A Dataset for Addressing Patient's Information Needs related to Clinical Course of Hospitalization
Journal:
arXiv
Published Date:
Jun 4, 2025
Abstract
Patients have distinct information needs about their hospitalization that can
be addressed using clinical evidence from electronic health records (EHRs).
While artificial intelligence (AI) systems show promise in meeting these needs,
robust datasets are needed to evaluate the factual accuracy and relevance of
AI-generated responses. To our knowledge, no existing dataset captures patient
information needs in the context of their EHRs. We introduce ArchEHR-QA, an
expert-annotated dataset based on real-world patient cases from intensive care
unit and emergency department settings. The cases comprise questions posed by
patients to public health forums, clinician-interpreted counterparts, relevant
clinical note excerpts with sentence-level relevance annotations, and
clinician-authored answers. To establish benchmarks for grounded EHR question
answering (QA), we evaluated three open-weight large language models
(LLMs)--Llama 4, Llama 3, and Mixtral--across three prompting strategies:
generating (1) answers with citations to clinical note sentences, (2) answers
before citations, and (3) answers from filtered citations. We assessed
performance on two dimensions: Factuality (overlap between cited note sentences
and ground truth) and Relevance (textual and semantic similarity between system
and reference answers). The final dataset contains 134 patient cases. The
answer-first prompting approach consistently performed best, with Llama 4
achieving the highest scores. Manual error analysis supported these findings
and revealed common issues such as omitted key clinical evidence and
contradictory or hallucinated content. Overall, ArchEHR-QA provides a strong
benchmark for developing and evaluating patient-centered EHR QA systems,
underscoring the need for further progress toward generating factual and
relevant responses in clinical contexts.