ER-REASON: A Benchmark Dataset for LLM-Based Clinical Reasoning in the Emergency Room
Journal:
arXiv
Published Date:
May 28, 2025
Abstract
Large language models (LLMs) have been extensively evaluated on medical
question answering tasks based on licensing exams. However, real-world
evaluations often depend on costly human annotators, and existing benchmarks
tend to focus on isolated tasks that rarely capture the clinical reasoning or
full workflow underlying medical decisions. In this paper, we introduce
ER-Reason, a benchmark designed to evaluate LLM-based clinical reasoning and
decision-making in the emergency room (ER)--a high-stakes setting where
clinicians make rapid, consequential decisions across diverse patient
presentations and medical specialties under time pressure. ER-Reason includes
data from 3,984 patients, encompassing 25,174 de-identified longitudinal
clinical notes spanning discharge summaries, progress notes, history and
physical exams, consults, echocardiography reports, imaging notes, and ER
provider documentation. The benchmark includes evaluation tasks that span key
stages of the ER workflow: triage intake, initial assessment, treatment
selection, disposition planning, and final diagnosis--each structured to
reflect core clinical reasoning processes such as differential diagnosis via
rule-out reasoning. We also collected 72 full physician-authored rationales
explaining reasoning processes that mimic the teaching process used in
residency training, and are typically absent from ER documentation. Evaluations
of state-of-the-art LLMs on ER-Reason reveal a gap between LLM-generated and
clinician-authored clinical reasoning for ER decisions, highlighting the need
for future research to bridge this divide.