Evaluating large language models for drafting emergency department encounter summaries.

Journal: PLOS digital health

Published Date: Jun 17, 2025

Abstract

Large language models (LLMs) possess a range of capabilities which may be applied to the clinical domain, including text summarization. As ambient artificial intelligence scribes and other LLM-based tools begin to be deployed within healthcare settings, rigorous evaluations of the accuracy of these technologies are urgently needed. In this cross-sectional study of 100 randomly sampled adult Emergency Department (ED) visits from 2012 to 2023 at the University of California, San Francisco ED, we sought to investigate the performance of GPT-4 and GPT-3.5-turbo in generating ED encounter summaries and evaluate the prevalence and type of errors for each section of the encounter summary across three evaluation criteria: 1) Inaccuracy of LLM-summarized information; 2) Hallucination of information; 3) Omission of relevant clinical information. In total, 33% of summaries generated by GPT-4 and 10% of those generated by GPT-3.5-turbo were entirely error-free across all evaluated domains. Summaries generated by GPT-4 were mostly accurate, with inaccuracies found in only 10% of cases, however, 42% of the summaries exhibited hallucinations and 47% omitted clinically relevant information. Inaccuracies and hallucinations were most commonly found in the Plan sections of LLM-generated summaries, while clinical omissions were concentrated in text describing patients' Physical Examination findings or History of Presenting Complaint. The potential harmfulness score across errors was low, with a mean score of 0.57 (SD 1.11) out of 7 and only three errors scoring 4 ('Potential for permanent harm') or greater. In summary, we found that LLMs could generate accurate encounter summaries but were liable to hallucination and omission of clinically relevant information. Individual errors on average had a low potential for harm. A comprehensive understanding of the location and type of errors found in LLM-generated clinical text is important to facilitate clinician review of such content and prevent patient harm.

Authors

Christopher Y K Williams

Bakar Computational Health Sciences Institute, University of California San Francisco, San Francisco, CA, USA.
Jaskaran Bains

Department of Emergency Medicine, University of California, San Francisco, California, United States of America.
Tianyu Tang

Department of Radiology, Zhongda Hospital, School of Medicine, Southeast University, Nanjing, China.
Kishan Patel

Department of Emergency Medicine, University of California, San Francisco, California, United States of America.
Alexa N Lucas

Department of Emergency Medicine, University of California, San Francisco, California, United States of America.
Fiona Chen

Department of Diagnostic Imaging, Rhode Island Hospital, Providence, RI, USA.
Brenda Y Miao

Bakar Computational Health Sciences Institute, University of California San Francisco, San Francisco, CA, USA. miao.brenda1@gmail.com.
Atul J Butte

Bakar Computational Health Sciences Institute, University of California San Francisco, San Francisco, CA.
Aaron E Kornblith

Department of Emergency Medicine, University of California, San Francisco, CA, USA.

Keywords

No keywords available for this article.

External Resources

View on PubMed Access via DOI PubMed (40526634)

Evaluating large language models for drafting emergency department encounter summaries.

Abstract

Authors

Keywords

External Resources

Popular Topics

Recent Journals

Evaluating large language models for drafting emergency department encounter summaries.

Abstract

Authors

Keywords

External Resources

Don't Miss the Future of Medicine

Popular Topics

Recent Journals