Assessing Completeness of Clinical Histories Accompanying Imaging Orders Using Adapted Open-Source and Closed-Source Large Language Models.

Journal: Radiology

PMID: 39998369

Abstract

Background Incomplete clinical histories are a well-known problem in radiology. Previous dedicated quality improvement efforts focusing on reproducible assessments of the completeness of free-text clinical histories have relied on tedious manual analysis. Purpose To adapt and evaluate open-source and closed-source large language models (LLMs) for their ability to automatically extract clinical history elements within imaging orders and to use the best-performing adapted open-source model to assess the completeness of a large sample of clinical histories as a benchmark for clinical practice. Materials and Methods This retrospective single-site study used previously extracted information accompanying CT, MRI, US, and radiography orders from August 2020 to May 2022 at an adult and pediatric emergency department of a 613-bed tertiary academic medical center. Two open-source (Llama 2-7B [Meta], Mistral-7B [Mistral AI]) and one closed-source (GPT-4 Turbo [OpenAI]) LLMs were adapted using prompt engineering, in-context learning, and fine-tuning (open-source only) to extract the elements "past medical history," "what," "when," "where," and "clinical concern" from clinical histories. Model performance, interreader agreement using Cohen κ (none to slight, 0.01-0.20; fair, 0.21-0.40; moderate, 0.41-0.60; substantial, 0.61-0.80; almost perfect, 0.81-1.00), and semantic similarity between the models and the adjudicated manual annotations of two board-certified radiologists with 16 and 3 years of postfellowship experience, respectively, were assessed using accuracy, Cohen κ, and BERTScore, an LLM metric that quantifies how well two pieces of text convey the same meaning; 95% CIs were also calculated. The best-performing open-source model was then used to assess completeness on a large dataset of unannotated clinical histories. Results A total of 50 186 clinical histories were included (794 training, 150 validation, 300 initial testing, 48 942 real-world application). Of the two open-source models, Mistral-7B outperformed Llama 2-7B in assessing completeness and was further fine-tuned. Both Mistral-7B and GPT-4 Turbo showed substantial overall agreement with radiologists (mean κ, 0.73 [95% CI: 0.67, 0.78] to 0.77 [95% CI: 0.71, 0.82]) and adjudicated annotations (mean BERTScore, 0.96 [95% CI: 0.96, 0.97] for both models; = .38). Mistral-7B also rivaled GPT-4 Turbo in performance (weighted overall mean accuracy, 91% [95% CI: 89, 93] vs 92% [95% CI: 90, 94]; = .31) despite being a smaller model. Using Mistral-7B, 26.2% (12 803 of 48 942) of unannotated clinical histories were found to contain all five elements. Conclusion An easily deployable fine-tuned open-source LLM (Mistral-7B), rivaling GPT-4 Turbo in performance, could effectively extract clinical history elements with substantial agreement with radiologists and produce a benchmark for completeness of a large sample of clinical histories. The model and code will be fully open-sourced. © RSNA, 2025

Authors

David B Larson

Department of Radiology, Warren Alpert Medical School, Brown University, 593 Eddy St, Providence, RI 02903 (I.P.); Department of Diagnostic Imaging, Rhode Island Hospital, Providence, RI (I.P.); Visiana, Hørsholm, Denmark (H.H.T.); Department of Radiology, Stanford University, Palo Alto, Calif (S.S.H., D.B.L.); and Department of Radiology, Athinoula A. Martinos Center for Biomedical Imaging, Massachusetts General Hospital, Harvard Medical School, Boston, Mass (J.K.C.).
Arogya Koirala

Department of Radiology, Stanford University School of Medicine, 453 Quarry Rd, MC 5659, Stanford, CA 94304.
Lina Y Cheuy

Department of Radiology, Stanford University School of Medicine, 453 Quarry Rd, MC 5659, Stanford, CA 94304.
Magdalini Paschali

Department of Radiology, Stanford University School of Medicine, Stanford, CA, USA.
Dave Van Veen

Department of Electrical Engineering, Stanford University, Stanford, CA 94305, USA.
Hye Sun Na

Department of Radiology, Stanford University School of Medicine, 453 Quarry Rd, MC 5659, Stanford, CA 94304.
Matthew B Petterson

Department of Radiology, Stanford University School of Medicine, 453 Quarry Rd, MC 5659, Stanford, CA 94304.
Zhongnan Fang

LVIS Corporation, Palo Alto, California.
Akshay S Chaudhari

Department of Radiology, Stanford University, Stanford, California.

Keywords

Diagnostic Imaging Humans Large Language Models Medical History Taking Natural Language Processing Quality Improvement Radiology Information Systems Retrospective Studies

External Resources

View on PubMed Access via DOI PubMed (39998369)

Assessing Completeness of Clinical Histories Accompanying Imaging Orders Using Adapted Open-Source and Closed-Source Large Language Models.

Abstract

Authors

Keywords

External Resources

Popular Topics

Recent Journals