Performance of Open-Source Large Language Models to Extract Symptoms from Clinical Notes.

Journal: Studies in health technology and informatics
Published Date:

Abstract

In this study, we examined how well the open-source foundational large language models (LLMs) can extract symptoms and signs (S&S), along with their corresponding ICD-10 codes, from clinical notes found in the public MTSamples dataset. The dataset comprising notes of patients with genitourinary conditions was manually annotated to compare the S&S extraction results with outputs generated by LLMs. We assessed three versions of the Llama model-Llama 3.1-13B, Llama 3.3-70B, and Me-Llama-13B-focusing on their consistency, runtime, and performance. Each model was tested on two tasks: (1) S&S extraction and (2) ICD-10 code generation. Our findings indicate that Llama 3.3-70B performed the best overall. With fast runtime and high consistency, it achieved an average recall of 0.87 and an average precision of 0.71 for S&S extraction, as well as an average recall of 0.71 and an average precision of 0.54 for ICD-10 code generation.

Authors

  • Yunbing Bai
    Department of Biomedical Informatics, School of Medicine, University of Utah, Salt Lake City, Utah.
  • Wanting Cui
    Icahn School of Medicine at Mount Sinai, New York, NY, USA.
  • Joseph Finkelstein
    Department of Biomedical Informatics, School of Medicine, University of Utah, USA.