Assessing large language model performance related to aging in genetic conditions.

Journal: npj aging
Published Date:

Abstract

Most genetic conditions are described in pediatric populations, leaving a gap in understanding their clinical progression and management in adulthood. Motivated by other applications of large language models (LLMs), we evaluated whether Llama-2-70b-chat (70b) and GPT-3.5 (GPT) could generate plausible medical vignettes, patient-geneticist dialogues and management plans for a hypothetical child and adult patients across 282 genetic conditions (selected by prevalence and categorized based on age-related characteristics). Results showed that LLMs provided appropriate age-based responses in both child and adult outputs based on Correctness and Completeness scores graded by clinicians. Sub-analysis of metabolic conditions including those typically presents neonatally with crisis also showed age-appropriate LLM responses. However 70b and GPT obtained low Correctness and Completeness scores at producing plausible management plans (55-66% for 70b and a wider range, 50-90%, for GPT). This suggests that LLMs still have some limitations in clinical applications.

Authors

  • Amna A Othman
    Medical Genomics Unit, National Human Genome Research Institute, National Institutes of Health, Bethesda, MD, USA. amna.othman@nih.gov.
  • Kendall A Flaharty
    Medical Genomics Unit, National Human Genome Research Institute, National Institutes of Health, Bethesda, MD, USA.
  • Suzanna E Ledgister Hanchard
  • Ping Hu
    Division of Cancer Prevention, National Cancer Institute, Canada.
  • Dat Duong
  • Rebekah L Waikel
  • Benjamin D Solomon

Keywords

No keywords available for this article.