Assessing large language model performance related to aging in genetic conditions.

Journal: npj aging

Published Date: May 3, 2025

Abstract

Most genetic conditions are described in pediatric populations, leaving a gap in understanding their clinical progression and management in adulthood. Motivated by other applications of large language models (LLMs), we evaluated whether Llama-2-70b-chat (70b) and GPT-3.5 (GPT) could generate plausible medical vignettes, patient-geneticist dialogues and management plans for a hypothetical child and adult patients across 282 genetic conditions (selected by prevalence and categorized based on age-related characteristics). Results showed that LLMs provided appropriate age-based responses in both child and adult outputs based on Correctness and Completeness scores graded by clinicians. Sub-analysis of metabolic conditions including those typically presents neonatally with crisis also showed age-appropriate LLM responses. However 70b and GPT obtained low Correctness and Completeness scores at producing plausible management plans (55-66% for 70b and a wider range, 50-90%, for GPT). This suggests that LLMs still have some limitations in clinical applications.

Authors

Amna A Othman

Medical Genomics Unit, National Human Genome Research Institute, National Institutes of Health, Bethesda, MD, USA. amna.othman@nih.gov.
Kendall A Flaharty

Medical Genomics Unit, National Human Genome Research Institute, National Institutes of Health, Bethesda, MD, USA.
Suzanna E Ledgister Hanchard
Ping Hu

Division of Cancer Prevention, National Cancer Institute, Canada.
Dat Duong
Rebekah L Waikel
Benjamin D Solomon

Keywords

No keywords available for this article.

External Resources

View on PubMed Access via DOI PubMed (40319013)

Assessing large language model performance related to aging in genetic conditions.

Abstract

Authors

Keywords

External Resources

Popular Topics

Recent Journals