Assessing large language model performance related to aging in genetic conditions.
Journal:
npj aging
Published Date:
May 3, 2025
Abstract
Most genetic conditions are described in pediatric populations, leaving a gap in understanding their clinical progression and management in adulthood. Motivated by other applications of large language models (LLMs), we evaluated whether Llama-2-70b-chat (70b) and GPT-3.5 (GPT) could generate plausible medical vignettes, patient-geneticist dialogues and management plans for a hypothetical child and adult patients across 282 genetic conditions (selected by prevalence and categorized based on age-related characteristics). Results showed that LLMs provided appropriate age-based responses in both child and adult outputs based on Correctness and Completeness scores graded by clinicians. Sub-analysis of metabolic conditions including those typically presents neonatally with crisis also showed age-appropriate LLM responses. However 70b and GPT obtained low Correctness and Completeness scores at producing plausible management plans (55-66% for 70b and a wider range, 50-90%, for GPT). This suggests that LLMs still have some limitations in clinical applications.
Authors
Keywords
No keywords available for this article.