Accuracy of Large Language Models in Generating Rare Disease Differential Diagnosis Using Key Clinical Features.
Journal:
Studies in health technology and informatics
Published Date:
Aug 7, 2025
Abstract
Generating differential diagnoses for rare disease patients can be time intensive and highly dependent on the background and training of the evaluating physicians. Large language models (LLMs) have the potential to complement this process by automatically generating differentials to support physicians, but their performance in real-world patient populations remains underexplored. To this end, we assessed the diagnostic accuracy of ChatGPT-4o, Llama 3.1-8B-Instruct, and Exomiser in 424 rare disease patients at the Undiagnosed Diseases Network. ChatGPT-4o had the highest differential diagnostic accuracy (22.4% [95% CI: 18.4, 26.4]), outperforming Exomiser (13.9% [10.6, 17.2]; p < 0.001) and Llama 3.1-8B-Instruct (11.6% [8.5, 14.6]; p < 0.001). Adjusting for other factors, age at symptom onset was a significant predictor of ChatGPT-4o's diagnostic accuracy with the model performing better in patients with later symptom onset, potentially due to more distinct phenotypic presentations in older individuals. The combined accuracy of ChatGPT-4o and Exomiser was 30% [25.6, 34.3] and higher than that of either model alone (p < 0.01). This improvement highlights the potential of combining LLMs and bioinformatic models to generate differential diagnoses for rare diseases.