A retrieval-augmented generation large language model framework for accurate dementia identification from electronic health records

Journal: medRxiv

Published Date: Jan 25, 2026

Abstract

Objective Accurate and scalable disease phenotyping from electronic health records (EHRs) is foundational for predictive modeling and precision medicine. Traditional rule- and keyword-based approaches are limited by inconsistent documentation and inability to capture clinical nuance. We aim to evaluate whether large language models (LLMs) can overcome these limitations to improve dementia phenotyping from real-world EHR data. Methods We developed and evaluated a framework integrating large language models and retrieval-augmented generation (RAG) to improve dementia identification from EHRs. Using Mass General Brigham EHR data, we identified a cohort of potential dementia cases and established gold-standard labels through chart review. Among 623 candidate cases, we compared rule-based classification, keyword-filtered LLMs, and RAG-based LLMs. Results The RAG-based classifier achieved the highest performance (F1=0.933, sensitivity=91.1%, PPV=95.5%) compared to rule-based (F1=0.823, sensitivity=81.1%, PPV=83.5%) and keyword-filtered LLM (F1=0.903, sensitivity=91.7%, PPV=88.6%). Error analysis revealed that structured-code dependence contributed to false positives, whereas unrecognized contextual cues in notes drove false negatives. Conclusion This framework demonstrates how RAG-based LLMs can produce reliable, context-aware dementia phenotypes to support predictive modeling, early detection, and precision care strategies across real-world populations.

Authors

Wang
L.; Liu
B.; Yang
R.; Chuang
Y.-W.; Estiri
H.; Murphy
S.; Zhou
L.; Marshall
G.

External Resources

View on medRxiv Access via DOI

A retrieval-augmented generation large language model framework for accurate dementia identification from electronic health records

Abstract

Authors

Categories

External Resources

Popular Topics

Recent Journals

A retrieval-augmented generation large language model framework for accurate dementia identification from electronic health records

Abstract

Authors

Categories

External Resources

Don't Miss the Future of Medicine

Popular Topics

Recent Journals