Predicting Alzheimer’s Disease Diagnosis, a Decade or more Years before Onset using the Electronic Health Record and Random Forest Machine Learning Models
Journal:
medRxiv
Published Date:
Jan 1, 2025
Abstract
There is need to detect and intervene in pre-clinical phases of Alzheimer’s disease (AD). Electronic health records (EHRs) may help predict AD using machine learning methods We identified EHRs for 19,473 cases with AD and 111,922 controls. Records spanned 10 or more years prior to AD diagnosis. We trained a random forest model (employing 5-fold cross-validation with 2,499 features) to predict AD 10 years prior to its onset using a 75/25% train/test split and then computed permuted feature importance On the test set achieved, the model achieved an area under the ROC curve of 0.80 and area under the precision-recall curve of 0.55. Feature importance identified factors associated with AD, including age, sex, race, ethnicity, BMI, cardiovascular diseases, inflammation, pain, sleep, trauma, other neurodegenerative disorders, diuretics, colon-related disorders and procedures, seizures, and vitamin B12. This work contributes to knowledge about EHR-based prediction of AD 10 years prior to onset, which could help predict AD and inform prevention/early intervention. The value of this work is not necessarily predicting AD, given what is now known about its biology and blood biomarkers. Rather, the work inspires further examination of informatics methods for predicting disease diagnoses a decade or more prior to clinical diagnoses. The code used to build and evaluate the model is located at https://github.com/dbmi-pitt/ad_prediction_PLP/tree/main We reviewed the relevant literature using traditional methods (e.g., PubMed) and identified that previous studies have predicted Alzheimer’s Disease (AD) early with the help of the Electronic Health Record (EHR). This may help identify patients early in the course of disease. We demonstrate predicting AD 10 years prior to diagnosis using the EHR and identify important predictive factors. More work is needed to build and validate these types of models across hospital systems over multiple timescales. These models could be directly embedded into EHR systems.