Machine Learning Models to Evaluate County-Level Incidence of Diagnosed Diabetes and Sociodemographic Factors.
Journal:
American journal of medicine open
Published Date:
Mar 27, 2026
Abstract
AIMS: To evaluate county-level incidence of diagnosed diabetes and key sociodemographic factors in a high-dimensional, nonlinear setting. METHODS: This temporally aggregated observational study used US Centers for Disease Control and Prevention data on county-level incidence of diagnosed diabetes, from 2004 to 2019, and 34 sociodemographic factors from public databases. We defined counties as higher-burden if diabetes incidence was >12.6 per 1000 persons (1 standard deviation [SD] above sample mean). As relationships between sociodemographic factors and diabetes incidence may be nonlinear and involve complex interactions, we trained three machine learning models to estimate incidence (elastic net regression), classify counties as higher-burden (eXtreme Gradient Boosting [XGBoost], support vector machine [SVM]), and identify feature importance. Model performance was evaluated using fivefold cross-validation, with stratified folds for XGBoost and SVM models. RESULTS: Overall, 500 of 3114 counties (16.1%) were of higher-burden. Elastic net regression showed good predictive performance for estimating diabetes incidence (R 2 0.78 [95% CI, 0.75-0.80]). For classification of higher-burden counties, SVM and XGBoost showed high discrimination with AUROC of 0.962 (95% CI, 0.948-0.974) and 0.957 (95% CI, 0.941-0.971), respectively. Sensitivity analyses using alternative definitions of higher-burden counties (mean + 0.75 × SD; mean + 1.25 × SD) yielded comparable results. Across all three models, key county-level features contributing to model predictions were percentages of children living with grandparent householders and of people withLimited English. CONCLUSIONS: Machine learning models demonstrated consistent performance in estimating and classifying county-level diabetes incidence, with high discrimination for identifying higher-burden counties. Sociodemographic factors, including children living with grandparent householders, may inform tailored public health interventions.
Authors
Keywords
No keywords available for this article.