Predicting high-need high-cost pediatric hospitalized patients in China based on machine learning methods.

Journal: Scientific reports
PMID:

Abstract

Rapidly increasing healthcare spending globally is significantly driven by high-need, high-cost (HNHC) patients, who account for the top 5% of annual healthcare costs but over half of total expenditures. The programs targeting existing HNHC patients have shown limited long-term impact, and research predicting HNHC pediatric patients in China is limited. There is an urgent need to establish a specific, valid, and reliable prediction model using machine-learning-based methods to identify potential HNHC pediatric patients and implement proactive interventions before high costs arise. This study used a 7-year retrospective cohort dataset from two administrative databases in Shanghai, covering pediatric patients under 18 years. The machine-learning-based models were developed to predict HNHC status using logistic regression, k-nearest neighbors (KNN), random forest (RF), multi-layer perceptron (MLP), and Naive Bayes. This study divided the data from 2021-2022 into 70:30 as a training set and a test set, with the internal class balancing approach of the Synthetic Minority Over-sampling Technique (SMOTE). A grid search strategy was employed with k-fold cross-validation to optimize hyperparameters. Model performance was assessed by 5 metrics: Receiver Operating Characteristic-Area Under Curve (ROC-AUC), accuracy, sensitivity, specificity, and F1 score. The external validation from 2022-2023 data and the internal validation using different train-test ratios (80:20 and 90:10) were used to assess the robustness of the trained models. Among the 91,882 hospitalized children included in 2021, significant differences were found in socioeconomics, disease, healthcare service utilization, previous healthcare expenditure, and hospital characteristics between the HNHC and non-HNHC groups. The hospitalization costs for HNHC pediatric patients accounted for over 35% of total spending. The MLP model demonstrated the highest predictive performance (ROC-AUC: 0.872), followed by RF (0.869), KNN (0.836), and naive Bayes (0.828). The most important predictive factors included length of stay, number of hospitalizations, previous HNHC status, age, and presence of Top 20 HNHC diseases. MLP showed robustness as the most efficient model in external validation (ROC-AUC: 0.843) and internal validation using different train-test ratios (ROC-AUC: 0.826 in 80:20 ratio; 0.807 in 90:10 ratio). Machine learning models, particularly MLP, effectively predict HNHC pediatric patients, providing a basis for early identification of HNHC and proactive healthcare interventions into clinical practice. This approach can also assist policymakers and payers in optimizing healthcare resource allocation, controlling healthcare costs, and improving patient outcomes.

Authors

  • Peng Zhang
    Key Laboratory of Macromolecular Science of Shaanxi Province, School of Chemistry & Chemical Engineering, Shaanxi Normal University, Xi'an, Shaanxi 710062, China.
  • Bifan Zhu
    Shanghai Health Development Research Center, Room 804, No 1477, West Beijing Road, Jing'an District, Shanghai, 20040, China.
  • Xing Chen
    School of Information and Electrical Engineering, China University of Mining and Technology, Xuzhou, 221116, China. xingchen@amss.ac.cn.
  • Linan Wang
    Shanghai Health Development Research Center, Room 804, No 1477, West Beijing Road, Jing'an District, Shanghai, 20040, China. wanglinan@51mch.com.