Comparing large scale and selected feature learning for community acquired pneumonia prognosis prediction using clinical data: a stacked ensemble approach.
Journal:
Scientific reports
PMID:
40210962
Abstract
This study investigated and validated all-cause in-hospital death prediction models for hospitalized pneumonia patients based on large-scale clinical data, including diagnoses, medication prescriptions, and laboratory test codes. Feature selection was performed using both large-scale feature learning with a Common Data Model (CDM) and specific pneumonia-related risk factors. A stacked ensemble mixed machine-learning model was compared with traditional machine-learning models. Accuracy, F1-score, the Area Under Precision Recall Curve (AUPRC) and the Area Under the Receiver Operating Characteristic (AUROC) were used for performance evaluation. For large-scale feature learning using a CDM, the ensemble model (LASSO LR + GBM + RF) achieved the highest performance. For the 365-day lookback, the ensemble model's AUROC was 0.867 (95% CI: 0.823-0.910), and for the 7-day lookback (AUROC 0.867, 95% CI: 0.822-0.912). In contrast, for feature learning based on selected pneumonia risk factors, among the traditional models, the RF model performed best with AUROCs of 0.774 (95% CI: 0.717-0.830) for the 365-day lookback and 0.773 (95% CI: 0.717-0.828) for the 7-days lookback. Leveraging large-scale feature learning within the CDM and using a stacked ensemble model predicts more accurately and robustly, highlighting the potential to capture complex relationships among clinical features and improve prognostic assessments.