Evaluating how different balancing data techniques impact on prediction of premature birth using machine learning models.
Journal:
PloS one
PMID:
40173408
Abstract
Premature birth can be defined as birth before 37 weeks of gestation, which is a significant global health issue, being the main cause for neonatal deaths. In this work, we evaluate machine learning models for predicting premature birth using Brazilian sociodemographic and obstetric data, focusing on the challenge of data imbalance, a common problem that can lead to biased predictions. We evaluate five data balancing techniques: Undersampling, Oversampling, and three Hybridsampling configurations where the minority class was increased by factors 2, 3, and 4. The machine learning models, including Decision Tree, Random Forest, and AdaBoost, are trained and evaluated on a dataset of over 483,000 cases. The use of the Hybridsampling approach resulted in an accuracy of 70%, a recall of 64%, and a precision of 74% in the Decision Tree model. Results show that Hybridsampling techniques significantly improves models' performance compared to Undersampling and Oversampling, highlighting the importance of a proper data balancing in predictive models for preterm birth. The relevance of our work is particularly significant for the Brazilian Unified Health System (SUS). By improving the accuracy of premature birth predictions, our models could assist healthcare providers in identifying at-risk pregnancies earlier, allowing for timely interventions. This integration could enhance maternal and neonatal care, reduce the incidence of preterm births, and potentially decrease neonatal mortality, especially in underserved regions.