Cervical cancer prediction using machine learning models based on routine blood analysis.
Journal:
Scientific reports
Published Date:
Jul 2, 2025
Abstract
Cervical cancer (CC) is the fourth most common cancer among women globally. The key to preventing and treating CC is early detection, diagnosis, and treatment. This study aimed to develop an interpretable model for predicting CC risk using routine blood data. The primary endpoint variable is the occurrence of CC, as confirmed by histopathological diagnosis. We used the Shapley Additive Explanation (SHAP) method to provide interpretabiligy and identify key factors associated with CC. In this restrospective study, medical records of patients from 2013 to 2023 were collected. A total of 2,503 patients diagnosed with CC were included in the case group, while the control group was composed of 3,794 patients without apparent signs of the disease, which included women with other gynecological conditions as well as healthy individuals undergoing routine check-ups. Age, clinical diagnosis information and 22 blood cell analysis results were considered. Four different algorithms were applied to construct a model for estimating the likelihood of CC occurrence. Using least absolute shrinkage and selection operator (LASSO) and the random forest method (RF) method, 15 key routine blood features were ultimtely selected from an initial set of 23 features for model training. These features include age, red blood cell count (RBC), platelet distribution width (PDW), white blood cell count (WBC), Lymphocyte Percentage (LYMPH%), basophil count (BASO), Basophil Percentage (BASO%), Lymphocyte Absolute Value (LYMPH), Neutrophil Percentage (NEUT%), Hemoglobin (HGB), Mean Corpuscular Hemoglobin Concentration (MCHC), Red Cell Distribution Width (R-CV), Mean Platelet Volume (MPV), Plateletcrit (PCT), and Among the four models, the extreme gradient boosting (XGBoost) model achieved the highest predictive performance, with an area under the curve (AUC) of 0.964. In contrast, the RF model exhibited the poorest generalization ability, with an AUC of 0.907. The SHAP method revealed the top 6 predictors of CC according to the importance ranking, and the average platelet distribution width (PDW) was recognized as the most important predictor variable for CC occurrence (the primary endpoint variable).