Cervical cancer prediction using machine learning models based on routine blood analysis.

Journal: Scientific reports
Published Date:

Abstract

Cervical cancer (CC) is the fourth most common cancer among women globally. The key to preventing and treating CC is early detection, diagnosis, and treatment. This study aimed to develop an interpretable model for predicting CC risk using routine blood data. The primary endpoint variable is the occurrence of CC, as confirmed by histopathological diagnosis. We used the Shapley Additive Explanation (SHAP) method to provide interpretabiligy and identify key factors associated with CC. In this restrospective study, medical records of patients from 2013 to 2023 were collected. A total of 2,503 patients diagnosed with CC were included in the case group, while the control group was composed of 3,794 patients without apparent signs of the disease, which included women with other gynecological conditions as well as healthy individuals undergoing routine check-ups. Age, clinical diagnosis information and 22 blood cell analysis results were considered. Four different algorithms were applied to construct a model for estimating the likelihood of CC occurrence. Using least absolute shrinkage and selection operator (LASSO) and the random forest method (RF) method, 15 key routine blood features were ultimtely selected from an initial set of 23 features for model training. These features include age, red blood cell count (RBC), platelet distribution width (PDW), white blood cell count (WBC), Lymphocyte Percentage (LYMPH%), basophil count (BASO), Basophil Percentage (BASO%), Lymphocyte Absolute Value (LYMPH), Neutrophil Percentage (NEUT%), Hemoglobin (HGB), Mean Corpuscular Hemoglobin Concentration (MCHC), Red Cell Distribution Width (R-CV), Mean Platelet Volume (MPV), Plateletcrit (PCT), and Among the four models, the extreme gradient boosting (XGBoost) model achieved the highest predictive performance, with an area under the curve (AUC) of 0.964. In contrast, the RF model exhibited the poorest generalization ability, with an AUC of 0.907. The SHAP method revealed the top 6 predictors of CC according to the importance ranking, and the average platelet distribution width (PDW) was recognized as the most important predictor variable for CC occurrence (the primary endpoint variable).

Authors

  • Jie Su
    School of Information Engineering, Suqian University, Suqian, Jiangsu, China.
  • Hui Lu
    Key Laboratory of the plateau of environmental damage control, Lanzhou General Hospital of Lanzhou Military Command, Lanzhou, China.
  • Ruihuan Zhang
    The Inner Mongolia Medical Intelligent Diagnostics Big Data Research Institute, Inner Mongolia, People's Republic of China.
  • Na Cui
    Department of Critical Care Medicine, State Key Laboratory of Complex Severe and Rare Diseases, Peking Union Medical College Hospital, Chinese Academy of Medical Science and Peking Union Medical College, Beijing, China.
  • Chao Chen
    Department of Neonatology, Children's Hospital of Fudan University, National Children's Medical Center, Shanghai, China.
  • Qin Si
    Department of Control Science and Engineering, University of Shanghai for Science and Technology, Shanghai, China.
  • Biao Song
    Inner Mongolia Wesure Date Technology Co., Ltd, Inner Mongolia, P.R. China.