Explainable Machine Learning Models for Colorectal Cancer Prediction Using Clinical Laboratory Data.
Journal:
Cancer control : journal of the Moffitt Cancer Center
PMID:
40334702
Abstract
IntroductionEarly diagnosis of colorectal cancer (CRC) poses a significant clinical challenge. This study aims to develop machine learning (ML) models for CRC risk prediction using clinical laboratory data.MethodsThis retrospective, single-center study analyzed laboratory examination data from healthy controls (HC), polyp patients (Polyp), and CRC patients between 2013 and 2023. Five ML algorithms, including adaptive boosting (AdaBoost), extreme gradient boosting (XGBoost), decision tree (DT), logistic regression (LR), and random forest (RF), were employed to classify subjects into HC vs Polyp vs CRC, HC vs CRC, and Polyp vs CRC, respectively.ResultsThis study included 31 539 subjects: 11 793 HCs, 10 125 polyp patients, and 9621 CRC patients. The XGBoost model achieved the highest AUCs of 0.966 for differentiating HC from CRC and 0.881 for Polyp from CRC, outperforming carcino-embryonic antigen (CEA) and fecal occult blood testing (FOBT) tests. This model could also identify CEA-negative or FOBT-negative CRC patients. Incorporating stool miR-92a detection into the model further improved diagnostic performance. Shapley additive explanations (SHAP) plots indicated that FOBT, CEA, lymphocyte percentage (LYMPH%), and hematocrit (HCT) were the most significant features contributing to CRC diagnosis. Additionally, a computational tool for predicting CRC risk based on the optimal model was developed, designed for researchers with programming experience.ConclusionFive ML models for CRC diagnosis, based on ten routine laboratory test items, were developed, achieving higher diagnostic accuracies than traditional CRC biomarkers. The diagnostic capabilities of these ML models can be further enhanced by including stool miR-92a levels.