Building a cancer risk and survival prediction model based on social determinants of health combined with machine learning: A NHANES 1999 to 2018 retrospective cohort study.
Journal:
Medicine
PMID:
39928823
Abstract
The occurrence and progression of cancer is a significant focus of research worldwide, often accompanied by a prolonged disease course. Concurrently, researchers have identified that social determinants of health (SDOH) (employment status, family income and poverty ratio, food security, education level, access to healthcare services, health insurance, housing conditions, and marital status) are associated with the progression of many chronic diseases. However, there is a paucity of research examining the influence of SDOH on cancer incidence risk and the survival of cancer survivors. The aim of this study was to utilize SDOH as a primary predictive factor, integrated with machine learning models, to forecast both cancer risk and prognostic survival. This research is grounded in the SDOH data derived from the National Health and Nutrition Examination Survey dataset spanning 1999 to 2018. It employs methodologies including adaptive boosting, gradient boosting machine (GradientBoosting), random forest (RF), extreme gradient boosting, light gradient boosting machine, support vector machine, and logistic regression to develop models for predicting cancer risk and prognostic survival. The hyperparameters of these models-specifically, the number of estimators (100-200), maximum tree depth (10), learning rate (0.01-0.2), and regularization parameters-were optimized through grid search and cross-validation, followed by performance evaluation. Shapley Additive exPlanations plots were generated to visualize the influence of each feature. RF was the best model for predicting cancer risk (area under the curve: 0.92, accuracy: 0.84). Age, non-Hispanic White, sex, and housing status were the 4 most important characteristics of the RF model. Age, gender, employment status, and household income/poverty ratio were the 4 most important features in the gradient boosting machine model. The predictive models developed in this study exhibited strong performance in estimating cancer incidence risk and survival time, identifying several factors that significantly influence both cancer incidence risk and survival, thereby providing new evidence for cancer management. Despite the promising findings, this study acknowledges certain limitations, including the omission of risk factors in the cancer survivor survival model and potential biases inherent in the National Health and Nutrition Examination Survey dataset. Future research is warranted to further validate the model using external datasets.