A comparative analysis of data-driven models for breast cancer survival prediction.

Journal: Scientific reports
Published Date:

Abstract

Breast cancer is the most frequently diagnosed cancer among women and persists as a societal problem worldwide. It remains a leading cause of cancer associated morbidity and mortality, specifically in low- and middle-income countries where access to timely diagnosis and treatment is often limited. This study aims to compare survival and classical machine learning models for predicting breast cancer survival in Ethiopia to identify approaches that balance predictive accuracy with interpretability. The study utilized retrospective data from 1164 women treated at Tikur Anbesa Specialized Hospital and Hiwot Fana Specialized University Hospital between 2019 and 2024. Methods like Kaplan-Meier estimation, Cox proportional hazards, random survival forests (RSF), DeepSurv, and classical machine learning (SVM, XGBoost, LGBM, and RF) classifiers were used with evaluation metrics such as AUC, C-index, and Integrated Brier Score (IBS). The Shapley additive explanation approach was used to ensure the interpretability of results from models such as RSF, DeepSurv, and random forests (RF). It allowed the identification of important predictors of breast cancer outcome by indicating consistent predictors across models. The findings demonstrated that random survival forest and random forest achieved the highest performance (C-index: 0.754; IBS: 0.091) and (0.729 ± 0.006), respectively, outperforming the other models under consideration. The Shapley Additive Explanations (SHAP) analysis for the RSF model showed that age, tumour size, metastasis, stage, comorbidities, and marital status as the most important predictors of breast cancer survival. Furthermore, the SHAP analysis for the RF model indicated that the higher age category (45 and above), metastasis status (M1), stage four, and larger tumour size contribute a strong influence on predictions. Among the machine learning models, the random forest algorithm effectively identifies the key predictors of breast cancer outcomes. For the survival analysis methods, the RSF offers robust capabilities for handling time-to-event data and censoring, making it well-suited for accurate survival prediction. By combining these approaches, we were able to gain clearer insights and better identify the key factors influencing breast cancer prognosis. This study highlights the value of data-driven methods in helping healthcare professionals identify high-risk patients with greater precision and take timely, informed actions to support their care.

Authors

Keywords

No keywords available for this article.