Semi-supervised learning to improve generalizability of risk prediction models.

Journal: Journal of biomedical informatics
Published Date:

Abstract

The utility of a prediction model depends on its generalizability to patients drawn from different but related populations. We explored whether a semi-supervised learning model could improve the generalizability of colorectal cancer (CRC) risk prediction relative to supervised learning methods. Data on 113,141 patients diagnosed with nonmetastatic CRC from 2004 to 2012 were obtained from the Surveillance Epidemiology End Results registry for model development, and data on 1149 patients from the Second Affiliated Hospital, Zhejiang University School of Medicine, who were diagnosed between 2004 and 2011, were collected for generalizability testing. A clinical prediction model for CRC survival risk using a semi-supervised logistic regression method was developed and validated to investigate the model discrimination, calibration, generalizability, interpretability and clinical usefulness. Rigorous model performance comparisons with other supervised learning models were performed. The area under the curve of the logistic membership model revealed a large heterogeneity between the development cohort and validation cohort, which is typical of generalizability studies of prediction models. The discrimination was good for all models. Calibration was poor for supervised learning models, while the semi-supervised logistic regression model exhibited a good calibration on the validation cohort, which indicated good generalizability. Clinical usefulness analysis showed that semi-supervised logistic regression can lead to better clinical outcomes than supervised learning methods. These results increase our current understanding of the generalizability of different models and provide a reference for predictive model development for clinical decision-making.

Authors

  • Shengqiang Chi
    Engineering Research Center of EMR and Intelligent Expert System, Ministry of Education, Collaborative Innovation Center for Diagnosis and Treatment of Infectious Diseases, College of Biomedical Engineering and Instrument Science, Zhejiang University, Hangzhou Zhejiang Province, China.
  • Xinhang Li
    Engineering Research Center of EMR and Intelligent Expert System, Ministry of Education, Collaborative Innovation Center for Diagnosis and Treatment of Infectious Diseases, College of Biomedical Engineering and Instrument Science, Zhejiang University, Hangzhou Zhejiang Province, China.
  • Yu Tian
    Key Laboratory of Development and Maternal and Child Diseases of Sichuan Province, Department of Pediatrics, Sichuan University, Chengdu, China.
  • Jun Li
    Department of Emergency, Zhuhai Integrated Traditional Chinese and Western Medicine Hospital, Zhuhai, 519020, Guangdong Province, China. quanshabai43@163.com.
  • Xiangxing Kong
    Department of Surgical Oncology, The Second Affiliated Hospital of Zhejiang University Medical School, Hangzhou, China.
  • Kefeng Ding
    Department of Surgical Oncology, The Second Affiliated Hospital of Zhejiang University Medical School, Hangzhou, China.
  • Chunhua Weng
    Department of Biomedical Informatics, Columbia University.
  • Jingsong Li
    Research Center for Healthcare Data Science, Zhejiang Lab, Hangzhou, China.