Prediction of lung cancer patient survival via supervised machine learning classification techniques.

Journal: International journal of medical informatics
Published Date:

Abstract

Outcomes for cancer patients have been previously estimated by applying various machine learning techniques to large datasets such as the Surveillance, Epidemiology, and End Results (SEER) program database. In particular for lung cancer, it is not well understood which types of techniques would yield more predictive information, and which data attributes should be used in order to determine this information. In this study, a number of supervised learning techniques is applied to the SEER database to classify lung cancer patients in terms of survival, including linear regression, Decision Trees, Gradient Boosting Machines (GBM), Support Vector Machines (SVM), and a custom ensemble. Key data attributes in applying these methods include tumor grade, tumor size, gender, age, stage, and number of primaries, with the goal to enable comparison of predictive power between the various methods The prediction is treated like a continuous target, rather than a classification into categories, as a first step towards improving survival prediction. The results show that the predicted values agree with actual values for low to moderate survival times, which constitute the majority of the data. The best performing technique was the custom ensemble with a Root Mean Square Error (RMSE) value of 15.05. The most influential model within the custom ensemble was GBM, while Decision Trees may be inapplicable as it had too few discrete outputs. The results further show that among the five individual models generated, the most accurate was GBM with an RMSE value of 15.32. Although SVM underperformed with an RMSE value of 15.82, statistical analysis singles the SVM as the only model that generated a distinctive output. The results of the models are consistent with a classical Cox proportional hazards model used as a reference technique. We conclude that application of these supervised learning techniques to lung cancer data in the SEER database may be of use to estimate patient survival time with the ultimate goal to inform patient care decisions, and that the performance of these techniques with this particular dataset may be on par with that of classical methods.

Authors

  • Chip M Lynch
    Department of Computer Engineering and Computer Science, University of Louisville, KY, USA.
  • Behnaz Abdollahi
    Department of Electrical and Computer Engineering, University of Louisville, KY, USA.
  • Joshua D Fuqua
    Department of Bioengineering, University of Louisville, KY, USA.
  • Alexandra R de Carlo
    Department of Bioengineering, University of Louisville, KY, USA.
  • James A Bartholomai
    Dept. of Bioengineering University of Louisville Louisville, KY.
  • Rayeanne N Balgemann
    Department of Bioengineering, University of Louisville, KY, USA.
  • Victor H van Berkel
    Department of Cardiovascular and Thoracic Surgery, University of Louisville, KY, USA.
  • Hermann B Frieboes
    Dept. of Bioengineering University of Louisville Louisville, KY.