Regression and machine learning approaches identify potential risk factors for glioblastoma multiforme.

Journal: Brain communications
Published Date:

Abstract

Glioblastoma multiforme is a lethal disease, with a 5-year survival rate of <10%. The identification of risk factors for glioblastoma multiforme is essential for the understanding of this disease and could facilitate more effective stratification of high-risk individuals. However, our current knowledge of glioblastoma multiforme risk factors is limited. Given the complexity and heterogeneity of the disease, traditional epidemiological approaches may be insufficient to study risk factors for glioblastoma multiforme. The combination of traditional approaches with machine learning models could prove effective in identifying relevant factors for glioblastoma multiforme risk. In this study, we developed glioblastoma multiformerisk models in the UK Biobank cohort using 576 glioblastoma multiforme cases and 302 602 controls. First, 369 exposures were tested with traditional regression models in a case-control study and significant associations were identified. Subsequently, significant features were filtered based on their completion rate and correlation. The selected exposures were then used to develop two machine learning models: a support vector machine and a Multi-Layer Perceptron. To address the imbalance within the subpopulation, two controls per case with full data were selected, resulting in 442 glioblastoma multiforme cases and 884 controls being analysed with the machine learning models. Relevant factors for glioblastoma multiforme risk were identified by explaining the results of the two models with Shapley Additive explanations. Traditional regression methods identified 38 significant associations between environmental exposures and glioblastoma multiforme risk under the Bonferroni threshold ( < 1.35 × 10). Subsequent filtration results in the selection of 12 exposures, which were then analysed with age, sex and a polygenic score using the two machine learning models. Support vector machine and the multi-layer perceptron demonstrated a good sensitivity (0.91 and 0.82, respectively). In addition to age and genetics, Shapley Additive explanations demonstrated significant contributions of insulin-like growth factor 1 blood levels and the right-hand grip strength on the predictions made by the models, with the latter effect potentially being confounded by endogenous testosterone levels. The integration of machine learning with traditional models has the potential to enhance the identification of risk factors for glioblastoma multiforme.

Authors

  • Alessio Felici
    Department of Biology, University of Pisa, Via Luca Ghini, 13 - 56126, Pisa, Italy. Electronic address: alessio.felici@phd.unipi.it.
  • Giulia Peduzzi
    Department of Biology, University of Pisa, Via Luca Ghini, 13 - 56126, Pisa, Italy. Electronic address: giulia.peduzzi@biologia.unipi.it.
  • Roberto Pellungrini
    Classe di scienze, Scuola Normale Superiore, Piazza dei Cavalieri, 7 - 56126, Pisa, Italy. Electronic address: roberto.pellungrini@sns.it.
  • Daniele Campa
    Department of Biology, University of Pisa, Via Luca Ghini, 13 - 56126, Pisa, Italy. Electronic address: daniele.campa@unipi.it.
  • Federico Canzian
    Genomic Epidemiology Group, German Cancer Research Center (DKFZ), Heidelberg 69120, Germany.

Keywords

No keywords available for this article.