Machine learning workflows to estimate class probabilities for precision cancer diagnostics on DNA methylation microarray data.

Journal: Nature protocols
Published Date:

Abstract

DNA methylation data-based precision cancer diagnostics is emerging as the state of the art for molecular tumor classification. Standards for choosing statistical methods with regard to well-calibrated probability estimates for these typically highly multiclass classification tasks are still lacking. To support this choice, we evaluated well-established machine learning (ML) classifiers including random forests (RFs), elastic net (ELNET), support vector machines (SVMs) and boosted trees in combination with post-processing algorithms and developed ML workflows that allow for unbiased class probability (CP) estimation. Calibrators included ridge-penalized multinomial logistic regression (MR) and Platt scaling by fitting logistic regression (LR) and Firth's penalized LR. We compared these workflows on a recently published brain tumor 450k DNA methylation cohort of 2,801 samples with 91 diagnostic categories using a 5 × 5-fold nested cross-validation scheme and demonstrated their generalizability on external data from The Cancer Genome Atlas. ELNET was the top stand-alone classifier with the best calibration profiles. The best overall two-stage workflow was MR-calibrated SVM with linear kernels closely followed by ridge-calibrated tuned RF. For calibration, MR was the most effective regardless of the primary classifier. The protocols developed as a result of these comparisons provide valuable guidance on choosing ML workflows and their tuning to generate well-calibrated CP estimates for precision diagnostics using DNA methylation data. Computation times vary depending on the ML algorithm from <15 min to 5 d using multi-core desktop PCs. Detailed scripts in the open-source R language are freely available on GitHub, targeting users with intermediate experience in bioinformatics and statistics and using R with Bioconductor extensions.

Authors

  • Máté E Maros
    Institute of Medical Biometry and Informatics (IMBI), University of Heidelberg, Heidelberg, Germany.
  • David Capper
    German Cancer Consortium (DKTK), Partner Site Berlin, and German Cancer Research Center (DKFZ), 69210 Heidelberg, Germany. frederick.klauschen@charite.de david.capper@charite.de.
  • David T W Jones
    Hopp Children's Cancer Center Heidelberg (KiTZ), Heidelberg, Germany.
  • Volker Hovestadt
    Division of Molecular Genetics, German Cancer Research Center (DKFZ), Heidelberg, Germany.
  • Andreas von Deimling
    Department of Neuropathology, Institute of Pathology, Heidelberg University Hospital, Heidelberg, Germany; German Cancer Consortium (DKTK), German Cancer Research Center (DKFZ), Heidelberg, Germany.
  • Stefan M Pfister
    Hopp Children's Cancer Center Heidelberg (KiTZ), Heidelberg, Germany.
  • Axel Benner
    Division of Biostatistics, German Cancer Research Center (DKFZ), Heidelberg, Germany.
  • Manuela Zucknick
    Oslo Centre for Biostatistics and Epidemiology, Department of Biostatistics, Institute of Basic Medical Sciences, University of Oslo, Oslo, Norway.
  • Martin Sill
    Hopp Children's Cancer Center Heidelberg (KiTZ), Heidelberg, Germany. m.sill@kitz-heidelberg.de.