Guiding questions to avoid data leakage in biological machine learning applications.

Journal: Nature methods
Published Date:

Abstract

Machine learning methods for extracting patterns from high-dimensional data are very important in the biological sciences. However, in certain cases, real-world applications cannot confirm the reported prediction performance. One of the main reasons for this is data leakage, which can be seen as the illicit sharing of information between the training data and the test data, resulting in performance estimates that are far better than the performance observed in the intended application scenario. Data leakage can be difficult to detect in biological datasets due to their complex dependencies. With this in mind, we present seven questions that should be asked to prevent data leakage when constructing machine learning models in biological domains. We illustrate the usefulness of our questions by applying them to nontrivial examples. Our goal is to raise awareness of potential data leakage problems and to promote robust and reproducible machine learning-based research in biology.

Authors

  • Judith Bernett
    Data Science in Systems Biology, TUM School of Life Sciences, Technical University of Munich, Maximus-von-Imhof Forum 3, 85354, Freising, Germany.
  • David B Blumenthal
    Chair of Experimental Bioinformatics, TUM School of Life Sciences, Technical University of Munich, Freising, Germany. david.b.blumenthal@fau.de.
  • Dominik G Grimm
    Technical University of Munich, Campus Straubing for Biotechnology and Sustainability, Bioinformatics, Straubing, Germany.
  • Florian Haselbeck
    TUM Campus Straubing for Biotechnology and Sustainability, Technical University of Munich, Straubing, Germany.
  • Roman Joeres
    Department of Chemistry and Molecular Biology, University of Gothenburg, Gothenburg, Sweden.
  • Olga V Kalinina
    Helmholtz Institute for Pharmaceutical Research Saarland (HIPS), Helmholtz Centre for Infection Research (HZI), Saarbrücken, Germany.
  • Markus List
    Chair of Experimental Bioinformatics, TUM School of Life Sciences, Technical University of Munich, Freising, Germany.