Data splitting to avoid information leakage with DataSAIL.

Journal: Nature communications
PMID:

Abstract

Information leakage is an increasingly important topic in machine learning research for biomedical applications. When information leakage happens during a model's training, it risks memorizing the training data instead of learning generalizable properties. This can lead to inflated performance metrics that do not reflect the actual performance at inference time. We present DataSAIL, a versatile Python package to facilitate leakage-reduced data splitting to enable realistic evaluation of machine learning models for biological data that are intended to be applied in out-of-distribution scenarios. DataSAIL is based on formulating the problem to find leakage-reduced data splits as a combinatorial optimization problem. We prove that this problem is NP-hard and provide a scalable heuristic based on clustering and integer linear programming. Finally, we empirically demonstrate DataSAIL's impact on evaluating biomedical machine learning models.

Authors

  • Roman Joeres
    Department of Chemistry and Molecular Biology, University of Gothenburg, Gothenburg, Sweden.
  • David B Blumenthal
    Chair of Experimental Bioinformatics, TUM School of Life Sciences, Technical University of Munich, Freising, Germany. david.b.blumenthal@fau.de.
  • Olga V Kalinina
    Helmholtz Institute for Pharmaceutical Research Saarland (HIPS), Helmholtz Centre for Infection Research (HZI), Saarbrücken, Germany.