Confound-leakage: confound removal in machine learning leads to leakage.

Journal: GigaScience
Published Date:

Abstract

BACKGROUND: Machine learning (ML) approaches are a crucial component of modern data analysis in many fields, including epidemiology and medicine. Nonlinear ML methods often achieve accurate predictions, for instance, in personalized medicine, as they are capable of modeling complex relationships between features and the target. Problematically, ML models and their predictions can be biased by confounding information present in the features. To remove this spurious signal, researchers often employ featurewise linear confound regression (CR). While this is considered a standard approach for dealing with confounding, possible pitfalls of using CR in ML pipelines are not fully understood.

Authors

  • Sami Hamdan
    Institute of Neuroscience and Medicine, Brain and Behaviour (INM-7), Forschungszentrum Jülich, 52428 Jülich, Germany.
  • Bradley C Love
    1University College London, London, UK.
  • Georg G von Polier
    Institute of Neuroscience and Medicine, Brain and Behaviour (INM-7), Forschungszentrum Jülich, 52428 Jülich, Germany.
  • Susanne Weis
    Institute of Neuroscience and Medicine, Brain and Behaviour (INM-7), Forschungszentrum Jülich, 52428 Jülich, Germany.
  • Holger Schwender
    Institute of Mathematics, Heinrich-Heine University Düsseldorf, 40225 Düsseldorf, Germany.
  • Simon B Eickhoff
    Institute of Neuroscience and Medicine (INM-1, INM-3), Research Centre Jülich, Jülich, Germany.
  • Kaustubh R Patil
    Institute of Neuroscience and Medicine, Brain and Behaviour (INM-7), Forschungszentrum Jülich, 52428 Jülich, Germany.