How doppelgänger effects in biomedical data confound machine learning.

Journal: Drug discovery today

Published Date: Oct 28, 2021

Abstract

Machine learning (ML) models have been increasingly adopted in drug development for faster identification of potential targets. Cross-validation techniques are commonly used to evaluate these models. However, the reliability of such validation methods can be affected by the presence of data doppelgängers. Data doppelgängers occur when independently derived data are very similar to each other, causing models to perform well regardless of how they are trained (i.e., the doppelgänger effect). Despite the abundance of data doppelgängers in biomedical data and their inflationary effects, they remain uncharacterized. We show their prevalence in biomedical data, demonstrate how doppelgängers arise, and provide proof of their confounding effects. To mitigate the doppelgänger effect, we recommend identifying data doppelgängers before the training-validation split.

Authors

Li Rong Wang

School of Computer Science and Engineering, Nanyang Technological University, Singapore.
Limsoon Wong

Department of Computer Science, National University of Singapore, Singapore; Department of Pathology, National University of Singapore, Singapore.
Wilson Wen Bin Goh

School of Biological Sciences, Nanyang Technological University, Singapore 637551, Republic of Singapore. Electronic address: wilsongoh@ntu.edu.sg.

Keywords

Machine Learning Reproducibility of Results

External Resources

View on PubMed Access via DOI PubMed (34743902)

How doppelgänger effects in biomedical data confound machine learning.

Abstract

Authors

Keywords

External Resources

Popular Topics

Recent Journals