Systematic auditing is essential to debiasing machine learning in biology.

Journal: Communications biology

Published Date: Feb 10, 2021

Abstract

Biases in data used to train machine learning (ML) models can inflate their prediction performance and confound our understanding of how and what they learn. Although biases are common in biological data, systematic auditing of ML models to identify and eliminate these biases is not a common practice when applying ML in the life sciences. Here we devise a systematic, principled, and general approach to audit ML models in the life sciences. We use this auditing framework to examine biases in three ML applications of therapeutic interest and identify unrecognized biases that hinder the ML process and result in substantially reduced model performance on new datasets. Ultimately, we show that ML models tend to learn primarily from data biases when there is insufficient signal in the data to learn from. We provide detailed protocols, guidelines, and examples of code to enable tailoring of the auditing framework to other biomedical applications.

Authors

Fatma-Elzahraa Eid

Broad Institute of MIT and Harvard, Cambridge, MA, USA. fatma@broadinstitute.org.
Haitham A Elmarakeby

Broad Institute of MIT and Harvard, Cambridge, MA, USA.
Yujia Alina Chan

Broad Institute of MIT and Harvard, Cambridge, MA, USA.
Nadine Fornelos

Broad Institute of MIT and Harvard, Cambridge, MA, USA.
Mahmoud ElHefnawi
Eliezer M Van Allen

Department of Medical Oncology, Dana-Farber Cancer Institute, Harvard Medical School, Harvard University, Boston, Massachusetts.
Lenwood S Heath

Department of Computer Science, Virginia Polytechnic Institute and State University Blacksburg, VA, USA.
Kasper Lage

Department of Surgery, Massachusetts General Hospital, Boston, MA, USA. lage.kasper@mgh.harvard.edu.

Keywords

Animals Bias Data Mining Databases, Protein Histocompatibility Antigens Humans Machine Learning Pharmaceutical Preparations Protein Binding Protein Interaction Maps Proteins Proteome Proteomics Reproducibility of Results

External Resources

View on PubMed Access via DOI PubMed (33568741)

Systematic auditing is essential to debiasing machine learning in biology.

Abstract

Authors

Keywords

External Resources

Popular Topics

Recent Journals