Navigating the pitfalls of applying machine learning in genomics.

Journal: Nature reviews. Genetics

Published Date: Nov 26, 2021

Abstract

The scale of genetic, epigenomic, transcriptomic, cheminformatic and proteomic data available today, coupled with easy-to-use machine learning (ML) toolkits, has propelled the application of supervised learning in genomics research. However, the assumptions behind the statistical models and performance evaluations in ML software frequently are not met in biological systems. In this Review, we illustrate the impact of several common pitfalls encountered when applying supervised ML in genomics. We explore how the structure of genomics data can bias performance evaluations and predictions. To address the challenges associated with applying cutting-edge ML methods to genomics, we describe solutions and appropriate use cases where ML modelling shows great potential.

Authors

Sean Whalen

Gladstone Institutes, University of California, San Francisco, CA, USA. Electronic address: shwhalen@gmail.com.
Jacob Schreiber

Paul G. Allen School of Computer Science and Engineering, University of Washington, Seattle, USA.
William S Noble

Department of Genome Sciences, University of Washington , Seattle 98195, Washington, United States.
Katherine S Pollard

Gladstone Institute of Data Science and Biotechnology, San Francisco, CA, USA. katherine.pollard@gladstone.ucsf.edu.

Keywords

Animals Genomics Humans Machine Learning Models, Statistical Software

External Resources

View on PubMed Access via DOI PubMed (34837041)

Navigating the pitfalls of applying machine learning in genomics.

Abstract

Authors

Keywords

External Resources

Popular Topics

Recent Journals