Evaluating the factors influencing accuracy, interpretability, and reproducibility in the use of machine learning classifiers in biology to enable standardization.

Journal: Scientific reports
PMID:

Abstract

The complexity and variability of biological data has promoted the increased use of machine learning methods to understand processes and predict outcomes. These same features complicate reliable, reproducible, interpretable, and responsible use of such methods, resulting in questionable relevance of the derived. outcomes. Here we systematically explore challenges associated with applying machine learning to predict and understand biological processes using a well- characterized in vitro experimental system. We evaluated factors that vary while applying machine learning classifers: (1) type of biochemical signature (transcripts vs. proteins), (2) data curation methods (pre- and post-processing), and (3) choice of machine learning classifier. Using accuracy, generalizability, interpretability, and reproducibility as metrics, we found that the above factors significantly mod- ulate outcomes even within a simple model system. Our results caution against the unregulated use of machine learning methods in the biological sciences, and strongly advocate the need for data standards and validation tool-kits for such studies.

Authors

  • Kaitlyn M Martinez
    A-1 Information Systems and Modeling, Los Alamos National Laboratory, Los Alamos, NM, United States of America.
  • Kristen Wilding
    T-6 Theoretical Biology and Biophysics, Los Alamos National Laboratory, Los Alamos, NM, United States of America.
  • Trent R Llewellyn
    C-PCS Physical Chemistry and Applied Spectroscopy, Los Alamos National Laboratory, Los Alamos, NM, United States of America.
  • Daniel E Jacobsen
    C-PCS Physical Chemistry and Applied Spectroscopy, Los Alamos National Laboratory, Los Alamos, NM, United States of America.
  • Makaela M Montoya
    C-PCS Physical Chemistry and Applied Spectroscopy, Los Alamos National Laboratory, Los Alamos, NM, United States of America.
  • Jessica Z Kubicek-Sutherland
    C-PCS Physical Chemistry and Applied Spectroscopy, Los Alamos National Laboratory, Los Alamos, NM, United States of America.
  • Sweta Batni
    Defense Threat Reduction Agency, Fort Belvoir, VA, USA.
  • Carrie Manore
    T-6 Theoretical Biology and Biophysics, Los Alamos National Laboratory, Los Alamos, NM, United States of America.
  • Harshini Mukundan
    C-PCS Physical Chemistry and Applied Spectroscopy, Los Alamos National Laboratory, Los Alamos, NM, United States of America. hmukundan@lbl.gov.