Investigating the determinants of performance in machine learning for protein fitness prediction.

Journal: Protein science : a publication of the Protein Society
Published Date:

Abstract

Machine learning (ML) has revolutionized protein biology, solving long-standing problems in protein folding, scaffold generation, and function design tasks. A range of architectures have shown success on supervised protein fitness prediction tasks. Nevertheless, in the absence of rational approaches for evaluating which architectures are optimal for specific datasets and engineering tasks, architecture choice remains challenging. Here, we propose a framework for investigating the determinants of success for a range of ML architectures. Using simulated (the NK model) and empirical fitness landscapes, we measure sequence-fitness prediction along six key performance metrics: interpolation within the training domain, extrapolation outside the training domain, robustness to increasing epistasis/ruggedness, ability to perform positional extrapolation, robustness to sparse training data, and sensitivity to sequence length. We show that architectural differences between algorithms consistently affect performance against these metrics across both experimental and theoretical landscapes. Moreover, landscape ruggedness emerges as a primary determinant of the accuracy of sequence-fitness prediction. Our methodology and results provide a rational strategy for experimental data sampling, model selection, and evaluation rooted in fitness landscape theory-one that we hope will advance sequence-fitness prediction accuracy, with implications for protein engineering and variant functional prediction.

Authors

  • Mahakaran Sandhu
    Research School of Chemistry, The Australian National University, Canberra, Australian Capital Territory, Australia.
  • Adam C Mater
    ARC Centre of Excellence for Electromaterials Science, Research School of Chemistry , Australian National University , Canberra , Australian Capital Territory 2601 , Australia.
  • Dana S Matthews
    Research School of Chemistry, The Australian National University, Canberra, Australian Capital Territory, Australia.
  • Matthew A Spence
    Research School of Chemistry, The Australian National University, Canberra, Australian Capital Territory, Australia.
  • Artem A Lenskiy
    School of Engineering and Technology, The University of New South Wales, Canberra, Australian Capital Territory, Australia.
  • Colin Jackson
    Research School of Chemistry, The Australian National University, Canberra, Australian Capital Territory, Australia.