Investigating the determinants of performance in machine learning for protein fitness prediction.

Journal: Protein science : a publication of the Protein Society

Published Date: Aug 1, 2025

Abstract

Machine learning (ML) has revolutionized protein biology, solving long-standing problems in protein folding, scaffold generation, and function design tasks. A range of architectures have shown success on supervised protein fitness prediction tasks. Nevertheless, in the absence of rational approaches for evaluating which architectures are optimal for specific datasets and engineering tasks, architecture choice remains challenging. Here, we propose a framework for investigating the determinants of success for a range of ML architectures. Using simulated (the NK model) and empirical fitness landscapes, we measure sequence-fitness prediction along six key performance metrics: interpolation within the training domain, extrapolation outside the training domain, robustness to increasing epistasis/ruggedness, ability to perform positional extrapolation, robustness to sparse training data, and sensitivity to sequence length. We show that architectural differences between algorithms consistently affect performance against these metrics across both experimental and theoretical landscapes. Moreover, landscape ruggedness emerges as a primary determinant of the accuracy of sequence-fitness prediction. Our methodology and results provide a rational strategy for experimental data sampling, model selection, and evaluation rooted in fitness landscape theory-one that we hope will advance sequence-fitness prediction accuracy, with implications for protein engineering and variant functional prediction.

Authors

Mahakaran Sandhu

Research School of Chemistry, The Australian National University, Canberra, Australian Capital Territory, Australia.
Adam C Mater

ARC Centre of Excellence for Electromaterials Science, Research School of Chemistry , Australian National University , Canberra , Australian Capital Territory 2601 , Australia.
Dana S Matthews

Research School of Chemistry, The Australian National University, Canberra, Australian Capital Territory, Australia.
Matthew A Spence

Research School of Chemistry, The Australian National University, Canberra, Australian Capital Territory, Australia.
Artem A Lenskiy

School of Engineering and Technology, The University of New South Wales, Canberra, Australian Capital Territory, Australia.
Colin Jackson

Research School of Chemistry, The Australian National University, Canberra, Australian Capital Territory, Australia.

Keywords

Algorithms Machine Learning Protein Folding Proteins

External Resources

View on PubMed Access via DOI PubMed (40689706)

Investigating the determinants of performance in machine learning for protein fitness prediction.

Abstract

Authors

Keywords

External Resources

Popular Topics

Recent Journals

Investigating the determinants of performance in machine learning for protein fitness prediction.

Abstract

Authors

Keywords

External Resources

Don't Miss the Future of Medicine

Popular Topics

Recent Journals