Investigating the determinants of performance in machine learning for protein fitness prediction.
Journal:
Protein science : a publication of the Protein Society
Published Date:
Aug 1, 2025
Abstract
Machine learning (ML) has revolutionized protein biology, solving long-standing problems in protein folding, scaffold generation, and function design tasks. A range of architectures have shown success on supervised protein fitness prediction tasks. Nevertheless, in the absence of rational approaches for evaluating which architectures are optimal for specific datasets and engineering tasks, architecture choice remains challenging. Here, we propose a framework for investigating the determinants of success for a range of ML architectures. Using simulated (the NK model) and empirical fitness landscapes, we measure sequence-fitness prediction along six key performance metrics: interpolation within the training domain, extrapolation outside the training domain, robustness to increasing epistasis/ruggedness, ability to perform positional extrapolation, robustness to sparse training data, and sensitivity to sequence length. We show that architectural differences between algorithms consistently affect performance against these metrics across both experimental and theoretical landscapes. Moreover, landscape ruggedness emerges as a primary determinant of the accuracy of sequence-fitness prediction. Our methodology and results provide a rational strategy for experimental data sampling, model selection, and evaluation rooted in fitness landscape theory-one that we hope will advance sequence-fitness prediction accuracy, with implications for protein engineering and variant functional prediction.