Overcoming systematic data biases enables accurate prediction of enzyme kcat fold-changes for computational protein design

Journal: bioRxiv
Published Date:

Abstract

Machine learning is increasingly used to guide protein engineering by predicting how mutations affect desired properties. Recent models for the turnover number (kcat) of enzymes report high accuracy, suggesting that mutation effects can be inferred directly from protein sequence. However, these approaches are typically evaluated on heterogeneous datasets of enzyme variants, where closely related sequences and systematic reporting patterns may confound model performance. A central challenge is therefore to determine whether current models truly capture mutation-specific effects or instead exploit statistical regularities in the data. Here we show that much of the reported accuracy in mutant kcat prediction arises from two pervasive biases: variants of the same enzyme occupy a narrow activity range, and mutations within a group often share a common direction of change. Simple baselines that exploit these biases match or exceed the performance of existing models, indicating that high apparent accuracy does not imply mechanistic understanding. To address this limitation, we introduce a bias-aware framework that reformulates prediction as a pairwise fold-change task and evaluates performance on unseen mutant-mutant pairs, thereby isolating mutation-specific signal. A proof-of-principle implementation explains approximately one-third of the variance under these conditions and outperforms existing models on leakage-controlled benchmarks. More broadly, this work establishes a general strategy for evaluating and modeling mutation effects in biochemical datasets, with implications for protein engineering and related fields.

Authors

  • Rousset
  • Y.; Kroll
  • A.; Lercher
  • M.

Categories