Data-Efficient Exploration of Enzyme Function Using Family-Specific Machine Learning

Journal: bioRxiv
Published Date:

Abstract

Enzymes are essential biocatalysts across diverse industries, driving demand for high-performing variants. Foundation models are attractive for guiding enzyme discovery, but often lack the resolution to model subtle variations driving function within homologous families. Navigating these rugged functional landscapes to identify elite variants remains challenging and experimentally costly, even when guided by such models. Here we show that coupling dense, family-specific experimental screening with targeted, sequence-based deep learning provides a data-efficient discovery strategy. We experimentally screened 1,513 natural homologues from an esterase superfamily (>7,500 assays) and used this functional landscape to train task-specific models that predict activity, thermostability, and substrate specificity from sequence alone. Prospective experimental validation of previously untested sequences demonstrated that these task-specific models significantly outperformed generalist pre-trained and physics-based models in enriching for target traits. Residue-level attribution further indicated that the models captured sequence patterns consistent with underlying structural features. Finally, retrospective simulations showed that iterative retraining compresses the search space, discovering 60% of top-tier hits using nearly half the samples required by pre-trained baseline models. Together, these results highlight that machine learning can provide mechanistic insight, and that integrating targeted data acquisition with iterative machine learning provides a more data-efficient discovery strategy than relying on generic model scale.

Authors

  • Ahmed
  • F. H.; Bender
  • A.; Wijesinghe
  • A.; Zhu
  • A.; Zhang
  • L.; Gebbie
  • L.; Marsh
  • A.; Ishitate
  • C.; Holdsworth
  • W.; Jones
  • C.; Warden
  • A. C.; Power
  • H.; Ong
  • C. S.; Steinberg
  • D. M.; Speight
  • R. E.

Categories