AdventML: Advanced Enzyme Temperature Prediction with Transformer-Based Embeddings and Resampling Strategies
Journal:
bioRxiv
Published Date:
Jun 3, 2026
Abstract
Accurate prediction of enzymes' optimal catalytic temperature (Topt) is crucial in biotechnology, as enzymes with extreme Topt values are highly desirable for reactions at extreme temperatures and for their general stability. However, experimental determination of Topt is costly, labor-intensive, and time-consuming. Meanwhile, existing computational methods suffer from small and imbalanced datasets, suboptimal predictions at extreme temperatures, and insufficient validation. In this study, we address these challenges by expanding the Topt dataset and validating on an independent test set based on sequence similarity. We further tackle these limitations by comparing multiple resampling techniques to improve predictions at extremes and by considering diverse protein representations and multiple machine learning architectures. Overall, the best performing models reached R2 approximately 0.64 with MAE approximately 7-8 degrees C, while extreme resampling improved tail performance, reducing tail MAE by up to approximately 1.8 degrees C. Notably, our models show improved performance over state-of-the-art prediction models. We also demonstrate that accurate prediction of Topt is achievable even in the absence of organism growth temperature (OGT). Our Topt prediction models are made freely available as AdventML on GitHub.