Machine Learning on the Impacts of Mutations in the SARS-CoV-2 Spike RBD on Binding Affinity to Human ACE2 Based on Deep Mutational Scanning Data.
Journal:
Biochemistry
Published Date:
Aug 14, 2025
Abstract
Severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) continues to accumulate mutations in the spike receptor-binding domain (RBD) region, leading to the emergence of new variants that potentially change the binding affinity for the human angiotensin converting enzyme 2 (hACE2) receptor. Deep mutational scanning (DMS) is a powerful biochemical experimental technique that can characterize the impact of mutations on protein sequence-function relationships, allowing for rapid assessment of new mutations. Herein, machine learning (ML) models were built using the SARS-CoV-2 DMS data set, with the input features derived from the Rosetta-computed decomposition energy terms. To improve the performance of this physics-based model, we further incorporated local environment information (the number of residue pair-specific contacts within shells at different distances) as the input features. Alternatively, a convolutional neural network (CNN) model based on amino-acid sequence information as well as their physicochemical and biochemical properties was also employed, yielding predictions that achieved good agreement with the experimental data. In addition, compared to three popular protein language models, the dual-encoding CNN model demonstrated consistently superior performance on the SARS-CoV-2 DMS data set and seven additional DMS data sets for different biological properties. Furthermore, a transfer-learning strategy was applied to fine-tune the CNN model using recently reported DMS data sets for the Alpha, Delta, and Omicron BA.1, BA.2, and XBB.1.5 variants, enabling the development of variant-specific prediction models. These ML models trained on DMS data sets can not only identify the effects of single-point mutations in mutagenesis data sets but also be useful in predicting the effects of multiple-point mutations and providing valuable information for ongoing viral surveillance efforts. Moreover, this dual-encoding CNN model, without including 3D geometric information, has the potential to be a robust and alternative ML model for other DMS studies.