Machine Learning on the Impacts of Mutations in the SARS-CoV-2 Spike RBD on Binding Affinity to Human ACE2 Based on Deep Mutational Scanning Data.

Journal: Biochemistry
Published Date:

Abstract

Severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) continues to accumulate mutations in the spike receptor-binding domain (RBD) region, leading to the emergence of new variants that potentially change the binding affinity for the human angiotensin converting enzyme 2 (hACE2) receptor. Deep mutational scanning (DMS) is a powerful biochemical experimental technique that can characterize the impact of mutations on protein sequence-function relationships, allowing for rapid assessment of new mutations. Herein, machine learning (ML) models were built using the SARS-CoV-2 DMS data set, with the input features derived from the Rosetta-computed decomposition energy terms. To improve the performance of this physics-based model, we further incorporated local environment information (the number of residue pair-specific contacts within shells at different distances) as the input features. Alternatively, a convolutional neural network (CNN) model based on amino-acid sequence information as well as their physicochemical and biochemical properties was also employed, yielding predictions that achieved good agreement with the experimental data. In addition, compared to three popular protein language models, the dual-encoding CNN model demonstrated consistently superior performance on the SARS-CoV-2 DMS data set and seven additional DMS data sets for different biological properties. Furthermore, a transfer-learning strategy was applied to fine-tune the CNN model using recently reported DMS data sets for the Alpha, Delta, and Omicron BA.1, BA.2, and XBB.1.5 variants, enabling the development of variant-specific prediction models. These ML models trained on DMS data sets can not only identify the effects of single-point mutations in mutagenesis data sets but also be useful in predicting the effects of multiple-point mutations and providing valuable information for ongoing viral surveillance efforts. Moreover, this dual-encoding CNN model, without including 3D geometric information, has the potential to be a robust and alternative ML model for other DMS studies.

Authors

  • Hui Xia
    Key Laboratory of Environmental Medicine and Engineering of Ministry of Education, School of Public Health, Southeast University, Nanjing 210009, China.
  • Dacong Wei
    Shenzhen Grubbs Institute, Department of Chemistry and Guangdong Provincial Key Laboratory of Catalysis, Southern University of Science and Technology, Shenzhen, 518055, China.
  • Zhihong Guo
    State Key Laboratory of Soil & Sustainable Agriculture, Institute of Soil Science, Chinese Academy of Sciences, Nanjing 211135, China; University of Chinese Academy of Sciences, Beijing 100049, China.
  • Lung Wa Chung
    Shenzhen Grubbs Institute, Department of Chemistry and Guangdong Provincial Key Laboratory of Catalysis, Southern University of Science and Technology, Shenzhen, 518055, China. oscarchung@sustech.edu.cn.