Prediction and design of thermostable proteins with a desired melting temperature.

Journal: Scientific reports
Published Date:

Abstract

The stability of proteins at higher temperatures is crucial for their functionality, which is measured by their melting temperature (Tm). The Tm is the temperature at which 50% of the protein loses its native structure and activity. Existing methods for predicting Tm have two major limitations: first, they are often trained on redundant proteins, and second, they do not allow users to design proteins with the desired Tm. To address these limitations, we developed a regression method for predicting the Tm value of proteins using 17,312 non-redundant proteins, where no two proteins are more than 40% similar. We used 80% of the data for training and testing and the remaining 20% for validation. Initially, we developed a machine learning model using standard features from protein sequences. Our best model, developed using Shannon entropy for all residues, achieved the highest Pearson correlation of 0.80 with an R of 0.63 between the predicted and actual Tm of proteins on the validation dataset. Next, we fine-tuned large language models (e.g., ProtBert, ProtGPT2, ProtT5) on our training dataset and generated embeddings. These embeddings have been used to develop machine learning models. Our best model, developed using ProtBert embeddings, achieved a maximum correlation of 0.89 with an R of 0.80 on the validation dataset. Finally, we developed an ensemble method that combines standard protein features and embeddings. One of the aims of the study is to assist the scientific community in the design of targeted melting temperatures. Our standalone software can be used to screen thermostable proteins at the genome level. We demonstrated the application of PPTstab in identifying thermostable proteins in different organisms. We created a user-friendly web server, and a Python package for predicting and designing thermostable proteins is available at https://webs.iiitd.edu.in/raghava/pptstab , https://github.com/raghavagps/pptstab .

Authors

  • Purva Tijare
    Department of Computational Biology, Indraprastha Institute of Information Technology, Okhla Industrial Estate, Phase III (Near Govind Puri Metro Station), Office: A-302 (R&D Block), New Delhi, 110020, India.
  • Nishant Kumar
    Department of Computational Biology, Indraprastha Institute of Information Technology, Okhla Phase 3, New Delhi 110020, India. Electronic address: nishantk@iiitd.ac.in.
  • Gajendra P S Raghava
    Department of Computational Biology, Indraprastha Institute of Information Technology, New Delhi, India.