TPGPred: A Mixed-Feature-Driven Approach for Identifying Thermophilic Proteins Based on GradientBoosting.

Journal: International journal of molecular sciences
Published Date:

Abstract

Thermophilic proteins maintain their stability and functionality under extreme high-temperature conditions, making them of significant importance in both fundamental biological research and biotechnological applications. In this study, we developed a machine learning-based thermophilic protein GradientBoosting prediction model, TPGPred, designed to predict thermophilic proteins by leveraging a large-scale dataset of both thermophilic and non-thermophilic protein sequences. By combining various machine learning algorithms with feature-engineering methods, we systematically evaluated the classification performance of the model, identifying the optimal feature combinations and classification models. Trained on a large public dataset of 5652 samples, TPGPred achieved an Accuracy score greater than 0.95 and an Area Under the Receiver Operating Characteristic Curve (AUROC) score greater than 0.98 on an independent test set of 627 samples. Our findings offer new insights into the identification and classification of thermophilic proteins and provide a solid foundation for their industrial application development.

Authors

  • Cuihuan Zhao
    Center for Synthetic and Systems Biology, School of Life Sciences, Tsinghua University, Beijing 100084, China.
  • Shuan Yan
    Institute of Public Safety Research, Department of Engineering Physics, Tsinghua University, Beijing 100084, China.
  • Jiahang Li
    School of Mathematical Sciences, Nankai University, Tianjin 300071, China.