Porter 6: Protein Secondary Structure Prediction by Leveraging Pre-Trained Language Models (PLMs).

Journal: International journal of molecular sciences
PMID:

Abstract

Accurately predicting protein secondary structure (PSSP) is crucial for understanding protein function, which is foundational to advancements in drug development, disease treatment, and biotechnology. Researchers gain critical insights into protein folding and function within cells by predicting protein secondary structures. The advent of deep learning models, capable of processing complex sequence data and identifying meaningful patterns, offer substantial potential to enhance the accuracy and efficiency of protein structure predictions. In particular, recent breakthroughs in deep learning-driven by the integration of natural language processing (NLP) algorithms-have significantly advanced the field of protein research. Inspired by the remarkable success of NLP techniques, this study harnesses the power of pre-trained language models (PLMs) to advance PSSP prediction. We conduct a comprehensive evaluation of various deep learning models trained on distinct sequence embeddings, including one-hot encoding and PLM-based approaches such as ProtTrans and ESM-2, to develop a cutting-edge prediction system optimized for accuracy and computational efficiency. Our proposed model, Porter 6, is an ensemble of CBRNN-based predictors, leveraging the protein language model ESM-2 as input features. Porter 6 achieves outstanding performance on large-scale, independent test sets. On a 2022 test set, the model attains an impressive 86.60% accuracy in three-state (Q3) and 76.43% in eight-state (Q8) classifications. When tested on a more recent 2024 test set, Porter 6 maintains robust performance, achieving 84.56% in Q3 and 74.18% in Q8 classifications. This represents a significant 3% improvement over its predecessor, outperforming or matching state-of-the-art approaches in the field.

Authors

  • Wafa Alanazi
    School of Computer Science, University College Dublin (UCD), D04 V1W8 Dublin, Ireland.
  • Di Meng
    School of Computer Science and Technology, AnHui University of Technology, Maxiang Street, Maanshan, 243032, Anhui, China.
  • Gianluca Pollastri