Prior knowledge facilitates low homologous protein secondary structure prediction with DSM distillation.

Journal: Bioinformatics (Oxford, England)
Published Date:

Abstract

MOTIVATION: Protein secondary structure prediction (PSSP) is one of the fundamental and challenging problems in the field of computational biology. Accurate PSSP relies on sufficient homologous protein sequences to build the multiple sequence alignment (MSA). Unfortunately, many proteins lack homologous sequences, which results in the low quality of MSA and poor performance. In this article, we propose the novel dynamic scoring matrix (DSM)-Distil to tackle this issue, which takes advantage of the pretrained BERT and exploits the knowledge distillation on the newly designed DSM features. Specifically, we propose the DSM to replace the widely used profile and PSSM (position-specific scoring matrix) features. DSM could automatically dig for the suitable feature for each residue, based on the original profile. Namely, DSM-Distil not only could adapt to the low homologous proteins but also is compatible with high homologous ones. Thanks to the dynamic property, DSM could adapt to the input data much better and achieve higher performance. Moreover, to compensate for low-quality MSA, we propose to generate the pseudo-DSM from a pretrained BERT model and aggregate it with the original DSM by adaptive residue-wise fusion, which helps to build richer and more complete input features. In addition, we propose to supervise the learning of low-quality DSM features using high-quality ones. To achieve this, a novel teacher-student model is designed to distill the knowledge from proteins with high homologous sequences to that of low ones. Combining all the proposed methods, our model achieves the new state-of-the-art performance for low homologous proteins.

Authors

  • Qin Wang
    Department of Pharmacy, Affiliated Hospital of Nantong University, Nantong, China.
  • Jun Wei
    Guangzhou Perception Vision Medical Technology Inc. Guangzhou 510000 China.
  • Yuzhe Zhou
    The Chinese University of Hong Kong (Shenzhen), Shenzhen 51800, China.
  • Mingzhi Lin
    Zelixir Biotech, Shanghai 200030, China.
  • Ruobing Ren
    Shanghai Key Laboratory of Metabolic Remodeling and Health, Institute of Metabolism and Integrative Biology, Fudan University, Shanghai 200000, China.
  • Sheng Wang
    Intensive Care Medical Center, Tongji Hospital, School of Medicine, Tongji University, Shanghai, 200065, People's Republic of China.
  • Shuguang Cui
  • Zhen Li
    PepsiCo R&D, Valhalla, NY, United States.