Multiple sequence alignment-based RNA language model and its application to structural inference.

Journal: Nucleic acids research
Published Date:

Abstract

Compared with proteins, DNA and RNA are more difficult languages to interpret because four-letter coded DNA/RNA sequences have less information content than 20-letter coded protein sequences. While BERT (Bidirectional Encoder Representations from Transformers)-like language models have been developed for RNA, they are ineffective at capturing the evolutionary information from homologous sequences because unlike proteins, RNA sequences are less conserved. Here, we have developed an unsupervised multiple sequence alignment-based RNA language model (RNA-MSM) by utilizing homologous sequences from an automatic pipeline, RNAcmap, as it can provide significantly more homologous sequences than manually annotated Rfam. We demonstrate that the resulting unsupervised, two-dimensional attention maps and one-dimensional embeddings from RNA-MSM contain structural information. In fact, they can be directly mapped with high accuracy to 2D base pairing probabilities and 1D solvent accessibilities, respectively. Further fine-tuning led to significantly improved performance on these two downstream tasks compared with existing state-of-the-art techniques including SPOT-RNA2 and RNAsnap2. By comparison, RNA-FM, a BERT-based RNA language model, performs worse than one-hot encoding with its embedding in base pair and solvent-accessible surface area prediction. We anticipate that the pre-trained RNA-MSM model can be fine-tuned on many other tasks related to RNA structure and function.

Authors

  • Yikun Zhang
    Laboratory of Image Science and Technology, Southeast University, Nanjing, Jiangsu, China.
  • Mei Lang
    Institute of Systems and Physical Biology, Shenzhen Bay Laboratory, Shenzhen, Guangdong, 518106, China.
  • Jiuhong Jiang
    Institute of Systems and Physical Biology, Shenzhen Bay Laboratory, Shenzhen 518107, China.
  • Zhiqiang Gao
    Beijing Entry-Exit Inspection and Quarantine Bureau, Beijing 100026, China.
  • Fan Xu
    Department of Public Health, Chengdu Medical College, Sichuan, China.
  • Thomas Litfin
    School of Information and Communication Technology, Griffith University, Gold Coast 4222, Australia.
  • Ke Chen
    Department of Signal Processing, Tampere University of Technology, Finland.
  • Jaswinder Singh
    Signal Processing Laboratory , Griffith University , Brisbane , QLD 4122 , Australia.
  • Xiansong Huang
    Peng Cheng Laboratory, Shenzhen 518066, China.
  • Guoli Song
    State Key Laboratory of Robotics, Shenyang Institute of Automation, Chinese Academy of Sciences, Shenyang, 110016, China; Institutes for Robotics and Intelligent Manufacturing, Chinese Academy of Sciences, Shenyang, 110016, China; Liaoning Medical Surgery and Rehabilitation Robot Engineering Research Center, Shenyang, CO, 110134, China. Electronic address: songgl@sia.cn.
  • Yonghong Tian
    National Engineering Laboratory for Video Technology, School of Electronics Engineering and Computer Science, Peking University, Beijing, China; Peng Cheng Laboratory, Shenzhen, China.
  • Jian Zhan
    School of Information and Communication Technology and Institue for Glycomics, Griffith University, Parklands Drive, Southport, Queensland, 4215, Australia.
  • Jie Chen
    School of Basic Medical Sciences, Health Science Center, Ningbo University, Ningbo, China.
  • Yaoqi Zhou
    Institute of Systems and Physical Biology, Shenzhen Bay Laboratory, Shenzhen, Guangdong, 518106, China. Electronic address: zhouyq@szbl.ac.cn.