Identifying the missing proteins in human proteome by biological language model.

Journal: BMC systems biology
Published Date:

Abstract

BACKGROUND: With the rapid development of high-throughput sequencing technology, the proteomics research becomes a trendy field in the post genomics era. It is necessary to identify all the native-encoding protein sequences for further function and pathway analysis. Toward that end, the Human Proteome Organization lunched the Human Protein Project in 2011. However many proteins are hard to be detected by experiment methods, which becomes one of the bottleneck in Human Proteome Project. In consideration of the complicatedness of detecting these missing proteins by using wet-experiment approach, here we use bioinformatics method to pre-filter the missing proteins.

Authors

  • Qiwen Dong
    Institute for Data Science and Engineering, East China Normal University, Shanghai 200062, People's Republic of China.
  • Kai Wang
    Department of Rheumatology, The Affiliated Huai'an No. 1 People's Hospital of Nanjing Medical University, Huai'an, Jiangsu, China.
  • Xuan Liu
    Department of Electrical and Computer Engineering, New Jersey Institute of Technology, University Heights, Newark, NJ 07102, USA.