Improving protein domain classification for third-generation sequencing reads using deep learning.

Journal: BMC genomics
Published Date:

Abstract

BACKGROUND: With the development of third-generation sequencing (TGS) technologies, people are able to obtain DNA sequences with lengths from 10s to 100s of kb. These long reads allow protein domain annotation without assembly, thus can produce important insights into the biological functions of the underlying data. However, the high error rate in TGS data raises a new challenge to established domain analysis pipelines. The state-of-the-art methods are not optimized for noisy reads and have shown unsatisfactory accuracy of domain classification in TGS data. New computational methods are still needed to improve the performance of domain prediction in long noisy reads.

Authors

  • Nan Du
    Tencent Medical AI Lab, Palo Alto, CA, USA.
  • Jiayu Shang
    Electrical Engineering Dept., City University of Hong Kong, Kowloon, Hong Kong Special Administrative Region.
  • Yanni Sun
    Department of Computer Science and Engineering, Michigan State University, East Lansing, MI 48824, USA.