PCVR: a pre-trained contextualized visual representation for DNA sequence classification.

Journal: BMC bioinformatics
PMID:

Abstract

BACKGROUND: The classification of DNA sequences is pivotal in bioinformatics, essentially for genetic information analysis. Traditional alignment-based tools tend to have slow speed and low recall. Machine learning methods learn implicit patterns from data with encoding techniques such as k-mer counting and ordinal encoding, which fail to handle long sequences or sacrifice structural and sequential information. Frequency chaos game representation (FCGR) converts DNA sequences of arbitrary lengths into fixed-size images, breaking free from the constraints of sequence length while preserving more sequential information than other representations. However, existing works merely consider local information, ignoring long-range dependencies and global contextual information within FCGR image.

Authors

  • Jiarui Zhou
    School of Artificial Intelligence and Data Science, University of Science and Technology of China, Hefei, 230026, Anhui Province, China.
  • Hui Wu
    China Medical University College of Health Management, Shenyang 110122, Liaoning Province, China.
  • Kang Du
    Division of Life Sciences and Medicine, University of Science and Technology of China, Hefei, 230026, Anhui Province, China.
  • Wengang Zhou
    Department of Electronic Engineering and Information Science, University of Science and Technology of China, Hefei, 230026, Anhui Province, China. zhwg@ustc.edu.cn.
  • Cong-Zhao Zhou
    Division of Life Sciences and Medicine, University of Science and Technology of China, Hefei, 230026, Anhui Province, China.
  • Houqiang Li
    Department of Electronic Engineering and Information Science, University of Science and Technology of China, Hefei, 230026, Anhui Province, China.