PCVR: a pre-trained contextualized visual representation for DNA sequence classification.
Journal:
BMC bioinformatics
PMID:
40346458
Abstract
BACKGROUND: The classification of DNA sequences is pivotal in bioinformatics, essentially for genetic information analysis. Traditional alignment-based tools tend to have slow speed and low recall. Machine learning methods learn implicit patterns from data with encoding techniques such as k-mer counting and ordinal encoding, which fail to handle long sequences or sacrifice structural and sequential information. Frequency chaos game representation (FCGR) converts DNA sequences of arbitrary lengths into fixed-size images, breaking free from the constraints of sequence length while preserving more sequential information than other representations. However, existing works merely consider local information, ignoring long-range dependencies and global contextual information within FCGR image.