Topological Sequence Analysis of Genomes: Category Approaches

Journal: arXiv
Published Date:

Abstract

Sequence data, such as DNA, RNA, and protein sequences, exhibit intricate, multi-scale structures that pose significant challenges for conventional analysis methods, particularly those relying on alignment or purely statistical representations. In this work, we introduce category-based topological sequence analysis (CTSA ) of genomes. CTSA models a sequence as a resolution category, capturing its hierarchical structure through a categorical construction. Substructure complexes are then derived from this categorical representation, and their persistent homology is computed to extract multi-scale topological features. Our models depart from traditional alignment-free approaches by incorporating structured mathematical formalisms rooted in sequence topology. The resulting topological signatures provide informative representations across a variety of tasks, including the phylogenetic analysis of SARS-CoV-2 variants and the prediction of protein-nucleic acid binding affinities. Comparative studies were carried out against six state-of-the-art methods. Experimental results demonstrate that CTSA achieves excellent and consistent performance in these tasks, suggesting its general applicability and robustness. Beyond sequence analysis, the proposed framework opens new directions for the integration of categorical and homological theories for biological sequence analysis.

Authors

  • Jian Liu
  • Li Shen
  • Mushal Zia
  • Guo-Wei Wei