Topological Sequence Analysis of Genomes: Category Approaches
Journal:
arXiv
Published Date:
Jul 9, 2025
Abstract
Sequence data, such as DNA, RNA, and protein sequences, exhibit intricate,
multi-scale structures that pose significant challenges for conventional
analysis methods, particularly those relying on alignment or purely statistical
representations. In this work, we introduce category-based topological sequence
analysis (CTSA ) of genomes. CTSA models a sequence as a resolution category,
capturing its hierarchical structure through a categorical construction.
Substructure complexes are then derived from this categorical representation,
and their persistent homology is computed to extract multi-scale topological
features. Our models depart from traditional alignment-free approaches by
incorporating structured mathematical formalisms rooted in sequence topology.
The resulting topological signatures provide informative representations across
a variety of tasks, including the phylogenetic analysis of SARS-CoV-2 variants
and the prediction of protein-nucleic acid binding affinities. Comparative
studies were carried out against six state-of-the-art methods. Experimental
results demonstrate that CTSA achieves excellent and consistent performance in
these tasks, suggesting its general applicability and robustness. Beyond
sequence analysis, the proposed framework opens new directions for the
integration of categorical and homological theories for biological sequence
analysis.