Bridging Chaos Game Representations and $k$-mer Frequencies of DNA Sequences
Journal:
arXiv
Published Date:
Jun 27, 2025
Abstract
This paper establishes formal mathematical foundations linking Chaos Game
Representations (CGR) of DNA sequences to their underlying $k$-mer frequencies.
We prove that the Frequency CGR (FCGR) of order $k$ is mathematically
equivalent to a discretization of CGR at resolution $2^k \times 2^k$, and its
vectorization corresponds to the $k$-mer frequencies of the sequence.
Additionally, we characterize how symmetry transformations of CGR images
correspond to specific nucleotide permutations in the originating sequences.
Leveraging these insights, we introduce an algorithm that generates synthetic
DNA sequences from prescribed $k$-mer distributions by constructing Eulerian
paths on De Bruijn multigraphs. This enables reconstruction of sequences
matching target $k$-mer profiles with arbitrarily high precision, facilitating
the creation of synthetic CGR images for applications such as data augmentation
for machine learning-based taxonomic classification of DNA sequences. Numerical
experiments validate the effectiveness of our method across both real genomic
data and artificially sampled distributions. To our knowledge, this is the
first comprehensive framework that unifies CGR geometry, $k$-mer statistics,
and sequence reconstruction, offering new tools for genomic analysis and
visualization.