Intra-DNA k-mer Conservation Patterns Encode Evolutionary Selection of Variants
Journal:
bioRxiv
Published Date:
Jan 1, 2025
Abstract
Evolution shapes the structure and content of genomes, yet the contribution of local sequence composition to variant selection remains poorly understood. While traditional models emphasize protein function or cross-species conservation, we propose that intra-genomic patterns of oligonucleotide (k-mer) frequencies also reflect selective forces. To explore this, we developed kGain score, a metric that quantifies the frequency shift of a k-mer upon single-nucleotide substitution, using the surrounding genomic context as a baseline. We hypothesize that variants arising in high-kGain contexts are more likely to persist due to evolutionary favorability. We validated this hypothesis across multiple systems. In E. coli and S. cerevisiae long-term evolution experiments, we found that fixed, essential, and parallel mutations consistently show elevated kGain scores. This trend held in SARS-CoV-2 variants of concern and in an in-house antibiotic adaptation experiment, where a high-kGain fusA Y515N mutation conferred resistance and maintained fitness when overexpressed, demonstrating a causal link between kGain and adaptive potential. To enable cross-species generalization, we trained a transformer-based neural network regressor on LTEE-derived mutations to predict kGain from sequence alone. The model achieved high correlation in held-out in-domain data (Pearson r = 0.81) and accurately predicted kGain trends in out-of-domain data (Pearson r = 0.82), demonstrating that k-mer-based sequence constraints learned from one genome can be effectively transferred to others. Together, our results establish kGain as a biologically meaningful, scalable metric for probing within-genome sequence selection, offering a complementary lens to existing conservation-based frameworks for understanding evolutionary fitness and variant persistence.