Intra-DNA k-mer Conservation Patterns Encode Evolutionary Selection of Variants

Journal: bioRxiv
Published Date:

Abstract

Evolution shapes the structure and content of genomes, yet the contribution of local sequence composition to variant selection remains poorly understood. While traditional models emphasize protein function or cross-species conservation, we propose that intra-genomic patterns of oligonucleotide (k-mer) frequencies also reflect selective forces. To explore this, we developed kGain score, a metric that quantifies the frequency shift of a k-mer upon single-nucleotide substitution, using the surrounding genomic context as a baseline. We hypothesize that variants arising in high-kGain contexts are more likely to persist due to evolutionary favorability. We validated this hypothesis across multiple systems. In E. coli and S. cerevisiae long-term evolution experiments, we found that fixed, essential, and parallel mutations consistently show elevated kGain scores. This trend held in SARS-CoV-2 variants of concern and in an in-house antibiotic adaptation experiment, where a high-kGain fusA Y515N mutation conferred resistance and maintained fitness when overexpressed, demonstrating a causal link between kGain and adaptive potential. To enable cross-species generalization, we trained a transformer-based neural network regressor on LTEE-derived mutations to predict kGain from sequence alone. The model achieved high correlation in held-out in-domain data (Pearson r = 0.81) and accurately predicted kGain trends in out-of-domain data (Pearson r = 0.82), demonstrating that k-mer-based sequence constraints learned from one genome can be effectively transferred to others. Together, our results establish kGain as a biologically meaningful, scalable metric for probing within-genome sequence selection, offering a complementary lens to existing conservation-based frameworks for understanding evolutionary fitness and variant persistence.

Authors

  • Bernadette Mathew; Abhishek Halder; Nancy Jaiswal; Smruti Panda; Debjit Pramanik; Sreeram Chandra Murthy Peela; Abhishek Garg; Sadhana Tripathi; Swarnava Samanta; Prashant Gupta; Vandana Malhotra; Gaurav Ahuja; Debarka Sengupta