Resolving Genome-to-Phenotype Links in Bacteria: Machine-Learned Inference from Downsampled k-mer Representations
Journal:
bioRxiv
Published Date:
Feb 18, 2026
Abstract
Standard approaches to bacterial phenotyping often treat the entire genome as the fundamental unit of information, resulting in high-dimensional inputs that may contain significant redundancy. Consequently, current bacterial phenotyping techniques typically rely on the assumption that entire sequences are required for accurate predictions. While downsampling based on min-hashing or prefix filtering has been used for clustering, its utility as a direct input for predictive machine learning remains underexplored. Here, we show that a novel prefix-based downsampling algorithm can reduce the size of genomes while maintaining relatively high predictive accuracy on phenotype prediction tasks. By combining a prefix reduction strategy with the specificity of short k-mers, we developed a method to downsample entire genomes into k-mer frequency matrices and \textit{k-mer-on-a-string} representations. We found that ensemble models, such as Random Forest and Gradient Boosting, trained on k-mer frequency matrices from downsampled genome representations outperformed more complex deep learning architectures with the same downsampled representation, particularly on datasets with limited data or highly similar genomes. We were able demonstrate explainability by tracing back the k-mers with the most impact on the models to genes coding for the specific phenotype. Our results demonstrate that downsampling genomic data can yield models with good predictive power thus establishing an alternative when using full genomes is infeasible. We present an approach that offers relatively high performance on bacterial phenotyping tasks and demonstrates a path forward towards lightweight Genome Language Models that will enable analysis of entire genomes.