Using core genome and machine learning for serovar prediction in Salmonella enterica subspecies I strains.
Journal:
FEMS microbiology letters
PMID:
40210591
Abstract
This study presents a dual investigation of Salmonella enterica subspecies I, focusing on serovar prediction and core genome characteristics. We utilized two large genomic datasets (panX and NCBI Pathogen Detection) to test machine learning methods for predicting Salmonella serovars based on genomic differences. Among the four tested algorithms, the Random Forest model demonstrated higher performance, achieving 90.3% accuracy with the panX dataset and 95.3% with the NCBI dataset, particularly effective when trained onĀ >50% of available data. When combined with hierarchical clustering validation, our approach achieved 100% prediction accuracy on the simulated data. Parallel analysis of panX core genome characteristics revealed that pathogenicity-related genes (including sseA, invA, mgtC, phoP, phoQ, and sitA) exhibited similar phylogenetic topologies distinct from the core genome phylogenetic tree, suggesting shared evolutionary histories. Notably, all identified core antibiotic resistance genes and virulence factors showed evidence of negative selection, indicating their essential role in bacterial survival. This study not only presents a promising machine learning-based alternative for Salmonella serovar classification, particularly valuable when analyzing newly identified serovars alongside known reference strains but also provides insights into the evolutionary dynamics of core virulence-associated genes, contributing to our understanding of Salmonella genomic architecture and pathogenicity.