Using core genome and machine learning for serovar prediction in Salmonella enterica subspecies I strains.

Journal: FEMS microbiology letters

PMID: 40210591

Abstract

This study presents a dual investigation of Salmonella enterica subspecies I, focusing on serovar prediction and core genome characteristics. We utilized two large genomic datasets (panX and NCBI Pathogen Detection) to test machine learning methods for predicting Salmonella serovars based on genomic differences. Among the four tested algorithms, the Random Forest model demonstrated higher performance, achieving 90.3% accuracy with the panX dataset and 95.3% with the NCBI dataset, particularly effective when trained on >50% of available data. When combined with hierarchical clustering validation, our approach achieved 100% prediction accuracy on the simulated data. Parallel analysis of panX core genome characteristics revealed that pathogenicity-related genes (including sseA, invA, mgtC, phoP, phoQ, and sitA) exhibited similar phylogenetic topologies distinct from the core genome phylogenetic tree, suggesting shared evolutionary histories. Notably, all identified core antibiotic resistance genes and virulence factors showed evidence of negative selection, indicating their essential role in bacterial survival. This study not only presents a promising machine learning-based alternative for Salmonella serovar classification, particularly valuable when analyzing newly identified serovars alongside known reference strains but also provides insights into the evolutionary dynamics of core virulence-associated genes, contributing to our understanding of Salmonella genomic architecture and pathogenicity.

Authors

Xiang Li

Department of Radiology, Massachusetts General Hospital and Harvard Medical School, Boston, MA, United States.
Adelumola Oladeinde

U.S. National Poultry Research Center, Egg & Poultry Production Safety Research Unit, Agricultural Research Service, U.S. Department of Agriculture, Athens, GA 30605, United States.
Michael Rothrock

U.S. National Poultry Research Center, Egg & Poultry Production Safety Research Unit, Agricultural Research Service, U.S. Department of Agriculture, Athens, GA 30605, United States.
Tae Jung Chung

U.S. National Poultry Research Center, Egg & Poultry Production Safety Research Unit, Agricultural Research Service, U.S. Department of Agriculture, Athens, GA 30605, United States.
Walid Ghazi Al Hakeem

U.S. National Poultry Research Center, Egg & Poultry Production Safety Research Unit, Agricultural Research Service, U.S. Department of Agriculture, Athens, GA 30605, United States.

Keywords

Genome, Bacterial Genomics Machine Learning Phylogeny Salmonella enterica Serogroup Virulence Factors

External Resources

View on PubMed Access via DOI PubMed (40210591)

Using core genome and machine learning for serovar prediction in Salmonella enterica subspecies I strains.

Abstract

Authors

Keywords

External Resources

Popular Topics

Recent Journals