Machine learning–assisted selection of informative loci for strain-level phylogenetics of Neisseria gonorrhoeae
Journal:
bioRxiv
Published Date:
Jan 1, 2025
Abstract
Epidemiological surveillance of Neisseria gonorrhoeae is hindered by the limitations of existing molecular typing methods, such as NG-MAST and MLST, which either suffer from excessive variability or insufficient resolution. In this study, we propose and evaluate a machine learning (ML) algorithm for the automated selection of a minimal set of informative genetic loci for accurate strain classification. Using a collection of 29 reference genomes of N. gonorrhoeae, we developed a pipeline that integrates Random Forest models and DNABERT embeddings to generate optimized gene panels. The results demonstrate that ML-selected panels substantially outperform traditional schemes, yielding markedly improved phylogenetic accuracy and branch support consistently above 90%. The proposed approach significantly reduces computational costs compared to whole-genome analysis and represents a promising resource-efficient tool for routine epidemiological monitoring, tracking transmission pathways, and identifying antibiotic-resistant strains.