Machine learning–assisted selection of informative loci for strain-level phylogenetics of Neisseria gonorrhoeae

Journal: bioRxiv
Published Date:

Abstract

Epidemiological surveillance of Neisseria gonorrhoeae is hindered by the limitations of existing molecular typing methods, such as NG-MAST and MLST, which either suffer from excessive variability or insufficient resolution. In this study, we propose and evaluate a machine learning (ML) algorithm for the automated selection of a minimal set of informative genetic loci for accurate strain classification. Using a collection of 29 reference genomes of N. gonorrhoeae, we developed a pipeline that integrates Random Forest models and DNABERT embeddings to generate optimized gene panels. The results demonstrate that ML-selected panels substantially outperform traditional schemes, yielding markedly improved phylogenetic accuracy and branch support consistently above 90%. The proposed approach significantly reduces computational costs compared to whole-genome analysis and represents a promising resource-efficient tool for routine epidemiological monitoring, tracking transmission pathways, and identifying antibiotic-resistant strains.

Authors

  • Elizaveta Kochubei; Serafim Dobrovolskii; Zlata Zenchenko; Mikhail Rayko