A Machine Learning Framework for Serogroup Classification of pathogenic species of Leptospira Based on rfb Locus Profiles
Journal:
bioRxiv
Published Date:
Mar 5, 2026
Abstract
Leptospira is a highly diverse genus traditionally classified by serological assays into more than 30 serogroups and over 300 serovars. However, this classification system is often complex and inconsistent, as cross-reactions between antigens can lead to ambiguous or unreliable results. Moreover, serological tests such as MAT and CAAT are labor-intensive, require live cultures, and are difficult to standardize across laboratories. To overcome these limitations, we compiled genomic data from 721 pathogenic Leptospira samples obtained from NCBI RefSeq and BIGSdb (Institut Pasteur) to develop a machine learning framework capable of predicting serological classification directly from genomic information. Our approach focuses on the rfb locus, a genomic region associated with lipopolysaccharide biosynthesis and antigenic diversity, and was designed to operate in two stages: the first stage assigns samples to one of four major serological classes, while the second stage classifies them into their respective serogroups. Models from both classification stages achieved high predictive performance, with perfect score in the first and a mean F1-score of 0.931 in the second stage. Based on the strong genetic coherence observed at the rfb locus and shared antigenic features, we formally propose the term "seroclass" to designate these higher-order groupings of serogroups. By enabling accurate serological inference from genomic data, our approach provides a scalable and reproducible alternative to traditional serological testing and offers valuable applications for epidemiological surveillance, outbreak investigation, and vaccine development within the Leptospira genus.