Genomic-island cassette architecture drives pathogenic Enterococcus cecorum lineages: Cassette2Vec-EC, a structural genomics and machine-learning framework
Journal:
bioRxiv
Published Date:
Feb 21, 2026
Abstract
Mobile genetic elements and genomic islands (GIs) frequently encode antibiotic resistance and host-adaptation cargo, yet routine genome comparison pipelines often miss the higher-order organization of how genes co-occur as transferable, GI-anchored modules. We present Cassette2Vec-EC, a structural genomics framework that converts annotated genomes into cassette units (local gene neighborhoods with GI context), encodes each cassette as a fixed-length feature vector, and applies genome-grouped machine learning to predict pathogenic lineages while preventing within-genome leakage. Using a curated Enterococcus cecorum cohort from poultry production systems, we integrate pangenome context, GI calls, mobility markers, and AMR/virulence annotations into cassette-level features and evaluate models strictly under GroupKFold-by-genome. Cassette2Vec-EC achieves strong genome-level generalization (AUROC 0.975 {+/-} 0.030, average precision 0.938 {+/-} 0.077, Brier score 0.056 {+/-} 0.058). When evaluated at the cassette unit level under the same genome-grouped protocol, performance remains high (AUROC 0.974 {+/-} 0.029, AP 0.919 {+/-} 0.093, Brier 0.057 {+/-} 0.057), supporting that cassette representations capture transferable signal rather than genome identity. Baselines show that GI burden alone can partially rank genomes but yields poorer calibration and limited interpretability. By combining comparative genomics with cassette-aware features and providing locus-level explanations (SHAP) that map predictions to specific GI-associated modules, Cassette2Vec-EC provides a practical blueprint for genomic-island-aware pathogen surveillance, junction-based diagnostics, and targeted monitoring of high-risk lineages.