CompleteBin: A transformer-based framework unlocks microbial dark matter through improved short contig binning

Journal: bioRxiv
Published Date:

Abstract

Metagenomic binning is crucial for reconstructing microbial genomes from metagenomic sequencing samples. However, existing tools struggle in complex communities where short, low-abundance contigs predominate, thereby limiting the recovery of complete metagenome-assembled genomes (MAGs) and the identification of novel functions. Here, we introduce CompleteBin, a Transformer-based framework that integrates contig sequence context, pre-trained taxonomic embeddings from a genome language model, and dynamic contrastive learning to bin short contigs robustly. Across CAMI II datasets, CompleteBin increased near-complete MAG recovery by 38.5% over leading methods like COMEBin. Across diverse real-world datasets (marine, freshwater, plant-associated, cold seep sediment, and human gut), it achieved a 57.4% improvement on average. Applying CompleteBin to six cold seep sediment samples uncovered 129 strain-level genome bins across 30 phyla, including 13 phyla undetected by other tools, and taxonomically assigned 90,405 genes (32.1% of total), revealing previously unknown species in nitrogen and sulfur cycling. CompleteBin unlocks microbial dark matter in diverse environments, advancing our understanding of microbial ecology and biogeochemical processes.

Authors

  • Bohao Zou; Zhenmiao Zhang; Xiaohan Wang; Rong Tao; Nianzhen Gu; Karsten Kristiansen; Mo Han; Lu Zhang