Deciphering the biosynthetic potential of microbial genomes using a BGC language processing neural network model.

Journal: Nucleic acids research
PMID:

Abstract

Biosynthetic gene clusters (BGCs), key in synthesizing microbial secondary metabolites, are mostly hidden in microbial genomes and metagenomes. To unearth this vast potential, we present BGC-Prophet, a transformer-based language model for BGC prediction and classification. Leveraging the transformer encoder, BGC-Prophet captures location-dependent relationships between genes. As one of the pioneering ultrahigh-throughput tools, BGC-Prophet significantly surpasses existing methods in efficiency and fidelity, enabling comprehensive pan-phylogenetic and whole-metagenome BGC screening. Through the analysis of 85 203 genomes and 9428 metagenomes, BGC-Prophet has profiled an extensive array of sub-million BGCs. It highlights notable enrichment in phyla like Actinomycetota and the widespread distribution of polyketide, NRP, and RiPP BGCs across diverse lineages. It reveals enrichment patterns of BGCs following important geological events, suggesting environmental influences on BGC evolution. BGC-Prophet's capabilities in detection of BGCs and evolutionary patterns offer contributions to deeper understanding of microbial secondary metabolites and application in synthetic biology.

Authors

  • Qilong Lai
    MOE Key Laboratory of Molecular Biophysics of the Ministry of Education, Hubei Key Laboratory of Bioinformatics and Molecular-imaging, Center of AI Biology, Department of Bioinformatics and Systems Biology, College of Life Science and Technology, Huazhong University of Science and Technology, Wuhan 430074, Hubei, China.
  • Shuai Yao
    MOE Key Laboratory of Molecular Biophysics of the Ministry of Education, Hubei Key Laboratory of Bioinformatics and Molecular-imaging, Center of AI Biology, Department of Bioinformatics and Systems Biology, College of Life Science and Technology, Huazhong University of Science and Technology, Wuhan 430074, Hubei, China.
  • Yuguo Zha
    MOE Key Laboratory of Molecular Biophysics, Hubei Key Laboratory of Bioinformatics and Molecular-imaging, Center of Artificial Intelligence Biology, Department of Bioinformatics and Systems Biology, College of Life Science and Technology, Huazhong University of Science and Technology, Wuhan 430074, China.
  • Haohong Zhang
    Key Laboratory of Molecular Biophysics of the Ministry of Education, Hubei Key Laboratory of Bioinformatics and Molecular-imaging, Center of AI Biology, Department of Bioinformatics and Systems Biology, College of Life Science and Technology, Huazhong University of Science and Technology, Wuhan 430074, Hubei, China.
  • Haobo Zhang
    Key Laboratory of Molecular Biophysics of the Ministry of Education, Hubei Key Laboratory of Bioinformatics and Molecular-imaging, Center of AI Biology, Department of Bioinformatics and Systems Biology, College of Life Science and Technology, Huazhong University of Science and Technology, Wuhan, Hubei 430074, China.
  • Ying Ye
    The Wistar Institute, Philadelphia, PA, 19104, USA.
  • Yonghui Zhang
    Guangdong Provincial Center for Disease Control and Prevention, Guangzhou, China.
  • Hong Bai
    MOE Key Laboratory of Molecular Biophysics of the Ministry of Education, Hubei Key Laboratory of Bioinformatics and Molecular-imaging, Center of AI Biology, Department of Bioinformatics and Systems Biology, College of Life Science and Technology, Huazhong University of Science and Technology, Wuhan 430074, Hubei, China.
  • Kang Ning
    MOE Key Laboratory of Molecular Biophysics, Hubei Key Laboratory of Bioinformatics and Molecular-imaging, Center of Artificial Intelligence Biology, Department of Bioinformatics and Systems Biology, College of Life Science and Technology, Huazhong University of Science and Technology, Wuhan 430074, China. Electronic address: ningkang@hust.edu.cn.