Deep Learning to Predict the Biosynthetic Gene Clusters in Bacterial Genomes.

Journal: Journal of molecular biology
PMID:

Abstract

Biosynthetic gene clusters (BGCs) in bacterial genomes code for important small molecules and secondary metabolites. Based on the validated BGCs and the corresponding sequences of protein family domains (Pfams), Pfam functions and clan information, we develop a deep learning method e-DeepBGC, that extends DeepBGC, for detecting the BGCs and their biosynthetic class in bacterial genomes. We show that e-DeepBGC leads to reduced false positive rates in BGC identification and an increased sensitivity in identifying BGCs compared to DeepBGC. We apply e-DeepBGC to 5,666 Ref Seq bacterial genomes and detect a total of 170, 685 BGCs with an average of 30.1 BGCs in each genome. We summarize all the predicted BGCs, their functional classes and the distributions of the BGCs in different bacterial phyla.

Authors

  • Mingyang Liu
    Department of Biostatistics, Epidemiology and Informatics, University of Pennsylvania, Philadelphia, PA 19104, USA.
  • Yun Li
    School of Public Health, University of Michigan, Ann Arbor, MI, USA.
  • Hongzhe Li
    Department of Biostatistics, Epidemiology, and Informatics, Perelman School of Medicine, University of Pennsylvania, Philadelphia, PA, United States.