Fine-Grained Structural Classification of Biosynthetic Gene Cluster-Encoded Products
Journal:
bioRxiv
Published Date:
Jan 1, 2025
Abstract
Biosynthetic gene clusters (BGCs) are responsible the biosynthesis of many natural products, including a multitude of effective therapeutics and their precursors. Advances in genomic data collection as well as computational techniques have made it possible to identify BGCs at scale. However, accurately determining the types of BGC-encoded products from genomic content remains elusive. Here, we introduce BGCat (BGC annotation tool), a machine learning method for fine-grained structural classification of BGC-encoded products, leveraging the NPClassifier natural product nomenclature. Our method leverages a pre-trained protein language model for creating meaningful gene representations and a deep neural network for class label prediction. We show the method outperforms state-of-the-art approaches in coarse-grained product classification and is effective for detailed classification. We implement a clustering-based augmentation strategy for BGC-product relationships, addressing a crucial gap in the available datasets. We then introduce the concept of product class profiles (PCPs) of gene cluster families (GCFs), associating each GCF with a probabilisitc distribution of product types and offering a new perspective on GCF functions. Lastly, we use BGCat to provide new product class labels for over 100k BGCs in antiSMASH DB that presently have minimal information about their products.