DomDiff: protein family and domain annotation via diffusion model and ESM2 embedding
Journal:
bioRxiv
Published Date:
Jan 1, 2025
Abstract
Accurate identification of conserved protein domain boundaries and their classification are fundamental to genome annotation, but are hindered by ambiguous boundaries, cross-domain interference, and limited samples for rare families. Here, we present DomDiff, a supervised conditional diffusion framework that reformulates the task as a generative process. Taking ESM2 embeddings, secondary structures, and biLSTM priors as inputs, it generates labels from Gaussian noise through iterative denoising, allowing coarse-to-fine optimization. We conducted a series of benchmark analyzes on publicly available protein sequence datasets, showing that DomDiff outperforms existing methods in domain boundary identification and classification, delivering performance gains of 12.6% in boundary detection and 4.2% in classification accuracy compared to other leading models. It excels particularly in annotating rare families, offering a powerful tool for specific applications such as large-scale genome annotation and functional characterization of novel proteins, thus providing a new paradigm for few-shot challenges in bioinformatics.