Integrating protein and DNA embeddings for improving genome-wide transcription factor binding site prediction.

Journal: NAR genomics and bioinformatics
Published Date:

Abstract

Transcription factors (TFs) regulate gene expression by binding to specific DNA sites on genome, making accurate TF binding site prediction critical for understanding gene regulation and downstream phenotypes. Almost all current deep learning methods use only DNA-related information to predict TF binding sites, ignoring the fact that different TF protein sequences and structures recognize distinct DNA patterns. Not leveraging TF information not only limits prediction accuracy but also makes the methods not generalizable to predicting binding sites of new TFs that do not exist in the training data. Here, we present TransBind, a protein-aware deep learning architecture that integrates DNA sequence information with protein embeddings containing both sequence and structural information derived from a protein language model pretrained on DNA-binding proteins, to improve TF binding site prediction. Through the cross-attention, a TF embedding selectively attends to genomic regions according to its unique binding properties. Evaluated on the data of 690 ChIP-seq experiments spanning 161 TFs across 91 human cell types, TransBind achieves an AUROC of 0.9508 and AUPR of 0.3741-representing a [Formula: see text]11.8% relative AUPR improvement over state-of-the-art methods including TBiNet, EPBDXDNABERT-2, DanQ, and DeepSEA. The model outperformed existing methods in [Formula: see text]98% of TF-cell type combinations. It also recovered 160 known TF binding motifs in the JASPAR database, providing the biological interpretability of the model. Moreover, the approach enables label-zero-shot prediction for unseen TFs, demonstrating its potential of generalizing to new, poorly characterized TFs. The source code of TransBind is available at https://github.com/jianlin-cheng/TransBind. The version used in this work is archived at https://doi.org/10.5281/zenodo.19462292.

Authors

Keywords

No keywords available for this article.