Identifying the DNA methylation preference of transcription factors using ProtBERT and SVM.
Journal:
PLoS computational biology
Published Date:
May 13, 2025
Abstract
Transcription factors (TFs) can affect gene expression by binding to certain specific DNA sequences. This binding process of TFs may be modulated by DNA methylation. A subset of TFs that serve as methylation readers preferentially binds to certain methylated DNA and is defined as TFPM. The identification of TFPMs enhances our understanding of DNA methylation's role in gene regulation. However, their experimental identification is resource-demanding. In this study, we propose a novel two-step computational approach to classify TFs and TFPMs. First, we employed a fine-tuned ProtBERT model to differentiate between the classes of TFs and non-TFs. Second, we combined the Reduced Amino Acid Category (RAAC) with K-mer and SVM to predict the potential of TFs to bind to methylated DNA. Comparative experiments demonstrate that our proposed methods outperform all existing approaches and emphasize the efficiency of our computational framework in classifying TFs and TFPMs. Cross-species validation on an independent mouse dataset further demonstrates the generalizability of our proposed framework In addition, we conducted predictions on all human transcription factors and found that most of the top 20 proteins belong to the Krueppel C2H2-type Zinc-finger family. So far, some studies have demonstrated a partial correlation between this family and DNA methylation and confirmed the preference of some of its members, thereby showing the robustness of our approach.