DeepDBS: Identification of DNA-binding sites in protein sequences by using deep representations and random forest.

Journal: Methods (San Diego, Calif.)
PMID:

Abstract

Interactions of biological molecules in organisms are considered to be primary factors for the lifecycle of that organism. Various important biological functions are dependent on such interactions and among different kinds of interactions, the protein DNA interactions are very important for the processes of transcription, regulation of gene expression, DNA repairing and packaging. Thus, keeping the knowledge of such interactions and the sites of those interactions is necessary to study the mechanism of various biological processes. As experimental identification through biological assays is quite resource-demanding, costly and error-prone, scientists opt for the computational methods for efficient and accurate identification of such DNA-protein interaction sites. Thus, herein, we propose a novel and accurate method namely DeepDBS for the identification of DNA-binding sites in proteins, using primary amino acid sequences of proteins under study. From protein sequences, deep representations were computed through a one-dimensional convolution neural network (1D-CNN), recurrent neural network (RNN) and long short-term memory (LSTM) network and were further used to train a Random Forest classifier. Random Forest with LSTM-based features outperformed the other models, as well as the existing state-of-the-art methods with an accuracy score of 0.99 for self-consistency test, 10-fold cross-validation, 5-fold cross-validation, and jackknife validation while 0.92 for independent dataset testing. It is concluded based on results that the DeepDBS can help accurate and efficient identification of DNA binding sites (DBS) in proteins.

Authors

  • Yaser Daanial Khan
    Department of Computer Science, School of Systems and Technology, University of Management and Technology, Lahore, Punjab 54770, Pakistan.
  • Tamim Alkhalifah
    Department of Computer, College of Science and Arts in Ar Rass, Qassim University, Buraydah 52571, Saudi Arabia.
  • Fahad Alturise
    Department of Computer, College of Science and Arts in Ar Rass, Qassim University, Ar Rass 58892, Qassim, Saudi Arabia.
  • Ahmad Hassan Butt
    Department of Computer Science, Faculty of Computing and Information Technology, University of the Punjab, Lahore 54000, Punjab, Pakistan. Electronic address: ahmad.hassan@pucit.edu.pk.