Weakly-Supervised Convolutional Neural Network Architecture for Predicting Protein-DNA Binding.

Journal: IEEE/ACM transactions on computational biology and bioinformatics
Published Date:

Abstract

Although convolutional neural networks (CNN) have outperformed conventional methods in predicting the sequence specificities of protein-DNA binding in recent years, they do not take full advantage of the intrinsic weakly-supervised information of DNA sequences that a bound sequence may contain multiple TFBS(s). Here, we propose a weakly-supervised convolutional neural network architecture (WSCNN), combining multiple-instance learning (MIL) with CNN, to further boost the performance of predicting protein-DNA binding. WSCNN first divides each DNA sequence into multiple overlapping subsequences (instances) with a sliding window, and then separately models each instance using CNN, and finally fuses the predicted scores of all instances in the same bag using four fusion methods, including Max, Average, Linear Regression, and Top-Bottom Instances. The experimental results on in vivo and in vitro datasets illustrate the performance of the proposed approach. Moreover, models built on in vitro data using WSCNN can predict in vivo protein-DNA binding with good accuracy. In addition, we give a quantitative analysis of the importance of the reverse-complement mode in predicting in vivo protein-DNA binding, and explain why not directly use advanced pooling layers to combine MIL with CNN, through a series of experiments.

Authors

  • Qinhu Zhang
  • Lin Zhu
    Institute of Environmental Technology, College of Environmental and Resource Sciences; Zhejiang University, Hangzhou 310058, China.
  • Wenzheng Bao
    School of Information Engineering, Xuzhou University of Technology, Xuzhou, China.
  • De-Shuang Huang