Protein Family-Specific Models Using Deep Neural Networks and Transfer Learning Improve Virtual Screening and Highlight the Need for More Data.

Journal: Journal of chemical information and modeling
Published Date:

Abstract

Machine learning has shown enormous potential for computer-aided drug discovery. Here we show how modern convolutional neural networks (CNNs) can be applied to structure-based virtual screening. We have coupled our densely connected CNN (DenseNet) with a transfer learning approach which we use to produce an ensemble of protein family-specific models. We conduct an in-depth empirical study and provide the first guidelines on the minimum requirements for adopting a protein family-specific model. Our method also highlights the need for additional data, even in data-rich protein families. Our approach outperforms recent benchmarks on the DUD-E data set and an independent test set constructed from the ChEMBL database. Using a clustered cross-validation on DUD-E, we achieve an average AUC ROC of 0.92 and a 0.5% ROC enrichment factor of 79. This represents an improvement in early enrichment of over 75% compared to a recent machine learning benchmark. Our results demonstrate that the continued improvements in machine learning architecture for computer vision apply to structure-based virtual screening.

Authors

  • Fergus Imrie
    Oxford Protein Informatics Group, Department of Statistics , University of Oxford , Oxford OX1 3LB , U.K.
  • Anthony R Bradley
    Structural Genomics Consortium , University of Oxford , Oxford OX3 7DQ , U.K.
  • Mihaela van der Schaar
    University of California, Los Angeles, CA, USA.
  • Charlotte M Deane
    Oxford Protein Informatics Group, Department of Statistics, University of Oxford, Oxford, United Kingdom.