Exploration of chemical space with partial labeled noisy student self-training and self-supervised graph embedding.

Journal: BMC bioinformatics
Published Date:

Abstract

BACKGROUND: Drug discovery is time-consuming and costly. Machine learning, especially deep learning, shows great potential in quantitative structure-activity relationship (QSAR) modeling to accelerate drug discovery process and reduce its cost. A big challenge in developing robust and generalizable deep learning models for QSAR is the lack of a large amount of data with high-quality and balanced labels. To address this challenge, we developed a self-training method, Partially LAbeled Noisy Student (PLANS), and a novel self-supervised graph embedding, Graph-Isomorphism-Network Fingerprint (GINFP), for chemical compounds representations with substructure information using unlabeled data. The representations can be used for predicting chemical properties such as binding affinity, toxicity, and others. PLANS-GINFP allows us to exploit millions of unlabeled chemical compounds as well as labeled and partially labeled pharmacological data to improve the generalizability of neural network models.

Authors

  • Yang Liu
    Department of Computer Science, Hong Kong Baptist University, Hong Kong, China.
  • Hansaim Lim
  • Lei Xie
    Ph.D. Program in Computer Science, The City University of New York, New York, NY, United States.