Multi-channel GCN ensembled machine learning model for molecular aqueous solubility prediction on a clean dataset.

Journal: Molecular diversity
Published Date:

Abstract

This study constructed a new aqueous solubility dataset and a solubility regression model which was ensembled by GCN and machine learning models. Aqueous solubility is a key physiochemical property of small molecules in drug discovery. In the past few decades, there have been many studies about solubility prediction. However, many of these studies have high root mean squared error (RMSE). Meanwhile, their dataset always contains salt compounds and solubility data obtained from different experimental conditions. In this paper, we constructed a clean dataset with 2609 compounds, which was small but contains only solubility records without salts at the same temperatures (25 °C). Here, we applied graph convolutional neural network (GCN) to construct an aqueous solubility prediction model. To enhance the performance of the model, the molecular MACCS key fingerprints and physiochemical descriptors were also combined with the GCN model to build a multi-channel model. Additionally, the authors also built two machine learning models (support vector regression and gradient boost decision tree) and assembled them to the GCN model to improve the root mean squared error (RMSE = 0.665). Finally, comparative experiments have shown that our framework achieved the best performance on ESOL dataset (RMSE = 0.56, RMSE = 0.44) and surpassed four established software on aqueous solubility prediction of new compounds.

Authors

  • Chenglong Deng
    Laboratory of Molecular Design and Drug Discovery, School of Science, China Pharmaceutical University, 639 Longmian Avenue, Nanjing 211198, China.
  • Li Liang
    Duke Clinical Research Institute, Duke University, Durham, North Carolina.
  • GuoMeng Xing
    Laboratory of Molecular Design and Drug Discovery, School of Science, China Pharmaceutical University, 639 Longmian Avenue, Nanjing 211198, China.
  • Yi Hua
    Laboratory of Molecular Design and Drug Discovery, School of Science, China Pharmaceutical University, 639 Longmian Avenue, Nanjing 211198, China.
  • Tao Lu
    Laboratory of Molecular Design and Drug Discovery, School of Science, China Pharmaceutical University, Nanjing, China.
  • Yanmin Zhang
    Department of Paediatric Cardiology, Shaanxi Institute for Pediatric Diseases, Affiliate Children's Hospital of Xi'an Jiaotong University, Xi'an, China.
  • Yadong Chen
    Laboratory of Molecular Design and Drug Discovery, School of Science, China; Pharmaceutical University, 639 Longmian Avenue, Nanjing, 211198 Jiangsu, China.
  • Haichun Liu
    Laboratory of Molecular Design and Drug Discovery, School of Science, China; Pharmaceutical University, 639 Longmian Avenue, Nanjing, 211198 Jiangsu, China.