Efficient Exploration of Chemical Compound Space Using Active Learning for Prediction of Thermodynamic Properties of Alkane Molecules.

Journal: Journal of chemical information and modeling
Published Date:

Abstract

We introduce an exploratory active learning (AL) algorithm using Gaussian process regression and marginalized graph kernel (GPR-MGK) to sample chemical compound space (CCS) at minimal cost. Targeting 251,728 enumerated alkane molecules with 4-19 carbon atoms, we applied the AL algorithm to select a diverse and representative set of molecules and then conducted high-throughput molecular simulations on these selected molecules. To demonstrate the power of the AL algorithm, we built directed message-passing neural networks (D-MPNN) using simulation data as the training set to predict liquid densities, heat capacities, and vaporization enthalpies of the CCS. Validations show that D-MPNN models built on the smallest training set considered in this work, which consists of 313 molecules or 0.124% of the original CCS, predict the properties with > 0.99 against the computational data and > 0.94 against the experimental data. The advantage of the presented AL algorithm is that the predicted uncertainty of GPR depends on only the molecular structures, which renders it compatible with high-throughput data generation.

Authors

  • Yan Xiang
    School of Chemistry and Chemical Engineering, Shanghai Jiao Tong University, Shanghai 200240, China.
  • Yu-Hang Tang
    Computational Research Division, Lawrence Berkeley National Laboratory, Berkeley, California 94720, United States.
  • Zheng Gong
    Sino-Cellbiomed Institutes of Medical Cell & Pharmaceutical Proteins Qingdao University, Qingdao, Shandong, China. xblong2000@gmail.com.
  • Hongyi Liu
    Department of General Surgery II, the First Medical Center of Chinese PLA General Hospital, Fuxing Road, Haidian District, Beijing, China.
  • Liang Wu
    Clinical and Research Center of AIDS, Beijing Ditan Hospital, Capital Medical University, Beijing, China.
  • Guang Lin
    Zhejiang Cancer Hospital, Hangzhou Institute of Medicine (HIM), Chinese Academy of Sciences, Hangzhou, Zhejiang, China.
  • Huai Sun
    School of Chemistry and Chemical Engineering, Materials Genome Initiative Center, and Key Laboratory of Scientific and Engineering Computing of Ministry of Education , Shanghai Jiao Tong University , Shanghai , China 200240.