Knowledge-based BERT: a method to extract molecular features like computational chemists.

Journal: Briefings in bioinformatics
Published Date:

Abstract

Molecular property prediction models based on machine learning algorithms have become important tools to triage unpromising lead molecules in the early stages of drug discovery. Compared with the mainstream descriptor- and graph-based methods for molecular property predictions, SMILES-based methods can directly extract molecular features from SMILES without human expert knowledge, but they require more powerful algorithms for feature extraction and a larger amount of data for training, which makes SMILES-based methods less popular. Here, we show the great potential of pre-training in promoting the predictions of important pharmaceutical properties. By utilizing three pre-training tasks based on atom feature prediction, molecular feature prediction and contrastive learning, a new pre-training method K-BERT, which can extract chemical information from SMILES like chemists, was developed. The calculation results on 15 pharmaceutical datasets show that K-BERT outperforms well-established descriptor-based (XGBoost) and graph-based (Attentive FP and HRGCN+) models. In addition, we found that the contrastive learning pre-training task enables K-BERT to 'understand' SMILES not limited to canonical SMILES. Moreover, the general fingerprints K-BERT-FP generated by K-BERT exhibit comparative predictive power to MACCS on 15 pharmaceutical datasets and can also capture molecular size and chirality information that traditional binary fingerprints cannot capture. Our results illustrate the great potential of K-BERT in the practical applications of molecular property predictions in drug discovery.

Authors

  • Zhenxing Wu
  • Dejun Jiang
    Innovation Institute for Artificial Intelligence in Medicine, College of Pharmaceutical Sciences, Zhejiang University, Hangzhou 310058 Zhejiang, P. R. China.
  • Jike Wang
    School of Computer Science, Wuhan University, Wuhan, Hubei 430072, China.
  • Xujun Zhang
    Injury Prevention Research Institute, Department of Epidemiology and Biostatistics, School of Public Health, Southeast University, Nanjing, Jiangsu Province, China.
  • Hongyan Du
    Innovation Institute for Artificial Intelligence in Medicine of Zhejiang University, College of Pharmaceutical Sciences, Zhejiang University, Hangzhou 310058, Zhejiang, P. R. China.
  • Lurong Pan
    Global Health Drug Discovery Institute, Beijing 100192, P. R. China.
  • Chang-Yu Hsieh
    Tencent Quantum Laboratory, Tencent, Shenzhen 518057 Guangdong, P. R. China.
  • Dongsheng Cao
    School of Pharmaceutical Sciences, Central South University, Changsha, China. oriental-cds@163.com.
  • Tingjun Hou
    College of Pharmaceutical Sciences, Zhejiang University , Hangzhou, Zhejiang 310058, China.