TocoDecoy: A New Approach to Design Unbiased Datasets for Training and Benchmarking Machine-Learning Scoring Functions.

Journal: Journal of medicinal chemistry
Published Date:

Abstract

Development of accurate machine-learning-based scoring functions (MLSFs) for structure-based virtual screening against a given target requires a large unbiased dataset with structurally diverse actives and decoys. However, most datasets for the development of MLSFs were designed for traditional SFs and may suffer from hidden biases and data insufficiency. Hereby, we developed a new approach named pology-based and nformation-based s generation (TocoDecoy), which integrates two strategies to generate decoys by tweaking the actives for a specific target, to generate unbiased and expandable datasets for training and benchmarking MLSFs. For hidden bias evaluation, the performance of InteractionGraphNet (IGN) trained on the TocoDecoy, LIT-PCBA, and DUD-E-like datasets was assessed. The results illustrate that the IGN model trained on the TocoDecoy dataset is competitive with that trained on the LIT-PCBA dataset but remarkably outperforms that trained on the DUD-E dataset, suggesting that the decoys in TocoDecoy are unbiased for training and benchmarking MLSFs.

Authors

  • Xujun Zhang
    Injury Prevention Research Institute, Department of Epidemiology and Biostatistics, School of Public Health, Southeast University, Nanjing, Jiangsu Province, China.
  • Chao Shen
    Department of Epidemiology, School of Public Health, Soochow University, Suzhou 215123, China.
  • Ben Liao
    Tencent Quantum Laboratory, Tencent, Shenzhen 518057, Guangdong, China.
  • Dejun Jiang
    Innovation Institute for Artificial Intelligence in Medicine, College of Pharmaceutical Sciences, Zhejiang University, Hangzhou 310058 Zhejiang, P. R. China.
  • Jike Wang
    School of Computer Science, Wuhan University, Wuhan, Hubei 430072, China.
  • Zhenxing Wu
  • Hongyan Du
    Innovation Institute for Artificial Intelligence in Medicine of Zhejiang University, College of Pharmaceutical Sciences, Zhejiang University, Hangzhou 310058, Zhejiang, P. R. China.
  • Tianyue Wang
    Key Laboratory of Drinking Water Science and Technology, Research Center for Eco-Environmental Sciences, Chinese Academy of Sciences, Beijing 100085, China; School of Chemical and Environmental Engineering, Beijing Campus, China University of Mining and Technology, Beijing 100083, China.
  • Wenbo Huo
    Tsinghua AI Drug Discovery group, Research Institute of Tsinghua University in Shenzhen, Shenzhen 518057, Guangdong, China.
  • Lei Xu
    Key Laboratory of Biomedical Information Engineering of the Ministry of Education, Department of Biomedical Engineering, School of Life Science and Technology, Xi'an Jiaotong University, Xi'an, China.
  • Dongsheng Cao
    School of Pharmaceutical Sciences, Central South University, Changsha, China. oriental-cds@163.com.
  • Chang-Yu Hsieh
    Tencent Quantum Laboratory, Tencent, Shenzhen 518057 Guangdong, P. R. China.
  • Tingjun Hou
    College of Pharmaceutical Sciences, Zhejiang University , Hangzhou, Zhejiang 310058, China.