Developing machine learning approaches to identify candidate persistent, mobile and toxic (PMT) and very persistent and very mobile (vPvM) substances based on molecular structure.

Journal: Water research
Published Date:

Abstract

Determining which substances on the global market could be classified as persistent, mobile and toxic (PMT) substances or very persistent, very mobile (vPvM) substances is essential to prevent or reduce drinking water contamination from them. This study developed machine learning models based on different molecular descriptors (MDs) and defined applicability domains for the screening of PMT/vPvM substances. The models were trained with 3111 substances with expert weight-of-evidence based PMT/vPvM hazard classifications that considered the highest quality data available. The model was based on the hypothesis that PMT/vPvM substances contain similar MDs, representative of chemical structures resistant to degradation, be associated with low sorption (or high-water solubility) and in some cases be associated with known toxic mechanisms. All possible model combinations were tested by integrating different molecular description methods, data balancing strategies and machine learning algorithms. Our model allows one-step prediction of candidate PMT/vPvM substances, and our method was compared with the approach predicting P, M and T separately (i.e. three-step prediction). The results showed that the one-step model achieved a higher accuracy of 92% for PMT/vPvM identification (i.e. positive samples) for an internal test set, and also resulted in a higher accuracy of 90% for an external test set of chemical pollutants detected in Taihu Lake, China. Furthermore, prediction mechanism of the model was interpreted by Shapley additive explanations (SHAP). This work presents an advance of big data in silico screening models for the identification of substances that potentially meet the PMT/vPvM criteria.

Authors

  • Min Han
    National Laboratory of Solid State Microstructures, College of Engineering and Applied Sciences, Nanjing University, 22 Hankou Road, Nanjing 210093, P. R. China.
  • Biao Jin
    State Key Laboratory of Organic Geochemistry, Guangzhou Institute of Geochemistry, Chinese Academy of Sciences, Guangzhou, 510640, China; CAS Center for Excellence in Deep Earth Science, Guangzhou, 510640, China; University of Chinese Academy of Sciences, Beijing, 10069, China. Electronic address: jinbiao@gig.ac.cn.
  • Jun Liang
    Department of AI and IT, Second Affiliated Hospital, School of Medicine, Zhejiang University, Hangzhou, Zhejiang, People's Republic of China.
  • Chen Huang
    Department of Pharmacy, The First Affiliated Hospital, Fujian Medical University, Fuzhou, China.
  • Hans Peter H Arp
    Norwegian Geotechnical Institute (NGI), P.O. Box 3930 Ullevaal Stadion, Oslo, N-0806, Norway; Norwegian University of Science and Technology (NTNU), Trondheim, NO-7491, Norway.