Impact of Chemist-In-The-Loop Molecular Representations on Machine Learning Outcomes.

Journal: Journal of chemical information and modeling
Published Date:

Abstract

The development of molecular descriptors is a central challenge in cheminformatics. Most approaches use algorithms that extract atomic environments or end-to-end machine learning. However, a looming question is that how do these approaches compare with the critical eye of trained chemists. The CAS fingerprint engages expert chemists to curate chemical motifs, which they deem could influence bioactivity. In this paper, we benchmark the CAS fingerprint against commonly used fingerprints using a well-established benchmark set of 88 targets. We show that the CAS fingerprint outperforms most of the commonly used molecular fingerprints. Analysis of the CAS fingerprint reveals that experts tend to select features that are rarely reported in the literature, though not all rare features are selected. Our analysis also shows that the CAS fingerprint provides a different source of information compared to other commonly used fingerprints. These results suggest that anthropomorphic insights do have predictive power and highlight the importance of a chemist-in-the-loop approach in the era of machine learning.

Authors

  • Todd J Wills
    CAS, P.O. Box 3012, Columbus, Ohio 43210-0012, United States.
  • Dmitrii A Polshakov
    CAS, P.O. Box 3012, Columbus, Ohio 43210-0012, United States.
  • Matthew C Robinson
    Department of Physics, J J Thomson Avenue, Cambridge, CB3 0HE, UK.
  • Alpha A Lee
    Department of Physics, J J Thomson Avenue, Cambridge, CB3 0HE, UK. aal44@cam.ac.uk.