Deciphering Molecular Embeddings with Centered Kernel Alignment.

Journal: Journal of chemical information and modeling
PMID:

Abstract

Analyzing machine learning models, especially nonlinear ones, poses significant challenges. In this context, centered kernel alignment (CKA) has emerged as a promising model analysis tool that assesses the similarity between two embeddings. CKA's efficacy depends on selecting a kernel that adequately captures the underlying properties of the compared models. The model analysis tool was designed for neural networks (NNs) with their invariance to data rotation in mind and has been successfully employed in various scientific domains. However, CKA has rarely been adopted in cheminformatics, partly because of the popularity of the random forest (RF) machine learning algorithm, which is not rotationally invariant. In this work, we present the adaptation of CKA that builds on the RF kernel to match the properties of RF. As part of the method validation, we show that the model analysis method is well-correlated with the prediction similarity of RF models. Furthermore, we demonstrate how CKA with the RF kernel can be utilized to analyze and explain the behavior of RF models derived from molecular and rooted fingerprints.

Authors

  • Matthias Welsch
    Department of Pharmaceutical Sciences, Division of Pharmaceutical Chemistry, Faculty of Life Sciences, University of Vienna, Josef-Holaubek-Platz 2, Vienna 1090, Austria.
  • Steffen Hirte
    Center for Bioinformatics (ZBH), Department of Informatics, Faculty of Mathematics, Informatics and Natural Sciences, Universität Hamburg, 20146 Hamburg, Germany. steffen.hirte@studium.uni-hamburg.de.
  • Johannes Kirchmair
    Department of Pharmaceutical Sciences, Division of Pharmaceutical Chemistry, Faculty of Life Sciences, University of Vienna, Josef-Holaubek-Platz 2, 1090 Vienna, Austria.