The topology of molecular representations and its influence on machine learning performance.

Journal: Journal of cheminformatics
Published Date:

Abstract

Advancements in cheminformatics have led to numerous methods for encoding molecules numerically. The choice of molecular representation impacts the accuracy and generalizability of learning algorithms applied to chemical datasets. Designing and selecting the appropriate representation often lacks a systematic approach and follows computationally exhaustive empirical testing. Moreover, research has shown that deep learning models do not substantially outperform traditional approaches across many tasks with no clear explanation for this shortfall. In this work, we present TopoLearn, a model that predicts the effectiveness of representations on datasets based on the topological characteristics of the corresponding feature space. Using interpretability techniques, we find that persistent homology descriptors are linked with the error metrics of trained machine learning models, offering a new method to better understand and select molecular representations.Scientific contribution Our research is the first to establish an empirical connection between the topology of feature spaces and the machine learning performance of molecular representations. In addition, we facilitate future research endeavors by providing open access to our developed model.

Authors

  • Florian Rottach
    Boehringer Ingelheim Pharma GmbH & Co. KG, 88397 Biberach, Germany.
  • Sebastian Schieferdecker
    Boehringer Ingelheim Pharma GmbH & Co. KG, 88397 Biberach, Germany.
  • Carsten Eickhoff
    Department of Computer Science, ETH Zurich, Zurich, Switzerland; Center for Biomedical Informatics, Brown University, Providence, RI, USA.

Keywords

No keywords available for this article.