Coverage bias in small molecule machine learning.

Journal: Nature communications
PMID:

Abstract

Small molecule machine learning aims to predict chemical, biochemical, or biological properties from molecular structures, with applications such as toxicity prediction, ligand binding, and pharmacokinetics. A recent trend is developing end-to-end models that avoid explicit domain knowledge. These models assume no coverage bias in training and evaluation data, meaning the data are representative of the true distribution. However, the domain of applicability is rarely considered in such models. Here, we investigate how well large-scale datasets cover the space of known biomolecular structures. For doing so, we propose a distance measure based on solving the Maximum Common Edge Subgraph (MCES) problem, which aligns well with chemical similarity. Although this method is computationally hard, we introduce an efficient approach combining Integer Linear Programming and heuristic bounds. Our findings reveal that many widely-used datasets lack uniform coverage of biomolecular structures, limiting the predictive power of models trained on them. We propose two additional methods to assess whether training datasets diverge from known molecular distributions, potentially guiding future dataset creation to improve model performance.

Authors

  • Fleming Kretschmer
    Bioinformatics, Friedrich Schiller University Jena, 07743 Jena, Germany.
  • Jan Seipp
    Algorithmic Bioinformatics, Institute for Computer Science, Heinrich Heine University Düsseldorf, Düsseldorf, Germany.
  • Marcus Ludwig
    Chair for Bioinformatics, Institute for Computer Science, Friedrich Schiller University Jena, Jena, Germany.
  • Gunnar W Klau
    Algorithmic Bioinformatics, Institute for Computer Science, Heinrich Heine University Düsseldorf, Düsseldorf, Germany.
  • Sebastian Böcker
    Chair for Bioinformatics, Friedrich Schiller University, 07743 Jena, Germany; sebastian.boecker@uni-jena.de.