Improved Machine Learning Predictions of EC50s Using Uncertainty Estimation from Dose-Response Data.

Journal: Journal of chemical information and modeling
Published Date:

Abstract

In early-stage drug design, machine learning models often rely on compressed representations of data, where raw experimental results are distilled into a single metric per molecule through curve fitting. This process discards valuable information about the quality of the curve fit. In this study, we incorporated a fit-quality metric into machine learning models to capture the reliability of metrics for individual molecules. Using 40 data sets from PubChem (public) and BASF (private), we demonstrated that including this quality metric can significantly improve predictive performance without additional experiments. Four methods were tested: random forests with parametric bootstrap, weighted random forests, variable output smearing random forests, and weighted support vector regression. When using fit-quality metrics, at least one of these methods led to a statistically significant improvement on 31 of the 40 data sets. In the best case, these methods led to a 22% reduction in the root-mean-squared error of the models. Overall, our results demonstrate that by adapting data processing to account for curve fit quality, we can improve predictive performance across a range of different data sets.

Authors

  • Hugo Bellamy
    Department of Chemical engineering and biotechnology, University of Cambridge, Cambridge CB2 1TN, United Kingdom of Great Britain and Northern Ireland.
  • Joachim Dickhaut
    BASF, Ludwigshafen 67056, Germany.
  • Ross D King
    3Department of Biology and Biological Engineering, Division of Systems and Synthetic Biology, Chalmers University of Technology, Kemivägen 10, SE-412 96 Gothenburg, Sweden.