Data Scaling and Generalization Insights for Medicinal Chemistry Deep Learning Models.

Journal: Journal of chemical information and modeling

Published Date: Jun 2, 2025

Abstract

Predictive models hold considerable promise in enabling the faster discovery of safer, more efficacious therapeutics. To better understand and improve the performance of small-molecule predictive models for drug discovery, we conduct multiple experiments with deep learning and traditional machine learning approaches, leveraging our large internal data sets as well as publicly available data sets. The experiments include assessing performance on random, temporal, and reverse-temporal data ablation tasks as well as tasks testing model extrapolation to different property spaces. We identify factors that contribute to the higher performance of predictive models built using graph neural networks compared to traditional methods such as XGBoost and random forest. These insights were successfully used to develop a scaling relationship that explains 81% of the variance in model performance across various assays and data regimes. This relationship can be used to estimate the performance of models for ADMET (absorption, distribution, metabolism, excretion, and toxicity) end points, as well as for drug discovery assay data more broadly. The findings offer guidance for further improving model performance in drug discovery.

Authors

Jacky Chen

Discipline of Medical Imaging Sciences, Sydney School of Health Sciences, Faculty of Medicine and Health, University of Sydney, Camperdown, NSW 2006, Australia; Medical Imaging Optimisation Perception Group, Discipline of Medical Imaging Sciences, Sydney School of Health Sciences, Faculty of Medicine and Health, University of Sydney, Camperdown, NSW 2006, Australia. Electronic address: jche5218@uni.sydney.edu.au.
Yunsie Chung

Department of Chemical Engineering, Massachusetts Institute of Technology, Cambridge, Massachusetts 02139, United States.
Jonathan Tynan

Modeling & Informatics, Merck & Co., Inc., South San Francisco, California 94080, United States.
Chen Cheng

Stomatology Hospital, School of Stomatology, Zhejiang University School of Medicine, Zhejiang Provincial Clinical Research Center for Oral Diseases, Key Laboratory of Oral Biomedical Research of Zhejiang Province, Cancer Center of Zhejiang University, Hangzhou 310006, China.
Song Yang

Key Laboratory of Pesticide Toxicology&Application Technique, College of Plant Protection, Shandong Agricultural University, Tai'an 271018, China.
Alan C Cheng

Computational and Structural Chemistry, Merck & Co., Inc., South San Francisco, California 94080, United States.

Keywords

Chemistry, Pharmaceutical Deep Learning Drug Discovery Humans Neural Networks, Computer

External Resources

View on PubMed Access via DOI PubMed (40454949)

Data Scaling and Generalization Insights for Medicinal Chemistry Deep Learning Models.

Abstract

Authors

Keywords

External Resources

Popular Topics

Recent Journals

Data Scaling and Generalization Insights for Medicinal Chemistry Deep Learning Models.

Abstract

Authors

Keywords

External Resources

Stay Ahead of Medical AI

Popular Topics

Recent Journals