Structural Bias in Three-Dimensional Autoregressive Generative Machine Learning of Organic Molecules.

Journal: Journal of chemical information and modeling

Published Date: Jul 14, 2025

Abstract

A range of generative machine learning models for the design of novel molecules and materials have been proposed in recent years. Models that can generate three-dimensional structures are particularly suitable for quantum chemistry workflows, enabling direct property prediction. The performance of generative models is typically assessed based on their ability to produce novel, valid, and unique molecules. However, equally important is their ability to learn the prevalence of functional groups and certain chemical moieties in the underlying training data, that is, to faithfully reproduce the chemical space spanned by the training data. Here, we investigate the ability of the autoregressive generative machine learning model G-SchNet to reproduce the chemical space and property distributions of training data sets composed of large, functional organic molecules. We assess the elemental composition, size- and bond-length distributions, as well as the functional group and chemical space distribution of training and generated molecules. By principal component analysis of the chemical space, we find that the model leads to a biased generation that is largely unaffected by the choice of hyperparameters or the training data set distribution, producing molecules that are, on average, less saturated and contain more heteroatoms. Purely aliphatic molecules are mostly absent in the generation. We further investigate generation with functional group constraints and based on composite data sets, which can help to partially remedy the model generation bias. Decision tree models can recognize the generation bias in the models and discriminate between training and generated data, revealing key chemical differences between the two sets. The chemical differences we find affect the distributions of electronic properties such as the HOMO-LUMO gap, which is a common target for functional molecule design.

Authors

Zsuzsanna Koczor-Benda

Department of Chemistry, University of Warwick, Coventry CV4 7AL, U.K.
Joe Gilkes

Department of Chemistry, University of Warwick, Coventry CV4 7AL, U.K.
Francesco Bartucca

Department of Chemistry, University of Warwick, Coventry CV4 7AL, U.K.
Abdulla Al-Fekaiki

Department of Chemistry, University of Warwick, Coventry CV4 7AL, U.K.
Reinhard J Maurer

Department of Chemistry, University of Warwick, Gibbet Hill Road, Coventry CV4 7AL, United Kingdom.

Keywords

Machine Learning Molecular Structure Organic Chemicals

External Resources

View on PubMed Access via DOI PubMed (40556385)

Structural Bias in Three-Dimensional Autoregressive Generative Machine Learning of Organic Molecules.

Abstract

Authors

Keywords

External Resources

Popular Topics

Recent Journals

Structural Bias in Three-Dimensional Autoregressive Generative Machine Learning of Organic Molecules.

Abstract

Authors

Keywords

External Resources

Don't Miss the Future of Medicine

Popular Topics

Recent Journals