Transforming molecular cores, substituents, and combinations into structurally diverse compounds using chemical language models.
Journal:
European journal of medicinal chemistry
Published Date:
Apr 10, 2025
Abstract
Transformer-based chemical language models (CLMs) were derived to generate structurally and topologically diverse embeddings of core structure fragments, substituents, or core/substituent combinations in chemically proper compounds, representing a design task that is difficult to address using conventional structure generation methods. To this end, CLM variants were challenged to learn different fragment-to-compound mappings in the absence of structural rules or any other fragment linking or synthetic information. The resulting alternative models were found to have high syntactic fidelity, but displayed notable differences in their ability to generate valid candidate compounds containing test fragments, with a clear preference for a model variant processing core/substituent combinations. However, the majority of valid candidate compounds generated with all models were distinct from training data and structurally novel. In addition, the CLMs exhibited high chemical diversification capacity and often generated structures with new topologies not encountered during training. Furthermore, all models produced large numbers of close structural analogues of known bioactive compounds covering a large target space, thus indicating the relevance of newly generated candidates for pharmaceutical research. As a part of our study, the new methodology and all data are made publicly available.