AssayMatch: Learning To Select Data for Molecular Activity Models.

Journal: Journal of chemical information and modeling
Published Date:

Abstract

The performance of machine-learning models in drug discovery is highly dependent on the quality and consistency of the training data. Due to limitations in data set sizes, many models are trained by aggregating bioactivity data from diverse sources, including public databases such as ChEMBL. However, this approach often introduces significant noise due to variability in experimental protocols. We introduce AssayMatch, a framework for data selection that builds smaller, more homogeneous training sets attuned to the test set of interest. AssayMatch leverages data attribution methods to quantify the contribution of each training assay to the model's performance. These attribution scores are used to fine-tune language embeddings of text-based assay descriptions to capture not just semantic similarity but also the compatibility between assays. Unlike existing data attribution methods, our approach enables data selection for a test set with unknown labels, mirroring real-world drug discovery campaigns in which the activities of candidate molecules are not known in advance. At test time, embeddings fine-tuned with AssayMatch are used to rank all available training data. We demonstrate that models trained on data selected by AssayMatch are able to surpass the performance of the model trained on the complete data set, highlighting its ability to effectively filter out harmful or noisy experiments. We perform experiments on two common machine-learning architectures and see increased prediction capability over a strong language-only baseline for 8/12 model-target pairs. AssayMatch provides a data-driven mechanism to curate higher-quality data sets, reducing noise from incompatible experiments and improving the predictive power and data efficiency of models for drug discovery. AssayMatch is available at https://github.com/Ozymandias314/AssayMatch.

Authors

Keywords

No keywords available for this article.