Accurate prediction of activity cliff compounds based on bioactivity profiles depends on assay nearest neighbor relationships.

Journal: Journal of cheminformatics
Published Date:

Abstract

The definition of activity cliffs (ACs) depends on compound similarity and activity difference criteria and on activity data types. ACs are usually defined as pairs or groups of structurally similar compounds or structural analogues that are active against the same target, but have large differences in potency (requiring numerical potency values, preferably equilibrium constants). In addition, ACs have also been defined as pairs of structural analogues that are active or inactive in screening assays. In medicinal chemistry, ACs are of particular interest because they often reveal structure-activity relationship (SAR) determinants during compound optimization. In cheminformatics, ACs present challenging test cases for machine learning (ML) and activity predictions because they represent an extreme form of SAR discontinuity in compound data sets. Given their composition, ACs are notoriously difficult to predict based on chemical structure representations. Various attempts have been made to predict compound pairs forming ACs or the activity of AC compounds via ML, often reporting high accuracy. However, in the absence of data leakage from training to test sets, AC prediction accuracy based on chemical structure is low or moderate at best. As an alternative to structural representations, biological/functional compound descriptors might be considered such as biological assay profiles, which have been investigated for other compound activity prediction. In this work, we report the prediction of AC compounds based on bioactivity profiles derived from a compound profiling matrix using data partitioning schemes designed to control information or data leakage during ML. Under these stringent conditions, AC compound predictions based on bioactivity profiles often failed. However, we also observed subsets of highly accurate predictions and explored in detail why these predictions succeeded, but others failed. The analysis revealed a critically important role of assay similarity for successful AC compound predictions. Most profile assays did not measurably influence the predictions. By contrast, accurate predictions mostly depended on the presence of one or at most a few profile assays that were similar to test assays. In most cases, these profile assays could be identified and exploited for predictions by nearest neighbor searching, thus putting ML performance into perspective.Scientific contribution As an alternative to the generally difficult prediction of ACs based on chemical structure, we introduce prediction of AC compounds based on bioactivity profiles. We show that accurate activity predictions of AC compounds do not depend on the global information content of assay profiles, but are largely determined by nearest neighbor relationships between profile and test assays.

Authors

Keywords

No keywords available for this article.