Consistent semantic representation learning for out-of-distribution molecular property prediction.
Journal:
Briefings in bioinformatics
PMID:
40205853
Abstract
Invariant molecular representation models provide potential solutions to guarantee accurate prediction of molecular properties under distribution shifts out-of-distribution (OOD) by identifying and leveraging invariant substructures inherent to the molecules. However, due to the complex entanglement of molecular functional groups and the frequent display of activity cliffs by molecular properties, the separation of molecules becomes inaccurate and tricky. This results in inconsistent semantics among the invariant substructures identified by existing models, which means molecules sharing identical invariant structures may exhibit drastically different properties. Focusing on the aforementioned challenges, in the semantic space, this paper explores the potential correlation between the consistent semantic-expressing the same information within different molecular representation forms-and the molecular property prediction problem. To enhance the performance of OOD molecular property prediction, this paper proposes a consistent semantic representation learning (CSRL) framework without separating molecules, which comprises two modules: a semantic uni-code (SUC) module and a consistent semantic extractor (CSE). To address inconsistent mapping of semantic in different molecular representation forms, SUC adjusts incorrect embeddings into the correct embeddings of two molecular representation forms. Then, CSE leverages non-semantic information as training labels to guide the discriminator's learning, thereby suppressing the reliance of CSE on the non-semantic information in different molecular representation embeddings. Extensive experiments demonstrate that the consistent semantic can guarantee the performance of models. Overall, CSRL can improve the model's average Receiver Operating Characteristic - Area Under the Curve (ROC-AUC) by 6.43%, when comparing with 11 state-of-the-art models on 12 datasets.