COLOR: A Compositional Linear Operation-Based Representation of Protein Sequences for Identification of Monomer Contributions to Properties.

Journal: Journal of chemical information and modeling

PMID: 40272990

Abstract

The properties of biological materials like proteins and nucleic acids are largely determined by their primary sequence. Certain segments in the sequence strongly influence specific functions, but identifying these segments, or so-called motifs, is challenging due to the complexity of sequential data. While deep learning (DL) models can accurately capture sequence-property relationships, the degree of nonlinearity in these models limits the assessment of monomer contributions to a property─a critical step in identifying key motifs. Recent advances in explainable AI (XAI) offer attention and gradient-based methods for estimating monomeric contributions. However, these methods are primarily applied to classification tasks, such as binding site identification, where they achieve limited accuracy (40-45%) and rely on qualitative evaluations. To address these limitations, we introduce a DL model with interpretable steps, enabling direct tracing of monomeric contributions. Inspired by the masking technique commonly used in vision and natural language processing domains, we propose a new metric for quantitative analysis on datasets mainly containing distinct properties of anticancer peptides (ACP), antimicrobial peptides (AMP), and collagen. Our model exhibits 22% higher explainability than the gradient and attention-based state-of-the-art models, recognizes critical motifs (RRR, RRI, and RSS) that significantly destabilize ACPs, and identifies motifs in AMPs that are 50% more effective in converting non-AMPs to AMPs. These findings highlight the potential of our model in guiding mutation strategies for designing protein-based biomaterials.

Authors

Akash Pandey

Department of Mechanical Engineering, Northwestern University, Evanston, Illinois 60208, United States.
Wei Chen

Department of Urology, Zigong Fourth People's Hospital, Sichuan, China.
Sinan Keten

Department of Mechanical Engineering, Northwestern University, Evanston, Illinois 60208, United States.

Keywords

Amino Acid Sequence Deep Learning Proteins

External Resources

View on PubMed Access via DOI PubMed (40272990)

COLOR: A Compositional Linear Operation-Based Representation of Protein Sequences for Identification of Monomer Contributions to Properties.

Abstract

Authors

Keywords

External Resources

Popular Topics

Recent Journals