A new strategy for Cas protein recognition based on graph neural networks and SMILES encoding.

Journal: Scientific reports

PMID: 40307455

Abstract

The CRISPR-Cas system, an adaptive immune mechanism found in bacteria and archaea, has evolved into a promising genomic editing tool, with various types of Cas proteins playing a crucial role. In this study, we developed a set of strategies for mining and identifying Cas1 proteins. Firstly, we analyzed the characteristic differences of 14 types of Cas proteins in the protein large language model embedding space in detail; then converted proteins into the Simplified Molecular Input Line Entry System (SMILES) format, thereby constructing graph data representing atom and bond features. Next, based on the characteristic differences of different Cas proteins, we designed and trained an ensemble model composed of two Directed Message Passing Neural Network (DMPNN) models for high-precision identification of Cas1 proteins. This ensemble model performed excellently on both training data and newly designed datasets. The comparison of this method with other methods, such as CRISPRCasFinder, has demonstrated its effectiveness. Finally, the ensemble model was successfully employed to identify potential Cas1 proteins in the Ensemble database, further highlighting its robustness and practicality. The strategies and models from this research may potentially be extended to other types of Cas proteins, though this would require further investigation and validation. Moreover, our work highlights SMILES encoding as a versatile tool for studying biological macromolecules, enabling efficient structural representation and advanced computational applications in protein research and beyond.

Authors

Gaoxiang Chen

The First Affiliated Hospital of Wenzhou Medical University, Wenzhou 325000, China.
Liya Hou

Zhejiang Laboratory, Research Center for Life Sciences Computing, Hangzhou, 311100, China.
Zhanwei Li

Zhejiang Laboratory, Research Center for Life Sciences Computing, Hangzhou, 311100, China.
Bin Xie

School of Automation, Central South University, Changsha, China. xiebin@csu.edu.cn.
Yongqiang Liu

Zhejiang Laboratory, Research Center for Life Sciences Computing, Hangzhou, 311100, China.

Keywords

Computational Biology CRISPR-Associated Proteins CRISPR-Cas Systems Databases, Protein Graph Neural Networks Neural Networks, Computer

External Resources

View on PubMed Access via DOI PubMed (40307455)

A new strategy for Cas protein recognition based on graph neural networks and SMILES encoding.

Abstract

Authors

Keywords

External Resources

Popular Topics

Recent Journals