A new strategy for Cas protein recognition based on graph neural networks and SMILES encoding.

Journal: Scientific reports
PMID:

Abstract

The CRISPR-Cas system, an adaptive immune mechanism found in bacteria and archaea, has evolved into a promising genomic editing tool, with various types of Cas proteins playing a crucial role. In this study, we developed a set of strategies for mining and identifying Cas1 proteins. Firstly, we analyzed the characteristic differences of 14 types of Cas proteins in the protein large language model embedding space in detail; then converted proteins into the Simplified Molecular Input Line Entry System (SMILES) format, thereby constructing graph data representing atom and bond features. Next, based on the characteristic differences of different Cas proteins, we designed and trained an ensemble model composed of two Directed Message Passing Neural Network (DMPNN) models for high-precision identification of Cas1 proteins. This ensemble model performed excellently on both training data and newly designed datasets. The comparison of this method with other methods, such as CRISPRCasFinder, has demonstrated its effectiveness. Finally, the ensemble model was successfully employed to identify potential Cas1 proteins in the Ensemble database, further highlighting its robustness and practicality. The strategies and models from this research may potentially be extended to other types of Cas proteins, though this would require further investigation and validation. Moreover, our work highlights SMILES encoding as a versatile tool for studying biological macromolecules, enabling efficient structural representation and advanced computational applications in protein research and beyond.

Authors

  • Gaoxiang Chen
    The First Affiliated Hospital of Wenzhou Medical University, Wenzhou 325000, China.
  • Liya Hou
    Zhejiang Laboratory, Research Center for Life Sciences Computing, Hangzhou, 311100, China.
  • Zhanwei Li
    Zhejiang Laboratory, Research Center for Life Sciences Computing, Hangzhou, 311100, China.
  • Bin Xie
    School of Automation, Central South University, Changsha, China. xiebin@csu.edu.cn.
  • Yongqiang Liu
    Zhejiang Laboratory, Research Center for Life Sciences Computing, Hangzhou, 311100, China.