A new strategy for Cas protein recognition based on graph neural networks and SMILES encoding.
Journal:
Scientific reports
PMID:
40307455
Abstract
The CRISPR-Cas system, an adaptive immune mechanism found in bacteria and archaea, has evolved into a promising genomic editing tool, with various types of Cas proteins playing a crucial role. In this study, we developed a set of strategies for mining and identifying Cas1 proteins. Firstly, we analyzed the characteristic differences of 14 types of Cas proteins in the protein large language model embedding space in detail; then converted proteins into the Simplified Molecular Input Line Entry System (SMILES) format, thereby constructing graph data representing atom and bond features. Next, based on the characteristic differences of different Cas proteins, we designed and trained an ensemble model composed of two Directed Message Passing Neural Network (DMPNN) models for high-precision identification of Cas1 proteins. This ensemble model performed excellently on both training data and newly designed datasets. The comparison of this method with other methods, such as CRISPRCasFinder, has demonstrated its effectiveness. Finally, the ensemble model was successfully employed to identify potential Cas1 proteins in the Ensemble database, further highlighting its robustness and practicality. The strategies and models from this research may potentially be extended to other types of Cas proteins, though this would require further investigation and validation. Moreover, our work highlights SMILES encoding as a versatile tool for studying biological macromolecules, enabling efficient structural representation and advanced computational applications in protein research and beyond.