Uncovering Cas9 PAM diversity through metagenomic mining and machine learning.
Journal:
Nature communications
Published Date:
Feb 8, 2026
Abstract
Recognition of protospacer adjacent motifs (PAMs) is crucial for target site recognition by CRISPR-Cas systems. In genome editing applications, the requirement for specific PAM sequences at the target locus imposes substantial constraints, driving efforts to search for novel Cas9 orthologs with extended or alternative PAM compatibilities. Here, we present CRISPR-PAMdb, a comprehensive and publicly accessible database compiling Cas9 protein sequences from 3.8 million bacterial and archaeal genomes and PAM profiles from 7.4 million phage and plasmid sequences. Through spacer-protospacer alignment, we infer consensus PAM preferences for 8003 unique Cas9 clusters. To extend PAM discovery beyond traditional alignment-based approaches, we develop CICERO, a machine learning model predicting PAM preferences directly from Cas9 protein sequences. Built on the ESM2 protein language model and trained on the CRISPR-PAMdb database, CICERO achieves an average cosine similarity of 0.69 on test data and 0.75 on experimentally validated Cas9 orthologs. For Cas9 clusters where alignment-based predictions are infeasible, CICERO generates PAM profiles for an additional 50,308 Cas9 proteins, including 17,453 high-confidence predictions with CICERO confidence scores above 0.8. Together, CRISPR-PAMdb and CICERO enable large-scale exploration of PAM diversity across Cas9 proteins, accelerating design of next-generation CRISPR-Cas9 tools for precise genome engineering.
Authors
Keywords
No keywords available for this article.