Deep Learning-Based Classification of CRISPR Loci Using Repeat Sequences.
Journal:
ACS synthetic biology
PMID:
40261207
Abstract
With the widespread application of the CRISPR-Cas system in gene editing and related fields, along with the increasing availability of metagenomic data, the demand for detecting and classifying CRISPR-Cas systems in metagenomic data sets has grown significantly. Traditional classification methods for CRISPR-Cas systems primarily rely on identifying cas genes near CRISPR arrays. However, in cases where cas gene information is absent, such as in metagenomes or fragmented genome assemblies, traditional methods may fail. Here, we present a deep learning-based method, CRISPRclassify-CNN-Att, which classifies CRISPR loci solely based on repeat sequences. CRISPRclassify-CNN-Att utilizes convolutional neural networks (CNNs) and self-attention mechanisms to extract features from repeat sequences. It employs a stacking strategy to address the imbalance of samples across different subtypes and uses transfer learning to improve classification accuracy for subtypes with fewer samples. CRISPRclassify-CNN-Att demonstrates outstanding performance in classifying multiple subtypes, particularly those with larger sample sizes. Although CRISPR loci classification traditionally depends on cas genes, CRISPRclassify-CNN-Att offers a novel approach that serves as a significant complement to cas-based methods, enabling the classification of orphan or distant CRISPR loci. The proposed tool is freely accessible via https://github.com/Xingyu-Liao/CRISPRclassify-CNN-Att.