Biological Foundation Models Enable CRISPR Array Detection Without Metagenomic Assembly

Journal: bioRxiv
Published Date:

Abstract

Accurate identification of CRISPR arrays is essential for studying prokaryotic adaptive immunity, yet existing tools struggle with short-read sequencing data and arrays containing degenerate repeats. These limitations restrict CRISPR analysis in metagenomic and fragmented genomic datasets. We present a foundation model-based approach for CRISPR array detection that addresses both these challenges. We fine-tune a large genomic foundation model using the Parameter-Efficient Fine-Tuning (PEFT) method, Low-Rank Adaptation (LoRA) to perform per-nucleotide classification of DNA sequences into repeat, spacer, and non-array regions directly from raw input nucleotide sequences. We develop two model variants for different sequence context lengths. The long-context model supporting sequences of up to 8,192 nucleotides achieves 98.16% test accuracy and identifies degenerate repeat candidates missed by similarity-based CRISPR detection tools. The short-context model supports sequences of up to 150 nucleotides, optimized for Illumina reads, reaches 90.03% accuracy and enables direct analysis of individual reads without assembly. On simulated metagenomic data, it achieves a spacer recall of 49.12% and recovers 12.57% of spacers that are otherwise not detected by dedicated metagenomic CRISPR array detection methods which require metagenomic assembly. Together, these results demonstrate that genomic foundation models provide a robust and complementary paradigm for CRISPR array detection.

Authors

  • Backofen
  • R.; Schroeder
  • L. D.; Mitrofanov
  • A.; Koeksal
  • R.; Uhl
  • M.