Balanced Training Sets Improve Deep Learning-Based Prediction of CRISPR sgRNA Activity.

Journal: ACS synthetic biology
PMID:

Abstract

CRISPR-Cas systems have transformed the field of synthetic biology by providing a versatile method for genome editing. The efficiency of CRISPR systems is largely dependent on the sequence of the constituent sgRNA, necessitating the development of computational methods for designing active sgRNAs. While deep learning-based models have shown promise in predicting sgRNA activity, the accuracy of prediction is primarily governed by the data set used in model training. Here, we trained a convolutional neural network (CNN) model and a large language model (LLM) on balanced and imbalanced data sets generated from CRISPR-Cas12a screening data for the yeast and evaluated their ability to predict high- and low-activity sgRNAs. We further tested whether prediction performance can be improved by training on imbalanced data sets augmented with synthetic sgRNAs. Lastly, we demonstrated that adding synthetic sgRNAs to inherently imbalanced CRISPR-Cas9 data sets from and leads to improved performance in predicting sgRNA activity, thus underscoring the importance of employing balanced training sets for accurate sgRNA activity prediction.

Authors

  • Varun Trivedi
    Department of Chemical and Environmental Engineering, University of California, Riverside, California 92521, United States.
  • Amirsadra Mohseni
    Department of Computer Science, University of California, Riverside, California 92521, United States.
  • Stefano Lonardi
    Computer Science and Engineering, University of California, Riverside, Riverside, 92521, CA, USA. stelo@cs.ucr.edu.
  • Ian Wheeldon
    Department of Chemical and Environmental Engineering, University of California, Riverside, California 92521, United States.