SeqScreen: accurate and sensitive functional screening of pathogenic sequences via ensemble learning.

Journal: Genome biology
PMID:

Abstract

The COVID-19 pandemic has emphasized the importance of accurate detection of known and emerging pathogens. However, robust characterization of pathogenic sequences remains an open challenge. To address this need we developed SeqScreen, which accurately characterizes short nucleotide sequences using taxonomic and functional labels and a customized set of curated Functions of Sequences of Concern (FunSoCs) specific to microbial pathogenesis. We show our ensemble machine learning model can label protein-coding sequences with FunSoCs with high recall and precision. SeqScreen is a step towards a novel paradigm of functionally informed synthetic DNA screening and pathogen characterization, available for download at www.gitlab.com/treangenlab/seqscreen .

Authors

  • Advait Balaji
    Department of Computer Science, Rice University, Houston, TX, USA.
  • Bryce Kille
    Department of Computer Science, Rice University, Houston, TX, USA.
  • Anthony D Kappell
    Signature Science, LLC, 8329 North Mopac Expressway, Austin, TX, USA.
  • Gene D Godbold
    Signature Science, LLC, 1670 Discovery Drive, Charlottesville, VA, USA.
  • Madeline Diep
    Fraunhofer USA Center Mid-Atlantic CMA, Riverdale, MD, USA.
  • R A Leo Elworth
    Department of Computer Science, Rice University, Houston, TX, USA.
  • Zhiqin Qian
    Department of Computer Science, Rice University, Houston, TX, USA.
  • Dreycey Albin
    Department of Computer Science, Rice University, Houston, TX, USA.
  • Daniel J Nasko
    Department of Computer Science, University of Maryland, College Park, MD, USA.
  • Nidhi Shah
    Department of SurgeryUniversity of MichiganAnn ArborMIUSA.
  • Mihai Pop
    Department of Computer Science, University of Maryland, College Park, MD, United States.
  • Santiago Segarra
    Department of Electrical and Computer Engineering, Rice University, Houston, TX, USA.
  • Krista L Ternus
    Signature Science, LLC, 8329 North Mopac Expressway, Austin, TX, USA. kternus@signaturescience.com.
  • Todd J Treangen
    Department of Computer Science, Rice University, Houston, TX, USA. treangen@rice.edu.