High-throughput deep learning variant effect prediction with Sequence UNET.

Journal: Genome biology
PMID:

Abstract

Understanding coding mutations is important for many applications in biology and medicine but the vast mutation space makes comprehensive experimental characterisation impossible. Current predictors are often computationally intensive and difficult to scale, including recent deep learning models. We introduce Sequence UNET, a highly scalable deep learning architecture that classifies and predicts variant frequency from sequence alone using multi-scale representations from a fully convolutional compression/expansion architecture. It achieves comparable pathogenicity prediction to recent methods. We demonstrate scalability by analysing 8.3B variants in 904,134 proteins detected through large-scale proteomics. Sequence UNET runs on modest hardware with a simple Python package.

Authors

  • Alistair S Dunham
    European Molecular Biology Laboratory, European Bioinformatics Institute (EMBL-EBI), Wellcome Genome Campus, Hinxton, Cambridgeshire, CB10 1SD, UK. ad44@sanger.ac.uk.
  • Pedro Beltrao
    European Molecular Biology Laboratory, European Bioinformatics Institute (EMBL-EBI), Wellcome Genome Campus, Hinxton, Cambridgeshire, CB10 1SD, UK.
  • Mohammed AlQuraishi
    Laboratory of Systems Pharmacology, Harvard Medical School, Boston, MA 02115, USA; Department of Systems Biology, Harvard Medical School, Boston, MA 02115, USA. Electronic address: alquraishi@hms.harvard.edu.