Benchmarking of deep neural networks for predicting personal gene expression from DNA sequence highlights shortcomings.

Journal: Nature genetics
Published Date:

Abstract

Deep learning methods have recently become the state of the art in a variety of regulatory genomic tasks, including the prediction of gene expression from genomic DNA. As such, these methods promise to serve as important tools in interpreting the full spectrum of genetic variation observed in personal genomes. Previous evaluation strategies have assessed their predictions of gene expression across genomic regions; however, systematic benchmarking is lacking to assess their predictions across individuals, which would directly evaluate their utility as personal DNA interpreters. We used paired whole genome sequencing and gene expression from 839 individuals in the ROSMAP study to evaluate the ability of current methods to predict gene expression variation across individuals at varied loci. Our approach identifies a limitation of current methods to correctly predict the direction of variant effects. We show that this limitation stems from insufficiently learned sequence motif grammar and suggest new model training strategies to improve performance.

Authors

  • Alexander Sasse
    Physik Department T38, Technische Universität München, James-Franck-Straße, Garching, Germany.
  • Bernard Ng
    Department of Statistics, University of British Columbia, Vancouver, BC V6T 1Z4, Canada.
  • Anna E Spiro
    Paul G. Allen School of Computer Science and Engineering, University of Washington, Seattle, WA, USA.
  • Shinya Tasaki
    Rush Alzheimer's Disease Center, Rush University Medical Center, Chicago, IL, USA. stasaki@gmail.com.
  • David A Bennett
    Rush Alzheimer's Disease Center, Rush University Medical Center, Chicago, IL, USA.
  • Christopher Gaiteri
    Rush Alzheimer's Disease Center, Rush University Medical Center, Chicago, IL, USA.
  • Philip L De Jager
    Department of Neurology, Center for Translational and Computational Neuroimmunology, Columbia University Medical Center, New York, NY, USA.
  • Maria Chikina
    Department of Computational and Systems Biology, University of Pittsburgh, Pittsburgh, PA, USA. mchikina@gmail.com.
  • Sara Mostafavi
    Department of Statistics, University of British Columbia, Vancouver, BC V6T 1Z4, Canada; cb@hms.harvard.edu saram@stat.ubc.ca.