A Method for Localizing Non-Reference Sequences to the Human Genome.

Journal: Pacific Symposium on Biocomputing. Pacific Symposium on Biocomputing
Published Date:

Abstract

As the last decade of human genomics research begins to bear the fruit of advancements in precision medicine, it is important to ensure that genomics' improvements in human health are distributed globally and equitably. An important step to ensuring health equity is to improve the human reference genome to capture global diversity by including a wide variety of alternative haplotypes, sequences that are not currently captured on the reference genome.We present a method that localizes 100 basepair (bp) long sequences extracted from short-read sequencing that can ultimately be used to identify what regions of the human genome non-reference sequences belong to.We extract reads that don't align to the reference genome, and compute the population's distribution of 100-mers found within the unmapped reads. We use genetic data from families to identify shared genetic material between siblings and match the distribution of unmapped k-mers to these inheritance patterns to determine the the most likely genomic region of a k-mer. We perform this localization with two highly interpretable methods of artificial intelligence: a computationally tractable Hidden Markov Model coupled to a Maximum Likelihood Estimator. Using a set of alternative haplotypes with known locations on the genome, we show that our algorithm is able to localize 96% of k-mers with over 90% accuracy and less than 1Mb median resolution. As the collection of sequenced human genomes grows larger and more diverse, we hope that this method can be used to improve the human reference genome, a critical step in addressing precision medicine's diversity crisis.

Authors

  • Brianna Sierra Chrisman
    Departments of Bioengineering, Stanford University, Stanford, CA 94305, USA, briannac@stanford.edu.
  • Kelley M Paskov
  • Chloe He
    Wellcome/EPSRC Centre for Interventional and Surgical Sciences, University College London 43-45 Foley St, London, W1W 7TY, UK.; Department of Computer Science, University College London, 66-72 Gower St, London WC1E 6EA, UK.; AI Team, Apricity, 14 Grays Inn Rd, London WC1 X 8HN, UK.. Electronic address: chloe.he.21@ucl.ac.uk.
  • Jae-Yoon Jung
  • Nate Stockham
  • Peter Yigitcan Washington
  • Dennis Paul Wall
    Department of Pediatrics, Division of Systems Medicine, Stanford University, California, United States of America.