An ensemble approach to accurately detect somatic mutations using SomaticSeq.

Journal: Genome biology
Published Date:

Abstract

SomaticSeq is an accurate somatic mutation detection pipeline implementing a stochastic boosting algorithm to produce highly accurate somatic mutation calls for both single nucleotide variants and small insertions and deletions. The workflow currently incorporates five state-of-the-art somatic mutation callers, and extracts over 70 individual genomic and sequencing features for each candidate site. A training set is provided to an adaptively boosted decision tree learner to create a classifier for predicting mutation statuses. We validate our results with both synthetic and real data. We report that SomaticSeq is able to achieve better overall accuracy than any individual tool incorporated.

Authors

  • Li Tai Fang
    Bina Technologies, Roche Sequencing, Redwood City, 94065, CA, USA. li\_tai.fang@bina.roche.com.
  • Pegah Tootoonchi Afshar
    Department of Electrical Engineering, Stanford University, Stanford, 94305, CA, USA. pegahta@stanford.edu.
  • Aparna Chhibber
    Bina Technologies, Roche Sequencing, Redwood City, 94065, CA, USA. aparna.chhibber@bina.roche.com.
  • Marghoob Mohiyuddin
    Bina Technologies, Roche Sequencing, Redwood City, 94065, CA, USA. marghoob.mohiyuddin@bina.roche.com.
  • Yu Fan
    Department of Bioinformatics and Computational Biology, The University of Texas MD Anderson Cancer Center, Houston, 77030, TX, USA. YFan1@mdanderson.org.
  • John C Mu
    Bina Technologies, Roche Sequencing, Redwood City, 94065, CA, USA. john.mu@bina.roche.com.
  • Greg Gibeling
    Bina Technologies, Roche Sequencing, Redwood City, 94065, CA, USA. greg.gibeling@bina.roche.com.
  • Sharon Barr
    Bina Technologies, Roche Sequencing, Redwood City, 94065, CA, USA. sharon.barr@bina.roche.com.
  • Narges Bani Asadi
    Bina Technologies, Roche Sequencing, Redwood City, 94065, CA, USA. narges.baniasadi@bina.roche.com.
  • Mark B Gerstein
    Program in Computational Biology and Bioinformatics, Yale University, New Haven, 06520, CT, USA. mark.gerstein@yale.edu.
  • Daniel C Koboldt
    The Genome Institute, Washington University in St. Louis, St. Louis, Missouri, United States of America.
  • Wenyi Wang
    Department of Bioinformatics and Computational Biology, The University of Texas MD Anderson Cancer Center, Houston, 77030, TX, USA. wwang7@mdanderson.org.
  • Wing H Wong
    Department of Statistics, Stanford University, Stanford, 94305, CA, USA. whwong@stanford.edu.
  • Hugo Y K Lam
    Bina Technologies, Roche Sequencing, Redwood City, 94065, CA, USA. hugo.lam@bina.roche.com.