Large-scale machine learning for metagenomics sequence classification.

Journal: Bioinformatics (Oxford, England)
Published Date:

Abstract

MOTIVATION: Metagenomics characterizes the taxonomic diversity of microbial communities by sequencing DNA directly from an environmental sample. One of the main challenges in metagenomics data analysis is the binning step, where each sequenced read is assigned to a taxonomic clade. Because of the large volume of metagenomics datasets, binning methods need fast and accurate algorithms that can operate with reasonable computing requirements. While standard alignment-based methods provide state-of-the-art performance, compositional approaches that assign a taxonomic class to a DNA read based on the k-mers it contains have the potential to provide faster solutions.

Authors

  • Kévin Vervier
    Bioinformatics Research Departement, bioMérieux, 69280 Marcy-l'Étoile, MINES ParisTech, PSL Research University, CBIO-Centre for Computational Biology, 77300 Fontainebleau, Institut Curie, 75248 Paris Cedex and INSERM U900, 75248 Paris Cedex, France.
  • Pierre Mahé
    Bioinformatics Research Departement, bioMérieux, 69280 Marcy-l'Étoile.
  • Maud Tournoud
    Bioinformatics Research Departement, bioMérieux, 69280 Marcy-l'Étoile.
  • Jean-Baptiste Veyrieras
    Bioinformatics Research Departement, bioMérieux, 69280 Marcy-l'Étoile.
  • Jean-Philippe Vert
    MINES ParisTech, PSL Research University, CBIO-Centre for Computational Biology, 77300 Fontainebleau, Institut Curie, 75248 Paris Cedex and INSERM U900, 75248 Paris Cedex, France.