Simulations of Sequence Evolution: How (Un)realistic They Are and Why.

Journal: Molecular biology and evolution
PMID:

Abstract

MOTIVATION: Simulating multiple sequence alignments (MSAs) using probabilistic models of sequence evolution plays an important role in the evaluation of phylogenetic inference tools and is crucial to the development of novel learning-based approaches for phylogenetic reconstruction, for instance, neural networks. These models and the resulting simulated data need to be as realistic as possible to be indicative of the performance of the developed tools on empirical data and to ensure that neural networks trained on simulations perform well on empirical data. Over the years, numerous models of evolution have been published with the goal to represent as faithfully as possible the sequence evolution process and thus simulate empirical-like data. In this study, we simulated DNA and protein MSAs under increasingly complex models of evolution with and without insertion/deletion (indel) events using a state-of-the-art sequence simulator. We assessed their realism by quantifying how accurately supervised learning methods are able to predict whether a given MSA is simulated or empirical.

Authors

  • Johanna Trost
    Biometry and Evolutionary Biology Laboratory (LBBE), University Claude Bernard Lyon 1, Lyon, France.
  • Julia Haag
    Computational Molecular Evolution Group, Heidelberg Institute for Theoretical Studies, Heidelberg, Germany.
  • Dimitri Höhler
    Computational Molecular Evolution Group, Heidelberg Institute for Theoretical Studies, Heidelberg, Germany.
  • Laurent Jacob
    University of Lyon, Université Lyon 1, CNRS, Laboratoire de Biométrie et Biologie Évolutive UMR 5558, Lyon, Rhône France.
  • Alexandros Stamatakis
    Computational Molecular Evolution Group, Heidelberg Institute for Theoretical Studies, 69118 Heidelberg, Germany.
  • Bastien Boussau
    Biometry and Evolutionary Biology Laboratory (LBBE), University Claude Bernard Lyon 1, Lyon, France.