Integrative taxonomy using traits and genomic data for Species Delimitation with Deep learning

Journal: bioRxiv
Published Date:

Abstract

Recognizing species boundaries in complex speciation scenarios, including those involving gene flow and demographic fluctuations, remains a challenge, particularly given the diversity of existing species concepts. Promising recent approaches adopt an integrative taxonomy that combines multiple sources of evidence (e.g., genetic, morphology, geographic distributions), reflecting different properties associated with the dynamics of the speciation continuum. The use of statistical inference methods for model comparison, such as approximate Bayesian computation, approximate likelihood approaches, and machine learning, has improved the better assessment of species boundaries in such contexts. However, most existing approaches involve analyzing genetic and phenotypic/geographical information separately, followed by visual/qualitative comparison. Methods that integrate genetic information with other sources of evidence remain limited to simple evolutionary models and are typically unable to analyze more than a few hundred loci across a maximum of a few tens of samples. Here, we present a deep learning approach (DeepID) that combines two convolutional neural networks to integrate genomic data (thousands of loci or single nucleotide polymorphisms, SNPs) and trait information into a unified framework. Using both simulated and empirical data sets, we evaluate the power and accuracy of this approach for discriminating among competing divergence speciation scenarios (with minimal ongoing gene flow) across a varying number of SNPs and traits, as well as different levels of missing data. Analyses based on genomic or trait data alone yielded a slight lower accuracy, whereas integrating genomic and trait data resulted in improved performance. When we violated the speciation model by including extensive migration, approaches incorporating trait data were less affected than those relying solely on genomic information. Together, these results suggest that combining genomic and trait data may capture complementary signals associated with different stages of the speciation process. Moreover, our approach successfully recovered the expected delimitation scenarios in empirical data sets from a plant (Euphorbia balsamifera) and a fish (Lepomis megalotis) species complex. We argue that our method is a flexible and promising approach, allowing for complex scenario comparison and the use of multiple types of data. Combining genomic and trait data likely captures complementary signals associated with different stages of the speciation process, reflecting the fact that speciation is a continuum in which genetic and phenotypic divergence may proceed at different rates.

Authors

  • Perez
  • M. F.; Riina
  • R.; Faircloth
  • B. C.; Cioffi
  • M. d. B.; Sanmartin
  • I.