A composite universal DNA signature for the tree of life.
Journal:
Nature ecology & evolution
Published Date:
Jun 25, 2025
Abstract
Species identification using DNA barcodes has revolutionized biodiversity sciences. However, conventional barcoding methods may lack power and universal applicability across the tree of life. Alternative methods based on whole genome sequencing are hard to scale due to large data requirements. Here we develop a novel DNA-based identification method, varKoding, using exceptionally low-coverage genome skim data to create two-dimensional images representing the genomic signature of a species. Using these representations, we train neural networks for taxonomic identification. Applying a taxonomically verified novel genomic dataset of Malpighiales plant accessions, we optimize training hyperparameters and find the highest performance by combining a transformer architecture with a new modified chaos game representation. Greater than 91% precision is achieved despite minimal input data, exceeding alternative methods tested. We illustrate the broad utility of varKoding across several focal clades of eukaryotes and prokaryotes. We also train a model capable of identifying all species in the Sequence Read Archive of the National Center for Biotechnology Information using less than 10 Mbp sequencing data with 96% precision and 95% recall and robust to sequencing platforms. The varKoding approach offers enhanced computational efficiency and scalability, minimal data inputs robust to sequencing details and modularity for further development in biodiversity science.
Authors
Keywords
No keywords available for this article.