TaxonMatch: taxonomic integration and tree construction from heterogeneous biological databases

Journal: bioRxiv
Published Date:

Abstract

Integrating taxonomic data across heterogeneous biological databases remains a major challenge in biodiversity research due to non-standardized nomenclature, incomplete synonym annotation, and inconsistencies in taxonomic hierarchies. These issues limit interoperability between key resources such as the Global Biodiversity Information Facility (GBIF), the National Center for Biotechnology Information (NCBI), and citizen science platforms such as iNaturalist. Here, we present TaxonMatch, a scalable and reproducible framework for taxonomic reconciliation and cross-database integration. The workflow combines string-based candidate generation using TF-IDF vectorization, supervised machine learning for match classification, and lineage-aware synonym resolution to align taxonomic entities across multiple sources. By integrating both declared and implicit equivalences, TaxonMatch resolves typographical variation, synonymy, and structural inconsistencies in taxonomic data. The framework produces a unified taxonomic structure in which equivalent entities are reconciled while preserving source-specific identifiers, provenance information, and hierarchical relationships. We evaluate its robustness across multiple classifiers and demonstrate its effectiveness in resolving ambiguous taxonomic cases that are not handled by traditional matching approaches. We illustrate the applicability of TaxonMatch through three use cases: the construction of a unified arthropod taxonomy integrating GBIF, NCBI, and iNaturalist data; the identification of closest extant relatives of fossil taxa with molecular information; and the integration of genomic resources with conservation data from the IUCN Red List. These applications highlight the ability of the workflow to support the integration of ecological, genomic, and paleontological datasets. TaxonMatch provides a flexible and generalizable solution for taxonomic data integration, enabling the construction of coherent and interoperable biodiversity datasets for downstream analyses in ecology, evolution, and conservation biology.

Authors

  • Leone
  • M.; Rech De Laval
  • V.; Drage
  • H. B.; Waterhouse
  • R. M.; Robinson-Rechavi
  • M.