A multi-modal dataset for insect biodiversity with imagery and DNA at the trap and individual level
Journal:
arXiv
Published Date:
Jul 9, 2025
Abstract
Insects comprise millions of species, many experiencing severe population
declines under environmental and habitat changes. High-throughput approaches
are crucial for accelerating our understanding of insect diversity, with DNA
barcoding and high-resolution imaging showing strong potential for automatic
taxonomic classification. However, most image-based approaches rely on
individual specimen data, unlike the unsorted bulk samples collected in
large-scale ecological surveys. We present the Mixed Arthropod Sample
Segmentation and Identification (MassID45) dataset for training automatic
classifiers of bulk insect samples. It uniquely combines molecular and imaging
data at both the unsorted sample level and the full set of individual
specimens. Human annotators, supported by an AI-assisted tool, performed two
tasks on bulk images: creating segmentation masks around each individual
arthropod and assigning taxonomic labels to over 17 000 specimens. Combining
the taxonomic resolution of DNA barcodes with precise abundance estimates of
bulk images holds great potential for rapid, large-scale characterization of
insect communities. This dataset pushes the boundaries of tiny object detection
and instance segmentation, fostering innovation in both ecological and machine
learning research.