High Throughput Information Extraction of Printed Specimen Labels from Large-Scale Digitization of Entomological Collections using a Semi-Automated Pipeline

Journal: bioRxiv
Published Date:

Abstract

Natural history museums curate billions of insect specimens, forming a vast but underutilized resource for biodiversity research. While digitization efforts have increased the availability of high-resolution specimen images, extracting metadata from labels remains a major bottleneck, often requiring manual transcription. We developed a semi-automated pipeline, ELIE (Entomological Label Information Extraction), which combines computer vision, convolutional neural networks (CNNs), optical character recognition (OCR), and clustering algorithms to streamline label data extraction. Our pipeline operates in three stages: (1) label detection and classification (printed vs. handwritten), (2) OCR-based text extraction from printed labels using Tesseract or Google Vision, and (3) clustering of extracted text for human validation of outliers. Benchmarking on diverse datasets from multiple museum collections showed that our approach successfully extracted and clustered up to 98% of printed labels, significantly reducing manual effort. The pipeline improves efficiency in digitization workflows while maintaining high accuracy in label data capture. Our approach demonstrates the potential of integrating AI-driven methods with human validation to accelerate specimen digitization. By reducing manual transcription workload and enabling scalable extraction of insect label metadata, it unlocks biodiversity data for research in ecology, systematics, and conservation globally.

Authors

  • Margot Belot; Joël Tuberosa; Leonardo Preuss; Olha Svezhentseva; Magdalena Claessen; Christian Bölling; Franziska Schuster; Théo Léger