High Throughput Information Extraction of Printed Specimen Labels from Large-Scale Digitization of Entomological Collections using a Semi-Automated Pipeline

Journal: bioRxiv

Published Date: Jan 1, 2025

Abstract

Natural history museums curate billions of insect specimens, forming a vast but underutilized resource for biodiversity research. While digitization efforts have increased the availability of high-resolution specimen images, extracting metadata from labels remains a major bottleneck, often requiring manual transcription. We developed a semi-automated pipeline, ELIE (Entomological Label Information Extraction), which combines computer vision, convolutional neural networks (CNNs), optical character recognition (OCR), and clustering algorithms to streamline label data extraction. Our pipeline operates in three stages: (1) label detection and classification (printed vs. handwritten), (2) OCR-based text extraction from printed labels using Tesseract or Google Vision, and (3) clustering of extracted text for human validation of outliers. Benchmarking on diverse datasets from multiple museum collections showed that our approach successfully extracted and clustered up to 98% of printed labels, significantly reducing manual effort. The pipeline improves efficiency in digitization workflows while maintaining high accuracy in label data capture. Our approach demonstrates the potential of integrating AI-driven methods with human validation to accelerate specimen digitization. By reducing manual transcription workload and enabling scalable extraction of insect label metadata, it unlocks biodiversity data for research in ecology, systematics, and conservation globally.

Authors

Margot Belot; Joël Tuberosa; Leonardo Preuss; Olha Svezhentseva; Magdalena Claessen; Christian Bölling; Franziska Schuster; Théo Léger

External Resources

View on bioRxiv Access via DOI

High Throughput Information Extraction of Printed Specimen Labels from Large-Scale Digitization of Entomological Collections using a Semi-Automated Pipeline

Abstract

Authors

Categories

External Resources

Popular Topics

Recent Journals

High Throughput Information Extraction of Printed Specimen Labels from Large-Scale Digitization of Entomological Collections using a Semi-Automated Pipeline

Abstract

Authors

Categories

External Resources

Don't Miss the Future of Medicine

Popular Topics

Recent Journals