LIGHT: Multi-Modal Text Linking on Historical Maps
Journal:
arXiv
Published Date:
Jun 27, 2025
Abstract
Text on historical maps provides valuable information for studies in history,
economics, geography, and other related fields. Unlike structured or
semi-structured documents, text on maps varies significantly in orientation,
reading order, shape, and placement. Many modern methods can detect and
transcribe text regions, but they struggle to effectively ``link'' the
recognized text fragments, e.g., determining a multi-word place name. Existing
layout analysis methods model word relationships to improve text understanding
in structured documents, but they primarily rely on linguistic features and
neglect geometric information, which is essential for handling map text. To
address these challenges, we propose LIGHT, a novel multi-modal approach that
integrates linguistic, image, and geometric features for linking text on
historical maps. In particular, LIGHT includes a geometry-aware embedding
module that encodes the polygonal coordinates of text regions to capture
polygon shapes and their relative spatial positions on an image. LIGHT unifies
this geometric information with the visual and linguistic token embeddings from
LayoutLMv3, a pretrained layout analysis model. LIGHT uses the cross-modal
information to predict the reading-order successor of each text instance
directly with a bi-directional learning strategy that enhances sequence
robustness. Experimental results show that LIGHT outperforms existing methods
on the ICDAR 2024/2025 MapText Competition data, demonstrating the
effectiveness of multi-modal learning for historical map text linking.