Joint Low-level and High-level Textual Representation Learning with Multiple Masking Strategies
Journal:
arXiv
Published Date:
May 11, 2025
Abstract
Most existing text recognition methods are trained on large-scale synthetic
datasets due to the scarcity of labeled real-world datasets. Synthetic images,
however, cannot faithfully reproduce real-world scenarios, such as uneven
illumination, irregular layout, occlusion, and degradation, resulting in
performance disparities when handling complex real-world images. Recent
self-supervised learning techniques, notably contrastive learning and masked
image modeling (MIM), narrow this domain gap by exploiting unlabeled real text
images. This study first analyzes the original Masked AutoEncoder (MAE) and
observes that random patch masking predominantly captures low-level textural
features but misses high-level contextual representations. To fully exploit the
high-level contextual representations, we introduce random blockwise and span
masking in the text recognition task. These strategies can mask the continuous
image patches and completely remove some characters, forcing the model to infer
relationships among characters within a word. Our Multi-Masking Strategy (MMS)
integrates random patch, blockwise, and span masking into the MIM frame, which
jointly learns low and high-level textual representations. After fine-tuning
with real data, MMS outperforms the state-of-the-art self-supervised methods in
various text-related tasks, including text recognition, segmentation, and
text-image super-resolution.