SemiETS: Integrating Spatial and Content Consistencies for Semi-Supervised End-to-end Text Spotting
Journal:
arXiv
Published Date:
Apr 14, 2025
Abstract
Most previous scene text spotting methods rely on high-quality manual
annotations to achieve promising performance. To reduce their expensive costs,
we study semi-supervised text spotting (SSTS) to exploit useful information
from unlabeled images. However, directly applying existing semi-supervised
methods of general scenes to SSTS will face new challenges: 1) inconsistent
pseudo labels between detection and recognition tasks, and 2) sub-optimal
supervisions caused by inconsistency between teacher/student. Thus, we propose
a new Semi-supervised framework for End-to-end Text Spotting, namely SemiETS
that leverages the complementarity of text detection and recognition.
Specifically, it gradually generates reliable hierarchical pseudo labels for
each task, thereby reducing noisy labels. Meanwhile, it extracts important
information in locations and transcriptions from bidirectional flows to improve
consistency. Extensive experiments on three datasets under various settings
demonstrate the effectiveness of SemiETS on arbitrary-shaped text. For example,
it outperforms previous state-of-the-art SSL methods by a large margin on
end-to-end spotting (+8.7%, +5.6%, and +2.6% H-mean under 0.5%, 1%, and 2%
labeled data settings on Total-Text, respectively). More importantly, it still
improves upon a strongly supervised text spotter trained with plenty of labeled
data by 2.0%. Compelling domain adaptation ability shows practical potential.
Moreover, our method demonstrates consistent improvement on different text
spotters.