Beyond Cropped Regions: New Benchmark and Corresponding Baseline for Chinese Scene Text Retrieval in Diverse Layouts
Journal:
arXiv
Published Date:
Jun 5, 2025
Abstract
Chinese scene text retrieval is a practical task that aims to search for
images containing visual instances of a Chinese query text. This task is
extremely challenging because Chinese text often features complex and diverse
layouts in real-world scenes. Current efforts tend to inherit the solution for
English scene text retrieval, failing to achieve satisfactory performance. In
this paper, we establish a Diversified Layout benchmark for Chinese Street View
Text Retrieval (DL-CSVTR), which is specifically designed to evaluate retrieval
performance across various text layouts, including vertical, cross-line, and
partial alignments. To address the limitations in existing methods, we propose
Chinese Scene Text Retrieval CLIP (CSTR-CLIP), a novel model that integrates
global visual information with multi-granularity alignment training. CSTR-CLIP
applies a two-stage training process to overcome previous limitations, such as
the exclusion of visual features outside the text region and reliance on
single-granularity alignment, thereby enabling the model to effectively handle
diverse text layouts. Experiments on existing benchmark show that CSTR-CLIP
outperforms the previous state-of-the-art model by 18.82% accuracy and also
provides faster inference speed. Further analysis on DL-CSVTR confirms the
superior performance of CSTR-CLIP in handling various text layouts. The dataset
and code will be publicly available to facilitate research in Chinese scene
text retrieval.