Every Pixel Tells a Story: End-to-End Urdu Newspaper OCR
Journal:
arXiv
Published Date:
May 20, 2025
Abstract
This paper introduces a comprehensive end-to-end pipeline for Optical
Character Recognition (OCR) on Urdu newspapers. In our approach, we address the
unique challenges of complex multi-column layouts, low-resolution archival
scans, and diverse font styles. Our process decomposes the OCR task into four
key modules: (1) article segmentation, (2) image super-resolution, (3) column
segmentation, and (4) text recognition. For article segmentation, we fine-tune
and evaluate YOLOv11x to identify and separate individual articles from
cluttered layouts. Our model achieves a precision of 0.963 and mAP@50 of 0.975.
For super-resolution, we fine-tune and benchmark the SwinIR model (reaching
32.71 dB PSNR) to enhance the quality of degraded newspaper scans. To do our
column segmentation, we use YOLOv11x to separate columns in text to further
enhance performance - this model reaches a precision of 0.970 and mAP@50 of
0.975. In the text recognition stage, we benchmark a range of LLMs from
different families, including Gemini, GPT, Llama, and Claude. The lowest WER of
0.133 is achieved by Gemini-2.5-Pro.