Deciphering the Underserved: Benchmarking LLM OCR for Low-Resource Scripts
Journal:
arXiv
Published Date:
Dec 20, 2024
Abstract
This study investigates the potential of Large Language Models (LLMs),
particularly GPT-4o, for Optical Character Recognition (OCR) in low-resource
scripts such as Urdu, Albanian, and Tajik, with English serving as a benchmark.
Using a meticulously curated dataset of 2,520 images incorporating controlled
variations in text length, font size, background color, and blur, the research
simulates diverse real-world challenges. Results emphasize the limitations of
zero-shot LLM-based OCR, particularly for linguistically complex scripts,
highlighting the need for annotated datasets and fine-tuned models. This work
underscores the urgency of addressing accessibility gaps in text digitization,
paving the way for inclusive and robust OCR solutions for underserved
languages.