Context-Independent OCR with Multimodal LLMs: Effects of Image Resolution and Visual Complexity
Journal:
arXiv
Published Date:
Mar 31, 2025
Abstract
Due to their high versatility in tasks such as image captioning, document
analysis, and automated content generation, multimodal Large Language Models
(LLMs) have attracted significant attention across various industrial fields.
In particular, they have been shown to surpass specialized models in Optical
Character Recognition (OCR). Nevertheless, their performance under different
image conditions remains insufficiently investigated, and individual character
recognition is not guaranteed due to their reliance on contextual cues. In this
work, we examine a context-independent OCR task using single-character images
with diverse visual complexities to determine the conditions for accurate
recognition. Our findings reveal that multimodal LLMs can match conventional
OCR methods at about 300 ppi, yet their performance deteriorates significantly
below 150 ppi. Additionally, we observe a very weak correlation between visual
complexity and misrecognitions, whereas a conventional OCR-specific model
exhibits no correlation. These results suggest that image resolution and visual
complexity may play an important role in the reliable application of multimodal
LLMs to OCR tasks that require precise character-level accuracy.