Context-Independent OCR with Multimodal LLMs: Effects of Image Resolution and Visual Complexity

Journal: arXiv

Published Date: Mar 31, 2025

Abstract

Due to their high versatility in tasks such as image captioning, document analysis, and automated content generation, multimodal Large Language Models (LLMs) have attracted significant attention across various industrial fields. In particular, they have been shown to surpass specialized models in Optical Character Recognition (OCR). Nevertheless, their performance under different image conditions remains insufficiently investigated, and individual character recognition is not guaranteed due to their reliance on contextual cues. In this work, we examine a context-independent OCR task using single-character images with diverse visual complexities to determine the conditions for accurate recognition. Our findings reveal that multimodal LLMs can match conventional OCR methods at about 300 ppi, yet their performance deteriorates significantly below 150 ppi. Additionally, we observe a very weak correlation between visual complexity and misrecognitions, whereas a conventional OCR-specific model exhibits no correlation. These results suggest that image resolution and visual complexity may play an important role in the reliable application of multimodal LLMs to OCR tasks that require precise character-level accuracy.

Authors

Kotaro Inoue

External Resources

View on arXiv arXiv (http://arxiv.org/abs/2503.23667v1)

Context-Independent OCR with Multimodal LLMs: Effects of Image Resolution and Visual Complexity

Abstract

Authors

Categories

External Resources

Popular Topics

Recent Journals

Context-Independent OCR with Multimodal LLMs: Effects of Image Resolution and Visual Complexity

Abstract

Authors

Categories

External Resources

Don't Miss the Future of Medicine

Popular Topics

Recent Journals