Can MLLMs Guide Me Home? A Benchmark Study on Fine-Grained Visual Reasoning from Transit Maps
Journal:
arXiv
Published Date:
May 24, 2025
Abstract
Multimodal large language models (MLLMs) have recently achieved significant
progress in visual tasks, including semantic scene understanding and text-image
alignment, with reasoning variants enhancing performance on complex tasks
involving mathematics and logic. However, their capacity for reasoning tasks
involving fine-grained visual understanding remains insufficiently evaluated.
To address this gap, we introduce ReasonMap, a benchmark designed to assess the
fine-grained visual understanding and spatial reasoning abilities of MLLMs.
ReasonMap encompasses high-resolution transit maps from 30 cities across 13
countries and includes 1,008 question-answer pairs spanning two question types
and three templates. Furthermore, we design a two-level evaluation pipeline
that properly assesses answer correctness and quality. Comprehensive
evaluations of 15 popular MLLMs, including both base and reasoning variants,
reveal a counterintuitive pattern: among open-source models, base models
outperform reasoning ones, while the opposite trend is observed in
closed-source models. Additionally, performance generally degrades when visual
inputs are masked, indicating that while MLLMs can leverage prior knowledge to
answer some questions, fine-grained visual reasoning tasks still require
genuine visual perception for strong performance. Our benchmark study offers
new insights into visual reasoning and contributes to investigating the gap
between open-source and closed-source models.