GeoChain: Multimodal Chain-of-Thought for Geographic Reasoning
Journal:
arXiv
Published Date:
Jun 1, 2025
Abstract
This paper introduces GeoChain, a large-scale benchmark for evaluating
step-by-step geographic reasoning in multimodal large language models (MLLMs).
Leveraging 1.46 million Mapillary street-level images, GeoChain pairs each
image with a 21-step chain-of-thought (CoT) question sequence (over 30 million
Q&A pairs). These sequences guide models from coarse attributes to fine-grained
localization across four reasoning categories - visual, spatial, cultural, and
precise geolocation - annotated by difficulty. Images are also enriched with
semantic segmentation (150 classes) and a visual locatability score. Our
benchmarking of contemporary MLLMs (GPT-4.1 variants, Claude 3.7, Gemini 2.5
variants) on a diverse 2,088-image subset reveals consistent challenges: models
frequently exhibit weaknesses in visual grounding, display erratic reasoning,
and struggle to achieve accurate localization, especially as the reasoning
complexity escalates. GeoChain offers a robust diagnostic methodology, critical
for fostering significant advancements in complex geographic reasoning within
MLLMs.