GeoChain: Multimodal Chain-of-Thought for Geographic Reasoning

Journal: arXiv

Published Date: Jun 1, 2025

Abstract

This paper introduces GeoChain, a large-scale benchmark for evaluating step-by-step geographic reasoning in multimodal large language models (MLLMs). Leveraging 1.46 million Mapillary street-level images, GeoChain pairs each image with a 21-step chain-of-thought (CoT) question sequence (over 30 million Q&A pairs). These sequences guide models from coarse attributes to fine-grained localization across four reasoning categories - visual, spatial, cultural, and precise geolocation - annotated by difficulty. Images are also enriched with semantic segmentation (150 classes) and a visual locatability score. Our benchmarking of contemporary MLLMs (GPT-4.1 variants, Claude 3.7, Gemini 2.5 variants) on a diverse 2,088-image subset reveals consistent challenges: models frequently exhibit weaknesses in visual grounding, display erratic reasoning, and struggle to achieve accurate localization, especially as the reasoning complexity escalates. GeoChain offers a robust diagnostic methodology, critical for fostering significant advancements in complex geographic reasoning within MLLMs.

Authors

Sahiti Yerramilli
Nilay Pande
Rynaa Grover
Jayant Sravan Tamarapalli

External Resources

View on arXiv arXiv (http://arxiv.org/abs/2506.00785v1)

GeoChain: Multimodal Chain-of-Thought for Geographic Reasoning

Abstract

Authors

Categories

External Resources

Popular Topics

Recent Journals

GeoChain: Multimodal Chain-of-Thought for Geographic Reasoning

Abstract

Authors

Categories

External Resources

Don't Miss the Future of Medicine

Popular Topics

Recent Journals