A Vision Centric Remote Sensing Benchmark
Journal:
arXiv
Published Date:
Mar 20, 2025
Abstract
Multimodal Large Language Models (MLLMs) have achieved remarkable success in
vision-language tasks but their remote sensing (RS) counterpart are relatively
under explored. Unlike natural images, RS imagery presents unique challenges
that current MLLMs struggle to handle, particularly in visual grounding and
spatial reasoning. This study investigates the limitations of CLIP-based MLLMs
in RS, highlighting their failure to differentiate visually distinct yet
semantically similar RS images. To address this, we introduce a remote sensing
multimodal visual patterns (RSMMVP) benchmark. It is designed to evaluate MLLMs
in RS tasks by identifying the CLIP-blind pairs, where CLIP-based models
incorrectly assign high similarity scores to visually distinct RS images.
Through a visual question answering (VQA) evaluation, we analyze the
performance of state-of-the-art MLLMs, revealing significant limitations in RS
specific representation learning. The results provide valuable insights into
the weaknesses of CLIP-based visual encoding and offer a foundation for future
research to develop more effective MLLMs tailored for remote sensing
applications.