Towards Omnidirectional Reasoning with 360-R1: A Dataset, Benchmark, and GRPO-based Method
Journal:
arXiv
Published Date:
May 20, 2025
Abstract
Omnidirectional images (ODIs), with their 360{\deg} field of view, provide
unparalleled spatial awareness for immersive applications like augmented
reality and embodied AI. However, the capability of existing multi-modal large
language models (MLLMs) to comprehend and reason about such panoramic scenes
remains underexplored. This paper addresses this gap by introducing OmniVQA,
the first dataset and conducting the first benchmark for omnidirectional visual
question answering. Our evaluation of state-of-the-art MLLMs reveals
significant limitations in handling omnidirectional visual question answering,
highlighting persistent challenges in object localization, feature extraction,
and hallucination suppression within panoramic contexts. These results
underscore the disconnect between current MLLM capabilities and the demands of
omnidirectional visual understanding, which calls for dedicated architectural
or training innovations tailored to 360{\deg} imagery. Building on the OmniVQA
dataset and benchmark, we further introduce a rule-based reinforcement learning
method, 360-R1, based on Qwen2.5-VL-Instruct. Concretely, we modify the group
relative policy optimization (GRPO) by proposing three novel reward functions:
(1) reasoning process similarity reward, (2) answer semantic accuracy reward,
and (3) structured format compliance reward. Extensive experiments on our
OmniVQA demonstrate the superiority of our proposed method in omnidirectional
space (+6% improvement).