Relation-R1: Progressively Cognitive Chain-of-Thought Guided Reinforcement Learning for Unified Relation Comprehension
Journal:
arXiv
Published Date:
Apr 20, 2025
Abstract
Recent advances in multi-modal large language models (MLLMs) have
significantly improved object-level grounding and region captioning. However,
they remain limited in visual relation understanding, struggling even with
binary relation detection, let alone \textit{N}-ary relations involving
multiple semantic roles. The core reason is the lack of modeling for
\textit{structural semantic dependencies} among multi-entities, leading to
unreliable outputs, hallucinations, and over-reliance on language priors (\eg,
defaulting to ``person drinks a milk'' if a person is merely holding it). To
this end, we propose Relation-R1, the \textit{first unified} relation
comprehension framework that explicitly integrates cognitive chain-of-thought
(CoT)-guided supervised fine-tuning (SFT) and group relative policy
optimization (GRPO) within a reinforcement learning (RL) paradigm.
Specifically, we first establish foundational reasoning capabilities via SFT,
enforcing structured outputs with thinking processes. Then, GRPO is utilized to
refine these outputs via multi-rewards optimization, prioritizing
visual-semantic grounding over language-induced biases, thereby improving
generalization capability. Furthermore, we investigate the impact of various
CoT strategies within this framework, demonstrating that a specific-to-general
progressive approach in CoT guidance further improves generalization,
especially in capturing synonymous \textit{N}-ary relations. Extensive
experiments on widely-used PSG and SWiG datasets demonstrate that Relation-R1
achieves state-of-the-art performance in both binary and \textit{N}-ary
relation understanding.