Boosting MLLM Reasoning with Text-Debiased Hint-GRPO
Journal:
arXiv
Published Date:
Mar 31, 2025
Abstract
MLLM reasoning has drawn widespread research for its excellent
problem-solving capability. Current reasoning methods fall into two types: PRM,
which supervises the intermediate reasoning steps, and ORM, which supervises
the final results. Recently, DeepSeek-R1 has challenged the traditional view
that PRM outperforms ORM, which demonstrates strong generalization performance
using an ORM method (i.e., GRPO). However, current MLLM's GRPO algorithms still
struggle to handle challenging and complex multimodal reasoning tasks (e.g.,
mathematical reasoning). In this work, we reveal two problems that impede the
performance of GRPO on the MLLM: Low data utilization and Text-bias. Low data
utilization refers to that GRPO cannot acquire positive rewards to update the
MLLM on difficult samples, and text-bias is a phenomenon that the MLLM bypasses
image condition and solely relies on text condition for generation after GRPO
training. To tackle these problems, this work proposes Hint-GRPO that improves
data utilization by adaptively providing hints for samples of varying
difficulty, and text-bias calibration that mitigates text-bias by calibrating
the token prediction logits with image condition in test-time. Experiment
results on three base MLLMs across eleven datasets demonstrate that our
proposed methods advance the reasoning capability of original MLLM by a large
margin, exhibiting superior performance to existing MLLM reasoning methods. Our
code is available at https://github.com/hqhQAQ/Hint-GRPO.