Group Relative Policy Optimization for Image Captioning
Journal:
arXiv
Published Date:
Mar 3, 2025
Abstract
Image captioning tasks usually use two-stage training to complete model
optimization. The first stage uses cross-entropy as the loss function for
optimization, and the second stage uses self-critical sequence training (SCST)
for reinforcement learning optimization. However, the SCST algorithm has
certain defects. SCST relies only on a single greedy decoding result as a
baseline. If the model itself is not stable enough, the greedy decoding result
may be relatively worst, which will lead to a high variance of advantage
estimation, further leading to unstable policy updates. In addition, SCST only
compares one sampling result with the greedy decoding result, and the
generation diversity is limited, which may fall into a local optimum. In this
paper, we propose using the latest Group Relative Policy Optimization (GRPO)
reinforcement learning algorithm as an optimization solution for the second
stage. GRPO generates multiple candidate captions for the input image and then
continuously optimizes the model through intragroup comparison. By constraining
the amplitude of policy updates and KL divergence, the stability of the model
during training is greatly guaranteed. In addition, compared to SCST, which
only samples one answer, GRPO samples and generates multiple answers. Multiple
candidate answers in the group cover a wider solution space. Combined with KL
divergence constraints, GRPO can improve diversity while ensuring model
stability. The code for this article is available at
https://github.com/liangxu-one/ms-models/tree/image_caption_grpo/research/arxiv_papers/Image_Caption_GRPO.