DeepEyes: Incentivizing "Thinking with Images" via Reinforcement Learning
Journal:
arXiv
Published Date:
May 20, 2025
Abstract
Large Vision-Language Models (VLMs) have shown strong capabilities in
multimodal understanding and reasoning, yet they are primarily constrained by
text-based reasoning processes. However, achieving seamless integration of
visual and textual reasoning which mirrors human cognitive processes remains a
significant challenge. In particular, effectively incorporating advanced visual
input processing into reasoning mechanisms is still an open question. Thus, in
this paper, we explore the interleaved multimodal reasoning paradigm and
introduce DeepEyes, a model with "thinking with images" capabilities
incentivized through end-to-end reinforcement learning without the need for
cold-start SFT. Notably, this ability emerges natively within the model itself,
leveraging its inherent grounding ability as a tool instead of depending on
separate specialized models. Specifically, we propose a tool-use-oriented data
selection mechanism and a reward strategy to encourage successful tool-assisted
reasoning trajectories. DeepEyes achieves significant performance gains on
fine-grained perception and reasoning benchmarks and also demonstrates
improvement in grounding, hallucination, and mathematical reasoning tasks.
Interestingly, we observe the distinct evolution of tool-calling behavior from
initial exploration to efficient and accurate exploitation, and diverse
thinking patterns that closely mirror human visual reasoning processes. Code is
available at https://github.com/Visual-Agent/DeepEyes.