ThinkVideo: High-Quality Reasoning Video Segmentation with Chain of Thoughts
Journal:
arXiv
Published Date:
May 24, 2025
Abstract
Reasoning Video Object Segmentation is a challenging task, which generates a
mask sequence from an input video and an implicit, complex text query. Existing
works probe into the problem by finetuning Multimodal Large Language Models
(MLLM) for segmentation-based output, while still falling short in difficult
cases on videos given temporally-sensitive queries, primarily due to the
failure to integrate temporal and spatial information. In this paper, we
propose ThinkVideo, a novel framework which leverages the zero-shot
Chain-of-Thought (CoT) capability of MLLM to address these challenges.
Specifically, ThinkVideo utilizes the CoT prompts to extract object
selectivities associated with particular keyframes, then bridging the reasoning
image segmentation model and SAM2 video processor to output mask sequences. The
ThinkVideo framework is training-free and compatible with closed-source MLLMs,
which can be applied to Reasoning Video Instance Segmentation. We further
extend the framework for online video streams, where the CoT is used to update
the object of interest when a better target starts to emerge and becomes
visible. We conduct extensive experiments on video object segmentation with
explicit and implicit queries. The results show that ThinkVideo significantly
outperforms previous works in both cases, qualitatively and quantitatively.