Memory Transmission Based Referring Video Object Segmentation.
Journal:
Neural networks : the official journal of the International Neural Network Society
Published Date:
Sep 1, 2025
Abstract
Referring Video Object Segmentation (RVOS) addresses the task of segmenting target objects described by textual descriptions from videos. In order to ensure the consistency of objects segmented from video frames, inter-frame modeling is adopted to capture the motion information of objects, which usually divides the video into several clips, and considers the association of video frames within each clip. However, the clip-level modeling cannot establish continuous motion changes of the object across the video. To address this issue, we suggest memory transmission based continuous inter-frame modeling, which uses the segmentation result of the previous frame to calculate a pseudo mask for the current frame. Based on the proposed continuous inter-frame modeling method, we propose Memory Transmission Based Referring Video Object Segmentation (MT-RVOS), which uses the transmitted pseudo mask to guide the segmentation mask inference for the current frame. Extensive experiments conducted on four referring video object segmentation benchmarks demonstrate that MT-RVOS achieves competitive performance.