Memory Transmission Based Referring Video Object Segmentation.

Journal: Neural networks : the official journal of the International Neural Network Society
Published Date:

Abstract

Referring Video Object Segmentation (RVOS) addresses the task of segmenting target objects described by textual descriptions from videos. In order to ensure the consistency of objects segmented from video frames, inter-frame modeling is adopted to capture the motion information of objects, which usually divides the video into several clips, and considers the association of video frames within each clip. However, the clip-level modeling cannot establish continuous motion changes of the object across the video. To address this issue, we suggest memory transmission based continuous inter-frame modeling, which uses the segmentation result of the previous frame to calculate a pseudo mask for the current frame. Based on the proposed continuous inter-frame modeling method, we propose Memory Transmission Based Referring Video Object Segmentation (MT-RVOS), which uses the transmitted pseudo mask to guide the segmentation mask inference for the current frame. Extensive experiments conducted on four referring video object segmentation benchmarks demonstrate that MT-RVOS achieves competitive performance.

Authors

  • Zijin Liu
    Beijing Key Laboratory of Multimedia and Intelligent Software Technology Beijing Institute of Artificial Intelligence, China; School of Information Science and Technology Beijing University of Technology, Beijing, 100124, China.
  • Lichun Wang
    Beijing Key Laboratory of Multimedia and Intelligent Software Technology Beijing Institute of Artificial Intelligence, China; School of Information Science and Technology Beijing University of Technology, Beijing, 100124, China. Electronic address: wanglc@bjut.edu.cn.
  • Yongli Hu
    Institute for Infocomm Research, A*STAR, 1 Fusionopolis Way, #21-01 Connexis (South Tower), Singapore, Singapore. huy@i2r.a-star.edu.sg.
  • Baocai Yin
    iFLYTEK Research, Hefei, China.