First-frame Supervised Video Polyp Segmentation via Propagative and Semantic Dual-teacher Network
Journal:
arXiv
Published Date:
Dec 21, 2024
Abstract
Automatic video polyp segmentation plays a critical role in gastrointestinal
cancer screening, but the cost of frameby-frame annotations is prohibitively
high. While sparse-frame supervised methods have reduced this burden
proportionately, the cost remains overwhelming for long-duration videos and
large-scale datasets. In this paper, we, for the first time, reduce the
annotation cost to just a single frame per polyp video, regardless of the
video's length. To this end, we introduce a new task, First-Frame Supervised
Video Polyp Segmentation (FSVPS), and propose a novel Propagative and Semantic
Dual-Teacher Network (PSDNet). Specifically, PSDNet adopts a teacher-student
framework but employs two distinct types of teachers: the propagative teacher
and the semantic teacher. The propagative teacher is a universal object tracker
that propagates the first-frame annotation to subsequent frames as pseudo
labels. However, tracking errors may accumulate over time, gradually degrading
the pseudo labels and misguiding the student model. To address this, we
introduce the semantic teacher, an exponential moving average of the student
model, which produces more stable and time-invariant pseudo labels. PSDNet
merges the pseudo labels from both teachers using a carefully-designed
back-propagation strategy. This strategy assesses the quality of the pseudo
labels by tracking them backward to the first frame. High-quality pseudo labels
are more likely to spatially align with the firstframe annotation after this
backward tracking, ensuring more accurate teacher-to-student knowledge transfer
and improved segmentation performance. Benchmarking on SUN-SEG, the largest VPS
dataset, demonstrates the competitive performance of PSDNet compared to
fully-supervised approaches, and its superiority over sparse-frame supervised
state-of-the-arts with a minimum improvement of 4.5% in Dice score.