SASVi -- Segment Any Surgical Video
Journal:
arXiv
Published Date:
Feb 12, 2025
Abstract
Purpose: Foundation models, trained on multitudes of public datasets, often
require additional fine-tuning or re-prompting mechanisms to be applied to
visually distinct target domains such as surgical videos. Further, without
domain knowledge, they cannot model the specific semantics of the target
domain. Hence, when applied to surgical video segmentation, they fail to
generalise to sections where previously tracked objects leave the scene or new
objects enter. Methods: We propose SASVi, a novel re-prompting mechanism based
on a frame-wise Mask R-CNN Overseer model, which is trained on a minimal amount
of scarcely available annotations for the target domain. This model
automatically re-prompts the foundation model SAM2 when the scene constellation
changes, allowing for temporally smooth and complete segmentation of full
surgical videos. Results: Re-prompting based on our Overseer model
significantly improves the temporal consistency of surgical video segmentation
compared to similar prompting techniques and especially frame-wise
segmentation, which neglects temporal information, by at least 1.5%. Our
proposed approach allows us to successfully deploy SAM2 to surgical videos,
which we quantitatively and qualitatively demonstrate for three different
cholecystectomy and cataract surgery datasets. Conclusion: SASVi can serve as a
new baseline for smooth and temporally consistent segmentation of surgical
videos with scarcely available annotation data. Our method allows us to
leverage scarce annotations and obtain complete annotations for full videos of
the large-scale counterpart datasets. We make those annotations publicly
available, providing extensive annotation data for the future development of
surgical data science models.