ReSW-VL: Representation Learning for Surgical Workflow Analysis Using Vision-Language Model
Journal:
arXiv
Published Date:
May 19, 2025
Abstract
Surgical phase recognition from video is a technology that automatically
classifies the progress of a surgical procedure and has a wide range of
potential applications, including real-time surgical support, optimization of
medical resources, training and skill assessment, and safety improvement.
Recent advances in surgical phase recognition technology have focused primarily
on Transform-based methods, although methods that extract spatial features from
individual frames using a CNN and video features from the resulting time series
of spatial features using time series modeling have shown high performance.
However, there remains a paucity of research on training methods for CNNs
employed for feature extraction or representation learning in surgical phase
recognition. In this study, we propose a method for representation learning in
surgical workflow analysis using a vision-language model (ReSW-VL). Our
proposed method involves fine-tuning the image encoder of a CLIP (Convolutional
Language Image Model) vision-language model using prompt learning for surgical
phase recognition. The experimental results on three surgical phase recognition
datasets demonstrate the effectiveness of the proposed method in comparison to
conventional methods.