Multi-Stage Boundary-Aware Transformer Network for Action Segmentation in Untrimmed Surgical Videos
Journal:
arXiv
Published Date:
Apr 26, 2025
Abstract
Understanding actions within surgical workflows is essential for evaluating
post-operative outcomes. However, capturing long sequences of actions performed
in surgical settings poses challenges, as individual surgeons have their unique
approaches shaped by their expertise, leading to significant variability. To
tackle this complex problem, we focused on segmentation with precise
boundaries, a demanding task due to the inherent variability in action
durations and the subtle transitions often observed in untrimmed videos. These
transitions, marked by ambiguous starting and ending points, complicate the
segmentation process. Traditional models, such as MS-TCN, which depend on large
receptive fields, frequently face challenges of over-segmentation (resulting in
fragmented segments) or under-segmentation (merging distinct actions). Both of
these issues negatively impact the quality of segmentation. To overcome these
challenges, we present the Multi-Stage Boundary-Aware Transformer Network
(MSBATN) with hierarchical sliding window attention, designed to enhance action
segmentation. Our proposed approach incorporates a novel unified loss function
that treats action classification and boundary detection as distinct yet
interdependent tasks. Unlike traditional binary boundary detection methods, our
boundary voting mechanism accurately identifies start and end points by
leveraging contextual information. Extensive experiments using three
challenging surgical datasets demonstrate the superior performance of the
proposed method, achieving state-of-the-art results in F1 scores at thresholds
of 25% and 50%, while also delivering comparable performance in other metrics.