Improving action segmentation via explicit similarity measurement
Journal:
arXiv
Published Date:
Feb 15, 2025
Abstract
Existing supervised action segmentation methods depend on the quality of
frame-wise classification using attention mechanisms or temporal convolutions
to capture temporal dependencies. Even boundary detection-based methods
primarily depend on the accuracy of an initial frame-wise classification, which
can overlook precise identification of segments and boundaries in case of
low-quality prediction. To address this problem, this paper proposes ASESM
(Action Segmentation via Explicit Similarity Measurement) to enhance the
segmentation accuracy by incorporating explicit similarity evaluation across
frames and predictions. Our supervised learning architecture uses frame-level
multi-resolution features as input to multiple Transformer encoders. The
resulting multiple frame-wise predictions are used for similarity voting to
obtain high quality initial prediction. We apply a newly proposed boundary
correction algorithm that operates based on feature similarity between
consecutive frames to adjust the boundary locations iteratively through the
learning process. The corrected prediction is then further refined through
multiple stages of temporal convolutions. As post-processing, we optionally
apply boundary correction again followed by a segment smoothing method that
removes outlier classes within segments using similarity measurement between
consecutive predictions. Additionally, we propose a fully unsupervised boundary
detection-correction algorithm that identifies segment boundaries based solely
on feature similarity without any training. Experiments on 50Salads, GTEA, and
Breakfast datasets show the effectiveness of both the supervised and
unsupervised algorithms. Code and models are made available on Github.