Multi-Scale Attention and Gated Shifting for Fine-Grained Event Spotting in Videos
Journal:
arXiv
Published Date:
Jul 10, 2025
Abstract
Precise Event Spotting (PES) in sports videos requires frame-level
recognition of fine-grained actions from single-camera footage. Existing PES
models typically incorporate lightweight temporal modules such as Gate Shift
Module (GSM) or Gate Shift Fuse (GSF) to enrich 2D CNN feature extractors with
temporal context. However, these modules are limited in both temporal receptive
field and spatial adaptability. We propose a Multi-Scale Attention Gate Shift
Module (MSAGSM) that enhances GSM with multi-scale temporal dilations and
multi-head spatial attention, enabling efficient modeling of both short- and
long-term dependencies while focusing on salient regions. MSAGSM is a
lightweight plug-and-play module that can be easily integrated with various 2D
backbones. To further advance the field, we introduce the Table Tennis
Australia (TTA) dataset-the first PES benchmark for table tennis-containing
over 4800 precisely annotated events. Extensive experiments across five PES
benchmarks demonstrate that MSAGSM consistently improves performance with
minimal overhead, setting new state-of-the-art results.