Beyond Pixels: Leveraging the Language of Soccer to Improve Spatio-Temporal Action Detection in Broadcast Videos
Journal:
arXiv
Published Date:
May 14, 2025
Abstract
State-of-the-art spatio-temporal action detection (STAD) methods show
promising results for extracting soccer events from broadcast videos. However,
when operated in the high-recall, low-precision regime required for exhaustive
event coverage in soccer analytics, their lack of contextual understanding
becomes apparent: many false positives could be resolved by considering a
broader sequence of actions and game-state information. In this work, we
address this limitation by reasoning at the game level and improving STAD
through the addition of a denoising sequence transduction task. Sequences of
noisy, context-free player-centric predictions are processed alongside clean
game state information using a Transformer-based encoder-decoder model. By
modeling extended temporal context and reasoning jointly over team-level
dynamics, our method leverages the "language of soccer" - its tactical
regularities and inter-player dependencies - to generate "denoised" sequences
of actions. This approach improves both precision and recall in low-confidence
regimes, enabling more reliable event extraction from broadcast video and
complementing existing pixel-based methods.