Context Sensitive Network for weakly-supervised fine-grained temporal action localization.
Journal:
Neural networks : the official journal of the International Neural Network Society
Published Date:
Jan 24, 2025
Abstract
Weakly-supervised fine-grained temporal action localization seeks to identify fine-grained action instances in untrimmed videos using only video-level labels. The primary challenge in this task arises from the subtle distinctions among various fine-grained action categories, which complicate the accurate localization of specific action instances. In this paper, we note that the context information embedded within the videos plays a crucial role in overcoming this challenge. However, we also find that effectively integrating context information across different scales is non-trivial, as not all scales provide equally valuable information for distinguishing fine-grained actions. Based on these observations, we propose a weakly-supervised fine-grained temporal action localization approach termed the Context Sensitive Network, which aims to fully leverage context information. Specifically, we first introduce a multi-scale context extraction module designed to efficiently capture multi-scale temporal contexts. Subsequently, we develop a scale-sensitive context gating module that facilitates interaction among multi-scale contexts and adaptively selects informative contexts based on varying video content. Extensive experiments conducted on two benchmark datasets, FineGym and FineAction, demonstrate that our approach achieves state-of-the-art performance.