Kronecker Mask and Interpretive Prompts are Language-Action Video Learners
Journal:
arXiv
Published Date:
Feb 5, 2025
Abstract
Contrastive language-image pretraining (CLIP) has significantly advanced
image-based vision learning. A pressing topic subsequently arises: how can we
effectively adapt CLIP to the video domain? Recent studies have focused on
adjusting either the textual or visual branch of CLIP for action recognition.
However, we argue that adaptations of both branches are crucial. In this paper,
we propose \textbf{CLAVER}: a \textbf{C}ontrastive
\textbf{L}anguage-\textbf{A}ction \textbf{V}ideo Learn\textbf{er}, designed to
shift CLIP's focus from the alignment of static visual objects and concrete
nouns to the alignment of dynamic action behaviors and abstract verbs.
Specifically, we introduce a novel Kronecker mask attention for temporal
modeling. Our tailored Kronecker mask offers three benefits 1) it expands the
temporal receptive field for each token, 2) it serves as an effective
spatiotemporal heterogeneity inductive bias, mitigating the issue of
spatiotemporal homogenization, and 3) it can be seamlessly plugged into
transformer-based models. Regarding the textual branch, we leverage large
language models to generate diverse, sentence-level and semantically rich
interpretive prompts of actions, which shift the model's focus towards the verb
comprehension. Extensive experiments on various benchmarks and learning
scenarios demonstrate the superiority and generality of our approach.