SMILE: Infusing Spatial and Motion Semantics in Masked Video Learning
Journal:
arXiv
Published Date:
Apr 1, 2025
Abstract
Masked video modeling, such as VideoMAE, is an effective paradigm for video
self-supervised learning (SSL). However, they are primarily based on
reconstructing pixel-level details on natural videos which have substantial
temporal redundancy, limiting their capability for semantic representation and
sufficient encoding of motion dynamics. To address these issues, this paper
introduces a novel SSL approach for video representation learning, dubbed as
SMILE, by infusing both spatial and motion semantics. In SMILE, we leverage
image-language pretrained models, such as CLIP, to guide the learning process
with their high-level spatial semantics. We enhance the representation of
motion by introducing synthetic motion patterns in the training data, allowing
the model to capture more complex and dynamic content. Furthermore, using
SMILE, we establish a new self-supervised video learning paradigm capable of
learning strong video representations without requiring any natural video data.
We have carried out extensive experiments on 7 datasets with various downstream
scenarios. SMILE surpasses current state-of-the-art SSL methods, showcasing its
effectiveness in learning more discriminative and generalizable video
representations. Code is available: https://github.com/fmthoker/SMILE