MOOSE: Pay Attention to Temporal Dynamics for Video Understanding via Optical Flows
Journal:
arXiv
Published Date:
Jun 1, 2025
Abstract
Many motion-centric video analysis tasks, such as atomic actions, detecting
atypical motor behavior in individuals with autism, or analyzing articulatory
motion in real-time MRI of human speech, require efficient and interpretable
temporal modeling. Capturing temporal dynamics is a central challenge in video
analysis, often requiring significant computational resources and fine-grained
annotations that are not widely available. This paper presents MOOSE (Motion
Flow Over Spatial Space), a novel temporally-centric video encoder explicitly
integrating optical flow with spatial embeddings to model temporal information
efficiently, inspired by human perception of motion. Unlike prior models, MOOSE
takes advantage of rich, widely available pre-trained visual and optical flow
encoders instead of training video models from scratch. This significantly
reduces computational complexity while enhancing temporal interpretability. Our
primary contributions includes (1) proposing a computationally efficient
temporally-centric architecture for video understanding (2) demonstrating
enhanced interpretability in modeling temporal dynamics; and (3) achieving
state-of-the-art performance on diverse benchmarks, including clinical,
medical, and standard action recognition datasets, confirming the broad
applicability and effectiveness of our approach.