Adaptive Keyframe Sampling for Long Video Understanding
Journal:
arXiv
Published Date:
Feb 28, 2025
Abstract
Multimodal large language models (MLLMs) have enabled open-world visual
understanding by injecting visual input as extra tokens into large language
models (LLMs) as contexts. However, when the visual input changes from a single
image to a long video, the above paradigm encounters difficulty because the
vast amount of video tokens has significantly exceeded the maximal capacity of
MLLMs. Therefore, existing video-based MLLMs are mostly established upon
sampling a small portion of tokens from input data, which can cause key
information to be lost and thus produce incorrect answers. This paper presents
a simple yet effective algorithm named Adaptive Keyframe Sampling (AKS). It
inserts a plug-and-play module known as keyframe selection, which aims to
maximize the useful information with a fixed number of video tokens. We
formulate keyframe selection as an optimization involving (1) the relevance
between the keyframes and the prompt, and (2) the coverage of the keyframes
over the video, and present an adaptive algorithm to approximate the best
solution. Experiments on two long video understanding benchmarks validate that
Adaptive Keyframe Sampling improves video QA accuracy (beyond strong baselines)
upon selecting informative keyframes. Our study reveals the importance of
information pre-filtering in video-based MLLMs. Code is available at
https://github.com/ncTimTang/AKS.