Movie Facts and Fibs (MF$^2$): A Benchmark for Long Movie Understanding
Journal:
arXiv
Published Date:
Jun 6, 2025
Abstract
Despite recent progress in vision-language models (VLMs), holistic
understanding of long-form video content remains a significant challenge,
partly due to limitations in current benchmarks. Many focus on peripheral,
``needle-in-a-haystack'' details, encouraging context-insensitive retrieval
over deep comprehension. Others rely on large-scale, semi-automatically
generated questions (often produced by language models themselves) that are
easier for models to answer but fail to reflect genuine understanding. In this
paper, we introduce MF$^2$, a new benchmark for evaluating whether models can
comprehend, consolidate, and recall key narrative information from full-length
movies (50-170 minutes long). MF$^2$ includes over 50 full-length,
open-licensed movies, each paired with manually constructed sets of claim pairs
-- one true (fact) and one plausible but false (fib), totalling over 850 pairs.
These claims target core narrative elements such as character motivations and
emotions, causal chains, and event order, and refer to memorable moments that
humans can recall without rewatching the movie. Instead of multiple-choice
formats, we adopt a binary claim evaluation protocol: for each pair, models
must correctly identify both the true and false claims. This reduces biases
like answer ordering and enables a more precise assessment of reasoning. Our
experiments demonstrate that both open-weight and closed state-of-the-art
models fall well short of human performance, underscoring the relative ease of
the task for humans and their superior ability to retain and reason over
critical narrative information -- an ability current VLMs lack.