Burn After Reading: Do Multimodal Large Language Models Truly Capture Order of Events in Image Sequences?
Journal:
arXiv
Published Date:
Jun 12, 2025
Abstract
This paper introduces the TempVS benchmark, which focuses on temporal
grounding and reasoning capabilities of Multimodal Large Language Models
(MLLMs) in image sequences. TempVS consists of three main tests (i.e., event
relation inference, sentence ordering and image ordering), each accompanied
with a basic grounding test. TempVS requires MLLMs to rely on both visual and
linguistic modalities to understand the temporal order of events. We evaluate
38 state-of-the-art MLLMs, demonstrating that models struggle to solve TempVS,
with a substantial performance gap compared to human capabilities. We also
provide fine-grained insights that suggest promising directions for future
research. Our TempVS benchmark data and code are available at
https://github.com/yjsong22/TempVS.