Burn After Reading: Do Multimodal Large Language Models Truly Capture Order of Events in Image Sequences?

Journal: arXiv

Published Date: Jun 12, 2025

Abstract

This paper introduces the TempVS benchmark, which focuses on temporal grounding and reasoning capabilities of Multimodal Large Language Models (MLLMs) in image sequences. TempVS consists of three main tests (i.e., event relation inference, sentence ordering and image ordering), each accompanied with a basic grounding test. TempVS requires MLLMs to rely on both visual and linguistic modalities to understand the temporal order of events. We evaluate 38 state-of-the-art MLLMs, demonstrating that models struggle to solve TempVS, with a substantial performance gap compared to human capabilities. We also provide fine-grained insights that suggest promising directions for future research. Our TempVS benchmark data and code are available at https://github.com/yjsong22/TempVS.

Authors

Yingjin Song
Yupei Du
Denis Paperno
Albert Gatt

External Resources

View on arXiv arXiv (http://arxiv.org/abs/2506.10415v1)

Burn After Reading: Do Multimodal Large Language Models Truly Capture Order of Events in Image Sequences?

Abstract

Authors

Categories

External Resources

Popular Topics

Recent Journals

Burn After Reading: Do Multimodal Large Language Models Truly Capture Order of Events in Image Sequences?

Abstract

Authors

Categories

External Resources

Stay Ahead of Medical AI

Popular Topics

Recent Journals