Beyond Single Frames: Can LMMs Comprehend Temporal and Contextual Narratives in Image Sequences?
Journal:
arXiv
Published Date:
Feb 19, 2025
Abstract
Large Multimodal Models (LMMs) have achieved remarkable success across
various visual-language tasks. However, existing benchmarks predominantly focus
on single-image understanding, leaving the analysis of image sequences largely
unexplored. To address this limitation, we introduce StripCipher, a
comprehensive benchmark designed to evaluate capabilities of LMMs to comprehend
and reason over sequential images. StripCipher comprises a human-annotated
dataset and three challenging subtasks: visual narrative comprehension,
contextual frame prediction, and temporal narrative reordering. Our evaluation
of $16$ state-of-the-art LMMs, including GPT-4o and Qwen2.5VL, reveals a
significant performance gap compared to human capabilities, particularly in
tasks that require reordering shuffled sequential images. For instance, GPT-4o
achieves only 23.93% accuracy in the reordering subtask, which is 56.07% lower
than human performance. Further quantitative analysis discuss several factors,
such as input format of images, affecting the performance of LLMs in sequential
understanding, underscoring the fundamental challenges that remain in the
development of LMMs.