ALLVB: All-in-One Long Video Understanding Benchmark
Journal:
arXiv
Published Date:
Mar 10, 2025
Abstract
From image to video understanding, the capabilities of Multi-modal LLMs
(MLLMs) are increasingly powerful. However, most existing video understanding
benchmarks are relatively short, which makes them inadequate for effectively
evaluating the long-sequence modeling capabilities of MLLMs. This highlights
the urgent need for a comprehensive and integrated long video understanding
benchmark to assess the ability of MLLMs thoroughly. To this end, we propose
ALLVB (ALL-in-One Long Video Understanding Benchmark). ALLVB's main
contributions include: 1) It integrates 9 major video understanding tasks.
These tasks are converted into video QA formats, allowing a single benchmark to
evaluate 9 different video understanding capabilities of MLLMs, highlighting
the versatility, comprehensiveness, and challenging nature of ALLVB. 2) A fully
automated annotation pipeline using GPT-4o is designed, requiring only human
quality control, which facilitates the maintenance and expansion of the
benchmark. 3) It contains 1,376 videos across 16 categories, averaging nearly 2
hours each, with a total of 252k QAs. To the best of our knowledge, it is the
largest long video understanding benchmark in terms of the number of videos,
average duration, and number of QAs. We have tested various mainstream MLLMs on
ALLVB, and the results indicate that even the most advanced commercial models
have significant room for improvement. This reflects the benchmark's
challenging nature and demonstrates the substantial potential for development
in long video understanding.