Benchmarking Multimodal Mathematical Reasoning with Explicit Visual Dependency
Journal:
arXiv
Published Date:
Apr 24, 2025
Abstract
Recent advancements in Large Vision-Language Models (LVLMs) have
significantly enhanced their ability to integrate visual and linguistic
information, achieving near-human proficiency in tasks like object recognition,
captioning, and visual question answering. However, current benchmarks
typically focus on knowledge-centric evaluations that assess domain-specific
expertise, often neglecting the core ability to reason about fundamental
mathematical elements and visual concepts. We identify a gap in evaluating
elementary-level math problems, which rely on explicit visual
dependencies-requiring models to discern, integrate, and reason across multiple
images while incorporating commonsense knowledge, all of which are crucial for
advancing toward broader AGI capabilities. To address this gap, we introduce
VCBENCH, a comprehensive benchmark for multimodal mathematical reasoning with
explicit visual dependencies. VCBENCH includes 1,720 problems across six
cognitive domains, featuring 6,697 images (averaging 3.9 per question) to
ensure multi-image reasoning. We evaluate 26 state-of-the-art LVLMs on VCBENCH,
revealing substantial performance disparities, with even the top models unable
to exceed 50% accuracy. Our findings highlight the ongoing challenges in
visual-mathematical integration and suggest avenues for future LVLM
advancements. The project can be found at
https://alibaba-damo-academy.github.io/VCBench/.