MMReason: An Open-Ended Multi-Modal Multi-Step Reasoning Benchmark for MLLMs Toward AGI
Journal:
arXiv
Published Date:
Jun 30, 2025
Abstract
Reasoning plays a crucial role in advancing Multimodal Large Language Models
(MLLMs) toward Artificial General Intelligence. However, existing MLLM
benchmarks often fall short in precisely and comprehensively evaluating
long-chain reasoning abilities from three key aspects: (1) lack of difficulty
and diversity, (2) susceptibility to guessability and memorization, (3)
inadequate assessment of intermediate reasoning steps. To fill this gap, we
introduce MMReason, a new benchmark designed to precisely and comprehensively
evaluate MLLM long-chain reasoning capability with diverse, open-ended,
challenging questions. First, we curate challenging questions requiring
multi-step reasoning from various fields (i.e., 6 disciplines) and multiple
difficulty levels (i.e., from pre-university to university, and from
foundational to competition tiers). Second, these questions are reformulated
into an open-ended format and filtered using a multi-model voting technique to
eliminate shortcut cases related to guessing and memorization, ensuring robust
reasoning evaluations. Third, we annotate the questions with detailed
step-by-step solutions, and design a reference-based ternary scoring mechanism
to reliably assess intermediate reasoning steps. With MMReason, we benchmark
popular leading MLLMs and provide an in-depth analysis of their reasoning
capabilities. We hope MMReason will serve as a valuable resource for advancing
MLLM reasoning research. Code will be available at
https://github.com/HJYao00/MMReason.