M$^3$-Med: A Benchmark for Multi-lingual, Multi-modal, and Multi-hop Reasoning in Medical Instructional Video Understanding
Journal:
arXiv
Published Date:
Jul 6, 2025
Abstract
With the rapid progress of artificial intelligence (AI) in multi-modal
understanding, there is increasing potential for video comprehension
technologies to support professional domains such as medical education.
However, existing benchmarks suffer from two primary limitations: (1)
Linguistic Singularity: they are largely confined to English, neglecting the
need for multilingual resources; and (2) Shallow Reasoning: their questions are
often designed for surface-level information retrieval, failing to properly
assess deep multi-modal integration. To address these limitations, we present
M3-Med, the first benchmark for Multi-lingual, Multi-modal, and Multi-hop
reasoning in Medical instructional video understanding. M3-Med consists of
medical questions paired with corresponding video segments, annotated by a team
of medical experts. A key innovation of M3-Med is its multi-hop reasoning task,
which requires a model to first locate a key entity in the text, then find
corresponding visual evidence in the video, and finally synthesize information
across both modalities to derive the answer. This design moves beyond simple
text matching and poses a substantial challenge to a model's deep cross-modal
understanding capabilities. We define two tasks: Temporal Answer Grounding in
Single Video (TAGSV) and Temporal Answer Grounding in Video Corpus (TAGVC). We
evaluated several state-of-the-art models and Large Language Models (LLMs) on
M3-Med. The results reveal a significant performance gap between all models and
human experts, especially on the complex multi-hop questions where model
performance drops sharply. M3-Med effectively highlights the current
limitations of AI models in deep cross-modal reasoning within specialized
domains and provides a new direction for future research.