Overview of the NLPCC 2025 Shared Task 4: Multi-modal, Multilingual, and Multi-hop Medical Instructional Video Question Answering Challenge
Journal:
arXiv
Published Date:
May 11, 2025
Abstract
Following the successful hosts of the 1-st (NLPCC 2023 Foshan) CMIVQA and the
2-rd (NLPCC 2024 Hangzhou) MMIVQA challenges, this year, a new task has been
introduced to further advance research in multi-modal, multilingual, and
multi-hop medical instructional question answering (M4IVQA) systems, with a
specific focus on medical instructional videos. The M4IVQA challenge focuses on
evaluating models that integrate information from medical instructional videos,
understand multiple languages, and answer multi-hop questions requiring
reasoning over various modalities. This task consists of three tracks:
multi-modal, multilingual, and multi-hop Temporal Answer Grounding in Single
Video (M4TAGSV), multi-modal, multilingual, and multi-hop Video Corpus
Retrieval (M4VCR) and multi-modal, multilingual, and multi-hop Temporal Answer
Grounding in Video Corpus (M4TAGVC). Participants in M4IVQA are expected to
develop algorithms capable of processing both video and text data,
understanding multilingual queries, and providing relevant answers to multi-hop
medical questions. We believe the newly introduced M4IVQA challenge will drive
innovations in multimodal reasoning systems for healthcare scenarios,
ultimately contributing to smarter emergency response systems and more
effective medical education platforms in multilingual communities. Our official
website is https://cmivqa.github.io/