Kvasir-VQA-x1: A Multimodal Dataset for Medical Reasoning and Robust MedVQA in Gastrointestinal Endoscopy
Journal:
arXiv
Published Date:
Jun 11, 2025
Abstract
Medical Visual Question Answering (MedVQA) is a promising field for
developing clinical decision support systems, yet progress is often limited by
the available datasets, which can lack clinical complexity and visual
diversity. To address these gaps, we introduce Kvasir-VQA-x1, a new,
large-scale dataset for gastrointestinal (GI) endoscopy. Our work significantly
expands upon the original Kvasir-VQA by incorporating 159,549 new
question-answer pairs that are designed to test deeper clinical reasoning. We
developed a systematic method using large language models to generate these
questions, which are stratified by complexity to better assess a model's
inference capabilities. To ensure our dataset prepares models for real-world
clinical scenarios, we have also introduced a variety of visual augmentations
that mimic common imaging artifacts. The dataset is structured to support two
main evaluation tracks: one for standard VQA performance and another to test
model robustness against these visual perturbations. By providing a more
challenging and clinically relevant benchmark, Kvasir-VQA-x1 aims to accelerate
the development of more reliable and effective multimodal AI systems for use in
clinical settings. The dataset is fully accessible and adheres to FAIR data
principles, making it a valuable resource for the wider research community.
Code and data: https://github.com/Simula/Kvasir-VQA-x1 and
https://huggingface.co/datasets/SimulaMet/Kvasir-VQA-x1