ReXVQA: A Large-scale Visual Question Answering Benchmark for Generalist Chest X-ray Understanding
Journal:
arXiv
Published Date:
Jun 4, 2025
Abstract
We present ReXVQA, the largest and most comprehensive benchmark for visual
question answering (VQA) in chest radiology, comprising approximately 696,000
questions paired with 160,000 chest X-rays studies across training, validation,
and test sets. Unlike prior efforts that rely heavily on template based
queries, ReXVQA introduces a diverse and clinically authentic task suite
reflecting five core radiological reasoning skills: presence assessment,
location analysis, negation detection, differential diagnosis, and geometric
reasoning. We evaluate eight state-of-the-art multimodal large language models,
including MedGemma-4B-it, Qwen2.5-VL, Janus-Pro-7B, and Eagle2-9B. The
best-performing model (MedGemma) achieves 83.24% overall accuracy. To bridge
the gap between AI performance and clinical expertise, we conducted a
comprehensive human reader study involving 3 radiology residents on 200
randomly sampled cases. Our evaluation demonstrates that MedGemma achieved
superior performance (83.84% accuracy) compared to human readers (best
radiology resident: 77.27%), representing a significant milestone where AI
performance exceeds expert human evaluation on chest X-ray interpretation. The
reader study reveals distinct performance patterns between AI models and human
experts, with strong inter-reader agreement among radiologists while showing
more variable agreement patterns between human readers and AI models. ReXVQA
establishes a new standard for evaluating generalist radiological AI systems,
offering public leaderboards, fine-grained evaluation splits, structured
explanations, and category-level breakdowns. This benchmark lays the foundation
for next-generation AI systems capable of mimicking expert-level clinical
reasoning beyond narrow pathology classification. Our dataset will be
open-sourced at https://huggingface.co/datasets/rajpurkarlab/ReXVQA