DrVD-Bench: Do Vision-Language Models Reason Like Human Doctors in Medical Image Diagnosis?
Journal:
arXiv
Published Date:
May 30, 2025
Abstract
Vision-language models (VLMs) exhibit strong zero-shot generalization on
natural images and show early promise in interpretable medical image analysis.
However, existing benchmarks do not systematically evaluate whether these
models truly reason like human clinicians or merely imitate superficial
patterns. To address this gap, we propose DrVD-Bench, the first multimodal
benchmark for clinical visual reasoning. DrVD-Bench consists of three modules:
Visual Evidence Comprehension, Reasoning Trajectory Assessment, and Report
Generation Evaluation, comprising a total of 7,789 image-question pairs. Our
benchmark covers 20 task types, 17 diagnostic categories, and five imaging
modalities-CT, MRI, ultrasound, radiography, and pathology. DrVD-Bench is
explicitly structured to reflect the clinical reasoning workflow from modality
recognition to lesion identification and diagnosis. We benchmark 19 VLMs,
including general-purpose and medical-specific, open-source and proprietary
models, and observe that performance drops sharply as reasoning complexity
increases. While some models begin to exhibit traces of human-like reasoning,
they often still rely on shortcut correlations rather than grounded visual
understanding. DrVD-Bench offers a rigorous and structured evaluation framework
to guide the development of clinically trustworthy VLMs.