MedSG-Bench: A Benchmark for Medical Image Sequences Grounding
Journal:
arXiv
Published Date:
May 17, 2025
Abstract
Visual grounding is essential for precise perception and reasoning in
multimodal large language models (MLLMs), especially in medical imaging
domains. While existing medical visual grounding benchmarks primarily focus on
single-image scenarios, real-world clinical applications often involve
sequential images, where accurate lesion localization across different
modalities and temporal tracking of disease progression (e.g., pre- vs.
post-treatment comparison) require fine-grained cross-image semantic alignment
and context-aware reasoning. To remedy the underrepresentation of image
sequences in existing medical visual grounding benchmarks, we propose
MedSG-Bench, the first benchmark tailored for Medical Image Sequences
Grounding. It comprises eight VQA-style tasks, formulated into two paradigms of
the grounding tasks, including 1) Image Difference Grounding, which focuses on
detecting change regions across images, and 2) Image Consistency Grounding,
which emphasizes detection of consistent or shared semantics across sequential
images. MedSG-Bench covers 76 public datasets, 10 medical imaging modalities,
and a wide spectrum of anatomical structures and diseases, totaling 9,630
question-answer pairs. We benchmark both general-purpose MLLMs (e.g.,
Qwen2.5-VL) and medical-domain specialized MLLMs (e.g., HuatuoGPT-vision),
observing that even the advanced models exhibit substantial limitations in
medical sequential grounding tasks. To advance this field, we construct
MedSG-188K, a large-scale instruction-tuning dataset tailored for sequential
visual grounding, and further develop MedSeq-Grounder, an MLLM designed to
facilitate future research on fine-grained understanding across medical
sequential images. The benchmark, dataset, and model are available at
https://huggingface.co/MedSG-Bench