GroundingDINO-US-SAM: Text-Prompted Multi-Organ Segmentation in Ultrasound with LoRA-Tuned Vision-Language Models
Journal:
arXiv
Published Date:
Jun 30, 2025
Abstract
Accurate and generalizable object segmentation in ultrasound imaging remains
a significant challenge due to anatomical variability, diverse imaging
protocols, and limited annotated data. In this study, we propose a
prompt-driven vision-language model (VLM) that integrates Grounding DINO with
SAM2 to enable object segmentation across multiple ultrasound organs. A total
of 18 public ultrasound datasets, encompassing the breast, thyroid, liver,
prostate, kidney, and paraspinal muscle, were utilized. These datasets were
divided into 15 for fine-tuning and validation of Grounding DINO using Low Rank
Adaptation (LoRA) to the ultrasound domain, and 3 were held out entirely for
testing to evaluate performance in unseen distributions. Comprehensive
experiments demonstrate that our approach outperforms state-of-the-art
segmentation methods, including UniverSeg, MedSAM, MedCLIP-SAM, BiomedParse,
and SAMUS on most seen datasets while maintaining strong performance on unseen
datasets without additional fine-tuning. These results underscore the promise
of VLMs in scalable and robust ultrasound image analysis, reducing dependence
on large, organ-specific annotated datasets. We will publish our code on
code.sonography.ai after acceptance.