BiPrompt-SAM: Enhancing Image Segmentation via Explicit Selection between Point and Text Prompts
Journal:
arXiv
Published Date:
Mar 25, 2025
Abstract
Segmentation is a fundamental task in computer vision, with prompt-driven
methods gaining prominence due to their flexibility. The Segment Anything Model
(SAM) excels at point-prompted segmentation, while text-based models, often
leveraging powerful multimodal encoders like BEIT-3, provide rich semantic
understanding. However, effectively combining these complementary modalities
remains a challenge. This paper introduces BiPrompt-SAM, a novel dual-modal
prompt segmentation framework employing an explicit selection mechanism. We
leverage SAM's ability to generate multiple mask candidates from a single point
prompt and use a text-guided mask (generated via EVF-SAM with BEIT-3) to select
the point-generated mask that best aligns spatially, measured by Intersection
over Union (IoU). This approach, interpretable as a simplified Mixture of
Experts (MoE), effectively fuses spatial precision and semantic context without
complex model modifications. Notably, our method achieves strong zero-shot
performance on the Endovis17 medical dataset (89.55% mDice, 81.46% mIoU) using
only a single point prompt per instance. This significantly reduces annotation
burden compared to bounding boxes and aligns better with practical clinical
workflows, demonstrating the method's effectiveness without domain-specific
training. On the RefCOCO series, BiPrompt-SAM attained 87.1%, 86.5%, and 85.8%
IoU, significantly outperforming existing approaches. Experiments show
BiPrompt-SAM excels in scenarios requiring both spatial accuracy and semantic
disambiguation, offering a simple, effective, and interpretable perspective on
multi-modal prompt fusion.