Customized SAM 2 for Referring Remote Sensing Image Segmentation
Journal:
arXiv
Published Date:
Mar 10, 2025
Abstract
Referring Remote Sensing Image Segmentation (RRSIS) aims to segment target
objects in remote sensing (RS) images based on textual descriptions. Although
Segment Anything Model 2 (SAM 2) has shown remarkable performance in various
segmentation tasks, its application to RRSIS presents several challenges,
including understanding the text-described RS scenes and generating effective
prompts from text descriptions. To address these issues, we propose RS2-SAM 2,
a novel framework that adapts SAM 2 to RRSIS by aligning the adapted RS
features and textual features, providing pseudo-mask-based dense prompts, and
enforcing boundary constraints. Specifically, we first employ a union encoder
to jointly encode the visual and textual inputs, generating aligned visual and
text embeddings as well as multimodal class tokens. Then, we design a
bidirectional hierarchical fusion module to adapt SAM 2 to RS scenes and align
adapted visual features with the visually enhanced text embeddings, improving
the model's interpretation of text-described RS scenes. Additionally, a mask
prompt generator is introduced to take the visual embeddings and class tokens
as input and produce a pseudo-mask as the dense prompt of SAM 2. To further
refine segmentation, we introduce a text-guided boundary loss to optimize
segmentation boundaries by computing text-weighted gradient differences.
Experimental results on several RRSIS benchmarks demonstrate that RS2-SAM 2
achieves state-of-the-art performance.