Semantic Localization Guiding Segment Anything Model For Reference Remote Sensing Image Segmentation
Journal:
arXiv
Published Date:
Jun 12, 2025
Abstract
The Reference Remote Sensing Image Segmentation (RRSIS) task generates
segmentation masks for specified objects in images based on textual
descriptions, which has attracted widespread attention and research interest.
Current RRSIS methods rely on multi-modal fusion backbones and semantic
segmentation heads but face challenges like dense annotation requirements and
complex scene interpretation. To address these issues, we propose a framework
named \textit{prompt-generated semantic localization guiding Segment Anything
Model}(PSLG-SAM), which decomposes the RRSIS task into two stages: coarse
localization and fine segmentation. In coarse localization stage, a visual
grounding network roughly locates the text-described object. In fine
segmentation stage, the coordinates from the first stage guide the Segment
Anything Model (SAM), enhanced by a clustering-based foreground point generator
and a mask boundary iterative optimization strategy for precise segmentation.
Notably, the second stage can be train-free, significantly reducing the
annotation data burden for the RRSIS task. Additionally, decomposing the RRSIS
task into two stages allows for focusing on specific region segmentation,
avoiding interference from complex scenes.We further contribute a high-quality,
multi-category manually annotated dataset. Experimental validation on two
datasets (RRSIS-D and RRSIS-M) demonstrates that PSLG-SAM achieves significant
performance improvements and surpasses existing state-of-the-art models.Our
code will be made publicly available.