LGD: Leveraging Generative Descriptions for Zero-Shot Referring Image Segmentation
Journal:
arXiv
Published Date:
Apr 20, 2025
Abstract
Zero-shot referring image segmentation aims to locate and segment the target
region based on a referring expression, with the primary challenge of aligning
and matching semantics across visual and textual modalities without training.
Previous works address this challenge by utilizing Vision-Language Models and
mask proposal networks for region-text matching. However, this paradigm may
lead to incorrect target localization due to the inherent ambiguity and
diversity of free-form referring expressions. To alleviate this issue, we
present LGD (Leveraging Generative Descriptions), a framework that utilizes the
advanced language generation capabilities of Multi-Modal Large Language Models
to enhance region-text matching performance in Vision-Language Models.
Specifically, we first design two kinds of prompts, the attribute prompt and
the surrounding prompt, to guide the Multi-Modal Large Language Models in
generating descriptions related to the crucial attributes of the referent
object and the details of surrounding objects, referred to as attribute
description and surrounding description, respectively. Secondly, three
visual-text matching scores are introduced to evaluate the similarity between
instance-level visual features and textual features, which determines the mask
most associated with the referring expression. The proposed method achieves new
state-of-the-art performance on three public datasets RefCOCO, RefCOCO+ and
RefCOCOg, with maximum improvements of 9.97% in oIoU and 11.29% in mIoU
compared to previous methods.