Rethinking the Global Knowledge of CLIP in Training-Free Open-Vocabulary Semantic Segmentation
Journal:
arXiv
Published Date:
Feb 5, 2025
Abstract
Recent works modify CLIP to perform open-vocabulary semantic segmentation in
a training-free manner (TF-OVSS). In vanilla CLIP, patch-wise image
representations mainly encode homogeneous image-level properties, which hinders
the application of CLIP to the dense prediction task. Previous TF-OVSS works
sacrifice globality to enhance the locality of CLIP features, by making each
patch mainly attend to itself or its neighboring patches within a narrow local
window. With their modifications,the ability of CLIP to aggregate global
context information is largely weakened. Differently, in this paper, we rethink
the global knowledge encoded by CLIP and propose GCLIP to answer how to extract
and utilize beneficial global knowledge of CLIP for TF-OVSS. As the
representation of each patch is finally determined by the attention weights and
the Value embeddings, we propose to reshape the last-block attention and Value
embeddings to aggregate useful global context into final features. Firstly, we
aim to equip the last-block attention with image-level properties while not
introducing homogeneous attention patterns across patches. To realize the goal,
we fuse the attention from the global-token emerging blocks with the
Query-Query attention. Secondly, we aim to make Value embeddings of the
last-block attention module more semantically correlated. To realize this, we
design a novel channel suppression strategy.Extensive experiments on five
standard benchmarks demonstrate that our method consistently outperforms
previous state-of-the-arts.