Refining CLIP's Spatial Awareness: A Visual-Centric Perspective
Journal:
arXiv
Published Date:
Apr 3, 2025
Abstract
Contrastive Language-Image Pre-training (CLIP) excels in global alignment
with language but exhibits limited sensitivity to spatial information, leading
to strong performance in zero-shot classification tasks but underperformance in
tasks requiring precise spatial understanding. Recent approaches have
introduced Region-Language Alignment (RLA) to enhance CLIP's performance in
dense multimodal tasks by aligning regional visual representations with
corresponding text inputs. However, we find that CLIP ViTs fine-tuned with RLA
suffer from notable loss in spatial awareness, which is crucial for dense
prediction tasks. To address this, we propose the Spatial Correlation
Distillation (SCD) framework, which preserves CLIP's inherent spatial structure
and mitigates the above degradation. To further enhance spatial correlations,
we introduce a lightweight Refiner that extracts refined correlations directly
from CLIP before feeding them into SCD, based on an intriguing finding that
CLIP naturally captures high-quality dense features. Together, these components
form a robust distillation framework that enables CLIP ViTs to integrate both
visual-language and visual-centric improvements, achieving state-of-the-art
results across various open-vocabulary dense prediction benchmarks.