Not All Attention Heads Are What You Need: Refining CLIP's Image Representation with Attention Ablation
Journal:
arXiv
Published Date:
Jul 1, 2025
Abstract
This paper studies the role of attention heads in CLIP's image encoder. While
CLIP has exhibited robust performance across diverse applications, we
hypothesize that certain attention heads negatively affect final
representations and that ablating them can improve performance in downstream
tasks. To capitalize on this insight, we propose a simple yet effective method,
called Attention Ablation Technique (AAT), to suppress the contribution of
specific heads by manipulating attention weights. By integrating two
alternative strategies tailored for different application scenarios, AAT
systematically identifies and ablates detrimental attention heads to enhance
representation quality. Experiments demonstrate that AAT consistently improves
downstream task performance across various domains, boosting recall rate by up
to 11.1% on CLIP-family models for cross-modal retrieval. The results highlight
the potential of AAT to effectively refine large-scale vision-language models
with virtually no increase in inference cost.