Extract Free Dense Misalignment from CLIP
Journal:
arXiv
Published Date:
Dec 24, 2024
Abstract
Recent vision-language foundation models still frequently produce outputs
misaligned with their inputs, evidenced by object hallucination in captioning
and prompt misalignment in the text-to-image generation model. Recent studies
have explored methods for identifying misaligned elements, aiming not only to
enhance interpretability but also to improve model performance. However,
current approaches primarily rely on large foundation models in a zero-shot
manner or fine-tuned models with human annotations, which limits scalability
due to significant computational costs. This work proposes a novel approach,
dubbed CLIP4DM, for detecting dense misalignments from pre-trained CLIP,
specifically focusing on pinpointing misaligned words between image and text.
We carefully revamp the gradient-based attribution computation method, enabling
negative gradient of individual text tokens to indicate misalignment. We also
propose F-CLIPScore, which aggregates misaligned attributions with a global
alignment score. We evaluate our method on various dense misalignment detection
benchmarks, covering various image and text domains and misalignment types. Our
method demonstrates state-of-the-art performance among zero-shot models and
competitive performance with fine-tuned models while maintaining superior
efficiency. Our qualitative examples show that our method has a unique strength
to detect entity-level objects, intangible objects, and attributes that can not
be easily detected for existing works. We conduct ablation studies and analyses
to highlight the strengths and limitations of our approach. Our code is
publicly available at https://github.com/naver-ai/CLIP4DM.