Adding simple structure at inference improves Vision-Language Compositionality

Journal: arXiv

Published Date: Jun 11, 2025

Abstract

Dual encoder Vision-Language Models (VLM) such as CLIP are widely used for image-text retrieval tasks. However, those models struggle with compositionality, showing a bag-of-words-like behavior that limits their retrieval performance. Many different training approaches have been proposed to improve the vision-language compositionality capabilities of those models. In comparison, inference-time techniques have received little attention. In this paper, we propose to add simple structure at inference, where, given an image and a caption: i) we divide the image into different smaller crops, ii) we extract text segments, capturing objects, attributes and relations, iii) using a VLM, we find the image crops that better align with text segments obtaining matches, and iv) we compute the final image-text similarity aggregating the individual similarities of the matches. Based on various popular dual encoder VLMs, we evaluate our approach in controlled and natural datasets for VL compositionality. We find that our approach consistently improves the performance of evaluated VLMs without any training, which shows the potential of inference-time techniques. The results are especially good for attribute-object binding as shown in the controlled dataset. As a result of an extensive analysis: i) we show that processing image crops is actually essential for the observed gains in performance, and ii) we identify specific areas to further improve inference-time approaches.

Authors

Imanol Miranda
Ander Salaberria
Eneko Agirre
Gorka Azkune

External Resources

View on arXiv arXiv (http://arxiv.org/abs/2506.09691v1)

Adding simple structure at inference improves Vision-Language Compositionality

Abstract

Authors

Categories

External Resources

Popular Topics

Recent Journals

Adding simple structure at inference improves Vision-Language Compositionality

Abstract

Authors

Categories

External Resources

Stay Ahead of Medical AI

Popular Topics

Recent Journals