EvdCLIP: Improving Vision-Language Retrieval with Entity Visual Descriptions from Large Language Models
Journal:
arXiv
Published Date:
May 24, 2025
Abstract
Vision-language retrieval (VLR) has attracted significant attention in both
academia and industry, which involves using text (or images) as queries to
retrieve corresponding images (or text). However, existing methods often
neglect the rich visual semantics knowledge of entities, thus leading to
incorrect retrieval results. To address this problem, we propose the Entity
Visual Description enhanced CLIP (EvdCLIP), designed to leverage the visual
knowledge of entities to enrich queries. Specifically, since humans recognize
entities through visual cues, we employ a large language model (LLM) to
generate Entity Visual Descriptions (EVDs) as alignment cues to complement
textual data. These EVDs are then integrated into raw queries to create
visually-rich, EVD-enhanced queries. Furthermore, recognizing that EVD-enhanced
queries may introduce noise or low-quality expansions, we develop a novel,
trainable EVD-aware Rewriter (EaRW) for vision-language retrieval tasks. EaRW
utilizes EVD knowledge and the generative capabilities of the language model to
effectively rewrite queries. With our specialized training strategy, EaRW can
generate high-quality and low-noise EVD-enhanced queries. Extensive
quantitative and qualitative experiments on image-text retrieval benchmarks
validate the superiority of EvdCLIP on vision-language retrieval tasks.