Zero-Shot Interactive Text-to-Image Retrieval via Diffusion-Augmented Representations
Journal:
arXiv
Published Date:
Jan 26, 2025
Abstract
Interactive Text-to-Image Retrieval (I-TIR) has emerged as a transformative
user-interactive tool for applications in domains such as e-commerce and
education. Yet, current methodologies predominantly depend on finetuned
Multimodal Large Language Models (MLLMs), which face two critical limitations:
(1) Finetuning imposes prohibitive computational overhead and long-term
maintenance costs. (2) Finetuning narrows the pretrained knowledge distribution
of MLLMs, reducing their adaptability to novel scenarios. These issues are
exacerbated by the inherently dynamic nature of real-world I-TIR systems, where
queries and image databases evolve in complexity and diversity, often deviating
from static training distributions. To overcome these constraints, we propose
Diffusion Augmented Retrieval (DAR), a paradigm-shifting framework that
bypasses MLLM finetuning entirely. DAR synergizes Large Language Model
(LLM)-guided query refinement with Diffusion Model (DM)-based visual synthesis
to create contextually enriched intermediate representations. This
dual-modality approach deciphers nuanced user intent more holistically,
enabling precise alignment between textual queries and visually relevant
images. Rigorous evaluations across four benchmarks reveal DAR's dual
strengths: (1) Matches state-of-the-art finetuned I-TIR models on
straightforward queries without task-specific training. (2) Scalable
Generalization: Surpasses finetuned baselines by 7.61% in Hits@10 (top-10
accuracy) under multi-turn conversational complexity, demonstrating robustness
to intricate, distributionally shifted interactions. By eliminating finetuning
dependencies and leveraging generative-augmented representations, DAR
establishes a new trajectory for efficient, adaptive, and scalable cross-modal
retrieval systems.