Composed Multi-modal Retrieval: A Survey of Approaches and Applications
Journal:
arXiv
Published Date:
Mar 3, 2025
Abstract
With the rapid growth of multi-modal data from social media, short video
platforms, and e-commerce, content-based retrieval has become essential for
efficiently searching and utilizing heterogeneous information. Over time,
retrieval techniques have evolved from Unimodal Retrieval (UR) to Cross-modal
Retrieval (CR) and, more recently, to Composed Multi-modal Retrieval (CMR). CMR
enables users to retrieve images or videos by integrating a reference visual
input with textual modifications, enhancing search flexibility and precision.
This paper provides a comprehensive review of CMR, covering its fundamental
challenges, technical advancements, and categorization into supervised,
zero-shot, and semi-supervised learning paradigms. We discuss key research
directions, including data augmentation, model architecture, and loss
optimization in supervised CMR, as well as transformation frameworks and
external knowledge integration in zero-shot CMR. Additionally, we highlight the
application potential of CMR in composed image retrieval, video retrieval, and
person retrieval, which have significant implications for e-commerce, online
search, and public security. Given its ability to refine and personalize search
experiences, CMR is poised to become a pivotal technology in next-generation
retrieval systems. A curated list of related works and resources is available
at: https://github.com/kkzhang95/Awesome-Composed-Multi-modal-Retrieval