Composed Multi-modal Retrieval: A Survey of Approaches and Applications

Journal: arXiv

Published Date: Mar 3, 2025

Abstract

With the rapid growth of multi-modal data from social media, short video platforms, and e-commerce, content-based retrieval has become essential for efficiently searching and utilizing heterogeneous information. Over time, retrieval techniques have evolved from Unimodal Retrieval (UR) to Cross-modal Retrieval (CR) and, more recently, to Composed Multi-modal Retrieval (CMR). CMR enables users to retrieve images or videos by integrating a reference visual input with textual modifications, enhancing search flexibility and precision. This paper provides a comprehensive review of CMR, covering its fundamental challenges, technical advancements, and categorization into supervised, zero-shot, and semi-supervised learning paradigms. We discuss key research directions, including data augmentation, model architecture, and loss optimization in supervised CMR, as well as transformation frameworks and external knowledge integration in zero-shot CMR. Additionally, we highlight the application potential of CMR in composed image retrieval, video retrieval, and person retrieval, which have significant implications for e-commerce, online search, and public security. Given its ability to refine and personalize search experiences, CMR is poised to become a pivotal technology in next-generation retrieval systems. A curated list of related works and resources is available at: https://github.com/kkzhang95/Awesome-Composed-Multi-modal-Retrieval

Authors

Kun Zhang
Jingyu Li
Zhe Li
Jingjing Zhang

External Resources

View on arXiv arXiv (http://arxiv.org/abs/2503.01334v1)

Composed Multi-modal Retrieval: A Survey of Approaches and Applications

Abstract

Authors

Categories

External Resources

Popular Topics

Recent Journals

Composed Multi-modal Retrieval: A Survey of Approaches and Applications

Abstract

Authors

Categories

External Resources

Stay Ahead of Medical AI

Popular Topics

Recent Journals