GMM-Based Comprehensive Feature Extraction and Relative Distance Preservation For Few-Shot Cross-Modal Retrieval
Journal:
arXiv
Published Date:
May 19, 2025
Abstract
Few-shot cross-modal retrieval focuses on learning cross-modal
representations with limited training samples, enabling the model to handle
unseen classes during inference. Unlike traditional cross-modal retrieval
tasks, which assume that both training and testing data share the same class
distribution, few-shot retrieval involves data with sparse representations
across modalities. Existing methods often fail to adequately model the
multi-peak distribution of few-shot cross-modal data, resulting in two main
biases in the latent semantic space: intra-modal bias, where sparse samples
fail to capture intra-class diversity, and inter-modal bias, where
misalignments between image and text distributions exacerbate the semantic gap.
These biases hinder retrieval accuracy. To address these issues, we propose a
novel method, GCRDP, for few-shot cross-modal retrieval. This approach
effectively captures the complex multi-peak distribution of data using a
Gaussian Mixture Model (GMM) and incorporates a multi-positive sample
contrastive learning mechanism for comprehensive feature modeling.
Additionally, we introduce a new strategy for cross-modal semantic alignment,
which constrains the relative distances between image and text feature
distributions, thereby improving the accuracy of cross-modal representations.
We validate our approach through extensive experiments on four benchmark
datasets, demonstrating superior performance over six state-of-the-art methods.