Graph-Based Cross-Domain Knowledge Distillation for Cross-Dataset Text-to-Image Person Retrieval
Journal:
arXiv
Published Date:
Jan 25, 2025
Abstract
Video surveillance systems are crucial components for ensuring public safety
and management in smart city. As a fundamental task in video surveillance,
text-to-image person retrieval aims to retrieve the target person from an image
gallery that best matches the given text description. Most existing
text-to-image person retrieval methods are trained in a supervised manner that
requires sufficient labeled data in the target domain. However, it is common in
practice that only unlabeled data is available in the target domain due to the
difficulty and cost of data annotation, which limits the generalization of
existing methods in practical application scenarios. To address this issue, we
propose a novel unsupervised domain adaptation method, termed Graph-Based
Cross-Domain Knowledge Distillation (GCKD), to learn the cross-modal feature
representation for text-to-image person retrieval in a cross-dataset scenario.
The proposed GCKD method consists of two main components. Firstly, a
graph-based multi-modal propagation module is designed to bridge the
cross-domain correlation among the visual and textual samples. Secondly, a
contrastive momentum knowledge distillation module is proposed to learn the
cross-modal feature representation using the online knowledge distillation
strategy. By jointly optimizing the two modules, the proposed method is able to
achieve efficient performance for cross-dataset text-to-image person retrieval.
acExtensive experiments on three publicly available text-to-image person
retrieval datasets demonstrate the effectiveness of the proposed GCKD method,
which consistently outperforms the state-of-the-art baselines.