TeDA: Boosting Vision-Lanuage Models for Zero-Shot 3D Object Retrieval via Testing-time Distribution Alignment
Journal:
arXiv
Published Date:
May 5, 2025
Abstract
Learning discriminative 3D representations that generalize well to unknown
testing categories is an emerging requirement for many real-world 3D
applications. Existing well-established methods often struggle to attain this
goal due to insufficient 3D training data from broader concepts. Meanwhile,
pre-trained large vision-language models (e.g., CLIP) have shown remarkable
zero-shot generalization capabilities. Yet, they are limited in extracting
suitable 3D representations due to substantial gaps between their 2D training
and 3D testing distributions. To address these challenges, we propose
Testing-time Distribution Alignment (TeDA), a novel framework that adapts a
pretrained 2D vision-language model CLIP for unknown 3D object retrieval at
test time. To our knowledge, it is the first work that studies the test-time
adaptation of a vision-language model for 3D feature learning. TeDA projects 3D
objects into multi-view images, extracts features using CLIP, and refines 3D
query embeddings with an iterative optimization strategy by confident
query-target sample pairs in a self-boosting manner. Additionally, TeDA
integrates textual descriptions generated by a multimodal language model
(InternVL) to enhance 3D object understanding, leveraging CLIP's aligned
feature space to fuse visual and textual cues. Extensive experiments on four
open-set 3D object retrieval benchmarks demonstrate that TeDA greatly
outperforms state-of-the-art methods, even those requiring extensive training.
We also experimented with depth maps on Objaverse-LVIS, further validating its
effectiveness. Code is available at https://github.com/wangzhichuan123/TeDA.