Probabilistic Prompt Distribution Learning for Animal Pose Estimation
Journal:
arXiv
Published Date:
Mar 20, 2025
Abstract
Multi-species animal pose estimation has emerged as a challenging yet
critical task, hindered by substantial visual diversity and uncertainty. This
paper challenges the problem by efficient prompt learning for Vision-Language
Pretrained (VLP) models, \textit{e.g.} CLIP, aiming to resolve the
cross-species generalization problem. At the core of the solution lies in the
prompt designing, probabilistic prompt modeling and cross-modal adaptation,
thereby enabling prompts to compensate for cross-modal information and
effectively overcome large data variances under unbalanced data distribution.
To this end, we propose a novel probabilistic prompting approach to fully
explore textual descriptions, which could alleviate the diversity issues caused
by long-tail property and increase the adaptability of prompts on unseen
category instance. Specifically, we first introduce a set of learnable prompts
and propose a diversity loss to maintain distinctiveness among prompts, thus
representing diverse image attributes. Diverse textual probabilistic
representations are sampled and used as the guidance for the pose estimation.
Subsequently, we explore three different cross-modal fusion strategies at
spatial level to alleviate the adverse impacts of visual uncertainty. Extensive
experiments on multi-species animal pose benchmarks show that our method
achieves the state-of-the-art performance under both supervised and zero-shot
settings. The code is available at https://github.com/Raojiyong/PPAP.