Attention to Burstiness: Low-Rank Bilinear Prompt Tuning
Journal:
arXiv
Published Date:
Jun 28, 2025
Abstract
Visual Prompt Tuning (VPT) is a parameter-efficient fune-tuning technique
that adapts a pre-trained vision Transformer (ViT) by learning a small set of
parameters in the input space, known as prompts. In VPT, we uncover
``burstiness'' in the values arising from the interaction of image patch
embeddings, and the key and query projectors within Transformer's
self-attention module. Furthermore, the values of patch embeddings and the key
and query projectors exhibit Laplacian and hyper-Laplacian distribution,
respectively. Intuitively, these non-Gaussian distributions pose challenges for
learning prompts. To address this, we propose whitening these data,
de-correlating them and equalizing their variance towards more Gaussian before
learning prompts. We derive the whitening matrix over random image patch
embeddings and ViT's key and query projectors, and multiply it with the prompt
to be learned in a bilinear manner. Surprisingly, this method significantly
accelerates prompt tuning and boosts accuracy, e.g., $>$25 accuracy points on
the CUB dataset; interestingly, it learns ``bursty prompts''. Extending the
bilinear model which is known to introduce burstiness, we present a compact,
low-rank version by learning two smaller matrices whose multiplication yields
the final prompts. We call the proposed methods Bilinear Prompt Tuning (BPT).
Extensive experiments across multiple benchmark datasets demonstrate that BPT
methods not only outperform various VPT methods but also reduce parameter count
and computation overhead.