AdaFV: Rethinking of Visual-Language alignment for VLM acceleration
Journal:
arXiv
Published Date:
Jan 16, 2025
Abstract
The success of VLMs often relies on the dynamic high-resolution schema that
adaptively augments the input images to multiple crops, so that the details of
the images can be retained. However, such approaches result in a large number
of redundant visual tokens, thus significantly reducing the efficiency of the
VLMs. To improve the VLMs' efficiency without introducing extra training costs,
many research works are proposed to reduce the visual tokens by filtering the
uninformative visual tokens or aggregating their information. Some approaches
propose to reduce the visual tokens according to the self-attention of VLMs,
which are biased, to result in inaccurate responses. The token reduction
approaches solely rely on visual cues are text-agnostic, and fail to focus on
the areas that are most relevant to the question, especially when the queried
objects are non-salient to the image. In this work, we first conduct
experiments to show that the original text embeddings are aligned with the
visual tokens, without bias on the tailed visual tokens. We then propose a
self-adaptive cross-modality attention mixture mechanism that dynamically
leverages the effectiveness of visual saliency and text-to-image similarity in
the pre-LLM layers to select the visual tokens that are informative. Extensive
experiments demonstrate that the proposed approach achieves state-of-the-art
training-free VLM acceleration performance, especially when the reduction rate
is sufficiently large.