GreedyPrune: Retenting Critical Visual Token Set for Large Vision Language Models
Journal:
arXiv
Published Date:
Jun 16, 2025
Abstract
Although Large Vision Language Models (LVLMs) have demonstrated remarkable
performance in image understanding tasks, their computational efficiency
remains a significant challenge, particularly on resource-constrained devices
due to the high cost of processing large numbers of visual tokens. Recently,
training-free visual token pruning methods have gained popularity as a low-cost
solution to this issue. However, existing approaches suffer from two key
limitations: semantic saliency-based strategies primarily focus on high
cross-attention visual tokens, often neglecting visual diversity, whereas
visual diversity-based methods risk inadvertently discarding semantically
important tokens, especially under high compression ratios. In this paper, we
introduce GreedyPrune, a training-free plug-and-play visual token pruning
algorithm designed to jointly optimize semantic saliency and visual diversity.
We formalize the token pruning process as a combinatorial optimization problem
and demonstrate that greedy algorithms effectively balance computational
efficiency with model accuracy. Extensive experiments validate the
effectiveness of our approach, showing that GreedyPrune achieves
state-of-the-art accuracy across various multimodal tasks and models while
significantly reducing end-to-end inference latency.