GreedyPrune: Retenting Critical Visual Token Set for Large Vision Language Models

Journal: arXiv

Published Date: Jun 16, 2025

Abstract

Although Large Vision Language Models (LVLMs) have demonstrated remarkable performance in image understanding tasks, their computational efficiency remains a significant challenge, particularly on resource-constrained devices due to the high cost of processing large numbers of visual tokens. Recently, training-free visual token pruning methods have gained popularity as a low-cost solution to this issue. However, existing approaches suffer from two key limitations: semantic saliency-based strategies primarily focus on high cross-attention visual tokens, often neglecting visual diversity, whereas visual diversity-based methods risk inadvertently discarding semantically important tokens, especially under high compression ratios. In this paper, we introduce GreedyPrune, a training-free plug-and-play visual token pruning algorithm designed to jointly optimize semantic saliency and visual diversity. We formalize the token pruning process as a combinatorial optimization problem and demonstrate that greedy algorithms effectively balance computational efficiency with model accuracy. Extensive experiments validate the effectiveness of our approach, showing that GreedyPrune achieves state-of-the-art accuracy across various multimodal tasks and models while significantly reducing end-to-end inference latency.

Authors

Ruiguang Pei
Weiqing Sun
Zhihui Fu
Jun Wang

External Resources

View on arXiv arXiv (http://arxiv.org/abs/2506.13166v1)

GreedyPrune: Retenting Critical Visual Token Set for Large Vision Language Models

Abstract

Authors

Categories

External Resources

Popular Topics

Recent Journals

GreedyPrune: Retenting Critical Visual Token Set for Large Vision Language Models

Abstract

Authors

Categories

External Resources

Stay Ahead of Medical AI

Popular Topics

Recent Journals