Not All Tokens and Heads Are Equally Important: Dual-Level Attention Intervention for Hallucination Mitigation
Journal:
arXiv
Published Date:
Jun 14, 2025
Abstract
Large vision-language models (LVLMs) have shown remarkable capabilities
across a wide range of multimodal tasks. However, they remain prone to visual
hallucination (VH), often producing confident but incorrect descriptions of
visual content. We present VisFlow, an efficient and training-free framework
designed to mitigate VH by directly manipulating attention patterns during
inference. Through systematic analysis, we identify three key pathological
attention behaviors in LVLMs: (1) weak visual grounding, where attention to
visual tokens is insufficient or misallocated, over-focusing on uninformative
regions; (2) language prior dominance, where excessive attention to prior
response tokens reinforces autoregressive patterns and impairs multimodal
alignment; (3) prompt redundancy, where many attention heads fixate on system
prompt tokens, disrupting the integration of image, instruction, and response
content. To address these issues, we introduce two inference-time
interventions: token-level attention intervention (TAI), which enhances focus
on salient visual content, and head-level attention intervention (HAI), which
suppresses over-attention to prompt and nearby text tokens. VisFlow operates
without additional training or model modifications. Extensive experiments
across models and benchmarks show that VisFlow effectively reduces
hallucinations and improves visual factuality, with negligible computational
cost.