ONLY: One-Layer Intervention Sufficiently Mitigates Hallucinations in Large Vision-Language Models
Journal:
arXiv
Published Date:
Jul 1, 2025
Abstract
Recent Large Vision-Language Models (LVLMs) have introduced a new paradigm
for understanding and reasoning about image input through textual responses.
Although they have achieved remarkable performance across a range of
multi-modal tasks, they face the persistent challenge of hallucination, which
introduces practical weaknesses and raises concerns about their reliable
deployment in real-world applications. Existing work has explored contrastive
decoding approaches to mitigate this issue, where the output of the original
LVLM is compared and contrasted with that of a perturbed version. However,
these methods require two or more queries that slow down LVLM response
generation, making them less suitable for real-time applications. To overcome
this limitation, we propose ONLY, a training-free decoding approach that
requires only a single query and a one-layer intervention during decoding,
enabling efficient real-time deployment. Specifically, we enhance textual
outputs by selectively amplifying crucial textual information using a
text-to-visual entropy ratio for each token. Extensive experimental results
demonstrate that our proposed ONLY consistently outperforms state-of-the-art
methods across various benchmarks while requiring minimal implementation effort
and computational cost. Code is available at https://github.com/zifuwan/ONLY.