Mitigating Hallucination of Large Vision-Language Models via Dynamic Logits Calibration
Journal:
arXiv
Published Date:
Jun 26, 2025
Abstract
Large Vision-Language Models (LVLMs) have demonstrated significant
advancements in multimodal understanding, yet they are frequently hampered by
hallucination-the generation of text that contradicts visual input. Existing
training-free decoding strategies exhibit critical limitations, including the
use of static constraints that do not adapt to semantic drift during
generation, inefficiency stemming from the need for multiple forward passes,
and degradation of detail due to overly rigid intervention rules. To overcome
these challenges, this paper introduces Dynamic Logits Calibration (DLC), a
novel training-free decoding framework designed to dynamically align text
generation with visual evidence at inference time. At the decoding phase, DLC
step-wise employs CLIP to assess the semantic alignment between the input image
and the generated text sequence. Then, the Relative Visual Advantage (RVA) of
candidate tokens is evaluated against a dynamically updated contextual
baseline, adaptively adjusting output logits to favor tokens that are visually
grounded. Furthermore, an adaptive weighting mechanism, informed by a real-time
context alignment score, carefully balances the visual guidance while ensuring
the overall quality of the textual output. Extensive experiments conducted
across diverse benchmarks and various LVLM architectures (such as LLaVA,
InstructBLIP, and MiniGPT-4) demonstrate that DLC significantly reduces
hallucinations, outperforming current methods while maintaining high inference
efficiency by avoiding multiple forward passes. Overall, we present an
effective and efficient decoding-time solution to mitigate hallucinations,
thereby enhancing the reliability of LVLMs for more practices. Code will be
released on Github.