Treble Counterfactual VLMs: A Causal Approach to Hallucination
Journal:
arXiv
Published Date:
Mar 8, 2025
Abstract
Vision-Language Models (VLMs) have advanced multi-modal tasks like image
captioning, visual question answering, and reasoning. However, they often
generate hallucinated outputs inconsistent with the visual context or prompt,
limiting reliability in critical applications like autonomous driving and
medical imaging. Existing studies link hallucination to statistical biases,
language priors, and biased feature learning but lack a structured causal
understanding. In this work, we introduce a causal perspective to analyze and
mitigate hallucination in VLMs. We hypothesize that hallucination arises from
unintended direct influences of either the vision or text modality, bypassing
proper multi-modal fusion. To address this, we construct a causal graph for
VLMs and employ counterfactual analysis to estimate the Natural Direct Effect
(NDE) of vision, text, and their cross-modal interaction on the output. We
systematically identify and mitigate these unintended direct effects to ensure
that responses are primarily driven by genuine multi-modal fusion. Our approach
consists of three steps: (1) designing structural causal graphs to distinguish
correct fusion pathways from spurious modality shortcuts, (2) estimating
modality-specific and cross-modal NDE using perturbed image representations,
hallucinated text embeddings, and degraded visual inputs, and (3) implementing
a test-time intervention module to dynamically adjust the model's dependence on
each modality. Experimental results demonstrate that our method significantly
reduces hallucination while preserving task performance, providing a robust and
interpretable framework for improving VLM reliability. To enhance accessibility
and reproducibility, our code is publicly available at
https://github.com/TREE985/Treble-Counterfactual-VLMs.