Mixed Signals: Decoding VLMs' Reasoning and Underlying Bias in Vision-Language Conflict
Journal:
arXiv
Published Date:
Apr 11, 2025
Abstract
Vision-language models (VLMs) have demonstrated impressive performance by
effectively integrating visual and textual information to solve complex tasks.
However, it is not clear how these models reason over the visual and textual
data together, nor how the flow of information between modalities is
structured. In this paper, we examine how VLMs reason by analyzing their biases
when confronted with scenarios that present conflicting image and text cues, a
common occurrence in real-world applications. To uncover the extent and nature
of these biases, we build upon existing benchmarks to create five datasets
containing mismatched image-text pairs, covering topics in mathematics,
science, and visual descriptions. Our analysis shows that VLMs favor text in
simpler queries but shift toward images as query complexity increases. This
bias correlates with model scale, with the difference between the percentage of
image- and text-preferred responses ranging from +56.8% (image favored) to
-74.4% (text favored), depending on the task and model. In addition, we explore
three mitigation strategies: simple prompt modifications, modifications that
explicitly instruct models on how to handle conflicting information (akin to
chain-of-thought prompting), and a task decomposition strategy that analyzes
each modality separately before combining their results. Our findings indicate
that the effectiveness of these strategies in identifying and mitigating bias
varies significantly and is closely linked to the model's overall performance
on the task and the specific modality in question.