Images Speak Louder than Words: Understanding and Mitigating Bias in Vision-Language Model from a Causal Mediation Perspective
Journal:
arXiv
Published Date:
Jul 3, 2024
Abstract
Vision-language models (VLMs) pre-trained on extensive datasets can
inadvertently learn biases by correlating gender information with specific
objects or scenarios. Current methods, which focus on modifying inputs and
monitoring changes in the model's output probability scores, often struggle to
comprehensively understand bias from the perspective of model components. We
propose a framework that incorporates causal mediation analysis to measure and
map the pathways of bias generation and propagation within VLMs. This approach
allows us to identify the direct effects of interventions on model bias and the
indirect effects of interventions on bias mediated through different model
components. Our results show that image features are the primary contributors
to bias, with significantly higher impacts than text features, specifically
accounting for 32.57% and 12.63% of the bias in the MSCOCO and PASCAL-SENTENCE
datasets, respectively. Notably, the image encoder's contribution surpasses
that of the text encoder and the deep fusion encoder. Further experimentation
confirms that contributions from both language and vision modalities are
aligned and non-conflicting. Consequently, focusing on blurring gender
representations within the image encoder, which contributes most to the model
bias, reduces bias efficiently by 22.03% and 9.04% in the MSCOCO and
PASCAL-SENTENCE datasets, respectively, with minimal performance loss or
increased computational demands.