When Tom Eats Kimchi: Evaluating Cultural Bias of Multimodal Large Language Models in Cultural Mixture Contexts
Journal:
arXiv
Published Date:
Mar 21, 2025
Abstract
In a highly globalized world, it is important for multi-modal large language
models (MLLMs) to recognize and respond correctly to mixed-cultural inputs. For
example, a model should correctly identify kimchi (Korean food) in an image
both when an Asian woman is eating it, as well as an African man is eating it.
However, current MLLMs show an over-reliance on the visual features of the
person, leading to misclassification of the entities. To examine the robustness
of MLLMs to different ethnicity, we introduce MixCuBe, a cross-cultural bias
benchmark, and study elements from five countries and four ethnicities. Our
findings reveal that MLLMs achieve both higher accuracy and lower sensitivity
to such perturbation for high-resource cultures, but not for low-resource
cultures. GPT-4o, the best-performing model overall, shows up to 58% difference
in accuracy between the original and perturbed cultural settings in
low-resource cultures. Our dataset is publicly available at:
https://huggingface.co/datasets/kyawyethu/MixCuBe.