Cultural Evaluations of Vision-Language Models Have a Lot to Learn from Cultural Theory
Journal:
arXiv
Published Date:
May 28, 2025
Abstract
Modern vision-language models (VLMs) often fail at cultural competency
evaluations and benchmarks. Given the diversity of applications built upon
VLMs, there is renewed interest in understanding how they encode cultural
nuances. While individual aspects of this problem have been studied, we still
lack a comprehensive framework for systematically identifying and annotating
the nuanced cultural dimensions present in images for VLMs. This position paper
argues that foundational methodologies from visual culture studies (cultural
studies, semiotics, and visual studies) are necessary for cultural analysis of
images. Building upon this review, we propose a set of five frameworks,
corresponding to cultural dimensions, that must be considered for a more
complete analysis of the cultural competencies of VLMs.