Seeing Sound, Hearing Sight: Uncovering Modality Bias and Conflict of AI models in Sound Localization
Journal:
arXiv
Published Date:
May 16, 2025
Abstract
Imagine hearing a dog bark and turning toward the sound only to see a parked
car, while the real, silent dog sits elsewhere. Such sensory conflicts test
perception, yet humans reliably resolve them by prioritizing sound over
misleading visuals. Despite advances in multimodal AI integrating vision and
audio, little is known about how these systems handle cross-modal conflicts or
whether they favor one modality. In this study, we systematically examine
modality bias and conflict resolution in AI sound localization. We assess
leading multimodal models and benchmark them against human performance in
psychophysics experiments across six audiovisual conditions, including
congruent, conflicting, and absent cues. Humans consistently outperform AI,
demonstrating superior resilience to conflicting or missing visuals by relying
on auditory information. In contrast, AI models often default to visual input,
degrading performance to near chance levels. To address this, we finetune a
state-of-the-art model using a stereo audio-image dataset generated via 3D
simulations. Even with limited training data, the refined model surpasses
existing benchmarks. Notably, it also mirrors human-like horizontal
localization bias favoring left-right precision-likely due to the stereo audio
structure reflecting human ear placement. These findings underscore how sensory
input quality and system architecture shape multimodal representation accuracy.