IllusionBench: A Large-scale and Comprehensive Benchmark for Visual Illusion Understanding in Vision-Language Models
Journal:
arXiv
Published Date:
Jan 1, 2025
Abstract
Current Visual Language Models (VLMs) show impressive image understanding but
struggle with visual illusions, especially in real-world scenarios. Existing
benchmarks focus on classical cognitive illusions, which have been learned by
state-of-the-art (SOTA) VLMs, revealing issues such as hallucinations and
limited perceptual abilities. To address this gap, we introduce IllusionBench,
a comprehensive visual illusion dataset that encompasses not only classic
cognitive illusions but also real-world scene illusions. This dataset features
1,051 images, 5,548 question-answer pairs, and 1,051 golden text descriptions
that address the presence, causes, and content of the illusions. We evaluate
ten SOTA VLMs on this dataset using true-or-false, multiple-choice, and
open-ended tasks. In addition to real-world illusions, we design trap illusions
that resemble classical patterns but differ in reality, highlighting
hallucination issues in SOTA models. The top-performing model, GPT-4o, achieves
80.59% accuracy on true-or-false tasks and 76.75% on multiple-choice questions,
but still lags behind human performance. In the semantic description task,
GPT-4o's hallucinations on classical illusions result in low scores for trap
illusions, even falling behind some open-source models. IllusionBench is, to
the best of our knowledge, the largest and most comprehensive benchmark for
visual illusions in VLMs to date.