BYO-Eval: Build Your Own Dataset for Fine-Grained Visual Assessment of Multimodal Language Models
Journal:
arXiv
Published Date:
Jun 5, 2025
Abstract
Visual Language Models (VLMs) are now sufficiently advanced to support a
broad range of applications, including answering complex visual questions, and
are increasingly expected to interact with images in varied ways. To evaluate
them, current benchmarks often focus on specific domains (e.g., reading
charts), constructing datasets of annotated real images paired with pre-defined
Multiple Choice Questions (MCQs) to report aggregate accuracy scores. However,
such benchmarks entail high annotation costs, risk information leakage, and do
not clarify whether failures stem from limitations in visual perception,
reasoning, or general knowledge. We propose a new evaluation methodology,
inspired by ophthalmologic diagnostics, leveraging procedural generation of
synthetic images to obtain control over visual attributes and precisely reveal
perception failures in VLMs. Specifically, we build collections of images with
gradually more challenging variations in the content of interest (e.g., number
of objects in a counting task) while holding other visual parameters constant.
This diagnostic allows systematic stress testing and fine-grained failure
analysis, shifting the focus from coarse benchmarking toward targeted and
interpretable assessment of VLM capabilities. Our code is available at
https://github.com/byoeval/BYO-EVAL.