BYO-Eval: Build Your Own Dataset for Fine-Grained Visual Assessment of Multimodal Language Models

Journal: arXiv

Published Date: Jun 5, 2025

Abstract

Visual Language Models (VLMs) are now sufficiently advanced to support a broad range of applications, including answering complex visual questions, and are increasingly expected to interact with images in varied ways. To evaluate them, current benchmarks often focus on specific domains (e.g., reading charts), constructing datasets of annotated real images paired with pre-defined Multiple Choice Questions (MCQs) to report aggregate accuracy scores. However, such benchmarks entail high annotation costs, risk information leakage, and do not clarify whether failures stem from limitations in visual perception, reasoning, or general knowledge. We propose a new evaluation methodology, inspired by ophthalmologic diagnostics, leveraging procedural generation of synthetic images to obtain control over visual attributes and precisely reveal perception failures in VLMs. Specifically, we build collections of images with gradually more challenging variations in the content of interest (e.g., number of objects in a counting task) while holding other visual parameters constant. This diagnostic allows systematic stress testing and fine-grained failure analysis, shifting the focus from coarse benchmarking toward targeted and interpretable assessment of VLM capabilities. Our code is available at https://github.com/byoeval/BYO-EVAL.

Authors

Ludovic Arnould
Salim Khazem
Hugues Ali Mehenni

External Resources

View on arXiv arXiv (http://arxiv.org/abs/2506.05440v1)

BYO-Eval: Build Your Own Dataset for Fine-Grained Visual Assessment of Multimodal Language Models

Abstract

Authors

Categories

External Resources

Popular Topics

Recent Journals

BYO-Eval: Build Your Own Dataset for Fine-Grained Visual Assessment of Multimodal Language Models

Abstract

Authors

Categories

External Resources

Stay Ahead of Medical AI

Popular Topics

Recent Journals