SMIR: Efficient Synthetic Data Pipeline To Improve Multi-Image Reasoning
Journal:
arXiv
Published Date:
Jan 7, 2025
Abstract
Vision-Language Models (VLMs) excel at understanding single images, aided by
high-quality instruction datasets. However, multi-image reasoning remains
underexplored in the open-source community due to two key challenges: (1)
scaling datasets with correlated images and complex reasoning instructions is
resource-intensive, and (2) robust evaluation benchmarks for multi-image tasks
are lacking. To address this, we introduce SMiR, a synthetic data-generation
pipeline for multi-image reasoning, along with a high-quality dataset generated
using this pipeline. SMiR efficiently extracts correlated images via multimodal
embeddings, integrates visual and descriptive information, and leverages
open-source LLMs to generate quality instructions. Using this approach, we
produce 160K synthetic training samples, offering a cost-effective alternative
to closed-source solutions. Additionally, we present SMiR-Bench, a multi-image
reasoning benchmark comprising 200 diverse examples across seven complex
reasoning tasks. SMiR-Bench is multi-turn and employs a VLM judge to evaluate
free-form responses, providing a comprehensive assessment of model
expressiveness and reasoning capability across modalities. We demonstrate the
effectiveness of SMiR by fine-tuning open-source VLMs and evaluating them on
SMiR-Bench.