SMIR: Efficient Synthetic Data Pipeline To Improve Multi-Image Reasoning

Journal: arXiv

Published Date: Jan 7, 2025

Abstract

Vision-Language Models (VLMs) excel at understanding single images, aided by high-quality instruction datasets. However, multi-image reasoning remains underexplored in the open-source community due to two key challenges: (1) scaling datasets with correlated images and complex reasoning instructions is resource-intensive, and (2) robust evaluation benchmarks for multi-image tasks are lacking. To address this, we introduce SMiR, a synthetic data-generation pipeline for multi-image reasoning, along with a high-quality dataset generated using this pipeline. SMiR efficiently extracts correlated images via multimodal embeddings, integrates visual and descriptive information, and leverages open-source LLMs to generate quality instructions. Using this approach, we produce 160K synthetic training samples, offering a cost-effective alternative to closed-source solutions. Additionally, we present SMiR-Bench, a multi-image reasoning benchmark comprising 200 diverse examples across seven complex reasoning tasks. SMiR-Bench is multi-turn and employs a VLM judge to evaluate free-form responses, providing a comprehensive assessment of model expressiveness and reasoning capability across modalities. We demonstrate the effectiveness of SMiR by fine-tuning open-source VLMs and evaluating them on SMiR-Bench.

Authors

Andrew Li
Rahul Thapa
Rahul Chalamala
Qingyang Wu
Kezhen Chen
James Zou

External Resources

View on arXiv arXiv (http://arxiv.org/abs/2501.03675v2)

SMIR: Efficient Synthetic Data Pipeline To Improve Multi-Image Reasoning

Abstract

Authors

Categories

External Resources

Popular Topics

Recent Journals

SMIR: Efficient Synthetic Data Pipeline To Improve Multi-Image Reasoning

Abstract

Authors

Categories

External Resources

Stay Ahead of Medical AI

Popular Topics

Recent Journals