Hallucination at a Glance: Controlled Visual Edits and Fine-Grained Multimodal Learning
Journal:
arXiv
Published Date:
Jun 8, 2025
Abstract
Multimodal large language models (MLLMs) have achieved strong performance on
vision-language tasks but still struggle with fine-grained visual differences,
leading to hallucinations or missed semantic shifts. We attribute this to
limitations in both training data and learning objectives. To address these
issues, we propose a controlled data generation pipeline that produces
minimally edited image pairs with semantically aligned captions. Using this
pipeline, we construct the Micro Edit Dataset (MED), containing over 50K
image-text pairs spanning 11 fine-grained edit categories, including attribute,
count, position, and object presence changes. Building on MED, we introduce a
supervised fine-tuning (SFT) framework with a feature-level consistency loss
that promotes stable visual embeddings under small edits. We evaluate our
approach on the Micro Edit Detection benchmark, which includes carefully
balanced evaluation pairs designed to test sensitivity to subtle visual
variations across the same edit categories. Our method improves difference
detection accuracy and reduces hallucinations compared to strong baselines,
including GPT-4o. Moreover, it yields consistent gains on standard
vision-language tasks such as image captioning and visual question answering.
These results demonstrate the effectiveness of combining targeted data and
alignment objectives for enhancing fine-grained visual reasoning in MLLMs.