Stop learning it all to mitigate visual hallucination, Focus on the hallucination target
Journal:
arXiv
Published Date:
Jun 13, 2025
Abstract
Multimodal Large Language Models (MLLMs) frequently suffer from hallucination
issues, generating information about objects that are not present in input
images during vision-language tasks. These hallucinations particularly
undermine model reliability in practical applications requiring accurate object
identification. To address this challenge, we propose \mymethod,\ a preference
learning approach that mitigates hallucinations by focusing on targeted areas
where they occur. To implement this, we build a dataset containing hallucinated
responses, correct responses, and target information (i.e., objects present in
the images and the corresponding chunk positions in responses affected by
hallucinations). By applying a preference learning method restricted to these
specific targets, the model can filter out irrelevant signals and focus on
correcting hallucinations. This allows the model to produce more factual
responses by concentrating solely on relevant information. Experimental results
demonstrate that \mymethod\ effectively reduces hallucinations across multiple
vision hallucination tasks, improving the reliability and performance of MLLMs
without diminishing overall performance.