Re-Align: Aligning Vision Language Models via Retrieval-Augmented Direct Preference Optimization
Journal:
arXiv
Published Date:
Feb 18, 2025
Abstract
The emergence of large Vision Language Models (VLMs) has broadened the scope
and capabilities of single-modal Large Language Models (LLMs) by integrating
visual modalities, thereby unlocking transformative cross-modal applications in
a variety of real-world scenarios. Despite their impressive performance, VLMs
are prone to significant hallucinations, particularly in the form of
cross-modal inconsistencies. Building on the success of Reinforcement Learning
from Human Feedback (RLHF) in aligning LLMs, recent advancements have focused
on applying direct preference optimization (DPO) on carefully curated datasets
to mitigate these issues. Yet, such approaches typically introduce preference
signals in a brute-force manner, neglecting the crucial role of visual
information in the alignment process. In this paper, we introduce Re-Align, a
novel alignment framework that leverages image retrieval to construct a
dual-preference dataset, effectively incorporating both textual and visual
preference signals. We further introduce rDPO, an extension of the standard
direct preference optimization that incorporates an additional visual
preference objective during fine-tuning. Our experimental results demonstrate
that Re-Align not only mitigates hallucinations more effectively than previous
methods but also yields significant performance gains in general visual
question-answering (VQA) tasks. Moreover, we show that Re-Align maintains
robustness and scalability across a wide range of VLM sizes and architectures.
This work represents a significant step forward in aligning multimodal LLMs,
paving the way for more reliable and effective cross-modal applications. We
release all the code in https://github.com/taco-group/Re-Align.