Feedback-Driven Vision-Language Alignment with Minimal Human Supervision
Journal:
arXiv
Published Date:
Jan 8, 2025
Abstract
Vision-language models (VLMs) have demonstrated remarkable potential in
integrating visual and linguistic information, but their performance is often
constrained by the need for extensive, high-quality image-text training data.
Curation of these image-text pairs is both time-consuming and computationally
expensive. To address this challenge, we introduce SVP (Sampling-based Visual
Projection), a novel framework that enhances vision-language alignment without
relying on manually curated text-image pairs or preference annotation. SVP
leverages a small set of manually selected images, self-captioning and a
pre-trained grounding model as a feedback mechanism to elicit latent
information in VLMs. We evaluate our approach across six key areas: captioning,
referring, visual question answering, multitasking, hallucination control, and
object recall. Results demonstrate significant improvements, including a 14 %
average improvement in captioning tasks, up to 12 % increase in object recall,
and significantly reduced hallucinations, while maintaining question-answering
capabilities. Using SVP, a small VLM achieves hallucination reductions similar
to a model five times larger, while a VLM with initially poor referring
capabilities more than doubles its performance, approaching parity with a model
twice its size.