Black-Box Visual Prompt Engineering for Mitigating Object Hallucination in Large Vision Language Models
Journal:
arXiv
Published Date:
Apr 30, 2025
Abstract
Large Vision Language Models (LVLMs) often suffer from object hallucination,
which undermines their reliability. Surprisingly, we find that simple
object-based visual prompting -- overlaying visual cues (e.g., bounding box,
circle) on images -- can significantly mitigate such hallucination; however,
different visual prompts (VPs) vary in effectiveness. To address this, we
propose Black-Box Visual Prompt Engineering (BBVPE), a framework to identify
optimal VPs that enhance LVLM responses without needing access to model
internals. Our approach employs a pool of candidate VPs and trains a router
model to dynamically select the most effective VP for a given input image. This
black-box approach is model-agnostic, making it applicable to both open-source
and proprietary LVLMs. Evaluations on benchmarks such as POPE and CHAIR
demonstrate that BBVPE effectively reduces object hallucination.