Black-Box Visual Prompt Engineering for Mitigating Object Hallucination in Large Vision Language Models

Journal: arXiv
Published Date:

Abstract

Large Vision Language Models (LVLMs) often suffer from object hallucination, which undermines their reliability. Surprisingly, we find that simple object-based visual prompting -- overlaying visual cues (e.g., bounding box, circle) on images -- can significantly mitigate such hallucination; however, different visual prompts (VPs) vary in effectiveness. To address this, we propose Black-Box Visual Prompt Engineering (BBVPE), a framework to identify optimal VPs that enhance LVLM responses without needing access to model internals. Our approach employs a pool of candidate VPs and trains a router model to dynamically select the most effective VP for a given input image. This black-box approach is model-agnostic, making it applicable to both open-source and proprietary LVLMs. Evaluations on benchmarks such as POPE and CHAIR demonstrate that BBVPE effectively reduces object hallucination.

Authors

  • Sangmin Woo
  • Kang Zhou
  • Yun Zhou
  • Shuai Wang
  • Sheng Guan
  • Haibo Ding
  • Lin Lee Cheong