ByDeWay: Boost Your multimodal LLM with DEpth prompting in a Training-Free Way
Journal:
arXiv
Published Date:
Jul 11, 2025
Abstract
We introduce ByDeWay, a training-free framework designed to enhance the
performance of Multimodal Large Language Models (MLLMs). ByDeWay uses a novel
prompting strategy called Layered-Depth-Based Prompting (LDP), which improves
spatial reasoning and grounding without modifying any model parameters. It
segments the scene into closest, mid-range, and farthest layers using monocular
depth estimation, then generates region-specific captions with a grounded
vision-language model. These structured, depth-aware captions are appended to
the image-question prompt, enriching it with spatial context. This guides MLLMs
to produce more grounded and less hallucinated responses. Our method is
lightweight, modular, and compatible with black-box MLLMs. Experiments on
hallucination-sensitive (POPE) and reasoning-intensive (GQA) benchmarks show
consistent improvements across multiple MLLMs, validating the effectiveness of
depth-aware prompting in a zero-training setting.