Multimodal Reference Visual Grounding
Journal:
arXiv
Published Date:
Apr 2, 2025
Abstract
Visual grounding focuses on detecting objects from images based on language
expressions. Recent Large Vision-Language Models (LVLMs) have significantly
advanced visual grounding performance by training large models with large-scale
datasets. However, the problem remains challenging, especially when similar
objects appear in the input image. For example, an LVLM may not be able to
differentiate Diet Coke and regular Coke in an image. In this case, if
additional reference images of Diet Coke and regular Coke are available, it can
help the visual grounding of similar objects.
In this work, we introduce a new task named Multimodal Reference Visual
Grounding (MRVG). In this task, a model has access to a set of reference images
of objects in a database. Based on these reference images and a language
expression, the model is required to detect a target object from a query image.
We first introduce a new dataset to study the MRVG problem. Then we introduce a
novel method, named MRVG-Net, to solve this visual grounding problem. We show
that by efficiently using reference images with few-shot object detection and
using Large Language Models (LLMs) for object matching, our method achieves
superior visual grounding performance compared to the state-of-the-art LVLMs
such as Qwen2.5-VL-7B. Our approach bridges the gap between few-shot detection
and visual grounding, unlocking new capabilities for visual understanding.
Project page with our code and dataset:
https://irvlutd.github.io/MultiGrounding