Can We Challenge Open-Vocabulary Object Detectors with Generated Content in Street Scenes?
Journal:
arXiv
Published Date:
Jun 30, 2025
Abstract
Open-vocabulary object detectors such as Grounding DINO are trained on vast
and diverse data, achieving remarkable performance on challenging datasets. Due
to that, it is unclear where to find their limitations, which is of major
concern when using in safety-critical applications. Real-world data does not
provide sufficient control, required for a rigorous evaluation of model
generalization. In contrast, synthetically generated data allows to
systematically explore the boundaries of model competence/generalization. In
this work, we address two research questions: 1) Can we challenge
open-vocabulary object detectors with generated image content? 2) Can we find
systematic failure modes of those models? To address these questions, we design
two automated pipelines using stable diffusion to inpaint unusual objects with
high diversity in semantics, by sampling multiple substantives from WordNet and
ChatGPT. On the synthetically generated data, we evaluate and compare multiple
open-vocabulary object detectors as well as a classical object detector. The
synthetic data is derived from two real-world datasets, namely LostAndFound, a
challenging out-of-distribution (OOD) detection benchmark, and the NuImages
dataset. Our results indicate that inpainting can challenge open-vocabulary
object detectors in terms of overlooking objects. Additionally, we find a
strong dependence of open-vocabulary models on object location, rather than on
object semantics. This provides a systematic approach to challenge
open-vocabulary models and gives valuable insights on how data could be
acquired to effectively improve these models.