GroundCap: A Visually Grounded Image Captioning Dataset
Journal:
arXiv
Published Date:
Feb 19, 2025
Abstract
Current image captioning systems lack the ability to link descriptive text to
specific visual elements, making their outputs difficult to verify. While
recent approaches offer some grounding capabilities, they cannot track object
identities across multiple references or ground both actions and objects
simultaneously. We propose a novel ID-based grounding system that enables
consistent object reference tracking and action-object linking, and present
GroundCap, a dataset containing 52,016 images from 77 movies, with 344
human-annotated and 52,016 automatically generated captions. Each caption is
grounded on detected objects (132 classes) and actions (51 classes) using a tag
system that maintains object identity while linking actions to the
corresponding objects. Our approach features persistent object IDs for
reference tracking, explicit action-object linking, and segmentation of
background elements through K-means clustering. We propose gMETEOR, a metric
combining caption quality with grounding accuracy, and establish baseline
performance by fine-tuning Pixtral-12B. Human evaluation demonstrates our
approach's effectiveness in producing verifiable descriptions with coherent
object references.