GroundCap: A Visually Grounded Image Captioning Dataset

Journal: arXiv

Published Date: Feb 19, 2025

Abstract

Current image captioning systems lack the ability to link descriptive text to specific visual elements, making their outputs difficult to verify. While recent approaches offer some grounding capabilities, they cannot track object identities across multiple references or ground both actions and objects simultaneously. We propose a novel ID-based grounding system that enables consistent object reference tracking and action-object linking, and present GroundCap, a dataset containing 52,016 images from 77 movies, with 344 human-annotated and 52,016 automatically generated captions. Each caption is grounded on detected objects (132 classes) and actions (51 classes) using a tag system that maintains object identity while linking actions to the corresponding objects. Our approach features persistent object IDs for reference tracking, explicit action-object linking, and segmentation of background elements through K-means clustering. We propose gMETEOR, a metric combining caption quality with grounding accuracy, and establish baseline performance by fine-tuning Pixtral-12B. Human evaluation demonstrates our approach's effectiveness in producing verifiable descriptions with coherent object references.

Authors

Daniel A. P. Oliveira
Lourenço Teodoro
David Martins de Matos

External Resources

View on arXiv arXiv (http://arxiv.org/abs/2502.13898v2)

GroundCap: A Visually Grounded Image Captioning Dataset

Abstract

Authors

Categories

External Resources

Popular Topics

Recent Journals

GroundCap: A Visually Grounded Image Captioning Dataset

Abstract

Authors

Categories

External Resources

Stay Ahead of Medical AI

Popular Topics

Recent Journals