DiscoSG: Towards Discourse-Level Text Scene Graph Parsing through Iterative Graph Refinement
Journal:
arXiv
Published Date:
Jun 18, 2025
Abstract
Vision-Language Models (VLMs) now generate discourse-level, multi-sentence
visual descriptions, challenging text scene graph parsers originally designed
for single-sentence caption-to-graph mapping. Current approaches typically
merge sentence-level parsing outputs for discourse input, often missing
phenomena like cross-sentence coreference, resulting in fragmented graphs and
degraded downstream VLM task performance. To address this, we introduce a new
task, Discourse-level text Scene Graph parsing (DiscoSG), supported by our
dataset DiscoSG-DS, which comprises 400 expert-annotated and 8,430 synthesised
multi-sentence caption-graph pairs for images. Each caption averages 9
sentences, and each graph contains at least 3 times more triples than those in
existing datasets. While fine-tuning large PLMs (i.e., GPT-4) on DiscoSG-DS
improves SPICE by approximately 48% over the best sentence-merging baseline,
high inference cost and restrictive licensing hinder its open-source use, and
smaller fine-tuned PLMs struggle with complex graphs. We propose
DiscoSG-Refiner, which drafts a base graph using one small PLM, then employs a
second PLM to iteratively propose graph edits, reducing full-graph generation
overhead. Using two Flan-T5-Base models, DiscoSG-Refiner still improves SPICE
by approximately 30% over the best baseline while achieving 86 times faster
inference than GPT-4. It also consistently improves downstream VLM tasks like
discourse-level caption evaluation and hallucination detection. Code and data
are available at: https://github.com/ShaoqLin/DiscoSG