Low-hallucination Synthetic Captions for Large-Scale Vision-Language Model Pre-training

Journal: arXiv

Published Date: Apr 17, 2025

Abstract

In recent years, the field of vision-language model pre-training has experienced rapid advancements, driven primarily by the continuous enhancement of textual capabilities in large language models. However, existing training paradigms for multimodal large language models heavily rely on high-quality image-text pairs. As models and data scales grow exponentially, the availability of such meticulously curated data has become increasingly scarce and saturated, thereby severely limiting further advancements in this domain. This study investigates scalable caption generation techniques for vision-language model pre-training and demonstrates that large-scale low-hallucination synthetic captions can serve dual purposes: 1) acting as a viable alternative to real-world data for pre-training paradigms and 2) achieving superior performance enhancement when integrated into vision-language models through empirical validation. This paper presents following key contributions: 1) a novel pipeline for generating high-quality, low-hallucination, and knowledge-rich synthetic captions. Our continuous DPO methodology yields remarkable results in reducing hallucinations. Specifically, the non-hallucination caption rate on a held-out test set increases from 48.3% to 77.9% for a 7B-size model. 2) Comprehensive empirical validation reveals that our synthetic captions confer superior pre-training advantages over their counterparts. Across 15 vision language tasks, the model trained with our data achieves a significant performance gain of at least 6.2% compared to identical images with alt-text. In 20 common cognitive domains, the model trained with our data outperforms the alt-text data by at least 7.5%. Meanwhile, it also offers considerable support in the text-to-image domain. With our dataset, the FID score is reduced by 17.1 on a real-world validation benchmark and 13.3 on the MSCOCO validation benchmark.

Authors

Xinsong Zhang
Yarong Zeng
Xinting Huang
Hu Hu
Runquan Xie
Han Hu
Zhanhui Kang

External Resources

View on arXiv arXiv (http://arxiv.org/abs/2504.13123v2)

Low-hallucination Synthetic Captions for Large-Scale Vision-Language Model Pre-training

Abstract

Authors

Categories

External Resources

Popular Topics

Recent Journals

Low-hallucination Synthetic Captions for Large-Scale Vision-Language Model Pre-training

Abstract

Authors

Categories

External Resources

Stay Ahead of Medical AI

Popular Topics

Recent Journals