Evaluating Image Caption via Cycle-consistent Text-to-Image Generation
Journal:
arXiv
Published Date:
Jan 7, 2025
Abstract
Evaluating image captions typically relies on reference captions, which are
costly to obtain and exhibit significant diversity and subjectivity. While
reference-free evaluation metrics have been proposed, most focus on cross-modal
evaluation between captions and images. Recent research has revealed that the
modality gap generally exists in the representation of contrastive
learning-based multi-modal systems, undermining the reliability of
cross-modality metrics like CLIPScore. In this paper, we propose CAMScore, a
cyclic reference-free automatic evaluation metric for image captioning models.
To circumvent the aforementioned modality gap, CAMScore utilizes a
text-to-image model to generate images from captions and subsequently evaluates
these generated images against the original images. Furthermore, to provide
fine-grained information for a more comprehensive evaluation, we design a
three-level evaluation framework for CAMScore that encompasses pixel-level,
semantic-level, and objective-level perspectives. Extensive experiment results
across multiple benchmark datasets show that CAMScore achieves a superior
correlation with human judgments compared to existing reference-based and
reference-free metrics, demonstrating the effectiveness of the framework.