DIMCIM: A Quantitative Evaluation Framework for Default-mode Diversity and Generalization in Text-to-Image Generative Models
Journal:
arXiv
Published Date:
Jun 5, 2025
Abstract
Recent advances in text-to-image (T2I) models have achieved impressive
quality and consistency. However, this has come at the cost of representation
diversity. While automatic evaluation methods exist for benchmarking model
diversity, they either require reference image datasets or lack specificity
about the kind of diversity measured, limiting their adaptability and
interpretability. To address this gap, we introduce the Does-it/Can-it
framework, DIM-CIM, a reference-free measurement of default-mode diversity
("Does" the model generate images with expected attributes?) and generalization
capacity ("Can" the model generate diverse attributes for a particular
concept?). We construct the COCO-DIMCIM benchmark, which is seeded with COCO
concepts and captions and augmented by a large language model. With
COCO-DIMCIM, we find that widely-used models improve in generalization at the
cost of default-mode diversity when scaling from 1.5B to 8.1B parameters.
DIMCIM also identifies fine-grained failure cases, such as attributes that are
generated with generic prompts but are rarely generated when explicitly
requested. Finally, we use DIMCIM to evaluate the training data of a T2I model
and observe a correlation of 0.85 between diversity in training images and
default-mode diversity. Our work provides a flexible and interpretable
framework for assessing T2I model diversity and generalization, enabling a more
comprehensive understanding of model performance.