Towards Evaluating Robustness of Prompt Adherence in Text to Image Models
Journal:
arXiv
Published Date:
Jul 9, 2025
Abstract
The advancements in the domain of LLMs in recent years have surprised many,
showcasing their remarkable capabilities and diverse applications. Their
potential applications in various real-world scenarios have led to significant
research on their reliability and effectiveness. On the other hand, multimodal
LLMs and Text-to-Image models have only recently gained prominence, especially
when compared to text-only LLMs. Their reliability remains constrained due to
insufficient research on assessing their performance and robustness. This paper
aims to establish a comprehensive evaluation framework for Text-to-Image
models, concentrating particularly on their adherence to prompts. We created a
novel dataset that aimed to assess the robustness of these models in generating
images that conform to the specified factors of variation in the input text
prompts. Our evaluation studies present findings on three variants of Stable
Diffusion models: Stable Diffusion 3 Medium, Stable Diffusion 3.5 Large, and
Stable Diffusion 3.5 Large Turbo, and two variants of Janus models: Janus Pro
1B and Janus Pro 7B. We introduce a pipeline that leverages text descriptions
generated by the gpt-4o model for our ground-truth images, which are then used
to generate artificial images by passing these descriptions to the
Text-to-Image models. We then pass these generated images again through gpt-4o
using the same system prompt and compare the variation between the two
descriptions. Our results reveal that these models struggle to create simple
binary images with only two factors of variation: a simple geometric shape and
its location. We also show, using pre-trained VAEs on our dataset, that they
fail to generate images that follow our input dataset distribution.