On the Feasibility of Poisoning Text-to-Image AI Models via Adversarial Mislabeling
Journal:
arXiv
Published Date:
Jun 27, 2025
Abstract
Today's text-to-image generative models are trained on millions of images
sourced from the Internet, each paired with a detailed caption produced by
Vision-Language Models (VLMs). This part of the training pipeline is critical
for supplying the models with large volumes of high-quality image-caption pairs
during training. However, recent work suggests that VLMs are vulnerable to
stealthy adversarial attacks, where adversarial perturbations are added to
images to mislead the VLMs into producing incorrect captions.
In this paper, we explore the feasibility of adversarial mislabeling attacks
on VLMs as a mechanism to poisoning training pipelines for text-to-image
models. Our experiments demonstrate that VLMs are highly vulnerable to
adversarial perturbations, allowing attackers to produce benign-looking images
that are consistently miscaptioned by the VLM models. This has the effect of
injecting strong "dirty-label" poison samples into the training pipeline for
text-to-image models, successfully altering their behavior with a small number
of poisoned samples. We find that while potential defenses can be effective,
they can be targeted and circumvented by adaptive attackers. This suggests a
cat-and-mouse game that is likely to reduce the quality of training data and
increase the cost of text-to-image model development. Finally, we demonstrate
the real-world effectiveness of these attacks, achieving high attack success
(over 73%) even in black-box scenarios against commercial VLMs (Google Vertex
AI and Microsoft Azure).