Augmented Conditioning Is Enough For Effective Training Image Generation
Journal:
arXiv
Published Date:
Feb 6, 2025
Abstract
Image generation abilities of text-to-image diffusion models have
significantly advanced, yielding highly photo-realistic images from descriptive
text and increasing the viability of leveraging synthetic images to train
computer vision models. To serve as effective training data, generated images
must be highly realistic while also sufficiently diverse within the support of
the target data distribution. Yet, state-of-the-art conditional image
generation models have been primarily optimized for creative applications,
prioritizing image realism and prompt adherence over conditional diversity. In
this paper, we investigate how to improve the diversity of generated images
with the goal of increasing their effectiveness to train downstream image
classification models, without fine-tuning the image generation model. We find
that conditioning the generation process on an augmented real image and text
prompt produces generations that serve as effective synthetic datasets for
downstream training. Conditioning on real training images contextualizes the
generation process to produce images that are in-domain with the real image
distribution, while data augmentations introduce visual diversity that improves
the performance of the downstream classifier. We validate
augmentation-conditioning on a total of five established long-tail and few-shot
image classification benchmarks and show that leveraging augmentations to
condition the generation process results in consistent improvements over the
state-of-the-art on the long-tailed benchmark and remarkable gains in extreme
few-shot regimes of the remaining four benchmarks. These results constitute an
important step towards effectively leveraging synthetic data for downstream
training.