Cost-Aware Routing for Efficient Text-To-Image Generation
Journal:
arXiv
Published Date:
Jun 17, 2025
Abstract
Diffusion models are well known for their ability to generate a high-fidelity
image for an input prompt through an iterative denoising process.
Unfortunately, the high fidelity also comes at a high computational cost due
the inherently sequential generative process. In this work, we seek to
optimally balance quality and computational cost, and propose a framework to
allow the amount of computation to vary for each prompt, depending on its
complexity. Each prompt is automatically routed to the most appropriate
text-to-image generation function, which may correspond to a distinct number of
denoising steps of a diffusion model, or a disparate, independent text-to-image
model. Unlike uniform cost reduction techniques (e.g., distillation, model
quantization), our approach achieves the optimal trade-off by learning to
reserve expensive choices (e.g., 100+ denoising steps) only for a few complex
prompts, and employ more economical choices (e.g., small distilled model) for
less sophisticated prompts. We empirically demonstrate on COCO and DiffusionDB
that by learning to route to nine already-trained text-to-image models, our
approach is able to deliver an average quality that is higher than that
achievable by any of these models alone.