MuLan: Adapting Multilingual Diffusion Models for Hundreds of Languages with Negligible Cost
Journal:
arXiv
Published Date:
Dec 2, 2024
Abstract
In this work, we explore a cost-effective framework for multilingual image
generation. We find that, unlike models tuned on high-quality images with
multilingual annotations, leveraging text encoders pre-trained on widely
available, noisy Internet image-text pairs significantly enhances data
efficiency in text-to-image (T2I) generation across multiple languages. Based
on this insight, we introduce MuLan, Multi-Language adapter, a lightweight
language adapter with fewer than 20M parameters, trained alongside a frozen
text encoder and image diffusion model. Compared to previous multilingual T2I
models, this framework offers: (1) Cost efficiency. Using readily accessible
English data and off-the-shelf multilingual text encoders minimizes the
training cost; (2) High performance. Achieving comparable generation
capabilities in over 110 languages with CLIP similarity scores nearly matching
those in English (38.61 for English vs. 37.61 for other languages); and (3)
Broad applicability. Seamlessly integrating with compatible community tools
like LoRA, LCM, ControlNet, and IP-Adapter, expanding its potential use cases.