FaR: Enhancing Multi-Concept Text-to-Image Diffusion via Concept Fusion and Localized Refinement
Journal:
arXiv
Published Date:
Apr 4, 2025
Abstract
Generating multiple new concepts remains a challenging problem in the
text-to-image task. Current methods often overfit when trained on a small
number of samples and struggle with attribute leakage, particularly for
class-similar subjects (e.g., two specific dogs). In this paper, we introduce
Fuse-and-Refine (FaR), a novel approach that tackles these challenges through
two key contributions: Concept Fusion technique and Localized Refinement loss
function. Concept Fusion systematically augments the training data by
separating reference subjects from backgrounds and recombining them into
composite images to increase diversity. This augmentation technique tackles the
overfitting problem by mitigating the narrow distribution of the limited
training samples. In addition, Localized Refinement loss function is introduced
to preserve subject representative attributes by aligning each concept's
attention map to its correct region. This approach effectively prevents
attribute leakage by ensuring that the diffusion model distinguishes similar
subjects without mixing their attention maps during the denoising process. By
fine-tuning specific modules at the same time, FaR balances the learning of new
concepts with the retention of previously learned knowledge. Empirical results
show that FaR not only prevents overfitting and attribute leakage while
maintaining photorealism, but also outperforms other state-of-the-art methods.