MADiff: Text-Guided Fashion Image Editing with Mask Prediction and Attention-Enhanced Diffusion
Journal:
arXiv
Published Date:
Dec 28, 2024
Abstract
Text-guided image editing model has achieved great success in general domain.
However, directly applying these models to the fashion domain may encounter two
issues: (1) Inaccurate localization of editing region; (2) Weak editing
magnitude. To address these issues, the MADiff model is proposed. Specifically,
to more accurately identify editing region, the MaskNet is proposed, in which
the foreground region, densepose and mask prompts from large language model are
fed into a lightweight UNet to predict the mask for editing region. To
strengthen the editing magnitude, the Attention-Enhanced Diffusion Model is
proposed, where the noise map, attention map, and the mask from MaskNet are fed
into the proposed Attention Processor to produce a refined noise map. By
integrating the refined noise map into the diffusion model, the edited image
can better align with the target prompt. Given the absence of benchmarks in
fashion image editing, we constructed a dataset named Fashion-E, comprising
28390 image-text pairs in the training set, and 2639 image-text pairs for four
types of fashion tasks in the evaluation set. Extensive experiments on
Fashion-E demonstrate that our proposed method can accurately predict the mask
of editing region and significantly enhance editing magnitude in fashion image
editing compared to the state-of-the-art methods.