MTADiffusion: Mask Text Alignment Diffusion Model for Object Inpainting
Journal:
arXiv
Published Date:
Jun 30, 2025
Abstract
Advancements in generative models have enabled image inpainting models to
generate content within specific regions of an image based on provided prompts
and masks. However, existing inpainting methods often suffer from problems such
as semantic misalignment, structural distortion, and style inconsistency. In
this work, we present MTADiffusion, a Mask-Text Alignment diffusion model
designed for object inpainting. To enhance the semantic capabilities of the
inpainting model, we introduce MTAPipeline, an automatic solution for
annotating masks with detailed descriptions. Based on the MTAPipeline, we
construct a new MTADataset comprising 5 million images and 25 million mask-text
pairs. Furthermore, we propose a multi-task training strategy that integrates
both inpainting and edge prediction tasks to improve structural stability. To
promote style consistency, we present a novel inpainting style-consistency loss
using a pre-trained VGG network and the Gram matrix. Comprehensive evaluations
on BrushBench and EditBench demonstrate that MTADiffusion achieves
state-of-the-art performance compared to other methods.