Marmot: Multi-Agent Reasoning for Multi-Object Self-Correcting in Improving Image-Text Alignment
Journal:
arXiv
Published Date:
Apr 10, 2025
Abstract
While diffusion models excel at generating high-quality images, they often
struggle with accurate counting, attributes, and spatial relationships in
complex multi-object scenes. To address these challenges, we propose Marmot, a
novel and generalizable framework that employs Multi-Agent Reasoning for
Multi-Object Self-Correcting, enhancing image-text alignment and facilitating
more coherent multi-object image editing. Our framework adopts a
divide-and-conquer strategy that decomposes the self-correction task into three
critical dimensions (counting, attributes, and spatial relationships), and
further divided into object-level subtasks. We construct a multi-agent editing
system featuring a decision-execution-verification mechanism, effectively
mitigating inter-object interference and enhancing editing reliability. To
resolve the problem of subtask integration, we propose a Pixel-Domain Stitching
Smoother that employs mask-guided two-stage latent space optimization. This
innovation enables parallel processing of subtask results, thereby enhancing
runtime efficiency while eliminating multi-stage distortion accumulation.
Extensive experiments demonstrate that Marmot significantly improves accuracy
in object counting, attribute assignment, and spatial relationships for image
generation tasks.