Marmot: Multi-Agent Reasoning for Multi-Object Self-Correcting in Improving Image-Text Alignment

Journal: arXiv

Published Date: Apr 10, 2025

Abstract

While diffusion models excel at generating high-quality images, they often struggle with accurate counting, attributes, and spatial relationships in complex multi-object scenes. To address these challenges, we propose Marmot, a novel and generalizable framework that employs Multi-Agent Reasoning for Multi-Object Self-Correcting, enhancing image-text alignment and facilitating more coherent multi-object image editing. Our framework adopts a divide-and-conquer strategy that decomposes the self-correction task into three critical dimensions (counting, attributes, and spatial relationships), and further divided into object-level subtasks. We construct a multi-agent editing system featuring a decision-execution-verification mechanism, effectively mitigating inter-object interference and enhancing editing reliability. To resolve the problem of subtask integration, we propose a Pixel-Domain Stitching Smoother that employs mask-guided two-stage latent space optimization. This innovation enables parallel processing of subtask results, thereby enhancing runtime efficiency while eliminating multi-stage distortion accumulation. Extensive experiments demonstrate that Marmot significantly improves accuracy in object counting, attribute assignment, and spatial relationships for image generation tasks.

Authors

Jiayang Sun
Hongbo Wang
Jie Cao
Huaibo Huang
Ran He

External Resources

View on arXiv arXiv (http://arxiv.org/abs/2504.20054v1)

Marmot: Multi-Agent Reasoning for Multi-Object Self-Correcting in Improving Image-Text Alignment

Abstract

Authors

Categories

External Resources

Popular Topics

Recent Journals

Marmot: Multi-Agent Reasoning for Multi-Object Self-Correcting in Improving Image-Text Alignment

Abstract

Authors

Categories

External Resources

Stay Ahead of Medical AI

Popular Topics

Recent Journals