MADFormer: Mixed Autoregressive and Diffusion Transformers for Continuous Image Generation
Journal:
arXiv
Published Date:
Jun 9, 2025
Abstract
Recent progress in multimodal generation has increasingly combined
autoregressive (AR) and diffusion-based approaches, leveraging their
complementary strengths: AR models capture long-range dependencies and produce
fluent, context-aware outputs, while diffusion models operate in continuous
latent spaces to refine high-fidelity visual details. However, existing hybrids
often lack systematic guidance on how and why to allocate model capacity
between these paradigms. In this work, we introduce MADFormer, a Mixed
Autoregressive and Diffusion Transformer that serves as a testbed for analyzing
AR-diffusion trade-offs. MADFormer partitions image generation into spatial
blocks, using AR layers for one-pass global conditioning across blocks and
diffusion layers for iterative local refinement within each block. Through
controlled experiments on FFHQ-1024 and ImageNet, we identify two key insights:
(1) block-wise partitioning significantly improves performance on
high-resolution images, and (2) vertically mixing AR and diffusion layers
yields better quality-efficiency balances--improving FID by up to 75% under
constrained inference compute. Our findings offer practical design principles
for future hybrid generative models.