Can We Achieve Efficient Diffusion without Self-Attention? Distilling Self-Attention into Convolutions
Journal:
arXiv
Published Date:
Apr 30, 2025
Abstract
Contemporary diffusion models built upon U-Net or Diffusion Transformer (DiT)
architectures have revolutionized image generation through transformer-based
attention mechanisms. The prevailing paradigm has commonly employed
self-attention with quadratic computational complexity to handle global spatial
relationships in complex images, thereby synthesizing high-fidelity images with
coherent visual semantics.Contrary to conventional wisdom, our systematic
layer-wise analysis reveals an interesting discrepancy: self-attention in
pre-trained diffusion models predominantly exhibits localized attention
patterns, closely resembling convolutional inductive biases. This suggests that
global interactions in self-attention may be less critical than commonly
assumed.Driven by this, we propose \(\Delta\)ConvFusion to replace conventional
self-attention modules with Pyramid Convolution Blocks
(\(\Delta\)ConvBlocks).By distilling attention patterns into localized
convolutional operations while keeping other components frozen,
\(\Delta\)ConvFusion achieves performance comparable to transformer-based
counterparts while reducing computational cost by 6929$\times$ and surpassing
LinFusion by 5.42$\times$ in efficiency--all without compromising generative
fidelity.