DiTFastAttnV2: Head-wise Attention Compression for Multi-Modality Diffusion Transformers

Journal: arXiv

Published Date: Mar 28, 2025

Abstract

Text-to-image generation models, especially Multimodal Diffusion Transformers (MMDiT), have shown remarkable progress in generating high-quality images. However, these models often face significant computational bottlenecks, particularly in attention mechanisms, which hinder their scalability and efficiency. In this paper, we introduce DiTFastAttnV2, a post-training compression method designed to accelerate attention in MMDiT. Through an in-depth analysis of MMDiT's attention patterns, we identify key differences from prior DiT-based methods and propose head-wise arrow attention and caching mechanisms to dynamically adjust attention heads, effectively bridging this gap. We also design an Efficient Fused Kernel for further acceleration. By leveraging local metric methods and optimization techniques, our approach significantly reduces the search time for optimal compression schemes to just minutes while maintaining generation quality. Furthermore, with the customized kernel, DiTFastAttnV2 achieves a 68% reduction in attention FLOPs and 1.5x end-to-end speedup on 2K image generation without compromising visual fidelity.

Authors

Hanling Zhang
Rundong Su
Zhihang Yuan
Pengtao Chen
Mingzhu Shen Yibo Fan
Shengen Yan
Guohao Dai
Yu Wang

External Resources

View on arXiv arXiv (http://arxiv.org/abs/2503.22796v1)

DiTFastAttnV2: Head-wise Attention Compression for Multi-Modality Diffusion Transformers

Abstract

Authors

Categories

External Resources

Popular Topics

Recent Journals

DiTFastAttnV2: Head-wise Attention Compression for Multi-Modality Diffusion Transformers

Abstract

Authors

Categories

External Resources

Don't Miss the Future of Medicine

Popular Topics

Recent Journals