Aquarius: A Family of Industry-Level Video Generation Models for Marketing Scenarios
Journal:
arXiv
Published Date:
May 14, 2025
Abstract
This report introduces Aquarius, a family of industry-level video generation
models for marketing scenarios designed for thousands-xPU clusters and models
with hundreds of billions of parameters. Leveraging efficient engineering
architecture and algorithmic innovation, Aquarius demonstrates exceptional
performance in high-fidelity, multi-aspect-ratio, and long-duration video
synthesis. By disclosing the framework's design details, we aim to demystify
industrial-scale video generation systems and catalyze advancements in the
generative video community. The Aquarius framework consists of five components:
Distributed Graph and Video Data Processing Pipeline: Manages tens of thousands
of CPUs and thousands of xPUs via automated task distribution, enabling
efficient video data processing. Additionally, we are about to open-source the
entire data processing framework named "Aquarius-Datapipe". Model Architectures
for Different Scales: Include a Single-DiT architecture for 2B models and a
Multimodal-DiT architecture for 13.4B models, supporting multi-aspect ratios,
multi-resolution, and multi-duration video generation. High-Performance
infrastructure designed for video generation model training: Incorporating
hybrid parallelism and fine-grained memory optimization strategies, this
infrastructure achieves 36% MFU at large scale. Multi-xPU Parallel Inference
Acceleration: Utilizes diffusion cache and attention optimization to achieve a
2.35x inference speedup. Multiple marketing-scenarios applications: Including
image-to-video, text-to-video (avatar), video inpainting and video
personalization, among others. More downstream applications and
multi-dimensional evaluation metrics will be added in the upcoming version
updates.