VIVAT: Virtuous Improving VAE Training through Artifact Mitigation
Journal:
arXiv
Published Date:
Jun 9, 2025
Abstract
Variational Autoencoders (VAEs) remain a cornerstone of generative computer
vision, yet their training is often plagued by artifacts that degrade
reconstruction and generation quality. This paper introduces VIVAT, a
systematic approach to mitigating common artifacts in KL-VAE training without
requiring radical architectural changes. We present a detailed taxonomy of five
prevalent artifacts - color shift, grid patterns, blur, corner and droplet
artifacts - and analyze their root causes. Through straightforward
modifications, including adjustments to loss weights, padding strategies, and
the integration of Spatially Conditional Normalization, we demonstrate
significant improvements in VAE performance. Our method achieves
state-of-the-art results in image reconstruction metrics (PSNR and SSIM) across
multiple benchmarks and enhances text-to-image generation quality, as evidenced
by superior CLIP scores. By preserving the simplicity of the KL-VAE framework
while addressing its practical challenges, VIVAT offers actionable insights for
researchers and practitioners aiming to optimize VAE training.