SIGMAN:Scaling 3D Human Gaussian Generation with Millions of Assets
Journal:
arXiv
Published Date:
Apr 9, 2025
Abstract
3D human digitization has long been a highly pursued yet challenging task.
Existing methods aim to generate high-quality 3D digital humans from single or
multiple views, but remain primarily constrained by current paradigms and the
scarcity of 3D human assets. Specifically, recent approaches fall into several
paradigms: optimization-based and feed-forward (both single-view regression and
multi-view generation with reconstruction). However, they are limited by slow
speed, low quality, cascade reasoning, and ambiguity in mapping low-dimensional
planes to high-dimensional space due to occlusion and invisibility,
respectively. Furthermore, existing 3D human assets remain small-scale,
insufficient for large-scale training. To address these challenges, we propose
a latent space generation paradigm for 3D human digitization, which involves
compressing multi-view images into Gaussians via a UV-structured VAE, along
with DiT-based conditional generation, we transform the ill-posed
low-to-high-dimensional mapping problem into a learnable distribution shift,
which also supports end-to-end inference. In addition, we employ the multi-view
optimization approach combined with synthetic data to construct the HGS-1M
dataset, which contains $1$ million 3D Gaussian assets to support the
large-scale training. Experimental results demonstrate that our paradigm,
powered by large-scale training, produces high-quality 3D human Gaussians with
intricate textures, facial details, and loose clothing deformation.