UrbanCraft: Urban View Extrapolation via Hierarchical Sem-Geometric Priors
Journal:
arXiv
Published Date:
May 29, 2025
Abstract
Existing neural rendering-based urban scene reconstruction methods mainly
focus on the Interpolated View Synthesis (IVS) setting that synthesizes from
views close to training camera trajectory. However, IVS can not guarantee the
on-par performance of the novel view outside the training camera distribution
(\textit{e.g.}, looking left, right, or downwards), which limits the
generalizability of the urban reconstruction application. Previous methods have
optimized it via image diffusion, but they fail to handle text-ambiguous or
large unseen view angles due to coarse-grained control of text-only diffusion.
In this paper, we design UrbanCraft, which surmounts the Extrapolated View
Synthesis (EVS) problem using hierarchical sem-geometric representations
serving as additional priors. Specifically, we leverage the partially
observable scene to reconstruct coarse semantic and geometric primitives,
establishing a coarse scene-level prior through an occupancy grid as the base
representation. Additionally, we incorporate fine instance-level priors from 3D
bounding boxes to enhance object-level details and spatial relationships.
Building on this, we propose the \textbf{H}ierarchical
\textbf{S}emantic-Geometric-\textbf{G}uided Variational Score Distillation
(HSG-VSD), which integrates semantic and geometric constraints from pretrained
UrbanCraft2D into the score distillation sampling process, forcing the
distribution to be consistent with the observable scene. Qualitative and
quantitative comparisons demonstrate the effectiveness of our methods on EVS
problem.