ERUPT: Efficient Rendering with Unposed Patch Transformer
Journal:
arXiv
Published Date:
Mar 31, 2025
Abstract
This work addresses the problem of novel view synthesis in diverse scenes
from small collections of RGB images. We propose ERUPT (Efficient Rendering
with Unposed Patch Transformer) a state-of-the-art scene reconstruction model
capable of efficient scene rendering using unposed imagery. We introduce
patch-based querying, in contrast to existing pixel-based queries, to reduce
the compute required to render a target view. This makes our model highly
efficient both during training and at inference, capable of rendering at 600
fps on commercial hardware. Notably, our model is designed to use a learned
latent camera pose which allows for training using unposed targets in datasets
with sparse or inaccurate ground truth camera pose. We show that our approach
can generalize on large real-world data and introduce a new benchmark dataset
(MSVS-1M) for latent view synthesis using street-view imagery collected from
Mapillary. In contrast to NeRF and Gaussian Splatting, which require dense
imagery and precise metadata, ERUPT can render novel views of arbitrary scenes
with as few as five unposed input images. ERUPT achieves better rendered image
quality than current state-of-the-art methods for unposed image synthesis
tasks, reduces labeled data requirements by ~95\% and decreases computational
requirements by an order of magnitude, providing efficient novel view synthesis
for diverse real-world scenes.