STeP: A General and Scalable Framework for Solving Video Inverse Problems with Spatiotemporal Diffusion Priors
Journal:
arXiv
Published Date:
Apr 10, 2025
Abstract
We study how to solve general Bayesian inverse problems involving videos
using diffusion model priors. While it is desirable to use a video diffusion
prior to effectively capture complex temporal relationships, due to the
computational and data requirements of training such a model, prior work has
instead relied on image diffusion priors on single frames combined with
heuristics to enforce temporal consistency. However, these approaches struggle
with faithfully recovering the underlying temporal relationships, particularly
for tasks with high temporal uncertainty. In this paper, we demonstrate the
feasibility of practical and accessible spatiotemporal diffusion priors by
fine-tuning latent video diffusion models from pretrained image diffusion
models using limited videos in specific domains. Leveraging this plug-and-play
spatiotemporal diffusion prior, we introduce a general and scalable framework
for solving video inverse problems. We then apply our framework to two
challenging scientific video inverse problems--black hole imaging and dynamic
MRI. Our framework enables the generation of diverse, high-fidelity video
reconstructions that not only fit observations but also recover multi-modal
solutions. By incorporating a spatiotemporal diffusion prior, we significantly
improve our ability to capture complex temporal relationships in the data while
also enhancing spatial fidelity.