VipDiff: Towards Coherent and Diverse Video Inpainting via Training-free Denoising Diffusion Models
Journal:
arXiv
Published Date:
Jan 21, 2025
Abstract
Recent video inpainting methods have achieved encouraging improvements by
leveraging optical flow to guide pixel propagation from reference frames either
in the image space or feature space. However, they would produce severe
artifacts in the mask center when the masked area is too large and no pixel
correspondences can be found for the center. Recently, diffusion models have
demonstrated impressive performance in generating diverse and high-quality
images, and have been exploited in a number of works for image inpainting.
These methods, however, cannot be applied directly to videos to produce
temporal-coherent inpainting results. In this paper, we propose a training-free
framework, named VipDiff, for conditioning diffusion model on the reverse
diffusion process to produce temporal-coherent inpainting results without
requiring any training data or fine-tuning the pre-trained diffusion models.
VipDiff takes optical flow as guidance to extract valid pixels from reference
frames to serve as constraints in optimizing the randomly sampled Gaussian
noise, and uses the generated results for further pixel propagation and
conditional generation. VipDiff also allows for generating diverse video
inpainting results over different sampled noise. Experiments demonstrate that
VipDiff can largely outperform state-of-the-art video inpainting methods in
terms of both spatial-temporal coherence and fidelity.