Get In Video: Add Anything You Want to the Video
Journal:
arXiv
Published Date:
Mar 8, 2025
Abstract
Video editing increasingly demands the ability to incorporate specific
real-world instances into existing footage, yet current approaches
fundamentally fail to capture the unique visual characteristics of particular
subjects and ensure natural instance/scene interactions. We formalize this
overlooked yet critical editing paradigm as "Get-In-Video Editing", where users
provide reference images to precisely specify visual elements they wish to
incorporate into videos. Addressing this task's dual challenges, severe
training data scarcity and technical challenges in maintaining spatiotemporal
coherence, we introduce three key contributions. First, we develop GetIn-1M
dataset created through our automated Recognize-Track-Erase pipeline, which
sequentially performs video captioning, salient instance identification, object
detection, temporal tracking, and instance removal to generate high-quality
video editing pairs with comprehensive annotations (reference image, tracking
mask, instance prompt). Second, we present GetInVideo, a novel end-to-end
framework that leverages a diffusion transformer architecture with 3D full
attention to process reference images, condition videos, and masks
simultaneously, maintaining temporal coherence, preserving visual identity, and
ensuring natural scene interactions when integrating reference objects into
videos. Finally, we establish GetInBench, the first comprehensive benchmark for
Get-In-Video Editing scenario, demonstrating our approach's superior
performance through extensive evaluations. Our work enables accessible,
high-quality incorporation of specific real-world subjects into videos,
significantly advancing personalized video editing capabilities.