SwapAnyone: Consistent and Realistic Video Synthesis for Swapping Any Person into Any Video
Journal:
arXiv
Published Date:
Mar 12, 2025
Abstract
Video body-swapping aims to replace the body in an existing video with a new
body from arbitrary sources, which has garnered more attention in recent years.
Existing methods treat video body-swapping as a composite of multiple tasks
instead of an independent task and typically rely on various models to achieve
video body-swapping sequentially. However, these methods fail to achieve
end-to-end optimization for the video body-swapping which causes issues such as
variations in luminance among frames, disorganized occlusion relationships, and
the noticeable separation between bodies and background. In this work, we
define video body-swapping as an independent task and propose three critical
consistencies: identity consistency, motion consistency, and environment
consistency. We introduce an end-to-end model named SwapAnyone, treating video
body-swapping as a video inpainting task with reference fidelity and motion
control. To improve the ability to maintain environmental harmony, particularly
luminance harmony in the resulting video, we introduce a novel EnvHarmony
strategy for training our model progressively. Additionally, we provide a
dataset named HumanAction-32K covering various videos about human actions.
Extensive experiments demonstrate that our method achieves State-Of-The-Art
(SOTA) performance among open-source methods while approaching or surpassing
closed-source models across multiple dimensions. All code, model weights, and
the HumanAction-32K dataset will be open-sourced at
https://github.com/PKU-YuanGroup/SwapAnyone.