RoboSwap: A GAN-driven Video Diffusion Framework For Unsupervised Robot Arm Swapping
Journal:
arXiv
Published Date:
Jun 10, 2025
Abstract
Recent advancements in generative models have revolutionized video synthesis
and editing. However, the scarcity of diverse, high-quality datasets continues
to hinder video-conditioned robotic learning, limiting cross-platform
generalization. In this work, we address the challenge of swapping a robotic
arm in one video with another: a key step for crossembodiment learning. Unlike
previous methods that depend on paired video demonstrations in the same
environmental settings, our proposed framework, RoboSwap, operates on unpaired
data from diverse environments, alleviating the data collection needs. RoboSwap
introduces a novel video editing pipeline integrating both GANs and diffusion
models, combining their isolated advantages. Specifically, we segment robotic
arms from their backgrounds and train an unpaired GAN model to translate one
robotic arm to another. The translated arm is blended with the original video
background and refined with a diffusion model to enhance coherence, motion
realism and object interaction. The GAN and diffusion stages are trained
independently. Our experiments demonstrate that RoboSwap outperforms
state-of-the-art video and image editing models on three benchmarks in terms of
both structural coherence and motion consistency, thereby offering a robust
solution for generating reliable, cross-embodiment data in robotic learning.