Towards a Generalizable Bimanual Foundation Policy via Flow-based Video Prediction
Journal:
arXiv
Published Date:
May 30, 2025
Abstract
Learning a generalizable bimanual manipulation policy is extremely
challenging for embodied agents due to the large action space and the need for
coordinated arm movements. Existing approaches rely on Vision-Language-Action
(VLA) models to acquire bimanual policies. However, transferring knowledge from
single-arm datasets or pre-trained VLA models often fails to generalize
effectively, primarily due to the scarcity of bimanual data and the fundamental
differences between single-arm and bimanual manipulation. In this paper, we
propose a novel bimanual foundation policy by fine-tuning the leading
text-to-video models to predict robot trajectories and training a lightweight
diffusion policy for action generation. Given the lack of embodied knowledge in
text-to-video models, we introduce a two-stage paradigm that fine-tunes
independent text-to-flow and flow-to-video models derived from a pre-trained
text-to-video model. Specifically, optical flow serves as an intermediate
variable, providing a concise representation of subtle movements between
images. The text-to-flow model predicts optical flow to concretize the intent
of language instructions, and the flow-to-video model leverages this flow for
fine-grained video prediction. Our method mitigates the ambiguity of language
in single-stage text-to-video prediction and significantly reduces the
robot-data requirement by avoiding direct use of low-level actions. In
experiments, we collect high-quality manipulation data for real dual-arm robot,
and the results of simulation and real-world experiments demonstrate the
effectiveness of our method.