Learning Coordinated Bimanual Manipulation Policies using State Diffusion and Inverse Dynamics Models
Journal:
arXiv
Published Date:
Mar 30, 2025
Abstract
When performing tasks like laundry, humans naturally coordinate both hands to
manipulate objects and anticipate how their actions will change the state of
the clothes. However, achieving such coordination in robotics remains
challenging due to the need to model object movement, predict future states,
and generate precise bimanual actions. In this work, we address these
challenges by infusing the predictive nature of human manipulation strategies
into robot imitation learning. Specifically, we disentangle task-related state
transitions from agent-specific inverse dynamics modeling to enable effective
bimanual coordination. Using a demonstration dataset, we train a diffusion
model to predict future states given historical observations, envisioning how
the scene evolves. Then, we use an inverse dynamics model to compute robot
actions that achieve the predicted states. Our key insight is that modeling
object movement can help learning policies for bimanual coordination
manipulation tasks. Evaluating our framework across diverse simulation and
real-world manipulation setups, including multimodal goal configurations,
bimanual manipulation, deformable objects, and multi-object setups, we find
that it consistently outperforms state-of-the-art state-to-action mapping
policies. Our method demonstrates a remarkable capacity to navigate multimodal
goal configurations and action distributions, maintain stability across
different control modes, and synthesize a broader range of behaviors than those
present in the demonstration dataset.