HAMSTER: Hierarchical Action Models For Open-World Robot Manipulation
Journal:
arXiv
Published Date:
Feb 8, 2025
Abstract
Large foundation models have shown strong open-world generalization to
complex problems in vision and language, but similar levels of generalization
have yet to be achieved in robotics. One fundamental challenge is the lack of
robotic data, which are typically obtained through expensive on-robot
operation. A promising remedy is to leverage cheaper, off-domain data such as
action-free videos, hand-drawn sketches or simulation data. In this work, we
posit that hierarchical vision-language-action (VLA) models can be more
effective in utilizing off-domain data than standard monolithic VLA models that
directly finetune vision-language models (VLMs) to predict actions. In
particular, we study a class of hierarchical VLA models, where the high-level
VLM is finetuned to produce a coarse 2D path indicating the desired robot
end-effector trajectory given an RGB image and a task description. The
intermediate 2D path prediction is then served as guidance to the low-level,
3D-aware control policy capable of precise manipulation. Doing so alleviates
the high-level VLM from fine-grained action prediction, while reducing the
low-level policy's burden on complex task-level reasoning. We show that, with
the hierarchical design, the high-level VLM can transfer across significant
domain gaps between the off-domain finetuning data and real-robot testing
scenarios, including differences on embodiments, dynamics, visual appearances
and task semantics, etc. In the real-robot experiments, we observe an average
of 20% improvement in success rate across seven different axes of
generalization over OpenVLA, representing a 50% relative gain. Visual results,
code, and dataset are provided at: https://hamster-robot.github.io/