$π_{0.5}$: a Vision-Language-Action Model with Open-World Generalization

Journal: arXiv
Published Date:

Abstract

In order for robots to be useful, they must perform practically relevant tasks in the real world, outside of the lab. While vision-language-action (VLA) models have demonstrated impressive results for end-to-end robot control, it remains an open question how far such models can generalize in the wild. We describe $\pi_{0.5}$, a new model based on $\pi_{0}$ that uses co-training on heterogeneous tasks to enable broad generalization. $\pi_{0.5}$\ uses data from multiple robots, high-level semantic prediction, web data, and other sources to enable broadly generalizable real-world robotic manipulation. Our system uses a combination of co-training and hybrid multi-modal examples that combine image observations, language commands, object detections, semantic subtask prediction, and low-level actions. Our experiments show that this kind of knowledge transfer is essential for effective generalization, and we demonstrate for the first time that an end-to-end learning-enabled robotic system can perform long-horizon and dexterous manipulation skills, such as cleaning a kitchen or bedroom, in entirely new homes.

Authors

  • Physical Intelligence
  • Kevin Black
  • Noah Brown
  • James Darpinian
  • Karan Dhabalia
  • Danny Driess
  • Adnan Esmail
  • Michael Equi
  • Chelsea Finn
  • Niccolo Fusai
  • Manuel Y. Galliker
  • Dibya Ghosh
  • Lachy Groom
  • Karol Hausman
  • Brian Ichter
  • Szymon Jakubczak
  • Tim Jones
  • Liyiming Ke
  • Devin LeBlanc
  • Sergey Levine
  • Adrian Li-Bell
  • Mohith Mothukuri
  • Suraj Nair
  • Karl Pertsch
  • Allen Z. Ren
  • Lucy Xiaoyang Shi
  • Laura Smith
  • Jost Tobias Springenberg
  • Kyle Stachowicz
  • James Tanner
  • Quan Vuong
  • Homer Walke
  • Anna Walling
  • Haohuan Wang
  • Lili Yu
  • Ury Zhilinsky