FLIP: Flow-Centric Generative Planning as General-Purpose Manipulation World Model
Journal:
arXiv
Published Date:
Dec 11, 2024
Abstract
We aim to develop a model-based planning framework for world models that can
be scaled with increasing model and data budgets for general-purpose
manipulation tasks with only language and vision inputs. To this end, we
present FLow-centric generative Planning (FLIP), a model-based planning
algorithm on visual space that features three key modules: 1. a multi-modal
flow generation model as the general-purpose action proposal module; 2. a
flow-conditioned video generation model as the dynamics module; and 3. a
vision-language representation learning model as the value module. Given an
initial image and language instruction as the goal, FLIP can progressively
search for long-horizon flow and video plans that maximize the discounted
return to accomplish the task. FLIP is able to synthesize long-horizon plans
across objects, robots, and tasks with image flows as the general action
representation, and the dense flow information also provides rich guidance for
long-horizon video generation. In addition, the synthesized flow and video
plans can guide the training of low-level control policies for robot execution.
Experiments on diverse benchmarks demonstrate that FLIP can improve both the
success rates and quality of long-horizon video plan synthesis and has the
interactive world model property, opening up wider applications for future
works.Video demos are on our website: https://nus-lins-lab.github.io/flipweb/.