VLM-TDP: VLM-guided Trajectory-conditioned Diffusion Policy for Robust Long-Horizon Manipulation
Journal:
arXiv
Published Date:
Jul 6, 2025
Abstract
Diffusion policy has demonstrated promising performance in the field of
robotic manipulation. However, its effectiveness has been primarily limited in
short-horizon tasks, and its performance significantly degrades in the presence
of image noise. To address these limitations, we propose a VLM-guided
trajectory-conditioned diffusion policy (VLM-TDP) for robust and long-horizon
manipulation. Specifically, the proposed method leverages state-of-the-art
vision-language models (VLMs) to decompose long-horizon tasks into concise,
manageable sub-tasks, while also innovatively generating voxel-based
trajectories for each sub-task. The generated trajectories serve as a crucial
conditioning factor, effectively steering the diffusion policy and
substantially enhancing its performance. The proposed Trajectory-conditioned
Diffusion Policy (TDP) is trained on trajectories derived from demonstration
data and validated using the trajectories generated by the VLM. Simulation
experimental results indicate that our method significantly outperforms
classical diffusion policies, achieving an average 44% increase in success
rate, over 100% improvement in long-horizon tasks, and a 20% reduction in
performance degradation in challenging conditions, such as noisy images or
altered environments. These findings are further reinforced by our real-world
experiments, where the performance gap becomes even more pronounced in
long-horizon tasks. Videos are available on https://youtu.be/g0T6h32OSC8