CDP: Towards Robust Autoregressive Visuomotor Policy Learning via Causal Diffusion
Journal:
arXiv
Published Date:
Jun 17, 2025
Abstract
Diffusion Policy (DP) enables robots to learn complex behaviors by imitating
expert demonstrations through action diffusion. However, in practical
applications, hardware limitations often degrade data quality, while real-time
constraints restrict model inference to instantaneous state and scene
observations. These limitations seriously reduce the efficacy of learning from
expert demonstrations, resulting in failures in object localization, grasp
planning, and long-horizon task execution. To address these challenges, we
propose Causal Diffusion Policy (CDP), a novel transformer-based diffusion
model that enhances action prediction by conditioning on historical action
sequences, thereby enabling more coherent and context-aware visuomotor policy
learning. To further mitigate the computational cost associated with
autoregressive inference, a caching mechanism is also introduced to store
attention key-value pairs from previous timesteps, substantially reducing
redundant computations during execution. Extensive experiments in both
simulated and real-world environments, spanning diverse 2D and 3D manipulation
tasks, demonstrate that CDP uniquely leverages historical action sequences to
achieve significantly higher accuracy than existing methods. Moreover, even
when faced with degraded input observation quality, CDP maintains remarkable
precision by reasoning through temporal continuity, which highlights its
practical robustness for robotic control under realistic, imperfect conditions.