Score-Based Diffusion Policy Compatible with Reinforcement Learning via Optimal Transport
Journal:
arXiv
Published Date:
Feb 18, 2025
Abstract
Diffusion policies have shown promise in learning complex behaviors from
demonstrations, particularly for tasks requiring precise control and long-term
planning. However, they face challenges in robustness when encountering
distribution shifts. This paper explores improving diffusion-based imitation
learning models through online interactions with the environment. We propose
OTPR (Optimal Transport-guided score-based diffusion Policy for Reinforcement
learning fine-tuning), a novel method that integrates diffusion policies with
RL using optimal transport theory. OTPR leverages the Q-function as a transport
cost and views the policy as an optimal transport map, enabling efficient and
stable fine-tuning. Moreover, we introduce masked optimal transport to guide
state-action matching using expert keypoints and a compatibility-based
resampling strategy to enhance training stability. Experiments on three
simulation tasks demonstrate OTPR's superior performance and robustness
compared to existing methods, especially in complex and sparse-reward
environments. In sum, OTPR provides an effective framework for combining IL and
RL, achieving versatile and reliable policy learning. The code will be released
at https://github.com/Sunmmyy/OTPR.git.