Mirror Descent Safe Policy Optimization for Reinforcement Learning Agents.

Journal: IEEE transactions on pattern analysis and machine intelligence
Published Date:

Abstract

Embodied intelligence and related disciplines have identified several mechanisms that help embodied agents learn how to solve complex problems. Reinforcement learning (RL) is one of the most promising computational approaches toward enhancement of the learning-based problem-solving abilities of such agents. Given the recent rapid evolution of artificial intelligence, RL has become a keystone technology, accelerating scientific discoveries and also finding applications in many other domains. In RL, an agent collects data when interacting with the environment, which optimizes a policy ensuring a higher return. Further improvement requires more exploration of the action space. However, not all actions in that space are safe and acceptable. The exploration of an agent must be constrained. In this work, a novel mirror descent safe policy optimization (MDSPO) algorithm is proposed to ensure the safety of an RL agent. The algorithm leverages mirror descent optimization to maximize the return while satisfying the safety constraint. A novel optimization objective is formulated, and an innovative three-stage optimization strategy is employed-comprising gradient descent without the cost constraint, projection onto the nonparametric policy space with the cost constraint, and projection onto the parametric policy space. Compared to previous methods, MDSPO is a simple and easy to implement first-order approach, which does not impose a hard constraint on the trust region. Theoretical analysis of the MDSPO reveals a lower bound on return improvement and an upper bound on constraint violation at the time of each policy update. The numerical results obtained from two sets of different constrained locomotive experiments demonstrate that MDSPO improves the average return by about 12% and better satisfies the cost constraints than other state-of-the-art methods do. In a real-world obstacle avoidance experiment using an unmanned surface vessel, MDSPO both finds the optimal path and guarantees agent safety.

Authors

Keywords

No keywords available for this article.