VCSAP: Online reinforcement learning exploration method based on visitation count of state-action pairs.

Journal: Neural networks : the official journal of the International Neural Network Society
PMID:

Abstract

In the domain of online reinforcement learning, strategies that leverage inherent rewards for exploration tend to achieve commendable outcomes within contexts characterized by deceptive or sparse rewards. Counting through the visitation of states is an efficient count-based exploration method to get the proper intrinsic reward. However, only the novelty of the states encountered by the agent is considered in this exploration method, resulting in the over-exploration of a certain state-action pair and falling into a locally optimal solution. In this paper, a count-based method called the visitation count of state-action pairs (VCSAP) is proposed, which is based on the strong error correction ability of online reinforcement learning. VCSAP counts both the visitation of individual states and state-action pairs, which not only drives the agent to visit novel states, but also motivates the agent to select novel actions. MuJoCo is an advanced multi-joint dynamics simulator, and MuJoCo environments with sparse rewards are more challenging and closer to real-world environments. VCSAP is applied to Proximal Policy Optimization (PPO) and Trust Region Policy Optimization (TRPO) respectively, and comparative experiments with exploration baselines are conducted on multiple tasks of MuJoCo and sparse MuJoCo benchmark. The experimental results show that compared to Random Network Distillation method, the performance of PPO-VCSAP and TRPO-VCSAP improves by 18% and 8% in 8 environments.

Authors

  • Ruikai Zhou
    Key Laboratory of Symbolic Computation and Knowledge Engineering (Jilin University), Changchun 130012, China; College of Computer Science and Technology, Jilin University, Changchun 130012, China. Electronic address: zhourk20@mails.jlu.edu.cn.
  • Wenbo Zhu
    School of Mechatronic Engineering and Automation, Foshan University, Foshan 528225, China.
  • Shuai Han
  • Meng Kang
    Key Laboratory of Symbolic Computation and Knowledge Engineering (Jilin University), Changchun 130012, China; College of Computer Science and Technology, Jilin University, Changchun 130012, China. Electronic address: kangmeng20@mails.jlu.edu.cn.
  • Shuai Lu
    Laboratory of Molecular Design and Drug Discovery, School of Science, China Pharmaceutical University, Nanjing, China.