Neurocontrol for fixed-length trajectories in environments with soft barriers.

Journal: Neural networks : the official journal of the International Neural Network Society
PMID:

Abstract

In this paper we present three neurocontrol problems where the analytic policy gradient via back-propagation through time is used to train a simulated agent to maximise a polynomial reward function in a simulated environment. If the environment includes terminal barriers (e.g. solid walls) which terminate the episode whenever the agent touches them, then we show learning can get stuck in oscillating limit cycles, or local minima. Hence we propose to use fixed-length trajectories, and change these barriers into soft barriers, which the agent may pass through, while incurring a significant penalty cost. We demonstrate that the presence of soft barriers can have the drawback of causing exploding learning gradients. Furthermore, the strongest learning gradients often appear at inappropriate parts of the trajectory, where control of the system has already been lost. When combined with modern adaptive optimisers, this combination of exploding gradients and inappropriate learning often causes learning to grind to a halt. We propose ways to avoid these difficulties; either by careful gradient clipping, or by smoothly truncating the gradients of the soft barriers' polynomial cost functions. We argue that this enables the learning algorithm to avoid exploding gradients, and also to concentrate on the most important parts of the trajectory, as opposed to parts of the trajectory where control has already been irreversibly lost.

Authors

  • Michael Fairbank
  • Danil Prokhorov
    Toyota Research Institute NA, Ann Arbor, MI, US.
  • David Barragan-Alcantar
    School of Computer Science and Electronic Engineering, University of Essex, Colchester, UK.
  • Spyridon Samothrakis
    Institute for Analytics and Data Science, University of Essex, Colchester, Essex, United Kingdom.
  • Shuhui Li