Rethinking exploration-exploitation trade-off in reinforcement learning via cognitive consistency.

Journal: Neural networks : the official journal of the International Neural Network Society
PMID:

Abstract

The exploration-exploitation dilemma is one of the fundamental challenges in deep reinforcement learning (RL). Agents must strike a trade-off between making decisions based on current beliefs or gathering more information. Prior work mostly prefers devising sophisticated exploration methods to ensure accurate target Q-values or learn rewards and actions association, which may not be intelligent enough for sample efficiency. In this paper, we propose to rethink the trade-off between exploration and exploitation from the perspective of cognitive consistency: humans tend to think and behave in line with their existing knowledge structures (maintaining cognitive consistency), yielding satisfactory results within a brief timeframe. We argue that maintaining consistency, specifically through pessimistic exploration, within the context of optimal policy-oriented cognition, can improve efficiency without compromising performance. To this end, we propose a Cognitive Consistency (CoCo) framework. CoCo first leverages a self-imitating distribution correction approach to pursue cognition oriented toward the optimal policy. Then, it conservatively implements pessimistic exploration by extracting novel inconsistency-minimization objectives inspired by label distribution learning. We validate our framework across various standard off-policy RL tasks and show that maintaining cognitive consistency improves sample efficiency and performance. Code is available at https://github.com/DkING-lv6/CoCo.

Authors

  • Da Wang
    Department of Colorectal Surgery, The Second Affiliated Hospital of Zhejiang University School of Medicine, Hangzhou, China; Cancer Institute (Key Laboratory of Cancer Prevention and Intervention, China National Ministry of Education, Key Laboratory of Molecular Biology in Medical Sciences, Zhejiang Province, China; The Second Affiliated Hospital, Zhejiang University School of Medicine, Hangzhou, China.
  • Wei Wei
    Dept. Biomedical Informatics, University of Pittsburgh, Pittsburgh, PA, USA.
  • Lin Li
    Department of Medicine III, LMU University Hospital, LMU Munich, Munich, Germany.
  • Xin Wang
    Key Laboratory of Bio-based Material Science & Technology (Northeast Forestry University), Ministry of Education, Harbin 150040, China.
  • Jiye Liang
    Key Laboratory of Computational Intelligence and Chinese Information Processing of Ministry of Education, School of Computer and Information Technology, Shanxi University, Taiyuan 030006, Shanxi, China. Electronic address: ljy@sxu.edu.cn.