Rethinking exploration-exploitation trade-off in reinforcement learning via cognitive consistency.

Journal: Neural networks : the official journal of the International Neural Network Society

PMID: 40090299

Abstract

The exploration-exploitation dilemma is one of the fundamental challenges in deep reinforcement learning (RL). Agents must strike a trade-off between making decisions based on current beliefs or gathering more information. Prior work mostly prefers devising sophisticated exploration methods to ensure accurate target Q-values or learn rewards and actions association, which may not be intelligent enough for sample efficiency. In this paper, we propose to rethink the trade-off between exploration and exploitation from the perspective of cognitive consistency: humans tend to think and behave in line with their existing knowledge structures (maintaining cognitive consistency), yielding satisfactory results within a brief timeframe. We argue that maintaining consistency, specifically through pessimistic exploration, within the context of optimal policy-oriented cognition, can improve efficiency without compromising performance. To this end, we propose a Cognitive Consistency (CoCo) framework. CoCo first leverages a self-imitating distribution correction approach to pursue cognition oriented toward the optimal policy. Then, it conservatively implements pessimistic exploration by extracting novel inconsistency-minimization objectives inspired by label distribution learning. We validate our framework across various standard off-policy RL tasks and show that maintaining cognitive consistency improves sample efficiency and performance. Code is available at https://github.com/DkING-lv6/CoCo.

Authors

Da Wang

Department of Colorectal Surgery, The Second Affiliated Hospital of Zhejiang University School of Medicine, Hangzhou, China; Cancer Institute (Key Laboratory of Cancer Prevention and Intervention, China National Ministry of Education, Key Laboratory of Molecular Biology in Medical Sciences, Zhejiang Province, China; The Second Affiliated Hospital, Zhejiang University School of Medicine, Hangzhou, China.
Wei Wei

Dept. Biomedical Informatics, University of Pittsburgh, Pittsburgh, PA, USA.
Lin Li

Department of Medicine III, LMU University Hospital, LMU Munich, Munich, Germany.
Xin Wang

Key Laboratory of Bio-based Material Science & Technology (Northeast Forestry University), Ministry of Education, Harbin 150040, China.
Jiye Liang

Key Laboratory of Computational Intelligence and Chinese Information Processing of Ministry of Education, School of Computer and Information Technology, Shanxi University, Taiyuan 030006, Shanxi, China. Electronic address: ljy@sxu.edu.cn.

Keywords

Cognition Decision Making Deep Learning Humans Neural Networks, Computer Reinforcement, Psychology Reward

External Resources

View on PubMed Access via DOI PubMed (40090299)

Rethinking exploration-exploitation trade-off in reinforcement learning via cognitive consistency.

Abstract

Authors

Keywords

External Resources

Popular Topics

Recent Journals