Dual-driven optimization of collaborative multi-agent via case learning and curiosity.
Journal:
Neural networks : the official journal of the International Neural Network Society
Published Date:
Sep 5, 2025
Abstract
Multi-Agent Deep Reinforcement Learning(MADRL) faces significant challenges in exploration-exploitation trade-off during training, particularly when learning collaborative behaviors through continuous environment interactions. Current exploration methods generally rely on unbiased randomized policy, which makes the policy optimization process lack of goal-directed, resulting in a large number of low signal-to-noise ratio transitions collected in the experience replay buffer, which seriously affects the learning efficiency and policy convergence stability of MADRL. To address the above research challenges, We propose the Case-Enhanced Random Network Distillation Exploration for Centralized Training and Decentralized Execution(CERE-CTDE) paradigm. Our innovation lies in the novel integration of Random Network Distillation(RND) and Case-Based Reasoning(CBR): RND provides intrinsic motivation to enhance exploration and overcome sparse rewards, while CBR enables goal-directed exploitation by leveraging historical case to guide agent action selection. This dual mechanism creates a dynamic equilibrium between exploring novel policy and exploiting proven case, effectively preventing premature convergence. We incorporate the CERE into two categories of MADRL methods based on the CTDE paradigm. The performance of us is assessed and validated with 2 methods focused on exploration using 13 confrontation scenarios in the StarCraft Multi-Agent Challenge(SMAC). The experimental results demonstrate: a 17.97 % statistically significant improvement in win rate on complex battlefields compared to baseline performance in simple scenarios; effective enhancement of policy exploration-exploitation and mitigation of partial sparse reward problems through intrinsic motivation and CBR-guided action sampling; and superior capability in escaping local optima while maintaining learning efficiency. The framework's robustness is further validated by its consistent performance across different SMAC scenarios with varying difficulty levels.
Authors
Keywords
No keywords available for this article.