Constraining an Unconstrained Multi-agent Policy with offline data.

Journal: Neural networks : the official journal of the International Neural Network Society
PMID:

Abstract

Real-world multi-agent decision-making systems often have to satisfy some constraints, such as harmfulness, economics, etc., spurring the emergence of Constrained Multi-Agent Reinforcement Learning (CMARL). Existing studies of CMARL mainly focus on training a constrained policy in an online manner, that is, not only maximizing cumulative rewards but also not violating constraints. However, in practice, online learning may be infeasible due to safety restrictions or a lack of high-fidelity simulators. Moreover, as the learned policy runs, new constraints, that are not taken into account during training, may occur. To deal with the above two issues, we propose a method called Constraining an UnconsTrained Multi-Agent Policy with offline data, dubbed CUTMAP, following the popular centralized training with decentralized execution paradigm. Specifically, we have formulated a scalable optimization objective within the framework of multi-agent maximum entropy reinforcement learning for CMARL. This approach is designed to estimate a decomposable Q-function by leveraging an unconstrained "prior policy" in conjunction with cost signals extracted from offline data. When a new constraint comes, CUTMAP can reuse the prior policy without re-training it. To tackle the distribution shift challenge in offline learning, we also incorporate a conservative loss term when updating the Q-function. Therefore, the unconstrained prior policy can be trained to satisfy cost constraints through CUTMAP without the need for expensive interactions with the real environment, facilitating the practical application of MARL algorithms. Empirical results in several cooperative multi-agent benchmarks, including StarCraft games, particle games, food search games, and robot control, demonstrate the superior performance of our method.

Authors

  • Cong Guan
    Nanjing University, Nanjing 210023, China.
  • Tao Jiang
    Department of Respiratory and Critical Care Medicine, Center for Respiratory Medicine, the Fourth Affiliated Hospital of School of Medicine, and International School of Medicine, International Institutes of Medicine, Zhejiang University, Yiwu, China.
  • Yi-Chen Li
    National Key Laboratory for Novel Software Technology, Nanjing University, Nanjing, China; School of Artificial Intelligence, Nanjing University, Nanjing, China; Polixir Technologies, Nanjing, China. Electronic address: https://www.lamda.nju.edu.cn/liyc/.
  • Zongzhang Zhang
    National Key Laboratory for Novel Software Technology, Nanjing University, Nanjing, China; School of Artificial Intelligence, Nanjing University, Nanjing, China; Polixir Technologies, Nanjing, China. Electronic address: https://www.lamda.nju.edu.cn/zhangzz/.
  • Lei Yuan
    Department of Pharmacy, Baodi People's Hospital, Tianjin, China.
  • Yang Yu
    Division of Cardiology, the Central Hospital of Wuhan, Tongji Medical College, Huazhong University of Science and Technology, Wuhan, China.