Constraining an Unconstrained Multi-agent Policy with offline data.

Journal: Neural networks : the official journal of the International Neural Network Society

PMID: 39965523

Abstract

Real-world multi-agent decision-making systems often have to satisfy some constraints, such as harmfulness, economics, etc., spurring the emergence of Constrained Multi-Agent Reinforcement Learning (CMARL). Existing studies of CMARL mainly focus on training a constrained policy in an online manner, that is, not only maximizing cumulative rewards but also not violating constraints. However, in practice, online learning may be infeasible due to safety restrictions or a lack of high-fidelity simulators. Moreover, as the learned policy runs, new constraints, that are not taken into account during training, may occur. To deal with the above two issues, we propose a method called Constraining an UnconsTrained Multi-Agent Policy with offline data, dubbed CUTMAP, following the popular centralized training with decentralized execution paradigm. Specifically, we have formulated a scalable optimization objective within the framework of multi-agent maximum entropy reinforcement learning for CMARL. This approach is designed to estimate a decomposable Q-function by leveraging an unconstrained "prior policy" in conjunction with cost signals extracted from offline data. When a new constraint comes, CUTMAP can reuse the prior policy without re-training it. To tackle the distribution shift challenge in offline learning, we also incorporate a conservative loss term when updating the Q-function. Therefore, the unconstrained prior policy can be trained to satisfy cost constraints through CUTMAP without the need for expensive interactions with the real environment, facilitating the practical application of MARL algorithms. Empirical results in several cooperative multi-agent benchmarks, including StarCraft games, particle games, food search games, and robot control, demonstrate the superior performance of our method.

Authors

Cong Guan

Nanjing University, Nanjing 210023, China.
Tao Jiang

Department of Respiratory and Critical Care Medicine, Center for Respiratory Medicine, the Fourth Affiliated Hospital of School of Medicine, and International School of Medicine, International Institutes of Medicine, Zhejiang University, Yiwu, China.
Yi-Chen Li

National Key Laboratory for Novel Software Technology, Nanjing University, Nanjing, China; School of Artificial Intelligence, Nanjing University, Nanjing, China; Polixir Technologies, Nanjing, China. Electronic address: https://www.lamda.nju.edu.cn/liyc/.
Zongzhang Zhang

National Key Laboratory for Novel Software Technology, Nanjing University, Nanjing, China; School of Artificial Intelligence, Nanjing University, Nanjing, China; Polixir Technologies, Nanjing, China. Electronic address: https://www.lamda.nju.edu.cn/zhangzz/.
Lei Yuan

Department of Pharmacy, Baodi People's Hospital, Tianjin, China.
Yang Yu

Division of Cardiology, the Central Hospital of Wuhan, Tongji Medical College, Huazhong University of Science and Technology, Wuhan, China.

Keywords

Algorithms Computer Simulation Decision Making Humans Neural Networks, Computer Reinforcement, Psychology

External Resources

View on PubMed Access via DOI PubMed (39965523)

Constraining an Unconstrained Multi-agent Policy with offline data.

Abstract

Authors

Keywords

External Resources

Popular Topics

Recent Journals