Reward Function Design Method for Long Episode Pursuit Tasks Under Polar Coordinate in Multi-Agent Reinforcement Learning

DONG Yubo1 (董玉博); CUI Tao1 (崔涛); ZHOU Yufan1 (周禹帆); SONG Xun2 (宋勋); ZHU Yue2 (祝月); DONG Peng1? (董鹏)

doi:10.1007/s12204-024-2713-4

Journal of Shanghai Jiaotong University(Science) >

2024 , Vol. 29 >Issue 4: 646 - 655

DOI: https://doi.org/10.1007/s12204-024-2713-4

Special Issue on Multi-Agent Collaborative Perception and Control

Reward Function Design Method for Long Episode Pursuit Tasks Under Polar Coordinate in Multi-Agent Reinforcement Learning

DONG Yubo¹ (董玉博) ,
CUI Tao¹ (崔涛) ,
ZHOU Yufan¹ (周禹帆) ,
SONG Xun² (宋勋) ,
ZHU Yue² (祝月) ,
DONG Peng^1? (董鹏)

Expand

(1. School of Aeronautics and Astronautics, Shanghai Jiao Tong University, Shanghai 200240, China; 2. Beijing Institute of Electronic System Engineering, Beijing 100854, China)

Accepted date: 2023-10-10

Online published: 2024-07-28

Fold

Abstract

Multi-agent reinforcement learning has recently been applied to solve pursuit problems. However, it suffers from a large number of time steps per training episode, thus always struggling to converge effectively, resulting in low rewards and an inability for agents to learn strategies. This paper proposes a deep reinforcement learning (DRL) training method that employs an ensemble segmented multi-reward function design approach to address the convergence problem mentioned before. The ensemble reward function combines the advantages of two reward functions, which enhances the training effect of agents in long episode. Then, we eliminate the non-monotonic behavior in reward function introduced by the trigonometric functions in the traditional 2D polar coordinates observation representation. Experimental results demonstrate that this method outperforms the traditional single reward function mechanism in the pursuit scenario by enhancing agents’ policy scores of the task. These ideas offer a solution to the convergence challenges faced by DRL models in long episode pursuit problems, leading to an improved model training performance.

Key words： multi-agent reinforcement learning, deep reinforcement learning (DRL), long episode, reward function

Cite this article

DONG Yubo¹ (董玉博) , CUI Tao¹ (崔涛) , ZHOU Yufan¹ (周禹帆) , SONG Xun² (宋勋) , ZHU Yue² (祝月) , DONG Peng^1? (董鹏) . Reward Function Design Method for Long Episode Pursuit Tasks Under Polar Coordinate in Multi-Agent Reinforcement Learning[J]. Journal of Shanghai Jiaotong University(Science), 2024 , 29(4) : 646 -655 . DOI: 10.1007/s12204-024-2713-4

References

[1] SILVER D, HUANG A, MADDISON C J, et al. Mastering the game of Go with deep neural networks and tree search [J]. Nature, 2016, 529: 484-489.
[2] SILVER D, SCHRITTWIESER J, SIMONYAN K, et al. Mastering the game of Go without human knowledge [J]. Nature, 2017, 550: 354-359.
[3] BERNER C, BROCKMAN G, CHAN B, et al. Dota 2 with large scale deep reinforcement learning [DB/OL]. (2019-12-13). http://arxiv.org/abs/1912.06680
[4] VINYALS O, BABUSCHKIN I, CZARNECKI W M, et al. Grandmaster level in StarCraft II using multi-agent reinforcement learning [J]. Nature, 2019, 575: 350-354.
[5] KOBER J, BAGNELL J A, PETERS J. Reinforcement learning in robotics: A survey [J]. International Journal of Robotics Research, 2013, 32(11): 1238-1274.
[6] LILLICRAP T P, HUNT J J, PRITZEL A, et al. Continuous control with deep reinforcement learning [DB/OL]. (2015-09-09). http://arxiv.org/abs/1509.02971
[7] LI P, RUAN X, ZHU X Q, et al. A regionalization vision navigation method based on deep reinforcement learning [J]. Journal of Shanghai Jiao Tong University, 2021, 55(5): 575-585 (in Chinese).
[8] SHALEV-SHWARTZ S, SHAMMAH S, SHASHUA A. Safe, multi-agent, reinforcement learning for autonomous driving [DB/OL]. (2016-10-11). https:// arxiv.org/abs/1610.03295
[9] ZHOU Y, ZHOU L, DING J, et al. Power network topology optimization and power flow control based on deep reinforcement learning [J]. Journal of Shanghai Jiao Tong University, 2021, 55(S2): 7-14 (in Chinese).
[10] L¨U Q B, LIU T Y, ZHANG R, et al. Generation approach of human-robot cooperative assembly strategy based on transfer learning [J]. Journal of Shanghai Jiao Tong University (Science), 2022, 27(5): 602-613.
[11] LIU Y, SHEN X, GU X, et al. A dual-system reinforcement learning method for flexible job shop dynamic scheduling [J]. Journal of Shanghai Jiao Tong University, 2022, 56(9): 1262-1275 (in Chinese).
[12] MNIH V, KAVUKCUOGLU K, SILVER D, et al. Human-level control through deep reinforcement learning [J]. Nature, 2015, 518: 529-533.
[13] BUSONIU L, BABUSKA R, DE SCHUTTER B. A comprehensive survey of multiagent reinforcement learning [J]. IEEE Transactions on Systems, Man, and Cybernetics, Part C (Applications and Reviews), 2008, 38(2): 156-172.
[14] BUS?ONIU L, BABUˇSKA R, DE SCHUTTER B. Multi-agent reinforcement learning: An overview [M]// Innovations in multi-agent systems and applications - 1. Berlin, Heidelberg: Springer, 2010: 183-221.
[15] FOERSTER J N, ASSAEL Y M, DE FREITAS N, et al. Learning to communicate with Deep multi-agent reinforcement learning [C]//30th International Conference on Neural Information Processing Systems. Barcelona: NIPS, 2016: 2145-2153.
[16] JIANG J C, LU Z Q. Learning attentional communication for multi-agent cooperation [C]//32nd International Conference on Neural Information Processing Systems. Montr′eal: NIPS, 2018: 7265-7275.
[17] SUKHBAATAR S, SZLAM A, FERGUS R. Learning multiagent communication with backpropagation [C]//30th International Conference on Neural Information Processing Systems. Barcelona: NIPS, 2016: 2252-2260.
[18] PENG P, WEN Y, YANG Y D, et al. Multiagent bidirectionally-coordinated nets: Emergence of human-level coordination in learning to play StarCraft combat games [DB/OL]. (2017-03-29). https://arxiv.org/abs/1703.10069
[19] JIANG J, DUN C, HUANG T, et al. Graph convolutional reinforcement learning [J]. (2018-10-22). https://arxiv.org/abs/1810.09202
[20] SINGH A, JAIN T, SUKHBAATAR S. Learning when to communicate at scale in multiagent cooperative and competitive tasks [DB/OL]. (2018-12-23). http://arxiv.org/abs/1812.09755
[21] KIM D, MOON S, HOSTALLERO D, et al. Learning to schedule communication in multiagent reinforcement learning [DB/OL]. (2019-02-09). http://arxiv.org/abs/1902.01554
[22] DAS A, GERVET T, ROMOFF J, et al. TarMAC: Targeted multi-agent communication [C]// 36th International Conference on Machine Learning. Long Beach: PMLR 97, 2019: 1538-1546.
[23] WANG Y F, ZHONG F W, XU J, et al. ToM2C: Target-oriented multi-agent communication and cooperation with theory of mind [DB/OL]. (2021-10-15). https://arxiv.org/abs/2111.09189
[24] SUNEHAG P, LEVER G, GRUSLYS A, et al. Value-decomposition networks for cooperative multi-agent learning [DB/OL]. (2017-06-16). http://arxiv.org/abs/1706.05296
[25] WEI E M, WICKE D, FREELAN D, et al. Multiagent soft Q-learning [DB/OL]. (2018-04-25). https://arxiv.org/abs/1804.09817
[26] SON K, KIM D, KANG W J, et al. QTRAN: Learning to factorize with transformation for cooperative multiagent reinforcement learning [DB/OL]. (2019-05-14). http://arxiv.org/abs/1905.05408
[27] WANG J H, REN Z Z, LIU T, et al. QPLEX: Duplex dueling multi-agent Q-learning [DB/OL]. (2020-08-03). https://arxiv.org/abs/2008.01062
[28] TABISH R, MIKAYEL S, SCHROEDER D W C, et al. Monotonic value function factorisation for deep multiagent reinforcement learning [J]. Journal of Machine Learning Research, 2020, 21(1): 7234-7284.
[29] YANG Y D, WEN Y, CHEN L H, et al. Multi-agent determinantal Q-learning [C]//37th International Conference on Machine Learning. Vienna: PMLR 119, 2020: 10757-10766.
[30] FU W, YU C, XU Z L, et al. Revisiting some common practices in cooperative multi-agent reinforcement learning [DB/OL]. (2022-06-15). http://arxiv.org/abs/2206.07505
[31] LONG Q, ZHOU Z H, GUPTA A, et al. Evolutionary population curriculum for scaling multiagent reinforcement learning [DB/OL]. (2020-03-23). https://arxiv.org/abs/2003.10423
[32] WANG W X, YANG T P, LIU Y, et al. From few to more: Large-scale dynamic multiagent curriculum learning [J]. Proceedings of the AAAI Conference on Artificial Intelligence, 2020, 34(5): 7293-7300.
[33] FOERSTER J, FARQUHAR G, AFOURAS T, et al. Counterfactual multi-agent policy gradients [J]. Proceedings of the AAAI Conference on Artificial Intelligence, 2018, 32(1): 2974-2982.
[34] LOWE R, WU Y, TAMAR A, et al. Multi-agent actorcritic for mixed cooperative-competitive environments [C]//31st International Conference on Neural Information Processing Systems. Long Beach: NIPS, 2017: 6382-6393.
[35] MAHAJAN A, RASHID T, SAMVELYAN M, et al. MAVEN: Multi-agent variational exploration [DB/OL]. (2019-10-16). http://arxiv.org/abs/1910.07483

Options

Outlines

模态框（Modal）标题

Abstract

Cite this article

References