Reward Function Design Method for Long Episode Pursuit Tasks Under Polar Coordinate in Multi-Agent Reinforcement Learning

doi:10.1007/s12204-024-2713-4

Abstract

Abstract: Multi-agent reinforcement learning has recently been applied to solve pursuit problems. However, it suffers from a large number of time steps per training episode, thus always struggling to converge effectively, resulting in low rewards and an inability for agents to learn strategies. This paper proposes a deep reinforcement learning (DRL) training method that employs an ensemble segmented multi-reward function design approach to address the convergence problem mentioned before. The ensemble reward function combines the advantages of two reward functions, which enhances the training effect of agents in long episode. Then, we eliminate the non-monotonic behavior in reward function introduced by the trigonometric functions in the traditional 2D polar coordinates observation representation. Experimental results demonstrate that this method outperforms the traditional single reward function mechanism in the pursuit scenario by enhancing agents’ policy scores of the task. These ideas offer a solution to the convergence challenges faced by DRL models in long episode pursuit problems, leading to an improved model training performance.

Key words: multi-agent reinforcement learning, deep reinforcement learning (DRL), long episode, reward function

摘要： 多智能体强化学习最近被应用于解决追击问题。然而，当算法面临训练的时间步数较多的长周期任务时，会遇到算法难以训练收敛的问题，进而导致智能体奖励较低、无法有效学习策略。提出了一种深度强化学习训练方法，采用联合分段多奖励函数设计方法来解决前面提到的收敛问题。联合奖励函数结合了两种不同特性的奖励函数的优点，增强了智能体在长周期任务中的训练效果。然后，提出方法消除了传统二维极坐标观测表示法中三角函数带来的奖励函数非单调行为。实验结果表明，在追逐场景中，提出的方法优于传统的单一奖励函数机制，提高了智能体在追击任务中的策略得分。方法为深度强化学习模型在长周期极坐标系追击问题中面临的收敛难题提供了解决方案，提高了模型训练性能。

关键词: 多智能体强化学习，深度强化学习，长周期，奖赏函数

CLC Number:

TP242

DONG Yubo¹ (董玉博), CUI Tao¹ (崔涛), ZHOU Yufan¹ (周禹帆), SONG Xun² (宋勋), ZHU Yue² (祝月), DONG Peng^1∗ (董鹏). Reward Function Design Method for Long Episode Pursuit Tasks Under Polar Coordinate in Multi-Agent Reinforcement Learning[J]. J Shanghai Jiaotong Univ Sci, 2024, 29(4): 646-655.

References

[1] SILVER D, HUANG A, MADDISON C J, et al. Mastering the game of Go with deep neural networks and tree search [J]. Nature, 2016, 529: 484-489.
[2] SILVER D, SCHRITTWIESER J, SIMONYAN K, et al. Mastering the game of Go without human knowledge [J]. Nature, 2017, 550: 354-359.
[3] BERNER C, BROCKMAN G, CHAN B, et al. Dota 2 with large scale deep reinforcement learning [DB/OL]. (2019-12-13). http://arxiv.org/abs/1912.06680
[4] VINYALS O, BABUSCHKIN I, CZARNECKI W M, et al. Grandmaster level in StarCraft II using multi-agent reinforcement learning [J]. Nature, 2019, 575: 350-354.
[5] KOBER J, BAGNELL J A, PETERS J. Reinforcement learning in robotics: A survey [J]. International Journal of Robotics Research, 2013, 32(11): 1238-1274.
[6] LILLICRAP T P, HUNT J J, PRITZEL A, et al. Continuous control with deep reinforcement learning [DB/OL]. (2015-09-09). http://arxiv.org/abs/1509.02971
[7] LI P, RUAN X, ZHU X Q, et al. A regionalization vision navigation method based on deep reinforcement learning [J]. Journal of Shanghai Jiao Tong University, 2021, 55(5): 575-585 (in Chinese).
[8] SHALEV-SHWARTZ S, SHAMMAH S, SHASHUA A. Safe, multi-agent, reinforcement learning for autonomous driving [DB/OL]. (2016-10-11). https:// arxiv.org/abs/1610.03295
[9] ZHOU Y, ZHOU L, DING J, et al. Power network topology optimization and power flow control based on deep reinforcement learning [J]. Journal of Shanghai Jiao Tong University, 2021, 55(S2): 7-14 (in Chinese).
[10] L¨U Q B, LIU T Y, ZHANG R, et al. Generation approach of human-robot cooperative assembly strategy based on transfer learning [J]. Journal of Shanghai Jiao Tong University (Science), 2022, 27(5): 602-613.
[11] LIU Y, SHEN X, GU X, et al. A dual-system reinforcement learning method for flexible job shop dynamic scheduling [J]. Journal of Shanghai Jiao Tong University, 2022, 56(9): 1262-1275 (in Chinese).
[12] MNIH V, KAVUKCUOGLU K, SILVER D, et al. Human-level control through deep reinforcement learning [J]. Nature, 2015, 518: 529-533.
[13] BUSONIU L, BABUSKA R, DE SCHUTTER B. A comprehensive survey of multiagent reinforcement learning [J]. IEEE Transactions on Systems, Man, and Cybernetics, Part C (Applications and Reviews), 2008, 38(2): 156-172.
[14] BUS?ONIU L, BABUˇSKA R, DE SCHUTTER B. Multi-agent reinforcement learning: An overview [M]// Innovations in multi-agent systems and applications - 1. Berlin, Heidelberg: Springer, 2010: 183-221.
[15] FOERSTER J N, ASSAEL Y M, DE FREITAS N, et al. Learning to communicate with Deep multi-agent reinforcement learning [C]//30th International Conference on Neural Information Processing Systems. Barcelona: NIPS, 2016: 2145-2153.
[16] JIANG J C, LU Z Q. Learning attentional communication for multi-agent cooperation [C]//32nd International Conference on Neural Information Processing Systems. Montr′eal: NIPS, 2018: 7265-7275.
[17] SUKHBAATAR S, SZLAM A, FERGUS R. Learning multiagent communication with backpropagation [C]//30th International Conference on Neural Information Processing Systems. Barcelona: NIPS, 2016: 2252-2260.
[18] PENG P, WEN Y, YANG Y D, et al. Multiagent bidirectionally-coordinated nets: Emergence of human-level coordination in learning to play StarCraft combat games [DB/OL]. (2017-03-29). https://arxiv.org/abs/1703.10069
[19] JIANG J, DUN C, HUANG T, et al. Graph convolutional reinforcement learning [J]. (2018-10-22). https://arxiv.org/abs/1810.09202
[20] SINGH A, JAIN T, SUKHBAATAR S. Learning when to communicate at scale in multiagent cooperative and competitive tasks [DB/OL]. (2018-12-23). http://arxiv.org/abs/1812.09755
[21] KIM D, MOON S, HOSTALLERO D, et al. Learning to schedule communication in multiagent reinforcement learning [DB/OL]. (2019-02-09). http://arxiv.org/abs/1902.01554
[22] DAS A, GERVET T, ROMOFF J, et al. TarMAC: Targeted multi-agent communication [C]// 36th International Conference on Machine Learning. Long Beach: PMLR 97, 2019: 1538-1546.
[23] WANG Y F, ZHONG F W, XU J, et al. ToM2C: Target-oriented multi-agent communication and cooperation with theory of mind [DB/OL]. (2021-10-15). https://arxiv.org/abs/2111.09189
[24] SUNEHAG P, LEVER G, GRUSLYS A, et al. Value-decomposition networks for cooperative multi-agent learning [DB/OL]. (2017-06-16). http://arxiv.org/abs/1706.05296
[25] WEI E M, WICKE D, FREELAN D, et al. Multiagent soft Q-learning [DB/OL]. (2018-04-25). https://arxiv.org/abs/1804.09817
[26] SON K, KIM D, KANG W J, et al. QTRAN: Learning to factorize with transformation for cooperative multiagent reinforcement learning [DB/OL]. (2019-05-14). http://arxiv.org/abs/1905.05408
[27] WANG J H, REN Z Z, LIU T, et al. QPLEX: Duplex dueling multi-agent Q-learning [DB/OL]. (2020-08-03). https://arxiv.org/abs/2008.01062
[28] TABISH R, MIKAYEL S, SCHROEDER D W C, et al. Monotonic value function factorisation for deep multiagent reinforcement learning [J]. Journal of Machine Learning Research, 2020, 21(1): 7234-7284.
[29] YANG Y D, WEN Y, CHEN L H, et al. Multi-agent determinantal Q-learning [C]//37th International Conference on Machine Learning. Vienna: PMLR 119, 2020: 10757-10766.
[30] FU W, YU C, XU Z L, et al. Revisiting some common practices in cooperative multi-agent reinforcement learning [DB/OL]. (2022-06-15). http://arxiv.org/abs/2206.07505
[31] LONG Q, ZHOU Z H, GUPTA A, et al. Evolutionary population curriculum for scaling multiagent reinforcement learning [DB/OL]. (2020-03-23). https://arxiv.org/abs/2003.10423
[32] WANG W X, YANG T P, LIU Y, et al. From few to more: Large-scale dynamic multiagent curriculum learning [J]. Proceedings of the AAAI Conference on Artificial Intelligence, 2020, 34(5): 7293-7300.
[33] FOERSTER J, FARQUHAR G, AFOURAS T, et al. Counterfactual multi-agent policy gradients [J]. Proceedings of the AAAI Conference on Artificial Intelligence, 2018, 32(1): 2974-2982.
[34] LOWE R, WU Y, TAMAR A, et al. Multi-agent actorcritic for mixed cooperative-competitive environments [C]//31st International Conference on Neural Information Processing Systems. Long Beach: NIPS, 2017: 6382-6393.
[35] MAHAJAN A, RASHID T, SAMVELYAN M, et al. MAVEN: Multi-agent variational exploration [DB/OL]. (2019-10-16). http://arxiv.org/abs/1910.07483