J Shanghai Jiaotong Univ Sci ›› 2024, Vol. 29 ›› Issue (4): 646-655.doi: 10.1007/s12204-024-2713-4
董玉博1, 崔涛1, 周禹帆1, 宋勋2, 祝月2, 董鹏1
接受日期:
2023-10-10
出版日期:
2024-07-14
发布日期:
2024-07-14
DONG Yubo1 (董玉博), CUI Tao1 (崔涛), ZHOU Yufan1 (周禹帆), SONG Xun2 (宋勋), ZHU Yue2 (祝月), DONG Peng1∗ (董鹏)
Accepted:
2023-10-10
Online:
2024-07-14
Published:
2024-07-14
摘要: 多智能体强化学习最近被应用于解决追击问题。然而,当算法面临训练的时间步数较多的长周期任务时,会遇到算法难以训练收敛的问题,进而导致智能体奖励较低、无法有效学习策略。提出了一种深度强化学习训练方法,采用联合分段多奖励函数设计方法来解决前面提到的收敛问题。联合奖励函数结合了两种不同特性的奖励函数的优点,增强了智能体在长周期任务中的训练效果。然后,提出方法消除了传统二维极坐标观测表示法中三角函数带来的奖励函数非单调行为。实验结果表明,在追逐场景中,提出的方法优于传统的单一奖励函数机制,提高了智能体在追击任务中的策略得分。方法为深度强化学习模型在长周期极坐标系追击问题中面临的收敛难题提供了解决方案,提高了模型训练性能。
中图分类号:
董玉博1, 崔涛1, 周禹帆1, 宋勋2, 祝月2, 董鹏1. 基于长周期极坐标系追击问题的多智能体强化学习奖赏函数设计方法[J]. J Shanghai Jiaotong Univ Sci, 2024, 29(4): 646-655.
DONG Yubo1 (董玉博), CUI Tao1 (崔涛), ZHOU Yufan1 (周禹帆), SONG Xun2 (宋勋), ZHU Yue2 (祝月), DONG Peng1∗ (董鹏). Reward Function Design Method for Long Episode Pursuit Tasks Under Polar Coordinate in Multi-Agent Reinforcement Learning[J]. J Shanghai Jiaotong Univ Sci, 2024, 29(4): 646-655.
[1] SILVER D, HUANG A, MADDISON C J, et al. Mastering the game of Go with deep neural networks and tree search [J]. Nature, 2016, 529: 484-489. [2] SILVER D, SCHRITTWIESER J, SIMONYAN K, et al. Mastering the game of Go without human knowledge [J]. Nature, 2017, 550: 354-359. [3] BERNER C, BROCKMAN G, CHAN B, et al. Dota 2 with large scale deep reinforcement learning [DB/OL]. (2019-12-13). http://arxiv.org/abs/1912.06680 [4] VINYALS O, BABUSCHKIN I, CZARNECKI W M, et al. Grandmaster level in StarCraft II using multi-agent reinforcement learning [J]. Nature, 2019, 575: 350-354. [5] KOBER J, BAGNELL J A, PETERS J. Reinforcement learning in robotics: A survey [J]. International Journal of Robotics Research, 2013, 32(11): 1238-1274. [6] LILLICRAP T P, HUNT J J, PRITZEL A, et al. Continuous control with deep reinforcement learning [DB/OL]. (2015-09-09). http://arxiv.org/abs/1509.02971 [7] LI P, RUAN X, ZHU X Q, et al. A regionalization vision navigation method based on deep reinforcement learning [J]. Journal of Shanghai Jiao Tong University, 2021, 55(5): 575-585 (in Chinese). [8] SHALEV-SHWARTZ S, SHAMMAH S, SHASHUA A. Safe, multi-agent, reinforcement learning for autonomous driving [DB/OL]. (2016-10-11). https:// arxiv.org/abs/1610.03295 [9] ZHOU Y, ZHOU L, DING J, et al. Power network topology optimization and power flow control based on deep reinforcement learning [J]. Journal of Shanghai Jiao Tong University, 2021, 55(S2): 7-14 (in Chinese). [10] L¨U Q B, LIU T Y, ZHANG R, et al. Generation approach of human-robot cooperative assembly strategy based on transfer learning [J]. Journal of Shanghai Jiao Tong University (Science), 2022, 27(5): 602-613. [11] LIU Y, SHEN X, GU X, et al. A dual-system reinforcement learning method for flexible job shop dynamic scheduling [J]. Journal of Shanghai Jiao Tong University, 2022, 56(9): 1262-1275 (in Chinese). [12] MNIH V, KAVUKCUOGLU K, SILVER D, et al. Human-level control through deep reinforcement learning [J]. Nature, 2015, 518: 529-533. [13] BUSONIU L, BABUSKA R, DE SCHUTTER B. A comprehensive survey of multiagent reinforcement learning [J]. IEEE Transactions on Systems, Man, and Cybernetics, Part C (Applications and Reviews), 2008, 38(2): 156-172. [14] BUS?ONIU L, BABUˇSKA R, DE SCHUTTER B. Multi-agent reinforcement learning: An overview [M]// Innovations in multi-agent systems and applications - 1. Berlin, Heidelberg: Springer, 2010: 183-221. [15] FOERSTER J N, ASSAEL Y M, DE FREITAS N, et al. Learning to communicate with Deep multi-agent reinforcement learning [C]//30th International Conference on Neural Information Processing Systems. Barcelona: NIPS, 2016: 2145-2153. [16] JIANG J C, LU Z Q. Learning attentional communication for multi-agent cooperation [C]//32nd International Conference on Neural Information Processing Systems. Montr′eal: NIPS, 2018: 7265-7275. [17] SUKHBAATAR S, SZLAM A, FERGUS R. Learning multiagent communication with backpropagation [C]//30th International Conference on Neural Information Processing Systems. Barcelona: NIPS, 2016: 2252-2260. [18] PENG P, WEN Y, YANG Y D, et al. Multiagent bidirectionally-coordinated nets: Emergence of human-level coordination in learning to play StarCraft combat games [DB/OL]. (2017-03-29). https://arxiv.org/abs/1703.10069 [19] JIANG J, DUN C, HUANG T, et al. Graph convolutional reinforcement learning [J]. (2018-10-22). https://arxiv.org/abs/1810.09202 [20] SINGH A, JAIN T, SUKHBAATAR S. Learning when to communicate at scale in multiagent cooperative and competitive tasks [DB/OL]. (2018-12-23). http://arxiv.org/abs/1812.09755 [21] KIM D, MOON S, HOSTALLERO D, et al. Learning to schedule communication in multiagent reinforcement learning [DB/OL]. (2019-02-09). http://arxiv.org/abs/1902.01554 [22] DAS A, GERVET T, ROMOFF J, et al. TarMAC: Targeted multi-agent communication [C]// 36th International Conference on Machine Learning. Long Beach: PMLR 97, 2019: 1538-1546. [23] WANG Y F, ZHONG F W, XU J, et al. ToM2C: Target-oriented multi-agent communication and cooperation with theory of mind [DB/OL]. (2021-10-15). https://arxiv.org/abs/2111.09189 [24] SUNEHAG P, LEVER G, GRUSLYS A, et al. Value-decomposition networks for cooperative multi-agent learning [DB/OL]. (2017-06-16). http://arxiv.org/abs/1706.05296 [25] WEI E M, WICKE D, FREELAN D, et al. Multiagent soft Q-learning [DB/OL]. (2018-04-25). https://arxiv.org/abs/1804.09817 [26] SON K, KIM D, KANG W J, et al. QTRAN: Learning to factorize with transformation for cooperative multiagent reinforcement learning [DB/OL]. (2019-05-14). http://arxiv.org/abs/1905.05408 [27] WANG J H, REN Z Z, LIU T, et al. QPLEX: Duplex dueling multi-agent Q-learning [DB/OL]. (2020-08-03). https://arxiv.org/abs/2008.01062 [28] TABISH R, MIKAYEL S, SCHROEDER D W C, et al. Monotonic value function factorisation for deep multiagent reinforcement learning [J]. Journal of Machine Learning Research, 2020, 21(1): 7234-7284. [29] YANG Y D, WEN Y, CHEN L H, et al. Multi-agent determinantal Q-learning [C]//37th International Conference on Machine Learning. Vienna: PMLR 119, 2020: 10757-10766. [30] FU W, YU C, XU Z L, et al. Revisiting some common practices in cooperative multi-agent reinforcement learning [DB/OL]. (2022-06-15). http://arxiv.org/abs/2206.07505 [31] LONG Q, ZHOU Z H, GUPTA A, et al. Evolutionary population curriculum for scaling multiagent reinforcement learning [DB/OL]. (2020-03-23). https://arxiv.org/abs/2003.10423 [32] WANG W X, YANG T P, LIU Y, et al. From few to more: Large-scale dynamic multiagent curriculum learning [J]. Proceedings of the AAAI Conference on Artificial Intelligence, 2020, 34(5): 7293-7300. [33] FOERSTER J, FARQUHAR G, AFOURAS T, et al. Counterfactual multi-agent policy gradients [J]. Proceedings of the AAAI Conference on Artificial Intelligence, 2018, 32(1): 2974-2982. [34] LOWE R, WU Y, TAMAR A, et al. Multi-agent actorcritic for mixed cooperative-competitive environments [C]//31st International Conference on Neural Information Processing Systems. Long Beach: NIPS, 2017: 6382-6393. [35] MAHAJAN A, RASHID T, SAMVELYAN M, et al. MAVEN: Multi-agent variational exploration [DB/OL]. (2019-10-16). http://arxiv.org/abs/1910.07483 |
[1] | . 近红外胶囊机器人无线能量接收线圈优化设计[J]. J Shanghai Jiaotong Univ Sci, 2025, 30(3): 425-432. |
[2] | . 多机协调吊运系统的绳索矢量碰撞检测算法研究[J]. J Shanghai Jiaotong Univ Sci, 2025, 30(2): 319-329. |
[3] | . 复杂光照下被动式双目光学运动捕捉技术[J]. J Shanghai Jiaotong Univ Sci, 2025, 30(2): 352-362. |
[4] | . 基于RGB-D图像的机器人抓取检测高效全卷积网络和优化方法[J]. J Shanghai Jiaotong Univ Sci, 2025, 30(2): 399-416. |
[5] | 赵艳飞1,2,3, 肖鹏4, 王景川1,2,3, 郭锐4. 基于局部语义地图的移动机器人半自主导航[J]. J Shanghai Jiaotong Univ Sci, 2025, 30(1): 27-33. |
[6] | 傅航1,许江长 1,李寅炜2,4,周慧芳2,4,陈晓军1,3. 基于视频图像增强现实的视神经管减压手术导航系统[J]. J Shanghai Jiaotong Univ Sci, 2025, 30(1): 34-42. |
[7] | 周涵巍1,朱心平1,马有为2,王坤东1. 低延时纤维胆道镜机器人驱动控制系统[J]. J Shanghai Jiaotong Univ Sci, 2025, 30(1): 43-52. |
[8] | 贺贵松,黄学功,李峰. 基于主被动联合驱动的助力型踝关节外骨骼机器人的协调性设计[J]. J Shanghai Jiaotong Univ Sci, 2025, 30(1): 197-208. |
[9] | 刘月笙, 贺宁, 贺利乐, 张译文, 习坤, 张梦芮. 基于机器学习的移动机器人路径跟踪MPC控制器参数自整定[J]. J Shanghai Jiaotong Univ Sci, 2024, 29(6): 1028-1036. |
[10] | 杜海阔1,2, 郭正玉3,4, 章露露1,2, 蔡云泽1,2. 基于多目标松散同步搜索的多目标多智能体异步路径规划[J]. J Shanghai Jiaotong Univ Sci, 2024, 29(4): 667-677. |
[11] | 董德金1,2,董诗音3,章露露1,2,蔡云泽1,2. 基于A-Star和DWA算法的野外环境路径规划[J]. J Shanghai Jiaotong Univ Sci, 2024, 29(4): 725-736. |
[12] | 李舒逸, 李旻哲, 敬忠良. 动态环境下基于改进DQN的多智能体路径规划方法[J]. J Shanghai Jiaotong Univ Sci, 2024, 29(4): 601-612. |
[13] | 徐亚茹1,2,李克鸿1,2,商新娜2,金晓明1,2,刘荣3,张建成1,2. 基于影响系数法的机器人动力学方程约束关系建立[J]. J Shanghai Jiaotong Univ Sci, 2024, 29(3): 450-456. |
[14] | 赵英策1,张广浩2,邢正宇2,李建勋2. 面向确定进攻对手策略的层次强化学习对抗算法研究[J]. J Shanghai Jiaotong Univ Sci, 2024, 29(3): 471-479. |
[15] | 李茹1,陈方2,俞文伟3,IGARASH Tatsuo3,4,舒雄鹏1,谢叻1,5,6. 一种新型线驱动手术软体机器人[J]. J Shanghai Jiaotong Univ Sci, 2024, 29(1): 60-72. |
阅读次数 | ||||||
全文 |
|
|||||
摘要 |
|
|||||