J Shanghai Jiaotong Univ Sci ›› 2024, Vol. 29 ›› Issue (3): 471-479.doi: 10.1007/s12204-023-2586-y
赵英策1,张广浩2,邢正宇2,李建勋2
接受日期:
2022-02-18
出版日期:
2024-05-28
发布日期:
2024-05-28
ZHAO Yingce1 (赵英策), ZHANG Guanghao2 (张广浩), XING Zhengyu2 (邢正宇), LI Jianxun2∗ (李建勋)
Accepted:
2022-02-18
Online:
2024-05-28
Published:
2024-05-28
摘要: 本文以option-critic算法为理论基础,提出了针对固定进攻策略对手的option选择确定性策略网络算法。该算法引入了option选择网络的上层策略结构,根据相对局势输出进攻或者防守策略的激活信号,下层实际策略网络根据激活信号做出对应的交互动作,最后critic网络对下层实际交互动作和上层激活信号做确定性价值估计。本算法有效地减弱了半马氏决策规划的假设条件,并通过去除终止概率网络简化了网络结构。实验结果表明,基于option选择确定性策略网络的对抗算法比经典深度确定性策略梯度算法能更灵活地在进攻和防守策略中切换,获得更好的对抗决策收益。
中图分类号:
赵英策1,张广浩2,邢正宇2,李建勋2. 面向确定进攻对手策略的层次强化学习对抗算法研究[J]. J Shanghai Jiaotong Univ Sci, 2024, 29(3): 471-479.
ZHAO Yingce(赵英策), ZHANG Guanghao(张广浩), XING Zhengyu(邢正宇), LI Jianxun(李建勋). Hierarchical Reinforcement Learning Adversarial Algorithm Against Opponent with Fixed Offensive Strategy[J]. J Shanghai Jiaotong Univ Sci, 2024, 29(3): 471-479.
[1] ZHANG J W, HUANG S C, HAN C C. Analysis of trajectory simulation of proportional guidance based on Matlab [J]. Tactical Missile Technology, 2009(3): 60-64 (in Chinese). [2] ZHAO W C, NA L, JIN X Y. Research and realization of quasi-parallel approaching method [J]. Measurement & Control Technology, 2009, 28(3): 92-95 (in Chinese). [3] ZENG J, MOU J, LIU Y. Lightweight issues of swarm intelligence based multi-agent game strategy [J]. Journal of Command and Control, 2020, 6(4): 381-387 (in Chinese). [4] SCHULMAN J, WOLSKI F, DHARIWAL P, et al. Proximal policy optimization algorithms [DB/OL]. (2017-08-28) [2021-10-25]. https://arxiv.org/abs/1707.06347. [5] LILLICRAP T P, HUNT J J, PRITZEL A, et al. Continuous control with deep reinforcement learning [DB/OL]. (2019-07-05) [2021-10-25]. https:// arxiv.org/abs/ 1509.02971. [6] FUJIMOTO S, VAN HOOF H, MEGER D. Addressing function approximation error in actor-critic methods [C]//35th International Conference on Machine Learning. Stockholm: IMLS, 2018: 1587-1596. [7] HAARNOJA T, ZHOU A, ABBEEL P, et al. Soft actor-critic: Off-policy maximum entropy deep reinforcement learning with a stochastic actor [C]//35th International Conference on Machine Learning. Stockholm: IMLS, 2018: 1861-1870. [8] LANGE S, RIEDMILLER M. Deep auto-encoder neural networks in reinforcement learning [C]//The 2010 International Joint Conference on Neural Networks. Barcelona: IEEE, 2010: 1-8. [9] ABTAHI F, ZHU Z G, BURRY A M. A deep reinforcement learning approach to character segmentation of license plate images [C]//2015 14th IAPR International Conference on Machine Vision Applications. Tokyo: IEEE, 2015: 539-542. [10] LANGE S, RIEDMILLER M, VOIGTL¨ANDER A. Autonomous reinforcement learning on raw visual input data in a real world application [C]//The 2012 International Joint Conference on Neural Networks. Brisbane: IEEE, 2012: 1-8. [11] NACHUM O, GU S S, LEE H, et al. Data-efficient hierarchical reinforcement learning [C]//32nd Conference on Neural Information Processing Systems. Montr′eal: NIPS, 2018: 1-11. [12] LOWE R, WU Y, TAMAR A, et al. Multi-agent actorcritic for mixed cooperative-competitive environments [C]//31st Conference on Neural Information Processing Systems. Long Beach: NIPS, 2017: 1-12. [13] FOERSTER J, FARQUHAR G, AFOURAS T, et al. Counterfactual multi-agent policy gradients [J]. Proceedings of the AAAI Conference on Artificial Intelligence, 2018, 32(1): 2974-2982. [14] DIETTERICH T G. The MAXQ method for hierarchical reinforcement learning [C]//15th International Conference on Machine Learning. Madison: IMLS 1998: 118-126. [15] KULKARNI T D, NARASIMHAN K R, SAEEDI A, et al. Hierarchical deep reinforcement learning: Integrating temporal abstraction and intrinsic motivation [C]//29th Conference on Neural Information Processing Systems. Barcelona: NIPS, 2016: 1-9. [16] SUTTON R S, PRECUP D, SINGH S. Between MDPs and semi-MDPs: A framework for temporal abstraction in reinforcement learning [J]. Artificial Intelligence, 1999, 112(1/2): 181-211. [17] BACON P L, HARB J, PRECUP D. The option-critic architecture [J]. Proceedings of the AAAI Conference on Artificial Intelligence, 2017, 31(1): 1726-1734. [18] LEVY A, KONIDARIS G, PLATT R, et al. Learning multi-level hierarchies with hindsight [DB/OL]. (2019-09-03) [2021-10-25]. https://arxiv.org/abs/1712.00948. |
[1] | 赵艳飞1,2,3, 肖鹏4, 王景川1,2,3, 郭锐4. 基于局部语义地图的移动机器人半自主导航[J]. J Shanghai Jiaotong Univ Sci, 2025, 30(1): 27-33. |
[2] | 李舒逸, 李旻哲, 敬忠良. 动态环境下基于改进DQN的多智能体路径规划方法[J]. J Shanghai Jiaotong Univ Sci, 2024, 29(4): 601-612. |
[3] | . 基于场端RGB-D相机阵列的室内停车场车辆定位系统[J]. J Shanghai Jiaotong Univ Sci, 2023, 28(1): 61-69. |
[4] | . 外参标定的激光-视觉-惯性里程计[J]. J Shanghai Jiaotong Univ Sci, 2023, 28(1): 70-76. |
[5] | . [J]. J Shanghai Jiaotong Univ Sci, 2022, 27(5): 602-613. |
[6] | . [J]. J Shanghai Jiaotong Univ Sci, 2022, 27(4): 552-560. |
[7] | . [J]. J Shanghai Jiaotong Univ Sci, 2022, 27(4): 570-578. |
阅读次数 | ||||||
全文 |
|
|||||
摘要 |
|
|||||