面向确定进攻对手策略的层次强化学习对抗算法研究

doi:10.1007/s12204-023-2586-y

J Shanghai Jiaotong Univ Sci ›› 2024, Vol. 29 ›› Issue (3): 471-479.doi: 10.1007/s12204-023-2586-y

面向确定进攻对手策略的层次强化学习对抗算法研究

赵英策1，张广浩2，邢正宇2，李建勋2

（1.沈阳飞机设计研究所，沈阳110031；2.上海交通大学电子信息与电气工程学院，上海 200240)

接受日期:2022-02-18 出版日期:2024-05-28 发布日期:2024-05-28

Hierarchical Reinforcement Learning Adversarial Algorithm Against Opponent with Fixed Offensive Strategy

ZHAO Yingce¹ (赵英策), ZHANG Guanghao² (张广浩), XING Zhengyu² (邢正宇), LI Jianxun^2∗ (李建勋)

(1. Shenyang Aircraft Design and Research Institute, Shenyang 110031, China; 2. School of Electronic Information and Electrical Engineering, Shanghai Jiao Tong University, Shanghai 200240, China)

Accepted:2022-02-18 Online:2024-05-28 Published:2024-05-28

摘要/Abstract

摘要： 本文以option-critic算法为理论基础，提出了针对固定进攻策略对手的option选择确定性策略网络算法。该算法引入了option选择网络的上层策略结构，根据相对局势输出进攻或者防守策略的激活信号，下层实际策略网络根据激活信号做出对应的交互动作，最后critic网络对下层实际交互动作和上层激活信号做确定性价值估计。本算法有效地减弱了半马氏决策规划的假设条件，并通过去除终止概率网络简化了网络结构。实验结果表明，基于option选择确定性策略网络的对抗算法比经典深度确定性策略梯度算法能更灵活地在进攻和防守策略中切换，获得更好的对抗决策收益。

关键词: 层次强化学习，固定进攻策略，option选择网络架构，确定性梯度策略

Abstract: Based on option-critic algorithm, a new adversarial algorithm named deterministic policy network with option architecture is proposed to improve agent’s performance against opponent with fixed offensive algorithm. An option network is introduced in upper level design, which can generate activated signal from defensive and offensive strategies according to temporary situation. Then the lower level executive layer can figure out interactive action with guidance of activated signal, and the value of both activated signal and interactive action is evaluated by critic structure together. This method could release requirement of semi Markov decision process effectively and eventually simplified network structure by eliminating termination possibility layer. According to the result of experiment, it is proved that new algorithm switches strategy style between offensive and defensive ones neatly and acquires more reward from environment than classical deep deterministic policy gradient algorithm does.

Key words: hierarchical reinforcement learning, fixed offensive strategy, option architecture, deterministic gradient policy

中图分类号:

TP242.6

赵英策1，张广浩2，邢正宇2，李建勋2. 面向确定进攻对手策略的层次强化学习对抗算法研究[J]. J Shanghai Jiaotong Univ Sci, 2024, 29(3): 471-479.

ZHAO Yingce(赵英策), ZHANG Guanghao(张广浩), XING Zhengyu(邢正宇), LI Jianxun(李建勋). Hierarchical Reinforcement Learning Adversarial Algorithm Against Opponent with Fixed Offensive Strategy[J]. J Shanghai Jiaotong Univ Sci, 2024, 29(3): 471-479.

参考文献

[1] ZHANG J W, HUANG S C, HAN C C. Analysis of trajectory simulation of proportional guidance based on Matlab [J]. Tactical Missile Technology, 2009(3): 60-64 (in Chinese).
[2] ZHAO W C, NA L, JIN X Y. Research and realization of quasi-parallel approaching method [J]. Measurement & Control Technology, 2009, 28(3): 92-95 (in Chinese).
[3] ZENG J, MOU J, LIU Y. Lightweight issues of swarm intelligence based multi-agent game strategy [J]. Journal of Command and Control, 2020, 6(4): 381-387 (in Chinese).
[4] SCHULMAN J, WOLSKI F, DHARIWAL P, et al. Proximal policy optimization algorithms [DB/OL]. (2017-08-28) [2021-10-25]. https://arxiv.org/abs/1707.06347.
[5] LILLICRAP T P, HUNT J J, PRITZEL A, et al. Continuous control with deep reinforcement learning [DB/OL]. (2019-07-05) [2021-10-25]. https:// arxiv.org/abs/ 1509.02971.
[6] FUJIMOTO S, VAN HOOF H, MEGER D. Addressing function approximation error in actor-critic methods [C]//35th International Conference on Machine Learning. Stockholm: IMLS, 2018: 1587-1596.
[7] HAARNOJA T, ZHOU A, ABBEEL P, et al. Soft actor-critic: Off-policy maximum entropy deep reinforcement learning with a stochastic actor [C]//35th International Conference on Machine Learning. Stockholm: IMLS, 2018: 1861-1870.
[8] LANGE S, RIEDMILLER M. Deep auto-encoder neural networks in reinforcement learning [C]//The 2010 International Joint Conference on Neural Networks. Barcelona: IEEE, 2010: 1-8.
[9] ABTAHI F, ZHU Z G, BURRY A M. A deep reinforcement learning approach to character segmentation of license plate images [C]//2015 14th IAPR International Conference on Machine Vision Applications. Tokyo: IEEE, 2015: 539-542.
[10] LANGE S, RIEDMILLER M, VOIGTL¨ANDER A. Autonomous reinforcement learning on raw visual input data in a real world application [C]//The 2012 International Joint Conference on Neural Networks. Brisbane: IEEE, 2012: 1-8.
[11] NACHUM O, GU S S, LEE H, et al. Data-efficient hierarchical reinforcement learning [C]//32nd Conference on Neural Information Processing Systems. Montr′eal: NIPS, 2018: 1-11.
[12] LOWE R, WU Y, TAMAR A, et al. Multi-agent actorcritic for mixed cooperative-competitive environments [C]//31st Conference on Neural Information Processing Systems. Long Beach: NIPS, 2017: 1-12.
[13] FOERSTER J, FARQUHAR G, AFOURAS T, et al. Counterfactual multi-agent policy gradients [J]. Proceedings of the AAAI Conference on Artificial Intelligence, 2018, 32(1): 2974-2982.
[14] DIETTERICH T G. The MAXQ method for hierarchical reinforcement learning [C]//15th International Conference on Machine Learning. Madison: IMLS 1998: 118-126.
[15] KULKARNI T D, NARASIMHAN K R, SAEEDI A, et al. Hierarchical deep reinforcement learning: Integrating temporal abstraction and intrinsic motivation [C]//29th Conference on Neural Information Processing Systems. Barcelona: NIPS, 2016: 1-9.
[16] SUTTON R S, PRECUP D, SINGH S. Between MDPs and semi-MDPs: A framework for temporal abstraction in reinforcement learning [J]. Artificial Intelligence, 1999, 112(1/2): 181-211.
[17] BACON P L, HARB J, PRECUP D. The option-critic architecture [J]. Proceedings of the AAAI Conference on Artificial Intelligence, 2017, 31(1): 1726-1734.
[18] LEVY A, KONIDARIS G, PLATT R, et al. Learning multi-level hierarchies with hindsight [DB/OL]. (2019-09-03) [2021-10-25]. https://arxiv.org/abs/1712.00948.

面向确定进攻对手策略的层次强化学习对抗算法研究

Hierarchical Reinforcement Learning Adversarial Algorithm Against Opponent with Fixed Offensive Strategy

PDF (PC)

可视化

摘要/Abstract

引用本文

使用本文

参考文献

相关文章 7

编辑推荐

Metrics

本文评价

[1]	赵艳飞1,2,3, 肖鹏4, 王景川1,2,3, 郭锐4. 基于局部语义地图的移动机器人半自主导航[J]. J Shanghai Jiaotong Univ Sci, 2025, 30(1): 27-33.
[2]	李舒逸, 李旻哲, 敬忠良. 动态环境下基于改进DQN的多智能体路径规划方法[J]. J Shanghai Jiaotong Univ Sci, 2024, 29(4): 601-612.
[3]	. 基于场端RGB-D相机阵列的室内停车场车辆定位系统[J]. J Shanghai Jiaotong Univ Sci, 2023, 28(1): 61-69.
[4]	. 外参标定的激光-视觉-惯性里程计[J]. J Shanghai Jiaotong Univ Sci, 2023, 28(1): 70-76.
[5]	. [J]. J Shanghai Jiaotong Univ Sci, 2022, 27(5): 602-613.
[6]	. [J]. J Shanghai Jiaotong Univ Sci, 2022, 27(4): 552-560.
[7]	. [J]. J Shanghai Jiaotong Univ Sci, 2022, 27(4): 570-578.