Hierarchical Reinforcement Learning Adversarial Algorithm Against Opponent with Fixed Offensive Strategy

doi:10.1007/s12204-023-2586-y

Abstract

Abstract: Based on option-critic algorithm, a new adversarial algorithm named deterministic policy network with option architecture is proposed to improve agent’s performance against opponent with fixed offensive algorithm. An option network is introduced in upper level design, which can generate activated signal from defensive and offensive strategies according to temporary situation. Then the lower level executive layer can figure out interactive action with guidance of activated signal, and the value of both activated signal and interactive action is evaluated by critic structure together. This method could release requirement of semi Markov decision process effectively and eventually simplified network structure by eliminating termination possibility layer. According to the result of experiment, it is proved that new algorithm switches strategy style between offensive and defensive ones neatly and acquires more reward from environment than classical deep deterministic policy gradient algorithm does.

Key words: hierarchical reinforcement learning, fixed offensive strategy, option architecture, deterministic gradient policy

摘要： 本文以option-critic算法为理论基础，提出了针对固定进攻策略对手的option选择确定性策略网络算法。该算法引入了option选择网络的上层策略结构，根据相对局势输出进攻或者防守策略的激活信号，下层实际策略网络根据激活信号做出对应的交互动作，最后critic网络对下层实际交互动作和上层激活信号做确定性价值估计。本算法有效地减弱了半马氏决策规划的假设条件，并通过去除终止概率网络简化了网络结构。实验结果表明，基于option选择确定性策略网络的对抗算法比经典深度确定性策略梯度算法能更灵活地在进攻和防守策略中切换，获得更好的对抗决策收益。

关键词: 层次强化学习，固定进攻策略，option选择网络架构，确定性梯度策略

CLC Number:

TP242.6

ZHAO Yingce(赵英策), ZHANG Guanghao(张广浩), XING Zhengyu(邢正宇), LI Jianxun(李建勋). Hierarchical Reinforcement Learning Adversarial Algorithm Against Opponent with Fixed Offensive Strategy[J]. J Shanghai Jiaotong Univ Sci, 2024, 29(3): 471-479.

References

[1] ZHANG J W, HUANG S C, HAN C C. Analysis of trajectory simulation of proportional guidance based on Matlab [J]. Tactical Missile Technology, 2009(3): 60-64 (in Chinese).
[2] ZHAO W C, NA L, JIN X Y. Research and realization of quasi-parallel approaching method [J]. Measurement & Control Technology, 2009, 28(3): 92-95 (in Chinese).
[3] ZENG J, MOU J, LIU Y. Lightweight issues of swarm intelligence based multi-agent game strategy [J]. Journal of Command and Control, 2020, 6(4): 381-387 (in Chinese).
[4] SCHULMAN J, WOLSKI F, DHARIWAL P, et al. Proximal policy optimization algorithms [DB/OL]. (2017-08-28) [2021-10-25]. https://arxiv.org/abs/1707.06347.
[5] LILLICRAP T P, HUNT J J, PRITZEL A, et al. Continuous control with deep reinforcement learning [DB/OL]. (2019-07-05) [2021-10-25]. https:// arxiv.org/abs/ 1509.02971.
[6] FUJIMOTO S, VAN HOOF H, MEGER D. Addressing function approximation error in actor-critic methods [C]//35th International Conference on Machine Learning. Stockholm: IMLS, 2018: 1587-1596.
[7] HAARNOJA T, ZHOU A, ABBEEL P, et al. Soft actor-critic: Off-policy maximum entropy deep reinforcement learning with a stochastic actor [C]//35th International Conference on Machine Learning. Stockholm: IMLS, 2018: 1861-1870.
[8] LANGE S, RIEDMILLER M. Deep auto-encoder neural networks in reinforcement learning [C]//The 2010 International Joint Conference on Neural Networks. Barcelona: IEEE, 2010: 1-8.
[9] ABTAHI F, ZHU Z G, BURRY A M. A deep reinforcement learning approach to character segmentation of license plate images [C]//2015 14th IAPR International Conference on Machine Vision Applications. Tokyo: IEEE, 2015: 539-542.
[10] LANGE S, RIEDMILLER M, VOIGTL¨ANDER A. Autonomous reinforcement learning on raw visual input data in a real world application [C]//The 2012 International Joint Conference on Neural Networks. Brisbane: IEEE, 2012: 1-8.
[11] NACHUM O, GU S S, LEE H, et al. Data-efficient hierarchical reinforcement learning [C]//32nd Conference on Neural Information Processing Systems. Montr′eal: NIPS, 2018: 1-11.
[12] LOWE R, WU Y, TAMAR A, et al. Multi-agent actorcritic for mixed cooperative-competitive environments [C]//31st Conference on Neural Information Processing Systems. Long Beach: NIPS, 2017: 1-12.
[13] FOERSTER J, FARQUHAR G, AFOURAS T, et al. Counterfactual multi-agent policy gradients [J]. Proceedings of the AAAI Conference on Artificial Intelligence, 2018, 32(1): 2974-2982.
[14] DIETTERICH T G. The MAXQ method for hierarchical reinforcement learning [C]//15th International Conference on Machine Learning. Madison: IMLS 1998: 118-126.
[15] KULKARNI T D, NARASIMHAN K R, SAEEDI A, et al. Hierarchical deep reinforcement learning: Integrating temporal abstraction and intrinsic motivation [C]//29th Conference on Neural Information Processing Systems. Barcelona: NIPS, 2016: 1-9.
[16] SUTTON R S, PRECUP D, SINGH S. Between MDPs and semi-MDPs: A framework for temporal abstraction in reinforcement learning [J]. Artificial Intelligence, 1999, 112(1/2): 181-211.
[17] BACON P L, HARB J, PRECUP D. The option-critic architecture [J]. Proceedings of the AAAI Conference on Artificial Intelligence, 2017, 31(1): 1726-1734.
[18] LEVY A, KONIDARIS G, PLATT R, et al. Learning multi-level hierarchies with hindsight [DB/OL]. (2019-09-03) [2021-10-25]. https://arxiv.org/abs/1712.00948.