Based on option-critic algorithm, a new adversarial algorithm named deterministic policy network with option architecture is proposed to improve agent’s performance against opponent with fixed offensive algorithm. An option network is introduced in upper level design, which can generate activated signal from defensive and offensive strategies according to temporary situation. Then the lower level executive layer can figure out interactive action with guidance of activated signal, and the value of both activated signal and interactive action is evaluated by critic structure together. This method could release requirement of semi Markov decision process effectively and eventually simplified network structure by eliminating termination possibility layer. According to the result of experiment, it is proved that new algorithm switches strategy style between offensive and defensive ones neatly and acquires more reward from environment than classical deep deterministic policy gradient algorithm does.
ZHAO Yingce(赵英策), ZHANG Guanghao(张广浩), XING Zhengyu(邢正宇), LI Jianxun(李建勋)
. Hierarchical Reinforcement Learning Adversarial Algorithm Against Opponent with Fixed Offensive Strategy[J]. Journal of Shanghai Jiaotong University(Science), 2024
, 29(3)
: 471
-479
.
DOI: 10.1007/s12204-023-2586-y
[1] ZHANG J W, HUANG S C, HAN C C. Analysis of trajectory simulation of proportional guidance based on Matlab [J]. Tactical Missile Technology, 2009(3): 60-64 (in Chinese).
[2] ZHAO W C, NA L, JIN X Y. Research and realization of quasi-parallel approaching method [J]. Measurement & Control Technology, 2009, 28(3): 92-95 (in Chinese).
[3] ZENG J, MOU J, LIU Y. Lightweight issues of swarm intelligence based multi-agent game strategy [J]. Journal of Command and Control, 2020, 6(4): 381-387 (in Chinese).
[4] SCHULMAN J, WOLSKI F, DHARIWAL P, et al. Proximal policy optimization algorithms [DB/OL]. (2017-08-28) [2021-10-25]. https://arxiv.org/abs/1707.06347.
[5] LILLICRAP T P, HUNT J J, PRITZEL A, et al. Continuous control with deep reinforcement learning [DB/OL]. (2019-07-05) [2021-10-25]. https:// arxiv.org/abs/ 1509.02971.
[6] FUJIMOTO S, VAN HOOF H, MEGER D. Addressing function approximation error in actor-critic methods [C]//35th International Conference on Machine Learning. Stockholm: IMLS, 2018: 1587-1596.
[7] HAARNOJA T, ZHOU A, ABBEEL P, et al. Soft actor-critic: Off-policy maximum entropy deep reinforcement learning with a stochastic actor [C]//35th International Conference on Machine Learning. Stockholm: IMLS, 2018: 1861-1870.
[8] LANGE S, RIEDMILLER M. Deep auto-encoder neural networks in reinforcement learning [C]//The 2010 International Joint Conference on Neural Networks. Barcelona: IEEE, 2010: 1-8.
[9] ABTAHI F, ZHU Z G, BURRY A M. A deep reinforcement learning approach to character segmentation of license plate images [C]//2015 14th IAPR International Conference on Machine Vision Applications. Tokyo: IEEE, 2015: 539-542.
[10] LANGE S, RIEDMILLER M, VOIGTL¨ANDER A. Autonomous reinforcement learning on raw visual input data in a real world application [C]//The 2012 International Joint Conference on Neural Networks. Brisbane: IEEE, 2012: 1-8.
[11] NACHUM O, GU S S, LEE H, et al. Data-efficient hierarchical reinforcement learning [C]//32nd Conference on Neural Information Processing Systems. Montr′eal: NIPS, 2018: 1-11.
[12] LOWE R, WU Y, TAMAR A, et al. Multi-agent actorcritic for mixed cooperative-competitive environments [C]//31st Conference on Neural Information Processing Systems. Long Beach: NIPS, 2017: 1-12.
[13] FOERSTER J, FARQUHAR G, AFOURAS T, et al. Counterfactual multi-agent policy gradients [J]. Proceedings of the AAAI Conference on Artificial Intelligence, 2018, 32(1): 2974-2982.
[14] DIETTERICH T G. The MAXQ method for hierarchical reinforcement learning [C]//15th International Conference on Machine Learning. Madison: IMLS
1998: 118-126.
[15] KULKARNI T D, NARASIMHAN K R, SAEEDI A, et al. Hierarchical deep reinforcement learning: Integrating temporal abstraction and intrinsic motivation [C]//29th Conference on Neural Information Processing Systems. Barcelona: NIPS, 2016: 1-9.
[16] SUTTON R S, PRECUP D, SINGH S. Between MDPs and semi-MDPs: A framework for temporal abstraction in reinforcement learning [J]. Artificial Intelligence, 1999, 112(1/2): 181-211.
[17] BACON P L, HARB J, PRECUP D. The option-critic architecture [J]. Proceedings of the AAAI Conference on Artificial Intelligence, 2017, 31(1): 1726-1734.
[18] LEVY A, KONIDARIS G, PLATT R, et al. Learning multi-level hierarchies with hindsight [DB/OL]. (2019-09-03) [2021-10-25]. https://arxiv.org/abs/1712.00948.