J Shanghai Jiaotong Univ Sci ›› 2024, Vol. 29 ›› Issue (3): 471-479.doi: 10.1007/s12204-023-2586-y

• Automation & Computer Technologies • Previous Articles     Next Articles

Hierarchical Reinforcement Learning Adversarial Algorithm Against Opponent with Fixed Offensive Strategy

面向确定进攻对手策略的层次强化学习对抗算法研究

ZHAO Yingce1 (赵英策), ZHANG Guanghao2 (张广浩), XING Zhengyu2 (邢正宇), LI Jianxun2∗ (李建勋)   

  1. (1. Shenyang Aircraft Design and Research Institute, Shenyang 110031, China; 2. School of Electronic Information and Electrical Engineering, Shanghai Jiao Tong University, Shanghai 200240, China)
  2. (1.沈阳飞机设计研究所,沈阳110031;2.上海交通大学 电子信息与电气工程学院,上海 200240)
  • Accepted:2022-02-18 Online:2024-05-28 Published:2024-05-28

Abstract: Based on option-critic algorithm, a new adversarial algorithm named deterministic policy network with option architecture is proposed to improve agent’s performance against opponent with fixed offensive algorithm. An option network is introduced in upper level design, which can generate activated signal from defensive and offensive strategies according to temporary situation. Then the lower level executive layer can figure out interactive action with guidance of activated signal, and the value of both activated signal and interactive action is evaluated by critic structure together. This method could release requirement of semi Markov decision process effectively and eventually simplified network structure by eliminating termination possibility layer. According to the result of experiment, it is proved that new algorithm switches strategy style between offensive and defensive ones neatly and acquires more reward from environment than classical deep deterministic policy gradient algorithm does.

Key words: hierarchical reinforcement learning, fixed offensive strategy, option architecture, deterministic gradient policy

摘要: 本文以option-critic算法为理论基础,提出了针对固定进攻策略对手的option选择确定性策略网络算法。该算法引入了option选择网络的上层策略结构,根据相对局势输出进攻或者防守策略的激活信号,下层实际策略网络根据激活信号做出对应的交互动作,最后critic网络对下层实际交互动作和上层激活信号做确定性价值估计。本算法有效地减弱了半马氏决策规划的假设条件,并通过去除终止概率网络简化了网络结构。实验结果表明,基于option选择确定性策略网络的对抗算法比经典深度确定性策略梯度算法能更灵活地在进攻和防守策略中切换,获得更好的对抗决策收益。

关键词: 层次强化学习,固定进攻策略,option选择网络架构,确定性梯度策略

CLC Number: