J Shanghai Jiaotong Univ Sci ›› 2024, Vol. 29 ›› Issue (4): 646-655.doi: 10.1007/s12204-024-2713-4

• Special Issue on Multi-Agent Collaborative Perception and Control • Previous Articles    

Reward Function Design Method for Long Episode Pursuit Tasks Under Polar Coordinate in Multi-Agent Reinforcement Learning

基于长周期极坐标系追击问题的多智能体强化学习奖赏函数设计方法

DONG Yubo1 (董玉博), CUI Tao1 (崔涛), ZHOU Yufan1 (周禹帆), SONG Xun2 (宋勋), ZHU Yue2 (祝月), DONG Peng1∗ (董鹏)   

  1. (1. School of Aeronautics and Astronautics, Shanghai Jiao Tong University, Shanghai 200240, China; 2. Beijing Institute of Electronic System Engineering, Beijing 100854, China)
  2. (1. 上海交通大学 航空航天学院,上海200240;2. 北京电子工程总体研究所,北京100854)
  • Accepted:2023-10-10 Online:2024-07-28 Published:2024-07-28

Abstract: Multi-agent reinforcement learning has recently been applied to solve pursuit problems. However, it suffers from a large number of time steps per training episode, thus always struggling to converge effectively, resulting in low rewards and an inability for agents to learn strategies. This paper proposes a deep reinforcement learning (DRL) training method that employs an ensemble segmented multi-reward function design approach to address the convergence problem mentioned before. The ensemble reward function combines the advantages of two reward functions, which enhances the training effect of agents in long episode. Then, we eliminate the non-monotonic behavior in reward function introduced by the trigonometric functions in the traditional 2D polar coordinates observation representation. Experimental results demonstrate that this method outperforms the traditional single reward function mechanism in the pursuit scenario by enhancing agents’ policy scores of the task. These ideas offer a solution to the convergence challenges faced by DRL models in long episode pursuit problems, leading to an improved model training performance.

Key words: multi-agent reinforcement learning, deep reinforcement learning (DRL), long episode, reward function

摘要: 多智能体强化学习最近被应用于解决追击问题。然而,当算法面临训练的时间步数较多的长周期任务时,会遇到算法难以训练收敛的问题,进而导致智能体奖励较低、无法有效学习策略。提出了一种深度强化学习训练方法,采用联合分段多奖励函数设计方法来解决前面提到的收敛问题。联合奖励函数结合了两种不同特性的奖励函数的优点,增强了智能体在长周期任务中的训练效果。然后,提出方法消除了传统二维极坐标观测表示法中三角函数带来的奖励函数非单调行为。实验结果表明,在追逐场景中,提出的方法优于传统的单一奖励函数机制,提高了智能体在追击任务中的策略得分。方法为深度强化学习模型在长周期极坐标系追击问题中面临的收敛难题提供了解决方案,提高了模型训练性能。

关键词: 多智能体强化学习,深度强化学习,长周期,奖赏函数

CLC Number: