J Shanghai Jiaotong Univ Sci ›› 2025, Vol. 30 ›› Issue (6): 1125-1133.doi: 10.1007/s12204-023-2666-z

• • 上一篇    下一篇

基于EEPPO的四足机器人步态学习方法复现

  

  1. 北京工业大学 信息学部;北京计算智能与智能系统重点实验室;数字社区教育部工程研究中心,北京 100124
  • 收稿日期:2023-02-20 接受日期:2023-03-14 出版日期:2025-11-21 发布日期:2023-11-06

Gait Learning Reproduction for Quadruped Robots Based on Experience Evolution Proximal Policy Optimization

李春阳,朱晓庆,阮晓钢,刘鑫源,张思远   

  1. Faculty of Information Technology, Beijing University of Technology; Beijing Key Laboratory of Computational Intelligence and Intelligent System; Engineering Research Center of Digital Community of Ministry of Education, Beijing 100124, China
  • Received:2023-02-20 Accepted:2023-03-14 Online:2025-11-21 Published:2023-11-06

摘要: 基于强化学习方法来实现四足机器人仿生步态学习已成为一种新的研究方向;其中近端策略优化(proximal policy optimization, PPO)算法由于存在奖励稀疏等问题,PPO算法从零开始学习到成功的步态概率不高。为解决上述问题,本文提出一种融合先验知识进化引导的四足机器人步态学习算法EEPPO(experience evolution proximal policy optimization)算法。EEPPO算法将进化策略训练成功样本作为先验知识,用于引导学习方向以提高学习算法成功概率。为验证提出的EEPPO算法有效性,在Pybullet平台上进行了四足机器人步态学习任务仿真实验。实验结果表明:使用机器人的速度、姿态与关节等关键信息,对CPG-RBF网络和策略网络同时更新以实现四足机器人的仿生对角小跑步态学习任务。为进一步验证本文算法优越性,将EEPPO算法与传统SAC(soft actor-critic)算法进行比较;该方法能够在平地地形中学习到更加稳定的对角小跑步态。

关键词: 四足机器人, 近端策略优化, 先验知识, 进化策略, 仿生步态学习

Abstract: Bionic gait learning of quadruped robots based on reinforcement learning has become a hot research topic. The proximal policy optimization (PPO) algorithm has a low probability of learning a successful gait from scratch due to problems such as reward sparsity. To solve the problem, we propose a experience evolution proximal policy optimization (EEPPO) algorithm which integrates PPO with priori knowledge highlighting by evolutionary strategy. We use the successful trained samples as priori knowledge to guide the learning direction in order to increase the success probability of the learning algorithm. To verify the effectiveness of the proposed EEPPO algorithm, we have conducted simulation experiments of the quadruped robot gait learning task on Pybullet. Experimental results show that the central pattern generator based radial basis function (CPG-RBF) network and the policy network are simultaneously updated to achieve the quadruped robot’s bionic diagonal trot gait learning task using key information such as the robot’s speed, posture and joints information. Experimental comparison results with the traditional soft actor-critic (SAC) algorithm validate the superiority of the proposed EEPPO algorithm, which can learn a more stable diagonal trot gait in flat terrain.

中图分类号: