J Shanghai Jiaotong Univ Sci ›› 2025, Vol. 30 ›› Issue (6): 1085-1102.doi: 10.1007/s12204-023-2631-x

• • 上一篇    下一篇

基于李雅普诺夫奖励塑造的移动机器人自适应LSAC-PID控制方法

  

  1. 浙江工业大学 信息工程学院,杭州310023
  • 收稿日期:2021-11-23 接受日期:2022-01-27 出版日期:2025-11-21 发布日期:2025-11-26

Self-Adaptive LSAC-PID Approach Based on Lyapunov Reward Shaping for Mobile Robots

禹鑫燚,徐思宇,樊越海,欧林林   

  1. College of Information Engineering, Zhejiang University of Technology, Hangzhou 310023, China
  • Received:2021-11-23 Accepted:2022-01-27 Online:2025-11-21 Published:2025-11-26

摘要: 为了解决复杂多变环境下多输入多输出(MIMO)系统的控制问题,本文针对移动机器人的自动控制提出了一种基于深度强化学习(RL)的无模型自适应LSAC-PID方法。根据环境反馈,作为上层控制器的RL智能体将最优参数输出给下层MIMO PID控制器,实现实时最优控制。首先,提出了一种无模型自适应MIMO PID混合控制策略,通过强化学习中最先进的软演员评论家(SAC)算法来实现控制参数的实时最优整定。其次,为了提高RL的收敛速度和控制性能,基于李雅普诺夫理论设计了一种适用于非策略RL算法的奖励塑造方法,并基于此确定了一种基于李雅普诺夫奖励的自适应LSAC-PID整定方法。通过软策略迭代的策略评估和策略改进,从数学上证明了提出的LSAC-PID算法的收敛性和最优性。最后,基于所提出的奖励塑造方法,针对循线机器人系统设计了奖励函数来提高其稳定性。仿真和实验结果表明,在不依赖控制系统模型和解耦控制回路的情况下,提出的自适应LSAC-PID方法可以实现MIMO PID参数的实时最优整定,具有收敛速度快、泛化性强和实时性高等优点。

关键词: 多输入多输出, PID整定, 强化学习, 基于李雅普诺夫奖励塑造, 软演员评论家, 移动机器人

Abstract: In order to solve the control problem of multiple-input multiple-output (MIMO) systems in complex and variable control environments, a model-free adaptive LSAC-PID method based on deep reinforcement learning (RL) is proposed in this paper for automatic control of mobile robots. According to the environmental feedback, the RL agent of the upper controller outputs the optimal parameters to the lower MIMO PID controllers, which can realize the real-time PID optimal control. First, a model-free adaptive MIMO PID hybrid control strategy is presented to realize real-time optimal tuning of control parameters in terms of soft-actor-critic (SAC) algorithm, which is state-of-the-art RL algorithm. Second, in order to improve the RL convergence speed and the control performance, a Lyapunov-based reward shaping method for off-policy RL algorithm is designed, and a self-adaptive LSAC-PID tuning approach with Lyapunov-based reward is then determined. Through the policy evaluation and policy improvement of the soft policy iteration, the convergence and optimality of the proposed LSAC-PID algorithm are proved mathematically. Finally, based on the proposed reward shaping method, the reward function is designed to improve the system stability for the line-following robot. The simulation and experiment results show that the proposed adaptive LSAC-PID approach has good control performance such as fast convergence speed, high generalization and high real-time performance, and achieves real-time optimal tuning of MIMO PID parameters without the system model and control loop decoupling.

中图分类号: