J Shanghai Jiaotong Univ Sci ›› 2025, Vol. 30 ›› Issue (2): 385-398.doi: 10.1007/s12204-023-2610-2

• Automation & Computer Science • Previous Articles     Next Articles

Exploiting a No-Regret Opponent in Repeated Zero-Sum Games

重复零和博弈中对无遗憾对手进行盘剥的研究

李凯1,黄文瀚1,李晨晨2,邓小铁3   

  1. 1. Department of Computer Science and Engineering, Shanghai Jiao Tong University, Shanghai 200240, China; 2. Beijing Jingdong Century Trading Co., Ltd., Beijing 100176, China; 3. Center on Frontiers of Computing Studies, Peking University, Beijing 100871, China
  2. 1. 上海交通大学 计算机科学与工程系,上海200240;2. 北京京东世纪贸易有限公司, 北京100176;3. 北京大学 前沿计算研究中心,北京100871
  • Accepted:2021-10-14 Online:2025-03-21 Published:2025-03-21

Abstract: In repeated zero-sum games, instead of constantly playing an equilibrium strategy of the stage game, learning to exploit the opponent given historical interactions could typically obtain a higher utility. However, when playing against a fully adaptive opponent, one would have difficulty identifying the opponent’s adaptive dynamics and further exploiting its potential weakness. In this paper, we study the problem of optimizing against the adaptive opponent who uses no-regret learning. No-regret learning is a classic and widely-used branch of adaptive learning algorithms. We propose a general framework for online modeling no-regret opponents and exploiting their weakness. With this framework, one could approximate the opponent’s no-regret learning dynamics and then develop a response plan to obtain a significant profit based on the inferences of the opponent’s strategies. We employ two system identification architectures, including the recurrent neural network (RNN) and the nonlinear autoregressive exogenous model, and adopt an efficient greedy response plan within the framework. Theoretically, we prove the approximation capability of our RNN architecture at approximating specific no-regret dynamics. Empirically, we demonstrate that during interactions at a low level of non-stationarity, our architectures could approximate the dynamics with a low error, and the derived policies could exploit the no-regret opponent to obtain a decent utility.

Key words: no-regret learning, repeated game, opponent exploitation, opponent modeling, dynamical system, system identification, recurrent neural network (RNN)

摘要: 在重复零和博弈中,若参与者在每一轮都使用阶段博弈的均衡策略,收益可能不高。与之相对,学习如何利用历史交互信息来盘剥对手通常更为有效。然而,若面对完全自适应的对手,识别其策略改变的动力学机制并进一步利用其潜在的弱点就会很困难。本文研究如何对抗使用无遗憾学习的自适应对手以优化自身收益。其中,无遗憾学习是一种广泛使用的自适应学习算法。本文提出了一种通用框架,用于在线建模无遗憾对手的策略更新过程并利用其弱点。通过该框架,可以近似对手的无遗憾动力学,并制定策略响应方案从而在推断对手策略的基础上获得显著利润。本文采用了两种系统识别架构对对手建模,包括循环神经网络(RNN)和非线性自回归外生模型,并在该框架内采用高效的贪心策略响应方案。本文从理论上证明了RNN架构近似特定无遗憾动力学的表示能力。并且通过实验说明,这些架构在较低程度的非平稳交互中能够以非常低的误差近似这些动力学,并且导出的策略可以盘剥无遗憾的对手以获得高收益。

关键词: 无遗憾学习,重复博弈,对手盘剥,对手建模,动力系统,系统识别,循环神经网络

CLC Number: