重复零和博弈中对无遗憾对手进行盘剥的研究

doi:10.1007/s12204-023-2610-2

摘要/Abstract

摘要： 在重复零和博弈中，若参与者在每一轮都使用阶段博弈的均衡策略，收益可能不高。与之相对，学习如何利用历史交互信息来盘剥对手通常更为有效。然而，若面对完全自适应的对手，识别其策略改变的动力学机制并进一步利用其潜在的弱点就会很困难。本文研究如何对抗使用无遗憾学习的自适应对手以优化自身收益。其中，无遗憾学习是一种广泛使用的自适应学习算法。本文提出了一种通用框架，用于在线建模无遗憾对手的策略更新过程并利用其弱点。通过该框架，可以近似对手的无遗憾动力学，并制定策略响应方案从而在推断对手策略的基础上获得显著利润。本文采用了两种系统识别架构对对手建模，包括循环神经网络（RNN）和非线性自回归外生模型，并在该框架内采用高效的贪心策略响应方案。本文从理论上证明了RNN架构近似特定无遗憾动力学的表示能力。并且通过实验说明，这些架构在较低程度的非平稳交互中能够以非常低的误差近似这些动力学，并且导出的策略可以盘剥无遗憾的对手以获得高收益。

关键词: 无遗憾学习, 重复博弈, 对手盘剥, 对手建模, 动力系统, 系统识别, 循环神经网络

Abstract: In repeated zero-sum games, instead of constantly playing an equilibrium strategy of the stage game, learning to exploit the opponent given historical interactions could typically obtain a higher utility. However, when playing against a fully adaptive opponent, one would have difficulty identifying the opponent’s adaptive dynamics and further exploiting its potential weakness. In this paper, we study the problem of optimizing against the adaptive opponent who uses no-regret learning. No-regret learning is a classic and widely-used branch of adaptive learning algorithms. We propose a general framework for online modeling no-regret opponents and exploiting their weakness. With this framework, one could approximate the opponent’s no-regret learning dynamics and then develop a response plan to obtain a significant profit based on the inferences of the opponent’s strategies. We employ two system identification architectures, including the recurrent neural network (RNN) and the nonlinear autoregressive exogenous model, and adopt an efficient greedy response plan within the framework. Theoretically, we prove the approximation capability of our RNN architecture at approximating specific no-regret dynamics. Empirically, we demonstrate that during interactions at a low level of non-stationarity, our architectures could approximate the dynamics with a low error, and the derived policies could exploit the no-regret opponent to obtain a decent utility.

中图分类号:

TP18

. 重复零和博弈中对无遗憾对手进行盘剥的研究[J]. J Shanghai Jiaotong Univ Sci, 2025, 30(2): 385-398.

Li Kai, Huang Wenhan, Li Chenchen, Deng Xiaotie. Exploiting a No-Regret Opponent in Repeated Zero-Sum Games[J]. J Shanghai Jiaotong Univ Sci, 2025, 30(2): 385-398.

参考文献

[1] GANZFRIED S, SANDHOLM T. Game theory-based opponent modeling in large imperfect-information games [C]//The 10th International Conference on Autonomous Agents and Multiagent Systems-Volume 2. Taipei: International Foundation for Autonomous Agents and Multiagent Systems, 2011: 533-540.
[2] GANZFRIED S, SUN Q Y. Bayesian opponent exploitation in imperfect-information games [C]//2018 IEEE Conference on Computational Intelligence and Games. Maastricht: IEEE, 2018: 1-8.
[3] HERNANDEZ-LEAL P, KAISERS M, BAARSLAG T, et al. A survey of learning in multiagent environments: Dealing with non-stationarity [DB/OL]. (2017- 07-28).
[4] CESA-BIANCHI N, LUGOSI G. Prediction, learning, and games [M]. Cambridge: Cambridge University Press, 2006.
[5] ZINKEVICHM, JOHANSONM, BOWLING M, et al. Regret minimization in games with incomplete information [M]//Advances in Neural Information Processing Systems 20. Red Hook: Curran Associates, 2007: 1729-1736.
[6] HART S, MAS-COLELL A. A simple adaptive procedure leading to correlated equilibrium [J]. Econometrica, 2000, 68(5): 1127-1150.
[7] SHALEV-SHWARTZ S. Online learning and online convex optimization [J]. Foundations and Trends

in Machine Learning, 2011, 4(2): 107-194.
[8] HAGHTALAB N, NOOTHIGATTU R, PROCACCIA A. Weighted voting via no-regret learning [J]. Proceedings of the AAAI Conference on Artificial Intelligence, 2018, 32(1): 1055-1062.
[9] CHEN Y L, VAUGHAN J W. A new understanding of prediction markets via no-regret learning [C]//11th ACM conference on Electronic commerce. Cambridge: ACM Press, 2010: 189-198.
[10] HARTLINE J, SYRGKANIS V, TARDOS E. Noregret learning in Bayesian games [M]//Advances in neural information processing systems 28. Red Hook: Curran Associates, 2015: 3061-3069.
[11] BILLINGS D, PAPP D, SCHAEFFER J, et al. Opponent modeling in poker [C]//15th National Conference on Artificial Intelligence and 10th Innovative Applications of Artificial Intelligence Conference. Madison: AAAI, 1998: 493-499.
[12] SCHADD F, BAKKES S, SPRONCK P. Opponent modeling in real-time strategy games [C]//GAMEON’2007. Bologna: University of Bologna, 2007: 61-70.
[13] ALBRECHT S V, STONE P. Autonomous agents modelling other agents: A comprehensive survey and open problems [J]. Artificial Intelligence, 2018, 258: 66-95.
[14] TANG Z T, ZHU Y H, ZHAO D B, et al. Enhanced rolling horizon evolution algorithm with opponent model learning [J]. IEEE Transactions on Games, 2020.
[15] MCCRACKEN P, BOWLING M. Safe strategies for agent modelling in games [C]//2004 AAAI Fall Symposium. Arlington: AAAI, 2004: 103-110.
[16] GANZFRIED S, SANDHOLM T. Safe opponent exploitation [J]. ACM Transactions on Economics and Computation, 2015, 3(2): 1-28.
[17] ZHANG C J, LESSER V. Multi-agent learning with policy prediction [C]//24th AAAI Conference on Artificial Intelligence. Atlanta: AAAI, 2010: 927-934.
[18] FOERSTER J, CHEN R Y, AL-SHEDIVAT M, et al. Learning with opponent-learning awareness [DB/OL]. (2017-09-13).
[19] KIM D K, LIUM, RIEMER M, et al. A policy gradient algorithm for learning to learn in multiagent reinforcement learning [C]//38th International Conference on Machine Learning. Online: PMLR, 2021: 5541-5550.
[20] SODERSTROM T, STOICA P. System identification [M]. London: Prentice-Hall International, 1989.
[21] LI L K. Approximation theory and recurrent networks [C]//International Joint Conference on Neural Networks. Baltimore: IEEE, 1992: 266-271.
[22] FUNAHASHI K I, NAKAMURA Y. Approximation of dynamical systems by continuous time recurrent neural networks [J]. Neural Networks, 1993, 6(6): 801-806.
[23] ZIMMERMANN H, NEUNEIER R. Modeling dynamical systems by recurrent neural networks [M]//Data mining II. Southampton: WIT Press, 2000: 557-566.
[24] BILLINGS S A. Nonlinear system identification [M]. Chichester: John Wiley & Sons, Ltd, 2013.
[25] CESA-BIANCHI N, FREUND Y, HAUSSLER D, et al. How to use expert advice [J]. Journal of the ACM, 1997, 44(3): 427-485.
[26] FREUND Y, SCHAPIRE R E. A decision-theoretic generalization of on-line learning and an application to boosting [J]. Journal of Computer and System Sciences, 1997, 55(1): 119-139.
[27] LITTLESTONE N, WARMUTH M K. The weighted majority algorithm [J]. Information and Computation, 1994, 108(2): 212-261.
[28] RAKHLIN A, SRIDHARAN K. Optimization, learning, and games with predictable sequences [M]//Advances in neural information processing systems 26. Red Hook: Curran Associates, 2013: 3066-3074.
[29] BROWN N, KROER C, SANDHOLM T. Dynamic thresholding and pruning for regret minimization [J]. Proceedings of the AAAI Conference on Artificial Intelligence, 2017, 31(1): 421-429.
[30] NEUMANN J. Zur theorie der gesellschaftsspiele [J]. Mathematische Annalen, 1928, 100(1): 295-320.
[31] AL-SHEDIVAT M, BANSAL T, BURDA Y, et al. Continuous adaptation via meta-learning in nonstationary and competitive environments [DB/OL]. (2017-10-10).
[32] DASKALAKIS C, DECKELBAUM A, KIM A. Nearoptimal no-regret algorithms for zero-sum games [J]. Games and Economic Behavior, 2015, 92: 327-348.
[33] AUER P, CESA-BIANCHI N, FREUND Y, et al. Gambling in a rigged casino: The adversarial multiarmed bandit problem [C]//IEEE 36th Annual Foundations of Computer Science. Milwaukee: IEEE, 1995: 322-331.
[34] RUMELHART D E, HINTON G E, WILLIAMS R J. Learning representations by back-propagating errors [J]. Nature, 1986, 323(6088): 533-536.
[35] ELMAN J L. Finding structure in time [J]. Cognitive Science, 1990, 14(2): 179-211.
[36] JIN L, NIKIFORUK P N, GUPTA M M. Approximation of discrete-time state-space trajectories using dynamic recurrent neural networks [J]. IEEE Transactions on Automatic Control, 1995, 40(7): 1266-1270.
[37] HORNIK K, STINCHCOMBE M, WHITE H. Multilayer feedforward networks are universal approximators [J]. Neural Networks, 1989, 2(5): 359-366.
[38] LLANAS B, LANTAR′ON S, S′AINZ F J. Constructive approximation of discontinuous functions by neural networks [J]. Neural Processing Letters, 2008, 27(3): 209-226.
[39] HOCHREITER S, SCHMIDHUBER J. Long shortterm memory [J]. Neural Computation, 1997, 9(8): 1735-1780.
[40] JAEGER H. A tutorial on training recurrent neural networks, covering BPTT, RTRL, EKF and the “echo state network” approach [R]. Bremen: German National Research Center for Information Technology, 2002.
[41] BILLINGS D. The first international roshambo programming competition [J]. ICGA Journal, 2000, 23(1): 42-50.
[42] PASZKE A, GROSS S, MASSA F, et al. Pytorch: An imperative style, high-performance deep learning library [M]//Advances in neural information processing systems 32. Red Hook: Curran Associates, 2019: 8024-8035.
[43] KINGMA D P, BA J. Adam: A method for stochastic optimization [DB/OL]. (2014-12-22).
[44] VAJENTE G, HUANG Y, ISI M, et al. Machinelearning nonstationary noise out of gravitationalwave detectors [J]. Physical Review D, 2020, 101(4): 042003.
[45] BERGER U. Fictitious play in 2×n games [J]. Journal of Economic Theory, 2005, 120(2): 139-154.
[46] BAILEY J P, PILIOURAS G. Fast and furious learning in zero-s[46] BAILEY J P, PILIOURAS G. Fast and furious learning in zero-sum games: Vanishing regret with nonvanishing step sizes [M]//Advances in neural information processing systems 32. Red Hook: Curran Associates, 2019: 12977-12987.