Automation & Computer Science

Exploiting a No-Regret Opponent in Repeated Zero-Sum Games

Expand
  • 1. Department of Computer Science and Engineering, Shanghai Jiao Tong University, Shanghai 200240, China; 2. Beijing Jingdong Century Trading Co., Ltd., Beijing 100176, China; 3. Center on Frontiers of Computing Studies, Peking University, Beijing 100871, China

Accepted date: 2021-10-14

  Online published: 2025-03-21

Abstract

In repeated zero-sum games, instead of constantly playing an equilibrium strategy of the stage game, learning to exploit the opponent given historical interactions could typically obtain a higher utility. However, when playing against a fully adaptive opponent, one would have difficulty identifying the opponent’s adaptive dynamics and further exploiting its potential weakness. In this paper, we study the problem of optimizing against the adaptive opponent who uses no-regret learning. No-regret learning is a classic and widely-used branch of adaptive learning algorithms. We propose a general framework for online modeling no-regret opponents and exploiting their weakness. With this framework, one could approximate the opponent’s no-regret learning dynamics and then develop a response plan to obtain a significant profit based on the inferences of the opponent’s strategies. We employ two system identification architectures, including the recurrent neural network (RNN) and the nonlinear autoregressive exogenous model, and adopt an efficient greedy response plan within the framework. Theoretically, we prove the approximation capability of our RNN architecture at approximating specific no-regret dynamics. Empirically, we demonstrate that during interactions at a low level of non-stationarity, our architectures could approximate the dynamics with a low error, and the derived policies could exploit the no-regret opponent to obtain a decent utility.

Cite this article

Li Kai, Huang Wenhan, Li Chenchen, Deng Xiaotie . Exploiting a No-Regret Opponent in Repeated Zero-Sum Games[J]. Journal of Shanghai Jiaotong University(Science), 2025 , 30(2) : 385 -398 . DOI: 10.1007/s12204-023-2610-2

References

[1] GANZFRIED S, SANDHOLM T. Game theory-based opponent modeling in large imperfect-information games [C]//The 10th International Conference on Autonomous Agents and Multiagent Systems-Volume 2. Taipei: International Foundation for Autonomous Agents and Multiagent Systems, 2011: 533-540.
[2] GANZFRIED S, SUN Q Y. Bayesian opponent exploitation in imperfect-information games [C]//2018 IEEE Conference on Computational Intelligence and Games. Maastricht: IEEE, 2018: 1-8.
[3] HERNANDEZ-LEAL P, KAISERS M, BAARSLAG T, et al. A survey of learning in multiagent environments: Dealing with non-stationarity [DB/OL]. (2017- 07-28).
[4] CESA-BIANCHI N, LUGOSI G. Prediction, learning, and games [M]. Cambridge: Cambridge University Press, 2006.
[5] ZINKEVICHM, JOHANSONM, BOWLING M, et al. Regret minimization in games with incomplete information [M]//Advances in Neural Information Processing Systems 20. Red Hook: Curran Associates, 2007: 1729-1736.
[6] HART S, MAS-COLELL A. A simple adaptive procedure leading to correlated equilibrium [J]. Econometrica, 2000, 68(5): 1127-1150.
[7] SHALEV-SHWARTZ S. Online learning and online convex optimization [J]. Foundations and Trends


 in Machine Learning, 2011, 4(2): 107-194.
[8] HAGHTALAB N, NOOTHIGATTU R, PROCACCIA A. Weighted voting via no-regret learning [J]. Proceedings of the AAAI Conference on Artificial Intelligence, 2018, 32(1): 1055-1062.
[9] CHEN Y L, VAUGHAN J W. A new understanding of prediction markets via no-regret learning [C]//11th ACM conference on Electronic commerce. Cambridge: ACM Press, 2010: 189-198.
[10] HARTLINE J, SYRGKANIS V, TARDOS E. Noregret learning in Bayesian games [M]//Advances in neural information processing systems 28. Red Hook: Curran Associates, 2015: 3061-3069.
[11] BILLINGS D, PAPP D, SCHAEFFER J, et al. Opponent modeling in poker [C]//15th National Conference on Artificial Intelligence and 10th Innovative Applications of Artificial Intelligence Conference. Madison: AAAI, 1998: 493-499.
[12] SCHADD F, BAKKES S, SPRONCK P. Opponent modeling in real-time strategy games [C]//GAMEON’2007. Bologna: University of Bologna, 2007: 61-70.
[13] ALBRECHT S V, STONE P. Autonomous agents modelling other agents: A comprehensive survey and open problems [J]. Artificial Intelligence, 2018, 258: 66-95.
[14] TANG Z T, ZHU Y H, ZHAO D B, et al. Enhanced rolling horizon evolution algorithm with opponent model learning [J]. IEEE Transactions on Games, 2020.
[15] MCCRACKEN P, BOWLING M. Safe strategies for agent modelling in games [C]//2004 AAAI Fall Symposium. Arlington: AAAI, 2004: 103-110.
[16] GANZFRIED S, SANDHOLM T. Safe opponent exploitation [J]. ACM Transactions on Economics and Computation, 2015, 3(2): 1-28.
[17] ZHANG C J, LESSER V. Multi-agent learning with policy prediction [C]//24th AAAI Conference on Artificial Intelligence. Atlanta: AAAI, 2010: 927-934.
[18] FOERSTER J, CHEN R Y, AL-SHEDIVAT M, et al. Learning with opponent-learning awareness [DB/OL]. (2017-09-13).
[19] KIM D K, LIUM, RIEMER M, et al. A policy gradient algorithm for learning to learn in multiagent reinforcement learning [C]//38th International Conference on Machine Learning. Online: PMLR, 2021: 5541-5550.
[20] SODERSTROM T, STOICA P. System identification [M]. London: Prentice-Hall International, 1989.
[21] LI L K. Approximation theory and recurrent networks [C]//International Joint Conference on Neural Networks. Baltimore: IEEE, 1992: 266-271.
[22] FUNAHASHI K I, NAKAMURA Y. Approximation of dynamical systems by continuous time recurrent neural networks [J]. Neural Networks, 1993, 6(6): 801-806.
[23] ZIMMERMANN H, NEUNEIER R. Modeling dynamical systems by recurrent neural networks [M]//Data mining II. Southampton: WIT Press, 2000: 557-566.
[24] BILLINGS S A. Nonlinear system identification [M]. Chichester: John Wiley & Sons, Ltd, 2013.
[25] CESA-BIANCHI N, FREUND Y, HAUSSLER D, et al. How to use expert advice [J]. Journal of the ACM, 1997, 44(3): 427-485.
[26] FREUND Y, SCHAPIRE R E. A decision-theoretic generalization of on-line learning and an application to boosting [J]. Journal of Computer and System Sciences, 1997, 55(1): 119-139.
[27] LITTLESTONE N, WARMUTH M K. The weighted majority algorithm [J]. Information and Computation, 1994, 108(2): 212-261.
[28] RAKHLIN A, SRIDHARAN K. Optimization, learning, and games with predictable sequences [M]//Advances in neural information processing systems 26. Red Hook: Curran Associates, 2013: 3066-3074.
[29] BROWN N, KROER C, SANDHOLM T. Dynamic thresholding and pruning for regret minimization [J]. Proceedings of the AAAI Conference on Artificial Intelligence, 2017, 31(1): 421-429.
[30] NEUMANN J. Zur theorie der gesellschaftsspiele [J]. Mathematische Annalen, 1928, 100(1): 295-320.
[31] AL-SHEDIVAT M, BANSAL T, BURDA Y, et al. Continuous adaptation via meta-learning in nonstationary and competitive environments [DB/OL]. (2017-10-10).
[32] DASKALAKIS C, DECKELBAUM A, KIM A. Nearoptimal no-regret algorithms for zero-sum games [J]. Games and Economic Behavior, 2015, 92: 327-348.
[33] AUER P, CESA-BIANCHI N, FREUND Y, et al. Gambling in a rigged casino: The adversarial multiarmed bandit problem [C]//IEEE 36th Annual Foundations of Computer Science. Milwaukee: IEEE, 1995: 322-331.
[34] RUMELHART D E, HINTON G E, WILLIAMS R J. Learning representations by back-propagating errors [J]. Nature, 1986, 323(6088): 533-536.
[35] ELMAN J L. Finding structure in time [J]. Cognitive Science, 1990, 14(2): 179-211.
[36] JIN L, NIKIFORUK P N, GUPTA M M. Approximation of discrete-time state-space trajectories using dynamic recurrent neural networks [J]. IEEE Transactions on Automatic Control, 1995, 40(7): 1266-1270.
[37] HORNIK K, STINCHCOMBE M, WHITE H. Multilayer feedforward networks are universal approximators [J]. Neural Networks, 1989, 2(5): 359-366.
[38] LLANAS B, LANTAR′ON S, S′AINZ F J. Constructive approximation of discontinuous functions by neural networks [J]. Neural Processing Letters, 2008, 27(3): 209-226.
[39] HOCHREITER S, SCHMIDHUBER J. Long shortterm memory [J]. Neural Computation, 1997, 9(8): 1735-1780.
[40] JAEGER H. A tutorial on training recurrent neural networks, covering BPTT, RTRL, EKF and the “echo state network” approach [R]. Bremen: German National Research Center for Information Technology, 2002.
[41] BILLINGS D. The first international roshambo programming competition [J]. ICGA Journal, 2000, 23(1): 42-50.
[42] PASZKE A, GROSS S, MASSA F, et al. Pytorch: An imperative style, high-performance deep learning library [M]//Advances in neural information processing systems 32. Red Hook: Curran Associates, 2019: 8024-8035.
[43] KINGMA D P, BA J. Adam: A method for stochastic optimization [DB/OL]. (2014-12-22).
[44] VAJENTE G, HUANG Y, ISI M, et al. Machinelearning nonstationary noise out of gravitationalwave detectors [J]. Physical Review D, 2020, 101(4): 042003.
[45] BERGER U. Fictitious play in 2×n games [J]. Journal of Economic Theory, 2005, 120(2): 139-154.
[46] BAILEY J P, PILIOURAS G. Fast and furious learning in zero-s[46] BAILEY J P, PILIOURAS G. Fast and furious learning in zero-sum games: Vanishing regret with nonvanishing step sizes [M]//Advances in neural information processing systems 32. Red Hook: Curran Associates, 2019: 12977-12987.
Outlines

/