J Shanghai Jiaotong Univ Sci ›› 2025, Vol. 30 ›› Issue (2): 385-398.doi: 10.1007/s12204-023-2610-2
接受日期:
2021-10-14
出版日期:
2025-03-21
发布日期:
2025-03-21
李凯1,黄文瀚1,李晨晨2,邓小铁3
Accepted:
2021-10-14
Online:
2025-03-21
Published:
2025-03-21
摘要: 在重复零和博弈中,若参与者在每一轮都使用阶段博弈的均衡策略,收益可能不高。与之相对,学习如何利用历史交互信息来盘剥对手通常更为有效。然而,若面对完全自适应的对手,识别其策略改变的动力学机制并进一步利用其潜在的弱点就会很困难。本文研究如何对抗使用无遗憾学习的自适应对手以优化自身收益。其中,无遗憾学习是一种广泛使用的自适应学习算法。本文提出了一种通用框架,用于在线建模无遗憾对手的策略更新过程并利用其弱点。通过该框架,可以近似对手的无遗憾动力学,并制定策略响应方案从而在推断对手策略的基础上获得显著利润。本文采用了两种系统识别架构对对手建模,包括循环神经网络(RNN)和非线性自回归外生模型,并在该框架内采用高效的贪心策略响应方案。本文从理论上证明了RNN架构近似特定无遗憾动力学的表示能力。并且通过实验说明,这些架构在较低程度的非平稳交互中能够以非常低的误差近似这些动力学,并且导出的策略可以盘剥无遗憾的对手以获得高收益。
中图分类号:
. 重复零和博弈中对无遗憾对手进行盘剥的研究[J]. J Shanghai Jiaotong Univ Sci, 2025, 30(2): 385-398.
Li Kai, Huang Wenhan, Li Chenchen, Deng Xiaotie. Exploiting a No-Regret Opponent in Repeated Zero-Sum Games[J]. J Shanghai Jiaotong Univ Sci, 2025, 30(2): 385-398.
[1] GANZFRIED S, SANDHOLM T. Game theory-based opponent modeling
in large imperfect-information games [C]//The 10th International Conference on
Autonomous Agents and Multiagent Systems-Volume 2. Taipei: International
Foundation for Autonomous Agents and Multiagent Systems, 2011: 533-540. in Machine Learning, 2011, 4(2): 107-194. [8] HAGHTALAB N, NOOTHIGATTU R, PROCACCIA A. Weighted voting via no-regret learning [J]. Proceedings of the AAAI Conference on Artificial Intelligence, 2018, 32(1): 1055-1062. [9] CHEN Y L, VAUGHAN J W. A new understanding of prediction markets via no-regret learning [C]//11th ACM conference on Electronic commerce. Cambridge: ACM Press, 2010: 189-198. [10] HARTLINE J, SYRGKANIS V, TARDOS E. Noregret learning in Bayesian games [M]//Advances in neural information processing systems 28. Red Hook: Curran Associates, 2015: 3061-3069. [11] BILLINGS D, PAPP D, SCHAEFFER J, et al. Opponent modeling in poker [C]//15th National Conference on Artificial Intelligence and 10th Innovative Applications of Artificial Intelligence Conference. Madison: AAAI, 1998: 493-499. [12] SCHADD F, BAKKES S, SPRONCK P. Opponent modeling in real-time strategy games [C]//GAMEON’2007. Bologna: University of Bologna, 2007: 61-70. [13] ALBRECHT S V, STONE P. Autonomous agents modelling other agents: A comprehensive survey and open problems [J]. Artificial Intelligence, 2018, 258: 66-95. [14] TANG Z T, ZHU Y H, ZHAO D B, et al. Enhanced rolling horizon evolution algorithm with opponent model learning [J]. IEEE Transactions on Games, 2020. [15] MCCRACKEN P, BOWLING M. Safe strategies for agent modelling in games [C]//2004 AAAI Fall Symposium. Arlington: AAAI, 2004: 103-110. [16] GANZFRIED S, SANDHOLM T. Safe opponent exploitation [J]. ACM Transactions on Economics and Computation, 2015, 3(2): 1-28. [17] ZHANG C J, LESSER V. Multi-agent learning with policy prediction [C]//24th AAAI Conference on Artificial Intelligence. Atlanta: AAAI, 2010: 927-934. [18] FOERSTER J, CHEN R Y, AL-SHEDIVAT M, et al. Learning with opponent-learning awareness [DB/OL]. (2017-09-13). [19] KIM D K, LIUM, RIEMER M, et al. A policy gradient algorithm for learning to learn in multiagent reinforcement learning [C]//38th International Conference on Machine Learning. Online: PMLR, 2021: 5541-5550. [20] SODERSTROM T, STOICA P. System identification [M]. London: Prentice-Hall International, 1989. [21] LI L K. Approximation theory and recurrent networks [C]//International Joint Conference on Neural Networks. Baltimore: IEEE, 1992: 266-271. [22] FUNAHASHI K I, NAKAMURA Y. Approximation of dynamical systems by continuous time recurrent neural networks [J]. Neural Networks, 1993, 6(6): 801-806. [23] ZIMMERMANN H, NEUNEIER R. Modeling dynamical systems by recurrent neural networks [M]//Data mining II. Southampton: WIT Press, 2000: 557-566. [24] BILLINGS S A. Nonlinear system identification [M]. Chichester: John Wiley & Sons, Ltd, 2013. [25] CESA-BIANCHI N, FREUND Y, HAUSSLER D, et al. How to use expert advice [J]. Journal of the ACM, 1997, 44(3): 427-485. [26] FREUND Y, SCHAPIRE R E. A decision-theoretic generalization of on-line learning and an application to boosting [J]. Journal of Computer and System Sciences, 1997, 55(1): 119-139. [27] LITTLESTONE N, WARMUTH M K. The weighted majority algorithm [J]. Information and Computation, 1994, 108(2): 212-261. [28] RAKHLIN A, SRIDHARAN K. Optimization, learning, and games with predictable sequences [M]//Advances in neural information processing systems 26. Red Hook: Curran Associates, 2013: 3066-3074. [29] BROWN N, KROER C, SANDHOLM T. Dynamic thresholding and pruning for regret minimization [J]. Proceedings of the AAAI Conference on Artificial Intelligence, 2017, 31(1): 421-429. [30] NEUMANN J. Zur theorie der gesellschaftsspiele [J]. Mathematische Annalen, 1928, 100(1): 295-320. [31] AL-SHEDIVAT M, BANSAL T, BURDA Y, et al. Continuous adaptation via meta-learning in nonstationary and competitive environments [DB/OL]. (2017-10-10). [32] DASKALAKIS C, DECKELBAUM A, KIM A. Nearoptimal no-regret algorithms for zero-sum games [J]. Games and Economic Behavior, 2015, 92: 327-348. [33] AUER P, CESA-BIANCHI N, FREUND Y, et al. Gambling in a rigged casino: The adversarial multiarmed bandit problem [C]//IEEE 36th Annual Foundations of Computer Science. Milwaukee: IEEE, 1995: 322-331. [34] RUMELHART D E, HINTON G E, WILLIAMS R J. Learning representations by back-propagating errors [J]. Nature, 1986, 323(6088): 533-536. [35] ELMAN J L. Finding structure in time [J]. Cognitive Science, 1990, 14(2): 179-211. [36] JIN L, NIKIFORUK P N, GUPTA M M. Approximation of discrete-time state-space trajectories using dynamic recurrent neural networks [J]. IEEE Transactions on Automatic Control, 1995, 40(7): 1266-1270. [37] HORNIK K, STINCHCOMBE M, WHITE H. Multilayer feedforward networks are universal approximators [J]. Neural Networks, 1989, 2(5): 359-366. [38] LLANAS B, LANTAR′ON S, S′AINZ F J. Constructive approximation of discontinuous functions by neural networks [J]. Neural Processing Letters, 2008, 27(3): 209-226. [39] HOCHREITER S, SCHMIDHUBER J. Long shortterm memory [J]. Neural Computation, 1997, 9(8): 1735-1780. [40] JAEGER H. A tutorial on training recurrent neural networks, covering BPTT, RTRL, EKF and the “echo state network” approach [R]. Bremen: German National Research Center for Information Technology, 2002. [41] BILLINGS D. The first international roshambo programming competition [J]. ICGA Journal, 2000, 23(1): 42-50. [42] PASZKE A, GROSS S, MASSA F, et al. Pytorch: An imperative style, high-performance deep learning library [M]//Advances in neural information processing systems 32. Red Hook: Curran Associates, 2019: 8024-8035. [43] KINGMA D P, BA J. Adam: A method for stochastic optimization [DB/OL]. (2014-12-22). [44] VAJENTE G, HUANG Y, ISI M, et al. Machinelearning nonstationary noise out of gravitationalwave detectors [J]. Physical Review D, 2020, 101(4): 042003. [45] BERGER U. Fictitious play in 2×n games [J]. Journal of Economic Theory, 2005, 120(2): 139-154. [46] BAILEY J P, PILIOURAS G. Fast and furious learning in zero-s[46] BAILEY J P, PILIOURAS G. Fast and furious learning in zero-sum games: Vanishing regret with nonvanishing step sizes [M]//Advances in neural information processing systems 32. Red Hook: Curran Associates, 2019: 12977-12987. |
[1] | . 基于粒可微性印度COVID-19疫情模糊动态最优模型[J]. J Shanghai Jiaotong Univ Sci, 2025, 30(3): 545-554. |
[2] | 臧红岩, 谢晓龙, 徐亚周, 陶业, 高长生. 基于循环神经网络的高超声速机动目标状态估计算法[J]. 空天防御, 2024, 7(4): 88-98. |
[3] | 毕金茂, 张朋, 张洁, 赵春财, 崔利. 不完备数据下的聚酯熔体特性黏度预测方法[J]. 上海交通大学学报, 2024, 58(4): 534-544. |
[4] | 黄权印, 蔡益朝, 李浩, 唐晓, 王辰洋. 基于改进注意力机制的自适应航迹预测方法[J]. 空天防御, 2024, 7(3): 94-101. |
[5] | 曾国治, 魏子清, 岳宝, 丁云霄, 郑春元, 翟晓强. 基于CNN-RNN组合模型的办公建筑能耗预测[J]. 上海交通大学学报, 2022, 56(9): 1256-1261. |
[6] | 刘秀丽, 徐小力. 基于特征金字塔卷积循环神经网络的故障诊断方法[J]. 上海交通大学学报, 2022, 56(2): 182-190. |
[7] | 王瑞昌, 陈志华, 明新国. 基于改进模糊逻辑控制的并联式船舶动力系统能量管理[J]. 上海交通大学学报, 2021, 55(10): 1188-1196. |
[8] | 王瑞昌,陈志华,明新国. 船舶动力系统全生命周期绿色设计的评价方法[J]. 上海交通大学学报, 2020, 54(3): 256-264. |
[9] | 刘爱虢1,翁一武2,陈雷1,马洪安1. 增压型MCFC/MGT混合动力系统中燃料电池的特性分析[J]. 上海交通大学学报(自然版), 2014, 48(09): 1239-1245. |
[10] | 刘爱虢1,翁一武2,曾文1,王冰1. 燃料电池-燃气轮机混合动力系统的燃烧与系统性能分析[J]. 上海交通大学学报(自然版), 2013, 47(12): 1957-1962. |
[11] | 王存磊,殷承良,于海生. 混合动力系统用发动机冷起动排放 [J]. 上海交通大学学报(自然版), 2010, 44(10): 1352-1355. |
阅读次数 | ||||||
全文 |
|
|||||
摘要 |
|
|||||