J Shanghai Jiaotong Univ Sci ›› 2025, Vol. 30 ›› Issue (2): 385-398.doi: 10.1007/s12204-023-2610-2
接受日期:
2021-10-14
出版日期:
2025-03-21
发布日期:
2025-03-21
李凯1,黄文瀚1,李晨晨2,邓小铁3
Accepted:
2021-10-14
Online:
2025-03-21
Published:
2025-03-21
摘要: 在重复零和博弈中,若参与者在每一轮都使用阶段博弈的均衡策略,收益可能不高。与之相对,学习如何利用历史交互信息来盘剥对手通常更为有效。然而,若面对完全自适应的对手,识别其策略改变的动力学机制并进一步利用其潜在的弱点就会很困难。本文研究如何对抗使用无遗憾学习的自适应对手以优化自身收益。其中,无遗憾学习是一种广泛使用的自适应学习算法。本文提出了一种通用框架,用于在线建模无遗憾对手的策略更新过程并利用其弱点。通过该框架,可以近似对手的无遗憾动力学,并制定策略响应方案从而在推断对手策略的基础上获得显著利润。本文采用了两种系统识别架构对对手建模,包括循环神经网络(RNN)和非线性自回归外生模型,并在该框架内采用高效的贪心策略响应方案。本文从理论上证明了RNN架构近似特定无遗憾动力学的表示能力。并且通过实验说明,这些架构在较低程度的非平稳交互中能够以非常低的误差近似这些动力学,并且导出的策略可以盘剥无遗憾的对手以获得高收益。
中图分类号:
. 重复零和博弈中对无遗憾对手进行盘剥的研究[J]. J Shanghai Jiaotong Univ Sci, 2025, 30(2): 385-398.
Li Kai, Huang Wenhan, Li Chenchen, Deng Xiaotie. Exploiting a No-Regret Opponent in Repeated Zero-Sum Games[J]. J Shanghai Jiaotong Univ Sci, 2025, 30(2): 385-398.
[1] GANZFRIED S, SANDHOLM T. Game theory-based opponent modeling
in large imperfect-information games [C]//The 10th International Conference on
Autonomous Agents and Multiagent Systems-Volume 2. Taipei: International
Foundation for Autonomous Agents and Multiagent Systems, 2011: 533-540. in Machine Learning, 2011, 4(2): 107-194. [8] HAGHTALAB N, NOOTHIGATTU R, PROCACCIA A. Weighted voting via no-regret learning [J]. Proceedings of the AAAI Conference on Artificial Intelligence, 2018, 32(1): 1055-1062. [9] CHEN Y L, VAUGHAN J W. A new understanding of prediction markets via no-regret learning [C]//11th ACM conference on Electronic commerce. Cambridge: ACM Press, 2010: 189-198. [10] HARTLINE J, SYRGKANIS V, TARDOS E. Noregret learning in Bayesian games [M]//Advances in neural information processing systems 28. Red Hook: Curran Associates, 2015: 3061-3069. [11] BILLINGS D, PAPP D, SCHAEFFER J, et al. Opponent modeling in poker [C]//15th National Conference on Artificial Intelligence and 10th Innovative Applications of Artificial Intelligence Conference. Madison: AAAI, 1998: 493-499. [12] SCHADD F, BAKKES S, SPRONCK P. Opponent modeling in real-time strategy games [C]//GAMEON’2007. Bologna: University of Bologna, 2007: 61-70. [13] ALBRECHT S V, STONE P. Autonomous agents modelling other agents: A comprehensive survey and open problems [J]. Artificial Intelligence, 2018, 258: 66-95. [14] TANG Z T, ZHU Y H, ZHAO D B, et al. Enhanced rolling horizon evolution algorithm with opponent model learning [J]. IEEE Transactions on Games, 2020. [15] MCCRACKEN P, BOWLING M. Safe strategies for agent modelling in games [C]//2004 AAAI Fall Symposium. Arlington: AAAI, 2004: 103-110. [16] GANZFRIED S, SANDHOLM T. Safe opponent exploitation [J]. ACM Transactions on Economics and Computation, 2015, 3(2): 1-28. [17] ZHANG C J, LESSER V. Multi-agent learning with policy prediction [C]//24th AAAI Conference on Artificial Intelligence. Atlanta: AAAI, 2010: 927-934. [18] FOERSTER J, CHEN R Y, AL-SHEDIVAT M, et al. Learning with opponent-learning awareness [DB/OL]. (2017-09-13). [19] KIM D K, LIUM, RIEMER M, et al. A policy gradient algorithm for learning to learn in multiagent reinforcement learning [C]//38th International Conference on Machine Learning. Online: PMLR, 2021: 5541-5550. [20] SODERSTROM T, STOICA P. System identification [M]. London: Prentice-Hall International, 1989. [21] LI L K. Approximation theory and recurrent networks [C]//International Joint Conference on Neural Networks. Baltimore: IEEE, 1992: 266-271. [22] FUNAHASHI K I, NAKAMURA Y. Approximation of dynamical systems by continuous time recurrent neural networks [J]. Neural Networks, 1993, 6(6): 801-806. [23] ZIMMERMANN H, NEUNEIER R. Modeling dynamical systems by recurrent neural networks [M]//Data mining II. Southampton: WIT Press, 2000: 557-566. [24] BILLINGS S A. Nonlinear system identification [M]. Chichester: John Wiley & Sons, Ltd, 2013. [25] CESA-BIANCHI N, FREUND Y, HAUSSLER D, et al. How to use expert advice [J]. Journal of the ACM, 1997, 44(3): 427-485. [26] FREUND Y, SCHAPIRE R E. A decision-theoretic generalization of on-line learning and an application to boosting [J]. Journal of Computer and System Sciences, 1997, 55(1): 119-139. [27] LITTLESTONE N, WARMUTH M K. The weighted majority algorithm [J]. Information and Computation, 1994, 108(2): 212-261. [28] RAKHLIN A, SRIDHARAN K. Optimization, learning, and games with predictable sequences [M]//Advances in neural information processing systems 26. Red Hook: Curran Associates, 2013: 3066-3074. [29] BROWN N, KROER C, SANDHOLM T. Dynamic thresholding and pruning for regret minimization [J]. Proceedings of the AAAI Conference on Artificial Intelligence, 2017, 31(1): 421-429. [30] NEUMANN J. Zur theorie der gesellschaftsspiele [J]. Mathematische Annalen, 1928, 100(1): 295-320. [31] AL-SHEDIVAT M, BANSAL T, BURDA Y, et al. Continuous adaptation via meta-learning in nonstationary and competitive environments [DB/OL]. (2017-10-10). [32] DASKALAKIS C, DECKELBAUM A, KIM A. Nearoptimal no-regret algorithms for zero-sum games [J]. Games and Economic Behavior, 2015, 92: 327-348. [33] AUER P, CESA-BIANCHI N, FREUND Y, et al. Gambling in a rigged casino: The adversarial multiarmed bandit problem [C]//IEEE 36th Annual Foundations of Computer Science. Milwaukee: IEEE, 1995: 322-331. [34] RUMELHART D E, HINTON G E, WILLIAMS R J. Learning representations by back-propagating errors [J]. Nature, 1986, 323(6088): 533-536. [35] ELMAN J L. Finding structure in time [J]. Cognitive Science, 1990, 14(2): 179-211. [36] JIN L, NIKIFORUK P N, GUPTA M M. Approximation of discrete-time state-space trajectories using dynamic recurrent neural networks [J]. IEEE Transactions on Automatic Control, 1995, 40(7): 1266-1270. [37] HORNIK K, STINCHCOMBE M, WHITE H. Multilayer feedforward networks are universal approximators [J]. Neural Networks, 1989, 2(5): 359-366. [38] LLANAS B, LANTAR′ON S, S′AINZ F J. Constructive approximation of discontinuous functions by neural networks [J]. Neural Processing Letters, 2008, 27(3): 209-226. [39] HOCHREITER S, SCHMIDHUBER J. Long shortterm memory [J]. Neural Computation, 1997, 9(8): 1735-1780. [40] JAEGER H. A tutorial on training recurrent neural networks, covering BPTT, RTRL, EKF and the “echo state network” approach [R]. Bremen: German National Research Center for Information Technology, 2002. [41] BILLINGS D. The first international roshambo programming competition [J]. ICGA Journal, 2000, 23(1): 42-50. [42] PASZKE A, GROSS S, MASSA F, et al. Pytorch: An imperative style, high-performance deep learning library [M]//Advances in neural information processing systems 32. Red Hook: Curran Associates, 2019: 8024-8035. [43] KINGMA D P, BA J. Adam: A method for stochastic optimization [DB/OL]. (2014-12-22). [44] VAJENTE G, HUANG Y, ISI M, et al. Machinelearning nonstationary noise out of gravitationalwave detectors [J]. Physical Review D, 2020, 101(4): 042003. [45] BERGER U. Fictitious play in 2×n games [J]. Journal of Economic Theory, 2005, 120(2): 139-154. [46] BAILEY J P, PILIOURAS G. Fast and furious learning in zero-s[46] BAILEY J P, PILIOURAS G. Fast and furious learning in zero-sum games: Vanishing regret with nonvanishing step sizes [M]//Advances in neural information processing systems 32. Red Hook: Curran Associates, 2019: 12977-12987. |
[1] | . 基于两阶段卷积神经网络的焊缝缺陷监测[J]. J Shanghai Jiaotong Univ Sci, 2025, 30(2): 291-299. |
[2] | . 基于空间特征学习与多粒度特征融合的行人重识别[J]. J Shanghai Jiaotong Univ Sci, 2025, 30(2): 363-374. |
[3] | 丁黎辉1, 2, 付立军1, 3, 杨光4, 5, 6, 万林4, 5, 常志军7. 基于视频的婴儿癫痫性痉挛综合征检测:建模、检测与评估[J]. J Shanghai Jiaotong Univ Sci, 2025, 30(1): 1-9. |
[4] | 柯晶1, 朱俊超2, 杨鑫1, 张浩林3, 孙宇翔1, 王嘉怡1, 鲁亦舟4, 沈逸卿5, 刘晟6, 蒋伏松7, 黄琴8. TshFNA-Examiner:甲状腺细胞学图像的核分割和癌症评估框架[J]. J Shanghai Jiaotong Univ Sci, 2024, 29(6): 945-957. |
[5] | 李明爱1, 2, 魏丽娜1. 基于朴素卷积神经网络和线性插值的运动想像分类[J]. J Shanghai Jiaotong Univ Sci, 2024, 29(6): 958-966. |
[6] | 刘月笙, 贺宁, 贺利乐, 张译文, 习坤, 张梦芮. 基于机器学习的移动机器人路径跟踪MPC控制器参数自整定[J]. J Shanghai Jiaotong Univ Sci, 2024, 29(6): 1028-1036. |
[7] | 彭诗玮1, 张希1, 朱旺旺1, 窦瑞2. 融合乘客感受量化指标的智能汽车舒适性研究[J]. J Shanghai Jiaotong Univ Sci, 2024, 29(6): 1063-1070. |
[8] | 刘文1, 3, 许剑新2, 4, 杨根科1, 3, 陈媛芳5. 基于LSTM-BiDBN入侵检测系统的在线车辆取证责任方认定方法[J]. J Shanghai Jiaotong Univ Sci, 2024, 29(6): 1161-1168. |
[9] | 耿宗盛1,赵东东1, 2,周兴文1,闫磊1, 阎石1, 2. 基于全分布式事件驱动控制的多智能体系统领导-跟随一致性研究[J]. J Shanghai Jiaotong Univ Sci, 2024, 29(4): 640-645. |
[10] | 刘增敏1, 2, 3, 4, 6, 王申涛5, 姚莉秀1, 2, 3, 蔡云泽1, 2, 3, 4, 6. 基于目标检测和特征提取网络的运动无人机平台下多目标跟踪[J]. J Shanghai Jiaotong Univ Sci, 2024, 29(3): 388-399. |
[11] | 张彦军1,4,5,6,7, 王碧云2,3 , 蔡云泽1,4,5,6,7. 基于注意力的多通道网络红外弱小目标检测[J]. J Shanghai Jiaotong Univ Sci, 2024, 29(3): 414-427. |
[12] | 朱江辉1,2,3,4,叶航航5,姚莉秀1,2,3,蔡云泽1,2,3,4. 基于自组织映射网络解决旅行商问题的算法[J]. J Shanghai Jiaotong Univ Sci, 2024, 29(3): 463-470. |
[13] | 李明爱1,2,3,许冬芹1. 综述:运动想像脑机接口中的迁移学习[J]. J Shanghai Jiaotong Univ Sci, 2024, 29(1): 37-59. |
[14] | 王玉娟1,李文刚2,刘建勇3,陈广学4,汪军1. fiber;麻灰色原配色丝织物的颜色预测模型[J]. J Shanghai Jiaotong Univ Sci, 2023, 28(6): 802-808. |
[15] | . [J]. J Shanghai Jiaotong Univ Sci, 2022, 27(5): 638-648. |
阅读次数 | ||||||||||||||||||||||||||||||||||||||||||||||||||
全文 9
|
|
|||||||||||||||||||||||||||||||||||||||||||||||||
摘要 100
|
|
|||||||||||||||||||||||||||||||||||||||||||||||||