Close
About
FAQ
Home
Collections
Login
USC Login
Register
0
Selected
Invert selection
Deselect all
Deselect all
Click here to refresh results
Click here to refresh results
USC
/
Digital Library
/
University of Southern California Dissertations and Theses
/
Provable reinforcement learning for constrained and multi-agent control systems
(USC Thesis Other)
Provable reinforcement learning for constrained and multi-agent control systems
PDF
Download
Share
Open document
Flip pages
Contact Us
Contact Us
Copy asset link
Request this asset
Transcript (if available)
Content
PROVABLE REINFORCEMENT LEARNING FOR CONSTRAINED AND MULTI-AGENT CONTROL SYSTEMS by Dongsheng Ding A Dissertation Presented to the FACULTY OF THE USC GRADUATE SCHOOL UNIVERSITY OF SOUTHERN CALIFORNIA In Partial Fulllment of the Requirements for the Degree DOCTOR OF PHILOSOPHY (ELECTRICAL ENGINEERING) December 2022 Copyright 2022 Dongsheng Ding Dedication To my family. ii Acknowledgements First and foremost, I would like to express my deep and sincere gratitude to my advisor, Profes- sor Mihailo R. Jovanović, for his generous guidance during my graduate study. Every time of meeting with him was a lesson, and I am grateful for all the knowledge and experience he has shared with me. His commitment to excellence and open-mindedness encouraged me to perfect my research and have big-picture thinking. I will always be grateful for all his advice and support, which always inspire me to be a better control engineer. I was fortunate to have the opportunity to work with many excellent researchers, including Dr. Xiaohan Wei, Professor Zhuoran Yang, Professor Zhaoran Wang, Dr. Kaiqing Zhang, Dr. Chen-Yu Wei, and Professor Bin Hu. Their creativity and patience have made our collaboration joyful. They generously oered me an abundant amount of time for discussing and shaping several exciting research problems presented in this dissertation. I am also privileged to collaborate with Professor Tamer Başar. I would like to thank him for his precious advice and generous support. His amazing enthusiasm and vision have always inspired me to work hard. My gratitude extends to Professors Ashutosh Nayyar and Meisam Razaviyayn, for serving on my defense/qualifying exam committees; Professors Pierluigi Nuzzo and Mahdi Soltanolkotabi, for serving on my qualifying exam committee. I have benetted from their invaluable comments iii and feedback. I am also grateful to many excellent teachers who have guided my graduate study, including Professors Zhi-Quan Luo, Andrew Lamperski, Yousef Saad, Peter Seiler, Georgios Gian- nakis, Arindam Banerjee, and Soheil Mohajer at the University of Minnesota Twin Cities; Profes- sors Bart Kosko, Meisam Razaviyayn, Mahdi Soltanolkotabi, Igor Kukavica, Jason D. Lee, Haipeng Luo, Steven M. Heilman, Sergey Lototsky, Andrew Manion, and Rahul Jain at the University of Southern California. Additionally, I would like to thank Anastassia Tzoytzoyrakos, Lucienne Aarsen, and Elizabeth Fife, for teaching me English communication. It is an amazing journey for me to start my doctoral program at the University of Minnesota Twin Cities and nish it at the University of Southern California. I am very grateful to have my past and current labmates, Dr. Xiaofan Wu, Professor Armin Zare, Dr. Neil Dhingra, Professor Yongxin Chen, Dr. Hamza Farooq, Dr. Sepideh Hassan-Moghaddam, Dr. Saakar Byahut, Dr. Wei Ran, Dr. Anubhav Dwivedi, Hesameddin Mohammadi, Samantha Samuelson, Emily Reed, Zalan Fabian, Jin Zhou, Milad Pooladsanj, Ibrahim Ozaslan, for making my graduate study more en- joyable. Moreover, I would like to thank many friends I met in this journey for the accompany, including Dr. Diqing Su, Dr. Jianjun Yuan, Dr. Huaijin Yao, Wen Zhou, Dr. Rui Ma, Dr. Shreyas Bhaban, and Dr. Bhaskar Sen in Minneapolis; Dr. Huimei Cheng, Dr. Sarah Cooney, Shushan Arakelyan, Nripsuta Ani Saxena, Qinyi Luo, Yifang Chen, Dr. Qian Yu, Dr. Yihang Zhang, Dr. Jiali Duan, Dr. Bin Wang, Dr. Nathan Dahlin, Dr. Yeji Shen, Dr. Buyun Chen, Dr. Seyed Moham- madreza Mousavi Kalan, Dr. Mukul Gagrani, Dr. Filipe Vital, Renyan Sun, Panagiotis Kyriakis, Yijun Liu, Yunsong Liu, Jayson Sia, Krishna Chaitanya Kalagarla, Tiancheng Jin, Fernando V. Monteiro, Mengxiao Zhang, Chung-Wei Lee, and Sagar Sudhakara in Los Angeles. Also, I would like to thank my roommates, Dr. Yaobin Qin, Dr. Jinfeng Yang, Jian Shi, Xuansheng Wang, Tianyi Chen, Gele Qin, and Dr. Xin Bai, for the help in my daily life. iv Finally, my deepest gratitude goes to my granduncle, parents, sisters, and cousins, for their endless love and support. My granduncle and father have taught me to be independent and pursue my dreams; my mother has taught me to appreciate every moment in life; my sisters and cousins have always encouraged me to believe in myself. I am the most fortunate guy in the world to have you as my extended family, and I am forever in your debt. v TableofContents Dedication . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ii Acknowledgements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iii ListofFigures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . x Abstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xiii 1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1.1 Policy gradient methods for constrained MDPs . . . . . . . . . . . . . . . . . . . . 4 1.2 Provably ecient RL for constrained MDPs . . . . . . . . . . . . . . . . . . . . . . 6 1.3 Temporal-dierence learning with linear function approximation . . . . . . . . . 7 1.4 Markov potential games and independent learning . . . . . . . . . . . . . . . . . . 9 1.5 Organization of the dissertation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10 1.6 Contributions of the dissertation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12 I Reinforcementlearningforconstrainedcontrolsystems . . . . . . . 16 2 Naturalpolicygradientprimal-dualmethodforconstrainedMDPs . . . . . . . . 17 2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18 2.2 Problem setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19 2.2.1 Constrained Markov decision processes . . . . . . . . . . . . . . . . . . . . 20 2.2.2 Method of Lagrange multipliers . . . . . . . . . . . . . . . . . . . . . . . . 22 2.2.3 Policy parametrization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24 2.3 Natural policy gradient primal-dual method . . . . . . . . . . . . . . . . . . . . . 27 2.3.1 Constrained policy optimization methods . . . . . . . . . . . . . . . . . . 27 2.3.2 Natural policy gradient primal-dual method . . . . . . . . . . . . . . . . . 32 2.4 Tabular softmax policy case . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33 2.4.1 Non-asymptotic convergence analysis . . . . . . . . . . . . . . . . . . . . 36 2.5 Function approximation case . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43 2.5.1 Log-linear policy class . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46 2.5.2 Non-asymptotic convergence analysis . . . . . . . . . . . . . . . . . . . . 51 2.5.3 General smooth policy class . . . . . . . . . . . . . . . . . . . . . . . . . . 58 2.6 Sample-based algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62 vi 2.6.1 Sample complexity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64 2.7 Computational experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66 2.8 Concluding remarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68 3 ProvablyecientpolicyoptimizationforconstrainedMDPs . . . . . . . . . . . . 72 3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73 3.2 Problem setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74 3.2.1 Learning performance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76 3.2.2 Linear function approximation . . . . . . . . . . . . . . . . . . . . . . . . . 79 3.3 Optimistic primal-dual proximal policy optimization . . . . . . . . . . . . . . . . . 80 3.3.1 Policy improvement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81 3.3.2 Dual update . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83 3.3.3 Policy evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84 3.4 Regret and constraint violation analysis . . . . . . . . . . . . . . . . . . . . . . . . 86 3.4.1 Setting up the analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91 3.4.2 Proof of regret bound . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94 3.4.3 Proof of constraint violation . . . . . . . . . . . . . . . . . . . . . . . . . . 103 3.5 Further results on the tabular case . . . . . . . . . . . . . . . . . . . . . . . . . . . 107 3.6 Concluding remarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108 II Reinforcementlearningformulti-agentcontrolsystems . . . . . . 110 4 Multi-agenttemporal-dierencelearningformulti-agentMDPs . . . . . . . . . 111 4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112 4.2 Problem formulation and background . . . . . . . . . . . . . . . . . . . . . . . . . 114 4.2.1 Multi-agent stochastic optimization problem . . . . . . . . . . . . . . . . . 114 4.2.2 Multi-agent Markov decision process . . . . . . . . . . . . . . . . . . . . . 116 4.2.3 Multi-agent policy evaluation and temporal-dierence learning . . . . . . 118 4.2.4 Standard stochastic primal-dual algorithm . . . . . . . . . . . . . . . . . . 121 4.3 Algorithm and performance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 122 4.3.1 Distributed homotopy primal-dual algorithm . . . . . . . . . . . . . . . . . 123 4.3.2 Assumptions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 126 4.3.3 Finite-time performance bound . . . . . . . . . . . . . . . . . . . . . . . . 129 4.4 Finite-time performance analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . 132 4.4.1 Setting up the analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 132 4.4.2 Useful lemmas . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 135 4.4.3 Proof of main theorem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 148 4.5 Computational experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 150 4.6 Concluding remarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 153 5 IndependentpolicygradientforMarkovpotentialgames . . . . . . . . . . . . . . 154 5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 155 5.2 Markov potential games . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 156 5.3 Independent learning protocol . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 160 vii 5.4 Independent policy gradient methods . . . . . . . . . . . . . . . . . . . . . . . . . 162 5.4.1 Policy gradient for Markov potential games . . . . . . . . . . . . . . . . . 162 5.4.2 Faster rates for Markov cooperative games . . . . . . . . . . . . . . . . . . 166 5.5 Nash regret analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 167 5.5.1 Setting up the analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 167 5.5.2 Nash regret analysis for Markov potential games . . . . . . . . . . . . . . 169 5.5.3 Nash regret analysis for Markov cooperative games . . . . . . . . . . . . . 176 5.6 Independent policy gradient with function approximation . . . . . . . . . . . . . . 179 5.7 Game-agnostic convergence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 184 5.8 Computational experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 188 5.9 Concluding remarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 192 6 Discussionandfuturedirections . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 194 6.1 Policy gradient primal-dual algorithms . . . . . . . . . . . . . . . . . . . . . . . . 194 6.2 Provably ecient RL for constrained MDPs . . . . . . . . . . . . . . . . . . . . . . 195 6.3 Multi-agent policy evaluation in other settings . . . . . . . . . . . . . . . . . . . . 196 6.4 Multi-agent policy gradient methods . . . . . . . . . . . . . . . . . . . . . . . . . . 197 Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 198 Appendices . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 224 A SupportingproofsinChapter2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 224 A.1 Proof of Lemma 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 224 A.2 Proof of Lemma 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 224 A.3 Proof of Lemma 3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 225 A.4 Proof of Theorem 4 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 226 A.5 Proof of Lemma 5 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 231 A.6 Sample-based algorithm with function approximation . . . . . . . . . . . . . . . . 233 A.7 Proof of Theorem 13 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 235 A.8 Proof of Theorem 14 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 240 B SupportingproofsinChapter3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 243 B.1 Proof of Theorem 23 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 243 B.2 Proof of Formulas (3.9) and (3.11) . . . . . . . . . . . . . . . . . . . . . . . . . . . . 246 B.3 Proof of Formulas (3.14) and (3.15) . . . . . . . . . . . . . . . . . . . . . . . . . . . 247 B.3.1 Proof of Formula (3.16) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 248 B.3.2 Proof of Formula (B.1) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 251 B.4 Supporting lemmas from optimization . . . . . . . . . . . . . . . . . . . . . . . . . 254 B.5 Other supporting lemmas . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 256 C SupportingproofsinChapter4 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 258 C.1 Proof of Lemma 25 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 258 C.2 Martingale concentration bound . . . . . . . . . . . . . . . . . . . . . . . . . . . . 259 C.3 Proof of Lemma 27 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 260 viii C.4 Proof of Lemma 28 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 260 D SupportingproofsinChapter5 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 262 D.1 Proof of Lemma 34 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 262 D.2 Proof of Lemma 35 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 263 D.3 Proof of Lemma 36 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 264 D.4 Proof of Lemma 37 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 265 D.5 Proofs for Section 5.6 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 269 D.5.1 Unbiased estimate . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 269 D.5.2 Proof of Theorem 40 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 270 D.5.3 Proof of Theorem 41 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 276 D.5.4 Sample complexity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 279 D.6 Proofs for Section 5.7 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 280 D.6.1 Proof of Theorem 42 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 280 D.6.2 Proof of Theorem 43 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 287 D.7 Other auxiliary lemmas . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 296 D.7.1 Auxiliary lemma for potential functions . . . . . . . . . . . . . . . . . . . 296 D.7.2 Auxiliary lemma for single-player MDPs . . . . . . . . . . . . . . . . . . . 297 D.7.3 Auxiliary lemma for multi-player MDPs . . . . . . . . . . . . . . . . . . . 298 D.7.4 Auxiliary lemma for stochastic projected gradient descent . . . . . . . . . 299 ix ListofFigures 1.1 Empirical success of RL in two representative applications: (a) playing the game of Go [203] and (b) solving Rubik’s cube with a robot hand [10]. . . . . . . . . . . 2 1.2 Diagram of controller-system interaction. In addition to reward, other learning objectives are critical as constraints, e.g., energy cost [94], trac delay [11], and infections [253]. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 1.3 Diagram of multiple controllers-system interactions. More than one controller participates in multi-agent systems, e.g., computer games [229], swarm robotics [105], and nancial management [283]. . . . . . . . . . . . . . . . . . . . 3 2.1 An example of a constrained MDP for which the objective functionV r (s) in Problem (2.3) is not concave and the constraint setf2 jV g (s) bg is not convex. The pair (r;g) associated with the directed arrow represents (reward, utility) received when an action at a certain state is taken. This example is utilized in the proof of Lemma 3. . . . . . . . . . . . . . . . . . . . . . . . . . . 25 2.2 Learning curves of NPG-PD (—) and FOCOPS [280] (—) for Hopper-v3, Swimmer- v3, and Half Cheetah-v3 robotic tasks with the respective speed limits 82:748, 24:516, and 151:989. The vertical axes represent the average reward and the average cost (i.e., average speed). The solid lines show the means of 1000 bootstrap samples obtained over 5 random seeds and the shaded regions display the bootstrap 95% condence intervals. . . . . . . . . . . . . . . . . . . . . . . . 69 2.3 Learning curves of NPG-PD (—) and FOCOPS [280] (—) for Walker2d-v3, Ant-v3, and Humanoid-v3 robotic tasks with the respective speed limits 81:886, 103:115, and 20:140. The vertical axes represent the average reward and the average cost (i.e., average speed). The solid lines show the means of 1000 bootstrap samples obtained over 5 random seeds and the shaded regions display the bootstrap 95% condence intervals. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70 x 2.4 Learning curves of NPG-PD (—) and FOCOPS [280] (—) for Humanoid Circle-v0 and Ant Circle-v0 robotic task. The horizontal axis represents the number of dual updates. The average cost is constrained to go below 50. The vertical axes represent the average reward and the average cost (i.e., average speed). The solid lines show the means of 1000 bootstrap samples obtained over 5 random seeds and the shaded regions display the bootstrap 95% condence intervals. . . . . . . 71 4.1 A system with six agents that communicate over a connected undirected network. Each agent interacts with the environment by receiving a private reward and taking a local action. . . . . . . . . . . . . . . . . . . . . . . . . . . . 117 4.2 Performance comparison for the centralized problem withN = 1. Our algorithm with stepsize = 0:1 andK = 4 (––) achieves a smaller optimality gap than stochastic primal-dual algorithm with stepsize: = 0:1 (--), = 0:05 (---), = 0:025 (), and = 1= p t (– –). It also provides a smaller optimality gap than the approach that utilizes pre-collected i.i.d. samples in a buer (––). . . . . . . . 151 4.3 Performance comparison for the distributed case withN = 10. Our algorithm with stepsize = 0:05 andK = 4 (––) achieves a smaller optimality gap than stochastic primal-dual algorithm with stepsize: = 0:05 (--), = 0:025 (---), = 0:0125 (), and = 1= p t (– –). It also provides a smaller optimality gap than the approach that utilizes pre-collected iid samples in a buer (––). . . . . . 152 4.4 Performance comparison for the distributed case withN = 5. Our algorithm with stepsize = 0:1 andK = 3 (––) achieves a smaller optimality gap than stochastic primal-dual algorithm with stepsize: = 0:5 (--), = 0:25 (---), = 0:125 (), and = 1= p t (– –). It also provides a smaller optimality gap than the approach that utilizes pre-collected iid samples in a buer (––). . . . . . . . . 152 5.1 Convergence performance. (a) Learning curves for our independent policy gradient (—) with stepsize = 0:001 and the projected stochastic gradient ascent (—) with = 0:0001 [131]. Each solid line is the mean of trajectories over three random seeds and each shaded region displays the condence interval. (b) Learning curves for six individual runs of our independent policy gradient (solid line) and the projected stochastic gradient ascent (dash line) three each. (c) Distribution of players in one of two states taking four actions. In (a) and (b), we measure the accuracy by the absolute distance of each iterate to the converged Nash policy, i.e., 1 N P N i = 1 k (t) i Nash i k 1 . In our computational experiments, the initial distribution is uniform. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 189 xi 5.2 Convergence performance. (a) Learning curves for our independent policy gradient (—) with stepsize = 0:002 and the projected stochastic gradient ascent (—) with = 0:0001 [131]. Each solid line is the mean of trajectories over three random seeds and each shaded region displays the condence interval. (b) Learning curves for six individual runs of our independent policy gradient (solid line) and the projected stochastic gradient ascent (dash line) three each. (c) Distribution of players in one of two states taking four actions. In (a) and (b), we measure the accuracy by the absolute distance of each iterate to the converged Nash policy, i.e., 1 N P N i = 1 k (t) i Nash i k 1 . In our computational experiments, the initial distribution is uniform. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 190 5.3 Convergence performance. (a) Learning curves for our independent policy gradient (—) with stepsize = 0:001 and the projected stochastic gradient ascent (—) with = 0:0001 [131]. Each solid line is the mean of trajectories over three random seeds and each shaded region displays the condence interval. (b) Learning curves for six individual runs of our independent policy gradient (solid line) and the projected stochastic gradient ascent (dash line) three each. (c) Distribution of players in one of two states taking four actions. In (a) and (b), we measure the accuracy by the absolute distance of each iterate to the converged Nash policy, i.e., 1 N P N i = 1 k (t) i Nash i k 1 . In our computational experiments, the initial distribution is nearly degenerate = (0:9999; 0:0001). . . . . . . . . . . . 191 5.4 Convergence performance. (a) Learning curves for our independent policy gradient (—) with stepsize = 0:001 and the projected stochastic gradient ascent (—) with = 0:0001 [131]. Each solid line is the mean of trajectories over three random seeds and each shaded region displays the condence interval. (b) Learning curves for six individual runs of our independent policy gradient (solid line) and the projected stochastic gradient ascent (dash line) three each. (c) Distribution of players in one of two states taking four actions. In (a) and (b), we measure the accuracy by the absolute distance of each iterate to the converged Nash policy, i.e., 1 N P N i = 1 k (t) i Nash i k 1 . In our computational experiments, the initial distribution is nearly degenerate = (0:0001; 0:9999). . . . . . . . . . . . 192 xii Abstract Reinforcement learning (RL) has proven its value through great empirical success in many ar- ticial sequential decision-making control systems, but uptake in complex real-world systems has been much slower. A wide gap between standard RL setups and realities often results from constraints and multiple agents in real-world systems. To reduce this gap, we develop eective RL algorithms for two types of real-world stochastic control systems: constrained systems and multi-agent systems to search for better control policies, with theoretical performance guaran- tees. Part I of the dissertation is devoted to RL for constrained control systems. We study two settings of sequential decision-making control problems described by constrained Markov de- cision processes (MDPs) in which a controller (or an agent) aims at satisfying a constraint in addition to maximizing the standard reward objective. In the simulation setting, we propose a di- rect policy search method for innite-horizon constrained MDPs: natural policy gradient primal- dual method, which updates the primal policy via natural policy gradient ascent and the dual variable via projected sub-gradient descent. We establish a global convergence theory for our method using softmax, log-linear, and general smooth policy parametrizations, and demonstrate nite-sample complexity guarantees for two model-free extensions of our method. In the online episodic setting, we propose an online policy optimization method for episodic nite-horizon xiii constrained MDPs: optimistic primal-dual proximal policy optimization, where we eectuate safe exploration through the upper-condence bound optimism and address constraints via the primal-dual optimization. We establish sublinear regret and constraint violation bounds that de- pend on the size of the state-action space only through the dimension of the feature mapping, making our results hold even when the number of states goes to innity. Part II of the dissertation is devoted to RL for multi-agent control systems. We study two se- tups of multi-agent sequential decision-making control problems modeled by multi-agent MDPs in which multiple agents aim at maximizing their reward objectives. In the cooperative setup, we propose an online distributed temporal-dierence learning algorithm for solving the classi- cal policy evaluation problem with networked agents. Our algorithm works as a true stochastic primal-dual update using online Markovian samples and homotopy-based adaptive stepsizes. We establish optimal nite-time error bound with a sharp dependence on the network size and topol- ogy. In the cooperative/competitive setup, we propose a new independent policy gradient method for learning a Nash policy of Markov potential games. We establish sublinear Nash regret bounds that are free of explicit dependence on the state space size, enabling our method to work for prob- lems with large size of state space and a large number of players. We demonstrate nite-sample complexity guarantees for a model-free extension of our method in the function approximation setting. Moreover, we identify a class of independent policy gradient methods that enjoys last- iterate convergence and sublinear Nash regret bound for learning both zero-sum Markov games and Markov cooperative games. xiv Chapter1 Introduction Reinforcement Learning (RL) is an algorithmic paradigm for sequential decision-making in which a controller (or an agent) aims to maximize the task-associated long-term reward by interacting with an unknown system (or environment) over time to learn a good control policy. In recent years, RL has achieved remarkable empirical success in a large set of simulated systems such as playing computer games and manipulating robotics; see Figure 1.1. However, it is challenging to extend such success to real-world applications by directly applying existing RL methods. In this thesis, we grow our current RL progress for real-world applications by establishing algorith- mic solutions for two types of stochastic control systems: constrained systems and multi-agent systems, from a mainly theoretical point of view. In many real-world RL tasks, it is not sucient for the agent to only maximize the long- term reward associated with the single learning objective. The control system is also subject to constraints on its utilities/costs in many safety-critical applications, e.g., in autonomous driving, robotics, cyber-security, and nancial management. Application of standard RL techniques for such constrained systems stimulates an active line of research on constrained RL: in addition to maximizing the long-term reward, it is also critical to take into account the (safety) constraint on 1 (a) (b) Figure 1.1: Empirical success of RL in two representative applications: (a) playing the game of Go [203] and (b) solving Rubik’s cube with a robot hand [10]. the long-term utility/cost as an extra learning objective; see Figure 1.2. Along this line of research, we focus on a fundamental class of constrained control systems in a model of constrained Markov Decision Processes (MDPs). Part I of the dissertation is devoted to the direct policy search method for constrained MDPs in two fundamental RL setups: with or without policy simulators. We are interested in exploiting the structure of RL objective to design RL algorithms with provably performance guarantees. Two topics we investigate are given as follows. • Natural policy gradient primal-dual methods for constrained MDPs. • Provably ecient policy optimization for constrained MDPs. Not just a controller (or an agent), many successful RL applications involve the participation of more than an agent, e.g., computer games, swarm robotics, and nancial management. Lever- aging standard RL techniques for such multi-agent systems encourages a rich line of research on multi-agent RL: multiple agents operate in a common system and each of them aims to maximize its long-term reward by interacting with the unknown system and other agents; see Figure 1.3. 2 Controller Unknown system action reward state Constraints: cost delay infection etc. Figure 1.2: Diagram of controller-system interaction. In addition to reward, other learning objec- tives are critical as constraints, e.g., energy cost [94], trac delay [11], and infections [253]. In this line of research, we are interested in two multi-agent RL tasks: the policy evaluation prob- lem of multi-agent MDPs and the independent direct policy search method for Markov potential games. Part II of the dissertation is devoted to the following two topics. • Multi-agent temporal dierence learning for multi-agent MDPs. • Independent policy gradient methods for Markov potential games. action Unknown system Controllers state reward Figure 1.3: Diagram of multiple controllers-system interactions. More than one controller par- ticipates in multi-agent systems, e.g., computer games [229], swarm robotics [105], and nancial management [283]. 3 Having stated two lines of research above, a critical capability we aim to develop for our RL methods is that they can work in large state spaces with function approximation. With this attribute, we support our RL methods by theoretical convergence guarantees that result from a solid integration of control, optimization, statistics, and game theory. The remainder of this introduction is organized as follows. We briey overview policy gra- dient methods and their applications for constrained MDPs in Section 1.1. We discuss provably ecient RL algorithms for constrained MDPs in Section 1.2. We overview temporal-dierence learning with linear function approximation in Section 1.3. We discuss Markov potential games and independent learning in Section 1.4. In Section 1.5, we provide an outline of the dissertation. Finally, we summarize the main contributions of the dissertation in Section 1.6. 1.1 PolicygradientmethodsforconstrainedMDPs Policy gradient methods lie at the heart of the empirical success of RL, which motivates a rich line of global convergence results. In [87, 152, 170, 167, 168, 169], the authors provided global convergence guarantees and quantied sample complexity of (natural) policy gradient methods for nonconvex linear quadratic regulator problem of both discrete- and continuous-time systems. In [276], the authors showed that locally optimal policies for MDPs are achievable using policy gradient methods with reward reshaping. It was demonstrated in [233] that (natural) policy gra- dient methods converge to the globally optimal value when overparametrized neural networks are used. A variant of natural policy gradient, trust-region policy optimization (TRPO) [195], converges to the globally optimal policy with overparametrized neural networks [139] and for 4 regularized MDPs [199]. In [32, 33], the authors studied global optimality and convergence of pol- icy gradient methods from a policy iteration perspective. In [8], the authors characterized global convergence properties of (natural) policy gradient methods and studied computational, approx- imation, and sample size issues. Additional recent advances along these lines include [159, 271, 50, 142, 75, 117, 247]. While all these references handle a lack of convexity in the objective func- tion, we make additional eort to deal with nonconvex constraints that arise in optimal control of constrained MDPs. In many constrained MDP algorithms [11, 2, 1, 37], Lagrangian-based policy gradient meth- ods have been widely used to address constraints. However, convergence guarantees of these algorithms are either local (to stationary-point or locally optimal policies) [35, 56, 220] or asymp- totic [37]. When function approximation is used for policy parametrization, [264] recognized the lack of convexity and showed asymptotic convergence (to a stationary point) of a method based on successive convex relaxations. In [179], the authors provided duality analysis for con- strained MDPs in the policy space and proposed a provably convergent dual descent algorithm by assuming access to a nonconvex optimization oracle. However, it is not clear how to obtain the solution to a primal nonconvex optimization problem and the global convergence guarantees are not established. In [180], the authors proposed a primal-dual algorithm and provided compu- tational experiments but did not oer any convergence analysis. This motivates us to establish non-asymptotic convergence of Lagrangian-based policy gradient methods to a globally optimal solution. Other related Lagrangian-based policy optimization methods include CPG [227], ac- celerated PDPO [135], CPO [7, 259], FOCOPS [280], IPPO [143], P3O [201], and CUP [257] but theoretical guarantees for these algorithms are still lacking. 5 1.2 ProvablyecientRLforconstrainedMDPs Provably ecient RL algorithms have shown the power of function approximation to achieve the statistical eciency through the tradeo between exploration and exploration. Using the opti- mism in the face of uncertainty [17, 45], [256, 255, 110, 47, 267] addressed the exploration and exploitation trade-o by adding the Upper Condence Bound (UCB) bonus, and proposed algo- rithms are provably sample-ecient. In [47], optimism has been combined with policy-based RL: an optimistic proximal policy optimization with UCB exploration. However, all these references only studied some particular MDPs in unconstrained RL. This motivates us to design an optimistic variant of proximal policy optimization for constrained MDPs. For the large constrained MDPs with unknown transition models, there is a line of literature that is related to the policy optimiza- tion under constraints, e.g., [227, 7, 259, 220, 143, 280, 201, 257]. However, the exploration under constraints is less studied and their theoretical guarantees are unknown. The present work lls in this gap in the linear function approximation setting. The study of provably ecient RL algorithms for constrained MDPs has received growing at- tention, especially those on learning constrained MDPs with unknown transitions and rewards. Most of them are model-based and only apply to nite state-action spaces. [204, 85] leveraged up- per condence bound (UCB) on xed reward, utility, and transition probability to propose sample- ecient algorithms for tabular constrained MDPs; [204] established an ~ O( p jAjT 1:5 logT ) regret and constraint violation via linear program in the average-cost case in time T ; [85] achieved an ~ O(jSj p H 3 T ) regret and constraint violation in the episodic setting via linear program and primal-dual policy optimization, wherejSj is the size of state space,jAj is the size of action space, 6 and H is the horizon of episode. In [187], the authors studied an adversarial stochastic short- est path problem under constraints and unknown transitions with ~ O(jSj p jAjH 2 T ) regret and constraint violation. [21] extended Q-learning with optimism for nite state-action constrained MDPs with peak constraints. [42] proposed UCB-based convex planning for episodic tabular con- strained MDPs in dealing with convex or hard constraints. [115, 98] established probably approx- imately correct (PAC) guarantees that enjoy better problem-dependent sample-complexity. In contrast, our proposed algorithm can potentially apply to scenarios with innite state-space, and our provided sublinear regret and constraint violation bounds only depend on the implicit dimen- sion instead of the true dimension of the state space. Compared to more recent references [73, 52, 271, 251, 201, 257], we attack the exploration directly and does not rely on any policy simulators (or generative models). 1.3 Temporal-dierence learning with linear function ap- proximation Temporal-dierence (TD) learning with linear function approximators is a popular approach for estimating the value function for an agent that follows a particular policy. The asymptotic con- vergence of the original TD method, which is known as TD(0), and its extension TD() was established in [225]. In spite of their wide-spread use, these algorithms can become unstable and convergence cannot be guaranteed in o-policy learning scenarios [26, 218]. To ensure stability, batch methods, e.g., Least-Squares Temporal-Dierence (LSTD) learning [41], have been pro- posed at the expense of increased computational complexity. To achieve both stability and low computational cost, a class of gradient-based temporal-dierence (GTD) algorithms [215, 216], 7 e.g., GTD, GTD2, and TDC, were proposed and their asymptotic convergence in o-policy settings was established by analyzing certain ordinary dierential equations (ODEs) [38]. However, these are not true stochastic gradient methods with respect to the original objective functions [218], because the underlying TD objectives, e.g., mean square Bellman error (MSBE) in GTD or mean square projected Bellman error (MSPBE) in GTD2, involve products and inverses of expectations. As such, these cannot be sampled directly and it is dicult to analyze eciency with respect to the TD objectives. The nite-time or nite-sample performance analysis of TD algorithms is critically important in applications with limited time budgets and nite amount of data. Early results are based on the stochastic approximation approach under i.i.d. sampling. For TD(0) and GTD,O(1=T ) error bound with2 (0; 0:5) was established in [61, 62] and an improvedO(1=T ) bound was pro- vided in [122], whereT is the total number of iterations. In the Markov setting,O(1=T ) bound was established for TD(0) that involves a projection step [34]; for a linear stochastic approxi- mation algorithm driven by Markovian noise [207]; and for an on-policy TD algorithm known as SARSA [286]. Recent work [102] provided complementary analysis of TD algorithms using the Markov jump linear system theory. To enable both on- and o-policy implementation, an optimization-based approach [138] was used to cast MSBPE into a convex-concave objective that allows the use of stochastic gradient algorithms [175] withO(1= p T ) bound for i.i.d. samples; an extension of this approach to the Markov setting was provided in [238]. In [222], the nite-time error bound of GTD was improved toO(1=T ) for i.i.d. sampled data but it remains unclear how to extend these results to multi-agent scenarios where data is not only Markovian but also dis- tributed over a network. The present work lls in this gap by proposing a new online distributed TD learning algorithm that operates online Markovian samples. 8 1.4 Markovpotentialgamesandindependentlearning In stochastic optimal control, the Markov potential games (MPGs) model dates back to [66, 95]. More recent studies include [268, 157, 147, 160] and all of these studies focus on systems with known dynamics. MPGs have also attracted attention in multi-agent RL. In the innite-horizon setting, [131, 277] extended the policy gradient method [8, 112] for multiple players and estab- lished the iteration/sample complexity that scales with the size of state space; [92] generalized the natural policy gradient method [112, 8] and established the global asymptotic convergence. In the nite-horizon setting, [205] built on the single-agent Nash-VI [140] to propose a sample ecient turn-based algorithm and [153] studied the policy gradient method. Earlier, [234, 144] studied Markov cooperative games and [118, 178, 57] studied one-state MPGs; both of these are special cases of MPGs. We note that the term: Markov potential game has also been used to refer to state-based potential MDPs [154, 161], which are dierent from the MPGs that we study; see counterexamples in [131]. Despite recent advances on the theory of policy gradient [32, 8], the theory of policy gradient methods for multi-agent RL is relatively less studied. In the basic two-player zero-sum Markov games, [274, 44, 64, 282] established global convergence guarantees for policy gradient methods for learning an (approximate) Nash equilibrium. More recently, [49, 239] examined variants of policy gradient methods and provided last-iterate convergence guarantees. However, it is much harder for the policy gradient methods to work in general Markov games [158, 97]. The eective- ness of (natural) policy gradient methods for tabular MPGs was demonstrated in [131, 277, 92, 278]. Moreover, [249, 235, 262, 181] reported impressive empirical performance of multi-agent 9 policy gradient methods with function approximation in cooperative Markov games, but the the- oretical foundation has not been provided, which motivates the present work . Independent learning recently received attention in multi-agent RL [64, 273, 177, 193, 111, 205, 116], because it only requires local information for learning and naturally yields algorithms that scale to a large number of players. The algorithms in [131, 277, 92, 278] can also be gener- ally categorized as independent learning algorithms for MPGs. In addition, we establish a new independent learning method for MPGs with large size of state space and number of players. Being game-agnostic is a desirable property for independent learning in which players are oblivious to the types of games being played. In particular, classical ctitious-play warrants average-iterate convergence for several games [189, 172, 100]. Although online learning algo- rithms, e.g., the one based on multiplicative weight updates (MWU) [51], oer average-iterate convergence in zero-sum matrix games, they often do not provide last-iterate convergence guar- antees [25], which motivates recent studies [65, 171, 241]. Interestingly, while MWU converges in last-iterate for potential games [178, 57], this is not the case for zero-sum matrix games [53]. Recently, [130, 129] established last-iterate convergence ofQ-learning dynamics for both zero- sum and potential/cooperative matrix games. However, it is open question whether an algorithm can have last-iterate convergence for both zero-sum and potential/cooperative Makov games. We provide the rst armative answer to this question. 1.5 Organizationofthedissertation The dissertation contains two parts. In each part, we investigate two problem setups in two chap- ters, respectively. Part I of the dissertation is devoted to reinforcement learning for constrained 10 control systems: Chapter 2 is based on the joint work with Kaiqing Zhang, Jiali Duan, and Tamer Başar [73, 69, 68], and Chapter 3 is based on the joint work with Xiaohan Wei, Zhuoran Yang, and Zhaoran Wang [74]. Part II of the dissertation is devoted to reinforcement learning for multi- agent control systems: Chapter 4 is based on the joint work with Xiaohan Wei, Zhuoran Yang, and Zhaoran Wang [70, 71], and Chapter 5 is based on the joint work with Chen-Yu Wei and Kaiqing Zhang [72]. PartI In Chapter 2, we rst formulate an optimal control problem for constrained MDPs and pro- vide necessary background material. Then, we describe our natural policy gradient primal-dual method and provide convergence guarantees for our algorithm under the tabular softmax, log- linear, and general smooth policy parametrizations. Next, we establish convergence and nite- sample complexity guarantees for two model-free primal-dual algorithms and provide computa- tional experiments. Finally, we close this chaper with concluding remarks. In Chapter 3, we rst introduce the episodic nite-horizon constrained MDPs, the metrics of learning performance, and the linear function approximation. We then propose an optimistic primal-dual policy optimization algorithm for learning constrained MDPs. Next, we establish a regret and constraint violation analysis for the proposed algorithm. We further present some improved results in the tabular setting. Finally, we close this chaper with concluding remarks. PartII In Chapter 4, we rst introduce a class of multi-agent stochastic saddle point problems that contain, as a special instance, minimization of a mean square projected Bellman error via dis- tributed temporal-dierence learning. We then develop a homotopy-based online distributed 11 primal-dual algorithm to solve this problem and establish a nite-time performance bound for the proposed algorithm. We oer computational experiments to demonstrate the merits and the eectiveness of our theoretical ndings. Finally, we close the chapter with concluding remarks. In Chapter 5, we rst introduce Markov potential games, Nash equilibrium, and provide nec- essary background material. Then, we present an independent learning protocol, and propose an independent policy gradient method for Markov potential games and establish a Nash regret anal- ysis. We next establish a model-free extension of our method and analysis to the linear function approximation setting. Futhermore, we establish game-agnostic convergence of an optimistic in- dependent policy gradient method for learning both Markov cooperative games and zero-sum Markov games. We also provide computational experiments to demonstrate the merits and the eectiveness of our theoretical ndings. Finally, we close the chapter with concluding remarks. 1.6 Contributionsofthedissertation This section summarizes the most important contributions of the dissertation. PartI Convergence and sample complexity of natural policy gradient primal-dual meth- odsforconstrainedMDPs. We propose a simple but eective primal-dual algorithm for solving discounted innite-horizon optimal control problems for constrained MDPs. Our Natural Policy Gradient Primal-Dual (NPG-PD) method employs natural policy gradient ascent to update the primal variable and projected sub-gradient descent to update the dual variable. We exploit the structure of softmax policy parametrization to establish global convergence guarantees in spite of the fact that the objective function in maximization problem is not concave and the constraint 12 set is not convex. In particular, we prove that our NPG-PD method achieves global convergence with rateO(1= p T ) in both the optimality gap and the constraint violation, whereT is the total number of iterations. Our convergence guarantees are dimension-free, i.e., the rate is indepen- dent of the size of the state-action space. We further establish convergence with rateO(1= p T ) in both the optimality gap and the constraint violation for log-linear and general smooth policy parametrizations, up to a function approximation error caused by restricted policy parametriza- tion. We also provide convergence and nite-sample complexity guarantees for two sample-based NPG-PD algorithms. Finally, we utilize computational experiments to showcase the merits and the eectiveness of our approach. Provably ecient policy optimization for learning constrained MDPs with linear function approximation. We propose a provably ecient safe RL algorithm for constrained MDPs with an unknown transition model in the linear episodic setting: an Optimistic Primal- Dual Proximal Policy OPtimization (OPDOP) algorithm, where the value function is estimated by combining the least-squares policy evaluation and an additional bonus term for the explo- ration under constraints (or safe exploration). Theoretically, we prove that the proposed algo- rithm achieves an ~ O(dH 2:5 p T ) regret and the same ~ O(dH 2:5 p T ) constraint violation, where d is the dimension of the feature mapping,H is the horizon of each episode, andT is the total number of steps. We establish these bounds in the setting where the reward/utility functions are xed but the feedback after each episode is bandit. Our bounds depend on the capacity of the state space only through the dimension of the feature mapping and thus hold even when the number of states goes to innity. To the best of our knowledge, our result is the rst provably 13 ecient online policy optimization for constrained MDPs in the function approximation setting, with safe exploration. PartII Homotopystochasticprimal-dualoptimizationformulti-agenttemporal-dierence learning. We formulate the multi-agent temporal-dierence (TD) learning as the minimization problem of mean square projected Bellman error (MSPBE). We employ Fenchel duality to cast the MSPBE minimization as a stochastic saddle point problem where the primal-dual objective is convex and strongly-concave with respect to primal and dual variables, respectively. Since the primal-dual objective has a linear dependence on expectations, we can obtain unbiased esti- mates of gradients from state samples thereby overcoming a challenge that approaches based on naive TD objectives face [138]. This allows us to design a true stochastic primal-dual learning algorithm and perform the nite-time performance analysis in Markov setting [175, 83]. Our primal-dual formulation utilizes distributed dual averaging [82] and for our homotopy-based dis- tributed learning algorithm we establish a sharp nite-time error bound in terms of network size and topology. This dierentiates our work from the approaches and results in [127, 76, 77]. To the best of our knowledge, we are the rst to utilize the homotopy-based approach for solving a class of distributed convex-concave saddle point programs withO(1=T ) nite-time performance bound. Independentpolicygradientforlarge-scaleMarkovpotentialgames(MPGs):sharper rates,functionapproximation,andgame-agnosticconvergence. First, we propose an inde- pendent policy gradient algorithm for learning an-Nash equilibrium of Markov potential games 14 (MPGs) withO(1= 2 ) iteration complexity. In contrast to the state of the art results [131, 277], such iteration complexity does not explicitly depend on the state space size. Second, we consider a linear function approximation setting and design an independent sample-based policy gradient algorithm that learns an-Nash equilibrium withO(1= 5 ) sample complexity. This appears to be the rst result for learning MPGs with function approximation. Third, we establish the conver- gence of an independent optimistic policy gradient algorithm (which has been proved to converge in learning zero-sum Markov games [239]) for learning a subclass of MPGs: Markov cooperative games. We show that the same type of optimistic policy learning algorithm provides an-Nash equilibrium in both zero-sum Markov games and Markov cooperative games while the players are oblivious to the types of games being played. To the best of our knowledge, this appears to be the rst game-agnostic convergence result in Markov games. 15 PartI Reinforcementlearningforconstrainedcontrolsystems 16 Chapter2 Naturalpolicygradientprimal-dualmethodforconstrained MDPs In this chapter, we study sequential decision making control problems aimed at maximizing the expected total reward while satisfying a constraint on the expected total utility. We employ the natural policy gradient method to solve the discounted innite-horizon optimal control problem for constrained Markov decision processes (MDPs). Specically, we propose a new Natural Policy Gradient Primal-Dual (NPG-PD) method that updates the primal variable via natural policy gra- dient ascent and the dual variable via projected sub-gradient descent. Although the underlying maximization involves a nonconcave objective function and a nonconvex constraint set, under the softmax policy parametrization we prove that our method achieves global convergence with sublinear rates regarding both the optimality gap and the constraint violation. Such convergence is independent of the size of the state-action space, i.e., it is dimension-free. Furthermore, for log- linear and general smooth policy parametrizations, we establish sublinear convergence rates up to a function approximation error caused by restricted policy parametrization. We also provide convergence and nite-sample complexity guarantees for two sample-based NPG-PD algorithms. 17 Finally, we use computational experiments to showcase the merits and the eectiveness of our approach. 2.1 Introduction Policy gradient [217] and natural policy gradient [112] methods have enjoyed substantial empiri- cal success in solving MDPs [195, 136, 164, 194, 214]. Policy gradient methods, or more generally direct policy search methods, have also been used to solve constrained MDPs [227, 37, 35, 56, 220, 135, 180, 7, 206], but most existing theoretical guarantees are asymptotic and/or only pro- vide local convergence guarantees to stationary-point policies. On the other hand, it is desired to show that, for arbitrary initial condition, a solution that enjoys-optimality gap and-constraint violation is computed using a nite number of iterations and/or samples. It is thus imperative to establish global convergence guarantees for policy gradient methods when solving constrained MDPs. In this chapter, we provide a theoretical foundation for non-asymptotic global convergence of the natural policy gradient method in solving optimal control problems for constrained MDPs and answer the following questions: (i) Can we employ natural policy gradient methods to solve optimal control problems for con- strained MDPs? (ii) Do natural policy gradient methods converge to the globally optimal solution that satises constraints? 18 (iii) What is the convergence rate of natural policy gradient methods and the eect of the func- tion approximation error caused by a restricted policy parametrization? (iv) What is the sample complexity of model-free natural policy gradient methods? In Section 2.2, we formulate an optimal control problem for constrained Markov decision processes and provide necessary background material. In Section 2.3, we describe our natural policy gradient primal-dual method. We provide convergence guarantees for our algorithm under the tabular softmax policy parametrization in Section 2.4 and under log-linear and general smooth policy parametrizations in Section 2.5. We establish convergence and nite-sample complexity guarantees for two model-free primal-dual algorithms in Section 2.6 and provide computational experiments in Section 2.7. We close the chapter with remarks in Section 2.8. 2.2 Problemsetup In Section 2.2.1, we introduce constrained Markov decision processes. In Section 2.2.2, we present the method of Lagrange multipliers, formulate a saddle-point problem for the constrained policy optimization, and exhibit several problem properties: strong duality, boundedness of the optimal dual variable, and constraint violation. In Section 2.2.3, we introduce a parametrized formulation of the constrained policy optimization problem, provide an example of a constrained MDP which is not convex, and present several useful policy parametrizations. 19 2.2.1 ConstrainedMarkovdecisionprocesses We consider a discounted constrained Markov decision process [11], CMDP(S;A;P; r; g; b; ; ) whereS is a state space,A is an action space,P is a transition probability measure which species the transition probability P(s 0 js;a) from state s to the next state s 0 under action a 2 A, r: SA! [0; 1] is a reward function,g:SA! [0; 1] is a utility function,b is a constraint oset, 2 [0; 1) is a discount factor, and is an initial distribution overS. For any states t , a stochastic policy:S! (A) is a function in the probability simplex (A) over action spaceA, i.e.,a t (js t ) at timet. Let be a set of all possible policies. A policy2 , together with initial state distribution, induces a distribution over trajectories =f(s t ;a t ;r t ;g t )g 1 t = 0 , wheres 0 ,a t (js t ) ands t+1 P(js t ;a t ) for allt 0. Given a policy, the value functionsV r , V g :S! R associated with the rewardr or the utilityg are determined by the expected values of total discounted rewards or utilities received under policy, V r (s) := E " 1 X t = 0 t r(s t ;a t ) ;s 0 =s # V g (s) := E " 1 X t = 0 t g(s t ;a t ) ;s 0 =s # 20 where the expectationE is taken over the randomness of the trajectory induced by. Starting from an arbitrary state-action pair (s;a) and following a policy, we also introduce the state- action value functionsQ r (s;a),Q g (s;a):SA!R together with their advantage functions A r ,A g :SA!R, Q (s;a) := E " 1 X t = 0 t (s t ;a t ) ;s 0 =s;a 0 =a # A := Q (s;a) V (s) where the symbol representsr org. Sincer;g2 [0; 1], we haveV (s)2 [0; 1=(1 )] and their expected values under the initial distribution are determined byV () :=E s 0 [V (s 0 ) ]: Having dened a policy as well as the state-action value functions for the discounted con- strained MDP, the objective is to nd a policy that maximizes the expected reward value over all policies subject to a constraint on the expected utility value, maximize 2 V r () subject to V g () b: (2.1) In view of the aforementioned boundedness of V r (s) and V g (s), we set the constraint oset b2 (0; 1=(1 )] to make Problem (2.1) meaningful. Remark1 FornotationalconvenienceweconsiderasingleconstraintinProblem (2.1)butourcon- vergence guarantees are readily generalizable to the problems with multiple constraints. 21 2.2.2 MethodofLagrangemultipliers By dualizing constraints [145, 29], we cast Problem (2.1) into the following max-min problem, maximize 2 minimize 0 V r () + (V g ()b) (2.2) where V ; L () := V r () + (V g ()b) is the Lagrangian of Problem (2.1), is the primal variable, and is the nonnegative Lagrange multiplier or dual variable. The associated dual function is dened as V D () := maximize 2 V ; L (): Instead of utilizing the linear program method [11], we employ direct policy search method to solve Problem (2.2). Direct methods are attractive for three reasons: (i) they allow us to di- rectly optimize/monitor the value functions that we are interested in; (ii) they can deal with large state-action spaces via policy parameterization, e.g., neural nets; and (iii) they can utilize policy gradient estimates via simulations of the policy. Since Problem (2.1) is a nonconcave constrained maximization problem with the policy space that is often innite-dimensional, Problems (2.1) and (2.2) are challenging. In spite of these challenges, Problem (2.1) has nice properties in the policy space when it is strictly feasible. We adapt the standard Slater condition [29] and assume strict feasibility of Problem (2.1) throughout the chapter. Assumption1(Slatercondition) There exists > 0 and 2 such thatV g ()b. 22 The Slater condition is mild in practice because we usually have a priori knowledge on a strictly feasible policy, e.g., the minimal utility is achievable by a particular policy so that the constraint becomes loose. Let ? denote an optimal solution to Problem (2.1), let ? be an optimal dual variable ? 2 argmin 0 V D () and let the set of all optimal dual variables be ? . We use the shorthand notationV ? r () =V ? r () and V ? D () = V ? D () whenever it is clear from the context. We recall the strong duality for constrained MDPs and we prove boundedness of optimal dual variable ? . Lemma1(Strongdualityandboundednessof ? ) Let Assumption 1 hold. Then (i) V ? r () = V ? D (); (ii) 0 ? (V ? r ()V r ())=: Proof. See Appendix A.1. Let the value function associated with Problem (2.1) be determined by v() := maximize 2 fV r ()jV g ()b + g: Using the concavity ofv() (e.g., see [179, Proposition 1]), in Lemma 2 we establish a bound on the constraint violation; see Appendix A.2 for proof. 23 Lemma2(Constraintviolation) LetAssumption1hold. ForanyC 2 ? ,ifthereexistsapolicy 2 and > 0 such thatV ? r ()V r () +C[bV g () ] + , then [bV g () ] + 2=C, where [x] + = max(x; 0). Proof. See Appendix A.2. Aided by the above properties implied by the Slater condition, we target the max-min Prob- lem (2.2) in a primal-dual domain. 2.2.3 Policyparametrization Introduction of a set of parametrized policiesf j2 g brings Problem (2.1) into a constrained optimization problem over the nite-dimensional parameter space , maximize 2 V r () subject to V g () b: (2.3) A parametric version of max-min Problem (2.2) is given by maximize 2 minimize 0 V r () + (V g () b): (2.4) whereV ; L () := V r () +(V g ()b) is the associated Lagrangian and is the Lagrange multiplier. The dual function is determined byV D () := maximize V ; L (). The primal max- imization problem (2.3) is nite-dimensional but not concave even if in the absence of a con- straint [8]. In Lemma 3 we prove that, in general, Problem (2.3) is not convex because it involves maximization of a non concave objective function over non convex constraint set. The proof is provided in Appendix A.3 and it utilizes an example of a constrained MDP in Figure 2.1. 24 Lemma3(Lackofconvexity) There exists a constrained MDP for which the objective function V r (s) in Problem (2.3) is not concave and the constraint setf2 jV g (s)bg is not convex. s 1 s 2 s 3 s 4 s 5 (0; 0) (0; 0) (0; 0) (0; 1) (0; 0) (1; 1) (0; 0) Figure 2.1: An example of a constrained MDP for which the objective functionV r (s) in Prob- lem (2.3) is not concave and the constraint setf2 jV g (s) bg is not convex. The pair (r;g) associated with the directed arrow represents (reward, utility) received when an action at a certain state is taken. This example is utilized in the proof of Lemma 3. In general, the Lagrangian V ; L () in Problem (2.4) is convex in but not concave in . While many algorithms for solving max-min optimization problems, e.g., those proposed in [137, 176, 254], require extra assumptions on the max-min structure or only guarantee convergence to a stationary point, we exploit problem structure and propose a new primal-dual method to compute globally optimal solution to Problem (2.4). Before doing that, we rst introduce several useful classes of policies. Directpolicyparametrization. A direct parametrization of a policy is the probability distribu- tion, (ajs) = s;a for all2 (A) jSj where s 2 (A) for anys2S, i.e., s;a 0 and P a2A s;a = 1. This policy class is complete since it directly represents any stochastic policy. Even though it is challenging to deal with from 25 both theoretical and the computational viewpoints [159, 8], it oers a sanity check for many policy search methods. Softmaxpolicyparametrization. This class of policies is parametrized by the softmax function, (ajs) = exp( s;a ) P a 0 2A exp( s;a 0) (2.5) for all2R jSjjAj . The softmax policy can be used to represent any stochastic policy and its closure contains all stationary policies. It has been utilized to study convergence properties of many RL algo- rithms [32, 8, 159, 50, 117] and it oers several algorithmic advantages: (i) it equips the policy with a rich structure so that the natural policy gradient update works like the classical multiplica- tive weights update in the online learning literature (e.g., see [51]); (ii) it can be used to interpret the function approximation error [8]. Log-linearpolicyparametrization. A log-linear policy is given by (ajs) = exp( > s;a ) P a 0 2A exp( > s;a 0) (2.6) for all2R d , where s;a 2R d is the feature map at a state-action pair (s;a). The log-linear policy builds on the softmax policy by applying the softmax function to a set of linear functions in a given feature space. More importantly, it exactly characterizes the linear function approximation via policy parametrization [8]; see [162, 13] for linear constrained MDPs. Generalpolicyparametrization. A general class of stochastic policies is given byf j2 g with R d without specifying the structure of . The parameter space has dimensiond and 26 this policy class covers a setting that utilizes nonlinear function approximation, e.g., (deep) neural networks [139, 233]. When we choosedjSjjAj in either the log-linear policy or the general nonlinear policy, the policy class has a limited expressiveness and it may not contain all stochastic policies. Motivated by this observation, the theory that we develop in Section 2.5 establishes global convergence up to error caused by the restricted policy class. 2.3 Naturalpolicygradientprimal-dualmethod In Section 2.3.1, we provide a brief summary of three basic algorithms that have been used to solve constrained policy optimization problem (2.3). In Section 2.3.2, we propose a natural policy gradient primal-dual method which represents an extension of natural policy gradient method to constrained optimization problems. 2.3.1 Constrainedpolicyoptimizationmethods We briey summarize three basic algorithms that can be employed to solve the primal prob- lem (2.3). We assume that the value function and the policy gradient can be evaluated exactly for any given policy. We rst introduce some useful denitions. The discounted visitation distribution d s 0 of a policy and its expectation over the initial distribution are respectively given by, d s 0 (s) = (1 ) 1 X t = 0 t Pr (s t =sjs 0 ) d (s) = E s 0 d s 0 (s) (2.7) 27 where Pr (s t = sjs 0 ) is the probability of visiting states at timet under the policy with an initial states 0 . When the use of parametrized policy is clear from the context, we useV r () to denoteV r (). When (js) is dierentiable and when it belongs to the probability simplex, i.e., 2 (A) jSj for all, the policy gradient of the Lagrangian (2.4) is determined by, r V ; L (s 0 ) = r V r (s 0 ) + r V g (s 0 ) = 1 1 E s 0 d s 0 E a (js) h A ; L (s;a)r log (ajs) i whereA ; L (s;a) =A r (s;a) +A g (s;a). Dualmethod When strong duality in Lemma 1 holds, it is convenient to work with the dual formulation of the primal problem (2.3), minimize 0 V D (): (2.8) While the dual function is convex regardless of concavity of the primal maximization problem, it is often non-dierentiable [30]. Thus, a projected dual subgradient descent can be used to solve the dual problem, (t+1) = P + (t) @ V (t) D () whereP + () is the projection to the non-negative real axis,> 0 is the stepsize, and @ V (t) D () := @ V D () = (t) is the subgradient of the dual function evaluated at = (t) . 28 The dual method works in the space of dual variables and it requires ecient evaluation of the subgradient of the dual function. We note that computing the dual function V D () for a given = (t) in each step amounts to a standard unconstrained RL problem [179]. In spite of global convergence guarantees for several policy search methods in the tabular setting, it is often challenging to obtain the dual function and/or to compute its sub-gradient, e.g., when the problem dimension is high and/or when the state space is continuous. Although the primal problem can be approximated using the rst-order Taylor series expansion [7, 259], inverting Hessian matrices becomes the primary computational burden and it is costly to implement the dual method. Primalmethod In the primal method, a policy search strategy works directly on the primal problem (2.3) by seeking an optimal policy in a feasible region. The key challenge is to ensure the feasibility of the next iterate in the search direction, which is similar to the use of the primal method in nonlinear programming [145]. An intuitive approach is to check the feasibility of each iterate and determine whether the constraint is active [251]. If the iterate is feasible or the constraint is inactive, we move towards maximizing the single objective function; otherwise, we look for a feasible direction. For the softmax policy parametrization (2.5), this can be accomplished using a simple rst-order gradient method, (t+1) s;a = (t) s;a + G (t) s;a () G (t) s;a () := 8 > > > > < > > > > : 1 1 A (t) r (s;a); when V (t) g () < b b 1 1 A (t) g (s;a); when V (t) g () b b (2.9) 29 where we use theA (t) r (s;a) andA (t) g (s;a) to denoteA (t) r (s;a) andA (t) g (s;a), respectively,G (t) s;a () is the gradient ascent direction determined by the scaled version of advantage functions, and b > 0 is the relaxation parameter for the constraint V g () b. When the iterate violates the relaxed constraint,V g () b b , it maximizes the constraint function to gain feasibility. More reliable evaluation of the feasibility often demands a more tractable characterization of the constraint, e.g., by utilizing Lyapunov function [54], Gaussian process modeling [211], backward value function [191], and logarithmic penalty function [143]. Hence, the primal method oers the adaptability of adjusting a policy to satisfy the constraint, which is desirable in safe train- ing applications. However, global convergence theory is still lacking and recent progress [251] requires a careful relaxation of the constraint. Primal-dualmethod The primal-dual method simultaneously updates primal and dual variables [16]. A basic primal- dual method with the direct policy parametrization (ajs) = s;a performs the following Policy Gradient Primal-Dual (PG-PD) update [1], (t+1) = P (t) + 1 r V (t) ; (t) L () (t+1) = P (t) 2 V (t) g ()b (2.10) wherer V (t) ; (t) L () := r V (t) r () + (t) r V (t) g (), 1 > 0 and 2 > 0 are the stepsizes, P is the projection onto probability simplex := (A) jSj , andP is the projection that will be specied later. For the max-min formulation (2.4), PG-PD method (2.10) directly performs projected gradient ascent in the policy parameter and descent in the dual variable, both over 30 the Lagrangian V ; L (). The primal-dual method overcomes disadvantages of the primal and dual methods either by relaxing the precise calculation of the subgradient of the dual function or by changing the descent direction via tuning of the dual variable. While this simple method provides a foundation for solving constrained MDPs [56, 220], lack of convexity in (2.4) makes it challenging to establish convergence theory, which is the primary objective of this chapter. We rst leverage structure of constrained policy optimization problem (2.3) to provide a pos- itive result in terms of optimality gap and constraint violation. Theorem4(Restrictiveconvergence: directpolicy) LetAssumption1holdwithapolicyclass f =j2 g and let = [ 0; 2= ((1 )) ],> 0, (0) = 0, and (0) be such thatV (0) r () V ? r (). If we choose 1 = O(1) and 2 = O(1= p T ), then the iterates (t) generated by PG-PD method (2.10) satisfy (Optimality gap) 1 T T1 X t = 0 V ? r ()V (t) r () C 1 jAjjSj (1 ) 6 T 1=4 d ? = 2 1 (Constraint violation) " 1 T T1 X t = 0 bV (t) g () # + C 2 jAjjSj (1 ) 6 T 1=4 d ? = 2 1 whereC 1 andC 2 are absolute constants that are independent ofT. For the tabular constrained MDPs with direct policy parametrization, Theorem 4 guarantees that, on average, the optimality gap V ? r ()V (t) r () and the constraint violation bV (t) g () decay to zero with the sublinear rate 1=T 1=4 . However, this rate explicitly depends on the sizes of state/action spacesjSj andjAj, and the distribution shiftkd ? =k 1 that species the exploration factor. A careful initialization (0) that satisesV (0) r ()V ? r () is also required. The proof of Theorem 4 is provided in Appendix A.4 and it exploits the problem structure that casts the primal problem (2.3) as a linear program in the occupancy measure [11] and applies 31 the convex optimization analysis. This method is not well-suited for large-scale problems, and projections onto the high-dimensional probability simplex are not desirable in practice. We next introduce a natural policy gradient primal-dual method to overcome these challenges and provide stronger convergence guarantees. 2.3.2 Naturalpolicygradientprimal-dualmethod The Fisher information matrix induced by , F () := E sd E a (js) h r log (ajs) (r log (ajs)) > i is used in the update of the primal variable in our primal-dual algorithm. The expectations are taken over the randomness of the state-action trajectory induced by and Natural Policy Gra- dient Primal-Dual (NPG-PD) method for solving Problem (2.4) is given by, (t+1) = (t) + 1 F y ( (t) )r V (t) ; (t) L () (t+1) = P (t) 2 V (t) g ()b (2.11) wherey denotes the Moore-Penrose inverse of a given matrix,P () denotes the projection to the interval that will be specied later, and ( 1 ; 2 ) are constant positive stepsizes in the updates of primal and dual variables. The primal update (t+1) is obtained using a pre-conditioned gradient ascent via the natural policy gradient F y ( (t) )r V (t) L () and it represents the policy gradient of the Lagrangian V (t) L () in the geometry induced by the Fisher information matrix F ( (t) ). On the other hand, the dual update (t+1) is obtained using a projected sub-gradient descent by 32 collecting the constraint violationbV (t) g (); where, for brevity, we useV (t) L () andV (t) g () to denoteV (t) ; (t) L () andV (t) g (), respectively. In Section 2.4, we establish global convergence of NPG-PD method (2.11) under the softmax policy parametrization. In Section 2.5, we examine the general policy parametrization and, in Section 2.6, we analyze sample complexity of two sample-based implementations of NPG-PD method (2.11). Remark2 Theperformancedierencelemma[113,8],whichquantiesthedierencebetweentwo state value functions,V (s 0 ) andV 0 (s 0 ), for any two policies and 0 and any states 0 , V (s 0 ) V 0 (s 0 ) = 1 1 E sd s 0 ;a(js) h A 0 (s;a) i is utilized in our analysis, where the symbol denotesr org. 2.4 Tabularsoftmaxpolicycase We rst examine NPG-PD method (2.11) under softmax policy parametrization (2.5). Strong du- ality in Lemma 1 holds on the closure of the softmax policy class, because of completeness of the softmax policy class. Even though maximization problem (2.3) is not concave, we establish global convergence of our algorithm with dimension-independent convergence rates. We rst exploit the softmax policy structure to show that the primal update in (2.11) can be expressed in a more compact form; see Appendix A.5 for the proof. 33 Lemma5(PrimalupdateasMWU) Let := [ 0; 2=((1 )) ] and letA (t) L (s;a) :=A (t) r (s;a) + (t) A (t) g (s;a). Under softmax parametrized policy (2.5), NPG-PD algorithm (2.11) can be brought to the following form, (t+1) s;a = (t) s;a + 1 1 A (t) L (s;a) (t+1) = P (t) 2 V (t) g ()b : (2.12a) Furthermore, the primal update in (2.12a) can be equivalently expressed as (t+1) (ajs) = (t) (ajs) exp 1 1 A (t) L (s;a) Z (t) (s) (2.12b) whereZ (t) (s) := P a2A (t) (ajs) exp 1 1 A (t) L (s;a) . The primal updates in (2.12a) do not depend on the state distributiond (t) that appears in NPG- PD algorithm (2.11) through the policy gradient. This is because of the Moore-Penrose inverse of the Fisher information matrix in (2.11). Furthermore, policy update (2.12b) is given by the multiplicative weights update (MWU) which is commonly used in online linear optimization [51]. In contrast to the online linear optimization, an advantage function appears in the MWU policy update at each iteration in (2.12b). In Theorem 6, we establish global convergence of NPG-PD algorithm (2.12a) with respect to both the optimality gapV ? r ()V (t) r () and the constraint violationbV (t) g (). Even though we set (0) s;a = 0 and (0) = 0 in the proof of Theorem 6 in Section 2.4.1, global convergence can be established for arbitrary initial conditions. 34 Theorem6(Dimension-freeglobalconvergence: softmaxpolicy) LetusxT > 0and2 S and let Assumption 1 hold for > 0. If we choose 1 = 2 logjAj and 2 = 2(1 )= p T, then the iterates (t) generated by algorithm (2.12) satisfy, (Optimality gap) 1 T T1 X t = 0 V ? r () V (t) r () 7 (1 ) 2 1 p T (Constraint violation) " 1 T T1 X t = 0 b V (t) g () # + 2= + 4 (1 ) 2 1 p T : Theorem 6 demonstrates that, on average, the reward value function converges to its globally optimal value and that the constraint violation decays to zero. In other words, for a desired accuracy, it takesO(1= 2 ) iterations to compute the solution which is away from the globally optimal one (with respect to both the optimality gap and the constraint violation). We note that the required number of iterations only depends on the desired accuracy and is independent of the sizes of the state and action spaces. Although maximization problem (2.3) is not concave, our rate (1= p T; 1= p T ) for optimality/constraint violation gap outperforms the classical one (1= p T; 1=T 3=4 ) [149] and it matches the achievable rate for solving online convex minimization problems with convex constraint sets [263]. Moreover, in contrast to the bounds established for PG-PD algorithm (2.10) in Theorem 4, the bounds in Theorem 6 for NPG-PD algorithm (2.11) under softmax policy parameterization do not depend on the initial distribution. As shown in Lemma 7 in Section 2.4.1, the reward and utility value functions are coupled and the natural policy gradient method in the unconstrained setting does not provide monotonic improvement to either of them [8, Section 5.3]. To address this challenge, we introduce a new line of analysis. To bound the optimality gap via a drift analysis of the dual update we rst establish the bounded average performance in Lemma 8 in Section 2.4.1. Furthermore, instead of using 35 methods from constrained convex optimization [149, 263, 242, 265], which either require extra assumptions or have slow convergence rate, under strong duality we establish that the constraint violation for nonconvex Problem (2.3) converges with the same rate as the optimality gap. 2.4.1 Non-asymptoticconvergenceanalysis We rst show that the policy improvement is not monotonic in either the reward value function or the utility value function. Lemma7(Non-monotonicimprovement) For any distribution of the initial state , iterates ( (t) ; (t) ) of algorithm (2.12) satisfy V (t+1) r () V (t) r () + (t) V (t+1) g () V (t) g () 1 1 E s logZ (t) (s) 0: (2.13) Proof. Letd (t+1) := d (t+1) . The performance dierence lemma in conjunction with the multi- plicative weights update in (2.12b) yield, V (t+1) r () V (t) r () = 1 1 E sd (t+1) " X a2A (t+1) (ajs)A (t) r (s;a) # = 1 1 E sd (t+1) " X a2A (t+1) (ajs) log (t+1) (ajs) (t) (ajs) Z (t) (s) # (t) 1 E sd (t+1) " X a2A (t+1) (ajs)A (t) g (s;a) # = 1 1 E sd (t+1) D KL (t+1) (ajs)k (t) (ajs) + 1 1 E sd (t+1) logZ (t) (s) (t) 1 E sd (t+1) " X a2A (t+1) (ajs)A (t) g (s;a) # 36 where the last equality follows from the denition of the Kullback-Leibler divergence or relative entropy between distributionsp andq,D KL (pkq) :=E xp log(p(x)=q(x)). Furthermore, 1 1 E sd (t+1) D KL (t+1) (ajs)k (t) (ajs) + 1 1 E sd (t+1) logZ (t) (s) (t) 1 E sd (t+1) " X a2A (t+1) (ajs)A (t) g (s;a) # (a) 1 1 E sd (t+1) logZ (t) (s) (t) 1 E sd (t+1) " X a2A (t+1) (ajs)A (t) g (s;a) # (b) = 1 1 E sd (t+1) logZ (t) (s) (t) V (t+1) g () V (t) g () is a consequence of the performance dierence lemma, where we drop a nonnegative term in (a) and (b). The rst inequality in (2.13) follows from a componentwise inequalityd (t+1) (1 ), which is obtained using (2.7). Now we prove that logZ (t) (s) 0. From the denition ofZ (t) (s) we have logZ (t) (s) = log X a2A (t) (ajs) exp 1 1 A (t) r (s;a) + (t) A (t) g (s;a) ! (a) X a2A (t) (ajs) log exp 1 1 A (t) r (s;a) + (t) A (t) g (s;a) = 1 1 X a2A (t) (ajs) A (t) r (s;a) + (t) A (t) g (s;a) = 1 1 X a2A (t) (ajs)A (t) r (s;a) + 1 1 (t) X a2A (t) (ajs)A (t) g (s;a) (b) = 0 37 where in (a) we apply the Jensen’s inequality to the concave function log(x). On the other hand, the last equality is due to that X a2A (t) (ajs)A (t) r (s;a) = X a2A (t) (ajs) Q (t) r (s;a)V (t) r (s) = 0 X a2A (t) (ajs)A (t) g (s;a) = 0 which follow from the denitions ofA (t) r (s;a) andA (t) g (s;a). We next compare the value functions of policy iterates generated by algorithm (2.12) with the ones that result from the use of optimal policy. Lemma8(Boundedaverageperformance) Let Assumption 1 hold and let us x T > 0 and 2 S . Then the iterates ( (t) ; (t) ) generated by algorithm (2.12) satisfy 1 T T1 X t = 0 V ? r ()V (t) r () + (t) V ? g ()V (t) g () logjAj 1 T + 1 (1 ) 2 T + 2 2 (1 ) 3 : (2.14) Proof. Letd ? :=d ? . The performance dierence lemma in conjunction with the multiplicative weights update in (2.12b) yield, V ? r () V (t) r () = 1 1 E sd ? " X a2A ? (ajs)A (t) r (s;a) # = 1 1 E sd ? " X a2A ? (ajs) log (t+1) (ajs) (t) (ajs) Z (t) (s) # (t) 1 E sd ? " X a2A ? (ajs)A (t) g (s;a) # : 38 Application of the denition of the Kullback–Leibler divergence or relative entropy between dis- tributionsp andq, D KL (pkq) := E xp log(p(x)=q(x)), and the performance dierence lemma again yield, V ? r () V (t) r () = 1 1 E sd ? D KL ? (ajs)k (t) (ajs) D KL ? (ajs)k (t+1) (ajs) + 1 1 E sd ? logZ (t) (s) (t) 1 E sd ? " X a2A ? (ajs)A (t) g (s;a) # = 1 1 E sd ? D KL ? (ajs)k (t) (ajs) D KL ? (ajs)k (t+1) (ajs) + 1 1 E sd ? logZ (t) (s) (t) V ? g () V (t) g () : (2.15) On the other hand, the rst inequality in (2.13) with =d ? becomes V (t+1) r (d ? ) V (t) r (d ? ) + (t) V (t+1) g (d ? ) V (t) g (d ? ) 1 1 E sd ? logZ (t) (s): (2.16) Hence, application of (2.16) to the average of (2.15) overt = 0; 1;:::;T 1 leads to, 1 T T1 X t = 0 V ? r () V (t) r () = 1 1 T T1 X t = 0 E sd ? D KL ? (ajs)k (t) (ajs) D KL ? (ajs)k (t+1) (ajs) + 1 1 T T1 X t = 0 E sd ? logZ (t) (s) 1 T T1 X t = 0 (t) V ? g () V (t) g () 1 1 T T1 X t = 0 E sd ? D KL ? (ajs)k (t) (ajs) D KL ? (ajs)k (t+1) (ajs) + 1 (1 )T T1 X t = 0 V (t+1) r (d ? ) V (t) r (d ? ) + 1 (1 )T T1 X t = 0 (t) V (t+1) g (d ? ) V (t) g (d ? ) 1 T T1 X t = 0 (t) V ? g () V (t) g () : (2.17) 39 From the dual update in (2.12a) we have 1 T T1 X t = 0 (t) V (t+1) g () V (t) g () = 1 T T1 X t = 0 (t+1) V (t+1) g () (t) V (t) g () + 1 T T1 X t = 0 (t) (t+1) V (t+1) g () (a) 1 T (T ) V (T ) g () + 1 T T1 X t = 0 (t) (t+1) V (t+1) g () (b) 2 2 (1 ) 2 (2.18) where we take a telescoping sum for the rst sum in (a) and drop a non-positive term, and in (b) we utilizej (T ) j 2 T=(1 ) andj (t) (t+1) j 2 =(1 ), which follows from the dual update in (2.12a), the non-expansiveness of projectionP , and boundedness of the value function V (t) g () 1=(1 ). Application of (2.18) with =d ? and the use of telescoping sum to (2.17) yields, 1 T T1 X t = 0 V ? r () V (t) r () 1 1 T E sd ?D KL ? (ajs)k (0) (ajs) + 1 (1 )T V (T ) r (d ? ) + 2 2 (1 ) 3 1 T T1 X t = 0 (t) V ? g () V (t) g () : Finally, we useD KL (pkq) logjAj forp2 (A) andq = Unif A ,V (T ) r (d ? ) 1=(1 ), and V ? g ()b to complete the proof. Proof. [Proof of Theorem 6] 40 Boundingtheoptimalitygap. From the dual update in (2.12a) we have 0 (T ) 2 = T1 X t = 0 ( (t+1) ) 2 ( (t) ) 2 = T1 X t = 0 P ( (t) 2 (V (t) g ()b) ) 2 ( (t) ) 2 (a) T1 X t = 0 (t) 2 (V (t) g ()b) 2 ( (t) ) 2 = 2 2 T1 X t = 0 (t) bV (t) g () + 2 2 T1 X t = 0 V (t) g ()b 2 (b) 2 2 T1 X t = 0 (t) V ? g ()V (t) g () + 2 2 T (1 ) 2 (2.19a) where (a) because of the projectionP , (b) is because of the feasibility of the optimal policy ? : V ? g ()b, andjV (t) g ()bj 1=(1 ). Hence, 1 T T1 X t = 0 (t) V ? g ()V (t) g () 2 2(1 ) 2 : (2.19b) To obtain the optimality gap bound, we now substitute (2.19b) into (2.14), applyD KL (pkq) logjAj forp2 (A) andq = Unif A , and take 1 = 2 logjAj and 2 = 2(1 )= p T . Bounding the constraint violation. For any 2 0; 2=((1 )) , from the dual update in (2.12a) we have j (t+1) j 2 (a) (t) 2 V (t) g ()b 2 = (t) 2 2 2 V (t) g ()b (t) + 2 2 V (t) g ()b 2 (b) (t) 2 2 2 V (t) g ()b (t) + 2 2 (1 ) 2 41 where (a) is because of the non-expansiveness of projectionP and (b) is due to (V (t) g ()b) 2 1=(1 ) 2 . Averaging the above inequality overt = 0;:::;T 1 yields 0 1 T j (T ) j 2 1 T j (0) j 2 2 2 T T1 X t = 0 V (t) g ()b (t) + 2 2 (1 ) 2 ; which implies, 1 T T1 X t = 0 V (t) g ()b (t) 1 2 2 T (0) 2 + 2 2(1 ) 2 : (2.20) We now add (2.20) to (2.14) on both sides of the inequality, and utilizeV ? g ()b, 1 T T1 X t = 0 V ? r () V (t) r () + T T1 X t = 0 b V (t) g () logjAj 1 T + 1 (1 ) 2 T + 2 2 (1 ) 3 + 1 2 2 T (0) 2 + 2 2(1 ) 2 : Taking = 2 (1 ) when P T1 t = 0 bV (t) g () 0 and = 0 otherwise, we obtain V ? r () 1 T T1 X t = 0 V (t) r () + 2 (1 ) " b 1 T T1 X t = 0 V (t) g () # + logjAj 1 T + 1 (1 ) 2 T + 2 2 (1 ) 3 + 2 2 (1 ) 2 2 T + 2 2(1 ) 2 : Note that bothV (t) r () andV (t) g () can be expressed as linear functions in the same occupancy measure [11, Chapter 10] that is induced by policy (t) and transitionP(s 0 js;a). The convexity of the set of occupancy measures shows that the average ofT occupancy measures is an occupancy 42 measure that produces a policy 0 with valueV 0 r andV 0 g . Hence, there exists a policy 0 such thatV 0 r () = 1 T P T1 t = 0 V (t) r () andV 0 g () = 1 T P T1 t = 0 V (t) g (). Thus, V ? r () V 0 r () + 2 (1 ) h b V 0 g () i + logjAj 1 T + 1 (1 ) 2 T + 2 2 (1 ) 3 + 2 2 (1 ) 2 2 T + 2 2(1 ) 2 : Application of Lemma 2 with 2=((1 )) 2 ? yields h b V 0 g () i + logjAj 1 T + (1 )T + 2 2 (1 ) 2 + 2 2 (1 )T + 2 2(1 ) which leads to our constraint violation bound if we further utilize 1 T P T1 t = 0 bV (t) g () = b V 0 g (), 1 = 2 logjAj, and 2 = 2(1 )= p T . 2.5 Functionapproximationcase Let us consider a general form of NPG-PD algorithm (2.11), (t+1) = (t) + 1 1 w (t) (t+1) = P (t) 2 V (t) g ()b (2.21) wherew (t) =(1 ) denotes either the exact natural policy gradient or its sample-based approx- imation. For a general policy class,f j2 g, with the parameter space R d , the strong duality in Lemma 1 does not necessarily hold and our analysis of Section 2.4 does not apply directly. Let the parametric dual function V D () := maximize 2 V ; L () be minimized at 43 the optimal dual variable ? . Under the Slater condition in Assumption 1, the parametrization gap [179, Theorem 2] is determined by, V ? r () = V ? D () V ? D () V ? r () M where := max s k(js) (js)k 1 is the policy approximation error and M > 0 is a problem-dependent constant. Application of item (ii) in Lemma 1 to the set of all optimal dual variables ? yields ? 2 [ 0; 2=((1 )) ] and, thus, = [ 0; 2=((1 )) ]. To quantify errors caused by the restricted policy parametrization, let us rst generalize NPG. For a distribution over state-action pair2 (SA), we introduce the compatible function approximation error as the following regression objective [112], E (w;;) := E (s;a) A ; L (s;a)w > r log (ajs) 2 whereA ; L (s;a) :=A r (s;a)+A g (s;a). We can view NPG in (2.11) as a minimizer ofE (w;;) for(s;a) =d (s) (ajs), (1 )F y ()r V ; L () 2 argmin w E (w;;): (2.22) Expression (2.22) follows from the rst-order optimality condition and the use ofr V ; L () := r V r () +r V g () allows us to rewrite it as (1 )F y ()r V () 2 argmin w E (w ;) (2.23) 44 where denotesr org. Let the minimal error beE ;? := minimize w E (w ;), where the compatible function ap- proximation errorE (w ;) is given by E (w ;) := E (s;a) h A (s;a)w > r log (ajs) 2 i : (2.24) When the compatible function approximation error is zero, the global convergence follows from Theorem 6. However, this is not the case for a general policy class because it may not include all possible policies (e.g., if we takedjSjjAj for the tabular constrained MDPs). The intuition behindcompatibility is that any minimizer ofE (w ;) can be used as the NPG direction without aecting convergence theory; also see discussions in [112, 217, 8]. Since the state-action measure of some feasible comparison policy is not known, we introduce an exploratory initial distribution 0 over state-action pairs and dene a state-action visitation distribution 0 of a policy as 0 (s;a) = (1 )E (s 0 ;a 0 ) 0 " 1 X t = 0 t Pr (s t =s;a t =ajs 0 ;a 0 ) # where Pr (s t =s;a t =ajs 0 ;a 0 ) is the probability of visiting a state-action pair (s;a) under policy for an initial state-action pair (s 0 ;a 0 ). Whenever clear from context, we use (t) to denote (t) 0 for notational convenience. It the minimizer is computed exactly, we can updatew (t) in (2.21) usingw (t) =w (t) r + (t) w (t) g , wherew (t) r andw (t) g are given by w (t) 2 argmin w E (t) w ; (t) : (2.25) 45 Even though the exact computation of the minimizer may not be feasible, we can use sample- based algorithms to approximately solve the empirical risk minimization problem. By character- izing errors that result from sample-based solutions and from function approximation, we next prove convergence of (2.21) for the log-linear and for the general smooth policy classes. 2.5.1 Log-linearpolicyclass We rst consider policies in the log-linear class (2.6), with linear feature maps s;a 2 R d . In this case, the gradientr log (ajs) becomes a shifted version of feature s;a , r log (ajs) = s;a E a 0 (js) [ s;a 0 ] := s;a : (2.26) Thus, the compatible function approximation error (2.24) captures how well the linear function > s;a approximates the advantage functionsA r (s;a) orA g (s;a) under the state-action distri- bution . We also introduce the compatible function approximation error with respect to the state-action value functionsQ (s;a), E (w ;) := E (s;a) h Q (s;a) w > s;a 2 i : When there are no compatible function approximation errors, the policy update in (2.21) forw (t) that is calculated by (2.25) is given byw (t) = w (t) r + (t) w (t) g ,w (t) 2 argmin w E (t) w ; (t) for =r org, where (t) (s;a) =d (t) (s) (t) (ajs) is an on-policy state-action visitation distribution. This is because the softmax function is invariant to any terms that are independent of the action. 46 Let us consider an approximate solution, w (t) argmin kwk 2 W E (t) w ; (t) (2.27) where the bounded domain W > 0 can be viewed as an ` 2 -regularization and let the exact minimizer be w (t) ;? 2 argmin kwk 2 W E (t) (w ; (t) ). Fixing a state-action distribution (t) , the estimation error in w (t) arises from the discrepancy between w (t) and w (t) ;? , which comes from the randomness in a sample-based optimization algorithm and the mismatch between the linear function and the true state-action value function. We represent the estimation error as E (t) ;est := E h E (t) w (t) ; (t) E (t) w (t) ;? ; (t) i where the expectationE is taken over the randomness of approximate algorithm that is used to solve (2.27). Note that the state-action distribution (t) is on-policy. To characterize the eect of distribu- tion shift onw (t) ;? , let us introduce some notation. We represent a xed distribution over state- action pairs (s;a) by ? (s;a) := d ? (s) Unif A (a): (2.28) The xed distribution ? samples a state fromd ? (s) and an action uniformly from Unif A (a). We characterize the error inw (t) ;? that arises from the distribution shift using the transfer error, E (t) ;bias := E E ? w (t) ;? ; (t) : 47 Assumption2(Estimationerrorandtransfererror) Boththeestimationerrorandthetrans- fer error are bounded, i.e.,E (t) ;est est andE (t) ;bias bias , where denotesr org. When we apply a sample-based algorithm to (2.27), it is standard to have est =O(1= p K), where K is the number of samples; e.g., see [20, Theorem 1]. A special case is the exact tabular softmax policy parametrization for which bias = est = 0. For any state-action distribution, we dene := E (s;a) s;a > s;a and, to compare with ? , we introduce the relative condition number, := sup w2R d w > ?w w > 0 w : Assumption3(Relativeconditionnumber) For an initial state-action distribution 0 and ? determined by (2.28), the relative condition number is nite. With the estimation error est , the transfer error bias , and the relative condition number in place, in Theorem 9 we provide convergence guarantees for algorithm (2.21) using the approximate update (2.27). Even though we set (0) = 0 and (0) = 0 in the proof of Theorem 9, global convergence can be established for arbitrary initial conditions. 48 Theorem9(Convergenceandoptimality: log-linearpolicy) Let Assumption 1 hold for > 0andletusxastatedistributionandastate-actiondistribution 0 . Iftheiterates ( (t) ; (t) )gen- eratedby algorithm (2.21)and (2.27)withk s;a kB and 1 = 2 = 1= p T satisfyAssumptions 2 and 3, then, E " 1 T T1 X t = 0 V ? r () V (t) r () # C 3 (1 ) 5 1 p T + 2 + 4= (1 ) 2 p jAj bias + s jAj est 1 ! E " 1 T T1 X t = 0 b V (t) g () # + C 4 (1 ) 4 1 p T + 4 + 2 1 p jAj bias + s jAj est 1 ! whereC 3 = 1 + logjAj + 5B 2 W 2 = andC 4 = (1 + logjAj +B 2 W 2 ) + (2 + 4B 2 W 2 )=. Theorem 9 shows that, on average, the reward value function converges to its globally optimal value and that the constraint violation decays to zero (up to an estimation error est and a transfer error bias ). When bias = est = 0, the rate (1= p T; 1= p T ) matches the result in Theorem 6 for the exact tabular softmax case. In contrast to the optimality gap, the lower order of eective horizon 1=(1 ) in the constraint violation yields a tighter error bound. Remark3 In the standard error decomposition, E (t) w (t) ; (t) = E (t) w (t) ; (t) E (t) w (t) ;? ; (t) +E (t) w (t) ;? ; (t) the dierence term is the standard estimation error that result from the discrepancy between w (t) andw (t) ;? ,andthelasttermcharacterizestheapproximationerrorinw (t) ;? . InCorollary10,werepeat Theorem 9 in terms of an upper bound approx on the approximation error, E (t) ;approx := E h E (t) w (t) ;? ; (t) i : 49 SinceE (t) ;approx utilizeson-policystate-actiondistribution (t) ,theerrorboundsinCorollary10depend ontheworst-casedistributionmismatchcoeceintk ? = 0 k 1 . Incontrast,applicationofestimation and transfer errors in Theorem 9 does not involve the distribution mismatch coecient. Therefore, theerrorboundsinTheorem9aretighterthantheonesinCorollary10thatutilizethestandarderror decomposition. Corollary10(Convergenceandoptimality: log-linearpolicy) Let Assumption 1 hold for > 0andletusxastatedistributionandastate-actiondistribution 0 . Iftheiterates ( (t) ; (t) ) generated by algorithm (2.21) and (2.27) withk s;a k B and 1 = 2 = 1= p T satisfy Assump- tion 2 except forE (t) ;bias , Assumption 3, andE (t) ;approx approx , =r org, then, E " 1 T T1 X t = 0 V ? r () V (t) r () # C 3 (1 ) 5 1 p T + C 0 3 s jAj approx 1 ? 0 1 + s jAj est 1 ! E " 1 T T1 X t = 0 b V (t) g () # + C 4 (1 ) 4 1 p T + C 0 4 s jAj approx 1 ? 0 1 + s jAj est 1 ! whereC 3 = 1 + logjAj + 5B 2 W 2 =, C 4 = (1 + logjAj +B 2 W 2 ) + (2 + 4B 2 W 2 )=, C 0 3 = (2 + 4=)=(1 ) 2 , andC 0 4 = (4 + 2)=(1 ). Proof. From the denitions ofE ? andE (t) we have E ? w (t) ;? ; (t) ? (t) 1 E (t) w (t) ;? ; (t) 1 1 ? 0 1 E (t) w (t) ;? ; (t) where the second inequality is because of (1 ) 0 (t) . Thus, E (t) ;bias 1 1 ? 0 1 E (t) ;approx 50 which allows us to replaceE (t) ;bias in the proof of Theorem 9 byE (t) ;approx . 2.5.2 Non-asymptoticconvergenceanalysis We rst provide a regret-type anlysis for our primal-dual method. Lemma11(Regret/violationlemma) Let Assumption 1 hold for > 0, let us x a state distri- bution andT > 0, and let log (ajs) be-smooth in for any (s;a). If the iterates ( (t) ; (t) ) aregeneratedbyalgorithm (2.21)with (0) = 0, (0) = 0, 1 = 2 = 1= p T,andkw (t) kW,then, 1 T T1 X t = 0 V ? r () V (t) r () C 3 (1 ) 5 1 p T + T1 X t = 0 err (t) r ( ? ) (1 )T + T1 X t = 0 2 err (t) g ( ? ) (1 ) 2 T " 1 T T1 X t = 0 b V (t) g () # + C 4 (1 ) 4 1 p T + T1 X t = 0 err (t) r ( ? ) T + T1 X t = 0 2 err (t) g ( ? ) (1 )T whereC 3 = 1 + logjAj + 5W 2 =,C 4 = (1 + logjAj +W 2 ) + (2 + 4W 2 )=, and err (t) () := E sd E a(js) A (t) (s;a) (w (t) ) > r log (t) (ajs) where =r org. Proof. The smoothness of the log-linear policy in conjunction with an application of Taylor series expansion to log (t) (ajs) yield log (t) (ajs) (t+1) (ajs) + (t+1) (t) > r log (t) (ajs) 2 k (t+1) (t) k 2 (2.29) 51 where (t+1) (t) = 1 w (t) =(1 ). Fixing and, we used to denoted to obtain, E sd D KL ((js)k (t) (js))D KL ((js)k (t+1) (js)) = E sd E a(js) log (t) (ajs) (t+1) (ajs) (a) 1 E sd E a(js) h r log (t) (ajs)w (t) i 2 1 2(1 ) 2 kw (t) k 2 (b) = 1 E sd E a(js) h r log (t) (ajs)w (t) r i + 1 (t) E sd E a(js) h r log (t) (ajs)w (t) g i 2 1 2(1 ) 2 kw (t) k 2 = 1 E sd E a(js) A (t) r (s;a) + 1 (t) E sd E a(js) A (t) g (s;a) + 1 E sd E a(js) h r log (t) (ajs) w (t) r + (t) w (t) g A (t) r (s;a)+ (t) A (t) g (s;a) i 2 1 (1 ) 2 kw (t) r k 2 + (t) 2 kw (t) g k 2 (c) 1 (1 ) V r ()V (t) r () + 1 (1 ) (t) V g ()V (t) g () 1 err (t) r () 1 (t) err (t) g () 2 1 W 2 (1 ) 2 2 1 W 2 (1 ) 2 (t) 2 where (a) is because of (2.29). On the other hand, we use the updatew (t) = w (t) r + (t) w (t) g for a given (t) in (b) and in (c) we apply the performance dierence lemma, denitions of err (t) r () and err (t) g (), andkw (t) kW . Rearrangement of the above inequality yields V r ()V (t) r () 1 1 1 1 E sd D KL ((js)k (t) (js))D KL ((js)k (t+1) (js)) + 1 1 err (t) r () + 2 (1 ) 2 err (t) g () + 1 W 2 (1 ) 3 + 4 1 W 2 (1 ) 5 2 (t) V g ()V (t) g () : where we utilize 0 (t) 2=((1 )) from the dual update in (2.21). 52 Averaging the above inequality above overt = 0; 1;:::;T 1 yields 1 T T1 X t = 0 V r ()V (t) r () 1 (1 ) 1 T T1 X t = 0 E sd D KL ((js)k (t) (js))D KL ((js)k (t+1) (js)) + 1 (1 )T T1 X t = 0 err (t) r () + 2 (1 ) 2 T T1 X t = 0 err (t) g () + 1 W 2 (1 ) 3 + 4 1 W 2 (1 ) 5 2 1 T T1 X t = 0 (t) V g ()V (t) g () which implies that, 1 T T1 X t = 0 V r ()V (t) r () logjAj (1 ) 1 T + 1 (1 )T T1 X t = 0 err (t) r () + 2 (1 ) 2 T T1 X t = 0 err (t) g () + 1 W 2 (1 ) 3 + 4 1 W 2 (1 ) 5 2 + 1 T T1 X t = 0 (t) V g ()V (t) g () : If we choose the comparison policy = ? , then we have 1 T T1 X t = 0 V ? r ()V (t) r () + 1 T T1 X t = 0 (t) V ? g ()V (t) g () logjAj (1 ) 1 T + 1 (1 )T T1 X t = 0 err (t) r ( ? ) + 2 (1 ) 2 T T1 X t = 0 err (t) g ( ? ) + 1 W 2 (1 ) 3 + 4 1 W 2 (1 ) 5 2 : (2.30) 53 Provingtherstinequality. By the same reasoning as in (2.19a), 0 (T ) 2 = T1 X t = 0 ( (t+1) ) 2 ( (t) ) 2 2 2 T1 X t = 0 (t) bV (t) g () + 2 2 T1 X t = 0 V (t) g ()b 2 (a) 2 2 T1 X t = 0 (t) V ? g ()V (t) g () + 2 2 T (1 ) 2 (2.31a) where (a) is because of feasibility of ? :V ? g ()b, andjV (t) g ()bj 1=(1 ). Hence, 1 T T1 X t = 0 (t) V ? g ()V (t) g () 2 2(1 ) 2 : (2.31b) By adding the inequality (2.31b) to (2.30) on both sides and taking 1 = 2 = 1= p T , we obtain the rst inequality. Provingthesecondinequality. Since the dual update in (2.21) is the same as the one in (2.12a), we can use the same reasoning to conclude (2.20). Adding the inequality (2.20) to (2.30) on both sides and usingV ? g ()b yield 1 T T1 X t = 0 V ? r () V (t) r () + T T1 X t = 0 b V (t) g () logjAj (1 ) 1 T + 1 (1 )T T1 X t = 0 err (t) r ( ? ) + 2 (1 ) 2 T T1 X t = 0 err (t) g ( ? ) + 1 W 2 (1 ) 3 + 4 1 W 2 (1 ) 5 2 + 1 2 2 T (0) 2 + 2 2(1 ) 2 : 54 Taking = 2 (1 ) when P T1 t = 0 bV (t) g () 0 and = 0 otherwise, we obtain V ? r () 1 T T1 X t = 0 V (t) r () + 2 (1 ) " b 1 T T1 X t = 0 V (t) g () # + logjAj (1 ) 1 T + 1 (1 )T T1 X t = 0 err (t) r ( ? ) + 2 (1 ) 2 T T1 X t = 0 err (t) g ( ? ) + 1 W 2 (1 ) 3 + 4 1 W 2 (1 ) 5 2 T + 2 2 (1 ) 2 2 + 2 2(1 ) 2 : SinceV (t) r () andV (t) g () are linear functions in the occupancy measure [11, Chapter 10], there exists a policy 0 such thatV 0 r () = 1 T P T1 t = 0 V (t) r () andV 0 g () = 1 T P T1 t = 0 V (t) g (). Hence, V ? r () V 0 r () + 2 (1 ) h b V 0 g () i + logjAj (1 ) 1 T + 1 (1 )T T1 X t = 0 err (t) r ( ? ) + 2 (1 ) 2 T T1 X t = 0 err (t) g ( ? ) + 1 W 2 (1 ) 3 + 4 1 W 2 (1 ) 5 2 + 2 2 (1 ) 2 2 T + 2 2(1 ) 2 : Application of Lemma 2 with 2=((1 )) 2 ? yields b V 0 g () + logjAj 1 T + T T1 X t = 0 err (t) r ( ? ) + 2 (1 )T T1 X t = 0 err (t) g ( ? ) + 1 W 2 (1 ) 2 + 4 1 W 2 (1 ) 4 + 2 2 (1 )T + 2 2(1 ) : which leads to our constraint violation bound if we further utilize 1 T P T1 t = 0 bV (t) g () = b V 0 g () and 1 = 2 = 1= p T . Proof. [Proof of Theorem 9] 55 Whenk s;a k B, for the log-linear policy class, log (ajs) is-smooth with = B 2 . By Lemma 11, it remains to consider the randomness in sequences ofw (t) and the error bounds for err (t) ( ? ). Application of the triangle inequality yields err (t) r ( ? ) E sd ? E a ? (js) h A (t) r (s;a) (w (t) r;? ) > r log (t) (ajs) i + E sd ? E a ? (js) w (t) r;? w (t) r > r log (t) (ajs) : (2.32) Application of (2.26) andA (t) r (s;a) =Q (t) r (s;a)E a 0 (t) (js) Q (t) r (s;a 0 ) yields E sd ? E a ? (js) h A (t) r (s;a) (w (t) r;? ) > r log (t) (ajs) i = E sd ? E a ? (js) h Q (t) r (s;a) > s;a w (t) r;? i E sd ? E a 0 (t) (js) h Q (t) r (s;a 0 ) > s;a 0w (t) r;? i r E sd ? E a ? (js) Q (t) r (s;a) > s;a w (t) r;? 2 + r E sd ? E a 0 (t) (js) Q (t) r (s;a 0 ) > s;a 0 w (t) r;? 2 2 s jAjE sd ? E a Unif A Q (t) r (s;a) > s;a w (t) r;? 2 = 2 q jAjE ? r (w (t) r;? ; (t) ): (2.33) Similarly, E sd ? E a ? (js) w (t) r;? w (t) r > r log (t) (ajs) = E sd ? E a ? (js) w (t) r;? w (t) r > s;a E sd ? E a 0 (t) (js) w (t) r;? w (t) r > s;a 0 2 v u u t jAjE sd ? E a Unif A " w (t) r;? w (t) r > s;a 2 # = 2 q jAjkw (t) r;? w (t) r k 2 ? (2.34) 56 where ? :=E (s;a) ? s;a > s;a . From the denition of we have kw (t) r;? w (t) r k 2 ? kw (t) r;? w (t) r k 2 0 1 kw (t) r;? w (t) r k 2 (t) (2.35) where we use (1 ) 0 (t) 0 := (t) in the second inequality. Evaluation of the rst-order optimality condition ofw (t) r;? 2 argmin kwrk 2 W E (t) r (w r ; (t) ) yields w r w (t) r;? > r E (t) r (w (t) r;? ; (t) ) 0; for anyw r satisfyingkw r k W: Thus, E (t) r (w r ; (t) )E (t) r (w (t) r;? ; (t) ) = E s;a (t) Q (t) r (s;a) > s;a w (t) r;? + > s;a w (t) r;? > s;a w r 2 E (t) r (w (t) r;? ; (t) ) = 2 w (t) r;? w r > E s;a (t) h Q (t) r (s;a) > s;a w (t) r;? s;a i +E s;a (t) > s;a w (t) r;? > s;a w r 2 = w r w (t) r;? > r E (t) r (w (t) r;? ; (t) ) +kw r w (t) r;? k 2 (t) kw r w (t) r;? k 2 (t) : Takingw r =w (t) r in the above inequality and combining it with (2.34) and (2.35), yield E sd ? E a ? (js) w (t) r;? w (t) r > r log (t) (ajs) 2 s jAj 1 E (t) r (w (t) r ; (t) )E (t) r (w (t) r;? ; (t) ) : (2.36) 57 Substitution of (2.33) and (2.36) into the right-hand side of (2.32) yields E err (t) r ( ? ) 2 r jAjE h E d ? r (w (t) r;? ; (t) ) i + 2 s jAj 1 E h E (t) r (w (t) r ; (t) )E (t) r (w (t) r;? ; (t) ) i : By the same reasoning, we can establish a similar bound onE err (t) g ( ? ) . Finally, our desired results follow by applying Assumption 2 and Lemma 11. 2.5.3 Generalsmoothpolicyclass For a general class of smooth policies [276, 8], we now establish convergence of algorithm (2.21) with approximate gradient update, w (t) = w (t) r + (t) w (t) g w (t) argmin kwk 2 W E (t) w ; (t) (2.37) where denotesr org and the exact minimizer is given byw (t) ;? 2 argmin kwk 2 W E (t) (w ; (t) ). Assumption4(Policysmoothness) For alls2S anda2A, log (ajs) is a-smooth func- tion of, kr log (ajs)r 0 log 0(ajs)k k 0 k for all ; 0 2 R d : Since both tabular softmax and log-linear policies satisfy Assumption 4 [8], Assumption 4 covers a broader function class relative to softmax policy parametrization (2.5). 58 Given a state-action distribution (t) , we introduce the estimation error as E (t) ;est := E h E (t) w (t) ; (t) E (t) w (t) ;? ; (t) (t) i : Furthermore, given a state distribution and an optimal policy ? , we dene a state-action dis- tribution ? (s;a) :=d ? (s) ? (ajs) as a comparator and introduce the transfer error, E (t) ;bias := E E ? w (t) ;? ; (t) : Finally, for any state-action distribution, we dene = E (s;a) h r log (ajs) r log (ajs) > i and use (t) to denote (t) . Assumption5(Estimation/transfererrorsandrelativeconditionnumber) The above es- timation and transfer errors as well as the expected relative condition number are bounded, i.e., E (t) ;est est andE (t) ;bias bias , for =r org, and E " sup w2R d w > (t) ?w w > (t) 0 w # : We next provide convergence guarantees for algorithm (2.21) in Theorem 12 using the ap- proximate update (2.37). Even though we set (0) = 0 and (0) = 0 in the proof of Theorem 12, convergence can be established for arbitrary initial conditions. 59 Theorem12(Convergenceandoptimality: generalsmoothpolicy) Letusxastatedistri- bution, a state-action distribution 0 , andT > 0, and let Assumptions 1 and 4 hold. If the iterates ( (t) ; (t) )generatedbyalgorithm (2.21)and (2.37)with 1 = 2 = 1= p T satisfyAssumption5and kw (t) kW, then, E " 1 T T1 X t = 0 V ? r () V (t) r () # C 3 (1 ) 5 1 p T + 1 + 2= (1 ) 2 p bias + r est 1 E " 1 T T1 X t = 0 b V (t) g () # + C 4 (1 ) 4 1 p T + 2 + 1 p bias + r est 1 whereC 3 = 1 + logjAj + 5W 2 = andC 4 = (1 + logjAj +W 2 ) + (2 + 4W 2 )=. Proof. Since Lemma 11 holds for any smooth policy class that satises Assumption 4, it remains to bound err (t) ( ? ) for = r org. We next separately bound each term on the right-hand side of (2.32). For the rst term, E sd ? E a ? (js) h A (t) r (s;a) (w (t) r;? ) > r log (t) (ajs) i r E sd ? E a ? (js) A (t) r (s;a) (w (t) r;? ) > r log (t) (ajs) 2 = q E ? r (w (t) r;? ; (t) ): (2.38) Similarly, E sd ? E a ? (js) w (t) r;? w (t) r > r log (t) (ajs) v u u t E sd ? E a ? (js) " w (t) r;? w (t) r > r log (t) (ajs) 2 # = r kw (t) r;? w (t) r k 2 (t) ? : (2.39a) 60 Let (t) := (t) 0 1=2 (t) ? (t) 0 1=2 2 be the relative condition number at timet. Thus, kw (t) r;? w (t) r k 2 (t) ? k (t) 0 1=2 (t) ? (t) 0 1=2 kkw (t) r;? w (t) r k 2 (t) 0 (a) (t) 1 kw (t) r;? w (t) r k 2 (t) (b) (t) 1 E (t) r (w (t) r ; (t) )E (t) r (w (t) r;? ; (t) ) where we use (1 ) 0 (t) 0 := (t) in (a), and we get (b) by the same reasoning as bound- ing (2.35). Taking an expectation over the inequality above from both sides yields E kw (t) r;? w (t) r k 2 (t) ? E (t) 1 E h E (t) r (w (t) r ; (t) )E (t) r (w (t) r;? ; (t) )j (t) i E (t) 1 est est 1 (2.39b) where the last two inequalities are because of Assumption 5. Substitution of (2.38) and (2.39) to the right-hand side of (2.32) yields an upper bound on E err (t) r ( ? ) . By the same reasoning, we can establish a similar bound onE err (t) g ( ? ) . Finally, application of these upper bounds to Lemma 11 yields the desired result. 61 2.6 Sample-basedalgorithms We now leverage convergence results established in Theorems 9 and 12 to design two model- free algorithms that utilize sample-based estimates. In particular, we propose a sample-based extension of NPG-PD algorithm (2.21) with function approximation and = [ 0; 2=((1 )) ], (t+1) = (t) + 1 1 ^ w (t) (t+1) = P (t) 2 ^ V (t) g ()b (2.40) where ^ w (t) and ^ V (t) g () are the sample-based estimates of the gradient and the value function. At each timet, we can access constrained MDP environment by executing a policy with termi- nating probability 1 . For the minimization problem in (2.37), we can run stochastic gradient descent (SGD) forK rounds,w ;k+1 =P kw ;k kW (w ;k G ;k ). Here,G ;k is a sample-based estimate of the population gradientr E (t) (w ; (t) ), G ;k = 2 (w ;k ) > r log (t) (ajs) ^ A (t) (s;a) r log (t) (ajs) ^ A (t) (s;a) := ^ Q (t) (s;a) ^ V (t) (s), ^ Q (t) (s;a) and ^ V (t) (s) are undiscounted sums that are collected in Algorithm 2. In addition, we estimate ^ V (t) g () using an undiscounted sum in Algorithm 3. As shown in Appendix A.6,G ;k , ^ A (t) (s;a), and ^ V (t) g () are unbiased estimates and we approximate gradient using the average of the SGD iterates ^ w (t) = K 1 P K k = 1 (w r;k + (t) w g;k ), which is an approximate solution for least-squares regression [20, Theorem 1]. 62 Algorithm1 Sample-based NPG-PD algorithm with general policy parametrization 1: Initialization: Learning rates 1 and 2 , number of SGD iterationsK, SGD learning rate. 2: Initialize (0) = 0, (0) = 0. 3: fort = 0;:::;T 1do 4: Initializew r;0 =w g;0 = 0. 5: fork = 0; 1;:::;K 1 do 6: Estimate ^ A r (s;a) and ^ A g (s;a) for some (s;a) (t) , using Algorithm 2 with policy (t) . 7: Take a step of SGD, w r;k+1 =P kwrkW w r;k 2 (w r;k ) > r log (t) (s;a) ^ A (t) r (s;a) r log (t) (s;a) w g;k+1 =P kwgkW w g;k 2 (w g;k ) > r log (t) (s;a) ^ A (t) g (s;a) r log (t) (s;a) : 8: endfor 9: Set ^ w (t) = ^ w (t) r + (t) ^ w (t) g , where ^ w (t) r = 1 K P K1 k = 0 w r;k and ^ w (t) g = 1 K P K1 k = 0 w g;k . 10: Estimate ^ V (t) g () using Algorithm 3 with policy (t) . 11: Natural policy gradient primal-dual update (t+1) = (t) + 1 ^ w (t) (t+1) = P [ 0; 2=((1 )) ] (t) 2 ^ V (t) g ()b : 12: endfor Algorithm2A-Unbiased estimate (A est , =r org) 1: Input: Initial state-action distribution 0 , policy, discount factor . 2: Sample (s 0 ;a 0 ) 0 , execute the policy with probability at each steph; otherwise, accept (s h ;a h ) as the sample. 3: Start with (s h ;a h ), execute the policy with the termination probability 1 . Once termi- nated, add all rewards/utilities from steph onwards as ^ Q (s h ;a h ) for =r org. 4: Start withs h , samplea 0 h (js h ), and execute the policy with the termination probability 1 . Once terminated, add all rewards/utilities from steph onwards as ^ V (s h ) for =r or g. 5: Output: (s h ;a h ) and ^ A (s h ;a h ) := ^ Q (s h ;a h ) ^ V (s h ), =r org. Algorithm3V -Unbiased estimate (V est g ) 1: Input: Initial state distribution, policy, discount factor . 2: Samples 0 , execute the policy with the termination probability 1 . Once terminated, add all utilities up as ^ V g (). 3: Output: ^ V g (). 63 2.6.1 Samplecomplexity To establish sample complexity of Algorithm 1, we assume the score functionr log(ajs) has bounded norm [276, 8]. Assumption6(Lipschitzpolicy) For 0t<T, the policy (t) satises kr log (t) (ajs)k L ; whereL > 0: Under Assumption 6, sample-based estimate of SGD gradient is bounded byG := 2L (WL + 1=(1 )) and, in Theorem 13, we establish sample complexity of Algorithm 1. Theorem13(Samplecomplexity: generalsmoothpolicy) Let Assumptions 1, 4, and 6 hold andletusxastatedistribution,astate-actiondistribution 0 ,andT > 0. Iftheiterates ( (t) ; (t) ) aregeneratedbythesample-basedNPG-PDmethoddescribedinAlgorithm1with 1 = 2 = 1= p T and =W=(G p K), in whichK rounds of trajectory samples are used at each timet, then, E " 1 T T1 X t = 0 V ? r () V (t) r () # C 5 (1 ) 5 1 p T + 1 + 2= (1 ) 2 p bias + s GW (1 ) p K ! E " 1 T T1 X t = 0 b V (t) g () # + C 6 (1 ) 4 1 p T + 2 + 1 p bias + s GW (1 ) p K ! whereC 5 = 2 + logjAj + 5W 2 = andC 6 = (2 + logjAj +W 2 ) + (2 + 4W 2 )=. In Theorem 13, the sampling eect appears as an error rate 1=K 1=4 , whereK is the size of sampled trajectories. This rate follows the standard SGD result [197, Theorem 14.8] and it can be improved to 1= p K under additional restrictions on the dataset [101, 58]. The proof of Theorem 13 64 in Appendix A.7 follows the proof of Theorem 12 except that we use sample-based estimates of gradients in the primal update and sample-based value functions in the dual update. Algorithm4 Sample-based NPG-PD algorithm with log-linear policy parametrization 1: Input: Learning rates 1 and 2 , number of SGD iterationsK, SGD learning rate. 2: Initialize (0) = 0, (0) = 0, 3: fort = 0;:::;T 1do 4: Initializew r;0 =w g;0 = 0. 5: fork = 0; 1;:::;K 1 do 6: Estimate ^ Q (t) r (s;a) and ^ Q (t) g (s;a) for some (s;a) (t) , using Algorithm 5 with log- linear policy (t) . 7: Take a step of SGD, w r;k+1 = P kwrkW w r;k 2 > s;a w r;k ^ Q (t) r (s;a) s;a w g;k+1 = P kwgkW w g;k 2 > s;a w g;k ^ Q (t) g (s;a) s;a : 8: endfor 9: Set ^ w (t) = ^ w (t) r + (t) ^ w (t) g , where ^ w (t) r = 1 K P K1 k = 0 w r;k and ^ w (t) g = 1 K P K1 k = 0 w g;k . 10: Estimate ^ V (t) g () using Algorithm 3 with log-linear policy (t) . 11: Natural policy gradient primal-dual update (t+1) = (t) + 1 1 ^ w (t) (t+1) = P [ 0; 2=((1 )) ] (t) 2 ^ V (t) g ()b : (2.41) 12: endfor Algorithm5Q-Unbiased estimate (Q est , =r org) 1: Input: Initial state-action distribution 0 , policy, discount factor . 2: Sample (s 0 ;a 0 ) 0 , execute the policy with probability at each steph; otherwise, accept (s h ;a h ) as the sample. 3: Start with (s h ;a h ), execute the policy with the termination probability 1 . Once termi- nated, add all rewards/utilities from steph onwards as ^ Q (s h ;a h ) for =r org, respectively. 4: Output: (s h ;a h ) and ^ Q (s h ;a h ), =r org. Algorithm 4 is utilized for log-linear policy parametrization. For the feature s;a that has bounded normk s;a k B, the sample-based gradient in SGD is bounded byG := 2B(WB + 65 1=(1 )). In Theorem 14, we establish sample complexity of Algorithm 4; see Appendix A.8 for proof. Theorem14(Samplecomplexity: log-linearpolicy) Let Assumption 1 hold and let us x a state distribution and a state-action distribution 0 . If the iterates ( (t) ; (t) ) generated by the sample-based NPG-PD method described in Algorithm 4 withk s;a k B, 1 = 2 = 1= p T, and =W=(G p K), in whichK rounds of trajectory samples are used at each timet, then, E " 1 T T1 X t = 0 V ? r () V (t) r () # C 5 (1 ) 5 1 p T + 2 + 4= (1 ) 2 p jAj bias + s jAjGW (1 ) p K ! E " 1 T T1 X t = 0 b V (t) g () # + C 6 (1 ) 4 1 p T + 4 + 2 1 p jAj bias + s jAjGW (1 ) p K ! whereC 5 = 2 + logjAj + 5W 2 = andC 6 = (2 + logjAj +W 2 ) + (2 + 4W 2 )=. When we specialize the log-linear policy to be the softmax policy, Algorithm 4 becomes a sample-based implementation of NPG-PD method (2.12) that utilizes the state-action value func- tions. In this case, bias = 0 andB = 1 in Theorem 14. When there are no sampling eects, i.e., asK!1, our rate (1= p T; 1= p T ) matches the rate in Theorem 6. 2.7 Computationalexperiments We use two examples of robotic tasks with constraints to demonstrate the merits and the ef- fectiveness of our sample-based NPG-PD method described in Algorithm 1. The rst example involves robots with speed limit tasks [280] and the second example rewards robots for running in a circle while staying in a safe region [7]. The robotic environments are implemented using the OpenAI Gym [43] for the MuJoCo physical simulators [221]. 66 We compare performance of our NPG-PD algorithm with First Order Constrained Optimiza- tion in Policy Space (FOCOPS) algorithm [280], an approach that provides the state-of-the-art performance for constrained robotic tasks. The policy is represented as a Gaussian distribution, where the mean action is parametrized by a two-layer neural network with the tanh activation and the state-independent logarithmic standard deviation is computed separately from the mean action. To have a fair comparison, we instantiate subroutines of Algorithms 2 and 3 by tting a two-layer neural network to estimate the value function and we implement lines 5–8 of Algo- rithm 1 withK = 8192 by solving a regularized empirical risk minimization problem to reduce variance. We also use the FOCOPS’ hyperparameters [280, Table 3] as our default hyperparame- ters. In the rst example, the goal of a robot is to move along either a line or in a plane while satisfying a speed limit constraint [280]. We train six MuJoCo robotic agents to walk: Hopper-v3, Swimmer-v3, HalfCheetah-v3, Walker2d-v3, Ant-v3, and Humanoid-v3, while we constrain the moving speed to be under a given threshold. Figure 2.2 shows that in the rst three tasks, our NPG-PD algorithm achieves higher rewards than the baseline FOCOPS algorithm while achieving similar constraint satisfaction cost. In the second and third tasks, we observe oscillatory response that arises from dual updates in NPG-PG algorithm [210]. For the last three tasks, Figure 2.3 shows a competitive performance of NPG-PD with FOCOPS. In Humanoid-v3, even though oscillations slow down the convergence of NPG-PD, it achieves higher rewards than FOCOPS in spite of early oscillatory behavior. In the second example, the robot aims to move along a circular trajectory while remaining within a safe region [7, 280]. For Humanoid Circle-v0, Figure 2.4 shows slow a slow initial re- sponse of NPG-PD compared to FOCOPS. We suspect that this is because of incremental update 67 of the dual variable which does not produce sucient penalty for reducing constraint violation. As the dual variable (or the average cost) approaches a stationary point, the average reward con- verges quickly. In contrast, for Ant Circle-v0, NPG-PD achieves a much higher average reward than FOCOPS. 2.8 Concludingremarks We have proposed a Natural Policy Gradient Primal-Dual (NPG-PD) algorithm for solving op- timal control problems for constrained MDPs. Our algorithm utilizes natural policy gradient ascent to update the primal variable and projected sub-gradient descent to update the dual vari- able. Although the underlying maximization involves a nonconcave objective function and a nonconvex constraint set, we have established global convergence for softmax, log-linear, and general smooth policy parametrizations and have provided nite-sample complexity guarantees for two model-free extensions of the NPG-PD algorithm. To the best of our knowledge, our work is the rst to oer nite-time performance guarantees for policy-based primal-dual methods in the context of discounted innite-horizon constrained MDPs. 68 Hopper-v3 average reward average cost iteration count iteration count Swimmer-v3 average reward average cost iteration count iteration count Half Cheetah-v3 average reward average cost iteration count iteration count Figure 2.2: Learning curves of NPG-PD (—) and FOCOPS [280] (—) for Hopper-v3, Swimmer-v3, and Half Cheetah-v3 robotic tasks with the respective speed limits 82:748, 24:516, and 151:989. The vertical axes represent the average reward and the average cost (i.e., average speed). The solid lines show the means of 1000 bootstrap samples obtained over 5 random seeds and the shaded regions display the bootstrap 95% condence intervals. 69 Walker2d-v3 average reward average cost iteration count iteration count Ant-v3 average reward average cost iteration count iteration count Humanoid-v3 average reward average cost iteration count iteration count Figure 2.3: Learning curves of NPG-PD (—) and FOCOPS [280] (—) for Walker2d-v3, Ant-v3, and Humanoid-v3 robotic tasks with the respective speed limits 81:886, 103:115, and 20:140. The vertical axes represent the average reward and the average cost (i.e., average speed). The solid lines show the means of 1000 bootstrap samples obtained over 5 random seeds and the shaded regions display the bootstrap 95% condence intervals. 70 Humanoid Circle-v0 average reward average cost iteration count iteration count Ant Circle-v0 average reward average cost iteration count iteration count Figure 2.4: Learning curves of NPG-PD (—) and FOCOPS [280] (—) for Humanoid Circle-v0 and Ant Circle-v0 robotic task. The horizontal axis represents the number of dual updates. The av- erage cost is constrained to go below 50. The vertical axes represent the average reward and the average cost (i.e., average speed). The solid lines show the means of 1000 bootstrap sam- ples obtained over 5 random seeds and the shaded regions display the bootstrap 95% condence intervals. 71 Chapter3 ProvablyecientpolicyoptimizationforconstrainedMDPs In this chapter, we focus on an episodic constrained Markov decision processes (MDPs) with the function approximation where the Markov transition kernels have a linear structure but do not impose any additional assumptions on the sampling model. Designing safe reinforcement learn- ing algorithms with provable computational and statistical eciency is particularly challenging under this setting because of the need to incorporate both the constraint and the function ap- proximation into the fundamental exploitation/exploration tradeo. To this end, we propose an Optimistic Primal-Dual Proximal Policy OPtimization (OPDOP) algorithm where the value func- tion is estimated by combining the least-squares policy evaluation and an additional bonus term for the exploration under constraints (or safe exploration). We prove that the proposed algo- rithm achieves an ~ O(dH 2:5 p T ) regret and an ~ O(dH 2:5 p T ) constraint violation, whered is the dimension of the feature mapping,H is the horizon of each episode, andT is the total number of steps. These bounds hold when the reward/utility functions are xed but the feedback after each episode is bandit. Our bounds depend on the capacity of the state-action space only through the dimension of the feature mapping and thus our results hold even when the number of states goes to innity. 72 3.1 Introduction Safe Reinforcement Learning (safe RL) augments RL [214] with a practical consideration of safety to deal with restrictions/constraints arising from real-world problems [93, 14, 84], e.g., collision- avoidance in autonomous robots [89, 91], cost limitations in medical applications [94, 23], and legal and business restrictions in nancial management [5]. There is considerable growth in safe RL, especially those studies on constrained MDPs [227, 210, 37, 35, 220, 135, 264, 7, 259, 280, 179, 180], showing the successful integration of the constrained optimization and the policy-based RL for addressing constraints. However, these safe RL algorithms either do not have a convergence theory or are limited to asymptotic convergence. In practice, only a nite amount of data is available. Hence, it is imperative to design safe RL algorithms with computational and statistical eciency guarantees. For this purpose, we must address the exploration/exploitation trade-o under constraints. In this chapter, we look at the challenging problem of nding a sequence of policies in re- sponse to online streaming samples of transition, reward functions, and utility functions. We attempt to provide theoretical guarantees on the regret of an algorithm approaching the best pol- icy in hindsight, and feasibility region determined by constraints. The task of safe exploration is to explore the unknown environment and learn to adapt the policy to the constraint set. Our problem setting deviates from existing scenarios, where good priors on constraints or transition models are more focused, e.g., [226, 28, 63, 231, 54, 55, 230]. Recent policy-based safe RL algo- rithms for constrained MDPs, e.g., constrained policy optimization [7, 259, 280] and primal-dual policy optimization [179, 180], seek a single safe policy via the constrained policy optimization whose sample eciency guarantees do not have a theory. 73 In this chapter, we present our answer the following theoretical question. Can we design a provably sample ecient online policy optimization algorithm for constrained MDPs in the function approximation setting? In Section 3.2, we introduce an episodic control problem of constrained MDPs, the metrics of learning performance, and the linear function approximation. In Section 3.3, we propose an optimistic primal-dual policy optimization algorithm for constrained MDPs. In Section 3.4, we establish the regret and constraint violation analysis for the proposed algorithm. In Section 3.5, we present some improved results in the tabular setting. We close this chaper with concluding remarks in Section 3.6. 3.2 Problemsetup We consider an episodic constrained Markov decision process, CMDP(S;A; H;P; r; g; b ) whereS is a state space,A is an action space,H is a xed length of each episode,P =fP h g H h = 1 is a collection of transition probability measures,r =fr h g H h = 1 is a collection of reward functions, g =fg h g H h = 1 is a collection of utility functions, and b is a constraint oset. We assume that S is a measurable space with possibly innite number of elements. Moreover, for each step h2 [H],P h (js;a) is the transition kernel over next state if actiona is taken for states andr h : SA! [0; 1] is a reward function. We assume that reward/utility functions are deterministic. Our analysis readily generalizes to the setting where reward/utility are random. 74 Let the policy space (AjS;H) beff h (j )g H h = 1 : h (js)2 (A);8s2S andh2 [H]g, where (A) denotes a probability simplex over the action space. Let k 2 (AjS;H) be the policy taken by the agent at episodek, where k h (js k h ):S!A is the action that the agent takes at states k h . For simplicity, we assume the initial states k 1 to be xed ass 1 in dierent episodes for brevity. The agent interacts with the environment in thekth episode as follows. At the beginning, the agent determines a policy k . Then, at each steph2 [H], the agent observes the states k h 2S, determines an actiona k h following the policy k h (js k h ), and receives a rewardr h (s k h ;a k h ) together with a utility g h (s k h ;a k h ). Meanwhile, the MDP evolves into next state s k h+1 drawing from the probabilityP h (js k h ;a k h ). The episode terminates at states k H in which no control action is taken and both reward and utility functions are equal to zero. Our focus is the bandit setting where the agent only observes the values of reward/utility functions, r h (s k h ;a k h ), g h (s k h ;a k h ), at visited state-action pair (s k h ;a k h ). We assume that reward/utility functions are xed over episodes. Given a policy2 (AjS;H), the value functionV r;h associated with the reward function r at each steph are the expected values of total rewards, V r;h (s) = E " H X i =h r i (s i ;a i ) s h =s # for alls2S,h2 [H], where the expectationE is taken over the random state-action sequence f(s h ;a h )g H h =i ; the actiona h follows the policy h (js h ) at the states h and the next states h+1 follows the transition dynamicsP h (js h ;a h ). Thus, the state-action functionQ r;h (s;a):SA! 75 R associated with the reward functionr is the expected value of total rewards when the agent starts from state-action pair (s;a) at steph and follows policy, Q r;h (s;a) = E " H X i =h r i (s i ;a i ) s h =s;a h =a # for all (s;a)2SA andh2 [H]. Similarly, we dene the value functionV g;h :S! R and the state-action functionQ g;h (s;a):SA! R associated with the utility functiong. Denote symbol = r org. For brevity, we take the shorthandP h V ;h+1 (s;a) := E s 0 P h (js;a) V ;h+1 (s 0 ). The Bellman equations associated with a policy are given by Q ;h (s;a) = h +P h V ;h+1 (s;a) (3.1) where V ;h (s) = Q ;h (s; ); h (js) A , for all (s;a)2SA. Here, the inner product of a functionf:SA!R with(js)2 (A) at xeds2S represents hf(s; );(js)i A := X a2A hf(s;a);(ajs)i: 3.2.1 Learningperformance The design of optimal control policy reduces to nding a solution of a constrained problem in which the objective function is the expected total rewards and the constraint is on the expected total utilities, maximize 2 (AjS;H) V r;1 (s 1 ) subject to V g;1 (s 1 ) b (3.2) 76 where we takeb2 (0;H] to avoid triviality. It is readily generalized to the problem with multiple constraints. Let ? 2 (AjS;H) be a solution to problem (3.2). Since the policy ? is computed from knowing the transition model and all reward and utility functions, we refer it as an optimal policy in-hindsight. The associated Lagrangian of problem (3.2) is given by V ;Y L (s 1 ) := V r;1 (s 1 ) + Y V g;1 (s 1 )b where is the primal variable andY 0 is the dual variable. We can cast (3.2) into a saddle-point problem, maximize 2 (AjS;H) minimize Y 0 V ;Y L (s 1 ) whereV ;Y L (s 1 ) is convex inY and is non-concave in in general. To address the non-concavity, we exploit the structure of value functions to propose a variant of Lagrange multiplier method for constrained RL problems in Section 3.3, which warrants a new line of primal-dual mirror descent type analysis in sequel. This distinguishes from unconstrained RL, e.g., [8, 47]. Another key feature of constrained RL is the safe exploration under constraints [93]. Without any constraint information a priori, it is infeasible for each policy to satisfy the constraint since utility information on constraints is only revealed after a policy is decided. Instead, we allow each policy to violate the constraint in each episode and minimize regret while minimizing total constraint violations for safe exploration overK episodes. We dene the regret as the dierence between the total reward value of policy ? in hindsight and that of the agent’s policy k overK 77 episodes, and the constraint violation as a dierence between the osetKb and the total utility value of the agent’s policy k overK episodes, Regret(K) = K X k = 1 V ? r;1 (s 1 )V k r;1 (s 1 ) Violation(K) = K X k = 1 bV k g;1 (s 1 ) : (3.3) In this chapter, we design algorithms, taking bandit feedback of the reward/utility functions, with both regret and constraint violation being sublinear in the total number of stepsT :=HK. Put dierently, the algorithm should ensure that given> 0, ifT =O(1= 2 ), then both Regret(K) = O() and Violation(K) =O() hold with high probability. LetV Y D (s 1 ) := maximize V ;Y L (s 1 ) be the dual function andY ? := argmin Y 0 V Y D (s 1 ) be the optimal dual variable. We assume feasibility for Problem (3.2) in Assumption 7 that is known as the Slater condition [179, 85, 187]. It is convenient to establish the strong duality [179] and the boundedness of the optimal dual variableY ? that can be found in Appendix B.4. Assumption7(Feasibility) Thereexists > 0and 2 (AjS;H)suchthatV g;1 (s 1 )b + . Lemma15(StrongDualityandBoundednessofY ? ) Let Assumption 7 hold. Then (i) V ? r;1 (s 1 ) = V Y ? D (s 1 ):; (ii) 0 Y ? (V ? r;1 (s 1 )V r;1 (s 1 ))= : Lemma 15 provides useful optimization properties of (3.2) for our algorithm design and anal- ysis. 78 3.2.2 Linearfunctionapproximation We focus on a class of constrained MDPs, where transition kernels are linear in feature maps. Assumption8 The CMDP(S;A;H;P;r;g) is a linear MDP with a kernel feature map :S AS!R d 1 , if for anyh2 [H], there exists a vector h 2R d 1 withk h k 2 p d 1 such that for any (s;a;s 0 )2SAS, P h (s 0 js;a) = h (s;a;s 0 ); h i; there exists a feature map':SA!R d 2 and vectors r;h ; g;h 2R d 2 such that for any (s;a)2 SA, r h (s;a) = h'(s;a); r;h i and g h (s;a) = h'(s;a); g;h i where max(k r;h k 2 ;k g;h k 2 ) p d 2 . Moreover, we assume that for any functionV:S! [0;H], Z S (s;a;s 0 )V (s 0 )ds 0 2 p d 1 H for all (s;a)2SA and max(d 1 ;d 2 )d. Assumption 8 adapts the denition of linear kernel MDP [18, 284, 47] for constrained MDPs. Linear kernel MDP examples include tabular MDPs [284], feature embedded transition mod- els [255], and linear combinations of base models [166]. We can construct related examples of constrained MDPs with linear structure by adding proper constraints. For usefulness of linear structure, see discussions in the literature [80, 228, 124]. For more general transition dynamics, see factored MDPs [184]. 79 Although our denition in Assumption 8 and linear MDPs [256, 110] all contain tabular MDPs as special cases, they dene transition dynamics using dierent feature maps. They are not com- parable since one cannot be implied by the other [284]. We provide more details on the tabular case of Assumption 8 in Section 3.5. 3.3 Optimisticprimal-dualproximalpolicyoptimization In Algorithm 6, we present a new variant of proximal policy optimization [194]: an Optimistic Primal-Dual Proximal Policy OPtimization (OPDOP) algorithm. Specically, we eectuate the optimism through the Upper-Condence Bounds (UCB) [256, 255, 110], and address the con- straints via the union of the Lagrange multipliers method with the value function structure that is captured by the performance dierence lemma. Remark4 For any two policies ; 0 2 (AjS;H), = r or g, the performance dierence lemma [47] quanties the value function dierence by V 0 ;1 (s 1 ) V ;1 (s 1 ) = E 0 " H X h = 1 Q ;h (s h ;); 0 h (js h ) h (js h ) s 1 # : In each episode, our algorithm consists of three main stages. The rst stage (lines 4–8) isPolicy Improvement: we receive a new policy k by improving previous k1 via a mirror descent type optimization; The second stage (line 9) is Dual Update: we update dual variableY k based on the constraint violation induced by previous policy k ; The third stage (line 10) is Policy Evaluation: we optimistically evaluate newly obtained policy via the least-squares policy evaluation with an additional UCB bonus term for exploration. 80 3.3.1 Policyimprovement In the k-th episode, a natural attempt of obtaining a policy k is to solve a Lagrangian-based policy optimization problem, maximize 2 (AjS;H) V ;Y k1 L (s 1 ) := V r;1 (s 1 )Y k1 (bV g;1 (s 1 )) whereV ;Y L (s 1 ) is the Lagrangian and the dual variableY k1 0 is from the last episode; we show thatY k1 can be updated eciently in Section 3.3.2. This type update also nds in [126, 179, 180, 220]. They rely on an oracle solver, e.g., Q-learning [86], proximal policy optimization [194], or trust region policy optimization [195], to deliver a near-optimal policy, making overall algo- rithmic complexity expensive. Hence, they are not suitable for online use. In contrast, we utilize the RL problem structure and show that only an easily-computable proximal step is sucient for eciently achieving near-optimal performance. Recall symbol =r org. Via the performance dierence lemma, we can expand value func- tionsV ;1 (s 1 ) at the previously known policy k1 , V ;1 (s 1 ) = V k1 ;1 (s k 1 ) + E k1 " H X h = 1 Q ;h (s h ; ); ( h k1 h )(js h ) # whereE k1 is taken over the random state-action sequencef(s h ;a h )g H h = 1 . Thus, we introduce an approximation ofV ;1 (s 1 ) for any state-action sequencef(s h ;a h )g H h = 1 induced by, L k1 () = V k1 ;1 (s 1 ) + H X h = 1 Q k1 ;h (s h ; ); ( h k1 h )(js h ) 81 whereV k1 ;h andQ k1 ;h can be estimated from an optimistic policy evaluation that will be discussed in Section 3.3.3. With this notion, in each episode, instead of solving a Lagrangian-based policy optimization, we perform a simple policy update in online mirror descent fashion, maximize 2 (AjS;H) L k1 r () Y k1 bL k1 g () 1 H X h = 1 D KL h (js h )j ~ k1 h (js h ) where ~ k1 h (js h ) = (1) k1 h (js h ) + Unif A () is a mixed policy of the previous one and the uniform distribution Unif A with 2 (0; 1]. The constant > 0 is a trade-o parameter, D KL (j ~ k1 ) is the Kullback-Leibler (KL) divergence between and ~ k1 in which is abso- lutely continuous in ~ k1 . The policy mixing step ensures such absolute continuity and implies uniformly bounded KL divergence; see Lemma 54 in Appendix B.5. Ignoring other-irrelevant terms, we update k in terms of previous policy k1 by argmax 2(AjS;H) H X h = 1 (Q k1 r;h +Y k1 Q k1 g;h )(s h ; ); h (js h ) 1 H X h = 1 D KL h (js h )j ~ k1 h (js h ) : Since the above update is separable overH steps, we can update the policy k as line 6 in Algo- rithm 6, a closed-form solution for any steph2 [H]. If we setY k1 = 0 and = 0, the above update reduces to one step in an optimistic proximal policy optimization [47]. The idea of KL- divergence regularization in policy optimization has been widely used in many unconstrained scenarios [112, 194, 195, 233, 139]. Our method is distinct in that it is based on the performance dierence lemma and the optimistically estimated value functions. 82 Algorithm6 Optimistic Primal-Dual Proximal Policy OPtimization (OPDOP) 1: Initialization: LetfQ 0 r;h ;Q 0 g;h g H h = 1 be zero functions,f 0 h g h2 [H] be uniform distributions onA,V 0 g;1 beb,Y 0 be 0, be 2H= ,;> 0;2 (0; 1]. 2: for episodek = 1;:::;K + 1do 3: Set the initial states k 1 =s 1 . 4: for steph = 1; 2;:::;H do 5: Mix the previous policy by ~ k1 h (j) = (1) k1 h (j) + Unif A . 6: Update the current policy by k h (j) / ~ k1 h (j ) e Q k1 r;h +Y k1 Q k1 g;h (; ) . 7: Take an actiona k h k h (js k h ) and recieve reward/utilityr h (s k h ;a k h ); g h (s k h ;a k h ). 8: Observe the next states k h+1 . 9: endfor 10: Update the dual variableY k byY k = P [ 0; ] Y k1 + (bV k1 g;1 (s 1 )) . 11: Estimate the state-action or value functionsfQ k r;h (;);Q k g;h (;);V k g;h ()g H h = 1 via LSTD fs h ;a h ;r h (s h ;a h );g h (s h ;a h )g H;k h; = 1 : 12: endfor 3.3.2 Dualupdate To infer the constraint violation for the dual update, we estimateV k g;1 (s 1 ) via an optimistic policy evaluation byV k1 g;1 (s 1 ) that is discussed in Section 3.3.3. We update the Lagrange multiplierY by movingY k to the direction of minimizing the LagrangianV ;Y L (s 1 ) overY 0 in line 10 of Algorithm 6, where> 0 is a stepsize andP [ 0; ] is a projection onto [0;] with an upper bound onY k . By Lemma 15, we choose = 2H= 2Y ? so that projection interval [ 0; ] includes the optimal dual variableY ? . This type design also nds in [85, 174]. The dual update works as a trade-o between the reward maximization and the constraint vio- lation reduction. If the current policy k satises the approximated constraint, i.e.,bL k1 g ( k ) 0, we put less weight on the state-action function associated with the utility and maximize the re- ward; otherwise, we sacrice the reward a bit to satisfy the constraint. The dual update has a sim- ilar use in dealing with constraints in constrained MDPs, e.g., Lagrangian-based actor-critic [56, 83 135], and online constrained optimization [263, 242, 265]. In contrast, we handle the dual update via the optimistic policy evaluation, yielding a simple, but ecient estimation on the constraint violation. Algorithm7 Least-squares temporal dierence (LSTD) with UCB exploration 1: Input:fs h ;a h ;r h (s h ;a h );g h (s h ;a h )g H;k h; = 1 . 2: Initialization: SetfV k r;H+1 ;V k g;H+1 g be zero functions and = 1; =O( p dH 2 log (dT=p)). 3: for steph =H;H 1; ; 1do 4: k ;h = k1 X = 1 ;h (s h ;a h ) ;h (s h ;a h ) > +I. 5: w k ;h = ( k ;h ) 1 k1 X = 1 ;h (s h ;a h )V ;h+1 (s h+1 ). 6: k ;h (;) = R S (;;s 0 )V k ;h+1 (s 0 )ds 0 . 7: k ;h (;) = ( k ;h (;) > ( k ;h ) 1 k ;h (;)) 1=2 . 8: k h = k1 X = 1 '(s h ;a h )'(s h ;a h ) > +I. 9: u k ;h = ( k h ) 1 k1 X = 1 '(s h ;a h ) h (s h ;a h ). 10: k h (;) = ('(;) > ( k h ) 1 '(;)) 1=2 . 11: Q k ;h (;) = min Hh + 1; '(;) > u k ;h + k ;h (;) > w k ;h + ( k h + k ;h )(;) + : 12: V k ;h () = Q k ;h (;); k h (j) A . 13: endfor 14: Return:fQ k ;h (;);V k ;h (;)g H h = 1 . 3.3.3 Policyevaluation The last stage of the kth episode takes the Least-Squares Temporal Dierence (LSTD) [41, 39, 125, 121] to evaluate the policy k based on previousk 1 historical trajectories. For each step 84 h2 [H], instead ofP h V k r;h+1 in the Bellman equations (3.1), we estimateP h V k r;h+1 by ( k r;h ) > w k r;h wherew k r;h is updated by the minimizer of the regularized least-squares problem overw, k1 X = 1 V r;h+1 (s h+1 ) r;h (s h ;a h ) > w 2 + kwk 2 2 (3.4) where r;h (; ) := R S (;;s 0 )V r;h+1 (s 0 )ds 0 , V r;h+1 () =hQ r;h+1 (; ); h+1 (j )i A forh2 [H1] andV H+1 = 0, and> 0 is the regularization parameter. Similarly, we estimateP h V k g;h+1 by ( k g;h ) > w k g;h . We display the least-squares solution in line 4–6 of Algorithm 7, where symbol =r org. We also estimater h (;) by' > u k r;h , whereu k r;h is updated by the minimizer of another regularized least-squares problem, k1 X = 1 r h (s h ;a h ) '(s h ;a h ) > u 2 + kuk 2 2 (3.5) where > 0 is the regularization parameter. Similarly, we estimate g h (;) by ' > u k g;h . The least-squares solutions lead to line 8–9 of Algorithm 7. After obtaining estimates ofP h V k ;h+1 and h (;) for = r org, we update the estimated state-action functionfQ k ;h g H h = 1 iteratively in line 11 of Algorithm 7, where' > u k ;h is an estimate of h and ( k ;h ) > w k ;h is an estimate ofP h V k ;h+1 ; we add UCB bonus terms k h (; ); k ;h (; ): SA!R + so that ' > u k ;h + k h and ( k ;h ) > w k ;h + k ;h all become their upper condence bounds. Here, the bonus terms take k h = (' > ( k h ) 1 ') 1=2 and k ;h =(( k ;h ) > ( k ;h ) 1 k ;h ) 1=2 and we leave the parameter > 0 to be tuned later. More- over, the bounded reward/utility h 2 [0; 1] impliesQ k ;h 2 [0;Hh + 1]. 85 We remark the computational eciency of Algorithm 6. For the time complexity, since line 6 is a scalar update, they need O(djAjT ) time. A dominating calculation is from lines 5/9 in Algorithm 7. If we use the Sherman–Morrison formula for computing ( k h ) 1 , it takesO(d 2 T ) time. Another important calculation is the integration from line 6 in Algorithm 7. We can either compute it analytically if it is tractable or approximate it via the Monte Carlo integration [284] that assumes polynomial time. Therefore, the time complexity isO(poly(d)jAjT ) in total. For the space complexity, we don’t need to store policy since it is recursively calculated via line 6 of Algorithm 6. By updatingY k , k h , k ;h , w k ;h , u k ;h , and h (s k h ;a k h ) recursively, it takesO((d 2 + jAj)H) space. 3.4 Regretandconstraintviolationanalysis We now prove that the regret and the constraint violation for Algorithm 6 are sublinear inT := KH, the total number of steps taken by the algorithm, whereK is the total number of episodes andH is the episode horizon. We recall thatjAj is the cardinality of action spaceA andd is the dimension of the feature map. Theorem16(LinearkernalMDP:regretandconstraintviolation) LetAssumptions7and8 hold. Fix p 2 (0; 1). We set = p logjAj=(H 2 K), = C 1 p dH 2 log (dT=p), = 1= p K, = 1=K, and = 1 in Algorithm 6, where C 1 is an absolute constant. Suppose logjAj = O d 2 log 2 (dT=p) . Then, with probability 1p, the regret and the constraint violation in (3.3) satisfy Regret(K) CdH 2:5 p T log dT p [Violation(K)] + C 0 dH 2:5 p T log dT p 86 whereC andC 0 are absolute constants. Theorem 16 establishes that Algorithm 6 enjoys an ~ O(dH 2:5 p T ) regret and an ~ O(dH 2:5 p T ) constraint violation if we set algorithm parametersf;;;;g properly. Our results have the optimal dependence on the total number of stepsT up to some logarithmic factors. Thed dependence occurs due to the uniform concentration for controlling the uctuations in the least- squares policy evaluation. This matches the existing bounds in the linear MDP setting without any constraints [47, 18, 284]. Our bounds dier from them only byH dependence, which is a price introduced by the uniform bound on the constraint violation. It is noticed that our algorithm works for bandit feedback of reward/utility functions after each episode. Regarding safe exploration, our violation bound provides nite-time convergence to the fea- sibility region dened by constraints. In the interaction with an unknown environment, the UCB exploration in the utility value function adds optimism towards constraint satisfaction. The dual update regularizes the policy improvement for governing actual constraint violation. Our regret and violation bounds readily lead to PAC guarantees [108]. Compared to most recent ref- erences [73, 251, 52, 271], our algorithm is sample-ecient in exploration and does not take any simulations of policy. We remark the tabular setting for Algorithm 6. The tabular CMDP(S;A;H;P;r;g;b) is a special case of Assumption 8 withjSj<1 andjAj<1. Letd 1 =jSj 2 jAj andd 2 =jSjjAj. We take the following feature maps (s;a;s 0 )2R d 1 ,'(s;a)2R d 2 , and parameter vectors, (s;a;s 0 ) = e (s;a;s 0 ) ; h = P h (;; ) '(s;a) = e (s;a) ; r;h = r h (; ); g;h = g h (; ) (3.6) 87 wheree (s;a;s 0 ) is a canonical basis ofR d 1 associated with (s;a;s 0 ) and h =P h (;; ) reads that for any (s;a;s 0 )2SAS, the (s;a;s 0 )th entry of h isP(s 0 js;a); similarly we denee (s;a) , r;h , and g;h . We can verify thatk h k p d 1 ,k r;h k p d 2 ,k g;h k p d 2 , and for any V : S! [0;H] and any (s;a)2SA, we havek P s 0 2S (s;a;s 0 )V (s 0 )k p jSjH p d 1 H. Therefore, we taked := max (d 1 ;d 2 ) =jSj 2 jAj in Assumption 8 for the tabular case. We now detail Algorithm 6 for the tabular case as follows. Our policy evaluation works with regression feature ;h :SA!R d 2 , ;h (s;a) = X s 0 (s;a;s 0 )V ;h+1 (s 0 ); for any (s;a)2SA where =r org. Thus, for any ( s; a; s 0 )2SAS, the ( s; a; s 0 )th entry of ;h (s;a) is given by ;h (s;a) ( s; a; s 0 ) = 1f(s;a) = ( s; a)gV ;h+1 ( s 0 ) which shows that ;h (s;a) is a sparse vector withjSj nonzero elements atf(s;a;s 0 );s 0 2Sg and the (s;a;s 0 )th entry of ;h (s;a) isV ;h+1 (s 0 ). For instance of =r, the regularized least-squares problem (3.4) becomes k1 X = 1 V r;h+1 (s h+1 ) X (s;a;s 0 ) 1f(s;a) = (s h ;a h )gV r;h+1 (s 0 )[w] (s;a;s 0 ) 2 + kwk 2 2 88 where [w] (s;a;s 0 ) is the (s;a;s 0 )th entry ofw, and the solutionw k r;h serves as an estimator of the transition kernelP h (j;). On the other hand, since'(s h ;a h ) = e (s h ;a h ) , the regularized least- squares problem (3.5) becomes k1 X = 1 r h (s h ;a h ) [u] (s h ;a h ) 2 + kuk 2 2 where [u] (s;a) is the (s;a)th entry ofu, the solutionu k r;h is an estimate ofr h (s;a) as'(s;a) > u k r;h . By adding similar UCB bonus terms k h , k r;h :SA!R given in Algorithm 7, we estimate the state-action function as follows, Q k r;h (s;a) = min [u k r;h ] (s;a) + k r;h (s;a) > w k r;h + ( k h + k r;h )(s;a); Hh + 1 + = min [u k r;h ] (s;a) + X s 0 2S V k r;h+1 (s 0 )[w k r;h ] (s;a;s 0 ) +( k h + k r;h )(s;a); Hh + 1 + for any (s;a)2SA. Thus,V k r;h (s) =hQ k r;h (s;); k h (js)i A . Similarly, we estimateg h (s;a) and thusQ k g;h (s;a) andV k g;h (s). Using already estimatedfQ k r;h (;);Q k g;h (;);V k r;h ();Q k g;h ()g H h = 1 , we execute the policy improvement and the dual update in Algorithm 6. We restate the result of Theorem 16 for the tabular case as follows. Corollary17(Regretandconstraintviolation) For the tabular constrained MDP with feature maps (3.6), let Assumption 7 hold. Fixp2 (0; 1). In Algorithm 6, we set = p logjAj=(H 2 K), 89 = C 1 p jSj 2 jAjH 2 log (jSjjAjT=p), = 1= p K, = 1=K, and = 1, whereC 1 is an absolute constant. Then, the regret and the constraint violation in (3.3) satisfy Regret(K) CjSj 2 jAjH 2:5 p T log jSjjAjT p Violation(K) C 0 jSj 2 jAjH 2:5 p T log jSjjAjT p with probability 1p whereC andC 0 are absolute constants. Proof. It follows the proof of Theorem 16 by noting that the tabular constrained MDP is a special linear MDP in Assumption 8, withd =jSj 2 jAj, and we have logjAjO (d 2 log (dT=p)) automatically. Algorithm8 Optimistic policy evaluation (OPE) 1: Input:fs h ;a h ;r h (s h ;a h );g h (s h ;a h )g H;k h; = 1 . 2: Initialization: Let = 1, = C 1 H p jSj log(jSjjAjT=p), and setfV k r;H+1 ;V k g;H+1 g be zero functions. 3: for steph =H;H 1; ; 1do 4: =r,g 5: Compute counters n k h (s;a;s 0 ) and n k h (s;a) via (3.36) for all (s;a;s 0 )2SAS and (s;a)2SA. 6: Estimate reward/utility functions ^ r k h , ^ g k h via (3.37) for all (s;a)2SA. 7: Estimate transition ^ P k h via (3.38) for all (s;a;s 0 )2SAS, and take bonus k h = n k h (s;a) + 1=2 for all (s;a)2SA. 8: Q k ;h (;) = min Hh + 1; ^ k h (;) + P s 0 2S ^ P h (s 0 j;)V k ;h+1 (s 0 ) + 2 k h (;) + . 9: V k ;h () = Q k ;h (;); k h (j) A . 10: endfor 11: Return:fQ k r;h (; );Q k g;h (; )g H h = 1 . 90 3.4.1 Settinguptheanalysis Our analysis begins with decomposition of the regret given in (3.3). Regret(K) = K X k = 1 V ? r;1 (s 1 )V k r;1 (s 1 ) | {z } (R.I) + K X k = 1 V k r;1 (s 1 )V k r;1 (s 1 ) | {z } (R.II) (3.7) where we add and subtract the valueV k r;1 (s 1 ) estimated from an optimistic policy evaluation by Algorithm 7; the policy ? in hindsight is the best policy in hindsight for problem (3.2). To bound the total regret (3.7), we would like to analyze (R.I) and (R.II) separately. First, we dene the model prediction error for the reward as k r;h := r h + P h V k r;h+1 Q k r;h (3.8) for all (k;h)2 [K] [H], which describes the prediction error in the Bellman equations (3.1) usingV k r;h+1 instead ofV k r;h+1 . With this notation, we expand (R.I) into (R.I) = K X k = 1 H X h = 1 E ? h Q k r;h (s h ; ); ? h (js h ) k h (js h ) i + K X k = 1 H X h = 1 E ? h k r;h (s h ;a h ) i (3.9) where the rst double sum is linear in terms of the policy dierence and the second one describes the total model prediction error. The above expansion is based on the standard performance dierence lemma (see Remark 4). Meanwhile, if we dene the model prediction error for the utility as k g;h := g h + P h V k g;h+1 Q k g;h (3.10) 91 then, similarly, we can expand P K k = 1 V ? g;1 (s 1 )V k g;1 (s 1 ) into K X k = 1 H X h = 1 E ? Q k g;h (s h ; ); ? h (js h ) k h (js h ) + K X k = 1 H X h = 1 E ? h k g;h (s h ;a h ) i : (3.11) To analyze the constraint violation, we also introduce a useful decomposition, Violation(K) = K X k = 1 bV k g;1 (s 1 ) + K X k = 1 V k g;1 (s 1 )V k g;1 (s 1 ) | {z } (V.II) (3.12) which the inserted value V k g;1 (s 1 ) is estimated from an optimistic policy evaluation by Algo- rithm 7. For notational simplicity, we introduce the underlying probability structure as follows. For any (k;h)2 [K][H], we deneF k h;1 as a-algebra generated by state-action sequences, reward and utility functions, f(s i ;a i )g (;i)2 [k1][H] [ f(s k i ;a k i )g i2 [h] : Similarly, we deneF k h;2 as an-algebra generated by f(s i ;a i )g (;i)2 [k1][H] [ f(s k i ;a k i )g i2 [h] [ fs k h+1 g: Here,s k H+1 is a null state for anyk2 [K]. A sequence of-algebrasfF k h;m g (k;h;m)2 [K][H][2] is a ltration in terms of time index t(k;h;m) := 2(k 1)H + 2(h 1) + m (3.13) 92 which holds thatF k h;m F k 0 h 0 ;m 0 for any t t 0 . The estimated reward/utility value functions, V k r;h ;V k g;h , and the associated state-action functions,Q k r;h ;Q k g;h areF k 1;1 -measurable since they are obtained from previousk 1 historical trajectories. With these notations, we can expand (R.II) in (3.7) into (R.II) = K X k = 1 H X h = 1 k r;h (s k h ;a k h ) + M K r;H;2 (3.14) wherefM k r;h;m g (k;h;m)2[K][H][2] is a martingale adapted to the ltrationfF k h;m g (k;h;m)2[K][H][2] in terms of time indext. Similarly, we have it for (V.II), (V.II) = K X k = 1 H X h = 1 k g;h (s k h ;a k h ) + M K g;H;2 (3.15) wherefM k g;h;m g (k;h;m)2[K][H][2] is a martingale adapted to the ltrationfF k h;m g (k;h;m)2[K][H][2] in terms of time index t. We prove (3.14) in Appendix B.3 (also see [47, Lemma 4.2]); (3.15) is similar. We recall two UCB bonus terms in the state-action function estimation of Algorithm 7, k ;h := (( k ;h ) > ( k ;h ) 1 k ;h ) 1=2 and k h := ((') > ( k h ) 1 ') 1=2 By the UCB argument, if we set = 1 and = C 1 p dH 2 log(dT=p) whereC 1 is an absolute constant, then for any (k;h)2 [K] [H] and (s;a)2SA, we have 2( k h + k ;h )(s;a) k ;h (x;a) 0 (3.16) 93 with probability 1p=2 where the symbol = r org. We prove (3.16) in Appendix B.3.1 for completeness. In what follows we delve into the analysis of the regret and the constraint violation. 3.4.2 Proofofregretbound Our analysis begins with a primal-dual mirror descent type analysis for the policy update in line 6 of Algorithm 6. In Lemma 18, we present a key upper bound on the total dierences of estimated valuesV k r;1 (s 1 ) andV k g;1 (s 1 ) given by Algorithm 7 to the optimal ones. Lemma18(Policyimprovement: primal-dualmirrordescent) Let Assumptions 7-8 hold. In Algorithm 6, if we set = p logjAj=(H 2 p K) and = 1=K, then K X k = 1 V ? r;1 (s 1 )V k r;1 (s 1 ) + K X k = 1 Y k V ? g;1 (s 1 )V k g;1 (s 1 ) C 2 H 2:5 p T logjAj + K X k = 1 H X h = 1 E ? h k r;h (s h ;a h ) i + K X k = 1 H X h = 1 Y k E ? h k g;h (s h ;a h ) i (3.17) whereC 2 is an absolute constant andT =HK. Proof. We recall that line 6 of Algorithm 6 follows a solution k to the following subproblem, maximize 2 (AjS;H) H X h = 1 Q k1 r;h +Y k1 Q k1 g;h ; h 1 H X h = 1 D KL h j ~ k1 h (3.18) where Q k1 r;h +Y k1 Q k1 g;h ; h is a shorthand for (Q k1 r;h +Y k1 Q k1 g;h )(s h ; ); h (js h ) and the shorthandD KL ( h j ~ k1 h ) forD KL ( h (js h )j ~ k1 h (js h )) if dependence on the state-action sequencefs h ;a h g H h = 1 is clear from context. We note that (3.18) is in form of a mirror descent 94 subproblem in Lemma 53. We can apply the pushback property with x ? = k h ;y = ~ k1 h and z = ? h , H X h = 1 Q k1 r;h +Y k1 Q k1 g;h ; k h 1 H X h = 1 D KL k h j ~ k1 h H X h = 1 Q k1 r;h +Y k1 Q k1 g;h ; ? h 1 H X h = 1 D KL ? h j ~ k1 h + 1 H X h = 1 D KL ? h j k h : Equivalently, we write the above inequality as follows, H X h = 1 Q k1 r;h ; ? h k1 h + Y k1 H X h = 1 Q k1 g;h ; ? h k1 h H X h = 1 Q k1 r;h +Y k1 Q k1 g;h ; k h k1 h 1 H X h = 1 D KL k h j ~ k1 h + 1 H X h = 1 D KL ? h j ~ k1 h 1 H X h = 1 D KL ? h j k h : (3.19) By taking expectationE ? on both sides of (3.19) over the state-action sequencef(s h ;a h )g H h = 1 starting froms 1 , and applying decompositions (3.9) and (3.11), we have V ? r;1 (s 1 )V k1 r;1 (s 1 ) + Y k1 V ? g;1 (s 1 )V k1 g;1 (s 1 ) H X h = 1 E ? h Q k1 r;h +Y k1 Q k1 g;h ; k h k1 h i 1 H X h = 1 E ? D KL k h j ~ k1 h + 1 H X h = 1 E ? D KL ? h j ~ k1 h D KL ? h j k h + H X h = 1 E ? h k1 r;h (s h ;a h ) i + Y k1 H X h = 1 E ? h k1 g;h (s h ;a h ) i (3.20) 95 The rest is to bound the right-hand side of the above inequality (3.20). By the Hölder’s in- equality and the Pinsker’s inequality, we rst have H X h = 1 Q k1 r;h +Y k1 Q k1 g;h ; k h k1 h 1 H X h = 1 D KL k h j ~ k1 h = H X h = 1 Q k1 r;h +Y k1 Q k1 g;h ; k h ~ k1 h 1 H X h = 1 D KL k h j ~ k1 h + H X h = 1 Q k1 r;h +Y k1 Q k1 g;h ; ~ k1 h k1 h H X h = 1 Q k1 r;h +Y k1 Q k1 g;h 1 k h ~ k1 h 1 1 2 k h ~ k1 h 2 1 + H X h = 1 Q k1 r;h +Y k1 Q k1 g;h 1 ~ k1 h k1 h 1 : Then, using the square completion, Q k1 r;h +Y k1 Q k1 g;h 1 k h ~ k1 h 1 1 2 k h ~ k1 h 2 1 = 1 2 Q k1 r;h +Y k1 Q k1 g;h 1 k h ~ k1 h 1 2 + 2 Q k1 r;h +Y k1 Q k1 g;h 2 1 2 Q k1 r;h +Y k1 Q k1 g;h 2 1 where we dropo the rst quadratic term for the inequality, and ~ k1 h k1 h 1 , we have H X h = 1 Q k1 r;h +Y k1 Q k1 g;h ; k h k1 h 1 H X h = 1 D KL k h j ~ k1 h 2 H X h = 1 Q k1 r;h +Y k1 Q k1 g;h 2 1 + H X h = 1 Q k1 r;h +Y k1 Q k1 g;h 1 (1 +) 2 H 3 2 + (1 +)H 2 (3.21) 96 where the last inequality is due to Q k1 r;h 1 H, a fact from line 12 in Algorithm 7, and 0 Y k1 . Taking the same expectationE ? as previously on both sides of (3.21) and substituting it into the left-hand side of (3.20) yield, V ? r;1 (s 1 )V k1 r;1 (s 1 ) + Y k1 V ? g;1 (s 1 )V k1 g;1 (s 1 ) (1 +) 2 H 3 2 + (1 +)H 2 + 1 H X h = 1 E ? D KL ? h j ~ k1 h D KL ? h j k h + H X h = 1 E ? h k1 r;h (s h ;a h ) i + Y k1 H X h = 1 E ? h k1 g;h (s h ;a h ) i (1 +) 2 H 3 2 + (1 +)H 2 + H logjAj + 1 H X h = 1 E ? D KL ? h j k1 h D KL ? h j k h + H X h = 1 E ? h k1 r;h (s h ;a h ) i + Y k1 H X h = 1 E ? h k1 g;h (s h ;a h ) i : (3.22) where in the second inequality we note the fact thatD KL ( ? h j ~ k1 h )D KL ( ? h j k1 h ) logjAj from Lemma 54. We note thatY 0 is initialized to be zero. By taking a telescoping sum of both sides of (3.22) fromk = 1 tok =K + 1 and shifting the indexk by one, we have K X k = 1 V ? r;1 (s 1 )V k r;1 (s 1 ) + K X k = 1 Y k V ? g;1 (s 1 )V k g;1 (s 1 ) (1 +) 2 H 3 (K + 1) 2 + (1 +)H 2 (K + 1) + H(K + 1) logjAj + H logjAj + K X k = 1 H X h = 1 E ? h k r;h (s h ;a h ) i + K X k = 1 H X h = 1 Y k E ? h k g;h (s h ;a h ) i : (3.23) 97 where we ignore 1 P H h = 1 E ?[D( ? h j K+1 h )] and utilize D KL ? h j 0 h = X a2A ? h (ajs h ) log (jAj ? h (ajs h )) logjAj where 0 h is uniform overA and we ignore P a2A ? h (ajs h ) log ( ? h (ajs h )) that is nonpositive. Finally, we take :=H= and, in the lemma to complete the proof. By the dual update of Algorithm 6, we can simplify the result in Lemma 18 and return back to the regret (3.7). Lemma19 Let Assumptions 7 and 8 hold. In Algorithm 6, if we set = p logjAj=(H 2 p K), = 1= p K, and = 1=K, then with probability 1p=2 , Regret(K) = C 3 H 2:5 p T logjAj + K X k = 1 H X h = 1 E ? k r;h (s h ;a h ) k r;h (s k h ;a k h ) + M K r;H;2 (3.24) whereC 3 is an absolute constant. Proof. By the dual update in line 9 in Algorithm 6, we have 0 Y K+1 2 = K+1 X k=1 Y k 2 Y k1 2 = K+1 X k=1 P [ 0; ] Y k1 +(bV k1 g;1 (s 1 )) 2 Y k1 2 K+1 X k=1 Y k1 +(bV k1 g;1 (s 1 )) 2 Y k1 2 K+1 X k=1 2Y k1 V ? g;1 (s 1 )V k1 g;1 (s 1 ) + 2 bV k1 g;1 (s 1 ) 2 : 98 where we use the feasibility of ? in the last inequality. SinceY 0 = 0 andjbV k1 g;1 (s 1 )j H, the above inequality implies that K X k=1 Y k V ? g;1 (s 1 )V k g;1 (s 1 ) K+1 X k=1 2 bV k1 g;1 (s 1 ) 2 H 2 (K + 1) 2 : (3.25) By noting the UCB result (3.16) andY k 0, the inequality (3.17) implies that K X k = 1 V ? r;1 (s 1 )V k r;1 (s 1 ) + K X k = 1 Y k V ? g;1 (s 1 )V k g;1 (s 1 ) C 2 H 2:5 p T logjAj + K X k = 1 H X h = 1 E ? h k r;h (s h ;a h ) i : If we add (3.25) to the above inequality and take = 1= p K, then, K X k = 1 V ? r;1 (s 1 )V k r;1 (s 1 ) C 3 H 2:5 p T logjAj + K X k = 1 H X h = 1 E ? h k r;h (s h ;a h ) i (3.26) whereC 3 is an absolute constant. Finally, we combine (3.14) and (3.26) to complete the proof. By Lemma 19, the rest is to bound the last two terms in the right-hand side of (3.24). We next show two probability bounds for them in Lemma 20 and Lemma 21, separately. Lemma20(Modelpredictionerrorbound) Let Assumption 8 hold. Fix p2 (0; 1). If we set =C 1 p dH 2 log (dT=p) in Algorithm 6, then with probability 1p=2 it holds that K X k = 1 H X h = 1 E ? k r;h (s h ;a h ) k r;h (s k h ;a k h ) 4C 1 s 2d 2 H 3 T log (K + 1) log dT p (3.27) whereC 1 is an absolute constant andT =HK. 99 Proof. By the UCB result (3.16), with probability 1p=2 for any (k;h)2 [K] [H] and (s;a)2SA, we have 2( k h + k r;h )(s;a) k r;h (x;a) 0: By the denition of k r;h (s;a),j k r;h (s;a)j 2H. Hence, it holds with probability 1p=2 that k r;h (s;a) 2 min H; ( k h + k r;h )(s;a) for any (k;h)2 [K] [H] and (s;a)2SA. Therefore, we have K X k = 1 H X h = 1 E ? k r;h (s h ;a h )js 1 k r;h (s k h ;a k h ) 2 K X k = 1 H X h = 1 min H; ( k h + k r;h )(s k h ;a k h ) where k h (;) =('(;) > ( k h ) 1 '(;)) 1=2 and k r;h (;) =( k r;h (;) > ( k r;h ) 1 k r;h (;)) 1=2 . Ap- plication of the Cauchy-Schwartz inequality shows that K X k = 1 H X h = 1 min H; ( k h + k r;h )(s k h ;a k h ) K X k = 1 H X h = 1 min H ; '(s k h ;a k h ) > ( k h ) 1 '(s k h ;a k h ) 1=2 + k r;h (s k h ;a k h ) > ( k r;h ) 1 k r;h (s k h ;a k h ) 1=2 : (3.28) Since we take = C 1 p dH 2 log (dT=p) withC 1 > 1, we haveH= 1. The rest is to apply Lemma 52. First, for anyh2 [H] it holds that K X k = 1 k r;h s k h ;a k h > k r;h 1 k r;h s k h ;a k h 2 log det K+1 r;h det 1 r;h ! : 100 Due tok k r;h k p dH in Assumption 8 and 1 r;h = I in Algorithm 7, it is clear that for any h2 [H], K+1 r;h = K X k = 1 k r;h s k h ;a k h k r;h s k h ;a k h > + I (dH 2 K +)I: Thus, log det K+1 r;h det 1 r;h ! log det (dH 2 K +)I det(I) ! d log dH 2 K + : Therefore, K X k = 1 k r;h s k h ;a k h > k r;h 1 k r;h s k h ;a k h 2d log dH 2 K + : (3.29) Similarly, we can show that K X k = 1 ' s k h ;a k h > k h 1 ' s k h ;a k h 2d log dK + : (3.30) Applying the above inequalities (3.29) and (3.30) to (3.28) leads to K X k = 1 H X h = 1 min H; ( k h + k r;h )(s k h ;a k h ) H X h = 1 min K; K X k = 1 '(s k h ;a k h ) > ( k h ) 1 '(s k h ;a k h ) 1=2 + k r;h (s k h ;a k h ) > ( k r;h ) 1 k r;h (s k h ;a k h ) 1=2 ! H X h = 1 0 @ K K X k = 1 '(s k h ;a k h ) > ( k h ) 1 '(s k h ;a k h ) ! 1=2 + K K X k = 1 k r;h (s k h ;a k h ) > ( k r;h ) 1 k r;h (s k h ;a k h ) ! 1=2 1 A H X h = 1 p K 2d log dK + 1=2 + 2d log dH 2 K + 1=2 ! Finally, we set =C 1 p dH 2 log (dT=p) and = 1 to obtain (3.27). 101 Lemma21(Matingalebound) Fixp2 (0; 1). In Algorithm 6, it holds with probability 1p=2 that M K r;H;2 4 s H 2 T log 4 p (3.31) whereT =HK. Proof. In the verication of (3.14) (see Appendix B.3), we introduce the following martingale, M K r;H;2 = K X k = 1 H X h = 1 D k r;h;1 +D k r;h;2 where D k r;h;1 = I k h Q k r;h Q k ;k r;h (s k h ) Q k r;h Q k ;k r;h s k h ;a k h D k r;h;2 = P h V k r;h+1 P h V k ;k r;h+1 s k h ;a k h V k r;h+1 V k ;k r;h+1 s k h+1 where I k h f (s) := f(s;); k h (js) . Due to the truncation in line 11 of Algorithm 7, we know that Q k r;h ;Q k r;h ;V k r;h+1 ;V k r;h+1 2 [0;H]. This shows thatjD k r;h;1 j;jD k r;h;2 j 2H for all (k;h)2 [K] [H]. Application of the Azuma-Hoeding inequality yields, P M K r;H;2 s 2 exp s 2 16H 2 T : Forp2 (0; 1), if we sets = 4H p T log (4=p), then the inequality (3.31) holds with probability at least 1p=2. 102 We now are ready to show the desired regret bound. Applying (3.27) and (3.31) to the right- hand side of the inequality (3.24), we have Regret(K) C 3 H 2:5 p T logjAj + 2C 1 s 2d 2 H 3 T log (K + 1) log dT p + 4 s H 2 T log 4 p with probability 1p whereC 1 ;C 3 are absolute constants. Then, with probability 1p it holds that Regret(K) CdH 2:5 p T log dT p whereC is an absolute constant. 3.4.3 Proofofconstraintviolation In Lemma 18, we have provided a useful upper bound on the total dierences that are weighted by the dual updateY k . To extract the constraint violation, we rst rene Lemma 18 as follows. Lemma22(Policyimprovement: renedprimal-dualmirrordescent) LetAssumptions7- 8 hold. In Algorithm 6, if we set = p logjAj=(H 2 p K), = 1=K, and = 1= p K, then Then, for anyY2 [0;], with probability 1p=2 , K X k = 1 V ? r;1 (s 1 )V k r;1 (s 1 ) + Y K X k = 1 bV k g;1 (s 1 ) C 4 H 2:5 p T logjAj (3.32) whereC 4 is an absolute constant,T =HK, and :=H= . 103 Proof. By the dual update in line 10 in Algorithm 6, for anyY2 [0;] we have jY k+1 Yj 2 = P [ 0; ] Y k +(bV k g;1 (s 1 )) P [ 0; ] (Y ) 2 Y k +(bV k g;1 (s 1 ))Y 2 Y k Y 2 + 2 bV k g;1 (s 1 ) Y k Y + 2 H 2 where we apply the non-expansiveness of projection in the rst inequality andjbV k g;1 (s 1 )jH for the last inequality. By summing the above inequality fromk = 1 tok =K, we have 0 jY K+1 Yj 2 = jY 1 Yj 2 + 2 K X k = 1 bV k g;1 (s 1 ) Y k Y + 2 H 2 K which implies that K X k = 1 bV k g;1 (s 1 ) YY k 1 2 jY 1 Yj 2 + 2 H 2 K: By adding the above inequality to (3.23) in Lemma 18 and noting thatV ? ;k g;1 (s 1 )b and the UCB result (3.16), we have K X k = 1 V ? r;1 (s 1 )V k r;1 (s 1 ) + Y K X k = 1 bV k g;1 (s 1 ) (1 +) 2 H 3 (K + 1) 2 + (1 +)H 2 (K + 1) + H(K + 1) logjAj + H logjAj + 1 2 jY 1 Yj 2 + 2 H 2 K: By taking =H= , and,, in the lemma, we complete the proof. 104 According to Lemma 22, we can multiply (3.15) byY 0 and add it, together with (3.14), to (3.32), K X k = 1 V ? r;1 (s 1 )V k r;1 (s 1 ) + Y K X k = 1 bV k g;1 (s 1 ) C 4 H 2:5 p T logjAj K X k = 1 H X h = 1 k r;h (s k h ;a k h ) Y K X k = 1 H X h = 1 k g;h (s k h ;a k h ) + M K r;H;2 + YM K g;H;2 : (3.33) We now are ready to show the desired constraint violation bound. We note that there exists a policy 0 such thatV 0 r;1 (s 1 ) = 1 K P K k = 1 V k r;1 (s 1 ) andV 0 g;1 (s 1 ) = 1 K P K k = 1 V k g;1 (s 1 ). By the occu- pancy measure method [11],V k r;1 (s 1 ) andV k g;1 (s 1 ) are linear in terms of an occupancy measure induced by policy k and initial states 1 . Thus, an average ofK occupancy measures is still an occupancy measure that produces policy 0 with values V 0 r;1 (s 1 ) and V 0 g;1 (s 1 ). Particularly, we takeY = 0 when P K k = 1 bV k g;1 (s 1 ) < 0; otherwiseY =. Therefore, we have V ? r;1 (s 1 ) 1 K K X k = 1 V k r;1 (s 1 ) + " b 1 K K X k = 1 V k g;1 (s 1 ) # + = V ? r;1 (s 1 )V 0 r;1 (s 1 ) + h bV 0 g;1 (s 1 ) i + C 4 H 2:5 p T logjAj K 1 K K X k = 1 H X h = 1 k r;h (s k h ;a k h ) K K X k = 1 H X h = 1 k g;h (s k h ;a k h ) + 1 K M K r;H;2 + K M K g;H;2 C 4 H 2:5 p T logjAj K + 1 K K X k = 1 H X h = 1 ( k h + k r;h )(s k h ;a k h ) + K K X k = 1 H X h = 1 ( k h + k g;h )(s k h ;a k h ) + 1 K M K r;H;2 + K M K g;H;2 (3.34) where we apply the UCB result (3.16) for the last inequality. 105 Finally, we recall two immediate results of Lemma 20 and Lemma 21. Fixp2 (0; 1), the proof of Lemma 20 also shows that with probability 1p=2, K X k = 1 H X h = 1 ( k h + k ;h ) s k h ;a k h C 1 s 2d 2 H 3 T log (K + 1) log dT p (3.35) and the proof of Lemma 21 shows that with probability 1p=2, M K g;H;2 4 s H 2 T log 4 p : If we take logjAj =O(d 2 log 2 (dT=p)), (3.34) implies that with probability 1p we have V ? r;1 (s 1 )V 0 r;1 (s 1 ) + h bV 0 g;1 (s 1 ) i + C 5 dH 2:5 p T log dT p where C 5 is an absolute constant. Finally, by noting our choice of 2Y ? , we can apply Lemma 49 to conclude that [Violation(K)] + C 0 dH 2:5 p T log dT p with probability 1p, whereC 0 is an absolute constant. 106 3.5 Furtherresultsonthetabularcase The proof of Theorem 16 is generic, since it is ready to achieve sublinear regret and constraint violation bounds as long as the policy evaluation is sample-ecient, e.g., the UCB design ofopti- misminthefaceofuncertainty. In what follows, we introduce another ecient policy evaluation for line 11 of Algorithm 6 in the tabular case. Let us rst introduce some notation. For any (h;k)2 [H] [K], any (s;a;s 0 )2SAS, and any (s;a)2SA, we dene two visitation countersn k h (s;a;s 0 ) andn k h (s;a) at steph in episodek, n k h (s;a;s 0 ) = k1 X = 1 1f(s;a;s 0 ) = (s h ;a h ;a h+1 )g n k h (s;a) = k1 X = 1 1f(s;a) = (s h ;a h )g: (3.36) This allows us to estimate reward function r, utility function g, and transition kernel P h for episodek by ^ r k h (s;a) = k1 X = 1 1f(s;a) = (s h ;a h )gr h (s h ;a h ) n k h (s;a) + ^ g k h (s;a) = k1 X = 1 1f(s;a) = (s h ;a h )gg h (s h ;a h ) n k h (s;a) + (3.37) ^ P k h (s 0 js;a) = n k h (s;a;s 0 ) n k h (s;a) + (3.38) for all (s;a;s 0 )2SAS, (s;a)2SA where > 0 is the regularization parameter. Moreover, we introduce the bonus term k h :SA!R, k h (s;a) = n k h (s;a) + 1=2 which adapts the counter-based bonus terms in [19, 108], where > 0 is to be determined later. Using the estimated transition kernelsf ^ P k h g H h = 1 , reward/utility functionsf^ r k h ; ^ g k h g H h = 1 , and the bonus termsf k h g H h = 1 , we now can estimate the state-action function via line 7 of Algorithm 8 for 107 any (s;a)2SA, where =r org. Thus,V k ;h (s) =hQ k ;h (s;); k h (js)i A . We summarize the above procedure in Algorithm 8. Using already estimatedfQ k r;h (;);Q k g;h (;)g H h = 1 , we execute the policy improvement and the dual update in Algorithm 6. As Theorem 16, we provide theoretical guarantees in Theorem 23, which improves (jSj;jAj) dependence in Theorem 16 for the tabular case and also matchesjSj dependence in [85, 187]. It is worthy mentioning our Algorithm 6 is generic in handling an innite state space. Theorem23(Tabularcase: regretandconstraintviolation) Let Assumption 7 hold and let Assumption8holdwithfeatures (3.6). Fixp2 (0; 1). InAlgorithm6,weset = p logjAj=(H 2 K), =C 1 H p jSj log(jSjjAjT=p), = 1= p K, = 1=K,and = 1whereC 1 isanabsoluteconstant. Then, with probability 1p, the regret and the constraint violation in (3.3) satisfy Regret(K) CjSj p jAjH 5 T log jSjjAjT p [Violation(K)] + C 0 jSj p jAjH 5 T log jSjjAjT p whereC andC 0 are absolute constants. Proof. See Appendix B.1. 3.6 Concludingremarks We have developed a provably ecient safe reinforcement learning algorithm in the linear MDP setting. The algorithm extends the proximal policy optimization to constrained MDPs by incor- porating the UCB exploration. We prove that the proposed algorithm achieves an ~ O( p T ) regret and an ~ O( p T ) constraint violation under mild conditions, whereT is the total number of steps 108 taken by the algorithm. Our algorithm works in the setting where reward/utility functions are given by bandit feedback. To the best of our knowledge, our algorithm is the rst provably e- cient online policy optimization algorithm for constrained MDPs in the function approximation setting. 109 PartII Reinforcementlearningformulti-agentcontrolsystems 110 Chapter4 Multi-agenttemporal-dierencelearningformulti-agent MDPs In this chapter, we study the policy evaluation problem in multi-agent reinforcement learning where a group of agents, with jointly observed states and private local actions and rewards, col- laborate to learn the value function of a given policy via local computation and communication over a connected undirected network. This problem arises in various large-scale multi-agent sys- tems, including power grids, intelligent transportation systems, wireless sensor networks, and multi-agent robotics. When the dimension of state-action space is large, the temporal-dierence learning with linear function approximation is widely used. In this chapter, we develop a new dis- tributed temporal-dierence learning algorithm and quantify its nite-time performance. Our al- gorithm combines a distributed stochastic primal-dual method with a homotopy-based approach to adaptively adjust the learning rate in order to minimize the mean-square projected Bellman error by taking fresh online samples from a causal on-policy trajectory. We explicitly take into account the Markovian nature of sampling and improve the best-known nite-time error bound fromO(1= p T ) toO(1=T ), whereT is the total number of iterations. 111 4.1 Introduction Temporal-dierence (TD) learning is a central approach to policy evaluation in modern reinforce- ment learning (RL) [214]. It was introduced in [213, 31, 26] and signicant advances have been made in a host of single-agent decision-making applications, including Atari or Go games [165, 203]. Recently, TD learning has been used to address multi-agent decision making problems for large-scale systems, including power grids [163], intelligent transportation systems [120], wire- less sensor networks [182], and multi-agent robotics [119]. Motivated by these applications, we introduce an extension of TD learning to a distributed setting of policy evaluation. This setup involves a group of agents that communicate over a connected undirected network. All agents share a joint state and dynamics of state transition are governed by the local actions of agents which follow a local policy and own a private local reward. To maximize the total reward, i.e., the sum of all local rewards, it is essential to quantify performance that each agent achieves if it follows a particular policy while interacting with the environment and using only local data and information exchange with its neighbors. This task is usually referred to as a distributed policy evaluation problem and it has received signicant recent attention [148, 155, 127, 232, 48, 76, 77, 212, 196]. In the context of distributed policy evaluation, several attempts have been made to extend TD algorithms to a multi-agent setup using linear function approximators. When the reward is global and actions are local, mean square convergence of a distributed primal-dual gradient temporal- dierence (GTD) algorithm for minimizing mean-square projected Bellman error (MSPBE), with diusion updates, was established in [148] and an extension to time-varying networks was made in [208]. In [155], the authors combined the gossip averaging scheme [40] with TD(0) and showed 112 asymptotic convergence. In the o-line setting, references [232] and [48] proposed dierent consensus-based primal-dual algorithms for minimizing a batch-sampled version of MSPBE with linear convergence; a fully asynchronous gossip-based extension was studied in [196] and its communication eciency was analyzed in [188]. To understand/gain data eciency, the recent focus of multi-agent TD learning research has shifted to nite-time or nite-sample performance analysis. For distributed TD(0) and TD() with local rewards, O(1=T ) error bound was estab- lished in [76] and [77], respectively. Linear convergence of distributed TD(0) to a neighborhood of the stationary point was proved in [212] andO(1= p T ) error bound for distributed GTD was shown in [127]. In [78],O(1=T 2=3 ) error bound was provided for a distributed variant of two- time-scale stochastic approximation algorithm. Apart from [77, 212], other nite sample results rely on the i.i.d. state sampling in policy evaluation. In most RL applications, this assumption is overly restrictive because of the Markovian nature of state trajectory samples. In [96], an ex- ample was provided to demonstrate that i.i.d. sampling-based convergence guarantees can fail to hold when samples become correlated. It is thus relevant to examine how to design an online distributed learning algorithm for the policy evaluation problem (e.g., MSPBE minimization) in the Markovian setting. Such distributed learning algorithms are essential in multi-agent RL; e.g., see the distributed variant of policy gradient theorem [275] along with recent surveys [273, 272, 128]. In Section 4.2, we introduce a class of multi-agent stochastic saddle point problems that con- tain, as a special instance, minimization of a mean square projected Bellman error via distributed TD learning. In Section 4.3, we develop a homotopy-based online distributed primal-dual algo- rithm to solve this problem and establish a nite-time performance bound for the proposed algo- rithm. In Section 4.4, we prove the main result, in Section 4.5, we oer computational experiments 113 to demonstrate the merits and the eectiveness of our theoretical ndings and, in Section 4.6, we close the chapter with concluding remarks. 4.2 Problemformulationandbackground In this section, we formulate a multi-agent stochastic saddle point problem over a connected undirected network. The motivation for studying this class of problems comes from distributed reinforcement learning where a group of agents with jointly observed states and private local ac- tions/rewards collaborate to learn the value function of a given policy via local computation and communication. We exploit the structure of the underlying optimization problem to demonstrate that it enables unbiased estimation of the saddle point objective from Markovian samples. Fur- thermore, we discuss an algorithm that is convenient for distributed implementation and nite- time performance analysis. 4.2.1 Multi-agentstochasticoptimizationproblem We consider a stochastic optimization problem over a connected undirected networkG = (V;E) withN agents, minimize x2X 1 N N X j = 1 f j (x) (4.1a) whereV =f1;:::;Ng is the set of nodes,EVV is the set of edges,x is the optimization variable,XR d is a convex set, andf j :X!R is a local objective function determined by, f j (x) = max y j 2Y E [ j (x;y j ;) ] | {z } j (x;y j ) : (4.1b) 114 Here,y j is a local variable that belongs to a convex setY R d and j (x;y j ;) is a stochastic function of a random variable which is distributed according to the stationary distribution of a Markov chain. Equivalently, problem (4.1) can be cast as a multi-agent stochastic saddle point problem, minimize x2X maximize y j 2Y 1 N N X j = 1 j (x;y j ) (4.2) with the primal variablex and the dual variabley := (y 1 ;:::;y N ). Each agentj can only communicate with its neighbors over the networkG and receive sam- ples from a stochastic process that converges to the stationary distribution . Finite optimal solution (x ? ;y ? ) to the saddle point problem (4.2) satises x ? := argmin x2X 1 N N X j = 1 j (x;y ? j ) y ? j := argmax y j 2Y j (x ? ;y j ) and the primary motivation for studying this class of problems comes from multi-agent rein- forcement learning where a group of agents with jointly observed states and private local ac- tions/rewards collaborate to learn the value function of a given policy via local computation and communication. Although our theory and algorithm can be readily extended to other settings, we restrict our attention to the Markovian structure in the context of policy evaluation. Formu- lation (4.1) arises in a host of large-scale multi-agent systems, e.g., in supervised learning [261] and in nonparametric regression [67], and we describe it next. 115 4.2.2 Multi-agentMarkovdecisionprocess Let us consider a control system described by a Markov decision process (MDP) over a connected undirected networkG withN agents, the state spaceS, and the joint action spaceA :=A 1 A N . Without loss of generality we assume that, for each agentj, the local action spaceA j is the same for all states. LetP(s 0 js;a) is the transition probability from states to states 0 under a joint actiona = (a 1 ;:::;a N )2A and letr j (s;a) be the local reward received by agentj that corresponds to the pair (s;a). The multi-agent MDP can be represented by the tuple, S;fA j g N j=1 ;P;fr j g N j=1 ; where 2 (0; 1) is a xed discount factor. When the states, actions, and rewards are globally observable, a multi-agent MDP problem simplies to the classical single-agent problem and the centralized controller can be utilized to solve it. In many applications (e.g., see [273, Section 12.4.1.2]), both the actions a j 2A j and the rewardsr j (s;a) are private and every agent can only communicate with its neighbors over the networkG. It is thus critically important to extend single-agent TD learning algorithms to a setup in which only local information exchange is available. We consider a cooperative learning task in which agents aim to maximize the total reward (1=N) P j r j (s;a) and, in Fig. 4.1, we illustrate the interaction between the environment and the agents. Let:SA! [0; 1] be a joint policy which species the probability to take an action 116 r j a j Environment Agent j s Figure 4.1: A system with six agents that communicate over a connected undirected network. Each agent interacts with the environment by receiving a private reward and taking a local action. a2A at states2S. We dene the global rewardR (s) at states2S under policy to be the expected value of the average of all local rewards, R (s) = 1 N N X j = 1 R j (s) (4.3) whereR j (s) :=E a(js) [r j (s;a) ]. For any xed joint policy, the multi-agent MDP becomes a Markov chain overS with the probability transition matrixP 2R jSjjSj , whereP s;s 0 = P a2A (ajs)P(s 0 js;a) is the (s;s 0 )- element of P . If the Markov chain associated with the policy is aperiodic and irreducible then, for any initial state, it converges to the unique stationary distribution with a geometric rate [132]; see Assumption 14 for a formal statement. 117 4.2.3 Multi-agentpolicyevaluationandtemporal-dierencelearning Let the value function of a policy,V :S!R, be dened as the expected value of discounted cumulative rewards, V (s) = E " 1 X p = 0 p R (s p ) s 0 =s; # wheres 0 = s is the initial state. If we arrangeV (s) andR (s) over all statess2S into the vectorsV andR , the Bellman equation forV can be written as [185], V = P V + R : (4.4) Since it is challenging to evaluateV directly for a large state space, we approximateV using a family of linear functionsfV x (s) = (s) > x; x2 R d g, where x2 R d is the vector of un- known parameters and:S!R d is a known dictionary consisting ofd features. If we arrange fV x (s)g s2S into the vector V x 2 R jSj , we have V x = x where the ith row of the matrix 2R jSjd is given by(s i ) > . Since the dictionary is a function determined by, e.g., polynomial basis, it is not restrictive to assume that the matrix has the full column rank [31]. The goal of policy evaluation now becomes to determine the vector of feature weightsx2R d so that V x approximates the true value function V . The objective of a typical TD learning method is to minimize the mean square Bellman error (MSBE) [215], 1 2 kV x P V x R k 2 D where D := diagf(s);s2Sg2 R jSjjSj is a diagonal matrix determined by the stationary distribution . As discussed in [216], the solution to the xed point problemV x = P V x +R 118 may not exist because the right-hand-side may not stay in the column space of the matrix . To address this challenge, GTD algorithm [216] was proposed to minimize the mean square projected Bellman error (MSPBE), f(x) := 1 2 kP (V x P V x R )k 2 D via stochastic-gradient-type updates, where P := ( > D) 1 > D is a projection operator onto the column space of . Equivalently, MSPBE can be compactly written as, f(x) = 1 2 kAx bk 2 C 1 (4.5a) whereA,b, andC are obtained by taking expectations over the stationary distribution , A := E s [(s)((s) (s 0 )) > ] b := E s [R (s)(s) ] C := E s [(s)(s) > ]: (4.5b) Assumption9 There exists a feature matrix such that the matricesA andC are full rank and positive denite, respectively. In [31, page 300], it was shown that the full column rank matrix yields a full rank A and a positive deniteC and that the objective functionf in (4.5) has a unique minimizer. Nevertheless, whenA,b, andC are replaced by their sampled versions it is challenging to solve (4.5) because their nonlinear dependence on the underlying samples introduces bias in the objective function 119 f. In what follows, we address the sampling challenge by reformulating (4.5) in terms of a saddle- point objective. Since the global rewardR (s) in (4.3) is determined by the average of all local rewardsR j (s), we can express the vector b as b = (1=N) P j b j , where b j := E s [R j (s)(s) ]. Thus, the problem of minimizing MSPBE (4.5) can be cast as minimize x2X 1 N N X j = 1 1 2 kAx b j k 2 C 1 (4.6) wheref j (x) := 1 2 kAxb j k 2 C 1 is the local MSPBE for the agentj andXR d is a convex set that contains the unique minimizer off(x) = (1=N) P j f j (x). Hence, it is sucient to restrict X to be a compact set containing the minimizer. A decentralized stochastic optimization problem (4.6) with N private stochastic objectives involves products and inverses of the expectations; cf. (4.5b). This unique feature of MSPBE is not encountered in typical distributed optimization settings [173, 82] and it makes the problem of obtaining an unbiased estimator of the objective function from a few state samples challenging. Using Fenchel duality, we can express each local MSPBE in (4.6) as f j (x) = max y j 2Y y > j (Axb j ) 1 2 y > j Cy j (4.7) wherey j is a dual variable andYR d is a convex compact set such thatC 1 (Axb j )2Y for allx2X . SinceC is a positive denite matrix andX is a compact set, suchY exists. In fact, one could take a ball centered at the origin with a radius greater than (1= min (C)) supkAxb j k, where the supremum is taken overx2X andj2f1;:::;Ng. Thus, we can reformulate (4.6) as 120 a decentralized stochastic saddle point problem (4.2) with objective j (x;y j ) = y > j (Axb j ) 1 2 y > j Cy j and compact convex domain setsX andY. By replacing expectations in the expressions forA,b j , andC with their samples that arise from the stationary distribution , we obtain an unbiased estimate of the saddle point objective j . We also note that each agentj indeed takes a local MSPBE as its local objective functionf j (x) = 1 2 kAxb j k 2 C 1 . Since the stationary distribution is not known, it is not possible to directly estimate A, b, andC. However, as we explain next, the policy evaluation problem allows correlated sampling according to a Markov process. 4.2.4 Standardstochasticprimal-dualalgorithm When i.i.d. samples from the stationary distribution are available and a centralized controller exists, the stochastic approximation method can be used to compute the solution to (4.2) with a convergence rateO(1= p T ) in terms of the primal-dual gap [175]. The stochastic primal-dual algorithm generates two pairs of vectors (x 0 (t);y 0 (t)) and (x(t);y(t)) that are contained inX Y N , wheret is a positive integer,X;Y R d are convex projection sets. It is sucient to take bounded sets X and Y containing the nite solution to (4.2). At iteration t, the primal-dual updates are given by x 0 (t + 1) = x 0 (t) (t)G x (x(t);y(t); t ) x(t + 1) = P X (x 0 (t + 1) ) y 0 (t + 1) = y 0 (t) + (t)G y (x(t);y(t); t ) y(t + 1) = P Y N (y 0 (t + 1) ) (4.8) 121 where(t) is a non-increasing sequence of stepsizes, operatorsP X (x 0 ) := argmin x2X kxx 0 k andP Y N (y 0 ) := argmin y2Y Nky5y 0 k are Euclidean projections ontoX andY N , and the sam- pled gradients are given by G x (x(t);y(t); t ) = r x (x(t);y(t); t ) G y (x(t);y(t); t ) = r y (x(t);y(t); t ) In our multi-agent MDP setup, however, each agent receives samples t from a Markov process whose state distribution at timet isP t , whereP t converges to the unknown distribution with a geometric rate. Thus, i.i.d. samples from the stationary distribution are not available. Since i.i.d. sampling-based convergence guarantees may not hold for correlated samples [96], it is im- portant to examine the ergodic stochastic optimization scenario in which samples are taken from a stochastic process [83]; a recent application for the centralized GTD can be found in [238]. In particular, we are interested in designing and analyzing distributed algorithms for stochastic saddle point problem (4.2) in the ergodic setting. 4.3 Algorithmandperformance We now present the main results of the chapter: a fast algorithm for the multi-agent learning. We propose a distributed stochastic primal-dual algorithm in Section 4.3.1, introduce underlying assumptions in Section 4.3.2, and establish a nite-time performance bound in Section 4.3.3. 122 Algorithm9 Distributed Homotopy Primal-Dual Algorithm Initialization:x j;1 (1) =x 0 j;1 (1) = 0,y j;1 (1) =y 0 j;1 (1) = 0 for allj2V;T 1 , 1 ,K Fork = 1 toK do 1. Fort = 1 toT k 1do • Primal update, x 0 j;k (t + 1) = N X i = 1 W ij x 0 i;k (t) k G j;x (x j;k (t);y j;k (t); k;t ) x j;k (t + 1) = P X (x 0 j;k (t + 1) ): • Dual update, y 0 j;k (t + 1) = y 0 j;k (t) + k G j;y (x j;k (t);y j;k (t); k;t ) y j;k (t + 1) = P Y (y 0 j;k (t + 1) ): endfor 2. Restart initialization, (x j;k+1 (1); y j;k+1 (1) ) = ( ^ x j;k ; ^ y j;k ) (see (4.9)) x 0 j;k+1 (1); y 0 j;k+1 (1) = (x j;k+1 (1); y j;k+1 (1) ): 3. Update stepsize, horizon: k+1 = 1 2 k ; T k+1 = 2T k . endfor Output: (^ x j;K ; ^ y j;K ) for allj2V 4.3.1 Distributedhomotopyprimal-dualalgorithm In this section, we extend stochastic primal-dual algorithm (4.8) to the multi-agent learning set- ting. To solve the stochastic saddle point program (4.2) in a distributed manner, the algorithm op- erates 2N primal-dual pairs of vectorsz j;k (t) := (x j;k (t);y j;k (t)) andz 0 j;k (t) := (x 0 j;k (t);y 0 j;k (t)), 123 which belong to projection setXY . In thekth iteration round at timet, thejth agent computes local gradient using the private objective j (z j;k (t); k;t ), G j (z j;k (t); k;t ) := 2 6 6 4 G j;x (z j;k (t); k;t ) G j;y (z j;k (t); k;t ) 3 7 7 5 and receives the vectorsfx 0 i;k (t);i 2 N j g from its neighborsN j . Here, G j;x (z j;k (t); k;t ) and G j;y (z j;k (t); k;t ) are gradients of j (z j;k (t); k;t ) with respect tox j;k (t) andy j;k (t), respectively. The primal iteratex j;k (t) is updated using a convex combination of the vectorsfx 0 i;k (t);i2 N j g and the dual iteratey j;k (t) is modied using the mirror descent update. We model the convex combination as a mixing process over the graphG and assume that the mixing matrix W is a doubly stochastic, N X i = 1 W ij = X i2N j W ij = 1; for allj2V N X j = 1 W ij = X j2N i W ij = 1; for alli2V where W ij > 0 for (i;j) 2 E. For a given learning rate k , each agent updates primal and dual variables according to Algorithm 9, whereP X ( ) andP Y ( ) are Euclidean projections onto bounded setsX andY . In practice, these projections are easily computable,P kkr ( 0 ) = r 0 =k 0 k whenk 0 k>r and is simply 0 otherwise. The iteration countersk andt are used in our Distributed Homotopy Primal-Dual (DHPD) Al- gorithm, i.e., Algorithm 9. The homotopy approach varies certain parameter for multiple rounds, where each round takes an estimated solution from the previous round as a starting point. We use the learning rate as a homotopy parameter in our algorithm. At the initial round k = 1, 124 problem (4.2) is solved with a large learning rate 1 and, in subsequent iterations, the learning rate is gradually decreased until a desired error tolerance is reached. For a xed learning rate k , we employ the distributed stochastic primal-dual method to solve (4.2) and obtain an approximate solution that is given by a time-running average of primal- dual pairs, ^ x j;k := 1 T k T k X t = 1 x j;k (t); ^ y j;k := 1 T k T k X t = 1 y j;k (t): (4.9) These are used as initial points for the next learning rate k+1 . At roundk, each agentj performs primal-dual updates withT k iterations, indexed by timet. At next roundk+1, we initialize primal and dual iterations using the previous approximate solutions ^ x j;k and ^ y j;k , reduce the learning rate by half, k+1 = k =2, and double the number of the inner-loop iterations,T k+1 = 2T k . The number of inner iterations in thekth round isT k and the number of total rounds isK. The homotopy approach not only provides outstanding practical performance but it also facil- itates an eective iteration complexity analysis [248]. In particular, for stochastic strongly convex programs, the rate faster thanO(1= p T ) was established in [224, 258]; other fast rate results can be found in [252, 243]. To the best of our knowledge, we are the rst to show that the homotopy approach can be used to solve distributed stochastic saddle-point problems with convergence rate better thanO(1= p T ). In Section 4.3.3, we use the primal optimality gap, err(^ x i;k ) := 1 N N X j = 1 (f j (^ x i;k )f j (x ? )) (4.10) 125 to quantify the distance of the running local average, ^ x i;k := (1=T k ) P T k t=1 x i;k (t), for theith agent from the optimal solutionx ? . The primal optimality gap measures performance of each agent in terms of MSPBE which is described by the global objective function (4.5), or equivalently by (4.6). Remark5(Multi-agentpolicyevaluation) For the multi-agent policy evaluation problem, we take j (x;y j ) =y > j (Axb j ) 1 2 y > j Cy j . Since j depends linearly onA,b j , andC, by replacing expectations in (4.5b) with the corresponding samples we obtain unbiased gradients, G j;x (z j;k (t); k;t ) = ((s) (s 0 ))(s) > y j;k (t) G j;y (z j;k (t); k;t ) = (s)((s) (s 0 )) > x j;k (t) R j (s)(s)(s)(s) > y j;k (t): wheres,s 0 aretwoconsecutivestatesthatevolveaccordingtotheunderlyingMarkovprocess k;t in- dexedbytimetandtheiterationroundk. Ineachiteration,Algorithm9requiresO(dN 2 )operations whered is the problem dimension (or the feature dimension in linear approximation) andN is the totalnumberofagents. Forasingle-agentproblem,O(d)operationsarerequiredwhichisconsistent with GTD algorithm [216]. 4.3.2 Assumptions We now formally state assumptions required to establish our main result in Theorem 24 that quanties nite-time performance of Algorithm 9 for stochastic primal-dual optimization prob- lem (4.2). Assumption10(Convexcompactprojectionsets) The projection sets X and Y contain the origininR d andthenitesolutionto (4.2),andtheyareconvexandcompactwithradiusr> 0,i.e., sup x2X;y2Y k(x;y)k 2 r 2 . 126 Assumption11(Convexityandconcavity) The function j (x;y j ) in (4.2) is convex in x for any xedy j 2Y, and is strongly concave iny j for any xedx2X, i.e., for anyx;x 0 2X and y j ;y 0 j 2Y, there existsL y > 0 such that j (x;y j ) j (x 0 ;y j ) +hr x j (x 0 ;y j );xx 0 i j (x;y j ) j (x;y 0 j ) r y j (x;y 0 j );y j y 0 j Ly 2 ky j y 0 j k 2 : Moreover,f j (x) := max y j 2Y j (x;y j )isstronglyconvex,i.e.,foranyx;x 0 2X,thereexistsL x > 0 such that f j (x) f j (x 0 ) +hrf j (x 0 );xx 0 i + L x 2 kxx 0 k 2 : Assumption12(Boundedgradient) Foranytandk,thereisapositiveconstantcsuchthatthe sample gradientG j (x;y j ; k;t ) satises, kG j (x;y j ; k;t )k c for allx2X;y j 2Y with probability one. Remark6 Jensen’s inequality can be combined with Assumption 12 to show that the population gradient is also bounded, i.e.,kg j (x;y j )kc, for allx2X andy j 2Y, where g j (x;y j ) := 2 6 6 4 g j;x (x;y j ) g j;y (x;y j ) 3 7 7 5 = 2 6 6 4 r x j (x;y j ) r y j (x;y j ) 3 7 7 5 : 127 Assumption13(Lipschitzgradient) Foranytandk,thereexistsapositiveconstantLsuchthat for anyx;x 0 2X andy j ;y 0 j 2Y, we have kG j (x;y j ; k;t )G j (x 0 ;y j ; k;t )k Lkxx 0 k kG j (x;y j ; k;t )G j (x;y 0 j ; k;t )k Lky j y 0 j k with probability one. We also recall some important concepts from probability theory. The total variation distance between distributionsP andQ on a set R jSj is given by d tv (P;Q) := Z jp()q()j d() = 2 sup E jP (E)Q(E)j where distributionsP andQ (with densitiesp andq) are continuous with respect to the Lebesgue measure, and the supremum is taken over all measurable subsets of . We use the notion of mixing time to evaluate the convergence speed of a sequence of proba- bility measures generated by a Markovian process to its (unique) stationary distribution , whose density is assumed to exist. LetF k;t be the-eld generated by the rstt samples at roundk, f k;1 ;:::; k;t g, drawn fromfP k;1 ;:::;P k;t g, whereP k;t is the probability measure of the Marko- vian process at timet and roundk. LetP [s] k;t be the distribution of k;t conditioned onF k;s (i.e., given samples up to time slots,f k;1 ;:::; k;s g) at roundk, whose densityp [s] k;t also exists. The mixing time for a Markovian process is dened as follows [83]. 128 Denition1 The total variation mixing time tv (P [s] k ;") of the Markovian process conditioned on the-eldoftheinitialssamplesF k;s =( k;1 ;:::; k;s )isthesmallestpositiveintegertsuchthat d tv (P [s] k;s+t ; )", tv (P [s] k ;") = inf ts t2N; Z jp [s] k;t ()()j d()" : The mixing time tv (P [s] k ;") measures the number of additional steps required until the distri- bution of k;t is within" neighborhood of the stationary distribution given the initials samples, f k;1 ;:::; k;s g. Assumption14 The underlying Markov chain is irreducible and aperiodic, i.e., there exists > 0 and2 (0; 1) such thatE[d tv (P [t] k;t+ ; ) ] for all2N and allk. Furthermore, we have tv (P [s] k ;") log (=") jlog j + 1; for allk; s2 N (4.11) wherede is the ceiling function and" species the distance to the stationarity; also see [132, Theorem 4.9]. 4.3.3 Finite-timeperformancebound For stochastic saddle point problem (4.2), we establish a nite-time error bound in terms of the average primal optimality gap in Theorem 24 where the total number of iterations in Algorithm 9 is given byT := P K k = 1 T k = (2 K 1)T 1 . 129 Theorem24 Let Assumptions 10–14 hold. Then, for any 1 1=(4=L y + 2=L x ) and anyT 1 and K that satisfy T 1 := log (T ) j logj + 1 (4.12) theoutput ^ x j;K ofAlgorithm9providesthesolutiontoproblem (4.2)withthefollowingupperbound c(rL +c) T C 1 log 2 (T p N) 1 2 (W ) + C 2 (1 +T 1 ) ! (4.13) on (1=N) P N j=1 E[ err(^ x j;K ) ],wheretheprimaloptimalitygap err(^ x j;K )isdenedin (4.10),risthe bound on feasible setsX andY in Assumption 10,c is the bound on sample gradients in Assump- tion 12, C 1 andC 2 are constants independent ofT, 2 (W ) is the second largest eigenvalue ofW, andN is the total number of agents. Remark7(Multi-agentpolicyevaluation) Assumption 9 guarantees that A is full rank and thatC is positive denite. For j (x;y j ) = y > j (Axb j ) 1 2 y > j Cy j , since all features and rewards are bounded, Assumptions 11-13 hold with L x = max (A > A)= min (C); L y = min (C) c (2 1 + 2 )r + 0 L max( p 2 1 + 2 2 ; 1 ) 130 where 0 ; 1 and 2 provide upper bounds tokR j (s)(s)k 0 ,k(s)((s) (s 0 )) > k 1 , andk(s)(s) > k 2 . The unique minimizer of (4.5) and the expression (4.7) that results from Fenchel duality validate Assumption 10 with r 2 0 min (C) 2 1 max (C) min (A > A) min (C) + 1 : In practice, when some prior knowledge about the model is available, e.g., when generative models or simulators can be utilized, samples from a near stationary state distribution under a given policy can be used to estimate these parameters. Remark8(Optimalperformanceboundandselectionofparameters) We can ndT 1 and K such that condition (4.12) holds, as long asT > =d log (T )=j logje. In particular, choosing T 1 = andK = log(1 +T=) gives the desiredO(log 2 (T p N)=T ) scaling of nite-time perfor- mance bound (4.13). In general, to satisfy condition (4.12), we can chooseT 1 andK such thatT 1 d (K + log (T 1 ))=j logje + 1. Hence, performance bound (4.13) scales asO((K 2 + log 2 T 1 )=T ). Time-running average (4.9) is used as the output of our algorithm and when the algorithm is ter- minated in an inner loopK, the time-running average from previous inner loop can be used as an output and our performance bound holds forK 1. Remark9(Mixingtime) When" = 1=T, =d log (T )=j logjeprovidesalowerboundonthe mixingtime(cf. (4.11)),where > 0and2 (0; 1)aregiveninAssumption14. Thus,performance bound (4.13) in Theorem 24 depends on how fast the process P [s] k reaches 1=T mixing via . In particular, when samples are independent and identically distributed, 0-mixing is reached in one step, i.e., we have = 1 and = 0. Hence, by settingT 1 = 1, performance bound (4.13) simplies toO(log 2 (T )=T ). 131 Remark10(Networksizeandtopology) In nite-time performance bound (4.13), the depen- dence on the network sizeN and the spectral gap 1 2 (W ) of the mixing matrixW is quantied by log 2 (T p N)=(1 2 (W )). We note thatW can be expressed using the LaplacianL of the un- derlying graph,W = ID 1=2 LD 1=2 =( max + 1), whereD := diag ( 1 ;:::; N ), i is the degree of nodei, and max := max i i . The algebraic connectivity of the network N1 (L), i.e., the second smallest eigenvalue ofL, can be used to bound 2 (W ). In particular, for a ring withN nodes, we have 2 (W ) = (1=N 2 ); for other topologies, see [82, Section 6]. 4.4 Finite-timeperformanceanalysis In this section, we study nite-time performance of the distributed homotopy primal-dual algo- rithm described in Algorithm 9. We dene auxiliary quantities in Section 4.4.1, present useful lemmas in Section 4.4.2, and provide the proof of Theorem 24 in Section 4.4.3. 4.4.1 Settinguptheanalysis We rst introduce three types of averages that are used to describe sequences generated by the primal update of Algorithm 9. The average value ofx j;k (t) over all agents is denoted by x k (t) := 1 N P N j=1 x j;k (t), the time-running average ofx j;k (t) is ^ x j;k := 1 T k P T k t=1 x j;k (t), the averaged time- running average of x j;k (t) is given by ~ x k := 1 N P N j=1 ^ x j;k = 1 T k P T k t=1 x k (t), and two auxiliary averaged sequences are, respectively, given by x 0 k (t) := 1 N P N j=1 x 0 j;k (t) andx k (t) :=P X ( x 0 k (t)). Since the mixing matrixW is doubly stochastic, the primal update x 0 k (t) has a simple ‘centralized’ form, x 0 k (t + 1) = x 0 k (t) k N N X j = 1 G j;x (z j;k (t); k;t ): (4.14) 132 We now utilize the approach similar to the network averaging analysis in [82] to quantify how well the agentj estimates the network average at roundk. Lemma25 Let Assumption 12 hold, letW be a doubly stochastic mixing matrix over graphG, let 2 (W )denoteitssecondlargestsingularvalue,andletasequencex j;k (t)begeneratedbyAlgorithm9 for agentj at roundk. Then, 1 T k T k X t = 1 E[kx j;k (t) x k (t)k ] 2 k where k is given by k := 2 k c log( p NT k ) 1 2 (W ) + 4c T k log( p NT k ) 1 2 (W ) + 1 ! k X l = 1 l T l + 2 k c Proof. See Appendix C.1. The dual updatey 0 j;k (t+1) behaves in a similar way as (4.14). We utilize the following classical online gradient descent to analyze the behavior of x 0 k (t) andy 0 j;k (t) under projectionsP X ( ) and P Y ( ). Lemma26([285]) LetU be a convex closed subset ofR d , letfg(t)g T t=1 be an arbitrary sequence inR d ,andletsequencesw(t)andu(t)begeneratedbytheprojection,w(t + 1) =w(t)g(t)and u(t + 1) =P U (w(t + 1) ),whereu(1)2U istheinitialpointand> 0isthelearningrate. Then, for any xedu ? 2U, T X t = 1 hg(t);u(t)u ? i ku(1)u ? k 2 2 + 2 T X t = 1 kg(t)k 2 : 133 Proof. See Appendix B.3 in [285]. If ^ y ? j;k := argmax y j 2Y j (^ x i;k ;y j ), using Fenchel dual (4.7), we havef j (^ x i;k ) = j (^ x i;k ; ^ y ? j;k ) and f j (x ? ) = j (x ? ;y ? j ). This allows us to express primal optimality gap (4.10) in terms of a primal-dual objective, err(^ x i;k ) = 1 N N X j = 1 j (^ x i;k ;y ? j;k ) j (x ? ;y ? j ) : (4.15) Let ^ x ? k := argmin x2X 1 N P N j = 1 j (x; ^ y j;k ). To analyze optimality gap (4.15), we introduce a sur- rogate gap, err 0 (^ x i;k ; ^ y k ) := 1 N N X j = 1 j (^ x i;k ; ^ y ? j;k ) j (x ? ; ^ y j;k ) (4.16) as well as average primal optimality (4.10) and surrogate (4.16) gaps, err k := 1 N N X j = 1 E[ err(^ x j;k ) ]; err 0 k := 1 N N X j = 1 E[ err 0 (^ x j;k ; ^ y k ) ]: In Lemma 27, we establish relation between the surrogate gap err 0 (^ x i;k ; ^ y k ) and the primal optimality gap err(^ x i;k ). Lemma27 Let ^ x i;k and ^ y k be generated by Algorithm 9 for agenti at roundk. Then, 0 err(^ x i;k ) err 0 (^ x i;k ; ^ y k ) and 0 err k err 0 k : Proof. See Appendix C.3. 134 Lemma28 Let Assumption 11 hold and let ^ x i;k and ^ y k be generated by Algorithm 9 for agenti at roundk. Then, err(^ x i;k ) L x 2 kx ? ^ x i;k k 2 (4.17a) err 0 (^ x i;k ; ^ y k ) L y 2N N X j = 1 ky ? j ^ y j;k k 2 (4.17b) err 0 (^ x i;k ; ^ y k ) L y 2N N X j = 1 ky ? j ^ y ? j;k k 2 : (4.17c) Proof. See Appendix C.4. We are now ready to provide an overview of our remaining analysis. In Lemma 29, we use a sum of network errors (NET) and local primal-dual gaps (PDG) to bound surrogate gap (4.16). In Lemma 30, we provide a bound on local primal-dual gaps by a sum of local dual gaps and a term that depends on mixing time. We combine Lemma 29 and Lemma 30 and apply the restarting strategy to get a recursion on the surrogate gap. Finally, in Section 4.4.3, we complete the proof by utilizing induction on the roundk. 4.4.2 Usefullemmas We utilize convexity and concavity of j with respect to primal and dual variables to decompose surrogate gap (4.16) into parts that quantify the inuence of network errors (NET) and local primal-dual gaps (PDG), respectively. Lemma29 Let Assumptions 11 and 12 hold and let ^ x i;k and ^ y k be generated by Algorithm 9 for agenti at roundk. Then, err 0 (^ x i;k ; ^ y k ) NET + PDG 135 where NET = c T k T k X t = 1 kx i;k (t) x k (t)k + 1 N N X j = 1 kx j;k (t) x k (t)k PDG = 1 NT k T k X t = 1 N X j = 1 j (x j;k (t); ^ y ? j;k ) j (x ? ;y j;k (t)) : Proof. Applying the mean value theorem and boundedness of the gradient of j (x; ^ y ? j;k ) with respect tox, we have j (^ x i;k ; ^ y ? j;k ) j (~ x k ; ^ y ? j;k ) ck^ x i;k ~ x k k c T k T k X t = 1 kx i;k (t) x k (t)k where the second inequality follows from the Jensen’s inequality. Then, breaking err 0 (^ x i;k ; ^ y k ) in (4.16) by adding and subtracting 1 N P N j=1 j (~ x k ; ^ y ? j;k ), we bound err 0 (^ x i;k ; ^ y k ) by err 0 (^ x i;k ; ^ y k ) c T k T k X t = 1 kx i;k (t) x k (t)k + 1 N N X j = 1 ( j (~ x k ; ^ y ? j;k ) j (x ? ; ^ y j;k )): Next, we nd a simple bound for the second sum. We recall that ~ x k := 1 T k P T k t=1 x k (t) and ^ y j;k := 1 T k P T k t=1 y j;k (t) and apply the Jensen’s inequality twice to obtain, 1 NT k T k X t = 1 N X j = 1 j ( x k (t); ^ y ? j;k ) j (x ? ;y j;k (t)) : (4.18) Similarly, we have j ( x k (t); ^ y ? j;k ) j (x j;k (t); ^ y ? j;k ) ck x k (t)x j;k (t)k. Breaking (4.18) by adding and subtracting 1 NT k P T k t=1 P N j=1 j (x j;k (t); ^ y ? j;k ), we further bound (4.18) by c NT k T k X t = 1 N X j = 1 kx j;k (t) x k (t)k + PDG: 136 The proof is completed by combining all above bounds. Lemma 29 establishes a bound for the surrogate gap err 0 (^ x i;k ; ^ y k ) for agenti at roundk. The terms in NET describe the accumulated network error that measures the deviation of each agent’s estimate from the average. On the other hand, PDG determines the average of primal-dual gaps incurred by local agents that are commonly used in the analysis of primal-dual algorithms [175, 174]. Next, we utilize the Markov mixing property to control the average of local primal-dual gaps PDG. First, we break the dierence j (x j;k (t); ^ y ? j;k ) j (x ? ;y j;k (t)) into a sum of j (x j;k (t); ^ y ? j;k ) j (x j;k (t);y j;k (t)) and j (x j;k (t);y j;k (t)) j (x ? ;y j;k (t)). We now can utilize convexity and concavity of j (x;y j ) to deal with these terms separately. Dividing the sum indexed byt into two intervals, 1 t T k and T k + 1 t T k , the PDG term can be bounded by PDG PDG + + PDG , where PDG + = 1 NT k T k X t = 1 N X j = 1 g j;y (z j;k (t)); ^ y ? j;k y j;k (t) +hg j;x (z j;k (t));x j;k (t)x ? i PDG = 1 NT k T k X t =T k +1 N X j = 1 g j;y (z j;k (t)); ^ y ? j;k y j;k (t) +hg j;x (z j;k (t));x j;k (t)x ? i : Here, is the mixing time of the ergodic sequence k;1 ;:::; k;t at round k. The intuition be- hind this is that, given the initialt samples k;1 ;:::; k;t , the sample k;t is almost a sample that arises from the stationary distribution . With this in mind, we next show that an appro- priate breakdown of the term PDG + enables applications of the martingale concentration from Lemma 55 and mixing time property (4.11), thereby producing a gradient-free bound on primal- dual gaps. 137 Lemma30 Let Assumptions 10–13 hold. ForT k that satisesT k 1 +d log(T k )=j logje =, we have E[ PDG ] c N 8 p T k + 1 T k N X j = 1 E ^ y ? j;k y ? j + E[ MIX ] where MIX = 2rL + p 2c NT k T k X t =+1 N X j = 1 kz j;k (t)z j;k (t)k + 1 2 k T k kx ? x k ( + 1)k 2 + 1 N N X j = 1 ^ y ? j;k y j;k ( + 1) 2 + c NT k T k X t =+1 N X j = 1 kx j;k (t)x k (t)k + 2rc( + 1) T k + k c 2 : Proof. Using Assumption 12, PDG can be upper bounded by PDG c NT k T k X t =T k +1 N X j = 1 ^ y ? j;k y j;k (t) +kx j;k (t)x ? k : Since the domain is bounded, this term is upper bounded by 2rc=T k . Next, we deal with PDG + . We divide eachhg j;y (z j;k (t)); ^ y ? j;k y j;k (t)i+hg j;x (z j;k (t));x j;k (t) x ? i into a sum of ve terms (4.19a)–(4.19e) by adding and subtracting G j;x (z j;k (t); k;t+ ) and G j;y (z j;k (t); k;t+ ) into the rst arguments of two inner products, respectively, and then inserting y ? j into the second argument for the rst resulting inner product, g j;y (z j;k (t))G j;y (z j;k (t); k;t+ ); ^ y ? j;k y ? j (4.19a) g j;y (z j;k (t))G j;y (z j;k (t); k;t+ );y ? j y j;k (t) (4.19b) 138 G j;y (z j;k (t); k;t+ ); ^ y ? j;k y j;k (t) (4.19c) hg j;x (z j;k (t))G j;x (z j;k (t); k;t+ );x j;k (t)x ? i (4.19d) hG j;x (z j;k (t); k;t+ );x j;k (t)x ? i: (4.19e) We sum each of (4.19a)–(4.19e) overt = 1;:::;T k andj = 1;:::;N, divide it byNT k , and represent each of them usingS 1 toS 5 . Thus,E[ PDG + ] =E[S 1 +S 2 +S 3 +S 4 +S 5 ]. We next bound each term separately. BoundingthetermE[S 1 ]: For agentj, we have a martingale dierence sequencefX j (t)g T k t = 1 , X j (t) = g j;y (z j;k (t)) G j;y (z j;k (t); k;t+ ) E j;k (t) where E j;k (t) := E[g j;y (z j;k (t))G j;y (z j;k (t); k;t+ )jF k;t ] and M = 4c in Lemma 55. This allows us to rewriteS 1 as S 1 = 1 N N X j = 1 * 1 T k T k X t = 1 X j (t); ^ y ? j;k y ? j + + 1 NT k N X j = 1 T k X t = 1 E j;k (t); ^ y ? j;k y ? j : Since (T k )=T k 1, Lemma 55 implies E 2 4 1 T k T k X t = 1 X j (t) 2 3 5 64c 2 T k : 139 Using Assumption 12 and the mixing time property (4.11), we can boundkE j;k (t)k by kE j;k (t)k = Z G i;y (z i;k (t);) ()p [t] k;t+ () d() c Z ()p [t] k;t+ () d() = cd tv (P [t] k;t+ ; ): (4.20) Applying the triangle and Cauchy-Schwartz inequalities toE[S 1 ] and using (4.20) lead to, E[S 1 ] 1 N 8c p T k N X j = 1 ^ y ? j;k y ? j + c NT k N X j = 1 T k X t = 1 E h d tv (P [t] k;t+ ; ) i ^ y ? j;k y ? j c N 8 p T k + 1 T k N X j = 1 ^ y ? j;k y ? j : where the last inequality follows from the mixing time property: if we chooseT k such that = 1 +d log(T k )= logjje tv (P [t] k ; 1=T k ), then we haveE[d tv (P [t] k;t+ ; ) ] 1=T k . BoundingthetermsE[S 2 ]andE[S 4 ]: Using the Cauchy-Schwartz inequality and (4.20), we can boundE[S 2 ] by E[S 2 ] c NT k T k X t = 1 N X j = 1 E[d tv (;P [t] k;t+ ) ]ky ? j y j;k (t)k c NT 2 k T k X t = 1 N X j = 1 E[ky ? j y j;k (t)k ] rc T k : Similarly, we have E[S 4 ] c NT 2 k T k X t = 1 N X j = 1 E[kx j;k (t)x ? k ] rc T k : 140 BoundingthetermsE[S 3 ]andE[S 5 ]: We re-index the sum inS 3 overt and write it as 1 NT k T k X t =+1 N X j = 1 G j;y (z j;k (t); k;t ); ^ y ? j;k y j;k (t) where eachhG j;y (z j;k (t); k;t ); ^ y ? j;k y j;k (t)i can be split into a sum of the following three inner products, hG j;y (z j;k (t); k;t )G j;y (z j;k (t); k;t ); ^ y ? j;k y j;k (t)i hG j;y (z j;k (t); k;t ); ^ y ? j;k y j;k (t)i hG j;y (z j;k (t); k;t );y j;k (t)y j;k (t)i: Combining the Lipschitz continuity of the gradientkG j;y (z j;k (t); k;t )G j;y (z j;k (t); k;t )k Lkz j;k (t)z j;k (t)k with the domain/gradient boundedness yields, S 3 rL NT k T k X t =+1 N X j = 1 kz j;k (t)z j;k (t)k + 1 NT k T k X t =+1 N X j = 1 G j;y (z j;k (t); k;t ); ^ y ? j;k y j;k (t) + c NT k T k X t =+1 N X j = 1 ky j;k (t)y j;k (t)k: Similarly, we have S 5 rL NT k T k X t =+1 N X j = 1 kz j;k (t)z j;k (t)k + c NT k T k X t =+1 N X j = 1 kx j;k (t)x j;k (t)k + 1 NT k T k X t =+1 N X j = 1 hG j;x (z j;k (t); k;t );x j;k (t)x ? i: 141 Insertingx k (t) into the second argument of the inner product in the above bound onS 5 and using inequality (kak +kbk) 2 2kak 2 + 2kbk 2 , we boundS 3 +S 5 by S 3 + S 5 2rL + p 2c NT k T k X t =+1 N X j = 1 kz j;k (t)z j;k (t)k + 1 NT k T k X t =+1 N X j = 1 G j;y (z j;k (t); k;t ); ^ y ? j;k y j;k (t) + 1 NT k T k X t =+1 N X j = 1 hG j;x (z j;k (t); k;t );x j;k (t)x k (t)i + 1 T k T k X t =+1 * 1 N N X j = 1 G j;x (z j;k (t); k;t );x k (t)x ? + : On the right-hand side of the above inequality, the third term can be bounded by applying the Cauchy-Schwartz inequality and the gradient boundedness; for the second term and the fourth term, application of Lemma 26 yields, T k X t =+1 G j;y (z j;k (t); k;t ); ^ y ? j;k y j;k (t) ^ y ? j;k y j;k ( + 1) 2 2 k + k 2 T k X t =+1 kG j;y (z j;j (t); k;t )k 2 T k X t =+1 * 1 N N X j = 1 G j;x (z j;k (t); k;t );x k (t)x ? + kx k ( + 1)x ? k 2 2 k + k 2 T k X t =+1 1 N N X j = 1 G j;x (z j;k (t); k;t ) 2 142 which allows us to boundS 3 +S 5 by S 3 + S 5 2rL + p 2c NT k T k X t =+1 N X j = 1 kz j;k (t)z j;k (t)k + 1 2 k T k kx k ( + 1)x ? k 2 + 1 N N X j = 1 ^ y ? j;k y j;k ( + 1) 2 + c NT k T k X t =+1 N X j = 1 kx j;k (t)x k (t)k + k c 2 : Taking expectation ofS 3 +S 5 and adding previous bounds onE[S 1 ],E[S 2 ], andE[S 4 ] to it lead to the nal bound onE[ PDG + ]. The proof is completed by adding the established upper bounds onE[PDG + ] andE[PDG ]. Lemma 30 is based on the ergodic analysis of the mixing process. We may set = 0 to take the traditional stochastic gradient method with i.i.d. samples. In fact, by combining results of Lem- mas 29 and 30, we can obtain a loose boundO(1= p T k ) bound for the surrogate gap err 0 (^ x i;k ; ^ y k ). Instead, we combine the strong concavity of j (x;y j ) in terms ofy j with Lemma 30 to establish O(1=T k ) bound on the surrogate gap. Lemma31 Let Assumptions 10–13 hold. For k andT k that satisfyL x k T k 16 andT k 1 + d log(T k )=j logje =, we have E[ err 0 (^ x i;k ; ^ y k ) ] 4E[ NET 0 ] + 4c 2 L y 4 p T k + 1 T k 2 + 16 k (rL +c) + 4 k c 2 + 8 k c 2 2 T k + 8rc( + 1) T k + 16E[ err 0 (^ x i;k1 ; ^ y k1 ) ] L y k T k + N X j = 1 8E[ err 0 (^ x j;k1 ; ^ y k1 ) ] L x k NT k 143 where NET 0 := NET + 4 (rL +c) NT k T k X t = 1 N X j = 1 kx j;k (t)x k (t)k: Proof. Taking expectation of (4.17c) yields, E[ err 0 (^ x i;k ; ^ y k ) ] L y 2N N X j = 1 E[ky ? j ^ y ? j;k k 2 ] L y 2 1 N N X j = 1 E[ky ? j ^ y ? j;k k ] 2 where we useE[X 2 ] E[X] 2 along with (1=N) P N i = 1 a 2 i ((1=N) P N i = 1 a i ) 2 to establish the second inequality. Substituting the above inequality into the result in Lemma 30 and combining it with Lemma 29 yield, E[ err 0 (^ x i;k ; ^ y k ) ] E[ NET ] + E[ MIX ] + p 2c p L y 4 p T k + 1 T k E[ err 0 (^ x i;k ; ^ y k ) ] 1=2 : Thus, := E[ err 0 (^ x i;k ; ^ y k ) ] 1=2 satises the quadratic inequality 2 a +b. Combining the values of that satisfy this inequality with (kak +kbk) 2 2kak 2 + 2kbk 2 leads to, E[ err 0 (^ x i;k ; ^ y k ) ] 2E[ NET ] + 2E[ MIX ] + 2c 2 L y 4 p T k + 1 T k 2 : (4.21) The remaining task is to express expectations on the right-hand side of (4.21) in terms of previously introduced terms. Lemma 25 provides a bound onE[ NET ] and, next, we evaluate the 144 terms inE[ MIX ]. Combination of the triangle inequality with non-expansiveness of projection allows us to boundE[kz j;k (t)z j;k (t)k ] by E[kz j;k (t)z j;k (t)k ] E[ky 0 j;k (t)y 0 j;k (t)k ] + E[kx j;k (t)x j;k (t)k ] whereE[kx j;k (t)x j;k (t)k ] can be further bounded by a sum of three terms: E[kx j;k (t)x k (t)k ];E[kx k (t)x k (t)k ]; andE[kx k (t)x j;k (t)k ]: Similarly,kx k (t)x k (t)kk x 0 k (t) x 0 k (t)k. By the gradient boundedness, it is clear from the primal-dual updates thatE[ky 0 j;k (t)y 0 j;k (t)k ];E[kx k (t)x k (t)k ] k c. Thus, we have 1 NT k T k X t =+1 N X j = 1 E[kz j;k (t)z j;k (t)k ] 2 k c + 1 NT k T k X t =+1 N X j = 1 E[kx j;k (t)x k (t)k ] + 1 NT k T k X t =+1 N X j = 1 E[kx k (t)x j;k (t)k ] 2 k c + 2 NT k T k X t = 1 N X j = 1 E[kx j;k (t)x k (t)k ] 145 where we sum overt from 1 toT k instead of from + 1 toT k in the last inequality. Now, we turn to next two expectations as follows, E[kx ? x k ( + 1)k 2 ] = E[kx k ( + 1)x k (1) +x k (1)x ? k 2 ] 2E[kx k ( + 1)x k (1)k 2 ] + 2E[kx k (1)x ? k 2 ] 2E[k x 0 k ( + 1) x 0 k (1)k 2 ] + 2E[kx k (1)x ? k 2 ] 2 2 k c 2 2 + 2 N N X j = 1 E[kx 0 j;k (1)x ? k 2 ] = 2 2 k c 2 2 + 2 N N X j = 1 E[kx j;k (1)x ? k 2 ] where we apply the inequalityka+bk 2 2kak 2 +2kbk 2 and the non-expansiveness of projection for the rst and the second inequalities; the third inequality is because of (4.14), the gradient boundedness, and application of the Jensen’s inequality tokk 2 withx k (1) := 1 N P N j=1 x 0 j;k (1); and the last equality follows the initializationx 0 j;k (1) =x j;k (1). Similarly, we derive E[k^ y ? j;k y j;k ( + 1)k 2 ] = E[k^ y ? j;k y j;k (1) +y j;k (1)y j;k ( + 1)k 2 ] 2E[k^ y ? j;k y j;k (1)k 2 ] + 2E[ky j;k (1)y j;k ()k 2 ] 2E[k^ y ? j;k y j;k (1)k 2 ] + 2E[ky 0 j;k (1)y 0 j;k ()k 2 ] 2E[k^ y ? j;k y j;k (1)k 2 ] + 2 2 k c 2 2 = 2E[k^ y ? j;k y ? j +y ? j y j;k (1)k 2 ] + 2 2 k c 2 2 4E[k^ y ? j;k y ? j k 2 ] + 4E[ky ? j y j;k (1)k 2 ] + 2 2 k c 2 2 : 146 From (4.17c) we know that 1 N N X j = 1 E[k^ y ? j;k y ? j k 2 ] 2 L y E[ err 0 (^ x i;k ; ^ y k ) ]: Now, we collect above inequalities for (4.21). For notational convenience, we sum overt from 1 untilT k instead of from + 1 toT k and combine similar terms to obtain the following bound on E[ MIX ], E[ MIX ] 4 k c(rL +c) + 4(rL +c) NT k T k X t = 1 N X j = 1 E[kx j;k (t)x k (t)k ] + 1 k NT k N X j = 1 2E[ky ? j y j;k (1)k 2 ] +E[kx j;k (1)x ? k 2 ] + 4E[ err 0 (^ x i;k ; ^ y k ) ] L y k T k + 2 k c 2 2 T k + 2rc ( + 1) T k + k c 2 : We note the restarting scheme of Algorithm 9, i.e., x j;k (1) = ^ x j;k1 and y j;k (1) = ^ y j;k1 . By (4.17a), (4.17b), and Lemma 27, we have E[kx j;k (1)x ? k 2 ] = E[k^ x j;k1 x ? k 2 ] 2 L x E[ err(^ x j;k1 ) ] 2 L x E[ err 0 (^ x j;k1 ; ^ y k1 ) ] 1 N N X j = 1 E[ky ? j y j;k (1)k 2 ] = 1 N N X j = 1 E[ky ? j ^ y j;k1 k 2 ] 2 L y E[ err 0 (^ x i;k1 ; ^ y k1 ) ]: Combining above inequalities with the bound onE[ MIX ] yields a bound onE[ err 0 (^ x i;k ; ^ y k ) ]. Finally, we utilizeL x k T k 16 to nish the proof. 147 4.4.3 Proofofmaintheorem The proof of Theorem 24 is based on the result in Lemma 31. We leave the mixing time to be determined so that it works for every roundk and focus on the averaged surrogate gap err 0 k := 1 N P N j = 1 E[ err 0 (^ x j;k ; ^ y k ) ]. Let H k := 4c 2 L y 8 p T k + 1 T k 2 E k := 8 (4rL + 6c) k + 16 k c (rL +c) + 8rc ( + 1) T k + 4 k c 2 : We rst apply Lemma 25 to obtainE[ NET 0 ] 2(4rL + 6c) k for each roundk. With previous simplied notation, we apply this inequality for the bound in Lemma 31, and then take average overi = 1;:::;N on both sides to obtain, err 0 k H k + E k + 8 k c 2 2 T k + err 0 k1 L 0 k T k (4.22) where 1=L 0 := 16=L y + 8=L x . In Algorithm 9, we have updates k = k1 =2 andT k = 2T k1 . Since 1 4=L 0 and T 1 1, we have L 0 k T k 4 for all k 1. Clearly, the assumption L x k T k 16 holds in Lemma 31. Therefore, (4.22) can be simplied into err 0 k H k + E k + 8 k c 2 2 T k + err 0 k1 4 : (4.23) Simple comparisons show thatH k H k1 =2 andE k E k1 =2. Thus, k =T k = (1=4) k1 1 =T 1 and H k H 1 =2 k1 are true for all k 1. If we set = 1 +d log(T )=j logje T 1 with 148 suitableT 1 andK, Lemma 31 applies at any roundk. Starting from the nal roundk = K, we repeat (4.23) to obtain, err 0 K H K + E K + 2 4 K2 1 c 2 2 T 1 + err 0 K1 4 1 2 K2 K X k = 1 1 2 k H 1 + K X k = 1 1 2 k E K + 2K 4 K2 1 c 2 2 T 1 + 1 4 K err 0 0 8H 1 T 1 T + 2E K + 32K 1 c 2 2 T 1 T 2 + r 2 T 2 1 T 2 whereT = P K k = 1 T k = (2 K 1)T 1 2 K T 1 and err 0 0 err 0 = sup x2X;y2Y k(x;y)k 2 r 2 . We now substituteH 1 andE K into this bound and bound err 0 K by 32c 2 T 1 L y T 8 p T 1 + 1 T 1 2 + 16 (4rL + 6c) K + 32K 1 c 2 2 T 1 T 2 + r 2 T 2 1 T 2 + 16rc ( + 1) T K + 32 K c (2rL + p 2c) + 8 K c 2 : SinceT 2 K T 1 , we have K = 1 =2 K 1 T 1 =T andT K = 2 K1 T 1 T=2. SinceT = (2 K 1)T 1 2 K1 T 1 , we haveK 1+log(T=T 1 ). Thus, P K k = 1 k T k =K 1 T 1 1 T 1 (1+log(T=T 1 )). Therefore, the above bound has the following order, C 1 c (c +rL) log 2 ( p NT ) T (1 2 (W )) + C 2 c (c +rL)(1 +T 1 ) T whereC 1 areC 2 are absolute constants. Since err K err 0 K , it also bounds err K . The proof is completed by combining 1 4=L 0 withT 1 . 149 4.5 Computationalexperiments We rst utilize a modied Mountain Car Task [214, Example 10.1] for multi-agent policy evalua- tion problem. We generate the dataset using the approach presented in [232], obtain a policy by running SARSA withd = 300 features, and sample the trajectories of states and actions according to the policy. The discount factor is set to = 0:95. We simulate the communication network withN agents using the Erdős-Rényi graph with connectivity 0:1. At every time instant, each agent observes a local reward that is generated as a random proportion of the total reward. Since the stationary distribution is unknown, we use sampled averages from the entire dataset to compute sampled versions ^ A, ^ b, and ^ C ofA,b, andC. We then formulate an empirical MSPBE as (1=2)k ^ Ax ^ bk 2 ^ C 1 and compute the optimal empirical MSPBE. We use this empirical MSPBE as an approximation of the population MSPBE to calculate the optimality gap. The dataset con- tains 85453 samples and we run our online algorithm over one trajectory of 30000 samples using multiple passes. We set an initial restart time toT 1 = 10 5 and a restart round toK = 4 to insure T 1 'O(K + logT 1 ), take large bounds for Euclidean projections, and choose dierent learning rates. We compare the performance of Algorithm 9 (DHPD) with stochastic primal-dual (SPD) al- gorithm under dierent settings. ForN = 1, SPD corresponds to GTD algorithm in [138, 238, 222], and forN > 1, SPD becomes the multi-agent GTD algorithm [127]. We show computa- tional results forN = 1 andN = 10 in Fig. 4.2 and Fig. 4.3, respectively. The optimality gap is the dierence between empirical MSPBE and the optimal one. Our algorithm achieves a smaller optimality gap than SPD in all cases, thereby demonstrating its sample eciency. In computa- tional experiments with simple diminishing stepsizes, the algorithm converges relatively slow as 150 optimality gap iteration count Figure 4.2: Performance comparison for the centralized problem withN = 1. Our algorithm with stepsize = 0:1 andK = 4 (––) achieves a smaller optimality gap than stochastic primal-dual algorithm with stepsize: = 0:1 (--), = 0:05 (---), = 0:025 (), and = 1= p t (– –). It also provides a smaller optimality gap than the approach that utilizes pre-collected i.i.d. samples in a buer (––). is typical in stochastic optimization. Also, our online algorithm competes well against the ap- proach that utilizes pre-collected i.i.d. samples (instead of true i.i.d. samples from the stationary distribution) in a xed buer. In our second computational experiment, we test randomly generated multi-agent MDPs for a ring network withN = 5 agents that utilize a xed policy [127, Example 1]. This leads to a Markov chain withS =f1; 2; 3; 4g, = 0:95,(s) = [ 1 (s) 2 (s) 3 (s) 4 (s) ] > 2R 4 where i (s) = e (si) 2 , andr j (s) = 1(s = 4). In Fig. 4.4, we demonstrate that our online algorithm with an initial restart timeT 1 = 20000 and a restart roundK = 3 outperforms SPD algorithm that utilizes diminishing stepsizes or a replay buer. 151 optimality gap iteration count Figure 4.3: Performance comparison for the distributed case withN = 10. Our algorithm with stepsize = 0:05 andK = 4 (––) achieves a smaller optimality gap than stochastic primal-dual algorithm with stepsize: = 0:05 (--), = 0:025 (---), = 0:0125 (), and = 1= p t (– –). It also provides a smaller optimality gap than the approach that utilizes pre-collected iid samples in a buer (––). optimality gap iteration count Figure 4.4: Performance comparison for the distributed case withN = 5. Our algorithm with stepsize = 0:1 andK = 3 (––) achieves a smaller optimality gap than stochastic primal-dual algorithm with stepsize: = 0:5 (--), = 0:25 (---), = 0:125 (), and = 1= p t (– –). It also provides a smaller optimality gap than the approach that utilizes pre-collected iid samples in a buer (––). 152 4.6 Concludingremarks In this chapter, we begin the multi-agent temporal-dierence learning with a distributed primal- dual stochastic saddle point problem. We have proposed a new online distributed homotopy- based primal-dual algorithm for minimizing the mean square projected Bellman error under the Markovian setting and establish anO(1=T ) error bound. Our result improves the best known O(1= p T ) error bound for general stochastic primal-dual algorithms and it demonstrates that distributed saddle point programs can be solved eciently even in applications with limited time budgets. 153 Chapter5 IndependentpolicygradientforMarkovpotentialgames In this chapter, we study global non-asymptotic convergence properties of policy gradient meth- ods for multi-agent reinforcement learning problems in Markov potential games (MPGs). To learn a Nash equilibrium of an MPG in which the size of state space and/or the number of players can be very large, we propose new independent policy gradient algorithms that are run by all players in tandem. When there is no uncertainty in the gradient evaluation, we show that our algorithm nds an-Nash equilibrium withO(1= 2 ) iteration complexity which does not explicitly depend on the state space size. When the exact gradient is not available, we establishO(1= 5 ) sample complexity bound in a potentially innitely large state space for a sample-based algorithm that utilizes function approximation. Moreover, we identify a class of independent policy gradient algorithms that enjoy convergence for both zero-sum Markov games and Markov cooperative games with the players that are oblivious to the types of games being played. Finally, we pro- vide computational experiments to corroborate the merits and the eectiveness of our theoretical developments. 154 5.1 Introduction Multi-agent reinforcement learning (multi-agent RL) studies how multiple players learn to max- imize their long-term returns in a setup where players’ actions inuence the environment and other agents’ returns [46, 273]. Recently, multi-agent RL has achieved signicant success in var- ious multi-agent learning scenarios, e.g., competitive game-playing [203, 202, 229], autonomous robotics [198, 133], and economic policy-making [283, 223]. In the framework of stochastic games [200, 90], most results are established for fully-competitive (i.e., two-player zero-sum) games; e.g., see [64, 239, 49]. However, to achieve social welfare for AI [60, 59, 209], it is impera- tive to establish theoretical guarantees for multi-agent RL in Markov games with cooperation. Policy gradient methods [245, 217] have received a lot of attention for both single-agent [32, 8] and multi-agent RL problems [274, 64, 239]. Independent policy gradient [273, 177] is proba- bly the most practical protocol in multi-agent RL, where each player behaves myopically by only observing her own rewards and actions (as well as the system states), while individually optimiz- ing its own policy. More importantly, independent learning dynamics do not scale exponentially with the number of players in the game. Recently, [64, 131, 277] have in fact shown that multi- agent RL players could perform policy gradient updates independently, while enjoying global non-asymptotic convergence. However, these results are only focused on the basic tabular set- ting in which the value functions are represented by tables; they do not carry over to large-scale multi-agent RL problems in which the state space size is potentially innite and the number of players is large. This motivates the following question: Can we design independent policy gradient methods for large-scale Markov games, with non-asymptotic global convergence guarantees? 155 In this chapter, we provide the rst armative answer to this question for a class of mixed cooperative/competitive Markov games: Markov potential games [147, 131, 277]. In Section 5.2, we introduce Markov potential games, Nash equilibrium, and provide necessary background material. In Section 5.3, we present an independent learning protocol. We propose an independent policy gradient method for Markov potential games in Section 5.4, and establish the Nash regret analysis in Section 5.5. In Section 5.6, we establish an extension of our method and analysis to the linear function approximation setting. In Section 5.7, we establish game-agnostic convergence of an optimistic independent policy gradient method for both Markov cooperative games and zero-sum Markov games. We provide computational experiments to demonstrate the merits and the eectiveness of our theoretical ndings in Section 5.8, and we close the chapter with concluding remarks in Section 5.9. 5.2 Markovpotentialgames We consider anN-player, innite-horizon, discounted Markov potential game (MPG), MPG (S;fA i g N i = 1 ;P;fr i g N i = 1 ; ; ) (5.1) whereS is a state space,A i is an action space for theith player, with the joint action space of N 2 players denoted asA :=A 1 :::A N ,P is a transition probability measure specied by a distributionP(js;a) overS if N players jointly take an action a fromA in state s, r i : SA! [0; 1] is an immediate reward function for theith player, 2 [0; 1) is a discount factor, and is an initial state distribution overS. We assume that all action spaces are nite with the 156 same sizeA = A i =jA i j for alli = 1;:::;N. It is straightforward to apply our analysis to the general case in which players’ nite action spaces have dierent sizes. For theith player, (A i ) represents the probability simplex over the action setA i . A stochas- tic policy for playeri is given by i :S! (A i ) that species the action distribution i (js)2 (A i ) for each state s 2 S. The set of stochastic policies for player i is denoted by i := ((A i )) jSj , the joint probability simplex is given by (A) := (A 1 ):::(A N ), and the joint policy space is := ((A)) jSj . A Markov product policy :=f i g N i = 1 2 forN players con- sists of the policy i 2 for all playersi = 1;:::;N. We use the shorthand i =f k g N k = 1;k6=i to represent the policy of all but theith player. We denote byV i :S!R theith player value function under the joint policy, starting from an initial states (0) =s: V i (s) := E " 1 X t = 0 t r i (s (t) ;a (t) ) s (0) =s # where the expectationE is over a (t) (js (t) ) and s (t+1) P(js (t) ;a (t) ). Finally, V i () denotes the expected value function ofV i (s) over a state distribution,V i () :=E s [V i (s)]. In an MPG, at any states2S, there exists a global function – the potential function (s): S!R – that captures the incentive of all players to vary their policies at states, V i ; i i (s) V 0 i ; i i (s) = i ; i (s) 0 i ; i (s) for any policies i ; 0 i 2 i and i 2 i . Let () :=E s [ (s)] be the expected potential function over a state distribution. Thus,V i ; i i ()V 0 i ; i i () = i ; i () 0 i ; i (). There always exists a constantC > 0 such thatj () 0 ()j C for any; 0 ;; see a 157 trivial upper bound in Lemma 68 in Appendix D.7. An important subclass of MPGs is given by Markov cooperative games in which all players share the same reward function r = r i for all i = 1;:::;N. We also denote byQ i :SA!R the state-action value function under policy, starting from an initial state-action pair (s (0) ;a (0) ) = (s;a): Q i (s;a) :=E " 1 X t = 0 t r i (s (t) ;a (t) ) s (0) =s;a (0) =a # : The value function can be equivalently expressed asV i (s) = P a 0 2A (a 0 js)Q i (s;a 0 ). For each playeri, by averaging out i , we can dene the averaged state-action value function Q i ; i i : SA i !R, Q i ; i i (s;a i ) := X a i 2A i i (a i js)Q i ; i i (s;a i ;a i ) whereA i is the set of actions of all but the ith player. We use the shorthand Q i for Q i ; i i when i and i are from the same joint policy. It is straightforward to see thatV i ;Q i , and Q i are bounded between 0 and 1=(1 ). We recall the notion of (Markov perfect stationary) Nash equilibrium [90]. A joint policy ? is called a Nash equilibrium if for each playeri = 1;:::;N, V ? i ; ? i i (s) V i ; ? i i (s); for all i 2 i ; s2S and called an-Nash equilibrium if fori = 1;:::;N, V ? i ; ? i i (s) V i ; ? i i (s) ; for all i 2 i ; s2S: 158 Nash equilibria for MPGs with nite states and actions always exist [90]. When the state space is innite, we assume the existence of a Nash equilibrium; see [219, 150, 151, 12] for cases with countable or compact state spaces. Given policy and initial states (0) , we dene the discounted state visitation distribution, d s (0) (s) = (1 ) 1 X t = 0 t Pr (s (t) =sjs (0) ): For a state distribution, dened (s) = E s (0) [d s (0) (s) ]. By denition,d (s) (1 )(s) for any ands. Remark11 Itisusefultointroduceavariantoftheperformancedierencelemma[8]formultiple players; for other versions, see [274, 64, 277, 131]. For theith player, if we x the policy i and any state distribution, then for any two policies ^ i and i , V ^ i ; i i ()V i ; i i () = 1 1 X s;a i d ^ i ; i (s) (^ i i )(a i js) Q i ; i i (s;a i ) where Q i ; i i (s;a i ) = P a i i (a i js)Q i ; i i (s;a i ;a i ): It is common to use the distribution mismatch coecient to measure the exploration diculty in policy optimization [8]. We next dene a distribution mismatch coecient for MPGs [131] in Denition 2, and its minimax variant in Denition 3. Denition2(Distributionmismatchcoecient) For any distribution2 (S) and policy 2 ,thedistributionmismatchcoecient isthemaximumdistributionmismatchof relative to, := sup 2 d = 1 , where the divisiond = is evaluated in a componentwise manner. 159 Denition3(Minimaxdistributionmismatchcoecient) For any distribution2 (S), theminimaxdistributionmismatchcoecient ~ istheminimaxvalueofthedistributionmismatch of relative to , ~ := inf 2 (S) sup 2 d = 1 , where the division d = is evaluated in a componentwise manner. Other notation. We denote bykk the` 2 -norm of a vector or the spectral norm of a matrix. The inner product of a function f:SA! R with p2 (A) at xed s2S is given by hf(s;);p()i A := P a2A f(s;a)p(a). The ` 2 -norm projection operator onto a convex set is dened asP (x) := argmin x 0 2 kx 0 xk. For functionsf andg, we writef(n) = O(g(n)) if there existsN <1 andC <1 such thatf(n)Cg(n) fornN, and writef(n) = ~ O(g(n)) if logg(n) appears inO(). We use “.” and “&” to denote “” and “” up to a constant. 5.3 Independentlearningprotocol Algorithm10 Independent policy gradient ascent 1: Parameters:> 0. Initialization: Let (1) i (a i js) = 1=A fors2S,a i 2A i andi = 1;:::;N. 2: for stept = 1;:::;T do 3: for playeri = 1;:::;N (in parallel)do 4: Dene playeri’s policy ons2S, (t+1) i (js) := argmax i (js)2 (A i ) i (js); Q (t) i (s;) A i 1 2 i (js) (t) i (js) 2 (5.2) where Q (t) i (s; a i ) is a shorthand for Q (t) i ; (t) i i (s;a i ) (dened in Denition 5.2). 5: endfor 6: endfor We examine an independent learning setting [273, 64, 177] in which all players repeatedly execute their own policy and update rules individually. At each timet, all players propose their 160 own polices (t) i :S! (A i ) with the player indexi = 1;:::;N, while a game oracle can either evaluate each player’s policy or generate a set of sample trajectories for each player. In repeating such protocol forT times, each player behaves myopically in optimizing its own policy. To evaluate the learning performance, we introduce a notion of regret, Nash-Regret(T ) := 1 T T X t = 1 max i max 0 i V 0 i ; (t) i i ()V (t) i () which averages the worst player’s local gaps inT iterations: max 0 i V 0 i ; (t) i i ()V (t) i () fort = 1;:::;T , where max 0 i V 0 i ; (t) i i () is theith player best response given (t) i . In Nash-Regret(T ), we compare the learnt joint policy (t) with the best policy that theith player can take by xing (t) i . We notice that Nash-Regret is closely related to the notion of dynamic regret [285] in which the regret comparator changes over time. This is a suitable notion because the environment is non-stationary from the perspective of an independent learner [156, 273]. To obtain an-Nash equilibrium (t ? ) with a tolerance> 0, our goal is to show the following average performance, Nash-Regret(T ) = : The existence of sucht ? is straightforward, t ? := argmin 1tT max i max 0 i V 0 i ; (t) i i ()V (t) i () : Since each summand above is non-negative, V (t ? ) i () V 0 i ; (t ? ) i i () for any 0 i and i = 1;:::;N, which implies that (t ? ) is an-Nash equilibrium. 161 5.4 Independentpolicygradientmethods In this section, we assume that we have access to exact gradient and examine a gradient-based method for learning a Nash equilibrium in Markov potential/cooperative games. 5.4.1 PolicygradientforMarkovpotentialgames A natural independent learning scheme for MPGs is to let every player independently perform policy gradient ascent [131, 277]. In this approach, theith player updates its policy according the gradient of the value function with respect to the policy parameters, (t+1) i (js) P (A i ) (t) i (js) + @V i () @ i (a i js) = (t) @V i () @ i (a i js) = 1 1 d (s) Q i (s;a i ) (5.3) where the calculation for the gradient in (5.3) can be found in [8, 131, 277]. Update rule (5.3) may suer from a slow learning rate for some states. Since the gradient with respect to i (a i js) scales withd (s) – which may be small if the current policy has small visitation frequency tos – the corresponding states may experience slow learning progress. To address this issue, we propose the following update rule (equivalent to (5.2) in Algorithm 10): (t+1) i (js) P (A i ) (t) i (js) + Q (t) i (s;) (5.4) which essentially removes thed (s)=(1 ) factor in standard policy gradient (5.3) and allevi- ates the slow-learning issue. Interestingly, update rule (5.4) for the single-player MDP has also 162 been studied in [247], concurrently. However, since the optimal value is not unique, the analysis of [247] does not apply to our multi-player case for which many Nash policies exist and the set that contains them is non-convex [131]. We also note that regularized variants of (5.4) for the single-player MDP appeared in [123, 270]. Furthermore, in contrast to (5.3), our update rule (5.4) is invariant to the initial state distribu- tion. This allows us to establish performance guarantees simultaneously for all in a similar way as typically done for natural policy gradient (NPG) and other policy mirror descent algo- rithms for single-player MDPs [8, 123, 270]. Theorem 32 establishes performance guarantees for Algorithm 10; see Section 5.5.2 for proof. Theorem32(Nash-RegretboundforMarkovpotentialgames) For MPG (5.1) with an ini- tialstatedistribution,ifallplayersindependentlyperformthepolicyupdateinAlgorithm10then, for two dierent choices of stepsize, we have Nash-Regret(T ) . 8 > > > > < > > > > : p ~ AN (C ) 1 4 (1 ) 9 4 T 1 4 ; = (1 ) 5 2 p C NA p T min( ;S) 2 p ANC (1 ) 3 p T ; = (1 ) 4 8 min( ;S) 3 NA : Depending on the stepsize , Theorem 32 provides two rates for the average Nash regret: T 1=4 andT 1=2 . The technicalities behind these choices will be explained later and, to obtain an -Nash equilibrium, our two bounds suggest respective iteration complexities, ~ 2 A 2 N 2 C (1 ) 9 4 and min( ;S) 4 ANC (1 ) 6 2 : 163 Compared with the iteration complexity guarantees in [131, 277], our bounds in Theorem 32 improve the dependence on the distribution mismatch coecient and the state space size S =jSj. Since our minimax distribution mismatch coecient ~ satises ~ min( ;S) our ~ -dependence or min( ;S)-dependence are less restrictive than the explicitS-dependence in [131, 277]. Importantly, this permits our bounds to work for systems with large number of states, and makes Algorithm 10 suitable for sample-based scenario with function approximation (see Section 5.6). With polynomial dependence on the number of playersN instead of exponen- tial, Algorithm 10 overcomes the curse of multiagents [111, 205]. In terms of problem parameters ( ;A;N;C ), our iteration complexity either improves or becomes slightly worse. Remark12(Innitestatespace) When the state space is innite, explicitS-dependence disap- pears in our iteration complexities. ImplicitS-dependence only exists in the distribution mismatch coecient or ~ . However, it is easy to bound by devising an initial state distribution without introducing constraints on the MDP dynamics. For instance, in MPGs with agent-independent tran- sitions (in which every state is a potential game and transitions do not depend on actions [131]), if weselecttobethestationarystatedistributiond then = 1regardlessofthestate-spacesizeS. Remark13(Ourkeytechniques) A key step of the analysis is to quantify the policy improve- ment regarding the potential function in each iteration. Similar to the standard descent lemma in 164 optimization[15], applyingtheprojectedpolicygradientalgorithmtoasmooth yieldsthefollow- ing ascent property (cf. Eq. (9) in [131] and Lemmas 11 and 12 in [277]), (t+1) () (t) () & 1 N X i=1 X s k (t+1) i (js) (t) i (js)k 2 where > 0 is related to the smoothness constant (or the second-order derivative) of the potential function. However,sincethesearchdirectioninourpolicyupdateisnotthestandardsearchdirection utilized in policy gradient, this ascent analysis does not apply to our algorithm. To obtain such improvement bound, it is crucial to analyze the joint policy improvement. Let usconsidertwoplayersiandj: playerichangesitspolicyfrom i to 0 i tomaximizeitsownreward based on the current policy prole ( i ; j ) and playerj changes its policy from j to 0 j in its own interest. What is the overall progress after they independently change their policies from ( i ; j ) to ( 0 i ; 0 j )? One method of capturing the joint policy improvement exploits the smoothness of the potential function, which is useful in the standard policy gradient ascent method [131, 277]. In our analysis, we connect the joint policy improvement with the individual policy improvement via the performance dierence lemma. In particular, as shown in Lemma 38, Lemma 34 and Lemma 37 provideaneectivemeansforanalyzingthejointpolicyimprovement. Theproposedapproachcould be of independent interests for analyzing other Markov games. InLemma38,weobtaintwodierentjointpolicyimprovementboundsbydealingwiththecross termsintwodierentways(seetheproofsfordetails). Hence,weestablishtwodierentNash-Regret boundsin Theorem32: onehas betterdependence onT whilethe otherhas betterdependenceon . Even though, it is an open issue how to achieve the best of the two, we next show that this is indeed possible for a special case: Markov cooperative games. 165 5.4.2 FasterratesforMarkovcooperativegames When all players use the same reward function, i.e., r = r i for all i = 1;:::;N, MPG (5.1) reduces to a Markov cooperative game. In this case,V i =V andQ i =Q for alli = 1;:::;N and Algorithm 10 works immediately. Thus, we continue to use Nash-Regret i (T ) that is dened throughV 0 i ; (t) i i =V 0 i ; (t) i andV (t) i =V (t) . Algorithm 33 provides a Nash-Regret bound for Markov cooperative games; see Section 5.5.3 for proof. Theorem33(Nash-RegretboundforMarkovcooperativegames) ForMPG (5.1)withiden- ticalrewardsandaninitialstatedistribution,ifallplayersindependentlyperformthepolicyupdate in Algorithm 10 with stepsize = (1 )=(2NA) then, Nash-Regret(T ) . p ~ AN (1 ) 2 p T . For Markov cooperative games, Theorem 33 achieves the best of the two bounds in Theo- rem 32 and an-Nash equilibrium is achieved with the following iteration complexity, ~ AN (1 ) 4 2 : This iteration complexity improves the ones provided in [131, 277] in several aspects. In par- ticular, we have introduced the minimax distribution mismatch coecient ~ , which is upper bounded by . When we take this lower bound, our bound improves the -dependence in [131, 277] from 2 to . We note that if we view the Markov cooperative game as an MPG, then the 166 value functionV serves as a potential function which is bounded between 0 and 1=(1 ). Thus, our (1 )-dependence matches the one in [277] and improves the one in [131] by (1 ) 2 . 5.5 Nashregretanalysis In this section, we study Nash regret of Algorithm 10 for Markov potential games and Markov co- operative games. We present useful lemmas in Section 5.5.1, and provide the proof of Theorem 32 in Section 5.5.2 and the proof of Theorem 33 in Section 5.5.3. 5.5.1 Settinguptheanalysis We rst introduce a decomposition of the dierence of multivariate functions, which is useful to decompose the dierence of potential functions () at two dierent policies for any state distribution. Let : !R be any multivariate function mapping a policy2 to a real number. In Lemma 34, we show that the dierence 0 at any two policies ; 0 equals to a sum of several partial dierences. For i;j2f1;:::;Ng with i < j, we denote by “i j” the set of indicesfkji<k<jg, “j” the set of indices fkjk =j + 1;:::;Ng. We use the shorthand I :=f k g k2I to represent the joint policy for all playersk2 I. For example, whenI = i j, I =f k g j1 k =i+1 is a joint policy for players from i + 1 toj 1; j can be introduced similarly. 167 Lemma34(Multivariatefunctiondierence) For any function : ! R, and any two policies; 0 2 , 0 = N X i = 1 0 i ; i + N X i = 1 N X j =i+1 j ; 0 i ; 0 j j ; i ; 0 j j ; 0 i ; j + j ; i ; j : (5.5) Proof. See Appendix D.1. It is useful to introduce the following two dierence bounds. Lemma35(State-actionvaluefunctiondierence) Suppose i < j for i;j = 1;:::;N. Let ~ ij be the policy for all players buti;j and i be the policy for playeri. For any two policies for playerj: j and 0 j , we have max s k Q ~ ij ; i ; 0 j i (s;) Q ~ ij ; i ; j i (s;)k 1 1 (1 ) 2 max s k 0 j (js) j (js)k 1 : Proof. See Appendix D.2. Lemma36(Visitationmeasuredierence) Let and 0 be two policies for an MDP, and be an initial state distribution. Then, X s d (s)d 0 (s) max s k(js) 0 (js)k 1 : Proof. See Appendix D.3. 168 5.5.2 NashregretanalysisforMarkovpotentialgames We rst extend the the 1st-order performance dierence in Remark 11 to the 2nd-order perfor- mance dierence, which is useful to measure the joint policy improvement from multiple players. Lemma37(The2nd-orderperformancedierence) Inatwo-playercommon-rewardMarkov game with state spaceS and action setsA 1 ,A 2 , let 1 = ((A 1 )) jSj and 2 = ((A 2 )) jSj be player 1 and player 2’s policy sets, respectively. Then, for anyx;x 0 2 1 andy;y 0 2 2 , V x;y () V x 0 ;y () V x;y 0 () + V x 0 ;y 0 () 2 2 A (1 ) 4 X s d x 0 ;y 0 (s) kx(js)x 0 (js)k 2 +ky(js)y 0 (js)k 2 where is the distribution mismatch coecient relative to. Proof. See Appendix D.4. We now apply Lemma 34 to the potential function () at two consecutive policies (t+1) and (t) in Algorithm 10, where is an initial state distribution. We use the shorthand (t) () for (t) (), the value of potential function at policy (t) . Lemma38(Policyimprovement: Markovpotentialgames) For MPG (5.1) with any initial statedistribution,thedierenceofpotentialfunctions ()attwoconsecutivepolicies (t+1) and (t) in Algorithm 10, (t+1) () (t) () can be lower bounded by either (i) or (ii), (i) 1 2(1 ) N X i = 1 X s d (t+1) i ; (t) i (s) (t+1) i (js) (t) i (js) 2 4 2 A 2 N 2 (1 ) 5 (ii) 1 2(1 ) N X i = 1 X s d (t+1) i ; (t) i (s) 1 4 3 AN (1 ) 4 (t+1) i (js) (t) i (js) 2 169 where isthestepsize,N isthenumberofplayers,Aisthesizeofoneplayer’sactionspace,and is the distribution mismatch coecient relative to. Proof. We let 0 = (t+1) and = (t) for brevity. By Lemma 34 with = (), it is equivalent to analyze (t+1) () (t) () = Di + Di (5.6) Di = N X i = 1 0 i ; i () () Di = N X i = 1 N X j =i+1 j ; 0 i ; 0 j () j ; i ; 0 j () j ; 0 i ; j () + j ; i ; j () : BoundingDi . By the property of the potential function (), 0 i ; i () () = V 0 i ; i i () V i () = 1 1 X s;a i d 0 i ; i (s) ( 0 i (a i js) i (a i js)) Q i ; i i (s;a i ) (5.7) where the second equality is due to the perforamnce dierence in Remark 11 using ^ i = 0 i and i = i . The optimality of 0 i = (t+1) i in line 4 of Algorithm 10 leads to 0 i (js); Q i ; i i (s;) A i 1 2 0 i (js) i (js) 2 i (js); Q i ; i i (s;) A i : (5.8) Combining (5.7) and (5.8) yields 0 i ; i () () 1 2(1 ) X s d 0 i ; i (s)k 0 i (js) i (js)k 2 : 170 Therefore, Di 1 2(1 ) N X i = 1 X s d (t+1) i ; (t) i (s) (t+1) i (js) (t) i (js) 2 : (5.9) Bounding Di . For simplicity, we denote ~ ij as the joint policy of playersNnfi;jg where playersj use 0 . For each summand inDi , ~ ij ; 0 i ; 0 j () ~ ij ; i ; 0 j () ~ ij ; 0 i ; j () + ~ ij ; i ; j () (a) = V ~ ij ; 0 i ; 0 j i () V ~ ij ; i ; 0 j i () V ~ ij ; 0 i ; j i () + V ~ ij ; i ; j i () (b) = 1 1 X s;a i d ~ ij ; 0 i ; 0 j (s) ( 0 i (a i js) i (a i js)) Q ~ ij ; i ; 0 j i (s;a i ) 1 1 X s;a i d ~ ij ; 0 i ; j (s) ( 0 i (a i js) i (a i js)) Q ~ ij ; i ; j i (s;a i ) = 1 1 X s;a i d ~ ij ; 0 i ; 0 j (s) ( 0 i (a i js) i (a i js)) Q ~ ij ; i ; 0 j i (s;a i ) Q ~ ij ; i ; j i (s;a i ) + 1 1 X s;a i d ~ ij ; 0 i ; 0 j (s)d ~ ij ; 0 i ; j (s) ( 0 i (a i js) i (a i js)) Q ~ ij ; i ; j i (s;a i ) 1 1 X s d ~ ij ; 0 i ; 0 j (s)k 0 i (js) i (js)k 1 k Q ~ ij ; i ; 0 j i (s;) Q ~ ij ; i ; j i (s;)k 1 1 1 X s d ~ ij ; 0 i ; 0 j (s)d ~ ij ; 0 i ; j (s) k 0 i (js) i (js)k 1 k Q ~ ij ; i ; j i (s;)k 1 (c) 1 (1 ) 3 max s k 0 i (js) i (js)k 1 max s k 0 j (js) j (js)k 1 1 (1 ) 2 max s k 0 j (js) j (js)k 1 max s k 0 i (js) i (js)k 1 (d) 8 2 A 2 (1 ) 5 where (a) is due to the property of the potential function, (b) is due to the performance dierence in Remark 11; for (c), we use Lemma 35, Lemma 36, and the fact that P s d ~ ij ; 0 i ; 0 j (s) = 1 and 171 k Q ~ ij ; i ; j i (s;)k 1 1 1 ; The last inequality (d) follows a direct result from the optimality of 0 i = (t+1) i given by (5.8) andkk p Akk 1 andkk 1 p Akk: k 0 i (js) i (js)k 2 2 (t+1) i (js) (t) i (js); Q i ; i i (s;) A i 2k (t+1) j (js) (t) j (js)kk Q i ; i i (s;)k and thus k (t+1) j (js) (t) j (js)k 2k Q i ; i i (s;)k 2 p A 1 k (t+1) j (js) (t) j (js)k 1 2A 1 : Therefore, Di N(N 1) 2 8 2 A 2 (1 ) 5 4 2 A 2 N 2 (1 ) 5 : (5.10) We now complete the proof of (i) by combining (5.6), (5.9), and (5.10). Alternatively, by Lemma 37, we can bound each summand ofDi by ~ ij ; 0 i ; 0 j () ~ ij ; i ; 0 j () ~ ij ; 0 i ; j () + ~ ij ; i ; j () = V ~ ij ; 0 i ; 0 j i () V ~ ij ; i ; 0 j i () V ~ ij ; 0 i ; j i () + V ~ ij ; i ; j i () 2 2 A (1 ) 4 X s d ~ ij ; i ; j (s) k i (js) 0 i (js)k 2 +k j (js) 0 j (js)k 2 : 172 Thus, Di 2 2 A (1 ) 4 N X i = 1 N X j =i+1 X s d ~ ij ; i ; j (s) k i (js) 0 i (js)k 2 +k j (js) 0 j (js)k 2 2 3 NA (1 ) 5 N X i = 1 X s d (t+1) i ; (t) i (s)k (t) i (js) (t+1) i (js)k 2 where the last inquality is due to d (s) d 0 (s) 1 for any; 0 ;s. Combining the inequality above with (5.6) and (5.9) nishes the proof of (ii). Proof. [Theorem 32] By the optimality of (t+1) i in line 4 of Algorithm 10, 0 i (js) (t+1) i (js); Q (t) i (s;) (t+1) i (js) + (t) i (js) A i 0; for any 0 i 2 i : Hence, if 1 p A , then for any 0 i 2 i , 0 i (js) (t) i (js); Q (t) i (s;) A i = 0 i (js) (t+1) i (js); Q (t) i (s;) A i + (t+1) i (js) (t) i (js); Q (t) i (s;) A i 1 0 i (js) (t+1) i (js); (t+1) i (js) (t) i (js) A i + (t+1) i (js) (t) i (js); Q (t) i (s;) A i (a) 2 k (t+1) i (js) (t) i (js)k +k (t+1) i (js) (t) i (js)kk Q (t) i (s;)k (b) 3 (t+1) i (js) (t) i (js) where in (a) we apply the Cauchy-Schwarz inequality and thatkpp 0 kkpp 0 k 1 2 for any two distributionsp andp 0 ; (b) is because ofk Q (t) i (s;)k p A 1 and 1 p A . 173 5Therefore, for any initial distribution, T X t = 1 max i max 0 i V 0 i ; (t) i i ()V (t) i () (a) = 1 1 T X t = 1 max 0 i X s;a i d 0 i ; (t) i (s) 0 i (a i js) (t) i (a i js) Q (t) i (s;a i ) (b) 3 (1 ) T X t = 1 X s d 0 i ; (t) i (s) (t+1) i (js) (t) i (js) (c) . q sup 2 kd =k 1 (1 ) 3 2 T X t = 1 X s r d 0 i ; (t) i (s)d (t+1) i ; (t) i (s) (t+1) i (js) (t) i (js) (d) q sup 2 kd =k 1 (1 ) 3 2 v u u t T X t = 1 X s d 0 i ; (t) i (s) v u u t T X t = 1 X s d (t+1) i ; (t) i (s) (t+1) i (js) (t) i (js) 2 (e) q sup 2 kd =k 1 (1 ) 3 2 p T v u u t T X t = 1 N X i = 1 X s d (t+1) i ; (t) i (s) (t+1) i (js) (t) i (js) 2 (5.11) where (a) is due to the performance dierence in Remark 11 and we slightly abuse the notation i to represent argmax i , in (b) we slightly abuse the notation 0 i to represent argmax 0 i , in (c) we choose an arbitrary2 (S) and use the following inequality: d 0 i ; (t) i (s) d (t+1) i ; (t) i (s) d 0 i ; (t) i (s) (1 )(s) sup 2 kd =k 1 1 : We apply the Cauchy–Schwarz inequality in (d), and nally we replacei ( argmax i in (a)) in the last square root term in (e) by the sum over all players. 174 If we proceed (5.11) with = argmin 2 (S) max 2 kd =k 1 , then, T X t = 1 max i max 0 i V 0 i ; (t) i i ()V (t) i () (a) p ~ (1 ) 3 2 p T s 2(1 ) ( (T +1) () (1) ()) + 4 3 A 2 N 2 (1 ) 4 T (b) . s ~ TC (1 ) 2 + s ~ T 2 A 2 N 2 (1 ) 7 where in (a) we apply the rst bound (i) in Lemma 38 (with = ) and use Denition 3: ~ = min 2 (S) max 2 kd =k 1 , and in (b) we usej () 0 ()jC for any; 0 , and further simplify the bound in (b). We complete the proof for the rst bound by taking stepsize = (1 ) 2:5 p C NA p T (by the upper bound ofC given in Lemma 68, the condition 1 p A is satised). If we proceed (5.11) with the second bound (ii) in Lemma 38 with the choice of (1 ) 4 8 3 NA , then, T X t = 1 max i max 0 i V 0 i ; (t) i i ()V (t) i () q sup 2 kd =k 1 (1 ) 3 2 p T q 4(1 ) ( (T +1) () (1) ()) . s sup 2 kd =k 1 TC (1 ) 2 : We next discuss two special choices of for proving our bound. First, if =, then (1 ) 4 8 3 NA . By letting = (1 ) 4 8 3 NA , the last square root term can be bounded byO q 4 NATC (1 ) 6 . Second, if = Unif S , the uniform distribution overS, then 1 S , which allows a valid choice = (1 ) 4 8S 3 NA (1 ) 4 8 3 NA . Hence, we can bound the last square root term byO q S 4 NATC (1 ) 6 . Since is arbitrary, combining these two special choices completes the proof. 175 5.5.3 NashregretanalysisforMarkovcooperativegames We rst establish policy improvement regarding the state-action value function at two consecu- tive policies (t+1) and (t) in Algorithm 10. Lemma39(Policyimprovement: Markovcooperativegames) For MPG (5.1) with identical rewardsandaninitialstatedistribution> 0,ifallplayersindependentlyperformthepolicyupdate in Algorithm 10 with stepsize 1 2N , then for anyt and anys, E a (t+1) (js) Q (t) (s;a) E a (t) (js) Q (t) (s;a) 1 4 N X i = 1 k (t+1) i (js) (t) i (js)k 2 where is the stepsize andN is the number of players. Proof. Fixing the time t and the state s, we apply 34 to = E a(js) Q (t) (s;a) , where Q (t) :=Q (t) (recall that is a joint policy of all players). By Lemma 34, for any two policies 0 and, E a 0 (js) Q (t) (s;a) E a(js) Q (t) (s;a) = N X i = 1 E a i 0 i (js);a i i (js) Q (t) (s;a) E a(js) Q (t) (s;a) + N X i = 1 N X j =i+1 E a i 0 i (js);a j 0 j (js);a ij ~ ij (js) Q (t) (s;a) E a i i (js);a j 0 j (js);a ij ~ ij (js) Q (t) (s;a) E a i 0 i (js);a j j (js);a ij ~ ij (js) Q (t) (s;a) +E a i i (js);a j j (js);a ij ~ ij (js) Q (t) (s;a) ! (5.12) 176 where ~ ij is a joint policy of playersNnfi;jg in which playersj use 0 . Particularly, we choose 0 = (t+1) and = (t) . Thus, we can reduce (5.12) into E a 0 (js) Q (t) (s;a) E a(js) Q (t) (s;a) = N X i = 1 X a i ( 0 i (a i js) i (a i js)) Q (t) i (s;a i ) + N X i = 1 N X j =i+1 X a i ;a j ( 0 i (a i js) i (a i js)) 0 j (a j js) j (a j js) E a ij ~ ij (js) Q (t) (s;a) (a) N X i = 1 1 2 k 0 i (js) i (js)k 2 1 1 N X i = 1 N X j =i+1 X a i ;a j j 0 i (a i js) i (a i js)j 0 j (a j js) j (a j js) (b) N X i = 1 1 2 k 0 i (js) i (js)k 2 A 2(1 ) N X i = 1 N X j =i+1 k 0 i (js) i (js)k 2 +k 0 j (js) j (js)k 2 = N X i=1 1 2 k 0 i (js) i (js)k 2 (N 1)A 2(1 ) N X i=1 k 0 i (js) i (js)k 2 (c) N X i = 1 1 4 k 0 i (js) i (js)k 2 where (a) is due to the optimality condition (5.8) and Q (t) (s;a) 1 1 , (b) is due tohx;yi kxk 2 +kyk 2 2 , and (c) follows the choice of 1 2NA . Proof. [Proof of Theorem 33] By the performance dierence in Remark 11 and Lemma 39, we have for any2 (S), V (t+1) () V (t) () = 1 1 X s;a d (t+1) (s) (t+1) (ajs) (t) (ajs) Q (t) (s;a) 1 4(1 ) N X i = 1 X s d (t+1) (s)k (t+1) i (js) (t) i (js)k 2 : (5.13) 177 By the same argument as the proof of Theorem 32, T X t = 1 max i max 0 i V 0 i ; (t) i ()V (t) () (a) 3 (1 ) T X t = 1 X s d 0 i ; (t) i (s) (t+1) i (js) (t) i (js) (b) . p ~ (1 ) 3 2 T X t = 1 X s r d 0 i ; (t) i (s)d (t+1) (s) (t+1) i (js) (t) i (js) p ~ (1 ) 3 2 v u u t T X t = 1 X s d 0 i ; (t) i (s) v u u t T X t = 1 X s d (t+1) (s) (t+1) i (js) (t) i (js) 2 (c) p ~ (1 ) 3 2 p T v u u t T X t = 1 N X i = 1 X s d (t+1) (s) (t+1) i (js) (t) i (js) 2 (d) p ~ (1 ) 3 2 p T q 4(1 ) (V (T +1) ()V (1) ()) where in (a) we slightly abuse the notationi to represent argmax i as in (5.11), in (b) we take = argmin 2(S) max 2 kd =k 1 and use the denition of ~ from Denition 3, and we replace i (argmax i in (a)) in the last square root term in (c) by the sum over all players, and we apply (5.13) in (d). Finally, we complete the proof by taking stepsize = 1 2NA and usingV (T +1) ()V (1) () 1 1 . 178 5.6 Independentpolicygradientwithfunctionapproxima- tion We next remove the exact gradient requirement and apply Algorithm 10 to the linear function approximation setting. In what follows, we assume that the averaged action value function is linear in a given feature map. Algorithm11 Independent policy gradient with linear function approximation 1: Parameters:K,W , and> 0. 2: Initialization: Let (1) i (a i js) = 1=A fors2S,a i 2A i andi = 1;:::;N. 3: for stept = 1;:::;T do 4: // Phase 1 (data collection) 5: for roundk = 1;:::;K do 6: For eachi2 [N], sampleh i Geometric(1 ) andh 0 i Geometric(1 ). 7: Draw an initial state s (0) . 8: Continuing from s (0) , let all players interact with each other usingf (t) i g N i = 1 for H = max i (h i + h 0 i ) steps, which generates a state-joint-action-reward trajectory s (0) ; a (0) ; r (0) ; s (1) ; a (1) ; r (1) ; :::; s (H) ; a (H) ; r (H) . 9: Dene for every playeri2 [N], s (k) i = s (h i ) ; a (k) i = a (h i ) i ; R (k) i = h i +h 0 i 1 X h =h i r (h) i : (5.14) 10: endfor 11: // Phase 2 (policy update) 12: for playeri = 1;:::;N (in parallel)do 13: Compute ^ w (t) i as ^ w (t) i argmin kw i kW K X k = 1 R (k) i i (s (k) i ;a (k) i );w i 2 : (5.15) 14: Dene ^ Q (t) i (s;) := i (s;); ^ w (t) i and playeri’s policy fors2S, (t+1) i (js) = argmax i (js)2 (A i ) i (js); ^ Q (t) i (s;) A i 1 2 i (js) (t) i (js) 2 : (5.16) 15: endfor 16: endfor 179 Assumption15(LinearaveragedQ) In MPG (5.1), for each playeri, there is a feature map i : SA i !R d , such that for any (s;a i )2SA i and any policy2 , Q i (s;a i ) = h i (s;a i ); w i i; for somew i 2 R d : Moreover,k i k 1 for alls;a i , andkw i kW for all. Without loss of generality, we can assumeW p d=(1 ); see Lemma 8 in [240]. Assump- tion 15 is a multi-agent generalization of the standard linearQ assumption [4] for single-player MDPs. It is dierent from the multi-agent linear MDP assumption [250, 81] in which both transi- tion and reward functions are linear in given feature maps. In contrast, Assumption 15 qualies each player to estimate its averaged action value function without observing other players’ ac- tions. A special case of Assumption 15 is the tabular case in which the sizes of state/action spaces are nite, and where we can select i to be an indicator function. Since the feature map i is locally-dened coordination between players is avoided [282]. Remark14(Functionapproximation) Since RL with function approximation is often statisti- cally hard, e.g., see [244, 236] for hardness results, assuming regularity of underlying MDPs is nec- essary for the application of function approximation to multi-agent RL in which either the value function [250, 81, 107, 104] or the policy [282] is approximated. Because of restrictive function ap- proximationpower,themainchallengeistheentanglementofpolicyimprovement(oroptimization) andpolicyevaluation (orapproximation)errors. In Theorem40andTheorem41, weshowthatopti- mizationandapproximationerrorsaredecoupledunderAssumption15sothatwecancontrolthem, separately. Our analysis can be generalized to some neural networks, e.g., overparametrized neural 180 networks [139], a rich function class that allows splitting optimization and approximation errors, which we leave for future work. We formally present our algorithm in Algorithm 11. At each stept, there are two phases. In Phase 1, the players begin with the initial state s (0) and simultaneously execute their current policiesf (t) i g N i = 1 to interact with the environment forK rounds. In each roundk, we terminate the interaction at stepH = max i (h i +h 0 i ), whereh i andh 0 i are sampled from a geometric distribu- tionGeometric(1 ), independently; the state ath i naturally follows s (h i ) d (t) . By collecting rewards from steph i toh i +h 0 i 1, as shown in (5.14), we can justifyE[R (k) i ] = Q (t) i ( s (h i ) ; a (h i ) i ) where Q (t) i (;) := Q (t) i (;) and a (h i ) i (t) i (j s (h i ) ), in Appendix D.5.1. In the end of roundk, we collect a sample tuple: (s (k) i ;a (k) i ;R (k) i ) in (5.14) for each playeri. After each player collectsK samples, in Phase 2, they use these samples to estimate Q (t) i (;), which is required for policy updates. By Assumption 15, Q (t) i (s;a i ) = h i (s;a i );w (t) i i;8(s;a i )2SA i where w (t) i represents w (t) i . Our goal is to obtain a solution ^ w (t) i w (t) i using samples, and estimate Q (t) i (s;a i ) via ^ Q (t) i (s;a i ) := h i (s;a i ); ^ w (t) i i;8(s;a i )2SA i : (5.17) We note thatE[R (k) i ] = Q (t) i (s (k) i ;a (k) i ) =h i (s (k) i ;a (k) i );w (t) i i. To obtain ^ w (t) i , the standard approach is to solve linear regression (5.15). 181 We measure the estimation quality of ^ w (t) i via the expected regression loss, L (t) i (w i ) =E (s;a i ) (t) i Q (t) i (s;a i )h i (s;a i );w i i 2 where (t) i (s;a i ) := d (t) (s) (t) i (a i js) and L (t) i (w (t) i ) = 0 by Assumption 15. We make the following assumption for the expected regression loss of ^ w (t) i . Assumption16(Boundedstatisticalerror) Fix a state distribution. For any sequence of iter- ates ^ w (1) i ;:::; ^ w (T ) i fori = 1;:::;N that are generated by Algorithm 11, there exists an stat <1 such that E L (t) i ( ^ w (t) i ) stat for alli andt, where the expectation is on randomness in generating ^ w (t) i . The bound for stat can be established using standard linear regression analysis and it is given by stat =O dW 2 K(1 ) 2 . This bound can be achieved by applying the stochastic projected gradient descent method [101, 58] to the regression problem. We next show how the approximation error aect the convergence. We take an expectation over the randomness of approximately computing ^ w (t) i as Assumption 16. After obtaining ^ Q (t) i (;), we update the polices in (5.16) which is dierent from the up- date in Algorithm 10 in two aspects: (i) the gradient direction ^ Q (t) i (;) is the estimated ver- sion of Q (t) i (;); and (ii) the Euclidean projection set becomes (A i ) :=f (1) i (js) + Unif A i ;8 i (js)g that introduces-greedy policies for exploration [131, 277], where2 (0; 1). Theorem 40 establishes performance guarantees for Algorithm 11; see Appendix D.5.2 for proof. 182 Theorem40(Nash-RegretboundforMarkovpotentialgames) Let Assumption 15 hold for MPG (5.1) with an initial state distribution. If all players independently run Algorithm 11 with = min 2 NA stat (1 ) 2 W 2 1 3 ; 1 2 and Assumption 16 holds, then E [ Nash-Regret(T ) ] . R() + 2 WAN stat (1 ) 5 1 3 R() = 8 > > > > < > > > > : p WN (AC ) 1 4 (1 ) 7 4 T 1 4 ; = (1 ) 3 2 p C WN p AT 2 p ANC (1 ) 3 p T ; = (1 ) 4 16 3 NA : Theorem 40 shows the additive eect of the function approximation error stat on the Nash regret of Algorithm 11. When stat = 0, Theorem 40 matches the rates in Theorem 32 in the exact gradient case. As in Algorithm 10, even though update rule (5.16) iterates over alls2S, we do not need to assume a nite state spaceS. In fact, (5.16) only “denes” a function (t) i (js) instead of “calculating” it. This is commonly used in policy optimization with function approximation, e.g., [47, 146]. To execute this algorithm, (t) i (js) only needs to be evaluated if necessary, e.g., when the states is visited in Phase 1 of Algorithm 11. When we apply stochastic projected gradient updates to (5.15), Algorithm 11 becomes a sample-based algorithm and existing stochastic projected gradient results directly apply. De- pending on the stepsize choice, an-Nash equilibrium is achieved with sample complexities (see Corollary 1 in Appendix D.5.4), TK = O 1 7 and O 1 5 ; respectively. 183 Compared with the sample complexity guarantees for the tabular MPG case [131, 277], our sample complexity guarantees hold for MPGs with potentially innitely large state spaces. When we specialize Assumption 15 to the tabular case, our second sample complexity improves the sample complexity in [131, 277] fromO(1= 6 ) toO(1= 5 ). As before, we get improved performance guarantees when we apply Algorithm 11 to Markov cooperative games. Theorem41(Nash-RegretboundforMarkovcooperativegames) Let Assumption 15 hold for MPG (5.1) with identical rewards and an initial state distribution > 0. If all players indepen- dentlyperformthepolicyupdateinAlgorithm11withstepsize = (1 )=(2NA)andexploration rate = min 2 NA stat (1 ) 2 W 2 1 3 ; 1 2 , with Assumption 16, E [ Nash-Regret(T ) ] . p AN (1 ) 2 p T + 2 WAN stat (1 ) 5 1 3 : We prove Theorem 41 in Appendix D.5.3 and show sample complexity TK = O(1= 5 ) in Corollary 2 of Appendix D.5.4. 5.7 Game-agnosticconvergence In Section 5.4 and Section 5.6, we have shown that our independent policy gradient method con- verges (in best-iterate sense) to a Nash equilibrium of MPGs. For the same algorithm in two- player case, however, [24] showed that players’ policies can diverge for zero-sum matrix games (a single-state case of zero-sum Markov games). A natural question arises: 184 Does there exist a simple gradient-based algorithm that provably converges to a Nash equilibrium in both potential/cooperative and zero-sum games? Unfortunately, classical MWU and optimistic MWU updates do not converge to a Nash equi- librium in zero-sum and coordination games simultaneously [53]. Recently, this question was partially answered by [130, 129] in which the authors established last-iterate convergence ofQ- learning dynamics to a quantal response equilibrium for both zero-sum and potential/cooperative matrix games. In this work, we provide an armative answer to this question for general Markov games that cover matrix games. Specically, we next show that optimistic gradient descent/ascent with a smoothed critic (see Algorithm 12) – an algorithm that converges to a Nash equilibrium in two-player zero-sum Markov games [239] – also converges to a Nash equilibrium in Markov cooperative games. We now setup notation for tabular two-player Markov cooperative games withN = 2,r = r 1 =r 2 ,A =jA 1 j =jA 2 j, andS =jSj. For convenience, we usex s 2R A andy s 2R A to denote policies 1 (js) and 2 (js) taken at states2S, andQ s 2R AA to denoteQ (s;a 1 ;a 2 ) with a 1 2A 1 anda 2 2A 2 . We describe our policy update (5.18) in Algorithm 12: the next iterate (x (t+1) s ;y (t+1) s ) is obtained from two steps of policy gradient ascent with an intermediate iterate ( x (t+1) s ; y (t+1) s ). Motivated by [239], we introduce a criticQ (t) s to learn the value function at each states using the learning rate (t) . When the critic is ideal, i.e.,Q (t) s =Q (t) s , whereQ (t) s is a matrix form ofQ (t) (s;a 1 ;a 2 ) fora 1 2A 1 anda 2 2A 2 , we can view Algorithm 12 as a two-player case of Algorithm 10. In Theorem 42, we establish asymptotic last-iterate convergence of Algorithm 12 in Markov cooperative games; see Appendix D.6.1 for proof. 185 Algorithm12 Independent optimistic policy gradient ascent 1: Parameters: 0< 1 32 p A and a non-increasing sequencef (t) g 1 t=1 that satises 0 < (t) 1 6 for allt and 1 X t =t 0 (t) = 1 for anyt 0 : 2: Initialization: Letx (1) s = x (1) s =y (1) s = y (1) s = 1=A andV (0) s = 0 for alls2S. 3: for stept = 1; 2;:::do 4: DeneQ (t) s 2R AA for alls2S, Q (t) s (a 1 ;a 2 ) = r(s;a 1 ;a 2 ) + E s 0 P(js;a 1 ;a 2 ) h V (t1) s 0 i : 5: Dene two players’ policies fors2S, x (t+1) s = argmax xs2 (A 1 ) x > s Q (t) s y (t) s 1 2 x s x (t) s 2 x (t+1) s = argmax xs2 (A 1 ) x > s Q (t) s y (t) s 1 2 x s x (t+1) s 2 y (t+1) s = argmax ys2 (A 2 ) (x (t) s ) > Q (t) s y s 1 2 y s y (t) s 2 y (t+1) s = argmax ys2 (A 2 ) (x (t) s ) > Q (t) s y s 1 2 y s y (t+1) s 2 (5.18) V (t) s = (1 (t) )V (t1) s + (t) (x (t) s ) > Q (t) s y (t) s : 6: endfor Theorem42(Last-iterateconvergencefortwo-playerMarkovcooperativegames) For MPG (5.1) with two players and identical rewards, if both players run Algorithm 12 with 0 < < (1 )=(32 p A)andanon-increasingf (t) g 1 t = 1 thatsatises 0< (t) < 1=6and P 1 t =t 0 (t) =1 for anyt 0 0, then the policy pair (x (t) ;y (t) ) converges to a Nash equilibrium whent!1. Last-iterate convergence in Theorem 42 is measured by the local gaps max x 0(V x 0 ;y (t) () V x (t) ;y (t) ()) and max y 0(V x (t) ;y 0 ()V x (t) ;y (t) ()), i.e., a policy pair (x (t) ;y (t) ) constitutes an ap- proximate Nash policy for larget. The condition on algorithm parameters and (t) in Theo- rem 42 is mild in sense that it is straightforward to take a pair of such parameters that ensures 186 last-iterate convergence in zero-sum Markov games [239]. Hence, Algorithm 12 enjoys last- iterate convergence in both two-player Markov cooperative and zero-sum competitive games. Compared with the result [92], our proof of Theorem 42 utilizes gap convergence instead of point-wise policy convergence that is restricted to isolated xed points of the algorithm dynam- ics. Moreover, our algorithm works for both cooperative and competitive Markov games. In the following Theorem 43, we further strengthen our result of Theorem 42 and show the sublinear Nash-Regret bounds for Algorithm 12 in both two-player Markov cooperative and zero- sum competitive games; see Appendix D.6.2 for proof. Theorem43(Nash-RegretboundforMarkovcooperative/zero-sumgames) When both players in two-player Markov games running Algorithm 12 with (t) = 1 6 3 p t and = (1 ) 2 32 p SA , inde- pendently, we have (i) if two players have identical rewards (r 1 =r 2 =r), then, 1 T T X t = 1 max x 0 ;y 0 V x 0 ;y (t) () +V x (t) ;y 0 () 2V x (t) ;y (t) () . (S 3 A ) 1 4 (1 ) 7 2 T 1 6 : (ii) if two players have zero-sum rewards (r 1 =r 2 =r), then, 1 T T X t = 1 max x 0 ;y 0 V x 0 ;y (t) ()V x (t) ;y 0 () . (S 3 A ) 1 2 (1 ) 15 4 T 1 6 : For two-player Markov cooperative/competitive games, Theorem 43 establishes the same rate T 1=6 for the Nash regret and the average duality gap, respectively. Alternatively, independent players in Algorithm 12 can nd an-Nash equilibrium afterO(1= 6 ) iterations, no matter which types of games are being played. To the best of our knowledge, Theorem 43 appears to be the rst 187 game-agnostic convergence for Markov cooperative/competitive games with nite-time perfor- mance guarantees. We leave the extension to more general Markov games for future work. 5.8 Computationalexperiments To demonstrate the merits and the eectiveness of our approach, we examine an MDP in which every state denes a congestion game. This example is borrowed from [36] and it includes MPG as a special case. For illustration, we consider the state spaceS =fsafe;distancingg and action spaceA i =fA;B;C;Dg, and the number of playersN = 8. In each states2S, the reward for playeri taking an actiona2A i is thew a s -weighted number of players using the actiona, where w a s species the action preferencew A s <w B s <w C s <w D s . The reward in state distancing is less than that in state safe by a large amountc> 0. For state transition, if more than half of players nd themselves using the same action, then the state transits to the state distancing; transition back to the state safe whenever no more than half of players take the same action. In our experiments, we implement our independent policy gradient method based on the code for the projected stochastic gradient ascent [131]. At each iteration, we collect a batch of 20 trajectories to estimate the state-action value function and (or) the stationary state distribution under current policy. We choose the discount factor = 0:99, and dierent the stepsize, and initial state distributions. We note that stepsize 0:001 does not provide convergence of the projected stochastic gradient ascent [131]. We report our computational results showing that our independent policy gradient with a large stepsize (green curve) quickly converges to a Nash equilibrium for a broad range of initial distributions. We rst verify that our independent policy gradient with = 0:001 still converges 188 in Figure 5.1. In Figure 5.2, we see an improved convergence of our independent policy gradient using a larger stepsize, e.g., = 0:002. We also remark that the learnt policies for all these experiments can generate the same Nash policy that matches the result in [131]. (a) (b) accuracy accuracy iteration count iteration count (c) # players action Figure 5.1: Convergence performance. (a) Learning curves for our independent policy gradient (—) with stepsize = 0:001 and the projected stochastic gradient ascent (—) with = 0:0001 [131]. Each solid line is the mean of trajectories over three random seeds and each shaded re- gion displays the condence interval. (b) Learning curves for six individual runs of our inde- pendent policy gradient (solid line) and the projected stochastic gradient ascent (dash line) three each. (c) Distribution of players in one of two states taking four actions. In (a) and (b), we measure the accuracy by the absolute distance of each iterate to the converged Nash policy, i.e., 1 N P N i = 1 k (t) i Nash i k 1 . In our computational experiments, the initial distribution is uniform. We next examine how sensitive the performance of algorithms depends on initial state dis- tributions. As discussed in Section 5.4, our independent policy gradient method (5.4) is dierent 189 (a) (b) accuracy accuracy iteration count iteration count (c) # players action Figure 5.2: Convergence performance. (a) Learning curves for our independent policy gradient (—) with stepsize = 0:002 and the projected stochastic gradient ascent (—) with = 0:0001 [131]. Each solid line is the mean of trajectories over three random seeds and each shaded re- gion displays the condence interval. (b) Learning curves for six individual runs of our inde- pendent policy gradient (solid line) and the projected stochastic gradient ascent (dash line) three each. (c) Distribution of players in one of two states taking four actions. In (a) and (b), we measure the accuracy by the absolute distance of each iterate to the converged Nash policy, i.e., 1 N P N i = 1 k (t) i Nash i k 1 . In our computational experiments, the initial distribution is uniform. from the projected policy gradient (5.3) by removing the dependence on the initial state dis- tribution. In the policy gradient theory [8], convergence of projected policy gradient methods is often restricted by how explorative the initial state distribution is. To be fair, we choose stepsize = 0:001 for our algorithm since it achieves a similar performance as the projected stochastic gradient ascent [131] in Figure 5.1. We choose two dierent initial state distributions 190 (a) (b) accuracy accuracy iteration count iteration count (c) # players action Figure 5.3: Convergence performance. (a) Learning curves for our independent policy gradient (—) with stepsize = 0:001 and the projected stochastic gradient ascent (—) with = 0:0001 [131]. Each solid line is the mean of trajectories over three random seeds and each shaded re- gion displays the condence interval. (b) Learning curves for six individual runs of our inde- pendent policy gradient (solid line) and the projected stochastic gradient ascent (dash line) three each. (c) Distribution of players in one of two states taking four actions. In (a) and (b), we measure the accuracy by the absolute distance of each iterate to the converged Nash policy, i.e., 1 N P N i = 1 k (t) i Nash i k 1 . In our computational experiments, the initial distribution is nearly de- generate = (0:9999; 0:0001). = (0:9999; 0:0001) and = (0:0001; 0:9999) and report our computational results in Figure 5.3 and Figure 5.4, respectively. Compared Figure 5.3 with Figure 5.1, both algorithms become a bit slower, but our algorithm is relatively insusceptible to the change of. This becomes more clearer in Figure 5.4 for another = (0:0001; 0:9999). This demonstrates that practical performance of our independent policy gradient method (5.4) indeed is invariant to the initial distribution. 191 (a) (b) accuracy accuracy iteration count iteration count (c) # players action Figure 5.4: Convergence performance. (a) Learning curves for our independent policy gradient (—) with stepsize = 0:001 and the projected stochastic gradient ascent (—) with = 0:0001 [131]. Each solid line is the mean of trajectories over three random seeds and each shaded re- gion displays the condence interval. (b) Learning curves for six individual runs of our inde- pendent policy gradient (solid line) and the projected stochastic gradient ascent (dash line) three each. (c) Distribution of players in one of two states taking four actions. In (a) and (b), we measure the accuracy by the absolute distance of each iterate to the converged Nash policy, i.e., 1 N P N i = 1 k (t) i Nash i k 1 . In our computational experiments, the initial distribution is nearly de- generate = (0:0001; 0:9999). 5.9 Concludingremarks We have proposed new independent policy gradient algorithms for learning a Nash equilibrium of Markov potential games (MPGs) when the size of state space and/or the number of players are large. In the exact gradient case, we show that our algorithm nds an -Nash equilibrium 192 withO(1= 2 ) iteration complexity. Such iteration complexity does not explicitly depend on the state space size. In the sample-based case, our algorithm works in the function approximation setting, and we prove O(1= 5 ) sample complexity in a potentially innitely large state space. This appears to be the rst result for learning MPGs with function approximation. Moreover, we identify a class of independent policy gradient algorithms that enjoys last-iterate convergence and sublinear Nash regret for both zero-sum Markov games and Markov cooperative games (a special case of MPGs). This nding sheds light on an open question in the literature on the existence of such an algorithm. 193 Chapter6 Discussionandfuturedirections We have established several RL algorithms for constrained and multi-agent control systems with theoretical convergence guarantees. We hope our results will serve as general frameworks/tools for considering real-world systems in standard RL setups. While there are a number of future directions motivated by our work, we list the most compelling ones in this chapter. 6.1 Policygradientprimal-dualalgorithms Since our work [73] published at NeurIPS 2020, there is a line of studies on policy gradient primal- dual algorithms for constrained MDPs. The rst focus is the two-time scale scheme for updating primal and dual variables: [141, 134, 260] have shown the fast convergence rate by incorporating modications to the objective function or the update of the dual variable into the algorithm de- sign. It is relevant to examine when can we improve the convergence of single-time scale primal- dual algorithms. The second focus is the zero constraint violation during the training: [22] has shown that the constraint violation can be reduced to zero using the pessimism for constraint satisfaction, which is an important design to implement for practical algorithms. Moreover, all 194 these methods are limited for small tabular constrained MDPs, which leads to an open question for constrained MDPs with large state/action spaces. A critical underlying assumption in policy gradient primal-dual algorithms is that the state space is already well-explored. Without this assumption, vanishing policy gradients often yield poor sample eciency [8]. It is important to study strategic exploration for policy gradient primal-dual algorithms, e.g., policy cover directed exploration [9, 88, 266] and-greedy [269]. In practice, the oscillatory behavior commonly arises in primal-dual methods [210]. It would be useful to establish a tighter understanding of the primal-dual optimization geometry for con- strained MDPs. 6.2 ProvablyecientRLforconstrainedMDPs Robustness has received a lot of attention in policy optimization algorithms, especially in han- dling adversarial rewards/utilities [190, 109, 47, 146, 99] and model misspecication [110]. We believe that techniques from the adversarial MDP literature [47, 99, 146] allow us to derive sim- ilar regret and constraint violation bounds for constrained MDPs, which is an important future direction. Beyond linear kernel MDPs, the condence-interval exploration has been widely used for other types of dynamics, e.g., linear MDPs [110] and linear/nonlinear regulators [6, 114]. It re- mains to be seen if/how provably ecient RL algorithms can be designed for similar constrained dynamics. 195 In practice, we often encounter general function approximation beyond linear functions [103, 106, 79, 279]. It would be useful to design provably ecient RL algorithms for constrained MDPs with nonlinear function approximation. 6.3 Multi-agentpolicyevaluationinothersettings Our developed framework of distributed policy evaluation is based on the linear function approx- imation. It is of interest to study nonlinear function approximators [281, 237]. Convex-concave property of the primal-dual formulation no longer holds in this setup and dierent analysis is required. Our network setting is restricted to fully connected consensus networks. It is of interest to examine transient performance of a distributed algorithms that, instead of our consensus-based approach, utilize a network diusion strategy [192]. Our algorithm requires synchronous com- munication over a network with doubly stochastic matrices. This can be restrictive in applica- tions that involve directed networks, communication delays, and time-varying channels. It is thus relevant to examine the eect of alternative communication schemes. Our multi-agent MDP assumes that agents function without any failures, and it is of interest to examine a setup in which agents may experience communication/computation failures with some of them acting maliciously [246]. 196 6.4 Multi-agentpolicygradientmethods A natural future direction is to extend techniques that oer faster rates for the single-agent pol- icy gradient methods [123, 270, 247] to independent multi-agent learning for Markov coopera- tive/potential games. In terms of improving sample eciency, it is of interest to study the exploration techniques in single-agent RL [111, 186] for Markov zero-sum/cooperative/potential games with or without function approximation. Another important direction is to investigate independent policy gradient methods for other classes of large-scale Markov games with continuous state/action spaces, e.g., linear quadratic Markov zero-sum/cooperative/potential games [268, 157, 44, 274, 97]. 197 Bibliography [1] Felisa Vázquez Abad and Vikram Krishnamurthy. “Policy gradient stochastic approximation algorithms for adaptive control of constrained time varying Markov decision processes”. In: Proceedings of the IEEE International Conference on Decision and Control. Vol. 3. 2003, pp. 2823–2828. [2] Felisa Vázquez Abad, Vikram Krishnamurthy, Katerine Martin, and Irina Baltcheva. “Self learning control of constrained Markov chains — a gradient approach”. In: Proceedings of the IEEE Conference on Decision and Control. Vol. 2. 2002. [3] Yasin Abbasi-Yadkori, Dávid Pál, and Csaba Szepesvári. “Improved algorithms for linear stochastic bandits”. In: Proceedings of the Advances in Neural Information Processing Systems. Vol. 24. 2011, pp. 2312–2320. [4] Yasin Abbasi-Yadkori, Peter Bartlett, Kush Bhatia, Nevena Lazic, Csaba Szepesvári, and Gellért Weisz. “POLITEX: Regret bounds for policy iteration using expert prediction”. In: Proceedings of the International Conference on Machine Learning. 2019, pp. 3692–3702. [5] Naoki Abe, Prem Melville, Cezar Pendus, Chandan K. Reddy, David L. Jensen, Vince P. Thomas, James J. Bennett, Gary F. Anderson, Brent R. Cooley, Melissa Kowalczyk, et al. “Optimizing debt collections using constrained reinforcement learning”. In: Proceedings of the International Conference on Knowledge Discovery and Data Mining. 2010, pp. 75–84. [6] Marc Abeille and Alessandro Lazaric. “Ecient optimistic exploration in linear-quadratic regulators via Lagrangian relaxation”. In: Proceedings of the International Conference on Machine Learning. 2020, pp. 23–31. [7] Joshua Achiam, David Held, Aviv Tamar, and Pieter Abbeel. “Constrained policy optimization”. In: Proceedings of the International Conference on Machine Learning. Vol. 70. 2017, pp. 22–31. 198 [8] Alekh Agarwal, Sham M. Kakade, Jason D. Lee, and Gaurav Mahajan. “On the theory of policy gradient methods: Optimality, approximation, and distribution shift”. In: Journal of Machine Learning Research 22.98 (2021), pp. 1–76. [9] Alekh Agarwal, Mikael Hena, Sham M. Kakade, and Wen Sun. “PC-PG: Policy cover directed exploration for provable policy gradient learning”. In: Proceedings of the Advances in Neural Information Processing Systems. Vol. 33. 2020, pp. 13399–13412. [10] Ilge Akkaya, Marcin Andrychowicz, Maciek Chociej, Mateusz Litwin, Bob McGrew, Arthur Petron, Alex Paino, Matthias Plappert, Glenn Powell, Raphael Ribas, et al. “Solving rubik’s cube with a robot hand”. In: arXiv preprint arXiv:1910.07113 (2019). [11] Eitan Altman. Constrained Markov Decision Processes. Vol. 7. CRC Press, 1999. [12] Eitan Altman, Arie Hordijk, and F.M. Spieksma. “Contraction conditions for average and -discount optimality in countable state Markov games with unbounded rewards”. In: Mathematics of Operations Research 22.3 (1997), pp. 588–618. [13] Sanae Amani, Christos Thrampoulidis, and Lin F. Yang. “Safe reinforcement learning with linear function approximation”. In: Proceedings of the International Conference on Machine Learning. 2021, pp. 243–253. [14] Dario Amodei, Chris Olah, Jacob Steinhardt, Paul Christiano, John Schulman, and Dan Mané. “Concrete problems in AI safety”. In: arXiv preprint arXiv:1606.06565 (2016). [15] Sanjeev Arora. Lecture 6 in Toward Theoretical Understanding of Deep Learning. https://www.cs.princeton.edu/courses/archive/fall18/cos597G/lecnotes/lecture6.pdf. 2008. [16] Kenneth J. Arrow. Studies in Linear and Non-linear Programming. Stanford University Press, 1958. [17] Peter Auer, Nicolo Cesa-Bianchi, and Paul Fischer. “Finite-time analysis of the multiarmed bandit problem”. In: Machine Learning 47.2-3 (2002), pp. 235–256. [18] Alex Ayoub, Zeyu Jia, Csaba Szepesvári, Mengdi Wang, and Lin F. Yang. “Model-based reinforcement learning with value-targeted regression”. In: Proceedings of the International Conference on Machine Learning. 2020, pp. 463–474. [19] Mohammad Gheshlaghi Azar, Ian Osband, and Rémi Munos. “Minimax regret bounds for reinforcement learning”. In: Proceedings of the International Conference on Machine Learning. 2017, pp. 263–272. [20] Francis Bach and Eric Moulines. “Non-strongly-convex smooth stochastic approximation with convergence rateO(1=n)”. In: Proceedings of the Advances in Neural Information Processing Systems. 2013, pp. 773–781. 199 [21] Qinbo Bai, Ather Gattami, and Vaneet Aggarwal. “Model-free algorithm and regret analysis for MDPs with peak constraints”. In: arXiv preprint arXiv:2003.05555 (2020). [22] Qinbo Bai, Amrit Singh Bedi, Mridul Agarwal, Alec Koppel, and Vaneet Aggarwal. “Achieving zero constraint violation for constrained reinforcement learning via primal-dual approach”. In: Proceedings of the AAAI Conference on Articial Intelligence. Vol. 36. 4. 2022, pp. 3682–3689. [23] Yu Bai, Tengyang Xie, Nan Jiang, and Yu-Xiang Wang. “Provably ecient Q-learning with low switching cost”. In: Proceedings of the Advances in Neural Information Processing Systems. Vol. 32. 2019, pp. 8002–8011. [24] James Bailey and Georgios Piliouras. “Fast and furious learning in zero-sum games: Vanishing regret with non-vanishing step sizes”. In: Proceedings of the Advances in Neural Information Processing Systems. Vol. 32. 2019. [25] James P. Bailey and Georgios Piliouras. “Multiplicative weights update in zero-sum games”. In: Proceedings of the ACM Conference on Economics and Computation. 2018, pp. 321–338. [26] Leemon Baird. “Residual algorithms: Reinforcement learning with function approximation”. In: Proceedings of the International Conference on Machine Learning. 1995, pp. 30–37. [27] Amir Beck. First-order Methods in Optimization. Vol. 25. SIAM, 2017. [28] Felix Berkenkamp, Matteo Turchetta, Angela Schoellig, and Andreas Krause. “Safe model-based reinforcement learning with stability guarantees”. In: Proceedings of the Advances in Neural Information Processing Systems. 2017, pp. 908–918. [29] Dimitri P. Bertsekas. Constrained Optimization and Lagrange Multiplier Methods. Academic Press, 2014. [30] Dimitri P. Bertsekas. Nonlinear Programming: Second Edition. Athena Scientic, 2008. [31] Dimitri P. Bertsekas and John N. Tsitsiklis. Neuro-dynamic Programming. Vol. 5. Athena Scientic Belmont, MA, 1996. [32] Jalaj Bhandari and Daniel Russo. “Global optimality guarantees for policy gradient methods”. In: arXiv preprint arXiv:1906.01786 (2019). [33] Jalaj Bhandari and Daniel Russo. “On the linear convergence of policy gradient methods for nite MDPs”. In: Proceedings of the International Conference on Articial Intelligence and Statistics. 2021, pp. 2386–2394. 200 [34] Jalaj Bhandari, Daniel Russo, and Raghav Singal. “A nite time analysis of temporal dierence learning with linear function approximation”. In: Proceedings of the Conference on Learning Theory. 2018, pp. 1691–1692. [35] Shalabh Bhatnagar and K. Lakshmanan. “An online actor–critic algorithm with function approximation for constrained Markov decision processes”. In: Journal of Optimization Theory and Applications 153.3 (2012), pp. 688–708. [36] Ilai Bistritz and Nicholas Bambos. “Cooperative multi-player bandit optimization”. In: Proceedings of the Advances in Neural Information Processing Systems. Vol. 33. 2020. [37] Vivek S. Borkar. “An actor-critic algorithm for constrained Markov decision processes”. In: Systems & Control Letters 54.3 (2005), pp. 207–213. [38] Vivek S. Borkar and Sean P. Meyn. “The ODE method for convergence of stochastic approximation and reinforcement learning”. In: SIAM Journal on Control and Optimization 38.2 (2000), pp. 447–469. [39] Justin A. Boyan. “Least-squares temporal dierence learning”. In: Proceedings of the International Conference on Machine Learning. 1999, pp. 49–56. [40] Stephen Boyd, Arpita Ghosh, Balaji Prabhakar, and Devavrat Shah. “Gossip algorithms: Design, analysis and applications”. In: Proceedings of the IEEE 24th Annual Joint Conference of the IEEE Computer and Communications Societies. Vol. 3. 2005, pp. 1653–1664. [41] Steven J. Bradtke and Andrew G. Barto. “Linear least-squares algorithms for temporal dierence learning”. In: Machine Learning 22.1-3 (1996), pp. 33–57. [42] Kianté Brantley, Miroslav Dudik, Thodoris Lykouris, Sobhan Miryoose, Max Simchowitz, Aleksandrs Slivkins, and Wen Sun. “Constrained episodic reinforcement learning in concave-convex and knapsack settings”. In: Proceedings of the Advances in Neural Information Processing Systems. Vol. 33. 2020. [43] Greg Brockman, Vicki Cheung, Ludwig Pettersson, Jonas Schneider, John Schulman, Jie Tang, and Wojciech Zaremba. “OpenAI Gym”. In: arXiv preprint arXiv:1606.01540 (2016). [44] Jingjing Bu, Lillian J. Ratli, and Mehran Mesbahi. “Global convergence of policy gradient for sequential zero-sum linear quadratic dynamic games”. In: arXiv preprint arXiv:1911.04672 (2019). [45] Sébastien Bubeck and Nicolo Cesa-Bianchi. “Regret analysis of stochastic and nonstochastic multi-armed bandit problems”. In: Foundations and Trends® in Machine Learning 5.1 (2012), pp. 1–122. 201 [46] Lucian Busoniu, Robert Babuska, and Bart De Schutter. “A comprehensive survey of multiagent reinforcement learning”. In: IEEE Transactions on Systems, Man, and Cybernetics, Part C (Applications and Reviews) 38.2 (2008), pp. 156–172. [47] Qi Cai, Zhuoran Yang, Chi Jin, and Zhaoran Wang. “Provably ecient exploration in policy optimization”. In: Proceedings of the International Conference on Machine Learning. 2020, pp. 1283–1294. [48] Lucas Cassano, Kun Yuan, and Ali H. Sayed. “Multi-agent fully decentralized value function learning with linear convergence rates”. In: IEEE Transactions on Automatic Control 66.4 (2020), pp. 1497–1512. [49] Shicong Cen, Yuting Wei, and Yuejie Chi. “Fast policy extragradient methods for competitive games with entropy regularization”. In: Proceedings of the Advances in Neural Information Processing Systems. Vol. 34. 2021. [50] Shicong Cen, Chen Cheng, Yuxin Chen, Yuting Wei, and Yuejie Chi. “Fast global convergence of natural policy gradient methods with entropy regularization”. In: Operations Research (2021). [51] Nicolo Cesa-Bianchi and Gábor Lugosi. Prediction, Learning, and Games. Cambridge University Press, 2006. [52] Yi Chen, Jing Dong, and Zhaoran Wang. “A primal-dual approach to constrained Markov decision processes”. In: arXiv preprint arXiv:2101.10895 (2021). [53] Yun Kuen Cheung and Georgios Piliouras. “Chaos, extremism and optimism: Volume analysis of learning in games”. In: Proceedings of the Advances in Neural Information Processing Systems. Vol. 33. 2020, pp. 9039–9049. [54] Yinlam Chow, Or Nachum, Edgar Duenez-Guzman, and Mohammad Ghavamzadeh. “A Lyapunov-based approach to safe reinforcement learning”. In: Proceedings of the Advances in Neural Information Processing Systems. Vol. 31. 2018. [55] Yinlam Chow, Or Nachum, Aleksandra Faust, Mohammad Ghavamzadeh, and Edgar Duenez-Guzman. “Lyapunov-based safe policy optimization for continuous control”. In: arXiv preprint arXiv:1901.10031 (2019). [56] Yinlam Chow, Mohammad Ghavamzadeh, Lucas Janson, and Marco Pavone. “Risk-constrained reinforcement learning with percentile risk criteria”. In: Journal of Machine Learning Research 18.1 (2017), pp. 6070–6120. [57] Johanne Cohen, Amélie Héliou, and Panayotis Mertikopoulos. “Learning with bandit feedback in potential games”. In: Proceedings of the Advances in Neural Information Processing Systems. 2017, pp. 6372–6381. 202 [58] Kobi Cohen, Angelia Nedić, and Rayadurgam Srikant. “On projected stochastic gradient descent algorithm with weighted averaging for least squares regression”. In: IEEE Transactions on Automatic Control 62.11 (2017), pp. 5974–5981. [59] Allan Dafoe, Yoram Bachrach, Gillian Hadeld, Eric Horvitz, Kate Larson, and Thore Graepel. Cooperative AI: machines must learn to nd common ground. 2021. [60] Allan Dafoe, Edward Hughes, Yoram Bachrach, Tantum Collins, Kevin R. McKee, Joel Z. Leibo, Kate Larson, and Thore Graepel. “Open problems in cooperative AI”. In: arXiv preprint arXiv:2012.08630 (2020). [61] Gal Dalal, Balázs Szörényi, Gugan Thoppe, and Shie Mannor. “Finite sample analyses for TD (0) with function approximation”. In: Proceedings of the AAAI Conference on Articial Intelligence. Vol. 32. 1. 2018. [62] Gal Dalal, Gugan Thoppe, Balázs Szörényi, and Shie Mannor. “Finite sample analysis of two-time scale stochastic approximation with applications to reinforcement learning”. In: Proceedings of the Conference on Learning Theory. 2018, pp. 1199–1233. [63] Gal Dalal, Krishnamurthy Dvijotham, Matej Vecerik, Todd Hester, Cosmin Paduraru, and Yuval Tassa. “Safe exploration in continuous action spaces”. In: arXiv preprint arXiv:1801.08757 (2018). [64] Constantinos Daskalakis, Dylan J. Foster, and Noah Golowich. “Independent policy gradient methods for competitive reinforcement learning”. In: Proceedings of the Advances in Neural Information Processing Systems. Vol. 33. 2020. [65] Constantinos Daskalakis and Ioannis Panageas. “The limit points of (optimistic) gradient descent in min-max optimization”. In: Proceedings of the Advances in Neural Information Processing Systems. Vol. 31. 2018. [66] W. Davis Dechert and S.I. O’Donnell. “The stochastic lake game: A numerical solution”. In: Journal of Economic Dynamics and Control 30.9-10 (2006), pp. 1569–1587. [67] Nishanth Dikkala, Greg Lewis, Lester Mackey, and Vasilis Syrgkanis. “Minimax estimation of conditional moment models”. In: Proceedings of the Advances in Neural Information Processing Systems. Vol. 33. 2020, pp. 12248–12262. [68] Dongsheng Ding, Kaiqing Zhang, Tamer Başar, and Mihailo R. Jovanović. “Convergence and optimality of policy gradient primal-dual method for constrained Markov decision processes”. In: Proceedings of the 2022 American Control Conference. 2022, pp. 2851–2856. [69] Dongsheng Ding, Kaiqing Zhang, Jiali Duan, Tamer Başar, and Mihailo R. Jovanović. “Convergence and sample complexity of natural policy gradient primal-dual methods for constrained MDPs”. In: arXiv preprint arXiv:2206.02346 (2022). 203 [70] Dongsheng Ding, Xiaohan Wei, Zhuoran Yang, Zhaoran Wang, and Mihailo R. Jovanović. “Fast multi-agent temporal-dierence learning via homotopy stochastic primal-dual method”. In: Optimization Foundations for Reinforcement Learning Workshop, 33rd Conference on Neural Information Processing Systems. 2019. [71] Dongsheng Ding, Xiaohan Wei, Zhuoran Yang, Zhaoran Wang, and Mihailo R. Jovanović. “Fast multi-agent temporal-dierence learning via homotopy stochastic primal-dual optimization”. In: arXiv preprint arXiv:1908.02805 (2019). [72] Dongsheng Ding, Chen-Yu Wei, Kaiqing Zhang, and Mihailo R. Jovanović. “Independent policy gradient for large-scale Markov potential games: Sharper rates, function approximation, and game-agnostic convergence”. In: Proceedings of the International Conference on Machine Learning. 2022, pp. 5166–5220. [73] Dongsheng Ding, Kaiqing Zhang, Tamer Başar, and Mihailo R. Jovanović. “Natural policy gradient primal-dual method for constrained Markov decision processes”. In: Proceedings of the Advances in Neural Information Processing Systems. Vol. 33. 2020, pp. 8378–8390. [74] Dongsheng Ding, Xiaohan Wei, Zhuoran Yang, Zhaoran Wang, and Mihailo R. Jovanović. “Provably ecient safe exploration via primal-dual policy optimization”. In: Proceedings of the International Conference on Articial Intelligence and Statistics. 2021, pp. 3304–3312. [75] Yuhao Ding, Junzi Zhang, and Javad Lavaei. “Beyond exact gradients: Convergence of stochastic soft-max policy gradient methods with entropy regularization”. In: arXiv preprint arXiv:2110.10117 (2021). [76] Thinh T. Doan, Siva Maguluri, and Justin Romberg. “Finite-time analysis of distributed TD(0) with linear function approximation on multi-agent reinforcement learning”. In: Proceedings of the International Conference on Machine Learning. 2019, pp. 1626–1635. [77] Thinh T. Doan, Siva Theja Maguluri, and Justin Romberg. “Finite-time performance of distributed temporal-dierence learning with linear function approximation”. In: SIAM Journal on Mathematics of Data Science 3.1 (2021), pp. 298–320. [78] Thinh T. Doan and Justin Romberg. “Finite-time performance of distributed two-time-scale stochastic approximation”. In: Proceedings of the Learning for Dynamics and Control. 2020, pp. 26–36. [79] Simon S. Du, Sham M. Kakade, Jason D. Lee, Shachar Lovett, Gaurav Mahajan, Wen Sun, and Ruosong Wang. “Bilinear classes: A structural framework for provable generalization in RL”. In: Proceedings of the International Conference on Machine Learning. 2021, pp. 2826–2836. 204 [80] Simon S. Du, Sham M. Kakade, Ruosong Wang, and Lin F. Yang. “Is a good representation sucient for sample ecient reinforcement learning?” In: Proceedings of the International Conference on Learning Representations. 2019. [81] Abhimanyu Dubey and Alex Pentland. “Provably ecient cooperative multi-agent reinforcement learning with function approximation”. In: arXiv preprint arXiv:2103.04972 (2021). [82] John C. Duchi, Alekh Agarwal, and Martin J. Wainwright. “Dual averaging for distributed optimization: Convergence analysis and network scaling”. In: IEEE Transactions on Automatic Control 57.3 (2012), pp. 592–606. [83] John C. Duchi, Alekh Agarwal, Mikael Johansson, and Michael I. Jordan. “Ergodic mirror descent”. In: SIAM Journal on Optimization 22.4 (2012), pp. 1549–1578. [84] Gabriel Dulac-Arnold, Daniel Mankowitz, and Todd Hester. “Challenges of real-world reinforcement learning”. In: arXiv preprint arXiv:1904.12901 (2019). [85] Yonathan Efroni, Shie Mannor, and Matteo Pirotta. “Exploration-exploitation in constrained MDPs”. In: arXiv preprint arXiv:2003.02189 (2020). [86] Damien Ernst, Pierre Geurts, and Louis Wehenkel. “Tree-based batch mode reinforcement learning”. In: Journal of Machine Learning Research 6.Apr (2005), pp. 503–556. [87] Maryam Fazel, Rong Ge, Sham M. Kakade, and Mehran Mesbahi. “Global convergence of policy gradient methods for the linear quadratic regulator”. In: Proceedings of the International Conference on Machine Learning. 2018, pp. 1467–1476. [88] Fei Feng, Wotao Yin, Alekh Agarwal, and Lin F. Yang. “Provably correct optimization and exploration with non-linear policies”. In: Proceedings of the International Conference on Machine Learning. 2021, pp. 3263–3273. [89] Seyedshams Feyzabadi. “Robot Planning with Constrained Markov Decision Processes”. PhD thesis. UC Merced, 2017. [90] Arlington M. Fink. “Equilibrium in a stochasticn-person game”. In: Journal of Science of the Hiroshima University, Series AI (mathematics) 28.1 (1964), pp. 89–93. [91] Jaime F. Fisac, Anayo K. Akametalu, Melanie N. Zeilinger, Shahab Kaynama, Jeremy Gillula, and Claire J. Tomlin. “A general safety framework for learning-based control in uncertain robotic systems”. In: IEEE Transactions on Automatic Control 64.7 (2018), pp. 2737–2752. 205 [92] Roy Fox, Stephen M. Mcaleer, Will Overman, and Ioannis Panageas. “Independent natural policy gradient always converges in Markov potential games”. In: Proceedings of the International Conference on Articial Intelligence and Statistics. 2022, pp. 4414–4425. [93] Javier Garcıa and Fernando Fernández. “A comprehensive survey on safe reinforcement learning”. In: Journal of Machine Learning Research 16.1 (2015), pp. 1437–1480. [94] Cory Jay Girard. “Structural Results for Constrained Markov Decision Processes”. PhD thesis. Cornell University, 2018. [95] David González-Sánchez and Onésimo Hernández-Lerma. Discrete–Time Stochastic Control and Dynamic Potential Games: the Euler–Equation Approach. Springer Science & Business Media, 2013. [96] László Györ and Harro Walk. “On the averaged stochastic approximation for linear regression”. In: SIAM Journal on Control and Optimization 34.1 (1996), pp. 31–61. [97] Ben M. Hambly, Renyuan Xu, and Huining Yang. “Policy gradient methods nd the Nash equilibrium in N-player general-sum linear-quadratic games”. In: arXiv preprint arXiv:2107.13090 (2021). [98] Aria HasanzadeZonuzy, Archana Bura, Dileep Kalathil, and Srinivas Shakkottai. “Learning with safety constraints: Sample complexity of reinforcement learning for constrained MDPs”. In: Proceedings of the AAAI Conference on Articial Intelligence. Vol. 35. 9. 2021, pp. 7667–7674. [99] Jiafan He, Dongruo Zhou, and Quanquan Gu. “Near-optimal policy optimization algorithms for learning adversarial linear mixture MDPs”. In: Proceedings of the International Conference on Articial Intelligence and Statistics. 2022, pp. 4259–4280. [100] Josef Hofbauer and William H. Sandholm. “On the global convergence of stochastic ctitious play”. In: Econometrica 70.6 (2002), pp. 2265–2294. [101] Daniel Hsu, Sham M. Kakade, and Tong Zhang. “Random design analysis of ridge regression”. In: Proceedings of the Conference on Learning Theory. 2012, pp. 9–1. [102] Bin Hu and Usman Syed. “Characterizing the exact behaviors of temporal dierence learning algorithms using Markov jump linear system theory”. In: Proceedings of the Advances in Neural Information Processing Systems. 2019, pp. 8477–8488. [103] Baihe Huang, Kaixuan Huang, Sham M. Kakade, Jason D. Lee, Qi Lei, Runzhe Wang, and Jiaqi Yang. “Going beyond linear RL: Sample ecient neural function approximation”. In: Proceedings of the Advances in Neural Information Processing Systems. Vol. 34. 2021, pp. 8968–8983. 206 [104] Baihe Huang, Jason D. Lee, Zhaoran Wang, and Zhuoran Yang. “Towards general function approximation in zero-sum Markov games”. In: Proceedings of the International Conference on Learning Representations. 2022. [105] Maximilian Hüttenrauch, Sosic Adrian, and Gerhard Neumann. “Deep reinforcement learning for swarm systems”. In: Journal of Machine Learning Research 20.54 (2019), pp. 1–31. [106] Chi Jin, Qinghua Liu, and Sobhan Miryoose. “Bellman Eluder dimension: New rich classes of RL problems, and sample-ecient algorithms”. In: Proceedings of the Advances in Neural Information Processing Systems. Vol. 34. 2021, pp. 13406–13418. [107] Chi Jin, Qinghua Liu, and Tiancheng Yu. “The power of exploiter: Provable multi-agent RL in large state spaces”. In: Proceedings of the International Conference on Machine Learning. 2022, pp. 10251–10279. [108] Chi Jin, Zeyuan Allen-Zhu, Sebastien Bubeck, and Michael I. Jordan. “Is Q-learning provably ecient?” In: Proceedings of the Advances in Neural Information Processing Systems. Vol. 31. 2018, pp. 4863–4873. [109] Chi Jin, Tiancheng Jin, Haipeng Luo, Suvrit Sra, and Tiancheng Yu. “Learning adversarial Markov decision processes with bandit feedback and unknown transition”. In: Proceedings of the International Conference on Machine Learning. 2020, pp. 4860–4869. [110] Chi Jin, Zhuoran Yang, Zhaoran Wang, and Michael I. Jordan. “Provably ecient reinforcement learning with linear function approximation”. In: Proceedings of the Conference on Learning Theory. 2020, pp. 2137–2143. [111] Chi Jin, Qinghua Liu, Yuanhao Wang, and Tiancheng Yu. “V-learning – A simple, ecient, decentralized Algorithm for multiagent RL”. In: arXiv preprint arXiv:2110.14555 (2021). [112] Sham M. Kakade. “A natural policy gradient”. In: Proceedings of the Advances in Neural Information Processing Systems. 2002, pp. 1531–1538. [113] Sham M. Kakade and John Langford. “Approximately optimal approximate reinforcement learning”. In: Proceedings of the International Conference on Machine Learning. Vol. 2. 2002, pp. 267–274. [114] Sham M. Kakade, Akshay Krishnamurthy, Kendall Lowrey, Motoya Ohnishi, and Wen Sun. “Information theoretic regret bounds for online nonlinear control”. In: Proceedings of the Advances in Neural Information Processing Systems. Vol. 33. 2020, pp. 15312–15325. 207 [115] Krishna C. Kalagarla, Rahul Jain, and Pierluigi Nuzzo. “A sample-ecient algorithm for episodic nite-horizon MDP with constraints”. In: Proceedings of the AAAI Conference on Articial Intelligence. Vol. 35. 9. 2021, pp. 8030–8037. [116] Hsu Kao, Chen-Yu Wei, and Vijay Subramanian. “Decentralized cooperative reinforcement learning with hierarchical information structure”. In: Proceedings of the International Conference on Algorithmic Learning Theory. 2022, pp. 573–605. [117] Sajad Khodadadian, Prakirt Raj Jhunjhunwala, Sushil Mahavir Varma, and Siva Theja Maguluri. “On linear and super-linear convergence of natural policy gradient algorithm”. In: Systems & Control Letters 164 (2022), p. 105214. [118] Robert Kleinberg, Georgios Piliouras, and Éva Tardos. “Multiplicative updates outperform generic no-regret learning in congestion games”. In: Proceedings of the annual ACM symposium on Theory of Computing. 2009, pp. 533–542. [119] Jens Kober, J. Andrew Bagnell, and Jan Peters. “Reinforcement learning in robotics: A survey”. In: International Journal of Robotics Research 32.11 (2013), pp. 1238–1274. [120] Lior Kuyer, Shimon Whiteson, Bram Bakker, and Nikos Vlassis. “Multiagent reinforcement learning for urban trac control using coordination graphs”. In: Proceedings of the Joint European Conference on Machine Learning and Knowledge Discovery in Databases. 2008, pp. 656–671. [121] Michail G. Lagoudakis and Ronald Parr. “Least-squares policy iteration”. In: Journal of Machine Learning Research. Vol. 4. 2003, pp. 1107–1149. [122] Chandrashekar Lakshminarayanan and Csaba Szepesvári. “Linear stochastic approximation: How far does constant step-size and iterate averaging go?” In: Proceedings of the International Conference on Articial Intelligence and Statistics. 2018, pp. 1347–1355. [123] Guanghui Lan. “Policy mirror descent for reinforcement learning: Linear convergence, new sampling complexity, and generalized problem classes”. In: Mathematical Programming (2022). [124] Tor Lattimore, Csaba Szepesvári, and Gellert Weisz. “Learning with good feature representations in bandits and in RL with a generative model”. In: Proceedings of the International Conference on Machine Learning. 2020, pp. 5662–5670. [125] Alessandro Lazaric, Mohammad Ghavamzadeh, and Remi Munos. “Finite-sample analysis of LSTD”. In: Proceedings of the International Conference on Machine Learning. 2010, pp. 615–622. [126] Hoang Le, Cameron Voloshin, and Yisong Yue. “Batch policy learning under constraints”. In: Proceedings of the International Conference on Machine Learning. 2019, pp. 3703–3712. 208 [127] Donghwan Lee and Jianghai Hu. “Primal-dual distributed temporal dierence learning”. In: arXiv preprint arXiv:1805.07918 (2018). [128] Donghwan Lee, Niao He, Parameswaran Kamalaruban, and Volkan Cevher. “Optimization for reinforcement learning: From a single agent to cooperative agents”. In: IEEE Signal Processing Magazine 37.3 (2020), pp. 123–135. [129] Stefanos Leonardos and Georgios Piliouras. “Exploration-exploitation in multi-agent learning: Catastrophe theory meets game theory”. In: Articial Intelligence 304 (2022), p. 103653. [130] Stefanos Leonardos, Georgios Piliouras, and Kelly Spendlove. “Exploration-exploitation in multi-agent competition: Convergence with bounded rationality”. In: Proceedings of the Advances in Neural Information Processing Systems. Vol. 34. 2021. [131] Stefanos Leonardos, Will Overman, Ioannis Panageas, and Georgios Piliouras. “Global convergence of multi-agent policy gradient in Markov potential games”. In: Proceedings of the International Conference on Learning Representations. 2022. [132] David A. Levin and Yuval Peres. Markov Chains and Mixing Times. Vol. 107. American Mathematical Society, 2017. [133] Sergey Levine, Chelsea Finn, Trevor Darrell, and Pieter Abbeel. “End-to-end training of deep visuomotor policies”. In: Journal of Machine Learning Research 17.1 (2016), pp. 1334–1373. [134] Tianjiao Li, Ziwei Guan, Shaofeng Zou, Tengyu Xu, Yingbin Liang, and Guanghui Lan. “Faster algorithm and sharper analysis for constrained Markov decision process”. In: arXiv preprint arXiv:2110.10351 (2021). [135] Qingkai Liang, Fanyu Que, and Eytan Modiano. “Accelerated primal-dual policy optimization for safe reinforcement learning”. In: arXiv preprint arXiv:1802.06480 (2018). [136] Timothy P. Lillicrap, Jonathan J. Hunt, Alexander Pritzel, Nicolas Heess, Tom Erez, Yuval Tassa, David Silver, and Daan Wierstra. “Continuous control with deep reinforcement learning”. In: arXiv preprint arXiv:1509.02971 (2015). [137] Tianyi Lin, Chi Jin, and Michael I. Jordan. “On gradient descent ascent for nonconvex-concave minimax problems”. In: Proceedings of the International Conference on Machine Learning. 2019, pp. 6083–6093. [138] Bo Liu, Ji Liu, Mohammad Ghavamzadeh, Sridhar Mahadevan, and Marek Petrik. “Finite-sample analysis of proximal gradient TD algorithms”. In: Proceedings of the Conference on Uncertainty in Articial Intelligence. 2015, pp. 504–513. 209 [139] Boyi Liu, Qi Cai, Zhuoran Yang, and Zhaoran Wang. “Neural trust region/proximal policy optimization attains globally optimal policy”. In: Proceedings of the Advances in Neural Information Processing Systems. 2019, pp. 10564–10575. [140] Qinghua Liu, Tiancheng Yu, Yu Bai, and Chi Jin. “A sharp analysis of model-based reinforcement learning with self-play”. In: Proceedings of the International Conference on Machine Learning. 2021, pp. 7001–7010. [141] Tao Liu, Ruida Zhou, Dileep Kalathil, Panganamala Kumar, and Chao Tian. “Fast global convergence of policy optimization for constrained MDPs”. In: arXiv preprint arXiv:2111.00552 (2021). [142] Yanli Liu, Kaiqing Zhang, Tamer Başar, and Wotao Yin. “An improved analysis of (variance-reduced) policy gradient and natural policy gradient methods”. In: Proceedings of the Advances in Neural Information Processing Systems. Vol. 33. 2020, pp. 7624–7636. [143] Yongshuai Liu, Jiaxin Ding, and Xin Liu. “IPO: Interior-point policy optimization under constraints”. In: Proceedings of the AAAI Conference on Articial Intelligence. Vol. 34. 04. 2020, pp. 4940–4947. [144] Ryan Lowe, Yi Wu, Aviv Tamar, Jean Harb, Pieter Abbeel, and Igor Mordatch. “Multi-agent actor-critic for mixed cooperative-competitive environments”. In: Proceedings of the Advances in Neural Information Processing Systems. Vol. 30. 2017, pp. 6379–6390. [145] David G. Luenberger and Yinyu Ye. Linear and Nonlinear Programming. Vol. 2. Springer, 1984. [146] Haipeng Luo, Chen-Yu Wei, and Chung-Wei Lee. “Policy optimization in adversarial MDPs: Improved exploration via dilated bonuses”. In: Proceedings of the Advances in Neural Information Processing Systems. Vol. 34. 2021. [147] Sergio Valcarcel Macua, Javier Zazo, and Santiago Zazo. “Learning parametric closed-Loop policies for Markov potential games”. In: Proceedings of the International Conference on Learning Representations. 2018. [148] Sergio Valcarcel Macua, Jianshu Chen, Santiago Zazo, and Ali H. Sayed. “Distributed policy evaluation under multiple behavior strategies”. In: IEEE Transactions on Automatic Control 60.5 (2014), pp. 1260–1274. [149] Mehrdad Mahdavi, Rong Jin, and Tianbao Yang. “Trading regret for eciency: Online convex optimization with long term constraints”. In: Journal of Machine Learning Research 13 (2012), pp. 2503–2528. [150] A. Maitra and T. Parthasarathy. “On stochastic games”. In: Journal of Optimization Theory and Applications 5.4 (1970), pp. 289–300. 210 [151] A. Maitra and T. Parthasarathy. “On stochastic games, II”. In: Journal of Optimization Theory and Applications 8.2 (1971), pp. 154–160. [152] Dhruv Malik, Ashwin Pananjady, Kush Bhatia, Koulik Khamaru, Peter L. Bartlett, and Martin J. Wainwright. “Derivative-free methods for policy optimization: Guarantees for linear quadratic systems”. In: Journal of Machine Learning Research 21.21 (2020), pp. 1–51. [153] Weichao Mao, Lin F. Yang, Kaiqing Zhang, and Tamer Başar. “On improving model-free algorithms for decentralized multi-agent reinforcement learning”. In: Proceedings of the International Conference on Machine Learning. 2022, pp. 15007–15049. [154] Jason R. Marden. “State based potential games”. In: Automatica 48.12 (2012), pp. 3075–3088. [155] Adwaitvedant Mathkar and Vivek S. Borkar. “Distributed reinforcement learning via gossip”. In: IEEE Transactions on Automatic Control 62.3 (2016), pp. 1465–1470. [156] Laetitia Matignon, Guillaume J. Laurent, and Nadine Le Fort-Piat. “Independent reinforcement learners in cooperative Markov games: A survey regarding coordination problems”. In: The Knowledge Engineering Review 27.1 (2012), pp. 1–31. [157] Vladimir V. Mazalov, Anna N. Rettieva, and Konstantin E. Avrachenkov. “Linear-quadratic discrete-time dynamic potential games”. In: Automation and Remote Control 78.8 (2017), pp. 1537–1544. [158] Eric Mazumdar, Lillian J. Ratli, Michael I. Jordan, and S. Shankar Sastry. “Policy-gradient algorithms have no guarantees of convergence in linear quadratic games”. In: Proceedings of the International Conference on Autonomous Agents and Multiagent Systems. 2020. [159] Jincheng Mei, Chenjun Xiao, Csaba Szepesvári, and Dale Schuurmans. “On the global convergence rates of softmax policy gradient methods”. In: Proceedings of the International Conference on Machine Learning. 2020, pp. 6820–6829. [160] David Mguni, Joel Jennings, and Enrique Munoz de Cote. “Decentralised learning in systems with many, many strategic agents”. In: Proceedings of the AAAI Conference on Articial Intelligence. 2018. [161] David H Mguni, Yutong Wu, Yali Du, Yaodong Yang, Ziyi Wang, Minne Li, Ying Wen, Joel Jennings, and Jun Wang. “Learning in nonzero-sum stochastic games with potentials”. In: Proceedings of the International Conference on Machine Learning. 2021, pp. 7688–7699. 211 [162] Sobhan Miryoose and Chi Jin. “A simple reward-free approach to constrained reinforcement learning”. In: Proceedings of the International Conference on Machine Learning. 2022, pp. 15666–15698. [163] Sudip Misra, Ayan Mondal, Shukla Banik, Manas Khatua, Samaresh Bera, and Mohammad S. Obaidat. “Residential energy management in smart grid: A Markov decision process-based approach”. In: Proceedings of the IEEE International Conference on Green Computing and Communications and IEEE Internet of things and IEEE Cyber, Physical and Social Computing. 2013, pp. 1152–1157. [164] Volodymyr Mnih, Adria Puigdomenech Badia, Mehdi Mirza, Alex Graves, Timothy Lillicrap, Tim Harley, David Silver, and Koray Kavukcuoglu. “Asynchronous methods for deep reinforcement learning”. In: Proceedings of the International Conference on Machine Learning. 2016, pp. 1928–1937. [165] Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Andrei A. Rusu, Joel Veness, Marc G. Bellemare, Alex Graves, Martin Riedmiller, Andreas K. Fidjeland, Georg Ostrovski, et al. “Human-level control through deep reinforcement learning”. In: Nature 518.7540 (2015), p. 529. [166] Aditya Modi, Nan Jiang, Ambuj Tewari, and Satinder Singh. “Sample complexity of reinforcement learning using linearly combined model ensembles”. In: Proceedings of the International Conference on Articial Intelligence and Statistics. 2020, pp. 2010–2020. [167] Hesameddin Mohammadi, Mihailo R. Jovanović, and Mahdi Soltanolkotabi. “Learning the model-free linear quadratic regulator via random search”. In: Proceedings of the Learning for Dynamics and Control. 2020, pp. 531–539. [168] Hesameddin Mohammadi, Mahdi Soltanolkotabi, and Mihailo R. Jovanović. “On the linear convergence of random search for discrete-time LQR”. In: IEEE Control Systems Letters 5.3 (2020), pp. 989–994. [169] Hesameddin Mohammadi, Armin Zare, Mahdi Soltanolkotabi, and Mihailo R. Jovanović. “Convergence and sample complexity of gradient methods for the model-free linear–quadratic regulator problem”. In: IEEE Transactions on Automatic Control 67.5 (2022), pp. 2435–2450. [170] Hesameddin Mohammadi, Armin Zare, Mahdi Soltanolkotabi, and Mihailo R. Jovanović. “Global exponential convergence of gradient methods over the nonconvex landscape of the linear quadratic regulator”. In: Proceedings of the IEEE 58th Conference on Decision and Control. 2019, pp. 7474–7479. [171] Aryan Mokhtari, Asuman Ozdaglar, and Sarath Pattathil. “A unied analysis of extra-gradient and optimistic gradient methods for saddle point problems: Proximal point approach”. In: Proceedings of the International Conference on Articial Intelligence and Statistics. 2020, pp. 1497–1507. 212 [172] D. Monderer and L. Shapley. “Fictitious play property for games with identical interests”. In: Journal of Economic Theory 68 (1996), pp. 258–265. [173] Angelia Nedić and Asuman Ozdaglar. “Distributed subgradient methods for multi-agent optimization”. In: IEEE Transactions on Automatic Control 54.1 (2009), p. 48. [174] Angelia Nedić and Asuman Ozdaglar. “Subgradient methods for saddle-point problems”. In: Journal of Optimization Theory and Applications 142.1 (2009), pp. 205–228. [175] Arkadi Nemirovski, Anatoli Juditsky, Guanghui Lan, and Alexander Shapiro. “Robust stochastic approximation approach to stochastic programming”. In: SIAM Journal on Optimization 19.4 (2009), pp. 1574–1609. [176] Maher Nouiehed, Maziar Sanjabi, Tianjian Huang, Jason D. Lee, and Meisam Razaviyayn. “Solving a class of non-convex min-max games using iterative rst order methods”. In: Proceedings of the Advances in Neural Information Processing Systems. 2019, pp. 14905–14916. [177] Asuman Ozdaglar, Muhammed O. Sayin, and Kaiqing Zhang. “Independent learning in stochastic games”. In: arXiv preprint arXiv:2111.11743 (2021). [178] Gerasimos Palaiopanos, Ioannis Panageas, and Georgios Piliouras. “Multiplicative weights update with constant step-size in congestion games: Convergence, limit cycles and chaos”. In: Proceedings of the Advances in Neural Information Processing Systems. Vol. 30. 2017. [179] Santiago Paternain, Luiz F.O. Chamon, Miguel Calvo-Fullana, and Alejandro Ribeiro. “Constrained reinforcement learning has zero duality gap”. In: Proceedings of the Advances in Neural Information Processing Systems. 2019, pp. 7553–7563. [180] Santiago Paternain, Miguel Calvo-Fullana, Luiz F.O. Chamon, and Alejandro Ribeiro. “Safe policies for reinforcement learning via primal-dual methods”. In: IEEE Transactions on Automatic Control (2022). [181] Bei Peng, Tabish Rashid, Christian Schroeder de Witt, Pierre-Alexandre Kamienny, Philip Torr, Wendelin Böhmer, and Shimon Whiteson. “FACMAC: Factored multi-agent centralised policy gradients”. In: Proceedings of the Advances in Neural Information Processing Systems. Vol. 34. 2021, pp. 12208–12221. [182] Paris Pennesi and Ioannis Ch. Paschalidis. “A distributed actor-critic algorithm and applications to mobile sensor network coordination problems”. In: IEEE Transactions on Automatic Control 55.2 (2010), pp. 492–497. [183] Iosif Pinelis. “Optimum bounds for the distributions of martingales in Banach spaces”. In: The Annals of Probability 22.4 (1994), pp. 1679–1706. 213 [184] Bernardo Ávila Pires and Csaba Szepesvári. “Policy error bounds for model-based reinforcement learning with factored linear models”. In: Proceedings of the Conference on Learning Theory. 2016, pp. 121–151. [185] Martin L. Puterman. Markov Decision Processes: Discrete Stochastic Dynamic Programming. John Wiley & Sons, 2014. [186] Shuang Qiu, Xiaohan Wei, Jieping Ye, Zhaoran Wang, and Zhuoran Yang. “Provably ecient ctitious play policy optimization for zero-sum Markov games with structured transitions”. In: Proceedings of the International Conference on Machine Learning. 2021, pp. 8715–8725. [187] Shuang Qiu, Xiaohan Wei, Zhuoran Yang, Jieping Ye, and Zhaoran Wang. “Upper condence primal-dual optimization: Stochastically constrained Markov decision processes with adversarial losses and unknown transitions”. In: Proceedings of the Advances in Neural Information Processing Systems. Vol. 1281. 2020, pp. 15277–15287. [188] Jineng Ren, Jarvis Haupt, and Zehua Guo. “Communication-ecient hierarchical distributed optimization for multi-agent policy evaluation”. In: Journal of Computational Science 49 (2021), p. 101280. [189] Julia Robinson. “An iterative method of solving a game”. In: Annals of Mathematics (1951), pp. 296–301. [190] Aviv Rosenberg and Yishay Mansour. “Online convex optimization in adversarial Markov decision processes”. In: Proceedings of the International Conference on Machine Learning. 2019, pp. 5478–5486. [191] Harsh Satija, Philip Amortila, and Joelle Pineau. “Constrained Markov decision processes via backward value functions”. In: Proceedings of the International Conference on Machine Learning. 2020, pp. 8502–8511. [192] Ali H. Sayed. “Adaptation, Learning, and Optimization over Networks”. In: Foundations and Trends® in Machine Learning 7.4-5 (2014), pp. 311–801. [193] Muhammed O. Sayin, Kaiqing Zhang, David S. Leslie, Tamer Başar, and Asuman E. Ozdaglar. “Decentralized Q-learning in zero-sum Markov games”. In: Proceedings of the Advances in Neural Information Processing Systems. 2021. [194] John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. “Proximal policy optimization algorithms”. In: arXiv preprint arXiv:1707.06347 (2017). [195] John Schulman, Sergey Levine, Pieter Abbeel, Michael Jordan, and Philipp Moritz. “Trust region policy optimization”. In: Proceedings of the International Conference on Machine Learning. 2015, pp. 1889–1897. 214 [196] Xingyu Sha, Jiaqi Zhang, Kaiqing Zhang, Keyou You, and Tamer Başar. “Asynchronous policy evaluation in distributed reinforcement learning over networks”. In: arXiv preprint arXiv:2003.00433 (2020). [197] Shai Shalev-Shwartz and Shai Ben-David. Understanding Machine Learning: From Theory to Algorithms. Cambridge University Press, 2014. [198] Shai Shalev-Shwartz, Shaked Shammah, and Amnon Shashua. “Safe, multi-agent, reinforcement learning for autonomous driving”. In: arXiv preprint arXiv:1610.03295 (2016). [199] Lior Shani, Yonathan Efroni, and Shie Mannor. “Adaptive trust region policy optimization: Global convergence and faster rates for regularized MDPs”. In: Proceedings of the AAAI Conference on Articial Intelligence. Vol. 34. 04. 2020, pp. 5668–5675. [200] Lloyd S. Shapley. “Stochastic games”. In: Proceedings of the National Academy of Sciences 39.10 (1953), pp. 1095–1100. [201] Li Shen, Long Yang, Shixiang Chen, Bo Yuan, Xueqian Wang, and Dacheng Tao. “Penalized proximal policy optimization for safe reinforcement learning”. In: arXiv preprint arXiv:2205.11814 (2022). [202] David Silver, Thomas Hubert, Julian Schrittwieser, Ioannis Antonoglou, Matthew Lai, Arthur Guez, Marc Lanctot, Laurent Sifre, Dharshan Kumaran, Thore Graepel, et al. “A general reinforcement learning algorithm that masters chess, shogi, and Go through self-play”. In: Science 362.6419 (2018), pp. 1140–1144. [203] David Silver, Aja Huang, Chris J Maddison, Arthur Guez, Laurent Sifre, George Van Den Driessche, Julian Schrittwieser, Ioannis Antonoglou, Veda Panneershelvam, Marc Lanctot, et al. “Mastering the game of Go with deep neural networks and tree search”. In: Nature 529.7587 (2016), p. 484. [204] Rahul Singh, Abhishek Gupta, and Ness B. Shro. “Learning in Markov decision processes under constraints”. In: arXiv preprint arXiv:2002.12435 (2020). [205] Ziang Song, Song Mei, and Yu Bai. “When can we learn general-sum Markov games with a large number of players sample-eciently?” In: Proceedings of the International Conference on Learning Representations. 2022. [206] Thomas Spooner and Rahul Savani. “A natural actor-critic algorithm with downside risk constraints”. In: arXiv preprint arXiv:2007.04203 (2020). [207] Rayadurgam Srikant and Lei Ying. “Finite-time error bounds for linear stochastic approximation and TD learning”. In: Proceedings of the Conference on Learning Theory. 2019, pp. 1–28. 215 [208] Miloš S. Stanković and Srdjan S. Stanković. “Multi-agent temporal-dierence learning with linear function approximation: Weak convergence under time-varying network topologies”. In: Proceedings of the American Control Conference. 2016, pp. 167–172. [209] Julian Stastny, Maxime Riché, Alexander Lyzhov, Johannes Treutlein, Allan Dafoe, and Jesse Clifton. “Normative disagreement as a challenge for cooperative AI”. In: arXiv preprint arXiv:2111.13872 (2021). [210] Adam Stooke, Joshua Achiam, and Pieter Abbeel. “Responsive safety in reinforcement learning by PID Lagrangian methods”. In: Proceedings of the International Conference on Machine Learning. 2020, pp. 9133–9143. [211] Yanan Sui, Vincent Zhuang, Joel Burdick, and Yisong Yue. “Stagewise safe Bayesian optimization with Gaussian processes”. In: Proceedings of the International conference on machine learning. 2018, pp. 4781–4789. [212] Jun Sun, Gang Wang, Georgios B. Giannakis, Qinmin Yang, and Zaiyue Yang. “Finite-time analysis of decentralized temporal-dierence learning with linear function approximation”. In: Proceedings of the International Conference on Articial Intelligence and Statistics. 2020, pp. 4485–4495. [213] Richard S. Sutton. “Learning to predict by the methods of temporal dierences”. In: Machine Learning 3.1 (1988), pp. 9–44. [214] Richard S. Sutton and Andrew G. Barto. Reinforcement Learning: An Introduction. MIT Press, 2018. [215] Richard S. Sutton, Hamid R. Maei, and Csaba Szepesvári. “A convergent O(n) temporal-dierence algorithm for o-policy learning with linear function approximation”. In: Proceedings of the Advances in Neural Information Processing Systems. 2009, pp. 1609–1616. [216] Richard S. Sutton, Hamid Reza Maei, Doina Precup, Shalabh Bhatnagar, David Silver, Csaba Szepesvári, and Eric Wiewiora. “Fast gradient-descent methods for temporal-dierence learning with linear function approximation”. In: Proceedings of the International Conference on Machine Learning. 2009, pp. 993–1000. [217] Richard S. Sutton, David A. McAllester, Satinder P. Singh, and Yishay Mansour. “Policy gradient methods for reinforcement learning with function approximation”. In: Proceedings of the Advances in Neural Information Processing Systems. 2000, pp. 1057–1063. [218] Csaba Szepesvári. Algorithms for Reinforcement Learning. Morgan and Claypool Publishers, 2010. 216 [219] Masayuki Takahashi. “Stochastic games with innitely many strategies”. In: Journal of Science of the Hiroshima University, Series AI (Mathematics) 26.2 (1962), pp. 123–134. [220] Chen Tessler, Daniel J. Mankowitz, and Shie Mannor. “Reward constrained policy optimization”. In: Proceedings of the International Conference on Learning Representations. 2019. [221] Emanuel Todorov, Tom Erez, and Yuval Tassa. “Mujoco: A physics engine for model-based control”. In: Proceedings of the IEEE/RSJ International Conference on Intelligent Robots and Systems. 2012, pp. 5026–5033. [222] Ahmed Touati, Pierre-Luc Bacon, Doina Precup, and Pascal Vincent. “Convergent TREE BACKUP and RETRACE with Function Approximation”. In: Proceedings of the International Conference on Machine Learning. 2018, pp. 4962–4971. [223] Alexander Trott, Sunil Srinivasa, Douwe van der Wal, Sebastien Haneuse, and Stephan Zheng. “Building a foundation for data-driven, interpretable, and robust policy design using the AI economist”. In: arXiv preprint arXiv:2108.02904 (2021). [224] Konstantinos I. Tsianos and Michael G. Rabbat. “Distributed strongly convex optimization”. In: Proceedings of the Allerton Conference on Communication, Control, and Computing. 2012, pp. 593–600. [225] John N. Tsitsiklis and Benjamin Van Roy. “An analysis of temporal-dierence learning with function approximation”. In: IEEE Transactions on Automatic Control 42.5 (1997), pp. 674–690. [226] Matteo Turchetta, Felix Berkenkamp, and Andreas Krause. “Safe exploration in nite Markov decision processes with Gaussian processes”. In: Proceedings of the Advances in Neural Information Processing Systems. 2016, pp. 4312–4320. [227] Eiji Uchibe and Kenji Doya. “Constrained reinforcement learning from intrinsic and extrinsic rewards”. In: Proceedings of the International Conference on Development and Learning. 2007, pp. 163–168. [228] Benjamin Van Roy and Shi Dong. “Comments on the Du-Kakade-Wang-Yang Lower Bounds”. In: arXiv preprint arXiv:1911.07910 (2019). [229] Oriol Vinyals, Igor Babuschkin, Wojciech M. Czarnecki, Michaël Mathieu, Andrew Dudzik, Junyoung Chung, David H. Choi, Richard Powell, Timo Ewalds, Petko Georgiev, et al. “Grandmaster level in StarCraft II using multi-agent reinforcement learning”. In: Nature 575.7782 (2019), pp. 350–354. [230] Akifumi Wachi and Yanan Sui. “Safe reinforcement learning in constrained Markov decision processes”. In: Proceedings of the International Conference on Machine Learning. 2020, pp. 9797–9806. 217 [231] Akifumi Wachi, Yanan Sui, Yisong Yue, and Masahiro Ono. “Safe exploration and optimization of constrained MDPs using Gaussian processes”. In: Proceedings of the AAAI Conference on Articial Intelligence. 2018. [232] Hoi-To Wai, Zhuoran Yang, Zhaoran Wang, and Mingyi Hong. “Multi-agent reinforcement learning via double averaging primal-dual optimization”. In: Proceedings of the Advances in Neural Information Processing Systems. 2018, pp. 9649–9660. [233] Lingxiao Wang, Qi Cai, Zhuoran Yang, and Zhaoran Wang. “Neural policy gradient methods: Global optimality and rates of convergence”. In: Proceedings of the International Conference on Learning Representations. 2019. [234] Xiaofeng Wang and Tuomas Sandholm. “Reinforcement learning to play an optimal Nash equilibrium in team Markov games”. In: Proceedings of the Advances in Neural Information Processing Systems. Vol. 15. 2002, pp. 1603–1610. [235] Yihan Wang, Beining Han, Tonghan Wang, Heng Dong, and Chongjie Zhang. “DOP: O-policy multi-agent decomposed policy gradients”. In: Proceedings of the International Conference on Learning Representations. 2021. [236] Yuanhao Wang, Ruosong Wang, and Sham M. Kakade. “An exponential lower bound for linearly realizable MDP with constant suboptimality gap”. In: Proceedings of the Advances in Neural Information Processing Systems. Vol. 34. 2021. [237] Yue Wang, Shaofeng Zou, and Yi Zhou. “Non-asymptotic analysis for two time-scale TDC with general smooth function approximation”. In: Proceedings of the Advances in Neural Information Processing Systems. Vol. 34. 2021, pp. 9747–9758. [238] Yue Wang, Wei Chen, Yuting Liu, Zhi-Ming Ma, and Tie-Yan Liu. “Finite sample analysis of the GTD policy evaluation algorithms in Markov setting”. In: Proceedings of the Advances in Neural Information Processing Systems. 2017, pp. 5504–5513. [239] Chen-Yu Wei, Chung-Wei Lee, Mengxiao Zhang, and Haipeng Luo. “Last-iterate convergence of decentralized optimistic gradient descent/ascent in innite-horizon competitive Markov games”. In: Proceedings of the Conference on Learning Theory. 2021. [240] Chen-Yu Wei, Mehdi Jafarnia Jahromi, Haipeng Luo, and Rahul Jain. “Learning innite-horizon average-reward MDPs with linear function approximation”. In: Proceedings of the International Conference on Articial Intelligence and Statistics. 2021, pp. 3007–3015. [241] Chen-Yu Wei, Chung-Wei Lee, Mengxiao Zhang, and Haipeng Luo. “Linear last-iterate convergence in constrained saddle-point optimization”. In: Proceedings of the International Conference on Learning Representations. 2020. 218 [242] Xiaohan Wei, Hao Yu, and Michael J. Neely. “Online primal-dual mirror descent under stochastic constraints”. In: Proceedings of the International Conference on Measurement and Modeling of Computer Systems. Vol. 4. 2. 2020, pp. 1–36. [243] Xiaohan Wei, Hao Yu, Qing Ling, and Michael J. Neely. “Solving non-smooth constrained programs with lower complexity than O(1="): A primal-dual homotopy smoothing approach”. In: Proceedings of the Advances in Neural Information Processing Systems. 2018, pp. 3999–4009. [244] Gellért Weisz, Philip Amortila, and Csaba Szepesvári. “Exponential lower bounds for planning in MDPs with linearly-realizable optimal action-value functions”. In: Proceedings of the International Conference on Algorithmic Learning Theory. 2021, pp. 1237–1264. [245] Ronald J. Williams. “Simple statistical gradient-following algorithms for connectionist reinforcement learning”. In: Machine Learning 8.3 (1992), pp. 229–256. [246] Zhaoxian Wu, Han Shen, Tianyi Chen, and Qing Ling. “Byzantine-resilient decentralized policy evaluation with linear function approximation”. In: IEEE Transactions on Signal Processing 69 (2021), pp. 3839–3853. [247] Lin Xiao. “On the convergence rates of policy gradient methods”. In: arXiv preprint arXiv:2201.07443 (2022). [248] Lin Xiao and Tong Zhang. “A proximal-gradient homotopy method for the sparse least-squares problem”. In: SIAM Journal on Optimization 23.2 (2013), pp. 1062–1091. [249] Dong Xie and Xiangnan Zhong. “Semicentralized deep deterministic policy gradient in cooperative StarCraft games”. In: IEEE Transactions on Neural Networks and Learning Systems (2020). [250] Qiaomin Xie, Yudong Chen, Zhaoran Wang, and Zhuoran Yang. “Learning zero-sum simultaneous-move Markov games using function approximation and correlated equilibrium”. In: Proceedings of the Conference on Learning Theory. 2020, pp. 3674–3682. [251] Tengyu Xu, Yingbin Liang, and Guanghui Lan. “CRPO: A new approach for safe reinforcement learning with convergence guarantee”. In: Proceedings of the International Conference on Machine Learning. 2021, pp. 11480–11491. [252] Yi Xu, Yan Yan, Qihang Lin, and Tianbao Yang. “Homotopy smoothing for non-smooth problems with lower complexity than O(1=)”. In: Proceedings of the Advances In Neural Information Processing Systems. 2016, pp. 1208–1216. [253] Reza Yaesoubi and Ted Cohen. “Dynamic health policies for controlling the spread of emerging infections: inuenza as an example”. In: PloS One 6.9 (2011), e24043. 219 [254] Junchi Yang, Negar Kiyavash, and Niao He. “Global convergence and variance reduction for a class of nonconvex-nonconcave minimax problems”. In: Proceedings of the Advances in Neural Information Processing Systems. Vol. 33. 2020, pp. 1153–1165. [255] Lin F. Yang and Mengdi Wang. “Reinforcement learning in feature space: Matrix bandit, kernels, and regret bound”. In: Proceedings of the International Conference on Machine Learning. 2020, pp. 10746–10756. [256] Lin F. Yang and Mengdi Wang. “Sample-optimal parametric Q-learning using linearly additive features”. In: Proceedings of the International Conference on Machine Learning. 2019, pp. 6995–7004. [257] Long Yang, Jiaming Ji, Juntao Dai, Yu Zhang, Pengfei Li, and Gang Pan. “CUP: A conservative update policy algorithm for safe reinforcement learning”. In: arXiv preprint arXiv:2202.07565 (2022). [258] Tianbao Yang and Qihang Lin. “RSG: Beating subgradient method without smoothness and strong convexity”. In: Journal of Machine Learning Research 19.1 (2018), pp. 236–268. [259] Tsung-Yen Yang, Justinian Rosca, Karthik Narasimhan, and Peter J. Ramadge. “Projection-based constrained policy optimization”. In: Proceedings of the International Conference on Learning Representations. 2020. [260] Donghao Ying, Yuhao Ding, and Javad Lavaei. “A dual approach to constrained Markov decision processes with entropy regularization”. In: Proceedings of the International Conference on Articial Intelligence and Statistics. 2022, pp. 1887–1909. [261] Yiming Ying, Longyin Wen, and Siwei Lyu. “Stochastic online AUC maximization”. In: Proceedings of the Advances in Neural Information Processing Systems. Vol. 29. 2016, pp. 451–459. [262] Chao Yu, Akash Velu, Eugene Vinitsky, Yu Wang, Alexandre Bayen, and Yi Wu. “The surprising eectiveness of PPO in cooperative, multi-agent games”. In: arXiv preprint arXiv:2103.01955 (2021). [263] Hao Yu, Michael J. Neely, and Xiaohan Wei. “Online convex optimization with stochastic constraints”. In: Proceedings of the Advances in Neural Information Processing Systems. 2017, pp. 1428–1438. [264] Ming Yu, Zhuoran Yang, Mladen Kolar, and Zhaoran Wang. “Convergent policy optimization for safe reinforcement learning”. In: Proceedings of the Advances in Neural Information Processing Systems. 2019, pp. 3121–3133. [265] Jianjun Yuan and Andrew Lamperski. “Online convex optimization for cumulative constraints”. In: Proceedings of the Advances in Neural Information Processing Systems. 2018, pp. 6137–6146. 220 [266] Andrea Zanette, Ching-An Cheng, and Alekh Agarwal. “Cautiously optimistic policy optimization and exploration with linear function approximation”. In: Proceedings of the Conference on Learning Theory. Vol. 134. 2021, pp. 4473–4525. [267] Andrea Zanette, Alessandro Lazaric, Mykel Kochenderfer, and Emma Brunskill. “Learning near optimal policies with low inherent bellman error”. In: Proceedings of the International Conference on Machine Learning. 2020, pp. 10978–10989. [268] Santiago Zazo, Sergio Valcarcel Macua, Matilde Sánchez-Fernández, and Javier Zazo. “Dynamic potential games with constraints: Fundamentals and applications in communications”. In: IEEE Transactions on Signal Processing 64.14 (2016), pp. 3806–3821. [269] Sihan Zeng, Thinh T. Doan, and Justin Romberg. “Finite-time complexity of online primal-dual natural actor-critic algorithm for constrained Markov decision processes”. In: arXiv preprint arXiv:2110.11383 (2021). [270] Wenhao Zhan, Shicong Cen, Baihe Huang, Yuxin Chen, Jason D. Lee, and Yuejie Chi. “Policy mirror descent for regularized reinforcement learning: A generalized framework with linear convergence”. In: arXiv preprint arXiv:2105.11066 (2021). [271] Junyu Zhang, Alec Koppel, Amrit Singh Bedi, Csaba Szepesvári, and Mengdi Wang. “Variational policy gradient method for reinforcement learning with general utilities”. In: Proceedings of the Advances in Neural Information Processing Systems. Vol. 33. 2020, pp. 4572–4583. [272] Kaiqing Zhang, Zhuoran Yang, and Tamer Başar. “Decentralized multi-agent reinforcement learning with networked agents: Recent advances”. In: Frontiers of Information Technology & Electronic Engineering 22.6 (2021), pp. 802–814. [273] Kaiqing Zhang, Zhuoran Yang, and Tamer Başar. “Multi-agent reinforcement learning: A selective overview of theories and algorithms”. In: Handbook of Reinforcement Learning and Control (2021), pp. 321–384. [274] Kaiqing Zhang, Zhuoran Yang, and Tamer Başar. “Policy optimization provably converges to Nash equilibria in zero-sum linear quadratic games”. In: Proceedings of the Advances in Neural Information Processing Systems. Vol. 32. 2019, pp. 11602–11614. [275] Kaiqing Zhang, Zhuoran Yang, Han Liu, Tong Zhang, and Tamer Başar. “Fully decentralized multi-agent reinforcement learning with networked agents”. In: Proceedings of the International Conference on Machine Learning. 2018, pp. 5867–5876. [276] Kaiqing Zhang, Alec Koppel, Hao Zhu, and Tamer Başar. “Global convergence of policy gradient methods to (almost) locally optimal policies”. In: SIAM Journal on Control and Optimization 58.6 (2020), pp. 3586–3612. 221 [277] Runyu Zhang, Zhaolin Ren, and Na Li. “Gradient play in multi-agent Markov stochastic games: Stationary points and convergence”. In: arXiv preprint arXiv:2106.00198 (2021). [278] Runyu Zhang, Jincheng Mei, Bo Dai, Dale Schuurmans, and Na Li. “On the eect of log-barrier regularization in decentralized softmax gradient play in multiagent systems”. In: arXiv preprint arXiv:2202.00872 (2022). [279] Xuezhou Zhang, Yuda Song, Masatoshi Uehara, Mengdi Wang, Alekh Agarwal, and Wen Sun. “Ecient reinforcement learning in block MDPs: A model-free representation learning approach”. In: Proceedings of the International Conference on Machine Learning. 2022, pp. 26517–26547. [280] Yiming Zhang, Quan Vuong, and Keith Ross. “First order constrained optimization in policy space”. In: Proceedings of the Advances in Neural Information Processing Systems. Vol. 33. 2020, pp. 15338–15349. [281] Yufeng Zhang, Qi Cai, Zhuoran Yang, Yongxin Chen, and Zhaoran Wang. “Can temporal-dierence and Q-Learning learn representation? A mean-eld theory”. In: Proceedings of the Advances in Neural Information Processing Systems. Vol. 33. 2020, pp. 19680–19692. [282] Yulai Zhao, Yuandong Tian, Jason D. Lee, and Simon S. Du. “Provably ecient policy optimization for two-player zero-sum Markov games”. In: Proceedings of the International Conference on Articial Intelligence and Statistics. 2022, pp. 2736–2761. [283] Stephan Zheng, Alexander Trott, Sunil Srinivasa, Nikhil Naik, Melvin Gruesbeck, David C Parkes, and Richard Socher. “The AI economist: Improving equality and productivity with AI-driven tax policies”. In: arXiv preprint arXiv:2004.13332 (2020). [284] Dongruo Zhou, Jiafan He, and Quanquan Gu. “Provably ecient reinforcement learning for discounted MDPs with feature mapping”. In: Proceedings of the International Conference on Machine Learning. 2021, pp. 12793–12802. [285] Martin Zinkevich. “Online convex programming and generalized innitesimal gradient ascent”. In: Proceedings of the International Conference on Machine Learning. 2003, pp. 928–936. [286] Shaofeng Zou, Tengyu Xu, and Yingbin Liang. “Finite-sample analysis for SARSA and Q-learning with linear function approximation”. In: Proceedings of the Advances In Neural Information Processing Systems. 2019, pp. 8665–8675. 222 Appendices 223 AppendixA SupportingproofsinChapter2 A.1 ProofofLemma1 The proof of (i) is standard; e.g., see [11, Theorem 3.6] or [179, Theorem 1] or [180, Theorem 3]. The proof of (ii) builds on the constrained optimization [27, Section 8.5]. Let a := f 0jV D ()ag be a sublevel set of the dual function fora2R. For any2 a , we have a V D () V r () + (V g ()b) V r () + where is a Slater point. Thus, (aV r ())=. If we takea =V ? r () =V ? D , then a = ? which proves (ii). A.2 ProofofLemma2 By the denition ofv(), we havev(0) =V ? r (). We also note thatv() is concave (see the proof of [179, Proposition 1]). First, we show that ? 2 @v(0). By the denition ofV ; L () and the strong duality in Lemma 1, for all2 , V ; ? L () maximize 2 V ; ? L () = V ? D () = V ? r () = v(0): Hence, for any2f2 jV g ()b +g, v(0) ? V ; ? L () ? = V r () + ? V g ()b ? = V r () + ? V g ()b V r (): Maximizing the right-hand side of this inequality overf2 jV g ()b +g yields v(0) ? v() (A.1) and, thus, ? 2@v(0). 224 On the other hand, if we take = bV g () + , then V r () v() and V ? r () = v(0) v(): (A.2) Combing (A.1) and (A.2) yieldsV r ()V ? r () ? : Thus, (C ? )jj = ? jj + Cjj = ? + Cjj V ? r () V r () + Cjj which completes the proof by applying the assumed condition on. A.3 ProofofLemma3 We prove Lemma 3 by providing a concrete constrained MDP example as shown in Figure 2.1. Statess 3 ,s 4 , ands 5 are terminal states with zero reward and utility. We consider non-trivial state s 1 with two actions: a 1 moving ‘up’ anda 2 going ‘right’, and the associated value functions are given by V r (s 1 ) = (a 2 js 1 )(a 1 js 2 ) V g (s 1 ) = (a 1 js 1 ) + (a 2 js 1 )(a 1 js 2 ): We consider the following two policies (1) and (2) using the softmax parametrization (2.5), (1) = (log 1; logx; logx; log 1) (2) = ( log 1; logx; logx; log 1) where the parameter takes form of ( s 1 ;a 1 ; s 1 ;a 2 ; s 2 ;a 1 ; s 2 ;a 2 ) withx> 0. First, we show thatV r is not concave. We compute that (1) (a 1 js 1 ) = 1 1 +x ; (1) (a 2 js 1 ) = x 1 +x ; (1) (a 1 js 2 ) = x 1 +x V (1) r (s 1 ) = x 1 +x 2 ; V (1) g (s 1 ) = 1 +x +x 2 (1 +x) 2 (2) (a 1 js 1 ) = x 1 +x ; (2) (a 2 js 1 ) = 1 1 +x ; (2) (a 1 js 2 ) = 1 1 +x V (2) r (s 1 ) = 1 1 +x 2 ; V (2) g (s 1 ) = 1 +x +x 2 (1 +x) 2 : Now, we consider policy () , (1) + (1) (2) = log 1; log x 21 ; log x 21 ; log 1 for some2 [0; 1], which is dened on the segment between (1) and (2) . Therefore, (1) (a 1 js 1 ) = 1 1 +x 21 ; (1) (a 2 js 1 ) = x 21 1 +x 21 ; (1) (a 1 js 2 ) = x 21 1 +x 21 225 V () r (s 1 ) = x 21 1 +x 21 2 ; V () g (s 1 ) = 1 +x 21 + (x 21 ) 2 (1 +x 21 ) 2 : Whenx = 3 and = 1 2 , 1 2 V (1) r (s 1 ) + 1 2 V (2) r (s 1 ) = 5 16 > V ( 1 2 ) r (s 1 ) = 4 16 which implies thatV r is not concave. Whenx = 10 and = 1 2 , V (1) g (s 1 ) = V (2) g (s 1 ) 0:9 and V ( 1 2 ) g (s 1 ) = 0:75 which shows that if we take constraint oset b = 0:9, then V (1) g (s 1 ) = V (2) g (s 1 ) b, and V ( 1 2 ) g (s 1 ) < b in which the policy ( 1 2 ) is infeasible. Therefore, the setfjV g (s) bg is not convex. A.4 ProofofTheorem4 Let us rst recall the notion of occupancy measure [11]. An occupancy measureq of a policy is dened as a set of distributions generated by executing, q s;a = 1 X t = 0 t Pr(s t =s;a t =aj;s 0 ) (A.3) for alls2S, a2A. We put allq s;a together asq 2 R jSjjAj andq a = [q 1;a ; ;q jSj;a ] > , for brevity. For an action a, we collect transition probabilitiesP(s 0 js;a) for all s 0 ;s2S to have the shorthand notationP a 2 R jSjjSj . The occupancy measureq has to satisfy a set of linear constraints given byQ :=fq 2 R jSjjAj j P a2A (I P > a )q a = andq 0g. With a slight abuse of notation, we writer2 [0; 1] jSjjAj andg2 [0; 1] jSjjAj . Thus, the value functionsV r ,V g : S!R under the initial state distribution are linear inq : V r () = hq ;ri := F r (q ) and V g () = hq ;gi := F g (q ): We are now in a position to consider the primal problem (2.3) as a linear program, maximize q 2Q F r (q ) subject to F g (q ) b (A.4) where the maximization is over all occupancy measuresq 2Q. Once we compute a solutionq , the associated policy solution can be recovered via (ajs) = q s;a P a2A q s;a for alls2S;a2A: (A.5) 226 Abstractly, we let q :Q! (A) jSj be a mapping from an occupancy measureq to a policy . Similarly, as dened by (A.3) we letq : (A) jSj !Q be a mapping from a policy to an occupancy measureq . Clearly,q = ( q ) 1 . Despite the non-convexity essence of (2.3) in policy space, the reformulation (A.4) reveals underlying convexity in occupancy measureq . In Lemma 44, we exploit this convexity to show the average policy improvement overT steps. Lemma44(Boundedaverageperformance) Let assumptions in Theorem 4 hold. Then, the it- erates ( (t) ; (t) ) generated by PG-PD method (2.10) satisfy 1 T T1 X t = 0 Z (t) F r (q ? )F r (q (t) ) + 1 T T1 X t = 0 (t) F g (q ? )F g (q (t) ) D L T 1=4 (A.6) whereD := 8jSj (1 ) 2 kd ? =k 2 1 andL := 2jAj(1+2=) (1 ) 4 . Proof. From the dual update in (2.10) we have 0 (t) 2=((1 )). From the smooth property of the value functions under the direct policy parametrization [8, Lemma D.3] we have F r (q )F r (q (t) ) r F r (q (t) ); (t) jAj (1 ) 3 k (t) k 2 : If we x (t) 0, then (F r + (t) F g )(q ) (F r + (t) F g )(q (t) ) r F r (q (t) ) + (t) r F g (q (t) ); (t) L 2 k (t) k 2 : Thus, (F r + (t) F g )(q ) (F r + (t) F g )(q (t) ) + r F r (q (t) ) + (t) r F g (q (t) ); (t) L 2 k (t) k 2 (F r + (t) F g )(q ) L k (t) k 2 : (A.7) We note that the primal update in (2.10) is equivalent to (t+1) = argmax 2 ( V (t) r () + (t) V (t) g () + r V (t) r () + (t) r V (t) g (); (t) 1 2 1 k (t) k 2 ) : 227 By taking 1 = 1=L and = (t+1) in (A.7), (F r + (t) F g )(q (t+1) ) maximize 2 ( (F r + (t) F g )(q (t) ) + r F r (q (t) ) + (t) r F g (q (t) ); (t) L 2 k (t) k 2 ) maximize 2 (F r + (t) F g )(q ) L k (t) k 2 maximize 2 [0;1] (F r + (t) F g )(q ) L k (t) k 2 (A.8) where := q (q ? + (1 )q (t) ), we apply (A.7) for the second inequality, and the last inequality is due to q q = id SA and linearity ofq in. SinceF r andF g are linear inq , we have (F r + (t) F g )(q ) = (F r + (t) F g )(q ? ) + (1)(F r + (t) F g )(q (t) ): (A.9) By the denition of q , ( q (q) q (q 0 )) sa = 1 P a2A q sa (q sa q 0 sa ) + P a2A q 0 sa P a2A q sa P a2A q sa P a2A q sa q 0 sa which, together withkx +yk 2 2kxk 2 + 2kyk 2 , gives k q (q) q (q 0 )k 2 2 X s2S X a2A (q sa q 0 sa ) 2 ( P a2A q sa ) 2 + 2 X s2S X a2A P a2A q 0 sa P a2A q sa P a2A q sa P a2A q sa 2 (q 0 sa ) 2 2 X s2S 1 ( P a2A q sa ) 2 0 @ X a2A (q sa q 0 sa ) 2 + X a2A q 0 sa X a2A q sa ! 2 1 A : Therefore, k (t) k 2 = q q ? + (1)q (t) q q (t) 2 X s2S 2 2 P a2A q (t) sa 2 0 @ X a2A q ? sa q (t) sa 2 + X a2A q (t) sa X a2A q ? sa ! 2 1 A 228 in which the upper bound further can be relaxed into X s2S 4 2 P a2A q (t) sa 2 0 @ X a2A q ? sa ! 2 + X a2A q (t) sa ! 2 1 A = 4 2 X s2S (d ? (s)) 2 + d (t) (s) 2 d (t) (s) 2 4 2 jSj + 4 2 jSj d ? d (t) 2 1 4 2 jSj 0 @ 1 + 1 (1 ) 2 d ? 2 1 1 A 2 D (A.10) where we applyd (t) (1 ) componentwise in the second inequality. We now apply (A.9) and (A.10) to (A.8), (F r + (t) F g )(q ? ) (F r + (t) F g )(q (t+1) ) minimize 2 [0;1] L k (t) k 2 + (F r + (t) F g )(q ? ) (F r + (t) F g )(q ) minimize 2 [0;1] n 2 D L + (1) (F r + (t) F g )(q ? ) (F r + (t) F g )(q (t) ) o which further implies (F r + (t+1) F g )(q ? ) (F r + (t+1) F g )(q (t+1) ) minimize 2 [0;1] n 2 D L + (1) (F r + (t) F g )(q ? ) (F r + (t) F g )(q (t) ) o ( (t) (t+1) )(F g (q ? )F g (q (t+1) )): (A.11) We check the right-hand side of the inequality (A.11). By the dual update in (2.10), it is easy to see that( (t) (t+1) )(F g (q ? )F g (q (t+1) ))j (t) (t+1) j=(1 ) 2 =(1 ) 2 . we discuss three cases: (i) when (t) < 0, we set = 0 for (A.11), (F r + (t+1) F g )(q ? ) (F r + (t+1) F g )(q (t+1) ) D L 2 p T ; (A.12) (ii) when (t) > 1, we set = 1 that leads to (F r + (t+1) F g )(q ? ) (F r + (t+1) F g )(q (t+1) ) 3 2 D L , i.e., (t+1) 3=4. Thus, this case reduces to the next case (iii): 0 (t) 1 in which we can express (A.11) as (F r + (t+1) F g )(q ? ) (F r + (t+1) F g )(q (t+1) ) 1 (Fr + (t) Fg )(q ? )(Fr + (t) Fg )(q (t) ) 4D L (F r + (t) F g )(q ? ) (F r + (t) F g )(q (t) ) + D L 2 p T 229 or equivalently, (t+1) 1 (t) 2 (t) + 1 4 p T : (A.13) By choosing (0) = 0 and (0) such thatV (0) r () V ? r (), we know that (0) 0. Thus, (1) 1=(4 p T ). By (A.12), the case (1) 0 is trivial. Without loss of generality, we assume that 0 (t) 1=T 1=4 1. By induction overt for (A.13), (t+1) 1 (t) 2 (t) + 1 4 p T 1 T 1=4 : (A.14) By combining (A.12) and (A.14), and averaging overt = 0; 1; ;T 1, we get the desired bound. Proof. [Proof of Theorem 4] Boundingtheoptimalitygap. By the dual update (2.10) and (0) = 0, it is convenient to bound ( (T ) ) 2 by (T ) 2 = T1 X t = 0 ( (t+1) ) 2 ( (t) ) 2 = 2 2 T1 X t = 0 (t) bF g (q (t) ) + 2 2 T1 X t = 0 F g (q (t) )b 2 2 2 T1 X t = 0 (t) F g (q ? )F g (q (t) ) + 2 2 T (1 ) 2 where the inequality is due to the feasibility of the optimal policy ? or the associated occupancy measureq ? = q ? : F g (q ? ) b, andjF g (q (t) )bj 1=(1 ). The above inequality further implies 1 T T1 X t = 0 (t) F g (q ? )F g (q (t) ) 2 2(1 ) 2 : By substituting the above inequality into (A.6) in Lemma 44, we show the desired optimality gap bound, where we take 2 = (1 ) 2 D L =(2 p T ). Bounding the constraint violation. From the dual update in (2.10) we have for any 2 [ 0; 2=((1 )) ], j (t+1) j 2 (a) (t) 2 F g (q (t) )b 2 (b) (t) 2 2 2 F g (q (t) )b (t) + 2 2 (1 ) 2 where (a) is due to the non-expansiveness of projectionP and (b) is due to (F g (q (t) )b) 2 1=(1 ) 2 . Summing it up fromt = 0 tot =T 1, and dividing it byT , yield 1 T j (T ) j 2 1 T j (0) j 2 2 2 T T1 X t = 0 F g (q (t) )b (t) + 2 2 (1 ) 2 230 which further implies, 1 T T1 X t = 0 F g (q (t) )b (t) j (0) j 2 2 2 T + 2 2(1 ) 2 : We note thatF g (q ? )b. By adding the inequality above to (A.6) in Lemma 44 from both sides, 1 T T1 X t = 0 F r (q ? )F r (q (t) ) + T T1 X t = 0 bF g (q (t) ) D L T 1=4 + 1 2 2 T j (0) j 2 + 2 2(1 ) 2 : We choose = 2 (1 ) if P T1 t = 0 bF g (q (t) ) 0; otherwise = 0. Thus, F r (q ? )F r (q 0 ) + 2 (1 ) [bF g (q 0 )] + D L T 1=4 + 1 2 2 (1 ) 2 2 T + 2 2(1 ) 2 where there existsq 0 such thatF r (q 0 ) := 1 T P T1 t = 0 F r (q (t) ) andF g (q 0 ) := 1 T P T1 t = 0 F g (q (t) ) by the denition of occupancy measure. Application of Lemma 2 with 2=((1 )) 2 ? yields [bF g (q 0 )] + (1 )D L T 1=4 + 1 2 2 (1 )T + 2 2(1 ) which readily leads to the desired constraint violation bound by noting that 1 T T1 X t = 0 bF g (q (t) ) = bF g (q 0 ) and taking 2 = 8jAjjSj(1+2=) (1 ) 4 p T kd ? =k 2 1 andkd ? =k 2 1 (1 ) 2 . A.5 ProofofLemma5 The dual update follows Lemma 1. Since ? (V ? r ()V r ())= with 0V ? r ,V r 1=(1 ), we take projection interval = [ 0; 2=((1 )) ] such that upper bound 2=((1 )) is such that 2=((1 )) 2 ? . We now verify the primal update. We expand the primal update in (2.11) into the following form, (t+1) = (t) + 1 F y ( (t) )r V (t) r () + 1 (t) F y ( (t) )r V (t) g (): (A.15) We now deal with: F y ( (t) )r V (t) r () and F y ( (t) )r V (t) g (). For the rst one, the proof begins with solutions to the following approximation error minimization problem: minimize w2R jSjjAj E r (w) := E sd ;a (ajs) h A r (s;a)w > r log (ajs) 2 i : 231 Using the Moore-Penrose inverse, the optimal solution reads, w ? r = F y ()E sd ;a (ajs) r log (ajs)A ; r (s;a) = (1 )F y ()r V ; r () where F () is the Fisher information matrix induced by . One key observation from this solution is thatw ? r is parallel to the NPG directionF y ()r V ; r (). On the other hand, it is easy to verify thatA r is a minimizer ofE r (w). The softmax pol- icy (2.5) implies that @ log (ajs) @ s 0 ;a 0 = 1[s =s 0 ] (1[a =a 0 ] (a 0 js)) (A.16) where1[E] is the indicator function of eventE being true. Thus, we have w > r log (ajs) = w s;a X a 0 2A w s;a 0 (a 0 js): The above equality together with the fact: P a2A (ajs)A ; r (s;a) = 0, yieldsE r (A r ) = 0. However, A r may not be the unique minimizer. We consider the following general form of possible solutions, A r + u; where u2 R jSjjAj : For any states and actiona such thats is reachable under, using (A.16) yields u > r log (ajs) = u s;a X a 0 2A u s;a 0 (a 0 js): Here, we make use of the following fact: is a stochastic policy with (ajs) > 0 for all actionsa in each states, so that if a state is reachable under, then it will also be reachable using . Therefore, we require zero derivative at each reachable state: u > r log (ajs) = 0 for alls,a so thatu s;a is independent of the action and becomes a constantc s for eachs. Therefore, the minimizer ofE r (w) is given up to some state-dependent oset, F y ()r V r () = A r 1 + u (A.17) whereu s;a =c s for somec s 2R for each states and actiona. We can repeat the above procedure forF y ( (t) )r V (t) g () and show, F y ()r V g () = A g 1 + v (A.18) wherev s;a =d s for somed s 2R for each states and actiona. 232 Substituting (A.17) and (A.18) into the primal update (A.15) yields, (t+1) = (t) + 1 1 A (t) r + (t) A (t) g + 1 u + (t) v (t+1) (ajs) = (t) (ajs) exp 1 1 A (t) r (s;a) + (t) A (t) g (s;a) + 1 c s + (t) d s Z (t) (s) where the second equality also utilizes the normalization termZ (t) (s). Finally, we complete the proof by settingc s =d s = 0. A.6 Sample-basedalgorithmwithfunctionapproximation We describe a sample-based NPG-PD algorithm with function approximation in Algorithm 1. We note the computational complexity of Algorithm 1: each round has expected length 2=(1 ) so the expected number of total samples is 4KT=(1 ); the total number of gradient computations r log (t) (ajs) is 2KT ; the total number of scalar multiplies, divides, and additions isO(dKT + KT=(1 )). The following unbiased estimates that are useful in our analysis. E h ^ V (t) g (s) i = E " K 0 1 X k = 0 g(s k ;a k )j (t) ;s 0 =s # = E " 1 X k = 0 1[K 0 1k 0]g(s k ;a k )j (t) ;s 0 =s # (a) = 1 X k = 0 E E K 0 [1[K 0 1k 0]]g(s k ;a k )j (t) ;s 0 =s (b) = 1 X k = 0 E k g(s k ;a k )j (t) ;s 0 =s (c) = E " 1 X k = 0 k g(s k ;a k )j (t) ;s 0 =s # = V (t) g (s) where we apply the Monotone Convergence Theorem and the Dominated Convergence Theorem for (a) and swap the expectation and the innite sum in (c), and in (b) we apply E K 0 [1[K 0 1k 0]] = 1P (K 0 <k) = k sinceK 0 Geometric(1 ), a geometric distribution. 233 By a similar agument as above, E h ^ Q (t) r (s;a) i = E " K 0 1 X k = 0 r(s k ;a k )j (t) ;s 0 =s;a 0 =a # = E " 1 X k = 0 1[K 0 1k 0]r(s k ;a k )j (t) ;s 0 =s;a 0 =a # = 1 X k = 0 E E K 0 [1[K 0 1k 0]]r(s k ;a k )j (t) ;s 0 =s;a 0 =a = 1 X k = 0 E k r(s k ;a k )j (t) ;s 0 =s;a 0 =a = E " 1 X k = 0 k r(s k ;a k )j (t) ;s 0 =s;a 0 =a # = Q (t) r (s;a): Therefore, E h ^ A (t) r (s;a) i = E h ^ Q (t) r (s;a) i E h ^ V (t) r (s) i = Q (t) r (s;a) V (t) r (s) = A (t) r (s;a): We also provide a bound on the variance of ^ V (t) g (s), Var h ^ V (t) g (s) i = E ^ V (t) g (s)V (t) g (s) 2 j (t) ;s 0 =s = E P K 0 1 k = 0 g(s k ;a k )V (t) g (s) 2 j (t) ;s 0 =s = E K 0 E P K 0 1 k = 0 g(s k ;a k )V (t) g (s) 2 jK 0 (a) E K 0 (K 0 ) 2 jK 0 (b) = 1 (1 ) 2 where (a) is due to 0g(x k ;a k ) 1 andV (t) g (s) 0 and (b) is clear fromK 0 Geometric(1 ). By the sampling scheme of Algorithm 2, we can show thatG r;k is an unbiased estimate of the population gradientr E (t) r (w r ; (t) ), E (s;a)d (t) [G ;k ] = 2E h w > r;k r log (t) (ajs) ^ A (t) r (s;a) r log (t) (ajs) i = 2E h w > r;k r log (t) (ajs)E h ^ A (t) r (s;a)js;a i r log (t) (ajs) i = 2E h w > r;k r log (t) (ajs)A (t) r (s;a) r log (t) (ajs) i = r wr E (t) r (w r ; (t) ): 234 A.7 ProofofTheorem13 We rst adapt Lemma 11 to the sample-based case as follows. Lemma45(Sample-basedregret/violationlemma) Let Assumption 1 hold and let us x a state distribution and T > 0. Assume that log (ajs) is -smooth in for any (s;a). If the iterates ( (t) ; (t) ) generated by the Algorithm 1 with (0) = 0, (0) = 0, 1 = 2 = 1= p T, and k ^ w (t) r k,k ^ w (t) g kW, then, E " 1 T T1 X t = 0 V ? r () V (t) r () # C 5 (1 ) 5 1 p T + T1 X t = 0 E h err (t) r ( ? ) i (1 )T + T1 X t = 0 2E h err (t) g ( ? ) i (1 ) 2 T E " 1 T T1 X t = 0 b V (t) g () # + C 6 (1 ) 4 1 p T + T1 X t = 0 E h err (t) r ( ? ) i T + T1 X t = 0 2E h err (t) g ( ? ) i (1 )T whereC 5 = 2 + logjAj + 5W 2 =,C 6 = (2 + logjAj +W 2 ) + (2 + 4W 2 )=, and c err (t) () := E sd E a(js) A (t) (s;a) ( ^ w (t) ) > r log (t) (ajs) ; where = r org: Proof. The smoothness of log-linear policy in conjunction with an application of Taylor’s the- orem to log (t) (ajs) yield log (t) (ajs) (t+1) (ajs) + (t+1) (t) > r log (t) (ajs) 2 k (t+1) (t) k 2 where (t+1) (t) = 1 1 ^ w (t) . We unloadd ? asd ? since ? and are xed. Therefore, E sd ? D KL ( ? (js)k (t) (js))D KL ( ? (js)k (t+1) (js)) = E sd ?E a ? (js) log (t) (ajs) (t+1) (ajs) 1 E sd ?E a ? (js) h ( ^ w (t) ) > r log (t) (ajs) i 2 1 2(1 ) 2 k ^ w (t) k 2 = 1 E sd ?E a ? (js) h ( ^ w (t) r ) > r log (t) (ajs) i + 1 (t) E sd ?E a ? (js) h ( ^ w (t) g ) > r log (t) (ajs) i 2 1 2(1 ) 2 k ^ w (t) k 2 = 1 E sd ?E a ? (js) A (t) r (s;a) + 1 (t) E sd ?E a ? (js) A (t) g (s;a) + 1 E sd ?E a ? (js) h ^ w (t) r + (t) ^ w (t) g > r log (t) (ajs) A (t) r (s;a)+ (t) A (t) g (s;a) i 2 1 (1 ) 2 k ^ w (t) r k 2 + (t) 2 k ^ w (t) g k 2 1 (1 ) V ? r ()V (t) r () + 1 (1 ) (t) V ? g ()V (t) g () 1 c err (t) r ( ? ) 1 (t) c err (t) g ( ? ) 2 1 W 2 (1 ) 2 2 1 W 2 (1 ) 2 (t) 2 235 where ^ w (t) = ^ w (t) r + (t) ^ w (t) g for a given (t) , in the last inequality we apply the performance dierence lemma, notation ofc err (t) r ( ? ) andc err (t) g ( ? ), andk ^ w (t) r k,k ^ w (t) g kW . Rearranging the inequality above leads to, V ? r ()V (t) r () 1 1 1 1 E sd ? D KL ( ? (js)k (t) (js))D KL ( ? (js)k (t+1) (js)) + 1 1 c err (t) r ( ? ) + 2 (1 ) 2 c err (t) g ( ? ) + 1 W 2 (1 ) 3 + 4 1 W 2 (1 ) 5 2 (t) V ? g ()V (t) g () where we utilize 0 (t) 2=((1 )) from the dual update of Algorithm 1. Therefore, 1 T T1 X t = 0 V ? r ()V (t) r () 1 (1 ) 1 T T1 X t = 0 E sd ? D KL ( ? (js)k (t) (js))D KL ( ? (js)k (t+1) (js)) + 1 (1 )T T1 X t = 0 c err (t) r ( ? ) + 2 (1 ) 2 T T1 X t = 0 c err (t) g ( ? ) + 1 W 2 (1 ) 3 + 4 1 W 2 (1 ) 5 2 1 T T1 X t = 0 (t) V ? g ()V (t) g () logjAj (1 ) 1 T + 1 (1 )T T1 X t = 0 c err (t) r ( ? ) + 2 (1 ) 2 T T1 X t = 0 c err (t) g ( ? ) + 1 W 2 (1 ) 3 + 4 1 W 2 (1 ) 5 2 + 1 T T1 X t = 0 (t) V g ()V (t) g () where in the last inequality we take a telescoping sum of the rst sum and drop a non-positive term. Taking the expectation over the randomness in sampling on both sides of the inequality above yields E " 1 T T1 X t = 0 V ? r ()V (t) r () # + E " 1 T T1 X t = 0 (t) V ? g ()V (t) g () # logjAj (1 ) 1 T + 1 (1 )T T1 X t = 0 E h c err (t) r ( ? ) i + 2 (1 ) 2 T T1 X t = 0 E h c err (t) g ( ? ) i + 1 W 2 (1 ) 3 + 4 1 W 2 (1 ) 5 2 : (A.19) 236 Provingtherstinequality. From the dual update in Algorithm 1 we have 0 (T ) 2 = T1 X t = 0 ( (t+1) ) 2 ( (t) ) 2 T1 X t = 0 (t) 2 ^ V (t) g ()b 2 ( (t) ) 2 = 2 2 T1 X t = 0 (t) b ^ V (t) g () + 2 2 T1 X t = 0 ^ V (t) g ()b 2 2 2 T1 X t = 0 (t) V ? g ()V (t) g () + 2 2 T1 X t = 0 (t) V (t) g () ^ V (t) g () + 2 2 T1 X t = 0 ( ^ V (t) g ()b) 2 where the second inequality is due to the feasibility of the policy ? :V ? g ()b. SinceV (t) g () is a population quantity and ^ V (t) g () is an estimate that is independent of (t) given the past history, (t) is independent ofV (t) g () ^ V (t) g () at timet and thusE (t) V (t) g () ^ V (t) g () = 0 due to the factE ^ V (t) g () =V (t) g (); see it in Appendix A.6. Therefore, E " 1 T T1 X t = 0 (t) V ? g ()V (t) g () # E " 2 2T T1 X t = 0 ( ^ V (t) g ()b) 2 # 2 2 (1 ) 2 (A.20) where in the second inequality we drop a non-positive term and use the fact E h ^ V (t) g () 2 i = Var h ^ V (t) g (s) i + E h ^ V (t) g (s) i 2 2 (1 ) 2 where the inequality is due to that Var ^ V (t) g (s) 1=(1 ) 2 ; see it in Appendix A.6, and E h ^ V (t) g () i = V (t) g (), where 0V (t) g (s) 1=(1 ). Adding the inequality (A.20) to (A.19) on both sides and taking 1 = 2 = 1= p T yield the rst inequality. Proving the second inequality. From the dual update in Algorithm 1 we have for any 2 := 0; 1=((1 )) , E j (t+1) j 2 = E P (t) 2 ^ V (t) g ()b P () 2 (a) E h (t) 2 ^ V (t) g ()b 2 i = E h (t) 2 i 2 2 E h ^ V (t) g ()b (t) i + 2 2 E h ^ V (t) g ()b 2 i (b) E h (t) 2 i 2 2 E h ^ V (t) g ()b (t) i + 3 2 2 (1 ) 2 237 where (a) is due to the non-expansiveness of projectionP and (b) is due toE ( ^ V (t) g ()b) 2 2=(1 ) 2 + 1=(1 ) 2 . Summing it up fromt = 0 tot =T 1 and dividing it byT yield 0 1 T E j (T ) j 2 1 T E h (0) 2 i 2 2 T T1 X t = 0 E h ^ V (t) g ()b (t) i + 3 2 2 (1 ) 2 which further implies that E " 1 T T1 X t = 0 V (t) g ()b (t) # 1 2 2 T E h (0) 2 i + 2 2 (1 ) 2 where we useE ^ V (t) g () =V (t) g () and (t) is independent of ^ V (t) g () given the past history. We now add the above inequality into (A.19) on both sides and utilizeV ? g ()b, E " 1 T T1 X t = 0 V ? r () V (t) r () # + E " 1 T T1 X t = 0 b V (t) g () # logjAj (1 ) 1 T + 1 (1 )T T1 X t = 0 E err (t) r ( ? ) + 2 (1 ) 2 T T1 X t = 0 E err (t) g ( ? ) + 1 W 2 (1 ) 3 + 4 1 W 2 (1 ) 5 2 + 1 2 2 T E h (0) 2 i + 2 2 (1 ) 2 : By taking = 2 (1 ) when P T1 t = 0 bV (t) g () 0; otherwise = 0, we reach E " V ? r () 1 T T1 X t = 0 V (t) r () # + 2 (1 ) E " b 1 T T1 X t = 0 V (t) g () # + logjAj (1 ) 1 T + 1 (1 )T T1 X t = 0 E err (t) r ( ? ) + 2 (1 ) 2 T T1 X t = 0 E err (t) g ( ? ) + 1 W 2 (1 ) 3 + 4 1 W 2 (1 ) 5 2 + 2 2 (1 ) 2 2 T + 2 2 (1 ) 2 : SinceV (t) r () andV (t) g () are linear functions in the occupancy measure [11, Chapter 10], there exists a policy 0 such thatV 0 r () = 1 T P T1 t = 0 V (t) r () andV 0 g () = 1 T P T1 t = 0 V (t) g (). Hence, E h V ? r () V 0 r () i + 2 (1 ) E h b V 0 g () i + logjAj (1 ) 1 T + 1 (1 )T T1 X t = 0 E err (t) r ( ? ) + 2 (1 ) 2 T T1 X t = 0 E err (t) g ( ? ) + 1 W 2 (1 ) 3 + 4 1 W 2 (1 ) 5 2 + 2 2 (1 ) 2 2 T + 2 2 (1 ) 2 : 238 Application of Lemma 2 with 2=((1 )) 2 ? yields E b V 0 g () + logjAj 1 T + T T1 X t = 0 E err (t) r ( ? ) + 2 (1 )T T1 X t = 0 E err (t) g ( ? ) + 1 W 2 (1 ) 2 + 4 1 W 2 (1 ) 4 + 2 2 (1 )T + 2 2 (1 ) which leads to our constraint violation bound by taking 1 = 2 = 1= p T . Proof. [Proof of Theorem 13] By Lemma 45, we only need to consider the randomness in sequences of ^ w (t) and bound E h err (t) ( ? ) i for =r org. Application of the triangle inequality yields c err (t) r ( ? ) E sd ? E a ? (js) h A (t) r (s;a) (w (t) r;? ) > r log (t) (ajs) i + E sd ? E a ? (js) h w (t) r;? ^ w (t) r > r log (t) (ajs) i (A.21) where w (t) r;? 2 argmin kwrk 2 W E (t) r (w r ; (t) ). We next bound each term in the right-hand side of (A.21), separately. For the rst term, E sd ? E a ? (js) h A (t) r (s;a) (w (t) r;? ) > r log (t) (ajs) i r E sd ? E a ? (js) A (t) r (s;a) (w (t) r;? ) > r log (t) (ajs) 2 = q E ? r w (t) r;? ; (t) : (A.22) Similarly, E sd ? E a ? (js) h w (t) r;? ^ w (t) r > r log (t) (ajs) i s E sd ? E a ? (js) w (t) r;? ^ w (t) r > r log (t) (ajs) 2 = r kw (t) r;? ^ w (t) r k 2 (t) ? : (A.23) We let (t) := (t) 0 1=2 (t) ? (t) 0 1=2 2 be the relative condition number at timet. Thus, kw (t) r;? ^ w (t) r k 2 (t) ? k (t) 0 1=2 (t) ? (t) 0 1=2 kkw (t) r;? ^ w (t) r k 2 (t) 0 (a) (t) 1 kw (t) r;? ^ w (t) r k 2 (t) (b) (t) 1 E (t) r ^ w (t) r ; (t) E (t) r w (t) r;? ; (t) (A.24) 239 where we use (1 ) 0 (t) 0 := (t) in (a), and we get (b) due to that the rst-order optimality condition forw (t) r;? , w r w (t) r;? > r E (t) r w (t) r;? ; (t) 0; for anyw r satisfyingkw r k W: further implies that E (t) r w r ; (t) E (t) r w (t) r;? ; (t) = E s;a (t) A (t) r (s;a) > s;a w (t) r;? + > s;a w (t) r;? > s;a w r 2 E (t) r w (t) r;? ; (t) = 2 w (t) r;? w r > E s;a (t) h A (t) r (s;a) > s;a w (t) r;? s;a i +E s;a (t) > s;a w (t) r;? > s;a w r 2 = w r w (t) r;? > r E (t) r (w (t) r;? ; (t) ) +kw r w (t) r;? k 2 (t) kw r w (t) r;? k 2 (t) : Taking an expectation over (A.24) from both sides yields E kw (t) r;? w (t) r k 2 (t) ? E (t) 1 E h E (t) r ^ w (t) r ; (t) E (t) r w (t) r;? ; (t) (t) i (a) E (t) 1 GW p K (b) GW (1 ) p K (A.25) where (a) is due to the standard SGD result [197, Theorem 14.8]: for =W=(G p K), E (t) r;est = E h E (t) r ^ w (t) r ; (t) E (t) r w (t) r;? ; (t) i GW p K and (b) follows Assumption 5. Substitution of (A.23), (A.25) into the right-hand side of (A.21) yields an upper bound on E err (t) r ( ? ) . By the same reasoning, we can establish a similar bound onE err (t) g ( ? ) . Finally, application of these upper bounds to Lemma 45 leads to our desired results. A.8 ProofofTheorem14 Byk s;a k B, for the log-linear policy class, log (ajs) is -smooth with = B 2 . By Lemma 45, we only need to consider the randomness in sequences of ^ w (t) and the error bounds 240 forE c err (t) r ( ? ) andE c err (t) g ( ? ) . We rst use (A.21) and consider the following cases. By (2.26) andA (t) r (s;a) =Q (t) r (s;a)E a 0 (t) (js) Q (t) r (s;a 0 ), E sd ? E a ? (js) h A (t) r (s;a) (w (t) r;? ) > r log (t) (ajs) i = E sd ? E a ? (js) h Q (t) r (s;a) > s;a w (t) r;? i E sd ? E a 0 (t) (js) h Q (t) r (s;a 0 ) > s;a 0w (t) r;? i r E sd ? E a ? (js) Q (t) r (s;a) > s;a w (t) r;? 2 + r E sd ? E a 0 (t) (js) Q (t) r (s;a 0 ) > s;a 0 w (t) r;? 2 2 s jAjE sd ? E a Unif A Q (t) r (s;a) > s;a w (t) r;? 2 = 2 q jAjE ? r w (t) r;? ; (t) : (A.26) Similarly, E sd ? E a ? (js) h w (t) r;? ^ w (t) r > r log (t) (ajs) i = E sd ? E a ? (js) h w (t) r;? ^ w (t) r > s;a i E sd ? E a 0 (t) (js) h w (t) r;? ^ w (t) r > s;a 0 i 2 s jAjE sd ? E a Unif A w (t) r;? ^ w (t) r > s;a 2 = 2 q jAjkw (t) r;? ^ w (t) r k 2 ? (A.27) where ? :=E (s;a) ? s;a > s;a . By the denition of, kw (t) r;? ^ w (t) r k 2 ? kw (t) r;? ^ w (t) r k 2 0 1 kw (t) r;? ^ w (t) r k 2 (t) (A.28) where we use (1 ) 0 (t) 0 := (t) in the second inequality. We note that w (t) r;? 2 argmin kwrk 2 W E (t) r (w r ; (t) ): Application of the rst-order optimality condition forw (t) r;? yields w r w (t) r;? > r E (t) r w (t) r;? ; (t) 0; for anyw r satisfyingkw r k W: 241 Thus, E (t) r w r ; (t) E (t) r w (t) r;? ; (t) = E s;a (t) Q (t) r (s;a) > s;a w (t) r;? + > s;a w (t) r;? > s;a w r 2 E (t) r w (t) r;? ; (t) = 2 w (t) r;? w r > E s;a (t) h Q (t) r (s;a) > s;a w (t) r;? s;a i +E s;a (t) > s;a w (t) r;? > s;a w r 2 = w r w (t) r;? > r E (t) r w (t) r;? ; (t) +kw r w (t) r;? k 2 (t) kw r w (t) r;? k 2 (t) : Takingw r = ^ w (t) r in the inequality above and combining it with (A.28) and (A.27) yield E sd ? E a ? (js) h w (t) r;? ^ w (t) r > r log (t) (ajs) i 2 s jAj 1 E (t) r ^ w (t) r ; (t) E (t) r w (t) r;? ; (t) : (A.29) We now substitute (A.26) and (A.29) into the right-hand side of (A.21), E h err (t) r ( ? ) i 2 r jAjE h E d ? r w (t) r;? ; (t) i + 2 r jAj 1 E h E (t) r ^ w (t) r ; (t) E (t) r w (t) r;? ; (t) i 2 r jAjE h E d ? r w (t) r;? ; (t) i + 2 s jAj 1 GW p K where the second inequality is due to the standard SGD result [197, Theorem 14.8]: for = W=(G p K), E (t) r;est = E h E (t) r ^ w (t) r ; (t) E (t) r w (t) r;? ; (t) i GW p K : By the same reasoning, we can nd a similar bound onE err (t) g ( ? ) . Finally, our desired results follow by applying Assumption 2 and Lemma 45. 242 AppendixB SupportingproofsinChapter3 B.1 ProofofTheorem23 As we see in the proof of Theorem 16, our nal regret or constraint violation bounds are domi- nated by the accumulated bonus terms, which come from the design of ‘optimism in the face of uncertainty.’ This framework provides a powerful exibility for Algorithm 6 to incorporate other optimistic policy evaluation methods. In what follows, we introduce Algorithm 6 with a variant of optimistic policy evaluation. We repeat notation for readers’ convenience. For any (h;k)2 [H] [K], any (s;a;s 0 )2 SAS, and any (s;a)2SA, we dene two visitation countersn k h (s;a;s 0 ) andn k h (s;a) at steph in episodek, n k h (s;a;s 0 ) = k1 X = 1 1f(s;a;s 0 ) = (s h ;a h ;a h+1 )g and n k h (s;a) = k1 X = 1 1f(s;a) = (s h ;a h )g: This allows us to estimate transition kernel P h , reward function r, and utility function g for episodek by ^ P k h (s 0 js;a) = n k h (s;a;s 0 ) n k h (s;a) + ; for all (s;a;s 0 )2SAS ^ r k h (s;a) = 1 n k h (s;a) + k1 X = 1 1f(s;a) = (s h ;a h )gr h (s h ;a h ); for all (s;a)2SA ^ g k h (s;a) = 1 n k h (s;a) + k1 X = 1 1f(s;a) = (s h ;a h )gg h (s h ;a h ); for all (s;a)2SA where> 0 is the regularization parameter. Moreover, we introduce the bonus term k h :SA! R, k h (s;a) = n k h (s;a) + 1=2 243 which adapts the counter-based bonus terms in [19, 108], where > 0 is to be determined later. Using the estimated transition kernelsf ^ P k h g H h = 1 , the estimated reward/utility functions f^ r k h ; ^ g k h g H h = 1 , and the bonus termsf k h g H h = 1 , we now can estimate the action-value function via Q k ;h (s;a) = min ^ k h (s;a) + X s 0 2S ^ P h (s 0 js;a)V k ;h+1 (s 0 ) + 2 k h (s;a); Hh + 1 + for any (s;a)2SA, where =r org. Thus,V k ;h (s) =hQ k ;h (s;); k h (js)i A . We summarize the above procedure in Algorithm 8. Using already estimatedfQ k r;h (;);Q k g;h (;)g H h = 1 , we can execute the policy improvement and the dual update in Algorithm 6. Similar to Theorem 16, we prove the following regret and constraint violation bounds. Proof. [Proof of Theorem 23] The proof is similar to Theorem 16. Since we only change the policy evaluation, all previous policy improvement results still hold. By Lemma 19, we have Regret(K) = C 3 H 2:5 p T logjAj + K X k = 1 H X h = 1 E ? k r;h (s h ;a h ) k r;h (s k h ;a k h ) + M K r;H;2 where k r;h is the model prediction error given by (3.8) andfM k r;h;m g (k;h;m)2[K][H][2] is a martin- gale adapted to the ltrationfF k h;m g (k;h;m)2[K][H][2] in terms of time indext dened in (3.13). By Lemma 21, it holds with probability 1p=3 thatjM K r;H;2 j 4 p H 2 T log(4=p). The rest is to bound the double sum term. As shown in Appendix B.3.2, with probability 1p=2 it holds that for any (k;h)2 [K] [H] and (s;a)2SA, 4 k h (s;a) k r;h (s;a) 0: (B.1) Together with the choice of k h , we have K X k = 1 H X h = 1 E ? k r;h (s h ;a h )js 1 k r;h (s k h ;a k h ) 4 K X k = 1 H X h = 1 k h (s k h ;a k h ) = 4 K X k = 1 H X h = 1 n k h (s k h ;a k h ) + 1=2 : Dene mapping : SA ! R jSjjAj as (s;a) = e (s;a) , we can utilize Lemma 52. For any (k;h)2 [K] [H], we have k h = k1 X = 1 (s h ;a h ) (s h ;a h ) > +I 2 R jSjjAjjSjjAj k h (s;a) = n k h (s;a) + 1=2 = q (s;a)( k h ) 1 (s;a) > 244 where k h is a diagonal matrix whose the (s;a)th diagonal entry isn k h (s;a) +. Therefore, we have K X k = 1 H X h = 1 E ? k r;h (s h ;a h ) k r;h (s k h ;a k h ) 4 K X k = 1 H X h = 1 (s k h ;a k h )( k h ) 1 (s k h ;a k h ) > 1=2 4 H X h = 1 K K X k = 1 (s k h ;a k h )( k h ) 1 (s k h ;a k h ) > ! 1=2 4 p 2K H X h = 1 log 1=2 det K+1 h det 1 h ! where we apply the Cauchy-Schwartz inequality for the second inequality and Lemma 52 for the third inequality. Notice that (K +)I K h and 1 h =I. Hence, Regret(K) = C 3 H 2:5 p T logjAj + 4 p 2jSjjAjHT s log K + + 4 s H 2 T log 6 p : Notice that logjAj O jSj 2 jAj log 2 (jSjjAjT=p) . We conclude the desired regret bound by setting = 1 and =C 1 H p jSj log(jSjjAjT=p). For the constraint violation analysis, Lemmas 22 still holds. Similar to (3.34), we have V ? r;1 (s 1 )V 0 r;1 (s 1 ) + h bV 0 g;1 (s 1 ) i + C 4 H 2:5 p T logjAj K + 4 K K X k = 1 H X h = 1 k h (s k h ;a k h ) + 4 K K X k = 1 H X h = 1 k h (s k h ;a k h ) + 1 K M K r;H;2 + K M K g;H;2 where V 0 r;1 (s 1 ) = 1 K P K k = 1 V k r;1 (s 1 ) and V 0 g;1 (s 1 ) = 1 K P K k = 1 V k g;1 (s 1 ). Similar to Lemma 21, it holds with probability 1p=3 thatjM K g;H;2 j 4 p H 2 T log(6=p) for = r or g. As shown in Appendix B.3.2, with probability 1p=3 it holds that4 k h (s;a) k ;h (s;a) 0 for any (k;h)2 [K] [H] and (s;a)2SA. Therefore, we have V ? r;1 (s 1 )V 0 r;1 (s 1 ) + bV 0 g;1 (s 1 ) + C 4 H 2:5 p T log(Aj) K + 4(1 +) p 2jSjjAjHT K s log K + + 4(1 +) K s H 2 T log 6 p which leads to the desired constraint violation bound due to Lemma 49 and we set and as previously. 245 B.2 ProofofFormulas(3.9)and(3.11) For any (k;h)2 [K] [H], we recall the denitions ofV ? r;h in the Bellman equations (3.1) and V k r;h from line 12 in Algorithm 7, V ? r;h (s) = Q ? h (s;); ? h (js) and V k r;h (s) = Q k h (s; ); k h (js) : We can expand the dierenceV ? r;h (s)V k r;h (s) into V ? r;h (s) V k r;h (s) = Q ? h (s; ); ? h (js) Q k h (s; ); k h (js) = Q ? h (s; )Q k h (s; ); ? h (js) + Q k h (s; ); ? h (js) k h (js) = Q ? h (s; )Q k h (s; ); ? h (js) + k h (x) (B.2) where k h (s) :=hQ k h (s; ); ? h (js) k h (js)i. Recall the equality in the Bellman equations (3.1) and the model prediction error, Q ? r;h = r k h + P h V ? r;h+1 and k r;h = r h + P h V k r;h+1 Q k r;h : As a result of the above two, it is easy to see that Q ? r;h Q k r;h = P h V ? r;h+1 V k r;h+1 + k r;h : Substituting the above dierence into the right-hand side of (B.2) yields, V ? r;h (s) V k r;h (s) = P h V ? r;h+1 V k r;h+1 (s; ); ? h (js) + k r;h (s; ); ? h (js) + k h (s): which displays a recursive formula overh. Thus, we expandV ? r;1 (s 1 )V k r;1 (s 1 ) recursively with x =s 1 as V ? r;1 (s 1 ) V k r;1 (s 1 ) = P 1 V ? r;2 V k r;2 (s 1 ;); ? 1 (js 1 ) + k r;1 (s 1 ;); ? 1 (js 1 )i + k 1 (s 1 ) = P 1 P 2 V ? r;3 V k r;3 (x 2 ;); ? 2 (jx 2 ) (s 1 ;); ? 1 (js 1 ) + P 1 k r;2 (x 2 ;); ? 2 (jx 2 ) (s 1 ;); ? 1 (js 1 ) + k r;1 (s 1 ;); ? 1 (js 1 ) + P 1 k 2 (s 1 ;); ? 1 (js 1 ) + k 1 (s 1 ): (B.3) For notational simplicity, for any (k;h)2 [K] [H], we dene an operatorI h for function f :SA!R, (I h f) (s) = f(s; ); ? h (js) : 246 With this notation, repeating the above recursion (B.3) overh2 [H] yields V ? r;1 (s 1 )V k r;1 (s 1 ) = I 1 P 1 I 2 P 2 V ? r;3 V k r;3 +I 1 P 1 I 2 k r;2 +I 1 k r;1 +I 1 P 1 k 2 + k 1 = I 1 P 1 I 2 P 2 I 3 P 3 V ? r;4 V k r;4 +I 1 P 1 I 2 P 2 I 3 k r;3 +I 1 P 1 I 2 k r;2 +I 1 k r;1 +I 1 P 1 I 2 P 2 k 3 +I 1 P 1 k 2 + k 1 . . . = H Y h = 1 I h P h ! V ? r;H+1 V k r;H+1 + H X h = 1 h1 Y i = 1 I i P i ! I h k r;h + H X h = 1 h1 Y h = 1 I i P i ! k h : Finally, notice thatV ? r;H+1 = V k r;H+1 = 0, we use the denitions ofP h andI h to conclude (3.9). Similarly, we can also use the above argument to verify (3.11). B.3 ProofofFormulas(3.14)and(3.15) We recall the denition ofV k r;h and dene an operatorI k h for functionf :SA!R, V k r;h (s) = Q k h (s; ); k h (js) and I k h f (s) = f(s; ); k h (js) : We expand the model prediction error k r;h into, k r;h (s k h ;a k h ) = r h (s k h ;a k h ) + (P h V k r;h+1 )(s k h ;a k h )Q k r;h (s k h ;a k h ) = r h (s k h ;a k h ) + (P h V k r;h+1 )(s k h ;a k h )Q k r;h (s k h ;a k h ) + Q k r;h (s k h ;a k h )Q k r;h (s k h ;a k h ) = P h V k r;h+1 P h V k r;h+1 (s k h ;a k h ) + Q k r;h (s k h ;a k h )Q k r;h (s k h ;a k h ) where we use the Bellman equation Q k r;h (s k h ;a k h ) = r h (s k h ;a k h ) + (P h V k r;h+1 )(s k h ;a k h ) in the last equality. With the above formula, we expand the dierenceV k r;1 (s 1 )V k r;1 (s 1 ) into V k r;h (s k h )V k r;h (s k h ) = I k h (Q k r;h Q k r;h ) (s k h ) k r;h (s k h ;a k h ) + P h V k r;h+1 P h V k r;h+1 (s k h ;a k h ) + Q k r;h Q k r;h (s k h ;a k h ): Let D k r;h;1 := I k h (Q k r;h Q k r;h ) (s k h ) Q k r;h Q k r;h (s k h ;a k h ); D k r;h;2 := P h V k r;h+1 P h V k r;h+1 (s k h ;a k h ) V k r;h+1 V k r;h+1 (s k h+1 ): Therefore, we have the following recursive formula overh, V k r;h (s k h )V k r;h (s k h ) = D k r;h;1 + D k r;h;2 + V k r;h+1 V k r;h+1 (s k h+1 ) k r;h (s k h ;a k h ): 247 Notice thatV k r;H+1 =V k r;H+1 = 0. Summing the above equality overh2 [H] yields V k r;1 (s 1 ) V k r;1 (s 1 ) = H X h = 1 D k r;h;1 +D k r;h;2 H X h = 1 k r;h (s k h ;a k h ): (B.4) Following the denitions ofF k h;1 andF k h;2 , we knowD k r;h;1 2F k h;1 andD k r;h;2 2F k h;2 . Thus, for any (k;h)2 [K] [H], E D k r;h;1 jF k h1;2 = 0 and E D k r;h;2 jF k h;1 = 0: Notice thatt(k; 0; 2) =t(k 1;H; 2) = 2H(k 1). Clearly,F k 0;2 =F k1 H;2 for anyk 2. LetF 1 0;2 be empty. We dene a martingale sequence, M k r;h;m = k1 X = 1 H X i = 1 D r;i;1 +D r;i;2 + h1 X i = 1 D k r;i;1 +D k r;i;2 + m X ` = 1 D k r;h;` = X (;i;`)2 [K][H][2];t(;i;`)t(k;h;m) D r;i;` where t(k;h;m) := 2(k 1)H + 2(h 1) +m is the time index. Clearly, this martingale is adapted to the ltrationfF k h;m g (k;h;m)2[K][H][2] , and particularly, K X k = 1 H X h = 1 (D k r;h;1 +D k r;h;2 ) = M K r;H;2 : Finally, we combine the above martingale with (B.4) to obtain (3.14). It is similar for (3.15). B.3.1 ProofofFormula(3.16) We recall the denition of the feature map k r;h , k r;h (s;a) = Z S (s;a;s 0 )V k r;h+1 (s 0 )ds 0 for any (k;h)2 [K] [H] and (s;a)2SA. By Assumption 8, we have (P h V k r;h+1 ) (s;a) = Z S (s;a;s 0 ) > h V k r;h+1 (s 0 )ds 0 = k r;h (s;a) > h = k r;h (s;a) > ( k r;h ) 1 k r;h h = k r;h (s;a) > ( k r;h ) 1 k1 X = 1 r;h (s h ;a h ) r;h (s h ;a h ) > h + h ! = k r;h (s;a) > ( k r;h ) 1 k1 X = 1 r;h (s h ;a h ) (P h V r;h+1 ) (s h ;a h ) + h ! 248 where the second equality is due to the denition of k r;h , we exploit k r;h from line 4 of Algorithm 7 in the fourth equality, and we recursively replace r;h (s h ;a h ) > h by (P h V r;h+1 ) (s h ;a h ) for all 2 [k 1] in the last equality. We recall the update w k r;h = ( k r;h ) 1 P k1 = 1 r;h (s h ;a h )V r;h+1 (s h+1 ) from line 5 of Algo- rithm 7. Therefore, k r;h (s;a) > w k r;h (P h V k r;h+1 ) (s;a) = k r;h (s;a) > ( k r;h ) 1 k1 X = 1 r;h (s h ;a h ) V r;h+1 (s h+1 ) (P h V r;h+1 ) (s h ;a h ) + k r;h (s;a) > ( k r;h ) 1 h k r;h (s;a) > ( k r;h ) 1 k r;h (s;a) 1 2 k1 X = 1 r;h (s h ;a h ) V r;h+1 (s h+1 )(P h V r;h+1 ) (s h ;a h ) ( k r;h ) 1 + k r;h (s;a) > ( k r;h ) 1 k r;h (s;a) 1 2 k h k ( k r;h ) 1 for any (k;h)2 [K] [H] and (s;a)2SA, where we apply the Cauchy-Schwarz inequality twice in the inequality. By Lemma 51, set = 1, with probability 1p=2 it holds that k1 X = 1 r;h (s h ;a h ) V r;h+1 (s h+1 ) (P h V r;h+1 ) (s h ;a h ) ( k r;h ) 1 C s dH 2 log dT p : Also notice that k r;h I andk h k p d, thusk h k ( k r;h ) 1 p d. Thus, by taking an appropriate absolute constantC, we obtain that k r;h (s;a) > w k r;h (P h V k r;h+1 ) (s;a) C k r;h (s;a) > ( k r;h ) 1 k r;h (s;a) 1=2 s dH 2 log dT p for any (k;h)2 [K] [H] and (s;a)2SA under the event of Lemma 51. We now setC > 1 and = C p dH 2 log (dT=p). By the exploration bonus k r;h in line 7 of Algorithm 7, with probability 1p=2 it holds that k r;h (s;a) > w k r;h (P h V k r;h+1 ) (x;a) k r;h (s;a) (B.5) for any (k;h)2 [K] [H] and (s;a)2SA. 249 We note that reward/utility functions are xed over episodes,r h (s h ;a h ) :='(s h ;a h ) > r;h For the dierence'(s;a) > u k r;h r h (s;a), we have '(s;a) > u k r;h r h (s;a) = '(s;a) > u k r;h '(s;a) > r;h = '(s;a) > ( k h ) 1 k1 X = 1 '(s h ;a h )r h (s h ;a h ) k h r;h ! = '(s;a) > ( k h ) 1 k1 X = 1 '(s h ;a h ) r h (s h ;a h )'(s h ;a h ) > r;h + r;h ! = '(s;a) > ( k h ) 1 r;h '(s;a) > ( k h ) 1 '(s;a) 1=2 k r;h k ( k h ) 1 where we apply the Cauchy-Schwartz inequality in the inequality. Notice that k h I and k r;h k p d, thusk r;h k ( k h ) 1 p d. Hence, if we set = 1 and = C p dH 2 log (dT=p), then any (k;h)2 [K] [H] and (s;a)2SA, '(s;a) > u k r;h r h (s;a) k h (s;a): (B.6) We recall the model prediction error k r;h :=r h +P h V k r;h+1 Q k r;h and the estimated state-action value functionQ k r;h in line 11 of Algorithm 7, Q k r;h (s;a) = min '(s;a) > u k r;h + k r;h (s;a) > w k r;h + ( k h + k r;h )(s;a); Hh + 1 + for any (k;h)2 [K] [H] and (s;a)2SA. By (B.5) and (B.6), we rst have k r;h (s;a) > w k r;h + k r;h (s;a) 0 and '(s;a) > u k r;h + k h (s;a) 0: Then, we can show that k r;h (s;a) = Q k r;h (s;a) (r h +P h V k r;h+1 )(s;a) '(s;a) > u k r;h + k r;h (s;a) > w k r;h + ( k h + k r;h )(s;a) (r k h +P h V k r;h+1 )(s;a) ('(s;a) > u k r;h r h (s;a)) + k h (s;a) + 2 k r;h (s;a) (B.7) for any (k;h)2 [K] [H] and (s;a)2SA. Therefore, (B.7) reduces to k r;h (s;a) 2 k h (s;a) + 2 k r;h (s;a) = 2( k h + k r;h )(s;a): 250 On the other hand, notice that (r k h +P h V k r;h+1 )(s;a)Hh + 1, thus k r;h (s;a) = (r h +P h V k r;h+1 )(s;a) Q k r;h (s;a) (r h +P h V k r;h+1 )(s;a) min '(s;a) > u k r;h + k r;h (s;a) > w k r;h + ( k h + k r;h )(s;a); Hh + 1 + max r h (s;a)'(s;a) > u k r;h k h (s;a)+(P h V k r;h+1 )(s;a) k r;h (s;a) > w k r;h k r;h (s;a); 0 + 0 for any (k;h)2 [K] [H] and (s;a)2SA. Therefore, we have proved that with probability 1p=2 it holds that 2( k h + k r;h )(s;a) k r;h (s;a) 0 for any (k;h)2 [K] [H] and (s;a)2SA. Similarly, we can show another inequality2( k h + k g;h )(s;a) k g;h (s;a) 0. B.3.2 ProofofFormula(B.1) LetV =fV :S! [0;H]g be a set of bounded function onS. Fo anyV 2V, we consider the dierence between P x 0 2S ^ P k h (s 0 j;)V (s 0 ) and P x 0 2S P h (s 0 j;)V (s 0 ) as follows, n k h (s;a) + 1=2 X x 0 2S ^ P k h (s 0 js;a)V (s 0 )P h (s 0 js;a)V (s 0 ) = n k h (s;a) + 1=2 X x 0 2S n k h (s;a;s 0 )V (s 0 ) (n k h (s;a) +)(P h V )(s;a) n k h (s;a) + 1=2 X x 0 2S n k h (s;a;s 0 )V (s 0 )n k h (s;a)(P h V )(s;a) + n k h (s;a) + 1=2 j(P h V )(s;a)j = n k h (s;a) + 1=2 k1 X = 1 1f(s;a) = (s h ;a h )g V (x h+1 ) (P h V )(s;a) + n k h (s;a) + 1=2 j(P h V )(s;a)j (B.8) for any (k;h)2 [K] [H] and (s;a)2SA, where we apply the triangle inequality for the inequality. 251 Let h :=V (x h+1 )(P h V )(s h ;a h ). Conditioning on the ltrationF k h;1 , h is a zero-mean and H=2-subGaussian random variable. By Lemma 50, we useY =I andX =1f(s;a) = (s h ;a h )g and thus with probability at least 1 it holds that n k h (s;a) + 1=2 k1 X = 1 1f(s;a) = (s h ;a h )g V (x h+1 ) (P h V )(s;a) v u u t H 2 2 log n k h (s;a) + 1=2 1=2 =H ! s H 2 2 log T for any (k;h)2 [K] [H]. Also, since 0VH, we have n k h (s;a) + 1=2 j(P h V )(s;a)j p H: By returning to (B.8) and setting = 1, with probability at least 1 it holds that n k h (s;a) + X x 0 2S ^ P k h (s 0 js;a)V (s 0 )P h (s 0 js;a)V (s 0 ) 2 H 2 log T + 2 (B.9) for anyk 1. Letd(V;V 0 ) = max s2S jV (s)V 0 (s)j be a distance onV. For any, an-coveringV ofV with respect to distanced(;) satises jV j 1 + 2 p jSjH ! jSj : Thus, for anyV2V, there existsV 0 2V such that max s2S jV (s)V 0 (s)j. By the triangle inequality, we have n k h (s;a) + 1=2 X s 0 2S ^ P k h (s 0 js;a)V (s 0 )P h (s 0 js;a)V (s 0 ) = n k h (s;a) + 1=2 X s 0 2S ^ P k h (s 0 js;a)V 0 (s 0 )P h (s 0 js;a)V 0 (s 0 ) + n k h (s;a) + 1=2 X s 0 2S ^ P k h (s 0 js;a)(V (s 0 )V 0 (s 0 ))P h (s 0 js;a)(V (s 0 )V 0 (s 0 )) n k h (s;a) + 1=2 X s 0 2S ^ P k h (s 0 js;a)V 0 (s 0 )P h (s 0 js;a)V 0 (s 0 ) + 2 n k h (s;a) + 1=2 : 252 Furthermore, we choose = (p=3)= (jV jjSjjAj) and take a union bound over V 2 V and (s;a)2SA. By (B.9), with probability at least 1p=2 it holds that sup V2V ( n k h (s;a) + 1=2 X s 0 2S ^ P k h (s 0 js;a)V (s 0 )P h (s 0 js;a)V (s 0 ) ) s H 2 log T + 2 + 2 n k h (s;a) + 1=2 H K s 2H 2 logjV j + log 2jSjjAjT p + 2 + 2 n k h (s;a) + 1=2 H K C 1 H s jSj log jSjjAjT p := for all (k;h) and (s;a), whereC 1 is an absolute constant. We recall our choice of k h and. Hence, with probability at least 1p=2 it holds that X s 0 2S ^ P k h (s 0 js;a)V (s 0 )P h (s 0 js;a)V (s 0 ) n k h (s;a) + 1=2 := k h (s;a) for any (k;h)2 [K] [H] and (s;a)2jSjjAj, where :=C 1 H p jSj log(jSjjAjT=p). We recall the denitionr h (s;a) = e > (s;a) r;h . By our estimation ^ r k h (s;a) in Algorithm 8, we have ^ r k h (s;a) = 1 n k h (s;a) + k1 X = 1 1f(s;a) = (s h ;a h )g[ r;h ] (s h ;a h ) and thus ^ r k h (s;a)r h (s;a) = ^ r k h (s;a) [ r;h ] (s;a) = n k h (s;a) + 1 k1 X = 1 1f(s;a) = (s h ;a h )g [ r;h ] (s h ;a h ) [ r;h ] (s;a) [ r;h ] (s;a) = n k h (s;a) + 1 [ r;h ] (s;a) n k h (s;a) + 1 n k h (s;a) + 1=2 k h (s;a) where we utilize = 1 and 1 in the inequalities. We now are ready to check the model prediction error k r;h dened by (3.8), k r;h (s;a) = Q k r;h (s;a) (r h +P h V k r;h+1 )(s;a) ^ r k h (s;a) + P s 0 2S ^ P k h (s 0 js;a)V k r;h+1 (s 0 ) + 2 k h (s;a) (r h +P h V k r;h+1 )(s;a) 4 k h (s;a) 253 for any (s;a)2SA. On the other hand, notice that (r h +P h V k r;h+1 )(s;a)Hh + 1, thus k r;h (s;a) = (r h +P h V k r;h+1 )(s;a) Q k r;h (s;a) (r h +P h V k r;h+1 )(s;a)min ^ r k h (s;a)+ X s 0 2S ^ P k h (s 0 js;a)V k r;h+1 (s 0 ) + 2 k h (s;a); Hh + 1 ! + max (r h ^ r h )(s;a) k h (s;a)+(P h V k r;h+1 )(s;a) X s 0 2S ^ P k h (s 0 js;a)V k r;h+1 (s 0 ) k h (s;a); 0 ! + 0 for any (k;h)2 [K] [H] and (s;a)2SA. Hence, we complete the proof of (B.1). B.4 Supportinglemmasfromoptimization We rephrase some optimization results from the literature for our constrained problem (3.2), maximize 2 (AjS;H) V r;1 (s 1 ) subject to V g;1 (s 1 ) b in which we maximize over all policies andb2 (0;H]. Let the optimal solution be ? such that V ? r;1 (s 1 ) = maximize 2 (AjS;H) fV r;1 (s 1 )jV g;1 (s 1 ) bg: Let the Lagrangian be V ;Y L (s 1 ) := V r;1 (s 1 ) +Y (V g;1 (s 1 )b), where Y 0 is the Lagrange multiplier or dual variable. The associated dual function is dened as V Y D (s 1 ) := maximize 2 (AjS;H) V ;Y L (s 1 ) :=V r;1 (s 1 ) + Y (V g;1 (s 1 )b) and the optimal dual isY ? := argmin Y 0 V Y D (s 1 ), V Y ? D (s 1 ) := minimize 0 V Y D (s 1 ): We recall that the problem (3.2) enjoys strong duality under the strict feasibility condition (also called Slater condition). The proof follows [180, Proposition 1] in the nite-horizon setting. Assumption17(Slatercondition) There exists > 0 and such thatV g;1 (s 1 )b . Lemma46(Strongduality) [180, Proposition 1] If the Slater condition holds, then the strong du- ality holds, V ? r;1 (s 1 ) = V Y ? D (s 1 ): Strong duality implies that the optimal solution to the dual problem: minimize Y 0 V Y D (s 1 ) is obtained atY ? . Denote the set of all optimal dual variables as ? . Under the Slater condition, a useful property of the dual variable is that the sublevel sets are bounded [27, Section 8.5]. 254 Lemma47(Boundednessofsublevelsetsofthedualfunction) Let Slater condition hold. FixC2R. For anyY2fY 0jV Y D (s 1 )Cg, it holds that Y 1 CV r;1 (s 1 ) : Proof. ByY2fY 0jV Y D (s 1 )Cg, C V Y D (s 1 ) V r;1 (s 1 ) +Y (V g;1 (s 1 )b) V r () +Y where we use the Slater point in the last inequality. We complete the proof by noting > 0. Corollary48(BoundednessofY ? ) If we take C = V ? r;1 (s 1 ) = V Y ? D (s 1 ), then ? = fY 0jV Y D (s 1 )Cg. Thus, for anyY2 ? , Y 1 V ? r;1 (s 1 )V r;1 (s 1 ) : Another useful theorem from the optimization [27, Section 3.5] is given as follows. It describes that the constraint violationbV g;1 (s 1 ) can be bounded similarly even if we have some weak bound. We next state and prove it for our problem, which is used in our constraint violation analysis. Lemma49(Constraintviolation) Let the Slater condition hold andY ? 2 ? . LetC ? 2Y ? . Assume that2 (AjS;H) satises V ? r;1 (s 1 ) V r;1 (s 1 ) + C ? bV g;1 (s 1 ) + : Then, bV g;1 (s 1 ) + 2 C ? where [x] + = max(x; 0). Proof. Let v() = maximize 2 (AjS;H) fV r;1 (s 1 )jV g;1 (s 1 ) b +g: By denition,v(0) = V ? r;1 (s 1 ). It has been shown as a special case of [180, Proposition 1] that v() is concave. First, we show thatY ? 2@v(0). By the Lagrangian and the strong duality, V ;Y ? L (s 1 ) maximize 2 (AjS;H) V ;Y ? L (s 1 ) = V Y ? D (s 1 ) = V ? r;1 (s 1 ) = v(0) for all2 (AjS;H). For any2f2 (AjS;H)jV g;1 (s 1 )b +g, v(0)Y ? L(;Y ? )Y ? = V r;1 (s 1 ) +Y ? (V g;1 (s 1 )b)Y ? = V r;1 (s 1 ) +Y ? (V g;1 (s 1 )b) V r;1 (s 1 ): 255 If we maximize the right-hand side of above inequality over2f2 (AjS;H)jV g;1 (s 1 ) b +g, then v(0)Y ? v() which show thatY ? 2@v(0). On the other hand, if we take = :=(bV g;1 (s 1 )) + , then V r;1 (s 1 ) v( ) and V ? r;1 (s 1 ) = v(0) v( ): Combing the above two yields V r;1 (s 1 )V ? r;1 (s 1 ) Y ? : Thus, (C ? Y ? )j j = Y ? j j +C ? j j = Y ? +C ? j j V ? r;1 (s 1 )V r;1 (s 1 ) +C ? j j: By our assumption andj j = bV g () + , bV g;1 (s 1 ) + C ? Y ? 2 C ? : B.5 Othersupportinglemmas First, we state a useful concentration inequality for the standard self-normalized processes. Lemma50(Concentrationofself-normalizedprocesses) LetfF t g 1 t = 0 be a ltration and f t g 1 t = 1 be aR-valued stochastic process such that t isF t -measurable for anyt 0. Assume that for anyt 0, conditioning onF t , t is a zero-mean and-subGaussian random variable with the variance proxy 2 > 0, i.e.,E e t jF t e 2 2 =2 for any2 R. LetfX t g 1 t = 1 be anR d -valued stochastic process such thatX t isF t -measurable for anyt 0. LetY 2 R dd be a deterministic and positive-denite matrix. For anyt 0, we dene Y t := Y + t X = 1 X X > and S t = t X = 1 X : Then, for any xed2 (0; 1), it holds with probability at least 1 that kS t k 2 ( Yt) 1 2 2 log det Y t 1=2 det (Y ) 1=2 ! for anyt 0. Proof. See the proof of Theorem 1 in [3]. The above concentration inequality can be customized to our setting in the following form without using covering number arguments as in [110]. 256 Lemma51 Let = 1 in Algorithm 7. Fix2 (0; 1). Then, for any (k;h)2 [K] [H] it holds for =r org that k1 X = 1 ;h (s h ;a h ) > V k ;h+1 (s h+1 ) (P h V k ;h+1 )(s h ;a h ) ( k ;h ) 1 C s dH 2 log dT with probability at least 1=2 whereC > 0 is an absolute constant. Proof. See the proof of Lemma D.1 in [47]. Lemma52(Ellipticalpotentiallemma) Letf t g 1 t=1 beasequenceoffunctionsinR d and 0 2 R dd beapositivedenitematrix. Let t = 0 + P t1 i = 1 i > i . Assumek t k 2 1and min ( 0 ) 1. Then for anyt 1 it holds that log det ( t+1 ) det ( 1 ) t X i = 1 > i 1 i i 2 log det ( t+1 ) det ( 1 ) : Proof. See the proof of Lemma D.2 in [110] or [47]. Lemma53(PushbackpropertyofKL-divergence) Letf: !Rbeaconcavefunctionwhere is a probability simplex in R d . Let o be the interior of . Let x ? = argmax x2 f(x) 1 D KL (x;y) for a xedy2 o and> 0. Then, for anyz2 , f(x ? ) 1 D KL (x ? ;y) f(z) 1 D KL (z;y) + 1 D KL (z;x ? ): Proof. See the proof of Lemma 14 in [242]. Lemma54(BoundedKL-divergencedierence) Let 1 ; 2 be two probability distributions in (A). Let ~ 2 = (1) 2 +1=jAj where2 (0; 1]. Then, D KL ( 1 j ~ 2 ) D KL ( 1 j 2 ) logjAj: Moreover, we have a uniform bound,D KL ( 1 j ~ 2 ) log(jAj=). Proof. See the proof of Lemma 31 in [242]. 257 AppendixC SupportingproofsinChapter4 C.1 ProofofLemma25 We begin with the triangle inequality,kx j;k (t) x k (t)kkx j;k (t)x k (t)k+kx k (t) x k (t)k where x k (t) =P X ( x 0 k (t)) and x k (t) = 1 N P N j = 1 x j;k (t)). First, the non-expansiveness of projectionP X implies that kx j;k (t)x k (t)k kx 0 j;k (t) x 0 k (t)k (C.1) wherex j;k (t) = P X (x 0 j;k (t)). Second, by the convexity of normkk and non-expansiveness of projectionP X , kx k (t) x k (t)k 1 N N X j = 1 kP X ( x 0 k (t))P X (x 0 j;k (t))k 1 N N X j = 1 kx 0 j;k (t) x 0 k (t)k: (C.2) Next, we focus onx 0 j;k (t) and x 0 k (t). Let [W s ] j be thejth row ofW s and [W s ] ji be the (j;i)th element ofW s . For anyt 2, the primal updatex 0 j;k (t) of Algorithm 9 can be expressed as x 0 j;k (t) = N X i = 1 [W t1 ] ij x 0 i;k (1) k t1 X s = 2 N X i = 1 [W ts+1 ] ij G i;x (z i;k (s 1); k;s1 ) k G j;x (z j;k (t 1); k;t1 ) (C.3) andx 0 j;k (2) = P N i=1 [W ] ij x 0 i;k (1) k G j;x (z j;k (1); k;1 ). Similar to the argument of [82, Eqs. (26) and (27)], we utilize the gradient boundedness to bound the dierence of (4.14) and (C.3) by x 0 j;k (t) x 0 k (t) N X i = 1 1 N [W t1 ] ij kx 0 i;k (1)k + k c t1 X s = 2 1 N 1 [W ts+1 ] j 1 + 2 k c: (C.4) 258 Application of the Markov chain property of mixing matrix [82] on the second sum in (C.4) yields, t1 X s = 2 1 N 1 [W ts+1 ] j 1 2 log( p NT k ) 1 2 (W ) : (C.5) For x 0 i;k (1) = 1 T k1 P T k1 t=1 x i;k1 (t) wherex i;k1 (t) = P X (x 0 i;k1 (t)) and 02X , we utilize the non-expansiveness of projection to bound it as kx 0 i;k (1)k 1 T k1 T k1 X t = 1 x 0 i;k1 (t) : Using (C.3) at roundk 1, we utilize the property of doubly stochasticW to have kx 0 i;k1 (t)k N X j = 1 [W t1 ] ji kx 0 j;k1 (1)k + 2 k1 T k1 c: Repeating this inequality fork 2;k 3;:::; 1 yields, kx 0 i;k (1)k 2 k1 X l = 1 l T l c: (C.6) where we usex 0 j;1 (1) = 0 for allj2V. Now, we are ready to show the desired result. Notice thatkx 0 j;k (1) x 0 k (1)kkx 0 j;k (1)k + 1 N P N j = 1 kx 0 j;k (1)k. We collect (C.5) and (C.6) for (C.4), and average it overt = 1;:::;T k to obtain, 1 T k T k X t = 1 kx 0 j;k (t) x 0 k (t)k = 1 T k T k X t = 2 kx 0 j;k (t) x 0 k (t)k + 1 T k kx 0 j;k (1) x 0 k (1)k 2 k c log( p NT k ) 1 2 (W ) + 4 T k k1 X l = 1 l T l c + 2 k c + 2c T k T k X t = 2 N X i = 1 1 N [W t1 ] ji k1 X l = 1 l T l : Bounding the sum P T k t = 2 P N i = 1 1 N [W t1 ] ji by (C.5) and application of (C.1) and (C.2) com- plete the proof. C.2 Martingaleconcentrationbound We state a useful result about martingale sequence. 259 Lemma55 LetfX(t)g T t=1 be a martingale dierence sequence inR d , and letkX(t)kM. Then, E 2 4 1 T T X t = 1 X(t) 2 3 5 4M 2 T : (C.7) Proof. We recall the classic concentration result in [183]: for any 0, we have P 0 @ 1 T T X t = 1 X(t) 2 4M 2 T 1 A e : The left-hand side of (C.7) can be expanded into Z 1 0 P 0 @ 1 T T X t = 1 X(t) 2 s 1 A ds = 4M 2 T Z 1 0 P 0 @ 1 T T X t=1 X(t) 2 4M 2 T 1 A d 4M 2 T Z 1 0 e d 4M 2 T : C.3 ProofofLemma27 It is clear that err(^ x i;k ) 0 from the optimality ofx ? in (4.10). The optimality ofy ? j yields, 1 N N X j = 1 j (x ? ;y ? j ) 1 N N X j = 1 j (x ? ; ^ y j;k ): Thus, using (4.15) and (4.16), we have err(^ x i;k ) err 0 (^ x i;k ; ^ y k ). C.4 ProofofLemma28 To show (4.17a), we apply the strong convexity off j (x), err(^ x i;k ) 1 N N X j = 1 hrf j (x ? ); ^ x i;k x ? i + L x 2 k^ x i;k x ? k 2 L x 2 k^ x i;k x ? k 2 where the second inequality is due to the optimality ofx ? . 260 To show (4.17b), we subtract and add 1 N P N j=1 j (x ? ;y ? j ) into (4.16) and apply the strong con- cavity of j (x;y j ) iny j , err 0 (^ x i;k ; ^ y k ) = err(^ x i;k ) + 1 N N X j = 1 ( j (x ? ;y ? j ) j (x ? ; ^ y j;k )) 1 N N X j = 1 ( j (x ? ;y ? j ) j (x ? ; ^ y j;k )) L y 2N N X j = 1 ky ? j ^ y j;k k 2 where err(^ x i;k ) 0 is omitted in the rst inequality, and the second inequality follows the strong concavity assumption of j (x ? ;y j ) iny j and the optimality ofy ? j = argmax y j 2Y j (x ? ;y j ). To show (4.17c), because of the optimality ofx ? andy ? j , we rst have j (^ x i;k ;y ? j ) j (x ? ;y ? j ) 0, and j (x ? ;y ? j ) j (x ? ; ^ y j;k ) 0. This further shows that j (^ x i;k ; ^ y ? j;k ) j (x ? ; ^ y j;k ) = ( j (^ x i;k ; ^ y ? j;k ) j (^ x i;k ;y ? j )) + ( j (^ x i;k ;y ? j ) j (x ? ;y ? j )) + ( j (x ? ;y ? j ) j (x ? ; ^ y j;k )) j (^ x i;k ; ^ y ? j;k ) j (^ x i;k ;y ? j ): Hence, we have err 0 (^ x i;k ; ^ y k ) 1 N N X j = 1 ( j (^ x i;k ; ^ y ? j;k ) j (^ x i;k ;y ? j ) ) L y 2N N X j = 1 ky ? j ^ y ? j;k k 2 where the second inequality follows from the strong concavity assumption of j (^ x i;k ;y j ) iny j and the optimality of ^ y ? j;k = argmax y j 2Y j (^ x ? i;k ;y j ). 261 AppendixD SupportingproofsinChapter5 D.1 ProofofLemma34 We prove (5.5) by induction on the number of playersN. In the basic step:N = 2, the right-hand side of (5.5) becomes 0 1 ; 2 1 ; 2 + 1 ; 0 2 1 ; 2 + 0 1 ; 0 2 1 ; 0 2 0 1 ; 2 + 1 ; 2 which equals to the left-hand side: 0 1 ; 0 2 1 ; 2 . Assume the equality (5.5) holds forN players. We next consider the induction step forN + 1 players . By subtracting and adding N ; 0 N+1 , 0 = 0 N ; 0 N+1 N ; 0 N+1 | {z } Di N + N ; 0 N+1 N ; N+1 | {z } Di N+1 : (D.1) In (D.1), we use the shorthand 0 N and N forf 0 k g N k = 1 andf k g N k = 1 , respectively. We note thatDi N orDi N+1 can be viewed as a function forN players if we x the (N + 1)th policy. For the rst termDi N , Di N = N X i = 1 0 i ; j ; 0 i ; 0 j ; 0 N+1 j ; i ; 0 j ; 0 N+1 j ; 0 i ; j ; 0 N+1 + j ; i ; j ; 0 N+1 262 the induction assumption implies that Di N = N X i = 1 0 i ; j ; 0 i ; 0 j ; 0 N+1 j ; i ; 0 j ; 0 N+1 j ; 0 i ; j ; 0 N+1 + j ; i ; j ; 0 N+1 where we use 0 >j to representf 0 k g N k =j+1 . AddingDi N+1 to the last equivalent expression ofDi N above yields Di N +Di N+1 = N+1 X i = 1 0 i ; i + N X i = 1 N+1 X j =N+1 j ; 0 i ; 0 j j ; i ; 0 j j ; 0 i ; j + j ; i ; j + N X i = 1 N X j =i+1 j ; 0 i ; 0 j ; 0 N+1 j ; i ; 0 j ; 0 N+1 j ; 0 i ; j ; 0 N+1 + j ; i ; j ; 0 N+1 = N+1 X i = 1 0 i ; i + N+1 X i = 1 N+1 X j =i+1 j ; 0 i ; 0 j j ; i ; 0 j j ; 0 i ; j + j ; i ; j : where the rst equality has a slight abuse of the notation: 0 >j representsf 0 k g N+1 k =j+1 in the rst double sum and 0 >j representsf 0 k g N k =j+1 in the second double sum. Therefore, (5.5) holds for N + 1 players. The proof is completed by induction. D.2 ProofofLemma35 We note that Q ~ ij ; i ; 0 j i (s;) and Q ~ ij ; i ; j i (s;) are averaged action value functions for player i using policy i , but they have dierent underlying averaged MDPs because of dierent policies executed by player j. Hence, we can directly apply Lemma 69. Specically, let (r;p) be the 263 averaged reward and transition functions for playeri induced by (~ ij ; j ), and (~ r; ~ p) be those induced by (~ ij ; 0 j ). Then, jr(s;a i ) ~ r(s;a i )j = X a j ;a ij r(s;a i ;a j ;a ij ) j (a j js) 0 j (a j js) ~ ij (a ij js) k j (js) 0 j (js)k 1 and kp(js;a i ) ~ p(js;a i )k 1 = X s 0 X a j ;a ij p(s 0 js;a i ;a j ;a ij ) j (a j js) 0 j (a j js) ~ ij (a ij js) X s 0 X a j ;a ij p(s 0 js;a i ;a j ;a ij )~ ij (a ij js) j (a j js) 0 j (a j js) k j (js) 0 j (js)k 1 : Application of two inequalities above to Lemma 69 competes the proof. D.3 ProofofLemma36 By the denition, for a xed states ] , d (s ] ) = (1 )E " 1 X t = 0 t 1 fst =s ] g s 0 ; # : By taking reward functionr(s;a) = (1 )1 fs =s ] g , we can viewd (s ] ) as a value function under the policy and the initial distribution. With a slight abuse of notation, we denote such a value function by V (;s ] ) = d (s ] ). Similarly, we can dene V (s;s ] ) and Q (s;a;s ] ), using the same reward function. By the performance dierence lemma (a single-player version of the performance dierence in Remark 11), d (s ] )d 0 (s ] ) = V (;s ] )V 0 (;s ] ) = X s;a d (s) ((ajs) 0 (ajs))Q 0 (s;a;s ] ): Therefore, X s ] d (s ] )d 0 (s ] ) X s ] X s;a d (s)j(ajs) 0 (ajs)jQ 0 (s;a;s ] ): (D.2) 264 We also note thatQ 0 (;;s ] ) is the action value function associated with the reward function r(s;a) = (1 )1 fs =s ] g . Thus, X s ] Q 0 (s;a;s ] ) = X s ] E " (1 ) 1 X t = 0 t 1 fst =s ] g (s 0 ;a 0 ) = (s;a); 0 # = E " (1 ) 1 X t = 0 t (s 0 ;a 0 ) = (s;a); 0 # = 1: Therefore, we can arrange (D.2) as follows, X s ] d (s ] )d 0 (s ] ) X s;a d (s)j(ajs) 0 (ajs)j = X s d (s)k(js) 0 (js)k 1 max s k(js) 0 (js)k 1 : D.4 ProofofLemma37 We consider a two-player common-reward Markov game with state spaceS and action setsA 1 , A 2 . Letr :SA 1 A 2 ! [0; 1] be the reward function, andp :SA 1 A 2 ! (S) be the transition function. Let 1 = ((A 1 )) jSj and 2 = ((A 2 )) jSj be player 1 and player 2’s policy sets, respectively. For anyx;x 0 2 1 , we dene the following non-stationary policies: x i : a Player 1’s policy where in steps from 0 toi 1,x 0 is executed; in steps fromi to1,x is executed. With this denition, x 0 =x and x 1 =x 0 . We dene y i similarly. Since x i is non-stationary, we specify its action distribution as x i (js;h) whereh is the step index. The joint value function for these non-stationary policies can be dened as usual: V x i ; y j () := E " 1 X t = 0 t r(s t ;a t ;b t ) s 0 ; a t x i (js t ;t); b t y j (js t ;t) # : For simplicity, we omit the initial distribution in writing the value function. We rst show that for anyH2N, V x 0 ; y 0 V x H ; y 0 V x 0 ; y H + V x H ; y H = H1 X i = 0 H1 X j = 0 (V x i ; y j V x i+1 ; y j V x i ; y j+1 + V x i+1 ; y j+1 ): 265 In fact, the right-hand side above is equal to H1 X j = 0 H1 X i = 0 (V x i ; y j V x i+1 ; y j ) + H1 X j = 0 H1 X i = 0 (V x i ; y j+1 +V x i+1 ; y j+1 ) = H1 X j = 0 (V x 0 ; y j V x H ; y j ) + H1 X j = 0 (V x 0 ; y j+1 +V x H ; y j+1 ) = H1 X j = 0 (V x 0 ; y j V x 0 ; y j+1 ) + H1 X j = 0 (V x H ; y j +V x H ; y j+1 ) = V x 0 ; y 0 V x 0 ; y H V x H ; y 0 + V x H ; y H : SendingH to innity and recalling that x 0 =x, x 1 =x 0 , y 0 =y, y 1 =y 0 lead to V x;y V x 0 ;y V x;y 0 + V x 0 ;y 0 = 1 X i = 0 1 X j = 0 (V x i ; y j V x i+1 ; y j V x i ; y j+1 + V x i+1 ; y j+1 ): We next focus on the particular summand above with index (i;j) and discuss three cases. Case1:i<j. We rst re-writeV x i ; y j V x i+1 ; y j . Notice that the value dierence between the policy pairs ( x i ; y j ) and ( x i+1 ; y j ) starts at stepi, since both policy pairs are equal to (x 0 ;y 0 ) from step 0 to stepi 1. At theith step, x i changes tox while x i+1 remains asx 0 . Therefore, V x i ; y j V x i+1 ; y j = 1 1 X s;a;b d x 0 ;y 0 (s;i)x(ajs)y 0 (bjs) (r(s;a;b) + (s;a;b;i;j)) 1 1 X s;a;b d x 0 ;y 0 (s;i)x 0 (ajs)y 0 (bjs) (r(s;a;b) + (s;a;b;i;j)) = 1 1 X s;a;b d x 0 ;y 0 (s;i) x(ajs)x 0 (ajs) y 0 (bjs) r(s;a;b) + (s;a;b;i;j) ! where we dene (also note thatd x;y (s) = P 1 i = 0 d x;y (s;i)) d x;y (s;i) := (1 )E i 1[s i =s]js 0 (s;a;b;i;j) := E " 1 X t =i+1 ti r(s t ;a t ;b t ) s i+1 p(js;a;b); x i ; y j # : 266 Similarly, V x i ; y j+1 V x i+1 ; y j+1 = 1 1 X s;a;b d x 0 ;y 0 (s;i) x(ajs)x 0 (ajs) y 0 (bjs) r(s;a;b) + (s;a;b;i;j + 1) ! : We notice that the dierence (s;a;b;i;j) (s;a;b;i;j + 1) is equivalent to 1 X ~ s;~ a; ~ b d x;y 0 p(js;a;b) (~ s;ji 1)x(~ aj ~ s) y( ~ bj ~ s)y 0 ( ~ bj ~ s) Q x;y (~ s; ~ a; ~ b): Hence, V x i ; y j V x i+1 ; y j V x i ; y j+1 +V x i+1 ; y j+1 = (1 ) 2 X s;a;b X ~ s;~ a; ~ b d x 0 ;y 0 (s;i)d x;y 0 p(js;a;b) (~ s;ji 1) x(ajs)x 0 (ajs) y 0 (bjs)x(~ aj ~ s) y( ~ bj ~ s)y 0 ( ~ bj ~ s) Q x;y (~ s; ~ a; ~ b) 2(1 ) 3 X s;a;b X ~ s;~ a; ~ b d x 0 ;y 0 (s;i)d x;y 0 p(js;a;b) (~ s;ji 1)y 0 (bjs)x(~ aj ~ s) x(ajs)x 0 (ajs) 2 + 2(1 ) 3 X s;a;b X ~ s;~ a; ~ b d x 0 ;y 0 (s;i)d x;y 0 p(js;a;b) (~ s;ji 1)y 0 (bjs)x(~ aj ~ s) y( ~ bj ~ s)y 0 ( ~ bj ~ s) 2 (boundjQ x;y (;;)j by 1 1 and use AM-GM) = A 2(1 ) 3 X s;a X ~ s d x 0 ;y 0 (s;i)d x;y 0 p(js;a;y 0 ) (~ s;ji 1) x(ajs)x 0 (ajs) 2 (denep(js;a;y) = P b p(js;a;b)y(bjs)) + A 2(1 ) 3 X s X ~ s; ~ b d x 0 ;y 0 (s;i)d x;y 0 p(js;Unif A ;y 0 ) (~ s;ji 1) y( ~ bj ~ s)y 0 ( ~ bj ~ s) 2 : (uniform distribution Unif A = 1 A 1) 267 Summing the inequality above overi<j yields 1 X i = 0 1 X j =i+1 (V x i ; y j V x i+1 ; y j V x i ; y j+1 +V x i+1 ; y j+1 ) A 2(1 ) 3 1 X i = 0 X s;a d x 0 ;y 0 (s;i) x(ajs)x 0 (ajs) 2 X ~ s 1 X j =i+1 d x;y 0 p(js;a;y 0 ) (~ s;ji 1) ! + A 2(1 ) 3 1 X i = 0 X ~ s; ~ b X s d x 0 ;y 0 (s;i) y( ~ bj ~ s)y 0 ( ~ bj ~ s) 2 1 X j =i+1 d x;y 0 p(js;unif;y 0 ) (~ s;ji 1) ! = A 2(1 ) 3 X s;a d x 0 ;y 0 (s) x(ajs)x 0 (ajs) 2 + A 2(1 ) 3 X ~ s; ~ b X s d x 0 ;y 0 (s)d x;y 0 p(js;Unif A ;y 0 ) (~ s) y( ~ bj ~ s)y 0 ( ~ bj ~ s) 2 (use the property P 1 i = 0 d x;y (s;i) =d x;y (s)) = A 2(1 ) 3 X s d x 0 ;y 0 (s)kx(js)x 0 (js)k 2 + A 2(1 ) 3 X s d x;y 0 0 (s)ky(js)y 0 (js)k 2 where 0 is a state distribution that generates the state by the following procedure: rst sample a states 0 according tod x 0 ;y 0 (), then execute (Unif A ;y 0 ) = ( 1 A 1;y 0 ) for one step, and then output the next state. By Lemma 70 (with = (x 0 ;y 0 ), 0 = (x;y 0 ), and = (Unif A ;y 0 )), we have d x;y 0 0 (s) d x 0 ;y 0 (s) d x;y 0 0 (s) (1 ) 2 (1 ) . Therefore, 1 X i = 0 1 X j =i+1 (V x i ; y j V x i+1 ; y j V x i ; y j+1 +V x i+1 ; y j+1 ) 2 A 2(1 ) 4 X s d x 0 ;y 0 (s) kx(js)x 0 (js)k 2 +ky(js)y 0 (js)k 2 : Case2:i>j. This case is symmetric to the case ofi<j, and can be handled similarly. 268 Case3:i =j. In this case, 1 X i = 0 (V x i ; y i V x i+1 ; y i V x i ; y i+1 +V x i+1 ; y i+1 ) = 1 1 1 X i = 0 X s;a;b d x 0 ;y 0 (s;i) x 0 (ajs)x(ajs) y 0 (ajs)y(ajs) Q x;y (s;a;b) = 1 1 X s;a;b d x 0 ;y 0 (s) x 0 (ajs)x(ajs) y 0 (ajs)y(ajs) Q x;y (s;a;b) 1 2(1 ) 2 X s;a;b d x 0 ;y 0 (s) x 0 (ajs)x(ajs) 2 + 1 2(1 ) 2 X s;a;b d x 0 ;y 0 (s) y 0 (ajs)y(ajs) 2 = A 2(1 ) 2 X s d x 0 ;y 0 (s)kx 0 (js)x(js)k 2 + A 2(1 ) 2 X s d x 0 ;y 0 (s)ky 0 (js)y(js)k 2 : Summing the bounds in all three cases above completes the proof. D.5 ProofsforSection5.6 In this section, we provide proofs of Theorem 40 and Theorem 41 in Appendix D.5.2 and Ap- pendix D.5.3, respectively. D.5.1 Unbiasedestimate We consider thekth sampling in the data collection phase of Algorithm 11. By the sampling model in lines 6-8 of Algorithm 11, it is straightforward to see that s (h i ) d (t) for playeri. Then, we take a (h i ) i (k) i (js (h i ) ) at steph i for playeri. Each playeri begins with such ( s (h i ) ; a (h i ) i ) while 269 all players execute the policyf (t) i g N i = 1 with the termination probability 1 . Once terminated, we add all rewards collected inR (k) i . We next show thatE[R (k) i ] = Q i ( s (h i ) ; a (h i ) i ), E[R (k) i ] = E 2 4 h i +h 0 i 1 X h =h i r (h) i s (h i ) ; a (h i ) i ; a (h i ) i (t) i (j s (h i ) );h 0 i Geometric(1 ) 3 5 (a) = E 2 4 h 0 i 1 X h = 0 r (h+h i ) i s (h i ) ; a (h i ) i ; a (h i ) i (t) i (j s (h i ) );h 0 i Geometric(1 ) 3 5 = E " 1 X h = 0 1 f0hh 0 i 1g r (h+h i ) i s (h i ) ; a (h i ) i ; a (h i ) i (t) i (j s (h i ) );h 0 i Geometric(1 ) # (b) = 1 X h = 0 E E h 0 i 1 f0hh 0 i 1g r (h+h i ) i s (h i ) ; a (h i ) i ; a (h i ) i (t) i (j s (h i ) ) (c) = 1 X h = 0 E h r (h+h i ) i s (h i ) ; a (h i ) i ; a (h i ) i (t) i (j s (h i ) ) = E a (h i ) i (t) i (j s (h i ) ) E " 1 X h = 0 h r (h+h i ) i s (h i ) ; a (h i ) i ; a (h i ) i (t) i (j s (h i ) ) # = E a (h i ) i (t) i (j s (h i ) ) h Q (t) i ( s (h i ) ; a (h i ) i ; a (h i ) i ) i = Q (t) i ( s (h i ) ; a (h i ) i ) where in (a) we change the range of indexh while using the same initial state and action, (b) is due to the tower property, (c) follows thatE h 0 i 1 f0hh 0 i 1g = 1 (1 (1p) h ) = h , where p = 1 , and we also apply the monotone convergence and dominated convergence theorems for swapping the sum and the expectation. D.5.2 ProofofTheorem40 We apply Lemma 34 to the potential function () at two consecutive policies (t+1) and (t) in Algorithm 11, where is the initial state distribution. We use the shorthand (t) () for (t) (), the value of potential function at policy (t) . The proof extends Lemma 38 by accounting for the statistical error in Assumption 16. 270 Lemma56(Policyimprovement: Markovpotentialgames) Let Assumption 15 hold. In Al- gorithm 11, the dierence of potential functions () at two consecutive policies (t+1) and (t) , (t+1) () (t) () is lower bounded by either (i) or (ii), (i) 1 4(1 ) N X i = 1 X s d (t+1) i ; (t) i (s) (t+1) i (js) (t) i (js) 2 4 2 AW 2 N 2 (1 ) 3 A (1 ) 2 N X i = 1 L (t) i ( ^ w (t) i ) (ii) 1 4(1 ) N X i = 1 X s d (t+1) i ; (t) i (s) 1 4 3 NA (1 ) 4 (t+1) i (js) (t) i (js) 2 A (1 ) 2 N X i = 1 L (t) i ( ^ w (t) i ) where is the stepsize,N is the number of players,A is the size of one player’s action space,W is the 2-norm bound of ^ w (t) i , and is the distribution mismatch coecient relative to. Proof. We let 0 = (t+1) and = (t) for brevity. We rst express (t+1) () (t) () = Di +Di , whereDi andDi are given as those in (5.6). BoundingDi . By the property of the potential function () and Remark 11, 0 i ; i () () = V 0 i ; i i () V i () = 1 1 X s;a i d 0 i ; i (s) ( 0 i (a i js) i (a i js)) Q i ; i i (s;a i ): The optimality of 0 i = (t+1) i in line 14 of Algorithm 11 leads to 0 i (js); ^ Q (t) i (s;) A i 1 2 0 i (js) i (js) 2 i (js); ^ Q (t) i (s;) A i : (D.3) Hence, 0 i ; i () () 1 2(1 ) X s d 0 i ; i (s)k 0 i (js) i (js)k 2 + 1 1 X s d 0 i ; i (s) 0 i (js) i (js); Q i ; i i (s;) ^ Q (t) i (s;) A i : Therefore, Di 1 2(1 ) N X i = 1 X s d (t+1) i ; (t) i (s) (t+1) i (js) (t) i (js) 2 + 1 1 N X i = 1 X s d (t+1) i ; (t) i (s) (t+1) i (t) i (js); Q (t) i (s;) ^ Q (t) i (s;) A i : 271 However, X s d (t+1) i ; (t) i (s) (t+1) i (t) i (js); Q (t) i (s;) ^ Q (t) i (s;) A i (a) X s d (t+1) i ; (t) i (s) 1 2 0 k (t+1) i (t) i (js)k 2 + 0 2 k Q (t) i (s;) ^ Q (t) i (s;)k 2 (b) = 1 4 X s d (t+1) i ; (t) i (s)k (t+1) i (t) i (js)k 2 X s d (t+1) i ; (t) i (s)k Q (t) i (s;) ^ Q (t) i (s;)k 2 where (a) follows the inequalityhx;yi kxk 2 2 0 + 0 kyk 2 2 for 0 > 0, and we choose 0 = 2 in (b). Therefore, Di 1 4(1 ) N X i = 1 X s d (t+1) i ; (t) i (s) (t+1) i (js) (t) i (js) 2 1 N X i = 1 X s d (t+1) i ; (t) i (s)k Q (t) i (s;) ^ Q (t) i (s;)k 2 : (D.4) Bounding Di . For simplicity, we denote ~ ij as the joint policy of playersNnfi;jg where players< i andi j use and players> j use 0 . As done in the proof of Lemma 38, we can bound each summand in Di except for the last step from (c) to (d), ~ ij ; 0 i ; 0 j () ~ ij ; i ; 0 j () ~ ij ; 0 i ; j () + ~ ij ; i ; j () (c) 1 (1 ) 3 max s k 0 i (js) i (js)k 1 max s k 0 j (js) j (js)k 1 1 (1 ) 2 max s k 0 j (js) j (js)k 1 max s k 0 i (js) i (js)k 1 (d) 8 2 AW 2 (1 ) 3 where (d) follows a direct result from the optimality of (t+1) j given by (D.3), k (t+1) j (js) (t) j (js)k 2k ^ Q (t) i (s;)k 2W and thatkk 1 p Akk. Therefore, Di 4 2 AW 2 N 2 (1 ) 3 : (D.5) 272 We now complete the proof of (i) by combining (D.4) and (D.5) and we also employ that X s d (t+1) i ; (t) i (s)k Q (t) i (s;) ^ Q (t) i (s;)k 2 (a) 1 X s d (t) i ; (t) i (s)k Q (t) i (s;) ^ Q (t) i (s;)k 2 (b) A (1 ) L (t) i ( ^ w (t) i ) where (a) follows the denition of and (b) is the denition ofL (t) i ( ^ w (t) i ): L (t) i ( ^ w (t) i ) := E sd (t) ;a i (t) i (js) h Q (t) i (s;a i ) ^ Q (t) i (s;a i ) 2 i A E sd (t) X a i Q (t) i (s;a i ) ^ Q (t) i (s;a i ) 2 : Alternatively, as done in Lemma 38, we can apply Lemma 37 to each summand ofDi and show that Di 2 3 NA (1 ) 5 N X i = 1 X s d (t+1) i ; (t) i (s)k (t) i (js) (t+1) i (js)k 2 : Combining the inequality above with (D.4) nishes the proof of (ii). Proof. [Theorem 40] By the optimality of (t+1) i in line 14 of Algorithm 11, for any 0 i 2 i , D (1) 0 i (js) + A 1 (t+1) i (js); ^ Q (t) i (s;) (t+1) i (js) + (t) i (js) E A i 0 which leads to D 0 i (js) (t+1) i (js); ^ Q (t) i (s;) E A i D 0 i (js) (t+1) i (js); (t+1) i (js) (t) i (js) E A i (D.6) + 1 D (t+1) (js) 1 A 1; ^ Q (t) i (s;) (t+1) i (js) + (t) i (js) E . k (t+1) i (js) (t) i (js)k + W (D.7) 273 where the last inequality is because ofk ^ Q (t) i (s;)k W and 1 2 . Hence, if 1 W , then for any 0 i 2 i , 0 i (js) (t) i (js); Q (t) i (s;) A i = 0 i (js) (t+1) i (js); ^ Q (t) i (s;) A i + (t+1) i (js) (t) i (js); ^ Q (t) i (s;) A i + 0 i (js) (t) i (js); Q (t) i (s;) ^ Q (t) i (s;) A i (a) . 1 k (t+1) i (js) (t) i (js)k + W +k (t+1) i (js) (t) i (js)kk ^ Q (t) i (s;)k + 0 i (js) (t) i (js); Q (t) i (s;) ^ Q (t) i (s;) A i (b) . 1 (t+1) i (js) (t) i (js) + W + 0 i (js) (t) i (js); Q (t) i (s;) ^ Q (t) i (s;) A i where we apply (D.7) and the Cauchy-Schwarz inequality in (a), and (b) is becausek ^ Q (t) i (s;)k W and 1 W . As done in the proof of Theorem 32, the dierent steps begin from (b) in (5.11), T X t = 1 max i max 0 i V 0 i ; (t) i i ()V (t) i () (b) . 1 (1 ) T X t = 1 X s d 0 i ; (t) i (s) (t+1) i (js) (t) i (js) + TW 1 + 1 1 T X t = 1 X s d 0 i ; (t) i (s) 0 i (js) (t) i (js); Q (t) i (s;) ^ Q (t) i (s;) A i (c) . p (1 ) 3 2 T X t = 1 X s r d (t+1) i ; (t) i (s)d 0 i ; (t) i (s) (t+1) i (js) (t) i (js) + TW 1 + 1 T X t = 1 X s d (t) (s) 0 i (js) (t) i (js); Q (t) i (s;) ^ Q (t) i (s;) A i (d) p (1 ) 3 2 v u u t T X t = 1 X s d (t+1) i ; (t) i (s) v u u t T X t = 1 X s d (t+1) i ; (t) i (s) (t+1) i (js) (t) i (js) 2 + TW 1 + 1 T X t = 1 s AL (t) i ( ^ w (t) i ) (e) p (1 ) 3 2 p T v u u t T X t = 1 N X i = 1 X s d (t+1) i ; (t) i (s) (t+1) i (js) (t) i (js) 2 + TW 1 + 1 T X t = 1 s AL (t) i ( ^ w (t) i ) (D.8) 274 where we slightly abuse the notation 0 i in (b) to represent argmax 0 i andi represents argmax i as in (5.11), (c) is due to the denition of the distribution mismatch coecient (see it in Denition 2): d 0 i ; (t) i (s) d (t+1) i ; (t) i (s) d 0 i ; (t) i (s) (1 )(s) 1 (d) follows the Cauchy–Schwarz inequality, the inequality p P i x i P i p x i for anyx i 0, the Jensen’s inequality, and the denition ofL (t) i ( ^ w (t) i ), X s d (t) (s) 0 i (js) (t) i (js); Q (t) i (s;) ^ Q (t) i (s;) A i . s X s d (t) (s) s X s d (t) (s) X a i Q (t) i (s;a i ) ^ Q (t) i (s;a i ) 2 s AL (t) i ( ^ w (t) i ) where ^ Q (t) i (s;a i ) =h i (s;a i ); ^ w (t) i i, and we replacei ( argmax i in (b)) in the square root term in (e) by the sum over all players. If we proceed (D.8) with the rst bound (i) in Lemma 56, then, E " T X t = 1 max i max 0 i V 0 i ; (t) i i ()V (t) i () # (a) . p T (1 ) 3 2 v u u t (1 )( (N+1) (1) ) + 3 AW 2 N 2 (1 ) 2 T + 2 A (1 ) T X t = 1 N X i = 1 E h L (t) i ( ^ w (t) i ) i + TW 1 + 1 T X t = 1 v u u t AE h L (t) i ( ^ w (t) i ) i (b) . s TC (1 ) 2 + T s AW 2 N 2 (1 ) 5 + T (1 ) 2 s AN stat + TW 1 where we apply the rst bound (i) in Lemma 56 and the telescoping sum for (a), and we use the boundedness of the potential function:j 0 jC for any and 0 , and further simplify the bound in (f) by Assumption 16. We complete the proof of (i) by taking stepsize = (1 ) 3=2 p C WN p AT and exploration rate 2 NA stat (1 ) 2 W 2 1 3 . 275 If we proceed (D.8) with the rst bound (ii) in Lemma 56 with the choice of (1 ) 4 16 3 NA , then, E " T X t = 1 max i max 0 i V 0 i ; (t) i i ()V (t) i () # (a) . p T (1 ) 3 2 v u u t (1 )( (N+1) (1) ) + 2 A (1 ) T X t = 1 N X i = 1 E h L (t) i ( ^ w (t) i ) i + TW 1 + 1 T X t = 1 v u u t AE h L (t) i ( ^ w (t) i ) i (b) . s TC (1 ) 2 + T (1 ) 2 s AN stat + TW 1 which completes the proof if we choose = (1 ) 4 16 3 NA and exploration rate 2 NA stat (1 ) 2 W 2 1 3 . D.5.3 ProofofTheorem41 We rst establish policy improvement regarding theQ-function at two consecutive policies (t+1) and (t) in Algorithm 11. Lemma57(Policyimprovement: Markovcooperativegames) For MPG (5.1) with identical rewardsandaninitialstatedistribution> 0,ifallplayersindependentlyperformthepolicyupdate in Algorithm 11 with stepsize 1 2N , then for anyt and anys, E a (t+1) (js) Q (t) (s;a) E a (t) (js) Q (t) (s;a) 1 8 N X i=1 k (t+1) i (js) (t) i (js)k 2 N X i = 1 kQ (t) i (s;) ^ Q (t) i (s;)k 2 where is the stepsize andN is the number of players, 276 Proof. As done in the proof of Lemma 39, we let :=E a(js) Q (t) (s;a) and (5.12) holds, whereQ (t) :=Q (t) . By taking 0 = (t+1) and = (t) for (5.12), E a 0 (js) Q (t) (s;a) E a(js) Q (t) (s;a) = N X i = 1 X a i ( 0 i (a i js) i (a i js)) ^ Q (t) i (s;a i ) + N X i = 1 X a i ( 0 i (a i js) i (a i js)) Q (t) i (s;a i ) ^ Q (t) i (s;a i ) + N X i = 1 N X j =i+1 X a i ;a j ( 0 i (a i js) i (a i js)) 0 j (a j js) j (a j js) E a ij ~ ij (js) Q (t) (s;a) (a) N X i = 1 1 2 k 0 i (js) i (js)k 2 N X i = 1 1 2 0 k 0 i (js) i (js)k 2 + 0 2 k Q (t) i (s;) ^ Q (t) i (s;)k 2 1 1 N X i = 1 N X j =i+1 X a i ;a j j 0 i (a i js) i (a i js)j 0 j (a j js) j (a j js) (b) N X i = 1 1 4 k 0 i (js) i (js)k 2 N X i = 1 k Q (t) i (s;) ^ Q (t) i (s;)k 2 1 2(1 ) N X i = 1 N X j =i+1 k 0 i (js) i (js)k 2 +k 0 j (js) j (js)k 2 = N X i=1 1 4 k 0 i (js) i (js)k 2 N X i = 1 k Q (t) i (s;) ^ Q (t) i (s;)k 2 N 1 2(1 ) N X i=1 k 0 i (js) i (js)k 2 (c) N X i = 1 1 8 k 0 i (js) i (js)k 2 N X i = 1 k Q (t) i (s;) ^ Q (t) i (s;)k 2 where (a) is due to the optimality condition (D.3), the inequalityhx;yi kxk 2 2 0 + 0 kyk 2 2 for 0 > 0, andQ (t) (s;a) 1 1 , (b) is due tohx;yi kxk 2 +kyk 2 2 and 0 = 2, and (c) follows the choice of 1 4N . 277 Proof. [Proof of Theorem 41] By Lemma 11 and Lemma 57, V (t+1) () V (t) () = 1 1 X s;a d (t+1) (s) (t+1) (ajs) (t) (ajs) Q (t) (s;a) 1 8(1 ) N X i = 1 X s d (t+1) (s)k (t+1) i (js) (t) i (js)k 2 1 N X i = 1 X s d (t+1) (s)k Q (t) i (s;) ^ Q (t) i (s;)k 2 1 8(1 ) N X i = 1 X s d (t+1) (s)k (t+1) i (js) (t) i (js)k 2 A (1 ) 2 N X i = 1 L (t) i ( ^ w (t) i ) where the last inequality is due to that X s d (t+1) (s)k Q (t) i (s;) ^ Q (t) i (s;)k 2 (a) 1 X s d (t) (s)k Q (t) i (s;) ^ Q (t) i (s;)k 2 = A (1 ) X s d (t) (s) X a i A Q (t) i (s;a i )h i (s;a i ); ^ w (t) i i 2 (b) A (1 ) E sd (t) ;a i (t) i (js) h Q (t) i (s;a i )h i (s;a i ); ^ w (t) i i 2 i = A (1 ) L (t) i ( ^ w (t) i ) where (a) follows the denition of and (b) is the denition ofL (t) i ( ^ w (t) i ). By the same argument as the proof of Theorem 40, T X t = 1 max i max 0 i V 0 i ; (t) i ()V (t) () . p (1 ) 3 2 v u u t T X t = 1 X s d 0 i ; (t) i (s) v u u t T X t = 1 N X i = 1 X s d (t+1) (s) (t+1) i (js) (t) i (js) 2 + TW 1 + 1 T X t = 1 s AL (t) i ( ^ w (t) i ) : 278 By taking expectation and the Jensen’s inequality, E " T X t = 1 max i max 0 i V 0 i ; (t) i ()V (t) () # . p T (1 ) 3 2 v u u t 8(1 )(V (N+1) V (1) ) + 8 2 A (1 ) T X t = 1 N X i = 1 E h L (t) i ( ^ w (t) i ) i + TW 1 + 1 T X t = 1 v u u t AE h L (t) i ( ^ w (t) i ) i . s 8 T (1 ) 3 + T s 8AN (1 ) 4 stat + TW 1 : We complete the proof by taking stepsize = 1 2NA , exploration rate 2 NA stat (1 ) 2 W 2 1 3 , and usingV (N+1) V (1) 1 1 . D.5.4 Samplecomplexity We present our sample complexity guarantees for Algorithm 11 in which the regression prob- lem (5.15) in each iteration is approximately solved by the stochastic projected gradient de- scent (D.20). We measure the sample complexity by the total number of trajectory samplesTK, whereT is the number of iterations andK is the batch size of trajectories. Corollary1(SamplecomplexityforMarkovpotentialgames) Assume the setting in Theo- rem 40 by excluding Assumption 16. Suppose we compute ^ w (t) i := 1 K P K k = 1 (K) k w (k) i via stochastic projected gradient descent (D.20) with stepsize (k) = 2 2+k and (K) k = 1= (k) P K r = 1 1= (r) . Then, if we choose stepsize = (1 ) 3=2 p C WN p AT and exploration rate = min 2 ANd (1 ) 4 K 1 3 ; 1 2 , then, E [ Nash-Regret(T ) ] . p WN(AC ) 1 4 (1 ) 7 2 T 1 4 + W ( 2 ANd ) 1 3 (1 ) 7 3 K 1 3 : Furthermore, if we choose stepsize = (1 ) 4 16 3 NA and exploration rate = min 2 ANd (1 ) 4 K 1 3 ; 1 2 , then, E [ Nash-Regret(T ) ] . 2 p ANC (1 ) 3 p T + W ( 2 ANd ) 1 3 (1 ) 7 3 K 1 3 : Moreover, their sample complexity guarantees areTK = O( 1 7 ) orTK = O( 1 5 ), respectively, for obtaining an-Nash equilibrium. 279 Proof. By the unbiased estimate in Appendix D.5.1, the stochastic gradient ^ r (t) i in (D.20) is also unbiased. We note the variance of the stochastic gradient is bounded by 1 (1 ) 2 . By Lemma 71, if we choose (k) = 2 2+k and (K) k = 1= (k) P K r = 1 1= (r) , then E h L (t) i ( ^ w (t) i ) i L (t) i (w (t) i ) dW 2 (1 ) 2 K where L (t) i (w (t) i ) = 0 by Assumption 15. Therefore, substitution of stat dW 2 (1 ) 2 K into Theo- rem 40 yields desired results. Finally, we let the upper bound on Nash-Regret(T ) be > 0 and calculate the sample com- plexityTK =O( 1 7 ) orTK =O( 1 5 ), respectively. Corollary2(SamplecomplexityforMarkovcooperativegames) ExceptingAssumption16, assume the setting in Theorem 41. Suppose we compute ^ w (t) i := 1 K P K k = 1 (K) k w (k) i via stochastic projected gradient descent (D.20) with stepsize (k) = 2 2+k and (K) k = 1= (k) P K r = 1 1= (r) . Then, if we choose stepsize = 1 WNA p T and exploration rate = min 2 AN (1 ) 4 K 1 3 ; 1 2 , then, E [ Nash-Regret(T ) ] . p AN (1 ) 2 p T + W ( 2 ANd ) 1 3 (1 ) 7 3 K 1 3 : Moreover, the sample complexity guarantee isTK =O( 1 5 ) for obtaining an-Nash equilibrium. Proof. The proof follows the proof steps of Corollary 1 above. D.6 ProofsforSection5.7 In this section, we prove Theorem 42 and Theorem 43 in D.6.1 and D.6.2, respectively. D.6.1 ProofofTheorem42 It is convenient to introduce an auxiliary sequencef (t;) g 1 = 0 associated with the learning rate f (t) g 1 t = 1 , (t;) := 8 > > > > > > > < > > > > > > > : t Y j = 1 (1 (j) ); for = 0 () t Y j =+1 (1 (j) ); for 1 t 0; for > t: (D.9) It is straightforward to verify that P t1 = 0 (t1;) = 1 fort 1. Lemma58 In Algorithm 12,V (t) s = P t = 1 (t;) (x () s ) > Q () s y () s for alls;t. 280 Proof. We prove it by induction. When t = 0 and t = 1, it holds trivially by noting that V (0) s = 0 and (1;1) = (1) . Assume that it holds for 0; 1;:::;t 1. By the update rule forV (t) s in Algorithm 12, V (t) s = (1 (t) )V (t1) s + (t) (x (t) s ) > Q (t) s y (t) s (a) = (1 (t) ) t1 X = 1 (t1;) (x () s ) > Q () s y () s + (t;t) (x (t) s ) > Q (t) s y (t) s (b) = t X = 1 (t;) (x () s ) > Q () s y () s where (a) follows the induction hypothesis and (b) is due to the denition of (t;) . Lemma59 In Algorithm 12, for every states and timet 1, (x (t+1) s ) > Q (t) s y (t+1) s (x (t) s ) > Q (t) s y (t) s 15 16 kz (t+1) s z (t+1) s k 2 + 7 16 k z (t+1) s z (t) s k 2 9 16 kz (t) s z (t) s k 2 wherez (t) s = (x (t) s ;y (t) s ) and z (t) s = ( x (t) s ; y (t) s ). Proof. We decompose the dierence into three terms: (x (t+1) s ) > Q (t) s y (t+1) s (x (t) s ) > Q (t) s y (t) s = (x (t+1) s x (t) s ) > Q (t) s y (t) s | {z } Dix + (x (t) s ) > Q (t) s (y (t+1) s y (t) s ) | {z } Diy + (x (t+1) s x (t) s ) > Q (t) s (y (t+1) s y (t) s ) | {z } Dixy : We next deal withDi x ,Di y , andDi xy , separately. BoundingDi x . The optimality ofx (t+1) s implies that for anyx 0 s 2 (A 1 ), (x (t+1) s ) > Q (t) s y (t) s 1 2 kx (t+1) s x (t+1) s k 2 (x 0 s ) > Q (t) s y (t) s 1 2 kx 0 s x (t+1) s k 2 + 1 2 kx 0 s x (t+1) s k 2 which implies that, by takingx 0 s = x (t+1) s , (x (t+1) s x (t+1) s ) > Q (t) s y (t) s 1 kx (t+1) s x (t+1) s k 2 : (D.10) The optimality of x (t+1) s implies that for anyx 0 s 2 (A 1 ), ( x (t+1) s ) > Q (t) s y (t) s 1 2 k x (t+1) s x (t) s k 2 (x 0 s ) > Q (t) s y (t) s 1 2 kx 0 s x (t) s k 2 + 1 2 kx 0 s x (t+1) s k 2 281 which implies that, by takingx 0 s =x (t) s , ( x (t+1) s x (t) s ) > Q (t) s y (t) s 1 2 k x (t+1) s x (t) s k 2 1 2 kx (t) s x (t) s k 2 : (D.11) Combining the two inequalities above yields Di x = (x (t+1) s x (t) s ) > Q (t) s y (t) s 1 kx (t+1) s x (t+1) s k 2 + 1 2 k x (t+1) s x (t) s k 2 1 2 kx (t) s x (t) s k 2 : (D.12) BoundingDi y . Similarly, Di y = (x (t) s ) > Q (t) s (y (t+1) s y (t) s ) 1 ky (t+1) s y (t+1) s k 2 + 1 2 k y (t+1) s y (t) s k 2 1 2 ky (t) s y (t) s k 2 : (D.13) BoundingDi xy . By the AM-GM and Cauchy-Schwarz inequalities, Di xy p A 2(1 ) kx (t+1) s x (t) s k 2 p A 2(1 ) ky (t+1) s y (t) s k 2 (a) 3 p A 2(1 ) kx (t+1) s x (t+1) s k 2 +k x (t+1) s x (t) s k 2 +k x (t) s x (t) s k 2 +ky (t+1) s y (t+1) s k 2 +k y (t+1) s y (t) s k 2 +k y (t) s y (t) s k 2 (b) 1 16 kx (t+1) s x (t+1) s k 2 +k x (t+1) s x (t) s k 2 +k x (t) s x (t) s k 2 +ky (t+1) s y (t+1) s k 2 +k y (t+1) s y (t) s k 2 +k y (t) s y (t) s k 2 where (a) followskx +y +zk 2 3kxk 2 + 3kyk 2 + 3kzk 2 and (b) is by 1 32 p A . Finally, we complete the proof by summing up the bounds above forDi x ,Di y , andDi xy . Lemma60 In Algorithm 12, for allt ands, the following two inequalities hold: (i) V (t) s V (t1) s ; (ii) (x (t+1) s ) > Q (t+1) s y (t+1) s (x (t) s ) > Q (t) s y (t) s 15 16 kz (t+1) s z (t+1) s k 2 + 7 16 k z (t+1) s z (t) s k 2 9 16 kz (t) s z (t) s k 2 : 282 Proof. We rst note that (ii) is a consequence of 59 and (i), (x (t+1) s ) > Q (t+1) s y (t+1) s (x (t) s ) > Q (t) s y (t) s = (x (t+1) s ) > Q (t+1) s y (t+1) s (x (t+1) s ) > Q (t) s y (t+1) s + (x (t+1) s ) > Q (t) s y (t+1) s (x (t) s ) > Q (t) s y (t) s (a) min s 0 V (t) s 0 V (t1) s 0 + 15 16 kz (t+1) s z (t+1) s k 2 + 7 16 k z (t+1) s z (t) s k 2 9 16 kz (t) s z (t) s k 2 (b) 15 16 kz (t+1) s z (t+1) s k 2 + 7 16 k z (t+1) s z (t) s k 2 9 16 kz (t) s z (t) s k 2 where (a) is due to Lemma 59, and the update ofQ (t) s in Algorithm 12, Q (t+1) s (a 1 ;a 2 )Q (t) s (a 1 ;a 2 ) = E s 0 P(js;a 1 ;a 2 ) h V (t) s 0 V (t1) s 0 i and (b) follows (i). Therefore, it suces to prove (i). We prove it by induction. Dene (t) s :=kz (t) s z (t) s k 2 and (t) s :=k z (t+1) s z (t) s k 2 . For notational simplicity, deneQ (0) s =0 AA ,z (0) s = z (0) s = 1 A 1 =z (1) s = z (1) s . Thus, (ii) holds fort = 0 and (i) holds fort = 1. We note that fort 2, V (t) s V (t1) s (a) = (t) x (t) s Q (t) s y (t) s V (t1) s (b) = (t) t1 X = 0 (t1;) x (t) s Q (t) s y (t) s x () s Q () s y () s ! = (t) t1 X = 0 (t1;) t1 X i = x (i+1) s Q (i+1) s y (i+1) s x (i) s Q (i) s y (i) s ! = (t) t1 X = 0 (t1;) t1 X i = x (i+1) s Q (i+1) s y (i+1) s x (i) s Q (i) s y (i) s 15 16 (i+1) s 7 16 (i) s + 9 16 (i) s ! + (t) t1 X = 0 (t1;) t1 X i = 15 16 (i+1) s + 7 16 (i) s 9 16 (i) s ! (c) (t) t1 X i = 0 i X = 0 (t1;) ! 15 16 (i+1) s 9 16 (i) s = (t) t X i = 1 (i) s 15 16 i1 X = 0 (t1;) 9 16 i X = 0 (t1;) ! (t) 0 X = 0 (t1;) ! 9 16 (0) s (d) = (t) t X i = 2 (i) s 15 16 i1 X = 0 (t1;) 9 16 i X = 0 (t1;) ! (e) 0 283 where (a) follows the update ofV (t) s in Algorithm 12, we apply Lemma 58 and P t1 = 0 (t1;) = 1 in (b), (c) follows the induction hypothesis (ii), (d) is due to that (0) s = (1) s = 0, and we apply Lemma 64 for (e). Lemma61 For everys2S, the following quantities in Algorithm 12 all converge to some xed values whent!1: (i)V (t) s ; (ii)kz (t) s z (t) s k 2 +k z (t) s z (t1) s k 2 (converges to zero); (iii) (x (t) s ) > Q (t) s y (t) s . Proof. Establishing (i). By (i) in Lemma 60,fV (t) s g 1 t = 0 is a bounded increasing sequence. By the monotone convergence theorem, it is convergent. Therefore, (i) holds. Establishing(ii). By summing up the inequality (ii) in Lemma 60 overt and using the fact that z (1) s = z (1) s , t X = 1 6 16 kz (+1) s z (+1) s k 2 + 7 16 k z (+1) s z () s k 2 (x (t+1) s ) > Q (t+1) s y (t+1) s (x (1) s ) > Q (1) s y (1) s 1 1 which implies that 6 16 kz (+1) s z (+1) s k 2 + 7 16 k z (+1) s z () s k 2 must converge to zero when!1, which further implies (ii). Establishing(iii). By (ii) in Lemma 60, (x (t+1) s ) > Q (t+1) s y (t+1) s 15 16 kz (t+1) s z (t+1) s k 2 (x (t) s ) > Q (t) s y (t) s 15 16 kz (t) s z (t) s k 2 + 7 16 k z (t+1) s z (t) s k 2 + 6 16 kz (t) s z (t) s k 2 : Therefore, (x (t) s ) > Q (t) s y (t) s 15 16 kz (t) s z (t) s k 2 converges to a xed value (increasing and upper bounded). In (ii), we have shown thatkz (t) s z (t) s k 2 converges to zero. Therefore, (x (t) s ) > Q (t) s y (t) s must also converge. Therefore, (iii) holds. Lemma62 In Algorithm 12, for everys2S, lim t!1 V x (t) ;y (t) s exists, and lim t!1 V (t) s = lim t!1 V x (t) ;y (t) s : Proof. By Lemma 61,V (t) s and (x (t) s ) > Q (t) s y (t) s both are convergent. LetV ? s := lim t!1 V (t) s and ? s := lim t!1 (x (t) s ) > Q (t) s y (t) s . We next showV ? s = ? s by contradiction. Assume that there exists 284 > 0 such thatjV ? s ? s j = . Since (x (t) s ) > Q (t) s y (t) s converges to ? s , there exists somet 0 > 0 such that for alltt 0 , (x (t) s ) > Q (t) s y (t) s ? s 3 : (D.14) By our choice of (t) , P 1 t =t 0 (t) =1 for anyt 0 . Thus, there existst 1 > 0 such that for alltt 1 and allt 0 , (t;) t Y i =+1 (1 (i) ) (a) exp t X i =t 0 +1 (i) ! (1 ) 3t 0 (D.15) where log(1x)x forx2 (0; 1) is used in (a). By the update ofV (t) s in Algorithm 12, for allt max(t 0 ;t 1 ), V (t) s ? s = t X = 0 (t;) (x () s ) > Q () s y () s ? s (a) t 0 1 X = 0 (t;) (x () s ) > Q () s y () s ? s + t X =t 0 (t;) (x () s ) > Q () s y () s ? s (b) t 0 1 X = 0 (t;) ! 1 1 + 1 t 0 1 X = 1 (t;) ! 3 t 0 max t 0 (t;) 1 1 + 3 (c) 2 3 where we apply the triangle inequality for (a), (b) is due to (D.14) and P t = 1 (t;) = 1, and (c) follows (D.15). SincejV ? s ? s j =, it is impossible thatV (t) s converges toV ? s , and it must be that V ? s = ? s . Therefore,V (t) s (x (t) s ) > Q (t) s y (t) s converges to zero ast!1. Equivalently,V (t) s (x (t) s ) > Q (t) s y (t) s can be expressed as V (t) s V (t1) s +V (t1) s X a 1 ;a 2 x (t) s (a 1 )y (t) s (a 2 ) r(s;a 1 ;a 2 ) + E s 0 P(js;a 1 ;a 2 ) h V (t1) s 0 i : By lettingt! 0, sinceV (t) s V (t1) s ! 0, thus, V (t1) s X a 1 ;a 2 x (t) s (a 1 )y (t) s (a 2 ) r(s;a 1 ;a 2 ) + E s 0 P(js;a 1 ;a 2 ) h V (t1) s 0 i : also converges to zero. Hence,V (t) s converges to the unique xed point of the Bellman equa- tion. By the uniqueness,V (t1) s V x (t) ;y (t) s converges to zero. Therefore, lim t!1 V x (t) ;y (t) s = lim t!1 V (t1) s =V ? s . 285 Lemma63 In Algorithm 12, for everys, lim t!1 max x 0 (x 0 s x (t) s ) > Q (t) s y (t) s = 0: Proof. By the optimality ofx (t+1) s , x 0 s x (t+1) s ; Q (t) s y (t) s x (t+1) s + x (t+1) s 0; for anyx 0 s : Rearranging the inequality yields, for anyx 0 s , hx 0 s x (t+1) s ;Q (t) s y (t) s i 1 x 0 s x (t+1) s ; x (t+1) s x (t) s + 1 x (t+1) s x (t) s ; Q (t) s y (t) s x (t+1) s + x (t+1) s . 1 kx (t+1) s x (t) s k 1 kx (t+1) s x (t+1) s k +k x (t+1) s x (t) s k +k x (t) s x (t) s k : By (ii) of Lemma 61, the right-hand side above converges to zero, which completes the proof. Lemma64 Letf (t) g 1 t=1 be a non-increasing sequence that satises 0 < (t) 1 6 for allt. Then for anyti 2, i X = 0 (t;) 5 3 i1 X = 0 (t;) : Proof. Equivalently, we prove (t;i) 2 3 i1 X = 0 (t;) : If suces to show that (t;i) 2 3 (t;i1) + 2 3 (t;i2) . We have the following two cases. Case1:i> 2. By the denition of (t;) and the monotonicity of 0< (t) 1 6 , (t;i) (t;i1) = (i) Q t j =i+1 (1 (j) ) (i1) Q t j =i (1 (j) ) = (i) (i1) (1 (i) ) 1 1 (i) 1 1 1 6 = 6 5 (t;i) (t;i2) = (i) (i2) (1 (i) )(1 (i1) ) 36 25 : Therefore, 2 3 (t;i1) + 2 3 (t;i2) 2 3 5 6 + 25 36 (t;i) (t;i) : 286 Case2:i = 2. By the denition of (t;) and the monotonicity of 0< (t) 1 6 , (t;2) (t;0) = (2) Q t j = 3 (1 (j) ) Q t j = 1 (1 (j) ) = (2) (1 (1) )(1 (2) ) 1 6 5 6 5 6 = 6 25 : Therefore, 2 3 (t;1) + 2 3 (t;0) 2 3 25 6 (t;2) (t;2) : Proof. [Proof of Theorem 42] max x 0 V x 0 ;y (t) ()V x (t) ;y (t) () = max x 0 1 1 X s d x 0 ;y (t) (s) x 0 s x (t) s > Q x (t) ;y (t) s y (t) s max x 0 1 1 X s d x 0 ;y (t) (s) x 0 s x (t) s > Q (t) s y (t) s | {z } Di P + max x 0 1 1 X s d x 0 ;y (t) (s) x 0 s x (t) s > Q x (t) ;y (t) s Q (t) s y (t) s | {z } Di Q : By Lemma 63,Di P ! 0 whent!1. ForDi Q , we notice that Q x (t) ;y (t) s Q (t) s max s 0 V x (t) ;y (t) s 0 V (t) s 0 which converges to zero by Lemma 62. Therefore,Di Q ! 0 whent!1. Therefore, (x (t) ;y (t) ) converges to a Nash equilibrium whent!1. D.6.2 ProofofTheorem43 We rst introduce a corollary of Lemma 60. Corollary3 In Algorithm 12, for every states, and anyT > 0, T X t = 1 k z (t+1) s z (t) s k 2 +k z (t+1) s z (t) s k 2 8 1 wherez (t) s = (x (t) s ;y (t) s ) and z (t) s = ( x (t) s ; y (t) s ). 287 Proof. By (ii) of Lemma 60, (x (t+1) s ) > Q (t+1) s y (t+1) s (x (t) s ) > Q (t) s y (t) s + 15 16 kz (t) s z (t) s k 2 kz (t+1) s z (t+1) s k 2 7 16 k z (t+1) s z (t) s k 2 + 6 16 kz (t) s z (t) s k 2 6 16 k z (t+1) s z (t) s k 2 + 6 16 kz (t) s z (t) s k 2 : Thus, by the inequalitykx +yk 2 2kxk 2 + 2kyk 2 , k z (t+1) s z (t) s k 2 +k z (t+1) s z (t) s k 2 3k z (t+1) s z (t) s k 2 + 2k z (t) s z (t) s k 2 3k z (t+1) s z (t) s k 2 + 3k z (t) s z (t) s k 2 8 (x (t+1) s ) > Q (t+1) s y (t+1) s (x (t) s ) > Q (t) s y (t) s + 15 2 kz (t) s z (t) s k 2 kz (t+1) s z (t+1) s k 2 which yields our desired result if we sum it overt, use (x (T +1) s ) > Q (T +1) s y (T +1) s 1 1 andz (1) s = z (1) s , and ignore a negative term. Lemma65 In Algorithm 12, the gap between the criticQ (t) s and the trueQ (t) s satises T X t = 1 max s kQ (t) s Q (t) s k 2 1 . A ( (T ) ) 2 (1 ) 6 T X t = 1 max s kx (t) s x (t1) s k 2 +ky (t) s y (t1) s k 2 : Proof. For notational simplicity, deneQ (0) s =Q (0) s =0 AA . 288 max s kQ (t) s Q (t) s k 2 1 := max s;a 1 ;a 2 Q (t) s (a 1 ;a 2 )Q (t) s (a 1 ;a 2 ) 2 max s;a 1 ;a 2 r(s;a 1 ;a 2 ) + E s 0 P(js;a 1 ;a 2 ) h (x s 0) > Q (t) s 0 y s 0 i t1 X = 0 (t1;) r(s;a 1 ;a 2 ) + E s 0 P(js;a 1 ;a 2 ) h (x () s 0 ) > Q () s 0 y () s 0 i 2 2 max s 0 t1 X = 0 (t1;) (x s 0) > Q (t) s 0 y s 0 (x () s 0 ) > Q () s 0 y () s 0 2 (a) 6 2 1 max s t1 X = 0 (t1;) (x (t) s ) > Q (t) s Q () s y (t) s ! 2 + 2 2 1 + max s t1 X = 0 (t1;) (x (t) s ) > Q () s Q () s y (t) s ! 2 + 6 2 1 max s t1 X = 0 (t1;) (x (t) s ) > Q () s (y (t) s y () s ) ! 2 + 6 2 1 max s t1 X = 0 (t1;) (x (t) s x () s ) > Q () s y () s ! 2 (b) 2 2 1 + max s t1 X = 0 (t1;) ! t1 X = 0 (t1;) (x (t) s ) > Q () s Q () s y (t) s 2 ! +c 0 max s t1 X = 0 (t1;) kx (t) s x () s k 1 +ky (t) s y () s k 1 ! 2 (c) 2 2 1 + max s " t1 X = 0 (t1;) kQ () s Q () s k 2 1 # + c 0 max s t1 X = 0 (t1;) t X h =+1 di (h) s ! 2 max s " t1 X = 0 (t1;) kQ () s Q () s k 2 1 # + c 0 max s t X h = 1 h1 X = 0 (t1;) di (h) s ! 2 (d) max s " t1 X = 0 (t1;) kQ () s Q () s k 2 1 # + c 0 max s t X h = 1 (t1;h1) di (h) s ! 2 (e) max s " t1 X = 0 (t1;) kQ () s Q () s k 2 1 # + c 0 max s t X h = 1 (1 (t) ) th di (h) s ! 2 where in (a) we apply (x +y +z +w) 2 6x 2 1 + 2y 2 1+ + 6z 2 1 + 6w 2 1 from the Cauchy-Schwarz inequality, in (b) we use Lemma 67 and obtainc 0 =O 1 (1 ) 5 , in (c) we introduce notation, di (h) s := kx (h) s x (h1) s k 1 +ky (h) s y (h1) s k 1 289 and in (d) we introduce notation, (t;) = t Y i =+1 (1 (i) ): and apply Lemma 35 of [239], (e) is due to thatf (t) g 1 t = 0 is a non-increasing sequence. Application of Lemma 33 of [239] to the recursion relation above yields max s kQ (t) s Q (t) s k 2 1 c 0 t X = 1 (t;) max s X q = 1 (1 () ) q di (q) s ! 2 c 0 t X = 1 (t;) X q = 1 (1 () ) q di (q) ! 2 (D.16) where (t;) := () Q t1 i= (1 (i) + (i) ) for 1 <t and (t;t) := 1, and di (t) := max s di (t) s . The right-hand side of (D.16) can be further upper bounded by c 0 t X = 1 (t;) X q = 1 (1 () ) q ! X q = 1 (1 () ) q (di (q) ) 2 ! (a) c 0 t X = 1 (t;) () X q = 1 (1 () ) q (di (q) ) 2 = c 0 t X q = 1 t X =q (t;) () (1 () ) q (di (q) ) 2 = c 0 t X q = 1 " t1 X =q (1 (t) + (t) ) t (1 (t) ) q (di (q) ) 2 + (1 (t) ) tq (t) (di (q) ) 2 # = c 0 t X q = 1 " (1 (t) + (t) ) tq t1 X =q 1 (t) 1 (t) + (t) q + (1 (t) ) tq (t) # (di (q) ) 2 = c 0 t X q = 1 2 6 4 (1 (t) + (t) ) tq 1 1 (t) 1 (t) + (t) tq (t) 1 (t) + (t) + (1 (t) ) tq (t) 3 7 5 (di (q) ) 2 2c 0 (t) t X q = 1 (1 (t) + (t) ) tq (di (q) ) 2 where (a) is due to that P q = 1 (1 () ) q 1 () : 290 Substitution of the upper bound above into (D.16) yields, T X t = 1 max s kQ (t) s Q (t) s k 2 1 . T X t = 1 c 0 (t) t X q = 1 (1 (t) + (t) ) tq (di (q) ) 2 (a) T X q = 1 T X t =q c 0 (T ) (1 (T ) + (T ) ) tq (di (q) ) 2 (b) c 0 ( (T ) ) 2 (1 ) T X q = 1 (di (q) ) 2 where (a) is due to that (t) is non-increasing, and (b) is due to that P T t =q (1 (T ) + (T ) ) tq 1 (T) (1 ) . Finally, using the denition ofc 0 and applyingkk 1 p Akk to di (q) lead to the desired result. Proof. [Proof of Theorem 43] The proof consists of two parts: Markov cooperative games and Markov competitive games, separately. Markovcooperativegames. Fixs, the optimality of x (t+1) s in Algorithm 12 yields Q (t) s y (t) s ( x (t+1) s x (t) s );x 0 s x (t+1) s 0; for anyx 0 s 2 (A 1 ): (D.17) Thus, for anyx 0 s 2 (A 1 ), (x 0 s x (t) s ) > Q (t) s y (t) s = (x 0 s x (t+1) s ) > Q (t) s y (t) s + (x 0 s x (t+1) s ) > (Q (t) s Q (t) s )y (t) s + ( x (t+1) s x (t) s ) > Q (t) s y (t) s (a) . 1 (x 0 s x (t+1) s ) > ( x (t+1) s x (t) s ) +kQ (t) s Q (t) s k + p A 1 k x (t+1) s x (t) s k (b) . 1 k x (t+1) s x (t) s k +k x (t+1) s x (t) s k +kQ (t) s Q (t) s k 291 where we use (D.17) andkQ (t) s k p A 1 in (a), and (b) is due to the Cauchy-Schwarz inequality and the choice of 1 32 p A . Hence, T X t = 1 max x 0 V x 0 ;y (t) ()V x (t) ;y (t) () = 1 1 T X t = 1 max x 0 X s d x 0 ;y (t) (s) x 0 s x (t) s > Q (t) s y (t) s . 1 (1 ) T X t = 1 X s d x 0 ;y (t) (s) x (t+1) s x (t) s + x (t+1) s x (t) s + 1 (1 ) T X t = 1 X s d x 0 ;y (t) (s)kQ (t) s Q (t) s k (a) . 1 (1 ) v u u t T X t = 1 X s d x 0 ;y (t) (s) v u u t T X t = 1 X s d x 0 ;y (t) (s) x (t+1) s x (t) s 2 + x (t+1) s x (t) s 2 + 1 (1 ) T X t = 1 X s d x 0 ;y (t) (s)kQ (t) s Q (t) s k (b) p T (1 ) v u u t T X t = 1 X s x (t+1) s x (t) s 2 + x (t+1) s x (t) s 2 + 1 (1 ) T X t = 1 X s d x 0 ;y (t) (s)kQ (t) s Q (t) s k (c) . p T (1 ) s S 1 + 1 1 v u u t T X t=1 X s d x 0 ;y (t) (s) v u u t T X t=1 X s kQ (t) s Q (t) s k 2 (d) . s ST (1 ) 3 + p T (1 ) 4 v u u t SA ( (T ) ) 2 T X t=1 max s kx (t) s x (t1) s k 2 +ky (t) s y (t1) s k 2 (e) . s ST (1 ) 3 + p T 1 s S 2 A ( (T ) ) 2 (1 ) 7 292 where (a) is due to the Cauchy-Schwarz inequality for (a), (b) follows the state distribution d x 0 ;y (t) (s), (c) is due to Corollary 3, (d) is because of Lemma 65, and (e) again is due to Corollary 3. By taking = (1 ) 2 32 p SA and (t) = 1 6 3 p t , the last upper bound above is of order, O (S 3 A) 1 4 p T (1 ) 7 2 (T ) ! = O (S 3 A) 1 4 T 5 6 (1 ) 7 2 ! : Markov competitive game. We start from an intermediate step in the proof of Theorem 1 of [239]. Specically, they have shown that if both players use Algorithm 12 in a two-player zero- sum Markov game, then, T X t = 1 max x 0 ;y 0 V x 0 ;y (t) ()V x (t) ;y 0 () = O S p C C T (1 ) ! whereC := 1+ P T t = 1 (t) andC is an upper bound for P T t = (t;) with (t;) := () Q t1 i = (1 (i) + (i) ) if <t and (t;t) := 1. We next calculate the upper bounds forC andC . BoundingC . Recall that (t) = 1 6 t 1 3 . By the denition ofC , C = 1 + 1 6 T X t = 1 t 1 3 = O T 2 3 : BoundingC . Using (t) = 1 6 t 1 3 , for any 1, we have T X t = (t;) 1 + T X t =+1 () t1 Y i = (1 (i) + (i) ) = 1 + 1 6 T X t =+1 1 3 1 1 6 t 1 3 (1 ) t 1 + 1 6 t 0 X t =+1 1 3 + 1 6 T X t =t 0 +1 1 3 1 1 6 t 1 3 (1 ) t (for somet 0 dened below) 1 + 1 6 1 3 (t 0 ) + 1 6 1 3 T X t =t 0 +1 exp 1 6 t 1 3 (1 )(t) : (D.18) Denet 0 := +H( +c) 1 3 ln( +c)+c, whereH := 48 1 andc := 2 2H 1 1 3 ln H 1 1 3 1 1 1 3 (ift 0 >T , we simply ignore the second term in (D.18)). By Lemma 66 withq = 1 3 , for alltt 0 , t H 2 t 2 1 3 ln t 2 : 293 Hence, we can continue to bound the right-hand side of (D.18) by O H +c 1 3 ln( +c) + c 1 3 ! + 1 6 1 3 T X t =t 0 +1 exp 1 12 t 1 3 (1 ) H 2 t 2 1 3 ln t 2 ! ~ O H(1 +c) 1 3 +c + 1 6 1 3 T X t =t 0 +1 2 t = ~ O 1 (1 ) 3 2 ! which proves thatC = ~ O 1 (1 ) 3 2 . Therefore, T X t = 1 max x 0 ;y 0 V x 0 ;y (t) ()V x (t) ;y 0 () = O S p C C T (1 ) ! = ~ O ST 5 6 (1 ) 7 4 ! which completes the proof by taking = (1 ) 2 32 p SA . Lemma66 Fix2N, 0 0. By the denition ofc, for allt c,t 1q t 2 1q 2H 1q ln H 1q and thus t 1q H 1q ln(t 1q ) = H lnt which proves the rst item, and that t 2 1q H 1q ln t 2 1q = H ln t 2 294 which gives d dt t H 2 t 2 q ln t 2 = 1 H 2 q 2 t 2 q1 ln t 2 H 2 t 2 q 1 t 1 1 2 q 2 1 2 0 which proves the second item. By the rst item and the denition oft 0 ,t 0 + ( +c) +c = 2 + 2c. Then by the second item, for alltt 0 we have t H 2 t 2 q ln t 2 t 0 H 2 t 0 2 q ln t 0 2 t 0 H 2 ( +c) q ln ( +c) which completes the proof. Lemma67 For any two policies (x 0 ;y 0 ) and (x;y), max s kQ x 0 ;y 0 s Q x;y s k max (1 ) 2 max s 0 (kx 0 s 0x s 0k 1 +ky 0 s 0y s 0k 1 ): Proof. By the denition, kQ x 0 ;y 0 s Q x;y s k max (a) := max a 1 ;a 2 Q x 0 ;y 0 s (a 1 ;a 2 )Q x;y s (a 1 ;a 2 ) (b) X s 0 P(s 0 js; a 1 ; a 2 ) (x 0 s 0) > Q x 0 ;y 0 s 0 y 0 s 0 (x s 0) > Q x;y s 0 y s 0 max s 0 (x s 0) > Q x 0 ;y 0 s 0 y s 0 (x s 0) > Q x;y s 0 y s 0 | {z } Qi (D.19) where a 1 and a 2 achieve the maximum in (a), and (b) is due to the Bellman equation, Q x;y s (a 1 ;a 2 ) = r(s;a 1 ;a 2 ) + E s 0 P(js;a 1 ;a 2 ) [V x;y s 0 ] = r(s;a 1 ;a 2 ) + X s 0 P(s 0 js;a 1 ;a 2 ) X a 0 1 ;a 0 2 x s 0(a 0 1 )y s 0(a 0 2 )Q x;y s 0 (a 0 1 ;a 0 2 ): Fixs 0 , we next subtract and add (x s 0) > Q x 0 ;y 0 s 0 y s 0 in Qi and applyja +bjjaj +jbj to reach, Qi (x 0 s 0) > Q x 0 ;y 0 s 0 y 0 s 0 (x s 0) > Q x 0 ;y 0 s 0 y s 0 + (x s 0) > Q x 0 ;y 0 s 0 y s 0 (x s 0) > Q x;y s 0 y s 0 1 1 X a 0 1 ;a 0 2 (x 0 s 0(a 1 )y 0 s 0(a 0 2 )x s 0(a 0 1 )y s 0(a 0 2 ))Q x 0 ;y 0 s 0 (a 0 1 ;a 0 2 ) + (x s 0) > Q x 0 ;y 0 s 0 Q x;y s 0 y s 0 1 1 X a 0 1 ;a 0 2 jx 0 s 0(a 0 1 )y 0 s 0(a 0 2 )x s 0(a 0 1 )y s 0(a 0 2 )j + (x s 0) > Q x 0 ;y 0 s 0 Q x;y s 0 y s 0 : 295 We also notice that kx 0 s 0y 0 s 0x s 0y s 0k 1 := X a 0 1 ;a 0 2 jx 0 s 0(a 0 1 )y 0 s 0(a 0 2 )x s 0(a 0 1 )y s 0(a 0 2 )j X a 0 1 ;a 0 2 jx 0 s 0(a 0 1 )y 0 s 0(a 0 2 )x s 0(a 0 1 )y 0 s 0(a 0 2 )j + X a 0 1 ;a 0 2 jx s 0(a 0 1 )y 0 s 0(a 0 2 )x s 0(a 0 1 )y s 0(a 0 2 )j X a 0 1 jx 0 s 0(a 0 1 )x s 0(a 0 1 )j + X a 0 2 jy 0 s 0(a 0 2 )y s 0(a 0 2 )j = kx 0 s 0x s 0k 1 +ky 0 s 0y s 0k 1 and (x s 0) > Q x 0 ;y 0 s 0 Q x;y s 0 y s 0 max a 1 ;a 2 Q x 0 ;y 0 s 0 (a 1 ;a 2 )Q x;y s 0 (a 1 ;a 2 ) := kQ x 0 ;y 0 s 0 Q x;y s 0 k max : By substituting the upper bound onQi above into (D.19), kQ x 0 ;y 0 s Q x;y s k max max s 0 1 1 (kx 0 s 0x s 0k 1 +ky 0 s 0y s 0k 1 ) + max s 0 kQ x 0 ;y 0 s 0 Q x;y s 0 k max : Therefore, max s kQ x 0 ;y 0 s Q x;y s k max max s 0 1 1 (kx 0 s 0x s 0k 1 +ky 0 s 0y s 0k 1 ) + max s 0 kQ x 0 ;y 0 s 0 Q x;y s 0 k max : which yields the desired result. D.7 Otherauxiliarylemmas In this section, we provide other auxiliary lemmas that are helpful in our analysis. D.7.1 Auxiliarylemmaforpotentialfunctions Lemma68 ForanyN-playerMarkovpotentialgamewithinstantaneousrewardboundedin [0; 1], it holds that () 0 () N 1 for any; 0 2 and2 (S). 296 Proof. By the potential property, () 0 () = ( 0 1 ; 1 ) + ( 0 1 ; 1 0 f1;2g ; f1;2g ) + ::: + ( N ; 0 N 0 ) = (V 1 V 0 1 ; 1 1 ) + (V 0 1 ; 1 2 V 0 f1;2g ; f1;2g 2 ) + ::: + (V N ; 0 N N V 0 N ) N 1 where the last inequality is due toV i V 0 i 1 1 for any and 0 . By symmetry, 0 () () N 1 . D.7.2 Auxiliarylemmaforsingle-playerMDPs Lemma69(State-actionvaluefunctiondierence) Suppose that two MDPs have the same state/actionspaces,butdierentrewardandtransitionfunctions: (r;p)and (~ r; ~ p). Then,foragiven policy, two action value functions associated with two MDPs satisfy max s;a jQ (s;a) ~ Q (s;a)j 1 1 max s;a jr(s;a) ~ r(s;a)j + (1 ) 2 max s;a kp(js;a) ~ p(js;a)k 1 : Proof. By the Bellman equations, Q (s;a) = r(s;a) + X s 0 ;a 0 p(s 0 js;a)(a 0 js 0 )Q (s 0 ;a 0 ) ~ Q (s;a) = ~ r(s;a) + X s 0 ;a 0 ~ p(s 0 js;a)(a 0 js 0 ) ~ Q (s 0 ;a 0 ): Subtracting equalities above on both sides yields jQ (s;a) ~ Q (s;a)j jr(s;a) ~ r(s;a)j + X s 0 ;a 0 (p(s 0 js;a) ~ p(s 0 js;a))(a 0 js 0 )Q (s 0 ;a 0 ) + X s 0 ;a 0 ~ p(s 0 js;a)(a 0 js 0 ) Q (s 0 ;a 0 ) ~ Q (s 0 ;a 0 ) jr(s;a) ~ r(s;a)j + 1 kp(js;a) ~ p(js;a)k 1 + max s 0 ;a 0 Q (s 0 ;a 0 ) ~ Q (s 0 ;a 0 ) : Taking the maximum over (s;a) leads to max s;a jQ (s;a) ~ Q (s;a)j max s;a jr(s;a) ~ r(s;a)j + 1 max s;a kp(js;a) ~ p(js;a)k 1 + max s;a jQ (s;a) ~ Q (s;a)j which leads to the desired inequality after rearrangement. 297 D.7.3 Auxiliarylemmaformulti-playerMDPs Lemma70 Let, 0 and be three policies, and be some initial distribution. Let 0 be a state distribution that generates a state according to the following: rst sample an s 0 from d (), then execute for one step, and then output the next state. Then, d 0 0 1 2 := 1 sup ~ d ~ 1 2 : Proof. For a particular states ] , we view the supremum sup ~ d ~ (s ] ) (s ] ) as the optimal value of an MDP whose reward function isr(s;a) = 1 (s ] ) 1[s = s ] ] and initial state is generated by. The optimal value of this MDP is upper bounded by by Denition 2. We next consider the following non-stationary policy for this MDP: rst execute for one step, and then execute 0 in the rest of the steps. The discounted value of this non-stationary policy is lower bounded by X s Pr (s 1 =sjs 0 ; a 0 (js 0 )) d 0 s (s ] ) (s ] ) = X s 0 ;a 0 ;s (s 0 ) (a 0 js 0 )p(sjs 0 ;a 0 ) d 0 s (s ] ) (s ] ) : We can upper and lower bound the discounted sum above as the following: X s 0 ;a 0 ;s d (s 0 ) (a 0 js 0 )p(sjs 0 ;a 0 ) d 0 s (s ] ) (s ] ) X s 0 ;a 0 ;s (s 0 ) (a 0 js 0 )p(sjs 0 ;a 0 ) d 0 s (s ] ) (s ] ) where the right inequality is due to that this discounted value must be upper bounded by the op- timal value of this MDP, which has an upper bound , and the left inequality is by the denition of . Now notice that 0 (s) = X s 0 ;a 0 d (s 0 ) (a 0 js 0 )p(sjs 0 ;a 0 ) by the denition of 0 . Plugging this into the previous inequality, we get d 0 0(s ] ) (s ] ) : Since this holds for anys ] , this gives d 0 0 1 2 : 298 Algorithm13 Stochastic projected gradient descent with weighted averaging 1: Parameters:W , (k) , and (K) k . 2: Input: Stepsize, total number of iterationsK > 0. 3: Initialization:w (0) = 0. 4: for stepk = 1;:::;K do 5: Drawr (k) form a distribution such thatE[r (k) jw (k) ]2@f(w (k) ). 6: Updatew (k+1) =P kwkW w (k) (k) r (k) . 7: endfor 8: Output: P K k = 0 (K) k w (k) . D.7.4 Auxiliarylemmaforstochasticprojectedgradientdescent Algorithm 11 serves a sample-based algorithm if we solve the empirical risk minimization prob- lem (5.15) via a stochastic projected gradient descent, w (k+1) i = P kwkW w (k) i (k) ^ r (t) i (s (k) ;a (k) i )) (D.20) where ^ r (t) i := 2(h i ;w (k) i iR (k) i ) i is thekth gradient of (5.15) and (k) > 0 is the stepsize. We assume that the smallest eigenvalue of correlation matrixE s;a i i (s;a i ) i (s;a i ) > is positive. For a constrained convex optimization, minimize w2fwjkwkWg f(w), wheref(w) is a convex function andW > 0, we consider a basic method for solving this problem: the stochastic pro- jected gradient descent in Algorithm 13, whereP kwkW is a Euclidean projection inR d to the constraint setkwkW . Lemma71 Let w ? := argmin w2fwjkwkWg f(w). Suppose Var(r (k) ) 2 . If we run Algo- rithm 13 with stepsize (k) =O( 1 1+k ) and (K) k = 1= (k) P K r = 0 1= (r) , then, E " f K X k = 0 (K) k w (k) !# f(w ? ) . 2 W 2 d K : Proof. See the proof of Theorem 1 in [58]. 299
Abstract (if available)
Abstract
Reinforcement learning (RL) has proven its value through great empirical success in many artificial sequential decision-making control systems, but uptake in complex real-world systems has been much slower. A wide gap between standard RL setups and realities often results from constraints and multiple agents in real-world systems. To reduce this gap, we develop effective RL algorithms for two types of real-world stochastic control systems: constrained systems and multi-agent systems to search for better control policies, with theoretical performance guarantees.
Part I of the dissertation is devoted to RL for constrained control systems. We study two settings of sequential decision-making control problems described by constrained Markov decision processes (MDPs) in which a controller (or an agent) aims at satisfying a constraint in addition to maximizing the standard reward objective. In the simulation setting, we propose a direct policy search method for infinite-horizon constrained MDPs: natural policy gradient primal-dual method, which updates the primal policy via natural policy gradient ascent and the dual variable via projected sub-gradient descent. We establish a global convergence theory for our method using softmax, log-linear, and general smooth policy parametrizations, and demonstrate finite-sample complexity guarantees for two model-free extensions of our method. In the online episodic setting, we propose an online policy optimization method for episodic finite-horizon constrained MDPs: optimistic primal-dual proximal policy optimization, where we effectuate safe exploration through the upper-confidence bound optimism and address constraints via the primal-dual optimization. We establish sublinear regret and constraint violation bounds that depend on the size of the state-action space only through the dimension of the feature mapping, making our results hold even when the number of states goes to infinity.
Part II of the dissertation is devoted to RL for multi-agent control systems. We study two setups of multi-agent sequential decision-making control problems modeled by multi-agent MDPs in which multiple agents aim at maximizing their reward objectives. In the cooperative setup, we propose an online distributed temporal-difference learning algorithm for solving the classical policy evaluation problem with networked agents. Our algorithm works as a true stochastic primal-dual update using online Markovian samples and homotopy-based adaptive stepsizes. We establish optimal finite-time error bound with a sharp dependence on the network size and topology. In the cooperative/competitive setup, we propose a new independent policy gradient method for learning a Nash policy of Markov potential games. We establish sublinear Nash regret bounds that are free of explicit dependence on the state space size, enabling our method to work for problems with large size of state space and a large number of players. We demonstrate finite-sample complexity guarantees for a model-free extension of our method in the function approximation setting. Moreover, we identify a class of independent policy gradient methods that enjoys last-iterate convergence and sublinear Nash regret bound for learning both zero-sum Markov games and Markov cooperative games.
Linked assets
University of Southern California Dissertations and Theses
Conceptually similar
PDF
Online reinforcement learning for Markov decision processes and games
PDF
New Lagrangian methods for constrained convex programs and their applications
PDF
Sequential Decision Making and Learning in Multi-Agent Networked Systems
PDF
Learning and decision making in networked systems
PDF
Learning and control for wireless networks via graph signal processing
PDF
Machine learning in interacting multi-agent systems
PDF
Learning and control in decentralized stochastic systems
PDF
Understanding goal-oriented reinforcement learning
PDF
I. Asynchronous optimization over weakly coupled renewal systems
PDF
Identifying and leveraging structure in complex cooperative tasks for multi-agent reinforcement learning
PDF
Artificial Decision Intelligence: integrating deep learning and combinatorial optimization
PDF
Sequential decision-making for sensing, communication and strategic interactions
PDF
Empirical methods in control and optimization
PDF
Optimizing healthcare decision-making: Markov decision processes for liver transplants, frequent interventions, and infectious disease control
PDF
Differentially private and fair optimization for machine learning: tight error bounds and efficient algorithms
PDF
Landscape analysis and algorithms for large scale non-convex optimization
PDF
Reinforcement learning with generative model for non-parametric MDPs
PDF
Accelerating reinforcement learning using heterogeneous platforms: co-designing hardware, algorithm, and system solutions
PDF
Enhancing collaboration on the edge: communication, scheduling and learning
PDF
Algorithms and systems for continual robot learning
Asset Metadata
Creator
Ding, Dongsheng
(author)
Core Title
Provable reinforcement learning for constrained and multi-agent control systems
School
Viterbi School of Engineering
Degree
Doctor of Philosophy
Degree Program
Electrical Engineering
Degree Conferral Date
2022-12
Publication Date
03/12/2024
Defense Date
09/19/2022
Publisher
University of Southern California
(original),
University of Southern California. Libraries
(digital)
Tag
constrained Markov decision processes,constrained reinforcement learning,game-agnostic convergence,homotopy approach,independent policy gradient method,linear constrained Markov decision processes,linear function approximation,Markov potential games,multi-agent Markov decision processes,multi-agent policy evaluation,multi-agent reinforcement learning,multi-agent temporal-difference learning,natural policy gradient primal-dual method,OAI-PMH Harvest,optimistic primal-dual proximal policy optimization,primal-dual optimization,reinforcement learning,safe exploration
Format
application/pdf
(imt)
Language
English
Contributor
Electronically uploaded by the author
(provenance)
Advisor
Jovanovic, Mihailo (
committee chair
), Nayyar, Ashutosh (
committee member
), Razaviyayn, Meisam (
committee member
)
Creator Email
dongshed@seas.upenn.edu,dongshed@usc.edu
Permanent Link (DOI)
https://doi.org/10.25549/usctheses-oUC111992645
Unique identifier
UC111992645
Legacy Identifier
etd-DingDongsh-11203
Document Type
Dissertation
Format
application/pdf (imt)
Rights
Ding, Dongsheng
Type
texts
Source
20220917-usctheses-batch-980
(batch),
University of Southern California
(contributing entity),
University of Southern California Dissertations and Theses
(collection)
Access Conditions
The author retains rights to his/her dissertation, thesis or other graduate work according to U.S. copyright law. Electronic access is being provided by the USC Libraries in agreement with the author, as the original true and official version of the work, but does not grant the reader permission to use the work if the desired use is covered by copyright. It is the author, as rights holder, who must provide use permission if such use is covered by copyright. The original signature page accompanying the original submission of the work to the USC Libraries is retained by the USC Libraries and a copy of it may be obtained by authorized requesters contacting the repository e-mail address given.
Repository Name
University of Southern California Digital Library
Repository Location
USC Digital Library, University of Southern California, University Park Campus MC 2810, 3434 South Grand Avenue, 2nd Floor, Los Angeles, California 90089-2810, USA
Repository Email
cisadmin@lib.usc.edu
Tags
constrained Markov decision processes
constrained reinforcement learning
game-agnostic convergence
homotopy approach
independent policy gradient method
linear constrained Markov decision processes
linear function approximation
Markov potential games
multi-agent Markov decision processes
multi-agent policy evaluation
multi-agent reinforcement learning
multi-agent temporal-difference learning
natural policy gradient primal-dual method
optimistic primal-dual proximal policy optimization
primal-dual optimization
reinforcement learning
safe exploration