Close
About
FAQ
Home
Collections
Login
USC Login
Register
0
Selected
Invert selection
Deselect all
Deselect all
Click here to refresh results
Click here to refresh results
USC
/
Digital Library
/
University of Southern California Dissertations and Theses
/
No-regret learning and last-iterate convergence in games
(USC Thesis Other)
No-regret learning and last-iterate convergence in games
PDF
Download
Share
Open document
Flip pages
Contact Us
Contact Us
Copy asset link
Request this asset
Transcript (if available)
Content
NO-REGRET LEARNING AND LAST-ITERATE CONVERGENCE IN GAMES
by
Chung-Wei Lee
A Dissertation Presented to the
FACULTY OF THE USC GRADUATE SCHOOL
UNIVERSITY OF SOUTHERN CALIFORNIA
In Partial Fulfillment of the
Requirements for the Degree
DOCTOR OF PHILOSOPHY
(COMPUTER SCIENCE)
May 2024
Copyright 2024 Chung-Wei Lee
Table of Contents
List of Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vii
List of Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . viii
Abstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . x
Chapter 1: Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.1 Notations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.2 Online Learning Framework . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.3 Online Learning Algorithms and Their Optimistic Versions . . . . . . . . . . . . . . . . . . 3
1.4 Normal-Form, Extensive-form, and Convex Games . . . . . . . . . . . . . . . . . . . . . . . 5
1.5 Learning in Games . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
1.6 Outline of the Thesis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
Chapter 2: No-Regret Learning and Last-Iterate Convergence in Normal-Form Games and
Constrained Saddle-Point Optimization . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
2.1 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
2.2 Notations and Preliminaries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
2.3 Convergence Results for OMWU . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
2.4 Convergence Results for OGDA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
2.5 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
Chapter 3: No-Regret Learning and Last-Iterate Convergence in Extensive-Form Games . . . . . . 29
3.1 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
3.2 Problem Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
3.3 Optimistic Regret-Minimization Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . 34
3.4 Last-Iterate Convergence Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
3.5 Analysis Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
3.5.1 Proof of Theorem 13 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
3.5.2 Review for Normal-Form Games . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
3.5.3 Analysis of Theorem 15 and Theorem 16 . . . . . . . . . . . . . . . . . . . . . . . . 44
3.6 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
Chapter 4: Kernelized Multiplicative Weights: Bridging the Gap Between Learning in ExtensiveForm and Normal-Form Games . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
4.1 Preliminaries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
4.1.1 Online Learning and Multiplicative Weights Update . . . . . . . . . . . . . . . . . 51
ii
4.1.2 Canonical Optimistic Learning Setup . . . . . . . . . . . . . . . . . . . . . . . . . . 52
4.2 Multiplicative Weights in Polyhedral Convex Games . . . . . . . . . . . . . . . . . . . . . . 54
4.3 Kernelized Multiplicative Weights Update . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
4.4 KOMWU in Extensive-Form Games . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61
4.4.1 Extensive-Form Games and Tree-Form Sequential Decision Problems . . . . . . . . 61
4.4.2 Linear-Time Implementation of KOMWU . . . . . . . . . . . . . . . . . . . . . . . 63
4.4.3 KOMWU Regret Bounds and Convergence . . . . . . . . . . . . . . . . . . . . . . . 65
4.4.4 Experimental Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67
Chapter 5: Near-Optimal No-Regret Learning in General Convex Games . . . . . . . . . . . . . . . 68
5.1 No-Regret Learning and Convex Games . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72
5.1.1 Convex Games . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72
5.1.2 Applications and Examples of Convex Games . . . . . . . . . . . . . . . . . . . . . 73
5.1.3 Online Linear Optimization and No-Regret Learning . . . . . . . . . . . . . . . . . 74
5.2 Near-Optimal No-Regret Learning in Convex Games . . . . . . . . . . . . . . . . . . . . . 75
5.2.1 Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76
5.2.2 Overview of Our Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77
5.2.3 Regret Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77
5.2.4 Main Result . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80
5.2.5 Implementation and Iteration Complexity . . . . . . . . . . . . . . . . . . . . . . . 81
Chapter 6: No-Regret and Last-Iterate Convergence Properties of Regret Matching Algorithms in
Games: Bridging the Gap Between Theory and Practice . . . . . . . . . . . . . . . . . . 84
6.1 Preliminaries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87
6.2 Instability of (Predictive) Regret Matching+ . . . . . . . . . . . . . . . . . . . . . . . . . . 91
6.3 Conceptual Regret Matching+ . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95
6.4 Numerical Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100
6.5 Preliminaries on Regret Matching+ in Two-Player Zero-Sum Normal-Form Games . . . . . 102
6.6 Non-Convergence of RM+, Alternating RM+, and PRM+ . . . . . . . . . . . . . . . . . . . 105
6.7 Convergence Properties of ExRM+ . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107
6.7.1 Asymptotic Last-Iterate Convergence of ExRM+ . . . . . . . . . . . . . . . . . . . 107
6.7.2 Best-Iterate Convergence Rate of ExRM+ . . . . . . . . . . . . . . . . . . . . . . . 111
6.7.3 Linear Last-Iterate Convergence for ExRM+ with Restarts . . . . . . . . . . . . . . 111
6.8 Last-Iterate Convergence of Smooth Predictive RM+ . . . . . . . . . . . . . . . . . . . . . . 113
6.9 Numerical Experiments of Last-Iterate Convergence . . . . . . . . . . . . . . . . . . . . . . 115
Chapter 7: Conclusions and Future Directions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 117
Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 119
Appendix A . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 132
Omitted Details in Chapter 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 132
A.1 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 132
A.1.1 More Empirical Results for Matrix Games . . . . . . . . . . . . . . . . . . . . . . . 132
A.1.2 Matrix Game on Curved Regions . . . . . . . . . . . . . . . . . . . . . . . . . . . . 132
A.1.3 Strongly-Convex-Strongly-Concave Games . . . . . . . . . . . . . . . . . . . . . . 133
A.1.4 An Example with β > 0 for SP-MS . . . . . . . . . . . . . . . . . . . . . . . . . . . 135
A.1.5 Matrix Games with Multiple Nash Equilibria . . . . . . . . . . . . . . . . . . . . . . 136
A.2 Lemmas for Optimistic Mirror Descent . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 136
iii
A.3 An Auxiliary Lemma on Recursive Formulas . . . . . . . . . . . . . . . . . . . . . . . . . . 141
A.4 Proofs of Lemma 2 and Theorem 3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 142
A.4.1 Some Problem-Dependent Constants . . . . . . . . . . . . . . . . . . . . . . . . . . 143
A.4.2 Auxiliary Lemmas . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 147
A.4.3 Proofs of Lemma 2 and Theorem 3 . . . . . . . . . . . . . . . . . . . . . . . . . . . 150
A.5 Proofs of Lemma 4 and the Sum-of-Duality-Gap Bound . . . . . . . . . . . . . . . . . . . . 157
A.6 The Equivalence Between SP-MS and Metric Subregularity . . . . . . . . . . . . . . . . . . 160
A.7 Proof of Theorem 5 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 163
A.8 Proof of Theorem 6 and Theorem 7 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 169
A.9 Proof of Theorem 8 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 170
A.10 Proof of Theorem 9 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 173
Appendix B . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 183
Omitted Details in Chapter 3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 183
B.1 Examples of EFG and Treeplexes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 183
B.2 Omitted Details of Section 3.6 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 184
B.2.1 Additional Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 184
B.2.2 Description of the Games . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 186
B.2.3 Parameter Selection in the Dilated Regularizers . . . . . . . . . . . . . . . . . . . . 187
B.3 Omitted Details of Section 3.3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 188
B.4 CFR-Based Algorithms and Proof of Theorem 12 . . . . . . . . . . . . . . . . . . . . . . . . 189
B.4.1 CFR and Its Optimistic Version . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 190
B.4.2 CFR+ and Its Optimistic Version . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 191
B.4.3 Proof of Theorem 12 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 192
B.4.4 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 195
B.5 Proofs of Theorem 15 and Theorem 16 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 195
B.5.1 Strict Complementary Slackness and Proof of Equation (3.6) in EFGs . . . . . . . . 197
B.5.1.1 Strict Complementary Slackness . . . . . . . . . . . . . . . . . . . . . . . 198
B.5.1.2 Some Problem-Dependent Constants . . . . . . . . . . . . . . . . . . . . 200
B.5.1.3 Proof of Lemma 88 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 204
B.5.2 Proofs of Equation (3.12) and Equation (3.13) . . . . . . . . . . . . . . . . . . . . . 209
B.5.3 Lower Bounds on the Probability Masses . . . . . . . . . . . . . . . . . . . . . . . . 211
B.5.4 Proof of Theorem 15 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 213
B.5.5 Proof of Theorem 16 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 214
B.5.5.1 The Significant Difference Lemma . . . . . . . . . . . . . . . . . . . . . . 214
B.5.5.2 The Counterpart of Equation (3.11) for DOMWU . . . . . . . . . . . . . 219
B.5.5.3 Proof of Theorem 16 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 222
B.5.6 Remarks on DOGDA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 224
Appendix C . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 226
Omitted Details in Chapter 4 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 226
C.1 Additional Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 226
C.1.1 More Results for Optimistic Algorithms in Games . . . . . . . . . . . . . . . . . . . 226
C.1.2 Approaches in Online Combinatorial Optimization . . . . . . . . . . . . . . . . . . 227
C.2 Pseudocode . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 227
C.3 Extensive-Form Games . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 227
C.4 Experimental Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 229
C.5 Proofs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 231
iv
C.6 Further Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 240
C.6.1 n-sets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 240
C.6.1.1 Polynomial, O(d min{n, d − n})-time kernel evaluation . . . . . . . . . 241
C.6.1.2 Implementing KOMWU with O(d min{n, d−n}) Per-Iteration complexity 241
C.6.2 Unit Hypercube . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 243
C.6.3 Flows in Directed Acyclic Graphs . . . . . . . . . . . . . . . . . . . . . . . . . . . . 243
C.6.4 Permutations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 244
C.6.5 Cartesian Product . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 244
Appendix D . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 245
Omitted Details in Chapter 5 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 245
D.1 Omitted Proofs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 245
D.1.1 Preliminary Proofs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 245
D.1.2 Analysis of OFTRL with Logarithmic Regularizer . . . . . . . . . . . . . . . . . . . 248
D.1.3 RVU Bound in the Original Space . . . . . . . . . . . . . . . . . . . . . . . . . . . . 254
D.1.4 Main Result: Proof of Theorem 39 . . . . . . . . . . . . . . . . . . . . . . . . . . . . 258
D.1.5 Extending the Analysis under Approximate Iterates . . . . . . . . . . . . . . . . . . 260
D.2 Implementation via Proximal Oracles . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 264
D.2.1 The Proximal Newton Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 265
D.2.2 Proximal Oracle for Normal-Form and Extensive-Form Games . . . . . . . . . . . . 267
Appendix E . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 276
Omitted Details in Chapter 6 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 276
E.1 Proof of Lemma 44 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 276
E.2 Instability of RM+ and predictive RM+ . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 277
E.2.1 Proof of Theorem 45 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 277
E.2.2 Counterexample on 3 × 3 matrix game for RM+ . . . . . . . . . . . . . . . . . . . 279
E.3 Proof of Proposition 46 and Corollary 47 . . . . . . . . . . . . . . . . . . . . . . . . . . . . 279
E.4 Proof of Theorem 48 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 283
E.5 Proof of Theorem 49 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 285
E.6 Proof of Theorem 52 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 287
E.7 Proof of Theorem 54 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 290
E.8 Proof of Theorem 55 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 292
E.9 Proof of Lemma 56 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 295
E.10 Extensive-Form Games . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 296
E.11 Details on the Numerical Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 301
E.11.1 Performances of ExRM+, Stable PRM+and Smooth PRM+on our small matrix
game example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 302
E.11.2 Extensive-Form Game Used in the Experiments . . . . . . . . . . . . . . . . . . . . 302
E.12 Useful Propositions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 305
E.13 Proof of Theorem 57 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 305
E.14 Proofs for Section 6.7 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 309
E.14.1 Proofs for Section 6.7.1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 309
E.14.2 Proofs for Section 6.7.2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 313
E.14.3 Proofs for Section 6.7.3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 315
E.15 Proofs for Section 6.8 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 317
E.15.1 Asymptotic Convergence of SPRM+ . . . . . . . . . . . . . . . . . . . . . . . . . . 317
E.15.2 Best-Iterate Convergence of SPRM+ . . . . . . . . . . . . . . . . . . . . . . . . . . 320
v
E.15.3 Linear Convergence of RS-SPRM+. . . . . . . . . . . . . . . . . . . . . . . . . . . . 322
E.16 Additional Numerical Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 325
E.16.1 Game Instances . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 325
E.16.2 Experiments with Restarting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 326
vi
List of Tables
1.1 A mapping from chapters to papers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
4.1 Properties of various no-regret algorithms for EFGs . . . . . . . . . . . . . . . . . . . . . . 48
5.1 Comparison of prior results on minimizing external regret in games . . . . . . . . . . . . . 71
6.1 Summary of regret guarantees for the algorithms studied in Chapter 6 . . . . . . . . . . . 86
vii
List of Figures
2.1 Experiments of OGDA and OMWU with different learning rates for a matrix game
f(x, y) = x
⊤Gy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
3.1 Experiments on Kuhn poker, Pursuit-evasion, and Leduc poker . . . . . . . . . . . . . . . . 46
4.1 Construction of the Vertex OMWU algorithm. The matrix V has the (possibly
exponentially-many) vertices VΩ of the convex polytope Ω as columns. . . . . . . . . . . . 57
4.2 Experimental comparison of KOMWU with CFR (left two columns) and DOMWU (right
two columns) in two standard poker benchmark games. . . . . . . . . . . . . . . . . . . . . 67
6.1 Left plots show the iterate-to-iterate variation in the last 100 iterates of predictive RM+.
Center plots show the regret for the x and y players under predictive RM+. Right plots
show empirical convergence speed of RM+ (top row) and Predictive RM+ (bottom row). . 93
6.2 Empirical performances of RM+, PRM+, ExRM+, Stable PRM+ and Smooth PRM+on our
3 × 3 matrix game (left plot) and on random instances for different step sizes. . . . . . . . 100
6.3 Practical performance of our variant of clairvoyant CFR (‘Cvynt CFR’) compared to
predictive CFR, across four multiplayer extensive-form games . . . . . . . . . . . . . . . . 101
6.4 Duality gap of the current iterates generated by RM+, RM+ with alternation, Predictive
RM+ (PRM+), and Predictive RM+ with alternation on the zero-sum game with payoff
matrix A = [[3, 0, −3], [0, 3, −4], [0, 0, 1]]. . . . . . . . . . . . . . . . . . . . . . . . . . . . 106
6.5 Pictorial illustration of Lemma 62. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110
6.6 Empirical performances of several algorithms on the 3 × 3 matrix game (left plot), Kuhn
poker and Goofspiel (center plots), and random instances (right plot). . . . . . . . . . . . . 115
A.1 Experiments of convergence with respect to ln ∥z
t − z
∗∥ of OGDA and OMWU . . . . . . 133
A.2 Experiments of OGDA on matrix games with curved regions where f(x, y) =
x2y1 − x1y2, X = Y ≜ {(a, b), 0 ≤ a ≤
1
2
, 0 ≤ b ≤
1
2n , an ≤ b}, and n = 2, 4, 6, 8 . . . 134
viii
A.3 Experiments on a strongly-convex-strongly-concave game where f(x, y) = x
2
1 − y
2
1 +
2x1y1 and X = Y ≜ {(a, b), 0 ≤ a, b ≤ 1, a + b = 1} . . . . . . . . . . . . . . . . . . . . . 134
A.4 Experiments of OGDA on a set of games satisfying SP-MS with β > 0, where f(x, y) =
x
2n
1 − x1y1 − y
2n
1
for some integer n ≥ 2 and X = Y ≜ {(a, b), 0 ≤ a, b ≤ 1, a + b = 1} 135
A.5 Experiments of OGDA and OMWU with different learning rates on a matrix game with
multiple Nash equilibria . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 137
B.1 A game tree for Kuhn poker . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 185
B.2 The decision space for player x and treeplex X . . . . . . . . . . . . . . . . . . . . . . . . 185
B.3 The decision space for player y and treeplex Y . . . . . . . . . . . . . . . . . . . . . . . . . 186
B.4 Experiments on Kuhn Leduc poker for DOMWU (left) and VOGDA (right) with small step
sizes and more time steps. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 186
B.5 Last-iterate divergence of CFR, CFR+ with simultaneous updates, and their optimistic
versions in the rock-paper-scissors game . . . . . . . . . . . . . . . . . . . . . . . . . . . . 196
C.1 Tree-form sequential decision making process of the first acting player in the game of
Kuhn poker. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 228
C.2 Maximum per-player regret cumulated by KOMWU for four different choices of constant
learning rate η
t = η ∈ {0.1, 1, 5, 10}, compared to that cumulated by CFR and CFR(RM+)
in two multiplayer poker games. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 231
E.1 ∥x
t+1 − x
t∥
2
2
(Figure E.1a) and ∥y
t+1 − y
t∥
2
2
(Figure E.1b) for RM+. . . . . . . . . . . . . 279
E.2 Empirical performance of ExRM+ (with linear averaging) on our 3 × 3 matrix game from
Section 6.1. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 303
E.3 Empirical performance of Stable PRM+ (with alternation and linear averaging) on our
3 × 3 matrix game from Section 6.1. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 303
E.4 Empirical performance of Smooth PRM+ (with alternation and linear averaging) on our
3 × 3 matrix game from Section 6.1. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 304
E.5 For a stepsize η = 0.05, empirical performances of several algorithms on our 3 × 3 matrix
game (left plot), Kuhn Poker and Goofspiel (center plots) and on random instances (right
plot). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 326
ix
Abstract
No-regret learning (or online learning) is a general framework for studying sequential decision-making.
Within this framework, the learner iteratively makes decisions, receives feedback, and adjusts their strategies. In this thesis, we consider analyzing the learning dynamics of no-regret algorithms in game scenarios
where players play a single game repeatedly with particular no-regret algorithms. This exploration not
only raises fundamental questions at the intersection of machine learning and game theory but also stands
as a vital element when developing recent breakthroughs in artificial intelligence.
A notable instance of this influence is in the widespread adoption of the “self-play” concept in game
AI development, exemplified in games such as Go and Poker. With this technique, AI agents learn how to
play by competing against themselves to enhance their performance step by step. In the terminology of literature focused on learning in games, the method involves running a set of online learning algorithms for
players in the game to compute and approximate their game equilibria. To learn more efficiently in games,
it is critical to design better online learning algorithms. Standard notions evaluating online learning algorithms in games include “regret,” assessing the average quality of iterates, and “last-iterate convergence,”
representing the quality of the final iterates.
In this thesis, we design online learning algorithms and prove that they achieve near-optimal regret
or fast last-iterate convergence in various game settings. We start from the simplest two-player zerosum normal-form games and extend the results to multi-player games, extensive-form games that capture
sequential interaction and imperfect information, and finally, the most general convex games. Moreover,
x
we also analyze the weaknesses of prevalent online learning algorithms widely employed in practice and
propose a fix for them. This not only makes the algorithms more robust but also sheds light on getting
better learning algorithms for artificial intelligence in the future.
xi
Chapter 1
Introduction
Games are prevalent in machine learning and artificial intelligence. The achievements of artificial intelligence in games such as Go [144, 145], Poker [10, 126, 16, 17], and video games [157], set important
milestones in the development of artificial intelligence. Meanwhile, game-theoretic formulations have also
become a popular way to describe machine learning problems [141, 69]. Even though the underlying games
of these applications are very different, they all boil down to the following question: how to design machine
learning algorithms to find good solutions to the games of interest efficiently? It turns out that finding a good
solution in games is fundamentally different from other machine learning problems, such as supervised
learning. Gradient descent, arguably the most prevailing algorithm in machine learning, has been shown
failure in some simple games [122, 119]. Thus, understanding game-solving requires a different framework
other than the traditional supervised learning paradigm.
Online learning, also called no-regret learning, has become an important framework for analyzing
game dynamics. Fundamental connections have been forged between no-regret learning and game-theoretic
solution concepts [59, 76, 57, 139]. Nevertheless, the traditional no-regret learning framework is overly
pessimistic, insisting on modeling the environment in a fully adversarial way. While this well-understood
worst-case view might be justifiable for applications such as security games, it could be far from optimal in
1
more benign and predictable environments, including the setting of training agents using self-play. Additionally, classical online learning theorems regarding converging to game-theoretic solutions are ergodic
convergence. Consequently, theoretical convergence rates of online learning algorithms are typically limited to O(1/
√
T) or O(1/T) for T rounds. In contrast, it is known that linear convergence rates are
achievable for certain other first-order algorithms [155, 66].
In this thesis, we take a closer look at the online learning framework and its application in games.
We provide a more adaptive and refined analysis of online learning algorithms in various types of games
and obtain stronger theoretical bounds and convergence guarantees. Before formally stating our main
contributions, we first introduce some frequently used notations and give an overview of some important
topics throughout this thesis.
1.1 Notations
For d ∈ N, we write 1d ∈ R
d
the vector with 1 on every component. The simplex of dimension d − 1
is ∆d = {x ∈ R
d
+ | ⟨x, 1d⟩ = 1}. The vector 0 has 0 on every component, and its dimension is
implicit. We use the convention that 0/0 = (1/d)1d. For x ∈ R, we write [x]
+
for the positive part
of x : [x]
+ = max{0, x}, and we overload this notation to vectors component-wise. For two vectors a and
b, a ≥ b means a is at least b component-wise. For a vector z, we use ∥z∥p to denote its p-norm (with
∥z∥ being a shorthand for ∥z∥2).
For a convex function ψ, the associated Bregman divergence is define as Dψ(u, v) = ψ(u) − ψ(v) −
⟨∇ψ(v),u − v⟩, and ψ is called κ-strongly convex with respect to the p-norm if Dψ(u, v) ≥
κ
2
∥u − v∥
2
p
holds for all u and v in the domain. The Kullback-Leibler divergence, which is the Bregman divergence
with respect to the entropy function, is denoted by KL(·, ·). For u ∈ R
d
, we define supp(u) = {i : ui > 0}.
Finally, we use [[N]] to denote the set {1, 2 . . . , N} for some positive integer N.
2
1.2 Online Learning Framework
In the online learning framework, the learner has to select a strategy x
t
for the decision space X ⊆ R
d
at
each round t ∈ [[T]]. At the end of the round t, the learner receives loss feedback ℓ
t ∈ R
d
and suffers loss
⟨x
t
, ℓ
t
⟩. The goal is to minimize regret over T, which is defined as
RegT = max
x∈X "X
T
t=1
⟨x
t
, ℓ
t
⟩ −X
T
t=1
⟨x, ℓ
t
⟩
#
.
The loss ℓ
t
can be chosen adversarially and can depend on x
1
, x
2
, . . . , x
t
. In the next section, we introduce
some important online learning algorithms considered in the following chapters.
1.3 Online Learning Algorithms and Their Optimistic Versions
For simplicity, here we assume X = ∆d
and introduce three key algorithms in this case. We will consider
more general settings and other variants of these algorithms in the following chapters. The algorithms all
start with an arbitrary initial point x
1 ∈ ∆d
and consider the following updates:
(Online) Gradient Descent (GD). The algorithm takes a parameter η > 0 and performs the following
steps at t = 1, 2, . . . :
x
t = x
t − ηℓ
t
, (1.1)
x
t+1 = min
x∈X
x − x
t
2
. (1.2)
Here Equation (1.1) is a (unconstrained) gradient descent and Equation (1.2) is the Euclidean projection
that ensures x
t+1 ∈ X .
3
Multiplicative Weights Updates (MWU). The algorithm takes a parameter η > 0 and performs the
following steps at t = 1, 2, . . . (write x = (x1, x2, . . . , xd)):
x
t
j = x
t
j
· exp(−ηℓt
j
), ∀ j ∈ [[d]], (1.3)
x
t+1 =
x
t
∥xt∥1
. (1.4)
Here, Equation (1.3) is a “multiplicative version” of gradient descent, and Equation (1.4) is the normalization
to ensure x
t+1 ∈ X .
Regret Matching+ (RM+). The algorithm has no parameters and initiates R1 = 0. Then, it performs
the following steps at t = 1, 2, . . . :
Rt+1 =
Rt + ⟨x
t
, ℓ
t
⟩1d − ℓ
t
+
, (1.5)
x
t+1 =
Rt+1
∥Rt+1∥1
. (1.6)
Here ⟨x
t
, ℓ
t
⟩ − ℓ
t
is the instantaneous regret at time t against every coordinate so Equation (1.5) performs
a truncated version of instantaneous regret ascent. Finally, Equation (1.6) is the normalization similar to
Equation (1.4).
Optimistic Algorithms The optimistic version of the above algorithms has the following structure. It
maintains an auxiliary path using the original update but performs an extra update step that goes away
from this path to obtain its final strategy at every round. For example, if we view the GD update in
4
Equation (1.1) and Equation (1.2) as a function x
t+1 = GD_Update(x
t
, ℓ
t
), optimistic gradient descent
(OGDA) initiates x
1
and xb
1
arbitraily and performs the following steps at t = 1, 2, . . . :
xb
t+1 = GD_Update(xb
t
, ℓ
t
) (1.7)
x
t+1 = GD_Update(xb
t+1
, ℓ
t
) (1.8)
It maintains an auxiliary path xb
t+1 using the original GD update (Equation (1.7)) and takes one extra GD
update from the path to get x
t+1 (Equation (1.8)).
Optimistic multiplicative weights updates (OMWU) follow the same logic. Rewrite Equation (1.3) and
Equation (1.4) as x
t+1 = MWU_Update(x
t
, ℓ
t
); then OMWU at each time performs steps similar to (Equation (1.7)) and (Equation (1.8)) but replaces GD_Update with MWU_Update. The optimistic version of
RM+(PRM+) can also be derived in a similar principle. We defer the details to Chapter 6.
1.4 Normal-Form, Extensive-form, and Convex Games
In a normal-form game (NFG), there are n players, and every playeri ∈ [[n]] has a finite set of actions Ai and
has their utility function Ui
:×n
j=1 Aj → R. The value Ui(a1, . . . , an) represents player i’s utility when
each player j plays action aj ∈ Aj . Players are allowed to randomize; each player i can choose a strategy
(probability distribution) xi ∈ ∆(Ai), where ∆(Ai) is the set of probability distributions supported on Ai
and the utility of player i becomes
E
(a1,...,an)∼(x1,...,xn)
[Ui(a1, . . . , an), ] ,
or we can write more compactly as Ea∼x [Ui(a), ].
5
Extensive-form games (EFGs) generalize NFGs by capturing sequential and simultaneous moves, stochasticity from the environment, and imperfect information. We will have more detailed introductions in the
following chapters. In a nutshell, extensive-form games can be seen as a subset of normal-form games
where each Ai
is a set of 0/1 integral vertices, that is, Ai ⊆ {0, 1}
di
for some integer di
.
Convex games are even more general than EFGs. The strategy space Xi of convex games, for every
player i, nonempty convex and compact set and the utility function ui
:×n
j=1 Xj → R is concave in xi for
every x−i = (x1, . . . , xi−1, xi+1, . . . , xn) ∈×j̸=i Xj . It is not hard to verify that convex games include
both normal-form games and extensive-form games as special cases. We will introduce more examples
and applications in Chapter 5.
Here, we also introduce a pervasive class of games, two-player zero-sum games. As the name suggests,
these games have two players, and the sum of utility functions is zero, that is, u1 = −u2. Therefore, a
player in these games has completely conflicting interests to their opponent. Since we study this setting
extensively in this thesis, we introduce a set of notations tailored to it. We drop the index i but call the two
players player x and player y. We also overload the notation x and y so that they are used to represent
the probability distributions the two players play, respectively. Similarly, we use X and Y to denote their
strategy space. When discussing two-player zero-sum games, we also use f = −u1 so that
u1(x, y) = −f(x, y), u2(x, y) = f(x, y).
In this thesis, we are particularly interested in finding (approximate) a Nash equilibrium in a two-player
zero-sum game. Specifically, we want to find (x
∗
, y
∗
) ∈ X × Y satisfying f(x
∗
, y) ≤ f(x
∗
, y
∗
) ≤
6
f(x, y
∗
) for any (x, y) ∈ X × Y. Given a pair of (x, y) ∈ X × Y, we measure the closeness to the set of
Nash equilibria by its duality gap, defined as
DualityGap(x, y) = max
y′∈Y
f(x, y
′
) − min
x′∈X
f(x
′
, y),
which is always non-negative since maxy′∈Y f(x, y
′
) ≥ f(x, y) ≥ minx′∈X f(x
′
, y).
1.5 Learning in Games
In the learning in games protocol, players use online learning learning algorithms to play a game repeatedly. Here, we take learning in NFGs as an example. The setting can be naturally extended to EFGs and
convex games; we will cover them later. Given a normal-form game and its action set Ai and utility function Ui for every player i, we set the decision space of player i to be Xi = ∆(Ai). Each player i doesn’t
have direct access to their utility function, but they can observe a loss vector at every round jointly determined by all the players. At round t, every player i simultaneously plays x
t
i
according to their online
learning algorithm. After that, each player i receives their loss vector ℓ
t
i
such that
ℓ
t
i
[ai
] = E
(a1,,...,ai−1,ai+1,...,an)∼(x1,,...,xi−1,xi+1,...,xn)
[−Ui(a1, . . . , an), ] .
In other words, each coordinate of ℓ
t
i
is the negative expected utility of player i when they play the corresponding action with probability 1, and all the other players follow their strategies at that round. It is
then not hard to see that ⟨x
t
i
, ℓ
t
i
⟩, the loss player i suffers, is the negative expected utility of player i in the
game when each player j plays x
t
j
.
It turns out that the concept of regret in online learning is closely related to computing the equilibrium of the game under this protocol. Specifically, suppose each player uses an online learning algorithm
that suffers regret at most ϵT over T rounds. Then the time-average joint distribution of all the players’
7
strategies is an ϵ-approximate coarse correlated equilibrium. We omit the details and the formal statement
here but want to highlight the importance of analyzing the individual regret of players in games, given its
close connection to converging to the game equilibria. Lower individual regret implies faster convergence
to the equilibria. This also means that the corresponding online learning learning algorithms learn more
efficiently in games.
In two-player zero-sum games, coarse correlated equilibria coincide with Nash equilibria. Specifically,
if player x suffers at most ϵxT regret while player y suffers at most ϵyT regret over T rounds, we have
DualityGap(x
T
, y
T
) ≤ ϵx + ϵy,
where
x
T =
1
T
X
T
t=1
x
t
, y
T =
1
T
X
T
t=1
y
t
.
Thus, low regret ensures that the iterates converge to Nash equilibria in average. However, we have a more
ambitious goal in two-player zero-sum games. Instead of average-iterate convergence, we are interested
in online learning algorithms converging pointwise, that is, (x
t
, y
t
) → (x
∗
, y
∗
). We call this favorable
property last-iterate convergence.
1.6 Outline of the Thesis
In this thesis, we greatly extend the understanding of low regret and last-iterate convergence properties of
online learning learning algorithms in NFGs, EFGs, and convex games. Specifically, our main contributions
are summarized below:
• In Chapter 2, for OMWU in two-player zero-sum NFGs, we show that when the equilibrium is
unique, linear last-iterate convergence is achieved. We then extend the results to more general
objectives and feasible sets for the OGDA algorithm, by introducing a sufficient condition under
8
which OGDA exhibits concrete last-iterate convergence rates. We show that two-player zero-sum
bilinear games over any polytope (that include NFGs) satisfy this condition and OGDA converges
exponentially fast even without the unique equilibrium assumption.
• In Chapter 3, we study last-iterate convergence in EFGs. Although the last-iterate convergence
results of OGDA cover EFGs, the projection step may be costly in general and thus less favorable
in this setting. Inspired by this, we show that DOMWU, a variant of OMWU in EFGs that can
run efficiently, has a linear convergence rate under the assumption that there is a unique Nash
equilibrium, the same assumption for OMWU to achieve similar results for NFGs.
• In Chapter 4, we consider another way to achieve last-iterate convergence in EFGs: running OMWU
on an NFG converted from the original EFG. It is well known that there is an exponential blowup of
the strategy space with this conversion, so the resulting algorithm is inefficient. However, we show
that, surprisingly, the algorithm can be simulated in linear time per iteration in the game tree size
using a kernel trick. The resulting algorithm, Kernelized OMWU (KOMWU), applies more broadly to
all convex games whose strategy space is a polytope with 0/1 integral vertices. In the particular case
of EFGs, KOMWU closes several standing gaps between NFG and EFG learning, by enabling direct,
black-box transfer to EFGs of desirable properties of learning dynamics that were so far known to
be achievable only in NFGs.
• In Chapter 5, we establish the first online learning learning algorithm with O(log T) per-player
regret in general convex games. Our algorithm is based on an instantiation of optimistic followthe-regularized-leader over an appropriately lifted space using a self-concordant regularizer that is,
peculiarly, not a barrier for the feasible region. Further, our learning dynamics are efficiently implementable given access to a proximal oracle for the convex strategy set, leading to O(log log T)
9
Chapter Paper
Chapter 2 [161]
Chapter 3 [105]
Chapter 4 [55]
Chapter 5 [46]
Chapter 6 [48, 22]
Table 1.1: A mapping from chapters to papers
per-iteration complexity; we also give extensions when access to only a linear optimization oracle
is assumed.
• In Chapter 6, we turn our focus to RM+ and its variants. We first give counterexamples showing that
RM+and PRM+ can be unstable, which might cause other players to suffer large regret. We then
provide smoothing techniques to achieve O(1)social regret in NFGs. We also apply our techniques to
clairvoyant updates in the uncoupled learning setting for RM+, introduced Extragradient RM+, and
prove similar results. As for last-iterate convergence, we first show numerically that several practical
variants of RM+ all lack last-iterate convergence guarantees. We then prove Extragradient RM+,
equipped with our smoothing techniques, enjoys asymptotic last-iterate convergence and 1/
√
t bestiterate convergence. Finally, we introduce its restarted variant and show that it enjoys linear-rate
last-iterate convergence.
Finally, we provide a mapping between the chapter index and the source of the content in Table 1.1.
10
Chapter 2
No-Regret Learning and Last-Iterate Convergence in Normal-Form
Games and Constrained Saddle-Point Optimization
In this chapter, we focus on two-player zero-sum games, which can be formulated as (constrained) saddlepoint optimization problems. Saddle-point optimization in the form of minx maxy f(x, y) dates back to
[129], where the celebrated minimax theorem was discovered. Due to advances of Generative Adversarial
Networks (GANs) [69] (which itself is a saddle-point problem), the question of how to find a good approximation of the saddle point, especially via an efficient iterative algorithm, has recently gained significant
research interest. Simple algorithms such as Gradient Descent Ascent (GDA) and Multiplicative Weights
Update (MWU) are known to cycle and fail to converge even in simple bilinear cases (see e.g., [8] and [30]).
Many recent works consider resolving this issue via simple modifications of standard algorithms,
usually in the form of some extra gradient descent/ascent steps. This includes Extra-Gradient methods (EG) [109, 123], Optimistic Gradient Descent Ascent (OGDA) [38, 64, 120], Optimistic Multiplicative
Weights Update (OMWU) [39, 106], and others. In particular, OGDA and OMWU are suitable for the repeated game setting where two players repeatedly propose x
t
and y
t
and receive only ∇xf(x
t
, y
t
) and
∇yf(x
t
, y
t
) respectively as feedback, with the goal of converging to a saddle point or equivalently a Nash
equilibrium using game theory terminology. One notable benefit of OGDA and OMWU is that they are
11
also no-regret algorithms with important applications in online learning, especially when playing against
adversarial opponents [31, 135].
Despite considerable progress, especially those for the unconstrained setting, the behavior of these
algorithms for the constrained setting, where x and y are restricted to closed convex sets X and Y respectively, is still not fully understood. This is even true when f is a bilinear function and X and Y are simplex,
known as the classic two-player zero-sum games in normal form, or simply matrix games. Indeed, existing
convergence results on the last iterate of OGDA or OMWU for matrix games are unsatisfactory — they
lack explicit convergence rates [134, 120], only apply to exponentially small learning rate thus not reflecting the behavior of the algorithms in practice [39], or require additional conditions such as uniqueness of
the equilibrium or a good initialization [39].
Motivated by this fact, we first improve the last-iterate convergence result of OMWU for matrix games.
Under the same unique equilibrium assumption as made by Daskalakis and Panageas [39], we show linear
convergence with a concrete rate in terms of the Kullback-Leibler divergence between the last iterate and
the equilibrium, using a learning rate whose value is set to a universal constant.
We then significantly extend our results and consider OGDA for general constrained and smooth
convex-concave saddle-point problems, without the uniqueness assumption. Specifically, we start with
proving an average duality gap convergence of OGDA at the rate of O(1/
√
T) after T iterations. Then,
to obtain a more favorable last-iterate convergence in terms of the distance to the set of equilibria, we
propose a general sufficient condition on X , Y, and f, called Saddle-Point Metric Subregularity (SP-MS),
under which we prove concrete last-iterate convergence rates, all with a constant learning rate and without further assumptions.
Our last-iterate convergence results of OGDA greatly generalize that of [84, Theorem 2], which itself
is a consolidated version of results from several earlier works. The key implication of our new results is
12
that, by showing that matrix games satisfy our SP-MS condition, we provide by far the most general lastiterate guarantee with a linear convergence for this problem using OGDA. Compared to that of OMWU,
the convergence result of OGDA holds more generally even when there are multiple equilibria.
More generally, the same linear last-iterate convergence holds for any bilinear games over polytopes
since they also satisfy the SP-MS condition as we show. To complement this result, we construct an example of a bilinear game with a non-polytope feasible set where OGDA provably does not ensure linear
convergence, indicating that the shape of the feasible set matters.
Finally, we also provide experimental results to support our theory. In particular, we observe that
OGDA generally converges faster than OMWU for matrix games, despite the facts that both provably
converge exponentially fast and that OMWU is often considered more favorable compared to OGDA when
the feasible set is the simplex.
2.1 Related Work
Average-iterate convergence. While showing last-iterate convergence has been a challenging task,
it is well-known that the average-iterate of many standard algorithms such as GDA and MWU enjoys a
converging duality gap at the rate of O(1/
√
T) [59]. A line of works show that the rate can be improved
to O(1/T) using the “optimistic” version of these algorithms such as OGDA and OMWU [135, 35, 150].
For tasks such as training GANs, however, average-iterate convergence is unsatisfactory since averaging
large neural networks is usually prohibited.
Extra-Gradient (EG) algorithms. The saddle-point problem fits into the more general variational inequality framework [73]. A classic algorithm for variational inequalities is EG, first introduced in [93].
Tseng [155] is the first to show last-iterate convergence for EG in various settings such as bilinear or
strongly-convex-strongly-concave problems. Recent works significantly expand the understanding of
13
EG and its variants for unconstrained bilinear problems [109], unconstrained strongly-convex-stronglyconcave problems [123], and more [163, 110, 68].
The original EG is not applicable to a repeated game setting where only one gradient evaluation is
possible in each iteration. Moreover, unlike OGDA and OMWU, EG is shown to have linear regret against
adversarial opponents, and thus it is not a no-regret learning algorithm [9, 67]. However, there are “singlecall variants” of EG that address these issues. In fact, some of these versions coincide with the OGDA
algorithm under different names such as modified Arrow–Hurwicz method [134] and “extrapolation from
the past” [64]. Apart from OGDA, other single-call variants of EG include Reflected Gradient [114, 33,
116] and Optimistic Gradient [38, 124]. These variants are all equivalent in the unconstrained setting
but differ in the constrained setting. To the best of our knowledge, none of the existing results for any
single-call variant of EG covers the constrained bilinear case (which is one of our key contributions).
Error Bounds and Metric Subregularity To derive linear convergence for variational inequality problems, error bound method is a commonly used technique [131, 113]. For example, it is a standard approach
to studying the last-iterate convergence of EG algorithms [155, 83, 6]. An error bound method is associated
with an error function that gives every point in the feasible set a measure of sub-optimality that is lower
bounded by the distance of the point to the optimal set up to some problem dependent constant. If such a
error function exists, linear convergence can be obtained. The choice of the error function depends on the
feasible region, the objection function, and the algorithm. Common error functions include natural residual functions [85, 115] and gap functions [103, 146, 28]. Our method to derive the last-iterate convergence
for OGDA can also be viewed as an error bound method.
Metric subregularity is another important concept to derive linear convergence via some Lipschitz
behavior of a set-valued operator [107, 108, 2, 104]. Metric subregularity is closely related to error bound
methods [100]. In fact, as we prove in Appendix A.6, one special case of our condition SP-MS (that allows
us to show linear convergence) is equivalent to metric subregularity of an operator defined in terms of the
14
normal cone of the feasible set and the gradient of the objective. This is also the reason why we call our
condition Saddle-Point Metric Subregularity. Although metric subregularity has been extensively used in
the literature, to the best of our knowledge, our work is the first to use this condition to analyze OGDA.
OGDA and OMWU. Recently, last-iterate convergence for OGDA has been proven in various settings
such as convex-concave problems [38], unconstrained bilinear problems [41, 109], strongly-convex-stronglyconcave problems [123], and others (e.g. [120]).
However, the behavior of OGDA and OMWU for the constrained bilinear case, or even the special
case of classic matrix games, appears to be much more mysterious and less understood. Cheung and
Piliouras [29] provide an alternative view on the convergence behavior of OMWU by studying volume
contraction in the dual space. Daskalakis and Panageas [39] show last-iterate convergence of OMWU
for matrix games under a uniqueness assumption and without a concrete rate. Although it is implicitly
suggested in [39, 40] that a rate of O(1/T1/9
) is possible, it is still not clear how to choose the learning
rate appropriately from their analysis. As mentioned, our results for OMWU significantly improve theirs,
with a clean linear convergence rate using a constant learning rate under the same uniqueness assumption,
while our results for OGDA further remove the uniqueness assumption.
2.2 Notations and Preliminaries
We consider the following constrained saddle-point problem: minx∈X maxy∈Y f(x, y), where X and Y
are closed convex sets, and f is a continuous differentiable function that is convex in x for any fixed y and
concave in y for any fixed x. By the celebrated minimax theorem [129], we have minx∈X maxy∈Y f(x, y) =
maxy∈Y minx∈X f(x, y).
The set of minimax optimal strategy is denoted by X
∗ = argminx∈X maxy∈Y f(x, y), and the set of
maximin optimal strategy is denoted by Y
∗ = argmaxy∈Y minx∈X f(x, y). It is well-known that X
∗
and
15
Y
∗
are convex, and any pair (x
∗
, y
∗
) ∈ X ∗ × Y∗
is a Nash equilibrium satisfying f(x
∗
, y) ≤ f(x
∗
, y
∗
) ≤
f(x, y
∗
) for any (x, y) ∈ X × Y.
For notational convenience, we define Z = X × Y and similarly Z
∗ = X
∗ × Y∗
. For a point z =
(x, y) ∈ Z, we further define f(z) = f(x, y) and F(z) = (∇xf(x, y), −∇yf(x, y)).
Our goal is to find a point z ∈ Z that is close to the set of Nash equilibria Z
∗
, and we consider three
ways of measuring the closeness. The first one is the duality gap, defined as
DualityGap(z) = max
y′∈Y
f(x, y
′
) − min
x′∈X
f(x
′
, y),
which is always non-negative since maxy′∈Y f(x, y
′
) ≥ f(x, y) ≥ minx′∈X f(x
′
, y).
The second one is the distance between z and Z
∗
. Specifically, for any closed set A, we define the
projection operator ΠA as ΠA(a) = argmina′∈A ∥a − a
′∥ (throughout this chapter ∥ · ∥ represents L2
norm). The squared distance between z and Z
∗
is then defined as dist2
(z, Z
∗
) = ∥z − ΠZ∗ (z)∥
2
.
The third one is only for the case when X and Y are probability simplices, and z
∗ = (x
∗
, y
∗
) is the
unique equilibrium. In this case, we use the sum of Kullback-Leibler divergence KL(x
∗
, x) + KL(y
∗
, y) to
measure the closeness between z = (x, y) and z
∗
, where KL(x, x
′
) = P
i
xi
ln xi
x
′
i
. With a slight abuse of
notation, we use KL(z, z
′
) to denote KL(x, x
′
) + KL(y, y
′
).
Optimistic Gradient Descent Ascent (OGDA). Starting from an arbitrary point (xb
1
, yb
1
) = (x
0
, y
0
)
from Z, OGDA with step size η > 0 iteratively computes the following for t = 1, 2, . . .,
x
t = ΠX
xb
t − η∇xf(x
t−1
, y
t−1
)
, xb
t+1 = ΠX
xb
t − η∇xf(x
t
, y
t
)
,
y
t = ΠY
yb
t + η∇yf(x
t−1
, y
t−1
)
, yb
t+1 = ΠY
yb
t + η∇yf(x
t
, y
t
)
.
Note that there are several slightly different versions of the algorithm in the literature, which differ in
the timing of performing the projection. Our version is the same as those in [31, 135]. It is also referred
to as “single-call extra-gradient” in [84], but it does not belong to the class of “extra-gradient” methods
discussed in [155, 109, 68] for example.
Also note that OGDA only requires accessing f via its gradient. In fact, only one gradient at the point
(x
t
, y
t
)is needed for iteration t. This aspect makes it especially suitable for a repeated game setting, where
in each round, one player proposes x
t while another player proposes y
t
. With only the information of
the gradient from the environment (∇xf(x
t
, y
t
) for the first player and ∇yf(x
t
, y
t
) for the other), both
players can execute the algorithm.
Optimistic Multiplicative Weights Update (OMWU). When the feasible sets X and Y are probability
simplices ∆M and ∆N for some integers M and N, OMWU is another common iterative algorithm to
solve the saddle-point problem. For simplicity, we assume that it starts from the uniform distributions
(xb
1
, yb
1
) = (x
0
, y
0
) =
1M
M ,
1N
N
, where 1
d
is the all-one vector of dimension d. Then OMWU with step
size η > 0 iteratively computes the following for t = 1, 2, . . .,
x
t
i =
xb
t
i
exp(−η(∇xf(x
t−1
, y
t−1
))i)
P
j
xb
t
j
exp(−η(∇xf(xt−1, yt−1))j )
, xb
t+1
i =
xb
t
i
exp(−η(∇xf(x
t
, y
t
))i)
P
j
xb
t
j
exp(−η(∇xf(xt
, yt))j )
,
y
t
i =
yb
t
i
exp(η(∇yf(x
t−1
, y
t−1
))i)
P
j
yb
t
j
exp(η(∇yf(xt−1, yt−1))j )
, yb
t+1
i =
yb
t
i
exp(η(∇yf(x
t
, y
t
))i)
P
j
yb
t
j
exp(η(∇yf(xt
, yt))j )
.
OMWU and OGDA as Optimistic Mirror Descent Ascent. OMWU and OGDA can be viewed as special cases of Optimistic Mirror Descent Ascent. Specifically, let regularizer ψ(u) denote the negative entropy
P
i
ui
ln ui for the case of OMWU and (half of) the L2 norm square 1
2
∥u∥
2
for the case of OGDA (so that
Dψ(u, v) is KL(u, v) and 1
2
∥u − v∥
2
respectively). Then using the shorthands z
t = (x
t
, y
t
) and zb
t =
17
(xb
t
, yb
t
) and recalling the notation defined earlier: Z = X × Y and F(z) = (∇xf(x, y), −∇yf(x, y)),
one can rewrite OMWU/OGDA compactly as
z
t = argmin
z∈Z
n
η⟨z, F(z
t−1
)⟩ + Dψ(z, zb
t
)
o
, (2.1)
zb
t+1 = argmin
z∈Z
n
η⟨z, F(z
t
)⟩ + Dψ(z, zb
t
)
o
. (2.2)
By the standard regret analysis of Optimistic Mirror Descent, we have the following important lemma,
which is readily applied to OMWU and OGDA when ψ is instantiated as the corresponding regularizer.
The proof is mostly standard (see e.g., [135, Lemma 1]). For completeness, we include it in Appendix A.2.
Lemma 1. Consider update rules Equation (2.1) and Equation (2.2) and define dist2
p
(z, z
′
) = ∥x − x
′∥
2
p +
∥y − y
′∥
2
p
. Suppose that ψ satisfies
Dψ(z, z
′
) ≥
1
2
dist2
p
(z, z
′
)
for some p ≥ 1, and F satisfies dist2
q
(F(z), F(z
′
)) ≤ L
2dist2
p
(z, z
′
) for q ≥ 1 with 1
p +
1
q
= 1. Also, assume
that η ≤
1
8L
. Then for any z ∈ Z and any t ≥ 1, we have
ηF(z
t
)
⊤(z
t − z) ≤ Dψ(z, zb
t
) − Dψ(z, zb
t+1) − Dψ(zb
t+1
, z
t
) −
15
16Dψ(z
t
, zb
t
) + 1
16Dψ(zb
t
, z
t−1
).
2.3 Convergence Results for OMWU
In this section, we show that for a two-player zero-sum matrix game with a unique equilibrium, OMWU
with a constant learning rate converges to the equilibrium exponentially fast. The assumption and the
algorithm are the same as those considered in [39], but our analysis improves theirs in two ways. First,
we do not require the learning rate to be exponentially smaller than some problem-dependent quantity.
18
Second, we explicitly provide a linear convergence rate. In Section 2.4, we further remove the uniqueness
assumption and significantly generalize the results by studying OGDA.
In a matrix game we have X = ∆M, Y = ∆N , and f(z) = x
⊤Gy for some matrix G ∈ [−1, 1]M×N .
To show the last-iterate convergence of OMWU, we first apply Lemma 1 with Dψ(u, v) = KL(u, v),
z = z
∗
(the unique equilibrium of the game matrix G) and (p, q) = (1, ∞). The constant L can be chosen
as 1 since dist2
∞(F(z), F(z
′
)) = maxi
|(G(y − y
′
))i
|
2 + maxj |(G⊤(x − x
′
))j |
2 ≤ ∥y − y
′∥
2
1 + ∥x −
x
′∥
2
1 = dist2
1
(z, z
′
). Also notice that F(z
t
)
⊤(z
t −z
∗
) = f(x
t
, y
t
)−f(x
∗
, y
t
) +f(x
t
, y
∗
)−f(x
t
, y
t
) =
f(x
t
, y
∗
) − f(x
∗
, y
t
) ≥ 0 by the optimality of z
∗
. Therefore, we have when η ≤
1
8
,
KL(z
∗
, zb
t+1) ≤ KL(z
∗
, zb
t
) − KL(zb
t+1
, z
t
) −
15
16KL(z
t
, zb
t
) + 1
16KL(zb
t
, z
t−1
).
Defining Θt = KL(z
∗
, zb
t
) + 1
16KL(zb
t
, z
t−1
) and ζ
t = KL(zb
t+1
, z
t
) + KL(z
t
, zb
t
), we rewrite the above as
Θ
t+1 ≤ Θ
t −
15
16 ζ
t
. (2.3)
From Equation (2.3) it is clear that the quantity Θt
is always non-increasing in t due to the non-negativity
of ζ
t
. Furthermore, the more the algorithm moves between round t and round t + 1 (that is, the larger ζ
t
is), the more Θt decreases.
To establish the rate of convergence, a natural idea is to relate ζ
t back to Θt or Θt+1. For example, if
we can show ζ
t ≥ cΘt+1 for some constant c > 0, then Equation (2.3) implies Θt+1 ≤ Θt −
15c
16 Θt+1
,
which further gives Θt+1 ≤
1 + 15c
16 −1 Θt
. This immediately implies a linear convergence rate for Θt
as
well as KL(z
∗
, zb
t
) since KL(z
∗
, zb
t
) ≤ Θt
.
Moreover, notice that to find such c, it suffices to find a c
′ > 0 such that ζ
t ≥ c
′KL(z
∗
, zb
t+1).
This is because it will then give ζ
t ≥
1
16KL(zb
t+1
, z
t
) + 15
16 ζ
t ≥
1
16KL(zb
t+1
, z
t
) + 15c
′
16 KL(z
∗
, zb
t+1) ≥
min{1,
15c
′
16 }Θt+1, and thus c ≜ min{1,
15c
′
16 } satisfies the condition.
1
From the discussion above, we see that to establish the linear convergence of KL(z
∗
, zb
t
), we only need
to show that there exists some c
′ > 0 such that KL(zb
t+1
, z
t
) + KL(z
t
, zb
t
) ≥ c
′KL(z
∗
, zb
t+1). The highlevel interpretation of this inequality is that when zb
t+1 is far from the equilibrium z
∗
(i.e., KL(z
∗
, zb
t+1) is
large), the algorithm should have a large move between round t and t+1 making KL(zb
t+1
, z
t
)+KL(z
t
, zb
t
)
large.
In our analysis, we use a two-stage argument to find such a c
′
. In the first stage, we only show that
KL(zb
t+1
, z
t
) + KL(z
t
, zb
t
) ≥ c
′′KL(z
∗
, zb
t+1)
2
for some c
′′ > 0, and use it to argue a slower convergence
rate KL(z
∗
, zb
t
) = O
1
t
. Then in the second stage, we show that after zb
t
and z
t become close enough to
z
∗
, we have KL(zb
t+1
, z
t
) + KL(z
t
, zb
t
) ≥ c
′KL(z
∗
, zb
t+1) for some c
′ > 0.
This kind of two-stage argument might be reminiscent of that used by Daskalakis and Panageas [39];
however, the techniques we use are very different. Specifically, Daskalakis and Panageas [39] utilize tools
of “spectral analysis” similar to [109] and show that the OMWU update can be viewed as a “contraction
mapping” with respect to a matrix whose eigenvalue is smaller than 1. Our analysis, on the other hand,
leverages analysis of online mirror descent, starting from the “one-step regret bound” (Lemma 1) and
making use of the two negative terms that are typically dropped in the analysis. Importantly, our analysis
does not need an exponentially small learning rate required by [39]. Thus, unlike their results, our learning
rate is kept as a universal constant in all stages. The arguments above are formalized below:
Lemma 2. Consider a matrix game f(x, y) = x
⊤Gy with X = ∆M, Y = ∆N , and G ∈ [−1, 1]M×N .
Assume that there exists a unique Nash equilibrium z
∗ and η ≤
1
8
. Then, there exists a constant C1 > 0 that
depends on G such that for any t ≥ 1, OMWU ensures
KL(zb
t+1
, z
t
) + KL(z
t
, zb
t
) ≥ η
2C1KL(z
∗
, zb
t+1)
2
.
2
Also, there is a constant ξ > 0 that depends on G (defined in Definition 3) such that as long as max{∥z
∗ −
zb
t∥1, ∥z
∗ − z
t∥1} ≤ ηξ
10 , then
KL(zb
t+1
, z
t
) + KL(z
t
, zb
t
) ≥ η
2C2KL(z
∗
, zb
t+1)
for another constant C2 > 0 that depends on G.
With Lemma 2 and the earlier discussion, the last-iterate convergence rate of OMWU is established:
Theorem 3. For a matrix game f(x, y) = x
⊤Gy with a unique Nash equilibrium z
∗
, OMWU with a
learning rate η ≤
1
8
guarantees KL(z
∗
, z
t
) ≤ C3(1+C4)
−t
, where C3, C4 > 0 are some constants depending
on the game matrix G.
Proofs for this section are deferred to Appendix A.4, where all problem-dependent constants are specified as well.∗ Theorem 3 gives the first last-iterate convergence result for OMWU with a concrete linear
rate. We note that the uniqueness assumption is critical for our analysis, and whether this is indeed
necessary for OMWU is left as an important future direction.
2.4 Convergence Results for OGDA
In this section, we provide last-iterate convergence results for OGDA, which are much more general than
those in Section 2.3. We propose a general condition subsuming many well-studied cases, under which
OGDA enjoys a concrete last-iterate convergence guarantee in terms of the L2 distance between z
t
and
Z
∗
. The results in this part can be specialized to the setting of bilinear games over simplex, but the unique
equilibrium assumption made in Section 2.3 and in [39] is no longer needed.
Throughout the section we make the assumption that f is L-smooth:
∗One might find that the constant C3 is exponential in some problem-dependent quantity T0. However, this is simply a loose
bound in exchange for more concise presentation — our proof in fact shows that when t < T0, the convergence is of a slower
1/t rate, and when t ≥ T0, the convergence is linear without this large constant.
21
Assumption 1. For any z, z
′ ∈ Z, ∥F(z) − F(z
′
)∥ ≤ L∥z − z
′∥ holds.†
To introduce our general condition, we first provide some intuition by applying Lemma 1 again. Letting
ψ(u) = 1
2
∥u∥
2
in Lemma 1, we get that for OGDA, for any z ∈ Z and any t ≥ 1,
2ηF(z
t
)
⊤(z
t − z) ≤ ∥zb
t − z∥
2 − ∥zb
t+1 − z∥
2 − ∥zb
t+1 − z
t
∥
2 −
15
16 ∥z
t − zb
t
∥
2 +
1
16 ∥zb
t − z
t−1
∥
2
.
Now we instantiate the inequality above with z = ΠZ∗ (zb
t
) ∈ Z∗
. Since z = ΠZ∗ (zb
t
) is an equilibrium,
we have F(z
t
)
⊤(z
t − z) ≥ f(x
t
, y
t
) − f(x, y
t
) + f(x
t
, y) − f(x
t
, y
t
) = f(x
t
, y) − f(x, y
t
) ≥ 0 by
the convexity/concavity of f and the optimality of z, and thus
∥zb
t+1 − ΠZ∗ (zb
t
)∥
2 ≤ ∥zb
t − ΠZ∗ (zb
t
)∥
2 − ∥zb
t+1 − z
t
∥
2 −
15
16 ∥z
t − zb
t
∥
2 +
1
16 ∥zb
t − z
t−1
∥
2
.
Further noting that the left-hand side is lower bounded by dist2
(zb
t+1
, Z
∗
) by definition, we arrive at
dist2
(zb
t+1
, Z
∗
) ≤ dist2
(zb
t
, Z
∗
) − ∥zb
t+1 − z
t
∥
2 −
15
16 ∥z
t − zb
t
∥
2 +
1
16 ∥zb
t − z
t−1
∥
2
.
Similarly, we define Θt = ∥zb
t − ΠZ∗ (zb
t
)∥
2 +
1
16 ∥zb
t − z
t−1∥
2
, ζ
t = ∥zb
t+1 − z
t∥
2 + ∥z
t − zb
t∥
2
, and
rewrite the above as
Θ
t+1 ≤ Θ
t −
15
16 ζ
t
. (2.4)
As in Section 2.3, our goal now is to lower bound ζ
t by some quantity related to dist2
(zb
t+1
, Z
∗
), and
then use Equation (2.4) to obtain a convergence rate for Θt
. In order to incorporate more general objective
†This is equivalent to the condition dist2
q(F(z), F(z
′
)) ≤ L
2dist2
p(z, z
′
) in Lemma 1 with p = 2, hence the same notation
L.
22
functions into the discussion, in the following Lemma 4, we provide an intermediate lower bound for ζ
t
,
which will be further related to dist2
(zb
t+1
, Z
∗
) later.
Lemma 4. For any t ≥ 0 and z
′ ∈ Z with z
′ ̸= zb
t+1
, OGDA with η ≤
1
8L
ensures
∥zb
t+1 − z
t
∥
2 + ∥z
t − zb
t
∥
2 ≥
32
81
η
2
F(zb
t+1)
⊤(zb
t+1 − z
′
)
2
+
∥zbt+1 − z
′∥
2
, (2.5)
where [a]+ ≜ max{a, 0}, and similarly, for z
′ ̸= z
t+1
,
∥zb
t+1 − z
t+1∥
2 + ∥z
t − zb
t+1∥
2 ≥
32
81
η
2
F(z
t+1)
⊤(z
t+1 − z
′
)
2
+
∥z
t+1 − z
′∥
2
. (2.6)
We note that a direct consequence of Lemma 4 is an “average duality gap” guarantee for OGDA when
Z is bounded:
1
T
X
T
t=1
DualityGap(z
t
) = 1
T
X
T
t=1
max
x′∈X,y′∈Y
f(x
t
, y
′
) − f(x
′
, y
t
)
= O
D
η
√
T
(2.7)
where D ≜ supz,z′∈Z ∥z − z
′∥ is the diameter of Z (the duality gap may be undefined when Z is unbounded). We are not aware of any previous work that gives this result for the constrained case. See
Appendix A.5 for the proof of Equation (2.7) and comparisons with previous works.
However, to obtain last-iterate convergence results, we need to make sure that the right-hand side of
Equation (2.5) is large enough. Motivated by this fact, we propose the following general condition on f
and Z to achieve so.
Definition 1 (Saddle-Point Metric Subregularity (SP-MS)). The SP-MS condition is defined as: for any z ∈
Z\Z∗ with z
∗ = ΠZ∗ (z),
sup
z′∈Z
F(z)
⊤(z − z
′
)
∥z − z
′∥
≥ C∥z − z
∗
∥
β+1 (SP-MS)
2
holds for some parameter β ≥ 0 and C > 0.
We call this condition Saddle-Point Metric Subregularity because the case with β = 0 is equivalent
to one type of metric subregularity in variational inequality problems, as we prove in Appendix A.6. The
condition is also closely related to other error bound conditions that have been identified for variational
inequality problems (e.g., Tseng, Gilpin, Peña, and Sandholm, Malitsky [155, 66, 115]). Although these
works have shown that under similar conditions their algorithms exhibit linear convergence, to the best
of our knowledge, there is no previous work that analyzes OGDA or other no-regret learning algorithms
using such conditions.
SP-MS covers many standard settings studied in the literature. The first and perhaps the most important example is bilinear games with a polytope feasible set, which in particular includes the classic
two-player matrix games considered in Section 2.3.
Theorem 5. A bilinear game f(x, y) = x
⊤Gy with X ⊆ RM and Y ⊆ R
N being polytopes and G ∈
RM×N satisfies SP-MS with β = 0.
We emphasize again that different from Lemma 2, Theorem 5 does not require a unique equilibrium.
Note that we have not provided the concrete form of the parameter C in the theorem (which depends on
X , Y, and G), but it can be found in the proof (see Appendix A.7).The next example shows that stronglyconvex-strongly-concave problems are also special cases of our condition.
Theorem 6. If f is strongly convex in x and strongly concave in y, then SP-MS holds with β = 0.
Next, we provide a toy example where SP-MS holds with β > 0.
Theorem 7. Let X = Y ≜ {(a, b) : 0 ≤ a, b ≤ 1, a + b = 1}, n > 2 be an integer, and f(x, y) =
x
2n
1 − x1y1 − y
2n
1
. Then SP-MS holds with β = 2n − 2.
With this general condition, we are now able to complete the loop. For any value of β, we show the
following last-iterate convergence guarantee for OGDA.
24
Theorem 8. For any η ≤
1
8L
, if SP-MS holds with β = 0, then OGDA guarantees linear last-iterate convergence:
dist2
(z
t
, Z
∗
) ≤ 64dist2
(zb
1
, Z
∗
)(1 + C5)
−t
; (2.8)
on the other hand, if the condition holds with β > 0, then we have a slower convergence:
dist2
(z
t
, Z
∗
) ≤ 32 " 1 + 4
4
β
1
β
!
dist2
(zb
1
, Z
∗
) + 2
2
C5β
1
β
#
t
− 1
β , (2.9)
where C5 ≜ min n
16η
2C2
81 ,
1
2
o
.
We defer the proof to Appendix A.9 and make several remarks. First, note that based on a convergence result on dist2
(z
t
, Z
∗
), one can immediately obtain a convergence guarantee for the duality gap
DualityGap(z
t
) as long as f is also Lipschitz. This is because DualityGap(z
t
) ≤ maxx′
,y′ f(x
t
, y
′
) −
f(x
∗
, y
′
) +f(x
′
, y
∗
)−f(x
′
, y
t
) ≤ O(∥x
t −x
∗∥+∥y
t −y
∗∥) = O
q
dist2
(z
t
, Z∗)
, where (x
∗
, y
∗
) =
ΠZ∗ (z
t
). While this leads to stronger guarantees compared to Equation (2.7), we emphasize that the latter
holds even without the SP-MS condition.
Second, our results significantly generalize [84, Theorem 2] which itself is a consolidated version of
several earlier works and also shows a linear convergence rate of OGDA under a condition stronger than
our SP-MS with β = 0 as discussed earlier. More specifically, our results show that linear convergence
holds for a much broader set of problems. Furthermore, we also show slower sublinear convergence rates
for any value of β > 0, which is also new as far as we know. In particular, we empirically verify that OGDA
indeed does not converge exponentially fast for the toy example defined in Theorem 7 (see Appendix A.1).
Last but not least, the most significant implication of Theorem 8 is that it provides by far the most
general linear convergence result for OGDA for the classic two-player matrix games, or more generally
bilinear games with polytope constraints, according to Theorem 5 and Equation (2.8). Compared to recent
25
works of [41, 39] for matrix games (on OGDA or OMWU), our result is considerably stronger: 1) we do
not require a unique equilibrium while they do; 2) linear convergence holds for any initial points zb
1
, while
their result only holds if the initial points are in a small neighborhood of the unique equilibrium (otherwise
the convergence is sublinear initially); 3) our only requirement on the step size is η ≤
1
8L
,
‡ while they
require an exponentially small η, which does not reflect the behavior of the algorithms in practice. Even
compared with our result in Section 2.3, we see that for OGDA, the unique equilibrium assumption is not
required, and we do not have an initial phase of sublinear convergence as in Lemma 2. In Appendix A.1,
we empirically show that OGDA often outperforms OMWU when both are tuned with a constant learning
rate.
One may wonder what happens if a bilinear game has a non-polytope constraint. It turns out that in
this case, SP-MS may only hold with β > 0, due to the following example showing that linear convergence
provably does not hold for OGDA when the feasible set has a curved boundary.
Theorem 9. There exists a bilinear game with a non-polytope feasible set such that SP-MS holds with β = 3,
and dist2
(z
t
, Z
∗
) = Ω(1/t2
) holds for OGDA.
This example indicates that the shape of the feasible set plays an important role in last-iterate convergence, which may be an interesting future direction to investigate, This is also verified empirically in our
experiments (see Appendix A.1).
2.5 Experiments for Matrix Games
In this section, we provide empirical results on the performance of OGDA and OMWU for matrix games on
probability simplex.§ We include more empirical results in other settings in Appendix A.1. We set the size
of the game matrix to be 32 × 32, then generate a random matrix with each entry Gij drawn uniformly at
‡
In fact, any η < 1
2L
is enough to achieve linear convergence rate for OGDA, as one can verify by going over our proof. We
use η ≤
1
8L
simply for consistency with the results for OMWU (where η cannot be set any larger due to technical reasons).
§Note that in this case the projection step of OGDA can be implemented efficiently in O(M ln M + N ln N) time [159].
26
0 0.5 1 1.5 2 2.5 3
10
5
-35
-30
-25
-20
-15
-10
-5
0
5
OMWU-eta=0.1
OMWU-eta=0.2
OMWU-eta=0.5
OMWU-eta=1
OMWU-eta=5
OMWU-eta=10
OGDA-eta=0.125
OGDA-eta=0.25
OGDA-eta=0.5
Figure 2.1: Experiments of OGDA and OMWU with different learning rates for a matrix game f(x, y) =
x
⊤Gy. “OGDA/OMWU-eta=η” represents the curve of OGDA/OMWU with learning rate η. The configuration order in the legend is consistent with the order of the curves. For OMWU, η ≥ 11 makes the
algorithm diverge. The plot confirms the linear convergence of OMWU and OGDA, although OGDA is
generally observed to converge faster than OMWU.
random from [−1, 1], and finally rescale its operator norm to 1. With probability 1, the game has a unique
Nash Equilibrium [39].
We compare the performances of OGDA and OMWU. For both algorithms, we choose a series of
different learning rates and compare their performances, as shown in Figure 2.1. The x-axis represents
time step t, and the y-axis represents ln(KL(z
∗
, z
t
)) (we observe similar results using dist2
(z
∗
, z
t
) or the
duality gap as the measure; see Appendix A.1.1). Note that here we approximate z
∗ by running OGDA for
much more iterations and taking the very last iterate. We also verify that the iterates of OMWU converge
to the same point as OGDA.
From Figure 2.1, we see that all curves eventually become a straight line, supporting our linear convergence results. Generally, the slope of the straight line is larger for a larger learning rate η. However, the
algorithm diverges when η exceeds some value (such as 11 for the case of OMWU). Comparing OMWU
and OGDA, we see that OGDA converges faster, which is also consistent with our theory if one compares
the bounds in Theorem 3 and Theorem 8 (with the value of the constants revealed in the proofs). We find
this observation interesting, since OMWU is usually considered more favorable for problems defined over
27
the simplex, especially in terms of regret minimization. Our experiments suggest that, however, in terms
of last-iterate convergence, OGDA might perform even better than OMWU.
28
Chapter 3
No-Regret Learning and Last-Iterate Convergence in Extensive-Form
Games
Extensive-form games (EFGs) are an important class of games in game theory and artificial intelligence
that can model imperfect information and sequential interactions. EFGs are typically solved by finding
or approximating a Nash equilibrium. Regret-minimization algorithms are among the most popular approaches to approximate Nash equilibria. The motivation comes from a classical result which says that in
two-player zero-sum games, when both players use no-regret algorithms, the average strategy converges
to Nash equilibrium [59, 76, 166]. Counterfactual Regret Minimization (CFR) [166] and its variants such
as CFR+ [152] are based on this motivation. However, the averaging procedure can create complications.
It not only increases the computational and memory overhead [10], but also makes things difficult when
incorporating neural networks in the solution process, where averaging is usually not possible. Indeed, to
address this issue, Brown et al. [13] create a separate neural network to approximate the average strategy
in their Deep CFR model.
In this chapter, following Chapter 2, we greatly extend the theoretical understanding of last-iterate
convergence of regret-minimization algorithms in two-player zero-sum extensive-form games with perfect recall and open up many interesting directions both in theory and practice. First, we show that any
29
optimistic online mirror-descent algorithm instantiated with a strongly convex regularizer that is continuously differentiable on the EFG strategy space provably enjoys last-iterate convergence, while CFR with
either regret matching or regret matching+ fails to converge. Moreover, for some of the optimistic algorithms, we further show explicit convergence rates. In particular, we prove that optimistic mirror descent
instantiated with the 1-strongly-convex dilated entropy regularizer [98], which we refer to as Dilated Optimistic Multiplicative Weights Update (DOMWU), has a linear convergence rate under the assumption that
there is a unique Nash equilibrium; we note that this assumption was also made by [39, 161] in order to
achieve similar results for normal-form games.
3.1 Related Work
Extensive-form Games Here we focus on work related to two-player zero-sum perfect-recall games.
Although there are many game-solving techniques such as abstraction [97, 62, 12], endgame solving [20,
61], and subgame solving [125, 17], these methods all rely on scalable methods for computing approximate
Nash equilibria. There are several classes of algorithms for computing approximate Nash equilibria, such as
double-oracle methods [117], fictitious play [11, 79], first-order methods [81, 98], and CFR methods [166,
102, 152]. Notably, variants of the CFR approach have achieved significant success in poker games [10,
126, 16]. Underlying the first-order and CFR approaches is the sequence-form representation [158], which
allows the problem to be represented as a bilinear saddle-point problem. This leads to algorithms based
on smoothing techniques and other first-order methods [66, 95, 63], and enables CFR via the theorem
connecting no-regret guarantees to Nash equilibrium.
Online Convex Optimization and Optimistic Regret Minimization Online convex optimization
[165] is a framework for repeated decision making where the goal is to minimize regret. When applied to
repeated two-player zero-sum games, it is known that the average strategy converges to Nash Equilibria at
30
the rate of O(1/
√
T) when both players apply regret-minimization algorithms whose regret grows on the
order of O(
√
T) [59, 76, 166]. Moreover, when the players use optimistic regret-minimization algorithms,
the convergence rate is improved to O(1/T)[135, 150]. Recent works have applied optimism ideas to EFGs,
such as optimistic algorithms with dilated regularizers [98, 54], CFR-like local optimistic algorithms [49],
and optimistic CFR algorithms [19, 14, 52]. However, the theoretical results in all these existing papers
consider the average strategy, while we are the first to consider last-iterate convergence in EFGs.
Last-iterate Convergence in Saddle-point Optimization As mentioned previously, two-player zerosum games can be formulated as saddle-point optimization problems. Saddle-point problems have recently
gained a lot of attention due to their applications in machine learning, for example in generative adversarial
networks [69]. Basic algorithms, including gradient descent ascent and multiplicative weights update,
diverge even in simple instances [119, 8]. In contrast, their optimistic versions, optimistic gradient descent
ascent (OGDA) [38, 120, 161] and optimistic multiplicative weights update (OMWU) [39, 106, 161] have
been shown to enjoy attractive last-iterate convergence guarantees. However, almost none of these results
apply to the case of EFGs: Wei et al. [161] show a result that implies linear convergence of vanilla OGDA in
EFGs (see Corollary 14), but no results are known for vanilla OMWU or more importantly for algorithms
instantiated with dilated regularizers which lead to fast iterate updates in EFGs. In this work we extend the
existing results on normal-form games to EFGs, including the practically-important dilated regularizers.
3.2 Problem Setup
Extensive-form Games as Bilinear Saddle-point Optimization We consider the problem of finding
a Nash equilibrium of a two-player zero-sum extensive-form game (EFG) with perfect recall. Instead of
formally introducing the definition of an EFG (see Appendix B.1 for an example), for the purpose of this
31
work, it suffices to consider an equivalent formulation, which casts the problem as a simple bilinear saddlepoint optimization [158]:
min
x∈X
max
y∈Y
x
⊤Gy = max
y∈Y
min
x∈X
x
⊤Gy, (3.1)
where G ∈ [−1, +1]M×N is a known matrix, and X ⊂ RM and Y ⊂ R
N are two polytopes called
treeplexes (to be defined soon). The set of Nash equilibria is then defined as Z
∗ = X
∗ × Y∗
, where
X
∗ = argminx∈X maxy∈Y x
⊤Gy and Y
∗ = argmaxy∈Y minx∈X x
⊤Gy. Our goal is to find a point
z ∈ Z = X ×Y that is close to the set of Nash equilibria Z
∗
, and we use the Bregman divergence (of some
function ψ) between z and the closest point in Z
∗
to measure the closeness, that is, minz∗∈Z∗ Dψ(z
∗
, z).
For notational convenience, we let P = M + N and F(z) = (Gy, −G⊤x) for any z = (x, y) ∈ Z ⊂
R
P . Without loss of generality, we assume ∥F(z)∥∞ ≤ 1 for all z ∈ Z (which can always be ensured by
normalizing the entries of G accordingly).
Treeplexes The structure of the EFG is implicitly captured by the treeplexes X and Y, which are generalizations of simplexes that capture the sequential structure of an EFG. The formal definition is as follows.
(In Appendix B.1, we provide more details on the connection between treeplexes and the structure of the
EFG, as well as concrete examples of treeplexes for better illustrations.)
Definition 2 (Hoda et al. [81]). A treeplex is recursively constructed via the following three operations:
1. Every probability simplex is a treeplex.
2. Given treeplexes Z1, . . . , ZK, the Cartesian product Z1 × · · · × ZK is a treeplex.
3. (Branching) Given treeplexes Z1 ⊂ RM and Z2 ⊂ R
N , and any i ∈ [[M]],
Z1 i Z2 =
(u, ui
· v) ∈ R
M+N : u ∈ Z1, v ∈ Z2
32
is a treeplex.
By definition, a treeplex is a tree-like structure built with simplexes, which intuitively represents the
tree-like decision space of a single player, and an element in the treeplex represents a strategy for the
player. Let HZ denote the collection of all the simplexes in treeplex Z, which following Definition 2 can
be recursively defined as: HZ = {Z} if Z is a simplex; HZ =
SK
k=1 HZi
if Z is a Cartesian product
Z1 × · · · × ZK; and HZ = HZ1 ∪ HZ2
if Z = Z1 i Z2. In EFG terminology, HX and HY are the
collections of information sets for player x and player y respectively, which are the decision points for the
players, at which they select an action within the simplex. For any h ∈ HZ, we let Ωh denote the set of
indices belonging to h, and for any z ∈ Z, we let zh be the slice of z whose indices are in Ωh. For each
index i, we also let h(i) be the simplex i falls into, that is, i ∈ Ωh(i)
.
In Definition 2, the last branching operation naturally introduces the concept of a parent variable for
each h ∈ HZ, which can again be recursively defined as: if Z is a simplex, then it has no parent; if Z
is a Cartesian product Z1 × · · · × ZK, then the parent of h ∈ HZ is the same as the parent of h in the
treeplex Zk that h belongs to (that is, h ∈ HZk ); finally, if Z = {(u, ui
· v) : u ∈ Z1, v ∈ Z2}, then
for all h ∈ HZ2 without a parent, their parent in Z is ui
, and for all other h, their parents remain the
same as in Z1 or Z2. We denote by σ(h) the index of the parent variable of h, and let it be 0 if h has no
parent. For convenience, we let z0 = 1 for all z ∈ Z (so that zσ(h)
is always well-defined). Also define
Hi = {h ∈ HZ : σ(h) = i} to be the collection of simplexes whose parent index is i.
Similarly, for an index i, its parent index is defined as pi = σ(h(i)), and i is called a terminal index if
it is not a parent index (that is, i ̸= pj for all j). Finally, for an element z ∈ Z and an index i, we define
qi = zi/zpi
. In EFG terminology, qi specifies the probability of selecting action i in the information set
h(i) according to strategy z.
33
3.3 Optimistic Regret-Minimization Algorithms
There are many different algorithms for solving bilinear saddle-point problems over general constrained
sets. We focus specifically on a family of regret-minimization algorithms, called Optimistic Online Mirror
Descent (OOMD) [135], which are known to be highly efficient. In contrast to the CFR algorithm and its
variants, which minimize a local regret notion at each information set (which upper bounds global regret),
the algorithms we consider explicitly minimize global regret. As our main results in the next section show,
these global regret-minimization algorithms enjoy last-iterate convergence, while CFR provably diverges.
Specifically, given a step size η > 0 and a convex function ψ (called a regularizer), OOMD sequentially
performs the following update for t = 1, 2, . . . ,
x
t = argmin
x∈X
n
η⟨x, Gyt−1
⟩ + Dψ(x, xb
t
)
o
, xb
t+1 = argmin
x∈X
n
η⟨x, Gyt
⟩ + Dψ(x, xb
t
)
o
,
y
t = argmin
y∈Y
n
η⟨y, −G⊤x
t−1
⟩ + Dψ(y, yb
t
)
o
, yb
t+1 = argmin
y∈Y
n
η⟨y, −G⊤x
t
⟩ + Dψ(y, yb
t
)
o
,
with (xb1, yb1) = (x0, y0) ∈ Z being arbitrary. Using shorthands z
t = (x
t
, y
t
), zb
t = (xb
t
, yb
t
), ψ(z) =
ψ(x)+ψ(y) and recalling the notation F(z) = (Gy, −G⊤x), the updates above can be compactly written
as OOMD with regularizer ψ over treeplex Z:
z
t = argmin
z∈Z
n
η⟨z, F(z
t−1
)⟩ + Dψ(z, zb
t
)
o
, zb
t+1 = argmin
z∈Z
n
η⟨z, F(z
t
)⟩ + Dψ(z, zb
t
)
o
. (3.2)
Below, we discuss four different regularizers and their resulting algorithms (throughout, we use notations Φ for regularizers based on Euclidean norm and Ψ for regularizers based on entropy).
Vanilla Optimistic Gradient Descent Ascent (VOGDA) Define the vanilla squared Euclidean norm
regularizer as Φ
van(z) = 1
2
P
i
z
2
i
. We call OOMD instantiated with ψ = Φvan Vanilla Optimistic Gradient
Descent Ascent (VOGDA). In this case, the Bregman divergence is DΦvan (z, z
′
) = 1
2
∥z−z
′∥
2
(by definition
34
Φ
van is thus 1-strongly convex with respect to the 2-norm), and the updates simply become projected
gradient descent. For VOGDA there is no closed-form for Equation (3.2), since projection onto the treeplex
Z is required. Nevertheless, the solution can still be computed in O(P
2
log P) time (recall that P is the
dimension of Z) [66].
Vanilla Optimistic Multiplicative Weight Update (VOMWU) Define the vanilla entropy regularizer
as Ψvan(z) = P
i
zi
ln zi
. We call OOMD with ψ = Ψvan Vanilla Optimistic Multiplicative Weights Update (VOMWU). The Bregman divergence in this case is the generalized KL divergence: DΨvan (z, z
′
) =
P
i
zi
ln(zi/z′
i
) − zi + z
′
i
. Although it is well-known that Ψvan is 1-strongly convex with respect to the
1-norm for the special case when Z is a simplex, this is not true generally on a treeplex. Nevertheless, it
can still be shown that Ψvan is 1-strongly convex with respect to the 2-norm; see Appendix B.3.
The name “Multiplicative Weights Update” is inherited from case when X and Y are simplexes, in
which case the updates in Equation (3.2) have a simple multiplicative form. We emphasize, however,
that in general VOMWU does not admit a closed-form update. Instead, to solve Equation (3.2), one can
equivalently solve a simpler dual optimization problem; see [164, Proposition 1].
The two regularizers mentioned above ignore the structure of the treeplex. Dilated Regularizers [81],
on the other hand, take the structure into account and allow one to decompose the update into simpler
updates at each information set. Specifically, given any convex function ψ defined over the simplex and a
weight parameter α ∈ R
HZ
+ , the dilated version of ψ defined over Z is:
ψ
dil
α (z) = X
h∈HZ
αh · zσ(h)
· ψ
zh
zσ(h)
. (3.3)
This is well-defined since zh
zσ(h)
is indeed a distribution within the simplex h (with qi for i ∈ Ωh being
its entries). It can also be shown that ψ
dil
α is always convex in z [81]. Intuitively, ψ
dil
α applies the base
35
regularizer ψ to the action distribution in each information set and then scales the value by its parent
variable and the weight αh. By picking different base regularizers, we obtain the following two algorithms.
Dilated Optimistic Gradient Descent Ascent (DOGDA) [54] Define the dilated squared Euclidean
norm regularizer Φ
dil
α as Equation (3.3) with ψ being the vanilla squared Euclidean norm ψ(z) = 1
2
P
i
z
2
i
.
Direct calculation shows Φ
dil
α (z) = 1
2
P
i αh(i)ziqi
. We call OOMD with regularizer Φ
dil
α the Dilated Optimistic Gradient Descent Ascent algorithm (DOGDA). It is known that there exists an α such that Φ
dil
α is
1-strongly convex with respect to the 2-norm [54]. Importantly, DOGDA decomposes the update Equation (3.2) into simpler gradient descent-style updates at each information set, as shown below.
Lemma 10 (Hoda et al. [81]). If z
′ = argminz∈Z n
η⟨z, f⟩ + DΦdil
α
(z, zb)
o
, then for every h ∈ HZ, the
corresponding vector q
′
h =
z
′
h
z
′
σ(h)
can be computed by:
q
′
h = argmin
qh∈∆|Ωh|
n
η⟨qh, Lh⟩ +
αh
2
∥qh − qbh∥
2
o
, (3.4)
where qbh =
zbh
zbσ(h)
, Lh is the slice of L whose entries are in Ωh, and L is defined through:
Li = fi +
X
h∈Hi
⟨q
′
h
, Lh⟩ +
αh
2η
∥q
′
h − qbh∥
2
.
While the definitions of q
′
h
and L are seemingly recursive, one can verify that they can in fact be
computed in a “bottom-up” manner, starting with the terminal indices. Although Equation (3.4) still does
not admit closed-form solution, it only requires projection onto a simplex, which can be solved efficiently,
see e.g. [32]. Finally, with q
′
h
computed for all h, z
′
can be calculated in a “top-down” manner by definition.
Dilated Optimistic Multiplicative Weight Update (DOMWU) [98] Finally, define the dilated entropy regularizer Ψdil
α as Equation (3.3) with ψ being the vanilla entropy ψ(z) = P
i
zi
ln zi
. Direct calculation shows Ψdil
α (z) = P
i αh(i)zi
ln qi
. We call OOMD with regularizer Ψdil
α the Dilated Optimistic
36
Multiplicative Weights Update algorithm (DOMWU). Similar to DOGDA, there exists an α such that Ψdil
α
is 1-strongly convex with respect to the 2-norm [98].∗ Moreover, in contrast to all the three algorithms
mentioned above, the update of DOMWU has a closed-form solution:
Lemma 11 (Hoda et al. [81]). Suppose z
′ = argminz∈Z n
η⟨z, f⟩ + DΨdil
α
(z, zb)
o
. Similarly to the notation
qi
, define q
′
i = z
′
i
/z′
pi
and qbi = zbi/zbpi
. Then we have
q
′
i ∝ qbi exp
−ηLi/αh(i)
, where Li = fi −
X
h∈Hi
αh
η
ln
X
j∈Ωh
qbj exp (−ηLj/αh)
.
This lemma again implies that we can compute q
′
i
bottom-up, and then z
′
can be computed top-down.
This is similar to DOGDA, except that all updates have a closed-form.
3.4 Last-Iterate Convergence Results
In this section, we present our main last-iterate convergence results for the global regret-minimization
algorithms discussed in Section 3.3. Before doing so, we point out again that the sequence produced by
the well-known CFR algorithm may diverge (even if the average converges to a Nash equilibrium). Indeed,
this can happen even for a simple normal-form game, as formally shown below.
Theorem 12. In the rock-paper-scissors game, CFR (with some particular initialization) produces a diverging
sequence.
In fact, we empirically observe that all of CFR, CFR+ [152] (with simultaneous updates), and their
optimistic versions [52] may diverge in the rock-paper-scissors game. We introduce the algorithms and
show the results in Appendix B.4.
On the contrary, every algorithm from the OOMD family given by Equation (3.2) ensures last-iterate
convergence, as long as the regularizer is strongly convex and continuously differentiable.
∗Kroer et al. [98] also show a better strong convexity result with respect to the 1-norm. We focus on the 2-norm here for
consistency with our other results, but our analysis can be applied to the 1-norm case as well.
3
Theorem 13. Consider the update rules in Equation (3.2). Suppose that ψ is 1-strongly convex with respect
to the 2-norm and continuously differentiable on the entire domain, and η ≤
1
8P
. Then z
t
converges to a Nash
equilibrium as t → ∞.
As mentioned, Φ
van and Ψvan are both 1-strongly convex with respect to 2-norm, so are Φ
dil
α and Ψdil
α
under some specific choice of α (in the rest of the chapter, we fix this choice of α). However, only Φ
van and
Φ
dil
α are continuously differentiable in the entire domain. Therefore, Theorem 13 provides an asymptotic
convergence result only for VOGDA and DOGDA, but not VOMWU and DOMWU. Nevertheless, below,
we resort to different analyses to show a concrete last-iterate convergence rate for three of our algorithms,
which is a much more challenging task.
First of all, note that [161, Theorem 5, Theorem 8] already provide a general last-iterate convergence
rate for VOGDA over polytopes. Since treeplexes are polytopes, we can directly apply their results and
obtain the following corollary.
Corollary 14. Define dist2
(z, Z
∗
) = minz∗∈Z∗ ∥z − z
∗∥
2
. For η ≤
1
8P
, VOGDA guarantees
dist2
(z
t
, Z
∗
) ≤ 64dist2
(zb
1
, Z
∗
)(1 + C1)
−t
,
where C1 > 0 is some constant that depends on the game and η.
However, the results for VOMWU in [161, Theorem 3] is very specific to normal-form game (that is,
when X and Y are simplexes) and thus cannot be applied here. Nevertheless, we are able to extend their
analysis to get the following result.
Theorem 15. If the EFG has a unique Nash equilibrium z
∗
, then VOMWU with step size η ≤
1
8P
guarantees
1
2
∥zb
t − z
∗∥
2 ≤ DΨvan (z
∗
, zb
t
) ≤
C2
t
, where C2 > 0 is some constant depending on the game, zb
1
, and η.
We note that the uniqueness assumption is often required in the analysis of OMWU even for normalform games [39, 161] (although [161, Appendix A.5] provides empirical evidence to show that this may
38
be an artifact of the analysis). Also note that for normal-form games, [161, Theorem 3] show a linear
convergence rate, whereas here we only show a slower sub-linear rate, due to additional complications
introduced by treeplexes (see more discussions in the next section). Whether this can be improved is left
as a future direction.
On the other hand, thanks to the closed-form updates of DOMWU, we are able to show the following
linear convergence rate for this algorithm.
Theorem 16. If the EFG has a unique Nash equilibrium z
∗
, then DOMWU with step size η ≤
1
8P
guarantees
1
2
∥z
t − z
∗∥
2 ≤ DΨdil
α
(z
∗
, z
t
) ≤ C3(1 + C4)
−t
, where C3, C4 > 0 are constants that depend on the game,
zb
1
, and η.
To the best of our knowledge, this is the first last-iterate convergence result for algorithms with dilated regularizers. Unfortunately, due to technical difficulties, we were unable to prove similar results for
DOGDA (see Appendix B.5 for more discussion). We leave that as an important future direction.
3.5 Analysis Overview
In this section, we provide an overview of our analysis. It starts from the following standard one-step
regret analysis of OOMD (see, for example, [161, Lemma 1]):
Lemma 17. Consider the update rules in Equation (3.2). Suppose that ψ is 1-strongly convex with respect to
the 2-norm, ∥F(z
1
) − F(z
2
)∥ ≤ L∥z
1 − z
2∥ for all z
1
, z
2 ∈ Z and some L > 0, and η ≤
1
8L
. Then for
any z ∈ Z and any t ≥ 1, we have
ηF(z
t
)
⊤(z
t − z) ≤ Dψ(z, zb
t
) − Dψ(z, zb
t+1) − Dψ(zb
t+1
, z
t
) −
15
16Dψ(z
t
, zb
t
) + 1
16Dψ(zb
t
, z
t−1
).
39
Note that the Lipschitz condition on F holds in our case with L = P since
∥F(z
1
) − F(z
2
)∥ =
q
∥G(y1 − y2)∥
2 + ∥G⊤(x1 − x2)∥
2 ≤
q
P∥z
1 − z
2∥
2
1 ≤ P∥z
1 − z
2
∥,
which is also why the step size is chosen to be η ≤
1
8P
in all our results. In the following, we first prove
Theorem 13. Then, we review the convergence analysis of [161] for OMWU in normal-form games, and
finally demonstrate how to prove Theorem 15 and Theorem 16 by building upon this previous work and
addressing the additional complications from EFGs.
3.5.1 Proof of Theorem 13
For any z
∗ ∈ Z∗
, by optimality of z
∗ we have:
F(z
t
)
⊤(z
t − z
∗
) = x
t⊤Gyt − x
t⊤Gyt + x
t⊤Gy∗ − x
∗⊤Gyt ≥ x
∗⊤Gy∗ − x
∗⊤Gy∗ = 0.
Thus, taking z = z
∗
in Lemma 17 and rearranging, we arrive at
Dψ(z
∗
, zb
t+1) ≤ Dψ(z
∗
, zb
t
) − Dψ(zb
t+1
, z
t
) −
15
16Dψ(z
t
, zb
t
) + 1
16Dψ(zb
t
, z
t−1
).
Defining Θt = Dψ(z
∗
, zb
t
) + 1
16Dψ(zb
t
, z
t−1
) and ζ
t = Dψ(zb
t+1
, z
t
) + Dψ(z
t
, zb
t
), we rewrite the inequality above as
Θ
t+1 ≤ Θ
t −
15
16 ζ
t
. (3.5)
40
We remark that similar inequalities appear in the last chapter (see Equation (2.3) and Equation (2.4)), but
here we pick z
∗ ∈ Z∗
arbitrarily while they have to pick a particular z
∗ ∈ Z∗
(such as the projection of
zb
t onto Z
∗
). Summing Equation (3.5) over t, telescoping, and applying the strong convexity of ψ, we have
Θ
1 ≥ Θ
1 − Θ
T ≥
15
16
T
X−1
t=1
ζ
t ≥
15
32
T
X−1
t=1
∥zb
t+1 − z
t
∥
2 + ∥z
t − zb
t
∥
2 ≥
15
64
T
X−1
t=2
∥z
t − z
t−1
∥
2
.
Similar to the last inequality, we also have Θ1 ≥
15
64
PT −1
t=1 ∥zb
t+1 − zb
t∥
2
since 2∥zb
t+1 − z
t∥
2 + 2∥z
t −
zb
t∥
2 ≥ ∥zb
t+1 − zb
t∥
2
. Therefore, we conclude that ∥z
t − zb
t∥, ∥z
t+1 − z
t∥, and ∥zb
t+1 − zb
t∥ all converge to 0 as t → ∞. On the other hand, since the sequence {z
1
, z
2
, . . . , } is bounded, by the BolzanoWeierstrass theorem, there exists a convergent subsequence, which we denote by {z
i1
, z
i2
, . . . , }. Let
z∞ = limτ→∞ z
iτ
. By ∥zb
t − z
t∥ → 0 we also have z∞ = limτ→∞ zb
iτ
. Now, using the first-order
optimality condition of zb
t+1, we have for every z
′ ∈ Z,
(∇ψ(zb
t+1) − ∇ψ(zb
t
) + ηF(z
t
))⊤(z
′ − zb
t+1) ≥ 0.
Apply this with t = iτ for every τ and let τ → ∞, we obtain
0 ≤ limτ→∞
(∇ψ(zb
iτ +1) − ∇ψ(zb
iτ
) + ηF(z
iτ
))⊤(z
′ − zb
iτ +1) (by the first-order optimality)
= limτ→∞
ηF(z
iτ
)
⊤(z
′ − zb
iτ +1) (by ∥zb
t+1 − zb
t∥ → 0 and the continuity of ∇ψ)
= ηF(z
∞)
⊤(z
′ − z
∞) (by z∞ = limτ→∞ z
iτ = limτ→∞ zb
iτ
)
This implies that z∞ is a Nash equilibrium. Finally, choosing z
∗ = z∞ in the definition of Θt
, we have
limτ→∞ Θiτ = 0 because limτ→∞ Dψ(z∞, zb
iτ ) = 0 and limτ→∞ ∥zb
iτ − z
iτ −1∥ = 0. Additionally,
by Equation (3.5) we also have that limt→∞ Θt = 0 as Θt
is non-increasing. Therefore, we conclude
41
that the entire sequence {z
1
, z
2
, . . .} converges to z∞. On the other hand, since OOMD is a regretminimization algorithm, it is well known that the average iterate converges to a Nash equilibrium [59].
Consequently, combining the two facts above implies that z
t has to converge to a Nash equilibrium, which
proves Theorem 13.
We remark that Lemma 17 holds for general closed convex domains as shown in [161]. Consequently,
with the same argument, Theorem 13 holds more generally as long as X and Y are closed convex sets.
While the argument is straightforward, we are not aware of similar results in prior works. Also note that
unlike Theorem 15 and Theorem 16, Theorem 13 holds without the uniqueness assumption for VOMWU
and DOMWU.
3.5.2 Review for Normal-Form Games
To better explain our analysis and highlight its novelty, we first review the two-stage analysis of [161]
for OMWU in normal-form games, a special case of our setting when X and Y are simplexes. Note that
both VOMWU and DOMWU reduce to OMWU in this case. As with Theorem 15 and Theorem 16, the
normal-form OMWU results assume a unique Nash equilibrium z
∗
. With this uniqueness assumption and
[119, Lemma C.4], Wei et al. [161] show the following inequality
ζ
t = Dψ(zb
t+1
, z
t
) + Dψ(z
t
, zb
t
) ≥ C5∥z
∗ − zb
t+1∥
2
(3.6)
for some problem-dependent constant C5 > 0, which, when combined with Equation (3.5), implies that
if the algorithm’s current iterate is far from z
∗
, then the decrease in Θt
is more substantial, that is,
the algorithm makes more progress on approaching z
∗
. To establish a recursion, however, we need to
connect the 2-norm back to the Bregman divergence (a reverse direction of strong convexity). To do
so, Wei et al. [161] argue that zb
t+1
i
can be lower bounded by another problem-dependent constant for
42
i ∈ supp(z
∗
) [161, Lemma 19], where supp(z
∗
) denotes the support of z
∗
. This further allows them to
lower bound ∥z
∗ − zb
t+1∥ in terms of Dψ(z
∗
, zb
t+1) (which is just KL(z
∗
, zb
t+1)), leading to
ζ
t = Dψ(zb
t+1
, z
t
) + Dψ(z
t
, zb
t
) ≥ C6Dψ(z
∗
, zb
t+1)
2
, (3.7)
for some C6 > 0. On the other hand, ignoring the nonnegative term Dψ(z
t
, zb
t
), we also have:
ζ
t = Dψ(zb
t+1
, z
t
) + Dψ(z
t
, zb
t
) ≥ Dψ(zb
t+1
, z
t
) ≥
1
4
Dψ(zb
t+1
, z
t
)
2
, (3.8)
where the last step uses the fact that zb
t+1 and z
t
are close [161, Lemma 17 and Lemma 18]. Now, Equation (3.7) and Equation (3.8) together imply
6ζ
t ≥ 2C6Dψ(z
∗
, zb
t+1)
2 + Dψ(zb
t+1
, z
t
)
2 ≥ min
C6,
1
2
Θ
t+12
.
Plugging this back into Equation (3.5), we obtain a recursion
Θ
t+1 ≤ Θ
t − C7
Θ
t+12
(3.9)
for some C7 > 0, which then implies Θt = O(1/t) [161, Lemma 12]. This can be seen as the first and
slower stage of the convergence behavior of the algorithm.
To further show a linear convergence rate, they argue that there exists a constant C8 > 0 such that
when the algorithm’s iterate is reasonably close to z
∗
in the following sense:
max{∥z
∗ − zb
t
∥1, ∥z
∗ − z
t
∥1} ≤ C8, (3.10)
43
the following improved version of Equation (3.7) holds (note the lack of square on the right-hand side):
ζ
t = Dψ(zb
t+1
, z
t
) + Dψ(z
t
, zb
t
) ≥ C9Dψ(z
∗
, zb
t+1) (3.11)
for some constant 0 < C9 < 1. Therefore, using the 1/t convergence rate derived in the first stage, there
exists a T0 such that when t ≥ T0, Equation (3.10) holds and the algorithm enters the second stage. In this
stage, combining Equation (3.11) and the fact ζ
t ≥ Dψ(zb
t+1
, z
t
) gives ζ
t ≥
C9
2 Θt+1, which, together with
Equation (3.5) again, implies an improved recursion Θt+1 ≤ Θt −
15
32C9Θt+1. This finally shows a linear
convergence rate Θt = O((1 + ρ)
−t
) for some problem-dependent constant ρ > 0.
3.5.3 Analysis of Theorem 15 and Theorem 16
While we mainly follow the steps of the analysis of [161] discussed above to prove Theorem 15 and Theorem 16, we remark that the generalization is highly non-trivial. First of all, we have to prove Equation (3.6)
for Z being a general treeplex, which does not follow [119, Lemma C.4] since its proof is very specific
to simplexes. Instead, we prove it by writing down the primal-dual linear program of Equation (3.1) and
applying the strict complementary slackness; see Appendix B.5.1 for details.
Next, to connect the 2-norm back to the Bregman divergence (which is not the simple KL divergence
anymore, especially for DOMWU), we prove the following for VOMWU:
Dψ(z
∗
, zb
t+1) ≤
X
i∈supp(z∗)
(z
∗
i − zb
t+1
i
)
2
zb
t+1
i
+
X
i /∈supp(z∗)
zb
t+1
i ≤
3P∥z
∗ − zb
t+1∥
mini∈supp(z∗) zb
t+1
i
, (3.12)
and the following for DOMWU:
Dψ(z
∗
, zb
t+1)
C′
≤
X
i∈supp(z∗)
z
∗
i − zb
t+1
i
2
z
∗
i
qb
t+1
i
+
X
i /∈supp(z∗)
z
∗
pi
qb
t+1
i ≤
∥z
∗ − zb
t+1∥1
mini∈supp(z∗) z
∗
i
zb
t+1
i
, (3.13)
4
where C
′ = 4P∥α∥∞ (see Appendix B.5.2). We then show a lower bound on z
t+1
i
and zb
t+1
i
for all
i ∈ supp(z
∗
), using similar arguments of [161] (see Appendix B.5.3). Combining Equation (3.12) and Equation (3.13) with Equation (3.6), we have the counterpart of Equation (3.7) for both VOMWU and DOMWU.
Showing Equation (3.8) also involves extra complication if we follow their analysis, especially for
VOMWU which does not admit a closed-form update. Instead, we find a simple workaround: by applying Equation (3.5) repeatedly, we get Dψ(z
∗
, zb
1
) = Θ1 ≥ · · · ≥ Θt+1 ≥
1
16Dψ(zb
t+1
, z
t
), thus,
ζ
t ≥ Dψ(zb
t+1
, z
t
) ≥ C10Dψ(zb
t+1
, z
t
)
2
for some C10 > 0 depending on Dψ(z
∗
, zb
1
). Combining this
with Equation (3.7), and applying them to Equation (3.5), we obtain the recursion Θt+1 ≤ Θt−C11
Θt+12
for some C11 > 0 similar to Equation (3.9), which implies Θt = O(1/t) for both VOMWU and DOMWU
and proves Theorem 15.
Finally, to show a linear convergence rate, we need to show the counterpart of Equation (3.11), which
is again more involved compared to the normal-form game case. Indeed, we are only able to do so for
DOMWU by making use of its closed-form update described in Lemma 11. Specifically, observe that in
Equation (3.13), the term P
i /∈supp(z∗)
z
∗
pi
qb
t+1
i
is the one that prevents us from bounding Dψ(z
∗
, zb
t+1) by
O(∥z
∗ − zb
t+1∥
2
). Thus, our high-level idea is to argue that P
i /∈supp(z∗)
z
∗
pi
qb
t+1
i
decreases significantly
as zb
t gets close enough to z
∗
. To do so, we use a bottom-up induction to prove that, for any information
set h ∈ HZ, indices i, j ∈ Ωh such that i /∈ supp(z
∗
) and j ∈ supp(z
∗
), Lbt
i
is significantly larger than
Lbt
j when zb
t
is close to z
∗
, where Lbt
is the counterpart of L in Lemma 11 when computing of qb
t+1. This
makes sure that the term P
i /∈supp(z∗)
z
∗
pi
qb
t+1
i
is dominated by the other term involving i ∈ supp(z
∗
) in
Equation (3.13), which eventually helps us show Equation (3.11) and the final linear convergence rate in
Theorem 16. See Appendix B.5.5 for details.
45
0 200 400 600 800 1000
-15
-10
-5
0
CFR w/ RM+
CFR+
DOGDA eta=2.0
DOMWU eta=4.3
VOMWU eta=4.4
VOGDA eta=2.0
0 200 400 600 800 1000
-15
-10
-5
0
CFR w/ RM+
CFR+
VOGDA eta=0.02
DOGDA eta=0.04
DOMWU eta=0.2
0 1000 2000 3000 4000 5000
-4
-3
-2
-1
0
DOMWU eta=4.2
VOGDA eta=0.3
CFR w/ RM+
DOGDA eta=0.9
CFR+
Figure 3.1: Experiments on Kuhn poker (left), Pursuit-evasion (middle), and Leduc poker (right). A description of each game is given in Appendix B.2. ξ
t = maxy x
t⊤Gy − minx x
⊤Gy
t
is the duality gap
at time step t, where (x
t
, y
t
) is the approximate Nash equilibrium computed by the algorithm at time t
(for the optimistic algorithms, (x
t
, y
t
) is (x
t
, y
t
) while for the CFR-based algorithms, (x
t
, y
t
) is the linear
average). The legend order reflects the curve order at the right-most point. Due to much higher computation overhead than all the other algorithms, we only run VOMWU on Kuhn poker, the game with the
smallest size among the three games. For each optimistic algorithm, we fine-tune step size η to get better
convergence results and show its value in the legends. There is no hyperparameter for the CFR-based
algorithms. All the experiments are run on CPU in a personal computer and the total computation time is
less than an hour. There is no random seed and the results are all deterministic.
3.6 Experiments
In this section, we experimentally evaluate the algorithms on three standard EFG benchmarks: Kuhn poker
[101], Pursuit-evasion [94], and Leduc poker [147]. The results are shown in Figure 3.1. Besides the optimistic algorithms, we also show two CFR-based algorithms as reference points. “CFR+” refers to CFR
with alternating updates, linear averaging [152], and regret matching+ as the regret minimizer. “CFR w/
RM+” is CFR with regret matching+ and linear averaging. We provide the formal descriptions of these two
algorithms in Appendix B.4 for completeness. For the optimistic algorithms, we plot the last iterate performance. For the CFR-based algorithms, we plot the performance of the linear average of iterates (recall
that the last iterate of CFR-based algorithms is not guaranteed to converge to a Nash equilibrium).
For Kuhn poker and Pursuit-evasion (on the left and in the middle of Figure 3.1), all of the optimistic
algorithms perform much better than CFR+, and their curves are nearly straight, showing their linear
last-iterate convergence on these games.
46
For Leduc poker, although CFR+ performs the best, we can still observe the last-iterate convergence
trends of the optimistic algorithms. We remark that although VOGDA and DOMWU have linear convergence rate in theory, the experiment on Leduc uses a step size η which is much larger than Corollary 14
and Theorem 16 suggest, which may void the linear convergence guarantee. This is done because the theoretical step size takes too many iterations before it starts to show improvement. It is worth noting that
CFR+ improves significantly when changing simultaneous updates (that is, CFR w/ RM+) to alternating
updates. Analyzing alternation and combining it with optimistic algorithms is a promising direction. We
provide a description of each game, more discussions, and details of the experiments in Appendix B.2.
47
Chapter 4
Kernelized Multiplicative Weights: Bridging the Gap Between Learning
in Extensive-Form and Normal-Form Games
Algorithm Per-player regret bound Last-iter. conv.†
CFR (regret matching / regret matching+) [166] O(
√
A ∥Q∥1 T
1/2
) no
CFR (MWU) [166] O(
√
log A ∥Q∥1 T
1/2
) no
FTRL / OMD (dilated entropy) [98] O(
√
log A 2
D/2 ∥Q∥1 T
1/2
) no
FTRL / OMD (dilatable global entropy) [51] O(
√
log A ∥Q∥1 T
1/2
) no
Kernelized MWU (our work) O(
√
log A
p
∥Q∥1 T
1/2
) no
Optimistic FTRL / OMD (dilated entropy) [98] O(
√
m log(A) 2D ∥Q∥
2
1 T
1/4
) yes [105]
Optimistic FTRL / OMD (dilatable gl. ent.) [51] O(
√
m log(A) ∥Q∥
2
1 T
1/4
) no
Kernelized OMWU (our work) O(m log(A) ∥Q∥1 log4
(T)) yes
Table 4.1: Properties of various no-regret algorithms for EFGs. All algorithms take linear time to perform an
iteration. The first set of rows are for non-optimistic algorithms. The second set of rows are for optimistic
algorithms. The regret bounds are per player and apply to multiplayer general-sum games. They depend
on the maximum number of actions A available at any decision point, the maximum ℓ1 norm ∥Q∥1 =
maxq∈Q ∥q∥1 over the player’s decision polytope Q, the depth D of the decision polytope, and the number
of players m. Optimistic algorithms have better asymptotic regret, but worse dependence on the game
constants m, A, and ∥Q∥1. Note that our algorithms achieve better dependence on ∥Q∥1 compared to
all existing algorithms. †Last-iterate convergence results are for two-player zero-sum games, and some
results rely on the assumption of a unique Nash equilibrium—see Section 4.4.3 for details.
Online learning results for EFGs are generally somewhat harder to come by, and have often lagged
behind results for NFGs. This is due to the more complicated combinatorial structure of the decision
spaces in EFGs. For example, the following concepts were all developed later for EFGs than for NFGs, and
sometimes with weaker guarantees: good distance measures [81, 99, 98, 51], optimistic regret-minimization
algorithms [54, 49], and last-iterate convergence results [161, 105]. Very recent NFG results such as the
48
polylogarithmic regret bounds for OMWU dynamics in general-sum NFGs [36] do not currently have an
analogue for EFGs.
In principle, an EFG can be represented as a NFG where each action in the NFG corresponds to an
assignment of decisions at each decision point in the EFG. One could then run, e.g., OMWU on this normalform representation, and receive all the guarantees obtained for NFGs directly. However, this reduction is
exponentially-large in the size of the EFG representation, and for this reason the normal-form representation was viewed as impractical. This leads to the necessity of developing the various more complicated
approaches mentioned in the previous paragraph.
We contradict popular belief and show that it is possible to work with the normal form efficiently: we
provide a kernel-based reduction from EFGs to NFGs that allows us to simulate MWU and OMWU on the
normal-form representation, using only linear (in the EFG size) time per iteration. Our algorithm, Kernelized OMWU (KOMWU), closes the gap between NFGs and EFGs; KOMWU achieves all the guarantees
provided by the various normal-form results mentioned previously, as well as any future results on OMWU
for NFGs. As an unexpected byproduct, KOMWU obtains new state-of-the-art regret bounds among all
online learning algorithms for EFGs (see also Table 4.1); we improve the dependence on the maximum ℓ1
norm ∥Q∥1 over the sequence-form polytope Q from ∥Q∥
2
1
to ∥Q∥1 (for the non-optimistic version we
improve it from ∥Q∥1 to p
∥Q∥1). Due to the connection between regret minimization and convergence
to Nash equilibrium, this also improves the state-of-the-art bounds for converging to a Nash equilibrium
at either a rate of 1/
√
T or 1/T by the same factor. Moreover, KOMWU achieves last-iterate convergence,
and as such it is the first algorithm to achieve linear-rate last-iterate convergence with a learning rate
that does not become impractically-small as the game grows large (albeit under a restrictive uniqueness
assumption).
More generally, we show that KOMWU can simulate OMWU for 0/1-polyhedral sets (of which the decision sets for EFGs are a special case): a decision set Ω ⊆ R
d which is convex and polyhedral, and whose
49
vertices are all contained in {0, 1}
d
. KOMWU reduces the problem of running OMWU on the vertices of
the polyhedral set to d + 1 evaluations of what we call the 0/1-polyhedral kernel. Thus, given an efficient
algorithm for performing these kernel evaluations, KOMWU enables one to get all the benefits of running
MWU or OMWU on the simplex of vertices, while retaining the crucial property that each iteration of
OMWU can be performed efficiently. In addition to EFGs, in the appendix we show that the kernel can
be computed efficiently for several other settings including n-sets, unit cubes, flows on directed acyclic
graphs, and permutations. As with EFGs, this immediately gives us an efficient algorithm with favorable
properties such as last-iterate convergence and polylogarithmic regret for games with 0/1-polyhedral strategy sets. In particular, for n-sets, we show an improvement on the time complexity per round compared
with the dynamic programming approach discussed in [151]. To the best of our knowledge, this is the
state-of-the-art bound for simulating MWU/OMWU on n-sets.
Related work There were several past works on specialized online learning methods for EFGs. One
class of methods is based on specialized Bregman divergences that lead to efficient iteration updates [81,
99, 98, 51]. Combined with optimistic regret minimizers for general convex set, this yields stronger regret
bounds that take into account the variation in payoffs, and combined with the connection between regret
minimization and Nash equilibrium computation, this yields 1/T-rate convergence for two-player zerosum games [135, 150, 54]. The counterfactual regret minimization (CFR) framework Zinkevich et al. [166]
also yields efficient iteration updates. This approach yields a worse √
T regret bound, but leads to the best
practical performance in most games [96, 98, 52]. Farina et al. [49] show that it is possible to attain O(T
1/4
)
regret within the CFR framework by using OMWU at each decision point. However, the game-dependent
constants in their bound are much worse than the ones in Table 4.1.
Regret minimization over 0/1 polyhedral sets, the framework we consider, is closely related to online
combinatorial optimization problems [5], where the decision maker (randomly) selects a 0/1 vertex in each
round instead of a point in the convex hull of the set of vertices, and the regret is measured in expectation.
50
We review approaches related to the use of MWU here, and other less closely related approaches in Appendix C.1. One approach similar to our KOMWU is to perform MWU over vertices (e.g., Cesa-Bianchi and
Lugosi [24]); the remaining problem is whether there is an efficient way to maintain and sample from the
weights. Such efficient implementations have been shown in many instances such as paths [151], spanning trees [91], and m-set [160]. [151] is the closest to our work, where they show how to produce MWU
iterates for paths in directed graphs. Our kernelized method can be seen as a significant extension of their
approach to general 0/1 polyhedral games, unifying many of the previous results listed above. This unification not only results in important applications to EFGs, but also leads to improvement to previously
studied problems such as n-sets.
4.1 Preliminaries
In this section we review some fundamental connections between normal-form games and no-regret learners.
4.1.1 Online Learning and Multiplicative Weights Update
Given a finite set of choices A, consider the following abstract model of a repeated decision-making problem between a decision maker and an unknown—potentially adversarial—environment. At each time
t = 1, 2, . . . , the decision maker is given (or otherwise selects) a prediction vector mt ∈ R
A. Then,
the decision maker must select and output a probability distribution λ
t over A, that is, a vector λ
t ∈
∆(A) =
λ ∈ R
A
≥0
:
P
a∈A λ[a] = 1
. Finally, the environment picks (possibly in an adversarial way) a
loss vector ℓ
t ∈ R
A and shows it to the decision maker, who then suffers a loss equal to ⟨ℓ
t
,λ
t
⟩. Given any
time T, a key quantity for the decision maker is its cumulative regret (or simply regret) up to time T,
RegT =
X
T
t=1
⟨ℓ
t
,λ
t
⟩ − min
λˆ∈∆(A)
X
T
t=1
⟨ℓ
t
,λˆ⟩. (4.1)
51
As we recall in the next subsection, decision-making algorithms that guarantee sublinear regret (in T)
in the worst case make for natural agents to learn equilibria in games. The most well-studied decisionmaking algorithm with that property is the optimistic multiplicative weights update (OMWU) algorithm.∗
Let ℓ
0
,m0 = 0 ∈ R
A and λ
0 =
1
|A|1 ∈ ∆(A); then, at all times t ∈ N>0, OMWU updates the distribution
λ
t−1 ∈ ∆(A) according to
λ
t
[a] = λ
t−1
[a] · e
−η
t wt
[a]
P
a
′∈A λt−1[a
′
] · e−η
t wt
[a
′
]
(4.2)
for all a ∈ A, where wt = ℓ
t−1 − mt−1 + mt
and η
t > 0 is a learning rate (full pseudocode is given
in Appendix C.2). The nonpredictive version of OMWU, called multiplicative weights update (MWU), is
obtained from OMWU as the special case in which mt = 0 at all t.
4.1.2 Canonical Optimistic Learning Setup
Normal-form games (NFG) are simultaneous-move, nonsequential games in which each player picks an
action from a finite set, and receives a payoff that depends on the tuple of actions played by the players.
Formally, we represent a normal form game as a tuple Γ = (m, {Ai}, {Ui}), where the positive integer
m ∈ N>0 denotes the number of players, each of which is assigned a unique player number in the set [[m]] =
{1, . . . , m}; the finite set Ai specifies the actions available to player i ∈ [[m]]; and Ui
: A1 × · · · × Am →
[0, 1] is the payoff function for player i ∈ [[m]]. The game is said to be zero-sum if P
i∈[[m]] Ui(a1, . . . , am) =
0 for all (a1, . . . , am) ∈ A1 × · · · × Am.
A mixed strategy for any player i ∈ [[m]] is a probability distribution λi ∈ ∆(Ai) over the player’s action set Ai
. When the players play according to mixed strategies λ1, . . . , λm, the expected utility U¯
i of any
playeri ∈ [[m]] is defined accordingly as the functionU¯
i
: (λ1, . . . , λm) 7→ Ea1∼λ1,...,am∼λm
Ui(a1, . . . , am)
.
∗
In the literature, OMWU is often given under the assumption that mt = ℓ
t−1
at all times t. In this chapter we present
OMWU in its general form, that is, with no assumptions on mt
.
52
Because of the linearity of expectation, the expected utility function U¯
i of each player i is a multilinear
function of the strategies λ1, ..., λm.
Learning in NFGs We now describe a learning setup for NFGs, which we will refer to as the canonical optimistic learning setup (COLS). In the COLS, the NFG is played repeatedly. At each time t ∈ N>0,
each player i ∈ [[m]] picks mixed strategies λ
t
i ∈ ∆(Ai) according to a learning algorithm Ri
, with the
following choice of loss and prediction vectors:
• The loss vector ℓ
t
i
is the opposite of the gradient of the expected utility of player i with respect of
player i’s strategy, in symbols ℓ
t
i = −∇λiU¯
i(λ
t
1
, . . . , λ
t
m);
• The prediction vector mt
i
is defined as the previous loss mt
i = ℓ
t−1
i
if t ≥ 2, and m1
i = 0 otherwise.
This is the same setup that was used in landmark papers such as [150] and [36]. A key result in
the theory of learning in games establishes a deep connection between the COLS and coarse-correlated
equilibria (CCEs) of the game (which, in two-player zero-sum games, are Nash equilibria).
Theorem 18. Under the COLS, the average product distribution of play µ¯ =
1
T
PT
t=1 λ
t
1 ⊗ · · · ⊗ λ
t
m is
an O(maxi∈[[m]] RegT
i /T)-approximate CCE of the game, where RegT
i
is the regret for player i (see Equation (4.1)).
When each player i learns under the COLS using OMWU with the same, constant learning rate η
t
i = η
as their learning algorithm Ri
, the following strong properties hold for any NFG Γ = (m, {Ai}, {Ui}).
Property 19 (O(T
1/4
) per-player regret). For all T, if η =
T −1/4
√
m−1
, the regret of each player i ∈ [[m]] is
bounded as RegT
i ≤ (4 + log |Ai
|)
√
m − 1 · T
1/4
. [150].
Property 20 (Near-optimal per-player regret). There exist universal constants C, C′ > 1 so that, for all T,
if η ≤
1
Cm log4 T
, the regret of each player i ∈ [[m]] is bounded as RegT
i ≤
log |Ai|
η + C
′
log T [36].
Property 21 (Optimal regret sum). If η ≤ √
1
8(m−1) , at all times T ∈ N>0 the sum of the players’ regrets
satisfies Pm
i=1 RegT
i ≤
m
η
maxm
i=1 log |Ai
| [150].
53
We remark that while Property 20 provides a substantially better bound than Property 19 asymptotically, for moderate values of T, Property 19 provides a better numerical bound.
When Γ is a two-player zero-sum game, the following also holds when learning under the COLS using
OMWU.
Property 22 (Last-iterate convergence). There exists a certain schedule of learning rates η
t
i
such that the
players’ strategies (λ
t
1
,λ
t
2
) converge to a Nash equilibrium of the game [82]. Furthermore, if Γ has a unique
Nash equilibrium (λ
∗
1
,λ
∗
2
) and each player uses any constant learning rate η
t
i = η ≤
1
8
, at all times t the
strategy profile (λ
t
1
,λ
t
2
) satisfies KL(λ
∗
1
,λ
t
1
) + KL(λ
∗
2
,λ
t
2
) ≤ C(1 + C
′
)
−t
, where the constants C, C′
only
depend on the game, and KL(·, ·) denotes the KL-divergence between two distributions [161].
4.2 Multiplicative Weights in Polyhedral Convex Games
A powerful generalization of normal-form games is polyhedral convex games, of which extensive-form
games are an example [71]. Unlike NFGs, in which players select a mixed strategy from the probability
simplex spanned by the set of available actions Ai
, in a polyhedral convex game the set of “randomized
strategies” from which each player i ∈ [[m]] can draw is a given convex polytope Ωi ⊆ R
di
. Analogously
to NFGs, we represent a polyhedral convex game as a tuple Γ = (m, {Ωi}, {U¯
i}), where the functions
U¯
i
: Ω1 × · · · × Ωm → [0, 1] are the multilinear utility functions for each player i ∈ [[m]].
The concepts of learning agents, equilibria, and COLS introduced in Sections 4.1.1 and 4.1.2 can be
directly extended to polyhedral convex games without difficulty, by simply replacing the set of mixed
strategies ∆(Ai) of each player with their convex polyhedral counterpart Ωi
.
Because the set of mixed strategies Ω of every player is a polytope, the decision problem of picking
a mixed strategy x
t ∈ Ω can be equivalently thought of as the decision problem of picking a convex
combination λ
t ∈ ∆(VΩ) over the finite set of vertices VΩ of Ω. Indeed, it is not hard to show that a
54
learning algorithm R for Ω ⊆ R
d
can be constructed from any learning algorithm R˜ for the set of vertices
VΩ, as we describe next. Let V denote the matrix whose columns are the vertices VΩ; then:
• whenever R receives a prediction mt ∈ R
d
(resp., loss ℓ
t
), it computes the vector m˜
t = V⊤mt ∈ R
VΩ
(resp., ˜ℓ
t = V⊤ℓ
t
) and forwards it to R˜;
• whenever R˜ plays a new distribution λ
t ∈ ∆(VΩ), the convex combination of vertices
x
t =
X
v∈VΩ
λ
t
[v] v = Vλ
t
is played by R.
It is immediate to verify that the regret cumulated by R and R˜ is equal at all times T. So, as long as R˜
guarantees sublinear regret, then so does R. In this chapter we are particularly interested in the algorithm
obtained by using the above construction for the specific choice of OMWU as the algorithm R˜. We coin
Vertex OMWU the resulting learning algorithm R in that case, depicted in Figure 4.1. Let ℓ
0
,m0 = 0 ∈
R
VΩ and λ
0 =
1
|VΩ|
1 ∈ ∆(VΩ); then, at all times t∈N>0, Vertex OMWU updates the convex combination
of vertices λ
t−1∈∆(VΩ) according to
λ
t
[v] = λ
t−1
[v] · e
−η
t
⟨wt
,v⟩
P
v′∈VΩ
λt−1[v
′
] · e−η
t⟨wt
,v′⟩
, (4.3)
where
wt = ℓ
t−1 − mt−1 + mt ∈ R
d
, (4.4)
and then outputs the iterate
Ω ∋ x
t =
X
v∈VΩ
λ
t
[v] · v = Vλ
t
. (4.5)
55
It is straightforward to show that Vertex OMWU satisfies Properties 20 to 22 with |Ai
| replaced with |VΩi
|,
by using a black-box reduction to NFGs. Indeed, let Γ = (m, {Ωi}, {U¯
i}) be a polyhedral convex game,
and introduce the NFG Γ˜ equivalent to Γ, defined as the NFG Γ = ( ˜ m, {VΩi
}, {Ui}) where the action set
of each player is the set of vertices VΩi
, and Ui(v1, . . . , vm) = U¯
i(v1, . . . , vm) for all (v1, . . . , vm) ∈
VΩ1 ×· · ·×VΩm. Consider the losses ℓ
t
i
, predictions mt
, and iterates x
t
i ∈ Ωi produced by agents learning
(under the COLS) in Γ using Vertex OMWU, and the losses ˜ℓ
t
i
, predictions m˜
t
i
, and iterates λ
t
i ∈ ∆(Vi)
produced by agents learning (again under the COLS) in Γ˜ using OMWU. For all players i ∈ [[m]], it is
immediate to verify by induction that the relationships (i) ˜ℓ
t
i = V⊤
i
ℓ
t
i
, (ii) m˜
t
i = V⊤
i mt
i
, and (iii) x
t
i =
Viλ
t
i
hold at all t, where Vi
is the matrix whose columns are the vertices VΩi
(see also Figure 4.1). The
above discussion shows that in a precise sense, Vertex OMWU and OMWU are the same algorithm, just
on different equivalent representations of the game. Hence, the regret cumulated by each player i in Γ
matches the regret cumulated by the same player in Γ˜, showing that Properties 20 and 21 hold for Vertex
OMWU. Furthermore, whenever λ
t
i
converges in iterates, then clearly so does x
t
i = Viλ
t
i
, showing that
Property 22 applies to Vertex OMWU as well.
The main drawback of Vertex OMWU is that it is not clear how to avoid a per-iteration complexity
linear in the number of vertices of Ω, which is typically exponential in d (this is the case in extensiveform games). While different learning algorithms that guarantee polynomial per-iteration complexity in
d exist, none of them is known to guarantee near-optimal per-player regret (Property 20) or last-iterate
convergence (Property 22) enjoyed by Vertex OMWU, much less all three Properties 20 to 22 at the same
time. In the rest of the chapter we fill this gap, by showing that in several cases of interest, Vertex OMWU
can be implemented with polynomial-time (in d) iterations using a kernel trick.
56
Polyhedral convex game
Γ
Equivalent NFG
ℓ
t
mt
x
t ∈ Ω
Vertex OMWU
(4.3), (4.5)
Γ˜
˜ℓ
t
m˜
t
λ
t ∈ ∆(VΩ)
OMWU
(4.2)
m˜
t = V⊤mt
˜ℓ
t = V⊤ℓ
t
x
t = Vλ
t
Figure 4.1: Construction of the Vertex OMWU algorithm. The matrix V has the (possibly exponentiallymany) vertices VΩ of the convex polytope Ω as columns.
4.3 Kernelized Multiplicative Weights Update
In this section, we introduce Kernelized OMWU (KOMWU). Kernelized OMWU gives a way of efficiently
simulating the Vertex OMWU algorithm described in Section 4.2 on polyhedral decision sets whose vertices
have 0/1 integer coordinates, as long as a specific polyhedral kernel function can be evaluated efficiently.
We will assume that we are given a polytope Ω ⊆ R
d with (possibly exponentially many) 0/1 integral
vertices VΩ = {v1, . . . , v|VΩ|} ⊆ {0, 1}
d
. Furthermore, given a vertex v ∈ VΩ, we will write k ∈ v as a
shorthand for v[k] = 1.
We define the 0/1-polyhedral feature map ϕΩ : R
d → R
VΩ associated with Ω as the function such that
ϕΩ(x)[v] = Y
k∈v
x[k] ∀ x ∈ R
d
, v ∈ VΩ. (4.6)
Correspondingly, the 0/1-polyhedral kernel KΩ associated with Ω is defined as the function KΩ : R
d ×
R
d → R,
KΩ(x, y) = ⟨ϕΩ(x), ϕΩ(y)⟩ =
X
v∈VΩ
Y
k∈v
x[k] y[k]. (4.7)
We show that Vertex OMWU can be simulated using d + 1 evaluation of the kernel KΩ at every iteration.
The key observation is summarized in the next theorem, which shows that the iterates λ
t produced by
57
Vertex OMWU are highly structured, in the sense that they are always proportional to the feature mapping
ϕΩ(b
t
) for some b
t ∈ R
d
.
Theorem 23. Consider the Vertex OMWU algorithm (4.3), (4.5). At all times t ≥ 0, the vector b
t ∈ R
d
defined as
b
t
[k] = exp(
−
X
t
τ=1
η
τ wτ
[k]
)
(4.8)
for all k = 1, . . . , d, is such that
λ
t =
ϕΩ(b
t
)
KΩ(b
t
, 1)
. (4.9)
Proof. By induction.
• At time t = 0, the vector b
0
is b
0 = 1 ∈ R
d
. By definition of the feature map (4.6), ϕΩ(1) = 1 ∈ R
VΩ .
So, KΩ(b
0
, 1) = P
v∈VΩ
1 = |VΩ| and hence the right-hand side of (4.9) is 1
|VΩ|
1, which matches λ
0
produced by Vertex OMWU, as we wanted to show.
• Assume the statement holds up to some time t − 1 ≥ 0. We will show that it holds at time t as well.
Since v has integral 0/1 coordinates, we can write
exp{−η
t
⟨wt
, v⟩} = exp(
−η
t X
k∈v
wt
[k]
)
=
Y
k∈v
exp{−η
t wt
[k]}. (4.10)
From the inductive hypothesis and (4.6), for all v ∈ VΩ,
λ
t−1
[v] = ϕΩ(b
t−1
)[v]
KΩ(b
t−1, 1)
=
Q
k∈v
b
t−1
[k]
KΩ(b
t−1, 1)
. (4.11)
58
Plugging (4.10) and (4.11) into (4.3), we have the inductive step
λ
t
[v] =
Q
k∈v
b
t−1
[k] exp{−η
t wt
[k]}
P
v∈VΩ
Q
k∈v
b
t−1[k] exp{−η
t wt
[k]}
=
ϕΩ(b
t
)[v]
KΩ(b
t
, 1)
for all v ∈ VΩ, where in the last step we used the fact that b
t
[k] = b
t−1
[k] exp{−η
t wt
[k]} by (4.8).
The structure of λ
t uncovered by Theorem 23 can be leveraged to compute the iterate x
t produced by
Vertex OMWU, i.e., the convex combination of the vertices (4.5), using d + 1 evaluations of the kernel KΩ.
We do so by extending an idea of Takimoto and Warmuth [151].
Theorem 24. Let b
t
be as in Theorem 23. For each h = 1, . . . , d, let e¯h ∈ R
d
be defined as the indicator
vector
e¯h[k] = 1k̸=h =
0 if k = h
1 if k ̸= h.
(4.12)
Then, at all t ≥ 1, the iterate x
t∈Ω produced by Vertex OMWU can be written as
x
t =
1 −
KΩ(b
t
, e¯1)
KΩ(b
t
, 1)
, . . . , 1 −
KΩ(b
t
, e¯d)
KΩ(b
t
, 1)
. (4.13)
Proof. The proof crucially relies on the observation that for all h = 1, . . . , d, the feature map ϕΩ(e¯h)
satisfies
ϕΩ(e¯h)[v] = Y
k∈v
e¯h[k] = Y
k∈v
1k̸=h = 1h /∈v, ∀ v ∈ VΩ.
59
Using the fact that ϕΩ(1) = 1, we conclude that
ϕΩ(1)[v] − ϕΩ(e¯h)[v] = 1h∈v, ∀ h = 1, . . . , d. (4.14)
Therefore, for all k = 1, . . . , d, we obtain
x
t
[k]
(4.5)
=
X
v∈VΩ
λ
t
[v] · v[k]
=
X
v∈VΩ
λ
t
[v] · 1k∈v
=
X
v∈VΩ
λ
t
[v] · (ϕΩ(1)[v] − ϕΩ(e¯k)[v])
=
⟨ϕΩ(b
t
), ϕΩ(1)⟩ − ⟨ϕΩ(b
t
), ϕΩ(e¯k)⟩
KΩ(b
t
, 1)
=
KΩ(b
t
, 1)−KΩ(b
t
, e¯k)
KΩ(b
t
, 1)
= 1−
KΩ(b
t
, e¯k)
KΩ(b
t
, 1)
,
where the second equality follows from the integrality of v ∈ VΩ, the third from (4.14), the fourth from
Theorem 23, and the fifth from the definition of KΩ (4.7).
Combined, Theorems 23 and 24 suggest that by keeping track of the vectors b
t
instead of λ
t
, updating them using Theorem 23 and reconstructing the iterates x
t using Theorem 24, Vertex OMWU can be
simulated efficiently. We call the resulting algorithm, given in Algorithm 1, Kernelized OMWU (KOMWU).
Similarly, we call Kernelized MWU the non-optimistic version of KOMWU obtained as the special case in
which mt = 0 at all t. In light of the preceding discussion, we have the following.
Theorem 25. Kernelized OMWU produces the same iterates x
t as Vertex OMWU when it receives the same
sequence of predictions mt and losses ℓ
t ∈ R
d
. Furthermore, each iteration of KOMWU runs in time proportional to the time required to compute the d+ 1 kernel evaluations {KΩ(b
t
, 1), KΩ(b
t
, e¯1), . . . , KΩ(b
t
, e¯d)}.
60
Algorithm 1: Kernelized OMWU (KOMWU)
1 ℓ
0
, m0
, s
0 ← 0 ∈ R
d
[▷ Initialization]
2 for t = 1, 2, . . . do
3 receive prediction mt ∈ R
d of next loss
[▷ set mt = 0 for non-predictive variant]
[▷ Compute b
t
according to Theorem 23]
4 wt ← ℓ
t−1 − mt−1 + mt
5 s
t ← s
t−1 + η
twt
[▷ s
t =
Pη
τwτ
]
6 for k = 1, . . . , d do
7 b
t
[k] ← exp{−s
t
[k]} [▷ see (4.8)]
[▷ Produce iterate x
t
according to Theorem 24]
8 x
t ← 0 ∈ R
d
9 α ← KΩ(b
t
, 1) [▷ KΩ is defined in (4.7)]
10 for k = 1, . . . , d do
11 x
t
[k] ← 1 − KΩ(b
t
, e¯k) / α [▷ see (4.13)]
12 output x
t ∈ Ω and receive loss vector ℓ
t ∈ R
d
4.4 KOMWU in Extensive-Form Games
In this section, we show how the general theory we developed in Section 4.3 applies to extensive-form
game, i.e., tree-form games that incorporate sequential and simultaneous moves, and imperfect information. The central result of this section, Theorem 29, shows that OMWU on the normal-form representation
of any EFG can be simulated in linear time in the game tree size via KOMWU, contradicting the popular
wisdom that working with the normal form of an extensive-form game is intractable.
4.4.1 Extensive-Form Games and Tree-Form Sequential Decision Problems
We now briefly recall standard concepts and notation about extensive-form games which we use in the
rest of the section. More details and an example are available in Appendix C.3.
In an m-player perfect-recall extensive-form game, each player i ∈ [[m]] faces a tree-form sequential
decision problem (TFSDP). In a TFSDP, the player interacts with the environment in two ways: at decision
points, the agent must act by picking an action from a set of legal actions; at observation points, the agent
observes a signal drawn from a set of possible signals. We denote the set of decision points of player i
as Ji
. The set of actions available at decision point j ∈ Ji
is denoted Aj . A pair (j, a) where j ∈ Ji
61
and a ∈ Aj is called a non-empty sequence. The set of all non-empty sequences of player i is denoted as
Σ
∗
i = {(j, a) : j ∈ Ji
, a ∈ Aj}. For notational convenience, we will often denote an element (j, a) in
Σ
∗
i
as ja without using parentheses. Given a decision point j ∈ Ji
, we denote by pj its parent sequence,
defined as the last sequence (that is, decision point-action pair) encountered on the path from the root of
the decision process to j. If the agent does not act before j (that is, j is the root of the process or only
observation points are encountered on the path from the root to j), we let pj be set to the special element
∅, called the empty sequence. We let Σi = Σ∗
i ∪ {∅}. Given a σ ∈ Σi
, we let Cσ = {j ∈ Ji
: pj = σ}.
An m-player extensive-form game is a polyhedral convex game (Section 4.2) Γ = (m, {Qi}, {Ui}),
where the convex polytope of mixed strategies Qi of each player i ∈ [[m]] is called a sequence-form strategy
space [136, 158, 90], and is defined as
Qi =
x ∈ R
Σi
:
1 x[∅] = 1,
2 x[pj ] = P
a∈Aj
x[ja] ∀j ∈ Ji
.
It is known that the set of vertices of Qi are the deterministic sequence-form strategies Πi = Qi ∩
{0, 1}
Σi
. We mention the following result (see Appendix C.5).
Proposition 26. The number of vertices of Qi
is upper bounded by A∥Qi∥1
, where A= maxj∈Ji
|Aj | is the
largest number of possible actions, and ∥Qi∥1 = maxq∈Qi
∥q∥1.
We will often need to describe strategies for subtrees of the TFSDP faced by each player i. We use the
notation j
′ ⪰ j to denote the fact that j
′ ∈ Ji
is a descendant of j ∈ Ji
, and j
′ ≻ j to denote a strict
descendant (i.e., j
′ ⪰ j ∧ j
′ ̸= j). For any j ∈ Ji we let Σ
∗
i,j = {j
′a
′
: j
′ ⪰ j, a′ ∈ Aj
′} denote the set of
62
non-empty sequences in the subtree rooted at j. The set of sequence-form strategies for that subtree j is
defined as the convex polytope
Qi,j =
x ∈ R
Σ∗
i,j :
1
P
a∈Aj
x[ja] = 1,
2 x[pj
′]=P
a∈Aj
′
x[j
′a] ∀j
′ ≻ j
.
Correspondingly, we let Πi,j = Qi,j ∩ {0, 1}
Σ∗
i,j denote the set of vertices of Qi,j , each of which is a
deterministic sequence-form strategy for the subtree rooted at j.
4.4.2 Linear-Time Implementation of KOMWU
For any player i, the 0/1-polyhedral kernel KQi
associated with the player’s sequence-form strategy space
Qi can be evaluated in linear time in the number of sequences |Σi
| of that player. To do so, we introduce
a partial kernel function Kj : R
Σi × R
Σi → R for every decision point j ∈ Ji
,
Kj (x, y) = X
π∈Πi,j
Y
σ∈π
x[σ] y[σ]. (4.15)
Theorem 27. For any vectors x, y ∈ R
Σi
, the two following recursive relationships hold:
KQi
(x, y) = x[∅] y[∅]
Y
j∈C∅
Kj (x, y), (4.16)
and, for all decision points j ∈ Ji
,
Kj (x, y) = X
a∈Aj
x[ja] y[ja]
Y
j
′∈Cja
Kj
′(x, y)
. (4.17)
63
In particular, Equations (4.16) and (4.17) give a recursive algorithm to evaluate the polyhedral kernel KQi
associated with the sequence-form strategy space of any player i in an EFG in linear time in the number of
sequences |Σi
|.
Theorem 27 shows that the kernel KQi
can be evaluated in linear time (in |Σi
|) at any (x, y). So, the
KOMWU algorithm (Algorithm 1) can be trivially implemented for Ω = Qi
in quadratic O(|Σi
|
2
) time per
iteration by directly evaluating the |Σi
| + 1 kernel evaluations {KQi
(b
t
, 1)} ∪ {KQi
(b
t
, e¯σ) : σ ∈ Σi}
needed at each iteration, where e¯σ ∈ R
Σi
, defined in (4.12) for the general case, is the vector whose
components are e¯σ[σ
′
] = 1σ̸=σ′ for all σ, σ′ ∈ Σi
. We refine that result by showing that an implementation
of KOMWU with linear-time (i.e., O(|Σi
|)) per-iteration complexity exists, by exploiting the structure of
the particular set of kernel evaluations needed at every iteration. In particular, we rely on the following
observation.
Proposition 28. For any player i ∈ [[m]], vector x ∈ R
Σi
>0
, and sequence ja ∈ Σ
∗
i
,
1−KQi
(x, e¯ja)/KQi
(x,1)
1−KQi
(x, e¯pj
)/KQi
(x,1)
=
x[ja]
Q
j
′∈Cja
Kj
′(x,1)
Kj (x,1)
.
In order to compute {KQi
(b
t
, e¯σ) : σ ∈ Σi} in cumulative O(|Σi
|) time, we then do the following.
1. We compute the values Kj (b
t
, 1) for all j ∈ Ji
in cumulative O(|Σi
|) time by using (4.17).
2. We compute the ratio KQi
(b
t
, e¯∅)/KQi
(b
t
, 1) by evaluating the two kernel separately using Theorem 27, spending O(|Σi
|) time.
3. We repeatedly use Proposition 28 in a top-down fashion along the tree-form decision problem of
player i to compute the ratio KQi
(b
t
, e¯ja)/KQi
(b
t
, 1) for each sequence ja ∈ Σ
∗
i
given the value
of the parent ratio KQi
(b
t
, e¯pj
)/KQi
(b
t
, 1) and the partial kernel evaluations {Kj (b
t
, 1) : j ∈ Ji}
64
from Step 1. For each ja ∈ Σ
∗
i
, Proposition 28 gives a formula whose runtime is linear in the number of children decision points |Cja| at that sequence. Therefore, the cumulative runtime required to
compute all ratios KQi
(b
t
, e¯ja)/KQi
(b
t
, 1) is O(|Σi
|).
4. By multiplying the ratios computed in Step 3 by the value of KQi
(b
t
, 1) computed in Step 2, we can
easily recover each KQi
(b
t
, e¯σ) for every σ ∈ Σ
∗
i
.
Hence, we have just proved the following.
Theorem 29. For each player i in a perfect-recall extensive-form game, the Kernelized OMWU algorithm can
be implemented exactly, with a per-iteration complexity linear in the number of sequences |Σi
| of that player.
4.4.3 KOMWU Regret Bounds and Convergence
If the players in an EFG run KOMWU, then we can combine Theorem 25 with standard OMWU regret
bounds, Proposition 26 and Properties 20 to 22 to get the following:
Theorem 30. In an EFG, after T rounds of learning under the COLS, KOMWU satisfies
1. A player i using KOMWU with η
t = η =
p
8 log(A)∥Qi∥1/
√
T is guaranteed to incur regret at most
RegT
i = O(
p
∥Qi∥1 log(A)T).
2. For all T, if η
t = η =
T −1/4
√
m−1
, the regret of each player i ∈ [[m]] is bounded as RegT
i ≤ (4 +
∥Qi∥ log A)
√
m − 1 · T
1/4
.
3. There exist C, C′ > 0 such that if all m players learn using KOMWU with constant learning rate η
t =
η ≤ 1/(Cm log4 T), then each player is guaranteed to incur regret at most log(A)∥Qi∥1
η + C
′
log T.
4. If all m player learn using KOMWU with η
t = η ≤ 1/
√
8(m − 1), then the sum of regrets is at most
Pm
i=1 RegT
i = O(maxm
i=1{∥Qi∥1 log Ai}
m
η
).
5. For two-player zero-sum EFGs, if both players learn using KOMWU, then there exists a schedule of learningrates η
(t)
such that the iterates converge to a Nash equilibrium. Furthermore, if the NFG representation of
65
the EFG has a unique Nash equilibrium and both players use learning rates η
t = η ≤ 1/8, then the iterates
converge to a Nash equilibrium at a linear rate C(1 + C
′
)
−t
, where C, C′ are constants that depend on
the game.
Prior to our result, the strongest regret bound for methods that take linear time per iteration was based
on instantiating e.g. follow the regularized leader (FTRL) or its optimistic variant with the dilatable global
entropy regularizer of Farina, Kroer, and Sandholm [51]. For FTRL this yields a regret bound of the form
O(
p
log(A) ∥Q∥
2
1T). For optimistic FTRL this yields a regret bound of the form O(log(A) ∥Q∥
2
1
√
mT1/4
),
when every player in an m-player game uses that algorithm and appropriate learning rates.
Our algorithm improves the state-of-the-art rate in two ways. First, we improve the dependence on
game constants by almost a square root factor, because our dependence on ∥Q∥1 is smaller by a square
root, compared to prior results. Secondly, in the multi-player general-sum setting, every other method
achieves regret that is on the order of T
1/4
, whereas our method achieves regret on the order of log4
(T).
In the context of two-player zero-sum EFGs, the bound on the sum of regrets in Theorem 30 guarantees
convergence to a Nash equilibrium at a rate of O(maxi ∥Qi∥1 log Ai/T). This similarly improves the prior
state of the art.
Lee, Kroer, and Luo [105] showed the first last-iterate results for EFGs using algorithms that require
linear time per iteration. In particular, they show that the dilated entropy DGF combined with optimistic
online mirror descent leads to last-iterate convergence at a linear rate. However, their result requires
learning rates η ≤ 1/(8|Σi
|). This learning rate is impractically small in practice. In contrast, our lastiterate linear-rate result for KOMWU allows learning rates of size 1/8. That said, our result is not directly
comparable to theirs. The existence of a unique Nash equilibrium in the EFG representation is a necessary
condition for uniqueness in the NFG representation. However, it is possible that the NFG has additional
equilibria even when the EFG does not. Wei et al. [161] conjecture that linear-rate convergence holds even
without the assumption of a unique Nash equilibrium. If this conjecture turns out to be true for NFGs, then
66
100
101
102
Max. individual regret
3-player Kuhn poker
100
101
102
4-player Kuhn poker
101 103
Iteration
100
101
102
103
Max. individual regret
3-player Leduc poker
CFR CFR(RM+) KOWMU
101 103
Iteration
101
102
103
4-player Leduc poker
η = 0.1 1 5 10
100
101
102
103
Max. individual regret
3-player Kuhn poker
100
101
102
103
4-player Kuhn poker
101 103
Iteration
100
101
102
103
Max. individual regret
3-player Leduc poker
KOWMU DOWMU
101 103
Iteration
101
102
103
104
4-player Leduc poker
η = 0.1 η = 1 η = 5 η = 10
Figure 4.2: Experimental comparison of KOMWU with CFR (left two columns) and DOMWU (right two
columns) in two standard poker benchmark games.
Theorem 25 would immediately imply that KOMWU also has last-iterate linear-rate convergence without
the uniqueness assumption.
4.4.4 Experimental Evaluation
We numerically investigate agents learning under the COLS in Kuhn and Leduc poker [101, 147]. We
compare the maximum per-player regret cumulated by KOMWU for four different choices of constant
learning rate, against that cumulated by two standard algorithms from the extensive-form game solving
literature (CFR and CFR(RM+)), as well as the DOMWU algorithm introduced by Lee, Kroer, and Luo [105].
More details about the games and the algorithms are given in Appendix C.4. Results are shown in Figure 4.2.
We observe that the per-player regret cumulated by KOMWU plateaus and remains constants, unlike the
CFR variants. This behavior is consistent with the near-optimal per-player regret guarantees of KOMWU
(Theorem 30). Furthermore, we observe that KOMWU outperforms DOMWU.
67
Chapter 5
Near-Optimal No-Regret Learning in General Convex Games
Obtaining the optimal performance guarantees when learning agents are competing against each other
in general games is a fundamental question in machine learning and game theory. It was first formulated
and addressed by Daskalakis, Deckelbaum, and Kim [34] within the context of zero-sum games. Since then,
there has been a considerable interest in extending their guarantee to more general settings [135, 150, 58, 27,
37, 132]. In particular, Daskalakis, Fishelson, and Golowich [36] recently established that when all players
in a general normal-form game employ an optimistic variant of multiplicative weights update (MWU), the
regret of each player grows nearly-optimally as O(log4 T) after T repetitions of the game, leading to an
exponential improvement over the guarantees obtained using traditional techniques within the no-regret
framework. However, while normal-form games are a common way to represent strategic interactions in
theory, most settings of practical significance inevitably involve more complex strategy spaces. For those
settings, any faithful approximation of the game using the normal form is typically inefficient, requiring
an action space that is exponential in the natural parameters of the problem, thereby limiting the practical
implications of those prior results. This motivates our central question:
Can we establish near-optimal, efficiently implementable, and strongly
uncoupled no-regret learning dynamics in general convex games?
(♣)
68
Convex games are a rich class of games wherein the strategy space of each player is an arbitrary convex
and compact set, while the utility of each player is an arbitrary concave function (see Section 5.1 for a
formal description). As such, convex games encompass normal-form and extensive-form games, but go
well-beyond to many other fundamental settings in economic theory including routing games, resource
allocation problems, and competition between firms. Our primary contribution in this chapter is to substantially extend prior results to all such games, addressing Question (♣).
Our Contributions In this chapter we introduce a novel no-regret learning algorithm, which we coin
lifted log-regularized optimistic follow the regularized leader (LRL-OFTRL). LRL-OFTRL settles Question (♣)
in the positive, as summarized in the following theorem.∗
Theorem 31 (Detailed version in Theorem 39). Consider any general convex game. When all players employ
our strongly uncoupled learning dynamics (LRL-OFTRL), the regret of each player grows as O(log T). At the
same time, if the player is facing adversarial utilities we guarantee O(
√
T) regret.
Importantly, our learning dynamics are efficiently implementable given access to a proximal oracle for
the set (Equation (5.6)), requiring only O(log log T) operations per-iteration (Theorem 42); such an oracle is weaker than the—relatively standard in convex optimization—quadratic optimization oracle. We also
point out extensions under a weakerlinear optimization oracle, albeit with a worse per-iteration complexity
(Theorem 43). Our no-regret learning dynamics imply the first efficiently implementable and near-optimal
regret guarantees in general convex games, significantly extending the scope of prior O(T)-regret guarantees [36, 55]; a comparison with prior approaches is included in Table 5.1. We remark that Theorem 31
establishes near-optimal regret both under self-play, and in the adversarial regime; the latter feature of
adversarial robustness has been a central desideratum in this line of work.
∗
For simplicity in the exposition we use the O(·) notation in our introduction to suppress time-independent parameters that
depend (polynomially) on the game; precise statements are deferred to Section 5.2.
69
Our proposed learning dynamics lie within the general framework of optimistic no-regret learning, pioneered by Chiang et al. [31] and Rakhlin and Sridharan [135]. We leverage the OFTRL algorithm of Syrgkanis et al. [150], but with some important twists. First, as detailed in Algorithm 2, the OFTRL optimization
step is performed over a “lifted” space. This lifting ensures that the regret incurred by OFTRL is nonnegative (Theorem 33). Further, we employ a logarithmic self-concordant regularizer; interestingly, and perhaps
surprisingly, this is not a barrier for the underlying feasible set. This deviates substantially from the typical use of self-concordant regularization (especially within the bandit setting [1, 162, 18]). A pictorial
overview of our construction is given in the caption of Algorithm 2.
The use of the logarithmic regularizer serves two main purposes. First, we show that it guarantees
multiplicative stability of the strategies, a refined notion of stability that is also leveraged in the work
of Daskalakis, Fishelson, and Golowich [36]. Nonetheless, we are the first to leverage such properties in
general domains, going well beyond the guarantees of (Optimistic) MWU on the simplex [36]. Further, the
local norm induced by the logarithmic regularizer enables us to cast regret bounds from the lifted space
to the original space, while preserving the RVU property [150, Definition 3]. In turn, this implies nearoptimal regret by establishing that the second-order path lengths up to time T are bounded by O(log T)
(Theorem 38), building on a recent technique of Anagnostides et al. [4] which crucially leverages the
nonnegativity of swap regret.†
Further Related Work The rich line of work pursuing improved regret guarantees in games was pioneered by Daskalakis, Deckelbaum, and Kim [34]. Specifically, they developed strongly uncoupled learning
dynamics so that the players’ regrets grow as O(log T), an exponential improvement over the guarantee one could hope for in adversarial environments [142, 25]. Their result was significantly simplified
by Rakhlin and Sridharan [135]—again in zero-sum games—who introduced a simple variant of mirror
†To see why nonnegativity is crucial, note that the RVU bound implies optimal sum of players’ regrets [150]. Thus, nonnegativity would imply the same bound for each player’s regret.
70
Method Uncoupled Applies to Regret bound Cost per iteration
OFTRL / OMD
[150]
✓ General convex set O(
√
n RT
1/4
) Regularizer- & oracle- dependent
OMWU
[36]
✓ Simplex ∆d O(n log d log4 T) O(d)
Clairvoyant MWU
[132]
✗ Simplex ∆d † O(n log d) ‡
Kernelized OMWU
[55]
✓
Polytope Ω = coV
with V ⊆ {0, 1}
d O(n log |V| log4 T) d × cost of kernel
LRL-OFTRL
[Our work] ✓
General convex set
X ⊆ R
d O(nd∥X ∥3
1
log T)
Oracle-dependent:
• O(log log T) proximal oracle calls
• O(T) linear opt. oracle calls
Table 5.1: Comparison of prior results on minimizing external regret in games. For simplicity, we have
suppressed dependencies on the smoothness and the range of the utilities. We use n to denote the number
of players; T to denote the number of repetitions; R to indicate a parameter that depends on the regularizer;
coV to denote the convex hull of V; and ∥X ∥1 to denote a bound on the maximum ℓ1 norm of any strategy.
† Piliouras, Sim, and Skoulakis [132] suggest that some of their techniques could apply beyond simplex
domains, though it is unclear whether the regret bound and efficient implementation of the dynamics
would transfer. ‡ It is not clear how to compare the per-iteration cost of Clairvoyant MWU with the other
methods as the former is not uncoupled and therefore does not perform parallel strategy updates.
descent with a recency bias—a.k.a. optimistic mirror descent (OMD). It is worth noting that, beyond the
benefits of optimism from an optimization standpoint [133], recency bias has been experimentally documented in natural learning environments in economics [60].
Subsequently, Syrgkanis et al. [150] crystallized the RVU property, an adversarial regret bound applicable for a broad class of optimistic no-regret learning algorithms. Using that property, they showed that the
individual regret of each player grows as O(T
1/4
) in general games, thereby converging to the set of coarse
correlated equilibria with a rate of O(T
−3/4
). A near-optimal bound of O((T)) in normal-form games was
finally established by Daskalakis, Fishelson, and Golowich [36], while Farina et al. [55] generalized that
result in a class of polyhedral games that includes extensive-form games. Some extensions of the previous
results have also been established for the stronger notion of no-swap-regret learning dynamics in normalform games [27, 3, 4]. In particular, our work builds on a very recent technique of Anagnostides et al. [4],
which established O(log T)swap regret in normal-form games using as a regularizer a self-concordant barrier function. On the other hand, establishing even sublinear o(T) swap regret in extensive-form games
71
is a notorious open question. Finally, an interesting new approach for obtaining near-optimal external
regret in normal-form games was recently proposed by Piliouras, Sim, and Skoulakis [132]. Nevertheless, their proposed dynamics are not uncoupled;‡ hence, their regret bounds are not comparable with the
aforementioned results.
Games with continuous strategy spaces have received a lot of attention in the literature; e.g., see [140,
44, 74, 82, 121, 148, 149], and references therein. Such games encompass a wide variety of applications in
economics and multiagent systems; we give several examples in Section 5.1. Indeed, in many applications
of interest a faithful approximation of the game requires an extremely large or even infinite action space;
such settings could be abstracted as Littlestone games in the sense of the recent work of Daskalakis and
Golowich [37].
5.1 No-Regret Learning and Convex Games
In this section we review the general setting of convex games§ which encompasses a number of important
applications, as explained in Section 5.1.2. We then formally define the framework of uncoupled and online
no-regret learning in games in Section 5.1.3.
5.1.1 Convex Games
Let [[n]] = {1, 2, . . . , n} be a set of players, with n ∈ N. In a convex game, every player i ∈ [[n]] has a
nonempty convex and compact set of strategies Xi ⊆ R
di
. For a joint strategy profile x = (x1, . . . , xn) ∈
×n
j=1 Xj , the reward of player i is given by a continuously differentiable utility function ui
:×n
j=1 Xj → R
subject to the following standard assumption.
‡During the preparation of this manuscript we were informed about a concurrent revision of the preprint by Piliouras, Sim,
and Skoulakis [132] introducing an uncoupled variant of Clairvoyant OMWU. This will be addressed in a future version of this
manuscript.
§
Sometimes these are referred to as concave games [137] or continuous games [82].
72
Assumption 2 (Convex games). The utility function ui(x1, . . . , xn) of any player i ∈ [[n]] satisfies the
following properties:
1. (Concavity) ui(xi
, x−i) is concave in xi for x−i = (x1, . . . , xi−1, xi+1, . . . , xn) ∈×j̸=i Xj ;
2. (Bounded gradients) for any (x1, . . . , xn) ∈×n
j=1 Xj , ∥∇xiui(x1, . . . , xn)∥∞ ≤ B, for some parameter
B > 0; and
3. (L-smoothness) there exists L > 0 so that for any two joint strategy profiles x, x
′ ∈×n
j=1 Xj ,
∥∇xiui(x) − ∇xiui(x
′
)∥∞ ≤ L
X
j∈[[n]]
∥xj − x
′
j∥1.
5.1.2 Applications and Examples of Convex Games
Here we discuss several different classes of games which can all be analyzed under the common framework
of convex games. For simplicity, we describe Cournot competion in the one-dimensional setting, but it can
be readily generalized in more general domains. For more examples, we refer to [44, 82], and references
therein.
Normal-Form Games In normal-form games (NFGs) every player i ∈ [[n]] has a finite and nonempty set
of strategies Ai
. Player i’s strategy set contains all probability distributions supported on Ai
; that is,
Xi = ∆(Ai). The utility of player i can be expressed as the multilinear function ui(x) = Ea∼x[Ui(a)],
for some arbitrary function Ui
:×n
j=1 Aj → R.
Extensive-Form Games Extensive-form games (EFGs) generalize NFGs by capturing both sequential and
simultaneous moves, stochasticity from the environment, as well as imperfect information. EFGs are
abstracted on a directed tree. Once the game reaches a terminal (or leaf) node z ∈ Z, each player
i ∈ [[n]] receives a utility Ui(z), for some Ui
: Z → R. The strategy space of each player i ∈ [[n]]
can be compactly represented using the sequence-form polytope Qi [136, 90]. If pc(z) is the probability
of reaching terminal node z ∈ Z over “chance moves”, the utility of player i can be expressed as
73
ui(q) = P
z∈Z pc(z)Ui(z)
Q
j∈[[n]] qj [σj,z], where q = (q1, . . . , qn) ∈ ×n
j=1 Qj is the joint strategy
profile, and qj [σj,z] is the probability mass assigned to the last sequence σj,z encountered by player j
before reaching z. The smoothness and the concavity of the utilities follow directly from multilinearity;
for a more detailed account on EFGs we refer the interested reader to the excellent book of Shoham
and Leyton-Brown [143].
Splittable Routing Games In these games [140] every player has to route a flow fi from a source to a
destination in an undirected graph G = (V, E). Every edge e ∈ E is associated with a latency function
ℓe(fe) mapping the amount of flow passing through the edge to some latency. The set of strategies of
player i corresponds to the possible ways of “splitting” the flow fi
into paths from the source to the
destination. Under suitable restrictions on the latency functions, those games satisfy Assumption 2
(see [150]).
Cournot Competition This game is played among n firms (players). Every firm i decides the quantity
si ∈ Si ⊆ R≥0 of a common good to produce, where Si
is an interval. Further, a cost function
ci
: Si → R assigns a production cost to a given quantity, while p :×Si → R≥0 is the price of the good
determined by the the joint choice of quantity s = (s1, . . . , sn) across the firms. Then, the utility of
firm i is defined as ui(s) = sip(s) − ci(si). In linear Cournot competition, p(s) = a − b (
Pn
i=1 si), for
some a, b > 0, while the cost functions ci are assumed to be smooth and convex [44].
5.1.3 Online Linear Optimization and No-Regret Learning
In the online learning framework a learning agent has to select a strategy x
t ∈ X ⊆ R
d
at every time t ∈ N.
Then, in the full information model, the learner receives as feedback from the environment a linear utility
74
function x 7→ ⟨x,u
t
⟩, for some vector u
t ∈ R
d
. The canonical measure of performance is the notion of
regret, defined for a time horizon T ∈ N as follows.
RegT = max
x∗∈X (X
T
t=1
⟨x
∗
,u
t
⟩
)
−
X
T
t=1
⟨x
t
,u
t
⟩. (5.1)
That is, the performance of the agent is compared to the optimal fixed strategy in hindsight. It is important
to note that regret can be negative. In the context of convex games, it is assumed that every player i ∈ [[n]]
receives at time t the “linearized” utility function xi
7→ ⟨xi
,u
t
i
⟩, where u
(t)
i = ∇xiui(x
t
). By concavity
(Assumption 2),
max
x∗
i ∈Xi
X
T
t=1
ui(x
∗
i
, x
t
−i
) − ui(x
t
)
≤ max
x∗
i ∈Xi
X
T
t=1
⟨x
∗
i − x
t
i
, ∇xiui(x
t
)⟩.
As a result, a regret bound on the linearized regret—in the sense of (5.1)—automatically translates to a
regret bound in the convex game.
Strongly Uncoupled Learning Dynamics In this setting, all learning dynamics are uncoupled in the
sense of Hart and Mas-Colell [77]: every player is oblivious to the other players’ utilities. In fact, players need not have any prior knowledge about the game whatsoever, even about their own utilities; this
essentially captures the condition of strong uncoupledness of Daskalakis, Deckelbaum, and Kim [34].
For simplicity, and without loss of generality, for the rest of this chapter we will assume that ∥u
t
i
∥∞∥xi∥1 ≤
1, for any time t ∈ N, player i ∈ [[n]], and xi ∈ Xi
.
5.2 Near-Optimal No-Regret Learning in Convex Games
In this section we describe our algorithm, Log-Regularized Lifted Optimistic FTRL (henceforth LRL-OFTRL).
The central result of this section, Theorem 39, asserts that when all players learn using LRL-OFTRL, their
7
regret only grows logarithmically with respect to the number of repetitions of the game. Detailed proofs
for this section are available in Appendix C.5.
5.2.1 Setup
In the sequel, we will define and analyze the regret cumulated by LRL-OFTRL from the perspective of a
generic player, omitting player subscripts.
We denote the set of strategies of the player by X ⊆ R
d
. Without loss of generality, we will assume
that X ⊆ [0, +∞)
d
; otherwise, it suffices to first shift the set. Furthermore, we assume without loss of
generality that there is no index r ∈ [[d]] such that x[r] = 0 for all x ∈ X—if not, dropping the identicallyzero dimension would not alter regret. We define the lifting of set X as the following set:
R
d+1 ⊇ X˜ = {(λ, y) : λ ∈ [0, 1], y ∈ λX }. (5.2)
Further, we define the ℓ1-norm ∥X ∥1 of X as the maximum ℓ1-norm of any vector x ∈ X , that is, ∥X ∥1 =
maxx∈X ∥x∥1; for example, ∥∆d∥1 = 1.
The logarithmic regularizer for R
d+1 is the function
R(λ, y) = − log λ −
X
d
r=1
log y[r], ∀(λ, y) ∈ R
d+1
>0
.
Given any vector (λ, y) ∈ X ∩˜ R
d+1
>0
, we denote with ∥ · ∥(λ,y) and ∥ · ∥∗,(λ,y)
the local norms centered at
(λ, y) induced by R(λ, y), defined as
a
z
(λ,y)
=
vuut
a
λ
2
+
X
d
r=1
z[r]
y[r]
2
,
a
z
∗,(λ,y)
=
vuut(aλ)
2 +
X
d
r=1
(z[r]y[r])2
76
for any (a, z) ∈ R
d+1. These are the norms induced by the Hessian matrix of R at (λ, y) and its inverse.
It is a well-known fact that ∥ · ∥∗,(λ,y)
is the dual norm of ∥ · ∥(λ,y)
, and vice versa.
5.2.2 Overview of Our Algorithm
Our algorithm (Algorithm 2) leverages optimistic follow the regularized leader (OFTRL), a simple variant
of FTRL introduced by Syrgkanis et al. [150], but with some important twists. First, the optimization is
performed over the lifting X˜ of the set X . More precisely, at every iteration the observed utility u
t ∈ R
d
will be transformed to u˜
t ∈ R
d+1 according to Line 6; this ensures that u˜
t
is orthogonal to the vector
(1, x
t
). Then, this utility vector u˜
t
is given as input to a regret minimizer operating over X˜, employing
OFTRL under the logarithmic regularizer R(λ, y); this step is described in Line 3. We discuss how such an
optimization problem can be solved efficiently in Section 5.2.5. Below we point out that Line 3 is indeed
well-defined.
Proposition 32. For any η ≥ 0 and at all times t ∈ N, the OFTRL optimization problem on Line 3 of
Algorithm 2 admits a unique optimal solution (λ
(t)
, y
(t)
) ∈ X ∩˜ R
d+1
>0
.
Finally, given the iterate (λ
t
, y
t
) output by the OFTRL step at time t, our regret minimizer over X
selects the next strategy x
t = y
t/λt
(Line 4); this is indeed a valid strategy in X by definition of X˜ in (5.2),
as well as the fact that λ
t > 0 as asserted in Proposition 32.
5.2.3 Regret Analysis
In this section, we study the regret of LRL-OFTRL under the idealized assumption that the optimization
problem on Line 3 (OFTRL step) is solved exactly at each time t. In Section 5.2.5 we will relax that assumption, and study the regret of LRL-OFTRL when the solution to Line 3 is approximated using variants of
Newton’s method.
77
Algorithm 2: Log-Regularized Lifted Optimistic FTRL (LRL-OFTRL)
Lifting OFTRL with
log regularizer Normalization u
t u˜
t (λ
t
, y
t
) x
t =
y
t
λt
Data: Learning rate η
1 Set U˜ 1
,u
(0) ← 0 ∈ R
d+1
2 for t = 1, 2, . . . , T do
3 Set
λ
t
y
t
← argmax
(λ,y)∈X˜
(
η
U˜ t + u˜
t−1
,
λ
y
+ log λ +
X
d
r=1
log y[r]
)
[▷ OFTRL]
4 Play strategy x
t =
y
t
λt
∈ X [▷ Normalization]
5 Observe u
t ∈ R
d
6 Set u˜
t ←
−⟨u
t
, x
t
⟩
u
t
[▷ Lifting]
7 Set U˜ t+1 ← U˜ t + u˜
t
To study the regret RegT
of LRL-OFTRL, defined in (5.1), it is useful to introduce the quantity Reg ˜
T
,
which measures the regret incurred by the internal OFTRL algorithm (Line 3) up to a time T ∈ N in the
lifted space X˜, i.e.,
Reg ˜
T
= max
(λ∗,y∗)∈X˜
X
T
t=1*
u˜
t
,
λ
∗
y
∗
−
λ
t
y
t
+
.
As the following theorem clarifies, there is a strong connection between Reg ˜
T
and RegT
.
Theorem 33. For any time T ∈ N it holds that Reg ˜
T
= max{0, RegT
}. In particular, it follows that
Reg ˜
T
≥ 0 and RegT ≤ Reg ˜
T
for any T ∈ N.
The nonnegativity of Reg ˜
T
will be a crucial property in establishing Theorem 38. Further, Theorem 33
implies that a guarantee over the lifted space can be automatically translated to a regret bound over the
original space X . Now let
∥ · ∥t = ∥ · ∥(λt
,yt) and ∥ · ∥∗,t = ∥ · ∥∗,(λt
,yt)
(5.3)
78
be the local norms centered at point (λ
t
, y
t
) produced by OFTRL at time t (Line 3). In the next proposition
we establish a refined RVU (Regret bounded by Variation in Utilities) bound in terms of this primal-dual
norm pair. Before we proceed with its statement, we recall the assumption that ∥u
t∥∞∥x∥1 ≤ 1, which
will be tacitly assumed throughout this section.
Proposition 34 (RVU bound of OFTRL in local norms). Let Reg ˜
T
be the regret cumulated up to time T by
the internal OFTRL algorithm. Then, for any time horizon T ∈ N and learning rate η ≤
1
50 ,
Reg ˜
T
≤ 4 + (d + 1) log T
η
+ 5η
X
T
t=1
u˜
t − u˜
t−1
2
∗,t −
1
27η
T
X−1
t=1
λ
t+1
y
t+1
−
λ
t
y
t
2
t
.
(We recall that u˜
(0) = 0.) Proposition 34 differs from prior analogous results in that the regularizer
is not a barrier over the feasible set. Next, we show that the iterates produced by OFTRL satisfy a refined
notion of stability, which we refer to as multiplicative stability.
Proposition 35 (Multiplicative Stability). For any time t ∈ N and learning rate η ≤
1
50 ,
λ
t+1
y
t+1
−
λ
t
y
t
t
≤ 22η.
Intuitively, this property ensures that coordinates of successive iterates will have a small multiplicative
deviation. We leverage this refined notion of stability to establish the following crucial lemma.
Lemma 36. For any time t ∈ N and learning rate η ≤
1
50 ,
∥x
t+1 − x
t
∥1 ≤ 4∥X ∥1
λ
t+1
y
t+1
−
λ
t
y
t
t
.
79
Combining this lemma with Proposition 34 allows us to obtain an RVU bound for the original space
X , with no dependencies on local norms.
Corollary 37 (RVU bound in the original (unlifted) space). For any time T ∈ N and η ≤
1
256 ,
Reg ˜
T
≤ 6 +
(d + 1) log T
η
+ 16η∥X ∥2
1
T
X−1
t=1
∥u
t+1 − u
t
∥
2
∞ −
1
512η∥X ∥2
1
T
X−1
t=1
∥x
t+1 − x
t
∥
2
1
.
5.2.4 Main Result
So far, in Section 5.2.3, we have performed the analysis from the perspective of a single player, obtaining
regret bounds that apply under an arbitrary sequence of utilities. Next, we assume that all players follow
our dynamics such that the variation in one’s utilities is now related to the variation in the joint strategies
based on the smoothness condition of the utility function, connecting the last two terms of the RVU bound.
Further leveraging the nonnegativity of the regrets in the lifted space, we establish that the second-order
path lengths of the dynamics up to time T are bounded by O(log T):
Theorem 38. Suppose that Assumption 2 holds for some parameter L > 0. If all players follow LRL-OFTRL
with learning rate η ≤ min n
1
256 ,
1
128nL∥X ∥2
1
o
, where ∥X ∥1 = maxi∈[[n]] ∥Xi∥1, then
Xn
i=1
T
X−1
t=1
∥x
t+1
i − x
t
i∥
2
1 ≤ 6144nη∥X ∥2
1 + 1024n(d + 1)∥X ∥2
1
log T.
We leverage this theorem to establish Theorem 31, the detailed version of which is given below.
Theorem 39 (Detailed Version of Theorem 31). Suppose that Assumption 2 holds for some parameter L > 0.
If all players follow LRL-OFTRL with learning rate η = min n
1
256 ,
1
128nL∥X ∥2
1
o
, then for any T ∈ N the regret
RegT
i
of each player i ∈ [[n]] can be bounded as
RegT
i ≤ 12 + 256(d + 1) max
nL∥X ∥2
1
, 2
log T. (5.4)
80
Furthermore, the algorithm can be adaptive so that if player i is instead facing adversarial utilities, then
RegT
i = O(
√
T).
Recall that we have assumed that ∥u
(t)
i
∥∞∥xi∥1 ≤ 1; instead, if we only knew that ∥u
(t)
i
∥∞ ≤ B
(Assumption 2), the aforementioned assumption can be met by rescaling the utilities by 1/(B∥X ∥1), introducing an additional multiplicative factor of B∥X ∥1 in the regret bounds (see Table 5.1). For clarity,
below we cast (5.4) of Theorem 39 in normal-form games with utilities normalized in [−1, 1].
Corollary 40 (Normal-form Games). Suppose that all players in a normal-form game with n ≥ 2 follow
LRL-OFTRL with learning rate η =
1
128n
. Then, for any T ∈ N and player i ∈ [[n]],
RegT
i ≤ 12 + 256n(d + 1) log T.
5.2.5 Implementation and Iteration Complexity
In this section, we discuss the implementation and iteration complexity of LRL-OFTRL. The main difficulty
in the implementation is the computation of the solution to the strictly concave nonsmooth constrained
optimization problem in Line 3. We start by studying how the guarantees laid out in Theorem 39 are
affected when the exact solution to the OFTRL problem (Line 4) in Algorithm 2 is replaced with an approximation. Specifically, suppose that at all times t the solution to the OFTRL step (Line 3) in Algorithm 2 is
only approximately solved within tolerance ϵ
t
, in the sense that
λ
t
y
t
−
λ
t
⋆
y
t
⋆
(λt
⋆
,yt
⋆
)
≤ ϵ
t
, (5.5)
81
where (λ
t
, y
t
) ∈ R
d+1
>0
and
λ
t
⋆
y
t
⋆
= argmax
(λ,y)∈X˜
η
*
U˜ t + u˜
t−1
,
λ
y
+
+ log λ +
X
d
r=1
log y[r]
.
Then, it can be proven directly from the definition of regret that the guarantees given in Corollary 37
deteriorate by an additive factor proportional to the sum of the tolerances PT
t=1 ϵ
t
. As an immediate corollary, when ϵ
t = ϵ = 1/T, the conclusion of Theorem 39 applies even when the solution to the optimization
problem on Line 3 is only approximated up to ϵ tolerance. Therefore, to complete our construction, it suffices to show that it is indeed possible to efficiently compute approximate solutions to the OFTRL step. In
the remainder of this section, we show that this is indeed the case assuming access to two different type
of oracles.
Proximal Oracle First, we will assume access to a proximal oracle in local norm for the set X˜, that is,
access to a function that is able to compute the solution to the (positive-definite) quadratic optimization
problem
Πw˜ (g˜) = argmin
x˜ ∈ X˜
g˜
⊤x˜ +
1
2
∥x˜ − w˜∥
2
w˜
= argmin
x˜ ∈ X˜
(
g˜
⊤x˜ +
1
2
X
d+1
r=1
x˜[r]
w˜[r]
− 1
2
)
(5.6)
for arbitrary centers w˜ ∈ R
d+1
>0
and gradients g˜ ∈ R
d+1. For certain sets X ⊆ R
d
, exact proximal
oracles with polynomial complexity in the dimension d can be given. In particular, we show that this is
the case when X is the strategy set of normal-form and extensive-form games by extending a technique
of Gilpin [65], as formalized below.
Proposition 41. Let X ⊆ R
d
be the polytope of sequence-form strategies for a player in a perfect-recall
extensive-form game. Then, the local proximal oracle Πw˜ (g˜) defined in (5.6) can be implemented exactly in
time polynomial in the dimension d for any w˜ ∈ R
d+1
>0
and g˜ ∈ R
d+1
.
82
We provide the details and a more precise statement in Appendix D.2. In this context, the following
guarantee employs the proximal Newton algorithm of Tran-Dinh, Kyrillidis, and Cevher [154]; see Algorithm 15.
Theorem 42 (Proximal Newton). Given any ϵ > 0, it is possible to compute (λ
t
, y
t
) ∈ X ∩˜ R
d+1
>0
such
that (5.5) holds for ϵ
t = ϵ using O(log log(1/ϵ)) operations and O(log log(1/ϵ)) calls to the proximal oracle
defined in Equation (5.6).
Linear Maximization Oracle Moreover, we consider having access to a weaker linear maximization
oracle (LMO) for the set X :
LX (u) = argmax
x∈X
⟨x,u⟩. (5.7)
Such an oracle is more realistic in many settings [86], and it is particularly natural in the context of games,
where it can be thought of as a best response oracle. We point out that an LMO for X automatically implies
an LMO for X˜. The following guarantee follows readily by applying the Frank-Wolfe (projected) Newton
method [112, Algotihms 1 and 2].
Theorem 43 (Frank-Wolfe Newton). Given any ϵ > 0, it is possible to compute (λ
t
, y
t
) ∈ X ∩ ˜ R
d+1
>0
such that (5.5) holds for ϵ
t = ϵ using O((1/ϵ)) operations and O((1/ϵ)) calls to the LMO oracle defined
in Equation (5.7).
83
Chapter 6
No-Regret and Last-Iterate Convergence Properties of Regret Matching
Algorithms in Games: Bridging the Gap Between Theory and Practice
Regret minimization is an important framework for solving games. Its connection to game theory provides a practically efficient way to approximate game-theoretic equilibria [59, 76]. Moreover, it provides a
scaleable way to solve large-scale sequential games, for example, using the Counterfactual Regret Minimization (CFR) decomposition [166]. Consequently, regret minimization algorithms are a central component
in recent superhuman poker AIs [10, 126, 15]. Regret Matching+ (RM+) [152] is the most prevalent regret
minimizer in these applications. In theory, it guarantees an O(1/
√
T) convergence rate after T iterations,
but its practical performance is usually significantly faster.
On the other hand, a line of recent works shows that regret minimizers based on follow the regularized
leader (FTRL) or online mirror descent (OMD) enjoy faster convergence rates in theory when combined
with the concept of optimism/predictiveness [135, 150]. The result was originally proven in matrix games
[135], and later extended to multiplayer normal-form games [150, 27, 36], extensive-form games [49, 54,
55, 7], and general convex games [82, 46]. However, despite their favorable properties in theory, optimistic algorithms based on FTRL/OMD are usually numerically inferior to RM+ when applied to solving
large-scale sequential games. It remains a mystery whether some optimistic variant of RM+ enjoys a theoretically faster convergence rate, considering the strong empirical performance of RM+. It is also an
84
open question whether there exists an algorithm that has both favorable theoretical guarantees similar to
FTRL/OMD algorithms and practical performance comparable to RM+.
Additionally, very little is known about the last-iterate convergence properties of RM+and its variants ∗
. Inspired by recent work on the connection between OMD and RM+ [52], we provide new insights
on the theoretical and empirical behavior of RM+-based algorithms, and we show that the analysis of
fast convergence for OMD can be extended to RM+ with some simple modifications to the algorithm.
Specifically, our main contributions can be summarized as follows.
1. We provide a detailed theoretical and empirical analysis of the potential for the slow performance of
RM+ and predictive RM+. We start by showing that, in stark contrast to FTRL/OMD algorithms that
are stable inherently, there exist loss sequences that make RM+ and its variants unstable, leading to
cycling between very different strategies. The key reason for such instability is that the decisions of
these algorithms are chosen by normalizing an aggregate payoff vector; thus, in a region close to the
origin, two consecutive aggregate payoffs may point in very different directions, despite being close,
resulting in unstable iterations. Surprisingly, note that this can only happen when the aggregate
payoff vectors, which essentially measure the algorithm’s regret against each action, are small, so
instability can only happen when one’s regret is small and thus is seemingly not an issue. However,
in a game setting, such instability might cause other players to suffer large regret because they have
to learn in an unpredictable environment. Indeed, we identify a 3 × 3 matrix game where this is
the case and both RM+ and predictive RM+ converge slowly at a rate of O(1/
√
T) (Figure 6.1). We
emphasize that very little is known about the properties of (predictive) RM+, and we are the first to
show concrete examples of stability issues in matrix games and the adversarial setting.
∗We note that a recent work [118] does study the last-iterate convergence of RM+variants, but for strongly-convex stronglyconcave games, which is incomparable to our matrix game setting.
85
Algorithms Social regret in multi-player NFGs
RM+ [76] O
T
1/2
Predictive RM+ [54] O
T
1/2
Stable Predictive RM+ (Alg. 3) O
T
1/4
Smooth Predictive RM+ (Alg. 4) O(1)
Conceptual RM+ (Alg. 5 ) O(1)
Approximate Conceptual RM+ (Alg. 6 with k = log(T)) O(1)
Extragradient RM+ (Alg. 7) O(1)
Table 6.1: Summary of regret guarantees for the algorithms studied in Chapter 6. The constants hidden in
the O(·) notations depend on initialization and the dimensions of the games and are given in our theorems.
2. Motivated by our counterexamples, we propose two methods to stabilize RM+: restarting, which
reinitializes the algorithms when the aggregate payoffs are all below a threshold, and chopping off
the origin from the nonnegative orthant to smooth the algorithms. When applying these techniques
to online learning with RM+, we show improved regret and fast convergence similar to predictive
OMD: we obtain O(T
1/4
) individual regrets for Stable Predictive RM+ (which uses restarting) and
O(1) social regret for Smooth Predictive RM+ (which chops off the origin). We also consider conceptual prox and extragradient versions of RM+ for normal-form games. We show that our stabilizing
ideas also provide the required stability in these settings and thus give strong theoretical guarantees:
Conceptual RM+ achieves O(1) individual regrets (Theorem 52) while Extragradient RM+ achieves
O(1) social regret (Theorem 55). See Table 6.1 for a summary of our results for normal-form games.
We further extend Conceptual RM+ to extensive-form games (EFG), yielding O(1) regret in T iterations with O(T log(T)) gradient computation. The key step here is to show the Lipschitzness of
the CFR decomposition (Lemma 124).
3. We apply our algorithms to solve matrix games and EFGs. For the 3×3 matrix game instability counterexample, our algorithms indeed perform significantly better than (predictive) RM+. For random
matrix games, we find that Stable and Smooth Predictive RM+ have very strong empirical performance, on par with (unstabilized) Predictive RM+, and greatly outperforming RM+ in all our experiments; Extragradient RM+ appears to be more sensitive to the choice of step sizes and sometimes
performs only as well as RM+. Our experiments on 4 different EFGs show that our implementation
of clairvoyant CFR outperforms predictive CFR in some, but not all, instances.
4. We provide numerical evidence that RM+ and important variants of RM+, including alternating RM+
and PRM+, may fail to have asymptotic last-iterate convergence. Conversely, we also prove that
RM+ does have last-iterate convergence in a very restrictive setting, where the matrix game admits
a strict Nash equilibrium. We then study the convergence properties of RM+: ExRM+ (Algorithm 11)
and SPRM+ (Algorithm 12). For these two algorithms, we first prove the asymptotic convergence of
the last iterate (without providing a rate). We then show a O(1/
√
t)-rate for the duality gap for the
best iterate after t iterations. Building on this last observation, we finally introduce new variants of
ExRM+ and SPRM+ that restart whenever the distance between two consecutive iterates has been
halved, and prove that they enjoy linear last-iterate convergence.
5. We verify the last-iterate convergence of ExRM+ and SPRM+ (including their restarted variants that
we propose) numerically on four instances of matrix games, including Kuhn poker and Goofspiel.
We also note that while vanilla RM+, alternating RM+, and SPRM+ may not converge, alternating
PRM+ exhibits a surprisingly fast last-iterate convergence.
6.1 Preliminaries
Online Linear Minimization. In online linear minimization, at every decision period t ≥ 1, an algorithm chooses a decision x
t
from a convex decision set X . A loss vector ℓ
t
is chosen arbitrarily and an
instantaneous loss of ⟨ℓ
t
, x
t
⟩ is incurred. The regret of an algorithm generating the sequence of decisions
87
x
1
, ..., x
T
is defined as the difference between the cumulative loss generated and that of any fixed strategy xˆ ∈ X : RegT
(xˆ) = PT
t=1⟨ℓ
t
, x
t − xˆ⟩. A regret minimizer guarantees that RegT
(xˆ) = o(T) for any
xˆ ∈ X .
Online Mirror Descent. A famous regret minimizer is Online Mirror Descent (OMD) [128], which
generates the decisions x
1
, ..., x
T
as follows (with a learning rate η > 0):
x
t+1 = Πxt
,X
ηℓ
t
(OMD)
where for any x ∈ X , and any loss ℓ, the proximal operator ℓ 7→ Πx,X (ℓ) is defined as Πx,X (ℓ) =
argminxˆ∈X ⟨ℓ, xˆ⟩ + D(xˆ, x) where D is the Bregman divergence associated with φ : X → R, a 1-strongly
convex regularizer (with respect to some norm ∥·∥): D(xˆ, x) = φ(xˆ)−φ(x)−⟨∇φ(x), xˆ−x⟩, ∀ xˆ, x ∈ X .
OMD guarantees that the worst-case regret against any xˆ grows as O(
√
T) (omitting other dependence
for simplicity; the same below). Other popular regret minimizers include Follow-The-Regularized-Leader
(FTRL), and adaptive variants of OMD and FTRL; we refer the reader to [78] for an extensive survey on
regret minimizers.
Regret Matching and Regret Matching+. Regret Matching (RM) and Regret Matching+ (RM+)
are two regret minimizers that achieve O(
√
T) worst-case regret when X = ∆d
. RM [76] maintains a
sequence of aggregate payoffs
Rt
t≥1
: R1 = R01d, and for t ≥ 1,
x
t =
Rt
+
/∥
Rt
+
∥1, Rt+1 = Rt + ⟨x
t
, ℓ
t
⟩1d − ℓ
t
,
where R0 ≥ 0 specifies an initial point and 0/0 is defined as the uniform distribution for convenience.
The original RM sets R0 = 0, making the algorithm completely parameter-free, a very appealing property
in practice. RM+ is a simple variation of RM, where the aggregate payoffs are thresholded at every iteration [152]. In particular, RM+ only keeps track of the non-negative components of the aggregate payoffs
to compute a decision: R1 = R01d, and for t ≥ 1,
x
t = Rt
/∥Rt
∥1, Rt+1 =
Rt + ⟨x
t
, ℓ
t
⟩1d − ℓ
t
+
.
We highlight that very little is known about the theoretical properties of RM+, despite its strong empirical
performances: [153] show that RM+ is a regret minimizer (and enjoys the stronger K-tracking regret
property), and [21] show that it can safely be combined with alternation ([72] prove strict improvement
when using alternation). Farina, Kroer, and Sandholm [52] show an interesting connection between RM+
and Online Mirror Descent: the update Rt+1 =
Rt + ⟨x
t
, ℓ
t
⟩1d − ℓ
t
+
of RM+ can be rewritten as
Rt+1 = ΠRt
,X
ηf(x
t
, ℓ
t
)
for X = R
d
+, φ =
1
2
∥ · ∥2
2
, η = 1, and f(x
t
, ℓ
t
) defined as f(x
t
, ℓ
t
) = ℓ
t − ⟨x
t
, ℓ
t
⟩1d. Therefore, RM+
generating a sequence of decisions x
1
, ..., x
T
facing a sequence of losses
ℓ
t
t≥1
, is closely connected
to OMD instantiated with the non-negative orthant as the decision set and facing a sequence of losses
f(x
t
, ℓ
t
)
t≥1
. We have the following relation for the regret in x
1
, ..., x
T
and the regret in R1
, ..., RT
(the proof follows [52] and is deferred to the appendix).
Lemma 44. Let x
1
, ..., x
T ∈ ∆d
be generated as x
t = Rt/∥Rt∥1 for some sequence R1
, ..., RT ∈ R
d
+.
The regret RegT
(xˆ) of x
1
, ..., x
T
facing a sequence of losses ℓ
1
, ..., ℓ
T
is equal to RegT
(Rˆ), the regret of
R1
, ..., RT
facing the sequence of losses f
x
1
, ℓ
1
, ..., f
x
T
, ℓ
T
, compared against Rˆ = xˆ: RegT
Rˆ
=
PT
t=1⟨f
x
t
, ℓ
t
, Rt − Rˆ⟩.
Since OMD is a regret minimizer guaranteeing RegT
(Rˆ) = O(
√
T), Lemma 44 directly shows that
RM+ is also a regret minimizer: RegT
(xˆ) = O(
√
T
Multiplayer Normal-Form Games. In a multiplayer normal-form game, there are n ∈ N players.
Each player i has di strategies and their decision space ∆di
is the probability simplex over the di strategies.
We denote ∆ = ×n
i=1∆di as the joint decision space of all players and d = d1 + · · · + dn. The utility
function for player i is a concave function ui
: ∆ → [−1, 1] that maps every joint strategy profile x =
(x1, ..., xn) ∈ ∆ to a payoff. We assume bounded gradients and Lu-smoothness for the utilities of the
players: there exists Bu > 0, Lu > 0 such that for any x, x
′ ∈ ∆ and any player i,
∥∇xiui(x)∥2 ≤ Bu, ∥∇xiui(x) − ∇xiui(x
′
)∥2 ≤ Lu∥x − x
′
∥2. (6.1)
The function mapping joint strategies to negative payoff gradients for all players is a vector-valued function
G : ∆ → R
d
such that G(x) = (−∇x1 u1(x), . . . , −∇xn un(x)). It is well known that running a regret
minimizer for (x
t
1
, ..., x
t
n
) ∈ ∆ = ×n
i=1∆di
facing the loss G(x
t
) = (ℓ
t
1
, . . . , ℓ
t
n
) leads to strong gametheoretic guarantees (e.g., the average iterate being an approximate coarse correlated equilibrium). However, in light of Lemma 44, we will instead perform regret minimization on (Rt
1
, ..., Rt
n
) ∈ X = ×n
i=1R
di
+
with the losses (f(x
t
1
, ℓ
t
1
), . . . , f(x
t
n
, ℓ
t
n
)). For conciseness, we thus define the operator F : X → R
d
as,
for z = (R1, ..., Rn), F(z) = (f(x1, ℓ1), ..., f(xn, ℓn)) where xi = Ri/∥Ri∥1, ∀ i = 1, ..., n,(ℓi)
i∈[n] =
G(x).
Predictive OMD and Its RVU Bounds. The predictive version of OMD proceeds as follows:
x
t = Πx˜
t
,X
ηmt
x˜
t+1 = Πx˜
t
,X
ηℓ
t
When setting mt = ℓ
t−1
, predictive OMD satisfies RegT
(xˆ) ≤
D(xˆ,x˜
1
)
η + η
PT
t=1 ∥ℓ
t − ℓ
t−1∥
2
∗ −
1
8η
PT −1
t=1 ∥x
t+1 − x
t∥
2
. This regret bound satisfies the RVU (regret bounded by variation in utilities)
condition, introduced in [150]. The authors show that this type of bound guarantees that the social regret
(i.e., sum of the regrets of all players) is O(1) when all players apply this special instance of predictive
OMD. Syrgkanis et al. [150] further prove that each player has improved O(T
1/4
) individual regret by the
stability of predictive OMD. Specifically, they show that predictive OMD guarantees ∥x
t+1 − x
t∥ = O(η)
against any adversarial loss sequence, i.e., the algorithm is stable in the sense that the change in the iterates
can be controlled by choosing η appropriately.
Predictive RM+ Similar to OMD, we can generalize RM+ to Predictive Regret Matching+[52]: define
R1 = m1 = R01d (with R0 = 0 by default), and for t ≥ 1,
x
t = Rˆt
/∥Rˆt
∥1, for Rˆt =
Rt + mt
+
,
Rt+1 =
Rt − f(x
t
, ℓ
t
)
+
, for f(x
t
, ℓ
t
) = ℓ
t − ⟨x
t
, ℓ
t
⟩1d.
We call the algorithm predictive RM+ (PRM+) when mt = −f(x
t−1
, ℓ
t−1
), and it recovers RM+ when
mt = 0. A regret bound with a similar RVU condition is attainable for predictive RM+ by its connection
to predictive OMD [52], but only in the non-negative orthant space instead of the actual strategy space.
To make a connection between them, stability is required as we show later. A natural question is then
whether (predictive) RM+ is also always stable. We show that the answer is no by giving an adversarial
example in the next section.
6.2 Instability of (Predictive) Regret Matching+
We start by showing that there exist adversarial loss sequences that lead to instability for both RM+ and
predictive RM+. Our construction starts with an unbounded loss sequence ℓ
t
so that x
t
alternates between
(1/2, 1/2) and (0, 1): we set ℓ
t = (ℓ
t
, 0), where ℓ
1 = 2, and for t ≥ 2, ℓ
t = −2
(t−2)/2
if t is even and
ℓ
t = 2(t−1)/2
if t is odd. Our proof is completed by normalizing the losses to [−1, 1] given a fixed time
horizon (see Appendix E.2 for details).
91
Theorem 45. There exist finite sequences of losses in R
2
for RM+ and its predictive version such that x
t =
(
1
2
,
1
2
) when t is odd and x
t = (0, 1) when t is even.
This is in stark contrast to OMD which always ensures ∥x
t+1 − x
t∥ = O(η) and is thus inherently
stable. However, a somewhat surprising property about (predictive) RM+ is that instability actually implies
low regret. To see this, we first present the following Lipschitz property of the normalization function
g : x 7→ x/∥x∥1 for x ∈ R
d
+.
Proposition 46. Let x, y ∈ R
d
+, with 1
⊤x ≥ 1. Then, ∥g(y) − g(x)∥2 ≤
√
d · ∥y − x∥2.
This proposition shows that the normalization step has a reasonable Lipschitz constant (√
d) as long
as its input is not too close to the origin, which further implies the following corollary.
Corollary 47. RM+ with ∥Rt∥1 ≥ R0 satisfies
x
t+1 − x
t
2
≤
√
d
R0
·
Rt+1 − Rt
2
≤
2dBu
R0
.
Put differently, the corollary states that instability can happen only when the cumulative regret vector
Rt
is small. For example, if ∥x
t+1 − x
t∥ = Ω(1), then we must have ∥Rt∥1 = O(dBu) and thus the
regret at that point is at most O(dBu). A similar argument holds for predictive RM+ as well. Therefore,
instability is in fact not an issue for these algorithms’ own regret.
However, when using these algorithms to play a game, what could happen is that such instability leads
to other players learning in an unpredictable environment with large regret. We show this phenomenon via
an example of a 3×3 matrix game maxx∈∆3 miny∈∆3 ⟨x, Ay⟩, where A = ((3, 0, −3),(0, 3, −4),(0, 0, 1)).
The first column of Figure 6.1 shows the squared ℓ2 norm of the consecutive difference of the last 100 iterates of Predictive RM+ for the x player (top) and the y player (bottom). The iterates of the x player are
rapidly changing in a periodic fashion while the iterates of the y player are stable with changes on the
order of 10−5
. In the center plots where we show the individual regret for each player, we indeed observe
that the cumulative regret of the x player is near zero as implied by instability, but it causes large regret
(close to T
0.5
empirically) for the y player. (We show the same plots for RM+ in Figure E.1 in Appendix
92
0 25 50 75 100
Last 100 iterations (out ot 107
)
0.00
0.02
0.04
0.06
Norm
||x
t − x
t+1||2
2
101 102 103 104 105 106 107
Number of iterations
0
2
4
Regret
Regret (x-player)
101 102 103 104 105 106 107 Number of iterations
10−1
10−2
10−3
10−4
Duality gap
RM+
linear fit
0 25 50 75 100
Last 100 iterations (out ot 107
)
10−6
10−5
Norm
||y
t − y
t+1||2
2
101 102 103 104 105 106 107 Number of iterations
100
101
102
Regret
Regret (y-player)
101 102 103 104 105 106 107 Number of iterations
10−1
10−2
10−3
10−4
Duality gap
PRM+
linear fit
Figure 6.1: Left plots show the iterate-to-iterate variation in the last 100 iterates of predictive RM+. Center
plots show the regret for the x and y players under predictive RM+. Right plots show empirical convergence speed of RM+ (top row) and Predictive RM+ (bottom row).
E.2; there, the iterates of both players are stable, but since RM+ lacks predictivity, it still leads to larger
regret for one player.)
The right column of Figure 6.1 shows the duality gap achieved by the linear average
(¯xt
, y¯t) =
2
T(T + 1)
X
T
t=1
tx
t
,
2
T(T + 1)
X
T
t=1
ty
t
!
,
when the iterates are generated by RM+ with alternation (top) and predictive RM+ (bottom). For both
algorithms the convergence rate slows down around 104
iterations. A linear regression estimate on the
rate for the last 106
iterates shows rates of −0.497 and −0.496 for RM+ and predictive RM+ respectively.
To the best of our knowledge, this is the first known case of empirical convergence rates on the order of
T
−0.5
for either RM+ or predictive RM+; the worst prior instance for RM+ was T
−0.74 in Farina, Kroer,
and Sandholm [54]; no hard instance was known for predictive RM+.
93
Stable Predictive RM+. One way to maintain the required distance to the origin is via restarting:
We initialize the algorithm with the cumulative regret vector equal to some non-zero amount, instead
of the usual initialization at zero. Then, when the cumulative regret vector gets below the initialization
point, we restart the algorithm from the initialization point. Applying this idea to predictive RM+ yields
Algorithm 3. Player i starts with R1
i = R01di
, runs predictive RM+, and restarts whenever Rt
i ≤ R01di
.
In the algorithm we write
Rt
1
, ..., Rt
n
compactly as wt
(similarly for z
t
). Note, though, that the updates
are decentralized for each player, as in vanilla predictive RM+.
Given this modification, Stable PRM+ achieves improved individual regret in multiplayer games, as
stated in Theorem 48. We defer the proof to the appendix. One key step in the analysis is to note that by
definition, the regret against any action is negative when the restarting event happens, so it is sufficient
to consider the regret starting from the last restart. Thanks to the stability enforced by the restarts, the
regret from the last restart is also well controlled and the results follow by tuning η and R0 optimally. In
fact, since the algorithm is scale-invariant up to the relative scale of the two parameters, it is without loss
of generality to always set R0 = 1.
Algorithm 3: Stable Predictive RM+
1: Input: R0 > 0, step size η > 0
2: Initialization: w0 = R01d
3: for t = 1, . . . , T do
4: z
t = Πwt−1,X
ηmt
5: wt = Πwt−1,X
ηF(z
t
)
6:
x
t
1
, . . . , x
t
n
= (g(z
t
1
), . . . , g(z
t
n
))
7: for i = 1, ..., n do
8: if wt
i ≤ R01di
then
9: wt
i = R01di
Algorithm 4: Smooth Predictive RM+
1: Input: Step size η > 0
2: Initialization: w0 ∈ X≥
3: for t = 1, . . . , T do
4: z
t = Πwt−1,X≥
ηmt
5: wt = Πwt−1,X≥
ηF(z
t
)
6:
x
t
1
, . . . , x
t
n
=
g(z
t
1
), . . . , g(z
t
n
)
Theorem 48. Let η =
d
2T
−1/4
and R0 = 1. Let
f
t
i
i∈[n]
= F(z
t
) for t ≥ 1. For each player i, set the
sequence of predictions mt
i = 0 when t = 0 or restart happens at t−1; otherwise, mt
i = f
t−1
i
, ∀ t ≥ 1. Then
Algorithm 3 guarantees that the individual regret RegT
i
(xˆi) = PT
t=1⟨∇xiui(x
t
), xˆi − x
t
i
⟩ of each player i
is bounded by O
d
3/2T
1/4
in multiplayer normal-form
Although the restarting idea successfully stabilizes the RM+ algorithm, the discontinuity created by
asynchronous restarts causes technical difficulty for bounding the social regret by O(1). Next we introduce
an alternative stabilization idea to fix this issue.
Smooth Predictive RM+. Our second stabilization idea is to restrict the decision space to a subset
where we “chop off” the area that is too close to the origin, that is, project the vector Rt
i
onto the set
∆
di
≥ = {R ∈ R
di
+ | ∥R∥1 ≥ 1}. We denote the joint chopped-off decision space as X≥ = ×n
i=1∆
di
≥
. We call
the resulting algorithm smooth predictive RM+ (Algorithm 4). Besides a similar result to Theorem 48 on
the individual regret (omitted for simplicity), Algorithm 4 also guarantees that the social regret is bounded
by a game-dependent constant, as shown in Theorem 49.
Theorem 49. Let η =
2
√
2(n − 1) maxi{d
3/2
i
}
−1
. Using the sequence of predictions m0 = 0,mt =
F(z
t−1
), ∀ t ≥ 1, Algorithm 4 guarantees that the social regretPn
i=1 RegT
i
(xˆi) = Pn
i=1
PT
t=1⟨∇xiui(x
t
), xˆi−
x
t
i
⟩ is upper bounded by O
n
2 maxi=1,...,n{d
3/2
i
} maxi=1,...,n{∥w0
i − xˆi∥
2
2
in multiplayer normal-form
games.
Algorithm 4 dominates Algorithm 3 in terms of our theoretical results so far, but it has one drawback:
it requires occasional projection onto X≥. In Appendix E.11 we show that this can be done with a sorting
trick in O(d log d) time, whereas the restarting procedure is implementable in linear time.
6.3 Conceptual Regret Matching+
In this section, we depart from the predictive OMD framework and develop new smooth variants of RM+
from a different angle. Instead of using predictive OMD to compute the iterates
Rt
i
t≥1
, we consider the
following regret minimizer that we call cheating OMD, defined for some arbitrary closed decision set Z
and an arbitrary sequence of losses
ℓ
t
t≥1
: z
t = Πzt−1,Z
ηℓ
t
for t ≥ 1, and z
0 ∈ X≥. Cheating OMD
is inspired by the Conceptual Prox method for solving variational inequalities associated with monotone
operators [26, 89, 127]. We call it cheating OMD because at iteration t, the decision z
t
is chosen as a
function of the current loss ℓ
t
, which is revealed after the decision z
t has been chosen. It is well-known
that cheating OMD yields a sequence of decisions with constant regret; we show it for our setting in the
following lemma.
Lemma 50. The Cheating OMD iterates
z
t
t
satisfy PT
t=1⟨ℓ
t
, z
t − zˆ⟩ ≤ 1
2η
∥z
0 − zˆ∥
2
2
, ∀zˆ ∈ Z.
To instantiate RM+ with Cheating OMD as a regret minimizer for the sequence
Rt
i
t≥1
of each player
i, we need to show the existence of a vector z
t ∈ X≥ such that
z
t = Πzt−1,X≥
ηF(z
t
)
. (6.2)
Equation (6.2) can be interpreted as a fixed-point equation for the map z 7→ Πzt−1,X≥
(ηF(z)). For any
z
′ ∈ X≥, the map z 7→ Πz′
,X≥
(ηF(z)) is ηL-Lipschitz continuous as long as F is L-Lipschitz continuous.
Therefore, it is a contraction when η < 1/L, and then the fixed-point equation z = Πz′
,X≥
(ηF(z))
has a unique solution. Recall that for z = (R1, ..., Rn) ∈ X≥, the operator F is defined as F(z) =
(f(x1, ℓ1), . . . , f(xn, ℓn)) where xi = g(Ri) and ℓi = −∇xiui(x), for all i ∈ {1, ..., n}. We now show
the Lipschitzness of F over X≥ for normal-form games.
Lemma 51. For a normal-form game, the operator F is LF -Lipschitz continuous over X≥, with LF =
(maxi di)
p
2B2
u + 4L2
u with Bu, Lu defined in (6.1).
ForLF defined as in Lemma 51 and η < 1/LF , the existence of the fixed-point z
t = Πzt−1,X≥
ηF(z
t
)
is guaranteed. This yields Conceptual RM+, defined in Algorithm 5. In the following theorem, we show
that Conceptual RM+ ensures constant regret for each player.
Theorem 52. Let LF > 0 be defined as in Lemma 51. For η < 1/LF , Algorithm 5 guarantees that the
individual regret RegT
i
(xˆi) = PT
t=1⟨∇xiui(x
t
), xˆi − x
t
i
⟩ of each player i is bounded by 1
2η
∥z
0
i − xˆi∥
2
2
in
multiplayer normal-form games.
Algorithm 5: Conceptual RM+
1: Input: Step size η > 0 with η < 1/LF
2: Initialization: z
0 ∈ X≥
3: for t = 1, . . . , T do
4: z
t = Πzt−1,X≥
ηF(z
t
)
5:
x
t
1
, . . . , x
t
n
=
g(z
t
1
), . . . , g(z
t
n
)
Algorithm 6: Conceptual RM+ with approximate fixed-point
1: Input: Step size η > 0 with η < 1/LF
2: Initialization: z
0 ∈ X≥
3: for t = 1, . . . , T do
4: w0 = z
t−1
5: for j = 0, . . . , k − 1 do
6: wj+1 = Πzt−1,X≥
ηF(wj
)
7: z
t = Πzt−1,X≥
ηF(wk
)
8:
x
t
1
, . . . , x
t
n
=
g(wk
1
), . . . , g(wk
n
)
Note that the requirement of η < 1/LF in Theorem 52 and Algorithm 5 is only needed in order to
ensure existence of a fixed-point. If the fixed-point condition holds for some larger η, then the algorithm
is still well-defined and the same convergence guarantee holds.
Remark 53. Piliouras, Sim, and Skoulakis [132] propose the clairvoyant multiplicate weights updates (MWU)
algorithm, based on the classical MWU algorithm, but where the rescaling at iteration t involves the payoff
of the players at iteration t. The connection with the conceptual prox method is made explicit by [50],
where they show how to extend clairvoyant MWU for normal-form games to clairvoyant OMD for general
convex games. Our algorithm uses the same idea but for RM+.
For z
′ ∈ X≥, we can approximate the fixed-point of z 7→ Πz′
,X≥
(ηF(z)) by performing k ∈ N
fixed-point iterations. This results in Algorithm 6. We give the guarantees for Algorithm 6 below.
Theorem 54. Let LF > 0 be defined as in Lemma 51 and η < 1/LF . Assume that in Algorithm 6, we ensure
∥wk − Πzt−1,X≥
ηF(wk
)
∥2 ≤ ϵ
(t)
, for all t ≥ 1. Then Algorithm 6 guarantees that the individual regret
RegT
i
(xˆi) = PT
t=1⟨∇xiui(x
t
), xˆi − x
t
i
⟩ of each player i is bounded by 1
2η
∥z
0
i − xˆi∥
2
2 + 2Bu
√
di
PT
t=1 ϵ
(t)
in multiplayer normal-form games.
By Theorem 54, if we ensure error ϵ
(t) = 1/t2
in Algorithm 6 then the individual regret of each player
is bounded by a constant. Since w 7→ Πzt−1,X≥
(ηF(w)) is a contraction for η < 1/LF , this only requires
k = O (log(t)) fixed-point iterations at each time t. If the number of iterations T is known in adva
we can choose k = O(log(T)), to ensure ϵ
(t) = O(1/T) and therefore that the individual regret of each
player i is bounded by the constant 1
2η
∥z
0
i − xˆi∥
2
2 + O
2Bu
√
di
.
Recall that the uniform distribution over a sequence of strategy profiles {x
t}
T
t=1 is a (maxi RegT
i
)/Tapproximate coarse correlated equilibrium (CCE) of a multiplayer normal-form game (see e.g. Theorem
2.4 in Piliouras, Sim, and Skoulakis [132]). Therefore, Algorithm 5 guarantees O(1/T) convergence to a
CCE after T iterations. With the setup from Theorem 54 and k = O(log(T)), Algorithm 6 guarantees
O(log(T)/T) convergence to a CCE after T evaluations of the operator F.
Algorithm 7: Extragradient RM+ (ExRM+)
1: Input: Step size η > 0 with η < 1/LF
2: Initialization: z
0 ∈ X≥
3: for t = 1, . . . , T do
4: wt = Πzt−1,X≥
ηF(z
t−1
)
5: z
t = Πzt−1,X≥
ηF(wt
)
6:
x
t
1
, . . . , x
t
n
=
g(wt
1
), . . . , g(wt
n
)
Extragradient RM+. We now consider the case of Algorithm 6 but with only one fixed-point iteration
(k = 1). This is similar to the mirror prox algorithm [127] or the extragradient method [93]. We call this
algorithm extragradient RM+(ExRM+, Algorithm 7). We show that one fixed-point iteration (k = 1) at
every iteration ensures constant social regret.
Theorem 55. Define LF as in Lemma 51 and let η = (√
2LF )
−1
. Algorithm 7 guarantees that the social
regretPn
i=1 RegT
i
(xˆi) = Pn
i=1
PT
t=1⟨∇xiui(x
t
), xˆi−x
t
i
⟩is bounded by 1
2η
Pn
i=1 ∥z
0
i −xˆi∥
2
2
in multiplayer
normal-form games.
We now apply Theorem 55 to the case of matrix games, where the goal is to solve
min
x∈∆d1
max
y∈∆d2
⟨x, Ay
for Ad1×d2
. The operator F is defined as
F
R1
R2
=
f (g(R1), Ag(R2))
f
g(R2), −A⊤g(R1)
and X≥ = ∆d1
≥ ×∆
d2
≥
. The next lemma gives the Lipschitz constant of the operator F in the case of matrix
games.
Lemma 56. For matrix games, the operator F is LF -Lipschitz over X≥, with LF =
√
6∥A∥op max{d1, d2}
with ∥A∥op = sup{∥Av∥2/∥v∥2 | v ∈ R
d2
, v ̸= 0}.
Combining Lemma 56 with Theorem 55, ExRM+ for matrix games with X≥ as a decision set and
η =
√
2LF
−1
guarantees constant social regret, so that the average of the iterates computed by ExRM+
converges to a Nash Equilibrium at a rate of O(1/T) [59].
Extensive-form games Our convergence results for Conceptual RM+ apply beyond normal-form games,
to EFGs. Briefly, a EFG is a game played on a tree, where each node belongs to some player, and the player
chooses a probability distribution over branches. Moreover, players have information sets, which are groups
of nodes belonging to a player such that they cannot distinguish among them, and thus they must choose
the same probability distribution at all nodes in an information set. As is standard, we assume that each
player never forgets information. Below, we describe the main ideas behind the extension; details are given
in Appendix E.10.
In order to extend our results, we use the CFR regret decomposition [166, 53]. CFR defines a notion
of local regret at each information set, using so-called counterfactual values. By minimizing the regret
incurred at each information set with respect to counterfactual values, CFR guarantees that the overall
regret over tree-form strategies is minimized. Importantly, counterfactual values are multilinear in the
strategies of the players, and therefore they are Lipschitz functions of the strategies of the other players.
Hence, using Algorithm 6 at each information set with counterfactual value and applying Theorem 54
begets a smooth-RM+-based algorithm that computes a sequence of iterates with regret at most ϵ in at
O(1/ϵ) iterations and using O(log(1/ϵ)/ϵ) gradient computations.
6.4 Numerical Experiments
Matrix games. We compute the performance of ExRM+, Stable and Smooth PRM+ on the 3 × 3 matrix
game instance from Section 6.1 (with step size η = 0.1) and on 30 random matrix games of size (d1, d2) =
(30, 40) with normally distributed coefficients of the payoff matrix and with step sizes η ∈ {0.1, 1, 10}.
We initialize our algorithms at (1/d1)1d, all algorithms use linear averaging, and all algorithms (except
ExRM+) use alternation. The results are shown in Figure 6.2. Our new algorithms greatly outperform
RM+ and PRM+ in the 3 × 3 matrix game; linear regression finds an asymptotic convergence rate of
O(1/T2
). More detailed results for this instance are given in Appendix E.11.1. For random matrix games,
our algorithms ExRM+, Smooth PRM+ and Stable PRM+ all outperform RM+ for stepsize η = 0.1. ExRM+
performs on par with RM+ for larger values of η, while Stable PRM+ and Smooth PRM+ remain very
competitive, performing on par with the unstabilized version of PRM+. We note that we use step sizes
that are larger than the theoretical ones since the latter may be overly conservative [54, 98].
101 102 103 104 105 106 107
10−11
10−8
10−5
10−2
3x3 matrix game
101 102 103 104
10−5
10−3
10−1
random (η = 0.1)
101 102 103 104
10−5
10−3
10−1
random (η = 1)
101 102 103 104
10−5
10−3
10−1
random (η = 10)
Number of iterations
Duality gap
RM+
PRM+ ExRM+
Stable PRM+
Smooth PRM+
Figure 6.2: Empirical performances of RM+, PRM+, ExRM+, Stable PRM+ and Smooth PRM+on our 3 × 3
matrix game (left plot) and on random instances for different step sizes.
100
250 500 750 1000
Gradient computatations
10−1
100
CCE gap
2-player Sheriff
Cvynt CFR (η: 1.0)
Predictive CFR
250 500 750 1000
Gradient computatations
10−2
10−1
3-player Leduc poker
Cvynt CFR (η: 10.0)
Predictive CFR
250 500 750 1000
Gradient computatations
10−3
10−2
10−1
4-player Kuhn poker
Cvynt CFR (η: 10.0)
Predictive CFR
250 500 750 1000
Gradient computatations
10−17
10−16
10−15
4-player Liar’s dice
Cvynt CFR (η: 10.0)
Predictive CFR
Figure 6.3: Practical performance of our variant of clairvoyant CFR (‘Cvynt CFR’) compared to predictive
CFR, across four multiplayer extensive-form games. Note that on Liar’s dice, both algorithms are down to
machine-precision accuracy immediately, which explains the jittery plot.
Extensive-form games. We implemented and evaluated our CFR-based clairvoyant algorithm (henceforth ‘Clairvoyant CFR’) for extensive-form games. To our knowledge, it is the first time that clairvoyant
algorithms are evaluated in extensive-form games. Overall, we were unable to observe the same strong
performance observed in normal-form games (Figure 6.2), for a combination of reasons. First, we observe
that the stepsize η calculated in Appendix E.10 to make the operator F a contraction in extensive-form
games is prohibitively small in the games we test on, each of which has a number of sequences on the order of tens of thousands. At the same time, we observe that ignoring the issue by setting a large constant
stepsize in practice often leads to non-convergence of the fixed point iterations. To sidestep both issues,
we considered a variant of the algorithm which only performs a single fixed-point iteration, and uses a
stepsize hyperparameter η, where we pick the best from the set {1, 10, 20}. We remark that this variant
of the algorithm is clairvoyant only in spirit, and while it is a sound regret-minimization algorithm, we
expect that the strong theoretical guarantees of constant per-player regret do not apply. Nevertheless,
in Figure 6.3 we show that we are able to sometimes observe superior performance to (non-clairvoyant)
predictive CFR in the four games we tried, which are described in the appendix. For both algorithms, we
ignore the first 100 iterations, in which the iterates are very far from convergence. To compensate for the
increased amount of computation needed at each iteration by our clairvoyant algorithm, we plot on the
x-axis not the number of iterations but rather the number of gradients of the utility functions computed
101
for each player. On the y-axis, we measure the gap to a coarse correlated equilibrium, which is equal to
the maximum regret across the players, divided by the number of iterations.
6.5 Preliminaries on Regret Matching+ in Two-Player Zero-Sum NormalForm Games
In the rest of this chapter, we focus on last-iterate convergence and study iterative algorithms for solving
the following matrix game:
min
x∈∆d1
max
y∈∆d2
x
⊤Ay (6.3)
for a payoff matrix A ∈ R
d1×d2
. We define Z = ∆d1 × ∆d2
to be the set of feasible pairs of strategies.
The duality gap of a pair of feasible strategy (x, y) ∈ Z is defined as
DualityGap(x, y) := max
y′∈∆d2
x
⊤Ay′ − min
x′∈∆d1
x
′⊤Ay.
Note that we always have DualityGap(x, y) ≥ 0, and it is well-known that DualityGap(x, y) ≤ ϵ implies
that the pair (x, y) ∈ Z is an ϵ-Nash equilibrium of the matrix game (6.3). When both players of (6.3)
employ a regret minimizer, a well-known folk theorem shows that the averages of the iterates generated
during self-play converge to a Nash equilibrium of the game [59]. This framework can be instantiated with
any regret minimizers, for instance, online mirror descent, follow-the-regularized leader, regret matching,
and optimistic variants of these algorithms. We refer to [78] for an extensive review on regret minimization.
From here on, we focus on solving (6.3) via Regret Matching+ and its variants. To describe these algorithms,
it is useful to define for a strategy x ∈ ∆d
and a loss vector ℓ ∈ R
d
, the negative instantaneous regret
vector f(x, ℓ) = ℓ − x
⊤ℓ · 1d,
†
and also define the normalization operator g : R
d1
+ × R
d2
+ → Z such that
for z = (z1, z2) ∈ R
d1
+ × R
d2
+ , we have g(z) = (z1/∥z1∥1, z2/∥z2∥1) ∈ Z.
†Here, d can be either d1 or d2. That is, we overload the notation f so its domain depends on the inputs.
102
Regret Matching+ (RM+), alternation, and Predictive RM+. We describe Regret Matching+ (RM+)
in Algorithm 8 [152].‡
It maintains two sequences: a sequence of joint aggregate payoffs (Rt
x, Rt
y
) ∈
R
d1
+ × R
d2
+ updated using the instantaneous regret vector, and a sequence of joint strategies (x
t
, y
t
) ∈
Z directly normalized from the aggregate payoff. Note that the update rules are stepsize-free and only
perform closed-form operations (thresholding and rescaling).
A popular variant of RM+, Alternating RM+ [153, 21], is shown in Algorithm 9. In alternating RM+,
the updates between the two players are asynchronous, and at iteration t, the second player observes the
choice x
t+1 of the first player when choosing their own decision y
t+1. Alternation leads to faster empirical
performance for solving matrix and extensive-form games, even though the theoretical guarantees remain
the same as for vanilla RM+ [21, 72].
Algorithm 8: Regret Matching+ (RM+)
1: Initialize: (R0
x, R0
y
) = 0, (x
0
, y
0
) ∈ Z
2: for t = 0, 1, . . . do
3: Rt+1
x = [Rt
x − f(x
t
, Ayt
)]+, Rt+1
y = [Rt
y + f(y
t
, A⊤x
t
)]+
4: (x
t+1
, y
t+1) = g(Rt+1
x , Rt+1
y
)
Algorithm 9: Alternating RM+ (alt. RM+)
1: Initialize: (R0
x, R0
y
) = 0, (x
0
, y
0
) ∈ Z
2: for t = 0, 1, . . . do
3: Rt+1
x = [Rt
x − f(x
t
, Ayt
)]+, x
t+1 =
Rt+1
x
∥Rt+1
x ∥1
4: Rt+1
y = [Rt
y + f(y
t
, A⊤x
t+1)]+, y
t+1 =
Rt+1
y
∥Rt+1
y ∥1
‡Typically, RM+ and PRM+ are introduced as regret minimizers that return a sequence of decisions against any sequence of
losses [153, 52]. For conciseness, we directly present them as self-play algorithms for solving matrix games, as in Algorithm 8
and Algorithm 10.
103
Algorithm 10: Predictive RM+ (PRM+)
1: Initialize: (R0
x, R0
y
) = 0, (x0, y0) ∈ Z
2: for t = 0, 1, . . . do
3: Rt+1
x = [Rt
x − f(x
t
, Ayt
)]+, Rt+1
y = [Rt
y + f(y
t
, A⊤x
t
)]+
4: (x
t+1
, y
t+1) = g([Rt+1
x − f(x
t
, Ayt
)]+, [Rt+1
y + f(y
t
, A⊤x
t
)]+)
Finally, we describe Predictive RM+ (PRM+) from [52] in Algorithm 10. PRM+ incorporates predictions of the next losses faced by each player (using the most recent observed losses) when computing the
strategies at each iteration, akin to predictive/optimistic online mirror descent [135, 150].
In practice, PRM+ can also be combined with alternation, but despite strong empirical performance,
it is unknown if alternating PRM+ enjoys ergodic convergence. In contrast, based on the aforementioned
folk theorem and the regret guarantees of RM+ [152], alternating RM+ [21], and PRM+ [52], the duality
gap of the average strategy of all these algorithms goes down at a rate of O(1/
√
T). However, we will
show in the next section that the iterates (x
t
, y
t
) themselves may not converge. We also note that despite
the connections between PRM+ and predictive online mirror descent, PRM+ does not achieve O(1/T)
ergodic convergence, because of its lack of stability [48].
Extragradient RM+ and Smooth Predictive RM+ We now describe two theoretically-faster variants
of RM+ recently introduced in [48]. To provide a concise formulation, we first need some additional
notation. First, we define the clipped positive orthant ∆
di
≥
:= {u ∈ R
di
+ : u
⊤1di ≥ 1} for i = 1, 2 and
Z≥ = ∆d1
≥ × ∆
d2
≥
. For a point z ∈ Z≥, we often write it as z = (Rx, Qy) for positive real numbers R
and Q such that (x, y) = g(z). Moreover, we define the operator F : Z≥ → R
d1+d2 as follows: for any
z = (Rx, Qy) ∈ Z≥,
F(z) = F((Rx, Qy)) =
f(x, Ay)
f(y, −A⊤x)
=
Ay − x
⊤Ay · 1d1
−A⊤x + x
⊤Ay · 1d2
, (6.4)
104
which is LF -Lipschitz-continuous over Z≥ with LF =
√
6∥A∥op max{d1, d2} [48]. We also write ΠZ≥
(u)
for the L2 projection onto Z≥ of the vector u: ΠZ≥
(u) = arg minz′∈Z≥
∥z
′ − u∥2. With these notations,
Extragradient RM+ (ExRM+) and Smooth PRM+ (SPRM+) are defined in Algorithm 11 and in Algorithm
12.
Algorithm 11: Extragradient RM+ (ExRM+)
1: Input: Step size η ∈ (0,
1
LF
).
2: Initialize: z
0 ∈ Z
3: for t = 0, 1, . . . do
4: z
t+1/2 = ΠZ≥
z
t − ηF(z
t
)
5: z
t+1 = ΠZ≥
z
t − ηF(z
t+1/2
)
Algorithm 12: Smooth PRM+ (SPRM+)
1: Input: Step size η ∈ (0,
1
8LF
].
2: Initialize: z
−1 = w0 ∈ Z
3: for t = 0, 1, . . . do
4: z
t = ΠZ≥
wt − ηF(z
t−1
)
5: wt+1 = ΠZ≥
wt − ηF(z
t
)
ExRM+ is connected to the Extragradient (EG) algorithm [93] and SPRM+ is connected to the Optimistic Gradient algorithm [134, 135] (see Section 6.7 and Section 6.8 for details). Farina et al. [48] show
that ExRM+ and SPRM+ enjoy fast O(
1
T
) ergodic convergence for solving matrix games.
6.6 Non-Convergence of RM+, Alternating RM+, and PRM+
In this section, we show empirically that several existing variants of RM+ may not converge in iterates.
Specifically, we numerically investigate four algorithms—RM+, alternating RM+, PRM+, and alternating
PRM+—on a simple 3×3 game matrix A = [[3, 0, −3], [0, 3, −4], [0, 0, 1]] that has the unique Nash equilibrium (x
∗
, y
∗
) = ([ 1
12 ,
1
12 ,
5
6
], [
1
3
,
5
12 ,
1
4
]). The same instance was also used in [48] to illustrate the instability
of PRM+ and slow ergodic convergence of RM+and PRM+. The results are shown in Figure 6.4. We observe that for RM+, alternating RM+, and PRM+, the duality gap remains on the order of 10−1
even after
105
iterations. Our empirical findings are in line with Theorem 3 of Lee, Kroer, and Luo [105], who pointed
out that RM+ diverges on the rock-paper-scissors game. In contrast, alternating PRM+ enjoys good lastiterate convergence properties on this matrix game instance. Overall, our empirical results suggest that
101 102 103 104 105
Number of iterations
10−13
10−9
10−5
10−1
Duality gap
RM+
PRM+
alt. RM+
alt. PRM+
Figure 6.4: Duality gap of the current iterates generated by RM+, RM+ with alternation, Predictive
RM+ (PRM+), and Predictive RM+ with alternation on the zero-sum game with payoff matrix A =
[[3, 0, −3], [0, 3, −4], [0, 0, 1]].
RM+, alternating RM+, and PRM+ all fail to converge in iterates, even when the game has a unique Nash
equilibrium, a more regular and benign setting than the general case.
We complement our empirical non-convergence results by showing that RM+ has asymptotic convergence under the restrictive assumption that the game has a strict Nash equilibrium. To our knowledge, this
is the first positive last-iterate convergence result related to RM+. In a strict Nash equilibrium (x
∗
, y
∗
),
x
∗
is the unique best-response to y
∗
and vice versa. It follows from this definition that the equilibrium is
unique and that x
∗
, y
∗
are pure strategies. As an example, the game matrix A = [[2, 1], [3, 4]] has a strict
Nash equilibrium x
∗ = [1, 0] and y
∗ = [1, 0].
Theorem 57 (Convergence of RM+ to Strict NE). If a matrix game has a strict Nash equilibrium (x
∗
, y
∗
),
RM+ (Algorithm 8) converges in last-iterate, that is, limt→∞{(x
t
, y
t
)} = (x
∗
, y
∗
).
We remark that the assumption of strict NE in Theorem 57 cannot be weakened to the assumption of a
unique, non-strict Nash equilibrium, as our empirical counterexample shows. Despite the isolated positive
result given in Theorem 57 (under a very strong assumption that we do not expect to hold in practice), the
negative empirical results encountered in this section paint a bleak picture of the last-iterate convergence
106
of unmodified regret-matching algorithms. This sets the stage for the rest of the chapter, where we will
show unconditional last-iterate convergence of variants of RM+.
6.7 Convergence Properties of ExRM+
In this section, we prove that ExRM+ exhibits favorable last-iterate convergence results. The section is
organized as follows. In Section 6.7.1, we prove asymptotic last-iterate convergence of ExRM+. Then, in
Section 6.7.2, we provide a concrete rate of O(1/
√
T) for the best iterate, based on which we finally show
a linear last-iterate convergence rate using a restarting mechanism in Section 6.7.3. All omitted proofs for
this section can be found in Appendix E.14.
6.7.1 Asymptotic Last-Iterate Convergence of ExRM+
ExRM+ (Algorithm 11) is equivalent to the Extragradient (EG) algorithm of Korpelevich [93] for solving
a variational inequality V I(Z≥, F). For a closed convex set S ⊆ R
n
and an operator G : S → R
n
, the
variational inequality problem V I(S, G) is to find z ∈ S, such that ⟨G(z), z − z
′
⟩ ≤ 0 for all z
′ ∈ S. We
denote SOL(S, G) the solution set of V I(S, G). EG has been extensively studied in the literature, and its
last-iterate convergence properties are known in several settings, including:
1. If G is Lipschitz and pseudo monotone with respect to the solution set SOL(S, G), i.e., for any z
∗ ∈
SOL(S, G), it holds that ⟨G(z), z − z
∗
⟩ ≥ 0 for all z ∈ S, then iterates produced by EG converge
to a solution of V I(S, G) [45, Ch. 12];
2. If G is Lipschitz and monotone, i.e., ⟨G(z) − G(z
′
), z − z
′
⟩ ≥ 0 for all z, z
′ ∈ S, then iterates {z
t}
produced by EG have O( √
1
t
) last-iterate convergence such that ⟨G(z
t
), z
t − z⟩ ≤ O( √
1
t
) for all
z ∈ S [68, 70, 23].
107
Unfortunately, these results do not apply directly to our case: although the operator F (as defined in
Equation (6.4)) is LF -Lipschitz-continuous over Z≥, it is not monotone or even pseudo monotone with
respect to SOL(Z≥, F). However, we observe that F satisfies the Minty condition: there exists a solution
z
∗ ∈ SOL(Z≥, F) such that ⟨F(z), z − z
∗
⟩ ≥ 0 for all z ∈ Z≥. The Minty condition is weaker than
pseudo monotonicity with respect to SOL(Z≥, F) (note the different quantifiers ∀ and ∃ for z
∗
in the two
conditions).
Proposition 58 (F satisfies the Minty condition). For any Nash equilibrium z
∗ ∈ ∆d1 × ∆d2 of the matrix
game A, ⟨F(z), z − az
∗
⟩ ≥ 0 holds for all a ≥ 1.
Proof. Let z
∗ = (x
∗
, y
∗
) ∈ ∆d1×∆d2 be a Nash equilibrium of the matrix game. For any z = (Rx, Qy) ∈
Z≥, we have ⟨F(z), z − az
∗
⟩ = −⟨F(z), az
∗
⟩ = a(x
⊤Ay∗ − (x
∗
)
⊤Ay) ≥ 0, using ⟨F(z), z⟩ = 0 and
the definition of Nash equilibrium.
We are not aware of last-iterate convergence results for variational inequalities under only the Minty
condition, but with this condition and Lipschitzness, a standard analysis shows that the distance between
any z
∗
that satisfies the Minty condition and the sequence {z
t} produced by ExRM+ is decreasing. This
serves as an important cornerstone for the rest of our analysis.
Lemma 59 (Adapted from Lemma 12.1.10 in [45]). Let z
∗ ∈ Z≥ be a point such that ⟨F(z), z − z
∗
⟩ ≥ 0
for all z ∈ Z≥. Let {z
t} be the sequence produced by ExRM+. Then for every iteration t ≥ 0 it holds that
∥z
t+1 − z
∗∥
2 ≤ ∥z
t − z
∗∥
2 − (1 − η
2L
2
F
)∥z
t+ 1
2 − z
t∥
2
.
By Lemma 59, the sequence {z
t} is bounded, so it has at least one limit point zˆ ∈ Z≥. We next show
that every limit point zˆ lies in the solution set of V I(Z≥, F) and induces a Nash equilibrium of the matrix
game A. Moreover, we have limt→∞ ∥z
t − z
t+1∥ = 0.
Lemma 60. Let {z
t} be the iterates produced by ExRM+. Then limt→∞ ∥z
t − z
t+1∥ = 0 and also:
108
1. If zˆ is a limit point of {z
t}, then zˆ ∈ SOL(Z≥, F).
2. If zˆ ∈ SOL(Z≥, F), then (xˆ, yˆ) = g(zˆ) is a Nash equilibrium of the matrix game A.
Now we know that the sequence {z
t} has at least one limit point zˆ ∈ SOL(Z≥, F). If zˆ is the unique
limit point, then {z
t} converges to zˆ. To show this, we first provide another condition under which {z
t}
converges in the following proposition.
Proposition 61. If the iterates {z
t} produced by ExRM+ have a limit point zˆ such that zˆ = az
∗
for z
∗ ∈ Z
and a ≥ 1 (equivalently, colinear with a pair of strategies), then {z
t} converges to zˆ.
Proof. Denote by {zt}t∈κ a subsequence of {z
t} that converges to zˆ. By Proposition 58, the Minty condition holds for zˆ, so by Lemma 59, {∥z
t − zˆ∥} is monotonically decreasing and therefore converges. Since
limt→∞ ∥z
t − zˆ∥ = limt(∈κ)→∞ ∥z
t − zˆ∥ = 0, {z
t} converges to zˆ.
However, the condition of Proposition 61 may not hold in general, and we empirically observe in
experiments that it is not uncommon for the limit point zˆ = (Rxˆ, Qyˆ) to have R ̸= Q. To proceed,
we will use the observation that the only “bad” case that prevents us from proving convergence of {z
t}
is that {z
t} has infinitely-many limit points (note that the number of solutions |SOL(Z≥, F)| is indeed
infinite). This is because if {z
t} has a finite number of limit points, then since limt→∞ ∥z
t+1 − z
t∥ = 0
(Lemma 60), it must have a unique limit point (see a formal proof in Proposition 126). In the following, to
show that it is impossible that {z
t} has infinitely many limit points, we first prove a lemma showing the
structure of limit points of {z
t}.
Lemma 62 (Structure of Limit Points). Let {z
t} be the iterates produced by ExRM+ and z
∗ ∈ ∆d1 × ∆d2
be any Nash equilibrium of A. If zˆ and z˜ are two limit points of {z
t}, then the following holds.
1. ∥az
∗ − zˆ∥
2 = ∥az
∗ − z˜∥
2
for all a ≥ 1.
2. ∥zˆ∥
2 = ∥z˜∥
2
.
109
z
⋆
az⋆
zˆ
z˜
Figure 6.5: Pictorial illustration of Lemma 62.
3. ⟨z
∗
, zˆ − z˜⟩ = 0.
See Figure 6.5 for an illustration of this lemma.§ With such structural understanding of limit points,
we are now ready to show that {z
t} necessarily has a unique limit point.
Lemma 63 (Unique limit Point). The sequence {z
t} produced by ExRM+ has a unique limit point.
Proof Sketch As discussed above, since limt→∞ ∥z
t+1 − z
t∥ = 0, it suffices to prove that {z
t} has
finitely many limit points. Let zˆ = (Rbxˆ, Qbyˆ) and z˜ = (Rex˜, Qey˜) be any two distinct limit points of {z
t}
such that Rb ̸= Qb and Re ̸= Qe (otherwise we can apply Proposition 61 to prove convergence). By careful
case analysis, the structure of limit points (Lemma 62), and properties of Nash equilibrium, we prove a key
equality: Rb + Re = Qb + Qe. Now zˆ and z˜ must be the only two limit points: suppose there exists another
limit point z = (Rx, Qy) with R ̸= Q, then at least one of Rb+R = Qb +Q and Re+R = Qe +Q would be
violated and lead to a contradiction. Thus {z
t} has at most two distinct limit points, which, again, when
further combined with the fact limt→∞ ∥z
t+1 − z
t∥ = 0 implies that it in fact has a unique limit point
(Proposition 126).
The last-iterate convergence of ExRM+ now follows directly by Lemma 60 and Lemma 63.
Theorem 64 (Last-Iterate Convergence of ExRM+). Let {z
t} be the sequence produced by ExRM+, then
{z
t} is bounded and converges to z
∗ ∈ Z≥ with g(z
∗
) = (x
∗
, y
∗
) ∈ ∆d1 × ∆d2 being a Nash equilibrium
of the matrix game A.
§Note that we draw z
∗
in a simplex only as a simplified illustration—technically z
∗
should be from the Cartesian product of
two simplexes instead.
110
6.7.2 Best-Iterate Convergence Rate of ExRM+
To remedy not having a concrete convergence rate in the last result, we now prove an O( √
1
T
) best-iterate
convergence rate of ExRM+ in this section. The following key lemma relates the duality gap of a pair of
strategies (x
t+1
, y
t+1) and the distance ∥z
t+ 1
2 − z
t∥.
Lemma 65. Let {z
t} be iterates produced by ExRM+ and (x
t+1
, y
t+1) = g(z
t+1). Then
DualityGap(x
t+1
, y
t+1) ≤
12∥z
t+ 1
2 − z
t∥
η
.
Now combining Lemma 59 and Lemma 65, we conclude the following best-iterate convergence rate.
Theorem 66. Let {z
t} be the sequence produced by ExRM+ with initial point z
0
. Then for any Nash equilibrium z
∗
, any T ≥ 1, there exists t ≤ T with (x
t
, y
t
) = g(z
t
) and
DualityGap(x
t
, y
t
) ≤
12∥z
0 − z
∗∥
η
q
1 − η
2L
2
F
1
√
T
.
If η = √
1
2LF
, then DualityGap(x
t
, y
t
) ≤
24LF ∥z
0−z
∗∥
√
T
.
Proof. Fix any Nash equilibrium z
∗ of the game. From Lemma 59, we know PT −1
t=0 ∥z
t+ 1
2 − z
t∥
2
≤
∥z
0−z
∗∥
2
1−η
2L2
F
. This implies that there exists 0 ≤ t ≤ T − 1 such that ∥z
t+ 1
2 − z
t∥ ≤ ∥z
0−z
∗∥
√
T
√
1−η
2L2
F
. We
then get the desired result by applying Lemma 65.
6.7.3 Linear Last-Iterate Convergence for ExRM+ with Restarts
In this section, based on the best-iterate convergence result from the last section, we further provide a
simple restarting mechanism under which ExRM+ enjoys linear last-iterate convergence. To show this,
we recall that zero-sum matrix games satisfy the following metric subregularity condition.
111
Proposition 67 (Metric Subregularity [161]). Let A ∈ R
d1×d2 be a matrix game. There exists a constant
c > 0 (only depending on A) such that for any z = (x, y) ∈ ∆d1 × ∆d2
, it holds that DualityGap(x, y) ≥
c∥z − ΠZ∗ [z]∥ where Z
∗ denotes the set of all Nash equilibria.
Together with the best-iterate convergence rate result from Theorem 66 (with η = √
1
2LF
), we immediately get that for any T ≥ 1, there exists 1 ≤ t ≤ T such that
(x
t
, y
t
) − ΠZ∗ [(x
t
, y
t
)]
≤
DualityGap(x
t
, y
t
)
c
≤
24LF
c
√
T
·
z
0 − ΠZ∗ [z
0
]
.
The above inequality further implies that if T ≥
482L2
F
c
2 , then there exists 1 ≤ t ≤ T such that
(x
t
, y
t
) − ΠZ∗ [(x
t
, y
t
)]
≤
1
2
z
0 − ΠZ∗ [z
0
]
.
Therefore, after at most a constant number of iterations (smaller than 482L2
F
c
2 ), the distance of the bestiterate (x
t
, y
t
) to the equilibrium set Z
∗
is halved compared to that of the initial point. If we could somehow identify this best iterate, then we just need to restart the algorithm with this best iterate as the next
initial strategy. Repeating this would then lead to a linear last-iterate convergence.
The issue in the argument above is obviously that we cannot exactly identify the best iterate since
∥(x
t
, y
t
) − ΠZ∗ [(x
t
, y
t
)]∥ is unknown. However, it turns out that we can use ∥z
t+ 1
2 − z
t∥ as a proxy
since ∥(x
t
, y
t
) − ΠZ∗ [(x
t
, y
t
)]∥ ≤ 1
c DualityGap(x
t
, y
t
) ≤
12∥z
t+ 1
2 −z
t∥
cη
by Lemma 65. This motivates
the design of our algorithm: Restarting ExRM+ (RS-ExRM+ for short; see Algorithm 13), which restarts
for the k-th time if ∥z
t+ 1
2 − z
t∥ is less than O(
1
2
k ). Importantly, RS-ExRM+ does not require knowing the
value of c, the constant in the metric subregularity condition, which can be hard to compute.
The main result of this section is the following linear convergence rates of RS-ExRM+.
112
Algorithm 13: Restarting ExRM+ (RSExRM+)
1: Input: Step size η ∈ (0,
1
LF
), ρ > 0.
2: Initialize: z
0 ∈ Z, k = 1
3: for t = 0, 1, . . . do
4: z
t+1/2 = ΠZ≥
z
t − ηF(z
t
)
5: z
t+1 = ΠZ≥
z
t − ηF(z
t+1/2
)
6: if ∥z
t+1/2 − z
t∥ ≤ ρ/2
k
then
7: z
t+1 ← g(z
t+1) ∈ ∆d1 × ∆d2
8: k ← k + 1
Algorithm 14: Restarting SPRM+ (RSSPRM+)
1: Input: Step size η ∈ (0,
1
8LF
].
2: Initialize: z
−1 = w0 ∈ Z, k = 1
3: for t = 0, 1, . . . do
4: z
t = ΠZ≥
wt − ηF(z
t−1
)
5: wt+1 = ΠZ≥
wt − ηF(z
t
)
6: if ∥wt+1 − z
t∥ + ∥wt − z
t∥ ≤ 8/2
k
then
7: z
t
, wt+1 ← g(wt+1) ∈ ∆d1 × ∆d2
8: k ← k + 1
Theorem 68 (Linear Last-Iterate Convergence of RS-ExRM+). Let {z
t} be the sequence produced by RSExRM+ and let ρ = √
4
1−η
2L2
F
. Then for any t ≥ 1, the iterate (x
t
, y
t
) = g(z
t
) satisfies
(x
t
, y
t
) − ΠZ∗ [(x
t
, y
t
)]
≤
DualityGap(x
t
, y
t
)
c
≤ α · (1 − β)
t
where α =
576
c
2η
2(1−η
2L2
F
)
and β =
1
3(1+α)
.
Proof sketch Let us denote by tk the iteration at which the k-th restart happens. According to the restart
condition and Lemma 65, we know that at iteration tk, the duality gap of (x
tk , y
tk ) and its distance to Z
∗
is at most O(
1
2
k ). For iterate t ∈ [tk, tk+1] at which the algorithm does not restart, we can use Theorem 66
to show that its performance is not much worse than that of (x
tk , y
tk ). Then we prove tk+1 − tk is upper
bounded by a constant for every k, which leads to a linear last-iterate convergence rate for all iterations
t ≥ 1.
6.8 Last-Iterate Convergence of Smooth Predictive RM+
In this section we study another RM+variant, SPRM+ (Algorithm 12). We present convergence results
very similar to those in the last section for ExRM+. Given the similarity and for the sake of conciseness,
we only state the main results here, with all details and proofs deferred to Appendix E.15.
Theorem 69 (Asymptotic Last-Iterate Convergence of SPRM+). Let {wt} and {z
t} be the sequences produced by SPRM+, then {wt} and {z
t} are bounded and both converge to z
∗ ∈ Z≥ with g(z
∗
) = (x
∗
, y
∗
) ∈
∆d1 × ∆d2 being a Nash equilibrium of the matrix game A.
Theorem 70 (Best-Iterate Convergence Rate of SPRM+). Let {w
t} and {z
t} be the sequences produced by
SPRM+. For any Nash equilibrium z
∗
of the game, any T ≥ 1, there exists 1 ≤ t ≤ T such that the iterate
g(wt
) ∈ ∆d1 × ∆d2 satisfies
DualityGap(g(w
t
)) ≤
10∥w
0 − z
∗∥
η
1
√
T
.
Moreover, for any Nash equilibrium z
∗
of the game, any T ≥ 1, there exists 0 ≤ t ≤ T such that the iterate
g(z
t
) ∈ ∆d1 × ∆d2 satisfies
DualityGap(g(z
t
)) ≤
18∥w0 − z
∗∥
η
1
√
T
.
We apply the same idea of restarting to SPRM+ to design a new algorithm called Restarting SPRM+(RSSPRM+; see Algorithm 14) that has provable linear last-iterate convergence.
Theorem 71 (Linear Last-Iterate Convergence of RS-SPRM+). Let {wt} and {z
t} be the sequences produced
by RS-SPRM+. Define α =
400
c
2η
2 and β =
1
3(1+α)
. Then for any t ≥ 1, the iterate g(wt
) ∈ ∆d1 ×∆d2 satisfies
g(wt
) − ΠZ∗ [g(wt
)]
≤
DualityGap(g(wt
))
c
≤ α · (1 − β)
t
.
Moreover, the iterate g(z
t
) ∈ ∆d1 × ∆d2 satisfies
g(z
t
) − ΠZ∗ [g(z
t
)]
≤
DualityGap(g(z
t
))
c
≤ 2α · (1 − β)
t
.
114
6.9 Numerical Experiments of Last-Iterate Convergence
101 102 103 104 105
10−13
10−9
10−5
10−1
3x3 matrix game
101 102 103 104 105
10−3
10−1
Kuhn Poker
101 102 103 104 105
10−3
10−2
10−1
100
Goofspiel
101 102 103 104 105
10−13
10−9
10−5
10−1
random
Number of iterations
Duality gap
RM+
alt. RM+ PRM+
alt. PRM+ ExRM+
Smooth PRM+
Figure 6.6: Empirical performances of several algorithms on the 3 × 3 matrix game (left plot), Kuhn poker
and Goofspiel (center plots), and random instances (right plot).
Next, we numerically evaluate the last-iterate performance of each algorithm studied in this chapter.
We use the 3 × 3 matrix game instance from Section 6.6, the normal-form representations of Kuhn poker
and Goofspiel, as well as 25 random matrix games of size (d1, d2) = (10, 15) (for which we average
the duality gaps across the instances and show the associated confidence intervals). More details on the
games can be found in Appendix E.16. We set η = 0.1 for all algorithms with a stepsize and we initialize
all algorithms at ((1/d1)1d1
,(1/d2)1d2
). In every iteration, we plot the duality gap of (x
t
, y
t
) for RM+,
PRM+, and alternating PRM+; the duality gap of g(z
t
) for ExRM+ and RS-ExRM+; the duality gap of
g(w
t
) for SPRM+ and RS-SPRM+. The results are shown in Figure 6.6. For the 3 × 3 matrix game, we
see that alternating PRM+, ExRM+, and SPRM+ achieve machine precision after 103
iterations (while
others stay around 10−1
as discussed earlier). On Kuhn poker, PRM+ and alternating PRM+ have faster
convergence before 103
iterations, and perform on par with ExRM+ and SPRM+ after that. On Goofspiel,
Alternating PRM+ is again the best algorithm, although all algorithms (except RM+) have comparable
performance after 105
iterations. Finally, on random instances, the last iterate performance of ExRM+ and
SPRM+ vastly outperform RM+, PRM+, and alternating RM+, but we note that alternating PRM+ seems
to outperform all other algorithms. Overall, these results are consistent with our theoretical findings of
115
ExRM+ and SPRM+. That said, understanding the superiority of alternating PRM+ (an algorithm that
we do not analyze in this work and for which neither ergodic nor last-iterate convergence guarantees are
known) remains open. We also numerically investigate the impact of restarting for RS-ExRM+ and RSSPRM+. For the sake of space we provide our analysis in Appendix E.16, where we note that restarting
does not significantly change the empirical convergence of ExRM+ and SPRM+, which is coherent with
the fact that ExRM+ and SPRM+ (without restarting) already exhibit fast last-iterate convergence.
116
Chapter 7
Conclusions and Future Directions
In this thesis, we study learning dynamics in various games. We provide fundamental fast convergence
results in average-iterate or last-iterate for classic no-regret learning algorithms. We also sophisticatedly
design new no-regret algorithms that enjoy these favorable properties. Our analysis goes beyond traditional worst-case analysis, attempting to discover the adaptivity of these algorithms in slowly drifting
environments, in particular, in games where players interact with each other and gradually alter their
behavior correspondingly.
In Chapter 2, we first prove the linear convergence of OGDA and OMWU in normal-form games.
In Chapter 3, we generalize the results of OMWU to extensive-form games with the help of dilated regularization. In Chapter 4, we extend OMWU from normal-form games to extensive-form games in a black-box
manner using the kernel trick, obtaining not only last-iterate convergence but all the favorable properties
OMWU has in normal-form games. In Chapter 5, we consider more general convex games and give the
first near-optimal no-regret algorithm using the lifting technique and log regularization. Finally, in Chapter 6, we point out that instability causes RM+ and its variants to fail to have low regret and last-iterate
conference and provide practical solutions to stabilize and ensure their fast convergence.
There are still several important questions unsolved in the learning in games literature. The first
direction involves understanding the behavior of OMWU even more. OMWU, in general, has a logarithmic
117
dependence on the dimension, which makes it well-suited for large-scale applications. However, in theory,
getting linear convergence without the uniqueness assumption remains open even in normal-form games.
Getting constant or O(log T) individual regret for OMWU is also an interesting question. Another open
question is whether OMWU (entropic regularizers) can be incorporated into our framework in Chapter 5
to get near-optimal regret in convex games.
The second direction involves understanding RM+ in more detail. Although we have shown that in
the worst cases, it’s impossible to have last-iterate convergence and low regret for RM+without stabilizing
it, the algorithms are well-behaved in most games tested. It would further bridge the gap between theory
and experiments if we could characterize the conditions of games to make RM+fail or not. Moreover,
alternation combined with predictive RM+performs exceptionally well in games without our stabilizing
technique, but it is still open to proving its improved regret or last-iterate convergence.
Lastly, at a higher level, understanding and characterizing the performance of general no-regret algorithms is challenging but essential for future work. For example, under the optimistic online mirror descent
or follow-the-regularizer-leader framework, do we have sufficient or necessary conditions for regularizers
so that they are guaranteed to have last-iterate convergence or subpolynomial regret? Additionally, when
analyzing optimistic (predictive) algorithms, we all let the prediction be the loss or regret in the previous
round. Is this necessary? Are there any other predictions leading to favorable properties in games? The
fast convergence analysis in this thesis or previous work of no-regret algorithms in games is ad hoc in
general. Utilizing the same method to obtain similar results for other no-regret algorithms becomes difficult. To answer these questions, we may need to develop more powerful and generalizable techniques
or study the problems more systematically. Any successful attempt along this line will further push our
understanding of learning dynamics.
118
Bibliography
[1] Jacob Abernethy, Elad Hazan, and Alexander Rakhlin. “Competing in the dark: An efficient
algorithm for bandit linear optimization”. In: In Proceedings of the 21st Annual Conference on
Learning Theory (COLT). 2008.
[2] Ahmet Alacaoglu, Olivier Fercoq, and Volkan Cevher. “On the convergence of stochastic
primal-dual hybrid gradient”. In: arXiv preprint arXiv:1911.00799 (2019).
[3] Ioannis Anagnostides, Constantinos Daskalakis, Gabriele Farina, Maxwell Fishelson,
Noah Golowich, and Tuomas Sandholm. “Near-optimal no-regret learning for correlated
equilibria in multi-player general-sum games”. In: STOC ’22: 54th Annual ACM SIGACT
Symposium on Theory of Computing. ACM, 2022, pp. 736–749.
[4] Ioannis Anagnostides, Gabriele Farina, Christian Kroer, Chung-Wei Lee, Haipeng Luo, and
Tuomas Sandholm. “Uncoupled Learning Dynamics with O(log T) Swap Regret in Multiplayer
Games”. In: Advances in Neural Information Processing Systems 35 (2022), pp. 3292–3304.
[5] Jean-Yves Audibert, Sébastien Bubeck, and Gábor Lugosi. “Regret in online combinatorial
optimization”. In: Mathematics of Operations Research 39.1 (2014), pp. 31–45.
[6] Waiss Azizian, Ioannis Mitliagkas, Simon Lacoste-Julien, and Gauthier Gidel. “A tight and unified
analysis of gradient-based methods for a whole spectrum of differentiable games”. In:
International Conference on Artificial Intelligence and Statistics. 2020.
[7] Yu Bai, Chi Jin, Song Mei, Ziang Song, and Tiancheng Yu. “Efficient P hi-Regret Minimization in
Extensive-Form Games via Online Mirror Descent”. In: arXiv preprint arXiv:2205.15294 (2022).
[8] James P Bailey and Georgios Piliouras. “Multiplicative weights update in zero-sum games”. In:
Proceedings of the 2018 ACM Conference on Economics and Computation. 2018.
[9] Michael Bowling. “Convergence and no-regret in multiagent learning”. In: Advances in neural
information processing systems. 2005, pp. 209–216.
[10] Michael Bowling, Neil Burch, Michael Johanson, and Oskari Tammelin. “Heads-up limit hold’em
poker is solved”. In: Science 347.6218 (2015), pp. 145–149.
119
[11] George W Brown. “Iterative solution of games by fictitious play”. In: Activity analysis of
production and allocation 13.1 (1951), pp. 374–376.
[12] Noam Brown, Sam Ganzfried, and Tuomas Sandholm. “Hierarchical Abstraction, Distributed
Equilibrium Computation, and Post-Processing, with Application to a Champion No-Limit Texas
Hold’em Agent.” In: AAAI Workshop: Computer Poker and Imperfect Information. 2015.
[13] Noam Brown, Adam Lerer, Sam Gross, and Tuomas Sandholm. “Deep counterfactual regret
minimization”. In: International conference on machine learning. PMLR. 2019, pp. 793–802.
[14] Noam Brown and Tuomas Sandholm. “Solving imperfect-information games via discounted
regret minimization”. In: Proceedings of the AAAI Conference on Artificial Intelligence. Vol. 33.
2019, pp. 1829–1836.
[15] Noam Brown and Tuomas Sandholm. “Superhuman AI for heads-up no-limit poker: Libratus
beats top professionals”. In: Science (Dec. 2017), eaao1733.
[16] Noam Brown and Tuomas Sandholm. “Superhuman AI for heads-up no-limit poker: Libratus
beats top professionals”. In: Science 359.6374 (2018), pp. 418–424.
[17] Noam Brown and Tuomas Sandholm. “Superhuman AI for multiplayer poker”. In: Science
365.6456 (2019), pp. 885–890.
[18] Sébastien Bubeck, Yuanzhi Li, Haipeng Luo, and Chen-Yu Wei. “Improved Path-length Regret
Bounds for Bandits”. In: Conference on Learning Theory, COLT 2019, 25-28 June 2019. Vol. 99.
Proceedings of Machine Learning Research. PMLR, 2019, pp. 508–528.
[19] Neil Burch. “Time and space: Why imperfect information games are hard”. PhD thesis. University
of Alberta, 2018.
[20] Neil Burch, Michael Johanson, and Michael Bowling. “Solving imperfect information games using
decomposition”. In: Proceedings of the AAAI Conference on Artificial Intelligence. Vol. 28. 2014.
[21] Neil Burch, Matej Moravcik, and Martin Schmid. “Revisiting CFR+ and alternating updates”. In:
Journal of Artificial Intelligence Research 64 (2019), pp. 429–443.
[22] Yang Cai, Gabriele Farina, Julien Grand-Clément, Christian Kroer, Chung-Wei Lee, Haipeng Luo,
and Weiqiang Zheng. “Last-Iterate Convergence Properties of Regret-Matching Algorithms in
Games”. In: arXiv preprint arXiv:2311.00676 (2023).
[23] Yang Cai, Argyris Oikonomou, and Weiqiang Zheng. “Finite-Time Last-Iterate Convergence for
Learning in Multi-Player Games”. In: Advances in Neural Information Processing Systems
(NeurIPS). 2022.
[24] Nicolo Cesa-Bianchi and Gábor Lugosi. “Combinatorial bandits”. In: Journal of Computer and
System Sciences 78.5 (2012), pp. 1404–1422.
[25] Nicolo Cesa-Bianchi and Gábor Lugosi. Prediction, learning, and games. Cambridge university
press, 2006.
120
[26] Gong Chen and Marc Teboulle. “Convergence analysis of a proximal-like minimization algorithm
using Bregman functions”. In: SIAM Journal on Optimization 3.3 (1993), pp. 538–543.
[27] Xi Chen and Binghui Peng. “Hedging in games: Faster convergence of external and swap regrets”.
In: Proceedings of the Annual Conference on Neural Information Processing Systems (NeurIPS). 2020.
[28] Yunmei Chen, Guanghui Lan, and Yuyuan Ouyang. “Accelerated schemes for a class of
variational inequalities”. In: Mathematical Programming (2017).
[29] Yun Kuen Cheung and Georgios Piliouras. “Chaos, Extremism and Optimism: Volume Analysis of
Learning in Games”. In: Advances in Neural Information Processing Systems (2020).
[30] Yun Kuen Cheung and Georgios Piliouras. “Vortices Instead of Equilibria in MinMax
Optimization: Chaos and Butterfly Effects of Online Learning in Zero-Sum Games”. In: Conference
on Learning Theory. 2019, pp. 807–834.
[31] Chao-Kai Chiang, Tianbao Yang, Chia-Jung Lee, Mehrdad Mahdavi, Chi-Jen Lu, Rong Jin, and
Shenghuo Zhu. “Online optimization with gradual variations”. In: Conference on Learning Theory.
2012, pp. 6–1.
[32] Laurent Condat. “Fast projection onto the simplex and the ℓ1 ball”. In: Mathematical Programming
158.1 (2016), pp. 575–585.
[33] Shisheng Cui and Uday V Shanbhag. “On the analysis of reflected gradient and splitting methods
for monotone stochastic variational inequality problems”. In: 2016 IEEE 55th Conference on
Decision and Control (CDC). 2016.
[34] Constantinos Daskalakis, Alan Deckelbaum, and Anthony Kim. “Near-optimal no-regret
algorithms for zero-sum games”. In: Annual ACM-SIAM Symposium on Discrete Algorithms
(SODA). 2011.
[35] Constantinos Daskalakis, Alan Deckelbaum, Anthony Kim, et al. “Near-optimal no-regret
algorithms for zero-sum games”. In: Games and Economic Behavior (2015).
[36] Constantinos Daskalakis, Maxwell Fishelson, and Noah Golowich. “Near-Optimal No-Regret
Learning in General Games”. In: CoRR abs/2108.06924 (2021).
[37] Constantinos Daskalakis and Noah Golowich. “Fast rates for nonparametric online learning: from
realizability to learning in games”. In: STOC ’22: 54th Annual ACM SIGACT Symposium on Theory
of Computing. ACM, 2022, pp. 846–859.
[38] Constantinos Daskalakis, Andrew Ilyas, Vasilis Syrgkanis, and Haoyang Zeng. “Training GANs
with Optimism”. In: International Conference on Learning Representations. 2018.
[39] Constantinos Daskalakis and Ioannis Panageas. “Last-Iterate Convergence: Zero-Sum Games and
Constrained Min-Max Optimization”. In: 10th Innovations in Theoretical Computer Science
Conference (ITCS 2019). Schloss Dagstuhl-Leibniz-Zentrum fuer Informatik. 2019.
121
[40] Constantinos Daskalakis and Ioannis Panageas. “Last-Iterate Convergence: Zero-Sum Games and
Constrained Min-Max Optimization”. In: Smooth Games Optimization and Machine Learning
Workshop (NeurIPS 2019) (2019).
[41] Constantinos Daskalakis and Ioannis Panageas. “The limit points of (optimistic) gradient descent
in min-max optimization”. In: Advances in Neural Information Processing Systems. 2018,
pp. 9236–9246.
[42] Damek Davis. Lecture 5, Mathematical Programming I. Available at
people.orie.cornell.edu/dsd95/teaching/orie6300/lec05.pdf. 2016.
[43] Damek Davis. Lecture 6, Mathematical Programming I. Available at
people.orie.cornell.edu/dsd95/teaching/orie6300/lec06.pdf. 2016.
[44] Eyal Even-Dar, Yishay Mansour, and Uri Nadav. “On the convergence of regret minimization
dynamics in concave games”. In: Proceedings of the 41st Annual ACM Symposium on Theory of
Computing, STOC 2009. Ed. by Michael Mitzenmacher. ACM, 2009, pp. 523–532. doi:
10.1145/1536414.1536486.
[45] Francisco Facchinei and Jong-Shi Pang. Finite-dimensional variational inequalities and
complementarity problems. Springer, 2003.
[46] Gabriele Farina, Ioannis Anagnostides, Haipeng Luo, Chung-Wei Lee, Christian Kroer, and
Tuomas Sandholm. “Near-optimal no-regret learning dynamics for general convex games”. In:
Advances in Neural Information Processing Systems 35 (2022), pp. 39076–39089.
[47] Gabriele Farina, Andrea Celli, Nicola Gatti, and Tuomas Sandholm. “Ex ante coordination and
collusion in zero-sum multi-player extensive-form games”. In: Advances in Neural Information
Processing Systems. 2018, pp. 9638–9648.
[48] Gabriele Farina, Julien Grand-Clément, Christian Kroer, Chung-Wei Lee, and Haipeng Luo.
“Regret Matching+:(In) Stability and Fast Convergence in Games”. In: Annual Conference on
Neural Information Processing Systems (NeurIPS). 2023.
[49] Gabriele Farina, Christian Kroer, Noam Brown, and Tuomas Sandholm. “Stable-predictive
optimistic counterfactual regret minimization”. In: International conference on machine learning.
PMLR. 2019, pp. 1853–1862.
[50] Gabriele Farina, Christian Kroer, Chung-Wei Lee, and Haipeng Luo. “Clairvoyant Regret
Minimization: Equivalence with Nemirovski’s Conceptual Prox Method and Extension to General
Convex Games”. In: OPT 2022: Optimization for Machine Learning (NeurIPS 2022 Workshop). 2022.
[51] Gabriele Farina, Christian Kroer, and Tuomas Sandholm. “Better Regularization for Sequential
Decision Spaces: Fast Convergence Rates for Nash, Correlated, and Team Equilibria”. In:
Proceedings of the 2021 ACM Conference on Economics and Computation. 2021.
[52] Gabriele Farina, Christian Kroer, and Tuomas Sandholm. “Faster Game Solving via Predictive
Blackwell Approachability: Connecting Regret Matching and Mirror Descent”. In: Proceedings of
the AAAI Conference on Artificial Intelligence. 2021.
122
[53] Gabriele Farina, Christian Kroer, and Tuomas Sandholm. “Online Convex Optimization for
Sequential Decision Processes and Extensive-Form Games”. In: AAAI Conference on Artificial
Intelligence. 2019.
[54] Gabriele Farina, Christian Kroer, and Tuomas Sandholm. “Optimistic Regret Minimization for
Extensive-Form Games via Dilated Distance-Generating Functions”. In: Advances in Neural
Information Processing Systems. 2019, pp. 5222–5232.
[55] Gabriele Farina, Chung-Wei Lee, Haipeng Luo, and Christian Kroer. “Kernelized Multiplicative
Weights for 0/1-Polyhedral Games: Bridging the Gap Between Learning in Extensive-Form and
Normal-Form Games”. In: Proceedings of the 39th International Conference on Machine Learning.
Vol. 162. Proceedings of Machine Learning Research. PMLR, 2022.
[56] Gabriele Farina, Chun Kai Ling, Fei Fang, and Tuomas Sandholm. “Correlation in Extensive-Form
Games: Saddle-Point Formulation and Benchmarks”. In: Conference on Neural Information
Processing Systems (NeurIPS). 2019.
[57] Dean Foster and Rakesh Vohra. “Calibrated Learning and Correlated Equilibrium”. In: Games and
Economic Behavior 21 (1997), pp. 40–55.
[58] Dylan J Foster, Zhiyuan Li, Thodoris Lykouris, Karthik Sridharan, and Eva Tardos. “Learning in
games: Robustness of fast convergence”. In: Advances in Neural Information Processing Systems.
2016.
[59] Yoav Freund and Robert E Schapire. “Adaptive game playing using multiplicative weights”. In:
Games and Economic Behavior 29.1-2 (1999), pp. 79–103.
[60] Drew Fudenberg and Alexander Peysakhovich. “Recency, Records and Recaps: Learning and
Non-Equilibrium Behavior in a Simple Decision Problem”. In: Proceedings of the Fifteenth ACM
Conference on Economics and Computation. EC ’14. Palo Alto, California, USA: Association for
Computing Machinery, 2014, pp. 971–986.
[61] Sam Ganzfried and Tuomas Sandholm. “Endgame solving in large imperfect-information games”.
In: Proceedings of the 2015 International Conference on Autonomous Agents and Multiagent Systems.
Citeseer. 2015, pp. 37–45.
[62] Sam Ganzfried and Tuomas Sandholm. “Potential-aware imperfect-recall abstraction with earth
mover’s distance in imperfect-information games”. In: Proceedings of the AAAI Conference on
Artificial Intelligence. Vol. 28. 2014.
[63] Yuan Gao, Christian Kroer, and Donald Goldfarb. “Increasing Iterate Averaging for Solving
Saddle-Point Problems”. In: Proceedings of the AAAI Conference on Artificial Intelligence. 2021.
[64] Gauthier Gidel, Hugo Berard, Gaëtan Vignoud, Pascal Vincent, and Simon Lacoste-Julien. “A
variational inequality perspective on generative adversarial networks”. In: International
Conference on Learning Representations (2019).
[65] Andrew Gilpin. Algorithms for abstracting and solving imperfect information games. Doctoral
dissertation, Computer Science Department, Carnegie Mellon University, Pittsburgh, 2009.
123
[66] Andrew Gilpin, Javier Peña, and Tuomas Sandholm. “First-Order Algorithm with O(ln(1/ϵ))
convergence for ϵ-equilibrium in Two-Person Zero-Sum Games”. In: AAAI. 2008.
[67] Noah Golowich, Sarath Pattathil, and Constantinos Daskalakis. “Tight last-iterate convergence
rates for no-regret learning in multi-player games”. In: Advances in Neural Information Processing
Systems (2020).
[68] Noah Golowich, Sarath Pattathil, Constantinos Daskalakis, and Asuman Ozdaglar. “Last Iterate is
Slower than Averaged Iterate in Smooth Convex-Concave Saddle Point Problems”. In: Conference
on Learning Theory (2020).
[69] Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair,
Aaron Courville, and Yoshua Bengio. “Generative adversarial nets”. In: Advances in neural
information processing systems. 2014.
[70] Eduard Gorbunov, Nicolas Loizou, and Gauthier Gidel. “Extragradient method: O(1/K)
last-iterate convergence for monotone variational inequalities and connections with
cocoercivity”. In: International Conference on Artificial Intelligence and Statistics (AISTATS). PMLR.
2022, pp. 366–402.
[71] Geoffrey J Gordon, Amy Greenwald, and Casey Marks. “No-regret learning in convex games”. In:
Proceedings of the 25th international conference on Machine learning. ACM. 2008, pp. 360–367.
[72] Julien Grand-Clément and Christian Kroer. “Solving optimization problems with Blackwell
approachability”. In: Mathematics of Operations Research (2023).
[73] Patrick T Harker and Jong-Shi Pang. “Finite-dimensional variational inequality and nonlinear
complementarity problems: a survey of theory, algorithms and applications”. In: Mathematical
programming (1990).
[74] Tobias Harks and Max Klimm. “Demand Allocation Games: Integrating Discrete and Continuous
Strategy Spaces”. In: Internet and Network Economics - 7th International Workshop, WINE 2011.
Ed. by Ning Chen, Edith Elkind, and Elias Koutsoupias. Vol. 7090. Lecture Notes in Computer
Science. Springer, 2011, pp. 194–205.
[75] Charles R. Harris, K. Jarrod Millman, Stéfan J. van der Walt, Ralf Gommers, Pauli Virtanen,
David Cournapeau, Eric Wieser, Julian Taylor, Sebastian Berg, Nathaniel J. Smith, Robert Kern,
Matti Picus, Stephan Hoyer, Marten H. van Kerkwijk, Matthew Brett, Allan Haldane,
Jaime Fernández del Río, Mark Wiebe, Pearu Peterson, Pierre Gérard-Marchant, Kevin Sheppard,
Tyler Reddy, Warren Weckesser, Hameer Abbasi, Christoph Gohlke, and Travis E. Oliphant.
“Array programming with NumPy”. In: Nature 585.7825 (Sept. 2020), pp. 357–362. doi:
10.1038/s41586-020-2649-2.
[76] Sergiu Hart and Andreu Mas-Colell. “A simple adaptive procedure leading to correlated
equilibrium”. In: Econometrica 68.5 (2000), pp. 1127–1150.
[77] Sergiu Hart and Andreu Mas-Colell. “Uncoupled dynamics do not lead to Nash equilibrium”. In:
American Economic Review 93 (2003), pp. 1830–1836.
124
[78] Elad Hazan et al. “Introduction to online convex optimization”. In: Foundations and Trends® in
Optimization 2.3-4 (2016), pp. 157–325.
[79] Johannes Heinrich and David Silver. “Deep reinforcement learning from self-play in
imperfect-information games”. In: arXiv preprint arXiv:1603.01121 (2016).
[80] David Helmbold and Manfred Warmuth. “Learning Permutations with Exponential Weights.” In:
Journal of Machine Learning Research 10.7 (2009).
[81] Samid Hoda, Andrew Gilpin, Javier Pena, and Tuomas Sandholm. “Smoothing techniques for
computing Nash equilibria of sequential games”. In: Mathematics of Operations Research 35.2
(2010), pp. 494–512.
[82] Yu-Guan Hsieh, Kimon Antonakopoulos, and Panayotis Mertikopoulos. “Adaptive Learning in
Continuous Games: Optimal Regret Bounds and Convergence to Nash Equilibrium”. In:
Conference on Learning Theory (2021).
[83] Yu-Guan Hsieh, Franck Iutzeler, Jérôme Malick, and Panayotis Mertikopoulos. “Explore
Aggressively, Update Conservatively: Stochastic Extragradient Methods with Variable Stepsize
Scaling”. In: Advances in Neural Information Processing Systems (2020).
[84] Yu-Guan Hsieh, Franck Iutzeler, Jérôme Malick, and Panayotis Mertikopoulos. “On the
convergence of single-call stochastic extra-gradient methods”. In: Advances in Neural Information
Processing Systems. 2019.
[85] Alfredo N Iusem, Alejandro Jofré, Roberto Imbuzeiro Oliveira, and Philip Thompson.
“Extragradient method with variance reduction for stochastic variational inequalities”. In: SIAM
Journal on Optimization (2017).
[86] Martin Jaggi. “Revisiting Frank-Wolfe: Projection-Free Sparse Convex Optimization”. In:
Proceedings of the 30th International Conference on Machine Learning, ICML 2013. Vol. 28. JMLR
Workshop and Conference Proceedings. JMLR.org, 2013, pp. 427–435.
[87] Mark Jerrum, Alistair Sinclair, and Eric Vigoda. “A polynomial-time approximation algorithm for
the permanent of a matrix with nonnegative entries”. In: Journal of the ACM (JACM) 51.4 (2004),
pp. 671–697.
[88] Adam Kalai and Santosh Vempala. “Efficient Algorithms for Online Decision Problems”. In:
Journal of Computer and System Sciences 71 (2005), pp. 291–307.
[89] Krzysztof C Kiwiel. “Proximal minimization methods with generalized Bregman functions”. In:
SIAM journal on control and optimization 35.4 (1997), pp. 1142–1168.
[90] Daphne Koller, Nimrod Megiddo, and Bernhard von Stengel. “Efficient Computation of Equilibria
for Extensive Two-Person Games”. In: Games and Economic Behavior 14.2 (1996).
125
[91] Terry Koo, Amir Globerson, Xavier Carreras Pérez, and Michael Collins. “Structured prediction
models via the matrix-tree theorem”. In: Joint Conference on Empirical Methods in Natural
Language Processing and Computational Natural Language Learning (EMNLP-CoNLL). 2007,
pp. 141–150.
[92] Wouter M Koolen, Manfred K Warmuth, and Jyrki Kivinen. “Hedging Structured Concepts”. In:
COLT 2010: Proceedings of the 23rd Annual Conference on Learning Theory. 2010, pp. 93–105.
[93] G. M. Korpelevich. “The extragradient method for finding saddle points and other problems”. In:
1976.
[94] Christian Kroer, Gabriele Farina, and Tuomas Sandholm. “Robust Stackelberg equilibria in
extensive-form games and extension to limited lookahead”. In: Proceedings of the AAAI
Conference on Artificial Intelligence. Vol. 32. 2018.
[95] Christian Kroer, Gabriele Farina, and Tuomas Sandholm. “Smoothing method for approximate
extensive-form perfect equilibrium”. In: Proceedings of the 26th International Joint Conference on
Artificial Intelligence. 2017, pp. 295–301.
[96] Christian Kroer, Gabriele Farina, and Tuomas Sandholm. “Solving Large Sequential Games with
the Excessive Gap Technique”. In: Proceedings of the Annual Conference on Neural Information
Processing Systems (NIPS). 2018.
[97] Christian Kroer and Tuomas Sandholm. “Extensive-form game abstraction with bounds”. In:
Proceedings of the fifteenth ACM conference on Economics and computation. 2014, pp. 621–638.
[98] Christian Kroer, Kevin Waugh, Fatma Kılınç-Karzan, and Tuomas Sandholm. “Faster algorithms
for extensive-form game solving via improved smoothing functions”. In: Mathematical
Programming 179.1 (2020), pp. 385–417.
[99] Christian Kroer, Kevin Waugh, Fatma Kılınç-Karzan, and Tuomas Sandholm. “Faster First-Order
Methods for Extensive-Form Game Solving”. In: Proceedings of the ACM Conference on Economics
and Computation (EC). 2015.
[100] Alexander Y Kruger. “Error bounds and metric subregularity”. In: Optimization (2015).
[101] HW Kuhn. “Simplified two-person poker”. In: Contributions to the Theory of Games (1950),
pp. 97–103.
[102] Marc Lanctot, Kevin Waugh, Martin Zinkevich, and Michael H Bowling. “Monte Carlo Sampling
for Regret Minimization in Extensive Games.” In: NIPS. 2009, pp. 1078–1086.
[103] Torbjörn Larsson and Michael Patriksson. “A class of gap functions for variational inequalities”.
In: Mathematical Programming (1994).
[104] Puya Latafat, Nikolaos M Freris, and Panagiotis Patrinos. “A new randomized block-coordinate
primal-dual proximal algorithm for distributed optimization”. In: IEEE Transactions on Automatic
Control (2019).
126
[105] Chung-Wei Lee, Christian Kroer, and Haipeng Luo. “Last-iterate convergence in extensive-form
games”. In: Advances in Neural Information Processing Systems 34 (2021).
[106] Qi Lei, Sai Ganesh Nagarajan, Ioannis Panageas, and Xiao Wang. “Last iterate convergence in
no-regret learning: constrained min-max optimization for convex-concave landscapes”. In: The
24nd International Conference on Artificial Intelligence and Statistics (2021).
[107] D Leventhal. “Metric subregularity and the proximal point method”. In: Journal of Mathematical
Analysis and Applications (2009).
[108] Jingwei Liang, Jalal Fadili, and Gabriel Peyré. “Convergence rates with inexact non-expansive
operators”. In: Mathematical Programming (2016).
[109] Tengyuan Liang and James Stokes. “Interaction Matters: A Note on Non-asymptotic Local
Convergence of Generative Adversarial Networks”. In: The 22nd International Conference on
Artificial Intelligence and Statistics. 2019.
[110] Tianyi Lin, Chi Jin, Michael Jordan, et al. “Near-optimal algorithms for minimax optimization”.
In: Conference on Learning Theory (2020).
[111] Viliam Lisy, Marc Lanctot, and Michael H Bowling. “Online Monte Carlo Counterfactual Regret `
Minimization for Search in Imperfect Information Games.” In: Autonomous Agents and
Multi-Agent Systems. 2015.
[112] Deyi Liu, Volkan Cevher, and Quoc Tran-Dinh. “A Newton Frank-Wolfe Method for Constrained
Self-Concordant Minimization”. In: CoRR abs/2002.07003 (2020).
[113] Zhi-Quan Luo and Paul Tseng. “Error bounds and convergence analysis of feasible descent
methods: a general approach”. In: Annals of Operations Research (1993).
[114] Yu Malitsky. “Projected reflected gradient methods for monotone variational inequalities”. In:
SIAM Journal on Optimization (2015).
[115] Yura Malitsky. “Golden ratio algorithms for variational inequalities”. In: Mathematical
Programming (2019).
[116] Yura Malitsky and Matthew K Tam. “A forward-backward splitting method for monotone
inclusions without cocoercivity”. In: SIAM Journal on Optimization ().
[117] H Brendan McMahan, Geoffrey J Gordon, and Avrim Blum. “Planning in the presence of cost
functions controlled by an adversary”. In: Proceedings of the 20th International Conference on
Machine Learning (ICML-03). 2003, pp. 536–543.
[118] Linjian Meng, Zhenxing Ge, Wenbin Li, Bo An, and Yang Gao. “Efficient Last-iterate Convergence
Algorithms in Solving Games”. In: arXiv preprint arXiv:2308.11256 (2023).
[119] Panayotis Mertikopoulos, Christos Papadimitriou, and Georgios Piliouras. “Cycles in adversarial
regularized learning”. In: Proceedings of the Twenty-Ninth Annual ACM-SIAM Symposium on
Discrete Algorithms. 2018.
127
[120] Panayotis Mertikopoulos, Houssam Zenati, Bruno Lecouat, Chuan-Sheng Foo,
Vijay Chandrasekhar, and Georgios Piliouras. “Optimistic Mirror Descent in Saddle-Point
Problems: Going the Extra (Gradient) Mile”. In: International Conference on Learning
Representations. 2019.
[121] Panayotis Mertikopoulos and Zhengyuan Zhou. “Learning in games with continuous action sets
and unknown payoff functions”. In: Math. Program. 173.1-2 (2019), pp. 465–507.
[122] Luke Metz, Ben Poole, David Pfau, and Jascha Sohl-Dickstein. “Unrolled Generative Adversarial
Networks”. In: International Conference on Learning Representations. 2016.
[123] Aryan Mokhtari, Asuman Ozdaglar, and Sarath Pattathil. “A Unified Analysis of Extra-gradient
and Optimistic Gradient Methods for Saddle Point Problems: Proximal Point Approach”. In: The
22nd International Conference on Artificial Intelligence and Statistics (2020).
[124] Aryan Mokhtari, Asuman Ozdaglar, and Sarath Pattathil. “Convergence Rate of O(1/k) for
Optimistic Gradient and Extra-gradient Methods in Smooth Convex-Concave Saddle Point
Problems”. In: SIAM Journal on Optimization 30.4 (2020), pp. 3230–3251.
[125] Matej Moravcik, Martin Schmid, Karel Ha, Milan Hladik, and Stephen Gaukrodger. “Refining
subgames in large imperfect information games”. In: Proceedings of the AAAI Conference on
Artificial Intelligence. Vol. 30. 2016.
[126] Matej Moravčík, Martin Schmid, Neil Burch, Viliam Lisy, Dustin Morrill, Nolan Bard, `
Trevor Davis, Kevin Waugh, Michael Johanson, and Michael Bowling. “Deepstack: Expert-level
artificial intelligence in heads-up no-limit poker”. In: Science 356.6337 (2017), pp. 508–513.
[127] Arkadi Nemirovski. “Prox-method with rate of convergence O (1/t) for variational inequalities
with Lipschitz continuous monotone operators and smooth convex-concave saddle point
problems”. In: SIAM Journal on Optimization (2004).
[128] Arkadi Nemirovski and David Yudin. Problem complexity and method efficiency in optimization.
1983.
[129] John von Neumann. “Zur theorie der gesellschaftsspiele”. In: Mathematische annalen (1928).
[130] Noam Nisan, Tim Roughgarden, Eva Tardos, and Vijay V Vazirani. Algorithmic Game Theory.
Cambridge University Press, 2007.
[131] Jong-Shi Pang. “Error bounds in mathematical programming”. In: Mathematical Programming
(1997).
[132] Georgios Piliouras, Ryann Sim, and Stratis Skoulakis. “Optimal No-Regret Learning in General
Games: Bounded Regret with Unbounded Step-Sizes via Clairvoyant MWU”. In: arXiv preprint
arXiv:2111.14737 (2021).
[133] Boris T Polyak. “Introduction to optimization. optimization software”. In: Inc., Publications
Division, New York 1 (1987), p. 32.
128
[134] Leonid Denisovich Popov. “A modification of the Arrow-Hurwicz method for search of saddle
points”. In: Mathematical notes of the Academy of Sciences of the USSR (1980).
[135] Sasha Rakhlin and Karthik Sridharan. “Optimization, learning, and games with predictable
sequences”. In: Advances in Neural Information Processing Systems. 2013, pp. 3066–3074.
[136] I. Romanovskii. “Reduction of a Game with Complete Memory to a Matrix Game”. In: Soviet
Mathematics 3 (1962).
[137] J. B. Rosen. “Existence and Uniqueness of Equilibrium Points for Concave N-Person Games”. In:
Econometrica 33.3 (1965), pp. 520–534. (Visited on 05/11/2022).
[138] Sheldon M Ross. “Goofspiel—the game of pure strategy”. In: Journal of Applied Probability 8.3
(1971), pp. 621–625.
[139] Tim Roughgarden. “Intrinsic Robustness of the Price of Anarchy”. In: J. ACM 62.5 (2015),
32:1–32:42.
[140] Tim Roughgarden and Florian Schoppmann. “Local smoothness and the price of anarchy in
splittable congestion games”. In: J. Econ. Theory 156 (2015), pp. 317–342.
[141] Robert E Schapire. “The boosting approach to machine learning: An overview”. In: Nonlinear
estimation and classification (2003), pp. 149–171.
[142] Shai Shalev-Shwartz. “Online Learning and Online Convex Optimization”. In: Foundations and
Trends in Machine Learning 4.2 (2012). issn: 1935-8237.
[143] Yoav Shoham and Kevin Leyton-Brown. Multiagent systems: Algorithmic, game-theoretic, and
logical foundations. Cambridge University Press, 2008.
[144] David Silver, Aja Huang, Chris J Maddison, Arthur Guez, Laurent Sifre,
George Van Den Driessche, Julian Schrittwieser, Ioannis Antonoglou, Veda Panneershelvam,
Marc Lanctot, et al. “Mastering the game of Go with deep neural networks and tree search”. In:
nature 529.7587 (2016), pp. 484–489.
[145] David Silver, Julian Schrittwieser, Karen Simonyan, Ioannis Antonoglou, Aja Huang,
Arthur Guez, Thomas Hubert, Lucas Baker, Matthew Lai, Adrian Bolton, et al. “Mastering the
game of go without human knowledge”. In: nature 550.7676 (2017), pp. 354–359.
[146] Michael V Solodov and Paul Tseng. “Some methods based on the D-gap function for solving
monotone variational inequalities”. In: Computational optimization and applications (2000).
[147] Finnegan Southey, Michael Bowling, Bryce Larson, Carmelo Piccione, Neil Burch, Darse Billings,
and Chris Rayner. “Bayes’ bluff: opponent modelling in poker”. In: Proceedings of the Twenty-First
Conference on Uncertainty in Artificial Intelligence. 2005, pp. 550–558.
[148] Noah D. Stein, Pablo A. Parrilo, and Asuman E. Ozdaglar. “Correlated equilibria in continuous
games: Characterization and computation”. In: Games Econ. Behav. 71.2 (2011), pp. 436–455.
129
[149] Gilles Stoltz and Gábor Lugosi. “Learning correlated equilibria in games with compact sets of
strategies”. In: Games Econ. Behav. 59.1 (2007), pp. 187–208.
[150] Vasilis Syrgkanis, Alekh Agarwal, Haipeng Luo, and Robert E Schapire. “Fast convergence of
regularized learning in games”. In: Advances in Neural Information Processing Systems. 2015.
[151] Eiji Takimoto and Manfred K Warmuth. “Path kernels and multiplicative updates”. In: The Journal
of Machine Learning Research 4 (2003), pp. 773–818.
[152] Oskari Tammelin. “Solving large imperfect information games using CFR+”. In: arXiv preprint
arXiv:1407.5042 (2014).
[153] Oskari Tammelin, Neil Burch, Michael Johanson, and Michael Bowling. “Solving heads-up limit
texas hold’em”. In: Twenty-fourth international joint conference on artificial intelligence. 2015.
[154] Quoc Tran-Dinh, Anastasios Kyrillidis, and Volkan Cevher. “Composite self-concordant
minimization”. In: J. Mach. Learn. Res. 16 (2015), pp. 371–416.
[155] Paul Tseng. “On linear convergence of iterative methods for the variational inequality problem”.
In: Journal of Computational and Applied Mathematics. 1995.
[156] Robert J Vanderbei et al. Linear programming. Vol. 3. Springer, 2015.
[157] Oriol Vinyals, Igor Babuschkin, Wojciech M Czarnecki, Michaël Mathieu, Andrew Dudzik,
Junyoung Chung, David H Choi, Richard Powell, Timo Ewalds, Petko Georgiev, et al.
“Grandmaster level in StarCraft II using multi-agent reinforcement learning”. In: Nature 575.7782
(2019), pp. 350–354.
[158] Bernhard von Stengel. “Efficient computation of behavior strategies”. In: Games and Economic
Behavior 14.2 (1996), pp. 220–246.
[159] Weiran Wang and Miguel A Carreira-Perpinán. “Projection onto the probability simplex: An
efficient algorithm with a simple proof, and an application”. In: arXiv preprint arXiv:1309.1541
(2013).
[160] Manfred K Warmuth and Dima Kuzmin. “Randomized online PCA algorithms with regret bounds
that are logarithmic in the dimension”. In: Journal of Machine Learning Research 9.Oct (2008),
pp. 2287–2320.
[161] Chen-Yu Wei, Chung-Wei Lee, Mengxiao Zhang, and Haipeng Luo. “Linear Last-iterate
Convergence in Constrained Saddle-point Optimization”. In: International Conference on Learning
Representations. 2021.
[162] Chen-Yu Wei and Haipeng Luo. “More Adaptive Algorithms for Adversarial Bandits”. In:
Conference On Learning Theory, COLT 2018. Vol. 75. Proceedings of Machine Learning Research.
PMLR, 2018, pp. 1263–1291.
[163] Junyu Zhang, Mingyi Hong, and Shuzhong Zhang. “On Lower Iteration Complexity Bounds for
the Saddle Point Problems”. In: arXiv preprint arXiv:1912.07481 (2019).
130
[164] Alexander Zimin and Gergely Neu. “Online learning in episodic Markovian decision processes by
relative entropy policy search”. In: Neural Information Processing Systems 26. 2013.
[165] Martin Zinkevich. “Online convex programming and generalized infinitesimal gradient ascent”.
In: Proceedings of the 20th international conference on machine learning (icml-03). 2003,
pp. 928–936.
[166] Martin Zinkevich, Michael Johanson, Michael Bowling, and Carmelo Piccione. “Regret
minimization in games with incomplete information”. In: Advances in neural information
processing systems 20 (2007), pp. 1729–1736.
131
Appendix A
Omitted Details in Chapter 2
A.1 More Experiment Results
A.1.1 More Empirical Results for Matrix Games
Here, we provide more plots for the same matrix game experiment described in Section 2.5. Specifically,
the left plot in Figure A.1 shows the convergence with respect to ln ∥z
t−z
∗∥, while the right plot shows the
convergence with respect to the logarithm of the duality gap ln(αf (z
t
)) = ln
maxj (G⊤x
t
)j − mini(Gyt
)i
.
One can see that the plots are very similar to those in Figure 2.1.
A.1.2 Matrix Game on Curved Regions
Next, we conduct experiments on a bilinear game similar to the one constructed in the proof of Theorem 9.
Specifically, the bilinear game is defined by
f(x, y) = x2y1 − x1y2, X = Y ≜ {(a, b), 0 ≤ a ≤
1
2
, 0 ≤ b ≤
1
2n , an ≤ b}.
For any positive integer n, the equilibrium point of this game is (0, 0) for both x and y. Note that in
Theorem 9, we prove that OGDA only converges at a rate no better than Ω(1/t2
) in this game when
n = 2.
13
0 1 2 3 4 5 6
10
5
-35
-30
-25
-20
-15
-10
-5
0
OMWU-eta=0.1
OMWU-eta=0.2
OMWU-eta=0.5
OMWU-eta=1
OMWU-eta=5
OMWU-eta=10
OGDA-eta=0.125
OGDA-eta=0.25
OGDA-eta=0.5
0 1 2 3 4 5 6
10
5
-40
-35
-30
-25
-20
-15
-10
-5
0
OMWU-eta=0.1
OMWU-eta=0.2
OMWU-eta=0.5
OMWU-eta=1
OMWU-eta=5
OMWU-eta=10
OGDA-eta=0.125
OGDA-eta=0.25
OGDA-eta=0.5
Figure A.1: Experiments of OGDA and OMWU with different learning rates on a matrix game f(x, y) =
x
⊤Gy, where we generate G ∈ R
32×32 with each entry Gij drawn uniformly at random from [−1, 1]
and then rescale G’s operator norm to 1. “OGDA/OMWU-eta=η” represents the curve of OGDA/OMWU
with learning rate η. The configuration order in the legend is consistent with the order of the curves. For
OMWU, η ≥ 11 makes the algorithm diverge. The plot confirms the linear convergence of OMWU and
OGDA, although OGDA is generally observed to converge faster than OMWU.
Figure A.2 shows the empirical results for various values of n. In this figure, we plot ∥z
t − z
∗∥ versus
time step t in log-log scale. Note that in a log-log plot, a straight line with slope s implies a convergence
rate of order O(t
s
), that is, a sublinear convergence rate. It is clear from Figure A.2 that OGDA indeed
converges sublinearly for all n, supporting our Theorem 9.
A.1.3 Strongly-Convex-Strongly-Concave Games
In this section, we use the same experiment setup for strongly-convex-strongly-concave games in [106],
where
f(x, y) = x
2
1 − y
2
1 + 2x1y1, and X = Y ≜ {(a, b), 0 ≤ a, b ≤ 1, a + b = 1}.
The equilibrium point is (0, 1) for both x and y. In Figure A.3, we present the log plot of ∥z
t − z
∗∥ versus
time step t and compare OGDA with OMWU using different learning rates as in Appendix A.1.1. The
straight line of OGDA implies that OGDA algorithm converges exponentially fast, supporting Theorem 6
133
0 2 4 6 8 10 12
-10
-8
-6
-4
-2
0
Figure A.2: Experiments of OGDA on matrix games with curved regions where f(x, y) = x2y1 −
x1y2, X = Y ≜ {(a, b), 0 ≤ a ≤
1
2
, 0 ≤ b ≤
1
2n , an ≤ b}, and n = 2, 4, 6, 8. This figure is a log-log
plot of ∥z
t − z
∗∥ versus t, and it indicates sublinear convergence rates of OGDA in all these games.
0 20 40 60 80 100 120 140
-30
-25
-20
-15
-10
-5
0
OMWU-eta=0.01
OMWU-eta=0.1
OMWU-eta=0.2
OMWU-eta=0.5
OMWU-eta=1
OGDA-eta=0.125
OGDA-eta=0.25
Figure A.3: Experiments on a strongly-convex-strongly-concave game where f(x, y) = x
2
1 − y
2
1 + 2x1y1
and X = Y ≜ {(a, b), 0 ≤ a, b ≤ 1, a + b = 1}. The figure is showing ln ∥z
t − z
∗∥ versus the time step
t. The result shows that OGDA enjoys linear convergence and outperforms OMWU in this case.
134
0 2 4 6 8 10
-4
-3
-2
-1
0
Figure A.4: Experiments of OGDA on a set of games satisfying SP-MS with β > 0, where f(x, y) =
x
2n
1 − x1y1 − y
2n
1
for some integer n ≥ 2 and X = Y ≜ {(a, b), 0 ≤ a, b ≤ 1, a + b = 1}. The result
shows that OGDA converges to the Nash equilibrium with sublinear rates in these instances.
and Theorem 8. Also note that here, OGDA outperforms OMWU, which is different from the empirical
results shown in [106]. We hypothesize that this is because they use a different version of OGDA.
A.1.4 An Example with β > 0 for SP-MS
We also consider the toy example in Theorem 7, where f(x, y) = x
2n
1 −x1y1−y
2n
1
for some integer n ≥ 2
and X = Y ≜ {(a, b), 0 ≤ a, b ≤ 1, a + b = 1}. The equilibrium point is (0, 1) for both x and y. We
prove in Theorem 7 that SP-MS does not hold for β = 0 but does hold for β = 2n − 2.
The point-wise convergence result is shown in Figure A.4, which is again a log-log plot of ∥z
t − z
∗∥
versus time step t. One can observe that the convergence rate of OGDA is sublinear, supporting our theory
again.
135
A.1.5 Matrix Games with Multiple Nash Equilibria
Finally, we provide empirical results for OGDA and OMWU in matrix games with multiple Nash equilibria,
even though theoretically we only prove linear convergence results for OMWU assuming that the Nash
equilibrium is unique. We consider the following game matrix
G =
0 −1 1 0 0
1 0 −1 0 0
−1 1 0 0 0
−1 1 0 2 −1
−1 1 0 −1 2
.
The value of G is 0. To verify this, consider x
0 = y
0 =
1
3
1
3
1
3
0 0
. Then we have for
maxy∈∆5 x
0⊤Gy = minx∈∆5 x
⊤Gy
0 = 0. Direct calculation gives the following set of Nash equilibria.
X
∗ =
x
0
,
Y
∗ =
y ∈ ∆5
: y1 = y2 = y3;
1
2
y5 ≤ y4 ≤ 2y5
.
Figure A.5 shows the point-wise convergence result. ΠZ∗ (z
t
) is the projection of z
t on the set of Nash
qquilibria. One can observe from the plots that both OGDA and OMWU achieve linear convergence rate
in this example. We thus conjecture that the uniqueness assumption for Theorem 3 can be further relaxed.
A.2 Lemmas for Optimistic Mirror Descent
We prove Lemma 1 in this section. To do so, we use the following two lemmas.
136
0 500 1000 1500 2000 2500 3000 3500 4000 4500 5000
-18
-16
-14
-12
-10
-8
-6
-4
-2
0
OMWU-eta=0.1
OMWU-eta=0.2
OMWU-eta=0.5
OMWU-eta=1
OGDA-eta=0.125
OGDA-eta=0.25
OGDA-eta=0.5
0 500 1000 1500 2000 2500 3000 3500 4000 4500 5000
-35
-30
-25
-20
-15
-10
-5
0
OMWU-eta=0.1
OMWU-eta=0.2
OMWU-eta=0.5
OMWU-eta=1
OGDA-eta=0.125
OGDA-eta=0.25
OGDA-eta=0.5
0 500 1000 1500 2000 2500 3000 3500 4000 4500 5000
-18
-16
-14
-12
-10
-8
-6
-4
-2
0
OMWU-eta=0.1
OMWU-eta=0.2
OMWU-eta=0.5
OMWU-eta=1
OGDA-eta=0.125
OGDA-eta=0.25
OGDA-eta=0.5
Figure A.5: Experiments of OGDAand OMWU with different learning rates on a matrix game with multiple
Nash equilibria. “OGDA/OMWU-eta=η” represents the curve of OGDA/OMWU with learning rate η. We
observe from these plots that both OGDA and OMWU enjoy a linear convergence rate, even though we
are only able to show the linear convergence of OMWU under the uniqueness assumption.
Lemma 72. Let A be a convex set and u
′ = argminu′∈A {⟨u
′
, g⟩ + Dψ(u
′
,u)}. Then for any u
∗ ∈ A,
⟨u
′ − u
∗
, g⟩ ≤ Dψ(u
∗
,u) − Dψ(u
∗
,u
′
) − Dψ(u
′
,u). (A.1)
Proof. Since Dψ(u
′
,u) = ψ(u
′
) − ψ(u) − ⟨∇ψ(u),u
′ − u⟩, by the first-order optimality condition of u
′
,
we have
g + ∇ψ(u
′
) − ∇ψ(u)
⊤
(u
∗ − u
′
) ≥ 0.
On the other hand, notice that the right-hand side of Equation (A.1) is
ψ(u
∗
) − ψ(u) − ⟨∇ψ(u),u
∗ − u⟩
− ψ(u
∗
) + ψ(u
′
) + ⟨∇ψ(u
′
),u
∗ − u
′
⟩
− ψ(u
′
) + ψ(u) + ⟨∇ψ(u),u
′ − u⟩
= ⟨∇ψ(u
′
) − ∇ψ(u),u
∗ − u
′
⟩.
Therefore, Equation (A.1) is equivalent to ⟨g + ∇ψ(u
′
) − ∇ψ(u),u
∗ − u
′
⟩ ≥ 0, which we have already
shown above.
13
Lemma 73. Suppose that ψ satisfies Dψ(x, x
′
) ≥
1
2
∥x − x
′∥
2
p
for some p ≥ 1, and let u,u1,u2 ∈ A (a
convex set) be related by the following:
u1 = argmin
u′∈A
⟨u
′
, g1⟩ + Dψ(u
′
,u)
,
u2 = argmin
u′∈A
⟨u
′
, g2⟩ + Dψ(u
′
,u)
.
Then we have
∥u1 − u2∥p ≤ ∥g1 − g2∥q,
where q ≥ 1 and 1
p +
1
q
= 1.
Proof. By the first-order optimality conditions of u1 and u2, we have
⟨∇ψ(u1) − ∇ψ(u) + g1,u2 − u1⟩ ≥ 0,
⟨∇ψ(u2) − ∇ψ(u) + g2,u1 − u2⟩ ≥ 0.
Summing them up and rearranging the terms, we get
⟨u2 − u1, g1 − g2⟩ ≥ ⟨∇ψ(u1) − ∇ψ(u2),u1 − u2⟩. (A.2)
By the condition on ψ, we have ⟨∇ψ(u1),u1 −u2⟩ ≥ ψ(u1)−ψ(u2) + 1
2
∥u1 −u2∥
2
p
and ⟨∇ψ(u2),u2 −
u1⟩ ≥ ψ(u2) − ψ(u1) + 1
2
∥u1 − u2∥
2
p
. Summing them up we get ⟨∇ψ(u1) − ∇ψ(u2),u1 − u2⟩ ≥
∥u1 − u2∥
2
p
. Combining this with Equation (A.2) we get
⟨u2 − u1, g1 − g2⟩ ≥ ∥u1 − u2∥
2
p
.
138
Since ⟨u2 − u1, g1 − g2⟩ ≤ ∥u1 − u2∥p∥g1 − g2∥q by Hölder’s inequality, we further get ∥u1 − u2∥p ≤
∥g1 − g2∥q.
Proof of Lemma 1. Considering Equation (2.2), and using Lemma 72 with u = zb
t
, u
′ = zb
t+1
, u
∗ = z, and
g = ηF(z
t
), we get
ηF(z
t
)
⊤(zb
t+1 − z) ≤ Dψ(z, zb
t
) − Dψ(z, zb
t+1) − Dψ(zb
t+1
, zb
t
).
Considering Equation (2.1), and using Lemma 72 with u = zb
t
, u
′ = z
t
, u
∗ = zb
t+1, and g = ηF(z
t−1
),
we get
ηF(z
t−1
)
⊤(z
t − zb
t+1) ≤ Dψ(zb
t+1
, zb
t
) − Dψ(zb
t+1
, z
t
) − Dψ(z
t
, zb
t
).
Summing up the two inequalities above, and adding η
F(z
t
) − F(z
t−1
)
⊤
(z
t − zb
t+1) to both sides, we
get
ηF(z
t
)
⊤(z
t − z)
≤ Dψ(z, zb
t
) − Dψ(z, zb
t+1) − Dψ(zb
t+1
, z
t
) − Dψ(z
t
, zb
t
) + η
F(z
t
) − F(z
t−1
)
⊤
(z
t − zb
t+1).
(A.3)
1
Using Lemma 73 with u = xb
t
, u1 = x
t
, u2 = xb
t+1
, g1 = η∇xf(z
t−1
) and g2 = η∇xf(z
t
), we get ∥x
t −
xb
t+1∥p ≤ η∥∇xf(z
t−1
) − ∇xf(z
t
)∥q. Similarly, we have ∥y
t − yb
t+1∥p ≤ η∥∇yf(z
t
) − ∇yf(z
t−1
)∥q.
Therefore, by Hölder’s inequality, we have
η
F(z
t
) − F(z
t−1
)
⊤
(z
t − zb
t+1)
≤ η∥x
t − xb
t+1∥p∥∇xf(z
t−1
) − ∇xf(z
t
)∥q + η∥y
t − yb
t+1∥p∥∇yf(z
t−1
) − ∇yf(z
t
)∥q
≤ η
2
∥∇xf(z
t−1
) − ∇xf(z
t
)∥
2
q + η
2
∥∇yf(z
t−1
) − ∇yf(z
t
)∥
2
q
= η
2dist2
q
(F(z
t
), F(z
t−1
))
≤ η
2L
2dist2
p
(z
t
, z
t−1
) (by assumption)
≤
1
64
dist2
p
(z
t
, z
t−1
). (by our choice of η)
Continuing from Equation (A.3), we then have
ηF(z
t
)
⊤(z
t − z)
≤ Dψ(z, zb
t
) − Dψ(z, zb
t+1) − Dψ(zb
t+1
, z
t
) − Dψ(z
t
, zb
t
) + 1
64
dist2
p
(z
t
, z
t−1
)
≤ Dψ(z, zb
t
) − Dψ(z, zb
t+1) − Dψ(zb
t+1
, z
t
) − Dψ(z
t
, zb
t
) + 1
32
dist2
p
(z
t
, zb
t
) + 1
32
dist2
p
(zb
t
, z
t−1
)
(∥u + v∥
2
p ≤ (∥u∥p + ∥v∥p)
2 ≤ 2∥u∥
2
p + 2∥v∥
2
p
)
≤ Dψ(z, zb
t
) − Dψ(z, zb
t+1) − Dψ(zb
t+1
, z
t
) − Dψ(z
t
, zb
t
) + 1
16
Dψ(z
t
, zb
t
) + 1
16
Dψ(zb
t
, z
t−1
)
(by the assumption on ψ)
= Dψ(z, zb
t
) − Dψ(z, zb
t+1) − Dψ(zb
t+1
, z
t
) −
15
16
Dψ(z
t
, zb
t
) + 1
16
Dψ(zb
t
, z
t−1
).
This concludes the proof.
14
A.3 An Auxiliary Lemma on Recursive Formulas
Here, we provide an auxiliary lemma that gives an explicit bound based on a particular recursive formula.
This will be useful later for deriving the convergence rate.
Lemma 74. Consider a non-negative sequence {Bt}t=1,2,··· that satisfies for some p > 0 and q > 0,
• Bt+1 ≤ Bt − qBp+1
t+1 , ∀t ≥ 1
• q(1 + p)B
p
1 ≤ 1.
Then Bt ≤ ct− 1
p , where c = max
B1,
2
qp 1
p
.
Proof. We first prove that Bt+1 ≤ Bt −
q
2B
p+1
t
. Notice that since Bt are all non-negative, by the first
condition, we have Bt+1 ≤ Bt ≤ · · · ≤ B1. Using the fundamental theorem of calculus, we have
B
p+1
t − B
p+1
t+1 =
Z Bt
Bt+1
d
dx
x
p+1
dx = (p + 1) Z Bt
Bt+1
x
pdx ≤ (p + 1)(Bt − Bt+1)B
p
t
and thus
Bt+1 ≤ Bt − qBp+1
t+1 ≤ Bt − qBp+1
t + q(p + 1) (Bt − Bt+1) B
p
t
.
By rearranging, we get
Bt+1 ≤
1 −
qBp
t
1 + q(1 + p)B
p
t
Bt ≤
1 −
qBp
t
2
Bt = Bt −
q
2
B
p+1
t
,
where the last inequality is because q(1 + p)B
p
t ≤ q(1 + p)B
p
1 ≤ 1.
141
Below we use induction to prove Bt ≤ ct− 1
p , where c = max
B1,
2
qp 1
p
. This clearly holds for
t = 1. Suppose that it holds for 1, . . . , t. Note that the function f(Bt) =
1 −
q
2B
p
t
Bt
is increasing in Bt
as f
′
(Bt) = 1 −
q(p+1)
2 B
p
t ≥ 1 −
q(p+1)
2 B
p
1 ≥ 0. Therefore, we apply the induction hypothesis and get
Bt+1 ≤
1 −
q
2
B
p
t
Bt ≤
1 −
q
2
c
p
t
−1
ct− 1
p
= ct− 1
p −
q
2
c
p+1t
−1− 1
p ≤ ct− 1
p −
c
p
t
−1− 1
p (
c
p ≤
q
2
c
p+1 by the definition of c)
≤ c(t + 1)− 1
p ,
where the last inequality is by the fundamental theorem of calculus:
t
− 1
p − (1 + t)
− 1
p =
Z t
1+t
d
dx
x
− 1
p
dx =
Z t
1+t
−
1
p
x
−1− 1
p dx
=
Z t+1
t
1
p
x
−1− 1
p dx ≤
1
p
t
−1− 1
p .
This completes the induction.
A.4 Proofs of Lemma 2 and Theorem 3
In this section, we consider f(x, y) = x
⊤Gy with X = ∆M and Y = ∆N being simplex and G ∈
[−1, 1]M×N . We assume that G has a unique Nash equilibrium z
∗ = (x
∗
, y
∗
). The value of the game is
denoted as ρ = minx∈X maxy∈Y x
⊤Gy = maxy∈Y minx∈X x
⊤Gy = x
∗⊤Gy∗
.
Before proving Lemma 2 and Theorem 3, in Appendix A.4.1, we define some constants for later analysis; in Appendix A.4.2, we state more auxiliary lemmas, which are useful when proving Lemma 2 and
Theorem 3 in Appendix A.4.3.
14
A.4.1 Some Problem-Dependent Constants
First, we define a constant ξ that is determined by G.
Definition 3.
ξ ≜ min
min
i /∈supp(x∗)
(Gy∗
)i − ρ, ρ − max
i /∈supp(y∗)
(G⊤x
∗
)i
∈ (0, 1].
The fact ξ ≤ 1 can be shown by:
ξ ≤
mini /∈supp(x∗)
(Gy∗
)i − ρ + ρ − maxi /∈supp(y∗)
(G⊤x
∗
)i
2
≤
∥Gy∗∥∞ + ∥G⊤x
∗∥∞
2
≤ 1,
while the fact ξ > 0 is a direct consequence of Lemma C.3 of Mertikopoulos, Papadimitriou, and Piliouras [119], stated below.
Lemma 75 (Lemma C.3 of Mertikopoulos, Papadimitriou, and Piliouras [119]). Let G ∈ RM×N be a game
matrix for a two-player zero-sum game with value ρ. Then there exists a Nash equilibrium (x
∗
, y
∗
) such that
(Gy∗
)i = ρ ∀i ∈ supp(x
∗
),
(Gy∗
)i > ρ ∀i /∈ supp(x
∗
),
(G⊤x
∗
)i = ρ ∀i ∈ supp(y
∗
),
(G⊤x
∗
)i < ρ ∀i /∈ supp(y
∗
).
Below, we define V
∗
(Z) = V
∗
(X ) × V∗
(Y), where
V
∗
(X ) ≜ {x : x ∈ ∆M, supp(x) ⊆ supp(x
∗
)}
143
and
V
∗
(Y) ≜ {y : y ∈ ∆N , supp(y) ⊆ supp(y
∗
)}.
Definition 4.
cx ≜ min
x∈∆M\{x∗}
max
y∈V∗(Y)
(x − x
∗
)
⊤Gy
∥x − x∗∥1
, cy ≜ min
y∈∆N \{y∗}
max
x∈V∗(X)
x
⊤G(y
∗ − y)
∥y∗ − y∥1
.
Note that in the definition of cx and cy, the outer minimization is over an open set, which may make
the definition problematic as the optimal value may not be attained. However, the following lemma shows
that cx and cy are well-defined.
Lemma 76. cx and cy are well-defined, and 0 < cx, cy ≤ 1.
Proof. We first show cx and cy are well-defined. To simplify the notations, we define x
∗
min ≜ mini∈supp(x∗) x
∗
i
and X
′ ≜ {x : x ∈ ∆M, ∥x − x
∗∥1 ≥ x
∗
min}, and define y
∗
min and Y
′
similarly. We will show that
cx = min
x∈X′
max
y∈V∗(Y)
(x − x
∗
)
⊤Gy
∥x − x∗∥1
, cy = min
y∈Y′
max
x∈V∗(X)
x
⊤G(y
∗ − y)
∥y∗ − y∥1
,
which are well-defined as the outer minimization is now over a closed set. Consider cx, it suffices to
show that for any x ∈ ∆M such that x ̸= x
∗
and ∥x − x
∗∥1 < x∗
min, there exists x
′ ∈ ∆M such that
∥x
′ − x
∗∥1 = x
∗
min and
(x − x
∗
)
⊤Gy
∥x − x∗∥1
=
(x
′ − x
∗
)
⊤Gy
∥x′ − x∗∥1
, ∀y. (A.4)
144
In fact, we can simply choose x
′ = x
∗ + (x − x
∗
) ·
x
∗
min
∥x−x∗∥1
. We first argue that x
′
is still in ∆M.
For each j ∈ [[K]], if xj − x
∗
j ≥ 0, we surely have x
′
j ≥ x
∗
j + 0 ≥ 0; otherwise, x
∗
j > xj ≥ 0 and thus
j ∈ supp(x
∗
) and x
∗
j ≥ x
∗
min, which implies x
′
j ≥ x
∗
j − |xj − x
∗
j
| · x
∗
min
∥x−x∗∥1
≥ x
∗
j − x
∗
min ≥ 0. In addition,
P
j
x
′
j =
P
j
x
∗
j = 1. Combining these facts, we have x
′ ∈ ∆M.
Moreover, according to the definition of x
′
, ∥x
′ − x
∗∥1 = x
∗
min holds. Also, since x
∗ − x and x
∗ − x
′
are parallel vectors, Equation (A.4) is satisfied. The arguments above show that the cx in Definition 4 is a
well-defined real number. The case of cy is similar.
Now we show 0 < cx, cy ≤ 1. The fact that cx, cy ≤ 1 is a direct consequence of G being in
[−1, 1]M×N . Below, we use contradiction to prove that cy > 0. First, if cy < 0, then there exists y ̸= y
∗
such that x
∗⊤Gy∗ < x
∗⊤Gy. This contradicts with the fact that (x
∗
, y
∗
) is the equilibrium.
On the other hand, if cy = 0, then there is some y ̸= y
∗
such that
max
x∈V∗(X)
x
⊤G(y
∗ − y) = 0. (A.5)
145
Consider the point y
′ = y
∗ +
ξ
2
(y − y
∗
) (recall the definition of ξ in Definition 3 and that 0 < ξ ≤ 1),
which lies on the line segment between y
∗
and y. Then, for any x ∈ X ,
x
⊤Gy′ =
X
i /∈supp(x∗)
xi(Gy′
)i +
X
i∈supp(x∗)
xi(Gy′
)i
≥
X
i /∈supp(x∗)
xi(Gy∗
)i − xi∥y
′ − y
∗
∥1
+
X
i∈supp(x∗)
ξ
2
· xi(G(y − y
∗
))i + xi(Gy∗
)i
(using Gij ∈ [−1, −1] for the first part and y
′ = y
∗ +
ξ
2
(y − y
∗
) for the second)
≥
X
i /∈supp(x∗)
xi(Gy∗
)i − xi∥y
′ − y
∗
∥1
+
X
i∈supp(x∗)
xiρ
(using Equation (A.5) and (Gy∗
)i = ρ for all i ∈ supp(x
∗
))
≥
X
i /∈supp(x∗)
xi ((Gy∗
)i − ξ)
+
X
i∈supp(x∗)
xiρ
(using y
′ − y
∗ =
ξ
2
(y − y
∗
) and ∥y − y
∗∥1 ≤ 2)
≥
X
i /∈supp(x∗)
xiρ +
X
i∈supp(x∗)
xiρ (by the definition of ξ)
= ρ.
This shows that minx∈X x
⊤Gy′ ≥ ρ, that is, y
′ ̸= y
∗
is also a maximin point, contradicting that z
∗
is
unique. Therefore, cy > 0 has to hold, and so does cx > 0 by the same argument.
Finally, we define the following constant that depends on G:
Definition 5.
ϵ ≜ min
j∈supp(z∗)
exp
−
ln(MN)
z
∗
j
!
.
A.4.2 Auxiliary Lemmas
All lemmas stated in this section is for the case f(x, y) = x
⊤Gy with Z = ∆M × ∆N and a unique Nash
equilibrium z
∗ = (x
∗
, y
∗
).
Lemma 77. For any z ∈ Z, we have
max
z′∈V∗(Z)
F(z)
⊤(z − z
′
) ≥ C∥z
∗ − z∥1
for C = min{cx, cy} ∈ (0, 1].
Proof. Recall that ρ = x
∗⊤Gy∗
is the game value and note that
max
z′∈V∗(Z)
F(z)
⊤(z − z
′
) = max
z′∈V∗(Z)
(x − x
′
)
⊤Gy + x
⊤G(y
′ − y) = max
z′∈V∗(Z)
−x
′⊤Gy + x
⊤Gy′
= max
x′∈V∗(X)
ρ − x
′⊤Gy
+ max
y′∈V∗(Y)
x
⊤Gy′ − ρ
= max
x′∈V∗(X)
x
′⊤G(y
∗ − y) + max
y′∈V∗(Y)
(x − x
∗
)
⊤Gy′
(Lemma 75)
≥ cy∥y
∗ − y∥1 + cx∥x
∗ − x∥1 (by Definition 4)
≥ min{cx, cy}∥z
∗ − z∥1,
which completes the proof.
Lemma 78. For any z ∈ Z, we have
KL(z
∗
, z) ≤
X
i∈supp(z∗)
(z
∗
i − zi)
2
zi
+
X
i /∈supp(z∗)
zi ≤
1
mini∈supp(z∗) zi
∥z
∗ − z∥1.
147
Proof. Using the definition of the Kullback-Leibler divergence, we have
KL(x
∗
, x) = X
i
x
∗
i
ln
x
∗
i
xi
≤ ln X
i
x
∗
i
2
xi
!
= ln
1 +X
i
(x
∗
i − xi)
2
xi
!
≤
X
i
(x
∗
i − xi)
2
xi
,
where the first inequality is by the concavity of the ln(·) function, and the second inequality is because
ln(1 + u) ≤ u. Considering i ∈ supp(x
∗
) and i /∈ supp(x
∗
) separately in the last summation, we have
X
i
(x
∗
i − xi)
2
xi
=
X
i∈supp(x∗)
(x
∗
i − xi)
2
xi
+
X
i /∈supp(x∗)
(xi)
2
xi
=
X
i∈supp(x∗)
(x
∗
i − xi)
2
xi
+
X
i /∈supp(x∗)
xi
.
The case for KL(y
∗
, y) is similar. Combining both cases finishes the proof of the first inequality (recall
that KL(z
∗
, z) is defined as KL(x
∗
, x) + KL(y
∗
, y)). The second inequality is straightforward:
X
i∈supp(z∗)
(z
∗
i − zi)
2
zi
+
X
i /∈supp(z∗)
zi ≤
1
mini∈supp(z∗) zi
X
i∈supp(z∗)
|z
∗
i − zi
| +
X
i /∈supp(z∗)
|zi
|
=
1
mini∈supp(z∗) zi
∥z
∗ − z∥1.
Lemma 79. For η ≤
1
8
, OMWU guarantees 3
4
zb
t
i ≤ z
t
i ≤
4
3
zb
t
i
and 3
4
zb
t
i ≤ zb
t+1
i ≤
4
3
zb
t
i
.
Proof. This is shown directly by the update of xb
t
:
xb
t
i
exp (−η)
exp (η)
≤ xb
t+1
i =
xb
t
i
exp (−η · (Gyt
)i)
P
j
xb
t
j
exp (−η · (Gyt)j )
≤
xb
t
i
exp (η)
exp(−η)
.
So by the condition on η, we have 3
4
xb
t
i ≤ exp(−2η) · xb
t
i ≤ xb
t+1
i ≤ exp(2η) · xb
t
i ≤
4
3
xb
t
i
. The cases for x
t
,
yb
t
and y
t
are similar.
148
Lemma 80. For any two probability vectors u, v, if for every entry i,
1
2
ui ≤ vi ≤
3
2
ui
, then 1
3
P
i
(vi−ui)
2
ui
≤
KL(u, v) ≤
P
i
(vi−ui)
2
ui
≤
1
4
.
Proof. Using the definition of the Kullback-Leibler divergence, we have
KL(u, v) = −
X
i
ui
ln vi
ui
≥ −X
i
ui
vi − ui
ui
−
1
3
(vi − ui)
2
u
2
i
=
1
3
X
i
(vi − ui)
2
ui
,
KL(u, v) = −
X
i
ui
ln vi
ui
≤ −X
i
ui
vi − ui
ui
−
(vi − ui)
2
u
2
i
=
X
i
(vi − ui)
2
ui
≤
1
4
,
where the first inequality is because ln(1 + a) ≤ a −
1
3
a
2
for −
1
2 ≤ a ≤
1
2
, and the second inequality is
because ln(1 + a) ≥ a − a
2
for −
1
2 ≤ a ≤
1
2
. The third inequality is by using the condition |ui − vi
| ≤
1
2
ui
.
Lemma 81. For all i ∈ supp(z
∗
) and t, OMWU guarantees zb
t
i ≥ ϵ (ϵ is defined in Definition 5).
Proof. Using Equation (2.3), we have
KL(z
∗
, zb
t
) ≤ Θ
t ≤ · · · ≤ Θ
1 =
1
16KL(zb
1
, z
0
) + KL(z
∗
, zb
1
) = KL(z
∗
, zb
1
), (A.6)
where the last equality is because zb
1 = z
0 = (1M
M ,
1N
N
).
Then, for any i ∈ supp(z
∗
), we have
z
∗
i
ln 1
zb
t
i
≤
X
j
z
∗
j
ln 1
zb
t
j
= KL(z
∗
, zb
t
) −
X
j
z
∗
j
ln z
∗
j ≤ KL(z
∗
, zb
1
) −
X
j
z
∗
j
ln z
∗
j
=
X
j
z
∗
j
ln 1
zb1,j
= ln(MN).
149
Therefore, we conclude for all t and i ∈ supp(z
∗
), zb
t
i
satisfies
zb
t
i ≥ exp
−
ln(MN)
z
∗
i
≥ min
j∈supp(z∗)
exp
−
ln(MN)
z
∗
j
!
= ϵ.
A.4.3 Proofs of Lemma 2 and Theorem 3
Proof of Lemma 2. Below we consider any z
′ ∈ Z such that supp(z
′
) ⊆ supp(z
∗
), that is, z
′ ∈ V∗
(Z).
Considering Equation (2.1), and using the first-order optimality condition of zb
t+1, we have
(∇ψ(zb
t+1) − ∇ψ(zb
t
) + ηF(z
t
))⊤(z
′ − zb
t+1) ≥ 0,
where ψ(z) = P
i
zi
ln zi
. Rearranging the terms and we get
ηF
z
t
⊤
zb
t+1 − z
′
≤
∇ψ(zb
t+1) − ∇ψ(zb
t
)
⊤
z
′ − zb
t+1
=
X
i
z
′
i − zb
t+1
i
ln zb
t+1
i
zb
t
i
. (A.7)
The left hand side of Equation (A.7) is lower bounded as
ηF
z
t
⊤
zb
t+1 − z
′
= ηF
zb
t+1⊤
zb
t+1 − z
′
+ η
F
z
t
− F
zb
t+1⊤
zb
t+1 − z
′
≥ ηF
zb
t+1⊤
zb
t+1 − z
′
− η∥F
z
t
− F
zb
t+1
∥∞∥zb
t+1 − z
′
∥1
≥ ηF
zb
t+1⊤
zb
t+1 − z
′
− 4η
z
t − zb
t+1
1
(∥F
z
t
− F
zb
t+1
∥∞ ≤
z
t − zb
t+1
1
≤ 4)
≥ ηF
zb
t+1⊤
zb
t+1 − z
′
−
1
2
z
t − zb
t+1
on the other hand, the right hand side of Equation (A.7) is upper bounded by
X
i
z
′
i − zb
t+1
i
ln zb
t+1
i
zb
t
i
=
X
i∈supp(z∗)
z
′
i
ln zb
t+1
i
zb
t
i
− KL(zb
t+1
, zb
t
) (supp(z
′
) ⊆ supp(z
∗
))
≤
X
i∈supp(z∗)
ln zb
t+1
i
zb
t
i
=
X
i∈supp(z∗)
max
ln
1 +
zb
t+1
i − zb
t
i
zb
t
i
, ln
1 +
zb
t
i − zb
t+1
i
zb
t+1
i
≤
X
i∈supp(z∗)
ln
1 +
zb
t+1
i − zb
t
i
min{zb
t+1
i
, zb
t
i
}
!
≤
4
3
X
i∈supp(z∗)
zb
t+1
i − zb
t
i
zb
t
i
. (ln(1 + a) ≤ a and Lemma 79)
Combining the bounds on the two sides of Equation (A.7), we get
ηF
zb
t+1⊤
zb
t+1 − z
′
≤
4
3
X
i∈supp(z∗)
zb
t+1
i − zb
t
i
zb
t
i
+
1
2
∥z
t − zb
t+1∥1.
Since z
′
can be chosen as any point in V
∗
(Z), we further lower bound the left-hand side above using
Lemma 77 and get
ηC∥z
∗ − zb
t+1∥1 ≤
4
3
X
i∈supp(z∗)
|zb
t+1
i − zb
t
i
|
zb
t
i
+
1
2
∥z
t − zb
t+1∥1
≤
4
3ϵ
∥zb
t+1 − zb
t
∥1 +
1
2
∥z
t − zb
t+1∥1, (Lemma 81)
≤
4
3ϵ
∥zb
t+1 − zb
t
∥1 + ∥z
t − zb
t+1∥1
(A.8)
where the last inequality uses ϵ ≤ 1. With the help of Equation (A.8), below we prove the desired inequalities.
Case 1. General case.
KL(zb
t+1
, z
t
) + KL(z
t
, zb
t
)
≥
1
2
∥xb
t+1 − x
t
∥
2
1 +
1
2
∥yb
t+1 − y
t
∥
2
1 +
1
2
∥x
t − xb
t
∥
2
1 +
1
2
∥y
t − yb
t
∥
2
1
(Pinsker’s inequality)
≥
1
4
∥zb
t+1 − z
t
∥
2
1 +
1
4
∥z
t − zb
t
∥
2
1
(a
2 + b
2 ≥
1
2
(a + b)
2
)
≥
1
16
∥zb
t+1 − z
t
∥
2
1 +
1
8
∥zb
t+1 − z
t
∥
2
1 + ∥z
t − zb
t
∥
2
1
≥
1
16
∥zb
t+1 − z
t
∥
2
1 +
1
16
∥zb
t+1 − zb
t
∥
2
1
(a
2 + b
2 ≥
1
2
(a + b)
2
and triangle inequality)
≥
1
32
∥zb
t+1 − z
t
∥1 + ∥zb
t+1 − zb
t
∥1
2
(a
2 + b
2 ≥
1
2
(a + b)
2
)
≥
1
32
3ϵηC
4
2
∥z
∗ − zb
t+1∥
2
1
(Equation (A.8))
≥
ϵ
2η
2C
2
64
× ϵ
2KL(z
∗
, zb
t+1)
2 =
ϵ
4η
2C
2
64
KL(z
∗
, zb
t+1)
2
. (Lemma 78 and Lemma 81)
This proves the first part of the lemma with C1 = ϵ
4C
2/64.
Case 2. The case when max{∥z
∗ − zb
t∥1, ∥z
∗ − z
t∥1} ≤ ηξ
10 .
KL(zb
t+1
, z
t
) + KL(z
t
, zb
t
)
≥
1
3
X
i
(zb
t+1
i − z
t
i
)
2
zb
t+1
i
+
(z
t
i − zb
t
i
)
2
z
t
i
(Lemma 79 and Lemma 80)
≥
1
4
X
i /∈supp(z∗)
(zb
t+1
i − z
t
i
)
2
zb
t
i
+
(z
t
i − zb
t
i
)
2
zb
t
i
(Lemma 79)
≥
1
8
X
i /∈supp(z∗)
(zb
t+1
i − zb
t
i
)
2
zb
t
i
. (A.9)
Below we continue to bound P
i /∈supp(z∗)
(zb
t+1
i −zb
t
i
)
2
zb
t
i
.
1
By the assumption, we have ∥y
t − y
∗∥1 ≤
ηξ
10 , which by Lemma 75 and Definition 3 implies
∀i ∈ supp(x
∗
), (Gyt
)i ≤ (Gy∗
)i +
ηξ
10
= ρ +
ηξ
10
≤ ρ +
ξ
10
,
∀i /∈ supp(x
∗
), (Gyt
)i ≥ (Gy∗
)i −
ηξ
10
≥ ρ + ξ −
ηξ
10
≥ ρ +
9ξ
10
.
We also have ∥xb
t − x
∗∥1 ≤
ηξ
10 , so P
j /∈supp(x∗)
xb
t
j ≤
ηξ
10 . Then, for i /∈ supp(x
∗
), we have
xb
t+1
i =
xb
t
i
exp(−η(Gyt
)i)
P
j
xb
t
j
exp(−η(Gyt)j )
≤
xb
t
i
exp(−η(Gyt
)i)
P
j∈supp(x∗)
xb
t
j
exp(−η(Gyt)j )
≤
xb
t
i
exp(−η(ρ +
9ξ
10 ))
P
j∈supp(x∗)
xb
t
j
exp(−η(ρ +
ξ
10 ))
=
xb
t
i
exp
−
8
10 ηξ
1 −
P
j /∈supp(x∗)
xb
t
j
≤
xb
t
i
exp
−
8
10 ηξ
1 −
ηξ
10 ≤ xb
t
i
1 −
1
2
ηξ
,
where the last inequality is because exp(−0.8u)
1−0.1u ≤ 1 − 0.5u for u ∈ [0, 1]. Rearranging gives
|xb
t+1
i − xb
t
i
|
2
xb
t
i
≥
η
2
ξ
2
4
xb
t
i ≥
η
2
ξ
2
8
xb
t+1
i
,
where the last step uses Lemma 79. The case for yb
t
is similar, so we have
|zb
t+1
i − zb
t
i
|
2
zb
t
i
≥
η
2
ξ
2
8
zb
t+1
i
.
Combining this with Equation (A.9), we get
KL(zb
t+1
, z
t
) + KL(z
t
, zb
t
) ≥
η
2
ξ
2
64
X
i /∈supp(z∗)
zb
t+1
i
. (A.10)
1
Now we combine two lower bounds of KL(zb
t+1
, z
t
) + KL(z
t
, zb
t
). Using an intermediate step in Case 1,
and Equation (A.10), we get
KL(zb
t+1
, z
t
) + KL(z
t
, zb
t
) = 1
2
KL(zb
t+1
, z
t
) + KL(z
t
, zb
t
)
+
1
2
KL(zb
t+1
, z
t
) + KL(z
t
, zb
t
)
≥
ϵ
2η
2C
2
128
∥z
∗ − zb
t+1∥
2
1 +
η
2
ξ
2
128
X
i /∈supp(z∗)
zb
t+1
i
=
ϵ
3η
2C
2
ξ
2
128
1
ξ
2ϵ
∥zb
t+1 − z
∗
∥
2
1 +
1
ϵ
3C2
X
i /∈supp(z∗)
zb
t+1
i
≥
ϵ
3η
2C
2
ξ
2
128
1
ϵ
∥zb
t+1 − z
∗
∥
2
1 +
X
i /∈supp(z∗)
zb
t+1
i
(ξ ≤ 1, C ≤ 1, and ϵ ≤ 1)
≥
ϵ
3η
2C
2
ξ
2
128
KL(z
∗
, zb
t+1). (Lemma 78 and Lemma 81)
This proves the second part of the lemma with C2 = ϵ
3C
2
ξ
2/128.
Now we are ready to prove Theorem 3.
Proof of Theorem 3. As argued in Section 2.3, with Θt = KL(z
∗
, zb
t
)+ 1
16KL(zb
t
, z
t−1
) and ζ
t = KL(zb
t+1
, z
t
)+
KL(z
t
, zb
t
), we have (see Equation (2.3))
Θ
t+1 ≤ Θ
t −
15
16 ζ
t
.
1
We the proceed as,
ζ
t ≥
1
2
KL(zb
t+1
, z
t
) + 1
2
ζ
t
≥
1
2
KL(zb
t+1
, z
t
) + η
2C1
2
KL(z
∗
, zb
t+1)
2
(Lemma 2)
≥ 2KL(zb
t+1
, z
t
)
2 +
η
2C1
2
KL(z
∗
, zb
t+1)
2
(by Lemma 79 and Lemma 80)
≥
η
2C1
2
KL(zb
t+1
, z
t
)
2 + KL(z
∗
, zb
t+1)
2
(C1 = ϵ
4C
2/64 ≤ 1/64 as shown in the proof of Lemma 2)
≥
η
2C1
4
KL(zb
t+1
, z
t
) + KL(z
∗
, zb
t+1)
2
≥
η
2C1
4
Θ
t+12
.
Therefore, Θt+1 ≤ Θt −
15η
2C1
64
Θt+12
≤ Θt −
15η
2C1
64+ln MN
Θt+12
. Also, recall zb
1 = z
0 = (1M
M ,
1N
N
)
and thus Θ1 = KL(z
∗
, zb
1
) ≤ ln(MN). Therefore, the conditions of Lemma 74 are satisfied with p = 1
and q =
15η
2C1
64+ln(MN)
, and we conclude that
Θ
t ≤
C
′
t
,
where C
′ = max n
ln(MN),
128+2 ln(MN)
15η
2C1
o
=
128+2 ln(MN)
15η
2C1
.
Next we prove the main result. Set T0 =
12800C′
η
2ξ
2 . For t ≥ T0, we have using Pinsker’s inequality,
∥z
∗ − zb
t
∥
2
1 ≤ 2∥x
∗ − xb
t
∥
2
1 + 2∥y
∗ − yb
t
∥
2
1 ≤ 4KL(z
∗
, zb
t
) ≤
4C
′
T0
≤
η
2
ξ
2
100
,
∥z
∗ − z
t
∥
2
1 ≤ 2∥z
∗ − zb
t+1∥
2
1 + 2∥zb
t+1 − z
t
∥
2
1
≤ 4∥x
∗ − xb
t+1∥
2
1 + 4∥xb
t+1 − x
t
∥
2
1 + 4∥y
∗ − yb
t+1∥
2
1 + 4∥yb
t+1 − y
t
∥
2
1
≤ 8KL(z
∗
, zb
t+1) + 8KL(zb
t+1
, z
t
)
≤ 128Θt+1 ≤
128C
′
T0
≤
η
2
ξ
2
100
.
1
Therefore, when t ≥ T0, the condition of the second part of Lemma 2 is satisfied, and we have
ζ
t ≥
1
2
KL(zb
t+1
, z
t
) + 1
2
ζ
t
≥
1
2
KL(zb
t+1
, z
t
) + η
2C2
2
KL(z
∗
, zb
t+1) (by Lemma 2)
≥
η
2C2
2
Θ
t+1
. (C2 = ϵ
3C
2
ξ
2/128 ≤ 1/128 as shown in the proof of Lemma 2)
Therefore, when t ≥ T0, Θt+1 ≤ Θt −
15η
2C2
32 Θt+1, which further leads to
Θ
t ≤ Θ
T0
·
1 +
15η
2C2
32 T0−t
≤ Θ
1
·
1 +
15η
2C2
32 T0−t
≤ ln(MN)
1 +
15η
2C2
32 T0−t
.
where the second inequality uses Equation (A.6). The inequality trivially holds for t < T0 as well, so it
holds for all t.
We finish the proof by relating KL(z
∗
, z
t
) and Θt+1. Note that by Lemma 78, Lemma 79, and Lemma 81,
we have
KL(z
∗
, z
t
)
2 ≤
∥z
∗ − z
t∥
2
mini∈supp(z
∗)
(z
t
i
)
2
≤
16∥z
∗ − z
t∥
2
9ϵ
2
≤ 4
∥z
∗ − zb
t+1∥
2 + ∥zb
t+1 − z
t∥
2
ϵ
2
.
We continue to bound the last term as
4
∥z
∗ − zb
t+1∥
2 + ∥zb
t+1 − z
t∥
2
ϵ
2
= 4
∥x
∗ − xb
t+1∥
2 + ∥y
∗ − yb
t+1∥
2 + ∥xb
t+1 − x
t∥
2 + ∥yb
t+1 − y
t∥
2
ϵ
2
= 4
∥x
∗ − xb
t+1∥
2
1 + ∥y
∗ − yb
t+1∥
2
1 + ∥xb
t+1 − x
t∥
2
1 + ∥yb
t+1 − y
t∥
2
1
ϵ
2
(∥x∥2 ≤ ∥x∥1)
≤
128
ϵ
2
KL(z
∗
, zb
t+1)
16
+
KL(zb
t+1
, z
t
)
16
(Pinsker’s inequality)
≤
128
ϵ
2
Θ
t+1
.
156
Combining everything, we get
KL(z
∗
, z
t
) ≤
√
128
ϵ
√
Θt+1 ≤
p
128 ln(MN)
ϵ
1 +
15η
2C2
32 T0−t−1
2
,
which completes the proof.
A.5 Proofs of Lemma 4 and the Sum-of-Duality-Gap Bound
Proof of Lemma 4. Below we consider any z
′ ̸= zb
t+1 ∈ Z. Considering Equation (2.1) with Dψ(u, v) =
1
2
∥u − v∥
2
, and using the first-order optimality condition of zb
t+1, we have
(zb
t+1 − zb
t + ηF(z
t
))⊤(z
′ − zb
t+1) ≥ 0,
(z
t+1 − zb
t+1 + ηF(z
t
))⊤(z
′ − z
t+1) ≥ 0.
Rearranging the terms and we get
(zb
t+1 − zb
t
)
⊤(z
′ − zb
t+1) ≥ ηF(z
t
)
⊤(zb
t+1 − z
′
)
= ηF(zb
t+1)
⊤(zb
t+1 − z
′
) + η
F(z
t
) − F(zb
t+1)
⊤
(zb
t+1 − z
′
)
≥ ηF(zb
t+1)
⊤(zb
t+1 − z
′
) − ηL∥z
t − zb
t+1∥∥zb
t+1 − z
′
∥
≥ ηF(zb
t+1)
⊤(zb
t+1 − z
′
) −
1
8
∥z
t − zb
t+1∥∥zb
t+1 − z
′
∥,
15
and
(z
t+1 − zb
t+1)
⊤(z
′ − z
t+1) ≥ ηF(z
t
)
⊤(z
t+1 − z
′
)
= ηF(z
t+1)
⊤(z
t+1 − z
′
) + η
F(z
t
) − F(z
t+1)
⊤
(z
t+1 − z
′
)
≥ ηF(z
t+1)
⊤(z
t+1 − z
′
) − ηL∥z
t − z
t+1∥∥z
t+1 − z
′
∥
≥ ηF(z
t+1)
⊤(z
t+1 − z
′
) −
1
8
∥z
t − z
t+1∥∥z
t+1 − z
′
∥.
Here, for both block, the third step uses Hölder’s inequality and the smoothness condition Assumption 1,
and the last step uses the condition η ≤ 1/(8L). Upper bounding the left-hand side of the two inequalities
by ∥zb
t+1 − zb
t∥∥zb
t+1 − z
′∥ and ∥z
t+1 − zb
t+1∥∥z
t+1 − z
′∥ respectively and then rearranging, we get
∥zb
t+1 − z
′
∥
∥zb
t+1 − zb
t
∥ +
1
8
∥z
t − zb
t+1∥
≥ ηF(zb
t+1)
⊤(zb
t+1 − z
′
),
∥z
t+1 − z
′
∥
∥z
t+1 − zb
t+1∥ +
1
8
∥z
t − z
t+1∥
≥ ηF(z
t+1)
⊤(z
t+1 − z
′
).
Therefore, we have
∥zb
t+1 − zb
t
∥ +
1
8
∥z
t − zb
t+1∥
2
≥
η
2
[F(zb
t+1)
⊤(zb
t+1 − z
′
)]2
+
∥zbt+1 − z
′∥
2
,
∥z
t+1 − zb
t+1∥ +
1
8
∥z
t − z
t+1∥
2
≥
η
2
[F(z
t+1)
⊤(z
t+1 − z
′
)]2
+
∥z
t+1 − z
′∥
2
.
15
Finally, by the triangle inequality and the fact (a + b)
2 ≤ 2a
2 + 2b
2
, we have
∥zb
t+1 − zb
t
∥ +
1
8
∥z
t − zb
t+1∥
2
≤
∥z
t − zb
t
∥ +
9
8
∥z
t − zb
t+1∥
2
≤
9
8
∥z
t − zb
t
∥ +
9
8
∥z
t − zb
t+1∥
2
≤
81
32
∥z
t − zb
t
∥
2 + ∥z
t − zb
t+1∥
2
,
∥zb
t+1 − z
t+1∥ +
1
8
∥z
t − z
t+1∥
2
≤
9
8
∥z
t+1 − zb
t+1∥ + ∥z
t − zb
t+1∥
2
≤
9
8
∥z
t+1 − zb
t+1∥ +
9
8
∥z
t − zb
t+1∥
2
≤
81
32
∥z
t+1 − zb
t+1∥
2 + ∥z
t − zb
t+1∥
2
,
which finishes the proof.
Next, we use Equation (2.4) and Equation (2.6) to derive a result on the convergence of “average duality
gap” across time. First, we use the following lemma to relate the right-hand side of Equation (2.6) to the
duality gap of z
t
.
Lemma 82. Let Z be closed and bounded. Then for any z ∈ Z, we have αf (z) ≤ maxz′∈Z F(z)
⊤(z −z
′
).
Proof. This is a direct consequence of the convexity of f(·, y) and the concavity of f(x, ·):
αf (z) = max
(x′
,y′)∈X ×Y
f(x, y
′
) − f(x, y) + f(x, y) − f(x
′
, y)
≤ max
(x′
,y′)∈X ×Y
∇yf(x, y)
⊤(y
′ − y) + ∇xf(x, y)
⊤(x − x
′
)
= max
z′∈Z
F(z)
⊤(z − z
′
).
With Lemma 82, the following theorem can be proven straightforwardly.
Theorem 83. Let Z be closed and bounded. Then OGDA with η ≤
1
8L
ensures 1
T
PT
t=1 αf (z
t
) = O
D
η
√
T
for any T, where D ≜ supz,z′∈Z ∥z − z
′∥.
Proof. We first bound the sum of squared duality gap as (recall ζ
t = ∥zb
t+1 − z
t∥
2 + ∥z
t − zb
t∥
2
):
X
T
t=1
αf (z
t
)
2 ≤
X
T
t=1
max
z′∈Z
F(z
t
)
⊤(z
t − z
′
)
2
(Lemma 82)
≤
81
32η
2
X
T
t=1
(ζ
t−1 + ζ
t
)∥z
t − z
′
∥
2
(Lemma 4)
≤ O
D2
η
2
X
T
t=2
(Θt−1 − Θ
t + Θt − Θ
t+1)
!
(Equation (2.4))
= O
D2
η
2
. (telescoping)
Finally, by Cauchy-Schwarz inequality, we get 1
T
PT
t=1 αf (z
t
) ≤
1
T
q
T
PT
t=1 αf (z
t)
2 = O
D
η
√
T
.
This theorem indicates that αf (z
t
) is converging to zero. A rate of αf (z
t
) = O(
D
η
√
t
) would be compatible with the theorem, but is not directly implied by it. In a recent work, Golowich et al. [68] consider
the unconstrained setting and show that the extra-gradient algorithm obtains the rate αf (z
t
) = O(
D
η
√
t
),
under an extra assumption that the Hessian of f is also Lipschitz (since Golowich et al. [68] study the unconstrained setting, their duality gap αf is defined only with respect to the best responses that lie within
a ball of radius D centered around the equilibrium). Note that the extra-gradient algorithm requires more
cooperation between the two players compared to OGDA and is less suitable for a repeated game setting.
A.6 The Equivalence Between SP-MS and Metric Subregularity
In this section, we formally that show our SP-MS condition with β = 0 is equivalent to metric subregularity. Before introducing the main theorem, we introduce several definitions. We let Z
∗ ⊆ Z ⊆ R
K (Z
∗
and
Z follow the same definitions as in our main text). First, we define the element-to-set distance function d:
Definition 6. The element-to-set distance function d: R
K × 2
RK → R is defined as d(z, S) = infz′∈S ∥z −
z
′∥.
160
The definition of metric subregularity involves a set-valued operator T : Z → 2
RK
, which maps an
element of Z to a set in R
K.
Definition 7. A set-valued operator T is called metric subregular at (z¯, v) for v ∈ T (z¯) if there exists
κ > 0 and a neighborhood Ω of z¯ such that
d(v, T (z)) ≥ κd(z, T
−1
(v))
for all z ∈ Ω, where x ∈ T −1
(v) ⇔ v ∈ T (x). If Ω = Z, we call T globally metric subregular.
The following definition of normal cone is also required in the analysis:
Definition 8. The normal cone of Z at point z is N (z) = {g | g
⊤(z
′ − z) ≤ 0, ∀z
′ ∈ Z} (we omit its
dependence on Z for simplicity). Equivalently, N (z) is the polar cone of the convex set Z −z (a property that
we will use in the proof).
Now we are ready to show that our SP-MS condition with β = 0 is equivalent to metric subregularity
of the operator N + F, defined via: (N + F)(z) = {g + F(z) | g ∈ N (z)}.
Theorem 84. Let z
∗ ∈ Z∗
. Then the following two statements are equivalent:
• (N + F) is globally metric subregular at (z
∗
, 0) with κ > 0;
• For all z ∈ Z\Z∗
, maxz′∈Z F(z)
⊤ (z−z
′
)
∥z−z′∥ ≥ κd(z, Z
∗
).
Proof. Let T = N + F. Notice that
z ∈ Z∗ ⇔ F(z)
⊤(z
′ − z) ≥ 0 ⇔ −F(z) ∈ N (z) ⇔ 0 ∈ (N + F)(z).
161
Therefore, 0 ∈ T (z
∗
) indeed holds, and we have T
−1
(0) = Z
∗
. This means that the first statement in the
theorem is equivalent to
d(0, T (z)) ≥ κd(z, T
−1
(0)) ⇔ d(0, N (z) + F(z)) ≥ κd(z, Z
∗
).
This inequality holds trivially when z ∈ Z∗
. Thus, to complete the proof, it suffices to prove that
d(0, N (z) + F(z)) = maxz′∈Z F(z)
⊤ (z−z
′
)
∥z−z′∥
for z ∈ Z\Z∗
. To do so, note that
d(0, N (z) + F(z))
= d(−F(z), N (z))
= ∥ − F(z) − ΠN (z)
(−F(z))∥
= ∥ΠN ◦(z)
(−F(z))∥
where N ◦
(z) = {g | g
⊤n ≤ 0, ∀n ∈ N (z)} is the polar cone of N (z) and the last step is by Moreau’s
theorem. Now consider the projection of −F(z) onto the polar cone N ◦
(z):
ΠN ◦(z)
(−F(z)) = argmin
y∈N ◦(z)
∥ − F(z) − y∥
2
= argmin
y∈N ◦(z)
n
2F(z)
⊤y + ∥y∥
2
o
= argmin
y∈N ◦(z)
2F(z)
⊤ y
∥y∥
· ∥y∥ + ∥y∥
2
= argmin
λ≥0, z¯∈N ◦(z), ∥z¯∥=1
n
2λF(z)
⊤z¯ + λ
2
o
,
162
where the last equality is because N ◦
(z) is a cone. Next, we find the z¯
∗
and λ
∗
that realize the last argmin
operator: notice that the objective is increasing in F(z)
⊤z¯, so z¯
∗ = argminz¯∈N ◦(z): ∥z¯∥=1
F(z)
⊤z¯
,
and thus λ
∗ = −F(z)
⊤z¯
∗ when F(z)
⊤z¯
∗ ≤ 0 and λ
∗ = 0 otherwise. Therefore,
∥ΠN ◦(z)
(−F(z))∥ = λ
∗ = max
0, max
z¯∈N ◦(z),∥z¯∥=1
−F(z)
⊤z¯
.
Note that N (z) is the polar cone of the conic hull of Z − z. Therefore, N ◦
(z) = (ConicHull(Z −
z))◦◦ = ConicHull(Z − z) and
max
0, max
z¯∈N ◦(z),∥z¯∥=1
−F(z)
⊤z¯
= max
0, max
z′∈Z
F(z)
⊤ (z − z
′
)
∥z
′ − z∥
.
Finally, note that when z ∈ Z\Z∗
, we have maxz′∈Z F(z)
⊤(z − z
′
) > 0. Combining all the facts
above, we have shown d(0, N (z) + F(z)) = maxz′∈Z F(z)
⊤ (z−z
′
)
∥z−z′∥
.
A.7 Proof of Theorem 5
Proof of Theorem 5. Let ρ = minx∈X maxy∈Y x
⊤Gy = maxy∈Y minx∈X x
⊤Gy be the game value. In
this proof, we prove that there exists some c > 0 such that
max
y′∈Y
x
⊤Gy′ − ρ ≥ c∥x − ΠX∗ (x)∥ (A.11)
for all x ∈ X . Similarly we prove
max
x′∈X
ρ − x
′⊤Gy ≥ c∥y − ΠY∗ (y)∥
163
for all y ∈ Y. Assume that the diameter of the polytope is D < ∞. Then combining the two proves
max
z′
F(z)
⊤(z − z
′
)
∥z − z
′∥
≥
1
D
max
z′
F(z)
⊤(z − z
′
) = 1
D
max
y′
x
⊤Gy′ − min
x′
x
′⊤Gy
≥
c
D
(∥y − ΠY∗ (y)∥ + ∥x − ΠX∗ (x)∥) ≥
c
D
∥z − ΠZ∗ (z)∥,
meaning that SP-MS holds with β = 0. We break the proof into following several claims.
Claim 1. If X , Y are polytopes, then X
∗
and Y
∗
are also polytopes.
Proof of Claim 1. Note that X
∗ =
x ∈ X : maxy∈Y x
⊤Gy ≤ ρ
. Since Y is a polytope, the maximum is
attained at vertices of Y. Therefore, X
∗
can be equivalently written as
x ∈ X : maxy∈V(Y) x
⊤Gy ≤ ρ
,
where V(Y) is the set of vertices of Y. Since the constraints of X
∗
are all linear constraints, X
∗
is a
polytope.
With Claim 1, we without loss of generality write X
∗
as
X
∗ =
n
x ∈ R
M : a
⊤
i x ≤ bi
, for i = 1, . . . , L, c
⊤
i x ≤ di
, for i = 1, . . . , Ko
,
where the a
⊤
i x ≤ bi constraints come from x ∈ X and the c
⊤
i x ≤ di constraints come from
max
y∈V(Y)
x
⊤Gy ≤ ρ.
Below, we refer to a
⊤
i x ≤ bi as the feasibility constraints, and c
⊤
i x ≤ di as the optimality constraints. In
fact, one can identify the i-th optimality constraint as ci = Gy(i)
and di = ρ, where y
(i)
is the i-th vertex
of Y. This is based on our construction of X
∗
in the proof of Claim 1. Therefore, K = |V(Y)|.
Since Equation (A.11) clearly holds for x ∈ X ∗
, below, we focus on an x ∈ X \X ∗
, and let x
∗ ≜
ΠX∗ (x).
164
We say a constraint is tight at x
∗
if a
⊤
i x
∗ = bi or c
⊤
i x
∗ = di
. Below we assume that there are ℓ tight
feasibility constraints at and k tight optimality constraints at x
∗
. Without loss of generality, we assume
these tight constraints correspond to i = 1, . . . , ℓ and i = 1, . . . , k respectively. That is,
a
⊤
i x
∗ = bi
, for i = 1, . . . , ℓ,
c
⊤
i x
∗ = di
, for i = 1, . . . , k.
Claim 2. x violates at least one of the tight optimality constraint at x
∗
.
Proof of Claim 2. We prove this by contradiction. Suppose that x satisfies all k tight optimality constraints
at x
∗
. Then x must violates some of the remaining K − k optimality constraints (otherwise x ∈ X ∗
).
Assume that it violates constraints K − n + 1, . . . , K for some 1 ≤ n ≤ K − k. Thus, we have the
following:
c
⊤
i x ≤ di for i = 1, . . . K − n;
c
⊤
i x > di for i = K − n + 1, . . . , K.
Recall that c
⊤
i x
∗ ≤ di for i = 1, . . . , K − n and c
⊤
i x
∗ < di for all i = K − n + 1, . . . , K. Thus, there
exists some x
′
that lies strictly between x and x
∗
that makes all constraints hold (notice that x and x
∗
both satisfy all feasibility constraints), which contradicts with ΠX∗ (x) = x
∗
.
Claim 3. maxy′∈Y
x
⊤Gy′ − ρ
≥ maxi∈{1,...,k} c
⊤
i
(x − x
∗
).
Proof of Claim 3. Recall that we identify ci with Gy(i)
and di = ρ. Therefore,
max
y′∈Y
x
⊤Gy′ − ρ
= max
i∈{1,...,|V(Y)|}
c
⊤
i x − di
≥ max
i∈{1,...,k}
c
⊤
i x − di
= max
i∈{1,...,k}
c
⊤
i
(x − x
∗
),
16
where the last equality is because c
⊤
i x
∗ = di for i = 1, . . . , k.
Recall from linear programming literature [42, 43] that the normal cone of X
∗
at x
∗
is expressed as
follows:
Nx∗ =
n
x
′ − x
∗
: x
′ ∈ R
M, ΠX∗ (x
′
) = x
∗
o
=
(X
ℓ
i=1
piai +
X
k
i=1
qici
: pi ≥ 0, qi ≥ 0
)
.
The normal cone of X
∗
at x
∗
consists of all outgoing normal vectors of X
∗ originated from x
∗
. Clearly,
x − x
∗ belongs to Nx∗ . However, besides the fact that x − x
∗
is a normal vector of X
∗
, we also have the
additional constraints that x ∈ X . We claim that in our case, x − x
∗
lies in the following smaller cone
(which is a subset of Nx∗ ):
Claim 4. x − x
∗ belongs to
Mx∗ =
(X
ℓ
i=1
piai +
X
k
i=1
qici
: pi ≥ 0, qi ≥ 0, a
⊤
j
X
ℓ
i=1
piai +
X
k
i=1
qici
!
≤ 0, ∀j = 1, . . . , ℓ)
.
Proof of Claim 4. As argued above, x − x
∗ ∈ Nx∗ , and thus x − x
∗
can be expressed as Pℓ
i=1 piai +
Pk
i=1 qici with pi ≥ 0, qi ≥ 0. To prove that x − x
∗ ∈ Mx∗ , we only need to prove that it satisfies the
additional constraints, that is,
a
⊤
i
(x − x
∗
) ≤ 0, ∀i = 1, . . . , ℓ.
166
This is shown by noticing that for all i = 1, . . . , ℓ,
a
⊤
i
(x − x
∗
) =
a
⊤
i x
∗ − bi
+ a
⊤
i
(x − x
∗
) (the i-th constraint is tight at x
∗
)
= a
⊤
i
(x
∗ + x − x
∗
) − bi
= a
⊤
i x − bi ≤ 0. (x ∈ X )
Claim 5. x − x
∗
can be written as Pℓ
i=1 piai +
Pk
i=1 qici with 0 ≤ pi
, qi ≤ C
′∥x − x
∗∥ for all i and
some problem-dependent constant C
′ < ∞.
Proof of Claim 5. Notice that x−x
∗
∥x−x∗∥
∈ Mx∗ (because 0 ̸= x − x
∗ ∈ Mx∗ and Mx∗ is a cone). Furthermore, x−x
∗
∥x−x∗∥
∈ {v ∈ RM : ∥v∥∞ ≤ 1}. Therefore, x−x
∗
∥x−x∗∥
∈ Mx∗ ∩ {v ∈ RM : ∥v∥∞ ≤ 1}, which is a
bounded subset of the cone Mx∗ .
Below we argue that there exists a large enough C
′ > 0 such that
(X
ℓ
i=1
piai +
X
k
i=1
qici
: 0 ≤ pi
, qi ≤ C
′
, ∀i
)
⊇ Mx∗ ∩ {v ∈ R
M : ∥v∥∞ ≤ 1} ≜ P.
To see this, first note that P is a polytope. For every vertex vb of P, the smallest C
′
such that vb belongs
to the left-hand side is the solution of the following linear programming:
min
pi,qi,C′
vb
C
′
vb
s.t. vb =
X
ℓ
i=1
piai +
X
k
i=1
qici
, 0 ≤ pi
, qi ≤ C
′
vb
.
Since vb ∈ Mx∗ , this linear programming is always feasible and admits a finite solution C
′
vb < ∞. Now let
C
′ = maxvb∈V(P) C
′
vb
, where V(P) is the set of all vertices of P. Then since any v ∈ P can be expressed
167
as a convex combination of points in V(P), v can be also be expressed as Pℓ
i=1 piai +
Pk
i=1 qici with
0 ≤ pi
, qi ≤ C
′
.
To sum up, x−x
∗
∥x−x∗∥
can be represented as Pℓ
i=1 piai +
Pk
i=1 qici with 0 ≤ pi
, qi ≤ C
′
. This further
implies that x − x
∗
can be represented as Pℓ
i=1 piai +
Pk
i=1 qici with 0 ≤ pi
, qi ≤ C
′∥x − x
∗∥. Notice
that C
′ only depends on the set of tight constraints at x
∗
.
Finally, we are ready to combine all previous claims and prove the desired inequality.
Define Ai ≜ a
⊤
i
(x − x
∗
) and Ci ≜ c
⊤
i
(x − x
∗
). By Claim 5, we can write x − x
∗
as Pℓ
i=1 piai +
Pk
i=1 qici with 0 ≤ pi
, qi ≤ C
′∥x − x
∗∥, and thus,
X
ℓ
i=1
piAi +
X
k
i=1
qiCi =
X
ℓ
i=1
piai +
X
k
i=1
qici
!⊤
(x − x
∗
) = ∥x − x
∗
∥
2
.
On the other hand, since x − x
∗ ∈ Mx∗ by Claim 4, we have
X
ℓ
i=1
piAi =
X
ℓ
i=1
pia
⊤
i
(x − x
∗
) ≤ 0
and
X
k
i=1
qiCi ≤
max
i∈{1,...,k}
Ci
X
k
i=1
qi ≤
max
i∈{1,...,k}
Ci
kC′
∥x − x
∗
∥,
where in the first inequality we use the fact pi ≥ 0, and in the second inequality we use maxi∈{1,...,k} Ci >
0 (by Claim 2) and 0 ≤ qi ≤ C
′∥x − x
∗∥.
Combining the three inequalities above, we get
max
i∈{1,...,k}
Ci ≥
1
kC′
∥x − x
∗
∥.
168
Then by Claim 3,
max
y′∈Y
x
⊤Gy′ − ρ
≥ max
i∈{1,...,k}
Ci ≥
1
kC′
∥x − x
∗
∥.
Note that k and C
′ only depend on the set of tight constraints at the projection point x
∗
, and there are only
finitely many different sets of tight constraints. Therefore, we conclude that there exists a constant c > 0
such that maxy′∈Y
x
⊤Gy′ − ρ
≥ c∥x − x
∗∥ holds for all x and x
∗
, which completes the proof.
A.8 Proof of Theorem 6 and Theorem 7
Proof of Theorem 6. Suppose that f is γ-strongly-convex in x and γ-strongly-concave in y, and let(x
∗
, y
∗
) ∈
Z
∗
. Then for any (x, y) we have
f(x, y) − f(x
∗
, y) ≤ ∇xf(x, y)
⊤(x − x
∗
) −
γ
2
∥x − x
∗
∥
2
,
f(x, y
∗
) − f(x, y) ≤ ∇yf(x, y)
⊤(y
∗ − y) −
γ
2
∥y − y
∗
∥
2
.
Summing up the two inequalities, and noticing that f(x, y
∗
) − f(x
∗
, y) ≥ 0 for any (x
∗
, y
∗
) ∈ Z∗
, we
get
F(z)
⊤(z − z
∗
) ≥
γ
2
∥z − z
∗
∥
2
,
and therefore, for z ∈ Z/
∗
,
F(z)
⊤(z − z
∗
)
∥z − z
∗∥
≥
γ
2
∥z − z
∗
∥,
which implies SP-MS with β = 0 and C = γ/2.
16
Proof of Theorem 7. First, we show that f has a unique Nash Equilibrium z
∗ = (x
∗
, y
∗
) = ((0, 1),(0, 1)).
As f is a strictly monotone decreasing function with respect to y1, we must have y
∗
1 = 0 and y
∗
2 = 1. In
addition, if x = (0, 1), maxy∈Y f(x, y) = − miny∈Y y
2n
1 = 0. If x ̸= (0, 1), then by choosing y
∗ = (0, 1),
f(x, y
∗
) = x
2n
1 > 0. Therefore, we have x
∗ = (0, 1), which proves that the unique Nash Equilibrium is
x
∗ = (0, 1), y
∗ = (0, 1).
Second, we show that f satisfies SP-MS with β = 2n − 2. In fact, for any z = (x, y) ̸= z
∗
, we have
F(z)
⊤(z − z
∗
) =
2nx2n−1
1 − y1
0
2ny2n−1
1 + x1
0
⊤
x1
x2 − 1
y1
y2 − 1
= 2n
x
2n
1 + y
2n
1
≥ 4n ·
x
2
1 + y
2
1
2
n
(Jensen’s inequality)
=
n
2
n−2
x
2
1 + y
2
1
n
.
Note that ∥z−z
∗∥ =
p
x
2
1 + (1 − x2)
2 + y
2
1 + (1 − y2)
2 =
p
2x
2
1 + 2y
2
1
. Therefore, we have F(z)⊤(z−z
∗)
∥z−z∗∥ ≥
n
2
2n−2 ∥z − z
∗∥
2n−1
. This shows that f satisfies SP-MS with β = 2n − 2 and C =
n
2
2n−2 .
A.9 Proof of Theorem 8
Proof of Theorem 8. As argued in Section 2.4, with Θt = ∥zb
t −ΠZ∗ (zb
t
)∥
2 +
1
16 ∥zb
t −z
t−1∥
2
, ζ
t = ∥zb
t+1 −
z
t∥
2 + ∥z
t − zb
t∥
2
, we have (see Equation (2.4))
Θ
t+1 ≤ Θ
t −
15
16 ζ
t
. (A.12)
1
Below, we relate ζ
t
to Θt+1 using the SP-MS condition, and then apply Lemma 74 to show
Θ
t ≤
2dist2
(zb
1
, Z
∗
)(1 + C5)
−t
if β = 0,
1 + 4
4
β
1
β
dist2
(zb
1
, Z
∗
) + 2
2
C5β
1
β
t
− 1
β if β > 0,
(A.13)
where C5 = min n
16η
2C2
81 ,
1
2
o
as defined in the statement of the theorem. This is enough to prove the
theorem since
dist2
(z
t
, Z
∗
) ≤ ∥z
t − ΠZ∗ (zb
t+1)∥
2
≤ 2∥zb
t+1 − ΠZ∗ (zb
t+1)∥
2 + 2∥zb
t+1 − z
t
∥
2
≤ 32Θt+1 ≤ 32Θt
.
Next, we prove Equation (A.13). We first show a simple fact by Equation (A.12):
∥zb
t+1 − z
t
∥
2 ≤ ζ
t ≤
16
15
Θ
t ≤ · · · ≤
16
15
Θ
1
. (A.14)
171
Notice that
ζ
t ≥
1
2
∥zb
t+1 − z
t
∥
2 +
1
2
∥zb
t+1 − z
t
∥
2 + ∥z
t − zb
t
∥
2
≥
1
2
∥zb
t+1 − z
t
∥
2 +
16η
2
81
sup
z′∈Z
F(zb
t+1)
⊤(zb
t+1 − z
′
)
2
+
∥zbt+1 − z
′∥
2
( Lemma 4)
≥
1
2
∥zb
t+1 − z
t
∥
2 +
16η
2C
2
81
∥zb
t+1 − ΠZ∗ (zb
t+1)∥
2(β+1) (SP-MS condition)
≥ min (
16η
2C
2
81
,
1
2
15
16Θ1
β
)
∥zb
t+1 − z
t
∥
2(β+1) + ∥zb
t+1 − ΠZ∗ (zb
t+1)∥
2(β+1)
(by Equation (A.14))
≥ min (
16η
2C
2
2
β · 81
,
1
2
15
32Θ1
β
)
∥zb
t+1 − z
t
∥
2 + ∥zb
t+1 − ΠZ∗ (zb
t+1)∥
2
β+1
(by Hölder’s inequality: (a
β+1 + b
β+1)(1 + 1)β ≥ (a + b)
β+1)
≥ min (
C5
2
β
,
1
2
1
4Θ1
β
)
Θ
t+1β+1
(recall that C5 = min{
16η
2C2
81 ,
1
2
})
= C
′
Θ
t+1β+1
. (define C
′ = min n
C5
2
β ,
1
2
1
4Θ1
β
o
)
Combining this with Equation (A.12), we get
Θ
t+1 ≤ Θ
t − C
′
Θ
t+1β+1
(A.15)
When β = 0, Equation (A.15) implies Θt+1 ≤ (1 + C5)
−1Θt
, which immediately implies Θt ≤ (1 +
C5)
−t+1Θ1 ≤ 2Θ1
(1 + C5)
−t
. When β > 0, Equation (A.15) is of the form specified in Lemma 74 with
p = β and q = C
′
. Note that the second required condition is satisfied: C
′
(β + 1)(Θ1
)
β ≤
β+1
2·4
β ≤ 1.
Therefore, by the conclusion of Lemma 74,
Θ
t ≤ max (
Θ
1
,
2
C′β
1
β
)
t
− 1
β = max (
Θ
1
,
2 · 2
β
C5β
1
β
, 4Θ1
4
β
1
β
)
t
− 1
β
≤
" 1 + 4
4
β
1
β
!
Θ
1 + 2
2
C5β
1
β
#
t
− 1
β .
Equation (A.13) is then proven by noticing that Θ1 = dist2
(zb
1
, Z
∗
).
A.10 Proof of Theorem 9
Proof of Theorem 9. Consider the following 2 × 2 bilinear game with curved feasible sets:
f(x, y) = x
⊤Gy =
x1 x2
0 −1
1 0
y1
y2
,
X =
x : 0 ≤ x1 ≤
1
2
, 0 ≤ x2 ≤
1
4
, x2 ≥ x1
2
,
Y =
y : 0 ≤ y1 ≤
1
2
, 0 ≤ y2 ≤
1
4
, y2 ≥ y1
2
.
Below, we use Claim 1 - Claim 5 to argue that if the two players start from x
0 = y
0 = xb0 = yb0 = ( 1
2
,
1
4
),
and use any constant learning rate η ≤
1
64 , then the convergence is sublinear in the sense that ∥z
t−z
∗∥ ≥
Ω(1/t). Then, in Claim 6, we show that in this example, SP-MS holds with β = 3.
Claim 1. The unique equilibrium is x
∗ = 0, y
∗ = 0.
173
When x = 0, clearly maxy′∈Y f(x, y
′
) = 0. When x ̸= 0, we prove maxy′∈Y f(x, y
′
) > 0 below. If
x1 ̸= 0, we let y
′
1 =
1
2
x1 and y
′
2 =
1
4
x1
2
(which satisfies y
′ ∈ Y), and thus
f(x, y
′
) = x2y
′
1 − x1y
′
2 = x1
2
·
1
2
x1 − x1 ·
1
4
x1
2 =
1
4
x1
3 > 0.
If x1 = 0 but x2 ̸= 0, we let y
′
1 =
1
2
, y′
2 =
1
4
, and thus
f(x, y
′
) = x2y
′
1 − x1y
′
2 =
1
2
x2 > 0.
Thus, maxy′∈Y f(x, y
′
) > 0 if x ̸= 0, and x
∗ = 0 is the unique optimal solution for x. By the symmetry
between x and y (because G = −G⊤), we can also prove that the unique optimal solution for y is y
∗ = 0.
Claim 2. Suppose that x
0 = y
0 = xb0 = yb0 = ( 1
2
,
1
4
). Then, at any step t ∈ [[T]], we have x
t = y
t
and
xb
t = yb
t
, and all x
t
, y
t
, xb
t
, yb
t belong to {u ∈ R
2
: u2 = u
2
1
}.
174
We prove this by induction. The base case trivially holds. Suppose that for step t, we have x
t = y
t
,
xb
t = yb
t
, and x
t
, y
t
, xb
t
, yb
t ∈ {u ∈ R
2
: u2 = u
2
1
}. Then consider step t + 1. According to the dynamic of
OGDA, we have
xb
t+1 = ΠX
xb
t − η
−yt,2
yt,1
= ΠX
xbt,1 + ηyt,2
xbt,2 − ηyt,1
, (A.16)
x
t+1 = ΠX
xb
t+1 − η
−yt,2
yt,1
= ΠX
xbt+1,1 + ηyt,2,
xbt+1,2 − ηyt,1
,
yb
t+1 = ΠY
yb
t + η
xt,2
−xt,1
= ΠY
ybt,1 + ηxt,2
ybt,2 − ηxt,1
,
y
t+1 = ΠY
yb
t+1 + η
xt,2
−xt,1
= ΠY
ybt+1,1 + ηxt,2
ybt+1,2 − ηxt,1
.
According to induction hypothesis, we have xb
t+1 = yb
t+1, which further leads to x
t+1 = y
t+1
.
Now we prove that for any
x1
x2
such that x1 ≥ 0, x2 ≤
1
4
and x2 < x1
2
,
x1
x2
= ΠX
x1
x2
satisfies that x
2
1 = x2. Otherwise, suppose that x
2
1 < x2. Then according to the intermediate value
theorem, there exists
xe1
xe2
that lies in the line segment of
x1
x2
and
x1
x2
such that xe
2
1 = xe2. Moreover,
as x1 ≥ 0, xe1 ≥ 0, x2 ≤
1
4
, xe2 ≤
1
4
, we know that
xe1
xe2
∈ X . Therefore, we have ∥xe − x∥ < ∥x − x∥,
which leads to contradiction.
Now consider xb
t+1. According to induction hypothesis, we have (xbt,1 + ηyt,2)
2 ≥ xb
2
t,1 = xbt,2 ≥
xbt,2−ηyt,1. If equalities hold, trivially we have xb
2
t+1,1 = xb
2
t,1 = xbt,2 = xbt+1,2 according to Equation (A.16).
Otherwise, as xbt,1 + ηyt,2 ≥ 0, xbt,2 − ηyt,1 ≤
1
4
, according to the analysis above, we also have xb
2
t+1,1 =
xbt+1,2. Applying similar analysis to yb
t+1
, x
t+1 and y
t+1 finishes the induction proof.
175
Claim 3. With η ≤
1
64 , the following holds for all t ≥ 1,
xt,1 ∈
1
2
xbt,1, 2xbt,1
, (A.17)
xbt,1 ∈
xbt−1,1 − 4ηxb
2
t−1,1
, xbt−1,1 + 4ηxb
2
t−1,1
. (A.18)
We prove the claim by induction on t. The case t = 1 trivially holds. Suppose that Equation (A.17) and
Equation (A.18) hold at step t. Now consider step t + 1.
Induction to get Equation (A.18). According to Claim 2, we have
xb
t+1 = ΠX
xb
t − η
−yt,2
yt,1
= ΠX
xbt,1 + ηx2
t,1
xb
2
t,1 − ηxt,1
,
and xb
t+1 = (u, u2
) for some u ∈ [0, 1/2]. Using the definition of the projection function, we have
xbt+1,1 = argmin
u∈[0,
1
2
]
n
xbt,1 + ηx2
t,1 − u
2
+
xb
2
t,1 − ηxt,1 − u
2
2
o
≜ argmin
u∈[0,
1
2
]
g(u).
Now we show that argminu∈[0,
1
2
]
g(u) = argminu∈R g(u). Note that
∇g(u) = 2(u − xbt,1 − ηx2
t,1
) + 4u
u
2 + ηxt,1 − xb
2
t,1
, (A.19)
Therefore, when u > 1
2
, using xt,1 ≤
1
2
, we have
∇g(u) > −2ηx2
t,1 + 2ηxt,1 ≥ 0, (A.20)
which means g(u) > g(
1
2
). On the other hand, when u < 0, using xbt,1 ≤
1
2
, we have
∇g(u) < 2u − 4uxb
2
t,1 ≤ u < 0, (A.21)
which means g(u) > g(0). Combining Equation (A.20) and Equation (A.21), we know that
argmin
u∈[0,
1
2
]
g(u) = argmin
u∈R
g(u).
Therefore, xbt+1,1 is the unconstrained minimizer of convex function g(u), which means ∇g(xbt+1,1) = 0.
Below we use contradiction to prove that xbt+1,1 ≥ xbt,1 − 4ηxb
2
t,1
. If xbt+1,1 < xbt,1 − 4ηxb
2
t,1
, we use
Equation (A.19) and get
∇g(xbt+1,1) = 2(xbt+1,1 − xbt,1 − ηx2
t,1
) + 4xbt+1,1
xb
2
t+1,1 + ηxt,1 − xb
2
t,1
< 2(−4ηxb
2
t,1 − ηx2
t,1
) + 4xbt+1,1
ηxt,1 − 8ηxb
3
t,1 + 16η
2xb
4
t,1
≤ −
17
2
ηxb
2
t,1 + 4xbt+1,1
2ηxbt,1 − 8ηxb
3
t,1 + 16η
2xb
4
t,1
(Equation (A.17))
≤ −
17
2
ηxb
2
t,1 + 4xbt+1,1
2ηxbt,1 + 16η
2xb
4
t,1
≤ −
17
2
ηxb
2
t,1 + 4(xbt,1 − 4ηxb
2
t,1
)
2ηxbt,1 + 16η
2xb
4
t,1
(xbt+1,1 < xbt,1 − 4ηxb
2
t,1
)
= −
1
2
ηxb
2
t,1 + 64η
2xb
5
t,1 − 32η
2xb
3
t,1 − 256η
3xb
6
t,1
≤ −
1
2
ηxb
2
t,1 − 16η
2xb
3
t,1 − 256η
3xb
6
t,1
(xbt,1 ≤
1
2
)
≤ 0,
which leads to contradiction. Similarly, if xbt+1,1 > xbt,1 + 4ηxb
2
t,1
, we have
∇g(xbt+1,1) = 2(xbt+1,1 − xbt,1 − ηx2
t,1
) + 4xbt+1,1
xb
2
t+1,1 + ηxt,1 − xb
2
t,1
> 2(4ηxb
2
t,1 − ηx2
t,1
) + 4xbt+1,1
ηxt,1 + 8ηxb
3
t,1 + 16η
2xb
4
t,1
≥ 0. (Equation (A.17))
The calculations above conclude that
xbt+1,1 ∈
xbt,1 − 4ηxb
2
t,1
, xbt,1 + 4ηxb
2
t,1
. (A.22)
Induction to get Equation (A.17). Similarly, we have
xt+1,1 = argmin
u∈[0,
1
2
]
n
xbt+1,1 + ηx2
t,1 − u
2
+
xb
2
t+1,1 − ηxt,1 − u
2
2
o
≜ argmin
u∈[0,
1
2
]
h(u),
∇h(u) = 2(u − xbt+1,1 − ηx2
t,1
) + 4u(u
2 + ηxt,1 − xb
2
t+1,1
),
and ∇h(xt+1,1) = 0. If xt+1,1 <
1
2
xbt+1,1, we have
∇h(xt+1,1) = 2(xt+1,1 − xbt+1,1 − ηx2
t,1
) + 4xt+1,1
x
2
t+1,1 + ηxt,1 − xb
2
t+1,1
< −xbt+1,1 − 2ηx2
t,1 − 3xt+1,1xb
2
t+1,1 + 2ηxbt+1,1xt,1 (xt+1,1 <
1
2
xbt+1,1)
≤ 0. (η ≤
1
64 , xt,1 ≤
1
2
)
If xt+1,1 > 2xbt+1,1, we also have
∇h(xt+1,1) = 2(xt+1,1 − xbt+1,1 − ηx2
t,1
) + 4xt+1,1
x
2
t+1,1 + ηxt,1 − xb
2
t+1,1
> 2xbt+1,1 − 2ηx2
t,1 + 24xb
3
t+1,1 + 8ηxbt+1,1xt,1 (xt+1,1 > 2xbt+1,1)
≥ 2xbt+1,1 − 2ηx2
t,1 + 24xb
3
t+1,1 + 8η(xbt,1 − 4ηxb
2
t,1
)xt,1 (Equation (A.22))
≥ 2xbt+1,1 − 2ηx2
t,1 + 24xb
3
t+1,1 + 8η(
1
2
xt,1 − 4ηxb
2
t,1
)xt,1 (Equation (A.17))
= 2xbt+1,1 + 2ηx2
t,1 + 24xb
3
t+1,1 − 32η
2xb
2
t,1xt,1
≥ 2xbt+1,1 +
1
4
ηxb
2
t,1 + 24xb
3
t+1,1 − 32η
2xb
2
t,1xt,1 (Equation (A.17))
≥ 0. (η ≤
1
64 , xt,1 ≤
1
2
)
Both lead to contradiction. Therefore, we conclude that x
t+1 ∈ [
1
2
xbt+1,1, 2xbt+1,1], which finishes the
induction proof.
Claim 4. xt,1 ≥ xbt,1 − 4ηxb
2
t,1
, for all t ≥ 1.
The case t = 1 holds trivially. For t ≥ 2, we prove this by contradiction. Using the definition of the
projection function, we have:
xt+1,1 = argmin
u∈[0,
1
2
]
n
xbt+1,1 + ηx2
t,1 − u
2
+
xb
2
t+1,1 − ηxt,1 − u
2
2
o
≜ argmin
u∈[0,
1
2
]
h(u).
Similar to the analysis in Claim 3, we have argminu∈[0,
1
2
]
h(u) = argminu∈R h(u), which means that
∇h(xt+1,1) = 0. Note that η ≤
1
64 and 0 ≤ xbt,1 ≤
1
2
, according to Equation (A.17) and Equation (A.18),
we have
xbt+1,1 ∈
xbt,1 − 4ηxb
2
t,1
, xbt,1 + 4ηxb
2
t,1
⊆
31
32
xbt,1,
33
32
xbt,1
,
which means that
xt,1 ∈
1
2
xbt,1, 2xbt,1
⊆
16
33
xbt+1,1,
64
31
xbt+1,1
. (A.23)
If xt+1,1 < xbt+1,1 − 4ηxb
2
t+1,1
, we show that ∇h(xt+1,1) < 0. In fact,
∇h(xt+1,1) = 2(xt+1,1 − xbt+1,1 − ηx2
t,1
) + 4xt+1,1
x
2
t+1,1 + ηxt,1 − xb
2
t+1,1
< 2(−4ηxb
2
t+1,1 − ηx2
t,1
) + 4xt+1,1
ηxt,1 − 8ηxb
3
t+1,1 + 16η
2xb
4
t+1,1
≤ −
42
5
ηxb
2
t+1,1 + 4xt+1,1
64
31
ηxbt+1,1 − 8ηxb
3
t+1,1 + 16η
2xb
4
t+1,1
(Equation (A.23))
≤ −
42
5
ηxb
2
t+1,1 + 4xt+1,1
64
31
ηxbt+1,1 + 16η
2xb
4
t+1,1
< −
42
5
ηxb
2
t+1,1 + 4(xbt+1,1 − 4ηxb
2
t+1,1
)
64
31
ηxbt+1,1 + 16η
2xb
4
t+1,1
≤ 64η
2xb
5
t+1,1 − 32η
2xb
3
t+1,1 − 256η
3xb
6
t+1,1
≤ −16η
2xb
3
t,1 − 256η
3xb
6
t,1
(xbt,1 ≤
1
2
)
≤ 0,
which leads to contradiction. Therefore, we show that xt,1 ≥ xbt,1 − 4ηxb
2
t,1
for all t ≥ 1.
Claim 5. If η ≤
1
64 , we have ∥z
t − z
∗∥ ≥ Ω(1/t).
1
Now we are ready to prove ∥z
t − z
∗∥ ≥ Ω(1/t). First we show xbt,1 ≥
1
2t
for all t ≥ 1 by induction.
The case t = 1 trivially holds. Suppose that it holds at step t. Considering step t + 1, we have
xbt+1,1 ≥ xbt,1 − 4ηxb
2
t,1
(Claim 3)
≥ xbt,1 −
1
16
xb
2
t,1
(η ≤
1
64 )
≥
1
2t
−
1
64t
2
(
1
2t ≤ xbt,1 ≤
1
2
, and x −
1
16x
2
is increasing when x ≤ 8)
≥
1
2(t + 1). (t ≥ 1)
Therefore, xbt,1 ≥
1
2t
, ∀t ≥ 1. This, by Claim 4 and the analysis above, shows that
xt,1 ≥ xbt,1 − 4ηxb
2
t,1 ≥
1
2(t + 1).
Note that according to Claim 1, x
∗ = 0. Therefore, we have ∥z
t − z
∗∥ ≥ xt,1 ≥
1
2(t+1) , which finishes
the proof.
Claim 6. In this example, SP-MS holds with β = 3. This can be seen by the following:
181
max
z′∈Z
F(z)
⊤(z − z
′
)
∥z − z
′∥
≥ max
z′∈Z
F(z)
⊤(z − z
′
)
= max
x′∈X,y′∈Y
n
x
⊤Gy′ − x
′⊤Gyo
= max
x′∈X,y′∈Y
−x1y
′
2 + x2y
′
1 + x
′
1
y2 − x
′
2
y1
≥ −x1x
2
2 + x
2
2 + y
2
2 − y
2
2
y1 (picking y
′
1 = x2, y′
2 = x
2
2
, x′
1 = y2, x′
2 = y
2
2
)
≥
1
2
x
2
2 +
1
2
y
2
2
(x1, y1 ≤
1
2
)
≥
1
4
x
4
1 + x
4
2 + y
4
1 + y
4
2
(x2 ≥ {x
2
1
, x2
2
}, y2 ≥ {y
2
1
, y2
2
})
≥
1
16
x
2
1 + x
2
2 + y
2
1 + y
2
2
2
(Cauchy-Schwarz)
=
1
16
∥z − z
∗
∥
4
, (z
∗ = (0, 0, 0, 0))
which implies β = 3.
1
Appendix B
Omitted Details in Chapter 3
B.1 Examples of EFG and Treeplexes
In this section, we introduce Kuhn poker [101], a simple EFG as an example to introduce treeplexes and
the corresponding definitions. In this game, there are three cards in the deck: King, Queen, and Jack. Both
player x and player y are dealt one card, while the third card is put aside unseen. In the first round, player
x can bet or check. Then, if player x bets player y can choose to call or fold. If player x checks then player
y can bet or check. Finally, if player x checks and player y bets, then player x has a final round where
they can call or fold. If neither player folded, then the player with the higher card wins the pot. If a player
folded then the other player wins the pot.
We show a game tree of Kuhn poker in Figure B.1. Players’ imperfect information is modeled by
information sets. In Figure B.1, nodes with the same color belong to the same information set. A player
cannot distinguish among nodes in a given information set (that belongs to this player). For example,
player x cannot distinguish among the blue nodes, since in both nodes, player x was dealt Queen but does
not know whether player y was dealt Jack or King.
We can further separate the decision spaces and consider them individually on a per-player basis,
which is where the concept of a treeplex arises. We show player x’s decision space in Figure B.2. Here
each circular node is an information set, which is a decision node for player x where they choose an action.
183
For example, the blue node h3 corresponds to the initial state where player x is dealt Queen, and player
x can choose to bet or check at this node. Each square node is an observation node, where player x does
not make a decision, but the environment or player y makes decisions which determine the next decision
node for player x. Each triangular node is a terminal node, where the game ends.
Each index of treeplex X corresponds to a solid, directed, edge in the figure. In other words, each index
corresponds to an action in finite action set Ωh for every h in HX , the set of information sets that belongs
to player x. Indices (solid edges) are labeled from x1 to x12. More specifically, x ∈ X if x satisfies xi ≥ 0
for every index i and for every h ∈ HX ,
X
i∈Ωh
xi = xσ(h)
,
where index σ(h) ∈ Ωh′ is the unique action such that h can be reached immediately by taking σ(h)
when player x is in information set h
′
. When no such action exists, that is, h can be reached immediately in
the beginning, we set σ(h) = 0 and x0 = 1. For example, we must have x6 = x7+x8 and x5+x6 = x0 = 1
if x ∈ X . Intuitively, xi
is the probability taking action i, given that the sequential decisions from the
environment and player y can lead to the information set where action i is. Similarly, we show player y’s
decision space in Figure B.3, which illustrates treeplex Y.
B.2 Omitted Details of Section 3.6
In this section, we provide more details about the experiments.
B.2.1 Additional Experiments
As we mentioned in Section 3.6, although VOGDA and DOMWU have linear convergence rate in theory,
we use much larger step sizes η in the Leduc poker experiment than what Corollary 14 and Theorem 16
184
x x x x x x
JQ QJ KJ JK QK KQ
y y y y
x x
bet check bet check
call fold bet check call fold bet check
call fold call fold
2 1 1 -2 1 -1
2 -1 -2 -1
Figure B.1: A game tree for Kuhn poker. The edge labeled with JQ means that player x is dealt Jack and
player y is dealt Queen. The cases are similar at other edges. We omit branches stemming from the green
and red nodes, which are similar to what we present for the blue nodes. Circular nodes are player x’s
decision nodes, while square nodes are player y’s decision nodes. Triangular nodes are terminal nodes,
where the values denote the utility for player x (and thus the loss for player y). Nodes with the same
color belong to the same information set, and a player cannot distinguish among nodes within the same
information set, that is, they only know they are at one of these nodes but do not know which node they
are at exactly.
x0
h1 h3 h5
h2 h4 h6
J Q K
x1 (bet) x2 (check)
x3 (call) x4 (fold)
x5 (bet) x6 (check)
x7 (call) x8 (fold)
x9 (bet) x10 (check)
x11 (call) x12 (fold)
Figure B.2: The decision space for player x and treeplex X . Each circular node is an information set, each
square node is an observation node, and each triangular node is a termination node, where the decision
process ends. Each solid edge corresponds an action and one of the indexes in treeplex X . More specifically,
we have M = 13, HX = {h1, . . . , h6}, Ωhi = {2i−1, 2i} for i = 1, . . . , 6, σ(h1) = σ(h3) = σ(h5) = 0,
σ(h2) = 2, σ(h4) = 6, σ(h6) = 10.
185
y0
h
0
1 h
0
2 h
0
3 h
0
4 h
0
5 h
0
6
y1 (bet) y2 (check) y3 (bet) y4 (check) y5 (bet) y6 (check) y7 (bet) y8 (check) y9 (bet) y10 (check) y11 (bet) y12 (check)
J, bet J, check
Q, bet Q, check
K, bet K, check
Figure B.3: The decision space for player y and treeplex Y. Each square node is an information set and
each triangular node is a termination node. More specifically, we have N = 13, HY = {h
′
1
, . . . , h
′
6
},
Ωh′
i
= {2i − 1, 2i} and σ(h
′
i
) = 0 for i = 1, . . . , 6.
0 2 4 6 8 10
104
0.35
0.4
0.45
0.5
0.55
0.6
0.65
0.7
DOMWU eta=1e-4
0 2 4 6 8 10
104
-0.2
0
0.2
0.4
0.6
0.8
VOGDA eta=1e-4
Figure B.4: Experiments on Kuhn Leduc poker for DOMWU (left) and VOGDA (right) with small step sizes
and more time steps.
suggest, which explains why we were not able to observe the linear convergence. Here, we rerun this
experiment with a smaller step size for VOGDA and DOMWU. With more iterations, on the order of 105
,
we observe again that they exhibit fast convergence, as shown in Figure B.4.
B.2.2 Description of the Games
We briefly introduce the games in the experiments. Beside the rules of the games, we show the game size
by providing M, N, |HY |, and |HY | (recall G = RM×N ).
Kuhn poker Introduced in [101], the deck for Kuhn poker contains three playing cards: King, Queen,
and Jack. Each player is dealt one card, while the third card is unseen. Then a betting process proceeds.
186
Player x can check or raise, and then player y can also check or raise. Player x has a final round to call
or fold if player x checks but player y raises in the previous round. The player with the higher card wins
the pot. In this game, M = N = 13, |HX | = |HY | = 6.
Pursuit-evasion This is a search game considered in [94]. Given a direct graph, player x controls an
attacker to move in the graph, while player y controls two patrols, who are only allowed to move in their
patrol areas. The game has 4 rounds. In each round, the attacker and the patrols act simultaneously. The
attacker can move to any adjacent node or choose to wait in each round, with the final goal of going across
the patrol areas and reach the goal nodes without being caught by the patrols. On the other hand, player
y’s goal is to let one of the patrols to reach the same node as the attacker in any round. A patrol who visits
a node that was previously visited by the attacker will know that the attacker was there if the attacker did
not wait at that node (in order to clean up their traces). In this game, M = 52, N = 2029, |HX | = 34,
|HY | = 348.
Leduc poker Introduced in [147], the game is similar to Kuhn poker. There are total 6 cards in the deck
with two Kings, two Queens, and two Jacks. Each player is dealt a private card while there is another
unrevealed public card. In the first round, player y bets after player x bets. After that the public card will
be revealed and there is another betting stage. In a showdown stage, a player who has the same rank with
the public card wins. Otherwise, the player with the higher card wins. In this game, M = N = 337,
|HX | = |HY | = 144.
B.2.3 Parameter Selection in the Dilated Regularizers
We note that for the dilated regularizers, we use the unweighted version in the experiments. This is
sufficient as Lemma 85 shows there always exists an assignment α such that ψ
dil
α is 1-strongly convex and
α = β · 1 for some β > 0, where 1 is the all-one vector over HZ. In this way, the value of η we refer to
187
in Figure 3.1 is actually η/β. However, as we mentioned, the final η used in the experiments may still be
larger than what the theorems suggest. Showing similar results while allowing larger η is an interesting
future direction.
Lemma 85. For an assignment α such that ψ
dil
α is 1-strongly convex, ψ
dil
α′ is also 1-strongly convex where
α′ = ∥α∥∞1.
Proof. Recall the definition of regularizer ψ
dil
α (z) in Equation (3.3). Since each term zσ(h)
· ψ
zh
zσ(h)
is
convex in z (thus with a non-negative Bregman divergence), Dψdil
α
is an increasing function in any variable
αh (h ∈ HZ). Therefore, we have
∥z − z
′∥
2
2
≤ Dψdil
α
(z, z
′
) ≤ Dψ
dil
α′
(z, z
′
),
which completes the proof.
B.3 Omitted Details of Section 3.3
When introducing VOMWU in Section 3.3, we mention that Ψvan is 1-strongly convex with respect to the
2-norm. In the following, we formally show this result. Before that, we first show a technical lemma.
Lemma 86. For u, v ∈ [0, 1], the following inequalities hold:
(u − v)
2
2
≤ u ln u
v
− u + v ≤
(u − v)
2
v
. (B.1)
Proof. Define f(u, v) = u ln
u
v
− u + v −
(u−v)
2
2
and g(u, v) = (u−v)
2
v − u ln
u
v
+ u − v. To prove the
claim, it is sufficient to show that the minimum of each of the two functions is zero. Since both functions
have the only critical point (0, 0), there is no extreme point in the interior. Also, it is straightforward to
1
find a (u, v) such that f(u, v), g(u, v) > 0 in the interior. Thus, it remains to check if the boundary of the
domain satisfies f(u, v), g(u, v) ≥ 0. For u = 0, we have
(0 − v)
2
2
=
v
2
2
≤ v =
(0 − v)
2
v
.
The case for v = 0 is trivial. For u = 1, note that
(v − 1)2
2
≤ v − 1 − ln(v) ≤
(v − 1)2
v
when 0 ≤ v ≤ 1. For v = 1, we have
(u − 1)2
2
≤ u ln u − u + 1 ≤ (u − 1)2
when 0 ≤ u ≤ 1. Therefore, we conclude f(u, v), g(u, v) ≥ 0 and this finishes the proof.
Now we are ready to give the result.
Lemma 87. Ψvan is 1-strongly convex with respect to the 2-norm.
Proof. The result follows by the first inequality of Equation (B.1) and 0 ≤ zi ≤ 1 for all z ∈ Z.
B.4 CFR-Based Algorithms and Proof of Theorem 12
In this section, we first introduce CFR, CFR+, and their optimistic versions. Then we show Theorem 12 in
Appendix B.4.3 and their empirical last-iterate divergence in Appendix B.4.4.
189
B.4.1 CFR and Its Optimistic Version
Given P-dimensional treeplex Z, loss vector ℓ ∈ R
P , and z ∈ Z, we recursively define the value vector
L ∈ R
P , to be
Li = ℓi +
X
g∈Hi
⟨qg, Lg⟩
for every index i (recall that qi = zi/zpi
). At time t, given q
t
, loss vector ℓ
t
, and its value vector Lt
, denote
Regg
t,j = ⟨qt,g, Lt,g⟩ − Lt,j , Regg
0,j = 0, and Regg
t,j = Regg
t−1,j + Regg
t,j for every simplex g ∈ HZ and
index j ∈ Ωg. In the literature, Counterfactual Regret Minimization (CFR) [166] refers to the algorithm
running regret matching on every simplex in a treeplex. Specifically, on simplex g at time t + 1, regret
matching plays arbitrarily when P
j∈Ωg
h
Regg
t,ji+
= 0, where [x]
+ = max(0, x); otherwise, it plays
q
t+1
i =
h
Regg
t,ii+
P
j∈Ωg
h
Regg
t,ji+
,
for all i in Ωg. In the two-player zero-sum setting, player x runs CFR on every simplex in treeplex X along
with point x
t ∈ X and loss vector Lx
t = Gyt
, and player y runs CFR on every simplex in Y along with
point y
t ∈ Y and loss vector L
y
t = −G⊤x
t
for every time t.
The optimistic version of CFR [52] is running the optimistic version of regret matching on every simplex in a treeplex. Specifically, the algorithm plays arbitrarily when P
j∈Ωg
h
Regg
t,j + Regg
t,ji+
= 0; otherwise,
q
t+1
i =
h
Regg
t,i + Regg
t,ii+
P
j∈Ωg
h
Regg
t,j + Regg
t,ji+
.
190
To get an approximate Nash equilibrium at time t, CFR and its optimistic version consider the average
iterate, that is, they return
(x
t
, y
t
) =
1
t
X
t
τ=1
xτ ,
1
t
X
t
τ=1
yτ
!
.
B.4.2 CFR+ and Its Optimistic Version
To introduce CFR+, we first introduce another regret-minimization algorithm on simplex, regret matching+.
Similar to Regg
t,j , we define Reg dg
0,j = 0 and
Reg dg
t,j =
h
Reg dg
t−1,j + Regg
t,ji+
,
for every simplex g ∈ HZ and index j ∈ Ωg. On simplex g at time t+ 1, regret matching+ plays arbitrarily
when P
j∈Ωg
h
Reg dg
t,ji+
= 0; otherwise, it plays
q
t+1
i =
h
Reg dg
t,ii+
P
j∈Ωg
h
Reg dg
t,ji+
.
CFR+ [152] refers to running regret matching+ on every simplex in a treeplex. In the two-player zero-sum
setting, CFR+ usually refers to the one with alternating updates. Specifically, player x runs CFR+ on every
simplex in HX with loss vector Lx
t = Gyt
and player y runs CFR on every simplex in HY with loss vector
L
y
t = −G⊤x
t+1 (note that in the case with simultaneous updates, L
y
t = −G⊤x
t
).
191
The optimistic version of CFR+ [52] is running the optimistic version of regret matching+ on every
simplex in a treeplex. Specifically, the algorithm plays arbitrarily when P
j∈Ωg
h
Reg dg
t,j + Regg
t,ji+
= 0;
otherwise, it plays
q
t+1
i =
h
Reg dg
t,i + Regg
t,ii+
P
j∈Ωg
h
Reg dg
t,j + Regg
t,ji+
.
Regarding the averaging scheme, CFR+ usually refers to the version with linear averaging to get an approximate Nash equilibrium at time t. Specifically, it returns
(x
t
, y
t
) =
2
t(t + 1)
X
t
τ=1
τ · xτ ,
2
t(t + 1)
X
t
τ=1
τ · yτ
!
.
In summary, CFR+ and its optimistic version refer to running regret matching+ and optimistic regret
matching+ on every simplex with alternating updates and linear averaging.
B.4.3 Proof of Theorem 12
Proof of Theorem 12. Note that in this instance, there is only one simplex g
x for player x and one simplex
g
y
for player y. The game matrix G of the rock-paper-scissors is
G =
0 −1 1
1 0 −1
−1 1 0
= −G⊤.
We consider the case when x1 = y1. Recall that we have Lx
t = Gyt
and L
y
t = −G⊤x
t
. Therefore, we
know that
L
x
1 = G⊤y1 = G⊤x1 = −G⊤x1 = L
y
1
,
192
and
Regg
x
1,j = x
⊤
1 L
x
1 − L
x
1,j = x
⊤
1 Gx1 − L
x
1,j = −L
x
1,j = −L
y
1,j = Regg
y
1,j
for j = 1, 2, 3. It is not hard to see that in this case, we have x2 = y2, and thus x
t = y
t
for every t.
Consequently, it is sufficient to focus on the updates of x
t
. For notional convenience, we write
Regt = Regg
x
t = G⊤x
t = (xt,2 − xt,3, xt,3 − xt,1, xt,1 − xt,2)
⊤, (B.2)
Regt = Regg
x
t = Regg
x
t−1 + Regt =
X
t
τ=1
Regτ
, (B.3)
and thus
xt+1,j =
h
Regt,ji+
h
Regt,1
i+
+
Regt,2
+ +
h
Regt,3
i+
(B.4)
for j = 1, 2, 3. We call distribution x
t
imbalanced if there exists a permutation λ of {1, 2, 3} such that
xt,λ(1) ≥ xt,λ(2) ≥ 0 = xt,λ(3).
We prove that if x1 is imbalanced, then every x
t
is imbalanced. Suppose at some time t, x
t
is imbalanced
and the corresponding λ is the identity without loss of generality, that is,
xt,1 ≥ xt,2 ≥ 0 = xt,3. (B.5)
193
In this case, we know that Regt−1,1 > 0. By Equation (B.5) and Equation (B.2), we also know that Regt,1 ≥
0, Regt,3 ≥ 0, Regt,2 < 0, and Regt,1 > 0. Moreover, by Equation (B.2) and Equation (B.3), we can get
Regt,1 + Regt,2 + Regt,3 = 0, Regt,1 + Regt,2 + Regt,3 = 0.
Therefore, we have Regt,2+Regt,3 < 0, which means that at least one of xt+1,2 and xt+1,3 is zero. Moreover,
we have
Regt,1 = Regt−1,1 + Regt,1 ≥ Regt−1,1 ≥ Regt−1,2 > Regt−1,2 + Regt,2 = Regt,2
,
where the second inequality follows from xt,1 ≥ xt,2 and Equation (B.4). The inequalities above imply
that xt+1,1 > xt+1,2. Thus, we get one of the following three situations continues to hold at time t + 1:
xt+1,1 ≥ xt+1,2 ≥ 0 = xt+1,3, (B.6)
xt+1,1 ≥ xt+1,3 ≥ 0 = xt+1,2, (B.7)
xt+1,3 ≥ xt+1,1 ≥ 0 = xt+1,2. (B.8)
If Equation (B.6) holds at time t+1, the same argument implies one of the three arguments above continues
to hold; otherwise, if at some time step τ > t, Equation (B.7) holds, that is,
xτ,1 ≥ xτ,3 ≥ 0 = xτ,2. (B.9)
194
Similarly, we know that Regτ,1 ≤ 0, Regτ,2 ≤ 0 and Regτ,3 ≥ 0, and xτ+1,2 = 0. Thus, we get either
Equation (B.9) continues to hold at time τ + 1 or
xτ+1,3 ≥ xτ+1,1 ≥ 0 = xτ+1,2
holds, which is exactly the same permutation in Equation (B.8). With similar arguments, we know that
for every imbalanced distribution, either the same permutation holds in the next round, or it transits to
another imbalanced distribution with another permutation. Note that the average iterate of the sequence
{x
t}
t
converges to the uniform distribution as CFR is a no-regret algorithm, so x
t never converges to any
imbalanced distribution. Therefore, we conclude that x
t diverges if the algorithm starts from x1 = y1
being an arbitrary imbalanced distribution.
B.4.4 Experiments
Besides Theorem 12, we empirically observe divergence of CFR, CFR+ with simultaneous updates, and
their optimistic versions in the rock-paper-scissors game. The results are shown in Figure B.5. We remark
that in these experiments, we consider CFR+ with simultaneous updates instead of the more commonly
used ones (alternating updates). In fact, we observe that with alternating updates, the optimistic CFR+
empirically has last-iterate convergence in some matrix games. As all of our theoretical results are for
simultaneous updates, the theoretical justification of this observation is beyond the scope of this thesis,
but it is an interesting direction for future works.
B.5 Proofs of Theorem 15 and Theorem 16
In this section, we show the proof of Theorem 15 in Appendix B.5.4 and the proof of Theorem 16 in
Appendix B.5.5. We generally follow the outline in Section 3.5.3. We discuss the technical difficulty to
195
0 1000 2000 3000 4000 5000
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
0 100 200 300 400 500
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
0 1000 2000 3000 4000 5000
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
0 100 200 300 400 500
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
Figure B.5: Last-iterate divergence of CFR, CFR+ with simultaneous updates, and their optimistic versions
in the rock-paper-scissors game. The first row is the CFR algorithms while the second row is the CFR+
algorithms. For both rows, we put the vanilla version on the left and the optimistic version on the right.
All algorithms start with x1 = y1 = (1, 0, 0)⊤.
196
get a convergence rate for DOGDA in Appendix B.5.6. Throughout the section, we assume that G has a
unique Nash equilibrium z
∗ = (x
∗
, y
∗
) (call it the uniqueness assumption). For the sake of analysis, we
equivalently define Ψdil
α as
Ψ
dil
α (z) = X
i
αh(i)zi
ln P
zi
j∈Ωh(i)
zj
.
Note that under this definition, we have
∂Ψdil
α (z)
∂zi
= αh(i)
ln
P
zi
j∈Ωh(i)
zj
!
+ 1 −
X
k∈Ωh(i)
P
zk
j∈Ωh(i)
zj
= αh(i)
ln qi
, (B.10)
and
DΨdil
α
(z, z
′
) = X
i
αh(i)
zi
ln(qi) − z
′
i
ln(q
′
i
) − (zi − z
′
i
) ln(q
′
i
)
=
X
i
αh(i)zi
ln qi
q
′
i
. (B.11)
B.5.1 Strict Complementary Slackness and Proof of Equation (3.6) in EFGs
In this subsection, we prove Equation (3.6), which is restated in Lemma 88.
Lemma 88. Under the uniqueness assumption, for both VOMWU and DOMWU, there exists some constant
C12 > 0 that depends on the game, η, and zb
1
such that for any t,
ζ
t = Dψ(zb
t+1
, z
t
) + Dψ(z
t
, zb
t
) ≥ C12∥z
∗ − zb
t+1∥
2
. (B.12)
Before proving this lemma, we show some useful properties of EFGs. The first one is the strict complementary slackness.
197
B.5.1.1 Strict Complementary Slackness
We formulate the minimax problem in Equation (3.1) as a linear program. This is a standard procedure
in the literature (see, for example, [130, Section 3.11]), but we show its derivation here for completeness.
Based on Definition 2, we have x ∈ X if x satisfies xi ≥ 0 for every i and
x0 = 1,
X
i∈Ωh
xi = xσ(h)
, ∀h ∈ HX . (B.13)
We can write the constraints in Equation (B.13) using a matrix A ∈ R(|HX |+1)×M such that x ∈ X
if and only if Ax = e, x ≥ 0, where all entries in e are zero except for e0 = 1, which corresponds
to the constraint x0 = e0 = 1. Similarly, for player y, we have the constraint matrix B and vector f.
Consequently, we write the best response of x to a fixed y to as the following linear program:
min
x
x
⊤(Gy), subject to Ax = e, x ≥ 0.
The dual of this linear program is
max
V
e
⊤V , subject to A⊤V ≤ Gy,
where V ∈ R
|HX |+1. Recall that all entries in e are zero except for e0 = 1 so the objective e
⊤V = V0. On
the other hand, player y tries to maximize x
⊤Gy, that is, maximize V0 by the strong duality. Therefore,
every y
∗ ∈ Y∗
is a solution to the following maximin problem
max
V ,y
V0, subject to A⊤V ≤ Gy, By = f, y ≥ 0, (B.14)
198
which is also a linear program. The dual of this linear problem is
min
U,x
U0, subject to B⊤U ≥ G⊤x, Ax = e, x ≥ 0. (B.15)
It is also not hard to see that every x
∗ ∈ X is a solution to this dual. We conclude that Equation (B.14)
and Equation (B.15) are primal-dual linear programs of the minmax problem in Equation (3.1). Given an
optimal solution pair x
∗
, y
∗
along with V
∗
, U∗
, by the complementary slackness, we have some slackness
variables w∗ ∈ RM, s
∗ ∈ R
N such that x
∗ ⊙ w∗ = 0, y
∗ ⊙s
∗ = 0 (⊙ denotes the element-wise product),
and
A⊤V
∗ − Gy∗ + w∗ = 0, B⊤U
∗ − G⊤x
∗ − s
∗ = 0, w∗
, s
∗ ≥ 0. (B.16)
Thus, Equation (B.16) implies that for every index i of player x,
V
∗
h(i) −
X
g∈Hi
V
∗
g − (Gy∗
)i = (A⊤V
∗
)i − (Gy∗
)i = (A⊤V
∗ − Gy∗
)i = −w
∗
i
. (B.17)
Additionally, the strict complementary slackness (see, for example, [156, Theorem 10.7]) ensures that there
exists an optimal solution (x
∗
, y
∗
) such that
x
∗ + w∗ > 0, y
∗ + s
∗ > 0. (B.18)
Under the uniqueness assumption, the strict complementary slackness must hold for the unique optimal
solution (x
∗
, y
∗
). Therefore, when x
∗
i > 0, we have w
∗
i = 0, which means that Equation (B.17) implies
V
∗
h = (Gy∗
)i +
X
g∈Hi
V
∗
g
, ∀h ∈ HX , i ∈ Ωh, x∗
i > 0; (B.19)
199
otherwise, if x
∗
i = 0 and w
∗
i > 0, by Equation (B.17), we have
V
∗
h < (Gy∗
)i +
X
g∈Hi
V
∗
g
, ∀h ∈ HX , i ∈ Ωh, x∗
i = 0. (B.20)
Note that for every terminal index i, Hi
is empty and we set the term P
g∈Hi
V
∗
g
zero correspondingly.
The case is similar for U∗
and G⊤x
∗
. We summarize this result as the following lemma.
Lemma 89. Under the uniqueness assumption, we have
X
g∈Hi
V
∗
g + (Gy∗
)i = V
∗
h(i)
∀i ∈ supp(x
∗
),
X
g∈Hi
V
∗
g + (Gy∗
)i > V ∗
h(i)
∀i /∈ supp(x
∗
),
X
g∈Hj
U
∗
g + (G⊤x
∗
)j = U
∗
h(j)
∀j ∈ supp(y
∗
),
X
g∈Hj
U
∗
g + (G⊤x
∗
)j < U∗
h(j)
∀j /∈ supp(y
∗
).
B.5.1.2 Some Problem-Dependent Constants
After introducing the strict complementary slackness, we are ready to introduce some problem-dependent
constants. Note that by Lemma 89, we have the following constant ξ > 0.
Definition 9. Under the uniqueness assumption, we define
ξ ≜ min
min
i /∈supp(x∗)
X
g∈Hi
V
∗
g + (Gy∗
)i − V
∗
h(i)
, min
j /∈supp(y∗)
U
∗
h(j) − (G⊤x
∗
)j −
X
g∈Hj
U
∗
g
∈ (0, 2].
200
Note that ξ ≤ 2 follows from the fact that for any information set h ∈ HX , indices i, j ∈ Ωh such that
i /∈ supp(x
∗
) and j ∈ supp(x
∗
), by Lemma 89, we have
ξ ≤
X
g∈Hi
V
∗
g + (Gy∗
)i − V
∗
h = (Gy∗
)i − (Gy∗
)j ≤ 2.
Below, we define V
∗
(Z) = V
∗
(X ) × V∗
(Y), where
V
∗
(X ) ≜ {x : x ∈ X , supp(x) ⊆ supp(x
∗
)}
and
V
∗
(Y) ≜ {y : y ∈ Y, supp(y) ⊆ supp(y
∗
)}.
Definition 10.
cx ≜ min
x∈X \{x∗}
max
y∈V∗(Y)
(x − x
∗
)
⊤Gy
∥x − x∗∥1
, cy ≜ min
y∈Y\{y∗}
max
x∈V∗(X)
x
⊤G(y
∗ − y)
∥y∗ − y∥1
.
The following lemma shows that cx and cy are well-defined even though the outer minimization is over
an open set. The proof generally follows [161, Lemma 14] but requires the results derived in Appendix B.5.1.
Lemma 90. cx and cy are well-defined, and 0 < cx, cy ≤ 1.
Proof. We first show cx and cy are well-defined. To simplify the notations, we define x
∗
min ≜ mini∈supp(x∗) x
∗
i
and X
′ ≜ {x : x ∈ X , ∥x − x
∗∥1 ≥ x
∗
min}, and define y
∗
min and Y
′
similarly. We will show that
cx = min
x∈X′
max
y∈V∗(Y)
(x − x
∗
)
⊤Gy
∥x − x∗∥1
, cy = min
y∈Y′
max
x∈V∗(X)
x
⊤G(y
∗ − y)
∥y∗ − y∥1
,
201
which are well-defined as the outer minimization is now over a closed set. To prove the equality for cx, it
suffices to show that for any x ∈ X such that x ̸= x
∗
and ∥x − x
∗∥1 < x∗
min, there exists x
′ ∈ X such
that ∥x
′ − x
∗∥1 = x
∗
min and
(x − x
∗
)
⊤Gy
∥x − x∗∥1
=
(x
′ − x
∗
)
⊤Gy
∥x′ − x∗∥1
, ∀y. (B.21)
In fact, we can simply choose x
′ = x
∗+(x−x
∗
)·
x
∗
min
∥x−x∗∥1
. We first argue that x
′
is still in X . For each
index j, if xj − x
∗
j ≥ 0, we surely have x
′
j ≥ x
∗
j + 0 ≥ 0; otherwise, x
∗
j > xj ≥ 0 and thus j ∈ supp(x
∗
)
and x
∗
j ≥ x
∗
min, which implies x
′
j ≥ x
∗
j − |xj − x
∗
j
| · x
∗
min
∥x−x∗∥1
≥ x
∗
j − x
∗
min ≥ 0. In addition, for any
h ∈ HX ,
X
j∈Ωh
x
′
j =
x
∗
min
∥x − x∗∥1
·
X
j∈Ωh
xj +
1 −
x
∗
min
∥x − x∗∥1
X
j∈Ωh
x
∗
j
=
x
∗
min
∥x − x∗∥1
· xσ(h) +
1 −
x
∗
min
∥x − x∗∥1
x
∗
σ(h)
= x
′
σ(h)
.
Therefore, we conclude x
′ ∈ X . Moreover, according to the definition of x
′
, ∥x
′ − x
∗∥1 = x
∗
min holds.
Also, since x
∗ −x and x
∗ −x
′
are parallel vectors, Equation (B.21) is satisfied. The arguments above show
that the cx in Definition 10 is a well-defined real number. The case of cy is similar.
Now we show 0 < cx, cy ≤ 1. The fact that cx, cy ≤ 1 is a direct consequence of the definitions.
Below, we use contradiction to prove that cy > 0. First, if cy < 0, then there exists y ̸= y
∗
such that
x
∗⊤Gy∗ < x
∗⊤Gy. This contradicts with the fact that (x
∗
, y
∗
) is the equilibrium.
On the other hand, if cy = 0, then there is some y ̸= y
∗
such that
max
x∈V∗(X)
x
⊤G(y
∗ − y) = 0. (B.22)
202
Consider the point y
′ = y
∗ +
ξ
2N
(y − y
∗
) (recall the definition of ξ in Definition 9 and that 0 < ξ ≤ 2),
which is a convex combination of y
∗
and y, and hence y
′ ∈ Y. Then, for any x ∈ X ,
x
⊤Gy′ =
X
i /∈supp(x∗)
xi(Gy′
)i +
X
i∈supp(x∗)
xi(Gy′
)i
≥
X
i /∈supp(x∗)
xi(Gy∗
)i − xi∥y
′ − y
∗
∥1
+
X
i∈supp(x∗)
ξ
2
· xi(G(y − y
∗
))i + xi(Gy∗
)i
(using Gij ∈ [−1, 1] for the first part and y
′ = y
∗ +
ξ
2N
(y − y
∗
) for the second)
≥
X
i /∈supp(x∗)
xi(Gy∗
)i − xi∥y
′ − y
∗
∥1
+
X
i∈supp(x∗)
xi
V
∗
h(i) −
X
g∈Hi
V
∗
g
,
where the last inequality is due to Equation (B.22) and Lemma 89. We continue to bound the terms above,
which are bounded by
≥
X
i /∈supp(x∗)
xi ((Gy∗
)i − ξ)
+
X
i∈supp(x∗)
xi
V
∗
h(i) −
X
g∈Hi
V
∗
g
(using y
′ − y
∗ =
ξ
2N
(y − y
∗
) and ∥y − y
∗∥1 ≤ 2N)
≥
X
i /∈supp(x∗)
xi
V
∗
h(i) −
X
g∈Hi
V
∗
g
+
X
i∈supp(x∗)
xi
V
∗
h(i) −
X
g∈Hi
V
∗
g
(by the definition of ξ)
=
X
i
xi
V
∗
h(i) −
X
g∈Hi
V
∗
g
.
The last term can be writen as the matrix form x
⊤A⊤V
∗ = e
⊤V
∗ = V
∗
0
. This shows that minx∈X x
⊤Gy′ ≥
V
∗
0
, that is, y
′ ̸= y
∗
is also a maximin point, contradicting the uniqueness assumption. Therefore, cy > 0
has to hold, and so does cx > 0 by the same argument.
We continue to define more constants in the following.
Definition 11. Define constants zmin ≜ mini zb1,i ∈ (0, 1],
ϵvan ≜ min
j∈supp(z∗)
exp
−
P(1 + ln(1/zmin))
z
∗
j
!
∈ (0, 1),
ϵdil ≜ min
j∈supp(z∗)
(
z
∗
j
· exp
−
∥α∥∞P
2
ln (1/zmin)
z
∗
j
!
·
3
4
P
)
∈ (0, 1).
For all i ∈ supp(z
∗
), we will show that ϵvan is a lower bound of zb
t
i
for VOMWU, while ϵdil is a lower
bound of zb
t
i
and z
t
i
for DOMWU. We show the results in Lemma 96 and defer the proof there.
B.5.1.3 Proof of Lemma 88
We are almost ready to prove Lemma 88. Before that, we first show the following auxiliary lemma.
Lemma 91. For any z ∈ Z, we have
max
z′∈V∗(Z)
F(z)
⊤(z − z
′
) ≥ C∥z
∗ − z∥1,
for C = min{cx, cy} ∈ (0, 1].
Proof. Recall that V
∗
0 = x
∗⊤Gy∗
is the game value and note that
max
z′∈V∗(Z)
F(z)
⊤(z − z
′
) = max
z′∈V∗(Z)
(x − x
′
)
⊤Gy + x
⊤G(y
′ − y) = max
z′∈V∗(Z)
−x
′⊤Gy + x
⊤Gy′
= max
x′∈V∗(X)
V
∗
0 − x
′⊤Gy
+ max
y′∈V∗(Y)
x
⊤Gy′ − V
∗
0
= max
x′∈V∗(X)
x
′⊤G(y
∗ − y) + max
y′∈V∗(Y)
(x − x
∗
)
⊤Gy′
≥ cy∥y
∗ − y∥1 + cx∥x
∗ − x∥1 (by Definition 10)
≥ min{cx, cy}∥z
∗ − z∥1,
204
where the third equality is due to Equation (B.16), x
′ ∈ V∗
(X ), and
V
∗
0 = e
⊤V
∗ = (x
′⊤A⊤)V
∗ = x
′⊤(A⊤V
∗
) = x
′⊤(Gy∗
);
the case for y
′
is similar. This completes the proof.
Now we show the proof of Lemma 88.
Proof of Lemma 88. Below we consider any z
′ ∈ Z such that supp(z
′
) ⊆ supp(z
∗
), that is, z
′ ∈ V∗
(Z).
Considering Equation (3.2) , and using the first-order optimality condition of zb
t+1, we have
(∇ψ(zb
t+1) − ∇ψ(zb
t
) + ηF(z
t
))⊤(z
′ − zb
t+1) ≥ 0. (B.23)
Rearranging the terms and we get
ηF
z
t
⊤
zb
t+1 − z
′
≤
∇ψ(zb
t+1) − ∇ψ(zb
t
)
⊤
z
′ − zb
t+1
. (B.24)
The left hand side of Equation (B.24) is lower bounded as
ηF
z
t
⊤
zb
t+1 − z
′
= ηF
zb
t+1⊤
zb
t+1 − z
′
+ η
F
z
t
− F
zb
t+1⊤
zb
t+1 − z
′
≥ ηF
zb
t+1⊤
zb
t+1 − z
′
− η∥F
z
t
− F
zb
t+1
∥∞∥zb
t+1 − z
′
∥1
≥ ηF
zb
t+1⊤
zb
t+1 − z
′
− 2P η
z
t − zb
t+1
1
≥ ηF
zb
t+1⊤
zb
t+1 − z
′
−
1
4
z
t − zb
t+1
1
When ψ = Ψvan, we have
(∇Ψ
van(zb
t+1) − ∇Ψ
van(zb
t
))i = (1 + ln zb
t+1
i
) − (1 + ln zb
t
i
) = ln zb
t+1
i
zb
t
i
. (B.25)
On the other hand, when ψ = Ψdil
α , by Equation (B.10), we have
∇Ψ
dil
α (zb
t+1) − ∇Ψ
dil
α (zb
t
)
i
= αi
ln qb
t+1
i
qb
t
i
. (B.26)
Therefore, the right hand side of Equation (B.24) for VOMWU is upper bounded by
∇Ψ
van(zb
t+1) − ∇Ψ
van(zb
t
)
⊤
z
′ − zb
t+1
=
X
i
z
′
i − zb
t+1
i
ln zb
t+1
i
zb
t
i
(Equation (B.25))
≤ ∥zb
t+1 − zb
t
∥1 − DΨvan (zb
t+1
, zb
t
) + X
i∈supp(z∗)
z
′
i
ln zb
t+1
i
zb
t
i
(supp(z
′
) ⊆ supp(z
∗
))
≤ ∥zb
t+1 − zb
t
∥1 +
X
i∈supp(z∗)
ln zb
t+1
i
zb
t
i
≤ ∥zb
t+1 − zb
t
∥1 +
X
i∈supp(z∗)
ln
1 +
zb
t+1
i − zb
t
i
min{zb
t+1
i
, zb
t
i
}
!
≤ ∥zb
t+1 − zb
t
∥1 +
X
i∈supp(z∗)
zb
t+1
i − zb
t
i
min{zb
t+1
i
, zb
t
i
}
(ln(1 + a) ≤ a)
≤
2
ϵvan
∥zb
t+1 − zb
t
∥1; (Lemma 96 and ϵvan ≤ 1)
on the other hand, the right hand side of Equation (B.24) for DOMWU is upper bounded by
∇Ψ
dil
α (zb
t+1) − ∇Ψ
dil
α (zb
t
)
⊤
z
′ − zb
t+1
=
X
i
αi
z
′
i − zb
t+1
i
ln qb
t+1
i
qb
t
i
(Equation (B.26))
=
X
i
αiz
′
i
ln qb
t+1
i
qb
t
i
−
X
i
αizb
t+1
i
ln qb
t+1
i
qb
t
i
=
X
i∈supp(z∗)
αiz
′
i
ln qb
t+1
i
qb
t
i
−
X
h∈HZ
αhzbt+1,σ(h)
X
j∈Ωh
qbt+1,j ln qbt+1,j
qb
t
j
≤ ∥α∥∞
X
i∈supp(z∗)
ln qb
t+1
i
qb
t
i
(
P
j∈Ωh
qbt+1,j ln qbt+1,j
qb
t
j
≥ 0)
≤ ∥α∥∞
X
i∈supp(z∗)
ln
1 +
qb
t+1
i − qb
t
i
min{qb
t+1
i
, qb
t
i
}
!
≤ ∥α∥∞
X
i∈supp(z∗)
qb
t+1
i − qb
t
i
min{qb
t+1
i
, qb
t
i
}
(ln(1 + a) ≤ a)
≤
∥α∥∞
ϵdil
X
i∈supp(z∗)
qb
t+1
i − qb
t
i
. (Lemma 96)
Since z
′
can be chosen as any point in V
∗
(Z), we further lower bound the left-hand side of Equation (B.24)
using Lemma 91 and get for VOMWU,
ηC∥z
∗ − zb
t+1∥1 ≤
2
ϵvan
∥zb
t+1 − zb
t
∥1 +
∥z
t − zb
t+1∥1
4
≤
2
ϵvan
∥zb
t+1 − zb
t
∥1 + ∥z
t − zb
t+1∥1
, (B.27)
and for DOMWU,
ηC∥z
∗ − zb
t+1∥1 ≤
1
4
∥z
t − zb
t+1∥1 +
∥α∥∞
ϵdil
X
i∈supp(z∗)
|qb
t+1
i − qb
t
i
|, (Lemma 96)
≤
1
4
∥z
t − zb
t+1∥1 +
∥α∥∞
ϵdil
X
i∈supp(z∗)
|zb
t+1
i − zb
t
i
|
zbt+1,pi
+ zb
t
i
|zbt+1,pi − zbt,pi
|
zbt+1,pi
zbt,pi
≤
1
4
∥z
t − zb
t+1∥1 +
∥α∥∞
ϵ
2
dil
X
i∈supp(z∗)
|zb
t+1
i − zb
t
i
| + |zbt+1,pi − zbt,pi
| (Lemma 96)
≤
1
4
∥z
t − zb
t+1∥1 +
P∥α∥∞
ϵ
2
dil
∥zb
t+1 − zb
t
∥1
≤
1
4
+
P∥α∥∞
ϵ
2
dil
∥zb
t+1 − zb
t
∥1 + ∥z
t − zb
t+1∥1
. (B.28)
Squaring both sides of Equation (B.27), we get
η
2C
2
∥z
∗ − zb
t+1∥
2
1 ≤
4
ϵ
2
van
∥zb
t+1 − zb
t
∥1 + ∥z
t − zb
t+1∥1
2
≤
8
ϵ
2
van
∥zb
t+1 − zb
t
∥
2
1 + ∥z
t − zb
t+1∥
2
1
. (B.29)
Using the strong convexity of the regularizers, the left hand side of Equation (B.12) can be bounded by
Dψ(zb
t+1
, z
t
) + Dψ(z
t
, zb
t
)
≥
1
2
∥zb
t+1 − z
t
∥
2
1 +
1
2
∥z
t − zb
t
∥
2
1
(a
2 + b
2 ≥
1
2
(a + b)
2
)
≥
1
8
∥zb
t+1 − z
t
∥
2
1 +
1
4
∥zb
t+1 − z
t
∥
2
1 + ∥z
t − zb
t
∥
2
1
≥
1
8
∥zb
t+1 − zb
t
∥
2
1 + ∥zb
t+1 − z
t
∥
2
1
(a
2 + b
2 ≥
1
2
(a + b)
2
and triangle inequality)
Combining this with Equation (B.28) finishes the proof for VOMWU. The similar argument works for
DOMWU by combining the inequality above with Equation (B.29).
B.5.2 Proofs of Equation (3.12) and Equation (3.13)
In this subsection, we prove Equation (3.12) and Equation (3.13). We first show Lemma 92 and Lemma 93.
Then we get Equation (3.12) and Equation (3.13) by substituting z with zb
t+1 in these lemmas.
Lemma 92. For any z ∈ Z, we have
DΨvan (z
∗
, z) ≤
X
i∈supp(z∗)
(z
∗
i − zi)
2
mini∈supp(z∗) zi
+
X
i /∈supp(z∗)
zi ≤
3P∥z
∗ − z∥
mini∈supp(z∗) zi
.
Proof. By Equation (B.1), we have
DΨvan (z
∗
, z) ≤
X
i
(z
∗
i − zi)
2
zi
≤
X
i∈supp(z∗)
(z
∗
i − zi)
2
mini∈supp(z∗) zi
+
X
i /∈supp(z∗)
zi
≤
∥z
∗ − z∥
2
mini∈supp(z∗) zi
+ ∥z
∗ − z∥1 ≤
3P∥z
∗ − z∥
mini∈supp(z∗) zi
,
where the last inequality is because ∥z
∗ − z∥ ≤ 2P and ∥z
∗ − z∥1 ≤ P∥z
∗ − z∥.
Lemma 93. For any z ∈ Z, we have
DΨdil
α
(z
∗
, z) ≤ ∥α∥∞
X
i∈supp(z∗)
4P
z
∗
i
(z
∗
i − zi)
2
qi
+
X
i /∈supp(z∗)
z
∗
pi
qi
≤
4P∥α∥∞
mini∈supp(z∗) z
∗
i
zi
∥z
∗ − z∥1.
209
Proof. By direction calculation and [161, Lemma 16], we have
DΨdil
α
(z
∗
, z) = X
i
αh(i)z
∗
pi
q
∗
i
ln
q
∗
i
qi
(Equation (B.11))
≤ ∥α∥∞
X
i
z
∗
pi
1{i ∈ supp(z
∗
)}
(q
∗
i − qi)
2
qi
+ 1{i /∈ supp(z
∗
)}qi
≤ ∥α∥∞
X
i∈supp(z∗)
2z
∗
pi
qi
z
∗
i − zi
z
∗
pi
2
+
zi
z
∗
pi
−
zi
zpi
2
!
+ ∥α∥∞
X
i /∈supp(z∗)
z
∗
pi
qi
= ∥α∥∞
X
i∈supp(z∗)
2
qiz
∗
pi
(z
∗
i − zi)
2 +
2qi
z
∗
pi
z
∗
pi − zpi
2
+ ∥α∥∞
X
i /∈supp(z∗)
z
∗
pi
qi
≤ ∥α∥∞
X
i∈supp(z∗)
2
z
∗
pi
(z
∗
i − zi)
2
qi
+
2P
z
∗
i
(z
∗
i − zi)
2
!
+ ∥α∥∞
X
i /∈supp(z∗)
z
∗
pi
qi
(pi ∈ supp(z
∗
) for all i ∈ supp(z
∗
))
≤ ∥α∥∞
X
i∈supp(z∗)
4P
z
∗
i
(z
∗
i − zi)
2
qi
!
+ ∥α∥∞
X
i /∈supp(z∗)
z
∗
pi
qi
.
This proves the first inequality. The second equality in the lemma follows from
DΨdil
α
(z
∗
, z) ≤ ∥α∥∞
X
i∈supp(z∗)
4P
z
∗
i
(z
∗
i − zi)
2
qi
!
+ ∥α∥∞
X
i /∈supp(z∗)
z
∗
pi
qi
≤ ∥α∥∞
X
i∈supp(z∗)
4P
z
∗
i
zi
|z
∗
i − zi
| + ∥α∥∞
X
i /∈supp(z∗)
z
∗
pi
zi
zpi
(|z
∗
i − zi
| ≤ 1)
≤ ∥α∥∞
X
i∈supp(z∗)
4P
z
∗
i
zi
|z
∗
i − zi
| + ∥α∥∞
X
i /∈supp(z∗)
1{pi ∈ supp(z
∗
)}
zpi
· |zi − 0|
≤
4P∥α∥∞
mini∈supp(z∗) z
∗
i
zi
∥z
∗ − z∥1.
21
B.5.3 Lower Bounds on the Probability Masses
In this subsection, we show for all i ∈ supp(z
∗
) and t, zb
t
i
computed by VOMWU can be lower bounded by
ϵvan, while zb
t
i
and z
t
i
computed by DOMWU can be lower bounded by ϵdil, where ϵvan and ϵdil are defined
in Definition 11. We state the results in Lemma 95 and Lemma 96, respectively. We first state the stability
of qb
t
and q
t
, which directly follows from the stability of OMWU on simplex, for example, [161, Lemma
17].
Lemma 94. For η ≤
1
8P
, DOMWU guarantees 3
4
qb
t
i ≤ q
t
i ≤
4
3
qb
t
i
and 3
4
qb
t
i ≤ qb
t+1
i ≤
4
3
qb
t
i
.
Lemma 95. For all i ∈ supp(z
∗
) and t, VOMWU guarantees that zb
t
i ≥ ϵvan.
Proof. Using Equation (3.5), we have
Dψ(z
∗
, zb
t
) ≤ Θ
t ≤ · · · ≤ Θ
1 =
1
16Dψ(zb
1
, z
0
) + Dψ(z
∗
, zb
1
) = Dψ(z
∗
, zb
1
), (B.30)
where the last equality is because zb
1 = z
0
. Thus, DΨvan (z
∗
, zb
t
) ≤ DΨvan (z
∗
, zb
1
). Then, for any i ∈
supp(z
∗
), we have
z
∗
i
ln 1
zb
t
i
≤
X
j
z
∗
j
ln 1
zb
t
j
= DΨvan (z
∗
, zb
t
) +X
j
z
∗
j − zb
t
j − z
∗
j
ln z
∗
j
≤ DΨvan (z
∗
, zb
1
) −
X
j
z
∗
j
ln z
∗
j +
X
j
(z
∗
j − zb
t
j
)
≤
X
j
z
∗
j
ln 1
zb1,j
+
X
j
(zb1,j − zb
t
j
)
≤ P +
X
j
z
∗
j
ln 1
zb1,j
= P(1 + ln(1/zmin)).
Therefore, we conclude for all t and i ∈ supp(z
∗
), zb
t
i
satisfies
zb
t
i ≥ exp
−
P(1 + ln(1/zmin))
z
∗
i
≥ min
j∈supp(z∗)
exp
−
P(1 + ln(1/zmin))
z
∗
j
!
= ϵvan.
21
Lemma 96. For all i ∈ supp(z
∗
) and t, DOMWU guarantees that qb
t
i ≥ zb
t
i ≥ ϵdil and q
t
i ≥ z
t
i ≥ ϵdil.
Proof. Similar to Lemma 95, applying Equation (B.30) gives DΨdil
α
(z
∗
, zb
t
) ≤ DΨdil
α
(z
∗
, zb
1
). Then, for any
i ∈ supp(z
∗
), we have
z
∗
i
ln 1
qb
t
i
≤
X
j
αjz
∗
j
ln 1
qb
t
j
= DΨdil
α
(z
∗
, zb
t
) −
X
j
αjz
∗
j
ln q
∗
j ≤ DΨdil
α
(z
∗
, zb
1
) −
X
j
αjz
∗
j
ln q
∗
j
=
X
j
αjz
∗
j
ln 1
qb1,j
≤ ∥α∥∞P ln(1/zmin).
Therefore, we conclude for all t and i ∈ supp(z
∗
), qb
t
i
satisfies
qb
t
i ≥ exp
−
∥α∥∞P ln(1/zmin)
z
∗
i
≥ min
j∈supp(z∗)
exp
−
∥α∥∞P ln(1/zmin)
z
∗
j
!
and zb
t
i
satisfies
zb
t
i = zbt,pi
qb
t
i = zbt,ppi
qbt,pi
qb
t
i ≥
Y
j∈supp(z∗)
qb
t
j ≥ min
j∈supp(z∗)
exp
−
∥α∥∞P ln(1/zmin)
z
∗
j
!P
.
This finishes the first part of the proof. Finally, using Lemma 94, we have for i ∈ supp(z
∗
), q
t
i ≥
3
4
qb
t
i ≥ ϵdil
and
z
t
i ≥
Y
j∈supp(z∗)
q
t
j ≥ min
j∈supp(z∗)
exp
−
∥α∥∞P
2
ln(1/zmin)
z
∗
j
!
·
3
4
P
≥ ϵdil.
212
B.5.4 Proof of Theorem 15
Based on the results in the previous subsections and the discussion in Section 3.5.3, we can get Θt = O(1/t)
for both VOMWU and DOMWU. In this subsection, we formally state the results in Theorem 97 for both
VOMWU and DOMWU, and show the proof by combining all the components. In particular, the result for
VOMWU implies Theorem 15.
Theorem 97. Under the uniqueness assumption, VOMWU and DOMWU with step size η ≤
1
8P
guarantee DΨvan (z
∗
, zb
t
) ≤
C13
t
and DΨdil
α
(z
∗
, zb
t
) ≤
C′
13
t
, respectively, where C
′
13, C13 > 0 are some constants
depending on the game, zb
1
, and η.
Proof. We start from Lemma 88. Using Lemma 92 and Lemma 95, the right hand side of Equation (B.12)
can be bounded by
ζ
t ≥ C12∥z
∗ − zb
t+1∥
2 ≥
ϵ
2
vanC12
9P2 DΨvan (z
∗
, zb
t+1)
2
, (B.31)
for VOMWU. Similarly, using Lemma 93 and Lemma 96, we have for DOMWU,
ζ
t ≥ C12∥z
∗ − zb
t+1∥
2 ≥
ϵ
4
vanC12
16P2∥α∥
2∞
DΨdil
α
(z
∗
, zb
t+1)
2
.
On the other hand, applying Equation (3.5) repeatedly, we get
Dψ(z
∗
, zb
1
) = Θ1 ≥ · · · ≥ Θ
t+1 ≥
1
16
Dψ(zb
t+1
, z
t
).
213
Thus, ζ
t ≥ Dψ(zb
t+1
, z
t
) ≥ C10Dψ(zb
t+1
, z
t
)
2
for some C10 > 0 depending on Dψ(z
∗
, zb
1
). Combining
this with Equation (B.31) gives
ζ
t =
1
2
Dψ(zb
t+1
, z
t
) + Dψ(z
t
, zb
t
)
+
1
2
Dψ(zb
t+1
, z
t
) + Dψ(z
t
, zb
t
)
≥
1
2
·
ϵ
2
vanC12
9P2 DΨvan (z
∗
, zb
t+1)
2 +
1
2
· C10DΨvan (zb
t+1
, z
t
)
2
(Equation (B.31))
≥
ϵ
2
vanC12
36P2
· 2DΨvan (z
∗
, zb
t+1)
2 +
C10
4
· 2DΨvan (zb
t+1
, z
t
)
2
≥ min
ϵ
2
vanC12
36P2
,
C10
4
Θ
t+12
≥ C11
Θ
t+12
,
for some C11 > 0. Similarly, we have ζ
t ≥ C
′
11
Θt+12
for DOMWU and constant C
′
11 > 0. Applying
this to Equation (3.5), we obtain the recursion Θt+1 ≤ Θt −
15
16C11
Θt+12
. This implies Θt ≤
C13
t
for
some constant C13 by [161, Lemma 12]. With the same argument, we can prove the case for DOMWU.
B.5.5 Proof of Theorem 16
B.5.5.1 The Significant Difference Lemma
In this subsection, we explain how to get the linear convergence result of DOMWU. As we discuss in the
end of Section 3.5.3, this requires showing that P
i /∈supp(z∗)
z
∗
pi
qb
t+1
i
decreases significantly as zb
t gets close
enough to z
∗
. The argument is shown in Lemma 100. Before that, we first show a lemma stating that for
any information set h ∈ HZ, indicesi, j ∈ Ωh such that i /∈ supp(z
∗
) and j ∈ supp(z
∗
), Lbt
i
is significantly
larger than Lbt
j when zb
t
is close to z
∗
.
Lemma 98. Suppose ∥z
∗ − z∥ ≤ η
2
ξϵ2
dil
40P3∥α∥∞
. Then for i ∈ Ωh, h ∈ HX , we have
∀i ∈ supp(x
∗
), Li ≤ V
∗
h +
ηξ
10
; ∀i /∈ supp(x
∗
), Li ≥ V
∗
h +
9ηξ
10
, (B.32)
where Li = (Gy)i +
P
g∈Hi
−
αg
η
ln P
j∈Ωg
qj exp(−ηLj/αg)
, qj = zj/zpj
.
2
Proof. We first consider terminal index i. By the assumption ∥z
∗ − z∥ ≤ η
2
ξϵ2
dil
40P3∥α∥∞
and Lemma 89, we
have ∥y − y
∗∥1 ≤
ηξ
10P
, and
Li = (Gy)i ≤ (Gy∗
)i +
ηξ
10P
= V
∗
h +
ηξ
10P
(B.33)
for i ∈ supp(x
∗
) and
Li = (Gy)i ≥ (Gy∗
)i −
ηξ
10P
≥ V
∗
h + ξ −
ηξ
10P
≥ V
∗
h +
9ηξ
10
for i /∈ supp(x
∗
) by the definition of ξ. Therefore, this shows Equation (B.32) for terminal indices. In the
following, we complete the proof by backward induction. Specifically, for nonterminal index i /∈ supp(x
∗
),
we assume Lj ≥ V
∗
h(j) +
9ηξ
10 for every descendant j (we say that index j is a descendant of index i if there
exists a sequence of indexes s0, . . . , sK for some K > 0 such that s0 = j, sK = i, and psk−1 = sk for
every k ∈ [[K]]). Note that we always have j /∈ supp(x
∗
). We will prove Li satisfies Equation (B.32), which
completes the proof for i /∈ supp(x
∗
) by induction. By assumption, we have
Li = (Gy)i +
X
g∈Hi
−
αg
η
ln
X
j∈Ωg
qj exp(−ηLj/αg)
≥ (Gy)i +
X
g∈Hi
−
αg
η
ln
max
j∈Ωg
exp(−ηLj/αg)
= (Gy)i +
X
g∈Hi
min
j∈Ωg
Lj
≥ (Gy)i +
X
g∈Hi
V
∗
g +
9ηξ
10
(by the induction hypothesis)
≥ (Gy∗
)i − ξ +
X
g∈Hi
V
∗
g +
9ηξ
10
(∥y − y
∗∥ ≤ ξ)
≥ V
∗
h(i) +
9ηξ
10
. (Lemma 89)
215
Similarly, for nonterminal index i ∈ supp(x
∗
), we show for every descendant j,
Lj ≤ V
∗
h(j) +
ηξ
10P
f(h(j)), (B.34)
where f : HX → R
+ is defined recursively as follows. For information set (simplex) g such that Ωg
contains terminal indices only, we let f(g) = 1. Otherwise, we define
f(g) = max
k∈Ωg
X
s∈Hk
f(s) + 1
P
. (B.35)
This shows Equation (B.32) as Lemma 99 guarantees
f(h) ≤ (P − 1) ·
1 +
1
P
< P,
for every simplex h. It remains to prove Equation (B.35) by induction. For the base case that i is a terminal
index, Equation (B.34) clearly holds by Equation (B.33). For nonterminal index i, we have
Li = (Gy)i +
X
g∈Hi
−
αg
η
ln
X
j∈Ωg
qj exp(−ηLj/αg)
≤ (Gy)i +
X
g∈Hi
−
αg
η
ln
X
j∈Ωg∩supp(x∗)
qj exp(−ηLj/αg)
≤ (Gy)i +
X
g∈Hi
−
αg
η
ln
exp
−η
V
∗
g +
ηξ
10P
f(g)
/αg
X
j∈Ωg∩supp(x∗)
qj
(by the assumption)
= (Gy)i +
X
g∈Hi
V
∗
g +
ηξ
20P
f(g) −
αg
η
ln
X
j∈Ωg∩supp(x∗)
qj
. (B.36)
216
We continue to bound the last term. Let c =
η
2
ξϵ2
dil
40P3∥α∥∞
. We have
−
αg
η
ln
X
j∈Ωg∩supp(x∗)
qj
= −
αg
η
ln P
j∈Ωg∩supp(x∗)
xj
xpj
!
≤ −
αg
η
ln P
j∈Ωg∩supp(x∗)
(x
∗
j − c)
x
∗
pj + c
!
(∥z
∗ − z∥ ≤ η
2
ξϵ2
dil
40P3∥α∥∞
)
≤ −
αg
η
ln
x
∗
pj − c|Ωg|
x
∗
pj + c
!
= −
αg
η
ln
1 −
c(|Ωg| + 1)
x
∗
pj + c
!
≤ −
αg
η
ln
1 −
cP
ϵdil
(x
∗
pj + c ≥ x
∗
pj ≥ ϵdil and |Ωg| + 1 ≤ P)
≤ −
αg
η
ln
1 −
η
2
ξ
40αgP2
(by definition of c)
≤
ηξ
20P2
. (− ln(1 − x) < 2x for 0 < x < 0.5)
Plugging this back to the original inequalities, we get
Li ≤ (Gy)i +
X
g∈Hi
V
∗
g +
ηξ
10P
f(g) + ηξ
20P2
≤ (Gy∗
)i +
ηξ
20P2
+
X
g∈Hi
V
∗
g +
ηξ
10P
f(g) + ηξ
20P2
(∥y − y
∗∥1 ≤
ηξ
10P2 )
≤ (Gy∗
)i +
X
g∈Hi
V
∗
g +
ηξ
10P
f(g) + ηξ
10P2
(i is nonterminal)
= (Gy∗
)i +
X
g∈Hi
V
∗
g +
ηξ
10P
X
g∈Hi
f(g) + 1
P
≤ V
∗
h +
ηξ
10P
X
g∈Hi
f(g) + 1
P
(Lemma 89)
≤ V
∗
h +
ηξ
10P
f(h(i)), (Equation (B.35))
which shows Equation (B.34) by induction, and thus shows Equation (B.32).
217
Lemma 99. Define f : HX → R
+ as follows.
f(g) =
1, if Ωg contains terminal indices only;
max
k∈Ωg
X
s∈Hk
f(s) + 1
P
, otherwise.
Then for every g ∈ HX , we have
f(g) ≤ Ig
1 +
1
P
, (B.37)
where Ig is the number of indices that are the descendants of g (we say that index j is a descendant of simplex
g if j ∈ Ωg or index j is a descendant of index i for some i ∈ Ωg).
Proof. If Ωg contains terminal indices only, since Ig ≥ 1, Equation (B.37) holds. Otherwise, suppose
Equation (B.37) holds for all simplexes that are descendants of g (we say that simplex h is a descendant of
simplex g if there exists a sequence of simplexes s0, . . . , sK for some K > 0 such that s0 = h, sK = g,
and σ(sk−1) ∈ Ωsk
for every k ∈ [[K]]). We define
k
∗ = argmax
k∈Ωg
X
s∈Hk
f(s) + 1
P
.
218
Then we have
f(g) = X
s∈Hk∗
f(s) + 1
P
(by definition of f)
≤
X
s∈Hk∗
Is
1 +
1
P
+
1
P
(by assumption)
≤ 1 + X
s∈Hk∗
Is
1 +
1
P
(|Hk
∗ | ≤ P)
≤ (Ig − 1)
1 +
1
P
+ 1
≤ Ig
1 +
1
P
,
where the third inequality is because k
∗
is not a descendant of any s ∈ Hk
∗ , and thus
X
s∈Hk∗
Is ≤ Ig − 1.
Therefore, we show Equation (B.37) by induction.
B.5.5.2 The Counterpart of Equation (3.11) for DOMWU
With Lemma 98, we can prove the following lemma, the counterpart of Equation (3.11) for DOMWU.
Lemma 100. Under the uniqueness assumption, there exists a constant C14 > 0 that depends on the game,
η, and zb
1
such that for any t ≥ 1, DOMWU with step size η ≤
1
8P
guarantees
DΨdil
α
(zb
t+1
, z
t
) + DΨdil
α
(z
t
, zb
t
) ≥ C14DΨdil
α
(z
∗
, zb
t+1)
as long as max{∥z
∗ − zb
t∥1, ∥z
∗ − z
t∥1} ≤ η
2
ξϵ2
dil
40P3∥α∥∞
.
219
Proof. We define αmin = minh∈HZ αh > 0. Note that
DΨdil
α
(zb
t+1
, z
t
) + DΨdil
α
(z
t
, zb
t
)
=
X
g∈HZ
αg · zbt+1,σ(g)KL(qbt+1,g, qt,g) + αg · zt,σ(g)KL(qt,g, qbt,g) (Equation (B.11))
≥ αmin X
g∈HZ
zbt+1,σ(g)KL(qbt+1,g, qt,g) + zt,σ(g)KL(qt,g, qbt,g)
≥ αminϵdil X
g∈H, σ(g)∈supp(z∗)
KL(qbt+1,g, qt,g) + KL(qt,g, qbt,g) (Lemma 96)
≥
αminϵdil
3
X
i /∈supp(z∗),pi∈supp(z∗)
(qb
t+1
i − q
t
i
)
2
qb
t+1
i
+
(q
t
i − qb
t
i
)
2
q
t
i
([161, Lemma 18])
≥
αminϵdil
4
X
i /∈supp(z∗),pi∈supp(z∗)
(qb
t+1
i − q
t
i
)
2
qb
t
i
+
(q
t
i − qb
t
i
)
2
qb
t
i
(Lemma 94)
≥
αminϵdil
8
X
i /∈supp(z∗),pi∈supp(z∗)
(qb
t+1
i − qb
t
i
)
2
qb
t
i
. (B.38)
Below we continue to bound P
i /∈supp(z∗),pi∈supp(z∗)
(qb
t+1
i −qb
t
i
)
2
qb
t
i
.
By the assumption, we have ∥qb
t − q
∗∥1 ≤ P∥zb
t − z
∗∥1/ϵ2
dil ≤
ηξ
10αi
and for index i such that i /∈
supp(x
∗
) and pi ∈ supp(x
∗
),
X
j∈Ωh(i)
,j /∈supp(x∗)
qb
t
j ≤
ηξ
10αi
. (B.39)
220
Moreover, by Lemma 98, we have (denote h = h(i))
qb
t+1
i =
qb
t
i
exp(−ηLbi/αi)
P
j∈Ωh
qb
t
j
exp(−ηLbj/αi)
≤
qb
t
i
exp(−ηLbi/αi)
P
j∈Ωh∩supp(z∗)
qb
t
j
exp(−ηLbj/αi)
≤
qb
t
i
exp(−η(V
∗
h +
9ξ
10 )/αi)
P
j∈Ωh∩supp(z∗)
qb
t
j
exp(−η(V
∗
h +
ξ
10 )/αi)
(Lemma 98)
=
qb
t
i
exp
−
8
10 ηξ/αi
1 −
P
j∈Ωh,j /∈supp(z∗)
qb
t
j
≤
qb
t
i
exp
−
8
10 ηξ/αi
1 −
ηξ/αi
10 (Equation (B.39))
≤ qb
t
i
1 −
1
2
ηξ
αi
. (
exp(−0.8u)
1−0.1u ≤ 1 − 0.5u for u ∈ [0, 1])
Rearranging gives
|qb
t+1
i − qb
t
i
|
2
qb
t
i
≥
η
2
ξ
2
4∥α∥
2∞
qb
t
i ≥
η
2
ξ
2
8∥α∥
2∞
qb
t+1
i
,
where the last step uses Lemma 94. The case for yb
t
is similar. Combining this with Equation (B.38), we get
DΨdil
α
(zb
t+1
, z
t
) + DΨdil
α
(z
t
, zb
t
) ≥ C
′
12 X
i /∈supp(z∗),pi∈supp(z∗)
qb
t+1
i
≥ C
′
12 X
i /∈supp(z∗)
z
∗
pi
qb
t+1
i
(B.40)
2
for some C
′
12 > 0. Now we combine two lower bounds of DΨdil
α
(zb
t+1
, z
t
)+DΨdil
α
(z
t
, zb
t
). Using Lemma 88
and Equation (B.40), we get
DΨdil
α
(zb
t+1
, z
t
) + DΨdil
α
(z
t
, zb
t
)
=
1
2
DΨdil
α
(zb
t+1
, z
t
) + DΨdil
α
(z
t
, zb
t
)
+
1
2
DΨdil
α
(zb
t+1
, z
t
) + DΨdil
α
(z
t
, zb
t
)
≥
C12
2
∥z
∗ − zb
t+1∥
2
1 +
C
′
12
2
X
i /∈supp(z∗)
z
∗
pi
qb
t+1
i
. (B.41)
Also note that by Lemma 93, we have
DΨdil
α
(z
∗
, zb
t+1) ≤ ∥α∥∞
X
i∈supp(z∗)
4P
z
∗
i
z
∗
i − zb
t+1
i
2
qb
t+1
i
+
X
i /∈supp(z∗)
z
∗
pi
qb
t+1
i
≤
4∥α∥∞P
ϵ
2
dil
X
i∈supp(z∗)
z
∗
i − zb
t+1
i
2
+
X
i /∈supp(z∗)
z
∗
pi
qb
t+1
i
. (Lemma 96)
Combining this with Equation (B.41), we conclude that
DΨdil
α
(zb
t+1
, z
t
) + DΨdil
α
(z
t
, zb
t
) ≥
min{C12, C′
12}
2
∥z
∗ − zb
t+1∥
2
1 +
X
i /∈supp(z∗)
z
∗
pi
qb
t+1
i
≥
ϵ
2
dil
8∥α∥∞P
min
C12, C′
12
DΨdil
α
(z
∗
, zb
t+1),
which finishes the proof.
B.5.5.3 Proof of Theorem 16
With Lemma 100, we are ready to prove Theorem 16.
2
Proof of Theorem 16. Set T0 =
64C′
13
c
2 , where c =
η
2
ξϵ2
dil
40P3∥α∥∞
. For t ≥ T0, we have using Theorem 97,
∥z
∗ − zb
t
∥
2
1 ≤ 2DΨdil
α
(z
∗
, zb
t
) ≤
2C
′
13
T0
≤ c
2
,
∥z
∗ − z
t
∥
2
1 ≤ 2∥z
∗ − zb
t+1∥
2
1 + 2∥zb
t+1 − z
t
∥
2
1
≤ 4DΨdil
α
(z
∗
, zb
t+1) + 4DΨdil
α
(zb
t+1
, z
t
)
≤ 64Θt+1 ≤
64C
′
13
T0
≤ c
2
.
Therefore, when t ≥ T0, the condition of the second part of Lemma 100 is satisfied, and we have
ζ
t ≥
1
2
DΨdil
α
(zb
t+1
, z
t
) + 1
2
ζ
t
≥
1
2
DΨdil
α
(zb
t+1
, z
t
) + C14
2
DΨdil
α
(z
∗
, zb
t+1) (by Lemma 100)
≥ C15Θ
t+1
.
for some constant C15 > 0. Therefore, when t ≥ T0, Θt+1 ≤ Θt −
15
16C15Θt+1, which further leads to
Θ
t ≤ Θ
T0
·
1 +
15
16
C15T0−t
≤ Θ
1
·
1 +
15
16
C15T0−t
= DΨdil
α
(z
∗
, zb
1
) ·
1 +
15
16
C15T0−t
,
where the second inequality uses Equation (B.30). The inequality trivially holds for t < T0 as well, so it
holds for all t. We finish the proof by relating DΨdil
α
(z
∗
, z
t
) and Θt+1. Note that by Lemma 93,
DΨdil
α
(z
∗
, z
t
)
2 ≤
16P∥α∥
2
∞
ϵ
4
dil
∥z
∗ − z
t
∥
2
1
≤
32P∥α∥
2
∞
ϵ
4
dil
∥z
∗ − zb
t+1∥
2
1 + ∥zb
t+1 − z
t
∥
2
1
≤
1024P∥α∥
2
∞
ϵ
4
dil
Θ
t+1
.
22
Therefore, we conclude
DΨdil
α
(z
∗
, z
t
) ≤
s
1024P∥α∥
2∞
ϵ
4
dil
Θt+1 ≤
s
1024P∥α∥
2∞DΨdil
α
(z
∗, zb1)
ϵ
4
dil
1 +
15
16
C15T0−t−1
2
,
which finishes the proof by setting
C3 =
s
1024P∥α∥
2∞DΨdil
α
(z
∗, zb1)
ϵ
4
dil
1 +
15
16
C15T0−1
2
, C4 =
1 +
15
16
C151
2
− 1.
B.5.6 Remarks on DOGDA
In this subsection, we discuss the technical difficulties to get a convergence rate for DOGDA. This is challenging even if we assume the uniqueness of the Nash equilibrium. From the analysis of VOMWU and
DOMWU, we can see that Lemma 95 and Lemma 96 play an important role. The lemmas lower bound zb
t
i
with some game-dependent constants for i in supp(z
∗
), and the proofs are based on the observation that
DΨvan (z
∗
, z) and DΨdil
α
(z
∗
, z) approach infinity as zi approaches zero for i ∈ supp(z
∗
). This property
of the entropy regularizers, however, does not hold for the dilated Euclidean regularizer Φ
dil
α in general.
Lower bounding zb
t
i
for DOGDA could be possible when zb
t
is sufficiently close to z
∗
. For example, when
∥zb
t − z
∗
∥ ≤ 1
2
min
i∈supp(z∗)
z
∗
i
,
we can lower bound zb
t
i
by 1
2 mini∈supp(z∗) z
∗
i
for i ∈ supp(z
∗
). This must happen when t is large by Theorem 13, but the entire analysis will then depend on a potentially large “asymptotic” constant. Therefore,
even though we know that asymptotically, DOGDA has linear convergence, getting a concrete rate as
VOMWU and DOMWU is still an open question.
224
Another direction is to follow the analysis of VOGDA, which gives a linear convergence rate by Corollary 14. However, in the analysis of VOGDA, Wei et al. [161] implicitly use the fact that Φ
van is β-smooth,
that is,
DΦvan (z, z
′
) ≤
β
2
∥z − z
′
∥
2
,
for some β > 0. In fact, Φ
van is 1-smooth. This property, unfortunately, does not hold for Φ
dil
α . We believe
one can still show that Φ
dil
α is β-smooth for some game-dependent β once zb
t
is sufficiently close to z
∗
.
However, this again involves the asymptotic result in Theorem 13 and may prevent us from getting a
concrete rate. In summary, using the existing techniques, we met some difficulties to obtain a concrete
convergence rate for DOGDA. However, DOGDA performs well in the experiments. Moreover, it reduces
to VOGDA in the normal-form games and achieves linear convergence in this case. Therefore, we still
believe that it is a promising direction to get a (linear) convergence rate for DOGDA in theory.
225
Appendix C
Omitted Details in Chapter 4
C.1 Additional Related Work
C.1.1 More Results for Optimistic Algorithms in Games
For individual regret in multi-player general-sum NFGs, Syrgkanis et al. [150] first show O(T
1/4
) regret
for general optimistic OMD and FTRL algorithms. The result is improved to O(T
1/6
) by [27], but only for
OMWU in two-player NFGs. Daskalakis, Fishelson, and Golowich [36] show that OMWU enjoys O(log4 T)
regret in multi-player general-sum NFGs.
As for last-iterate convergence in two-player zero-sum games, Daskalakis and Panageas [39] show an
asymptotic result for OMWU under the unique Nash equilibrium assumption. Wei et al. [161] further show
a linear convergence rate while allowing larger learning rates under the same assumption. Hsieh, Antonakopoulos, and Mertikopoulos [82] show another asymptotic convergence result without the assumption.
It is also worth noting that OGDA, another popular optimistic algorithm, has been shown its last-iterate
convergence in general polyhedral convex games [161].
226
C.1.2 Approaches in Online Combinatorial Optimization
Besides performing MWU/OMWU over vertices, we review two additional approaches in online combinatorial optimization:
OMD over the Convex Hull This approach is running Online Mirror Descent (OMD) over the convex
hull [92, 5]. It is well known that OMD with the negative entropy regularizer results in a (dimensionwise) multiplicative weight update. For the case that the set of vertices is a standard basis, this algorithm
coincides with the MWU over the probability simplex. However, for general cases, it requires to project
back to the convex hull and the procedure may not be efficient. Helmbold and Warmuth [80] first used
this approach for permutations, and Koolen, Warmuth, and Kivinen [92] generally studied it for arbitrary
0/1 polyhedral sets and show its efficiency for more cases.
FTPL Another approach is called Follow the Perturbed Leader [88]. This approach adds a random perturbation to the cumulative loss vector, and greedily selects the vertex with minimal perturbed loss. The
latter procedure corresponds to linear optimization over the set of vertices, which can be solved efficiently
for most cases of interest. We are not aware of any previous work using this approach for EFGs though.
C.2 Pseudocode
Below we show pseudocode for OMWU and Vertex OMWU (Section 4.2).
C.3 Extensive-Form Games
In a tree-form sequential decision process (TFSDP) problem the agent interacts with the environment in two
ways: at decision points, the agent must act by picking an action from a set of legal actions; at observation
points, the agent observes a signal drawn from a set of possible signals. Different decision points can have
227
different sets of legal actions, and different observation points can have different sets of possible signals.
Decision and observation points are structured as a tree: under the standard assumption that the agent
is not forgetful, so, it is not possible for the agent to cycle back to a previously encountered decision or
observation point by following the structure of the decision problem.
As an example, consider the simplified game of Kuhn poker [101], depicted in Figure C.1. Kuhn poker
is a standard benchmark in the EFG-solving community. In Kuhn poker, each player puts an ante worth 1
into the pot. Each player is then privately dealt one card from a deck that contains 3 unique cards (Jack,
Queen, King). Then, a single round of betting then occurs, with the following dynamics. First, Player 1
decides to either check or bet 1. Then,
• If Player 1 checks Player 2 can check or raise 1.
– If Player 2 checks a showdown occurs; if Player 2 raises Player 1 can fold or call.
∗ If Player 1 folds Player 2 takes the pot; if Player 1 calls a showdown occurs.
• If Player 1 raises Player 2 can fold or call.
– If Player 2 folds Player 1 takes the pot; if Player 2 calls a showdown occurs.
When a showdown occurs, the player with the higher card wins the pot and the game immediately ends.
Figure C.1: Tree-form sequential decision making process of the first acting player in the game of Kuhn
poker.
As soon as the game starts, the agent observes a private card that has been dealt to them; this is
observation point k1, whose set of possible signals is Sk1 = {jack, queen, king}. Should the agent observe
the ‘jack’ signal, the decision problem transitions to the decision point j1, where the agent must pick
one action from the set Aj1 = {check,raise}. If the agent picks ‘raise’, the decision process terminates;
otherwise, if ‘check’ is chosen, the process transitions to observation point k2, where the agent will observe
whether the opponent checks (at which point the interaction terminates) or raises. In the latter case, the
228
process transitions to decision point j4, where the agent picks one action from the set Aj4 = {fold, call}.
In either case, after the action has been selected, the interaction terminates.
C.4 Experimental Evaluation
Game instances We numerically investigate agents learning under the COLS in Kuhn and Leduc poker
[101, 147], standard benchmark games from the extensive-form games literature.
Kuhn poker The two-player variant of Kuhn poker first appeared in [101]. In this chapter, we use the
multiplayer variant, as described by Farina et al. [47]. In a multiplayer Kuhn poker game with r
ranks, a deck with r unique cards is used. At the beginning of the game, each player pays one chip
to the pot (ante), and is dealt a single private card (their hand). The first player to act can check
or bet, i.e., put an additional chip in the pot. Then, the second player can check or bet after a first
player’s check, or fold/call the first player’s bet. If no bet was previously made, the third player can
either check or bet, and so on in turn. If a bet is made by a player, each subsequent player needs to
decide whether to fold or call the bet. The betting round if all players check, or if every player has
had an opportunity to either fold or call the bet that was made. The player with the highest card
who has not folded wins all the chips in the pot.
Leduc poker We use a multiplayer version of the classical Leduc hold’em poker introduced by Southey
et al. [147]. We employ game instances of rank 3. The deck consists of three suits with 3 cards
each. Our instances are parametric in the maximum number of bets, which in limit hold’em is not
necessarily tied to the number of players. As in Kuhn poker, we set a cap on the number of raises to
one bet. As the game starts, players pay one chip to the pot. Then, two betting rounds follow. In the
first one, a single private card is dealt to each player while in the second round a single board card
is revealed. The raise amount is set to 2 and 4 in the first and second round, respectively.
229
For each game, we consider a 3-player and a 4-player variant. The 3-player Kuhn variant uses a deck with
r = 12 ranks. The 4-player variant uses a deck with a reduced number of ranks equal to r = 5 to avoid
excessive memory usage.
CFR and CFR(RM+) Modern variants of counterfactual regret minimization (CFR) are the current practical state-of-the-art in two-player zero-sum extensive-form game solving. We implemented both the original CFR algorithm by Zinkevich et al. [166], and a more modern variant (which we denote ‘CFR(RM+)’)
using the Regret Matching Plus regret minimization algorithm at each decision point [153].
Discussion of results We compare the maximum per-player regret cumulated by KOMWU for four
different choices of constant learning rate η
t = η ∈ {0.1, 1, 5, 10}, against that cumulated by CFR and
CFR(RM+).
We remark that the payoff ranges of these games are not [0, 1] (i.e., the games have not been normalized). The payoff range of Kuhn poker is 6 for the 3-player variant and 8 for the 4-player variant. The
payoff range of Leduc poker is 21 for the 3-player variant and 28 for the 4-player variant. So, a learning
rate value of η = 0.1 corresponds to a significantly smaller learning rate in the normalized game where
the payoffs have been shifted and rescaled to lie within [0, 1] as required in the statements of Properties 20
to 22.
Results are shown in Figure C.2. In all games, we observe that the maximum per-player regret cumulated by KOMWU plateaus and remains constants, unlike the CFR variants. This behavior is consistent
with the near-optimal per-player regret guarantees of KOMWU (Theorem 30). In the 3-player variant of
Leduc poker, we observe that the largest learning rate we use, η = 10, leads to divergent behavior of the
learning dynamics.
230
100
101
102
Max. individual regret
3-player Kuhn poker
100
101
102
4-player Kuhn poker
101 103
Iteration
100
101
102
103
Max. individual regret
3-player Leduc poker
CFR CFR(RM+) KOWMU
101 103
Iteration
101
102
103
4-player Leduc poker
η = 0.1 η = 1 η = 5 η = 10
Figure C.2: Maximum per-player regret cumulated by KOMWU for four different choices of constant learning rate η
t = η ∈ {0.1, 1, 5, 10}, compared to that cumulated by CFR and CFR(RM+) in two multiplayer
poker games.
C.5 Proofs
Theorem 27. For any vectors x, y ∈ R
Σi
, the two following recursive relationships hold:
KQi
(x, y) = x[∅] y[∅]
Y
j∈C∅
Kj (x, y), (4.16)
and, for all decision points j ∈ Ji
,
Kj (x, y) = X
a∈Aj
x[ja] y[ja]
Y
j
′∈Cja
Kj
′(x, y)
. (4.17)
In particular, Equations (4.16) and (4.17) give a recursive algorithm to evaluate the polyhedral kernel KQi
associated with the sequence-form strategy space of any player i in an EFG in linear time in the number of
sequences |Σi
|.
231
Proof. In the proof of this result, we will make use of the following additional notation. Given any x ∈ R
Σi
and a j ∈ Ji
, we let x(j) ∈ R
Σ∗
i,j denote the subvector obtained from x by only considering sequences
σ ∈ Σ
∗
i,j , that is, the vector whose entries are defined as x(j)
[σ] = x[σ] for all σ ∈ Σ
∗
i,j .
Proof of (4.16) Direct inspection of the definitions of Πi and Πi,j (given in Section 4.4.1), together with
the observation that the {Σ
∗
i,j : j ∈ C∅} form a partition of Σ
∗
i
, reveals that
Πi =
π ∈ {0, 1}
Σi
:
1 π[∅] = 1
2 π(j) ∈ Πi,j ∀ j ∈ C∅
(C.1)
The observation above can be summarized informally into the statement that “Πi
is equal, up to permutation
of indices, to the Cartesian product×j∈C∅ Πi,j ”. The idea for the proof is then to use that Cartesian product
structure in the definition of 0/1-polyhedral kernel (4.7), as follows
KQi
(x, y) = X
π∈Πi
Y
σ∈π
x[σ] y[σ]
=
X
π∈Πi
x[∅] y[∅]
Y
j
′∈C∅
Y
σ∈π(j
′)
x[σ] y[σ]
=
X
π(j)∈Πi,j ∀ j∈C∅
x[∅] y[∅]
Y
j
′∈C∅
Y
σ∈π(j
′)
x[σ] y[σ]
= x[∅] y[∅]
X
π(j)∈Πi,j ∀ j∈C∅
Y
j
′∈C∅
Y
σ∈π(j
′)
x[σ] y[σ]
= x[∅] y[∅]
Y
j∈C∅
X
π(j)∈Πi,j
Y
σ∈π(j)
x[σ] y[σ]
= x[∅] y[∅]
Y
j∈C∅
Kj (x, y),
232
where the second equality follows from the fact that {∅} ∪ {Σi,j : j ∈ C∅} form a partition of Σi
,
the third equality follows from (C.1), the fifth equality from the fact that each πj ∈ Πi,j can be chosen
independently, and the last equality from the definition of partial kernel function (4.15).
Proof of (4.17) Similarly to what we did for (4.16), we start by giving an inductive characterization of
Πi,j as a function of the children Πi,j′ for j
′ ∈ ∪a∈Aj
Cja. Specifically, direct inspection of the definitions
of Πi,j , together with the observation that the {Σ
∗
i,j′ : j
′ ∈ ∪a∈Aj
Cja} form a partition of Σ
∗
i,j , reveals that
Πi,j =
π ∈ {0, 1}
Σ∗
i,j :
1
P
a∈Aj
π[ja] = 1
2 π(j
′) ∈ π[ja] · Πi,j′ ∀ a ∈ Aj , j′ ∈ Cja
. (C.2)
From constraint 1 together with the fact that π[ja] ∈ {0, 1} for all a ∈ Aj , we conclude that exactly one
a
∗ ∈ Aj is such that π[ja∗
] = 1, while π[ja] = 0 for all other a ∈ Aj , a ̸= a
∗
. So, we can rewrite (C.2) as
Πi,j =
[
a
∗∈Aj
π ∈ {0, 1}
Σ∗
i,j :
1 π[ja∗
] = 1
2 π[ja] = 0 ∀ a ∈ Aj , a ̸= a
∗
3 π(j
′) ∈ Πi,j′ ∀ j
′ ∈ Cja∗
4 π(j
′) = 0 ∀ j
′ ∈ ∪a∈Aj ,a̸=a
∗ Cja
, (C.3)
where the union is clearly disjoint. The above equality can be summarized informally into the statement
that “Πi,j is equal, up to permutation of indices, to a disjoint union over actions a
∗ ∈ Aj of Cartesian products
233
×j∈Cja∗ Πi,j ”. We can then use the same set of manipulations we already used in the proof of (4.16) to
obtain
Kj (x, y) = X
π∈Πi,j
Y
σ∈π
x[σ] y[σ]
=
X
π∈Πi,j
x[ja∗
] y[ja∗
]
Y
j
′∈Cja∗
Y
σ∈π(j
′)
x[σ] y[σ]
=
X
a
∗∈Aj
X
πj
′∈Πi,j′ ∀ j
′∈Cja∗
x[ja∗
] y[ja∗
]
Y
j
′∈Cja∗
Y
σ∈π(j
′)
x[σ] y[σ]
=
X
a
∗∈Aj
x[ja∗
] y[ja∗
]
Y
j
′∈Cja∗
X
π(j
′)∈Πi,j′
Y
σ∈π(j
′)
x[σ] y[σ]
=
X
a∈Aj
x[ja] y[ja]
Y
j
′∈Cja
Kj
′(x, y)
,
where the second equality follows from the fact that the {Σ
∗
i,j′ : j
′ ∈ ∪a∈Aj
Cja} form a partition of
Σ
∗
i,j , third equality follows from (C.3), the fourth equality from the fact that each πj
′ ∈ Πi,j′ can be picked
independently, and the last equality from the definition of partial kernel function (4.15) as well as renaming
a
∗
into a.
Proposition 28. For any player i ∈ [[m]], vector x ∈ R
Σi
>0
, and sequence ja ∈ Σ
∗
i
,
1−KQi
(x, e¯ja)/KQi
(x,1)
1−KQi
(x, e¯pj
)/KQi
(x,1)
=
x[ja]
Q
j
′∈Cja
Kj
′(x,1)
Kj (x,1)
.
234
Proof. Note that since x > 0, clearly KQi
(x, 1), Kj (x, 1) > 0. Furthermore, from (4.14) we have that for
all σ ∈ Σi
KQi
(x, 1) − KQi
(x, e¯σ) = ⟨ϕQi
(1) − ϕQi
(e¯σ), ϕQi
(x)⟩
=
X
π∈Πi
π[σ]=1
Y
σ′∈π
x[σ
′
] (C.4)
> 0.
The above inequality immediately implies that 0 < KQi
(x, e¯pj
)/KQi
(x, 1) < 1 and therefore all denominators in the statement are nonzero, making the statement well-formed.
In light of (C.4), we further have
1 − KQi
(x, e¯ja)/KQi
(x, 1)
1 − KQi
(x, e¯pj
)/KQi
(x, 1)
=
x[ja]
Q
j
′∈Cja
Kj
′(x, 1)
Kj (x, 1)
⇐⇒
KQi
(x, 1) − KQi
(x, e¯ja)
KQi
(x, 1) − KQi
(x, e¯pj
)
=
x[ja]
Q
j
′∈Cja
Kj
′(x, 1)
Kj (x, 1)
⇐⇒
P
π∈Πi,π[ja]=1
Q
σ∈π x[σ]
P
π∈Πi,π[pj ]=1
Q
σ∈π x[σ]
=
x[ja]
Q
j
′∈Cja
Kj
′(x, 1)
Kj (x, 1)
(C.5)
We now prove (C.5). Let
A = {π ∈ Πi
: π[ja] = 1}, B = {π ∈ Πi
: π[pj ] = 1}
235
be the domains of the summations. From the definition of Πi
(specifically, constraints 2 in the definition
of Qi
, of which Πi
is a subset; see Section 4.4.1), it is clear that A ⊆ B. Furthermore, it is straightforward
to check, using the definitions of Πi,j , Πi
, and B, that
π(j) ∈ Πi,j ∀π ∈ B (C.6)
We now introduce the function ((· | ·)) : B × Πi,j → B defined as follows. Given any π ∈ B and
π
′ ∈ Πi,j , ((π |π
′
)) is the vector obtained from π by replacing all sequences at or below decision point j
with what is prescribed by π
′
; formally,
((π |π
′
))[σ] =
π
′
[σ] if σ ∈ Σ
∗
i,j
π[σ] otherwise.
∀π ∈ B,π
′ ∈ Πi,j (C.7)
It is immediate to check that ((π |π
′
)) is indeed an element of B. We now introduce the following result.
Lemma 101. There exists a set P ⊆ B such that every π
′′ ∈ B can be uniquely written as π
′′ = ((π |π
′
))
for some π ∈ P and π
′ ∈ Πi,j . Vice versa, given any π ∈ P and π
′ ∈ Πi,j , then ((π |π
′
)) ∈ B.
Proof. The second part of the statement is straightforward. We now prove the first part.
Fix any π
∗ ∈ Πi,j and let P = {((π |π
∗
)) : π ∈ B}. It is straightforward to verify that for any
π
′′ ∈ B, the choices π = ((π
′′ |π
∗
)) ∈ P and π
′ = π(j) ∈ Πi,j satisfy the equality ((π |π
′
)) = π
′′. So,
every π
′′ ∈ B can be expressed in at least one way as π
′′ = ((π |π
′
)) for some π ∈ P and π
′ ∈ Πi,j . We
now show that the choice above is in fact the unique choice. First, it is clear from the definition of ((· | ·))
that π
′ must satisfy π
′ = π
′′
(j)
, and so it is uniquely determined. Suppose now that there exist π,π˜ ∈ P
such that ((π |π
′
)) = ((π˜ |π
′
)). Then, π and π˜ must coincide on all σ ∈ Σi \ Σ
∗
i,j . However, since all
236
elements of P are of the form ((b |π
∗
)) for some b ∈ B, then π and π˜ must also coincide on all σ ∈ Σ
∗
i,j .
So, π and π˜ coincide on all coordinates σ ∈ Σi
, and the statement follows.
Lemma 101 exposes a convenient combinatorial structure of the set B. In particular, it enables us to
rewrite the denominator on the left-hand side of (C.5) as follows
X
π∈B
Y
σ∈π
x[σ] = X
π′∈P
X
π′′∈Πi,j
Y
σ∈((π′
| π′′))
x[σ]
=
X
π′∈P
X
π′′∈Πi,j
Y
σ∈((π′
| π′′))
σ∈Σi,j
x[σ]
Y
σ∈((π′
| π′′))
σ̸∈Σi,j
x[σ]
=
X
π′∈P
X
π′′∈Πi,j
Y
σ∈π′′
x[σ]
!
Y
σ∈π′
σ̸∈Σi,j
x[σ]
=
X
π′′∈Πi,j
Y
σ∈π′′
x[σ]
X
π′∈P
Y
σ∈π′
σ̸∈Σi,j
x[σ]
= Kj (x, 1) ·
X
π′∈P
Y
σ∈π′
σ̸∈Σi,j
x[σ]
, (C.8)
where we used (C.7) in the third equality.
We can use a similar technique to express the numerator of the left-hand side of (C.5). Let
Πi,ja = {π ∈ Πi,j : π[ja] = 1}.
Using the constraints that define Πi and the definition of A, it follows immediately that for any π ∈ A,
π(j) ∈ Πi,ja. Furthermore, a direct consequence of Lemma 101 is the following:
Corollary 102. The same set P ⊆ B introduced in Lemma 101 is such that every π
′′ ∈ A can be uniquely
written as π
′′ = ((π |π
′
)) for some π ∈ P and π
′ ∈ Πi,ja.
237
Using Corollary 102 and following the same steps that led to (C.8), we express the numerator of the
left-hand side of (C.5) as
X
π∈A
Y
σ∈π
x[σ] = X
π′∈P
X
π′′∈Πi,ja
Y
σ∈((π′
| π′′))
x[σ]
=
X
π′∈P
X
π′′∈Πi,ja
Y
σ∈((π′
| π′′))
σ∈Σi,j
x[σ]
Y
σ∈((π′
| π′′))
σ̸∈Σi,j
x[σ]
=
X
π′∈P
X
π′′∈Πi,ja
Y
σ∈π′′
x[σ]
!
Y
σ∈π′
σ̸∈Σi,j
x[σ]
=
X
π′′∈Πi,ja
Y
σ∈π′′
x[σ]
X
π′∈P
Y
σ∈π′
σ̸∈Σi,j
x[σ]
. (C.9)
The statement then follows immediately if we can prove that
X
π∈Πi,ja
Y
σ∈π
x[σ] = x[ja]
Y
j
′∈Cja
Kj
′(x, 1).
To do so, we use the same approach as in the proof of Theorem 27. In fact, we can directly use the inductive
characterization of Πi,j obtained in (C.3) to write
Πi,ja =
π ∈ {0, 1}
Σ∗
i,j :
1 π[ja] = 1
2 π[ja′
] = 0 ∀ a
′ ∈ Aj , a′ ̸= a
3 π(j
′) ∈ Πi,j′ ∀ j
′ ∈ Cja
4 π(j
′) = 0 ∀ j
′ ∈ ∪a
′∈Aj ,a′̸=aCja′
,
238
which fundamentally uncovers the Cartesian-product structure of Πi,ja. Using the same technique as Theorem 27, we then have
X
π∈Πi,ja
Y
σ∈π
x[σ] = X
π(j
′)∈Πi,j′ ∀ j
′∈Cja
x[ja]
Y
j
′∈Cja
Y
σ∈π(j
′)
x[σ]
=
x[ja]
Y
j
′∈Cja
X
π(j
′)∈Πi,j′
Y
σ∈π(j
′)
x[σ]
=
x[ja]
Y
j
′∈Cja
Kj
′(x, 1)
,
and the statement is proven.
Proposition 26. The number of vertices of Qi
is upper bounded by A∥Qi∥1
, where A= maxj∈Ji
|Aj | is the
largest number of possible actions, and ∥Qi∥1 = maxq∈Qi
∥q∥1.
Proof. The proof is by induction. As the base case consider a single decision point ∆b with b ≤ A actions.
Then the number of vertices is b ≤ A = A∥∆b∥1
.
For the induction step we consider two cases. First, consider a polytope Q whose root is a decision
point with b ≤ A actions, with each action a leading to a polytope Qa whose number of vertices va satisfies
the inductive assumption (if some action a is a terminal action then we overload notation and let va = 1
and ∥Qa∥1 = 0). Then, the number of vertices of Q is
X
b
a=1
va ≤
X
b
a=1
A
∥Qa∥1
≤ b · A
maxa∈[[b]] ∥Qa∥1
≤ A · A
maxa∈[[b]] ∥Qa∥1
= A
∥Q∥1
.
239
Second, consider a polytope Q whose root is an observation point with b observations, with each
observation o leading to a polytope Qo with vo vertices, such that the inductive assumption holds. Then,
the number of vertices of Q is
v =
Y
b
o=1
vo ≤
Y
b
o=1
A
∥Qo∥1 ≤ A
Pb
o=1 ∥Qo∥1 = A
∥Q∥1
.
C.6 Further Applications
In this appendix, we illustrate additional 0/1-polyhedral domains in which our polyhedral kernel can be
computed efficiently.
C.6.1 n-sets
We start from n-sets, that is, the 0/1-polydral set Ω
d
n = co{π ∈ {0, 1}
d
: ∥π∥1 = n}. Learning over
n-sets is a classic problem first considered by Warmuth and Kuzmin [160] with an application to online
Principal Component Analysis. They proposed an Online Mirror Descent algorithm operating over the
convex hull Ω
d
n
, with per-iteration complexity of O(d
2
). The Follow-the-Perturbed-Leader approach [88]
is even faster with per-iteration complexity of O(d log d), but it often leads to sub-optimal regret bounds
(see discussions in [92]). Simulating MWU over the vertices of Ω
d
n has been considered in for example [24],
where they proposed to use the general approach of [151] to implement this algorithm, leading to periteration complexity of O(d
2n). Below, we show that our kernelized approach admits an even faster periteration complexity of O(d min{n, d − n}).
240
C.6.1.1 Polynomial, O(d min{n, d − n})-time kernel evaluation
Let x, y ∈ R
d
, and assume for now n ≤ d − n. Introduce the polynomial px,y(z) of z, defined as
px,y(z) = (x[1]y[1] z + 1)· · ·(x[d]y[d] z + 1).
It is immediate to see that the coefficient of z
n
in the expansion of px,y(z) is exactly KΩd
n
(x, y). Such
coefficient can be computed by directly carrying out the multiplication of the binomial terms, keeping
track of the term of degree 0, . . . , n. So, each evaluation of KΩd
n
(x, y) can be carried out in O(nd) time
under the assumption that n < d − n.
If on the other hand n < d−n, we can repeat the whole argument above for the polynomial qx,y(z) =
(z + x[1]y[1])· · ·(z + x[d]y[d]) instead. In that case, we are interested in the coefficients of z
d−n
, which
can be computed in O(d(d − n)) using the same procedure described above.
Putting together the two cases, we conclude that the computation of KΩd
n
(x, y)requires O(d min{n, d−
n}) time.
C.6.1.2 Implementing KOMWU with O(d min{n, d − n}) Per-Iteration complexity
The result described in the previous paragraph immediately implies that KOMWU can be implemented
with O(d
2 min{n, d − n})-time iterations. In this subsection we refine the that result by showing that it
is possible to compute the d kernel evaluations {KΩd
n
(x, e¯k) : k = 1, . . . , d} required at every iteration
by KOMWU so that they take cumulative O(d · min{n, d − n}) time.
To do so, we build on the technique described in the previous subsection. Assume again that n ≤ d−n.
The key insight is that the coefficient of z
n of the polynomial px,1(z)/(x[j] z + 1) is exactly KΩd
n
(x, e¯j ).
So, to compute all {KΩd
n
(x, e¯k) : k = 1, . . . , d} we can do the following:
241
1. First, for all k = 0, . . . , d and h = 0, . . . , n, we compute the coefficient A[k, h] of the z
h
in the
expansion of (x[1] z + 1). . .(x[k] z + 1)
We can compute all such values in O(dn) time by using dynamic programming. In particular, we have
A[k, h] =
1 if h = 0
0 if k = 0 ∧ h ̸= 0
A[k − 1, h] + x[k] · A[k − 1, h − 1] otherwise.
2. Then, for all k = 1, . . . , d + 1 and h = 0, . . . , n, we compute the coeffience B[k, h] of z
h
in the
expansion of (x[k] z + 1)· · ·(x[d] z + 1)
Again, we can do that in O(dn) time by using dynamic programming. Specifically,
B[k, h] =
1 if h = 0
0 if k = d + 1 ∧ h ̸= 0
B[k + 1, h] + x[k] · B[k + 1, h − 1] otherwise.
3. (Note that at this point, KΩd
n
(x, 1) is simply A[d, n].)
4. For each k = 1, . . . , d, KΩd
n
(x, e¯j ) can be computed as
KΩd
n
(x, e¯k) = Xn
h=0
A[k − 1, h] · B[k + 1, n − h].
The above formula takes O(n)time to be computed (we need to iterate over h = 0, . . . , n), and we need
to evaluate it d times (once per each k = 1, . . . , d). So, computing all {KΩd
n
(x, e¯k) : k = 1, . . . , d}
takes cumulative O(dn) time, as we wanted to show.
242
As in the previous subsection, the case n > d − n is symmetric. In that case, the set of values
{KΩd
n
(x, e¯k) : k = 1, . . . , d} can be computed in cumulative O(d(d − n)) time.
C.6.2 Unit Hypercube
Consider the hypercube [0, 1]d
, whose vertices are all the vectors in {0, 1}
d
. In this case, the polyhedral
kernel is simply
K[0,1]d (x, y) = (x[1] · y[1] + 1)· · ·(x[d] · y[d] + 1),
which can be clearly evaluated in O(d) time. Similarly to n-sets (Appendix C.6.1), we can avoid paying an
extra d factor in the per-iteration complexity of KOMWU by using the following procedure:
1. For each k = 0, . . . , d define A[k] = (x[1] · y[1] + 1)· · ·(x[k] · y[k] + 1). Clearly, the A[k] values
can be computed in O(d) cumulative time.
2. For each k = 1, . . . , d+ 1, define B[k] = (x[k] ·y[k] + 1)· · ·(x[d] ·y[d] + 1). Again, all B[k] values
can be computed in O(d) cumulative time.
3. For each k = 1, . . . , d, we have that K[0,1]d (x, e¯k) = A[k − 1] · B[k + 1]. Hence, we can compute
{K[0,1]d (x, e¯k) : k = 1, . . . , d} in cumulative O(d) time.
C.6.3 Flows in Directed Acyclic Graphs
The polytope F of flows in a generic directed acyclic graphs (DAGs) has vertices with 0/1 integer coordinates, corresponding to paths in the DAG. The 0/1-polyhedral kernel KF corresponding to the set of flows
in a DAG coincides with the kernel function introduced by Takimoto and Warmuth [151], which was shown
to be computable in polynomial-time in the size of the DAG. Consequently, KF admits polynomial-time
(in the size of the DAG) evaluation.
243
C.6.4 Permutations
When P is the convex hull of the set of all d×d permutation matrices, it is believed that KP cannot be evaluated in polynomial time in O(d), since the computation of the permanent of a matrix A can be expressed
as KΩ(A, 1). However, an ϵ-approximate computation of KP can be performed in O((d, log(1/ϵ))) for
any ϵ > 0 by using a landmark result by Jerrum, Sinclair, and Vigoda [87]. We refer the interested reader
to the paper by Cesa-Bianchi and Lugosi [24].
C.6.5 Cartesian Product
Finally, we remark that when two 0/1-polyhedral sets have efficiently-computable 0/1-polyhedral kernels,
then so does their Cartesian product. Specifically, let Ω ⊆ R
d
, Ω
′ ⊆ R
d
′
be 0/1-polyhedral sets, and let
KΩ, KΩ′ be their corresponding 0/1-polyhedral kernels. Then, it follows immediately from the definition
that the polyhedral kernel of Ω × Ω
′
satisfies
KΩ×Ω′
x
x
′
,
y
y
′
= KΩ(x, y) · KΩ′(x
′
, y
′
) ∀
x
y
,
x
′
y
′
∈ R
d × R
d
′
.
244
Appendix D
Omitted Details in Chapter 5
D.1 Omitted Proofs
In this section we include all of the proofs omitted from the main body. For the convenience of the reader,
we will restate each claim before proceeding with its proof.
D.1.1 Preliminary Proofs
We commence with the proof of Proposition 32.
Proposition 32. For any η ≥ 0 and at all times t ∈ N, the OFTRL optimization problem on Line 3 of
Algorithm 2 admits a unique optimal solution (λ
(t)
, y
(t)
) ∈ X ∩˜ R
d+1
>0
.
Proof. Uniqueness follow immediately from strict convexity. In the rest of the proof we focus on the
existence part.
We start by showing that there exists a point x˜ ∈ X˜ whose coordinates are all strictly positive. By
hypothesis (see Section 5.2.1), for every coordinate r ∈ [[d]], there exists a point xr such that xr[r] > 0.
Hence, by convexity of X ⊆ [0, +∞)
d
and by definition of X˜, the point
(1, x
◦
) =
1,
1
d
X
d
r=1
xr
!
.
245
is such that (1, x
◦
) ∈ X ∩˜ R
d
>0
.
Let now M be the ℓ∞ norm of the linear part in the OFTRL step (Line 3 of Algorithm 2). Then, a lower
bound on the optimal value v
⋆ of objective is obtained by plugging in the point (1, x
◦
) at least
v
⋆ ≥ −M(1 + ∥X ∥1) +X
d
r=1
log x
◦
[r]. (D.1)
Let now
m = exp(
−(2M + d)(1 + ∥X ∥1) +X
d
r=1
log x
◦
[r]
)
> 0. (D.2)
We will show that any point (λ, y) ∈/ [m, +∞) ∩ X˜ cannot be optimal for the OFTRL objective. Indeed,
take a point (λ, y) ∈/ [m, +∞)∩X˜. Then, at least one coordinate of (λ, y) is strictly less than m. If λ < m,
then the objective value at (λ, y) is at most
Mλ + M∥X ∥1 + log λ +
X
d
r=1
log y[r] ≤ M(1 + ∥X ∥1) + log m +
X
d
r=1
log ∥X ∥1
≤ M(1 + ∥X ∥1) + log m + d(∥X ∥1 − 1)
< (M + d)(1 + ∥X ∥1) + log m
= −M(1 + ∥X ∥1) +X
d
r=1
log x
◦
[r] (from (D.2))
≤ v
∗
, (from (D.1))
246
where the first inequality follows from upper bounding any coordinate of y with ∥X ∥1, and the second
inequality follows from using the inequality log z ≤ z−1, valid for all z ∈ (0, +∞). Similarly, if y[s] < m
for some s ∈ [[d]], then we can upper bound the objective value at (λ, y) as
M + M∥X ∥1 + log 1 +X
d
r=1
log y[r] ≤ M(1 + ∥X ∥1) + log m +
X
d
r=1
log ∥X ∥1
≤ M(1 + ∥X ∥1) + (d − 1)(∥X ∥1 − 1) + log m
< (M + d)(1 + ∥X ∥1) + log m ≤ v
∗
.
So, in either case, we see that no optimal point can have any coordinate strictly less than m. Consequently,
the maximizer of the OFTRL step lies in the set S = [m, +∞)
d+1 ∩ X˜. Since both [m, +∞)
d+1 and X˜ are
closed, and since X˜ is bounded by hypothesis, the set S is compact. Furthermore, note that S is nonempty,
as (1, x
◦
) ∈ S, as for any s ∈ [[d]]
log m = −(2M + d)(1 + ∥X ∥1) +X
d
r=1
log x
◦
[r]
≤ −(2M + d)(1 + ∥X ∥1) + log x
◦
[s] + (d − 1) log ∥X ∥1
≤ −(2M + d)(1 + ∥X ∥1) + log x
◦
[s] + (d − 1)(∥X ∥1 − 1)
≤ log x
◦
[s],
implying that (1, x
◦
) ∈ [m, +∞)
d+1. Since S is compact and nonempty and the objective function is
continuous, the optimization problem attains an optimal solution on S by virtue of Weierstrass’ theorem.
Theorem 33. For any time T ∈ N it holds that Reg ˜
T
= max{0, RegT
}. In particular, it follows that
Reg ˜
T
≥ 0 and RegT ≤ Reg ˜
T
for any T ∈ N.
247
Proof. First, by definition of u˜
t
in Line 6, it follows that for any t,
*
u˜
t
,
λ
t
y
t
+
=
*
u˜
t
,
1
x
t
+
= 0.
As a result, we have that max{0, RegT
} is equal to
max(
0, max
x∗∈X
X
T
t=1
⟨u
t
, x
∗ − x
t
⟩
)
= max
0, max
x∗∈X
X
T
t=1*
u˜
t
,
1
x
∗
−
1
x
t
+
= max
0, max
x∗∈X
X
T
t=1*
u˜
t
,
1
x
∗
+
= max
(λ∗,y∗)∈X˜
X
T
t=1*
u˜
t
,
λ
∗
y
∗
+
= max
(λ∗,y∗)∈X˜
X
T
t=1*
u˜
t
,
λ
∗
y
∗
−
λ
t
y
t
+
= Reg ˜
T
,
as we wanted to show.
D.1.2 Analysis of OFTRL with Logarithmic Regularizer
For notational convenience, we define the log-regularizer ψ : X →˜ R≥0 as
ψ(x˜) = −
1
η
X
d+1
r=1
log x˜[r],
and its induced Bregman divergence
gx˜z˜ =
1
η
X
d+1
r=1
h
x˜[r]
z˜[r]
, where h(a) = a − 1 − ln(a).
248
Moreover, we define
x˜
(t) = argmax
x˜∈X˜
−Ft(x˜) = argmin
x˜∈X˜
Ft(x˜), where Ft(x˜) = −
D
U˜ (t) + u˜
(t−1)
, x˜
E
+ ψ(x˜). (D.3)
We note that Ft
is a convex function for each t and x˜
(t)
is exactly equal to
λ
t
y
t
computed by Algorithm 2.
Further, we define an auxiliary sequence {z˜
(t)}t=1,2,... defined as follows.
z˜
(t) = argmax
x˜∈X˜
−Gt(x˜) = argmin
x˜∈X˜
Gt(x˜), where Gt(x˜) = −
D
U˜ (t)
, x˜
E
+ ψ(x˜). (D.4)
Similarly, Gt
is a convex function for each t. We also recall the primal and dual norm notation:
∥z˜∥t =
X
d+1
r=1
z˜[r]
x˜
t
[r]
2
, ∥z˜∥∗,t =
X
d+1
r=1
x˜
t
[r]z˜[r]
2
.
Finally, for a (d + 1) × (d + 1) positive definite matrix M, we use ∥z˜∥M to denote the induced quadratic
norm √
z˜⊤Mz˜. We are now ready to establish Proposition 34.
Proposition 34 (RVU bound of OFTRL in local norms). Let Reg ˜
T
be the regret cumulated up to time T by
the internal OFTRL algorithm. Then, for any time horizon T ∈ N and learning rate η ≤
1
50 ,
Reg ˜
T
≤ 4 + (d + 1) log T
η
+ 5η
X
T
t=1
u˜
t − u˜
t−1
2
∗,t −
1
27η
T
X−1
t=1
λ
t+1
y
t+1
−
λ
t
y
t
2
t
.
24
Proof of Proposition 34. For any comparator x˜ ∈ X˜, define x˜
′ =
T −1
T
· x˜ +
1
T
· x˜
(1) ∈ X˜, where we recall
x˜
(1) = argminx˜∈X˜ F1(x˜) = argminx˜∈X˜ ψ(x˜). Then, we have
X
T
t=1
D
x˜ − x˜
(t)
,u˜
(t)
E
=
X
T
t=1
D
x˜ − x˜
′
,u˜
(t)
E
+
X
T
t=1
D
x˜
′ − x˜
(t)
,u˜
(t)
E
=
1
T
X
T
t=1
D
x˜ − x˜
(1)
,u˜
(t)
E
+
X
T
t=1
D
x˜
′ − x˜
(t)
,u˜
(t)
E
≤ 4 +X
T
t=1
D
x˜
′ − x˜
(t)
,u˜
(t)
E
,
where the last inequality follows from Cauchy-Schwarz together with the assumption that ∥u
t∥∞ ≤
1
∥X ∥1
.
Now, by standard Optimistic FTRL analysis (see Lemma 103), the last term PT
t=1
x˜
′ − x˜
(t)
,u˜
(t)
(cumulative regret against x˜
′
) is bounded by
X
T
t=1
D
x˜
′ − x˜
(t)
,u˜
(t)
E
≤ ψ(x˜
′
) − ψ(x˜
(1)) +X
T
t=1
D
z˜
(t+1) − x˜
(t)
,u˜
(t) − u˜
(t−1)E
−
X
T
t=1
gx˜
(t)z˜
(t) + gz˜
(t+1)x˜
(t)
.
For the term ψ(x˜
′
) − ψ(x˜
(1)), a direct calculation using definitions shows
ψ(x˜
′
) − ψ(x˜
(1)) = 1
η
X
d+1
i=1
log x˜
(1)[i]
x˜
′
[i]
≤
d + 1
η
log T.
For the other terms, we apply Lemma 104 and Lemma 106, which completes the proof.
Lemma 103. The update rule (D.3) ensures the following for any x˜ ∈ X˜:
X
T
t=1
D
x˜ − x˜
(t)
,u˜
(t)
E
≤ ψ(x˜) − ψ(x˜
(1)) +X
T
t=1
D
z˜
(t+1) − x˜
(t)
,u˜
(t) − u˜
(t−1)E
−
X
T
t=1
gx˜
(t)z˜
(t) + gz˜
(t+1)x˜
(t)
.
250
Proof. First note that for any convex function F : X →˜ R and a minimizer x˜
⋆
, we have for any x˜ ∈ X˜:
F(x˜
⋆
) = F(x˜) − ⟨∇F(x˜
⋆
), x˜ − x˜
⋆
⟩ − DF (x˜, x˜
⋆
) ≤ F(x˜) − DF (x˜, x˜
⋆
),
where DF is the Bregman Divergence induced by F and the inequality is by the first-order optimality.
Using this fact and the optimality of z˜
(t)
, we have
Gt(z˜
(t)
) ≤ Gt(x˜
(t)
) − gx˜
(t)z˜
(t)
= Ft(x˜
(t)
) + D
x˜
(t)
,u˜
(t−1)E
− gx˜
(t)z˜
(t)
Similarly, using the optimality of x˜
(t)
, we have
Ft(x˜
(t)
) ≤ Ft(z˜
(t+1)) − gz˜
(t+1)x˜
(t)
= Gt+1(z˜
(t+1)) + D
z˜
(t+1)
,u˜
(t) − u˜
(t−1)E
− gz˜
(t+1)x˜
(t)
Combining the inequalities and summing over t, we have
G1(z˜
(1)) ≤ GT +1(z˜
(T +1)) +X
T
t=1
Dx˜
(t)
,u˜
(t)
E
+
D
z˜
(t+1) − x˜
(t)
,u˜
(t) − u˜
(t−1)E
+
X
T
t=1
−gx˜
(t)z˜
(t) − gz˜
(t+1)x˜
(t)
.
Observe that G1(z˜
(1)) = ψ(x˜
(1)) and GT +1(z˜
(T +1)) ≤ − D
x˜, U˜ (T +1)E
+ψ(x˜). Rearranging then proves
the lemma.
251
Lemma 104. If η ≤
1
50 , then we have
z˜
(t+1) − x˜
(t)
t
≤ 5η
u˜
(t) − u˜
(t−1)
∗,t
≤ 10√
2η ≤ 15η, (D.5)
x˜
(t+1) − x˜
(t)
t
≤ 5η
2u˜
(t) − u˜
(t−1)
∗,t
≤ 15√
2η ≤ 22η. (D.6)
Proof. The second part of both inequalities is clear by definitions:
u˜
(t) − u˜
(t−1)
2
∗,t
=
λ
t
x
t
,u
t
−
x
t−1
,u
t−1
2
+
X
d
r=1
y
t
[r]
u
t
[r] − u
t−1
[r]
2
≤ 4(λ
t
)
2 +
4
∥X ∥2
1
X
d
r=1
y
t
[r]
2 ≤ 8,
where we use ⟨x
τ
,u
τ
⟩ ≤ ∥x
τ ∥1∥u
τ ∥∞ ≤ 1 and |u
τ
[r]| ≤ 1
∥X ∥1
for any time τ and any coordinate r by
the assumption, and similarly,
2u˜
(t) − u˜
(t−1)
2
∗,t
=
λ
t
2
x
t
,u
t
−
x
t−1
,u
t−1
2
+
X
d
r=1
y
t
[r]
2u
t
[r] − u
t−1
[r]
2
≤ 9(λ
t
)
2 +
9
∥X ∥2
1
X
d
r=1
y
t
[r]
2 ≤ 18
To prove the first inequality in Equation (D.5), let Et =
n
x˜ :
x˜ − x˜
(t)
t
≤ 5η
u˜
(t) − u˜
(t−1)
∗,to
. Noticing that z˜
(t+1) is the minimizer of the convex function Gt+1, to show z˜
(t+1) ∈ Et
, it suffices to show
for all x˜ on the boundary of Et
, we have Gt+1(x˜) ≥ Gt+1(x˜
(t)
). Indeed, using Taylor’s theorem, for any
such x˜, there is a point ξ on the line segment between x˜
(t)
and x˜ such that
Gt+1(x˜) = Gt+1(x˜
(t)
) + D
∇Gt+1(x˜
(t)
), x˜ − x˜
(t)
E
+
1
2
x˜ − x˜
(t)
2
∇2Gt+1(ξ)
= Gt+1(x˜
(t)
) −
D
u˜
(t) − u˜
(t−1)
, x˜ − x˜
(t)
E
+
D
∇Ft(x˜
(t)
), x˜ − x˜
(t)
E
+
1
2
x˜ − x˜
(t)
2
∇2ψ(ξ)
≥ Gt+1(x˜
(t)
) −
D
u˜
(t) − u˜
(t−1)
, x˜ − x˜
(t)
E
+
1
2
x˜ − x˜
(t)
2
∇2ψ(ξ)
(by the optimality of x˜
(t)
)
≥ Gt+1(x˜
(t)
) −
u˜
(t) − u˜
(t−1)
∗,t
x˜ − x˜
(t)
t
+
1
2
x˜ − x˜
(t)
2
∇2ψ(ξ)
.
(by Hölder’s inequality)
≥ Gt+1(x˜
(t)
) −
u˜
(t) − u˜
(t−1)
∗,t
x˜ − x˜
(t)
t
+
2
9η
x˜ − x˜
(t)
2
t
(⋆)
= Gt+1(x˜
(t)
) + 5
9
η
u˜
(t) − u˜
(t−1)
2
∗,t
(
x˜ − x˜
(t)
t
= 5η
u˜
(t) − u˜
(t−1)
∗,t)
≥ Gt+1(x˜
(t)
).
Here, the inequality (⋆) holds because Lemma 105 (together with the condition η ≤
1
50 ) shows 1
2
x˜
t
[i] ≤
x˜[i] ≤
3
2
x˜
t
[i], which implies 1
2
x˜
t
[i] ≤ ξ[i] ≤
3
2
x˜
t
[i] as well, and thus ∇2ψ(ξ) ⪰
4
9∇2ψ(x˜
(t)
). This finishes
the proof for Equation (D.5). The first inequality of Equation (D.6) can be proven in the same manner.
Proposition 35 (Multiplicative Stability). For any time t ∈ N and learning rate η ≤
1
50 ,
λ
t+1
y
t+1
−
λ
t
y
t
t
≤ 22η.
Proof. The statement is proved in Lemma 104.
Lemma 105. If x˜ satisfies ∥x˜ − x˜
(t)∥t ≤
1
2
, then 1
2
x˜
t
[i] ≤ x˜[i] ≤
3
2
x˜
t
[i] for every coordinate i.
253
Proof. By definition, ∥x˜ − x˜
(t)∥t ≤
1
2
implies for any i,
|x˜[i]−x˜
t
[i]|
x˜
t
[i] ≤
1
2
, and thus 1
2
x˜
t
[i] ≤ x˜[i] ≤
3
2
x˜
t
[i].
Lemma 106. If η ≤
1
50 , then we have
X
T
t=1
gx˜
(t)z˜
(t) + gz˜
(t+1)x˜
(t)
≥
1
27η
T
X−1
t=1
x˜
(t+1) − x˜
(t)
2
t
.
Proof. Recall h(a) = a − 1 − ln(a) and gx˜z˜ =
1
η
Pd+1
i=1 h
x˜[i]
z˜[i]
. We proceed as
X
T
t=1
gx˜
(t)z˜
(t) + gz˜
(t+1)x˜
(t)
≥
T
X−1
t=1
gx˜
(t+1)z˜
(t+1) + gz˜
(t+1)x˜
(t)
=
1
η
T
X−1
t=1
X
d+1
i=1
h
x˜
t+1[i]
z˜
(t+1)[i]
+ h
z˜
(t+1)[i]
x˜
t
[i]
!!
≥
1
6η
T
X−1
t=1
X
d+1
i=1
(x˜
t+1[i] − z˜
(t+1)[i])2
z˜
(t+1)[i]
2 +
(z˜
(t+1)[i] − x˜
t
[i])2
(x˜
t
[i])2
!
(h(y) ≥
(y−1)2
6
for y ∈ [
1
3
, 3])
≥
2
27η
T
X−1
t=1
X
d+1
i=1
(x˜
t+1[i] − z˜
(t+1)[i])2
(x˜
t
[i])2 +
(z˜
(t+1)[i] − x˜
t
[i])2
(x˜
t
[i])2
!
≥
1
27η
T
X−1
t=1
X
d+1
i=1
(x˜
t+1[i] − x˜
t
[i])2
(x˜
t
[i])2
=
1
27η
T
X−1
t=1
x˜
(t+1) − x˜
(t)
2
t
.
Here, the second and the third inequality hold because by Lemma 104 and Lemma 105, we have 1
2 ≤
z˜
t+1[i]
x˜
t
[i] ≤
3
2
and 1
2 ≤
x˜
t+1[i]
x˜
t
[i] ≤
3
2
, and thus 1
3 ≤
x˜
t+1[i]
z˜
t+1[i] ≤ 3.
D.1.3 RVU Bound in the Original Space
Next, we establish an RVU bound in the original (unlifted) space, namely Corollary 37. To this end, we
first proceed with the proof of Lemma 36, which boils down to the following simple claim.
25
Lemma 107. Let (λ, y),(λ
′
, y
′
) ∈ X ∩˜ R
d+1
>0
be arbitrary points such that
λ
′
y
′
−
λ
y
(λ,y)
≤
1
2
.
Then,
y
λ
−
y
′
λ′
1
≤ 4∥X ∥1 ·
λ
′
y
′
−
λ
y
(λ,y)
.
Proof. Let µ be defined as
µ = max
λ
′
λ
− 1
, max
r∈[[d]]
y
′
[r]
y[r]
− 1
. (D.7)
By definition,
λ
′
λ
− 1
≤ µ,
which in turn implies that
(1 − µ)λ ≤ λ
′ ≤ (1 + µ)λ. (D.8)
Similarly, for any r ∈ [[d]],
(1 − µ)y[r] ≤ y
′
[r] ≤ (1 + µ)y[r]. (D.9)
As a result, combining (D.8) and (D.9) we get that for any r ∈ [[d]],
y
′
[r]
λ′
−
y[r]
λ
≤
1 + µ
1 − µ
− 1
y[r]
λ
≤ 4µ
y[r]
λ
= 4µx[r],
255
since µ ≤
1
2
. Similarly, by (D.8) and (D.9),
y[r]
λ
−
y
′
[r]
λ′
≤
1 −
1 − µ
1 + µ
y[r]
λ
≤ 2µ
y[r]
λ
= 2µx[r].
Thus, it follows that
y
′
[r]
λ′
−
y[r]
λ
≤ 4µx[r],
in turn implying that
∥x
′ − x∥1 =
X
d
r=1
y
′
[r]
λ′
−
y[r]
λ
≤ 4µ
X
d
r=1
x[r] ≤ 4µ∥X ∥1. (D.10)
Moreover, by definition of (D.7),
(µ)
2 ≤
λ
′
y
′
−
λ
y
2
t
.
Finally, combining this bound with (D.10) concludes the proof.
Lemma 36. For any time t ∈ N and learning rate η ≤
1
50 ,
∥x
t+1 − x
t
∥1 ≤ 4∥X ∥1
λ
t+1
y
t+1
−
λ
t
y
t
t
.
Proof. Since η ≤
1
50 by assumption, we have
x
t+1 − x
t
t
≤ 22η <
1
2
.
Hence, we are in the domain of applicability of Lemma 107, which immediately yields the statement.
256
Corollary 37 (RVU bound in the original (unlifted) space). For any time T ∈ N and η ≤
1
256 ,
Reg ˜
T
≤ 6 +
(d + 1) log T
η
+ 16η∥X ∥2
1
T
X−1
t=1
∥u
t+1 − u
t
∥
2
∞ −
1
512η∥X ∥2
1
T
X−1
t=1
∥x
t+1 − x
t
∥
2
1
.
Proof. First, by definition of the induced dual local norm in (5.3),
∥u˜
t − u˜
t−1
∥
2
∗,t ≤ (⟨x
t
,u
t
⟩ − ⟨x
t−1
,u
t−1
⟩)
2
(λ
t
)
2 +
X
d
r=1
(y[r])2
(u
t
[r] − u
t−1
[r])2
≤ (⟨x
t
,u
t
⟩ − ⟨x
t−1
,u
t−1
⟩)
2 +
X
d
r=1
(x[r])2
(u
t
[r] − u
t−1
[r])2
≤
⟨x
t
,u
t
⟩ − ⟨x
t−1
,u
t−1
⟩
2
+ ∥X ∥2
1∥u
t − u
t−1
∥
2
∞, (D.11)
for any t ≥ 2. Further, by Young’s inequality,
⟨x
t
,u
t
⟩ − ⟨x
t−1
,u
t−1
⟩
2
≤ 2
⟨x
t
,u
t − u
t−1
⟩
2
+ 2
⟨x
t − x
t−1
,u
t−1
⟩
2
≤ 2∥X ∥2
1∥u
t − u
t−1
∥
2
∞ +
2
∥X ∥2
1
∥x
t − x
t−1
∥
2
1
.
Combining with (D.11),
∥u˜
t − u˜
t−1
∥
2
∗,t ≤ 3∥X ∥2
1∥u
t − u
t−1
∥
2
∞ +
2
∥X ∥2
1
∥x
t − x
t−1
∥
2
1
,
for t ≥ 2, since ∥u∥∞ ≤
1
∥X ∥1
(by assumption). Further, ∥u˜
(1) − u˜
(0)∥
2
∗,t = ∥u˜
(1)∥
2
∗,t ≤ 2. Combining
with Proposition 34 and Lemma 36, we get that Reg ˜
T
is upper bounded by
6 +
(d + 1) log T
η
+ 16η∥X ∥2
1
T
X−1
t=1
∥u
t+1 − u
t
∥
2
∞ +
1
∥X ∥2
1
10η −
1
432η
T
X−1
t=1
∥x
t+1 − x
t
∥
2
1
≤ 6 +
(d + 1) log T
η
+ 16η∥X ∥2
1
T
X−1
t=1
∥u
t+1 − u
t
∥
2
∞ −
1
512η∥X ∥2
1
T
X−1
t=1
∥x
t+1 − x
t
∥
2
1
.
D.1.4 Main Result: Proof of Theorem 39
Finally, we are ready to establish Theorem 39. To this end, the main ingredient is the bound on the secondorder path lengths predicted by Theorem 38, which is recalled below.
Theorem 38. Suppose that Assumption 2 holds for some parameter L > 0. If all players follow LRL-OFTRL
with learning rate η ≤ min n
1
256 ,
1
128nL∥X ∥2
1
o
, where ∥X ∥1 = maxi∈[[n]] ∥Xi∥1, then
Xn
i=1
T
X−1
t=1
∥x
t+1
i − x
t
i∥
2
1 ≤ 6144nη∥X ∥2
1 + 1024n(d + 1)∥X ∥2
1
log T.
Proof. By Assumption 2, it follows that for any player i ∈ [[n]],
∥u
t+1
i − u
t
i∥∞
2
≤
L
Xn
j=1
∥x
t+1
j − x
t
j∥1
2
≤ L
2n
Xn
j=1
∥x
t+1
j − x
t
j∥
2
1
,
by Jensen’s inequality. Hence, by Corollary 37 the regret RegT
i of each player i ∈ [[n]] can be upper
bounded by
6 +
(d + 1) log T
η
+ 16η∥X ∥2
1L
2n
Xn
j=1
T
X−1
t=1
∥x
t+1
j − x
t
j∥
2
1 −
1
512η∥X ∥2
1
T
X−1
t=1
∥x
t+1
i − x
t
i∥
2
1
,
Summing over all players i ∈ [[n]], we have that
Xn
i=1
Reg ˜
T
i ≤ 6n + n
(d + 1) log T
η
+
Xn
i=1
16η∥X ∥2
1L
2n
2 −
1
512η∥X ∥2
1
T
X−1
t=1
∥x
t+1
i − x
t
i∥
2
1
≤ 6n + n
(d + 1) log T
η
−
1
1024η∥X ∥2
1
Xn
i=1
T
X−1
t=1
∥x
t+1
i − x
t
i∥
2
1
,
25
since η ≤
1
256nL∥X ∥2
1
. Finally, the theorem follows since Pn
i=1 Reg ˜
T
i ≥ 0, which in turn follows directly
from Theorem 33.
Theorem 39 (Detailed Version of Theorem 31). Suppose that Assumption 2 holds for some parameter L > 0.
If all players follow LRL-OFTRL with learning rate η = min n
1
256 ,
1
128nL∥X ∥2
1
o
, then for any T ∈ N the regret
RegT
i
of each player i ∈ [[n]] can be bounded as
RegT
i ≤ 12 + 256(d + 1) max
nL∥X ∥2
1
, 2
log T. (5.4)
Furthermore, the algorithm can be adaptive so that if player i is instead facing adversarial utilities, then
RegT
i = O(
√
T).
Proof. First of all, by Assumption 2 we have that for any player i ∈ [[n]],
∥u
t+1
i − u
t
i∥
2
∞ ≤
L
Xn
i=1
∥x
t+1
i − x
t
i∥1
!2
≤ L
2n
Xn
j=1
∥x
t+1
j − x
t
j∥
2
1
.
Hence, summing over all t ∈ [T],
T
X−1
t=1
∥u
t+1
i − u
t
i∥
2
∞ ≤ L
2n
T
X−1
t=1
Xn
j=1
∥x
t+1
j − x
t
j∥
2
1
≤ 6144n
2L
2
η∥X ∥2
1 + 1024n
2L
2
(d + 1)∥X ∥2
1
log T,
where the last bound uses Theorem 38. As a result, from Corollary 37, if η =
1
128nL∥X ∥2
1
,
Reg ˜
T
i ≤ 6 +
(d + 1) log T
η
+ 16η∥X ∥2
1
T
X−1
t=1
∥u
t+1
i − u
t
i∥
2
∞
≤ 12 + 256(d + 1)nL∥X ∥2
1
log T.
259
Thus, the bound on RegT
i
follows directly since RegT
i ≤ Reg ˜
T
i
by Theorem 33. The case where η =
1
256
is analogous.
Next, let us focus on the adversarial bound. Each player can simply check whether there exists a time
t ∈ [T] such that
X
t−1
τ=1
∥u
(τ+1)
i − u
(τ)
i
∥
2
∞ > 6144n
2L
2
η∥X ∥2
1 + 1024n
2L
2
(d + 1)∥X ∥2
1
log t. (D.12)
In particular, we know from Theorem 38 that when all players follow the prescribed protocol (D.12) will
never by satisfied. On the other hand, if there exists time t so that (D.12) holds, then it suffices to switch
to any no-regret learning algorithm tuned to face adversarial utilities.
D.1.5 Extending the Analysis under Approximate Iterates
In this subsection we describe how to extend our analysis, and in particular Theorem 39, when the OFTRL
step of Algorithm 2 at time t is only computed with tolerance ϵ
(t)
, in the sense of (5.5). We start by
extending Proposition 34 below.
Proposition 108 (Extension of Proposition 34). Let Reg ˜
T
be the regret cumulated up to time T by the
internal OFTRL algorithm producing approximate iterates (λ
(t)
, y
(t)
) ∈ X˜, for any t ∈ [[T]]. Then, for any
T ∈ N and learning rate η ≤
1
50 ,
Reg ˜
T
≤ 4 + (d + 1) log T
η
+ 5η
X
T
t=1
u˜
t − u˜
t−1
2
∗,t −
1
27η
T
X−1
t=1
λ
t+1
⋆
y
t+1
⋆
−
λ
t
⋆
y
t
⋆
(λt
⋆
,yt
⋆
)
+2X
T
t=1
λ
t
y
t
−
λ
t
⋆
y
t
⋆
(λt
⋆
,yt
⋆
)
,
260
where
λ
t
⋆
y
t
⋆
= argmax
(λ,y)∈X˜
η
*
U˜ t + u˜
t−1
,
λ
y
+
+ log λ +
X
d
r=1
log y[r]
.
Proof. Fix any (λ
∗
, y
∗
) ∈ X˜. Then,
X
T
t=1*
u˜
t
,
λ
∗
y
∗
−
λ
t
y
t
+
=
X
T
t=1*
u˜
t
,
λ
∗
y
∗
−
λ
t
⋆
y
t
⋆
+
+
X
T
t=1*
u˜
t
,
λ
(t)
⋆
y
(t)
⋆
−
λ
t
y
t
+
≤
X
T
t=1*
u˜
t
,
λ
∗
y
∗
−
λ
t
⋆
y
t
⋆
+
+ 2X
T
t=1
λ
t
y
t
−
λ
t
⋆
y
t
⋆
(λt
⋆
,yt
⋆
)
,
where the last inequality uses Hölder’s inequality along with the fact that ∥u˜
(t)∥∗,(λ(t)
,y(t)) ≤ 2, which
in turn follows since ∥u
(t)∥∞∥X ∥1 ≤ 1 (by assumption). Finally, the proof follows as an immediate
consequence of Proposition 34.
We next proceed with the extension of Lemma 36.
Lemma 109 (Extension of Lemma 36). Suppose that ϵ
(t) ≤
1
8
, for any t ∈ [[T]]. Then, for any time t ∈ [[T −1]]
and learning rate η ≤
1
256 ,
∥x
t+1 − x
t
∥1 ≤ 8∥X ∥1
λ
t+1
⋆
y
t+1
⋆
−
λ
t
⋆
y
t
⋆
(λt
⋆
,yt
⋆
)
+ 16∥X ∥1ϵ
(t+1) + 8∥X ∥1ϵ
(t)
,
where x
(t) = y
(t)/λ(t)
.
261
Proof. First, by the triangle inequality,
λ
t+1
⋆
y
t+1
⋆
−
λ
t
⋆
y
t
⋆
(λt
⋆
,yt
⋆
)
≥
λ
t+1
y
t+1
−
λ
t
y
t
(λt
⋆
,yt
⋆
)
−
λ
t+1
⋆
y
t+1
⋆
−
λ
t+1
y
t+1
(λt
⋆
,yt
⋆
)
−
λ
t
⋆
y
t
⋆
−
λ
t
y
t
(λt
⋆
,yt
⋆
)
.
Now given that η ≤
1
50 , it follows from Proposition 35 that
λ
t+1
⋆
y
t+1
⋆
−
λ
t
⋆
y
t
⋆
(λt
⋆
,yt
⋆
)
≤
1
2
,
which in turn—combined with Lemma 105—implies that
λ
t+1
⋆
y
t+1
⋆
−
λ
t+1
y
t+1
(λt
⋆
,yt
⋆
)
≤ 2
λ
t+1
⋆
y
t+1
⋆
−
λ
t+1
y
t+1
(λ
t+1
⋆ ,y
t+1
⋆ )
.
Similarly, since ϵ
(t) ≤
1
8
, it follows that
λ
t+1
y
t+1
−
λ
t
y
t
(λt
⋆
,yt
⋆
)
≥
1
2
λ
t+1
y
t+1
−
λ
t
y
t
(λt
,yt)
.
As a result,
λ
t+1
⋆
y
t+1
⋆
−
λ
t
⋆
y
t
⋆
(λt
⋆
,yt
⋆
)
≥
1
2
λ
t+1
y
t+1
−
λ
t
y
t
(λt
,yt)
− 2ϵ
(t+1) − ϵ
(t)
. (D.13)
262
Next, we will prove that
max (
λ
(t+1)
λ(t)
− 1
, max
r∈[[d]]
y
(t+1)[r]
y(t)
[r]
− 1
)
≤
1
2
. (D.14)
Indeed, since ϵ
(t)
, ϵ(t+1) ≤
1
8
, it holds that
1 −
λ
(t)
λ
(t)
⋆
≤
1
8
=⇒
7
8
λ
(t)
⋆ ≤ λ
(t) ≤
9
8
λ
(t)
⋆ ,
and
1 −
λ
(t+1)
λ
(t+1)
⋆
≤
1
8
=⇒
7
8
λ
(t+1)
⋆ ≤ λ
(t+1) ≤
9
8
λ
(t+1)
⋆ .
Furthermore, for η ≤
1
256 ,
1 −
λ
(t+1)
⋆
λ
(t)
⋆
≤
1
10
=⇒
9
10
λ
(t)
⋆ ≤ λ
(t+1)
⋆ ≤
11
10
λ
(t)
⋆ ,
by Proposition 35 and Lemma 105. Thus,
2
3
λ
(t+1) ≤
7
8
10
11
8
9
λ
(t+1) ≤ λ
(t) ≤
9
8
10
9
8
7
λ
(t+1) ≤
3
2
λ
(t+1)
,
in turn implying that
1 −
λ
(t+1)
λ(t)
≤
1
2
.
Similarly, we conclude that for any r ∈ [[d]],
1 −
y
(t+1)[r]
y(t)
[r]
≤
1
2
,
263
confirming (D.14). Hence, following the proof of Lemma 107, we derive that
λ
t+1
y
t+1
−
λ
t
y
t
(λt
,yt)
≥
1
4∥X ∥1
y
(t+1)
λ(t+1) −
y
(t)
λ(t)
1
=
1
4∥X ∥1
∥x
(t+1) − x
(t)
∥1.
Combining this bound with (D.13) concludes the proof.
We also state the following immediate implication of Lemma 109.
Corollary 110. Suppose that ϵ
(t) ≤
1
8
, for any t ∈ [[T]]. Then, for any t ∈ [[T −1]] and learning rate η ≤
1
256 ,
∥x
(t+1) − x
(t)
∥
2
1 ≤ 192∥X ∥2
1
λ
t+1
⋆
y
t+1
⋆
−
λ
t
⋆
y
t
⋆
2
(λt
⋆
,yt
⋆
)
+ 768∥X ∥2
1
(ϵ
(t+1))
2 + 192∥X ∥2
1
(ϵ
(t)
)
2
,
where x
(t) = y
(t)/λ(t)
.
As a result, combining this bound with Proposition 108 extends Corollary 37 with an error term proportional to PT
t=1 ϵ
(t)
. Finally, the rest of the extension is identical to our proof of Theorem 39.
D.2 Implementation via Proximal Oracles
In this section we provide the omitted proofs from Section 5.2.5 regarding the implementation of LRL-OFTRL
using proximal oracles (recall Equation (5.6)).
264
D.2.1 The Proximal Newton Method
In this subsection we describe the proximal Newton algorithm of Tran-Dinh, Kyrillidis, and Cevher [154],
leading to Theorem 42 we presented in Section 5.2.5. More precisely, Tran-Dinh, Kyrillidis, and Cevher [154]
studied the following composite minimization problem:
min
x˜∈Rd+1
{F(x˜) = f(x˜) + g(x˜)} , (D.15)
where f is a (standard) self-concordant and convex function, and g : R
d+1 → R ∪ {+∞} is a proper,
closed and convex function. In our setting, we will let g be defined as
g(x˜) =
0 if x˜ ∈ X˜,
+∞ otherwise.
Further, for a given time t ∈ N, we let
f : x˜ 7→ −η
D
U˜ t + u˜
t−1
, x˜
E
−
X
d+1
r=1
log x˜[r].
Before we describe the proximal Newton method, let us define s˜k as follows.
s˜k = argmin
x˜∈X˜
f(x˜k) + (∇f(x˜k))⊤(x˜ − x˜k) + 1
2
(x˜ − x˜k)
⊤∇2
f(x˜k)(x˜ − x˜k)
, (D.16)
for some x˜k ∈ R
d+1
>0
. We point out that the optimization problem (D.16) can be trivially solved when we
have access to a (local) proximal oracle—given in Equation (5.6).
265
In this context, the proximal Newton method of Tran-Dinh, Kyrillidis, and Cevher [154] is given in Algorithm 15. Their algorithm proceeds in two phases. In the first phase we perform damped steps of proximal
Newton until we reach the region of quadratic convergence. Afterwards, we perform full steps of proximal
Newton until the desired precision ϵ > 0 has been reached. Below we summarize the main guarantee
regarding Algorithm 15, namely [154, Theorem 9].
Theorem 111 ( [154]). Algorithm 15 returns x˜K ∈ R
d+1
>0
such that ∥x˜K − x˜
∗∥x˜
∗ ≤ 2ϵ after at most
K =
f(x˜0) − f(x˜
∗
)
0.017
+
1.5 ln ln
0.28
ϵ
+ 2
iterations, for any ϵ > 0, where x˜
∗ = argminx˜ F(x˜), for the composite function F defined in (D.15).
To establish Theorem 42 from this guarantee, it suffices to initialize Algorithm 15 at every iteration
t ≥ 2 with x˜0 = x˜
(t−1) = (λ
(t−1)
, y
(t−1)). Then, as long as ϵ
(t−1) is sufficiently small, the number of
iterations predicted by Theorem 111 will be bounded by O(log log(1/ϵ)), in turn establishing Theorem 42.
Algorithm 15: Proximal Newton [154]
Data: Initial point x˜0
Precision ϵ > 0
Constant σ = 0.2
1 for k = 1, . . . , K do
2 Obtain the proximal Newton direction d˜
k ← s˜k − x˜k, where s˜k is defined in (D.16)
3 Set λk ← ∥d˜
k∥x˜k
4 if λk > 0.2 then
5 x˜k+1 ← x˜k + αkd˜
k, where αk = (1 + λk)
−1
[▷ Damped Step]
6 else if λk > ϵ then
7 x˜k+1 ← x˜k + d˜
k [▷ Full Step]
8 else
9 return x˜k
266
D.2.2 Proximal Oracle for Normal-Form and Extensive-Form Games
In order to show that the proximal oracle of Section 5.2.5 can be implemented efficiently for probability
simplexes (i.e., the strategy sets of normal-form games) and sequence-form strategy spaces (i.e., the strategy
sets of extensive-form games), we will prove a slightly stronger result concerning treeplex sets, of which
sequence-form strategy spaces are instances.
Definition 12. A set Q ⊆ [0, +∞)
d
, d ≥ 1, is treeplex if it is:
1. a simplex Q = ∆d
;
2. a Cartesian product of treeplex sets Q1 × · · · × QK; or
3. (Branching operation) a set of the form
△(Q1, . . . , QK) = {(x, x[1]q1, . . . , x[K]qK) : x ∈ ∆K, qk ∈ Qk ∀k ∈ [[K]]},
where Q1, . . . , QK are treeplex.
We will show that any treeplex Q is such that [0, 1]Q admits an efficient (positive-definite) quadratic
optimization oracle. This is sufficient, since it is well-known that every sequence-form strategy space X
is treeplex (e.g., Hoda et al. [81]) and therefore, by definition, so is the set {(1, x) : x ∈ X }.
Introduce the value function
VQ(t; g, w) = min
x∈tQ (
−g
⊤x +
1
2
X
d
r=1
x[r]
w[r]
2
)
(t ≥ 0, w > 0) (D.17)
267
(note the rescaling by t in the domain of the minimization). We will be interested in the derivative of
VQ(t; g, w), which we will denote as∗
λQ(t; g, w) = d
dtVQ(t; g, w).
Preliminaries on strictly monotonic piecewise-linear (SMPL) functions
Definition 13 (SMPL function and standard representation). Given an interval I ⊆ R and a function
f : I → R, we say that f is SMPL if it is strictly monotonically increasing and piecewise-linear on I.
Definition 14 (Quasi-SMPL function). A quasi-SMPL function is a function f : R → [0, +∞) of the form
f(x) = [g(x)]+ where g(x) : R → R is SMPL and [ · ]
+ = max{0, · }.
Definition 15. Given a SMPL or quasi-SMPL function f, a standard representation for it is an expression of
the form
f(x) = ζ + α0x +
X
S
s=1
αs[x − βs]
+,
valid for all x in the domain of f, where S ∈ N and β1 < · · · < βS. The size of the standard representation
is defined as the natural number S.
We now mention four basic results about SMPL and quasi-SMPL functions. The proofs are elementary
and omitted.
Lemma 112. Let f : I → R be SMPL, and consider a standard representation of f of size S. Then, for any
ζ ∈ R and α ≥ 0, a standard representation for the SMPL function I ∋ x 7→ ζ + αf(x) can be computed in
O(S + 1) time.
∗
For t = 0 we define λQ(t; g, w) in the usual way as
λQ(0; g, w) = lim
t→0+
VQ(t; g, w) − VQ(0; g, w)
t
= lim
t→0+
VQ(t; g, w)
t
.
268
Lemma 113. The sum f1 + · · · + fn of n SMPL (resp., quasi-SMPL) functions fi
: I → R is a SMPL (resp.,
quasi-SMPL) function I → R. Furthermore, if each fi admits a standard representation of size Si
, then a
standard representation of size at most S1 + · · · + Sn for their sum can be computed in O((S1 + · · · + Sn +
1) log n) time.
Lemma 114. Let f : R → R be SMPL, and consider a standard representation of f of size S. Then, for any
β ∈ R, a standard representation of size at most S for the quasi-SMPL function I ∋ x 7→ [f(x) − β]
+ can be
computed in O(S + 1) time.
Lemma 115. The inverse f
−1
: range(f) → R of a SMPL function f : I → R is SMPL. Furthermore, if f
admits a standard representation of size S, then a standard representation for f
−1
of size at most S can be
computed in O(S + 1) time.
Lemma 116. Let f : R → [0, +∞) be quasi-SMPL. The restricted inverse f
−1
: (0, +∞) → R of f is SMPL,
where we restrict the domain to (0, +∞) because f
−1
(0) may be multivalued. Furthermore, if f admits a
standard representation of size S, then a standard representation of size at most S for f
−1
can be computed
in O(S + 1) time.
Proof. We have f(x) = [g(x)]+ where g is SMPL. It follows that the function g¯ : I → R defined as
g¯(x) = g(x) for the interval I = {x : g(x) > 0} is SMPL as well. For any x such that f(x) > 0 we have
x ∈ I, and thus f
−1 = g
−1
, and it follows from Lemma 115 that f
−1
is SMPL.
Lemma 117. Let f : [0, +∞) → R be a SMPL function, and consider the function g that maps y to the unique
solution to the equation x = [y−f(x)]+. Then, g is quasi-SMPL and satisfies g(y) = [(x+f)
−1
(y)]+, where
(x + f)
−1 denotes the inverse of the SMPL function x 7→ x + f([x]
+).
Proof. For any y ∈ R, the function hy : x 7→ x − [y − f(x)]+ is clearly SMPL on [0, +∞). Furthermore,
hy(0) ≤ 0 and hy(+∞) = +∞, implying that hy(x) = 0 has a unique solution. We now show that
269
g(y) = [(x + f)
−1
(y)]+ is that solution, that is, it satisfies g(y) = [y − f(g(y))]+ for all y ∈ R. Fix any
y ∈ R and let
g¯ = (x + f)
−1
(y) ⇐⇒ g¯ + f([¯g]
+) = y ⇐⇒ g¯ = y − f([¯g]
+) (D.18)
There are two cases:
• If g¯ ≥ 0, then g(y) = [¯g]
+ = ¯g, and so we have
g(y) = [¯g]
+ = [y − f([¯g]
+)]+ = [y − f(g(y))]+,
as we wanted to show.
• Otherwise, g <¯ 0 and g(y) = 0. From (D.18), the condition g <¯ 0 implies y < f([¯g]
+) = f(0). So,
it is indeed the case that
0 = g(y) = [y − f(0)]+ = [y − f(g(0))]+,
as we wanted to show.
Finally, we note that the function (x + f)
−1
: R → R is SMPL due to Lemma 115, implying that g(y) is
quasi-SMPL.
Central result The following result is central in our analysis.
Lemma 118. For any treeplex Q ⊆ R
d
, gradient g ∈ R
d
, and center w ∈ R
d
>0
, the function t 7→ λQ(t; g, w)
is SMPL, and a standard representation of it of size d can be computed in polynomial time in d.
Proof. We will prove the result by structural induction on Q.
270
• First, we consider the case where Q is a Cartesian product,
Q = Q1 × · · · × QK.
In that case, the value function decomposes as follows
VQ(t; g, w) = X
K
k=1
min
xk∈tQk
(
−g
⊤
k xk +
1
2
X
dk
r=1
xk[r]
wk[r]
2
)
=
X
K
k=1
VQk
(t; gk, wk).
By linearity of derivatives, we have
λQ(t; g, w) = X
K
k=1
λQk
(t; gk, wk).
From Lemma 113, we conclude that λQ(t; g, w)is a SMPL function with domain [0, +∞) which admits
a standard representation of size at most d = d1 + · · · + dK computable in time O(d log K) starting
from the standard representation of each of the λQk
(t; gk, wk).
• Second, consider the case where Q is a simplex or the result of a branching operation
△(Q1, . . . , QK) = {(x, x[1]q1, . . . , x[K]qK) : x ∈ ∆K, qk ∈ Qk ∀k ∈ [[K]]},
where Qk ∈ R
dk . With a slighty abuse of notation, we will treat the two cases together, considering
the K-simplex ∆K as a branching operation over empty sets Qk = ∅.
In this case, we can write
g = (g•[1], . . . , g•[K], g1 ∈ R
d1
, · · · , gK ∈ R
dK ), and
w = (w•[1], . . . , w•[K], w1 ∈ R
d1
>0
, · · · , wK ∈ R
dK
>0
).
271
The value function then decomposes recursively as
VQ(t; g, w) = min
x•∈t∆K
( −
X
K
r=1
g•[r]x•[r] + 1
2
X
K
r=1
x•[r]
w•[r]
2
!
+
X
K
k=1
min
xk∈x•[k]Qk
(
−g
⊤
k xk +
X
dk
r=1
xk[r]
wk[r]
2
))
= min
x•∈t∆K
( −
X
K
r=1
g•[r]x•[r] + 1
2
X
K
r=1
x•[r]
w•[r]
2
!
+ VQk
(x•[k]; gk, wk)
)
. (D.19)
Suppose that for each k ∈ [[K]], λQk
(t; gk, wk) is piecewise linear and monotonically increasing in t.
Now we consider the KKT conditions for x• in Equation (D.19):
−g•[k] + x•[k]
w•[k]
2
+ λQk
(x•[k]; gk, wk) = λ• + µ[k] ∀k ∈ [[K]] (Stationarity)
x• ∈ t · ∆K (Primal feasibility)
λ• ∈ R, µ ∈ R
d
≥0
(Dual feasibility)
µ[k]x•[k] = 0 ∀k ∈ [[K]] (Compl. slackness)
Solving for x•[k] in the stationarity condition, and using the conditions x•[k]µ[k] = 0 and µ[k] ≥ 0,
it follows that for all k ∈ [[K]]
x•[k] = w•[k]
2
λ• + µ[k] + g•[k] − λQk
(x•[k]; gk, wk)
= w•[k]
2
h
λ• + g•[k] − λQk
(x•[k]; gk, wk)
i+
. (D.20)
272
Strict monotonicity and piecewise-linearity of x•[k] as a function of λ•. Given the preliminaries on SMPL functions, it is now immediate to see that x•[k] is unique as a function of λ•. Indeed,
note that (D.20) can be rewritten as
x•[k] = h
(w•[k]
2
)λ• − w•[k]
2
(−g•[k] + λQk
(x•[k]; gk, wk))i+
,
which is a fixed-point problem of the form studied in Lemma 117 for y = (w•[k]
2
)λ• and function fk
defined as
fk(x•[k]) = w•[k]
2
(−g•[k] + λQk
(x•[k]; gk, wk)),
which is clearly SMPL by inductive hypothesis. Hence, the unique solution to the previous fixed-point
equation is given by the quasi-SMPL function
gk : λ• 7→
1
w•[k]
2
(x•[k] + fk)
−1
(λ•)
+
,
a standard representation of which can be computed in time O(d + 1) by combining the results of
Lemmas 112, 114 and 115 given that a standard representation of λQk
(t; gk, wk) of size d is available
by inductive hypothesis.
Strict monotonicity and piecewise-linearity of λ• as a function of t. At this stage, we know
that given any value of the dual variable λ•, the unique value of the coordinate x•[k] that solves the
KKT system can be computed using the SMPL function gk. In turn, this means that we can remove the
primal variables x• from the KKT system, leaving us a system in λ• and t only. We now show that the
solution λ
⋆
• of that system is a SMPL function of t ∈ [0, +∞).
273
Indeed, the value of λ
⋆
•
that solves the KKT system has to satisfy the primal feasibility condition
t =
X
K
k=1
x•[k] = X
K
k=1
gk(λ•).
Fix any t > 0. The right-hand side of the equation is a sum of quasi-SMPL functions. Hence,
from Lemma 113, we have that the right-hand side has a standard representation of size at most
K +
PK
k=1 dk = d can be computed in time O(d log K). Furthermore, from Lemma 116, we have
that the λ
⋆
•
that satisfies the equation is unique, and in fact that the mapping (0, +∞) ∋ t 7→ λ
⋆
•
is
SMPL with standard representation of size at most d.
Relating λ• and λQ(t; g, w). Since λ
⋆
•
(t) is the coefficient on t in the Lagrangian relaxation of
(D.19), it is a subgradient of VQ(t; g, w), and since there is a unique solution, we get that it is the
derivative, that is,
λ
⋆
•
(t) = λQ(t; g, w)
for all t ∈ (0, +∞). To conclude the proof by induction, we then need to analyze the case t = 0, which
has so far been excluded. When t = 0, the feasible set tQ is a singleton, and VQ(0; g, w) = 0. Since
VQ(t; g, w) is continuous on [0, +∞), and since limt→0+ λQ(t; g, w) = limt→0+ λ
⋆
•
(t) exists since
λ
⋆
•
(t) is piecewise-linear, then by the mean value theorem,
λQ(0; g, w) = lim
t→0+
λ
⋆
•
(t),
that is, the continuous extension of λ
⋆
• must be (right) derivative of VQ(t; g, w) in 0. As extending
continuously λ
⋆
•
(t) clearly does not alter its being SMPL nor its standard representation, we conclude
the proof of the inductive case.
274
Lemma 118 also provides a constructive way of computing the argmin of (D.17) in polynomial time
for any t ∈ [0, +∞). To conclude the construction of the proximal oracle, it is then enough to show how
to pick the optimal value of t ∈ [0, 1] that minimizes
min
x∈[0,1]Q
(
−g
⊤x +
1
2
X
d
r=1
x[r]
w[r]
2
)
= min
t∈[0,1]
VQ(t; g, w).
That is easy starting from the derivative λQ(t; g, w), which is a SMPL function by Lemma 118. Indeed,
if λQ(0; g, w) ≥ 0, then by monotonicity of the derivative we know that the optimal value of t is t = 0.
Else, if λQ(1; g, w) ≤ 0, again by monotonicity we know that the optimal value of t is t = 1. Else, there
exists a unique value of t ∈ (0, 1) at which the derivative of the objective is 0, and such a value can be
computed exactly using Lemma 115.
275
Appendix E
Omitted Details in Chapter 6
E.1 Proof of Lemma 44
Proof of Lemma 44. Let us write Rˆ = xˆ. Note that
RegT
(xˆ) = X
T
t=1
⟨x
t
, ℓ
t
⟩ −X
T
t=1
⟨xˆ, ℓ
t
⟩
= −
X
T
t=1
⟨xˆ, f(x
t
, ℓ
t
)⟩ (E.1)
= −
X
T
t=1
⟨Rˆ, f(x
t
, ℓ
t
)⟩
=
X
T
t=1
⟨Rt
, f(x
t
, ℓ
t
)⟩ −X
T
t=1
⟨Rˆ, f(x
t
, ℓ
t
)⟩ (E.2)
= RegT
Rˆ
(E.3)
where (E.1) follows from xˆ
⊤1d = 1 and the definition of the map f(·, ·), (E.2) follows from ⟨Rt
, f(x
t
, ℓ
t
)⟩ =
0 because x
t = Rt/∥Rt∥1 (note that this is also trivially true when Rt = 0), and (E.3) follows from the
definition of RegT
Rˆ
.
276
E.2 Instability of RM+ and predictive RM+
E.2.1 Proof of Theorem 45
Proof of Theorem 45. We first prove the case for RM+. Since we consider x
t ∈ R
2
, we can express x
t =
(p
t
, 1 − p
t
) for some scalar p
t ∈ [0, 1] (starting with p
1 = 1/2). In our counterexample, we set ℓ
t = (ℓ
t
, 0)
for some scalar ℓ
t
to be specified. Consequently, we have
f(x
t
, ℓ
t
) = ℓ
t − ⟨x
t
, ℓ
t
⟩12 = ((1 − p
t
)ℓ
t
, −p
t
ℓ
t
).
To make the algorithm highly unstable, we first provide an unbounded sequence of ℓ
t
so that the resulting
Rt
alternates between vectors with the same value on both entries and vectors with only the first entry
being 0, which means p
t by definition alternates between 1/2 and 0. Noting that RM+ is scale-invariant
to the loss sequence, our proof is completed by normalizing the losses so that they all lie in [−1, 1].
Specifically, we set ℓ
1 = 2, which gives f(x
1
, ℓ
1
) = (1, −1), R2 = (0, 1), and p
2 = 0. Then for t ≥ 2
we set ℓ
t = −2
(t−2)/2 when t is even and ℓ
t = 2(t−1)/2 when t is odd. By direct calculation it is not hard
to verify that
f(x
t
, ℓ
t
) = (−2
(t−2)/2
, 0), Rt+1 = (2(t−2)/2
, 2
(t−2)/2
),
p
t+1 =
1
2 when t is even, and
f(x
t
, ℓ
t
) = (2(t−3)/2
, −2
(t−3)/2
), Rt+1 = (0, 2
(t−1)/2
),
p
t+1 = 0 when t is odd, completing the counterexample for RM+.
277
It remains to prove the case for predictive RM+, where mt = f(x
t−1
, ℓ
t−1
). Initially, let ℓ
1 = 4, ℓ2 =
−1. Recall that
f(x
t
, ℓ
t
) = ℓ
t − ⟨x
t
, ℓ
t
⟩12 = ((1 − p
t
)ℓ
t
, −p
t
ℓ
t
).
By direct calculation, we have
f(x
1
, ℓ
1
) = (2, −2), R2 = (0, 2), Rˆ2 = (−2, 4), p2 = 0
f(x
2
, ℓ
2
) = (−1, 0), R3 = (1, 2), Rˆ3 = (2, 2), p3 =
1
2
Thereafter, we set ℓ
t = 2(t+1)/2 when t is odd and ℓ
t = −2
(t−2)/2 when t is even. The updates for the next
4 steps are:
f(x
3
, ℓ
3
) = (2, −2), R4 = (0, 4), Rˆ4 = (0, 6), p4 = 0
f(x
4
, ℓ
4
) = (−2, 0), R5 = (2, 4), Rˆ5 = (4, 4), p5 =
1
2
f(x
5
, ℓ
5
) = (4, −4), R6 = (0, 8), Rˆ6 = (0, 12), p6 = 0
f(x
6
, ℓ
6
) = (−4, 0), R7 = (4, 8), Rˆ7 = (8, 8), p7 =
1
2
.
It is not hard to verify (by induction) that
f(x
t
, ℓ
t
) = (2(t−1)/2
, −2
(t−1)/2
), Rt+1 = (0, 2
(t+1)/2
), Rˆt+1 = (0, 2
(t+1)/2 + 2(t−1)/2
), pt+1 = 0
when t is odd and
f(x
t
, ℓ
t
) = (−2
(t−2)/2
, 0), Rt+1 = (2(t−2)/2
, 2
t/2
), Rˆt+1 = (2t/2
, 2
t/2
), pt+1 =
1
2
278
when t is even. This completes the proof.
Remark 119. The losses are unbounded in the examples, but note that the update rules for the algorithms
imply that all the algorithms remain unchanged after scaling the losses, so we can rescale them accordingly. Specifically, if we have a loss sequence ℓ
1
, . . . , ℓT
, we can define LT = max{|ℓ
1
|, . . . , |ℓ
T
|} and
consider another loss sequence ℓ
1/LT , . . . , ℓT /LT , which is bounded in [−1, 1] and will make the algorithms produce the same outputs.
E.2.2 Counterexample on 3 × 3 matrix game for RM+
101 102 103 104 105 106 107 Number of iterations
10−6
10−4
10−2
Norm
||x
t − x
t+1||2
2
(a) ∥x
t+1 − x
t∥
2
2
101 102 103 104 105 106 107 Number of iterations
10−7
10−5
10−3
Norm
||y
t − y
t+1||2
2
(b) ∥y
t+1 − y
t∥
2
2
Figure E.1: ∥x
t+1 − x
t∥
2
2
(Figure E.1a) and ∥y
t+1 − y
t∥
2
2
(Figure E.1b) for RM+.
E.3 Proof of Proposition 46 and Corollary 47
We start with a couple of technical lemmas.
Lemma 120. Given any y ∈ R
d
+ and x ∈ R
d
such that 1
⊤x = 0,
(x
⊤y)
2 ≤
d − 1
d
∥x∥
2
2 ∥y∥
2
2
.
279
Proof. If x = 0 the claim is trivial, so we focus on the other case. Let ξ be the (Euclidean) projection of y
onto the orthogonal complement of span{x, 1}. Since by hypothesis 1 ⊥ x, it holds that
y =
y
⊤x
∥x∥
2
2
x +
1
⊤y
∥1∥
2
2
1 + ξ
and therefore
∥y∥
2
2 =
(y
⊤x)
2
∥x∥
2
2
+
(1
⊤y)
2
∥1∥
2
2
+ ∥ξ∥
2
2 ≥
(y
⊤x)
2
∥x∥
2
2
+
(1
⊤y)
2
∥1∥
2
2
(E.4)
Using the hypothesis that y ≥ 0, we can bound
1
⊤y = ∥y∥1 ≥ ∥y∥2,
where we used the well-known inequality between the ℓ1-norm and the ℓ2-norm. Substituting the previous
inequality into (E.4), and using the fact that ∥1∥
2
2 = d,
∥y∥
2
2 ≥
(y
⊤x)
2
∥x∥
2
2
+
∥y∥
2
2
d
.
Rearranging the terms yields the statement.
Lemma 121. For any yˆ ∈ R
d
+ such that ∥yˆ∥2 = 1, 1
⊤yˆ ̸= 0 and for any x ∈ R
d
such that 1
⊤x = 1,
1
1⊤yˆ
− x
⊤yˆ
2
≤ (d − 1) ·
(x
⊤yˆ)yˆ − x
2
2
.
280
Proof. The main idea of the proof is to introduce
z = x −
yˆ
1⊤yˆ
.
Note that 1
⊤z = 1
⊤x − 1 = 0. Furthermore,
x
⊤yˆ =
z +
yˆ
1⊤yˆ
⊤
yˆ = z
⊤yˆ +
1
1⊤yˆ
.
Substituting the previous equality in the statement, we obtain
1
1⊤yˆ
− x
⊤yˆ
2
− (d − 1) ·
(x
⊤yˆ)yˆ − x
2
2
= (z
⊤yˆ)
2 − (d − 1) ·
z
⊤yˆ +
1
1⊤yˆ
yˆ − z −
yˆ
1⊤yˆ
2
2
= (z
⊤yˆ)
2 − (d − 1) ·
(z
⊤yˆ)yˆ − z
2
2
= (z
⊤yˆ)
2 − (d − 1)
(z
⊤yˆ)
2 + ∥z∥
2
2 − 2(z
⊤yˆ)
2
= d
(z
⊤yˆ)
2 −
d − 1
d
∥z∥
2
2
,
where we used the hypothesis that ∥yˆ∥
2
2 = 1 in the third equality. Using the inequality of Lemma 120
concludes the proof.
We are now ready to prove Proposition 46.
281
Proof of Proposition 46. If y = 0, the statement holds trivially. Hence, we focus on the case y ̸= 0. Let
yˆ = y/∥y∥2 be the direction of y; clearly, ∥yˆ∥2 = 1. Note that
∥y − x∥
2
2 =
∥y∥2 − x
⊤yˆ
2
+
x − (x
⊤yˆ) yˆ
2
2
≥
x − (x
⊤yˆ) yˆ
2
2
= (1
⊤x)
2
g(x) −
g(x)
⊤yˆ
yˆ
2
2
≥
g(x) −
g(x)
⊤yˆ
yˆ
2
2
, (E.5)
where we used the hypothesis that 1
⊤x ≥ 1 in the last step. On the other hand, using Lemma 121 (note
that 1
⊤yˆ ̸= 0 since y ̸= 0 by hypothesis),
g(x) −
g(x)
⊤yˆ
yˆ
2
2
=
1
d
g(x) − (g(x)
⊤yˆ) yˆ
2
2
+
d − 1
d
g(x) − (g(x)
⊤yˆ) yˆ
2
2
≥
1
d
g(x) − (g(x)
⊤yˆ) yˆ
2
2
+
1
d
1
1⊤yˆ
− g(x)
⊤yˆ
2
=
1
d
∥g(x)∥
2
2 +
1
(1⊤yˆ)
2
− 2
g(x)
⊤yˆ
1⊤yˆ
=
1
d
∥g(x)∥
2
2 + ∥g(y)∥
2
2 − 2g(x)
⊤g(y)
=
1
d
∥g(y) − g(x)∥
2
2
. (E.6)
Combining (E.5) and (E.6), we obtain the statement.
282
Proof of Corollary 47. The condition means that 1
⊤ Rt
R0
≥ 1 and
x
t+1 − x
t
2
≤
√
d
Rt+1
R0
−
Rt
R0
2
(by Proposition 46)
≤
√
d
R0
f(x
t
, ℓ
t
)
2
≤
√
d
R0
ℓ
t
2
+
⟨x
t
, ℓ
t
⟩1d
2
≤
√
d
R0
Bu +
√
d
x
t
2
ℓ
t
2
(by (6.1))
≤
2dBu
R0
E.4 Proof of Theorem 48
Proof of Theorem 48. When the algorithm restarts, the accumulated regret is negative to all actions, so it is
sufficient to consider the regret from T0, the round when the last restart happens to the end. In that case,
we can analyze the algorithm as a normal predictive regret matching algorithm. By Proposition 5 in [52],
we have that the regret for player i is bounded by
RegT
i
(x
∗
) ≤
∥x
∗ − z
T0
i
∥
2
2
2η
+ η
X
T
t=T0
f
t
i − mt
i
2 −
1
8η
X
T
T =T0
z
t+1
i − z
t
i
2
, (E.7)
where z
T0
i = (R0, . . . , R0) and
f
t−1
i
i∈[n]
= F(z
t−1
). When setting mt
i = f
t−1
i
, then
f
t
i − mt
i
can
be bounded by
2
f
t
i − mt
i
2
=
⟨x
t
i
, ℓ
t
i
⟩1di − ⟨x
t−1
i
, ℓ
t−1
i
⟩1di −
ℓ
t
i − ℓ
t−1
i
2
=
⟨x
t
i − x
t−1
i
, ℓ
t
i
⟩1di − ⟨x
t−1
i
, ℓ
t−1
i − ℓ
t
i
⟩1di −
ℓ
t
i − ℓ
t−1
i
2
≤
x
t
i − x
t−1
i
2
ℓ
t
i
2
∥1di
∥2 +
x
t−1
i
2
ℓ
t−1
i − ℓ
t
i
2
∥1di
∥2 +
ℓ
t
i − ℓ
t−1
i
2
≤ Bu
p
di
x
t
i − x
t−1
i
2
+
p
di
ℓ
t
i − ℓ
t−1
i
2
+
ℓ
t
i − ℓ
t−1
i
2
(by (6.1))
≤ Bu
p
di
x
t
i − x
t−1
i
2
+
p
di
X
i
′̸=i
2Lu
x
t
i
′ − x
t−1
i
′
2
(by (6.1))
≤ 2
p
di(Bu + Lu)
X
i
′∈[n]
x
t
i
′ − x
t−1
i
′
2
≤
12η
√
diBu(Bu + Lu)
P
i
′∈[n]
di
′
R0
≤
12ηBu(Bu + Lu)d
3/2
R0
where the last-but-one inequality is because
x
t
i − x
t−1
i
2
=
g(z
t
i
) − g(z
t−1
i
)
2
≤
g(z
t
i
) − g(w
t−1
i
)
2
+
g(w
t−1
i
) − g(w
t−2
i
)
2
+
g(w
t−2
i
) − g(z
t−1
i
)
2
≤ 3 ·
2ηdiBu
R0
=
6ηdiBu
R0
and we bound each of RHS of the first inequality using a restatement of Corollary 47, shown in Lemma 122.
Therefore, we can further bound (E.7) it by dropping the negative terms and bounding the rest by
∥x
∗∥
2
2 + ∥z
T0
i
∥
2
2
η
+ η
X
T
t=T0
f
t
i − f
t−1
i
2
≤
1 + R2
0
d
η
+ η
3T ·
144B2
u
(Bu + Lu)
2d
3
R2
0
.
Choosing R0 = 1 and η =
d
2T
−1/4
finishes the proof.
Lemma 122. Let z = Πw,X (ηf(x, ℓ)) for x ∈ ∆d
, ℓ ∈ R
d
, ∥ℓ∥2 ≤ Bu. Suppose ∥w∥1 ≥ R0, then we
have
∥g(z) − g(w)∥2 ≤
√
d
R0
· ∥z − w∥2 ≤
2ηdBu
R0
.
Proof. The proof is essentially the same as Corollary 47. The condition means that 1
⊤ w
R0
≥ 1 and
∥g(z) − g(w)∥2 = ∥g(z/R0) − g(w/R0)∥2
≤
√
d
z
R0
−
w
R0
2
(by Proposition 46)
≤
√
d
R0
∥ηf(x, ℓ)∥2
≤
η
√
d
R0
(∥ℓ∥2 + ∥⟨x, ℓ⟩1d∥2
)
≤
η
√
d
R0
Bu +
√
d ∥x∥2
∥ℓ∥2
(by (6.1))
≤
2ηdBu
R0
.
E.5 Proof of Theorem 49
We write
Rt
1
, ..., Rt
n
= z
t
. Let us consider the regret RegT
i
(xˆi) of player i ∈ {1, ..., n}. Lemma 44
shows that
RegT
i
(xˆi) = RegT
i
Rˆ
i
28
with RegT
i
Rˆ
i
the regret against Rˆ
i = xˆi
incurred by a decision-maker choosing the decisions
Rt
i
t≥1
and facing the sequence of losses
f
t
i
t≥1
, with f
t
i = ℓ
t
i − ⟨x
t
i
, ℓ
t
i
⟩1di
:
RegT
i
Rˆ
i
=
X
T
t=1
⟨f
t
i
, Rt
i − Rˆ
i⟩ (E.8)
Note that R1
i
, ..., RT
i
is computed by Predictive OMD with ∆
di
≥
as a decision set, f
1
i
, ..., f
T
i
as the
sequence of losses and m1
i
, ...,mT
i
as the sequence of predictions. Therefore, Proposition 5 in [52] applies,
and we can write the following regret bound:
RegT
i
Rˆ
i
≤
∥w0
i − xˆi∥
2
2
2η
+ η
X
T
t=1
f
t
i − mt
i
2
2
−
1
8η
X
T
t=1
Rt+1
i − Rt
i
2
2
.
(E.9)
Since we maintain Rt
i ∈ ∆
di
≥
at every iteration, using Proposition 46 we find that
∥x
t+1
i − x
t
i∥
2
2 ≤ di
Rt+1
i − Rt
i
2
2
.
Plugging this into (E.9), we obtain
RegT
i
Rˆ
i
≤
∥w0
i − xˆi∥
2
2
2η
+ η
X
T
t=1
f
t
i − mt
i
2
2
−
1
8diη
X
T
t=1
∥x
t+1 − x
t
∥
2
2
.
2
Using ∥ · ∥2 ≤ ∥ · ∥1 ≤
√
di∥ · ∥2, we obtain
RegT
i
Rˆ
i
≤ α + β
X
T
t=1
f
t
i − mt
i
2
1
− γ
X
T
t=1
∥x
t+1 − x
t
∥
2
1
.
with α =
∥w0
i −xˆi∥
2
2
2η
, β = diη, γ =
1
8d
2
i
η
. To conclude as in Theorem 4 in [150] we need β ≤ γ/(n−1)2
, i.e.,
η =
1
2
√
2(n−1)d
3/2
i
. Therefore, using η =
1
2
√
2(n−1) maxi{d
3/2
i
}
, we conclude that the sum of the individual
regrets is bounded by
O
n
2 max
i=1,...,n
{d
3/2
i
} max
i=1,...,n
{∥w0
i − xˆi∥
2
2
.
E.6 Proof of Theorem 52
Proof of Lemma 50. The first-order optimality condition gives
⟨ηℓ
t + z
t − z
t−1
, zˆ − z
t
⟩ ≥ 0 ∀zˆ ∈ Z.
Rearranging gives that for any zˆ ∈ Z, we have
⟨ηℓ
t
, zˆ − z
t
⟩ ≥ ⟨z
t−1 − z
t
, zˆ − z
t
⟩ =
1
2
∥z
t − zˆ∥
2
2 −
1
2
∥z
t−1 − zˆ∥
2
2 +
1
2
∥z
t − z
t−1
∥
2
2
.
Multiplying by −1 and summing over all t = 1, ..., T gives the regret bound:
X
T
t=1
⟨ηℓ
t
, z
t − zˆ⟩ ≤ 1
2
∥z
0 − zˆ∥
2
2 −
1
2
∥z
T − zˆ∥
2
2 −
X
T
t=1
1
2
∥z
t − z
t−1
∥
2
2 ≤
1
2
∥z
0 − zˆ∥
2
2
.
287
Proof of Lemma 51. Let x, x
′ ∈ ∆ and i ∈ {1, ..., n}. Let us write ℓi = −∇xiui(x), ℓ
′
i = −∇xiui(x
′
). We
have, for i ∈ {1, ..., n},
∥f (xi
, ℓi) − f
x
′
i
, ℓ
′
i
∥
2
2
=
X
di
j=1
(xi − ej )
⊤ℓi − (x
′
i − ej )
⊤ℓ
′
i
2
=
X
di
j=1
((xi − ej )
⊤ℓi − (x
′
i − ej )
⊤ℓi + (x
′
i − ej )
⊤ℓi − (x
′
i − ej )
⊤ℓ
′
i
)
2
=
X
di
j=1
(xi − x
′
i
)
⊤ℓi + (x
′
i − ej )
⊤(ℓi − ℓ
′
i
)
2
≤ 2di
(xi − x
′
i
)
⊤ℓi
2
+ 2X
di
j=1
(x
′
i − ej )
⊤(ℓi − ℓ
′
i
)
2
≤ 2di∥xi − x
′
i∥
2
2∥ℓi∥
2
2 + 2X
di
j=1
∥x
′
i − ej∥
2
2∥ℓi − ℓ
′
i∥
2
2
,
where the last inequality follows from Cauchy-Schwarz inequality. Now from (6.1) and the definition of
ℓi
, ℓ
′
i
, we have
∥ℓi∥2 ≤ Bu, ∥ℓi − ℓ
′
i∥2 ≤ Lu∥x − x
′
∥2.
This yields
∥f (xi
, ℓi) − f
x
′
i
, ℓ
′
i
∥
2
2 ≤ 2diB
2
u∥xi − x
′
i∥
2
2 + 4diL
2
u∥x − x
′
∥
2
2
≤
2diB
2
u + 4diL
2
u
∥x − x
′
∥
2
2
.
Since the function g is √
di-Lipschitz continuous over each decision set ∆
di
≥
(Proposition 46), we have
shown that the Lipschitz constant of F is LF = (maxi di)
p
2B2
u + 4L2
u
.
We are now ready to prove Theorem 52. We write
Rt
1
, ..., Rt
n
= z
t
.
Proof of Theorem 52. We use the fact that
z
t
t≥1
is computed following the Cheating OMD update with
ℓ
t = F(z
t
) at every iteration t ≥ 1. Therefore, the first-order optimality condition in z
t = Πzt−1,X≥
(ηF(z
t
))
yields
⟨ηF(z
t
) + z
t − z
t−1
, zˆ − z
t
⟩ ≥ 0 ∀zˆ ∈ X≥.
Similarly as for the proof of Lemma 50, we obtain that for any zˆ ∈ X≥, we have
⟨ηF(z
t
), zˆ − z
t
⟩ ≥ 1
2
∥z
t − zˆ∥
2
2 −
1
2
∥z
t−1 − zˆ∥
2
2 +
1
2
∥z
t − z
t−1
∥
2
2
.
Let i ∈ {1, ..., n}. We apply the inequality above to the vector zˆ ∈ X≥ = ×n
i=1∆
di
≥
defined as zˆj = z
t
j
for j ̸= i and zˆi = Rˆ
i for some Rˆ
i ∈ ∆
di
≥
. This yields, for any Rˆ
i ∈ ∆
di
≥
, for x
t
j = g(Rt
j
) and
ℓ
t
1
, ..., ℓ
t
n
= G(x
t
) for all j ∈ {1, ..., n},
⟨ηf(x
t
, ℓ
t
), Rˆ
i − Rt
i
⟩ ≥ 1
2
∥Rt
i − Rˆ
i∥
2
2 −
1
2
∥Rt−1
i − Rˆ
i∥
2
2 +
1
2
∥Rt
i − Rt−1
i
∥
2
2
.
Summing the above inequality for t = 1, ..., T, we obtain our bound on the individual regrets of each
player: for any Rˆ
i ∈ ∆
di
≥
, we have
X
T
t=1
⟨ηf(x
t
, ℓ
t
), Rt
i − Rˆ
i⟩ ≤ 1
2
∥R0
i − Rˆ
i∥
2
2 −
1
2
∥RT
i − Rˆ
i∥
2
2 −
X
T
t=1
1
2
∥Rt
i − Rt−1
i
∥
2
2 ≤
1
2
∥R0
i − Rˆ
i∥
2
2
.
Note that from Lemma 44, we have that the individual regret of player i
RegT
i
(xˆi) = X
T
t=1
⟨∇xiu
t
i
(x
t
), xˆi − x
t
i
⟩
2
against a decision xˆi ∈ ∆di
is equal to PT
t=1⟨f(x
t
, ℓ
t
), Rt
i − Rˆ
i⟩ for Rˆ
i = xˆi
. Therefore, we conclude
that
RegT
i
(xˆi) ≤
1
2η
∥zˆ
0
i − xˆi∥
2
2
.
This concludes the proof of Theorem 52.
E.7 Proof of Theorem 54
Proof of Theorem 54. At iteration t ≥ 1, let wt ∈ X≥ such that ∥wt − Πzt−1,X≥
ηF(wt
)
∥2 ≤ ϵ
(t)
. Then
the first order optimality condition gives, for z
t = Πzt−1,X≥
(ηF(wt
)),
⟨ηF(wt
), zˆ − z
t
⟩ ≥ 1
2
∥z
t − zˆ∥
2
2 −
1
2
∥z
t−1 − zˆ∥
2
2 +
1
2
∥z
t − z
t−1
∥
2
2
, ∀ zˆ ∈ X≥.
Let us fix a player i ∈ {1, ..., n}. We apply the inequality above with zˆj = z
t
j
for j ̸= i. This yields, for
xp = g(wt
p
), ∀ p ∈ {1, ..., n} and ℓ
t
i = −∇xiui(x),
⟨ηf(g(wt
i
), ℓ
t
i
), zˆi − z
t
i
⟩ ≥ 1
2
∥z
t
i − zˆi∥
2
2 −
1
2
∥z
t−1
i − zˆi∥
2
2 +
1
2
∥z
t
i − z
t−1
i
∥
2
2
, ∀ zˆi ∈ ∆di
.
We now upper bound the left-hand side of the previous inequality. Note that
⟨ηf(g(wt
i
), ℓ
t
i
), zˆi − z
t
i
⟩ = ⟨ηf(g(wt
i
), ℓ
t
i
), zˆi − wt
i
⟩ + ⟨ηf(g(wt
i
), ℓ
t
i
), wt
i − z
t
i
⟩.
Cauchy-Schwarz inequality ensures that
⟨ηf(g(wt
i
), ℓ
t
i
), wt
i − z
t
i
⟩ ≤ η∥f(g(wt
i
), ℓ
t
i
)∥2∥wt
i − z
t
i∥2.
29
Note that Πzt−1,X≥
ηF(wt
)
= z
t
, so that ∥wt
i − z
t
i
∥2 ≤ ϵ
(t)
. To bound ∥f(g(wt
i
), ℓ
t
i
)∥2, we note that
by definition,
∥f(xi
, ℓi)∥
2
2 =
X
di
j=1
(xi − ej )
⊤ℓi
2
≤
X
di
j=1
∥xi − ej∥
2
2∥ℓi∥
2
2 ≤ 4diB
2
u
.
This gives
∥f(g(wt
i
), ℓ
t
i
)∥2 ≤ 2Bu
p
di
.
Overall, we have obtained that for all zˆi ∈ ∆
di
≥
, we have
⟨ηf(g(wt
i
), ℓ
t
i
), wt
i − zˆi⟩ ≤ −1
2
∥z
t
i − zˆi∥
2
2 +
1
2
∥z
t−1
i − zˆi∥
2
2 −
1
2
∥z
t
i − z
t−1
i
∥
2
2 + η2Bu
p
diϵ
(t)
.
We sum the previous inequality for t = 1, ..., T to obtain that for all zˆi ∈ ∆
di
≥
, we have
X
T
t=1
⟨ηf(g(wt
i
), ℓ
t
i
), wt
i − zˆi⟩ ≤ 1
2
∥z
0
i − zˆi∥
2
2 −
1
2
∥z
T
i − zˆi∥
2
2 −
X
T
t=1
1
2
∥z
t
i − z
t−1
i
∥
2
2 + η2Bu
p
di
X
T
t=1
ϵ
(t)
.
Overall, we conclude that
X
T
t=1
⟨f(g(wt
i
), ℓ
t
i
), wt
i − zˆi⟩ ≤ 1
2η
∥z
0
i − zˆi∥
2
2 + 2Bu
p
di
X
T
t=1
ϵ
(t)
, ∀ zˆi ∈ ∆
di
≥
.
From Lemma 44 the left-hand side is equal to RegT
i
(xˆi) for xˆi = zˆi
. This concludes the proof of Theorem
54.
29
E.8 Proof of Theorem 55
Proof of Theorem 54. We will show that for any wˆ ∈ X≥, we have
X
T
t=1
⟨F(wt
), wt − wˆ⟩ ≤ 1
2η
∥w0 − wˆ∥
2
2
.
Since x
t
i = wt
i
, ∀ t ≥ 1, ∀ i ∈ {1, ..., n}, this is enough to prove Theorem 54. Note that
⟨F(wt
), wt − wˆ⟩ = ⟨F(wt
), z
t − wˆ⟩ + ⟨F(wt
), wt − z
t
⟩.
We will independently analyze each term of the right-hand side of the above equality.
For the first term, we note that the first-order optimality condition for z
t = Πzt−1,X≥
ηF(wt
)
gives,
for any wˆ ∈ X≥,
⟨ηF(wt
), z
t − wˆ⟩ ≤ 1
2
∥wˆ − z
t−1
∥
2
2 −
1
2
∥wˆ − z
t
∥
2
2 −
1
2
∥z
t − z
t−1
∥
2
2
. (E.10)
For the second term, we will prove the following lemma.
Lemma 123. Let η > 0 such that w 7→ ηF(w) is 1/
√
2 Lipschitz continuous over X≥. Then
⟨ηF(wt
), wt − z
t
⟩ ≤ 1
2
∥z
t − z
t−1
∥
2
2
. (E.11)
Proof of Lemma 123. We write
⟨ηF(wt
), wt − z
t
⟩ = ⟨ηF(z
t−1
), wt − z
t
⟩ + ⟨ηF(wt
) − ηF(z
t−1
), wt − z
t
⟩.
29
We will bound independently each term in the above equation. From wt = Πzt−1,X≥
ηF(z
t−1
)
we have
⟨ηF(z
t−1
), wt − z
t
⟩ ≤ 1
2
∥z
t − z
t−1
∥
2
2 −
1
2
∥z
t − wt
∥
2
2 −
1
2
∥wt − z
t−1
∥
2
2
,
which gives
⟨ηF(z
t−1
), wt − z
t
⟩ ≤ 1
2
∥z
t − z
t−1
∥
2
2 −
1
2
∥z
t − wt
∥
2
2 −
1
2
∥wt − z
t−1
∥
2
2
, (E.12)
From Cauchy-Schwarz inequality, we have
⟨ηF(wt
) − ηF(z
t−1
), wt − z
t
⟩ ≤ ∥ηF(wt
) − ηF(z
t−1
)∥2∥wt − z
t
∥2.
Recall that
wt = Πzt−1,X
ηF(z
t−1
)
z
t = Πzt−1,X
ηF(wt
)
Since the proximal operator is 1-Lipschitz continuous, and since w 7→ ηF(w) is 1/
√
2-Lipschitz continuous, we obtain
⟨ηF(wt
) − ηF(z
t−1
), wt − z
t
⟩ ≤ 1
2
∥wt − z
t−1
∥
2
2
. (E.13)
We can now sum (E.12) and (E.13) to obtain
⟨ηF(wt
), wt − z
t
⟩ ≤ 1
2
∥z
t − z
t−1
∥
2
2 −
1
2
∥z
t − wt
∥
2
2 ≤
1
2
∥z
t − z
t−1
∥
2
2
.
We have shown in Lemma 51 that F is LF -Lipschitz continuous for normal-form games. Our choice
of step size η =
1
LF
√
2
ensures that that ω 7→ ηF(w) is 1/
√
2-Lipschitz continuous as in the assumptions
of Lemma 123.
Combining (E.10) with (E.11) yields
⟨ηF(wt
), wt − wˆ⟩ ≤ 1
2
∥wˆ − z
t−1
∥
2
2 −
1
2
∥wˆ − z
t
∥
2
2
.
Summing this inequality for t = 1, ..., T and telescoping, we obtain
X
T
t=1
⟨ηF(wt
), wt − wˆ⟩ ≤ 1
2
∥wˆ − z
0
∥
2
2 −
1
2
∥wˆ − z
T
∥
2
2
which directly yields
X
T
t=1
⟨ηF(wt
), wt − wˆ⟩ ≤ 1
2
∥wˆ − z
0
∥
2
2
. (E.14)
Overall, for any
Rˆ
1, ..., Rˆ
n
∈ X≥ we obtain that PT
i=1 RegT
i
(Rˆ
i) is upper bounded by Pn
i=1
1
2η
∥w0
i −
Rˆ
i∥
2
2
. Now from Lemma 44, for any (xˆ1, ..., xˆn) ∈ ∆, we conclude that
X
T
i=1
RegT
i
(xˆi) ≤
Xn
i=1
1
2η
∥w0
i − xˆi∥
2
2
.
This concludes the proof of Theorem 55.
294
E.9 Proof of Lemma 56
Proof of Lemma 56. The proof of Lemma 56 follows the lines of the proof of Lemma 51. Clearly, for matrix
games we have F = h ◦ g with h : R
d1 × R
d2 → R
d1 × R
d2 defined as Proposition 46.
h
x
y
=
f (x, Ay)
f
y, −A⊤x
(E.15)
The function g is Lipschitz continuous over ∆
d1
≥
(Proposition 46), with a Lipschitz constant of Lg =
√
d1.
Let us now compute the Lipschitz constant of h. Observe that:
∥f (x, Ay) − f
x
′
, Ay′
∥
2
2
=
X
d1
i=1
(x − ei)
⊤Ay − (x
′ − ei)
⊤Ay′
2
=
X
d1
i=1
((x − ei)
⊤Ay − (x
′ − ei)
⊤Ay + (x
′ − ei)
⊤Ay − (x
′ − ei)
⊤Ay′
)
2
=
X
d1
i=1
(x − x
′
)
⊤Ay + (x
′ − ei)
⊤A(y − y
′
)
2
≤ 2d1
(x − x
′
)
⊤Ay2
+ 2X
d1
i=1
(x
′ − ei)
⊤A(y − y
′
)
2
≤ 2d1∥A∥
2
op∥x − x
′
∥
2
2 + 4d1∥A∥
2
op∥y − y
′
∥
2
2
.
Similarly, we have that ∥f
y, −A⊤x
− f
y
′
, −A⊤x
′
∥
2
2
is upper bounded by
2d2∥A∥
2
op∥y − y
′
∥
2
2 + 4d2∥A∥
2
op∥x − x
′
∥
2
2
,
and thus
h
x
y
− h
x
′
y
′
2
≤ ∥A∥opp
6 max{d1, d2}
x
y
−
x
′
y
′
2
.
Therefore, the Lipschitz constant of h is Lh = ∥A∥opp
6 max{d1, d2}.
Since F = h◦g, we obtain that the Lipschitz constantLF of F isLF = Lh×Lg =
√
6∥A∥op max{d1, d2}.
E.10 Extensive-Form Games
In this section we show how to extend our convergence results for Conceptual RM+ from normal-form
games to EFGs. Briefly, an EFG is a game played on a tree, where each node belongs to some player,
and the player chooses a probability distribution over branches. Moreover, players have information sets,
which are groups of nodes belonging to a player such that they cannot distinguish among them, and thus
they must choose the same probability distribution at all nodes in an information set. When a leaf h is
reached, each player i receives some payoff vi(h) ∈ [0, 1]. In order to extend our results, we will use the
CFR regret decomposition [166, 53], and then show how to run the Conceptual RM+ algorithm on the
resulting set of strategy spaces (which will be a Cartesian product of positive orthants). The CFR regret
decomposition works in the space of behavioral strategies, which represents the strategy space of each
player as a Cartesian product of simplices, with each simplex corresponding to the set of possible ways
to randomize over actions at a given information set for the player. Formally, we write the polytope of
behavioral-form strategies as
X = ×i∈[n],j∈Di∆nj
,
296
where Di
is the set of information sets for player i and nj is the number of actions at information set j. Let
P =
P
i∈[n],j∈Di
nj be the dimension of X . In EFGs with perfect recall, meaning that a player never forgets
something they knew in the past, the sequence-form is an equivalent representation of the set of strategies,
which allows one to write the payoffs for each player as a multilinear function. This in turn enables
optimization and regret minimization approaches that exploit multilinearity, e.g. bilinearity in the twoplayer zero-sum setting [81, 98, 54]. Instead of working on this representation, the CFR approach minimizes
a notion of local regret at each information set, using so-called counterfactual values. The weighted sum
of counterfactual regrets at each information set is an upper bound on the sequence-form regret [166],
and thus a player in an EFG can minimize their regret by locally minimizing each counterfactual regret.
Informally, the counterfactual value is the expected value of an action at a information set, conditional on
the player at the information set playing to reach that information set and then taking the corresponding
action. The counterfactual value associated to each tuples of player i, information set j ∈ Di
, and action
a ∈ Aj is Gija(x) = P
h∈Lja
Q
(ˆj,aˆ)∈Pj (h)
x[ˆj, aˆ]vi(h), where Lja is the set of leaf nodes reachable from
information set j after taking action a, and Pj (h) is the set of pairs of information sets and actions (ˆj, aˆ)
on the path from the root to h, except that information sets belonging to player i are excluded, unless they
occur after j, a.
We will be concerned with the counterfactual regret, given by the operator H : X → R
P
i∈[n],j∈Di
nj
defined as Hija(x) = Gija(x) − ⟨Gij (x), x
j
⟩. Now we can show that the counterfactual regret operator
H is Lipschitz continuous. Intuitively, this should hold since H is multilinear.
Lemma 124. For any behavioral strategies x, x
′ ∈ X , ∥H(x) − H(x
′
)∥2 ≤
√
2P∥x − x
′∥2.
Proof. We start by showing a bound for G. We first analyze the change in a single coordinate of G for a
given i ∈ [n], j ∈ Di
, a ∈ Aj . We focus on how Gija changes with respect to the change in |x[ˆj, aˆ] −
297
x
′
[ˆj, aˆ]| for some arbitrary information set-action pair (ˆj, aˆ) ∈ Pj (h) for some h ∈ Lja. To alleviate inline
notation, let P
ˆj,aˆ
j
(h) = Pj (h) \ {(ˆj, aˆ)}.
Gija(x) = x[ˆj, aˆ]
X
h∈Lˆjaˆ∩Lja
Y
(¯j,a¯)∈Pˆj,aˆ
j
(h)
x[¯j, a¯]vi(h)
+
X
h∈Lja\Lˆjaˆ
Y
(¯j,a¯)∈Pj (h)
z[¯j, a¯]vi(h)
≤ |x[ˆj, aˆ] − x
′
[ˆj, aˆ]|
X
h∈Lˆjaˆ∩Lja
Y
(¯j,a¯)∈Pˆj,aˆ
j
(h)
x[¯j, a¯]vi(h)
+ x
′
[ˆj, aˆ]
X
h∈Lˆjaˆ∩Lja
Y
(¯j,a¯)∈Pj (h)
x[¯j, a¯]vi(h)
+
X
h∈Lja\Lˆjaˆ
Y
(¯j,a¯)∈Pj (h)
x[¯j, a¯]vi(h)
Now let us bound the error term by noting that vi(h) ≤ 1 for all h by assumption:
|x[ˆj, aˆ] − x
′
[ˆj, aˆ]|
X
h∈Lˆjaˆ∩Lja
Y
(¯j,a¯)∈Pˆj,aˆ
j
(h)
x[¯j, a¯]vi(h)
≤|x[ˆj, aˆ] − x
′
[ˆj, aˆ]|
X
h∈Lˆjaˆ∩Lja
Y
(¯j,a¯)∈Pˆj,aˆ
j
(h)
x[¯j, a¯]
≤|x[ˆj, aˆ] − x
′
[ˆj, aˆ]|,
where the last inequality is because the sum of reach probabilities on leaf nodes in Lˆjaˆ ∩ Lja after conditioning on player i playing to reach (j, a) and (ˆj, aˆ) being played with probability one, is less than or
equal to one.
298
By iteratively applying this argument to each (ˆj, aˆ) ∈ Pj (h), we get
Gija(x) ≤ Gija(x
′
) + X
h∈Lja
X
(ˆj,aˆ)∈Pj (h)
|x[ˆj, aˆ] − x
′
[ˆj, aˆ]| (E.16)
≤ Gija(x
′
) + ∥x − x
′
∥1.
Repeating the same argument for x
′ gives
|Gija(x) − Gija(x
′
)| ≤ ∥x − x
′
∥1.
Secondly, we bound the difference in the inner product terms.
⟨Gij (x), x
j
⟩ =
X
a∈Aj
x[j, a]Gija(x)
≤
X
a∈Aj
|x[j, a] − x
′
[j, a]| + x
′
[j, a]Gija(x)
≤ ∥x
j − x
j
′
∥1 +
X
a∈Aj
x
′
[j, a]Gija(x
′
) + X
a∈Aj
x
′
[j, a]
X
h∈Lja
X
(ˆj,aˆ)∈Pj (h)
|x[ˆj, aˆ] − x
′
[ˆj, aˆ]|
≤ ⟨Gij (x
′
), x
′
⟩ + ∥x − x
′
∥1
where the second-to-last line is by Equation (E.16). Again we can start from x
′
instead to get
|⟨Gij (x), x
j
⟩ − ⟨Gij (x
′
), x
′
⟩| ≤ ∥x − x
′
∥1.
299
Putting together all our bonds and applying norm equivalence, we get that
∥H(x) − H(x
′
)∥
2
2 ≤
X
i∈[n]
X
j∈Di,a∈Aj
2∥x − x
′
∥
2
1
≤ 2P∥x − x
′
∥
2
2
.
Taking square roots completes the proof.
Since we want to run smooth RM+, we will need to consider the lifted strategy space for each decision point. Let Z be the Cartesian product of the positive orthants for each information set, i.e. Z =
×i∈[n],j∈DiR
nj
+ . Now let gˆ : Z → X be the function that normalizes each vector from the positive orthant to the simplex such that we get a behavioral strategy, i.e. gˆj (z) = g(z
j
), where z
j
is the slice of z
corresponding to information set j. The function gˆ is also Lipschitz continuous.
Lemma 125. Suppose that z, z
′ ∈ Z satisfy ∥z
j∥1 ≥ R0,j , ∥z
j′∥1 ≥ R0,j for all i ∈ [n], j ∈ Di
. Then,
∥gˆ(z) − gˆ(z
′
)∥2 ≤ maxi∈[n],j∈Di
p
nj/R0,j∥z − z
′∥2.
Proof. We have from Proposition 46 that
∥gˆ(z) − gˆ(z
′
)∥
2
2 =
X
i∈[n],j∈Di
∥g(z) − g(z
′
)∥
2
2
≤
X
i∈[n],j∈Di
nj/R0,j∥z
j − z
j′
∥
2
2
≤ max
i∈[n],j∈Di
nj/R0,j∥z − z
′
∥
2
2
Now let us introduce the operator F : Z → R
P
i∈[n],j∈Di
nj
for EFGs. For a given z ∈ Z, the operator
will output the regret associated with the counterfactual values for each decision set j. F will be composed
300
of two functions, first gˆ maps a given z to some behavioral strategy x = ˆg(z), and then the operator
H : X → R
P
i∈[n],j∈Di
nj
outputs the regrets for the counterfactual values.
Now we can apply our bounds on the Lipschitz constant for gˆ and H to get that F is Lipschitz continuous with Lipschitz constant 2P maxi∈[n],j∈Di
p
nj/R0,j . Combining our Lipschitz result with our setup
of X and F, we can now run Algorithm 6 on X and F and apply Theorem 54 to get a smooth-RM+-based
algorithm that allows us to compute a sequence of iterates with regret at most ϵ in at O(1/ϵ) iterations
and using O(log(1/ϵ)/ϵ) gradient computations.
E.11 Details on the Numerical Experiments
Efficient orthogonal projection on ∆n
≥. Recall that ∆n
≥ = {R ∈ R
n
| R ≥ 0, 1
⊤
n R ≥ 1}. Let y ∈ R
n
and let us consider
min
x≥0,1⊤n x≥1
1
2
∥x − y∥
2
2
.
Introducing a Lagrange multiplier µ ≥ 0 for the constraint 1 − 1
⊤
n x ≤ 0, we arrive at
min
x≥0
max
µ≥0
1
2
∥x − y∥
2
2 + µ
1 − 1
⊤
n x
.
Let us call (x, µ) ∈ R
n
+ × R+ an optimal solution to the above saddle-point problem. Stationarity of the
Lagrangian function shows that xi = [yi + µ]
+, ∀ i ∈ [n]. Therefore, we could simply use binary search
to solve the following univariate concave problem:
max
µ≥0
µ −
1
2
∥[y + µ1n]
+∥
2
2
.
301
Let us use the Karush-Kuhn-Tucker conditions. Complementary slackness gives µ ·
1 − 1
⊤
n x
= 0. If
µ = 0, then x = [y]
+, and by primal feasibility we must have 1
⊤
n x ≥ 1, i.e., 1
⊤
n
[y]
+ ≥ 1. If that is not
the case, then we can not have µ = 0, and we must have 1 − 1
⊤
n x = 0, i.e., x ∈ ∆n
. In this case, we
obtain that x is the orthogonal projection of y on ∆n
. Overall, we see that x is always either [y]
+, the
orthogonal projection of y on R
n
+, or x is the orthogonal projection of y on ∆n
. Since ∆n
≥ ⊂ R
n
+, we can
compute the orthogonal projection on ∆n
≥ as follows:
Compute x = [y]
+. If 1
⊤
n x ≥ 1, then we have found the orthogonal projection of y on ∆n
≥. Else,
return the orthogonal projection of y on the simplex ∆n
.
E.11.1 Performances of ExRM+, Stable PRM+and Smooth PRM+on our small matrix
game example
In this section we provide detailed numerical results for ExRM+, Stable PRM+, and Smooth PRM+ on our
3×3 matrix-game counterexample. All algorithms use linear averaging and Stable and Smooth PRM+ use
alternation. We choose a step size of η = 0.1 for our implementation of these algorithms. The results are
presented in Figure E.2 for ExRM+, in Figure E.3 for Stable PRM+ and in Figure E.4 for Smooth PRM+.
E.11.2 Extensive-Form Game Used in the Experiments
We used the following games in the experiments:
• 2-player Sheriff is a two-player general-sum game inspired by the Sheriff of Nottingham board game.
It was introduced as a benchmark for correlated equilibria by Farina et al. [56]. The variant of the
game we use has the following parameters:
– maximum number of items that can be smuggled: 10
– maximum bribe amount: 3
– number of bargaining rounds: 3
30
101 102 103 104 105 106 107
Number of iterations
10−11
10−8
10−5
10−2
Duality gap
ExRM+
(a) Duality gap
101 102 103 104 105 106 107
Number of iterations
10−21
10−15
10−9
10−3
Norm
||x
t − x
t+1||2
2
(b) ∥x
t − x
t+1∥
2
2
101 102 103 104 105 106 107
Number of iterations
10−21
10−16
10−11
10−6
Norm
||y
t − y
t+1||2
2
(c) ∥y
t − y
t+1∥
2
2
101 102 103 104 105 106 107
Number of iterations
1.50
1.75
2.00
2.25
2.50
Regret
Regret (x-player)
(d) RegT
x
101 102 103 104 105 106 107
Number of iterations
-0.8
-0.6
-0.4
-0.2
0.0
Regret
Regret (y-player)
(e) RegT
y
Figure E.2: Empirical performance of ExRM+ (with linear averaging) on our 3×3 matrix game from Section
6.1.
101 102 103 104 105 106 107
Number of iterations
10−11
10−8
10−5
10−2
Duality gap
Stable PRM+
(a) Duality gap
101 102 103 104 105 106 107
Number of iterations
10−26
10−19
10−12
10−5
Norm
||x
t − x
t+1||2
2
(b) ∥x
t − x
t+1∥
2
2
101 102 103 104 105 106 107
Number of iterations
10−26
10−19
10−12
10−5
Norm
||y
t − y
t+1||2
2
(c) ∥y
t − y
t+1∥
2
2
101 102 103 104 105 106 107
Number of iterations
1
2
3
Regret
Regret (x-player)
(d) RegT
x
101 102 103 104 105 106 107
Number of iterations
-1.0
-0.5
0.0
0.5
1.0
Regret
Regret (y-player)
(e) RegT
y
Figure E.3: Empirical performance of Stable PRM+ (with alternation and linear averaging) on our 3 × 3
matrix game from Section 6.1.
303
101 102 103 104 105 106 107
Number of iterations
10−11
10−8
10−5
10−2
Duality gap
Smooth PRM+
(a) Duality gap
101 102 103 104 105 106 107
Number of iterations
10−22
10−16
10−10
10−4
Norm
||x
t − x
t+1||2
2
(b) ∥x
t − x
t+1∥
2
2
101 102 103 104 105 106 107
Number of iterations
10−22
10−16
10−10
10−4
Norm
||y
t − y
t+1||2
2
(c) ∥y
t − y
t+1∥
2
2
101 102 103 104 105 106 107
Number of iterations
1.4
1.6
1.8
2.0
Regret
Regret (x-player)
(d) RegT
x
101 102 103 104 105 106 107
Number of iterations
-0.5
0.0
0.5
Regret
Regret (y-player)
(e) RegT
y
Figure E.4: Empirical performance of Smooth PRM+ (with alternation and linear averaging) on our 3 × 3
matrix game from Section 6.1.
– value of each item: 5
– penalty for illegal item found in cargo: 1
– penalty for Sheriff if no illegal item found in cargo: 1
The number of nodes in this game is 9648.
• 3-player Leduc poker is a 3-player version of the standard benchmark of Leduc poker [147]. The
game has 15659 nodes.
• 4-player Kuhn poker is a 4-player version of the standard benchmark of Kuhn poker [101]. We use
a larger variant the standard one, to assess the scalability of our algorithm. The variant we use has
six ranks in the deck. The game has 23402 nodes.
• 4-player Liar’s dice is a 4-player version of the game Liar’s dice, already used as a benchmark by
Lisy, Lanctot, and Bowling [ ` 111]. We use a variant with 1 die per player, each with two distinct
faces. The game has 8178 nodes.
304
E.12 Useful Propositions
Proposition 126. Let S ⊆ R
n
be a compact convex set and {z
t ∈ S} be a sequence of points in S such that
limt→∞ ∥z
t − z
t+1∥ = 0, then one of the following is true.
1. The sequence {z
t} has infinite limit points.
2. The sequence {z
t} has a unique limit point.
Proof. We denote a closed Euclidean ball in R
n
as B(x, δ) := {x
′ ∈ R
n
: ∥x
′ − x∥ ≤ δ}. For the sake of
contradiction, assume that neither of the two conditions are satisfied. This implies that the sequence {z
t}
has a finite number (≥ 2) of limit points. Then there exists ϵ > 0 and two distinct limit points zˆ1 and zˆ2
such that
1. ∥zˆ1 − zˆ2∥ ≥ ϵ;
2. zˆ1 is the only limit point of {z
t} within B(zˆ1, ϵ/2) ∩ S;
3. zˆ2 is the only limit point of {z
t} within B(zˆ2, ϵ/2) ∩ S.
Given that limt→∞ ∥z
t − z
t+1∥ = 0, there exists T > 0 such that for any t > T, ∥z
t − z
t+1∥ ≤ ϵ
4
. Now
since zˆ1 and zˆ2 are limit points of {z
t}, there exists sub-sequences {z
tk } and {z
tj } that converges to zˆ1
and zˆ2, respectively. This means that there are infinite points from {z
tk } within B(zˆ1, ϵ/4)∩S and infinite
points from {z
tj } within B(zˆ2, ϵ/4)∩S. However, for t > T, there must also be infinite points from {z
t}
within the region (B(zˆ1, ϵ/2) − B(zˆ1, ϵ/4)) ∩ S. This implies there exists another limit point within
B(zˆ1, ϵ/2) ∩ S and contradicts the assertion that zˆ1 is the only limit point within B(zˆ1, ϵ/2) ∩ S.
E.13 Proof of Theorem 57
Proof of Theorem 57. We will use an equivalent update rule of RM+ (Algorithm 8) proposed by Farina,
Kroer, and Sandholm [52]: for allt ≥ 0, z
t+1 = ΠR
d1+d2
≥0
[z
t − ηF(z
t
)] where we use z
t
to denote (Rt
x, Rt
y
)
305
and F as defined in (6.4). The step size η > 0 only scales the z
t vector and produces the same strategies
{(x
t
, y
t
) ∈ ∆d1 × ∆d2 } as Algorithm 8. In the proof, we also use the following property of the operator
F.
Proposition 127 (Adapted from Lemma 5.7 in [48]). For a matrix game A ∈ R
d1×d2
, let
L1 = ∥A∥op
p
6 max{d1, d2}
with ∥A∥op := sup{∥Av∥2
/∥v∥2
: v ∈ R
d2
, v ̸= 0}. Then for any z, z
′ ∈ R
d1+d2
≥0
, it holds that
∥F(z) − F(z
′
)∥ ≤ L1∥g(z) − g(z
′
)∥.
For any t ≥ 1, since z
t+1 = ΠR
d1+d2
≥0
[z
t − ηF(z
t
)], we have for any z ∈ R
d1+d2
≥0
z
t − ηF(z
t
) − z
t+1
, z − z
t+1
≤ 0.
⇒
z
t − z
t+1
, z − z
t
≤ −
z
t − z
t+1
, z
t − z
t+1
+
ηF(z
t
), z − z
t+1
. (E.17)
By three-point identity, we have
z
t+1 − z
2
=
z
t − z
2 +
z
t+1 − z
t
2
+ 2
z
t − z
t+1
, z − z
t
≤
z
t − z
2 −
z
t+1 − z
t
2
+ 2
ηF(z
t
), z − z
t+1
. (by (E.17))
306
We define k
t
:= argmink≥1 ∥z
t − kz
∗∥. With z = k
tz
∗
, the above inequality is equivalent to
z
t+1 − k
tz
∗
2
+ 2η
F(z
∗
), z
t+1
≤
z
t − k
tz
∗
2 + 2η
F(z
∗
), z
t
−
z
t+1 − z
t
2
+ 2η
F(z
t
) − F(z
∗
), z
t − z
t+1
+ 2η
F(z
t
), ktz
∗
(We use ⟨F(z
t
), z
t
⟩ = 0)
≤
z
t − k
tz
∗
2 + 2η
F(z
∗
), z
t
+ η
2
F(z
t
) − F(z
∗
)
2 + 2η
F(z
t
), ktz
∗
(−a
2 + 2ab ≤ b
2
)
≤
z
t − k
tz
∗
2 + 2η
F(z
∗
), z
t
+ η
2L
2
1
(x
t
, y
t
) − z
∗
2 + 2η
F(z
t
), ktz
∗
.
Since z
∗ = (x
∗
, y
∗
) is a strict Nash equilibrium, by Lemma 128, there exists a constant c > 0 such that
2η
F(z
t
), ktz
∗
= −2ηkt
(x
t
)
⊤Ay∗ − (x
∗
)
⊤Ayt
≤ −2ηckt
(x
t
, y
t
) − z
∗
2
≤ −2ηc
(x
t
, y
t
) − z
∗
2
,
where in the last inequality, we use k
t ≥ 1 by definition. Thus, combining the above two inequalities gives
z
t+1 − k
tz
∗
2
+ 2η
F(z
∗
), z
t+1
≤
z
t − k
tz
∗
2 + 2η
F(z
∗
), z
t
+ (η
2L
2
1 − 2ηc)
(x
t
, y
t
) − z
∗
2
.
Using definition of k
t+1 and choosing η < c
L2
1
, we have 2ηc − η
2L
2
1 > 0 and
(2ηc − η
2L
2
1
)
(x
t
, y
t
) − z
∗
2 ≤
z
t − k
tz
∗
2 + 2η
F(z
∗
), z
t
−
z
t+1 − k
t+1z
∗
2
− 2η
F(z
∗
), z
t+1
.
Telescoping the above inequality for t ∈ [1, T] gives
(2ηc − η
2L
2
1
)
X
T
t=1
(x
t
, y
t
) − z
∗
2 ≤
z
1 − k
1z
∗
2
+ 2η
F(z
∗
), z
1
< +∞
307
It implies that the sequence {(x
t
, y
t
)} converges to z
∗
. Moreover, it also gives a O( √
1
T
) best-iterate
convergence rate with respect to ∥(x
t
, y
t
) − z
∗∥
2
.
Lemma 128. If a matrix game A has a strict Nash equilibrium (x
∗
, y
∗
), then there exists a constant c > 0
such that for any (x, y) ∈ ∆d1 × ∆d2
,
x
⊤Ay∗ − (x
∗
)
⊤Ay ≥ c∥(x, y) − (x
∗
, y
∗
)∥
2
.
Proof. Since (x
∗
, y
∗
) is strict, it is also the unique and pure Nash equilibrium. we denote the unit vector
ei
∗ = x
∗ with i
∗ ∈ [d1]
∗
and constant c1 := maxi̸=i
∗ e
⊤
i Ay∗ − e
⊤
i
∗Ay∗ > 0. Then, by definition, we have
x
⊤Ay∗ − (x
∗
)
⊤Ay∗ =
X
i̸=i
∗
x[i]
e
⊤
i Ay∗ − ei
∗Ay∗
≥ c1
X
i̸=i
∗
x[i]
=
c1
2
X
i̸=i
∗
x[i] + c1
2
(1 − x[i
∗
]) (
P
i∈[d1]
x[i] = 1)
≥
c1
2
X
i̸=i
∗
x[i]
2 +
c1
2
(1 − x[i
∗
])2
(x[i] ∈ [0, 1])
=
c1
2
∥x − x
∗
∥
2
.
Similarly, denote ej
∗ = x
∗ with j
∗ ∈ [d2] and constant c2 = (x
∗
)
⊤Ay∗ − maxj̸=j
∗ (x
∗
)
⊤Aej , we get
(x
∗
)
⊤Ay∗ − maxj̸=j
∗ (x
∗
)
⊤Ay ≥
c2
2
∥y − y
∗∥
2
. Combining the two inequalities above and let c =
1
2 min(c1, c2), we have x
⊤Ay∗ − (x
∗
)
⊤Ay ≥ c∥(x, y) − (x
∗
, y
∗
)∥
2
for all (x, y).
∗Given a positive integer d, we denote the set {1, 2, . . . , d} by [d].
308
E.14 Proofs for Section 6.7
E.14.1 Proofs for Section 6.7.1
Proof of Lemma 60. From Lemma 59, we know
(1 − η
2L
2
F
)
X∞
t=0
z
t+ 1
2 − z
t
2
≤
z
0 − z
∗
2
.
Since 1 − η
2L
2
F > 0, this implies
lim
t→∞
z
t+ 1
2 − z
t
= 0.
Using Proposition 131, we know that ∥z
t+1 − z
t+ 1
2 ∥ ≤ ηLF ∥z
t+ 1
2 − z
t∥. Thus
lim
t→∞
z
t+1 − z
t
≤ lim
t→∞
z
t+1 − z
t+ 1
2
+
z
t+ 1
2 − z
t
= lim
t→∞
(1 + ηLF )
z
t+ 1
2 − z
t
= 0.
If zˆ is the limit point of the subsequence {z
t
: t ∈ κ}, then we must have
lim
t(∈κ)→∞
z
t+ 1
2 = zˆ.
By the definition of z
t+ 1
2 in Algorithm 11 and by the continuity of F and of the projection, we get
zˆ = lim
t(∈κ)→∞
z
t+ 1
2 = lim
t(∈κ)→∞
ΠZ≥
z
t − ηF(z
t
)
= ΠZ≥
[zˆ − ηF(zˆ)].
This shows that zˆ ∈ SOL(Z≥, F). Moreover, by zˆ = ΠZ≥
[zˆ − ηF(zˆ)] and Lemma 130, we conclude
(xˆ, yˆ) = g(zˆ) ∈ ∆d1 × ∆d2
is a Nash equilibrium of A.
309
Proof of Lemma 62. Let {ki ∈ Z} be an increasing sequence of indices such that {z
ki} converges to zˆ and
{li ∈ Z} be an increasing sequence of indices such that {z
li} converges to z˜. Let σ : {ki} → {li} be a
mapping that always maps ki to a larger index in {li}, i.e. σ(ki) > ki
. Such a mapping clearly exists. Since
Lemma 59 applies to az
∗
for any a ≥ 1, we get
∥az
∗ − zˆ∥
2 = lim
i→∞
az
∗ − z
ki
2
≥ lim
i→∞
az
∗ − z
σ(ki)
2
= ∥az
∗ − z˜∥
2
.
By symmetry, the other direction also holds. Thus ∥az
∗ − zˆ∥
2 = ∥az
∗ − z˜∥
2
for all a ≥ 1.
Expanding ∥az
∗ − zˆ∥
2 = ∥az
∗ − z˜∥
2
gives
∥zˆ∥
2 − ∥z˜∥
2 = 2a⟨z
∗
, zˆ − z˜⟩, ∀a ≥ 1.
It implies that ∥zˆ∥
2 = ∥z˜∥
2
and ⟨z
∗
, zˆ − z˜⟩ = 0.
Proof of Lemma 63. For the sake of contradiction, let zˆ = (Rbxˆ, Qbyˆ) and z˜ = (Rex˜, Qey˜) be any two distinct
limit points of {z
t}. In the following, we will show a key equality:
Proposition 129. Rb + Re = Qb + Qe.
Case 1: zˆ = (Rbx
∗
, Qby
∗
) and z˜ = (Rex
∗
, Qey
∗
) for a Nash equilibrium z
∗ = (x
∗
, y
∗
). Part 2 of
Lemma 62 gives
∥zˆ∥
2 = ∥z˜∥
2 ⇒ (Rb2 − Re2
)∥x
∗
∥
2 = (Qe2 − Qb2
)∥y
∗
∥
2
.
Part 3 of Lemma 62 gives
⟨z
∗
, zˆ − z˜⟩ = 0 ⇒ (Rb − Re)∥x
∗
∥
2 = (Qe − Qb)∥y
∗
∥
2
.
310
Combining the above two inequalities with the fact that zˆ ̸= z˜, we get
Rb + Re = Qb + Q. e
Case 2: zˆ = (Rbxˆ, Qbyˆ) and z˜ = (Rex˜, Qey˜). Note that by Lemma 60, both (xˆ, yˆ) and (x˜, y˜) are Nash
equilibrium of A and thus (xˆ, y˜) is also a Nash equilibrium of A. Using Lemma 62, we have the following
equalities
1. ∥zˆ∥
2 = ∥z˜∥
2 ⇒ Rb2∥xˆ∥
2 − Re2∥x˜∥
2 = Qe2∥y˜∥
2 − Qb2∥yˆ∥
2
2. 0 = ⟨(xˆ, yˆ), zˆ − z˜⟩ = Rb∥xˆ∥
2 − Re⟨xˆ, x˜⟩ + Qb∥yˆ∥
2 − Qe⟨yˆ, y˜⟩
3. 0 = ⟨(x˜, y˜), zˆ − z˜⟩ = Rb⟨xˆ, x˜⟩ − Re∥x˜∥
2 + Qb⟨yˆ, y˜⟩ − Qe∥y˜∥
2
4. 0 = ⟨(xˆ, y˜), zˆ − z˜⟩ = Rb∥xˆ∥
2 − Re⟨xˆ, x˜⟩ + Qb⟨yˆ, y˜⟩ − Qe∥y˜∥
2
Combing the last three equalities gives
c := Rb∥xˆ∥
2 − Re⟨xˆ, x˜⟩ = Rb⟨xˆ, x˜⟩ − Re∥x˜∥
2
,
and
−c = Qb∥yˆ∥
2 − Qe⟨yˆ, y˜⟩ = Qb⟨yˆ, y˜⟩ − Qe∥y˜∥
2
.
311
Further combining the first equality gives
(Rb + Re)c + (Qb + Qe)(−c)
= Rb(Rb∥xˆ∥
2 − Re⟨xˆ, x˜⟩) + Re(Rb⟨xˆ, x˜⟩ − Re∥x˜∥
2
) + Qb(Qb∥yˆ∥
2 − Qe⟨yˆ, y˜⟩) + Qe(Qb⟨yˆ, y˜⟩ − Qe∥y˜∥
2
)
= Rb2
∥xˆ∥
2 − Re2
∥x˜∥
2 − Qe2
∥y˜∥
2 + Qb2
∥yˆ∥
2
= 0.
If c ̸= 0, then we get
Rb + Re = Qb + Q. e
If c = 0, then RbRe∥xˆ∥
2
∥x˜∥
2 = RbRe⟨xˆ, x˜⟩
2
and QbQe∥yˆ∥
2
∥y˜∥
2 = QbQe⟨yˆ, y˜⟩
2
gives the equality condition
of Cauchy-Schwarz inequality and we get xˆ = x˜ and yˆ = y˜. This reduced to Case 1 where we know
Rb + Re = Qb + Qe holds. This completes the proof of the claim.
Note that Rb + Re = Qb + Qe must hold for any two distinct limit points zˆ and z˜ of {z
t}. If Rb = Qb, then
zˆ = Rb(xˆ, yˆ) is colinear with a Nash equilibrium (xˆ, yˆ) and by Proposition 61, {z
t} converges to zˆ and zˆ
is the unique limit point. If Rb ̸= Qb, we argue that zˆ and z˜ are the only two possible limit points. To see
why, let us suppose that there exists another limit point z = (Rx, Qy) distinct from zˆ and z˜. If R = Rb,
then by R + Re = Q + Qe, we know Q = Qb. However, R + Rb = 2Rb ̸= 2Qb = Q + Qb gives a contradiction
to Proposition 129. If R ̸= Rb, then combing
Rb + R = Qb + Q, Re + R = Qe + Q, Rb + Re = Qb + Qe
312
gives
Rb − Qb = Q − R = Re − Qe = Qb − R. b
Thus, we have Rb = Qb, and it contradicts the assumption that Rb ̸= Qb. Now we conclude that {z
t} has
at most two distinct limit points which means {z
t} has a unique limit point given the fact that {z
t} is
bounded and limt→∞ ∥z
t+1 − z
t∥ = 0.
E.14.2 Proofs for Section 6.7.2
We will need the following lemma. Recall the normalization operator g : R
d1
+ × R
d2
+ → Z such that for
z = (z1, z2) ∈ R
d1
+ × R
d2
+ , we have g(z) = (z1/∥z1∥1, z2/∥z2∥1) ∈ Z.
Lemma 130. Let z ∈ ∆d1 × ∆d2 and z1, z2, z3 ∈ Z≥ such that z3 = ΠZ≥
[z1 − ηF(z2)]. Denote
(x3, y3) = g(z3), then
DualityGap(x3, y3) := max
y∈∆d2
x
⊤
3 Ay − min
x∈∆d1
x
⊤Ay3 ≤
(∥z1 − z3∥ + ηLF ∥z3 − z2∥)(∥z3 − z∥ + 2)
η
).
Proof of Lemma 130. Let xˆ ∈ argminx x
⊤Ay3 and yˆ ∈ argmaxy x
⊤
3 Ay. Then we have
max
y
x
⊤
3 Ay − min
x
x
⊤Ay3
= x
⊤
3 Ayˆ − xˆ
⊤Ay3
= −⟨F(z3), zˆ⟩
=
1
η
(⟨z1 − z3 + ηF(z3) − ηF(z2), z3 − zˆ⟩ + ⟨z1 − ηF(z2) − z3, zˆ − z3⟩)
≤
1
η
⟨z1 − z3 + ηF(z3) − ηF(z2), z3 − zˆ⟩ (z3 = ΠZ≥
[z1 − ηF(z2)])
≤
(∥z1 − z3∥ + ηLF ∥z3 − z2∥)∥z3 − zˆ∥
η
.
313
Moreover, we have
∥z3 − zˆ∥ ≤ ∥z3 − z∥ + ∥z − zˆ∥ ≤ ∥z3 − z∥ + 2.
Combining the above two inequalities completes the proof.
We now show a few useful properties of the iterates.
Proposition 131. In the same setup of Lemma 59, for any t ≥ 1, the following holds.
1. ∥z
t+1 − z
t+ 1
2 ∥ ≤ ηLF ∥z
t+ 1
2 − z
t∥ ≤ ∥z
t+ 1
2 − z
t∥;
2. ∥z
t+1 − z
t∥ ≤ (1 + ηLF )∥z
t+ 1
2 − z
t∥ ≤ 2∥z
t+ 1
2 − z
t∥.
Proof of Proposition 131. Using the update rule of ExRM+ and the non-expansiveness of the projection
operator ΠZ≥
[·] we have
z
t+1 − z
t+ 1
2
≤
z
t − ηF(z
t
) − (z
t − ηF(z
t+ 1
2 ))
= η
F(z
t+ 1
2 ) − F(z
t
)
≤ ηLF
z
t+ 1
2 − z
t
.
Using triangle inequality, we have
z
t+1 − z
t
≤
z
t+1 − z
t+ 1
2
+
z
t+ 1
2 − z
t
≤ (1 + ηLF )
z
t+ 1
2 − z
t
.
We are now ready to prove Lemma 65.
314
Proof of Lemma 65. Let z
∗ be any Nash equilibrium of the game. Since z
t+1 = ΠZ≥
[z
t − ηF(z
t+ 1
2 )],
applying Lemma 130 gives
DualityGap(x
t+1
, y
t+1) ≤
(∥z
t+1 − z
t∥ + ηLF ∥z
t+1 − z
t+ 1
2 ∥)(∥z
t+1 − z
∗∥ + 2)
η
≤
3∥z
t+ 1
2 − z
t∥(∥z
t+1 − z
∗∥ + 2)
η
(ηLF ≤ 1 and Proposition 131)
≤
3∥z
t+ 1
2 − z
t∥(∥z
0 − z
∗∥ + 2)
η
(Lemma 59)
≤
12∥z
t+ 1
2 − z
t∥
η
, (z
0
and z
∗
are both in the simplex ∆d1 × ∆d2
)
which finishes the proof.
E.14.3 Proofs for Section 6.7.3
Proof of Theorem 68. We note that RS-ExRM+ always restarts in the first iteration since ∥z
1
2 − z
0∥ ≤
∥z
0−z
∗
√
∥
1−η
2L2
F
≤ √
2
1−η
2L2
F
. Let 1 = t1 < t2 < . . . be the iterations where the algorithm restarts with
(x
tk , y
tk ). Using Lemma 65 and the restart condition, we know that for any k ≥ 1,
DualityGap(x
tk
, y
tk ) ≤
12∥z
tk− 1
2 − z
tk−1∥
η
≤
48
η
q
1 − η
2L
2
F
·
1
2
k
(x
tk
, y
tk ) − ΠZ∗ [(x
tk
, y
tk )]
≤
DualityGap(x
tk , y
tk )
c
≤
48
cηq
1 − η
2L
2
F
·
1
2
k
.
Moreover, for any iterate t ∈ [tk + 1, tk+1 − 1], using Lemma 65, we get
DualityGap(x
t
, y
t
) ≤
12∥z
t− 1
2 − z
t−1∥
η
≤
12∥z
tk − ΠZ∗ [(x
tk , y
tk )]∥
η
q
1 − η
2L
2
F
≤
576
cη2(1 − η
2L
2
F
)
·
1
2
k
.
315
Using metric subregularity (Proposition 67), we have for any iterate t ∈ [tk + 1, tk+1 − 1],
(x
t
, y
t
) − ΠZ∗ [(x
t
, y
t
)]
≤
576
c
2η
2(1 − η
2L
2
F
)
·
1
2
k
.
If we can prove tk+1 − tk is always bounded by a constant C ≥ 1, then tk ≤ C(k − 1) + 1 ≤ Ck. Then
for any t ≥ 1, we have t ≥ t
⌊
t
C
⌋
and
(x
t
, y
t
) − ΠZ∗ [(x
t
, y
t
)]
≤
576
c
2η
2(1 − η
2L
2
F
)
·
1
2
⌊
t
C
⌋
≤
576
c
2η
2(1 − η
2L
2
F
)
· 2
− t
C
≤
576
c
2η
2(1 − η
2L
2
F
)
· (1 −
1
3C
)
t
.
Bounding tk+1 − tk If tk+1 − tk = 1, then the claim holds. Now we assume tk+1 − tk ≥ 2. Before
restarting, the updates of z
t
for tk ≤ t ≤ tk+1 − 2 is just ExRM+. By Lemma 59, we know
tkX
+1−2
t=tk
z
t+ 1
2 − z
t
2
≤
∥(x
tk , y
tk ) − ΠZ∗ [(x
tk , y
tk )]∥
2
1 − η
2L
2
F
It implies that there exists t ∈ [tk, tk+1 − 2] such that
z
t+ 1
2 − z
t
≤
∥(x
tk , y
tk ) − ΠZ∗ [(x
tk , y
tk )]∥
q
1 − η
2L
2
F
√
tk+1 − tk − 1
≤
48
cη(1 − η
2L
2
F
)
√
tk+1 − tk − 1 · 2
k
.
On the other hand, since the algorithm does not restart in [tk, tk+1−2], we know for every t ∈ [tk, tk+1−2],
z
t+ 1
2 − z
t
>
4
q
1 − η
2L
2
F
· 2
k+1
.
316
Combining the above inequalities gives
tk+1 − tk − 1 ≤
482
c
2η
2(1 − η
2L
2
F
)
2 · 2
2k
·
(1 − η
2L
2
F
) · 2
2k+2
16
=
482/4
c
2η
2(1 − η
2L
2
F
)
=
576
c
2η
2(1 − η
2L
2
F
)
.
Linear Last-Iterate Convergence Combing all the above, we get for all t ≥ 1,
DualityGap(x
t
, y
t
) ≤
576
cη2(1 − η
2L2)
·
1 −
1
3(1 + 576
c
2η
2(1−η
2L2
F
)
)
t
(x
t
, y
t
) − ΠZ∗ [(x
t
, y
t
)]
≤
576
c
2η
2(1 − η
2L2)
·
1 −
1
3(1 + 576
c
2η
2(1−η
2L2
F
)
)
t
.
This completes the proof.
E.15 Proofs for Section 6.8
E.15.1 Asymptotic Convergence of SPRM+
In the following, we will prove last-iterate convergence of SPRM+ (Algorithm 12). The proof follows the
same idea as that of ExRM+ in previous sections. Applying standard analysis of OG, we have the following
important lemma for SPRM+.
Lemma 132 (Adapted from Lemma 1 in [161]). Let z
∗ ∈ Z≥ be a point such that ⟨F(z), z − z
∗
⟩ ≥ 0 for
all z ∈ Z≥. Let {z
t} and {wt} be the sequences produced by SPRM+. Then for every iteration t ≥ 0 it holds
that
wt+1 − z
∗
2
+
1
16
wt+1 − z
t
2
≤
wt − z
∗
2 +
1
16
wt − z
t−1
2
−
15
16
wt+1 − z
t
2
+
wt − z
t
2
.
317
It also implies for any t ≥ 0,
wt+1 − z
t
+
wt − z
t
≤ 2
w0 − z
∗
,
wt − z
t
+
wt − z
t−1
≤ 2
w0 − z
∗
.
Lemma 133. Let {z
t} and {wt} be the sequences produced by SPRM+, then
1. The sequences {wt} and {z
t} are bounded and thus have at least one limit point.
2. If the sequence {wt} converges to wˆ ∈ Z≥, then the sequence {z
t} also converges to wˆ.
3. If wˆ is a limit point of {wt}, then wˆ ∈ SOL(Z≥, F) and g(wˆ) is a Nash equilibrium of the game.
Proof. By Lemma 132 and the fact that w0 = z
−1
, we know for any t ≥ 0,
wt+1 − z
∗
2
+
1
16
wt+1 − z
t
2
≤
w0 − z
∗
2
,
which implies the sequences {wt} and {z
t} are bounded. Thus, both of them have at least one limit point.
By Lemma 132, we have
15
32
X
T
t=1
wt+1 − wt
2
≤
15
16
X
T
t=1
wt+1 − z
t
2
+
wt − z
t
2
≤
w0 − z
∗
2
+
1
16
w0 − z
−1
2
.
This implies
lim
t→∞
wt+1 − wt
= 0, lim
t→∞
wt − z
t
= 0, lim
t→∞
wt+1 − z
t
= 0.
If wˆ is the limit point of the subsequence {wt
: t ∈ κ}, then we must have
lim
(t∈κ)→∞
wt+1 = wˆ, lim
(t∈κ)→∞
z
t = wˆ.
318
By the definition of wt+1 in SPRM+ and by the continuity of F and of the projection, we get
wˆ = lim
t(∈κ)→∞
wt+1 = lim
t(∈κ)→∞
ΠZ≥
wt − ηF(z
t
)
= ΠZ≥
[wˆ − ηF(wˆ)].
This shows that wˆ ∈ SOL(Z≥, F). By Lemma 130, we know g(wˆ) is a Nash equilibrium of the game.
Proposition 134. Let {z
t} and {wt} be the sequences produced by SPRM+. If {wt} has a limit point wˆ
such that wˆ = az
∗
for z
∗ ∈ ∆d1 × ∆d2 and a ≥ 1 (equivalently, colinear with a pair of strategies in the
simplex), then {wt} converges to wˆ.
Proof. By Proposition 58, we know the MVI condition holds for wˆ, Then by Lemma 132, {∥wt − wˆ∥
2 +
1
16 ∥wt − z
t−1∥
2
} is monotonitically decreasing and therefore converges. Let wˆ be the limit of the sequence {wt
: t ∈ κ}. Using the fact that limt→∞ ∥wt − z
t−1∥ = 0, we know that {∥wt − wˆ∥} also
converges and
lim
t→∞
wt − wˆ
2 +
1
16
wt − z
t−1
2
= lim
t→∞
wt − wˆ
2 = lim
t(∈κ)→∞
wt − wˆ
2 = 0
Thus, {wt} converges to wˆ.
Lemma 135 (Structure of Limit Points). Let {z
t} and {wt} be the sequences produced by SPRM+. Let
z
∗ ∈ ∆d1 × ∆d2 be any Nash equilibrium of A. If wˆ and w˜ are two limit points of {wt}, then the following
holds.
1. ∥az
∗ − wˆ∥
2 = ∥az
∗ − w˜∥
2
for all a ≥ 1.
2. ∥wˆ∥
2 = ∥w˜∥
2
.
3. ⟨z
∗
, wˆ − w˜⟩ = 0.
319
Proof. Let {ki ∈ Z} be an increasing sequence of indices such that {wki} converges to wˆ and {li ∈ Z} be
an increasing sequence of indices such that {wli} converges to w˜. Let σ : {ki} → {li} be a mapping that
always maps ki to a larger index in {li}, i.e. σ(ki) > ki
. Such a mapping clearly exists. Since Lemma 132
applies to az
∗
for any a ≥ 1 and limt→∞ ∥wt − z
t−1∥ = 0, we get
∥az
∗ − wˆ∥
2 = lim
i→∞
az
∗ − wki
2
+
1
16
wki − z
ki−1
2
≥ lim
i→∞
az
∗ − wσ(ki)
2
+
1
16
wσ(ki) − z
σ(ki)−1
2
= ∥az
∗ − w˜∥
2
.
By symmetry, the other direction also holds. Thus, ∥az
∗ − wˆ∥
2 = ∥az
∗ − w˜∥
2
for all a ≥ 1.
Expanding ∥az
∗ − wˆ∥
2 = ∥az
∗ − w˜∥
2
gives
∥wˆ∥
2 − ∥w˜∥
2 = 2a⟨z
∗
, wˆ − w˜⟩, ∀a ≥ 1.
It implies that ∥wˆ∥
2 = ∥w˜∥
2
and ⟨z
∗
, wˆ − w˜⟩ = 0.
With Lemma 132, Lemma 133, Proposition 134, and Lemma 135, by the same argument as the proof
for ExRM+ in Lemma 63, we can prove that {wt} produced by SPRM+ has a unique limit point and thus
converges, as shown in the following lemma.
Lemma 136 (Uniqueness of Limit Point). The iterates {wt} produced by SPRM+ have a unique limit point.
Thus, Theorem 69 directly follows from Lemma 133 and Lemma 136.
E.15.2 Best-Iterate Convergence of SPRM+
Lemma 137. Let {z
t} and {wt} be the sequences produced by SPRM+ (Algorithm 12). Then
1. DualityGap(g(wt+1)) ≤
5(∥wt−z
t∥+∥wt+1−z
t∥)
η
;
320
2. DualityGap(g(z
t
)) ≤
9(∥wt−z
t∥+∥wt−z
t−1∥)
η
.
.
Proof of Lemma 137. Since wt+1 = ΠZ≥
[wt − ηF(z
t
)], by Lemma 130, we know for any z
∗ ∈ Z∗
DualityGap(g(wt+1)) ≤
(∥wt+1 − wt∥ + ηLF ∥wt+1 − z
t∥)(∥wt+1 − z
∗∥ + 2)
η
≤
(1 + ηLF )(∥wt − z
t∥ + ∥wt+1 − z
t∥)(∥wt+1 − z
∗∥ + 2)
η
≤
(1 + ηLF )(∥wt − z
t∥ + ∥wt+1 − z
t∥)(∥w0 − z
∗∥ + 2)
η
≤
5(∥wt − z
t∥ + ∥wt+1 − z
t∥)
η
,
where in the second inequality we apply ∥wt+1 − wt∥ ≤ ∥wt+1 − z
t∥ + ∥wt − z
t∥; in the third inequality we use ∥wt+1 − z
∗∥ ≤ ∥w0 − z
∗∥ by Lemma 132; in the last inequality we use ηLF ≤
1
8
and
∥w0 − z
∗∥ ≤ 2 since w0
, z
∗ ∈ ∆d1 × ∆d2
.
Similarly, since z
t = ΠZ≥
[wt − ηF(z
t−1
)], by Lemma 130, we know for any z
∗ ∈ Z∗
DualityGap(g(z
t
))
≤
(∥z
t − wt∥ + ηLF ∥z
t − z
t−1∥)(∥z
t − z
∗∥ + 2)
η
≤
(1 + ηLF )(∥wt − z
t∥ + ∥wt − z
t−1∥)(∥wt − z
∗∥ + ∥wt − z
t∥ + 2)
η
≤
(1 + ηLF )(∥wt − z
t∥ + ∥wt − z
t−1∥)(∥w0 − z
∗∥ + 2∥w0 − z
∗∥ + 2)
η
≤
9(∥wt − z
t∥ + ∥wt − z
t−1∥)
η
,
where in the second inequality we apply ∥z
t − z
t−1∥ ≤ ∥z
t − wt∥ + ∥wt − z
t−1∥ and ∥z
t − z
∗∥ ≤
∥z
t − wt∥ + ∥wt − z
∗∥; in the third inequality we use ∥wt+1 − z
∗∥ ≤ ∥w0 − z
∗∥ and ∥wt − z
t∥ ≤
321
2∥w0 − z
∗∥ by Lemma 132; in the last inequality we use ηLF ≤
1
8
and ∥w0 − z
∗∥ ≤ 2 since w0
, z
∗ ∈
∆d1 × ∆d2
.
Proof of Theorem 70. Fix any Nash equilibrium z
∗ of the game. From Lemma 132, we know that
T
X−1
t=0
wt+1 − z
t
2
+
wt − z
t
2
≤
16
15
·
w0 − z
∗
2
.
This implies that there exists 0 ≤ t ≤ T − 1 such that ∥wt+1 − z
t∥ + ∥wt − z
t∥ ≤ 2∥w0−z
∗∥
√
T
(we use
a + b ≤
p
2(a
2 + b
2)). By applying Lemma 137, we then get
DualityGap(g(wt+1)) ≤
5(∥wt − z
t∥ + ∥wt+1 − z
t∥)
η
≤
10∥w0 − z
∗∥
η
√
T
.
Similarly, using Lemma 132 and the fact that w0 = z
−1
, we know
X
T
t=0
wt − z
t
2 +
wt − z
t−1
2
≤
X
T
t=0
wt − z
t
2 +
wt+1 − z
t
2
≤
16
15
·
w0 − z
∗
2
.
This implies that there exists 0 ≤ t ≤ T such that ∥wt − z
t∥ + ∥wt − z
t−1∥ ≤ 2∥w0−z
∗∥
√
T
(we use
a + b ≤
p
2(a
2 + b
2)). By applying Lemma 137, we then get
DualityGap(g(z
t
)) ≤
9(∥wt − z
t∥ + ∥wt − z
t−1∥)
η
≤
18∥w0 − z
∗∥
η
√
T
.
E.15.3 Linear Convergence of RS-SPRM+.
Proof of Theorem 71. We note that Algorithm 14 always restarts in the first iteration t = 1 since ∥w1 − z
0∥+
∥w0 − z
0∥ ≤ q
32
15 ∥w0 − z
∗∥
2 ≤ 4 = 8
2
1 . We denote 1 = t1 < t2 < . . . < tk < . . . the iteration where
322
Algorithm 14 restarts for the k-th time, i.e., the “If" condition holds. Then by Lemma 137, we know for
every tk, wtk = g(wtk ) ∈ Z = ∆d1 × ∆d2 and
DualityGap(wtk ) ≤
5(∥wtk − z
tk−1∥ + ∥wtk−1 − z
tk−1∥)
η
≤
40
η · 2
k
,
and by metric subregularity (Proposition 67), we have
wtk − ΠZ∗ [wtk
]
≤
DualityGap(g(wtk ))
c
≤
40
cη · 2
k
.
Then, for any iteration tk + 1 ≤ t ≤ tk+1 − 1, using Lemma 132 and Lemma 137, we have
DualityGap(g(wt
)) ≤
5(∥wt − z
t−1∥ + ∥wt−1 − z
t−1∥)
η
(Lemma 137)
≤
10∥wtk − ΠZ∗ [wtk ]∥
η
(Lemma 132)
≤
400
cη2 · 2
k
.
Then again by metric subregularity, we also get ∥g(wt
) − ΠZ∗ [g(wt
)]∥ ≤ 400
c
2η
2·2
k
for every t ∈ [tk +
1, tk+1 − 1].
Similarly, for any iteration tk ≤ t ≤ tk+1 − 1, using Lemma 132 and Lemma 137, we have
DualityGap(g(z
t
)) ≤
9(∥wt − z
t∥ + ∥wt − z
t−1∥)
η
(Lemma 137)
≤
18∥wtk − ΠZ∗ [wtk ]∥
η
(Lemma 132)
≤
720
cη2 · 2
k
.
Then by metric subregularity, we also get ∥g(z
t
) − ΠZ∗ [g(z
t
)]∥ ≤ 400
c
2η
2·2
k
for every t ∈ [tk, tk+1 − 1].
323
Bounding tk+1 − tk Fix any k ≥ 1. If tk+1 = tk + 1, then we are good. Now we assume tk+1 > tk + 1.
By Lemma 132, we know
tkX
+1−2
t=tk
(
wt+1 − z
t
2
+
wt − z
t
2
) ≤
16
15
wtk − ΠZ∗ [wtk
2
.
This implies that there exists t ∈ [tk, tk+1 − 2] such that
wt+1 − z
t
+
wt − z
t
≤
2∥wtk − ΠZ∗ [wtk ]∥
√
tk+1 − tk − 1
≤
80
cη2
k
1
√
tk+1 − tk − 1
.
On the other hand, since Algorithm 14 does not restart in [tk, tk+1−2], we have for every t ∈ [tk, tk+1−2],
wt+1 − z
t
+
wt − z
t
>
8
2
k+1 .
Combining the above two inequalities, we get
tk+1 − tk − 1 ≤
400
c
2η
2
.
Linear Last-Iterate Convergence Rates Define C := 1 + 400
c
2η
2 ≥ tk+1 − tk. Then tk ≤ Ck and for
any t ≥ 1, we have t ≥ t
⌊
t
C
⌋
and
DualityGap(g(wt
)) ≤
400
cη2
· 2
−⌊ t
C
⌋ ≤
400
cη2
1 −
1
3C
t
=
400
cη2
·
1 −
1
3(1 + 400
c
2η
2 )
!t
.
Using metric subregularity, we get for any t ≥ 1.
g(wt
) − ΠZ∗ [g(wt
)]
≤
400
c
2η
2
·
1 −
1
3(1 + 400
c
2η
2 )
!t
.
324
Similarly, we have for any t ≥ 1
DualityGap(g(z
t
)) ≤
720
cη2
·
1 −
1
3(1 + 400
c
2η
2 )
!t
,
g(z
t
) − ΠZ∗ [g(z
t
)]
≤
720
c
2η
2
·
1 −
1
3(1 + 400
c
2η
2 )
!t
.
This completes the proof.
E.16 Additional Numerical Experiments
E.16.1 Game Instances
Below we describe the extensive-form benchmark game instances we test in our experiments. These games
are solved in their normal-form representation. For each game, we report the size of the payoff matrix in
this representation.
Kuhn poker Kuhn poker is a widely used benchmark game introduced by Kuhn [101]. At the beginning
of the game, each player pays one chip to the pot, and each player is dealt a single private card from a deck
containing three cards: jack, queen, and king. The first player can check or bet, i.e., putting an additional
chip in the pot. Then, the second player can check or bet after the first player’s check, or fold/call the first
player’s bet. After a bet of the second player, the first player still has to decide whether to fold or to call
the bet. At the showdown, the player with the highest card who has not folded wins all the chips in the
pot.
The payoff matrix for Kuhn poker has dimension 27 × 64 and 690 nonzeros.
Goofspiel We use an imperfect-information variant of the standard benchmark game introduced by
Ross [138]. We use a 4-rank variant, that is, each player has a hand of cards with values {1, 2, 3, 4}. A
third stack of cards with values {1, 2, 3, 4} (in order) is placed on the table. At each turn, a prize card is
revealed, and each player privately chooses one of his/her cards to bid. The players do not reveal the cards
325
that they have selected. Rather, they show their cards to a fair umpire, which determines which player
has played the highest card and should receive the prize card as a result. In case of a tie, the prize is split
evenly among the winners. After 4 turns, all the prizes have been dealt out and the game terminates. The
payoff of each player is computed as the sum of the values of the cards they won.
The payoff matrix for Goofspiel has dimension 72 × 7,808 and 562,176 nonzeros.
Random instances. We consider random matrices of size (10, 15). The coefficients of the matrices
are normally distributed with mean 0 and variance 1 and are sampled using the numpy.random.normal
function from the numpy Python package [75]. In all figures, we average the last-iterate duality gaps over
the 25 random matrix instances, and we also show the confidence intervals.
E.16.2 Experiments with Restarting
In this section, we compare the last-iterate convergence performances of ExRM+, SPRM+ and their restarting variants RS-ExRM+ and RS-SPRM+ as introduced in Section 6.7.3 and Section 6.8. We present our
results when choosing a stepsize η = 0.05 in Figure E.5.
101 102 103 104 105
10−13
10−9
10−5
10−1
3x3 matrix game
101 102 103 104 105
10−4
10−3
10−2
10−1
Kuhn Poker
101 102 103 104 105
10−2
10−1
100 Goofspiel
101 102 103 104 105
10−11
10−8
10−5
10−2
random
Number of iterations
Duality gap
ExRM+
Smooth PRM+ RS-ExRM+ RS-Smooth PRM+
Figure E.5: For a stepsize η = 0.05, empirical performances of several algorithms on our 3×3 matrix game
(left plot), Kuhn Poker and Goofspiel (center plots) and on random instances (right plot).
326
Abstract (if available)
Abstract
No-regret learning (or online learning) is a general framework for studying sequential decision-making. Within this framework, the learner iteratively makes decisions, receives feedback, and adjusts their strategies. In this thesis, we consider analyzing the learning dynamics of no-regret algorithms in game scenarios where players play a single game repeatedly with particular no-regret algorithms. This exploration not only raises fundamental questions at the intersection of machine learning and game theory but also stands as a vital element when developing recent breakthroughs in artificial intelligence.
A notable instance of this influence is the widespread adoption of the “self-play” concept in game AI development, exemplified in games such as Go and Poker. With this technique, AI agents learn how to play by competing against themselves to enhance their performance step by step. In the terminology of literature focused on learning in games, the method involves running a set of online learning algorithms for players in the game to compute and approximate their game equilibria. To learn more efficiently in games, it is critical to design better online learning algorithms. Standard notions evaluating online learning algorithms in games include “regret,” assessing the average quality of iterates, and “last-iterate convergence,” representing the quality of the final iterates.
In this thesis, we design online learning algorithms and prove that they achieve near-optimal regret or fast last-iterate convergence in various game settings. We start from the simplest two-player zero-sum normal-form games and extend the results to multi-player games, extensive-form games that capture sequential interaction and imperfect information, and finally, the most general convex games. Moreover, we also analyze the weaknesses of prevalent online learning algorithms widely employed in practice and propose a fix for them. This not only makes the algorithms more robust but also sheds light on getting better learning algorithms for artificial intelligence in the future.
Linked assets
University of Southern California Dissertations and Theses
Conceptually similar
PDF
Online reinforcement learning for Markov decision processes and games
PDF
Robust and adaptive algorithm design in online learning: regularization, exploration, and aggregation
PDF
Robust and adaptive online reinforcement learning
PDF
Learning social sequential decision making in online games
PDF
Understanding goal-oriented reinforcement learning
PDF
Machine learning in interacting multi-agent systems
PDF
Sequential Decision Making and Learning in Multi-Agent Networked Systems
PDF
Artificial Decision Intelligence: integrating deep learning and combinatorial optimization
PDF
A function approximation view of database operations for efficient, accurate, privacy-preserving & robust query answering with theoretical guarantees
PDF
Provable reinforcement learning for constrained and multi-agent control systems
PDF
Reinforcement learning with generative model for non-parametric MDPs
PDF
Efficient deep learning for inverse problems in scientific and medical imaging
PDF
Artificial intelligence for low resource communities: Influence maximization in an uncertain world
PDF
Interactive learning: a general framework and various applications
PDF
Learning and control in decentralized stochastic systems
PDF
Robust and adaptive online decision making
PDF
Learning and decision making in networked systems
PDF
Theoretical foundations for dealing with data scarcity and distributed computing in modern machine learning
PDF
Characterizing and improving robot learning: a control-theoretic perspective
PDF
Transfer learning for intelligent systems in the wild
Asset Metadata
Creator
Lee, Chung-Wei
(author)
Core Title
No-regret learning and last-iterate convergence in games
School
Viterbi School of Engineering
Degree
Doctor of Philosophy
Degree Program
Computer Science
Degree Conferral Date
2024-05
Publication Date
01/24/2024
Defense Date
01/23/2024
Publisher
Los Angeles, California
(original),
University of Southern California
(original),
University of Southern California. Libraries
(digital)
Tag
artificial intelligence,learning in games,machine learning,OAI-PMH Harvest
Format
theses
(aat)
Language
English
Contributor
Electronically uploaded by the author
(provenance)
Advisor
Luo, Haipeng (
committee chair
), Nayyar, Ashutosh (
committee member
), Sharan, Vatsal (
committee member
)
Creator Email
chungwei0328@gmail.com,leechung@usc.edu
Permanent Link (DOI)
https://doi.org/10.25549/usctheses-oUC113809481
Unique identifier
UC113809481
Identifier
etd-LeeChungWe-12621.pdf (filename)
Legacy Identifier
etd-LeeChungWe-12621
Document Type
Dissertation
Format
theses (aat)
Rights
Lee, Chung-Wei
Internet Media Type
application/pdf
Type
texts
Source
20240124-usctheses-batch-1121
(batch),
University of Southern California
(contributing entity),
University of Southern California Dissertations and Theses
(collection)
Access Conditions
The author retains rights to his/her dissertation, thesis or other graduate work according to U.S. copyright law. Electronic access is being provided by the USC Libraries in agreement with the author, as the original true and official version of the work, but does not grant the reader permission to use the work if the desired use is covered by copyright. It is the author, as rights holder, who must provide use permission if such use is covered by copyright.
Repository Name
University of Southern California Digital Library
Repository Location
USC Digital Library, University of Southern California, University Park Campus MC 2810, 3434 South Grand Avenue, 2nd Floor, Los Angeles, California 90089-2810, USA
Repository Email
cisadmin@lib.usc.edu
Tags
artificial intelligence
learning in games
machine learning