Close
About
FAQ
Home
Collections
Login
USC Login
Register
0
Selected
Invert selection
Deselect all
Deselect all
Click here to refresh results
Click here to refresh results
USC
/
Digital Library
/
University of Southern California Dissertations and Theses
/
Robust and adaptive online decision making
(USC Thesis Other)
Robust and adaptive online decision making
PDF
Download
Share
Open document
Flip pages
Contact Us
Contact Us
Copy asset link
Request this asset
Transcript (if available)
Content
ROBUST AND ADAPTIVE ONLINE DECISION MAKING by Chen-Yu Wei A Dissertation Presented to the FACULTY OF THE USC GRADUATE SCHOOL UNIVERSITY OF SOUTHERN CALIFORNIA In Partial Fulfillment of the Requirements for the Degree DOCTOR OF PHILOSOPHY (COMPUTER SCIENCE) August 2022 Copyright 2022 Chen-Yu Wei Acknowledgements The accomplishment of my doctoral degree would not have been possible without the help of many people. First, I want to thank Haipeng for taking me as his PhD student despite my unattractive application profile. Being his first student, I had the privilege to work with him very closely, from which I was so much impressed by his good research taste and his agile mind that I tried to develop myself. Apart from technical influences, I want to thank him for spending a lot of time with his students, always quickly responding to their questions or needs, and making efforts in building their connections. He will always be my role model in my future career as a researcher and an advisor. I would like to thank Chi-Jen Lu and Yi-Te Hong, my supervisor and colleague during my stay at Academia Sinica before my PhD. They guided me from nowhere to the research field I am currently in, and introduced to me projects that are still heavily influencing my current research. Thanks to Yi-Te for teaching me basics of online learning and brainstorming with me all day long. I had three rewarding summer internships during my PhD. I thank Alina Beygelzimer for hosting me at Yahoo Research New York in summer 2018. Alina introduced to me a very interesting and challenging bandit problem. I enjoyed the weekly brainstorming with Alina, Dávid Pál, Balázs Szörényi, Devanathan Thiruvenkatachari, and Chicheng Zhang on the problem, which had greatly broadened my horizon. Then, I thank Alekh Agarwal and John Langford for hosting me at Microsoft Research Redmond in summer 2019. Their persistence in designing practical algorithms has shaped my research interest. Finally, I thank Chris Dann and Julian Zimmert for hosting me at ii Google Research in summer 2021. I enjoyed the insightful weekly discussions, and I am glad that the collaboration is still ongoing. A lot of memorable stories had happened in our little lab space in Powell Hall at USC. Thanks to my long- or short-term labmates: He Jiang, Karishma Sharma, and Sulagna Mukherjee for entertaining chats and fun time with puzzle toys, Kai Zheng for hosting the birthday party and the trip of Highway One, Yifang Chen for organizing hotpots and being our driver, Chung-Wei Lee and Mengxiao Zhang for providing endless cookies and milk teas and great discussions, Liyu Chen and Tiancheng Jin for many deep discussions on MDPs, Hikaru Ibayashi for sharing music, magic, and deep learning. I will definitely miss the time with them a lot. I would like to thank Rahul Jain, David Kempe, Jiapeng Zhang for being my thesis committee, and Vatsal Sharan, Yan Liu, Shanghua Teng, Shaddin Dughmi for giving me useful suggestions on my qualifying exam and thesis proposal. Thanks to Haipeng Luo, Alekh Agarwal, Alina Beygelzimer, Chris Dann, Rahul Jain, and Csaba Szepesvari for generous help on my job search materials; to Yang Cai, Constantinos Daskalakis, Rahul Jain, Haipeng Luo, Vidya Muthukumar, Ioannis Panageas, Vijay Subramanian, Mano- lis Vlatakis for helpful suggestions on my job talk; to Vatsal Sharan for detailed feedback on my research statement. I also thank Hsu Kao and Melody Kao for giving me guidance and polishing my materials for PhD application, fellowship application, and job search. Thanks to other collaborators not mentioned above: James Preiss, Sébastien Arnold, Sébastien Bubeck, Yuanzhi Li, Mehdi Jafarnia-Jahromi, Hiteshi Sharma, Ehsan Emamjomeh-Zadeh, Xiao- jin Zhang, Dongsheng Ding, Kaiqing Zhang, Alberto Bietti, Steven Wu, Miro Dudík. From the discussions I grew a lot, and the results were fruitful. Over the course of my PhD, I have developed a new interest – playing jazz. I would like to express my gratitude to David Arnay for being an inspiring and patient jazz piano teacher. The iii world he introduced to me is so exciting that I am going to spend a lot of time exploring it. Thanks also to Aaron Serfaty and John Thomas for super fun courses. I thank Kai Zheng, Han-Jia Ye, Chun-Heng Huang, Denny Huang, Yi-Ting Chang, Ronan Hsieh, Hsu Kao, Melody Kao, Han Wang, and Chung-Wei Lee for planning wonderful road trips and inviting me to join. Those trips constitute most beautiful memories during my PhD. Thanks to Karen, David, my roommates, and neighbors for forming the warm Taiwanese community that acts like a family during my five-year stay in the US. Thanks to Hsu Kao for being whom I can always chat with without pressure, and giving me useful suggestions when I make important decisions. Finally, thanks to my family for always allowing me to freely dive into my interests and pursue my dreams. Specially, I thank their bittersweet support for my decisions to study and work in the US, which means we have to be apart most of the time. iv Table of Contents Acknowledgements ii List of Tables viii List of Figures ix Abstract x Chapter 1: Introduction 1 1.1 The expert problem and multi-armed bandits . . . . . . . . . . . . . . . . . . . . . . 3 1.2 Online linear optimization and linear bandits . . . . . . . . . . . . . . . . . . . . . . 6 1.3 Contextual bandits . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7 1.4 Markov decision processes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10 1.5 Online decision making in dynamic worlds . . . . . . . . . . . . . . . . . . . . . . . . 12 1.6 Outline of the thesis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14 Chapter 2: Adaptive Algorithms for Experts and Online Linear Optimization 16 2.1 Overview: the impossible-tuning problem . . . . . . . . . . . . . . . . . . . . . . . . 16 2.2 Impossible tuning made possible . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20 2.3 A new master algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23 2.4 Applications to the expert problem . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25 2.5 Applications to online linear optimization . . . . . . . . . . . . . . . . . . . . . . . . 29 2.6 Open problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33 Chapter 3: Adaptive Algorithms for Bandits 34 3.1 Overview: bandits with predictions, and path-length bounds . . . . . . . . . . . . . 34 3.2 Path-length bounds for multi-armed bandits I (V 1 ) . . . . . . . . . . . . . . . . . . . 36 3.3 Path-length bounds for multi-armed bandits II (V ⋆ ) . . . . . . . . . . . . . . . . . . 38 3.4 Path-length bounds for multi-armed bandits III (V ∞ ) . . . . . . . . . . . . . . . . . . 41 3.5 Path-length bounds for linear bandits . . . . . . . . . . . . . . . . . . . . . . . . . . 43 3.6 Contextual bandits with loss predictors . . . . . . . . . . . . . . . . . . . . . . . . . 45 3.6.1 Algorithms for adversarial environments . . . . . . . . . . . . . . . . . . . . . 47 3.6.2 Algorithms for stochastic environments . . . . . . . . . . . . . . . . . . . . . 50 3.7 Open problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54 v Chapter 4: Robust Algorithms against an Adversary 56 4.1 Overview: best of both worlds, and learning under corruption . . . . . . . . . . . . . 56 4.2 Multi-armed bandits with best-of-all-worlds guarantees . . . . . . . . . . . . . . . . . 57 4.3 Linear bandits with best-of-all-world guarantees . . . . . . . . . . . . . . . . . . . . . 58 4.4 Robust algorithms for general decision making . . . . . . . . . . . . . . . . . . . . . 67 4.4.1 A model selection framework . . . . . . . . . . . . . . . . . . . . . . . . . . . 71 4.4.2 Gap-independent bounds . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75 4.4.3 Gap-dependent bounds . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77 4.5 Open problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82 Chapter 5: Learning in Adversarial MDPs with a Large State Space 84 5.1 Overview: adversarial MDPs with linear function approximation . . . . . . . . . . . 84 5.2 Dilated exploration bonuses . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88 5.3 The linear-Q case . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92 5.4 The linear MDP case . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95 5.5 Open problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100 Chapter 6: Learning in Non-Stationary Environments 103 6.1 Overview: parameter-free algorithms for non-stationary environments . . . . . . . . 103 6.2 A reduction framework . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108 6.3 Multi-scale scheduling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 114 6.4 Equipping multi-scale scheduling with tests and restarts . . . . . . . . . . . . . . . . 117 6.5 Open problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 119 References 121 Appendix A: Omitted Details in Chapter 2 142 A.1 Useful lemmas for optimistic online mirror descent . . . . . . . . . . . . . . . . . . . 142 A.2 Omitted details in Section 2.2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 146 A.3 Omitted details in Section 2.3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 149 A.4 Omitted details in Section 2.4 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 150 A.4.1 Impossible results for interval regret . . . . . . . . . . . . . . . . . . . . . . . 154 A.5 Omitted details in Section 2.5 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 156 A.5.1 Combining Online Newton Step . . . . . . . . . . . . . . . . . . . . . . . . . . 157 A.5.2 Combining Gradient Descent . . . . . . . . . . . . . . . . . . . . . . . . . . . 160 A.5.3 Combining AdaGrad . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 162 A.5.4 Combining MetaGrad’s base algorithm . . . . . . . . . . . . . . . . . . . . . . 166 Appendix B: Omitted Details in Chapter 3 169 B.1 Lemmas for log-barrier OMD . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 169 B.2 Omitted details in Section 3.2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 172 B.3 Omitted details in Section 3.3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 175 B.4 Omitted details in Section 3.4 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 180 B.5 Omitted details in Section 3.6 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 182 B.5.1 Concentration inequalities . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 182 B.5.2 Proof of Theorem 17 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 185 B.5.3 Proof of Theorem 18 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 188 B.5.4 Proofs of Lemma 4 and Theorems 19 and 20 . . . . . . . . . . . . . . . . . . 193 vi Appendix C: Omitted Details in Chapter 4 198 C.1 Omitted details in Section 4.2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 198 C.2 Omitted details in Section 4.3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 200 C.2.1 Auxiliary lemmas . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 200 C.2.2 Proof of Theorem 23 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 210 C.2.3 Proof of Theorem 22 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 216 C.2.4 Adversarial linear bandit algorithms with high-probability bounds . . . . . . 226 C.3 Omitted details in Section 4.4 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 231 C.3.1 Concentration Inequalities . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 231 C.3.2 Omitted proofs in Section 4.4.1 . . . . . . . . . . . . . . . . . . . . . . . . . . 233 C.3.3 Omitted proofs in Section 4.4.2 . . . . . . . . . . . . . . . . . . . . . . . . . 238 C.3.4 Omitted proofs in Section 4.4.3 . . . . . . . . . . . . . . . . . . . . . . . . . . 239 C.3.5 The implementation of the leave-one-policy-out MDP . . . . . . . . . . . . . 250 Appendix D: Omitted Details in Chapter 5 252 D.1 Auxiliary lemmas . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 252 D.2 Analysis for auxiliary procedures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 254 D.2.1 The guarantee of GeometricResampling . . . . . . . . . . . . . . . . . . . . . . 254 D.2.2 The guarantee of PolicyCover . . . . . . . . . . . . . . . . . . . . . . . . . . . 257 D.3 Omitted details in Section 5.2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 264 D.4 Omitted details in Section 5.3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 267 D.5 Omitted details in Section 5.4 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 273 Appendix E: Omitted Details in Chapter 6 283 E.1 Omitted details in Section 6.3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 283 E.2 Omitted details in Section 6.4 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 285 E.2.1 Single-block regret analysis I . . . . . . . . . . . . . . . . . . . . . . . . . . . 285 E.2.2 Single-block regret analysis II . . . . . . . . . . . . . . . . . . . . . . . . . . . 291 E.2.3 Single-epoch regret analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . 295 E.2.4 Proof of Theorem 28 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 296 vii List of Tables 1.1 A mapping from chapters/sections to papers . . . . . . . . . . . . . . . . . . . . . . . 15 2.1 Summary of main results. w t ∈R d is the decision of the learner, ℓ t is the loss vector, m t is a prediction for ℓ t ,L T = P T t=1 (ℓ t −m t )(ℓ t −m t ) ⊤ , and r is the rank ofL T . . . 18 4.1 ∗ indicates computationally inefficient algorithms. G is the GapComplexity defined in (Simchowitz and Jamieson, 2019); ∆ is the gap between the expected reward of the best and second-best policy. It holds that G≤ 1 ∆ . C a = P t c t and C r = p T P t c 2 t , where c t is the amount of corruption in roundt. By definition, C a ≤C r ≤ min{ √ C a T,T max t c t }. C a is the standard notion of corruption in the literature. † : The bound reported in (Jin et al., 2021d) is min{G + √ GC a , √ T} under a different definition of regret. ♯ : Linearized corruption restricts that the corruption on actiona equalsc ⊤ a for some vectorc shared among all actions. 69 6.1 A summary of our results and comparisons with the state of the art. Our algorithms are named in the form of “MASTER + X” where X is the base algorithm used in our reduction. Here, Reg ⋆ L = √ LT and Reg ⋆ ∆ = ∆ 1 /3 T 2 /3 + √ T, where T is the number of rounds and L and ∆ are the number and amount of changes of the world, respectively. (Dependence on other parameters is omitted.) . . . . . . . . . . . . . . 105 viii List of Figures 4.1 Optimization Problem (OP) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64 4.2 Catoni’s Estimator . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64 6.1 An illustration of how we detect non-stationarity via multiple instances of ALG . . . 113 6.2 An illustration of MALG with n = 4 (see detailed explanation in Section 6.3) . . . . . 116 ix Abstract Online learning (or online decision making) is a learning paradigm that involves real-time interactions between the learner and the environment. It can be used to model recommendation systems, marketing, advertising, etc. In online learning, the learner has to make instantaneous decisions based on past data and observations, in order to predict future outcomes, get high reward, or acquire new and informative data. This is more challenging than the traditional machine learning framework where the data is pre-collected and the learner has access to all data in advance. Because the learner’s decisions are involved in the data collection process, an important question is how to efficiently explore the world and find the best policy against the world. Past theoretical research has developed algorithms that can perform strategic exploration, and achieve near-optimal performance in the worst-case environment. However, these algorithms designed for the worst cases are often too pessimistic and do not exploit possible benign properties of the environment. In this thesis, we develop algorithms whose performance can adapt to the easiness of the environment, thus saving the time or the number of required samples in training. Since online learning is interactive, an adversary may exploit the learner’s algorithm, corrupt the data, and make the learner fail to learn good a policy. If an algorithm fails with only a small amount of corruption, then the algorithm may be too unsafe to be deployed in practice. In this thesis, we aim to make our algorithms minimally affected by data corruption, and we design robust algorithms whose performance scales optimally against the amount of corruption. x With adaptivity and robustness, an online learning algorithm will be able to more efficiently and more safely used in a wide spectrum of environments. We hope that the algorithmic techniques and insight developed in this thesis could be useful in improving existing algorithms for real applications. xi Chapter 1 Introduction In the past decade, machine learning has undergone astonishing progress — supervised learning, as the by far most successful machine learning paradigm, enables machines to pick up intellectual tasks such as object detection, reading comprehension, speech recognition, etc. However, these successful examples heavily rely on access to a huge amount of training data and massive computational power. Therefore, a similar revolution has not occurred in data-hungry applications such as medicine and robotics. Furthermore, these successful machines usually only specialize in a single task and lack the capability of generalization across tasks. With a comparison to how humans learn, we can see that the learning paradigm of these machines is quite different from that of humans. Humans, especially at their infant stage, do not learn over a huge amount of pre-collected data. Instead, they collect a few data samples from the world, internalize them, and then collect more new samples based on what they’ve learned. With such incremental and strategic exploration over the world and an internal model updated on the fly, humans are able to learn about the world with relatively fewer samples. Therefore, how to make machines learn as efficiently as humans remains an important open question towards building more powerful artificial intelligence. This thesis aims to answer some fundamental questions in this interactive and data-hungry learning paradigm. 1 Let’s have a deeper look into how an infant (or a learner) interacts with the world. An interaction involves the learner observing something from the surroundings (e.g., seeing some interesting object), taking an action (e.g., touching the object), and receiving reward or more observations (e.g., feeling hurt or seeing the movement of the object). Notice that this “observation→ action→ observation → action→···” sequence might have long-term dependencies, i.e., an observation may depend on the previous ten observations and actions. Because the learner changes the strategy over time, the distribution of the observation sequence also changes accordingly, resulting in a non-i.i.d. observation sequence. This is in contrast with traditional supervised learning where samples are assumed to be i.i.d. generated. Such a more interactive learning paradigm is called online decision making or online reinforcement learning. A core problem in online decision making is how to perform efficient exploration. In the learning procedure, the learner is faced with uncertainty (e.g., uncertain about the optimal action), and the learner must perform search by trial and error in order to reduce the uncertainty (e.g., figure out the optimal action). How should the learner choose actions in each round of interaction to quickly reduce uncertainty and figure out the best action? This core problem can also be stated as the exploitation-exploration tradeoff problem: how to balance between gaining new information (exploration) and maintaining good performance (exploitation)? The first main theme of this thesis is to design online decision algorithms which achieve near- optimal exploration efficiency. We not only target worst-case optimality, but also pursue instance optimality — if the environment is more benign, our algorithms can use much fewer samples to learn. In most cases, our algorithms are parameter-free, i.e., they can automatically achieve instance optimality without prior knowledge on the hardness of the environment. This demonstrates the adaptivity of our algorithms. 2 The second main theme is to design algorithms that can tolerate adversarial manipulations and environment changes. When the learner interacts with the environment, there might be external or adversarial perturbations so that the feedback to the learner does not faithfully reflect what happens in the real world. For example, a robot may undergo sensor errors. Besides adversarial manipulations, there could also be non-stationarity due to the change of the world. For example, a robot may need to walk over different terrains in the learning procedure. Our goal is to design algorithms with robustness against these external factors. In the rest of this chapter, we formalize the online decision making settings we consider in this thesis, and state our contributions. Before that, we first introduce some basic notation. Notation ∆ d denotes thed− 1 dimensional simplex;e i ,0,1∈R d are respectively thei-th standard basis vector, the all-zero vector, and the all-one vector; [n] denotes the set{1,...,n}; KL(·,·) denotes the KL divergence;∥u∥ A = √ u ⊤ Au is the quadratic norm with respect to a matrix A; D ψ (u,w) =ψ(u)−ψ(w)−⟨∇ψ(w),u−w⟩ is the Bregman divergence of u and w with respect to a convex function ψ, and ˜ O(·) hides logarithmic dependence on T. 1.1 The expert problem and multi-armed bandits One of the most basic formulations of online decision making is learning from experts, or simply the expert problem. In the expert problem, the learner is facing a task (e.g., weather prediction), and is given a set of experts that can provide suggestions (e.g., experts that make weather predictions using different methods). The learner needs to dynamically select or combine the suggestions of the experts, with the goal of making her performance almost as good as the best expert. This procedure can be formalized as in Protocol 1. The set of experts is denoted byA. In each roundt, each expert i∈A incurs a loss ℓ t,i ∈ [−1, 1] that quantifies how bad expert i performs in round t.{ℓ t,i } i are 3 Protocol 1: The Expert Problem 1 Given: a finite set of experts A. 2 for t = 1, 2,...,T do 3 Environment determines the losses of experts ℓ t ∈ [−1, 1] A . 4 Learner chooses an expert i t ∈A. 5 Learner suffers loss ℓ t,it . 6 Learner observes ℓ t . hidden from the learner at the beginning of round t. Then the learner chooses an expert i t ∈A, and suffers the loss of expert i t (i.e., ℓ t,it ). At the end of round t, the learner observes the losses of all experts{ℓ t,i } i∈A . The model for the environment we consider here is called adaptive adversary, meaning that the loss of the experts in round t can be arbitrary and may depend on all the history (including the losses and the learner’s decisions) up to round t− 1. The performance of the learner is measured through regret, defined as the difference between the cumulative loss of the learner, and that of the best single expert: Reg := T X t=1 ℓ t,it − min i∈A T X t=1 ℓ t,i . (1.1) A regret that grows sublinearly in T implies that the average loss of the learner can only exceed that of the best expert by an amount vanishing over time. This is what the learner hopes to achieve. The minimax regret bound is Θ( p T ln|A|), which is achieved by the Hedge algorithm (Freund and Schapire, 1997). In some applications, the learner can only observe the loss of the chosen expert. In this case, the expert problem becomes the multi-armed bandit problem (Protocol 2). In multi-armed bandits, we usually call the experts actions or arms instead. The regret for multi-armed bandits is also defined as (1.1). The minimax regret bound is Θ( p |A|T ). A near-optimal regret bound ofO( p |A|T ln|A|) was first achieved by the Exp3 algorithm (Auer et al., 2002b), and then tightened to O( p |A|T ) by the Poly-INF algorithm (Audibert and Bubeck, 2009). 4 Protocol 2: Multi-Armed Bandit 1 Given: a finite set of arms A. 2 for t = 1, 2,...,T do 3 Environment determines the losses of arms ℓ t ∈ [−1, 1] A . 4 Learner chooses an arm i t ∈A 5 Learner suffers loss ℓ t,it . 6 Learner observes ℓ t,it . Protocol 3: Stochastic Multi-Armed Bandit 1 Given: a finite set of arms A. 2 Environment determines the losses of arms ℓ∈ [−1, 1] A . 3 for t = 1, 2,...,T do 4 Learner chooses an arm i t ∈A. 5 Learner suffers loss ℓ it . 6 Learner observes ℓ it +ε t withE[ε t ] = 0. One application of multi-armed bandits is the recommendation system, where actions are potential items to be recommended to incoming users; ℓ t,i is 0 if the user buys itemi at roundt, and 1 otherwise. The recommendation system can only observe the loss corresponding to the item that it recommends, but not others, so Protocol 2 would be a better model than Protocol 1 in this case. ∗ Stochastic multi-armed bandits A special case of multi-armed bandit is when the losses are fixed over time (i.e., ℓ t,i = ℓ i for all t). In this case, the common assumption is that the learner only observes a noisy version of the loss. This is called the stochastic multi-armed bandit problem, whose protocol is presented in Protocol 3. The regret is defined the same as in (1.1), with ℓ t replaced by ℓ. For stochastic multi-armed bandit, it is possible to achieve a regret that is logarithmic in T (c.f. for general multi-armed bandits, the lower bound is Ω( p |A|T )). Specifically, define ∆ i =ℓ i −min i ′ ∈A ℓ i ′ for alli∈A, which is the sub-optimality gap of arm i. Then the classic UCB algorithm (Auer et al., ∗ While this seems too much simplified for true recommendation systems, in Section 1.2 and Section 1.3 we will see how to generalize multi-armed bandit and make it more realistic. 5 Protocol 4: Online Linear Optimization 1 Given: a set of actionsK⊂R d . 2 for t = 1, 2,...,T do 3 Environment decides a loss vector ℓ t ∈R d . 4 Learner chooses an action w t ∈K. 5 Learner suffers loss ⟨w t ,ℓ t ⟩. 6 Learner observes ℓ t . 2002a) achieves a regret bound ofO P i:∆ i >0 lnT ∆ i , which is tight (up to constants) in terms of the dependencies on the sub-optimality gaps. 1.2 Online linear optimization and linear bandits In the expert problem or multi-armed bandit introduced in Section 1.1, we do not assume any relation among the losses of the actions. This makes learning hard when there is a large number of actions. In real-world problems, however, actions are often correlated and there exists some low-dimensional structure among actions that the learner can exploit. One simple way to capture this correlation is to assume that each action can be encoded as a vector w∈K⊂R d , and the loss of this action at round t can be represented as⟨w,ℓ t ⟩ for some ℓ t ∈R d shared among all actions. With this structural assumption, we potentially can even deal with an infinite number of actions. Considering the linear structure, Protocol 1 and Protocol 2 can be generalized to Protocol 4 and Protocol 5, which are referred to as the online linear optimization problem and the linear bandit problem, respectively. One can verify that Protocol 1 and Protocol 2 are special cases of Protocol 4 and Protocol 5 (respectively) withK ={e 1 ,...,e d }. The regret for online linear optimization or linear bandit is defined as the following: Reg := T X t=1 ⟨w t ,ℓ t ⟩− min w∈K T X t=1 ⟨w,ℓ t ⟩. (1.2) 6 Protocol 5: Linear Bandit 1 Given: a set of actionsK⊂R d . 2 for t = 1, 2,...,T do 3 Environment determines a loss vector ℓ t ∈R d . 4 Learner chooses an action w t ∈K. 5 Learner suffers loss ⟨w t ,ℓ t ⟩. 6 Learner observes⟨w t ,ℓ t ⟩. Protocol 6: Stochastic Linear Bandit 1 Given: a set of actionsK⊂R d . 2 Environment determines a loss vector ℓ∈R d . 3 for t = 1, 2,...,T do 4 Learner chooses an action w t ∈K. 5 Learner suffers loss ⟨w t ,ℓ⟩. 6 Learner observes⟨w t ,ℓ⟩ +ε t withE[ε t ] = 0. For online linear optimization, the tight regret bound Θ( √ dT ) is achievable through online gradi- ent descent (Zinkevich, 2003). For linear bandits, the algorithm of Abernethy et al. (2008) achieves a bound of ˜ O( √ d 3 T ), which was later improved by Audibert et al. (2011) to ˜ O( p d min{d, ln|K|}T ), matching the lower bound up to logarithmic factors (Dani et al., 2007). Stochastic linear bandits Similarly to stochastic multi-armed bandits, we also have the stochastic linear bandit problem, where ℓ t = ℓ for all t and the learner only observes⟨w t ,ℓ⟩ +ε t for some zero-mean noise ε t at round t. The protocol is in Protocol 6. The regret bound is defined the same as (1.2), with ℓ t replaced by ℓ. Similar to the case of multi-armed bandits, it is possible to achieve O(lnT ) regret for stochastic linear bandits (hiding other dependencies). The tight regret bound is derived in Lattimore and Szepesvari (2017). 1.3 Contextual bandits Consider the example at the end of Section 1.1 about recommendation systems. A drawback of using multi-armed bandits to model recommendation systems with regret defined as (1.1) is that the 7 Protocol 7: Contextual Bandit 1 Given: a finite set of actions A. 2 for t = 1, 2,...,T do 3 Environment generate a context x t , and decides the losses of actions ℓ t ∈ [−1, 1] A . 4 Learner observes x t , and decides an action a t ∈A. 5 Learner suffers loss ℓ t,at . 6 Learner observes ℓ t,at . recommendation system will only try to find a single action that has the best average performance over all rounds. This corresponds to a single item that is bought by most users. This kind of recommendation is contextless, i.e., it ignores each user’s preference, income, and the degree of match between the item and the user, etc. To take into consideration such side information, the model of contextual bandit is introduced (Protocol 7). In each round, the environment first generates a context x t and reveals it to the learner. The learner, again faced with a finite set of actions, chooses an action a t based on the observed context, and then receives the loss of the chosen action ℓ t,at . The learning process is associated with policy set Π, which contains mappings from contexts to actions. The learner’s goal would be to perform almost as well as the best policy in Π. Formally, the regret for contextual bandits is defined as the following: Reg := T X t=1 ℓ t,at − min π∈Π T X t=1 ℓ t,π(xt) , (1.3) where ℓ t,π(xt) is the loss of policy π in round t. The best-known algorithm for contextual bandits (and probably the only existing algorithm with a near-optimal regret bound) is the Exp4 algorithm (Auer et al., 2002b), which achieves the near-optimal regret bound ofO( p T|A| ln|Π|). Although the dependence on|Π| in the regret is logarithmic, the computational complexity of Exp4 is linear in|Π|, which could be very high in many practical scenarios. Unfortunately, in order to achieve any sub-linear regret, a computational complexity of Ω( p |Π|) is necessary, as proven by Hazan and Koren (2016). To sidestep this difficulty, 8 two variants of contextual bandits with extra assumptions have been developed, as introduced below. Stochastic contextual bandits In stochastic contextual bandits, the context-loss pair (x t ,ℓ t ) is drawn from an unknown fixed distribution D. Stochastic contextual bandit is studied by Langford and Zhang (2008) and Agarwal et al. (2014a), who provide oracle-efficient algorithms that reduce a contextual bandit problem to a cost-sensitive classification problem. Specifically, they assume that there is an oracle that can solve the minimization problem min π∈Π P T t=1 c t,π(xt) for any cost sequence {c t } T t=1 efficiently, and their algorithms only need poly(T ) calls to such an oracle (with no poly(|Π|) dependence). Dudik et al. (2011); Agarwal et al. (2014a) develop oracle-efficient algorithms that achieve the near-optimal regret boundO( p T|A| ln|Π|). Realizable contextual bandits In the general formulation of contextual bandit described above, there is no explicit connection among context, action, and loss; their connection is implicitly built through the policy class Π. While this gives a general formulation, it is challenging to design a practical algorithm under this framework using existing supervised learning techniques. † Therefore, the framework of contextual bandit with realizability was developed, which explicitly assumes a relation among context, action, and loss, and reduces contextual bandits to a regression problem. In this formulation, the learner is given a function classF, which contains mappings from (context, action) pairs to a number in [−1, 1]. It is assumed that there exists f ⋆ ∈F such that f ⋆ (x,a) gives the expected loss when the learner takes actiona on a given contextx. For this function classF, we † Though the cost-sensitive classification problem used in stochastic contextual bandits is indeed a supervised learning problem, there is still difficulty in implementing it in practice. See Foster and Rakhlin (2020). 9 can define the induced policy class Π F : for each function f∈F, the corresponding induced policy π f ∈ Π F is defined as π f (x) = argmin a∈A f(x,a). For this formulation of contextual bandit, we can still define the regret as (1.3) using Π F as the policy set. By the relation among context, action, and loss described above, we see that E [Reg] =E " T X t=1 f ⋆ (x t ,a t )− T X t=1 min a∈A f ⋆ (x t ,a) # . (1.4) For the class of linear functions (linear contextual bandit), near-optimal algorithms with regret bound ˜ O(d p T ln|A|) are given by Abbasi-Yadkori et al. (2011). For general function classes, a near-optimal algorithm with efficient implementation was recently given by Foster and Rakhlin (2020). 1.4 Markov decision processes In decision making problems such as the game of Go, robot movement, and investment in stock markets, the actual consequence of an action is not completely shown in an instantaneous loss feedback. Instead, the loss is a consequence of a sequence of actions. In these applications, the “bandit” model discussed in previous sections might not be the best model. To formulate such a sequential decision making problem, one commonly used model is the Markov decision process (MDP), where states are introduced to capture the intermediate status of a sequence of actions. Throughout the thesis, we only focus on a special class of it — the finite-horizon MDP. In finite-horizon MDPs, the consequence of actions can last for at most H steps, where H is the horizon length. The protocol is described in Protocol 8. In each episode t, the environment 10 Protocol 8: Finite-Horizon Markov Decision Process 1 Given: a set of statesS and actionsA. 2 for t = 1, 2,...,T do 3 The environment decides ρ t ∈ ∆ S , p t,h :S×A→ ∆ S and ℓ t,h :S×A→ [−1, 1] ∀h. 4 Environment generates s t,1 ∼ρ t . 5 for h = 1,...,H do 6 Learner observes s t,h , and decides an action a t,h ∈A. 7 Learner suffers loss ℓ t,h (s t,h ,a t,h ). 8 Learner observes ℓ t,h (s t,h ,a t,h ). 9 Environment generates the next state s t,h+1 ∼p t,h (·|s t,h ,a t,h ). first determines the initial distribution ρ t , transition kernels p t ={p t,h } H h=1 and the loss functions ℓ t ={ℓ t,h } H h=1 for that episode. Then the learner interacts with the environment for H steps in the following way: The environment first draws s t,1 ∈S from ρ t . In each step h, the learner first observes the current state s t,h . Based on s t,h , the learner chooses an action a t,h . Then the learner observes ℓ t,h (s t,h ,a t,h ) and transitions to the next state according to s t,h+1 ∼p t,h (·|s t,h ,a t,h ). A policyπ for a finite-horizon MDP is defined as a mapping from S×{1, 2,...,H} to ∆ A , withπ h (·|s) denoting the distribution over actions the policy uses in step h when visiting state s. The value of a policy starting from (state, step) = (s,h) under transition p ={p k } H k=1 and loss ℓ ={ℓ k } H k=1 is defined as V π h (s;p,ℓ) :=E " H X k=h ℓ(s k ,a k ) s h =s, a k ∼π k (·|s k ), s k+1 ∼p k (·|s k ,a k ),∀k =h,...,H # . V π h (s;p,ℓ) is also called the “cost-to-go” function, because it is the cumulative cost suffered by the player if it starts from state s in step h and follows policy π. The regret is defined as the difference between the learner’s expected value and that of the best policy’s: Reg := T X t=1 V πt (s t,1 ;p t ,ℓ t )− min π T X t=1 V π (s t,1 ;p t ,ℓ t ) (1.5) 11 where π t is the policy used by the learner in episode t. For MDPs, another useful notion is the state-action value function: Q π h (s,a;p,ℓ) := E " H X k=h ℓ(s k ,a k ) (s h ,a h ) = (s,a), a k ∼π k (·|s k ), s k ∼p k−1 (·|s k−1 ,a k−1 ),∀k =h + 1,...,H # . Q π h (s,a;p,ℓ) is the cumulative cost the player suffers if it starts from state s in steph, firsts chooses action a, and then follows policy π from step h + 1. Linear-structured MDPs Besides considering MDPs with a finite number of states and actions, in this thesis we also consider MDPs with a potentially infinite number of states and actions, but with linear structure. We consider two types of linear-structured MDPs characterized as follows: • Linear MDPs: An MDP with transition p and loss ℓ is called a linear MDP if for all h∈ [H], there exists a known feature mapping ϕ h (·,·) :S×A→R d such that p h (s ′ |s,a) = ϕ h (s,a) ⊤ µ h (s ′ ) and ℓ h (s,a) =ϕ h (s,a) ⊤ θ h for some unknown µ h (·) :S→R d and θ h ∈R d . • Linear-Q MDPs: An MDP with transition p and loss ℓ is called a linear-Q MDP if for all h, there exists a known feature mapping ϕ h (·,·) :S×A→R d such that for any policy π, Q π h (s,a;p,ℓ) =ϕ h (s,a) ⊤ w π h for some unknown w π h ∈R d . It can be shown that any linear MDP is a linear-Q MDP. 1.5 Online decision making in dynamic worlds In Section 1.1-Section 1.4, we have introduced several online decision making settings and their corresponding definitions of regret. The benchmark in the regret definition is usually the cumulative loss of a fixed action or policy. Hence they are also called the static regret. This is reasonable if the 12 environment is relatively stationary (or stationary but with adversarial perturbations), but becomes questionable when the environment is non-stationary. For example, in recommendation systems, users preferences may change over time, and the trend is always evolving, so the recommendation policies might have to adapt to these changes. In this case, a better benchmark would be one that changes over time. ‡ There are two possible definitions for the adaptive benchmark. One is the cumulative loss of the best sequence of actions/policies. Regret defined with this benchmark is called the dynamic regret. In the case of multi-armed bandit, the dynamic regret would be defined as Dynamic-Reg := T X t=1 ℓ t,it − T X t=1 min i∈A ℓ t,i . Comparing this with the static regret (1.1), we see that it simply moves the max i∈A operator from outside the summation to inside the summation. The other class of adaptive benchmark is the best action/policy sequence that changes at most L− 1 times, for some L of interest. This is called switching regret. In the case of multi-armed bandit, the switching regret would be defined as Switching-Reg (L) := T X t=1 ℓ t,it − min (i ⋆ 1 ,...,i ⋆ T )∈S(L) T X t=1 ℓ t,i ⋆ t whereS(L) ={(i 1 ,...,i T ) : P T t=2 1[i t ̸=i t−1 ]≤L− 1}. ‡ For a particular online decision making problem, it is sometimes unclear whether a fixed benchmark or a changing benchmark is better. The choice of benchmark would depend on the application itself and the learner’s prior knowledge about the environment. If the learner believes that the underlying environment is relatively fixed (but possibly with adversarial manipulations), then he should use the fixed benchmark; if the learner believes that the world is changing and a previously learned policy would not be useful in a later phase, then he should choose the changing benchmark. An interesting question would be how to distinguish these two scenarios using the collected data, though this is out of the scope of this thesis. 13 1.6 Outline of the thesis In Section 1.1-Section 1.4, we have seen that in many online decision making problems, the minimax static regret is of orderO( √ T), where T is the number of rounds the learner interacts with the environment. When the environment does not change over time (e.g., in stochastic multi-armed bandit, stochastic linear bandit, or stochastic MDPs), an improvedO(lnT ) regret is possible. The results in this thesis are motivated by the following two observations: 1) In many settings, √ T bound is the best regret bound against a worst-case environment; however, it is possible that the learner can achieve regret better than √ T when the environment is more benign. Many existing algorithms that have √ T worst-case regret bound cannot achieve a better regret even when the environment is benign. 2) The standard algorithms for the stochastic setting that achieveO(lnT ) regret are different from those for the general setting that achieve O( √ T) regret. Therefore, in order to achieve the best of stochastic and general settings, the learner has to know the type of environment in advance and use the corresponding algorithm. It is desirable to have an algorithm that requires no such knowledge, and achieves the best of both worlds automatically. In order to address the two issues above, we design adaptive algorithms, whose performance adapts to the type of the environment, and robust algorithms, whose performance is not only good in benign environments, but remains optimal when the environment is adversarial. Besides these two properties, most of our algorithms are prior-knowledge-free, meaning that the learner does not need to know the type of the environment to achieve the desired regret bound. In Chapter 2, we design prior-knowledge-free algorithms that achieve better-than- √ T regret bounds for the expert problem and online linear optimization. Specifically, the regret bounds are usually of the form √ □ for some□ ≤T that quantifies the hardness of the environment. In Chapter 3, we obtain similar guarantees for multi-armed bandits, linear bandits, and contextual 14 Chapter/Section Paper Chapter 2 Chen et al. (2021a) Section 3.2, Section 3.3 Wei and Luo (2018) Section 3.4, Section 3.5 Bubeck et al. (2019) Section 3.6 Wei et al. (2020b) Section 4.2 Wei and Luo (2018) Section 4.3 Lee et al. (2021) Section 4.4 Wei et al. (2022) Chapter 5 Luo et al. (2021) Chapter 6 Wei and Luo (2021) Table 1.1: A mapping from chapters/sections to papers bandits. In Chapter 4, we design algorithms for bandits/MDPs that achieve logarithmic-in-T regret in the stochastic case, while achieving near-optimal regret in the general case. In Chapter 5, we deal with linear-Q MDPs and linear MDPs (defined in Section 1.4) with adversarial losses and fixed transitions. We obtain sub-linear-in- T regret, which are among the first for these settings. In Chapter 6, we develop a general mechanism to deal with non-stationary environments, achieving optimal dynamic regret (see Section 1.5) for a wide range of online decision making problems. In Table 4.1, we provide a detailed mapping between the chapter/section index and the source of the content (which are all published conference papers). 15 Chapter 2 Adaptive Algorithms for Experts and Online Linear Optimization 2.1 Overview: the impossible-tuning problem In this chapter, we consider the expert problem and online linear optimization. For the expert problem, we consider Protocol 9, which is slightly different from Protocol 1 in that here the decision set is a distribution over experts instead of a single expert. The algorithm designed based on Protocol 9 can be directly applied to Protocol 1 by samplingi t ∼w t . Suppose that the set of experts isA ={1, 2,...,d}. In each round, the learner determines a distribution over experts w t ∈ ∆ d , and then the adversary decides a loss vector ℓ t ∈ [−1, 1] d . Finally, the learner suffers the loss ⟨w t ,ℓ t ⟩. We consider a more refined regret defined as the following: For any u∈ ∆ d , define the learner’s regret against u as Reg(u) = T X t=1 ⟨w t −u,ℓ t ⟩. (2.1) It is not hard to see thatE[max u∈∆ d Reg(u)] =E[Reg] where Reg is defined in (1.1). The reason we define Reg(u) is that in this chapter we target achieving refined regret bound that adapts to the benchmark u. 16 Protocol 9: The Expert Problem 1 Given: a finite set of experts A. 2 for t = 1, 2,...,T do 3 Learner chooses a distribution over experts w t ∈ ∆ A . 4 Environment decides the losses of experts ℓ t ∈ [−1, 1] A . 5 Learner suffers loss ⟨w t ,ℓ t ⟩. 6 Learner observes ℓ t . The standard Hedge algorithm (also called exponential weight or multiplicative weight algorithm) (Freund and Schapire, 1997) with a fixed learning rate η is known to achieve the following regret bound for all u∈ ∆ d simultaneously: Reg(u)≤ lnd η +η T X t=1 d X i=1 w t,i ℓ 2 t,i . (2.2) With an oracle tuning on η, one can achieve the bound Reg(u)≤O q (lnd) P T t=1 P d i=1 w t,i ℓ 2 t,i . Without oracle tuning, it is also standard to use adaptive learning rates η t = r lnd 1+ P t−1 τ=1 P i w τ,i ℓ 2 τ,i to achieve the same bound. Notice that for all u∈ ∆ d , the bound we can prove for Reg(u) is the same. Yet another standard algorithm for the expert problem is the Prod algorithm (Cesa-Bianchi et al., 2007), which achieves the following bound for all u simultaneously: Reg(u)≤ lnd η +η T X t=1 d X i=1 u i ℓ 2 t,i . (2.3) Specially, if we let u =e i , then we have Reg(e i )≤ lnd η +η P T t=1 ℓ 2 t,i , i.e., the regret bound against experti only depends on the sum of squared losses of experti. This is different from (2.2), where the regret depends on the loss of all experts, and the former can sometimes be much smaller. With the optimal tuning of η, this gives an adaptive bound Reg(e i ) =O q (lnd) P T t=1 ℓ 2 t,i . However, since different i requires different tuning, no method is known to achieve this bound simultaneously for all i. Several works discuss the difficulty of doing so even with different η for different experts and 17 Table 2.1: Summary of main results. w t ∈R d is the decision of the learner, ℓ t is the loss vector, m t is a prediction for ℓ t ,L T = P T t=1 (ℓ t −m t )(ℓ t −m t ) ⊤ , and r is the rank ofL T . Results Expert Reg(u) ˜ O q (lnd) P T t=1 P d i=1 u t,i (ℓ t,i −m t,i ) 2 ˜ O q (lnd) P T t=1 ⟨u−w t ,ℓ t −m t ⟩ 2 OLO Reg(u) ˜ O q r P T t=1 ⟨u,ℓ t −m t ⟩ 2 ˜ O ∥u∥ q P T t=1 ∥ℓ t −m t ∥ 2 ⋆ ˜ O q ∥u∥ 2 2 +u ⊤ L 1/2 T u tr L 1/2 T ˜ O q r P T t=1 ⟨u−w t ,ℓ t −m t ⟩ 2 why all standard tuning techniques fail (Cesa-Bianchi et al., 2007; Hazan and Kale, 2010). Indeed, the problem is so challenging that it has been referred to as the “impossible tuning” issue (Gaillard et al., 2014). In Section 2.2, we show that, perhaps surprisingly, this impossible tuning is in fact possible (up to an additional lnT factor), via an algorithm combining ideas that mostly appeared before already. More concretely, we achieve this via Mirror Descent with a correction term similar to (Steinhardt and Liang, 2014a) and a weighted negative entropy regularizer with different learning rates for each expert (and each round) similar to (Bubeck et al., 2017b). Note that while natural, this algorithm has not been studied before, and is not equivalent to using different learning rates for different experts in Prod or multiplicative-weight, as it does not admit a closed “proportional” form (and instead needs to be computed via a line search). We present our result in a more general setting where the learner receives a predicted loss vector m t before deciding on w t (Rakhlin and Sridharan, 2013b), and show a bound Reg(e i ) = ˜ O q (lnd) P T t=1 (ℓ t,i −m t,i ) 2 simultaneously for alli (settingm t = 0 resolves the original impossible tuning issue). Using different m t , we achieve various regret bounds that either recover the guarantees of existing algorithms such as (A,B)-Prod (Sani et al., 2014), Adapt-ML-Prod (Gaillard et al., 18 2014), Optimistic-Adapt-ML-Prod (Wei et al., 2016), or improve over existing variance/path- length bounds in (Steinhardt and Liang, 2014a). In Section 2.3, we use the same algorithmic framework to create a master algorithm that combines a set of base algorithms and learns the best for different environments. Although similar ideas appear in many prior works with different masters (Koolen et al., 2014; van Erven and Koolen, 2016; Foster et al., 2017; Cutkosky, 2019b; Bhaskara et al., 2020), the new guarantee of our master allows us to derive many new results that have not been achieved before, for both the expert problem and more generally Online Linear Optimization (OLO). Specifically, for the expert problem, using the master to combine different instances of itself, we further generalize the aforementioned bound in different ways, including replacing the lnd factor with KL(u,π) when competing against u with a prior distribution π, adapting to the scale of each expert, extending the results to switching regret, and dealing with unknown loss range. These results improve over (Luo and Schapire, 2015; Koolen and Van Erven, 2015), (Bubeck et al., 2017b; Foster et al., 2017; Cutkosky and Orabona, 2018), (Cesa-Bianchi et al., 2012), and (Mhammedi et al., 2019) respectively. See Section 2.4 for detailed discussions. Next, we consider the more general OLO problem where the learner’s decision set generalizes from ∆ d to an arbitrary closed convex setK⊂R d (other than this change, the learning protocol and the regret definition remain the same). Using our master to combine different types of base algorithms, we achieve four different bounds on Reg(u) simultaneously for all u, listed in Table 2.1. These bounds improve over a line of recent advances in OLO (van Erven and Koolen, 2016; Cutkosky and Orabona, 2018; Cutkosky, 2019a,b; Mhammedi et al., 2019; Mhammedi and Koolen, 2020; Cutkosky, 2020). See Section 2.5 for detailed discussions. 19 Algorithm 10: Multi-scale Multiplicative-weight with Correction (MsMwC) 1 Initialize: w ′ 1 ∈ ∆ d . 2 for t = 1,...,T do 3 Receive prediction m t ∈R d . 4 Decide a compact convex decision subset Ω t ⊆ ∆ d and learning rates η t ∈R d ≥0 . 5 Compute w t = argmin w∈Ω t ⟨w,m t ⟩ +D ψt (w,w ′ t ) where ψ t (w) = P d i=1 1 η t,i w i lnw i . 6 Play w t , receive ℓ t , and construct correction term a t ∈R d with a t,i = 32η t,i (ℓ t,i −m t,i ) 2 . 7 Compute w ′ t+1 = argmin w∈Ω t ⟨w,ℓ t +a t ⟩ +D ψt (w,w ′ t ). 2.2 Impossible tuning made possible We start by proposing a general algorithmic framework called Multi-scale Multiplicative-weight with Correction (MsMwC), shown in Algorithm 10. In this section, we instantiate the framework in a specific way to resolve the impossible tuning issue, and in Section 2.3, we instantiate it differently to obtain a new master algorithm, with more applications discussed in the following sections. MsMwC is a variation of the standard Optimistic-Mirror-Descent (OMD) framework, which maintains two sequences w 1 ,...,w T and w ′ 1 ,...,w ′ T updated according to Line 5 and Line 7. The key new ingredients are the following. First, we adopt a time-varying decision subset Ω t ⊆ ∆ d to which w t and w ′ t+1 belong. Second, our regularizer ψ t (w) = P d i=1 1 η t,i w i lnw i is negative entropy with individual and time- varying learning rate η t,i for each expert i. For most applications, η t,i is the same for all t, in which case our regularizer is the same as that used in the MsMw algorithm of (Bubeck et al., 2017b). Finally, we adopt a second-order correction term a t added to the loss vector ℓ t in the update of w ′ t+1 (Line 7), which is the most important difference compared to MsMw (Bubeck et al., 2017b). Similar correction terms have been used in prior works such as (Hazan and Kale, 2010; Steinhardt and Liang, 2014a; Wei and Luo, 2018) and are known to be important to achieving a regret bound that depends on quantities only related to the expert being compared to. 20 We present a general lemma on the regret guarantee of MsMwC below, which holds under a condition on the magnitude of η t,i |ℓ t,i −m t,i |; see Appendix A.2 for the proof. We also note that the last negative term in the regret bound is particularly important for some of the applications. Lemma 1. Define f KL (a,b) =a ln a b −a +b for a,b∈ [0, 1]. ∗ Suppose that for all t∈ [T ], 32η t,i |ℓ t,i − m t,i |≤ 1 holds for all i such that w t,i > 0. Then MsMwC ensures for any u∈ T T t=1 Ω t , Reg(u)≤ d X i=1 1 η 1,i f KL (u i ,w ′ 1,i ) + T X t=2 d X i=1 1 η t,i − 1 η t−1,i ! f KL (u i ,w ′ t,i ) + 32 T X t=1 d X i=1 η t,i u i (ℓ t,i −m t,i ) 2 − 16 T X t=1 d X i=1 η t,i w t,i (ℓ t,i −m t,i ) 2 . (2.4) To resolve the impossible tuning issue, we instantiate MsMwC in the following way with the decision sets fixed to a truncated simplex and the learning rates tuned using data observed so far. Theorem 1. Suppose|ℓ t,i | and|m t,i | are bounded by 1 for all t∈ [T ] and i∈ [d]. Then MsMwC with w ′ 1 = 1 d 1, Ω 1 =··· = Ω T ={w∈ ∆ d :w i ≥ 1 dT }, and η t,i = min r ln(dT ) P s<t (ℓ s,i −m s,i ) 2 , 1 64 ensures for all i ⋆ ∈ [d], Reg(e i⋆ ) =O ln(dT ) + q ln(dT ) P T t=1 (ℓ t,i⋆ −m t,i⋆ ) 2 . Proof sketch. We apply (2.4) with u = (1− 1 T )e i⋆ + 1 T w ′ 1 ∈ T T t=1 Ω t , so that Reg(e i⋆ )≤ Reg(u) + 2. Most of the calculations are straightforward, and the most important part is to realize that 1 η t,i − 1 η t−1,i w ′ t,i , a term from 1 η t,i − 1 η t−1,i f KL (u i ,w ′ t,i ), can be bounded as: 1 η t,i − 1 η t−1,i ! w ′ t,i = 1 η 2 t,i − 1 η 2 t−1,i 1 η t,i + 1 η t−1,i w ′ t,i ≤η t−1,i w ′ t,i 1 η 2 t,i − 1 η 2 t−1,i ! , which is further bounded by 1 ln(dT ) η t−1,i w ′ t,i (ℓ t−1,i −m t−1,i ) 2 using the definition of η t,i , and thus can be canceled by the last negative term in (2.4) (since w ′ t,i and w t−1,i are close). The complete proof can be found in Appendix A.2. ∗ Define f KL (0,b) =b for all b∈ [0, 1]. 21 When m t = 0, our bound exactly resolves the original impossible tuning issue (up to a lnT term). Below we discuss more implications of our bound by choosing different m t . Implication 1: improved variance or path-length bounds. Similarly to (Steinhardt and Liang, 2014a), by setting m t to be the running average of the loss vectors 1 t−1 P s<t ℓ s , we obtain a bound that depends only on the variance of expert i ⋆ : O q ln(dT ) P T t=1 (ℓ t,i⋆ −µ i⋆ ) 2 where µ i⋆ = 1 T P T t=1 ℓ t,i⋆ . On the other hand, by settingm t =ℓ t−1 (define ℓ 0 =0), we obtain a bound that depends only on the “path-length” of expert i ⋆ :O q ln(dT ) P T t=1 (ℓ t,i⋆ −ℓ t−1,i⋆ ) 2 . The algorithm of (Steinhardt and Liang, 2014a) uses a fixed learning rate and only achieves these bounds with an oracletuningofthefixedlearningrate, whileouralgorithmiscompletelyadaptiveandparameter-free. In the next few implications, we make use of the following trick: if all coordinates of m t are the same, then⟨w,m t ⟩ is a constant independent of w∈ ∆ d and thus w t = argmin w∈Ω t ⟨w,m t ⟩ + D ψt (w,w ′ t ) = argmin w∈Ω t D ψt (w,w ′ t ), meaning that the algorithm and its guarantee are valid even if m t is set in terms of ℓ t which is unknown at the beginning of round t. Implication 2: recovering (A,B)-Prod guarantee. If we set m t = ℓ t,1 1, then the regret against expert 1 becomes a constantO (ln(dT )) (while the regret against others remainsO( p T ln(dT ))). This is exactly the guarantee of the (A,B)-Prod algorithm (Sani et al., 2014), useful for combining a set of base algorithms where one of them enjoys a regret bound significantly better than √ T. Implication 3: recovering Adapt-ML-Prod guarantee. Next, we set m t =⟨w t ,ℓ t ⟩1 (again, valid even if unknown at the beginning of round t), leading to a boundO q ln(dT ) P T t=1 r 2 t,i⋆ where r t,i =⟨w t −e i ,ℓ t ⟩ is the instantaneous regret to expert i. A regret bound in terms of q P T t=1 r 2 t,i⋆ was first achieved by the Adapt-ML-Prod algorithm (Gaillard et al., 2014) (and later improved 22 in (Koolen and Van Erven, 2015; Wintenberger, 2017)), and it has important consequences in achieving fast rates in stochastic settings; see (Koolen et al., 2016) for in-depth discussions. Implication 4: recovering Optimistic-Adapt-ML-Prod guarantee. For the same reason, it is also valid to set m t =m ′ t +⟨w t ,ℓ t −m ′ t ⟩1 for some prediction m ′ t ∈ [−1, +1] d received at the beginning of round t. † Doing so leads to a boundO q ln(dT ) P T t=1 r ′ 2 t,i⋆ where r ′ t,i =⟨w t −e i ,ℓ t −m ′ t ⟩ is the instantaneous regret to expert i measured with respect to the prediction difference ℓ t −m ′ t . This recovers the bound of Optimistic-Adapt-ML-Prod (Wei et al., 2016). 2.3 A new master algorithm Next, we instantiate MsMwC differently to obtain a master algorithm MsMwC-Master that combines a set of base algorithms and adaptively learns the best one (see Algorithm 11). We will apply this master to both the expert problem (Section 2.4) and more generally the OLO problem (Section 2.5) where the decision set generalizes from ∆ d to an arbitrary closed convex setK. The instantiation still leaves the choices of Ω t open for now and simply fixes the learning rate for each expert to be the same value over the T rounds. Since we will use this master, which itself deals with an expert problem with different base algorithms as experts, to deal with another expert/OLO problem, we adopt different notation for the master. Specifically, the set of experts is denoted by E, which consists of pairs in the form (η,B) whereη is the learning rate for this expert andB is a base algorithm. For each expert k = (η,B)∈E, we use η k to denote the corresponding learning rate η. MsMwC-Master maintains two sequences of distributions p 1 ,...,p T and p ′ 1 ,...,p ′ T over the set of experts. We use ∆ E to denote the probability simplex for the set of experts, and p t,k to denote the weight assigned to expert k by p t . We fix a specific initial distribution p ′ 1 such that p ′ 1,k ∝η 2 k . † This is because wt = argmin w∈Ω t ⟨w,mt⟩ +D ψ t (w,w ′ t ) = argmin w∈Ω t ⟨w,m ′ t ⟩ +D ψ t (w,w ′ t ). One caveat is that mt,i is now in the range of [−3, +3], breaking the condition of Theorem 1, but this can be simply addressed by changing the constant 64 in the definition of ηt,i to 128 so that the condition of Lemma 1 still holds. 23 Algorithm 11: MsMwC-Master 1 Input: a set of (learning rate, base algorithm) pairsE. 2 Initialize: p ′ 1 ∈ ∆ E such that p ′ 1,k ∝η 2 k for each k∈E. 3 for t = 1,...,T do 4 Receive prediction m t ∈R d and feed it to all base algorithms. 5 For each k∈E, receive decision w k t ∈K from the base algorithm and define h t,k = D w k t ,m t E . 6 Decide a compact convex decision subset Λ t ⊆ ∆ E . 7 Compute p t = argmin p∈Λt ⟨p,h t ⟩ +D ψ (p,p ′ t ) where ψ(p) = P k∈E 1 η k p k lnp k . 8 Play w t = P k∈E p t,k w k t ∈K, receive ℓ t and feed it to all base algorithms. 9 For each k∈E, define g t,k = D w k t ,ℓ t E and b t,k = 32η k (g t,k −h t,k ) 2 . 10 Compute p ′ t+1 = argmin p∈Λt ⟨p,g t +b t ⟩ +D ψ (p,p ′ t ). Upon receiving the prediction m t ∈R d for the expert/OLO problem we are trying to solve, we feed it to all base algorithms, receive their decisions{w k t } k∈E , and then define the prediction h t ∈R E for the master expert problem with h t,k = D w k t ,m t E , that is, the predicted loss of the decision w k t . Next, MsMwC-Master decides a subset Λ t ∈ ∆ E and performs the OMD update with the regularizer ψ(p) = P k∈E 1 η k p k lnp k to compute p t ; note that the regularizer is now fixed over time. With p t , MsMwC-Master aggregates the decisions of all base algorithms by playing the convex combination P k∈E p t,k w k t . Afterseeingthelossvectorℓ t andfeedingittoallbasealgorithms, MsMwC- Master naturally defines the loss vector g t ∈R E for its own expert problem with g t,k = D w k t ,ℓ t E and the corresponding correction term b t with b t,k = 32η k (g t,k −h t,k ) 2 . Finally, p ′ t+1 is calculated according to the OMD update rule using g t +b t . To use MsMwC-Master, one simply designs a set of base algorithms with corresponding learning rates (and decides the subset Λ t which is usually the set of distributions over some or all of the experts). These base algorithms are usually different instances of the same algorithm with different parameters such as a different learning rate, which usually coincides with the learning rate η k for this expert. The point of having this construction is that MsMwC-Master can then learn the best 24 parameter setting of the base algorithm automatically. Indeed, with Reg B being the regret of base algorithmB, we have the following guarantee that is a direct corollary of Lemma 1. Theorem 2. Suppose that for all t, 32η k | D w k t ,ℓ t −m t E |≤ 1 holds for all k∈E with p t,k > 0. Then for any k ⋆ = (η ⋆ ,B ⋆ )∈E such that e k⋆ ∈ T T t=1 Λ t , MsMwC-Master ensures ∀u∈K, Reg(u)≤ Reg B⋆ (u) + 1 η ⋆ ln P k η 2 k η 2 ⋆ ! + P k η k P k η 2 k + 32η ⋆ T X t=1 D w k⋆ t ,ℓ t −m t E 2 . (2.5) The proof is deferred to Appendix A.2. In all our applications, the learning rates are chosen from an exponential grid such that P k η k and P k η 2 k are both constants. Moreover, the term 32η ⋆ P T t=1 ⟨w k⋆ t ,ℓ t −m t ⟩ 2 can usually be canceled by the a negative term from Reg B⋆ (u), making the overhead of the master simply beO( 1 η⋆ ln 1 η⋆ ), which is rather small. We remark that the idea of combining a set of base algorithms or more specifically “learning the learning rate” has appeared in many prior works such as (Koolen et al., 2014; van Erven and Koolen, 2016; Foster et al., 2017; Cutkosky, 2019b; Bhaskara et al., 2020). However, the special regret guarantee of MsMwC allows us to derive new applications as shown in the next two sections. 2.4 Applications to the expert problem In this section, we apply MsMwC-Master to derive yet another four new results for the expert problem (thusK = ∆ d throughout this section). These results improve over the guarantee of Theorem 1 by respectively adapting to an arbitrary competitor and a prior, the scale of each expert, a switching sequence of competitors, and unknown loss ranges. ‡ ‡ While we present all results using the master with appropriate base algorithms, it is actually possible to “flatten” this two-layer structure to just one layer by duplicating each expert and assigning each copy a different learning rate. We omit the details since this approach does not generalize to OLO. 25 Appication 1: adapting to an arbitrary competitor Typical regret bounds for the expert problem compete with an individual expert and pay for a √ lnd factor. Several works generalize this by replacing lnd with KL(u,π) when competing with an arbitrary competitor u∈ ∆ d , where π is a fixed prior distribution over the experts (Luo and Schapire, 2015; Koolen and Van Erven, 2015). Importantly, the bound holds simultaneously for all u. Inspired by these works, our goal here is to make the same generalization for Theorem 1. To do so, we again instantiate MsMwC differently to create a set of base algorithms, each with a fixed learning rate across all i andt (so both the master and the base algorithms are instances of MsMwC). Specifically, consider the following set of O (lnT ) experts: E KL = n (η k ,B k ) :∀k = 1,...,⌈log 2 T⌉,η k = 1 32·2 k ,B k is MsMwC with w ′ 1 =π, Ω t = ∆ d , and η t,i = 2η k for all t and i o . (2.6) By Lemma 1, we know thatB k guarantees for all u∈ ∆ d : Reg B k (u)≤ KL(u,π) 2η k + 64η k T X t=1 d X i=1 u i (ℓ t,i −m t,i ) 2 − 32η k T X t=1 d X i=1 w k t,i (ℓ t,i −m t,i ) 2 . (2.7) MsMwC-Master can then learn the bestη k to achieve the optimal tuning. Indeed, directly combining the guarantee of MsMwC-Master from Theorem 2 and noting that, importantly, the last term in (2.5) can be canceled by the last negative term in (2.7) by Cauchy-Schwarz inequality, we obtain the following result (full proof deferred to Appendix A.4). Theorem 3. Suppose∥ℓ t −m t ∥ ∞ ≤ 1,∀t. Then for any π∈ ∆ d , MsMwC-Master with expert set E KL and Λ t = ∆ E KL ensures Reg(u) =O KL(u,π) + lnV (u) + p (KL(u,π) + lnV (u))V (u) for all u∈ ∆ d , where V (u) = max n 3, P T t=1 P d i=1 u i (ℓ t,i −m t,i ) 2 o . 26 This result recovers the guarantee in Theorem 1 when u = e i⋆ and π is uniform (in fact, it also improves the lnT factor to lnV (e i⋆ )). § Note that the implications discussed in Section 2.2 by selecting different m t still apply here with the same improvement (from ln(dT ) to KL(u,π)+lnV (u)). In particular, this means that our results recover and improve those of (Luo and Schapire, 2015; Koolen and Van Erven, 2015) (which only cover the case with m t =⟨w t ,ℓ t ⟩1). Application 2: adapting to Multiple Scales Consider the “multi-scale” expert problem (Bubeck et al., 2017b; Foster et al., 2017; Cutkosky and Orabona, 2018) where each expert i has a different loss range c i > 0 such that|ℓ t,i |≤c i (and naturally|m t,i |≤c i ) for all t. Previous works all achieve a bound Reg(e i⋆ ) = ˜ O(c i⋆ √ T lnd), scaling only in terms of c i⋆ . The main term of our bound in Theorem 1 is already strictly better since the term q P T t=1 (ℓ t,i⋆ −m t,i⋆ ) 2 ≤ 2c i⋆ √ T inherently only scales with c i⋆ . The issue is that the lower-order term in the bound is in fact in terms of max i c i . To improve it to c i⋆ , we apply similar ideas developed in Section 2.4 and again use MsMwC-Master to learn the best learning rate for the base algorithm MsMwC. To this end, first define a set S = n k∈Z :∃i∈ [d],c i ≤ 2 k−2 ≤c i √ T o so that{ 1 32·2 k } k∈S contains all the learning rates we want to search over. Then define expert set: E MS = n (η k ,B k ) :∀k∈S,η k = 1 32·2 k ,B k is MsMwC with w ′ 1 being uniform overZ(k), Ω t ={w∈ ∆ d :w i = 0,∀i / ∈Z(k)}, and η t,i = 2η k for all t and i o , (2.8) whereZ(k) ={i∈ [d] : c i ≤ 2 k−2 }. Compared to (2.6), another difference is that we restrict each base algorithmB k to work with only a subsetZ(k) of arms, which ensures the condition 32η t,i |ℓ t,i −m t,i |≤ 128η k c i ≤ 1 (for i with w k t,i > 0) of Lemma 1 and similarly the condition of § However, we believe that the result of Theorem 1 is still valuable since the algorithm does not require maintaining multiple base algorithms and is more computationally efficient and practical. 27 Theorem 2. With this construction, we can then automatically learn the best instance and achieve the following multi-scale bound that is a strict improvement over previous works. Theorem 4. Suppose for all t,|ℓ t,i |≤c i and|m t,i |≤c i for some c i > 0. Define c min = min i c i and Γ i = ln( dTc i c min ). Then MsMwC-Master with expert setE MS defined in (2.8) and Λ t = ∆ S MS ensures: Reg(e i⋆ ) = ˜ O c i⋆ Γ i⋆ + q Γ i⋆ P T t=1 (ℓ t,i⋆ −m t,i⋆ ) 2 for all i ⋆ ∈ [d]. Application 3: adapting to a switching sequence So far, the regret measure we have considered compares with a fixed competitor across all T rounds. A more challenging notion of regret, called switching regret, compares with a sequence of changing competitors with a certain number of switches, which is a much more appropriate measure for non-stationary environments. Specifically, we useI to denote an interval of rounds (that is, a subset of [T ] of the form{s,s + 1,...,t− 1,t}) and Reg I (u) = P t∈I ⟨w t −u,ℓ t ⟩ to denote the regret against u on this interval. For a partition I 1 ,...,I S of [T] and competitors u 1 ,...,u S ∈ ∆ d , the corresponding switching regret is then P S j=1 Reg I j (u j ). Now, we show that almost the same construction as in Section 2.4 generalizes our result in Theorem 1 to switching regret as well. Specifically, we deploy the following expert set: E switch = n (η k ,B k ) :∀k = 1,...,⌈log 2 T⌉,η k = 1 32·2 k ,B k is MsMwC with w ′ 1 = 1 d 1, Ω t = n w∈ ∆ d :w i ≥ 1 dT o , and η t,i = 2η k for all t and i o , (2.9) where the only essential difference compared to E KL is the use of a truncated simplex for Ω t . We then have the following new switching regret guarantee. 28 Theorem 5. If∥ℓ t −m t ∥ ∞ ≤ 1 holds for all t∈ [T], then MsMwC-Master with expert setE switch defined in (2.9) and Λ t = n p∈ ∆ E KL :p k ≥ 1 T o ensures for any partitionI 1 ,...,I S of [T] and competitors u 1 ,...,u S ∈ ∆ d , S X j=1 Reg I j (u j ) =O S ln(dT ) + S X j=1 v u u u tln(dT ) X t∈I j d X i=1 u j,i (ℓ t,i −m t,i ) 2 . (2.10) Our bound is never worse than the typical oneO( p ST ln(dT )) (due to Cauchy-Schwarz inequal- ity) and significantly improves over previous works such as (Cesa-Bianchi et al., 2012; Luo and Schapire, 2015) by again choosing different m t according to previous discussions. It also resolves an open problem raised by Lu and Zhang (2019) on the possibility of making the switching regret bound adapt to the path length of the comparator sequence. The proof of Theorem 5 requires a more general version of Lemma 1 and is deferred to Appendix A.4. Impossibility for interval regret. Looking at (2.10), one might wonder whether the natural bound Reg I j (u j ) =O ln(dT ) + q ln(dT ) P t∈I j P d i=1 u j,i (ℓ t,i −m t,i ) 2 holds for each intervalI j separately. Indeed, (2.10) could be derived from this (by summing over S intervals). It turns out that, even if its special case with m t =⟨w t ,ℓ t ⟩ is achievable (Luo and Schapire, 2015), this cannot hold in general as shown in Appendix A.4.1. We find this intriguing (given that (2.10) is achievable) and reminiscent of the impossibility result for interval regret in bandits (Daniely et al., 2015). 2.5 Applications to online linear optimization We next discuss applications of MsMwC-Master to general OLO. For simplicity, we assume thatK is a compact convex set such that∥w∥≤D for all w∈K, and also max t ∥ℓ t −m t ∥≤ 1, where∥·∥ is the L 2 norm (extensions to general primal-dual norms are straightforward). 29 Application 1: combining Online Newton Step It is a folklore that one can reduce OLO to the expert problem by discretizing the decision set K into O T d points and treating each point as an expert. With this reduction, our result in Theorem 1 immediately implies a bound Reg(u) = ˜ O( q d P t ⟨u,ℓ t −m t ⟩ 2 ) for OLO. Of course, the caveat is that the reduction is compu- tationally inefficient. ¶ Below, we show that the same (or even better) bound can be achieved efficiently by using MsMwC-Master with a variant of Online Newton Step (ONS) (Hazan et al., 2007) as the base algorithm. Specifically, the ONS variant (denoted by A k and parameterized by a fixed learning rate η) can be presented in the OMD framework again using an auxiliary cost function c t (w) =⟨w,ℓ t ⟩ + 32η⟨w,ℓ t −m t ⟩ 2 and a time-varying regularizer ψ t (w) = 1 2 ∥w∥ 2 At whereA t =η 2I + P s<t (∇ s −m s )(∇ s −m s ) ⊤ and∇ s =∇c s (w k s ). This variant is similar to that in (Cutkosky and Orabona, 2018), but incorporates the predictionm t as well. We defer the details to Appendix A.5.1, which shows:A k ensures (withr being the rank ofL T = P T t=1 (ℓ t −m t )(ℓ t −m t ) ⊤ ) Reg(u)≤ ˜ O r η +η T X t=1 ⟨u,ℓ t −m t ⟩ 2 ! − 16η T X t=1 D w k t ,ℓ t −m t E 2 . (2.11) Therefore, using MsMwC-Master to learn the best learning rate and noting that the last negative term in (2.11) cancels the last term in (2.5), we obtain the following result. Theorem 6. Let r≤d be the rank ofL T = P T t=1 (ℓ t −m t )(ℓ t −m t ) ⊤ . MsMwC-Master with expert setE ONS defined in (A.6) and Λ t = ∆ E ONS ensures ∀u∈K, Reg(u) = ˜ O r∥u∥ + v u u t r T X t=1 ⟨u,ℓ t −m t ⟩ 2 . (2.12) Similar bounds have appeared before but only with m t = 0 (Cutkosky and Orabona, 2018; Cutkosky, 2020), and we have not been able to incorporate general m t into their algorithms. Our ¶ The reduction is efficient when d = 1 though. This gives an alternative algorithm with the same guarantee as (Cutkosky and Orabona, 2018, Theorem 1) and is useful already with their reduction from general d to d = 1. 30 bound has no explicit dependence on D at all, and its dependence on ℓ t −m t is only through its projection on u. Application 2: combining Gradient Descent Another natural choice of base algorithm is Optimistic Gradient Descent, which guarantees Reg(u) =O ∥u∥ 2 η +η P T t=1 ∥ℓ t −m t ∥ 2 (see Appendix A.5.2). Combining instances with different learning rates that operate over subsets of K of different sizes (necessary to ensure 32η k | D w k t ,ℓ t −m t E |≤ 1 for Theorem 2), we obtain: Theorem 7. MsMwC-Master with expert setE GD defined in (A.7) and Λ t = ∆ E GD ensures ∀u∈K, Reg(u) = ˜ O ∥u∥ +∥u∥ v u u t T X t=1 ∥ℓ t −m t ∥ 2 . (2.13) This bound has appeared before, first with m t = 0 in (Cutkosky and Orabona, 2018) and later with general m t in (Cutkosky, 2019b). We recover the bound easily with our framework. Similar to (2.12), this bound adapts to the size of the competitor u (with no dependence on D). An advantage of (2.13) is that it is dimension-free, while (2.12) is potentially large for high-dimensional data. Application 3: combining AdaGrad Inspired by the recent work of (Cutkosky, 2020) that provides an improved guarantee for the full-matrix version of AdaGrad (Duchi et al., 2011), we next design an optimistic version of AdaGrad and combine instances with different parameters to obtain the following new result. Theorem 8. MsMwC-Master with expert setE AG defined in (A.8) and Λ t = ∆ E AG ensures ∀u∈K, Reg(u) = ˜ O ∥u∥ + r u ⊤ (I +L T ) 1/2 u tr L 1/2 T ! . (2.14) 31 All details are deferred to Appendix A.5.3. Cutkosky (2020) achieves (2.14) for m t = 0 (again, we are not able to extend their algorithm to deal with general m t ). Application 4: combining MetaGrad’s base algorithm Finally, we discuss how to recover and generalize the regret bound of MetaGrad (van Erven and Koolen, 2016) which depends on the sum of squared instantaneous regret and is the analogue of the Adapt-ML-Prod guarantee for the expert problem. Our base algorithm is yet another variant of ONS that uses a different auxiliary cost function c t (w) =⟨w,ℓ t ⟩ + 32η⟨w−w t ,ℓ t −m t ⟩ 2 with an extra offset in terms of w t (the decision of the master). When m t = 0, this is the same base algorithm used in (van Erven and Koolen, 2016). Compared to (2.11), this variant ensures the following: Reg(u)≤ ˜ O r η +η T X t=1 ⟨u−w t ,ℓ t −m t ⟩ 2 ! − 16η T X t=1 D w k t −w t ,ℓ t −m t E 2 . (2.15) Note that the last negative term is now slightly different from the last term in (2.5). To make them match, we need to change the definition of h t,k in MsMwC-Master from h ′ t,k def = D w k t ,m t E to h ′ t,k +⟨p t ,g t −h ′ t ⟩, the same trick used in Application 4 of Section 2.2. We defer the details to Appendix A.5.4 and show the final bound below. Theorem 9. Letr≤d be the rank of P T t=1 (ℓ t −m t )(ℓ t −m t ) ⊤ . MsMwC-Master with the new definition of h t described above, expert setE MG defined in (A.9), and Λ t = ∆ E MG ensures∀u∈K,Reg(u) = rD + q r P T t=1 ⟨u−w t ,ℓ t −m t ⟩ 2 . This bound generalizes MetaGrad’s guarantee from m t = 0 to general m t and is the analogue of the bound discussed in Application 4 of Section 2.2 for the expert problem. When using m t =ℓ t−1 , our bound preserves all the fast rate consequences discussed in (van Erven and Koolen, 2016; Koolen et al., 2016), while ensuring a bound in terms of only the variation of the loss vectors P t ∥ℓ t −ℓ t−1 ∥ 2 . 32 We remark that MetaGrad also uses a master algorithm to combine similar ONS variants, but the master is “tilted exponential weight” and cannot incorporate general m t . 2.6 Open problems We mention two open questions for the expert problem. First, in the case when we are required to select one experti t randomly in each roundt, and the regret againsti is measured by P T t=1 (ℓ t,it −ℓ t,i ), it is unclear how to achieve our bounds such as ˜ O q (lnd) P T t=1 ℓ 2 t,i with high probability (even though our results clearly imply this in expectation). The difficulty lies in handling the deviation between P T t=1 ⟨w t ,ℓ t ⟩ and P T t=1 ℓ t,it and bounding it in terms of only P T t=1 ℓ 2 t,i . We conjecture that impossible tuning might indeed be impossible in this case. Second, note that even though we only focus on having one prediction sequence{m t } t∈[T ] , we can in fact also deal with multiple sequences and learn the best via another expert algorithm, similarly to (Rakhlin and Sridharan, 2013a). One caveat is that the trick we apply in Implications 2–4 of Section 2.2 (that m t can depend on ℓ t even though it is unknown) does not work anymore, since different experts might be using different sources of predictions and thus the calculation of w t does require knowing all predictions at the beginning of roundt. Due to this issue, we for example cannot achieve a bound in the form of ∀i∈ [d], Reg(e i ) = ˜ O v u u t (lnd) min ( T X t=1 ℓ 2 t,i , T X t=1 (ℓ t,i −⟨w t ,ℓ t ⟩) 2 ) . We leave the possibility of achieving such a bound as an open problem. 33 Chapter 3 Adaptive Algorithms for Bandits 3.1 Overview: bandits with predictions, and path-length bounds In Section 3.2–Section 3.4, we consider the multi-armed bandit problem defined in Protocol 2 with A = [K]. The classic Exp3 algorithm (Auer et al., 2002b) achieves a regret bound of order ˜ O( √ TK) after T rounds, which is worst-case optimal up to logarithmic factors. There are several existing works on deriving more adaptive bandit algorithms, replacing the dependence onT in the regret bound by some data-dependent quantity that isO(T ) in the worst-case but could be potentially much smaller in benign environments. Examples of such data-dependent quantities include the loss of the best arm (Allenberg et al., 2006; Foster et al., 2016) or the empirical variance of all arms (Hazan and Kale, 2011; Bubeck et al., 2017a). These adaptive algorithms not only enjoy better performance guarantees, but also have important applications for other areas such as game theory (Foster et al., 2016). In Section 3.2–Section 3.4, we focus on the path-length bound that has been extensively studied for the full-information setting (for example, in Chapter 2 and the references therein) but has never been studied for bandits prior to our work. Our algorithm falls into the Optimistic Online Mirror Descent (Optimistic OMD) framework (Rakhlin and Sridharan, 2013a) with the “log-barrier” as the 34 regularizer, originally proposed in (Foster et al., 2016), though there are several extra elements that are required to derive our results. In Section 3.2, we use V 1 = P T t=1 ∥ℓ t −ℓ t−1 ∥ 1 to quantify the path-length (with ℓ 0 := 0); in Section 3.3, we use V ⋆ = P T t=1 |ℓ t,i⋆ −ℓ t−1,i⋆ | where i ⋆ is the best arm; in Section 3.4, we use V ∞ = P T t=1 ∥ℓ t −ℓ t−1 ∥ ∞ . We obtain near-optimal regret bounds (up to logarithmic factors) with respect to these notions of path-length. Though these quantities are closely related, we have to use different tricks for them. For simplicity, we assume that these quantities are known when tuning the optimal learning rate. In Section 3.5, we consider linear bandits (Protocol 5), and design algorithms that achieve a path-length regret bound quantified by V 2 = P T t=1 ∥ℓ t −ℓ t−1 ∥ 2 . The algorithm is based on an optimistic version of the SCRiBLe algorithm (Abernethy et al., 2008; Rakhlin and Sridharan, 2013a). In Section 3.6, we consider contextual bandits. We consider a general setting where in each round before the learner chooses an action, he observes a loss predictor m t ∈ [−1, 1] A . Clearly, if m t ≈ℓ t , the learner should be able to achieve a better regret than the worst-case regret bound. We design algorithms that achieve the optimal regret in this setting. NotationusedinSection3.2–Section3.4. Wedenotethelog-barrierfunctionasψ(u) = P K i=1 1 η i ln 1 u i for some learning ratesη 1 ,...,η K ≥ 0 andu∈ ∆ K . Withh(y)≜y−1−lny, the Bregman divergence with respect to the log-barrier is: D ψ (u,v) = P K i=1 1 η i ln v i u i + u i −v i v i = P K i=1 1 η i h u i v i . We define α i (t) to be the most recent time when arm i is picked prior to round t, that is, α i (t) = max{s<t : i s =i} (or 0 if the set is empty). 35 Algorithm 12: Barrier-Regularized with Optimism and ADaptivity Online Mirror Descent (Broad- OMD) 1 Define : ψ t (w) = P K i=1 1 η t,i ln 1 w i . 2 Initialize: w ′ 1 = argmin w∈∆ K ψ 1 (w). 3 for t = 1, 2,...,T do 4 w t = argmin w∈∆ K ⟨w,m t ⟩ +D ψt (w,w ′ t ) . 5 Draw i t ∼w t , suffer loss ℓ t,it , and observe ℓ t,it . 6 Let ˆ ℓ t,i = (ℓ t,i −m t,i )1[it=i] w t,i +m t,i . 7 w ′ t+1 = argmin w∈∆ K ⟨w, ˆ ℓ t ⟩ +D ψt (w,w ′ t ) . 3.2 Path-length bounds for multi-armed bandits I (V 1 ) As mentioned, our algorithm falls into the optimistic OMD framework, which has been studied by (Rakhlin and Sridharan, 2013a). For the result in this section, we directly apply the vanilla optimistic OMD (see also Algorithm 12): w t = argmin w∈∆ K ⟨w,m t ⟩ +D ψt (w,w ′ t ) , (3.1) w ′ t+1 = argmin w∈∆ K n ⟨w, ˆ ℓ t ⟩ +D ψt (w,w ′ t ) o (3.2) with a specially chosen loss predictor m t generated by the algorithm itself (discussed later). It is well known that the classic Exp3 algorithm falls into this framework with m t =0 and ψ t being the (negative) entropy. To obtain our results, it is crucial to use the log-barrier as the regularizer instead, that is, ψ t (w) = P K i=1 1 η t,i ln 1 w i . For simplicity, we assume that V 1 is known by the learner, and we fix η t,i =η (η will be tuned according to V 1 ). We start with a general lemma that holds no matter what regularizer ψ t is used and what m t and ˆ ℓ t are. 36 Lemma 2. For the update rules (3.1) and (3.2), we have for all u∈ ∆ K , ⟨w t −u, ˆ ℓ t ⟩≤D ψt (u,w ′ t )−D ψt (u,w ′ t+1 ) +⟨w t −w ′ t+1 , ˆ ℓ t −m t ⟩−A t , where A t ≜D ψt (w ′ t+1 ,w t ) +D ψt (w t ,w ′ t )≥ 0. The proof is standard as in typical OMD analysis, and we defer it to Appendix B.2. The next theorem then shows how the term⟨w t −w ′ t+1 , ˆ ℓ t −m t ⟩ is further bounded whenψ t is the log-barrier as in Broad-OMD. Theorem 10. If the following three conditions hold for all t,i: (i) η≤ 1 162 , (ii) w t,i | ˆ ℓ t,i −m t,i |≤ 3, (iii) η P K i=1 w 2 t,i ( ˆ ℓ t,i −m t,i ) 2 ≤ 1 18 , then Broad-OMD guarantees for any u∈ ∆ K , T X t=1 ⟨w t −u, ˆ ℓ t ⟩≤ K X i=1 ln w ′ 1,i u i η + 3η T X t=1 K X i=1 w 2 t,i ( ˆ ℓ t,i −m t,i ) 2 − T X t=1 A t . (3.3) See Appendix B.2 for the proof. If we configure Broad-OMD with m t,i = ℓ α i (t),i and ˆ ℓ t,i = (ℓ t,i −m t,i )1{it=i} w t,i +m t,i , then the key term in Eq. (3.3) can be bounded as follows: T X t=1 K X i=1 w 2 t,i ( ˆ ℓ t,i −m t,i ) 2 = T X t=1 K X i=1 (ℓ t,i −ℓ α i (t),i ) 2 1{i t =i} = K X i=1 X t:it=i (ℓ t,i −ℓ α i (t),i ) 2 ≤ 2 K X i=1 X t:it=i |ℓ t,i −ℓ α i (t),i |≤ 2 K X i=1 X t:it=i t X s=α i (t)+1 |ℓ s,i −ℓ s−1,i |≤ 2 K X i=1 V T,i . (3.4) With this calculation, we obtain the following corollary. Corollary 11. Broad-OMD with m t,i = ℓ α i (t),i , ˆ ℓ t,i = (ℓ t,i −m t,i )1{it=i} w t,i +m t,i , and η t,i = η≤ 1 162 guarantees E [Reg]≤O K lnT η + 6η K X i=1 V T,i ≤O K lnT η +η K X i=1 V T,i ! . 37 Algorithm 13: Broad-OMD+ 1 Define: κ =e 1 lnT , ψ t (w) = P K i=1 1 η t,i ln 1 w i . 2 Initialize: w ′ 1,i = 1/K, ρ 1,i = 2K for all i∈ [K]. 3 for t = 1, 2,...,T do 4 w t = argmin w∈∆ K ⟨w,m t ⟩ +D ψt (w,w ′ t ) . 5 ¯ w t = (1− 1 T )w t + 1 KT 1. 6 Draw i t ∼ ¯ w t , suffer loss ℓ t,it , and let ˆ ℓ t,i = (ℓ t,i −m t,i )1{it=i} ¯ w t,i +m t,i . 7 Let a t,i = 6η t,i w t,i ( ˆ ℓ t,i −m t,i ) 2 . 8 w ′ t+1 = argmin w∈∆ K ⟨w, ˆ ℓ t +a t ⟩ +D ψt (w,w ′ t ) . 9 for i = 1,...,K do 10 if 1 ¯ w t,i >ρ t,i then ρ t+1,i = 2 ¯ w t,i , η t+1,i =κη t,i . 11 else ρ t+1,i =ρ t,i , η t+1,i =η t,i . Wepointoutthatwecanapplyadoublingtricktomakethealgorithmparameter-free(i.e., require no prior knowledge on V 1 ). The idea is that as long as the observable term 3η P t s=1 P K i=1 w 2 s,i ( ˆ ℓ s,i − m s,i ) 2 in (3.3) becomes larger than K lnT η at some round t, we halve the learning rate η and restart the algorithm. This gives the regret bound of ˜ O q K P K i=1 V T,i +K . 3.3 Path-length bounds for multi-armed bandits II (V ⋆ ) In this section we focus on a variant of Broad-OMD with a correction term a t (see Algorithm 13): w t = argmin w∈∆ K ⟨w,m t ⟩ +D ψt (w,w ′ t ) , (3.5) w ′ t+1 = argmin w∈∆ K n ⟨w, ˆ ℓ t +a t ⟩ +D ψt (w,w ′ t ) o . (3.6) We first show ageneral lemma that update rules (3.5) and(3.6) guarantee, no matter what regularizer ψ t is used and what a t ,m t , and ˆ ℓ t are. The proof is deferred to Appendix B.3. Lemma 3. For the update rules (3.5) and (3.6), if the following condition holds: ⟨w t −w ′ t+1 , ˆ ℓ t −m t +a t ⟩≤⟨w t ,a t ⟩, (3.7) 38 then for all u∈ ∆ K , we have ⟨w t −u, ˆ ℓ t ⟩≤D ψt (u,w ′ t )−D ψt (u,w ′ t+1 ) +⟨u,a t ⟩−A t , (3.8) where A t ≜D ψt (w ′ t+1 ,w t ) +D ψt (w t ,w ′ t )≥ 0. The important part of bound (3.8) is the term⟨u,a t ⟩, which allows us to derive regret bounds that depend on only the comparator u. The key is now how to configure the algorithm such that condition (3.7) holds, while leading to a reasonable bound (3.8) at the same time. In the work of (Steinhardt and Liang, 2014a) for full-information problems, a t can be defined as a t,i =η t,i (ℓ t,i −m t,i ) 2 , which suffices to derive many interesting results. However, in the bandit setting this is not applicable since ℓ t is unknown. The natural first attempt is to replace ℓ t by ˆ ℓ t , but one would quickly realize the common issue in the bandit literature: ˆ ℓ t,i is often constructed via inverse propensity weighting, and thus ( ˆ ℓ t,i −m t,i ) 2 can be of order 1/w 2 t,i , which is too large. Based on this observation, our choice for a t is a t,i = 6η t,i w t,i ( ˆ ℓ t,i −m t,i ) 2 (the constant 6 is merely for technical reasons). The extra term w t,i can then cancel the aforementioned large term 1/w 2 t,i in expectation, similar to the classic trick done in the analysis of Exp3 (Auer et al., 2002b). Note that with a smaller a t , condition (3.7) becomes more stringent. The entropy regularizer used in (Steinhardt and Liang, 2014a) no longer suffices to maintain such a condition. Instead, it turns out that the log-barrier regularizer addresses the issue, as shown below. Theorem 12. If the following three conditions hold for all t,i: (i) η t,i ≤ 1 162 , (ii) w t,i | ˆ ℓ t,i −m t,i |≤ 3, (iii) P K i=1 η t,i w 2 t,i ( ˆ ℓ t,i −m t,i ) 2 ≤ 1 18 , then Broad-OMD+ with a t,i = 6η t,i w t,i ( ˆ ℓ t,i −m t,i ) 2 guarantees condition (3.7). Moreover, it guarantees for any u∈ ∆ K (recall h(y) =y− 1− lny≥ 0), T X t=1 ⟨w t −u, ˆ ℓ t ⟩≤ K X i=1 ln w ′ 1,i u i η 1,i + T X t=1 1 η t+1,i − 1 η t,i ! h u i w ′ t+1,i ! + T X t=1 ⟨u,a t ⟩. (3.9) 39 See Appendix B.3 for the proof. Note that h(·) is always non-negative. Therefore, if the sequence{η t,i } T +1 t=1 is non-decreasing for alli, the term P T t=1 1 η t+1,i − 1 η t,i h u i w ′ t+1,i in bound (3.9) is non-positive. One can now derive different results using Theorem 12 with specific choices of ˆ ℓ t and m t . Corollary 13. Broad-OMD+ with a t,i = 6η t,i w t,i ( ˆ ℓ t,i − m t,i ) 2 , any m t,i ∈ [−1, 1], ˆ ℓ t,i = (ℓ t,i −m t,i )1{it=i} w t,i +m t,i , and η t,i = η ≤ 1 162 (ignoring Line 9–Line 11 in Algorithm 13) enjoys the following regret bound: E [Reg] =E " T X t=1 (ℓ t,it −ℓ t,i ∗) # ≤ K lnT η + 6ηE " T X t=1 (ℓ t,i ∗−m t,i ∗) 2 # . One can see that the expected regret in Corollary 13 only depends on the squared estimation error of m t,i ∗! This is exactly the counterpart of results in (Steinhardt and Liang, 2014a). Again, we set m t,i to be the most recent observed loss of arm i, that is, m t,i =ℓ α i (t),i , where α i (t) is defined in Section 3.1. We examine the key term P T t=1 ⟨u,a t ⟩ in Theorem 12 after plugging in u =e i for some arm i, m t,i =ℓ α i (t),i , and ˆ ℓ t,i = (ℓ t,i −m t,i )1{it=i} w t,i +m t,i . We assume η t,i =η for simplicity and also use the fact w t,i | ˆ ℓ t,i −m t,i |≤ 2. We then have T X t=1 ⟨u,a t ⟩ = 6η T X t=1 w t,i ( ˆ ℓ t,i −ℓ α i (t),i ) 2 ≤ 12η T X t=1 | ˆ ℓ t,i −ℓ α i (t),i | = 12η X t:it=i |ℓ t,i −ℓ α i (t),i | w t,i ≤ 12η X t:it=i P t s=α i (t)+1 |ℓ s,i −ℓ s−1,i | w t,i ≤ 12η max t∈[T ] 1 w t,i ! V T,i . (3.10) Therefore, the term P T t=1 ⟨u,a t ⟩ is close to the first-order path-length but with an extra factor max t∈[T ] 1 w t,i . To cancel this potentially large factor, we adopt the increasing learning rate schedule developed in (Agarwal et al., 2017). The idea is that the term h u i w ′ t+1,i in Eq. (3.9) is close to 40 1 w t+1,i ifu i is close to 1. If the learning rate is increased whenever a large 1 w t+1,i is encountered, then 1 η t+1,i − 1 η t,i h u i w ′ t+1,i becomes a large negative term in terms of −1 w t+1,i , which exactly compensates for the term P T t=1 ⟨u,a t ⟩. To avoid the learning rates from being increased by too much, similarly to (Agarwal et al., 2017) we use some individual threshold (ρ t,i ) to decide when to increase the learning rate and update these thresholds in some doubling manner. Also, we mix w t with a small amount of uniform exploration to further ensure that it cannot be too small. The final algorithm, called Broad-OMD+, is presented in Algorithm 13. We have the following theorem, whose proof can be found in Appendix B.3. Theorem 14. Broad-OMD+ with m t,i =ℓ α i (t),i and η 1,i =η≤ 1 810 guarantees E [Reg]≤ 2K lnT η +E[ρ T +1,i ∗] −1 40η lnT + 90ηV T,i ∗ +O (1) when T≥ 3. Picking η = min n 1 810 , 1 60 √ V T,i ∗ lnT o so that the second term is non-positive leads to E [Reg] = ˜ O K p V T,i ∗ +K . 3.4 Path-length bounds for multi-armed bandits III (V ∞ ) We propose a new algorithm that improves the result in Section 3.2 from ˜ O( √ KV 1 ) to ˜ O( √ KV ∞ ). Our algorithm is also based on the optimistic mirror descent framework: x t = argmin x∈∆ K ⟨x,m t ⟩ +D ψ (x,x ′ t ), x ′ t+1 = argmin x∈∆ K ⟨x, ˆ ℓ t ⟩ +D ψ (x,x ′ t ). where ˆ ℓ t is set to the unbiased estimator ˆ ℓ t,i = ℓ t,i −m t,i w t,i 1{i t =i} +m t,i . 41 Algorithm 14: 1 Define : ψ(x) = 1 η P K i=1 ln 1 x i for some learning rate η; parameter α∈ (0, 1). 2 Initialize: w 1 is the uniform distribution, c 0 = 0. 3 for t = 1, 2,...,T do 4 Play i t ∼w t and observe c t =ℓ t,it . 5 Construct unbiased estimator ˆ ℓ t s.t. ˆ ℓ t,i = ℓ t,i −c t−1 w t,i 1{i t =i} +c t−1 for all i. 6 Update x t+1 = argmin x∈∆ K ⟨x, ˆ ℓ t ⟩ +D ψ (x,x t ). 7 w t+1 = (1−α t+1 )x t+1 +α t+1 e it , where α t+1 = α(1−ct) 1+α(1−ct) . 8 end Our algorithm makes the following two modifications compared to the algorithm in Section 3.2 (see Algorithm 14 for pseudocode). First, we simply use the observed loss at time t as the optimistic prediction for all arms at time t + 1. Formally, we set m t+1,i =c t ≜ℓ t,it for all i. Note that in this case⟨x,m t ⟩ =c t−1 for any x∈ ∆ K and thus x t = argmin x∈∆ K ⟨x,m t ⟩ +D ψ (x,x ′ t ) =x ′ t , meaning that we only need to maintain one sequence (Line 6). Second, instead of using x t+1 to sample i t+1 , we slightly bias towards the most recently picked arm by moving a small fraction α t+1 of each arm’s weight to armi t , whereα t+1 = α(1−ct) 1+α(1−ct) for some fixed parameter α (Line 7). Note that the smaller the loss of arm i t is, the more we bias towards this arm, but the correlation is in some nonlinear form. Such bias is intuitive in a slowly changing environment where we expect a good arm to remain reasonably good for a while. In the next theorem we formally show the improved regret bound of our algorithm. The proof is deferred to Appendix B.4. Theorem 15. Algorithm 14 with η≤ 1 162 and α = 8η ensures E[Reg] =O K lnT η +ηE " T−1 X t=1 |ℓ t+1,it −ℓ t,it | #! =O K lnT η +ηV ∞ for an adaptive adversary. Picking the optimal η leads to regret boundO( √ KV ∞ lnT +K lnT ). 42 Algorithm 15: 1 Define : ψ(x) is a ν-self-concordant barrier; learning rate η.B is the unit ball. 2 Initialize: x 1 =x ′ 1 = argmin x∈Ω ψ(x) and m 1 =0. 3 for t = 1, 2,...,T do 4 Compute eigendecomposition∇ 2 ψ(x t ) = P d i=1 λ t,i v t,i v ⊤ t,i . 5 Sample i t ∈ [d] and σ t ∈{−1, +1} uniformly at random. 6 Play w t =x t + σt √ λ t,i t v t,it and observe c t =⟨w t ,ℓ t ⟩. 7 Construct unbiased estimator ˆ ℓ t =d(c t −⟨w t ,m t ⟩)σ t p λ t,it v t,it +m t . 8 Set m t+1 = Proj B m t − 1 4 (⟨w t ,m t ⟩−c t )w t . 9 Update x ′ t+1 = argmin x∈Ω η⟨x, ˆ ℓ t ⟩ +D ψ (x,x ′ t ). 10 Update x t+1 = argmin x∈Ω η⟨x,m t+1 ⟩ +D ψ (x,x ′ t+1 ). 3.5 Path-length bounds for linear bandits In this section we move on to the more general linear bandit problem (Protocol 5). We assume that the decision sets of the learner and the adversary are contained in unit balls. Our algorithm (see Algorithm 15 for the pseudocode) is based on the optimistic SCRiBLe algorithm (Abernethy et al., 2008; Hazan and Kale, 2011; Rakhlin and Sridharan, 2013a) with new optimistic predictions. Specifically, optimistic SCRiBLe is again an instance of general optimistic mirror descent. The regularizer is any ν-self-concordant barrier of the decision set for some ν > 0. Having the point x t , the algorithm uniformly at random selects one of the 2d endpoints of the principal axes of the unit Dikin ellipsoid centered at x t , as the final action w t (Lines 4, 5 and 6). After observing the loss c t =⟨w t ,ℓ t ⟩, the algorithm then constructs an unbiased loss estimator (Line 7) and uses it in the next optimistic mirror descent update (Line 9 and Line 10; note that the learning rate η is explicitly spelled out here). We refer the readers to (Abernethy et al., 2008; Rakhlin and Sridharan, 2013a) for more detailed explanation of the (optimistic) SCRiBLe algorithm. 43 For any optimistic prediction sequence m 1 ,...,m T , Rakhlin and Sridharan (2013a) show that the regret of optimistic SCRiBLe is bounded as E[Reg] =O ν lnT η +ηd 2 E " T X t=1 ⟨w t ,ℓ t −m t ⟩ 2 #! . (3.11) It remains to specify how to pick the optimistic predictions m 1 ,...,m T such that the last term above P T t=1 ⟨w t ,ℓ t −m t ⟩ 2 is close to the path-length of the loss sequence ℓ 1 ,...,ℓ T . This is trivial in the full information setting where one observes ℓ t at the end of round t and can simply set m t =ℓ t−1 . In the bandit setting, however, only c t =⟨w t ,ℓ t ⟩ is observed and the problem becomes more challenging. In the next subsections we propose two approaches, one through a reduction to obtaining dynamic regret in an online learning problem with full information, and another via a further reduction to an instance of convex body chasing. As a side result, we obtain new dynamic regret bounds that may be of independent interest. Rakhlin and Sridharan (2013a) suggest treating the problem of selecting m t as another online learning problem. Specifically, consider the following online learning formulation: at each time t the algorithm selectsm t ∈B and then observes the loss functionf t (m) =⟨w t ,ℓ t −m⟩ 2 = (c t −⟨w t ,m⟩) 2 . Note that this is a full information problem even though ℓ t is unknown and is in fact the standard problem of online linear regression with squared loss. Further observe that applying Online Newton Step (Hazan et al., 2007) to learn m t ensures T X t=1 f t (m t )≤ min m ⋆ ∈B T X t=1 f t (m ⋆ ) +O(d lnT )≤ min m ⋆ ∈B T X t=1 ∥ℓ t −m ⋆ ∥ 2 2 +O(d lnT ). Picking m ⋆ = 1 T P T s=1 ℓ s and combining the above with (3.11) immediately recovers the main result of (Hazan and Kale, 2011) with a different approach. (This observation was not made in (Rakhlin and Sridharan, 2013a) though.) 44 However, competing with a fixed m ⋆ is not adequate for getting a path-length bound. Instead in this case we need a dynamic regret bound (Zinkevich, 2003) that allows the algorithm to compete with some sequence m ⋆ 1 ,...,m ⋆ T instead of a fixed m ⋆ . Typical dynamic regret bounds depend on either the variation of the loss functions or the competitor sequence (Jadbabaie et al., 2015; Mokhtari et al., 2016; Yang et al., 2016; Zhang et al., 2017, 2018), and here we need the latter one. Specifically, Yang et al. (2016) discover that projected gradient descent with a constant learning rate ensures for any minimizer sequence m ⋆ 1 ∈ argmin m∈B f 1 (m),...,m ⋆ T ∈ argmin m∈B f T (m), T X t=1 f t (m t )− T X t=1 f t (m ⋆ t )≤O L T X t=2 m ⋆ t −m ⋆ t−1 2 ! (3.12) as long as the following assumption holds: Assumption 1. Each f t is convex and L-smooth (that is, for any m,m ′ ∈B, f t (m)≤ f t (m ′ ) + ⟨∇f t (m ′ ),m−m ′ ⟩ + L 2 ∥m−m ′ ∥ 2 2 ). Additionally,∇f t (m ⋆ ) = 0 for any m ⋆ ∈ argmin m∈B f t (m). It is clear that f t (m) = (c t −⟨w t ,m⟩) 2 satisfies Assumption 1 with L = 4. Also note that Option I in Line 8 is exactly doing projected gradient descent with f t (we define Proj K (m) = argmin m ′ ∈K ∥m−m ′ ∥). Therefore picking m ⋆ t =ℓ t and combining (3.12) and (3.11) immediately imply the following. Corollary 16. Algorithm 15 ensures Reg =O ν lnT η +ηd 2 E h P T t=1 ∥ℓ t −ℓ t−1 ∥ 2 i , which is of order ˜ O d √ νV 2 with the optimal η. 3.6 Contextual bandits with loss predictors As we see in Chapter 2 and Section 3.2–Section 3.5, the framework of online learning with loss predictors is useful for the expert problem and multi-armed/linear bandits in obtaining adaptive regret bounds (we have mostly focused on path-length bounds). In these problems, the minimax 45 regret bound is Θ( √ T), but the adaptive algorithms we design achieve regret bounds of order O( √ E), whereE = P T t=1 ∥ℓ t −m t ∥ 2 ∞ is the total error of the predictions. However, is the same bound achievable in contextual bandits (Protocol 7)? In this section, we take the first attempt at addressing this question. The main message is that good predictors indeed help reduce regret for contextual bandits, but not to the same extent as in non-contextual settings. We consider both the adversarial setting and the stochastic setting. In the former, the sequence (x t ,ℓ t ,m t ) T t=1 can be arbitrary and even depend on the learner’s strategy. For simplicity we assume it is decided ahead of time before the game starts (i.e., the oblivious setting). In the latter, each triple (x t ,m t ,ℓ t ) is drawn independently from a fixed and unknown distribution D. It is well known that the optimal worst-case regret isO( √ dT) where we define d≜ K lnN with N =|Π| being the number of policies. The key question we address in this work is whether one could improve upon this worst-case bound when the predictor is accurate. More specifically, we denote the total loss of the predictor byE≜ P T t=1 ∥ℓ t −m t ∥ 2 ∞ for the adversarial setting and E≜TE (x,ℓ,m)∼D ∥ℓ−m∥ 2 ∞ for the stochastic setting, and we ask the following question: (Q1) Can we improve the regret overO( √ dT ) ifE =o(T )? Note that for the special case of multi-armed bandits where Π consists of K constant mappings that always pick one of the K actions (that is, contexts are ignored), in Section 3.2 we have shown thatO( √ dE) is achievable, an improvement overO( √ dT) as long asE =o(T). A natural guess would be that the same holds true for contextual bandits. However, somewhat surprisingly, in the following theorem we show that this is not the case (proofs for all lower bounds are deferred to Appendix B.5.2). Theorem 17. For any algorithm and any value V ∈ [0,T ], there exists an environment (which can be stochastic or adversarial) withE≤V such that Reg = e Ω min √ V (KT ) 1 4 , √ KT . 46 This theorem gives a negative answer to (Q1) whenE = Ω( √ T). Even whenE =O(1), the theorem shows that the best one can achieve isO(T 1 4 ), a sharp contrast with the non-contextual case. In Sections 3.6.1 and 3.6.2, we develop algorithms with matching upper bounds for adversarial and stochastic environments, respectively, thus completely answering (Q1) and confirming that loss predictors are helpful if and only if E =o( √ T ). 3.6.1 Algorithms for adversarial environments In this section, we describe our algorithm for the adversarial setting. Similar to existing works on online learning with loss predictors, our algorithm is based on the optimistic Online Mirror Descent (OMD) framework (Rakhlin and Sridharan, 2013a). In particular, with the entropy regularizer, the optimistic OMD update maintains a sequence of distributions Q ′ 1 ,...,Q ′ T ∈ ∆ Π such that Q ′ t+1 (π)∝ Q ′ t (π) exp −η b ℓ t (π(x t )) where η > 0 is the learning rate and b ℓ t is some estimator for ℓ t . Upon seeing a context x t and a predictor m t at time t, the algorithm computes Q t ∈ ∆ Π such that Q t (π)∝Q ′ t (π) exp −ηm t (π(x t )) , and samples a policy according to Q t and follows its suggestion to choose an action a t . Suppose p t ∈ ∆ K is the distribution of a t . Then the standard variance-reduced loss estimator is b ℓ t (a) = (ℓt(a)−mt(a))1[at=a] pt(a) +m t (a). When m t (a) = 0 for all t and a, this is exactly the Exp4 algorithm (Auer et al., 2002c). While optimistic OMD with entropy regularizer has been used for problems with full-information feedback (Steinhardt and Liang, 2014b; Syrgkanis et al., 2015), it in fact cannot be directly applied to the bandit setting since an analysis typically requires b ℓ t (a)−m t (a) to be lower-bounded by−1/η, which does not hold if ℓ t (a t )≤ m t (a t ) and p t (a t ) is too small. Intuitively this is also the hard case because the predictor over-predicts the loss of a good action and prevents the algorithm from realizing it due to the bandit feedback. A naive approach of enforcing uniform exploration so that p t (a t )≥η contributes ηTK regret already, which eventually leads to Ω( √ T ) regret. Indeed, to get 47 around this issue for multi-armed bandits, in Section 3.2–Section 3.4 we used a different regularizer called log-barrier, but this does not work for contextual bandits either since it inevitably introduces polynomial dependence on the number of policies N for the regret. Our solutions. Our first key observation is that, despite the range of the loss estimators, Optimistic Exp4 in fact always guarantees the following (cf. Lemma 33): for any π ∗ ∈ Π, T X t=1 X π∈Π Q t (π) b ℓ t (π(x t ))− T X t=1 b ℓ t (π ∗ (x t ))≤ lnN η + 2η T X t=1 ( b ℓ t (a t )−m t (a t )) 2 . (3.13) Readers familiar with the Exp4 analysis would find that p t (a t ) is missing in the last term compared to the standard analysis when b ℓ t (a)−m t (a)≥−1/η holds. To see why Eq. (3.13) is useful, first take the expectation over both sides so the last term is bounded by 2ηK P t ∥ℓt−mt∥ 2 ∞ minapt(a) . Then consider enforcing uniform exploration so that p t (a)≥µ/K holds for some µ ∈ [0, 1]. Since this contributes µT extra regret, using Eq. (3.13) we have Reg =O( lnN η + ηK 2 E µ +µT ), which, with the optimal tuning of η and µ , already gives a nontrivial bound Reg =O((ET) 1 3 ). This bound is also o( √ T) wheneverE =o( √ T ), but is worse than the bound Reg =O( √ ET 1 4 ) we are aiming for. To further improve the algorithm, we introduce a novel action remapping technique. Specifically, let a ∗ t = argmin a∈[K] m t (a) be the action with smallest predicted loss andA t ={a∈ [K] :m t (a)≤ m t (a ∗ t ) +σ} be the set of actions with predicted loss not larger than that of a ∗ t by σ, for some threshold σ≥ 0. Then, we rename the actions according to a mapping ϕ t : [K]→A t such that ϕ t (a) = a for a∈A t and ϕ t (a) = a ∗ t for a / ∈A t . In other words, we pretend that every action outsideA t was just a ∗ t . We call our algorithm Exp4.OAR and show its complete pseudocode in Algorithm 16. To see why this action remapping is useful, first consider the regret compared to P T t=1 ℓ t (ϕ t (π ∗ (x t ))) due to exploration. Note that only the actions in A t are explored and 48 Algorithm 16: Exp4.OAR: Optimistic Exp4 with Action Remapping 1 Parameter: learning rate η> 0, threshold σ> 0, exploration probability µ ∈ [0, 1]. 2 Initialize: Q ′ 1 (π) = 1 N for all π∈ Π. 3 for t = 1,...,T do 5 5 Receive x t and m t . Define a ∗ t = argmin a∈[K] m t (a), A t ={a∈ [K] :m t (a)≤m t (a ∗ t ) +σ}, and ϕ t (a) = ( a, if a∈A t , a ∗ t , else. (3.14) 7 7 Calculate Q t ∈ ∆ Π : Q t (π)∝Q ′ t (π) exp (−ηm t (ϕ t (π(x t )))). 9 9 Calculate p t ∈ ∆ K : p t (a) = (1−µ ) P π:ϕ t(π(xt))=a Q t (π) + µ |At| 1[a∈A t ]. 11 11 Sample a t ∼p t and receive ℓ t (a t ). 13 13 Construct estimator: b ℓ t (a) = ℓt(a)−mt(a) pt(a) 1[a t =a] +m t (a) for all a∈A t . 15 15 Calculate Q ′ t+1 ∈ ∆ Π : Q ′ t+1 (π)∝Q ′ t (π) exp −η b ℓ t (ϕ t (π(x t ))) . all actions in this set have predicted lossσ-close to each other. Therefore, exploration leads to regret µTσ + 2µ P t ∥ℓ t −m t ∥ ∞ ≤µTσ + 2µ √ ET, instead of µT compared to the naive approach. On the other hand, the bias due to remapping ℓ t (ϕ t (π ∗ (x t )))−ℓ t (π ∗ (x t )) is either zero if π ∗ (x t )∈A t or at most 2∥ℓ t −m t ∥ ∞ −σ otherwise (by adding and subtracting m t (a ∗ t ) and m t (π ∗ (x t ))). Using the AM-GM inequality and summing over t giveE/σ. Combining everything we prove the following theorem. Theorem 18. Exp4.OAR (Algorithm 16) ensures Reg≤ lnN η + 2ηK 2 E µ +µTσ + 2µ √ ET + E σ . Picking µ = min n d √ T , 1 o , η = q µ lnN K 2 E , and σ = q E µT gives Reg =O √ dE(T ) 1 4 +d √ E . See Appendix B.5.3 for the complete proof. This theorem indicates that whenever the predictor is good enough withE = o( √ T), our algorithm improves over Exp4 and achieves o( √ T) regret. Note that this bound requires setting the parameters in terms of the quantityE, and in the case whenE = Ω( √ T ), one can simply switch to Exp4 and achieve regretO( √ dT ). Therefore, our result indeed matches the lower bound stated in Theorem 17 (except for a slightly worse dependence on d). 49 3.6.2 Algorithms for stochastic environments In this subsection, we consider learning in a stochastic environment. Recall that a stochastic environment is parameterized by an unknown distributionD such that each triple (x t ,ℓ t ,m t ) is an i.i.d. sample ofD and the total prediction error isE =TE (x,ℓ,m)∼D ∥ℓ−m∥ 2 ∞ . Clearly, this is a special case of the adversarial environment, and our goal is to derive the same results but with oracle-efficient algorithms. Specifically, an ERM oracle is a procedure that takes any set S of context-loss pairs (x,c)∈ X×R K as inputs and outputs a policy ERM(S)∈ argmin π∈Π P (x,c)∈S c(π(x)). An algorithm is oracle-efficient if its total running time and the number of oracle calls are both polynomial in T and d, excluding the running time of the oracle itself. Oracle-efficiency has been proven to be impossible for adversarial environments (Hazan and Koren, 2016), but achievable for stochastic environments. The simplest oracle-efficient algorithm is ϵ-greedy (Langford and Zhang, 2008), with suboptimal regretO(T 2 3 ). However, somewhat surprisingly, we are able to build our algorithm on top of ϵ-greedy and achieve optimal results whenE =o( √ T ). We first review the ϵ-greedy algorithm and its analysis, and point out the difficulties of improving the regret with the help of loss predictors. At each round t, with probability µ the algorithm samples an action a t uniformly at random, and with probability 1−µ the algorithm follows the empirically best policy π t = ERM {x s , b ℓ s } s<t by choosing a t =π t (x t ), where b ℓ s is the standard importance-weighted estimator for round s. By standard concentration arguments (Freedman inequality), it holds with high probability that the difference between the average estimated loss and the expected loss of any policy π, 1 t P t s=1 b ℓ s (π(x s ))−E (x,ℓ,m)∼D [ℓ(π(x))] , is at most ˜ O 1 t q (lnN) P t s=1 V s b ℓ s (π(x s )) + d µt where V s [ b ℓ s (π(x s ))] is the conditional variance (given everything before round s) and is at most K /µ . By the optimality of π t , it is then clear that the total regret of following the empirically best policy is 50 ˜ O P t p d /µt + d /µt = ˜ O p dT /µ + d /µ . Further taking the uniform exploration into account shows that the regret ofϵ-greedy has three components: the variance term ˜ O( p dT /µ ), the lower-order term ˜ O( d /µ ), and the exploration termO(µT ). Picking the optimal µ givesO T 2 3 regret. To improve the bound, we improve each of these three terms as described below. Improving variance/exploration terms via action remapping. One natural idea to improve the variance term is to deploy the same variance-reduced (also known as doubly-robust) estimator b ℓ t (a) = (ℓt(a)−mt(a))1[at=a] /pt(a) +m t (a) as in the adversarial case. However, by the law of total variance, we have V t h b ℓ t (π(x t )) i =E xt,mt,ℓt h V at h b ℓ t (π(x t )) x t ,m t ,ℓ t ii +V xt,mt,ℓt h E at h b ℓ t (π(x t )) x t ,m t ,ℓ t ii , where one can verify that the first term is at most KE µT , but the second term is justV xt,mt,ℓt [ℓ t (π(x t ))] and is not related toE. Simply bounding the second term by 1 leads to Ω( √ T ) regret already. We propose to address the issue above by first shifting the variance-reduced estimator by m t (a ∗ t ), where a ∗ t = argmin a∈[K] m t (a) is again the action with the smallest predicted loss. In other words, we use a new biased estimator: e ℓ t (a) = b ℓ t (a)−m t (a ∗ t ) = ℓ t (a)−m t (a) p t (a) 1[a t =a] +m t (a)−m t (a ∗ t ). Moreover, we apply the same action remapping technique using the mapping ϕ t : [K]→A t as in the adversarial case (Eq. (3.14)). To see why this is useful, note that the variance term now becomes V t h e ℓ t (ϕ t (π(x t ))) i ≤E xt,mt,ℓt,at b ℓ t (ϕ t (π(x t )))−m t (a ∗ t ) 2 ≤ 2E xt,mt,ℓt,at " (ℓ t (ϕ t (π(x t )))−m t (ϕ t (π(x t )))) 2 p 2 t (ϕ t (π(x t ))) 1 [a t =ϕ t (π(x t ))] # 51 + 2E xt,mt h (m t (ϕ t (π(x t )))−m t (a ∗ t )) 2 i (using (a +b) 2 ≤ 2a 2 + 2b 2 ) ≤ 2K µ E xt,mt,ℓt h (ℓ t (ϕ t (π(x t )))−m t (ϕ t (π(x t )))) 2 i + 2σ 2 ≤ 2KE µT + 2σ 2 , (3.15) which improves over the variance term V t [ b ℓ t (π(x t ))] if σ is small. Also note that with action remapping, we only explore actions in A t , and thus by the exact same arguments as in the adversarial case, the exploration term also becomes µTσ + 2µ √ ET, again better than the naive approach as long asσ is small. Therefore, remapping improves both the variance and the exploration term. It remains to analyze the bias from both the shifted estimator and the remapping. The former in fact does not introduce any bias for the regret since the shift m t (a ∗ t ) is the same for all actions. The latter introduces total biasO(E/σ), again by the same analysis as in the adversarial case. With these modifications, we achieve ˜ O((ET ) 1 3 ) regret already (even with the presence of the lower-order term). This is summarized in the following theorem (see Appendix B.5.4 for the proof). Theorem 19. ϵ-Greedy.AR (Algorithm 17 with Option I) ensures Reg = ˜ O q dE µ +σ √ dT + d µ + µTσ +µ √ ET + E σ . ForE ≤ √ T, picking µ = min d 2 3 (ET) − 1 3 , 1 and σ =E 2 3 (dT) − 1 3 gives Reg =O (dET ) 1 3 + √ dE +d . Removing the lower-order term via Catoni’s estimator. To further improve the regret bound to ˜ O( √ ET 1 4 ), we need to improve the lower-order term as well. Fortunately, it turns out that this lower- order term can be completely removed using robust mean estimators for heavy-tailed distributions, such as median of means, trimmed-mean, and Catoni’s estimator (see the survey (Lugosi and Mendelson, 2019)). In particular, we use Catoni’s estimator, as we show that it can be implemented efficiently via the ERM oracle. 52 Algorithm 17: ϵ-Greedy with Action Remapping (and Catoni’s estimator) 1 Parameters: threshold σ> 0, exploration probability µ ∈ [0, 1]. 2 for t = 1,...,T do 3 Receive x t and m t . Define a ∗ t ,A t and ϕ t as in Eq. (3.14). 4 Find π t = argmin π∈Π ERM {x s , e ℓ s } s<t , (Option I, termed ϵ-Greedy.AR) ≈ argmin π∈Π Catoni α n e ℓ s (ϕ s (π(x s ))) o s<t using Algorithm 18 with α = r 2 ln(TN) (σ 2 t+KE/µ ) . (Option II, termed ϵ-Greedy.ARC) 5 Calculate p t ∈ ∆ K : p t (a) = (1−µ )1[a =ϕ t (π t (x t ))] + µ |At| 1[a∈A t ]. 6 Sample a t ∼p t and receive ℓ t (a t ). 7 Construct estimator: e ℓ t (a) = ℓt(a)−mt(a) pt(a) 1[a t =a] +m t (a)−m t (a ∗ t ) for all a∈A t . Algorithm 18: Finding the Policy with the Smallest Catoni’s Mean 1 Input: context x s , loss estimator e ℓ s , remapping function ϕ s , for s = 1,...,t− 1, and parameter α. Define : ψ(y) = ( ln(1 +y +y 2 /2), if y≥ 0, − ln(1−y +y 2 /2), else. 2 Initialize: z right = K µ + 1, z left =−z right . 3 while z right −z left ≥ 1/T do 4 Let z mid = (z left +z right )/2. 5 Construct c s ∈R K for all s<t such that c s (a) =ψ α e ℓ s (ϕ s (a))−z mid . 6 Invoke oracle π = ERM ({x s ,c s } s<t ). 7 if P s<t c s (π(x s ))≥ 0 then z left =z mid ,; 8 else z right =z mid .; 9 Construct c s ∈R K for all s<t such that c s (a) =ψ α e ℓ s (ϕ s (a))−z right . 10 Return π t = ERM ({x s ,c s } s<t ). More specifically, instead of following the policy with the smallest average estimated loss, we follow the policy with the smallest Catoni’s mean: argmin π∈Π Catoni α e ℓ s (ϕ s (π(x s ))) s<t where Catoni α (y 1 ,...,y n ) is the root of the functionf(z) = P n j=1 ψ(α(y j −z)) for some increasing function ψ (defined in Algorithm 18) and coefficient α> 0. Generalizing the proof of (Lugosi and Mendelson, 2019, Theorem 5) for i.i.d. random variables to a martingale sequence, we obtain a concentration result without the lower-order term (see Lemma 32 in Appendix B.5.1). Furthermore, we prove that a close approximation of this policy can be found efficiently via a binary search invoking O(ln(TK/µ )) calls of the ERM oracle, detailed in Algorithm 18. 53 Lemma 4. Algorithm 18 invokes the ERM oracle at mostO(ln(TK/µ )) times and returns a policy π t such that Catoni α n e ℓ s (ϕ s (π t (x s ))) o s<t ≤ min π∈Π Catoni α n e ℓ s (ϕ s (π(x s ))) o s<t + 1 T . The proof is straightforward using the monotonicity of ψ and is deferred to Appendix B.5.4. We remark that this result might be of independent interest and useful for developing oracle-efficient algorithms for other problems. Combining the two key techniques above, we improve all the three terms and prove the following theorem (see Algorithm 17 for the pseudocode and Appendix B.5.4 for the complete proof). Theorem 20. ϵ-Greedy.ARC (Algorithm 17 with Option II) ensures Reg = ˜ O q dE µ +σ √ dT +µTσ + µ √ ET + E σ . Picking µ = min p d/T, 1 and σ = √ E(dT ) − 1 4 gives Reg =O √ E(dT ) 1 4 + √ dE . Similarly, this requires setting the parameter σ in terms ofE, and whenE = Ω( √ T ), one could switch to the optimal algorithm (Agarwal et al., 2014b) and achieveO( √ T ) regret. Therefore, our bound again matches the lower bound in Theorem 17. In fact, it also enjoys a better dependence on d compared to the adversarial case (Theorem 18). 3.7 Open problems We state an open problem left for contextual bandits with predictions (Section 3.6). While our lower bound shows that e Ω ( √ ET 1 /4 ) regret is inevitable for general contextual bandits, it does not rule out the possibility of getting e O( √ E) in contextual bandits with certain structures. For example, it does not rule out the possibility of getting e O(d ′ √ E) for linear contextual bandit, where d ′ is the feature dimension. If we cast the example we construct for the e Ω ( √ ET 1 /4 ) lower bound to a linear contextual bandit problem, we need d ′ =T 1 /2 , so the extraT 1 /4 in the lower bound can be explained 54 away byd ′ . Finding the correct complexity measure for the policy class that allows a e O( √ E) bound is an important open direction. 55 Chapter 4 Robust Algorithms against an Adversary 4.1 Overview: best of both worlds, and learning under corruption Multi-armed bandits are traditionally studied under two separate settings: the general (adversarial) setting (Protocol 2) and the stochastic setting (Protocol 3). The tight bound for the former setting is Θ( p |A|T ) while for the latter is Θ( P i:∆ i >0 lnT ∆ i ) where ∆ i is the gap between the expected loss of arm i and that of the optimal arm. It is a natural question whether there is a single algorithm achieving “the best of both worlds,” i.e., having near-optimal regret guarantees in both settings. The study of this problem was initiated by Bubeck and Slivkins (2012), followed by Seldin and Slivkins (2014); Auer and Chiang (2016); Seldin and Lugosi (2017). The algorithms in these works are all much more complicated than the existing algorithms solely for the adversarial setting or the stochastic setting. Our result in Section 4.2 provides the first simple algorithm with near-optimal guarantees in both settings. The algorithm is based on the traditional OMD/FTRL framework with a special regularizer (i.e., log-barrier). In contrast to all previous algorithms, our algorithm does not perform any stationarity detection or gap estimation. The result has been improved later by Zimmert and Seldin (2019) who get optimal regret bounds in both settings up to constants, and extended to Markov decision processes (Jin and Luo, 2020; Jin et al., 2021d). Their OMD/FTRL 56 framework is similar to ours, though with a different regularizer (i.e., Tsallis entropy with parameter 1 2 ). Another line of research that tries to bridge stochastic and adversarial multi-armed bandits is the corruption setting (Lykouris et al., 2018; Gupta et al., 2019). In these works, the underlying environment is stochastic, but the loss feedback received by the learner is corrupted. The goal of the learner is to minimize the regret defined with the original (uncorrupted) loss. It turns out that the algorithms that achieve best-of-both-world bounds are highly related to those designed for the corruption setting (see, e.g., Zimmert and Seldin (2019)). The corruption setting has also been considered under linear bandits (Li et al., 2019), linear contextual bandits (Foster et al., 2020), and Markov decision processes (Lykouris et al., 2019), etc. In Section 4.3, we provide the first “best-of-both-worlds” result for linear bandits. The technique is similar to those of Bubeck and Slivkins (2012); Auer and Chiang (2016) for multi-armed bandits, which requires sophisticated gap estimation. It still remains open whether the simple approach of a single FTRL/OMD gets similar results. We show that the algorithm also achieves a near-optimal regret bound for the corruption setting. In Section 4.4, we consider general online decision making under the corruption setting. We provide reductions that can transfer an algorithm with a certain regret bound under a known amount of corruption to an algorithm with the same regret bound but without knowing the amount of corruption. An important implication of the result is the first tight bound for learning in Markov decision processes with corruption. 4.2 Multi-armed bandits with best-of-all-worlds guarantees We invoke Broad-OMD (Algorithm 12) with m t = 0 (therefore, ˆ ℓ t,i = ℓ t,i 1{it=i} w t,i ), with a certain restart and adaptive learning rate scheduling. Although the algorithm has been studied before (Foster et al., 2016), to show the best-of-both-world regret bound, two extra elements are required: 57 Algorithm 19: Doubling trick for Broad-OMD 1 Initialize: η = 1 162 ,T 0 = 0,t = 1. 2 for β = 0, 1,... do 3 w ′ t = argmin w∈Ω ψ 1 (w) (restart Broad-OMD). 4 while t≤T do 5 Update w t , sample b t ∼w t , and update w ′ t+1 as in Broad-OMD. 6 if P t s=T β +1 P K i=1 w 2 s,i ( ˆ ℓ s,i −ℓ s,is ) 2 ≥ K lnT 3η 2 then 7 η←η/2, T β+1 ←t, t←t + 1. 8 break. 9 t←t + 1. 1) the loss shifting technique in the analysis to derive aO(lnT ) regret in the stochastic setting, 2) a doubling trick to tune the learning rate η according to the quantity P T t=1 P K i=1 w 2 t,i ( ˆ ℓ t,i −ℓ t,it ) 2 = P T t=1 P K i=1 (ℓ t,i 1{i t = i}−w t,i ℓ t,it ) 2 . The modified algorithm is presented in Algorithm 19. The following theorem gives its regret bound guarantee. The proof is deferred to Appendix C.1. Theorem 21. Algorithm 19 with ˆ ℓ t,i = ℓ t,i 1{it=i} w t,i guarantees E [Reg] =O v u u t (K lnT )E " T X t=1 K X i=1 (ℓ t,i 1{i t =i}−w t,i ℓ t,it ) 2 # +K lnT . (4.1) This bound implies that in the stochastic setting, we have E [Reg] = O K lnT ∆ , while in the adversarial setting, we have E [Reg] =O p KL T,i ∗ lnT +K lnT assuming non-negative losses, where L T,i ≜ P T t=1 ℓ t,i . 4.3 Linear bandits with best-of-all-world guarantees Next, we proceed to the case of linear bandits (Section 1.2). We consider three different settings: stochastic, corrupted, and adversarial, explained in detail below. We assume that the action set K⊂R d is finite. 58 In the stochastic setting, ℓ t is fixed to some unknown vector θ∈R d . We assume that there exists a unique optimal arm x ∗ ∈K such that⟨x ∗ ,θ⟩< min x ∗ ̸=x∈K ⟨x,θ⟩, and define for each x∈K its sub-optimality gap as ∆ x =⟨x−x ∗ ,θ⟩. Also denote the minimum gap min x̸=x ∗ ∆ x by ∆ min . The corrupted setting is a generalization of the stochastic setting, where in addition to a fixed vector θ, the environment also decides on a corruption vector c t ∈R d for each round (before seeing x t ) so that ℓ t =θ +c t . ∗ We define the total amount of corruption as C = P t max x∈K |⟨x,c t ⟩|. The stochastic setting is clearly a special case with C = 0. In both of these settings, we define the regret as Reg(T ) = max x∈K P T t=1 ⟨x t −x,θ⟩ = P T t=1 ∆ xt . Finally, in the adversarial setting, ℓ t can be chosen arbitrarily (possibly dependent on the learner’s algorithm and her previously chosen actions). The difference compared to the corrupted setting (which also has potentially arbitrary loss vectors) is that the regret is now defined in terms of ℓ t : Reg(T ) = max x∈K P T t=1 ⟨x t −x,ℓ t ⟩. In all settings, we assume⟨x,θ⟩,⟨x,c t ⟩,⟨x,ℓ t ⟩ and y t are all in [−1, 1] for all t and x∈K. We also denote⟨x,ℓ t ⟩ by ℓ t,x and similarly⟨x,c t ⟩ by c t,x . It is known that the minimax optimal regret in the adversarial setting is Θ(d √ T) (Dani et al., 2008; Bubeck et al., 2012). The instance-optimality in the stochastic case, on the other hand, is slightly more complicated. Specifically, an algorithm is called consistent if it guarantees E[Reg(T )] =o(T ϵ ) for any θ,K, and ϵ> 0. Then, a classic lower bound result (see e.g., (Lattimore and Szepesvari, 2017)) states that for a particular instance (K,θ), all consistent algorithms satisfy: † lim inf T→∞ E[Reg(T )] logT ≥ Ω( c(K,θ)), ∗ In other words, the environment corrupts the observation yt by adding⟨xt,ct⟩. The setting of (Li et al., 2019) is slightly more general with the corruption on yt being ct(xt) for some function ct that is not necessarily linear. † The original proof is under the Gaussian noise assumption. To meet our boundedness assumption onyt, it suffices to consider the case when yt is a Bernoulli random variable, which only affects the constant of the lower bound. 59 where c(K,θ) is the objective value of the following optimization problem: inf N∈[0,∞) K X x∈K\{x ∗ } N x ∆ x (4.2) subject to ∥x∥ 2 H −1 (N) ≤ ∆ 2 x 2 , ∀x∈K\{x ∗ } (4.3) and H(N) = P x∈K N x xx ⊤ (the notation∥x∥ M denotes the quadratic norm √ x ⊤ Mx with respect to a matrix M). This implies that the best instance-dependent bound for Reg(T ) one can hope for isO(c(K,θ) logT) (and more generallyO(c(K,θ) logT +C) for the corrupted setting). It can be shown that c(K,θ)≤O d ∆ min (see Lemma 42), but this upper bound can be arbitrarily loose as shown in (Lattimore and Szepesvari, 2017). The solution N x in the optimization problem above specifies the least number of times action x should be drawn in order to distinguish between the present environment and any other alternative environment with a different optimal action. Many previous instance-optimal algorithms try to match their number of pulls for x to the solution N x under some estimated gap b ∆ x (Lattimore and Szepesvari, 2017; Hao et al., 2020; Jun and Zhang, 2020). While these algorithms are asymptotically optimal, their regret usually grows linearly when T is small (Jun and Zhang, 2020). Furthermore, they are all deterministic algorithms and by design cannot tolerate corruptions. We will show how these issues can be addressed in the next section. Notation In this section, we use P S to denote the probability simplex over S: n p∈R |S| ≥0 : P s∈S p s = 1 o (to avoid the notation ∆ S that could be confused with the subopti- mality gap), and define the clipping operator Clip [a,b] (v) as min(max(v,a),b) for a≤b. In this section, we develop an algorithm that enjoys similar regret guarantees in the stochastic or corrupted setting, and additionally guarantees e O( √ T ) regret in the adversarial setting, without 60 having any prior knowledge on which environment it is facing. To the best of our knowledge, this kind of best-of-three-worlds guarantee has only appeared before for multi-armed bandits (Wei and Luo, 2018; Zimmert and Seldin, 2019) and Markov decision processes (Jin and Luo, 2020), but not for linear bandits. Our algorithm requires block-box access to an adversarial linear bandit algorithmB that satisfies the following: Assumption 2.B is a linear bandit algorithm that outputs a loss estimator b ℓ t,x for each action x after each time t. There exist L 0 , C 1 ≥ 2 15 d log(T|K|/δ), and universal constant C 2 ≥ 20, such that for all t≥L 0 ,B guarantees the following with probability at least 1− δ T :∀x∈K, t X s=1 (ℓ s,xs −ℓ s,x )≤ p C 1 t−C 2 t X s=1 (ℓ s,x − b ℓ s,x ) . (4.4) (4.4) states that the regret ofB against action x is bounded by a √ t-order term minus the deviation between the loss of x and its estimator. While this might not seem intuitive, in fact, all existing linear bandit algorithms with a near-optimal high-probability bound satisfy Assumption 2, even though this may not have been stated explicitly (and one may need to slightly change the constant parameters in these algorithms to satisfy the conditions on C 1 and C 2 ). Below, we give two examples of suchB and justify them in Appendix C.2.4. • A variant of GeometricHedge.P (Bartlett et al., 2008) with an improved exploration scheme satisfies Assumption 2 with ( δ ′ =δ/(|K| log 2 T )) C 1 = Θ d log(T/δ ′ ) , L 0 = Θ d log 2 (T/δ ′ ) . 61 • The algorithm of (Lee et al., 2020) satisfies Assumption 2 with ( lg = log(dT ), δ ′′ =δ/(|K|T )) C 1 = Θ d 6 lg 8 log(lg/δ ′′ ) , L 0 = Θ log(lg/δ ′′ ) . With such a black-box at hand, our algorithm BOTW is shown in Algorithm 20. We first present its formal guarantees in different settings. Theorem 22. Algorithm 20 guarantees that with probability at least 1−δ, in the stochastic setting (C = 0), Reg(T ) is at most O c(K,θ) logT log T|K| δ + C 1 √ logT ∆ min +M ∗ log 3 2 1 δ + p C 1 L 0 ! , where M ∗ is a problem-dependent constant; and in the corrupted setting (C > 0), Reg(T ) is at most O C 1 logT ∆ min +C + p C 1 L 0 . In the case whenB is the variant of GeometricHedge.P, the last bound is O d log(T|K|/δ) logT ∆ min +C . Therefore, Algorithm 20 enjoys the nearly instance-optimal regretO(c(K,θ) log 2 T) in the stochastic setting ‡ , but slightly worse regretO( d log 2 T ∆ min +C) in the corrupted setting (recall again c(K,θ)≤d/∆ min ). In exchange, however, Algorithm 20 enjoys the following worst-case robustness in the adversarial setting. ‡ Note that when we chooseB as the variant of GeometricHedge.P, C 1 √ logT ∆ min =O d ∆ min log 3 2 T which is dominated by the termO(c(K,θ) log 2 T ) when T is sufficiently large. 62 Algorithm 20: BOTW (Best of Three Worlds) 1 Input: an algorithmB satisfying Assumption 2. 2 Initialize: L←L 0 (L 0 defined in Assumption 2). 3 while true do 4 Run BOTW-SE with input L, and receive output t 0 . 5 L← 2t 0 . Theorem 23. In the adversarial setting, Algorithm 20 guarantees that with probability at least 1−δ, Reg(T ) is at mostO √ C 1 T logT + √ C 1 L 0 . Algorithm 20 BOTW takes a black-boxB satisfying Assumption 2 (with parameter L 0 ) as input, and then proceeds in epochs until the game ends. In each epoch, it runs its single-epoch version BOTW-SE (Algorithm 21) with a minimum duration L (initialized as L 0 ). Based on the results of some statistical tests, at some point BOTW-SE will terminate with an output t 0 ≥L. Then BOTW enters into the next epoch with L updated to 2t 0 , so that the number of epochs is alwaysO(logT ). BOTW-SE has two phases. In Phase 1, the learner executes the adversarial linear bandit algorithmB. Starting from t =L (i.e., after the minimum duration specified by the input), the algorithm checks in every round whether (4.6) and (4.7) hold for some action b x (Line 7). If there exists such an b x, Phase 1 terminates and the algorithm proceeds to Phase 2. This test is to detect whether the environment is likely stochastic. Indeed, (4.6) and (4.7) imply that the performance of the learner is significantly better than all but one action (i.e., b x). In the stochastic environment, this event happens at roughly t≈ Θ d ∆ 2 min with b x =x ∗ . This is exactly the timing when the learner should stop usingB whose regret grows as e O( √ t) and start doing more exploitation on the better actions, in order to keep the regret logarithmic in time for the stochastic environment. We define t 0 to be the time when Phase 1 ends, and b ∆ x to be the empirical gap for action x with respect to the estimators obtained fromB so far (Line 7). In the stochastic setting, we can show that b ∆ x = Θ(∆ x ) holds with high probability. 63 In the second phase, we calculate the action distribution using OP with the estimated gap { b ∆ x } x∈K . Indeed, if b ∆ x ’s are accurate, the distribution returned by OP is close to the optimal way of allocating arm pulls, leading to near-optimal regret. The loss estimator b ℓ t,x is defined as the following: b ℓ t,x = x ⊤e S −1 t x t y t , x̸=b x yt e p t,b x I{x t =b x}, x =b x where e S t = X x∈K e p t,x xx ⊤ . (4.5) OP(t, b ∆) : return any minimizer p ∗ of the following: min p∈P X X x p x b ∆ x , (4.10) s.t. ∥x∥ 2 S(p) −1≤ t b ∆ 2 x β t + 4d, ∀x∈K, (4.11) where S(p) = P x∈K p x xx ⊤ and β t = 2 15 log t|K| δ . Figure 4.1: Optimization Problem (OP) Catoni α ({X 1 ,X 2 ,...,X n }) : return b X, the unique root of the functionf(z) = P n i=1 ψ(α(X i −z)) where ψ(y) = ( ln(1 +y +y 2 /2), if y≥ 0, − ln(1−y +y 2 /2), else. Figure 4.2: Catoni’s Estimator Then, we define the average empirical gap in [1,t] for x̸=b x and t in Phase 2 as the following: b ∆ t,x = 1 t t 0 X s=1 b ℓ s,x + (t−t 0 )Rob t,x − t X s=1 b ℓ s,b x ! (4.12) 64 Algorithm 21: BOTW-SE (BOTW – Single Epoch) 1 Input: L (minimum duration) 2 Define : f T = logT 3 Initialize: a new instance ofB. 4 // Phase 1 5 for t = 1, 2,... do 6 Execute and updateB. Receive estimators{ b ℓ t,x } x∈K . 7 if t≥L and there exists an action b x such that t X s=1 y s − t X s=1 b ℓ s,b x ≥−5 p f T C 1 t, (4.6) t X s=1 y s − t X s=1 b ℓ s,x ≤−25 p f T C 1 t, ∀x̸=b x, (4.7) then t 0 ←t, b ∆ x ← 1 t 0 P t 0 s=1 b ℓ s,x − b ℓ s,b x , break. 8 // Phase 2 9 for t =t 0 + 1,... do 10 Let p t = OP(t, b ∆) and e p t = 1 2 e b x + 1 2 p t . 11 Sample x t ∼ e p t and observe y t . 12 Calculate b ℓ t,x and b ∆ t,x based on (4.5) and (4.12). 13 if ∃x̸=b x, b ∆ t,x / ∈ h 0.39 b ∆ x , 1.81 b ∆ x i or (4.8) t X s=t 0 +1 y s − b ℓ s,b x ≥ 20 p f T C 1 t 0 , (4.9) then break. 14 Return t 0 . where Rob t,x = Clip [−1,1] Catoni αx b ℓ τ,x t τ=t 0 +1 with α x = 4 log(t|K|/δ) t−t 0 + P t τ=t 0 +1 2∥x∥ 2 S −1 τ ! 1 2 . Note that we use a simple average estimator for b x, but a hybrid of average estimator of Phase 1 and robust estimator of Phase 2 for other actions. These gap estimators are useful in monitoring the non-stochasticity of the environment, which is done via the tests (4.8) and (4.9). The first condition ( (4.8)) checks whether the average empirical gap 65 b ∆ t,x is still close to the estimated gap b ∆ x at the end of Phase 1. The second condition ((4.9)) checks whether the regret against b x incurred in Phase 2 is still tolerable. It can be shown that (see Lemma 7), with high probability, (4.8) and (4.9) do not hold in a stochastic environment. Therefore, when either event is detected, BOTW-SE terminates and returns the value of t 0 to BOTW, which will then run BOTW-SE again from scratch with L = 2t 0 . In the following, we provide a sketch of the analysis for BOTW, further revealing the ideas behind our design. Analysis for the adversarial setting We first show that at any time t in Phase 2, with high probability, b x is always the best action so far. Lemma 5. With probability at least 1−δ, for any t in Phase 2, we have b x∈ argmin x∈K P t s=1 ℓ s,x . (See Appendix C.2.2 for the proof.) We then prove that, importantly, the regret in each epoch is bounded by e O( √ t 0 ) (not square root of the epoch length): Lemma 6. With probability at least 1−δ, for any time t in Phase 2, we have for any x∈K, t X s=1 (ℓ s,xs −ℓ s,x ) =O p C 1 t 0 f T . (See Appendix C.2.2 for the full proof.) Finally, to obtain Theorem 23, it suffices to apply Lemma 6 and the fact that the number of epochs isO(logT ). Analysis for the corrupted and stochastic settings The key for this analysis is the following lemma. 66 Lemma 7. In the corrupted setting, BOTW-SE ensures with probability at least 1− 15δ: • t 0 ≤ max n 900f T C 1 ∆ 2 min , 900C 2 f T C 1 ,L o . • If C≤ 1 30 √ f T C 1 L, then 1) b x =x ∗ ; 2) b ∆ x ∈ [0.7∆ x , 1.3∆ x ] for all x; and 3) Phase 2 never ends. Using this lemma, we can show Theorem 22. The full proof is deferred to Appendix C.2.3. 4.4 Robust algorithms for general decision making Having established the regret bounds for the “best-of-both-worlds” problem or the corruption setting for multi-armed bandits and linear bandits in Section 4.2 and Section 4.3, our next step is to obtain similar results for more general decision-making problems. In Markov decision processes where the transition and the loss can both be adversarial, the situation is, however, very different from those of bandits. It was first shown by Abbasi-Yadkori et al. (2013) that achieving sub-linear regret in this setting is computationally hard, and recently enhanced by Tian et al. (2021) showing that it is even information-theoretically hard. To establish meaningful guarantees, previous work aims to achieve a regret bound that smoothly degrades with the amount of corruption. When the total amount of corruption is given as prior knowledge to the learner, Wu et al. (2021) designed an algorithm with a regret upper bound that scales optimally with the amount of corruption. However, this kind of prior knowledge is rarely available in practice. When the total amount of corruption is unknown, efforts were made by Lykouris et al. (2019); Chen et al. (2021c); Cheung et al. (2020); Wei and Luo (2021); Zhang et al. (2021c) to obtain similar guarantees. Unfortunately, all their bounds scale sub-optimally in the amount of corruption. Therefore, the following question remains open: When both reward and transition can be corrupted, how can the learner achieve a regret bound that has optimal dependence on the unknown amount of corruption? We address this open problem by 67 designing an efficient algorithm with the desired worst-case optimal bound. Specifically, in tabular MDPs, our regret bound scales with √ T +C where T is the number of rounds and C is the total amount of corruption (omitting dependencies on other quantities). This matches the lower bound by Wu et al. (2021). In contrast, the bounds obtained by Lykouris et al. (2019) and Chen et al. (2021c) are (1 +C) √ T +C 2 and √ T +C 2 , respectively, which are non-vacuous only when C≤ √ T, a rather limited case. The bounds obtained by Cheung et al. (2020); Wei and Luo (2021); Zhang et al. (2021b) are (1 +C) 1 /4 T 3 /4 , √ T +C 1 /3 T 2 /3 , and p (1 +C)T, respectively. Although these bounds are meaningful for all C≤T, the dependence on C is multiplicative in T, which is undesirable. For tabular MDPs, we further show that the bound can be improved to min{ 1 ∆ , √ T} +C, where ∆ is the gap between the expected reward of the best and the second-best policies. This kind of refined instance-dependent regret bound was also established by Lykouris et al. (2019) and Chen et al. (2021c). The bound of Lykouris et al. (2019) is (1 +C) min{G, √ T} +C 2 for some gap-complexity G≤ 1 ∆ , while Chen et al. (2021c) obtained min{ 1 ∆ , √ T} +C 2 . It was left as an open question whether the best-of-all-world bound min{G, √ T} +C is achievable. Our method is based on the framework of model selection (Agarwal et al., 2017; Foster et al., 2019; Arora et al., 2021; Abbasi-Yadkori et al., 2020; Pacchiano et al., 2020a,b). In model selection problems, the learner is given a set of base algorithms, each with an underlying model or assumption for the world. However, the learner does not know in advance which model fits the real world the best. The goal of the learner is to be comparable to the best base algorithm in hindsight. In our case, a model of the world corresponds to a hypothetical amount of corruption C; for a given C, there are algorithms with near-optimal bounds (e.g., Wu et al. (2021)) that can serve as base algorithms. Therefore, the problem of handling unknown corruption can be cast as a model selection problem. To get the bound of √ T +C, we adopt the idea of regret balancing similar to those of 68 Table 4.1: ∗ indicates computationally inefficient algorithms. G is the GapComplexity defined in (Simchowitz and Jamieson, 2019); ∆ is the gap between the expected reward of the best and second-best policy. It holds thatG≤ 1 ∆ . C a = P t c t andC r = p T P t c 2 t , wherec t is the amount of corruption in roundt. By definition, C a ≤C r ≤ min{ √ C a T,T max t c t }. C a is the standard notion of corruption in the literature. † : The bound reported in (Jin et al., 2021d) is min{G + √ GC a , √ T} under a different definition of regret. ♯ : Linearized corruption restricts that the corruption on action a equals c ⊤ a for some vector c shared among all actions. Setting Algorithm Reg(T ) in ˜ O(·) Restrictions Tabular MDP (Lykouris et al., 2019) (1 +C a ) min{G, √ T} + (C a ) 2 (Chen et al., 2021c) min{ 1 ∆ , √ T} + (C a ) 2 ∗ (Jin et al., 2021d) † min{G, √ T} +C a only for corruption in reward G-COBE + UCBVI min{ 1 ∆ , √ T} +C a Linear bandit (Li et al., 2019) 1 ∆ 2 + C a ∆ (Bogunovic et al., 2020) (1 +C a ) √ T (Bogunovic et al., 2021) √ T + (C a ) 2 (Lee et al., 2021) min{ 1 ∆ , √ T} +C a only for linearized corruption ♯ G-COBE + PE min{ 1 ∆ , √ T} +C a Linear contextual bandit (Foster et al., 2020) √ T +C r COBE + OFUL √ T +C r COBE + VOFUL √ T +C a ∗ Linear MDP (Lykouris et al., 2019) √ T + (C a ) 2 √ T (Wei and Luo, 2021) √ T + (C a ) 1 /3 T 2 /3 COBE + LSVI-UCB √ T +C r COBE + VARLin √ T +C a ∗ Low BE-dimension COBE + GOLF √ T +C r ∗ Abbasi-Yadkori et al. (2020); Pacchiano et al. (2020a), while to get min{ 1 ∆ , √ T} +C, we develop another novel two-model selection algorithm to achieve the goal (see Section 4.4.3). Extensions to linear and general function approximation Our model selection framework can be readily extended to the cases of linear contextual bandits and linear MDPs. However, even with at most C corrupted rounds, a straightforward extension of standard algorithms (i.e., OFUL (Abbasi-Yadkori et al., 2011), LSVI-UCB (Jin et al., 2020c)) results in an overall regret of Ω( √ CT ), a sharp contrast with theO( √ T +C) bound in tabular MDPs. We find that the O( √ T +C) bound can indeed be achieved efficiently in the non-contextual case (i.e., linear bandits with a fixed action set) by using the Phased Elimination (PE) approach developed by Lattimore et al. (2020); 69 Bogunovic et al. (2021). We achieve the same bounds for linear contextual bandits and linear MDPs, but we resort to the idea of Zhang et al. (2021d), who deal with linear models through a sophisticated and computationally inefficient clipping technique. Note that the original purpose of Zhang et al. (2021d) is to get a variance-reduced bound for linear contextual bandits and a horizon-free bound for linear mixture MDPs, which are very different from our goal here. Their idea being applicable to improve robustness against corruption is surprising and of independent interest. The fact that additive dependence on C is possible under linear settings (though computationally inefficient) partially answers an open question by Zhang et al. (2021c). We further extend our framework to general function approximation settings. We consider the class of MDPs that have low Bellman-eluder dimension (Jin et al., 2021a), and derive a corruption- robust version of their algorithm (GOLF). The algorithm achieves a regret bound ofO( p (1 +C)T ). Whether the bound ofO( √ T +C) is possible is left as an open problem. In Table 4.1, we compare our bounds with those in previous works (omitting dependencies other than C and T). Note that C r is best interpreted as √ CT in the notation of prior work. Problem Setting We consider a general decision making framework that covers a wide range of problems. We first describe the uncorrupted setting. The learner is given a policy set Π and a context setX. Ahead of time, the environment determines a context-to-expected-reward mapping µ π :X→ [0, 1] for all π∈ Π, which are hidden from the learner. In each round t = 1,...,T, the environment first arbitrarily generates a context x t ∈X, and generates a noisy reward r π t ∈ [0, 1] for all π such thatE[r π t ] =µ π (x t ). The context x t is revealed to the learner. Then the learner chooses a policy π t , and receives r t ≜r πt t . The goal of the learner is to minimize the regret defined as Reg(T ) = max π∈Π T X t=1 µ π (x t )−µ πt (x t ) . (4.13) 70 In the corrupted setting, the protocol is similar, but in each roundt, an adversary can change the context-to-expected-reward mapping fromµ π (·) toµ π t (·). Thenr π t is drawn such thatE[r π t ] =µ π t (x t ). We assume that µ π t (x t ) and r π t still lie in [0, 1]. As before, the learner observes x t , chooses π t , and receives r t =r πt t . The goal of the learner remains to minimize the regret defined in (4.13) (notice that it is defined through the uncorrupted µ ). The adversary we consider falls into the category of an adaptive adversary, an adversary that can determine the corruption in round t based on the history up to round t− 1. We consider the realizable setting, where the following assumption holds: Assumption 3. There exists a policy π ⋆ ∈ Π such that µ π ⋆ (x t )≥µ π (x t ) for all t and all π∈ Π. Two Ways to Compute Aggregated Corruption In previous works of corruption-robust RL, the total corruption was defined as C = P T t=1 c t , where c t is the per-round corruption defined above. However, to unify the analysis under different settings, we introduce another notion of total regret defined as q T P T t=1 c 2 t . To distinguish them, we denote C a = P T t=1 c t and C r = q T P T t=1 c 2 t , for that C a is T times the arithmetic mean of c t ’s, while C r is T times the root mean square of c t ’s. By defining C r , we are able to recover the bounds in the “model misspecification” literature, in which the regret bound is often expressed through T max t c t , which is an upper bound of C r (see Table 4.1 for more details). We further define C a t ≜ P t τ=1 c τ and C r t ≜ q t P t τ=1 c 2 τ . 4.4.1 A model selection framework In this subsection, we develop a general corruption-robust algorithm based on model selection. The regret bound we achieve is of order ˜ O( √ T +C), where C is either C a or C r (see Table 4.1 for the choices in different settings). Model selection approaches rely on a meta algorithm learning over a set of base algorithms. We first specify the properties that each base algorithm should satisfy: 71 Assumption 4 (base algorithm, with either C t ≜C a t or C t ≜C r t ). ALG is an algorithm that takes as input a time horizon T, a confidence level δ, and a hypothetical corruption level θ. ALG ensures the following: with probability at least 1−δ, for all t≤T such that C t ≤θ, it holds that t X τ=1 (r π ⋆ τ −r τ )≤R(t,θ) for some functionR(t,θ). Without loss of generality, we assume thatR(t,θ) is non-decreasing in both t and θ, and thatR(t,θ)≥θ. If a base algorithm satisfies Assumption 4 with C t ≜ C a t , we call it a type-a base algorithm, while if C t ≜C r t , we call it a type-r. Base algorithms are essentially corruption-robust algorithms that require prior knowledge of the total corruption. Therefore, the algorithms developed by Lykouris et al. (2019, Appendix B) or Wu et al. (2021) can be readily used as our base algorithms. For example, for tabular MDPs, a variant of the UCBVI algorithm (Azar et al., 2017) satisfies Assumption 4 with C t ≜ C a t andR(t,θ) = poly(H, log(SAT/δ))( √ SAt +S 2 A +SAθ); for linear MDPs, a variant of the LSVI-UCB algorithm (Jin et al., 2020c) satisfies Assumption 4 with C t ≜C r t andR(t,θ) = poly(H, log(dT/δ))( √ d 3 t +dθ). More examples can be found in Appendix H of Wei et al. (2022). A base algorithm with a higher hypothetical corruption level θ is more robust, but incurs more regret overhead. In contrast, base algorithms with lower hypothetical corruption level introduce less overhead, but have higher possibility of mis-specifying the amount of corruption. When the true total corruption is unknown, just running a single base algorithm with a fixed θ is risky either way. The idea of our algorithm is to simultaneously run multiple base algorithms (in each round, sample one of the base algorithms and execute it), each with a different hypothesis on the total amount of corruption. This idea is also used by Lykouris et al. (2019). Intuitively, if two base 72 algorithms have a valid hypothesis for the total corruption (i.e., their hypotheses upper-bound the true total corruption), then the one with smaller hypothesis should learn faster than the larger one because its hypothesis is closer to the true value, and incurs less overhead. Therefore, if at some point we find that the average performance of a base algorithm with a smaller hypothesis is significantly worse than that of a larger one, it is an evidence that the former has mis-specified the amount of corruption. If this happens, this base algorithm terminates. There are two key questions to be answered. First, what distribution should we use to select among the base algorithms? Second, given this distribution, how should we detect mis-specification of the amount of corruption by comparing the performance of base algorithms? We will soon answer the second question here. The first question will be addressed in Section 4.4.2 and Section 4.4.3 slightly differently depending on our target regret bound. BASIC (Algorithm 22) is a building block of our final algorithms. In BASIC, the distribution over base algorithms is fixed and given as an input ( α in Algorithm 22). Other inputs include: a length parameter L that specifies the maximum number of rounds (the algorithm might terminate before finishing all L rounds though) and an indexk∈ [k max ] (k max is defined in Algorithm 22) that specifies the smallest index of base algorithms (the base algorithms are indexed by k,k +1,...,k max ). Below, we sometimes unify the statements for the two definitions of total corruption. The notation (C,C t ) refers to (C a ,C a t ) if the base algorithm is type-a, and refers to (C r ,C r t ) if it is type-r. We will explicitly write the superscripts if we have to distinguish them. The base algorithm with index i∈ [k,k max ] (denoted as ALG i ) hypothesizes that the total corruption C is upper-bounded by 2 i . We say ALG i is well-specified at round t if C t ≤ 2 i ; otherwise we say it is mis-specified at round t. Naively, we might want to set the θ parameter of ALG i to 2 i . However, we can actually set it to be smaller to reduce the overhead, as explained below. Since each base algorithm is sub-sampled according to the distribution α, the total corruption experienced by 73 Algorithm 22: Base Algorithms run Simultaneously with mIs-specification Check (BASIC) 1 input: base algorithm ALG satisfying Assumption 4, L∈ [T ], k∈ [k max ] where k max ≜⌈log 2 (c max L)⌉, δ∈ (0, 1), and a distributionα = (α k ,α k+1 ,...,α kmax ) satisfying: α k ≥α k+1 ≥···≥α kmax > 0 and kmax X i=k α i = 1. for i =k,...,k max do 2 Initiate an instance of ALG with inputs T,δ, and θ chosen as below: θ i ≜ ( 1.25·α i 2 i + 21c max log(T/δ) if ALG is type-a 1.25·α i 2 i + 8c max p α i L log(T/δ) + 21c max log(T/δ) if ALG is type-r (4.14) (We call this instance ALG i .) 3 4 for t = 1,...,L do 5 Randomly pick a sub-algorithm i t ∼α, receive the context x t , and use ALG it to output π t . 6 Execute π t , receive feedback, and perform update on ALG it . 7 Define N t,i ≜ P t τ=1 1[i τ =i], R t,i ≜ P t τ=1 1[i τ =i]r τ . 8 if∃i,j∈ [k,k max ], i<j, such that R t,i α i + R(N t,i ,θ i ) α i < R t,j α j − 8 s t log(T/δ) α j + log(T/δ) +θ j α j ! , (4.15) 9 return false. 10 return true. ALG i in [1,t] is only roughly P τ≤t α i c τ ≤α i C a or q (α i t) P τ≤t α i c 2 τ ≤α i C r (for type-a and type-r base algorithms, respectively). This means that ALG i , which hypothesizes a total corruption of 2 i , only needs to set the θ parameter in Assumption 4 to roughly α i 2 i , instead of 2 i . Our choice of θ i in (4.14) is slightly larger than α i 2 i to accommodate the randomness in the sampling procedure. Besides performing sampling over base algorithms, BASIC also compares the performance any two base algorithms using (4.15). If all base algorithms hypothesize large enough corruption, then all of them enjoy the regret bound specified in Assumption 4 in the subset of rounds they are executed. In this case, we can show that with high probability, the termination condition (4.15) will not hold. This is formalized in Lemma 8. 74 Lemma 8. With probability at least 1−O(k max δ), the termination condition (4.15) of the BASIC algorithm does not hold in any round t such that C t ≤ 2 k . In other words, (4.15) is triggered only when C t > 2 k , i.e., ALG k is mis-specified at round t. Once this happens, the BASIC algorithm terminates. Checking condition (4.15) essentially ensures that the quantities R t,i α i of all base algorithms remain close. Notice that at all t, there is always a well-specified base algorithm i ⋆ with C t ≤ 2 i ⋆ which enjoys the regret guarantee of Assumption 4. Therefore, R t,i ⋆ α i ⋆ is not too low, and thus, testing condition (4.15) prevents R t,i α i of any i from falling too low. This directly controls the performance of every base algorithm before termination. The following lemma bounds the learner’s cumulative regret at termination. Lemma 9. LetL 0 ≤L be the round at which BASIC terminates, and leti ⋆ be the smallesti∈ [k,k max ] such that C L 0 ≤ 2 i . Then with probability at least 1−O(k max δ), L 0 X t=1 (r π ⋆ t −r t )≤ kmax X i=k R(N L 0 ,i ,θ i ) + ˜ O 1 [i ⋆ >k] s L 0 α i ⋆ + R(N L 0 ,i ⋆,θ i ⋆) α i ⋆ !! . where N L 0 ,i is the total number of rounds ALG i was played. 4.4.2 Gap-independent bounds Next, we use BASIC to build a corruption-robust algorithm with a regret bound of either √ T +C a or √ T +C r without prior knowledge of C a or C r . The algorithm is called COBE and presented in Algorithm 23. We consider base algorithms with the following concrete form ofR(t,θ): R(t,θ) = p β 1 t +β 2 θ +β 3 (4.16) 75 Algorithm 23: COrruption-robustness through Balancing and Elimination (COBE) 1 input: base algorithm ALG satisfying Assumption 4 with the form specified in (4.16). 2 define : Z≜c max if ALG is type-a, and Z≜c max √ T if ALG is type-r. 3 k init ≜ max log 2 √ β 1 T +β 2 Z+β 3 β 2 , 0 with β 1 ,β 2 ,β 3 defined in (4.16). 4 for k =k init ,... do 5 Run BASIC with input k and L =T, and (α i ) kmax i=k specified in (4.17), until it terminates or the total number of rounds reaches T. for some β 1 ,β 2 ,β 3 ≥ 1. COBE starts with k =k init (defined in Algorithm 23) and runs BASIC with inputs k and L =T and the following choice of (α i ) kmax i=k : α i = 2 k−i−1 for i>k, 1− P kmax i=k+1 α i for i =k. (4.17) Whenever the subroutine BASIC terminates beforeT, we eliminate ALG k and start a new instance of BASIC with k increased by 1 (see the for-loop in COBE). This is because as indicated by Lemma 8, early termination implies that ALG k mis-specifies the amount of corruption. Notice that 2 k init is roughly of order √ T, i.e., we start from assuming that the total amount of corruption is √ T. This is because we only target the worst-case regret rate of √ T +C here, so refinements for smaller corruption levels C≤ √ T do not improve the asymptotic bound. Our choice of α i makes α i 2 i ≈ 2 k for all i, and this further keeps the magnitudes ofR(N t,i ,θ i ) of all i’s roughly the same. This conforms with the regret balancing principle by Abbasi-Yadkori et al. (2020); Pacchiano et al. (2020a), as well as the sub-sampling idea of Lykouris et al. (2019). This makes the bound of the model selection algorithm only worse than the best base algorithm by a factor ofO(k max ) = ˜ O(1) if all base algorithms are well-specified. In the following theorem, we show guarantees of COBE for bothC≜C a andC≜C r . The proof essentially plugs the choices of parameters into Lemma 9, and sums up the regret over epochs. 76 Theorem 24. If ALG satisfies Assumption 4 and R(t,θ) in the form of (4.16), then with α i ’s specified in (4.17), COBE guarantees with probability at least 1−O(k max δ) that Reg(T ) = ˜ O p β 1 T +β 2 (C +Z) +β 3 , where Z =c max if C≜C a and Z =c max √ T if C≜C r . 4.4.3 Gap-dependent bounds Our next goal is to get instance-dependent bounds similar to those in Lykouris et al. (2019); Chen et al. (2021c). There are extra assumptions to be made in this section. First, we only deal with the case without contexts, i.e., the following assumption holds: Assumption 5. Assume that µ π (x t ) =µ π . This covers linear bandits and MDPs with a fixed initial state. In fact, our approach can handle a slightly more general case where the context is i.i.d. generated in the uncorrupted case, and the deviation of the context distribution from i.i.d. is considered as corruption (in contrast, in Section 4.4.1, such deviation is not considered as corruption). Besides, our bound depends on the sub-optimality gap defined in the following: Assumption 6. There exists a policy π ⋆ ∈ Π such that for all π∈ Π\{π ⋆ }, µ π ≤µ π ⋆ − ∆ . This gap assumption is in fact stronger than that made by Chen et al. (2021c). In (Chen et al., 2021c), ∆ := min π: ∆ π>0 ∆ π where ∆ π =µ π ⋆ −µ π . Their definition keeps ∆ > 0 when there are multiple optimal policies, while our Assumption 6 forces ∆ = 0 if there are two optimal policies with the same expected reward. This kind of stronger gap assumption is similar to those in (Lee et al., 2021; Jin et al., 2021d). Finally, we only focus on the case C =C a throughout this section. § § When C =C r , our approach produces a regret term of cmax √ T as in Theorem 24, spoiling the gap-dependent bound. 77 Algorithm 24: Gap-bound enhanced COBE (G-COBE) 1 k init = max log 2 √ β 1 +β 2 cmax+β 3 β 2 , 0 , β 4 = 10 4 (2β 1 + 42β 2 c max log(T/δ) + 2β 3 ). 2 for k =k init ,... do 3 // Phase 1 4 Let L be the smallest integer such that √ β 4 L≥β 2 2 k . 5 if L>T then break; 6 Run BASIC with inputk andL, and (α i ) kmax i=k specified in (4.17) until it terminates or the number of rounds reaches T. Let o be its output, and let b π be the policy that is executed most often by the base algorithm ALG k . 7 if o = true then 8 // Phase 2 9 Run TwoModelSelect with L,b π,B b π , until it terminates or the number of rounds reaches T. 10 // Phase 3 11 Run COBE in the remaining rounds. Algorithm Overview Our algorithm G-COBE (Algorithm 24) consists of three phases where the first two phases are executed interleavingly. In Phase 1 (Line 4-Line 6 in G-COBE), we run BASIC with a type-a base algorithm that satisfies Assumption 4 with the following gap-dependent bound: R(t,θ) = min p β 1 t, β 1 ∆ +β 2 θ +β 3 (4.18) for some β 1 ,β 2 ,β 3 satisfying β 1 ≥ 16 log(T/δ), β 2 ≥ 1, β 3 ≥ 10 p β 1 log(T/δ). In every for-loop of k, if BASIC in Phase 1 returns true, the algorithm proceeds to Phase 2 (Line 9 in G-COBE). In Phase 2, TwoModelSelect (Algorithm 25) is executed. TwoModelSelect is a specially designed two-model selection algorithm that dynamically chooses between two instances. One of them is b π, a candidate optimal policy identified in Phase 1 (defined in Line 6 of G-COBE); the other is an algorithm with b π as input (we call this algorithmB b π ). We assume thatB b π has the following property: Assumption 7.B b π is a corruption-robust algorithm over the policy set Π\{b π} without prior knowledge of the total corruption. In other words, when running alone, in every round t, it chooses a policy 78 π t ∈ Π\{b π} and receives r t with E[r t ] =µ πt t . It ensures the following for all t with probability at least 1−δ: max π∈Π\{b π} t X τ=1 (r π τ −r τ )≤R B (t,C t )≜ p β 1 t +β 2 C t +β 3 . (4.19) Notice that in Section 4.4.2 we have already developed a corruption-robust algorithm COBE, whose guarantee is already in the form of (4.19), albeit over the original policy set Π (see Theorem 24). In Appendix C.3.5, we describe how to implementB b π through running COBE on a modified MDP. The TwoModelSelect in Phase 2 might end earlier than time T. This happens only when 1 ∆ +C is larger than the order of 2 k . In this case, the algorithm goes back to Phase 1 with k increased by 1. When 2 k grows to the order of √ T (implying that √ T ≳ 1 ∆ +C), the algorithm instead proceeds to Phase 3 and simply run COBE in the remaining rounds (Line 5 of G-COBE). The regret guarantee of G-COBE is summarized by the following theorem. Theorem 25. G-COBE ensures that (with β 4 defined in Algorithm 24) Reg(T ) = ˜ O min p β 4 T, β 4 ∆ +β 2 C +β 4 . Theorem 25 gives the first min{ 1 ∆ , √ T} +C bound in the literature of corrupted MDPs without the knowledge of C. To show Theorem 25, we establish some key lemmas for Phase 1 and Phase 2 in the remaining of this section. The complete proof of Theorem 25 is given in Appendix C.3.4. Note that within the sub-routines BASIC and TwoModelSelect, we re-index the time so that they both start from t = 1 for convenience. 79 Phase 1 of G-COBE In Phase 1 we run BASIC with base algorithms that achieve gap-dependent bounds (4.18). The regret bound of BASIC under general choices ofR(t,θ) andα i has already been derived in Lemma 9. Here, we apply it with the new form ofR(t,θ) in (4.18), and the new choice of α i as below: α i = min √ β 1 L/β 2 +2 k 2 i , 1 2(kmax−k) for i>k 1− P kmax i=k+1 α i for i =k (4.20) The regret bound of BASIC under such choices of parameters is summarized as the following: Lemma 10. Let L 0 ≤L be the round at which BASIC terminates. IfR(t,θ) is of the form (4.18), and the α i ’s follow (4.20), then with high probability, BASIC guarantees L 0 X t=1 (r π ⋆ t −r t ) = ˜ O p β 1 L 0 +β 2 C L 0 +β 2 c max +β 3 . We see that even though our base algorithms achieve a gap-dependent bound ((4.18)), the advantage is not reflected in the final bound of BASIC (as can be seen in Lemma 10, we still do not achieve a gap-dependent bound). This is due to the fundamental limitation of general model selection problems (Pacchiano et al., 2020b). Therefore, Lemma 10 does not seem to give any advantage over Theorem 24. However, the hidden advantage of using base algorithms with gap-dependent bounds is that if a base algorithm well-specifies the total corruption, it will more quickly concentrate on the best policy. This enables the learner to identify the best policy faster. This is formalized in Lemma 11. 80 Lemma 11. Suppose that we run BASIC with base algorithms satisfying (4.18). Let L 0 ≤L be the round at which BASIC terminates. If 32 β 4 ∆ +β 2 C L 0 ≤β 2 2 k ≤ √ β 4 L, then with probability at least 1−O(δ), L 0 =L, and the following holds: L X t=1 1[i t =k]1[π t =π ⋆ ]> 1 2 L X t=1 1[i t =k]. (4.21) Lemma 11 ensures that if 2 k ≳ 1 ∆ +C, by looking at which policy is most frequently executed by ALG k , the learner can correctly identify the best policy b π =π ⋆ with high probability (by (4.21) and the definition of b π in G-COBE). Phase 2 of G-COBE In Phase 2, TwoModelSelect is executed, which is a model selection algorithm between b π andB b π . The high-level goal is to make the learner concentrate on executing b π until the end of T rounds if b π =π ⋆ and 1 ∆ +C is relatively small, and otherwise terminate the algorithm quickly before incurring too much regret. It proceeds in epochs of varying length, indexed with j. The quantity b ∆ j is an estimator of the gap between the average performance of b π andB b π at the beginning of epoch j; M j is the maximum possible length of epoch j, and p j is the probability that the learner choosesB b π in epoch j. The algorithm constantly monitors the difference between the average performance of b π andB b π (Line 9-Line 9 in TwoModelSelect). Whenever she finds that their performance gap is actually much smaller or larger than b ∆ j (i.e., if (4.22) or (4.23) holds), she updates b ∆ j ,M j , and p j , and restarts a new epoch. If at any time b ∆ j becomes smaller than b ∆ 1 , or j grows larger than 3 log 2 T, she terminates TwoModelSelect. We establish the following two key lemmas. Lemma 12. Let T 0 be the last round of TwoModelSelect, then with probability at least 1−O(δ), T 0 X t=1 (r π ⋆ t −r t ) = ˜ O p β 4 L +β 2 C T 0 +β 4 . 81 Algorithm 25: TwoModelSelect (L, b π,B b π ) 1 initialization: b ∆ 1 ← min n q β 4 L , 1 o , M 1 ← β 4 b ∆ 2 1 , t← 1. (β 4 defined in Algorithm 24) 2 for j = 1, 2,..., (3 log 2 T ) do 3 t j ←t, p j ← β 4 2M j b ∆ 2 j , and re-initializeB b π . 4 while t≤t j +M j − 1 do 5 Y t ← Bernoulli (p j ). 6 if Y t = 1 then ExecuteB b π for one round and updateB b π ; 7 else Execute b π for one round ; 8 t←t + 1. 9 Let b R 0 = 1 1−p j P t−1 τ=t j r τ 1[Y τ = 0], b R 1 = 1 p j P t−1 τ=t j r τ 1[Y τ = 1]. if b R 0 ≤ b R 1 + 1 2 (t−t j ) b ∆ j − 5 p j R B p j (t−t j ), p j √ β 1 L β 2 , (4.22) then b ∆ j+1 ← 1 1.25 b ∆ j and break; if b R 0 ≥ b R 1 + 3M j b ∆ j + 8 p β 1 L, (4.23) then b ∆ j+1 ← 1.25 b ∆ j and break. 10 11 if b ∆ j+1 < b ∆ 1 then return; 12 M j+1 ← 2(t−t j ) + β 4 b ∆ 2 j+1 . (4.24) Lemma 13. Let T 0 be last round of TwoModelSelect. If b π =π ∗ and √ β 4 L≥ 16 β 4 ∆ +β 2 C T 0 , then with probability at least 1−O(δ), it is terminated because the number of rounds reaches T. Theorem 25 can be obtained by combining Lemma 10–Lemma 13. All the proofs are in Appendix C.3.4. 4.5 Open problems Recent work by (He et al., 2022) has improved our result in Section 4.4 for linear contextual bandit with corruption. Specifically, they provide a computationally efficient algorithm that achieves ˜ O( √ T +C) regret with tight dependence on the feature dimension. Combining their algorithm with our model selection framework, the bound can even be achieved without knowledge of C. 82 These results are not the end of the story. Some open problems still remain (for the setting in Section 4.4): • For the tabular setting, our gap complexity measure is larger than those in (Simchowitz and Jamieson, 2019; Lykouris et al., 2019; Jin et al., 2021d). It is an important future direction to further improve our gap-dependent bound without sacrificing the worst-case dependence on T or C. • In the model mis-specification literature, Agarwal et al. (2020a); Zanette et al. (2021) define a new notion of local model mis-specification for the state aggregation scenario. It is much smaller and more favorable than the notion of model mis-specification defined in Jin et al. (2020c); Zanette et al. (2020b). Is there any counterpart for the corruption setting? If there is, how can we achieve robustness under such notion of corruption without prior knowledge? 83 Chapter 5 Learning in Adversarial MDPs with a Large State Space 5.1 Overview: adversarial MDPs with linear function approximation In this section, we investigate an MDP setting where the transition is fixed while the loss is adversarial. In the literature, this is usually called “adversarial MDPs.” Recall from the discussion in Section 4.4 that when the transition and loss are both adversarial, it is impossible to achieve any sublinear-in-T regret with polynomial dependencies on all parameters. However, as long as the transition is fixed, this becomes possible. Adversarial MDPs have been extensively studied under MDPs with finite number of states (Neu et al., 2013; Jin et al., 2020a; Shani et al., 2020). Similar to the case of adversarial bandits, the algorithmic framework of OMD/FTRL is used, which is usually referred to as the policy optimization approach in the reinforcement learning literature. However, due to the local-search nature of policy optimization approaches, global optimality guarantees often rely on unrealistic assumptions to ensure global exploration (see e.g., (Abbasi-Yadkori et al., 2019; Agarwal et al., 2020b; Neu and Olkhovskaya, 2020; Wei et al., 2021a)), making them theoretically less appealing compared to other methods. Motivated by this issue, a line of recent works (Cai et al., 2020; Shani et al., 2020; Agarwal et al., 2020a; Zanette et al., 2021) equip policy optimization with global exploration by adding exploration 84 bonuses to the update, and prove favorable guarantees even without making extra exploratory assumptions. Moreover, they all demonstrate some robustness aspect of policy optimization (such as being able to handle adversarial losses or a certain degree of model mis-specification). Despite such important progress, however, many limitations still exist, including worse regret rates comparing to the best value-based or model-based approaches (Shani et al., 2020; Agarwal et al., 2020a; Zanette et al., 2021), or requiring full-information feedback on the entire loss function (as opposed to the more realistic bandit feedback) (Cai et al., 2020). To address these issues, in this work, we propose a new type of exploration bonuses called dilated bonuses, which satisfies a certain dilated Bellman equation and provably leads to improved exploration compared to existing works (Section 5.2). We apply this general idea to advance the state of the art of policy optimization for learning finite-horizon episodic MDPs with adversarial losses and bandit feedback. More specifically, our main results are: • First, we consider a linear function approximation setting where the state-action values are linear in some known low-dimensional features and also a simulator is available, the same setting considered by (Neu and Olkhovskaya, 2020). We obtain the same ˜ O(T 2 /3 ) regret while importantly removing their exploratory assumption (Section 5.3). • Second, to remove the need of a sampling oracle, we further consider linear MDPs, a special case where the transition kernel is also linear in the features. To our knowledge, the only existing works that consider adversarial losses in this setup are (Cai et al., 2020), which obtains ˜ O( √ T ) regret but requires full-information feedback on the loss functions, and (Neu and Olkhovskaya, 2021) (an updated version of (Neu and Olkhovskaya, 2020)), which obtains ˜ O( √ T ) regret under bandit feedback but requires perfect knowledge of the transition as well as an exploratory assumption. 85 We propose the first algorithm for the most challenging setting with bandit feedback and unknown transition, which achieves ˜ O(T 14 /15 ) regret without any exploratory assumption. (Section 5.4) Weemphasizethatunlikethetabularsetting, inthetwoadversariallinearfunctionapproximation settings with bandit feedback that we consider, researchers have not been able to show any sublinear regret for policy optimization without exploratory assumptions before our work, which shows the critical role of our proposed dilated bonuses. In fact, there are simply no existing algorithms with sublinear regret at all for these two settings, be it policy-optimization-type or not. This shows the advantage of policy optimization over other approaches, when combined with our dilated bonuses. Related work. In the tabular setting, except for (Shani et al., 2020), most algorithms apply the occupancy-measure-based framework to handle adversarial losses (e.g., (Rosenberg and Mansour, 2019a; Jin et al., 2020b; Chen et al., 2021b; Chen and Luo, 2021)), which as mentioned is com- putationally expensive. For stochastic losses, there are many more different approaches such as model-based ones (Jaksch et al., 2010; Dann and Brunskill, 2015; Azar et al., 2017; Fruit et al., 2018; Zanette and Brunskill, 2019) and value-based ones (Jin et al., 2018; Dong et al., 2019). Theoretical studies for linear function approximation have gained increasing interest re- cently (Yang and Wang, 2020; Zanette et al., 2020a; Jin et al., 2020c). Most of them study stochastic/stationary losses, with the exception of (Cai et al., 2020; Neu and Olkhovskaya, 2020, 2021). Our algorithm for the linear MDP setting bears some similarity to those of (Agarwal et al., 2020a; Zanette et al., 2021) which consider stationary losses. However, in each episode, their algorithms first execute an exploratory policy (from a policy cover), and then switch to the policy suggested by the policy optimization algorithm, which inevitably leads to linear regret when facing adversarial losses. 86 Settings. We consider an MDP specified by a state space X (possibly infinite), a finite action spaceA, and a transition functionP withP (·|x,a) specifying the distribution of the next state after taking action a in state x. In particular, we focus on the finite-horizon episodic setting in which X admits a layer structure and can be partitioned into X 0 ,X 1 ,...,X H for some fixed parameter H, where X 0 contains only the initial state x 0 , X H contains only the terminal state x H , and for any x∈X h , h = 0,...,H− 1, P(·|x,a) is supported on X h+1 for all a∈A (that is, transition is only possible from X h to X h+1 ). An episode refers to a trajectory that starts from x 0 and ends at x H following some series of actions and the transition dynamic. The MDP is assigned a loss function ℓ :X×A→ [0, 1] so that ℓ(x,a) specifies the loss suffered when selecting action a in state x. A policy π for the MDP is a mapping X→ ∆( A), where ∆( A) denotes the set of distributions over A and π(a|x) is the probability of choosing action a in state x. Given a loss function ℓ and a policy π, the expected total loss of π is given by V π (x 0 ;ℓ) = E P H−1 h=0 ℓ(x h ,a h ) a h ∼ π t (·|x h ),x h+1 ∼ P(·|x h ,a h ) . It can also be defined via the Bellman equation involving the state value function V π (x;ℓ) and the state-action value function Q π (x,a;ℓ) (a.k.a. Q-function) defined as below: V (x H ;ℓ) = 0, Q π (x,a;ℓ) =ℓ(x,a) +E x ′ ∼P (·|x,a) V π (x ′ ;ℓ) , and V π (x;ℓ) =E a∼π(·|x) [Q π (x,a;ℓ)]. We study online learning in such a finite-horizon MDP with unknown transition, bandit feedback, and adversarial losses. The learning proceeds through T episodes. Ahead of time, an adversary arbitrarily determines T loss functions ℓ 1 ,...,ℓ T , without revealing them to the learner. Then in each episode t, the learner generates a policy π t based on all information received prior to this episode, executes π t starting from the initial state x 0 , and generates and observes a trajectory 87 {(x t,h ,a t,h ,ℓ t (x t,h ,a t,h ))} H−1 h=0 . Importantly, the learner does not observe any other information about ℓ t (a.k.a. bandit feedback). ∗ The goal of the learner is to minimize the regret, defined as Reg = T X t=1 V πt t (x 0 )− min π T X t=1 V π t (x 0 ), where we use V π t (x) as a shorthand for V π (x;ℓ t ) (and similarly Q π t (x,a) as a shorthand for Q π (x,a;ℓ t )). Without further structure, the best existing regret bound is ˜ O(H|X| p |A|T) (Jin et al., 2020b), with an extra √ X factor compared to the best existing lower bound (Jin et al., 2018). Occupancy measures. For a policy π and a state x, we define q π (x) to be the probability (or probability measure when|X| is infinite) of visiting state x within an episode when following π. When it is necessary to highlight the dependence on the transition, we write it as q P,π (x). Further define q π (x,a) = q π (x)π(a|x) and q t (x,a) = q πt (x,a). Finally, we use q ⋆ as a shorthand for q π ⋆ where π ⋆ ∈ argmin π P T t=1 V π t (x 0 ) is one of the optimal policies. Note that by definition, we have V π (x 0 ;ℓ) = P x,a q π (x,a)ℓ(x,a). In fact, we will overload notation and let V π (x 0 ;b) = P x,a q π (x,a)b(x,a) for any function b :X×A→R (even though it might not correspond to a real loss function). 5.2 Dilated exploration bonuses In this section, we start with a general discussion on designing exploration bonuses (not specific to policy optimization), and then introduce our new dilated bonuses for policy optimization. For simplicity, the exposition in this section assumes a finite state space, but the idea generalizes to an infinite state space. ∗ Full-information feedback, on the other hand, refers to the easier setting where the entire loss function ℓt is revealed to the learner at the end of episode t. 88 When analyzing the regret of an algorithm, very often we run into the following form: Reg = T X t=1 V πt t (x 0 )− T X t=1 V π ⋆ t (x 0 )≤o(T ) + T X t=1 X x,a q ⋆ (x,a)b t (x,a) =o(T ) + T X t=1 V π ⋆ (x 0 ;b t ), (5.1) forsomefunctionb t (x,a)usuallyrelatedtosomeestimationerrororvariancethatcanbeprohibitively large. For example, in policy optimization, the algorithm performs local search in each state essentially using a multi-armed bandit algorithm and treating Q πt t (x,a) as the loss of action a in state x. Since Q πt t (x,a) is unknown, however, the algorithm has to use some estimator of Q πt t (x,a) instead, whose bias and variance both contribute to the b t function. Usually, b t (x,a) is large for a rarely visited state-action pair (x,a) and is inversely related to q t (x,a), which is exactly why most analyses rely on the assumption that some distribution mismatch coefficient related to q ⋆ (x,a) /qt(x,a) is bounded (see e.g., (Agarwal et al., 2020b; Wei et al., 2020a)). On the other hand, an important observation is that while V π ⋆ (x 0 ;b t ) can be prohibitively large, its counterpart with respect to the learner’s policyV πt (x 0 ;b t ) is usually nicely bounded. For example, if b t (x,a) is inversely related to q t (x,a) as mentioned, then V πt (x 0 ;b t ) = P x,a q t (x,a)b t (x,a) is small no matter how small q t (x,a) could be for some (x,a). This observation, together with the linearity property V π (x 0 ;ℓ t −b t ) =V π (x 0 ;ℓ t )−V π (x 0 ;b t ), suggests that we treat ℓ t −b t as the loss function of the problem, or in other words, add a (negative) bonus to each state-action pair, which intuitively encourages exploration due to underestimation. Indeed, assuming for a moment that (5.1) still roughly holds even if we treat ℓ t −b t as the loss function: T X t=1 V πt (x 0 ;ℓ t −b t )− T X t=1 V π ⋆ (x 0 ;ℓ t −b t )≲o(T ) + T X t=1 V π ⋆ (x 0 ;b t ). (5.2) 89 Then by linearity and rearranging, we have Reg = T X t=1 V πt t (x 0 )− T X t=1 V π ⋆ t (x 0 )≲o(T ) + T X t=1 V πt (x 0 ;b t ). (5.3) Due to the switch from π ⋆ to π t in the last term compared to (5.1), this is usually enough to prove a desirable regret bound without making extra assumptions. The caveat of this discussion is the assumption of (5.2). Indeed, after adding the bonuses, which itself contributes some more bias and variance, one should expect that b t on the right-hand side of (5.2) becomes something larger, breaking the desired cancellation effect to achieve (5.3). Indeed, the definition of b t essentially becomes circular in this sense. Dilated Bonuses for Policy Optimization To address this issue, we take a closer look at the policy optimization algorithm specifically. As mentioned, policy optimization decomposes the problem into individual multi-armed bandit problems in each state and then performs local optimization. This is based on the well-known performance difference lemma (Kakade and Langford, 2002): Reg = X x q ⋆ (x) T X t=1 X a π t (a|x)−π ⋆ (a|x) Q πt t (x,a), showing that in each statex, the learner is facing a bandit problem with Q πt t (x,a) being the loss for action a. Correspondingly, incorporating the bonuses b t for policy optimization means subtracting the bonus Q πt (x,a;b t ) from Q πt t (x,a) for each action a in each state x. Recall that Q πt (x,a;b t ) satisfies the Bellman equation Q πt (x,a;b t ) =b t (x,a) +E x ′ ∼P (·|x,a) E a ′ ∼πt(·|x ′ ) [B t (x ′ ,a ′ )]. To resolve 90 the issue mentioned earlier, we propose to replace this bonus function Q πt (x,a;b t ) with its dilated version B t (s,a) satisfying the following dilated Bellman equation: B t (x,a) =b t (x,a) + 1 + 1 H E x ′ ∼P (·|x,a) E a ′ ∼πt(·|x ′ ) B t (x ′ ,a ′ ) (5.4) (with B t (x H ,a) = 0 for all a). The only difference compared to the standard Bellman equation is the extra (1 + 1 H ) factor, which slightly increases the weight for deeper layers and thus intuitively induces more exploration for those layers. Due to the extra bonus compared to Q πt (x,a;b t ), the regret bound also increases accordingly. In all our applications, this extra amount of regret turns out to be of the form 1 H P T t=1 P x,a q ⋆ (x)π t (a|x)B t (x,a), leading to X x q ⋆ (x) T X t=1 X a π t (a|x)−π ⋆ (a|x) Q πt t (x,a)−B t (x,a) ≤o(T ) + T X t=1 V π ⋆ (x 0 ;b t ) + 1 H T X t=1 X x,a q ⋆ (x)π t (a|x)B t (x,a). (5.5) With some direct calculation, one can show that this is enough to show a regret bound that is only a constant factor larger than the desired bound in (5.3). This is summarized in the following lemma. Lemma 14. If (5.5) holds with B t defined in (5.4), then Reg≤o(T ) + 3 P T t=1 V πt (x 0 ;b t ). The high-level idea of the proof is to show that the bonus added to a layer h is enough to cancel the large bias/variance terms (including those coming from the bonus itself) from layer h + 1. Therefore, cancellation happens in a layer-by-layer manner except for layer 0, where the total amount of bonus can be shown to be at most (1 + 1 H ) H P T t=1 V πt (x 0 ;b t )≤ 3 P T t=1 V πt (x 0 ;b t ). Recalling again that V πt (x 0 ;b t ) is usually nicely bounded, we thus arrive at a favorable regret guarantee without making extra assumptions. Of course, since the transition is unknown, we cannot compute B t exactly. However, Lemma 14 is robust enough to handle either a good approximate 91 version of B t (see Lemma 67) or a version where (5.4) and (5.5) only hold in expectation (see Lemma 68), which is enough for us to handle unknown transition. In the next three sections, we apply this general idea to different settings, showing what b t and B t are concretely in each case. 5.3 The linear-Q case In this section, we consider the most basic linear function approximation scheme where for any π, the Q-function Q π t (x,a) is linear in some known feature vector ϕ (x,a), formally stated below. Assumption 8 (Linear-Q). Let ϕ (x,a)∈R d be a known feature vector of the state-action pair (x,a). We assume that for any episode t, policy π, and layer h, there exists an unknown weight vector θ π t,h ∈R d such that for all (x,a)∈X h ×A, Q π t (x,a) =ϕ (x,a) ⊤ θ π t,h . Without loss of generality, we assume∥ϕ (x,a)∥≤ 1 for all (x,a) and∥θ π t,h ∥≤ √ dH for all t,h,π. For justification on the last condition on norms, see (Wei et al., 2021a, Lemma 8). This linear- Q assumption has been made in several recent works with stationary losses (Abbasi-Yadkori et al., 2019; Wei et al., 2021a) and also in (Neu and Olkhovskaya, 2020) with the same adversarial losses. † It is weaker than the linear MDP assumption (see Section 5.4) as it does not pose explicit structure requirements on the loss and transition functions. Due to this generality, however, our algorithm also requires access to a simulator to obtain samples drawn from the transition, formally stated below. Assumption 9 (Simulator). The learner has access to a simulator, which takes a state-action pair (x,a)∈X×A as input, and generates a random outcome of the next state x ′ ∼P (·|x,a). † The assumption in (Neu and Olkhovskaya, 2020) is stated slightly differently (e.g., their feature vectors are independent of the action). However, it is straightforward to verify that the two versions are equivalent. 92 Algorithm 26: Policy Optimization with Dilated Bonuses (Linear-Q Case) 1 parameters: γ,β,η,ϵ∈ (0, 1 2 ), M = l 24 ln(dHT ) ϵ 2 γ 2 m , N = l 2 γ ln 1 ϵγ m . 2 for t = 1, 2,...,T do 3 Step 1: Interact with the environment. Execute π t , which is defined such that for each x∈X h , π t (a|x)∝ exp −η t−1 X τ=1 ϕ (x,a) ⊤ b θ τ,h −Bonus(τ,x,a) ! , (5.6) and obtain trajectory (x t,h ,a t,h ,ℓ t (x t,h ,a t,h )) H−1 h=0 . 4 Step 2: Construct covariance matrix inverse estimators. Collect MN trajectories using the simulator and π t . LetT t be the set of trajectories. Compute n b Σ + t,h o H−1 h=0 = GeometricResampling (T t ,M,N,γ). (see Algorithm 28) Step 3: Construct Q-function weight estimators. For h = 0,...,H− 1, compute b θ t,h = b Σ + t,h ϕ (x t,h ,a t,h )L t,h , where L t,h = H−1 X i=h ℓ t (x t,i ,a t,i ). (5.7) Algorithm 27: Bonus(t,x,a) 1 if Bonus(t,x,a) has been called before then 2 return the value of Bonus(t,x,a) calculated last time. 3 Let h be such that x∈X h . if h =H then return 0. 4 Compute π t (·|x), defined in (5.6) (which involves recursive calls to Bonus for smaller t). 5 Get a sample of the next state x ′ ←Simulator(x,a). 6 Compute π t (·|x ′ ) (again, defined in (5.6)), and sample an action a ′ ∼π t (·|x ′ ). 7 return β∥ϕ (x,a)∥ 2 b Σ + t,h +E j∼πt(·|x) h β∥ϕ (x,j)∥ 2 b Σ + t,h i + 1 + 1 H Bonus(t,x ′ ,a ′ ). Note that this assumption is also made by (Neu and Olkhovskaya, 2020) and other earlier works with stationary losses (see e.g., (Azar et al., 2012; Sidford et al., 2018)). ‡ In this setting, we propose a new policy optimization algorithm with ˜ O(T 2 /3 ) regret. See Algorithm 26 for the pseudocode. Algorithm design. The algorithm still follows the multiplicative weight update (5.6) in each state x∈X h (for some h), but now with ϕ (x,a) ⊤b θ t,h as an estimator for Q πt t (x,a) =ϕ (x,a) ⊤ θ πt t,h , and ‡ The simulator required by Neu and Olkhovskaya (2020) is in fact slightly weaker than ours and those from earlier works — it only needs to be able to generate a trajectory starting from x0 for any policy. 93 Algorithm 28: GeometricResampling(T,M,N,γ) 1 Denote the MN trajectories inT by (x i,0 ,a i,0 ,...,x i,H−1 ,a i,H−1 ) i=1,...,MN . Let c = 1 2 . 2 for m = 1,...,M do 3 for n = 1,...,N do 4 i = (m− 1)N +n. 5 For all h, compute Y n,h =γI +ϕ (x i,h ,a i,h )ϕ (x i,h ,a i,h ) ⊤ . 6 For all h, compute Z n,h = Π n j=1 (I−cY j,h ). 7 For all h, set b Σ +(m) h =cI +c P N n=1 Z n,h . 8 For all h, set b Σ + h = 1 M P M m=1 b Σ +(m) h . 9 return b Σ + h for all h = 0,...,H− 1. Bonus(t,x,a) as the dilated bonus B t (x,a). Specifically, the construction of the weight estimator b θ t,h follows the idea of (Neu and Olkhovskaya, 2020) (which itself is based on the linear bandit literature) and is defined in (5.7) as b Σ + t,h ϕ (x t,h ,a t,h )L t,h . Here, b Σ + t,h is an ϵ-accurate estimator of (γI + Σ t,h ) −1 , whereγ is a small parameter and Σ t,h =E t [ϕ (x t,h ,a t,h )ϕ (x t,h ,a t,h ) ⊤ ] is the covariance matrix for layer h under policy π t ; L t,h = P H−1 i=h ℓ t (x t,i ,a t,i ) is again the loss suffered by the learner starting from layer h, whose conditional expectation isQ πt t (x t,h ,a t,h ) =ϕ (x t,h ,a t,h ) ⊤ θ πt t,h . Therefore, when γ and ϵ approach 0, one can see that b θ t,h is indeed an unbiased estimator of θ πt t,h . We adopt the GeometricResampling procedure (see Algorithm 28) of (Neu and Olkhovskaya, 2020) to compute b Σ + t,h , which requires calling the simulator multiple times. Next, we explain the design of the dilated bonus. Following the general principle discussed in Section 5.2, we identify b t (x,a) in this case as β∥ϕ (x,a)∥ 2 b Σ + t,h +E j∼πt(·|x) β∥ϕ (x,j)∥ 2 b Σ + t,h for some parameterβ > 0. Further following the dilated Bellman equation (5.4), we thus define Bonus(t,x,a) recursively as the last line of Algorithm 27, where we replace the expectationE (x ′ ,a ′ ) [Bonus(t,x ′ ,a ′ )] with one single sample for efficient implementation. However, even more care is needed to actually implement the algorithm. First, since the state space is potentially infinite, one cannot actually calculate and store the value of Bonus(t,x,a) for all (x,a), but can only calculate them on the fly when needed. Moreover, unlike the estimators 94 for Q πt t (x,a), which can be succinctly represented and stored via the weight estimator b θ t,h , this is not possible for Bonus(t,x,a) due to the lack of any structure. Even worse, the definition of Bonus(t,x,a) itself depends on π t (·|x) and also π t (·|x ′ ) for the next state x ′ , which, according to (5.6), further depends on Bonus(τ,x,a) forτ <t, resulting in a complicated recursive structure. This is also why we present it as a procedure in Algorithm 27 (instead of B t (x,a)). In total, this leads to (TAH) O(H) calls to the simulator. Whether this can be improved is left as a future direction. Regret guarantee By showing that (5.5) holds in expectation for our algorithm, we obtain the following regret guarantee. (See Appendix D.4 for the proof.) Theorem 26. Under Assumption 8 and Assumption 9, with appropriate choices of the parameters γ,β,η,ϵ, Algorithm 26 ensuresE[Reg] = ˜ O H 2 (dT ) 2 /3 (the dependence on|A| is only logarithmic). This matches the ˜ O(T 2 /3 ) regret of (Neu and Olkhovskaya, 2020, Theorem 1), without the need of their assumption which essentially says that the learner is given an exploratory policy to start with. § To our knowledge, this is the first no-regret algorithm for the linear- Q setting (with adversarial losses and bandit feedback) when no exploratory assumptions are made. 5.4 The linear MDP case To remove the need of a simulator, we further consider the linear MDP case, a special case of the linear-Q setting. It is equivalent to Assumption 8 plus the extra assumption that the transition function also has low-rank structure, formally stated below. Assumption 10 (Linear MDP). The MDP satisfies Assumption 8 and that for any h and x ′ ∈X h+1 , there exists an unknown weight vector ν x ′ h ∈R d such that P(x ′ |x,a) =ϕ (x,a) ⊤ ν x ′ h for all (x,a)∈ X h ×A. § Under an even stronger assumption that every policy is exploratory, they also improve the regret to ˜ O( √ T); see (Neu and Olkhovskaya, 2020, Theorem 2). 95 There have been several works studying this setting, with (Cai et al., 2020) being the closest to us. They achieve ˜ O( √ T ) regret but require full-information feedback of the loss functions, and there are no existing results for the bandit feedback setting, except for the concurrent work (Neu and Olkhovskaya, 2021) which assumes perfect knowledge of the transition and an exploratory condition. We propose the first algorithm with sublinear regret for this problem with unknown transition and bandit feedback, shown in Algorithm 29. The structure of Algorithm 29 is similar to that of Algorithm 26, but importantly with the following modifications. A succinct representation of dilated bonuses Our definition of b t remains the same as in the linear-Q case. However, due to the low-rank transition structure in linear MDPs, we are now able to efficiently construct estimators of B t (x,a) even for unseen state-action pairs using function approximation, bypassing the requirement of a simulator. Specifically, observe that according to (5.4), for eachx∈X h , under Assumption 10B t (x,a) can be written asb t (x,a)+ϕ (x,a) ⊤ Λ πt t,h , where Λ πt t,h = (1 + 1 H ) R x ′ ∈X h+1 E a ′ ∼πt(·|x ′ ) [B t (x ′ ,a ′ )]ν x ′ h dx ′ is a vector independent of (x,a). Thus, following the similar idea of using b θ t,h to estimate θ πt t,h as we did in Algorithm 26, we can construct b Λ t,h to estimate Λ πt t,h as well, thus succinctly representing B t (x,a) for all (x,a). Epoch schedule Recall that estimatingθ πt t,h (and thus also Λ πt t,h ) requires constructing the covariance matrix inverse estimate b Σ + t,h . Due to the lack of a simulator, another important change of the algorithm is to construct b Σ + t,h using online samples. To do so, we divide the entire horizon (or more accurately the lastT−T 0 rounds since the first T 0 rounds are reserved for some other purpose to be discussed next) into epochs with equal lengthW, and only update the policy optimization algorithm at the beginning of an epoch. We index an epoch by k, and thus θ πt t,h , Λ πt t,h , b Σ + t,h are now denoted by θ π k k,h , Λ π k k,h , b Σ + k,h . Within an epoch, the algorithm keeps executing the same policy π k (up to a small exploration probability δ e ) and collect W trajectories, which are then used to construct b Σ + k,h as well 96 asθ π k k,h and Λ π k k,h . To decouple their dependence, the algorithm uniformly at random partitions these W trajectories into two sets S and S ′ with equal size, and use data from S to construct b Σ + k,h in Step 2 via the same GeometricResampling procedure and data from S ′ to construct θ π k k,h and Λ π k k,h in Step 3 and Step 4, respectively. Exploration with a policy cover Unfortunately, some technical difficulty arises when bounding the estimation error and the variance of b Λ k,h . Specifically, they can be large if the magnitude of the bonus term b k (x,a) is large for some (x,a); furthermore, since b Λ k,h is constructed using empirical samples, its variance can be even larger in those directions of the feature space that are rarely visited. Overall, due to the combined effect of these two facts, we are unable to prove any sublinear regret with only the ideas described so far. To address this issue, we adopt the idea of policy cover, recently introduced in (Agarwal et al., 2020a; Zanette et al., 2021). Specifically, the algorithm spends the first T 0 rounds to find an exploratory (mixture) policy π cov (called policy cover) which tends to reach all possible directions of the feature space. This is done via the procedure PolicyCover (Algorithm 30) (to be discussed in detail soon), which also returns b Σ cov h for each layer h, an estimator of the true covariance matrix Σ cov h of the policy cover π cov . PolicyCover guarantees that with high probability, for any policy π and h we have Pr x h ∼π ∃a,∥ϕ (x h ,a)∥ 2 ( b Σ cov h ) −1 ≥α ≤ ˜ O dH α (5.12) where x h ∈X h is sampled from executing π; see Lemma 66. This motivates us to only focus on x such that∥ϕ (x,a)∥ 2 ( b Σ cov h ) −1 ≤α for all a (h is the layer to which x belongs). This would not incur much regret because no policy would visit other states often enough. We call such a state a known 97 state and denote byK the set of all known states. To implement the idea above, we simply introduce an indicator 1[x∈K] in the definition of b k (that is, no bonus at all for unknown states). The benefit of doing so is that the aforementioned issue of b k (x,a) having a large magnitude is now alleviated as long as the algorithm explores using π cov with some small probability in each episode. Specifically, in each episode of epoch k, with probability 1−δ e we execute π k suggested by policy optimization, otherwise we explore using π cov . The way the algorithm explores differs slightly for episodes in S and those in S ′ (recall that an epoch is partitioned evenly into S and S ′ , where S is used to estimate b Σ + k,h and S ′ is used to estimate θ π k k,h and Λ π k k,h ). For an episode in S, we simply explore by executing π cov for the entire episode, so that b Σ + k,h is an estimation of the inverse of γI +δ e Σ cov h + (1−δ e )E (x h ,a)∼π k [ϕ (x h ,a)ϕ (x h ,a) ⊤ ], and thus by its definition b k (x,a) is bounded by roughly αβ δe for all (x,a) (this improves over the trivial bound β γ by our choice of parameters; see Lemma 71). On the other hand, for an episode in S ′ , we first uniformly at random draw a step h ∗ t , then we execute π cov for the first h ∗ t steps and continue with π k for the rest. This leads to a slightly different form of the estimators θ π k k,h and Λ π k k,h compared to (5.7) (see (5.10) and (5.11), where the definition of D t,h is in light of (5.4)), which is important to ensure that they have almost no bias. This also concludes the description of Step 1. We note that the idea of dividing states into known and unknown parts is related to those of (Agarwal et al., 2020a; Zanette et al., 2021). However, our case is more challenging because we are only allowed to mix a small amount of π cov into our policy in order to get sublinear regret against an adversary, while their algorithms can always start by executing π cov in each episode to maximally explore the feature space. Constructing the policy cover Finally, we describe how Algorithm 30 finds a policy cover π cov . It is a procedure very similar to Algorithm 1 of (Wang et al., 2020b). Note that the focus of Wang et al. (2020b) is reward-free exploration in linear MDPs, but it turns out that the same idea can be 98 used for our purpose, and it is also related to the exploration strategy introduced in (Agarwal et al., 2020a; Zanette et al., 2021). More specifically, PolicyCover interacts with the environment for T 0 = M 0 N 0 rounds. At the beginning of episode (m− 1)N 0 + 1 for every m = 1,...,M 0 , it computes a policy π m using the LSVI-UCB algorithm of (Jin et al., 2020c) but with a fake reward function (5.13) (ignoring the true loss feedback from the environment). This fake reward function is designed to encourage the learner to explore unseen state-action pairs and to ensure (5.12) eventually. For this purpose, we could have set the fake reward for (x,a) to be 1 ∥ϕ (x,a)∥ 2 Γ −1 m,h ≥ α 2M 0 . However, for technical reasons the analysis requires the reward function to be Lipschitz, and thus we approximate the indicator function above using a ramp function (with a large slope T). With π m in hand, the algorithm then interacts with the environment for N 0 episodes, collecting trajectories to construct a good estimator of the covariance matrix ofπ m . The design of the fake reward function and this extra step of covariance estimation are the only differences compared to Algorithm 1 of (Wang et al., 2020b). At the end of the procedure, PolicyCover constructs π cov as a uniform mixture of{π m } m=1,...,M 0 . This means that whenever we execute π cov , we first uniformly at random sample m∈ [M 0 ], and then execute the (pure) policy π m . Regret guarantee With all these elements, we successfully remove the need of a simulator and prove the following regret guarantee. Theorem 27. Under Assumption 10, Algorithm 29 with appropriate choices of the parameters ensures E[Reg] = ˜ O d 2 H 4 T 14 /15 . Although our regret rate is significantly worse than that in the full-information setting (Cai et al., 2020), in the stochastic setting (Zanette et al., 2021), or in the case when the transition is known (Neu and Olkhovskaya, 2021), we emphasize again that our algorithm is the first with 99 provable sublinear regret guarantee for this challenging adversarial setting with bandit feedback and unknown transition. 5.5 Open problems One future direction is to further improve our results, including reducing the number of simulator calls in the linear-Q setting and improving the regret bound for the linear MDP setting (which is currently far from optimal). A potential idea for the latter is to reuse data across different epochs, an idea adopted by several recent works (Zanette et al., 2021; Lazic et al., 2021) for different problems. Another key future direction is to investigate whether the idea of dilated bonuses is applicable beyond the finite-horizon setting (e.g., whether it is applicable to the more general stochastic shortest path model or the infinite-horizon setting). 100 Algorithm 29: Policy Optimization with Dilated Bonuses (Linear MDP Case) 1 Parameters: γ,β,η,ϵ,δ e ∈ (0, 1 2 ),δ, M = l 96 ln(dHT ) ϵ 2 γ 2 m , N = l 2 γ ln 1 ϵγ m , W = 2MN, α = δe 6β , M 0 = α 2 dH 2 , N 0 = 100M 4 0 log(T/δ) α 2 , T 0 =M 0 N 0 . 2 Construct a mixture policy π cov and its estimated covariance matrices (which requires interacting with the environment for the first T 0 rounds using Algorithm 30): π cov , n b Σ cov h o h=0,...,H−1 ← PolicyCover(M 0 ,N 0 ,α,δ). 3 Define known state set K = x∈X :∀a∈A,∥ϕ (x,a)∥ 2 ( b Σ cov h ) −1 ≤α where h is such that x∈X h . 4 for k = 1, 2,..., (T−T 0 )/W do 5 Step 1: Interact with the environment. Define π k as the following: for x∈X h , π k (a|x)∝ exp −η k−1 X τ=1 ϕ (x,a) ⊤ b θ τ,h −ϕ (x,a) ⊤b Λ τ,h −b τ (x,a) ! (5.8) where b τ (x,a) = β∥ϕ (x,a)∥ 2 b Σ + τ,h +βE a ′ ∼πτ (·|x) h ∥ϕ (x,a ′ )∥ 2 b Σ + τ,h i 1[x∈K]. 6 Randomly partition{T 0 + (k− 1)W + 1,...,T 0 +kW} into two parts: S and S ′ , such that|S| =|S ′ | =W/2. 7 for t =T 0 + (k− 1)W + 1,...,T 0 +kW do 8 Draw Y t ∼Bernoulli(δ e ). 9 if Y t = 1 then 10 if t∈S then Execute π cov . 11 else Draw h ∗ t unif. ∼ {0,...,H− 1}; execute π cov in steps 0,...,h ∗ t − 1 and π k in steps h ∗ t ,...,H− 1. 12 else Execute π k . 13 Collect trajectory{(x t,h ,a t,h ,ℓ t (x t,h ,a t,h ))} H−1 h=0 . 14 Step 2: Construct inverse covariance matrix estimators. Let T k ={(x t,0 ,a t,0 ,...,x t,H−1 ,a t,H−1 )} t∈S , (the trajectories in S) n b Σ + k,h o H−1 h=0 = GeometricResampling(T k ,M,N,γ). (5.9) 15 Step 3: Construct Q-function weight estimators. Compute for all h (with L t,h = P H−1 i=h ℓ t (x t,i ,a t,i )): b θ k,h = b Σ + k,h 1 |S ′ | P t∈S ′((1−Y t ) +Y t H 1[h =h ∗ t ])ϕ (x t,h ,a t,h )L t,h . (5.10) 16 Step 4: Construct bonus function weight estimators. Compute for all h : b Λ k,h = b Σ + k,h 1 |S ′ | P t∈S ′((1−Y t ) +Y t H 1[h =h ∗ t ])ϕ (x t,h ,a t,h )D t,h , (5.11) where D t,h = P H−1 i=h+1 1 + 1 H i−h b k (x t,i ,a t,i ). 101 Algorithm 30: PolicyCover (M 0 ,N 0 ,α,δ) 1 Let ξ = 60dH p log(T/δ). 2 Let Γ 1,h be the identity matrix inR d×d for all h. 3 for m = 1,...,M 0 do 4 Let b V m (x H ) = 0. 5 for h =H− 1,H− 2,..., 0 do 6 For all (x,a)∈X h ×A, compute b Q m (x,a) = min r m (x,a) +ξ∥ϕ (x,a)∥ Γ −1 m,h +ϕ (x,a) ⊤ b θ m,h , H , b V m (x) = max a ′ b Q m (x,a ′ ), π m (a|x) =1 a = argmax a ′ b Q m (x,a ′ ) , (break tie in argmax arbitrarily) with r m (x,a) = ramp 1 T ∥ϕ (x,a)∥ 2 Γ −1 m,h − α M 0 , (5.13) b θ m,h = Γ −1 m,h 1 N 0 (m−1)N 0 X t=1 ϕ (x t,h ,a t,h ) b V m (x t,h+1 ) , where ramp z (y) = 0 if y≤−z, 1 if y≥ 0, y z + 1 if −z<y< 0. 7 for t = (m− 1)N 0 + 1,...,mN 0 do 8 Execute π m in episode t and collect trajectory{x t,h ,a t,h } H−1 h=0 . 9 Compute Γ m+1,h = Γ m,h + 1 N 0 mN 0 X t=(m−1)N 0 +1 ϕ (x t,h ,a t,h )ϕ (x t,h ,a t,h ) ⊤ . 10 Let π cov =Uniform {π m } M 0 m=1 and b Σ cov h = 1 M 0 Γ M 0 +1,h for all h. 11 return π cov and n b Σ cov h o h=0,...,H−1 . 102 Chapter 6 Learning in Non-Stationary Environments 6.1 Overview: parameter-free algorithms for non-stationary environments Most existing works on reinforcement learning consider a stationary environment and aim to find or be comparable to an optimal policy (known as having low static regret). In many applications, however, the environment is far from being stationary. In these cases, it is much more meaningful to minimize dynamic regret, the gap between the total reward of the optimal sequence of policies and that of the learner. Indeed, there has been a surge of studies on this topic recently (Jaksch et al., 2010; Gajane et al., 2018; Li and Li, 2019; Ortner et al., 2020; Cheung et al., 2020; Fei et al., 2020; Domingues et al., 2021; Mao et al., 2021; Zhou et al., 2020; Touati and Vincent, 2020). One common issue of all these works, however, is that their algorithms crucially rely on having some prior knowledge on the degree of non-stationarity of the world, such as how much or how many times the distribution changes, which is often unavailable in practice. Cheung et al. (2020) develop a Bandit-over-Reinforcement-Learning (BoRL) framework to relax this assumption, but it introduces extra overhead and leads to suboptimal regret. Indeed, as discussed in their work, there are multiple aspects (which they call endogeneity, exogeneity, uncertainty, and bandit feedback) combined in non-stationary reinforcement learning that make the problem highly challenging. 103 For bandit problems, a special case of reinforcement learning, the works of Auer et al. (2019) and Chen et al. (2019) are the first to achieve near-optimal dynamic regret without any prior knowledge on the degree of non-stationarity. The same technique has later been adopted by Chen et al. (2020) for the case of combinatorial semi-bandits. Their algorithms maintain a distribution over arms (or policies/super-arms in the contextual/combinatorial case (Chen et al., 2019, 2020)) with properly controlled variance for all reward estimators. This approach is generally incompatible with standard reinforcement learning algorithms, which are usually built upon the optimism in the face of uncertainty principle and do not maintain a distribution over policies (see also (Lykouris et al., 2019; Wang et al., 2020a) for related discussions). Another drawback is that their algorithms are very specialized to their problems, and it is highly unclear whether the ideas can be extended to other problems. In this chapter, we address all these issues. Specifically, we propose a general approach that is applicable to various reinforcement learning settings (including bandits, episodic MDPs, infinite- horizon MDPs, etc.) and achieves optimal dynamic regret without any prior knowledge on the degree of non-stationarity. Our approach, called MASTER, is a black-box reduction that turns any algorithm with optimal performance in a (near-)stationary environment and additionally some mild requirements into another algorithm with optimal dynamic regret in a non-stationary environment, again, without the need of any prior knowledge. For example, all existing UCB-based algorithms satisfy the conditions of our reduction and are ready to be plugged into our black-box. Applications and comparisons To showcase the versatility of our approach, we provide a list of examples by considering different settings and applying our reduction with different base algorithms. These examples, summarized in Table 6.1, recover the results of Auer et al. (2019) and Chen et al. (2019) for (contextual) multi-armed bandits (Section 1.1, Section 1.3), and more importantly, improve the best known results for (generalized) linear bandits (Section 1.2), episodic MDPs (Section 1.4) 104 Table 6.1: A summary of our results and comparisons with the state of the art. Our algorithms are named in the form of “MASTER + X” where X is the base algorithm used in our reduction. Here, Reg ⋆ L = √ LT and Reg ⋆ ∆ = ∆ 1 /3 T 2 /3 + √ T, where T is the number of rounds and L and ∆ are the number and amount of changes of the world, respectively. (Dependence on other parameters is omitted.) Setting Algorithm Regret in ˜ O(·) Required knowledge Multi-armed bandits (Auer et al., 2019) Reg ⋆ L MASTER + UCB1 min{Reg ⋆ L ,Reg ⋆ ∆ } Contextual bandits (Chen et al., 2019) min{Reg ⋆ L ,Reg ⋆ ∆ } MASTER + ILTCB MASTER + FALCON Linear bandits (Cheung et al., 2018) Reg ⋆ ∆ ∆ (Cheung et al., 2018) Reg ⋆ ∆ +T 3 /4 MASTER + OFUL min{Reg ⋆ L ,Reg ⋆ ∆ } Generalized linear bandits (Russac et al., 2020) L 1 /3 T 2 /3 L (Faury et al., 2021) ∆ 1 /5 T 4 /5 +T 3 /4 MASTER + GLM-UCB min{Reg ⋆ L ,Reg ⋆ ∆ } Episodic MDPs (tabular case) (Mao et al., 2021) Reg ⋆ ∆ ∆ MASTER + Q-UCB min{Reg ⋆ L ,Reg ⋆ ∆ } Episodic MDPs (linear case) (Touati and Vincent, 2020) ∆ 1 /4 T 3 /4 + √ T ∆ MASTER + LSVI-UCB min{Reg ⋆ L ,Reg ⋆ ∆ } in various ways. More specifically, let L and ∆ be the number and amount of changes of the environment, respectively (see Section 6.2 for formal definitions). For all settings, ignoring other parameters, our algorithms achieve dynamic regret min{Reg ⋆ L ,Reg ⋆ ∆ } without knowing L and ∆ , where Reg ⋆ L = √ LT, Reg ⋆ ∆ = ∆ 1 /3 T 2 /3 + √ T, and T is the number of rounds. These bounds are known to be optimal even when L and ∆ are known, and they improve over (Cheung et al., 2018, 2019; Russac et al., 2019; Kim and Tewari, 2020; Zhao et al., 2020; Zhao and Zhang, 2021) for linear bandits, (Russac et al., 2020; Faury et al., 2021) for generalized linear bandits, (Mao et al., 2021) for episodic tabular MDPs, and (Touati and Vincent, 2020; Zhou et al., 2020) for episodic linear MDPs. In particular, we emphasize that achieving dynamic regret Reg ⋆ L beyond (contextual) multi-armed bandits is one notable breakthrough we make. Indeed, even when L is known, previous approaches based on restarting after a fixed period, a sliding window with a fixed size, or discounting with a 105 fixed discount factor, all led to a suboptimal bound of ˜ O(L 1 /3 T 2 /3 ) at best (Gajane et al., 2018). Since this bound is subsumed by Reg ⋆ ∆ , related discussions are often omitted in previous works. For non-stationary linear bandits, although several existing works (Russac et al., 2019; Kim and Tewari, 2020; Zhao et al., 2020) claim that their algorithms achieve the bound Reg ⋆ ∆ (when ∆ is known), there is in fact a technical flaw in all of them, as explained and corrected recently in (Zhao and Zhang, 2021; Touati and Vincent, 2020). After the correction, their bounds all deteriorate to ∆ 1 /4 T 3 /4 + √ T, which is no longer near-optimal. Recently, Cheung et al. (2018) sidestepped this difficulty by leveraging adversarial linear bandit algorithms, and achieved the tight bound of Reg ⋆ ∆ when ∆ is known. On the other hand, our approach is based on stochastic linear bandit algorithms; however, it not only sidesteps the difficulty met in previous works, but also avoids the requirement of knowing ∆ . When dealing with other linear-structured problems including generalized linear bandits and linear MDPs, our bounds Reg ⋆ ∆ is new even when ∆ is known. Previous results (Russac et al., 2020; Faury et al., 2021; Touati and Vincent, 2020; Zhou et al., 2020) cannot achieve the optimal bound due to the same technical difficulty mentioned above. High-level ideas The high-level idea of our reduction is to schedule multiple instances of the base algorithm with different durations in a carefully designed randomized scheme, which facilitates non- stationarity detection with little overhead. A related and well-known approach for non-stationary environments is to maintain multiple instances of a base algorithm with different parameter tunings or different starting points and to learn the best of them via another “expert” algorithm, which can be very successful when learning with full information (Hazan and Seshadhri, 2007; Luo and Schapire, 2015; Daniely et al., 2015; Jun et al., 2017) but is suboptimal and has many limitations when learning with partial information (Luo et al., 2018; Cheung et al., 2019, 2020). Our approach is different as we do not try to learn the best instance; instead, we always follow the decision suggested by the instance with the currently shortest scheduled duration, and also only update this 106 instance after receiving feedback from the environment. The is because base algorithms with shorter duration are responsible for detecting larger distribution changes, and always following the shortest one ensure that it is not blocked by the longer ones and thus every scale of distribution change is detected in a timely manner. Another related approach is regret balancing, developed recently for model selection in bandit problems (Abbasi-Yadkori et al., 2020; Pacchiano et al., 2020a). The idea is also to run multiple base algorithms in parallel, each with a putative regret upper bound. The learner executes one of them in each round which incurs the least regret so far, and also constantly compares the performance among base algorithms, eliminating those whose putative regret bounds are violated. While our algorithm resembles regret balancing in some aspects, the way it chooses the base algorithm in each round is quite different, which is also crucial for our problem. Other related work There is a series of works on learning MDPs with adversarial rewards and a fixed transition, as we studied in Chapter 5 (Even-Dar et al., 2009; Neu et al., 2010; Arora et al., 2012; Neu et al., 2012; Dekel and Hazan, 2013; Neu et al., 2013; Zimin and Neu, 2013; Dick et al., 2014; Rosenberg and Mansour, 2019b; Cai et al., 2020; Jin et al., 2020a; Shani et al., 2020; Rosenberg and Mansour, 2020; Lee et al., 2020; Jin and Luo, 2020; Chen et al., 2021b; Lancewicki et al., 2020). These models can potentially handle non-stationarity in the reward function but not the transition kernel. (In fact, most of these works also only consider static regret.) Lykouris et al. (2019) investigate an episodic MDP setting where an adversary can corrupt both the reward and the transition for up to L ′ episodes, and achieve dynamic regret ˜ O(min{L ′ √ T,L ′ /gap}) without knowing L ′ , where gap is the minimal suboptimality gap and could be arbitrarily small. Since corruption of up to L ′ episodes implies that the world changes at most L = 2L ′ times, our result improves theirs from ˜ O(L ′ √ T) to ˜ O( √ L ′ T) when 1/gap> √ T. On the other hand, it is possible 107 that L is much smaller than L ′ (e.g. L = Θ(1) while L ′ = Θ(T )), in which case our results are also significantly better. Recently, there has been a surge of theoretical works on multi-agent learning in Markov games (Wei et al., 2017; Xie et al., 2020; Bai et al., 2020; Liu et al., 2020; Tian et al., 2020; Daskalakis et al., 2020; Sayin et al., 2020; Wei et al., 2021b; Cen et al., 2021; Zhao et al., 2022; Jin et al., 2021c; Huang et al., 2021; Jafarnia-Jahromi et al., 2021; Chen et al., 2022a,b; Kao et al., 2022; Leonardos et al., 2021; Zhang et al., 2021a; Ding et al., 2022; Mao and Başar, 2022; Song et al., 2021; Jin et al., 2021b; Kao and Subramanian, 2022). In a multi-agent environment, each player is essentially facing a non-stationary Markov decision process, where the reward and transition are determined jointly by the policies of all players. While the foci of these works (e.g., convergence rate to equilibria) are usually different from ours, it is demonstrated by Mao et al. (2021) that the algorithms designed for non-stationary environments have potential applications in multi-agent environments. 6.2 A reduction framework Throughout this chapter, we fix a probability parameter δ of order 1/poly(T ). We say “with high probability, h 1 = ˜ O(h 2 (x))” if “with probability 1−δ, h 1 = ˜ O(h 2 (x))”. For an integer n, we denote the set{1, 2,...,n} by [n]; and for integers s and e, we denote the set{s,s + 1,...,e} by [s,e]. We consider the following general reinforcement learning (RL) framework that covers a wide range of problems. Ahead of time, the learner is given a policy set Π, and the environment determines T reward functions f 1 ,...,f T : Π→ [0, 1] unknown to the learner. Then, in each round t = 1,...,T, the learner chooses a policy π t ∈ Π and receives a noisy reward R t ∈ [0, 1] whose mean is f t (π t ). ∗ The dynamic regret of the learner is defined as Dynamic-Reg = P T t=1 (f ⋆ t −R t ), where f ⋆ t = max π∈Π f t (π) is the expected reward of the optimal policy for round t. ∗ Different from previous chapters where we consider the “cost setting”, here we consider the “reward setting.” The two settings can be converted to each other straightforwardly. 108 Many heavily studied problems fall into this framework. For example, in the classic multi-armed bandit problem (Lai and Robbins, 1985), it suffices to treat each arm as a policy; for finite-horizon episodic RL (e.g. (Jin et al., 2018)), each state-to-action mapping is considered as a policy, and f t (π) is the expected reward of executing π in the t-th episode’s MDP with some transition kernel and some reward function. Note that our framework ignores many details of the actual problem we are trying to solve (e.g., not even mentioning the MDPs for RL). This is because our results only rely on certain guarantees provided by a base algorithm, making these details irrelevant to our presentation. Non-stationarity measure A natural way to measure the distribution drift between roundst andt+1 is to see how much the expected reward of any policy could change, that is, max π∈Π |f t (π)−f t+1 (π)|. However, to make our results more general, we take a sligtly more abstract way to define non- stationarity whose exact form eventually depends on what guarantees the base algorithm can provide for a concrete problem. To this end, we define the following. Definition 1. ∆ : [ T ]→R is a non-stationarity measure if it satisfies ∆( t)≥ max π∈Π |f t (π)−f t+1 (π)| for all t. Define for any interval I = [s,e], ∆ I = P e−1 τ=s ∆( τ) (note ∆ [s,s] = 0) and L I = 1 + P e−1 τ=s 1[∆( τ)̸= 0]. With slight abuse of notation, we write ∆ = ∆ [1,T ] and L =L [1,T ] . Base algorithm and requirements As mentioned, our approach takes a base algorithm that tackles the problem when the environment is (near-)stationary, and turns it into another algorithm that can deal with non-stationary environments. Throughout this chapter, we denote the base algorithm by ALG and assumes that it satisfies the following mild requirements when run alone. Assumption 11. ALG outputs an auxiliary quantity e f t ∈ [0, 1] at the beginning of each round t. There exist a non-stationarity measure ∆ and a non-increasing function ρ : [T ]→R such that running ALG 109 satisfies the following: for all t∈ [T], as long as ∆ [1,t] ≤ρ(t), without knowing ∆ [1,t] ALG ensures with probability at least 1− δ T : e f t ≥ min τ∈[1,t] f ⋆ τ − ∆ [1,t] and 1 t t X τ=1 e f τ −R τ ≤ρ(t) + ∆ [1,t] . (6.1) Furthermore, we assume that ρ(t)≥ 1 √ t and C(t) =tρ(t) is a non-decreasing function. We unpack the meaning of this assumption and explain why this is a mild requirement via a few remarks below, followed by examples of existing algorithms that do satisfy our assumption. First, consider choosing ∆( t) = max π∈Π |f t (π)−f t+1 (π)| and see what the assumption means for a stationary environment with f t =f and ∆( t) = 0 for all t. In this case, (6.1) simply becomes e f t ≥ max π∈Π f(π) and P t τ=1 e f τ −R τ ≤C(t), which are standard properties of Upper-Confidence- Bound (UCB)-based algorithms, where e f t is an optimistic estimator of the optimal reward and C(t) is the regret bound, usually of order √ t. In fact, even for non-UCB-based algorithms that do not explicitly maintain optimistic estimators, by looking into their analysis, it is still possible to extract a quantity e f t satisfying these two properties. We also note that this requirement for the special stationary case is in fact all we need to achieve our claimed regret bound Reg ⋆ L . Second, to simultaneously achieve the regret bound Reg ⋆ ∆ as well, we require slightly more from the base algorithm: in a near-stationary environment with ∆ [1,t] ≤ρ(t), the two aforementioned properties still hold approximately with degradation ∆ [1,t] (that is, (6.1)). † We call this a near- stationary environment because ∆ [1,t] can be of order Θ(t) in a highly non-stationary environment, while here we restrict it to be at mostρ(t), which is non-increasing int (and in fact of order 1/ √ t in all our examples). To the best of our knowledge, all UCB-based algorithms satisfy Assumption 11 with some suitable choice of ∆ . The fact that we only require (6.1) to hold for near-stationary † We use min τ∈[1,t] f ⋆ τ instead of the more natural one f ⋆ t since the former is weaker and the difference between these two is at most ∆ [1,t] anyway. 110 environments is the key to bypassing the technical difficulty of getting the optimal bound Reg ⋆ ∆ encountered in (Russac et al., 2019; Zhao et al., 2020; Russac et al., 2020; Faury et al., 2021; Touati and Vincent, 2020; Zhou et al., 2020) for linear bandits, generalized linear bandits, and linear MDPs. Finally, noting that ρ(t) and C(t) represent an average and a cumulative regret bound, re- spectively, the monotonicity requirement on them is very natural. The requirement ρ(t)≥ 1 √ t is also usually unavoidable without further structure in the problem. Note that while we write ρ and C as a function of t only, they can depend on log(1/δ), logT, the complexity of Π, and other problem-dependent parameters such as the number of states/actions of an MDP. Main results Our main result is that, with an algorithm satisfying Assumption 11 at hand, our proposed black-box reduction, MASTER (Algorithm 33), ensures the following dynamic regret bound. Theorem 28. If Assumption 11 holds with C(t) =c 1 t p +c 2 for some p∈ [ 1 2 , 1) and c 1 ,c 2 > 0, then MASTER (Algorithm 33), without knowing L and ∆ , guarantees with high probability: Dynamic-Reg = ˜ O min c 1 + c 2 c 1 √ LT, c 2 /3 1 +c 2 c − 4 /3 1 ∆ 1 /3 T 2 /3 + c 1 + c 2 c 1 √ T when p = 1 2 , and Dynamic-Reg = ˜ O min c 1 L 1−p T p , c 1 ∆ 1−p T 1 2−p +c 1 T p when p> 1 2 (omit- ting some lower-order terms). For ease of presentation, in this theorem we assume that C(·) takes a certain form that is common in the literature and holds for all our examples withp = 1 2 . Applying this theorem to all the examples discussed earlier, we achieve all the optimal min{Reg ⋆ L ,Reg ⋆ ∆ } bounds listed in Table 6.1. Our definitions of L are the same as in previous works, and our definitions of ∆ are sometimes larger by some problem-dependent factors (such as d and H) in order to fit Assumption 11. More specifically, for (contextual) bandits, our MASTER combined with UCB1 and ILTCB recovers the same optimal bounds (in terms of all parameters) achieved by (Auer et al., 2019; Chen et al., 2019). 111 MASTER with FALCON obtains a similar bound as in (Chen et al., 2019) but with a different definition of ∆ specific to the regressor setting. For other settings, we present our results in terms of the common definition of the non-stationarity measure (denoted by b ∆ ) and compare them with the state of the art: • MASTER + OFUL: Dynamic-Reg = ˜ O(min{d √ LT,d b ∆ 1 /3 T 2 /3 +d √ T}), where b ∆ = P t ∥θ t − θ t+1 ∥ 2 . This improves (Cheung et al., 2019; Russac et al., 2019; Kim and Tewari, 2020; Zhao et al., 2020; Zhao and Zhang, 2021) which get ˜ O(d 7 /8b ∆ 1 /4 T 3 /4 +d √ T ) when b ∆ is known. • MASTER + GLM-UCB: Dynamic-Reg = ˜ O min n kµ cµ d √ LT, k 4 /3 µ cµ d b ∆ 1 /3 T 2 /3 + kµ cµ d √ T o , where b ∆ = P t ∥θ t −θ t+1 ∥ 2 . This improves (Russac et al., 2020) which gets ˜ O kµ cµ d 2 /3 L 1 /3 T 2 /3 when L is known, and (Faury et al., 2021) which gets ˜ O kµ cµ d 9 /10b ∆ 1 /5 T 4 /5 . • MASTER + Q-UCB: Dynamic-Reg = ˜ O(min{ √ H 5 SALT, (H 7 SA b ∆ ) 1 /3 T 2 /3 + √ H 5 SAT}), where b ∆ = P t,h max s,a (|r t h (s,a)−r t+1 h (s,a)| +∥p t h (·|s,a)−p t+1 h (·|s,a)∥ 1 ). ‡ (Mao et al., 2021, Theorem 3) gets ˜ O((H 5 SA b ∆) 1 /3 T 2 /3 + √ H 3 SAT ) when b ∆ is known. § • MASTER + LSVI-UCB: Dynamic-Reg = ˜ O(min{ √ d 3 H 4 LT, (d 4 H 6b ∆ ) 1 /3 T 2 /3 + √ d 3 H 4 T}), where b ∆ = P t,h (∥θ t h −θ t+1 h ∥ 2 +∥µ t h −µ t+1 h ∥ F ). This improves (Zhou et al., 2020; Touati and Vincent, 2020) which get ˜ O((d 5 H 8b ∆) 1 /4 T 3 /4 + √ d 3 H 4 T ) when b ∆ is known. ¶ ‡ Due to the scaling of the reward range, here, we first scale down C(·) and ∆ by an H factor, then apply Theorem 28, and finally scale up the final bound by an H factor. § The bound reported in (Mao et al., 2021) is (H 3 SA b ∆ ) 1/3 T 2/3 + √ H 2 SAT; however, their T is the total number of timesteps while our T is the number of episodes, and we have performed a proper translation between notations here. Their bound has a better H dependency thanks to the use of Freedman-style confidence bounds. The same idea unfortunately does not improve our bound due to the lower-order term c2 in the definition of C(t). ¶ The same scaling as in Footnote ‡ and Footnote § has been performed here. 112 ⋆ I new instance of ALG Learner’s average performance in new ALG Figure 6.1: An illustration of how we detect non-stationarity via multiple instances of ALG High-level ideas To get a high-level idea of our approach, first consider what could go wrong when running ALG alone in a non-stationary environment and how to fix that intuitively. Decompose the dynamic regret as follows: t X τ=1 f ⋆ τ − e f τ | {z } term 1 + t X τ=1 e f τ −R τ | {z } term 2 . (6.2) As mentioned, in a stationary environment, ALG ensures that term 1 is simply non-positive and term 2 is bounded by C(t) directly. In a non-stationary environment, however, both terms can be substantially larger. If we can detect the event that either of them is abnormally large, we know that the environment has changed substantially, and should just restart ALG. This detection can be easily done for term 2 since both e f τ andR τ are observable, but not for term 1 sincef ⋆ τ is of course unknown. Note that a large term 1 implies that a policy, possibly suboptimal in the past, now becomes the optimal one with a much larger reward. A single instance of ALG run from the beginning thus cannot detect this because suboptimal polices are naturally selected very infrequently. To address this issue, our main idea is to maintain different instances of ALG to facilitate non-stationarity detection, illustrated via an example in Figure 6.1. Here, there is one distribution change that happens in intervalI, making the value of f ⋆ τ (the blue curve) drastically increase. If within this interval, we start running another instance of ALG (the red interval), then its performance (the black curve) will 113 gradually approach f ⋆ τ due to its regret guarantee in a stationary environment. Hypothetically, if another instance of ALG run from the beginning could coexist with this new instance, we would see that the latter significantly outperforms the former and infer that the environment has changed. The issue is that we cannot have multiple instances running and making decisions simultaneously, and here is where the optimistic estimators e f τ can help. Specifically, since the quantity U τ = min s≤τ e f s (the green non-increasing curve) should always be an upper bound of the learner’s performance in a stationary environment, if we find that the new instance of ALG significantly outperforms this quantity at some point (as shown in Figure 6.1), we can also infer that the environment has changed, and prevent term 1 ≤ P t τ=1 (f ⋆ τ −U τ ) from growing too large by restarting. To formally implement the ideas above, we need to decide when to start a new instance, how long it should last, which instance should be active if multiple exist, and others. In Section 6.3, we propose a randomized multi-scale scheme to do so, which is reminiscent of the ideas of sampling obligation in (Auer et al., 2019) and replay phase in (Chen et al., 2019), although their mechanisms are highly specific to their algorithms and problems. 6.3 Multi-scale scheduling In this section, we first introduce MALG, an algorithm that schedules and runs multiple instances of the base algorithm ALG in a multi-scale manner. Then, equipping MALG with non-stationarity detection, we introduce our final black-box reduction MASTER (Section 6.4). MALG: Running the Base Algorithm with Multiple Scales We always run MALG for an interval of length 2 n , which we call a block, for some integer n (unless it is terminated by the non-stationarity detection mechanism). During initialization, MALG uses Procedure 31 to schedule multiple instances of ALG within the block in the following way: for every m = n,n− 1,..., 0, partition the block 114 Procedure 31: A procedure that randomly schedules ALG of different lengths within 2 n rounds input: n, ρ(·) 1 for τ = 0,..., 2 n − 1 do 2 for m =n,n− 1,..., 0 do 3 If τ is a multiple of 2 m , with probability ρ(2 n ) ρ(2 m ) , schedule a new instance alg of ALG that starts at alg.s =τ + 1 and ends at alg.e =τ + 2 m . Algorithm 32: MALG (Multi-scale ALG) input: n, ρ(·) 1 Initialization: run Procedure 31 with inputs n and ρ(·). 2 At each time t, let the unique active instance be alg, output e g t (which is the e f t output by alg), follow alg’s decision π t , and update alg after receiving feedback from the environment. equally into 2 n−m sub-intervals of length 2 m , and for each of these sub-intervals, with probability ρ(2 n ) ρ(2 m ) ≤ 1 schedule an instance of ALG (otherwise skip this sub-interval). We call these instances of length 2 m order-m instances. Note that by definition there is always an order- n instance covering the entire block. We use alg to denote a particular instance of ALG, and use alg.s and alg.e to denote its start and end time. After the initialization, MALG starts interacting with the environment as follows. At each time t, the unique instance covering this time step with the shortest length is considered as being active, while all others are inactive. MALG follows the decision of the active instance, and updates it after receiving feedback from the environment. All inactive instances do not make any decisions or updates, that is, they are paused. (They might be resumed at some point though.) We use e g t to denote the scalar e f t output by the active instance. See Algorithm 32 for the pseudocode. 115 m=4 m=3 m=2 m=1 m=0 ① ② inactive active inactive active Figure 6.2: An illustration of MALG with n = 4 (see detailed explanation in Section 6.3) For better illustration, we give an example withn = 4 in Figure 6.2. Suppose that the realization of the random scheduling by Procedure 31 is: one order-4 instance (red), zero order-3 instance, two order-2 instances (green), two order-1 instances (blue), and five order- 0 instances (purple). The bolder part of the segment indicates the period of time when the instances are active, while the thinner part indicates the inactive period. For example, the red order-4 instance is active for the first round, then paused for the next 8 rounds, and then resumed (from the frozen internal states) for another 3 rounds before becoming inactive again. The dashed black arrow marked with 1 indicates that ALG is executed as if the two sides of the arrow were concatenated. On the other hand, as another example, the two purple instances on the two sides of the dashed line marked with 2 are two different order-0 instances, so the second one should start from scratch even though they are consecutive. One can see that at any point of time, the active instance is always the one with the shortest length. Regret analysis of MALG The multi-scale nature of MALG allows the learner’s regret to also enjoy a multi-scale structure, as shown in the next lemma (proof deferred to Appendix E.1). 116 Lemma 15. Let b n = log 2 T + 1 and b ρ(t) = 6b n log(T/δ)ρ(t). MALG with input n≤ log 2 T guarantees the following: for any instance alg that MALG maintains and any t∈ [alg.s,alg.e], as long as ∆ [alg.s,t] ≤ρ(t ′ ) where t ′ =t−alg.s + 1, we have with probability at least 1− δ T : e g t ≥ min τ∈[alg.s,t] f ⋆ τ − ∆ [alg.s,t] , 1 t ′ t X τ=alg.s (e g τ −R τ )≤ b ρ(t ′ ) +b n∆ [alg.s,t] , (6.3) and the number of instances started within [alg.s,t] is upper-bounded by 6b n log(T/δ) C(t ′ ) C(1) . Note that (6.3) is essentially the analogue of (6.1) (up to logarithmic terms) with the starting time changed from 1 to alg.s. It shows that even if we have multiple instances interleaving in a complicated way, the regret for a specific interval is still almost the same as running ALG alone on this interval, thanks to the carefully chosen probability in Procedure 31. Recall that there is always an order-n instance starting from the beginning of the block, so MALG is always providing a stronger multi-scale guarantee compared to running ALG alone. This richer guarantee facilitates non-stationarity detection as we show next. 6.4 Equipping multi-scale scheduling with tests and restarts MASTER: Equipping MALG with Stationarity Tests We are now ready to present our final algorithm MASTER, short for MALG with Stationarity TEsts and Restarts (see Algorithm 33). MASTER runs MALG in a sequence of blocks with doubling lengths (2 0 , 2 1 ,...). Within each block of length 2 n (with t n being the starting time), MASTER simply runs a new instance of MALG and records the minimum optimistic predictor thus far for this block U t = min τ∈[tn,t] e g τ . At the end of each time, MASTER performs two tests (Test 1 and Test 2), and if either of them returns fail, MASTER restarts from scratch. 117 Algorithm 33: MALG with Stationarity TEsts and Restarts (MASTER) input: b ρ(·) (defined in Lemma 15) 1 Initialize: t← 1 2 for n = 0, 1,... do 3 Set t n ←t and initialize an MALG (Algorithm 32) for the block [t n ,t n + 2 n − 1]. 4 while t<t n + 2 n do 5 Receive e g t and π t from MALG, execute π t , and receive reward R t . 6 Update MALG with any feedback from the environment, and set U t = min τ∈[tn,t] e g τ . 7 Perform Test 1 and Test 2 (see below). Increment t←t + 1. 8 if either test returns fail then restart from Line 2. Test 1: If t = alg.e for some order-m alg and 1 2 m P alg.e τ=alg.s R τ ≥U t + 9b ρ(2 m ), return fail. Test 2: If 1 t−tn+1 P t τ=tn (e g τ −R τ )≥ 3b ρ(t−t n + 1), return fail. The two tests exactly follow the ideas described in Section 6.2 (recall Figure 6.1). Following (6.2), we decompose the regret on [t n ,t] as term 1 + term 2 where term 1 = P t τ=tn (f ⋆ τ −e g τ ) and term 2 = P t τ=tn (e g τ −R τ ). Test 1 prevents term 1 ≤ P t τ=tn (f ⋆ τ −U τ ) from growing too large by testing if there is some order-m instance’s interval during which the learner’s average performance 1 2 m P alg.e τ=alg.s R τ is larger than the promised performance upper bound U t by an amount of 9b ρ(2 m ). On the other hand, Test 2 presents term 2 from growing too large by directly testing if its average is large than something close to the promised regret bound 3b ρ(t−t n + 1). It is now clear that MASTER indeed does not require knowledge of L or ∆ at all. To analyze MASTER, we prove the following key lemma that bounds the regret on a single block [t n ,E n ] where E n is either t n + 2 n − 1 or something smaller in the case where a restart is triggered. Lemma 16. With high probability, the dynamic regret of MASTER on any blockJ = [t n ,E n ] where E n ≤t n + 2 n − 1 is bounded as X τ∈J (f ⋆ τ −R τ )≤ ˜ O ℓ X i=1 C(|I ′ i |) + n X m=0 ρ(2 m ) ρ(2 n ) C(2 m ) ! . (6.4) where{I ′ 1 ,...,I ′ ℓ } is any partition ofJ such that ∆ I ′ i ≤ρ(|I ′ i |) for all i. 118 See Appendix E.2.1 for the proof. When ρ(t) = Θ(1/ √ t) (as in all our examples), the first term is ˜ O( P ℓ i=1 q |I ′ i |) = ˜ O( p ℓ|J|) by Cauchy-Schwarz; the second term is of order ˜ O( √ 2 n ). To derive a bound in terms of L, we can simply choose the partition{I ′ 1 ,...,I ′ ℓ } in a way such that ∆ I ′ i = 0 and ℓ =L J , while to derive a bound in terms of ∆ , the partition needs to be chosen more carefully depending on the value of ∆ J . Noting that the number of blocks between two restarts is always at most log 2 T, to finally prove Theorem 28, it remains to bound the number of restarts, which intuitively should scale with L or ∆ because by design a restart will not be triggered when the environment is stationary. The complete proof is deferred to Appendix E.2.2–Appendix E.2.4. 6.5 Open problems It would be interesting to see whether algorithms with data-dependent bounds work with our black-box approach. Previous work in this direction (Wei et al., 2016) achieves an improved dynamic regret bound for multi-armed bandits when the cumulative variance of the loss is small; however, their approach crucially relies on knowledge of the degree of non-stationarity as well as the cumulative variance. On the other hand, there are some immediate difficulties in applying our black-box approach to data-dependent algorithms. For example, the monotonicity of the the average regret ρ(·) may not hold anymore, and it is unclear how to set the probability of initiating a new base algorithm. Therefore, the task of achieving data-dependent dynamic bounds without prior knowledge seems to be challenging and requires other innovations. Another future direction is to study a class of contextual bandit problems where the context is adversarially generated (Abbasi-Yadkori et al., 2011; Cheung et al., 2019; Foster and Rakhlin, 2020). In this case, the expected reward of the optimal policy changes over time even if the environment is stationary, so our current algorithm cannot be directly applied. For linear contextual bandits with adversarial contexts (Abbasi-Yadkori et al., 2011; Cheung et al., 2019), the fix is straightforward 119 though: instead of requiring the base algorithm to generate a scalar e f t in each round, we let it generate a confidence set for the hidden parameter, and check the inconsistency of the confidence set over time. However, for general contextual bandits with adversarial contexts, where algorithms do not necessarily maintain a confidence set for the hidden parameter (Foster and Rakhlin, 2020), the extension is less clear and is left for future investigation. Finally, we are not aware of any near-optimal convex bandit algorithm satisfying our Assump- tion 11, so achieving near-optimal dynamic regret bound in general convex bandits is also left open. 120 References Abbasi-Yadkori, Y., Bartlett, P., Bhatia, K., Lazic, N., Szepesvari, C., and Weisz, G. (2019). Politex: Regret bounds for policy iteration using expert prediction. In International Conference on Machine Learning, pages 3692–3702. PMLR. Abbasi-Yadkori, Y., Bartlett, P. L., and Szepesvári, C. (2013). Online learning in markov decision processes with adversarially chosen transition probability distributions. In Conference on Neural Information Processing Systems. Abbasi-Yadkori, Y., Pacchiano, A., and Phan, M. (2020). Regret balancing for bandit and rl model selection. In Conference on Neural Information Processing Systems. Abbasi-Yadkori, Y., Pál, D., and Szepesvári, C. (2011). Improved algorithms for linear stochastic bandits. Advances in neural information processing systems, 24:2312–2320. Abernethy, J. D., Hazan, E., and Rakhlin, A. (2008). Competing in the dark: An efficient algorithm for bandit linear optimization. In Conference on Learning Theory, pages 263–274. Agarwal, A., Henaff, M., Kakade, S., and Sun, W. (2020a). Pc-pg: Policy cover directed exploration for provable policy gradient learning. In Conference on Neural Information Processing Systems. Agarwal, A., Hsu, D., Kale, S., Langford, J., Li, L., and Schapire, R. (2014a). Taming the monster: A fast and simple algorithm for contextual bandits. In International Conference on Machine Learning, pages 1638–1646. PMLR. 121 Agarwal, A., Hsu, D., Kale, S., Langford, J., Li, L., and Schapire, R. (2014b). Taming the monster: A fast and simple algorithm for contextual bandits. In Proceedings of the 31st International Conference on Machine Learning. Agarwal, A., Kakade, S. M., Lee, J. D., and Mahajan, G. (2020b). Optimality and approximation with policy gradient methods in markov decision processes. In Conference on Learning Theory, pages 64–66. PMLR. Agarwal, A., Luo, H., Neyshabur, B., and Schapire, R. E. (2017). Corralling a band of bandit algorithms. In Conference on Learning Theory, pages 12–38. Allenberg, C., Auer, P., Gyorfi, L., and Ottucsák, G. (2006). Hannan consistency in on-line learning in case of unbounded losses under partial monitoring. In Algorithmic Learning Theory, volume 4264, pages 229–243. Springer. Arora, R., Dekel, O., and Tewari, A. (2012). Deterministic mdps with adversarial rewards and bandit feedback. In Proceedings of the Twenty-Eighth Conference on Uncertainty in Artificial Intelligence, pages 93–101. Arora, R., Marinov, T. V., and Mohri, M. (2021). Corralling stochastic bandit algorithms. In International Conference on Artificial Intelligence and Statistics . Audibert, J.-Y. and Bubeck, S. (2009). Minimax policies for adversarial and stochastic bandits. In Conference on Learning Theory. Audibert, J.-Y., Bubeck, S., and Lugosi, G. (2011). Minimax policies for combinatorial prediction games. In Proceedings of the 24th Annual Conference on Learning Theory. Auer, P., Cesa-Bianchi, N., and Fischer, P. (2002a). Finite-time analysis of the multiarmed bandit problem. Machine learning, 47(2-3):235–256. 122 Auer, P., Cesa-Bianchi, N., Freund, Y., and Schapire, R. E. (2002b). The nonstochastic multiarmed bandit problem. SIAM journal on computing, 32(1):48–77. Auer, P., Cesa-Bianchi, N., Freund, Y., and Schapire, R. E. (2002c). The nonstochastic multiarmed bandit problem. SIAM Journal on Computing, 32(1):48–77. Auer, P. and Chiang, C.-K. (2016). An algorithm with nearly optimal pseudo-regret for both stochastic and adversarial bandits. In Conference on Learning Theory, pages 116–120. Auer, P., Gajane, P., and Ortner, R. (2019). Adaptively tracking the best bandit arm with an unknown number of distribution changes. In Conference on Learning Theory, pages 138–158. Azar, M. G., Munos, R., and Kappen, B. (2012). On the sample complexity of reinforcement learning with a generative model. In International Conference on Machine Learning. Azar, M. G., Osband, I., and Munos, R. (2017). Minimax regret bounds for reinforcement learning. In International Conference on Machine Learning. Bai, Y., Jin, C., and Yu, T. (2020). Near-optimal reinforcement learning with self-play. Advances in Neural Information Processing Systems. Bartlett, P., Dani, V., Hayes, T., Kakade, S., Rakhlin, A., and Tewari, A. (2008). High-probability regret bounds for bandit online linear optimization. In Proceedings of the 21st Annual Conference on Learning Theory-COLT 2008, pages 335–342. Omnipress. Beygelzimer, A., Langford, J., Li, L., Reyzin, L., and Schapire, R. (2011). Contextual bandit algorithms with supervised learning guarantees. In Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics , pages 19–26. Bhaskara, A., Cutkosky, A., Kumar, R., and Purohit, M. (2020). Online linear optimization with many hints. Advances in neural information processing systems. 123 Bogunovic, I., Krause, A., and Scarlett, J. (2020). Corruption-tolerant gaussian process bandit optimization. In International Conference on Artificial Intelligence and Statistics . Bogunovic, I., Losalka, A., Krause, A., and Scarlett, J. (2021). Stochastic linear bandits robust to adversarial attacks. In International Conference on Artificial Intelligence and Statistics . Bubeck, S., Cesa-Bianchi, N., and Kakade, S. (2012). Towards minimax policies for online linear optimization with bandit feedback. In Conference on Learning Theory. Bubeck, S., Cohen, M. B., and Li, Y. (2017a). Sparsity, variance and curvature in multi-armed bandits. arXiv preprint arXiv:1711.01037. Bubeck, S., Devanur, N. R., Huang, Z., and Niazadeh, R. (2017b). Online auctions and multi-scale online learning. In Proceedings of the 2017 ACM Conference on Economics and Computation, pages 497–514. Bubeck, S., Li, Y., Luo, H., and Wei, C.-Y. (2019). Improved path-length regret bounds for bandits. In Proceedings of the 32nd Conference On Learning Theory. Bubeck, S. and Slivkins, A. (2012). The best of both worlds: stochastic and adversarial bandits. In Conference on Learning Theory, pages 42–1. Cai, Q., Yang, Z., Jin, C., and Wang, Z. (2020). Provably efficient exploration in policy optimization. In International Conference on Machine Learning, pages 1283–1294. PMLR. Cen, S., Wei, Y., and Chi, Y. (2021). Fast policy extragradient methods for competitive games with entropy regularization. Advances in Neural Information Processing Systems, 34. Cesa-Bianchi, N., Gaillard, P., Lugosi, G., and Stoltz, G. (2012). Mirror descent meets fixed share (and feels no regret). Advances in Neural Information Processing Systems, 25:980–988. 124 Cesa-Bianchi, N., Mansour, Y., and Stoltz, G. (2007). Improved second-order bounds for prediction with expert advice. Machine Learning, 66(2-3):321–352. Chen, L. and Luo, H. (2021). Finding the stochastic shortest path with low regret: The adversarial cost and unknown transition case. In International Conference on Machine Learning. Chen, L., Luo, H., and Wei, C.-Y. (2021a). Impossible tuning made possible: A new expert algorithm and its applications. In Conference on Learning Theory, pages 1216–1259. PMLR. Chen, L., Luo, H., and Wei, C.-Y. (2021b). Minimax regret for stochastic shortest path with adversarial costs and known transition. Conference on Learning Theory. Chen, W., Wang, L., Zhao, H., and Zheng, K. (2020). Combinatorial semi-bandit in the non- stationary environment. arXiv preprint arXiv:2002.03580. Chen, Y., Du, S. S., and Jamieson, K. (2021c). Improved corruption robust algorithms for episodic reinforcement learning. In International Conference on Machine Learning. Chen, Y., Lee, C.-W., Luo, H., and Wei, C.-Y. (2019). A new algorithm for non-stationary contextual bandits: Efficient, optimal and parameter-free. In Conference on Learning Theory, pages 696–726. Chen, Z., Ma, S., and Zhou, Y. (2022a). Sample efficient stochastic policy extragradient algorithm for zero-sum markov game. In International Conference on Learning Representations. Chen, Z., Zhou, D., and Gu, Q. (2022b). Almost optimal algorithms for two-player zero-sum linear mixture markov games. In International Conference on Algorithmic Learning Theory, pages 227–261. PMLR. Cheung, W. C., Simchi-Levi, D., and Zhu, R. (2018). Hedging the drift: Learning to optimize under non-stationarity. Available at SSRN 3261050. 125 Cheung, W. C., Simchi-Levi, D., and Zhu, R. (2019). Learning to optimize under non-stationarity. In The 22nd International Conference on Artificial Intelligence and Statistics , pages 1079–1087. PMLR. Cheung, W. C., Simchi-Levi, D., and Zhu, R. (2020). Reinforcement learning for non-stationary markov decision processes: The blessing of (more) optimism. In International Conference on Machine Learning. Cutkosky, A. (2019a). Artificial constraints and hints for unbounded online learning. In Conference on Learning Theory, pages 874–894. Cutkosky, A. (2019b). Combining online learning guarantees. Conference on Learning Theory. Cutkosky, A. (2020). Better full-matrix regret via parameter-free online learning. Advances in Neural Information Processing Systems, 33. Cutkosky, A. and Orabona, F. (2018). Black-box reductions for parameter-free online learning in banach spaces. In Conference on Learning Theory (COLT), pages 1493–1529. Dani, V., Kakade, S. M., and Hayes, T. (2007). The price of bandit information for online optimization. Advances in Neural Information Processing Systems, 20. Dani, V., Kakade, S. M., and Hayes, T. P. (2008). The price of bandit information for online optimization. In Advances in Neural Information Processing Systems, pages 345–352. Daniely, A., Gonen, A., and Shalev-Shwartz, S. (2015). Strongly adaptive online learning. In International Conference on Machine Learning, pages 1405–1411. Dann, C. and Brunskill, E. (2015). Sample complexity of episodic fixed-horizon reinforcement learning. In Conference on Neural Information Processing Systems. 126 Dann, C., Mansour, Y., Mohri, M., Sekhari, A., and Sridharan, K. (2020). Reinforcement learning with feedback graphs. In Conference on Neural Information Processing Systems. Daskalakis, C., Foster, D. J., and Golowich, N. (2020). Independent policy gradient methods for competitive reinforcement learning. Advances in neural information processing systems, 33:5527–5540. Dekel, O. and Hazan, E. (2013). Better rates for any adversarial deterministic mdp. In International Conference on Machine Learning, pages 675–683. Dick, T., Gyorgy, A., and Szepesvari, C. (2014). Online learning in markov decision processes with changing cost sequences. In International Conference on Machine Learning. Ding, D., Wei, C.-Y., Zhang, K., and Jovanović, M. R. (2022). Independent policy gradient for large-scale markov potential games: Sharper rates, function approximation, and game-agnostic convergence. arXiv preprint arXiv:2202.04129. Domingues, O. D., Ménard, P., Pirotta, M., Kaufmann, E., and Valko, M. (2021). A kernel-based approach to non-stationary reinforcement learning in metric spaces. In International Conference on Artificial Intelligence and Statistics , pages 3538–3546. PMLR. Dong, K., Wang, Y., Chen, X., and Wang, L. (2019). Q-learning with ucb exploration is sample efficient for infinite-horizon mdp. arXiv preprint arXiv:1901.09311. Duchi, J., Hazan, E., and Singer, Y. (2011). Adaptive subgradient methods for online learning and stochastic optimization. Journal of machine learning research, 12(7). Dudik, M., Hsu, D., Kale, S., Karampatziakis, N., Langford, J., Reyzin, L., and Zhang, T. (2011). Efficient optimal learning for contextual bandits. In Proceedings of the Twenty-Seventh Conference on Uncertainty in Artificial Intelligence , pages 169–178. 127 Even-Dar, E., Kakade, S. M., and Mansour, Y. (2009). Online markov decision processes. Mathe- matics of Operations Research, 34(3):726–736. Faury, L., Russac, Y., Abeille, M., and Calauzènes, C. (2021). Regret bounds for generalized linear bandits under parameter drift. arXiv preprint arXiv:2103.05750. Fei, Y., Yang, Z., Wang, Z., and Xie, Q. (2020). Dynamic regret of policy optimization in non- stationary environments. Advances in Neural Information Processing Systems, 33. Foster, D. and Rakhlin, A. (2020). Beyond ucb: Optimal and efficient contextual bandits with regression oracles. In International Conference on Machine Learning, pages 3199–3210. PMLR. Foster, D. J., Gentile, C., Mohri, M., and Zimmert, J. (2020). Adapting to misspecification in contextual bandits. In Conference on Neural Information Processing Systems. Foster, D. J., Kale, S., Mohri, M., and Sridharan, K. (2017). Parameter-free online learning via model selection. In Advances in Neural Information Processing Systems, pages 6020–6030. Foster, D. J., Krishnamurthy, A., and Luo, H. (2019). Model selection for contextual bandits. In Conference on Neural Information Processing Systems. Foster, D. J., Li, Z., Lykouris, T., Sridharan, K., and Tardos, E. (2016). Learning in games: Robustness of fast convergence. In Advances in Neural Information Processing Systems, pages 4734–4742. Freund, Y. and Schapire, R. E. (1997). A decision-theoretic generalization of on-line learning and an application to boosting. Journal of computer and system sciences, 55(1):119–139. Fruit, R., Pirotta, M., Lazaric, A., and Ortner, R. (2018). Efficient bias-span-constrained exploration- exploitation in reinforcement learning. In International Conference on Machine Learning, pages 1578–1586. PMLR. 128 Gaillard, P., Stoltz, G., and Van Erven, T. (2014). A second-order bound with excess losses. In Conference on Learning Theory, pages 176–196. Gajane, P., Ortner, R., and Auer, P. (2018). A sliding-window algorithm for markov decision processes with arbitrarily changing rewards and transitions. arXiv preprint arXiv:1805.10066. Gupta, A., Koren, T., and Talwar, K. (2019). Better algorithms for stochastic bandits with adversarial corruptions. In Conference on Learning Theory. Hao, B., Lattimore, T., and Szepesvari, C. (2020). Adaptive exploration in linear contextual bandit. In International Conference on Artificial Intelligence and Statistics . PMLR. Hazan, E., Agarwal, A., and Kale, S. (2007). Logarithmic regret algorithms for online convex optimization. Machine Learning, 69(2-3):169–192. Hazan, E. and Kale, S. (2010). Extracting certainty from uncertainty: Regret bounded by variation in costs. Machine learning, 80(2-3):165–188. Hazan, E. and Kale, S. (2011). Better algorithms for benign bandits. Journal of Machine Learning Research, 12(Apr):1287–1311. Hazan, E. and Koren, T. (2016). The computational power of optimization in online learning. In Proceedings of the forty-eighth annual ACM symposium on Theory of Computing, pages 128–141. Hazan, E. and Seshadhri, C. (2007). Adaptive algorithms for online decision problems. In Electronic colloquium on computational complexity (ECCC), volume 14. He, J., Zhou, D., Zhang, T., and Gu, Q. (2022). Nearly optimal algorithms for linear contextual bandits with adversarial corruptions. arXiv preprint arXiv:2205.06811. 129 Huang, B., Lee, J. D., Wang, Z., and Yang, Z. (2021). Towards general function approximation in zero-sum markov games. arXiv preprint arXiv:2107.14702. Jadbabaie, A., Rakhlin, A., Shahrampour, S., and Sridharan, K. (2015). Online optimization: Competing with dynamic comparators. In Artificial Intelligence and Statistics , pages 398–406. Jafarnia-Jahromi, M., Jain, R., and Nayyar, A. (2021). Learning zero-sum stochastic games with posterior sampling. arXiv preprint arXiv:2109.03396. Jaksch, T., Ortner, R., and Auer, P. (2010). Near-optimal regret bounds for reinforcement learning. Journal of Machine Learning Research. Jin, C., Allen-Zhu, Z., Bubeck, S., and Jordan, M. I. (2018). Is q-learning provably efficient? In Conference on Neural Information Processing Systems. Jin, C., Jin, T., Luo, H., Sra, S., and Yu, T. (2020a). Learning adversarial markov decision processes with bandit feedback and unknown transition. In International Conference on Machine Learning. Jin, C., Jin, T., Luo, H., Sra, S., and Yu, T. (2020b). Learning adversarial markov decision processes with bandit feedback and unknown transition. In International Conference on Machine Learning. Jin, C., Liu, Q., and Miryoosefi, S. (2021a). Bellman eluder dimension: New rich classes of rl problems, and sample-efficient algorithms. In Conference on Neural Information Processing Systems. Jin, C., Liu, Q., Wang, Y., and Yu, T. (2021b). V-learning–a simple, efficient, decentralized algorithm for multiagent rl. arXiv preprint arXiv:2110.14555. Jin, C., Liu, Q., and Yu, T. (2021c). The power of exploiter: Provable multi-agent rl in large state spaces. arXiv preprint arXiv:2106.03352. 130 Jin, C., Yang, Z., Wang, Z., and Jordan, M. I. (2020c). Provably efficient reinforcement learning with linear function approximation. In Conference on Learning Theory. Jin, T., Huang, L., and Luo, H. (2021d). The best of both worlds: stochastic and adversarial episodic mdps with unknown transition. In Conference on Neural Information Processing Systems. Jin, T. and Luo, H. (2020). Simultaneously learning stochastic and adversarial episodic mdps with known transition. Advances in Neural Information Processing Systems. Jun, K.-S., Orabona, F., Wright, S., and Willett, R. (2017). Improved strongly adaptive online learning using coin betting. In Artificial Intelligence and Statistics , pages 943–951. Jun, K.-S. and Zhang, C. (2020). Crush optimism with pessimism: Structured bandits beyond asymptotic optimality. Advances in Neural Information Processing Systems. Kakade, S. and Langford, J. (2002). Approximately optimal approximate reinforcement learning. In In Proc. 19th International Conference on Machine Learning. Citeseer. Kao, H. and Subramanian, V. (2022). Common information based approximate state representations in multi-agent reinforcement learning. In International Conference on Artificial Intelligence and Statistics, pages 6947–6967. PMLR. Kao, H., Wei, C.-Y., and Subramanian, V. (2022). Decentralized cooperative reinforcement learning with hierarchical information structure. In International Conference on Algorithmic Learning Theory, pages 573–605. PMLR. Kim, B. and Tewari, A. (2020). Randomized exploration for non-stationary stochastic linear bandits. In Uncertainty in Artificial Intelligence . 131 Koolen, W. M., Grünwald, P., and van Erven, T. (2016). Combining adversarial guarantees and stochastic fast rates in online learning. Advances in Neural Information Processing Systems, 29:4457–4465. Koolen, W. M. and Van Erven, T. (2015). Second-order quantile methods for experts and combina- torial games. In Conference on Learning Theory, pages 1155–1175. Koolen, W. M., Van Erven, T., and Grünwald, P. (2014). Learning the learning rate for prediction with expert advice. Advances in neural information processing systems, 27:2294–2302. Koren, T. and Livni, R. (2017). Affine-invariant online optimization and the low-rank experts problem. In Advances in Neural Information Processing Systems, pages 4747–4755. Lai, T. L. and Robbins, H. (1985). Asymptotically efficient adaptive allocation rules. Advances in applied mathematics, 6(1):4–22. Lancewicki, T., Rosenberg, A., and Mansour, Y. (2020). Learning adversarial markov decision processes with delayed feedback. arXiv preprint arXiv:2012.14843. Langford, J. and Zhang, T. (2008). The epoch-greedy algorithm for multi-armed bandits with side information. In Advances in neural information processing systems 21. Lattimore, T.andSzepesvari, C.(2017). Theendofoptimism? anasymptoticanalysisoffinite-armed linear bandits. In Artificial Intelligence and Statistics , pages 728–737. PMLR. Lattimore, T., Szepesvari, C., and Weisz, G. (2020). Learning with good feature representations in bandits and in rl with a generative model. In International Conference on Machine Learning. Lazic, N., Yin, D., Abbasi-Yadkori, Y., and Szepesvari, C. (2021). Improved regret bound and experience replay in regularized policy iteration. arXiv preprint arXiv:2102.12611. 132 Lee, C.-W., Luo, H., Wei, C.-Y., and Zhang, M. (2020). Bias no more: high-probability data- dependent regret bounds for adversarial bandits and mdps. Advances in neural information processing systems. Lee, C.-W., Luo, H., Wei, C.-Y., Zhang, M., and Zhang, X. (2021). Achieving near instance- optimality and minimax-optimality in stochastic and adversarial linear bandits simultaneously. In International Conference on Machine Learning. Leonardos, S., Overman, W., Panageas, I., and Piliouras, G. (2021). Global convergence of multi-agent policy gradient in markov potential games. arXiv preprint arXiv:2106.01969. Li, Y. and Li, N. (2019). Online learning for markov decision processes in nonstationary environments: A dynamic regret analysis. In 2019 American Control Conference (ACC), pages 1232–1237. IEEE. Li, Y., Lou, E. Y., and Shan, L. (2019). Stochastic linear optimization with adversarial corruption. arXiv:1909.02109. Liu, Q., Yu, T., Bai, Y., and Jin, C. (2020). A sharp analysis of model-based reinforcement learning with self-play. arXiv preprint arXiv:2010.01604. Lu, S. and Zhang, L. (2019). Adaptive and efficient algorithms for tracking the best expert. arXiv preprint arXiv:1909.02187. Lugosi,G.andMendelson,S.(2019). Meanestimationandregressionunderheavy-taileddistributions: A survey. Foundations of Computational Mathematics, 19(5):1145–1190. Luo, H. and Schapire, R. E. (2015). Achieving all with no parameters: Adanormalhedge. In Conference on Learning Theory, pages 1286–1304. Luo, H., Wei, C.-Y., and Lee, C.-W. (2021). Policy optimization in adversarial mdps: Improved exploration via dilated bonuses. In Conference on Neural Information Processing Systems. 133 Luo, H., Wei, C.-Y., and Zheng, K. (2018). Efficient online portfolio with logarithmic regret. In Advances in Neural Information Processing Systems. Lykouris, T., Mirrokni, V., and Paes Leme, R. (2018). Stochastic bandits robust to adversarial corruptions. In Proceedings of the 50th Annual ACM SIGACT Symposium on Theory of Computing. Lykouris, T., Simchowitz, M., Slivkins, A., and Sun, W. (2019). Corruption robust exploration in episodic reinforcement learning. arXiv:1911.08689. Mao, W. and Başar, T. (2022). Provably efficient reinforcement learning in decentralized general-sum markov games. Dynamic Games and Applications, pages 1–22. Mao, W., Zhang, K., Zhu, R., Simchi-Levi, D., and Başar, T. (2021). Is model-free learning nearly optimal for non-stationary rl? In International Conference on Machine Learning. Meng, L. and Zheng, B. (2010). The optimal perturbation bounds of the moore–penrose inverse under the frobenius norm. Linear algebra and its applications, 432(4):956–963. Mhammedi, Z. and Koolen, W. M. (2020). Lipschitz and comparator-norm adaptivity in online learning. Conference on Learning Theory. Mhammedi, Z., Koolen, W. M., and Van Erven, T. (2019). Lipschitz adaptivity with multiple learning rates in online learning. Conference on Learning Theory. Mokhtari, A., Shahrampour, S., Jadbabaie, A., and Ribeiro, A. (2016). Online optimization in dynamic environments: Improved regret rates for strongly convex problems. In 55th IEEE Conference on Decision and Control, pages 7195–7201. Neu, G., Gyorgy, A., and Szepesvári, C. (2012). The adversarial stochastic shortest path problem with unknown transition probabilities. In Artificial Intelligence and Statistics , pages 805–813. 134 Neu, G., György, A., Szepesvári, C., and Antos, A. (2013). Online markov decision processes under bandit feedback. IEEE Transactions on Automatic Control, 59(3):676–691. Neu, G., György, A., Szepesvári, C., et al. (2010). The online loop-free stochastic shortest-path problem. In Conference on Learning Theory. Neu, G. and Olkhovskaya, J. (2020). Online learning in mdps with linear function approximation and bandit feedback. arXiv preprint arXiv:2007.01612. Neu, G. and Olkhovskaya, J. (2021). Online learning in mdps with linear function approximation and bandit feedback. arXiv preprint arXiv:2007.01612v2. Ortner, R., Gajane, P., and Auer, P. (2020). Variational regret bounds for reinforcement learning. In Uncertainty in Artificial Intelligence , pages 81–90. PMLR. Pacchiano, A., Dann, C., Gentile, C., and Bartlett, P. (2020a). Regret bound balancing and elimination for model selection in bandits and rl. arXiv preprint arXiv:2012.13045. Pacchiano, A., Phan, M., Abbasi-Yadkori, Y., Rao, A., Zimmert, J., Lattimore, T., and Szepesvari, C. (2020b). Model selection in contextual stochastic bandit problems. In Conference on Neural Information Processing Systems. Rakhlin, A. and Sridharan, K. (2013a). Online learning with predictable sequences. Conference on Learning Theory. Rakhlin, A. and Sridharan, K. (2013b). Optimization, learning, and games with predictable sequences. Advances in Neural Information Processing Systems, 26:3066–3074. Rosenberg, A. and Mansour, Y. (2019a). Online convex optimization in adversarial Markov decision processes. In Proceedings of the 36th International Conference on Machine Learning. 135 Rosenberg, A. and Mansour, Y. (2019b). Online convex optimization in adversarial markov decision processes. In International Conference on Machine Learning. Rosenberg, A. and Mansour, Y. (2020). Stochastic shortest path with adversarially changing costs. arXiv preprint arXiv:2006.11561. Russac, Y., Cappé, O., and Garivier, A. (2020). Algorithms for non-stationary generalized linear bandits. arXiv preprint arXiv:2003.10113. Russac, Y., Vernade, C., and Cappé, O. (2019). Weighted linear bandits for non-stationary environments. Advances in Neural Information Processing Systems. Sani, A., Neu, G., and Lazaric, A. (2014). Exploiting easy data in online optimization. Advances in Neural Information Processing Systems, 27:810–818. Sayin, M. O., Parise, F., and Ozdaglar, A. (2020). Fictitious play in zero-sum stochastic games. arXiv preprint arXiv:2010.04223. Seldin, Y. and Lugosi, G. (2017). An improved parametrization and analysis of the exp3++ algorithm for stochastic and adversarial bandits. In Conference on Learning Theory. Seldin, Y. and Slivkins, A. (2014). One practical algorithm for both stochastic and adversarial bandits. In International Conference on Machine Learning, pages 1287–1295. Shalev-Shwartz, S. and Ben-David, S. (2014). Understanding machine learning: From theory to algorithms. Cambridge university press. Shani, L., Efroni, Y., Rosenberg, A., and Mannor, S. (2020). Optimistic policy optimization with bandit feedback. In International Conference on Machine Learning, pages 8604–8613. PMLR. 136 Sidford, A., Wang, M., Wu, X., Yang, L. F., and Ye, Y. (2018). Near-optimal time and sample complexities for solving markov decision processes with a generative model. In Advances in Neural Information Processing Systems, pages 5192–5202. Simchowitz, M. and Jamieson, K. G. (2019). Non-asymptotic gap-dependent regret bounds for tabular mdps. In Conference on Neural Information Processing Systems. Song, Z., Mei, S., and Bai, Y. (2021). When can we learn general-sum markov games with a large number of players sample-efficiently? arXiv preprint arXiv:2110.04184. Steinhardt, J. and Liang, P. (2014a). Adaptivity and optimism: An improved exponentiated gradient algorithm. In International Conference on Machine Learning, pages 1593–1601. Steinhardt, J. and Liang, P. (2014b). Adaptivity and optimism: An improved exponentiated gradient algorithm. In Proceedings of the 31st International Conference on Machine Learning. Syrgkanis, V., Agarwal, A., Luo, H., and Schapire, R. E. (2015). Fast convergence of regularized learning in games. In Advances in Neural Information Processing Systems 28. Tian, Y., Wang, Y., Yu, T., and Sra, S. (2020). Provably efficient online agnostic learning in markov games. arXiv preprint arXiv:2010.15020. Tian, Y., Wang, Y., Yu, T., and Sra, S. (2021). Online learning in unknown markov games. In International Conference on Machine Learning. Touati, A. and Vincent, P. (2020). Efficient learning in non-stationary linear markov decision processes. arXiv preprint arXiv:2010.12870. Tropp, J. A. (2012). User-friendly tail bounds for sums of random matrices. Foundations of computational mathematics, 12(4):389–434. 137 van Erven, T. and Koolen, W. M. (2016). Metagrad: Multiple learning rates in online learning. Advances in Neural Information Processing Systems. Wang, R., Du, S. S., Yang, L. F., and Kakade, S. M. (2020a). Is long horizon reinforcement learning more difficult than short horizon reinforcement learning? In Conference on Neural Information Processing Systems. Wang, R., Du, S. S., Yang, L. F., and Salakhutdinov, R. (2020b). On reward-free reinforcement learning with linear function approximation. arXiv preprint arXiv:2006.11274. Wei, C.-Y., Dann, C., and Zimmert, J. (2022). A model selection approach for corruption robust reinforcement learning. In International Conference on Algorithmic Learning Theory, pages 1043–1096. PMLR. Wei, C.-Y., Hong, Y.-T., and Lu, C.-J. (2016). Tracking the best expert in non-stationary stochastic environments. Advances in neural information processing systems, 29:3972–3980. Wei, C.-Y., Hong, Y.-T., and Lu, C.-J. (2017). Online reinforcement learning in stochastic games. In Advances in Neural Information Processing Systems. Wei, C.-Y., Jahromi, M. J., Luo, H., and Jain, R. (2021a). Learning infinite-horizon average-reward mdps with linear function approximation. In International Conference on Artificial Intelligence and Statistics, pages 3007–3015. PMLR. Wei, C.-Y., Jahromi, M. J., Luo, H., Sharma, H., and Jain, R. (2020a). Model-free reinforcement learning in infinite-horizon average-reward markov decision processes. In International Conference on Machine Learning, pages 10170–10180. PMLR. 138 Wei, C.-Y., Lee, C.-W., Zhang, M., and Luo, H. (2021b). Last-iterate convergence of decentralized optimistic gradient descent/ascent in infinite-horizon competitive markov games. In Conference on Learning Theory, pages 4259–4299. PMLR. Wei, C.-Y. and Luo, H. (2018). More adaptive algorithms for adversarial bandits. In Conference On Learning Theory, pages 1263–1291. Wei, C.-Y. and Luo, H. (2021). Non-stationary reinforcement learning without prior knowledge: An optimal black-box approach. In Conference on Learning Theory. Wei, C.-Y., Luo, H., and Agarwal, A. (2020b). Taking a hint: How to leverage loss predictors in contextual bandits? In Conference on Learning Theory, pages 3583–3634. PMLR. Wintenberger, O. (2017). Optimal learning with bernstein online aggregation. Machine Learning, 106(1):119–141. Wu, T., Yang, Y., Du, S., and Wang, L. (2021). On reinforcement learning with adversarial corruption and its application to block mdp. In International Conference on Machine Learning. Xie, Q., Chen, Y., Wang, Z., and Yang, Z. (2020). Learning zero-sum simultaneous-move markov games using function approximation and correlated equilibrium. arXiv preprint arXiv:2002.07066. Yang, L. and Wang, M. (2020). Reinforcement learning in feature space: Matrix bandit, kernels, and regret bound. In International Conference on Machine Learning, pages 10746–10756. PMLR. Yang, T., Zhang, L., Jin, R., and Yi, J. (2016). Tracking slowly moving clairvoyant: Optimal dynamic regret of online learning with true and noisy gradient. In International Conference on Machine Learning, pages 449–457. 139 Zanette, A., Brandfonbrener, D., Brunskill, E., Pirotta, M., and Lazaric, A. (2020a). Frequentist regret bounds for randomized least-squares value iteration. In International Conference on Artificial Intelligence and Statistics , pages 1954–1964. PMLR. Zanette, A. and Brunskill, E. (2019). Tighter problem-dependent regret bounds in reinforcement learning without domain knowledge using value function bounds. In International Conference on Machine Learning. Zanette, A., Cheng, C.-A., and Agarwal, A. (2021). Cautiously optimistic policy optimization and exploration with linear function approximation. In Conference on Learning Theory. Zanette, A., Lazaric, A., Kochenderfer, M., and Brunskill, E. (2020b). Learning near optimal policies with low inherent bellman error. In Internation Conference on Machine Learning. Zhang, L., Lu, S., and Zhou, Z.-H. (2018). Adaptive online learning in dynamic environments. In Advances in Neural Information Processing Systems, pages 1330–1340. Zhang, L., Yang, T., Yi, J., Rong, J., and Zhou, Z.-H. (2017). Improved dynamic regret for non-degenerate functions. In Advances in Neural Information Processing Systems, pages 732–741. Zhang, R., Ren, Z., and Li, N. (2021a). Gradient play in stochastic games: stationary points, convergence, and sample complexity. arXiv preprint arXiv:2106.00198. Zhang, X., Chen, Y., Zhu, J., and Sun, W. (2021b). Corruption-robust offline reinforcement learning. arXiv preprint arXiv:2106.06630. Zhang, X., Chen, Y., Zhu, X., and Sun, W. (2021c). Robust policy gradient against strong data corruption. In International Conference on Machine Learning. 140 Zhang, Z., Yang, J., Ji, X., and Du, S. S. (2021d). Variance-aware confidence set: Variance-dependent bound for linear bandits and horizon-free bound for linear mixture mdp. In Conference on Neural Information Processing Systems. Zhao, P. and Zhang, L. (2021). Non-stationary linear bandits revisited. arXiv preprint arXiv:2103.05324. Zhao, P., Zhang, L., Jiang, Y., and Zhou, Z.-H. (2020). A simple approach for non-stationary linear bandits. In Chiappa, S. and Calandra, R., editors, Proceedings of the Twenty Third International Conference on Artificial Intelligence and Statistics , volume 108 of Proceedings of Machine Learning Research, pages 746–755. PMLR. Zhao, Y., Tian, Y., Lee, J., and Du, S. (2022). Provably efficient policy optimization for two-player zero-sum markov games. In International Conference on Artificial Intelligence and Statistics , pages 2736–2761. PMLR. Zhou, H., Chen, J., Varshney, L. R., and Jagmohan, A. (2020). Nonstationary reinforcement learning with linear function approximation. arXiv preprint arXiv:2010.04244. Zimin, A. and Neu, G. (2013). Online learning in episodic markovian decision processes by relative entropy policy search. Advances in neural information processing systems, 26:1583–1591. Zimmert, J. and Seldin, Y. (2019). An optimal algorithm for stochastic and adversarial bandits. In The 22nd International Conference on Artificial Intelligence and Statistics . PMLR. Zinkevich, M. (2003). Online convex programming and generalized infinitesimal gradient ascent. In International Conference on Machine Learning. 141 Appendix A Omitted Details in Chapter 2 A.1 Useful lemmas for optimistic online mirror descent Lemma 17. Define w ⋆ = argmin w∈K ⟨w,x⟩+D ψ (w,w ′ ) for some compact convex setK⊂R d , convex function ψ, an arbitrary point x∈R d , and a point w ′ ∈K. Then for any u∈K: ⟨w ⋆ −u,x⟩≤D ψ (u,w ′ )−D ψ (u,w ⋆ )−D ψ (w ⋆ ,w ′ ). Proof. This is shown for example in the proof of (Wei and Luo, 2018, Lemma 1), and is by direct calculations plus the first-order optimality condition of w ⋆ . Lemma 18. Let w t = argmin w∈K ⟨w,m t ⟩ +D ψt (w,w ′ t ) and w ′ t+1 = argmin w∈K ⟨w,ℓ t ⟩ +D ψt (w,w ′ t ) for some compact convex setK⊂R d , convex function ψ t , arbitrary points ℓ t ,m t ∈R d , and a point w ′ t ∈K. Then, for any u∈K we have ⟨w t −u,ℓ t ⟩≤ w t −w ′ t+1 ,ℓ t −m t +D ψt (u,w ′ t )−D ψt (u,w ′ t+1 )−D ψt (w ′ t+1 ,w t )−D ψt (w t ,w ′ t ). 142 Proof. We apply Lemma 17 with w ⋆ =w t ,u =w ′ t+1 to obtain w t −w ′ t+1 ,m t ≤D ψt (w ′ t+1 ,w ′ t )−D ψt (w ′ t+1 ,w t )−D ψt (w t ,w ′ t ), and then with w ⋆ =w ′ t+1 to obtain: w ′ t+1 −u,ℓ t ≤D ψt (u,w ′ t )−D ψt (u,w ′ t+1 )−D ψt (w ′ t+1 ,w ′ t ). Summing the two inequalities above, we have: w t −w ′ t+1 ,m t + w ′ t+1 −u,ℓ t ≤D ψt (u,w ′ t )−D ψt (u,w ′ t+1 )−D ψt (w ′ t+1 ,w t )−D ψt (w t ,w ′ t ). Also note that the left-hand side is equal to: w t −w ′ t+1 ,m t + w ′ t+1 −u,ℓ t = w t −w ′ t+1 ,m t −ℓ t + w t −w ′ t+1 ,ℓ t + w ′ t+1 −u,ℓ t = w t −w ′ t+1 ,m t −ℓ t +⟨w t −u,ℓ t ⟩. Combining and reorganizing terms, we get the desired result. Lemma 19. For any convex function ψ defined on convex set K⊂R d and a point x∈R d , define F x (w) =⟨w,x⟩ +ψ(w) and w x = argmin w∈K F x (w). Suppose that for some x,x ′ ∈R d , there is a constant c such that for all ξ on the segment connecting w x and w x ′,∇ 2 ψ(ξ)≽ c∇ 2 ψ(w x ) holds (which means∇ 2 ψ(ξ)−c∇ 2 ψ(w x ) is positive semi-definite). Then, we have ⟨w x −w x ′,x ′ −x⟩≥ 0 and∥w x −w x ′∥ ∇ 2 ψ(wx) ≤ 2 c ∥x−x ′ ∥ ∇ −2 ψ(wx) . 143 Proof. Note that F x ′(w x )−F x ′(w x ′) = w x −w x ′,x ′ −x +F x (w x )−F x (w x ′) (definition of F) ≤ w x −w x ′,x ′ −x (optimality of w x ) ≤∥w x −w x ′∥ ∇ 2 ψ(wx) x ′ −x ∇ −2 ψ(wx) . (Hölder’s inequality) Using Taylor expansion, for some ξ on the segment connecting w x and w x ′, we have F x ′(w x )−F x ′(w x ′) = w x −w ′ x ,∇F x ′(w x ′) + 1 2 ∥w x −w x ′∥ 2 ∇ 2 ψ(ξ) ≥ 1 2 ∥w x −w x ′∥ 2 ∇ 2 ψ(ξ) (first-order optimality of w x ′) ≥ c 2 ∥w x −w x ′∥ 2 ∇ 2 ψ(wx) . (condition of the lemma) Combining we have,⟨w x −w x ′,x ′ −x⟩≥F x ′(w x )−F x ′(w x ′)≥c∥w x −w x ′∥ 2 ∇ 2 ψ(wx) ≥ 0, and also c 2 ∥w x −w x ′∥ 2 ∇ 2 ψ(wx) ≤∥w x −w x ′∥ ∇ 2 ψ(wx) ∥x ′ −x∥ ∇ −2 ψ(wx) , which implies ∥w x −w x ′∥ ∇ 2 ψ(wx) ≤ 2 c x−x ′ ∇ −2 ψ(wx) and finishes the proof. Lemma 20 (Multiplicative Stability). Let Ω = {w∈ ∆ d : w i ≥b i ,∀i∈ [d]} for some b i ∈ [0, 1], w ′ ∈ Ω be such that w ′ i > 0 for all i∈ [d], w = argmin w∈Ω {⟨w,ℓ⟩ +D ψ (w,w ′ )} where ψ(w) = P d i=1 1 η i w i lnw i ,|ℓ i |≤c max , and η i c max ≤ 1 32 for all i and some c max > 0. Then w i ∈ [ 1 √ 2 w ′ i , √ 2w ′ i ]. 144 Proof. Recall that D ψ (w,w ′ ) = P i 1 η i w i ln w i w ′ i −w i +w ′ i . By the KKT condition of the optimiza- tion problem, we have for some λ and µ i ≥ 0, ℓ i + 1 η i ln w i w ′ i −λ−µ i = 0 and µ i (w i −b i ) = 0 for all i. The above gives w i =w ′ i exp (η i (−ℓ i +λ +µ i )). We now separately discuss two cases. Case 1: min i (ℓ i −µ i )̸= max i (ℓ i −µ i ). In this case, we claim that min i (ℓ i −µ i )<λ< max i (ℓ i −µ i ). We prove it by contradiction: If λ≥ max i (ℓ i −µ i ), then X i w i = X i w ′ i exp (η i (−ℓ i +λ +µ i ))> X i w ′ i = 1 contradicting with w∈ ∆ d (the strict inequality is because there exists some j such that max i (ℓ i − µ i )> (ℓ j −µ j ) and w ′ j > 0). We can derive a similar contradiction if λ≤ min i (ℓ i −µ i ). Thus, we conclude min i (ℓ i −µ i )<λ< max i (ℓ i −µ i ). Our second claim is that for all i with µ i ̸= 0, ℓ i −µ i ≥ λ. Indeed, when µ i ̸= 0, we have b i =w i =w ′ i exp (η i (−ℓ i +λ +µ i )). Clearly, exp (η i (−ℓ i +λ +µ i ))≤ 1 must hold; otherwise we have w ′ i 0. By the first claim, λ>ℓ j −µ j , and this contradicts with the second claim. Thus, max i (ℓ i −µ i )− min i (ℓ i −µ i ) = max i (ℓ i −µ i )− min i ℓ i ≤ max i ℓ i − min i ℓ i ≤ 2c max (the inequality is by µ i ≥ 0). Since both λ and ℓ i −µ i are in the range [min i (ℓ i −µ i ), max i (ℓ i −µ i )], we 145 have|−ℓ i +λ +µ i |≤ max i (ℓ i −µ i )− min i (ℓ i −µ i )≤ 2c max . By the condition on η i , we then have w i ∈ h exp(− 1 16 )w ′ i , exp( 1 16 )w ′ i i ⊂ h 1 √ 2 w ′ i , √ 2w ′ i i . Case 2: min i (ℓ i −µ i ) = max i (ℓ i −µ i ). In this case, it is clear that λ =ℓ i −µ i must hold for all i to make w and w ′ both distributions. Thus, w t,i =w ′ t,i for all i. A.2 Omitted details in Section 2.2 Proof of Lemma 1. By Lemma 18, we have (dropping one non-positive term) T X t=1 ⟨w t −u,ℓ t +a t ⟩ ≤ T X t=1 D ψt (u,w ′ t )−D ψt (u,w ′ t+1 ) + T X t=1 w t −w ′ t+1 ,ℓ t −m t +a t −D ψt (w ′ t+1 ,w t ) . (A.1) For the first term, we reorder it and use D ψt (u,v) = P d i=1 1 η t,i f KL (u i ,v i ): T X t=1 D ψt (u,w ′ t )−D ψt (u,w ′ t+1 ) =D ψ 1 (u,w ′ 1 ) + T X t=2 D ψt (u,w ′ t )−D ψ t−1 (u,w ′ t ) −D ψ T (u,w ′ T +1 ) ≤ d X i=1 1 η 1,i f KL (u i ,w ′ 1,i ) + T X t=2 d X i=1 1 η t,i − 1 η t−1,i ! f KL (u i ,w ′ t,i ). For the second term, fix a particular t and define w ⋆ = argmax w∈R d + ⟨w t −w,ℓ t −m t +a t ⟩− D ψt (w,w t ). By the optimality of w ⋆ , we have: ℓ t −m t +a t =∇ψ t (w t )−∇ψ t (w ⋆ ) and thus w ⋆ i =w t,i e −η t,i (ℓ t,i −m t,i +a t,i ) . Therefore, we have w t −w ′ t+1 ,ℓ t −m t +a t −D ψt (w ′ t+1 ,w t ) ≤⟨w t −w ⋆ ,ℓ t −m t +a t ⟩−D ψt (w ⋆ ,w t ) =⟨w t −w ⋆ ,∇ψ t (w t )−∇ψ t (w ⋆ )⟩−D ψt (w ⋆ ,w t ) 146 =D ψt (w t ,w ⋆ ) = d X i=1 1 η t,i w t,i ln w t,i w ⋆ i −w t,i +w ⋆ i ! = d X i=1 w t,i η t,i η t,i (ℓ t,i −m t,i +a t,i )− 1 +e −η t,i (ℓ t,i −m t,i +a t,i ) ≤ d X i=1 η t,i w t,i (ℓ t,i −m t,i +a t,i ) 2 , where in the last inequality we apply e −x − 1 +x≤x 2 for x≥−1 and the condition of the lemma η t,i |ℓ t,i −m t,i |≤ 1 32 such thatη t,i |ℓ t,i −m t,i +a t,i |≤η t,i |ℓ t,i −m t,i |+32η 2 t,i (ℓ t,i −m t,i ) 2 ≤ 1 32 + 32 32 2 ≤ 1 16 . Using the definition of a t and the condition η t,i |ℓ t,i −m t,i |≤ 1 32 again, we also continue with w t −w ′ t+1 ,ℓ t −m t +a t −D ψt (w ′ t+1 ,w t )≤ d X i=1 η t,i w t,i ℓ t,i −m t,i + 32η t,i (ℓ t,i −m t,i ) 2 2 ≤ 4 d X i=1 η t,i w t,i (ℓ t,i −m t,i ) 2 . To sum up, combining everything, we have, T X t=1 ⟨w t −u,ℓ t +a t ⟩ ≤ d X i=1 1 η 1,i f KL (u i ,w ′ 1,i ) + T X t=2 d X i=1 1 η t,i − 1 η t−1,i ! f KL (u i ,w ′ t,i ) + 4 T X t=1 d X i=1 η t,i w t,i (ℓ t,i −m t,i ) 2 . Finally, moving P T t=1 ⟨w t −u,a t ⟩ to the right-hand side of the inequality and using the definition of a t again finishes the proof. Proof of Theorem 1. To apply Lemma 1, we notice that the condition 32η t,i |ℓ t,i −m t,i |≤ 1 of Lemma1holdstriviallybythedefinitionof η t,i . Therefore, applying(2.4)withu = (1− 1 T )e i⋆ + 1 T w ′ 1 ∈ T T t=1 Ω t , we have: Reg(e i⋆ ) = Reg(u) + T X t=1 ⟨u−e i⋆ ,ℓ t ⟩ 147 = Reg(u) + 1 T T X t=1 w ′ 1 −e i⋆ ,ℓ t ≤ Reg(u) + 2 ≤ d X i=1 1 η 1,i f KL (u i ,w ′ 1,i ) + T X t=2 d X i=1 1 η t,i − 1 η t−1,i ! f KL (u i ,w ′ t,i ) + 32 T X t=1 d X i=1 η t,i u i (ℓ t,i −m t,i ) 2 − 16 T X t=1 d X i=1 η t,i w t,i (ℓ t,i −m t,i ) 2 + 2. (A.2) For the first term, note that u i ≤w ′ 1,i when i̸=i ⋆ , and η 1,i = 1 64 . Thus, d X i=1 1 η 1,i f KL (u i ,w ′ 1,i ) = d X i=1 1 η 1,i u i ln u i w ′ 1,i −u i +w ′ 1,i ! ≤ 64u i⋆ ln u i⋆ w ′ 1,i⋆ + d X i=1 64· 1 d =O(lnd). For the second term, we proceed as T X t=2 d X i=1 1 η t,i − 1 η t−1,i ! f KL (u i ,w ′ t,i ) = T X t=2 d X i=1 1 η t,i − 1 η t−1,i ! u i ln u i w ′ t,i −u i +w ′ t,i ! ≤ T X t=2 1 η t,i⋆ − 1 η t−1,i⋆ ! u i⋆ ln u i⋆ w ′ t,i⋆ ! + T X t=2 d X i=1 1 η t,i − 1 η t−1,i ! w ′ t,i (u i = 1 dT ≤w ′ t,i for i̸=i ⋆ ) = T X t=2 1 η t,i⋆ − 1 η t−1,i⋆ ! u i⋆ ln u i⋆ w ′ t,i⋆ ! + T X t=2 d X i=1 1 η 2 t,i − 1 η 2 t−1,i 1 η t,i + 1 η t−1,i w ′ t,i ≤ ln(dT ) η T,i⋆ + T X t=2 d X i=1 η t−1,i 1 η 2 t,i − 1 η 2 t−1,i ! w ′ t,i (u i⋆ ln u i⋆ w ′ t,i⋆ ≤ ln(dT )) ≤ 64 ln(dT ) + v u u t ln(dT ) T X t=1 (ℓ t,i⋆ −m t,i⋆ ) 2 + T X t=2 d X i=1 1 ln(dT ) η t−1,i w ′ t,i (ℓ t−1,i −m t−1,i ) 2 (by the definition of η t,i ) ≤ 64 ln(dT ) + v u u t ln(dT ) T X t=1 (ℓ t,i⋆ −m t,i⋆ ) 2 + T X t=2 d X i=1 2η t−1,i w t−1,i (ℓ t−1,i −m t−1,i ) 2 , 148 where the last step uses the fact w ′ t,i ≤ 2w t−1,i according to the multiplicative stability lemma Lemma 20 (which asserts w ′ t,i ∈ [ 1 √ 2 w ′ t−1,i , √ 2w ′ t−1,i ] and w t−1,i ∈ [ 1 √ 2 w ′ t−1,i , √ 2w ′ t−1,i ]). Note that the last term is then canceled by the fourth term of (A.2). For the third term of (A.2), we have T X t=1 d X i=1 η t,i u i (ℓ t,i −m t,i ) 2 ≤ T X t=1 η t,i⋆ (ℓ t,i⋆ −m t,i⋆ ) 2 + 1 dT T X t=1 d X i=1 η t,i (ℓ t,i −m t,i ) 2 ≤ T X t=1 s ln(dT ) P s<t (ℓ s,i⋆ −m s,i⋆ ) 2 · (ℓ t,i⋆ −m t,i⋆ ) 2 + 1 ≤O v u u t ln(dT ) T X t=1 (ℓ t,i⋆ −m t,i⋆ ) 2 + 1 . Combining everything then proves Reg(e i⋆ ) =O ln(dT ) + v u u t ln(dT ) T X t=1 (ℓ t,i⋆ −m t,i⋆ ) 2 . A.3 Omitted details in Section 2.3 Proof of Theorem 2. The regret Reg(u) = P T t=1 ⟨w t −u,ℓ t ⟩ can be decomposed as the regret of base algorithm k ⋆ : Reg B⋆ (u) = P T t=1 ⟨w k⋆ t −u,ℓ t ⟩, plus the regret of the master to this base algorithm: P T t=1 ⟨p t −e k⋆ ,g t ⟩ = P T t=1 ⟨w t −w k⋆ t ,ℓ t ⟩ (by the definition of w t andg t ). It thus remains to apply the regret guarantee of MsMwC from Lemma 1 (with u in that lemma set to e k⋆ ), since the conditions of the lemma hold by the fact g t,k −h t,k = D w k t ,ℓ t −m t E . The first term in (2.4) becomes 1 η⋆ ln 1 p ′ 1,k⋆ + P k p ′ 1,k η k , which is 1 η⋆ ln P k η 2 k η 2 ⋆ + P k η k P k η 2 k by the definition of p ′ 1 . The second term 149 in (2.4) is simply zero since the learning rate stays the same over time. The third term equals 32η ⋆ P T t=1 ⟨w k⋆ t ,ℓ t −m t ⟩ 2 . Dropping the last negative term then finishes the proof. A.4 Omitted details in Section 2.4 Proof of Theorem 3. By the construction, for any u there exists k ⋆ such that η k⋆ ≤ min 1 64 , r KL(u,π)+lnV (u) V (u) ≤ 2η k⋆ . Therefore, from (2.7) we have: Reg A k⋆ (u)≤ KL(u,π) 2η k⋆ + 64η k⋆ T X t=1 d X i=1 u i (ℓ t,i −m t,i ) 2 − 32η k⋆ T X t=1 d X i=1 w k⋆ t,i (ℓ t,i −m t,i ) 2 ≤O KL(u,π) + q (KL(u,π) + lnV (u))V (u) − 32η k⋆ T X t=1 d X i=1 w k⋆ t,i (ℓ t,i −m t,i ) 2 . (A.3) Further note that 32η k D w k t ,ℓ t −m t E ≤ 32η k ∥ℓ t −m t ∥ ∞ ≤ 1. Hence, we apply Theorem 2 with P k η k = Θ(1), P k η 2 k = Θ(1), P k η 2 k η 2 k⋆ =O 1/η 2 k⋆ =O V (u) KL(u,π)+lnV (u) =O (V (u)), and cancel the last term in (2.5) by the last negative term in (A.3) via Cauchy-Schwarz inequality, arriving at REG(u)≤O KL(u,π) + q (KL(u,π) + lnV (u))V (u) + 1 η k⋆ ln P k η 2 k η 2 k⋆ =O KL(u,π) + lnV (u) + q (KL(u,π) + lnV (u))V (u) and finishing the proof. Proof of Theorem 4. Bythedefinitionof S, itisclearthat|S|isatmostO (d log 2 T )soouralgorithm is efficient. For any i ⋆ ∈ [d], there exists a k ⋆ such that η k⋆ ≤ min ( 1 128c i⋆ , r Γ i⋆ P T t=1 (ℓ t,i⋆ −m t,i⋆ ) 2 ) ≤ 150 2η k⋆ . Moreover, 32·η t,i |ℓ t,i −m t,i |≤ 128η k c i ≤ 1 for alli∈Z(k). Hence, the conditions of Lemma 1 hold, and with|Z(k)|≤d we have Reg A k⋆ (e i⋆ )≤O lnd η k⋆ + 64η k⋆ T X t=1 (ℓ t,i⋆ −m t,i⋆ ) 2 − 32η k⋆ T X t=1 d X i=1 w t,i (ℓ t,i −m t,i ) 2 =O c i⋆ Γ i⋆ + v u u t Γ i⋆ T X t=1 (ℓ t,i⋆ −m t,i⋆ ) 2 − 32η k⋆ T X t=1 d X i=1 w t,i (ℓ t,i −m t,i ) 2 . Next, also note that the conditions of Theorem 2 hold since 32η k | D w k t ,ℓ t −m t E |≤ 64η k max i∈Z(k) c i ≤ 1. Thus, with the last negative term from the bound for Reg A k⋆ (e i⋆ ) above canceling the last term of (2.5), and P k η k = Θ(1/c min ), P k η 2 k = Θ(1/c 2 min ), and P k η 2 k η 2 k⋆ =O c 2 i⋆ T c 2 min , we obtain: Reg(e i⋆ ) =O c i⋆ Γ i⋆ + v u u t Γ i⋆ T X t=1 (ℓ t,i⋆ −m t,i⋆ ) 2 + 1 η k⋆ Γ i⋆ +c min =O c i⋆ Γ i⋆ + v u u t Γ i⋆ T X t=1 (ℓ t,i⋆ −m t,i⋆ ) 2 , which completes the proof. Proof of Theorem 5. We first focus on a specific j and bound the regret withinI j . The regret in this interval can be decomposed as X t∈I j ⟨w t −u j ,ℓ t ⟩ = X t∈I j ⟨w t −w r t ,ℓ t ⟩ + X t∈I j ⟨w r t −u j ,ℓ t ⟩ = X t∈I j ⟨p t −e r ,g t ⟩ + X t∈I j ⟨w r t −u j ,ℓ t ⟩ ≤ X t∈I j ⟨p t −e r ,g t ⟩ + X t∈I j ⟨w r t −u j ,ℓ t ⟩ +O(1) (define e r = (1− 1 T )e r + 1 ⌈log 2 T⌉T ) 151 for any r∈ [⌈log 2 T⌉]. The term P t∈I j ⟨w r t −u j ,ℓ t ⟩ corresponds to the regret of the r-th base algorithm in the interval I j . Let s j be the first time index in I j , and recall that the r-th expert is an MsMwC with a fixed learning rate 2η r , and a feasible set Ω t ={w∈ ∆ d :w i ≥ 1 dT }. To upper bound it, we follow the exact same arguments as in the proof of Lemma 1, except for replacing the summation range [1,T ] withI j . This leads to: X t∈I j ⟨w r t −u j ,ℓ t ⟩ ≤ 1 2η r d X i=1 f KL (u j,i ,w r′ s j ,i ) + 32 X t∈I j d X i=1 2η r u j,i (ℓ t,i −m t,i ) 2 − 16 X t∈I j d X i=1 2η r w r t,i (ℓ t,i −m t,i ) 2 = 1 2η r d X i=1 u j,i ln u j,i w r′ s j ,i + 32 X t∈I j d X i=1 2η r u j,i (ℓ t,i −m t,i ) 2 − 16 X t∈I j d X i=1 2η r w r t,i (ℓ t,i −m t,i ) 2 ≤ 1 2η r ln(dT ) + 32 X t∈I j d X i=1 2η r u j,i (ℓ t,i −m t,i ) 2 − 16 X t∈I j d X i=1 2η r w r t,i (ℓ t,i −m t,i ) 2 . Next, we deal with P t∈I j ⟨p t −e r ,g t ⟩. Recall that MsMwC-Master uses a regularizer ψ(p) = P ⌈log 2 T⌉ k=1 1 η k p k lnp k . Again, similarly to the proof of Lemma 1, considering the regret only inI j and dropping the negative term, we have X t∈I j ⟨p t −e r ,g t ⟩≤ X t∈I j D ψ (e r ,p ′ t )−D ψ (e r ,p ′ t+1 ) + 32 X t∈I j ⌈log 2 T⌉ X k=1 η k e r,k (g t,k −h t,k ) 2 ≤D ψ (e r ,p ′ s j )−D ψ (e r ,p ′ s j+1 ) + 32η r X t∈I j ⟨w r t ,ℓ t −m t ⟩ 2 +O(1), where s j+1 is defined as T + 1 if j is the last interval. We further deal with the first term above: D ψ (e r ,p ′ s j )−D ψ (e r ,p ′ s j+1 ) = ⌈log 2 T⌉ X k=1 1 η k e r,k ln p ′ s j+1 ,k p ′ s j ,k +p ′ s j ,k −p ′ s j+1 ,k ! 152 ≤ ln(⌈log 2 T⌉T ) η r + ⌈log 2 T⌉ X k=1 1 η k p ′ s j ,k −p ′ s j+1 ,k +O(ln(dT )). Combining all bounds above, we get that for any r∈⌈log 2 T⌉: X t∈I j ⟨w t −u j ,ℓ t ⟩ ≤ 1 2η r ln(dT ) + 32 X t∈I j d X i=1 2η r u j,i (ℓ t,i −m t,i ) 2 − 16 X t∈I j d X i=1 2η r w r t,i (ℓ t,i −m t,i ) 2 + ln(⌈log 2 T⌉T ) η r + ⌈log 2 T⌉ X k=1 1 η k p ′ s j ,k −p ′ s j+1 ,k + 32η r X t∈I j ⟨w r t ,ℓ t −m t ⟩ 2 +O(ln(dT )) ≤O ln(dT ) η r +η r X t∈I j d X i=1 u j,i (ℓ t,i −m t,i ) 2 +O(ln(dT )) + ⌈log 2 T⌉ X k=1 1 η k p ′ s j ,k −p ′ s j+1 ,k where we use Jenson’s inequality:⟨w r t ,ℓ t −m t ⟩ 2 ≤ P d i=1 w r t,i (ℓ t,i −m t,i ) 2 . Specifically, applying the above bound with the r such that η r ≤ min 1 64 , v u u t ln(dT ) P t∈I j P d i=1 u j,i (ℓ t,i −m t,i ) 2 ≤ 2η r , we get X t∈I j ⟨w t −u j ,ℓ t ⟩ =O v u u u tln(dT ) X t∈I j d X i=1 u j,i (ℓ t,i −m t,i ) 2 + ln(dT ) + ⌈log 2 T⌉ X k=1 1 η k p ′ s j ,k −p ′ s j+1 ,k . (A.4) Finally, summing the above bound over j = 1, 2,...,S and telescoping, we get S X j=1 X t∈I j ⟨w t −u j ,ℓ t ⟩ =O S X j=1 v u u u tln(dT ) X t∈I j d X i=1 u j,i (ℓ t,i −m t,i ) 2 +S ln(dT ) + ⌈log 2 T⌉ X k=1 p ′ 1,k η k =O S X j=1 v u u u tln(dT ) X t∈I j d X i=1 u j,i (ℓ t,i −m t,i ) 2 +S ln(dT ) , 153 finishing the proof. Note that importantly, the last term in (A.4) only disappears (mostly) after summed over all intervals. As mentioned, getting an interval regret bound like (A.4) but without the last term is impossible, proven in the next section. A.4.1 Impossible results for interval regret Theorem 29. For a two-expert problem with loss range [−1, 1], it is impossible to achieve the following regret bound for all intervalI⊆ [1,T ] and all comparators i∈{1, 2} simultaneously: X t∈I ⟨p t −e i ,ℓ t ⟩ = ˜ O s X t∈I |ℓ t,i | + 1 . Proof. Consider an envinronment where the losses of Expert 1 is a deterministic value ℓ t,1 = 0, and the losses of Expert 2 are i.i.d. chosen in each round according to the following: ℓ t,2 = 1 with probability 1 2 −ϵ −1 with probability 1 2 +ϵ where ϵ =T − 1 5 . We assume that ϵ≤ 1 4 (which is equivalent to assuming T≥ 4 5 ). For simplicity, we call this distributionD. Note that the expected loss of Expert 2 is−2ϵ, smaller than that of Expert 1. Therefore, in this environment, the expected regret of the learner during [1,T ] would be E[Reg [1,T ] (e 2 )] = 2ϵE " T X t=1 p t,1 # . 154 Define L = T 3 10 , and divide the whole horizon into T L = T 7 10 intervals. Denote them asI k = {(k− 1)L + 1,...,kL} for k = 1, 2,..., T L . Let k ⋆ = argmin k E X t∈I k p t,1 . That is, k ⋆ is the interval where the learner would put least weight on Expert 1 in expectation. We then create another environment, where the loss of Expert 2 is same as the previous environment in interval 1, 2,...,k ⋆ − 1, but change to the following starting from interval k ⋆ : ℓ t,2 = 1 with probability 1 2 +ϵ −1 with probability 1 2 −ϵ We call this distributionD ′ . In this alternative environment, starting from interval k ⋆ , the best expert becomes Expert 1, and the expected interval regret of the learner is E ′ [Reg I k ⋆ (e 1 )] = 2ϵE ′ X t∈I k ⋆ p t,2 = 2ϵL− 2ϵE ′ X t∈I k ⋆ p t,1 . (A.5) where we useE ′ [·] to denote the expectation under this alternative environment. Below we denote the probability measure under the two environments asP andP ′ respectively. Since p t,1 is a function of{ℓ τ } t−1 τ=1 , by standard arguments, E ′ X t∈I k ⋆ p t,1 −E X t∈I k ⋆ p t,1 ≤L P({ℓ τ } τ=1,...,k ⋆ L )−P ′ ({ℓ τ } τ=1,...,k ⋆ L ) TV (∥·∥ TV is the total variation) ≤ L 2 q KL(P({ℓ τ } τ=1,...,k ⋆ L ), P ′ ({ℓ τ } τ=1,...,k ⋆ L )) (Pinsker’s inequality) = L 2 q LKL(D, D ′ ) 155 = L 3 2 2 v u u t 1 2 +ϵ ln 1 2 +ϵ 1 2 −ϵ + 1 2 −ϵ ln 1 2 −ϵ 1 2 +ϵ ≤ L 3 2 2 v u u t 2ϵ ln 1 2 +ϵ 1 2 −ϵ ≤ L 3 2 2 s 4ϵ 2 1 2 −ϵ ≤ 2L 3 2 ϵ, where we use ln(1 +α) ≤ α and ϵ ≤ 1 4 . Notice that L T 1 2ϵ E[Reg [1,T ] (e 2 )] = L T E h P T t=1 p t,1 i ≥ E h P t∈I k ⋆ p t,1 i by the definition of k ⋆ , andE ′ h P t∈I k ⋆ p t,1 i =L− E ′ [Reg I k ⋆ (e 1 )] 2ϵ by (A.5). Using them in the above inequality, we get L− E ′ [Reg I k ⋆ (e 1 )] 2ϵ − L 2ϵT E[Reg [1,T ] (e 2 )]≤ 2L 3 2 ϵ. Using the values we choose, this is equivalent to T 3 10 − T 2 10 2 E ′ [Reg I k ⋆ (e 1 )]− 1 2T 5 10 E[Reg [1,T ] (e 2 )]≤ 2T 1 4 . When T is large enough, we see that either E[Reg [1,T ] (e 2 )]≥ Ω( T 8 10 ) orE ′ [Reg I k ⋆ (e 1 )]≥ Ω( T 1 10 ). However, the desired bound q P t∈I |ℓ t,i | isO( √ T ) andO(1) in the two cases respectively. One of them must be violated, thus the desired bound is impossible. A.5 Omitted details in Section 2.5 In this section, when presenting the base algorithms, we sometimes use Ω as its decision set, which should be seen as a subset ofK (thus its size is bounded by D as well) and will be set appropriately by the master. 156 Algorithm 34: A Variant of Online Newton Step 1 Parameters: learning rate η> 0, w ′ 1 = ⃗ 0. 2 Define: c t (w) =⟨w,ℓ t ⟩ + 16η⟨w,ℓ t −m t ⟩ 2 and ∇ t =∇c t (w t ) =ℓ t + 32η⟨w t ,ℓ t −m t ⟩ (ℓ t −m t ). 3 for t = 1,..., T do 4 Receive prediction m t and range hint z t . 5 Update w t = argmin w∈Ω {⟨w,m t ⟩ +D ψt (w,w ′ t )} where ψ t (w) = 1 2 ∥w∥ 2 At and A t =η 4z 2 1 ·I + t−1 X s=1 (∇ s −m s )(∇ s −m s ) ⊤ + 4z 2 t ·I ! . 6 Receive ℓ t . 7 Update w ′ t+1 = argmin w∈Ω {⟨w,∇ t ⟩ +D ψt (w,w ′ t )}. A.5.1 Combining Online Newton Step We first introduce a variant of the ONS algorithm (Algorithm 34) and present its regret guarantee. We consider a slightly more general setup where the algorithm receives a range hint z t at the beginning of round t, which is guaranteed to satisfy∥ℓ t −m t ∥≤z t . For this section and the result of Theorem 6, it suffices to set z t = 1 for all t. The guarantee of this ONS variant is as follows. Lemma 21. Suppose∥ℓ t −m t ∥ 2 ≤z t ,z t is non-decreasing int, and 64ηDz T ≤ 1. Then Algorithm 34 ensures for any u∈ Ω (with r being the rank ofL T = P T t=1 (ℓ t −m t )(ℓ t −m t ) ⊤ ) T X t=1 ⟨w t −u,ℓ t ⟩ ≤O r ln(Tz T /z 1 ) η +z 1 ∥u∥ 2 +D(z T −z 1 ) +η T X t=1 ⟨u,ℓ t −m t ⟩ 2 ! − 11η T X t=1 ⟨w t ,ℓ t −m t ⟩ 2 . Proof. By Lemma 18 and Lemma 22, we have: T X t=1 ⟨w t −u,∇ t ⟩≤ T X t=1 w t −w ′ t+1 ,∇ t −m t +D ψt (u,w ′ t )−D ψt (u,w ′ t+1 ) ≤ 2 T X t=1 ∥∇ t −m t ∥ 2 A −1 t +D ψ 1 (u,w ′ 1 ) + T−1 X t=1 D ψ t+1 (u,w ′ t+1 )−D ψt (u,w ′ t+1 ) 157 ≤O r ln(Tz T /z 1 ) η +ηz 2 1 ∥u∥ 2 2 + T−1 X t=1 D ψ t+1 (u,w ′ t+1 )−D ψt (u,w ′ t+1 ). Note that ηz 2 1 ∥u∥ 2 2 ≤ηDz T z 1 ∥u∥ 2 =O (z 1 ∥u∥ 2 ). Moreover, T−1 X t=1 D ψ t+1 (u,w ′ t+1 )−D ψt (u,w ′ t+1 ) ≤ η 2 T−1 X t=1 u−w ′ t+1 ,∇ t −m t 2 +O ηD 2 T−1 X t=1 (z 2 t+1 −z 2 t ) ! ≤η T−1 X t=1 ⟨u−w t ,∇ t −m t ⟩ 2 +η T−1 X t=1 w t −w ′ t+1 ,∇ t −m t 2 +O ηD 2 z T T−1 X t=1 (z t+1 −z t ) ! ≤ 2η T X t=1 ⟨u,∇ t −m t ⟩ 2 + 2η T X t=1 ⟨w t ,∇ t −m t ⟩ 2 +O r ln(Tz T /z 1 ) η +O (D(z T −z 1 )) (by 0≤η w t −w ′ t+1 ,∇ t −m t ≤ 3ηDz t =O(1) and Lemma 22) ≤ 5η T X t=1 ⟨u,ℓ t −m t ⟩ 2 + 5η T X t=1 ⟨w t ,ℓ t −m t ⟩ 2 +O r ln(Tz T /z 1 ) η +O (D(z T −z 1 )). (by definition of ∇ t and 32η|⟨w t ,ℓ t −m t ⟩|≤ 32ηDz t ≤ 1 2 ) Since c t (w) is convex in w, we have T X t=1 c t (w t )−c t (u) = T X t=1 ⟨w t −u,ℓ t ⟩ + 16η T X t=1 ⟨w t ,ℓ t −m t ⟩ 2 −⟨u,ℓ t −m t ⟩ 2 ≤ T X t=1 ⟨w t −u,∇ t ⟩. Reorganizing terms then finishes the proof. Lemma 22. In Algorithm 34, we have 0 ≤ w t −w ′ t+1 ,∇ t −m t ≤ 2∥∇ t −m t ∥ 2 A −1 t and also P T t=1 ∥∇ t −m t ∥ 2 A −1 t =O r ln(Tz T /z 1 ) η . Proof. For any t, define F x (w) =⟨w,x⟩ +D ψt (w,w ′ t ). Then, we have w t = argmin w∈K F mt (w) and w ′ t+1 = argmin w∈K F ∇t (w). 158 Moreover, ∇ 2 w D ψt (w,w ′ t ) = A t is a constant matrix. Hence, by Lemma 19 with c = 1, 0≤ w t −w ′ t+1 ,∇ t −m t ≤ 2∥∇ t −m t ∥ 2 A −1 t . Define A ′ t =η 4z 2 1 ·I + P t s=1 (∇ s −m s )(∇ s −m s ) ⊤ . Note that∥∇ t −m t ∥ 2 2 ≤ 4∥ℓ t −m t ∥ 2 2 ≤ 4z 2 t . Thus, A t ≽A ′ t . By similar arguments in (Koren and Livni, 2017, Lemma 6), we have T X t=1 ∥∇ t −m t ∥ 2 A −1 t = 1 η T X t=1 tr A −1 t (A ′ t −A ′ t−1 ) ≤ 1 η T X t=1 tr (A ′ t ) −1 (A ′ t −A ′ t−1 ) ≤ 1 η ln |A ′ T | |A ′ 0 | = 1 η ln I + T X t=1 (∇ t −m t )(∇ t −m t ) ⊤ 4z 2 1 =O r ln 1 + P T t=1 ∥ℓt−mt∥ 2 2 rz 2 1 η =O r ln(Tz T /z 1 ) η , where r is rank of P T t=1 (ℓ t −m t )(ℓ t −m t ) ⊤ . To obtain the regret bound in Theorem 6, we instantiate MsMwC-Master with the following set of experts: E ONS = n (η k ,B k ) :∀k = (d k ,s k )∈{−⌈log 2 (dT )⌉,...,⌈log 2 D⌉}× [⌈log 2 T⌉],η k = 1 64·2 d k +s k , B k is Algorithm 34 with z t = 1 for all t, Ω = K∩{w :∥w∥ 2 ≤ 2 d k }, and η = 3η k o (A.6) Proof of Theorem 6. We first assume ∥u∥ 2 > 1 dT , and thus there exists k ⋆ such that η k⋆ ≤ min ( 1 192·2 d k⋆ , r r lnT P T t=1 ⟨u,ℓt−mt⟩ 2 ) ≤ 2η k⋆ , and 2 d k⋆ −1 ≤∥u∥ 2 ≤ 2 d k⋆ . Then by Lemma 21 with 64· 3η k⋆ · 2 d k⋆ ≤ 1: T X t=1 D w k⋆ t −u,ℓ t E ≤O r lnT η k⋆ +∥u∥ 2 +η k⋆ T X t=1 ⟨u,ℓ t −m t ⟩ 2 ! − 33η k⋆ T X t=1 D w k⋆ t ,ℓ t −m t E 2 159 = ˜ O r∥u∥ 2 + v u u t r T X t=1 ⟨u,ℓ t −m t ⟩ 2 − 33η k⋆ T X t=1 D w k⋆ t ,ℓ t −m t E 2 . Next, by Theorem 2 with 32η k D w k t ,ℓ t −m t E ≤ 32η k w k t 2 ≤ 1, P k η k = Θ(dT ), P k η 2 k = Θ(d 2 T 2 ), and P k η 2 k η 2 k⋆ =O(d 2 T 2 /η 2 k⋆ ) =O(d 2 D 2 T 4 ), we have: T X t=1 ⟨w t −u,ℓ t ⟩ = ˜ O r∥u∥ 2 + v u u t r T X t=1 ⟨u,ℓ t −m t ⟩ 2 + 1 η k⋆ = ˜ O r∥u∥ 2 + v u u t r T X t=1 ⟨u,ℓ t −m t ⟩ 2 . When∥u∥ 2 ≤ 1 dT ≤D (if D< 1 dT , we achieve constant regret by picking w t arbitrarily), pick any u ′ ∈K such that∥u ′ ∥ 2 = 1 dT (this is possible since 0∈K). Then: T X t=1 ⟨w t −u,ℓ t ⟩ = T X t=1 w t −u ′ ,ℓ t + T X t=1 u ′ −u,ℓ t ≤ ˜ O r u ′ 2 + v u u t r T X t=1 ⟨u ′ ,ℓ t −m t ⟩ 2 + u ′ 2 T = ˜ O (1). This finishes the proof. A.5.2 Combining Gradient Descent For gradient descent type of bound, we use the optimistic gradient descent algorithm (OptGD) as the base algorithm, which achieves the following regret bound with learning rate η (see (Rakhlin and Sridharan, 2013b, Lemma 3)): T X t=1 ⟨w t −u,ℓ t ⟩≤ ∥u∥ 2 2 η +η T X t=1 ∥ℓ t −m t ∥ 2 2 . 160 To obtain the regret bound in Theorem 7, it suffices to instantiate MsMwC-Master with the following set of experts: E GD = n (η k ,B k ) :∀k = (d k ,s k )∈{−⌈log 2 T⌉,...,⌈log 2 D⌉}× [⌈log 2 T⌉],η k = 1 32·2 d k +s k , B k is OptGD with decision set Ω = K∩{w :∥w∥ 2 ≤ 2 d k }, and η = 4 d k η k o . (A.7) Proof of Theorem 7. We first assume ∥u∥ 2 > 1 T , so that there exists k ⋆ such that η k⋆ ≤ min 1 64· 2 d k⋆ , 1 ∥u∥ 2 q P T t=1 ∥ℓ t −m t ∥ 2 2 ≤ 2η k⋆ , and 2 d k⋆ −1 ≤∥u∥ 2 ≤ 2 d k⋆ . By the regret guarantee of OptGD, we have: T X t=1 D w k⋆ t −u,ℓ t E ≤ ∥u∥ 2 2 4 d k⋆ η k⋆ + 4 d k⋆ η k⋆ T X t=1 ∥ℓ t −m t ∥ 2 2 =O ∥u∥ 2 +∥u∥ 2 v u u t T X t=1 ∥ℓ t −m t ∥ 2 2 . Next, by Theorem 2 with 32η k D w k t ,ℓ t −m t E ≤ 32η k w k t 2 ≤ 1, P k η k = Θ(T), P k η 2 k = Θ(T 2 ), and P k η 2 k η 2 k⋆ =O(D 2 T 3 ), we have: T X t=1 ⟨w t −u,ℓ t ⟩ = ˜ O ∥u∥ 2 +∥u∥ 2 v u u t T X t=1 ∥ℓ t −m t ∥ 2 2 +η k⋆ T X t=1 D w k⋆ t ,ℓ t −m t E 2 = ˜ O ∥u∥ 2 +∥u∥ 2 v u u t T X t=1 ∥ℓ t −m t ∥ 2 2 . When∥u∥ 2 ≤ 1 T , pick any u ′ ∈K such that∥u ′ ∥ 2 = 1 T , then: T X t=1 ⟨w t −u,ℓ t ⟩ = T X t=1 w t −u ′ ,ℓ t + T X t=1 u ′ −u,ℓ t ≤ ˜ O ∥u∥ 2 +∥u∥ 2 v u u t T X t=1 ∥ℓ t −m t ∥ 2 2 +∥u∥ 2 = ˜ O (1). 161 Algorithm 35: Optimistic AdaGrad 1 Parameters: learning rate η,η ′ > 0, w ′ 1 = ⃗ 0. 2 Define: c t (w) =⟨w,ℓ t ⟩ + 16η ′ ⟨w,ℓ t −m t ⟩ 2 ∇ t =∇c t (w t ) =ℓ t + 32η ′ ⟨w t ,ℓ t −m t ⟩ (ℓ t −m t ) ψ t (w) = 1 2 ∥w∥ 2 At , where A t ≜ 1 η (I +G t ) 1/2 , G t = t X s=1 (∇ t −m t )(∇ t −m t ) ⊤ . 3 for t = 1,..., T do 4 Receive prediction m t . 5 Compute w t = argmin w∈Ω nD w, P t−1 s=1 ∇ s +m t E +ψ t−1 (w) o . 6 Play w t and receive ℓ t . This finishes the proof. A.5.3 Combining AdaGrad We first introduce the base algorithm Algorithm 35, which is a variant of the AdaGrad algorithm with predictor m t incorporated. It guarantees the following. Theorem 30. Define A ′ t = (I + P t s=1 (ℓ s −m s )(ℓ s −m s ) ⊤ ) 1/2 . Assume 64η ′ |⟨w t ,ℓ t −m t ⟩|≤ 1 for all t, and η ′ ≤η/∥u∥ 2 A ′ T . Algorithm 35 ensures for any u∈ Ω , T X t=1 ⟨w t −u,ℓ t ⟩ =O ηtr L 1/2 T + u ⊤ (I +L T ) 1/2 u η ! − 16η ′ T X t=1 ⟨w t ,ℓ t −m t ⟩ 2 . Proof. For anyt, define F x (w) =⟨w,x⟩+ψ t−1 (w). Note thatw t = argmin w∈K F P t−1 s=1 ∇s+mt (w), and denote w ′ t = argmin w∈K F P t s=1 ∇s (w). Moreover,∇ 2 ψ t−1 (w) =A t−1 is a constant matrix. Hence, by Lemma 19 with c = 1,⟨w t −w ′ t ,∇ t −m t ⟩≤ 2∥∇ t −m t ∥ 2 A −1 t−1 , and for any u∈ Ω we have: T X t=1 ⟨w t −u,∇ t ⟩ = T X t=1 w t −w ′ t ,∇ t −m t + w t −w ′ t ,m t + w ′ t −u,∇ t ≤ 2 T X t=1 ∥∇ t −m t ∥ 2 A −1 t−1 + T X t=1 w t −w ′ t ,m t + w ′ t −u,∇ t . 162 We prove by induction that for any τ,u∈ Ω : τ X t=1 w t −w ′ t ,m t + w ′ t ,∇ t ≤ τ X t=1 ⟨u,∇ t ⟩ +ψ τ−1 (u). When τ = 1, it suffices to show: w 1 −w ′ 1 ,m 1 + w ′ 1 ,∇ 1 ≤ w ′ 1 ,∇ 1 +ψ 0 (w ′ 1 ). This is clearly true since⟨w 1 ,m 1 ⟩≤⟨w 1 ,m 1 ⟩ +ψ 0 (w 1 )≤⟨w ′ 1 ,m 1 ⟩ +ψ 0 (w ′ 1 ). Now suppose the result is true for τ =T, then for τ =T + 1: T +1 X t=1 w t −w ′ t ,m t + w ′ t ,∇ t ≤ * w T +1 , T X t=1 ∇ t + +ψ T−1 (w T +1 ) + w T +1 −w ′ T +1 ,m T +1 + w ′ T +1 ,∇ T +1 (induction step for τ =T with u =w T +1 ) ≤ * w ′ T +1 , T X t=1 ∇ t +m T +1 + +ψ T (w ′ T +1 )− w ′ T +1 ,m T +1 + w ′ T +1 ,∇ T +1 (by ψ T−1 (w)≤ψ T (w), and F P T t=1 ∇t+m T+1 (w T +1 )≤F P T t=1 ∇t+m T+1 (w ′ T +1 )) = * w ′ T +1 , T +1 X t=1 ∇ t + +ψ T (w ′ T +1 )≤ * u, T +1 X t=1 ∇ t + +ψ T (u), for any u∈ Ω by the definition of w ′ T +1 . Therefore, by (Cutkosky, 2020, Theorem 7), we have: T X t=1 ⟨w t −u,∇ t ⟩≤ 2 T X t=1 ∥∇ t −m t ∥ 2 A −1 t−1 +ψ T−1 (u)≤ 2 T X t=1 ∥∇ t −m t ∥ 2 A −1 t−1 + u ⊤ (I +G T ) 1/2 u η =O ηtr G 1/2 T + u ⊤ (I +G T ) 1/2 u η ! =O ηtr L 1/2 T + u ⊤ (I +L T ) 1/2 u η ! . 163 The reasoning of the last equality is as follows: note that∇ t −m t = (1 + 32η ′ ⟨w t ,ℓ t −m t ⟩)(ℓ t −m t ) has the same direction as ℓ t −m t . Thus by assumption on η ′ ,G t ≼ 3 2 L t . Finally, note that c t is a convex function. Therefore, P T t=1 c t (w t )−c t (u)≤ P T t=1 ⟨w t −u,∇ t ⟩. Reorganizing terms, we get: T X t=1 ⟨w t −u,ℓ t ⟩ ≤O ηtr L 1/2 T + u ⊤ (I +L T ) 1/2 u η ! − 16η ′ T X t=1 ⟨w t ,ℓ t −m t ⟩ 2 + 16η ′ T X t=1 ⟨u,ℓ t −m t ⟩ 2 . By η ′ ≤η/∥u∥ 2 A ′ T (note that∥u∥ 2 A ′ T =u ⊤ (I +L T ) 1/2 u), we have: η ′ T X t=1 ⟨u,ℓ t −m t ⟩ 2 ≤η ′ T X t=1 ∥u∥ 2 A ′ T ∥ℓ t −m t ∥ 2 (A ′ t−1 ) −1 =O ηtr L 1/2 T . Therefore, T X t=1 ⟨w t −u,ℓ t ⟩ =O ηtr L 1/2 T + u ⊤ (I +L T ) 1/2 u η ! − 16η ′ T X t=1 ⟨w t ,ℓ t −m t ⟩ 2 . Now we instantiate MsMwC-Master with the following set of experts to obtain the desired bound in Theorem 8. E AG = n (η k ,B k ) :∀k = (d k ,t k ,l k )∈S AG , η k = 1 64·2 d k +t k ,B k is Algorithm 35 with decision set Ω = K∩{w :∥w∥ 2 ≤ 2 d k }, η ′ = 2η k and η = 2 l k +1 η k o , (A.8) 164 whereS AG ={−⌈log 2 T⌉,...,⌈log 2 D⌉}× [⌈log 2 (dT )⌉]×{−⌈log 2 T⌉,...,⌈log 2 (2D 2 T )⌉}. Proof of Theorem 8. First assume∥u∥ 2 > 1 T , so that there exists k ⋆ such that: 2 d k⋆ −1 ≤∥u∥ 2 ≤ 2 d k⋆ ,η k⋆ ≤ min 1 128· 2 d k⋆ , 1 r ∥u∥ 2 (I+L T ) 1/2 tr L 1/2 T ≤ 2η k⋆ , and 2 l k −1 ≤ u ⊤ (I +L T ) 1/2 u ≤ 2 l k . Note that 64η ′ | D w k⋆ t ,ℓ t −m t E | ≤ 64η ′ w k⋆ t 2 ≤ 1, and ∥u∥ 2 A ′ T η ′ ≤ 2 l k⋆ · 2η k⋆ =η. Hence, by the regret guarantee of Algorithm 35, we have: T X t=1 D w k⋆ t −u,ℓ t E ≤O 2 l k +1 η k⋆ tr L 1/2 T + u ⊤ (I +L T ) 1/2 u 2 l k +1 η k⋆ ! − 32η k⋆ T X t=1 D w k⋆ t ,ℓ t −m t E 2 ≤O ∥u∥ + r (u ⊤ (I +L T ) 1/2 u)tr L 1/2 T ! − 32η k⋆ T X t=1 D w k⋆ t ,ℓ t −m t E 2 . Next, by Theorem 2 with 32η k D w k t ,ℓ t −m t E ≤ 32η k w k t 2 ≤ 1, P k η k = Θ(T), P k η 2 k = Θ(T 2 ), and P k η 2 k η 2 k⋆ =O(d 2 D 2 T 4 ), we have: T X t=1 ⟨w t −u,ℓ t ⟩ = ˜ O ∥u∥ 2 + r (u ⊤ (I +L T ) 1/2 u)tr L 1/2 T ! . When∥u∥ 2 ≤ 1 T , pick any u ′ ∈K such that∥u ′ ∥ 2 = 1 T , then: T X t=1 ⟨w t −u,ℓ t ⟩ = T X t=1 w t −u ′ ,ℓ t + T X t=1 u ′ −u,ℓ t ≤ ˜ O u ′ 2 + r (u ′⊤ (I +L T ) 1/2 u ′ )tr L 1/2 T + u ′ 2 ! = ˜ O (1). This finishes the proof. 165 Algorithm 36: MetaGrad 1 Parameters: learning rate η> 0, w ′ 1 = ⃗ 0. 2 Define: c t (w) =⟨w,ℓ t ⟩ + 16η⟨w− ¯ w t ,ℓ t −m t ⟩ 2 ∇ t =∇c t (w t ) =ℓ t + 32η⟨w t − ¯ w t ,ℓ t −m t ⟩ (ℓ t −m t ) ψ t (w) = 1 2 ∥w∥ 2 At , where A t ≜η 8I + t−1 X s=1 (∇ s −m s )(∇ s −m s ) ⊤ ! . 3 for t = 1,..., T do 4 Receive prediction m t . 5 Play w t = argmin w∈K {⟨w,m t ⟩ +D ψt (w,w ′ t )}. 6 Receive ℓ t and ¯ w t . 7 Compute w ′ t+1 = argmin w∈K {⟨w,∇ t ⟩ +D ψt (w,w ′ t )}. A.5.4 Combining MetaGrad’s base algorithm We first present the MetaGrad base algorithm (Algorithm 36) and its regret guarantee below (note that the algorithm receives ¯ w t at the end of round t, which will eventually be set to the master’s prediction in our construction). Lemma 23. Assume 64ηD≤ 1. Algorithm 36 ensures: T X t=1 ⟨w t −u,ℓ t ⟩≤O ∥u∥ 2 + r lnT η +η T X t=1 ⟨u− ¯ w t ,ℓ t −m t ⟩ 2 ! − 10η T X t=1 ⟨w t − ¯ w t ,ℓ t −m t ⟩ 2 . Proof. By Lemma 18 and Lemma 22 with z t = 1 for all t, we have: T X t=1 ⟨w t −u,∇ t ⟩ ≤ T X t=1 w t −w ′ t+1 ,∇ t −m t +D ψt (u,w ′ t )−D ψt (u,w ′ t+1 ) ≤ 2 T X t=1 ∥∇ t −m t ∥ 2 A −1 t +D ψ 1 (u,w ′ 1 ) + T−1 X t=1 D ψ t+1 (u,w ′ t+1 )−D ψt (u,w ′ t+1 ) ≤O r lnT η +η∥u∥ 2 2 + T−1 X t=1 D ψ t+1 (u,w ′ t+1 )−D ψt (u,w ′ t+1 ). 166 Note that η∥u∥ 2 2 =O (∥u∥ 2 ). Moreover, T−1 X t=1 D ψ t+1 (u,w ′ t+1 )−D ψt (u,w ′ t+1 ) = η 2 T−1 X t=1 u−w ′ t+1 ,∇ t −m t 2 ≤η T−1 X t=1 ⟨u−w t ,∇ t −m t ⟩ 2 +η T−1 X t=1 w t −w ′ t+1 ,∇ t −m t 2 ≤ 3η T−1 X t=1 ⟨u−w t ,ℓ t −m t ⟩ 2 +O r lnT η , where the last step is by 0≤η w t −w ′ t+1 ,∇ t −m t ≤ 3ηD =O(1) and Lemma 22. Since c t (w) is convex in w, we have P T t=1 c t (w t )−c t (u)≤ P T t=1 ⟨w t −u,∇ t ⟩. Re-organzing terms, we have: T X t=1 ⟨w t −u,ℓ t ⟩≤O r lnT η +∥u∥ 2 + 3η T X t=1 ⟨u−w t ,ℓ t −m t ⟩ 2 + 16η T X t=1 ⟨u− ¯ w t ,ℓ t −m t ⟩ 2 − 16η T X t=1 ⟨w t − ¯ w t ,ℓ t −m t ⟩ 2 ≤O r lnT η +∥u∥ 2 +η T X t=1 ⟨u− ¯ w t ,ℓ t −m t ⟩ 2 ! − 10η T X t=1 ⟨w t − ¯ w t ,ℓ t −m t ⟩ 2 . Then, we instantiate MsMwC-Master with the following set of experts to obtain the desired bound in Theorem 9. E MG = n (η k ,B k ) :∀k∈ [⌈log 2 (2DT )⌉],η k = 1 64D·2 k , B k is Algorithm 36 with ¯ w t =w t for all t and η = 4η k o . (A.9) 167 Proof of Theorem 9. There exists k ⋆ such that η k⋆ ≤ min ( 1 256D , r r lnT P T t=1 ⟨u−wt,ℓt−mt⟩ 2 ) ≤ 2η k⋆ . Then by Lemma 23 with 64· 4η k⋆ D≤ 1: T X t=1 D w k⋆ t −u,ℓ t E ≤O r lnT η k⋆ +∥u∥ 2 +η k⋆ T X t=1 ⟨u−w t ,ℓ t −m t ⟩ 2 ! − 40η k⋆ T X t=1 D w k⋆ t −w t ,ℓ t −m t E 2 = ˜ O rD + v u u t r T X t=1 ⟨u−w t ,ℓ t −m t ⟩ 2 − 40η k⋆ T X t=1 D w k⋆ t −w t ,ℓ t −m t E 2 . Next, by Theorem 2 with 32η k |g t,k −h t,k | = 32η k D w k t −w t ,ℓ t −m t E ≤ 64η k D≤ 1, P k η k = Θ(1/D), P k η 2 k = Θ(1/D 2 ), and P k η 2 k η 2 k⋆ =O(D 4 T 2 ), we have: T X t=1 ⟨w t −u,ℓ t ⟩ = ˜ O rD + v u u t r T X t=1 ⟨u−w t ,ℓ t −m t ⟩ 2 + 1 η k⋆ = ˜ O rD + v u u t r T X t=1 ⟨u−w t ,ℓ t −m t ⟩ 2 . This completes the proof. 168 Appendix B Omitted Details in Chapter 3 B.1 Lemmas for log-barrier OMD In this section we establish some useful lemmas for update rules (3.1) and (3.2) with log-barrier regularizer, which are used in the proofs of other theorems. We start with some definitions. Definition 2. For any h∈ R K , define norm ∥h∥ t,w = q h ⊤ ∇ 2 ψ t (w)h = r P K i=1 1 η t,i h 2 i w 2 i and its dual norm∥h∥ ∗ t,w = q h ⊤ ∇ −2 ψ t (w)h = q P K i=1 η t,i w 2 i h 2 i . For some radius r > 0, define ellipsoid E t,w (r) = n u∈R K :∥u−w∥ t,w ≤r o . Lemma 24. If w ′ ∈ E t,w (1) and η t,i ≤ 1 81 for all i, then w ′ i ∈ h 1 2 w i , 3 2 w i i for all i, and also 0.9∥h∥ t,w ≤∥h∥ t,w ′≤ 1.2∥h∥ t,w for any h∈R K . Proof. w ′ ∈E t,w (1) implies P K i=1 1 η t,i (w ′ i −w i ) 2 w 2 i ≤ 1. Thus for everyi, we have |w ′ i −w i | w i ≤ √ η t,i ≤ 1 9 , im- plying w ′ i ∈ h 8 9 w i , 10 9 w i i ⊂ h 1 2 w i , 3 2 w i i . Therefore,∥h∥ t,w ′ = r P K i=1 1 η t,i h 2 i w ′2 i ≥ r P K i=1 1 η t,i h 2 i ( 10 9 w i) 2 = 0.9∥h∥ t,w . Similarly, we have∥h∥ t,w ′≤ 1.2∥h∥ t,w . Lemma 25. Let w t ,w ′ t+1 follow (3.5) and (3.6) where ψ t is the log-barrier with η t,i ≤ 1 81 for all i. If ˆ ℓ t −m t +a t ∗ t,wt ≤ 1 3 , then w ′ t+1 ∈E t,wt (1). 169 Proof. Define F t (w) =⟨w,m t ⟩ +D ψt (w,w ′ t ) and F ′ t+1 (w) =⟨w, ˆ ℓ t +a t ⟩ +D ψt (w,w ′ t ). Then by definition we have w t = argmin w∈Ω F t (w) andw ′ t+1 = argmin w∈Ω F ′ t+1 (w). To showw ′ t+1 ∈E t,wt (1), it suffices to show that for all u on the boundary ofE t,wt (1), F ′ t+1 (u)≥F ′ t+1 (w t ). Indeed, using Taylor’s theorem, for any u∈∂E t,wt (1), there is an ξ on the line segment between w t and u such that (let h≜u−w t ) F ′ t+1 (u) =F ′ t+1 (w t ) +∇F ′ t+1 (w t ) ⊤ h + 1 2 h ⊤ ∇ 2 F ′ t+1 (ξ)h =F ′ t+1 (w t ) + ( ˆ ℓ t −m t +a t ) ⊤ h +∇F t (w t ) ⊤ h + 1 2 h ⊤ ∇ 2 ψ t (ξ)h ≥F ′ t+1 (w t ) + ( ˆ ℓ t −m t +a t ) ⊤ h + 1 2 ∥h∥ 2 t,ξ (by the optimality of w t ) ≥F ′ t+1 (w t ) + ( ˆ ℓ t −m t +a t ) ⊤ h + 1 2 × 0.9 2 ∥h∥ 2 t,wt (by Lemma 24) ≥F ′ t+1 (w t )− ˆ ℓ t −m t +a t ∗ t,wt ∥h∥ t,wt + 1 3 ∥h∥ 2 t,wt =F ′ t+1 (w t )− ˆ ℓ t −m t +a t ∗ t,wt + 1 3 (∥h∥ t,wt = 1) ≥F ′ t+1 (w t ). (by the assumption) Lemma 26. Let w t ,w ′ t+1 follow (3.5) and (3.6) where ψ t is the log-barrier with η t,i ≤ 1 81 for all i. If ˆ ℓ t −m t +a t ∗ t,wt ≤ 1 3 , then w ′ t+1 −w t t,wt ≤ 3 ˆ ℓ t −m t +a t ∗ t,wt . Proof. Define F t (w) and F ′ t+1 (w) to be the same as in Lemma 25. Then we have F ′ t+1 (w t )−F ′ t+1 (w ′ t+1 ) = (w t −w ′ t+1 ) ⊤ ( ˆ ℓ t −m t +a t ) +F t (w t )−F t (w ′ t+1 ) ≤ (w t −w ′ t+1 ) ⊤ ( ˆ ℓ t −m t +a t ) (optimality of w t ) ≤ w t −w ′ t+1 t,wt ˆ ℓ t −m t +a t ∗ t,wt . (B.1) 170 On the other hand, for some ξ on the line segment between w t and w ′ t+1 , we have by Taylor’s theorem and the optimality of w ′ t+1 , F ′ t+1 (w t )−F ′ t+1 (w ′ t+1 ) =∇F ′ t+1 (w ′ t+1 ) ⊤ (w t −w ′ t+1 ) + 1 2 (w t −w ′ t+1 ) ⊤ ∇ 2 F ′ t+1 (ξ)(w t −w ′ t+1 ) ≥ 1 2 w t −w ′ t+1 2 t,ξ . (B.2) Since the condition in Lemma 25 holds, w ′ t+1 ∈E t,wt (1), and thus ξ∈E t,wt (1). Using again Lemma 24, we have 1 2 w t −w ′ t+1 2 t,ξ ≥ 1 3 w t −w ′ t+1 2 t,wt . (B.3) Combining (B.1), (B.2), and (B.3), we have w t −w ′ t+1 t,wt ˆ ℓ t −m t +a t ∗ t,wt ≥ 1 3 w t −w ′ t+1 2 t,wt , which leads to the stated inequality. Lemma 27. When the three conditions in Theorem 12 hold, we have ˆ ℓ t −m t +a t ∗ t,wt ≤ 1 3 for either a t,i = 6η t,i w t,i ( ˆ ℓ t,i −m t,i ) 2 or a t,i = 0. Proof. For a t,i = 6η t,i w t,i ( ˆ ℓ t,i −m t,i ) 2 , we have ˆ ℓ t −m t +a t ∗2 t,wt = K X i=1 η t,i w 2 t,i ˆ ℓ t,i −m t,i + 6η t,i w t,i ( ˆ ℓ t,i −m t,i ) 2 2 = K X i=1 η t,i w 2 t,i ( ˆ ℓ t,i −m t,i ) 2 + 12η 2 t,i w 3 t,i ( ˆ ℓ t,i −m t,i ) 3 + 36η 3 t,i w 4 t,i ( ˆ ℓ t,i −m t,i ) 4 ≤ K X i=1 η t,i w 2 t,i ( ˆ ℓ t,i −m t,i ) 2 (1 + 36η t,i + 324η 2 t,i ) (condition (ii)) ≤ 2 K X i=1 η t,i w 2 t,i ( ˆ ℓ t,i −m t,i ) 2 (condition (i)) ≤ 2× 1 18 = 1 9 . (condition (iii)) 171 For a t,i = 0, we have ˆ ℓ t −m t +a t ∗2 t,wt = ˆ ℓ t −m t ∗2 t,wt = K X i=1 η t,i w 2 t,i ( ˆ ℓ t,i −m t,i ) 2 ≤ 1 18 < 1 9 . (condition (iii)) Lemma 28. If the three conditions in Theorem 12 hold, Broad-OMD satisfies 1 2 w t,i ≤w ′ t+1,i ≤ 3 2 w t,i . Proof. This is a direct application of Lemmas 27, 25, and 24. Lemma 29. For the MAB problem, if the three conditions in Theorem 12 hold, then 1 2 w t,i ≤w ′ t,i ≤ 3 2 w t,i . Proof. It suffices to prove w ′ t ∈E t,wt (1) by Lemma 24. Since we assume that the three conditions in Theorem 12 hold and w t ∈ ∆ K , we have∥m t ∥ ∗ t,wt = q P K i=1 η t,i w 2 t,i m 2 t,i ≤ q 1 162 P K i=1 w 2 t,i ≤ q 1 162 < 1 3 . This implies w ′ t ∈E t,wt (1) by a similar arguments as in the proof of Lemma 25 (one only needs to replace F ′ t+1 (w) there by G(w)≜D ψt (w,w ′ t ) and note that w ′ t = argmin w∈∆ K G(w)). B.2 Omitted details in Section 3.2 Proof of Lemma 2. We first state a useful property used in typical OMD analysis. Let Ω be a convex compact set inR K ,ψ be a convex function on Ω ,w ′ be an arbitrary point in Ω , andx∈R K . If w ∗ = argmin w∈Ω {⟨w,x⟩ +D ψ (w,w ′ )}, then for any u∈ Ω , ⟨w ∗ −u,x⟩≤D ψ (u,w ′ )−D ψ (u,w ∗ )−D ψ (w ∗ ,w ′ ). 172 This is by the first-order optimality condition of w ∗ and direct calculations. Applying this to update rule (3.6) and (3.5), we get ⟨w ′ t+1 −u, ˆ ℓ t ⟩≤D ψt (u,w ′ t )−D ψt (u,w ′ t+1 )−D ψt (w ′ t+1 ,w ′ t ); and ⟨w t −w ′ t+1 ,m t ⟩≤D ψt (w ′ t+1 ,w ′ t )−D ψt (w ′ t+1 ,w t )−D ψt (w t ,w ′ t ) respectively. Therefore, by expanding the instantaneous regret, we have ⟨w t −u, ˆ ℓ t ⟩ =⟨w t −w ′ t+1 , ˆ ℓ t −m t ⟩ +⟨w ′ t+1 −u, ˆ ℓ t ⟩ +⟨w t −w ′ t+1 ,m t ⟩ ≤⟨w t −w ′ t+1 , ˆ ℓ t −m t ⟩ +D ψt (u,w ′ t )−D ψt (u,w ′ t+1 )−D ψt (w ′ t+1 ,w t )−D ψt (w t ,w ′ t ). Proof of Theorem 10. Applying Lemma 2, we have T X t=1 ⟨w t −u, ˆ ℓ t ⟩≤ T X t=1 D ψt (u,w ′ t )−D ψt (u,w ′ t+1 ) +⟨w t −w ′ t+1 , ˆ ℓ t −m t ⟩−A t ≤ K X i=1 ln w ′ 1,i u i η + T X t=1 ⟨w t −w ′ t+1 , ˆ ℓ t −m t ⟩−A t . For the second term, using Lemma 27 and 26 we bound⟨w t −w ′ t+1 , ˆ ℓ t −m t ⟩ by w t −w ′ t+1 t,wt ˆ ℓ t −m t ∗ t,wt ≤ 3 ˆ ℓ t −m t ∗2 t,wt = 3η K X i=1 w 2 t,i ( ˆ ℓ t,i −m t,i ) 2 173 Finally we lower bound A t for the MAB case. Note h(y) =y− 1− lny≥ (y−1) 2 6 for y∈ [ 1 2 , 2]. By Lemma 28 and 29, w ′ t+1,i w t,i and w t,i w ′ t,i both belong to [ 1 2 , 2]. Therefore, A t =D ψt (w ′ t+1 ,w t ) +D ψt (w t ,w ′ t ) = 1 η K X i=1 h w ′ t+1,i w t,i ! +h w t,i w ′ t,i !! ≥ 1 6η K X i=1 (w ′ t+1,i −w t,i ) 2 w 2 t,i + (w t,i −w ′ t,i ) 2 w ′2 t,i ! ≥ 1 24η K X i=1 (w ′ t+1,i −w t,i ) 2 w 2 t,i + (w t,i −w ′ t,i ) 2 w 2 t−1,i ! , and T X t=1 A t ≥ 1 24η T X t=2 K X i=1 (w ′ t,i −w t−1,i ) 2 w 2 t−1,i + T X t=2 K X i=1 (w t,i −w ′ t,i ) 2 w 2 t−1,i ≥ 1 48η T X t=2 K X i=1 (w t,i −w t−1,i ) 2 w 2 t−1,i . Proof of Corollary 11. We first verify the three conditions in Theorem 10: η≤ 1 162 by assumption; w t,i ˆ ℓ t,i −m t,i = (ℓ t,i −ℓ α i (t),i )1{i t =i} ≤ 2< 3;η P K i=1 w 2 t,i ( ˆ ℓ t,i −m t,i ) 2 =ηw 2 t,it ( ˆ ℓ t,it −m t,it ) 2 ≤ 9 162 = 1 18 . Let u = 1− 1 T e i ∗ + 1 T w ′ 1 , which guarantees w ′ 1,i u i ≤ T. By Theorem 10 and some rearrangement, we have T X t=1 ⟨w t −e i ∗, ˆ ℓ t ⟩≤ K lnT η + 3η T X t=1 K X i=1 w 2 t,i ( ˆ ℓ t,i −m t,i ) 2 − T X t=1 A t +B, where B ≜ 1 T P T t=1 ⟨−e i ∗ +w ′ 1 , ˆ ℓ t ⟩. To get the stated bound, just note that E[B] =O(1), and replace P T t=1 P K i=1 w 2 t,i ( ˆ ℓ t,i −m t,i ) 2 by the upper bound at (3.4) and A t by the lower bound in Theorem 10. 174 B.3 Omitted details in Section 3.3 Proof of Lemma 3. We first state a useful property used in typical OMD analysis. Let Ω be a convex compact set inR K ,ψ be a convex function on Ω ,w ′ be an arbitrary point in Ω , andx∈R K . If w ∗ = argmin w∈Ω {⟨w,x⟩ +D ψ (w,w ′ )}, then for any u∈ Ω , ⟨w ∗ −u,x⟩≤D ψ (u,w ′ )−D ψ (u,w ∗ )−D ψ (w ∗ ,w ′ ). This is by the first-order optimality condition of w ∗ and direct calculations. Applying this to update rule (3.6) we have ⟨w ′ t+1 −u, ˆ ℓ t +a t ⟩≤D ψt (u,w ′ t )−D ψt (u,w ′ t+1 )−D ψt (w ′ t+1 ,w ′ t ); (B.4) while applying it to update rule (3.5) and picking u =w ′ t+1 we have ⟨w t −w ′ t+1 ,m t ⟩≤D ψt (w ′ t+1 ,w ′ t )−D ψt (w ′ t+1 ,w t )−D ψt (w t ,w ′ t ). (B.5) Now we bound the instantaneous regret as follows: ⟨w t −u, ˆ ℓ t ⟩ =⟨w t −u, ˆ ℓ t +a t ⟩−⟨w t ,a t ⟩ +⟨u,a t ⟩ =⟨w t −w ′ t+1 , ˆ ℓ t +a t ⟩−⟨w t ,a t ⟩ +⟨w ′ t+1 −u, ˆ ℓ t +a t ⟩ +⟨u,a t ⟩ =⟨w t −w ′ t+1 , ˆ ℓ t +a t −m t ⟩−⟨w t ,a t ⟩ +⟨w ′ t+1 −u, ˆ ℓ t +a t ⟩ +⟨w t −w ′ t+1 ,m t ⟩ +⟨u,a t ⟩ ≤D ψt (u,w ′ t )−D ψt (u,w ′ t+1 )−D ψt (w ′ t+1 ,w t )−D ψt (w t ,w ′ t ) +⟨u,a t ⟩, (B.6) 175 where last inequality is by the condition⟨w t −w ′ t+1 , ˆ ℓ t +a t −m t ⟩−⟨w t ,a t ⟩≤ 0, Eq. (B.4), and Eq. (B.5). Proof of Theorem 12. We first prove Eq. (3.7) holds: by Lemmas 27 and 26, we have ⟨w t −w ′ t+1 , ˆ ℓ t −m t +a t ⟩≤ w t −w ′ t+1 t,wt ˆ ℓ t −m t +a t ∗ t,wt ≤ 3 ˆ ℓ t −m t +a t ∗2 t,wt ≤ 3 K X i=1 η t,i w 2 t,i ( ˆ ℓ t,i −m t,i ) 2 (1 + 36η t,i + 324η 2 t,i ) ≤ 6 K X i=1 η t,i w 2 t,i ( ˆ ℓ t,i −m t,i ) 2 =⟨w t ,a t ⟩, where the last two inequalities are by the same calculations done in the proof of Lemma 27. Since Eq. (3.7) holds, using Lemma 3 we have (ignoring non-positive terms−A t ’s), T X t=1 ⟨w t −u, ˆ ℓ t ⟩≤ T X t=1 D ψt (u,w ′ t )−D ψt (u,w ′ t+1 ) + T X t=1 ⟨u,a t ⟩ ≤D ψ 1 (u,w ′ 1 ) + T X t=1 D ψ t+1 (u,w ′ t+1 )−D ψt (u,w ′ t+1 ) + T X t=1 ⟨u,a t ⟩. (B.7) In the last inequality, we add a term D ψ T+1 (u,w ′ T +1 )≥ 0 artificially. As mentioned, ψ T +1 , defined in terms of η T +1,i , never appears in the Broad-OMD algorithm. We can simply pick any η T +1,i > 0 for all i here. This is just to simplify some analysis later. The first term in (B.7) can be bounded by the optimality of w ′ 1 : D ψ 1 (u,w ′ 1 ) =ψ 1 (u)−ψ 1 (w ′ 1 )−⟨∇ψ 1 (w ′ 1 ),u−w ′ 1 ⟩ ≤ψ 1 (u)−ψ 1 (w ′ 1 ) = K X i=1 1 η 1,i ln w ′ 1,i u i . 176 The second term, by definition, is T X t=1 K X i=1 1 η t+1,i − 1 η t,i ! h u i w ′ t+1,i ! . Plugging the above two terms into (B.7) finishes the proof. Proof of Corollary 13. We first check the three conditions in Theorem 12 under our choice of η t,i and ˆ ℓ t,i : η t,i =η = 1 162 ; w t,i | ˆ ℓ t,i −m t,i | =|ℓ t,i −m t,i |1{i∈b t }≤ 2< 3; P K i=1 η t,i w 2 t,i ( ˆ ℓ t,i −m t,i ) 2 = 1 162 P K i=1 (ℓ t,i −m t,i ) 2 1{i =i t }≤ 4 162 < 1 18 . Applying Theorem 12 we then have T X t=1 ⟨w t −u, ˆ ℓ t ⟩≤ K X i=1 ln w ′ 1,i u i η + T X t=1 ⟨u,a t ⟩. As mentioned, if we let u = e i ∗, then ln w ′ 1,i u i becomes infinity for those i̸= i ∗ . Instead, we let u = 1− 1 T e i ∗ + 1 T w ′ 1 . With this choice ofu, we have w ′ 1,i u i ≤ w ′ 1,i 1 T w ′ 1,i =T. Pluggingu into the above inequality and rearranging, we get T X t=1 ⟨w t −e i ∗, ˆ ℓ t ⟩≤ K lnT η + T X t=1 ⟨e i ∗,a t ⟩ +B, (B.8) where B≜ 1 T P T t=1 ⟨−e i ∗ +w ′ 1 , ˆ ℓ t +a t ⟩. Now note that E it [a t,i ] = 6η(ℓ t,i −m t,i ) 2 =O(η) and E it [ ˆ ℓ t,i ] = ℓ t,i =O(1) for all i. Thus, E[B] = E h 1 T P T t=1 ⟨−e i ∗ +w ′ 1 ,E it [ ˆ ℓ t +a t ]⟩ i ≤ E h 1 T P T t=1 ∥−e i ∗ +w ′ 1 ∥ 1 E it [ ˆ ℓ t +a t ] ∞ i = O(1). Taking expectation on both sides of (B.8), we have E " T X t=1 ℓ t,it − T X t=1 ℓ t,i ∗ # ≤ K lnT η + 6ηE " T X t=1 (ℓ t,i ∗−m t,i ∗) 2 # +O(1). 177 Lemma 30. Let n i be such that η T +1,i =κ n i η 1,i , i.e., the number of times the learning rate of arm i changes in Broad-OMD+. Then n i ≤ log 2 T, and η t,i ≤ 5η 1,i for all t,i. Proof. Let t 1 ,t 2 ,...,t n i ∈ [T ] be the rounds the learning rate for arm i changes (i.e., η t+1,i =κη t,i for t =t 1 ,...,t n i ). By the algorithm, we have KT≥ 1 ¯ w tn i ,i >ρ tn i ,i > 2ρ t n i −1 ,i >···> 2 n i −1 ρ t 1 ,i = 2 n i K. Therefore, n i ≤ log 2 T. And we have η t,i ≤κ log 2 T η 1,i =e log 2 T lnT η 1,i ≤ 5η 1,i . Proof of Theorem 14. Again, we verify the three conditions stated in Theorem 12. By Lemma 30, η t,i ≤ 5η≤ 5× 1 810 = 1 162 ; also, w t,j ˆ ℓ t,j −m t,j =w t,j (ℓ t,j −m t,j )1{it=j} ¯ w t,j ≤w t,j 2 w t,j(1− 1 T ) ≤ 3 because we assumeT≥ 3; finally, P K j=1 η t,j w 2 t,j ( ˆ ℓ t,j −m t,j ) 2 =η t,it w 2 t,it ( ˆ ℓ t,it −m t,it ) 2 ≤ 1 162 ×3 2 = 1 18 . Let τ j denote the last round the learning rate for arm j is updated, that is, τ j ≜ max{t∈ [T] :η t+1,j =κη t,j }. We assume that the learning rate is updated at least once so that τ j is well defined, otherwise one can verify that the bound is trivial. For any arm i to compete with, let u = 1− 1 T e i + 1 T w ′ 1 = 1− 1 T e i + 1 KT 1, which guarantees w ′ 1,i u i ≤T. Applying Theorem 12, with B≜ 1 T P T t=1 ⟨−e i +w ′ 1 , ˆ ℓ t +a t ⟩ we have T X t=1 ⟨w t , ˆ ℓ t ⟩− ˆ ℓ t,i ≤ K lnT η + T X t=1 K X j=1 1 η t+1,j − 1 η t,j ! h u j w ′ t+1,j ! + T X t=1 a t,i +B ≤ K lnT η + 1 η τ i +1,i − 1 η τ i ,i ! h u i w ′ τ i +1,i ! + T X t=1 a t,i +B ≤ K lnT η + 1−κ η τ i +1,i h u i w ′ τ i +1,i ! + T X t=1 a t,i +B ≤ K lnT η − 1 5η lnT h u i w ′ τ i +1,i ! + T X t=1 a t,i +B, (B.9) 178 where the last inequality is by Lemma 30 and the fact κ− 1≥ 1 lnT . Now we bound the second and the third term in (B.9) separately. 1. For the second term, by Lemma 28 and T≥ 3 we have u i w ′ τ i +1,i ≥ 1− 1 T 3 2 w τ i ,i ≥ 1− 1 T 2 3 2 ¯ w τ i ,i = 1− 1 T 2 3 2 × ρ T +1,i 2 ≥ ρ T +1,i 8 ≥ 4K 8 ≥ 1. Noting that h(y) is an increasing function when y≥ 1, we thus have h u i w ′ τ i +1,i ! ≥h ρ T +1,i 8 = ρ T +1,i 8 − 1− ln ρ T +1,i 8 ≥ ρ T +1,i 8 − 1− ln KT 4 . (B.10) 2. For the third term, we proceed as T X t=1 a t,i = 6 T X t=1 η t,i w t,i ( ˆ ℓ t,i −m t,i ) 2 ≤ 90η T X t=1 | ˆ ℓ t,i −m t,i | ≤ 90η max t∈[T ] 1 ¯ w t,i ! T X t=1 |ℓ t,i −ℓ t−1,i |≤ 90ηρ T +1,i V T,i , (B.11) where in the first inequality, we use w t,i | ˆ ℓ t,i −m t,i |≤ 3 and η t,i ≤ 5η; in the second inequality, we do a similar calculation as in Eq. (3.10) (only replacing w t,i by ¯ w t,i ); and in the last inequality, we use the fact 1 ¯ w t,i ≤ρ T +1,i for all t∈ [T ] by the algorithm. Combining Eq. (B.10) and Eq. (B.11) and using the fact 1+ln( KT 4 ) 5 lnT ≤ K lnT, we continue from Eq. (B.9) to arrive at T X t=1 ⟨w t , ˆ ℓ t ⟩− ˆ ℓ t,i ≤ 2K lnT η +ρ T +1,i −1 40η lnT + 90ηV T,i +B, (B.12) 179 We are almost done here, but note that the left-hand side of (B.12) is not the desired regret. What we would like to bound is T X t=1 ⟨ ¯ w t , ˆ ℓ t ⟩− T X t=1 ˆ ℓ t,i = T X t=1 ⟨ ¯ w t −w t , ˆ ℓ t ⟩ + T X t=1 ⟨w t , ˆ ℓ t ⟩− ˆ ℓ t,i , (B.13) where the second summation on the right-hand side is bounded by Eq. (B.12). The first term can be written as P T t=1 ⟨− 1 T w t + 1 KT 1, ˆ ℓ t ⟩. Note that 1 T P T t=1 ⟨−w t , ˆ ℓ t ⟩≤ 1 T P T t=1 |⟨w t , ˆ ℓ t −m t ⟩| + 1 T P T t=1 |⟨w t ,m t ⟩|≤ 3 + 1 = 4, and E h 1 T P T t=1 ⟨ 1 K 1, ˆ ℓ t ⟩ i = 1 T P T t=1 ⟨ 1 K 1,ℓ t ⟩≤ 1. Therefore, taking expectation on both sides of (B.13), we get E " T X t=1 ℓ t,it # − T X t=1 ℓ t,i ≤ 2K lnT η +E[ρ T +1,i ] −1 40η lnT + 90ηV T,i +O(1), becauseE[B] is alsoO(1) as proved in Corollary 13. B.4 Omitted details in Section 3.4 Proof of Theorem 15. We first analyze the regret of the sequence x 1 ,...,x T using the analysis of (Wei and Luo, 2018). Specifically by their Theorem 7 and our choice of m t and ˆ ℓ t , we have for any arm i, E " T X t=1 ⟨x t −e i ,ℓ t ⟩ # ≤O K lnT η + 3ηE " T X t=1 K X i=1 x 2 t,i ( ˆ ℓ t,i −c t−1 ) 2 # =O K lnT η + 3ηE " T X t=1 x 2 t,it w 2 t,it (c t −c t−1 ) 2 # ≤O K lnT η + 4ηE " T X t=1 (c t −c t−1 ) 2 # , (B.14) 180 where in the last step we use x t,i w t,i ≤ 1 1−αt ≤ 1 +α = 1 + 8η≤ p 4/3 by our choice of η. In the rest of the proof we analyze the difference between using x t and w t . Specifically we prove E " T X t=1 ⟨w t −x t ,ℓ t ⟩ # ≤O(1) +αE " T X t=2 |ℓ t,i t−1 −ℓ t−1,i t−1 |− 1 2 T X t=2 (c t −c t−1 ) 2 # , (B.15) which finishes the proof by combining the two inequalities above and using α = 8η. Indeed, observe that for each time t> 1, we have by the definition of w t E [⟨w t −x t ,ℓ t ⟩] =E α t ⟨e i t−1 −x t ,ℓ t ⟩ =E α t ⟨e i t−1 −w t ,ℓ t ⟩ +E [α t ⟨w t −x t ,ℓ t ⟩]. Rearranging and plugging the definition of α t gives E [⟨w t −x t ,ℓ t ⟩] =E α t 1−α t ⟨e i t−1 −w t ,ℓ t ⟩ =αE (1−c t−1 )⟨e i t−1 −w t ,ℓ t ⟩ =αE (1−c t−1 )(ℓ t,i t−1 −c t ) =αE (1−c t−1 )(ℓ t,i t−1 −c t−1 +c t−1 −c t ) ≤αE |ℓ t,i t−1 −c t−1 | + (1−c t−1 )(c t−1 −c t ) =αE h |ℓ t,i t−1 −c t−1 | + (c t−1 −c t −c 2 t−1 +c t−1 c t ) i . Summing over t, and combining (B.14), we can bound the regret by (recall c t−1 =ℓ t−1,i t−1 ) O K lnT η + 8ηE " T X t=2 |ℓ t,i t−1 −ℓ t−1,i t−1 | # +E " 4η T X t=2 (c t −c t−1 ) 2 + 8η T X t=2 (−c 2 t−1 +c t−1 c t ) # =O K lnT η + 8ηE " T X t=2 |ℓ t,i t−1 −ℓ t−1,i t−1 | # , (telescoping) which finishes the proof. 181 B.5 Omitted details in Section 3.6 B.5.1 Concentration inequalities Lemma 31 (Freedman’s inequality, cf. Theorem 1 of (Beygelzimer et al., 2011)). LetF 0 ⊂···⊂F n be a filtration, and X 1 ,...,X n be real random variables such that X i isF i -measurable, E[X i |F i−1 ] = 0, |X i |≤b, and P n i=1 E[X 2 i |F i−1 ]≤V for some fixed b≥ 0 and V ≥ 0. Then for any δ∈ (0, 1), we have with probability at least 1−δ, n X i=1 X i ≤ 2 q V n log(1/δ) +b log(1/δ). Lemma 32 (Concentration inequality for Catoni’s estimator). LetF 0 ⊂···⊂F n be a filtration, and X 1 ,...,X n be real random variables such that X i isF i -measurable, E[X i |F i−1 ] =µ i for some fixed µ i , and P n i=1 E[(X i −µ i ) 2 |F i−1 ]≤V for some fixed V. Denote µ ≜ 1 n P n i=1 µ i and let b µ n,α be the Catoni’s robust mean estimator of X 1 ,...,X n with a fixed parameter α > 0, that is, b µ n,α is the unique root of the function f(z) = n X i=1 ψ(α(X i −z)) where ψ(y) = ln(1 +y +y 2 /2), if y≥ 0, − ln(1−y +y 2 /2), else. Then for anyδ∈ (0, 1), as long asn is large enough such thatn≥α 2 (V + P n i=1 (µ i −µ ) 2 )+2 log(1/δ), we have with probability at least 1− 2δ, |b µ n,α −µ |≤ α(V + P n i=1 (µ i −µ ) 2 ) n + 2 log(1/δ) αn . 182 In particular, if µ 1 =··· =µ n =µ , we have ∗ |b µ n,α −µ |≤ αV n + 2 log(1/δ) αn . Proof. The proof generalizes that of (Lugosi and Mendelson, 2019, Theorem 5) for i.i.d. random variables, following similar ideas used in (Beygelzimer et al., 2011, Theorem 1). First, one can verify that ψ(y)≤ ln(1 +y +y 2 /2) for all y∈R. Therefore, for any fixed z∈R and any i, we have E i [exp (ψ(α(X i −z)))] (E i [·]≜E[·|F i−1 ]) ≤E i " 1 +α(X i −z) + α 2 (X i −z) 2 2 # = 1 +α(µ i −z) + α 2 E i (X i −µ i ) 2 +α 2 (µ i −z) 2 2 ≤ exp α(µ i −z) + α 2 E i (X i −µ i ) 2 +α 2 (µ i −z) 2 2 ! . (1 +y≤e y ) Define random variables Z 0 = 1, and for i≥ 1, Z i =Z i−1 exp (ψ(α(X i −z))) exp − α(µ i −z) + α 2 E i (X i −µ i ) 2 +α 2 (µ i −z) 2 2 !! . Then the last calculation shows E i [Z i ]≤ Z i−1 . Therefore, taking expectation over all random variables X 1 ,...,X n , we have E[Z n ]≤E[Z n−1 ]≤···≤E[Z 0 ] = 1. ∗ In all our applications of this lemma, we have µ 1 =··· =µ n =µ . 183 Further define g(z)≜nα(µ −z) + 1 2 α 2 n X i=1 (µ i −z) 2 + 1 2 α 2 V + log 1 δ and note that f(z)≥g(z) implies n X i=1 ψ(α(X i −z))≥nα(µ −z) + 1 2 α 2 n X i=1 (µ i −z) 2 + 1 2 α 2 n X i=1 E i h (X i −µ i ) 2 i + log 1 δ (by the condition of V) = n X i=1 α(µ i −z) + α 2 (µ i −z) 2 +α 2 E i (X i −µ i ) 2 2 ! + log 1 δ , which further implies Z n ≥ 1/δ. By Markov’s inequality, we then have Pr[f(z)≥g(z)]≤ Pr[Z n ≥ 1/δ]≤ Pr[Z n ≥E[Z n ]/δ]≤δ. Note further that we can rewrite g(z) as g(z) =nα(µ −z) + 1 2 α 2 (nz 2 − 2nµz + n X i=1 µ 2 i ) + 1 2 α 2 V + log 1 δ =nα(µ −z) + 1 2 α 2 (n(z−µ ) 2 −nµ 2 + n X i=1 µ 2 i ) + 1 2 α 2 V + log 1 δ =nα(µ −z) + 1 2 nα 2 (z−µ ) 2 + 1 2 α 2 n X i=1 µ 2 i −nµ 2 ! + 1 2 α 2 V + log 1 δ Now we pick z to be the smaller root z 0 of the quadratic function g(z), that is, z 0 =µ + 1 α 1− s 1− α 2 (V + P n i=1 (µ i −µ ) 2 ) n − 2 n log 1 δ (which exists due to the condition on n). By the monotonicity of f and the fact f(b µ n,α ) = 0 we then have Pr [b µ n,α ≥z 0 ] = Pr [f(z 0 )≥ 0] = Pr [f(z 0 )≥g(z 0 )]≤δ. 184 In other words, with probability at least 1−δ, we have b µ n,α −µ ≤ 1 α 1− s 1− α 2 (V + P n i=1 (µ i −µ ) 2 ) n − 2 n log 1 δ ≤ 1 α α 2 (V + P n i=1 (µ i −µ ) 2 ) n + 2 n log 1 δ ! (1− √ 1−x≤x for x∈ [0, 1]) = α(V + P n i=1 (µ i −µ ) 2 ) n + 2 log(1/δ) αn . Finally, via a symmetric argument one can show that µ −b µ n,α ≤ α(V + P n i=1 (µ i −µ ) 2 ) n + 2 log(1/δ) αn holds with probability at least 1−δ as well. Applying a union bound then finishes the proof. B.5.2 Proof of Theorem 17 Proof of Theorem 17. We first prove that for any algorithm, any K≥ 2, any T≥ 8× 10 4 , and any value V ∈ [0,T ], there exists a stochastic environment withE≤V and N = (K− 1) p T /K + 1 such that Reg = e Ω min √ V (KT) 1 4 , √ KT . The construction is as follows. There are p T /K possible context-predictor-loss tuples{(x (i) ,m (i) ,ℓ (i) )} √ T /K i=1 , and in each round, (x t ,m t ,ℓ t ) is uniformly randomly drawn from this set. The policy set Π contains (K− 1) p T /K + 1 policies such that: there is a policy π (0) that always chooses action 1 given any context; other policies are indexed by (i,k)∈ [ p T /K]×{2,...,K} such that π (i,k) (x) = k if x =x (i) , 1 otherwise. Now first consider an environment with m (i) = ℓ (i) = ( 1 2 , 1 2 +σ,..., 1 2 +σ) for all i, where σ = min n 1 2 , √ V 2(KT ) 1/4 o . Note thatE = 0≤V. Under this environment and the given algorithm, if 185 for all (i,k)∈ [ p T /K]×{2,...,K}, the expected total number of times where (x t ,a t ) = (x (i) ,k) is larger than 1 2 , then the algorithm’s regret against π (0) is E " T X t=1 ℓ t (a t )−ℓ t (π (0) (x t )) # =E T X t=1 √ T /K X i=1 K X k=2 1[(x t ,a t ) = (x (i) ,k)] (ℓ t (k)−ℓ t (1)) =E T X t=1 √ T /K X i=1 K X k=2 1[(x t ,a t ) = (x (i) ,k)]σ ≥ s T K × (K− 1)× 1 2 ×σ≥ 1 4 √ KTσ. On the other hand, if there exists a pair (i ∗ ,k ∗ )∈ [ p T /K]×{2,...,K} such that E " T X t=1 1[(x t ,a t ) = (x (i ∗ ) ,k ∗ )] # ≤ 1 2 , then by Markov’s inequality, Pr " T X t=1 1[(x t ,a t ) = (x (i ∗ ) ,k ∗ )] = 0 # = Pr " T X t=1 1[(x t ,a t ) = (x (i ∗ ) ,k ∗ )]< 1 # = 1− Pr " T X t=1 1[(x t ,a t ) = (x (i ∗ ) ,k ∗ )]≥ 1 # ≥ 1 2 . That is, with probability at least 1 2 , the learner never chooses action k ∗ when she sees context x (i ∗ ) . In this case, consider another environment where all m (i) and ℓ (i) remain the same except that ℓ (i ∗ ) is changed to ( 1 2 , 1 2 +σ,..., 1 2 −σ,..., 1 2 +σ), where 1 2 −σ appears in thek ∗ -th coordinate. Note that in this new environment we again haveE =TE (x,ℓ,m) ∥ℓ−m∥ 2 ∞ = √ TK× 4σ 2 ≤V. Moreover, with probability at least 1 2 the learner never realizes the change of the environment and behaves exactly the same, since the only way to distinguish the two environments is to pick k ∗ under context x (i ∗ ) . 186 It remains to calculate the regret of the learner under this new environment. First, by Freedman’s inequality (Lemma 31), we have with probability at least 1− 1 T , T X t=1 1[x t =x (i ∗ ) ]≥ √ KT− 2 q √ KT logT− logT≥ √ KT 3 (B.16) where the last step uses the condition K≥ 2 and T≥ 8× 10 4 . Define events E 1 = ( T X t=1 1[(x t ,a t ) = (x (i ∗ ) ,k ∗ )] = 0 ) ,E 2 = ( T X t=1 1[x t =x (i ∗ ) ]≥ √ KT 3 ) , and useE ′ , Pr ′ to denote the expectation and probability under the new environment. Now we lower bound the regret against π (i ∗ ,k ∗ ) in this environment as E ′ T X t=1 √ T /K X i=1 1[x t =x (i) ] ℓ t (a t )−ℓ t (π (i ∗ ,k ∗ ) (x (i) )) ≥ Pr ′ [E 1 ∩E 2 ]×E ′ " T X t=1 1[x t =x (i ∗ ) ] ℓ t (a t )−ℓ t (π (i ∗ ,k ∗ ) (x (i ∗ ) )) E 1 ,E 2 # = Pr ′ [E 1 ∩E 2 ]×E ′ " T X t=1 1[x t =x (i ∗ ) ]σ E 1 ,E 2 # ≥ 1 2 − 1 T × √ KTσ 3 ≥ √ KTσ 12 . To summarize, in at least one of these two environments, the learner’s regret is Ω( √ KTσ) = e Ω min √ V (KT ) 1 4 , √ KT , finishing the lower bound proof for stochastic environments. For adversarial environments, the only change is to let each tuple (x (i) ,m (i) ,ℓ (i) ) appear for exactly p T/K times, so thatE≤V still holds 187 in these two constructions under the slightly different definition for E (which is P T t=1 ∥ℓ t −m t ∥ 2 ∞ ). It is clear that the same lower bound holds. B.5.3 Proof of Theorem 18 We first prove a lemma showing a somewhat non-conventional analysis of the optimistic Exp4 update. We denote the KL divergence of two distributions Q and P by D(Q,P ) = P π∈Π Q(π) ln Q(π) P (π) . Lemma 33. For any η > 0,M t ,L t ∈ R N , and distribution Q ′ t ∈ ∆ Π , define two distributions Q t ,Q ′ t+1 ∈ ∆ Π such that Q t (π)∝Q ′ t (π) exp (−ηM t (π)), Q ′ t+1 (π)∝Q ′ t (π) exp (−ηL t (π)). (B.17) Then there exists ξ t ∈ ∆ Π such that for any Q ∗ ∈ ∆ Π , we have ⟨Q t −Q ∗ ,L t ⟩≤ D(Q ∗ ,Q ′ t )−D(Q ∗ ,Q ′ t+1 ) η + 2η X π∈Π ξ t (π) (L t (π)−M t (π)) 2 . (B.18) Moreover, ifL t (π)−M t (π)≥− 1 η holds for all π, then we have for any Q ∗ ∈ ∆ Π , ⟨Q t −Q ∗ ,L t ⟩≤ D(Q ∗ ,Q ′ t )−D(Q ∗ ,Q ′ t+1 ) η +η X π∈Π Q t (π) (L t (π)−M t (π)) 2 . (B.19) Proof. First, we rewrite the updates in the standard optimistic online mirror descent framework: Q t = argmin Q∈∆ Π F t (Q) and Q ′ t+1 = argmin Q∈∆ Π F ′ t (Q) where F t (Q) =η⟨Q,M t ⟩ +D(Q,Q ′ t ), F ′ t (Q) =η⟨Q,L t ⟩ +D(Q,Q ′ t ). (B.20) 188 Applying Lemma 6 of (Wei and Luo, 2018) shows ⟨Q t −Q ∗ ,L t ⟩≤ D(Q ∗ ,Q ′ t )−D(Q ∗ ,Q ′ t+1 ) η + Q t −Q ′ t+1 ,L t −M t − 1 η D(Q ′ t+1 ,Q t ). Next, we prove Eq. (B.18). By Taylor expansion, there exists some convex combination of Q t and Q t+1 , denoted by ξ t , such that F ′ t (Q t )−F ′ t (Q ′ t+1 ) =∇F ′ t (Q ′ t+1 )(Q t −Q ′ t+1 ) + 1 2 (Q t −Q ′ t+1 ) ⊤ ∇ 2 F ′ t (ξ t )(Q t −Q ′ t+1 ) =∇F ′ t (Q ′ t+1 )(Q t −Q ′ t+1 ) + 1 2 X π∈Π (Q t (π)−Q ′ t+1 (π)) 2 ξ t (π) ≥ 1 2 X π∈Π (Q t (π)−Q ′ t+1 (π)) 2 ξ t (π) , where the last step is due to the optimality of Q ′ t+1 . On the other hand, we also have F ′ t (Q t )−F ′ t (Q ′ t+1 ) =F t (Q t )−F t (Q ′ t+1 ) +η Q t −Q ′ t+1 ,L t −M t ≤η⟨Q t −Q ′ t+1 ,L t −M t ⟩ (by optimality of Q t ) ≤η X π∈Π (Q t (π)−Q ′ t+1 (π)) 2 ξ t (π) 1/2 X π∈Π ξ t (π)(L t (π)−M t (π)) 2 1/2 . (Cauchy-Schwarz inequality) Combining the two inequalities shows ⟨Q t −Q ′ t+1 ,L t −M t ⟩≤ 2η X π∈Π ξ t (π)(L t (π)−M t (π)) 2 , which proves Eq. (B.18) (since D(Q ′ t+1 ,Q t ) is non-negative). 189 To proves Eq. (B.19), note that Q ′ t+1 (π) = 1 Z Q t (π) exp(−η(L t (π)−M t (π))) where Z = X π∈Π Q t (π) exp(−η(L t (π)−M t (π))) is the normalization factor. Direct calculation shows ⟨Q t −Q ′ t+1 ,L t −M t ⟩− 1 η D(Q ′ t+1 ,Q t ) = X π∈Π Q t −Q ′ t+1 ,L t −M t − 1 η X π∈Π Q ′ t+1 (π) lnQ ′ t+1 (π) + 1 η X π∈Π Q ′ t+1 (π) lnQ t (π) = X π∈Π ⟨Q t ,L t −M t ⟩ + 1 η lnZ ≤ X π∈Π ⟨Q t ,L t −M t ⟩ + 1 η ln X π∈Π Q t (π) 1−η(L t (π)−M t (π)) +η 2 (L t (π)−M t (π)) 2 (by e −z ≤ 1−z +z 2 for z≥−1 and the condition η(L t (π)−M t (π))≥−1) = X π∈Π ⟨Q t ,L t −M t ⟩ + 1 η ln 1−η⟨Q t ,L t −M t ⟩ +η 2 X π∈Π Q t (π)(L t (π)−M t (π)) 2 ≤η X π∈Π Q t (π) (L t (π)−M t (π)) 2 . (by ln(1 +z)≤z) This finishes the proof. Proof of Theorem 18. We directly apply Lemma 33 withM t (π) = m t (ϕ t (π(x t ))) andL t (π) = b ℓ t (ϕ t (π(x t ))) and use Eq. (B.18) withQ ∗ concentrating on the best policyπ ∗ . Summing overt gives T X t=1 X π∈Π Q t (π) b ℓ t (ϕ t (π(x t )))− T X t=1 b ℓ t (ϕ t (π ∗ (x t ))) ≤ D(Q ∗ ,Q ′ 1 ) η + 2η T X t=1 X a∈At X π:ϕ t(π(xt))=a ξ t (π) b ℓ t (a)−m t (a) 2 . ≤ lnN η + 2η T X t=1 X a∈At b ℓ t (a)−m t (a) 2 190 = lnN η + 2η T X t=1 b ℓ t (a t )−m t (a t ) 2 , where in the last step we use the fact that b ℓ t (a)−m t (a) is non-zero only if a =a t . Note that this basically proves Eq. (3.13) (with remapping). The rest of the proof follows the analysis sketch in Section 3.6.1. First, we plug in the definition of b ℓ t and continue to bound the last expression by lnN η + 2η T X t=1 (ℓ t (a t )−m t (a t )) 2 p t (a t ) 2 ≤ lnN η + 2ηK µ T X t=1 ∥ℓ t −m t ∥ 2 ∞ p t (a t ) , where the last step uses the fact p t (a t )≥µ/ |A t |≥µ/K . Taking expectation on both sides leads to E T X t=1 X π∈Π Q t (π)ℓ t (ϕ t (π(x t ))) − T X t=1 ℓ t (ϕ t (π ∗ (x t )))≤ lnN η + 2ηK 2 E µ . (B.21) Next, consider the expected loss of the algorithm at time t: X a∈At p t (a)ℓ t (a) = (1−µ ) X a∈At X π:ϕ t(π(xt))=a Q t (π) ℓ t (a) + µ |A t | X a∈At ℓ t (a) = (1−µ ) X π∈Π Q t (π)ℓ t (ϕ t (π(x t ))) + µ |A t | X a∈At ℓ t (a). Combining with Eq. (B.21) shows E " T X t=1 ℓ t (a t ) # ≤ (1−µ ) T X t=1 ℓ t (ϕ t (π ∗ (x t ))) + lnN η + 2ηK 2 E µ + T X t=1 µ |A t | X a∈At ℓ t (a) = T X t=1 ℓ t (ϕ t (π ∗ (x t ))) + lnN η + 2ηK 2 E µ + T X t=1 µ |A t | X a∈At (ℓ t (a)−ℓ t (ϕ t (π ∗ (x t )))), where the last term can be further bounded as (by the definition of A t ): ℓ t (a)−ℓ t (ϕ t (π ∗ (x t )) 191 =ℓ t (a)−m t (a) +m t (a)−m t (ϕ t (π ∗ (x t )) +m t (ϕ t (π ∗ (x t ))−ℓ t (ϕ t (π ∗ (x t )) ≤ 2∥ℓ t −m t ∥ ∞ +σ. This shows E " T X t=1 ℓ t (a t ) # ≤ T X t=1 ℓ t (ϕ t (π ∗ (x t ))) + lnN η + 2ηK 2 E µ +µTσ + 2µ T X t=1 ∥ℓ t −m t ∥ ∞ ≤ T X t=1 ℓ t (ϕ t (π ∗ (x t ))) + lnN η + 2ηK 2 E µ +µTσ + 2µ √ ET. (Cauchy-Schwarz inequality) It remains to bound the bias due to remapping: when π ∗ (x t )̸=ϕ t (π ∗ (x t )) we have ϕ t (π ∗ (x t )) =a ∗ t , m t (a ∗ t )≤m t (π ∗ (x t ))−σ, and ℓ t (ϕ t (π ∗ (x t )))−ℓ t (π ∗ (x t )) =ℓ t (a ∗ t )−m t (a ∗ t ) +m t (a ∗ t )−m t (π ∗ (x t )) +m t (π ∗ (x t ))−ℓ t (π ∗ (x t )), ≤ 2∥ℓ t −m t ∥ ∞ −σ≤ ∥ℓ t −m t ∥ 2 ∞ σ , (B.22) where the last step is by the AM-GM inequality. Whenπ ∗ (x t ) =ϕ t (π ∗ (x t )), the above holds trivially. Summing over t we have thus shown Reg≤ lnN η + 2ηK 2 E µ +µTσ + 2µ √ ET + E σ , finishing the proof. 192 B.5.4 Proofs of Lemma 4 and Theorems 19 and 20 First, we prove Lemma 4 which certifies the efficiency and (approximate) correctness of the binary search procedure for finding the policy with the smallest Catoni’s mean (Algorithm 18). Proof of Lemma 4. The fact that the algorithm stops after log 2 2T K µ + 1 =O(ln(KT/µ )) iterations is clear due to the initial value of z left and z right , and the precision 1/T. To prove the approximate optimality of the output π t , note that the algorithm maintains the following loop invariants: min π∈Π X s<t ψ α e ℓ s (ϕ s (π(x s )))−z left ≥ 0 and min π∈Π X s<t ψ α e ℓ s (ϕ s (π(x s )))−z right ≤ 0. Therefore, by the monotonicity of ψ, all policies have Catoni’s mean larger than z left , and there exists a policy argmin π∈Π X s<t ψ α e ℓ s (ϕ s (π(x s )))−z right withCatoni’smeansmallerthanz right . ThesetwofactsimplythatbothCatoni α n e ℓ s (ϕ s (π t (x s ))) o s<t and min π∈Π Catoni α n e ℓ s (ϕ s (π(x s ))) o s<t are betweenz left andz right , and are thus 1/T away from each other since we have z right −z left ≤ 1/T after the algorithm stops. To prove both Theorem 19 and Theorem 20, we introduce the following notation. Definition 3. Denote byL(π)≜E (xt,mt,ℓt)∼D [ℓ t (π(x t ))] the expected loss of policy π, and byL(π)≜ E (xt,mt,ℓt)∼D [ℓ t (ϕ t (π(x t )))] the expected loss of policy π after remapping. For both theorems we make use of the following lemmas. 193 Lemma 34. Algorithm 17 (with either Option I or Option II) ensures E " T X t=1 ℓ t (a t ) # ≤E " T X t=1 L(π t ) # +µTσ + 2µ √ ET. Proof. Denote the conditional expectation given the history up to the beginning of time t byE t [·]. By the choice of a t we have E t [ℓ t (a t )] = (1−µ )E t [ℓ(ϕ t (π t (x t ))] +E t µ |A t | X a∈At ℓ t (a) =L(π t ) +E t µ |A t | X a∈At (ℓ t (a)−ℓ t (ϕ t (π t (x t )))) ≤L(π t ) +µ E t " sup a,a ′ ∈At |ℓ t (a)−ℓ t (a ′ )| # ≤L(π t ) +µ E t " sup a,a ′ ∈At |ℓ t (a)−m t (a)| +|m t (a)−m t (a ′ )| +|m t (a ′ )−ℓ t (a ′ )| # ≤L(π t ) +µ E t [σ + 2∥ℓ t −m t ∥ ∞ ] (by the definition of A t ) =L(π t ) +µσ + 2µ E t [∥ℓ t −m t ∥ ∞ ]. Summing over T and applying Cauchy-Schwarz inequality: E " T X t=1 ∥ℓ t −m t ∥ ∞ # ≤ v u u t TE " T X t=1 ∥ℓ t −m t ∥ 2 ∞ # = √ ET finish the proof. Lemma 35. Algorithm 17 (with either Option I or Option II) ensures T (L(π ∗ )−L(π ∗ ))≤ E σ . 194 Proof. The proof is exactly the same as the adversarial case (cf. Eq. (B.22)). First rewrite L(π ∗ )−L(π ∗ ) asE [ℓ t (ϕ t (π ∗ (x t )))−ℓ t (π ∗ (x t ))]. Whenπ ∗ (x t )̸=ϕ t (π ∗ (x t )) we haveϕ t (π ∗ (x t )) =a ∗ t , m t (a ∗ t )≤m t (π ∗ (x t ))−σ, and ℓ t (ϕ t (π ∗ (x t )))−ℓ t (π ∗ (x t )) =ℓ t (a ∗ t )−m t (a ∗ t ) +m t (a ∗ t )−m t (π ∗ (x t )) +m t (π ∗ (x t ))−ℓ t (π ∗ (x t )), ≤ 2∥ℓ t −m t ∥ ∞ −σ≤ ∥ℓ t −m t ∥ 2 ∞ σ , where the last step is by the AM-GM inequality. Whenπ ∗ (x t ) =ϕ t (π ∗ (x t )), the above holds trivially. Plugging the definition of E then finishes the proof. We are now ready to prove Theorems 19 and 20, using different concentrations according to the two different ways of calculating π t . Proof of Theorem 19. First, for any fix π and t, we invoke Lemma 31 with X s = e ℓ s (ϕ s (π(x s )))− L(π) +E (x,ℓ,m)∼D [min a m(a)] for s = 1,...,t, b =O( K µ ), and V t =O( KEt µT +σ 2 t) (see Eq. (3.15)). Together with a union bound over all t and π, we have with probability at least 1− 1/T, 1 t t X s=1 e ℓ s (ϕ s (π(x s )))−L(π) +E (x,ℓ,m)∼D [min a m(a)] =O s KE µTt + σ 2 t log(NT ) + K log(NT ) µt (B.23) for all t∈ [T ] and π∈ Π. Therefore, we have L(π t ) 195 ≤ 1 t t X s=1 e ℓ s (ϕ s (π t (x s ))) +E[min a m(a)] +O s KE µTt + σ 2 t log(NT ) + K log(NT ) µt (by Eq. (B.23)) ≤ 1 t t X s=1 e ℓ s (ϕ s (π ∗ (x s ))) +E[min a m(a)] +O s KE µTt + σ 2 t log(NT ) + K log(NT ) µt (by the optimality of π t ) ≤L(π ∗ ) +O s KE µTt + σ 2 t log(NT ) + K log(NT ) µt . (by Eq. (B.23)) Combining Lemma 34, the inequality above, and Lemma 35, we arrive at E " T X t=1 ℓ t (a t ) # ≤E " T X t=1 L(π t ) # +µTσ + 2µ √ ET ≤TL(π ∗ ) +O µTσ +µ √ ET + T X t=1 s KE µTt + σ 2 t log(NT ) + K log(NT ) µt =TL(π ∗ ) + ˜ O µTσ +µ √ ET + s dE µ +σ √ dT + d µ ! (B.24) ≤E " T X t=1 ℓ t (π ∗ (x t )) # + ˜ O µTσ +µ √ ET + s dE µ +σ √ dT + d µ + E σ ! , which finishes the proof. Proof of Theorem 20. First, for any fix π and t, we invoke Lemma 32 with X s = e ℓ s (ϕ s (π(x s ))) for s = 1,...,t, µ 1 =··· = µ t = µ =L(π)−E (x,ℓ,m)∼D [min a m(a)], and V =O( KE µ +σ 2 t) (see Eq. (3.15) for the variance calculation). Together with a union bound over all t and π, and the value of α specified in Algorithm 17, we have with probability at least 1− 2/T, Catoni α e ℓ s (ϕ s (π(x s ))) s≤t −L(π) +E (x,ℓ,m)∼D [min a m(a)] = 1 t αV + 2 log(NT 2 ) α ! =O s KE µt 2 + σ 2 t log(NT ) (B.25) 196 for all t≥α 2 V + 2 log(NT 2 ) = 4 log(NT 2 ) and π∈ Π. Therefore, we have for t≥ 4 ln(NT 2 ), L(π t ) ≤ Catoni α e ℓ s (ϕ s (π t (x s ))) s≤t +E[min a m(a)] +O s KE µt 2 + σ 2 t log(NT ) (by Eq. (B.25)) ≤ Catoni α e ℓ s (ϕ s (π ∗ (x s ))) s≤t +E[min a m(a)] +O s KE µt 2 + σ 2 t log(NT ) + 1 T (by Lemma 4) ≤L(π ∗ ) +O s KE µt 2 + σ 2 t log(NT ) + 1 T . (by Eq. (B.25)) Combining Lemma 34, the inequality above, and Lemma 35, we arrive at E " T X t=1 ℓ t (a t ) # ≤E " T X t=1 L(π t ) # +µTσ + 2µ √ ET ≤TL(π ∗ ) +O 4 ln(NT 2 ) +µTσ +µ √ ET + T X t=1 s KE µt 2 + σ 2 t log(NT ) =TL(π ∗ ) + ˜ O µTσ +µ √ ET + s dE µ +σ √ dT ! ≤E " T X t=1 ℓ t (π ∗ (x t )) # + ˜ O µTσ +µ √ ET + s dE µ +σ √ dT + E σ ! , which finishes the proof. 197 Appendix C Omitted Details in Chapter 4 C.1 Omitted details in Section 4.2 Proof of Theorem 21. We first verify conditions (ii) and (iii) in Theorem 10 hold for ˆ ℓ t,i = ℓ t,i 1{it=i} w t,i and m t,i =ℓ t,it . Indeed, condition (ii) holds since w t,i | ˆ ℓ t,i −m t,i | =|ℓ t,i 1{i t =i}−w t,i ℓ t,it |≤ 2< 3. Other the other hand, condition (iii) also holds because η K X i=1 w 2 t,i ( ˆ ℓ t,i −m t,i ) 2 =η K X i=1 (ℓ t,i 1{i t =i}−w t,i ℓ t,it ) 2 =η K X i=1 (ℓ 2 t,i 1{i t =i}− 2ℓ t,i w t,i ℓ t,it 1{i t =i} +w 2 t,i ℓ 2 t,it ) ≤ 1 162 ℓ 2 t,it − 2w t,it ℓ 2 t,it + K X i=1 w 2 t,i ! ℓ 2 t,it ! ≤ 1 162 (1 + 0 + 1)< 1 18 . Thus, by Theorem 10 and the standard analysis for the doubling trick, we have E " T X t=1 ℓ t,it −ℓ t,i ∗ # =O v u u t (K lnT )E " T X t=1 K X i=1 w 2 t,i ( ˆ ℓ t,i −ℓ t,it ) 2 # +K lnT . (C.1) 198 Now we consider the stochastic setting. In this case, we further take expectations over ℓ 1 ,...,ℓ T on both sides of (C.1). The left-hand side of (C.1) can be lower bounded by E " T X t=1 ℓ t,it −ℓ t,i ∗ # =E " T X t=1 ℓ t,it − min j T X t=1 ℓ t,j # ≥E " T X t=1 ℓ t,it − T X t=1 ℓ t,a ∗ # =E " T X t=1 K X i=1 w t,i (ℓ t,i −ℓ t,a ∗) # ≥E T X t=1 X i̸=a ∗ w t,i ∆ = ∆ E " T X t=1 (1−w t,a ∗) # . (C.2) On the other hand, E it∼wt " K X i=1 w 2 t,i ( ˆ ℓ t,i −ℓ t,it ) 2 # =E it∼wt K X i=1 w 2 t,i ℓ t,i 1{i t =i} w t,i −ℓ t,it ! 2 =E it∼wt " K X i=1 (ℓ t,i 1{i t =i}−w t,i ℓ t,it ) 2 # = K X i=1 w t,i (ℓ t,i −w t,i ℓ t,i ) 2 + X j̸=i w t,j (w t,i ℓ t,j ) 2 ≤ K X i=1 w t,i (1−w t,i ) 2 + X j̸=i w t,j w 2 t,i = K X i=1 w t,i (1−w t,i ) ≤ (1−w t,a ∗) + X i̸=a ∗ w t,i = 2(1−w t,a ∗). (C.3) Therefore, the first term on the right-hand side of (C.1) can be upper bounded by v u u t (K lnT )E " T X t=1 K X i=1 w 2 t,i ( ˆ ℓ t,i −ℓ t,it ) 2 # ≤ v u u t (K lnT )E " T X t=1 2(1−w t,a ∗) # . (C.4) Let H =E h P T t=1 (1−w t,a ∗) i . Combining (C.2), (C.4), and (C.1), we have H∆ ≤O q (K lnT )H +K lnT , 199 which implies H =O K lnT ∆ 2 . Therefore, the expected regret is upper bounded by O q (K lnT )H +K lnT =O K lnT ∆ . For the adversarial setting, we continue from an intermediate step of (C.3): E it∼wt " K X i=1 w 2 t,i ( ˆ ℓ t,i −ℓ t,it ) 2 # = K X i=1 w t,i (1−w t,i ) 2 ℓ 2 t,i + X j̸=i w t,j w 2 t,i ℓ 2 t,j ≤ K X i=1 w t,i ℓ 2 t,i + K X j=1 X i̸=j w t,j w 2 t,i ℓ 2 t,j ≤ K X i=1 w t,i ℓ 2 t,i + K X j=1 w t,j ℓ 2 t,j = 2E it∼wt h ℓ 2 t,it i Assuming ℓ t,i ∈ [0, 1], we thus have ℓ 2 t,it ≤ℓ t,it and E " T X t=1 ℓ t,it # − T X t=1 ℓ t,i ∗ =O v u u t (K lnT )E " T X t=1 ℓ t,it # +K lnT . Solving for r E h P T t=1 ℓ t,it i and rearranging then give E " T X t=1 ℓ t,it # − T X t=1 ℓ t,i ∗ =O v u u t (K lnT ) T X t=1 ℓ t,i ∗ +K lnT =O q KL T,i ∗ lnT +K lnT . C.2 Omitted details in Section 4.3 C.2.1 Auxiliary lemmas Lemma 36. Let p be the solution of OP(t, b ∆ ), where b ∆ ∈ R |K| ≥0 . Then we have P x∈K p x b ∆ x = O d log(t|K|/δ) √ t . 200 Proof of Lemma 36. Consider the minimizer p ∗ of the following constrained minimization problem for some ξ> 0: min p∈∆ K X x∈K p x b ∆ x + 2 ξ (− ln(det(S(p)))), (C.5) where S(p) = P x∈K p x xx ⊤ . We will show that X x∈K p ∗ x b ∆ x ≤ 2d ξ , (C.6) ∥x∥ 2 S(p ∗ ) −1≤ ξ b ∆ x 2 +d,∀x∈K. (C.7) To prove this, first note that relaxing the constraints from p∈P K to the set of sub-distributions {p : P x∈K p x ≤ 1 and p x ≥ 0,∀x} does not change the solution of this problem. This is because for any sub-distribution, we can always make it a distribution by increasing the weight of some x with b ∆ x = 0 (at least one exists) while not increasing the objective value (since ln(det(S(p))) is non-decreasing in p x for each x). Therefore, applying the KKT conditions, we have b ∆ x − 2 ξ x ⊤ S(p ∗ ) −1 x−λ x +λ = 0, (C.8) whereλ x ,λ≥ 0 are Lagrange multipliers. Plugging in the optimal solutionp ∗ and taking summation over all x∈K, we have 0 = X x∈K p ∗ x b ∆ x − 2 ξ X x∈K p ∗ x x ⊤ S(p ∗ ) −1 x− X x∈K λ x p ∗ x +λ = X x∈K p ∗ x b ∆ x − 2 ξ Tr(S(p ∗ ) −1 S(p ∗ )) +λ (complementary slackness) = X x∈K p ∗ x b ∆ x − 2d ξ +λ 201 ≥ X x∈K p ∗ x b ∆ x − 2d ξ . (λ≥ 0) Therefore, we have P x∈K p ∗ x ∆ x ≤ 2d ξ and λ≤ 2d ξ as P x∈K p ∗ x b ∆ x ≥ 0. This proves (C.6). For (C.7), using (C.8), we have ∥x∥ 2 S(p ∗ ) −1 = ξ 2 b ∆ x −λ x +λ ≤ ξ 2 b ∆ x +λ ≤ ξ b ∆ x 2 +d, where the first inequality is due to λ x ≥ 0, and the second inequality is due to λ≤ 2d ξ . Now we show how to transform p ∗ into a distribution satisfying the constraint of OP. Choose ξ = √ t βt . Let G ={x : b ∆ x ≤ 1 √ t }. We construct the distribution q = 1 2 p ∗ + 1 2 q G,κ , where q G,κ is defined in Lemma 37 with κ = 1 √ t , and prove that q satisfies (4.11). Indeed, for all x / ∈G, we have by definition √ t b ∆ x ≥ 1 and thus ∥x∥ 2 S(q) −1≤∥x∥ 2 1 2 S(p ∗ ) −1 ≤ξ b ∆ x + 2d = √ t b ∆ x β t + 2d≤ t b ∆ 2 x β t + 4d; for x∈G, according to Lemma 37 below, we have∥x∥ 2 S(q) −1 ≤∥x∥ 2 1 2 S(q G,κ ) −1 ≤ 4d≤ t b ∆ 2 x βt + 4d. Combining the two cases above, we prove that q satisfies (4.11). According to the optimality of p, we thus have X x∈K p x b ∆ x ≤ X x∈K q x b ∆ x = 1 2 X x∈K p ∗ x b ∆ x + 1 2 X x∈K q G,κ x b ∆ x (by the definition of q) ≤ 1 2 2d ξ + 1 2 √ t + 1 2 √ t (by (C.6), b ∆ x ≤ 1, the definition of G, and the choice of κ) ≤ dβ t + 1 √ t =O dβ t √ t , (by the definition of ξ) 202 proving the lemma. The following lemma shows that for any G⊂K, there always exists a distribution p∈P K that puts most weights on actions from G, such that∥x∥ 2 ( P x∈K pxxx ⊤ ) −1 ≤O(d) for all x∈G. Lemma 37. Suppose thatK⊆R d spans R d and let p K be the uniform distribution overK. For any G⊆K and κ∈ (0, 1 2 ], there exists a distribution q∈P G such that∥x∥ 2 S(q G,κ ) −1 ≤ 2d for all x∈G, where q G,κ ≜κ·p K + (1−κ)·q and S(p) = P x∈K p x xx ⊤ . Proof. LetP κ G ={p∈P K | p = κ·p K + (1−κ)·q,q∈P G }. AsK spans the whole R d space, ∥x∥ 2 S(p) −1 is well-defined for all p∈P κ G . Then we have min p∈P κ G max x∈G ∥x∥ 2 S −1 (p) = min p∈P κ G max q∈P G * X x∈K q x xx ⊤ , X x∈K p x xx ⊤ ! −1 + = max q∈P G min p∈P κ G * X x∈K q x xx ⊤ , X x∈K p x xx ⊤ ! −1 + (C.9) ≤ max q∈P G * X x∈K q x xx ⊤ , X x∈K κ |K| + (1−κ)q x xx ⊤ ! −1 + ≤ 2 max q∈P G * (1−κ) X x∈K q x xx ⊤ , X x∈K κ |K| + (1−κ)q x xx ⊤ ! −1 + (κ≤ 1 2 ) ≤ 2 max q∈P G * κ |K| X x∈K xx ⊤ + (1−κ) X x∈K q x xx ⊤ , X x∈K κ |K| + (1−κ)q x xx ⊤ ! −1 + = 2d, where the second equality is by the Sion’s minimax theorem as (C.9) is linear in q and convex in p. Lemma 38. Given { b ∆ } x∈K , suppose there exists a unique b x such that b ∆ b x = 0, and b ∆ min = min x̸=b x b ∆ x > 0. Then P x p x b ∆ x ≤ 24dβt b ∆ min t when t≥ 16dβt b ∆ 2 min , where p is the solution to OP(t, b ∆) . 203 Proof. We divide actions into groups G 0 ,G 1 ,G 2 ,... based on the following rule: G 0 ={b x}, G i = n x : 2 i−1b ∆ 2 min ≤ b ∆ 2 x < 2 ib ∆ 2 min o . Letn be the largest index such that G n is not empty and z i = dβt 2 i−2b ∆ 2 min t fori≥ 1. For each group i, by Lemma 37, we find a distribution q G i ,κ with κ = 1 n·2 n |K| , such that∥x∥ 2 P y∈K q G i ,κ y yy ⊤ −1 ≤ 2d for all x∈G i . Then we define a distribution e p over actions as the following: e p x = X j≥1 z j q G j ,κ x if x̸=b x 1− X x ′ ̸=b x e p x ′ if x =b x. e p is a valid distribution as e p b x = 1− X i≥1 X x∈G i X j≥1 z j q G j ,κ x = 1− X i≥1 X x∈G i z i q G i ,κ x − X i≥1 X x∈G i X j̸=i,j≥1 z j q G j ,κ x ≥ 1− X i≥1 z i − X i≥1 X x∈G i X j̸=i,j≥1 z j n· 2 n ·|K| ( P x∈G i q G i ,κ x ≤ 1 and q G j ,κ x = 1 n·2 n |K| for x / ∈G j ) ≥ 1− X i≥1 z i − X i≥1 z i (by definition z j 2 n ≤z i for all i,j) ≥ 1 2 . (condition t≥ 16dβt b ∆ 2 min and thus P i≥1 2z i ≤ P ∞ i=1 2 2 i−2 ·16 = 1 2 ) 204 Now we show that e p also satisfies the constraint of OP(t, b ∆ x ). Indeed, for any x̸= b x and i such that x∈G i , we use the facts e p y ≥z i q G i ,κ y for y̸=b x by definition and e p b x ≥ 1 2 ≥ z i n·2 n |K| =z i q G i ,κ b x as well to arrive at: ∥x∥ 2 S(e p) −1 ≤∥x∥ 2 P y∈K z i q G i ,κ y yy ⊤ −1 = 1 z i ∥x∥ 2 P y∈K q G i ,κ y yy ⊤ −1≤ 2d z i = 2 i−1b ∆ 2 min t β t ≤ t b ∆ 2 x β t ; for x =b x, we have e p b x ≥ 1 2 as shown above and thus, ∥b x∥ 2 S(e p) −1 =∥S(e p) −1 b x∥ 2 S(e p) ≥∥S(e p) −1 b x∥ 2 1 2 b xb x ⊤ = 1 2 ∥b x∥ 4 S(e p) −1 =⇒ ∥b x∥ 2 S −1 (e p) ≤ 2. Thus, e p satisfies (4.11). Therefore, X x∈K p x b ∆ x ≤ X x∈K e p x b ∆ x (by the feasibility of e p and the optimality of p) = X x̸=b x e p x b ∆ x ≤ X i≥1 X x∈G i X j≥1 dβ t 2 j−2b ∆ 2 min t q G j ,κ x √ 2 ib ∆ min (by the definition of e p x and G i ) = X i≥1 X x∈G i X j̸=i,j≥1 dβ t q G j ,κ x 2 j− i /2−2b ∆ min t + X i≥1 X x∈G i dβ t q G i ,κ x 2 i /2−2b ∆ min t ≤ X i≥1 X x∈G i X j̸=i,j≥1 dβ t |K|·n· 2 n+j− i /2−2b ∆ min t + X i≥1 X x∈G i dβ t q G i ,κ x 2 i /2−2b ∆ min t (q G j ,κ x = 1 n·2 n ·|K| for x / ∈G j ) ≤ X i≥1 2dβ t 2 i /2−2b ∆ min t ≤ 24dβ t b ∆ min t , (n +j− i 2 ≥ i 2 ) proving the lemma. Lemma 39. Let b ∆ x ∈ h 1 √ r ∆ x , √ r∆ x i for all x∈K for some r > 1, and p = OP(t, b ∆ ) for some t≥ 16rdβt ∆ 2 min . Then P x∈K p x ∆ x ≤ 24rdβt ∆ min t . 205 Proof. By the condition on b ∆ x , we have t≥ 16rdβt ∆ 2 min ≥ 16dβt b ∆ 2 min . Also, the condition implies that b ∆ x ∗ = ∆ x ∗ = 0 and b ∆ x > 0 for all x̸=x ∗ . Therefore, X x∈K p x ∆ x ≤ √ r X x∈K p x b ∆ x ≤ √ r 24dβ t b ∆ min t ≤ 24rdβ t ∆ min t , where the second equality is due to Lemma 38 and the other inequalities follow from b ∆ x ∈ h 1 √ r ∆ x , √ r∆ x i for all x. In the following lemma, we define a problem-dependent quantity M. Lemma 40. Consider the optimization problem: min {Nx} x∈K ,Nx≥0 X x∈K N x ∆ x s.t. ∥x∥ 2 H(N) −1≤ ∆ 2 x 2 ,∀x∈K − , where H(N) = P x∈K N x xx ⊤ andK − =K\{x ∗ }. Define its optimal objective value as c(K,θ). Then, there exist{N ∗ x } x∈K satisfying the constraint of this optimization problem with P x∈K N ∗ x ∆ x ≤ 2c(K,θ) and N ∗ x ∗ being finite. (Define M = P x∈K N ∗ x .) Proof. If there exists an assignment of{N x } x∈K for the optimal objective value which has finite N x ∗, then the lemma trivially holds. Otherwise, consider the optimal solution{ e N x } x∈K with e N x ∗ =∞. According to the constraints, the following holds for all x∈K − : lim N→∞ ∥x∥ 2 Nx ∗ x ∗ ⊤ + P y∈K − e Nyyy ⊤ −1≤ ∆ 2 x 2 . 206 As|K| is finite, by definition, we know for any ϵ, there exists a positive value M ϵ such that for all N≥M ϵ ,∥x∥ 2 Nx ∗ x ∗ ⊤ + P y∈K − e Nyyy ⊤ −1 ≤ ∆ 2 x 2 +ϵ. Choosing ϵ = ∆ 2 min 2 , we have when N≥M ϵ , for all x∈K − , ∥x∥ 2 Nx ∗ x ∗ ⊤ + P x∈K − e Nxxx ⊤ −1 < ∆ 2 x 2 + ∆ 2 min 2 ≤ ∆ 2 x . Therefore, consider the solution{N ∗ x } x∈K where N ∗ x = 2 e N x if x∈K − and N ∗ x ∗ = 2M ϵ . We have ∥x∥ 2 H(N) −1 ≤ ∆ 2 x 2 . Moreover, the objective value is bounded by P x∈K N ∗ x ∆ x = 2 P x∈K −1 e N x ∆ x = 2c(K,θ). Lemma 41. Suppose b ∆ x ∈ h 1 √ r ∆ x , √ r∆ x i for all x∈K for some r> 1, and p = OP(t, b ∆ ) for some t≥rβ t M, where M is defined in Lemma 40. Then P x∈K p x ∆ x ≤ r 2 βt t c(K,θ). Proof. Recall N ∗ defined in Lemma 40. Define e p, a distribution overK, as the following: e p x = rβtN ∗ x 2t , x̸=x ∗ 1− P x ′ ̸=x ∗e p x ′ x =x ∗ . It is clear that e p is a valid distribution since t≥rβ t M. Also, note that by the definition of M and the condition of t, e p x ∗ = 1− X x ′ ̸=x ∗ rβ t N ∗ x 2t ≥ 1− rβ t M 2t ≥ rβ t M 2t ≥ rβ t N ∗ x ∗ 2t . Below we show that e p satisfies the constraint of OP(t, b ∆ x ). Indeed, for any x̸=x ∗ , ∥x∥ 2 S(e p) −1 ≤∥x∥ 2 P y∈K rβ t N ∗ y 2t yy ⊤ −1 (e p x ∗≥ rβtN ∗ x ∗ 2t ) 207 = 2t rβ t ∥x∥ 2 P y∈K N ∗ y yy ⊤ −1 ≤ t∆ 2 x rβ t (by the constraint in the definition of {N ∗ x } x∈K ) ≤ t b ∆ 2 x β t ; ( b ∆ x ∈ h 1 √ r ∆ x , √ r∆ x i ) for x =x ∗ , we have e p x ∗≥ 1− rβtM 2t ≥ 1 2 by the condition of t: ∥x ∗ ∥ 2 S(e p) −1 =∥S(e p) −1 x ∗ ∥ 2 S(e p) ≥∥S(e p) −1 x ∗ ∥ 2 1 2 x ∗ x ∗⊤ = 1 2 ∥x ∗ ∥ 4 S(e p) −1 =⇒ ∥x ∗ ∥ 2 S −1 (e p) ≤ 2. Notice that b ∆ x ∗∈ h 1 √ r ∆ x ∗, √ r∆ x ∗ i implies ∆ x ∗ = b ∆ x ∗ = 0. Thus, X x∈K p x ∆ x ≤ √ r X x∈K p x b ∆ x ( b ∆ x ∈ h 1 √ r ∆ x , √ r∆ x i ) ≤ √ r X x∈K e p x b ∆ x (by the feasibility of e p and the optimality of p) ≤r √ rβ t X x∈K N ∗ x 2t b ∆ x (by the definition of e p and b ∆ x ∗ = 0) ≤r 2 β t X x∈K N ∗ x 2t ∆ x ( b ∆ x ∈ h 1 √ r ∆ x , √ r∆ x i ) ≤ r 2 β t t c(K,θ), ( P x N ∗ x ∆ x ≤ 2c(K,θ) proven in Lemma 40) finishing the proof. Lemma 42. We have c(K,θ)≤ 48d ∆ min . Proof. The idea is similar to that of Lemma 38. Define G 0 ={x ∗ }, G i = x : ∆ 2 x ∈ [2 i−1 , 2 i )∆ 2 min andn be the largest index such thatG n is not empty. For eachi≥ 1, letq G i ,κ ∈P K withκ = 1 |K|·n·2 n be the distribution such that∥x∥ 2 S(q G i ,κ ) −1 ≤ 2d for all x∈G i (see Lemma 37). Let N x ∗ =∞ and for x∈G i , we let N x = P j≥1 4dq G j ,κ x 2 j−1 ∆ 2 min . Next we show that{N x } x∈K satisfies the constraint (4.3). 208 In fact, fix x∈G i ⊆K − , by definition of {N x } x∈K , we have ∥x∥ 2 ( P x∈K Nxxx ⊤ ) −1≤∥x∥ 2 P x∈K 4dq G i ,κ x xx ⊤ 2 i−1 ∆ 2 min −1 = 2 i−1 ∆ 2 min 4d ∥x∥ 2 S(q G i ,κ ) −1 ≤ 2 i−2 ∆ 2 min ≤ ∆ 2 x 2 , where the first inequality is because S(q G i ,κ ) is invertible. Therefore, the objective value of (4.2) is bounded as follows: X x∈K N x ∆ x = X i≥1 X x∈G i X j≥1 4dq ∗ G j ,x 2 j−1 ∆ 2 min ∆ x (by the definition of N x ) ≤ X i≥1 X x∈G i X j≥1 4dq ∗ G j ,x 2 j− i /2−1 ∆ min (by the definition of G i ) = X i≥1 X x∈G i X j̸=i,j≥1 4dq ∗ G j ,x 2 j− i /2−1 ∆ min + X i≥1 X x∈G i 4dq G i ,κ x 2 i /2−1 ∆ min ≤ X i≥1 X x∈G i X j̸=i,j≥1 4d |K|·n· 2 n+j− i /2−1 ∆ min + X i≥1 4d 2 i /2−1 ∆ min (q G j ,κ x = 1 |K|·n·2 n for j̸=i, x∈G i ) ≤ X i≥1 4d 2 n− i /2−1 ∆ min + X i≥1 4d 2 i /2−1 ∆ min (j≥ 1) ≤ X i 8d 2 i /2−1 ∆ min ≤ 48d ∆ min . (i≤n) Therefore, we have c(K,θ)≤ 48d ∆ min . Lemma 43. (Lemma A.2 of (Shalev-Shwartz and Ben-David, 2014)) Let a≥ 1 and b > 0. If x≥ 4a log(2a) + 2b, then we have x≥a log(x) +b. Lemma 44(ConcentrationinequalityforCatoni’sestimator(Weietal.,2020b)). LetF 0 ⊂···⊂F n be a filtration, and X 1 ,...,X n be real random variables such that X i isF i -measurable,E[X i |F i−1 ] =µ i for some fixed µ i , and P n i=1 E[(X i −µ i ) 2 |F i−1 ]≤V for some fixed V. Denote µ ≜ 1 n P n i=1 µ i and 209 let b µ n,α be the Catoni’s robust mean estimator of X 1 ,...,X n with a fixed parameter α> 0, that is, b µ n,α is the unique root of the function f(z) = n X i=1 ψ(α(X i −z)) where ψ(y) = ln(1 +y +y 2 /2), if y≥ 0, − ln(1−y +y 2 /2), else. Then for anyδ∈ (0, 1), as long asn is large enough such thatn≥α 2 (V + P n i=1 (µ i −µ ) 2 )+2 log(1/δ), we have with probability at least 1− 2δ, |b µ n,α −µ |≤ α(V + P n i=1 (µ i −µ ) 2 ) n + 2 log(1/δ) αn . Choosing α optimally, we have |b µ n,α −µ |≤ 2 n v u u t 2 V + n X i=1 (µ i −µ ) 2 ! log(1/δ). In particular, if µ 1 =··· =µ n =µ , we have |b µ n,α −µ |≤ 2 n q 2V log(1/δ). C.2.2 Proof of Theorem 23 To prove the guarantee in the adversarial setting, we first prove Lemma 5, which shows that at any time in Phase 2, b x has the smallest cumulative loss within [1,t]. 210 Proof of Lemma 5. By Assumption 2, for any x, and any t in Phase 1, t X s=1 (ℓ s,xs −ℓ s,x )≤ p C 1 t−C 2 t X s=1 (ℓ s,x − b ℓ s,x ) ≤ p C 1 t− (C 2 − 1) t X s=1 (ℓ s,x − b ℓ s,x ) − t X s=1 (ℓ s,x − b ℓ s,x ), which implies t X s=1 (ℓ s,x − b ℓ s,x ) ≤ 1 C 2 − 1 p C 1 t− t X s=1 (ℓ s,x − b ℓ s,x )− t X s=1 (ℓ s,xs −ℓ s,x ) ! ≤ 1 C 2 − 1 p C 1 t + t X s=1 b ℓ s,x − t X s=1 ℓ s,xs ! . (C.10) At time t 0 , we have with probability at least 1− 2δ, t 0 X s=1 (ℓ s,x − b ℓ s,x ) ≤ 1 C 2 − 1 p C 1 t 0 + t 0 X s=1 b ℓ s,x − t 0 X s=1 ℓ s,xs ! (by (C.10)) ≤ 1 C 2 − 1 2 p C 1 t 0 + t 0 X s=1 b ℓ s,x − t 0 X s=1 y s ! (by Azuma’s inequality) ≤ 1 C 2 − 1 2 p C 1 t 0 + 5 p f T C 1 t 0 + t 0 X s=1 b ℓ s,x − t 0 X s=1 b ℓ s,b x ! (by (4.6)) = 1 C 2 − 1 7 p f T C 1 t 0 +t 0 b ∆ x . (C.11) Bounding the deviation of (t−t 0 )Rob t,x for x̸=b x: For all x̸=b x, the variance of b ℓ τ,x is bounded as follows: Var( b ℓ τ,x )≤E h b ℓ 2 τ,x i ≤E h (x ⊤ e S −1 τ x t x ⊤ t e S −1 τ x) 2 i ≤∥x∥ 2 e S −1 τ ≤ 2∥x∥ 2 S −1 τ , (C.12) 211 where the last inequality is due to e S τ = 1 2 b xb x ⊤ + 1 2 S τ ⪰ 1 2 S τ . Therefore, using Lemma 44 with µ i =ℓ i,x , with probability at least 1− 2δ, for all t in Phase 2 and all x̸=b x, (t−t 0 )·Rob t,x − t X τ=t 0 +1 ℓ τ,x ≤α x 2 t X τ=t 0 +1 ∥x∥ 2 S −1 τ + t X τ=t 0 +1 ℓ τ,x − 1 t−t 0 t X τ ′ =t 0 +1 ℓ τ ′ ,x 2 + 2 log t 2 |K| δ α x ≤α x t X τ=t 0 +1 2∥x∥ 2 S −1 τ + 1 + 4 log t|K| δ α x ≤ 2 v u u t 4 t X τ=t 0 +1 2∥x∥ 2 S −1 τ + 1 log t|K| δ (Choose α x optimally) ≤ 2 v u u t 4 log t|K| δ t X τ=t 0 +1 2τ b ∆ 2 x β τ + 9d ! , (C.13) where the last inequality is due to (4.11). For τ≥t 0 , since b ∆ x = 1 t 0 t 0 X s=1 b ℓ s,x − b ℓ s,b x ! ≥ 20 s f T C 1 t 0 , (C.14) we have b ∆ x ≥ 20 s f T C 1 τ ≥ 20 s f T dβ T τ ≥ 20 s dβ T τ ≥ 20 s dβ τ τ ≥ 3 s dβ τ 2τ and 9d≤ 2τ b ∆ 2 x βτ . Note that h(τ) = τ log(τ|K|/δ) an increasing function when δ≤ 0.1. Using (C.13) and 9d≤ 2τ b ∆ 2 x βτ , we have (t−t 0 )·Rob t,x − t X s=t 0 +1 ℓ s,x ≤ 2 v u u t 4 log t|K| δ t X τ=t 0 +1 4τ b ∆ 2 x β τ ≤ 2 s 16t 2b ∆ 2 x log t|K| δ 1 β t ≤ t b ∆ x 16 . (C.15) 212 For the first t 0 rounds, according to (C.11), we have t 0 X s=1 (ℓ s,x − b ℓ s,x ) ≤ 1 C 2 − 1 7 p f T C 1 t 0 +t 0 b ∆ x ≤ 1.4t 0 b ∆ x C 2 − 1 , (C.16) where the last inequality is due to (C.14). Combining (C.15) and (C.16) and noticing that C 2 ≥ 20, we have for all x̸=b x, t 0 X s=1 (ℓ s,x − b ℓ s,x ) + t X s=t 0 +1 (ℓ s,x −Rob t,x ) ≤ 1.7t b ∆ x 10 . (C.17) Bounding the deviation of P t s=1 b ℓ s,b x (recall that we use the standard average estimator for b x): For the first t 0 rounds, according to (C.11), since b ∆ b x = 0, we have t 0 X s=1 (ℓ s,b x − b ℓ s,b x ) ≤ p f T C 1 t 0 . For t ≥ t 0 + 1, according to Freedman’s inequality and the fact that E t h b ℓ 2 s,b x i = E t " y 2 s e p 2 s,b x ·1{x s =b x} # ≤ 1 e p s,b x ≤ 2 as e p s,b x ≥ 1 2 , we have with probability at least 1−δ, t X s=t 0 +1 ℓ s,b x − t X s=t 0 +1 b ℓ s,b x ≤ 2 q 2t log(t|K|/δ) + 2 log(t|K|/δ)≤ p C 1 t. Combining the above two inequalities, we have t X s=1 (ℓ s,b x − b ℓ s,b x ) ≤ 3 p f T C 1 t. (C.18) In sum: combining the bounds for x̸=b x and x =b x, we have for all x̸=b x, t X s=1 (ℓ s,x −ℓ s,b x )≥ t−1 X s=1 (ℓ s,x −ℓ s,b x )− 2 213 ≥ t 0 X s=1 b ℓ s,x − b ℓ s,b x + (t−t 0 − 1)Rob t−1,x − t−1 X s=t 0 +1 b ℓ s,b x − 3 q f T C 1 (t− 1)− 1.7(t− 1) b ∆ x 10 − 2 ((C.17) and (C.18)) ≥ (t− 1) b ∆ t−1,x − 4 q f T C 1 (t− 1)− 1.7(t− 1) b ∆ x 10 (by the definition of b ∆ t−1,x in (4.12)) ≥ (t− 1) b ∆ t−1,x − 3.7(t− 1) b ∆ x 10 (by (C.14)) ≥ 0.02(t− 1) b ∆ x > 0, where the last inequality is because t belongs to Phase 2, which means that at time t− 1, (4.8) is satisfied. Now we are ready to prove our main lemma in the adversarial setting. Proof of Lemma 6. By Lemma 5, we know that the regret comparator is b x. By the regret bound of A and the fact that t 0 ≥L 0 (recall L 0 from Assumption 2), we have t 0 X s=1 (ℓ s,xs −ℓ s,b x )≤O p C 1 t 0 . For the regret in Phase 2, first note that it suffices to consider t not being the last round of this phase (since the last round contributes at most 2 to the regret). Then, consider the following decomposition: t X s=t 0 +1 (ℓ s,xs −ℓ s,b x )≤ t−1 X s=t 0 +1 (y s −ϵ s (x s )−ℓ s,b x ) + 2 = t−1 X s=t 0 +1 y s − b ℓ s,b x | {z } term 1 + t−1 X s=t 0 +1 b ℓ s,b x −ℓ s,b x −ϵ s (b x) | {z } term 2 + t−1 X s=t 0 +1 (ϵ s (b x)−ϵ s (x s )) | {z } term 3 +2. 214 term 1 is upper bounded byO √ f T C 1 t 0 since it corresponds to the termination condition (4.9). term 2 is a martingale difference sequence since E t h b ℓ s,b x −ℓ s,b x −ϵ s (b x) i =E t " (ℓ s,b x +ϵ s (b x))I{x s =b x} e p s,b x − (ℓ s,b x +ϵ s (b x)) # = 0. The variance is upper bounded by E t b ℓ s,b x −ℓ s,b x −ϵ s (b x) 2 =E t (ℓ s,b x +ϵ s (b x)) 2 I{x s =b x} e p s,b x − 1 ! 2 ≤ e p s,b x 1 e p s,b x − 1 ! 2 + (1−e p s,b x ) ≤ 2(1−e p s,b x ), (C.19) where the last term is because e p s,b x ≥ 1 2 . term 3 is also a martingale difference sequence. As ϵ s (x)∈ [−2, 2], its variance can be upper bounded by E t h (ϵ s (b x)−ϵ s (x s )) 2 i ≤ 16E t [I{x s ̸=b x}] = 16(1−e p s,b x ). (C.20) Therefore, with probability at least 1−δ/t, we have term 2 +term 3 =O v u u t t X s=t 0 +1 (1−e p s,b x ) log(t/δ) + log(t/δ) by Freedman’s inequality. As p t = OP(t, b ∆) and t≥t 0 ≥ 400C 1 f T b ∆ 2 min ≥ 16dβt b ∆ 2 min , by Lemma 38, we have 1−e p t,b x = 1 2 1−p t,b x ≤ 1 2 X x p t,x b ∆ x b ∆ min ≤ 12dβ t t b ∆ 2 min . (C.21) 215 Combining the above with b ∆ min ≥ 20 q C 1 f T t 0 , we get term 2 +term 3 =O v u u t log(t/δ) t X s=t 0 +1 dβ t s b ∆ 2 min + log(t/δ) =O s dt 0 β t log(t/δ) log(t) f T C 1 + log(t/δ) ! =O q t 0 log(t/δ) + log(t/δ) where the last step uses the definition of β t and C 1 ≥ 2 15 d log(T|K|/δ) from Assumption 2. Combining all bounds above, we have shown t X s=1 (ℓ s,xs −ℓ s,b x ) =O p C 1 t 0 f T , proving the lemma. Theorem 23 can then be proven by directly applying Lemma 6 to each epoch and using the fact that the number of epochs is at mostO(logT ). C.2.3 Proof of Theorem 22 Next, we prove our results in the corrupted setting. To prove the main lemma Lemma 7, we separate the proof into two parts, Lemma 45 and Lemma 47. Lemma 45. In the stochastic setting with corruptions, within a single epoch, 1. with probability at least 1− 4δ, t 0 ≤ max n 900f T C 1 ∆ 2 min , 900C 2 f T C 1 ,L o ; 2. if C≤ 1 30 √ f T C 1 L, then with probability at least 1−δ, b x =x ∗ ; 3. if C≤ 1 30 √ f T C 1 L, then with probability at least 1− 2δ, t 0 ≥ 64f T C 1 ∆ 2 min ; 4. if C≤ 1 30 √ f T C 1 L, then with probability at least 1− 3δ, b ∆ x ∈ [0.7∆ x , 1.3∆ x ] for all x̸=x ∗ . 216 Proof. In the corrupted setting, we can identify ℓ +c t asℓ t in the adversarial setting. We first show the following property: at any t in Phase 1 and with probability at least 1−δ, for any x, C 2 t X s=1 (ℓ s,x − b ℓ s,x ) ≤ p C 1 t +t∆ x + 2C. (C.22) By the guarantee ofA, we have with probability at least 1−δ, for any x and t∈ [T ] C 2 t X s=1 (ℓ s,x − b ℓ s,x ) ≤ p C 1 t + t X s=1 ℓ s,x − t X s=1 ℓ s,xs . (C.23) Since ℓ t,xt ≥ℓ t,x ∗− max x∈K |c t,x |, we have for any t, t X s=1 ℓ s,xs ≥ t X s=1 ℓ s,x ∗−C. (C.24) Combining (C.23) and (C.24), and using ℓ s,x −ℓ s,x ∗≤ ∆ x + max x ′ ∈K |c s,x ′| for any x∈K, we get (C.22). Below, we define Dev t,x ≜ P t s=1 (ℓ s,x − b ℓ s,x ) . Claim 1’s proof: Let t = max n 900f T C 1 ∆ 2 min , 900C 2 f T C 1 ,L o . Below we prove that if Phase 1 has not finished before time t, then for the choice of b x =x ∗ , both (4.6) and (4.7) hold with high probability at time t. Consider (4.6). With probability at least 1− 2δ, t X s=1 y s ≥ t X s=1 ℓ s,xs − p C 1 t (by Azuma’s inequality) ≥ t X s=1 ℓ s,x ∗− p C 1 t−C (by (C.24)) ≥ t X s=1 b ℓ s,x ∗− 2 p C 1 t− 3C (by (C.22) and ∆ x ∗ = 0) 217 ≥ t X s=1 b ℓ s,x ∗− 3 p f T C 1 t, (t≥ 900C 2 f T C 1 and √ f T C 1 t≥ 30C) showing that (4.6) holds for b x =x ∗ . For (4.7), by the regret bound ofA, with probability at least 1− 2δ, for x̸=x ∗ , t X s=1 y s − t X s=1 b ℓ s,x = t X s=1 (y s −ℓ s,xs ) + t X s=1 (ℓ s,xs −ℓ s,x ∗) + t X s=1 (ℓ s,x ∗−ℓ s,x ) + t X s=1 (ℓ s,x − b ℓ s,x ) ≤ p C 1 t + p C 1 t−C 2 Dev t,x ∗ + (−t∆ x +C) +Dev t,x (by the regret bound ofA and Azuma’s inequality) ≤ 2 + 1 30 p f T C 1 t−t∆ x + 1 C 2 p C 1 t +t∆ x + 2C (by (C.22) and that 30C≤ √ f T C 1 t) ≤−0.95t∆ x + 2.1 p f T C 1 t. (C 2 ≥ 20) By the condition oft, we havet∆ x ≥ 30 √ f T C 1 t for allx̸=x ∗ . Thus, the last expression can further be upper bounded by (−30× 0.95 + 2.1) √ f T C 1 t≤−25 √ f T C 1 t, indicating that (4.7) also holds for all x̸=x ∗ . Combining the two parts above finishes the proof. Claim 2’s proof: Note that (4.6) and (4.7) jointly imply that t 0 X s=1 ( b ℓ s,x − b ℓ s,b x )≥ 20 p f T C 1 t 0 ∀x̸=b x. (C.25) However, with probability at least 1−δ, for any x̸=x ∗ , t 0 X s=1 ( b ℓ s,x ∗− b ℓ s,x ) = t 0 X s=1 ( b ℓ s,x ∗−ℓ s,x ∗) + t 0 X s=1 (ℓ s,x ∗−ℓ s,x ) + t 0 X s=1 (ℓ s,x − b ℓ s,x ) ≤Dev t 0 ,x ∗ + (−t 0 ∆ x +C) +Dev t 0 ,x 218 ≤ 1 C 2 p C 1 t 0 + 2C + (−t 0 ∆ x +C) + 1 C 2 p C 1 t 0 +t 0 ∆ x + 2C (by (C.22)) ≤ 5 p f T C 1 t 0 . (using C≤ 1 30 √ f T C 1 t 0 and C 2 ≥ 20) Therefore, to make (C.25) hold, it must be that b x =x ∗ . Claim 3’s proof: Suppose that t 0 ≤ 64f T C 1 ∆ 2 min , and let x be such that ∆ x = ∆ min . Then we have t 0 X s=1 b ℓ s,x − b ℓ s,x ∗ ≤ t 0 X s=1 (ℓ s,x −ℓ s,x ∗) +Dev t 0 ,x +Dev t 0 ,x ∗ (hold w.p. 1−δ) ≤ (t 0 ∆ min +C) + 1 C 2 2 p C 1 t 0 +t 0 ∆ min + 4C (hold w.p. 1−δ by (C.22)) ≤ 2t 0 ∆ min + 2 p f T C 1 t 0 (by C≤ 1 30 √ f T C 1 t 0 and C 2 = 20) ≤ 16 p f T C 1 t 0 + 2 p f T C 1 t 0 (t 0 ≤ 64f T C 1 ∆ 2 min ) = 18 p f T C 1 t 0 . Recall that (C.25) needs to hold, and recall from Claim 2 that b x =x ∗ holds with probability 1−δ. Thus, the bound above is a contradiction. Therefore, with probability 1− 2δ, t 0 ≥ 64f T C 1 ∆ 2 min . Claim 4’s proof. For notational convenience, denote the set [a−b,a +b] by [a±b]. We have t 0 b ∆ x = t 0 X s=1 b ℓ s,x − b ℓ s,x ∗ ∈ " t 0 X s=1 (ℓ s,x −ℓ s,x ∗)± (Dev t 0 ,x +Dev t 0 ,x ∗) # ⊆ t 0 ∆ x ± C + 1 C 2 2 p C 1 t 0 +t 0 ∆ x + 4C (hold w.p. 1−δ by (C.22)) ⊆ t 0 ∆ x ± 1 C 2 t 0 ∆ x + p f T C 1 t 0 (using C≤ 1 30 √ f T C 1 t 0 and C 2 ≥ 20) ⊆ t 0 ∆ x ± 1 C 2 t 0 ∆ x + 1 8 t 0 ∆ x (by Claim 3, t 0 ≥ 64f T C 1 ∆ 2 min holds w.p. 1− 2δ) ⊆ [(1± 0.3)t∆ x ], 219 which finishes the proof. The next lemma shows that when L grows large enough compared to the total corruption C, the termination condition (4.9) will never be satisfied once the algorithm enters Phase 2. Lemma 46. Algorithm 20 guarantees that with probability at least 1− 10δ, for any t in Phase 2, when 0≤C≤ 1 30 √ f T C 1 L, we have t X s=t 0 +1 y s − b ℓ s,b x ≤ 20 p f T C 1 t 0 . Furthermore, when t≥M ′ = 10β t M (M is the constant defined in Lemma 40), we have t X s=t 0 +1 y s − b ℓ s,b x =O c(K,θ) logT log(T|K|/δ) + p f T C 1 t 0 +dβ M ′ √ M ′ . Proof. Recall that y s =ℓ s,xs +ϵ s (x s ) and b ℓ s,b x = ℓ s,b x +ϵs(b x) e p s,b x 1{x s =b x}. Thus, t X s=t 0 +1 y s − b ℓ s,b x = t X s=t 0 +1 ℓ s,xs −ℓ s,b x | {z } term 1 + t X s=t 0 +1 ℓ s,b x − ℓ s,b x e p s,b x 1{x s =b x} ! | {z } term 2 (C.26) + t X s=t 0 +1 (ϵ s (x s )−ϵ s (b x)) | {z } term 3 + t X s=t 0 +1 ϵ s (b x)− ϵ s (b x) e p s,b x 1{x s =b x} ! | {z } term 4 (C.27) 220 Except for term 1 , all terms are martingale difference sequences. Let E 0 be the expectation taken over the randomness before Phase 2. Similar to the calculation in (C.19) and (C.20), we have E s ℓ s,b x − ℓ s,b x e p s,b x 1{x s =b x} ! 2 ≤ 2E 0 [1−e p s,b x ], E s ϵ s (b x)− ϵ s (b x) e p s,b x 1{x s =b x} ! 2 ≤ 8E 0 [1−e p s,b x ] and E s h (ϵ s (x s )−ϵ s (b x)) 2 i ≤ 16E 0 h 1−e p s,b x i . By Freedman’s inequality, we have with probability at least 1− 3δ, for all t in Phase 2, term 2 +term 3 +term 4 ≤ 2 v u u t 2 t X s=s 0 +1 E 0 [1−e p s,b x ] log(T/δ) + log(T/δ) + 2 v u u t 16 t X s=s 0 +1 E 0 [1−e p s,b x ] log(T/δ) + 4 log(T/δ) + 2 v u u t 8 t X s=s 0 +1 E 0 [1−e p s,b x ] log(T/δ) + 2 log(T/δ) ≤ 20 v u u t t X s=s 0 +1 E 0 [1−e p s,b x ] log(T/δ) + 7 log(T/δ). Then we deal with term 1 . Again, by Freeman’s inequality with probability at least 1−δ, for all t in Phase 2, t X s=t 0 +1 (ℓ s,xs −ℓ s,b x )≤ t X s=t 0 +1 X x̸=b x e p s,x (ℓ s,x −ℓ s,b x ) + 4 v u u t t X s=t 0 +1 E 0 [1−e p s,b x ] log(T/δ) + 2 log(T/δ) ≤C + t X s=t 0 +1 X x̸=b x e p s,x (∆ x − ∆ b x ) + 4 v u u t t X s=t 0 +1 E 0 [1−e p s,b x ] log(T/δ) + 2 log(T/δ) ≤C + 1 2 t X s=t 0 +1 X x̸=b x p s,x ∆ x + 4 v u u t t X s=t 0 +1 E 0 [1−e p s,b x ] log(T/δ) + 2 log(T/δ). (e p s,x = 1 2 p s,x for x̸=b x) 221 When C ∈ [0, 1 30 √ f T C 1 L] ⊆ [0, 1 30 √ f T C 1 t 0 ], according to Lemma 45, we know that with probability 1− 4δ, b x =x ∗ and b ∆ x ∈ [0.7∆ x , 1.3∆ x ]. Also by Lemma 45, with probability 1− 2δ, for any s≥t 0 , we have s≥t 0 ≥ 64f T C 1 ∆ 2 min ≥ 48dβs ∆ 2 min . These conditions satisfy the requirement in Lemma 39 withr = 3. Therefore we can apply Lemma 39 and get X x∈K p s,x ∆ x ≤ 72dβ s ∆ min s (C.28) for all s≥t 0 . Combining all the above, we get t X s=t 0 +1 y s − b ℓ s,b x =C + 72dβ t logt ∆ min + 24 v u u t t X s=t 0 +1 E 0 [1−e p s,b x ] log(T/δ) + 9 log(T/δ). As argued in (C.21), 1−e p s,b x ≤ 12dβs b ∆ 2 min s . Therefore, the above can be further upper bounded by t X s=t 0 +1 y s − b ℓ s,b x ≤C + 96dβ t logt b ∆ min + 24 v u u t t X s=t 0 +1 12dβ s b ∆ 2 min s log(T/δ) ( b ∆ x ∈ [0.7∆ x , 1.3∆ x ], 1−e p s,b x ≤ 12dβs b ∆ 2 min s ) ≤C + 96dβ t logt b ∆ min + 144dβ T logT b ∆ min (by definition of β T ) ≤C + 10 s t 0 f T C 1 dβ T logT (t 0 ≥ 64f T C 1 ∆ 2 min ≥ 24f T C 1 b ∆ 2 min ) ≤ 1 30 p f T C 1 t 0 + 10 s t 0 f T C 1 dβ T logT (C≤ 1 30 √ f T C 1 t 0 ) ≤ 20 p f T C 1 t 0 . (C 1 ≥dβ T ) 222 Below, we use an alternative way to bound P x∈K p s,x ∆ x . Let M ′ ≥ 20β M M, which implies M ′ ≥ 10β M ′M. For s∈ [t 0 + 1,M ′ ], we use Lemma 36, and bound M ′ X s=t 0 +1 X x∈K p s,x ∆ x ≤ 1 0.7 M ′ X s=t 0 +1 X x∈K p s,x b ∆ x ≤ 1 0.7 M ′ X s=t 0 +1 dβ s √ s ≤O dβ M ′ √ M ′ . (C.29) For s>M ′ , we use Lemma 41 and bound t X s=M ′ +1 X x∈K p s,x ∆ x ≤ t X s=M ′ +1 O β s s c(K,θ) =O (c(K,θ)β t logt). (C.30) Combining (C.29) and (C.30) and following a similar analysis in the previous case, we have t X s=t 0 +1 ℓ s,xs −ℓ s,b x ≤C +O dβ M ′ √ M ′ +c(K;θ)β t logt + 4 v u u t t X s=t 0 +1 E 0 [1−e p s,b x ] log(T/δ) + 2 log(T/δ) ≤O c(K,θ) logT log(T|K|/δ) + p f T C 1 t 0 +dβ M ′ √ M ′ . Now we are ready to show that once L grows large enough, Phase 2 never ends. Lemma 47. If C≤ 1 30 √ f T C 1 L, then with probability at least 1− 15δ, Phase 2 never ends. Proof. It suffices to verify the two termination conditions (4.8) and (4.9) are never satisfied. (4.9) does not hold because of Lemma 46. Consider (4.8). Let t be in Phase 2 and x̸=b x. According to (C.17) and (C.18), we have with probability 1− 5δ, t 0 X s=1 (ℓ s,x − b ℓ s,x ) + t X s=t 0 +1 (ℓ s,x −Rob t,x ) ≤ 1.7t b ∆ x 10 , 223 t X s=1 (ℓ s,b x − b ℓ s,b x ) ≤ 3 p f T C 1 t≤ 0.15t b ∆ x . Therefore, we have t b ∆ t,x −t∆ x ≤ t 0 X s=1 (ℓ s,x − b ℓ s,x ) + t X s=t 0 +1 (ℓ s,x −Rob t,x ) + t X s=1 (ℓ s,b x − b ℓ s,b x ) +C ≤ 0.32t b ∆ x +C≤ 0.372t b ∆ x . (C≤ 1 30 √ f T C 1 t≤ 0.052t b ∆ min ) This means that t b ∆ t,x ≤t∆ x + 0.372t b ∆ x ≤ 1 0.7 t b ∆ x + 0.372t b ∆ x ≤ 1.81t b ∆ x , t b ∆ t,x ≥t∆ x − 0.372t b ∆ x ≥ 1 1.3 t b ∆ x − 0.372t b ∆ x ≥ 0.39t b ∆ x . Therefore, (4.8) is not satisfied. Finally, we prove the regret bound for the corrupted stochastic setting. Proof of Theorem 22. First, we consider the pure stochastic setting with C = 0. According to Lemma 45, we know that the algorithm has only one epoch as C≤ 1 30 √ f T C 1 L is satisfied in the first epoch. Specifically, after at most 900f T C 1 ∆ 2 min rounds in Phase 1, the algorithm goes to Phase 2 and never goes back to Phase 1. Then we can directly apply the second claim in Lemma 46 to get the regret bound in the stochastic setting. Specifically, we bound the regret in Phase 1 by O √ C 1 L 0 + r C 1 · 900f T C 1 ∆ 2 min =O √ C 1 L 0 + C 1 √ logT ∆ min . For the regret in Phase 2, according to the second claim in Lemma 46, we bound the regret byO c(K;θ) logT log T|K| δ +dβ M ′ √ M ′ = O c(K;θ) logT log T|K| δ +M ∗ log 3 2 1 δ for some problem-dependent constant M ∗ . Combining them together proves the first claim. 224 Now we consider the corrupted stochastic setting with C > 0. Suppose that we are in the epoch with L = L ∗ , which is the first epoch such that L ∗ ≥ max n 900f T C 1 ∆ 2 min , 900C 2 f T C 1 o . Therefore, in previous epochs, we have L≤ max n 900f T C 1 ∆ 2 min , 900C 2 f T C 1 o . According to Lemma 45, we have t 0 ≤ max n 900f T C 1 ∆ 2 min , 900C 2 f T C 1 o in the previous epoch and L ∗ = 2t 0 . We bound the regret before this epoch, as well as the regret in the first phase of this epoch by the adversarial regret bound: O p C 1 f T L ∗ + p C 1 L 0 =O p C 1 L 0 + p C 1 f T × √ f T C 1 ∆ min + C √ f T C 1 !! =O p C 1 L 0 + f T C 1 ∆ min +C . If we use GeometricHedge.P as the adversarial linear bandit algorithm and f T = logT, then the above is upper bounded by O d logT log(T|K|/δ) ∆ min +C . For Phase 2 of the epoch withL =L ∗ , according to Lemma 45, we know that this phase will never end and by definition of L ∗ , we have C≤ 1 30 √ f T C 1 L. Note that in this phase b ∆ x ∈ [0.7∆ x , 1.3∆ x ]. Therefore, by taking a summation over t on (C.28), we bound the regret in this interval by O dβ T logT ∆ min =O d logT log(T|K|/δ) ∆ min . Combining the regret bounds finishes the proof of the second claim. 225 C.2.4 Adversarial linear bandit algorithms with high-probability bounds In this section, we show that the algorithms of (Bartlett et al., 2008) and (Lee et al., 2020) both satisfy Assumption 2. Algorithm 37: GeometricHedge.P 1 Input:K,γ,η,δ ′ , and John’s exploration distribution q∈P K . 2 Set∀x∈K, w 1 (x) = 1, and W 1 =|K|. 3 for t = 1 to T do 4 Set p t (x) = (1−γ) wt(x) Wt +γq(x), ∀x∈K. 5 Sample x t according to distribution p t . 6 Observe loss y t =ℓ t,xt +ϵ t (x t ), where ℓ t,xt =⟨x t ,ℓ t ⟩. 7 Compute S(p t ) = P x∈K p t (x)xx ⊤ and b ℓ t =S(p t ) −1 x t ·y t . 8 ∀x∈K, compute b ℓ t,x = D x, b ℓ t E , e ℓ t,x = b ℓ t,x − 2∥x∥ 2 S(pt) −1 s log(1/δ ′ ) dT , w t+1 =w t (x) exp −η e ℓ t,x . Compute W t+1 = P x∈K w t+1 (x). GeometricHedge.P We first show GeometricHedge.P in Algorithm 37 for completeness. We remark the differences between the original version and one shown here. First, we consider the noisy feedback y t instead of the zero-noise feedback ℓ t,xt . However, most analysis in (Bartlett et al., 2008) still holds. Second, instead of using the barycentric spanner exploration (known to be suboptimal), we use John’s exploration shown to be optimal in Bubeck et al. (2012). With this replacement, Lemma 3 in Bartlett et al. (2008) can be improved to| b ℓ t,x |≤d/γ and∥x∥ 2 S(pt) −1 ≤d/γ. Now, consider martingale difference sequence M t (x) = b ℓ t,x −ℓ t,x . We have|M t (x)|≤ d γ + 1≜b and σ = v u u t T X t=1 Var t (M t )≤ v u u t T X t=1 ∥x∥ 2 S(pt) −1 . 226 Using Lemma 2 in (Bartlett et al., 2008), we have that with probability at least 1− 2δ ′ log 2 T (set δ ′ =δ/(|K| log 2 (T ))), T X t=1 ( b ℓ t,x −ℓ t,x ) ≤ 2 max 2σ,b q log(1/δ ′ ) q log(1/δ ′ ) ≤ 4σ q log(1/δ ′ ) + 2b· log(1/δ ′ ) ≤ 4 v u u t T X t=1 ∥x∥ 2 S(pt) −1 q log(1/δ ′ ) + 2 d γ + 1 log(1/δ ′ ) ≤ 1 C 2 T X t=1 ∥x∥ 2 S(pt) −1 s log(1/δ ′ ) dT + 4C 2 q dT log(1/δ ′ ) + 2 d γ + 1 log(1/δ ′ ) | {z } ≜Dev T,x , (C.31) where the last inequality is by AM-GM inequality. Note that b ℓ t,x = e ℓ t,x + 2∥x∥ 2 S(pt) −1 q log(1/δ ′ ) dT . Plugging this into (C.31), we have with probability at least 1− 2δ ′ log 2 T, T X t=1 e ℓ t,x ≤ T X t=1 ℓ t,x − T X t=1 ∥x∥ 2 S(pt) −1 s log(1/δ ′ ) dT + 4C 2 q dT log(1/δ ′ ) + 2 d γ + 1 log(1/δ ′ ). (C.32) The counterpart of Lemma 6 in Bartlett et al. (2008) shows that with probability at least 1−δ, T X t=1 ℓ t,xt − T X t=1 X x∈K p t (x) b ℓ t,x ≤ ( √ d + 1) q 2T log(1/δ) + 4 3 log(1/δ) d γ + 1 . (C.33) Using (C.32), we have the counterpart of Lemma 7 in Bartlett et al. (2008) as follows: with probability at least 1− 2δ, γ T X t=1 X x∈K q(x) e ℓ t,x ≤γ T X t=1 X x∈K q(x)ℓ t,x + 4C 2 γ q dT log(1/δ ′ ) + 2γ d γ + 1 log(1/δ ′ ) 227 ≤γT + 4C 2 γ q dT log(1/δ ′ ) + 2 (d +γ) log(1/δ ′ ). (C.34) The counterpart of Lemma 8 in Bartlett et al. (2008) is: with probability at least 1−δ, T X t=1 X x∈K p t (x) b ℓ 2 t,x ≤dT + d γ q 2T log(1/δ). (C.35) Plugging (C.33), (C.34), and (C.35), into Equation (2) in (Bartlett et al., 2008), with have with probability at least 1− 3δ, log W T +1 W 1 (C.36) ≤ η 1−γ − T X t=1 ℓ t,xt + 2 q dT log(1/δ ′ ) + ( √ d + 1) q 2T log(1/δ) + 4 3 log(1/δ) d γ + 1 +γT + 4C 2 γ q dT log(1/δ ′ ) + 2 (d +γ) log(1/δ ′ ) + 2ηdT + 2ηd γ q 2T log(1/δ) + 8η log(1/δ ′ ) √ dT . (C.37) Again using (C.32) and Equation (4) in (Bartlett et al., 2008), we have with probability at least 1−δ, for all x∈K, log W T +1 W 1 ≥−η T X t=1 e ℓ t,x ! − log|K| ≥−η T X t=1 ℓ t,x +η T X t=1 ∥x∥ 2 S(pt) −1 s log(1/δ ′ ) dT − 4ηC 2 q dT log(1/δ ′ )− 2η d γ + 1 log(1/δ ′ )− log|K|. 228 Combining this with (C.37) and assuming γ≤ 1 2 , we have that with probability at least 1− 5δ, for every x∈K, T X t=1 ℓ t,xt (C.38) ≤ T X t=1 ℓ t,x − 1 2 T X t=1 ∥x∥ 2 S(pt) −1 s log(1/δ ′ ) dT + 4C 2 q dT log(1/δ ′ ) + 2 d γ + 1 log(1/δ ′ ) + log|K| η + 2 q dT log(1/δ ′ ) + ( √ d + 1) q 2T log(1/δ) + 4 3 log(1/δ) d γ + 1 +γT + 2C 2 q dT log(1/δ ′ ) + 2 (d +γ) log(1/δ ′ ) + 2ηdT + 2ηd γ q 2T log(1/δ) + 8η log(1/δ ′ ) √ dT. Recalling the definition of Dev T,x in (C.31) and combining terms, we have T X t=1 ℓ t,xt ≤ T X t=1 ℓ t,x −C 2 ·Dev T,x +O q dT log(1/δ ′ ) + d γ log(1/δ ′ ) + log|K| η +γT + 2ηdT + 2ηd γ q 2T log(1/δ) + 8η log(1/δ ′ ) √ dT. (C.39) It remains to decide η and γ. Note that the analysis of (Bartlett et al., 2008) requires|η e ℓ t,x |≤ 1. From the proof of Lemma 4 in (Bartlett et al., 2008), we know that|η e ℓ t,x |≤ ηd γ 1 + 2 q log(1/δ ′ ) dT . Thus, we set η =γ/ d + 2d q log(1/δ ′ ) dT so that|η e ℓ t,x |≤ 1 always holds. Therefore, T X t=1 ℓ t,xt ≤ T X t=1 ℓ t,x −C 2 ·Dev T,x +O q dT log(1/δ ′ ) + d γ log(|K|/δ ′ ) + 2 γ s d log 3 (|K|/δ ′ ) T + 3γT + 8γ log(1/δ ′ ) s T d . Choosing γ = min 1 2 , q d log(|K|/δ ′ ) T , we have with probability at least 1− 7δ, for all x∈K, T X t=1 ℓ t,xt ≤ T X t=1 ℓ t,x −C 2 ·Dev T,x +O q dT log(|K|/δ ′ ) +d log 3 2 (|K|/δ ′ ) 229 ≤ T X t=1 ℓ t,x −C 2 ·Dev T,x +O q dT log(|K| log 2 (T )/δ) +d log 3 2 (|K| log 2 (T )/δ) (δ ′ =δ/(|K| log 2 (T ))) ≤ T X t=1 ℓ t,x −C 2 T X t=1 (ℓ t,x − b ℓ t,x ) +O q dT log(|K| log 2 (T )/δ) +d log 3 2 (|K| log 2 (T )/δ) , ((C.31)) which proves (4.4). The algorithm of (Lee et al., 2020) Now we introduce another high-probability adversarial linear bandit algorithm from (Lee et al., 2020). The regret bound of this algorithm is slightly worse than (Bartlett et al., 2008). However, the algorithm is efficient when there are infinite or exponentially many actions. For the concrete pseudocode of the algorithm, we refer the readers to Algorithm 2 of (Lee et al., 2020). Here, we focus on showing that it satisfies (4.4). We first restate Lemma B.15 in (Lee et al., 2020) with explicit logarithmic factors: Algorithm 2 of (Lee et al., 2020) with η≤ C 3 d 2 lg 3 log(lg/δ) for some universal constant C 3 > 0 guarantees that with probability at least 1−δ, T X t=1 ⟨x t −x,ℓ t ⟩ (C.40) ≤O d logT η +ηd 2 T + lg 2 q T log(lg/δ) +Dev T,x · C 4 − 1 C 5 ηd 2 lg 3 p T log(lg/δ) ! , (C.41) where lg = log(dT ),Dev T,x isanupperboundon P T t=1 (ℓ t,x − b ℓ t,x ) withprobability 1−δ,C 4 ,C 5 > 0 are two universal constants, and we replace the self-concordant parameter in their bound by a trivial upper bound d. 230 Therefore, choosing η = min C 3 d 2 lg 3 log(lg/δ) , 1 2C 4 C 5 d 2 lg 3 √ T log(lg/δ) , 1 2C 2 C 5 d 2 lg 3 √ T log(lg/δ) for some C 2 ≥ 20, the coefficient of Dev T,x becomes at most−C 2 , leading to T X t=1 ⟨x t −x,ℓ t ⟩≤O d 3 lg 4 log(lg/δ) +d 3 lg 4 q T log(lg/δ) −C 2 ·Dev T,x ≤O d 3 lg 4 log(lg/δ) +d 3 lg 4 q T log(lg/δ) −C 2 · T X t=1 (ℓ t,x − b ℓ t,x ) . Finally, using a union bound over all x similar to Theorem B.16 of (Lee et al., 2020), we get with probability at least 1− 2δ, for every x∈K, T X t=1 ⟨x t −x,ℓ t ⟩≤O d 3 lg 4 log(lg/δ ′′ ) +d 3 lg 4 q T log(lg/δ ′′ ) −C 2 T X t=1 (ℓ t,x − b ℓ t,x ) where δ ′′ =δ/(|K|T ). Therefore, we conclude that this algorithm satisfies (4.4) as well. C.3 Omitted details in Section 4.4 C.3.1 Concentration Inequalities Lemma 48 (Freedman’s inequality, Theorem 1 of (Beygelzimer et al., 2011)). LetF 0 ⊂F 1 ⊂···⊂F n be a filtration, and X 1 ,...,X n be real random variables such thatX i isF i -measurable,E[X i |F i−1 ] = 0, |X i |≤b, and P n i=1 E[X 2 i |F i−1 ]≤V for some fixed b≥ 0 and V ≥ 0. Then with probability at least 1−δ, n X i=1 X i ≤ 2 q V log(1/δ) +b log(1/δ). Lemma 49 (Freedman’s inequality, Lemma 4.4 of (Bubeck and Slivkins, 2012)). LetF 0 ⊂F 1 ⊂ ···⊂F n be a filtration, and X 1 ,...,X n be real random variables such that X i isF i -measurable, 231 E[X i |F i−1 ] = 0,|X i |≤b, for some fixed b≥ 0. Let V n = P n i=1 E[X 2 i |F i−1 ]. Then with probability at least 1−δ, n X i=1 X i ≤ 2 q V n log(n/δ) + 3b log(n/δ). Lemma 50. LetF 0 ⊂F 1 ⊂···⊂F T be a filtration, and X 1 ,...,X T be real random variables such that X t isF t -measurable, E[X t |F t−1 ] = 0,|X t |≤b, for some fixed b≥ 0. Let z t ∼ Bernoulli(α) be an i.i.d. random variable independent of all other variables, and let 0≤y t ≤b be a deterministic scalar givenF t−1 . Then with probability at least 1−δ, the following holds for allI = [t 1 ,t 2 ]⊆ [1,T ]: X t∈I y t (z t −α) ≤ min 4 s α X t∈I y 2 t log(T/δ) + 9b log(T/δ), 1 4 α X t∈I y t + 21b log(T/δ) . Proof. Fixing an intervalI∈ [1,T], we apply Lemma 59 with X t =y t (z t −α). Then we get that with probability at least 1− 2δ ′ , X t∈I y t (z t −α) ≤ 2 s X t∈I y 2 t E t [(z t −α) 2 ] log(T/δ ′ ) + 3b log(T/δ ′ ) (define E t [·] =E[·|F t−1 ]) ≤ 2 s α X t∈I y 2 t log(T/δ ′ ) + 3b log(T/δ ′ ) (E t [(z t −α) 2 ] =α(1−α) 2 + (1−α)α 2 ≤α) ≤ 2 q b log(T/δ ′ ) s α X t∈I y t + 3b log(T/δ ′ ) (|y t |≤b) ≤ 1 4 α X t∈I y t + 7b log(T/δ ′ ) (AM-GM) Notice that there are T (T−1) 2 different I’s, so we pick δ ′ = δ T (T−1) , and take an union bound over I’s. This gives the desired bound. 232 C.3.2 Omitted proofs in Section 4.4.1 We start with some extra notations to be used in this section. Definition 4. For any time t, base algorithm i, and policy π, define C a t,i ≜ P t τ=1 1[i τ = i]c τ and C r t,i ≜ r P t τ=1 1[i τ =i] P t τ=1 1[i τ =i]c 2 τ . Similarly, when we write C t,i to indicate either C a t,i or C r t,i , depending on the type of base algorithms we use. Definition 5. For any time t, base algorithm i, and policy π, define R π t,i = P t τ=1 1[i τ =i]r π τ . Next, we prove some lemmas to be used in the later analysis. Lemma 51. In BASIC (Algorithm 22), for any fixed i, with probability at least 1− 2δ, the following holds for all t: 3 4 α i t− 21 log(T/δ)≤N t,i ≤ 5 4 α i t + 21 log(T/δ). Proof. This is by directly applying Lemma 50 with y t = 1 and α =α i Lemma 52. For any fixed i, with probability at least 1− 3δ, the following holds for all t: C a t,i ≤ 1.25α i C a t + 21c max log(T/δ), C r t,i ≤ 1.25α i C r t + 8c max q α i t log(T/δ) + 21c max log(T/δ). Proof. We prove the lemma for (C t,i ,C) = (C a t,i ,C a ) and (C t,i ,C) = (C r t,i ,C r ) cases separately. Case 1. (C t,i ,C) = (C a t,i ,C a ). C t,i = t X τ=1 c τ 1[i τ =i]≤ 5 4 α i C + 21c max log(T/δ). (holds w.p.≥ 1−δ by Lemma 50 with y τ =c τ , z τ =1[i τ =i]) 233 Case 2. (C t,i ,C) = (C r t,i ,C r ). C t,i = v u u t t X τ=1 1[i τ =i] ! t X τ=1 1[i τ =i]c 2 τ ! ≤ v u u t 5 4 α i t + 21 log(T/δ) 5 4 α i t X τ=1 c 2 τ + 21c 2 max log(T/δ) ! (holds w.p.≥ 1− 2δ by Lemma 50 with y τ = 1 and y τ =c 2 τ ) ≤ v u u t 25 16 α 2 i t t X τ=1 c 2 τ + 52.5α i tc 2 max log(T/δ) + 21 2 c 2 max log 2 (T/δ) ≤ 5 4 α i v u u t t t X τ=1 c 2 τ + 8c max q α i t log(T/δ) + 21c max log(T/δ) ( √ a +b +c≤ √ a + √ b + √ c) = 5 4 α i C + 8c max q α i t log(T/δ) + 21c max log(T/δ). Lemma 53. For any i, with probability at least 1−O(δ), the following holds for all t such that C t ≤ 2 i : R π ⋆ t,i −R t,i ≤R(N t,i ,θ i ). Proof. The total amount of corruption experienced by ALG i up to round t is C t,i , whose upper bound is given in Lemma 52 for both types of base algorithms. Comparing the upper bounds of C t,i with our choice of θ i in (4.14), we see that for a fixed i, under the condition C t ≤ 2 i , we have C t,i ≤ θ i with probability 1−O(δ). In other words, the condition specified in Assumption 4 is satisfied for ALG i in the rounds that it is executed. Therefore, by the regret bound in Assumption 4, we have R π ⋆ t,i −R t,i = t X τ=1 (r π ⋆ τ −r τ )1[i τ =i]≤R(N t,i ,θ i ). 234 Lemma 54. For any fixed i, with probability at least 1−δ, the following holds for all t: 1 α i R π ⋆ t,i − t X τ=1 r π ⋆ t,i ≤ 2 s t log(T/δ) α i + log(T/δ) α i . Proof. By Lemma 48, for a fixed i, with probability 1−δ, for all t, 1 α i R π ⋆ t,i − t X τ=1 r π ⋆ t,i = t X τ=1 1[i τ =i] α i − 1 r π ⋆ τ ≤ 2 s t log(T/δ) α i + log(T/δ) α i . Proof of Lemma 8. Notice that C t ≤ 2 k implies that C t ≤ 2 i for all i∈ [k,k max ]. Notice that R t,i −R π ⋆ t,i = t X τ=1 1[i τ =i](r τ −r π ⋆ τ ) = t X τ=1 1[i τ =i] r τ −µ πτ (x τ ) +µ πτ (x τ )−µ π ⋆ (x τ ) | {z } ≤0 (by Assumption 3) +µ π ⋆ (x τ )−r π ⋆ τ ≤ t X τ=1 1[i τ =i] r τ −µ πτ τ (x τ ) +µ π ⋆ τ (x τ )−r π ⋆ τ + 2C a t,i (|µ π τ (x τ )−µ π (x τ )|≤c τ for all π and τ) ≤ 2 q 2α i t log(T/δ) + 6 log(T/δ) + 2C t,i (by Lemma 59 with an union bound over t, and that C a t,i ≤C r t,i ) ≤ 2 q 2α i t log(T/δ) + 6 log(T/δ) + 2θ i . (C t,i ≤θ i with high probability by Lemma 52) (C.42) 235 Combining Lemma 53 and Lemma 54, we see that the performance of ALG i admits the following lower bound with probability at least 1−O(δ): R t,i α i ≥ R π ⋆ t,i −R(N t,i ,θ i ) α i ≥ t X τ=1 r π ⋆ τ − R(N t,i ,θ i ) α i − 2 s t log(T/δ) α i − log(T/δ) α i . (C.43) Combining (C.42) and Lemma 54, we also have the following with probability at least 1−O(δ): R t,i α i ≤ R π ⋆ t,i α i + 3 s t log(T/δ) α i + 6 log(T/δ) + 2θ i α i ≤ t X τ=1 r π ⋆ τ + 5 s t log(T/δ) α i + 7 log(T/δ) + 2θ i α i . (C.44) The bounds (C.43) and (C.44) together with an union bound over i’s indicate that the following holds for all i,j∈ [k,k max ] with probability 1−O(k max δ): R t,i α i + R(N t,i ,θ i ) α i + 2 s t log(T/δ) α i + log(T/δ) α i ≥ t X τ=1 r π ⋆ τ ≥ R t,j α j − 5 s t log(T/δ) α j − 7 log(T/δ) + 2θ j α j . Further combined with the fact that α i ≥ α j since i≤ j, the last inequality implies that the termination condition (4.15) will not hold. Proof of Lemma 9. L 0 X t=1 r π ⋆ t −r t ≤ 1 + kmax X i=k L 0 −1 X t=1 r π ⋆ t −r t 1[i t =i] = 1 + kmax X i=k R π ⋆ L 0 −1,i −R L 0 −1,i . (C.45) 236 For i≥ i ⋆ , since the corruption level is well-specified, by Lemma 53, with probability at least 1−O(δ), R π ⋆ L 0 −1,i −R L 0 −1,i ≤R(N L 0 −1,i ,θ i ). (C.46) For ik] s L 0 α i ⋆ + R(N L 0 −1,i ⋆,θ i ⋆) α i ⋆ !! (C.48) 237 where in the last inequality we use P i 1. By the way we update b ∆ j , it must be that b ∆ j ≥ 1 1.25 and that at the end of epoch j, (4.23) is triggered. 244 However, notice that in (4.23), the left-hand side b R 0 = 1 1−p j P t τ=t j r τ 1[Y τ = 0]≤ 1 1−p j M j ≤ 2M j since p j = β 4 2M j b ∆ 2 j ≤ 1 2 by (4.24), but the right-hand side of (4.23) involves a term 3M j b ∆ j ≥ 3M j × 1 1.25 > 2M j . Therefore, (4.23) is impossible to be triggered at this j, contradicting our assumption. Lemma 58. With probability at least 1−O(δ), for all intervalI = [t j ,t]⊆E j , 1 p j R B b N I,1 , Θ j ≤ 2 p j R B p j |I|, Θ j + p j √ β 1 L β 2 ! ≤ 0.02M j b ∆ j + 2.5β 2 C E j log(T/δ) + 2 p β 1 L. Proof. 1 p j R B b N I,1 , Θ j ≤ 2 p j R B p j |I|, Θ j + p j √ β 1 L β 2 ! (by Lemma 55) = 2 s β 1 |I| p j + β 2 Θ j p j + β 3 p j ! + 2 p β 1 L ≤ 2 s 2β 1 β 4 |I|M j b ∆ j + 5 4 β 2 C E j + 21β 2 c max log(T/δ) +β 3 β 4 × 2M j b ∆ 2 j ! + 2 p β 1 L ≤ 0.02M j b ∆ j + 2.5β 2 C E j + 2 p β 1 L. (by the definition of β 4 and that b ∆ j ≤ 1 by Lemma 57) Proof of Lemma 12. Let T 0 be the round at which TwoModelSelect terminates. In the following proof, we assume that the high-probability events defined in previous lemmas hold. Case 1. b π̸=π ⋆ . X t∈E ′ j (r π ⋆ t −r t ) 245 = (1−p j ) b R π ⋆ E ′ j ,0 − b R b π E ′ j ,0 +p j b R π ⋆ E ′ j ,1 − b R E ′ j ,1 ≤ (1−p j ) b R π ⋆ E ′ j ,1 − b R E ′ j ,0 + (1−p j ) 2 s 2|E ′ j | log(T/δ) p j + 2 log(T/δ) p j +p j b R π ⋆ E ′ j ,1 − b R E ′ j ,1 (when Y t = 0 we execute b π, and by Lemma 48) ≤ (1−p j ) b R E ′ j ,1 − b R E ′ j ,0 + 1 p j R B p j |E ′ j |, 0 + b R π ⋆ E ′ j ,1 − b R E ′ j ,1 (by the assumptions on β 1 ,β 3 ) ≤ 5 p j R B p j |E ′ j |, p j √ β 1 L β 2 ! − 1 2 |E ′ j | b ∆ j + 1 p j R B p j |E ′ j |, 0 + 1 p j R B b N E ′ j ,1 , Θ j (by (4.22), and Assumption 7 with the assumption that π ⋆ ∈ Π\{b π}) ≤ 8 p j R B p j |E ′ j |, Θ j + p j √ β 1 L β 2 ! − 1 2 |E j | b ∆ j (by Lemma 55) ≤ 0.16M j b ∆ j + 20β 2 C E j + 16 p β 1 L− 0.5|E j | b ∆ j . (by Lemma 58) For j≥ 2, the last expression is further upper bounded by 0.32|E j−1 | + β 4 b ∆ 2 j b ∆ j + 20β 2 C E j + 16 p β 1 L− 0.5|E j | b ∆ j (by the definition of M j ) ≤ β 4 b ∆ j + 20β 2 C E j + 16 p β 1 L + 0.4|E j−1 | b ∆ j−1 − 0.5|E j | b ∆ j ( b ∆ j ≤ 1.25 b ∆ j−1 ) ≤ ˜ O p β 4 L +β 2 C E j +β 4 + 0.4|E j−1 | b ∆ j−1 − 0.5|E j | b ∆ j ; ( b ∆ j ≥ b ∆ 1 = min n q β 4 L , 1 o ) for j = 1, it is upper bounded by 0.16β 4 b ∆ 1 + 20β 2 C E 1 + 16 p β 1 L− 0.5|E 1 | b ∆ 1 = ˜ O p β 4 L +β 2 C E 1 +β 4 − 0.5|E 1 | b ∆ 1 . 246 Summing up the above bound over j (and noticing that the number of epochs is upper bounded by 3 log 2 T), we see that T 0 X t=1 (r π ⋆ t −r t ) ≤ ˜ O p β 4 L +β 2 C E 1 +β 4 − 0.5|E 1 | b ∆ 1 + 3 log 2 T X j=2 ˜ O p β 4 L +β 2 C E j +β 4 + 0.4|E j−1 | b ∆ j−1 − 0.5|E j | b ∆ j = ˜ O p β 4 L +β 2 C T 0 +β 4 . Case 2. b π =π ⋆ . X t∈E ′ j r π ⋆ t −r t =p j b R π ⋆ t,1 − b R t,1 ≤p j b R π ⋆ t,0 − b R t,1 + ˜ O q p j |E ′ j | + 1 (Lemma 48) ≤p j b R t,0 − b R t,1 + ˜ O q p j M j + 1 (since b π =π ⋆ ) ≤ ˜ O p j M j b ∆ j + p β 1 L + q p j M j (by (4.23)) ≤ ˜ O β 4 b ∆ j + p β 1 L ! (by the definition of p j ) = ˜ O p β 4 L +β 4 . ( b ∆ j ≥ b ∆ 1 = min n q β 4 L , 1 o ) Similarly, summing over epochs and using the fact that the number of epochs is upper bounded by O(log 2 T ) we get the desired bound. 247 Proof of Lemma 13. The condition in the lemma implies ∆ ≥ 16 q β 4 L = 16 b ∆ 1 . Below we prove by induction that ∆ ≥ b ∆ j for all j. This holds for j = 1. Notice that b ∆ j only increases when the second break condition (4.23) holds. If (4.23) holds, we have |E j |∆ = |E j |(µ π ∗ −µ π ′ ) ≥ b R E j ,0 − b R E j ,1 − 4 p j R B b N E j ,1 , Θ j (Lemma 56) ≥ 3M j b ∆ j + 9 p β 1 L− 4 0.02M j b ∆ j + 2.5β 2 C E j + 2 p β 1 L . (by (4.23) and Lemma 58) ≥ 2.5M j b ∆ j . (by the condition specified in the lemma, we have √ β 1 L≥ 10β 2 C E j ) Because|E j |≤M j , we have b ∆ j ≤ 1 2.5 ∆ . Therefore, after the update, b ∆ j+1 = 1.25 b ∆ j ≤ ∆ still holds. Next, we show that the first break condition (4.22) will not hold with high probability: at any time t within epoch j, b R [t j ,t−1],0 − b R [t j ,t−1],1 − 1 2 (t−t j ) b ∆ j = b R [t j ,t−1],0 − (t−t j )µ π ∗ + (t−t j ) µ π ∗ −µ π ′ − 1 2 b ∆ j + (t−t j )µ π ′ − b R [t j ,t],1 ≥− 4 p j R B p j (t−t j ),p j C E j + (t−t j ) ∆ − 1 2 b ∆ j (Lemma 56) ≥− 4 p j R B p j (t−t j ), p j √ β 1 L β 2 ! . (β 2 C E j ≤ √ β 1 L as assummed in the lemma; ∆ ≥ b ∆ j for all j as we just showed above) Therefore, the first break condition will not be triggered. Overall, with high probability, b ∆ j is non-decreasing with j. Under this high-probability event, since b ∆ j never decreases, the number of times b ∆ j increases is upper bounded by log 1.25 ∆ b ∆ 1 ≤ log 1.25 q L β 4 ≤ 1 2 log 1.25 T≤ 2 log 2 T. Furthermore, between two times b ∆ j increases, since (4.22) and (4.23) are not triggered, the epoch length is at least two 248 times the previous one (by (4.24)). Therefore, between two times b ∆ j increases, the number of epochs is upper bounded by log 2 T. Overall, the total number of epochs is upper bounded by 2 log 2 T× log 2 T = 2 log 2 T. Since we allow the maximum number of epochs to be 3 log 2 T in Algorithm 25, it will not end before the number of rounds reaches T. Proof of Theorem 25. Let L ⋆ be the smallest L such that 32 β 4 ∆ +β 2 C ≤ p β 4 L. In the for-loops where the learner uses L≤L ⋆ , by Lemma 10 and Lemma 12, the sum of regret in Phase 1 and Phase 2 is upper bounded by ˜ O p β 4 L +β 2 C +β 4 = ˜ O p β 4 L ⋆ +β 2 C +β 4 = ˜ O β 4 ∆ +β 2 C . In the for-loop where the learner first time uses L>L ⋆ , we have β 2 2 k ≥ q β 4 (L− 1)≥ p β 4 L ⋆ ≥ 32 β 4 ∆ +β 2 C where the first inequality is by the choice of L in COBE. By Lemma 11, with probability at least 1−O(δ), b π =π ⋆ . Further by Lemma 13, with high probability, Phase 2 will continue until the total number of rounds reaches T. In this case, using Lemma 10 and Lemma 12, we can still bound the regret in the remaining steps by ˜ O p β 4 L +β 2 C +β 4 = ˜ O p β 4 L ⋆ +β 2 C +β 4 = ˜ O β 4 ∆ +β 2 C . 249 By the discussions above, we also see that with high probability, in all for-loops, the learner uses L< 2L ⋆ (because the algorithm will be locked in Phase 2 when the first time L>L ⋆ happens). Therefore, by the condition of starting Phase 3, Phase 3 can only be reached when L ⋆ = Ω( T ). In this case, the regret incurred in Phase 3, by Theorem 24, is upper bounded by ˜ O p β 1 T +β 2 C +β 3 = ˜ O p β 1 L ⋆ +β 2 C +β 3 = ˜ O β 4 ∆ +β 2 C +β 4 = ˜ O β 4 ∆ +β 2 C . Overall, after summing the regret in all phases and using the fact that the for-loop only repeat ˜ O(1) times, we see that the total regret can be upper bounded by ˜ O β 4 ∆ +β 2 C . To show that the algorithm also simultaneously guarantees a bound of ˜ O √ β 4 T +β 2 C +β 4 , simply bound the regret in all phases by ˜ O √ β 4 L +β 2 C +β 4 = ˜ O √ β 4 T +β 2 C +β 4 . C.3.5 The implementation of the leave-one-policy-out MDP We consider a tabular MDPM = (S,A,r,p,H) with a fixed initial state s 1 ∈S. Let Π M denote the set of all deterministic policies inM. Now, given a deterministic policy b π∈ Π M , our goal is to construct another MDPM ′ , such that the policy set ofM ′ includes all policies inM except for b π, and that for any π∈ Π M \{b π}, the expected reward inM andM ′ is the same. MDPM ′ has state space{s 0 }∪S×S and horizon H + 1. InM ′ , the agent starts in the initial state s 0 and takes one of S actions which makes it transition to one of S copies of the original MDPM. Thes-th copy ofM is denoted byM s and is identical toM except that the agent is not allowed to take the actions prescribed by b π in state s. Note that we can obtain samples forM ′ by playing inM, and that max π∈Π M ′ µ π M ′ = max π∈Π M \{b π} µ π M (C.53) 250 where µ π M denotes the expected reward of policy π under MDPM. To see this, simply notice that for any π∈ Π M \{b π} which differs from π ⋆ on state s, one can find a policy π ′ ∈ Π M ′ that first goes toM s inM ′ in the first step, and then follow π in the rest of the steps. This policy π ′ gives the same expected reward asπ. Conversely, for anyπ ′ ∈ Π M ′, there is a policyπ∈ Π M \{b π} which simply equals to π ′ on its 2 to H + 1 steps. This π gives the same expected reward as π ′ . AlthoughM ′ has S 2 + 1 states, and the total number of actions is (SA− 1)×S +S (where SA− 1 is the total number of actions in each copy ofM, and the additional S is the number of actions ons 0 ), running UCBVI onM ′ can in fact yield the same gap-independent bound as running it inM if we share the samples among different copies of M. To see this in the uncorrupted case, notice that in the analysis of the UCBVI algorithm (see (Azar et al., 2017)), the regret bound is a sum of terms of the form P t P h poly(S,A,H) √ nt(s t h ,a t h ) or P t P h poly(S,A,H) nt(s t h ,a t h ) . When the samples of the S copies ofM are shared, these sum will only scale with the original number of states and actions. This can also be proved formally through the use of feedback graphs (Dann et al., 2020). In the corrupted case, the amount of corruption (i.e., c t =H· sup s,a,V |(TV−T t V )(s,a)|) remains the same inM and inM ′ . Therefore, the overall regret bound inM ′ under corruption remains the same order as that inM. 251 Appendix D Omitted Details in Chapter 5 D.1 Auxiliary lemmas In this section, we list auxiliary lemmas that are useful in our analysis. First, we show some concentration inequalities. Lemma 59 ((A special form of) Freedman’s inequality, Theorem 1 of (Beygelzimer et al., 2011)). Let F 0 ⊂···⊂F n be a filtration, and X 1 ,...,X n be real random variables such thatX i isF i -measurable, E[X i |F i ] = 0,|X i |≤ b, and P n i=1 E[X 2 i |F i ]≤ V for some fixed b≥ 0 and V ≥ 0. Then for any δ∈ (0, 1), we have with probability at least 1−δ, n X i=1 X i ≤ V b +b log(1/δ). Throughout the appendix, we letF t be the σ-algebra generated by the observations before episode t. 252 Lemma 60 (Adapted from Lemma 11 of (Jin et al., 2020b)). For all x,a, let{z t (x,a)} T t=1 be a sequence of functions where z t (x,a)∈ [0,R] isF t -measurable. Let Z t (x,a)∈ [0,R] be a random variable such that E t [Z t (x,a)] =z t (x,a). Then with probability at least 1−δ, T X t=1 X x,a 1 t (x,a)Z t (x,a) ˜ U(x,a) +γ − q t (x,a)z t (x,a) ˜ U(x,a) ! ≤ RH 2γ ln H δ . Lemma 61 (Matrix Azuma, Theorem 7.1 of (Tropp, 2012)). Consider an adapted sequence{X k } n k=1 of self-adjoint matrices in dimension d, and a fixed sequence {A k } n k=1 of self-adjoint matrices that satisfy E k [X k ] = 0 and X 2 k ⪯A 2 k almost surely Define the variance parameter σ 2 = 1 n n X k=1 A 2 k op . Then, for all τ > 0, Pr 1 n n X k=1 X k op ≥τ ≤de −nτ 2 /8σ 2 . Next, we show a classic regret bound for the exponential weight algorithm, which can be found, for example, in (?). 253 Lemma 62 (Regret bound of exponential weight, extracted from Theorem 1 of (?)). Let η> 0, and let π t ∈ ∆( A) and ℓ t ∈R A satisfy the following for all t∈ [T ] and a∈A: π 1 (a) = 1 |A| , π t+1 (a) = π t (a)e −ηℓt(a) P a ′ ∈A π t (a ′ )e −ηℓt(a ′ ) , |ηℓ t (a)|≤ 1. Then for any π ⋆ ∈ ∆( A), T X t=1 X a∈A (π t (a)−π ⋆ (a))ℓ t (a)≤ ln|A| η +η T X t=1 X a∈A π t (a)ℓ t (a) 2 . D.2 Analysis for auxiliary procedures In this section, we analyze two important auxiliary procedures for the linear function approximation settings: GeometricResampling and PolicyCover. D.2.1 The guarantee of GeometricResampling The GeometricResampling algorithm is shown in Algorithm 28, which is almost the same as that in Neu and Olkhovskaya (2020) except that we repeat the same procedure forM times and average the outputs (see the extra outer loop). This extra step is added to deal with some technical difficulties in the analysis. The following lemma summarizes some useful guarantees of this procedure. For generality, we present the lemma assuming a lower bound on the minimum eigenvalue λ of the covariance matrix, but it will simply be 0 in all our applications of this lemma in this work. Lemma 63. Let π be a policy (possibly a mixture policy) with a covariance matrix Σ h = E π [ϕ (x h ,a h )ϕ (x h ,a h ) ⊤ ] ⪰ λI for layer h and some constant λ ≥ 0. Further let ϵ > 0 and 254 γ≥ 0 be two parameters satisfying 0<γ +λ< 1. Define M = l 24 ln(dHT ) ϵ 2 min n 1 γ 2 , 4 λ 2 ln 2 1 ϵλ om and N = l 2 γ+λ ln 1 ϵ(γ+λ) m . LetT be a set of MN trajectories generated by π. Then GeometricResampling (Algorithm 28) with input (T,M,N,γ) ensures the following for all h: b Σ + h op ≤ min 1 γ , 2 λ ln 1 ϵλ . (D.1) E h b Σ + h i − (γI + Σ h ) −1 op ≤ϵ, (D.2) b Σ + h − (γI + Σ h ) −1 op ≤ 2ϵ, (D.3) b Σ + h Σ h op ≤ 1 + 2ϵ, (D.4) where∥·∥ op represents the spectral norm and the last two properties (D.3) and (D.4) hold with probability at least 1− 1 T 3 . Proof. To prove (D.1), notice that each one of b Σ +(m) h , m = 1,...,M, is a sum of N + 1 terms. Furthermore, the n-th term of them (cZ n,h in Algorithm 28) has an operator norm upper bounded by c(1−cγ) n . Therefore, b Σ +(m) h op ≤ N X n=0 c(1−cγ) n ≤ min 1 γ ,c(N + 1) ≤ min 1 γ , 2 λ ln 1 ϵλ (D.5) by the definition of N and that c = 1 2 . Since b Σ + h is an average of b Σ +(m) h , this implies (D.1). To show (D.2), observe thatE t [Y n,h ] =γI + Σ h and{Y n,h } N n=1 are independent. Therefore, we a have E h b Σ + t,h i =E h b Σ +(m) t,h i =cI +c N X i=1 (I−c (γI + Σ t,h )) i = (γI + Σ t,h ) −1 I− (I−c (γI + Σ t,h )) N+1 255 where the last step uses the formula: I + P N i=1 A i = (I−A) −1 (I−A N+1 ) withA =I−c(γI +Σ t,h ). Thus, E t h b Σ + h i − (γI + Σ h ) −1 op = (γI + Σ h ) −1 (I−c (γI + Σ h )) N+1 op ≤ (1−c(γ +λ)) N+1 γ +λ ≤ e −(N+1)c(γ+λ) γ +λ ≤ϵ, where the first inequality is by 0≺I−c(γI +I)⪯I−c(γI + Σ h )⪯I−c(γ +λ)I, and the last inequality is by our choice of N and that c = 1 2 . To show (D.3), we only further need b Σ + h −E h b Σ + h i op ≤ϵ and combine it with (D.2). This can be shown by applying Lemma 61 with X k = b Σ +(k) h − E h b Σ +(k) h i ,A k = min n 1 γ , 2 λ ln 1 ϵλ o I (recall (D.5) and thus X 2 k ⪯ A 2 k ), σ = min n 1 γ , 2 λ ln 1 ϵλ o , τ = ϵ, and n = M. This gives the following statement: the event b Σ + h −E t h b Σ + h i op > ϵ holds with probability less than d exp −M×ϵ 2 × 1 8 × max ( γ 2 , λ 2 4 ln 2 1 ϵλ )! ≤ 1 d 2 H 3 T 3 ≤ 1 HT 3 by our choice of M. The conclusion follows by a union bound over h. To prove (D.4), observe that with (D.3), we have b Σ + h Σ h op ≤ (γI + Σ h ) −1 Σ h op + b Σ + h − (γI + Σ h ) −1 Σ h op ≤ 1 + 2ϵ since∥Σ h ∥ op ≤ 1. 256 D.2.2 The guarantee of PolicyCover In this section, we analyze Algorithm 30, which returns a policy cover and its estimated covariance matrices. The final guarantee of the policy cover is provided in Lemma 66, but we need to establish a couple of useful lemmas before introducing that. Note that Algorithm 30 bears some similarity with (Wang et al., 2020b, Algorithm 1) (except for the design of the reward function r t ), and thus the analysis is also similar to theirs. We first define the following definitions, using notations defined in Algorithm 30 and Assump- tion 10. Definition 10. For any π and m, define V π m to be the state value function for π with respect to reward function r m . Precisely, this means V π m (x H ) = 0 and for (x,a)∈X h ×A, h =H− 1,..., 0: V π m (x) = P a π(a|x)Q π m (x,a) where Q π m (x,a) =r m (x,a) +ϕ (x,a) ⊤ θ π m,h and θ π m,h = Z x ′ ∈X h+1 V π m (x ′ )ν x ′ h dx ′ . Furthermore, let π ∗ m be the optimal policy satisfying π ∗ m = argmax π V π m (x) for all x, and define shorthands V ∗ m (x) =V π ∗ m m (x), Q ∗ m (x,a) =Q π ∗ m m (x,a), and θ ∗ m,h =θ π ∗ m m,h . The following lemma characterizes the optimistic nature of Algorithm 30. Lemma 64. With probability at least 1−δ, for all h, all (x,a)∈X h ×A, and all π, Algorithm 30 ensures 0≤ b Q m (x,a)−Q π m (x,a)≤E x ′ ∼P (·|x,a) h b V m (x ′ )−V π m (x ′ ) i + 2ξ∥ϕ (x,a)∥ Γ −1 m,h . 257 Proof. The proof mostly follows that of (Wei et al., 2021a, Lemma 4). For notational convenience, denote ϕ (x τ,h ,a τ,h ) as ϕ τ,h , and x ′ ∼P (·|x τ,h ,a τ,h ) as x ′ ∼ (τ,h). We then have b θ m,h −θ π m,h = Γ −1 m,h 1 N 0 (m−1)N 0 X τ=1 ϕ τ,h b V m (x τ,h+1 ) − Γ −1 m,h θ π m,h + 1 N 0 (m−1)N 0 X τ=1 ϕ τ,h ϕ ⊤ τ,h θ π m,h = Γ −1 m,h 1 N 0 (m−1)N 0 X τ=1 ϕ τ,h b V m (x τ,h+1 ) − Γ −1 m,h 1 N 0 (m−1)N 0 X τ=1 ϕ τ,h E x ′ ∼(τ,h) V π m (x ′ ) − Γ −1 m,h θ π m,h = Γ −1 m,h 1 N 0 (m−1)N 0 X τ=1 ϕ τ,h E x ′ ∼(τ,h) h b V m (x ′ )−V π m (x ′ ) i +ζ m,h − Γ −1 t,h θ π t,h (define ζ m,h = 1 N 0 Γ −1 m,h P (m−1)N 0 τ=1 b V m (x τ,h+1 )−E x ′ ∼(τ,h) b V m (x ′ ) ) = Γ −1 m,h 1 N 0 (m−1)N 0 X τ=1 ϕ τ,h ϕ ⊤ τ,h Z x ′ ∈X h+1 ν x ′ h b V m (x ′ )−V π m (x ′ ) dx ′ +ζ m,h − Γ −1 m,h θ π m,h = Z x ′ ∈X h+1 ν x ′ h b V m (x ′ )−V π m (x ′ ) dx ′ +ζ m,h − Γ −1 m,h θ π m,h − Γ −1 m,h Z x ′ ∈X h+1 ν x ′ h b V m (x ′ )−V π m (x ′ ) dx ′ . Therefore, for x∈X h , b Q m (x,a)−Q π m (x,a) =ϕ (x,a) ⊤ b θ m,h −θ π m,h +ξ∥ϕ (x,a)∥ Γ −1 m,h =ϕ (x,a) ⊤ Z x ′ ∈X h+1 ν x ′ h b V m (x ′ )−V π m (x ′ ) dx ′ +ϕ (x,a) ⊤ ζ m,h | {z } term 1 +ξ∥ϕ (x,a)∥ Γ −1 m,h −ϕ (x,a) ⊤ Γ −1 m,h Z x ′ ∈X h+1 ν x ′ h b V m (x ′ )−V π m (x ′ ) dx ′ | {z } term 2 −ϕ (x,a) ⊤ Γ −1 m,h θ π m,h | {z } term 3 =E x ′ ∼p(·|x,a) h b V m (x ′ )−V π m (x ′ ) i +ξ∥ϕ (x,a)∥ Γ −1 m,h +term 1 +term 2 +term 3 . (D.6) It remains to bound|term 1 +term 2 +term 3 |. To do so, we follow the exact same arguments as in (Wei et al., 2021a, Lemma 4) to bound each of the three terms. 258 Bounding term 1 . First we have|term 1 |≤∥ζ m,h ∥ Γ m,h ∥ϕ (x,a)∥ Γ −1 m,h . To bound∥ζ m,h ∥ Γ m,h , we use the exact same argument of (Wei et al., 2021a, Lemma 4) to arrive at (with probability at least 1−δ) ∥ζ m,h ∥ Γ m,h = 1 N 0 (m−1)N 0 X τ=1 b V m (x τ,h+1 )−E x ′ ∼(τ,h) b V m (x ′ ) Γ −1 m,h ≤ 2H s d 2 log(M 0 + 1) + log N ε δ + q 8M 2 0 ε 2 , (D.7) whereN ε is the ε-cover of the function class that b V m (·) lies in. Notice that for all m, b V m (·) can be expressed as the following: b V m (x) = min max a ramp 1 T ∥ϕ (x,a)∥ 2 Z − α M 0 +ξ∥ϕ (x,a)∥ Z +ϕ (x,a) ⊤ θ , H for some positive definite matrix Z ∈ R d×d with 1 1+M 0 I ⪯ Z ⪯ I and vector θ ∈ R d with ∥θ∥≤ sup m,τ,h Γ −1 m,h op ×M 0 ∥ϕ τ,h ∥H≤M 0 H. Therefore, we can write the class of functions that b V m (·) lies in as the following set: V = ( min max a ramp 1 T ∥ϕ (x,a)∥ 2 Z − α M 0 +ξ∥ϕ (x,a)∥ Z +ϕ (x,a) ⊤ θ , H : θ∈R d :∥θ∥≤M 0 H, Z∈R d×d : 1 1 +M 0 I⪯Z⪯I ) . Now we apply Lemma 12 of (Wei et al., 2021a) toV, with the following choices of parameters: P =d 2 +d, ε = 1 T 3 , B =M 0 H, and L =T +ξ √ 1 +M 0 + 1≤ 3T (without loss of generality, we assume that T is large enough so that the last inequality holds). The value of the Lipschitzness 259 parameter L is according to the following calculation that is similar to (Wei et al., 2021a): for any ∆ Z =ϵe i e ⊤ j , 1 |ϵ| q ϕ (x,a) ⊤ (Z + ∆ Z)ϕ (x,a)− q ϕ (x,a) ⊤ Zϕ (x,a) ≤ ϕ (x,a) ⊤ e i e ⊤ j ϕ (x,a) q ϕ (x,a) ⊤ Zϕ (x,a) ( √ u +v− √ u≤ |v| √ u ) ≤ ϕ (x,a) ⊤ 1 2 e i e ⊤ i + 1 2 e j e ⊤ j ϕ (x,a) q ϕ (x,a) ⊤ Zϕ (x,a) ≤ ϕ (x,a) ⊤ ϕ (x,a) q ϕ (x,a) ⊤ Zϕ (x,a) ≤ s 1 λ min (Z) ≤ p 1 +M 0 ; 1 |ϵ| ∥ϕ (x,a)∥ 2 Z+∆ Z −∥ϕ (x,a)∥ 2 Z =|e ⊤ i ϕ (x,a)ϕ (x,a) ⊤ e j |≤ 1; and that ramp 1 T (·) has a slope of T (this is why we need to use the ramp function to approximate an indication function that is not Lipschitz). Overall, this leads to logN ε ≤ 20(d 2 +d) logT. Using this fact in (D.7), we get ∥ζ m,h ∥ Γ m,h ≤ 20H s d 2 log T δ ≤ 1 3 ξ, and thus|term 1 |≤ ξ 3 ∥ϕ (x,a)∥ Γ −1 m,h . Bounding term 2 and term 3 . This is exactly the same as (Wei et al., 2021a, Lemma 4), and we omit the details. In summary, we can also prove|term 2 |≤ ξ 3 ∥ϕ (x,a)∥ Γ −1 m,h and term 3 ≤ ξ 3 ∥ϕ (x,a)∥ Γ −1 m,h . In sum, we can bound |term 1 +term 2 +term 3 |≤|term 1 | +|term 2 | +|term 3 |≤ξ∥ϕ (x,a)∥ Γ −1 m,h for all m,h and (s,a) with probability at least 1−δ. 260 Combining this with (D.6), we get b Q m (x,a)−Q π m (x,a)≤E x ′ ∼p(·|x,a) h b V m (x ′ )−V π m (x ′ ) i + 2ξ∥ϕ (x,a)∥ Γ −1 m,h , (D.8) b Q m (x,a)−Q π m (x,a)≥E x ′ ∼p(·|x,a) h b V m (x ′ )−V π m (x ′ ) i , (D.9) where (D.8) proves the second inequality in the lemma. To prove the first inequality in the lemma, we use and induction to show that b V m (x)≥V π m (x) for all x, which combined with (D.9) finishes the proof. Recall that we define b V m (x H ) = V π m (x H ) = 0. Assume that b V m (x)≥ V π m (x) holds for x∈ X h+1 . Then by (D.9), b Q m (x,a)−Q π m (x,a)≥ 0 for all (x,a)∈ X h ×A. Thus, b V m (x)−V π m (x) = max a b Q m (x,a)− P a π(a|x)Q π m (x,a)≥ 0, finishing the induction. The next lemma provides a “regret guarantee” for Algorithm 30 with respect to the fake rewards. Lemma 65. With probability at least 1− 2δ, Algorithm 30 ensures M 0 X m=1 V ∗ m (x 0 )− M 0 X m=1 V πm m (x 0 ) = ˜ O d 3/2 H 2 p M 0 . Proof. For any t∈ [(m− 1)N 0 + 1,mN 0 ] and any h, b V m (x t,h )−V πm m (x t,h ) = max a b Q m (x t,h ,a)−Q πm m (x t,h ,a t,h ) (π m is a deterministic policy) = b Q m (x t,h ,a t,h )−Q πm m (x t,h ,a t,h ) ≤E x ′ ∼(x t,h ,a t,h ) h b V m (x ′ )−V πm m (x ′ ) i + 2ξ∥ϕ (x t,h ,a t,h )∥ Γ −1 m,h (Lemma 64) = b V m (x t,h+1 )−V πm m (x t,h+1 ) +e t,h + 2ξ∥ϕ (x t,h ,a t,h )∥ Γ −1 m,h . (define e t,h to be the difference) 261 Thus, b V m (x 0 )−V πm m (x 0 )≤ X h 2ξ∥ϕ (x t,h ,a t,h )∥ Γ −1 m,h +e t,h . Summing over t, and using the fact V ∗ m (x 0 )≤ b V m (x 0 ) (from Lemma 64), we get 1 M 0 M 0 X m=1 (V ∗ m (x 0 )−V πm m (x 0 )) ≤ 1 M 0 N 0 M 0 N 0 X t=1 X h 2ξ∥ϕ (x t,h ,a t,h )∥ Γ −1 m,h +e t,h ≤ 2ξ √ M 0 N 0 X h v u u t M 0 N 0 X t=1 ∥ϕ (x t,h ,a t,h )∥ 2 Γ −1 m,h + 1 M 0 N 0 M 0 N 0 X t=1 X h e t,h . (Cauchy-Schwarz inequality) Further using the fact P M 0 N 0 t=1 ∥ϕ (x t,h ,a t,h )∥ 2 Γ −1 m,h =N 0 P M 0 m=1 D Γ m+1,h − Γ m,h , Γ −1 m,h , = E ˜ O (N 0 d) (see e.g., (Jin et al., 2020c, Lemma D.2)), we bound the first term above by ˜ O ξH p d/M 0 = ˜ O H 2 p d 3 /M 0 . For the second term, note that P M 0 N 0 t=1 e t,h is the sum of a martingale differ- ence sequence. By Azuma’s inequality, the entire second term is thus of order ˜ O H 2 log(1/δ) √ M 0 N 0 with probability at least 1−δ. This finishes the proof. Finally, we are ready to show the guarantee of the returned policy cover. Recall our definition of known state set: K = x∈X :∀a∈A,∥ϕ (x,a)∥ 2 ( b Σ cov h ) −1 ≤α where h is such that x∈X h . Lemma 66. For any h = 0,...,H− 1, with probability at least 1− 4δ (over the randomness in the first T 0 rounds), the covariance matrices b Σ cov h returned by Algorithm 30 satisfies that for any policy π, Pr x h ∼π [x h / ∈K]≤ ˜ O dH α . 262 where x h ∈X h is sampled from executing π. Proof. We define an auxiliary policy π ′ which only differs from π for unknown states in layer h. Specifically, for x∈X h not inK, let a be such that∥ϕ (x,a)∥ 2 ( b Σ cov h ) −1 ≥α (which must exist by the definition of K), then π ′ (a ′ |x) = 1[a ′ =a] for all a ′ ∈A. By doing so, we have Pr x h ∼π [x h / ∈K] = Pr (x h ,a)∼π ′ ∥ϕ (x h ,a)∥ 2 ( b Σ cov h ) −1 ≥α = Pr (x h ,a)∼π ′ ∥ϕ (x h ,a)∥ 2 Γ −1 M 0 +1,h ≥ α M 0 ≤ 1 M 0 M 0 X m=1 Pr (x h ,a)∼π ′ ∥ϕ (x h ,a)∥ 2 Γ −1 m,h ≥ α M 0 (Γ m,h ⪯ Γ M 0 +1,h ) ≤ 1 M 0 M 0 X m=1 E (x h ,a)∼π ′ ramp 1 T ∥ϕ (x,a)∥ 2 Γ −1 m,h − α M 0 (1[y≥ 0]≤ ramp z (y)) ≤ 1 M 0 M 0 X m=1 V π ′ m (x 0 ) (rewards r m (·,·) are non-negative) ≤ 1 M 0 M 0 X m=1 V πm m (x 0 ) + 1 M 0 × ˜ O d 3/2 H 2 p M 0 (Lemma 65) ≤ 1 M 0 N 0 M 0 N 0 X t=1 H−1 X h=0 r m (x t,h ,a t,h ) + ˜ O H √ M 0 N 0 + ˜ O d 3/2 H 2 √ M 0 ! (by Azuma’s inequality) ≤ 1 M 0 N 0 × 1 α M 0 M 0 N 0 X t=1 H−1 X h=0 ∥ϕ (x t,h ,a t,h )∥ 2 Γ −1 m,h + ˜ O d 3/2 H 2 √ M 0 ! (ramp z (y−y ′ )≤ y y ′ for y> 0,y ′ >z> 0) ≤ 1 M 0 N 0 × 1 α M 0 × ˜ O (N 0 dH) + ˜ O d 3/2 H 2 √ M 0 ! (same calculation as done in the proof of Lemma 65) ≤ ˜ O dH α + d 3/2 H 2 √ M 0 ! . Finally, using the definition of M 0 finishes the proof. 263 D.3 Omitted details in Section 5.2 In this section, we prove Lemma 14. In fact, we prove two generalized versions of it. Lemma 67 states that the lemma holds even when we replace the definition of B t (x,a) by an upper bound of the right hand side of (5.4). (Note that Lemma 14 is clearly a special case with b P =P.) Lemma 67. Let b t (x,a) be a non-negative loss function, and b P be a transition function. Suppose that the following holds for all x,a: B t (x,a) =b t (x,a) + 1 + 1 H E x ′ ∼ b P (·|x,a) E a ′ ∼πt(·|x ′ ) B t (x ′ ,a ′ ) (D.10) ≥b t (x,a) + 1 + 1 H E x ′ ∼P (·|x,a) E a ′ ∼πt(·|x ′ ) B t (x ′ ,a ′ ) with B t (x H ,a)≜ 0, and suppose that (5.5) holds. Then Reg≤o(T ) + 3 T X t=1 b V πt (x 0 ;b t ). where b V π is the state value function under the transition function b P and policy π. Proof of Lemma 67. By rearranging (5.5), we see that Reg≤o(T ) + T X t=1 X x,a q ⋆ (x)π ⋆ (a|x)b t (x,a) | {z } term 1 + 1 H T X t=1 X x,a q ⋆ (x)π t (a|x)B t (x,a) | {z } term 2 + T X t=1 X x,a q ⋆ (x) π t (a|x)−π ⋆ (a|x) B t (x,a) | {z } term 3 . We first focus on term 3 , and focus on a single layer 0≤h≤H− 1 and a single t: X x∈X h X a∈A q ⋆ (x) (π t (a|x)−π ⋆ (a|x))B t (x,a) 264 = X x∈X h X a∈A q ⋆ (x)π t (a|x)B t (x,a)− X x∈X h X a∈A q ⋆ (x)π ⋆ (a|x)B t (x,a) = X x∈X h X a∈A q ⋆ (x)π t (a|x)B t (x,a) − X x∈X h X a∈A q ⋆ (x)π ⋆ (a|x) b t (x,a) + 1 + 1 H E x ′ ∼ b P (·|x,a) E a ′ ∼πt(·|x ′ ) B t (x ′ ,a ′ ) ≤ X x∈X h X a∈A q ⋆ (x)π t (a|x)B t (x,a) − X x∈X h X a∈A q ⋆ (x)π ⋆ (a|x) b t (x,a) + 1 + 1 H E x ′ ∼P (·|x,a) E a ′ ∼πt(·|x ′ ) B t (x ′ ,a ′ ) = X x∈X h X a∈A q ⋆ (x)π t (a|x)B t (x,a)− X x∈X h+1 X a∈A q ⋆ (x)π t (a|x)B t (x,a) − X x∈X h X a∈A q ⋆ (x)π ⋆ (a|x)b t (x,a)− 1 H X x∈X h+1 X a∈A q ⋆ (x)π t (a|x)B t (x,a), where the last step uses the fact P x∈X h P a∈A q ⋆ (x)π ⋆ (a|x)P(x ′ |x,a) =q ⋆ (x ′ ) (and then changes the notation (x ′ ,a ′ ) to (x,a)). Now summing this over h = 0, 1,...,H− 1 and t = 1,...,T, and combining with term 1 and term 2 , we get term 1 +term 2 +term 3 = 1 + 1 H T X t=1 X a π t (a|x 0 )B t (x 0 ,a). Finally, we relate P a π t (a|x 0 )B t (x 0 ,a) to b V πt (x 0 ;b t ). Below, we show by induction that for x∈X h and any a, X a∈A π t (a|x)B t (x,a)≤ 1 + 1 H H−h−1 b V πt (x;b t ). 265 Whenh =H−1, P a π t (a|x)B t (x,a) = P a π t (a|x)b t (x,a) = b V πt (x;b t ). Suppose that the hypothesis holds for all x∈X h . Then for any x∈X h−1 , X a∈A π t (a|x)B t (x,a) = X a π t (a|x) b t (x,a) + 1 + 1 H E x ′ ∼ b P (·|x,a) E a ′ ∼πt(·|x ′ ) B t (x ′ ,a ′ ) ≤ X a π t (a|x) b t (x,a) + 1 + 1 H H−h E x ′ ∼ b P (·|x,a) h b V πt (x ′ ;b t ) i (induction hypothesis) ≤ 1 + 1 H H−h X a π t (a|x) b t (x,a) +E x ′ ∼ b P (·|x,a) h b V πt (x ′ ;b t ) i (b t (x,a)≥ 0) = 1 + 1 H H−h b V πt (x;b t ), finishing the induction. Applying the relation on x = x 0 and noticing that 1 + 1 H H ≤ e < 3 finishes the proof. Besides Lemma 67, we also show Lemma 68 below, which guarantees that Lemma 14 holds even if (5.4) and (5.5) only hold in expectation. Lemma 68. Let b t (x,a) be a non-negative loss function that is fixed at the beginning of episode t, and let π t be fixed at the beginning of episode t. Let B t (x,a) be a randomized bonus function that satisfies the following for all x,a: E t [B t (x,a)] =b t (x,a) + 1 + 1 H E x ′ ∼P (·|x,a) E a ′ ∼πt(·|x ′ ) E t h B t (x ′ ,a ′ ) i (D.11) with B t (x H ,a)≜ 0, and suppose that the following holds (simply taking expectations on (5.5)): E " X x q ⋆ (x) T X t=1 X a π t (a|x)−π ⋆ (a|x) Q πt t (x,a)−B t (x,a) # ≤o(T ) +E " T X t=1 V π ⋆ (x 0 ;b t ) # + 1 H E " T X t=1 X x,a q ⋆ (x)π t (a|x)B t (x,a) # . (D.12) 266 Then E [Reg]≤o(T ) + 3E " T X t=1 V πt (x 0 ;b t ) # . Proof. The proof of this lemma follows that of Lemma 67 line-by-line (with b P =P), except that we take expectations in all steps. D.4 Omitted details in Section 5.3 In this section, we analyze Algorithm 26 and prove Theorem 26. In the analysis, we require that π t (a|x) and B t (x,a) are defined for all x,a,t, but in Algorithm 26, they are only explicitly defined if the learner has ever visited state x. Below, we construct a virtual process that is equivalent to Algorithm 26, but with all π t (a|x) and B t (x,a) well-defined. Imagine a virtual process where at the end of episodet (the moment when b Σ + t has been defined), Bonus(t,x,a) is called once for every (x,a), in an order from layer H− 1 to layer 0. Observe that within Bonus(t,x,a), other Bonus(t ′ ,x ′ ,a ′ ) might be called, but either t ′ < t, or x ′ is in a later layer. Therefore, in this virtual process, every recursive call will soon be returned in the third line of Algorithm 27 because they have been called previously and the values of them are already determined. Given that Bonus(t,x,a) are all called once, at the beginning of episode t + 1,π t+1 will be well-defined for all states since it only depends on Bonus(t ′ ,x ′ ,a ′ ) witht ′ ≤t and other quantities that are well-defined before episode t + 1. Comparing the virtual process and the real process, we see that the virtual process calculates all entries of Bonus(t,x,a), while the real process only calculates a subset of them that are necessary for constructing π t and b Σ + t . However, they define exactly the same policies as long as the random seeds we use for each entry of Bonus(t,x,a) are the same for both processes. Therefore, we can 267 define B t (x,a) unambiguously as the value returned by Bonus(t,x,a) in the virtual process, and π t (a|x) as shown in (5.6) with Bonus(τ,x,a) replaced by B τ (x,a). Now, with the new definition of b Q t (x,a)≜ ϕ (x,a) ⊤b θ t,h (for x∈ X h ) and B t (x,a) described above: T X t=1 H−1 X h=0 E x h ∼π ⋆ [⟨π t (·|x h )−π ⋆ (·|x h ),Q πt t (x h ,·)−B t (x h ,·)⟩] = T X t=1 H−1 X h=0 E x h ∼π ⋆ hD π t (·|x h ),Q πt t (x h ,·)− b Q t (x h ,·) Ei | {z } bias (D.13) + T X t=1 H−1 X h=0 E x h ∼π ⋆ hD π ⋆ (·|x h ), b Q t (x h ,·)−Q πt t (x h ,·) Ei | {z } bias-2 + T X t=1 H−1 X h=0 E x h ∼π ⋆ hD π t (·|x h )−π ⋆ (·|x h ), b Q t (x h ,·)−B t (x h ,·) Ei | {z } Reg-term . We then boundE[bias +bias-2] andE[Reg-term] in Lemma 69 and Lemma 70 respectively. Lemma 69. If β≤H, then E[bias +bias-2] is upper bounded by β 4 E " T X t=1 H−1 X h=0 E x h ∼π ⋆ " X a π t (a|x h ) +π ⋆ (a|x h ) ∥ϕ (x h ,a)∥ 2 b Σ + t,h ## +O γdH 3 T β +ϵH 2 T ! . Proof of Lemma 69. Consider a specific (t,x,a). Let h be such that x∈X h . Then we proceed as E t h Q πt t (x,a)− b Q t (x,a) i =ϕ (x,a) ⊤ θ πt t,h −E t h b θ t,h i =ϕ (x,a) ⊤ θ πt t,h −E t h b Σ + t,h i E t [ϕ (x t,h ,a t,h )L t,h ] (definition of b θ t,h ) 268 =ϕ (x,a) ⊤ θ πt t,h − (γI + Σ t,h ) −1 E t [ϕ (x t,h ,a t,h )L t,h ] +O(ϵH) (by (D.2) of Lemma 63 and that∥ϕ (x,a)∥≤ 1 for all x,a and L t,h ≤H) =ϕ (x,a) ⊤ θ πt t,h − (γI + Σ t,h ) −1 Σ t,h θ πt t,h +O(ϵH) (E[L t,h ] =ϕ (x t,h ,a t,h ) ⊤ θ πt t,h ) =γϕ (x,a) ⊤ (γI + Σ t,h ) −1 θ πt t,h +O(ϵH) (θ πt t,h = (γI + Σ t,h ) −1 (γI + Σ t,h )θ πt t,h ) ≤γ∥ϕ (x,a)∥ 2 (γI+Σ t,h ) −1∥θ πt t,h ∥ 2 (γI+Σ t,h ) −1 +O(ϵH) (Cauchy-Schwarz inequality) ≤ β 4 ∥ϕ (x,a)∥ 2 (γI+Σ t,h ) −1 + γ 2 β ∥θ πt t,h ∥ 2 (γI+Σ t,h ) −1 +O(ϵH) (AM-GM inequality) ≤ β 4 E t ∥ϕ (x,a)∥ 2 b Σ + t,h + γdH 2 β +O (ϵ(H +β)) (D.14) where in the last inequality we use (D.2) again and also∥θ π t,h ∥ 2 ≤dH 2 according to Assumption 8. Taking expectation over x and summing over t,a with weights π t (a|x), we get E[bias]≤ β 4 E " T X t=1 H−1 X h=0 E x h ∼π ⋆ " X a π t (a|x h )∥ϕ (x h ,a)∥ 2 b Σ + t,h ## +O γdH 3 T β +ϵH 2 T ! . (using β≤H) By the same argument, we can show that E t [ b Q t (x,a)−Q πt t (x,a)] is also upper bounded by the right-hand side of (D.14), and thus E[bias-2]≤ β 4 E " T X t=1 H−1 X h=0 E x h ∼π ⋆ " X a π ⋆ (a|x h )∥ϕ (x h ,a)∥ 2 b Σ + t,h ## +O γdH 3 T β +ϵH 2 T ! . Summing them up finishes the proof. Lemma 70. If ηβ≤ γ 12H 2 and η≤ γ 2H , then E[Reg-term] is upper bounded by H ln|A| η + 2ηH 2 E " T X t=1 H−1 X h=0 E x h ∼π ⋆ " X a π t (a|x h )∥ϕ (x h ,a)∥ 2 b Σ + t,h ## 269 + 1 H E " T X t=1 H−1 X h=0 E x h ∼π ⋆ " X a π t (a|x h )B t (x,a) ## +O ηϵH 3 T + ηH 3 γ 2 T 2 ! . Proof of Lemma 70. Again, we will apply the regret bound of the exponential weight algorithm Lemma 62 to each state. We start by checking the required condition: η|ϕ (x,a) ⊤b θ τ,h −B t (x,a)|≤ 1. This can be seen by that η ϕ (x,a) ⊤ b θ τ,h =η ϕ (x,a) ⊤ b Σ + t,h ϕ (x t,h ,a t,h )L t,h ≤η× b Σ + t,h op ×L t,h ≤ ηH γ ≤ 1 2 , ((D.1) and the condition η≤ γ 2H ) and that by the definition of Bonus(t,x,a), we have ηB t (x,a)≤η×H 1 + 1 H H × 2β sup x,a,h ∥ϕ (x,a)∥ 2 b Σ + t,h ≤ 6ηβH γ ≤ 1 2H , (D.15) where the last inequality is by (D.1) again and the condition ηβ≤ γ 12H 2 . Thus, by Lemma 62, we have for any x, E " T X t=1 X a (π t (a|x)−π ⋆ (a|x)) b Q t (x,a) # ≤ ln|A| η + 2ηE " T X t=1 X a π t (a|x) b Q t (x,a) 2 # + 2ηE " T X t=1 X a π t (a|x)B t (x,a) 2 # . (D.16) Thelasttermin(D.16)canbeupperboundedbyE h 1 H P T t=1 P a π t (a|x)B t (x,a) i becauseηB t (x,a)≤ 1 2H as we verified in (D.15). To bound the second term in (D.16), we use the following: for (x,a)∈X h ×A, E t h b Q t (x,a) 2 i ≤H 2 E t h ϕ (x,a) ⊤ b Σ + t,h ϕ (x t,h ,a t,h )ϕ (x t,h ,a t,h ) ⊤ b Σ + t,h ϕ (x,a) i =H 2 E t h ϕ (x,a) ⊤ b Σ + t,h Σ t,h b Σ + t,h ϕ (x,a) i 270 ≤H 2 E t h ϕ (x,a) ⊤ b Σ + t,h Σ t,h (γI + Σ t,h ) −1 ϕ (x,a) i +O ϵH 2 + H 2 γ 2 T 3 ! (∗) ≤H 2 ϕ (x,a) ⊤ (γI + Σ t,h ) −1 Σ t,h (γI + Σ t,h ) −1 ϕ (x,a) +O ϵH 2 + H 2 γ 2 T 3 ! (by (D.2)) ≤H 2 ϕ (x,a) ⊤ (γI + Σ t,h ) −1 ϕ (x,a) +O ϵH 2 + H 2 γ 2 T 3 ! ≤H 2 E t h ϕ (x,a) ⊤ b Σ + t,h ϕ (x,a) i +O ϵH 2 + H 2 γ 2 T 3 ! (by (D.2) again) =H 2 E t ∥ϕ (x,a)∥ 2 b Σ + t,h +O ϵH 2 + H 2 γ 2 T 3 ! where (∗) is because by (D.3) and (D.4),∥(γI +Σ t,h ) −1 − b Σ + t,h ∥ op ≤ 2ϵ and∥ b Σ + t,h Σ t,h ∥ op ≤ 1+2ϵ hold with probability 1− 1 T 3 ; for the remaining probability, we upper boundH 2 ϕ (x,a) ⊤b Σ + t,h Σ t,h b Σ + t,h ϕ (x,a) by H 2 γ 2 . Combining them with (D.16) and taking expectation over states finishes the proof. With Lemma 69 and Lemma 70, we can now prove Theorem 26. Proof of Theorem 26. Combining Lemma 69 and Lemma 70, we get (under the required conditions of the parameters): E [bias +bias-2 +Reg-term] ≤O H ln|A| η + γdH 3 T β +ϵH 2 T +ηϵH 3 T + ηH 3 γ 2 T 2 ! + 2ηH 2 + β 4 E " T X t=1 H−1 X h=0 E x h ∼π ⋆ " X a π t (a|x h ) +π ⋆ (a|x h ) ∥ϕ (x h ,a)∥ 2 b Σ + t,h ## + 1 H E " T X t=1 H−1 X h=0 E x h ∼π ⋆ " X a π t (a|x h )B t (x h ,a) ## . Weseethat(D.12)issatisfiedinexpectationaslongaswehave 2ηH 2 + β 4 ≤β anddefine b t (x,a)≜ β∥ϕ (x,a)∥ 2 b Σ + t,h +β P a ′π t (a ′ |x)∥ϕ (x,a ′ )∥ 2 b Σ + t,h (for x∈X h ). By the definition of Algorithm 27, (D.11) is also satisfied with this choice of b t (x,a). Therefore, we can apply Lemma 68 to obtain a regret 271 bound. To simply the presentation, we first pick ϵ = 1 H 3 T so that all ϵ-related terms becomeO(1). Then we have E[Reg] = ˜ O H η + γdH 3 T β + ηH 3 γ 2 T 2 +E " T X t=1 H−1 X h=0 E (x h ,a)∼πt [b t (x,a)] #! = ˜ O H η + γdH 3 T β + ηH 3 γ 2 T 2 +βE " T X t=1 H−1 X h=0 E (x h ,a)∼πt ∥ϕ (x,a)∥ 2 b Σ + t,h #! = ˜ O H η + γdH 3 T β + ηH 3 γ 2 T 2 +βE " T X t=1 H−1 X h=0 E (x h ,a)∼πt h ∥ϕ (x,a)∥ 2 (γI+Σ t,h ) −1 i #! ((D.2) and β≤H) = ˜ O H η + γdH 3 T β + ηH 3 γ 2 T 2 +βdHT ! , where the last step uses the fact E t " X h E (x h ,a)∼πt h ∥ϕ (x,a)∥ 2 (γI+Σ t,h ) −1 i # ≤E t " X h E (x h ,a)∼πt ∥ϕ (x,a)∥ 2 Σ −1 t,h # = X h D Σ t,h , Σ −1 t,h E =dH. (D.17) Finally, choosing the parameters under the specified constraints as: γ = (dT ) − 2 3 , β =H(dT ) − 1 3 , ϵ = 1 H 3 T , η = min γ 2H , 3β 8H 2 , γ 12βH 2 , we further bound the regret by ˜ O H 2 (dT ) 2 3 +H 4 (dT ) 1 3 . 272 D.5 Omitted details in Section 5.4 In this section, we analyze our algorithm for linear MDPs. First, we show the main benefit of exploring with the policy cover, that is, it ensures a small magnitude for b t (x,a), as shown below. Lemma 71. If γ≥ 36β 2 δe and βϵ≤ 1 8 , then b k (x,a)≤ 1 for all (x,a) and all k (with high probability). Proof. According to the definition of b k (x,a) (in Algorithm 29), it suffices to show that for x∈K, β∥ϕ (x,a)∥ 2 b Σ + k,h ≤ 1 2 for any a. To do so, note that the GeometricResampling procedure ensures that b Σ + k,h is an estimation of the inverse of γI + Σ mix k,h , where Σ mix k,h =δ e Σ cov h + (1−δ e )E (x h ,a)∼π k [ϕ (x h ,a)ϕ (x h ,a) ⊤ ] (D.18) and Σ cov h = 1 M 0 P M 0 m=1 E (x h ,a)∼πm h ϕ (x h ,a)ϕ (x h ,a) ⊤ i is the covariance matrix of the policy cover π cov . By (D.3), we have with probability at least 1− 1/T 3 , β∥ϕ (x,a)∥ 2 b Σ + k,h ≤β∥ϕ (x,a)∥ 2 (γI+Σ mix k,h ) −1 + 2βϵ≤β∥ϕ (x,a)∥ 2 (γI+Σ mix k,h ) −1 + 1 4 . The first term can be further bounded as β δe ∥ϕ (x,a)∥ 2 ( γ δe I+Σ cov h ) −1 ≤ β δe ∥ϕ (x,a)∥ 2 ( 1 M 0 I+Σ cov h ) −1 , where the last step is because γ δe M 0 ≥ γ δe × δ 2 e 36β 2 ≥ 1 by our condition. Finally, we show that 1 M 0 I + Σ cov h and b Σ cov h are close. Recall the definition of the latter: b Σ cov h = 1 M 0 I + 1 M 0 N 0 M 0 X m=1 mN 0 X t=(m−1)N 0 +1 ϕ (x t,h ,a t,h )ϕ (x t,h ,a t,h ) ⊤ . We now apply Lemma 61 with n =N 0 and X k = 1 M 0 M 0 X m=1 ϕ (x τ(m,k),h ,a τ(m,k),h )ϕ (x τ(m,k),h ,a τ(m,k),h ) ⊤ − 1 M 0 M 0 X m=1 E (x h ,a)∼πm h ϕ (x h ,a)ϕ (x h ,a) ⊤ i , 273 for k = 1,...,N 0 , where τ(m,k)≜ (m− 1)N 0 +k. Note that X 2 k ⪯ I. Therefore, we can pick A k =I and σ = 1. By Lemma 61, we have with probability at least 1−δ, b Σ cov h − 1 M 0 I− Σ cov h op ≤ s 8 log(d/δ) N 0 . Following the same proof as (Meng and Zheng, 2010, Theorem 2.1), we have 1 M 0 I + Σ cov h −1 − b Σ cov h −1 op ≤M 2 0 b Σ cov h − 1 M 0 I− Σ cov h op ≤M 2 0 s 8 log(d/δ) N 0 ≤ α 2 . (by our choice of N 0 and M 0 ) Consequently, for any vector ϕ with∥ϕ ∥≤ 1, we have ∥ϕ ∥ 2 1 M 0 I+Σ cov h −1−∥ϕ ∥ 2 b Σ cov h −1 ≤ 1 M 0 I + Σ cov h −1 − b Σ cov h −1 op ≤ α 2 . Therefore, combining everything we have β∥ϕ (x,a)∥ 2 b Σ + k,h ≤ β δ e ∥ϕ (x,a)∥ 2 b Σ cov h −1 + α 2 ! + 1 4 ≤ 3βα 2δ e + 1 4 ≤ 1 2 , where the last two steps use the fact x∈K and the value of α. This finishes the proof. Next, we define the following notations for convenience due to the epoch schedule of our algorithm, and then proceed to prove the main theorem. Definition 11. ℓ k (x,a) = 1 W T 0 +kW X t=T 0 +(k−1)W +1 ℓ t (x,a) Q π k (x,a) =Q π (x,a;ℓ k ) 274 θ π k,h is such that Q π k (x,a) =ϕ (x,a) ⊤ θ π k,h B k (x,a) =b k (x,a) + 1 + 1 H E x ′ ∼P (·|x,a) E a ′ ∼π k (·|x ′ ) [B k (x ′ ,a ′ )] b B k (x,a) =b k (x,a) +ϕ (x,a) ⊤b Λ k,h (for x∈X h ) Proof of Theorem 27. We first analyze the regret of policy optimization after the first T 0 rounds. Our goal is again to prove (D.12) which in this case bounds (T−T 0 )/W X k=1 X h E X h ∋x∼π ⋆ " X a π k (a|x)−π ⋆ (a|x) Q π k k (x,a)−B k (x,a) # . The first step is to separate known states and unknown states. For unknown states, we have (T−T 0 )/W X k=1 X h E X h ∋x∼π ⋆ " 1[x / ∈K] X a π k (a|x)−π ⋆ (a|x) Q π k k (x,a)−B k (x,a) # ≤ (T−T 0 )He W X h E X h ∋x∼π ⋆ [1[x / ∈K]] = ˜ O dH 3 T αW ! , where the first step is by the facts 0≤ Q π k k (x,a)≤ H and 0≤ B k (x,a)≤ (1 + 1 H ) H ×H≤ He (Lemma 71), and the second step applies Lemma 66. For known states, we apply a similar decomposition as previous analysis, but since we also use function approximation for bonus B t (x,a), we need to account for its estimation error, which results in two extra bias terms: (T−T 0 )/W X k=1 X h E X h ∋x∼π ⋆ " 1[x∈K] X a π k (a|x)−π ⋆ (a|x) Q π k k (x,a)−B k (x,a) # = (T−T 0 )/W X k=1 X h E X h ∋x∼π ⋆ " 1[x∈K] X a π k (a|x) Q π k k (x,a)− b Q k (x,a) # | {z } bias + (T−T 0 )/W X k=1 X h E X h ∋x∼π ⋆ " 1[x∈K] X a π ⋆ (a|x) b Q k (x,a)−Q π k k (x,a) # | {z } bias-2 275 + (T−T 0 )/W X k=1 X h E X h ∋x∼π ⋆ " 1[x∈K] X a π k (a|x) b B k (x,·)−B k (x,·) # | {z } bias-3 + (T−T 0 )/W X k=1 X h E X h ∋x∼π ⋆ " 1[x∈K] X a π ⋆ (a|x) B k (x,a)− b B k (x,a) # | {z } bias-4 + (T−T 0 )/W X k=1 X h E X h ∋x∼π ⋆ " 1[x∈K] X a π k (·|x)−π ⋆ (·|x) b Q k (x,·)− b B k (x,·) # | {z } Reg-term . Now we combine the bounds in Lemma 72, Lemma 73, and Lemma 74 (included after this proof). Suppose that the conditions on the parameters specified in Lemma 74 hold. We get E[bias +bias-2 +bias-3 +bias-4 +Reg-term] = ˜ O H η + ηϵH 4 T W + ηH 4 γ 2 T 2 W + γdH 3 T βW + ϵH 3 T W ! + β 2 + 2ηH 3 X k X h E X h ∋x∼π ⋆ " 1[x∈K] X a (π ⋆ (a|x) +π k (a|x))∥ϕ (x,a)∥ 2 b Σ + k,h # + 1 H X k X h E X h ∋x∼π ⋆ " X a π k (a|x)B k (x,a) # ≤ ˜ O H η + ηϵH 4 T W + ηH 4 γ 2 T 2 W + γdH 3 T βW + ϵH 3 T W ! + X k V π ⋆ (x 0 ;b k ) + 1 H X k X h E X h ∋x∼π ⋆ " X a π k (a|x)B k (x,a) # where the last inequality is because β 2 + 2ηH 3 ≤β (implied by η β ≤ 1 20H 4 , a condition specified in Lemma 74). Combining two cases and applying Lemma 68, we thus have E (T−T 0 )/W X k=1 V π k (x 0 ;ℓ k ) − (T−T 0 )/W X k=1 V π ⋆ (x 0 ;ℓ k ) ≤ ˜ O H η + ηϵH 4 T W + ηH 4 γ 2 T 2 W + γdH 3 T βW + ϵH 3 T W ! 276 + ˜ O βE (T−T 0 )/W X k=1 X h E (x h ,a)∼π k ∥ϕ (x h ,a)∥ 2 b Σ + k,h + dH 3 T αW = ˜ O H η + ηϵH 4 T W + ηH 4 γ 2 T 2 W + γdH 3 T βW + ϵH 3 T W + βdHT W + dH 3 T αW ! . (by similar calculation as (D.17)) Finally, to get the overall regret, it remains to multiply the bound above by W, add the trivial bound HT 0 = 2HM 0 N 0 =O δ 8 e d 10 H 11 β 8 for the initial T 0 rounds, and consider the exploration probability δ e , which leads to E[Reg] = ˜ O HW η +ηϵH 4 T + ηH 4 γ 2 T 2 + γdH 3 T β +ϵH 3 T +βdHT + dH 3 T α + d 10 H 11 δ 8 e β 8 +δ e HT ! = ˜ O H ηϵ 2 γ 3 + ηH 4 γ 2 T 2 + γdH 3 T β +ϵH 3 T +βdHT + dH 3 Tβ δ e + d 10 H 11 δ 8 e β 8 +δ e HT ! where we use the specified value of M, N, W = 2MN, α, and that ηH≤ 1 (so that the second term ηϵH 4 T is absorbed by the fifth term ϵH 3 T). Considering the constraints in Lemma 74 and Lemma 71, we choose γ = max n 16ηH 4 , 4β 2 δe o . This gives the following simplified regret ˜ O 1 ϵ 2 η 4 H 11 + 1 ηH 4 T 2 + ηdH 7 T β +ϵH 3 T +βdHT + dH 3 Tβ δ e + d 10 H 11 δ 8 e β 8 +δ e HT ! . Choosing δ e optimally, and supposing η≥ 1 T , the above is simplified to ˜ O 1 ϵ 2 η 4 H 11 + ηdH 7 T β +ϵH 3 T +βdHT +H 2 p dβT +d 2 H 35 /9 T 8 /9 ! = ˜ O 1 ϵ 2 η 4 H 11 + ηdH 7 T β +ϵH 3 T +H 2 p dβT +d 2 H 35 /9 T 8 /9 ! . (choosing β≤ H 2 d ) Picking optimal parameters in the last expression, we get ˜ O d 2 H 4 T 14 /15 . 277 Lemma 72. E[bias +bias-2] ≤ β 4 E (T−T 0 )/W X k=1 X h E X h ∋x∼π ⋆ " 1[x∈K] X a π ⋆ (a|x) +π k (a|x) ∥ϕ (x,a)∥ 2 b Σ + k,h # +O γdH 3 T βW + ϵH 3 T W ! . Proof. The proof of this lemma is similar to that of Lemma 69, except that we replace T by (T−T 0 )/W, and consider the averaged loss ℓ k in an epoch instead of the single episode loss ℓ t : E k h Q π k k (x,a)− b Q k (x,a) i =ϕ (x,a) ⊤ θ π k k,h −E k h b θ k,h i =ϕ (x,a) ⊤ θ π k k,h −E k h b Σ + k,h i E k 1 |S ′ k | X t∈S ′ k ((1−Y t ) +Y t H 1[h =h ∗ t ])ϕ (x t,h ,a t,h )L t,h (S ′ k is the S ′ in Algorithm 29 within epoch k) =ϕ (x,a) ⊤ θ π k k,h − γI + Σ mix k,h −1 E k 1 |S ′ k | X t∈S ′ k ((1−Y t ) +Y t H 1[h =h ∗ t ])ϕ (x t,h ,a t,h )L t,h +O(ϵH 2 ) (by Lemma 63 and that∥ϕ (x,a)∥≤ 1 for all x,a and L t,h ≤H; Σ mix k,h is defined in (D.18)) =ϕ (x,a) ⊤ θ π k k,h − γI + Σ mix k,h −1 E k 1 |S ′ k | X t∈S ′ k Σ mix k,h θ π k t,h +O(ϵH 2 ) =ϕ (x,a) ⊤ θ π k k,h − γI + Σ mix k,h −1 E k 1 W kW X t=(k−1)W +1 Σ mix k,h θ π k t,h +O(ϵH 2 ) (S ′ k is randomly chosen from epoch k) =ϕ (x,a) ⊤ θ π k k,h − γI + Σ mix k,h −1 Σ mix k,h θ π k k,h +O(ϵH 2 ) =γϕ (x,a) ⊤ γI + Σ mix k,h −1 θ π k k,h +O ϵH 2 278 ≤ β 4 ∥ϕ (x,a)∥ 2 (γI+Σ mix t,h ) −1 + γ 2 β θ π k k,h 2 (γI+Σ mix k,h ) −1 +O(ϵH 2 ) (AM-GM inequality) ≤ β 4 E k ∥ϕ (x,a)∥ 2 b Σ + k,h + γdH 2 β +O ϵH 2 . The same bound also holds forE k h b Q k (x,a)−Q π k k (x,a) i by the same reasoning. Taking expectation over x, summing over k,h and a (with weights π k (a|x) and π ⋆ (a|x) respectively) finishes the proof. Lemma 73. E[bias-3 +bias-4] ≤ β 4 E (T−T 0 )/W X k=1 X h E X h ∋x∼π ⋆ " 1[x∈K] X a π ⋆ (a|x) +π k (a|x) ∥ϕ (x,a)∥ 2 b Σ + k,h # +O γdH 3 T βW + ϵH 3 T W ! . Proof. The proof is almost identical to that of the previous lemma. The only difference is that L t,h is replaced by D t,h and θ π k t,h is replaced by Λ π k k,h (recall the definition of Λ π k k,h in Section 5.4). Note that b t (x,a)∈ [0, 1] (Lemma 71), so D t,h ∈ [0,He], which is also the same order for L t,h . Therefore, we get the same bound as in the previous lemma. Lemma 74. Let η γ ≤ 1 16H 4 and η β ≤ 1 40H 4 . Then E[Reg-term] = ˜ O H η + ηϵH 4 T W + ηH 4 γ 2 T 2 W ! + 2ηH 3 E X k,h E X h ∋x∼π ⋆ " 1[x∈K] X a π k (x,a)∥ϕ (x,a)∥ 2 b Σ + k,h # + 1 H E X k,h E X h ∋x∼π ⋆ " X a π k (x,a)B k (x,a) # . 279 Proof. We first check the condition for Lemma 62: η b Q k (x,a)− b B t (x,a) ≤ 1. In our case, η b Q k (x,a) =η ϕ (x,a) ⊤ b Σ + k,h 1 |S ′ | X t∈S ′ ((1−Y t ) +Y t H 1[h =h ∗ t ])ϕ (x t,h ,a t,h )L t,h ≤η×∥ b Σ + k,h ∥ op ×H× sup t∈S ′ L t,h ≤η× 1 γ ×H 2 (by Lemma 63) ≤ 1 2 (by the condition specified in the lemma) and η b B k (x,a) ≤η|b k (x,a)| +η ϕ (x,a) ⊤ b Σ + k,h 1 |S ′ | X t∈S ′ ((1−Y t ) +Y t H 1[h =h ∗ t ])ϕ (x t,h ,a t,h )D t,h ≤η +η×∥ b Σ + k,h ∥ op ×H× sup t∈S ′ D t,h ≤η +η×∥ b Σ + k,h ∥ op ×H× (H− 1) 1 + 1 H H (Lemma 71) ≤η + 3ηH 2 γ ≤ 4ηH 2 γ ≤ 1 2H . (by the condition specified in the lemma) Now we derive an upper bound forE k h b Q k (x,a) 2 i : E k h b Q k (x,a) 2 i ≤E k 1 |S ′ k | X t∈S ′ k H 2 ϕ (x,a) ⊤ b Σ + k,h ((1−Y t ) +Y t H 1[h =h ∗ t ]) 2 ϕ (x t,h ,a t,h )ϕ (x t,h ,a t,h ) ⊤ b Σ + k,h ϕ (x,a) (∗) =E k h H 2 ϕ (x,a) ⊤ b Σ + k,h (1−δ e )Σ k,h +δ e HΣ cov h b Σ + k,h ϕ (x,a) i ≤H 3 E k h ϕ (x,a) ⊤ b Σ + k,h Σ mix k,h b Σ + k,h ϕ (x,a) i 280 ≤H 3 E k h ϕ (x,a) ⊤ b Σ + k,h Σ mix k,h (γI + Σ mix k,h ) −1 ϕ (x,a) i + ˜ O ϵH 3 + H 3 γ 2 T 3 ! (Lemma 63) ≤H 3 ϕ (x,a) ⊤ (γI + Σ mix k,h ) −1 Σ mix k,h (γI + Σ mix k,h ) −1 ϕ (x,a) + ˜ O ϵH 3 + H 3 γ 2 T 3 ! (Lemma 63) ≤H 3 ϕ (x,a) ⊤ (γI + Σ mix k,h ) −1 ϕ (x,a) + ˜ O ϵH 3 + H 3 γ 2 T 3 ! =H 3 E k ∥ϕ (x,a)∥ 2 b Σ + k,h + ˜ O ϵH 3 + H 3 γ 2 T 3 ! , (D.19) where in (∗) we use 1 |S ′ k | P t∈S ′ k v t 2 ≤ 1 |S ′ k | P t∈S ′ k v 2 t with v t =ϕ (x,a) ⊤ b Σ + k,h ((1−Y t ) +Y t H 1[h =h ∗ t ])ϕ (x t,h ,a t,h )L t,h . Next, we boundE t h b B t (x,a) 2 i : E k h b B k (x,a) 2 i ≤ 2E k h b k (x,a) 2 i + 2E k h (ϕ (x,a) ⊤b Λ k,h ) 2 i ≤ 2E k [b k (x,a)] + 18H 3 E k ∥ϕ (x,a)∥ 2 b Σ + k,h + ˜ O ϵH 3 + H 3 γ 2 T 3 ! ≤ 20H 3 β b k (x,a) + ˜ O ϵH 3 + H 3 γ 2 T 3 ! , where in the second inequality we boundE k h (ϕ (x,a) ⊤b Λ k,h ) 2 i similarly as we boundE k h b Q k (x,a) 2 i in (D.19), except that we replace the upper bound H for L t,h by the upper bound for D t,h : H 1 + 1 H H sup t,x,a b t (x,a)≤ 3H (since b t (x,a)≤ 1 by Lemma 71). Thus, by Lemma 62, we have E[Reg-term] ≤ ˜ O H η + 2η X k,h E X h ∋x∼π ⋆ " 1[x∈K] X a π k (a|x)( b Q k (x,a) 2 + b B k (x,a) 2 ) # 281 ≤ ˜ O H η + ηϵH 4 T W + ηH 4 γ 2 T 2 W ! + 2ηH 3 E X k,h E X h ∋x∼π ⋆ " 1[x∈K] X a π k (x,a)∥ϕ (x,a)∥ 2 b Σ + k,h # + 40ηH 3 β E X k,h E X h ∋x∼π ⋆ " X a π k (a|x)b k (x,a) # ≤ ˜ O H η + ηϵH 4 T W + ηH 4 γ 2 T 2 W ! + 2ηH 3 E X k,h E X h ∋x∼π ⋆ " 1[x∈K] X a π k (x,a)∥ϕ (x,a)∥ 2 b Σ + k,h # + 1 H E X k,h E X h ∋x∼π ⋆ " X a π k (a|x)Bonus k (x,a) # where in the last inequality we use the conditions specified in the lemma and that B k (x,a)≥ b k (x,a). 282 Appendix E Omitted Details in Chapter 6 E.1 Omitted details in Section 6.3 Proof. of Lemma 15 Below, we fix an alg and fix a t ∈ [alg.s,alg.e], and consider the case ∆ [alg.s,t] ≤ρ(t ′ ) as specified in the lemma statement. For the first part of the lemma, note that e g t of MALG is defined as e f alg ′ t where alg ′ is the active instance of ALG at roundt. By Procedure 31, alg ′ can only be an instance that starts within [alg.s,t] (i.e., alg ′ .s≥ alg.s). Therefore, the distribution drift undergone by alg ′ up to t is upper bounded by ∆ [alg.s,t] ≤ρ(t ′ ), which is further upper bounded by ρ(t ′′ ) where t ′′ is the number of active rounds alg ′ runs within [alg.s,t], because ρ(·) is a decreasing function. Therefore, the conditions in Assumption 11 is satisfied for this alg ′ , and thus we have e g t = e f alg ′ t ≥ min τ≤t: alg ′ is active at τ f ⋆ τ − ∆ [alg ′ .s,t] ≥ min τ∈[alg.s,t] f ⋆ τ − ∆ [alg.s,t] , proving the first part. Next, we prove the second part of the lemma. We use S m to denote the set of order-m instances which start within [alg.s,t]. Note that t X τ=alg.s (e g τ −R τ ) = t X τ=alg.s n X m=0 X alg ′ ∈Sm 1[alg ′ is active at τ] e f alg ′ τ −R τ 283 = n X m=0 X alg ′ ∈Sm t X τ=alg.s 1[alg ′ is active at τ] e f alg ′ τ −R τ | {z } (∗) . (E.1) The first equality holds because e g τ of MALG is defined as the e f τ of the active instance at round t. Next, we focus on a specific m, and bound the (∗) term in (E.1). Let|S m | = ℓ and S m = alg ′ 1 ,...,alg ′ ℓ , and letI i ≜ [alg ′ i .s,alg ′ i .e]∩ [alg.s,t] for i = 1,...,ℓ (i.e.,I i are the rounds within [alg.s,t] where alg ′ i is scheduled). Clearly,|I i |≤ min{alg ′ i .e−alg ′ i .s + 1,t−alg.s + 1} = min{2 m ,t ′ }. By Assumption 11, we have (∗) = ℓ X i=1 t X τ=alg.s 1[alg ′ i is active at τ] e f alg ′ i τ −R τ ≤ ℓ X i=1 (C(|I i |) +|I i |∆ I i ) ≤ℓC(min{2 m ,t ′ }) +t ′ ∆ [alg.s,t] , (E.2) where in the first inequality we use Assumption 11, and that alg ′ i updates for no more than |I i | rounds in the interval [alg.s,t] (also, the condition in Assumption 11 is satisfied because ∆ I i ≤ ∆ [alg.s,t] ≤ ρ(t ′ )≤ ρ(|I i |)). In the last inequality, for the first term, we use that C(·) is increasing; for the second term, we use|I i |≤t ′ , and that ∆ I 1 +··· + ∆ I ℓ ≤ ∆ [alg.s,t] sinceI 1 ,...,I ℓ are non-overlapping intervals lying within [alg.s,t]. By Procedure 31, for every m, the expected number of order-m ALG’s that starts within the interval [alg.s,t] can be upper bounded as E[|S m |]≤ ρ(2 n ) ρ(2 m ) t ′ 2 m ≤ ρ(2 n ) ρ(2 m ) t ′ 2 m + 1 ≤ ρ(2 n ) ρ(2 m ) t ′ 2 m + 1 (E.3) 284 By Bernstein’s inequality, with probability 1− δ T , |S m | ≤ E[|S m |] + p 2E[|S m |] log(T/δ) + log(T/δ)≤ 2E[|S m |] + 2 log(T/δ). Thus, continuing from (E.2), we have with probability at least 1− δ T , (∗)≤ 2· ρ(2 n ) ρ(2 m ) t ′ 2 m + 1 C(min{2 m ,t ′ }) + 2 log(T/δ)C(min{2 m ,t ′ }) +t ′ ∆ [alg.s,t] ≤ 2 C(t ′ ) C(2 m ) + 2 log(T/δ)C(min{2 m ,t ′ }) +t ′ ∆ [alg.s,t] (ρ(2 n )t ′ ≤ρ(t ′ )t ′ =C(t ′ )) ≤ 6C(t ′ ) log(T/δ) +t ′ ∆ [alg.s,t] (C(·) is an increasing function) (E.4) Finally, using this in (E.1), we get the second claim of the lemma: with probability at least 1− δ T , t X τ=alg.s (e g τ −R τ )≤ 6(n + 1)C(t ′ ) log(1/δ) +t ′ (n + 1)∆ [alg.s,t] . (E.5) For the third part of the lemma, as we calculated above, with probability at least 1− δ T , the number of instances started within [alg.s,t] is upper bounded by n X m=0 2· ρ(2 n ) ρ(2 m ) t ′ 2 m + 2 log(T/δ)≤ 2b n C(t ′ ) C(1) + 2 log(T/δ)≤ 6b n C(t ′ ) C(1) log(T/δ) where we use ρ(2 m )2 m =C(2 m )≥C(1) and ρ(2 n )t ′ ≤ρ(t ′ )t ′ =C(t ′ ). E.2 Omitted details in Section 6.4 E.2.1 Single-block regret analysis I In this section, we focus on the regret in a block of index n (Lemma 16). 285 For the purpose of conducting analysis, we divide [t n ,t n + 2 n − 1] into consecutive intervals I 1 = [s 1 ,e 1 ],I 2 = [s 2 ,e 2 ],...,I K = [s K ,e K ] (s 1 =t n ,e i + 1 =s i+1 ,e K =t n + 2 n − 1) in a way such that for all i: ∆ I i ≤ρ(|I i |) (E.6) One simple way to divide the intervals is to let ∆ I i = 0 in eachI i . Then the number of intervals K would be upper bounded by the number of stationary intervals within [t n ,t n + 2 n − 1]. Intuitively, the number of intervals can also be related to ∆ [tn,tn+2 n −1] . We defer the calculation of the required number of intervals to Lemma 77. For now, we only need the fact that the partition satisfies (E.6). From a high level, this partition makes the distribution in each interval close to stationary. Notice that this partition is independent of the learner’s behavior in block n. For convenience, we further define the following quantities that depend on the learner’s behavior in block n: Definition 12. Define E n as the index of the last round in block n. Since the block might terminate earlier than planned, we have E n ≤ t n + 2 n − 1. Let ℓ∈ [K] be such that E n ∈I ℓ (that is, ℓ is the index of the interval where block n ends). Define e ′ i = min{e i ,E n } andI ′ i = [s i ,e ′ i ] (therefore, I ′ i =∅ for i>ℓ). Recall the definition of b n and b ρ(t) from Lemma 15. For simplicity, we define G m ≜ ρ(2 m ), b α m ≜ b ρ(2 m ), and also b C(t)≜tb ρ(t). Furthermore, we define the following technical quantities. Definition 13. For every i∈{1,...,K}, and every m∈{0, 1,...,n}, define τ i (m) = min τ∈I ′ i : f ⋆ τ −e g τ ≥ 12b α m ; 286 that is, τ i (m) is the first time τ inI ′ i =I i ∩ [t n ,E n ] such that f ⋆ τ −e g τ exceeds 12b α m . If such τ does not exist orI ′ i is empty, we let τ i (m) =∞. Besides, we define ξ i (m) = [e ′ i −τ i (m) + 1] + where [a] + = max{0,a} (which is the length of the interval [τ i (m),e ′ i ] when τ i (m) is not∞). The intuition for τ i (m) and ξ i (m) is as follows. Suppose that block n has not ended at τ. If there exists some τ∈I i such that f ⋆ τ −e g τ ≥ 12b α m (which first happens at τ i (m)), and ifI i is long enough (i.e., ξ i (m) is large enough) so that after τ i (m), an order-m instance of ALG can run entirely withinI i , then the learner is able to discover the fact that f ⋆ τ −e g τ is large, and then restart. This coincides with our explanation in Figure 6.1. The derivation in this section will formalize this intuition. Lemma 75. Let the high-probability events described in Lemma 15 hold. Then with high probability, En X τ=tn (e g τ −R τ )≤ 4 b C(2 n ), En X τ=tn (f ⋆ τ −e g τ )≤ 96b n ℓ X i=1 b C(|I ′ i |) + 60 n X m=0 G m G n b C(2 m ) log(T/δ) (notations are defined at the beginning of this section). Proof. P En τ=tn (e g τ −R τ ) is trivially upper bounded by 3 b C(E n −t n + 1) + 1≤ 4 b C(2 n ) because it is guarded by Test 2. Below we focus on the second claim. Note that we can write for all i = 1,...,K, X τ∈I ′ i (f ⋆ τ −e g τ ) ≤ 12 X τ∈I ′ i 1 h f ⋆ τ −e g τ ≤ 12b α n i b α n + n X m=1 1 h 12b α m <f ⋆ τ −e g τ ≤ 12b α m−1 i b α m−1 +1 h f ⋆ τ −e g τ > 12b α 0 i 1 ! ≤ 12 |I ′ i |b α n + n X m=1 b α m−1 ξ i (m) +ρ(1)ξ i (0) ! (ρ(1)≥ 1 by Assumption 11) 287 ≤ 12|I ′ i |b α n + 24 n X m=0 b α m ξ i (m) (b α m = b C(2 m ) 2 m ≤ b C(2 m+1 ) 2 m = 2b α m+1 ) where in the second-to-last inequality we use P τ∈I ′ i 1 f ⋆ τ −e g τ ≥ 12b α m = P τ∈[τ i (m),e ′ i ] 1 f ⋆ τ −e g τ ≥ 12b α m ≤ξ i (m) by the definition of τ i (m). Summing the above over intervals i and notice that P ℓ i=1 |I ′ i |≤ 2 n , we get En X τ=tn (f ⋆ τ −e g τ )≤ 12· 2 n b α n + 24 n X m=0 ℓ X i=1 b α m ξ i (m) = 12 b C(2 n ) + 24 n X m=0 ℓ X i=1 b α m ξ i (m). (E.7) Next, we upper bound P ℓ i=1 b α m ξ i (m) for each m. ℓ X i=1 b α m ξ i (m) = ℓ X i=1 b α m min{ξ i (m), 4· 2 m } + ℓ X i=1 b α m [ξ i (m)− 4· 2 m ] + . (E.8) (using a = min{a,b} + [a−b] + ) The first term on the right-hand side of (E.8) can be bounded as below: ℓ X i=1 b α m min{ξ i (m), 4· 2 m }≤ 4 ℓ X i=1 b ρ(2 m )× min{ξ i (m), 2 m } ≤ 4 ℓ X i=1 b ρ(min{ξ i (m), 2 m })× min{ξ i (m), 2 m } (b ρ(·) is a decreasing function) = 4 ℓ X i=1 b C(min{ξ i (m), 2 m }) ≤ 4 ℓ X i=1 b C(|I ′ i |). ( b C(·) is an increasing function) The second term on the right-hand side of (E.8) is bounded using Lemma 76 below. Combining them into (E.7) finishes the proof. 288 Lemma 76. Let the high probability events described in Lemma 15 hold. Then with high probability, ℓ X i=1 b α m [ξ i (m)− 4· 2 m ] + ≤ 2G m G n b C(2 m ) log(T/δ). Proof. Using the fact that [[a] + −b] + = [a−b] + when b≥ 0, we have [ξ i (m)− 4· 2 m ] + = e ′ i −τ i (m) + 1− 4· 2 m + . (E.9) Next, weconsiderthefollowingquantity: “thenumberofroundsintheinterval [τ i (m),e ′ i −2·2 m B] which are candidate starting points of an order-m ALG”. By Procedure 31, this quantity can be written and lower bounded as A i ≜ X t∈I i 1 h t∈ [τ i (m), e ′ i − 2· 2 m ], (t−t n ) mod 2 m = 0 i ≥ [e ′ i −τ i (m) + 1− 4· 2 m ] + 2 m where we use the fact in an interval of length w, there are at least w+2−2u u points whose indices are multiples of u. Notice that the right-hand side is related to what we want to upper bound in the lemma according to (E.9). Thus we continue to upper bound the left-hand side above. We define the following events: W t ={τ i (m)≤t≤e i − 2· 2 m where i is such that t∈I i }, X t ={t≤E n − 2· 2 m }, Y t ={t≤E n and (t−t n ) mod 2 m = 0}, Z t ={∃ order-m alg such that alg.s =t}, V t ={∃τ∈ [t n ,t] such that W τ ∩Y τ ∩Z τ }. 289 Then we can write (recall the definition of K in the beginning of this section) ℓ X i=1 A i = K X i=1 A i = tn+2 n −1 X t=tn 1[W t ,X t ,Y t ]≤ tn+2 n −1 X t=tn 1[W t ,Y t ,V t ] | {z } term 3 + tn+2 n −1 X t=tn 1[X t ,V t ] | {z } term 4 For term 3 , notice that conditioned onW t ∩Y t , the eventZ t happens with a constant probability Gn Gm (by Procedure 31). Therefore, term 3 counts the number of trials up to the first success in a repeated trial with success probabiliy Gn Gm . Therefore, with probability 1− δ T , term 3 ≤ 1 + log(T/δ) − log 1− Gn Gm ≤ 2Gm Gn log(T/δ). Next, we deal with term 4 . Below we show that term 4 = 0. The eventV t implies that there exists some order-m alg which starts at alg.s =t ⋆ , where t ⋆ ≤t and τ i (m)≤t ⋆ ≤e i − 2· 2 m . Therefore, we have alg.e = alg.s + 2 m − 1 =t ⋆ + 2 m − 1≤e i − 2 m − 1<e i , and thus [alg.s,alg.e]⊆I i . Together with X t , the event V t ∩X t implies that alg.e = alg.s + 2 m − 1≤t + 2 m − 1<E n , and therefore, and time alg.e, block n has not ended. Since at time alg.e, block n is still on-going, the learner performs Test 1. By Lemma 15, with high probability, we have 1 2 m alg.e X τ=alg.s R τ ≥ 1 2 m alg.e X τ=alg.s e g τ −b α m −b n∆ [alg.s,alg.e] (Lemma 15) ≥ min τ∈I i f ⋆ τ −b α m − (b n + 1)∆ I i (because [alg.s,alg.e]⊆I i ) ≥f ⋆ τ i (m) −b α m − (b n + 3)∆ I i (| min τ∈I i f ⋆ τ −f ⋆ τ i (m) |≤ 2∆ I i ) ≥e g τ i (m) + 12b α m − 2b α m (by the definition of τ i (m) and ∆ I i ≤ρ(|I i |)≤ρ(2 m )≤ b αm 6b n ) ≥U alg.e + 10b α m (Because alg.e≥τ i (m), U alg.e ≤e g τ i (m) by the algorithm) 290 This should trigger the restart at time alg.e<E n , contradicting the definition of E n . Therefore, 1[X t ,V t ] = 0. Finally, combining all previous arguments, we have that with high probability, ℓ X i=1 b α m [ξ i (m)− 4· 2 m ] + = ℓ X i=1 b α m e ′ i −τ i (m) + 1− 4· 2 m + ≤ b α m 2 m ℓ X i=1 A i = b C(2 m ) ℓ X i=1 A i ≤ 2G m G n b C(2 m ) log(T/δ), finishing the proof. E.2.2 Single-block regret analysis II In Appendix E.2.1, we have derived the regret bound in a single block for both the standard setting and the infinite-horizon MDP setting (Lemma 16). They are both of the form X τ∈J (f ⋆ τ −R τ ) = ˜ O ℓ X i=1 C(|I ′ i |) + n X m=0 ρ(2 n ) ρ(2 m ) C(2 m ) ! . (E.10) (replacing C(·) and ρ(·) by C UCRL (·;D) and ρ(·;D) for the case of infinite-horizon MDP). In this section, we further derive more concrete dynamic regret bounds for both cases by assuming that C(·) is of some specific form. The form of C(·) we consider in this section is defined as follows: Definition 14. We define a form of C(t) as C(t) = min{c 1 t p +c 2 ,c 3 t} for some p∈ [ 1 2 , 1) and some c 1 , c 2 , c 3 (c 3 ≥ 1) that capture dependencies on log(T/δ) and other problem-dependent constants. In fact, usually, a regret bound is only written in the form of c 1 t p +c 2 . However, since the reward is bounded between 0 and 1, the regret bound of min{c 1 t p +c 2 ,t} is also trivially correct. Definition 14 is slightly more general than this by allowing a coefficient c 3 ≥ 1 (the regret bound would still be trivially correct). In some cases, we make c 1 ,c 2 ,c 3 larger than their tightest possible 291 values to make the final regret bound better — notice that the choice of c 1 ,c 2 ,c 3 affects the probability specified in Procedure 31, and thus smaller c 1 ,c 2 ,c 3 does not necessarily make the final regret bound smaller. This subtle issue can be observed from the analysis. To get a concrete bound, we also need to decide the number ℓ in the single-block regret bound above. In Appendix E.2.1, we have stated the condition (i.e., (E.6)) that should be satisfied by I ′ 1 ,...,I ′ ℓ (orI 1 ,...,I K ). In the next lemma, we upper bound the value of ℓ that is required to fulfill the condition. Lemma 77. LetJ = [t n ,E n ]. Then we have ℓ≤L J . Furthermore, if C(t) is in the form specified in Definition 14, we also have ℓ≤ 1 + 2 c −1 1 ∆ J |J| 1−p 1 2−p +c −1 3 ∆ J . Proof. The fact that ℓ≤L J is straightforward to see (and has been explained in Appendix E.2.1): to satisfy the condition (E.6), one way to divide the block is to make eachI i a stationary interval, which makes ∆ I i = 0 for all i∈ [K]. This way of division leads to ℓ≤L J . For the second claim, we follow the same procedure as decribed in the proof of Lemma 5 in (Chen et al., 2019). Basically, the procedure divides [t n ,t n + 2 n − 1] in a greedy way, making all I i = [s i ,e i ] satisfy ∆ [s i ,e i ] ≤ ρ(e i −s i + 1) and ∆ [s i ,e i +1] > ρ(e i −s i + 2) for all i∈ [K− 1] (i.e., except for the last interval). Then we have ∆ J ≥ ℓ−1 X i=1 ∆ [s i ,e i +1] (by the definition of ∆ [·,·] ) > ℓ−1 X i=1 ρ(e i −s i + 2) ≥ ℓ−1 X i=1 min n c 1 (e i −s i + 2) p−1 ,c 3 o (by Definition 14) ≥ ℓ−1 X i=1 min 1 2 c 1 (e i −s i + 1) p−1 ,c 3 ((x + 2) p−1 ≥ (2(x + 1)) p−1 ≥ 1 2 (x + 1) p−1 for any x≥ 0 and p≤ 1) 292 = 1 2 ℓ 1 X i=1 c 1 (e i −s i + 1) p−1 + ℓ 2 X i=1 c 3 where in the last equality we separate the intervals where min n 1 2 c 1 (e i −s i + 1) p−1 ,c 3 o takes the former or the latter value. Note that ℓ 1 +ℓ 2 =ℓ− 1. The above inequality implies that ∆ J upper bounds both 1 2 P ℓ 1 i=1 c 1 (e i −s i + 1) p−1 and P ℓ 2 i=1 c 3 . Thus, ℓ 2 ≤c −1 3 ∆ J , and by Hölder’s inequality, ℓ 1 ≤ ℓ 1 X i=1 (e i −s i + 1) p−1 1 2−p ℓ 1 X i=1 (e i −s i + 1) 1−p 2−p ≤ 2∆ J c 1 1 2−p |J| 1−p 2−p . Combining them finishes the proof. In the following Lemma 78, we bound the regret within a block by combining (E.10) and Lemma 77. We will frequently use the following two properties: let{S 1 ,S 2 ,...,S K } be a partition of the intervalS. Then K X i=1 L S i ≤L S + (K− 1), (E.11) K X i=1 ∆ S i ≤ ∆ S . (E.12) They can be derived using the definitions of L [·,·] and ∆ [·,·] . Lemma 78. If C(t) is of the form specified in Definition 14, then En X τ=tn (f ⋆ τ −R τ )≤ ˜ O min n Reg L (J ),Reg ∆ (J ) o +c 1 2 np + c 2 c 3 c 1 2 n(1−p) + c 2 2 c 3 ! , 293 where Reg L (J )≜c 1 L 1−p J |J| p +c 2 L J and Reg ∆ (J )≜ c 1 ∆ 1−p J |J| 1 2−p +c 1 |J| p +c 1 (c −1 3 ∆ J ) 1−p |J| p +c 2 c −1 1 ∆ J |J| 1−p 1 2−p +c 2 +c 2 c −1 3 ∆ J . Proof. We bound each term in (E.10) using Definition 14. First, notice that ˜ O ℓ X i=1 C(|I ′ i |) ! = ˜ O ℓ X i=1 min c 1 |I ′ i | p +c 2 , c 3 t ! ≤ ˜ O ℓ X i=1 (c 1 |I ′ i | p +c 2 ) ! ≤ ˜ O c 1 ℓ 1−p |J| p +c 2 ℓ . (E.13) Using the first upper bound for ℓ given in Lemma 77, (E.13) can be bounded by ˜ O (Reg L (J )); using the second upper bound, (E.13) can be bounded by ˜ O (Reg ∆ (J )). Next, we have ˜ O ρ(2 m ) ρ(2 n ) C(2 m ) = ˜ O c 1 2 np + c 2 c 3 c 1 2 n(1−p) + c 2 1 c 3 2 m(2p−1) + c 2 2 c 3 2 −m ! . by Lemma 79 below. Notice that because c 3 ≥ 1 and p≥ 1 2 , c 2 1 c 3 2 m(2p−1) ≤c 2 1 2 n(2p−1) ≤c 1 2 np when c 1 ≤ 2 n(1−p) . This is indeed the regime we care about since if c 1 > 2 n(1−p) then the first term c 1 2 np > 2 n , which is a vacuous bound for the regret of block n. Therefore, we can drop this term. Thus, the dynamic regret in block n can be summarized as the following based on (E.10): ˜ O min n Reg L (J ),Reg ∆ (J ) o +c 1 2 np + c 2 c 3 c 1 2 n(1−p) + c 2 2 c 3 ! , (E.14) finishing the proof. Lemma 79. Let C(t) be of the form in Definition 14. Then ρ(2 m ) ρ(2 n ) C(2 m ) =O c 1 2 np + c 2 c 3 c 1 2 n(1−p) + c 2 1 c 3 2 m(2p−1) + c 2 2 c 3 2 −m ! . 294 Proof. This is by direct calculation: ρ(2 m ) ρ(2 n ) C(2 m ) = C(2 m ) 2 C(2 n ) 2 n−m =O min{c 2 1 2 2mp +c 2 2 , c 2 3 2 2m } c 1 2 np +c 2 2 n−m + min{c 2 1 2 2mp +c 2 2 , c 2 3 2 2m } c 3 2 n 2 n−m ! =O min ( c 1 2 np 2 (n−m)(1−2p) + c 2 2 c 1 2 n(1−p)−m , c 2 3 c 1 2 n(1−p)+m ) + c 2 1 c 3 2 m(2p−1) + c 2 2 c 3 2 −m ! =O c 1 2 np + min ( c 2 2 c 1 2 n(1−p)−m , c 2 3 c 1 2 n(1−p)+m ) + c 2 1 c 3 2 m(2p−1) + c 2 2 c 3 2 −m ! =O c 1 2 np + c 2 c 3 c 1 2 n(1−p) + c 2 1 c 3 2 m(2p−1) + c 2 2 c 3 2 −m ! . E.2.3 Single-epoch regret analysis We call [t 0 ,E] an epoch if t 0 is the first step after restart (or t 0 = 1), and E is the first time after roundt 0 when the restart is triggered. In this section, we continue the discussion in Appendix E.2.2 and bound the regret in a single epoch. Recall that the we consider cases where the single-block regret can be written as (E.10) and C(·) is in the form of Definition 14. This holds both for the case of the standard setting and the infinite-horizon MDP setting. Lemma 80. LetE be an epoch. Then X τ∈E ≤ ˜ O min Reg L (E),Reg ∆ (E) + c 2 c 3 c 1 |E| 1−p + c 2 2 c 3 ! (Reg L (·) and Reg ∆ (·) are defined in Lemma 78) 295 Proof. LetE be an epoch whose last block is indexed by n. Then|E| = Θ(2 n ). LetJ 1 ,...,J n be blocks inE. Then by Lemma 78, the dynamic regret inE is upper bounded by ˜ O min ( n X m=0 Reg L (J m ), n X m=0 Reg ∆ (J m ) ) +c 1 n X m=0 2 mp + c 2 c 3 c 1 n X m=0 2 m(1−p) + n X m=0 c 2 2 c 3 ! . By Hölder’s inequality, n X m=0 Reg L (J m ) =c 1 n X m=0 L Jm ! 1−p n X m=0 |J m | ! p +c 2 n X m=0 L Jm ≤c 1 (L E +n) 1−p |E| p +c 2 (L E +n) (using (E.11)) ≤ ˜ O c 1 L 1−p E |E| p +c 2 L E = ˜ O (Reg L (E)) (because n =O(logT ) = ˜ O(1)) Similarly, P n m=0 Reg ∆ (J m ) = ˜ O (Reg ∆ (E)). On the other hand,c 1 P n m=0 2 mp + c 2 c 3 c 1 P n m=0 2 m(1−p) + P n m=0 c 2 2 c 3 = ˜ O c 1 2 np + c 2 c 3 c 1 2 n(1−p) + c 2 2 c 3 = ˜ O c 1 |E| p + c 2 c 3 c 1 |E| 1−p + c 2 2 c 3 . In summary, the dynamic regret within an epoch is of order ˜ O min Reg L (E),Reg ∆ (E) + c 2 c 3 c 1 |E| 1−p + c 2 2 c 3 ! (E.15) (the c 1 |E| p term is absorbed into min{Reg L (E),Reg ∆ (E)}). E.2.4 Proof of Theorem 28 We are now ready to prove Theorem 28 after showing the following two lemmas. Lemma 81. Let t be in an epoch starting from t 0 . If ∆ [t 0 ,t] ≤ρ(t−t 0 + 1), then with high probability, no restart would be triggered at time t. 296 Proof. We first verify that Test 1 would not fail with high probability. Let t = alg.e where alg is any order-m ALG in block n. Then with high probability, U t = min τ∈[tn,t] e g τ ≥ min τ∈[tn,t] f ⋆ τ − ∆ [tn,t] (by Lemma 15) ≥ 1 2 m X τ∈[alg.s,t] f ⋆ τ − 3∆ [tn,t] ([alg.s,t]⊆ [t n ,t]) ≥ 1 2 m X τ∈[alg.s,t] R τ − 2 s log(T/δ) 2 m − 3ρ(t−t 0 + 1) (E[R τ ] =E[f τ (π t )]≤f ⋆ τ and we use Azuma’s inequality) ≥ 1 2 m X τ∈[alg.s,t] R τ −b ρ(2 m )− 3ρ(t−t 0 + 1) (By Assumption 11, b ρ(2 m )≥ 6 log(T/δ)ρ(2 m )≥ 6 log(T/δ) q 1 2 m ) ≥ 1 2 m X τ∈[alg.s,t] R τ − 2b ρ(2 m ). (ρ(t−t 0 + 1)≤ρ(2 m ) because ρ(·) is decreasing) So with high probability, Test 1 will not return fail. Furthermore, by Lemma 15, with high probability, 1 t−t n + 1 t X τ=tn (e g τ −R τ )≤ b ρ(t−t n + 1) + ∆ [tn,t] ≤ 2b ρ(t−t n + 1). Therefore, with high probability, Test 2 will not return fail either. Lemma 82. With high probability, the number of epochs is upper bounded by L. If C(·) is in the form of Definition 14, the number of epochs is also upper bounded by 1 + 2 c −1 1 ∆ T 1−p 1 2−p +c −1 3 ∆ . Proof. By Lemma 81, if [t 0 ,E] is not the last epoch, then ∆ [t 0 ,E] >ρ(E−t 0 +1) with high probability. Then following the exact same arguments as in Lemma 77 proves the lemma. 297 Proof of Theorem 28. If C(t) =c 1 t p +c 2 satisfies Assumption 11, then C(t) = min{c 1 t p +c 2 ,t} also satisfies it (since the reward is bounded in [0, 1]). Below we use C(t) = min{c 1 t p +c 2 ,t} as the input to our algorithm. Notice that this is in the form of Definition 14 with c 3 = 1. LetE 1 ,...,E N be epochs in [1,T ]. Then by Lemma 80, the dynamic regret in [1,T ] is upper bounded by ˜ O min ( N X i=1 Reg L (E i ), N X i=1 Reg ∆ (E i ) ) + c 2 c 1 N X i=1 |E i | 1−p +c 2 2 N ! . (E.16) By Hölder’s inequality and (E.11), N X i=1 Reg L (E i )≤ ˜ O c 1 (L +N− 1) 1−p T p +c 2 (L +N− 1) ≤ ˜ O c 1 L 1−p T p +c 2 L , where in the last inequality we use Lemma 82 to bound N. Similarly, N X i=1 Reg ∆ (E i ) ≤ ˜ O c 1 ∆ 1−p T 1 2−p +c 1 N 1−p T p +c 1 ∆ 1−p T p +c 2 c −1 1 ∆ T 1−p 1 2−p +c 2 N +c 2 ∆ ≤ ˜ O c 1 ∆ 1−p T 1 2−p +c 1 T p +c 1 ∆ 1−p T p +c 2 c −1 1 ∆ T 1−p 1 2−p +c 2 +c 2 ∆ . (using Lemma 82 to bound N) Then we deal with the second term in (E.16): c 2 c 1 N X i=1 |E i | 1−p ≤ c 2 c 1 N p T 1−p , 298 which can be either bounded by ˜ O c 2 c 1 L p T 1−p or ˜ O c 2 c 1 T 1−p + c 2 c 1 c −p 1 ∆ p T 2−2p 1 2−p + c 2 c 1 ∆ p T 1−p using the upper bound for N in Lemma 82. Finally, the third term in (E.16) can be upper bounded either by ˜ O c 2 2 L or ˜ O c 2 2 +c 2 2 c −1 1 ∆ T 1−p 1 2−p +c 2 2 ∆ . With all terms expanded, below, we collect the dominant terms for the cases of p = 1 2 andp> 1 2 . We say term a(T ) is dominated by b(T ) if lim T→∞ a(T )/b(T ) = 0 under any sublinear growth rate of L or ∆ (e.g., √ ∆ T is dominated by ∆ 1 /3 T 2 /3 and L is dominated by √ LT). And below we only write down terms that are not dominated by other terms. The case for p = 1 2 : ˜ O min c 1 + c 2 c 1 √ LT, c 2 /3 1 +c 2 c − 4 /3 1 ∆ 1 /3 T 2 /3 + c 1 + c 2 c 1 √ T ; The case for p> 1 2 : ˜ O min c 1 L 1−p T p , c 1 ∆ 1−p T 1 2−p +c 1 T p . This finishes the proof. 299
Abstract (if available)
Abstract
Online learning (or online decision making) is a learning paradigm that involves real-time interactions between the learner and the environment. It can be used to model recommendation systems, marketing, advertising, etc. In online learning, the learner has to make instantaneous decisions based on past data and observations, in order to predict future outcomes, get high reward, or acquire new and informative data. This is more challenging than the traditional machine learning framework where the data is pre-collected and the learner has access to all data in advance.
Because the learner's decisions are involved in the data collection process, an important question is how to efficiently explore the world and find the best policy against the world. Past theoretical research has developed algorithms that can perform strategic exploration, and achieve near-optimal performance in the worst-case environment. However, these algorithms designed for the worst cases are often too pessimistic and do not exploit possible benign properties of the environment. In this thesis, we develop algorithms whose performance can adapt to the easiness of the environment, thus saving the time or the number of required samples in training.
Since online learning is interactive, an adversary may exploit the learner's algorithm, corrupt the data, and make the learner fail to learn good a policy. If an algorithm fails with only a small amount of corruption, then the algorithm may be too unsafe to be deployed in practice. In this thesis, we aim to make our algorithms minimally affected by data corruption, and we design robust algorithms whose performance scales optimally against the amount of corruption.
With adaptivity and robustness, an online learning algorithm will be able to more efficiently and more safely used in a wide spectrum of environments. We hope that the algorithmic techniques and insight developed in this thesis could be useful in improving existing algorithms for real applications.
Linked assets
University of Southern California Dissertations and Theses
Conceptually similar
PDF
Robust and adaptive online reinforcement learning
PDF
Robust and adaptive algorithm design in online learning: regularization, exploration, and aggregation
PDF
Online reinforcement learning for Markov decision processes and games
PDF
Understanding goal-oriented reinforcement learning
PDF
Learning and decision making in networked systems
PDF
Reinforcement learning with generative model for non-parametric MDPs
PDF
High-throughput methods for simulation and deep reinforcement learning
PDF
Provable reinforcement learning for constrained and multi-agent control systems
PDF
No-regret learning and last-iterate convergence in games
PDF
Online learning algorithms for network optimization with unknown variables
PDF
Theoretical foundations for dealing with data scarcity and distributed computing in modern machine learning
PDF
Decision making in complex action spaces
PDF
Learning social sequential decision making in online games
PDF
Leveraging cross-task transfer in sequential decision problems
PDF
Leveraging training information for efficient and robust deep learning
PDF
Context-adaptive expandable-compact POMDPs for engineering complex systems
PDF
Sequential decision-making for sensing, communication and strategic interactions
PDF
Empirical methods in control and optimization
PDF
Learning, adaptation and control to enhance wireless network performance
PDF
Inductive biases for data- and parameter-efficient transfer learning
Asset Metadata
Creator
Wei, Chen-Yu
(author)
Core Title
Robust and adaptive online decision making
School
Viterbi School of Engineering
Degree
Doctor of Philosophy
Degree Program
Computer Science
Degree Conferral Date
2022-08
Publication Date
07/22/2022
Defense Date
06/03/2022
Publisher
University of Southern California
(original),
University of Southern California. Libraries
(digital)
Tag
adversarial environment,adversarial MDP,bandit,best of both worlds,change detection,Corruption,dilated bonus,impossible tuning,Markov decision process,mirror descent,non-stationary environment,OAI-PMH Harvest,online learning,optimistic mirror descent,parameter-free,path-length bound,prior-knowledge-free,regret,reinforcement learning
Format
application/pdf
(imt)
Language
English
Contributor
Electronically uploaded by the author
(provenance)
Advisor
Luo, Haipeng (
committee chair
), Jain, Rahul (
committee member
), Kempe, David (
committee member
), Zhang, Jiapeng (
committee member
)
Creator Email
bahh723@gmail.com,wei668@usc.edu
Permanent Link (DOI)
https://doi.org/10.25549/usctheses-oUC111375431
Unique identifier
UC111375431
Legacy Identifier
etd-WeiChenYu-10931
Document Type
Dissertation
Format
application/pdf (imt)
Rights
Wei, Chen-Yu
Type
texts
Source
20220728-usctheses-batch-962
(batch),
University of Southern California
(contributing entity),
University of Southern California Dissertations and Theses
(collection)
Access Conditions
The author retains rights to his/her dissertation, thesis or other graduate work according to U.S. copyright law. Electronic access is being provided by the USC Libraries in agreement with the author, as the original true and official version of the work, but does not grant the reader permission to use the work if the desired use is covered by copyright. It is the author, as rights holder, who must provide use permission if such use is covered by copyright. The original signature page accompanying the original submission of the work to the USC Libraries is retained by the USC Libraries and a copy of it may be obtained by authorized requesters contacting the repository e-mail address given.
Repository Name
University of Southern California Digital Library
Repository Location
USC Digital Library, University of Southern California, University Park Campus MC 2810, 3434 South Grand Avenue, 2nd Floor, Los Angeles, California 90089-2810, USA
Repository Email
cisadmin@lib.usc.edu
Tags
adversarial environment
adversarial MDP
best of both worlds
change detection
dilated bonus
impossible tuning
Markov decision process
mirror descent
non-stationary environment
online learning
optimistic mirror descent
parameter-free
path-length bound
prior-knowledge-free
regret
reinforcement learning