Close
About
FAQ
Home
Collections
Login
USC Login
Register
0
Selected
Invert selection
Deselect all
Deselect all
Click here to refresh results
Click here to refresh results
USC
/
Digital Library
/
University of Southern California Dissertations and Theses
/
Robust and adaptive online reinforcement learning
(USC Thesis Other)
Robust and adaptive online reinforcement learning
PDF
Download
Share
Open document
Flip pages
Contact Us
Contact Us
Copy asset link
Request this asset
Transcript (if available)
Content
ROBUST AND ADAPTIVE ONLINE REINFORCEMENT LEARNING by Tiancheng Jin A Dissertation Presented to the FACULTY OF THE USC GRADUATE SCHOOL UNIVERSITY OF SOUTHERN CALIFORNIA In Partial Fulfillment of the Requirements for the Degree DOCTOR OF PHILOSOPHY (Computer Science) May 2024 Copyright 2024 Tiancheng Jin Acknowledgements The most important lesson of life I have learned during my Ph.D. degree, is all about courage. Through these five years, I have faced numerous challenges: I endured the pandemic for the first three years, suffered the loss of my grandmother in the forth year, and am still deeply troubled by the devastating news that of my mother’s cancer diagnosis in the final year. From these experiences, I realized that optimism could not suffice: unlike the robust and adaptive algorithms we have created, one can not maintain calm and optimistic always, especially when facing uncertainties with potential unaffordable loss. To keep moving forward, I not only need the courage from myself, and but also have to collect enormous encouragement from many people. Without them, I can not be myself, let alone accomplish this doctoral degree. Here, I would like to express my profound and sincere gratitude to these important people in my life, and wish everyone good health and happiness. The first person I would like to thank is my PhD advisor, Haipeng Luo, for his great guidance, patience, and encouragement. I feel very fortunate of being his student, as I don’t have an attractive application profile nor a strong background of theory. Haipeng is very patient: he always listens carefully, asks critical questions to guide me back in the correct direction, and walks me through the problem and techniques. In addition, Haipeng has a rigorous and serious attitude towards every piece of the work, which not only helps me to improve the writing and presentation skills, but also helps me to break the bad habits and cultivate a right mindset. I am deeply grateful for his guidance ii to my research and professional development. More importantly, Haipeng is very warmhearted, understanding and considerate. Besides research, I always seek advice and suggestions from him for certain important decisions I have to make. To this, I thank Haipeng for his help and support during the time I faces the challenges. Then, I would like to thank the mentors and colleagues in the three wonderful internships of my PhD. I thank Longbo Huang for hosting me at Haihua Research Institute from 2020 to 2021. I really enjoyed the weekly discussion with Longbo, and the days of brainstorming with Kaixuan, Jingzhao, Tianle, Jiaye, Ruiqi, Qiwen, and many other talented colleagues, which greatly broaden my eyesight. Then, I thank Yu Gan and Calvin Chi for hosting me at Amazon Ads in summer 2022, and Lin Zhong, Chulin Wang and Qie Hu for hosting me again at Amazon Ads in summer 2023. Their persistence in applying theory techniques into modern recommendation systems impressed me a lot. I would like to thank my every labmates: Chen-Yu Wei, Chung-Wei Lee, Mengxiao Zhang, Yifang Chen, Liyu Chen, Junyan Liu, Yan Wen, Dongze Ye, Spandan Senapati, Soumita Hait, and Yue Wu. I really miss the discussions we had in our lab space in Powell Hall at USC, and remember the good memories of walking in the trail. Specifically, I would like to thank Chen-Yu for introducing me the fundamental knowledge of online learning. Chen-Yu is very talented and always willing to discuss problems and share his ideas. Besides, I thank Chung-Wei, Mengxiao, Yifang, Liyu, and Junyan for spending tremendous time and effort in brainstorming with me. I have benefited a lot for their great insights and innovative ideas. I am thankful for the invaluable guidance of my thesis committee members Rahul Jain and Vatsal Sharan, and the insightful suggestions from Jiapeng Zhang, Renyuan Xu on thesis proposal. iii Thanks to other collaborators not mentioned above: Chi Jin, Suvrit Sra, Tiancheng Yu, Tal Lancewicki, Yishay Mansour, Aviv Rosenberg, Chloé Rouyer, William Chang. I am very lucky to work with these talented people. I would like to thank Zhaoheng Zheng, Haidong Zhu, Liangchuan Zou, Mengxiao Zhang, Jian Guo, Xihu Zhang, Zeyu Sun, Minghao Gu, and many other friends who have ever helped me during my PhD, for their suggestions and encouragement in my hard times. I hope the friendship continues to grow and last forever. My gratitude extends to all the people that have given me a hand. In particular, I would like to thank Lizsl De Leon, Andy Chen, Rita Wiraatmadja, Ellecia Williams, and many other staff of University of Southern California. I am grateful for their effort in supporting the students and creating a wonderful campus. Finally, I would like to express my sincere gratitude to my family members, Simin Guo, Jianzhong Jin, and Li Ling. I thank their support and understanding for my decisions to study and work abroad, which means we have to be apart most of the time. Specially, I would like to thank Simin for her love, company, and understanding. Last but not least, I would like to thank my brave mother, Li Ling, for giving birth to me, and raising me as an honest, sympathetic, and responsible person who is interested in study and research. I remember the first lessons she gave me which inspired my interest in science, and her encouragement to pursue higher education and become a scientist. iv Table of Contents Acknowledgements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ii List of Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . viii Abstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . x Chapter 1: Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1.1 Finite-Horizon MDPs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 1.2 Research Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 Chapter 2: Preliminaries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7 2.1 Occupancy Measures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7 2.2 Maliciousness Measure of the Transitions . . . . . . . . . . . . . . . . . . . . . . . . 9 2.3 Confidence Set . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10 2.4 Upper Occupancy Bound . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11 Chapter 3: Robustness against Adversarial Losses with Fixed Unknown Transition . . . . . . 13 3.1 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16 3.2 Problem Formulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18 3.2.1 Occupancy Measures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18 3.3 Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19 3.3.1 Confidence Sets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20 3.3.2 Online Mirror Descent (OMD) . . . . . . . . . . . . . . . . . . . . . . . . . . 22 3.3.3 Loss Estimators . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23 3.4 Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25 3.5 Omitted Details for the Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28 3.5.1 Updating Occupancy Measure . . . . . . . . . . . . . . . . . . . . . . . . . . 29 3.5.2 Computing Upper Occupancy Bounds . . . . . . . . . . . . . . . . . . . . . . 32 3.6 Omitted Details for the Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33 3.6.1 Auxiliary Lemmas . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34 3.6.2 Proof of the Key Lemma . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37 3.6.3 Bounding Reg and Bias2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41 Chapter 4: Adaptivity against Adversarial Losses with Fixed Unknown Transition . . . . . . 47 4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48 4.2 Preliminaries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51 4.3 Warm-up for Known Transition: A New Loss-shifting Technique . . . . . . . . . . . 53 v 4.4 Main Algorithms and Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56 4.4.1 Main Best-of-both-worlds Results . . . . . . . . . . . . . . . . . . . . . . . . . 60 4.5 Analysis Sketch . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62 4.6 Best of Both Worlds for MDPs with Known Transition . . . . . . . . . . . . . . . . . 66 4.6.1 Loss-shifting Technique . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66 4.6.2 Known Transition and Full-information Feedback: FTRL with Shannon Entropy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70 4.6.3 Known Transition and Bandit Feedback: FTRL with Tsallis Entropy . . . . 78 4.7 Best of Both Worlds for MDPs with Unknown Transition and Full Information . . . 84 4.7.1 Optimism of Adjusted losses and Other Lemmas . . . . . . . . . . . . . . . . 85 4.7.2 Proof for the Adversarial World (Proposition 4.7.1) . . . . . . . . . . . . . . . 88 4.7.3 Proof for the Stochastic World (Proposition 4.7.2) . . . . . . . . . . . . . . . 93 4.8 Best of Both Worlds for MDPs with Unknown Transition and Bandit Feedback . . . 99 4.8.1 Auxiliary Lemmas . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101 4.8.2 Proof for the Adversarial World (Proposition 4.8.1) . . . . . . . . . . . . . . . 115 4.8.3 Proof for the Stochastic World (Proposition 4.8.2) . . . . . . . . . . . . . . . 117 4.9 General Decomposition, Self-bounding Terms, and Supplementary Lemmas . . . . . 124 4.9.1 General Decomposition Lemma . . . . . . . . . . . . . . . . . . . . . . . . . . 124 4.9.2 Self-bounding Terms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 131 4.9.3 Supplementary Lemmas . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 141 Chapter 5: Robustness and Adaptivity towards Adversarial Losses and Transitions . . . . . . 157 5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 157 5.2 Achieving O( √ T + C P) with Known C P . . . . . . . . . . . . . . . . . . . . . . . . . 161 5.3 Achieving O( √ T + C P) with Unknown C P . . . . . . . . . . . . . . . . . . . . . . . . 165 5.4 Gap-Dependent Refinements with Known C P . . . . . . . . . . . . . . . . . . . . . . 169 5.5 Omitted Details for Section 5.2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 174 5.5.1 Equivalent Definition of Amortized Bonuses . . . . . . . . . . . . . . . . . . . 176 5.5.2 Bounding Error . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 177 5.5.3 Bounding Bias1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 178 5.5.4 Bounding Reg . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 181 5.5.5 Bounding Bias2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 184 5.5.6 Proof of Theorem 5.2.2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 187 5.6 Omitted Details for Section 5.3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 189 5.6.1 Bottom Layer Reduction: STABILISE . . . . . . . . . . . . . . . . . . . . . . 189 5.6.2 Top Layer Reduction: Corral . . . . . . . . . . . . . . . . . . . . . . . . . . . 191 5.7 Omitted Details for Section 5.4 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 196 5.7.1 Description of the Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . 198 5.7.2 Self-Bounding Properties of the Regret . . . . . . . . . . . . . . . . . . . . . . 199 5.7.3 Regret Decomposition of RegT (π ⋆ ) and Proof of Lemma 5.7.2.3 . . . . . . . . 204 5.7.4 Proof of Lemma 5.7.3.1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 207 5.7.5 Proof of Lemma 5.7.3.2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 207 5.7.6 Proof of Lemma 5.7.3.3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 209 5.7.6.1 Bounding Term 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . 211 5.7.6.2 Bounding Term 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . 213 5.7.6.3 Bounding Term 3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . 215 5.7.7 Proof of Lemma 5.7.3.4 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 217 5.7.7.1 Properties of the Learning Rate . . . . . . . . . . . . . . . . . . . . 218 vi 5.7.7.2 Bounding EstRegi(π ⋆ ) for Varying Learning Rate . . . . . . . . . . 225 5.7.7.3 Bounding EstReg . . . . . . . . . . . . . . . . . . . . . . . . . . . . 232 5.7.8 Properties of Optimistic Transition . . . . . . . . . . . . . . . . . . . . . . . . 234 5.8 Supplementary Lemmas . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 238 5.8.1 Expectation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 238 5.8.2 Confidence Bound with Known Corruption . . . . . . . . . . . . . . . . . . . 238 5.8.3 Difference Lemma . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 246 5.8.4 Loss Shifting Technique with Optimistic Transition . . . . . . . . . . . . . . . 255 5.8.5 Estimation Error . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 256 Chapter 6: Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 262 6.1 Attacker . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 263 6.2 Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 265 6.2.1 UCBVI . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 265 6.2.2 UCBVI-C . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 266 6.2.3 UOB-REPS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 267 6.2.4 UOB-REPS-C . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 267 6.3 Environments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 268 6.3.1 Random MDP . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 268 6.3.2 Diabolical Combination Lock Problem . . . . . . . . . . . . . . . . . . . . . . 270 6.3.3 Inventory Control Problem . . . . . . . . . . . . . . . . . . . . . . . . . . . . 272 Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 276 vii List of Figures 1.1 Research Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 6.1 MDP Structure of Random MDPs . . . . . . . . . . . . . . . . . . . . . . . . . . . . 268 6.2 Cumulative Regret on Random Environment . . . . . . . . . . . . . . . . . . . . . . 269 6.3 MDP Structure of a Diabolical Combination Lock . . . . . . . . . . . . . . . . . . . 271 6.4 Cumulative Regret on Diabolical Combination Lock . . . . . . . . . . . . . . . . . . 272 6.5 Cumulative Regret on Small Inventory Control . . . . . . . . . . . . . . . . . . . . . 274 6.6 Cumulative Regret on Large Inventory Control . . . . . . . . . . . . . . . . . . . . . 275 viii List of Algorithms 1 Learner-Environment Interaction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 2 Upper Occupancy Bound Relative Entropy Policy Search (UOB-REPS) . . . . . . . 21 3 Comp-UOB . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22 4 Greedy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34 5 Best-of-both-worlds for Episodic MDPs with Unknown Transition . . . . . . . . . . . 59 6 Best-of-both-worlds for MDPs with Known Transition and Full-information Feedback 71 7 Best-of-both-worlds for MDPs with Known Transition and Bandit Feedback . . . . . 79 8 Algorithm for Adversarial Transitions (with Known C P) . . . . . . . . . . . . . . . . 164 9 STable Algorithm By Independent Learners and Instance SElection (STABILISE) 167 10 (A Variant of) Corral . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 192 11 Algorithm with Optimistic Transition Achieving Gap-Dependent Bounds (Known C P)196 2 Attacker-Learner-Environment Interaction . . . . . . . . . . . . . . . . . . . . . . . . 263 12 Attacker . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 264 ix Abstract Reinforcement learning (RL) is a machine learning (ML) technique on learning to make optimal sequential decisions via interactions with an environment. In recent years, RL achieved great success in many artificial intelligence tasks, and has been widely regarded as one of the keys towards Artificial General Intelligence (AGI). However, most RL models are trained on simulators, and suffer from the reality gap [40, 24]: a mismatch between simulated and real-world performance. Moreover, recent work [6] has shown that RL models are especially vulnerable to adversarial attacks. This motivates the research on improving the robustness of RL, that is, the ability of ensuring worst-case guarantees. On the other hand, it is not favorable to be too conservative/pessimistic and sacrifice too much performance while the environment is not difficult to deal with. In other words, adaptivity — the capability of automatically adapting to the maliciousness of the environment, is especially desirable to RL algorithms: they should not only target worst-case guarantee, but also pursue instance optimality and achieve better performance against benign environments. In this thesis, we focus on designing practical, robust and adaptive reinforcement algorithms. Specifically, we take inspiration from the online learning literature, and consider interacting with a sequence of Markov Decision Processes (MDPs), which captures the nature of changing environment. We hope that the techniques and insight developed in this thesis could shed light on improving existing deep RL algorithms for future applications. x Chapter 1 Introduction Reinforcement learning (RL) has achieved remarkable successes in many fields such as Robotics [58], Go [83], Video games [67, 87], Autonomous Driving [53], and Large Language Models [22]. However, RL researchers also encounter tremendous challenges in ensuring robust performance amid diverse condition changes, real-world perturbations, and malicious attacks. Most works only focus on learning against a fixed environment, and are unable to bridge the reality gap [40, 24] even with only slight changes or minor perturbations of the environment. Specifically, the agents are usually first trained on a simulator imitating the real world environment, and suffers from performance degradation when deployed later, as the environment may have changed since the simulator is built, or attacked by some adversaries. This poses the problem of robustness, which refers to the algorithm’s ability to ensure worst-case guarantee while the environment may be (maliciously) or affected. On the other hand, the real-world environments are not purely adversarial: the algorithm should not be too conservative and sacrifice good performance when the environment is less difficult to deal with. In other words, the algorithm should not only target worst-case optimality, but also pursue instance optimality — if the environment is more benign, the algorithm should achieve 1 better performance. Moreover, parameter-free algorithms are more favorable, as they can automatically achieve instance optimality without prior knowledge on the hardness/maliciousness of the environment. This demonstrates the importance of adaptivity, which refers to the algorithm’s capability to adapt to the environment automatically, in order to achieve better performance when the environment is benign. One line of research, originated from [30] and later improved or generalized by e.g. [70, 73, 98, 79], takes inspiration from the online learning literature and considers interacting with a sequence of finite horizon Markov Decision Processes (MDPs). This framework captures the nature of changing environment and is particular useful for developing and analyzing robust and adaptive RL algorithms. In this thesis, we proposes several practical RL algorithms based on this framework, which obtain theoretical robustness and/or adaptivity guarantees as well as excellent empirical performance. In the rest of this chapter, we formalize this framework in Section 1.1 and state our contributions in Section 1.2. 1.1 Finite-Horizon MDPs An episodic MDP M is defined by a tuple (S, A, L, P, ℓ), where S is the finite state space, A is the finite action space, L is the horizon, P : S × A × S → [0, 1] is the transition function, with P(s ′ |s, a) being the probability of transferring to state s ′ when executing action a in state s, and ℓ : S × A → [0, 1] is the loss function. In this thesis, we consider interacting with a sequence of T episodic MDPs {Mt = (S, A, L, Pt , ℓt)} T t=1, all with the same state space S and action space A. Layer Structure We consider an episodic setting with finite horizons and assume that all MDPs share the same layered structure, satisfying the following conditions: 2 Protocol 1 Learner-Environment Interaction Parameters: state space S and action space A (known to the learner), unknown transition functions {Pt} T t=1, loss functions {ℓt} T t=1 for t = 1 to T do learner decides a policy πt and starts in state st,0 = s0 for k = 0 to L − 1 do learner selects action at,k ∼ πt(·|st,k) learner observes loss ℓt(st,k, at,k) environment draws a new state st,k+1 ∼ Pt(·|st,k, at,k) learner observes state st,k+1 • The state space S consists of L + 1 layers S0, . . . , SL such that S = SL k=0 Sk and Si ∩ Sj = ∅ for i ̸= j. • S0 and SL are singletons, that is, S0 = {s0} and SL = {sL}. • Transitions are possible only between consecutive layers. In other words, if Pt(s ′ |s, a) > 0 for any t, then s ′ ∈ Sk+1 and s ∈ Sk for some k. These assumptions were made in previous work [72, 97, 78] as well. They are not necessary but greatly simplify notation and analysis. Such a setup is sometimes referred to as the loop-free stochastic shortest path problem in the literature. It is clear that this is a strict generalization of the episodic setting studied in [15, 45] for example, where the number of states is the same for each layer (except for the first and the last one).∗ We also point out that our algorithms and results can be easily modified to deal with a more general setup where the first layer has multiple states and in each episode the initial state is decided adversarially, as in [45] (details omitted). Learning Protocol The interaction between the learner and the environment is presented in Protocol 1. Ahead of time, knowing the learner’s algorithm, the environment decides the transition functions {Pt} T t=1 and the loss functions {ℓt} T t=1 for these T MDPs in an arbitrary manner (unknown ∗ In addition, some of these works (such as [15]) also assume that the states have the same name for different layers, and the transition between the layers remains the same. Our setup does not make this assumption and is closer to that of [45]. We also refer the reader to footnote 2 of [45] for how to translate regret bounds between settings with and without this extra assumption. 3 to the learner). Then the learner sequentially interacts with these T MDPs: for each episode t = 1, . . . , T, the learner first decides a stochastic policy πt : S × A → [0, 1] where πt(a|s) is the probability of taking action a when visiting state s; then, starting from the initial state st,0 = s0, for each k = 0, . . . , L − 1, the learner repeatedly selects an action at,k sampled from πt(·|st,k), suffers loss ℓt(st,k, at,k) ∈ [0, 1], and transits to the next state st,k+1 sampled from Pt(·|st,k, at,k) (until reaching the terminal state); finally, the learner observes the losses of those visited state-action pairs (a.k.a. bandit feedback). Loss and Regret Let ℓt(π) = E PL−1 h=0 ℓt(sh, ah)|Pt , π be the expected loss of executing policy π in the t-th MDP (that is, {(sh, ah)} L−1 h=0 is a stochastic trajectory generated according to transition Pt and policy π). Then, the regret of the learner against any policy π is defined as RegT (π) = E PT t=1 ℓt(πt)−ℓt(π) . We denote by ˚π one of the optimal policies in hindsight such that RegT (˚π) = maxπ RegT (π) and use RegT ≜ RegT (˚π) as a shorthand. Notation. For notational convenience, for any k < L, we denote the set of tuples Sk × A × Sk+1 by Wk. We also use k(s) to denote the layer to which state s ∈ S belongs, and I{·} to denote the indicator function whose value is 1 if the input holds true and 0 otherwise. Let ot = {(st,k, at,k, ℓt(st,k, at,k))} L−1 k=0 be the observation of the learner in episode t, and Ft be the σ-algebra generated by (o1, . . . , ot−1). Also let Et [·] be a shorthand of E[·|Ft ]. 1.2 Research Outline Based on the proposed framework, we design adaptive algorithms, whose performance adapts to how adversarial the environment is, and robust algorithms, whose performance remains optimal when the environment is adversarial. In this thesis, we will present three key works ([46], [47], and [49]), and provide empirical evaluation of our designed algorithms. 4 In Chapter 3, we present the work [46], which studies the setting where transition function is fixed (that is, Pt = P for any t) and the losses {ℓt} T t=1 are selected by an adversary, and proposes the first algorithm (Algorithm 2) which ensures O˜( √ T) regret in this challenging setting. Importantly, this work laid the foundation of many studies on adversarial MDPs, and the techniques (especially the optimistic loss estimator) are widely used. Following this work, we further enhance the robustness and propose an algorithm in [48], which can handle a harder setting where the losses can be delayed adversarially. In Chapter 4, we present the work [47] which considers the adaptivity of RL algorithms. In the previous work [50], we develop the first best-of-both-worlds result for MDPs, that is, we propose an adaptive algorithm that automatically achieves near instance-optimal Oe(log T) regret, while ensuring the worst case guarantee. However, [50] requires a rather strong assumption that the fixed transition function P is known to the learner. In this work (Chapter 4), we drop the assumption by designing a new framework to incorporate a novel loss shifting technique for MDPs, and prove the best-of-bot-worlds result with the help of our new analysis. Theses novel techniques shed light on the research of best-of-bot-worlds results on MDPs. In Chapter 5, we present the work [49], which considers an even more general setting that the transitions {Pt} T t=1 can change arbitrarily from episode to episode. This setting captures the essence of environment changes, and is important for developing and analyzing robust and adaptive algorithms. In this work, we first develop an algorithm that can handle adversarial loss functions and transitions without prior knowledge of the maliciousness of environment. Second, we incorporate the idea of previous work [47] to achieve the best of both worlds results with adversarial transitions, under the assumption that the maliciousness of environment is known to the learner beforehand. In Chapter 6, we carry out numerical experiments on three tasks to verify our theoretical results and demonstrate the robustness and effectiveness of our designed algorithms (Algorithm 2 5 and Algorithm 8), even with adversarially corrupted losses and transitions. Specifically, we compare to two well-studied RL algorithms with guarantees and show that: first, our algorithm can achieve competitive performance with static environment; second, our algorithms are robust when the environment is corrupted by an adversary. In Figure 1.1, we outline the contributions of our works towards robust and adaptive RL (which are all published conference papers). Figure 1.1: Research Outline Robustness Adaptivity Jin et al. [46] Adversarial ℓt with fixed/unknown P Jin and Luo [50] Adapt to how adversarial ℓt’s are with known P Jin et al. [48] Delayed feedback on ℓt Jin, Huang, and Luo [47] Adapt to how adversarial ℓt’s are with unknown P Jin et al. [49] 1 Adversarial ℓt and Pt 2 Adapt to how adversarial ℓt’s are 6 Chapter 2 Preliminaries In this chapter, we introduce several important concepts and techniques which are commonly used to design the robust and adaptive algorithms: In Section 2.1, we introduce the concept of occupancy measure, which can reduce the online RL problem to online linear optimization and enable application of online learning techniques; In Section 2.2, we formally define the maliciousness measure of the transition functions; Later, in Section 2.3, we introduce the confidence set, a central technique to deal with unknown transitions; Finally, we propose the upper occupancy bound which is the key to construct viable loss estimator for adversarial losses and transitions in Section 2.4. 2.1 Occupancy Measures Solving the learning problem of MDPs with techniques from online learning requires introducing the concept of occupancy measures [8, 72]. Specifically, the occupancy measure q P,π : S ×A×S → [0, 1] associated with a stochastic policy π and a transition function P is defined as q P,π(s, a, s′ ) = Pr sk = s, ak = a, sk+1 = s ′ | P, π , 7 where k = k(s) is the index of the layer to which s belongs. In other words, q P,π(s, a, s′ ) is the marginal probability of encountering the triple (s, a, s′ ) when executing policy π in a MDP with transition function P. Clearly, an occupancy measure q satisfies the following two properties. First, due to the loop-free structure, each layer is visited exactly once and thus for every k = 0, . . . , L − 1, X s∈Sk X a∈A X s ′∈Sk+1 q(s, a, s′ ) = 1. (2.1) Second, the probability of entering a state when coming from the previous layer is exactly the probability of leaving from that state to the next layer (except for s0 and sL). Therefore, for every k = 1, . . . , L − 1 and every state s ∈ Sk, we have X s ′∈Sk−1 X a∈A q(s ′ , a, s) = X s ′∈Sk+1 X a∈A q(s, a, s′ ). (2.2) It turns out that these two properties suffice for any function q : S × A → [0, 1] to be an occupancy measure associated with some transition function and some policy. Lemma 2.1.1 (Rosenberg and Mansour [78]). If a function q : S × A × S → [0, 1] satisfies Condition (2.1) and (2.2), then it is a valid occupancy measure associated with the following induced transition function P q and induced policy π q : P q (s ′ |s, a) = q(s, a, s′ ) P y∈Sk(s)+1 q(s, a, y) , π q (a|s) = P s ′∈Sk(s)+1 q(s, a, s′ ) P b∈A P s ′∈Sk(s)+1 q(s, b, s′) . We denote by ∆ the set of valid occupancy measures, that is, the subset of [0, 1]S×A×S satisfying Condition (2.1) and (2.2). For a fixed transition function P, we denote by ∆(P) ⊂ ∆ the set of 8 occupancy measures whose induced transition function P q is exactly P. Similarly, we denote by ∆(P) ⊂ ∆ the set of occupancy measures whose induced transition function P q belongs to a set of transition functions P. 2.2 Maliciousness Measure of the Transitions If the transition functions are all the same, we can show that RegT = Oe(L|S| p |A|T) is achievable, no matter how the loss functions are decided (Chapter 3). However, when the transition functions are also arbitrary, Tian et al. [86] shows that RegT = Ω(min{T, √ 2 LT}) is unavoidable. Therefore, a natural goal is to allow the regret to smoothly increase from order √ T to T when some maliciousness measure of the transitions increases. Specifically, the measure we use is C P ≜ min P′∈P X T t=1 L X−1 k=0 max (s,a)∈Sk×A Pt(·|s, a) − P ′ (·|s, a) 1 , (2.3) where P denotes the set of all valid transition functions. Let P be the transition that realizes the minimum in this definition. Then C P can be regarded as the same corruption measure used in Chen, Du, and Jamieson [20]; there, it is assumed that a ground truth MDP with transition P exists, and the adversary corrupts it arbitrarily in each episode to obtain Pt , making C P the total amount of corruption measured in a certain norm. For simplicity, in the rest of this paper, we will also take this perspective and call C P the transition corruption. We also use C P t = PL−1 k=0 max(s,a)∈Sk×A ∥Pt(·|s, a) − P(·|s, a)∥1 to denote the per-round corruption (so C P = PT t=1 C P t ). It is clear that C P = 0 when the transition stays the same for all MDPs, while in the worst case it is at most 2T L. Our goal is to achieve RegT = O( √ T + C P) (ignoring other dependence), which smoothly interpolates between the result of Jin et al. [46] for C P = 0 and that of Tian et al. [86] for C P = O(T). 9 2.3 Confidence Set A central technique to deal with unknown transitions is to maintain a shrinking confidence set that contains the ground truth with high probability [79, 80]. With a properly enlarged confidence set, the same idea extends to the case with adversarial transitions [62]. Specifically, all our algorithms deploy the following transition estimation procedure. It proceeds in epochs, indexed by i = 1, 2, · · · , and each epoch i includes some consecutive episodes. An epoch ends whenever we encounter a stateaction pair whose total number of visits doubles itself when compared to the beginning of that epoch. At the beginning of each epoch i, we calculate an empirical transition P¯ i as: P¯ i(s ′ |s, a) = mi(s, a, s′ )/mi(s, a), ∀(s, a, s′ ) ∈ Wk, k = 0, . . . L − 1, (2.4) where mi(s, a) and mi(s, a, s′ ) are the total number of visits to (s, a) and (s, a, s′ ) prior to epoch i. ∗ In addition, we calculate the following transition confidence set. Definition 2.3.1. (Enlarged Confidence Set of Transition Functions) Let δ ∈ (0, 1) be a confidence parameter. With known corruption C P, we define the confidence set of transition functions for epoch i as Pi = n P ∈ P : P(s ′ |s, a) − P¯ i(s ′ |s, a) ≤ Bi(s, a, s′ ), ∀(s, a, s′ ) ∈ Wk, k = 0, . . . , L − 1 o , (2.5) where the confidence interval Bi(s, a, s′ ) is defined, with ι = |S||A|T/δ as Bi(s, a, s′ ) = min ( 1, 16s P¯ i(s ′ |s, a) log (ι) mi(s, a) + 64 · C P + log (ι) mi(s, a) ) . (2.6) ∗When mi(s, a) = 0, we simply let the transition function to be uniform, that is, P¯i(s ′ |s, a) = 1/ Sk(s′) . 10 Note that the confidence interval is enlarged according to how large the corruption C P is. We denote by Econ the event that P ∈ Pi for all epoch i, which is guaranteed to happen with highprobability. Lemma 2.3.2. With probability at least 1 − 2δ, the event Econ holds. For the special case C P = 0, we are able to use a tighter confidence interval defined as min 1, 2 vuut P¯ i(s ′ |s, a) ln T|S||A| δ max{1, mi(s, a) − 1} + 14 ln T|S||A| δ 3 max{1, mi(s, a) − 1} , (2.7) which ensures that the event Econ hold with probability at least 1 − 4δ (Lemma 3.3.1.1). We defer the details of this tighter confidence interval to Section 3.3.1. Remark 2.3.3. In Chapter 3 and Chapter 4 where the transition functions {Pt} T t=1 are assumed to be fixed (that is, Pt = P for any episode t), we use the tighter confidence intervals in Eq. (2.7). We will use the enlarged ones defined in Eq. (2.6) to deal with adversarial transitions mainly in Chapter 5. 2.4 Upper Occupancy Bound To handle partial feedback on the loss function ℓt , we need to construct loss estimators ℓbt using the (efficiently computable) upper occupancy bound ut : ℓbt(s, a) = It(s, a)ℓt(s, a) ut(s, a) , where ut(s, a) = max Pb∈Pi(t) q P,π b t (s, a), (2.8) where It(s, a) is 1 if (s, a) is visited during episode t (so that ℓt(s, a) is revealed), and 0 otherwise, and i(t) denotes the epoch index to which episode t belongs. As one key contribution of [46], 11 we refer the readers to Chapter 3 (especially Section 3.3.3) for a detailed discussion of the upper occupancy bound. 12 Chapter 3 Robustness against Adversarial Losses with Fixed Unknown Transition In this chapter, we consider the task of learning in episodic finite-horizon Markov decision processes with a fixed and unknown transition function that is, Pt = P for any episode t), bandit feedback, and adversarial losses. In [46], we propose an efficient algorithm that achieves O˜(L|S| p |A|T) regret with high probability, where L is the horizon, |S| the number of states, |A| the number of actions, and T the number of episodes. To our knowledge, our algorithm (UOB-REPS) is the first to ensure O˜( √ T) regret in this challenging setting; in fact it achieves the same regret as [78] who consider the easier setting with full-information. Our key contributions are two-fold: a tighter confidence set for the transition function; and an optimistic loss estimator that is inversely weighted by an upper occupancy bound. Reinforcement learning studies the problem where a learner interacts with the environment sequentially and aims to improve her strategy over time. The environment dynamics are usually modeled as an MDP with a fixed and unknown transition function. We consider a general setting where the interaction proceeds in episodes with a fixed horizon. Within each episode the learner sequentially observes her current state, selects an action, suffers and observes the loss corresponding to the chosen state-action pair, and then transits to the next state according to the underlying 13 transition function.∗ The goal of the learner is to minimize her regret: the difference between her total loss and the total loss of an optimal fixed policy. The majority of the literature in learning MDPs assumes stationary losses, that is, the losses observed for a specific state-action pair follow a fixed and unknown distribution. To better capture applications with non-stationary or even adversarial losses, the works [31, 93] are among the first to study the problem of learning adversarial MDPs, where the losses can change arbitrarily between episodes. There are several follow-ups in this direction, such as [93, 70, 72, 97, 28, 78]. See Section 3.1 for more related work. For an MDP with |S| states, |A| actions, T episodes, and L steps in each episode, the best existing result is the work [78], which achieves O(L|S| p |A|T) regret, assuming a fixed and unknown transition function, adversarial losses, but importantly full-information feedback: i.e., the loss for every state-action pair is revealed at the end of each episode. On the other hand, with the more natural and standard bandit feedback (where only the loss for each visited state-action pair is revealed), a later work by the same authors [79] achieves regret O(L 3/2 |S||A| 1/4T 3/4 ), which has a much worse dependence on the number of episodes T compared to the full-information setting. Our main contribution significantly improves on [79]. In particular, we propose an efficient algorithm that achieves O(L|S| p |A|T) regret in the same setting with bandit feedback, an unknown transition function, and adversarial losses. Although our regret bound still exhibits a gap compared to the best existing lower bound Ω(L p |S||A|T) [45], to the best of our knowledge, for this challenging setting our result is the first to achieve O( √ T) regret. Importantly, this also matches the regret upper bound of Rosenberg and Mansour [78], who consider the easier setting with full-information feedback. ∗As in previous work [78, 79], throughout we use the term “losses” instead of “rewards” to be consistent with the adversarial online learning literature. One can translate between losses and rewards by simply taking negation. 14 Our algorithm builds on the UC-O-REPS algorithm [78, 79]—we also construct confidence sets to handle the unknown transition function, and apply Online Mirror Descent over the space of occupancy measures to handle adversarial losses. The first key difference and challenge is that with bandit feedback, to apply Online Mirror Descent we must construct good loss estimators since the loss function is not completely revealed. However, the most natural approach of building unbiased loss estimators via inverse probability requires knowledge of the transition function, and is thus infeasible in our setting. We address this key challenge by proposing a novel biased and optimistic loss estimator (Section 3.3.3). Specifically, instead of inversely weighting the observation by the probability of visiting the corresponding state-action pair (which is unknown), we use the maximum probability among all plausible transition functions specified by a confidence set, which we call upper occupancy bound. This idea resembles the optimistic principle of using upper confidence bounds for many other problems of learning with bandit feedback, such as stochastic multi-armed bandits [11], stochastic linear bandits [23, 2], and reinforcement learning with stochastic losses [44, 15, 45]. However, as far as we know, applying optimism in constructing loss estimators for an adversarial setting is new. The second key difference of our algorithm from UC-O-REPS (Section 3.3.1) lies in a new confidence set for the transition function. Specifically, for each state-action pair, the confidence set used in UC-O-REPS and previous works such as [44, 15] imposes a total variation constraint on the transition probability, while our proposed confidence set imposes an independent constraint on the transition probability for each next state, and is strictly tighter. Indeed, with the former we can only prove an O(L|S| 2p |A|T) regret, while with the latter we improve it to O(L|S| p |A|T). Analyzing the non-trivial interplay between our optimistic loss estimators and the new confidence set is one of our key technical contributions. 15 Finally, we remark that our proposed upper occupancy bounds can be computed efficiently via backward dynamic programming and solving some linear programs greedily, and thus our algorithm can be implemented efficiently. 3.1 Related Work Stochastic losses. Learning MDPs with stochastic losses and bandit feedback is relatively wellstudied for the tabular case (that is, finite number of states and actions). For example, in the episodic setting, using our notation,† the UCRL2 algorithm of Jaksch, Ortner, and Auer [44] achieves O( p L3|S| 2|A|T) regret, and the UCBVI algorithm of Azar, Osband, and Munos [15] achieves the optimal bound O(L p |S||A|T), both of which are model-based algorithms and construct confidence sets for both the transition function and the loss function. The recent work [45] achieves a suboptimal bound O( p L3|S||A|T) via an optimistic Q-learning algorithm that is modelfree. Besides the episodic setting, other setups such as discounted losses or infinite-horizon averageloss setting have also been heavily studied; see for example [74, 35, 96, 89, 29] for some recent works. Adversarial losses. Based on whether the transition function is known and whether the feedback is full-information or bandit, we discuss four categories separately. Known transition and full-information feedback. Early works on adversarial MDPs assume a known transition function and full-information feedback. For example, Even-Dar, Kakade, and Mansour [31] propose the algorithm MDP-E and prove a regret bound of O(τ 2p T ln |A|) where τ is the mixing time of the MDP; another work [93] achieves O(T 2/3 ) regret. Both of these consider a continuous setting (as opposed to the episodic setting that we study). Later Zimin and Neu †We warn the reader that in some of these cited papers, the notation |S| or T might be defined differently (often L times smaller for |S| and L times larger for T). We have translated the bounds based on Table 1 of [45] using our notation defined in Section 3.2. 16 [97] consider the episodic setting and propose the O-REPS algorithm which applies Online Mirror Descent over the space of occupancy measures, a key component adopted by [78] and our work. O-REPS achieves the optimal regret O(L p T ln(|S||A|)) in this setting. Known transition and bandit feedback. Several works consider the harder bandit feedback model while still assuming known transitions. The work [70] achieves regret O(L 2p T|A|/α), assuming that all states are reachable with some probability α under all policies. Later, Neu et al. [71] eliminates the dependence on α but only achieves O(T 2/3 ) regret. The O-REPS algorithm of [97] again achieves the optimal regret O( p L|S||A|T). Another line of works [10, 28] assumes deterministic transitions for a continuous setting without some unichain structure, which is known to be harder and suffers Ω(T 2/3 ) regret [27]. Unknown transition and full-information feedback. To deal with unknown transitions, Neu, György, and Szepesvári [72] propose the Follow the Perturbed Optimistic Policy algorithm and achieve O(L|S||A| √ T) regret. Combining the idea of confidence sets and Online Mirror Descent, the UC-O-REPS algorithm of [78] improves the regret to O(L|S| p |A|T). We note that this work also studies general convex performance criteria, which we do not consider. Unknown transition and bandit feedback. This is the setting considered in this chapter. The only previous work we are aware of [79] achieves a regret bound of O(T 3/4 ), or O( √ T /α) under the strong assumption that under any policy, all states are reachable with probability α that could be arbitrarily small in general. Our algorithm achieves O( √ T) regret without this assumption by using a different loss estimator and by using a tighter confidence set. We also note that the lower bound of Ω(L p |S||A|T) [45] still applies. 17 3.2 Problem Formulation 3.2.1 Occupancy Measures With the concept of occupancy measure (Section 2.1), we can reduce the problem of learning a policy to the problem of learning an occupancy measure and apply online linear optimization techniques. Specifically, with slight abuse of notation, for an occupancy measure q we define q(s, a) = X s ′∈Sk(s)+1 q(s, a, s′ ) for all s ̸= sL and a ∈ A, which is the probability of visiting state-action pair (s, a). Then the expected loss of following a policy π for episode t can be rewritten as E " L X−1 k=0 ℓt(sk, ak) P, π# = L X−1 k=0 X s∈Sk X a∈A q P,π(s, a)ℓt(s, a) = X s∈S\{sL},a∈A q P,π(s, a)ℓt(s, a) ≜ ⟨q P,π, ℓt⟩, and accordingly the actual regret of the learner can be rewritten as RegT = X T t=1 ⟨q P,πt − q ⋆ , ℓt⟩, (3.1) where q ⋆ ∈ argminq∈∆(P) PT t=1⟨q, ℓt⟩ is the optimal occupancy measure in ∆(P). Remark 3.2.1.1. In this chapter, we will consider the regret defined in Eq. (3.1), which is different from the expected regret defined in Chapter 1. On the other hand, assume for a moment that the set ∆(P) were known and the loss function ℓt was revealed at the end of episode t. Consider an online linear optimization problem (see [39] for 18 example) with decision set ∆(P) and linear loss parameterized by ℓt at time t. In other words, at each time t, the learner proposes qt ∈ ∆(P) and suffers loss ⟨qt , ℓt⟩. The regret of this problem is X T t=1 ⟨qt − q ⋆ , ℓt⟩. (3.2) Therefore, if in the original problem, we set πt = π qt , then the two regret measures Eq. (3.1) and Eq. (3.2) are exactly the same by Lemma 2.1.1 and we have thus reduced the problem to an instance of online linear optimization. It remains to address the issues that ∆(P) is unknown and that we have only partial information on ℓt . The first issue can be addressed by constructing a confidence set P based on observations and replacing ∆(P) with ∆(P), and the second issue is addressed by constructing loss estimators with reasonably small bias and variance. For both issues, we propose new solutions compared to [79]. Note that importantly, the above reduction does not reduce the problem to an instance of the well-studied bandit linear optimization [3] where the quantity ⟨qt , ℓt⟩ (or a sample with this mean) is observed. Indeed, roughly speaking, what we observed in our setting are samples with mean ⟨q P,πqt , ℓt⟩. These two are different when we do not know P and have to operate over the set ∆(P). 3.3 Algorithm The complete pseudocode of our algorithm, UOB-REPS, is presented in Algorithm 2. The three key components of our algorithm are: 1) maintaining a confidence set of the transition function, 2) using Online Mirror Descent to update the occupancy measure, and 3) constructing loss estimators, each described in detail below. 19 3.3.1 Confidence Sets To ensure lower bias for our loss estimators, we propose a tighter confidence set which includes all transition functions with bounded distance compared to P¯ i(s ′ |s, a) for each triple (s, a, s′ ). More specifically, the confidence set Pi for epoch i is defined as‡ n Pb : Pb(s ′ |s, a) − P¯ i(s ′ |s, a) ≤ ϵi(s ′ |s, a), ∀(s, a, s′ ) ∈ Wk, k = 0, . . . , L − 1 o , (3.3) where the confidence width Bi(s, a, s′ ) is defined as min 1, 2 vuut P¯ i(s ′ |s, a) ln T|S||A| δ max{1, mi(s, a) − 1} + 14 ln T|S||A| δ 3 max{1, mi(s, a) − 1} (3.4) for some confidence parameter δ ∈ (0, 1). For the first epoch (i = 1), Pi is simply the set of all transition functions so that ∆(Pi) = ∆. § By the empirical Bernstein inequality and union bounds, one can show the following (see Section 3.6.1 for the proof): Lemma 3.3.1.1. With probability at least 1 − 4δ, we have P ∈ Pi for all i. Moreover, ignoring constants one can further show that our confidence bound is strictly tighter than those used in [78, 79], which is important for getting our final regret bound (more discussions to follow in Section 4.5). ‡ It is understood that in the definition of the confidence set (Eq. (3.3)), there is also an implicit constraint on Pb(·|s, a) being a valid distribution over the states in Sk(s)+1, for each (s, a) pair. This is omitted for conciseness. §To represent P1 in the form of Eq. (3.3), one can simply let P¯1(·|s, a) be any distribution and ϵ1(s ′ |s, a) = 1. 20 Algorithm 2 Upper Occupancy Bound Relative Entropy Policy Search (UOB-REPS) Input: state space S, action space A, episode number T, learning rate η, exploration parameter γ, and confidence parameter δ Initialization: Initialize epoch index i = 1 and confidence set P1 as the set of all transition functions. For all k = 0, . . . , L − 1 and all (s, a, s′ ) ∈ Wk, initialize counters m0(s, a) = m1(s, a) = m0(s ′ |s, a) = m1(s ′ |s, a) = 0 and occupancy measure qb1(s, a, s′ ) = 1 |Sk||A||Sk+1| . Initialize policy π1 = π qb1 . for t = 1 to T do Execute policy πt for L steps and obtain trajectory sk, ak, ℓt(sk, ak) for k = 0, . . . , L − 1. Compute upper occupancy bound for each k: ut(sk, ak) = Comp-UOB(πt , sk, ak,Pi). Construct loss estimators for all (s, a): ℓbt(s, a) = ℓt(s, a) ut(s, a) + γ I{sk(s) = s, ak(s) = a}. Update counters: for each k, mi(sk, ak) ← mi(sk, ak) + 1, mi(sk+1|sk, ak) ← mi(sk+1|sk, ak) + 1. if ∃k, mi(sk, ak) ≥ max{1, 2mi−1(sk, ak)} then Increase epoch index i ← i + 1. Initialize new counters: for all (s, a, s′ ), mi(s, a) = mi−1(s, a), mi(s ′ |s, a) = mi−1(s ′ |s, a). Update confidence set Pi based on Eq. (3.3). Update occupancy measure (D defined in Eq. (3.6)): qbt+1 = argmin q∈∆(Pi) η⟨q, ℓbt⟩ + D(q ∥ qbt). Update policy πt+1 = π qbt+1 . 21 Algorithm 3 Comp-UOB Input: a policy πt , a state-action pair (s, a) and a confidence set P of the form n Pb : Pb(s ′ |s, a) − P¯(s ′ |s, a) ≤ B(s ′ |s, a), ∀(s, a, s′ ) o Initialize: for all s˜ ∈ Sk(s) , set f(˜s) = I{s˜ = s}. for k = k(s) − 1 to 0 do for all s˜ ∈ Sk do Compute f(˜s) based on Eq. (3.7): f(˜s) = X a∈A πt(a|s˜) · Greedy f, P¯(·|s, a ˜ ), ϵ(·|s, a ˜ ) (see Section 3.5.2 for the procedure Greedy). Return: πt(a|s)f(s0). 3.3.2 Online Mirror Descent (OMD) The OMD component of our algorithm is the same as [79]. As discussed in Section 3.2.1, our problem is closely related to an online linear optimization problem over some occupancy measure space. In particular, our algorithm maintains an occupancy measure qbt for episode t and executes the induced policy πt = π qbt . We apply Online Mirror Descent, a standard algorithmic framework to tackle online learning problems, to update the occupancy measure as qbt+1 = argmin q∈∆(Pi) η⟨q, ℓbt⟩ + D(q ∥ qbt) (3.5) where i is the index of the epoch to which episode t + 1 belongs, η > 0 is some learning rate, ℓbt is some loss estimator for ℓt , and D(·∥·) is a Bregman divergence. Following [78, 79], we use the unnormalized KL-divergence as the Bregman divergence: D(q ∥ q ′ ) = X s,a,s′ q(s, a, s′ ) ln q(s, a, s′ ) q ′(s, a, s′) − X s,a,s′ q(s, a, s′ ) − q ′ (s, a, s′ ) . (3.6) Note that as pointed out earlier, ideally one would use ∆(P) as the constraint set in the OMD update, but since P is unknown, using ∆(Pi) in place of it is a natural idea. Also note that the update can be implemented efficiently, similarly to Rosenberg and Mansour [78] (see Section 3.5.1 for details). 3.3.3 Loss Estimators A common technique to deal with partial information in adversarial online learning problems (such as adversarial multi-armed bandits [12]) is to construct loss estimators based on observations. In particular, inverse importance-weighted estimators are widely applicable. For our problem, with a trajectory s0, a0, . . . , sL−1, aL−1 for episode t, a common importance-weighted estimator for ℓt(s, a) would be ℓt(s, a) q P,πt (s, a) I sk(s) = s, ak(s) = a . Clearly this is an unbiased estimator for ℓt(s, a). Indeed, the conditional expectation Et [I sk(s) = s, ak(s) = a ] is exactly q P,πt (s, a) since the latter is exactly the probability of visiting (s, a) when executing policy πt in a MDP with transition function P. The issue of this standard estimator is that we cannot compute q P,πt (s, a) since P is unknown. To address this issue, Rosenberg and Mansour [79] directly use qbt(s, a) in place of q P,πt (s, a), leading to an estimator that could be either an overestimate or an underestimate, and they can only show O(T 3/4 ) regret with this approach. Instead, since we have a confidence set Pi that contains P with high probability (where i is the index of the epoch to which t belongs), we propose to replace q P,πt (s, a) with an upper occupancy bound defined as ut(s, a) = max Pb∈Pi q P,π b t (s, a), 23 that is, the largest possible probability of visiting (s, a) among all the plausible environments. In addition, we also adopt the idea of implicit exploration from [69] to further increase the denominator by some fixed amount γ > 0. Our final estimator for ℓt(s, a) is ℓbt(s, a) = ℓt(s, a) ut(s, a) + γ I sk(s) = s, ak(s) = a . The implicit exploration is important for several technical reasons such as obtaining a high probability regret bound, the key motivation of the work [69] for multi-armed bandits. Clearly, ℓbt(s, a) is a biased estimator and in particular is underestimating ℓt(s, a) with high probability (since by definition q P,πt (s, a) ≤ ut(s, a) if P ∈ Pi). The idea of using underestimates for adversarial learning with bandit feedback can be seen as an optimism principle which encourages exploration, and appears in previous work such as [7, 69] in different forms and for different purposes. A key part of our analysis is to show that the bias introduced by these estimators is reasonably small, which eventually leads to a better regret bound compared to [79]. Computing upper occupancy bound efficiently. It remains to discuss how to compute ut(s, a) efficiently. First note that ut(s, a) = πt(a|s) max Pb∈Pi q P,π b t (s) where once again we slightly abuse the notation and define q(s) = P a ′∈A q(s, a′ ) for any occupancy measure q, which is the marginal probability of visiting state s under the associated policy and transition function. Further define f(˜s) = max Pb∈Pi Pr h sk(s) = s sk(˜s) = ˜s, P, π b t i , 24 for any s˜ with k(˜s) ≤ k(s), which is the maximum probability of visiting s starting from state s˜, under policy πt and among all plausible transition functions in Pi . Clearly one has ut(s, a) = πt(a|s)f(s0), and also f(˜s) = I{s˜ = s} for all s˜ in the same layer as s. Moreover, since the confidence set Pi imposes an independent constraint on Pb(·|s, a) for each different pair (s, a), we have the following recursive relation: f(˜s) = X a∈A πt(a|s˜) max Pb(·|s,a ˜ ) X s ′∈Sk(˜s)+1 Pb(s ′ |s, a ˜ )f(s ′ ) (3.7) where the maximization is over the constraint that Pb(·|s, a ˜ ) is a valid distribution over Sk(˜s)+1 and also Pb(s ′ |s, a ˜ ) − P¯ i(s ′ |s, a ˜ ) ≤ Bi(˜s, a, s′ ), ∀s ′ ∈ Sk(˜s)+1. This optimization can be solved efficiently via a greedy approach after sorting the values of f(s ′ ) for all s ′ ∈ Sk(˜s)+1 (see Section 3.5.2 for details). This suggests computing ut(s, a) via backward dynamic programming from layer k(s) down to layer 0, detailed in Algorithm 3. 3.4 Analysis In this section, we analyze the regret of our algorithm and prove the following theorem. Theorem 3.4.1. With probability at least 1−9δ, UOB-REPS with η = γ = qL ln(L|S||A|/δ) T|S||A| ensures: RegT = O L|S| s |A|T ln T|S||A| δ ! . 25 The proof starts with decomposing the regret into four different terms. Specifically, by Eq. (3.1) the regret can be written as RegT = PT t=1⟨qt−q ⋆ , ℓt⟩ where we define q ⋆ ∈ argminq∈∆(P) PT t=1⟨q, ℓt⟩ and qt = q P,πt . We then add and subtract three terms and decompose the regret as RegT = X T t=1 ⟨qt − qbt , ℓt⟩ | {z } Error + X T t=1 D qbt , ℓt − ℓbt E | {z } Bias1 + X T t=1 D qbt − q ⋆ , ℓbt E | {z } Reg + X T t=1 D q ⋆ , ℓbt − ℓt E | {z } Bias2 . Here, the first term Error measures the error of using qbt to approximate qt ; the third term Reg is the regret of the corresponding online linear optimization problem and is controlled by OMD; the second and the fourth terms Bias1 and Bias2 correspond to the bias of the loss estimators. We bound Error and Bias1 in the rest of this section. Bounding Reg and Bias2 is relatively standard and we defer the proofs to Section 3.6.3. Combining all the bounds (specifically, Lemmas 3.4.3, 3.4.4, 3.6.3.2, and 3.6.3.4), applying a union bound, and plugging in the (optimal) values of η and γ prove Theorem 3.4.1. Throughout the analysis we use it to denote the index of the epoch to which episode t belongs. Note that Pit and qbt are both Ft-measurable. We start by stating a key technical lemma which essentially describes how our new confidence set shrinks over time and is critical for bounding Error and Bias1 (see Section 3.6.2 for the proof). Lemma 3.4.2. With probability at least 1 − 6δ, for any collection of transition functions {P s t }s∈S such that P s t ∈ Pit for all s, we have X T t=1 X s∈S,a∈A |q P s t ,πt (s, a) − qt(s, a)| = O L|S| s |A|T ln T|S||A| δ ! . Bounding Error. With the help of Lemma 3.4.2, we immediately obtain the following bound on Error. 26 Lemma 3.4.3. With probability at least 1 − 6δ, UOB-REPS ensures Error = O L|S| s |A|T ln T|S||A| δ ! . Proof. Since all losses are in [0, 1], we have Error ≤ X T t=1 X s,a |qbt(s, a) − qt(s, a)| = X T t=1 X s,a |q P s t ,πt (s, a) − qt(s, a)|, where we define P s t = P qbt ∈ Pit for all s so that qbt = q Pt,πt (by the definition of πt and Lemma 2.1.1). Applying Lemma 3.4.2 finishes the proof. Note that in the proof above, we set P s t to be the same for all s. In fact, in this case our Lemma 3.4.2 is similar to Lemmas B.2 and B.3 of [78] and it also suffices to use their looser confidence bound. However, in the next application of Lemma 3.4.2 to bounding Bias1, it turns out to be critical to set P s t to be different for different s and also to use our tighter confidence bound. Bounding Bias1. To bound the term Bias1 = PT t=1⟨qbt , ℓt − ℓbt⟩, we need to show that ℓbt is not underestimating ℓt by too much, which, at a high-level, is also ensured due to the fact that the confidence set becomes more and more accurate for frequently visited state-action pairs. Lemma 3.4.4. With probability at least 1 − 7δ, UOB-REPS ensures Bias1 = O L|S| s |A|T ln T|S||A| δ + γ|S||A|T ! . 27 Proof. First note that ⟨qbt , ℓbt⟩ is in [0, L] because P qbt ∈ Pit by the definition of qbt and thus qbt(s, a) ≤ ut(s, a) by the definition of ut , which implies X s,a qbt(s, a)ℓbt(s, a) ≤ X s,a I{sk(s) = s, ak(s) = a} = L. Applying Azuma’s inequality we thus have with probability at least 1 − δ, PT t=1⟨qbt , Et [ℓbt ] − ℓbt⟩ ≤ L q 2T ln 1 δ . Therefore, we can bound Bias1 by PT t=1⟨qbt , ℓt − Et [ℓbt ]⟩ + L q 2T ln 1 δ under this event. We then focus on the term P t ⟨qbt , ℓt − Et [ℓbt ]⟩ and rewrite it as (by the definition of ℓbt) X t,s,a qbt(s, a)ℓt(s, a) 1 − Et [I{sk(s) = s, ak(s) = a}] ut(s, a) + γ = X t,s,a qbt(s, a)ℓt(s, a) 1 − qt(s, a) ut(s, a) + γ = X t,s,a qbt(s, a) ut(s, a) + γ (ut(s, a) − qt(s, a) + γ) ≤ X t,s,a |ut(s, a) − qt(s, a)| + γ|S||A|T where the last step is again due to qbt(s, a) ≤ ut(s, a). Finally, note that by Eq. (4.4), one has ut = q P s t ,πt for P s t = argmaxPb∈Pit q P,π b t (x) (which is Ft-measurable and belongs to Pit clearly). Applying Lemma 3.4.2 together with a union bound then finishes the proof. We point out again that this is the only part that requires using our new confidence set. With the looser one used in previous work we can only show P t,s,a |ut(s, a)−qt(s, a)| = O L|S| 2 q |A|T ln T|S||A| δ , with an extra |S| factor. 3.5 Omitted Details for the Algorithm In this section, we provide omitted details on how to implement our algorithm efficiently. 2 3.5.1 Updating Occupancy Measure This subsection explains how to implement the update defined in Eq. (3.5) efficiently. We use almost the same approach as in [78] with the only difference being the choice of confidence set. We provide details of the modification here for completeness. It has been shown in [78] that Eq. (3.5) can be decomposed into two steps: (1) compute q˜t+1(s, a, s′ ) = qbt(s, a, s′ ) exp(−ηℓbt(s, a)) for any (s, a, s′ ), which is the optimal solution of the unconstrained problem; (2) compute the projection step: qbt+1 = argmin q∈∆(Pi) D(q ∥ q˜t+1), (3.8) Since our choice of confidence set ∆(Pi) is different, the main change lies in the second step, whose constraint set can be written explicitly using the following set of linear equations: ∀k : X s∈Sk,a∈A,s′∈Sk+1 q s, a, s′ = 1, ∀k, ∀s ∈ Sk : X a∈A,s′∈Sk+1 q s, a, s′ = X s ′∈Sk−1,a∈A q s ′ , a, s , ∀k, ∀ s, a, s′ ∈ Wk : q s, a, s′ ≤ P¯ i s ′ |s, a + Bi s, a, s′ X y∈Sk+1 q (s, a, y), q s, a, s′ ≥ P¯ i s ′ |s, a − Bi s, a, s′ X y∈Sk+1 q (s, a, y), q s, a, s′ ≥ 0. (3.9) Therefore, the projection step Eq. (3.8) is a convex optimization problem with linear constraints, which can be solved in polynomial time. This optimization problem can be further reformulated into a dual problem, which is a convex optimization problem with only non-negativity constraints, and thus can be solved more effic Lemma 3.5.1.1. The dual problem of Eq. (3.8) is to solve µt , βt = argmin µ,β≥0 L X−1 k=0 lnZ k t (µ, β) where β := {β(s)}s and µ := {µ +(s, a, s′ ), µ−(s, a, s′ )}(s,a,s′) are dual variables and Z k t (µ, β) = X s∈Sk,a∈A,s′∈Sk+1 qbt s, a, s′ exp n B µ,β t s, a, s′ o , B µ,β t s, a, s′ = β s ′ − β (s) + µ − − µ + s, a, s′ − ηℓbt (s, a) + X y∈Sk(s)+1 µ + − µ − (s, a, y) P¯ i (y|s, a) + µ + + µ − (s, a, y) Bi (s, a, y). Furthermore, the optimal solution to Eq. (3.8) is given by qbt+1 s, a, s′ = qbt (s, a, s′ ) Z k(s) t (µt , βt) exp n B µt,βt t s, a, s′ o . Proof. In the following proof, we omit the non-negativity constraint Eq. (3.9). This is without loss of generality, since the optimal solution for the modified version of Eq. (3.8) without the non-negativity constraint Eq. (3.9) turns out to always satisfy the non-negativity constraint. We write the Lagrangian as: L(q, λ, β, µ) =D (q||q˜t+1) + L X−1 k=0 λk X s∈Sk,a∈A,s′∈Sk+1 q s, a, s′ − 1 + L X−1 k=1 X s∈Sk β (s) X a∈A,s′∈Sk+1 q s, a, s′ − X s ′∈Sk−1,a∈A q s ′ , a, s + L X−1 k=0 X s∈Sk,a∈A,s′∈Sk+1 µ + s, a, s′ q s, a, s′ − P¯ i s ′ |s, a + Bi s, a, s′ X y∈Sk+1 q (s, a, y) + L X−1 k=0 X s∈Sk,a∈A,s′∈Sk+1 µ − s, a, s′ P¯ i s ′ |s, a − Bi s, a, s′ X y∈Sk+1 q (s, a, y) − q where λ := {λk}k, β := {β(s)}s and µ := {µ +(s, a, s′ ), µ−(s, a, s′ )}(s,a,s′) are Lagrange multipliers. We also define β (s0) = β (sL) = 0 for convenience. Now taking the derivative we have ∂L ∂q (s, a, s′) = ln q s, a, s′ − ln ˜qt+1 s, a, s′ + λk(s) + β (s) − β s ′ + µ + − µ − s, a, s′ − X y∈Sk(s)+1 µ + − µ − (s, a, y) P¯ i (y|s, a) + µ + + µ − (s, a, y) Bi (s, a, y) = ln q s, a, s′ − ln ˜qt+1 s, a, s′ + λk(s) − ηℓbt (s, a) − B µ,β t s, a, s′ . Setting the derivative to zero gives the explicit form of the optimal q ⋆ by q ⋆ s, a, s′ = ˜qt+1 s, a, s′ exp n −λk(s) + ηℓbt (s, a) + B µ,β t s, a, s′ o = qbt s, a, s′ exp n −λk(s) + B µ,β t s, a, s′ o . On the other hand, setting ∂L/∂λk = 0 shows that the optimal λ ⋆ satisfies exp {λ ⋆ k } = X s∈Sk,a∈A,s′∈Sk+1 qbt s, a, s′ exp n B µ,β t s, a, s′ o = Z k t (µ, β). It is straightforward to check that strong duality holds, and thus the optimal dual variables µ ⋆ , β⋆ are given by µ ⋆ , β⋆ = argmax µ,β≥0 max λ min q L(q, λ, β, µ) = argmax µ,β≥0 L(q ⋆ , λ⋆ , β, µ). Finally, we note the equality L(q, λ, β, µ) =D (q||q˜t+1) + L X−1 k=0 X s∈Sk,a∈A,s′∈Sk+1 ∂L ∂q (s, a, s′) − ln q s, a, s′ + ln ˜qt+1 s, a, s′ − L X−1 k=1 λk = L X−1 k=0 X s∈Sk,a∈A,s′∈Sk+1 ∂L ∂q (s, a, s′) − 1 q(s, a, s′ ) + ˜qt+1(s, a, s′ ) − L X−1 k=1 λk. This, combined with the fact that q ⋆ has zero partial derivative, gives L(q ⋆ , λ⋆ , β, µ) = − L + L X−1 k=0 X s∈Sk,a∈A,s′∈Sk+1 q˜t+1(s, a, s′ ) − L X−1 k=0 lnZ k t (µ, β). Note that the first two terms in the last expression are independent of (µ, β). We thus have: µ ⋆ , β⋆ = argmax µ,β≥0 L(q ⋆ , λ⋆ , β, µ) = argmin µ,β≥0 L X−1 k=0 lnZ k t (µ, β). Combining all equations for (q ⋆ , λ⋆ , µ⋆ , β⋆ ) finishes the proof. 3.5.2 Computing Upper Occupancy Bounds This subsection explains how to greedily solve the following optimization problem from Eq. (3.7): max Pb(·|s,a ˜ ) X s ′∈Sk(˜s)+1 Pb(s ′ |s, a ˜ )f(s ′ ) subject to Pb(·|s, a ˜ ) being a valid distribution over Sk(˜s)+1 and for all s ′ ∈ Sk(˜s)+1, Pb(s ′ |s, a ˜ ) − P¯ i(s ′ |s, a ˜ ) ≤ Bi(˜s, a, s′ ), 32 where (˜s, a) is some fixed state-action pair, Bi(˜s, a, s′ ) is defined in Eq. (3.3), and the value of f(s ′ ) for any s ′ ∈ Sk(˜s)+1 is known. To simplify notation, let n = |Sk(˜s)+1|, and σ : [n] → Sk(˜s)+1 be a bijection such that f(σ(1)) ≤ f(σ(2)) ≤ · · · ≤ f(σ(n)). Further let p¯ and B be shorthands of P¯ i(·|s, a ˜ ) and Bi(˜s, a, ·) respectively. With these notations, the problem becomes max p∈Rn +: P s′ p(s ′ )=1 |p(s ′ )−p¯(s ′ )|≤B(s ′ ) Xn j=1 p(σ(j))f(σ(j)). Clearly, the maximum is achieved by redistributing the distribution p¯ so that it puts as much weight as possible on states with large f value under the constraint. This can be implemented efficiently by maintaining two pointers j − and j + starting from 1 and n respectively, and considering moving as much weight as possible from state s − = σ(j −) to state s + = σ(j +). More specifically, the maximum possible weight change for s − and s + are δ − = min{p¯(s −), B(s −)} and δ + = min{1 − p¯(s +), B(s +)} respectively, and thus we move min{δ −, δ+} amount of weight from s − to s +. In the case where δ − ≤ δ +, no more weight can be decreased from s − and we increase the pointer j − by 1 as well as decreasing B(s +) by δ − to reflect the change in maximum possible weight increase for s +. The situation for the case δ − > δ+ is similar. The procedure stops when the two pointers coincide. See Algorithm 4 for the complete pseudocode. We point out that the step of sorting the values of f and finding σ can in fact be done only once for each layer (instead of every call of Algorithm 4). For simplicity, we omit this refinement. 3.6 Omitted Details for the Analysis In this section, we provide omitted proofs for the regret analysis of our algorithm. 33 Algorithm 4 Greedy Input: f : S → [0, 1], a distribution p¯ over n states of layer k , positive numbers {B(s)}s∈Sk Initialize: j − = 1, j+ = n, sort {f(s)}s∈Sk and find σ such that f(σ(1)) ≤ f(σ(2)) ≤ · · · ≤ f(σ(n)) while j − < j+ do s − = σ(j −), s+ = σ(j +) δ − = min{p¯(s −), B(s −)} ▷maximum weight to decrease for state s − δ + = min{1 − p¯(s +), B(s +)} ▷maximum weight to increase for state s + p¯(s −) ← p¯(s −) − min{δ −, δ+} p¯(s +) ← p¯(s +) + min{δ −, δ+} if δ− ≤ δ+ then B(s +) ← B(s +) − δ − j − ← j − + 1 else B(s −) ← B(s −) − δ + j + ← j + − 1 Return: Pn j=1 p¯(σ(j))f(σ(j)) 3.6.1 Auxiliary Lemmas First, we prove Lemma 3.3.1.1 which states that with probability at least 1−4δ, the true transition function P is within the confidence set Pi for all epoch i. Proof of Lemma 3.3.1.1. By the empirical Bernstein inequality [65, Theorem 4] and union bounds, we have with probability at least 1 − 4δ, for all (s, a, s′ ) ∈ Sk × A × Sk+1, k = 0, . . . , L − 1, and any i ≤ T, P(s ′ |s, a) − P¯ i(s ′ |s, a) ≤ vuut 2P¯ i(s ′ |s, a)(1 − P¯ i(s ′ |s, a)) ln T|S| 2|A| δ max{1, mi(s, a) − 1} + 7 ln T|S| 2 |A| δ 3 max{1, mi(s, a) − 1} ≤ 2 vuut P¯ i(s ′ |s, a) ln T|S||A| δ max{1, mi(s, a) − 1} + 14 ln T|S||A| δ 3 max{1, mi(s, a) − 1} = Bi(s, a, s′ ) which finishes the proof. 34 Next, we state three lemmas that are useful for the rest of the proof. The first one shows a convenient bound on the difference between the true transition function and any transition function from the confidence set. Lemma 3.6.1.1. Under the event of Lemma 3.3.1.1, for all epoch i, all Pb ∈ Pi, all k = 0, . . . , L−1 and (s, a, s′ ) ∈ Sk × A × Sk+1, we have Pb(s ′ |s, a) − P(s ′ |s, a) = O vuut P(s ′ |s, a) ln T|S||A| δ max{1, mi(s, a)} + ln T|S||A| δ max{1, mi(s, a)} ≜ B ⋆ i (s ′ |s, a). Proof. Under the event of Lemma 3.3.1.1, we have P¯ i(s ′ |s, a) ≤ P(s ′ |s, a) + 2 vuut P¯ i(s ′ |s, a) ln T|S||A| δ max{1, mi(s, a) − 1} + 14 ln T|S||A| δ 3 max{1, mi(s, a) − 1} . Viewing this as a quadratic inequality of p P¯ i(s ′ |s, a) and solving for P¯ i(s ′ |s, a) prove the lemma. The next one is a standard Bernstein-type concentration inequality for martingale. We use the version of Theorem 1 from [17]. Lemma 3.6.1.2. Let Y1, . . . , YT be a martingale difference sequence with respect to a filtration F1, . . . , FT . Assume Yt ≤ R a.s. for all i. Then for any δ ∈ (0, 1) and λ ∈ [0, 1/R], with probability at least 1 − δ, we have X T t=1 Yt ≤ λ X T t=1 Et [Y 2 t ] + ln(1/δ) λ . The last one is a based on similar ideas used for proving many other optimistic algorithms. Lemma 3.6.1.3. With probability at least 1 − 2δ, we have for all k = 0, . . . , L − 1, X T t=1 X s∈Sk,a∈A qt(s, a) max{1, mit (s, a)} = O (|Sk||A| ln T + ln(L/δ)) (3.10) 35 and X T t=1 X s∈Sk,a∈A qt(s, a) p max{1, mit (s, a)} = O p |Sk||A|T + |Sk||A| ln T + ln(L/δ) . (3.11) Proof. Let It(s, a) be the indicator of whether the pair (s, a) is visited in episode t so that Et [It(s, a)] = qt(s, a). We decompose the first quantity as X T t=1 X s∈Sk,a∈A qt(s, a) max{1, mit (s, a)} = X T t=1 X s∈Sk,a∈A It(s, a) max{1, mit (s, a)} + X T t=1 X s∈Sk,a∈A qt(s, a) − It(s, a) max{1, mit (s, a)} . The first term can be bounded as X s∈Sk,a∈A X T t=1 It(s, a) max{1, mit (s, a)} = X s∈Sk,a∈A O (ln T) = O (|Sk||A| ln T). To bound the second term, we apply Lemma 3.6.1.2 with Yt = P s∈Sk,a∈A qt(s,a)−It(s,a) max{1,mit (s,a)} ≤ 1, λ = 1/2, and the fact Et [Y 2 t ] ≤ Et X s∈Sk,a∈A It(s, a) max{1, mit (s, a)} 2 = Et X s∈Sk,a∈A It(s, a) max{1, m2 it (s, a)} (It(s, a)It(s ′ , a′ ) = 0 for s ̸= s ′ ∈ Sk) ≤ X s∈Sk,a∈A qt(s, a) max{1, mit (s, a)} , which gives with probability at least 1 − δ/L, X T t=1 X s∈Sk,a∈A qt(s, a) − It(s, a) max{1, mit (s, a)} ≤ 1 2 X T t=1 X s∈Sk,a∈A qt(s, a) max{1, mit (s, a)} + 2 ln L δ . Combining these two bounds, rearranging, and applying a union bound over k prove Eq. (3.10). 36 Similarly, we decompose the second quantity as X T t=1 X s∈Sk,a∈A qt(s, a) p max{1, mit (s, a)} = X T t=1 X s∈Sk,a∈A It(s, a) p max{1, mit (s, a)} + X T t=1 X s∈Sk,a∈A qt(s, a) − It(s, a) p max{1, mit (s, a)} . The first term is bounded by X s∈Sk,a∈A X T t=1 It(s, a) p max{1, mit (s, a)} = O X s∈Sk,a∈A q NiT (s, a) ≤ O s |Sk||A| X s∈Sk,a∈A NiT (s, a) = O p |Sk||A|T , where the second line uses the Cauchy-Schwarz inequality and the fact P s∈Sk,a∈A NiT (s, a) ≤ T. To bound the second term, we again apply Lemma 3.6.1.2 with Yt = P s∈Sk,a∈A √ qt(s,a)−It(s,a) max{1,mit (s,a)} ≤ 1, λ = 1, and the fact Et [Y 2 t ] ≤ Et X s∈Sk,a∈A It(s, a) p max{1, mit (s, a)} 2 = X s∈Sk,a∈A qt(s, a) max{1, mit (s, a)} , which shows with probability at least 1 − δ/L, X T t=1 X s∈Sk,a∈A qt(s, a) − It(s, a) p max{1, mit (s, a)} ≤ X T t=1 X s∈Sk,a∈A qt(s, a) max{1, mit (s, a)} + ln L δ . Combining Eq. (3.10) and a union bound proves Eq. (3.11). 3.6.2 Proof of the Key Lemma We are now ready to prove Lemma 3.4.2, the key lemma of our analysis which requires using our new confidence set. 37 Proof of Lemma 3.4.2. To simplify notation, let q s t = q P s t ,πt . Note that for any occupancy measure q, by definition we have for any (s, a) pair, q(s, a) = π q (s|a) X {sk∈Sk,ak∈A} k(s)−1 k=0 k( Ys)−1 h=0 π q (ah|sh) k( Ys)−1 h=0 P q (sh+1|sh, ah). where we define sk(s) = s for convenience. Therefore, we have |q s t (s, a)−qt(s, a)| = πt(s|a) X {sk,ak} k(s)−1 k=0 k( Ys)−1 h=0 πt(ah|sh) k( Ys)−1 h=0 P s t (sh+1|sh, ah) − k( Ys)−1 h=0 P(sh+1|sh, ah) . By adding and subtracting k(s) − 1 terms we rewrite the last term in the parentheses as k( Ys)−1 h=0 P s t (sh+1|sh, ah) − k( Ys)−1 h=0 P(sh+1|sh, ah) = k( Ys)−1 h=0 P s t (sh+1|sh, ah) − k( Ys)−1 h=0 P(sh+1|sh, ah) ± k( Xs)−1 m=1 mY−1 h=0 P(sh+1|sh, ah) k( Ys)−1 h=m P s t (sh+1|sh, ah) = k( Xs)−1 m=0 (P s t (sm+1|sm, am) − P(sm+1|sm, am)) mY−1 h=0 P(sh+1|sh, ah) k( Ys)−1 h=m+1 P s t (sh+1|sh, ah), which, by Lemma 3.6.1.1, is bounded by k( Xs)−1 m=0 B ⋆ it (sm, am, sm+1) mY−1 h=0 P(sh+1|sh, ah) k( Ys)−1 h=m+1 P s t (sh+1|sh, ah). We have thus shown |q s t (s, a) − qt(s, a)| ≤ πt(s|a) X {sk,ak} k(s)−1 k=0 k( Ys)−1 h=0 πt(ah|sh) k( Xs)−1 m=0 B ⋆ it (sm, am, sm+1) mY−1 h=0 P(sh+1|sh, ah) k( Ys)−1 h=m+1 P s t (sh+1|sh, ah) 38 = k( Xs)−1 m=0 X {sk,ak} k(s)−1 k=0 B ⋆ it (sm, am, sm+1) πt(am|sm) mY−1 h=0 πt(ah|sh)P(sh+1|sh, ah) ! · πt(s|a) k( Ys)−1 h=m+1 πt(ah|sh)P s t (sh+1|sh, ah) = k( Xs)−1 m=0 X sm,am,sm+1 B ⋆ it (sm, am, sm+1) X {sk,ak} m−1 k=0 πt(am|sm) mY−1 h=0 πt(ah|sh)P(sh+1|sh, ah) · X am+1 X {sk,ak} k(s)−1 k=m+2 πt(s|a) k( Ys)−1 h=m+1 πt(ah|sh)P s t (sh+1|sh, ah) = k( Xs)−1 m=0 X sm,am,sm+1 B ⋆ it (sm, am, sm+1)qt(sm, am)q s t (s, a|sm+1), (3.12) where we use q s t (s, a|sm+1) to denote the probability of encountering pair (s, a) given that sm+1 was visited in layer m + 1, under policy πt and transition P s t . By the exact same reasoning, we also have |q s t (s, a|sm+1) − qt(s, a|sm+1)| ≤ k( Xs)−1 h=m+1 X s ′ h ,a′ h ,s′ h+1 B ⋆ it (s ′ h , a′ h , s′ h+1)qt(s ′ h , a′ h |sm+1)q s t (s, a|s ′ h+1) ≤ πt(a|s) k( Xs)−1 h=m+1 X s ′ h ,a′ h ,s′ h+1 B ⋆ it (s ′ h , a′ h , s′ h+1)qt(s ′ h , a′ h |sm+1) (3.13) Combining Eq. (3.12) and Eq. (3.13), summing over all t and (s, a), and using the shorthands wm = (sm, am, sm+1) and w ′ h = (s ′ h , a′ h , s′ h+1), we have derived X T t=1 X s∈S,a∈A |q s t (s, a) − qt(s, a)| ≤ X t,s,a k( Xs)−1 m=0 X wm B ⋆ it (sm, am, sm+1)qt(sm, am)qt(s, a|sm+1) + X t,s,a k( Xs)−1 m=0 X wm B ⋆ it (sm, am, sm+1)qt(sm, am) πt(a|s) k( Xs)−1 h=m+1 X w′ h B ⋆ it (s ′ h , a′ h , s′ h+1)qt(s ′ h , a′ h |sm+1) 39 = X t X k<L X k−1 m=0 X wm B ⋆ it (sm, am, sm+1)qt(sm, am) X s∈Sk,a∈A qt(s, a|sm+1) + X t X k<L X k−1 m=0 X wm X k−1 h=m+1 X w′ h B ⋆ it (sm, am, sm+1)qt(sm, am)B ⋆ it (s ′ h , a′ h , s′ h+1)qt(s ′ h , a′ h |sm+1) X s∈Sk,a∈A πt(a|s) = X 0≤m<k<L X t,wm B ⋆ it (sm, am, sm+1)qt(sm, am) + X 0≤m<h<k<L |Sk| X t,wm,w′ h B ⋆ it (sm, am, sm+1)qt(sm, am)B ⋆ it (s ′ h , a′ h , s′ h+1)qt(s ′ h , a′ h |sm+1) ≤ X 0≤m<k<L X t,wm B ⋆ it (sm, am, sm+1)qt(sm, am) | {z } ≜B1 + |S| X 0≤m<h<L X t,wm,w′ h B ⋆ it (sm, am, sm+1)qt(sm, am)B ⋆ it (s ′ h , a′ h , s′ h+1)qt(s ′ h , a′ h |sm+1) | {z } ≜B2 . It remains to bound B1 and B2 using the definition of B⋆ it . For B1, we have B1 = O X 0≤m<k<L X t,wm qt(sm, am) vuut P(sm+1|sm, am) ln T|S||A| δ max{1, mit (sm, am)} + qt(sm, am) ln T|S||A| δ max{1, mit (sm, am)} ≤ O X 0≤m<k<L X t,sm,am qt(sm, am) vuut |Sm+1| ln T|S||A| δ max{1, mit (sm, am)} + qt(sm, am) ln T|S||A| δ max{1, mit (sm, am)} ≤ O X 0≤m<k<L s |Sm||Sm+1||A|T ln T|S||A| δ ≤ O X 0≤m<k<L (|Sm| + |Sm+1|) s |A|T ln T|S||A| δ = O L|S| s |A|T ln T|S||A| δ ! , where the second line uses the Cauchy-Schwarz inequality, the third line uses Lemma 3.6.1.3, and the fourth line uses the AM-GM inequality. 40 For B2, plugging the definition of B⋆ it and using trivial bounds (that is, B⋆ it and qt are both at most 1 regardless of the arguments), we obtain the following three terms (ignoring constants) X 0≤m<h<L X t,wm,w′ h vuut P(sm+1|sm, am) ln T|S||A| δ max{1, mit (sm, am)} qt(sm, am) vuut P(s ′ h+1|s ′ h , a′ h ) ln T|S||A| δ max{1, mit (s ′ h , a′ h )} qt(s ′ h , a′ h |sm+1) + X 0≤m<h<L X t,wm,w′ h qt(sm, am) ln T|S||A| δ max{1, mit (sm, am)} + X 0≤m<h<L X t,wm,w′ h qt(s ′ h , a′ h ) ln T|S||A| δ max{1, mit (s ′ h , a′ h )} . The last two terms are both of order O(ln T) by Lemma 3.6.1.3 (ignoring dependence on other parameters), while the first term can be written as ln T|S||A| δ multiplied by the following: X 0≤m<h<L X t,wm,w′ h s qt(sm, am)P(s ′ h+1|s ′ h , a′ h )qt(s ′ h , a′ h |sm+1) max{1, mit (sm, am)} s qt(sm, am)P(sm+1|sm, am)qt(s ′ h , a′ h |sm+1) max{1, mit (s ′ h , a′ h )} ≤ X 0≤m<h<L vuut X t,wm,w′ h qt(sm, am)P(s ′ h+1|s ′ h , a′ h )qt(s ′ h , a′ h |sm+1) max{1, mit (sm, am)} vuut X t,wm,w′ h qt(sm, am)P(sm+1|sm, am)qt(s ′ h , a′ h |sm+1) max{1, mit (s ′ h , a′ h )} = X 0≤m<h<L vuut|Sm+1| X t,sm,am qt(sm, am) max{1, mit (sm, am)} vuut|Sh+1| X t,s′ h ,a′ h qt(s ′ h , a′ h ) max{1, mit (s ′ h , a′ h )} = O |A| ln T|S||A| δ X 0≤m<h<L p |Sm||Sm+1||Sh||Sh+1| = O L 2 |S| 2 |A| ln T|S||A| δ , where the second line uses the Cauchy-Schwarz inequality and the last line uses Lemma 3.6.1.3 again. This shows that the entire term B2 is of order O(ln T). Finally, realizing that we have conditioned on the events stated in Lemmas 3.6.1.1 and 3.6.1.3, which happen with probability at least 1 − 6δ, finishes the proof. 3.6.3 Bounding Reg and Bias2 In this section, we complete the proof of our main theorem by bounding the terms Reg and Bias2. We first state the following useful concentration lemma which is a variant of [69, Lemma 1] and 41 is the key for analyzing the implicit exploration effect introduced by γ. The proof is based on the same idea of the proof for [69, Lemma 1]. Lemma 3.6.3.1. For any sequence of functions α1, . . . , αT such that αt ∈ [0, 2γ] S×A is Ftmeasurable for all t, we have with probability at least 1 − δ, X T t=1 X s,a αt(s, a) ℓbt(s, a) − qt(s, a) ut(s, a) ℓt(s, a) ≤ Lln L δ . Proof. Fix any t. For simplicity, let β = 2γ and It,s,a be a shorthand of I{sk(s) = s, ak(s) = a}. Then for any state-action pair (s, a), we have ℓbt(s, a) = ℓt(s, a)It,s,a ut(s, a) + γ ≤ ℓt(s, a)It,s,a ut(s, a) + γℓt(s, a) = It,s,a β · 2γℓt(s, a)/ut(s, a) 1 + γℓt(s, a)/ut(s, a) ≤ 1 β ln 1 + βℓt(s, a)It,s,a ut(s, a) , (3.14) where the last step uses the fact z 1+z/2 ≤ ln(1+z) for all z ≥ 0. For each layer k < L, further define Sbt,k = X s∈Sk,a∈A αt(s, a)ℓbt(s, a) and St,k = X s∈Sk,a∈A αt(s, a) qt(s, a) ut(s, a) ℓt(s, a). The following calculation shows Et h exp(Sbt,k) i ≤ exp(St,k): Et h exp(Sbt,k) i ≤ Et exp X s∈Sk,a∈A αt(s, a) β ln 1 + βℓt(s, a)It,s,a ut(s, a) (by Eq. (3.14)) ≤ Et Y s∈Sk,a∈A 1 + αt(s, a)ℓt(s, a)It,s,a ut(s, a) = Et 1 + X s∈Sk,a∈A αt(s, a)ℓt(s, a)It,s,a ut(s, a) = 1 + St,k ≤ exp(St,k). 42 Here, the second inequality is due to the fact z1 ln(1 + z2) ≤ ln(1 + z1z2) for all z2 ≥ −1 and z1 ∈ [0, 1], and we apply it with z1 = αt(s,a) β which is in [0, 1] by the condition αt(s, a) ∈ [0, 2γ]; the first equality holds since It,s,aIt,s′ ,a′ = 0 for any s ̸= s ′ or a ̸= a ′ (as only one state-action pair can be visited in each layer for an episode). Next we apply Markov inequality and show Pr "X T t=1 (Sbt,k − St,k) > ln L δ # ≤ δ L · E " exp X T t=1 (Sbt,k − St,k) !# = δ L · E " exp T X−1 t=1 (Sbt,k − St,k) ! ET h exp SbT,k − ST,ki# ≤ δ L · E " exp T X−1 t=1 (Sbt,k − St,k) !# ≤ · · · ≤ δ L . (3.15) Finally, applying a union bound over k = 0, . . . , L − 1 shows with probability at least 1 − δ, X T t=1 X s,a αt(s, a) ℓbt(s, a) − qt(s, a) ut(s, a) ℓt(s, a) = L X−1 k=0 X T t=1 (Sbt,k − St,k) ≤ Lln L δ , which completes the proof. Bounding Reg. To bound Reg = PT t=1⟨qbt −q ⋆ , ℓbt⟩, note that under the event of Lemma 3.3.1.1, q ⋆ ∈ ∩i ∆(Pi), and thus Reg is controlled by the standard regret guarantee of OMD. Specifically, we prove the following lemma. Lemma 3.6.3.2. With probability at least 1 − 5δ, UOB-REPS ensures Reg = O L ln(|S||A|) η + η|S||A|T + ηL ln(L/δ) γ . 43 Proof. By standard analysis (see Lemma 3.6.3.3 after this proof), OMD with KL-divergence ensures for any q ∈ ∩i ∆(Pi), X T t=1 ⟨qbt − q, ℓbt⟩ ≤ Lln(|S| 2 |A|) η + η X t,s,a qbt(s, a)ℓbt(s, a) 2 . Further note that qbt(s, a)ℓbt(s, a) 2 is bounded by qbt(s, a) ut(s, a) + γ ℓbt(s, a) ≤ ℓbt(s, a) by the fact qbt(s, a) ≤ ut(s, a). Applying Lemma 3.6.3.1 with αt(s, a) = 2γ then shows with probability at least 1 − δ, X t,s,a qbt(s, a)ℓbt(s, a) 2 ≤ X t,s,a qt(s, a) ut(s, a) ℓt(s, a) + Lln L δ 2γ . Finally, note that under the event of Lemma 3.3.1.1, we have q ⋆ ∈ ∩i ∆(Pi), qt(s, a) ≤ ut(s, a), and thus qt(s,a) ut(s,a) ℓt(s, a) ≤ 1. Applying a union bound then finishes the proof. Lemma 3.6.3.3. The OMD update with qb1(s, a, s′ ) = 1 |Sk||A||Sk+1| for all k < L and (s, a, s′ ) ∈ Sk × A × Sk+1, and qbt+1 = argmin q∈∆(Pit ) η⟨q, ℓbt⟩ + D(q ∥ qbt) where D(q ∥ q ′ ) = P s,a,s′ q(s, a, s′ ) ln q(s,a,s′ ) q ′(s,a,s′) − P s,a,s′ (q(s, a, s′ ) − q ′ (s, a, s′ )) ensures X T t=1 ⟨qbt − q, ℓbt⟩ ≤ Lln(|S| 2 |A|) η + η X t,s,a qbt(s, a)ℓbt(s, a) 2 for any q ∈ ∩i ∆(Pi), as long as ℓbt(s, a) ≥ 0 for all t, s, a. 44 Proof. Define q˜t+1 such that q˜t+1(s, a, s′ ) = qbt(s, a, s′ ) exp −ηℓbt(s, a) . It is straightforward to verify qbt+1 = argminq∈∆(Pit ) D(q ∥ q˜t+1) and also η⟨qbt − q, ℓbt⟩ = D(q ∥ qbt) − D(q ∥ q˜t+1) + D(qbt ∥ q˜t+1). By the condition q ∈ ∆(Pit ) and the generalized Pythagorean theorem we also have D(q ∥ qbt+1) ≤ D(q ∥ q˜t+1) and thus η X T t=1 ⟨qbt − q, ℓbt⟩ ≤ X T t=1 (D(q ∥ qbt) − D(q ∥ qbt+1) + D(qbt ∥ q˜t+1)) = D(q ∥ qb1) − D(q ∥ qbT +1) +X T t=1 D(qbt ∥ q˜t+1). The first two terms can be rewritten as L X−1 k=0 X s∈Sk X a∈A X s ′∈Sk+1 q(s, a, s′ ) ln qbT +1(s, a, s′ ) qb1(s, a, s′) ≤ L X−1 k=0 X s∈Sk X a∈A X s ′∈Sk+1 q(s, a, s′ ) ln(|Sk||A||Sk+1|) (by definition of qb1) = L X−1 k=0 ln(|Sk||A||Sk+1|) ≤ Lln(|S| 2 |A|). It remains to bound the term D(qbt ∥ q˜t+1): D(qbt ∥ q˜t+1) = L X−1 k=0 X s∈Sk X a∈A X s ′∈Sk+1 ηqbt(s, a, s′ )ℓbt(s, a) − qbt(s, a, s′ ) + qbt(s, a, s′ ) exp −ηℓbt(s, a) 45 ≤ η 2 L X−1 k=0 X s∈Sk X a∈A X s ′∈Sk+1 qbt(s, a, s′ )ℓbt(s, a) 2 = η 2 X s∈S,a∈A qbt(s, a)ℓbt(s, a) 2 where the inequality is due to the fact e −z ≤ 1 − z + z 2 for all z ≥ 0. This finishes the proof. Bounding Bias2. It remains to bound the term Bias2 = PT t=1⟨q ⋆ , ℓbt − ℓt⟩, which can be done via a direct application of Lemma 3.6.3.1. Lemma 3.6.3.4. With probability at least 1 − 5δ, UOB-REPS ensures Bias2 = O L ln(|S||A|/δ) γ . Proof. For each state-action pair (s, a), we apply Eq. (3.15) in Lemma 3.6.3.1 with αt(s ′ , a′ ) = 2γI{s ′ = s, a′ = a}, which shows that with probability at least 1 − δ |S||A| , X T t=1 ℓbt(s, a) − qt(s, a) ut(s, a) ℓt(s, a) ≤ 1 2γ ln |S||A| δ . Taking a union bound over all state-action pairs shows that with probability at least 1−δ, we have for all occupancy measure q ∈ Ω, X T t=1 D q, ℓbt − ℓt E ≤ X t,s,a q(s, a)ℓt(s, a) qt(s, a) ut(s, a) − 1 + X s,a q(s, a) ln |S||A| δ 2γ = X t,s,a q(s, a)ℓt(s, a) qt(s, a) ut(s, a) − 1 + Lln |S||A| δ 2γ . Note again that under the event of Lemma 3.3.1.1, we have qt(s, a) ≤ ut(s, a), so the first term of the bound above is nonpositive. Applying a union bound and taking q = q ⋆ finishes the proof. 46 Chapter 4 Adaptivity against Adversarial Losses with Fixed Unknown Transition In this chapter, we study the adaptivity of RL algorithms against adversarial losses with a fixed and unknown transition function (that is, Pt = P for any episode t). Specifically, we consider the best-of-both-worlds problem for learning T episodic Markov Decision Processes with the goal of achieving Oe( √ T) regret when the losses are adversarial and simultaneously O(polylog(T)) regret when the losses are (almost) stochastic. Our previous work [50] achieves this goal when the fixed transition is known, and leaves the case of unknown transition as a major open question. In a follow-up work [47], we resolve this open problem by using the same Follow-the-Regularized-Leader (FTRL) framework together with a set of new techniques. We first propose a loss-shifting trick in the FTRL analysis, which greatly simplifies the approach of [50] and already improves their results for the known transition case. Then, we extend this idea to the unknown transition case and develop a novel analysis which upper bounds the transition estimation error by (a fraction of) the regret itself in the stochastic setting, a key property to ensure O(polylog(T)) regret. 47 4.1 Introduction When the losses are stochastically generated, [84, 92] show that O(log T) regret is achievable (ignoring dependence on some gap-dependent quantities for simplicity). On the other hand, even when the losses are adversarially generated, [78, 46] show that Oe( √ T) regret is achievable.∗ Given that the existing algorithms for these two worlds are substantially different, Jin and Luo [50] asked the natural question of whether one can achieve the best of both worlds, that is, enjoying (poly)logarithmic regret in the stochastic world while simultaneously ensuring some worst-case robustness in the adversarial world. Taking inspiration from the bandit literature and using the classic Follow-the-regularized-Leader (FTRL) framework with a novel regularizer, they successfully achieved this goal, albeit under a strong restriction that the transition has to be known ahead of time. Since it is highly unclear how to ensure that the transition estimation error is only O(polylog(T)), extending their results to the unknown transition case is highly challenging and was left as a key open question. In our work [47], we resolve this open question and propose the first algorithm with such a best-of-both-worlds guarantee under unknown transition. Specifically, our algorithm enjoys Oe( √ T) regret always, and simultaneously O(log2 T) regret if the losses are i.i.d. samples of a fixed distribution. More generally, our polylogarithmic regret holds under a general condition similar to that of [50], which requires neither independence nor identical distributions. For example, it covers the corrupted i.i.d. setting where our algorithm achieves Oe( √ C) regret with C L ≤ T being the total amount of corruption. Techniques Our results are achieved via three new techniques. First, we propose a new lossshifting trick for the FTRL analysis when applied to MDPs. While similar ideas have been used ∗Throughout the paper, we use Oe(·) to hide polylogarithmic terms. 48 for the special case of multi-armed bandits (e.g., [90, 100, 57, 101]), its extension to MDPs has eluded researchers, which is also the reason why [50] resorts to a different approach with a highly complex analysis involving analyzing the inverse of the non-diagonal Hessian of a complicated regularizer. Instead, inspired by the well-known performance difference lemma, we design a key shifting function in the FTRL analysis, which helps reduce the variance of the stability term and eventually leads to an adaptive bound with a certain self-bounding property known to be useful for the stochastic world. To better illustrate this idea, we use the known transition case as a warmup example in Section 4.3, and show that the simple Tsallis entropy regularizer (with a diagonal Hessian) is already enough to achieve the best-of-both-worlds guarantee. This not only greatly simplifies the approach of [50] (paving the way for extension to unknown transition), but also leads to bounds with better dependence on some parameters, which on its own is a notable result already. Our second technique is a new framework to deal with unknown transition under adversarial losses, which is important for incorporating the loss-shifting trick mentioned above. Specifically, when the transition is unknown, prior works [78, 79, 46, 54] perform FTRL over the set of all plausible occupancy measures according to a confident set of the true transition, which can be seen as a form of optimism encouraging exploration. Since our loss-shifting trick requires a fixed transition, we propose to move the optimism from the decision set of FTRL to the losses fed to FTRL. More specifically, we perform FTRL over the empirical transition in some doubling epoch schedule, and add (negative) bonuses to the loss functions so that the algorithm is optimistic and never underestimates the quality of a policy, an idea often used in the stochastic setting (e.g., [15]). See Section 4.4 for the details of our algorithm. Finally, we develop a new analysis to show that the transition estimation error of our algorithm is only polylogarithmic in T, overcoming the most critical obstacle in achieving best-of-both-worlds. An important aspect of our analysis is to make use of the amount of underestimation of the optimal 49 policy, a term that is often ignored since it is nonpositive for optimistic algorithms. We do so by proposing a novel decomposition of the regret inspired by the work of [84], and show that in the stochastic world, every term in this decomposition can be bounded by a fraction of the regret itself plus some polylogarithmic terms, which is enough to conclude the final polylogarithmic regret bound. See Section 4.5 for a formal summary of this idea. Related work For earlier results in each of the two worlds, we refer the readers to the systematic surveys in [84, 92, 46]. The work closest to ours is [50] which assumes known transition, and as mentioned, we strictly improve their bounds and more importantly extend their results to the unknown transition case. Two recent works [63, 20] also consider the corrupted stochastic setting, where both the losses and the transition function can be corrupted by a total amount of C L . This is more general than our results since we assume a fixed transition and only allow the losses to be corrupted. On the other hand, their bounds are worse than ours when specified to our setting — [63] ensures a gapdependent polylogarithmic regret bound of O(C log3 T +C 2 ), while [20] achieves O(log3 T +C) but with a potentially larger gap-dependent quantity. Therefore, neither result provides a meaningful guarantee in the adversarial world when C L = T, while our algorithm always ensures a robustness guarantee with Oe( √ T) regret. Their algorithms are also very different from ours and are not based on FTRL. The question of achieving best-of-both-worlds guarantees for the special case of multi-armed bandits was first proposed in [18]. Since then, many improvements using different approaches have been established over the years [82, 13, 81, 90, 61, 38, 99, 101, 56]. One notable and perhaps surprising approach is to use the FTRL framework, originally designed only for the adversarial settings but later found to be able to automatically adapt to the stochastic settings as long as certain regularizers are applied [90, 99, 101]. Our approach falls into this category, and our regularizer 50 design is also based on these prior works. As mentioned, however, obtaining our results requires the new loss-shifting technique as well as the novel analysis on controlling the estimation error, both of which are critical to address the extra challenges presented in MDPs. 4.2 Preliminaries We consider the problem of learning T episodic MDPs with a fixed unknown transition P (that is, Pt = P for all t) in two feedback settings. In the full-information setting, the learner observes the entire loss function ℓt , while in the more challenging bandit feedback setting, the learner only observes the losses of those visited state-action pairs, that is, ℓt(st,0, at,0), . . . , ℓt(st,L−1, at,L−1). With slight abuse of notation, we denote the expected loss of a policy π for episode t by ℓt(π) = E hPL−1 k=0 ℓt(sk, ak) P, πi , where the trajectory {(sk, ak)}k=0,...,L−1 is the generated by executing policy π under transition P. The regret of the learner against some policy π is then defined as RegT (π) = E hPT t=1 ℓt(πt) − ℓt(π) i , and we denote by ˚π one of the optimal policies in hindsight such that RegT (˚π) = maxπ RegT (π). Adversarial world versus stochastic world We consider two different setups depending on how the loss functions ℓ1, . . . , ℓT are generated. In the adversarial world, the environment decides the loss functions arbitrarily with knowledge of the learner’s algorithm (but not her randomness). In this case, the goal is to minimize the regret against the best policy RegT (˚π), with the best existing upper bound being Oe(L|S| p |A|T) [78, 46] and the best lower bound being Ω(L p |S||A|T) [45] (for both full-information and bandit feedback). 51 In the stochastic world, following [50] (which generalizes the bandit case of [100, 101]), we assume that the loss functions satisfy the following condition: there exists a deterministic policy π ⋆ : S → A, a gap function ∆ : S × A → R+ and a constant C L > 0 such that RegT (π ⋆ ) ≥ E X T t=1 X s̸=sL X a̸=π⋆(s) qt(s, a)∆(s, a) − C L , (4.1) where qt(s, a) is the probability of the learner visiting (s, a) in episode t. This general condition covers the heavily-studied i.i.d. setting where ℓ1, . . . , ℓT are i.i.d. samples of a fixed distribution, in which case C L = 0, π ⋆ is simply the optimal policy, and ∆ is the gap function with respect to the optimal Q-function. More generally, the condition also covers the corrupted i.i.d. setting with C L being the total amount of corruption. We refer the readers to [50] for detailed explanation. In this stochastic world, our goal is to minimize regret against π ⋆ , that is, RegT (π ⋆ ). † With unknown transition, this general setup has not been studied before, but for specific examples such as the i.i.d. setting, regret bounds of order O( log T ∆min ) where ∆min = mins,a̸=π⋆(s) ∆(s, a) have been derived [84, 92]. Occupancy measure and FTRL To solve this problem with online learning techniques, we will use the concept of occupancy measure. Our earlier notation qt in Eq. (4.1) is thus simply a shorthand for q P,πt . Moreover, by definition, ℓt(π) can be rewritten as q P,π, ℓt by naturally treating q P,π and ℓt as vectors in R |S|×|A| , and thus the regret RegT (π) can be written as E hPT t=1 qt − q P,π, ℓt i , connecting the problem to online linear optimization. Given a transition function P¯, we denote by Ω(P¯) = q P,π ¯ : π is a stochastic policy the set of all valid occupancy measures associated with the transition P¯. It is known that Ω(P¯) is a simple polytope with O(|S||A|) constraints [97]. When P is unknown, our algorithm uses an †Some works (such as [50]) still consider minimizing RegT (˚π) as the goal in this case. More discussions are deferred to the last paragraph of Section 4.4.1. 52 estimated transition P¯ as a proxy and searches for a “good” occupancy measure within Ω(P¯). More specifically, this is done by the classic Follow-the-Regularized-Leader (FTRL) framework which solves the following at the beginning of episode t: qbt = argmin q∈Ω(P¯) * q,X τ<t ℓbτ + + ϕt(q), (4.2) where ℓbτ is some estimator for ℓτ and ϕt is some regularizer. The learner’s policy πt is then defined through πt(a|s) ∝ qbt(s, a). Note that we have qbt = q P,π ¯ t but not necessarily qbt = qt unless P¯ = P. 4.3 Warm-up for Known Transition: A New Loss-shifting Technique One of the key components of our approach is a new loss-shifting technique for analyzing FTRL applied to MDPs. To illustrate the key idea in a clean manner, in this section we focus on the known transition setting with bandit feedback, the same setting studied by Jin and Luo [50]. As we will show, our method not only improves their bounds, but also significantly simplifies the analysis, which paves the way for extending the result to the unknown transition setting studied in following sections. First note that when P is known, one can simply take P¯ = P (so that qbt = qt) and use the standard importance-weighted estimator ℓbτ (s, a) = ℓτ (s, a)Iτ (s, a)/qτ (s, a) in the FTRL framework Eq. (4.2), where Iτ (s, a) is 1 if (s, a) is visited in episode τ , and 0 otherwise. It remains to determine the regularizer ϕt . While there are many choices of ϕt leading to √ T-regret in the adversarial world, obtaining logarithmic regret in the stochastic world requires some special property of 53 the regularizer. Specifically, generalizing the idea of [100] for multi-armed bandits, [50] shows that it suffices to find ϕt such that the following adaptive regret bound holds RegT (˚π) ≲ E X T t=1 X s̸=sL X a̸=π⋆(s) r qt(s, a) t , (4.3) which then automatically implies logarithmic regret under Eq. (4.1). This is because Eq. (4.3) admits a self-bounding property under Eq. (4.1) — one can bound the right-hand side of Eq. (4.3) as follows using AM-GM inequality (for any z > 0), which can then be related to the regret itself using Eq. (4.1): E X T t=1 X s̸=sL X a̸=π⋆(s) qt(s, a)∆(s, a) 2z + z 2t∆(s, a) ≤ RegT (˚π) + C 2z + z X s̸=sL X a̸=π⋆(s) log T ∆(s, a) . (4.4) Rearranging and picking the optimal z then shows a logarithmic bound for RegT (˚π) (see Section 2 of [50] for detailed discussions). To achieve Eq. (4.3), a natural candidate of ϕt would be a direct generalization of the Tsallisentropy regularizer of [100], which takes the form ϕt(q) = − 1 ηt P s,a p q(s, a) with ηt = 1/ √ t. However, Jin and Luo [50] argued that it is highly unclear how to achieve Eq. (4.3) with this natural candidate, and instead, inspired by [99] they ended up using a different regularizer with a complicated non-diagonal Hessian to achieve Eq. (4.3), which makes the analysis extremely complex since it requires analyzing the inverse of this non-diagonal Hessian. Our first key contribution is to show that this natural and simple candidate is in fact (almost) enough to achieve Eq. (4.3) after all. To show this, we propose a new a loss-shifting technique in the analysis. Similar techniques have been used for multi-armed bandits, but the extension to 54 MDPs is much less clear. Specifically, observe that for any shifting function gτ : S × A → R such that the value of ⟨q, gτ ⟩ is independent of q for any q ∈ Ω(P¯), we have qbt = argmin q∈Ω(P¯) * q,X τ<t ℓbτ + + ϕt(q) = argmin q∈Ω(P¯) * q,X τ<t (ℓbτ + gτ ) + + ϕt(q). (4.5) Therefore, we can pretend that the learner is performing FTRL over the shifted loss sequence {ℓbτ + gτ }τ<t (even when gτ is unknown to the learner). The advantage of analyzing FTRL over this shifted loss sequence is usually that it helps reduce the variance of the loss functions. For multi-armed bandits, prior works [90, 100] pick gτ to be a constant such as the negative loss of the learner in episode τ . For MDPs, however, this is not enough to show Eq. (4.3), as already pointed out by Jin and Luo [50] (which is also the reason why they resorted to a different approach). Instead, we propose the following shifting function: gτ (s, a) = Qbτ (s, a) − Vbτ (s) − ℓbτ (s, a), ∀(s, a) ∈ S × A, (4.6) where Qbτ and Vbτ are the state-action and state value functions with respect to the transition P¯, the loss function ℓbτ , and the policy πτ , that is: Qbτ (s, a) = ℓbτ (s, a) + Es ′∼P¯(·|s,a) [Vbτ (s ′ )] and Vbτ (s) = Ea∼πτ (·|s) [Qbτ (s, a)] (with Vbτ (sL) = 0). This indeed satisfies the invariant condition since using a well-known performance difference lemma one can show ⟨q, gτ ⟩ = −Vbτ (s0) for any q ∈ Ω(P¯) (Lemma 4.6.1.1). With this shifting function, the learner is equivalently running FTRL over the “advantage” functions (Qbτ (s, a) − Vbτ (s) is often called the advantage at (s, a) in the literature). More importantly, it turns out that when seeing FTRL in this way, a standard analysis with some direct calculation already shows Eq. (4.3). One caveat is that since Qbτ (s, a) − Vbτ (s) can potentially have a large magnitude, we also need to stabilize the algorithm by adding a small amount of the so-called log-barrier regularizer to the Tsallis entropy regularizer, an idea that has 55 appeared in several prior works (see [50] and references therein). We defer all details including the concrete algorithm and analysis to Section 4.6, and show the final results below. Theorem 4.3.1. When P is known, Algorithm 7 (with parameter γ = 1) ensures the optimal regret RegT (˚π) = O( p L|S||A|T) in the adversarial world, and simultaneously RegT (π ⋆ ) ≤ RegT (˚π) = O(U + √ UC) where U = L|S| log T ∆min + L 4 P s̸=sL P a̸=π⋆(s) log T ∆(s,a) in the stochastic world. Our bound for the stochastic world is even better than [50] (their U has an extra |A| factor in the first term and an extra L factor in the second term). By setting the parameter γ differently, one can also improve L 4 to L 3 , matching the best existing result from [84] for the i.i.d. setting with C L = 0 (this would worsen the adversarial bound though). Besides this improvement, we emphasize again that the most important achievement of this approach is that it significantly simplifies the analysis, making the extension to the unknown transition setting possible. 4.4 Main Algorithms and Results We are now ready to introduce our main algorithms and results for the unknown transition case, with either full-information or bandit feedback. The complete pseudocode is shown in Algorithm 5, which is built with two main components: a new framework to deal with unknown transitions and adversarial losses (important for incorporating our loss-shifting technique), and special regularizers for FTRL. We explain these two components in detail below. A new framework for unknown transitions and adversarial losses When the transition is unknown, a common practice (which we also follow) is to maintain an empirical transition along with a shrinking confidence set of the true transition, usually updated in some doubling epoch schedule. More specifically, a new epoch is started whenever the total number of visits to some state-action pair is doubled (compared to the beginning of this epoch), thus resulting in at most 56 O (|S||A| log T) epochs. We denote by i(t) the epoch index to which episode t belongs. At the beginning of each epoch i, we calculate the empirical transition P¯ i and fix it through this epoch The confidence set of the true transition for this epoch is then defined as Pi = n Pb : Pb(s ′ |s, a) − P¯ i(s ′ |s, a) ≤ Bi(s, a, s′ ), ∀(s, a, s′ ) ∈ Sk × A × Sk+1, k < Lo , where Bi is Bernstein-style confidence width defined in Eq. (2.7) (taken from [46]): Bi(s, a, s′ ) = min 2 vuut P¯ i(s ′ |s, a) ln T|S||A| δ mi(s, a) + 14 ln T|S||A| δ 3mi(s, a) , 1 (4.7) for some confidence parameter δ ∈ (0, 1). As Lemma 3.3.1.1 shows, the true transition P is contained in the confidence set Pi for all epoch i with probably at least 1 − 4δ. When dealing with adversarial losses, prior works [78, 79, 46, 54] perform FTRL (or a similar algorithm called Online Mirror Descent) over the set of all plausible occupancy measures Ω(Pi) = {q ∈ Ω(Pb) : Pb ∈ Pi} during epoch i, which can be seen as a form of optimism and encourages exploration. This framework, however, does not allow us to apply the loss-shifting trick discussed in Section 4.3 — indeed, our key shifting function Eq. (4.6) is defined in terms of some fixed transition P¯, and the required invariant condition on ⟨q, gτ ⟩ only holds for q ∈ Ω(P¯) but not q ∈ Ω(Pi). Inspired by this observation, we propose the following new approach. First, to directly fix the issue mentioned above, for each epoch i, we run a new instance of FTRL simply over Ω(P¯ i). This is implemented by keeping track of the epoch starting time ti and only using the cumulative loss Pt−1 τ=ti ℓbτ in the FTRL update (Eq. (4.8)). Therefore, in each epoch, we are pretending to deal with a known transition problem, making the same loss-shifting technique discussed in Section 4.3 applicable. 57 However, this removes the critical optimism in the algorithm and does not admit enough exploration. To fix this, our second modification is to feed FTRL with optimistic losses constructed by adding some (negative) bonus term, an idea often used in the stochastic setting. More specifically, we subtract L · Bi(s, a) from the loss for each (s, a) pair, where Bi(s, a) = min 1, P s ′∈Sk(s)+1 Bi(s, a, s′ ) ; see Eq. (4.9). In the full-information setting, this means using ℓbt(s, a) = ℓt(s, a) − L · Bi(s, a). In the bandit setting, note that the importance-weighted estimator discussed in Section 4.3 is no longer applicable since the transition is unknown (making qt also unknown), and [46] proposes to use ℓt(s,a)·It(s,a) ut(s,a) instead, where It(s, a) is again the indicator of whether (s, a) is visited during episode t, and ut(s, a) is the occupancy measure defined in which and can be efficiently computed via the Comp-UOB procedure in Chapter 3. Our final adjusted loss estimator is then ℓbt(s, a) = ℓt(s,a)·It(s,a) ut(s,a) − L · Bi(s, a). In our analysis, we show that these adjusted loss estimators indeed make sure that we only underestimate the loss of each policy, which encourages exploration. With this new framework, it is not difficult to show √ T-regret in the adversarial world using many standard choices of the regularizer ϕt (which recovers the results of [78, 46] with a different approach). To further ensure polylogarithmic regret in the stochastic world, however, we need some carefully designed regularizers discussed next. Special regularizers for FTRL Due to the new structure of our algorithm which uses a fixed transition P¯ i during epoch i, the design of the regularizers is basically the same as in the known transition case. Specifically, in the bandit case, we use the same Tsallis entropy regularizer: ϕt(q) = − 1 ηt X s̸=sL X a∈A p q(s, a) + β X s̸=sL X a∈A ln 1 q(s, a) , (4.10) ‡ If P b∈A qbt(s, b) = 0, we let πt to be the uniform distribution. §We use x +← y as a shorthand for the increment operation x ← x + y. 58 Algorithm 5 Best-of-both-worlds for Episodic MDPs with Unknown Transition Input: confidence parameter δ. Initialize: epoch index i = 1 and epoch starting time ti = 1. Initialize: ∀(s, a, s′ ), set counters m1(s, a) = m1(s, a, s′ ) = m0(s, a) = m0(s, a, s′ ) = 0. Initialize: empirical transition P¯ 1 and confidence width B1 based on Eq. (2.4) and Eq. (4.7). for t = 1, . . . , T do Let ϕt be Eq. (4.11) for full-information feedback or Eq. (4.10) for bandit feedback, and compute qbt = argmin q∈Ω(P¯ i) * q, X t−1 τ=ti ℓbτ + + ϕt(q). (4.8) Compute policy πt from qbt such that πt(a|s) ∝ qbt(s, a). ‡ Execute policy πt and obtain trajectory (s t k , at k ) for k = 0, . . . , L − 1. Construct adjusted loss estimator ℓbt such that ℓbt(s, a) = ( ℓt(s, a) − L · Bi(s, a), for full-information feedback, ℓt(s,a)·It(s,a) ut(s,a) − L · Bi(s, a), for bandit feedback, (4.9) where Bi(s, a) = min 1, P s ′∈Sk(s)+1 Bi(s, a, s′ ) , It(s, a) = I ∃k,(s, a) = (s t k , at k ) , and ut is the upper occupancy measure defined in Eq. (4.4). Increment counters: for each k < L, mi(st,k, at,k, st,k+1) +← 1, mi(st,k, at,k) +← 1. § if ∃k, mi(st,k, at,k) ≥ max{1, 2mi−1(st,k, at,k)} then ▷ entering a new epoch Increment epoch index i +← 1 and set new epoch starting time ti = t + 1. Initialize new counters: ∀(s, a, s′ ), mi(s, a, s′ ) = mi−1(s, a, s′ ), mi(s, a) = mi−1(s, a). Update empirical transition P¯ i and confidence width Bi based on Eq. (2.4) and Eq. (4.7). where ηt = 1/ √t−ti(t)+1 and β = 128L 4 . As discussed in Section 4.3, the small amount of log-barrier in the second part of Eq. (4.10) is used to stabilize the algorithm, similarly to [50]. In the full-information case, while we can still use Eq. (4.10) since the bandit setting is only more difficult, this leads to extra dependence on some parameters. Instead, we use the following Shannon entropy regularizer: ϕt(q) = 1 ηt X s̸=sL X a∈A q(s, a) · ln q(s, a). (4.11) Although this is a standard choice for the full-information setting, the tuning of the learning rate ηt requires some careful thoughts. In the special case of MDPs with one layer (known as the 59 expert problem [34]), it has been shown that choosing ηt to be of order 1/ √ t ensures best-of-bothworlds [68, 9]. However, in our general case, due to the use of the loss-shifting trick, we need to use the following data-dependent tuning (with i denoting i(t) for simplicity): ηt = q L ln(|S||A|) 64L5 ln(|S||A|)+Mt where Mt = X t−1 τ=ti min X s̸=sL X a∈A qbτ (s, a)ℓbτ (s, a) 2 , X s̸=sL X a∈A qbτ (s, a) Qbτ (s, a) − Vbτ (s) 2 , and similar to the discussion in Section 4.3, Qbτ and Vbτ are the state-action and state value functions with respect to the transition P¯ i , the adjusted loss function ℓbτ , and the policy πτ , that is: Qbτ (s, a) = ℓbτ (s, a) + Es ′∼P¯ i(·|s,a) [Vbτ (s ′ )] and Vbτ (s) = Ea∼πτ (·|s) [Qbτ (s, a)] (with Vbτ (sL) = 0). This particular tuning makes sure that FTRL enjoys some adaptive regret bound with a self-bounding property akin to Eq. (4.3), which is again the key to ensure polylogarithmic regret in the stochastic world. This concludes all the algorithm design; see Algorithm 5 again for the complete pseudocode. 4.4.1 Main Best-of-both-worlds Results We now present our main best-of-both-worlds results. As mentioned, proving √ T-regret in the adversarial world is relatively straightforward. However, proving polylogarithmic regret bounds for the stochastic world is much more challenging due to the transition estimation error, which is usually of order √ T. Fortunately, we are able to develop a new analysis that upper bounds some transition estimation related terms by the regret itself, establishing a self-bounding property again. We defer the proof sketch to Section 4.5, and state the main results in the following theorems.¶ ¶For simplicity, for bounds in the stochastic world, we omit some Oe(1) terms that are independent of the gap function, but they can be found in the full proof. 60 Theorem 4.4.1.1. In the full-information setting, Algorithm 5 with δ = 1 T2 guarantees RegT (˚π) = Oe L|S| p |A|T always, and simultaneously RegT (π ⋆ ) = O U + √ UC under Condition (4.1), where U = O (L6 |S| 2+L5 |S||A| log(|S||A|))log T ∆min + P s̸=sL P a̸=π⋆(s) L6 |S| log T ∆(s,a) . Theorem 4.4.1.2. In the bandit feedback setting, Algorithm 5 with δ = 1 T3 guarantees RegT (˚π) = Oe (L + p |A|)|S| p |A|T always, and simultaneously RegT (π ⋆ ) = O U + √ UC under Condition (4.1), where U = O (L6 |S| 2+L3 |S| 2 |A|)log2 T ∆min + P s̸=sL P a̸=π⋆(s) (L6 |S|+L4 |S||A|)log2 T ∆(s,a) . While our bounds have some extra dependence on the parameters L, |S|, and |A| compared to the best existing bounds in each of the two worlds, we emphasize that our algorithm is the first to be able to adapt to these two worlds simultaneously and achieve Oe( √ T) and O(polylog(T)) regret respectively. In fact, with some extra twists (such as treating differently the state-action pairs that are visited often enough and those that are not), we can improve the dependence on these parameters, but we omit these details since they make the algorithms much more complicated. Also, while [50] is able to obtain O(log T) regret for the stronger benchmark RegT (˚π) under Condition (4.1) and known transition (same as our Theorem 4.3.1), here we only achieve so for RegT (π ⋆ ) due to some technical difficulty (see Section 4.5). However, recall that for the most interesting i.i.d. case, one simply has RegT (π ⋆ ) = RegT (˚π) as discussed in Section 4.2; even for the corrupted i.i.d. case, since RegT (˚π) is at most C L + RegT (π ⋆ ), our algorithms ensure RegT (˚π) = O(U + C) (note √ UC ≤ U + C). Therefore, our bounds on RegT (π ⋆ ) are meaningful and strong. 61 4.5 Analysis Sketch In this section, we provide a proof sketch for the full-information setting (which is simpler but enough to illustrate our key ideas). The complete proofs can be found in Section 4.7 (full-information) and Section 4.8 (bandit). We start with the following straightforward regret decomposition: RegT (π) = E "X T t=1 V πt t (s0) − Vb πt t (s0) | {z } Err1 + X T t=1 Vb πt t (s0) − Vb π t (s0) | {z } EstReg + X T t=1 Vb π t (s0) − V π t (s0) | {z } Err2 # (4.12) for an arbitrary benchmark π, where V π t is the state value function associated with the true transition P, the true loss ℓt , and policy π, while Vb π t is the state value function associated with the empirical transition P¯ i(t) , the adjusted loss ℓbt , and policy π. Define the corresponding state-action value functions Qπ t and Qbπ t similarly (our earlier notations Vbt and Qbt are thus shorthands for Vb πt t and Qbπt t ). In the adversarial world, we bound each of the three terms in Eq. (4.12) as follows (see Proposition 4.7.1 for details). First, E [Err1] measures the estimation error of the loss of the learner’s policy πt , which can be bounded by Oe(L|S| p |A|T) following the analysis of [46]. Second, as mentioned, our adjusted losses are optimistic in the sense that it underestimates the loss of all policies (with high probability), making E [Err2] an O (1) term only. Finally, E [EstReg] is the regret measured with P¯ i(t) and ℓbt , which is controlled by the FTRL procedure and of order Oe(L p |S||A|T). Put together, this proves the Oe(L|S| p |A|T) regret shown in Theorem 4.4.1.1. In the stochastic world, we fix the benchmark π = π ⋆ . To obtain polylogarithmic regret, an important observation is that we now have to make use of the potentially negative term Err2 instead of simply bounding it by O (1) (in expectation). Specifically, inspired by [84], we propose a new decomposition on Err1 and Err2 jointly as follows (see Section 4.9.1): Err1 + Err2 = ErrSub + ErrOpt + OccDiff + Bias. Here, 62 • ErrSub = PT t=1 P s̸=sL P a̸=π⋆(s) qt(s, a)Ebπ ⋆ t (s, a) measures some estimation error contributed by the suboptimal actions, where Ebπ ⋆ t (s, a) = ℓt(s, a) + Es ′∼P(·|s,a) Vb π ⋆ t (s ′ ) − Qbπ ⋆ t (s, a) is a “surplus” function (a term taken from [84]); • ErrOpt = PT t=1 P s̸=sL P a=π⋆(s) (qt(s, a) − q ⋆ t (s, a)) Ebπ ⋆ t (s, a) measures some estimation error contributed by the optimal action, where q ⋆ t (s, a) is the probability of visiting a trajectory of the form (s0, π⋆ (s0)),(s1, π⋆ (s1)), . . . ,(sk(s)−1 , π⋆ (sk(s)−1 )),(s, a) when executing policy πt ; • OccDiff = PT t=1 P s̸=sL P a∈A (qt(s, a) − qbt(s, a)) Qbπ ⋆ t (s, a) − Vb π ⋆ t (s) measures the occupancy measure difference between qt and qbt ; • Bias = PT t=1 P s̸=sL P a̸=π⋆(s) q ⋆ t (s, a) Vb π ⋆ t (s) − V π ⋆ t (s) measures some estimation error for π ⋆ , which, similar to Err2, is of order O(1) in expectation due to optimism. The next key step is to show that the terms ErrSub, ErrOpt, OccDiff, and EstReg can all be upper bounded by some quantities that admit a certain self-bounding property similarly to the right-hand side of Eq. (4.3). We identify four such quantities and present them using functions G1, G2, G3, and G4, whose definitions are deferred to Section 4.9.2 due to space limit. Combining these bounds for each term, we obtain the following important lemma. Lemma 4.5.1. With δ = 1 T2 , Algorithm 5 ensures that RegT (π ⋆ ) is at most O(L 4 |S| 3 |A| 2 ln2 T) plus: E " O G1 L 4 |S| ln T | {z } from ErrSub + G2 L 4 |S| ln T | {z } from ErrOpt + G3 L 4 ln T | {z } from OccDiff + G4 L 5 |S||A| ln T ln(|S||A|) | {z } from EstReg !#. Finally, as mentioned, each of the G1, G2, G3, and G4 functions can be shown to admit the following self-bounding property, such that similarly to what we argue in Eq. (4.4), picking the optimal values of α and β and rearranging leads to the polylogarithmic regret bound shown in Theorem 4.4.1.1. Lemma 4.5.2 (Self-bounding property). Under Condition (4.1), we have for any α, β ∈ (0, 1), E [G1(J)] ≤ α · (RegT (π ⋆ ) + C) + O 1 α · P s̸=sL P a̸=π⋆(s) J ∆(s,a) , E [G2(J)] ≤ β · (RegT (π ⋆ ) + C) + O 1 β · L|S|J ∆min , E [G3(J)] ≤ (α + β) · (RegT (π ⋆ ) + C) + O 1 α · P s̸=sL P a̸=π⋆(s) L2 |S|J ∆(s,a) + O 1 β · L2 |S| 2J ∆min , E [G4(J)] ≤ β · (RegT (π ⋆ ) + C) + O 1 β · J ∆min . We emphasize again that the proposed joint decomposition on Err1 +Err2 plays a crucial rule in this analysis and addresses the key challenge on how to bound the transition estimation error by something better than √ T. We also point out that in this analysis, only EstReg is related to the FTRL procedure, while the other three terms are purely based on our new framework to handle unknown transition. In fact, the reason that we can only derive a polylog(T) bound on RegT (π ⋆ ) but not directly on RegT (˚π) is also due to these three terms — they can be related to the right-hand side of Condition (4.1) only when we use the benchmark π = π ⋆ but not when π = ˚π. This is not the case for EstReg, which is the reason why Jin and Luo [50] are able to derive a bound on RegT (˚π) directly when the transition is known. Whether this issue can be addressed is left as a future direction. The rest of this chapter is organized as follows. In Section 4.6, we will introduce the lossshifting technique and apply it to achieve the best of both worlds result for MDPs with known transition. Later, we will prove the best of both worlds results for the full-information setting (Theorem 4.4.1.1) in Section 4.7, and that for the bandit feedback setting (Theorem 4.4.1.2) in 64 Section 4.8. In Section 4.9, we will provide the details of several key supplementary Lemmas used throughout the analysis. Importantly, we will make the following convention for upcoming analysis. An important convention Note that the value of mi(s, a) is changing in the algorithm. For the entire analysis, we see mi(s, a) as its initial value, which is the number of visits to (s, a) from epoch 1 to epoch i − 1. In this sense, if we let N be the total number of epochs, then mN+1(s, a) is naturally defined as the total number of visits to (s, a) within T episodes. 65 4.6 Best of Both Worlds for MDPs with Known Transition In this section, we show how to extend the loss-shifting technique to MDPs with known transition and obtain best-of-both-worlds results. 4.6.1 Loss-shifting Technique First of all, we introduce a general invariant condition with a fixed transition in Lemma 4.6.1.1 Lemma 4.6.1.1. Fix the transition function P. For any policy π and loss function ˚ℓ : S × A → R, define invariant function g ∈ S × A → R as: g P,π,˚ℓ (s, a) ≜ Q P,π,˚ℓ (s, a) − V P,π,˚ℓ (s) − ˚ℓ(s, a) , (4.13) where QP,π,˚ℓ and V P,π,˚ℓ are state-action value and state value functions associated with ˚ℓ and the fixed policy π. Then, it holds for any policy π ′ that D q P,π′ , gP,π,˚ℓ E ≜ X s̸=sL X a∈A q P,π′ (s, a) · g P,π,˚ℓ (s, a) = −V P,π,˚ℓ (s0) where V P,π,˚ℓ (s0) only depends on π and ˚ℓ (but not π ′ ). Proof. For notational convenience, we drop the superscripts for fixed transition P and loss function ˚ℓ. By the standard performance difference lemma (Theorem 5.2.1 of [52]), it holds for any policy π ′ that V π ′ (s0) − V π (s0) = X s̸=sL X a∈A q π ′ (s, a) (Q π (s, a) − V π (s)). (4.14) On the other hand, it also holds that V π ′ (s0) = X s̸=sL X a∈A q π ′ (s, a)˚ℓ(s, a). (4.15) 66 Therefore, subtracting V π ′ (s0) from Eq. (4.14) yields that −V π (s0) = X s̸=sL X a∈A q π ′ (s, a) Q π (s, a) − V π (s) − ˚ℓ(s, a) which completes the proof after putting back the superscripts for P and ˚ℓ. As discussed in Section 4.3, the invariant function g P,π,˚ℓ defined in Eq. (4.13) allows us to treat FTRL as dealing with a hypothesized loss sequence, as restated below. Corollary 4.6.1.2. Consider the selected occupancy measure qbt via FTRL with respect to a regularizer ϕt(·) and loss sequence {ℓbτ }τ<t (on the decision set Ω(P¯)), then it holds that qbt = argmin q∈Ω(P¯) * q,X τ<t ℓbτ + + ϕt(q) = argmin q∈Ω(P¯) * q,X τ<t (ℓbτ + gτ ) + + ϕt(q). for any invariant function sequence {gτ }τ<t which are constructed with hypothesized losses {˚ℓτ }τ<t and policies {π ′ τ }τ<t. Proof. By Lemma 4.6.1.1, one can verify that * q,X τ<t gτ + = − X τ<t V P,π ¯ ′ τ ,˚ℓτ (s0) for any occupancy measure q ∈ Ω(P¯). Therefore, this term does not affect the optimization. Then, we consider the “loss-shifting function” defined in Eq. (4.6), that is, constructing gt via the loss estimator ℓbt and the policy πt selected at episode t. Importantly, in the known transition setting where qbt = qt , ℓbt is inverse propensity weighted estimator, in other words, ℓbt(s, a) = It(s,a)ℓt(s,a)/qt(s,a). More specifically, we have gt(s, a) = Qbt(s, a) − Vbt(s) − ℓbt(s, a), 67 where Qbt(s, a) = ℓbt(s, a) + X s ′∈Sk(s)+1 P¯(s ′ |s, a)Vbt(s), Vbt(s) = X a∈A πt(a|s)Qbt(s, a) (with Vbt(sL) = 0). Below we show several useful properties, which are key to achieve the best-ofboth-worlds guarantee in the known transition setting. Lemma 4.6.1.3. With P¯ = P being the true transition function (therefore, qbt = qt), we have • qt(s, a)Qbt(s, a) ≤ L, • qt(s)Vbt(s) ≤ L, • Et Qbt(s, a) − Vbt(s) 2 ≤ 2L2 (1−πt(a|s)) qt(s,a) , for all state-action pairs (s, a) (where Et denotes the conditional expectation given everything before episode t). Proof. Denote by qt(s ′ , a′ |s, a) the probability of visiting (s ′ , a′ ) after taking action a at state s and following πt afterwards. Then we have Qbt(s, a) = PL−1 k=k(s) P s ′∈Sk P a ′∈A qt(s ′ , a′ |s, a)ℓbt(s ′ , a′ ). Therefore, plugging in the definition of ℓbt(s, a), we verify the following: qt(s, a)Qbt(s, a) = L X−1 k=k(s) X s ′∈Sk X a ′∈A qt(s, a)qt(s ′ , a′ |s, a) qt(s ′ , a′) It s ′ , a′ ℓt(s ′ , a′ ) ≤ L X−1 k=k(s) X s ′∈Sk X a ′∈A It s ′ , a′ ≤ L, where the inequality is by qt(s, a)qt(s ′ , a′ |s, a) ≤ qt(s ′ , a′ ) and ℓt(s ′ , a′ ) ∈ [0, 1]. This also proves qt(s)Vbt(s) ≤ L using the definition of Vbt(s). To prove the last statement, we first note that Et Qbt(s, a) − Vbt(s) 2 ≤ 2Et (1 − πt(a|s))2 Qbt(s, a) 2 + X b̸=a πt(b|s)Qbt(s, b) 2 (4.16) by the fact (x − y) 2 ≤ 2x 2 + 2y 2 for all x, y ∈ R. For the first term in Eq. (4.16), we have: Et h Qbt(s, a) 2 i = Et L X−1 k=k(s) X s ′∈Sk X a ′∈A qt(s ′ , a′ |s, a) qt(s ′ , a′) It s ′ , a′ ℓt(s ′ , a′ ) 2 ≤ L · Et L X−1 k=k(s) X s ′∈Sk X a ′∈A qt(s ′ , a′ |s, a) qt(s ′ , a′) It s ′ , a′ ℓt(s ′ , a′ ) 2 ≤ L · Et L X−1 k=k(s) X s ′∈Sk X a ′∈A qt(s ′ , a′ |s, a) 2 qt(s ′ , a′) 2 It s ′ , a′ = L · L X−1 k=k(s) X s ′∈Sk X a ′∈A qt(s ′ , a′ |s, a) 2 qt(s ′ , a′) = L qt(s, a) · L X−1 k=k(s) X s ′∈Sk X a ′∈A qt(s, a)qt(s ′ , a′ |s, a) qt(s ′ , a′) · qt(s ′ , a′ |s, a) ≤ L qt(s, a) · L X−1 k=k(s) X s ′∈Sk X a ′∈A qt(s ′ , a′ |s, a) ≤ L 2 qt(s, a) , where the second line uses the Cauchy-Schwartz inequality; the third line follows from the fact It(s, a)It(s ′ , a′ ) = 0 for all (s, a),(s ′ , a′ ) ∈ Sk × A such that (s, a) ̸= (s ′ , a′ ); the fourth line uses Et [It(s ′ , a′ )] = qt(s ′ , a′ ); and the last line follows from the fact qt(s, a)qt(s ′ , a′ |s, a) ≤ qt(s ′ , a′ ). Repeating the similar arguments, we bound the second term as Et X b̸=a πt(b|s)Qbt(s, b) 2 = Et L X−1 k=k(s) X s ′∈Sk X a ′∈A X b̸=a πt(b|s)qt(s ′ , a′ |s, b) ℓbt(s ′ , a′ ) 2 ≤ L · Et L X−1 k=k(s) X s ′∈Sk X a ′∈A X b̸=a πt(b|s)qt(s ′ , a′ |s, b) ℓbt(s ′ , a′ ) 2 (Cauchy-Schwarz inequality) ≤ L · Et L X−1 k=k(s) X s ′∈Sk X a ′∈A X b̸=a πt(b|s)qt(s ′ , a′ |s, b) 2 It(s ′ , a′ ) qt(s ′ , a′) 2 (It(s, a)It(s ′ , a′ ) = 0 for (s, a) ̸= (s ′ , a′ )) = L · L X−1 k=k(s) X s ′∈Sk X a ′∈A P b̸=a πt(b|s)qt(s ′ , a′ |s, b) qt(s ′ , a′) ! · X b̸=a πt(b|s) · qt(s ′ , a′ |s, b) = L qt(s) · L X−1 k=k(s) X s ′∈Sk X a ′∈A P b̸=a qt(s, b)qt(s ′ , a′ |s, b) qt(s ′ , a′) ! · X b̸=a πt(b|s) · qt(s ′ , a′ |s, b) ≤ L qt(s) · L X−1 k=k(s) X s ′∈Sk X a ′∈A X b̸=a πt(b|s) · qt(s ′ , a′ |s, b) = L qt(s) · X b̸=a πt(b|s) · L X−1 k=k(s) X s ′∈Sk X a ′∈A qt(s ′ , a′ |s, b) ≤ L 2 qt(s) · X b̸=a πt(b|s) = L 2 (1 − πt(a|s)) qt(s) . Plugging these bounds into Eq. (4.16) concludes the proof: Et Qbt(s, a) − Vbt(s) 2 ≤ 2L 2 (1 − πt(a|s))2 qt(s, a) + 1 − πt(a|s) qt(s) ! = 2L 2 (1 − πt(a|s)) 1 − πt(a|s) qt(s, a) + 1 qt(s) = 2L 2 (1 − πt(a|s)) qt(s, a) . 4.6.2 Known Transition and Full-information Feedback: FTRL with Shannon Entropy Although not mentioned in the main text, in this section, we discuss a simple application of the loss-shifting technique: achieving the best-of-both-worlds in the full-information feedback setting with known transition via the FTRL framework with the Shannon entropy regularizer. Some of the lemmas in this section are useful for proving similar results for the unknown transition case in Section 4.7. 70 Algorithm 6 Best-of-both-worlds for MDPs with Known Transition and Full-information Feedback for t = 1 to T do Compute qt = argminq∈Ω(P) q,P τ<t ℓτ + ϕt(q) where ϕt(q) is defined in Eq. (4.18). Execute policy πt where πt(a|s) = qt(s, a)/qt(s). Observe the entire loss function ℓt . Therefore, the specific state-action and state value functions defined in Lemma 4.6.1.3 are now constructed based on the received loss vector ℓt , instead of the loss estimator ℓbt . In other words, the loss-shifting function gt is defined as gt(s, a) = Qb(s, a) − Vb(s) − ℓt(s, a) where Qbt(s, a) = ℓt(s, a) + X s ′∈Sk(s)+1 P(s ′ |s, a)Vbt(s), Vbt(s) = X a∈A πt(a|s)Qbt(s, a). (4.17) Our goal is to show that, using an adaptive time-varying learning rate schedule, FTRL with Shannon entropy is able to attain a self-bounding regret guarantee with full-information feedback. This idea will be further discussed in Section 4.7 to address the unknown transition setting. In particular, the algorithm uses following regularizer for episode t: ϕt(q) = 1 ηt X s̸=sL X a∈A q(s, a) ln q(s, a) = 1 ηt ϕ(q), (4.18) where the adaptive learning rate ηt is defined as ηt = q L ln(|S||A|) Mt−1+64L3 ln(|S||A|) with Mt = X t τ=1 min X s̸=sL X a∈A qτ (s, a) Qbτ (s, a) − Vbτ (s) 2 , X s̸=sL X a∈A qτ (s, a)ℓτ (s, a) 2 . The pseudocode of our algorithm is presented in Algorithm 6. 71 In the known transition setting, we assume the loss functions satisfy a more general condition compared to Condition (4.1): there exists a deterministic policy π ⋆ : S → A, a gap function ∆ : S × A → R+ and a constant C L > 0 such that RegT (˚π) ≥ E X T t=1 X s̸=sL X a̸=π⋆(s) qt(s, a)∆(s, a) − C . (4.19) Note that this is only weaker than Condition (4.1) since RegT (˚π) ≥ RegT (π ⋆ ). Then, we show that Algorithm 6 ensures a worst-case guarantee RegT (˚π) = Oe(L √ T), and simultaneously an adaptive regret bound which further leads to logarithmic regret under Condition (4.19) (Corollary 4.6.2.2). Importantly, the worst-case regret bound matches the lower bound of learning MDPs with known transition and full-information feedback [97]. Theorem 4.6.2.1. Algorithm 6 ensures that RegT (˚π) is bounded by O vuuutmin L2T, L3E X T t=1 X s̸=sL X a̸=π(s) qt(s, a) ln(|S||A|) + L 2 ln(|S||A|) (4.20) for any mapping π : S → A. Proof. Due to the invariant property (that ⟨q, gt⟩ is independent of q ∈ Ω(P)), we can apply Lemma 4.6.2.3 with ℓbt being either ℓt or ℓt + gt for any t — note that the condition ηtℓbt(s, a) ≥ −1 is always satisfied since ℓt(s, a) ∈ [0, 1] and Qbt(s, a) − Vbt(s) ∈ [−L, L]. Therefore, we have for any u ∈ Ω(P), X T t=1 ⟨qt − u, ℓt⟩ ≤ Lln(|S||A|) ηT +1 + X T t=1 ηt min X s̸=sL X a∈A qt(s, a) Qbt(s, a) − Vbt(s) 2 , X s̸=sL X a∈A qt(s, a)ℓt(s, a) 2 72 = Lln(|S||A|) ηT +1 + X T t=1 ηt (Mt − Mt−1), (definition of Mt) = Lln(|S||A|) ηT +1 + X T t=1 ηt p Mt + p Mt−1 p Mt − p Mt−1 , ≤ Lln(|S||A|) ηT +1 + 2X T t=1 ηt p Mt−1 + L p Mt − p Mt−1 . (Mt ≤ Mt−1 + L) Further plugging in the definition of ηt and taking expectation, we arrive at RegT (˚π) ≤ E " Lln(|S||A|) ηT +1 + 2p Lln(|S||A|) X T t=1 p Mt − p Mt−1 # = E hp Lln(|S||A|) (MT + 64L3 ln |S||A|) + 2p LMT ln(|S||A|) i = O p LE [MT ] ln(|S||A|) + L 2 ln(|S||A|) . It remains to bound MT . First, we note that MT = X T t=1 min X s̸=sL X a∈A qt(s, a) Qbt(s, a) − Vbt(s) 2 , X s̸=sL X a∈A qt(s, a)ℓt(s, a) 2 ≤ min X T t=1 X s̸=sL X a∈A qt(s, a) Qbt(s, a) − Vbt(s) 2 , X T t=1 X s̸=sL X a∈A qt(s, a)ℓt(s, a) 2 ≤ min X T t=1 X s̸=sL X a∈A qt(s, a) Qbt(s, a) − Vbt(s) 2 , LT . where the second line follows from the fact min {a, b} + min {c, d} ≤ min {a + c, b + d}, and the third line uses the property 0 ≤ ℓt(s, a) ≤ 1 for all state-action pairs (s, a). On the other hand, we have Qbt(s, a) − Vbt(s) 2 ≤ 2 (1 − πt(a|s))2 Qbt(s, a) 2 + X b̸=a πt(b|s)Qbt(s, b) 2 ≤ 2L 2 · h (1 − πt(a|s))2 + (1 − πt(a|s))2 i 73 ≤ 4L 2 (1 − πt(a|s)), (4.21) where we use the facts (a − b) 2 ≤ 2(a 2 + b 2 ) and 0 ≤ Qbt(s, a) ≤ L for all state-action pairs (s, a). Therefore, we have for any mapping π : S → A, X T t=1 X s̸=sL X a∈A qt(s, a) Qbt(s, a) − Vbt(s) 2 ≤ 4L 2 · X T t=1 X s̸=sL X a∈A qt(s, a) (1 − πt(a|s)) ≤ 4L 2 · X T t=1 X s̸=sL qt(s) · (1 − πt(π(s)|s)) + X a̸=π(s) qt(s, a) = 8L 2 · X T t=1 X s̸=sL X a̸=π(s) qt(s, a), (4.22) which finishes the proof. Corollary 4.6.2.2. Suppose Condition (4.19) holds. Algorithm 6 guarantees that: RegT (˚π) = O U + √ CU , where U = L 3 ln(|S||A|) ∆min . Proof. By Theorem 4.6.2.1, RegT (˚π) is bounded by κ · vuuutL3 ln(|S||A|) · E X T t=1 X s̸=sL X a̸=π⋆(s) qt(s, a) + L 2 ln(|S||A|) where κ ≥ 1 is a universal constant, and π ⋆ is the mapping specified in Condition (4.19). For any z > 1, RegT (˚π) is bounded by κ vuuutL3 ln(|S||A|) · E X T t=1 X s̸=sL X a̸=π⋆(s) qt(s, a) + κL2 ln(|S||A|) 74 = vuuut zκ2L3 ln(|S||A|) 2∆min · 2 z · E X T t=1 X s̸=sL X a̸=π⋆(s) qt(s, a)∆min + κL2 ln(|S||A|) ≤ RegT (˚π) + C L z + zκ2L 3 ln(|S||A|) 4∆min + κL2 ln(|S||A|) ≤ RegT (˚π) + C L z + z · 2κ 2U, where the third line uses the AM-GM inequality and Eq. (4.19), and the last line uses the shorthand U and the facts κ, z > 1 and ∆min ≤ 1. Therefore, by defining x = z − 1 > 0, we can rearrange and arrive at RegT (˚π) ≤ C z − 1 + z 2 z − 1 · 2κ 2U = C x + (x + 1)2 x · 2κ 2U = 1 x · C L + 2κ 2U + x · 2κ 2U + 4κ 2U, where we replace all z’s in the second line. Picking the optimal x = q C+2κ2U 2κ2U gives RegT (˚π) ≤ 2 q (CL + 2κ 2U) · (2κ 2U) + 4κ 2U ≤ 8κ 2U + 2√ 2κ · √ CLU = O U + √ UC , where the second line follows from the fact √ x + y ≤ √ x + √y. Lemma 4.6.2.3. Suppose qt = argminq∈Ω(P) q,P τ<t ℓbτ + ϕt(q), where ϕt(q) = 1 ηt ϕ(q) for some ηt > 0, ϕ(q) = P s̸=sL P a∈A q(s, a) ln q(s, a), and ηtℓbt(s, a) ≥ −1 holds for all t and (s, a). Then X T t=1 D qt − u, ℓbt E ≤ Lln(|S||A|) ηT +1 + X T t=1 ηt · X s̸=sL X a∈A qt(s, a)ℓbt(s, a) 2 , holds for any u ∈ Ω(P). Proof. Let Φt = minq∈Ω(P) D q,Pt−1 τ=1 ℓbτ E + ϕt(q) and DF (u, v) being the Bregman divergence with convex function F, that is, DF (u, v) = F(u) − F(v) − ⟨u − v, ∇F(v)⟩. Then, we have Φt = * qt , X t−1 τ=1 ℓbτ + + ϕt(qt) = * qt+1, X t−1 τ=1 ℓbτ + + ϕt(qt+1) − *qt+1 − qt , X t−1 τ=1 ℓbτ + + ϕt(qt+1) − ϕt(qt) ! ≤ * qt+1, X t−1 τ=1 ℓbτ + + ϕt(qt+1) − (− ⟨qt+1 − qt , ∇ϕt(qt)⟩ + ϕt(qt+1) − ϕt(qt)) = * qt+1, X t−1 τ=1 ℓbτ + + ϕt(qt+1) − Dϕt (qt+1, qt) = Φt+1 − D qt+1, ℓbt E − (ϕt+1(qt+1) − ϕt(qt+1)) − Dϕt (qt+1, qt), where the third line follows from the first order optimality condition of qt , that is, D qt+1 − qt , ∇ϕt(qt) + Pt−1 τ=1 ℓbτ E ≥ 0. Taking the summation over all episodes gives Φ1 = ΦT +1 − X T t=1 D qt+1, ℓbt E − X T t=1 (ϕt+1(qt+1) − ϕt(qt+1)) − X T t=1 Dϕt (qt+1, qt). Therefore, we have X T t=1 D qt − u, ℓbt E = X T t=1 D qt − u, ℓbt E + ΦT +1 − Φ1 − X T t=1 D qt+1, ℓbt E − X T t=1 (ϕt+1(qt+1) − ϕt(qt+1)) − X T t=1 Dϕt (qt+1, qt) = X T t=1 Dqt − qt+1, ℓbt E − Dϕt (qt+1, qt) − X T t=1 D u, ℓbt E + ΦT +1 − Φ1 − X T t=1 (ϕt+1(qt+1) − ϕt(qt+1)) 76 ≤ X T t=1 Dqt − qt+1, ℓbt E − Dϕt (qt+1, qt) | {z } Stability + ϕT +1(u) − ϕ1(q1) − X T t=1 (ϕt+1(qt+1) − ϕt(qt+1)) | {z } Penalty where the last line follows from the optimality condition ΦT +1 ≤ PT t=1 D u, ℓbt E + ϕT +1(u). To bound the stability term, we first consider relaxing the constraint and taking the maximum as: D qt − qt+1, ℓbt E − Dϕt (qt+1, qt) ≤ max q∈R S×A + D qt − q, ℓbt E − Dϕt (q, qt). Denote by qet the maximizer of the right hand side. Setting the gradient to zero yields the equality ∇ϕt(qt) − ∇ϕt(qet) = ℓbt . By direction calculation, one can verify that qet(s, a) = qt(s, a) · exp −ηt · ℓbt(s, a) for all state-action pairs, and the following inequality that D qt − qt+1, ℓbt E − Dϕt (qt+1, qt) ≤ D qt − qet , ℓbt E − Dϕt (qet , qt) = D qt − qet , ℓbt E − ϕt(qet) + ϕt(qt) − ⟨qet − qt , ∇ϕt(qt)⟩ = Dϕt (qt , qet) where the second equality uses the equality ∇ϕt(qt) − ∇ϕt(qet) = ℓbt . Moreover, the term Dϕt (qt , qet) can be bounded as: Dϕt (qt , qet) = 1 ηt X s̸=sL X a∈A qt(s, a) ln qt(s, a) qet(s, a) − qt(s, a) + qet(s, a) = 1 ηt X s̸=sL X a∈A qt(s, a) · ηtℓbt(s, a) − 1 + exp −ηt · ℓbt(s, a) ≤ ηt X s̸=sL X a∈A qt(s, a)ℓbt(s, a) 2 77 where the last inequality follows from the facts y − 1 + e −y ≤ y 2 for y > −1 and ηt · ℓbt(s, a) ≥ −1 for all sate-action pairs. On the other hand, the penalty term is at most ϕT +1(u) − ϕ1(q1) − X T t=1 (ϕt+1(qt+1) − ϕt(qt+1)) ≤ − ϕ(q1) η1 − X T t=1 1 ηt+1 − 1 ηt ϕ(qt), since ϕ(u) ≤ 0. Moreover, note that for any valid occupancy measure q, it holds that ϕ(q) = L X−1 k=0 X s∈Sk X a∈A q(s, a) ≥ − L X−1 k=0 ln(|Sk||A|) ≥ −Lln(|S||A|). Therefore, the penalty term is bounded by − ϕ(q1) η1 − X T t=1 1 ηt+1 − 1 ηt ϕ(qt) ≤ Lln(|S||A|) · 1 η1 + X T t=1 1 ηt+1 − 1 ηt ! = Lln(|S||A|) ηT +1 . Finally, combining the bounds for the stability and penalty terms finishes the proof. 4.6.3 Known Transition and Bandit Feedback: FTRL with Tsallis Entropy In this section, we consider the bandit feedback setting with known transition. We use the following hybrid regularizer with learning rate ηt = γ/ √ t for episode t: ϕt(q) = ϕH(q) ηt + β X s̸=sL X a∈A log 1 q(s, a) | {z } =ϕL(q) , (4.23) 78 Algorithm 7 Best-of-both-worlds for MDPs with Known Transition and Bandit Feedback for t = 1 to T do compute qt = argminq∈Ω q,P τ<t ℓbτ + ϕt(q) where ϕt(q) is defined in Eq. (4.23). execute policy πt where πt(a|s) = qt(s, a)/qt(s). observe (s0, a0, ℓt(s0, a0)), . . . ,(sL−1, aL−1, ℓt(sL−1, aL−1)). construct estimator ℓbt such that: ∀(s, a), ℓbt(s, a) = ℓt(s,a) qt(s,a) I sk(s) = s, ak(s) = a . where ϕL is a fixed log-barrier regularizer, and ϕH(q) is the 1/2-Tsallis entorpy: ϕH(q) = − X s̸=sL X a∈A p q(s, a). We present the pseudocode of our algorithm in Algorithm 7, and show the ensured guarantees in Theorem 4.6.3.1, which is a more detailed version of Theorem 4.3.1. In particular, the adaptive regret bound Eq. (4.24) is a strict improvement of Theorem 1 of [50] and leads to the best-of-bothworlds guarantee automatically. We emphasize that the key to achieve such a guarantees is the loss-shifting function defined in Eq. (4.6). Theorem 4.6.3.1. With β = 64L and γ = 1, Algorithm 7 ensures that RegT (˚π) is bounded by X T t=1 Oe min E B X s̸=sL X a̸=π(s) r qt(s, a) t + D vuut X s̸=sL X a̸=π(s) qt(s, a) + ˚q(s, a) t , r L|S||A| t (4.24) for any mapping π : S → A, where B = L 2 and D = p L|S|. Therefore, the regret of Algorithm 7 is always bounded as RegT (˚π) = Oe p L|S||A|T . Moreover, under Condition (4.19), RegT (˚π) is bounded by O U + √ UC where U = L|S| log T ∆min + P s̸=sL P a̸=π⋆(a) L4 log T ∆(s,a) + L|S||A| log T. 79 Proof. By Lemma 5 of [50], with a sufficiently large log-barrier component (in particular, β = 64L suffices), the regret can be decomposed and bounded as: E "X T t=1 D qt − ˚q, ℓbt E # ≤ X T t=1 1 ηt − 1 ηt−1 E [ϕH(˚q) − ϕH(qt)] | {z } Penalty + 8X T t=1 ηtE ℓbt 2 ∇−2ϕ(qt) | {z } Stability + O (L|S||A| log T). where ˚q is the occupancy measure of an deterministic optimal policy ˚π : S → A. Moreover, with the help of Corollary 4.6.1.2, we can in fact bound RegT (˚π) as RegT (˚π) ≤ X T t=1 1 ηt − 1 ηt−1 E [ϕH(˚q) − ϕH(qt)] | {z } Penalty +O (L|S||A| log T) + 8X T t=1 ηtE " min Et ℓbt 2 ∇−2ϕ(qt) , Et ℓbt + gt 2 ∇−2ϕ(qt) # | {z } Stability . (4.25) where gt is the specific loss-shifting function defined in Eq. (4.6). This is again because adding the loss-shifting function gt does not influence the outcomes of FTRL and thus in the analysis, one can decide whether to add gt or not for episode t in hindsight to establish a tighter adaptive regret bound. Before analyzing the stability term, we point out that ϕH(˚q) − ϕH(qt) can be bounded as (ϕH(˚q) − ϕH(qt)) ≤ X s̸=sL X a̸=π(s) p qt(s, a) + 2s |S|L X s̸=sL X a̸=π(s) qt(s, a) + ˚q(s, a) (4.26) for any mapping π : S → A according to Lemma 6 of [50] (take α in their lemma to be 0). On the other hand, we also have ϕH(˚q)−ϕH(qt) ≤ −ϕH(qt) ≤ p L|S||A| by the Cauchy-Schwarz inequality. 80 Combining these two cases and the fact 1 ηt − 1 ηt−1 = 1 γ · √ t − √ t − 1 ≤ 1 γ · √ 1 t , the penalty term is bounded by 1 γ X T t=1 E min r L|S||A| t , X s̸=sL X a̸=π(s) r qt(s, a) t + 2 vuut|S|L X s̸=sL X a̸=π(s) qt(s, a) + ˚q(s, a) t We now bound the stability term. By direct calculation, we have Et ℓbt + gt 2 ∇−2ϕ(qt) = X s̸=sL X a∈A qt(s, a) 3/2Et ℓbt(s, a) + gt(s, a) 2 ≤ 2L 2 X s̸=sL X a∈A p qt(s, a) · (1 − πt(a|s)), (4.27) where the second line applies the properties of the loss-shifting function in Lemma 4.6.1.3. For any mapping π : S → A, we can further bound Eq. (4.27) as 2L 2 X s̸=sL X a∈A p qt(s, a) · (1 − πt(a|s)) ≤ 2L 2 X s̸=sL X a̸=π(s) p qt(s, a) + 2L 2 X s̸=sL p qt(s) · X a̸=π(s) πt(a|s) ≤ 4L 2 X s̸=sL X a̸=π(s) p qt(s, a), where the third line follows from the fact x ≤ √ x for x ∈ [0, 1]. Therefore, for any mapping π : S → A, the stability term is bounded by X T t=1 8ηtE ℓbt + gt 2 ∇−2ϕ(qt) ≤ 32L 2 · X T t=1 ηtE X s̸=sL X a̸=π(s) p qt(s, a) . (4.28) 8 On the other hand, without the loss-shifting function, the stability term is simultaneously bounded as X T t=1 8ηtE ℓbt 2 ∇−2ϕ(qt) = X T t=1 8ηtE X s̸=sL X a∈A qt(s, a) 3/2 · ℓbt(s, a) 2 ≤ X T t=1 8ηtE X s̸=sL X a∈A p qt(s, a) ≤ X T t=1 8ηt p L|S||A|. (Cauchy-Schwarz inequality) Plugging Eq. (4.26) and Eq. (4.28) into the Eq. (4.25) shows that Algorithm 7 ensures the following self-bounding regret bound for RegT (˚π): 1 γ X T t=1 E min r L|S||A| t , X s̸=sL X a̸=π(s) r qt(s, a) t + 2 vuut|S|L X s̸=sL X a̸=π(s) qt(s, a) + ˚q(s, a) t 32γ · X T t=1 E min r L|S||A| t , L2 X s̸=sL X a̸=π(s) r qt(s, a) t + O (L|S||A| log T) (4.29) for any mapping π : S → A. Picking γ = 1 and using min {a, b} + min {c, d} ≤ min {a + c, b + d} proves Eq. (4.24). The (optimal) worst-case bound RegT (˚π) = Oe( p L|S||A|T) can be obtained by using the second argument of the min operator in Eq. (4.24), while the logarithmic regret bound under Condition (4.19) is obtained by using the first argument of the min operator and the exact same reasoning as in Appendix A.1 of [50]. We point out that with a different choice γ = 1/L, Algorithm 7 achieves a regret bound of RegT (˚π) = O V + √ V C under Condition (4.19), where V = L 3 |S| log T ∆min + X s̸=sL X a̸=π⋆(a) L 2 log T ∆(s, a) + L|S||A| log T 82 which matches the best existing regret bound in [84]. This choice of γ worsens the worst-case bound though. 83 4.7 Best of Both Worlds for MDPs with Unknown Transition and Full Information In this part, we will prove the best of both worlds results for the full-information setting. We present the bound for the adversarial world in Proposition 4.7.1, and that for the stochastic world in Proposition 4.7.2 (part of which is a restatement of Lemma 4.5.1). Together, they prove Theorem 4.4.1.1. Proposition 4.7.1. Consider the decomposition RegT (˚π) = E [Err1 + EstReg + Err2] stated in Eq. (4.12). Then, with δ = 1 T2 Algorithm 5 ensures: • E [Err1] = Oe L|S| p |A|T + L 3 |S| 3 |A| , • E [Err2] = Oe (1), • E [EstReg] = Oe L p |S||A|T + L 2 |S| 2 |A| 3 2 + L 3 |S||A| . Proposition 4.7.2. With δ = 1 T2 , Algorithm 5 ensures that RegT (π ⋆ ) is bounded as O E " G1 L 4 |S| ln T | {z } ErrSub + G2 L 4 |S| ln T | {z } ErrOpt + G3 L 4 ln T | {z } OccDiff + G4 L 5 |S||A| ln T ln(|S||A|) | {z } # + O L 4 |S| 3 |A| 2 ln2 T , where G1-G4 are defined in Definition 4.9.2.1. Under Condition (4.1), this bound implies RegT (π ⋆ ) = O U + √ UC + V where U = L 6 |S| 2 + L 5 |S||A| log(|S||A|) log T ∆min + X s̸=sL X a̸=π⋆(s) L 6 |S| log T ∆(s, a) , V = L 4 |S| 3 |A| 2 ln2 Before diving into the proof details, we first give formal definitions of several notations mentioned in Section 4.4 and Section 4.5 for the full-information setting. Through out this paper, we denote by A the event that P ∈ Pi for all i, which happens with probability at least 1−4δ based on Lemma 3.3.1.1. We denote by N the total number of epochs, and set tN+1 = T + 1 for convenience (recall that ti is the first episode for epoch i). Then, recall the Qbπ t and Vb π t defined in Section 4.5, that is, the state-action and state value functions associated with the empirical transition P¯ i(t) and the adjusted loss ℓbt , formally defined as: Qbπ t (s, a) = ℓbt(s, a) + X s ′∈Sk(s)+1 P¯ i(t) (s ′ |s, a)Vbt(s ′ ), Vb π t (s) = X a∈A π(a|s)Qbπ t (s, a), (4.30) and Qbπ t (sL, a) = 0 for all a. Also recall that the notation Qbt and Vbt used in the loss-shifting function are shorthands for Qbπt t and Vb πt t . Similarly, the true state-action and state value functions of episode t are defined as: Q π t (s, a) = ℓt(s, a) + X s ′∈Sk(s)+1 P(s ′ |s, a)Vt(s ′ ), V π t (s) = X a∈A π(a|s)Q π t (s, a), (4.31) with Qπ t (sL, a) = 0 for all a. For notational convenience, we let ι = T|S||A| δ and assume that δ ∈ (0, 1), and denote by Tk the set of transition tuples at layer k, that is, Tk = {(s, a, s′ ) ∈ Sk × A × Sk+1}. 4.7.1 Optimism of Adjusted losses and Other Lemmas First, we show that the adjusted loss ℓbt defined in Eq. (4.9) ensures the optimism of the estimated state-action and state value functions as stated in Lemma 4.7.1.1. As discussed in Section 4.5, this certain kind of optimism ensures that E [Err2] is bounded by a constant with a sufficiently small confidence parameter δ. 85 Lemma 4.7.1.1. Using the notations in Eq. (4.30) and Eq. (4.31) and conditioning on the event A, we have Qbπ t (s, a) ≤ Q π t (s, a), ∀(s, a) ∈ S × A, t ∈ [T]. Proof. We prove this result via a backward induction from layer L to layer 0. Base case: for sL, Qbπ t (s, a) = Qπ t (s, a) = 0 holds always. Induction step: Suppose Qbπ t (s, a) ≤ Qπ t (s, a) holds for all the states s with k(s) > h. Then, for any state s in layer h, we have Qbπ t (s, a) = ℓt(s, a) + X s ′∈Sk(s)+1 P¯ i(t) (s ′ |s, a)Vb π t (s ′ ) − L · Bt(s, a) ≤ ℓt(s, a) + X s ′∈Sk(s)+1 P¯ i(t) (s ′ |s, a)V π t (s ′ ) − L · Bt(s, a) (Induction hypothesis) ≤ ℓt(s, a) + X s ′∈Sk(s)+1 P(s ′ |s, a)V π t (s ′ ) + X s ′∈Sk(s)+1 P¯ i(t) (s ′ |s, a) − P(s ′ |s, a) V π t (s ′ ) − L · Bi(t) (s, a) = Q π t (s, a) + X s ′∈Sk(s)+1 P¯ i(t) (s ′ |s, a) − P(s ′ |s, a) V π t (s ′ ) − L · Bi(t) (s, a) where the first line follows from the definition of ℓbt . Clearly, when Bi(t) (s, a) = 1, we have X s ′∈Sk(s)+1 P¯ i(t) (s ′ |s, a) − P(s ′ |s, a) V π t (s ′ ) − L · Bi(t) (s, a) ≤ X s ′∈Sk(s)+1 P¯ i(t) (s ′ |s, a) · L − L = 0 where the inequality follows from the fact 0 ≤ V π t (s ′ ) ≤ L. On the other hand, when P s ′∈Sk(s)+1 Bi(t) (s, a, s′ ) = Bi(t) (s, a), we have X s ′∈Sk(s)+1 P¯ i(t) (s ′ |s, a) − P(s ′ |s, a) V π t (s ′ ) − L · Bi(t) (s, a) ≤ X s ′∈Sk(s)+1 Bi(t) (s, a, s′ ) · L − L · Bi(t) (s, a) = 0 where the second line uses the definition of event A. Combining these two cases shows that Qbπ t (s, a) ≤ Qπ t (s, a) holds for all state-action pairs (s, a) at layer h, finishing the induction. Next, we analyze the estimated regret suffered within one epoch. With slightly abuse of notation, we denote by EstRegi (π) the difference between the total loss suffered within epoch i and that of the fixed policy π with respect to the empirical transition P¯ i and the adjusted losses within epoch i, that is, EstRegi (π) = E tiX +1−1 t=ti D q P¯ i,πt − q P¯ i,π , ℓbt E = E tiX +1−1 t=ti D qbt − q P¯ i,π , ℓbt E . (4.32) In addition, we let EstRegi = maxπ EstRegi (π) be the maximum regret suffered within epoch i. Lemma 4.7.1.2. For full-information feedback, Algorithm 5 ensures that EstRegi is bounded by O L 3 ln(|S||A|) plus: O E vuuutLln (|S||A|) · min L4 tiX +1−1 t=ti X s∈S X a̸=π(s) qbt(s, a), tiX +1−1 t=ti X s̸=sL X a∈A qbt(s, a)ℓbt(s, a) 2 . (4.33) Proof. The proof follows the same steps as in that of Theorem 4.6.2.1. Due to the invariant property, the loss of episode t fed to FTRL can be seen as either ℓbt(s, a) or Qbt(s, a) − Vbt(s, a). By the definition of ηt , we have both ηtℓbt(s, a) ≥ −1 and ηt(Qbt(s, a) − Vbt(s, a)) ≥ −1. Therefore, we can apply Lemma 4.6.2.3 and bound EstRegi by E Lln(|S||A|) ηti+1 + tiX +1−1 t=ti ηt min X s̸=sL X a∈A qbt(s, a) Qbt(s, a) − Vbt(s) 2 , X s̸=sL X a∈A qbt(s, a)ℓt(s, a) 2 . The tuning of ηt makes sure that the above is further bounded by O L 3 ln(|S||A|) plus p Lln (|S||A|) multiplied with O E vuuutmin tiX +1−1 t=ti X s̸=sL X a∈A qbt(s, a) Qbt(s, a) − Vbt(s) 2 , tiX +1−1 t=ti X s̸=sL X a∈A qbt(s, a)ℓt(s, a) 2 ; see the beginning of the proof of Theorem 4.6.2.1 for the same reasoning. Finally, it remains to bound P s̸=sL P a∈A qbt(s, a) Qbt(s, a) − Vbt(s) 2 by 8L 4 P s̸=sL P a̸=π(s) qbt(s, a). This is again by the same reasoning as Eq. (4.21) and Eq. (4.22), except that Qbt(s, a) now has a range in [−L 2 , L2 ] which explains the extra L 2 factor. 4.7.2 Proof for the Adversarial World (Proposition 4.7.1) We analyze the regret based on the decomposition in Eq. (4.12) and consider bounding the terms E [Err1], E [Err2] and E [EstReg] separately. Err1 Following the similar idea of [46], we decompose this term as: Err1 = X T t=1 V πt t (s0) − Vb πt t (s0) = X T t=1 ⟨qt , ℓt⟩ − D qbt , ℓbt E = X T t=1 ⟨qt , ℓt⟩ − ⟨qbt , ℓt⟩ + L · X T t=1 qbt , Bi(t) = X T t=1 ⟨qt , ℓt⟩ − ⟨qbt , ℓt⟩ + L · X T t=1 qt , Bi(t) + L · X T t=1 qbt − qt , Bi(t) 8 ≤ X T t=1 X s̸=sL X a∈A |qt(s, a) − qbt(s, a)| + L · X T t=1 qt , Bi(t) + L · X T t=1 qbt − qt , Bi(t) where the last line follows from the fact 0 ≤ ℓt(s, a) ≤ 1. According to this decomposition, we next consider bounding the expectation of these three terms separately. First, we focus on the second term: E " L · X T t=1 qt , Bi(t) # ≤ L · E L X−1 k=0 X s∈Sk X a∈A X T t=1 qt(s, a) 2 s |Sk(s)+1| ln ι max {mi(s, a), 1} + 14|Sk(s)+1| ln ι 3 max {mi(s, a), 1} = O L · L X−1 k=0 p |Sk||Sk+1||A|T ln ι + |Sk(s)+1||Sk||A|(2 + ln T) ln ι ! ≤ O L p |A|T ln ι · L X−1 k=0 (|Sk| + |Sk+1|) + L|S| 2 |A| ln2 ι ! = O L|S| p |A|T ln ι + L|S| 2 |A| ln2 ι = Oe L|S| p |A|T + L|S| 2 |A| (4.34) where the second line follows Lemma 4.9.3.2, the third line follows from Lemma 4.9.3.8 and the fourth line applies AM-GM inequality. Then, for the first term, with the help from residual term rt defined in Definition 4.9.3.9, we have E X T t=1 X s̸=sL X a∈A |qt(s, a) − qbt(s, a)| ≤ E 4 X T t=1 X s̸=sL X a∈A k( Xs)−1 k=0 X (u,v,w)∈Tk qt(u, v) s P(w|u, v) ln ι max mi(t) (u, v), 1 qt(s, a|w) +X T t=1 X s̸=sL X a∈A rt(s, a) ≤ E 4L · X T t=1 X u̸=sL X v∈A X w∈Sk(u)+1 qt(u, v) s P(w|u, v) ln ι max mi(t) (u, v), 1 + X T t=1 X s̸=sL X a∈A rt(s, a) 89 ≤ E 4L · X T t=1 X u̸=sL X v∈A qt(u, v) s Sk(u)+1 ln ι max mi(t) (u, v), 1 + X T t=1 X s̸=sL X a∈A rt(s, a) = O L|S| p |A|T ln ι + L 2 |S| 3 |A| 2 ln2 ι + δ|S||A|T = Oe L|S| p |A|T + L 2 |S| 3 |A| 2 (4.35) where the second line uses the bound of |qt(s, a) − qbt(s, a)| in Lemma 4.9.3.10; the third line follows from the fact P s̸=sL P a∈A qt(s, a|w) ≤ L; the forth line uses the Cauchy-Schwarz inequality; the fifth line follows the same argument in Eq. (4.34) and applies the expectation bound of residual terms in Lemma 4.9.3.10; and the last line plugs in the value of δ = 1/T2 . For the last term, using the bound of |qbt(s, a) − qt(s, a)| in Lemma 4.9.3.10, we arrive at E " L · X T t=1 qbt − qt , Bi(t) # ≤ E L · X T t=1 X s̸=sL X a∈A Bi(t) (s, a) · 4 k( Xs)−1 k=0 X (u,v,w)∈Tk qt(u, v) s P(w|u, v) ln ι max mi(t) (u, v), 1 qt(s, a|w) + rt(s, a) ≤ E 4L · X T t=1 X s̸=sL X a∈A X s ′∈Sk(s)+1 Bi(t) (s, a, s′ ) k( Xs)−1 k=0 X (u,v,w)∈Tk qt(u, v) s P(w|u, v) ln ι max mi(t) (u, v), 1 qt(s, a|w) + E L · X T t=1 X s̸=sL X a∈A rt(s, a) , where the last line follows from the fact Bi(t) (s, a) ≤ 1. According to the definition of the residual term in Definition 4.9.3.9, we have rt(s, a) ≥ X s ′∈Sk(s)+1 Bi(t) (s, a, s′ ) · k( Xs)−1 k=0 X (u,v,w)∈Tk qt(u, v) s P(w|u, v) ln ι max mi(t) (u, v), 1 qt(s, a|w) 90 (in particular, the second summand in the definition of rt(s, a) is an upper bound of the right-hand side above). Therefore, we have E h L · PT t=1 qbt − qt , Bi(t) i further bounded by E (4L + L) · X T t=1 X s̸=sL X a∈A rt(s, a) ≤ O L 3 |S| 3 |A| 2 ln2 ι + δ · L|S||A|T = Oe L 3 |S| 3 |A| 2 (4.36) where the last inequality uses the expectation bound of residual terms in Lemma 4.9.3.10. Combining all bounds yields E [Err1] = Oe L|S| p |A|T + L 3 |S| 3 |A| . Err2 According to Lemma 4.7.1.1, Lemma 4.9.3.5, and the fact Vb π t (s) ≤ L 2 , we have E [Err2] = E "X T t=1 Vb π t (s0) − V π t (s0) # ≤ L 2T Pr[A c ] ≤ 4L 2T δ = Oe(1). EstReg By Lemma 4.7.1.2, we have E [EstReg] bounded as E X N i=1 tiX +1−1 t=ti D qbt − q P¯ i,˚π , ℓbt E ≤ E "X N i=1 EstRegi # ≤ E Oe X N i=1 vuutL tiX +1−1 t=ti X s∈S X a∈A qbt(s, a)ℓbt(s, a) 2 + L 3 ≤ Oe vuutE " L|S||A| X T t=1 X s∈S X a∈A qbt(s, a)ℓbt(s, a) 2 # + L 3 |S||A| where the last line follows from the fact N ≤ 4|S||A|(log T + 1) according to Lemma 4.9.3.12 and uses Cauchy-Schwarz inequality. Next, we continue to bound the following key term: E "X T t=1 X s∈S X a∈A qbt(s, a)ℓbt(s, a) 2 # = E "X T t=1 X s∈S X a∈A qbt(s, a) ℓt(s, a) − L · Bi(t) (s, a) 2 # ≤ 2 · E "X T t=1 X s∈S X a∈A qbt(s, a) ℓt(s, a) 2 + L 2 · Bi(t) (s, a) 2 # ≤ 2LT + 2L 2 · E "X T t=1 qbt , Bi(t) # = 2LT + 2L · L · E "X T t=1 qbt − qt , Bi(t) # + L · E "X T t=1 qt , Bi(t) #! , where the third line uses (x + y) 2 ≤ 2 x 2 + y 2 and the fourth line uses Bi(t) (s, a) ≤ 1. Moreover, in the previous analysis of the term Err1, we bound the terms in the bracket with E " L · X T t=1 qt , Bi(t) # ≤ Oe L|S| p |A|T + L|S| 2 |A| , (from Eq. (4.34)) E " L · X T t=1 qbt − qt , Bi(t) # ≤ Oe L 3 |S| 3 |A| 2 . (from Eq. (4.36)) Therefore, we have E "X T t=1 X s∈S X a∈A qbt(s, a)ℓbt(s, a) 2 # = Oe LT + L|S| p |A|T + L 3 |S| 3 |A| 2 = Oe LT + L 3 |S| 3 |A| 2 , which further proves E[EstReg] = Oe L p |S||A|T + L 2 |S| 2 |A| 3 2 + L 3 |S||A| 4.7.3 Proof for the Stochastic World (Proposition 4.7.2) As discussed in Section 4.5, we decompose Err1 + Err2 as (see Corollary 4.9.1.2): Err1 + Err2 = X T t=1 X s̸=sL X a̸=π⋆(s) qt(s, a)Ebπ ⋆ t (s, a) (ErrSub) + X T t=1 X s̸=sL X a=π⋆(s) (qt(s, a) − q ⋆ t (s, a)) Ebπ ⋆ t (s, a) (ErrOpt) + X T t=1 X s̸=sL X a∈A (qt(s, a) − qbt(s, a)) Qbπ ⋆ t (s, a) − Vb π ⋆ t (s) (OccDiff) + X T t=1 X s̸=sL X a̸=π⋆(s) q ⋆ t (s, a) Vb π ⋆ t (s) − V π ⋆ t (s) (Bias) where Ebπ t (s, a) is defined as: Ebπ t (s, a) = ℓt(s, a) + X s ′∈Sk(s)+1 P(s ′ |s, a)Vb π t (s ′ ) − Qbπ t (s, a). Then, we proceed to bound each of the five terms: ErrSub, ErrOpt, OccDiff, Bias, and EstReg. ErrSub Conditioning on A, we know that Ebπ ⋆ t (s, a) = LBi(t) (s, a) + X s ′∈Sk(s)+1 P(s ′ |s, a) − P¯ i(t) (s ′ |s, a) Vb π ⋆ t (s ′ ) ≤ LBi(t) (s, a) + L 2 · X s ′∈Sk(s)+1 Bi(t) (s, a, s′ ) ≤ 4L 2 · X s ′∈Sk(s)+1 s P¯ i(t) (s ′ |s, a) ln ι max mi(t) (s, a), 1 + 7 ln ι 3 max mi(t) (s, a), 1 ! ≤ 4L 2 s |S| ln ι max mi(t) (s, a), 1 + 7|S| ln ι 3 max mi(t) (s, a), 1 ! , 9 where the second line follows from the event A and the fact Vb π t (s) ≤ L 2 , and the last line applies the Cauchy-Schwarz inequality. Therefore, under event A, ErrSub can be bounded as: ErrSub ≤ X T t=1 X s̸=sL X a̸=π⋆(s) qt(s, a) · 4L 2 s |S| ln ι max mi(t) (s, a), 1 + 7|S| ln ι 3 max mi(t) (s, a), 1 ! ≤ 4G1 L 4 |S| ln ι + 28|S|L 2 ln ι 3 X T t=1 X s̸=sL X a∈A qt(s, a) 3 max mi(t) (s, a), 1 , where the second line follows from the definition of G1(·) in Definition 4.9.2.1. With the help of Lemma 4.9.3.5 and the fact |ErrSub| ≤ L 3T, we have E [ErrSub] ≤ O L 3T δ + E G1 L 4 |S| ln ι + E 28|S|L 2 ln ι 3 X T t=1 X s̸=sL X a∈A qt(s, a) 3 max mi(t) (s, a), 1 = O E G1 L 4 |S| ln ι + L 2 |S| 2 |A| ln2 ι , (4.37) where the last line uses Lemma 4.9.3.8. ErrOpt By the similar arguments above, we have ErrOpt bounded by the following given event A: ErrOpt ≤ X T t=1 X s̸=sL X a=π⋆(s) (qt(s, a) − q ⋆ t (s, a)) · 4L 2 s |S| ln ι max mi(t) (s, a), 1 + 7|S| ln ι 3 max mi(t) (s, a), 1 ! . Using the definition of G2(·) in Definition 4.9.2.1 and Lemma 4.9.3.5, we have E [ErrOpt] ≤ O L 3T δ + E G2 L 4 |S| ln ι + E 28|S|L 2 ln ι 3 X T t=1 X s̸=sL X a∈A qt(s, a) 3 max mi(t) (s, a), 1 = O E G2 L 4 |S| ln ι + L 2 |S| 2 |A| ln2 ι . ( OccDiff First, we have OccDiff = X T t=1 X s̸=sL X a∈A (qt(s, a) − qbt(s, a)) Qbπ ⋆ t (s, a) − Vb π ⋆ t (s) = X T t=1 X s̸=sL X a̸=π⋆(s) (qt(s, a) − qbt(s, a)) Qbπ ⋆ t (s, a) − Vb π ⋆ t (s) ≤ 2L 2X T t=1 X s̸=sL X a̸=π⋆(s) |qt(s, a) − qbt(s, a)| , where the second line follows from the fact Vb π ⋆ t (s) = Qbπ ⋆ t (s, a) for all state-action pairs (s, a) satisfying a = π ⋆ (s), and the last line uses the fact Qbπ ⋆ t (s, a) − Vb π ⋆ t (s) ≤ 2L 2 for all state-action pairs. With the help of the residual terms in Definition 4.9.3.9 and Lemma 4.9.3.10, we further bound OccDiff as 2L 2X T t=1 X s̸=sL X a̸=π⋆(s) |qt(s, a) − qbt(s, a)| ≤ 2L 2X T t=1 X s̸=sL X a̸=π⋆(s) rt(s, a) + 8L 2X T t=1 X s̸=sL X a̸=π⋆(s) k( Xs)−1 k=0 X (u,v,w)∈Tk qt(u, v) s P(w|u, v) ln ι max mi(t) (u, v), 1 qt(s, a|w) = O L 4 |S| 3 |A| 2 ln2 ι + L 2 |S||A|T · δ + G3(L 4 ln ι) (4.39) where the last line is by the definition of G3(·) in Definition 4.9.2.1. Therefore, we conclude E [OccDiff] ≤ O L 4 |S| 3 |A| 2 ln2 ι + E G3(L 4 ln ι) . (4.40) Bias Conditioning on the event A, Bias is nonpositive due to Lemma 4.7.1.1. Then, by Lemma 4.9.3.5, we bound the expectation of Bias by E [Bias] ≤ 0 + E [I{Ac }] · L 3T = O (1). (4.41) EstReg By the analysis of estimated regret in Lemma 4.7.1.2, we have E [EstReg] bounded by (with C L EstReg = L 5 |S||A| ln T ln (|S||A|)) O E X N i=1 vuuut L5 ln (|S||A|) · tiX +1−1 t=ti X s∈S X a̸=π⋆(s) qbt(s, a) + L 3 ln(|S||A|) ≤ O E vuutCEstReg · X T t=1 X s∈S X a̸=π⋆(s) qbt(s, a) + L 3 |S||A| ln T ln(|S||A|) ≤ O E vuutCEstReg · X T t=1 X s∈S X a̸=π⋆(s) qt(s, a) + L 3 |S||A| ln T ln(|S||A|) + O E vuutCEstReg · X T t=1 X s∈S X a̸=π⋆(s) |qbt(s, a) − qt(s, a)| ≤ O E G4(L 5 |S||A| ln T ln (|S||A|)) + L 5 |S||A| ln T ln (|S||A|) ! + O E X T t=1 X s∈S X a̸=π⋆(s) |qbt(s, a) − qt(s, a)| where the second line uses the Cauchy-Schwarz inequality and the fact N ≤ 4|S||A|(log T + 1) according to Lemma 4.9.3.12; the third line uses the fact that √ x ≤ √y + p |x − y| for x, y > 0; the last line uses the definition of G4(·) in Definition 4.9.2.1 and the AM-GM inequality. Note that in the analysis of OccDiff (see Eq. (4.39)), we have already shown that X T t=1 X s̸=sL X a̸=π⋆(s) |qt(s, a) − qbt(s, a)| = O L 2 |S| 3 |A| 2 ln2 ι + E [G3(ln ι)] . (4.42) 9 Combining everything, we have E [EstReg] bounded by: O E G4(L 5 |S||A| ln T ln (|S||A|)) + G3(ln ι) + L 3 |S| 3 |A| 2 ln2 ι . (4.43) Finally, combining everything we have shown that Algorithm 5 ensures the following regret bound for RegT (π ⋆ ): O E G1 L 4 |S| ln ι (from Eq. (4.37) for ErrSub) + O E G2 L 4 |S| ln ι (from Eq. (4.38) for ErrOpt) + O E G3 L 4 ln ι (from Eq. (4.40) for OccDiff) + O E G4 L 5 |S||A| ln (|S||A|) ln T (from Eq. (4.43) for EstReg) + O L 4 |S| 3 |A| 2 ln2 ι . Now suppose that Condition (4.1) holds. For some universal constant κ > 0, RegT (π ⋆ ) is bounded as RegT (π ⋆ ) ≤ κ · E G1 L 4 |S| ln ι + E G2 L 4 |S| ln ι + E G3 L 4 ln ι + κ · E G4 L 5 |S||A| ln (|S||A|) ln T + κ · L 4 |S| 3 |A| 2 ln2 ι . For any z > 0, by Lemma 4.9.2.2, Lemma 4.9.2.3, Lemma 4.9.2.4 and Lemma 4.9.2.5 with α = β = 1 12zκ we have RegT (π ⋆ ) ≤ RegT (π ⋆ ) + C z + 12z · X s̸=sL X a̸=π⋆(s) 8κ 2 ∆(s, a) · L 4 |S| ln ι + L 6 + 12z · κ 2 ∆min · 8L 5 |S| ln ι + 8L 6 |S| 2 ln ι + L 5 |S||A| ln (|S||A|) ln T 4 + κ · L 4 |S| 3 |A| 2 ln2 ι ≤ RegT (π ⋆ ) + C z + 288zκ2 · U + 2κ · V, where the last line uses the shorthands U and V defined in Proposition 4.7.2. Rearranging the terms arrive at: RegT (π ⋆ ) ≤ C z − 1 + z 2 z − 1 · 288κ 2U + z z − 1 · 2κ · V = C x + (x + 1)2 x · 288κ 2U + x + 1 x · 2κ · V = 1 x · C L + 288κ 2U + 2κ · V + x · 288κ 2U + 2κ · V + 576κ 2U where we replace all z’s by x = z − 1 > 0 in the second line. Finally, by selecting the optimal x to balance the first two terms, we have RegT (π ⋆ ) ≤ 2 q (CL + 288κ 2U + 2κ · V ) · 288κ 2U + 2κV + 576κ 2U = O U + √ UC + V , finishing the entire proof for Proposition 4.7.2. 9 4.8 Best of Both Worlds for MDPs with Unknown Transition and Bandit Feedback In this section, we prove the best of both worlds results for the bandit feedback setting with unknown transition. We present the bound for the adversarial world in Proposition 4.8.1, and that for the stochastic world in Proposition 4.8.2. Together, they prove Theorem 4.4.1.2. Proposition 4.8.1. With δ = 1 T3 , Algorithm 5 ensures RegT (˚π) = Oe L + √ A |S| p |A|T . Proposition 4.8.2. Suppose Condition (4.1) holds. With δ = 1 T3 , Algorithm 5 ensures that RegT (π ⋆ ) is bounded by O U + √ CU + V where V = L 6 |S| 3 |A| 3 ln2 T and U is defined as U = X s̸=sL X a̸=π⋆(s) L 6 |S| ln T + L 4 |S||A| ln2 T ∆(s, a) + L 6 |S| 2 ln T + L 3 |S| 2 |A| ln2 T ∆min . The analysis is similar to that for the full-information setting, except that we need to handle some bias terms caused by the new loss estimators. To this end, we denote by ℓet the conditional expectation of ℓbt , that is ℓet(s, a) = Et h ℓbt(s, a) i = qt(s, a) ut(s, a) · ℓt(s, a) − L · Bi(t) (s, a). (4.44) Then we define the following: 99 Definition 4.8.3. For any policy π, the estimated state-action and state value functions associated with P¯ i(t) and loss function ℓet are defined as: Qeπ t (s, a) = ℓet(s, a) + X s ′∈Sk(s)+1 P¯ i(t) (s ′ |s, a)Ve π t (s ′ ), ∀(s, a) ∈ (S − {sL}) × A, Ve π t (s) = X a∈A π(a|s)Qeπ t (s, a), ∀s ∈ S, Qeπ t (sL, a) = 0, ∀a ∈ A. (4.45) On the other hand, the true state-action and value functions are again defined as: Q π t (s, a) = ℓt(s, a) + X s ′∈Sk(s)+1 P(s ′ |s, a)V π t (s ′ ), ∀(s, a) ∈ (S − {sL}) × A, V π t (s) = X a∈A π(a|s)Q π t (s, a), ∀s ∈ S, Q π t (sL, a) = 0, ∀a ∈ A. (4.46) where P denotes the true transition function. Besides the definition of event A, we also define Ai to be the event P ∈ Pi . Importantly, the value of I{Ai} is only based on observations prior to epoch i. For notational convenience, we again let ι = T|S||A| δ and assume δ ∈ (0, 1). Similarly to the full-information setting, we decompose the regret against policy π, Reg(π) = E hPT t=1 V πt t (s0) − V π t (s0) i , as E "X T t=1 V πt t (s0) − Ve πt t (s0) | {z } Err1 # + E "X T t=1 Ve πt t (s0) − Ve π t (s0) | {z } EstReg # + E "X T t=1 Ve π t (s0) − V π t (s0) | {z } Err2 # . (4.47) 100 Note that, the second term is exactly E [EstReg] = E "X T t=1 D q P¯ i(t) ,πt − q P¯ i(t) ,π , ℓet E # = E "X T t=1 D q P¯ i(t) ,πt − q P¯ i(t) ,π , ℓbt E # , which is controlled by the FTRL process. 4.8.1 Auxiliary Lemmas First, we show the following optimism lemma. Lemma 4.8.1.1. With the notations defined in Eq. (4.45) and Eq. (4.46), the following holds conditioning on event A: Qeπ t (s, a) ≤ Q π t (s, a), ∀(s, a) ∈ S × A, t ∈ [T]. Specifically, we have D q P¯ i(t) ,π , ℓet E = Ve π t (s0) ≤ V π t (s0) = q P,π, ℓt . Proof. We prove this result via a backward induction from layer L to layer 0. Base case: for sL, Qeπ t (s, a) = Qπ t (s, a) = 0 holds always. Induction step: Suppose Qeπ t (s, a) ≤ Qπ t (s, a) holds for all states s with k(s) > h. Then, for any state s with k(s) = h, we have Qeπ t (s, a) = qt(s, a) ut(s, a) · ℓt(s, a) + X s ′∈Sk(s)+1 P¯ i(t) (s ′ |s, a)Ve π t (s ′ ) − L · Bi(t) (s, a) (Eq. (4.44)) ≤ qt(s, a) ut(s, a) · ℓt(s, a) + X s ′∈Sk(s)+1 P¯ i(t) (s ′ |s, a)V π t (s ′ ) − L · Bi(t) (s, a) (induction hypothesis) ≤ qt(s, a) ut(s, a) · ℓt(s, a) + X s ′∈Sk(s)+1 P(s ′ |s, a)V π t (s ′ ) 101 + X s ′∈Sk(s)+1 P¯ i(t) (s ′ |s, a) − P(s ′ |s, a) V π t (s ′ ) − L · Bi(t) (s, a) ≤ qt(s, a) ut(s, a) · ℓt(s, a) + X s ′∈Sk(s)+1 P(s ′ |s, a)V π t (s ′ ) ≤ ℓt(s, a) + X s ′∈Sk(s)+1 P(s ′ |s, a)V π (s ′ ) = Q π t (s, a), where the forth step follows from the same arguments in Lemma 4.7.1.1, and the last step holds since under event A, we have qt(s, a) ≤ ut(s, a) by the definition of ut . This finishes the induction. Next, we provide a sequence of boundedness results, useful for regret analysis. Lemma 4.8.1.2 (Lower Bound of Upper Occupancy Bound). Algorithm 5 ensures ut(s) ≥ 1 |S|t for all t and s. Proof. We prove by constructing a special transition function Pb i(t) within the confidence set Pi(t) , which ensures q P¯ i(t) ,πt (s) ≥ 1 |S|t for all state-action pairs. Specifically, let Pb i(t) be such that Pb i(t) (s ′ |s, a) = 1 t · 1 |Sk(s)+1| + t − 1 t · P¯ i(t) (s ′ |s, a), ∀(s, a, s′ ) ∈ Tk, k < L. Clearly, Pb i(t) (·|s, a) is a valid transition distribution over Sk(s)+1 for all state-action pairs. Then, we prove that Pb i(t) ∈ Pi by Pb i(t) (s ′ |s, a) − P¯ i(t) (s ′ |s, a) = 1 t · P¯ i(t) (s ′ |s, a) − 1 |Sk(s)+1| ≤ 1 t ≤ 14 ln T|S||A| δ 3 max mi(t)(s,a) , 1 where the last inequality follows from the fact that mi(t) (s, a) ≤ t. 10 Then, for any state s ̸= s0, we have by the definition of occupancy measures q Pb i(t) ,πt (s) = X s ′∈Sk(s)−1 X a ′∈A q Pb i(t) ,πt (s ′ , a′ ) · Pb i(t) (s|s ′ , a′ ) ≥ X s ′∈Sk(s)−1 X a ′∈A q Pb i(t) ,πt (s ′ , a′ ) · 1 Sk(s) t = 1 Sk(s) t ≥ 1 |S|t Clearly, for s0 it holds that q Pb i(t) ,πt (s0) = 1 ≥ 1/|S|t, which finishes the proof. Corollary 4.8.1.3. Algorithm 5 ensures that, the adjusted loss ℓbt defined in Eq. (4.9) for banditfeedback is bounded as: ℓbt(s, a) ≤ L + It(s, a) qt(s, a) · |S|t. Also, we have E It(s, a) qt(s, a) Ai(t) = E It(s, a) qt(s, a) A c i(t) = 1. Proof. By Lemma 4.8.1.2, we have ℓbt(s, a) ≤ It(s, a) ut(s) · πt(a|s) + L ≤ It(s, a) qt(s) · πt(a|s) · |S|t + L = L + It(s, a) qt(s, a) · |S|t, where the first inequality follows from Bi(s, a) ≤ 1 and ℓt(s, a) ≤ 1, and the second inequality uses Lemma 4.8.1.2 and the fact qt(s) ≤ 1. For the second statement, we have E It(s, a) qt(s, a) Ai(t) = E " Et It(s, a) qt(s, a) Ai(t) # = E h 1| Ai(t) i = 1, By the same arguments we can prove E h It(s,a) qt(s,a) Ac i(t) i = 1 as well. 103 Lemma 4.8.1.4. Algorithm 5 ensures that, the expected adjusted loss ℓet defined in Eq. (4.44) is bounded as: ℓet(s, a) ≤ L + |S| · t ≤ 2|S| · t, ∀(s, a) ∈ S × A, t ∈ [T]. Proof. By Eq. (4.44), we know that ℓet(s, a) = qt(s, a) ut(s, a) · ℓt(s, a) − L · Bi(t) (s, a) ≤ qt(s) ut(s) + L ≤ L + |S| · t where the last inequality follows from Lemma 4.8.1.2. Combining with the fact |S| ≥ L finishes the proof. Corollary 4.8.1.5. Algorithm 5 ensures that, the estimated state-action value functions defined in Eq. (4.45) are bounded as: Qeπ t (s, a) ≤ 2L|S|t, ∀(s, a) ∈ S × A, t ∈ [T]. Proof. This is directly by Lemma 4.8.1.4 and the definition of Qeπ t (s, a). Next, we analyze the estimated regret in each epoch. Reloading the notation from the fullinformation setting, we define EstRegi (π) = E tiX +1−1 t=ti D q P¯ i,πt − q P¯ i,π , ℓbt E = E tiX +1−1 t=ti D qbt − q P¯ i,π , ℓbt E . Lemma 4.8.1.6. With β = 128L 4 , for any epoch i, Algorithm 5 ensures EstRegi (π) ≤ O E tiX +1−1 t=ti ηt · p L|S||A| + L 2 X s̸=sL X a∈A qbt(s, a) · Bi(t) (s, a) 2 + O L 4 |S||A| log T + δ · E [L|S|T (ti+1 − ti)] , (4.48) 10 for any policy π, and simultaneously EstRegi (π) ≤ O E p L|S| tiX +1−1 t=ti ηt · sX s̸=sL X a̸=π(s) qbt(s, a) + O L 2 · E tiX +1−1 t=ti ηt · X s̸=sL X a̸=π(s) p qbt(s, a) + O L 4 |A| · E tiX +1−1 t=ti ηt · X s̸=sL X a∈A qbt(s, a) · Bi(t) (s, a) 2 + O L 4 |S||A| log T + δ · E [L|S|T (ti+1 − ti)] , (4.49) for any deterministic policy π : S → A. Proof. The proof is largely based on that of Theorem 4.6.3.1, but with some careful treatments based one whether Ai holds or not. Let q = q P¯ i,π be the occupancy measure we want to compete against. When Ai does not hold, we first derive the following naive bound on Pti+1−1 t=ti D qbt − q, ℓbt E : tiX +1−1 t=ti D qbt − q, ℓbt E ≤ tiX +1−1 t=ti X s̸=sL X a∈A (qbt(s, a) + q(s, a)) · ℓbt(s, a) ≤ tiX +1−1 t=ti X s̸=sL X a∈A (qbt(s, a) + q(s, a)) · L + It(s, a) ut(s, a) · |S|t (Corollary 4.8.1.3) ≤ 2L 2 · (ti+1 − ti) + |S|T · tiX +1−1 t=ti X s̸=sL X a∈A (qbt(s, a) + q(s, a)) · It(s, a) qt(s, a) . Therefore, we have the conditional expectation E hPti+1−1 t=ti D qbt − q, ℓbt E Ac i i bounded by E 2L 2 · (ti+1 − ti) + |S|t · tiX +1−1 t=ti X s̸=sL X a∈A (qbt(s, a) + q(s, a)) · It(s, a) qt(s, a) A c i ≤ E (2L 2 + 2L|S|T) · (ti+1 − ti) A c i (Corollary 4.8.1.3) ≤ O E [L|S|T · (ti+1 − ti)| Ac i ] . 10 Next, we condition on event Ai . In this case, by the same argument as Lemma 5 of [50] and also our loss-shifting technique, Algorithm 5 with β = 128L 4 ensures that Pti+1−1 t=ti D qbt − q, ℓbt E is bounded by O L 4 |S||A| log T + tiX +1−1 t=ti+1 1 ηt − 1 ηt−1 (ϕH(q) − ϕH(qbt)) + 8 tiX +1−1 t=ti ηt min X s̸=sL X a∈A qbt(s, a) 3/2 Qbt(s, a) − Vbt(s) 2 , X s̸=sL X a∈A qbt(s, a) 3/2 ℓbt(s, a) 2 (4.50) where ϕH(q) = − P s̸=sL P a∈A p q(s, a), and Qbt and Vbt are state-action and state value functions associated with the loss estimator ℓbt and the empirical transition P¯ i(t) : Qbt(s, a) = ℓbt(s, a) + X s ′∈Sk(s)+1 P¯ i(t) (s ′ |s, a)Vbt(s ′ ), Vbt(s) = X a∈A πt(a|s)Qbt(s, a). Below, we discuss how to proceed from here to prove Eq. (4.48) and Eq. (4.49) respectively. Proving Eq. (4.48) In this case, we take the second argument of the min operator from Eq. (4.50) and bound ϕH(q) − ϕH(qbt) ≤ P s̸=sL P a∈A p qbt(s, a) trivially by p L|S||A| using Cauchy-Schwarz inequality, leading to tiX +1−1 t=ti D qbt − q, ℓbt E ≤ O (L|S||A| log T) + p L|S||A| · tiX +1−1 t=ti ηt + 8 tiX +1−1 t=ti ηt · X s̸=sL X a∈A qbt(s, a) 3/2 ℓbt(s, a) 2 ( 1 ηt − 1 ηt−1 ≤ ηt since 1 ηt = √ t − ti + 1) ≤ O (L|S||A| log T) + 2p L|S||A| · tiX +1−1 t=ti ηt + 16 tiX +1−1 t=ti ηt · X s̸=sL X a∈A qbt(s, a) 3/2 · It(s, a) ut(s, a) 2 + 16L 2 tiX +1−1 t=ti ηt · X s̸=sL X a∈A qbt(s, a) 3/2 · Bi(t) (s, a) 2 10 ≤ O (L|S||A| log T) + 2p L|S||A| · tiX +1−1 t=ti ηt + 16 tiX +1−1 t=ti ηt · X s̸=sL X a∈A p qbt(s, a) · It(s, a) qt(s, a) + 16L 2 tiX +1−1 t=ti ηt · X s̸=sL X a∈A qbt(s, a) · Bi(t) (s, a) 2 where the second step follows from the definition of ℓbt in Eq. (4.9) and the last step follows from the fact qbt(s, a) ≤ ut(s, a) and qt(s, a) ≤ ut(s, a) since P¯ i , P ∈ Pi according to event Ai . Therefore, by Lemma 4.9.3.6 we have for any policy π that, E [EstRegi (π)] ≤ E 2 p L|S||A| · tiX +1−1 t=ti ηt + 16 tiX +1−1 t=ti ηt · X s̸=sL X a∈A p qbt(s, a) · It(s, a) qt(s, a) + E 16L 2 tiX +1−1 t=ti ηt · X s̸=sL X a∈A qbt(s, a) · Bi(t) (s, a) 2 + O L 4 |S||A| log T + δ · E [L|S|T (ti+1 − ti)] ≤ O E p L|S||A| · tiX +1−1 t=ti ηt + E L 2 tiX +1−1 t=ti ηt · X s̸=sL X a∈A qbt(s, a) · Bi(t) (s, a) 2 + O L 4 |S||A| log T + δ · E [L|S|T (ti+1 − ti)] where the second step takes the conditional expectation of It(s, a) and applies the Cauchy-Schwarz inequality to get P s̸=sL P a∈A p qbt(s, a) ≤ p L|S||A|. This finishes the proof of Eq. (4.48). Proving Eq. (4.49) In this case, recall that π is a deterministic policy, so that ϕH(q) − ϕH(qbt) = X s̸=sL p qbt(s) X a∈A p πt(a|s) − 1 ! + X s̸=sL p qbt(s) − p q(s) . 1 Using Lemma 16 of [50] to bound the first term (take α in their lemma to be 0), and Lemma 19 of [50] to bound the second, we obtain ϕH(q) − ϕH(qbt) = X s̸=sL X a̸=π(s) p qbt(s, a) + s L|S| X s̸=sL X a̸=π(s) qbt(s, a). Therefore, taking the first argument of the min operator from Eq. (4.50) and using 1 ηt − 1 ηt−1 ≤ ηt again, we arrive at tiX +1−1 t=ti D qbt − q, ℓbt E ≤ p L|S| tiX +1−1 t=ti ηt · sX s̸=sL X a̸=π(s) qbt(s, a) + tiX +1−1 t=ti ηt · X s̸=sL X a̸=π(s) p qbt(s, a) + 8 tiX +1−1 t=ti ηt · X s̸=sL X a∈A qbt(s, a) 3/2 Qbt(s, a) − Vbt(s) 2 + O L 4 |S||A| log T . (4.51) Finally, we apply Lemma 4.8.1.7 to bound the term P s̸=sL P a∈A qbt(s, a) 3/2 Qbt(s, a) − Vbt(s) 2 , and use Lemma 4.9.3.6 again to take expectation and arrive at Eq. (4.49) (with the help of Eq. (4.52)). Lemma 4.8.1.7. Under event A, we have for any t, X s̸=sL X a∈A qbt(s, a) 3/2 Qbt(s, a) − Vbt(s) 2 ≤ 4L 4 |A| X s ′̸=sL X a ′∈A qbt(s ′ , a′ ) · Bi(t) (s ′ , a′ ) 2 + X s̸=sL X a∈A p qbt(s, a) · (Ot(s, a) + Wt(s, a)) 10 where Ot(s, a) = 4L · (1 − πt(a|s)) L X−1 k=k(s) X s ′∈Sk X a ′∈A qbt(s ′ , a′ |s, a) It(s ′ , a′ ) qt(s ′ , a′) , Wt(s, a) = 4L · X b̸=a πt(b|s) L X−1 k=k(s) X s ′∈Sk X a ′∈A qbt(s ′ , a′ |s, b) It(s ′ , a′ ) qt(s ′ , a′) , and qbt(s ′ , a′ |s, a) is the probability of visiting (s ′ , a′ ) starting from (s, a) under πt and P¯ i(t) . Moreover, we have Et X s̸=sL X a∈A p qbt(s, a) · (Ot(s, a) + Wt(s, a)) ≤ 16L 2 X s̸=sL X a̸=π(s) p qbt(s, a), (4.52) for any mapping π : S → A. Proof. First, Qbt(s, a) − Vbt(s) 2 is bounded by Qbt(s, a) − Vbt(s) 2 = (1 − πt(a|s)) Qbt(s, a) − X b̸=a πt(b|s)Qbt(s, b) 2 ≤ 2 (1 − πt(a|s))2 Qbt(s, a) 2 + 2 X b̸=a πt(b|s)Qbt(s, b) 2 . Following the same idea of Lemma 4.6.1.3, the first term can be bounded as (1 − πt(a|s))2 Qbt(s, a) 2 = (1 − πt(a|s))2 L X−1 k=k(s) X s ′∈Sk X a ′∈A qbt(s ′ , a′ |s, a)ℓbt(s ′ , a′ ) 2 ≤ 2 (1 − πt(a|s))2 L X−1 k=k(s) X s ′∈Sk X a ′∈A qbt(s ′ , a′ |s, a) It(s ′ , a′ ) ut(s ′ , a′) · ℓt(s ′ , a′ ) 2 109 + 2 (1 − πt(a|s))2 L X−1 k=k(s) X s ′∈Sk X a ′∈A qbt(s ′ , a′ |s, a) · L · Bi(t) (s ′ , a′ ) 2 ≤ 2L · (1 − πt(a|s))2 L X−1 k=k(s) X s ′∈Sk X a ′∈A qbt(s ′ , a′ |s, a) 2 · It(s ′ , a′ ) ut(s ′ , a′) 2 + 2L 3 (1 − πt(a|s))2 L X−1 k=k(s) X s ′∈Sk X a ′∈A qbt(s ′ , a′ |s, a) · Bi(t) (s ′ , a′ ) 2 (4.53) where the equality follows from the definition of Qbt ; the first inequality uses the fact (x+y) 2 ≤ 2(x 2+ y 2 ); the second inequality applies the Cauchy-Schwarz inequality with the facts It(s, a)It(s ′ , a′ ) = 0 for (s, a) ̸= (s ′ , a′ ) and PL−1 k=k(s) P s ′∈Sk P a ′∈A qbt(s ′ , a′ |s, a) ≤ L. By the same arguments, the second term is bounded as X b̸=a πt(b|s)Qbt(s, b) 2 = L X−1 k=k(s) X s ′∈Sk X a ′∈A X b̸=a πt(b|s)qbt(s ′ , a′ |s, b) ℓbt(s, a) 2 ≤ 2L · L X−1 k=k(s) X s ′∈Sk X a ′∈A X b̸=a πt(b|s) · qbt(s ′ , a′ |s, b) 2 · It(s ′ , a′ ) ut(s ′ , a′) 2 + 2L 3 L X−1 k=k(s) X s ′∈Sk X a ′∈A X b̸=a πt(b|s) · qbt(s ′ , a′ |s, b) · Bi(t) (s ′ , a′ ) 2 , (4.54) where in the last step we use PL−1 k=k(s) P s ′∈Sk P a ′∈A P b̸=a πt(b|s) · qbt(s ′ , a′ |s, b) ≤ L (after applying Cauchy-Schwarz). Combining Eq. (4.53) and Eq. (4.54), we show that qbt(s, a) Qbt(s, a) − Vbt(s) 2 can be bounded as qbt(s, a) Qbt(s, a) − Vbt(s) 2 ≤ 4L · qbt(s, a) (1 − πt(a|s))2 L X−1 k=k(s) X s ′∈Sk X a ′∈A qbt(s ′ , a′ |s, a) 2 · It(s ′ , a′ ) ut(s ′ , a′) 2 110 + 4L · qbt(s, a) L X−1 k=k(s) X s ′∈Sk X a ′∈A X b̸=a πt(b|s) · qbt(s ′ , a′ |s, b) 2 · It(s ′ , a′ ) ut(s ′ , a′) 2 + 4L 3 qbt(s, a) (1 − πt(a|s))2 L X−1 k=k(s) X s ′∈Sk X a ′∈A qbt(s ′ , a′ |s, a) · Bi(t) (s ′ , a′ ) 2 + 4L 3 qbt(s, a) L X−1 k=k(s) X s ′∈Sk X a ′∈A X b̸=a πt(b|s) · qbt(s ′ , a′ |s, b) · Bi(t) (s ′ , a′ ) 2 . Moreover, we have the summation of the first two terms bounded as 4L · qbt(s, a) (1 − πt(a|s))2 L X−1 k=k(s) X s ′∈Sk X a ′∈A qbt(s ′ , a′ |s, a) 2 · It(s ′ , a′ ) ut(s ′ , a′) 2 + 4L · qbt(s, a) L X−1 k=k(s) X s ′∈Sk X a ′∈A X b̸=a πt(b|s) · qbt(s ′ , a′ |s, b) 2 · It(s ′ , a′ ) ut(s ′ , a′) 2 ≤ 4L · (1 − πt(a|s))2 L X−1 k=k(s) X s ′∈Sk X a ′∈A qbt(s, a)qbt(s ′ , a′ |s, a) ut(s ′ , a′) · qbt(s ′ , a′ |s, a) It(s ′ , a′ ) qt(s ′ , a′) + 4L · L X−1 k=k(s) X s ′∈Sk X a ′∈A P b̸=a qbt(s, b) · qbt(s ′ , a′ |s, b) ut(s ′ , a′) · X b̸=a πt(b|s) · qbt(s ′ , a′ |s, b) It(s ′ , a′ ) qt(s ′ , a′) ≤ Ot(s, a) + Wt(s, a) where we use qt(s ′ , a′ ) ≤ ut(s ′ , a′ ) due to event Ai in the first step and P a∈A qbt(s, a)qbt(s ′ , a′ |s, a) ≤ qbt(s ′ , a′ ) ≤ ut(s ′ , a′ ) in the second step to bound the fractions by 1. On the other hand, the summation of the other two terms is bounded as 4L 3 qbt(s, a) (1 − πt(a|s))2 L X−1 k=k(s) X s ′∈Sk X a ′∈A qbt(s ′ , a′ |s, a) · Bi(t) (s ′ , a′ ) 2 + 4L 3 qbt(s, a) L X−1 k=k(s) X s ′∈Sk X a ′∈A X b̸=a πt(b|s) · qbt(s ′ , a′ |s, b) · Bi(t) (s ′ , a′ ) 2 ≤ 4L 3 qbt(s) L X−1 k=k(s) X s ′∈Sk X a ′∈A qbt(s ′ , a′ |s, a)πt(a|s) +X b̸=a πt(b|s) · qbt(s ′ , a′ |s, b) · Bi(t) (s ′ , a′ ) 2 111 = 4L 3 L X−1 k=k(s) X s ′∈Sk X a ′∈A qbt(s ′ , a′ |s)qbt(s) · Bi(t) (s ′ , a′ ) 2 . Note that, taking the summation of the last bound over all state-action pairs yields 4L 3 X s̸=sL X a∈A L X−1 k=k(s) X s ′∈Sk X a ′∈A qbt(s ′ , a′ |s)qbt(s) · Bi(t) (s ′ , a′ ) 2 = 4L 3 |A| X s ′̸=sL X a ′∈A k(s ′ X )−1 k=0 X s∈Sk qbt(s ′ , a′ |s)qbt(s) · Bi(t) (s ′ , a′ ) 2 ≤ 4L 4 |A| X s ′̸=sL X a ′∈A qbt(s ′ , a′ ) · Bi(t) (s ′ , a′ ) 2 . Therefore, combining everything, we have shown: X s̸=sL X a̸=π(s) qbt(s, a) 3/2 Qbt(s, a) − Vbt(s) 2 ≤ 4L 4 |A| X s ′̸=sL X a ′∈A qbt(s ′ , a′ ) · Bi(t) (s ′ , a′ ) 2 + X s̸=sL X a̸=π(s) p qbt(s, a) · (Ot(s, a) + Wt(s, a)), proving the first statement of the lemma. To prove the second statement, we first show Et [Ot(s, a) + Wt(s, a)] = 4L(1 − πt(a|s)) · L X−1 k=k(s) X s ′∈Sk X a ′∈A qbt(s ′ , a′ |s, a) + 4L · L X−1 k=k(s) X s ′∈Sk X a ′∈A X b̸=a πt(b|s) · qbt(s ′ , a′ |s, b) = 4L(1 − πt(a|s)) L X−1 k=k(s) 1 + 4L · L X−1 k=k(s) (1 − πt(a|s)) ≤ 8L 2 (1 − πt(a|s)), 112 and therefore Et X s̸=sL X a∈A p qbt(s, a) · (Ot(s, a) + Wt(s, a)) ≤ 8L 2 X s̸=sL X a∈A p qbt(s, a) (1 − πt(a|s)) ≤ 8L 2 X s̸=sL X a̸=π(s) p qbt(s, a) + 8L 2 X s̸=sL p qbt(s) (1 − πt(π(s)|s)) ≤ 16L 2 X s̸=sL X a̸=π(s) p qbt(s, a), which proves Eq. (4.52). Note that both Eq. (4.48) and Eq. (4.49) contain a term related to P s̸=sL P a∈A qbt(s, a) · Bi(t) (s, a) 2 . Below, we show that when summed over t, this is only logarithmic in T. Lemma 4.8.1.8. Algorithm 5 ensures the following: E X T t=1 X s̸=sL X a∈A qbt(s, a) · Bi(t) (s, a) 2 = O L 2 |S| 3 |A| 2 ln2 ι + |S||A|T · δ . (4.55) Proof. By Lemma 4.9.3.2, we know that Bi(s, a) 2 ≤ 2 s |Sk(s)+1| ln ι max {mi(s, a), 1} + 14|Sk(s)+1| ln ι 3 max {mi(s, a), 1} 2 ≤ O |Sk(s)+1| ln ι max {mi(s, a), 1} + |Sk(s)+1| 2 ln2 ι max {mi(s, a), 1} 2 ! . Then, we have E X T t=1 X s̸=sL X a∈A qbt(s, a) · Bi(t) (s, a) 2 11 = E X T t=1 X s̸=sL X a∈A (qbt(s, a) − qt(s, a)) · Bi(t) (s, a) 2 + E X T t=1 X s̸=sL X a∈A qt(s, a) · Bi(t) (s, a) 2 ≤ E X T t=1 X s̸=sL X a∈A rt(s, a) + E 4 X T t=1 X s̸=sL X a∈A k( Xs)−1 k=0 X (u,v,w)∈Tk qt(u, v) vuut P(w|u, v) ln T|S||A| δ max mi(t) (u, v), 1 qt(s, a|w) · Bi(t) (s, a) + O E X T t=1 X s̸=sL X a∈A qt(s, a) · |Sk(s)+1| ln ι max {mi(s, a), 1} + |Sk(s)+1| 2 ln2 ι max {mi(s, a), 1} 2 ! ≤ O E X T t=1 X s̸=sL X a∈A rt(s, a) + O E X T t=1 X s̸=sL X a∈A qt(s, a) · |Sk(s)+1| ln ι max {mi(s, a), 1} + |Sk(s)+1| 2 ln2 ι max {mi(s, a), 1} 2 ! where the first inequality uses Lemma 4.9.3.10 and Bi(s, a) ∈ [0, 1], and the last inequality follows from the observation that, the second term in the previous line is bounded by PT t=1 P s̸=sL P a∈A rt(s, a) according to the definition of residual terms in Definition 4.9.3.9. Finally, applying Lemma 4.9.3.10 and Lemma 4.9.3.8, we have E X T t=1 X s̸=sL X a∈A qbt(s, a) · Bi(t) (s, a) 2 = O L 2 |S| 3 |A| 2 ln2 ι + |S||A|T · δ + O L X−1 k=0 |Sk+1| |Sk| |A| ln T ln ι + |Sk(s)+1| 2 |Sk| |A| ln2 ι ! = O L 2 |S| 3 |A| 2 ln2 ι + |S||A|T · δ , which completes the proof. Finally, we provide a lemma regarding the learning rates. Lemma 4.8.1.9 (Learning Rates). According to the design of the learning rate ηt = √ 1 t−ti(t)+1 , the following inequalities hold: X T t=1 η 2 t ≤ O |S||A| log2 T , (4.56) X T t=1 ηt ≤ O p |S||A|T log T . (4.57) Proof. By direct calculation, we have tiX +1−1 t=ti η 2 t = ti+1 X−ti n=1 1 n ≤ 2 Z ti+1−ti+1 1 1 x dx = 2 ln (ti+1 − ti + 1) ≤ O (log T). Combining the inequality with the fact that the total number of epochs N is at most 4|S||A|(log T + 1) (Lemma 4.9.3.12) finishes the proof of Eq. (4.56). Following the similar idea, we have tiX +1−1 t=ti ηt = ti+1 X−ti n=1 1 √ n ≤ Z ti+1−ti 0 1 √ x dx ≤ 2 p ti+1 − ti . Taking the summation over N epochs and applying the Cauchy-Schwarz inequality yields Eq. (4.57). 4.8.2 Proof for the Adversarial World (Proposition 4.8.1) Recall the regret decomposition in Eq. (4.47): E "X T t=1 V πt t (s0) − Ve πt t (s0) | {z } Err1 # + E "X T t=1 Ve πt t (s0) − Ve π t (s0) | {z } EstReg # + E "X T t=1 Ve π t (s0) − V π t (s0) | {z } Err2 # . We bound each of them separately below. 11 Err1 Similarly to the proof for the full-information feedback setting, we have Err1 = X T t=1 ⟨qt , ℓt⟩ − D qbt , ℓet E = X T t=1 X s̸=sL X a∈A ℓt(s, a)qbt(s, a) ut(s, a) · (ut(s, a) − qt(s, a)) +X T t=1 ⟨qt − qbt , ℓt⟩ + L · X T t=1 qbt , Bi(t) where the last two terms have been shown to be at most Oe L|S| p |A|T + L 3 |S| 3 |A| according to the analysis of Err1 in Section 4.7.2 (see Eq. (4.34), Eq. (4.35) and Eq. (4.36)). Then, we bound the first term as E X T t=1 X s̸=sL X a∈A ℓt(s, a)qbt(s, a) ut(s, a) · (ut(s, a) − qt(s, a)) ≤ E X T t=1 X s̸=sL X a∈A |ut(s, a) − qt(s, a)| (qbt(s, a) ≤ ut(s, a)) ≤ E 4 X T t=1 X s̸=sL X a∈A rt(s, a) + 16X T t=1 X s̸=sL X a∈A k( Xs)−1 k=0 X (u,v,w)∈Tk qt(u, v) s P(w|u, v) ln ι max mi(t) (u, v), 1 qt(s, a|w) (Corollary 4.9.3.11) ≤ O L 2 |S| 3 |A| 2 ln2 ι + |S||A|T · δ + 4L · E X T t=1 X u̸=sL X v∈A qt(u, v) s Sk(u)+1 ln ι max mi(t) (u, v), 1 (Lemma 4.9.3.10 and Cauchy-Schwarz) ≤ O L 2 |S| 3 |A| 2 ln2 ι + |S||A|T · δ + L · L X−1 k=0 p |Sk| · |Sk+1| |A|T ln ι ! (Lemma 4.9.3.8) = O L|S| p |A|T ln ι + L 2 |S| 3 |A| 2 ln2 ι + |S||A|T · δ . Combining the bounds together, we have E [Err1] bounded by: E [Err1] = Oe L|S| p |A|T + L 3 |S| 3 |A| 2 . 11 Err2 Following the same idea of bounding Err2, by Lemma 4.8.1.1 and Lemma 4.9.3.5, we have the expectation of Err2 bounded as E [Err2] ≤ δ · 3L|S|T 2 + 0 = O L|S|T 2 · δ = O(1). EstReg According to Eq. (4.48) of Lemma 4.8.1.6, we have EstReg(˚π) = E "X T t=1 D qbt − q P¯ i(t) ,˚π , ℓbt E # = E "X N i=1 EstRegi (˚π) # ≤ O E X N i=1 tiX +1−1 t=ti ηt p L|S||A| + E L 2 · X T t=1 X s̸=sL X a∈A qbt(s, a) · Bi(t) (s, a) 2 + O L 4 |S| 2 |A| 2 ln2 T + δL|S|T 2 ≤ Oe E "X T t=1 ηt · p L|S||A| # + L 4 |S| 3 |A| 2 ln2 ι ! (Lemma 4.8.1.8) ≤ Oe |S||A| √ LT + L 4 |S| 3 |A| 2 . (Eq. (4.57)) Finally, we combine the bounds of Err1, Err2 and EstReg as: RegT (˚π) = Oe L|S| p |A|T + |S||A| √ LT + L 4 |S| 3 |A| 2 , finishing the proof. 4.8.3 Proof for the Stochastic World (Proposition 4.8.2) Similarly to the proof of Proposition 4.7.2, we decompose Err1 and Err2 jointly into four terms ErrSub, ErrOpt, OccDiff and Bias: Err1 + Err2 = X T t=1 X s̸=sL X a̸=π⋆(s) qt(s, a)Ebπ ⋆ t (s, a) (ErrSub) 1 + X T t=1 X s̸=sL X a=π⋆(s) (qt(s, a) − q ⋆ t (s, a)) Ebπ ⋆ t (s, a) (ErrOpt) + X T t=1 X s̸=sL X a∈A (qt(s, a) − qbt(s, a)) Qeπ ⋆ t (s, a) − Ve π ⋆ t (s) (OccDiff) + X T t=1 X s̸=sL X a̸=π⋆(s) q ⋆ t (s, a) Ve π ⋆ t (s) − V π ⋆ t (s) (Bias) where Ebπ t is defined as Ebπ t (s, a) = ℓt(s, a) + X s ′∈Sk(s)+1 P(s ′ |s, a)Ve π t (s ′ ) − Qeπ t (s, a). By the exact same reasoning as in the full-information setting (Section 4.7.3), we have E [OccDiff] = O L 4 |S| 3 |A| 2 ln2 ι + E G3(L 4 ln ι) and E [Bias] = O(1), but the first two terms ErrSub and ErrOpt are slightly different. To see this, note that under event A, we have Ebπ ⋆ t (s, a) = ℓt(s, a) − ℓet(s, a) + X s ′∈Sk(s)+1 P(s ′ |s, a) − P¯ i(t) (s ′ |s, a) Ve π ⋆ t (s ′ ) = ℓt(s, a) 1 − qt(s, a) ut(s, a) + L · Bi(t) (s, a) + X s ′∈Sk(s)+1 P(s ′ |s, a) − P¯ i(t) (s ′ |s, a) Ve π ⋆ t (s ′ ) ≤ ut(s, a) − qt(s, a) qt(s, a) + 2L 2 · Bi(t) (s, a) where the last line applies the definition of event A and the fact qt(s, a) ≤ ut(s, a) given this event. Importantly, the second term has been studied and bounded in the proof of Proposition 4.7.2 already, so we only need to focus on the first term. Before doing so, note that the range of Ebπ t is O (L|S|t) based on Corollary 4.8.1.5, and thus the range of ErrSub and ErrOpt is O L 2 |S|T 2 . Therefore, we only need to add a term O δ · L 2 |S|T 2 to address the event Ac . Extra term in ErrSub According to previous analysis, the extra term in ErrSub is X T t=1 X s̸=sL X a̸=π⋆(s) qt(s, a) · ut(s, a) − qt(s, a) qt(s, a) ≤ X T t=1 X s̸=sL X a̸=π⋆(s) |ut(s, a) − qt(s, a)| ≤ 4 X T t=1 X s̸=sL X a̸=π⋆(s) rt(s, a) + 16X T t=1 X s̸=sL X a̸=π⋆(s) k( Xs)−1 k=0 X (u,v,w)∈Tk qt(u, v) vuut P(w|u, v) ln T|S||A| δ max mi(t) (u, v), 1 qt(s, a|w) (Corollary 4.9.3.11) = 4X T t=1 X s̸=sL X a̸=π⋆(s) rt(s, a) + 16G3(ln ι) (Definition 4.9.2.1) = 16G3(ln ι) + O L 2S 3A 2 ln2 ι . (Lemma 4.9.3.10) Finally, using Lemma 4.9.3.6 and the bound on ErrSub for the full-information setting, we have E [ErrSub] = O G3(ln ι) + G1(L 4 |S| ln ι) + L 2 |S| 3 |A| 2 ln2 ι . Extra term in ErrOpt Similarly, we consider the extra term in ErrOpt: X T t=1 X s̸=sL X a=π⋆(s) (qt(s, a) − q ⋆ t (s, a)) · ut(s, a) − qt(s, a) qt(s, a) ≤ 4 X T t=1 X s̸=sL X a=π⋆(s) qt(s, a) − q ⋆ t (s, a) qt(s, a) rt(s, a) + X T t=1 X s̸=sL X a=π⋆(s) qt(s, a) − q ⋆ t (s, a) qt(s, a) · 16 X u,v,w qt(u, v) vuut P(w|u, v) ln T|S||A| δ max mi(t) (u, v), 1 qt(s, a|w) (Corollary 4.9.3.11) ≤ 4 X T t=1 X s̸=sL X a=π⋆(s) rt(s, a) + 16G6(ln ι) (Definition 4.9.2.1) = 16G6(ln ι) + O L 2S 3A 2 ln2 ι . (Lemma 4.9.3.10) Again, considering the term that appears in the full-information setting already, we have E [ErrOpt] = O G6(ln ι) + G2(L 4 |S| ln ι) + L 2 |S| 3 |A| 2 ln2 ι . It remains to bound EstReg with terms that enjoy self-bounding properties. Term EstReg According to Eq. (4.49) in Lemma 4.8.1.6, taking the summation of all the epochs, we have the following bound for E [EstReg]: O E p |S|L X N i=1 tiX +1−1 t=ti ηt · sX s̸=sL X a̸=π⋆(s) qbt(s, a) + L 2 · E X N i=1 tiX +1−1 t=ti ηt · X s̸=sL X a̸=π⋆(s) p qbt(s, a) + O E L 4 |A| X N i=1 tiX +1−1 t=ti X s̸=sL X a∈A qbt(s, a) · Bi(t) (s, a) 2 + O δ · E " L|S|T X N i=1 (ti+1 − ti) # + L 4 |S| 2 |A| 2 ln2 ι ! = O E p |S|L X T t=1 ηt · sX s̸=sL X a̸=π⋆(s) qbt(s, a) + O L 2 · E X T t=1 ηt · X s̸=sL X a̸=π⋆(s) p qbt(s, a) + O L 6 |S| 3 |A| 3 ln2 ι where the lase line applies Lemma 4.8.1.8. Then, for the first term, we have E p |S|L X T t=1 ηt · sX s̸=sL X a̸=π⋆(s) qbt(s, a) ≤ E p |S|L · vuutX T t=1 η 2 t · vuutX T t=1 X s̸=sL X a̸=π⋆(s) qbt(s, a) ≤ E q 4L|S| 2|A| log2 T · vuutX T t=1 X s̸=sL X a̸=π⋆(s) qbt(s, a) 1 where the second line follows from the Cauchy-Schwarz inequality, and the third line applies Eq. (4.56). Then, we separate the term into two parts: E q 4L|S| 2|A| log2 T · vuutX T t=1 X s̸=sL X a̸=π⋆(s) qt(s, a) + E q 4L|S| 2|A| log2 T · vuutX T t=1 X s̸=sL X a̸=π⋆(s) |qbt(s, a) − qt(s, a)| ≤ E 2 · G4(L|S| 2 |A| log2 T) + E X T t=1 X s̸=sL X a̸=π⋆(s) |qbt(s, a) − qt(s, a)| + 4|S| 2 |A|Llog2 T where second line follows from the fact √xy ≤ x + y for x, y ≥ 0. Note that, the second term above can be bounded by O G3 (ln ι) + L 2 |S| 3 |A| 2 ln2 ι just as in the full-information setting (see Eq. (4.42)). Therefore, we have finished bounding the first term: E p |S|L X T t=1 ηt · sX s̸=sL X a̸=π⋆(s) qbt(s, a) = O E G4(L|S| 2 |A| log2 T) + G3 (ln ι) + L 2 |S| 3 |A| 2 ln2 ι . On the other hand, the second term can be bounded similarly: L 2 · E X T t=1 ηt · X s̸=sL X a̸=π⋆(s) p qbt(s, a) ≤ L 2 · E X s̸=sL X a̸=π⋆(s) · vuutX T t=1 η 2 t · vuutX T t=1 qbt(s, a) ≤ L 2 q 4|S||A| log2 T · E X s̸=sL X a̸=π⋆(s) · vuutX T t=1 qbt(s, a) ≤ E 2 · G5(L 4 |S||A| log2 T) + E X T t=1 X s̸=sL X a̸=π⋆(s) |qbt(s, a) − qt(s, a)| + L 4 |S||A| log2 T 12 = O E G5(L 4 |S||A| log2 T) + G3 (ln ι) + L 2 |S| 3 |A| 2 ln2 ι . So we have the final bound on E [EstReg]: E [EstReg] = O E G4 L|S| 2 |A| log2 T + G5 L 4 |S||A| log2 T + G3 (ln ι) + L 6 |S| 3 |A| 3 ln2 ι Finally, by combining the bounds of each term, we finally have RegT (π ⋆ ) ≤ O E G1 L 4 |S| ln T + G3 (ln T) (from ErrSub) + E G2 L 4 |S| ln T + G6 (ln T) (from ErrOpt) + E G3 L 4 ln T (from OccDiff) + E G4 L|S| 2 |A| ln2 T + G5 L 4 |S||A| ln2 T + G3 (ln T) (from EstReg) + L 6 |S| 3 |A| 3 ln2 T . When Condition (4.1) holds, we apply similar self-bounding arguments to obtain a logarithmic regret bound. Specifically, for some universal constant κ > 0, we have RegT (π ⋆ ) ≤ κ E G1 L 4 |S| ln T + G2 L 4 |S| ln T + G3 L 4 ln T ! + κ E G4 L|S| 2 |A| log2 T + G5 L 4 |S||A| log2 T + G6 (ln T) ! + κ L 6 |S| 3 |A| 3 ln2 Then, for any z > 1, by applying all the self-bounding lemmas (Lemma 4.9.2.2-Lemma 4.9.2.7) with α = β = 1 32zκ , we arrive at RegT (π ⋆ ) ≤ 1 z · RegT (π ⋆ ) + C L + z · O X s̸=sL X a̸=π⋆(s) κ 2 ∆(s, a) · L 4 |S| ln T + L 6 |S| ln T + L 4 |S||A| log2 T + z · O κ 2 ∆min · L 5 |S| 2 ln T + L 6 |S| 2 ln T + L 3 |S| 2 |A| ln T + L|S| 2 |A| log2 T + κ · L 6 |S| 3 |A| 3 ln2 T ≤ 1 z · RegT (π ⋆ ) + C L + κ · L 6 |S| 3 |A| 3 ln2 T + z · O X s̸=sL X a̸=π⋆(s) L 6 |S| ln T + L 4 |S||A| log2 T ∆(s, a) + L 6 |S| 2 ln T + L 3 |S| 2 |A| log2 T ∆min ≤ 1 z · RegT (π ⋆ ) + C L + z · κ ′U + κ · V, where κ ′ is a universal constant hidden in the O(·) notation, and U and V are defined in Proposition 4.8.2). The last step is to rearrange and pick the optimal z, which is almost identical to that in the proof of Proposition 4.7.2 and finally shows RegT (π ⋆ ) = O U + √ UC + V . This completes the entire proof. 1 4.9 General Decomposition, Self-bounding Terms, and Supplementary Lemmas In this section, we provide details of our two key techniques: a general decomposition and selfbounding terms, as well as a set of supplementary Lemmas used throughout the analysis. 4.9.1 General Decomposition Lemma In this section, we consider measuring the performance difference between a policy π and a mapping (deterministic policy) π ⋆ , that is, V π (s0) − V π ⋆ (s0) where Q and V are the state-action and state value functions associated with some transition P and some loss function ℓ, that is, Q π (s, a) = ℓ(s, a) + X s ′∈Sk(s)+1 P(s ′ |s, a)V π (s ′ ), V π (s) = X a∈A π(a|s)Q π (s, a), for all state-action pairs (with V π (sL) = 0). Moreover, for some estimated transition Pb and estimated loss function ℓb, define similarly Qb and Vb as the corresponding state-action and state value functions: Qbπ (s, a) = ℓb(s, a) + X s ′∈Sk(s)+1 Pb(s ′ |s, a)Vb π (s ′ ), Vb π (s) = X a∈A π(a|s)Qbπ (s, a), for all state-action pairs (with Vb π (sL) = 0). Again, we denote by q ⋆ π (s, a) the probability of visiting a trajectory of the form (s0, π⋆ (s0)),(s1, π⋆ (s1)), . . . ,(sk(s)−1 , π⋆ (sk(s)−1 )),(s, a) when executing policy π. In other words, q ⋆ π can be formally defined as q ⋆ π (s, a) = π(a|s), s = s0, π(a|s) · P s ′∈Sk(s)−1 q ⋆ π (s ′ , π⋆ (s))P(s|s ′ , π⋆ (s)) , otherwise. 124 Note that our earlier notation q ⋆ t is thus a shorthand for q ⋆ πt . With slight abuse of notations, we define q ⋆ π (s) = P a∈A q ⋆ π (s, a). Now, we present a general decomposition for V π (s0) − V π ⋆ (s0). Lemma 4.9.1.1. (General Performance Decomposition) For any policies π and u, and a mapping (deterministic policy) π ⋆ : S → A, we have V π (s0) − V π ⋆ (s0) = X s̸=sL X a̸=π⋆(s) q(s, a)Ebu (s, a) (Error of Sub-opt actions) + X s̸=sL X a=π⋆(s) (q(s, a) − q ⋆ π (s, a)) Ebu (s, a) (Error of Opt actions) + X s̸=sL X a∈A q(s, a) Qbu (s, a) − Vb u (s) (Policy Difference) − X s̸=sL X a=π⋆(s) q ⋆ π (s, a) Qbu (s, a) − Vb u (s) (Estimation Bias 1) + X s̸=sL X a̸=π⋆(s) q ⋆ π (s, a) Vb u (s) − V π ⋆ (s) , (Estimation Bias 2) where q = q P,π is the occupancy measure associated with transition P and policy π, and Ebπ is a surplus function with: Ebπ (s, a) = ℓ(s, a) + X s ′∈Sk(s)+1 P(s ′ |s, a)Vb π (s ′ ) − Qbπ (s, a). Moreover, selecting the surrogate policy u as the mapping π ⋆ yields Corollary 4.9.1.2, which is the key decomposition lemma used in our analysis. 125 Corollary 4.9.1.2. Consider an arbitrary policy sequence {πt} T t=1, an arbitrary estimated transition sequence {Pbt} T t=1, and an arbitrary estimated loss sequence {ℓbt} T t=1. Then, we have X T t=1 V πt (s0) − Vb πt t (s0) | {z } Err1 + X T t=1 Vb π ⋆ t (s0) − V π ⋆ (s0) ! | {z } Err2 = X T t=1 X s̸=sL X a̸=π⋆(s) qt(s, a)Ebπ ⋆ t (s, a) (Error of Sub-opt actions) + X T t=1 X s̸=sL X a=π⋆(s) (qt(s, a) − q ⋆ t (s, a)) Ebπ ⋆ t (s, a) (Error of Opt actions) + X T t=1 X s̸=sL X a∈A (qt(s, a) − qbt(s, a)) Qbπ ⋆ t (s, a) − Vb π ⋆ t (s) (Occupancy Difference) + X T t=1 X s̸=sL X a̸=π⋆(s) q ⋆ t (s, a) Vb π ⋆ t (s) − V π ⋆ t (s) , (Estimation Bias) where qbt = q Pbt,πt , qt = q P,πt , q ⋆ t = q ⋆ πt , Qbπt t and Vb πt t are the state-action and state value functions associated with πt, ℓbt, and Pbt, and Ebπ t is the surplus function defined as: Ebπ t (s, a) = ℓ(s, a) + X s ′∈Sk(s)+1 P(s ′ |s, a)Vb π t (s ′ ) − Qbπ t (s, a). Proof. (Proof of Lemma 4.9.1.1) By direct calculation, for all states s, we have V π (s) − Vb u (s) = X a∈A π(a|s) Q π (s, a) − Qbu (s, a) + X a∈A π(a|s) Qbu (s, a) − Vb u (s) = X a∈A π(a|s) X s ′∈Sk(s)+1 P(s ′ |s, a) V π (s ′ ) − Vb u (s ′ ) + X a∈A π(a|s) ℓ(s, a) + X s ′∈Sk(s)+1 P(s ′ |s, a)Vb u (s ′ ) − Qbu (s, a) | {z } Ebu(s,a) + X a∈A π(a|s) Qbu (s, a) − Vb u (s) . 126 By repeatedly expanding V π (s ′ ) − Vb u (s ′ ) in the same way, we conclude V π (s0) − Vb u (s0) = X s̸=sL X a∈A q(s, a)Ebu (s, a) + X s̸=sL X a∈A q(s, a) Qbu (s, a) − Vb u (s) . (4.58) On the other hand, we also have for all states s: V π (s) − Vb u (s) = X a=π⋆(s) π(a|s) Q π (s, a) − Vb u (s) + X a̸=π⋆(s) π(a|s) Q π (s, a) − Vb u (s) = X a=π⋆(s) π(a|s) X s ′∈Sk(s)+1 P(s ′ |s, a) V π (s ′ ) − Vb u (s ′ ) + X a=π⋆(s) π(a|s) ℓ(s, a) + X s ′∈Sk(s)+1 P(s ′ |s, a)Vb u (s ′ ) − Qbu (s, a) | {z } Ebu(s,a) + X a=π⋆(s) π(a|s) Qbu (s, a) − Vb u (s) + X a̸=π⋆(s) π(a|s) Q π (s, a) − Vb u (s) . Using Lemma 4.9.1.3 (which repeatedly expands V π (s ′ ) − Vb u (s ′ ) in the same way) with C(s) = X a=π⋆(s) π(a|s)Ebu (s, a) + X a=π⋆(s) π(a|s) Qbu (s, a) − Vb u (s) + X a̸=π⋆(s) π(a|s) Q π (s, a) − Vb u (s) 127 we obtain V π (s0) − Vb u (s0) = X s̸=sL q ⋆ π (s)C(s) = X s̸=sL X a=π⋆(s) q ⋆ π (s, a)Ebu (s, a) + X s̸=sL X a̸=π⋆(s) q ⋆ π (s, a) Q π (s, a) − Vb u (s) + X s̸=sL X a=π⋆(s) q ⋆ π (s, a) Qbu (s, a) − Vb u (s) . (4.59) Combining Eq. (4.58) and Eq. (4.59), we have the following equality: X s̸=sL X a̸=π⋆(s) q ⋆ π (s, a) Q π (s, a) − Vb u (s) = X s̸=sL X a∈A q(a, s)Ebu (s, a) + X s̸=sL X a∈A q(s, a) Qbu (s, a) − Vb u (s) − X s̸=sL X a=π⋆(s) q ⋆ π (s, a)Ebu (s, a) − X s̸=sL X a=π⋆(s) q ⋆ π (s, a) Qbu (s, a) − Vb u (s) = X s̸=sL X a̸=π⋆(s) q(s, a)Ebu (s, a) (Error of Sub-opt actions) (4.60) + X s̸=sL X a=π⋆(s) (q(s, a) − q ⋆ π (s, a)) Ebu (s, a) (Error of Opt actions) + X s̸=sL X a∈A q(s, a) Qbu (s, a) − Vb u (s) (Policy Difference) − X s̸=sL X a=π⋆(s) q ⋆ π (s, a) Qbu (s, a) − Vb u (s) (Estimation Bias 1), (4.61) 128 Next, we consider the following: V π (s) − V π ⋆ (s) = X a=π⋆(s) π(a|s) (Q π (s, a) − Q ⋆ (s, a)) + X a̸=π⋆(s) π(a|s) Q π (s, a) − V π ⋆ (s) = X a=π⋆(s) π(a|s) X s ′∈Sk(s)+1 P(s ′ |s, a) V π (s ′ ) − V π ⋆ (s) + X a̸=π⋆(s) π(a|s) Q π (s, a) − V π ⋆ (s) . By Lemma 4.9.1.3 (which again repeatedly expands V π (s ′ ) − V π ⋆ (s) in the same way), we obtain V π (s0) − V π ⋆ (s0) = X s̸=sL X a̸=π⋆(s) q ⋆ π (s, a) Q π (s, a) − V π ⋆ (s) . (4.62) Finally, combining Eq. (4.60) and Eq. (4.62), we arrive at V π (s0) − V π ⋆ (s0) = X s̸=sL X a̸=π⋆(s) q ⋆ π (s, a) Q π (s, a) − Vb u (s) + X s̸=sL X a̸=π⋆(s) q ⋆ π (s, a) Vb u (s) − V π ⋆ (s) = X s̸=sL X a̸=π⋆(s) q(s, a)Ebu (s, a) (Transition Error of Sub-opt actions) + X s̸=sL X a=π⋆(s) (q(s, a) − q ⋆ π (s, a)) Ebu (s, a) (Transition Error of Opt actions) + X s̸=sL X a∈A q(s, a) Qbu (s, a) − Vb u (s) (Policy Difference) − X s̸=sL X a=π⋆(s) q ⋆ π (s, a) Qbu (s, a) − Vb u (s) (Estimation Bias 1) + X s̸=sL X a̸=π⋆(s) q ⋆ π (s, a) Vb u (s) − V π ⋆ (s) (Estimation Bias 2) finishing the proof. 129 Proof. (Proof of Corollary 4.9.1.2) By applying Lemma 4.9.1.1 with u = π ⋆ , we know that V πt t (s0)− V π ⋆ t (s0) equals to X s̸=sL X a̸=π⋆(s) qt(s, a)Ebπ ⋆ t (s, a) + X s̸=sL X a=π⋆(s) (qt(s, a) − q ⋆ t (s, a)) Ebπ ⋆ t (s, a) + X s̸=sL X a∈A qbt(s, a) Qbπ ⋆ t (s, a) − Vb π ⋆ t (s) + X s̸=sL X a∈A (qt(s, a) − qbt(s, a)) Qbπ ⋆ t (s, a) − Vb π ⋆ t (s) − X s̸=sL X a=π⋆(s) q ⋆ t (s, a) Qbπ ⋆ t (s, a) − Vb π ⋆ t (s) (Estimation Bias 1) + X s̸=sL X a̸=π⋆(s) q ⋆ t (s, a) Vb π ⋆ t (s) − V π ⋆ t (s) . (Estimation Bias 2) Now observe the following two facts. First, the third term above is in fact equal to Vb πt t (s0)−Vb π ⋆ t (s0) according to the standard performance difference lemma (Theorem 5.2.1 of [52]). Second, the first estimation bias term is simply 0 since Qbπ ⋆ t (s, a) = Vb π ⋆ t (s) when a = π ⋆ (s). Therefore, by taking the summation over t, we obtain Err1 + Err2 = X T t=1 V πt t (s0) − V π ⋆ t (s0) − Vb πt t (s0) − Vb π ⋆ t (s0) = X s̸=sL X a̸=π⋆(s) qt(s, a)Ebπ ⋆ t (s, a) + X s̸=sL X a=π⋆(s) (qt(s, a) − q ⋆ t (s, a)) Ebπ ⋆ t (s, a) + X s̸=sL X a∈A (qt(s, a) − qbt(s, a)) Qbπ ⋆ t (s, a) − Vb π ⋆ t (s) + X s̸=sL X a̸=π⋆(s) q ⋆ t (s, a) Vb π ⋆ t (s) − V π ⋆ t (s) 130 which finishes the proof. Lemma 4.9.1.3. For any functions F : S → R and C L : S → R satisfying the following condition: F(s) = X a=π⋆(s) π(a|s) X s ′∈Sk(s)+1 P(s ′ |s, a)F(s ′ ) + C(s) and F(sL) = 0, we have F(s0) = X s̸=sL q ⋆ π (s)C(s). Proof. By definition and direct calculation, we have F(s0) equal to X a=π⋆(s0) q(s0, a) X s ′∈S1 P(s ′ |s0, a)F(s ′ ) + C(s) (q(s0) = 1) = X s1∈S1 q ⋆ π (s1)F(s1) + q ⋆ π (s0)C(s) = X s1∈S1 q ⋆ π (s1) X a=π⋆(s) π(a|s) X s ′∈S2 P(s ′ |s, a)F(s ′ ) + X 1 k=0 X s∈Sk q ⋆ π (s)C(s) = X s2∈S2 q ⋆ π (s2)F(s2) +X 1 k=0 X s∈Sk q ⋆ π (s)C(s) (definition of q ⋆ π (s)) = X sL∈SL q ⋆ π (sL)F(sL) + L X−1 k=0 X s∈Sk q ⋆ π (s)C(s) (repeatedly expanding) = X s̸=sL q ⋆ π (s)C(s), (F(sL) = 0) which completes the proof. 4.9.2 Self-bounding Terms In this section, we summarize all the self-bounding terms we use in the proofs for the unknown transition settings. 131 Definition 4.9.2.1 (Self-bounding Terms). For some mapping π ⋆ : S → A, define the following: G1(J) = X T t=1 X s̸=sL X a̸=π⋆(s) qt(s, a) s J max mi(t) (s, a) , G2(J) = X T t=1 X s̸=sL X a=π⋆(s) (qt(s, a) − q ⋆ t (s, a)) s J max mi(t) (s, a), 1 , G3(J) = X T t=1 X s̸=sL X a̸=π⋆(s) k( Xs)−1 k=0 X (u,v,w)∈Tk qt(u, v) s P(w|u, v) · J max mi(t) (u, v), 1 qt(s, a|w), G4(J) = vuutJ · X T t=1 X s̸=sL X a̸=π⋆(s) qt(s, a), G5(J) = X s̸=sL X a̸=π⋆(s) vuutJ X T t=1 qt(s, a), G6(J) = X T t=1 X s̸=sL X a=π⋆(s) qt(s, a) − q ⋆ t (s, a) qt(s, a) k( Xs)−1 k=0 X (u,v,w)∈Tk qt(u, v) s P(w|u, v) · J max mi(t) (u, v), 1 qt(s, a|w) . In the next six lemmas, we show that each of these six functions enjoys a certain self-bounding property under Condition (4.1) so that they are small whenever the regret of the learner is small. In all these lemmas, the policy π ⋆ used in G1-G6 coincides with the π ⋆ in Condition (4.1). Also note that Lemma 4.5.2 is simply a collection of the first four lemmas. Lemma 4.9.2.2. Suppose Condition (4.1) holds. Then we have for any α ∈ R+, E [G1(J)] ≤ α · RegT (π ⋆ ) + C L + 1 α X s̸=sL X a̸=π⋆(s) 8J ∆(s, a) . Proof. Under the condition, for any α ∈ R+, we have G1(J) = X T t=1 X s̸=sL X a̸=π⋆(s) qt(s, a) s J max mi(t) (s, a), 1 − α∆(s, a) ! + α X T t=1 X s̸=sL X a̸=π⋆(s) qt(s, a)∆(s, a) 132 where the expectation of the last term is bounded by α · RegT (π ⋆ ) + C L . It thus remains to bound the first term. To this end, for a fixed state-action pair (s, a), we define Ns,a as the last epoch where the term in the bracket is still positive, so that: mNs,a+1(s, a) ≤ 2J α2∆(s, a) 2 due to the doubling epoch schedule. Then we have E "X T t=1 qt(s, a) s J max mi(t) (s, a), 1 − α∆(s, a) !# = E "X N i=1 (mi+1(s, a) − mi(s, a)) s J max {mi(s, a), 1} − α∆(s, a) !# ≤ E N Xs,a i=1 (mi+1(s, a) − mi(s, a)) s J max {mi(s, a), 1} − α∆(s, a) ! ≤ E " 2 Z mNs,a+1(s,a) 0 r J x dx# ≤ E " 2 Z 2J α2∆(s,a)2 0 r J x dx# ≤ 4 · √ J · s 2J α2∆(s, a) 2 ≤ 8J α∆(s, a) . Taking the summation over all state-action pairs (s, a) satisfying a ̸= π ⋆ (s), we thus have E [G2(J)] ≤ α · (RegT (π ⋆ ) + C) + X s̸=sL X a̸=π⋆(s) 8J α∆(s, a) . Lemma 4.9.2.3. Suppose Condition (4.1) holds. Then we have for any β ∈ R+, E [G2(J)] ≤ β · (RegT (π ⋆ ) + C) + 1 β · 8|S|LJ ∆min . 13 Proof. Clearly, under the condition, for any β ∈ R+, we have G2(J) = X T t=1 X s̸=sL X a=π⋆(s) (qt(s, a) − q ⋆ t (s, a)) s J max mi(t) (s, a), 1 − β · ∆min L ! + β X T t=1 X s̸=sL X a=π⋆(s) (qt(s, a) − q ⋆ t (s, a)) · ∆min L where the expectation of the last term is bounded by β·(RegT (π ⋆ ) + C) according to Lemma 4.9.2.8 (deferred to the end of this subsection). It thus remains to bound the first term. To this end, for a fixed state-action pair (s, a), we similarly define Ns,a as the last epoch where the term in the bracket is still positive, so that: mNs,a+1(s, a) ≤ 2JL2 β 2∆2 min due to the doubling epoch schedule. Then, we have E "X T t=1 (qt(s, a) − q ⋆ t (s, a)) s J max mi(t) (s, a), 1 − β · ∆min L !# ≤ E N Xs,a i=1 (mi+1(s, a) − mi(s, a)) s J max {mi(s, a), 1} − β · ∆min L ! (qt(s, a) ≥ q ⋆ t (s, a) by definition) ≤ E " 2 Z mNs,a+1(s,a) 0 r J x dx# ≤ E 2 Z 2JL2 β2∆2 min 0 r J x dx ≤ 4 · √ J · s 2JL2 β 2∆2 min ≤ 8LJ β∆min . Taking the summation over all state-action pairs satisfying a = π ⋆ (s), we have E [G2(J)] ≤ β · (RegT (π ⋆ ) + C) + X s̸=sL X a=π⋆(s) 8LJ β∆min = β · RegT (π ⋆ ) + C L + 8|S|LJ β∆min . 134 Lemma 4.9.2.4. Suppose Condition (4.1) holds. Then we have for any α, β ∈ R+, E [G3(J)] ≤ (α + β) · (RegT (π ⋆ ) + C L ) + 1 α · X s̸=sL X a̸=π⋆(s) 8L 2 |S|J ∆(s, a) + 1 β · 8L 2 |S| 2J ∆min . Proof. First we have G3(J) = X T t=1 L X−1 k=0 X (u,v,w)∈Tk qt(u, v) s P(w|u, v) · J max mi(t) (s, a) L X−1 l=k+1 X s∈Sl X a̸=π⋆(s) qt(s, a|w) = X T t=1 L X−1 k=0 X u∈Sk X v̸=π⋆(s) qt(u, v) X w∈Sk+1 s P(w|u, v) · J max mi(t) (s, a), 1 L X−1 l=k+1 X s∈Sl X a̸=π⋆(s) qt(s, a|w) + X T t=1 L X−1 k=0 X u∈Sk X v=π⋆(s) qt(u, v) X w∈Sk+1 s P(w|u, v) · J max mi(t) (s, a), 1 L X−1 l=k+1 X s∈Sl X a̸=π⋆(s) qt(s, a|w) ≤ X T t=1 L X−1 k=0 X u∈Sk X v̸=π⋆(s) qt(u, v) · s L2|S| · J max mi(t) (s, a), 1 + X T t=1 L X−1 k=0 X u∈Sk X v=π⋆(s) qt(u, v) X w∈Sk+1 s P(w|u, v) · J max mi(t) (s, a), 1 L X−1 l=k+1 X s∈Sl X a̸=π⋆(s) qt(s, a|w) where the second step separates the optimal and sub-optimal state-action pairs, and the inequality follows from the fact P s̸=sL P a∈A qt(s, a|w) ≤ L and the Cauchy-Schwarz inequality. Note that, the first term is simply G1(L 2 |S|) and can be applied using Lemma 4.9.2.2. To bound the last term, we first observe the following X T t=1 L X−1 k=0 X u∈Sk X v=π⋆(s) qt(u, v) X w∈Sk+1 P(w|u, v) · ∆min L L X−1 l=k+1 X s∈Sl X a̸=π⋆(s) qt(s, a|w) = X T t=1 L X−1 l=0 X s∈Sl X a̸=π⋆(s) ∆min L · X l−1 k=0 X u∈Sk X v=π⋆(s) X w∈Sk+1 qt(u, v)P(w|u, v)qt(s, a|w) ≤ X T t=1 L X−1 l=0 X s∈Sl X a̸=π⋆(s) ∆min L · X l−1 k=0 qt(s, a) ! 135 ≤ X T t=1 L X−1 l=0 X s∈Sl X a̸=π⋆(s) qt(s, a)∆min where the expectation of the last term is bounded by RegT (π ⋆ ) + C under Condition (4.1). Let cilp [x] = max {x, 0} be the clipping function that removes the negative value. By adding and subtracting β times the term above, we have X T t=1 L X−1 k=0 X u∈Sk X v=π⋆(s) qt(u, v) X w∈Sk+1 s P(w|u, v) · J max mi(t) (s, a), 1 L X−1 l=k+1 X s∈Sl X a̸=π⋆(s) qt(s, a|w) = β X T t=1 L X−1 k=0 X u∈Sk X v=π⋆(s) qt(u, v) X w∈Sk+1 P(w|u, v) · ∆min L L X−1 l=k+1 X s∈Sl X a̸=π⋆(s) qt(s, a|w) + X T t=1 L X−1 k=0 X u∈Sk X v=π⋆(s) qt(u, v) X w∈Sk+1 s P(w|u, v) · J max mi(t) (s, a), 1 − β · ∆minP(w|u, v) L ! L X−1 l=k+1 X s∈Sl X a̸=π⋆(s) qt(s, a|w) ≤ β X T t=1 L X−1 l=0 X s∈Sl X a̸=π⋆(s) qt(s, a)∆min + L X T t=1 L X−1 k=0 X u∈Sk X v=π⋆(s) X w∈Sk+1 qt(u, v)clip "s P(w|u, v) · J max mi(t) (s, a), 1 − β · ∆minP(w|u, v) L # where the last line follows from the facts x ≤ clip[x] and P s̸=sL P a∈A qt(s, a|w) ≤ L. Fix a tuple Nu,v,w where v = π ⋆ (u), we similarly define Nu,v,w as the last epoch where the argument of clip(·) is still positive, so that: mNu,v,w+1(s, a) ≤ 2JL2 P(w|u, v)β 2∆2 min due to the doubling epoch schedule. Then, we have E "X T t=1 qt(u, v)clip "s P(w|u, v) · J max mi(t) (s, a), 1 − β · ∆minP(w|u, v) L ## ≤ E N Xu,v,w i=1 (mi+1(u, v) − mi(u, v)) clip "s P(w|u, v) · J max mi(t) (s, a), 1 − β · ∆minP(w|u, v) L # 136 ≤ E " 2 Z mNu,v,w+1(s,a) 0 r P(w|u, v) · J x dx# ≤ E 2 Z 2JL2 P (w|u,v)β2∆2 min 0 r P(w|u, v)J x dx ≤ 4 · p P(w|u, v) · J · s 2JL2 P(w|u, v)β 2∆2 min ≤ 8LJ β∆min . Taking the summation over all transition tuple (u, v, w) satisfying v = π ⋆ (s) and adding E G1(L 2 |S|J) , we have E [G3(J)] ≤ β · (RegT (π ⋆ ) + C L ) + E G1(L 2 |S|J) + L L X−1 k=0 X u∈Sk X v=π⋆(u) X w∈Sk+1 8LJ β∆min ≤ (α + β) · (RegT (π ⋆ ) + C L ) + 1 α · X s̸=sL X a̸=π⋆(s) 8L 2 |S|J ∆(s, a) + 1 β · 8L 2 |S| 2J ∆min , where the last line follows from the fact PL−1 k=0 |Sk| |Sk + 1| ≤ |S| 2 . Lemma 4.9.2.5. Suppose Condition (4.1) holds. Then we have for any β ∈ R+, E [G4(J)] ≤ β · RegT (π ⋆ ) + C L + 1 β · J 4∆min . Proof. By the fact that 2 √xy ≤ x + y for all x, y ≥ 0, with Condition (4.1), we have E [G4(J)] = E vuut2β X T t=1 X s̸=sL X a̸=π⋆(s) qt(s, a)∆min · J 2β∆min ≤ β · E X T t=1 X s̸=sL X a̸=π⋆(s) qt(s, a)∆min + J 4β∆min ≤ β · RegT (π ⋆ ) + C L + J 4β∆min . 137 Lemma 4.9.2.6. Suppose Condition (4.1) holds. Then we have for any α ∈ R+, E [G5(J)] ≤ α · RegT (π ⋆ ) + C L + X s̸=sL X a̸=π⋆(s) J 4α∆(s, a) . Proof. By the fact that 2 √xy ≤ x + y for all x, y ≥ 0, with Condition (4.1), we have E [G4(J)] = E X s̸=sL X a̸=π⋆(s) vuut2α X T t=1 qt(s, a)∆(s, a) · J 2α∆(s, a) ≤ α · E X T t=1 X s̸=sL X a̸=π⋆(s) qt(s, a)∆(s, a) + X s̸=sL X a̸=π⋆(s) J 4α∆(s, a) ≤ α · RegT (π ⋆ ) + C L + X s̸=sL X a̸=π⋆(s) J 4α∆(s, a) . Lemma 4.9.2.7. Suppose Condition (4.1) holds. Then we have for any β ∈ R+, E [G6(J)] ≤ β · RegT (π ⋆ ) + C L + 1 β · 8L 3 |S| 2 |A| · J ∆min . Proof. By adding and subtracting terms, we have G6(J) equals to X T t=1 X s̸=sL X a=π⋆(s) qt(s, a) − q ⋆ t (s, a) qt(s, a) · k( Xs)−1 k=0 X (u,v,w)∈Tk qt(u, v) vuut P(w|u, v) ln T|S||A| δ max mi(t) (u, v), 1 qt(s, a|w) − βqt(s, a) · ∆min L + β L X T t=1 X s̸=sL X a=π⋆(s) (qt(s, a) − q ⋆ t (s, a)) ∆min where the expectation of the last term is bounded by β·(RegT (π ⋆ ) + C) according to Lemma 4.9.2.8. 138 To bound the first term, we observe that k( Xs)−1 k=0 X (u,v,w)∈Tk qt(u, v) vuut P(w|u, v) ln T|S||A| δ max mi(t) (u, v), 1 qt(s, a|w) − βqt(s, a) · ∆min L = k( Xs)−1 k=0 X (u,v,w)∈Tk qt(u, v) vuut P(w|u, v) ln T|S||A| δ max mi(t) (u, v), 1 qt(s, a|w) − β · ∆min L2 · k( Xs)−1 k=0 X (u,v,w)∈Tk qt(u, v)P(w|u, v)qt(s, a|w) = k( Xs)−1 k=0 X (u,v,w)∈Tk qt(u, v) vuut P(w|u, v) ln T|S||A| δ max mi(t) (u, v), 1 − P(w|u, v) · β · ∆min L2 · qt(s, a|w) ≤ k( Xs)−1 k=0 X (u,v,w)∈Tk qt(u, v) clip vuut P(w|u, v) ln T|S||A| δ max mi(t) (u, v), 1 − P(w|u, v) · β · ∆min L2 | {z } =ht(u,v,w) qt(s, a|w) where the first equality uses P (u,v,w)∈Tk qt(u, v)P(w|u, v)qt(s, a|w) = qt(s, a) for all layer k = 0, . . . k(s) − 1. (Recall clip[x] = max{x, 0}.) Therefore, with Condition (4.1), we bound the E [G6(J)] by E X T t=1 X s̸=sL X a=π⋆(s) qt(s, a) − q ⋆ t (s, a) qt(s, a) X u,v,w qt(u, v)ht(u, v, w)qt(s, a|w) ! + β · (RegT (π ⋆ ) + C) ≤ E X T t=1 X s̸=sL X a=π⋆(s) X u,v,w qt(u, v)ht(u, v, w)qt(s, a|w) ! + β · (RegT (π ⋆ ) + C) ≤ LE " · X T t=1 X u,v,w qt(u, v)ht(u, v, w) # + β · (RegT (π ⋆ ) + C) where the second line applies the fact qt(s,a)−q ⋆ t (s,a) qt(s,a) ≤ 1, and the third line changes summation order and uses the fact that P s̸=sL P a∈A qt(s, a|w) ≤ L. 139 Finally, following the similar idea of handing P t=1 qt(u, v)ht(u, v, w) as in Lemma 4.9.2.4, we have E "X t=1 qt(u, v)ht(u, v, w) # ≤ 8L 2J β∆min . By taking the summation over all transition triples, we have E [G6(J)] ≤ β · RegT (π ⋆ ) + C L + L · L X−1 k=0 X (u,v,w)∈Tk 1 β · 8L 2 · J ∆min ≤ β · RegT (π ⋆ ) + C L + 1 β · 8L 3 |S| 2 |A| · J ∆min , where the last line follows from the fact that PL k=0 |Sk| |Sk+1| ≤ |S| 2 . Lemma 4.9.2.8. Under Condition (4.1), we have E X T t=1 X s̸=sL X a=π⋆(s) (qt(s, a) − q ⋆ t (s, a)) ∆min ≤ L · E [RegT (π ⋆ ) + C] . Proof. For each k, we proceed as X s∈Sk X a=π⋆(s) (qt(s, a) − q ⋆ t (s, a)) ≤ 1 − X s∈Sk X a=π⋆(s) q ⋆ t (s, a) ( P s∈Sk P a∈A qt(s, a) = 1) = 1 − X s∈Sk X a=π⋆(s) πt(a|s) Pr " {sk = s} \ k \−1 τ=0 {aτ = π ⋆ (sτ )} ! P, πt # (definition of q ⋆ t ) = 1 − Pr " \ k τ=0 {aτ = π ⋆ (sτ )} ! P, πt # 140 = Pr " \ k τ=0 {aτ = π ⋆ (sτ )} !c P, πt # = Pr " [ k τ=0 {aτ ̸= π ⋆ (sτ )} ! P, πt # (De Morgan’s laws) ≤ X k τ=0 Pr [aτ ̸= π ⋆ (sτ )| P, πt ] (union bound) = X k τ=0 X s∈Sτ X a̸=π⋆(s) qt(s, a) = X s̸=sL X a̸=π⋆(s) qt(s, a). Therefore, we have X T t=1 X s̸=sL X a=π⋆(s) (qt(s, a) − q ⋆ π (s, a)) ∆min ≤ L · X T t=1 X s̸=sL X a̸=π⋆(s) qt(s, a) · ∆(s, a) ≤ L · E [RegT (π ⋆ ) + C] where the last line follows from Condition (4.1). 4.9.3 Supplementary Lemmas Lemma 4.9.3.1. (Occupancy Measure Difference) For any policy π and transition functions P1 and P2, with q1 = q P1,π and q2 = q P2,π we have for all s, q1(s) − q2(s) = k( Xs)−1 k=0 X u∈Sk X v∈A X w∈Sk+1 q1(u, v) [P1(w|u, v) − P2(w|u, v)] q2(s|w) = k( Xs)−1 k=0 X u∈Sk X v∈A X w∈Sk+1 q2(u, v) [P1(w|u, v) − P2(w|u, v)] q1(s|w) (4.63) 141 where the conditional occupancy measure q1(s ′ |s) (similarly for q2(s ′ |s)) is defined recursively as q1(s ′ |s) = 0, k(s ′ ) < k(s) or (k(s ′ ) = k(s) and s ′ ̸= s) 1, k(s ′ ) = k(s) and s ′ = s P u∈Sk(s′)−1 q1(u|s) P v∈A π(v|u)P(s ′ |u, v) , k(s ′ ) > k(s) (4.64) which is the conditional probability of visiting state s ′ from s under π and transition P1. Proof. Fix a state s. We proceed as: q1(s) − q2(s) = X s ′∈Sk(s)−1 X a ′∈A q1(s ′ , a′ )P1(s|s ′ , a′ ) − q2(s ′ , a′ )P2(s ′ , a′ ) = X s ′∈Sk(s)−1 X a ′∈A q1(s ′ ) − q2(s ′ ) P1(s|s ′ , a′ )π(a ′ |s ′ ) + X s ′∈Sk(s)−1 X a ′∈A q2(s ′ , a′ ) P1(s|s ′ , a′ ) − P2(s|s ′ , a′ ) where the second step follows by subtracting and adding q2(s ′ , a′ )P1(s|s ′ , a′ ). Note that, P a ′∈A π(a ′ |s ′ )P1(s|s ′ , a′ ) is exactly the conditional probability of transiting to state s from state s ′ with transition P1. Therefore, we have P a ′∈A π(a ′ |s ′ )P1(s|s ′ , a′ ) = q1(s|s ′ ) according to Eq. (4.64), and further expand q1(s) − q2(s) as: X s ′∈Sk(s)−1 X a ′∈A q1(s ′ ) − q2(s ′ ) P1(s|s ′ , a′ )π(a ′ |s ′ ) + X s ′∈Sk(s)−1 X a ′∈A q2(s ′ , a′ ) P1(s|s ′ , a′ ) − P2(s|s ′ , a′ ) = X s ′∈Sk(s)−1 q1(s|s ′ ) q1(s ′ ) − q2(s ′ ) + X s ′∈Sk(s)−1 X a ′∈A q2(s ′ , a′ ) P1(s|s ′ , a′ ) − P2(s|s ′ , a′ ) q1(s|s) where the second line follows from the fact that q1(s|s) = 1. Therefore, we can recursively expand q1(s) − q2(s) as: q1(s) − q2(s) = X s ′∈Sk(s)−1 q1(s ′ ) − q2(s ′ ) q1(s|s ′ ) + X s ′∈Sk(s)−1 X a ′∈A q2(s ′ , a′ ) P1(s|s ′ , a′ ) − P2(s|s ′ , a′ ) q1(s|s) = X s ′∈Sk(s)−1 q1(s ′ ) − q2(s ′ ) q1(s|s ′ ) + X k(s) k=k(s) X (u,v,w)∈Tk q2(u, v) [P1(w|u, v) − P2(w|u, v)] q1(s|w) = X s ′∈Sk(s)−1 X s ′′∈Sk(s)−2 q1(s ′′) − q2(s ′′) q1(s ′ |s ′′) q1(s|s ′ ) + X k(s) k=k(s)−1 X (u,v,w)∈Tk q2(u, v) P1(s|s ′ , a′ ) − P2(s|s ′ , a′ ) q1(s|w) = X s ′′∈Sk(s)−2 q1(s ′′) − q2(s ′′) q1(s|s ′′) + X k(s) k=k(s)−1 X (u,v,w)∈Tk q2(u, v) P1(s|s ′ , a′ ) − P2(s|s ′ , a′ ) q1(s|w) = k( Xs)−1 k=0 X u∈Sk X v∈A X w∈Sk+1 q2(u, v) [P1(w|u, v) − P2(w|u, v)] q1(s|w). (expand recursively) where the second step follows from the fact that q(s ′ |s) = 0 for all states s ̸= s ′ with k(s) = k(s ′ ), and the third step follows from the fact P s ′∈Sk q(s ′ |s ′′)q(s|s ′ ) = q(s|s ′′) for all state pairs that k(s) > k > k(s ′′). By applying the same technique, we also have q2(s) − q1(s) = k( Xs)−1 k=0 X u∈Sk X v∈A X w∈Sk+1 q1(u, v) [P2(w|u, v) − P1(w|u, v)] q2(s|w). Flipping this equality finishes the proof for the second statement of the lemma: q1(s) − q2(s) = k( Xs)−1 k=0 X u∈Sk X v∈A X w∈Sk+1 q1(u, v) [P1(w|u, v) − P2(w|u, v)] q2(s|w). Lemma 4.9.3.2. The following holds: Bi(s, a) ≤ 2 vuut |Sk(s)+1| ln T|S||A| δ max {mi(s, a), 1} + 14|Sk(s)+1| ln T|S||A| δ 3 max {mi(s, a), 1} . Proof. By the definition of Bi(s, a), we have Bi(s, a) = X s ′∈Sk(s)+1 Bi(s, a, s′ ) = X s ′∈Sk(s)+1 2 vuut P¯ i(s ′ |s, a) ln T|S||A| δ max {mi(s, a), 1} + 14 ln T|S||A| δ 3 max {mi(s, a), 1} ≤ 2 vuut |Sk(s)+1| ln T|S||A| δ max {mi(s, a), 1} + 14|Sk(s)+1| ln T|S||A| δ 3 max {mi(s, a), 1} where the last line follows from the Cauchy-Schwarz inequality. Lemma 4.9.3.3. Conditioning on event A, we have Bi(s, a, s′ ) ≤ 4 vuut P(s ′ |s, a) ln T|S||A| δ max {mi(s, a), 1} + 40 ln T|S||A| δ 3 max {mi(s, a), 1} . (4.65) 144 Proof. By direct calculation based on Eq. (4.7) and the condition of event A, we have Bi(s, a, s′ ) ≤ 2 vuut P¯ i(s ′ |s, a) ln T|S||A| δ max {mi(s, a), 1} + 14 ln T|S||A| δ 3 max {mi(s, a), 1} ≤ 2 vuut (P(s ′ |s, a) + Bi(s, a, s′)) ln T|S||A| δ max {mi(s, a), 1} + 14 ln T|S||A| δ 3 max {mi(s, a), 1} ≤ 2 vuut P(s ′ |s, a) ln T|S||A| δ max {mi(s, a), 1} + vuut 4Bi(s, a, s′) ln T|S||A| δ max {mi(s, a), 1} + 14 ln T|S||A| δ 3 max {mi(s, a), 1} ≤ 2 vuut P(s ′ |s, a) ln T|S||A| δ max {mi(s, a), 1} + Bi(s, a, s′ ) 2 + 20 ln T|S||A| δ 3 max {mi(s, a), 1} , where the third line applies the fact that √ x + y ≤ √ x+ √y, and the last line follows from the fact 2 √xy ≤ x + y for x, y > 0. Rearranging the terms yields that Bi(s, a, s′ ) ≤ 4 vuut P(s ′ |s, a) ln T|S||A| δ max {mi(s, a), 1} + 40 ln T|S||A| δ 3 max {mi(s, a), 1} . Combining with the fact Bi(s, a, s′ ) ≤ 1, we have the following tighter bound of confidence width. Corollary 4.9.3.4. Conditioning on event A, we have Bi(s, a, s′ ) ≤ min 4 vuut P(s ′ |s, a) ln T|S||A| δ max {mi(s, a), 1} + 40 ln T|S||A| δ 3 max {mi(s, a), 1} , 1 ≤ min 4 vuut P(s ′ |s, a) ln T|S||A| δ max {mi(s, a), 1} , 1 + min 40 ln T|S||A| δ 3 max {mi(s, a), 1} , 1 . 145 We often use the following two lemmas to deal with the small-probability event Ac when taking expectation. Lemma 4.9.3.5. Suppose that a random variable X satisfies the following conditions: • Conditioning on event E, X < Y where Y > 0 is another random variable; • X < CL holds always for some fixed C L ∈ R+. Then, we have E [X] ≤ C L · Pr [E c ] + E [Y ] . Proof. By writing the random variable X as X · I{E} + X · I{Ec}, and noting X · I{E} ≤ Y · I{E} ≤ Y, and X · I{Ec } ≤ C L · I{Ec }, we prove the statement after taking the expectations. Lemma 4.9.3.6. Suppose that a random variable X satisfies the following conditions: • Conditioning on event E, X < Y where Y > 0 is another random variable; • X < CL holds where C L is another random variable which ensures E [C|Ec ] ≤ D for some fixed D ∈ R+. Then, we have E [X] ≤ D · Pr [E c ] + E [Y ] . 146 Proof. By writing the random variable X as X · I{E} + X · I{Ec}, and noting X · I{E} ≤ Y · I{E} ≤ Y, X · I{Ec } ≤ C L · I{Ec }, E [C · I{Ec }] ≤ E [C|Ec ] , we prove the statement after taking the expectations. Lemma 4.9.3.7. ([46, Lemma 10]) With probability at least 1−2δ, we have for all k = 0, . . . L−1, X T t=1 X s∈Sk,a∈A qt(s, a) max{1, mi(t) (s, a)} = O (|Sk||A| ln T + ln(L/δ)) (4.66) and X T t=1 X s∈Sk,a∈A qt(s, a) q max{1, mi(t) (s, a)} = O p |Sk||A|T + |Sk||A| ln T + ln(L/δ) . (4.67) Simultaneously, for all k < h, we have X T t=1 X (u,v,w)∈Tk X (x,y,z)∈Th qt(u, v) s P(w|u, v) max{1, mi(t) (u, v)} · qt(x, y|w) s P(z|x, y) max{1, mi(t) (x, y)} = O (|A| ln T + ln (L/δ)) · p |Sk| |Sk+1| |Sh| |Sh+1| . (4.68) Proof. Eq. (4.66) and Eq. (4.67) are from [46]. For Eq. (4.68), by direct calculation we have X T t=1 X (u,v,w)∈Tk X (x,y,z)∈Th qt(u, v) s P(w|u, v) max{1, mi(t) (u, v)} · qt(x, y|w) s P(z|x, y) max{1, mi(t) (x, y)} = X T t=1 X (u,v,w)∈Tk X (x,y,z)∈Th s qt(u, v)P(z|x, y)qt(x, y|w) max{1, mi(t) (u, v)} · s qt(u, v)P(w|u, v)qt(x, y|w) max{1, mi(t) (x, y)} ≤ vuutX T t=1 X (u,v,w)∈Tk X (x,y,z)∈Th qt(u, v)P(z|x, y)qt(x, y|w) max{1, mi(t) (u, v)} · vuutX T t=1 X (u,v,w)∈Tk X (x,y,z)∈Th qt(u, v)P(w|u, v)qt(x, y|w) max{1, mi(t) (x, y)} 147 ≤ vuut|Sk+1| X T t=1 X u∈Sk X a∈A qt(u, v) max{1, mi(t) (u, v)} · vuut|Sh+1| X T t=1 X x∈Sh X a∈A qt(x, y) max{1, mi(t) (x, y)} ≤ O (|A| ln T + ln (L/δ)) · p |Sk| |Sk+1| |Sh| |Sh+1| . Lemma 4.9.3.8. For all k = 0, . . . , L − 1, we have E X T t=1 X s∈Sk,a∈A qt(s, a) max{1, mi(t) (s, a)} = O (|Sk||A| ln T + |Sk||A|) (4.69) and E X T t=1 X s∈Sk,a∈A qt(s, a) q max{1, mi(t) (s, a)} = O p |Sk||A|T + |Sk||A| . (4.70) Proof. For each state-action pair (s, a), we have E "X T t=1 qt(s, a) max{1, mi(t) (s, a)} # = E "X T t=1 It(s, a) max{1, mi(t) (s, a)} # = E X N i=1 tiX +1−1 t=ti It(s, a) max{1, mi(s, a)} = E "X N i=1 mi+1(s, a) − mi(s, a) max{1, mi(s, a)} # ≤ 2E " 1 + Z 1+mN+1(s,a) 1 dx x # ≤ 2 (2 ln T + 1) where the second line follows from the definition of the indicator and occupancy measure qt , and the last line applies the fact mi+1(s, a) ≤ 2mi(s, a) when mi(s, a) ≥ 1. Taking the summation over all state-action pairs at layer k finishes the proof of Eq. (4.69). 148 Similarly, we have E X T t=1 qt(s, a) q max{1, mi(t) (s, a)} = E X T t=1 It(s, a) q max{1, mi(t) (s, a)} = E X N i=1 tiX +1−1 t=ti It(s, a) p max{1, mi(s, a)} = E "X N i=1 mi+1(s, a) − mi(s, a) p max{1, mi(s, a)} # ≤ 2E " 1 + Z mN+1(s,a) 0 dx √ x # ≤ 2 2 p mN+1(s, a) + 1 where mN+1(s, a) is the total number of visiting state-action pair (s, a). Taking the summation over all state-action pairs of layer k yields that E X s∈Sk X a∈A X T t=1 qt(s, a) q max{1, mi(t) (s, a)} ≤ X s∈Sk X a∈A 2 2 p mN+1(s, a) + 1 ≤ 2 2 p |Sk||A|T + |Sk||A| where the last inequality follows from the Cauchy-Schwarz inequality. Definition 4.9.3.9. (Residual Term) We define the residual term rt(s, a) as rt(s, a) = 40 3 k( Xs)−1 k=0 X (u,v,w)∈Tk qt(u, v) · P(w|u, v) ln T|S||A| δ max mi(t) (u, v), 1 · qt(s, a|w) + k( Xs)−1 k=0 k( Xs)−1 h=k+1 X (u,v,w)∈Tk X (x,y,z)∈Th qt(u, v)Bi(t) (u, v, w)qt(x, y|w)Bi(t) (x, y, z) + I{Ac }. (4.71) for all state-action pair (s, a) ∈ S × A and all episodes t ∈ [T]. 149 Lemma 4.9.3.10. The following hold: |qt(s, a) − qbt(s, a)| ≤ rt(s, a) + 4 k( Xs)−1 k=0 X (u,v,w)∈Tk qt(u, v) s P(w|u, v) ln ι max mi(t) (u, v), 1 qt(s, a|w) and E X T t=1 X s̸=sL X a∈A rt(s, a) = O L 2 |S| 3 |A| 2 ln2 T|S||A| δ + |S||A|T · δ . Proof. For simplicity, we let ι = T|S||A| δ and assume δ ∈ (0, 1). According to the Lemma 4.9.3.1, conditioning on event A, we have |qt(s, a) − qbt(s, a)| = k( Xs)−1 k=0 X (u,v,w)∈Tk qt(u, v) P(w|u, v) − P¯ i(t) (w|u, v) qbt(s, a|w) ≤ k( Xs)−1 k=0 X (u,v,w)∈Tk qt(u, v) P(w|u, v) − P¯ i(t) (w|u, v) qbt(s, a|w) ≤ k( Xs)−1 k=0 X (u,v,w)∈Tk qt(u, v)Bi(t) (u, v, w)qbt(s, a|w) Moreover, we apply Lemma 4.9.3.1 again to conditional occupancy measure and obtain |qt(s, a|w) − qbt(s, a|w)| ≤ k( Xs)−1 h=k(w) X (x,y,z)∈Th qt(x, y|w)Bi(t) (x, y, z)qbt(s, a|z) ≤ k( Xs)−1 h=k(w) X (x,y,z)∈Th qt(x, y|w)Bi(t) (x, y, z) where the second line applies the fact qbt(s, a|z) ≤ 1. 15 Combining these inequalities yields (under the event A) |qt(s, a) − qbt(s, a)| ≤ k( Xs)−1 k=0 X (u,v,w)∈Tk qt(u, v)Bi(t) (u, v, w)qt(s, a|w) + k( Xs)−1 k=0 X (u,v,w)∈Tk qt(u, v)Bi(t) (u, v, w) k( Xs)−1 h=k(w) X (x,y,z)∈Th qt(x, y|w)Bi(t) (x, y, z) ≤ 4 k( Xs)−1 k=0 X (u,v,w)∈Tk qt(u, v) s P(w|u, v) ln ι max mi(t) (u, v), 1 qt(s, a|w) + 40 3 k( Xs)−1 k=0 X (u,v,w)∈Tk qt(u, v) · P(w|u, v) ln ι max mi(t) (u, v), 1 · qt(s, a|w) + k( Xs)−1 k=0 k( Xs)−1 h=k+1 X (u,v,w)∈Tk X (x,y,z)∈Th qt(u, v)Bi(t) (u, v, w)qt(x, y|w)Bi(t) (x, y, z) where the second line follows from Lemma 4.9.3.3. On the other hand, |qt(s, a) − qbt(s, a)| ≤ 1 holds always. Combining the bounds of these two cases finishes the first statement. Recall the definition of the residual terms, we decompose the following into three terms Sum1, Sum2 and Sum3: E X T t=1 X s̸=sL X a∈A rt(s, a) = 40 3 E X T t=1 X s̸=sL X a∈A k( Xs)−1 k=0 X (u,v,w)∈Tk qt(u, v) · P(w|u, v) ln ι max mi(t) (u, v), 1 · qt(s, a|w) | {z } ≜Sum1 + E X T t=1 X s̸=sL X a∈A I{Ac } | {z } ≜Sum2 151 + E X T t=1 X s̸=sL X a∈A k( Xs)−1 k=0 k( Xs)−1 h=k+1 X (u,v,w)∈Tk X (x,y,z)∈Th qt(u, v)Bi(t) (u, v, w)qt(x, y|w)Bi(t) (x, y, z) | {z } ≜Sum3 . Then, we show that these terms are all logarithmic in T. Sum1 By direct calculation, we have Sum1 = 40 3 E X T t=1 X s̸=sL X a∈A k( Xs)−1 k=0 X (u,v,w)∈Tk qt(u, v) · P(w|u, v) ln ι max mi(t) (u, v), 1 · qt(s, a|w) = 40 3 E X T t=1 L X−1 k=0 X (u,v,w)∈Tk qt(u, v) · ln ι max mi(t) (u, v), 1 · X s̸=sL X a∈A P(w|u, v)qt(s, a|w) ≤ 40L 3 ln ιE X T t=1 X u̸=sL X v∈A · qt(u, v) max mi(t) (u, v), 1 = 80L 3 ln ι L X−1 k=0 |Sk||A|(ln T + 1)! = O L|S||A| ln2 ι (4.72) where the first line follows from the property of occupancy measures, and the last line applies Eq. (4.69) of Lemma 4.9.3.8. Sum2 According to the definition of event A, we have Sum2 = E X T t=1 X s̸=sL X a∈A I{Ac } = |S||A|T · E [I{Ac }] = |S||A|T · δ. (4.73) Sum3 First, we consider the term inside the expectation bracket and show the following conditioning on event A: X T t=1 X s̸=sL X a∈A k( Xs)−1 k=0 k( Xs)−1 h=k+1 X (u,v,w)∈Tk X (x,y,z)∈Th qt(u, v)Bi(t) (u, v, w)qt(x, y|w)Bi(t) (x, y, z) 15 ≤ 4 X T t=1 X s̸=sL X a∈A k( Xs)−1 k=0 k( Xs)−1 h=k+1 X (u,v,w)∈Tk X (x,y,z)∈Th qt(u, v) s P(w|u, v) ln ι max mi(t) (u, v), 1 qt(x, y|w)Bi(t) (x, y, z) + 40 3 X T t=1 X s̸=sL X a∈A k( Xs)−1 k=0 k( Xs)−1 h=k+1 X (u,v,w)∈Tk X (x,y,z)∈Th qt(u, v) P(w|u, v) ln ι max mi(t) (u, v), 1 ! qt(x, y|w)Bi(t) (x, y, z) ≤ 16|S||A| ln ι X T t=1 X k<h X (u,v,w)∈Tk X (x,y,z)∈Th qt(u, v) s P(w|u, v) max mi(t) (u, v), 1 qt(x, y|w) s P(z|x, y) max mi(t) (x, y), 1 + 160|S||A| 3 X T t=1 X k<h X (u,v,w)∈Tk X (x,y,z)∈Th qt(u, v) s P(w|u, v) ln ι max mi(t) (u, v), 1 qt(x, y|w) min ( P(z|x, y) ln ι max mi(t) (x, y), 1 , 1 ) + 40|S||A| 3 X T t=1 X k<h X (u,v,w)∈Tk X (x,y,z)∈Th qt(u, v) P(w|u, v) ln ι max mi(t) (u, v), 1 ! qt(x, y|w) where the second inequality follows from Lemma 4.9.3.3 and Corollary 4.9.3.4. Then we consider bounding these three different terms with the help of previous analysis. According to Eq. (4.68) of Lemma 4.9.3.7, The first term is bounded with probability at least 1 − 2δ ′ : 16|S||A| ln ι X T t=1 X k<h X (u,v,w)∈Tk X (x,y,z)∈Th qt(u, v) s P(w|u, v) max mi(t) (u, v), 1 qt(x, y|w) s P(z|x, y) max mi(t) (x, y), 1 ≤ 16|S||A| ln ι · O |A| ln T + ln(L/δ′ ) X k<h p |Sk| |Sk+1| |Sh| |Sh+1| ! ≤ 16|S||A| ln ι · O |A| ln T + ln(L/δ′ ) X k<h (|Sk| |Sk+1| + |Sh| |Sh+1|) ! ≤ O |A| ln T + ln(L/δ′ ) L|S| 3 |A| ln ι , where the third line follows from the AM-GM inequality. Taking the expectation with δ ′ = L ι , we have the expectation of the first term bounded by O L|S| 3 |A| 2 ln2 ι using Lemma 4.9.3.5. On the other hand, for the second term, we have 160|S||A| 3 X T t=1 X k<h X (u,v,w)∈Tk X (x,y,z)∈Th qt(u, v) s P(w|u, v) ln ι max mi(t) (u, v), 1 qt(x, y|w) min ( P(z|x, y) ln ι max mi(t) (x, y), 1 , 1 ) ≤ 80|S||A| 3 X T t=1 X k<h X (u,v,w)∈Tk X (x,y,z)∈Th qt(u, v)P(w|u, v)qt(x, y|w) P(z|x, y) ln ι max mi(t) (x, y), 1 ! + 80|S||A| 3 X T t=1 X k<h X (u,v,w)∈Tk X (x,y,z)∈Th qt(u, v) ln ι max mi(t) (u, v), 1 qt(x, y|w) ≤ 80L|S||A| 3 ln ι X T t=1 X x∈S X y∈A qt(x, y) max mi(t) (x, y), 1 ! + 80L|S| 2 |A| 3 ln ι X T t=1 X u̸=sL X v∈A qt(u, v) max mi(t) (u, v), 1 ! ≤ 160L|S| 2 |A| 3 ln ι X T t=1 X u̸=sL X v∈A qt(u, v) max mi(t) (u, v), 1 where the expectation of the final term is bounded O L|S| 3 |A| 2 ln2 ι with the help from Lemma 4.9.3.8. Similarly, we have the expectation of the third term bounded by O L|S| 3 |A| 2 ln2 ι following the same idea. Therefore, we have Sum3 bounded as Sum3 = O L|S| 3 |A| 2 ln2 ι + L|S| 3 |A| 2 ln2 ι + |S||A|T · δ = O L|S| 3 |A| 2 ln2 ι + |S||A|T · δ (4.74) where the |S||A|T · δ comes from the range of Sum3 and the probability of event Ac . Combining the bounds of Sum1, Sum2, and Sum3 stated in Eq. (4.72), Eq. (4.73) and Eq. (4.74) finishes the proof. Corollary 4.9.3.11. The following holds: |qt(s, a) − ut(s, a)| ≤ 4rt(s, a) + 16 k( Xs)−1 k=0 X (u,v,w)∈Tk qt(u, v) vuut P(w|u, v) ln T|S||A| δ max mi(t) (u, v), 1 qt(s, a|w). where qt is the true occupancy measure of episode t, and ut is the upper occupancy bound of episode t associated with confidence set Pi(t) and policy πt. Proof. Fix the state-action pair (s, a) and episode t . Let Pb be the transition in Pi(t) that realizes the maximum in the definition of ut(s, a), and qet = q P,π b t bet the associated occupancy measure. Therefore, we have qet(s, a) = ut(s, a). Conditioning on event A, we have |qt(s, a) − qet(s, a)| = k( Xs)−1 k=0 X (u,v,w)∈Tk qt(u, v) P(w|u, v) − Pb(w|u, v) qet(s, a|w) ≤ k( Xs)−1 k=0 X (u,v,w)∈Tk qt(u, v) P(w|u, v) − Pb(w|u, v) qet(s, a|w) ≤ 2 k( Xs)−1 k=0 X (u,v,w)∈Tk qt(u, v)Bi(t) (u, v, w)qet(s, a|w). Moreover, we apply Lemma 4.9.3.1 to terms qbt(s, a|w) and obtain |qt(s, a|w) − qet(s, a|w)| ≤ 2 k( Xs)−1 h=k(w) X (x,y,z)∈Th qt(x, y|w)Bi(t) (x, y, z)qet(s, a|z) ≤ 2 k( Xs)−1 h=k(w) X (x,y,z)∈Th qt(x, y|w)Bi(t) (x, y, z) where the second line uses qbt(s, a|z) ≤ 1. Combining these inequalities yields (under the event A) |qt(s, a) − qbt(s, a)| ≤ 4 k( Xs)−1 k=0 X (u,v,w)∈Tk qt(u, v)Bi(t) (u, v, w)qt(s, a|w) + 4 k( Xs)−1 k=0 X (u,v,w)∈Tk qt(u, v)Bi(t) (u, v, w) k( Xs)−1 h=k(w) X (x,y,z)∈Th qt(x, y|w)Bi(t) (x, y, z) 155 ≤ 16 k( Xs)−1 k=0 X (u,v,w)∈Tk qt(u, v) s P(w|u, v) ln ι max mi(t) (u, v), 1 qt(s, a|w) + 160 3 k( Xs)−1 k=0 X (u,v,w)∈Tk qt(u, v) · P(w|u, v) ln ι max mi(t) (u, v), 1 · qt(s, a|w) + 4 k( Xs)−1 k=0 k( Xs)−1 h=k+1 X (u,v,w)∈Tk X (x,y,z)∈Th qt(u, v)Bi(t) (u, v, w)qt(x, y|w)Bi(t) (x, y, z) where the second line follows from Lemma 4.9.3.3. On the other hand, |qt(s, a) − qet(s, a)| ≤ 1 holds always. Combining the bounds of these two cases finishes the proof. Lemma 4.9.3.12. Algorithm 5 ensures N ≤ 4|S||A|(log T + 1) where N is the number of epochs. Proof. For a fixed state-action pair (s, a), let the i1 ≤ i2 ≤ . . . ≤ ik denotes the epochs that triggered by this state-action pair, that is {i1, i2, . . . , ik} = {i : i ∈ 1, . . . N, mi(s, a) ≥ max {1, 2 · mi−1(s, a)}} . Clearly, it holds that 1 = mi1 (s, a), and miτ (s, a) ≥ 2miτ−1 (s, a)τ ∈ 2, . . . , k which indicates that mik (s, a) ≥ 2 k−1 . Combining with the fact that mik (s, a) ≤ T, we have k = |{i1, i2, . . . , ik}| ≤ 4 log T + 4. Taking the summation over all state-action pairs finishes the proof. 156 Chapter 5 Robustness and Adaptivity towards Adversarial Losses and Transitions In this chapter, we develop algorithms that can handle both adversarial losses and adversarial transitions, with regret increasing smoothly in the degree of maliciousness of the adversary. More concretely, we first propose a robust algorithm that enjoys Oe( √ T + C P) regret where C P measures how adversarial the transition functions are and can be at most O(T). While this algorithm itself requires knowledge of C P, we further develop a black-box reduction approach that removes this requirement. Moreover, we also show that further refinements of the algorithm not only maintains the same regret bound, but also simultaneously adapts to easier environments (where losses are generated in a certain stochastically constrained manner as in Chapter 4) and achieves Oe(U + √ UCL + C P) regret, where U is some standard gap-dependent coefficient and C L is the amount of corruption on losses. 5.1 Introduction Recently, many studies on RL have been conducted to deal with time-evolving or even adversarial corrupted environments. In particular, one line of research, originated from [30] and later improved or generalized by e.g. [70, 73, 98, 79, 46], takes inspiration from the online learning literature and 157 considers interacting with a sequence of T MDPs, each with an adversarially chosen loss function. Despite facing such a challenging environment, the learner can still ensure O( √ T) regret (ignoring other dependence; same below) as shown by these works, that is, the learner’s average performance is close to that of the best fixed policy up to O(1/ √ T). However, one caveat of these studies is that they all still require the MDPs to have to the same transition function. This is not for no reason — Abbasi Yadkori et al. [1] shows that even with full information feedback, achieving sub-linear regret with adversarial transition functions is computationally hard, and Tian et al. [86] complements this result by showing that under the more challenging bandit feedback, this goal becomes even information-theoretically impossible without paying exponential dependence on the episode length. To get around such impossibility results, one natural idea is to allow the regret to depend on some measure of maliciousness C P of the transition functions, which is 0 when the transitions remain the same over time and O(T) in the worst case when they are completely arbitrary. We review several such attempts at the end of this section, and point out here that they all suffer one issue: even when C P = 0, the algorithms developed in these works all suffer linear regret when the loss functions are completely arbitrary, while, as mentioned above, O( √ T) regret is achievable in this case. This begs the question: when learning with completely adversarial MDPs, is O( √ T +C P) regret achievable? In this chapter, we not only answer this question affirmatively, but also show that one can perform even better sometimes. More concretely, our results are as follows. 1. In Section 5.2, we develop a variant of the UOB-REPS algorithm [46], achieving Oe( √ T + C P) regret in completely adversarial environments when C P is known. The algorithmic modifications we propose include an enlarged confidence set, using the log-barrier regularizer, and a novel amortized bonus term that leads to a critical “change of measure” effect in the analysis. 158 2. We then remove the requirement on the knowledge of C P in Section 5.3 by proposing a blackbox reduction that turns any algorithm with Oe( √ T + C P) regret under known C P into another algorithm with the same guarantee (up to logarithmic factors) even if C P is unknown. Our reduction improves that of Wei, Dann, and Zimmert [88] by allowing adversarial losses, which presents extra challenges as discussed in Pacchiano, Dann, and Gentile [75]. The idea of our reduction builds on top of previous adversarial model selection framework (a.k.a. Corral [5, 32, 60]), but is even more general and is of independent interest: it shows that the requirement from previous work on having a stable input algorithm is actually redundant, since our method can turn any algorithm into a stable one. 3. Finally, in Section 5.4 we also further refine our algorithm so that it simultaneously adapts to the maliciousness of the loss functions and achieves Oe(min{ √ T, U + √ UCL}+C P) regret, where U is some standard gap-dependent coefficient and C L ≤ T is the amount of corruption on losses (this result unfortunately requires the knowledge of C P, but not U or C L ). This generalizes the so-called best-of-both-worlds guarantee of Jin and Luo [50], Jin, Huang, and Luo [47], and Dann, Wei, and Zimmert [26] from C P = 0 to any C P, and is achieved by combining the ideas from Jin, Huang, and Luo [47] and Ito [42] with a novel optimistic transition technique. In fact, this technique also leads to improvement on the dependence of episode length even when C P = 0. Related Work Here, we review how existing studies deal with adversarially chosen transition functions and how our results compare to theirs. The closest line of research is usually known as corruption robust reinforcement learning [62, 20, 94, 88], which assumes a ground truth MDP and measures the maliciousness of the adversary via the amount of corruption to the ground truth — the amount of corruption is essentially our C L , while the amount of corruption is essentially our C P (these will become clear after we provide their formal definitions later). Naturally, the regret in 159 these works is defined as the difference between the learner’s total loss and that of the best policy with respect to the ground truth MDP, in which case Oe( √ T + C P + C L ) regret is unavoidable and is achieved by the state-of-the-art [88]. On the other hand, following the canonical definition in online learning, we define regret with respect to the corrupted MDPs, in which case Oe( √ T + C P) is achievable as we show (regardless how large C L is). To compare these results, note that the two regret definitions differ from each other by an amount of at most O(C P + C L ). Therefore, our result implies that of Wei, Dann, and Zimmert [88], but not vice versa — what Wei, Dann, and Zimmert [88] achieves in our definition of regret is again Oe( √ T + C P + C L ), which is never better than ours and could be Ω(T) even when C P = 0. In fact, our result also improves upon that of Wei, Dann, and Zimmert [88] in terms of the gapdependent refinement — their refined bound is Oe(min{ √ T, G}+C P +C L ) for some gap-dependent measure G that is known to be no less than our gap-dependent measure U; on the other hand, based on earlier discussion, our refined bound in their regret definition is Oe(min{ √ T, U + √ UCL}+ C P + C L ) = Oe(min{ √ T, U} + C P + C L ) and thus better. The caveat is that, as mentioned, for this refinement our result requires the knowledge of C P, but Wei, Dann, and Zimmert [88] does not.∗ However, we emphasize again that for the gap-independent bound, our result does not require knowledge of C P and is achieved via an even more general black-box reduction compared to the reduction of Wei, Dann, and Zimmert [88]. Finally, we mention that another line of research, usually known as non-stationary reinforcement learning, also allows arbitrary transition/loss functions and measures the difficulty by either the number of changes in the environment [14, 37] or some smoother measure such as the total variation across time [91, 21]. These results are less comparable to ours since their regret (known as dynamic ∗When C P + C L is known, Lykouris et al. [62] also achieves Oe(min{ √ T, U} + C P + C L ) regret, but similar to earlier discussions on gap-independent bounds, their regret definition is weaker than ours. 160 regret) measures the performance of the learner against the best sequence of policies, while ours (known as static regret) measures the performance against the best fixed policy. 5.2 Achieving O( √ T + C P) with Known C P As the first step, we develop an algorithm that achieves our goal when C P is known. To introduce our solution, we first briefly review the UOB-REPS algorithm of Jin et al. [46] (designed for C P = 0) and point out why simply using the enlarged confidence set Eq. (2.5) when C P ̸= 0 is far away from solving the problem. Specifically, UOB-REPS maintains a sequence of occupancy measures {qbt} T t=1 via OMD: qbt+1 = argminq∈Ω(Pi(t+1)) η⟨q, ℓbt⟩ + Dϕ(q, qbt). Here, η > 0 is a learning rate, ℓbt is the loss estimator defined in Eq. (2.8), ϕ is the negative entropy regularizer, and Dϕ is the corresponding Bregman divergence.† With qbt at hand, in episode t, the learner simply executes πt = π qbt . Standard analysis of OMD ensures a bound on the estimated regret Reg = E[ P t ⟨qbt − q P,˚π , ℓbt⟩], and the rest of the analysis of Jin et al. [46] boils down to bounding the difference between Reg and RegT . First Issue This difference between Reg and RegT leads to the first issue when one tries to analyze UOB-REPS against adversarial transitions — it contains the following bias term that measures the difference between the optimal policy’s estimated loss and its true loss: E "X T t=1 D q P,˚π , ℓbt − ℓt E # = E "X T t=1 X s,a q P,˚π (s, a)ℓt(s, a) q Pt,πt (s, a) − ut(s, a) ut(s, a) # . (5.1) When C P = 0, we have P = Pt , and thus under the high probability event Econ and by the definition of upper occupancy bound, we know q Pt,πt (s, a) ≤ ut(s, a), making Eq. (5.1) negligible. However, †The original loss estimator of Jin et al. [46] is slightly different, but that difference is only for the purpose of obtaining a high probability regret guarantee, which we do not consider in this work for simplicity. 161 this argument breaks when C P ̸= 0 and P ̸= Pt . In fact, Pt can be highly different from any transitions in Pi(t) with respect to which ut is defined, making Eq. (5.1) potentially huge. Solution: Change of Measure via Amortized Bonuses Given that q P,πt (s, a) ≤ ut(s, a) does still hold with high probability, Eq. (5.1) is (approximately) bounded by E "X T t=1 X s,a q P,˚π (s, a) |q Pt,πt (s, a) − q P,πt (s, a)| ut(s, a) # = E "X T t=1 X s q P,˚π (s) |q Pt,πt (s) − q P,πt (s)| ut(s) # which is at most E hPT t=1 P s q P,˚π (s) CP t ut(s) i since q Pt,πt (s) − q P,πt (s) is bounded by the per-round corruption C P t (see Corollary 5.8.3.6). While this quantity is potentially huge, if we could “change the measure” from q P,˚π to qbt , then the resulting quantity E hPT t=1 P s qbt(s) CP t ut(s) i is at most |S|C P since qbt(s) ≤ ut(s) by definition. The general idea of such a change of measure has been extensively used in the online learning literature (see Luo, Wei, and Lee [59] in a most related context) and can be realized by changing the loss fed to OMD from ℓbt to ℓbt − bt for some bonus term bt , which, in our case, should satisfy bt(s, a) ≈ CP t ut(s) . However, the challenge here is that C P t is unknown! Our solution is to introduce a type of efficiently computable amortized bonuses that do not change the measure per round, but do so overall. Specifically, our amortized bonus bt is defined as bt(s, a) = 4L ut(s) if Pt τ=1 I{⌈log2 uτ (s)⌉ = ⌈log2 ut(s)⌉} ≤ CP 2L , 0 else, (5.2) which we also write as bt(s) since it is independent of a. To understand this definition, note that −⌈log2 ut(s)⌉ is exactly the unique integer j such that ut(s) falls into the bin (2−j−1 , 2 −j ]. Therefore, the expression Pt τ=1 I{⌈log2 uτ (s)⌉ = ⌈log2 ut(s)⌉} counts, among all previous rounds τ = 1, . . . , t, how many times we have encountered a uτ (s) value that falls into the same bin as ut(s). If this 162 number does not exceed CP 2L , we apply a bonus of 4L ut(s) , which is (two times of) the maximum possible value of the unknown quantity CP t ut(s) ; otherwise, we do not apply any bonus. The idea is that by enlarging the bonus to it maximum value and stopping it after enough times, even though each bt(s) might be quite different from CP t ut(s) , overall they behave similarly after T episodes: Lemma 5.2.1. The amortized bonus defined in Eq. (5.2) satisfies PT t=1 CP t ut(s) ≤ PT t=1 bt(s) and PT t=1 qbt(s)bt(s) = O(C P log T) for any s. Therefore, the problematic term Eq. (5.1) is at most E[ P t ⟨q P,˚π , bt⟩], which, if “converted” to E[ P t ⟨qbt , bt⟩] (change of measure), is nicely bounded by O(|S|C P log T). As mentioned, such a change of measure can be realized by feeding ℓbt − bt instead of ℓbt to OMD, because now standard analysis of OMD ensures a bound on Reg = E[ P t ⟨qbt−q P,˚π , ℓbt−bt⟩], which, compared to the earlier definition of Reg, leads to a difference of E[ P t ⟨qbt − q P,˚π , bt⟩] (see Section 5.5.5 for details). Second Issue The second issue comes from analyzing Reg (which exists even if no bonuses are used). Specifically, standard analysis of OMD requires bounding a “stability” term, which, for the negative entropy regularizer, is in the form of E[ P t P s,a qbt(s, a)ℓbt(s, a) 2 ] = E[ P t P s,a qbt(s, a) q Pt ,πt (s,a)ℓt(s,a) 2 ut(s,a) 2 ] ≤ E[ P t P s,a q Pt ,πt (s,a) ut(s,a) ]. Once again, when C P = 0 and Pt = P, we have q Pt,πt bounded by ut(s, a) with high probability, and thus the stability term is O(T|S||A|); but this breaks if C P ̸= 0 and Pt can be arbitrarily different from transitions in Pi(t) . Solution: Log-Barrier Regularizer Resolving this second issue, however, is relatively straightforward — it suffices to switch the regularizer from negative entropy to log-barrier: ϕ(q) = − PL−1 k=0 P (s,a,s′)∈Wk log q(s, a, s′ ), which is first used by Lee et al. [55] in the context of learning adversarial MDPs but dates back to earlier work such as Foster et al. [33] for multi-armed bandits. An important property of log-barrier is that it leads to a smaller stability term in the form of E[ P t P s,a qbt(s, a) 2 ℓbt(s, a) 2 ] (with an extra qbt(s, a)), which is at most E[ P t P s,a q Pt,πt (s, a)ℓt(s, a) 2 ] = 163 Algorithm 8 Algorithm for Adversarial Transitions (with Known C P) Input: confidence parameter δ ∈ (0, 1), learning rate η > 0. Initialize: epoch index i = 1; counters m1(s, a) = m1(s, a, s′ ) = m0(s, a) = m0(s, a, s′ ) = 0 for all (s, a, s′ ); empirical transition P¯ 1 and confidence width B1 based on Eq. (2.4) and Eq. (2.6); occupancy measure qb1(s, a, s′ ) = 1 |Sk||A||Sk+1| for all (s, a, s′ ); and initial policy π1 = π qb1 . for t = 1, . . . , T do Execute policy πt and obtain trajectory (st,k, at,k) for k = 0, . . . , L − 1. Construct loss estimator ℓbt as defined in Eq. (2.8). Update bt(s) for all s based on Eq. (5.2). Increase counters: for each k < L, mi(st,k, at,k, st,k+1) +← 1, mi(st,k, at,k) +← 1. if ∃k, mi(st,k, at,k) ≥ max{1, 2mi−1(st,k, at,k)} then ▷ entering a new epoch Increase epoch index i +← 1. Initialize new counters: ∀(s, a, s′ ), mi(s, a, s′ ) = mi−1(s, a, s′ ), mi(s, a) = mi−1(s, a). Update confidence set Pi based on Eq. (2.5). Let Dϕ(·, ·) be the Bregman divergence with respect to log barrier (Eq. (5.6)) and compute qbt+1 = argmin q∈Ω(Pi) η D q, ℓbt − bt E + Dϕ(q, qbt). (5.3) Update policy πt+1 = π qbt+1 . O(T L) since qbt(s, a) ≤ ut(s, a). In fact, this also helps control the extra stability term when bonuses are used, which is in the form of E[ P t P s,a qbt(s, a) 2 bt(s, a) 2 ] and is at most 4LE[ P t ⟨qbt , bt⟩] = O(L|S|C P log T) according to Lemma 5.2.1. Putting these two ideas together leads to our final algorithm (see Algorithm 8). We prove the following regret bound in Section 5.5, which recovers that of Jin et al. [46] when C P = 0 and increases linearly in C P as desired. Theorem 5.2.2. With δ = 1/T and η = min q |S| 2|A| log(ι) LT , 1 8L , Algorithm 8 ensures RegT = O L|S| p |A|T log (ι) + L|S| 4 |A| log2 (ι) + C PL|S| 4 |A| log(ι) . 164 5.3 Achieving O( √ T + C P) with Unknown C P In this section, we address the case when the amount of corruption is unknown. We develop a black-box reduction which turns an algorithm that only deals with known C P to one that handles unknown C P. This is similar to Wei, Dann, and Zimmert [88] but additionally handles adversarial losses using a different approach. A byproduct of our reduction is that we develop an entirely black-box model selection approach for adversarial online learning problems, as opposed to the gray-box approach developed by the “Corral” literature [5, 32, 60] which requires checking if the base algorithm is stable. To achieve this, we essentially develop another layer of reduction that turns any standard algorithm with sublinear regret into a stable algorithm. This result itself might be of independent interest and useful for solving other model selection problems. More specifically, our reduction has two layers. The bottom layer is where our novelty lies: it takes as input an arbitrary corruption-robust algorithm that operates under known C P (e.g., the one we developed in Section 5.2), and outputs a stable corruption-robust algorithm (formally defined later) that still operates under known C P. The top layer, on the other hand, follows the standard Corral idea and takes as input a stable algorithm that operates under known C P, and outputs an algorithm that operates under unknown C P. Below, we explain these two layers of reduction in details. Bottom Layer (from an Arbitrary Algorithm to a Stable Algorithm) The input of the bottom layer is an arbitrary corruption-robust algorithm, formally defined as: 165 Definition 5.3.1. An adversarial MDP algorithm is corruption-robust if it takes θ (a guess on the corruption amount) as input, and achieves the following regret for any random stopping time t ′ ≤ T: max π E " t X′ t=1 (ℓt(πt) − ℓt(π))# ≤ E hp β1t ′ + (β2 + β3θ)I{t ′ ≥ 1} i + Pr[C P 1:t ′ > θ]LT for problem-dependent constants and log(T) factors β1 ≥ L 2 , β2 ≥ L, β3 ≥ 1, where C P 1:t ′ = Pt ′ τ=1 C P τ is the total corruption up to time t ′ . While the regret bound in Definition 5.3.1 might look cumbersome, it is in fact fairly reasonable: if the guess θ is not smaller than the true corruption amount, the regret should be of order √ t ′ + θ; otherwise, the regret bound is vacuous since LT is its largest possible value. The only extra requirement is that the algorithm needs to be anytime (i.e., the regret bound holds for any stopping time t ′ ), but even this is known to be easily achievable by using a doubling trick over a fixed-time algorithm. It is then clear that our algorithm in Section 5.2 (together with a doubling trick) indeed satisfies Definition 5.3.1. As mentioned, the output of the bottom layer is a stable robust algorithm. To characterize stability, we follow Agarwal et al. [5] and define a new learning protocol that abstracts the interaction between the output algorithm of the bottom layer and the master algorithm from the top layer: Protocol 1. In every round t, before the learner makes a decision, a probability wt ∈ [0, 1] is revealed to the learner. After making a decision, the learner sees the desired feedback from the environment with probability wt , and sees nothing with probability 1 − wt . In such a learning protocol, Agarwal et al. [5] defines a stable algorithm as one whose regret smoothly degrades with ρT = 1 mint∈[T ] wt . For our purpose here, we additionally require that the dependence on C P in the regret bound is linear, which results in the following definition: 166 Algorithm 9 STable Algorithm By Independent Learners and Instance SElection (STABILISE) Input: C P and a base algorithm satisfying Definition 5.3.1. Initialize: ⌈log2 T⌉ instances of the base algorithm ALG1, . . . , ALG⌈log2 T⌉ , where ALGj is configured with the parameter θ = θj ≜ 2 −j+1C P + 16Llog(T). for t = 1, 2, . . . do Receive wt . if wt ≤ 1 T then play an arbitrary policy πt continue (without updating any instances) Let jt be such that wt ∈ (2−jt−1 , 2 −jt ]. Let πt be the policy suggested by ALGjt . Output πt . If feedback is received, send it to ALGjt with probability 2−jt−1 wt , and discard it otherwise. Definition 5.3.2 ( 1 2 -stable corruption-robust algorithm). A 1 2 -stable corruption-robust algorithm is one that, with prior knowledge on C P, achieves RegT ≤ E √ β1ρT T + β2ρT + β3C P under Protocol 1 for problem-dependent constants and log(T) factors β1 ≥ L 2 , β2 ≥ L, and β3 ≥ 1. For simplicity, we only define and discuss the 1 2 -stability notion here (the parameter 1 2 refers to the exponent of T), but our result can be straightforwardly extended to the general α-stability notion for α ∈ [ 1 2 , 1) as in [5]. Our main result in this section is then that one can convert any corruption-robust algorithm into a 1 2 -stable corruption-robust algorithm: Theorem 5.3.3. If an algorithm is corruption robust according to Definition 5.3.1 for some constants (β1, β2, β3), then one can convert it to a 1 2 -stable corruption-robust algorithm (Definition 5.3.2) with constants (β ′ 1 , β′ 2 , β′ 3 ) where β ′ 1 = O(β1 log T), β′ 2 = O(β2 + β3Llog T), and β ′ 3 = O(β3 log T). This conversion is achieved by a procedure that we call STABILISE (see Algorithm 9 for details). The high-level idea of STABILISE is as follows. Noticing that the challenge when learning in Protocol 1 is that wt varies over time, we discretize the value of wt and instantiate one instance of the input algorithm to deal with one possible discretized value, so that it is learning in Protocol 1 167 but with a fixed wt , making it straightforward to bound its regret based on what it promises in Definition 5.3.1. More concretely, STABILISE instantiates O(log2 T) instances {ALGj} ⌈log2 T⌉ j=0 of the input algorithm that satisfies Definition 5.3.1, each with a different parameter θj . Upon receiving wt from the environment, it dispatches round t to the j-th instance where j is such that wt ∈ (2−j−1 , 2 −j ], and uses the policy generated by ALGj to interact with the environment (if wt ≤ 1 T , simply ignore this round). Based on Protocol 1, the feedback for this round is received with probability wt . To equalize the probability of ALGj receiving feedback as mentioned in the high-level idea, when the feedback is actually obtained, STABILISE sends it to ALGj only with probability 2−j−1 wt (and discards it otherwise). This way, every time ALGj is assigned to a round, it always receives the desired feedback with probability wt · 2−j−1 wt = 2−j−1 . This equalization step is the key that allows us to use the original guarantee of the base algorithm (Definition 5.3.1) and run it as it is, without requiring it to perform extra importance weighting steps as in Agarwal et al. [5]. The choice of θj is crucial in making sure that STABILISE only has C P regret overhead instead of ρT C P. Since ALGj only receives feedback with probability 2 −j−1 , the expected total corruption it experiences is on the order of 2 −j−1C P. Therefore, its input parameter θj only needs to be of this order instead of the total corruption C P. This is similar to the key idea of Wei, Dann, and Zimmert [88] and Lykouris, Mirrokni, and Paes Leme [61]. See Section 5.6.1 for more details and the full proof of Theorem 5.3.3. Top Layer (from Known C P to Unknown C P) With a stable algorithm and a regret guarantee in Definition 5.3.2, it is relatively standard to convert it to an algorithm with Oe( √ T + C P) regret without knowing C P. Similar arguments have been made in Foster et al. [32], and the idea is to have another specially designed OMD/FTRL-based master algorithm to choose on the fly among a set of instances of this stable base algorithm, each with a different guess on C P (the probability 168 wt in Protocol 1 is then decided by this master algorithm). We defer all details to Section 5.6. The final regret guarantee is the following (Oe(·) hides log(T) factors). Theorem 5.3.4. Using an algorithm satisfying Definition 5.3.2 as a base algorithm, Algorithm 10 (in the appendix) ensures RegT = Oe √ β1T + β2 + β3C P without knowing C P. 5.4 Gap-Dependent Refinements with Known C P Finally, we discuss how to further improve our algorithm so that it adapts to easier environments and enjoys a better bound when the loss functions satisfy a certain gap condition, while still maintaining the O( √ T + C P) robustness guarantee. This result unfortunately requires the knowledge of C P because the black-box approach introduced in the last section leads to √ T regret overhead already. We leave the possibility of removing this limitation for future work. More concretely, following prior work such as Jin and Luo [50], we consider the following general condition: there exists a mapping π ⋆ : S → A, a gap function ∆ : S × A → (0, L], and a constant C L ≥ 0, such that for any policies π1, . . . , πT generated by the learner, we have E "X T t=1 D q P,πt − q P,π⋆ , ℓt E # ≥ E X T t=1 X s̸=sL X a̸=π⋆(s) q P,πt (s, a)∆(s, a) − C L . (5.4) It has been shown that this condition subsumes the case when the loss functions are drawn from a fixed distribution (in which case π ⋆ is simply the optimal policy with respect to the loss mean and P, ∆ is the gap function with respect to the optimal Q-function, and C L = 0), or further corrupted by an adversary in an arbitrary manner subject to a budget of C L ; we refer the readers to Jin and Luo [50] for detailed explanation. Our main result for this section is a novel algorithm (whose pseudocode is deferred to Section 5.7 due to space limit) that achieves the following best-of-bothworld guarantee. 16 Theorem 5.4.1. Algorithm 11 (with δ = 1/T 2 and γt defined as in Definition 5.4.2) ensures RegT (˚π) = O L 2 |S||A| log (ι) √ T + C P + 1 L 2 |S| 4 |A| 2 log2 (ι) always, and simultaneously the following gap-dependent bound under Condition (5.4): RegT (π ⋆ ) = O U + √ UCL + C P + 1 L 2 |S| 4 |A| 2 log2 (ι) , where U = L3 |S| 2 |A| log2 (ι) ∆min + P s̸=sL P a̸=π⋆(s) L2 |S||A| log2 (ι) ∆(s,a) and ∆min = min s̸=sL,a̸=π⋆(s) ∆(s, a). Aside from having larger dependence on parameters L, S, and A, Algorithm 11 maintains the same O( √ T + C P) regret as before, no matter how losses/transitions are generated; additionally, the √ T part can be significantly improved to O(U + √ UCL) (which can be of order only log2 T when C L is small) under Condition (5.4). This result not only generalizes that of Jin, Huang, and Luo [47] and Dann, Wei, and Zimmert [26] from C P = 0 to any C P, but in fact also improves their results by having smaller dependence on L in the definition of U. In the rest of this section, we describe the main ideas of our algorithm. FTRL with Epoch Schedule Our algorithm follows a line of research originated from Wei and Luo [90] and Zimmert and Seldin [100] for multi-armed bandits and uses FTRL (instead of OMD) together with a certain self-bounding analysis technique. Since FTRL does not deal with varying decision sets easily, similar to Jin, Huang, and Luo [47], we restart FTRL from scratch at the beginning of each epoch i (recall the epoch schedule described in Chapter 2). More specifically, in an episode t that belongs to epoch i, we now compute qbt as argminq q,Pt−1 τ=ti (ℓbτ − bτ ) + ϕt(q), where ti is the first episode of epoch i, ℓbt is the same loss estimator defined in Eq. (2.8), bt is the amortized bonus defined in Eq. (5.2) (except that τ = 1 there is also changed to τ = ti due to 170 restarting), ϕt is a time-varying regularizer to be specified later, and the set that q is optimized over is also a key element to be discussed next. As before, the learner then simply executes πt = π qbt for this episode. Optimistic Transition An important idea from Jin, Huang, and Luo [47] is that if FTRL optimizes q over Ω(P¯ i) (occupancy measures with respect to a fixed transition P¯ i) instead of Ω(Pi) (occupancy measures with respect to a set of plausible transitions) as in UOB-REPS, then a critical loss-shifting technique can be applied in the analysis. However, the algorithm lacks “optimism” when not using a confidence set, which motivates Jin, Huang, and Luo [47] to instead incorporate optimism by subtracting a bonus term Bonus from the loss estimator (not to be confused with the amortized bonus bt we propose in this work). Indeed, if we define the value function V P,π ¯ (s; ℓ) as the expected loss one suffers when starting from s and following π in an MDP with transition P¯ and loss ℓ, then they show that the Bonus term is such that V P¯ i,π(s; ℓ − Bonus) ≤ V P,π(s; ℓ) for any state s and any loss function ℓ, that is, the performance of any policy is never underestimated. Instead of following the same idea, here, we propose a simpler and better way to incorporate optimism via what we call optimistic transitions. Specifically, for each epoch i, we simply define an optimistic transition function Pei such that Pei(s ′ |s, a) = max 0, P¯ i(s ′ |s, a) − Bi(s, a, s′ ) (recall the confidence interval Bi defined in Eq. (2.6)). Since this makes P s ′ Pei(s ′ |s, a) less than 1, we allocate all the remaining probability to the terminal state sL (which breaks the layer structure but does not really affect anything). This is a form of optimism because reaching the terminate state earlier can only lead to smaller loss. More formally, under the high probability event Econ, we prove V Pei,π(s; ℓ) ≤ V P,π(s; ℓ) for any policy π, any state s, and any loss function ℓ (see Lemma 5.7.8.3). With such an optimistic transition, we simply perform FTRL over Ω(Pei) without adding any additional bonus term (other than bt), making both the algorithm and the analysis much simpler than Jin, Huang, and Luo [47]. Moreover, it can also be shown that V P¯ i,π(s; ℓ−Bonus) ≤ V Pei,π(s; ℓ) 171 (see Lemma 5.7.8.4), meaning that while both loss estimation schemes are optimistic, ours is tighter than that of Jin, Huang, and Luo [47]. This eventually leads to the aforementioned improvement in the U definition. Time-Varying Log-Barrier Regularizers The final element to be specified in our algorithm is the time-varying regularizer ϕt . Recall from discussions in Section 5.2 that using log-barrier as the regularizer is critical for bounding some stability terms in the presence of adversarial transitions. We thus consider the following log-barrier regularizer with an adaptive learning rate γt : S × A → R+: ϕt(q) = − P s̸=sL P a∈A γt(s, a) · log q(s, a). The learning rate design requires combining the lossshifting idea of [47] and the idea from [42], the latter of which is the first work to show that with adaptive learning rate tuning, the log-barrier regularizer leads to near-optimal best-of-both-world gaurantee for multi-armed bandits. More specifically, following the same loss-shifting argument of Jin, Huang, and Luo [47], we first observe that our FTRL update can be equivalently written as qbt = argmin q∈Ω(Pei) * q, X t−1 τ=ti (ℓbτ − bτ ) + + ϕt(q) = argmin x∈Ω(Pei) * q, X t−1 τ=ti (gτ − bτ ) + + ϕt(q), where gτ (s, a) = QPei,πτ (s, a; ℓbτ ) − V Pei,πτ (s; ℓbτ ) for any state-action pair (s, a) (Q is the standard Q-function; see Section 5.7 for formal definition). With this perspective, we follow the idea of Ito [42] and propose the following learning rate schedule: Definition 5.4.2. (Adaptive learning rate for log-barrier) For any t, if it is the starting episode of an epoch, we set γt(s, a) = 256L 2 |S|; otherwise, we set γt+1(s, a) = γt(s, a) + Dνt(s,a) 2γt(s,a) where D = 1/log(ι), νt(s, a) = q Pe i(t) ,πt (s, a) 2 Q Pe i(t) ,πt (s, a; ℓbt) − V Pe i(t) ,πt (s; ℓbt) 2 , and i(t) is the epoch index to which episode t belongs. 172 Such a learning rate schedule is critical for the analysis in obtaining a certain self-bounding quantity and eventually deriving the gap-dependent bound. This concludes the design of our algorithm; see Section 5.7 for more details. 173 5.5 Omitted Details for Section 5.2 In this section, we provide more details of the modified UOB-REPS algorithm as shown in Algorithm 8. Before diving in to the details, we first introduce several important notations for the rest of this chapter. Important Notations We denote the transition associated with the occupancy measure qbt by Pbt so that qbt = q Pbt,πt . Recall i(t) is the epoch index to which episode t belongs. Importantly, the notation mi(s, a) (and similarly mi(s, a, a′ )), which is a changing variable in the algorithms, denotes the initial value of this counter in all analysis, that is, the total number of visits to (s, a) prior to epoch i. Finally, for convenience, we define mb i(s, a) = max mi(s, a), CP + log (ι) , and it can be verified that the confidence interval, defined in Eq. (2.6) as Bi(s, a, s′ ) = min ( 1, 16s P¯ i(s ′ |s, a) log (ι) mi(s, a) + 64 · C P + log (ι) mi(s, a) ) , can be equivalently written as Bi(s, a, s′ ) = min ( 1, 16s P¯ i(s ′ |s, a) log (ι) mb i(s, a) + 64 · C P + log (ι) mb i(s, a) ) since whenever mb i(s, a) ̸= mi(s, a), the two definitions both lead to a value of 1. In the algorithm, the occupancy measure for each episode t is computed as qbt+1 = argmin q∈Ω(Pi(t+1)) η D q, ℓbt − bt E + Dϕ(q, qbt), (5.5) 174 where η > 0 is the learning rate and Dϕ(q, q′ ) is the Bregman divergence defined as Dϕ(q, q′ ) = ϕ(q) − ϕ(q ′ ) − ∇ϕ(q ′ ), q − q ′ . (5.6) The Bregman divergence is induced by the log-barrier regularizer ϕ given as: ϕ(q) = L X−1 k=0 X s∈Sk X a∈A X s ′∈Sk+1 log 1 q(s, a, s′) . (5.7) We note that the loss estimator is constructed based upon upper occupancy bound ut , which can be efficiently computed by Comp-UOB [46]. In the rest of this section, we prove Theorem 5.2.2 which shows that the expected regret is bounded by O L|S| p |A|T log(ι) + L|S| 4 |A| log2 (ι) + C PL|S| 4 |A| log(ι) . Our analysis starts from a regret decomposition similar to Jin et al. [46], but with the amortized bonuses taken into account: RegT = E "X T t=1 D q Pt,πt − q Pt,˚π , ℓt E # = E "X T t=1 D q Pt,πt − q Pbt,πt , ℓt E # | {z } Error + E "X T t=1 D q Pbt,πt , ℓt − ℓbt E # | {z } Bias1 + E "X T t=1 D q Pbt,πt − q P,˚π , ℓbt − bt E # | {z } Reg + E "X T t=1 D q P,˚π , ℓbt − ℓt E + X T t=1 D q Pbt,πt − q P,˚π , bt E # | {z } Bias2 + E "X T t=1 D q P,˚π − q Pt,˚π , ℓt E # , 175 where the last term can be directly bounded by O LCP using Corollary 5.8.3.7, and Error, Bias1, Reg, and Bias2 are analyzed in Section 5.5.2, Section 5.5.3, Section 5.5.4 and Section 5.5.5 respectively. 5.5.1 Equivalent Definition of Amortized Bonuses For ease of exposition, we show an alternative definition of the amortized bonus bt(s), which is equivalent to Eq. (5.2) but is useful for our following analysis. Recall from Eq. (5.2) that bt(s) = bt(s, a) = 4L ut(s) if Pt τ=1 I{⌈log2 uτ (s)⌉ = ⌈log2 ut(s)⌉} ≤ CP 2L , 0 else, (5.8) In Eq. (5.8), the value of bt(s) relies on the cumulative sum of I{⌈log2 uτ (s)⌉ = ⌈log2 ut(s)⌉} and whether the cumulative sum exceeds the threshold CP 2L . To reconstruct this, we define yt(s) as the unique integer value j = −⌈log2 ut(s)⌉ such that ut(s) ∈ (2−j−1 , 2 −j ] and define z j t (s) as the total number of times from episode τ = 1 to τ = t where yτ (s) = j holds. With these definitions, the condition under which bt(s) = 4L ut(s) can be represented by P j I{yt(s) = j}I n z j t (s) ≤ CP /2L o where the summation is over all integers that −⌈log2 ut(s)⌉ can take. Notice that since ut(s) ≥ 1 |S|T holds for all s and t (see Lemma 5.8.2.8), the largest possible integer value we need to consider is ⌈log2 (|S|T)⌉. Therefore, bt(s) can be equivalently defined as: bt(s) = bt(s, a) = 4L ut(s) ⌈log2X (|S|T)⌉ j=0 I{yt(s) = j}I z j t (s) ≤ C P 2L , ∀s ∈ S, ∀t ∈ {1, . . . , T}. (5.9) 17 5.5.2 Bounding Error Lemma 5.5.2.1 (Bound of Error). Algorithm 8 ensures Error = O L|S| p |A|T log(ι) + L|S| 2 |A| log(ι) log(T) + C PL|S||A| log(T) + δLT . Proof. For this proof, we consider two cases, whether the event Econ∧ Eest holds or not, where Econ, defined above Eq. (2.6), and Eest, defined in Proposition 5.8.5.2, are both high probability events. Suppose that Econ ∧ Eest holds. We have X T t=1 D q Pt,πt − q Pbt,πt , ℓt E ≤ X T t=1 L X−1 k=0 X s∈Sk X a∈A q Pt,πt (s, a) − q Pbt,πt (s, a) = X T t=1 L X−1 k=0 X s∈Sk q Pbt,πt (s) − q P,πt (s) ≤ X T t=1 L X−1 k=0 X s∈Sk X k−1 h=0 X u∈Sh X v∈A X w∈Sh+1 q Pt,πt (u, v) Pbt(w|u, v) − Pt(w|u, v) q Pbt,πt (s|w) ≤ L X T t=1 L X−1 h=0 X u∈Sh X v∈A q Pt,πt (u, v) Pbt(·|u, v) − Pt(·|u, v) 1 ≤ L X T t=1 L X−1 h=0 X u∈Sh X v∈A q Pt,πt (u, v) ∥Pt(·|u, v) − P(·|u, v)∥1 + Pbt(·|u, v) − P(·|u, v) 1 ≤ L X T t=1 L X−1 h=0 X u∈Sh X v∈A q Pt,πt (u, v) Pbt(·|u, v) − P(·|u, v) 1 + L X T t=1 L X−1 h=0 C P t,h = L X T t=1 L X−1 h=0 X u∈Sh X v∈A q Pt,πt (u, v) Pbt(·|u, v) − P(·|u, v) 1 + LCP , (5.10) where the first step follows from the fact that ℓt(s, a) ∈ [0, 1] for any (s, a); the third step applies Lemma 5.8.3.3; the fourth step rearranges the summation and uses the fact that P s,a q Pbt,πt (s, a|w) ≤ 177 L; and in the sixth step we define C P t,h = maxs∈Sh,a∈A ∥Pt(·|s, a) − P(·|s, a)∥1 , so that C P = PT t=1 PL−1 h=0 C P t,h. We continue to bound the first term above as: X T t=1 L X−1 h=0 X u∈Sh X v∈A q Pt,πt (u, v) Pbt(·|u, v) − P(·|u, v) 1 ≤ X T t=1 L X−1 h=0 X u∈Sh X v∈A q Pt,πt (u, v) Pbt(·|u, v) − P¯ i(t) (·|u, v) 1 + P¯ i(t) (·|u, v) − P(·|u, v) 1 ≤ L X−1 h=0 X T t=1 X u∈Sh X v∈A q Pt,πt (u, v) O s |Sh+1| log(ι) mb i(t) (u, v) + C P + |Sh+1| log(ι) mb i(t) (u, v) ! ≤ O L X−1 h=0 p |Sh||Sh+1||A|T log(ι) + |Sh||A| log(T) |Sh+1| log(ι) + C P + log(ι) ! ≤ O L X−1 h=0 (|Sh| + |Sh+1|) p |A|T log(ι) + |Sh||A| log(T) |Sh+1| log(ι) + C P + log(ι) ! ≤ O |S| p |A|T log(ι) + |S| 2 |A| log(ι) log(T) + C P |S||A| log(T) , (5.11) where the first step uses the triangle inequality; the second step applies Corollary 5.8.2.6 to bound two norm terms based on the fact that Pbt , P¯ i(t) ∈ Pi(t) ; the third step follows the definition of Eest from Proposition 5.8.5.2; the fourth step uses the AM-GM inequality. Putting the result of Eq. (5.11) into Eq. (5.10) yields the first three terms of the claimed bound. Now suppose that Econ∧Eest does not hold. We trivially bound PT t=1 D q Pt,πt − q Pbt,πt , ℓt E by LT. As the probability that this case occurs is at most O (δ), this case contributes to at most O (δLT) regret. Finally, combining these two cases and applying Lemma 5.8.1.1 finishes the proof. 5.5.3 Bounding Bias1 Lemma 5.5.3.1 (Bound of Bias1). Algorithm 8 ensures Bias1 = O L|S| p |A|T log(ι) + |S| 4 |A| log2 (ι) + C PL|S| 4 |A| log(ι) + δLT . 178 Proof. We can rewrite Bias1 as Bias1 = X T t=1 E hDq Pbt,πt , ℓt − ℓbt Ei = X T t=1 E hDq Pbt,πt , Et h ℓt − ℓbt iEi , where the second step uses the law of total expectation and the fact that q Pbt,πt is deterministic given all the history up to t. For any state-action pair (s, a), we have Et h ℓt(s, a) − ℓbt(s, a) i = ℓt(s, a) 1 − q Pt,πt (s, a) ut(s, a) = ℓt(s, a) ut(s, a) − q Pt,πt (s, a) ut(s, a) . Then, we can further rewrite and bound Bias1 as: Bias1 = E "X T t=1 X s,a q Pbt,πt (s, a)ℓt(s, a) ut(s, a) − q Pt,πt (s, a) ut(s, a) # ≤ E "X T t=1 X s,a q Pbt,πt (s, a) ut(s, a) − q Pt,πt (s, a) ut(s, a) !# ≤ E "X T t=1 X s∈S ut(s) − q Pt,πt (s) # , where the third step follows from the fact that q Pbt,πt (s, a) ≤ ut(s, a) according to the definition of the upper occupancy measure. Similar to the proof of Lemma 5.5.2.1, we first consider the case when the high probability event Econ ∧ Eest holds: X s∈S ut(s) − q Pt,πt (s) ≤ X s∈S ut(s) − q P,πt (s) + X s∈S q P,πt (s) − q Pt,πt (s) ≤ X s∈S ut(s) − q P,πt (s) + LCP t 179 ≤ O LCP t + X s∈S k( Xs)−1 k=0 X (u,v,w)∈Wk q P,πt (u, v) s P(w|u, v) log (ι) mb i(t) (s, a) q P,πt (s|w) + O |S| 3X s,a q P,πt (s, a) C P + log (ι) mb i(t) (s, a) ! ≤ O LCP t + L L X−1 k=0 X (u,v,w)∈Wk q P,πt (u, v) s P(w|u, v) log (ι) mb i(t) (s, a) + O |S| 3X s,a q P,πt (s, a) C P + log (ι) mb i(t) (s, a) ! , where the first step uses the triangle inequality; the second step uses Corollary 5.8.3.6; the third step uses Lemma 5.8.3.8; the last step follows from the fact that P s∈S q P,πt (s|w) ≤ L. Taking the summation over all episodes yields the following: X T t=1 X s∈S ut(s) − q Pt,πt (s) ≤ O X T t=1 LCP t + L X T t=1 L X−1 k=0 X (u,v,w)∈Wk q P,πt (u, v) s P(w|u, v) log (ι) mb i(t) (u, v) + O |S| 3X T t=1 X s,a q P,πt (s, a) C P + log (ι) mb i(t) (s, a) ! ≤ O LCP + |S| 3 L X−1 k=0 C P + log (ι) |Sk||A| log (ι) ! + O L X T t=1 L X−1 k=0 X (u,v)∈Sk×A q P,πt (u, v) s |Sk+1| log (ι) mb i(t) (u, v) ≤ O |S| 4 |A| C P + log (ι) log (ι) + L X T t=1 L X−1 k=0 X (u,v)∈Sk×A q Pt,πt (u, v) s |Sk+1| log (ι) mb i(t) (u, v) + O L X T t=1 L X−1 k=0 X (u,v)∈Sk×A q P,πt (u, v) − q Pt,πt (u, v) s |S| log (ι) mb i(t) (u, v) ≤ O |S| 4 |A| C P + log (ι) log (ι) + L X T t=1 L X−1 k=0 X (u,v)∈Sk×A q Pt,πt (u, v) s |Sk+1| log (ι) mb i(t) (u, v) + O L p |S| X T t=1 L X−1 k=0 X (u,v)∈Sk×A q P,πt (u, v) − q Pt,πt (u, v) 180 ≤ O |S| 4 |A| C P + log (ι) log (ι) + L L X−1 k=0 p |Sk+1| log (ι) p |Sk||A|T + |Sk||A| log (ι) ! = O |S| 4 |A| C P + log (ι) log (ι) + L|S| p |A|T log (ι) , where the second step uses the Cauchy-Schwartz inequality: P w∈Sk+1 p P(w|u, v) ≤ p |Sk+1| and also applies Lemma 5.8.5.3; the fourth step bounds mb i(t) (u, v) ≥ log(ι) for any (u, v) and |Sk+1| ≤ |S|; the fifth applies Proposition 5.8.5.2, and Corollary 5.8.3.7 to bound PT t=1 PL−1 k=0 P (u,v)∈Sk×A q P,πt (u, v) − q Pt,πt (u, v) by O LCP ; the last step uses the fact that √xy ≤ x + y for any x, y ≥ 0, and thus we have PL−1 k=0 p |Sk| |Sk+1| ≤ 2 PL−1 k=0 |Sk| = 2 |S|. Finally, using a similar argument in the proof of Lemma 5.5.2.1 to bound the case that Econ∧Eest does not hold, we complete the proof. 5.5.4 Bounding Reg Lemma 5.5.4.1 (Bound of Reg). If learning rate η satisfies η ∈ (0, 1 8L ], then, Algorithm 8 ensures Reg = O |S| 2 |A| log (ι) η + η L|S|C P log T + LT + δLT . Proof. We consider a specific transition P0 defined in [55, Lemma C.4] and occupancy measure u such that u = 1 − 1 T q P,˚π + 1 T|A| X a∈A q P0,πa , where πa is a policy such that action a is selected at every state. 18 By direct calculation, we have Dϕ(u, qb1) = L X−1 k=0 X (s,a,s′)∈Wk log qb1(s, a, s′ ) u(s, a, s′) + u(s, a, s′ ) qb1(s, a, s′) − 1 = L X−1 k=0 X (s,a,s′)∈Wk log qb1(s, a, s′ ) u(s, a, s′) + L X−1 k=0 X (s,a,s′)∈Wk |Sk||A||Sk+1|u(s, a, s′ ) − 1 = L X−1 k=0 X (s,a,s′)∈Wk log qb1(s, a, s′ ) u(s, a, s′) ≤ 3|S| 2 |A| log (ι), where the second step uses the definition qb1(s, a, s′ ) = 1 |Sk||A||Sk+1| for k = 0, · · · , L − 1 and the fourth step lower-bounds u(s, a, s′ ) ≥ 1 T3|S| 2|A| from [55, Lemma C.10], thereby u(s, a, s′ ) ≥ 1 ι 3 , and upper-bounds qb1(s, a, s′ ) ≤ 1. According to [55, Lemma C.4], we have q P0,πa ∈ ∩i Ω (Pi). Therefore, u is a convex combination of points in that convex set, and we can use Lemma 5.5.4.2 (included after this proof) to show D q Pbt,πt − u, ℓbt − bt E ≤ 3|S| 2 |A| log (ι) η + 2η X T t=1 X s,a q Pbt,πt (s, a) 2 ℓbt(s, a) 2 + X T t=1 X s,a q Pbt,πt (s, a) 2 bt(s) 2 ! . (5.12) On the one hand, we bound the first summation in Eq. (5.12) as X T t=1 X s,a q Pbt,πt (s, a) 2 ℓbt(s, a) 2 = X T t=1 X s,a q Pbt,πt (s, a) 2 · ℓt(s, a) 2 It{s, a} ut(s, a) 2 ≤ LT, since q Pbt,πt (s, a) ≤ ut(s, a) by definition of the upper occupancy bound. 18 On the other hand, we bound the second summation in Eq. (5.12) as X T t=1 X s,a q Pbt,πt (s, a) 2 bt(s) 2 ≤ 4L X T t=1 X s q Pbt,πt (s)bt(s) = O L|S|C P log T , (5.13) where the first step uses the facts that bt(s) 2 ≤ 4Lbt(s) and P a q Pbt,πt (s, a) 2 ≤ q Pbt,πt (s), and the last step applies Lemma 5.2.1. Putting these inequalities together concludes the proof. Lemma 5.5.4.2. With η ∈ (0, 1 8L ], Algorithm 8 ensures X T t=1 D qbt − u, ℓbt − bt E ≤ Dϕ(u, qb1) η + 2η X T t=1 X s,a qbt(s, a) 2 ℓbt(s, a) 2 + bt(s) 2 , (5.14) for any u ∈ ∩i Ω (Pi). Proof. To use the standard analysis of OMD with log-barrier (e.g.,see [5, Lemma 12]), we need to ensure that ηqbt(s, a, s′ ) ℓbt(s, a) − bt(s) ≥ −1 2 , since log (1 + x) ≥ x − x 2 holds for any x ≥ −1 2 . Clearly, by choosing η ∈ (0, 1 8L ], we have ηqbt(s, a, s′ ) ℓbt(s, a) − bt(s) ≥ −ηqbt(s, a, s′ )bt(s) ≥ −4Lη qbt(s, a, s′ ) ut(s) ≥ −4Lη ≥ − 1 2 , where the first step follows from the fact that ℓbt(s, a) ≥ 0 for all t, s, a; the second step follows from the definition of amortized bonus in Eq. (5.2); the third step bounds qbt(s, a, s′ ) ≤ ut(s). 183 Now, we are ready to apply the standard analysis to show that X T t=1 D qbt − u, ℓbt − bt E ≤ PT t=1 (Dϕ(u, qbt) − Dϕ(u, qbt+1)) η + η X T t=1 X s,a,s′ qbt(s, a, s′ ) 2 ℓbt(s, a) − bt(s) 2 = Dϕ(u, qb1) − Dϕ(u, qbT +1) η + η X T t=1 X s,a,s′ qbt(s, a)Pbt(s ′ |s, a) 2 ℓbt(s, a) − bt(s) 2 ≤ Dϕ(u, qb1) η + η X T t=1 X s,a qbt(s, a) 2 ℓbt(s, a) − bt(s) 2 · X s ′ Pbt(s ′ |s, a) 2 ! ≤ Dϕ(u, qb1) η + η X T t=1 X s,a qbt(s, a) 2 ℓbt(s, a) − bt(s) 2 ≤ Dϕ(u, qb1) η + 2η X T t=1 X s,a qbt(s, a) 2 ℓbt(s, a) 2 + bt(s) 2 , where the second step uses the fact that qbt(s, a, s′ ) = qbt(s, a)·Pbt(s ′ |s, a) (see [46, Lemma 1] for more details); the third step follows from the fact that the Bregman divergence is non-negative; the fourth step follows from the fact P s ′ Pbt(s ′ |s, a) = 1; the last step uses the fact that (x − y) 2 ≤ 2 x 2 + y 2 for any x, y ∈ R. 5.5.5 Bounding Bias2 Lemma 5.5.5.1 (Bound of Bias2). Algorithm 8 ensures Bias2 = O |S|C P log T + δLT . Proof. We first rewrite Bias2 as Bias2 = E "X T t=1 Et hDq P,˚π , ℓbt − ℓt Ei − X T t=1 D q P,˚π , bt E # | {z } =:(I) + E "X T t=1 D q Pbt,πt , bt E # | {z } =:(II) , 18 where (II) is bounded by O |S|C P log T by Lemma 5.2.1 (whose proof is included after this proof). Now, we show that (I) is bounded by O (δLT). Suppose that Econ holds. We then have Et hDq P,˚π , ℓbt − ℓt Ei = X s,a q P,˚π (s, a) q Pt,πt (s, a) − ut(s, a) ut(s, a) ≤ X s,a q P,˚π (s, a) q Pt,πt (s, a) − q P,πt (s, a) + q P,πt (s, a) − ut(s, a) ut(s, a) ! = X s,a q P,˚π (s, a) · q Pt,πt (s) − q P,πt (s) ut(s) + X s,a q P,˚π (s, a) q P,πt (s, a) − ut(s, a) ut(s, a) ≤ X s,a q P,˚π (s, a) · C P t ut(s) + X s,a q P,˚π (s, a) q P,πt (s, a) − ut(s, a) ut(s, a) ≤ X s,a q P,˚π (s, a) · C P t ut(s) , where the fourth step applies Corollary 5.8.3.6 to bound q Pt,πt (s) − q P,πt (s) ≤ C P t , and in the last step, the second term (in the fifth step) is bounded by 0 under Econ. Since q P,˚π (s, a) is fixed over all episodes, we apply Lemma 5.2.1 to show X T t=1 Et hDq P,˚π , ℓbt − ℓt Ei − X T t=1 D q P,˚π , bt E ≤ X s,a q P,˚π (s, a) X T t=1 C P t ut(s) − bt(s) ≤ 0, For the case that Econ does not occur, we bound the expected regret by O (δLT). Combining two cases via Lemma 5.8.1.1, we conclude the proof. Lemma 5.5.5.2 (Restatement of Lemma 5.2.1). The amortized bonus defined in Eq. (5.2) satisfies PT t=1 CP t ut(s) ≤ PT t=1 bt(s) and PT t=1 qbt(s)bt(s) = O(C P log T) for any s. 18 Proof. In the following proof, we use the equivalent definition of bt(s) given in Eq. (5.9). We first show PT t=1 CP t ut(s) ≤ PT t=1 bt(s). On the one hand, we have X T t=1 C P t ut(s) = X T t=1 ⌈log( X |S|T)⌉ j=0 I{yt(s) = j} C P t ut(s) ≤ X T t=1 ⌈log( X |S|T)⌉ j=0 I{yt(s) = j} C P t 2−j−1 = ⌈log( X |S|T)⌉ j=0 PT t=1 I{yt(s) = j}C P t 2−j−1 ≤ ⌈log( X |S|T)⌉ j=0 min{2L PT t=1 I{yt(s) = j}, CP} 2−j−1 , (5.15) where the second step uses the construction of the bin, i.e., if ut(s) falls into a bin (2−j−1 , 2 −j ], then, it is lower-bounded by 2 −j−1 ; the fourth step follows the facts that C P t ≤ 2L and PT t=1 C P t ≤ C P. On the other hand, one can show X T t=1 bt(s) = 4L X T t=1 ⌈log( X |S|T)⌉ j=0 I{yt(s) = j}I n z j t (s) ≤ CP 2L o ut(s) ≥ 4L ⌈log( X |S|T)⌉ j=0 PT t=1 I{yt(s) = j}I n z j t (s) ≤ CP 2L o 2−j ≥ 2L ⌈log( X |S|T)⌉ j=0 min{ PT t=1 I{yt(s) = j}, CP 2L } 2−j−1 = ⌈log( X |S|T)⌉ j=0 min{2L PT t=1 I{yt(s) = j}, CP} 2−j−1 , (5.16) where the second step uses the construction of the bin, i.e., if ut(s) falls into a bin (2−j−1 , 2 −j ], then, it is upper-bounded by 2 −j , and the third step follows the definitions of yt(s) and z j t (s). Combining Eq. (5.15) and Eq. (5.16), we complete the proof of PT t=1 CP t ut(s) ≤ PT t=1 bt(s). 186 For the proof of PT t=1 qbt(s)bt(s) = O(|S|C P log T), one can show X T t=1 qbt(s)bt(s) = 4L X T t=1 ⌈log( X |S|T)⌉ i=0 q Pbt,πt (s)I{yt(s) = i}I n z i t (s) ≤ CP 2L o ut(s) ≤ 4L X T t=1 ⌈log( X |S|T)⌉ i=0 I{yt(s) = i}I z i t (s) ≤ C P 2L = 4L ⌈log( X |S|T)⌉ i=0 X T t=1 I{yt(s) = i}I z i t (s) ≤ C P 2L ≤ 4L ⌈log( X |S|T)⌉ i=0 C P 2L = O C P log (|S|T) = O C P log (T) , where the first inequality uses the fact q Pbt,πt (s) ≤ ut(s), and the last equality uses |S| ≤ T. 5.5.6 Proof of Theorem 5.2.2 For Reg, we choose η = min q |S| 2|A| log(ι) LT , 1 8L . First consider the case η ̸= 1 8L : Reg = O |S| 2 |A| log (ι) η + η L|S|C P log T + LT + LT δ ≤ O |S| 2 |A| log (ι) η + ηLT + |S|C P log T + LT δ = O |S| p |A| log (ι)LT + |S|C P log T + LT δ , where the second step uses η ≤ 1 8L , and the third step applies choice of η in the case of η ̸= 1 8L . For the case of η = 1 8L , we have T ≤ 64L|S| 2 |A| ln(ι) and show Reg = O L|S| 2 |A| log (ι) + |S|C P log T + LT δ . 187 Finally, choosing δ = 1 T and putting the bound above together with Error, Bias1, and Bias2, we complete the proof of Theorem 5.2.2. 188 5.6 Omitted Details for Section 5.3 5.6.1 Bottom Layer Reduction: STABILISE Proof of Theorem 5.3.3. Define indicators gt,j = I{wt ∈ (2−j−1 , 2 −j ]} ht,j = I{ALGj receives the feedback for episode t}. Now we consider the regret of ALGj . Notice that ALGj makes an update only when gt,jht,j = 1. By the guarantee of the base algorithm (Definition 5.3.1), we have E "X T t=1 (ℓt(πt) − ℓt(π))gt,jht,j# ≤ E vuutβ1 X T t=1 gt,jht,j + (β2 + β3θj ) max t≤T gt,j + Pr "X T t=1 C P t gt,jht,j > θj # LT. (5.17) We first bound the last term: Notice that E[ht,j |gt,j ] = 2−j−1 gt,j by Algorithm 9. Therefore, X T t=1 C P t gt,jE[ht,j |gt,j ] = 2−j−1X T t=1 C P t gt,j ≤ 2 −j−1C P (5.18) By Freedman’s inequality, with probability at least 1 − 1 T2 , X T t=1 C P t gt,jht,j − X T t=1 C P t gt,jE[ht,j |gt,j ] ≤ 2 vuut2 X T t=1 (CP t ) 2gt,jE[ht,j |gt,j ] log(T) + 4Llog(T) ≤ 4 vuutL X T t=1 CP t gt,jE[ht,j |gt,j ] log(T) + 4Llog(T) (C P t ≤ 2L) 189 ≤ X T t=1 C P t gt,jE[ht,j |gt,j ] + 8Llog(T) (AM-GM inequality) which gives X T t=1 C P t gt,jht,j ≤ 2 X T t=1 C P t gt,jE[ht,j |gt,j ] + 8Llog(T) ≤ 2 −jC P + 8Llog(T) ≤ θj with probability at least 1 − 1 T2 using Eq. (5.18). Therefore, the last term in Eq. (5.17) is bounded by 1 T2LT ≤ L T . Next, we deal with other terms in Eq. (5.17). Again, by E[ht,j |gt,j ] = 2−j−1 gt,j , Eq. (5.17) implies 2 −j−1E "X T t=1 (ℓt(πt) − ℓt(π))gt,j# ≤ E vuut2−j−1β1 X T t=1 gt,j + (β2 + β3θj ) max t≤T gt,j + L T . which implies after rearranging: E "X T t=1 (ℓt(πt) − ℓt(π))gt,j# ≤ E vuut 1 2−j−1 β1 X T t=1 gt,j + β2 2−j−1 + β3θj 2−j−1 max t≤T gt,j + L T2−j−1 ≤ E vuutβ1 X T t=1 2gt,j wt + 2β2 + 16β3Llog(T) 2−j + 4β3C P max t≤T gt,j + L T2−j−1 . (using that when gt,j = 1, 1 2−j−1 ≤ 2 wt , and the definition of θj ) 190 Now, summing this inequality over all j ∈ {0, 1, . . . , ⌈log2 T⌉}, we get E "X T t=1 (ℓt(πt) − ℓt(π))I wt ≥ 1 T # ≤ O E vuutNβ1 X T t=1 1 wt + (β2 + β3Llog(T)) 1 mint≤T wt + Nβ3C P + L ≤ O E hp β1T log(T)ρT + (β2 + β3Llog(T))ρT i + β3C P log T + L where N ≤ O(log T) is the number of ALGj ’s that has been executed at least once. On the other hand, E "X T t=1 (ℓt(πt) − ℓt(π))I wt ≤ 1 T # ≤ LTE [I {ρT ≥ T}] ≤ LE [ρT ] . Combining the two parts and using the assumption β2 ≥ L finishes the proof. 5.6.2 Top Layer Reduction: Corral In this subsection, we use a base algorithm that satisfies Definition 5.3.2 to construct an algorithm with √ T + C P regret under unknown C P. The idea is to run multiple base algorithms, each with a different hypothesis on C P; on top of them, run another multi-armed bandit algorithm to adaptively choose among them. The goal is to let the top-level bandit algorithm perform almost as well as the best base algorithm. This is the Corral idea outlined in [5, 32, 60], and the algorithm is presented in Algorithm 10. The top-level bandit algorithm is an FTRL with log-barrier regularizer. We first state the standard regret bound of FTRL under log-barrier regularizer, whose proof can be found in, e.g., Theorem 7 of [90]. 191 Algorithm 10 (A Variant of) Corral Initialize: a log-barrier algorithm with each arm being an instance of any base algorithm satisfying Definition 5.3.2. The hypothesis on C P is set to 2 i for arm i (i = 1, 2, . . . , M ≜ ⌈log2 T⌉). Initialize: ρ0,i = M, ∀i for t = 1, 2, . . . , T do Let wt = argmin w∈∆(M),wi≥ 1 T ,∀i (*w,X t−1 τ=1 (ˆcτ − rτ ) + + 1 η X M i=1 log 1 wi ) where η = 1 4(√ β1T +β2) . For all i, send wt,i to instance i. Draw it ∼ wt . Execute the πt output by instance it Receive the loss ct,it for policy πt (whose expectation is ℓt(πt)) and send it to instance it . Define for all i: cˆt,i = ct,iI[it = i] wt,i , ρt,i = min τ≤t 1 wτ,i , rt,i = p β1T √ρt,i − √ρt−1,i + β2 (ρt,i − ρt−1,i). Lemma 5.6.2.1. The FTRL algorithm over a convex subset Ω of the (M −1)-dimensional simplex ∆(M): wt = argmin w∈Ω (*w,X t−1 τ=1 ℓτ + + 1 η X M i=1 log 1 wi ) ensures for all u ∈ Ω, X T t=1 ⟨w − u, ℓt⟩ ≤ M log T η + η X T t=1 X M i=1 w 2 t,iℓ 2 t,i as long as ηwt,i|ℓt,i| ≤ 1 2 for all t, i. 19 Proof of Theorem 5.3.4. The Corral algorithm is essential an FTRL with log-barrier regularizer. To apply Lemma 5.6.2.1, we first verify the condition ηwt,i|ℓt,i| ≤ 1 2 where ℓt,i = ˆct,i − rt,i. By our choice of η, ηwt,i|cˆt,i| ≤ ηct,i ≤ 1 4 , (because β2 ≥ L by Definition 5.3.2) ηwt,irt,i = η p β1T wt,i( √ρt,i − √ρt−1,i) + ηβ2wt,i (ρt,i − ρt−1,i). The right-hand side of the last equality is non-zero only when ρt,i > ρt−1,i, implying that ρt,i = 1 wt,i . Therefore, we further bound it by ηwt,irt,i ≤ η p β1T 1 ρt,i ( √ρt,i − √ρt−1,i) + ηβ2 1 ρt,i (ρt,i − ρt−1,i) = η p β1T 1 √ρt,i − √ρt−1,i ρt,i + ηβ2 1 − ρt−1,i ρt,i ≤ η p β1T 1 √ρt−1,i − 1 √ρt,i + ηβ2 1 − ρt−1,i ρt,i ( √ 1 a − √ b a ≤ √ 1 b − √ 1 a for a, b > 0) (5.19) ≤ η p β1T + ηβ2 (ρt,i ≥ 1) = 1 4 (definition of η) which can be combined to get the desired property ηwt,i|cˆt,i − rt,i| ≤ 1 2 . Hence, by the regret guarantee of log-barrier FTRL (Lemma 5.6.2.1), we have E "X T t=1 (ct,it − ct,i⋆ ) # 193 ≤ O M log T η + ηE "X T t=1 X M i=1 w 2 t,i(ˆct,i − rt,i) 2 | {z } stability-term #! + E "X T t=1 X M i=1 wt,irt,i − rt,i⋆ ! | {z } bonus-term # where i ⋆ is the smallest i such that 2 i upper bounds the true corruption amount C P. Bounding stability-term: stability term ≤ 2η X T t=1 X M i=1 w 2 t,i(ˆc 2 t,i + r 2 t,i) where 2η X T t=1 X M i=1 w 2 t,icˆ 2 t,i = 2η X T t=1 X M i=1 c 2 t,iI{it = i} ≤ O(ηL2T) and 2η X T t=1 X M i=1 w 2 t,ir 2 t,i ≤ 4η X T t=1 X M i=1 ( p β1T) 2 1 √ρt−1,i − 1 √ρt,i 2 + 4ηβ2 X T t=1 X M i=1 1 − ρt−1,i ρt,i 2 (continue from Eq. (5.19)) ≤ 4ηβ1T × X T t=1 X M i=1 1 √ρt−1,i − 1 √ρt,i + 4ηβ2 X T t=1 X M i=1 ln ρt,i ρt−1,i ( √ 1 ρt−1,i − √ 1 ρt,i ≤ 1 and 1 − a ≤ − ln a) ≤ 4ηβ1TM 3 2 + 4ηβ2M ln T. (telescoping and using ρ0,i = M and ρT,i ≤ T) Bounding bonus-term: bonus-term = X T t=1 X M i=1 wt,irt,i − X T t=1 rt,i⋆ ≤ p β1T X T t=1 X M i=1 1 √ρt−1,i − 1 √ρt,i + β2 X T t=1 X M i=1 log ρt,i ρt−1,i 194 − p ρT,i⋆ β1T + ρT,i⋆ β2 − p ρ0,i⋆ β1T − ρ0,i⋆ β2 (continue from Eq. (5.19) and using 1 − a ≤ − ln a) ≤ O p β1TM 3 2 + β2M log T − p ρT,i⋆ β1T + ρT,i⋆ β2 . Combining the two terms and using η = Θ √ 1 β1T +β2 , M = Θ(log T), we get E "X T t=1 (ℓt(πt) − ct,i⋆ ) # = E "X T t=1 (ct,it − ct,i⋆ ) # = O q β1T log3 T + β2 log2 T − E hp ρT,i⋆ β1T + ρT,i⋆ β2 i (5.20) On the other hand, by Definition 5.3.2 and that C P ∈ [2i ⋆−1 , 2 i ⋆ ], we have E "X T t=1 (ct,i⋆ − ℓt(˚π))# ≤ E hp ρT,i⋆ β1T + ρT,i⋆ β2 i + 2β3C P . (5.21) Combining Eq. (5.20) and Eq. (5.21), we get E "X T t=1 (ℓt(πt) − ℓt(˚π))# ≤ O q β1T log3 T + β2 log2 T + 2β3C P , which finishes the proof. 195 Algorithm 11 Algorithm with Optimistic Transition Achieving Gap-Dependent Bounds (Known C P) Input: confidence parameter δ ∈ (0, 1). Initialize: ∀(s, a), learning rate γ1(s, a) = 256L 2 |S|; epoch index i = 1 and epoch starting time ti = 1; ∀(s, a, s′ ), set counters m1(s, a) = m1(s, a, s′ ) = m0(s, a) = m0(s, a, s′ ) = 0; empirical transition P¯ 1 and confidence width B1 based on Eq. (2.4); optimistic transition Pei by Definition 5.7.1.1. for t = 1, . . . , T do Let ϕt be defined in Eq. (5.23) and compute qbt = argmin q∈Ω(Pei) * q, X t−1 τ=ti ℓbτ − bτ + + ϕt(q). (5.22) Compute policy πt from qbt such that πt(a|s) ∝ qbt(s, a). Execute policy πt and obtain trajectory (st,k, at,k) for k = 0, . . . , L − 1. Construct loss estimator ℓbt as defined in Eq. (2.8). Compute amortized bonus bt based on Eq. (5.24). Compute learning rate γt+1 according to Definition 5.4.2. Increment counters: for each k < L, mi(st,k, at,k, st,k+1) +← 1, mi(st,k, at,k) +← 1. if ∃k, mi(st,k, at,k) ≥ max{1, 2mi−1(st,k, at,k)} then ▷ entering a new epoch Increment epoch index i +← 1 and set new epoch starting time ti = t + 1. Initialize new counters: ∀(s, a, s′ ), mi(s, a, s′ ) = mi−1(s, a, s′ ), mi(s, a) = mi−1(s, a). Update empirical transition P¯ i and confidence width Bi based on Eq. (2.4). Update optimistic transition Pei based on Definition 5.7.1.1 5.7 Omitted Details for Section 5.4 In this section, we consider a variant of the algorithm proposed in [47], which instead ensures exploration via optimistic transitions and also switch the regularizer from Tsallis entropy to logbarrier. We present the pseudocode in Algorithm 11, and show that it automatically adapts to easier environments with improved gap-dependent regret bounds. Remark 5.7.1. Throughout the analysis, we denote by QP,π(s, a; r) the state-action value of stateaction pair (s, a) with respect to transition function P (P could be an optimistic transition function), policy π and loss function r; similarly, we denote by V P,π(s; r) the corresponding state-value function. 196 Specifically, the state value function V P,π(s; r) is computed in a backward manner from layer L to layer 0 as following: V P,π(s; r) = 0, s = sL, P a∈A π(a|s) · r(s, a) + P s ′∈Sk(s)+1 P(s ′ |s, a)V P,π(s ′ ; r) , s ̸= sL. Similarly, the state-action value function QP,π(s, a; r) is calculated in the same manner: Q P,π(s, a; r) = 0, s = sL, r(s, a) + P s ′∈Sk(s)+1 P(s ′ |s, a) P a ′∈A π(a ′ |s ′ )QP,π(s ′ , a′ ; r), s ̸= sL. Clearly, V P,π(s0; r), the state value of s0, is equal to V P,π(s0; r) = X s̸=sL X a∈A q P,π(s, a)r(s, a) ≜ q P,π, r . This equality can be further extended to the other state u as: V P,π(u; r) = X s̸=sL X a∈A q P,π(s, a|u)r(s, a), where q P,π(s, a|u) denotes the probability of arriving at (s, a) starting at state u under policy π and transition function P. 197 5.7.1 Description of the Algorithm The construction of this algorithm follows similar ideas to the work of Jin, Huang, and Luo [47], while also including several key differences. One of them is that we rely on the log-barrier regularizer, defined with a positive learning rate γt(s, a) as ϕt(q) = − X s,a γt(s, a) log (q(s, a)). (5.23) The choice of log-barrier is important for adversarial transitions as discussed in Section 5.2, while the adaptive choice of learning rate is important for adapting to easy environments. The formal definition of the learning rate is given in Definition 5.4.2, and further details and properties of the learning rate are provided in Section 5.7.7. As the algorithm runs a new instance of FTRL for each epoch i, we modify the definition of the amortized bonus bt accordingly, bt(s) = bt(s, a) = 4L ut(s) if Pt τ=ti(t) I{⌈log2 uτ (s)⌉ = ⌈log2 ut(s)⌉} ≤ CP 2L , 0 else. (5.24) The bonus bt(s) in Eq. (5.24) is defined based on each epoch i, in the sense that Pt τ=1 I{⌈log2 uτ (s)⌉ = ⌈log2 ut(s)⌉} counts, among all previous rounds τ = ti , . . . , t in epoch i, the number of times that the value of uτ (s) falls into the same bin as ut(s). Again, for the ease of analysis, in the analysis we use the equivalent definition of the bonus bt(s) defined in Eq. (5.9), except that the counter z j t (s) now will be reset to zero at the beginning of each epoch i. Next, we formally define the optimistic transition as follows. 198 Definition 5.7.1.1 (Optimistic Transition). For epoch i, the optimistic transition Pei : S×A×S → [0, 1] is defined as: Pei(s ′ |s, a) = max 0, P¯ i(s ′ |s, a) − Bi(s, a, s′ ) , (s, a, s′ ) ∈ Wk, P s ′∈Sk+1 min P¯ i(s ′ |s, a), Bi(s, a, s′ ) , (s, a) ∈ Sk × A and s ′ = sL, where P¯ i is the empirical transition defined in Eq. (2.4). Note that the optimistic transition Pei(·|s, a) is a valid distribution as we have X s ′∈Sk+1 min P¯ i(s ′ |s, a), Bi(s, a, s′ ) = 1 − X s ′∈Sk+1 max 0, P¯ i(s ′ |s, a) − Bi(s, a, s′ ) . We summarize the properties of the optimistic transition functions in Section 5.7.8. 5.7.2 Self-Bounding Properties of the Regret In order to achieve best-of-both-worlds guarantees, our goal is to bound RegT (π ⋆ ) in terms of two self-bounding quantities for some x > 0 (plus other minor terms): S1(x) = vuuutx · E X T t=1 X s̸=sL X a̸=π⋆(s) q Pt,πt (s, a) , S2(x) = X s̸=sL X a̸=π⋆(s) vuutx · E "X T t=1 q Pt,πt (s, a) # . (5.25) These two quantities enjoy a certain self-bounding property which is critical to achieve the gapdependent bound under Condition (5.4), as they can be related back to the regret against policy π ⋆ itself. To see this, we first show the following implication of Condition (5.4). 199 Lemma 5.7.2.1. Under Condition (5.4), the following holds. RegT (π ⋆ ) ≥ E X T t=1 X s̸=sL X a̸=π⋆(s) q Pt,πt (s, a)∆(s, a) − C L − 2LCP − L 2C P . (5.26) Proof. From the definition of RegT (π ⋆ ), we first note that: RegT (π ⋆ ) = E "X T t=1 D q Pt,πt − q Pt,π⋆ , ℓt E # = E "X T t=1 D q P,πt − q P,π⋆ , ℓt E # + E "X T t=1 q Pt,πt − q P,πt , ℓt # + E "X T t=1 D q P,π⋆ − q Pt,π⋆ , ℓt E # ≥ E "X T t=1 D q P,πt − q P,π⋆ , ℓt E # − 2LCP ≥ E X T t=1 X s̸=sL X a̸=π⋆(s) q P,πt (s, a)∆(s, a) − C L − 2LCP where the third step applies Corollary 5.8.3.7 and the last step uses the Condition (5.4). We continue to bound the first term above as: E X T t=1 X s̸=sL X a̸=π⋆(s) q P,πt (s, a)∆(s, a) ≥ E X T t=1 X s̸=sL X a̸=π⋆(s) q Pt,πt (s, a)∆(s, a) − E X T t=1 X s̸=sL X a̸=π⋆(s) q Pt,πt (s, a) − q P,πt (s, a) ∆(s, a) ≥ E X T t=1 X s̸=sL X a̸=π⋆(s) q Pt,πt (s, a)∆(s, a) − L 2C P , where the last step uses ∆(s, a) ∈ (0, L] and Corollary 5.8.3.6. 200 We are now ready to show the following important self-bounding properties (recall ∆min = mins̸=sL,a̸=π⋆(s) ∆(s, a)). Lemma 5.7.2.2 (Self-Bounding Quantities). Under Condition (5.26), we have for any z > 0: S1(x) ≤ z RegT (π ⋆ ) + C L + 2LCP + L 2C P + 1 z x 4∆min , S2(x) ≤ z RegT (π ⋆ ) + C L + 2LCP + L 2C P + 1 z X s̸=sL X a̸=π⋆(s) x 4∆(s, a) . Besides, it always holds that S1(x) ≤ √ x · LT and S2(x) ≤ p x · L|S||A|T. Proof. For any z > 0, we have S1 (x) = vuuut x 2z∆min · E 2z∆minX T t=1 X s̸=sL X a̸=π⋆(s) q Pt,πt (s, a) ≤ E z∆minX T t=1 X s̸=sL X a̸=π⋆(s) q Pt,πt (s, a) + x 4z∆min ≤ z RegT (π ⋆ ) + C L + 2LCP + L 2C P + x 4z∆min , where the second step follows from the AM-GM inequality: 2 √xy ≤ x + y for any x, y ≥ 0, and the last step follows from Condition (5.26). By similar arguments, we have S2(x) bounded for any z > 0 as: X s̸=sL X a̸=π⋆(s) vuut x 2z∆(s, a) · E " 2z∆(s, a) X T t=1 q Pt,πt (s, a) # ≤ X s̸=sL X a̸=π⋆(s) x 4z∆(s, a) + zE X s̸=sL X a̸=π⋆(s) X T t=1 q Pt,πt (s, a)∆(s, a) = z RegT (π ⋆ ) + C L + 2LCP + L 2C P + X s̸=sL X a̸=π⋆(s) x 4z∆(s, a) . 201 Finally, by direct calculation, we can show that S1(x) = vuuutx · E X T t=1 X s̸=sL X a̸=π⋆(s) q Pt,πt (s, a) ≤ √ x · LT, according to the fact that P s̸=sL P a∈A q Pt,πt (s, a) ≤ L. On the other hand, S2(x) is bounded as S2(x) ≤ vuuutx · |S||A|E X s̸=sL X a̸=π⋆(s) X T t=1 q Pt,πt (s, a) ≤ p x · L|S||A|T, with the help of the Cauchy-Schwarz inequality in the first step. Finally, we show that Algorithm 11 achieves the following adaptive regret bound, which directly leads to the best-of-both-worlds guarantee in Theorem 5.4.1. Lemma 5.7.2.3. Algorithm 11 with δ = 1/T 2 and learning rate defined in Definition 5.4.2 ensures that, for any mapping π ⋆ : S → A, the regret RegT (π ⋆ ) is bounded by O C P + 1 L 2 |S| 4 |A| 2 log2 (ι) plus O S1 L 3 |S| 2 |A| log2 (ι) + S2 L 2 |S||A| log2 (ι) . The proof of this result is detailed in Section 5.7.3. We emphasize that this bound holds for any mapping π ⋆ , and is not limited to that policy in Eq. (5.4). This is important for proving the robustness result when losses are arbitrary, as shown in the following proof of Theorem 5.4.1. Proof of Theorem 5.4.1. When losses are arbitrary, we simply select π ⋆ = ˚π where ˚π is one of the optimal deterministic policies in hindsight and obtain the following bound of RegT (˚π): O C P + 1 L 2 |S| 4 |A| 2 log2 (ι) + S1 L 3 |S| 2 |A| log2 (ι) + S2 L 2 |S||A| log2 (ι) = O C P + 1 L 2 |S| 4 |A| 2 log2 (ι) + q L4|S| 2|A|T log2 (ι) + q L3|S| 2|A| 2T log2 (ι) , where the first step follows from Lemma 5.7.2.3 and the second step follows from Lemma 5.7.2.2. Next, suppose that Condition (5.4) holds. We set π ⋆ as defined in the condition and use Lemma 5.7.2.3 to write the regret against π ⋆ as: RegT (π ⋆ ) ≤ µ S1 L 3 |S| 2 |A| log2 (ι) + S2 L 2 |S||A| log2 (ι) + ξ C P + 1 , where µ > 0 is an absolute constant and ξ = O L 2 |S| 4 |A| 2 log2 (ι) . For any z > 0, according to Lemma 5.7.2.2 (where we set the z there as z/2µ), we have: RegT (π ⋆ ) ≤ z RegT (π ⋆ ) + C L + 2LCP + L 2C P + µ 2 z L 3 |S| 2 |A| log2 (ι) ∆min + X s̸=sL X a̸=π⋆(s) L 2 |S||A| log2 (ι) ∆(s, a) + ξ C P + 1 . Let x = 1−z z and U = L3 |S| 2 |A| log2 (ι) ∆min + P s̸=sL P a̸=π⋆(s) L2 |S||A| log2 (ι) ∆(s,a) . Rearranging the above inequality leads to RegT (π ⋆ ) ≤ z C L + 2LCP + L 2C P 1 − z + µ 2U z (1 − z) + ξ C P + 1 1 − z = C L + 2LCP + L 2C P x + x + 2 + 1 x µ 2U + 1 + 1 x ξ C P + 1 = 1 x C L + 2LCP + L 2C P + µ 2U + ξ C P + 1 + x · µ 2U + 2µ 2U + ξ C P + 1 . Picking the optimal x to minimize the upper bound of RegT (π ⋆ ) yields RegT (π ⋆ ) = 2q (CL + 2LCP + L2CP + µ2U + ξ (CP + 1)) µ2U + 2µ 2U + ξ C P + ≤ 2 q (CL + 2LCP + L2CP + ξ (CP + 1)) µ2U + 4µ 2U + ξ C P + 1 = O q CL + L2|S| 4|A| 2 log2 (ι) CP U + U + L 2 |S| 4 |A| log2 (ι) C P + 1 ≤ O √ UCL + U + L 2 |S| 4 |A| 2 log2 (ι) C P + 1 , where the second step uses √ x + y ≤ √ x+ √y for any x, y ∈ R≥0 and the last step uses 2 √xy ≤ x+y for any x, y ≥ 0. 5.7.3 Regret Decomposition of RegT (π ⋆ ) and Proof of Lemma 5.7.2.3 In the following sections, we will first decompose RegT (π ⋆ ) for any mapping π ⋆ : S → A into several parts, and then bound each part separately from Section 5.7.4 to Section 5.7.7, in order to prove Lemma 5.7.2.3. For any mapping π ⋆ : S → A, we start from the following decomposition of RegT (π ⋆ ) as E "X T t=1 q Pt,πt , ℓt − D q Pbt,πt , ℓbt − bt E # | {z } Error1 + E "X T t=1 D q Pbt,πt , ℓbt − bt E − D q Pbt,π⋆ , ℓbt − bt E # | {z } EstReg + E "X T t=1 D q Pbt,π⋆ , ℓbt − bt E − D q Pt,π⋆ , ℓt E # | {z } Error2 , (5.27) where Pbt = Pe i(t) denotes the optimistic transition for episode t for simplicity (which is consistent with earlier definition such that qbt = q Pbt,πt ). Here, EstReg is the estimated regret controlled by FTRL, while Error1 and Error2 are estimation errors incurred on the selected policies {πt} T t=1, and that on the comparator policy π ⋆ respectively. 20 In order to achieve the gap-dependent bound under Condition (5.4), we consider these two estimation error terms Error1 and Error2 together as: E "X T t=1 q Pt,πt , ℓt − D q Pbt,πt , ℓbt − bt E + D q Pbt,π⋆ , ℓbt − bt E − D q Pt,π⋆ , ℓt E # = E "X T t=1 q Pt,πt , ℓt − D q Pbt,πt , Et h ℓbt i − bt E + D q Pbt,π⋆ , Et h ℓbt i − bt E − D q Pt,π⋆ , ℓt E # , (5.28) where Et [·] denotes the conditional expectation given the history prior to episode t. To better analyze the conditional expectation of the loss estimators, we define αt , βt : S×A → R as: αt(s, a) ≜ q P,πt (s, a)ℓt(s, a) ut(s, a) , βt(s, a) ≜ q Pt,πt (s, a) − q P,πt (s, a) ℓt(s, a) ut(s, a) , which ensures that Et h ℓbt(s, a) i = αt(s, a) + βt(s, a) for any state-action pair (s, a). With the help of αt , βt , we have D q Pbt,πt , Et h ℓbt i − bt E = D q Pbt,πt , αt E + D q Pbt,πt , βt − bt E , D q Pbt,π⋆ , Et h ℓbt i − bt E = D q Pbt,π⋆ , αt E + D q Pbt,π⋆ , βt − bt E , which helps us further rewrite Eq. (5.28) as E "X T t=1 q Pt,πt − q P,πt , ℓt + X T t=1 D q P,π⋆ − q Pt,π⋆ , ℓt E # + E "X T t=1 q P,πt , ℓt − D q Pbt,πt , αt E + D q Pbt,π⋆ , αt E − D q P,π⋆ , ℓt E # + E "X T t=1 D q Pbt,πt , bt − βt E + X T t=1 D q Pbt,π⋆ , βt − bt E # . (5.29) 20 Based on this decomposition of Error1 + Error2, we then bound each parts respectively in the following lemmas. Lemma 5.7.3.1. For any δ ∈ (0, 1) and any policy sequence {πt} T t=1, Algorithm 11 ensures that E "X T t=1 q Pt,πt − q P,πt , ℓt + X T t=1 D q P,π⋆ − q Pt,π⋆ , ℓt E # = O C PL . Lemma 5.7.3.2. For any δ ∈ (0, 1) and any mapping π ⋆ : S → A, Algorithm 11 ensures that E "X T t=1 D q Pbt,πt , bt − βt E + D q Pbt,π⋆ , βt − bt E # = O C PL|S| 2 |A| log2 (T) . Lemma 5.7.3.3. For any δ ∈ (0, 1) and any mapping π ⋆ : S → A, Algorithm 11 ensures that E "X T t=1 q P,πt , ℓt − D q Pbt,πt , αt E + D q Pbt,π⋆ , αt E − D q P,π⋆ , ℓt E # = O S1 L 3 |S| 2 |A| log2 (ι) + C P + 1 L 2 |S| 4 |A| log2 (ι) + δL|S| 2 |A|T 2 . Lemma 5.7.3.4. With the learning rates {γt} T t=1 defined in Definition 5.4.2, Algorithm 11 ensures that for any δ ∈ (0, 1) and any mapping π ⋆ that, EstReg(π ⋆ ) = E "X T t=1 D q Pbt,πt , ℓbt − bt E − D q Pbt,π⋆ , ℓbt − bt E # = O S2 L 2 |S||A| log2 (ι) + C P + 1 L|S| 2 |A| 2 log2 (ι) + δT L2 |S| 2 |A| log(ι) . 2 Proof of Lemma 5.7.2.3. According to the previous discussion, we have the regret against any mapping π ⋆ : S → A decomposed as: RegT (π ⋆ ) = E "X T t=1 q Pt,πt − q P,πt , ℓt + X T t=1 D q P,π⋆ − q Pt,π⋆ , ℓt E # + E "X T t=1 D q Pbt,πt , bt − βt E + X T t=1 D q Pbt,π⋆ , βt − bt E # + E "X T t=1 q P,πt , ℓt − D q Pbt,πt , αt E + D q Pbt,π⋆ , αt E − D q P,π⋆ , ℓt E # + E "X T t=1 D q Pbt,πt , ℓbt − bt E − D q Pbt,π⋆ , ℓbt − bt E # , where the first term is mainly caused by the difference between {Pt} T t=1 and P, which is unavoidable and can be bounded by O C PL as shown in Lemma 5.7.3.1; the second term is the extra cost of using the amortized losses bt to handle the biases of loss estimators, which is controlled by Oe(C P) as shown in Lemma 5.7.3.2; the third term measures the estimation error related to the optimistic transitions {Pbt} T t=1, which can be bounded by some self-bounding quantities in Lemma 5.7.3.2; the final term is the estimated regret calculated with respect to the optimistic transitions {Pbt} T t=1, which is controlled by FTRL as shown in Lemma 5.7.3.4. Putting all these bounds together finishes the proof. 5.7.4 Proof of Lemma 5.7.3.1 The result is immediate by directly applying Corollary 5.8.3.7. 5.7.5 Proof of Lemma 5.7.3.2 For this proof, we bound E hPT t=1 D q Pbt,πt , bt − βt Ei and E hPT t=1 D q Pbt,π⋆ , βt − bt Ei separately. 20 Bounding E hPT t=1 D q Pbt,πt , bt − βt Ei. We have X T t=1 D q Pbt,πt , bt − βt E = X T t=1 X s,a q Pbt,πt (s, a) bt(s) − q Pt,πt (s, a) − q P,πt (s, a) ℓt(s, a) ut(s, a) ! ≤ X T t=1 X s,a q Pbt,πt (s, a) bt(s) + q Pt,πt (s, a) − q P,πt (s, a) ut(s, a) ! ≤ X T t=1 X s,a q Pbt,πt (s, a)bt(s) +X T t=1 X s,a q Pt,πt (s, a) − q P,πt (s, a) ≤ X T t=1 X s,a q Pbt,πt (s, a)bt(s) + LCP , where the second step bounds ℓt(s, a) ≤ 1; the third step follows the fact that q Pbt,πt (s, a) ≤ ut(s, a) for all (s, a). Let Ei be a set of episodes that belong to epoch i and let N be the total number of epochs through T episodes. Then, we turn to bound X T t=1 X s,a q Pbt,πt (s, a)bt(s) = X N i=1 X t∈Ei X s,a q Pbt,πt (s, a)bt(s) ≤ O L|S| 2 |A| log2 (T) C P , where the last step repeats the same argument of Lemma 5.2.1 for every epoch and the number of epochs is at most O(|S||A| log T) according to [47, Lemma D.3.12]. Bounding E hPT t=1 D q Pbt,π⋆ , βt − bt Ei. For this term, we show that for any given epoch i, P t∈Ei D q Pbt,π⋆ , βt − bt E ≤ 0 which yields PT t=1 D q Pbt,π⋆ , βt − bt E ≤ 0. To this end, we first consider a fixed epoch i and upperbound D q Pbt,π⋆ , βt E = X s,a q Pbt,π⋆ (s, a) q Pt,πt (s) − q P,πt (s) ℓt(s, a) ut(s) ! ≤ X s,a q Pbt,π⋆ (s, a) C P t ut(s) , where the second step uses Corollary 5.8.3.6 to bound q Pt,πt (s) − q P,πt (s) ≤ C P t . For any epoch i and episode t ∈ Ei , we have Pbt = Pe i(t) , which gives that X T t=1 D q Pbt,π⋆ , βt − bt E = X N i=1 X t∈Ei D q Pei,π⋆ , βt − bt E ≤ X N i=1 X s,a q Pei,π⋆ (s, a) X t∈Ei C P t ut(s) − bt(s) ≤ 0, where the last step applies Lemma 5.2.1 for every epoch i. Thus, E hPT t=1 D q Pbt,π⋆ , βt − bt Ei ≤ 0 holds. 5.7.6 Proof of Lemma 5.7.3.3 We introduce the following lemma to evaluate the estimated performance via the true occupancy measure, which helps us analyze the the estimation error. Lemma 5.7.6.1. For any transition function pair (P, Pb) (P and Pb can be optimistic transition), policy π, and loss function ℓ, it holds that q P,π, ℓ = D q P,π b , Z P,P,π b ℓ E , where Z P,P,π b ℓ is defined as Z P,P,π b ℓ (s, a) = Q P,π(s, a; ℓ) − X s ′∈Sk(s)+1 Pb(s ′ |s, a)V P,π(s ′ ; ℓ) = ℓ(s, a) + X s ′∈Sk(s)+1 P(s ′ |s, a) − Pb(s ′ |s, a) V P,π(s ′ ; ℓ) for all state-action pairs (s, a). Proof. By direct calculation, we have: V P,π(s; ℓ) = X a π(a|s)Q P,π(s, a; ℓ) = X a π(a|s) Q P,π(s, a; ℓ) − X s ′∈Sk(s)+1 Pb(s ′ |s, a)V P,π(s; ℓ) 209 + X a π(a|s) X s ′∈Sk(s)+1 Pb(s ′ |s, a)V P,π(s ′ ; ℓ) = X s ′∈Sk(s)+1 q P,π b (s ′ |s)V P,π(s ′ ; ℓ) + X a π(a|s) Q P,π(s, a; ℓ) − X s ′∈Sk(s)+1 Pb(s ′ |s, a)V P,π(s; ℓ) = L X−1 k=k(s) X s ′∈Sk X a∈A q P,π b (s ′ , a|s) Q P,π(s ′ , a; ℓ) − X s ′′∈Sk(s′)+1 Pb(s ′′|s ′ , a)V P,π(s ′′; ℓ) = L X−1 k=k(s) X s ′∈Sk X a∈A q P,π b (s ′ , a|s)Z P,P,π b ℓ (s ′ , a) where the second to last step follows from recursively repeating the first three steps. The proof is completed by noticing q P,π, ℓ = V P,π(s0, ℓ). According to Lemma 5.7.6.1, we rewrite D q Pbt,πt , αt E and D q Pbt,π⋆ , αt E with q P,πt and q P,π⋆ for any policy π ⋆ as D q Pbt,πt , αt E = D q P,πt , Z Pbt,P,πt αt E , D q Pbt,π⋆ , αt E = D q P,π⋆ , Z Pbt,P,π⋆ αt E . Therefore, we can further decompose as E h q P,πt , ℓt − D q Pbt,πt , αt E + D q Pbt,π⋆ , αt E − D q P,π⋆ , ℓt Ei =E "X T t=1 D q P,πt , ℓt − ZPbt,P,πt αt E − D q P,π⋆ , ℓt − ZPbt,P,π⋆ αt E # =E "X T t=1 D q P,πt − q P,π⋆ , ℓt − ZPbt,P,π⋆ αt E + D q P,πt , Z Pbt,P,π⋆ αt − ZPbt,P,πt αt E # =E "X T t=1 X s̸=sL X a∈A q P,πt (s, a) − q P,π⋆ (s, a) (ℓt(s, a) − αt(s, a)) | {z } Term 1(π⋆) # 210 +E "X T t=1 X s̸=sL X a∈A q P,πt (s, a) − q P,π⋆ (s, a) X s ′∈Sk(s)+1 P(s ′ |s, a) − Pbt(s ′ |s, a) V Pbt,π⋆ (s ′ ; αt) | {z } Term 2(π⋆) # +E "X T t=1 D q P,πt , Z Pbt,P,π⋆ αt − ZPbt,P,πt αt E | {z } Term 3(π⋆) # . We will bound these terms with some self-bounding quantities in Lemma 5.7.6.2, Lemma 5.7.6.3 and Lemma 5.7.6.4 respectively. In these proofs, we will follow the idea of Lemma 5.8.1.1 to first bound these terms conditioning on the events Eest and Econ defined in Proposition 5.8.5.2 and Eq. (2.6), while ensuring that these terms are always bounded by O |S| 2 |A|T 2 in the worst case. 5.7.6.1 Bounding Term 1 Lemma 5.7.6.2. For any δ ∈ (0, 1) and any mapping π ⋆ : S → A, Algorithm 11 ensures that E [Term 1(π ⋆ )] is bounded by O S1 L 2 |S| 2 |A| log2 (ι) + δ|S| 2 |A|T 2 + C P + 1 L 2 |S| 4 |A| log2 (ι) . Proof. Clearly, we have αt(s, a) ≤ |S|T for every state-action pair (s, a), due to the fact that ut(s) ≥ 1/|S|T (Lemma 5.8.2.8). Therefore, we have Term 1(π ⋆ ) ≤ |S| 2 |A|T 2 holds always. In the remaining, our main goal is to bound Term 1(π ⋆ ) conditioning on Econ ∧ Eest. For any state-action pair (s, a) ∈ S × A and episode t, we have ℓt(s, a) − αt(s, a) = ut(s, a) − q P,πt (s, a) ℓt(s, a) ut(s, a) = ut(s) − q P,πt (s) ℓt(s, a) ut(s) ≥ 0, conditioning on the event Econ. Therefore, under event Econ ∧ Eest, Term 1(π ⋆ ) is bounded by X T t=1 X s̸=sL X a∈A q P,πt (s, a) − q P,π⋆ (s, a) (ℓt(s, a) − αt(s, a)) ≤ X T t=1 X s̸=sL X a∈A h q P,πt (s, a) − q P,π⋆ (s, a) i + ut(s) − q P,πt (s) ℓt(s, a) ut(s) ≤ X T t=1 X s̸=sL X a∈A q P,πt (s, a) − q P,π⋆ (s, a) + ut(s) · ut(s) − q P,πt (s) = O vuutL|S| 2|A| log2 (ι) X T t=1 X s̸=sL q P,πt (s) · X a [q P,πt (s, a) − q P,π⋆ (s, a)]+ ut(s) !2 + O C P + log (ι) L 2 |S| 4 |A| log (ι) ≤ O vuutL|S| 2|A| log2 (ι) X T t=1 X s̸=sL X a∈A [q P,πt (s, a) − q P,π⋆ (s, a)]+ + O C P + log (ι) L 2 |S| 4 |A| log (ι) ≤ O vuutL2|S| 2|A| log2 (ι) X T t=1 X s̸=sL X a̸=π⋆(s) q P,πt (s, a) + C P + log (ι) L 2 |S| 4 |A| log (ι) ≤ O vuutL2|S| 2|A| log2 (ι) X T t=1 X s̸=sL X a̸=π⋆(s) q Pt,πt (s, a) + C P + 1 L 2 |S| 4 |A| log2 (ι) , where the first step follows from the non-negativity of ℓt(s, a) − αt(s, a); the third step applies Lemma 5.8.5.4 with G = 1 as P a q P,πt (s, a) − q P,π⋆ (s, a) + ≤ q P,πt (s) ≤ ut(s); the fifth step follows from the fact that P s̸=sL P a∈A q P,πt (s, a) − q P,π⋆ (s, a) + ≤ 2L P s̸=sL P a̸=π⋆(s) q P,πt (s, a) according to Corollary 5.8.3.5; the last step applies Corollary 5.8.3.6. Applying Lemma 5.8.1.1 with event Econ ∧ Eest yields that E [Term 1(π ⋆ )] = O E vuutL2|S| 2|A| log2 (ι) X T t=1 X s̸=sL X a̸=π⋆(s) q Pt,πt (s, a) + O C P + log (ι) L 2 |S| 4 |A| log (ι) + δ|S| 2 |A|T 2 21 = O S1 L 2 |S| 2 |A| log2 (ι) + δ|S| 2 |A|T 2 + C P + log (ι) L 2 |S| 4 |A| log (ι) . 5.7.6.2 Bounding Term 2 Lemma 5.7.6.3. For any δ ∈ (0, 1) and any mapping π ⋆ : S → A, Algorithm 11 ensures that E [Term 2(π ⋆ )] = O S1 L|S| 2 |A| log2 (ι) + δL|S| 2 |A|T + L|S| 2 |A| C P + log (ι) log (T) . Proof. Suppose that Econ ∧ Eest occurs. We have P(s ′ |s, a) ≥ Pbt(s ′ |s, a) = Pe i(t) (s ′ |s, a), ∀(s, a, s′ ) ∈ Wk, k = 0, . . . , L − 1. Therefore, we can show that 0 ≤ X s ′∈Sk(s)+1 P(s ′ |s, a) − Pbt(s ′ |s, a) V Pbt,π⋆ (s ′ ; αt) = O L s |S| log (ι) mb i(t) (s, a) + |S| C P + log (ι) mb i(t) (s, a) !! , where the last step follows from the definition of optimistic transition and the fact that αt(s, a) ≤ 1. By direct calculation, we have Term 2(π ⋆ ) bounded by O L X s̸=sL X a∈A X T t=1 h q P,πt (s, a) − q P,π⋆ (s, a) i + s |S| log (ι) mb i(t) (s, a) + |S| C P + log (ι) mb i(t) (s, a) ! ≤ O L X s̸=sL X a∈A X T t=1 h q P,πt (s, a) − q P,π⋆ (s, a) i + s |S| log (ι) mb i(t) (s, a) + O L|S| X s̸=sL X a∈A X T t=1 q P,πt (s, a) C P + log (ι) mb i(t) (s, a) ! = O L X s̸=sL X a∈A X T t=1 h q P,πt (s, a) − q P,π⋆ (s, a) i + s |S| log (ι) mb i(t) (s, a) + L|S| 2 |A| C P + log (ι) log (T) , where the last step follows from Lemma 5.8.5.3. Moreover, we have X s̸=sL X a∈A X T t=1 h q P,πt (s, a) − q P,π⋆ (s, a) i + s |S| log (ι) mb i(t) (s, a) ≤ X s̸=sL X a∈A X T t=1 q [q P,πt (s, a) − q P,π⋆ (s, a)]+ · s q P,πt (s, a)|S| log (ι) mb i(t) (s, a) ≤ vuutX s̸=sL X a∈A X T t=1 [q P,πt (s, a) − q P,π⋆ (s, a)]+ · vuutX s̸=sL X a∈A X T t=1 q P,πt (s, a)|S| log (ι) mb i(t) (s, a) = O vuut|S| 2|A| log2 (ι) X s̸=sL X a∈A X T t=1 [q P,πt (s, a) − q P,π⋆ (s, a)]+ ≤ O vuutL|S| 2|A| log2 (ι) X T t=1 X s̸=sL X a̸=π⋆(s) q P,πt (s, a) ≤ O vuutL|S| 2|A| log2 (ι) X T t=1 X s̸=sL X a̸=π⋆(s) q Pt,πt (s, a) + q L2|S| 2|A| log2 CP , where the second step uses the Cauchy-Schwarz inequality; the third step applies Lemma 5.8.5.3; the fifth step follows from Corollary 5.8.3.5; the last step uses Corollary 5.8.3.6 and the fact that √ x + y ≤ √ x + √y for any x, y ≥ 0. Finally, applying Lemma 5.8.1.1 finishes the proof. 21 5.7.6.3 Bounding Term 3 Lemma 5.7.6.4. For any δ ∈ (0, 1) and the policy π ⋆ , Algorithm 11 ensures that E [Term 3(π ⋆ )] = O S1 L 3 |S| 2 |A| log2 (ι) + δL|S| 2 |A|T + C P + log (ι) L 2 |S| 4 |A| log (ι) . Proof. Suppose that Econ ∧ Eest occurs. We first have: D q P,πt , Z Pbt,P,π⋆ αt − ZPbt,P,πt αt E = X s̸=sL X a∈A q P,πt (s, a) X s ′∈Sk(s)+1 Pbt(s ′ |s, a) − P(s ′ |s, a) V Pbt,π⋆ (s ′ ; αt) − V Pbt,πt (s ′ ; αt) ≤ 2 X s̸=sL X a∈A X s ′∈Sk(s)+1 q P,πt (s, a)Bi(t) (s, a, s′ ) V Pbt,π⋆ (s ′ ; αt) − V Pbt,πt (s ′ ; αt) ≤ O L X−1 k=0 X (s,a,s′)∈Wk q P,πt (s, a) s P(s ′ |s, a) log (ι) mb i(t) (s, a) · V Pbt,π⋆ (s ′ ; αt) − V Pbt,πt (s ′ ; αt) + O L|S| X s̸=sL X a∈A q P,πt (s, a) C P + log (ι) mb i(t) (s, a) , where the last step follows from Lemma 5.8.2.7 and the fact that V Pbt,π(s ′ ; αt) ≤ L for any π. By applying Lemma 5.8.5.3, we have the second term bounded by O C P + log (ι) L 2 |S| 4 |A| log (ι) . On the other hand, for the first term, we can bound V Pbt,π⋆ (s ′ ; αt) − V Pbt,πt (s ′ ; αt) as V Pbt,π⋆ (s ′ ; αt) − V Pbt,πt (s ′ ; αt) ≤ X u̸=sL X v∈A q Pbt,πt (u|s ′ )|πt(v|u) − π ⋆ (v|u)| Q Pbt,π⋆ (u, v; αt) ≤ L X u∈S X v∈A q Pbt,πt (u|s ′ )|πt(v|u) − π ⋆ (v|u)| ≤ O L L X−1 k=k(s ′) X u∈Sk X v̸=π⋆(u) q Pbt,πt (u, v|s ′ ) ≤ O L L X−1 k=k(s ′) X u∈Sk X v̸=π⋆(u) q P,πt (u, v|s ′ ) where the first step follows from Lemma 5.8.3.2; the second step follows from the fact that QPbt,π⋆ (u, v; αt) ∈ [0, L]; the third step uses the same reasoning as the proof of Corollary 5.8.3.5; and the last step uses Corollary 5.7.8.2. Finally, we consider the following term X T t=1 L X−1 h=0 X (s,a,s′)∈Wh q P,πt (s, a) s P(s ′ |s, a) log (ι) mb i(t) (s, a) L X−1 k=h+1 X u∈Sk X v̸=π⋆(u) q P,πt (u, v|s ′ ) = X T t=1 L X−1 h=0 X (s,a,s′)∈Wh L X−1 k=h+1 X u∈Sk X v̸=π⋆(u) s q P,πt (s, a)q P,πt (u, v|s ′) log (ι) mb i(t) (s, a) · q q P,πt (s, a)P(s ′ |s, a)q P,πt (u, v|s ′) ≤ L X−1 h=0 vuutX T t=1 X (s,a,s′)∈Wh L X−1 k=h+1 X u∈Sk X v̸=π⋆(u) q P,πt (s, a)q P,πt (u, v|s ′) log (ι) mb i(t) (s, a) · vuutX T t=1 X (s,a,s′)∈Wh L X−1 k=h+1 X u∈Sk X v̸=π⋆(u) q P,πt (s, a)P(s ′ |s, a)q P,πt (u, v|s ′) ≤ L X−1 h=0 vuut|Sh+1|L X T t=1 X s∈Sh X a∈A q P,πt (s, a) log (ι) mb i(t) (s, a) · vuutX T t=1 X u∈S X v̸=π⋆(u) q P,πt (u, v) ≤ L X−1 h=0 q L|Sh+1||Sh||A| log2 (ι) ! · vuutX T t=1 X u∈S X v̸=π⋆(u) q P,πt (u, v) = O vuutL|S| 2|A| log2 (ι) X T t=1 X s̸=sL X a̸=π⋆(a) q P,πt (s, a) = O vuutL|S| 2|A| log2 (ι) X T t=1 X s̸=sL X a̸=π⋆(a) q Pt,πt (s, a) + q CP · L2|S| 2|A| log2 (ι) , where the second step applies Cauchy-Schwarz inequality; the third step follows from the fact that P s∈Sk P a∈A P s ′∈Sk(s)+1 q P,πt (s, a)P(s ′ |s, a)q P,πt (u, v|s ′ ) = q P,πt (u, v); the fourth step follows from 216 Lemma 5.8.5.3 conditioning on the event Eest; the fifth step uses the fact that √ x + y ≤ √ x + √y for x, y ≥ 0; the last step follows from Corollary 5.8.3.6. 5.7.7 Proof of Lemma 5.7.3.4 In this section we bound EstReg using a learning rate that depends on t and (s, a), which is crucial to obtain a self-bounding quantity. Another key observation is that the estimated transition function is constant within each epoch, so we first bound EstReg within one epoch before summing them. Recall that Ei is a set of episodes that belong to epoch i and N is the total number of epochs through T episodes. By using the fact that Pbt = Pei for episode t belonging to epoch i, we make the following decomposition EstReg(π ⋆ ) EstReg(π ⋆ ) = E X N i=1 X t∈Ei D q Pei,πt − q Pbt,π⋆ , ℓbt − bt E ≤ E X N i=1 Eti X t∈Ei D q Pei,πt − q Pbt,π⋆ , ℓbt − bt E = E "X N i=1 EstRegi(π ⋆ ) # . This learning rate is defined in Definition 5.4.2 and restated below. Definition 5.7.7.1. For any t, if it is the starting episode of an epoch, we set γt(s, a) = 256L 2 |S|; otherwise, we set γt+1(s, a) = γt(s, a) + Dνt(s, a) 2γt(s, a) , where D = 1 log(ι) and νt(s, a) = q Pe i(t) ,πt (s, a) 2 Q Pe i(t) ,πt (s, a; ℓbt) − V Pe i(t) ,πt (s; ℓbt) 2 . (5.30) 217 Importantly, both Q Pe i(t) ,πt (s, a; ℓbt) and V Pe i(t) ,πt (s; ℓbt) can be computed, which ensures that the learning rate is properly defined. 5.7.7.1 Properties of the Learning Rate In this section, we prove key properties of νt(s, a) and of γt(s, a). We first present some results in Lemma 5.7.7.2 that are useful to bound νt(s, a), and then use these results to bound γt(s, a). Lemma 5.7.7.2. For any state-action pair (s, a) and any episode t, it holds that q Pbt,πt (s, a)Q Pbt,πt (s, a; ℓbt) ≤ L, and q Pbt,πt (s)V Pbt,πt (s; ℓbt) ≤ L, ∀(s, a) ∈ S × A, which ensures q Pbt,πt (s, a) QPbt,πt (s, a; ℓbt) − V Pbt,πt (s; ℓbt) ∈ [−L, L]. Suppose that the high-probability event Econ holds. Then, it further holds for all state-action pair (s, a) that q Pbt,πt (s, a) 2 · Et Q Pbt,πt (s, a; ℓbt) − V Pbt,πt (s; ℓbt) 2 ≤ O L 2 q Pbt,πt (s, a) (1 − πt(a|s)) + L|S|C P t , where Pbt here is the optimistic transition Pe i(t) defined in Definition 5.7.1.1. Proof. We first verify q Pbt,πt (s, a)QPbt,πt (s, a; ℓbt) ≤ L: q Pbt,πt (s, a)Q Pbt,πt (s, a; ℓbt) = q Pbt,πt (s, a) X h=k(s) X u∈S X v∈A q Pbt,πt (u, v|s, a)ℓbt(u, v) = q Pbt,πt (s, a) X h=k(s) X u∈S X v∈A q Pbt,πt (u, v|s, a) · It(u, v)ℓt(u, v) ut(u, v) ≤ X h=k(s) X u∈Sh X v∈A It(u, v) · q Pbt,πt (s, a)q Pbt,πt (u, v|s, a) ut(u, v) ≤ X h=k(s) X u∈Sh X v∈A It(u, v) ≤ L, 218 where the second step follows from the definition of loss estimator, the fourth step follows from q Pbt,πt (s, a)q Pbt,πt (u, v|s, a) ≤ q Pbt,πt (u, v) ≤ ut(u, v), and the last step uses the fact that P u∈Sh P v∈A It(u, v) = 1. Following the same idea, we can show that q Pbt,πt (s)V Pbt,πt (s; ℓbt) ≤ L as well. Next, we have Et QPbt,πt (s, a; ℓbt) − V Pbt,πt (s; ℓbt) 2 bounded as Et Q Pbt,πt (s, a; ℓbt) − V Pbt,πt (s; ℓbt) 2 = Et Q Pbt,πt (s, a; ℓbt) − πt(a|s)Q Pbt,πt (s, a; ℓbt) − X a ′̸=a πt(a ′ |s)Q Pbt,πt (s, a′ ; ℓbt) 2 ≤ 2 · Et Q Pbt,πt (s, a; ℓbt) − πt(a|s)Q Pbt,πt (s, a; ℓbt) 2 + 2 · Et X a ′̸=a πt(a ′ |s)Q Pbt,πt (s, a′ ; ℓbt) 2 = 2 (1 − πt(a|s))2 Et Q Pbt,πt (s, a; ℓbt) 2 + 2Et X a ′̸=a πt(a ′ |s)Q Pbt,πt (s, a′ ; ℓbt) 2 , (5.31) where the second step follows from the fact that (x + y) 2 ≤ 2 x 2 + y 2 for any x, y ∈ R. By direct calculation, we have Et Q Pbt,πt (s, a; ℓbt) 2 = Et L X−1 k=k(s) X x∈Sk X y∈A q Pbt,πt (x, y|s, a)ℓbt(x, y) 2 ≤ L · Et L X−1 k=k(s) X x∈Sk X y∈A q Pbt,πt (x, y|s, a)ℓbt(x, y) 2 = L · Et L X−1 k=k(s) X x∈Sk X y∈A q Pbt,πt (x, y|s, a) 2 ℓbt(x, y) 2 ≤ L · L X−1 k=k(s) X x∈Sk X y∈A q Pbt,πt (x, y|s, a) 2 · q Pt,πt (x, y) ut(x, y) 2 21 = L q Pbt,πt (s, a) · L X−1 k=k(s) X x∈Sk X y∈A q Pbt,πt (x, y|s, a) q Pbt,πt (x, y|s, a)q Pbt,πt (s, a) ut(s, a) ! · q Pt,πt (x, y) ut(s, a) ≤ L q Pbt,πt (s, a) · L X−1 k=k(s) X x∈Sk X y∈A q Pbt,πt (x, y|s, a) q Pt,πt (x, y) ut(s, a) ≤ L q Pbt,πt (s, a) · L X−1 k=k(s) X x∈Sk X y∈A q Pbt,πt (x, y|s, a) q P,πt (x, y) ut(x, y) + C P t ut(x, y) , (5.32) where the second step uses Cauchy-Schwarz inequality; the third step uses the fact that ℓbt(s, a) · ℓbt(s ′ , a′ ) = 0 for any (s, a) ̸= (s ′ , a′ ); the fourth step takes the conditional expectation of ℓbt(x, y) 2 ; the sixth step follows from the fact that q Pbt,πt (x, y|s, a)q Pbt,πt (s, a) ≤ q Pbt,πt (x, y) ≤ ut(x, y) according to the definition of upper occupancy bound; the last step follows from Corollary 5.8.3.6. Similarly, for the second term in Eq. (5.31), we have Et X b̸=a πt(b|s)Q Pbt,πt (s, b; ℓbt) 2 ≤ L · Et L X−1 k=k(s) X x∈Sk X y∈A X b̸=a πt(b|s)q Pbt,πt (x, y|s, b) ℓbt(x, y) 2 = L · Et L X−1 k=k(s) X x∈Sk X y∈A X b̸=a πt(b|s)q Pbt,πt (x, y|s, b) 2 ℓbt(x, y) 2 ≤ L · L X−1 k=k(s) X x∈Sk X y∈A X b̸=a πt(b|s)q Pbt,πt (x, y|s, b) 2 q Pt,πt (x, y) ut(x, y) 2 ≤ L q Pbt,πt (s) L X−1 k=k(s) X x∈Sk X y∈A X b̸=a πt(b|s)q Pbt,πt (x, y|s, b) q Pt,πt (x, y) ut(x, y) ≤ L q Pbt,πt (s) X b̸=a πt(b|s) L X−1 k=k(s) X x∈Sk X y∈A q Pbt,πt (x, y|s, b) q P,πt (x, y) ut(x, y) + C P t ut(x, y) , (5.33) where the first three steps are following the same idea of previous analysis; the fourth step uses the fact that P b̸=a q Pbt,πt (s)πt(b|s)q Pbt,πt (x, y|s, b) ≤ q Pbt,πt (x, y); the last step follows from Corollary 5.8.3.6 as well. 220 Conditioning on the event Econ, we have the term in Eq. (5.32) further bounded as L q Pbt,πt (s, a) · L X−1 k=k(s) X x∈Sk X y∈A q Pbt,πt (x, y|s, a) q P,πt (x, y) ut(x, y) + C P t ut(x, y) ≤ L q Pbt,πt (s, a) · L X−1 k=k(s) X x∈Sk X y∈A q Pbt,πt (x, y|s, a) 1 + C P t ut(x, y) = L q Pbt,πt (s, a) · L X−1 k=k(s) X x∈Sk X y∈A q Pbt,πt (x, y|s, a) + L q Pbt,πt (s, a) · L X−1 k=k(s) X x∈Sk X y∈A q Pbt,πt (x, y|s, a) C P t ut(x, y) ≤ L 2 q Pbt,πt (s, a) + LCP t q Pbt,πt (s, a) · L X−1 k=k(s) X x∈Sk X y∈A q Pbt,πt (x, y|s, a) ut(x, y) ! , (5.34) where the second step follows from the fact that q P,πt (x, y) ≤ ut(x, y) as P ∈ Pi(t) according to the event Econ; the last step follows from the fact that P x∈Sk P y∈A q Pbt,πt (x, y|s, a) ≤ 1. Following the same argument, for the term in Eq. (5.33), we have L q Pbt,πt (s) X b̸=a πt(b|s) L X−1 k=k(s) X x∈Sk X y∈A q Pbt,πt (x, y|s, b) q P,πt (x, y) ut(x, y) + C P t ut(x, y) ≤ L q Pbt,πt (s) X b̸=a πt(b|s) L + L X−1 k=k(s) X x∈Sk X y∈A q Pbt,πt (x, y|s, b)C P t ut(x, y) ! ≤ L 2 (1 − πt(a|s)) q Pbt,πt (s) + LCP t q Pbt,πt (s) X b̸=a πt(b|s) L X−1 k=k(s) X x∈Sk X y∈A q Pbt,πt (x, y|s, b) ut(x, y) ! . (5.35) Plugging Eq. (5.34) and Eq. (5.35) into Eq. (5.31), we have q Pbt,πt (s, a) 2 · Et Q Pbt,πt (s, a; ℓbt) − V Pbt,πt (s; ℓbt) 2 ≤ 2q Pbt,πt (s, a) 2 (1 − πt(a|s))2 L 2 q Pbt,πt (s, a) + LCP t q Pbt,πt (s, a) · L X−1 k=k(s) X x∈Sk X y∈A q Pbt,πt (x, y|s, a) ut(x, y) ! 221 + 2q Pbt,πt (s, a) 2 L 2 (1 − πt(a|s)) q Pbt,πt (s) + LCP t q Pbt,πt (s) X b̸=a πt(b|s) L X−1 k=k(s) X x∈Sk X y∈A q Pbt,πt (x, y|s, b) ut(x, y) ! ≤ O L 2 q Pbt,πt (s, a) (1 − πt(a|s)) + O LCP t · q Pbt,πt (s, a) L X−1 k=k(s) X x∈Sk X y∈A q Pbt,πt (x, y|s, a) ut(x, y) ! + O LCP t · q Pbt,πt (s) X b̸=a πt(b|s) L X−1 k=k(s) X x∈Sk X y∈A q Pbt,πt (x, y|s, b) ut(x, y) ! ≤ O L 2 q Pbt,πt (s, a) (1 − πt(a|s)) + O LCP t · L X−1 k=k(s) X x∈Sk X y∈A q Pbt,πt (x, y|s, a)q Pbt,πt (s, a) ut(x, y) ! + O LCP t · L X−1 k=k(s) X x∈Sk X y∈A P b̸=a q Pbt,πt (s)πt(b|s)q Pbt,πt (x, y|s, b) ut(x, y) ! ≤ O L 2 q Pbt,πt (s, a) (1 − πt(a|s)) + L|S|C P t , where the third step follows from the facts that q Pbt,πt (x, y|s, a)q Pbt,πt (s, a) ≤ q Pbt,πt (x, y) ≤ ut(x, y) for any (s, a),(x, y) ∈ S × A, and P b̸=a q Pbt,πt (s)πt(b|s)q Pbt,πt (x, y|s, b) ≤ q Pbt,πt (x, y) similarly. The first part of Lemma 5.7.7.2 ensures that νt(s, a) ∈ [0, L2 ], which can be used to bound the growth of the learning rate. Proposition 5.7.7.3. The learning rate γt defined in Definition 5.4.2 satisfies for any state-action pair (s, a): γt(s, a) ≤ vuuut D X t−1 j=ti(t) νj (s, a) + γti(t) (s, a) where i(t) is the epoch to which episode t belongs and ti is the first episode of epoch i. Proof. We prove this statement by induction on t. The equation trivially holds for t = ti(t) which is the first round of epoch i(t). For the induction step, we first note that Dνt(s, a) ∈ [0, L2 ]. We 222 introduce some notations to simplify the proof: we use c to denote Dνt(s, a), x to denote γt(s, a), and the induction hypothesis is x ≤ √ S + γ where S = D Pt−1 j=ti(t) νj (s, a) and γ = γti(t) (s, a). Proving the induction step is the same as proving: x + c 2x ≤ √ S + c + γ. (5.36) First, we can verify that f(x) = x + c 2x is an increasing function of x for x ≥ p c/2. As c ∈ [0, L2 ] and x ≥ γ ≥ L, we can use the induction hypothesis x ≤ √ S +γ to upper bound x in the left-hand side of Eq. (5.36), and we get: x + c 2x = x 2 + c/2 x ≤ S + 2γ √ S + γ 2 + c/2 √ S + γ = S + γ √ S + c/2 √ S + γ + γ √ S + γ 2 √ S + γ = √ S + γ √ S + γ √ S + c 2 √ S + 2γ + γ √ S + γ 2 √ S + γ = √ S + c 2 √ S + 2γ + γ ≤ √ S + c 2 √ S + c + γ ≤ √ S + c + γ, where the second to last step follows from c ≤ L 2 and γ ≥ L, which ensures that 2 √ S + 2√γ ≥ 2 p S + γ 2 ≥ 2 √ S + c. The last step follows from the concavity of the square-root function which ensures that √ S ≤ √ S + c − c 2 √ S+c . Then, we can use the second part of Lemma 5.7.7.2 to continue the bound of Proposition 5.7.7.3 and derive a bound that only depends on the suboptimal actions. 223 Proposition 5.7.7.4. If νt is defined as in Definition 5.4.2, we have for any deterministic policy π : S → A and any state s ̸= sL: Eti X a∈A vuut tiX +1−1 t=ti νt(s, a) ≤ O X a̸=π(s) vuuutEti L2 tiX +1−1 t=ti q Pt,πt (s, a) + O δL2 |S||A|(ti+1 − ti) + vuutL|S||A| 2 tiX +1−1 t=ti CP t . Proof. For each epoch i and state-action pair (s, a), we have: Eti vuut tiX +1−1 t=ti νt(s, a) ≤ vuuutEti tiX +1−1 t=ti νt(s, a) = O vuuutEti tiX +1−1 t=ti L2 q Pbt,πt (s, a)(1 − πt(a|s)) + L|S|CP t , (5.37) where the first step follows from Jensen’s inequality, and the second step uses Lemma 5.7.7.2. Note that, for a = π(s), we have: q Pbt,πt (s, π(s))(1 − πt(π(s)|s)) ≤ q Pbt,πt (s) X b̸=π(s) πt(b|s) ≤ X b̸=π(s) q Pbt,πt (s, b). Therefore, we have Eti X a∈A vuut tiX +1−1 t=ti νt(s, a) = O X a∈A vuuutEti tiX +1−1 t=ti L2 q Pbt,πt (s, a)(1 − πt(a|s)) + L|S|CP t = O X a∈A vuuutEti L2 tiX +1−1 t=ti q Pbt,πt (s, a)(1 − πt(a|s)) + vuutL|S| tiX +1−1 t=ti CP t 224 = O X a̸=π(s) vuuutEti L2 tiX +1−1 t=ti q Pbt,πt (s, a) + vuutL|S||A| 2 tiX +1−1 t=ti CP t ≤ O δL2 |S||A|(ti+1 − 1 − ti) + X a̸=π(s) vuuutEti L2 tiX +1−1 t=ti q P,πt (s, a) + vuutL|S||A| 2 tiX +1−1 t=ti CP t ≤ O δL2 |S||A|(ti+1 − 1 − ti) + X a̸=π(s) vuuutEti L2 tiX +1−1 t=ti q Pt,πt (s, a) + vuutL|S||A| 2 tiX +1−1 t=ti CP t , where the first step follows from Eq. (5.37); the second step uses the fact that √ x + y ≤ √ x + √y for x, y ≥ 0; the fourth step follows from Lemma 5.8.1.1 with the event Econ and Corollary 5.7.8.2; the last step applies Corollary 5.8.3.6. 5.7.7.2 Bounding EstRegi (π ⋆ ) for Varying Learning Rate We now focus on bounding EstRegi(π ⋆ ) for an individual epoch i. Notations for FTRL Analysis. For any fixed epoch i and any integer t ∈ [ti , ti+1 − 1], we introduce the following notations. Ft(q) = * q, X t−1 τ=ti ℓbτ − bτ + + ϕt(q) , Gt(q) = * q, X t τ=ti ℓbτ − bτ + + ϕt(q), qt = argmin q∈Ω(Pei) Ft(q) , qet = argmin q∈Ω(Pei) Gt(q). (5.38) With these notations, we have qt = qbt = q Pbt,πt = q Pei,πt . Also, according to the loss shifting technique (specifically Corollary 5.8.4.2), qt and qet can be equivalently written as qt = argmin q∈Ω(Pei) * q, X t−1 τ=ti (gτ − bτ ) + + ϕt(q), qet = argmin q∈Ω(Pei) * q, X t τ=ti (gτ − bτ ) + + ϕt(q), (5.39) 225 where gt : S × A → R is defined as gt(s, a) = Q Pbt,πt (s, a; ℓbt) − V Pbt,πt (s; ℓbt), ∀(s, a) ∈ S × A. (5.40) Finally, We also define u = q Pei,π⋆ and v = 1 − 1 T2 u + 1 T2|A||S| X s,a q max s,a , (5.41) where q max s,a is the occupancy measure associated with a policy that maximizes the probability of visiting (s, a) given the optimistic transition Pei . Then, we have v ∈ Ω(Pei) because u, qmax s,a ∈ Ω(Pei) for all (s, a), and Ω(Pei) is convex [50, Lemma 10]. Therefore, EstRegi(π ⋆ ) can be rewritten as: EstRegi(π ⋆ ) = Eti tiX +1−1 t=ti D qt − u, ℓbt − bt E = Eti tiX +1−1 t=ti D qt − v, ℓbt − bt E | {z } =:(I) + Eti tiX +1−1 t=ti D v − u, ℓbt − bt E | {z } =:(II) , (5.42) where term (I) is equivalent to Eti tiX +1−1 t=ti ⟨qt − v, gt − bt⟩ , (5.43) because D qt − v, gt − ℓbt E = 0 according to Lemma 5.8.4.1. We then present the following lemma to provide a bound for EstRegi(π ⋆ ). 226 Lemma 5.7.7.5. For gt define in Eq. (5.40), v defined in Eq. (5.41), and any non-decreasing learning rate such that γ1(s, a) ≥ 256L 2 |S| for all (s, a), Algorithm 11 ensures that EstRegi(π ⋆ ) is bounded by O L + L|S|C P T + Eti X s̸=sL X a∈A γti+1−1(s, a) log (ι) + O min Eti tiX +1−1 t=ti ℓbt − bt 2 ∇−2ϕt(qt) , Eti tiX +1−1 t=ti ∥gt − bt∥ 2 ∇−2ϕt(qt) , (5.44) Proof. We start from the decomposition in Eq. (5.42). Bounding (I). By adding and subtracting Ft(qt)−Gt(˜qt), we decompose (I) into a stability term and a penalty term: Eti tiX +1−1 t=ti Dqt , ℓbt − bt E + Ft(qt) − Gt(˜qt) | {z } Stability term + Eti tiX +1−1 t=ti Gt(˜qt) − Ft(qt) − D v, ℓbt − bt E | {z } Penalty term . As γ1(s, a) ≥ 256L 2 |S|, Lemma 5.7.7.6 ensures 1 2 qt(s, a) ≤ qet(s, a) ≤ 2qt(s, a) for all (s, a), which allows us to apply [50, Lemma 13] to bound the stability term by Stability term = O Eti tiX +1−1 t=ti ℓbt − bt 2 ∇−2ϕt(qt) . (5.45) The penalty term without expectation is bounded as: tiX +1−1 t=ti Gt(˜qt) − Ft(qt) − D v, ℓbt − bt E = −Fti (qti ) + tiX +1−1 t=ti+1 (Gt−1(˜qt−1) − Ft(qt)) + Gti+1−1(˜qti+1−1) − * v, tiX +1−1 t=ti ℓbt − bt + 227 ≤ −Fti (qti ) + tiX +1−1 t=ti+1 (Gt−1(qt) − Ft(qt)) + Gti+1−1(v) − * v, tiX +1−1 t=ti ℓbt − bt + = −ϕti (qti ) + tiX +1−1 t=ti+1 (ϕt−1(qt) − ϕt(qt)) + ϕti+1−1(v) = −ϕti (qti ) + tiX +1−1 t=ti+1 X s̸=sL X a∈A (γt−1(s, a) − γt(s, a)) log 1 qt(s, a) + ϕti+1−1(v) = −ϕti (qti ) + ϕti (v) + tiX +1−1 t=ti+1 X s̸=sL X a∈A (γt(s, a) − γt−1(s, a)) log qt(s, a) v(s, a) = X s̸=sL X a∈A γti (s, a) log qti (s, a) v(s, a) + tiX +1−1 t=ti+1 X s̸=sL X a∈A (γt(s, a) − γt−1(s, a)) log qt(s, a) v(s, a) ≤ X s̸=sL X a∈A γti (s, a) log ι 2 qti (s, a) qmax s,a (s, a) + tiX +1−1 t=ti+1 X s̸=sL X a∈A (γt(s, a) − γt−1(s, a)) log ι 2 qt(s, a) qmax s,a (s, a) ≤ 2 X s̸=sL X a∈A γti (s, a) log (ι) + 2 tiX +1−1 t=ti+1 X s̸=sL X a∈A (γt(s, a) − γt−1(s, a)) log (ι) = 2 X s̸=sL X a∈A γti+1−1(s, a) log (ι), (5.46) where the second step uses the optimality of q˜t−1 and q˜ti+1−1 so that Gt−1(˜qt−1) ≤ Gt−1(qt) and Gti+1−1(˜qti+1−1) ≤ Gti+1−1(v); the third step follows the definitions in Eq. (5.38); the seventh step lower-bounds v(s, a) by 1 ι 2 q max s,a (s, a) for each (s, a); the eighth step uses the definition of q max s,a . Now, by taking the equivalent perspective in Eq. (5.39) and Eq. (5.43) and repeating the exact same argument, the same bound holds with ℓbt repalced by gt . Thus, we have shown (I) = O Eti X s̸=sL X a∈A γti+1−1(s, a) log (ι) + O min Eti tiX +1−1 t=ti ℓbt − bt 2 ∇−2ϕt(qt) , Eti tiX +1−1 t=ti ∥gt − bt∥ 2 ∇−2ϕt(qt) . (5.47) 228 Bounding (II). By direct calculation, we have (II) = Eti tiX +1−1 t=ti D v − u, ℓbt − bt E = 1 T2 Eti tiX +1−1 t=ti * 1 |A||S| X s,a q max s,a − u, ℓbt − bt + ≤ 1 T2 Eti 1 |A||S| X s,a q max s,a − u 1 tiX +1−1 t=ti (ℓt − bt) ∞ , where the first step uses the definition of v in Eq. (5.41) and the third step applies the Hölder’s inequality. Then, we further show: (II) ≤ 1 T2 Eti 1 |A||S| X s,a q max s,a − u 1 tiX +1−1 t=ti (ℓt − bt) ∞ ≤ 1 T2 Eti 1 |A||S| X s,a q max s,a 1 + ∥u∥1 ! tiX +1−1 t=ti (ℓt − bt) ∞ ≤ 1 T2 Eti 2L tiX +1−1 t=ti ℓt ∞ + tiX +1−1 t=ti bt ∞ ≤ Eti 2L(ti+1 − ti) + 4LCP|S|T T2 ≤ 2L + 4LCP|S| T , (5.48) where the second step repeatedly applies the triangle inequality, and those terms are bounded by 2L in the next step; the fourth step bounds ℓt(s, a) ≤ 1 for all (s, a) and bounds Pti+1−1 t=ti bt(s) ≤ 2C P|S|T by using the fact ut(s) ≥ 1 |S|T for all t, s (see Lemma 5.8.2.8); the fifth step bounds ti+1−ti by T. Finally, we combine the bounds in Eq. (5.47) and Eq. (5.48) to complete the proof. 229 Lemma 5.7.7.6 (Multiplicative Stability). For qet and qt defined in Eq. (5.39), if the learning rate fulfills γt(s, a) ≥ γ1(s, a) = 256L 2 |S| for all t and (s, a), then, 1 2 qt(s, a) ≤ qet(s, a) ≤ 2qt(s, a) holds for all (s, a). Proof. We first show that ℓbt − bt 2 ∇−2ϕt(qt) = X s,a It(s, a)ℓt(s, a) ut(s, a) − bt(s) 2 qt(s, a) 2 γt(s, a) ≤ X s,a It(s, a)ℓt(s, a) 2 ut(s, a) 2 + bt(s) 2 qt(s, a) 2 γt(s, a) ≤ X s,a It(s, a) ut(s, a) 2 + 16L 2 ut(s) 2 qt(s, a) 2 γt(s, a) ≤ X s,a It(s, a) 256L2|S| + X s,a 16L 2 · qt(s, a) 2 256L2|S| · ut(s) 2 ≤ 1 256L|S| + X s̸=sL 1 16|S| ≤ 1 8 , (5.49) where the second step uses (x−y) 2 ≤ x 2+y 2 for any x, y ≥ 0; the third step bounds ℓt(s, a) 2 ≤ 1 and bt(s) 2 ≤ 16L2 ut(s) 2 according to the definition of Eq. (5.2); the fourth step holds as γt(s, a) ≥ γ1(s, a) = 256L 2 |S| and qt(s, a) ≤ ut(s, a) for all (s, a); the fifth steps uses the fact that P a qt(s, a) 2 ≤ ( P a qt(s, a))2 ≤ ut(s) 2 . Once Eq. (5.49) holds, one only needs to repeat the same argument of [50, Lemma 12] to obtain the claimed multiplicative stability. Lemma 5.7.7.7. Under the conditions of Lemma 5.7.7.5, event Econ, and the learning γt(s, a) defined in Definition 5.4.2, the following holds for any deterministic policy π ⋆ : S → A: Eti X s̸=sL X a∈A γti+1−1(s, a) log (ι) + tiX +1−1 t=ti ∥gt − bt∥ 2 ∇−2ϕt(qt) 230 = O Eti L p log (ι) X s̸=sL X a̸=π⋆(s) vuuutE tiX +1−1 t=ti q Pt,πt (s, a) + C P log(|S|T) + O Eti δL2 |S| 2 |A| log (ι) (ti+1 − ti) + vuutL|S| 3|A| 2 log (ι) tiX +1−1 t=ti CP t . Proof. We start by bounding the second part for each t: ∥gt − bt∥ 2 ∇−2ϕt(qt) = X s,a 1 γt(s, a) qt(s, a) 2 (gt(s, a) − bt(s))2 ≤ 2 X s,a νt(s, a) γt(s, a) + 2X s,a 1 γt(s, a) qt(s, a) 2 bt(s) 2 , where the second step uses the definition of νt and (x − y) 2 ≤ 2(x 2 + y 2 ) for any x, y ∈ R. We first focus on bounding Pti+1−1 t=ti P s,a 1 γt(s,a) qt(s, a) 2 bt(s) 2 : tiX +1−1 t=ti X s,a 1 γt(s, a) qt(s, a) 2 bt(s) 2 ≤ 1 256L2|S| tiX +1−1 t=ti X s,a qt(s, a) 2 bt(s) 2 ≤ 4L 256L2|S| tiX +1−1 t=ti X s̸=sL qt(s)bt(s) = O C P log(|S|T) , (5.50) where the first step uses the fact that γt(s, a) ≥ 256L 2 |S|, the second step bounds bt(s) 2 ≤ 4Lbt(s) and P a qt(s, a) 2 ≤ qt(s), and the last step applies Lemma 5.2.1. Then, we bound the remaining term: Eti X s,a γti+1−1(s, a) log (ι) + 2 tiX +1−1 t=ti X s,a νt(s, a) γt(s, a) . (5.51) 231 Using Definition 5.4.2, we can bound Eq. (5.51) as: Eti X s,a γti+1−1(s, a) log (ι) + 2 tiX +1−1 t=ti X s,a νt(s, a) γt(s, a) = Eti "X s,a γti+1−1(s, a) log (ι) + 4 D (γti+1−1(s, a) − γti (s, a) # ≤ Eti "X s,a log (ι) + 4 D γti+1−1(s, a) # = Eti "X s,a 5 log (ι) γti+1−1(s, a) # ≤ Eti X s,a 5 log (ι) vuut 1 log (ι) tiX +1−2 j=ti νj (s, a) + 256L 2 |S| ≤ O p log (ι)Eti vuutX s,a tiX +1−1 j=ti νj (s, a) + L 2 |S| 2 |A| ≤ O Eti X s̸=sL X a̸=π(s) vuutL2 log (ι) tiX +1−1 t=ti q Pt,πt (s, a) + O Eti δL2 |S| 2 |A| log (ι) (ti+1 − ti) + vuutL|S| 3|A| 2 log (ι) tiX +1−1 t=ti CP t , (5.52) where the first step applies the definition of γt(s, a) which gives: νt(s,a) γt(s,a) = 2 D (γt+1(s, a) − γt(s, a)); the second step simplifies the telescopic sum and uses γti (s, a) ≥ 0; the third step uses D = 1 log(ι) ; the fourth step uses Proposition 5.7.7.3; the sixth step applies Proposition 5.7.7.4. 5.7.7.3 Bounding EstReg Using the equality EstReg(π ⋆ ) = E PN i=1 EstRegi(π ⋆ ) where N is the number of epochs, we are now ready to complete the proof of Lemma 5.7.3.4. Lemma 5.7.7.8. Over the course of T episodes, Algorithm 11 runs at most O (|S||A| log(T)) epochs. 232 Proof. By definition, the algorithm resets each time a counter of the number of visits to a specific state-action pair doubles. Each state-action pair is visited at most once for each of the T rounds, so it can trigger a new epoch at most log T times. Summing on all state-action pairs finishes the proof. Proof of Lemma 5.7.3.4. Using Lemma 5.7.7.5 and Lemma 5.7.7.7, we have EstReg(π ⋆ ) = E "X N i=1 EstRegi(π ⋆ ) # = O E X N i=1 L + LCP|S| T + δL2 |S| 2 |A| log (ι) (ti+1 − ti) + vuutL|S| 3|A| 2 log (ι) tiX +1−1 t=ti CP t + O E X N i=1 C P log(|S|T) + L p log (ι) X s̸=sL X a̸=π⋆(s) Eti vuut tiX +1−1 t=ti q Pt,πt (s, a) = O L + LCP|S| T |S||A| log(T) + δL2 |S| 2 |A| log (ι) T + |A||S| 2 log(ι) q L|A|CP + O C P |S||A| log(ι) 2 + p L2 log (ι) X s̸=sL X a̸=π⋆(s) E X N i=1 vuut tiX +1−1 t=ti q Pt,πt (s, a) = O L + LCP|S| T |S||A| log(T) + δL2 |S| 2 |A| log (ι) T + |A||S| 2 log(ι) |A| + LCP + O C P |S||A| log(ι) 2 + p L2 log (ι) X s̸=sL X a̸=π⋆(s) E X N i=1 vuut tiX +1−1 t=ti q Pt,πt (s, a) = O L|S||A| log(T) T + δL2 |S| 2 |A| log(ι)T + C PL|S| 2 |A| log(ι) + |A| 2 |S| 2 log(ι) + O C P |S||A| log(ι) 2 + q L2|S||A| log2 (ι) X s̸=sL X a̸=π⋆(s) E vuutX T t=1 q Pt,πt (s, a) , where the first step follows from the definition of EstReg (π ⋆ ); the third step uses the fact that N = O (|S||A| log (T)) and the Cauchy-Schwarz inequality; the fourth step uses √xy ≤ x + y for 233 any x, y ≥ 0 with x = LCP and y = |A|; the last step follows from Cauchy-Schwarz inequality again. 5.7.8 Properties of Optimistic Transition We summarize the properties guaranteed by the optimistic transition defined in Definition 5.7.1.1. Lemma 5.7.8.1. For any epoch i, any transition P ′ ∈ Pi, any policy π, and any initial state u ∈ S, it holds that q Pei,π(s, a|u) ≤ q P ′ ,π(s, a|u), ∀(s, a) ∈ S × A. Proof. We prove this result via a forward induction from layer k(u) to layer L − 1. Base Case: for the initial state u, q Pei,π(u, a|u) = q P ′ ,π(u, a|u) = π(a|u) for any action a ∈ A. For the other state s ∈ Sk(u) , we have q Pei,π(s, a|u) = q P ′ ,π(s, a|u) = 0. Induction step: Suppose q Pei,π(s, a|u) ≤ q P ′ ,π(s, a|u) holds for all the state-action pair (s, a) with k(s) < h. Then, for any (s, a) ∈ Sh × A, we have q Pei,π(s, a|u) = π(a|s) · X s ′∈Sh−1 X a ′∈A q Pei,π(s ′ , a′ |u)Pei(s|s ′ , a′ ) ≤ π(a|s) · X s ′∈Sh−1 X a ′∈A q P ′ ,π(s ′ , a′ |u)P ′ (s|s ′ , a′ ) = q P ′ ,π(s, a|u), where the second step follows from the induction hypothesis and the definition of optimistic transition in Definition 5.7.1.1. 234 Corollary 5.7.8.2. Conditioning on the event Econ, it holds for any epoch i and any policy π that q Pei,π(s, a) ≤ q P,π(s, a), ∀(s, a) ∈ S × A. Lemma 5.7.8.3. (Optimism of Optimistic Transition) Suppose the high-probability event Econ holds. Then for any policy π, any (s, a) ∈ S × A, and any valid loss function ℓ : S × A → R≥0, it holds that Q Pei,π (s, a; ℓ) ≤ Q P,π (s, a; ℓ), and V Pei,π (s; ℓ) ≤ V P,π (s; ℓ), ∀(s, a) ∈ S × A. Proof. According to Corollary 5.7.8.2, we have for all epoch i that q Pei,π(s, a|u) ≤ q P,π(s, a|u), ∀u ∈ S ∀(s, a) ∈ S × A. (5.53) Therefore, we have V Pei,π (s; ℓ) = X u∈S X v∈A q Pei,π(u, v|s)ℓ(u, v) ≤ X u∈S X v∈A q P,π(u, v|s)ℓ(u, v) = V P,π (s; ℓ), where the second step follows from Eq. (5.53). The statement for the Q-function can be proven in the same way. 235 Next, we argue that our optimistic transition provides a tighter performance estimation compared to the approach of Jin, Huang, and Luo [47]. Specifically, Jin, Huang, and Luo [47] proposes to subtract the following exploration bonuses Bonusi : S × A → R from the loss functions Bonusi(s, a) = L · min 1, X s ′∈Sk(s)+1 Bi(s, a, s′ ) , where Bi(s, a, s′ ) is the confidence bound defined in Eq. (2.6). This makes sure QP¯ i,π (s, a; ℓ − Bonusi) is no larger than the true Q-function QP,π(s, a; ℓ) as well, but is a looser lower bound as shown below. Lemma 5.7.8.4. (Tighter Performance Estimation) For any policy π, any (s, a) ∈ S ×A, and any bounded loss function ℓ : S × A → [0, 1], it holds that Q P¯ i,π (s, a; ℓ − Bonusi) ≤ Q Pei,π (s, a; ℓ), ∀(s, a) ∈ S × A. Proof. We prove this result via a backward induction from layer L to layer 0. Base Case: for the terminal state sL, we have QP¯ i,π (s, a; ℓ − Bonusi) = QPei,π (s, a; ℓ) = 0. Induction step: suppose the induction hypothesis holds for all the state-action pairs (s, a) ∈ S × A with k(s) > h. For any state-action pair (s, a) ∈ Sh × A, we first have Q P¯ i,π (s, a; ℓ − Bonusi) = ℓ(s, a) − Bonusi(s, a) + X u∈Sh+1 P¯ i(u|s, a)V P¯ i,π (u; ℓ − Bonusi) ≤ ℓ(s, a) − Bonusi(s, a) + X u∈Sh+1 P¯ i(u|s, a)V Pei,π (u; ℓ). 236 Clearly, when P u∈Sk(s)+1 Bi(s, a, u) ≥ 1, we have QP¯ i,π (s, a; ℓ − Bonusi) ≤ 0 by the definition of Bonusi , which directly implies that QP¯ i,π (s, a; ℓ − Bonusi) ≤ QPei,π (s, a; ℓ). So we continue the bound under the condition P u∈Sk(s)+1 Bi(s, a, u) < 1: Q P¯ i,π (s, a; ℓ − Bonusi) ≤ ℓ(s, a) − Bonusi(s, a) + X u∈Sh+1 P¯ i(u|s, a)V Pei,π (u; ℓ) ≤ ℓ(s, a) + X u∈Sh+1 P¯ i(u|s, a) − Bi(s, a, u) V Pei,π (u; ℓ) ≤ ℓ(s, a) + X u∈Sh+1 Pei(u|s, a)V Pei,π (u; ℓ) = Q Pei,π (s, a; ℓ), where the second step follows from the fact that V Pei,π (u; ℓ) ≤ L; the third step follows from the definition of optimistic transition Pei . Combining these two cases proves that QP¯ i,π (s, a; ℓ − Bonusi) ≤ QPei,π (s, a; ℓ) for any (s, a) ∈ Sh × A, finishing the induction. 23 5.8 Supplementary Lemmas 5.8.1 Expectation Lemma 5.8.1.1. ([47, Lemma D.3.6]) Suppose that a random variable X satisfies the following conditions: • X < R where R is a constant. • X < Y conditioning on event A, where Y ≥ 0 is a random variable. Then, it holds that E [X] ≤ E [Y ] + Pr [Ac ] · R where Ac is the complementary event of A. 5.8.2 Confidence Bound with Known Corruption In this subsection, we show that the empirical transition is centered around the transition P. Let [T] := {1, · · · , T}. Recall that Ei := {t ∈ [T] : episode t belongs to epoch i}, ι = |S||A|T δ , and we define the following quantities: Ti(s, a) = n t ∈ ∪i−1 j=1Ej : ∃k such that (st,k, at,k) = (s, a) o ∀(s, a) ∈ Sk × A, ∀i ≥ 2, ∀k < L; C P t (s, a, s′ ) = |P(s ′ |s, a) − Pt(s ′ |s, a)|, ∀(s, a, s′ ) ∈ Wk, ∀i ∈ [T], ∀k < L; C P i (s, a, s′ ) = X t∈Ti(s,a) C P t (s, a, s′ ), ∀(s, a, s′ ) ∈ Wk, ∀i ∈ [T], ∀k < L. Note that based on definition of Ti(s, a), we have mi(s, a) = |Ti(s, a)|. Then, we present the following lemma which shows the concentration bound between P(s ′ |s, a) and P¯ i(s ′ |s, a). Lemma 5.8.2.1. (Detailed restatement of Lemma 2.3.2) Event Econ occurs with probability at least 1 − δ where, Econ := ∀(s, a, s′ ) ∈ Wk, ∀i ∈ [T], ∀k < L : P(s ′ |s, a) − P¯ i(s ′ |s, a) ≤ Bi(s, a, s′ ) , (5.54) 238 and Bi(s, a, s′ ) is defined in Eq. (2.6) as: Bi(s, a, s′ ) = min ( 1, 16s P¯ i(s ′ |s, a) log (ι) mi(s, a) + 64 C P + log (ι) mi(s, a) ) . (5.55) Proving this lemma requires several auxiliary results stated below. For any episode t ∈ [T] and any layer k < L, we use s IMG t,k to denote an imaginary random state sampled from P(·|st,k, at,k). For any (s, a, s′ ) ∈ Wk, let P¯IMG i (s ′ |s, a) = 1 mi(s, a) X t∈Ti(s,a) I{s IMG t,k(s)+1 = s ′ }. We now proceed with a couple lemmas. Lemma 5.8.2.2 (Lemma 2, [46]). Event E 1 occurs with probability at least 1 − 3δ/4 where E 1 := n ∀(s, a, s′ ) ∈ Wk, ∀i, k : P(s ′ |s, a) − P¯IMG i (s ′ |s, a) ≤ ω¯i(s, a, s′ ) o , and ω¯i(s, a, s′ ) for any (s, a, s′ ) ∈ Wk and 0 ≤ k ≤ L − 1 is defined as ω¯i(s, a, s′ ) = min ( 1, 2 s P¯ i(s ′ |s, a) log ι mi(s, a) + 14 log ι 3mi(s, a) ) . (5.56) We note that as long as |A|T ≥ 16/3, event E 1 can occur with probability at least 1 − 3δ/4 (unlike 1 − 4δ in Lemma 2 of [46] ). Lemma 5.8.2.3. Event E 2 occurs with probability at least 1 − δ/4 where E 2 := n ∀(s, a, s′ ) ∈ Wk, ∀i, k : P¯IMG i (s ′ |s, a) − P¯ i(s ′ |s, a) ≤ ωi(s, a, s′ ) o , 23 and ωi(s, a, s′ ) for any (s, a, s′ ) ∈ Wk and 0 ≤ k ≤ L − 1 is defined as ωi(s, a, s′ ) = min ( 1, 4C P i (s, a, s′ ) mi(s, a) + s 24P(s ′ |s, a) log ι mi(s, a) + 6 log ι mi(s, a) ) . (5.57) Proof. For any fixed (s, a, s′ ), if mi(s, a) = 0, the claimed bound in E 2 holds trivially, so we consider the case mi(s, a) ̸= 0 below. By definition we have: P¯IMG i (s ′ |s, a) − P¯ i(s ′ |s, a) = 1 mi(s, a) X t∈Ti(s,a) I{s IMG t,k(s)+1 = s ′ } − I{st,k(s)+1 = s ′ } . Then, we construct the martingale difference sequence {Xt(s, a, s′ )}∞ t=1 w.r.t. filtration {Ft,k(s)}∞ t=1 (see Definition 4.9 of [62] for the formal definition of these filtrations) where Xt(s, a, s′ ) = I{s IMG t,k(s)+1 = s ′ } − I{st,k(s)+1 = s ′ } − P s ′ |s, a − Pt s ′ |s, a . With the definition of Xt(s, a, s′ ), one can show X t∈Ti(s,a) E Xt(s, a, s′ ) 2 |Ft,k(s) ≤ X t∈Ti(s,a) E I{s IMG t,k(s)+1 = s ′ } − I{st,k(s)+1 = s ′ } − P s ′ |s, a − Pt s ′ |s, a2 | Ft,k(s) ≤ X t∈Ti(s,a) E 2 I{s IMG t,k(s)+1 = s ′ } − I{st,k(s)+1 = s ′ } 2 + 2 P s ′ |s, a − Pt s ′ |s, a2 | Ft,k(s) ≤ X t∈Ti(s,a) E h 2I{s IMG t,k(s)+1 = s ′ } + 2I{st,k(s)+1 = s ′ } + 2C P t (s, a, s′ ) | Ft,k(s) i = 2 X t∈Ti(s,a) P(s ′ |s, a) + Pt(s ′ |s, a) + C P t (s, a, s′ ) , (5.58) where the second step uses (x−y) 2 ≤ 2(x 2+y 2 ) for any x, y ∈ R; the third step uses (x−y) 2 ≤ x 2+y 2 for x, y ∈ R≥0 and the fact that C P t (s, a, s′ ) ∈ [0, 1], thereby C P t (s, a, s′ ) 2 ≤ C P t (s, a, s′ ); the last holds based on the definitions of Ti(s, a), s IMG t,k(s)+1, and st,k(s)+1 as well as the fact that C P t (s, a, s′ ) is Ft,k(s) -measurable. By using the result in Eq. (5.58), we bound the average second moment σ 2 as σ 2 = P t∈Ti(s,a) E X2 t |Ft,k(s) mi(s, a) ≤ 2 P t∈Ti(s,a) P(s ′ |s, a) + Pt(s ′ |s, a) + C P t (s, a, s′ ) mi(s, a) . (5.59) By applying Lemma 5.8.2.4 with b = 2 and the upper bound of σ 2 shown in Eq. (5.59), as well as using the fact that mi(s, a) = |Ti(s, a)|, for any (s, a, s′ ), we have the following with probability at least 1 − δ/(4T|S| 2 |A|), |P¯IMG i (s ′ |s, a) − P¯ i(s ′ |s, a)| ≤ P(s ′ |s, a) − P t∈Ti(s,a) Pt(s ′ |s, a) mi(s, a) + 4 log 8mi(s,a)T|S| 2 |A| δ 3mi(s, a) + vuut 4 log 16mi(s,a) 2T|S| 2|A| δ P t∈Ti(s,a) P(s ′ |s, a) + Pt(s ′ |s, a) + CP t (s, a, s′) m2 i (s, a) ≤ P(s ′ |s, a) − P t∈Ti(s,a) Pt(s ′ |s, a) mi(s, a) + 8 log ι 3mi(s, a) + s 12 log ι P t∈Ti(s,a) P(s ′ |s, a) + Pt(s ′ |s, a) + CP t (s, a, s′) m2 i (s, a) ≤ P t∈Ti(s,a) P(s ′ |s, a) mi(s, a) − P t∈Ti(s,a) Pt(s ′ |s, a) mi(s, a) + 8 log ι 3mi(s, a) + s 12 log ι P t∈Ti(s,a) P(s ′ |s, a) + P(s ′ |s, a) + CP t (s, a, s′) + CP t (s, a, s′) m2 i (s, a) ≤ C P i (s, a, s′ ) mi(s, a) + s 24P(s ′ |s, a) log ι mi(s, a) + 8 log ι 3mi(s, a) + q 24CP i (s, a, s′) log ι mi(s, a) ≤ C P i (s, a, s′ ) mi(s, a) + s 24P(s ′ |s, a) log ι mi(s, a) + 8 log ι 3mi(s, a) + √ 6 C P i (s, a, s′ ) + log ι mi(s, a) ≤ ωi(s, a, s′ ), where the first inequality applies Lemma 5.8.2.4; the second inequality bounds all logarithmic terms by log ι with an appropriate constant factor (using mi(s, a) ≤ T and |A| ≥ 2); the third step follows the fact that Pt(s ′ |s, a) ≤ P(s ′ |s, a) + C P t (s, a, s′ ); the fourth step uses the definition C P i (s, a, s′ ) = P t∈Ti(s,a) C P t (s, a, s′ ); the fifth step applies the inequality √ 4xy ≤ x + y, ∀x, y ≥ 0 for x = C P i (s, a, s′ ) and y = log ι. Applying a union bound over all (s, a, s′ ) and epochs i ≤ T, we complete the proof. Lemma 5.8.2.4 (Anytime Version of Azuma-Bernstein). Let {Xi}∞ i=1 be b-bounded martingale difference sequence with respect to Fi. Let σ 2 = 1 N PN i=1 E[X2 i |Fi−1]. Then, with probability at least 1 − δ, for any N ∈ N +, it holds that: 1 N X N i=1 Xi ≤ r 2σ 2 log (4N2/δ) N + 2b log(2N/δ) 3N . Proof. This follows the same argument as Lemma G.2 in [62]. Lemma 5.8.2.5. Event E occurs with probability at least 1 − δ, where E := ∀(s, a, s′ ) ∈ Wk, ∀i, k : P(s ′ |s, a) − P¯ i(s ′ |s, a) ≤ ωi(s, a, s′ ) + ¯ωi(s, a, s′ ) . Proof. Conditioning on events E 1 and E 2 , we have P(s ′ |s, a) − P¯ i(s ′ |s, a) ≤ P(s ′ |s, a) − P¯IMG i (s ′ |s, a) + P¯IMG i (s ′ |s, a) − P¯ i(s ′ |s, a) ≤ ω¯i(s, a, s′ ) + ωi(s, a, s′ ). (5.60) Using a union bound for E 1 and E 2 , we complete the proof. Armed with above results, we are now ready to prove Lemma 5.8.2.1. 242 Proof of Lemma 5.8.2.1. Conditioning on event E, for any fixed (s, a, s′ ) with mi(s, a) ̸= 0 (otherwise the desired bound holds trivially), we have ωi(s, a, s′ ) ≤ s 24P(s ′ |s, a) log ι mi(s, a) + 6 log ι mi(s, a) + 4C P i (s, a, s′ ) mi(s, a) ≤ vuut 24 P¯ i(s ′ |s, a) + ¯ωi(s, a, s′) + 4CP i (s,a,s′) mi(s,a) + ωi(s, a, s′) log ι mi(s, a) + 6 log ι mi(s, a) + 4C P i (s, a, s′ ) mi(s, a) ≤ s 24P¯ i(s ′ |s, a) log ι mi(s, a) + s 24¯ωi(s, a, s′) log ι mi(s, a) + vuut 96CP i (s,a,s′) mi(s,a) log ι mi(s, a) + s 24ωi(s, a, s′) log ι mi(s, a) + 6 log ι mi(s, a) + 4C P i (s, a, s′ ) mi(s, a) ≤ s 24P¯ i(s ′ |s, a) log ι mi(s, a) + ¯ωi(s, a, s′ ) + q 96CP i (s, a, s′) log ι mi(s, a) + ωi(s, a, s′ ) 2 + 24 log ι + 4C P i (s, a, s′ ) mi(s, a) ≤ s 24P¯ i(s ′ |s, a) log ι mi(s, a) + ¯ωi(s, a, s′ ) + 28C P i (s, a, s′ ) mi(s, a) + ωi(s, a, s′ ) 2 + 25 log ι mi(s, a) , where the second step holds under E; the third step uses pPn i=1 xi ≤ Pn i=1 √ xi for all xi ∈ R≥0; the fourth step and the fifth step use 2 √xy ≤ x + y for x, y ≥ 0. Rearranging the above, we obtain ωi(s, a, s′ ) ≤ s 96P¯ i(s ′ |s, a) log ι mi(s, a) + 2¯ωi(s, a, s′ ) + 56C P i (s, a, s′ ) mi(s, a) + 50 log ι mi(s, a) (5.61) Thus, conditioning on event E, one can show for all (s, a, s′ ) ∈ Wk and k < L − 1 P(s ′ |s, a) − P¯ i(s ′ |s, a) ≤ ωi(s, a, s′ ) + ¯ωi(s, a, s′ ) ≤ s 96P¯ i(s ′ |s, a) log ι mi(s, a) + 3¯ωi(s, a, s′ ) + 56C P i (s, a, s′ ) mi(s, a) + 50 log ι mi(s, a) ≤ 16s P¯ i(s ′ |s, a) log ι mi(s, a) + 56C P i (s, a, s′ ) mi(s, a) + 64 log ι mi(s, a) 243 ≤ 16s P¯ i(s ′ |s, a) log ι mi(s, a) + 64C P i (s, a, s′ ) + log ι mi(s, a) , (5.62) where the second step uses Eq. (5.61), the third step applies the definition of ω¯i(s, a, s′ ). Finally, using the fact C P i (s, a, s′ ) ≤ C P, we complete the proof. Then, we present an immediate corollary of Lemma 5.8.2.1. Corollary 5.8.2.6. Consider any epoch i and any transition P ′ ∈ Pi. The following holds (recall mb i defined at the beginning of the appendix), ∥P ′ (·|s, a) − P¯ i(·|s, a)∥1 ≤ 2 · min 1, 32C P mb i(s, a) + 8s |Sk(s)+1| log ι mb i(s, a) + 32|Sk(s)+1| log ι mb i(s, a) . Proof. As P ′ ∈ Pi , we start from Eq. (5.62): ∥P ′ (·|s, a) − P¯ i(·|s, a)∥1 ≤ X s ′∈k(s)+1 64C P i (s, a, s′ ) mi(s, a) + 16s P¯ i(s ′ |s, a) log ι mi(s, a) + 64 log ι mi(s, a) ! ≤ 64C P mi(s, a) + 16s |Sk(s)+1| log ι mi(s, a) + 64|Sk(s)+1| log ι mi(s, a) , (5.63) where the last step uses the Cauchy-Schwarz inequality and the fact that P s ′∈k(s)+1 C P i (s, a, s′ ) ≤ C P. Since ∥P(·|s, a) − P¯ i(·|s, a)∥1 ≤ 2, we combine this trivial bound and the bound of ∥P(·|s, a) − P¯ i(·|s, a)∥1 in Eq. (5.63) to arrive at ∥P ′ (·|s, a) − P¯ i(·|s, a)∥1 ≤ 2 · min 1, 32C P mi(s, a) + 8s |Sk(s)+1| log ι mi(s, a) + 32|Sk(s)+1| log ι mi(s, a) = 2 · min 1, 32C P mb i(s, a) + 8s |Sk(s)+1| log ι mb i(s, a) + 32|Sk(s)+1| log ι mb i(s, a) , finishing the proof. 244 We conclude this subsection with two other useful lemmas. Lemma 5.8.2.7. Conditioning on event Econ, it holds for all tuple (s, a, s′ ) and epoch i that P(s ′ |s, a) − P¯ i(s ′ |s, a) ≤ O min ( 1, s P(s ′ |s, a) log(ι) mb i(s, a) + C P + log (ι) mb i(s, a) )! . Proof. Fix the epoch i and tuple (s, a, s′ ). According to the definitions of Econ in Eq. (5.54) and mb , we have P(s ′ |s, a) − P¯ i(s ′ |s, a) ≤ min ( 1, 16s P¯ i(s ′ |s, a) log(ι) mb i(s, a) + 64 · C P + log (ι) mb i(s, a) ) . Therefore, by direct calculation, we have P(s ′ |s, a) − P¯ i(s ′ |s, a) ≤ 16s P(s ′ |s, a) − P¯ i(s ′ |s, a) + P(s ′ |s, a) log (ι) mb i(s, a) + 64 · C P + log (ι) mb i(s, a) ≤ 8 s P(s ′ |s, a) log (ι) mb i(s, a) + s P(s ′ |s, a) − P¯ i(s ′ |s, a) log (ι) mb i(s, a) + 64 · C P + log (ι) mb i(s, a) = 8s P(s ′ |s, a) log (ι) mb i(s, a) + 64 · C P + log (ι) mb i(s, a) + s P(s ′ |s, a) − P¯ i(s ′ |s, a) · 64 log (ι) mb i(s, a) ≤ 8 s P(s ′ |s, a) log (ι) mb i(s, a) + 96 · C P + log (ι) mb i(s, a) + 1 2 P(s ′ |s, a) − P¯ i(s ′ |s, a) , where the second step and last step follow from the fact that √xy ≤ 1 2 (x + y) for any x, y ≥ 0. Finally, rearranging the above inequality finishes the proof. Lemma 5.8.2.8. (Lower Bound of Upper Occupancy Measure) For any episode t and state s ≠ sL, it always holds that ut(s) ≥ 1/|S|T. 24 Proof. Fix the episode t and state s. We prove the lemma by constructing a specific transition Pb ∈ Pi(t) , such that q P,π b (s) ≥ 1/|S|T for any policy π, which suffices due to the definition of ut(s). Specifically, Pb is defined as, for any tuple (s, a, s′ ) ∈ Wk and k = 0, . . . , L − 1: Pb(s ′ |s, a) = P¯ i(t) (s ′ |s, a) · 1 − 1 T + 1 |Sk+1|T . By direct calculation, one can verify that Pb is a valid transition function. Then, we show that Pb ∈ Pi(t) by verifying the condition for any transition tuple (s, a, s′ ) ∈ Wk and k = 0, . . . , L − 1: Pb(s ′ |s, a) − P¯ i(t) (s ′ |s, a) = 1 T · P¯ i(t) (s ′ |s, a) − 1 |Sk+1| ≤ 1 T ≤ Bi(t) (s, a, s′ ), where the last step follows from the definition of confidence intervals in Eq. (2.6). Finally, we show that q P,π b (s) ≥ 1/|S|T as: q P,π b (s) = X u∈Sk(s)−1 X v∈A q P,π b (u, v)Pb(s|u, v) ≥ X u∈Sk(s)−1 X v∈A q P,π b (u, v) · 1 T|S| = 1 T|S| , which concludes the proof. 5.8.3 Difference Lemma Lemma 5.8.3.1 (Theorem 5.2.1 of [52]). (Performance Difference Lemma) For any policies π1, π2 and any loss function ℓ : S × A → R, V P,π1 (s0; ℓ) − V P,π2 (s0; ℓ) = X s̸=sL X a∈A q P,π2 (s, a) V P,π1 (s; ℓ) − Q P,π1 (s, a; ℓ) = X s̸=sL X a∈A q P,π2 (s) (π1(a|s) − π2(a|s)) Q P,π1 (s, a; ℓ). 24 In fact, the same also holds for our optimistic transition where the layer structure is violated. For completeness, we include a proof below. Lemma 5.8.3.2. For any policies π1, π2, any loss function ℓ : S × A → R, and any transition Pe where P s ′∈Sk(s)+1 Pe(s ′ |s, a) ≤ 1 for all state-action pairs (s, a) (the remaining probability is assigned to sL), we have V P,π e 1 (s0; ℓ) − V P,π e 2 (s0; ℓ) = X s̸=sL X a∈A q P,π e 2 (s, a) V P,π e 1 (s; ℓ) − Q P,π e 1 (s, a; ℓ) . Proof. By direct calculation, we have for any state s ̸= sL: V P,π e 1 (s0; ℓ) − V P,π e 2 (s0; ℓ) = X a∈A π2(a|s0) ! V P,π e 1 (s0; ℓ) − X a∈A π2(a|s0)Q P,π e 1 (s0, a; ℓ) + X a∈A π2(a|s0) Q P,π e 1 (s0, a; ℓ) − Q P,π e 2 (s0, a; ℓ) = X a∈A q P,π e 2 (s0, a) V P,π e 1 (s0; ℓ) − Q P,π e 1 (s0, a; ℓ) + X a∈A X s ′∈S1 π2(a|s0)Pe(s ′ |s0, a) V P,π e 1 (s ′ ; ℓ) − V P,π e 2 (s ′ ; ℓ) = X a∈A q P,π e 2 (s0, a) V P,π e 1 (s0; ℓ) − Q P,π e 1 (s0, a; ℓ) + X a∈A X s ′∈S1 q P,π e 2 (s ′ ) V P,π e 1 (s ′ ; ℓ) − V P,π e 2 (s ′ ; ℓ) = X s̸=sL X a∈A q P,π e 2 (s, a) V P,π e 1 (s; ℓ) − Q P,π e 1 (s, a; ℓ) , where the last step follows from recursively repeating the first three steps. 247 Lemma 5.8.3.3. (Occupancy Measure Difference, [47, Lemma D.3.1]) For any transition functions P1, P2 and any policy π, q P1,π(s) − q P2,π(s) = k( Xs)−1 k=0 X (u,v,w)∈Wk q P1,π(u, v) (P1(w|u, v) − P2(w|u, v)) q P2,π(s|w) = k( Xs)−1 k=0 X (u,v,w)∈Wk q P2,π(u, v) (P1(w|u, v) − P2(w|u, v)) q P1,π(s|w), where q P ′ ,π(s|w) is the probability of visiting s starting w under policy π and transition P ′ . Lemma 5.8.3.4. ([26, Lemma 17]) For any policies π1, π2 and transition function P, X s̸=sL X a∈A q P,π1 (s, a) − q P,π2 (s, a) ≤ L X s̸=sL X a∈A q P,π1 (s)|π1(a|s) − π2(a|s)| . Corollary 5.8.3.5. For any policies π1, mapping π2 : S → A (that is, a deterministic policy), and transition function P, we have X s̸=sL X a∈A q P,π1 (s, a) − q P,π2 (s, a) ≤ 2L X s̸=sL X a̸=π2(s) q P,π1 (s, a). Proof. According to Lemma 5.8.3.4, we have X s̸=sL X a∈A q P,π1 (s, a) − q P,π2 (s, a) ≤ L X s̸=sL X a∈A q P,π1 (s)|π1(a|s) − π2(a|s)| . Note that, for every state s, it holds that X a∈A |π1(a|s) − π2(a|s)| = X a̸=π2(s) π1(a|s) + |π1(π2(s)|s) − 1| = 2 X a̸=π2(s) π1(a|s), 248 where the first step follows from the fact that π2(a|s) = 1 when a = π2(s), and π2(b|s) = 0 for any other action b ̸= π2(s). Therefore, we have L X s̸=sL X a∈A q P,π1 (s)|π1(a|s) − π2(a|s)| ≤ 2L X s̸=sL X a̸=π2(s) q P,π1 (s, a), which concludes the proof. According to Lemma 5.8.3.3, we can estimate the occupancy measure difference caused by the corrupted transition function Pt at episode t. Corollary 5.8.3.6. For any episode t and any policy π, we have q P,π(s) − q Pt,π(s) ≤ C P t , ∀s ̸= sL, and X s̸=sL q P,π(s) − q Pt,π(s) ≤ LCP t . Proof. By direct calculation, we have for any episode t and any s ̸= sL q P,π(s) − q Pt,π(s) ≤ k( Xs)−1 k=0 X (u,v,w)∈Wk q P,π(u, v)|P(w|u, v) − Pt(w|u, v)| q Pt,π(s|w) ≤ k( Xs)−1 k=0 X u∈Sk X v∈A q P,π(u, v) ∥P(·|u, v) − Pt(·|u, v)∥1 ≤ C P t , where the first step follows from Lemma 5.8.3.3, the second step bounds q Pt,π(s|w) ≤ 1, and the last two steps follows from the definition of C P t . Moreover, taking the summation over all states s ̸= sL, we have X s̸=sL q P,π(s) − q Pt,π(s) 249 ≤ X s̸=sL k( Xs)−1 k=0 X (u,v,w)∈Wk q P,π(u, v)|P(w|u, v) − Pt(w|u, v)| q Pt,π(s|w) = L X−1 k=0 X (u,v,w)∈Wk q P,π(u, v)|P(w|u, v) − Pt(w|u, v)| L X−1 h=k+1 X s∈Sh q Pt,π(s|w) ≤ L k( Xs)−1 k=0 X u∈Sk X v∈A q P,π(u, v) ∥P(·|u, v) − Pt(·|u, v)∥1 ≤ LCP t , where the third step follows from the fact that P s∈Sh q Pt,π(s|w) = 1 for any h ≥ k(w). Corollary 5.8.3.7. For any policy sequence {πt} T t=1 and loss functions {ℓt} T t=1 such that ℓt ∈ S × A → [0, 1] for any t ∈ {1, · · · , T}, it holds that X T t=1 q P,πt − q Pt,πt , ℓt ≤ LCP . Proof. By direct calculation, we have X T t=1 q P,πt − q Pt,πt , ℓt ≤ X T t=1 q P,πt − q Pt,πt 1 = X T t=1 X s̸=sL X a∈A q P,πt (s, a) − q Pt,πt (s, a) = X T t=1 X s̸=sL X a∈A q P,πt (s) − q Pt,πt (s) πt(a|s) = X T t=1 X s̸=sL q P,πt (s) − q Pt,πt (s) ≤ X T t=1 LCP t = LCP , 250 where the first step applies the Hölder’s inequality; the third step follows form the fact that q P ′ ,πt (s, a) = q P ′ ,πt (s)πt(a|s) for any state-action pair (s, a) and any transition function P ′ ; the fifth step applies Corollary 5.8.3.6; the last step follows from the definition of C P. Following the same idea in the proof of [26, Lemma 16], we also consider a tighter bound of the difference between occupancy measures in the following lemma. Lemma 5.8.3.8. Suppose the event Econ holds. For any state s ̸= sL, episode t and transition function P ′ t ∈ Pi(t) , we have q P ′ t ,πt (s) − q P,πt (s) ≤O k( Xs)−1 k=0 X (u,v,w)∈Wk q P,πt (u, v) s P(w|u, v) log (ι) mb i(t) (u, v) q P,πt (s|w) + O |S| 2 X s̸=sL X a∈A q P,πt (s, a) C P + log (ι) mb i(t) (s, a) . (5.64) Proof. According to Lemma 5.8.3.3, we have q P ′ t ,πt (s) − q P,πt (s) ≤ k( Xs)−1 k=0 X (u,v,w)∈Wk q P,πt (u, v) P(w|u, v) − P ′ t (w|u, v) q P ′ t ,πt (s|w) ≤ k( Xs)−1 k=0 X (u,v,w)∈Wk q P,πt (u, v) P(w|u, v) − P ′ t (w|u, v) q P,πt (s|w) + k( Xs)−1 k=0 X (u,v,w)∈Wk q P,πt (u, v) P(w|u, v) − P ′ t (w|u, v) k( Xs)−1 h=k+1 X (x,y,z)∈Wk q P,πt (x, y|w) P(z|x, y) − P ′ t (z|x, y) ≤ O k( Xs)−1 k=0 X (u,v,w)∈Wk q P,πt (u, v) s P(w|u, v) log (ι) mb i(t) (u, v) q P,πt (s|w) + O k( Xs)−1 k=0 X (u,v,w)∈Wk q P,πt (u, v) C P + log (ι) mb i(t) (u, v) q P,πt (s|w) | {z } Term (a) 251 + k( Xs)−1 k=0 X (u,v,w)∈Wk q P,πt (u, v) P(w|u, v) − P ′ t (w|u, v) k( Xs)−1 h=k+1 X (x,y,z)∈Wh q P,πt (x, y|w) P(z|x, y) − P ′ t (z|x, y) | {z } Term (b) , where the second step first subtracts and adds q P ′ t ,πt (s|w) and then applies Lemma 5.8.3.3 again for |q P ′ t ,πt (s|w) − q P,πt (s|w)|, and the third step follows from Lemma 5.8.2.7 . Clearly, we can bound term (a) as: k( Xs)−1 k=0 X (u,v,w)∈Wk q P,πt (u, v) C P + log (ι) mb i(t) (u, v) q P,πt (s|w) ≤ |S| X s̸=sL X a∈A q P,πt (s, a) C P + log (ι) mb i(t) (s, a) . On the other hand, for term (b), we decompose it into three terms in Eq. (5.65), Eq. (5.66), and Eq. (5.67): k( Xs)−1 k=0 X (u,v,w)∈Wk q P,πt (u, v) P(w|u, v) − P ′ t (w|u, v) k( Xs)−1 h=k+1 X (x,y,z)∈Wk q P,πt (x, y|w) P(z|x, y) − P ′ t (z|x, y) ≤ O k( Xs)−1 k=0 X (u,v,w)∈Wk q P,πt (u, v) s P(w|u, v) log (ι) mb i(t) (u, v) k( Xs)−1 h=k+1 X (x,y,z)∈Wh q P,πt (x, y|w) s P(z|x, y) log (ι) mb i(t) (x, y) (5.65) + O k( Xs)−1 k=0 X (u,v,w)∈Wk q P,πt (u, v) s P(w|u, v) log (ι) mb i(t) (u, v) k( Xs)−1 h=k+1 X (x,y,z)∈Wh q P,πt (x, y|w) C P + log (ι) mb i(t) (x, y) (5.66) + O k( Xs)−1 k=0 X (u,v,w)∈Wk q P,πt (u, v) C P + log (ι) mb i(t) (u, v) k( Xs)−1 h=k+1 X (x,y,z)∈Wk q P,πt (x, y|w) . (5.67) According to the AM-GM inequality, we have the term in Eq. (5.65) bounded as k( Xs)−1 k=0 X (u,v,w)∈Wk q P,πt (u, v) s P(w|u, v) log (ι) mb i(t) (u, v) k( Xs)−1 h=k+1 X (x,y,z)∈Wh q P,πt (x, y|w) s P(z|x, y) log (ι) mb i(t) (x, y) 252 = k( Xs)−1 k=0 X (u,v,w)∈Wk k( Xs)−1 h=k+1 X (x,y,z)∈Wh q P,πt (u, v) s P(w|u, v) log (ι) mb i(t) (u, v) q P,πt (x, y|w) s P(z|x, y) log (ι) mb i(t) (x, y) ≤ k( Xs)−1 k=0 X (u,v,w)∈Wk k( Xs)−1 h=k+1 X (x,y,z)∈Wh q P,πt (u, v)q P,πt (x, y|w)P(z|x, y) log (ι) mb i(t) (u, v) + k( Xs)−1 k=0 X (u,v,w)∈Wk k( Xs)−1 h=k+1 X (x,y,z)∈Wh q P,πt (u, v)P(w|u, v)q P,πt (x, y|w) log (ι) mb i(t) (x, y) = k( Xs)−1 k=0 X (u,v,w)∈Wk q P,πt (u, v) log (ι) mb i(t) (u, v) k( Xs)−1 h=k+1 X (x,y,z)∈Wh q P,πt (x, y|w)P(z|x, y) + k( Xs)−1 h=0 X (x,y,z)∈Wh log (ι) mb i(t) (x, y) X h−1 k=0 X (u,v,w)∈Wk q P,πt (u, v)P(w|u, v)q P,πt (x, y|w) ≤ L k( Xs)−1 k=0 X (u,v,w)∈Wk q P,πt (u, v) log (ι) mb i(t) (u, v) + L k( Xs)−1 h=0 X (x,y,z)∈Wh q P,πt (x, y) log (ι) mb i(t) (x, y) ≤ O L|S| L X−1 k=0 X s∈Sk X a∈A q P,πt (s, a) log (ι) mb i(t) (s, a) , where the first step follows from re-arranging the summation; the second step applies the fact that √xy ≤ x+y for any x, y ≥ 0; the third step rearranges the summation order, and the fourth step follows form the facts that P x∈Sh P y∈A q P,πt (x, y|w)P(z|x, y) = q P,πt (z|w) and P (u,v,w)∈Wk q P,πt (u, v)P(w|u, v)q P,πt (x, y|w) = q P,πt (x, y) for any k and (x, y, z). Similarly, we have the term in Eq. (5.66) bounded as k( Xs)−1 k=0 X (u,v,w)∈Wk q P,πt (u, v) s P(w|u, v) log (ι) mb i(t) (u, v) k( Xs)−1 h=k+1 X (x,y,z)∈Wh q P,πt (x, y|w) C P + log (ι) mb i(t) (x, y) ≤ k( Xs)−1 k=0 X (u,v,w)∈Wk k( Xs)−1 h=k+1 X (x,y,z)∈Wh q P,πt (u, v)q P,πt (x, y|w) log (ι) mb i(t) (u, v) + k( Xs)−1 k=0 X (u,v,w)∈Wk k( Xs)−1 h=k+1 X (x,y,z)∈Wh q P,πt (u, v)P(w|u, v)q P,πt (x, y|w) C P + log (ι) mb i(t) (x, y) ≤ k( Xs)−1 k=0 X (u,v,w)∈Wk q P,πt (u, v) log (ι) mb i(t) (u, v) k( Xs)−1 h=k+1 X (x,y,z)∈Wh q P,πt (x, y|w) 253 + L X−1 h=0 X (x,y,z)∈Wh k( Xs)−1 k=0 X (u,v,w)∈Wk q P,πt (u, v)P(w|u, v)q P,πt (x, y|w) C P + log (ι) mb i(t) (x, y) ≤ |S| k( Xs)−1 k=0 X (u,v,w)∈Wk q P,πt (u, v) log (ι) mb i(t) (u, v) + L X−1 h=0 X (x,y,z)∈Wh k( Xs)−1 k=0 q P,πt (x, y) C P + log (ι) mb i(t) (x, y) ≤ |S| 2 X s̸=sL X a∈A q P,πt (s, a) log (ι) mb i(t) (s, a) + L|S| X s̸=sL X a∈A q P,πt (s, a) C P + log (ι) mb i(t) (s, a) , where the first step follows from the fact that mb i(t) (s, a) ≥ C +log (ι) according to its definition; the third step follows from the facts that P x∈Sh P y∈A q P,πt (x, y|w) ≤ 1 and P (u,v,w)∈Wk q P,πt (u, v)P(w|u, v)q P,πt (x, y|w) = q P,πt (x, y). For the term in Eq. (5.67), we have k( Xs)−1 k=0 X (u,v,w)∈Wk q P,πt (u, v) C P + log (ι) mb i(t) (u, v) k( Xs)−1 h=k+1 X (x,y,z)∈Wk q P,πt (x, y|w) = k( Xs)−1 k=0 X u∈Sk X v∈A X w∈Sk+1 q P,πt (u, v) C P + log (ι) mb i(t) (u, v) k( Xs)−1 h=k+1 X z∈Sk+1 1 ≤ |S| 2 X u̸=sL X v∈A q P,πt (s, a) C P + log (ι) mb i(t) (u, v) , according to the fact that P x∈Sh P y∈A q P,πt (x, y|w) ≤ 1. Putting all the bounds for the terms in Eq. (5.65), Eq. (5.66), and Eq. (5.67) together yields the bound of Term (b) that k( Xs)−1 k=0 X (u,v,w)∈Wk q P,πt (u, v) P(w|u, v) − P ′ t (w|u, v) k( Xs)−1 h=k+1 X (x,y,z)∈Wh q P,πt (x, y|w) P(z|x, y) − P ′ t (z|x, y) = O |S| 2 X u̸=sL X v∈A q P,πt (u, v) C P + log (ι) mb i(t) (u, v) . Combining this bound with that of Term (a) finishes the proof. 254 5.8.4 Loss Shifting Technique with Optimistic Transition Lemma 5.8.4.1. Fix an optimistic transition function Pe (defined in Section 5.4). For any policy π and any loss function ℓ : S × A → R, we define function g : S × A → R similar to that of [47] as: g P,π e (s, a; ℓ) ≜ Q P,π e (s, a; ℓ) − V P,π e (s; ℓ) − ℓ(s, a) , where the state-action and state value function QP,π e and V P,π e are defined with respect to the optimistic transition Pe and π as following: Q P,π e (s, a; ℓ) = ℓ(s, a) + X s ′∈Sk(s)+1 Pe(s ′ |s, a)V P,π e (s ′ ; ℓ), V P,π e (s; ℓ) = 0, s = sL, P a∈A π(a|s)QP,π e (s, a; ℓ), s ̸= sL. Then, it holds for any policy π ′ that, D q P,π e ′ , gP,π e E = X s̸=sL X a∈A q P,π e ′ (s, a) · g P,π e (s, a; ℓ) = −V P,π e (s0; ℓ), where V P,π e (s0; ℓ) is only related to Pe, π and ℓ, and is independent with π ′ . Proof. By the extended performance difference lemma of the optimistic transition in Lemma 5.8.3.1, we have the following equality holds for any policy π ′ : V P,π e ′ (s0; ℓ) − V P,π e (s0; ℓ) = X s̸=sL X a∈A q P,π e ′ (s, a) Q P,π e (s, a; ℓ) − V P,π e (s; ℓ) . 255 On the other hand, we also have V P,π e ′ (s0; ℓ) = X s̸=sL X a∈A q P,π e ′ (s, a)ℓ(s, a). Therefore, subtracting V P,π e ′ (s0; ℓ) yields that −V P,π e (s0; ℓ) = X s̸=sL X a∈A q P,π e ′ (s, a) Q P,π e (s, a; ℓ) − V P,π e (s; ℓ) − ℓ(s, a) . The proof is finished after using the definition of g P,π e . Therefore, we have the following result for the FTRL framework which is similar to [47, Corollary A.1.2.]. Corollary 5.8.4.2. The FTRL update in Algorithm 11 can be equivalently written as: qbt = argmin q∈Ω(Pei) * q, X t−1 τ=ti (ℓbτ − bτ ) + + ϕt(q) = argmin x∈Ω(Pei) * q, X t−1 τ=ti (gτ − bτ ) + + ϕt(q), where gτ (s, a) = QPei,πτ (s, a; ℓbτ ) − V Pei,πτ (s; ℓbτ ) for any state-action pair (s, a). 5.8.5 Estimation Error Lemma 5.8.5.1. ([46, Lemma 10]) With probability at least 1−δ, we have for all k = 0, . . . , L−1, X T t=1 X (s,a)∈Sk×A q Pt,πt (s, a) max mi(t) (s, a), 1 = O |Sk||A| log (T) + log L δ , and X T t=1 X (s,a)∈Sk×A q Pt,πt (s, a) q max mi(t) (s, a), 1 = O p |Sk||A|T + |Sk||A| log (T) + log L δ . 256 Proof. Simply replacing the stationary transition P with the sequence of transitions {Pt} T t=1 in the proof of [46, Lemma 10] suffices. Proposition 5.8.5.2. Let Eest be the event such that we have for all k = 0, . . . , L−1 simultaneously X T t=1 X (s,a)∈Sk×A q Pt,πt (s, a) max mi(t) (s, a), 1 = O (|Sk||A| log (T) + log (ι)), and X T t=1 X (s,a)∈Sk×A q Pt,πt (s, a) q max mi(t) (s, a), 1 = O p |Sk||A|T + |Sk||A| log (T) + log (ι) . We have Pr[Eest] ≥ 1 − δ. Proof. The proof directly follows from the definition of ι, which ensures that ι ≥ L/δ. Based on the event Eest, we introduce the following lemma which is critical in analyzing the estimation error. Lemma 5.8.5.3. Suppose the event Eest defined in Proposition 5.8.5.2 holds. Then, we have for all k = 0, . . . , L − 1, X T t=1 X (s,a)∈Sk×A q P,πt (s, a) mb i(t) (s, a) = O (|Sk||A| log (T) + log (ι)), and, X T t=1 X (s,a)∈Sk×A q P,πt (s, a) C P + log (ι) mb i(t) (s, a) = O C P + log (ι) |Sk||A| log (ι) , 257 Proof. According to Corollary 5.8.3.6, we have X T t=1 X (s,a)∈Sk×A q P,πt (s, a) mb i(t) (s, a) ≤ X T t=1 X (s,a)∈Sk×A q Pt,πt (s, a) mb i(t) (s, a) + X T t=1 X (s,a)∈Sk×A q P,πt (s, a) − q Pt,πt (s, a) mb i(t) (s, a) ≤ X T t=1 X (s,a)∈Sk×A q Pt,πt (s, a) max 1, mi(t) (s, a) + X T t=1 X (s,a)∈Sk×A C P t CP + log(ι) ≤ O (|Sk||A| log (T) + log (ι)), where the second step follows from the definitions of C P and mb i(t) (s, a), and the last step applies Lemma 5.8.5.1. Similarly, we have X T t=1 X (s,a)∈Sk×A q P,πt (s, a) C P + log (ι) mb i(t) (s, a) ≤ X T t=1 X (s,a)∈Sk×A q Pt,πt (s, a) − q P,πt (s, a) + q Pt,πt (s, a) C P + log (ι) mb i(t) (s, a) ≤ X T t=1 X (s,a)∈Sk×A C P t + q Pt,πt (s, a) C P + log (ι) mb i(t) (s, a) ≤ |Sk| |A| X T t=1 C P t + C P + log (ι) X T t=1 X (s,a)∈Sk×A q Pt,πt (s, a) mb i(t) (s, a) = O |Sk| |A| C P + C P + log (ι) (|Sk||A| log (T) + log (ι)) = O C P + log (ι) |Sk||A| log (ι) , where the first step adds and subtracts q Pt,πt ; the second step follows from Corollary 5.8.3.6; the fourth step uses Proposition 5.8.5.2 due to the fact that mb i(s, a) ≥ max {mi(s, a), 1}. 25 Lemma 5.8.5.4. (Extension of [26, Lemma 16] for adversarial transition) Suppose the highprobability events Eest (defined in Proposition 5.8.5.2) and Econ (defined in Lemma 5.8.2.1) hold together. Let P s t be a transition function in Pi(t) which depends on s, and let gt(s) ∈ [0, G] for some G > 0. Then, X T t=1 X s̸=sL q P s t ,πt (s) − q P,πt (s) gt(s) = O vuutL|S| 2|A| log2 (ι) X T t=1 X s̸=sL q P,πt (s)gt(s) 2 + O G C P + log (ι) L 2 |S| 4 |A| log (ι) . Proof. By Lemma 5.8.3.8, under the event Econ, we have X T t=1 X s̸=sL q P s t ,πt (s) − q P,πt (s) gt(s) ≤ O X T t=1 X s̸=sL gt(s) k( Xs)−1 k=0 X (u,v,w)∈Wk q P,πt (u, v) s P(w|u, v) log (ι) mb i(t) (u, v) q P,πt (s|w) + O G|S| 3X T t=1 X s̸=sL X a∈A q P,πt (s, a) C P + log (ι) mb i(t) (s, a) . (5.68) By Lemma 5.8.5.3, the last two terms can be bounded under the event Eest as O G|S| 3X T t=1 X s̸=sL X a∈A q P,πt (s, a) C P + log (ι) mb i(t) (s, a) ≤ O G|S| 3 L X−1 k=0 C P + log (ι) |Sk||A| log (ι) ! = O G C P + log (ι) L|S| 4 |A| log (ι) . For the first term in Eq. (5.68), we first rewrite it as X T t=1 X s̸=sL k( Xs)−1 h=0 X (u,v,w)∈Wh q P,πt (u, v) s P(w|u, v) log (ι) mb i(t) (u, v) q P,πt (s|w)gt(s) 259 ≤ L X−1 h=0 X T t=1 X s̸=sL X (u,v,w)∈Wh q P,πt (u, v) s P(w|u, v) log (ι) mb i(t) (u, v) q P,πt (s|w)gt(s). For any θ > 0 and layer h, conditioning on the event Eest, we have X T t=1 X s̸=sL X (u,v,w)∈Wh q P,πt (u, v) s P(w|u, v) log (ι) mb i(t) (u, v) q P,πt (s|w)gt(s) = X T t=1 X s̸=sL X (u,v,w)∈Wh s q P,πt (u, v)q P,πt (s|w) log (ι) mb i(t) (u, v) · q P,πt (u, v)P(w|u, v)q P,πt (s|w)gt(s) 2 ≤ X T t=1 X s̸=sL X (u,v,w)∈Wh θ · q P,πt (u, v)q P,πt (s|w) log (ι) mb i(t) (u, v) + 1 θ · q P,πt (u, v)P(w|u, v)q P,πt (s|w)gt(s) 2 = θ · X T t=1 X u∈Sh X v∈A q P,πt (u, v) log (ι) mb i(t) (u, v) · X w∈Sh+1 X s̸=sL q P,πt (s|w) + 1 θ · X T t=1 X s̸=sL X (u,v,w)∈Wh q P,πt (u, v)P(w|u, v)q P,πt (s|w) gt(s) 2 ≤ θ · L|Sh+1| log (ι) X T t=1 X u∈Sh X v∈A q P,πt (u, v) mb i(t) (u, v) + 1 θ · X T t=1 X s̸=sL q P,πt (s)gt(s) 2 ≤ O θ · L|Sh+1| log (ι) (|Sh||A| log (T) + log (ι)) + 1 θ · X T t=1 X s̸=sL q P,πt (s)gt(s) 2 = O θL · |Sh||Sh+1||A| log2 (ι) + 1 θ · X T t=1 X s̸=sL q P,πt (s)gt(s) 2 , where the second step uses the fact that √xy ≤ x + y for all x, y ≥ 0, the fourth step follows from Lemma 5.8.5.3. Then, for any layer h = 0, . . . L − 1, picking the optimal θ gives X T t=1 X s̸=sL X (u,v,w)∈Wh q P,πt (u, v) s P(w|u, v) log (ι) mb i(t) (u, v) q P,πt (s|w)gt(s) = O vuutL|Sh||Sh+1||A| log2 (ι) X T t=1 X s̸=sL q P,πt (s)gt(s) 2 260 ≤ O (|Sh| + |Sh+1|) vuutL|A| log2 (ι) X T t=1 X s̸=sL q P,πt (s)gt(s) 2 . Finally, taking the summation over all the layers yields L X−1 h=0 X T t=1 X s̸=sL X (u,v,w)∈Wh q P,πt (u, v) s P(w|u, v) log (ι) mb i(t) (u, v) q P,πt (s|w)gt(s) ≤ O vuutL|S| 2|A| log2 (ι) X T t=1 X s̸=sL q P,πt (s)gt(s) 2 , which finishes the proof. 261 Chapter 6 Experiments We provide experiments illustrating our algorithms’ performance on challenging problems requiring both robustness and adaptivity. Specifically, we develop an efficient attacker which corrupts the loss and transition functions separately up to fixed budgets, and show that our algorithms UOB-REPS (Algorithm 2) and UOB-REPS-C (Algorithm 8) significantly outperform both vanilla UCBVI [15] as well as UCBVI-C [62], an adaptation of UCBVI which is designed for corrupted losses and transitions, against the attackers in various problems. We aim to demonstrate the following two properties of our designed algorithms: 1. Whether our UOB-REPS algorithm is competitive to UCBVI in uncorrupted environments. 2. Whether our UOB-REPS-C algorithm is indeed robust to corruption and how it is compared to UCBVI-C. We organize the content of this chapter as follows: We first introduce the attacker in Section 6.1; then, we discuss the algorithms (including UCBVI, UCBVI-C) and their implementations’ details in Section 6.2; in Section 6.3, we evaluate their performance on three tasks (environments) and present the empirical results. 262 Protocol 2 Attacker-Learner-Environment Interaction Parameters: state space S and action space A (known to the learner), transition function P, uncorrupted loss sequence {ℓt} T t=1, mean loss function µ, corruption budget BudgetP and BudgetL . compute optimal policy π ⋆ and optimal state-value function V ⋆ . for t = 1 to T do learner decides a policy πt and starts in state s0 attacker computes corrupted transition function and loss function: Pt , ℓt , BudgetP , BudgetL ← Attacker(πt , P, ℓt , π⋆ , V ⋆ , BudgetP , BudgetL ) for k = 0 to L − 1 do learner selects action ak ∼ πt(·|sk) learner observes loss ℓt(sk, ak) environment draws a new state sk+1 ∼ Pt(·|sk, ak) learner observes state sk+1 6.1 Attacker As an important step towards robust RL, it is essential to understand the effects of adversarial attacks. To achieve this, there have been many recent studies on adversarial attacks under various settings and approaches [16, 41, 64, 95, 85, 76, 77]. In the adversarial attack scenario, an adversary sits between the learner and environment, and may contaminate the losses and transition functions that the learner observes. The goal of adversary is to guide the learner into a sub-optimal policy, with fixed amount of manipulation. Inspired by the previous work of Jun et al. [51], we design an attacker which can look into the learner’s policy and corrupt the loss functions and transitions efficiently. The attacker is initialized with two parameters BudgetL , BudgetP (non-negative integers) which denote the corruption budgets on losses and transitions separately. More precisely, the attacker can only corrupt at most BudgetL state-action pairs’ losses and at most BudgetP state-action pairs’ transition functions. The interaction between the learner, the attacker, and the environment is presented in Protocol 2. Prior to the interaction, the environment generates the sequence of loss functions { ˜ℓt} T t=1 by i.i.d sampling from a fixed but unknown distribution; the attacker receives the mean loss function 263 µ(s, a) = E[ ˜ℓ(s, a)] and transition function P, in order to compute the optimal policy π ⋆ , the optimal state value function V ⋆ , and the set of good states Starget = {s : q P,π⋆ (s) > 0} which includes all the states that are reachable by the optimal policy. The pseudocode of the attacker is presented in Algorithm 12. In episode t, the attacker receives the uncorrupted loss function ˜ℓt and transition function P, observes the learner’s policy πt , and computes corrupted loss and transition ℓt and Pt according to the following rule: for each state s in Starget, if the learner takes the underlying optimal action, that is, π ⋆ (s) = arg maxa∈A πt(a|s) with the highest probability, the attacker will modify the loss and transition of this state-action pair (s, π⋆ (s)); if the budget on loss function BudgetL allows, the attacker will set ℓt(s, π⋆ (s)) to 1; if the budget on transition function BudgetP allows, the attacker will change the transition dynamics of (s, π⋆ (s)) to transit to the worst state in the next layer, that is, Pt(s ′ |s, π⋆ (s)) = 1 where s ′ = arg maxs∈Sk(s)+1 V ⋆ (s). Algorithm 12 Attacker Input: a policy πt , uncorrupted transition function P, uncorrupted loss function ˜ℓt , optimal policy π ⋆ , optimal state value function V ⋆ , remaining budget on transition BudgetP , and remaining budget on loss BudgetL . Initialize: for all state-action pair (s, a), set Pt(·|s, a) = P(·|s, a) and ℓt(s, a) = ˜ℓt(s, a). Compute occupancy measure q P,π⋆ . for k = 0 to L − 1 do for all s ∈ Sk do if π ⋆ (s) = argmaxa∈A πt(a|s) and q P,π⋆ (s) > 0 then if BudgetP > 0 then Decrease budget on transition: BudgetP ← BudgetP − 1. Corrupt transition function of (s, π⋆ (s)): for all s ′ ∈ Sk(s)+1, Pt(s ′ |s, π⋆ (s)) ← ( 0, s′ ̸= argmaxy∈Sk(s)+1 V ⋆ (y) 1, s′ = argmaxy∈Sk(s)+1 V ⋆ (y) if BudgetL > 0 then Decrease budget on transition: BudgetL ← BudgetL − 1. Corrupt loss of (s, π⋆ (s)): ℓt(s, a) ← 1. Return: transition function Pt , loss function ℓt , updated budget BudgetL and BudgetL . 264 Note that, for the multi-armed bandit problem (that is, the degraded case with L = 1), this attacker is proven to be efficient against any algorithms achieving an O(log T) regret bound [51]: by only corrupting the losses for at most O(K log T) times, the attacker is able to mislead the learner to take the sub-optimal arm for Ω(T) times. 6.2 Algorithms We focus on four algorithms: UCBVI, UCBVI-C, UOB-REPS, and UOB-REPS-C. 6.2.1 UCBVI The Upper Confidence Bound Value Iteration or UCBVI [15] algorithm is based on the optimism in the face of uncertainty (OFU) principle. It is a direct generalization of the famous UCB algorithm for MAB to the reinforcement learning setting. The UCBVI algorithm augmented with carefully designed confidence sets achieves the optimal regret bound Oe L p |S||A|T while dealing with stochastic losses and fixed transition function P [15], which matches the lower bound in [45]. For episode t, the UCBVI algorithm computes the optimistic state and state-action value functions Vt and Qt , which are bonus-enhanced lower bounds of the optimal state and state-action functions V ⋆ and Q⋆ respectively, based on the empirical loss ¯ℓt and transition P¯ t adjusted by bonuses bt as following: Qt(s, a) = min 0, ¯ℓt(s, a) + X s ′∈Sk(s)+1 P¯ t(s ′ |s, a) · Vt(s ′ ) − bt(s, a) , Vt(s) = min a∈A Qt(s, a), for all state-action pairs (s, a) ∈ (S − {sL}) × A, where the boundary condition is Vt(sL) = 0. 265 The bonus function bt , as the key component of UCBVI, is usually defined as bt(s, a) = O L · min ( 1, s ln L2|S||A|T δ nt(s, a) )! , where δ is the confidence parameter, and nt(s, a) is the number of visits to (s, a) prior to episode t. To keep consistency between the confidence sets used by UOB-REPS, we adjust the the bonus function as bt(s, a) = X s ′∈Sk(s)+1 Bt(s ′ |s, a) · Vt(s ′ ) + min ( 1, s ln L2|S||A|T δ nt(s, a) ) , (6.1) where Bt(s ′ |s, a) is the confidence width defined in Eq. (2.5). The first term in Eq. (6.1) helps to construct an upper bound of the estimation error term P s ′∈Sk(s)+1 P¯ t(s ′ |s, a) − P(s ′ |s, a) Vt(s ′ ), and the second term is an upper bound of the estimation error of the empirical loss ¯ℓt(s, a) according to Chernoff’s concentration inequalities. 6.2.2 UCBVI-C To deal with corrupted losses and transitions, the previous work of Lykouris et al. [62] considers an adaptation of UCBVI [15] which applies enlarged confidence sets with prior knowledge of the corruption level C P and C L beforehand. Unlike [62], we distinguish the corruption measures on loss and transition functions separately. For the confidence sets of transition functions, we simply use the enlarged confidence sets defined 266 in Section 2.3. For loss functions, the UCBVI-C algorithm requires extra knowledge of the corresponding corruption level C L to modify the second term in Eq. (6.1) as: min 1, vuut ln L2|S||A|T δ nt(s, a) + C L ln L2 |S||A|T δ nt(s, a) , where the maliciousness measures on the loss functions C L is defined as: C L = X T t=1 L X−1 k=0 X s∈Sk X a∈A ℓt(s, a) − ˜ℓt(s, a) . (6.2) The attacker in Section 6.1 guarantees that BudgetP ≥ C P and BudgetL ≥ C L . To this end, we initialize the UCBVI-C algorithm with BudgetP and BudgetL in experiments on corrupted MDPs. 6.2.3 UOB-REPS We implement our UOB-REPS algorithm which is designed for adversarial losses and a fixed transition. The pseudo code is presented in Algorithm 2. In addition, we made one modification in order to achieve better empirical results. Specifically, instead of using a fixed learning rate η, we adopt a simple learning rate schedule that the learning rate ηt of episode t is set to √ η0 t , where η0 is the initial learning rate. This learning schedule is the same as in the one we used in Chapter 4, and it produces better results compared to fixed learning rate schedule. 6.2.4 UOB-REPS-C We implement the UOB-REPS-C algorithm which only requires the knowledge of C P to deal with the corrupted transitions. The pseudo code is presented in Algorithm 8. Similar to Section 6.2.3, 267 Figure 6.1: MDP Structure of Random MDPs we adopt the adaptive learning rate schedule. To be consistent with UCBVI-C, we initialize UOBREPS-C with C P = BudgetP as well. 6.3 Environments In this section, we will evaluate and report the performance of aforementioned algorithms on the following environments: Random MDP (Section 6.3.1), Diabolical Combination Lock (Section 6.3.2), and Inventory Control Problem (Section 6.3.3). 6.3.1 Random MDP The first environment we considered is a randomly generated environment. We construct its layer structure as shown in Figure 6.1, with L = 5, A = {0, 1}, and S = {0, . . . , 9} where each intermediate layer has exactly 2 states (that is, |Sk| = 2 for k = 1, . . . , 4). Then, we sample the loss distribution and transition function uniformly as follows: for each state-action pair (s, a), the uncorrupted losses ˜ℓt(s, a) are i.i.d sampled from a fixed Bernoulli distribution parameterized by µ(s, a), where µ(s, a) is initially sampled from a uniform distribution of [0, 1]; the transition distribution P(·|s, a) ∈ R |Sk(s)+1| of (s, a) is uniformly sampled over all valid distributions. 268 Figure 6.2: Cumulative Regret on Random Environment We first evaluate the performance of the UCBVI and UOB-REPS algorithm (green and red dashed lines) on the uncorrupted losses and transitions for T = 160000 episodes. Then, we evaluate that of the UCBVI, UCBVI-C and UOB-REPS algorithm (solid lines) against the attacker with BudgetP = 0 and BudgetL = 120000 for the same number of episodes. Note that, due to the fact that BudgetP = 0, we are only using UOB-REPS instead of UOB-REPS-C for this experiment. The experiment result is presented in Figure 6.2. Each curve stands for the average cumulative regret over three random seeds, and the shaded region represents the standard deviation. Specifically, the cumulative regret with respect to the policy ˚π is defined as: Regt (˚π) = X t τ=1 ℓτ (πτ ) − ℓτ (˚π), where ˚π denotes the optimal policy with underlying transition P and average corrupted loss 1 T PT t=1 ℓt . From the experiment results, we have following findings: (1) Without corruption, the performance of UOB-REPS (green dashed line) is close to that of UCBVI (red dashed line), and the 269 slopes of corresponding curves decrease to nearly-zero, which indicates that these two algorithms achieve sub-linear regret. (2) With corruption on loss functions, UOB-REPS (green solid line) greatly outperforms UCBVI (red solid line) and UCBVI-C (blue solid line), especially when the curves of UCBVI and UCBVI-C are nearly linear before 120000-th episode. This finding shows that our designed algorithm UOB-REPS is robust to the corrupted losses, and enjoys certain adaptivity with respect to C L . (3) With corruption, UCBVI-C performs slightly better than UCBVI, which indicates that the enlarged confidence interval provides robustness against the corrupted losses in this case. 6.3.2 Diabolical Combination Lock Problem We provide experiments on an RL problem designed to be particularly hard for exploration: the Diabolical Combination Lock problem. This problem was first formally proposed in the work of Misra et al. [66], and later enhanced in [4]. The problem was originally designed for deep reinforcement learning tasks, which encodes the high-dimensional observations into the state. Here, we drop those high-dimensional observations and use a fixed and finite set of states. The MDP structure of combination lock is shown in Figure 6.3, which consists of L layers and each layer has exactly three states. The intermediate states are divided into dead states (in black color) and good states (in gray color). For each layer, one of the states is a dead state from which the learner cannot recover and lead to 1 loss. The two good states at the last layer will incur 0 loss. Finally, the following two designs make the diabolical combination lock problem challenging for exploration. First, for the good states in other layers, |A| − 1 out of |A| actions lead the learner to the dead state of the next layer (black arrows from the gray states), and the other action will lead to the good states in the next layer with 0.5 probability separately (gray arrows). By this design, uniform exploration has merely |A| −L probability of reaching one of the two good states in 270 Figure 6.3: MDP Structure of a Diabolical Combination Lock the last layer. Second, following the previous work of Agarwal et al. [4], we add a loss of 1/6L for transitioning to a good state and 0 for a dead state, in order to misleads a locally optimal policy to transit to a dead state quickly. For this environment, we first evaluate the performance of the UCBVI and UOB-REPS algorithm (green and red dashed lines) on the uncorrupted losses and transitions for T = 160000 episodes. Then, we evaluate that of the UCBVI, UCBVI-C and UOB-REPS-C algorithm (solid lines) against the attacker with BudgetP = 1000 and BudgetL = 80000 for 160000 episodes as well. The performance is reported in Figure 6.4. For each algorithm, the corresponding line shows the average cumulative regret across three random seeds, with the shaded region representing the standard deviation. From the experiment results, we have several findings: (1) With corrupted MDPs, UOB-REPSC (green solid line) achieves better performance compared to UCBVI (red solid line) and UCBVI-C (blue solid line). Importantly, the slope of the corresponding curve of UOB-REPS-C is clearly gradually decreasing, while the curves of UCBVI and UCBVI-C are nearly linear. This finding is expected and shows the robustness of UOB-REPS-C against corruption. (2) Compared to performance of UCBVI with corrupted MDPs (red solid line), UCBVI-C performs slightly better (blue solid line), which shows robustness brought by the enlarged confidence interval. (3) The huge performance decrease of UCBVI after MDPs are corrupted (red solid line), shows the vulnerability of UCBVI 271 against corruption. (4) Without corruption, UOB-REPS (green dashed line) achieves competitive performance with respect to UCBVI (red dashed line). Figure 6.4: Cumulative Regret on Diabolical Combination Lock 6.3.3 Inventory Control Problem Inventory control is one of the major problems in operations research and operations management which plays a critical role in supply chain and logistics. It depicts the realistic problem faced by a manager that has to decide how much to order at each time period to meet demand/request. The stochastic inventory control problem is one of the most studied problems in inventory theory, which was initiated by the seminal works of [36]. In recent years, tremendous progresses have been made towards this problem, where many of them are related to the integration of reinforcement learning theory. Here, we refer the readers to the recent survey [19] for more details. In this section, we consider a single product stochastic inventory control problem across L days, whose dynamics can be defined by ik+1 = min {max{ik + ok − dk, 0}, D} , k = 0, . . . L − 1, 272 where ik ∈ 0 . . . , D is the amount of inventory available at the end of day k with D being the maximum units one could hold, ok ∈ 0, . . . , M is the amount of ordered inventory at day k with M being the maximum units that one can order per day, and dk is the amount occurs during the daytime of the (k+ 1)-th day which is assumed to follow a fixed but unknown distribution. Initially, i0 = 0 indicates that there is no inventory at the beginning. The cost of day k can be defined as: c(ik, ok, dk) = I{ok > 0} · f + ok · v + ik · h − min {dk, ik} · p, where I{·} is the indicator function whose value is 1 if the input holds true and 0 otherwise, f is the fixed ordering cost incurred per order, v is the variable ordering cost per unit, h is the holding cost per unit, and p is the unit selling price. The objective of this problem is to minimize the expected total cost over k days, that is, E hPL−1 k=0 c(ik, ok, dk) i . The inventory control problem involves making decisions sequentially, which can be directly modeled by the framework of MDP. Clearly, the horizon L denotes the number of days, and T denotes the number of episodes. Here, the episode represents a specific regular time interval (such as week, month, season, etc.) where the retailer needs to clear out the inventory and start a new round of selling (perhaps with new products). For example, the clothing stores usually have seasonal clearance sales to make room for new items. The state space is constructed as follows: the initial state s0 in layer 0 represents the initial amount of inventory with i0 = 0; for layer k = 1, . . . , L, we use D + 1 states to represent ik from 0 to D, that is, i-th state in layer k indicates that the learner holds i unit at the end of k-th day. For notational convenience, we denote the units reflect by state s by i(s). Then, we use M + 1 actions to denote the M + 1 options of ordering 0 to M units, where o(a) denotes the 273 Probability d = 0 d = 1 d = 2 d = 3 Small Inventory Control 0.3 0.6 0.1 0 Large Inventory Control 0.3 0.4 0.2 0.1 Table 6.1: Distribution of Demand d ordered units corresponding to action a. Finally, we construct the loss function of episode t by ℓt(s, a) = c(i(s), o(a), dk(s) ) which is generated by i.i.d sampling dk(s) from a fixed but unknown distribution, and compute the transition function P(s ′ |s, a) accordingly with respect to the given distribution of demands. Figure 6.5: Cumulative Regret on Small Inventory Control We consider two instances with L = 6, M = 2: (1) small inventory control problem with D = 2, p = 0.2, f = 0.1, v = 0.05, h = 0.005; (2) large inventory control problem with D = 4, p = 0.1, f = 0.05, v = 0.025, h = 0.0075. The distributions of demand d are shown in Table 6.1 while the other parameters are the same. We first evaluate the performance of the UCBVI and UOB-REPS algorithm (green and red dashed lines) on the uncorrupted losses and transitions for T = 320000 episodes. Then, we evaluate that of the UCBVI, UCBVI-C and UOB-REPS-C algorithm (solid lines) against the attacker with 274 BudgetP = 1000 and BudgetL = 160000. The experiment results of these two instances are reported in Figure 6.5 and Figure 6.6, where the curve represents the average cumulative regret and the shaded region represents the standard deviation. Figure 6.6: Cumulative Regret on Large Inventory Control From the experiment results, we have several findings: (1) With corrupted MDPs, UOB-REPSC (green solid lines) achieve better performance compared to UCBVI (red solid lines) and UCBVIC (blue solid lines) in both instances. While the curves of UCBVI and UCBVI-C are nearly linear, the slopes of the corresponding curves of UOB-REPS-C are gradually decreasing. Moreover, these two curves (green solid lines) are gradually approaching to those of UOB-REPS without corruption (green dashed line). This finding clearly shows the robustness of UOB-REPS-C against corruption. (2) Surprisingly, UCBVI performs slightly better (blue solid line) compared UCBVI-C with corrupted MDPs in both instances. One possible explanation of this phenomenon is that the attacker works more effectively against UCBVI-C. (4) Without corruption, UOB-REPS (green dashed line) achieves competitive performance with respect to UCBVI (red dashed line), which indicates that UOB-REPS has certain adaptivity in this setup. 275 Bibliography [1] Yasin Abbasi Yadkori, Peter L Bartlett, Varun Kanade, Yevgeny Seldin, and Csaba Szepesvari. “Online Learning in Markov Decision Processes with Adversarially Chosen Transition Probability Distributions”. In: Advances in Neural Information Processing Systems (NeurIPS). 2013. [2] Yasin Abbasi-Yadkori, Dávid Pál, and Csaba Szepesvári. “Improved algorithms for linear stochastic bandits”. In: Proceedings of the International Conference on Neural Information Processing Systems. 2011. [3] Jacob D Abernethy, Elad Hazan, and Alexander Rakhlin. “Competing in the Dark: An Efficient Algorithm for Bandit Linear Optimization”. In: Proceedings of the Annual Conference on Learning Theory. 2008. [4] Alekh Agarwal, Mikael Henaff, Sham Kakade, and Wen Sun. “Pc-pg: Policy cover directed exploration for provable policy gradient learning”. In: Advances in Neural Information Processing Systems (NeurIPS) 33 (2020), pp. 13399–13412. [5] Alekh Agarwal, Haipeng Luo, Behnam Neyshabur, and Robert E Schapire. “Corralling a band of bandit algorithms”. In: Proceedings of the International Conference on Computational Learning Theory (COLT). 2017. [6] Naveed Akhtar, Ajmal Mian, Navid Kardan, and Mubarak Shah. “Advances in adversarial attacks and defenses in computer vision: A survey”. In: IEEE Access 9 (2021), pp. 155161–155196. [7] Chamy Allenberg, Peter Auer, László Györfi, and György Ottucsák. “Hannan consistency in on-line learning in case of unbounded losses under partial monitoring”. In: Proceedings of the international conference on Algorithmic Learning Theory. 2006, pp. 229–243. [8] Eitan Altman. Constrained Markov decision processes. Vol. 7. CRC Press, 1999. [9] Idan Amir, Idan Attias, Tomer Koren, Roi Livni, and Yishay Mansour. “Prediction with corrupted expert advice”. In: Advances in Neural Information Processing Systems (2020). 276 [10] Raman Arora, Ofer Dekel, and Ambuj Tewari. “Deterministic MDPs with Adversarial Rewards and Bandit Feedback”. In: Proceedings of the Conference on Uncertainty in Artificial Intelligence. 2012, pp. 93–101. [11] Peter Auer, Nicolo Cesa-Bianchi, and Paul Fischer. “Finite-time analysis of the multiarmed bandit problem”. In: Machine learning (2002). [12] Peter Auer, Nicolo Cesa-Bianchi, Yoav Freund, and Robert E Schapire. “The nonstochastic multiarmed bandit problem”. In: SIAM journal on computing 32.1 (2002), pp. 48–77. [13] Peter Auer and Chao-Kai Chiang. “An algorithm with nearly optimal pseudo-regret for both stochastic and adversarial bandits”. In: Proceedings of the Annual Conference on Learning Theory. 2016. [14] Peter Auer, Thomas Jaksch, and Ronald Ortner. “Near-optimal regret bounds for reinforcement learning”. In: Advances in Neural Information Processing Systems (NeurIPS) (2008). [15] Mohammad Gheshlaghi Azar, Ian Osband, and Rémi Munos. “Minimax regret bounds for reinforcement learning”. In: Proceedings of the 34th International Conference on Machine Learning-Volume 70. JMLR. org. 2017, pp. 263–272. [16] Vahid Behzadan and Arslan Munir. “Vulnerability of deep reinforcement learning to policy induction attacks”. In: International Conference on Machine Learning and Data Mining in Pattern Recognition. Springer. 2017, pp. 262–275. [17] Alina Beygelzimer, John Langford, Lihong Li, Lev Reyzin, and Robert Schapire. “Contextual bandit algorithms with supervised learning guarantees”. In: Proceedings of the International Conference on Artificial Intelligence and Statistics. 2011. [18] Sébastien Bubeck and Aleksandrs Slivkins. “The best of both worlds: Stochastic and adversarial bandits”. In: Proceedings of the 23rd Annual Conference on Learning Theory. 2012. [19] Vaibhav Chaudhary, Rakhee Kulshrestha, and Srikanta Routroy. “State-of-the-art literature review on inventory models for perishable products”. In: Journal of Advances in Management Research 15.3 (2018), pp. 306–346. [20] Yifang Chen, Simon Du, and Kevin Jamieson. “Improved corruption robust algorithms for episodic reinforcement learning”. In: Proceedings of the International Conference on Machine Learning (ICML). 2021. [21] Wang Chi Cheung, David Simchi-Levi, and Ruihao Zhu. “Nonstationary reinforcement learning: The blessing of (more) optimism”. In: Management Science (2023). [22] Paul F Christiano, Jan Leike, Tom Brown, Miljan Martic, Shane Legg, and Dario Amodei. “Deep reinforcement learning from human preferences”. In: Advances in Neural Information Processing Systems (NeurIPS) 30 (2017). 277 [23] Wei Chu, Lihong Li, Lev Reyzin, and Robert Schapire. “Contextual bandits with linear payoff functions”. In: Proceedings of the International Conference on Artificial Intelligence and Statistics. 2011. [24] Jack Collins, David Howard, and Jurgen Leitner. “Quantifying the reality gap in robotic manipulation tasks”. In: 2019 International Conference on Robotics and Automation (ICRA). IEEE. 2019, pp. 6706–6712. [25] Christoph Dann, Teodor V Marinov, Mehryar Mohri, and Julian Zimmert. “Beyond Value-Function Gaps: Improved Instance-Dependent Regret Bounds for Episodic Reinforcement Learning”. In: arXiv preprint arXiv:2107.01264 (2021). [26] Christoph Dann, Chen-Yu Wei, and Julian Zimmert. “Best of Both Worlds Policy Optimization”. In: arXiv preprint arXiv:2302.09408 (2023). [27] Ofer Dekel, Jian Ding, Tomer Koren, and Yuval Peres. “Bandits with switching costs: T 2/3 regret”. In: Proceedings of the annual ACM symposium on Theory of computing. 2014. [28] Ofer Dekel and Elad Hazan. “Better Rates for Any Adversarial Deterministic MDP”. In: Proceedings of the International Conference on Machine Learning. 2013. [29] Kefan Dong, Yuanhao Wang, Xiaoyu Chen, and Liwei Wang. “Q-learning with UCB Exploration is Sample Efficient for Infinite-Horizon MDP”. In: arXiv preprint arXiv:1901.09311 (2019). [30] Eyal Even-Dar, Sham M Kakade, and Yishay Mansour. “Online Markov decision processes”. In: Mathematics of Operations Research (2009). [31] Eyal Even-Dar, Sham M Kakade, and Yishay Mansour. “Online Markov decision processes”. In: Mathematics of Operations Research (2009). [32] Dylan J Foster, Claudio Gentile, Mehryar Mohri, and Julian Zimmert. “Adapting to misspecification in contextual bandits”. In: Advances in Neural Information Processing Systems (NeurIPS) (2020). [33] Dylan J Foster, Zhiyuan Li, Thodoris Lykouris, Karthik Sridharan, and Eva Tardos. “Learning in games: Robustness of fast convergence”. In: Advances in Neural Information Processing Systems (NeurIPS) (2016). [34] Yoav Freund and Robert E Schapire. “A decision-theoretic generalization of on-line learning and an application to boosting”. In: Journal of computer and system sciences 55.1 (1997), pp. 119–139. [35] Ronan Fruit, Matteo Pirotta, Alessandro Lazaric, and Ronald Ortner. “Efficient Bias-Span-Constrained Exploration-Exploitation in Reinforcement Learning”. In: ICML 2018-The 35th International Conference on Machine Learning. Vol. 80. 2018, pp. 1578–1586. 278 [36] Yoichiro Fukuda. “Optimal policies for the inventory problem with negotiable leadtime”. In: Management Science 10.4 (1964), pp. 690–708. [37] Pratik Gajane, Ronald Ortner, and Peter Auer. “A sliding-window algorithm for markov decision processes with arbitrarily changing rewards and transitions”. In: arXiv preprint arXiv:1805.10066 (2018). [38] Anupam Gupta, Tomer Koren, and Kunal Talwar. “Better algorithms for stochastic bandits with adversarial corruptions”. In: Proceedings of the Annual Conference on Learning Theory. 2019. [39] Elad Hazan et al. “Introduction to online convex optimization”. In: Foundations and Trendső in Optimization (). [40] Sebastian Höfer, Kostas Bekris, Ankur Handa, Juan Camilo Gamboa, Florian Golemo, Melissa Mozifian, Chris Atkeson, Dieter Fox, Ken Goldberg, John Leonard, et al. “Perspectives on sim2real transfer for robotics: A summary of the r: Ss 2020 workshop”. In: arXiv preprint arXiv:2012.03806 (2020). [41] Yunhan Huang and Quanyan Zhu. “Deceptive reinforcement learning under adversarial manipulations on cost signals”. In: International Conference on Decision and Game Theory for Security. Springer. 2019, pp. 217–237. [42] Shinji Ito. “Parameter-Free Multi-Armed Bandit Algorithms with Hybrid Data-Dependent Regret Bounds”. In: Proceedings of the International Conference on Computational Learning Theory (COLT). 2021. [43] Shinji Ito. “Parameter-free multi-armed bandit algorithms with hybrid data-dependent regret bounds”. In: Conference on Learning Theory. PMLR. 2021, pp. 2552–2583. [44] Thomas Jaksch, Ronald Ortner, and Peter Auer. “Near-optimal Regret Bounds for Reinforcement Learning.” In: Journal of Machine Learning Research 11.4 (2010). [45] Chi Jin, Zeyuan Allen-Zhu, Sebastien Bubeck, and Michael I Jordan. “Is q-learning provably efficient?” In: Advances in Neural Information Processing Systems. 2018, pp. 4863–4873. [46] Chi Jin, Tiancheng Jin, Haipeng Luo, Suvrit Sra, and Tiancheng Yu. “Learning adversarial markov decision processes with bandit feedback and unknown transition”. In: Proceedings of the International Conference on Machine Learning (ICML). 2020. [47] Tiancheng Jin, Longbo Huang, and Haipeng Luo. “The best of both worlds: stochastic and adversarial episodic MDPs with unknown transition”. In: Advances in Neural Information Processing Systems (NeurIPS) (2021). [48] Tiancheng Jin, Tal Lancewicki, Haipeng Luo, Yishay Mansour, and Aviv Rosenberg. “Near-optimal regret for adversarial mdp with delayed bandit feedback”. In: Advances in Neural Information Processing Systems (NeurIPS) 35 (2022), pp. 33469–33481. 279 [49] Tiancheng Jin, Junyan Liu, Chloé Rouyer, William Chang, Chen-Yu Wei, and Haipeng Luo. “No-Regret Online Reinforcement Learning with Adversarial Losses and Transitions”. In: Advances in Neural Information Processing Systems (NeurIPS) 36 (2024). [50] Tiancheng Jin and Haipeng Luo. “Simultaneously Learning Stochastic and Adversarial Episodic MDPs with Known Transition.” In: Advances in Neural Information Processing Systems (NeurIPS) (2020). [51] Kwang-Sung Jun, Lihong Li, Yuzhe Ma, and Jerry Zhu. “Adversarial attacks on stochastic bandits”. In: Advances in Neural Information Processing Systems (NeurIPS) 31 (2018). [52] Sham Machandranath Kakade. “On the sample complexity of reinforcement learning”. PhD thesis. University College London, 2003. [53] B Ravi Kiran, Ibrahim Sobh, Victor Talpaert, Patrick Mannion, Ahmad A Al Sallab, Senthil Yogamani, and Patrick Pérez. “Deep reinforcement learning for autonomous driving: A survey”. In: IEEE Transactions on Intelligent Transportation Systems 23.6 (2021), pp. 4909–4926. [54] Chung-Wei Lee, Haipeng Luo, Chen-Yu Wei, and Mengxiao Zhang. “Bias no more: high-probability data-dependent regret bounds for adversarial bandits and MDPs”. In: Advances in Neural Information Processing Systems (2020). [55] Chung-Wei Lee, Haipeng Luo, Chen-Yu Wei, and Mengxiao Zhang. “Bias no more: high-probability data-dependent regret bounds for adversarial bandits and mdps”. In: Advances in Neural Information Processing Systems (NeurIPS) (2020). [56] Chung-Wei Lee, Haipeng Luo, Chen-Yu Wei, Mengxiao Zhang, and Xiaojin Zhang. “Achieving Near Instance-Optimality and Minimax-Optimality in Stochastic and Adversarial Linear Bandits Simultaneously”. In: Proceedings of the International Conference on Machine Learning (2021). [57] Chung-Wei Lee, Haipeng Luo, and Mengxiao Zhang. “A Closer Look at Small-loss Bounds for Bandits with Graph Feedback”. In: Conference on Learning Theory. 2020. [58] Sergey Levine, Chelsea Finn, Trevor Darrell, and Pieter Abbeel. “End-to-end training of deep visuomotor policies”. In: Journal of Machine Learning Research 17.39 (2016), pp. 1–40. [59] Haipeng Luo, Chen-Yu Wei, and Chung-Wei Lee. “Policy optimization in adversarial mdps: Improved exploration via dilated bonuses”. In: Advances in Neural Information Processing Systems (NeurIPS) (2021). [60] Haipeng Luo, Mengxiao Zhang, Peng Zhao, and Zhi-Hua Zhou. “Corralling a larger band of bandits: A case study on switching regret for linear bandits”. In: Proceedings of the International Conference on Computational Learning Theory (COLT). 2022. 280 [61] Thodoris Lykouris, Vahab Mirrokni, and Renato Paes Leme. “Stochastic bandits robust to adversarial corruptions”. In: Proceedings of the Annual ACM SIGACT Symposium on Theory of Computing. 2018. [62] Thodoris Lykouris, Max Simchowitz, Aleksandrs Slivkins, and Wen Sun. “Corruption robust exploration in episodic reinforcement learning”. In: arXiv preprint arXiv:1911.08689 (2019). [63] Thodoris Lykouris, Max Simchowitz, Alex Slivkins, and Wen Sun. “Corruption-robust exploration in episodic reinforcement learning”. In: Conference on Learning Theory. 2021. [64] Yuzhe Ma, Xuezhou Zhang, Wen Sun, and Jerry Zhu. “Policy poisoning in batch reinforcement learning and control”. In: Advances in Neural Information Processing Systems (NeurIPS) 32 (2019). [65] Andreas Maurer and Massimiliano Pontil. “Empirical Bernstein Bounds and Sample Variance Penalization”. In: stat 1050 (2009), p. 21. [66] Dipendra Misra, Mikael Henaff, Akshay Krishnamurthy, and John Langford. “Kinematic state abstraction and provably efficient rich-observation reinforcement learning”. In: Proceedings of the International Conference on Machine Learning (ICML). PMLR. 2020, pp. 6961–6971. [67] Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Andrei A Rusu, Joel Veness, Marc G Bellemare, Alex Graves, Martin Riedmiller, Andreas K Fidjeland, Georg Ostrovski, et al. “Human-level control through deep reinforcement learning”. In: Nature 518.7540 (2015), pp. 529–533. [68] Jaouad Mourtada and Stéphane Gaffas. “On the optimality of the Hedge algorithm in the stochastic regime”. In: Journal of Machine Learning Research 20 (2019), pp. 1–28. [69] Gergely Neu. “Explore no more: Improved high-probability regret bounds for non-stochastic bandits”. In: Advances in Neural Information Processing Systems 28 (2015), pp. 3168–3176. [70] Gergely Neu, Andras Antos, András György, and Csaba Szepesvári. “Online Markov Decision Processes under Bandit Feedback”. In: Advances in Neural Information Processing Systems (NeurIPS). 2010. [71] Gergely Neu, Andras Antos, András György, and Csaba Szepesvári. “Online Markov decision processes under bandit feedback”. In: IEEE Transactions on Automatic Control (2014). [72] Gergely Neu, András György, and Csaba Szepesvári. “The adversarial stochastic shortest path problem with unknown transition probabilities”. In: Proceedings of the Fifteenth International Conference on Artificial Intelligence and Statistics, (AISTATS). 2012, pp. 805–813. 281 [73] Gergely Neu, András György, and Csaba Szepesvári. “The Online Loop-free Stochastic Shortest-Path Problem”. In: Proceedings of the International Conference on Computational Learning Theory (COLT). 2010. [74] Yi Ouyang, Mukul Gagrani, Ashutosh Nayyar, and Rahul Jain. “Learning unknown Markov Decision Processes: a thompson sampling approach”. In: Proceedings of the International Conference on Neural Information Processing Systems. 2017. [75] Aldo Pacchiano, Christoph Dann, and Claudio Gentile. “Best of Both Worlds Model Selection”. In: Advances in Neural Information Processing Systems (NeurIPS). 2022. [76] Amin Rakhsha, Goran Radanovic, Rati Devidze, Xiaojin Zhu, and Adish Singla. “Policy teaching via environment poisoning: Training-time adversarial attacks against reinforcement learning”. In: Proceedings of the International Conference on Machine Learning (ICML). PMLR. 2020, pp. 7974–7984. [77] Amin Rakhsha, Xuezhou Zhang, Xiaojin Zhu, and Adish Singla. “Reward poisoning in reinforcement learning: Attacks against unknown learners in unknown environments”. In: arXiv preprint arXiv:2102.08492 (2021). [78] Aviv Rosenberg and Yishay Mansour. “Online Convex Optimization in Adversarial Markov Decision Processes”. In: Proceedings of the 36th International Conference on Machine Learning. 2019, pp. 5478–5486. [79] Aviv Rosenberg and Yishay Mansour. “Online convex optimization in adversarial markov decision processes”. In: Proceedings of the International Conference on Machine Learning (ICML). 2019. [80] Aviv Rosenberg and Yishay Mansour. “Online Stochastic Shortest Path with Bandit Feedback and Unknown Transition Function”. In: Advances in Neural Information Processing Systems (NeurIPS) (2019). [81] Yevgeny Seldin and Gábor Lugosi. “An Improved Parametrization and Analysis of the EXP3++ Algorithm for Stochastic and Adversarial Bandits”. In: Proceedings of the Annual Conference on Learning Theory. 2017. [82] Yevgeny Seldin and Aleksandrs Slivkins. “One practical algorithm for both stochastic and adversarial bandits”. In: Proceedings of the International Conference on Machine Learning. 2014. [83] David Silver, Aja Huang, Chris J Maddison, Arthur Guez, Laurent Sifre, George Van Den Driessche, Julian Schrittwieser, Ioannis Antonoglou, Veda Panneershelvam, Marc Lanctot, et al. “Mastering the game of Go with deep neural networks and tree search”. In: Nature 529.7587 (2016), pp. 484–489. [84] Max Simchowitz and Kevin G Jamieson. “Non-asymptotic gap-dependent regret bounds for tabular MDPs”. In: Advances in Neural Information Processing Systems. 2019, pp. 1151–1160. 282 [85] Yanchao Sun, Da Huo, and Furong Huang. “Vulnerability-aware poisoning mechanism for online rl with unknown dynamics”. In: ICLR. 2021. [86] Yi Tian, Yuanhao Wang, Tiancheng Yu, and Suvrit Sra. “Online learning in unknown markov games”. In: Proceedings of the International Conference on Machine Learning (ICML). 2021. [87] Oriol Vinyals, Igor Babuschkin, Wojciech M Czarnecki, Michaël Mathieu, Andrew Dudzik, Junyoung Chung, David H Choi, Richard Powell, Timo Ewalds, Petko Georgiev, et al. “Grandmaster level in StarCraft II using multi-agent reinforcement learning”. In: Nature 575.7782 (2019), pp. 350–354. [88] Chen-Yu Wei, Christoph Dann, and Julian Zimmert. “A model selection approach for corruption robust reinforcement learning”. In: Proceedings of the International Conference on Machine Learning (ICML). 2022. [89] Chen-Yu Wei, Mehdi Jafarnia-Jahromi, Haipeng Luo, Hiteshi Sharma, and Rahul Jain. “Model-free Reinforcement Learning in Infinite-horizon Average-reward Markov Decision Processes”. In: arXiv preprint arXiv:1910.07072 (2019). [90] Chen-Yu Wei and Haipeng Luo. “More Adaptive Algorithms for Adversarial Bandits”. In: Proceedings of the International Conference on Computational Learning Theory (COLT). 2018. [91] Chen-Yu Wei and Haipeng Luo. “Non-stationary reinforcement learning without prior knowledge: An optimal black-box approach”. In: Proceedings of the International Conference on Computational Learning Theory (COLT). 2021. [92] Kunhe Yang, Lin Yang, and Simon Du. “Q-learning with Logarithmic Regret”. In: International Conference on Artificial Intelligence and Statistics. PMLR. 2021, pp. 1576–1584. [93] Jia Yuan Yu, Shie Mannor, and Nahum Shimkin. “Markov decision processes with arbitrary reward processes”. In: Mathematics of Operations Research 34.3 (2009), pp. 737–757. [94] Xuezhou Zhang, Yiding Chen, Xiaojin Zhu, and Wen Sun. “Robust policy gradient against strong data corruption”. In: International Conference on Machine Learning. PMLR. 2021, pp. 12391–12401. [95] Xuezhou Zhang, Yuzhe Ma, Adish Singla, and Xiaojin Zhu. “Adaptive reward-poisoning attacks against reinforcement learning”. In: Proceedings of the International Conference on Machine Learning (ICML). PMLR. 2020, pp. 11225–11234. [96] Zihan Zhang and Xiangyang Ji. “Regret Minimization for Reinforcement Learning by Evaluating the Optimal Bias Function”. In: Advances in Neural Information Processing Systems. 2019. 283 [97] Alexander Zimin and Gergely Neu. “Online Learning in Episodic Markovian Decision Processes by Relative Entropy Policy Search”. In: Proceedings of the International Conference on Neural Information Processing Systems. 2013. [98] Alexander Zimin and Gergely Neu. “Online learning in episodic Markovian decision processes by relative entropy policy search”. In: Advances in Neural Information Processing Systems (NeurIPS). 2013. [99] Julian Zimmert, Haipeng Luo, and Chen-Yu Wei. “Beating Stochastic and Adversarial Semi-bandits Optimally and Simultaneously”. In: Proceedings of the International Conference on Machine Learning. 2019. [100] Julian Zimmert and Yevgeny Seldin. “An Optimal Algorithm for Stochastic and Adversarial Bandits”. In: Proceedings of the International Conference on Artificial Intelligence and Statistics (AISTATS). 2019. [101] Julian Zimmert and Yevgeny Seldin. “Tsallis-INF: An Optimal Algorithm for Stochastic and Adversarial Bandits”. In: Journal of Machine Learning Research 22.28 (2021), pp. 1–49. 284
Abstract (if available)
Abstract
Reinforcement learning (RL) is a machine learning (ML) technique on learning to make optimal sequential decisions via interactions with an environment. In recent years, RL achieved great success in many artificial intelligence tasks, and has been widely regarded as one of the keys towards Artificial General Intelligence (AGI).
However, most RL models are trained on simulators, and suffer from the reality gap: a mismatch between simulated and real-world performance. Moreover, recent work has shown that RL models are especially vulnerable to adversarial attacks. This motivates the research on improving the robustness of RL, that is, the ability of ensuring worst-case guarantees.
On the other hand, it is not favorable to be too conservative/pessimistic and sacrifice too much performance while the environment is not difficult to deal with. In other words, adaptivity --- the capability of automatically adapting to the maliciousness of the environment, is especially desirable to RL algorithms: they should not only target worst-case guarantee, but also pursue instance optimality and achieve better performance against benign environments.
In this thesis, we focus on designing practical, robust and adaptive reinforcement algorithms. Specifically, we take inspiration from the online learning literature, and consider interacting with a sequence of Markov Decision Processes (MDPs), which captures the nature of changing environment. We hope that the techniques and insight developed in this thesis could shed light on improving existing deep RL algorithms for future applications.
Linked assets
University of Southern California Dissertations and Theses
Conceptually similar
PDF
Robust and adaptive algorithm design in online learning: regularization, exploration, and aggregation
PDF
Robust and adaptive online decision making
PDF
Understanding goal-oriented reinforcement learning
PDF
No-regret learning and last-iterate convergence in games
PDF
Online reinforcement learning for Markov decision processes and games
PDF
A function approximation view of database operations for efficient, accurate, privacy-preserving & robust query answering with theoretical guarantees
PDF
Reinforcement learning with generative model for non-parametric MDPs
PDF
High-throughput methods for simulation and deep reinforcement learning
PDF
Sample-efficient and robust neurosymbolic learning from demonstrations
PDF
Identifying and leveraging structure in complex cooperative tasks for multi-agent reinforcement learning
PDF
Decision making in complex action spaces
PDF
Learning, adaptation and control to enhance wireless network performance
PDF
Leveraging cross-task transfer in sequential decision problems
PDF
Interactive learning: a general framework and various applications
PDF
Learning social sequential decision making in online games
PDF
Handling attacker’s preference in security domains: robust optimization and learning approaches
PDF
Acceleration of deep reinforcement learning: efficient algorithms and hardware mapping
PDF
Leveraging training information for efficient and robust deep learning
PDF
Accelerating reinforcement learning using heterogeneous platforms: co-designing hardware, algorithm, and system solutions
PDF
Reward shaping and social learning in self- organizing systems through multi-agent reinforcement learning
Asset Metadata
Creator
Jin, Tiancheng
(author)
Core Title
Robust and adaptive online reinforcement learning
School
Viterbi School of Engineering
Degree
Doctor of Philosophy
Degree Program
Computer Science
Degree Conferral Date
2024-05
Publication Date
05/21/2024
Defense Date
04/22/2024
Publisher
Los Angeles, California
(original),
University of Southern California
(original),
University of Southern California. Libraries
(digital)
Tag
adaptivity,bandit theory,OAI-PMH Harvest,online learning,reinforcement learning,robustness
Format
theses
(aat)
Language
English
Contributor
Electronically uploaded by the author
(provenance)
Advisor
Luo, Haipeng (
committee chair
), Jain, Rahul (
committee member
), Sharan, Vatsal (
committee member
)
Creator Email
jtc172@gmail.com,tiancheng.jin@usc.edu
Permanent Link (DOI)
https://doi.org/10.25549/usctheses-oUC113953633
Unique identifier
UC113953633
Identifier
etd-JinTianche-12994.pdf (filename)
Legacy Identifier
etd-JinTianche-12994
Document Type
Thesis
Format
theses (aat)
Rights
Jin, Tiancheng
Internet Media Type
application/pdf
Type
texts
Source
20240522-usctheses-batch-1159
(batch),
University of Southern California
(contributing entity),
University of Southern California Dissertations and Theses
(collection)
Access Conditions
The author retains rights to his/her dissertation, thesis or other graduate work according to U.S. copyright law. Electronic access is being provided by the USC Libraries in agreement with the author, as the original true and official version of the work, but does not grant the reader permission to use the work if the desired use is covered by copyright. It is the author, as rights holder, who must provide use permission if such use is covered by copyright.
Repository Name
University of Southern California Digital Library
Repository Location
USC Digital Library, University of Southern California, University Park Campus MC 2810, 3434 South Grand Avenue, 2nd Floor, Los Angeles, California 90089-2810, USA
Repository Email
cisadmin@lib.usc.edu
Tags
adaptivity
bandit theory
online learning
reinforcement learning
robustness