Close
About
FAQ
Home
Collections
Login
USC Login
Register
0
Selected
Invert selection
Deselect all
Deselect all
Click here to refresh results
Click here to refresh results
USC
/
Digital Library
/
University of Southern California Dissertations and Theses
/
Understanding goal-oriented reinforcement learning
(USC Thesis Other)
Understanding goal-oriented reinforcement learning
PDF
Download
Share
Open document
Flip pages
Contact Us
Contact Us
Copy asset link
Request this asset
Transcript (if available)
Content
UNDERSTANDING GOAL-ORIENTED REINFORCEMENT LEARNING by Liyu Chen A Dissertation Presented to the FACULTY OF THE USC GRADUATE SCHOOL UNIVERSITY OF SOUTHERN CALIFORNIA In Partial Fulllment of the Requirements for the Degree DOCTOR OF PHILOSOPHY (COMPUTER SCIENCE) May 2023 Copyright 2023 Liyu Chen Acknowledgements First of all, I would like to express my heartfelt gratitude to my PhD advisor, Haipeng Luo, for his support, guidance, and encouragement throughout my graduate studies. Haipeng has been an invaluable mentor. He has a rigorous attitude towards every piece of his work, always asks sharp and constructive questions, and never fails to guide me in the right direction when I get stuck. He works closely with every student and is always responsive to our questions and needs. I am deeply grateful for his crucial contributions to my research and professional development. During my PhD studies, I am very fortunate to work with many wonderful collaborators, including Sebastien Arnold, Yan Dai, Boqing Gong, Hexiang Hu, Medhi Jafarnia Jahromi, Rahul Jain, Alessandro Lazaric, Zhiyun Lu, Matteo Pirotta, Aviv Rosenberg, Fei Sha, Andrea Trinzoni, Chong Wang, and Chen-Yu Wei. In particular, I would like to thank Chen-Yu for spending hours and days with me thinking about research problems. I would like to thank Hexiang Hu for collaborating with me on my rst paper, and guiding me step by step into deep learning. I would like to thank Chong Wang for being my great internship mentor in ByteDance, and leading me into the world of recommendation system. I would also like to thank Matteo Pirotta, Andrea Trinzoni, and Alessandro Lazaric for hosting my internship in Meta, Paris. I enjoyed the intense weekly discussion with them and learnt a lot on how to do theoretical research in industry. I am grateful to my labmates Yifang Chen, Tiancheng Jin, Hikaru Ibayashi, Chung-Wei Lee, Chen-Yu Wei, and Mengxiao Zhang for their camaraderie and support. Although I am a late member of this family, I ii have never struggled to t in. I really enjoy the regular reading group where we share new research papers and interesting stories. I would like to thank Rahul Jain, David Kempe, and Ashutosh Nayyar for willing to serve on my thesis committee and oer useful feedback for this thesis. I am also thankful to Stefanos Nikolaidis, Vatsal Sharan, and Jiapeng Zhang for giving me helpful suggestions on my qualifying exam and thesis proposal. I would like to extend my appreciation to my previous advisor, Fei Sha, who admits me into USC even if I did not have much background in machine learning. Although we did not work out eventually, he taught me a lot of important lessons in both research and life. I am also grateful to my previous labmates in Sha Lab, including Melissa Ailem, Sebastien Arnold, Aaron Chan, Soravit Beer Changpinyo, Wei-Lun Chao, Chao-Kai Chiang, JP Francis, Chin-Cheng Hsu, Hexiang Hu, Shariq Iqbal, Michiel de Jong, Zhiyun Lu, Yiming Yan, Han-Jia Ye, Yury Zemlyanskiy, Bowen Zhang, and Ke Zhang. I would like to thank them for their friendship and support during my early years in grad school. I will always cherish the memories of the joy and great times we shared together. I would like to thank David, Karen, my roommates, and neighbors for forming such a great community. I really enjoyed the two years staying in SoCalSLR. Finally, I would like to thank my family for their love and support throughout my studies. We spent many nights video chatting and sharing details about our lives. Their encouragement and understanding have been invaluable, and I couldn’t have made it through these years without them. Thank you all so much. iii TableofContents Acknowledgements ii ListofTables viii ListofFigures ix Abstract x Chapter1: Introduction 1 1.1 Stochastic Shortest Path . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 1.2 Linear Function Approximation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 1.3 Adversarially Changing Costs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 1.4 Non-stationary Environments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7 1.5 PAC Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8 1.6 Outline of the thesis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9 1.7 Notation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10 Chapter2: ImplicitFinite-HorizonReduction: EcientOptimalAlgorithmsforSSP 11 2.1 Overview: Existing Algorithms for SSP with Stochastic Costs . . . . . . . . . . . . . . . . . 11 2.2 Implicit Finite-Horizon Approximation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 2.3 The First Model-free Algorithm: LCB-Advantage-SSP . . . . . . . . . . . . . . . . . . . . 18 2.4 An Optimal and Ecient Model-based Algorithm: SVI-SSP . . . . . . . . . . . . . . . . . . 22 2.5 Open Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26 Chapter3: LearninginSSPwithLinearStructure 27 3.1 Overview: SSP with Linear Function Approximation . . . . . . . . . . . . . . . . . . . . . . 27 3.2 An Ecient Algorithm for Linear SSP . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28 3.2.1 Finite-Horizon Approximation of SSP . . . . . . . . . . . . . . . . . . . . . . . . . 29 3.2.2 Applying an Ecient Finite-Horizon Algorithm for Linear MDPs . . . . . . . . . . 33 3.3 Open Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35 Chapter4: AlgorithmsforSSPwithAdversariallyCostsandKnownTransition 36 4.1 Overview: SSP with Adversarial Costs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36 4.2 Minimax Regret for the Full-information Setting . . . . . . . . . . . . . . . . . . . . . . . . 39 4.2.1 Optimal expected regret . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39 4.2.2 Optimal high-probability regret . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41 4.3 Minimax Regret for the Bandit Feedback Setting . . . . . . . . . . . . . . . . . . . . . . . . 46 4.4 Open Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51 iv Chapter5: PolicyOptimizationforSSP 52 5.1 Stacked Discounted Approximation and Algorithm Template . . . . . . . . . . . . . . . . . 54 5.2 Algorithm for Stochastic Environments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60 5.3 Algorithm for Adversarial Environments . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63 5.4 Open Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65 Chapter6: LearningNon-stationarySSP 67 6.1 Overview: Non-stationary SSP . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67 6.2 Lower Bound . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69 6.3 Basic Framework: Finite-Horizon Approximation . . . . . . . . . . . . . . . . . . . . . . . 69 6.4 A Simple Sub-Optimal Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71 6.5 A Minimax Optimal Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74 6.6 Open Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76 Chapter7: ReachingGoalsisHard: SettlingtheSampleComplexityofSSP 78 7.1 Overview: PAC Learning in SSP . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78 7.2 Lower Bounds with a Generative Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79 7.2.1 Lower Bound for-optimality . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80 7.2.2 Lower Bound for-optimality with Prior Knowledge onT max . . . . . . . . . . . . 83 7.2.3 Lower Bound for (;T )-optimality . . . . . . . . . . . . . . . . . . . . . . . . . . . 84 7.3 Algorithm with a Generative Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86 7.4 Open Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89 References 91 AppendixA:OmittedDetailsinChapter2 97 A.1 Preliminaries for the Appendix . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97 A.2 Omitted Details for Section 2.2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98 A.3 Omitted Details for Section 2.3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101 A.3.1 Sample Complexity for Reference Value Function . . . . . . . . . . . . . . . . . . . 102 A.3.2 Proofs of Required Properties . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108 A.3.3 Proof of Theorem 3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112 A.3.4 Extra Lemmas . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112 A.4 Omitted Details for Section 2.4 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 119 A.4.1 Properties of Proposed Update Scheme . . . . . . . . . . . . . . . . . . . . . . . . . 120 A.4.2 Proofs of Required Properties . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 123 A.4.3 Proof of Theorem 5 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 129 A.4.4 Extra Lemmas . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 130 A.5 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 137 AppendixB:OmittedDetailsinChapter3 140 B.1 Omitted Details for Section 3.2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 140 B.1.1 Formal Denition ofQ ? h andV ? h . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 140 B.1.2 Proof of Lemma 11 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 141 B.1.3 Proof of Lemma 5 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 141 B.1.4 Learning without KnowingB ? orT max . . . . . . . . . . . . . . . . . . . . . . . . . 145 B.2 Auxiliary Lemmas . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 149 v AppendixC:OmittedDetailsinChapter4 152 C.1 Omitted details for Section 4.2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 152 C.1.1 Proof of Theorem 9 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 152 C.1.2 Proof of Theorem 10 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 154 C.1.3 Proof of Lemma 6 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 157 C.1.4 Proof of Lemma 7 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 157 C.1.5 Proof of Lemma 8 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 158 C.1.6 Proof of Theorem 11 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 160 C.2 Omitted details for Section 4.3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 163 C.2.1 Optimal Expected Regret . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 163 C.2.2 Proof of Theorem 13 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 165 C.2.3 Optimal High-probability Regret . . . . . . . . . . . . . . . . . . . . . . . . . . . . 167 AppendixD:OmittedDetailsinChapter5 177 D.1 Preliminary for Appendix . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 177 D.1.1 Transition Estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 178 D.1.2 Approximation ofQ ;(;P;c);c . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 181 D.2 Omitted Details for Section 5.1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 182 D.2.1 Limitation of Existing Approximation Schemes . . . . . . . . . . . . . . . . . . . . 182 D.2.2 Proof of Lemma 10 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 183 D.3 Omitted Details for Section 5.2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 184 D.3.1 Cost Estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 184 D.3.2 Main Results for Stochastic Costs and Stochastic Adversary . . . . . . . . . . . . . 185 D.3.2.1 Proof of Theorem 15 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 189 D.3.3 Extra Lemmas for Section 5.2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 190 D.4 Omitted Details for Section 5.3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 209 D.4.1 Proof of Theorem 16 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 209 D.4.2 Dilated Bonus in SDA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 217 D.4.3 Computation ofB k . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 220 D.5 Learning without Some Parameters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 222 AppendixE:OmittedDetailsinChapter6 225 E.1 Preliminaries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 225 E.2 Omitted Details in Section 6.2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 234 E.2.1 Optimal Value Change w.r.t. Non-stationarity . . . . . . . . . . . . . . . . . . . . . 234 E.2.2 Proof of Theorem 17 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 236 E.3 Omitted Details in Section 3.2.1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 239 E.4 Omitted Details in Section 6.4 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 241 E.4.1 Proof of Theorem 18 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 241 E.4.2 Proof of Theorem 19 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 246 E.4.3 Auxiliary Lemmas . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 247 E.4.4 Minimax Optimal Bound in Finite-Horizon MDP . . . . . . . . . . . . . . . . . . . 250 E.5 Omitted Details in Section 6.5 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 251 E.5.1 Proof of Theorem 20 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 261 E.5.2 Auxiliary Lemmas . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 266 E.5.3 Proof of Theorem 21 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 268 vi AppendixF:OmittedDetailsinChapter7 270 F.1 Omitted Details in Section 7.2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 270 F.1.1 Proof of Theorem 22 and Theorem 23 . . . . . . . . . . . . . . . . . . . . . . . . . . 270 F.1.1.1 Lower Bound for min T z ;T <1 . . . . . . . . . . . . . . . . . . . . . 270 F.1.1.2 Lower Bound for min T z ;T =1 . . . . . . . . . . . . . . . . . . . . . 275 F.1.2 Proof of Theorem 24 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 276 F.2 Omitted Details in Section 7.3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 280 F.2.1 Proof of Lemma 12 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 281 F.2.2 Guarantee of the Finite-Horizon Algorithm LCBVI . . . . . . . . . . . . . . . . . . 281 F.2.3 Proof of Theorem 25 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 289 AppendixG:AuxiliaryLemmas 291 G.1 Concentration Inequalities . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 293 vii ListofTables 1.1 A mapping from chapters to papers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10 2.1 Summary of existing regret minimization algorithms for SSP with their best achievable bounds (assuming necessary prior knowledge). Here,D is the diameter,T max is the maximum expected hitting time of the optimal policy over all states, andB ? is the maximum expected costs of the optimal policy over all states. Note that although LCB-Advantage-SSP has a larger lower order term depending on ~ O(1=c 4 min ) among the minimax optimal algorithms, it actually nearly matches that of ULCVI whenT max is unknown, in which case their algorithm is run withT max replaced by its upper bound B ? =c min . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 4.1 Summary of our minimax optimal results and comparisons with prior work. Here,D;S;A are the diameter, number of states, and number of actions of the MDP,c min is the minimum cost,T ? D=c min is the expected hitting time of the optimal policy, andK is the number of episodes. Logarithmic terms are omitted. All algorithms can be implemented eciently. 37 5.1 Comparison of regret bound, time complexity, and space complexity of dierent SSP algorithms. We consider two feedback types: SC (stochastic costs) and AF (Adversarial, Full information). Operator ~ O() is hidden for simplicity. Time complexity of poly(S;A;T max ) is due to optimization in the occupancy measure space. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53 7.1 Result summary with a generative model. Here,T is a known upper bound on the hitting time of the optimal policy (T =1 when such a bound is unknown),T z = B? cmin andB ?;T is the maximum expected cost over all starting states of the restricted optimal policy with hitting time bounded byT . Operators ~ O() and () are hidden for simplicity. . . . . . . . . . . . . . . . . . . . . . . . . . 80 A.1 Average time (in seconds) spent in updates in 3000 episodes for each algorithm. Our model-based algorithmSVI-SSP is the most ecient algorithm. . . . . . . . . . . . . . . . . . . . . . . . . . 138 A.2 Hyper-parameters used in the experiments. We search the best parameters for each algorithm. . . 139 viii ListofFigures 7.1 (a) hard instance (simplied for proof sketch) in Theorem 22 whenc min > 0. (b) hard instance in Theorem 22 whenc min = 0. Here,c represents the cost of an action, whilep represents the transition probability. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80 A.1 Accumulated regret of each algorithm on RandomMDP (left) and GridWorld (right) in 3000 episodes. Each plot is an average of 500 repeated runs, and the shaded area is 95% condence interval. Dotted lines represent model-free algorithms and solid lines represent model-based algorithms. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 138 F.1 Hard instance in Theorem 24. Each arrow represents a possible transition of a state-action pair, and the value on the side is the expected cost of taking this state-action pair until the transition happens. Valuet represents the expected number of steps needed for the transition to happen. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 277 ix Abstract Reinforcement learning (RL) is about learning to make optimal sequential decisions in an unknown envi- ronment. In the past decades, RL has made astounding progress. With massive computation power, we can train agents that beat professional players in challenging games such as Go (Silver et al., 2016) and StarCraft (Vinyals et al., 2019). The objective of traditional RL models, such as nite-horizon models and innite-horizon discounted models, is to minimize accumulated cost within an eective horizon. However, many real-world applications are goal-oriented, meaning that the objective is to achieve a certain goal while minimizing the accumulated cost. Examples of such applications include games (beat your opponent as quickly as possible), car navigation (reach a destination with minimum gas consumption), and robotics manipulations (move an object to a desired position with the least joint movements). Notably, there are two objectives in goal-oriented tasks: reaching the goal and minimizing cost. These two objectives may not always align, and it is often hard to specify goal-oriented tasks by traditional RL models. As a result, Goal-oriented reinforcement learning, that is, applying RL to solve goal-oriented tasks, often requires heavy engineering eort, such as cost function design, determining the appropriate horizon or discount factor, and sophisticated exploration schemes to handle sparse rewards. In this thesis, we focus on resolving these issues by answering the following question: how can we perform Goal-oriented reinforcement learning (GoRL) in a principled way? Specically, we study learning in a Markov Decision Process (MDP) called Stochastic Shortest Path (SSP) (Bertsekas and Yu, 2013), which x exactly captures the dual objectives of GoRL. Similar to the study of other RL models, we consider various learning settings, such as adversarial environments, non-stationary environments, and PAC learning. We also develop practical learning algorithms in SSP, such as model-free algorithms, incorporating function approximations, and policy optimization. xi Chapter1 Introduction In the past decades, reinforcement learning (RL) has received much attention and achieved great success. With tons of data and massive computation power, RL agents surpass human-level performance in complex games (Silver et al., 2016; Vinyals et al., 2019; Berner et al., 2019), and even discover novel algorithms that are more ecient than any human designed ones (Fawzi et al., 2022). However, success of RL is mostly limited to applications where the task can be easily specied. In most cases, it means that the task can be represented by a simple reward function, where the loss and gain of every decision of the agent can be clearly quantied. This is a very strong requirement, and often requires extensive manual eort in reward design. A wealth of prior research tackles this problem from dierent perspectives: such as reward shaping (Ng, Harada, and Russell, 1999; Gupta et al., 2022), learning from human preference (Akrour et al., 2014; Christiano et al., 2017), unifying task specication (White, 2017), etc. In fact, task specication could be challenging even for conceptually simple goal-oriented tasks, whose objective is “achieving a certain goal with minimal cost”. This formalism subsumes many real-world applications, such as games (beat your opponent as soon as possible), car navigation (reach a destination with minimum gas consumption), and robotics manipulations (move an object to a desired position with the least joint movements). Specifying a goal-oriented task could be dicult since it inherently has two objectives: reaching the goal and minimizing cost, while traditional RL models, such as nite-horizon 1 models (agent interacts with the environment for a xed number of steps) and innite-horizon discounted models (reward / cost in the future shrinks exponentially w.r.t. some discounting factor) focus solely on cost minimization. For example, consider training a robot navigating through a maze to an exit. A natural cost function design is to provide certain reward when the robot reaches the exit, and also incur certain cost for locomotion. However, an agent without the goal (reaching the exit) “in mind” may easily converge to the sub-optimal behavior: staying still to avoid any locomotion cost (solely minimizing the cost). In general, Goal-oriented reinforcement learning (GoRL), that is, applying RL to solve goal-oriented tasks, often requires heavy engineering eort, such as cost function design (Liu, Zhu, and Zhang, 2022), determining the appropriate horizon or discount factor, and sophisticated exploration schemes to handle sparse rewards (Florensa et al., 2018). For example, in training AlphaTensor to discover matrix multiplication algorithms, the reward function gives both a per step penalty and also a sophisticated evaluation on the terminal state. The algorithm also incorporates a bonus term in MCTS search for eective exploration. Although there are many engineering techniques that can alleviate this issue, in this thesis, we take a dierent route and ask the following question: What’stherightmodelofgoal-orientedreinforcementlearning? It turns out that a Markov Decision Process (MDP) model named Stochastic Shortest Path (SSP) (Bertsekas and Yu, 2013) exactly captures the dual objectives of GoRL. This model has been well studied in the control community, with the focus on computing an optimal policy in a known SSP model. On the other hand, reinforcement learning in SSP, which assumes an unknown environment, is much less understood. The main theme of this thesis is to design practical learning algorithms for SSP, which solve GoRL in a principled way. Considering the SSP model instead of the others potentially frees us from the cost of making the wrong assumption when performing GoRL in practice. For example, in robotic manipulations, a reset scheme is often implemented to make the problem nite-horizon, while with an SSP algorithm this is no longer necessary. In the rest of this chapter, we formalize the SSP model, and state our contributions. 2 1.1 StochasticShortestPath SSP is a type of Markov Decision Process (MDP), which models decision making in stochastic sequential environments. An SSP instance is dened by a tupleM = (S;s init ;g;A;P ). Here,S is the state space, s init 2S is the initial state, g = 2S is the goal state, andA =fA s g s2S is a nite action space where A s is the available action set at states. We also dene =f(s;a) : s2S;a2A s g to be the set of valid state-action pairs. The transition functionP : (S[fgg)! [0; 1] is such thatP (s 0 js;a) is the probability of transiting tos 0 after taking actiona2A s at states, and it satises P s 0 2S[fgg P (s 0 js;a) = 1 for each (s;a)2S + , whereS + =S[fgg and S + is the simplex overS + . For simplicity, we often write P s;a () =P (js;a). Finally, we denote byS =jS + j andA = ( P s2S jA s j)=jSj the total number of states (including the goal state) and the average number of available actions respectively. ProperPoliciesandRelatedConcepts Before discussing the goal of the learner, we introduce several necessary concepts. A stationary policy is a mapping that assigns to each states a distribution over actions(js)2 As , and it is deterministic if(js) concentrates on a single action (denoted by(s)) for alls. Denote byT (s) the expected number of steps it takes to reachg starting from states and following policy. Policy is proper if following from any initial state reaches the goal state with probability 1 (i.e.,T (s) <1 for alls), and it is improper otherwise (i.e., there existss2S such that T (s) =1). Denote by the set of stationary policies and 1 the set of proper policies (assumed to be non-empty). Given a proper policy, a transition functionP , and a cost functionc : ! [0; 1], we dene its value function and action-value function as follows:V ;P;c (s) =E h P I i=1 c(s i ;a i ) ;P;s 1 =s i and Q ;P;c (s;a) =c(s;a) +E s 0 Ps;a [V ;P;c (s 0 )], where the expectation inV ;P;c is over the randomness of actiona i (js i ), next states i+1 P s i ;a i , and the number of stepsI before reachingg. LearningProtocol Most of the results in this thesis focus on the online learning setting. Intuitively, the learner’s objective is to minimize its total cost throughout the interaction (subject to reaching the goal 3 Protocol1 SSP with Stochastic Costs for episodek = 1;:::;K do Learner starts in states k 1 =s init 2S;i 1. whiles k i 6=g do Learner chooses actiona k i 2A, suers and observes sampled costc k i with meanc(s k i ;a k i ), and transits to states k i+1 P s k i ;a k i . i i + 1. state in each episode). The learning protocol is as follows: the learner interacts with the environment forK episodes. In episodek, the learner starts in initial states init , sequentially takes an action, incurs a cost (which might not be observed immediately), and transits to the next state until the goal stateg is reached. Formally, at thei-th step of episodek, the learner observes states k i (withs k 1 =s init ), takes action a k i , suers costc k i , and transits to the next states k i+1 P s k i ;a k i . Denote byI k the length of episodek, such thats k I k +1 =g whenI k is nite. Note that the heavily studied nite-horizon setting is a special case of SSP whereI k is always guaranteed to be some xed number. Stochastic Costs In this thesis we consider various types of environments with dierent feedback mechanism. We start with the simpler environment with a xed “ground truth” cost: there exists an unknown mean cost functionc : ! [c min ; 1], and the costs incurred by the learner are i.i.d. samples from some distribution with support [c min ; 1] and meanc. Here,c min 2 [0; 1] is a global lower bound. Whenever the learner visits state-action pair (s;a), she immediately observes (and incurs) an i.i.d. cost sampled from some unknown distribution with meanc(s;a). This is summarized in Protocol 1. The learner’s objective is to minimize her regret, dened as the dierence between her total incurred cost and the total expected cost of the best proper policy: R K = K X k=1 I k X i=1 c k i V ? (s init ) ! ; (1.1) 4 whereV ? = V ? ;P;c and ? is the optimal proper policy satisfying ? 2 argmin 21 V ;P;c (s) for all s2S. We also deneR K =1 whenI k =1 for somek. This is to avoid the undesired behavior that the learner gets stuck in some zero-cost loop (such as a robot pausing forever to avoid any locomotion cost). Our goal is to develop no-regret algorithms, which ensure thatR K grows sub-linearly w.r.t. the number of episodesK, that is, lim K!1 R K =K = 0. Bellman Optimality We would like to mention some properties of the optimal (proper) policy ? . Without loss of generality, we always assume that ? is deterministic (a deterministic optimal proper policy always exists; see (Bertsekas and Yu, 2013)). For notational convenience, we also dene the optimal action-value functionQ ? (s;a) =Q ? ;P;c . It is known that ? satises the Bellman optimality equation: V ? (s) = min a Q ? (s;a) for alls2S (Bertsekas and Tsitsiklis, 1991). 1.2 LinearFunctionApproximation In most parts of the thesis, we assume the so-called tabular setting, where the state space is small, and algorithms with computational complexity and regret bound depending on S =jS + j are acceptable. However, modern reinforcement learning applications often need to handle a massive state space, in which function approximation is necessary. As a rst step towards general function approximation, we also study a linear function approximation setting (Vial et al., 2021), where the MDP enjoys a linear structure in both the transition and cost functions (known as linear or low-rank MDP). Assumption 1 (Linear SSP). For some d 2, there exist known feature mapsf(s;a)g (s;a)2 R d , unknown parameters ? 2R d andf(s 0 )g s 0 2S + R d , such that for any (s;a)2 ands 0 2S + , we have: c(s;a) =(s;a) > ? ; P s;a (s 0 ) =(s;a) > (s 0 ): 5 Protocol2 SSP with Adversarially Changing Costs for episodek = 1;:::;K do Environment chooses costc k , learner starts in states k 1 =s init 2S;i 1. whiles k i 6=g do Learner chooses actiona k i 2A, suers (but does not observe)c k i =c k (s k i ;a k i ), and transits to state s k i+1 P s k i ;a k i . i i + 1. Learner observesc k (full information) orfc(s k i ;a k i )g I k i=1 (bandit feedback). Moreover, we assumek(s;a)k 2 1 for all (s;a)2 ,k ? k 2 p d, and R h(s 0 )d(s 0 ) 2 p dkhk 1 for anyh2R S + . We refer the reader to (Vial et al., 2021) and references therein for justication on this widely used structural assumption (especially on the last few norm constraints). Under Assumption 1, by denition we haveQ ? (s;a) =(s;a) > w ? , wherew ? = ? + R V ? (s 0 )d(s 0 )2R d , that is,Q ? is also linear in the features. 1.3 AdversariallyChangingCosts Protocol 1 is suitable when the environment is nearly stationary, that is, the mean cost function is xed over time. In high-stakes applications, however, a practical algorithm should be robust to adversarial changes in the environment. One way to model such an adversarial setting is to assume an adversary which can choose the cost function in each episode adaptively, potentially depending on the learner’s algorithm and the randomness before the current episode. The learning protocol of this setting is shown in Protocol 2. To align with the feedback model of the simpler nite-horizon MDP, we assume that the learner receives feedback on the cost function only after the goal state is reached. We consider two dierent types of feedback on the cost functions. In the full-information setting, the entire cost functionc k is revealed to the learner, while in the bandit feedback setting, only the costs for the visited state-action pairs, that is, c k i =c k (s k i ;a k i ) fori = 1;:::;I k , are revealed to the learner. 6 Protocol3 Non-stationary SSP Environment picks cost functionsfc k g K k=1 and transition functionsfP k g K k=1 . for episodek = 1;:::;K do Learner starts in states k 1 =s init 2S;i 1. whiles k i 6=g do Learner chooses actiona k i 2A, suers and observes sampled costc k i with meanc k (s k i ;a k i ), and transits to states k i+1 P k;s k i ;a k i . i i + 1. The learner’s objective is again to minimize her regret, dened as the dierence between her total incurred cost and the total expected cost of the best xed proper policy in hindsight: R K = K X k=1 I k X i=1 c k i V ? k (s init ) ! ; (1.2) whereV ? k = V ? ;P;c k , and we abuse the notation to let ? denote the optimal proper policy satisfying ? 2 argmin 21 P K k=1 V ;P;c k (s) for alls2S. 1.4 Non-stationaryEnvironments Another common source of non-stationarity in environments is natural drift, such as variations of weather and trac condition. Such changes could be arbitrary, but are not adaptive to the actual interaction between the learner and the environment. We can model this by assuming an oblivious adversary which chooses the cost and transition functions of each episode before learning starts. Intuitively, in each episode, the learner is facing a new SSP with stochastic costs. This is summarized in Protocol 3. In such a dynamic environment, we would expect the learner to be adaptive to the changes, and it is undesirable to only compete against a xed policy. Thus, we adopt the notion of dynamic regret, where the learner competes against a sequence of optimal policies, dened as Dyn-R K = K X k=1 I k X i=1 c k i V ? k (s init ) ! ; (1.3) 7 whereV ? k = V ? k ;P k ;c k and ? k 2 argmin 21 V ;P k ;c k (s) for alls2S is the optimal proper policy in episodek. 1.5 PACLearning So far we have only considered the online reinforcement learning (i.e., regret minimization) setting. However, such learning settings are infeasible for many real-world applications where updating the learner’s policy online could be expensive. For example, in medical treatment, updating a policy means changing a patient’s prescription, which is time-consuming and even raises safety concerns. In such a scenario, it is desirable to invoke the least possible number of updates while maintaining a good performance guarantee. As a rst step, we study the PAC learning setting. The goal of the learner is to identify a near-optimal policy of desired accuracy with high probability, with or without a generative model. We formalize each component below. SampleCollection With a generative model (PAC-SSP), the learner directly selects a state-action pair (s;a)2 and collects a sample of the next states 0 drawn fromP s;a . Without a generative model (i.e., Best PolicyIdentication(BPI)), the learner directly interacts with the environment through episodes starting from an initial states init and sequentially taking actions untilg is reached. Cost and Value Functions For simplicity, we assume a known deterministic cost functionc : ! [c min ; 1] for some c min > 0. We write V ;P;c as V and Q ;P;c as Q . Dene optimal policy ? = argmin 21 V (s) for alls2S, and letV ? =V ? . -Optimality With a generative model, we say a policy is-optimal ifV (s)V ? (s) for alls2S. Without a generative model, a policy is-optimal ifV (s init )V ? (s init ). 8 Denition1 ((;)-Correctness). LetT be the random stopping time by when an algorithmterminates its interactionwiththeenvironmentandreturnsapolicyb . Wesaythatanalgorithmis (;)-correctwithsample complexityn(M)ifP M (T n(M);b is-optimal inM) 1 foranySSPinstanceM,wheren(M) is a deterministic function of the characteristic parameters of the problem (e.g., number of states and actions, inverse of the accuracy). 1.6 Outlineofthethesis In Section 1.1-Section 1.5, we have seen the formulation of stochastic shortest path and various learning settings of it. In the rest of the thesis, we design practical learning algorithms for each of these settings. In Chapter 2, we propose a generic template for developing regret minimization algorithms in SSP with stochastic costs (dened in Section 1.1), which achieves minimax optimal regret, has simple implementation, and admits highly sparse updates. In Chapter 3, we introduce a new no-regret algorithm for the SSP problem with a linear MDP (dened in Section 1.2) that signicantly improves over the only existing results of (Vial et al., 2021). In Chapter 4, we study SSP with adversarial costs (dened in Section 1.3). Assuming known transition, we develop minimax optimal algorithms for both the full information setting and the bandit feedback setting. In Chapter 5, we study policy optimization for SSP. We propose a unied algorithmic framework applicable to both stochastic and adversarial environments. Algorithms instantiated from our template are of simple form (multiplicative weight update), shown to achieve near-optimal regret guarantee, enjoy lower time complexity, and consistently reduce the space complexity. In Chapter 6, we initiate the study of dynamic regret minimization for goal-oriented reinforcement learning modeled by a non-stationary SSP (dened in Section 1.4). We establish a regret lower bound of this setting and propose two algorithms, including a minimax-optimal one. 9 Chapter Paper Chapter 2 Chen et al. (2021) Chapter 3 Chen, Jain, and Luo (2021) Chapter 4 Chen, Luo, and Wei (2021b) Chapter 5 Chen, Luo, and Rosenberg (2022) Chapter 6 Chen and Luo (2022) Chapter 7 Chen et al. (2022) Table 1.1: A mapping from chapters to papers All results above are in the online learning setting. In Chapter 7, we take a turn and study the PAC learning setting in SSP (Section 1.5). We discover some surprising negative results and settle the sample complexity of learning-optimal polices in SSP with a generative model. In Table 1.1, we provide a detailed mapping between the chapter index and the original paper (which are all published in conference proceedings). 1.7 Notation Before diving into our results, we rst introduce some frequently used notation. Forn2N + , we dene [n] =f1;:::;ng. The notation ~ O () hides all logarithmic terms including lnK and ln 1 for some condence level2 (0; 1). Dene (x) + = maxf0;xg. For a functionX :S + ! R and a distributionP overS + , denote byPX =E S 0 P [X(S 0 )],PX 2 =E S 0 P [X(S 0 ) 2 ], andV(P;X) = Var S 0 P [X(S 0 )] the expectation, second moment, and variance ofX(S 0 ) respectively whereS 0 is drawn fromP . In pseudocode,x + y is a shorthand for the increment operationx x +y. 10 Chapter2 ImplicitFinite-HorizonReduction: EcientOptimalAlgorithmsfor SSP 2.1 Overview: ExistingAlgorithmsforSSPwithStochasticCosts In this chapter, we consider online learning in SSP with stochastic costs. The learning protocol and regret are dened in Protocol 1 and Eq. (1.1) respectively. There are several existing works in this direction. Tarbouriech et al. (2020) develop the rst regret minimization algorithm for SSP with a regret bound of ~ O(D 3=2 S p AK=c min ), whereD is the diameter. Cohen et al. (2020) improve over their results and give a near optimal regret bound of ~ O(B ? S p AK), whereB ? D is the largest expected cost of the optimal policy starting from any state. Even more recently, Cohen et al. (2021) achieve minimax regret of ~ O(B ? p SAK) through a nite-horizon reduction technique, and concurrently Tarbouriech et al. (2021b) also propose minimax optimal and parameter-free algorithms. Notably, all existing algorithms are model-based with space complexity (S 2 A). Moreover, they all update the learner’s policy through full-planning (a term taken from (Efroni et al., 2019)), incurring a relatively high time complexity. In this chapter, we further advance the state of the art by proposing a generic template for regret minimization algorithms in SSP (Algorithm 4), which achieves minimax optimal regret as long as some 11 properties are ensured. By instantiating our template dierently, we make the following two key algorithmic contributions: • In Section 2.3, we develop the rst model-free SSP algorithm calledLCB-Advantage-SSP (Algorithm 5). Similar to most model-free reinforcement learning algorithms,LCB-Advantage-SSP does not estimate the transition directly, enjoys a space complexity of ~ O(SA), and also takes onlyO (1) time to update certain statistics in each step, making it a highly ecient algorithm. It achieves a regret bound of ~ O(B ? p SAK +B 5 ? S 2 A=c 4 min ), which is minimax optimal whenc min > 0. Moreover, it can be made parameter-free without worsening the regret bound. • In Section 2.4, we develop another simple model-based algorithm calledSVI-SSP (Algorithm 6), which achieves minimax regret ~ O(B ? p SAK +B ? S 2 A) even whenc min = 0, matching the best existing result by Tarbouriech et al. (2021b). ∗ Notably, compared to their algorithm (as well as other model-based algorithms), SVI-SSP is computationally much more ecient since it updates each state-action pair only logarithmically many times, and each update only performs one-step planning (again, a term taken from (Efroni et al., 2019)) as opposed to full-planning (such as value iteration or extended value iteration); see more concrete time complexity comparisons in Section 2.4. SVI-SSP can also be made parameter-free following the idea of (Tarbouriech et al., 2021b). We include a summary of regret bounds of all existing SSP algorithms as well as more complexity comparisons in Table 2.1. Assumption onc min Similar to many previous works, our analysis requiresc min being known and strictly positive. Whenc min is unknown or known to be 0, a simple workaround is to solve a modied SSP instance with all observed costs clipped to if they are below some > 0, so thatc min = > 0. ∗ Depending on the available prior knowledge, the nal bounds achieved bySVI-SSP are slightly dierent, but they all match that of EB-SSP. See (Tarbouriech et al., 2021b, Table 1) for more details. 12 Table 2.1: Summary of existing regret minimization algorithms for SSP with their best achievable bounds (assuming necessary prior knowledge). Here,D is the diameter,T max is the maximum expected hitting time of the optimal policy over all states, andB ? is the maximum expected costs of the optimal policy over all states. Note that althoughLCB-Advantage-SSP has a larger lower order term depending on ~ O(1=c 4 min ) among the minimax optimal algorithms, it actually nearly matches that of ULCVI whenT max is unknown, in which case their algorithm is run withT max replaced by its upper boundB ? =c min . Algorithm Regret Bound UC-SSP (Tarbouriech et al., 2020) ~ O DS p DAK=c min +S 2 AD 2 Bernstein-SSP (Cohen et al., 2020) ~ O B ? S p AK + p B 3 ? S 2 A 2 =c min ULCVI (Cohen et al., 2021) ~ O B ? p SAK +T 4 max S 2 A EB-SSP (Tarbouriech et al., 2021b) ~ O B ? p SAK +B ? S 2 A LCB-Advantage-SSP(Ours) ~ O B ? p SAK +B 5 ? S 2 A=c 4 min SVI-SSP(Ours) ~ O B ? p SAK +B ? S 2 A Then the regret in this modied SSP is similar to that in the original SSP up to an additive term of order O (K) (Tarbouriech et al., 2020). Therefore, throughout the paper we assume thatc min is known and strictly positive unless explicitly stated otherwise. Notation Here we introduce notation used in this chapter.B ? = max s2S V ? (s) is the maximum expected cost of the optimal policy over all starting states. For notational convenience, we writeV ;P;c asV and Q ;P;c asQ . We useC K = P K k=1 P I k i=1 c k i in the analysis to denote the total costs suered by the learner overK episodes. Denote bydxe 2 = 2 dlog 2 xe andbxc 2 = 2 blog 2 xc the closest power of two upper and lower boundingx respectively. 2.2 ImplicitFinite-HorizonApproximation In this section, we introduce our main analytical technique, that is, implicitly approximating the SSP problem with a nite-horizon counterpart. We start with a general template of our algorithms shown in Algorithm 4. For notational convenience, we concatenate state-action-cost trajectories of all episodes as one single sequence (s t ;a t ;c t ) fort = 1; 2;:::;T , wheres t 2S is one of the non-goal states,a t 2A st is 13 Algorithm4 A General Algorithmic Template for SSP Initialize:t 0,s 1 s init ,Q(s;a) 0 for all (s;a)2 . fork = 1;:::;K do repeat 1 Increment time stept + 1. 2 Take actiona t = argmin a Q(s t ;a), suer costc t , transit to and observes 0 t . 3 UpdateQ (so that it satises Property 1 and Property 2). 4 ifs 0 t 6=g thens t+1 s 0 t ;elses t+1 s init ,break. RecordT t (that is, the total number of steps). the action taken ats t , andc t is the resulting cost incurred by the learner. Note that the goal stateg is never included in this sequence (since no action is taken there), and we also use the notations 0 t 2S + to denote the next-state following (s t ;a t ), so thats t+1 is simplys 0 t unlesss 0 t =g (in which cases t+1 is reset to the initial states init ); see Line 4. The template follows a rather standard idea for many reinforcement learning algorithms: maintain an (optimistic) estimateQ of the optimal action-value functionQ ? , and act greedily by taking the action with the smallest estimate:a t = argmin a Q(s t ;a); see Line 2. The key of the analysis is often to bound the estimation errorQ ? (s t ;a t )Q(s t ;a t ), which is relatively straightforward in a discounted setting (where the discount factor controls the growth of the error) or a nite-horizon setting (where the error vanishes after a xed number of steps), but becomes highly non-trivial for SSP due to the lack of similar structures. A natural idea is to explicitly solve a discounted problem or a nite-horizon problem that approximates the original SSP well enough. Unfortunately, both approaches are problematic: approximating an undis- counted MDP by a discounted one often leads to suboptimal regret (Wei et al., 2020); on the other hand, while explicitly approximating SSP with a nite-horizon problem can lead to optimal regret (Chen, Luo, and Wei, 2021b; Cohen et al., 2021), it greatly increases the space complexity of the algorithm, and also produces non-stationary policies, which is unnatural and introduces unnecessary complexity since the optimal policy in SSP is stationary. 14 Therefore, we propose to approximate the original SSP instanceM with a nite-horizon counterpart f M implicitly (that is, only in the analysis). We defer the formal denition of f M to Appendix A.2, which is similar to those in (Chen, Luo, and Wei, 2021b; Cohen et al., 2021) and corresponds to interacting with the original SSP forH steps (for some integerH) and then teleporting to the goal. All we need in the analysis are the optimal value functionV ? h and optimal action-value functionQ ? h of f M for each steph2 [H], which can be dened recursively without resorting to the denition of f M: Q ? h (s;a) =c(s;a) +P s;a V ? h1 ; V ? h (s) = min a Q ? h (s;a); (2.1) withQ ? 0 (s;a) = 0 for all (s;a). † Intuitively,Q ? H approximatesQ ? well whenH is large enough. This is formally summarized in the lemma below, whose proof is similar to prior works (see Appendix A.2). Lemma 1. For any value of H, Q ? H (s;a) Q ? (s;a) holds for all (s;a). For any 2 (0; 1), if H 4B? c min ln(2=) + 1, thenQ ? (s;a)Q ? H (s;a) +B ? holds for all (s;a). In the remaining discussion, we x a particular value ofH. To carry out the regret analysis, we now specify two general requirements of the estimateQ. LetQ t be the value ofQ at the beginning of time step t (that is, the value used in ndinga t ). ThenQ t needs to satisfy: Property1 (Optimism). With high probability,Q t (s;a)Q ? (s;a) holds for all (s;a) andt 1. Property2 (Recursion). There exists a “bonus overhead” H > 0 and an absolute constantd> 0 such that the following holds with high probability: T X t=1 ( Q(s t ;a t )Q t (s t ;a t )) + H + 1 + d H T X t=1 ( V (s t )Q t (s t ;a t )) + ; for Q =Q ? h and V =V ? h1 (h = 1;:::;H) as well as Q =Q ? and V =V ? . ‡ † Note that our notation is perhaps unconventional compared to most works on nite-horizon MDPs, whereQ ? h usually refers to ourQ ? Hh . We make this switch (only in this chapter) since we want to highlight the dependence onH forQ ? H . ‡ Note thatH might be a random variable. In fact, it often depends onCK . 15 Property 1 is standard and can usually be ensured by using a certain “bonus” term derived from concentration equalities in the update. These bonus terms on (s t ;a t ) accumulate into some bonus overhead in the nal regret bound, which is exactly the role of H in Property 2. In both of our algorithms, H has a leading-order term ~ O( p B ? SAC K ) and a lower-order term that increases inH. Property 2 is a key property that provides a recursive form of the estimation error and allows us to connect it to the nite-horizon approximation. This is illustrated through the following two lemmas. Lemma2. Property 2 implies P T t=1 (Q ? H (s t ;a t )Q t (s t ;a t )) + O (H H ). Proof. With Q =Q ? H and V =V ? H1 , Property 2 implies T X t=1 (Q ? H (s t ;a t )Q t (s t ;a t )) + H + 1 + d H T X t=1 (V ? H1 (s t )Q t (s t ;a t )) + H + 1 + d H T X t=1 (Q ? H1 (s t ;a t )Q t (s t ;a t )) + ; where in the last step we use the optimality ofV ? H1 from Eq. (2.1). Repeatedly applying this argument, we eventually arrive at P T t=1 (Q ? H (s t ;a t )Q t (s t ;a t )) + H 1 + d H H H + 1 + d H H P T t=1 (Q ? 0 (s t ;a t ) Q t (s t ;a t )) + =O (H H ), where the last step uses the facts Q ? 0 (s t ;a t ) = 0 and 1 + d H H e d (an absolute constant). Lemma 3. For any2 (0; 1), ifH 4B? c min ln(2=) + 1, then Property 1 and Property 2 together imply P T t=1 Q ? (s t ;a t )V ? (s t ) =O (C K + H ). Proof. Applying Property 2 with Q = Q ? and V = V ? , we have P T t=1 (Q ? (s t ;a t )Q t (s t ;a t )) + H + 1 + d H P T t=1 (V ? (s t )Q t (s t ;a t )) + . Now note that by Property 1, the Bellman optimality equation V ? (s t ) = min a Q ? (s t ;a), and the factQ t (s t ;a t ) = min a Q t (s t ;a) (by the denition ofa t ), the arguments 16 within the clipping operation () + are all non-negative and thus the clipping can be removed. Rearranging terms then gives T X t=1 Q ? (s t ;a t )V ? (s t ) H + d H T X t=1 (V ? (s t )Q t (s t ;a t )) H + d H T X t=1 (Q ? (s t ;a t )Q t (s t ;a t )): (optimality ofV ? ) It remains to bound the last term using the nite-horizon approximationQ ? H as a proxy: T X t=1 (Q ? (s t ;a t )Q t (s t ;a t )) = T X t=1 (Q ? (s t ;a t )Q ? H (s t ;a t ) +Q ? H (s t ;a t )Q t (s t ;a t )) =O (TB ? +H H ); where the last step uses Lemma 1 and Lemma 2. Importantly, this term is nally scaled byd=H, which, together with the fact TB? H c min TC K , proves the claimed bound. Readers familiar with the literature might already recognize the term P T t=1 Q ? (s t ;a t )V ? (s t ) consid- ered in Lemma 3, which is closely related to the regret. Indeed, with this lemma, we can conclude a regret bound for our generic algorithm. Theorem1. For any2 (0; 1), ifH 4B? c min ln(2=) + 1, then Algorithm 4 ensures (with high probability) R K = ~ O p B ? C K +B ? +C K + H . Proof. We rst decompose the regret as follows, which holds generally for any algorithm: R K = K X k=1 I k X i=1 c k i V ? (s k 1 ) ! K X k=1 I k X i=1 c k i V ? (s k i ) +V ? (s k i+1 ) = T X t=1 (c t V ? (s t ) +V ? (s 0 t )) = T X t=1 (c t c(s t ;a t )) + T X t=1 (V ? (s 0 t )P st;at V ? ) + T X t=1 (Q ? (s t ;a t )V ? (s t )): (2.2) 17 The rst and the second term are the sum of a martingale dierence sequence (sinces 0 t is drawn fromP st;at ) and can be bounded by ~ O p C K and ~ O p B ? C K +B ? respectively using concentration inequalities; see Lemma 13, Lemma 125, and Lemma 14. The third term can be bounded using Lemma 3 directly, which nishes the proof. To get a sense of the regret bound in Theorem 1, rst note that since 1= only appears in a logarithmic term of the required lower bound ofH, one can pick to be small enough so that the termC K is dominated by others. Moreover, if H is ~ O( p B ? SAC K ) plus some lower-order term H (which as mentioned is the case for our algorithms), then by solving a quadratic of p C K , the regret bound of Theorem 1 implies R K = ~ O(B ? p SAK + H ), which is minimax optimal (ignoring H )! Based on this analytical technique, it remains to design algorithms satisfying the two required properties. In the following sections, we provide two such examples, leading to the rst model-free SSP algorithm and an improved model-based SSP algorithm. 2.3 TheFirstModel-freeAlgorithm: LCB-Advantage-SSP In this section, we present a model-free algorithm (the rst in the literature) calledLCB-Advantage-SSP that falls into our generic template and satises the required properties. It is largely inspired by the state-of-the-art model-free algorithmUCB-Advantage (Zhang, Zhou, and Ji, 2020) for the nite-horizon problem. The pseudocode is shown in Algorithm 5, with only the lines instantiating the update rule of the Q estimates numbered. Importantly, the space complexity of this algorithm is onlyO (SA) since we do not estimate the transition directly or conduct explicit nite-horizon reduction, and the time complexity is only O (1) in each step. Specically, for each state-action pair (s;a), we divide the samples received when visiting (s;a) into consecutive stages of exponentially increasing length, and only updateQ(s;a) at the end of a stage. The 18 Algorithm5LCB-Advantage-SSP Parameters: horizonH, threshold ? , and failure probability2 (0; 1). Dene:L ? =fE j g j2N + whereE j = P j i=1 e i ,e 1 =H ande j+1 =b(1 + 1=H)e j c. Initialize:t 0,s 1 s init ,B 1, for all (s;a);N(s;a) 0;M(s;a) 0. Initialize: for all (s;a);Q(s;a) 0;V (s) 0;V ref (s) V (s); b C(s;a) 0. Initialize: for all (s;a); ref (s;a) 0; ref (s;a) 0;(s;a) 0;(s;a) 0,v(s;a) 0. fork = 1;:::;K do repeat Increment time stept + 1. Take actiona t = argmin a Q(s t ;a), suer costc t , transit to and observes 0 t . 1 Increment visitation counters:n =N(s t ;a t ) + 1;m =M(s t ;a t ) + 1. 2 Update global accumulators: ref (s t ;a t ) + V ref (s 0 t ); ref (s t ;a t ) + V ref (s 0 t ) 2 , b C(s t ;a t ) + c t . 3 Update local accumulators:v(s t ;a t ) + V (s 0 t ); (s t ;a t ) + V (s 0 t )V ref (s 0 t ); (s t ;a t ) + (V (s 0 t ) V ref (s 0 t )) 2 . 4 ifn2L ? then 5 Compute 256 ln 6 (4SAB 8 ? n 5 =), cost estimatorb c = b C(st;at) n , bonusesb 0 2 q B 2 m + q b c n + n andb r ref (st;at) =n ( ref (st;at) =n) 2 n + r (st;at) =m ( (st;at) =m) 2 m + 4B n + 3B m + r b c n : 6 Q(s t ;a t ) max n b c + v(st;at) m b 0 ;Q(s t ;a t ) o . 7 Q(s t ;a t ) max n b c + ref (st;at) n + (st;at) m b;Q(s t ;a t ) o . 8 V (s t ) min a Q(s t ;a). 9 ifV (s t )>B thenB 2V (s t ). 10 Reset local accumulators:v(s t ;a t ) 0;(s t ;a t ) 0;(s t ;a t ) 0;M(s t ;a t ) 0. 11 if P a N(s t ;a) is a power of two not larger than ? thenV ref (s t ) V (s t ). ifs 0 t 6=g thens t+1 s 0 t ;elses t+1 s init ,break. number of samplese j in stagej is dened throughe 1 =H ande j+1 =b(1 + 1=H)e j c for some parameter H. Further deneL ? =fE j g j2N + withE j = P j i=1 e i , which contains all the indices indicating the end of some stage. As mentioned, the algorithm only updatesQ(s;a) when the total number of visits to (s;a) falls into the setL ? (Line 4). The algorithm also maintains an estimateV forV ? , which always satises V (s) = min a Q(s;a) (Line 8), and importantly another reference value functionV ref whose role and update rule are to be discussed later. 19 In addition, some local and global accumulators are maintained in the algorithm. Local accumulators only store information related to the current stage. These include: M(s;a), the number of visits to (s;a) within the current stage;v(s;a), the cumulative value ofV (s 0 ) within the current stage, wheres 0 represents the next state after each visit to (s;a); and nally(s;a) and(s;a), the cumulative values of V (s 0 )V ref (s 0 ) and its square respectively within the current stage (Line 3). These local accumulators are reset to zero at the end of each stage (Line 10). On the other hand, global accumulators store information related to all stages and are never reset. These include:N(s;a), the number of visits to (s;a) since the beginning; b C(s;a), total cost incurred at (s;a) since the beginning; and ref (s;a) and ref (s;a), the cumulative value ofV ref (s 0 ) and its square respectively since the beginning, where agains 0 represents the next state after each visit to (s;a) (Line 2). We are now ready to describe the update rule ofQ. The rst update, Line 6, is intuitively based on the equalityQ ? (s;a) = c(s;a) +P s;a V ? and usesv(s;a)=M(s;a) as an estimate forP s;a V ? together with a (negative) bonusb 0 derived from Azuma’s inequality (Line 5). As mentioned, the bonus is necessary to ensure Property 1 (optimism) so thatQ is always a lower condence bound ofQ ? (hence the name “LCB”). Note that this update only uses data from the current stage (roughly 1=H fraction of the entire data collected so far), which leads to an extra p H factor in the regret. To address this issue, Zhang, Zhou, and Ji (2020) introduce a variance reduction technique via a reference-advantage decomposition, which we borrow here leading to the second update rule in Line 7. This is intuitively based on the decompositionP s;a V ? = P s;a V ref +P s;a (V ? V ref ), whereP s;a V ref is approximated by ref (s;a)=N(s;a) andP s;a (V ? V ref ) is approximated by(s;a)=M(s;a). In addition, a “variance-aware” bonus termb is applied, which is derived from a tighter Freedman’s inequality (Line 5). The reference functionV ref is some snapshot of the past value ofV , and is guaranteed to beO(c min ) close to V ? on a particular state as long as the number of visits to this state exceeds some threshold ? = ~ O B 2 ? H 3 SA=c 2 min (Line 11). Overall, this second update rule not only removes the extra p H factor 20 as in (Zhang, Zhou, and Ji, 2020), but also turns some terms of order ~ O( p T ) into ~ O( p C K ) in our context, which is important for obtaining the optimal regret. Despite the similarity, we emphasize several key dierences between our algorithm and that of (Zhang, Zhou, and Ji, 2020). First, (Zhang, Zhou, and Ji, 2020) maintains a dierentQ estimate for each step of an episode (which is natural for a nite-horizon problem), while we only maintain oneQ estimate (which is natural for SSP). Second, we update the reference functionV ref (s) whenever the number of visits to s doubles (while still below the threshold ? ; see Line 11), instead of only updating it once as in (Zhang, Zhou, and Ji, 2020). We show in Lemma 17 that this helps reduce the sample complexity and leads to a smaller lower-order term in the regret. Third, since there is no apriori known upper bound onV (unlike the nite-horizon setting), we maintain an empirical upper boundB (in a doubling manner) such that V (s)B 2B ? (Line 9), which is further used in computing the bonus termsb andb 0 . This is important for eventually developing a parameter-free algorithm. In Appendix A.3, we show that Algorithm 5 indeed satises the two required properties. Theorem2. LetH =d 4B? c min ln( 2 ) + 1e 2 for = c min 2B 2 ? SAK and ? = ~ O B 2 ? H 3 SA c 2 min be dened in Lemma 17. Then Algorithm 5 satises Property 1 and Property 2 withd = 3 and H = ~ O p B ? SAC K + B 2 ? H 3 S 2 A c min . Proof Sketch. The proof of Property 1 largely follows the analysis of (Zhang, Zhou, and Ji, 2020, Proposition 4) for the designed bonuses. To prove Property 2, similarly to (Zhang, Zhou, and Ji, 2020) we can show: T X t=1 ( Q(s t ;a t )Q t (s t ;a t )) + . H + T X t=1 1 m t mt X i=1 P s l t;i ;a l t;i ( VV l t;i ) + ; wherem t is the value ofm used in computingQ t (s t ;a t ), and l t;i is thei-th time step the agent visits (s t ;a t ) among those m t steps. Now it suces to show that P T t=1 1 mt P mt i=1 P s l t;i ;a l t;i ( V V l t;i ) + . (1 + 3 H ) P T t=1 ( V (s t )V t (s t )) + , which is proven in Lemma 22. As a direct corollary of Theorem 1, we arrive at the following regret guarantee. 21 Theorem3. WiththesameparametersasinTheorem2,withprobabilityatleast 160,Algorithm5ensures R K = ~ O B ? p SAK + B 5 ? S 2 A c 4 min . We make several remarks on our results. First, while Algorithm 5 requires setting the two parameters H and ? in terms ofB ? to obtain the claimed regret bound, one can in fact achieve the exact same bound without knowingB ? by slightly changing the algorithm. The high level idea is to rst apply the doubling trick from Tarbouriech et al. (2021b) to determine an upper bound onB ? , then try logarithmically many dierent values ofH and ? simultaneously, each leading to a dierent update rule forQ andV ref . This only increases the time and space complexity by a logarithmic factor, without hurting the regret (up to log factors); see details in (Chen et al., 2021, Section D.5). Second, as mentioned in Section 2.1, whenc min is unknown orc min = 0, one can clip all observed costs to if they are below> 0, which introduces an additive regret term of orderO (K). By picking to be of orderK 1=5 , our bound becomes ~ O K 4=5 ignoring other parameters. Although most existing works suer the same issue, this is certainly undesirable, and our second algorithm to be introduced in the next section completely avoids this issue by having only logarithmic dependence on 1=c min . Finally, we point out that, just as in the nite-horizon case, the variance reduction technique is crucial for obtaining the minimax optimal regret. For example, if one instead uses an update rule similar to the (suboptimal) Q-learning algorithm of (Jin et al., 2018), then this is essentially equivalent to removing the second update (Line 7) of our algorithm. While this still satises Property 2, the bonus overhead H would be p H times larger, resulting in a suboptimal leading term in the regret. 2.4 AnOptimalandEcientModel-basedAlgorithm: SVI-SSP In this section, we propose a simple model-based algorithm calledSVI-SSP (Sparse Value Iteration for SSP) following our template, which not only achieves the minimax optimal regret even whenc min = 0, matching the state-of-the-art by a recent work (Tarbouriech et al., 2021b), but also admits highly sparse updates, 22 making it more ecient than all existing model-based algorithms. The pseudocode is in Algorithm 6, again with only the lines instantiating the update rule forQ numbered. Similar to Algorithm 5, SVI-SSP divides samples of each (s;a) into consecutive stages of (roughly) exponentially increasing length, and only updatesQ(s;a) at the end of a stage (Line 2). However, the number of samplese j in stagej is dened slightly dierently throughe j =be e j c;e e 1 = 1, ande e j+1 =e e j + 1 H e j for some parameter H. In the long run, this is almost the same as the scheme used in Algorithm 5, but importantly, it forces more frequent updates at the beginning — for example, one can verify that e 1 = =e H = 1, meaning thatQ(s;a) is updated every time (s;a) is visited for the rstH visits. This slight dierence turns out to be important to ensure that the lower-order term in the regret has no poly(H) dependence, as shown in Lemma 24 and further discussed in Remark 5. The update rule forQ is very simple (Line 5). It is again based on the equalityQ ? (s;a) =c(s;a)+P s;a V ? , but this time uses P s;a Vb as an approximation forP s;a V ? , where P s;a is the empirical transition directly calculated from two countersn(s;a) andn(s;a;s 0 ) (number of visits to (s;a) and (s;a;s 0 ) respectively), V is such thatV (s) = min a Q(s;a), andb is a special bonus term (Line 4) adopted from (Tarbouriech et al., 2021b; Zhang, Ji, and Du, 2020) which ensures thatQ is an optimistic estimate ofQ ? and also helps remove poly(H) dependence in the regret. SVI-SSP exhibits a unique structure compared to existing algorithms. In each update, it modies only one entry ofQ (similarly to model-free algorithms), while other model-based algorithms such as (Tarbouriech et al., 2021b) perform value iteration for every entry ofQ repeatedly until convergence (concrete time complexity comparisons to follow). We emphasize that our implicit nite-horizon analysis is indeed the key to enable us to derive a regret guarantee for such a sparse value iteration algorithm. Specically, in Appendix A.4, we show thatSVI-SSP satises the two required properties. 23 Algorithm6SVI-SSP Parameters: horizonH, value function upper boundB, and failure probability2 (0; 1). Dene:L =fE j g j2N +, whereE j = P j i=1 e i ;e j =be e j c, ande e 1 = 1;e e j+1 =e e j + 1 H e j . Initialize:t 0;s 1 s init . Initialize: for all (s;a;s 0 );n(s;a;s 0 ) 0;n(s;a) 0,Q(s;a) 0,V (s) 0, b C(s;a) 0. fork = 1;:::;K do repeat Increment time stept + 1. Take actiona t = argmin a Q(s t ;a), suer costc t , transit to and observes 0 t . 1 Update accumulators:n =n(s t ;a t ) + 1;n(s t ;a t ;s 0 t ) + 1, b C(s t ;a t ) + c t . 2 ifn2Lthen 3 Update empirical transition: P st;at (s 0 ) n(st;at;s 0 ) n for alls 0 . 4 Compute 20 ln 2SAn , cost estimator b c b C(s;a) n , and bonus b max n 7 q V( Ps t ;a t ;V ) n ; 49B n o + q b c n . 5 Q(s t ;a t ) maxfb c + P st;at Vb;Q(s t ;a t )g. 6 V (s t ) argmin a Q(s t ;a). ifs 0 t 6=g thens t+1 s 0 t ;elses t+1 s init ,break. Theorem4. IfBB ? andH =d 4B c min ln( 2 ) + 1e 2 for = c min 2B 2 SAK , then Algorithm 6 satises Property 1 and Property 2 withd = 1 and H = ~ O( p B ? SAC K +BS 2 A +C K ), where the dependence onH in H is hidden in logarithmic terms. Proof Sketch. The proof of Property 1 largely follows the analysis of (Tarbouriech et al., 2021b, Lemma 15). To prove Property 2, we rst show P T t=1 ( Q(s t ;a t )Q t (s t ;a t )) + . H + P T t=1 P t ( VV lt ) + , wherel t is the last time stepQ(s t ;a t ) is updated. Then, the remaining main steps are shown below with all details deferred to the corresponding key lemmas: T X t=1 P t ( VV lt ) + . 1 + 1 H T X t=1 P t ( VV t ) + (Lemma 24) . 1 + 1 H T X t=1 ( V (s t )V t (s t )) + + 1 + 1 H T X t=1 (P t I s 0 t )( VV t ) + . 1 + 1 H T X t=1 ( V (s t )V t (s t )) + + H ; (Lemma 30 and Lemma 29) which completes the proof. 24 Again, as a direct corollary of Theorem 1, we arrive at the following regret guarantee. Theorem5. WiththesameparametersasinTheorem4,withprobabilityatleast 112,Algorithm6ensures R K = ~ O(B ? p SAK +BS 2 A). SettingB =B ? , our bound becomes ~ O(B ? p SAK +B ? S 2 A), which is minimax optimal even when c min is unknown orc min = 0 (this is because the dependence on 1=c min is only logarithmic, and one can clip all observed costs to if they are below = 1=K in this case without introducing poly(K) overhead to the regret). WhenB ? is unknown, we can use the same doubling trick from Tarbouriech et al. (2021b) to obtain almost the same bound (with only the lower-order term increased to ~ O B 3 ? S 3 A ); see (Chen et al., 2021, Section E.5) for details. ComparisonwithEB-SSP(Tarbouriechetal.,2021b) Our regret bounds match exactly the state-of- the-art by Tarbouriech et al. (2021b). Thanks to the sparse update, however,SVI-SSP has a much better time complexity. Specically, forSVI-SSP, each (s;a) is updated at most ~ O(H) = ~ O( B? =c min ) times (Lemma 24), and each update takesO(S) time, leading to total complexity ~ O( B?S 2 A =c min ). On the other hand, for EB-SSP, although each (s;a) only causes ~ O(1) updates, each update runs value iteration on all entries ofQ until convergence, which takes ~ O( B 2 ? S 2 =c 2 min ) iterations (see their Appendix C) and leads to total complexity ~ O( B 2 ? S 5 A =c 2 min ), much larger than ours. Comparison with ULCVI (Cohen et al., 2021) Another recent work by Cohen et al. (2021) using explicit nite-horizon approximation also achieves minimax regret but requires the knowledge of some hitting time of the optimal policy. Without this knowledge, their bound has a large 1=c 4 min dependence in the lower-order term just as our model-free algorithm. Our results in this section show that implicit nite-horizon approximation has an advantage over explicit approximation apart from reducing space complexity: the former does not necessarily introduce poly(H) dependence even for the lower-order term, while the latter does under the current analysis. 25 2.5 OpenProblems The regret of our model-free algorithmLCB-Advantage-SSP has a large constant term that scales with B ? =c min . An important open problem is developing a horizon-free model-free algorithm, which avoids the undesirable 1=c min dependency in the lower order term. Notably, such an algorithm does not exist even for the simpler nite-horizon MDPs. Another open question is whether we can achieve a regret bound with ~ O(B ? SA) lower order term for the model-based algorithm. Ménard et al. (2021) show that a ~ O(H 4 SA) lower order term is possible in a nite-horizon MDP with horizonH, where they achieve a linearS dependency with the price of a quite sub-optimal horizon dependency. Overall, how to obtain a smaller lower order term is an important future direction as it determines the length of the "burn-in" phase. 26 Chapter3 LearninginSSPwithLinearStructure 3.1 Overview: SSPwithLinearFunctionApproximation Modern reinforcement learning applications often need to handle a massive state space, in which function approximation is necessary. In this chapter, we study SSP with linear structure satisfying Assumption 1. There is huge progress in the study of linear function approximation, for both the nite-horizon setting (Ay- oub et al., 2020; Jin et al., 2020b; Yang and Wang, 2020; Zanette et al., 2020a; Zanette et al., 2020b; Zhou, Gu, and Szepesvari, 2021) and the innite-horizon setting (Wei et al., 2021; Zhou, Gu, and Szepesvari, 2021; Zhou, He, and Gu, 2021). Recently, Vial et al. (2021) took the rst step in considering linear function approximation for SSP. They study SSP dened over a linear MDP, and proposed a computationally inecient algorithm with regret ~ O( p d 3 B 3 ? K=c min ), as well as another ecient algorithm with regret ~ O(K 5=6 ) (omitting other dependency). Here,d is the dimension of the feature space andB ? is an upper bound on the expected costs of the optimal policy. Later, Min et al. (2021) study a related but dierent SSP problem dened over a linear mixture MDP and achieve a ~ O(dB 1:5 ? p K=c min ) regret bound. Despite leveraging the advances from both the nite-horizon and innite-horizon settings, the results above are still far from optimal in terms of regret guarantee or computational eciency, demonstrating the unique challenge of SSP problems. In this chapter, we further extend our understanding of SSP with linear function approximation (more specically, with linear MDPs). Specically, in Section 3.2, we propose a new analysis for the nite-horizon 27 approximation of SSP introduced in (Cohen et al., 2021), which is much simpler and achieves a smaller approximation error. Our analysis is alsomodelagnostic, meaning that it does not make use of the modeling assumption and can be applied to both the tabular setting and function approximation settings. Combining this new analysis with a simple nite-horizon algorithm similar to that of (Jin et al., 2020b), we achieve a regret bound of ~ O( p d 3 B 2 ? T max K), withT max B ? =c min being an upper bound of the hitting time of the optimal policy, which strictly improves over that of (Vial et al., 2021). Notably, unlike their algorithm, ours is computationally ecient without any extra assumption. Key parameters and notation For notational convenience, we writeV ;P;c asV andQ ;P;c asQ . Two extra parameters that play a key role in our analysis are:B ? = max s V ? (s), the maximum cost of the optimal policy starting from any state, andT max = max s T ? (s), the maximum hitting time of the optimal policy starting from any state. By denition, we haveT max B ? =c min . For simplicity, we assume thatB ? andT max are known to the learner for most discussions; see (Chen, Jain, and Luo, 2021) on what we can achieve when some of these parameters are unknown. We also assume B ? > 1 similar to (Cohen et al., 2021). For anylr, we dene [x] [l;r] = minfmaxfx;lg;rg as the projection ofx onto the interval [l;r]. 3.2 AnEcientAlgorithmforLinearSSP In this section, we introduce a computationally ecient algorithm for linear SSP. In Section 3.2.1, we rst develop an improved analysis for the nite-horizon approximation of (Cohen et al., 2021). Then in Section 3.2.2, we combine this approximation with a simple nite-horizon algorithm, which together achieves ~ O( p d 3 B 2 ? T max K) regret. 28 3.2.1 Finite-HorizonApproximationofSSP Finite-horizon approximation has been frequently used in solving SSP problems (Chen, Luo, and Wei, 2021b; Chen and Luo, 2021; Cohen et al., 2021; Chen et al., 2021). In particular, Cohen et al. (2021) proposed a black-box reduction from SSP to a nite-horizon MDP, which achieves minimax optimal regret bound in the tabular case when combining with a certain nite-horizon algorithm. We will make use of the same algorithmic reduction in our proposed algorithm, but with an improved analysis. Specically, for an SSP instanceM = (S;A;s init ;g;c;P ), dene its nite-horizon MDP counterpart as f M = (S + ; e A;e c;c f ; e P;H), where e A =A[fA g g withA g =fa g g (a g is a virtual action) is the extended action space,e c(s;a) =c(s;a)Ifs6=gg is the extended cost function,c f (s) = 2B ? Ifs6=gg is the terminal cost function (more details to follow), e P : e ! S + with e = [f(g;a g )g andP g;ag (s 0 ) =Ifs 0 =gg is the extended transition function, andH is a horizon parameter. Assume the access to a corresponding nite-horizon algorithmA which learns through a certain number of “intervals” following the protocol below. At the beginning of an intervalm, the learnerA is rst reset to an arbitrary states m 1 . Then, in each steph = 1;:::;H within this interval,A decides an actiona m h , transits tos m h+1 e P s m h ;a m h , and suers cost e c(s m h ;a m h ). At the end of the interval, the learner suers an additional terminal costc f (s m H+1 ), and then moves on to the next interval. With such a black-box access toA, the reduction of (Cohen et al., 2021) is depicted in Algorithm 14. The algorithm partitions the time steps into intervals of lengthH 4T max ln(4K) (such that ? reachesg withinH steps with high probability). In each step, the algorithm followsA in a natural way and feeds the observations toA (Line 5 and Line 6). If the goal state is not reached within an interval,A naturally enters the next interval with the initial state being the current state (Line 10). Otherwise, if the goal state is reached within some interval, we keep feedingg and zero cost toA until it nishes the current interval (Line 8 and Line 6), and after that, the next interval corresponds to the beginning of the next episode of the original SSP problem (Line 1). 29 Algorithm7 Finite-Horizon Approximation of SSP from (Cohen et al., 2021) Input: AlgorithmA for nite-horizon MDP f M with horizonH 4T max ln(4K). Initialize: interval counterm 1. fork = 1;:::;K do 1 Sets m 1 s init . 2 whiles m 1 6=g do 3 Feed initial states m 1 toA. 4 forh = 1;:::;H do 5 Receive actiona m h fromA. 6 ifs m h 6=g then 7 Play actiona m h , observe costc m h =c(s m h ;a m h ) and next states m h+1 . 8 else Setc m h = 0 ands m h+1 =g. 9 Feedc m h ands m h+1 toA. 10 Sets m+1 1 =s m H+1 andm m + 1. Analysis Cohen et al. (2021) showed that in this reduction, the regretR K of the SSP problem is very close to the regret of A in the nite-horizon MDP f M. Specically, dene e R M 0 = P M 0 m=1 ( P H h=1 c m h + c f (s m H+1 )V ? 1 (s m 1 )) as the regret ofA over the rstM 0 intervals of f M (note the inclusion of the terminal costs), whereV ? 1 is the optimal value function of the rst layer of f M (see Appendix B.1.1 for the formal denition). Denote byM the nal (random) number of intervals created during theK episodes. Then Cohen et al. (2021) showed the following (a proof is included in Appendix E.3 for completeness). Lemma4. Algorithm 14 ensuresR K e R M +B ? . This lemma suggests that it remains to bound the number of intervalsM. The analysis of Cohen et al. (2021) does so by marking state-action pairs as “known” or “unknown” based on how many times they have been visited, and showing that in each interval, the learner either reaches an “unknown” state-action pair or with high probability reaches the goal state. This analysis requiresA to be “admissible” (dened through a set of conditions) and also heavily makes use of the tabular setting to keep track of the status of each state-action pair, making it hard to be directly generalized to function approximation settings. Furthermore, it also introducesT max dependency in the upper bound ofM, since the total cost for an interval where an “unknown” state-action pair is visited is trivially bounded byH = (T max ). 30 Instead, we propose the following simple and improved analysis. The idea is to separate intervals into “good” ones within which the learner reaches the goal state, and “bad” ones within which the learner does not. Then, our key observation is that the regret in each bad interval is at leastB ? — this is because the learner’s cost is at least 2B ? in such intervals by the choice of the terminal costc f , and the optimal policy’s expected cost is at mostB ? . Therefore, ifA is a no-regret algorithm, the number of bad intervals has to be small. More formally, based on this idea we can boundM directly in terms of the regret guarantee ofA without requiring any extra properties fromA, as shown in the following lemma. Theorem 6. Suppose thatA enjoys the following regret guarantee: e R m = ~ O ( 0 + 1 p m) with certain probability for some problem-dependent coecients 0 and 1 (that are independent ofm) and any number of intervalsmM. Then, with the same probability, the number of intervals created by Algorithm 14 satises M = ~ O K + 2 1 B 2 ? + 0 B? . Proof. For any niteM y M, we will showM y = ~ O K + 2 1 B 2 ? + 0 B? , which then implies thatM has to be nite and is upper bounded by the same quantity. To do so, we dene the set of good intervals C g =fm2 [M y ] : s m H+1 = gg where the learner reaches the goal state, and also the total costs of the learner in intervalm of f M:C m = P H h=1 c m h +c f (s m H+1 ). By denition and the guarantee ofA, we have e R M y = X m2Cg (C m V ? 1 (s m 1 )) + X m= 2Cg (C m V ? 1 (s m 1 )) ~ O 0 + 1 p M y : (3.1) Next, we derive lower bounds on P m2Cg (C m V ? 1 (s m 1 )) and P m= 2Cg (C m V ? 1 (s m 1 )) respectively. First note that by Lemma 118 andH 4T max ln(4K), we have that ? reaches the goal withinH steps with probability at least 11=2K. Therefore, executing ? in an episode of f M leads to at mostB ? + 2B? 2K 3 2 B ? costs in expectation, which impliesV ? 1 (s) 3 2 B ? for anys. ByjC g jK, we thus have X m2Cg (C m V ? 1 (s m 1 )) 3 2 B ? K: 31 On the other hand, form = 2C g , we haveC m 2B ? due to the terminal costc f (s m H+1 ) = 2B ? , and thus X m= 2Cg (C m V ? 1 (s m 1 )) B ? 2 (M y jC g j) B ? 2 (M y K): Combining the two lower bounds above with Eq. (3.1), we arrive at B? 2 M y ~ O 0 + 1 p M y + 2B ? K. By Lemma 110, this impliesM y = ~ O K + 2 1 B 2 ? + 0 B? , nishing the proof. Now plugging in the bound onM in Theorem 6 into Lemma 11, we immediately obtain the following corollary on a general regret bound for the nite-horizon approximation. Corollary7. Under the same condition of Theorem 6, Algorithm 14 ensures (with the same probability stated in Theorem 6)R K = ~ O 1 p K + 2 1 B? + 0 +B ? . Proof. Combining Lemma 11 and Theorem 6, we have R K e R M +B ? ~ O( 1 p M + 0 +B ? ) ~ O 1 p K + 2 1 B? + 1 q 0 B? + 0 +B ? . Further realizing 1 q 0 B? 1 2 2 1 B? + 0 by AM-GM inequality proves the statement. Note that the nal regret bound completely depends on the regret guarantee of the nite horizon algorithmA. In particular, in the tabular case, if we apply a variant of EB-SSP (Tarbouriech et al., 2021b) that achieves e R m = ~ O(B ? p SAm +B ? S 2 A) (note the lack of polynomial dependency onH), ∗ then Corollary 7 ensures thatR K = ~ O(B ? p SAK +B ? S 2 A), improving the results of (Cohen et al., 2021) and matching the best existing bounds of (Tarbouriech et al., 2021b; Chen et al., 2021); see (Chen, Jain, and Luo, 2021, Appendix B.5) for more details. This is not achievable by the analysis of (Cohen et al., 2021) due to theT max dependency in the lower order term mentioned earlier. More importantly, our analysis is model agnostic: it only makes use of the regret guarantee of the nite-horizon algorithm, and does not leverage any modeling assumption on the SSP instance. This enables ∗ This variant is equivalent to applying EB-SSP on a homogeneous nite-horizon MDP. 32 us to directly apply our result to settings with function approximation; see (Chen, Jain, and Luo, 2021, Appendix B.6) for an example for SSP with a linear mixture MDP. 3.2.2 ApplyinganEcientFinite-HorizonAlgorithmforLinearMDPs Similarly, if there were a horizon-free algorithm for nite-horizon linear MDPs, we could directly combine it with Algorithm 14 and obtain aT max -independent regret bound. However, to our knowledge, this is still open due to some unique challenge for linear MDPs. Nevertheless, even combining Algorithm 14 with a horizon-dependent linear MDP algorithm already leads to signicant improvement over the state-of-the-art for linear SSP. Specically, the nite-horizon algorithmA we apply is a variant of LSVI-UCB (Jin et al., 2020b), which performs Least-Squares Value Iteration with an optimistic modication. The pseudocode is shown in Algorithm 8. Utilizing the fact that action-value functions are linear in the features for a linear MDP, in each intervalm, we estimate the parametersfw m h g H h=1 of these linear functions by solving a set of least square linear regression problems using all observed data (Line 8), and we encourage exploration by subtracting a bonus term m k(s;a)k 1 m in the denition of b Q m h (s;a) (Line 2). Then, we simply act greedily with respect to the truncated action- value estimatesfQ m h g h (Line 3). Clearly, this is an ecient algorithm with polynomial (ind,H,m andA) time complexity for each intervalm. We refer the reader to (Jin et al., 2020b) for more explanation of the algorithm, and point out three key modications we make compared to their version. First, Jin et al. (2020b) maintain a separate covariance matrix m h for each layerh using data only from layerh, while we only maintain a single covariance matrix m using data across all layers (Line 3). This is possible (and resulting in a better regret bound) since the transition function is the same in each layer of f M. Another modication is to deneV m H+1 (s) asc f (s) simply for the purpose of incorporating the terminal cost. Finally, we project the action-value estimates onto [0;B] for some parameterB similar to Vial et al. (2021) (Line 2). In the main text we simply set 33 Algorithm8 Finite-Horzion Linear-MDP Algorithm Parameters: = 1, m = 50dB p ln(16BmHd=) where is the failure probability andB 1. Initialize: 1 =I. form = 1;:::;M do DeneV m H+1 (s) =c f (s). forh =H;:::; 1do 1 Compute w m h = 1 m m1 X m 0 =1 H X h 0 =1 m 0 h 0 (c m 0 h 0 +V m h+1 (s m 0 h 0 +1 )); where m h =(s m h ;a m h ). 2 Dene(g;a) = 0 and b Q m h (s;a) =(s;a) > w m h m k(s;a)k 1 m Q m h (s;a) = [ b Q m h (s;a)] [0;B] V m h (s) = min a Q m h (s;a) forh = 1;:::;H do 3 Playa m h = argmin a Q m h (s m h ;a), suerc m h , and transit tos m h+1 . Compute m+1 = m + P H h=1 m h m h > . B = 3B ? , and the upper bound truncation atB has no eect in this case. However, this projection will become important when learning without the knowledge ofB ? (see Appendix B.1.4). We show the following regret guarantee of Algorithm 8 following the analysis of (Vial et al., 2021) (see Appendix B.1.3). Lemma5. With probability at least 1 4, Algorithm 8 withB = 3B ? ensures e R m = ~ O( p d 3 B 2 ? Hm + d 2 B ? H) for anymM. Applying Corollary 7 we then immediately obtain the following new result for linear SSP. Theorem8. Applying Algorithm 14 withH = 4T max ln(4K) andA being Algorithm 8 withB = 3B ? to the linear SSP problem ensuresR K = ~ O( p d 3 B 2 ? T max K +d 3 B ? T max ) with probability at least 1 4. There is some gap between our result above and the existing lower bound (dB ? p K) for this prob- lem (Min et al., 2021). In particular, the dependency onT max inherited from theH dependency in Lemma 5 34 is most likely unnecessary. Nevertheless, this already strictly improves over the best existing bound ~ O( p d 3 B 3 ? K=c min ) from (Vial et al., 2021) sinceT max B ? =c min . Moreover, our algorithm is computation- ally ecient, while the algorithms of Vial et al. (2021) are either inecient or achieve a much worse regret bound such as ~ O(K 5=6 ) (unless some strong assumptions are made). This improvement comes from the fact that our algorithm uses non-stationary policies (due to the nite-horizon approximation), which avoids the challenging problem of solving the xed point of some empirical Bellman equation. This also demonstrates the power of nite-horizon approximation in solving SSP problems. On the other hand, obtaining the same regret guarantee by learning stationary policies only is an interesting future direction. LearningwithoutknowingB ? orT max Note that the result of Theorem 8 requires the knowledge of B ? andT max . Without knowing these parameters, we can still eciently obtain a regret bound of order ~ O( p d 3 B 3 ? K=c min +d 3 B 2 ? =c min ), matching the bound of (Vial et al., 2021) achieved by their inecient algorithm. See Appendix B.1.4 for details. 3.3 OpenProblems A natural future direction is to close the gap between existing upper bounds and lower bounds for linear SSP, especially with an ecient algorithm. Another interesting direction is to study SSP with adversarially changing costs under linear function approximation. Finally, it is a natural next step to study SSP under general function approximation. 35 Chapter4 AlgorithmsforSSPwithAdversariallyCostsandKnownTransition 4.1 Overview: SSPwithAdversarialCosts In this chapter, we consider Protocol 2 and regret dened in Eq. (1.2), where the cost function in each episode is chosen by an adaptive adversary. Rosenberg and Mansour (2021) considers adversarial costs that are revealed at the end of each episode (the so-called full-information setting). When the transition function is known, their algorithm achieves ~ O( D c min p K) regret whereD is the diameter of the MDP and c min 2 (0; 1] is a global lower bound of the cost for any state-action pair. Whenc min = 0, they provide a dierent algorithm with regret ~ O( p DT ? K 3=4 ) whereT ? is the expected time for the optimal policy to reach the goal state. They also further study the case with unknown transition. In this chapter, we signicantly improve the state-of-the-art for the general SSP problem with adversarial costs and known transition, by developing matching upper and lower bounds for both the full-information setting and the bandit feedback setting. More specically, our results are (see also Table 4.1 for a summary): • In the full-information setting, we show that the minimax regret is of order ( p DT ? K) (ignoring logarithmic terms), with no dependence on 1=c min (it can be shown thatT ? D=c min ). We develop two algorithms, one with optimal expected regret (Algorithm 9) and another with optimal high prob- ability regret (Algorithm 10). Note that, as pointed out by Rosenberg and Mansour (2021), achieving 36 Table 4.1: Summary of our minimax optimal results and comparisons with prior work. Here,D;S;A are the diameter, number of states, and number of actions of the MDP,c min is the minimum cost,T ? D=c min is the expected hitting time of the optimal policy, andK is the number of episodes. Logarithmic terms are omitted. All algorithms can be implemented eciently. Minimax Regret (thiswork) (Rosenberg and Mansour, 2021) Full information ( p DT ? K) ~ O D c min p K or ~ O p DT ? K 3 4 Algorithm 9 (expected bound) Algorithm 10 (high probability bound) Theorem 10 (lower bound) Bandit feedback ( p DT ? SAK) N/A Algorithm 11 (expected bound) Algorithm 12 (high probability bound) Theorem 13 (lower bound) high probability bounds for SSP is signicantly more challenging even in the full-information setting, since the learner is often not guaranteed to reach the goal within a xed number of steps with high probability. We complement our algorithms and upper bounds with a matching lower bound in Theorem 10. • Next, we further consider the more challenging bandit feedback setting where the learner only observes the cost for the visited state-action pairs, which has not been studied before in the adversarial cost case to the best of our knowledge. We show that the minimax regret is of order ( p DT ? SAK) (ignoring logarithmic terms). We again developed two algorithms, one with optimal expected regret (Algorithm 11) and another more complex one with optimal high probability regret (Algorithm 12). A matching lower bound is shown in Theorem 13. Notation The fast policy f is the (deterministic) policy that achieves the minimum expected hitting time starting from any state, and the diameter of the MDP is dened asD = max s2S min 21 T (s) = max s2S T f (s). Note that both f andD can be computed ahead of time since we consider the known transition setting. 37 For notational convenience, we writeV ;P;c k asV k . Two quantities related to ? play an important role in our analysis: its expected hitting time starting from the initial stateT ? =T ? (s init ) and its largest expected hitting time starting from any stateT max = max s T ? (s). We assume thatT ? is known to the learner unless stated explicitly. Letc min = min k min (s;a) c k (s;a) be the minimum cost, ands max 2S be such thatT max = T ? (s max ). We haveT max c min V ? k (s max ) andV f k (s max ) D by denition. Together with the fact P K k=1 V ? k (s max ) P K k=1 V f k (s max ), this impliesT ? T max D c min ifc min > 0 (which is one of the reasons whyc min shows up in existing results). We letN k (s;a) denote the (random) number of visits of the learner to (s;a) during episodek, so that the regret can be re-written asR K = P K k=1 hN k q ?;c k i. We use the notationhf;gi as a shorthand for P s2S f(s)g(s), P (s;a) f(s;a)g(s;a), or P H h=1 P (s;a) f(s;a;h)g(s;a;h) whenf andg are functions in R S ,R , orR [H] (for someH) respectively. LetF k denote the-algebra of events up to the beginning of episodek, andE k be a shorthand ofE[jF k ]. For a convex function , the Bregman divergence between u andv is dened as:D (u;v) = (u) (v)hr (v);uvi. Occupancymeasure Our algorithm design heavily makes use of the concept of occupancy measure. For a xed MDP, a proper policy induces an occupancy measureq 2R 0 such thatq (s;a) is the expected number of visits to (s;a) when executing, that is: q (s;a) =E " I X i=1 Ifs i =s;a i =ag P;;s 1 =s 0 # : Similarly,q (s) = P a2As q (s;a) is the expected number of visits tos when executing. Clearly, we have V k (s init ) = P (s;a)2 q (s;a)c k (s;a) =hq ;c k i, and if the learner executes a stationary proper policy k in episodek, then the expected regret can be written as E[R K ] =E " K X k=1 V k k (s init )V ? k (s init ) # =E " K X k=1 hq k q ?;c k i # ; (4.1) 38 converting the problem into a form of online linear optimization and making Online Mirror Descent a natural solution to the problem. Note that, given a function q : ! [0;1), if it corresponds to an occupancy measure, then the corresponding policy q can clearly be obtained by q (ajs)/q(s;a). Also note thatT (s init ) = P (s;a) q (s;a) = P s2S q (s). 4.2 MinimaxRegretfortheFull-informationSetting In this section, we consider the simpler full-information setting where the learner observesc k in the end of episodek. Somewhat surprisingly, even in this case, ensuring optimal regret is rather challenging. We rst propose an algorithm with expected regret ~ O( p DT ? K) and a matching lower bound in Section 4.2.1. Next, in Section 4.2.2, by converting the problem into another loop-free SSP instance and using a skewed occupancy measure space, we develop an algorithm that achieves the same regret bound with high probability. 4.2.1 Optimalexpectedregret To introduce our algorithm, we rst briey review the SSP-O-REPS algorithm of Rosenberg and Mansour (2021), which only achieves regret ~ O( D c min p K). The idea is to run the standard Online Mirror Descent (OMD) algorithm over an appropriate occupancy measure space. Specically, they dene the occupancy measure space parameterized by sizeT > 0 as: (T ) = ( q2R 0 : X (s;a)2 q(s;a)T; X a2As q(s;a) X (s 0 ;a 0 )2 P (sjs 0 ;a 0 )q(s 0 ;a 0 ) =Ifs =s 0 g;8s2S ) : (4.2) It is shown that everyq2 (T ) is a valid occupancy measure induced by the policy q (recall q (ajs)/ q(s;a)). Therefore, as long asT is large enough such thatq ?2 (T ), based on Eq. (4.1), the problem is es- sentially translated to an instance of online linear optimization and can be solved by maintaining a sequence 39 Algorithm9 SSP-O-REPS Input: upper bound on expected hitting timeT . Dene: regularizer (q) = 1 P (s;a) q(s;a) lnq(s;a) and = min 1 2 ; q T ln(SAT ) DK . Initialization:q 1 = argmin q2(T ) (q) where (T ) is dened in Eq. (4.2). fork = 1;:::;K do Execute q k , receivec k , and updateq k+1 = argmin q2(T ) hq;c k i +D (q;q k ). of occupancy measuresq 1 ;:::;q K updated according to OMD:q k+1 = argmin q2(T ) hq;c k i +D (q;q k ); where is a regularizer with the default choice being the negative entropy (q) = 1 P (s;a) q(s;a) lnq(s;a) for some learning rate> 0. See Algorithm 9 for the pseudocode and (Rosenberg and Mansour, 2021) for the details of implementing it eciently. Rosenberg and Mansour (2021) show that as long as q ? 2 (T ), Algorithm 9 ensuresE[R K ] = ~ O(T p K). To ensureq ?2 (T ), they setT = D c min because P (s;a)2 q ?(s;a) =T ? D c min as discussed in Section 4.1. This leads to their nal regret bound ~ O( D c min p K). We show a more careful analysis for the same Algorithm 9 and use the fact that the total expected cost of ? is bounded byDK instead ofTK to obtain the following stronger guarantee. Theorem9. IfT is such thatq ?2 (T ), then Algorithm 9 guarantees:E[R K ] = ~ O( p DTK): By using the sameT = D c min , this already leads to a better bound ~ O(D p K=c min ). IfT ? was known, settingT =T ? would also immediately give the claimed bound ~ O( p DT ? K) (sinceq ?2 (T ? )), which is optimal as we show later. Lowerbound Our regret bound stated in Theorem 9 not only improves that of (Rosenberg and Mansour, 2021), but is also optimal up to logarithmic terms as shown in the following lower bound. Theorem10. For anyD;T ? ;K withKT ? D + 1, there exists an SSP instance such that its diameter is D + 2, theoptimalpolicyhashittingtimeT ? + 1, andtheexpectedregretofanylearnerafterK episodesisat least p DT ? K under the full-information and known transition setting. 40 Similarly to most lower bound proofs, our proof also constructs an environment with stochastic costs and with a slightly better state hidden among other equally good states, and argues that the expected regret of any learner with respect to the randomness of the environment has to be p DT ? K . At rst glance, this appears to be a contradiction to existing results for SSP with stochastic costs (Tarbouriech et al., 2020; Cohen et al., 2020), where the optimal regret is independent ofT ? . However, the catch is that “stochastic costs” has a dierent meaning in these works. Specically, it refers to a setting where the cost for each state-action pair is drawn independently from a xed distributioneverytime it is visited, and is revealed to the learner immediately. On the other hand, “stochastic costs” in our lower bound proof refers to a setting where at the beginning of each episodek,c k is sampled once from a xed distribution and then xed throughout the episode. Moreover, it is revealed only after the episode ends. It can be shown that our setting is harder due to the larger variance of costs, explaining our larger lower bound and the seemingly contradiction. 4.2.2 Optimalhigh-probabilityregret To obtain a high-probability regret bound, one needs to control the deviation between the actual total cost of the learner P K k=1 hN k ;c k i and its expectation P K k=1 hq k ;c k i. While for most online learning problems with full information, similar deviation can be easily controlled by the Azuma’s inequality, this is not true for SSP as pointed out in (Rosenberg and Mansour, 2021), due to the lack of an almost sure upper bound on the random variablehN k ;c k i. Rosenberg and Mansour (2021, Lemma E.1) point out that with high probability P (s;a) N ?(s;a) is bounded byT max , and thus it is natural to enforce the same forN k . However, this at best leads to a bound of order ~ O( p DT max K). To achieve the optimal regret, we start with a closer look at the variance of the actual cost of any policy, showing that it is in fact related to the corresponding value function. Lemma6. Consider executing a stationary policy in episodek. ThenE[hN k ;c k i 2 ] 2hq ;V k i. 41 For the optimal policy ? , although q ?;V ? k can still be as large asT max , one key observation is that the sum of these quantities overK episodes is at mostDT ? K since K X k=1 D q ?;V ? k E = X s2S q ?(s) K X k=1 V ? k (s)DK X s2S q ?(s) =DT ? K; where the inequality is again due to the optimality of ? and the existence of the fast policy f : P K k=1 V ? k (s) P K k=1 V f k (s)DK. Given this observation, it is tempting to enforce that the learner’s policies 1 ;:::; K are also such that P K k=1 q k ;V k k DT ? K, which would be enough to control the deviation between P K k=1 hN k ;c k i and P K k=1 hq k ;c k i by ~ O( p DT ? K) as desired by Freedman’s inequality. However, it is unclear how to enforce this constraint since it depends on all the cost functions unknown ahead of time. In fact, even if the cost functions were known, the constraint is also non-convex due to the complicated dependence ofV k onq . To address these issues, we propose two novel ideas. Firstidea: aloop-freereduction Our rst idea is to reduce the problem to a loop-free MDP so that the varianceE[hN k ;c k i 2 ] takes a much simpler form that is linear in both the occupancy measure and the cost function. Moreover, the reduction only introduces a small bias in the regret between the original problem and its loop-free version. The construction of the loop-free MDP is basically to duplicate each state by attaching a time steph forH 1 steps, and then connect all states to some virtual fast state that lasts for anotherH 2 steps. Formally, we dene the following. Denition2. For an SSP instanceM = (S;s init ;g;A;P ) with cost functionsc 1:K , we dene, for horizon parametersH 1 ;H 2 2N, another loop-free SSP instance f M = ( e S;e s init ;g; e A; e P ) with cost functione c 1:K as follows: • e S = (S[fs f g) [H] wheres f is an articially added “fast” state andH =H 1 +H 2 ; • e s init = (s init ; 1) and the goal stateg remains the same; 42 • e A =A[fa f g, wherea f is an articially added action that is only available at (s f ;h) forh2 [H] (the available action set at (s;h) isA s for alls6=s f andh2 [H]); • transition from (s;h) to (s 0 ;h 0 ) is only possible whenh 0 =h + 1: for the rstH 1 layers, the transition follows the original MDP in the sensethat e P ((s 0 ;h + 1)j(s;h);a) =P (s 0 js;a) and e P (gj(s;h);a) = P (gjs;a) for allh<H 1 and (s;a)2 ; from layerH 1 to layerH, all states transit to the fast state: e P ((s f ;h + 1)j(s;h);a) = 1 for allH 1 h<H and (s;a)2 e , [f(s f ;a f )g; nally, the last layer transits to the goal state always: e P (gj(s;H);a) = 1 for all (s;a)2 e ; • cost function is such thate c k ((s;h);a) = c k (s;a) ande c k ((s f ;h);a f ) = 1 for all (s;a)2 and h2 [H]; for notational convenience, we also writee c k ((s;h);a) asc k (s;a;h). Note that in this denition, there are some redundant states such as (s;h) fors2S andh > H 1 or (s f ;h) forh H 1 since they will never be visited. However, having these redundant states greatly simplies our presentation. For notation related to the loop-free version, we often use a tilde symbol to distinguish them from the original counterparts (such as f M and e S), and for a function e f((s;h);a) that takes a state in f M and an action as inputs, we often simplify it asf(s;a;h) (such asc k andq k ). For such a function, we will also use the notation ~ hf2R e [H] such that ( ~ hf)(s;a;h) =hf(s;a;h). Similarly, for a functionf2R e , we use the same notation ~ hf2R e [H] such that ( ~ hf)(s;a;h) =hf(s;a). As mentioned, one key reason of considering such a loop-free MDP is that the variance of the learner’s actual cost takes a much simpler form that is linear in both the occupancy measure and the cost function, as shown in the lemma below (which is an analogue of Lemma 6). Lemma7. Considerexecutingastationarypolicye in f Minepisodek andlet e N k (s;a;h)2f0; 1gdenote the number of visits to state-action pair ((s;h);a). ThenE[h e N k ;c k i 2 ] 2 D q e ; ~ hc k E . Next, we complete the reduction by describing how one can solve the original problem via solving its loop-free version. Given a policye for f M, we dene anon-stationary policy(e ) forM as follows: for 43 each stephH 1 , followe (j(s;h)) when at states; after the rstH 1 steps (if not reachingg yet), execute the fast policy f until reaching the goal stateg. When executing(e ) inM for episodek, we overload the notation e N k dened in Lemma 7 and let e N k (s;a;h) be 1 if (s;a) is visited at time stephH 1 , or 0 otherwise; and e N k (s f ;a f ;h) be 1 ifH 1 <hH and the goal stateg is not reached withinH 1 steps, or 0 otherwise. Clearly, e N k (s;a;h) indeed follows the same distribution as the number of visits to state-action pair ((s;h);a) when executinge in f M. We also dene a deterministic policye ? for f M that mimics the behavior of ? in the sense thate ? (s;h) = ? (s) fors2S andhH 1 (for largerh,s has to bes f and the only available action isa f ). The next lemma shows that, as long as the horizon parametersH 1 andH 2 are set appropriately, this reduction makes sure that the regret between these two problems are similar. Lemma8. SupposeH 1 8T max lnK;H 2 =d4D ln 4K eandKDforsome2 (0; 1). Lete 1 ;:::;e K be policiesfor f Mwithoccupancymeasuresq 1 ;:::;q K 2 [0; 1] e [H] . Thentheregretofexecuting(e 1 );:::;(e K ) inM satises: 1) for any2 (0; 2=H], with probability at least 1, R K K X k=1 D e N k q e ?;c k E + ~ O (1) K X k=1 hq k q e ?;c k i | {z } Reg + K X k=1 D q k ; ~ hc k E | {z } Var + 2 ln ( 2 =) + ~ O (1); and 2)E[R K ]E[Reg] + ~ O (1). Note that theReg term is the expected regret (toe ? ) in f M and can again be controlled by OMD. TheVar term comes from the derivation between the actual cost of the learner in f M and its expectation, according to Freedman’s inequality and Lemma 7. At this point, one might wonder whether directly applying an existing algorithm such as (Zimin and Neu, 2013) for loop-free MDPs solves the problem, since Lemma 8 shows that the regret in these two problems are close. Doing so, however, leads to a suboptimal bound of order ~ O(H p K) = ~ O((T max +D) p K). This is basically the same as trivially boundingVar byH 2 K. It is thus critical to better control this term using properties of the original problem, which requires the second idea described below. 44 Algorithm10 SSP-O-REPS with Loop-free Reduction and Skewed Occupancy Measure Input: Upper bound on expected hitting timeT , horizon parameterH 1 , condence level Parameters: = min n 1 2 ; q T DK o ; = q ln( 1 =) DTK ;H 2 =d4D ln 4K e Dene:H =H 1 +H 2 , regularizer () = 1 P H h=1 P (s;a)2 e (s;a;h) ln(s;a;h) Dene: decision set =f =q + ~ hq :q2 e (T )g (with e (T ) dened in Eq. (C.5)) Initialization: 1 =q 1 + ~ hq 1 = argmin 2 (). fork = 1;:::;K do Execute(e k ) wheree k is such thate k (aj(s;h))/q k (s;a;h), and receivec k . Update k+1 =q k+1 + ~ hq k+1 = argmin 2 h;c k i +D (; k ). Secondidea: skewedoccupancymeasurespace Similarly to earlier discussions, it can be shown that P K k=1 D q e ?; ~ hc k E =O (DT ? K) (Lemma 42), making it hopeful to bound Var by the same. However, even though the variance now takes a simpler form, it is still unclear how to directly enforce the algorithm to satisfy Var =O (DT ? K). Instead, we take a dierent route and make sure that the Reg term is at most ~ O p DT ? K +DT ? K Var, thus canceling the variance term. To do so, thanks to the simple form of Var, it suces to inject a small positive bias into the action space of OMD, making it a skewed occupancy measure space: =f =q + ~ hq :q2 e (T ? )g where e (T ? ) is the counterpart of (T ? ) for f M (see Eq. (C.5) in Appendix C.1 for the spelled out denition). Indeed, by similar arguments from Section 4.2.1, operating OMD over this space ensures a bound of orderO p DT ? K on the “skewed regret”: P K k=1 D (q k + ~ hq k ) (q e ? + ~ hq e ?);c k E = Reg +Var P K k=1 D q e ?; ~ hc k E ; and we already know that the last term is of orderO (DT ? K). Rearranging thus proves the desired bound onReg, and nally picking the optimal to trade o the term 2 ln( 2 =) leads to the optimal bound. We summarize the nal algorithm in Algorithm 10 and its regret guarantee below. (Note that the algorithm can be implemented eciently since is a convex polytope withO(SAH) constraints.) Theorem11. IfTT ? + 1,H 1 8T max lnK, andKH 2 ln 1 , then with probability at least 1, Algorithm 10 ensuresR K = ~ O( p DTK). 45 To obtain the optimal bound, we need to setT =cT ? + 1 for any constantc 1. As for the parameter H 1 , we can always set it to something large such asK 1=3 so that the conditions of the theorem hold for large enoughK (though leading to a larger time complexity of the algorithm). We also remark that instead of injecting bias to the occupancy measure space, one can obtain the same by injecting a similar positive bias to the cost function. However, we use the former approach because it turns out to be critical for the bandit feedback setting that we consider in the next section. 4.3 MinimaxRegretfortheBanditFeedbackSetting We now consider the more challenging case with bandit feedback, that is, at the end of each episode, the learner only receives the cost of the visited state-action pairs. A standard technique in the adversarial bandit literature is to construct an importance-weighted cost estimatorb c k forc k and then feed it to OMD, which is even applicable to learning loop-free SSP (Zimin and Neu, 2013; Jin et al., 2020a; Lee et al., 2020). For general SSP, the natural importance-weighted estimatorb c k is:b c k (s;a) = N k (s;a)c k (s;a) q k (s;a) whereN k (s;a) is the number of visits to (s;a) andq k is the occupancy measure of the policy executed in episodek. This is clearly unbiased sinceE k [N k (s;a)] =q k (s;a). However, it is well-known that unbiasedness alone is not enough — the variance of the estimator also plays a key role in the OMD analysis even if one only cares about expected regret. For example, if we still use the entropy regularizer as in Section 4.2, the so-called stability term of OMD is in terms of the weighted variance P (s;a) q k (s;a)E k [b c 2 k (s;a)] = P (s;a) E k [N 2 k (s;a)]c 2 k (s;a) q k (s;a) . While this term is nicely bounded in the loop-free case (sinceN k (s;a) is binary and thusE k [N 2 k (s;a)] = q k (s;a) cancels out the denominator), unfortunately it can be prohibitively large in the general case. In light of this, it might be tempting to use our loop-free reduction again and then directly apply an existing algorithm such as (Zimin and Neu, 2013). However, this again leads to a suboptimal bound with dependence onH = ~ O (T max ). It turns out that 46 this is signicantly more challenging than other bandit problems and requires a combination of various techniques, as described below. Log-barrierregularizer Although the entropy regularizer is a classic choice for OMD to deal with bandit problems, in recent years, a line of research discovers various advantages of using a dierent regularizer called log-barrier (see e.g. (Foster et al., 2016; Agarwal et al., 2017; Wei and Luo, 2018; Luo, Wei, and Zheng, 2018; Bubeck et al., 2019; Kotłowski and Neu, 2019; Lee, Luo, and Zhang, 2020)). In our context, the log-barrier regularizer is 1 P (s;a) lnq(s;a), and it indeed leads to a smaller stability term in terms of P (s;a) q 2 k (s;a)E k [b c 2 k (s;a)] = P (s;a) E k [N 2 k (s;a)]c 2 k (s;a) (note the extraq k (s;a) factor compared to the case of entropy). This term is further bounded byE k [hN k ;c k i 2 ], which is exactly the variance of the learner’s actual cost considered in Section 4.2.2! Loop-freereductionandskewedoccupancymeasure Based on the observation above, it is natural to apply the same ideas of loop-free reduction and skewed occupancy measure from Section 4.2.2 to deal with the stability termE k [hN k ;c k i 2 ]. However, some extra care is needed when using log-barrier in the loop-free instance f M. Indeed, directly using () = 1 P h P (s;a) ln(s;a;h) would lead to another term of order ~ O(HSA=) in the OMD analysis and ruin the bound. Instead, taking advantage of the fact thatc k (s;a;h) is the same for a xed (s;a) pair regardless of the value ofh, ∗ we propose to perform OMD with(s;a) = P h (s;a;h) for all (s;a)2 e as the variables, even though the skewed occupancy measure is still dened in terms of(s;a;h) as in Algorithm 10. More specically, this means that our regularizer is () = 1 P (s;a) ln(s;a), and the cost estimator isb c k (s;a) = e N k (s;a)c k (s;a) q k (s;a) where e N k (s;a) = P h e N k (s;a;h) andq k (s;a) = P h q k (s;a;h). This completely avoids the factorH in the analysis (other than lower order terms). ∗ This also explains why injecting the bias to the occupancy space instead of the cost vectors is important here, as mentioned in the end of Section 4.2, since the latter makes the cost dierent for dierenth. 47 Algorithm11 Log-barrier Policy Search for SSP Input: Upper bound on expected hitting timeT and horizon parameterH 1 . Parameters: = q SA DTK ; = 8;H 2 =d4D ln 4K e,H =H 1 +H 2 Dene: regularizer () = 1 P (s;a)2 e ln(s;a) where(s;a) = P H h=1 (s;a;h) Dene: decision set =f =q + ~ hq :q2 e (T )g (with e (T ) dened in Eq. (C.5)) Initialization: 1 =q 1 + ~ hq 1 = argmin 2 (). fork = 1;:::;K do Execute(e k ) wheree k is such thate k (aj(s;h))/q k (s;a;h). Construct cost estimatorb c k 2R e 0 such thatb c k (s;a) = e N k (s;a)c k (s;a) q k (s;a) where e N k (s;a) = P h e N k (s;a;h) andq k (s;a) = P h q k (s;a;h) ( e N k is dened after Lemma 7). Update k+1 =q k+1 + ~ hq k+1 = argmin 2 P (s;a) (s;a)b c k (s;a) +D (; k ). With the ideas above, we can already show an optimal expected regret bound for anoblivious adversary who selectsc k independent of the learner’s randomness. We summarize the algorithm in Algorithm 11 and its guarantee in the following theorem. Theorem12. IfTT ? + 1,H 1 8T max lnK,andK 64SAH 2 ,thenAlgorithm11ensuresE [R K ] = ~ O p DTSAK for an oblivious adversary. SettingT =T ? + 1 leads to ~ O( p DT ? SAK), which is optimal in light of the following lower bound theorem (the adversary is indeed oblivious in the lower bound construction). Theorem13. ForanyD;T ? ;K;S 4withKST ? andT ? D + 1,thereexistsanSSPprobleminstance withS states andA =O(1) actions such that its diameter isD + 2, the optimal policy has expecting hitting timeT ? + 1, and the expected regret of any learner afterK episodes is at least p DT ? SAK under the bandit feedback and known transition setting. To further obtain a high probability regret bound for general adaptive adversaries (thus also a more general expected regret bound), it is important to analyze the derivation between the optimal policy’s estimated total loss P k hq e ?;b c k i and its expectation P k hq e ?;c k i. Using Freedman’s inequality, we need to carefully control the conditional varianceE k [b c 2 k (s;a)] = E k [ e N 2 k (s;a)]c 2 k (s;a) q 2 k (s;a) for each (s;a), which is much more dicult than the aforementioned stability term due to the lack of the extraq 2 k (s;a) factor. To address 48 this, we rst utilize the simpler form ofE k [ e N 2 k (s;a)] in the loop-free setting and bound it by P h hq k (s;a;h) (see Lemma 46). Then, with K (s;a) = max k 1 q k (s;a) andb k (s;a) = P h hq k (s;a;h)c k (s;a) q k (s;a) , we bound the key term in the derivation P k hq e ?;b c k c k i by X (s;a) q e ?(s;a) v u u t K (s;a) K X k=1 b k (s;a) 1 hq e ?; K i + K X k=1 hq e ?;b k i; whereq e ?(s;a) = P h q e ?(s;a;h) and the last step is by AM-GM inequality (see Lemma 44 for details). The last two terms above are then handled by the following two ideas respectively. Increasinglearningrate The rst term 1 hq e ?; K i appears in the work of (Lee et al., 2020) already for loop-free MDPs and can be canceled by a negative term introduced by an increasing learning rate schedule. (See the last for loop of Algorithm 12 and Lemma 43.) Injectingnegativebiastothecosts To handle the second term P K k=1 hq e ?;b k i, note again that its counterpart P K k=1 hq k ;b k i is exactly P K k=1 D q k ; ~ hc k E , a term that can be canceled by the skewed occupancy measure as discussed. Therefore, if we could inject another negative bias term into the cost vectors, that is, replacingb c k withb c k b k , then this bias would cancel the term P K k=1 hq e ?;b k i while introducing the term P K k=1 hq k ;b k i that could be further canceled by the skewed occupancy measure. However, the issue is thatb k depends on the unknown true costc k . We address this by using b b k instead which replacesc k withb c k , that is, b b k (s;a) = P h hq k (s;a;h)b c k (s;a) q k (s;a) . This leads to yet another derivation term between b b k andb k that needs to be controlled in the analysis. Fortunately, this term is of lower order compared to others since it is multiplied by (see Lemma 45). Note that at this point we have used both the positive bias from the skewed occupancy measure space and the negative bias from the cost estimators, which we nd intriguing. 49 Algorithm12 Log-barrier Policy Search for SSP (High Probability) Input: Upper bound on expected hitting timeT , horizon parameterH 1 , and condence level Parameters: H 2 =d4D ln 4K e, H = H 1 +H 2 , C =dlog 2 (TK 4 )edlog 2 (T 2 K 9 )e; = e 1 7 lnK ; = q SA ln 1 = DT?K ; = 100 lnK 1 +C q 8 ln CSA 2 ; = 40 + 2 . Dene: regularizer k () = P (s;a)2 e 1 k (s;a) ln 1 (s;a) where(s;a) = P H h=1 (s;a;h) Dene: decision set =f =q + ~ hq :q2 e (T ); q(s;a) 1 TK 4 ;8(s;a)2 e g Initialization: 1 =q 1 + ~ hq 1 = argmin 2 1 (). Initialization: for all (s;a)2 e ; 1 (s;a) =; 1 (s;a) = 2T . fork = 1;:::;K do Execute(e k ) wheree k is such thate k (aj(s;h))/q k (s;a;h). Construct cost estimatorb c k 2R e 0 such thatb c k (s;a) = e N k (s;a)c k (s;a) q k (s;a) where e N k (s;a) = P h e N k (s;a;h) andq k (s;a) = P h q k (s;a;h) ( e N k is dened after Lemma 7). Construct bias term b b k 2R e 0 such that b b k (s;a) = P h hq k (s;a;h)b c k (s;a) q k (s;a) . Update k+1 =q k+1 + ~ hq k+1 = argmin 2 X (s;a) (s;a) b c k (s;a) b b k (s;a) +D k (; k ): for8(s;a)2 e do if 1 k+1 (s;a) > k (s;a)then k+1 (s;a) = 2 k+1 (s;a) ; k+1 (s;a) = k (s;a). else k+1 (s;a) = k (s;a); k+1 (s;a) = k (s;a). Combining everything, our nal algorithm is summarized in Algorithm 12 (see Appendix C.2 due to space limit). The following theorem shows that, with the knowledge ofT ? or a suitable upper bound, our algorithm again achieves the optimal regret bound with high probability. Theorem 14. If T T ? + 1, H 1 8T max lnK, and K is large enough (K & SAH 2 ln 1 ), then Algorithm 12 ensuresR K = ~ O p DTSAK with probability at least 1 6. 50 4.4 OpenProblems In this chapter, we develop matching upper and lower bounds for the stochastic shortest path problem with adversarial costs and known transition, signicantly improving previous results. Our algorithms are built on top of a variety of techniques that might be of independent interest. There are two key future directions. The rst one is to develop parameter-free and optimal algorithms without the knowledge ofT ? . We only achieve this in the full-information setting for expected regret bounds. Indeed, generalizing our techniques that learnT ? automatically to obtain a high-probability bound in the full-information setting boils down to getting the same multi-scale expert result with high probability, which is still open unfortunately (see also discussions in (Chen, Luo, and Wei, 2021a, Section 5)). The diculty lies in bounding the deviation between the learner’s expected loss and the actual loss in terms of the loss of the unknown comparator. On the other hand, it is also dicult to generalize our technique to obtain an expected bound in the bandit setting (without knowingT ? ), since this becomes a bandit-of-bandits type of framework and is known to suer some tuning issues; see for example (Foster, Krishnamurthy, and Luo, 2019, Appendix A.2). The second future direction is to gure out the minimax regret of the more challenging setting where the transition is unknown. We note that our loop-free reduction is readily to be applied to this case. Our follow-up work (Chen and Luo, 2021) makes some progress in this direction, but the minimax regret remains unknown in this case. 51 Chapter5 PolicyOptimizationforSSP Policy Optimization (PO) is among the most popular methods in reinforcement learning due to its strong empirical performance and favorable theoretical properties. Unlike value-based approaches such as Q learning, PO-type methods directly optimize the policy in an incremental manner. Many widely used practical algorithms fall into this category, such as REINFORCE (Williams, 1992), NPG (Kakade, 2001), and TRPO (Schulman et al., 2015). They are also easy to implement and computationally ecient compared to other methods such as those operating over the occupancy measure space (e.g., (Zimin and Neu, 2013)). From a theoretical perspective, PO is a general framework that works for dierent types of environments, including stochastic costs or even adversarial costs (Shani et al., 2020), function approximation (Cai et al., 2020), and non-stationary environments (Fei et al., 2020). Despite its popularity in applications, most theoretical works on PO focus on simple models such as nite-horizon models (Cai et al., 2020; Shani et al., 2020; Luo, Wei, and Lee, 2021) and discounted models (Liu et al., 2019; Wang et al., 2020; Agarwal et al., 2021), which are often oversimplications of real-life applications. In particular, PO methods have not been applied to regret minimization in SSP as far as we know. Motivated by this gap, in this chapter, we systematically study policy optimization in SSP. We con- sider both stochastic and adversarial environments and for each of them discuss how to design a policy optimization algorithm with a strong regret bound. Specically, our main results are as follows: 52 Table 5.1: Comparison of regret bound, time complexity, and space complexity of dierent SSP algorithms. We consider two feedback types: SC (stochastic costs) and AF (Adversarial, Full information). Operator ~ O() is hidden for simplicity. Time complexity of poly(S;A;T max ) is due to optimization in the occupancy measure space. Regret Time Space Feedback Cohen et al. (2021) B ? p SAK S 3 A 2 T max S 2 AT max SC Our work B ? S p AK S 2 AT max K S 2 A Chen and Luo (2021) p S 2 ADT ? K poly(S;A;T max )K S 2 AT max AF Our work p (S 2 A +T ? )DT ? K S 2 AT max K S 2 A • In Section 5.1, we rst propose an important technique used in all our algorithms: stacked discounted approximation. It reduces any SSP instance to a special Markov Decision Process (MDP) with a stack of O(lnK) layers (K is the total number of episodes), each of which contains a discounted MDP (hence the name) such that the learner stays in the same layer with a certain probability and proceeds to the next layer with probability 1 . This approximation not only resolves the diculty of having dynamic and potentially unbounded episode lengths in the PO analysis, but more importantly leads to a near-stationary policies with onlyO(lnK) changes within an episode. Compared to the commonly used nite-horizon approximation (Chen, Luo, and Wei, 2021b; Chen and Luo, 2021; Cohen et al., 2021) which changes the policy at every step of an episode, our approach could lead to an exponential improvement in space complexity and is also more natural since the optimal policy for SSP is indeed stationary. • Building on the stacked discounted approximation, in Section 5.2, we design PO algorithm for stochastic costs. Our algorithm achieves ~ O(B ? S p AK) regret, close to the minimax bound ~ O(B ? p SAK) (Cohen et al., 2021), whereB ? is the maximum expected cost of the optimal policy starting from any states. • Finally, in Section 5.3, we further study SSP with adversarial costs and design PO algorithms that achieve ~ O(T ? p DK + p DT ? S 2 AK) regret with full information. The best existing bounds for is settings is ~ O( p DT ? S 2 AK) (Chen and Luo, 2021). We also include Table 5.1 with a comprehensive comparison between SSP algorithms for a better understanding of our contributions. While our regret bounds do not always match the state-of-the-art, we 53 emphasize again that our algorithms are more space-ecient due to the stacked discounted approximation (S 2 A versusS 2 AT max in Table 5.1). It is also more time-ecient in some cases (for feedback type AF in Table 5.1). We also note that in the analysis of stacked discounted approximation, a regret bound starting from any state (not just the initial state) is important, and PO indeed provides such a guarantee while other methods based on occupancy measure do not. In other words, PO is especially compatible with our stacked discounted approximation. Moreover, our results also signicantly improve our theoretical understanding on PO, and pave the way for future study on more challenging problems such as SSP with function approximation, where in some cases PO is the only method known to be computationally and statistically ecient (Luo, Wei, and Lee, 2021). KeyParameters Four parameters play a key role in our analysis and regret bounds:B ? = max s V ? ;P;c (s), the maximum expected cost of the optimal policy starting from any state;T ? =T ? (s init ), the hitting time of the optimal policy starting from the initial state;T max = max s T ? (s), the maximum hitting time of the optimal policy starting from any state; andD = max s min T (s), the SSP-diameter. We also dene the fastpolicy f such that f 2 argmin T (s) for all states. Similarly to previous works, in most discussions we assume the knowledge of all four parameters and the fast policy, and defer to Appendix D.5 what we can achieve when some of these are unknown. We also assumeB ? 1 for simplicity. 5.1 StackedDiscountedApproximationandAlgorithmTemplate Policy optimization algorithm have been naturally derived in many MDP models. In the nite-horizon setting, one can update the policy at the end of each episode using the cost for this episode that is always bounded. In the discounted setting or average reward setting with some ergodic assumption, one can also update the policy after a certain xed number of steps since the short-term information is enough to predict the long-term behavior reasonably well. However, this is not possible in SSP: the hitting time of an arbitrary 54 policy can be arbitrarily large in SSP, and only looking at a xed number of steps can not always provide accurate information. A natural solution would be to approximate SSP by other MDP models, and then apply PO in the reduced model. Approximating SSP instances by nite-horizon MDPs (Chen et al., 2021; Chen, Luo, and Wei, 2021b; Cohen et al., 2021) or discounted MDPs (Tarbouriech et al., 2021b; Min et al., 2021) is a common practice in the literature, but both have their pros and cons. Finite-horizon approximation shrinks the estimation error exponentially fast and usually leads to optimal regret (Chen, Luo, and Wei, 2021b; Cohen et al., 2021). However, it greatly increases the space complexity of the algorithm as it needs to store non-stationary policies with horizon of order ~ O(T max ) or ~ O( B? c min ) (as shown in Table 5.1). Discounted approximation, on the other hand, produces stationary policies, but the estimation error decreases only linearly in the eective horizon (1 ) 1 , where is the discounted factor. This often leads to sub-optimal regret bounds and large time complexity (Tarbouriech et al., 2021b). We include a detailed discussion on limitations of existing approximation schemes in Appendix D.2.1. These issues greatly limit the practical potential of these methods, and PO methods built on top of them would be less interesting. To address these issues and achieve optimal regret with small space complexity, we introduce a new approximation scheme called StackedDiscounted Approximation, which is a hybrid of nite-horizon and discounted approximations. The key idea is as follows: the nite-horizon approximation requires a horizon of orderO(T max lnK), but one can imagine that policies at nearby layers are close to each other and can be approximated by one stationary policy. Thus, we propose to achieve the best of both worlds by dividing the layers intoO(lnK) parts and performing discounted approximation within each part with an eective horizonO(T max ). Formally, we dene the following. Denition 3. For an SSP instanceM = (S;s init ;g;A;P ), we dene, for number of layersH, discounted factor , and terminal costc f , another SSP instance M = ( S; s init ;g;A; P ) as follows: 1. S =S [H + 1], s init = (s init ; 1), and the goal stateg remains the same. 55 2. Transition from (s;h) to (s 0 ;h 0 ) is only possible forh 0 2fh;h + 1g: for anyhH and (s;a;s 0 )2 S, we have P (s;h);a (s 0 ;h) = P s;a (s 0 ) (stay in the same layer with probability ), P (s;h);a (s 0 ;h + 1) = (1 )P s;a (s 0 ) (proceed to the next layer with probability 1 ), and P (s;h);a (g) =P s;a (g); for h =H +1,wehave P (s;H+1);a (g) = 1forany (s;a)(immediatelyreachthegoalifatlayerH +1). For notational convenience, we also write P (s;h);a (s 0 ;h 0 ) asP (s;h);a (s 0 ;h 0 ) orP s;a;h (s 0 ;h 0 ), and P (s;h);a (g) asP (s;h);a (g) orP s;a;h (g). 3. Foranycostfunctionc : ! [0; 1]inM,wedeneacostfunction cfor Msuchthat c((s;h);a) = c(s;a) forh2 [H] and c((s;H + 1);a) =c f (terminal cost). Fornotational convenience, wealso write c((s;h);a) asc((s;h);a) orc(s;a;h). For any stationary policy in M, we write(aj(s;h)) as(ajs;h), and we often abuse the notation Q ;P;c andV ;P;c to represent the value functions with respect to policy, transition P , and cost function c. We often use (s;a;h) in place of ((s;h);a) for function input, that is, we writef((s;h);a) asf(s;a;h). We also dene =f((s;h);a)g (s;h)2 S;a2As as the set of available state-action pairs in M. Dene ? for M that mimics the behavior of ? , in the sense that ? (js;h) = ? (js). If we set = 1 1 2Tmax , by the denition ofT max , it can be shown that the probability of ? transiting to the next layer before reachingg is upper bounded by 1=2. If we further setH =O(lnK), then the probability of transiting to the (H + 1)-th layer before reachingg is at most 1 2 H = ~ O(1=K). As a result, the estimation error decreases exponentially in the number of layers while the policy only changes forO(lnK) many times. More importantly, due to the discounted factor, the expected hitting time of any policy is of order O( H 1 ) =O(T max lnK), which controls the cost of exploration and enables the learner to only update its policy at the end of an episode. We summarize the intuition above in the following lemma. 56 Lemma9. Foranycostfunctionc : ! [0; 1]andterminalcostc f ,wehaveV ;P;c (s;h) Hh+1 1 +c f for anyh2 [H];s2S, and policy in M. Moreover, if = 1 1 2Tmax , we further haveQ ? ;P;c (s;a;h) Q ? ;P;c (s;a) + c f 2 Hh+1 for anyh2 [H] and (s;a)2 . Proof. The rst statement is because in expectation it takes any policy 1 1 steps to transit from one layer to the next and each step incurs at most 1 cost (except for the terminal cost). For the second statement, note thatV ;P;c (s;H + 1) =Q ;P;c (s;a;H + 1) =c f for any (s;a)2 , and for anyh2 [H], V ;P;c (s;h) = P a2As (ajs;h)Q ;P;c (s;a;h) and Q ;P;c (s;a;h) =c(s;a) + P s;a V ;P;c (;h) + (1 )P s;a V ;P;c (;h + 1); where we abuse the notation and deneV ;P;c (g;h) = 0 for allh2 [H + 1]. Now we prove the second statement by induction forh =H + 1;:::; 1. The base caseh =H + 1 is clearly true. ForhH, we boundQ ? ;P;c (s;a;h)Q ? ;P;c (s;a) as follows: P s;a V ? ;P;c (;h) + (1 )P s;a V ? ;P;c (;h + 1)P s;a V ? ;P;c P s;a (V ? ;P;c (;h)V ? ;P;c ) + (1 ) c f 2 Hh (V ? ;P;c (s;h + 1)V ? ;P;c (s) c f 2 Hh by induction) = E s 0 Ps;a;a 0 ? (s 0 ) h Q ? ;P;c (s 0 ;a 0 ;h)Q ? ;P;c (s 0 ;a 0 ) i + (1 ) c f 2 Hh : By repeating the arguments above, we arrive at Q ? ;P;c (s;a;h)Q ? ;P;c (s;a)E " I X t=1 t1 (1 ) c f 2 Hh ? ;P;s 1 =s;a 1 =a # ; 57 whereI is the (random) number of steps it takes for ? to reach the goal inM starting from (s;a). Bounding t1 by 1 andE[I] byT max , we then obtain the upper bound (1 )Tmaxc f 2 Hh = c f 2 Hh+1 , which nishes the induction. Remark 1. Applying the rst statement of Lemma 9 withc(s;a) = 1 andc f = 1, we have the expected hitting time of any policy in M bounded by H 1 + 1 starting from any state in any layer. Now we complete the approximation by showing how to solve the original problem via solving its stacked discounted version. Given a policy for M, dene a non-stationary randomized policy() for M as follows: it maintains an internal counterh initialized as 1. In each time step before reaching the goal, it rst follows(js;h) for one step, wheres is the current state. Then, it samples a Bernoulli random variableX with mean , and it increasesh by 1 ifX = 0. Whenh =H + 1, it executes the fast policy f until reaching the goal state. Clearly, the trajectory of() indeed follows the same distribution of the trajectory of in M. We show that as long asH is large enough andc f is of order ~ O(D), this reduction makes sure that the regret between these two problems are similar. The proof is deferred to Appendix D.2. Lemma10. Let = 1 1 2Tmax ,H =dlog 2 (c f K)e,c f =d4D ln 2K eforsome2 (0; 1),and 1 ;:::; K be policies for M. Then the regret of executing( 1 );:::;( K ) inM satisesR K R K + ~ O(1) with probability at least 1, where R K = P K k=1 P J k i=1 c k i + c k J k +1 V ? ;P;c (s k 1 ; 1) for stochastic environ- ments,and R K = P K k=1 P J k i=1 c k i + c k J k +1 V ? ;P;c k (s k 1 ; 1) foradversarialenvironments. Here,J k isthe number of time steps in episodek before the learner reachingg or the counter of( k ) reachingH + 1, and c k J k +1 =c f Ifs k J k +1 6=gg. ComputingFastPolicyandEstimatingDiameter For simplicity, we assume knowledge of the diam- eter and the fast policy above. When these are unknown, one can follow the ideas in (Chen and Luo, 2021) for estimating the fast policy with constant overhead and then adopt their template for learning without knowing the diameter; see (Chen and Luo, 2021, Lemma 1, Appendix E). 58 Algorithm13 Template for Policy Optimization with Stacked Discounted Approximation Initialize:P 1 , the set of all possible transition functions in M (Eq. (D.1));> 0, some learning rate. fork = 1;:::;K do Compute k (ajs;h)/ exp P k1 j=1 ( e Q j (s;a;h)B j (s;a;h)) . Execute( k ) for one episode (see the paragraph before Lemma 10). Compute some optimistic action-value estimator e Q k and exploration bonus functionB k usingP k and observations from episodek. Compute transition condence setP k+1 , as dened in Eq. (D.2). PolicyOptimizationinStackedDiscountedMDPs Now we describe a template of performing policy optimization with the stacked discounted approximation. The pseudocode is shown in Algorithm 13. To handle unknown transition, we maintain standard Bernstein-style transition condence setsfP k g K k=1 whose denition is deferred to Appendix D.1.1. In episodek, the algorithm rst computes policy k in M following the multiplicative weights update with some learning rate > 0, such that k (ajs;h)/ e P k1 j=1 ( e Q j (s;a;h)B j (s;a;h)) for some optimistic action-value estimator e Q j and exploration bonus function B j (computed from past observations and condence sets). Then, it executes( k ) for this episode. Finally, it computes condence setP k+1 . All algorithms introduced in this work follow this template and dier from each other in the denition of e Q k andB k . Ideally, e Q k B k should be the action-value function with respect to the true transition, the true cost function, and policy k , but since the transition and cost functions are unknown, the key challenge lies in constructing accurate estimators that simultaneously encourage sucient exploration. Optimistic Transitions Our algorithms require using some optimistic transitions. Specically, for a policy, a condence setP, and a cost functionc, let (;P;c) be the corresponding optimistic transition such that (;P;c)2 argmin P2P V ;P;c (s;h) for all state (s;h). The existence of such an optimistic transition and how it can be eciently approximated via Extended Value Iteration (in at most ~ O(T max ) iterations) are deferred to Appendix D.1.2. We abuse the notation and denote byV ;P;c andQ ;P;c the value functionV ;(;P;c);c and action-value functionQ ;(;P;c);c . 59 Occupancy Measure Another important concept for subsequent discussions is occupancy measure. Given a policy : S! A and a transition functionP =fP s;a;h g (s;h)2 S;a2A withP s;a;h 2 S + and S + = S[fgg, deneq ;P : S + !R + such thatq ;P ( s;a; s 0 ) =E[ P I i=1 Ifs i = s;a i =a;s i+1 = s 0 gj;P;s 1 = s init ] is the expected number of visits to ( s;a; s 0 ) following policy in a stacked discounted MDP with transitionP . We also letq ;P (s;a;h) = P s 0q ;P ((s;h);a; s 0 ) be the expected number of visits to ((s;h);a) andq ;P (s;h) = P a q ;P (s;a;h) be the number of visits to (s;h). Note that if a function q : S + !R + is an occupancy measure, then the corresponding policy q satises q (ajs;h)/q(s;a;h) and the corresponding transition functionP q satisesP q;s;a;h (s 0 ;h 0 )/ q((s;h);a; (s 0 ;h 0 )). Moreover, V ;P;c ( s init ) =hq ;P ;ci holds for any policy, transition functionP and cost functionc. OtherNotation Following Lemma 10 we set = 1 1 2Tmax ,H =dlog 2 (c f K)e, andc f =d4D ln 2K e for some failure probability2 (0; 1). Dene = 2HT max +c f as the value function upper bound in M (according to the rst statement of Lemma 9). Deneq k =q k ;P ,q ? =q ? ;P , andL =d 8H 1 ln(2T max K=)e. Also dene the advantage functionA ;P;c (s;a) =Q ;P;c (s;a)V ;P;c (s). 5.2 AlgorithmforStochasticEnvironments In this section, we consider policy optimization in stochastic environments dened in Protocol 1. We show that a simple policy optimization framework can be used to achieve near-optimal regret in this setting. Below, we start by describing the algorithm and its guarantees, followed by some explanation behind the algorithm design and then some key ideas and novelty in the analysis. Algorithm As mentioned, the only elements left to be specied in Algorithm 13 are e Q k andB k . For stochastic environments, we simply setB k (s;a;h) = 0 for all (s;a;h) since exploration is relatively easier in this case. We now discuss how to construct e Q k . 60 • Action-valueestimator e Q k is dened asQ k ;P k ;e c k for some corrected cost estimatore c k : e c k (s;a;h) = (1 + b Q k (s;a;h))b c k (s;a;h); (5.1) where is some parameter and b Q k =Q k ;P k ;b c k is another action-value estimator with respect to some optimistic cost estimatorb c k (all to be specied below). • Optimisticcostestimatorb c k is dened as b c k (s;a;h) =b c k (s;a)IfhHg +c f Ifh =H + 1g; b c k (s;a) = max 0; c k (s;a) 2 p c k (s;a) k (s;a) 7 k (s;a) ; where c k (s;a) is the average of all costs that are observed for (s;a) in episodej = 1;:::;k 1 before ( j ) switches to the fast policy, and k (s;a) is = ln(2SALK=) divided by the number of samples used in computing c k (s;a), such that 2 p c k (s;a) k (s;a) + 7 k (s;a) is a standard Bernstein-style deviation term (thus makingb c k (s;a) an optimistic underestimator). • Parametertuning: learning rate is set to minf 1 =3Tmax(8+ =Tmax) 2 ; 1 = p T 4 max Kg for the multiplicative weights update, and the parameter is set to minf 1 =Tmax; p S 2 A =B 2 ? Kg. We now state the regret guarantees of our algorithm (proofs are deferred to Appendix D.3.2.1). Theorem15. Forstochasticcosts,Algorithm13withtheinstantiationaboveachievesR K = ~ O(B ? S p AK + T 3 max (S 2 AK) 1=4 +S 4 A 2:5 T 4 max ) with probability at least 1 32. Ignoring lower-order terms, our bound almost matches the minimax bound ~ O(B ? p SAK) of (Cohen et al., 2021), with a p S factor gap. We emphasize again that besides the simplicity of PO, one algorithmic ad- vantage of our method compared to those based on nite-horizon approximation is its low space complexity to store policies — the horizonH for our method is onlyO(lnK), while the horizon for other works (Chen 61 and Luo, 2021; Cohen et al., 2021) is ~ O(T max ) whenT max is known or otherwise ~ O( B? =c min ). Note that when c min = 0, a common technique is to perturb the cost and deal with a modied problem withc min = 1 =poly(K), in which case our space complexity is exponentially better. In fact, even for time complexity, although our method requires calculating optimistic transition and might need ~ O(T max ) rounds of Extended Value Iteration, this procedure could terminate much earlier, while the nite-horizon approximation approaches always need at least (T max ) time complexity since that is the horizon of the MDP they are dealing with. Analysishighlights We start by explaining the design of the corrected cost estimator Eq. (5.1). Roughly speaking, standard analysis of PO (specically, by (Chen and Luo, 2021, Lemma 9) and then Lemma 59) leads to a term of order P K k=1 hq k ;c b Q k i due to the transition estimation error, which can be prohibitively large (for functionsf andg with the same domain, we dene (fg)(x) = f(x)g(x)). Introducing the correction bias b Q k (s;a;h)b c k (s;a;h) in Eq. (5.1), on the other hand, has the eect of transforming this problematic term into its counterpart P K k=1 hq ? ;c b Q k i in terms of q ? instead of q k . Bounding the latter term, however, requires a property that PO enjoys, that is, a regret bound for any initial state- action pair: P K k=1 ( b Q k Q ? ;P;c )(s;a;h) = ~ O( p K) for any (s;a;h). In contrast, approaches based on occupancy measure (Chen and Luo, 2021) only guarantee a regret bound starting froms init . This makes PO especially compatible with our stacked discounted approximation. Based on this observation, we further have P K k=1 hq ? ;c b Q k i P K k=1 q ? ;cQ ? ;P;c , where the latter term is only about the behavior of the optimal policy and is thus nicely bounded (see e.g. Lemma 53). To sum up, the correction term b Q k (s;a;h)b c k (s;a;h) in Eq. (5.1) together with a favorable property of PO helps us control the transition estimation error in a near-optimal way. Finally, we point out another novelty in our analysis. Compared to other approaches that act according to the exact optimal policy of an estimated MDP, PO incurs an additional cost due to only updating the policy incrementally in each episode. This cost is often of order ~ O( p K) and is one of the dominating terms in the regret bound; see e.g. (Shani et al., 2020; Wu et al., 2021) for the nite-horizon case. For SSP, this 62 is undesirable because it also depends onT ? or evenT max . Reducing this cost has been studied from the optimization perspective — for example, an improved ~ O( 1 =K) convergence rate of PO has been established recently by (Agarwal et al., 2021). However, adopting their analysis to regret minimization requires additional eorts. Specically, we need to carefully bound the bias from using an action-value estimator in the policy’s update, which can be shown to be approximately bounded by P K k=1 ( e Q k+1 e Q k )(s;a;h). In Lemma 58, we show that this term is of lower order by carefully analyzing the drift ( e Q k+1 e Q k )(s;a;h) in each episode. Remark2. Weremarkthatouralgorithmcanbeappliedtonite-horizonMDPswithinhomogeneoustransition and gives a ~ O( p S 2 AH 3 K) regret bound, improving over that of (Shani et al., 2020) by a factor of p H where H is the horizon. We omit the details but only mention that the improvement comes from two sources: rst, the aforementioned improved PO analysis turns a ~ O(H 2 p K) regret term into a lower order term; second, we use Bernstein-style transition condence set to obtain an improved ~ O( p S 2 AH 3 K) transition estimation error. 5.3 AlgorithmforAdversarialEnvironments We move on to consider the more challenging environments with adversarial costs, where the extra exploration bonus functionB k in Algorithm 13 now plays an important role. Even in the nite-horizon setting, developing ecient PO methods in this case can be challenging, and Luo, Wei, and Lee (2021) proposed the so-called “dilated bonuses” to guide better exploration, which we also adopt and extend to SSP. Specically, for a policy, a transition condence setP, and some bonus functionb : [H + 1]!R, we dene the corresponding dilated bonus functionB ;P;b : [H + 1]!R as:B ;P;b (s;a;H + 1) = b(s;a;H + 1) and forh2 [H], B ;P;b (s;a;h) =b(s;a;h) + 1 + 1 H 0 max b P2P b P s;a;h X a 0 (a 0 j;)B ;P;b (;a 0 ;) ! ; (5.2) 63 whereH 0 = 8(H+1) ln(2K) 1 is the dilated coecient. Intuitively,B ;P;b is the dilated (by a factor of 1 + 1 =H 0 ) and optimistic (by maximizing overP) version of the action-value function with respect to andb. In the nite-horizon setting (Luo, Wei, and Lee, 2021), this can be computed directly via dynamic programming, but how to compute it in a stacked discounted MDP (or even why it exists) is less clear. Fortunately, we show that this can indeed be computed eciently via a combination of dynamic programming and Extended Value Iteration; see Appendix D.4.3. We now describe our algorithm for the adversarial full-information case (wherec k is revealed at the end of episodek, i.e., Protocol 2). It suces to specify e Q k andB k in Algorithm 13. • Action-valueestimator e Q k is dened as e Q k =Q k ;P k ;e c k , wheree c k (s;a;h) = (1+ b Q k (s;a;h))c k (s;a;h) for some parameter and b Q k =Q k ;P k ;c k . • Dilated bonusB k is dened asB k ;P k ;b k withb k (s;a;h) = 2 P a2As k (ajs;h) e A k (s;a;h) 2 , where e A k (s;a;h) = e Q k (s;a;h) e V k (s;h) (advantage function) and e V k =V k ;P k ;e c k . • Parametertuning: = minf 1 =(64 2 p HH 0 ); 1 = p DKg and = minf 1 =; 48 + p S 2 A =DT?Kg. Our algorithm enjoys the following guarantee (whose proof can be found in Appendix D.4.1). Theorem16. For adversarial costs with full information, Algorithm 13 with the instantiation above achieves R K = ~ O(T ? p DK + p S 2 ADT ? K +S 4 A 2 T 5 max ) with probability at least 1 20. The best existing bound is ~ O( p S 2 ADT ? K) from (Chen and Luo, 2021). Ignoring the lower order term, our result matches theirs whenT ? S 2 A (and is worse by a p T? =S 2 A factor otherwise). Our algorithm enjoys better time and space complexity though, similar to earlier discussions. 64 Analysis highlights For simplicity we assume that the true transition is known, in which case our bound is only ~ O(T ? p DK) (the other term p S 2 ADT ? K is only due to transition estimation error). A naive way to implement PO would lead to a penalty term T? = plus a stability term K X k=1 X s;h q ? (s;h) X a k (ajs;h)Q k ;P;c k (s;a;h) 2 ; which eventually leads to a bound of order ~ O(T ? T max p K) if one boundsQ k ;P;c k (s;a;h) by ~ O(T max ). Our improvement comes from the following ve steps: 1) rst, through a careful shifting argument, we show that the stability term can be improved to P K k=1 P s;h q ? (s;h) P a k (ajs;h)A k ;P;c k (s;a;h) 2 (recall thatA is the advantage function); 2) second, similarly to (Luo, Wei, and Lee, 2021), the dilated bonusB k helps transformq ? toq k in the term above, leading to P K k=1 q k ; (A k ;P;c k ) 2 ; 3) third, in Lemma 59 we show that the previous term is bounded by the variance of the learner’s cost, which in turn is at most P K k=1 q k ;c k Q k ;P;c k ; 4) fourth, similarly to Section 5.2, the correction termc k b Q k in the denition ofe c k helps transformq k back toq ? , resulting in P K k=1 q ? ;c k Q k ;P;c k ; 5) nally, since PO guarantees a regret bound for any initial state (as mentioned in Section 5.2), the previous term is close to P K k=1 q ? ;c k Q ? k ;P;c k , which is now only related to the optimal policy and can be shown to be at most ~ O(DT ? K). Combining this with the penalty term T? = and picking the best then results in the claimed ~ O(T ? p DK) regret bound. 5.4 OpenProblems Our work initiates the study of policy optimization for SSP. Many questions remain open, such as closing the gap between some of our results and the best known results achieved by other types of methods. Moreover, as mentioned, one of the reasons to study PO for SSP is that PO usually works well when combined with function approximation. Our stacked discounted approximation scheme also does not make use of any 65 modeling assumption and should be applicable in more general settings. Although our work is only for the tabular setting, we believe that our results lay a solid foundation for future studies on SSP with function approximation. 66 Chapter6 LearningNon-stationarySSP 6.1 Overview: Non-stationarySSP All previous works on SSP consider minimizingstaticregret, a special case where the benchmark policy is the same for every episode. This is reasonable only for (near) stationary environments where one single policy performs well over all episodes. In reality, however, the environment is often non-stationary with both the cost function and the transition function changing over episodes, making static regret an unreasonable metric. Instead, the desired objective is to minimize dynamicregret, where the benchmark policy for each episode is the optimal policy for that corresponding environment, and the hope is to obtain sublinear dynamic regret whenever the non-stationarity is not too large. Based on this motivation, in this chapter, we initiate the study of dynamic regret minimization for non-stationary SSP and develop the rst set of results. Specically, we consider Protocol 3 and dynamic regret dened in Eq. (1.3). Our contributions are as follows: • To get a sense on the diculty of the problem, we start by establishing a dynamic regret lower bound in Section 6.2. Specically, we prove that ((B ? SAT ? ( c +B 2 ? P )) 1=3 K 2=3 ) regret is unavoidable, whereB ? is the maximum expected cost of the optimal policy of any episode starting from any state, T ? is the maximum hitting time of the optimal policy of any episode starting from the initial state, and 67 c , P are the amount of changes of the cost and transition functions respectively. Note the dierent roles of c and P here — the latter is multiplied with an extraB 2 ? factor, which we nd surprising for a technical reason discussed in Section 6.2. More importantly, this inspires us to estimate costs and transitions independently in subsequent algorithm design. • For algorithms, we rst present a simple one (Algorithm 15 in Section 6.4) that achieves sub-optimal regret of ~ O((B ? SAT max ( c +B 2 ? P )) 1=3 K 2=3 ), whereT max T ? is the maximum hitting time of the optimal policy of any episode starting from any state. Except for replacingT ? with the larger quantityT max , this bound is optimal in all other parameters. Moreover, this also translates to a minimax optimal regret bound in the nite-horizon setting (a special case of SSP), making Algorithm 15 the rst model-based algorithm with the optimal (SA) 1=3 dependency. • To improve theT max dependency toT ? , in Section 6.5, we present a more involved algorithm (Algo- rithm 17) that achieves a near minimax optimal regret bound matching the earlier lower bound up to logarithmic terms. Notation Several parameters play a key role in characterizing the diculty of this problem: B ? = max k;s V ? k (s), the maximum cost of the optimal policy of any episode starting from any state; T ? = max k T ? k k (s init ) (whereT k (s) is expected number of steps it takes for policy to reach the goal in episode k starting from states), the maximum hitting time of the optimal policy of any episode starting from the initial state;T max = max k;s T ? k k (s), the maximum hitting time of the optimal policy of any episode starting from any state; c = P K1 k=1 kc k+1 c k k 1 , the amount of non-stationarity in the cost functions; and nally P = P K1 k=1 max s;a kP k+1;s;a P k;s;a k 1 , the amount of non-stationarity in the transition functions. Throughout the paper we assume the knowledge ofB ? ,T ? , andT max , and alsoB ? 1 for simplicity. c and P are assumed to be known for the rst two algorithms we develop, but unknown for the last one. We also dene a value function upper boundB = 16B ? . 68 6.2 LowerBound To better understand the diculty of learning non-stationary SSP, we rst establish the following dynamic regret lower bound. Theorem17. In the worst case, the learner’s regret is at least ((B ? SAT ? ( c +B 2 ? P )) 1=3 K 2=3 ). The lower bound construction is similar to that in (Mao et al., 2020), where the environment is piecewise stationary. In each stationary period, the learner is facing a hard SSP instance with a slightly better hidden state. Details are deferred to Appendix E.2.2. In a technical lemma in Appendix E.2.1, we show that for any two episodesk 1 andk 2 , the change of the optimal value function due to non-stationarity satisesV ? k 1 (s init )V ? k 2 (s init ) ( c +B ? P )T ? , with only one extraB ? factor for the P -related term. We thus nd our lower bound somewhat surprising since an extraB 2 ? factor shows up for the P -related term. This comes from the fact that constructing the hard instance with perturbed costs requires a larger amount of perturbation compared to that with perturbed transitions; see Theorem 30 and Theorem 31 for details. More importantly, this observation implies that simply treating these two types of non-stationarity as a whole and only consider the non-stationarity in value function as done in (Wei and Luo, 2021) does not give the rightB ? dependency. This further inspires us to consider cost and transition estimation independently in our subsequent algorithm design. 6.3 BasicFramework: Finite-HorizonApproximation Our algorithms are all built on top of the nite-horizon approximation scheme of (Cohen et al., 2021), whose analysis is greatly simplied and improved by (Chen, Jain, and Luo, 2021), making it applicable to our non-stationary setting as well. This scheme makes use of an algorithmA that deals with a special case of SSP where each episode ends withinH = ~ O(T max ) steps, and applies it to the original SSP following 69 Algorithm14 Finite-Horizon Approximation of SSP Input: AlgorithmA for nite-horizon MDP M with horizonH = 4T max ln(8K). Initialize: interval counterm 1. fork = 1;:::;K do 1 Sets m 1 s init . 2 whiles m 1 6=g do 3 Feed initial states m 1 toA,h 1. 4 while True do 5 Receive actiona m h fromA, play it, and observe costc m h and next states m h+1 . 6 Feedc m h ands m h+1 toA. 7 ifh =H ors m h+1 =g orA requests to start a new interval then 8 H m h. break. 9 elseh h + 1. 10 Sets m+1 1 =s m Hm+1 andm m + 1. Algorithm 14. Specically, call each “mini-episode”A is facing aninterval. At each steph of intervalm, the learner receives the decisiona m h fromA, takes this action, observes the costc m h , transits to the next state s m h+1 , and then feed the observationc m h ands m h+1 toA (Line 5 and Line 6). The intervalm ends whenever one of the following happens (Line 7): the goal state is reached,H steps have passed, orA requests to start a new interval. ∗ In the rst case, the initial states m+1 1 of the next intervalm + 1 will be set tos init , while in the other two cases, it is naturally set to the learner’s current state, which is alsos m Hm+1 whereH m is the length of intervalm (see Line 10). At the end of each interval, we articially letA suer a terminal cost c f (s m Hm+1 ) wherec f (s) = 2B ? Ifs6=gg. This procedure (adaptively) generates a non-stationary nite-horizon Markov Decision Process (MDP) that A faces: M = (S;A;g;fc m g M m=1 ;fP m g M m=1 ;c f ;H). Here, c m = c k(m) andP m = P k(m) where k(m) is the unique episode that interval m belongs to, and M is the total number of intervals over K episodes, a random variable determined by the interactions. Note thatc m andP m always lie in the oblivious setsfc k g K k=1 andfP k g K k=1 respectively, butc m andP m are not oblivious since their values depend on the interaction history. LetV ;m 1 (s) be the expected cost (including the terminal cost) of following ∗ This last condition is not present in prior works. We introduce it since later our instantiation ofA will change its policy in the middle of an interval, and creating a new interval in this case allows us to make sure that the policy in each interval is always xed, which simplies the analysis. 70 policy starting from states in intervalm. Dene the regret ofA over the rstM 0 intervals in M as R M 0 = P M 0 m=1 ( P Hm+1 h=1 c m h V ? k(m) ;m 1 (s m 1 )) where we usec m Hm+1 as a shorthand for the terminal cost c f (s m Hm+1 ). Following similar arguments as in (Cohen et al., 2021; Chen, Jain, and Luo, 2021), the regret in M and M are close in the following sense. Lemma11. Algorithm 14 ensuresR K R M +B ? . See Appendix E.3 for the proof. Based on this lemma, in following sections we focus on developing the nite-horizon algorithmA and analyzing how large R M is. Note, however, that while this nite-horizon reduction is very useful, it does not mean that our problem is as easy as learning non-stationary nite- horizon MDPs and that we can directly plug in an existing algorithm asA. Great care is still needed when designingA in order to obtain tight regret bounds as we will show. 6.4 ASimpleSub-OptimalAlgorithm In this section, we present a relatively simple nite-horizon algorithmA for M which, in combination with the reduction of Algorithm 14, achieves a regret bound that almost matches our lower bound except thatT ? is replaced byT max . The key steps are shown in Algorithm 15. It follows the ideas of the MVP algorithm (Zhang, Ji, and Du, 2020) and adopts a UCBVI-style update rule (Line 6) with a Bernstein-type bonus term (Line 5) to maintain a set ofQ h functions, which then determines the action at each step in a greedy manner (Line 1). The two crucial new elements are the following. First, in the update rule Line 6, we subtract a positive valuex uniformly over all state-action pairs so thatkQ h k 1 is of orderO(B ? ) (recallB = 16B ? ), and we nd the (almost) smallest suchx via a doubling trick (Line 7). This is similar to the adaptive condence widening technique of (Wei and Luo, 2021), where they increase the size of the transition condence set to ensure a bounded magnitude on the estimated value function; our approach is an adaptation of their idea to the UCBVI style update rule. 71 Algorithm15 Non-Stationary MVP Parameters: window sizesW c (for costs) andW P (for transitions), and failure probability. Initialize: for all (s;a;s 0 ),C(s;a) 0,M(s;a) 0,N(s;a) 0,N(s;a;s 0 ) 0. Initialize: Update(1). form = 1;:::;M do forh = 1;:::;H do 1 Play actiona m h argmin a Q h (s m h ;a), receive costc m h and next states m h+1 . C(s m h ;a m h ) + c m h ,M(s m h ;a m h ) + 1,N(s m h ;a m h ) + 1,N(s m h ;a m h ;s m h+1 ) + 1. 2 ifs m h+1 =g orM(s m h ;a m h ) = 2 l orN(s m h ;a m h ) = 2 l for some integerl 0 then break (which starts a new interval). 3 ifW c dividesmthen resetC(s;a) 0 andM(s;a) 0 for all (s;a). 4 ifW P dividesmthen resetN(s;a;s 0 ) 0 andN(s;a) 0 for all (s;a;s 0 ). Update(m + 1). ProcedureUpdate(m) V H+1 (s) 2B ? Ifs6=gg,V h (g) 0 forhH, 2 11 ln 2SAHKm , andx 1 mH . for all (s;a)do N + (s;a) maxf1;N(s;a)g,M + (s;a) maxf1;M(s;a)g, c(s;a) C(s;a) M + (s;a) , b c(s;a) max n 0; c(s;a) q c(s;a) M + (s;a) M + (s;a) o , P s;a () N(s;a;) N + (s;a) . while True do forh =H;:::; 1do 5 b h (s;a) max n 7 q V( Ps;a;V h+1 ) N + (s;a) ; 49B p S N + (s;a) o for all (s;a). 6 Q h (s;a) maxf0;b c(s;a) + P s;a V h+1 b h (s;a)xg for all (s;a). V h (s) min a Q h (s;a) for alls. 7 if max s;a;h Q h (s;a)B=4thenbreak;elsex 2x. Second, we periodically restart the algorithm (by resetting some counters and statistics) in Line 3 and Line 4. While periodic restart is a standard idea to deal with non-stationarity, the novelty here is a two-scale restart schedule: we set one window sizeW c related to costs and another oneW P related to transitions, and restart after everyW c intervals or everyW P intervals. As mentioned, this two-scale schedule is inspired by the lower bound in Section 6.2, which indicates that cost estimation and transition estimation play dierent roles in the nal regret and should be treated separately. Another small modication is that we start a new interval when the visitation to some (s;a) doubles (Line 2), which helps removeT max dependency in lower-order terms and is important for following sections. With all these elements, we prove the following regret guarantee of Algorithm 15. 72 Theorem18. For anyM 0 M, with probability at least 1 22 Algorithm 15 ensures R M 0 = ~ O M 0 q B ? SA 1 =Wc + B? =W P +B ? SA ( 1 =Wc + S =W P ) + ( c W c +B ? P W P )T max : Thus, with a proper tunning ofW c andW P (that is in term ofM 0 ), Algorithm 15 ensures R M 0 = ~ O((B ? SAT max ( c +B 2 ? P )) 1=3 M 0 2=3 ). However, this does not directly imply a bound on R M sinceM is a random variable (and the tunning above would depend onM). Fortunately, to resolve this it suces to perform a doubling trick on the number of intervals, that is, rst make a guess onM, and then double the guess wheneverM exceeds it. We summarize this idea in Algorithm 16. Finally, combining it with Algorithm 14, Lemma 11, and the simplied analysis of (Chen, Jain, and Luo, 2021) which is able to bound the total number of intervalsM in terms of the total number of episodesK (Lemma 87), we obtain the following result (all proofs are deferred to Appendix E.4). Theorem19. With probability at least 1 22, applying Algorithm 14 withA being Algorithm 16 ensures R K 0 = ~ O((B ? SAT max ( c +B 2 ? P )) 1=3 K 0 2=3 ) (ignoring lower order terms) for anyK 0 K. Note that Theorem 19 actually provides an anytime regret guarantee (that is, holds for anyK 0 K), which is important in following sections. Compared to our lower bound in Theorem 17, the only sub- optimality is in replacingT ? with the larger quantityT max . Despite its sub-optimality for SSP, however, as a side result our algorithm in fact implies the rst model-based nite-horizon algorithm that achieves the optimal dependency onSA and matches the minimax lower bound of (Mao et al., 2020). Specically, in previous works, the optimalSA dependency is only achievable by model-free algorithms, which un- fortunately have sub-optimal dependency on the horizon by the current analysis (see (Mao et al., 2020, Lemma 10)). On the other hand, existing model-based algorithms for nite state-action space all follow the idea of extended value iteration, which gives sub-optimal dependency onS and also brings diculty in 73 Algorithm16 Non-Stationary MVP with a Doubling Trick forn = 1; 2;:::do Initialize an instance of Algorithm 15 with W c = d(B ? SA) 1=3 (2 n1 =( c T max )) 2=3 e and W P = d(SA) 1=3 (2 n1 =( P T max )) 2=3 e, and execute it in intervalsm = 2 n1 ;:::; 2 n 1. incorporating entry-wise Bernstein condence sets. † Our approach, however, resolves all these issues. See Appendix E.4.4 for more discussions. TechnicalHighlights The key step of our proof for Theorem 18 is to bound the sum of variance term P M 0 m=1 P Hm h=1 V(P m s m h ;a m h ;V ?;m h+1 V m h+1 ), whereV m h+1 is the value ofV h+1 at the beginning of intervalm, and V ?;m h+1 is the optimal value function of M in interval m (formally dened in Appendix E.1). The standard analysis on bounding this term requiresV ?;m h+1 (s)V m h+1 (s) 0, which is only true in a stationary environment due to optimism. To handle this in non-stationarity environments, we carefully choose a set of constantsfz m h g so thatV ?;m h+1 (s) +z m h V m h+1 (s) 0 (Lemma 89), and then apply similar analysis on P M 0 m=1 P Hm h=1 V(P m s m h ;a m h ;V ?;m h+1 V m h+1 ) = P M 0 m=1 P Hm h=1 V(P m s m h ;a m h ;V ?;m h+1 +z m h V m h+1 ). See Lemma 91 for more details. 6.5 AMinimaxOptimalAlgorithm In this section, we present an improved algorithm that achieves the minimax optimal regret bound up to logarithmic terms, starting with a rened version of Algorithm 15 shown in Algorithm 17. Below, we focus on describing the new elements introduced in Algorithm 17 (that is, Lines 1-3 and 6-8). ‡ The main challenge in replacingT max withT ? is that the regret due to non-stationarity accumulates along the learner’s trajectory, which can be as large asO(( c +B ? P )H) since the horizon isH (recall H = ~ O(T max )). Moreover, bounding the number of steps needed for the learner’s policy to reach the goal is highly non-trivial due to the changing transitions. Our main idea to address these issues is to incorporate † Note that the transition non-stationarity P is dened viaL1 norm. Thus, naively applying entry-wise condence widening to Bernstein condence sets introduces extra dependency onS. ‡ Line 4 and Line 5, althogh written in a dierent form, are similar to Line 3 and Line 4 of Algorithm 15. 74 a correction term (computed in Line 7) into the estimated cost (Line 8) to penalize policies that take too long to reach the goal. This correction term is set to be an upper bound of the learner’s average regret per interval (dened through c and P in Line 7). It introduces the eect of canceling the non-stationarity along the learner’s trajectory when it is not too large. When the non-stationarity is large, on the other hand, we detect it through two non-stationary tests (Line 2 and Line 3), and reset the knowledge of the environment (more details to follow). However, this correction leads to one issue: we cannot perform adaptive condence widening (that is, thex bias) anymore as it would cancel out the correction term. To address this, we introduce another test (Line 6,Test3) to directly check whether the magnitude of the estimated value function is bounded as desired. If not, we reset again since that is also an indication of large non-stationarity. We now provide some intuitions on the design of Test1 andTest2. First, one can show that the two quantitiesb c andb P we maintain in Line 1 are such that their sum is roughly an upper bound on the estimated accumulated regret. So directly checking whetherb c +b P is too large would be similar to the second test of the MASTER algorithm (Wei and Luo, 2021). Here, however, we again break it into two tests whereTest1 only guards the non-stationarity in cost, andTest2 mainly guards the non-stationarity in transition. Note thatTest2 also involves cost information through V , but our observation is that we can still achieve the desired regret bound as long as the ratio of the number of resets caused by procedures ResetC() and ResetP() is of order ~ O(B ? ). This inspires us to reset both the cost and the transition estimation whenTest2 fails, but reset the transition estimation only with some probabilityp (eventually set to 1=B ? ) whenTest3 fails. For analysis, we rst establish a regret guarantee of Algorithm 17 in an ideal situation where the rst state of each interval is alwayss init . (Proofs of this section are deferred to Appendix E.5.) 75 Theorem20. LetW c =d(B ? SA) 1=3 (K=( c T ? )) 2=3 e,W P =d(SA) 1=3 (K=( P T ? )) 2=3 e,c 1 = p B ? SA=T ? , c 2 = p SA=T ? , and p = 1=B ? . Suppose s m 1 = s init for all m K, then Algorithm 17 ensures R K = ~ O((B ? SAT ? ( c +B 2 ? P )) 1=3 K 2=3 ) (ignoring lower order terms) with probability at least 1 40. The reason that we only analyze this ideal case is that, if the initial state is nots init , then even the optimal policy does not guaranteeT ? hitting time by denition. This also inspires us to eventually deploy a two-phase algorithm slightly modifying Algorithm 14: feed the rst interval of each episode into an instance of Algorithm 17, and the rest of intervals into an instance of Algorithm 16 (see Algorithm 18). Thanks to the large terminal cost, we are able to show that the regret in the second phase is upper bounded by a constant, leading to the following nal result. Theorem21. Algorithm18withA 1 beingAlgorithm17andA 2 beingAlgorithm16ensuresR K = ~ O((B ? SAT ? ( c + B 2 ? P )) 1=3 K 2=3 ) (ignoring lower order terms) with probability at least 1 64. Ignoring logarithmic and lower-order terms, our bound is minimax optimal. Also note that the bound is sub-linear (inK) as long as c and P are sub-linear (that is, not the worst case). 6.6 OpenProblems In this work, we develop the rst set of results for dynamic regret minimization in non-stationary SSP. Our work opens up many other possible future directions on this topic, such as extension to more general settings with function approximation. It would also be interesting to study more adaptive dynamic regret bounds in this setting. For example, ourB ? andT ? are dened as the maximum optimal expected cost and hitting time over all episodes, which is undesirable if only a few episodes admit a large optimal expected cost or hitting time. Ideally, some kind of (weighted) average would be a more reasonable measure in these cases. 76 Algorithm17 MVP with Non-Stationarity Tests Parameters: window sizesW c andW P , coecientsc 1 ,c 2 , sample probabilityp, and failure probability. Initialize: ResetC(),ResetP(),Update(1). form = 1;:::;M do forh = 1;:::;H do Play actiona m h argmin a Q h (s m h ;a), receive costc m h and next states m h+1 . C(s m h ;a m h ) + c m h ,M(s m h ;a m h ) + 1,N(s m h ;a m h ) + 1,N(s m h ;a m h ;s m h+1 ) + 1. 1 b c + c m h b c(s m h ;a m h ),b P + V h+1 (s m h+1 ) P s m h ;a m h V h+1 . ifs m h+1 =g orM(s m h ;a m h ) = 2 l orN(s m h ;a m h ) = 2 l for some integerl 0then break (which start a new interval). 2 if b c > c m (dened in Lemma 95)thenResetC(). (Test1) 3 if b P > P m (dened in Lemma 96)thenResetC() andResetP(). (Test2) 4 if c =W c thenResetC(). 5 if P =W P thenResetC() andResetP(). c + 1, P + 1,Update(m + 1). 6 if V h 1 >B=2 for someh(Test3)then ResetC(), with probabilityp executeResetP(), andUpdate(m + 1). ProcedureUpdate(m) V H+1 (s) 2B ? Ifs6=gg, V h (g) 0 for allhH, and 2 11 ln 2SAHKm . 7 c minf c 1 p c ; 1 2 8 H g, P minf c 2 p P ; 1 2 8 H g, c +B P . for all (s;a)do N + (s;a) maxf1;N(s;a)g,M + (s;a) maxf1;M(s;a)g, c(s;a) C(s;a) M + (s;a) , P s;a () N(s;a;) N + (s;a) ,b c(s;a) max n 0; c(s;a) q c(s;a) M + (s;a) M + (s;a) o , 8 c(s;a) b c(s;a) + 8. forh =H;:::; 1do b h (s;a) max 7 q V( Ps;a; V h+1 ) N + (s;a) ; 49B p S N + (s;a) for all (s;a). Q h (s;a) = maxf0; c(s;a) + P s;a V h+1 b h (s;a)g all (s;a). V h (s) = argmin a Q h (s;a) for alls. ProcedureResetC() c 1,b c 0,C(s;a) 0,M(s;a) 0 for all (s;a). ProcedureResetP() P 1,b P 0,N(s;a;s 0 ) 0,N(s;a) 0 for all (s;a;s 0 ). Algorithm18 A Two-Phase Variant of Algorithm 14 Initialize: Phase 1 algorithm instanceA 1 and Phase 2 algorithm instanceA 2 . Execute Algorithm 14 withA =A 1 for every rst interval of an episode, andA =A 2 otherwise. 77 Chapter7 ReachingGoalsisHard: SettlingtheSampleComplexityofSSP 7.1 Overview: PACLearninginSSP In this chapter, we study the probably approximately correct (PAC) objective in SSP, which is formally dened in Section 1.5. Most of the literature on learning in SSP focuses on the regret minimization objective, for which learning algorithms with minimax-optimal performance are available (e.g., our algorithms in Chapter 2) even when no prior knowledge about the optimal policy is provided (i.e., its hitting time or the range of its value function). On the other hand, the PAC objective, i.e., to learn an-optimal policy with high probability with as few samples as possible, has received little attention so far. One reason is that, as it is shown in (Tarbouriech et al., 2021a), in SSP it is not possible to convert regret into sample complexity bounds through an online-to-batch conversion (Jin et al., 2018) and PAC guarantees can only be derived by developing specic algorithmic and theoretical tools. Assuming access to a generative model, Tarbouriech et al. (2021a) derived the rst PAC algorithm for SSP with sample complexity upper bounded as ~ O( T z B 2 ? SA 2 ), where is the largest support of the transition distribution,B ? is the maximum expected cost of an optimal policy over all states,T z =B ? =c min , wherec min is the minimum cost over all state-action pairs, and is the desired accuracy. The most intriguing aspect of this bound is the dependency onT z , which represents a worst-case bound on the hitting timeT max of the optimal policy (i.e., the horizon of the SSP) and it depends on the inverse of the minimum cost. While some dependency on the horizon may be 78 unavoidable, as conjectured in (Tarbouriech et al., 2021a), we may expect the horizon to be independent of the cost function ∗ , as in nite-horizon and discounted problems. Moreover, in regret minimization, there are algorithms whose regret bound only scales withT max , with no dependency onc min , even whenc min = 0 and no prior knowledge is available. It is thus reasonable to conjecture that the sample complexity should also scale withT max instead ofT z . This leads us to the rst question addressed in this chapter: Question: Is the dependency onT z =B ? =c min in the sample complexity of learning with a generative model unavoidable? Surprisingly, we derive a lower bound providing an armative answer to the question. In particular, we show that ( T z B 2 ? SA 2 ) samples are needed to learn an-optimal policy, showing that a dependency on T z is indeed unavoidable and that it is not possible to adapt to the optimal policy hitting timeT max † . This result also implies that there exist SSP instances withc min = 0 (i.e.,T z =1) that arenotlearnable. This shows for the rst time that not only SSP is a strict generalization of the nite-horizon and discounted settings, but it is also strictly harder to learn. We then derive lower bounds when prior knowledge of the formTT max is provided or when an optimality criterion restricted to policies with bounded hitting time is dened. Finally, we propose a simple algorithm based on a nite-horizon reduction argument and we prove upper bounds for its sample complexity matching the lower bound in each of the cases considered above; see Table 7.1. Notation We denote byT an upper bound ofT ? known to the learner, and letT =1 if such knowledge is unavailable. For simplicity, we often writea = ~ O(b) asa.b. 7.2 LowerBoundswithaGenerativeModel In this section, we derive lower bounds on PAC-SSP in various cases. ∗ Notice that in generalTmaxB?=cmin. † In our proof, we construct SSP instances whereTmax <T z . 79 Performance (gen model) Lower Bound Upper Bound Tarbouriech et al. (2021a) (;) Denition 1 min T z ;T B 2 ? SA 2 min T z ;T B 2 ? SA 2 T z B 2 ? SA 2 (;;T ) Denition 5 TB 2 ?;T SA 2 when minfT z ;Tg =1 minfT z ;Tg B 2 ?;T SA 2 TB 3 ?;T SA 3 Table 7.1: Result summary with a generative model. Here,T is a known upper bound on the hitting time of the optimal policy (T =1 when such a bound is unknown),T z = B? cmin andB ?;T is the maximum expected cost over all starting states of the restricted optimal policy with hitting time bounded byT . Operators ~ O() and () are hidden for simplicity. g s 0 p 0 ' 1+=B? T? p i ' 1 T z ::: Action 0 c 0 ' B? T? Actioni 1 c i ' B? T z s 0 g s 1 ag c = 1 2 a 0 ;ag;c = 1 a 0 ;c = 0 s 0 g s 1 ag c = 1 2 a 0 ;ag;c = 1 a 0 ;c = 0 inM + inM (a) (b) Figure 7.1: (a) hard instance (simplied for proof sketch) in Theorem 22 whenc min > 0. (b) hard instance in Theorem 22 whenc min = 0. Here,c represents the cost of an action, whilep represents the transition probability. 7.2.1 LowerBoundfor-optimality We rst establish the sample complexity lower bound of any (;)-correct learning algorithm when no prior knowledge is available. Theorem 22. For any S 3, A 3, c min > 0, B 2, T 0 maxfB; log A S + 1g, 2 (0; 1 32 ), and 2 (0; 1 2e 4 ) such that T 0 B=c min , there exists an MDP with S states, A actions, minimum cost c min , B ? = (B), and T max = (T 0 ), such that any (;)-correct algorithm has sample complexity T z B 2 ? SA 2 ln 1 . ‡ There also exists an MDP withc min = 0,T max = 1,T =1, andB ? = 1 in which every (;)-correct algorithm with2 (0; 1 2 ) and2 (0; 1 16 ) has innite sample complexity. Details are deferred to Appendix F.1.1. We rst remark that the lower bound qualitatively matches known PAC bounds for the discounted and nite-horizon settings in terms of its dependency on the size of ‡ Formally, for any n 0, we say that an algorithm has sample complexity (n) on an SSP instanceM ifPM(T n;b is-optimal inM) 1. 80 the state-action space and on the inverse of the squared accuracy. As for the dependency onB ? andc min , it can be conveniently split in two terms: 1) a termB 2 ? and2) a factorT z =B ? =c min . DependencyonB 2 ? This term is connected to the range of the optimal value functionV ? . Interestingly, in nite-horizon and discounted settingsH and 1=(1 ) bound the range of the value function ofany policy, whereas in SSP a more rened analysis is required to avoid dependencies on, e.g., max V , which can be unbounded whenever an improper policy exists. DependencyonT z WhileT z is an upper bound on the hitting time of the optimal policy, in the construc- tion of the lower boundT ? is strictly smaller thanT z . For the casec min > 0, this shows that the algorithm proposed by Tarbouriech et al. (2021a) has an optimal dependency inB ? andT z . On the other hand, this reveals that in certain SSP instances no algorithm can return an-optimal policy after collecting a nite number of samples. ThisistherstevidencethatlearninginSSPsisstrictlyharderthanthenite-horizon and discounted settings, where the sample complexity is always bounded. This is also in striking contrast with results in regret minimization in SSP, where the regret is bounded even forc min = 0 and no prior knowledge aboutB ? orT max is provided. This is due to the fact that the regret measures performance in the cost dimension and the algorithm is allowed to change policies within and across episodes. On the other hand, in learning with a generative model the performance is evaluated in terms of the number of samples needed to condently commit to a policy with performance-close to the optimal policy. This requires to distinguish between proper and improper policies, which can become arbitrarily hard in certain SSPs wherec min = 0. ProofSketch In order to provide more insights about our result, here we present the main idea of our hard instances construction. We consider two cases separately: 1)c min > 0 and 2)c min = 0. Whenc min > 0, our construction is a variant of that in (Mannor and Tsitsiklis, 2004, Theorem 1); see an illustration in 81 Figure 7.1 (a). Let’s consider an MDPM with a multi-arm bandit structure: it has a single states 0 andN +1 actionsA =f0; 1;:::;Ng (in the general case this corresponds toN + 1 state-action pairs). Taking action 0 incurs a cost B T 0 and transits to the goal state with probability 1+=2 T 0 (stays ins 0 otherwise), where = 32 B . For eachi2 [N], taking actioni incurs a cost B T 1 and transits to the goal state with probability 1 T 1 . Note that inM the optimal action (deterministic policy) is 0, withB ? = (B),T max = (T 0 ),T z = (T 1 ), whereas all other actions are more than suboptimal. Also note that it takes ( T 1 B 2 ? 2 ) samples to estimate the expected cost of actioni2 [N] with accuracy. If an algorithmA spendso( T 1 B 2 ? 2 ) samples on some action i 0 , then we can consider an alternative MDPM 0 , whose only dierence compared toM is that taking action i 0 transits to the goal state with probability 1+ T 1 . Note that inM 0 the only-optimal action isi 0 . However, algorithmA cannot distinguish betweenM andM 0 since it does not have enough samples on actioni 0 , and thus has a high probability on outputting the wrong action in eitherM orM 0 . Applying this argument to each armi2 [N], we conclude that an (;)-correct algorithm needs at least ( NT 1 B 2 ? 2 ) = ( T z B 2 ? SA 2 ) samples. We emphasize that in our construction,T max (whose proxy isT 0 ) can be arbitrarily smaller thanT z (whose proxy isT 1 ) inM. However, the learner still needs ( T z B 2 ? 2 ) samples to exclude the alternativeM 0 in whichT max =T z . A natural question to ask is what if we have prior knowledge onT max , which could potentially reduce the space of alternative MDPs. We answer this in Section 7.2.2. Whenc min = 0, we consider a much simpler MDPM with two statesfs 0 ;s 1 g and two actionsfa 0 ;a g g; see an illustration in Figure 7.1 (b). Ats 0 , takinga 0 transits tos 0 with cost 0 and takinga g transits tog with cost 1 2 . Ats 1 , taking both actions transits tog with cost 1. Clearlyc min = 0,B ? =T max = 1,V ? (s 0 ) = 1 2 , and both actions ins 0 are-optimal inM. Now consider any algorithmA with sample complexityn<1 onM, and without loss of generality, assume thatA outputs a deterministic policyb . We consider two cases: 1)b (s 0 ) =a 0 and 2)b (s 0 ) =a g . In the rst case, consider an alternative MDPM + , whose only dierence compared toM is that takinga 0 ats 0 transits tos 1 with probability 1 n , and tos 0 otherwise. Note 82 that the optimal action ats 0 isa g inM + . SinceA uses at mostn samples, with high probability it never observes transition (s 0 ;a 0 ;s 1 ) and is unable to distinguish betweenM andM + . Thus, it still outputsb withb (s 0 ) =a 0 inM + . This givesV b (s 0 )V ? (s 0 ) = 1 1 2 andb is not-optimal for any2 (0; 1 2 ). In the second case, consider another alternative MDPM , whose only dierence compared toM is that takinga 0 ats 0 transits tog with probability 1 n , and tos 0 otherwise. The optimal action ats 0 isa 0 inM . Again, algorithmA cannot distinguish betweenM andM and still outputb withb (s 0 ) =a g inM . This givesV b (s 0 )V ? (s 0 ) = 1 2 andb is not-optimal for any2 (0; 1 2 ). Combining these two cases, we have that any (;)-correct algorithm with2 (0; 1 2 ) cannot have nite sample complexity. Remark Our construction reveals that the potentially innite horizon in SSP does bring hardness into learning whenc min = 0. Indeed, we can treatM as an innite-horizon MDP due to the presence of the self-loop ats 0 . Any algorithm that uses nite number of samples cannot identify all proper policies inM, that is, it can never be sure whether (s 0 ;a 0 ) has non-zero probability of reaching states other thans 0 . 7.2.2 LowerBoundfor-optimalitywithPriorKnowledgeonT max Now we consider the case where the learning algorithm has some prior knowledgeT T max on the hitting time of the optimal proper policy. Intuitively, we expect the algorithm to exploit the knowledge of parameterT to focus on the set of policiesf :kT k 1 Tg with bounded hitting time. § Theorem23. ForanyS 3,A 3,c min 0,B 2,T 0 maxfB; log A S + 1g,T 0,2 (0; 1 32 ),and 2 (0; 1 2e 4 )suchthatT 0 minfT=2;B=c min g<1,thereexistanMDPwithS states,Aactions,minimum costc min ,B ? = (B),andT max = (T 0 )T,suchthatany (;)-correctalgorithmhassamplecomplexity min T z ;T B 2 ? SA 2 ln 1 . § Notice thatf :kT k 1 Tg includes the optimal policy by denition sinceTTmax. 83 Details are deferred to Appendix F.1.1. The main idea of proving Theorem 23 still follows from that of Theorem 22. Also note that the bound in Theorem 23 subsumes that of Theorem 22 since we letT =1 when such knowledge is unavailable. Dependency on minfT z ;Tg We distinguish two regimes: 1) When T T z the bound reduces to ( TB 2 ? SA 2 ) with no dependency onc min . In this case, an algorithm may benet from its prior knowledge to eectively prune any policy with hitting time larger thanT , thus reducing the sample complexity of the problem and avoiding innite sample complexity whenc min = 0. 2) WhenT >T z , we recover the bound ( T z B 2 ? SA 2 ). In this case, an algorithm does not pay the price of a loose upper bound onT max . Again, in our construction it is possible thatT max < minfT z ;Tg. This concludes that it is impossible to adapt to T max for computing-optimal policies in SSPs. 7.2.3 LowerBoundfor (;T)-optimality Knowing that we cannot solve for an-optimal policy when minfT z ;Tg =1, that is, c min = 0 and T =1, we now consider a restricted optimality criterion where we only seek-optimality w.r.t.. a set of proper policies. Denition4 (Restricted (;T )-Optimality). ForanyT2 [1;T ],wedenetheset T =f2 :kT k 1 Tg. Also dene ? T;s = argmin 2 T V (s),V ?;T (s) =V ? T;s (s), andB ?;T = max s V ?;T (s). We say that a policy is (;T )-optimal ifV (s)V ?;T (s) for alls2S. We deneV ?;T (s) =1 for alls when T =;. ¶ WhenTT max , we have ? T;s = ? for alls. WhenDT <T max , the policy ? T;s exists and may vary for dierent starting states due to the hitting time constraint. It can even be stochastic from the ¶ Tarbouriech et al. (2021a) consider a slightly dierent notion of restricted optimality, where they letT =D with2 [1;1) as input to the algorithm, andD is unknown. 84 literature of constrained MDPs (Altman, 1999). ∥ WhenT <D, we have T =?, andV ?;T (s) =1 for all s. Clearly,V ?;T (s)V ? (s) for anys andT . Denition5 ((;;T )-Correctness). LetT be the random stopping time by when an algorithm terminates its interaction with the environment and returns a policyb . We say that an algorithm is (;;T )-correct with sample complexityn(M) ifP M (T n(M);b is (;T )-optimal inM) 1 for any SSP instanceM, wheren(M) is a deterministic function of the characteristic parameters of the problem (e.g., number of states and actions, inverse of the accuracy). Note that being (;T )-optimal does not require2 T . For example, ? is (;T )-optimal for any T 1. Similarly, policy output by an (;;T )-correct algorithm is not required to be in T , and it is allowed return a better cost-oriented policy. Now we establish a sample complexity lower bound of any (;;T )-correct algorithm when minfT z ;Tg =1 (see Appendix F.1.2 for details). Theorem 24. For any S 6, A 8, B ? 2, T 6(log A1 (S=2) + 1), B T 2, 2 (0; 1 32 ), and 2 (0; 1 8e 4 ) such that B ? B T B ? (A 1) S=21 =4 and B T T=6, and for any (;;T )-correct algorithm, there exist an MDP withB ?;T = (B T ),c min = 0, and parametersS,A,B ? , such that with a generative model, the algorithm has sample complexity TB 2 ?;T SA 2 ln 1 . Note that whenT T max , the lower bound reduces to TB 2 ? SA 2 , which coincides with that of Theo- rem 23. On the other hand, the sample complexity lower bound for computing (;T )-optimal policy when minfT z ;Tg<1 is still unknown and it is an interesting open problem. ProofSketch We consider an MDPM with state spaceS =S T [S ? . The learner can reach the goal state either through states inS T orS ? , where in the rst case the learner aims at learning an (;T )-optimal policy, and in the second case the learner aims at learning an -optimal policy. InS T , we follow the ∥ Consider an MDP with one state and two actions. Taking action one suers cost 1 and directly transits to the goal state. Taking action 2 suers cost 0 and transits to the goal state with probability 1=3. Now considerT = 2. Then the optimal constrained policy should take action 2 with probability 3=4. 85 construction in Theorem 22 so that learning an-optimal policy on the sub-MDP restricted onS T takes ( TB 2 ?;T SA 2 ) samples. InS ? , we consider a sub-MDP that forms a chain similar to (Strens, 2000, Figure 1), where the optimal policy suerB ? cost but a bad policy could suer (B ? A S ) cost. For each states inS ? , we make the probability of transiting back tos by taking any action large enough, so that learner with sample complexity of order ~ O( TB 2 ?;T SA 2 ) hardly receive any learning signals inS ? . Therefore, any algorithm with ~ O( TB 2 ?;T SA 2 ) sample complexity should focus on learning the sub-MDP restricted onS T . This proves the statement. 7.3 AlgorithmwithaGenerativeModel Algorithm19 Search Horizon Input: hitting time boundT (T =T with prior knowledge), accuracy2 (0; 1), and probability2 (0; 1). Initialize:i 1. 1 LetB i = 2 i ,H i = 4 minfB i =c min ;Tg ln(48B i =),c f;i (s) = 0:6B i Ifs6=gg, i ==(40i 2 ),N ? i =N ? (B i ;H i ; 2 ; i ) andN i = b N(B i ;H i ; 0:1B i ; i ), whereN ? , b N are dened in Lemma 106 and Lemma 107 respectively. //EstimateB ?;T while True do Reset counterN, and then drawN i samples for each (s;a) to updateN. 2 i ;V i = LCBVI(H i ;N;B i ;c f;i ; i ) (refer to Algorithm 21). 3 if V i 1 1 0:1B i and max h2[H+1] V i h 1 0:7B i thenbreak. i i + 1. 4 ifB i > 40T thenoutputb =?. //i.e.,T <D (everypolicyis (;T )-optimal) //Compute-optimalpolicy Reset counterN, and then drawN ? i samples for each (s;a) to updateN. 5 b ; b V = LCBVI(H i ;N;B i ;c f;i ; i ) (refer to Algorithm 21). Output: policyb extended to innite horizon. 86 In this section, we present an algorithm whose sample complexity matches all the lower bounds introduced in Section 7.2. We notice that the horizon (or hitting time) of the optimal policy plays an important role in the lower bounds. Thus, a natural algorithmic idea is to explicitly determine and control the horizon of the output policy. This leads us to the idea of nite-horizon reduction, which is frequently applied in the previous works on SSP (e.g., Chen, Jain, and Luo, 2021; Chen, Luo, and Wei, 2021b; Cohen et al., 2020). Now we formally describe the nite-horizon reduction scheme. Given an SSPM, letM H;c f be a time-homogeneous nite-horizon MDP with horizonH and terminal costc f 2 [0;1) S + , which has the same state space, action space, cost function, and transition function asM. When interacting withM H;c f , the learner starts in some initial state and stageh, it observes states h , takes actiona h , incurs costc(s h ;a h ), and transits to the next states h+1 P s h ;a h . It also suers costc f (s H+1 ) before ending the interaction. When the nite-horizon MDPM H;c f is clear from the context, we deneV h (s) as the expected cost of following policy starting from states and stageh inM H;c f . Although the nite-horizon reduction has become a common technique in regret minimization for SSP, it is not straightforward to apply it in our setting. Indeed, even if we solve a near-optimal policy in the nite-horizon MDP, it is unclear how to apply the nite-horizon policy in SSP, where any trajectory may be much longer thanH. Our key result is a lemma that resolves this. It turns out that when the terminal cost in the nite-horizon MDP is large enough, all we need for applying the nite-horizon policy to SSP is to repeat it periodically. Specically, given a nite-horizon policy2 ( A ) S[H] , we abuse the notation and dene 2 ( A ) SN + as an innite-horizon non-stationary policy, such that(ajs;h+iH) =(ajs;h);8i2N + . The following lemma relates the performance of inM to its performance inM H;c f (see Appendix F.2.1 for details). Lemma12. For any SSPM, horizonH, and terminal cost functionc f , suppose is a policy inM H;c f and V 1 (s)c f (s) for alls2S + . ThenV (s)V 1 (s). 87 Thanks to the lemma above, for a given horizonT , we can rst learn an-optimal policyb inM H;c f withH = ~ O(T ) andc f (s) =O(B ?;T Ifs6= gg), and then extend it to an SSP policy with performance V b (s)V b 1 (s)V ? 1 (s)V ? (s), whereV ? 1 is the optimal value function of stage 1 inM H;c f , and the last step is by the fact thatH is suciently large compared toT . As a result,b would then be (;T )-optimal policy in the original SSP problem. Algorithm 19 builds on this idea. It takes a hitting time upper bound T as input, and aims at computing an (;T )-optimal policy. The main idea is to search the range ofB ?;T and minfT z ;Tg via a doubling trick on estimatorsB i andH i (Line 1). ∗∗ For each possible value ofB i and H i , we compute an optimal value function estimate with 0:1B i accuracy usingSAN i .S 2 AH i samples (Line 2), and stop ifB i becomes a proper upper bound on the estimated value function (Line 3). Here we need dierent conditions boundingV i 1 andV i h as the terminal costc f should be negligible starting from stage 1 but not for any stage. Once we determine their range, we compute an-optimal nite-horizon policy with nal values ofB i andH i usingSAN ? i . H i B 2 i SA 2 samples (Line 5). On the other hand, ifB i becomes unreasonably large, then the algorithm claims thatT <D (Line 4), in which caseV ?;T (s) =1 for anys (see Denition 4), and any policy is (;T )-optimal by denition. In the procedure described above, we need to repeatedly compute a near-optimal policy with various accuracy and horizon. We use a simple variant of the UCBVI algorithm (Azar, Osband, and Munos, 2017; Zhang, Ji, and Du, 2020) to achieve this (see Algorithm 21 in Appendix F.2.2). The main idea is to compute an optimistic value function estimate by incorporating a Bernstein-style bonus (Line 1). We state the guarantee of Algorithm 19 in the following theorem (see Appendix F.2.3 for details). Theorem25. For any givenT 1,2 (0; 1), and2 (0; 1), with probability at least 1, Algorithm 19 either uses ~ O S 2 AT samples to conrm thatT <D, or uses ~ O minfT z ;Tg B 2 ?;T SA 2 samples to output an (;T )-optimal policy (ignoring lower order terms). ∗∗ Note that T ? T;s 1 T z for anyTD and states since ? T;s = ? whenTT z Tmax. 88 When prior knowledge1>TT max is available, we simply setT =T . In this case, we have that (;T )-optimality is equivalent to-optimality,B ?;T =B ? and Algorithm 19 matches the lower bound in Theorem 23. When minfT z ;Tg =1, Algorithm 19 computes an (;T )-optimal policy with ~ O( TB 2 ?;T SA 2 ) samples, which matches the lower bound in Theorem 24. Thus, our algorithm is minimax optimal in all cases considered in Section 7.2. When comparing with the results in (Tarbouriech et al., 2021a), in terms of computing an-optimal policy, we remove a factor and improve the dependency ofT z to minfT z ;Tg, that is, our algorithm is able to leverage a given bound onT max to improve sample eciency while theirs cannot. In terms of computing an (;T )-optimal policy, we greatly improve over their result by removing a B ?;T factor and improving the dependency ofT to minfT z ;Tg, that is, our algorithm automatically adapts to a smaller hitting time upper bound of the optimal policy. Finally, it is interesting to notice that even though the (;T )-optimal policy is possibly stochastic, the policy output by Algorithm 19 is always deterministic, and it does not necessarily have hitting time bounded byT . In fact, Algorithm 19 puts no constraint on the hitting time of the output policy, except that the horizon for the reduction is ~ O (T ). Nevertheless, as shown in Theorem 25, we can still prove that the policy is (;T )-optimal since the requirement only evaluates the expected cost and not the constraint on the hitting time. 7.4 OpenProblems In this chapter, we study the sample complexity of the SSP problem. We provide an almost complete characterization of the minimax sample complexity with a generative model. In particular, we show that an -optimal policy may not be learnable in SSP even with a generative model. We complement the study of sample complexity with lower bounds for learnable settings and matching upper bounds. Many interesting problems remain open, such as the minimax optimal sample complexity of computing an (;T )-optimal 89 policy when minfT z ;Tg<1. It is also interesting to study whether learning a stationary policy with minimax optimal sample complexity is possible under various learnable settings. 90 References Abbasi-Yadkori, Yasin, Dávid Pál, and Csaba Szepesvári (2011). “Improved algorithms for linear stochastic bandits”. In: Advances in Neural Information Processing Systems 24, pp. 2312–2320. Agarwal, Alekh, Haipeng Luo, Behnam Neyshabur, and Robert E Schapire (2017). “Corralling a Band of Bandit Algorithms”. In: Conference on Learning Theory. Agarwal, Alekh, Sham M Kakade, Jason D Lee, and Gaurav Mahajan (2021). “On the theory of policy gradient methods: Optimality, approximation, and distribution shift”. In: Journal of Machine Learning Research 22.98, pp. 1–76. Akrour, Riad, Marc Schoenauer, Michèle Sebag, and Jean-Christophe Souplet (2014). “Programming by feedback”. In: International Conference on Machine Learning. 32. JMLR. org, pp. 1503–1511. Altman, Eitan (1999). Constrained Markov decision processes. Vol. 7. CRC Press. Ayoub, Alex, Zeyu Jia, Csaba Szepesvari, Mengdi Wang, and Lin Yang (2020). “Model-based reinforcement learning with value-targeted regression”. In: International Conference on Machine Learning. PMLR, pp. 463–474. Azar, Mohammad Gheshlaghi, Ian Osband, and Rémi Munos (2017). “Minimax regret bounds for reinforcement learning”. In: International Conference on Machine Learning. PMLR, pp. 263–272. Berner, Christopher, Greg Brockman, Brooke Chan, Vicki Cheung, Przemysław Dębiak, Christy Dennison, David Farhi, Quirin Fischer, Shariq Hashme, Chris Hesse, et al. (2019). “Dota 2 with large scale deep reinforcement learning”. In: arXiv preprint arXiv:1912.06680. Bertsekas, Dimitri P and John N Tsitsiklis (1991). “An analysis of stochastic shortest path problems”. In: Mathematics of Operations Research 16.3, pp. 580–595. Bertsekas, Dimitri P and Huizhen Yu (2013). “Stochastic shortest path problems under weak conditions”. In: Lab. for Information and Decision Systems Report LIDS-P-2909, MIT. Beygelzimer, Alina, John Langford, Lihong Li, Lev Reyzin, and Robert Schapire (2011). “Contextual bandit algorithms with supervised learning guarantees”. In: International Conference on Articial Intelligence and Statistics. 91 Bubeck, Sébastien, Yuanzhi Li, Haipeng Luo, and Chen-Yu Wei (2019). “Improved Path-length Regret Bounds for Bandits”. In: Conference On Learning Theory. Cai, Qi, Zhuoran Yang, Chi Jin, and Zhaoran Wang (2020). “Provably ecient exploration in policy optimization”. In: International Conference on Machine Learning. PMLR, pp. 1283–1294. Chen, Liyu, Rahul Jain, and Haipeng Luo (2021). “Improved No-Regret Algorithms for Stochastic Shortest Path with Linear MDP”. In: arXiv preprint arXiv:2112.09859. Chen, Liyu and Haipeng Luo (2021). “Finding the stochastic shortest path with low regret: The adversarial cost and unknown transition case”. In: International Conference on Machine Learning. — (2022). “Near-Optimal Goal-Oriented Reinforcement Learning in Non-Stationary Environments”. In: arXiv preprint arXiv:2205.13044. Chen, Liyu, Haipeng Luo, and Aviv Rosenberg (2022). “Policy Optimization for Stochastic Shortest Path”. In: arXiv preprint arXiv:2202.03334. Chen, Liyu, Haipeng Luo, and Chen-Yu Wei (2021a). “Impossible tuning made possible: A new expert algorithm and its applications”. In: Conference on Learning Theory. PMLR, pp. 1216–1259. — (2021b). “Minimax regret for stochastic shortest path with adversarial costs and known transition”. In: Conference on Learning Theory. PMLR. Chen, Liyu, Mehdi Jafarnia-Jahromi, Rahul Jain, and Haipeng Luo (2021). “Implicit Finite-Horizon Approximation and Ecient Optimal Algorithms for Stochastic Shortest Path”. In: Advances in Neural Information Processing Systems. Chen, Liyu, Andrea Tirinzoni, Matteo Pirotta, and Alessandro Lazaric (2022). “Reaching Goals is Hard: Settling the Sample Complexity of the Stochastic Shortest Path”. In: arXiv preprint arXiv:2210.04946. Christiano, Paul F, Jan Leike, Tom Brown, Miljan Martic, Shane Legg, and Dario Amodei (2017). “Deep reinforcement learning from human preferences”. In: Advances in neural information processing systems 30. Cohen, Alon, Haim Kaplan, Yishay Mansour, and Aviv Rosenberg (2020). “Near-optimal Regret Bounds for Stochastic Shortest Path”. In: Proceedings of the 37th International Conference on Machine Learning. Vol. 119. PMLR, pp. 8210–8219. Cohen, Alon, Yonathan Efroni, Yishay Mansour, and Aviv Rosenberg (2021). “Minimax Regret for Stochastic Shortest Path”. In: Advances in Neural Information Processing Systems. Efroni, Yonathan, Nadav Merlis, Mohammad Ghavamzadeh, and Shie Mannor (2019). “Tight Regret Bounds for Model-Based Reinforcement Learning with Greedy Policies”. In: Advances in Neural Information Processing Systems. Vol. 32. Curran Associates, Inc. Efroni, Yonathan, Nadav Merlis, Aadirupa Saha, and Shie Mannor (2021). “Condence-Budget Matching for Sequential Budgeted Learning”. In: International Conference on Machine Learning. 92 Fawzi, Alhussein, Matej Balog, Aja Huang, Thomas Hubert, Bernardino Romera-Paredes, Mohammadamin Barekatain, Alexander Novikov, Francisco J R Ruiz, Julian Schrittwieser, Grzegorz Swirszcz, et al. (2022). “Discovering faster matrix multiplication algorithms with reinforcement learning”. In: Nature 610.7930, pp. 47–53. Fei, Yingjie, Zhuoran Yang, Zhaoran Wang, and Qiaomin Xie (2020). “Dynamic regret of policy optimization in non-stationary environments”. In: Advances in Neural Information Processing Systems 33, pp. 6743–6754. Florensa, Carlos, David Held, Xinyang Geng, and Pieter Abbeel (2018). “Automatic goal generation for reinforcement learning agents”. In: International conference on machine learning. PMLR, pp. 1515–1528. Foster, Dylan J., Akshay Krishnamurthy, and Haipeng Luo (2019). “Model Selection for Contextual Bandits”. In: Advances in Neural Information Processing Systems. Foster, Dylan J, Zhiyuan Li, Thodoris Lykouris, Karthik Sridharan, and Eva Tardos (2016). “Learning in Games: Robustness of Fast Convergence”. In: Advances in Neural Information Processing Systems. Gerchinovitz, Sébastien and Tor Lattimore (2016). “Rened lower bounds for adversarial bandits”. In: Advances in Neural Information Processing Systems 29. Gupta, Abhishek, Aldo Pacchiano, Yuexiang Zhai, Sham M Kakade, and Sergey Levine (2022). “Unpacking reward shaping: Understanding the benets of reward engineering on sample complexity”. In: arXiv preprint arXiv:2210.09579. Hazan, Elad (2019). “Introduction to online convex optimization”. In: arXiv preprint arXiv:1909.05207. Jaksch, Thomas, Ronald Ortner, and Peter Auer (2010). “Near-optimal Regret Bounds for Reinforcement Learning.” In: Journal of Machine Learning Research 11.4. Jin, Chi, Zeyuan Allen-Zhu, Sebastien Bubeck, and Michael I Jordan (2018). “Is Q-learning provably ecient?” In: Advances in Neural Information Processing Systems, pp. 4863–4873. Jin, Chi, Tiancheng Jin, Haipeng Luo, Suvrit Sra, and Tiancheng Yu (2020a). “Learning Adversarial Markov Decision Processes with Bandit Feedback and Unknown Transition”. In: Proceedings of the 37th International Conference on Machine Learning, pp. 4860–4869. Jin, Chi, Zhuoran Yang, Zhaoran Wang, and Michael I Jordan (2020b). “Provably ecient reinforcement learning with linear function approximation”. In: Conference on Learning Theory. PMLR. Kakade, Sham M (2001). “A natural policy gradient”. In: Advances in neural information processing systems 14. Kotłowski, Wojciech and Gergely Neu (2019). “Bandit principal component analysis”. In: Conference On Learning Theory. Lattimore, Tor and Csaba Szepesvári (2020). Bandit algorithms. Cambridge University Press. 93 Lee, Chung-Wei, Haipeng Luo, and Mengxiao Zhang (2020). “A Closer Look at Small-loss Bounds for Bandits with Graph Feedback”. In: Conference on Learning Theory. Lee, Chung-Wei, Haipeng Luo, Chen-Yu Wei, and Mengxiao Zhang (2020). “Bias no more: high-probability data-dependent regret bounds for adversarial bandits and MDPs”. In: Advances in Neural Information Processing Systems. Vol. 33. Curran Associates, Inc., pp. 15522–15533. Liu, Boyi, Qi Cai, Zhuoran Yang, and Zhaoran Wang (2019). “Neural trust region/proximal policy optimization attains globally optimal policy”. In: Advances in neural information processing systems 32. Liu, Minghuan, Menghui Zhu, and Weinan Zhang (2022). “Goal-conditioned reinforcement learning: Problems and solutions”. In: arXiv preprint arXiv:2201.08299. Luo, Haipeng, Chen-Yu Wei, and Chung-Wei Lee (2021). “Policy Optimization in Adversarial MDPs: Improved Exploration via Dilated Bonuses”. In: Advances in Neural Information Processing Systems 34. Luo, Haipeng, Chen-Yu Wei, and Kai Zheng (2018). “Ecient online portfolio with logarithmic regret”. In: Advances in Neural Information Processing Systems. Mannor, Shie and John N Tsitsiklis (2004). “The sample complexity of exploration in the multi-armed bandit problem”. In: Journal of Machine Learning Research 5.Jun, pp. 623–648. Mao, Weichao, Kaiqing Zhang, Ruihao Zhu, David Simchi-Levi, and Tamer Başar (2020). “Model-Free Non-Stationary RL: Near-Optimal Regret and Applications in Multi-Agent RL and Inventory Control”. In: arXiv preprint arXiv:2010.03161. Ménard, Pierre, Omar Darwiche Domingues, Xuedong Shang, and Michal Valko (2021). “UCB Momentum Q-learning: Correcting the bias without forgetting”. In: International Conference on Machine Learning. PMLR, pp. 7609–7618. Min, Yifei, Jiafan He, Tianhao Wang, and Quanquan Gu (2021). “Learning Stochastic Shortest Path with Linear Function Approximation”. In: arXiv preprint arXiv:2110.12727. Ng, Andrew Y, Daishi Harada, and Stuart Russell (1999). “Policy invariance under reward transformations: Theory and application to reward shaping”. In: Icml. Vol. 99, pp. 278–287. Rosenberg, Aviv and Yishay Mansour (2021). “Stochastic Shortest Path with Adversarially Changing Costs”. In: Proceedings of the Thirtieth International Joint Conference on Articial Intelligence. Schulman, John, Sergey Levine, Pieter Abbeel, Michael Jordan, and Philipp Moritz (2015). “Trust region policy optimization”. In: International conference on machine learning. PMLR, pp. 1889–1897. Shani, Lior, Yonathan Efroni, Aviv Rosenberg, and Shie Mannor (2020). “Optimistic Policy Optimization with Bandit Feedback”. In: Proceedings of the 37th International Conference on Machine Learning, pp. 8604–8613. Silver, David, Aja Huang, Chris J Maddison, Arthur Guez, Laurent Sifre, George Van Den Driessche, Julian Schrittwieser, Ioannis Antonoglou, Veda Panneershelvam, Marc Lanctot, et al. (2016). “Mastering the game of Go with deep neural networks and tree search”. In: nature 529.7587, pp. 484–489. 94 Strens, Malcolm (2000). “A Bayesian framework for reinforcement learning”. In: ICML. Vol. 2000, pp. 943–950. Tarbouriech, Jean, Evrard Garcelon, Michal Valko, Matteo Pirotta, and Alessandro Lazaric (2020). “No-regret exploration in goal-oriented reinforcement learning”. In: International Conference on Machine Learning. PMLR, pp. 9428–9437. Tarbouriech, Jean, Matteo Pirotta, Michal Valko, and Alessandro Lazaric (2021a). “Sample Complexity Bounds for Stochastic Shortest Path with a Generative Model”. In: Algorithmic Learning Theory. Tarbouriech, Jean, Runlong Zhou, Simon S Du, Matteo Pirotta, Michal Valko, and Alessandro Lazaric (2021b). “Stochastic Shortest Path: Minimax, Parameter-Free and Towards Horizon-Free Regret”. In: Advances in Neural Information Processing Systems. Vial, Daniel, Advait Parulekar, Sanjay Shakkottai, and R Srikant (2021). “Regret Bounds for Stochastic Shortest Path Problems with Linear Function Approximation”. In: arXiv preprint arXiv:2105.01593. Vinyals, Oriol, Igor Babuschkin, Wojciech M Czarnecki, Michaël Mathieu, Andrew Dudzik, Junyoung Chung, David H Choi, Richard Powell, Timo Ewalds, Petko Georgiev, et al. (2019). “Grandmaster level in StarCraft II using multi-agent reinforcement learning”. In: Nature 575.7782, pp. 350–354. Wang, Lingxiao, Qi Cai, Zhuoran Yang, and Zhaoran Wang (2020). “Neural policy gradient methods: Global optimality and rates of convergence”. In: International Conference on Learning Representations (ICLR). Wei, Chen-Yu and Haipeng Luo (2018). “More adaptive algorithms for adversarial bandits”. In: Conference On Learning Theory. PMLR, pp. 1263–1291. — (2021). “Non-stationary reinforcement learning without prior knowledge: An optimal black-box approach”. In: Conference on Learning Theory. PMLR, pp. 4300–4354. Wei, Chen-Yu, Mehdi Jafarnia Jahromi, Haipeng Luo, Hiteshi Sharma, and Rahul Jain (2020). “Model-free reinforcement learning in innite-horizon average-reward Markov decision processes”. In: International Conference on Machine Learning. PMLR, pp. 10170–10180. Wei, Chen-Yu, Mehdi Jafarnia Jahromi, Haipeng Luo, and Rahul Jain (2021). “Learning innite-horizon average-reward MDPs with linear function approximation”. In: International Conference on Articial Intelligence and Statistics. PMLR, pp. 3007–3015. White, Martha (2017). “Unifying task specication in reinforcement learning”. In: International Conference on Machine Learning. PMLR, pp. 3742–3750. Williams, Ronald J (1992). “Simple statistical gradient-following algorithms for connectionist reinforcement learning”. In: Machine learning 8.3, pp. 229–256. Wu, Tianhao, Yunchang Yang, Han Zhong, Liwei Wang, Simon S Du, and Jiantao Jiao (2021). “Nearly Optimal Policy Optimization with Stable at Any Time Guarantee”. In: arXiv preprint arXiv:2112.10935. 95 Yang, Lin and Mengdi Wang (2020). “Reinforcement learning in feature space: Matrix bandit, kernels, and regret bound”. In: International Conference on Machine Learning. PMLR, pp. 10746–10756. Yin, Ming and Yu-Xiang Wang (2021). “Towards instance-optimal oine reinforcement learning with pessimism”. In: Advances in neural information processing systems 34, pp. 4065–4078. Yu, Huizhen and Dimitri P Bertsekas (2013). “On boundedness of Q-learning iterates for stochastic shortest path problems”. In: Mathematics of Operations Research 38.2, pp. 209–227. Zanette, Andrea, David Brandfonbrener, Emma Brunskill, Matteo Pirotta, and Alessandro Lazaric (2020a). “Frequentist regret bounds for randomized least-squares value iteration”. In: International Conference on Articial Intelligence and Statistics. PMLR, pp. 1954–1964. Zanette, Andrea, Alessandro Lazaric, Mykel Kochenderfer, and Emma Brunskill (2020b). “Learning near optimal policies with low inherent bellman error”. In: International Conference on Machine Learning. PMLR, pp. 10978–10989. Zhang, Zihan, Xiangyang Ji, and Simon S Du (2020). “Is reinforcement learning more dicult than bandits? a near-optimal algorithm escaping the curse of horizon”. In: Conference On Learning Theory. Zhang, Zihan, Yuan Zhou, and Xiangyang Ji (2020). “Almost Optimal Model-Free Reinforcement Learning via Reference-Advantage Decomposition”. In: Advances in Neural Information Processing Systems. Vol. 33. Curran Associates, Inc., pp. 15198–15207. Zhou, Dongruo, Quanquan Gu, and Csaba Szepesvari (2021). “Nearly minimax optimal reinforcement learning for linear mixture markov decision processes”. In: Conference on Learning Theory. PMLR, pp. 4532–4576. Zhou, Dongruo, Jiafan He, and Quanquan Gu (2021). “Provably ecient reinforcement learning for discounted mdps with feature mapping”. In: International Conference on Machine Learning. PMLR, pp. 12793–12802. Zimin, Alexander and Gergely Neu (2013). “Online learning in episodic Markovian decision processes by relative entropy policy search”. In: Advances in Neural Information Processing Systems, pp. 1583–1591. 96 AppendixA OmittedDetailsinChapter2 A.1 PreliminariesfortheAppendix ExtraNotationinAppendix For conciseness, throughout the appendix, we use the following notational shorthands: • I s (s 0 ) =Ifs =s 0 g; • P t =P st;at ; • for a functionf t : !R, we often abuse the notation and usef t to denotef t (s t ;a t ) when there is no confusion from the context; in fact, in Lemma 18 and Lemma 26, we also usef t to denotef t (s;a) for a particular (s;a) pair; •V H =f(Q ? ;V ? )g[f(Q ? h ;V ? h1 )g H h=1 . Note that for any ( Q; V )2V H , we have Q(s;a) = c(s;a) +P s;a V , V (s)2 [0;B ? ], V (g) = 0 and V (s) min a Q(s;a). Throughout this part of Appendix, ~ O () also hides dependence on lnT whereT is a random variable but can be bounded by C K c min under strictly positive costs. Truncating the Interaction An important question in SSP is whether the algorithm halts in a nite number of steps. To implicitly show this, we do the following trick throughout the analysis. Fix any positive 97 integerT 0 and explicitly stop the algorithm afterT 0 steps. Our analysis will show that in this case the regret R K is bounded by something independent ofT 0 , which then allows us to takeT 0 to innity and recover the original setting while maintaining the same bound. This also implicitly shows that the algorithm must halt in a nite number of steps. A.2 OmittedDetailsforSection2.2 In this section, we provide omitted details and proofs for Section 2.2. We rst introduce the class of nite horizon MDPs used in the approximation: given an SSP modelM = (S;s init ;g;A;c;P ), we consider the costs of interacting with f M for at mostH steps and then directly teleporting to the goal state. Specically, we dene a nite-horizon SSP f M = ( e S;e s init ;g;A;e c; e P ) as follows: • e S =S [H];e s init = (s init ; 1) and the goal stateg remains the same; • transition from (s;h) to (s 0 ;h 0 ) is only possible whenh 0 = h + 1, and the transition follows the original MDP: e P ((s 0 ;h + 1)j(s;h);a) =P (s 0 js;a) forh2 [H 1] and e P (gj(s;H);a) = 1; • mean cost function also follows the original MDP:e c((s;h);a) =c(s;a). We also deneQ ? 0 (s;a) = V ? 0 (s) = 0;Q ? h (s;a) = e Q ? ((s;Hh + 1);a);V ? h (s) = e V ? (s;Hh + 1) for h2 [H], where e Q ? and e V ? are optimal state-action and state value functions in f M. Then, it is straightforward to verify thatQ ? h andV ? h satisfy Eq. (2.1). SinceM is equivalent to f M withH =1, intuitively we should haveQ ? (s;a)Q ? H (s;a) for a suciently largeH. The formal statement, shown in Lemma 1, is proven below: Proof of Lemma 1. By denitionQ ? h (s;a) Q ? (s;a) holds for all (s;a)2 andh2 [H], since f M is a truncated version ofM. Therefore,V ? h (s)B ? holds, and the expected hitting time (the number of steps needed to reach the goal) of the optimal policy in f M starting from any (s;h) is upper bounded by B? c min . By 98 (Rosenberg and Mansour, 2021, Lemma 6), whenh 4B? c min ln 2 , the probability of not reachingg inh steps is at most. Denote bye ? L the optimal policy of f M, and ? L a non-stationary policy inM which follows e ? L for the rstH steps, and then follows ? afterwards. We have for anys2S;V ? (s)V ? H1 (s) V ? L (s)V e ? L H1 (s)B ? , where we applyH 4B? c min ln 2 +1;V ? (s)V e ? L (s) andV ? H1 (s) =V e ? L H1 (s). Finally,Q ? (s;a)Q ? H (s;a) =P s;a (V ? V ? H1 )B ? . Lemma13. With probability at least 1 2, P T t=1 c t c(s t ;a t ) = ~ O p C K . Proof. By Eq. (G.5) of Lemma 125,kck 1 2 [0; 1], and Lemma 127 with = 1, with probability at least 1 2: T X t=1 c t c(s t ;a t ) = ~ O 0 @ v u u t T X t=1 E[c 2 t ] 1 A = ~ O 0 @ v u u t T X t=1 c(s t ;a t ) 1 A = ~ O p C K : The next lemma is used in the proof of Theorem 1, which shows that the sum of the variances of the optimal value function is of order ~ O(B ? C K ). It is also useful in bounding the overhead of Bernstein-style condence interval (see Lemma 20 and Cohen et al. (2020, Lemma 4.7) for example). Lemma14. With probability at least 1 2, P T t=1 V(P st;at ;V ? ) = ~ O B 2 ? +B ? C K . 99 Proof. Note that: T X t=1 V(P st;at ;V ? ) = T X t=1 P st;at (V ? ) 2 (P st;at V ? ) 2 = K X k=1 I k X i=1 P s k i ;a k i (V ? ) 2 V ? (s k i ) 2 + K X k=1 I k X i=1 V ? (s k i ) 2 (P s k i ;a k i V ? ) 2 K X k=1 I k X i=1 P s k i ;a k i (V ? ) 2 V ? (s k i+1 ) 2 + K X k=1 I k X i=1 Q ? (s k i ;a k i ) 2 (P s k i ;a k i V ? ) 2 : (V ? (s k I k +1 ) = 0 andV ? (s k i )Q ? (s k i ;a k i )) For the rst term, by Eq. (G.5) of Lemma 125 withV ? (s)B ? and Lemma 115 withX =V ? (S 0 );S 0 P st;at , we have with probability at least 1, K X k=1 I k X i=1 P s k i ;a k i (V ? ) 2 V ? (s k i+1 ) 2 = ~ O 0 @ v u u t T X t=1 V(P st;at ; (V ? ) 2 ) +B 2 ? 1 A = ~ O 0 @ B ? v u u t T X t=1 V(P st;at ;V ? ) +B 2 ? 1 A : For the second term, note that: K X k=1 I k X i=1 Q ? (s k i ;a k i ) 2 (P s k i ;a k i V ? ) 2 = K X k=1 I k X i=1 Q ? (s k i ;a k i )P s k i ;a k i V ? Q ? (s k i ;a k i ) +P s k i ;a k i V ? K X k=1 I k X i=1 3B ? c(s k i ;a k i ): (Q ? (s;a) 2B ? andV ? (s)B ? for any (s;a)2 ) 100 Therefore, P T t=1 V(P st;at ;V ? ) = ~ O B ? q P T t=1 V(P st;at ;V ? ) +B 2 ? +B ? P K k=1 P I k i=1 c(s k i ;a k i ) . By Lemma 110 withx = P T t=1 V(P st;at ;V ? ) and Lemma 127, we have with probability at least 1, T X t=1 V(P st;at ;V ? ) = ~ O B 2 ? +B ? K X k=1 I k X i=1 c(s k i ;a k i ) ! = ~ O B 2 ? +B ? C K : A.3 OmittedDetailsforSection2.3 Before we present the proof of Theorem 3 (Section A.3.3), we rst quantify the sample complexity of the reference value function (Section A.3.1) and prove the two required properties (Section A.3.2). ExtraNotation Denote byQ t (s;a),V t (s),V ref t (s),B t ,N t (s;a) the value ofQ(s;a),V (s),V ref (s),B, N(s;a) at the beginning of time stept. DeneN t (s) = P a N t (s;a). Denote byn t (s;a),m t (s;a),b t (s;a), b 0 t (s;a), t (s;a),b c t (s;a) the value ofn;m;b;b 0 ;;b c used in computingQ t (s;a). Note that, these are not necessarily their values at time stept. For example,n t (s;a) is the number of visits to (s;a) before the current stage (not before timet);m t (s;a) the number of visits to (s;a) in the last stage;b t (s;a) andb 0 t (s;a) are the bonuses used in the last update ofQ t (s;a); andb c t (s;a) is the cost estimator used in the last update ofQ t (s;a) (b t (s;a),b 0 t (s;a) andb c t (s;a) are 0 whenn t (s;a) = 0). Denote byl t;i (s;a) thei-th time step the agent visits (s;a) among thosen t (s;a) steps before the current stage, and by l t;i (s;a) thei-th time step the agent visits (s;a) among thosem t (s;a) steps within the last stage. With these notation, we have by the update rule of the algorithm: Q t (s;a) = max ( Q t1 (s;a); b c t (s;a) + 1 m t mt X i=1 V l t;i (s 0 l t;i )b 0 t ; b c t (s;a) + 1 n t nt X i=1 V ref l t;i (s 0 l t;i ) + 1 m t mt X i=1 (V l t;i (s 0 l t;i )V ref l t;i (s 0 l t;i ))b t ) ; (A.1) 101 wherem t representsm t (s;a), l t;i represents l t;i (s;a), and similarly forn t ,l t;i ,b t andb 0 t . We also dene two empirical variances at time stept as: t = 1 m t mt X i=1 (V l t;i (s 0 l t;i )V ref l t;i (s 0 l t;i )) 2 1 m t mt X i=1 V l t;i (s 0 l t;i )V ref l t;i (s 0 l t;i ) ! 2 (A.2) and ref t = 1 n t nt X i=1 V ref l t;i (s 0 l t;i ) 2 1 n t nt X i=1 V ref l t;i (s 0 l t;i ) ! 2 : (A.3) Here, t and ref t should be treated as a function of state-action pair (s;a), so thatm t ,n t , l t;i , andl t;i in the formulas all representm t (s;a),n t (s;a), l t;i (s;a), andl t;i (s;a). Except for Lemma 18, this input (s;a) is simply (s t ;a t ). Further dene" t =Ifn t > 0g =Ifm t > 0g, and 0=0 to be 0 so that formula in the form 1 nt P nt i=1 X l t;i is treated as 0 ifn t = 0 (similarly form t ). A.3.1 SampleComplexityforReferenceValueFunction In this section, we assumeH =d 4B? c min ln( 2 ) + 1e 2 for some > 0 (the form used in Theorem 2). We show that to obtain a reference value with precision 2B ? at states (that is,jV ref (s)V ? (s)j), ~ O B 2 ? H 3 SA 2 number of visits to states is sucient (Corollary 26). Moreover, the total costs appeared in regret for a reference value function with maximum precision is ~ O B 2 ? H 3 S 2 A (Lemma 17). Note that if we only update the reference value function once as in (Zhang, Zhou, and Ji, 2020), instead of applying our “smoother” update, the total costs become ~ O B 2 ? H 3 S 2 A 2 . Lemma15. With probability at least 1 8, Algorithm 5 ensures for any non-negative weightsfw t g T t=1 , T X t=1 w t (Q ? (s t ;a t )Q t (s t ;a t ))B ? kwk 1 + ~ O H 2 SAB ? kwk 1 +B ? q H 3 SAkwk 1 kwk 1 : 102 Proof. Denew (0) t =w t andw (h+1) t+1 = P T t 0 =1 P m t 0 i=1 w (h) t 0 m t 0 Ift = l t 0 ;i g. We rst argue the following properties related tow (h) t and vectorw (h) = (w (h) 1 ;:::;w (h) T ). Denote byj t the stage to which time stept belongs. Whent = l t 0 ;i , we havem t 0 =e jt . Therefore, T X t 0 =1 m t 0 X i=1 1 m t 0 Ift = l t 0 ;i g e jt+1 e jt 1 + 1 H ; and thus, w (h) 1 (1 + 1 H ) w (h1) 1 (1 + 1 H ) h kwk 1 . Moreover, w (h+1) 1 = T X t=1 T X t 0 =1 m t 0 X i=1 w (h) t 0 m t 0 Ift = l t 0 ;i g = T X t 0 =1 w (h) t 0 m t 0 X i=1 T X t=1 Ift = l t 0 ;i g m t 0 w (h) 1 ; and thus w (h) 1 kwk 1 for anyh. Also note that for anyfX t g t such thatX t 0: T X t=1 w (h) t m t mt X i=1 X l t;i = T X t 0 =1 T X t=1 w (h) t m t mt X i=1 X t 0Ift 0 = l t;i g = T X t 0 =1 w (h+1) t 0 +1 X t 0: (A.4) Next, for a xed (s;a), by Lemma 122, with probability at least 1 SA , whenn t (s;a)> 0: jc(s;a)b c t (s;a)j 2 s 2b c t (s;a) n t (s;a) ln 2SAn t (s;a) + 19 ln 2SAnt(s;a) n t (s;a) s b c t (s;a) t n t (s;a) + t n t (s;a) : (A.5) Taking a union bound, we have Eq. (A.5) holds for all (s;a) whenn t (s;a)> 0 with probability at least 1. Then by denition ofb 0 t , we have c(s t ;a t )b c t (s t ;a t )Ifm t = 0g +b 0 t : (A.6) 103 Now we are ready to prove the lemma. First, we condition on Lemma 18, which happens with probability at least 1 7. Then for anyh2f0;:::;H 1g; Q =Q Hh ; V =Q Hh1 we have: T X t=1 w (h) t ( Q(s t ;a t )Q t (s t ;a t )) + T X t=1 w (h) t (c(s t ;a t )b c t (s t ;a t )) + +w (h) t P t V 1 m t mt X i=1 V l t;i (s 0 l t;i ) ! + +w (h) t b 0 t (by Eq. (A.1) and Q(s;a) =c(s;a) +P s;a V ) T X t=1 2B ? w (h) t Ifm t = 0g + T X t=1 w (h) t 1 m t mt X i=1 P l t;i V 1 m t mt X i=1 V l t;i (s 0 l t;i ) ! + + 2w (h) t b 0 t : (Eq. (A.6),P t =P l t;i andP t V B ? Ifm t = 0g + 1 mt P mt i=1 P l t;i V ) Sincee 1 =H, we have P T t=1 w (h) t Ifm t = 0gSAH w (h) 1 . Moreover, by Eq. (G.5) of Lemma 125 with X t = V (s 0 t ), we have with probability at least 1 H : 1 mt P mt i=1 P l t;i V 1 mt P mt i=1 V (s 0 l t;i ) + ~ O B?"t p mt . Plugging these back to the previous inequality and using the denition ofb 0 t gives: T X t=1 w (h) t ( Q(s t ;a t )Q t (s t ;a t )) + 2HSAB ? w (h) 1 + T X t=1 w (h) t m t mt X i=1 V (s 0 l t;i )V l t;i (s 0 l t;i ) + + ~ O B ? w (h) t " t p m t + w (h) t " t n t ! 3HSAB ? w (h) 1 + ~ O B ? q HSA w (h) 1 kwk 1 + T X t=1 w (h+1) t+1 V (s 0 t )V t (s 0 t ) + (Eq. (A.4) and Lemma 23) ~ O HSAB ? w (h) 1 +B ? q HSA w (h) 1 kwk 1 + T X t=1 w (h+1) t ( Q(s t ;a t )Q t (s t ;a t )) + ; 104 where in the last inequality we apply: T X t=1 w (h+1) t+1 V (s 0 t )V t (s 0 t ) + T X t=1 w (h+1) t+1 ( V (s 0 t )V t+1 (s 0 t )) + + ~ O w (h) 1 SB ? (apply Lemma 113 on P T t=1 V t+1 (s 0 t )V t (s 0 t )) T X t=1 w (h+1) t ( V (s t )V t (s t )) + + ~ O w (h) 1 SB ? (( V (s 0 t )V t+1 (s 0 t )) + ( V (s t+1 )V t+1 (s t+1 )) + andw (h+1) T +1 = 0) T X t=1 w (h+1) t ( Q(s t ;a t )Q t (s t ;a t )) + + ~ O w (h) 1 SB ? : ( V (s t ) Q(s t ;a t ) andV t (s t ) =Q t (s t ;a t )) By a union bound, the inequality above holds for Q =Q Hh ; V =Q Hh1 for allh2f0;:::;H 1g with probability at least 1. Applying the inequality above recursively starting fromh = 0, and by Q ? 0 (s;a)Q t (s;a) 0, (1 + 1 H ) H 3: T X t=1 w t (Q ? H (s t ;a t )Q t (s t ;a t )) + = ~ O H 2 SAB ? kwk 1 +B ? q H 3 SAkwk 1 kwk 1 : Therefore, by Lemma 1, T X t=1 w t (Q ? (s t ;a t )Q t (s t ;a t )) = T X t=1 w t (Q ? (s t ;a t )Q ? H (s t ;a t ) +Q ? H (s t ;a t )Q t (s t ;a t )) B ? kwk 1 + ~ O H 2 SAB ? kwk 1 +B ? q H 3 SAkwk 1 kwk 1 : Now by Lemma 15 withw t =IfV ? (s t )V t (s t )g for some threshold, we can bound the sample complexity of obtaining a value function with precision (Corollary 26), which is used to determine the value of ? (Lemma 17). However, one caveat here is that the bound in Lemma 15 has logarithmic 105 dependency onT from t , which should not appear in the denition of ? sinceT is a random variable. To deal with this, we obtain a loose bound onT in the following lemma. Lemma16. With probability at least 1 13,T = ~ O(B ? K=c min +B 2 ? H 3 SA=c 2 min ). Proof. By Lemma 15 withw t = 1, we have with probability at least 1 8: T X t=1 Q ? (s t ;a t )Q t (s t ;a t ) =B ? T + ~ O H 2 SAB ? +B ? p H 3 SAT : Now by Eq. (2.2), Lemma 13, Lemma 125, and Lemma 14, with probability at least 1 5, R K T X t=1 (c t c(s t ;a t )) + T X t=1 (V ? (s 0 t )P st;at V ? ) + T X t=1 (Q ? (s t ;a t )V ? (s t )) ~ O( p B ? C K +B ? ) + T X t=1 (Q ? (s t ;a t )Q t (s t ;a t )) (V t =Q t (s t ;a t ) and Lemma 18) =B ? T + ~ O H 2 SAB ? +B ? p H 3 SAT : (C K T ) Further usingc min TKB ? R K ,B ? c min 2 , and Lemma 110 proves the statement. Corollary26. With probability at least 1 13, Algorithm 5 ensures for any 2B ? : T X t=1 IfV ? (s t )V t (s t )g = ~ O B 2 ? H 3 SA 2 ,U 1; and for anys2S,N t (s)U implies 0V ? (s)V t (s). 106 Proof. We can assume B ? since P T t=1 IfV ? (s t )V t (s t ) g = 0 when > B ? . By Lemma 15 with w t = IfV ? (s t )V t (s t ) g, w t w t (V ? (s t )V t (s t )), 2B ? , and V ? (s t )V t (s t ) Q ? (s t ;a t )Q t (s t ;a t ), we have with probability at least 1 8: kwk 1 T X t=1 w t (V ? (s t )V t (s t )) 2 kwk 1 + ~ O H 2 SAB ? +B ? q H 3 SAkwk 1 : Therefore, by Lemma 110 and Lemma 16,kwk 1 = ~ O H 2 SAB? + B 2 ? H 3 SA 2 , which has no logarithmic dependency onT . We prove the second statement by contradiction: supposeN t (s) U andV ? (s) V t (s) > . Then sinceV t is non-decreasing int,N t (s)kwk 1 . Thus,U N t (s)kwk 1 < U , a contradiction. Lemma17. Dene i = B? 2 i ; e N 0 = 0; e N i =U i (dened in Corollary 26) fori 1 andq ? = inffi : i c min g. DeneV REF =V ref T +1 ; ? =d e N q ?e 2 , andB ref t such that: B ref t (s) = q ? X i=1 i1 Ifd e N i1 e 2 N t (s)<d e N i e 2 g: Then with probability at least 1 13,V REF (s)V ref t (s)B ref t (s), and T X t=1 V REF (s t )V ref t (s t ) T X t=1 B ref t (s t ) = ~ O B 2 ? H 3 S 2 A c min ,C REF ; T X t=1 V REF (s t )V ref t (s t ) 2 T X t=1 B ref t (s t ) 2 = ~ O B 2 ? H 3 S 2 A ,C REF, 2 : 107 Proof. We condition on Corollary 26, which happens with probability at least 1 13. By Corollary 26 with = i for eachi2 [q ? ], we haveV REF (s)V ref t (s)B ref t (s). Moreover,B ref t (s) 2 = P q ? i=1 2 i1 Ifd e N i1 e 2 N t (s)<d e N i e 2 g. Thus, T X t=1 B ref t (s t ) X s q ? X i=1 i1 d e N i e 2 = ~ O 0 @ X s q ? X i=1 B 2 ? H 3 SA i 1 A = ~ O B 2 ? H 3 S 2 A q ? : T X t=1 B ref t (s t ) 2 X s q ? X i=1 2 i1 d e N i e 2 = ~ O 0 @ X s q ? X i=1 B 2 ? H 3 SA 1 A = ~ O B 2 ? H 3 S 2 A : A.3.2 ProofsofRequiredProperties In this section, we prove Property 1 and Property 2 of Algorithm 5. Lemma18. Withprobabilityatleast 1 7,Algorithm5ensuresQ t (s;a)Q t+1 (s;a)Q ? (s;a)forany (s;a)2 ;t 1. Proof. We x a pair (s;a), and denoten t ;m t ;l t;i ; l t;i ;b t ;b 0 t ; t as shorthands of the corresponding functions evaluated at (s;a). The rst inequality is by the update rule ofQ t . Next, we proveQ t (s;a)Q ? (s;a) by induction ont. It is clearly true whent = 1. For the induction step, the statement is clearly true when n t =m t = 0. Whenn t > 0, it suces to consider two update rules, that is, the last two terms in the max operator of Eq. (A.1). For the second update rule, note that, b c t (s;a) + 1 n t nt X i=1 V ref l t;i (s 0 l t;i ) + 1 m t mt X i=1 V l t;i (s 0 l t;i )V ref l t;i (s 0 l t;i ) b t =b c t (s;a) + 1 n t nt X i=1 P s;a V ref l t;i + 1 m t mt X i=1 P s;a V l t;i V ref l t;i + 1 n t nt X i=1 I s 0 l t;i P s;a V ref l t;i | {z } 1 + 1 m t mt X i=1 I s 0 l t;i P s;a V l t;i V ref l t;i | {z } 2 b t : (A.7) 108 DeneC 0 t =dln(B 4 ? n t )e 2 minf4 ln 2 (B 4 ? n t );B 8 ? n 2 t g (in general, we can setC 0 t =dln( e B 4 n t )e 2 for some e BB ? ). For 1 , by Eq. (G.5) of Lemma 125 withb =B 2 ? andCC 0 t , we have with probability at least 1 SA : j 1 j = 1 n t nt X i=1 I s 0 l t;i P s;a V ref l t;i 4 ln 3 4SAB 8 ? n 5 t 0 @ s 8 P nt i=1 V(P s;a ;V ref l t;i ) n 2 t + 5B t n t 1 A ; Note that (recall that ref t represents ref t (s;a)) 1 n t nt X i=1 V(P s;a ;V ref l t;i ) ref t = 3 + 4 + 5 ; (A.8) where 3 = 1 n t nt X i=1 P s;a (V ref l t;i ) 2 V ref l t;i (s 0 l t;i ) 2 ; 4 = 1 n t nt X i=1 V ref l t;i (s 0 l t;i ) ! 2 1 n t nt X i=1 P s;a V ref l t;i ! 2 ; 5 = 1 n t nt X i=1 P s;a V ref l t;i ! 2 1 n t nt X i=1 (P s;a V ref l t;i ) 2 : By Eq. (G.5) of Lemma 125 withb =B 2 ? andCC 0 t , and Lemma 115 with V ref l t;i 1 B t , with probability at least 1 2 SA , j 3 j 4 ln 3 (4SAB 8 ? n 5 t =) n t 0 @ v u u t 8 nt X i=1 V(P s;a ; (V ref l t;i ) 2 ) + 5B 2 t 1 A 4 ln 3 (4SAB 8 ? n 5 t =) n t 0 @ 2B t v u u t 8 nt X i=1 V(P s;a ;V ref l t;i ) + 5B 2 t 1 A : (A.9) j 4 j 1 n t nt X i=1 V ref l t;i (s 0 l t;i ) + 1 n t nt X i=1 P s;a V ref l t;i 1 n t nt X i=1 V ref l t;i (s 0 l t;i ) 1 n t nt X i=1 P s;a V ref l t;i 2B t 4 ln 3 (4SAB 8 ? n 5 t =) n t 0 @ v u u t 8 nt X i=1 V(P s;a ;V ref l t;i ) + 5B t 1 A : (A.10) 109 Moreover, 5 0 by Cauchy-Schwarz inequality. Therefore, 1 n t nt X i=1 V(P s;a ;V ref l t;i ) ref t 4B t ln 3 (4SAB 8 ? n 5 t =) n t 0 @ 4 v u u t 8 nt X i=1 V(P s;a ;V ref l t;i ) + 15B t 1 A : Applying Lemma 110 withx = P nt i=1 V(P s;a ;V ref l t;i ), we obtain: 1 n t nt X i=1 V(P s;a ;V ref l t;i ) 2 ref t + 4216B 2 t ln 6 4SAB 8 ? n 5 t n t : Thus, 1 nt P nt i=1 I s 0 l t;i P s;a V ref l t;i q ref t nt t + 3Btt nt . By similar arguments,j 2 j q t mt t + 3Btt mt with probability at least 1 3 SA . Finally, by Eq. (A.5) andB t 1, we haveb c t (s;a)c(s;a) q b ct(s;a) nt + Bt nt . Therefore, jb c t (s;a)c(s;a)j +j 1 j +j 2 jb t : (A.11) Plugging Eq. (A.11) back to Eq. (A.7), and by the non-decreasing property ofV ref t andV l t;i (s)V ? (s) for anys2S + : b c t (s;a) + 1 n t nt X i=1 V ref l t;i (s 0 l t;i ) + 1 m t mt X i=1 V l t;i (s 0 l t;i )V ref l t;i (s 0 l t;i ) b t c(s;a) + 1 n t nt X i=1 P s;a V ref l t;i + 1 m t mt X i=1 P s;a V l t;i V ref l t;i c(s;a) +P s;a V ? =Q ? (s;a): For the rst update rule, by Eq. (G.5) of Lemma 125 withb = K andC C 0 t , with probability at least 1 SA , 1 mt P mt i=1 V l t;i (s 0 l t;i )P l t;i V l t;i 2 q B 2 t t mt . Therefore, by Eq. (A.5): b c t (s;a) + 1 m t mt X i=1 V l t;i (s 0 l t;i )b 0 t c(s;a) + 1 m t mt X i=1 P l t;i V l t;i c(s;a) +P s;a V ? =Q ? (s;a): Combining two cases, we haveQ t (s;a)Q ? (s;a) for the xed (s;a). By a union bound over (s;a)2 , we haveQ t (s;a)Q ? (s;a) for any (s;a)2 ;t 1. 110 Remark3. Note that the statement of Lemma 18 still holds if we use “compute 256 ln 6 (4SA e B 8 n 5 =)” inLine8ofAlgorithm5forsome e BB ? . Thisisusefulinderivingtheparameter-freeversionofAlgorithm5. Proof of Theorem 2. Property 1 is satised by Lemma 18. For Property 2, we conditioned on Lemma 18, Lemma 17, Lemma 19, and Lemma 20, which holds with probability at least 1 50. Then, for any ( Q; V )2V H : T X t=1 ( Q(s t ;a t )Q t (s t ;a t )) + T X t=1 c(s t ;a t )b c t (s t ;a t ) +P t V 1 n t nt X i=1 V ref l t;i (s 0 l t;i ) 1 m t mt X i=1 V l t;i (s 0 l t;i )V ref l t;i (s 0 l t;i ) +b t ! + (by Eq. (A.1) and Q(s;a) =c(s;a) +P s;a V ) T X t=1 2B ? Ifm t = 0g + T X t=1 1 m t mt X i=1 P l t;i V 1 n t nt X i=1 P l t;i V ref l t;i 1 m t mt X i=1 P l t;i V l t;i V ref l t;i ! + + 2b t (P t V B ? Ifm t = 0g + 1 mt P mt i=1 P l t;i V and Eq. (A.11)) 2B ? HSA + T X t=1 1 n t nt X i=1 P l t;i V REF V ref l t;i + 1 m t mt X i=1 P l t;i ( VV l t;i ) + + 2b t : ( P T t=1 Ifm t = 0gSAH,P t =P l t;i =P l t;i , andV ref l t;i (s)V REF (s) for anys2S (Lemma 17)) By Lemma 21 and Lemma 19, T X t=1 1 n t nt X i=1 P l t;i V REF V ref l t;i = ~ O T X t=1 P t (V REF V ref t ) ! = ~ O (C REF ): Moreover, by Lemma 22, with probability at least 1 H+1 , 1 m t mt X i=1 P l t;i ( VV l t;i ) + 1 + 1 H 2 T X t=1 ( V (s t )V t (s t )) + + ~ O (B ? (H +S)): 111 Plugging these back, and by (1 + 1 H ) 2 1 + 3 H , Lemma 20 and Lemma 17, we get: T X t=1 ( Q(s t ;a t )Q t (s t ;a t )) + ~ O (B ? HSA +C REF ) + 1 + 1 H 2 T X t=1 ( V (s t )V t (s t )) + + 2 T X t=1 b t 1 + 3 H T X t=1 ( V (s t )V t (s t )) + + ~ O p B ? SAC K + p SAHc min C K + B 2 ? H 3 S 2 A c min : Taking a union bound over ( Q; V )2V H and usingH = ~ O B? c min proves the claim. A.3.3 ProofofTheorem3 Proof. By Theorem 1 and Theorem 2, with probability at least 1 60 and = c min 2B 2 ? SAK : C K KV ? (s init ) =R K ~ O C K + p B ? SAC K + B 2 ? H 3 S 2 A c min : Then byV ? (s init )B ? ; 1 2 and Lemma 110, we haveC K = ~ O (B ? K). Substituting this back and by c min B?K ;H = ~ O(B ? =c min ), we getR K = ~ O B ? p SAK + B 5 ? S 2 A c 4 min . A.3.4 ExtraLemmas In this section, we gives proofs of auxiliary lemmas used in Section 2.3. Lemma 19 quanties the cost of using reference value function. Lemma 20 quanties the cost of using the variance-aware bonus termsb t . Lemma 21, Lemma 22, and Lemma 23 deal with the bias induced by the sparse update scheme. Lemma19. With probability at least 1 9, P T t=1 P t V REF V ref t P T t=1 P t B ref t = ~ O (C REF ), where C REF is dened in Lemma 17. 112 Proof. By Lemma 17, Lemma 127, Lemma 113 andB ref t+1 (s 0 t )B ref t+1 (s t+1 ) in each step: T X t=1 P t V REF V ref t T X t=1 P t B ref t 2 T X t=1 B ref t (s 0 t ) + ~ O (B ? ) = ~ O T X t=1 B ref t (s t ) +SB ? ! = ~ O (C REF ): Lemma20. With probability at least 1 21, T X t=1 b t = ~ O p B ? SAC K +B ? H 2 S 3 2 A + p SAHc min C K : Proof. We condition on Lemma 17, which holds with probability at least 1 8. By Eq. (A.12) and Eq. (A.13) of Lemma 23, T X t=1 b t T X t=1 s ref t " t n t t + r t " t m t t +B ? X t 4" t n t + 3" t m t t + s b c t " t t n t = ~ O 0 @ T X t=1 s ref t " t n t + r t " t m t +B ? HSA + s b c t " t n t 1 A : Note that by Eq. (A.8), Eq. (A.9) and Eq. (A.10), whenn t > 0, with probability at least 1 2, ref t 1 n t nt X i=1 V(P l t;i ;V ref l t;i )j 3 j +j 4 j 5 ~ O 0 @ B t n t v u u t nt X i=1 V(P l t;i ;V ref l t;i ) + B 2 t n t 1 A + 1 n t nt X i=1 (P l t;i V ref l t;i ) 2 1 n t nt X i=1 P l t;i V ref l t;i ! 2 (i) = ~ O 0 @ B t n t v u u t nt X i=1 V(P l t;i ;V ref l t;i ) + B 2 t n t + B ? n t nt X i=1 P l t;i B ref l t;i 1 A 1 n t nt X i=1 V(P l t;i ;V ref l t;i ) + ~ O B 2 t n t + B ? n t nt X i=1 P l t;i B ref l t;i ! ; (AM-GM Inequality) 113 where in (i) we apply: 1 n t nt X i=1 (P l t;i V ref l t;i ) 2 1 n t nt X i=1 P l t;i V ref l t;i ! 2 (P t V REF ) 2 1 n t nt X i=1 P l t;i V ref l t;i ! 2 (V ref l t;i (s)V REF (s) for anys2S) 2B ? n t nt X i=1 P l t;i V REF V ref l t;i 2B ? n t nt X i=1 P l t;i B ref l t;i : ( V REF 1 B ? and Lemma 17) Therefore, ref t 2 nt P nt i=1 V(P l t;i ;V ref l t;i ) = ~ O B 2 t nt + B? nt P nt i=1 P l t;i B ref l t;i , and ref t 2V(P t ;V ? ) = ref t 2 n t nt X i=1 V(P l t;i ;V ref l t;i ) + 2 n t nt X i=1 (V(P l t;i ;V ref l t;i )V(P l t;i ;V ? )) (P t =P l t;i ) (i) ~ O B 2 ? n t + B ? n t nt X i=1 P l t;i B ref l t;i ! + 4B ? n t nt X i=1 P l t;i V ? V ref l t;i = ~ O B 2 ? n t + B ? n t nt X i=1 P l t;i B ref l t;i +B ? q ? ! ; (V ? (s)V ref l t;i (s)B ref l t;i (s) + q ?;8s) where in (i) we apply the bound for ref t 2 nt P nt i=1 V(P l t;i ;V ref l t;i ),B t B ? and V(P l t;i ;V ref l t;i )V(P l t;i ;V ? ) (P l t;i V ? ) 2 (P l t;i V ref l t;i ) 2 2B ? P l t;i (V ? V ref l t;i ): Plugging the inequality above back, we have with probability at least 1 11, T X t=1 s ref t n t = ~ O 0 @ T X t=1 s V(P t ;V ? ) n t + B ? n t + 1 n t v u u t B ? nt X i=1 P l t;i B ref l t;i + s B ? q ? n t 1 A = ~ O 0 @ v u u t SA T X t=1 V(P t ;V ? ) +B ? SA + v u u t T X t=1 B ? n t v u u t T X t=1 1 n t nt X i=1 P l t;i B ref l t;i + p B ? q ?SAT 1 A (Lemma 23 and Cauchy-Schwarz inequality) = ~ O p B ? SAC K +B ? SA + p B ? SAC REF + p B ? q ?SAT : (Lemma 14, Lemma 23, Lemma 21 and Lemma 19) 114 Moreover, T X t=1 r t m t T X t=1 q P mt i=1 (V l t;i (s 0 l t;i )V ref l t;i (s 0 l t;i )) 2 m t T X t=1 q P mt i=1 (V ? (s 0 l t;i )V ref l t;i (s 0 l t;i )) 2 m t = ~ O 0 @ T X t=1 q P mt i=1 B ref l t;i (s 0 l t;i ) 2 m t + q P mt i=1 2 q ? m t 1 A (V ? (s 0 l t;i )V ref l t;i (s 0 l t;i )B ref l t;i (s 0 l t;i ) + q ?, (a +b) 2 2a 2 + 2b 2 , and p x +y p x + p y) = ~ O 0 @ v u u t T X t=1 1 m t v u u t T X t=1 1 m t mt X i=1 B ref l t;i (s 0 l t;i ) 2 + T X t=1 s 2 q ? m t 1 A : (Cauchy-Schwarz inequality) Note that by Lemma 21, Lemma 113,B ref t+1 (s 0 t )B ref t+1 (s t+1 ) and Lemma 17: T X t=1 1 m t mt X i=1 B ref l t;i (s 0 l t;i ) 2 1 + 1 H T X t=1 B ref t (s 0 t ) 2 = ~ O T X t=1 B ref t+1 (s 0 t ) +SB 2 ? ! = ~ O T X t=1 B ref t (s t ) +SB 2 ? ! = ~ O (C REF, 2 ): Plugging this back to the last inequality, and by Lemma 23, we have: T X t=1 r t m t = ~ O p SAHC REF, 2 + q SAH 2 q ?T : Finally, by Cauchy-Schwarz inequality, Eq. (A.13), Eq. (A.5) and Lemma 127: T X t=1 s b c t " t n t = ~ O 0 @ v u u t SA T X t=1 b c t " t 1 A = ~ O 0 @ v u u t SA T X t=1 c(s t ;a t ) + T X t=1 (b c t c(s t ;a t ))" t ! 1 A = ~ O 0 @ p SAC K + v u u t SA T X t=1 s b c t " t n t +SA 1 A : 115 Solving a quadratic equation gives P T t=1 q b ct"t nt = ~ O p SAC K +SA . Putting everything together, and by q ? =O (c min ); q ?T =O (c min T ) =O (C K ): T X t=1 b t = ~ O p B ? SAC K + p B ? SAC REF + p SAHC REF, 2 + p SAHc min C K +B ? HSA = ~ O p B ? SAC K +B ? H 2 S 3 2 A + p SAHc min C K : (H = B? c min and denition ofC REF ;C REF, 2 (Lemma 17)) Lemma21 (bias of the update scheme). AssumingX t 0, we have: T X t=1 1 m t mt X i=1 X l t;i 1 + 1 H T X t=1 X t ; T X t=1 1 n t nt X i=1 X l t;i =O ln(T ) T X t=1 X t ! : Proof. For the rst inequality, denote byj t the stage to which time stept belongs. Whent 0 = l t;i , we have m t =e j t 0 . Therefore, P T t=1 P mt i=1 1 mt Ift 0 = l t;i g e j t 0 +1 e j t 0 1 + 1 H , and T X t=1 1 m t mt X i=1 X l t;i = T X t=1 1 m t mt X i=1 T X t 0 =1 X t 0Ift 0 = l t;i g = T X t 0 =1 X t 0 T X t=1 mt X i=1 Ift 0 = l t;i g m t 1 + 1 H T X t 0 =1 X t 0: For the second inequality: T X t=1 1 n t nt X i=1 X l t;i = T X t=1 1 n t nt X i=1 T X t 0 =1 X t 0Ift 0 =l t;i g = T X t 0 =1 X t 0 T X t=1 nt X i=1 Ift 0 =l t;i g n t T X t 0 =1 X t 0 X z:t 0 E z1 T e z E z1 =O ln(T ) T X t 0 =1 X t 0 ! : 116 Lemma22. AssumingX t :S + ! [0;B] is monotonic int (i.e.,X t (s) is non-increasing or non-decreasing in t for anys2S + ) andX t (g) = 0, with probability at least 1, T X t=1 1 m t mt X i=1 P l t;i X l t;i 1 + 1 H 2 T X t=1 X t (s t ) + ~ O (B(H +S)): Proof. By Lemma 21, Lemma 127 and Lemma 113,X t+1 (s 0 t )X t+1 (s t+1 ) in each step, T X t=1 1 m t mt X i=1 P l t;i X l t;i 1 + 1 H T X t=1 P t X t 1 + 1 H 2 T X t=1 X t (s 0 t ) + ~ O (BH) 1 + 1 H 2 T X t=1 X t (s t ) + ~ O (B(H +S)): Lemma23. For any non-negative weightsfw t g t , and2 (0; 1), we have: T X t=1 w t " t n t =O (kwk 1 SA) kwk 1 1 ; T X t=1 w t " t m t =O (kwk 1 HSA) kwk 1 1 ln kwk 1 kwk 1 : Moreover, whenw t =v(s t ;a t ) for somev, T X t=1 w t " t n t = ~ O 0 @ X (s;a) v(s;a)N T +1 (s;a) 1 1 A ; T X t=1 w t " t m t = ~ O 0 @ H X (s;a) v(s;a)N T +1 (s;a) 1 1 A : In casew t = 1 for allt, it holds that when 0<< 1, T X t=1 " t n t = ~ O (SA) T 1 ; T X t=1 " t m t = ~ O (SAH) T 1 ; (A.12) 117 and when = 1, T X t=1 " t n t =O (SA lnT ); T X t=1 " t m t =O (SAH lnT ): (A.13) Proof. Dene n(s;a;j) = P t:(st;at)=(s;a);nt=E j w t , n(s;a) = P j0 n(s;a;j). Then, P (s;a) n(s;a) = kwk 1 ,n(s;a;j)kwk 1 e j+1 1 + 1 H kwk 1 e j . Moreover, by denitions ofe j andE j , X j1 I 1 + 1 H kwk 1 E j1 n(s;a) =O H ln kwk 1 kwk 1 : (A.14) X j1 e j I 1 + 1 H kwk 1 E j1 n(s;a) =O(n(s;a)=kwk 1 ): (A.15) Since 1 E j and 1 e j is decreasing, by “moving weights to earlier terms” (fromn(s;a;j) ton(s;a;i) fori<j), T X t=1 w t " t n t = X (s;a) X j1 n(s;a;j) E j X (s;a) X j1 1 + 1 H kwk 1 e j I 1 + 1 H kwk 1 E j1 n(s;a) E j =O 0 @ X (s;a) kwk 1 n(s;a) kwk 1 1 1 A ( P J j=1 e j E j =O E 1 J and Eq. (A.15)) =O (kwk 1 SA) kwk 1 1 ; (Hölder’s inequality) T X t=1 w t " t m t = X (s;a) X j1 n(s;a;j) e j X (s;a) X j1 1 + 1 H kwk 1 e 1 j I 1 + 1 H kwk 1 E j1 n(s;a) 1 + 1 H kwk 1 0 @ X (s;a) X j1 Ifkwk 1 E j1 n(s;a)g 1 A 0 @ X (s;a) n(s;a) kwk 1 1 A 1 (Hölder’s inequality and Eq. (A.15)) =O (kwk 1 HSA) kwk 1 1 ln kwk 1 kwk 1 : (Eq. (A.14)) 118 In casew t = 1 and2 (0; 1), we havekwk 1 = 1;kwk 1 =T , and Eq. (A.12) is proved. Whenw t =v(s t ;a t ) for somev,n(s;a;j)v(s;a)e j+1 IfjJ s;a g, whereJ s;a is such thatE Js;a =n T (s;a). Thus, T X t=1 w t " t n t X (s;a) v(s;a) Js;a X j=1 e j+1 E j =O 0 @ X (s;a) v(s;a) Js;a X j=1 e j E j 1 A =O 0 @ X (s;a) v(s;a)N T +1 (s;a) 1 1 A : T X t=1 w t " t m t X (s;a) v(s;a) Js;a X j=1 e j+1 e j =O 0 @ X (s;a) v(s;a) Js;a X j=1 e 1 j 1 A = ~ O 0 @ X (s;a) v(s;a)J s;a 0 @ Js;a X j=1 e j 1 A 1 1 A = ~ O 0 @ H X (s;a) v(s;a)N T +1 (s;a) 1 1 A : (Hölder’s inequality andJ s;a = ~ O (H) by howe j grows) In case = 1, we have: T X t=1 " t n t X (s;a) X j:0<E j1 T e j E j1 =O (SA lnT ): T X t=1 " t m t X (s;a) X j:0<E j1 T 1 + 1 H =O (SAH lnT ): A.4 OmittedDetailsforSection2.4 Extra Notation Denote byQ t (s;a);V t (s) the value ofQ(s;a);V (s) at the beginning of time stept, V 0 (s) = 0, andb t (s;a);n t (s;a); P t;s;a (s 0 ); t (s;a),b c t (s;a) the value ofb;n; P s;a (s 0 );;b c used in computing Q t (s;a) (note thatb t (s;a) = 0 andb c t (s;a) = 0 ifn t (s;a) = 0). Denote byl t (s;a) the last time step the agent visits (s;a) among thosen t (s;a) steps before the current stage, andl t (s;a) =t if the rst visit to 119 (s;a) is at time stept. Also dene P t = P t;st;at andn + t (s;a) = maxf1;n t (s;a)g. With these notation, we have by the update rule of the algorithm: Q t (s;a) = maxfQ t1 (s;a);b c t (s;a) + P t;s;a V lt b t g; (A.16) whereb t representsb t (s;a), andl t representsl t (s;a) for notational convenience. Before proving Theorem 5 (Section A.4.3), we rst show some basic properties of our proposed update scheme (Section A.4.1), and proves the two required properties for Algorithm 6 (Section A.4.2). A.4.1 PropertiesofProposedUpdateScheme In this section, we prove that our proposed update scheme has the desired properties, that is, it suers constant cost independent ofH, while maintaining sparse update in the long run similar to the update scheme of Algorithm 5 (Lemma 24). We also quantify the bias induced by the sparse update compared to full-planning (that is, update every state-action pair at every time step) in Lemma 25. Lemma24. The proposed update scheme satises the following: 1. ForfX t g t0 such that X t 2 [0; B] and t < t 0 ; (s t ;a t ) = (s t 0;a t 0) implies X t X t 0, we have: P T t=1 X lt BSA + (1 + 1 H ) P T t=1 X t . 2. Denotei ? h = inffiN + :e i hg forh2N + . Theni ? h =O(H ln(h)). Proof. For any givenn2N + , deney n as the index of the end of last stage, that is, the largest element in L that is smaller thann (also deney 1 = 1). For the rst property, we rst prove by induction that for any j2N + , there exist non-negative weightsfw n;i g n;i such that: 1. For allnE j , P yn i=1 w n;i =Ifn> 1g, andw n;i = 0 fori>y n . 2. P E j n=1 w n;i 1 + 1 H for anyiE j . 120 3. e e j+1 + P E j n=1 P E j n 0 =1 w n;n 0 = (1 + 1=H)E j . The base case ofj = 1 is clearly true byw 1;i = 0 for anyi2N + ande e 2 = 1 + 1 H . For the induction step, by the third property, there are in total (1 + 1 H )E j energy contributed by indices up toE j , wheree e j+1 is the amount of energy available to use for stages starting fromj + 1, and P E j n=1 P E j n 0 =1 w n;n 0 is the amount of energy consumed by indices up toE j (we use one of the possible assignments offw n;i g n;i fornE j from the previous induction step). We can easily distributee j+1 weights (frome e j+1 ) to indices in stage j + 1 so that P yn i=1 w n;i = 1 andw n;i = 0 fori>y n for allE j <n E j+1 (note thaty n = E j in this range), and P E j+1 n=1 w n;i 1 + 1 H for anyiE j+1 . Moreover, e e j+2 + E j+1 X n=1 E j+1 X n 0 =1 w n;n 0 =e e j+1 + 1 H e j+1 + E j X n=1 E j X n 0 =1 w n;n 0 +e j+1 = 1 + 1 H E j + 1 + 1 H e j+1 = 1 + 1 H E j+1 : Thus, the induction step also holds. We are now ready to prove the rst property. Denote byt i (s;a) the time step of thei-th visit to (s;a), and byN(s;a) the total number of visits to (s;a) inK episodes. We have T X t=1 X lt = X (s;a) N(s;a) X n=1 X tyn (s;a) X (s;a) X t 1 (s;a) + X (s;a) N(s;a) X n=2 yn X i=1 w n;i X t i (s;a) (y 1 = 1,X t i (s;a) is non-increasing ini, andfw n;i g n;i is from the induction result) BSA + X (s;a) N(s;a) X i=1 X t i (s;a) N(s;a) X n=1 w n;i BSA + 1 + 1 H X (s;a) N(s;a) X i=1 X t i (s;a) (X t 1 (s;a) B and P N(s;a) n=1 w n;i 1 + 1 H ) = BSA + 1 + 1 H T X t=1 X t : 121 For the second property, note thati ? h = inffi2N + :e e i hg sinceh is an interger. Moreover, e e i+1 = 1 + 1 H e e i + 1 H (e i e e i ) 1 + 1 H e e i 1 H =) e e i+1 1 1 + 1 H (e e i 1) =) e e i (e e i ? 2 1) 1 + 1 H ii ? 2 + 1 1 + 1 H ii ? 2 + 1; 8ii ? 2 : Therefore,i ? h inf i fii ? 2 : (1 + 1=H) ii ? 2 + 1hg =i ? 2 +O(H ln(h)). Also, by inspectinge i for small i we observe thati ? 2 =O(H), which implies thati ? h =O(H ln(h)). Remark4. Lemma 24 implies that there are at mostO(minfSAH lnT;STg) updates inT steps. Remark5. Note that the update scheme in (Zhang, Zhou, and Ji, 2020) (also used in Algorithm 5) induces a constantcostoforder ~ O(B ? HSA),whichruinsthehorizonfreeregret. Thisisbecausetheirupdatescheme collectsH samples before the rst update. On the contrary, our update scheme updates frequently at the beginning, but has the same update frequency as that of (Zhang, Zhou, and Ji, 2020) in the long run. This reduces the constant cost to ~ O(B ? SA) while maintaining the ~ O(SAH) time complexity. The following lemma quanties the dominating bias introduced by the sparse update. Lemma 25 (bias of the update scheme). P T t=1 P t (V t V lt ) B ? SA + 1 H P T t=1 P t (V ? V t ) and P T t=1 V(P t ;V t V lt ) ~ O B 2 ? SA + B? H P T t=1 P t (V ? V t ). Proof. For the rst statement, we apply Lemma 24 andP t =P lt to obtain T X t=1 P t (V t V lt ) = T X t=1 P lt (V ? V lt ) T X t=1 P t (V ? V t )B ? SA + 1 H T X t=1 P t (V ? V t ): 122 Similarly, for the second statement T X t=1 V(P t ;V t V lt ) T X t=1 P t (V t V lt ) 2 B ? T X t=1 P t (V t V lt ) B 2 ? SA + B ? H T X t=1 P t (V ? V t ): A.4.2 ProofsofRequiredProperties In this section, we prove Property 1 (Lemma 26) and Property 2 of Algorithm 6, where Lemma 27 proves a preliminary form of Property 2. Lemma26. With probability at least 1,Q t (s;a)Q t+1 (s;a)Q ? (s;a), for any (s;a)2 ;t 1. Proof. The rst inequality is clearly true by the update rule. Next, we proveQ t (s;a) Q ? (s;a). By Eq. (A.16), it is clearly true whenn t (s;a) = 0. Whenn t (s;a)> 0, by Lemma 116: (here,l t ; t is a shorthand ofl t (s;a); t (s;a)): b c t (s;a) + P t;s;a V lt b t (s;a) =b c t (s;a) +f( P t;s;a ;V lt ;n t (s;a);B; t ) s b c t (s;a) t n t (s;a) c(s;a) +f( P t;s;a ;V ? ;n t (s;a);B; t ) + t n t (s;a) (Eq. (A.17)) =c(s;a) + P t;s;a V ? max 8 < : 7 s V( P t;s;a ;V ? ) t n t (s;a) ; 49B t n t (s;a) 9 = ; + t n t (s;a) Q ? (s;a) + ( P t;s;a P s;a )V ? 3 s V( P t;s;a ;V ? ) t n t (s;a) 24B t n t (s;a) + B t n t (s;a) (BB ? 1,Q ? (s;a) =c(s;a) +P s;a V ? and maxfa;bg a+b 2 ) Q ? (s;a) + (2 p 2 3) s V( P t;s;a ;V ? ) t n t (s;a) + (20 24) B t n t (s;a) Q ? (s;a): (Lemma 122) 123 Lemma27. With probability at least 1 9, for all ( Q; V )2V H T X t=1 ( Q(s t ;a t )Q t (s t ;a t )) + 1 + 1 H T X t=1 ( V (s t )V t (s t )) + + ~ O 0 @ p B ? SAC K +BS 2 A + v u u t B ? S 2 A H T X t=1 V ? (s t )V t (s t ) 1 A : Proof. We rst prove useful properties related to the cost estimator. For a xed (s;a), by Lemma 122, with probability at least 1 SA , whenn t (s;a)> 0: jc(s;a)b c t (s;a)j 2 s 2b c t (s;a) n t (s;a) ln 2SA + 19 ln 2SA n t (s;a) s b c t (s;a) t n t (s;a) + t n t (s;a) : (A.17) Taking a union bound, we have Eq. (A.17) holds for all (s;a) whenn t (s;a)> 0 with probability at least 1. Then by denition ofb t , we have c(s t ;a t )b c t (s t ;a t )Ifn t = 0g +b t : (A.18) Note that with probability at least 1 2, for all ( Q; V )2V H , T X t=1 ( Q(s t ;a t )Q t (s t ;a t )) + T X t=1 (c(s t ;a t )b c t (s t ;a t ) +P t V P t V lt ) + +b t ( Q(s t ;a t ) =c(s t ;a t ) +P t V and Eq. (A.16)) T X t=1 Ifn t = 0g + T X t=1 h (P t ( VV lt ) + (P t P t )V ? + (P t P t )(V lt V ? )) + + 2b t i SA + T X t=1 " P t ( VV lt ) + + ~ O s V(P t ;V ? ) n + t + s SV(P t ;V ? V lt ) n + t + SB ? n + t ! + 2b t # : ((x +y) + (x) + + (y) + , Lemma 122, and Lemma 31) 124 Note that: T X t=1 P t ( VV lt ) + 1 + 1 H T X t=1 P t ( VV t ) + +B ? SA (P lt =P t and Lemma 24) =B ? SA + 1 + 1 H T X t=1 ( V (s 0 t )V t (s 0 t )) + + (P t I s 0 t )( VV t ) + O (B ? SA) + 1 + 1 H T X t=1 ( V (s t )V t (s t )) + + (P t I s 0 t )( VV t ) + ; where the last step is by Lemma 113 and ( V (s 0 t )V t+1 (s 0 t )) + ( V (s t+1 )V t+1 (s t+1 )) + . Plugging this back to the previous inequality, and by Cauchy-Schwarz inequality and Lemma 32: T X t=1 ( Q(s t ;a t )Q t (s t ;a t )) + 1 + 1 H T X t=1 ( V (s t )V t (s t )) + + (P t I s 0 t )( VV t ) + +b t + ~ O 0 @ v u u t SA T X t=1 V(P t ;V ? ) + v u u t S 2 A T X t=1 V(P t ;V ? V lt ) +B ? S 2 A 1 A : 125 Next, we bound the term P T t=1 (P t I s 0 t )( V V t ) + . We condition on Lemma 28, which holds with probability at least 1. Then, for a given ( Q; V )2V H , by Lemma 30 withX t = ( VV t ) + =B ? , we have with probability 1 H+1 (F T ;Y T , and T are dened in Lemma 30): B ? F T (0) = T X t=1 (P t I s 0 t )( VV t ) + B ? ( p 3Y T T + 4 T ) = ~ O p B 2 ? Y T +B ? = ~ O 0 @ v u u t B 2 ? S + 1 + T X t=1 (X t (s t )P t X t ) + ! +B ? 1 A = ~ O 0 @ v u u t B 2 ? S +B ? T X t=1 ( V (s t )V t (s t )P t ( VV t )) + +B ? 1 A : ((x) + (y) + (xy) + ) (i) = ~ O 0 @ T X t=1 b t +B ? S p A + v u u t B ? H T X t=1 P t (V ? V t ) 1 A + ~ O 0 @ v u u t SA T X t=1 V(P t ;V ? ) + v u u t S 2 A T X t=1 V(P t ;V ? V lt ) 1 A ; where in (i) we apply: v u u t B ? T X t=1 ( V (s t )V t (s t )P t ( VV t )) + v u u t B ? T X t=1 2b t + P t (V ? V t ) H ! + ~ O 0 B @ v u u u t B ? 0 @ v u u t SA T X t=1 V(P t ;V ? ) + v u u t S 2 A T X t=1 V(P t ;V ? V lt ) 1 A +B ? S p A 1 C A (Lemma 28 and p x +y p x + p y) 2 T X t=1 b t + v u u t B ? H T X t=1 P t (V ? V t ) + ~ O 0 @ v u u t SA T X t=1 V(P t ;V ? ) + v u u t S 2 A T X t=1 V(P t ;V ? V lt ) +B ? S p A 1 A ; (AM-GM inequality and p x +y p x + p y) 126 Hence, by a union bound, the bound above for P T t=1 (P t I s 0 t )( VV t ) + holds for all ( Q; V )2V H with probability at least 1, and with probability at least 1 4, for all ( Q; V )2V H , T X t=1 ( Q(s t ;a t )Q t (s t ;a t )) + 1 + 1 H T X t=1 ( V (s t )V t (s t )) + + ~ O B ? S 2 A + T X t=1 b t ! + ~ O 0 @ v u u t SA T X t=1 V(P t ;V ? ) + v u u t S 2 A T X t=1 V(P t ;V ? V lt ) + v u u t B ? H T X t=1 P t (V ? V t ) 1 A 1 + 1 H T X t=1 ( V (s t )V t (s t )) + + ~ O 0 @ BS 2 A + v u u t SA T X t=1 V(P t ;V ? ) 1 A + ~ O 0 @ v u u t S 2 A T X t=1 V(P t ;V ? V lt ) + v u u t B ? SA H T X t=1 P t (V ? V t ) + p SAC K 1 A : (Lemma 29) Note that: v u u t S 2 A T X t=1 V(P t ;V ? V lt ) = ~ O 0 B B @ v u u u t B ? S 2 A v u u t SA T X t=1 V(P t ;V ? ) +B 2 S 4 A 2 + B ? S 2 A H T X t=1 P t (V ? V t ) +B ? S 2 A p SAC K 1 C C A (Lemma 29) = ~ O 0 B B @ v u u u t B ? S 2 A v u u t SA T X t=1 V(P t ;V ? ) +BS 2 A + v u u t B ? S 2 A H T X t=1 P t (V ? V t ) + p SAC K 1 C C A ( p x +y p x + p y and AM-GM inequality) = ~ O 0 @ v u u t SA T X t=1 V(P t ;V ? ) +BS 2 A + v u u t B ? S 2 A H T X t=1 P t (V ? V t ) + p SAC K 1 A : (AM-GM inequality) 127 Plug this back to the previous inequality, and then by Lemma 14 T X t=1 ( Q(s t ;a t )Q t (s t ;a t )) + 1 + 1 H T X t=1 ( V (s t )V t (s t )) + + ~ O 0 @ p B ? SAC K +BS 2 A + v u u t B ? S 2 A H T X t=1 P t (V ? V t ) 1 A : Finally, applying Lemma 127, Lemma 113 and (V ? V t+1 )(s 0 t ) (V ? V t+1 )(s t+1 ), the claim is proved by T X t=1 P t (V ? V t ) ~ O (B ? ) + 2 T X t=1 (V ? (s 0 t )V t (s 0 t )) ~ O (SB ? ) + 2 T X t=1 (V ? (s t )V t (s t )): Proof of Theorem 4. Property 1 is proved in Lemma 26. For Property 2, by Lemma 27, it suces to bound P T t=1 V ? (s t )V t (s t ). By Lemma 27, V ? h1 (s t ) Q ? h (s t ;a t ), and V t (s t ) = Q t (s t ;a t ), we have with probability at least 1 9, for all Q =Q ? h ; V =V ? h1 ;h2 [H]: T X t=1 (Q ? h (s t ;a t )Q t (s t ;a t )) + 1 + 1 H T X t=1 (Q ? h1 (s t ;a t )Q t (s t ;a t )) + + ~ O 0 @ p B ? SAC K +BS 2 A + v u u t B ? S 2 A H T X t=1 V ? (s t )V t (s t ) 1 A ; 8h2 [H]: Applying the inequality above recursively starting fromh =H and byQ ? 0 (s;a) = 0; (1 + 1 H ) H 3 we have: T X t=1 (Q ? H (s t ;a t )Q t (s t ;a t )) + = ~ O 0 @ H p B ? SAC K +BS 2 A + v u u t B ? HS 2 A T X t=1 V ? (s t )V t (s t ) 1 A : 128 Then by Lemma 1 withH =d 4B c min ln( 2 ) + 1e 2 : T X t=1 V ? (s t )V t (s t ) T X t=1 (Q ? (s t ;a t )Q ? H (s t ;a t )) + T X t=1 (Q ? H (s t ;a t )Q t (s t ;a t )) B ? T + ~ O 0 @ H p B ? SAC K +BS 2 A + v u u t BHS 2 A T X t=1 V ? (s t )V t (s t ) 1 A : Solving a quadratic equation w.r.t. P T t=1 V ? (s t )V t (s t ) (Lemma 110), we have: T X t=1 V ? (s t )V t (s t )B ? T + ~ O H p B ? SAC K +BS 2 A : Plug this back to the bound of Lemma 27 and by AM-GM inequality, we have for all ( Q; V )2V H : T X t=1 ( Q(s t ;a t )Q t (s t ;a t )) + 1 + 1 H T X t=1 ( V (s t )V t (s t )) + + B ? T H + ~ O p B ? SAC K +BS 2 A : Moreover, byH B? c min , we have B?T H c min TC K . Hence, Property 2 is satised withd = 1; H = C K + ~ O( p B ? SAC K +BS 2 A) with probability at least 1 9. A.4.3 ProofofTheorem5 Proof. By Theorem 1 and Theorem 4, with probability at least 1 12: C K KV ? (s init ) =R K C K + ~ O p B ? SAC K +BS 2 A : Then byV ? (s init )B ? ; 1 2 and Lemma 110, we haveC K = ~ O (B ? K). Substituting this back and by c min B?K ;H = ~ O(B ? =c min ), we getR K = ~ O B ? p SAK +BS 2 A . 129 A.4.4 ExtraLemmas In this section, we give full proofs of auxiliary lemmas used in Section 2.4. Notably, Lemma 28 and Lemma 29 bound the additional terms appears in the recursion in Lemma 27. Lemma 30 gives recursion-based analysis on bounding the sum of martingale dierence sequence, which is the key in obtaining horizon-free regret. Lemma28. With probability at least 1, we have for all ( Q; V )2V H , T X t=1 ((I st P t )( VV t )) + T X t=1 2b t + P t (V ? V t ) H + ~ O 0 @ v u u t SA T X t=1 V(P t ;V ? ) + v u u t S 2 A T X t=1 V(P t ;V ? V lt ) +B ? S 2 A 1 A : Proof. With probability at least 1, for all ( Q; V )2V H , T X t=1 ( V (s t )V t (s t )P t ( VV t )) + T X t=1 ( Q(s t ;a t )P t V +P t V t V t (s t )) + T X t=1 (c(s t ;a t ) +P t V lt V t (s t )) + +P t (V t V lt ) ( Q(s t ;a t ) =c(s t ;a t ) +P t V , (x +y) + (x) + + (y) + , andV t is increasing int) B ? SA + T X t=1 (c(s t ;a t )b c t (s t ;a t )) + + ((P t P t )V lt ) + +b t + 1 H P t (V ? V t ) (V t (s t ) =Q t (s t ;a t ), Eq. (A.16), and Lemma 25) 2B ? SA + T X t=1 ((P t P t )V ? + (P t P t )(V lt V ? )) + + 2b t + 1 H P t (V ? V t ): (Eq. (A.18)) 130 Now by Lemma 122 and Lemma 31, we have with probability at least 1: (P t P t )V ? =O r V(Pt;V ? ) n + t + B? n + t and (P t P t )(V lt V ? ) = ~ O r SV(Pt;V ? V l t ) n + t + SB? n + t . Plugging these back to the previous inequality, we have for all ( Q; V )2V H : T X t=1 ( V (s t )V t (s t )P t ( VV t )) + 2B ? SA + T X t=1 ~ O s V(P t ;V ? ) n + t + s SV(P t ;V ? V lt ) n + t + SB ? n + t ! + 2b t + 1 H P t (V ? V t ) ~ O 0 @ v u u t SA T X t=1 V(P t ;V ? ) + v u u t S 2 A T X t=1 V(P t ;V ? V lt ) +B ? S 2 A 1 A + T X t=1 2b t + P t (V ? V t ) H : (Cauchy-Schwarz inequality and Lemma 32) This completes the proof. Lemma29. With probability at least 1 3, T X t=1 b t = ~ O 0 @ BS 3=2 A + v u u t SA T X t=1 V(P t ;V ? ) + v u u t B ? SA H T X t=1 P t (V ? V t ) + p SAC K 1 A ; T X t=1 V(P t ;V ? V lt ) = ~ O 0 @ B ? v u u t SA T X t=1 V(P t ;V ? ) +B 2 S 2 A + B ? H T X t=1 P t (V ? V t ) +B ? p SAC K 1 A : Proof. First note that: T X t=1 b t (i) = ~ O BSA + T X t=1 s V( P t ;V lt ) n + t + s b c t n + t ! (ii) = ~ O BSA + T X t=1 s V(P t ;V lt ) n + t + B ? p S n + t + s b c t n + t ! : 131 where in (i) we apply maxfa;bga +b and Lemma 32, and in (ii) we have with probability at least 1, V( P t ;V lt ) = P t (V lt P t V lt ) 2 P t (V lt P t V lt ) 2 ( P i p i x i P i p i = argmin z P i p i (x i z) 2 ) =V(P t ;V lt ) + (P t P t )(V lt P t V lt ) 2 V(P t ;V lt ) + ~ O X s 0 s P t (s 0 ) n + t + 1 n + t ! (V lt (s 0 )P t V lt ) 2 ! (Lemma 122) V(P t ;V lt ) + ~ O B ? s SV(P t ;V lt ) n + t + SB 2 ? n + t ! = ~ O V(P t ;V lt ) + SB 2 ? n + t : (Cauchy-Schwarz inequality and AM-GM inequality) Thus, by Lemma 114, Cauchy-Schwarz inequality, and Lemma 32, we have: T X t=1 b t = ~ O BS 3=2 A + T X t=1 s V(P t ;V ? ) n + t + T X t=1 s V(P t ;V ? V lt ) n + t + s b c t n + t ! = ~ O 0 @ BS 3=2 A + v u u t SA T X t=1 V(P t ;V ? ) + v u u t SA T X t=1 V(P t ;V ? V lt ) + p SAC K 1 A ; (A.19) where in the last inequality we apply: T X t=1 s b c t n + t v u u t SA T X t=1 c(s t ;a t ) + T X t=1 (c(s t ;a t )b c t ) ! (Cauchy-Schwarz inequality and Lemma 32) v u u t SA 2C K + ~ O (1) + T X t=1 s b c t t n + t + t n + t ! = ~ O 0 @ p SAC K + v u u t SA T X t=1 s b c t n + t +SA 1 A ; (Lemma 127 and Eq. (A.17)) 132 and by Lemma 110 we obtain: P T t=1 q b ct n + t = ~ O( p SAC K +SA). Applying Lemma 30 withX t (s) = (V ? (s)V t (s))=B ? , we have with probability at least 1 (G T ;Y T , and T are dened in Lemma 30), T X t=1 V(P t ;V ? V t ) =B 2 ? G T (0) 3B 2 ? Y T + 9B 2 ? T 3B ? T X t=1 ((I st P t )(V ? V t )) + + ~ O SB 2 ? : By Lemma 28 and Eq. (A.19), with probability at least 1, T X t=1 ((I st P t )(V ? V t )) + T X t=1 2b t + 1 H P t (V ? V t ) + ~ O 0 @ v u u t SA T X t=1 V(P t ;V ? ) + v u u t S 2 A T X t=1 V(P t ;V ? V lt ) +B ? S 2 A 1 A : = ~ O 0 @ BS 2 A + v u u t SA T X t=1 V(P t ;V ? ) + v u u t S 2 A T X t=1 V(P t ;V ? V lt ) + 1 H T X t=1 P t (V ? V t ) + p SAC K 1 A (i) = ~ O 0 @ BS 2 A + v u u t SA T X t=1 V(P t ;V ? ) + v u u t S 2 A T X t=1 V(P t ;V ? V t ) + 1 H T X t=1 P t (V ? V t ) + p SAC K 1 A ; where in (i) we apply v u u t S 2 A T X t=1 V(P t ;V ? V lt ) = ~ O 0 @ v u u t S 2 A T X t=1 V(P t ;V ? V t ) + v u u t S 2 A T X t=1 V(P t ;V t V lt ) 1 A (Var[X +Y ] 2Var[X] + 2Var[Y ] and p x +y p x + p y) = ~ O 0 @ v u u t S 2 A T X t=1 V(P t ;V ? V t ) + v u u t S 2 A B 2 ? SA + B ? H T X t=1 P t (V ? V t ) ! 1 A (Lemma 25) = ~ O 0 @ v u u t S 2 A T X t=1 V(P t ;V ? V t ) +B ? S 2 A + 1 H T X t=1 P t (V ? V t ) 1 A : ( p x +y p x + p y and AM-GM Inequality) 133 Plugging the bound on P T t=1 ((I st P t )(V ? V t )) + back, we have T X t=1 V(P t ;V ? V t ) = ~ O 0 @ B 2 S 2 A +B ? v u u t SA T X t=1 V(P t ;V ? ) +B ? v u u t S 2 A T X t=1 V(P t ;V ? V t ) 1 A + ~ O B ? H T X t=1 P t (V ? V t ) +B ? p SAC K ! : Solving a quadratic inequality w.r.t. P T t=1 V(P t ;V ? V t ) (Lemma 110), we obtain T X t=1 V(P t ;V ? V t ) = ~ O 0 @ B 2 S 2 A +B ? v u u t SA T X t=1 V(P t ;V ? ) + B ? H T X t=1 P t (V ? V t ) +B ? p SAC K 1 A ; and byVar[X +Y ] 2Var[X] + 2Var[Y ] and Lemma 25, T X t=1 V(P t ;V ? V lt ) = ~ O T X t=1 V(P t ;V ? V t ) +V(P t ;V t V lt ) ! = ~ O 0 @ B ? v u u t SA T X t=1 V(P t ;V ? ) +B 2 S 2 A + B ? H T X t=1 P t (V ? V t ) +B ? p SAC K 1 A : Moreover, by p x +y p x + p y and AM-GM inequality: v u u t SA T X t=1 V(P t ;V ? V lt ) = ~ O 0 B B @ v u u u t B ? SA v u u t SA T X t=1 V(P t ;V ? ) +BS 3=2 A + v u u t B ? SA H T X t=1 P t (V ? V t ) + q B ? SA p SAC K 1 C C A = ~ O 0 @ v u u t SA T X t=1 V(P t ;V ? ) +BS 3=2 A + v u u t B ? SA H T X t=1 P t (V ? V t ) + p SAC K 1 A : 134 Plug this back to Eq. (A.19): T X t=1 b t = ~ O 0 @ BS 3=2 A + v u u t SA T X t=1 V(P t ;V ? ) + v u u t B ? SA H T X t=1 P t (V ? V t ) + p SAC K 1 A : Lemma30. SupposeX t :S + ! [0; 1] is monotonicint(thatis,X t (s) isnon-decreasingornon-increasing int for alls2S + ), andX t (g) = 0. Dene: F n (d) = n X t=1 P t X 2 d t (X t (s 0 t )) 2 d ; G n (d) = n X t=1 V(P t ;X 2 d t ): Thenwithprobabilityatleast 1,foralln2N + simultaneously,G n (0) 3Y n + 9 n ;F n (0) p 3Y n n + 4 n , whereY n =S + 1 + P n t=1 (X t (s t )P t X t ) + ; n = 32 ln 3 4n 4 . Proof. Note that: G n (d) = n X t=1 P t X 2 d+1 t (P t X 2 d t ) 2 n X t=1 P t X 2 d+1 t (P t X t ) 2 d+1 (x p is convex forp> 1) = n X t=1 P t X 2 d+1 t X t (s 0 t ) 2 d+1 + n X t=1 X t (s 0 t ) 2 d+1 X t (s t ) 2 d+1 + n X t=1 X t (s t ) 2 d+1 (P t X t ) 2 d+1 (i) F n (d + 1) +S + 1 + 2 d+1 (X t (s t )P t X t ) + F (d + 1) + 2 d+1 Y n ; where in (i) we apply Lemma 111 and, n X t=1 X t (s 0 t ) 2 d+1 X t (s t ) 2 d+1 = n X t=1 X t (s 0 t ) 2 d+1 X t+1 (s 0 t ) 2 d+1 + n X t=1 X t+1 (s 0 t ) 2 d+1 X t (s t ) 2 d+1 S + n X t=1 X t+1 (s t+1 ) 2 d+1 X t (s t ) 2 d+1 =S +X n+1 (s n+1 ) 2 d+1 X 1 (s 1 ) 2 d+1 S + 1: (Lemma 113 andX t+1 (s 0 t )X t+1 (s t+1 )) 135 For a xedd;n, by Eq. (G.4) of Lemma 125, with probability 1 2n 2 dlog 2 n+1e , F n (d) p G n (d) n + n q (F n (d + 1) + 2 d+1 Y n ) n + n : Taking a union bound ond = 0;:::;dlog 2 ne, and by Lemma 112 with 1 =n; 2 = p n ; 3 =Y n ; 4 = n , we have: F n (1) maxf( p n + p 2 n ) 2 ; p 8Y n n + n g maxf6 n ; p 8Y n n + n g: Therefore,G n (0)F n (1) + 2Y n maxf6 n ;Y n + 9 n g + 2Y n 3Y n + 9 n , andF n (0) p G n (0) n + n p 3Y n n + 4 n . Taking a union bound overn2N + proves the claim. Lemma31. GivenX t :S + !RwithkX t k 1 B,withprobabilityatleast 1,itholdsthatforallt 1 simultaneously: (P t P t )X t = ~ O r SV(Pt;Xt) n + t + SB n + t : Proof. For a xed (s;a)2 , by Lemma 122, with probability 1 SA , for anyt 1 such that (s t ;a t ) = (s;a): (P t P t )X t = X s 0 (P t (s 0 ) P t (s 0 ))(X t (s 0 )P t X t ) ( P s 0P t (s 0 ) P t (s 0 ) = 0) = ~ O X s 0 s P t (s 0 ) n + t + 1 n + t ! jX t (s 0 )P t X t j ! = ~ O s SV(P t ;X t ) n + t + SB n + t ! : Taking a union bound over (s;a)2 , the statement is proved. Lemma32. P T t=1 1 n + t =O(SA lnT ). Proof. DeneJ s;a such thatE Js;a =n T (s;a). It is easy to see thate j+1 =e j 2. Then, T X t=1 1 n + t SA + X (s;a) Js;a X j=1 e j+1 E j SA + 2 X (s;a) Js;a X j=1 e j E j =O (SA lnT ): 136 A.5 Experiments In this section, we benchmark known SSP algorithms empirically. We consider two environments, Ran- domMDP and GridWorld. In RandomMDP, there are 5 states and 2 actions, and both transition and cost function are chosen uniformly at random. In GridWorld, there are 12 states (including the goal state) and 4 actions (LEFT, RIGHT, UP, DOWN) forming a 3 4 grid. The agent starts at the upper left corner of the grid, and the goal state is at the lower right corner of the grid. Taking each action initiates an attempt to moves one step towards the indicated direction with probability 0:85, and moves randomly towards the other three directions with probability 0:15. The movement attempt fails if the agent tries to move out of the grid, and in this case the agent stays at the same position. The cost is 1 for each state-action pair. In our experiments,B ? 1:5 andc min 0:04 in RandomMDP, andB ? 6 andc min = 1 in GridWorld. We implement two model-free algorithms: Q-learning with-greedy exploration (Yu and Bertsekas, 2013) and LCB-Advantage-SSP, and ve model-based algorithms: UC-SSP (Tarbouriech et al., 2020) ∗ , Bernstein-SSP (Cohen et al., 2020), ULCVI (Cohen et al., 2021), EB-SSP (Tarbouriech et al., 2021b), and SVI-SSP. For each algorithm, we optimize hyper-parameters for the best possible results. Moreover, instead of incorporating the logarithmic terms from condence intervals suggested by the theory, we treat it as a hyper-parameter and search its best value. The hyper-parameters used in the experiments are shown in Table A.2. All experiments are performed in Google Cloud Platform on a compute engine with machine type “e2-medium”. The plot of accumulated regret is shown in Figure A.1. Q-learning with-greedy exploration suers linear regret, indicating that naive-greedy exploration is inecient. UC-SSP and SVI-SSP show competitive ∗ we implement a variant of UC-SSP with a xed pivot horizon for a much better empirical performance, where k;j = 10 6 always (see their Algorithm 2 for the denition of k;j ) 137 0 500 1000 1500 2000 2500 3000 Episode 0 20 40 60 80 100 120 Regret RandomMDP Q-learning with -greedy LCB-ADVANTAGE-SSP ULCVI Bernstein-SSP UC-SSP SVI-SSP EB-SSP 0 500 1000 1500 2000 2500 3000 Episode 0 200 400 600 800 1000 1200 Regret GridWorld Q-learning with -greedy LCB-ADVANTAGE-SSP ULCVI Bernstein-SSP UC-SSP SVI-SSP EB-SSP Figure A.1: Accumulated regret of each algorithm on RandomMDP (left) and GridWorld (right) in 3000 episodes. Each plot is an average of 500 repeated runs, and the shaded area is 95% condence interval. Dotted lines represent model-free algorithms and solid lines represent model-based algorithms. Table A.1: Average time (in seconds) spent in updates in 3000 episodes for each algorithm. Our model-based algorithmSVI-SSP is the most ecient algorithm. RandomMDP GridWorld Q-learning with-greedy 0.3385 0.3773 LCB-Advantage-SSP 0.3517 0.3982 UC-SSP 14.4472 8.6886 Bernstein-SSP 0.2918 0.4656 ULCVI 15.7128 22.8062 EB-SSP 0.2319 0.4619 SVI-SSP 0.1207 0.1419 results in both environments. SVI-SSP also consistently outperforms EB-SSP, both of which are minimax- optimal and horizon-free. In Table A.1, we also show the time spent in updates (policy, accumulators, etc) in the whole learning process for each algorithm. Our model-based algorithmSVI-SSP spends least time in updates among all algorithms, conrming our theoretical arguments. ULCVI and UC-SSP spend most time in updates, which is reasonable since these two algorithms computes a new policy in each episode, instead of exponentially sparse updates. 138 Table A.2: Hyper-parameters used in the experiments. We search the best parameters for each algorithm. Algorithm Parameters RandomMDP Q-learning with-greedy = 0:05 LCB-Advantage-SSP H = 5; = 0:05; ? = 4096 UC-SSP = 1:0 Bernstein-SSP = 2:0 ULCVI H = 80; = 2:0 EB-SSP = 0:05 SVI-SSP H = 15; = 0:05 GridWorld Q-learning with-greedy = 0:05 LCB-Advantage-SSP H = 5; = 0:1; ? = 4096 UC-SSP = 0:5 Bernstein-SSP = 0:5 ULCVI H = 100; = 1:0 EB-SSP = 0:01 SVI-SSP H = 10; = 0:01 139 AppendixB OmittedDetailsinChapter3 B.1 OmittedDetailsforSection3.2 Notation For f M, denote byV h (s) the the expected cost of executing policy starting from states in layerh, and by m the policy executed in intervalm (for example, m (s;h) = argmin a Q m h (s;a) in Algorithm 8). For notational convenience, deneP m h = e P s m h ;a m h , andw ? h = ? + R V ? h+1 (s 0 )d(s 0 ) for h2 [H] such thatQ ? h (s;a) = (s;a) > w ? h . Dene indicatorI s (s 0 ) = Ifs = s 0 g, and auxiliary feature (g;a) = 02 R d for alla2A, such thate c(s;a) = (s;a) > ? and e P s;a V = (s;a) > R V (s 0 )d(s 0 ) for any (s;a)2 e and V :S + ! R with V (g) = 0. Finally, for Algorithm 8, dene stopping time M = inf m fm M;9h2 [H] : b Q m h (s m h ;a m h ) > Q m h (s m h ;a m h )g, which is the number of intervals until nishingK episodes or upper bound truncation onQ estimate is triggered. B.1.1 FormalDenitionofQ ? h andV ? h It is not hard to see that we can deneQ ? h andV ? h recursively without resorting to the denition of f M: Q ? h (s;a) =e c(s;a) + e P s;a V ? h+1 ; V ? h (s) = min a Q ? h (s;a); withQ ? H+1 (s;a) =c f (s) for all (s;a). 140 B.1.2 ProofofLemma11 Proof. Denote byI k the set of intervals in episodek, and bym k the rst interval in episodek. We bound the regret in episodek as follows: by Lemma 118 andH 4T max ln(4K), we have the probability that following ? takes more thanH steps to reachg in f M is at most 1 2K . Therefore, V ? 1 (s)V ? (s) + 2B ? P (s H+1 6=gj ? ;P;s 1 =s)V ? (s) + B ? K : Thus, X m2I k H X h=1 c m h V ? (s m k 1 ) X m2I k H X h=1 c m h V ? 1 (s m k 1 ) + B ? K = X m2I k H X h=1 c m h V ? 1 (s m 1 ) ! + X m2I k V ? 1 (s m 1 )V ? 1 (s m k 1 ) + B ? K X m2I k H X h=1 c m h +c f (s m H+1 )V ? 1 (s m 1 ) ! + B ? K : (V ? 1 (s)V ? 1 (s) and P m2I k V ? 1 (s m 1 )V ? 1 (s m k 1 ) 2B ? (jI k j 1) = P m2I k c f (s m H+1 )) Summing terms above overk2 [K] and by the denition ofR K , e R M we obtain the desired result. B.1.3 ProofofLemma5 We rst bound the error of one-step value iteration w.r.t. b Q m h andV m h+1 , which is essential to our analysis. Lemma 33. For anyB maxf1; max s c f (s)g, with probability at least 1, we have 0 e c(s;a) + e P s;a V m h+1 b Q m h (s;a) 2 m k(s;a)k 1 m andV m h (s)V ? h (s) for anym2N + ,h2 [H]. 141 Proof. Dene e w m h = ? + R V m h+1 (s 0 )d(s 0 ), so that(s;a) > e w m h =e c(s;a) + e P s;a V m h+1 . Then, e w m h w m h = 1 m m e w m h m1 X m 0 =1 H X h 0 =1 m 0 h 0 (c m 0 h 0 +V m h+1 (s m 0 h 0 +1 )) ! = 1 m e w m h + 1 m m1 X m 0 =1 H X h 0 =1 m 0 h 0 (P m 0 h 0 V m h+1 V m h+1 (s m 0 h 0 +1 )) | {z } m h : ByV m h+1 (s)B and Lemma 39, we have with probability at least 1, for anym,h2 [H]: k m h k m 2B s d 2 ln mH + + ln N " + p 8mH" p ; (B.1) whereN " is the"-cover of the function class ofV m h+1 with" = 1 mH . Note thatV m h+1 (s) is eitherc f (s) or V m h+1 (s) = min a (s;a) > w m q (s;a) > (s;a) [0;B] ; for some PSD matrix such that 1 +mH min () max () 1 by the denition of 1 m , and for some w2R d such thatkwk 2 max ()mHsup s;a k(s;a)k 2 (B +1) mH (B +1) by the denition of w m h . We denote byV the function class ofV m h+1 . Now we apply Lemma 40 toV with = (w; ),n =d 2 +d, D = mH p d(B + 1)= maxf mH (B + 1); p d= 2 g (note thatj i;j jkk F = q P d i=1 2 i () p d= 2 ), andL = m p +mH, which is given by [x] [0;B] [y] [0;B] jxyj (Vial et al., 2021, Claim 2) and the following calculation: for any w =e i for some6= 0, 1 jj (w + w) > (s;a)w > (s;a) = e > i (s;a) k(s;a)k 1; 142 and for any =e i e > j , 1 jj m q (s;a) > ( + )(s;a) m q (s;a) > (s;a) m (s;a) > e i e > j (s;a) p (s;a) > (s;a) ( p u +v p u jvj p u ) m (s;a) > ( 1 2 e i e > i + 1 2 e j e > j )(s;a) p (s;a) > (s;a) (jabj 1 2 (a 2 +b 2 )) m (s;a) > (s;a) p (s;a) > (s;a) m p min () m p +mH: Lemma 40 then implies lnN " (d 2 +d) ln 32d 2:5 Bm 2 H 2 m " . Plugging this back, we get k m h k m m 2 : (B.2) Moreover,ke w m h k 1 m ke w m h k 2 = p p d=(1 +B). Thus, kw m h e w m h k m ke w m h k 1 m +k m h k m m : Therefore,e c(s;a)+ e P s;a V m h+1 b Q m h (s;a) =(s;a) > (e w m h w m h )+ m k(s;a)k 1 m 2 [0; 2 m k(s;a)k 1 m ] by(s;a) > (e w m h w m h )2 [k(s;a)k 1 m kw m h e w m h k m ;k(s;a)k 1 m kw m h e w m h k m ], and the rst statement is proved. For anym2N + , we prove the second statement by induction onh =H + 1;:::; 1. The base caseh = H + 1 is clearly true byV m h+1 (s) = V ? h+1 (s) = c f (s). Forh H, we have by the induction step: b Q m h (s;a)e c(s;a) + e P s;a V m h+1 e c(s;a) + e P s;a V ? h+1 Q ? h (s;a): Thus,V m h (s) min a maxf0; b Q m h (s;a)g min a Q ? h (s;a) =V ? h (s). 143 Next, we prove a general regret bound, from which Lemma 5 is a direct corollary. Lemma34. Assumec f (s)H. Thenwithprobabilityatleast 1 2,Algorithm8ensuresforanyM 0 M e R M 0 = ~ O p d 3 B 2 HM 0 +d 2 BH : Proof. Denec m H+1 =c f (s m H+1 ). Note that form<M, we haveV m h (s m h ) = maxf0; b Q m h (s m h ;a m h )g, and with probability at least 1, H+1 X h=1 c m h V ? 1 (s m 1 ) H+1 X h=1 c m h V m 1 (s m 1 ) H+1 X h=1 c m h b Q m 1 (s m 1 ;a m 1 ) H+1 X h=2 c m h P m 1 V m 2 + 2 m k(s m 1 ;a m 1 )k 1 m (Lemma 33) = H+1 X h=2 c m h V m 2 (s m 2 ) + (I s m 2 P m 2 )V m 2 + 2 m k(s m 1 ;a m 1 )k 1 m H X h=1 (I s m h+1 P m h+1 )V m h+1 + 2 m k(s m h ;a m h )k 1 m ; where the last step is byc m H+1 =V m H+1 (s m H+1 ). Therefore, by Lemma 37 and Lemma 123, with probability at least 1: e R M 0 e R M 0 1 +H M 0 1 X m=1 H X h=1 (I s m h+1 P m h+1 )V m h+1 + 2 m k(s m h ;a m h )k 1 m +H = ~ O p d 3 B 2 HM 0 +d 2 BH : We are now ready to prove Lemma 5. Proof of Lemma 5. Note that whenB = 3B ? ,V m h (s)V ? h (s) 3B ? =B by Lemma 33. Thus,M =M, and the statement directly follows from Lemma 34 withM 0 =M. 144 Algorithm20 Adaptive Finite-Horizon Approximation of SSP Input: upper bound estimateB and functionU(B) from Lemma 36. Initialize:A an instance of nite-horizon algorithm with horizond 10B c min ln(8BK)e. Initialize:m = 1,m 0 = 0,k = 1,s =s init . whilekK do ExecuteA forH steps starting from states and receives m H+1 . ifs m H+1 =g then k k + 1,s s init ;elsem 0 m 0 + 1,s s m H+1 . 1 ifm 0 >U(B) orA detectsB H=2 c m h 2B. Denote byP m () the conditional probability of certain event conditioning on the history before intervalm. Then with probability at least 1, 2B M 0 X m=1 P m (s m H+1 6=g) + M 0 X m=1 ( e V m 1 (s m 1 ) e V ? 1 (s m 1 )) M 0 2K + M 0 X m=1 (V m 1 (s m 1 )V ? 1 (s m 1 )) (2B P M 0 m=1 P m (s m H+1 6=g) + e V m 1 (s m 1 )V m 1 (s m 1 ) andV ? 1 (s)V ? (s) e V ? 1 (s) + 1 4K ) 1 2K M 0 X m=1 Ifs m H+1 6=gg + ~ O 0 (B) + 1 (B) p M 0 (M 0 K + P M 0 m=1 Ifs m H+1 6=gg and guarantee ofA) 1 K M 0 X m=1 P m (s m H+1 6=g) + ~ O 0 (B) + 1 (B) p M 0 : (Lemma 126) Then by e V m 1 (s m 1 ) e V ? 1 (s m 1 ) and reorganizing terms, we get P M 0 m=1 P m (s m H+1 6= g) = ~ O( 0 (B)=B + 1 (B) p M 0 =B). Again by Lemma 126, we have with probability at least 1: M 0 X m=1 Ifs m H+1 6=gg = ~ O M 0 X m=1 P m (s m H+1 6=g) ! = ~ O 0 (B)=B + 1 (B) p M 0 =B : ByM 0 K + P M 0 m=1 Ifs m H+1 6=gg and solving a quadratic inequality w.r.t. q P M 0 m=1 Ifs m H+1 6=gg, we get P M 0 m=1 Ifs m H+1 6=gg = ~ O( 0 (B)=B + 1 (B) 2 =B 2 + 1 (B) p K=B). Thus, we also get the same bound for P M m=1 Ifs m H+1 6=gg. Note that Lemma 35 and Lemma 36 together implies a ~ O( p K) regret bound whenBB ? . Moreover, since the total number of “bad” intervals is of order ~ O( p K), we can properly bound the cost of running nite-horizon algorithm with wrong estimates onB ? . We now present an adaptive version of nite-horizon approximation of SSP (Algorithm 20) which does not require the knowledge ofB ? orT max . The main idea is to perform nite-horizon approximation with zero costs, and maintain an estimateB ofB ? . The learner runs a nite-horizon algorithm with horizon of order ~ O( B c min ). WheneverA detectsBB ? , or the number of “bad” intervals is more than expected (Line 1), it doubles the estimateB and start a new instance of 147 nite-horizon algorithm with the updated estimate. The guarantee of Algorithm 20 is summarized in the following theorem. Theorem27. SupposeAtakesanestimateB asinput,andwhenB 2 det(m) H X h=1 k m h k 1 m p 2 M 0 X m2I H X h=1 k m h k 1 m+1 +O M 0d ln(M 0 H=)H (2 m < m+1 by Lemma 38, and det( M 0)= det( 0 ) (( +M 0 H)=) d ) =O 0 @ M 0 v u u t HjIj X m2I H X h=1 m h 2 1 m+1 + M 0dH ln(M 0 H) 1 A (Cauchy-Schwarz inequality) =O p d 3 B 2 HjIj ln dBM 0 H +d 2 BH ln 1:5 dBM 0 H ; where the last step is by (Jin et al., 2020b, Lemma D.2), = 1, and denition of M 0. Lemma38. (Abbasi-Yadkori,Pál,andSzepesvári,2011,Lemma12)LetA,B bepositivesemi-denitematrices such thatA Ax x > Bx det(A) det(B) . Lemma 39. (Wei et al., 2021, Lemma 11) Letfx t g 1 t=1 be a martingale sequence on state spaceX w.r.t. a ltrationfF t g 1 t=0 ,f t g 1 t=1 be a sequence of random vectors inR d so that t 2F t1 andk t k 1, t =I + P t1 s=1 s > s ,andVR X beasetoffunctionsdenedonX withN " asits"-coveringnumberw.r.t. the distance dist(v;v 0 ) = sup x jv(x)v 0 (x)j for some"> 0. Then for any> 0, we have with probability at least 1, for allt> 0 andv2V so that sup x jv(x)jB: t1 X s=1 s (v(x s )E[v(x s )jF s1 ]) 2 1 t 4B 2 d 2 ln t + + ln N " + 8t 2 " 2 : 150 Lemma 40. (Wei et al., 2021, Lemma 12) LetV be a class of mappings fromX to R parameterized by 2 [D;D] n . Suppose that for anyv2V (parameterized by) andv 0 2V 0 (parameterized by 0 ), the following holds: sup x2X v(x)v(x 0 ) L 0 1 : Then, lnN " n ln 2DLn " , whereN " is the"-covering number ofV with respect to the distancedist(v;v 0 ) = sup x2X jv(x)v 0 (x)j. 151 AppendixC OmittedDetailsinChapter4 C.1 OmitteddetailsforSection4.2 In this section, we provide all proofs for Section 4.2. C.1.1 ProofofTheorem9 Proof. By standard OMD analysis (see for example Eq. (12) of (Rosenberg and Mansour, 2021)), for any q2 (T ) we have: K X k=1 hq k q;c k iD (q;q 1 ) + K X k=1 q k q 0 k+1 ;c k ; (C.1) whereq 0 k+1 = argmin q2R hq;c k i+D (q;q k ), or equivalently, with the particular choice of the regularizer, q 0 k+1 (s;a) =q k (s;a)e c k (s;a) . Applying the inequality 1e x x, we obtain K X k=1 q k q 0 k+1 ;c k K X k=1 X (s;a) q k (s;a)c 2 k (s;a) K X k=1 hq k ;c k i: 152 Substituting this back into Eq. (C.1), choosingq =q ? (recall the conditionq ?2 (T ) of the lemma), and rearranging, we arrive at K X k=1 hq k q ?;c k i 1 1 D (q ?;q 1 ) + K X k=1 hq ?;c k i ! 2D (q ?;q 1 ) + 2 K X k=1 hq ?;c k i: (C.2) It remains to bound the last two terms. For the rst one, since q 1 minimizes over (T ), we have hr (q 1 );q ?q 1 i 0, and thus D (q ?;q 1 ) (q ?) (q 1 ) = 1 X (s;a) q ?(s;a) lnq ?(s;a) 1 X (s;a) q 1 (s;a) lnq 1 (s;a) 1 X (s;a) q ?(s;a) lnT T X (s;a) q 1 (s;a) T ln q 1 (s;a) T T ln(T ) + T ln(SA) = T ln(SAT ) : For the second one, we use the fact P K k=1 hq ?;c k i P K k=1 hq f;c k iDK. Put together, this implies K X k=1 hq k q ?;c k i 2T ln(SAT ) + 2DK: With the optimal = min 1 2 ; q T ln(SAT ) DK , we have thus shown E[R k ] =E " K X k=1 hq k q ?;c k i # =O p DTK ln(SAT ) +T ln(SAT ) = ~ O p DTK ; completing the proof. 153 C.1.2 ProofofTheorem10 Proof. By Yao’s minimax principle, in order to obtain a regret lower bound, it suces to show that there exists a distribution of SSP instances that forces any deterministic learner to suer a regret bound of p DT ? K in expectation. Below we describe such a distribution (the MDP is xed but the costs are stochastic). • The state space isS =fs 0 ;s 1 ;:::;s N ;fg for anyN 2, ands init =s 0 . • At state s 0 , there are N available actions a 1 ;:::;a N ; at each state of s 1 ;:::;s N , there are two available actionsa g anda f ; and at statef, there is only one actiona g . • At states 0 , taking actiona j transits to states j deterministically for allj2 [N]. At any states j (j2 [N]), taking actiona f transits to statef deterministically, while taking actiona g transits to the goal stateg with probability 1=T ? and stays at the same state with probability 1 1=T ? . Finally, at statef, taking actiona g transits to the goal stateg with probability 1=D and stays with probability 1 1=D. • The cost at states 0 is always zero, that is,c k (s 0 ;a) = 0 for allk anda; the cost of actiona f is also always zero, that is,c k (s;a f ) = 0 for allk ands2fs 1 ;:::;s N g; the cost at statef is always one, that is,c k (f;a g ) = 1 for allk; nally, the cost of taking actiona g at states2fs 1 ;:::;s N g is generated stochastically as follows: rst, a good statej ? 2 [N] is sampled uniformly at random ahead of time and then xed throughout theK episodes; then, in each episodek,c k (s;a g ) is an independent sample of Bernoulli( D 2T? ) ifs =s j ?, and an independent sample of Bernoulli( D 2T? +) ifs6=s j ?, for some D 2T? to be specied later. It is clear that in all these SSP instances, the diameter isD + 2 (since one can reach the goal state via the fast statef within at mostD + 2 steps in expectation), and the hitting time of the optimal policy is indeed T ? + 1 (in fact, the hitting time of any stationary deterministic policy is eitherT ? + 1 orD + 2T ? + 1). 154 It remains to argueE[R K ] = p DT ? K for any deterministic learner, where the expectation is over the randomness of the costs. To do so, letE j denote the conditional expectation given that the good statej ? is j. Then we have E[R K ] = 1 N N X j=1 E j " K X k=1 I k X i=1 c k (s k i ;a k i ) min 21 K X k=1 V k (s 0 ) #! 1 N N X j=1 E j " K X k=1 I k X i=1 c k (s k i ;a k i ) K X k=1 V j k (s 0 ) #! ; where j is the policy that picks actiona j at states 0 anda g at states j (other states are irrelevant). Note that it takesT ? steps in expectation for j to reachg froms j and each step incur expected cost D 2T? , which meansE j [V j k (s 0 )] = D 2T? T ? = D 2 . On the other hand, the learner is always better o not visitingf at all, since starting from statef, the expected cost before reachingg isD, while the expected cost of reaching the goal state via any other states is at most D 2T? + T ? D. Therefore, depending on whether the learner selects the good actiona j ? or not at the rst step, we further lower bound the expected regret as E[R K ] 1 N N X j=1 K X k=1 E j D 2 +T ? Ifa 1 k 6=a j g D 2 =T ? K T ? N N X j=1 E j [K j ]; whereK j = P K k=1 Ifa 1 k =a j g. It thus suces to upper bound P N j=1 E j [K j ]. To do so, consider a reference environment without a good state, that is,c k (s;a g ) is an independent sample of Bernoulli( D 2T? +) for allk and alls2fs 1 ;:::;s N g. Denote byE 0 the expectation with respect to this reference environment, and byP 0 the distribution of the learner’s observation in this environment (P j is dened similarly). Then with the factK j K and Pinsker’s inequality, we have E j [K j ]E 0 [K j ]KkP j P 0 k 1 K q 2KL(P 0 ;P j ): 155 By the divergence decomposition lemma (see e.g. (Lattimore and Szepesvári, 2020, Lemma 15.1)) and the nature of the full-information setting, we further have KL(P 0 ;P j ) = N X j 0 =1 E 0 [K j 0] KL Bernoulli D 2T ? + ; Bernoulli D 2T ? =K KL Bernoulli D 2T ? + ; Bernoulli D 2T ? K 2 (1) ; where the last step is by (Gerchinovitz and Lattimore, 2016, Lemma 6) with = D 2T? . Therefore, we have N X j=1 E j [K j ] N X j=1 E 0 [K j ] +NK s 2K 2 (1) =K +NK s 2K 2 (1) : This is enough to show the claimed lower bound: E[R K ]T ? K T ? N N X j=1 E j [K j ] T ? K T ? N " K +NK s 2K 2 (1) # =T ? K " 1 1 N s 2K (1) # T ? K " 1 2 s 2K (1) # = T ? K 16 r (1) 2K = p DT ? K ; where in the last line we choose = 1 4 q (1) 2K 1 8 q D T?K D 2T? to maximize the lower bound. 156 C.1.3 ProofofLemma6 Proof. With the inequality ( P I i=1 a i ) 2 2 P i a i ( P I i 0 =i a i 0), we proceed as E 2 4 0 @ X (s;a) N k (s;a)c k (s;a) 1 A 2 3 5 =E 2 4 0 @ I k X i=1 X (s;a) Ifs k i =s;a k i =agc k (s;a) 1 A 2 3 5 2E 2 4 I k X i=1 X (s;a) Ifs k i =s;a k i =ag 0 @ I k X i 0 =i X (s 0 ;a 0 ) Ifs k i 0 =s 0 ;a k i 0 =a 0 gc k (s 0 ;a 0 ) 1 A 3 5 = 2E 2 4 I k X i=1 X s2S Ifs k i =sgE 2 4 I k X i 0 =i X (s 0 ;a 0 ) Ifs k i 0 =s 0 ;a k i 0 =a 0 gc k (s 0 ;a 0 ) s k i =s 3 5 3 5 = 2E " I k X i=1 X s2S Ifs k i =sgV k (s) # = 2 X s2S q (s)V k (s) = 2hq ;V k i; completing the proof. C.1.4 ProofofLemma7 Proof. Applying Lemma 6 (to the loop-free instance), we have E D e N k ;c k E 2 2 X e s2 e S q e (e s)V e k (e s) = 2 H X h=1 X s2S[fs f g q e (s;h)V e k (s;h): Denoteq e ;(s;h) as the occupancy measure of policye with initial state (s;h), so that V e k (s;h) = X (s 0 ;a 0 )2 e X h 0 h q e ;(s;h) (s 0 ;a 0 ;h 0 )c k (s 0 ;a 0 ;h 0 ): 157 Then we continue with the following equalities: H X h=1 X s2S[fs f g q e (s;h)V e k (s;h) = H X h=1 X s2S[fs f g q e (s;h) X (s 0 ;a 0 )2 e X h 0 h q e ;(s;h) (s 0 ;a 0 ;h 0 )c k (s 0 ;a 0 ;h 0 ) = H X h=1 X (s 0 ;a 0 )2 e X h 0 h 0 @ X s2S[fs f g q e (s;h)q e ;(s;h) (s 0 ;a 0 ;h 0 ) 1 A c k (s 0 ;a 0 ;h 0 ) = H X h=1 X (s 0 ;a 0 )2 e X h 0 h q e (s 0 ;a 0 ;h 0 )c k (s 0 ;a 0 ;h 0 ) = H X h=1 X (s;a)2 e hq e (s;a;h)c k (s;a;h) = D q e ; ~ hc k E : (C.3) where in the third line we use the equality P s2S[fs f g q e (s;h)q e ;(s;h) (s 0 ;a 0 ;h 0 ) =q e (s 0 ;a 0 ;h 0 ) by denition (since both sides are the probability of visiting (s 0 ;a 0 ;h 0 )). This completes the proof. C.1.5 ProofofLemma8 Proof. We rst prove the second statementE[R K ] E[Reg] + ~ O (1). Since the fast policy reaches the goal state withinD steps in expectation starting from any state, by the denition of(e ) and f M, we have V (e ) k (s 0 )V e k (e s 0 ) for anye , that is, the expected cost of executing(e ) inM is not larger than that of executinge in f M. On the other hand, since the probability of not reaching the goal state withinH 1 steps when executing ? is at most: 2e H 1 4Tmax 2 K 2 by Lemma 118 and the choice ofH 1 , the expected cost of ? inM and the expected cost ofe ? in f M is very similar: V e ? k (e s init )V ? k (s init ) + 2H 2 K 2 =V ? k (s init ) + ~ O 1 K : (C.4) 158 This proves the second statement: E[R K ] =E " K X k=1 V (e k ) k (s init )V ? k (s init ) # E " K X k=1 V e k k (e s init )V e ? k (e s init ) # + ~ O (1) =E[Reg] + ~ O (1): To prove the rst statement, we apply Lemma 118 again to show that for each episodek, the probability of the learner not reachingg withinH steps is at most 2e H 2 4D = 2K . With a union bound, this means, with probability at least 1 2 , the learner reaches the goal withinH steps for all episodes and thus her actual loss inM is not larger than that in f M: P K k=1 hN k ;c k i P K k=1 D e N k ;c k E . Together with Eq. (C.4), this shows R K K X k=1 D e N k q e ?;c k E + ~ O (1): It thus remains to bound the deviation P K k=1 D e N k q k ;c k E , which is the sum of a martingale dierence sequence. We apply Freedman’s inequality (Beygelzimer et al., 2011) directly: the variable D e N k ;c k E is bounded byH always, and its conditional variance is bounded by 2 D q k ; ~ hc k E as shown in Lemma 7, which means for any2 (0; 2=H], K X k=1 D e N k q k ;c k E K X k=1 D q k ; ~ hc k E + 2 ln ( 2 =) holds with probability at least 1 2 . Applying another union bound nishes the proof. 159 C.1.6 ProofofTheorem11 For completeness, we rst spell out the denition of e (T ), which is the exact counterpart of (T ) dened in Eq. (4.2) for f M (the rst equality below), but can be simplied using the special structure of e P (the second equality below). e (T ) = ( q2 [0; 1] e [H] : H X h=1 X (s;a)2 e q(s;a;h)T; X a2 e A (s;h) q(s;a;h) H X h 0 =1 X (s 0 ;a 0 )2 e e P ((s;h)j(s 0 ;h 0 );a 0 )q(s 0 ;a 0 ;h 0 ) =If(s;h) =e s 0 g;8(s;h)2 e S ) = ( q2 [0; 1] e [H] : H X h=1 X (s;a)2 e q(s;a;h)T; X a2As 0 q(s 0 ;a; 1) = 1; q(s;a; 1) = 0;8s6=s 0 anda2A s ; q(s;a;h) = 0;8(s;a)2 andh>H 1 ; q(s f ;a f ;h) =Ifh>H 1 g X (s 0 ;a 0 )2 q(s 0 ;a 0 ;H 1 ); X a2As q(s;a;h) = X (s 0 ;a 0 )2 P (sjs 0 ;a 0 )q(s 0 ;a 0 ;h 1);8s2S and 1<hH 1 : ) (C.5) Note thatq e ? belongs to e (T ? + 1) as shown in the following lemma. Lemma 41. The policye ? satisesT e ? (e s init ) = P H h=1 P (s;a)2 e q e ?(s;a;h) T ? + 1 and thusq e ? 2 e (T ? + 1). Proof. This is a direct application of the factT ? (s init ) =T ? and Lemma 118: the probability of not reaching the goal state withinH 1 steps when executing ? is at most: 2e H 1 4Tmax 2 K 2 : Therefore,T e ? (e s init ) T ? (s init ) + 2H 2 K 2 T ? + 1, nishing the proof. We also need the following lemma. Lemma42. The policye ? satises P K k=1 D q e ?; ~ hc k E =O (DT ? K). 160 Proof. We proceed as follows: K X k=1 D q e ?; ~ hc k E = K X k=1 H X h=1 X s2S[fs f g q e ?(s;h)V e ? k (s;h) = H X h=1 X s2S[fs f g q e ?(s;h) K X k=1 V e ? k (s;h) H X h=1 X s2S[fs f g q e ?(s;h) K X k=1 V e ? k (s; 1) H X h=1 X s2S[fs f g q e ?(s;h) ~ O (1) + K X k=1 V ? k (s) ! H X h=1 X s2S[fs f g q e ?(s;h) ~ O (1) + K X k=1 V f k (s) ! ~ O (1) +DK H X h=1 X s2S[fs f g q e ?(s;h) ~ O (DT ? K); where the rst line is by Eq. (C.3), the fourth line is by the same reasoning of Eq. (C.4), and the last line is by Lemma 41. We are now ready to prove Theorem 11. Proof. Dene ? =q e ? + ~ hq e ?. which belongs to the set by Lemma 41 and the conditionTT ? + 1. By the exact same reasoning of Eq. (C.2) in the proof of Theorem 9, OMD ensures K X k=1 h k ? ;c k i 2D ( ? ; 1 ) + 2 K X k=1 h ? ;c k i: 161 The last two terms can also be bounded in a similar way as in the proof of Theorem 9: for the rst term, since 1 minimizes over , we havehr ( 1 ); ? 1 i 0, and thus with the fact P H h=1 P (s;a) (s;a;h) T +HT 2T for any2 we obtain D ( ? ; 1 ) ( ? ) ( 1 ) = 1 H X h=1 X (s;a) ? (s;a;h) ln ? (s;a;h) 1 H X h=1 X (s;a) 1 (s;a;h) ln 1 (s;a;h) 2T ln(2T ) 2T H X h=1 X (s;a) 1 (s;a;h) 2T ln 1 (s;a;h) 2T 2T ln(2T ) + 2T ln(j e jH) =O T ln(SAHT ) ; for the second term, we have K X k=1 h e ?;c k i 2 K X k=1 hq e ?;c k i 2 K X k=1 V ? k (s 0 ) + ~ O (1) 2 K X k=1 V f k (s 0 ) + ~ O (1) 2DK + ~ O (1); where the second inequality is by Eq. (C.4). Combining the above and plugging the choice of, we arrive at K X k=1 h k ? ;c k iO T ln(SAHT ) + 2DK + ~ O (1) = ~ O p DTK : 162 Finally, we apply Lemma 8: with probability at least 1, R K K X k=1 hq k q e ?;c k i + K X k=1 D q k ; ~ hc k E + 2 ln ( 2 =) + ~ O (1) = K X k=1 h k ? ;c k i + K X k=1 D q e ?; ~ hc k E + 2 ln ( 2 =) + ~ O (1) = ~ O p DTK + K X k=1 D q e ?; ~ hc k E + 2 ln ( 2 =) = ~ O p DTK + ~ O (DTK) + 2 ln ( 2 =) (Lemma 42) = ~ O p DTK ; (by the choice of) which nishes the proof. C.2 OmitteddetailsforSection4.3 In this section, we provide all omitted algorithms and proofs for Section 4.3. C.2.1 OptimalExpectedRegret of Theorem 12. Using the second statement of Lemma 8, we have E [R K ] =E " K X k=1 hq k q e ?;c k i # + ~ O(1): 163 As in all analysis for OMD with log-barrier regularizer, we consider a slightly perturbed benchmark q ? = (1 1 TK )q e ? + 1 TK q 1 which is in e (T ) by the convexity of e (T ), the conditionTT ? + 1, and Lemma 41. We then have E [R K ]E " K X k=1 hq k q ? ;c k i # + 1 TK 1 E " K X k=1 hq 1 ;c k i # + ~ O(1) =E " K X k=1 hq k q ? ;c k i # + ~ O (1): It remains to boundE h P K k=1 hq k q ? ;c k i i . Let ? =q ? + ~ hq ? 2 . By the non-negativity and the unbiasedness of the cost estimator, the obliviousness of the adversary, and the same argument of (Agarwal et al., 2017, Lemma 12), OMD with log-barrier regularizer ensures E " K X k=1 h k ? ;c k i # =E " K X k=1 h k ? ;b c k i # D ( ? ; 1 ) +E 2 4 K X k=1 X (s;a) 2 k (s;a)b c 2 k (s;a) 3 5 : For the rst term, as 1 minimizes , we havehr ( 1 ); ? 1 i 0 and thus D ( ? ; 1 ) 1 X (s;a) ln 1 (s;a) ? (s;a) = SA ln(HT ) = ~ O SA : For the second term, we note that E 2 4 X (s;a) 2 k (s;a)b c 2 k (s;a) 3 5 4E 2 4 X (s;a) q 2 k (s;a)b c 2 k (s;a) 3 5 = 4E 2 4 X (s;a) e N 2 k (s;a)c 2 k (s;a) 3 5 4E D e N k ;c k E 2 8E hD q k ; ~ hc k Ei ; where the last step is by Lemma 7. Combining everything, we have shown E " K X k=1 h k ? ;c k i # = ~ O SA + 8E " K X k=1 D q k ; ~ hc k E # ; 164 and thus E " K X k=1 hq k q ? ;c k i # =E " K X k=1 h k ? ;c k i # +E " K X k=1 D q ? ; ~ hc k E # E " K X k=1 D q k ; ~ hc k E # = ~ O SA + 8E " K X k=1 D q ? ; ~ hc k E # ( = 8) = ~ O SA +DTK : (Lemma 42) Plugging the choice of nishes the proof. C.2.2 ProofofTheorem13 Proof. By Yao’s minimax principle, in order to obtain a regret lower bound, it suces to show that there exists a distribution of SSP instances that forces any deterministic learner to suer a regret bound of p DT ? SAK in expectation. We use the exact same construction as in Theorem 10 withN =S 2 (note that the average number of actionsA isO(1)). The proof is the same up to the point where we show E[R K ]T ? K T ? N N X j=1 E j [K j ]; withK j = P K k=1 Ifa k 1 =a j g, and E j [K j ]E 0 [K j ]KkP j P 0 k 1 K q 2KL(P 0 ;P j ): 165 What is dierent is the usage of the divergence decomposition lemma (see e.g. (Lattimore and Szepesvári, 2020, Lemma 15.1)) due to the dierent observation model: KL(P 0 ;P j ) = N X j 0 =1 E 0 [K j 0] KL Bernoulli D 2T ? + ; Bernoulli D 2T ? +Ifj 0 6=jg =E 0 [K j ] KL Bernoulli D 2T ? + ; Bernoulli D 2T ? E 0 [K j ] 2 (1) ; where the last step is again by (Gerchinovitz and Lattimore, 2016, Lemma 6) with = D 2T? . Therefore, we can upper bound P N j=1 E j [K j ] as: N X j=1 E j [K j ] N X j=1 E 0 [K j ] +K s 2 2 (1) N X j=1 q E 0 [K j ] N X j=1 E 0 [K j ] +K v u u t 2N 2 (1) N X j=1 E 0 [K j ] (Cauchy-Schwarz inequality) =K +K s 2NK 2 (1) : ( P N j=1 E 0 [K j ] =K) This shows the following lower bound: E[R K ]T ? K T ? N N X j=1 E j [K j ] T ? K T ? N " K +K s 2NK 2 (1) # =T ? K " 1 1 N s 2K N(1) # T ? K " 1 2 s 2K N(1) # = T ? K 16 r N(1) 2K = p DT ? NK = p DT ? SAK ; 166 where in the last step we set = 1 4 q N(1) 2K 1 8 q SD T?K D 2T? to maximize the lower bound. C.2.3 OptimalHigh-probabilityRegret We present our algorithm with optimal high-probability regret in Algorithm 12. The key dierence compared to Algorithm 11 is the use of the extra bias term b b k in the OMD update and the time-varying individual learning k (s;a) for each state-action pair together with an increasing learning rate schedule (see the last for loop). Note that, similar to (Lee et al., 2020), the decision set has the extra constraintq(s;a) 1 TK 4 compared to Algorithm 10 and Algorithm 11, and it is always non-empty as long asK is large enough and every state is reachable withinH steps starting froms 0 (states not satisfying this can simply be removed without aecting f M). Below we present the proof of Theorem 14. It decomposes the regret into several terms, each of which is bounded by a lemma included after the proof. Proof of Theorem 14. We apply the rst statement of Lemma 8: with probability 1, R K K X k=1 D e N k q e ?;c k E + ~ O (1): Similar to the proof of Theorem 12, we dene a slightly perturbed benchmarkq ? = (1 1 TK )q e ? + 1 TK q 0 2 e (T ) for someq 0 2 e (T ) withq 0 (s;a) 1 K 3 for all (s;a)2 e (which again exists as long asK is large enough), so thatR K P K k=1 D e N k q ? ;c k E + ~ O(1) still holds. Also dene ? =q ? + ~ hq ? 2 and 167 b k 2R e such thatb k (s;a) = P h hq k (s;a;h)c k (s;a) q k (s;a) , which clearly satisesE k [ b b k ] =b k . We then decompose P K k=1 D e N k q ? ;c k E as K X k=1 D e N k q ? ;c k E = K X k=1 hq k ;b c k i K X k=1 hq ? ;c k i (h e N k ;c k i =hq k ;b c k i) = K X k=1 h k ? ;b c k i + K X k=1 h ? ;b c k c k i + K X k=1 D ~ hq ? ;c k E K X k=1 D ~ hq k ;b c k E = K X k=1 h k ? ;b c k i + K X k=1 h ? ;b c k c k i + ~ O (DTK) K X k=1 D ~ hq k ;b c k E (Lemma 42) = K X k=1 h k ? ;b c k i + ~ O (DTK) +Dev 1 +Dev 2 K X k=1 D ~ hq k ;c k E (deneDev 1 = P K k=1 h ? ;b c k c k i andDev 2 = P K k=1 D ~ hq k ;c k b c k E ) = Reg + ~ O (DTK) +Dev 1 +Dev 2 + K X k=1 D k ? ; b b k E K X k=1 D ~ hq k ;c k E (deneReg = P K k=1 h k ? ;b c k b b k i) = Reg + ~ O (DTK) +Dev 1 +Dev 2 +Dev 3 +Dev 4 + K X k=1 h k ? ;b k i K X k=1 D ~ hq k ;c k E (deneDev 3 = P K k=1 h k ; b b k b k i andDev 4 = P K k=1 h ? ;b k b b k i) Reg + ~ O (DTK) +Dev 1 +Dev 2 +Dev 3 +Dev 4 + 2 K X k=1 hq k ;b k i K X k=1 h ? ;b k i K X k=1 D ~ hq k ;c k E = Reg + ~ O (DTK) +Dev 1 +Dev 2 +Dev 3 +Dev 4 + (2 ) K X k=1 D q k ; ~ hc k E K X k=1 h ? ;b k i: (hq k ;b k i = D q k ; ~ hc k E ) 168 The Reg term can be upper bounded by the OMD analysis (see Lemma 43), and the four deviation termsDev 1 ;Dev 2 ;Dev 3 , andDev 4 are all sums of martingale dierence sequences and can be bounded using Azuma’s or Freedman’s inequality (see Lemma 44 and Lemma 45). Combining everything, we obtain R K ~ O SA h ? ; K i 70 lnK + 40 K X k=1 D q k ; ~ hc k E + ~ O (DTK) + 1 +C s 8 ln CSA ! h ? ; K i 0 + 0 * ? ; K X k=1 b k +! + 4CH ln CSA h ? ; K i + (2 ) K X k=1 D q k ; ~ hc k E K X k=1 h ? ;b k i = ~ O SA +DTK + 0 @ 1 +C q 8 ln CSA 0 + 4CH ln CSA 1 70 lnK 1 A h ? ; K i + (40 + 2 ) K X k=1 D q k ; ~ hc k E + 1 +C s 8 ln CSA ! 0 ! K X k=1 h ? ;b k i: Finally, note that 0 0 from Lemma 44 and Lemma 45 can be chosen arbitrarily. Now setting 0 = = 1 +C q 8 ln CSA , and plugging the choice of = 100 lnK 1 +C q 8 ln CSA 2 and = 40 + 2 , one can see that the coecients multiplying the last three termsh ? ; K i, P K k=1 D q k ; ~ hc k E , and P K k=1 h ? ;b k i are all non-positive. Therefore, we arrive at R K = ~ O SA +DTK ln ( 1 =) = ~ O p DTSAK ; where the last step is by the choice of. Lemma43. Algorithm 12 ensures with probability at least 1: Reg ~ O SA h ? ; K i 70 lnK + 40 K X k=1 D q k ; ~ hc k E + ~ O H 2 p SA : 169 Proof. Denote byn(s;a) the number of times the learning rate for (s;a) increases, such that K (s;a) = n(s;a) , and byk 1 ;:::;k n(s;a) the episodes where k (s;a) is increased, such that kt+1 (s;a) = kt (s;a). Since 1 (s;a) = 2T and 1 (s;a)2 n(s;a)1 k n(s;a) (s;a) 1 k n(s;a) +1 (s;a) 1 q k n(s;a) +1 (s;a) TK 4 ; we haven(s;a) 1 + log 2 K 4 2 7 log 2 K. Therefore, K (s;a)e 7 log 2 K 7 lnK 5. Now, notice that b b k (s;a) H P h q k (s;a;h)b c k (s;a) q k (s;a) = Hb c k (s;a)b c k (s;a): This means that the costb c k b b k we feed to OMD is always non-negative, and thus by the same argument of (Agarwal et al., 2017, Lemma 12), we have Reg = K X k=1 D k ? ;b c k b b k E K X k=1 D k ( ? ; k )D k ( ? ; k+1 ) + K X k=1 X (s;a) k (s;a) 2 k (s;a)(b c k (s;a) b b k (s;a)) 2 D 1 ( ? ; 1 ) + K1 X k=1 D k+1 ( ? ; k+1 )D k ( ? ; k+1 ) + 5 K X k=1 X (s;a) 2 k (s;a)b c 2 k (s;a) D 1 ( ? ; 1 ) + K1 X k=1 D k+1 ( ? ; k+1 )D k ( ? ; k+1 ) + 20 K X k=1 X (s;a) q 2 k (s;a)b c 2 k (s;a) =D 1 ( ? ; 1 ) + K1 X k=1 D k+1 ( ? ; k+1 )D k ( ? ; k+1 ) + 20 K X k=1 X (s;a) e N 2 k (s;a)c 2 k (s;a): For the rst term, since 1 minimizes 1 and thushr 1 ( 1 ); ? 1 i 0, we have D 1 ( ? ; 1 ) 1 ( ? ) 1 ( 1 ) = 1 X (s;a) ln 1 (s;a) ? (s;a) 1 X (s;a) ln 2H q ? (s;a) = ~ O SA : 170 For the second term, we dene(y) =y 1 lny and proceed similarly to (Agarwal et al., 2017): K1 X k=1 D k+1 ( ? ; k+1 )D k ( ? ; k+1 ) = K1 X k=1 X (s;a) 1 k+1 (s;a) 1 k (s;a) ? (s;a) k+1 (s;a) X (s;a) 1 n(s;a) ? (s;a) k n(s;a) +1 (s;a) ! = X (s;a) 1 n(s;a) ? (s;a) k n(s;a) +1 (s;a) 1 ln ? (s;a) k n(s;a) +1 (s;a) ! 1 35 lnK X (s;a) ? (s;a) K (s;a) 2 1 ln ? (s;a) k n(s;a) +1 (s;a) ! SA(1 + 6 lnK) 35 lnK h ? ; K i 70 lnK = ~ O SA h ? ; K i 70 lnK ; where in the last two lines we use the facts 1 1 7 lnK ; n(s;a) 5, K (s;a) = 2 k n(s;a) +1 (s;a) , and ln q ? (s;a) q k n(s;a) +1 (s;a) ln(HTK 4 ) 6 lnK. Finally, for the third term, since P (s;a) e N 2 k (s;a)c 2 k (s;a) P (s;a) e N k (s;a) 2 H 2 , we apply Azuma’s inequality (Lemma 120) and obtain, with probability at least 1: K X k=1 X (s;a) e N 2 k (s;a)c 2 k (s;a) K X k=1 E k 2 4 X (s;a) e N 2 k (s;a)c 2 k (s;a) 3 5 + ~ O H 2 p K K X k=1 E k D e N k ;c k E 2 + ~ O H 2 p SA 2 K X k=1 D q k ; ~ hc k E + ~ O H 2 p SA : (Lemma 7) Combining everything shows Reg ~ O SA h ? ; K i 70 lnK + 40 K X k=1 D q k ; ~ hc k E + ~ O H 2 p SA : 171 nishing the proof. Lemma44. For any 0 > 0, with probability at least 1, Dev 1 C s 8 ln CSA h ? ; K i 0 + 0 * ? ; K X k=1 b k +! + 4CH ln CSA h ? ; K i: Also, with probability at least 1,Dev 2 = ~ O H 2 p SA : Proof. DeneX k (s;a) =b c k (s;a)c k (s;a). Note that X k (s;a) H q k (s;a) 2H k (s;a) 2H k (s;a) 4HTK 4 ; and K X k=1 E k X 2 k (s;a) K X k=1 E k h e N 2 k (s;a)c 2 k (s;a) i q 2 k (s;a) 2 k (s;a) K X k=1 E k h e N 2 k (s;a)c 2 k (s;a) i q k (s;a) = 4 k (s;a) K X k=1 b k (s;a): (Lemma 46) 172 Therefore, by applying a strengthened Freedman’s inequality (Lee et al., 2020, Theorem 2.2) withb = 4HTK 4 , B k = 2H k (s;a); max k B k = 2H K (s;a), andV = 4 k (s;a) P K k=1 b k (s;a), we have with probability 1=(SA), K X k=1 b c k (s;a)c k (s;a) C 0 @ v u u t 32 K (s;a) K X k=1 b k (s;a) ln CSA + 4H K (s;a) ln CSA 1 A C s 8 ln CSA K (s;a) 0 + 0 K X k=1 b k (s;a) ! + 4CH K (s;a) ln CSA ; where the last step is by AM-GM inequality. Further using a union bound shows that the above holds for all (s;a)2 e with probability 1 and thus Dev 1 = K X k=1 h ? ;b c k c k i C s 8 ln CSA h ? ; K i 0 + 0 * ? ; K X k=1 b k +! + 4CH ln CSA h ? ; K i: To boundDev 2 , simply note thatj D ~ hq k ;c k b c k E j 2H 2 and apply Azuma’s inequality (Lemma 120): with probability 1, Dev 2 = K X k=1 D ~ hq k ;c k b c k E =O H 2 p K = ~ O H 2 p SA : This completes the proof. 173 Lemma 45. With probability at least 1, we have Dev 3 = ~ O H 2 p SA . Also, for any 0 > 0, with probability at least 1, we have Dev 4 h ? ; K i 0 + 0 * ? ; K X k=1 b k + + ~ O (1): Proof. To boundDev 3 , simply note that D k ; b b k b k E 4Hhq k ;b c k i 4H 0 @ X (s;a) e N k (s;a) 1 A 4H 2 and apply Azuma’s inequality: with probability 1, Dev 3 = K X k=1 D k ; b b k b k E = ~ O H 2 p K = ~ O H 2 p SA : To boundDev 4 = P K k=1 D ? ;b k b b k E , we note thatb k (s;a) b b k (s;a)b k (s;a)H, and E k h b k (s;a) b b k (s;a) i 2 E k ( P h hq k (s;a;h)b c k (s;a)) 2 q 2 k (s;a) H 2 E k h e N 2 k (s;a)c 2 k (s;a) i q 2 k (s;a) 2H 2 K (s;a) E k h e N 2 k (s;a)c 2 k (s;a) i q k (s;a) 4H 2 K (s;a)b k (s;a): (Lemma 46) 174 Hence, applying a strengthened Freedman’s inequality (Lee et al., 2020, Theorem 2.2) withb =B i =H, V = 4H 2 K (s;a) P K k=1 b k (s;a), and C 0 =dlog 2 Hedlog 2 (H 2 K)e, we have with probability at least 1=(SA), K X k=1 b k (s;a) b b k (s;a) 2C 0 H s ln C 0 SA v u u t 8 K (s;a) K X k=1 b k (s;a) + 2C 0 H ln C 0 SA = 2C 0 H s ln C 0 SA 2 K (s;a) 0 + 0 X k b k (s;a) ! + 2C 0 H ln C 0 SA ; where the last step is by AM-GM inequality. Finally, applying a union bound shows that the above holds for all (s;a)2 e with probability at least 1 and thus Dev 4 = K X k=1 D ? ;b k b b k E h ? ; K i 0 + 0 * ? ; K X k=1 b k + + ~ O (1); where we bound C 0 H q ln C 0 SA by a constant since is of order 1= p K and is small enough whenK is large. Lemma46. For any episodek and (s;a)2 e :E k h e N k (s;a) 2 c k (s;a) 2 i 2q k (s;a)b k (s;a). 175 Proof. The proof is similar to those of Lemma 6 and uses ( P I i=1 a i ) 2 2 P i a i ( P I i 0 =i a i 0): E k h e N k (s;a) 2 c k (s;a) 2 i E k 2 4 H X h=1 e N k (s;a;h) ! 2 c k (s;a) 3 5 2E k 2 4 H X h=1 e N k (s;a;h) ! 0 @ H X h 0 h e N k (s;a;h 0 )c k (s;a) 1 A 3 5 2E k 2 4 H X h=1 H X h 0 h e N k (s;a;h 0 )c k (s;a) 3 5 ( e N k (s;a;h)2f0; 1g) = 2 H X h=1 H X h 0 h q k (s;a;h 0 )c k (s;a) = 2 H X h=1 hq k (s;a;h)c k (s;a) = 2q k (s;a)b k (s;a); where the last step is by the denition ofb k . 176 AppendixD OmittedDetailsinChapter5 D.1 PreliminaryforAppendix ExtraNotation Dene s k i = (s k i ;h k i ) as thei-th step in M in episodek. Denen k (s;a;h) as the number of visits to ((s;h);a) in M in episodek, andn k (s;a) = P hH n k (s;a;h) (excluding layerH + 1). Dene J k = minfL;J k g, n k (s;a) = minfL;n k (s;a)g, and n k (s;a;h) = minfL;n k (s;a;h)g. For any sequence of scalars or functionsfz k g k , denedz k =z k+1 z k . By default we assume P h = P H+1 h=1 . For inner product hu;vi, ifu(s;a),u(s;a;h),v(s;a), andv(s;a;h) are all dened, we lethu;vi = P s;a;h u(s;a;h)v(s;a;h). For functionsf andg with the same domain, dene function (fg)(x) =f(x)g(x). Denote byE k [] the conditional expectation given everything before episodek. For any random variableX, dene conditional variance Var k [X] =E k [(XE k [X]) 2 ]. For an occupancy measureq w.r.t. policy and transitionP , deneq (s;h) as the occupancy measure w.r.t. policy, transitionP , and initial state (s;h), andq (s;a;h) as the occupancy measure w.r.t. policy, transition P , initial state (s;h), and initial actiona. Denote byx k (s;a;h) the probability that ((s;h);a) is ever visited in episodek,x k (s;a) = P H h=1 x k (s;a;h) the probability that (s;a) is ever visited before layerH + 1 in episodek, andy k (s;a;h) the probability of visiting ((s;h);a) again if the agent starts from ((s;h);a). For any occupancy measureq(s;a;h), we deneq(s;a) = P hH q(s;a;h) (excluding layerH + 1). Note that q k (s;a;h) = x k (s;a;h) 1y k (s;a;h) andy k (s;a;h) . Thus, we haveq k (s;a;h) =O(T max x k (s;a;h)). 177 Dene M as the set of possible transition functions of M: M = n P =fP s;a;h g (s;h)2 S;a2As ;P s;a;h 2 S + :P s;a;H+1 (g) = 1; X s 0 2S P s;a;h (s 0 ;h) ; X s 0 2S P s;a;h (s 0 ;h + 1) 1 ; P s;a;h (s 0 ;h 0 ) = 0;8(s;a)2 ;h2 [H];h 0 = 2fh;h + 1g o ; (D.1) where X =f x :x2Xg for some setX . By denition, the expected hitting time of any stationary policy in an MDP with transitionP2 M is upper bounded by (H + 1)(1 ) 1 starting from any state. Therefore, for any occupancy measureq withP q 2 M (for example,q k andq ? ), we have P s;a;h q(s;a;h) (H + 1)(1 ) 1 = ~ O(T max ). Finally deneC M as the set of possible cost functions of M: C M = n c : S!R + :c(s;a;h) = ~ O(1);8hH; and9C 0 = ~ O(T max );c(s;a;H + 1) =C 0 ;8a o : D.1.1 TransitionEstimation In this section, we present important lemmas regarding the transition condence setsfP k g K k=1 . We rst prove an auxiliary lemma saying that the number of steps taken by the learner before reaching g or switching to fast policy is well bounded with high probability. Lemma47. With probability at least 1, we haveJ k = J k for allk2 [K]. Proof. We want to show thatJ k L =d 8H 1 ln(2T max K=)e for allk2 [K] with probability at least 1. Letk2 [K], it suces to show that the expected hitting time of k is upper bounded by H 1 starting from any (s;h), because then we can apply Lemma 118 and take a union bound over allK episodes. 178 Note that the expected hitting time (w.r.t.J k ) is simply the value function with respect to a cost function that is 1 for all state-action pairs except for 0 cost in the goal stateg and layerH + 1 (i.e.,c f = 0). Thus, by Lemma 9, the expected hitting time starting from (s;h) is bounded by Hh+1 1 H 1 . DenitionofP k We deneP k = T s;a;hH P k;s;a;h , where: P k;s;a;h = P 0 2 M : P k;s;a (s 0 )P 0 s;a;h (s 0 ;h)= k (s;a;s 0 ); P k;s;a (s 0 )P 0 s;a;h (s 0 ;h + 1)=(1 ) k (s;a;s 0 ); P k;s;a (g)P 0 s;a;h (g) k (s;a;g);8s 0 2S ; (D.2) where k (s;a;s 0 ) = 4 q P k;s;a (s 0 ) 0 k (s;a) + 28 0 k (s;a), 0 k (s;a) = N + k (s;a) , P k;s;a (s 0 ) = N k (s;a;s 0 ) N + k (s;a) is the empirical transition,N + k (s;a) = maxf1;N k (s;a)g,N k (s;a) is the number of visits to (s;a) in episode j = 1;:::;k 1 before( j ) switches to the fast policy, andN k (s;a;s 0 ) is the number of visits to (s;a;s 0 ) in episodej = 1;:::;k 1 before( j ) switches to the fast policy. Lemma48. Under the event of Lemma 47, we have P2P k for anyk2 [K] with probability at least 1. Proof. Clearly P2 M . Moreover, for any (s;a)2 ;s 0 2S + by Lemma 121 andN K+1 (s;a) LK under the event of Lemma 47, we have with probability at least 1 2S 2 A , P s;a (s 0 ) P k;s;a (s 0 ) k (s;a;s 0 ): (D.3) By a union bound, we have Eq. (D.3) holds for any (s;a)2 ;s 0 2S + with probability at least 1. Then the statement is proved by P s;a;h (s 0 ;h) = P s;a (s 0 ), P s;a;h (s 0 ;h + 1) = (1 )P s;a (s 0 ), and P s;a;h (g) = P s;a (g). 179 Lemma49. Under the event of Lemma 48, for anyP 0 2P k , we have for any s 0 2 S + : P 0 s;a;h ( s 0 )P s;a;h ( s 0 ) 8 q P s;a;h ( s 0 ) 0 k (s;a) + 136 0 k (s;a), ? k (s;a;h; s 0 ): For simplicity, we also write ? k (s;a;h; (s 0 ;h 0 )) as ? k (s;a;h;s 0 ;h 0 ) for (s 0 ;h 0 )2 S. Proof. Under the event of Lemma 48 and by Eq. (D.3), we have for all (s;a)2 , ands 0 2S + : P k;s;a (s 0 )P s;a (s 0 ) + 4 q P k;s;a (s 0 ) 0 k (s;a) + 28 0 k (s;a): Applyingx 2 ax +b =) xa + p b witha = 4 p 0 k (s;a) andb =P s;a (s 0 ) + 28 0 k (s;a), we have q P k;s;a (s 0 ) 4 q 0 k (s;a) + q P s;a (s 0 ) + 28 0 k (s;a) q P s;a (s 0 ) + 10 q 0 k (s;a): Substituting this back to the denition of k , we have k (s;a;s 0 ) = 4 q P k;s;a (s 0 ) 0 k (s;a) + 28 0 k (s;a) 4 q P s;a (s 0 ) 0 k (s;a) + 68 0 k (s;a): Now we start to prove the statement. The statement is clearly true for s 0 = (s 0 ;h 0 ) withh 0 = 2fh;h + 1g since the left-hand side equals to 0. Moreover, by the denition ofP k , Lemma 48, andx p x forx2 (0; 1), P 0 s;a;h (s 0 ;h)P s;a;h (s 0 ;h) P 0 s;a;h (s 0 ;h) P k;s;a (s 0 ) + P k;s;a (s 0 )P s;a;h (s 0 ;h) 2 k (s;a;s 0 ) ? k (s;a;h;s 0 ;h); 180 P 0 s;a;h (s 0 ;h + 1)P s;a;h (s 0 ;h + 1) P 0 s;a;h (s 0 ;h + 1) (1 ) P k;s;a (s 0 ) + (1 ) P k;s;a (s 0 )P s;a;h (s 0 ;h + 1) 2(1 ) k (s;a;s 0 ) ? k (s;a;h;s 0 ;h); P 0 s;a;h (g)P s;a;h (g) P 0 s;a;h (g) P k;s;a (g) + P k;s;a (g)P s;a;h (g) 2 k (s;a;g) 2 ? k (s;a;h;g): This completes the proof. D.1.2 ApproximationofQ ;(;P;c);c We show thatQ ;(;P;c);c can be approximated eciently by Extended Value Iteration similar to (Jaksch, Ortner, and Auer, 2010). Note that nding (;P;c) is equivalent to computing the optimal policy in an augmented MDP M with state space S and extended action spaceP, such that for any extended actionP2P, the cost at ((s;h);P ) is P a (ajs;h)c(s;a;h), and the transition probability to s 0 2 S + is P a (ajs;h)P s;a;h ( s 0 ). In this work, we haveP2fP k g K k=1 , andP k = T s;a;h P k;s;a;h , whereP k;s;a;h is a convex set that species constraints on ((s;h);a). In other words,P k is a product of constraints on each ((s;h);a) (note that M can also be decomposed into shared constraints onP s;a;H+1 and independent constraints on eachs;a;hH). Thus, any policy in M can be represented by an elementP2P. We can now perform value iteration in M to approximateQ ;(;P;c);c . The Bellman operator of M isT 0 dened in Eq. (D.14) with min operator replaced by max operator. Also note that M is an SSP instance where all policies are proper. Thus,V ;(;P;c);c is the unique xed point ofT 0 (Bertsekas and Yu, 2013). It is straightforward to show that Lemma 72 still holds with min operator replaced by max operator in Eq. (D.14) and letV 0 (s;H + 1) = max a c(s;a;H + 1). Thus, we can approximateV ;(;P;c);c eciently. 181 Now suppose aftern iterations of modied Eq. (D.14), we obtainV n such that V n V ;(;P;c);c 1 . Then we can simply useQ(s;a;h) =c(s;a;h) + min P2P P s;a;h V n to approximateQ ;(;P;c);c , since Q(s;a;h)Q ;(;P;c);c (s;a;h) (i) = min P2P P s;a;h V n min P2P P s;a;h V ;(;P;c);c max P2P P s;a;h (V n V ;(;P;c);c ) V n V ;(;P;c);c 1 ; where (i) is by the denition of (;P;c). In this work, setting = 1=K is enough for obtaining the desired regret bounds. Lemma 72 (modied) then implies that ~ O(T max ) iterations of modied Eq. (D.14) suces. D.2 OmittedDetailsforSection5.1 In this section, we provide omitted discussions and proofs for Section 5.1. D.2.1 LimitationofExistingApproximationSchemes Finite-HorizonApproximation Thanks to Lemma 118, the approximation error under nite-horizon approximation decreases exponentially. Specically, we only need a horizon of orderO(T max lnK) to have approximation error of orderO( 1 K ). This gives optimal regret bound under both adversarial costs (Chen, Luo, and Wei, 2021b) and stochastic costs (Chen et al., 2021). However, it also clearly brings an extra ~ O(T max ) dependency in the space complexity since we need to store non-stationary policies changing in dierent layers. Chen et al. (2021) proposes an implicit nite-horizon approximation analysis that achieves optimal regret bound without storing non-stationary policies. Unfortunately, their approach does not work for adversarial costs. DiscountedApproximation Approximating an SSP by a discounted MDP clearly produces stationary policies. However, the approximation error scales with 1 (that is, inversely proportional to the eective horizon (1 ) 1 ) following similar arguments as in (Wei et al., 2020, Lemma 2), where is the discounted 182 factor. This leads to a sub-optimal regret bound when the achieved regret bound in the discounted MDP has polynomial dependency on the horizon even in the lower order term (Wei et al., 2020). In Tarbouriech et al. (2021b), they still achieve minimax optimal regret by deriving a horizon-free regret bound (no polynomial dependency on the horizon even in the lower order term), and approximately set 1 = ~ O( 1 K ) to achieve small approximation error. The drawback, however, is that the time complexity of updating the learner’s policy scales linearly w.r.t. the eective horizon, which is of order ~ O(K); see (Tarbouriech et al., 2021b, Remark 1). D.2.2 ProofofLemma10 Proof. We only prove the statement for adversarial environment, and the statement for stochastic environ- ment follows directly from settingc 1 =c K =c. By Lemma 9, we haveV ? ;P;c k (s; 1)V ? ;P;c k (s)+ 1 K for any k 2 [K]. Now by Lemma 118 and the fact that the expected hitting time of fast policy is upper bounded by D, we have with probability at least 1 , the learner reaches the goal within J k + c f steps for each episode k. Thus by a union bound, we have with probability at least 1 , P K k=1 P I k i=1 c k i P K k=1 P J k i=1 c k i + c k J k +1 . Putting everything together, we get: R K = K X k=1 I k X i=1 c k i V ? ;P;c k (s k 1 ) ! K X k=1 J k X i=1 c k i + c k J k +1 V ? ;P;c k (s k 1 ; 1) ! + ~ O (1) = R K + ~ O (1): This completes the proof. 183 D.3 OmittedDetailsforSection5.2 In this section, we provide all proofs for Section 5.2. We rst provide omitted details for cost estimation. Then, we establish the main results in Appendix D.3.2. Finally, we provide proofs of auxiliary lemmas in Appendix D.3.3. Extra Notation Dene optimistic transitions e P k = ( k ;P k ;e c k ) andP k = ( k ;P k ;b c k ), such that e Q k =Q k ; e P k ;e c k and b Q k =Q k ;P k ;b c k . Also denee q k =q k ; e P k andQ k =Q k ;P;b c k . D.3.1 CostEstimation We provide more details on the denition ofb c k for the subsequent analysis. Recall thatb c k (s;a) = maxf0; c k (s;a) 2 p c k (s;a) k (s;a) 7 k (s;a)g. Here, c k (s;a) = C k (s;a) N + k (s;a) , whereC k (s;a) is the accu- mulated costs that are observed at (s;a) in episodej = 1;:::;k1 before( j ) switches to the fast policy, k (s;a) = N + k (s;a) (recall = ln(2SALK=)),N + k (s;a) = maxf1;N k (s;a)g, andN k is the number of times the learner observes cost at (s;a) in episodej = 1;:::;k 1 before( j ) switches to the fast policy. For stochastic costs,C k (s;a) = P k1 j=1 P J k i=1 c j i Ifs j i =s;a j i =ag andN k =N k (s;a). Below we show a lemma quantifying the cost estimation error. Lemma50. Under the event of Lemma 47, we have with probability at least 1, 0c(s;a)b c k (s;a) 4 p b c k (s;a) k (s;a) + 34 k (s;a); for all denitions ofb c k . 184 Proof. Note that under the event of Lemma 47,N k+1 (s;a)LK. Applying Lemma 121 withX k =c k (s;a) for each (s;a)2 and then by a union bound over all (s;a)2 , we have with probability at least 1, for allk2 [K]: j c k (s;a)c(s;a)j 2 p k (s;a) c k (s;a) + 7 k (s;a): Hence, c(s;a) b c k (s;a) by the denition ofb c k . Applying x 2 ax +b =) x a + p b with x = p c k (s;a) to the inequality above (ignoring the absolute value operator), we obtain p c k (s;a) 2 p k (s;a) + p c(s;a) + 7 k (s;a) p c(s;a) + 5 p k (s;a); Therefore, 2 p k (s;a) c k (s;a) + 7 k (s;a) 2 p k (s;a)c(s;a) + 17 k (s;a), and c(s;a)b c k (s;a) =c(s;a) c k (s;a) + c k (s;a)b c k (s;a) 2 (2 p k (s;a) c k (s;a) + 7 k (s;a)) 4 p k (s;a)c(s;a) + 34 k (s;a): This completes the proof. D.3.2 MainResultsforStochasticCostsandStochasticAdversary We rst show a general regret bound agnostic to the feedback type (Theorem 29). Then, we present the proof of Theorem 15 (Appendix D.3.2.1) using the general regret bound. Theorem29. Assuming that there exists a constantG such that for anys;h: K1 X k=1 D k+1 (js;h);d e Q k (s;;h) E G: 185 Then, Algorithm 13 in stochastic environments with minf1=T max ; p S 2 A=Kg ensures with probability at least 1 22, R K = ~ O 0 @ K X k=1 J k X i=1 (c k i b c k ( s k i ;a k i )) + S 2 A +S 4 A 2:5 T 3 max + K X k=1 D q ? ;cQ ? ;P;b c k E 1 A + ~ O T max +T max G : Proof. For notational convenience, dene! =S 4 A 2:5 T 3 max . Byhq ? ;b c k ihq ? ;ci (Lemma 50) and Lemma 47 (under whichn k = n k ), we have with probability at least 1 2, R K = K X k=1 0 @ J k X i=1 c k i + c k J k +1 V ? ;P;c ( s k 1 ) 1 A K X k=1 0 @ J k X i=1 c k i + c k J k +1 V ? ;P;b c k ( s k 1 ) 1 A K X k=1 J k X i=1 (c k i b c k ( s k i ;a k i )) + K X k=1 h n k q ? ;b c k i: For the second term, by the denition ofe c k , K X k=1 h n k q ? ;b c k i = K X k=1 h n k q k ;b c k i + K X k=1 hq k e q k ;e c k i + K X k=1 he q k q ? ;e c k i K X k=1 D q k ;b c k b Q k E + K X k=1 D q ? ;b c k b Q k E K X k=1 h n k q k ;b c k i + K X k=1 hq k e q k ;e c k i K X k=1 hq k ;b c k Q k i | {z } 1 + K X k=1 D q k ;b c k (Q k b Q k ) E | {z } 2 + K X k=1 D q ? ;cQ ? ;P;b c k E + K X k=1 he q k q ? ;e c k i + K X k=1 D q ? ;c ( b Q k Q ? ;P;b c k ) E | {z } 3 : (b c k (s;a;h)c(s;a;h)) 186 For 1 , with probability at least 1 17: K X k=1 h n k q k ;b c k i + K X k=1 hq k e q k ;e c k i K X k=1 hq k ;b c k Q k i ~ O 0 @ v u u t K X k=1 hq k ;b c k Q k i +SAT max 1 A + K X k=1 D q k e q k ; (1 + b Q k )b c k E K X k=1 hq k ;b c k Q k i (E k [ n k (s;a;h)]q k (s;a;h), Lemma 123, Lemma 59, and n k (s;a;h)L = ~ O(T max )) = ~ O 0 @ v u u t S 2 A K X k=1 hq k ;b c k Q k i +! 1 A K X k=1 hq k ;b c k Q k i (Lemma 61 and (1 + b Q k (s;a;h))b c k (s;a;h) = ~ O(b c k (s;a;h))) = ~ O S 2 A +! : (AM-GM inequality) For 2 , by Lemma 63 and Lemma 49, with probability at least 1 2, Q k (s;a;h) b Q k (s;a;h) = X s 0 ;a 0 ;h 0 q k;(s;a;h) (s 0 ;a 0 ;h 0 )(P s 0 ;a 0 ;h 0P k;s 0 ;a 0 ;h 0)V k ;P k ;b c k = ~ O 0 @ X s 0 ;a 0 q k;(s;a;h) (s 0 ;a 0 ) 0 @ p ST max q N + k (s 0 ;a 0 ) + ST max N + k (s 0 ;a 0 ) 1 A 1 A : (D.4) Byq k (s;a;h) = x k (s;a;h) 1y k (s;a;h) andy k (s;a;h) = 1 1 2Tmax , we have X s;a;hH q k (s;a;h)q k;(s;a;h) (s 0 ;a 0 ) 2T max X s;a;hH x k (s;a;h)q k;(s;a;h) (s 0 ;a 0 ) 2T max X s;a;hH q k (s 0 ;a 0 ) = 2T max SAHq k (s 0 ;a 0 ): (D.5) 187 Therefore, with probability at least 1, 2 = K X k=1 D q k ;b c k (Q k b Q k ) E = ~ O 0 @ K X k=1 X s;a;h q k (s;a;h) X s 0 ;a 0 q k;(s;a;h) (s 0 ;a 0 ) 0 @ p ST max q N + k (s 0 ;a 0 ) + ST max N + k (s 0 ;a 0 ) 1 A 1 A = ~ O 0 @ T 2 max S 3=2 A X s 0 ;a 0 K X k=1 q k (s 0 ;a 0 ) q N + k (s 0 ;a 0 ) +T 2 max S 2 A X s 0 ;a 0 K X k=1 q k (s 0 ;a 0 ) N + k (s 0 ;a 0 ) 1 A (Eq. (D.5)) = ~ O T 2 max S 3=2 A p SAT max K +S 3 A 2 T 3 max = ~ O (!): (Lemma 64 and P s;a q k (s;a) = ~ O(T max )) For 3 , rst note that e Q 1 1 = ~ O(T max ) under all denitions ofe c k , and by Lemma 63: K X k=1 he q k q ? ;e c k i = K X k=1 X s;h q ? (s;h) X a ( k (ajs;h) ? (ajs;h)) e Q k (s;a;h) + K X k=1 X s;a;h q ? (s;a;h) e Q k (s;a;h)e c k (s;a;h)P s;a;h V k ; e P k ;e c k = ~ O T ? +T ? G +T 2 max : (Lemma 57, the denition of e P k and P s;a;h q ? (s;a;h) = ~ O(T ? ) by Lemma 9) Next, note that K X k=1 ( b Q k (s;a;h)Q ? ;P;b c k (s;a;h)) K X k=1 (Q k ; e P k ;b c k (s;a;h)Q ? ;P;b c k (s;a;h)) (P k ; e P k 2P k ) K X k=1 Q k ; e P k ;e c k (s;a;h)Q ? ;P;e c k (s;a;h) + K X k=1 Q ? ;P; b Q k (s;a;h) (denition ofe c k ) 188 Also note that P K k=1 D q ? ;Q ? ;P; b Q k E = ~ O( 2 T 3 max K) = ~ O(S 2 AT 3 max ) by p S 2 A=K. Thus, K X k=1 D q ? ;c ( b Q k Q ? ;P;b c k ) E = ~ O K X k=1 D q ? ;c ( e Q k Q ? ;P;e c k ) E +S 2 AT 3 max ! : Now by Lemma 63 and the denition of e P k : K X k=1 ( e Q k (s;a;h)Q ? ;P;e c k (s;a;h)) (D.6) K X k=1 X s 00 ;h 00 P s;a;h (s 00 ;h 00 ) X s 0 ;a 0 ;h 0 q ? (s 00 ;h 00 ) (s 0 ;h 0 ) k (a 0 js 0 ;h 0 ) ? (a 0 js 0 ;h 0 ) e Q k (s 0 ;a 0 ;h 0 ) = ~ O T max +T max G +T 2 max : (Lemma 57) Thus, byT max 1, we have P K k=1 D q ? ;c ( e Q k Q ? ;P;e c k ) E = ~ O( Tmax +T max G +T 2 max ). Putting everything together completes the proof. D.3.2.1 ProofofTheorem15 Proof. By Lemma 58 withn k =n k andN k =N k , with probability at least 1 2: K1 X k=1 D k+1 (js;h);d e Q k (s;;h) E = ~ O 0 @ T 2 max K X k=1 X s 0 ;a 0 Sn k (s 0 ;a 0 ) N + k (s 0 ;a 0 ) +T 4 max K 1 A = ~ O S 2 AT 3 max +T 2 max (S 2 AK) 1=4 : (denition of and,n k (s;a) = n k (s;a) under the event of Lemma 47, and Lemma 64) 189 Thus, by Theorem 29, Lemma 51, denition of, and replacingG by the bound above, we have with probability at least 1 28, R K = ~ O 0 @ v u u t SA K X k=1 J k X i=1 c k i + S 2 A +T 3 max (S 2 AK) 1=4 +S 4 A 2:5 T 4 max + K X k=1 D q ? ;cQ ? ;P;b c k E 1 A = ~ O 0 @ v u u t SA K X k=1 J k X i=1 c k i +B ? S p AK +T 3 max (S 2 AK) 1=4 +S 4 A 2:5 T 4 max 1 A : (Lemma 53) Now by R k = P K k=1 P J k i=1 c k i KV ? ;P;c ( s k 1 ) and Lemma 110, we have P K k=1 P J k i=1 c k i = ~ O(B ? K). Plugging this back, we get R K = ~ O(B ? S p AK +T 3 max (S 2 AK) 1=4 +S 4 A 2 T 4 max ). Applying Lemma 10 then completes the proof. D.3.3 ExtraLemmasforSection5.2 We give an outline of this section: Lemma 51 bound the term P K k=1 P J k i=1 (c k i b c k ( s k i ;a k i )). Lemma 53 and Lemma 54 bound the term P K k=1 q ? ;cQ ? ;P;c . Lemma 56 establishes stability of PO updates. Lemma 57 provide a rened analysis of PO. Lemma 58 bounds the drift of various quantities (such asdb c k andd e Q k ) across episodes. Lemma 59 provide bounds on variance of learner’s costs. Lemma 61 gives a bound on the estimation error of value functions due to transition estimation. Lemma51. Under stochastic costs, we have with probability at least 1 6: K X k=1 J k X i=1 (c k i b c k ( s k i ;a k i )) = ~ O 0 @ v u u t SA K X k=1 J k X i=1 c k i +SAT max 1 A : Proof. First note that: K X k=1 J k X i=1 (c k i b c k ( s k i ;a k i )) = K X k=1 J k X i=1 (c k i c(s k i ;a k i )) + K X k=1 J k X i=1 (c(s k i ;a k i )b c k (s k i ;a k i )): 190 For the rst term, by Lemma 123 and Lemma 126, we have with probability at least 1 2, K X k=1 J k X i=1 (c k i c(s k i ;a k i )) = ~ O 0 B @ v u u t K X k=1 J k X i=1 E[(c k i ) 2 js k i ;a k i ] 1 C A = ~ O 0 B @ v u u t K X k=1 J k X i=1 c(s k i ;a k i ) 1 C A = ~ O 0 @ v u u t K X k=1 J k X i=1 c k i 1 A : For the second term, with probability at least 1 4, K X k=1 J k X i=1 (c(s k i ;a k i )b c k ( s k i ;a k i )) = ~ O 0 @ K X k=1 J k X i=1 s c(s k i ;a k i ) N + k (s k i ;a k i ) + 1 N + k (s k i ;a k i ) ! 1 A (Lemma 50 andb c k (s;a)c(s;a)) = ~ O 0 @ X (s;a) K X k=1 n k (s;a) s c(s;a) N + k (s;a) + n k (s;a) N + k (s;a) ! 1 A = ~ O 0 B @ v u u t SA K X k=1 J k X i=1 c(s k i ;a k i ) +SAT max 1 C A (Lemma 64 andJ k = J k ) = ~ O 0 B @ v u u t SA K X k=1 J k X i=1 c k i +SAT max 1 C A: (Lemma 126) This completes the proof. Lemma52. Forh2 [H + 1], we have P s;a q ? (s;a;h) ( 1 2 ) h1 T max . Proof. Denote byp(s) the probability that the learner starts at states in layerh and eventually reaches layerh + 1 following ? . Clearly,p(g) = 0, and p(s) 1 + P s; ? (s) p (i) E " I X t=1 (1 ) t1 ? ;P;s 1 =s # 1 2 ; 191 where (i) is by repeatedly applying the rst inequality. By a recursive argument, we have the probability of reaching layerh is upper bounded by ( 1 2 ) h1 . Then by P s;a q ? (s 0 ;h) (s;a;h)T max for anys 0 , we have P s;a q ? (s;a;h) ( 1 2 ) h1 T max . Lemma53. Under stochastic costs, q ? ;cQ ? ;P;c 2B 2 ? + (H+1)Tmax K . Proof. By Lemma 9 and Lemma 52, we have: D q ? ;cQ ? ;P;c E = H X h=1 X s;a q ? (s;a;h)c(s;a)Q ? ;P;c (s;a;h) + X s q ? (s;H + 1)c f H X h=1 X s;a q ? (s;a;h)c(s;a) Q ? ;P;c (s;a) + c f 2 Hh+1 + c f T max 2 H 2B 2 ? + H X h=1 T max 2 h1 c f 2 Hh+1 + c f T max 2 H ( P H h=1 q ? (s;a;h)c(s;a)B ? andQ ? ;P;c (s;a) 1 +B ? ) 2B 2 ? + (H + 1) c f T max 2 H 2B 2 ? + (H + 1)T max K : Lemma54. For stochastic adversary, we have P K k=1 q ? ;cQ ? ;P;c = ~ O D 2 K . Proof. P K k=1 q ? ;cQ ? ;P;c = ~ O (DKhq ? ;ci) = ~ O D 2 K . Lemma55. e Q k 1 1 under all denitions ofe c k . Proof. It suces to bound e Q k 1 . By Lemma 9, b Q k (s;a;h) H 1 +c f =. This givese c k (s;a;h) (1 + b Q k (s;a;h)) 3(8+=T max ) forhH ande c k (s;a;H +1) (1+ b Q k (s;a;H +1))c f 3c f =T max . Lemma 9 then gives e Q k (s;a;h) H 1 3(8 +=T max ) + 3c f =T max 3T max (8 +=T max ) 2 , and the statement is proved by the denition of. 192 Lemma56. Underalldenitionsofe c k ,wehavejd k (ajs;h)j = ~ O(T max k (ajs;h))and dQ k ;P 0 ;c 0 1 = ~ O(T 3 max ) forP 0 2 M andc 0 2C M . Proof. Note that: k+1 (ajs;h) k (ajs;h) = k (ajs;h) exp( e Q k (s;a;h)) P a 0 k (a 0 js;h) exp( e Q k (s;a 0 ;h)) k (ajs;h) k (ajs;h) P a 0 k (a 0 js;h) exp(max a 0 j e Q k (s;a 0 ;h)j) k (ajs;h) = ~ O (T max k (ajs;h)): (Lemma 55 andje x 1j 2jxj forx2 [1; 1]) The other direction can be proved similarly. Then by Lemma 63, Q k+1 ;P 0 ;c 0 (s;a;h)Q k ;P 0 ;c 0 (s;a;h) = X s 00 ;h 00 P s;a;h (s 00 ;h 00 ) X s 0 ;a 0 ;h 0 q k ;P 0 ;(s 00 ;h 00 ) (s 0 ;h 0 ) d k (a 0 js 0 ;h 0 ) Q k+1 ;P 0 ;c 0 (s 0 ;a 0 ;h 0 ) = ~ O T 3 max : This completes the proof. Lemma57. Suppose k (ajs;h)/ exp( P j<k e Q j (s;a;h)). Then, K X k=1 X a2As ( k (ajs;h) ? (ajs;h)) e Q k (s;a;h) lnA + D 1 (js;h); e Q 1 (s;;h) E + K1 X k=1 D k+1 (js;h); e Q k+1 (s;;h) e Q k (s;;h) E : Proof. First note that: k+1 (js;h) = argmin (js;h)2(A) D (js;h); e Q k (s;;h) E + KL((js;h); k (js;h)); (D.7) 193 where KL(p;q) = P a (p(a) ln p(a) q(a) p(a) +q(a)), and k+1 (js;h)/ 0 k+1 (ajs;h), k (ajs;h) exp( e Q k (s;a;h)); where 0 k+1 is the solution of the unconstrained variant of Eq. (D.7) (that is, replacing argmin (js;h)2(A) by argmin (js;h)2R A). It is easy to verify that: KL( k (js;h); k+1 (js;h)) + KL( k+1 (js;h); k (js;h)) = k (js;h); ln k (js;h) k+1 (js;h) + k+1 (js;h); ln k+1 (js;h) k (js;h) = * k (js;h) k+1 (js;h); ln k (js;h) 0 k+1 (js;h) + ( k+1 (js;h)/ 0 k+1 (js;h)) = D k (js;h) k+1 (js;h); e Q k (s;;h) E 0: (D.8) By the standard OMD analysis (Hazan, 2019) (note that KL is the Bregman divergence w.r.t. the negative entropy regularizer), K X k=1 D k (js;h) ? (js;h); e Q k (s;;h) E = 1 K X k=1 KL( ? (js;h); k (js;h)) KL( ? (js;h); 0 k+1 (js;h)) + KL( k (js;h); 0 k+1 (js;h)) = 1 K X k=1 (KL( ? (js;h); k (js;h)) KL( ? (js;h); k+1 (js;h)) + KL( k (js;h); k+1 (js;h))) KL( ? (js;h); 1 (js;h)) + K X k=1 D k (js;h) k+1 (js;h); e Q k (s;;h) E (Eq. (D.8)) lnA + K1 X k=1 D k+1 (js;h); e Q k+1 (s;;h) e Q k (s;;h) E + D 1 (js;h); e Q 1 (s;;h) E D K+1 (js;h); e Q K (s;;h) E : 194 This completes the proof. Lemma58. Denen k (s;a) =N k+1 (s;a)N k (s;a). We have: jdb c k (s;a)j =O n k (s;a) N + k (s;a) ; d b Q k (s;a;h) =O 0 @ T 2 max X s 0 ;a 0 Sn k (s 0 ;a 0 ) N + k (s 0 ;a 0 ) +T max X s 0 ;a 0 n k (s 0 ;a 0 ) N + k (s 0 ;a 0 ) +T 3 max 1 A ; jde c k (s;a)j =O 0 @ n k (s;a) N + k (s;a) +T 2 max X s 0 ;a 0 Sn k (s 0 ;a 0 ) N + k (s 0 ;a 0 ) +T max X s 0 ;a 0 n k (s 0 ;a 0 ) N + k (s 0 ;a 0 ) +T 3 max 1 A ; d e Q k (s;a;h) =O 0 @ T 2 max X s 0 ;a 0 Sn k (s 0 ;a 0 ) N + k (s 0 ;a 0 ) +T 2 max X s 0 ;a 0 n k (s 0 ;a 0 ) N + k (s 0 ;a 0 ) +T 4 max 1 A : Proof. Firststatement: Note that for all denitions ofb c k used in this paper, we havekb c k k 1 1. Then by the denition ofb c k andj maxf0;ag maxf0;bgjjabj: jb c k+1 (s;a)b c k (s;a)j =O j c k+1 (s;a) c k (s;a)j + s c k (s;a) N + k (s;a) s c k+1 (s;a) N + k+1 (s;a) + N + k (s;a) N + k+1 (s;a) ! : Note that: j c k+1 (s;a) c k (s;a)j = C k+1 (s;a) N + k+1 (s;a) C k (s;a) N + k (s;a) C k+1 (s;a)C k (s;a) N + k+1 (s;a) +N k (s;a) 1 N + k (s;a) 1 N + k+1 (s;a) (C k (s;a)N k (s;a)) n k (s;a) N + k+1 (s;a) + N k (s;a)n k (s;a) N + k (s;a)N + k+1 (s;a) 2n k (s;a) N + k (s;a) ; 195 and byj p a p bj p jabj,n k (s;a)2N: s c k (s;a) N + k (s;a) s c k+1 (s;a) N + k+1 (s;a) s j c k (s;a) c k+1 (s;a)j N + k (s;a) + p c k+1 (s;a) 0 @ 1 q N + k (s;a) 1 q N + k+1 (s;a) 1 A 2n k (s;a) N + k (s;a) + 0 @ r N + k (s;a) s N + k+1 (s;a) 1 A =O n k (s;a) N + k (s;a) ; where in the last inequality we apply 1 q N + k (s;a) 1 q N + k+1 (s;a) = 1 N + k (s;a) 1 N + k+1 (s;a) ! = 0 @ 1 q N + k (s;a) + 1 q N + k+1 (s;a) 1 A q N + k+1 (s;a) n k (s;a) N + k (s;a)N + k+1 (s;a) n k (s;a) N + k (s;a) : (D.9) Thus,jdb c k (s;a)j =O n k (s;a) N + k (s;a) . Secondstatement: Dene k (P 0 ) = argmin P 00 2P k+1 P s;a;h P 00 s;a;h P 0 s;a;h 1 for anyP 0 2P k . By the denition ofP k , we have (note thatP 0 s;a;h (s 0 ;h 0 ) = 0 forh 0 = 2fh;h + 1g): k (P 0 ) s;a;h P 0 s;a;h 1 2 X s 0 P k;s;a (s 0 ) P k+1;s;a (s 0 ) + 2 X s 0 k+1 (s;a;s 0 ) k (s;a;s 0 ) : Denote byn k (s;a;s 0 ) the number of visits to (s;a;s 0 ) (before policy switch or goal state is reached) in episodek. Note that: P k;s;a (s 0 ) P k+1;s;a (s 0 ) = N k (s;a;s 0 ) +n k (s;a;s 0 ) N + k+1 (s;a) N k (s;a;s 0 ) N + k (s;a) N k (s;a;s 0 ) 1 N + k (s;a) 1 N + k+1 (s;a) ! + n k (s;a;s 0 ) N + k+1 (s;a) 2n k (s;a) N + k (s;a) : 196 and byj p a p bj p jabj, k (s;a;s 0 ) k+1 (s;a;s 0 ) =O s P k;s;a (s 0 ) N + k (s;a) s P k+1;s;a (s 0 ) N + k+1 (s;a) +d N + k (s;a) ! =O 0 @ s P k;s;a (s 0 ) P k+1;s;a (s 0 ) N + k (s;a) + q P k+1;s;a (s 0 )d 0 @ 1 q N + k (s;a) 1 A +d N + k (s;a) 1 A =O 0 @ n k (s;a) N + k (s;a) + q P k+1;s;a (s 0 )d 0 @ p q N + k (s;a) 1 A 1 A : Plugging these back, and by Cauchy-Schwarz inequality and Eq. (D.9) withN k =N k , we have k (P 0 ) s;a;h P 0 s;a;h 1 =O 0 @ Sn k (s;a) N + k (s;a) +d 0 @ p S q N + k (s;a) 1 A 1 A =O Sn k (s;a) N + k (s;a) : (D.10) Thus, for any policy 0 and cost functionc 0 2C M withc 0 (s;a;h)2 [0; 1] forhH, by Lemma 63 and Eq. (D.10), Q 0 ; k (P 0 );c 0 (s;a;h)Q 0 ;P 0 ;c 0 (s;a;h) = X s 0 ;a 0 ;h 0 q 0 ;P 0 ;(s;a;h) (s 0 ;a 0 ;h 0 )( k (P 0 ) s 0 ;a 0 ;h 0P 0 s 0 ;a 0 ;h 0)V 0 ; k (P 0 );c 0 =O 0 @ T 2 max X s 0 ;a 0 Sn k (s 0 ;a 0 ) N + k (s 0 ;a 0 ) 1 A : (D.11) 197 Now deneP 0 k = k (P k ). We have b Q k+1 (s;a;h) b Q k (s;a;h) =Q k+1 ;P k+1 ;b c k+1 (s;a;h)Q k ;P k ;b c k (s;a;h) Q k+1 ;P 0 k ;b c k+1 (s;a;h)Q k+1 ;P k ;b c k+1 (s;a;h) +Q k+1 ;P k ;b c k+1 (s;a;h)Q k ;P k ;b c k (s;a;h) =O 0 @ T 2 max X s 0 ;a 0 Sn k (s 0 ;a 0 ) N + k (s 0 ;a 0 ) 1 A + (Q k+1 ;P k ;b c k+1 (s;a;h)Q k+1 ;P k ;b c k (s;a;h)) (Eq. (D.11)) + (Q k+1 ;P k ;b c k (s;a;h)Q k ;P k ;b c k (s;a;h)) =O 0 @ T 2 max X s 0 ;a 0 Sn k (s 0 ;a 0 ) N + k (s 0 ;a 0 ) +T max X s 0 ;a 0 b c k+1 (s 0 ;a 0 )b c k (s 0 ;a 0 ) +T 3 max 1 A (Lemma 63 and Lemma 56) =O 0 @ T 2 max X s 0 ;a 0 Sn k (s 0 ;a 0 ) N + k (s 0 ;a 0 ) +T max X s 0 ;a 0 n k (s 0 ;a 0 ) N + k (s 0 ;a 0 ) +T 3 max 1 A : The other direction can be proved similarly. Thirdstatement: Note thatjde c k (s;a;H + 1)j = 0, and forhH, je c k+1 (s;a;h)e c k (s;a;h)j jdb c k (s;a)j + b c k+1 (s;a) b Q k+1 (s;a;h) b Q k (s;a;h)b c k (s;a) jdb c k (s;a)j + b Q k+1 (s;a;h)jdb c k (s;a)j +b c k (s;a) d b Q k (s;a;h) =O 0 @ n k (s;a) N + k (s;a) +T 2 max X s 0 ;a 0 Sn k (s 0 ;a 0 ) N + k (s 0 ;a 0 ) +T max X s 0 ;a 0 n k (s 0 ;a 0 ) N + k (s 0 ;a 0 ) +T 3 max 1 A : 198 Fourth statement: Dene e P 0 k = k ( e P k ). By Eq. (D.10), e P 0 k;s;a;h e P k;s;a;h 1 =O Sn k (s;a) N + k (s;a) , and 1=T max , we have e Q k+1 (s;a;h) e Q k (s;a;h)Q k+1 ; e P 0 k ;e c k+1 (s;a;h)Q k ; e P k ;e c k (s;a;h) = Q k+1 ; e P 0 k ;e c k+1 (s;a;h)Q k+1 ; e P k ;e c k+1 (s;a;h) + Q k+1 ; e P k ;e c k+1 (s;a;h)Q k+1 ; e P k ;e c k (s;a;h) + Q k+1 ; e P k ;e c k (s;a;h)Q k ; e P k ;e c k (s;a;h) (i) O 0 @ T 2 max X s 0 ;a 0 Sn k (s 0 ;a 0 ) N + k (s 0 ;a 0 ) 1 A + X s 0 ;a 0 ;h 0 q k+1 ; e P k ;(s;a;h) (s 0 ;a 0 ;h 0 ) e c k+1 (s 0 ;a 0 ;h 0 )e c k (s 0 ;a 0 ;h 0 ) =O 0 @ T 2 max X s 0 ;a 0 Sn k (s 0 ;a 0 ) N + k (s 0 ;a 0 ) +T 2 max X s 0 ;a 0 n k (s 0 ;a 0 ) N + k (s 0 ;a 0 ) +T 4 max 1 A ; where in (i) we apply Eq. (D.11), Lemma 63, and Q k+1 ; e P k ;e c k (s;a;h)Q k ; e P k ;e c k (s;a;h) = X s 00 ;h 00 e P k;s;a;h (s 00 ;h 00 ) X s 0 ;a 0 ;h 0 q k+1 ; e P k ;(s 00 ;h 00 ) (s 0 ;h 0 ) d k (a 0 js 0 ;h 0 ) Q k ; e P k ;e c k (s 0 ;a 0 ;h 0 ) (Lemma 63) 0: (Eq. (D.8)) This completes the proof. Lemma59. For any cost functionc in M such thatc((s;h);a) 0, we have: Var k [hn k ;ci] = X s;a;h q k (s;a;h)(A k ;P;c (s;a;h) 2 +V(P s;a;h ;V k ;P;c )) E k [hn k ;ci 2 ] 2 q k ;cQ k ;P;c : 199 Proof. LetQ =Q k ;P;c ,V =V k ;P;c ,A =A k ;P;c and denec(g;a) = 0. Then, Var k [hn k ;ci] =E k 2 4 J k +1 X i=1 c( s k i ;a k i )V ( s k 1 ) ! 2 3 5 =E k 2 4 J k +1 X i=2 c( s k i ;a k i ) +Q( s k 1 ;a k 1 )P s k 1 ;a k 1 VV ( s k 1 ) ! 2 3 5 (Q( s;a) =c( s;a) +P s;a V ) (i) =E k Q( s k 1 ;a k 1 )V ( s k 1 ) 2 +E k 2 4 J k +1 X i=2 c( s k i ;a k i )P s k 1 ;a k 1 V ! 2 3 5 (ii) =E k Q( s k 1 ;a k 1 )V ( s k 1 ) 2 +E k 2 4 J k +1 X i=2 c( s k i ;a k i )V ( s k 2 ) ! 2 3 5 +E k V ( s k 2 )P s k 1 ;a k 1 V 2 =E k " J k +1 X i=1 Q( s k i ;a k i )V ( s k i ) 2 + V ( s k i+1 )P s k i ;a k i V 2 # (recursive argument) = X s;a;h q k (s;a;h) A 2 (s;a;h) +V(P s;a;h ;V ) ; where (i) is byQ( s k 1 ;a 1 )V ( s k 1 )2( s k 1 ;a k 1 ) (the-algebra of events dened on ( s k 1 ;a k 1 )) and E k " J k +1 X i=2 c( s k i ;a k i )P s k 1 ;a k 1 V s k 1 ;a k 1 # = 0; (ii) is byV ( s k 2 )P s k 1 ;a k 1 V 2( s k 1 ;a k 1 ; s k 2 ) and E k " J k +1 X i=2 c( s k i ;a k i )V ( s k 2 ) s k 1 ;a k 1 ; s k 2 # = 0: 200 Moreover, by ( P n i=1 a i ) 2 2a i ( P n i 0 =i a i 0) for anyn 1 andP (J k =1) = 0, Var k [hn k ;ci]E k [hn k ;ci 2 ] =E k 2 4 J k +1 X i=1 c( s k i ;a k i ) ! 2 3 5 2E k " J k +1 X i=1 c( s k i ;a k i ) J k +1 X i 0 =i c( s k i 0;a k i 0) # = 2E k " 1 X i=1 IfJ k + 1igc( s k i ;a k i ) J k +1 X i 0 =i c( s k i 0;a k i 0) # (i) = 2E k " J k +1 X i=1 c( s k i ;a k i )Q( s k i ;a k i ) # = 2hq k ;cQi; where in (i) we applyQ( s k i ;a k i ) =E[ P J k +1 i 0 =i c( s k i 0 ;a k i 0 )j s k 1 ;a k 1 ;:::; s k i ;a k i ] andfJ k +1ig2( s k 1 ;a k 1 ;:::; s k i ;a k i ). Lemma60. For everyk2 [K] it holds thatq k (s;a;h)E k [ n k (s;a;h)] + ~ O (1=K). Proof. By denition ofn k (s;a;h),x k (s;a;h), andy k (s;a;h) we have: P (n k (s;a;h)>n) =P (n k (s;a;h)>njn k (s;a;h)>n 1)P (n k (s;a;h)>n 1) =P (return to (s;a;h))P (n k (s;a;h)>n 1) =y k (s;a;h)P (n k (s;a;h)>n 1) = =y n k (s;a;h)P (n k (s;a;h)> 0) =y n k (s;a;h)x k (s;a;h): Now, sinceq k (s;a;h) is the expected number of visits to (s;a;h), q k (s;a;h) =E k [n k (s;a;h)] = 1 X n=0 P (n k (s;a;h)>n) =x k (s;a;h) 1 X n=0 y n k (s;a;h) =x k (s;a;h) L1 X n=0 y n k (s;a;h) +x k (s;a;h) 1 X n=L y n k (s;a;h): 201 To nish we bound each of the sums separately. By denition of n k (s;a;h): x k (s;a;h) L1 X n=0 y n k (s;a;h) = L1 X n=0 P (n k (s;a;h)>n) = L1 X n=0 P (minfL;n k (s;a;h)g>n) 1 X n=0 P (minfL;n k (s;a;h)g>n) = 1 X n=0 P ( n k (s;a;h)>n) =E k [ n k (s;a;h)]: In each step there’s a probability of at most to stay in layerh. Soy k (s;a;h) , which implies: x k (s;a;h) 1 X n=L y n k (s;a;h) 1 X n=L n = L 1 8H 1 ln(2TmaxK=) 1 e 8H ln(2TmaxK=) 1 2T max 2T max K 8 log 2 (c f K) = ~ O (1=K); where the second inequality uses 1 1 e 1 . Lemma 61. Consider a sequence of cost functionsfc k g K k=1 and transition functionsfP k g K k=1 such that c k 2C M andP k 2P k . Also deneb q k =q k ;P k . Then with probability at least 1 8, K X k=1 jhq k b q k ;c k ij = ~ O 0 @ v u u t S 2 A K X k=1 hq k ;c k Q k ;P;c k i +S 2:5 A 1:5 T 3 max 1 A : Proof. Denev k;s;a;h ( s 0 ) =V k ;P;c k ( s 0 )P s;a;h V k ;P;c k for s 0 2 S + . Note that with probability at least 1 4: K X k=1 jhq k b q k ;c k ij = K X k=1 X s;a;h q k (s;a;h)(P s;a;h P k;s;a;h )V k ;P k ;c k (Lemma 63) = K X k=1 X s;a;h q k (s;a;h)(P s;a;h P k;s;a;h )V k ;P;c k + ~ O S 2:5 A 1:5 T 3 max : (Lemma 49 and Lemma 62) 202 Below we bound the rst term. We continue with: = K X k=1 X s;a H X h=1 q k (s;a;h)(P s;a;h P k;s;a;h )v k;s;a;h (P s;a;H+1 =P k;s;a;H+1 ) = ~ O 0 @ K X k=1 X s;a;hH; s 0 q k (s;a;h) s P s;a;h ( s 0 )v 2 k;s;a;h ( s 0 ) N + k (s;a) +ST max K X k=1 X s;a q k (s;a) N + k (s;a) 1 A : (Lemma 49) By Lemma 60, we haveq k (s;a;h)E k [ n k (s;a;h)] + ~ O(1=K). Therefore, we continue with = ~ O 0 @ K X k=1 X s;a;hH; s 0 E k [ n k (s;a;h)] s P s;a;h ( s 0 )v 2 k;s;a;h ( s 0 ) N + k (s;a) +S 2 AT max 1 A (Lemma 64) = ~ O 0 @ K X k=1 X s;a;hH; s 0 n k (s;a;h) s P s;a;h ( s 0 )v 2 k;s;a;h ( s 0 ) N + k (s;a) +S 2 AT 2 max 1 A (Lemma 126) = ~ O 0 @ K X k=1 X s;a;hH; s 0 n k (s;a;h) v u u t P s;a;h ( s 0 )v 2 k;s;a;h ( s 0 ) N + k+1 (s;a) 1 A + ~ O 0 @ ST 2 max X (s;a) K X k=1 0 @ 1 q N + k (s;a) 1 q N + k+1 (s;a) 1 A +S 2 AT 2 max 1 A = ~ O 0 @ v u u t K X k=1 X s;a; s 0 n k (s;a) N + k+1 (s;a) v u u t K X k=1 X s;a;hH; s 0 n k (s;a;h)P s;a;h ( s 0 )v 2 k;s;a;h ( s 0 ) +S 2 AT 2 max 1 A (Cauchy-Schwarz inequality) = ~ O 0 @ p S 2 A v u u t K X k=1 X s;a;hH; s 0 q k (s;a;h)P s;a;h ( s 0 )v 2 k;s;a;h ( s 0 ) +SAT 3 max +S 2 AT 2 max 1 A (Lemma 126) = ~ O 0 @ v u u t S 2 A K X k=1 Var k [hn k ;c k i] +S 2 AT 2 max 1 A ( P s 0P s;a;h ( s 0 )v 2 k;s;a;h ( s 0 ) =V(P s;a;h ;V k ;P;c k ) and Lemma 59) = ~ O 0 @ v u u t S 2 A K X k=1 hq k ;c k Q k ;P;c k i +S 2 AT 2 max 1 A : (Lemma 59) Substituting these back completes the proof. 203 Lemma 62. Consider a sequence of cost functionsfc k g K k=1 and transition functionsfP k g K k=1 such that c k 2C M andP k 2P k . Then, we have with probability at least 1 4: K X k=1 X s;a;h;s 0 ;h 0 q k (s;a;h) ? k (s;a;h;s 0 ;h 0 ) V k ;P;c k (s 0 ;h 0 )V k ;P k ;c k (s 0 ;h 0 ) = ~ O S 2:5 A 1:5 T 3 max : Proof. Below. is equivalent to ~ O (). Also denotez = (s;a;h;s 0 ;h 0 ) and ~ z = (e s;e a; e h;e s 0 ; e h 0 ). By Lemma 63 we have with probability at least 1 2: V k ;P;c k (s 0 ;h 0 )V k ;P k ;c k (s 0 ;h 0 ) . X e s;e a; e h q k;(s 0 ;h 0 ) (e s;e a; e h) P e s;e a; e h V k ;P;c k P k;e s;e a; e h V k ;P;c k .T max X e s;e a; e h q k;(s 0 ;h 0 ) (e s;e a; e h) P e s;e a; e h P k;e s;e a; e h 1 .T max X e s;e a; e h;e s 0 ; e h 0 q k;(s 0 ;h 0 ) (e s;e a; e h) ? k (e s;e a; e h;e s 0 ; e h 0 ); where the second inequality is by Lemma 9, and the third is by Lemma 49. Thus, using Lemma 49 and the Cauchy-Schwarz inequality, we get: K X k=1 X s;a;h;s 0 ;h 0 q k (s;a;h) ? k (s;a;h;s 0 ;h 0 ) V k ;P;c k (s 0 ;h 0 )V k ;P k ;c k (s 0 ;h 0 ) .T max K X k=1 X z q k (s;a;h) ? k (s;a;h;s 0 ;h 0 ) X ~ z q k;(s 0 ;h 0 ) (e s;e a; e h) ? k (e s;e a; e h;e s 0 ; e h 0 ) .T max K X k=1 X z q k (s;a;h) s P s;a;h (s 0 ;h 0 ) N + k (s;a) X ~ z q k;(s 0 ;h 0 ) (e s;e a; e h) v u u t P e s;e a; e h (e s 0 ; e h 0 ) N + k (e s;e a) .T max v u u t X k;z;~ z q k (s;a;h)P e s;e a; e h (e s 0 ; e h 0 )q k;(s 0 ;h 0 ) (e s;e a; e h) N + k (s;a) v u u t X k;z;~ z q k (s;a;h)P s;a;h (s 0 ;h 0 )q k;(s 0 ;h 0 ) (e s;e a; e h) N + k (e s;e a) : 204 Note that we ignore some lower order terms in the calculation above. To nish the proof we bound each of the terms separately. For the rst term we have with probability at least 1: X k;z;~ z q k (s;a;h)P e s;e a; e h (e s 0 ; e h 0 )q k;(s 0 ;h 0 ) (e s;e a; e h) N + k (s;a) = X k;s;a ( P h q k (s;a;h)) P s 0 ;h 0 ;e s;e a; e h q k;(s 0 ;h 0 ) (e s;e a; e h) P e s 0 ; e h 0 P e s;e a; e h (e s 0 ; e h 0 ) N + k (s;a) .T max S X k;s;a q k (s;a) N + k (s;a) .T 2 max S 2 A; where the last inequality is by Lemma 64. For the second term we have with probability at least 1: X k;z;~ z q k (s;a;h)P s;a;h (s 0 ;h 0 )q k;(s 0 ;h 0 ) (e s;e a; e h) N + k (e s;e a) .S X k;s;a;h;e s;e a; e h q k (s;a;h) P s 0 ;h 0P s;a;h (s 0 ;h 0 )q k;(s 0 ;h 0 ) (e s;e a; e h) N + k (e s;e a) .T max S X k;s;a;h;e s;e a; e h x k (s;a;h) P s 0 ;h 0P s;a;h (s 0 ;h 0 )q k;(s 0 ;h 0 ) (e s;e a; e h) N + k (e s;e a) .T max S X k;s;a;h;e s;e a; e h q k (e s;e a; e h) N + k (e s;e a) .T max S 2 A X k;e s;e a q k (e s;e a) N + k (e s;e a) .S 3 A 2 T 2 max ; where the second inequality follows byq k (s;a;h).T max x k (s;a;h), the third by x k (s;a;h) X s 0 ;h 0 P s;a;h (s 0 ;h 0 )q k;(s 0 ;h 0 ) (e s;e a; e h)q k (e s;e a; e h); and the last one by Lemma 64. 205 Lemma63 (Extended Value Dierence). For any policies; 0 , transitionsP;P 0 , and cost functionsc;c 0 in M, we have: Q ;P;c (s;a;h)Q 0 ;P 0 ;c 0 (s;a;h) = X s 00 ;h 00 P 0 s;a;h (s 00 ;h 00 ) X s 0 ;h 0 q 0 ;P 0 ;(s 00 ;h 00 ) (s 0 ;h 0 ) X a 0 (a 0 js 0 ;h 0 ) 0 (a 0 js 0 ;h 0 ) Q ;P;c (s 0 ;a 0 ;h 0 ) + X s 0 ;a 0 ;h 0 q 0 ;P 0 ;(s;a;h) (s 0 ;a 0 ;h 0 ) Q ;P;c (s 0 ;a 0 ;h 0 )c 0 (s 0 ;a 0 ;h 0 )P 0 s 0 ;a 0 ;h 0V ;P;c : and V ;P;c (s;h)V 0 ;P 0 ;c 0 (s;h) = X s 0 ;h 0 q 0 ;P 0 ;(s;h) (s 0 ;h 0 ) X a 0 (a 0 js 0 ;h 0 ) 0 (a 0 js 0 ;h 0 ) Q ;P;c (s 0 ;a 0 ;h 0 ) + X s 0 ;a 0 ;h 0 q 0 ;P 0 ;(s;h) (s 0 ;a 0 ;h 0 ) Q ;P;c (s 0 ;a 0 ;h 0 )c 0 (s 0 ;a 0 ;h 0 )P 0 s 0 ;a 0 ;h 0V ;P;c : Proof. We rst prove the second statement, note that: V ;P;c (s;h)V 0 ;P 0 ;c 0 (s;h) = X a 0 (a 0 js;h) 0 (a 0 js;h) Q ;P;c (s;a 0 ;h) + X a 0 0 (a 0 js;h)(Q ;P;c (s;a 0 ;h)Q 0 ;P 0 ;c 0 (s;a 0 ;h)) = X a 0 (a 0 js;h) 0 (a 0 js;h) Q ;P;c (s;a 0 ;h) + X a 0 0 (a 0 js;h) Q ;P;c (s;a 0 ;h)c 0 (s;a 0 ;h)P 0 s;a 0 ;h V ;P;c + X a 0 0 (a 0 js;h)P 0 s;a 0 ;h (V ;P;c V 0 ;P 0 ;c 0 ): 206 Applying the equality above recursively and by the denition ofq 0 ;P 0 ;(s;h) , we prove the second statement. For the rst statement, note that: Q ;P;c (s;a;h)Q 0 ;P 0 ;c 0 (s;a;h) = Q ;P;c (s;a;h)c 0 (s;a;h)P 0 s;a;h V ;P;c +P 0 s;a;h (V ;P;c V 0 ;P 0 ;c 0 ): Applying the second statement and the denition ofq 0 ;P 0 ;(s;a;h) completes the proof. Lemma64. For anyz k : ! [0; 1], with probability at least 1, K X k=1 X (s;a) n k (s;a) p z k (s;a) q N + k (s;a) = ~ O 0 @ SAT max + s SA X k X (s;a) n k (s;a)z k (s;a) 1 A = ~ O 0 @ SAT max + s SA X k X (s;a) q k (s;a)z k (s;a) 1 A ; K X k=1 X (s;a) q k (s;a) p z k (s;a) q N + k (s;a) = ~ O 0 @ SAT max + s SA X k X (s;a) n k (s;a)z k (s;a) 1 A = ~ O 0 @ SAT max + s SA X k X (s;a) q k (s;a)z k (s;a) 1 A ; K X k=1 X (s;a) n k (s;a) N + k (s;a) = ~ O (SAT max ); K X k=1 X (s;a) q k (s;a) N + k (s;a) = ~ O (SAT max ): Proof. Firststatement: Sincez k (s;a) 1 and n k (s;a)L = ~ O (T max ) we have: K X k=1 n k (s;a) p z k (s;a) q N + k (s;a) K X k=1 n k (s;a) p z k (s;a) q N + k+1 (s;a) + K X k=1 L 0 @ 1 q N + k (s;a) 1 q N + k+1 (s;a) 1 A K X k=1 n k (s;a) p z k (s;a) q N + k+1 (s;a) + ~ O (T max ): 207 By Cauchy-Schwarz inequality this implies: X s;a K X k=1 n k (s;a) p z k (s;a) q N + k (s;a) = ~ O 0 @ v u u t K X k=1 X (s;a) n k (s;a) N + k+1 (s;a) v u u t K X k=1 X (s;a) n k (s;a)z k (s;a) +SAT max 1 A = ~ O 0 @ s SA X k X s;a n k (s;a)z k (s;a) +SAT max 1 A : Finally, P s;a P k n k (s;a)z k (s;a) = ~ O P s;a P k q k (s;a)z k (s;a) +SAT max with high probability by Lemma 126. Secondstatement: By Lemma 60 we have: K X k=1 X (s;a) q k (s;a) p z k (s;a) q N + k (s;a) K X k=1 X (s;a) E k [ n k (s;a)] p z k (s;a) q N + k (s;a) + ~ O (1=K) K X k=1 X (s;a) p z k (s;a) q N + k (s;a) K X k=1 X (s;a) E k [ n k (s;a)] p z k (s;a) q N + k (s;a) + ~ O (SA) = ~ O 0 @ K X k=1 X (s;a) n k (s;a) p z k (s;a) q N + k (s;a) +T max SA 1 A ; where the last relation holds with high probability by Lemma 126. Now the statement follows by the rst statement. Thirdandforthstatements: Similarly to the rst statement, K X k=1 n k (s;a) N + k (s;a) K X k=1 n k (s;a) N + k+1 (s;a) + K X k=1 L 1 N + k (s;a) 1 N + k+1 (s;a) ! K X k=1 n k (s;a) maxf1; P ik n i (s;a)g + ~ O (T max ) = ~ O (T max ): Summing over (s;a) proves the third statement. The forth statement is then proved similarly to the second statement. 208 D.4 OmittedDetailsforSection5.3 ExtraNotation Denee q k =q k ; e P k ,Q k =Q k ;P;c k ,V k =V k ;P;c k , andA k =A k ;P;c k . D.4.1 ProofofTheorem16 In this part, dene e P k = ( k ;P k ;e c k ) andP k = ( k ;P k ;c k ), such that e Q k =Q k ; e P k ;e c k , e V k =V k ; e P k ;e c k , and b Q k =Q k ;P k ;c k . We rst provide bounds on some important quantities. Lemma65. e c k 2C M , e A k B k 1 1, andkB k k 1 1 2H 0 . Proof. For the rst statement, byP k 2 M , we have b Q k (s;a;h) H 1 +c f =. Therefore, b Q k (s;a;h) 1 ande c k 2 C M . For the second statement, bye c k (s;a;h) 2 for h H, we havej e A k (s;a;h)j j e Q k (s;a;h)j +j e V k (s;h)j 4( H 1 +c f ) = 4 forhH. Therefore,kb k k 1 32 2 , and by Lemma 70, we havekB k k 1 15Hkb k k 1 1 960HT max 2 . Thus by the denition of, we havekB k k 1 1 2H 0 and e A k B k 1 e A k 1 +kB k k 1 1. We are now ready to prove Theorem 16. The proof decomposes the regret into several terms, each of which is bounded by a lemma included after the proof. Proof of Theorem 16. With probability at least 1 10, we decompose the regret as follows: R K = K X k=1 hn k q k ;c k i +hq k q ? ;c k i (i) ~ O 0 @ v u u t K X k=1 hq k ;c k Q k i +SAT max 1 A + K X k=1 hq k e q k ;e c k i + K X k=1 he q k q ? ;e c k i K X k=1 D q k ;c k b Q k E + K X k=1 D q ? ;c k b Q k E (ii) = ~ O 0 @ v u u t S 2 A K X k=1 hq k ;c k Q k i +S 2:5 A 1:5 T 3 max 1 A + K X k=1 he q k q ? ;e c k i K X k=1 hq k ;c k Q k i + K X k=1 D q k ;c k (Q k b Q k ) E + K X k=1 D q ? ;c k Q ? ;P;c k E + K X k=1 D q ? ;c k ( b Q k Q ? ;P;c k ) E ; 209 where in (i) we apply Lemma 47 and Lemma 123 to have K X k=1 hn k q k ;c k i = K X k=1 h n k q k ;c k i = ~ O 0 @ v u u t K X k=1 E k [h n k ;c k i 2 ] +SAT max 1 A = ~ O 0 @ v u u t K X k=1 hq k ;c k Q k i +SAT max 1 A ; (Lemma 59) and in (ii) we apply Lemma 61 ande c k 2C M forhH to have K X k=1 hq k e q k ;e c k i = ~ O 0 @ v u u t S 2 A K X k=1 q k ;e c k Q k ;P;e c k +S 2:5 A 1:5 T 3 max 1 A = ~ O 0 @ v u u t S 2 A K X k=1 hq k ;c k Q k ;P;c k i +S 2:5 A 1:5 T 3 max 1 A : (e c k (s;a;h) 2c k (s;a;h)) Dene 0 = q S 2 A DT?K . By Lemma 66, Lemma 67, Lemma 68, and denition of;, with probability at least 1 9: R K ~ O 0 @ v u u t S 2 A K X k=1 hq k ;c k Q k i + T ? +S 4 A 2 T 5 max 1 A + 24 K X k=1 q k ;A 2 k K X k=1 hq k ;c k Q k i + K X k=1 D q ? ;c k Q ? ;P;c k E = ~ O S 2 A 0 + 0 K X k=1 hq k ;c k Q k i + ~ O T ? +S 4 A 2 T 5 max + 48 K X k=1 hq k ;c k Q k i K X k=1 hq k ;c k Q k i +O (DT ? K) (AM-GM inequality, Lemma 59, and Lemma 69) = ~ O T ? p DK + p S 2 ADT ? K +S 4 A 2 T 5 max : (K = ~ O(S 2 AT 2 max ) when< 48 + 0 ) Applying Lemma 10 completes the proof. 210 Lemma66. With probability at least 1 6, K X k=1 he q k q ? ;e c k i = 24 q k ;A 2 k + ~ O T ? +S 4 A 2 T 3:5 max : Proof. Note that by Lemma 119 and Lemma 65: K X k=1 X s;h q ? (s;h) X a2As ( k (ajs;h) ? (ajs;h)) e A k (s;a;h)B k (s;a;h) X s;h q ? (s;h) lnA + K X k=1 X a2As k (ajs;h) e A k (s;a;h)B k (s;a;h) 2 ! ~ O T ? + 2 X s;h q ? (s;h) K X k=1 X a2As k (ajs;h) e A k (s;a;h) 2 + K X k=1 X a2As k (ajs;h)B k (s;a;h) 2 ! = ~ O T ? + K X k=1 hq ? ;b k i + 1 H 0 X s;h q ? (s;h) K X k=1 X a2As k (ajs;h)B k (s;a;h): (Lemma 65) Deneb q 0 k =q k ;P 0 k , whereP 0 k is the optimistic transition dened inB k . We have K X k=1 he q k q ? ;e c k i = K X k=1 X s;h q ? (s;h) X a2As ( k (ajs;h) ? (ajs;h)) e A k (s;a;h)B k (s;a;h) + K X k=1 X s;a;h q ? (s;a;h) e Q k (s;a;h)e c k (s;a;h)P s;a;h e V k + K X k=1 X s;h q ? (s;h) X a2As ( k (ajs;h) ? (ajs;h))B k (s;a;h) (shifting argument and Lemma 63) (i) ~ O T ? + 3 K X k=1 b q 0 k ;b k + ~ O (T max ) = ~ O T ? + 6 K X k=1 D q k ; e A 2 k E + 3 K X k=1 b q 0 k q k ;b k ; 211 where in (i) we apply Lemma 71,b k (s;a;h) = ~ O(1), and the denition of e P k so that K X k=1 X s;a;h q ? (s;a;h) e Q k (s;a;h)e c k (s;a;h)P s;a;h e V k 0: For the second term, by (a +b +c) 2 2a 2 + 2(b +c) 2 2a 2 + 4b 2 + 4c 2 , K X k=1 X s;a;h q k (s;a;h) e A k (s;a;h) 2 2 K X k=1 X s;a;h q k (s;a;h)A k ;P;e c k (s;a;h) 2 + 4 K X k=1 X s;a;h q k (s;a;h) Q k ; e P k ;e c k (s;a;h)Q k ;P;e c k (s;a;h) 2 + 4 K X k=1 X s;h q k (s;h) V k ; e P k ;e c k (s;h)V k ;P;e c k (s;h) 2 2 K X k=1 X s;a;h q k (s;a;h)A k ;P;e c k (s;a;h) 2 + 8 K X k=1 X s;a;h q k (s;a;h) Q k ; e P k ;e c k (s;a;h)Q k ;P;e c k (s;a;h) 2 ; where in the last step we apply Cauchy-Schwarz inequality to obtain V k ; e P k ;e c k (s;h)V k ;P;e c k (s;h) 2 = X a k (ajs;h)(Q k ; e P k ;e c k (s;a;h)Q k ;P;e c k (s;a;h)) ! 2 X a k (ajs;h) Q k ; e P k ;e c k (s;a;h)Q k ;P;e c k (s;a;h) 2 : 212 Note that with probability at least 1 2, Q k ; e P k ;e c k (s;a;h)Q k ;P;e c k (s;a;h) (D.12) = ~ O 0 @ T max X s 0 ;a 0 ;h 0 H q k ;P;(s;a;h) (s 0 ;a 0 ;h 0 ) P s 0 ;a 0 ;h 0 e P k;s 0 ;a 0 ;h 0 1 1 A (Lemma 63, Hölder’s inequality, andV k ; e P k ;e c k = ~ O(T max )) = ~ O 0 @ T max S X s 0 ;a 0 q k ;P;(s;a;h) (s 0 ;a 0 ) q N + k (s 0 ;a 0 ) 1 A : (Lemma 49) Therefore, with probability at least 1, K X k=1 X s;a;h q k (s;a;h) Q k ; e P k ;e c k (s;a;h)Q k ;P;e c k (s;a;h) 2 K X k=1 X s;a;hH q k (s;a;h)T 3 max S 2 X s 0 ;a 0 q k ;P;(s;a;h) (s 0 ;a 0 ) N + k (s 0 ;a 0 ) (Cauchy-Schwarz inequality) (i) = ~ O 0 @ T 4 max S 3 A K X k=1 X s 0 ;a 0 q k ;P (s 0 ;a 0 ) N + k (s 0 ;a 0 ) 1 A (ii) = ~ O T 5 max S 4 A 2 ; where in (i) we applyq k (s;a;h) 2T max x k (s;a;h) andx k (s;a;h)q k ;P;(s;a;h) (s 0 ;a 0 )q k ;P (s 0 ;a 0 ), and in (ii) we apply Lemma 64. Plugging these back, we get: K X k=1 X s;a;h q k (s;a;h) e A k (s;a;h) 2 2 K X k=1 D q k ; (A k ;P;e c k ) 2 E + ~ O T 5 max S 4 A 2 4 K X k=1 q k ;A 2 k + 4 2 K X k=1 D q k ; (A k ;P; b Q k ) 2 E + ~ O T 5 max S 4 A 2 ((a +b) 2 2a 2 + 2b 2 ) 4 q k ;A 2 k + ~ O T 5 max S 4 A 2 + 2 T 5 max K = 4 q k ;A 2 k + ~ O S 4 A 2 T 3 max : 213 For the third term, with probability at least 1 3, K X k=1 b q 0 k q k ;b k K X k=1 X s;a;hH q k (s;a;h) b P 0 k;s;a;h P s;a;h 1 V k ; b P 0 k ;b k 1 (Lemma 63 and Hölder’s inequality) = ~ O 0 @ ST 3 max K X k=1 X s;a;hH q k (s;a;h) q N + k (s;a) 1 A (b k (s;a;h) = ~ O(T 2 max ) and Lemma 49) = ~ O ST 3 max p SAT max K +S 2 AT 4 max = ~ O S 2 AT 3:5 max : (Lemma 64) Putting everything together completes the proof. Lemma67. P K k=1 D q ? ;c k ( b Q k Q ? ;P;c k ) E = ~ O S 2 AT 5 max . 214 Proof. Deneq 0 s;a;h (s 0 ;h 0 ) = P s 00 ;h 00P s;a;h (s 00 ;h 00 )q ? (s 00 ;h 00 ) (s 0 ;h 0 ). We have forhH: K X k=1 ( b Q k (s;a;h)Q ? ;P;c k (s;a;h)) K X k=1 (Q k ; e P k ;c k (s;a;h)Q ? ;P;c k (s;a;h)) K X k=1 Q k ; e P k ;e c k (s;a;h)Q ? ;P;e c k (s;a;h) + 2 K X k=1 Q ? ;P; b Q k (s;a;h) K X k=1 Q k ; e P k ;e c k (s;a;h)Q ? ;P;e c k (s;a;h) + ~ O 2 T 2 max K X s 0 ;h 0 q 0 s;a;h (s 0 ;h 0 ) K X k=1 X a 0 k (a 0 js 0 ;h 0 ) ? (a 0 js 0 ;h 0 ) e A k (s 0 ;a 0 ;h 0 )B k (s 0 ;a 0 ;h 0 ) + X s 0 ;a 0 ;h 0 q ? (s;a;h) (s 0 ;a 0 ;h 0 ) K X k=1 Q k ; e P k ;e c k (s 0 ;a 0 ;h 0 )e c k (s 0 ;a 0 ;h 0 )P s 0 ;a 0 ;h 0V k ; e P k ;e c k + X s 0 ;h 0 q 0 s;a;h (s 0 ;h 0 ) K X k=1 X a 0 k (a 0 js 0 ;h 0 ) ? (a 0 js 0 ;h 0 ) B k (s 0 ;a 0 ;h 0 ) + ~ O 2 T 2 max K (Lemma 63) = ~ O 0 @ X s 0 ;h 0 q 0 s;a;h (s 0 ;h 0 ) T ? + K X k=1 X a2As k (ajs;h)T 2 max ! +T 4 max K + 2 T 2 max K 1 A ( e A k 1 = ~ O(T max ), denition of e P k , andB k (s;a;h) = ~ O(T 3 max )) = ~ O T 2 max +T 4 max K + 2 T 2 max K : Plugging this back and by the denition of;: K X k=1 D q ? ; b Q k Q ? ;P;c k E = ~ O T 3 max +T 5 max K + 2 T 3 max K = ~ O S 2 AT 5 max : This completes the proof. Lemma68. With probability at least 1 3, P K k=1 D q k ;c k (Q k b Q k ) E = ~ O S 3:5 A 2 T 3 max . 215 Proof. By similar arguments as in Eq. (D.12) with e P k replaced byP k ande c k replaced byc k , with probability at least 1 3: K X k=1 D q k ;Q k b Q k E =T max S K X k=1 X s;a;hH q k (s;a;h) X s 0 ;a 0 q k ;P;(s;a;h) (s 0 ;a 0 ) q N + k (s 0 ;a 0 ) = ~ O 0 @ S 2 AT 2 max K X k=1 X s 0 ;a 0 q k (s 0 ;a 0 ) q N + k (s 0 ;a 0 ) 1 A (q k (s;a;h) =O(T max x k (s;a;h)) andx k (s;a;h)q k ;P;(s;a;h) (s 0 ;a 0 )q k (s 0 ;a 0 )) = ~ O S 2 AT 2 max p SAT max K +S 3 A 2 T 3 max = ~ O S 3:5 A 2 T 3 max : (Lemma 64) Lemma69. P K k=1 q ? ;c k Q ? ;P;c k = ~ O (DT ? K) +T max . Proof. By Lemma 9, forhH, P K k=1 c k (s;a;h)Q ? ;P;c k (s;a;h) P K k=1 (Q ? ;P;c k (s;a)+c f ) = ~ O(DK). Therefore, K X k=1 D q ? ;c k Q ? ;P;c k E = K X k=1 X s;a;hH q ? (s;a;h)c k (s;a;h)Q ? ;P;c k (s;a;h) + K X k=1 X s;a q ? (s;a;H + 1)c f ~ O (DT ? K) +T max ; where the last step is by P s;a;hH q ? (s;a;h) =O(T ? ), P K k=1 Q ? ;P;c k (s;a;h) =O(DK), and Lemma 52. 216 D.4.2 DilatedBonusinSDA Below we present lemmas related to dilated bonus in M. We rst show that a form of dilated value function is well-dened. Lemma 70. For some policy in M, transitionP 2 M , and bonus functionb : [H]! [0;] for some > 0, dene B(s;a;h) = b(s;a;h) + 1 + 1 H 0 P s;a;h B, B(s;h) = P a (ajs;h)B(s;a;h) and B(g) =B(s;a;H + 1) = 0. Then, max s;a B(s;a;h) 15(Hh+1) 1 . Proof. Dene 0 = (1 + 1 H 0 ) and recall thatH 0 = 8(H+1) ln(2K) 1 . Now note that 1 1 0 1+ 1 H 1 by simple algebra. Finally, dene b(s;a;h) = 1 + 1 H 0 hP s;a;h (;h + 1);B(;h + 1)i forh H, andP 0 s;a;h (s 0 ) = (1 + 1 H 0 )P s;a;h (s 0 ;h). We prove thatB is well dened and the statement holds by induction onh =H + 1;:::; 1. The base case is true by denitionB(s;a;H + 1) = 0. ForhH we have: B(s;a;h) =b(s;a;h) + 1 + 1 H 0 (hP s;a;h (;h);B(;h)i +hP s;a;h (;h + 1);B(;h + 1)i) =b(s;a;h) + b(s;a;h) +P 0 s;a;h B(;h): Therefore,B(;;h) can be treated as the action-value function in an SSP with cost (b + b)(;;h) and transition functionP 0 (thus well dened). By P s 0P 0 s;a;h (s 0 ;h) 0 , we have the expected hitting time of any policy starting from any state in an SSP with transitionP 0 is upper bounded by 1 1 0 1+ 1 H 1 . Let R(h) = max s;a B(s;a;h) and note that R(H + 1) = 0. Since b(s;a;h) and b(s;a;h) 1 + 1 H 0 (1 )R(h + 1) by P s 0P s;a;h (s 0 ;h + 1) 1 , we have: R(h) + 1 + 1 H 0 (1 )R(h + 1) 1 0 1 0 + 1 + 1 H 0 1 + 1 H R(h + 1) 1 + 1 H 1 + 1 + 1 H 0 1 + 1 H R(h + 1); 217 where the two last inequalities follow because 1 1 0 1+ 1 H 1 . The proof is now nished by solving the recursion and obtaining: R(h) 1 Hh X i=0 1 + 1 H 0 i 1 + 1 H i+1 ; which implies thatR(h) 15(Hh+1) 1 since (1 + 1 H ) H+1 (1 + 1 H 0 ) H 2e 2 15. Lemma71. Let beapolicyin Mandbbeanon-negativecostfunctionin Msuchthatb(s;a;H + 1) = 0 andb(s;a;h). Moreover, let b P2 M be an optimistic transition so that B(s;a;h) =b(s;a;h) + 1 + 1 H 0 b P s;a;h Bb(s;a;h) + 1 + 1 H 0 P s;a;h B; whereB(s;h) = P a2As (ajs;h)B(s;a;h) andB(g) =B(s;H + 1) = 0. Then, X s;h q ? (s;h) X a2As ((ajs;h) ? (ajs;h))B(s;a;h) + 1 H 0 X s;h q ? (s;h)B(s;h) + X s;a;h q ? (s;a;h)b(s;a;h) 3V ; b P;b (s init ; 1) + ~ O H K(1 ) : 218 Proof. By the optimism property of b P , we have: X s;h q ? (s;h) X a2As ((ajs;h) ? (ajs;h))B(s;a;h) + 1 H 0 X s;h q ? (s;h) X a2As (ajs;h)B(s;a;h) + X s;a;h q ? (s;a;h)b(s;a;h) 1 + 1 H 0 X s;h q ? (s;h) X a2As (ajs;h)B(s;a;h) + X s;a;h q ? (s;a;h)b(s;a;h) X s;a;h q ? (s;a;h) 0 @ b(s;a;h) + 1 + 1 H 0 X s 0 ;h 0 P s;a;h (s 0 ;h 0 )B(s 0 ;h 0 ) 1 A = 1 + 1 H 0 X s 0 ;h 0 0 @ q ? (s 0 ;h 0 ) X s;a;h q ? (s;a;h)P s;a;h (s 0 ;h 0 ) 1 A B(s 0 ;h 0 ) = 1 + 1 H 0 B(s init ; 1): (D.13) The last relation is byq ? (s;h) P s 0 ;a 0 ;h 0q ? (s 0 ;a 0 ;h 0 )P s 0 ;a 0 ;h 0(s;h) =If(s;h) = (s init ; 1)g (see (Rosenberg and Mansour, 2021, Appendix B.1)). LetJ be the number of steps until the goal stateg is reached in M, andn = 8H 1 ln(2K). Now note that for any policy, the expected hitting time in an SSP with transition b P is upper bounded by H 1 + 1 by b P2 M . Therefore, by Lemma 118,P (Jn) 1 K , and B(s;h) =E " J X t=1 1 + 1 H 0 t1 b(s t ;a t ;h t ) ; b P; (s 1 ;h 1 ) = (s;h) # =E " n X t=1 1 + 1 H 0 t1 b(s t ;a t ;h t ) + 1 + 1 H 0 n B(s t+1 ;h t+1 ) ; b P; (s 1 ;h 1 ) = (s;h) # 1 + 1 H 0 n1 V ; b P;b (s;h) + ~ O H K(1 ) : (Lemma 70) Plugging this back into Eq. (D.13) and by (1 + 1=H 0 ) n e< 3, we get the desired result. 219 D.4.3 ComputationofB k We study an operator on value function, from whichB k can be computed as a xed point. For any policy, cost functionc, transition condence setP M , and interest factor 0, we dene the dilated Bellman operatorT that maps any value functionV : S + !R + to another value functionT V : S + !R + , such that: (T V )(s;h) = X a (ajs;h) c(s;a;h) + (1 +) max P2P P s;a;h V ; (T V )(g) = 0; (T V )(s;H + 1) = max a c(s;a;H + 1): (D.14) In this work, we haveP2fP k g K k=1 , andP k = T s;a;h P k;s;a;h , whereP k;s;a;h is a convex set that species constraints on ((s;h);a). In other words,P k is a product of constraints on each ((s;h);a) (note that M can also be decomposed into shared constraints onP s;a;H+1 and independent constraints on each s;a;h H). Thus, there exists P 0 2P that satises P 0 = argmax P2P P s;a;h V in Eq. (D.14) for all ((s;h);a) simultaneously. Moreover, nding suchP 0 can be done by linear programming for each ((s;h);a) independently. Now we show that iteratively applyingT to some initial value function converges to a xed point suciently fast. Lemma72. DenevaluefunctionV 0 : S + !R + suchthatV 0 (s;h) =V 0 (g) = 0forany (s;h)2S[H] and V 0 (s;H + 1) = max a c(s;a;H + 1). Then for any 0 such that 0 = (1 + ) < 1, the limit V = lim n!1 T n V 0 exists. Moreover, when n Hl with l =d ln 1 1 0 e for some > 0, we have T n V 0 V 1 H (1+)(1 ) 1 0 H1 , where = P H1 j=0 ( (1+)(1 ) 1 0 ) j kck 1 1 0 . 220 Proof. Dene a sequence of value functionsfV i g 1 i=0 such thatV i+1 =T V i . We rst show that V i (;h) 1 P Hh j=0 ( (1+)(1 ) 1 0 ) j kck 1 1 0 fori 0 andhH. We prove this by induction oni. Note that this is clearly true wheni = 0. Fori> 0, byP M and Eq. (D.14), we have: V i (s;h) = (T V i1 )(s;h)kck 1 + 0 V i1 (;h) 1 + (1 +)(1 ) V i1 (;h + 1) 1 kck 1 + 0 Hh X j=0 (1 +)(1 ) 1 0 j kck 1 1 0 + Hh X j=1 (1 +)(1 ) 1 0 j kck 1 Hh X j=0 (1 +)(1 ) 1 0 j kck 1 1 0 : Therefore, V i 1 . We now show thatfV i g i converges to a xed point. Specically, we show that for some> 0 and anyi;j2N, whenn (Hh+1)l, we have (T n V i )(;h) (T n V j )(;h) 1 (Hh+ 1)( (1+)(1 ) 1 0 ) Hh (note that (1+)(1 ) 1 0 > 1). Therefore, whennHl, we have T n V i T n V j 1 H( (1+)(1 ) 1 0 ) H1 . Setting! 0, the statement above implies that for any s2 S,fV i ( s)g 1 i=1 is a Cauchy sequence and thus converges. Moreover, lettingj!1 implies thatfV i g i converges toV with the rate shown above. We prove the statement above by induction onh =H;:::; 1. First note that for any s2S;h2 [H]: (T V i )(s;h) (T V j )(s;h) = (1 +) X a (ajs;h) max P2P P s;a;h V i max P2P P s;a;h V j (1 +) X a (ajs;h) max P2P P s;a;h (V i V j ) 0 V i (;h)V j (;h) 1 + (1 +)(1 ) V i (;h + 1)V j (;h + 1) 1 ; (D.15) where the last inequality is by P s 0P s;a;h (s 0 ;h) , P s 0P s;a;h (s 0 ;h + 1) 1 , andP s;a;h (s 0 ;h 0 ) = 0 forh 0 = 2fh;h + 1g, for anyP2 M . Now for the base caseh =H, Eq. (D.15) implies (T V i )(;H) (T V j )(;H) 1 0 V i (;H)V j (;H) 1 : 221 Thus forn l, (T n V i )(;H) (T n V j )(;H) 1 0 n . For the induction steph < H, if n (Hh + 1)l, then Eq. (D.15) implies: (T n V i )(s;h) (T n V j )(s;h) 0 l (T nl V i )(s;h) (T nl V j )(s;h) 1 + (1 0 ) (1 +)(1 ) 1 0 Hh l1 X i=0 0 i (Hh) (by the induction assumption) (Hh + 1) (1 +)(1 ) 1 0 Hh : This completes the proof of the statement above. Now note thatB k is a xed point ofT with = k ,P =P k ,c =b k , and = 1=H 0 . Thus,B k can be approximated eciently. D.5 LearningwithoutSomeParameters In this section, we discuss the achievable regret guarantee without knowing some of the parameters assumed to be known. For simplicity, we only describe the high level ideas. We rst describe the general ideas of dealing with each parameter being unknown, which are applicable under all types of feedback. • UnknownD andunknownfastpolicy: we can simply follow the ideas in (Chen and Luo, 2021) to estimate D and fast policy. For unknown fast policy, we maintain an instance of Bernstein- SSP (Cohen et al., 2020)B f . When we need to switch to the fast policy, we simply involveB f as if this is a new episode for this algorithm, follow its decision until reachingg, and always feed cost 1 for all state-action pairs. Following the arguments in (Chen and Luo, 2021, Lemma 1), the scheme above only incurs constant extra regret. For unknownD, we maintain an estimate of it and update the algorithm’s parameters whenever the estimate is updated. Specically, we separate the state 222 space into known states and unknown states. A state is known if the number of visits to it is more than some threshold, and it is unknown otherwise. Whenever the learner visits an unknown state, it involves a Bernstein-SSP instance to approximate the behavior of fast policy until reachingg. When an unknown states becomes known, we update the diameter estimate by incorporating an estimate ofT f (s), and then updates the algorithm’s parameters with respect to the new estimate. In terms of regret, this approach does not aect the transition estimation error, but brings an extra p S factor in the regret from policy optimization due to at mostS updates to the algorithm’s parameters. • UnknownB ? : We can estimateB ? following the procedure in (Cohen et al., 2021, Appendix C). The main idea is pretty similar to the unknownD case: we again maintain an estimate ofB ? and separate states into known states and unknown states based on how many times a state has been visited. The learner updates algorithm’s parameters whenever the estimate ofB ? is updated. Similarly, this approach brings an extra p S factor in the regret from policy optimization. • UnknownT ? : We can replaceT ? in parameters byB ? =c min in stochastic costs setting andD=c min in other settings sinceT ? B ? =c min (orT ? D=c min ). How to estimateD orB ? is discussed above. • UnknownT max : Similar to (Chen and Luo, 2021), we simply replaceT max in parameters byK p for somep2 (0; 1 2 ). Next, we describe under each setting, what regret guarantee we can achieve with each parameter being unknown by applying the corresponding method above. StochasticCosts In this setting, we need the knowledge ofD,B ? andT max . • UnknownD: Since the regret from policy optimization is a lower order term, the dominating term of the nal regret remains to be ~ O(B ? S p AK). 223 • UnknownB ? : Since the regret from policy optimization is a lower order term, the dominating term of the nal regret remains to be ~ O(B ? S p AK). • Unknown T max : We replace T max in parameters by K 1=12 . If K 1=12 T max , then clearly the regret is of order ~ O(LK) = ~ O(T 13 max ). Otherwise, by Theorem 15 we haveR K = ~ O(B ? S p AK + S 4 A 2:5 K 1=3 ). AdversarialCosts,FullInformation In this setting, we need the knowledge ofD,T ? , andT max . We consider the following cases: • UnknownD: With an extra p S factor in the policy optimization term, we haveR K = ~ O(T ? p SDK+ p S 2 ADT ? K) ignoring the lower order terms. • UnknownT ? : Ignoring the lower order terms, we haveR K = ~ O( D 1:5 c min p K +D p S 2 AK=c min ). • Unknown T max : We replace T max in parameters by K 1=11 . If K 1=11 T max , then clearly the regret is of order ~ O(LK) = ~ O(T 12 max ). Otherwise, by Theorem 16 we haveR K = ~ O(T ? p DK + p S 2 ADT ? K +S 4 A 2 K 5=11 ). 224 AppendixE OmittedDetailsinChapter6 E.1 Preliminaries ExtraNotation We rst dene (or restate) some notation used throughout the whole Appendix. • Let c;[i;j] = P j1 =i c +1 c 1 , P;[i;j] = P j1 =i max s;a P +1 s;a P s;a 1 . It is straightforward to verify that c;[1;M] = c and P;[1;M] = P . • Dene c;m = c;[i c m ;m] and P;m = P;[i P m ;m] , wherei c m andi P m are the rst intervals after the last resets ofM andN before intervalm respectively. • For all algorithms, denote byb c m , c m , P m s;a ,b m h ,N + m ,M + m , m the value ofb c, c, P s;a ,b h ,N + ,M + , at the beginning of intervalm, and deneb c m h =b c m (s m h ;a m h ), c m h = c(s m h ;a m h ),N m h =N + (s m h ;a m h ), andM m h = M + (s m h ;a m h ). We also slightly abuse the notation and writeb m (s m h ;a m h ) asb m h when there is no confusion. • Denee c m (s;a) = 1 M + m (s;a) P m1 m 0 =i c m P H m 0 h=1 c m 0 (s;a)If(s m 0 h ;a m 0 h ) = (s;a)g, e c m h = e c m (s m h ;a m h ), e P m s;a = 1 N + m (s;a) P m1 m 0 =i P m P H m 0 h=1 P m 0 s;a If(s m 0 h ;a m 0 h ) = (s;a)g, P m h = P m s m h ;a m h , and e P m h = e P m s m h ;a m h . • Denote by L c;[i;j] and L P;[i;j] one plus the number of resets of M and N within intervals [i;j] respectively, and deneL c;m =L c;[1;m] ,L P;m =L P;[1;m] ,L m =L c;m +L P;m for anym 1. 225 • Denef c (m) (orf P (m)) as the earliest interval at or after intervalm in which the learner resetsM (orN). • Denem m h =IfM m (s m h ;a m h ) = 0g,n m h =IfN m (s m h ;a m h ) = 0g,C M 0 = P M 0 m=1 P Hm+1 h=1 c m h , and bonus functionb m (s;a;V ) = max 7 r V( P m s;a ;V )m N + m (s;a) ; 49B p Sm N + m (s;a) . • DeneT ? ;m h (s) (orT ? ;m h (s;a)) as the hitting time (reachingg or layerH + 1) of ? k(m) starting from states (or state-action pair (s;a)) in layerh w.r.t. transitionP m , such thatT ? ;m h (s;a) = 1 + P m s;a T ? ;m h+1 , T ? ;m h (s) = T ? ;m h (s; ? k(m) (s)), and T ? ;m H+1 (s) = T ? ;m H+1 (s;a) = T ? ;m h (g) = T ? ;m h (g;a) = 0. • For notational convenience, we often writeV ? k(m) ;m h asV ? ;m h . • Dene (x) + = maxf0;xg. OptimalValueFunctionsof M We denote byQ ?;m h andV ?;m h the optimal value functions in interval m. It is not hard to see that they can be dened recursively as follows:V ?;m H+1 =c f and forhH, Q ?;m h (s;a) =c m (s;a) +P m s;a V ?;m h+1 ; V ?;m h (s) = min a Q ?;m h (s;a): For notational convenience, we also letQ ?;m H+1 (s;a) =V ?;m H+1 (s) for any (s;a)2 . Lemma73. For anym 1 andhH + 1,Q ?;m h (s;a)Q ? ;m h (s;a) 4B ? . Proof. This is simply byQ ? ;m h (s;a) 1 + max s V ? k (s) + 2B ? 4B ? . AuxiliaryLemmas Below we provide auxiliary lemmas used throughout the whole Appendix and for all algorithms. 226 Lemma74. With probability at least 1 3, for anyM 0 M, M 0 X m=1 Hm X h=1 (c m (s m h ;a m h )b c m h ) 3 M 0 X m=1 Hm X h=1 s c m h m M m h + m M m h ! + M 0 X m=1 Hm X h=1 c;m ~ O p SAL c;M 0C M 0 +SAL c;M 0 + 2 M 0 X m=1 Hm X h=1 c;m ; and M 0 X m=1 Hm X h=1 s c m h m M m h + m M m h ! ~ O 0 @ p SAL c;M 0C M 0 +SAL c;M 0 + v u u t SAL c;M 0 M 0 X m=1 Hm X h=1 c;m 1 A : Proof. First note that by Lemma 124, with probability at least 1, for anym 1 and (s;a)2 , e c m (s;a) c m (s;a) s c m (s m h ;a m h ) M + m (s;a) + 1 M + m (s;a) : (E.1) For the rst inequality in the rst statement, note that M 0 X m=1 Hm X h=1 (c m (s m h ;a m h )b c m h ) M 0 X m=1 Hm X h=1 e c m (s m h ;a m h ) c m (s m h ;a m h ) + s c m h m M m h + m M m h +m m h ! + M 0 X m=1 Hm X h=1 c;m (denition ofb c m h andc m (s m h ;a m h )e c m (s m h ;a m h ) + c;m +m m h ) 3 M 0 X m=1 Hm X h=1 s c m h m M m h + m M m h ! + M 0 X m=1 Hm X h=1 c;m : (Eq. (E.1) andm m h 1 M m h ) 227 The second inequality in the rst statement simply follows from applying AM-GM inequality on the second statement. To prove the second statement, rst note that by Lemma 124, Cauchy-Schwarz inequality, and Lemma 82, with probability at least 1, M 0 X m=1 Hm X h=1 c m h = ~ O M 0 X m=1 Hm X h=1 e c m h + s c m h M m h + 1 M m h !! = ~ O 0 @ M 0 X m=1 Hm X h=1 e c m h + v u u t SAL c;M 0 M 0 X m=1 Hm X h=1 c m h +SAL c;M 0 1 A : Solving a quadratic inequality w.r.t. P M 0 m=1 P Hm h=1 c m h (Lemma 110) gives P M 0 m=1 P Hm h=1 c m h = ~ O( P M 0 m=1 P Hm h=1 e c m h + SAL c;M 0). Therefore, with probability at least 1, M 0 X m=1 Hm X h=1 s c m h m M m h + m M m h ! = ~ O 0 @ v u u t SAL c;M 0 M 0 X m=1 Hm X h=1 c m h +SAL c;M 0 1 A (Cauchy-Schwarz inequality and Lemma 82) = ~ O 0 @ v u u t SAL c;M 0 M 0 X m=1 Hm X h=1 e c m h +SAL c;M 0 1 A = ~ O 0 @ v u u t SAL c;M 0 M 0 X m=1 Hm X h=1 c;m + v u u t SAL c;M 0 M 0 X m=1 Hm X h=1 c m (s m h ;a m h ) +SAL c;M 0 1 A = ~ O 0 @ v u u t SAL c;M 0 M 0 X m=1 Hm X h=1 c;m + p SAL c;M 0C M 0 +SAL c;M 0 1 A : (Lemma 126) This completes the proof. Lemma75. Withprobabilityatleast 1,foranym 1, (s;a)2 ands 0 2S + , e P m s;a (s 0 ) P m s;a (s 0 ) r e P m s;a (s 0 )m 2N + m (s;a) + m 2N + m (s;a) r P m s;a (s 0 )m N + m (s;a) + m N + m (s;a) . Proof. The rst inequality hold with probability at least 1=2 by applying Lemma 124 for each (s;a)2 ands 0 2S + . Also by Lemma 126, we have e P m s;a (s 0 ) 2 P m s;a (s 0 ) + m 2N + m (s;a) for any (s;a)2 ;s 0 2S + 228 with probability at least 1=2. Substituting this back and applying p a +b p a + p b proves the second inequality. Lemma76. With probability at least 1, for any (s;a)2 andm 1,b c m (s;a)c m (s;a) + c;m . Proof. For any (s;a) andm 1, whenM m (s;a) = 0, the statement clearly holds since c m (s;a) = 0. Otherwise, by Lemma 124 and Lemma 126, with probability at least 1, for all (s;a) and m 1 simultaneously, j c m (s;a)e c m (s;a)j 3 s e c m (s;a) M + m (s;a) ln 32SAm 5 + 2 ln 32SAm 5 M + m (s;a) 3 v u u u t 2 c m (s;a) + 12 ln 4SAm M + m (s;a) M + m (s;a) ln 32SAm 5 + 2 ln 32SAm 5 M + m (s;a) s c m (s;a) m M + m (s;a) + m M + m (s;a) : (E.2) Therefore, by maxf0;ag maxf0;bg maxf0;abg, b c m (s;a)c m (s;a)b c m (s;a)e c m (s;a) + c;m max ( 0; c m (s;a)e c m (s;a) s c m (s;a) m M + m (s;a) m M + m (s;a) ) + c;m c;m ; where the last step is by Eq. (E.2). Lemma 77. Given function V 2 [B;B] S + for some B > 0, we have with probability at least 1, j( e P m s;a P m s;a )Vj ~ O r SV(P m s;a ;V ) N + m (s;a) + SB N + m (s;a) + B P;m 64 for anym 1. 229 Proof. Note that with probability at least 1, j( e P m s;a P m s;a )Vj =j( e P m s;a P m s;a )(VP m s;a V )j = ~ O 0 @ X s 0 0 @ s e P m s;a (s 0 ) N + m (s;a) jV (s 0 )P m s;a Vj + B N + m (s;a) 1 A 1 A (Lemma 75) = ~ O 0 @ s S e P m h (VP m s;a V ) 2 N + m (s;a) + SB N + m (s;a) 1 A (Cauchy-Schwarz inequality) = ~ O s SP m h (VP m s;a V ) 2 N + m (s;a) + SB N + m (s;a) +B s S P;m N + m (s;a) ! : Applying AM-GM inequality completes the proof. Lemma78. With probability at least 1,V( P m h ;V m h+1 ) 2V(P m h ;V m h+1 ) + ~ O SB 2 N m h + 2B 2 P;m for anym 1. Proof. Note that: V( P m h ;V m h+1 ) P m h (V m h+1 P m h V m h+1 ) 2 ( P i p i x i P i p i = argmin z P i p i (x i z) 2 ) =V(P m h ;V m h+1 ) + ( P m h P m h )(V m h+1 P m h V m h+1 ) 2 V(P m h ;V m h+1 ) + ( P m h e P m h )(V m h+1 P m h V m h+1 ) 2 +B 2 P;m V(P m h ;V m h+1 ) + ~ O 0 @ B s S e P m h (V m h+1 P m h V m h+1 ) 2 N m h + SB 2 N m h 1 A +B 2 P;m (Lemma 75 and Cauchy-Schwarz inequality) V(P m h ;V m h+1 ) + ~ O B s SV(P m h ;V m h+1 ) N m h +B 2 s S P;m N m h + SB 2 N m h ! +B 2 P;m 2V(P m h ;V m h+1 ) + ~ O SB 2 N m h + 2B 2 P;m : (AM-GM inequality) 230 Lemma 79. Given an oblivious set of value functionsV withjVj (2HK) 6 andkVk 1 B for any V 2V, we have with probability at least 1, for anyV 2V, (s;a)2 , andm 1,j( P m s;a e P m s;a )Vj r V(P m s;a ;V )m N + m (s;a) + 17Bm N + m (s;a) + B P;m 64 andj( P m s;a e P m s;a )Vj r 2V( P m s;a ;V )m N + m (s;a) + 3B p Sm N + m (s;a) . Proof. For each (s;a)2 andV 2V, by Lemma 124, with probability at least 1 2SA(2HK) 6 , for any m 1 j( P m s;a e P m s;a )Vj 1 N + m (s;a) 0 B @ v u u t Nm(s;a) X i=1 V(P m i s;a ;V ) m +B m 1 C A: (E.3) Denote bym i the interval where thei-th visits to (s;a) lies in among thoseN m (s;a) visits, we have 1 N + m (s;a) Nm(s;a) X i=1 V(P m i s;a ;V ) = 1 N + m (s;a) Nm(s;a) X i=1 P m i s;a (VP m i s;a V ) 2 1 N + m (s;a) Nm(s;a) X i=1 P m i s;a (VP m s;a V ) 2 V(P m s;a ;V ) +B 2 P;m ; where the second last inequality is by P i p i x i P i p i = argmin z P i p i (x i z) 2 . Thus by Eq. (E.3), j( P m s;a e P m s;a )Vj s V(P m s;a ;V ) m N + m (s;a) + B m N + m (s;a) +B s P;m m N + m (s;a) s V(P m s;a ;V ) m N + m (s;a) + 17B m N + m (s;a) + B P;m 64 : (AM-GM inequality) Moreover, again by P i p i x i P i p i = argmin z P i p i (x i z) 2 , 1 N + m (s;a) Nm(s;a) X i=1 V(P m i s;a ;V ) 1 N + m (s;a) Nm(s;a) X i=1 P m i s;a (V P m s;a V ) 2 V( P m s;a ;V ) + ( e P m s;a P m s;a )(V P m s;a V ) 2 V( P m s;a ;V ) +B s SV( P m s;a ;V ) m N + m (s;a) + SB 2 m N + m (s;a) (Lemma 75 and Cauchy-Schwarz inequality) 2V( P m s;a ;V ) + 2SB 2 m N + m (s;a) : (AM-GM inequality) 231 Thus by Eq. (E.3),j( P m s;a e P m s;a )Vj r 2V( P m s;a ;V )m N + m (s;a) + 3B p Sm N + m (s;a) . Lemma80. For any sequence of value functionsfV m h g m;h withkV m h k 1 2 [0;B], we have with probability at least 1, for allM 0 1, M 0 X m=1 Hm X h=1 V(P m h ;V m h+1 ) = ~ O M 0 X m=1 V m Hm+1 (s m Hm+1 ) 2 + M 0 X m=1 Hm X h=1 B(V m h (s m h )P m h V m h+1 ) + +B 2 ! : Proof. We decompose the sum of variance as follows: M 0 X m=1 Hm X h=1 V(P m h ;V m h+1 ) = M 0 X m=1 Hm X h=1 P m h (V m h+1 ) 2 V m h+1 (s m h+1 ) 2 + M 0 X m=1 Hm X h=1 V m h+1 (s m h+1 ) 2 V m h (s m h ) 2 + M 0 X m=1 Hm X h=1 V m h (s m h ) 2 (P m h V m h+1 ) 2 : For the rst term, by Lemma 124 and Lemma 115, with probability at least 1, M 0 X m=1 Hm X h=1 P m h (V m h+1 ) 2 V m h+1 (s m h+1 ) 2 = ~ O 0 @ v u u t M 0 X m=1 Hm X h=1 V(P m h ; (V m h+1 ) 2 ) +B 2 1 A = ~ O 0 @ B v u u t M 0 X m=1 Hm X h=1 V(P m h ;V m h+1 ) +B 2 1 A : The second term is clearly upper bounded by P M 0 m=1 V m Hm+1 (s m Hm+1 ) 2 , and the third term is upper bounded by 2B P M 0 m=1 P Hm h=1 (V m h (s m h )P m h V m h+1 ) + bya 2 b 2 (a +b)(ab) + . Putting everything together and solving a quadratic inequality (Lemma 110) w.r.t. P M 0 m=1 P Hm h=1 V(P m h ;V m h+1 ) completes the proof. 232 Lemma81. ForanyvaluefunctionsfV m h g m;h suchthatkV m h k 1 B,withprobabilityatleast 1,for anyM 0 1, M 0 X m=1 Hm X h=1 b m (s m h ;a m h ;V m h+1 ) = ~ O 0 @ v u u t SAL P;M 0 M 0 X m=1 Hm X h=1 V(P m h ;V m h+1 ) +BS 1:5 AL P;M 0 +B v u u t SAL P;M 0 M 0 X m=1 Hm X h=1 P;m 1 A : Proof. Note that: M 0 X m=1 H X h=1 b m (s m h ;a m h ;V m h+1 ) = ~ O 0 @ M 0 X m=1 Hm X h=1 0 @ s V( P m h ;V m h+1 ) N m h + B p S N m h 1 A 1 A = ~ O 0 @ v u u t SAL P;M 0 M 0 X m=1 Hm X h=1 V( P m h ;V m h+1 ) +BS 1:5 AL P;M 0 1 A (Cauchy-Schwarz inequality and Lemma 82) = ~ O 0 @ v u u t SAL P;M 0 M 0 X m=1 Hm X h=1 V(P m h ;V m h+1 ) +BS 1:5 AL P;M 0 +B v u u t SAL P;M 0 M 0 X m=1 Hm X h=1 P;m 1 A : (Lemma 78, Lemma 82, and p a +b p a + p b) Lemma82. For anyM 0 1, P M 0 m=1 P Hm h=1 1 M m h = ~ O(SAL c;M 0) and P M 0 m=1 P Hm h=1 1 N m h = ~ O(SAL P;M 0). Proof. This simply follows from the fact that the sum of 1 M m h (or 1 N m h ) between consecutive resets ofM m h (orN m h ) is of order ~ O(SA). Lemma83. P M 0 m=1 IfH m <H;s m Hm+1 6=gg = ~ O(SAL M 0) for anyM 0 M. Proof. This simply follows from the fact that between consecutive resets ofM orN, the number of times that the number of visits to some (s;a) is doubled is ~ O(SA). 233 Lemma 84. Suppose r(m) = minf c 1 p m +c 2 ;c 3 g, 2 R N + + is a non-stationarity measure, and dene [i;j] = P j1 i=1 (i). If for a given intervalJ, there is a way to partitionJ into` intervalsfI i g ` i=1 with I i = [s i ;e i ] such that [s i ;e i +1] > r(jI i j + 1) for i ` 1 (note thatjI i j = e i s i + 1), then ` 1 + (2c 1 1 J ) 2=3 jJj 1=3 +c 1 3 J . Proof. Note that J `1 X i=1 [s i ;e i +1] > `1 X i=1 r(jI i j + 1) `1 X i=1 min n c 1 (jI i j + 1) 1=2 ;c 3 o `1 X i=1 min n c 1 2 jI i j 1=2 ;c 3 o = ` 1 X i=1 c 1 2 jI i j 1=2 +` 2 c 3 ; where in the last step we assumejI i j is decreasing ini without loss of generality and` 1 +` 2 =` 1. The inequality above implies` 2 c 1 3 J and ` 1 = ` 1 X i=1 jI i j 1 3 jI i j 1 3 ` 1 X i=1 jI i j 1=2 ! 2 3 ` 1 X i=1 jI i j ! 1 3 2 J c 1 2 3 jJj 1 3 (Hölder’s inequality withp = 3 2 andq = 3) Combining them completes the proof. E.2 OmittedDetailsinSection6.2 In this section we provide omitted proofs and discussions in Section 6.2. E.2.1 OptimalValueChangew.r.t. Non-stationarity Below we provide a bound on the change of optimal value functions w.r.t. cost and transition non- stationarity. Lemma85. For anyk 1 ;k 2 2 [K],V ? k 1 (s init )V ? k 2 (s init ) ( c +B ? P )T ? . 234 Proof. Denote by q ? k 2 (s;a) (or q ? k 2 (s)) the number of visits to (s;a) (or s) before reaching g following ? k 2 . By the extended value dierence lemma (Shani et al., 2020, Lemma 1) (note that their result is for nite-horizon MDP, but the nature generalization to SSP holds), we have V ? k 1 (s init )V ? k 2 (s init ) = X s q ? k 2 (s)(V ? k 1 (s)Q ? k 1 (s; ? k 2 (s))) + X s;a q ? k 2 (s;a)(Q ? k 1 (s;a)c k 2 (s;a)P k 2 ;s;a V ? k 1 ) X s;a q ? k 2 (s;a)(c k 1 (s;a)c k 2 (s;a) + (P k 1 ;s;a P k 2 ;s;a )V ? k 1 ) ( c +B ? P )T ? : where in the last inequality we applykc k 1 c k 2 k 1 c , (P k 1 ;s;a P k 2 ;s;a )V ? k 1 max s;a kP k 1 ;s;a P k 2 ;s;a k 1 V ? k 1 1 B ? P ; and P s;a q ? k 2 (s;a)T ? . We also give an example showing that the bound in Lemma 85 is tight up to a multiplication fac- tor. Consider an SSP instance with only one states init and one actiona g , such thatc(s init ;a g ) = B? T? , P (gjs init ;a g ) = 1 T? , andP (s init js init ;a g ) = 1P (gjs init ;a g ) with 1B ? T ? . The optimal value of this in- stance is clearlyB ? . Now consider another SSP instance with perturbed cost functionc 0 (s init ;a g ) = B? T? + c and perturbed transition functionP 0 (gjs init ;a g ) = 1 T? P 2 ,P 0 (s init js init ;a g ) = 1P 0 (gjs init ;a g ) with maxf c ; P g 1 T? . The optimal value function in this instance is B? T? + c 1 T? P 2 = B ? +T ? c 1 T? P 2 (B ? +T ? c )(1 +T ? P ) =B ? + ( c +B ? P )T ? +T 2 ? c P B ? + 2( c +B ? P )T ? ; 235 where in the rst inequality we apply 1 1x 1 + 2x forx2 [0; 1 2 ]. Thus the optimal value dierence between these two SSPs is of the same order of the upper bound in Lemma 85. E.2.2 ProofofTheorem17 For anyB ? ;T ? ;SA;K withB ? 1,T ? 3B ? , andK SA 10, we dene a set of SSP instances fM K i;j g i;j withi;j2f0; 1;:::;Ng andN =SA. The instanceM K i ? ;j ? is constructed as follows: • There areN + 1 statesfs init ;s 1 ;:::;s N g. • Ats init , there areN actionsa 1 ;:::;a N ; ats i fori2 [N] there is only one actiona g . • c(s init ;a i ) = 0 andc(s i ;a g ) Bernoulli( B?+ c;K Ifi6=i ? g T? ) fori2 [N], where c;K = 11=N 4 p NB ? =K. • P (s i js init ;a i ) = 1,P (gjs j ;a g ) = 1+ P;K Ifj=j ? g T? , andP (s j js j ;a g ) = 1P (gjs j ;a g ), where P;K = 11=N 4 p N=K. Note that for anyM K i;j , the expected hitting time is upper bounded byT ? + 1, the expected cost of optimal policy is upper bounded by 2B ? , and the number of state-action pairs is upper bounded by 2N. We then usefM K i;j g i;j to prove static regret lower bounds (note that static regret and dynamic regret are the same without non-stationarity, that is, c = P = 0) based on cost perturbation and transition perturbation respectively, which serve as the cornerstones of the proof of Theorem 17. Theorem 30. For anyB ? ;T ? ;SA;K withB ? 1, T ? 3B ? , K SA 10, and any learner, there exists an SSP instance based on cost perturbation such that the regret of the learner afterK episodes is at least ( p B ? SAK). Proof. Consider a distribution of SSP instances which is uniform overfM K i;0 g i fori2 [N]. LetE i be the expectation w.r.t.M K i;0 ,P i be the distribution of learner’s observations w.r.t.M K i;0 , andK i the number of 236 visits to statei inK episodes. Also let c = c;K . The expected regret over this distribution of SSPs can be lower bounded as E[R K ] = 1 N N X i=1 E i [R K ] 1 N N X i=1 E i [KK i ] c = c K 1 N N X i=1 E i [K i ] ! : Note thatM K 0;0 has no “good” state. By Pinsker’s inequality: E i [K i ]E 0 [K i ]KkP i P 0 k 1 K p 2KL(P 0 ;P i ): By the divergence decomposition lemma (Lattimore and Szepesvári, 2020, Lemma 15.1), we have: KL(P 0 ;P i ) =E 0 [K i ]T ? KL(Bernoulli((B ? + c )=T ? ); Bernoulli(B ? =T ? )) E 0 [K i ]T ? 2 c =T 2 ? B? T? (1 B? T? ) 2 2 c B ? E 0 [K i ]: ((Gerchinovitz and Lattimore, 2016, Lemma 6)) Therefore, by Cauchy-Schwarz inequality, N X i=1 E i [K i ] N X i=1 E 0 [K i ] + 2 c K p E 0 [K i ]=B ? K + 2 c K p NK=B ? : Plugging this back and by the denition of c , we obtain E[R K ] c K 1 1 N 2 c r K NB ? ! = (1 1=N) 2 8 p B ? NK = ( p B ? SAK): This completes the proof. Theorem31. For anyB ? ;T ? ;SA;K withB ? 1,T ? 3B ? ,KSA 10, and any learner, there exists an SSP instance based on transition perturbation such that the regret of the learner afterK episodes is at least (B ? p SAK). 237 Proof. Consider a distribution of SSP instances which is uniform overfM K 0;j g j forj2 [N]. LetE j be the expectation w.r.t.M K 0;j ,P j be the distribution of learner’s observations w.r.t.M K 0;j , andK j the number of visits to statej inK episodes. Also let P = P;K . The expected regret over this distribution of SSPs can be lower bounded as E[R K ] = 1 N N X j=1 E j [R K ] 1 N N X j=1 E j [KK j ]B ? 1 1 1 + P B ? P 2 0 @ K 1 N N X j=1 E j [K j ] 1 A : Note thatM K 0;0 has no “good” state. By Pinsker’s inequality: E j [K j ]E 0 [K j ]KkP j P 0 k 1 K q 2KL(P 0 ;P j ): By the divergence decomposition lemma (Lattimore and Szepesvári, 2020, Lemma 15.1), we have: KL(P 0 ;P j ) =E 0 [K j ] KL(Geometric(1=T ? ); Geometric((1 + P )=T ? )) =E 0 [K j ]T ? KL(Bernoulli(1=T ? ); Bernoulli((1 + P )=T ? )) E 0 [K j ]T ? 2 P =T 2 ? 1+ P T? (1 1+ P T? ) 2 2 P E 0 [K j ]: ((Gerchinovitz and Lattimore, 2016, Lemma 6) and P 1 4 ) Therefore, by Cauchy-Schwarz inequality, N X j=1 E j [K j ] N X j=1 E 0 [K j ] + 2 P K q E 0 [K j ] K + 2 P K p NK: 238 Plugging this back and by the denition of P , we obtain E[R K ] B ? P K 2 1 1 N 2 P r K N ! (1 1=N) 2 16 B ? p NK = (B ? p SAK): This completes the proof. Now we are ready to prove Theorem 17. Proof of Theorem 17. We construct a hard non-stationary SSP instance as follows: we divideK episodes intoL = L c +L P epochs. Each of the rstL c epochs has length K 2Lc , and the corresponding SSP is uniformly sampled fromfM K=(2Lc) i;0 g i2[N] independently; each of the last L P epochs has length K 2L P , and the corresponding SSP is uniformly sampled fromfM K=(2L P ) 0;j g j2[N] independently. By Theorem 30 and Theorem 31, the regrets in each of the rstL c epochs and each of the lastL P epochs are of order ( p B ? SAK=L c ) and (B ? p SAK=L P ) respectively. Moreover, the total change in cost and transition functions are upper bounded by cLc T? and 2 P L P T? respectively with c = c; K 2Lc and P = P; K 2L P . Now let cLc T? = c and 2 P L P T? = P , we haveL c = ( 4cT? 11=N ) 2=3 ( K 2NB? ) 1=3 andL P = ( 2 P T? 11=N ) 2=3 ( K 2N ) 1=3 , and the dynamic regret is of order (L c p B ? SAK=L c +L P B ? p SAK=L P ) = ((B ? SAT ? ( c + B 2 ? P )) 1=3 K 2=3 ). E.3 OmittedDetailsinSection3.2.1 Notation Under the protocol of Algorithm 14, for anyk2 [K], denote byM k the number of intervals in the rstk episodes. Clearly,M =M K . The following lemma is a more general version of Lemma 11. Lemma86. For anyK 0 2 [K],R K 0 R M K 0 +B ? . 239 Proof. LetI k be the set of intervals in episodek. Then the regret in episodek satises X m2I k Hm X h=1 c m h V ? k (s k 1 ) = X m2I k Hm X h=1 c m h V ? ;m 1 (s m 1 ) ! + X m2I k V ? ;m 1 (s m 1 )V ? k (s k 1 ) X m2I k (C m V ? ;m 1 (s m 1 )) + B ? 2K ; where the last step is by the denition ofc m Hm+1 andV ? ;m 1 (s m 1 )V ? k (s m 1 ) + B? 2K 3 2 B ? by Lemma 118. Summing up overk completes the proof. Lemma87. SupposealgorithmAensures R M 0 = ~ O( 0 + 1 M 0 1=3 + 1 2 M 0 1=2 + 2 M 0 2=3 )foranynumber of intervalsM 0 M with cetain probability. Then with the same probability,M K 0 = ~ O(K 0 + 0 =B ? + ( 1 =B ? ) 3=2 +( 1 2 =B ? ) 2 +( 2 =B ? ) 3 )and R M K 0 = ~ O( 1 K 0 1=3 + 1 2 K 0 1=2 + 2 K 0 2=3 + 3=2 1 =B 1=2 ? + 2 1 2 =B ? + 3 2 =B 2 ? + 0 ) for anyK 0 2 [K]. Proof. Fix aK 0 2 [K]. For anyM 0 M K 0, letC g =fm2 [M 0 ] :s m Hm+1 =gg. Then, R M 0 = X m2Cg (C m V ? ;m 1 (s m 1 )) + X m= 2Cg (C m V ? ;m 1 (s m 1 )) = ~ O 0 + 1 M 0 1=3 + 1 2 M 0 1=2 + 2 M 0 2=3 : (E.4) Note thatV ? ;m 1 (s m 1 ) V ? k(m) (s m 1 ) + B? 2K 3 2 B ? by Lemma 118. Moreover,C m 2B ? whenm = 2C g . Therefore,C m V ? ;m 1 (s m 1 ) 3B? 2 form2C g andC m V ? ;m 1 (s m 1 ) B? 2 form = 2C g . Reorganizing terms and byjC g jK 0 , we get: B ? M 0 2 2B ? K 0 + ~ O 0 + 1 M 0 1=3 + 1 2 M 0 1=2 + 2 M 0 2=3 : 240 Solving a quadratic inequality w.r.t..M 0 , we getM 0 = ~ O(K 0 + 0 =B ? +( 1 =B ? ) 3=2 +( 1 2 =B ? ) 2 +( 2 =B ? ) 3 ). Dene = 0 =B ? + ( 1 =B ? ) 3=2 + ( 1 2 =B ? ) 2 + ( 2 =B ? ) 3 . Plugging the bound onM 0 back to Eq. (E.4), we have R M 0 = ~ O 0 + 1 K 0 1=3 + 1 2 K 0 1=2 + 2 K 0 2=3 + 1 1=3 + 1 2 1=2 + 2 2=3 = ~ O 0 + 1 K 0 1=3 + 1 2 K 0 1=2 + 2 K 0 2=3 + 3=2 1 =B 1=2 ? + 2 1 2 =B ? + 3 2 =B 2 ? +B ? = ~ O 0 + 1 K 0 1=3 + 1 2 K 0 1=2 + 2 K 0 2=3 + 3=2 1 =B 1=2 ? + 2 1 2 =B ? + 3 2 =B 2 ? ; where in the second last step we apply Young’s inequality for product (xyx p =p +y q =q forx 0,y 0, p> 1,q> 1, and 1 p + 1 q = 1). Putting everything together and settingM 0 =M K 0 completes the proof. E.4 OmittedDetailsinSection6.4 Extra Notation LetQ m h , V m h , x m be the value ofQ h , V h , andx at the beginning of intervalm, and Q m H+1 (s;a) =V m H+1 (s) for any (s;a)2 . E.4.1 ProofofTheorem18 We rst prove two lemmas related to the optimism ofQ m h . Dene the following reference value function: Q m h (s;a) = (b c m (s;a)+ P m s;a V m h+1 b m (s;a; V m h+1 ) x m ) + forh2 [H], where V m h (s) = argmin a Q m h (s;a) forh2 [H], V m H+1 =c f , Q m H+1 (s;a) = V m H+1 (s) for any (s;a)2 , and x m = c;m + 4B ? P;m . Lemma88. With probability at least 1 2, Q m h (s;a)Q ?;m h (s;a) formM. 241 Proof. We prove this by induction onh. The base case ofh = H + 1 is clearly true. Forh H, by Lemma 116, for any (s;a)2 : Q m h (s;a) =b c m (s;a) + P m s;a V m h+1 b m (s;a; V m h+1 ) x m b c m (s;a) + P m s;a V ?;m h+1 b m (s;a;V ?;m h+1 ) x m (by the induction step) =b c m (s;a) + e P m s;a V ?;m h+1 + ( P m s;a e P m s;a )V ?;m h+1 b m (s;a;V ?;m h+1 ) x m (i) b c m (s;a) + e P m s;a V ?;m h+1 x m (ii) c m (s;a) +P m s;a V ?;m h+1 =Q ?;m h (s;a); where in (i) we apply Lemma 109 withjfV ?;m h g m;h jHK+1 to obtain ( P m s;a e P m s;a )V ?;m h+1 b m (s;a;V ?;m h+1 ) 0; in (ii) we apply Lemma 76, Lemma 73, and the denition of x m . Lemma89. With probability at least 1 2,Q m h (s;a)Q ?;m h (s;a) + ( c;m + 4B ? P;m )(Hh + 1) andx m maxf 1 mH ; 2( c;m + 4B ? P;m )g. Proof. The second statement simply follows from Lemma 88,Q ?;m h (s;a) Q ? ;m h (s;a) 4B ? = B=4 by Lemma 73, and the computing procedure of x m . We now prove Q m h (s;a) Q m h (s;a) + ( c;m + 4B ? P;m )(Hh+1) by induction onh, and the rst statement simply follows from Q m h (s;a)Q ?;m h (s;a) (Lemma 88). The statement is clearly true for h = H + 1. For h H, by the induction step and V m h+1 1 B=4 from the update rule, we haveV m h+1 (s) minfB=4; V m h+1 (s)+( c;m +4B ? P;m )(H h)g V m h+1 (s) +y m h+1 B for anys2S + , wherey m h = minfB=4; ( c;m + 4B ? P;m )(Hh + 1)g. Thus, P m s;a V m h+1 b m (s;a;V m h+1 )x m P m s;a ( V m h+1 +y m h+1 )b m (s;a; V m h+1 +y m h+1 ) (Lemma 116 andx m 0) P m s;a V m h+1 b m (s;a; V m h+1 ) x m + ( c;m + 4B m ? P;m )(Hh + 1); 242 where in the last inequality we apply denition of x m andb m (s;a; V m h+1 +y m h+1 ) =b m (s;a; V m h+1 ) since constant oset does not change the variance. Then,Q m h (s;a) Q m h (s;a)+( c;m +4B ? P;m )(Hh+1) by the update rule ofQ m h and the denition of Q m h . We are now ready to prove the main theorem, from which Theorem 18 is a simple corollary. Theorem32. Algorithm 15 ensures with probability at least 1 22, for anyM 0 M, R M 0 = ~ O p B ? SAL c;M 0M 0 +B ? p SAL P;M 0M 0 +B ? SAL c;M 0 +B ? S 2 AL P;M 0 + ~ O M 0 X m=1 ( c;m +B ? P;m )H ! : Proof. Note that with probability at least 1 2: R M 0 M 0 X m=1 Hm X h=1 c m h +c m Hm+1 V ?;m 1 (s m 1 ) ! (V ?;m 1 (s m 1 )V ? ;m 1 (s m 1 )) M 0 X m=1 Hm X h=1 c m h +c m Hm+1 V m 1 (s m 1 ) ! + M 0 X m=1 ( c;m + 4B ? P;m )H (Lemma 89) M 0 X m=1 Hm X h=1 c m h +V m h+1 (s m h+1 )V m h (s m h ) + M 0 X m=1 ( c;m + 4B ? P;m )H + ~ O (B ? SAL M 0) (c m Hm+1 = ~ O(B ? ) and Lemma 83) M 0 X m=1 Hm X h=1 (c m h b c m h ) + (V m h+1 (s m h+1 )P m h V m h+1 ) + (P m h P m h )V m h+1 +b m h + 2 M 0 X m=1 ( c;m + 4B ? P;m )H + ~ O (B ? SAL M 0); 243 where the last step is by the denitions ofV m h (s m h ),x m maxf 1 mH ; 2( c;m + 4B ? P;m )g (Lemma 89), maxfa;bg a+b 2 , and P M 0 m=1 P Hm h=1 1 mH = ~ O(1). Now we bound the rst three sums separately. For the rst term, with probability at least 1 4, M 0 X m=1 Hm X h=1 (c m h b c m h ) = M 0 X m=1 Hm X h=1 (c m h c m (s m h ;a m h )) + M 0 X m=1 Hm X h=1 (c m (s m h ;a m h )b c m h ) ~ O p C M 0 + p SAL c;M 0C M 0 +SAL c;M 0 + 2 M 0 X m=1 c;m H: (Lemma 124 and Lemma 74) For the second term, by Lemma 124, with probability at least 1, M 0 X m=1 Hm X h=1 (V m h+1 (s m h+1 )P m h V m h+1 ) = ~ O 0 @ v u u t M 0 X m=1 Hm X h=1 V(P m h ;V m h+1 ) +B ? 1 A = ~ O 0 @ v u u t M 0 X m=1 Hm X h=1 V(P m h ;V ?;m h+1 ) + v u u t M 0 X m=1 Hm X h=1 V(P m h ;V ?;m h+1 V m h+1 ) +B ? 1 A ; (Var[X +Y ] 2(Var[X] +Var[Y ]) and p a +b p a + p b) 244 which is dominated by the upper bound of the third term below. For the third term, by P m h V m h+1 e P m h V m h+1 + 4B ? ( P;m +n m h ), with probability at least 1 2, M 0 X m=1 Hm X h=1 (P m h P m h )V m h+1 M 0 X m=1 Hm X h=1 ( e P m h P m h )V m h+1 + M 0 X m=1 Hm X h=1 4B ? ( P;m +n m h ) M 0 X m=1 Hm X h=1 ( e P m h P m h )V ?;m h+1 + ( e P m h P m h )(V m h+1 V ?;m h+1 ) + 4B ? n m h + M 0 X m=1 4B ? P;m H = ~ O 0 @ M 0 X m=1 Hm X h=1 0 @ s V(P m h ;V ?;m h+1 ) N m h + SB ? N m h + s SV(P m h ;V m h+1 V ?;m h+1 ) N m h 1 A + M 0 X m=1 B ? P;m H 1 A (n m h 1 N m h , Lemma 109 withjfV ?;m h+1 g m;h jHK + 1, and Lemma 77) = ~ O 0 @ v u u t SAL P;M 0 M 0 X m=1 Hm X h=1 V(P m h ;V ?;m h+1 ) + v u u t S 2 AL P;M 0 M 0 X m=1 Hm X h=1 V(P m h ;V ?;m h+1 V m h+1 ) 1 A + ~ O B ? S 2 AL P;M 0 + M 0 X m=1 B ? P;m H ! : (Cauchy-Schwarz inequality and Lemma 82) Moreover, by Lemma 81, with probability at least 1, M 0 X m=1 Hm X h=1 b m h = ~ O 0 @ v u u t SAL P;M 0 M 0 X m=1 Hm X h=1 V(P m h ;V m h+1 ) +B ? S 1:5 AL P;M 0 +B ? v u u t SAHL P;M 0 M 0 X m=1 P;m 1 A = ~ O 0 @ v u u t SAL P;M 0 M 0 X m=1 Hm X h=1 V(P m h ;V ?;m h+1 ) + v u u t SAL P;M 0 M 0 X m=1 Hm X h=1 V(P m h ;V m h+1 V ?;m h+1 ) 1 A + ~ O B ? S 1:5 AL P;M 0 + M 0 X m=1 B ? P;m H ! : (Var[X +Y ] 2Var[X] + 2Var[Y ], p a +b p a + p b, and AM-GM inequality) 245 which is dominated by the upper bound of the third term above. Putting everything together, we have with probability at least 1 11, R M 0 = ~ O 0 @ p SAL c;M 0C M 0 +B ? SAL c;M 0 + v u u t SAL P;M 0 M 0 X m=1 Hm X h=1 V(P m h ;V ?;m h+1 ) +B ? S 2 AL P;M 0 1 A + ~ O 0 @ v u u t S 2 AL P;M 0 M 0 X m=1 Hm X h=1 V(P m h ;V ?;m h+1 V m h+1 ) + M 0 X m=1 ( c;m +B ? P;m )H 1 A = ~ O p SAL c;M 0C M 0 + p B ? SAL P;M 0C M 0 +B ? SAL c;M 0 +B ? S 2 AL P;M 0 + ~ O M 0 X m=1 ( c;m +B ? P;m )H ! : (Lemma 90, Lemma 91 and AM-GM inequality) Note that R M 0 = P M 0 m=1 (C m V ? ;m 1 (s m 1 )) C M 0 4B ? M 0 (Lemma 73). Reorganizing terms and solving a quadratic inequality (Lemma 110) w.r.t.C M 0 givesC M 0 = ~ O(B ? M 0 ) ignoring lower order terms. Plugging this back completes the proof. Proof of Theorem 18. Note that by by Line 3 and Line 4 of Algorithm 15, we haveL c d M 0 Wc e,L P d M 0 W P e, and the number of intervals between consecutive resets ofM (orN) are upper bounded byW c (orW P ), which gives M 0 X m=1 ( c;m +B ? P;m )H M 0 X m=1 ( c;f c (m) +B ? P;f P (m) )H (W c c +B ? W P P )H Applying Theorem 32 completes the proof. E.4.2 ProofofTheorem19 We rst show that Algorithm 16 ensures an anytime regret bound in M. 246 Theorem33. With probability at least 1 22, Algorithm 16 ensures for anyM 0 M, R M 0 = ~ O (B ? SAT max c ) 1=3 M 0 2=3 +B ? (SAT max P ) 1=3 M 0 2=3 + ~ O (B ? SAT max c ) 2=3 M 0 1=3 +B ? (S 2:5 AT max P ) 2=3 M 0 1=3 + ( c +B ? P )T max : Proof. It suces to prove the desired inequality forM 0 2f2 n 1g n2N + . SupposeM 0 = 2 N 1 for some N 1. By the doubling scheme, we run Algorithm 15 on intervals [2 n1 ; 2 n 1] forn = 1;:::;N, and the regret on intervals [2 n1 ; 2 n 1] is of order ~ O((B ? SAT max c ) 1=3 (2 n1 ) 2=3 +B ? (SAT max P ) 1=3 (2 n1 ) 2=3 + (B ? SAT max c ) 2=3 (2 n1 ) 1=3 +B ? (S 2:5 AT max c ) 2=3 (2 n1 ) 1=3 + ( c +B ? P )T max ) by Theorem 18 and the choice ofW c andW P . Summing overn completes the proof. Proof of Theorem 19. By Lemma 87 and Theorem 33 with 0 = ( c +B ? P )T max , 1 = (B ? SAT max c ) 2=3 + B ? (S 2:5 AT max P ) 2=3 , 1 2 = 0, and 2 = (B ? SAT max c ) 1=3 +B ? (SAT max P ) 1=3 , we have 3=2 1 =B 1=2 ? = ~ O(B 1=2 ? SAT max c +B ? S 2:5 AT max P ), 3 2 =B 2 ? = ~ O(SAT max c =B 2 ? +B ? SAT max P ), and thus for anyK 0 2 [K], R M K 0 = ~ O (B ? SAT max c ) 1=3 K 0 2=3 +B ? (SAT max P ) 1=3 K 0 2=3 + (B ? SAT max c ) 2=3 K 0 1=3 + ~ O B ? (S 2:5 AT max P ) 2=3 K 0 1=3 +B 1=2 ? SAT max c +B ? S 2:5 AT max P : Then by Lemma 86, we obtain the same bound as R M K 0 forR K 0. E.4.3 AuxiliaryLemmas Lemma 90. With probability at least 1 2, P M 0 m=1 P Hm h=1 V(P m h ;V ?;m h+1 ) = ~ O B ? C M 0 +B 2 ? for any M 0 M. 247 Proof. Applying Lemma 80 with V ?;m h 1 4B ? (Lemma 73), with probability at least 1 2, M 0 X m=1 Hm X h=1 V(P m h ;V ?;m h+1 ) = ~ O M 0 X m=1 V ?;m Hm+1 (s m Hm+1 ) 2 + M 0 X m=1 Hm X h=1 B ? (V ?;m h (s m h )P m h V ?;m h+1 ) + +B 2 ? ! = ~ O B ? C M 0 +B 2 ? ; where in the last step we apply (V ?;m h (s m h )P m h V ?;m h+1 ) + (Q ?;m h (s m h ;a m h )P m h V ?;m h+1 ) + c m (s m h ;a m h ); and P M 0 m=1 P Hm h=1 c m (s m h ;a m h ) = ~ O( P M 0 m=1 P Hm h=1 c m h ) by Lemma 126. Lemma91. With probability at least 1 9, for anyM 0 M, M 0 X m=1 Hm X h=1 V(P m h ;V ?;m h+1 V m h+1 ) = ~ O B ? p B ? SAL P;M 0C M 0 +B ? p SAL c;M 0C M 0 +B 2 ? S 2 AL P;M 0 + ~ O B 2 ? SAL c;M 0 + M 0 X m=1 B ? ( c;m +B ? P;m )H ! : Proof. Letz m h = minfB=4; ( c;m + 4B ? P;m )HgIfhHg. By Lemma 89 andkV m h k 1 B=4, we have V ?;m h (s) +z m h V m h (s) for alls2S + . Moreover, by Lemma 83, M 0 X m=1 (V ?;m Hm+1 (s m Hm+1 ) +z m Hm+1 V m Hm+1 (s m Hm+1 )) 2 M 0 X m=1 (z m Hm+1 ) 2 Ifs m Hm+1 =gg + ~ O B 2 ? M 0 X m=1 IfH m <H;s m Hm+1 6=gg ! = 4B ? M 0 X m=1 ( c;m + 4B ? P;m )H + ~ O B 2 ? SAL M 0 : 248 Also note that () = M 0 X m=1 B ? Hm X h=1 (V ?;m h (s m h )V m h (s m h )P m h V ?;m h+1 +P m h V m h+1 +z m h z m h+1 ) + M 0 X m=1 B ? Hm X h=1 c m (s m h ;a m h ) + e P m h V m h+1 V m h (s m h ) + 4B ? n m h + + 2 M 0 X m=1 B ? ( c;m + 4B ? P;m )H (V ?;m h (s m h )Q ?;m h (s m h ;a m h ),z m h z m h+1 , andP m h V m h+1 e P m h V m h+1 + 4B ? (n m h + P;m )) M 0 X m=1 B ? Hm X h=1 (c m (s m h ;a m h )b c m h + ( e P m h P m h )V ?;m h+1 + ( e P m h P m h )(V m h+1 V ?;m h+1 ) +b m h ) + + 3 M 0 X m=1 B ? ( c;m + 4B ? P;m )H + 4B 2 ? M 0 X m=1 Hm X h=1 n m h + ~ O (B ? ); where the last step is by the denitions ofV m h (s m h ),x m maxf 1 mH ; 2( c;m + 4B ? P;m )g (Lemma 89), maxfa;bg a+b 2 , and P M 0 m=1 P Hm h=1 1 mH = ~ O(1). Now by Lemma 74, Lemma 109, Lemma 77, and n m h 1 N m h , we continue with () = ~ O 0 @ B ? 0 @ p SAL c;M 0C M 0 +SAL c;M 0 + M 0 X m=1 Hm X h=1 0 @ s V(P m h ;V ?;m h+1 ) N m h + s SV(P m h ;V m h+1 V ?;m h+1 ) N m h 1 A 1 A 1 A + ~ O M 0 X m=1 Hm X h=1 B 2 ? S N m h + M 0 X m=1 B ? ( c;m +B ? P;m )H +B ? M 0 X m=1 Hm X h=1 b m h ! = ~ O 0 @ B ? p SAL c;M 0C M 0 +B ? SAL c;M 0 +B ? v u u t SAL P;M 0 M 0 X m=1 Hm X h=1 V(P m h ;V ?;m h+1 ) +B 2 ? S 2 AL P;M 0 1 A + ~ O 0 @ B ? v u u t S 2 AL P;M 0 M 0 X m=1 Hm X h=1 V(P m h ;V ?;m h+1 V m h+1 ) + M 0 X m=1 B ? ( c;m +B ? P;m )H 1 A ; 249 where in the last step we apply Cauchy-Schwarz inequality, Lemma 82, Lemma 81, Var[X + Y ] 2Var[X] + 2Var[Y ], and AM-GM inequality. Finally, by Lemma 90, we continue with () = ~ O B ? p SAL c;M 0C M 0 +B ? SAL c;M 0 +B ? p B ? SAL P;M 0C M 0 +B 2 ? S 2 AL P;M 0 + ~ O 0 @ B ? v u u t S 2 AL P;M 0 M 0 X m=1 Hm X h=1 V(P m h ;V ?;m h+1 V m h+1 ) + M 0 X m=1 B ? ( c;m +B ? P;m )H 1 A : Applying Lemma 80 on value functionsfV ?;m h +z m h V m h g m;h (constant oset does not change the variance) and plugging in the bounds above, we have M 0 X m=1 Hm X h=1 V(P m h ;V ?;m h+1 V m h+1 ) = M 0 X m=1 Hm X h=1 V(P m h ;V ?;m h+1 +z m h V m h+1 ) = ~ O B ? p SAL c;M 0C M 0 +B 2 ? SAL c;M 0 +B ? p B ? SAL P;M 0C M 0 +B 2 ? S 2 AL P;M 0 + ~ O 0 @ B ? v u u t S 2 AL P;M 0 M 0 X m=1 Hm X h=1 V(P m h ;V ?;m h+1 V m h+1 ) + M 0 X m=1 B ? ( c;m +B ? P;m )H 1 A : Then solving a quadratic inequality w.r.t. P M 0 m=1 P Hm h=1 V(P m h ;V ?;m h+1 V m h+1 ) (Lemma 110) completes the proof. E.4.4 MinimaxOptimalBoundinFinite-HorizonMDP Here we give a high level arguments on why Algorithm 15 implies a minimax optimal dynamic regret bound in the nite-horizon setting. To adapt Algorithm 15 to the non-homogeneous nite-horizon setting, we maintain empirical cost and transition functions for each layerh2 [H] and letc f (s) = 0. Following 250 similar arguments and substitutingB ? ,T max by horizonH, Theorem 18 implies (ignoring lower order terms) R M 0 = ~ O p SAH 2 =W c M 0 + p SAH 3 =W P M 0 + ( c W c +H P W P )H = ~ O H(SA c ) 1=3 M 0 2=3 + (SAH 5 P ) 1=3 M 0 2=3 ; where the extra p H dependency in the rst two terms comes from estimating the cost and transition func- tions of each layer independently, and we setW c = (SA) 1=3 (M 0 = c ) 2=3 ,W P = (SA=H) 1=3 (M 0 = P ) 2=3 . Note that the lower bound construction in (Mao et al., 2020) only make use of non-stationary transition. The lower bound they prove is ((SA) 1=3 (HT ) 2=3 ) (their Theorem 5), which actually matches our upper bound ~ O((SAH 5 P ) 1=3 M 0 2=3 ) for non-stationary transition sinceT = M 0 H and = H P by their denition of non-stationarity. It is also straightforward to show that the lower bound for non-stationary cost matches our upper bound following similar arguments in proving Theorem 17. E.5 OmittedDetailsinSection6.5 Notation Denote by c m and P m the values of c and P at the beginning of intervalm respectively, that is, c m =g c ( c m ) and P m =g P ( P m ), whereg c (m) = minf c 1 p m ; 1 2 8 H g andg P (m) = minf c 2 p m ; 1 2 8 H g. Denote by c m the value of c at the beginning of intervalm and dene c m h = c(s m h ;a m h ). Dene Q ? ;m h and V ? ;m h as the action-value function and value function w.r.t. costc m + 8 m , transitionP m , and policy ? k(m) ; andC [i;j] = P j m=i P Hm h=1 C m . Let Q ?;m h and V ?;m h be the optimal value functions w.r.t. cost function c m + 8 m and transition functionP m . It is not hard to see that they can be dened recursively as follows: V ?;m H+1 =c f and forhH, Q ?;m h (s;a) =c m (s;a) + 8 m +P m s;a V ?;m h+1 ; V ?;m h (s) = min a Q ?;m h (s;a): 251 For notational convenience, dene Q m H+1 (s;a) = V m H+1 (s), Q ? ;m H+1 (s;a) = V ? ;m H+1 (s), and Q ?;m H+1 (s;a) = V ?;m H+1 (s) for any (s;a)2 ; letL c =L c;[1;K] andL P =L P;[1;K] . ProofSketchofTheorem20 We give a high level idea on the analysis of the main theorem and also point out the key technical challenges. We decompose the regret as follows: R K = K X m=1 (C m V m 1 (s m 1 )) + K X m=1 ( V m 1 (s m 1 ) V ? ;m 1 (s m 1 )) + 8T ? K X m=1 m . K X m=1 H X h=1 c m h b c m h + V m h+1 (s m h+1 ) P m h V m h+1 +b m h 8 m (denition of V m h (s m h )) + K X m=1 ( V m 1 (s m 1 ) V ? ;m 1 (s m 1 )) + 8T ? K X m=1 m : We bound the three terms above separately. For the second term, we rst show that V m 1 (s m 1 ) V ? ;m 1 (s m 1 ) ( c;m +B P;m )T ? , where c;m = c;[i c m ;m] , P;m = P;[i P m ;m] are the accumulated cost and transition non-stationarity since the last reset respectively. Although proving such a bound is straightforward when V m h is indeed a value function (similar to Lemma 85), it is non-trivial under the UCBVI update rule as the bonus termb depends on the next-step value function and can not be simply treated as part of the cost function. A key step here is to make use of the monotonic property (Lemma 116) of the bonus function; see Lemma 93 for more details. Now by the periodic resets of cost and transition counters (Line 4 and Line 5), the number of intervals between consecutive resets of cost and transition estimation is upper bounded by W c andW P respectively. Thus, K X m=1 ( c;m +B P;m )T ? K X m=1 ( c;f c (m) +B P;f P (m) )T ? (W c c +BW P P )T ? = ~ O (B ? SAT ? c ) 1=3 K 2=3 +B ? (SAT ? P ) 1=3 K 2=3 + ( c +B ? P )T ? : where the last step is simply by the chosen values ofW c andW P . 252 For the third term, we have: T ? K X m=1 m T ? K X m=1 c 1 p c m + Bc 2 p P m ! = ~ O T ? c 1 Lc X i=1 p M c i +B ? c 2 L P X i=1 q M P i !! = ~ O T ? (c 1 p L c K +B ? c 2 p L P K) = ~ O p B ? SAL c K +B ? p SAL P K ; whereM c i (orM P i ) is the number of intervals between thei-th and (i + 1)-th reset of cost (or transition) estimation, and the second last step is by Cauchy-Schwarz inequality. Finally we bound the rst term, simply byTest1 andTest2, we have (only keeping the dominating terms) K X m=1 H X h=1 c m h b c m h + V m h+1 (s m h+1 ) P m h V m h+1 +b m h 8 m = Lc X i=1 X m2I c i Hm X h=1 (c m h b c m h ) + L P X i=1 X m2I P i Hm X h=1 ( V m h+1 (s m h+1 ) P m h V m h+1 ) + M 0 X m=1 Hm X h=1 (b m h 8 m ) . M 0 X m=1 Hm X h=1 0 @ s c m h M m h + s V( P m h ; V m h+1 ) N m h 1 A = ~ O p B ? SAL c K +B ? p SAL P K : wherefI c i g Lc i=1 (orfI P i g L P i=1 ) is a partition ofK episodes such thatM (orN) is reseted in the last interval of eachI c i (orI P i ) fori<L c (ori<L P ) and the last interval ofI c Lc (orI P L P ) isK, and in the second last step we apply the denition of c m (Lemma 95) and P m (Lemma 96). Note that the regret of non-stationarity along the learner’s trajectory is cancelled out by the negative correction term8 m . Now it suces to boundL c andL P . It can be shown that the reset rules of the non-stationarity tests guarantee that L c = ~ O (K=W c +B ? K=W P ); L P = ~ O (K=W P +K=(B ? W c )): Details are deferred to Lemma 97. Putting everything together completes the proof. Next, we present three lemmas related to the optimism and magnitude (Test 3) of estimated value function. 253 Lemma92. Withprobabilityatleast 12,forallmK, Q m h (s;a) Q ?;m h (s;a)+( c;m +B P;m )(H h + 1). Proof. We prove this by induction onh. The base case ofh = H + 1 is clearly true. Forh H, by Test 3 and the induction step, we have V m h+1 (s) minfB=2; V ?;m h+1 (s) + ( c;m +B P;m )(Hh)g V ?;m h+1 (s) +x m h+1 B wherex m h = minfB=2; ( c;m +B P;m )(Hh + 1)g and V ?;m h (s) V ? ;m h (s) B 4 + 8H m B 3 . Thus, with probability at least 1 2, c m (s;a) + P m s;a V m h+1 b m (s;a; V m h+1 ) c m (s;a) + P m s;a ( V ?;m h+1 +x m h+1 )b m (s;a; V ?;m h+1 +x m h+1 ) (Lemma 116) (i) c m (s;a) + e P m s;a ( V ?;m h+1 +x m h+1 ) (Lemma 109) c m (s;a) + 8 m + c;m +P m s;a ( V ?;m h+1 +x m h+1 ) +B P;m (Lemma 76) Q ?;m h (s;a) + ( c;m +B P;m )(Hh + 1): Note that in (i) we use the fact thatjf V ?;m h +x m h g m;h j (HK +1) 6 sincejf(c m ;P m )g m jK,jf c m g m j K,jf P m g m jK,jf c;m g m jK + 1, andjf P;m g m jK + 1 ( c;m = P;m = 0 whenm is not the rst interval of some episode). Lemma 93. With probability at least 1 2, for all m K, Q m h (s;a) Q ? ;m h (s;a) + ( c;m + B P;m )T ? ;m h (s;a). Proof. We prove this by induction onh. The base case ofh = H + 1 is clearly true. Forh H, by Test3 and the induction step, we have V m h+1 (s) minfB=2; V ? ;m h+1 (s) + ( c;m +B P;m )T ? ;m h+1 (s)g 254 V ? ;m h+1 (s) +x m h+1 (s) B wherex m h (s) = minfB=2; ( c;m +B P;m )T ? ;m h (s)g and V ? ;m h (s) B 4 + 8 m T ? ;m h (s) B 4 + 8H m B 3 . Thus, with probability at least 1 2, c m (s;a) + P m s;a V m h+1 b m (s;a; V m h+1 ) c m (s;a) + P m s;a ( V ? ;m h+1 +x m h+1 )b m (s;a; V ? ;m h+1 +x m h+1 ) (Lemma 116) (i) c m (s;a) + e P m s;a ( V ? ;m h+1 +x m h+1 ) (Lemma 109) c m (s;a) + 8 m + c;m +P m s;a ( V ? ;m h+1 +x m h+1 ) +B P;m (Lemma 76) Q ? ;m h (s;a) + ( c;m +B P;m )T ? ;m h (s;a): Note that in (i) we use the fact thatjf V ? ;m h +x m h g m;h j (HK + 1) 6 sincejfV ? ;m h g m;h j HK + 1, jf c m g m jK,jf P m g m jK,jf c;m g m jK + 1,jf P;m g m jK + 1 ( c;m = P;m = 0 whenm is not the rst interval of some episode), andjfT ? ;m h g m;h jHK + 1. Lemma94. Withprobabilityatleast 12,forallmK,if c;m c m and P;m P m ,then Q m h (s;a) Q ? ;m h (s;a) + m T ? ;m h (s;a)B=2. Moreover,ifTest3failsinintervalm,then c;[i c m ;m+1] >g c ( c m + 1) or P;[i P m ;m+1] >g P ( P m + 1). 255 Proof. First note that Q ? ;m h (s;a) B 4 +8 m T ? ;m h (s;a) B 4 +8H m B 3 . We prove the rst statement by induction onh. The base case ofh =H + 1 is clearly true. ForhH, note that: c m (s;a) + P m s;a V m h+1 b m (s;a; V m h+1 ) c m (s;a) + P m s;a ( V ? ;m h+1 + m T ? ;m h+1 )b m (s;a; V ? ;m h+1 + m T ? ;m h+1 ) (induction step and Lemma 116) (i) c m (s;a) + e P m s;a ( V ? ;m h+1 + m T ? ;m h+1 ) (Lemma 109) c m (s;a) + 8 m + c m +P m s;a ( V ? ;m h+1 + m T ? ;m h+1 ) + P m (B=3 +H m ) (Lemma 76, c;m c m , and P;m P m ) Q ? ;m h (s;a) + m T ? ;m h (s;a): (H m B=12) Note that in (i) we use the fact thatjf V ? ;m h + m T ? ;m h g m;h j (HK +1) 6 sincejfV ? ;m h g m;h jHK +1, jf c m g m j K,f P m g m K, andjfT ? ;m h g m;h j HK + 1. The second statement is simply by the contraposition of the rst statement. The next two lemmas are aboutTest1 andTest2. Lemma95. With probability at least 1 4, for anyM 0 K, if c;M 0 c M 0 , then M 0 X m=i c M 0 Hm X h=1 (c m h b c m h ) ~ O 0 @ q C [i c M 0 ;M 0 ] + M 0 X m=i c M 0 Hm X h=1 s c m h M m h + 1 M m h ! 1 A + M 0 X m=i c M 0 Hm X h=1 c m , c M 0: Moreover, ifTest1 fails in intervalM 0 , then c;M 0 > c M 0 . 256 Proof. Note that for any givenM 0 M, without loss of generality, we can oset the intervals and assume i c M 0 = 1. Then with probability at least 1 4, for anyM 0 K, assumingi c M 0 = 1 we have M 0 X m=1 Hm X h=1 (c m h b c m h ) = M 0 X m=1 Hm X h=1 (c m h c m (s m h ;a m h )) + M 0 X m=1 Hm X h=1 (c m (s m h ;a m h )b c m h ) ~ O p C M 0 + M 0 X m=1 Hm X h=1 (c m (s m h ;a m h )b c m h ) (Lemma 124 and Lemma 126) ~ O p C M 0 + M 0 X m=1 Hm X h=1 s c m h M m h + 1 M m h !! + M 0 X m=1 Hm X h=1 c m : (Lemma 74, and c;m c;M 0 c M 0 c m ) The rst statement is then proved by notingi c M 0 = 1. The second statement is simply by the contraposition of the rst statement. Lemma96. With probability at least 1 16, for anyM 0 K, if c;[i P M 0 ;M 0 ] c M 0 , minf B 1:5 ? c 1 q P M 0 ; 1 2 8 H g and P;M 0 P M 0 , then M 0 X m=i P M 0 Hm X h=1 V m h+1 (s m h+1 ) P m h V m h+1 ~ O 0 B @ v u u u t M 0 X m=i P M 0 Hm X h=1 V( P m h ; V m h+1 ) + M 0 X m=i P M 0 Hm X h=1 s V( P m h ; V m h+1 ) N m h 1 C A + ~ O q SA(B ? +L c;[i P M 0 ;M 0 ] )C [i P M 0 ;M 0 ] + q B ? SA P M 0 +B 2:5 ? S 2 AHL c;[i P M 0 ;M 0 ] + 4 M 0 X m=i P M 0 Hm X h=1 m , P M 0: Moreover, ifTest2 fails in intervalM 0 , then c;[i P M 0 ;M 0 ] > c M 0 or P;M 0 > P M 0 . 257 Proof. For any M 0 K, without loss of generality, we can oset the intervals and assume i P M 0 = 1. Moreover, for anymM 0 , we have P;m P;M 0 P M 0 P m . Thus, with probability at least 1 2, M 0 X m=1 Hm X h=1 V m h+1 (s m h+1 ) P m h V m h+1 M 0 X m=1 Hm X h=1 ( V m h+1 (s m h+1 )P m h V m h+1 ) + M 0 X m=1 Hm X h=1 ( e P m h P m h ) V m h+1 + M 0 X m=1 Hm X h=1 B( P m +n m h ) (P m h V m h+1 e P m h V m h+1 +B( P;m +n m h ) and P;m P m ) ~ O 0 @ v u u t M 0 X m=1 Hm X h=1 V(P m h ; V m h+1 ) +B ? SA 1 A + M 0 X m=1 Hm X h=1 ( e P m h P m h ) V m h+1 + M 0 X m=1 Hm X h=1 B P m (Lemma 124 and P M 0 m=1 P Hm h=1 n m h P M 0 m=1 P Hm h=1 1 N m h SA byL P;M 0 = 1) ~ O 0 @ v u u t M 0 X m=1 Hm X h=1 V( P m h ; V m h+1 ) +B ? SA 1 A + M 0 X m=1 Hm X h=1 ( e P m h P m h ) V m h+1 + M 0 X m=1 Hm X h=1 2B P m ; where the last inequality is by V(P m h ; V m h+1 )P m h ( V m h+1 P m h V m h+1 ) 2 e P m h ( V m h+1 P m h V m h+1 ) 2 +B 2 ( P;m +n m h ) ( P i p i x i P i p i = argmin z P i p i (x i z) 2 ) 2V( P m h ; V m h+1 ) + ~ O SB 2 N m h +B 2 P m ; ( e P m h (s 0 ) 2 P m h (s 0 ) + 1 N m h by Lemma 126,n m h 1 N m h , and P;m P m ) 258 Lemma 82,L P;M 0 = 1, and AM-GM inequality. Now note that with probability at least 1 3, M 0 X m=1 Hm X h=1 ( e P m h P m h ) V m h+1 = M 0 X m=1 Hm X h=1 ( e P m h P m h ) V ?;m h+1 + ( e P m h P m h )( V m h+1 V ?;m h+1 ) ~ O 0 @ M 0 X m=1 Hm X h=1 0 @ s V( P m h ; V ?;m h+1 ) N m h + SB ? N m h 1 A + v u u t S 2 A M 0 X m=1 Hm X h=1 V(P m h ; V m h+1 V ?;m h+1 ) 1 A + M 0 X m=1 Hm X h=1 B P m 32 (Lemma 109, Lemma 77, Cauchy-Schwarz inequality, Lemma 82, and P;m P m ) ~ O 0 @ M 0 X m=1 Hm X h=1 0 @ s V( P m h ; V m h+1 ) N m h + SB ? N m h 1 A + v u u t S 2 A M 0 X m=1 Hm X h=1 V(P m h ; V m h+1 V ?;m h+1 ) 1 A + M 0 X m=1 Hm X h=1 B P m 16 ; where in the last step we apply M 0 X m=1 Hm X h=1 s V( P m h ; V ?;m h+1 ) N m h M 0 X m=1 Hm X h=1 0 @ s V( P m h ; V m h+1 ) N m h + s V( P m h ; V m h+1 V ?;m h+1 ) N m h 1 A by p Var[X +Y ] p Var[X] + p Var[Y ] (Cohen et al., 2021, Lemma E.3) and M 0 X m=1 Hm X h=1 s V( P m h ; V m h+1 V ?;m h+1 ) N m h M 0 X m=1 Hm X h=1 s P m h (( V m h+1 V ?;m h+1 )P m h ( V m h+1 V ?;m h+1 )) 2 N m h ( P i p i x i P i p i = argmin z P i p i (x i z) 2 ) M 0 X m=1 Hm X h=1 s 2 e P m h (( V m h+1 V ?;m h+1 )P m h ( V m h+1 V ?;m h+1 )) 2 N m h + ~ O M 0 X m=1 Hm X h=1 B p S N m h ! ( P m h (s 0 ) 2 e P m h (s 0 ) + ~ O 1 N m h by Lemma 126) M 0 X m=1 Hm X h=1 s 2V(P m h ; V m h+1 V ?;m h+1 ) N m h + ~ O M 0 X m=1 Hm X h=1 B p S N m h + M 0 X m=1 Hm X h=1 B s P;m N m h ! ~ O 0 @ v u u t SA M 0 X m=1 Hm X h=1 V(P m h ; V m h+1 V ?;m h+1 ) + M 0 X m=1 Hm X h=1 B p S N m h 1 A + M 0 X m=1 Hm X h=1 B P m 32 : (Cauchy-Schwarz inequality, Lemma 82,L P;M 0 = 1, AM-GM inequality, and P;m P m ) 259 Now by Lemma 99,L P;M 0 = 1, and AM-GM inequality, we have with probability 1 10, v u u t S 2 A M 0 X m=1 Hm X h=1 V(P m h ; V m h+1 V ?;m h+1 ) ~ O p SAL c;M 0C M 0 + p B ? SA(C M 0 +M 0 ) + ~ O 0 @ B ? S 2 A +B ? S 1:5 AL c;M 0 + v u u t B ? S 2 A M 0 X m=1 ( c;m +B ? P;m )H 1 A : Moreover, byi c m i P m and c m P m due to the reset rules, we have c;m c;[i P M 0 ;m] c;[i P M 0 ;M 0 ] c M 0 c m B 1:5 ? minf c 1 p P m ; 1 2 8 H g B 1:5 ? minf c 1 p c m ; 1 2 8 H g B 1:5 ? c m . Therefore, by P;m P m and AM-GM inequality, v u u t B ? S 2 A M 0 X m=1 Hm X h=1 ( c;m +B ? P;m ) v u u t B 2:5 ? S 2 AH M 0 X m=1 ( c m +B ? P m )B 2:5 ? S 2 AH + M 0 X m=1 m : Plugging these back, and by Lemma 82,L P;M 0 = 1, we obtain M 0 X m=1 Hm X h=1 ( e P m h P m h ) V m h+1 ~ O 0 @ M 0 X m=1 Hm X h=1 0 @ s V( P m h ; V m h+1 ) N m h 1 A + p B ? SA(C M 0 +M 0 ) 1 A + ~ O p SAL c;M 0C M 0 +B ? S 1:5 AL c;M 0 +B 2:5 ? S 2 AH + 2 M 0 X m=1 Hm X h=1 m : Plugging this back and notingi P M 0 = 1 completes the proof of the rst statement. The second statement is simply by the contraposition of the rst statement. 260 E.5.1 ProofofTheorem20 Proof. Bys m 1 =s init , we decompose the regret as follows, with probability at least 1 2, R K = K X m=1 Hm X h=1 c m h +c m Hm+1 V ? ;m 1 (s m 1 ) ! = K X m=1 Hm X h=1 c m h +c m Hm+1 V m 1 (s m 1 ) ! + K X m=1 V m 1 (s m 1 ) V ? ;m 1 (s m 1 ) + 8T ? K X m=1 m K X m=1 Hm X h=1 c m h +c m Hm+1 V m 1 (s m 1 ) ! + K X m=1 ( c;m +B P;m )T ? + 8T ? K X m=1 m (Lemma 93) We rst bound the rst and the third term above separately. For the third term, we have: T ? K X m=1 m T ? K X m=1 c 1 p c m + Bc 2 p P m ! = ~ O T ? c 1 Lc X i=1 p M c i +B ? c 2 L P X i=1 q M P i !! ( P j i=1 1 p i =O( p j)) = ~ O T ? (c 1 p L c K +B ? c 2 p L P K) = ~ O p B ? SAL c K +B ? p SAL P K ; whereM c i (orM P i ) is the number of intervals between thei-th and (i + 1)-th reset of cost (or transition) estimation, and the second last step is by Cauchy-Schwarz inequality. For the rst term, denefI c i g Lc i=1 (or fI P i g L P i=1 ) as a partition ofK episodes such thatM (orN) is reset in the last interval of eachI c i (orI P i ) for 261 i<L c (ori<L P ) and the last interval ofI c Lc (orI P L P ) isK. Also letL =L c +L P . Then with probability at least 1 20, K X m=1 Hm X h=1 c m h +c m Hm+1 V m 1 (s m 1 ) ! K X m=1 Hm X h=1 c m h + V m h+1 (s m h+1 ) V m h (s m h ) + ~ O (B ? SAL) (Lemma 83) K X m=1 Hm X h=1 c m h b c m h + V m h+1 (s m h+1 ) P m h V m h+1 +b m h 8 m + ~ O (B ? SAL) (denition of V m h (s m h )) = Lc X i=1 X m2I c i Hm X h=1 (c m h b c m h ) + L P X i=1 X m2I P i Hm X h=1 ( V m h+1 (s m h+1 ) P m h V m h+1 ) + K X m=1 Hm X h=1 (b m h 8 m ) + ~ O (B ? SAL) = ~ O 0 @ p L c C K + K X m=1 Hm X h=1 s c m h M m h + 1 M m h ! + v u u t L P K X m=1 Hm X h=1 V( P m h ; V m h+1 ) + K X m=1 Hm X h=1 b m h 1 A + ~ O B 2:5 ? S 2 AHL c + p B ? SAL P (C K +K) + p SAL c C K +HL c +B ? HL P ; (Test1 (Lemma 95),Test2 (Lemma 96), and Cauchy-Schwarz inequality) where ~ O(HL c +B ? HL P ) is upper bound of the costs in intervals whereTest1 fails orTest2 fails. By Lemma 74 and AM-GM inequality, with probability at least 1 3, K X m=1 Hm X h=1 s c m h M m h + 1 M m h ! = ~ O SAHL c + p SAL c C K + K X m=1 c;m : 262 Following the proof of Lemma 81, we have q L P P K m=1 P Hm h=1 V( P m h ; V m h+1 ) is dominated by the upper bound of P M 0 m=1 P Hm h=1 b m h . Thus with probability at least 1, v u u t L P K X m=1 Hm X h=1 V( P m h ; V m h+1 ) + K X m=1 Hm X h=1 b m h = ~ O 0 @ v u u t SAL P K X m=1 Hm X h=1 V(P m h ; V m h+1 ) +B ? S 1:5 AL P +B ? v u u t SAL P K X m=1 Hm X h=1 P;m 1 A = ~ O p B ? SAL P (C K +K) + p SAL c C K +B ? S 1:5 AHL + K X m=1 ( c;m +B ? P;m ); where in the last inequality we apply AM-GM inequality onB ? q SAL P P K m=1 P Hm h=1 P;m , and note that with probability at least 1 11, v u u t SAL P K X m=1 Hm X h=1 V(P m h ; V m h+1 ) = ~ O 0 @ v u u t SAL P K X m=1 Hm X h=1 V(P m h ; V ?;m h+1 ) + v u u t SAL P K X m=1 Hm X h=1 V(P m h ; V m h+1 V ?;m h+1 ) 1 A (Var[X +Y ] 2Var[X] + 2Var[Y ] and p a +b p a + p b) = ~ O p B ? SAL P (C K +K) + p SAL c C K +B ? S 1:5 AHL + M 0 X m=1 ( c;m +B ? P;m ): (Lemma 98, Lemma 99, and AM-GM inequality) Putting everything together, we have R K = ~ O p SA(L c +B ? L P )(C K +B ? K) +B 2:5 ? S 2 AHL + K X m=1 ( c;m +B ? P;m )T ? ! : 263 Now by R K C K 4B ? K, solving a quadratic inequality (Lemma 110) w.r.t.C K and plugging the bound onC K back, we obtain R K = ~ O p B ? SAL c K +B ? p SAL P K +B 2:5 ? S 2 AHL + K X m=1 ( c;m +B ? P;m )T ? ! : It suces to bound the last term above. By the periodic resets ofM andN (Line 4 and Line 5 of Algorithm 17), the number of intervals between consecutive resets of M and N are upper bounded by W c and W P respectively. Thus, K X m=1 ( c;m +B ? P;m )T ? K X m=1 ( c;f c (m) +B ? P;f P (m) )T ? (W c c +B ? W P P )T ? = ~ O (B ? SAT ? c ) 1=3 K 2=3 +B ? (SAT ? P ) 1=3 K 2=3 + ( c +B ? P )T ? ; where the last step is simply by the chosen values ofW c andW P . Plugging this back and applying Lemma 97 completes the proof. Lemma97. With probability at least 1 2, Algorithm 17 withp = 1=B ? ensures L c = ~ O (B ? SA) 1=3 (T ? c ) 2=3 K 1=3 +B ? (SA) 1=3 (T ? P ) 2=3 K 1=3 +H( c +B ? P ) ; L P = ~ O (B ? SA) 1=3 (T ? c ) 2=3 K 1=3 =B ? + (SA) 1=3 (T ? P ) 2=3 K 1=3 +H( c + P ) : Proof. We consider the number of resets ofM andN from each test separately. By Lemma 95 and Lemma 84, there are at most ~ O((c 1 1 c ) 2=3 K 1=3 +H c ) resets ofM triggered byTest1. By Lemma 96 and Lemma 84, there are at most ~ O(((B 1:5 ? c 1 1 c ) 2=3 + (c 1 2 P ) 2=3 )K 1=3 +H( c + P )) resets ofM andN triggered byTest2. 264 Next, we consider Test 3. DeneI c m = If c;[i c m ;m+1] > g c ( c m + 1)g andI P m = If P;[i P m ;m+1] > g P ( P m + 1)g. Note that whenever Test 3 fails in intervalm, we haveI c m = 1 orI P m = 1 by Lemma 94. We partitionK intervals into segmentsI 1 ;:::;I Nc , such that in the last interval of eachI i withi<N c denoted bym, Test 3 fails andI c m = 1. Since c is reset whenever Test 3 fails, we have I i [fm+1g [i c m ;m+1] >g c ( c m + 1)g c (jI i j + 1). By Lemma 84, we obtainN c = ~ O((c 1 1 c ) 2=3 K 1=3 +H c ). Now deneA m as the indicator that Test 3 fails in intervalm andI P m = 1. Also deneA 0 m as the indicator thatTest3 fails andN is reset in intervalm, andI P m = 1. We then partitionK intervals into segmentsI 0 1 ;:::;I 0 N P , such that in the last interval of eachI 0 i withi<N P denoted bym,A 0 m = 1. Since P is reset in intervalm whenA 0 m = 1, we have I 0 i [fm+1g [i P m ;m+1] >g P ( P m + 1)g P (jI 0 i j + 1). By Lemma 84, we haveN P = ~ O((c 1 2 P ) 2=3 K 1=3 +H P ). Moreover, by Lemma 126 and the reset rule of Test3, we havep P m A m = ~ O( P m A 0 m ) with probability at least 1, which gives P m A m = ~ O(N P =p). SinceI c m = 1 orI P m = 1 when Test 3 fails in intervalm, the total number of times that Test 3 fails N 3 N c + P m A m = ~ O((c 1 1 c ) 2=3 K 1=3 +B ? (c 1 2 P ) 2=3 K 1=3 +H( c +B ? P )). Now by the reset rule of Test3, the number of timesM is reset due toTest3 is upper bounded byN 3 , and the number of timesN is reset due toTest3 is upper bounded by ~ O(pN 3 ) with probability at least 1 by Lemma 126. Finally, by Line 4 and Line 5 of Algorithm 17, there are at most K Wc resets of M and K W P resets of N respectively due to periodic restarts. Putting all cases together, we have L c = ~ O (c 1 1 c ) 2=3 K 1=3 +B ? (c 1 2 P ) 2=3 )K 1=3 +H( c +B ? P ) +K=W c = ~ O (B ? SA) 1=3 (T ? c ) 2=3 K 1=3 +B ? (SA) 1=3 (T ? P ) 2=3 K 1=3 +H( c +B ? P ) ; 265 and L P = ~ O 1 B ? (c 1 1 c ) 2=3 K 1=3 + (c 1 2 P ) 2=3 K 1=3 +H( c + P ) +K=W P = ~ O (B ? SA) 1=3 (T ? c ) 2=3 K 1=3 B ? + (SA) 1=3 (T ? P ) 2=3 K 1=3 +H( c + P ) ! : This completes the proof. E.5.2 AuxiliaryLemmas Lemma98. Withprobabilityatleast 1,foranyM 0 K, P M 0 m=1 P Hm h=1 V(P m h ; V ?;m h+1 ) = ~ O B ? C M 0 +B ? M 0 +B 2 ? . Proof. Applying Lemma 80 with V ?;m h 1 B, with probability at least 1, M 0 X m=1 Hm X h=1 V(P m h ; V ?;m h+1 ) = ~ O M 0 X m=1 V ?;m Hm+1 (s m Hm+1 ) 2 + M 0 X m=1 Hm X h=1 B ? ( V ?;m h (s m h )P m h V ?;m h+1 ) + +B 2 ? ! = ~ O B ? C M 0 +B ? M 0 +B 2 ? ; where in the last step we apply ( V ?;m h (s m h )P m h V ?;m h+1 ) + ( Q ?;m h (s m h ;a m h )P m h V ?;m h+1 ) + c m (s m h ;a m h ) + 8 m c m (s m h ;a m h ) + 1=H; and also Lemma 126. Lemma 99. With probability at least 1 10, for anyM 0 K, P M 0 m=1 P Hm h=1 V(P m h ; V ?;m h+1 V m h+1 ) = ~ O(B ? p SAL c;M 0C M 0 +B ? p B ? SAL P;M 0(C M 0 +M 0 )+B 2 ? S 2 AL P;M 0 +B 2 ? SAL c;M 0 + P M 0 m=1 B ? ( c;m + B ? P;m )H). 266 Proof. Letz m h = minfB=2; ( c;m +B P;m )HgIfhHg. By Lemma 92, we have V ?;m h (s)+z m h V m h (s). Moreover, by Lemma 83, P M 0 m=1 ( V ?;m Hm+1 (s m Hm+1 ) +z m Hm+1 V m Hm+1 (s m Hm+1 )) 2 is bounded by M 0 X m=1 (z m Hm+1 ) 2 Ifs m Hm+1 =gg + 64B 2 ? M 0 X m=1 IfH m <H;s m Hm+1 6=gg = ~ O B ? M 0 X m=1 ( c;m +B ? P;m )H +B 2 ? SAL M 0 ! : and () = M 0 X m=1 B ? Hm X h=1 ( V ?;m h (s m h ) V m h (s m h )P m h V ?;m h+1 +P m h V m h+1 +z m h z m h+1 ) + M 0 X m=1 B ? Hm X h=1 c m (s m h ;a m h ) + 8 m + e P m h V m h+1 V m h (s m h ) +B( P;m +n m h ) + +B ? M 0 X m=1 (z m 1 z m Hm+1 ) ( V ?;m h (s m h ) Q ?;m h (s m h ;a m h ),z m h z m h+1 , andP m h+1 V m h+1 e P m h+1 V m h+1 +B( P;m +n m h )) M 0 X m=1 B ? Hm X h=1 (c m (s m h ;a m h )b c m h + ( e P m h P m h ) V ?;m h+1 + ( e P m h P m h )( V m h+1 V ?;m h+1 ) +b m h ) + + ~ O M 0 X m=1 B ? ( c;m +B ? P;m )H +B 2 ? M 0 X m=1 Hm X h=1 n m h ! : (denition of V m h (s m h )) Now by Lemma 74, Lemma 109, Lemma 77, andn m h 1 N m h , we continue with () = ~ O 0 @ B ? 0 @ p SAL c;M 0C M 0 +SAL c;M 0 + M 0 X m=1 Hm X h=1 0 @ s V(P m h ; V ?;m h+1 ) N m h + s SV(P m h ; V m h+1 V ?;m h+1 ) N m h 1 A 1 A 1 A + ~ O M 0 X m=1 Hm X h=1 B 2 ? S N m h + M 0 X m=1 B ? ( c;m +B ? P;m )H +B ? M 0 X m=1 Hm X h=1 b m h ! = ~ O 0 @ B ? p SAL c;M 0C M 0 +B ? SAL c;M 0 +B ? v u u t SAL P;M 0 M 0 X m=1 Hm X h=1 V(P m h ; V ?;m h+1 ) +B 2 ? S 2 AL P;M 0 1 A + ~ O 0 @ B ? v u u t S 2 AL P;M 0 M 0 X m=1 Hm X h=1 V(P m h ; V ?;m h+1 V m h+1 ) + M 0 X m=1 B ? ( c;m +B ? P;m )H 1 A ; 267 where in the last step we apply Cauchy-Schwarz inequality, Lemma 82, Lemma 81, Var[X + Y ] 2Var[X] + 2Var[Y ], and AM-GM inequality. Finally, by Lemma 98, we continue with () = ~ O B ? p SAL c;M 0C M 0 +B ? SAL c;M 0 +B ? q B ? SAL P;M 0(C M 0 +M 0 ) +B 2 ? S 2 AL P;M 0 + ~ O 0 @ B ? v u u t S 2 AL P;M 0 M 0 X m=1 Hm X h=1 V(P m h ; V ?;m h+1 V m h+1 ) + M 0 X m=1 B ? ( c;m +B ? P;m )H 1 A : Applying Lemma 80 on value functionsf V ?;m h +z m h V m h g m;h (constant oset does not change the variance) and plugging in the bounds above, we have M 0 X m=1 Hm X h=1 V(P m h ; V ?;m h+1 V m h+1 ) = M 0 X m=1 Hm X h=1 V(P m h ; V ?;m h+1 +z m h V m h+1 ) = ~ O B ? p SAL c;M 0C M 0 +B 2 ? SAL c;M 0 +B ? q B ? SAL P;M 0(C M 0 +M 0 ) +B 2 ? S 2 AL P;M 0 + ~ O 0 @ B ? v u u t S 2 AL P;M 0 M 0 X m=1 Hm X h=1 V(P m h ; V ?;m h+1 V m h+1 ) + M 0 X m=1 B ? ( c;m +B ? P;m )H 1 A : Then solving a quadratic inequality w.r.t. P M 0 m=1 P Hm h=1 V(P m h ; V ?;m h+1 V m h+1 ) (Lemma 110) completes the proof. E.5.3 ProofofTheorem21 We rst prove a general regret guarantee of Algorithm 18, from which Theorem 21 is a direct corollary. Theorem34. SupposeA 1 ensures R K R 1 whens m 1 =s init formK, andA 2 ensuresR K 0R 2 (K 0 ) for anyK 0 K such thatR 2 (k) is sub-linear w.r.t.k. Then Algorithm 18 ensuresR K = ~ O(R 1 ) (ignoring lower order terms). 268 Proof. LetI k be the set of intervals in episodek, andm k i be thei-th interval of episodek (if exists). The regret is decomposed as: R K = K X k=1 2 6 4 H m k 1 X h=1 c m k 1 h +c m k 1 H m k 1 +1 V ? k (s k 1 ) 3 7 5 + K X k=1 2 4 X m2I k nfm k 1 g Hm X h=1 c m h c m k 1 H m k 1 +1 3 5 : Note thatV ? ;m k 1 1 (s m k 1 1 )V ? k (s k 1 ) +B ? =K by Lemma 118. Therefore, K X k=1 2 6 4 H m k 1 X h=1 c m k 1 h +c m k 1 H m k 1 +1 V ? k (s k 1 ) 3 7 5 K X k=1 2 6 4 H m k 1 X h=1 c m k 1 h +c m k 1 H m k 1 +1 V ? ;m k 1 1 (s m k 1 1 ) 3 7 5 +B ? R 1 +B ? : For the second term, note thatc m k 1 H m k 1 +1 = 2B ? ifs m k 2 1 exists. DeneK f = P K k=1 IfjI k j > 1g, we have (denes m k 2 1 =g ifm k 2 does not exist) K X k=1 2 4 X m2I k nfm k 1 g Hm X h=1 c m h c m k 1 H m k 1 +1 3 5 K X k=1 0 @ X m2I k nfm k 1 g Hm X h=1 c m h V ? k (s m k 2 1 ) 1 A B ? K f R 2 (K f )B ? K f ; which is a lower order term sinceR 2 (K f ) is sub-linear w.r.t.K f . Putting everything together completes the proof. We are now ready to prove Theorem 21. Proof. We simply apply Theorem 34 withR 1 determined by Theorem 20 andR 2 determined by Theorem 19. 269 AppendixF OmittedDetailsinChapter7 F.1 OmittedDetailsinSection7.2 In this section we provide omitted proofs and discussions in Section 7.2. F.1.1 ProofofTheorem22andTheorem23 It suces to prove Theorem 23 and the second statement in Theorem 22, since Theorem 23 subsumes the rst statement of Theorem 22. We decompose the proof into two cases: 1) minfT z ;Tg<1, and 2) minfT z ;Tg =1, and we prove each case in a separate theorem. F.1.1.1 LowerBoundfor min T z ;T <1 In case there is a nite upper bound on the hitting time of optimal policy, we construct a hard instance adapted from (Mannor and Tsitsiklis, 2004). of Theorem 23. Without loss of generality, we assume S = A l 1 A1 for some l 0. It is clear that l log A S + 1. We construct an MDPM 0 of fullA-ary tree structure: the root node iss 0 , each action at a non-leaf node transits to one of its children with costc min , and we denote the set of leaf nodes byS 0 . Since A 3, we havejS 0 j S 2 . The action space isA s = [A] for alls, and we partition the state-action pairs inS 0 into two parts: 0 =S 0 [1] and =S 0 f2;:::;Ag =f(s 1 ;a 1 );:::; (s N ;a N )g, whereN =jS 0 j(A1) 270 (note that here we index state-action pair instead of state, sos i ,s j withi6=j may refer to the same state inS 0 ). Now deneT 1 = minfT=2;B=c min g. The cost function satisesc(s; 1) = B T 0 fors2S 0 , and c(s i ;a i ) = B T 1 fori2 [N]. The transition function satisesP (gjs; 1) = 1 T 0 + T 1 2T 0 ,P (sjs; 1) = 1P (gjs; 1) fors2S 0 , andP (gjs i ;a i ) = 1 T 1 ,P (s i js i ;a i ) = 1P (gjs i ;a i ) fori2 [N], where = 32 T 1 B . Now we consider a class of alternative MDPsfM i g N i=1 . The only dierence betweenM 0 andM i is that the transition ofM i at (s i ;a i ) satisesP (gjs i ;a i ) = 1 T 1 + andP (s i js i ;a i ) = 1P (gjs i ;a i ), that is, (s i ;a i ) is a “good” state-action pair atM i . Denote byB i ? andT i ? the value ofB ? andT ? inM i respectively. Fori2f0;:::;Ng, we have B 2 B i ? c min l +B 2B by 1 2T 1 andB 2; T 0 2 T i ? l +T 1 T ; andc min is indeed the minimum cost byc min B T 1 . It is not hard to see that ats 0 , the optimal behavior in M 0 is to reach any leaf node and then take action 1 untilg is reached; while inM i fori2f1;:::;Ng, the optimal behavior is to reachs i and then take (s i ;a i ) untilg is reached. Thus,T 0 ? = (T 0 ) andT i ? = (T 1 ) fori2 [N]. Without loss of generality, we consider learning algorithms that output a deterministic policy, which can be represented byb v2S 0 A the unique state-action pair inS 0 reachable by following the output policy starting froms 0 . Dene eventE 1 =fb v2 0 g. Below we x az2 [N]. Let b T be the number of times the learner samples (s z ;a z ), andK t be the number of times the agent observes (s z ;a z ;g) among the rstt samples of (s z ;a z ). We introduce event E 2 = max 1tt ? jptK t j" ; where" = q 2p(1p)t ? ln d ,p = 1 T 1 ,d = e 4 , = exp(d 0 2 t ? =(p(1p))) < 1 for somet ? > 0 to be specied later, andd 0 = 128. Also dene eventsE 3 =f b T t ? g andE =E 1 \E 2 \E 3 . For each i2f0;:::;Ng, we denote byP i andE i the probability and expectation w.r.t.M i respectively. 271 Below we introduce two lemmas that characterize the behavior of the learner if it gathers insucient samples on (s z ;a z ). Lemma100. IfP 0 (E 3 ) 7 8 , thenP 0 (E 2 \E 3 ) 3 4 . Proof. Note that inM 0 , the probability of observing (s z ;a z ;g) isp for each sample of (s z ;a z ). Thus, ptK t is a sum of i.i.d. random variables, and the variance of ptK t for t = t ? is t ? p(1p). By Kolmogorov’s inequality, we have P 0 (E 2 ) =P 0 max 1tt ? jptK t j" 1 t ? p(1p) 2p(1p)t ? ln d = 1 1 2 ln d 7 8 : Thus,P 0 (E 2 \E 3 ) =P 0 (E 2 ) +P 0 (E 3 )P 0 (E 2 [E 3 ) 3 4 . Lemma101. IfP 0 (E 3 ) 7 8 andP 0 (E 1 ) 1 2d , thenP z (E 1 ) 2d . Proof. The range of ensures that p 2 1p 2 . By the assumptions of this lemma and Lemma 100, we haveP 0 (E 1 ) 1 1 2d 7 8 byd 1 16 , and thusP 0 (E) 1 2 . LetW be the interaction history of the learner and the generative model, andL j (w) =P j (W =w) forj2f0;:::;Ng. Note that the next-state distribution is identical inM 0 andM z unless (s z ;a z ) is sampled. DeneK =K b T . We have L z (W ) L 0 (W ) = (p +) K (1p) b TK p K (1p) b TK = 1 + p K 1 1p b TK = 1 + p K 1 1p K( 1 p 1) 1 1p b T K p : By 1ue uu 2 foru2 [0; 1 2 ],e u 1u, and 1p 2 , we have 1 1p 1 p 1 exp 1p p 1p 1p 2 !! = exp p exp 2 p(1p) 1 p 1 2 p(1p) : 272 Therefore, conditioned onE, we have L z (W ) L 0 (W ) I E 1 2 p 2 K 1 2 p(1p) K 1 1p b T K p I E 1 2 p 2 p b T +" 1 2 p(1p) p b T +" 1 1p " p I E ; where in the last inequality we applyjp b TKj" byE 2 andE 3 . Then by 1ue 2u foru2 [0; 1 2 ] and p 2 1p 2 , we have L z (W ) L 0 (W ) I E exp 2 2 p 2 (p b T +") + 2 p(1p) (p b T +") + 1p " p exp 2 b T 2 p + b T 2 1p + 2 " p 2 + 2 " p(1p) + " p(1p) !! I E exp 2 1 d 0 ln 1 + 3" p(1p) I E ( b Tt ? , 2 t ? = p(1p) d 0 ln 1 , and for any j 2 [N]. Thus, all -optimal deterministic policies inM 0 have the same behavior as 0 starting from s 0 . On the other hand,V 0 z (s 0 )V ? z (s 0 ) = B( 1 1+T 1 =2 1 1+T 1 ) > . Thus, all-optimal deterministic policies inM z have the same behavior as z starting froms 0 . Therefore, an (;)-correct algorithm should guarantee P 0 (E 1 ) 1 andP z (E 1 )<. WhenP 0 (E 3 ) 7 8 , this leads to a contradiction by Lemma 101 and the choice oft ? . Thus,P 0 (E 3 )< 7 8 . We are now ready to prove the main statement of Theorem 23. Note that inM 0 , an (;)-correct algorithm should guaranteeP 0 ( b T z t ? )< 7 8 for anyz2 [N] by Lemma 102, where b T z is the number of times the learner samples (s z ;a z ). DeneN = P z If b T z t ? g. Clearly, we haveNN andE 0 [N ]< 7N 8 . Moreover,P 0 (N 8N 9 ) 63 64 by Markov’s inequality. This implies that with probability at least 1 64 > 1 2e 4 , we havejfz2 [N] : b T z > t ? gj > N 9 and thus the total number of samples used inM 0 is at least Nt ? 9 . To conclude, there is no (;)-correct algorithm with2 (0; 1 32 ),2 (0; 1 2e 4 ), and sample complexity Nt ? 9 = ( NB 2 T 1 2 ln 1 ) = (minf B? c min ;Tg B 2 ? SA 2 ln 1 ) onM 0 byB ? = (B) and the denition ofT 1 . This completes the proof. Remark6. NotethatT isbothaparameteroftheenvironmentandtheknowledgegiventothelearner. In fact,T constrainsthehittingtimeoftheoptimalpolicyinallthepossiblealternativeMDPsfM j g j2[N] ,which 274 aectsthe nallowerbound. Alsonotethatthelowerboundholdsevenifthelearnerhasaccesstoanupper bound ofB ? (which is 2B in the proof above). WhyafasterrateisimpossiblewithTT max ? This result may seem unintuitive because when we have knowledge ofTT max , a nite-horizon reduction with horizon ~ O(T ) ensures that the estimation error shrinks at rateB ? p T max =n (Yin and Wang, 2021, Figure 1), wheren is the number of samples for each state-action pair. Then it seems that it might be possible to obtain a sample complexity of order TmaxB 2 ? SA 2 . However, our lower bound indicates that the sample complexity should scale withT instead ofT max . An intuitive explanation is that even if the estimation error shrinks with rateT max in hindsight, since the learner doesn’t know the exact value ofT max , it can only setn w.r.t.T so that the output policy is -optimal even in the worst case ofT =T max . F.1.1.2 LowerBoundfor min T z ;T =1 Now we show that when there is no nite upper bound onT max , it really takes innite number of samples to learn in the worst scenario. Theorem35 (Second statemnt of Theorem 22). There exist an SSP instance withc min = 0,T max = 1, and B ? = 1inwhichevery (;)-correctalgorithmwith2 (0; 1 2 )and2 (0; 1 16 )hasinnitesamplecomplexity. Proof. Consider an SSPM 0 withS =fs 0 ;s 1 g andA =fa 0 ;a g g. The cost function satisesc(s 0 ;a 0 ) = 0, c(s 0 ;a g ) = 1 2 , andc(s 1 ;a) = 1 for alla. The transition function satisesP (gjs 0 ;a g ) = 1,P (s 0 js 0 ;a 0 ) = 1, andP (gjs 1 ;a) = 1 for alla; see Figure 7.1 (b). Clearlyc min = 0,B ? =T max = 1, andV ? (s 0 ) = 1 2 inM 0 . Without loss of generality, we consider learning algorithm that outputs deterministic policyb and dene eventsE 1 =fb (s 0 ) =a 0 g andE 0 1 =fb (s 0 ) =a g g. If a learning algorithm is (;)-correct with2 (0; 1 8 ) and has sample complexityn2 [2;1) on M 0 , then consider two alternative MDPsM + andM . MDPM + is the same asM 0 except that 275 P (s 1 js 0 ;a 0 ) = 1 n andP (s 0 js 0 ;a 0 ) = 1 1 n . MDPM is the same asM 0 except thatP (gjs 0 ;a 0 ) = 1 n and P (s 0 js 0 ;a 0 ) = 1 1 n . Note that inM + , the optimal proper policy takesa g ats 0 , andV ? (s 0 ) = 1 2 ; while in M , the optimal proper policy takesa 0 ats 0 , andV ? (s 0 ) = 0. LetW be the interaction history between the learner and the generative model, and deneL j (w) =P j (W =w) forj2f0; +;g, whereP j is the probability w.r.t.M j . Also let b T be the number of times the learner samples (s 0 ;a 0 ) before outputtingb , and (w) =IfL 0 (w)> 0g. DeneE 2 =f b Tng,E =E 1 \E 2 andE 0 =E 0 1 \E 2 . For anyj2f+;g, we have L j (W ) L 0 (W ) I E (W ) (W ) = (1 1 n ) b T I E (W ) (W ) (1 1 n ) n I E (W ) (W ) I E (W ) (W ) 4 . Thus, P j (E) =E j [I E (W )]E j [I E (W ) (W )] =E 0 L j (W ) L 0 (W ) I E (W ) (W ) P 0 (E) 4 : By a similar arguments, we also haveP j (E 0 )P 0 (E 0 )=4 forj2f+;g. Now note thatP 0 (E 2 ) 7 8 by the sample complexity of the learner. SinceE[E 0 =E 2 andE\E 0 =?, we haveP 0 (E) 7 16 orP 0 (E 0 ) 7 16 . Combining withP j (E) P 0 (E)=4 andP j (E 0 ) P 0 (E 0 )=4, we have eitherP j (E) 7 64 forj2f+;g, orP j (E 0 ) 7 64 forj2f+;g. In the rst case, inM + , we haveV b (s 0 )V ? (s 0 ) = 1 1 2 = 1 2 with probability at least 7 64 . In the second case, inM , we haveV b (s 0 )V ? (s 0 ) = 1 2 0 = 1 2 with probability at least 7 64 . Therefore, for any2 (0; 1 2 ) and2 (0; 1 16 ), there is a contradiction in both cases if the learner is (;)-correct and has nite sample complexity onM 0 . This completes the proof. Remark7. Note that althoughT max = 1 inM 0 , the key of the analysis is thatT max can be arbitrarily large in the alternative MDPs. Indeed, if we have a nite upper boundT ofT max , then the learning algorithm only requires nite number of samples as shown in Theorem 23. F.1.2 ProofofTheorem24 Proof. Without loss of generality, assume thatS = 2((A1) l 1) A2 for somel 0. Consider a family of MDPs fM i;j g i2f0;:::;N 0 g;j2[A1] N with state spaceS =S T [S ? wherejS T j =jS ? j =N +1,N 0 = (A2)(A1) l , 276 g s 0 0 (full (A 1)-ary tree structure) ::: s ::: Action 1 Actioni2f2;:::;A 1g B T 1+"=2 ;tT B T or B T 1+" ;tT s ? 0 ActionA 2B T s ? 1 B? s ? 2 0 (correct action) 0 (wrong action) ::: s ? N Figure F.1: Hard instance in Theorem 24. Each arrow represents a possible transition of a state-action pair, and the value on the side is the expected cost of taking this state-action pair until the transition happens. Valuet represents the expected number of steps needed for the transition to happen. and action spaceA s = [A] for alls. States inS T forms a full (A 1)-ary tree on action subset [A 1] as in Theorem 23 with roots 0 ,T =T=3,T 0 =T=6,B =B T , andc min = 0. It is clear thatN 0 =jj (dened in Theorem 23) in the tree formed byS T . The transition ofM i;j inS T corresponds toM i in Theorem 23. We denoteS ? =fs ? 0 ;:::;s ? N g, and for each state inS T , the remaining unspecied action transits tos ? 0 with cost 0. Consider another set of MDPsfM 0 i g i2f0;:::;N 0 g with state spaceS T . The transition and cost functions ofM 0 i is the same asM i inS T except that its action space is restricted to [A 1]. Theorem 23 implies that there exists constants 1 , 2 , such that any ( 0 ; 0 )-correct algorithm with 0 2 (0; 1 32 ), 0 2 (0; 1 2e 4 ) has sample complexity at leastC( 0 ; 0 ) = 1 B 2 T TSA 02 ln 2 0 onfM 0 i g i (note that in Theorem 23 we only show the sample complexity lower bound inM 0 0 , but it not hard to show a similar bound for otherM 0 i following similar arguments). Now we specify the transition and cost functions inS ? for eachM i;j such that learning fM i;j g i;j is as hard as learningfM 0 i g i . Ats ? 0 , taking any action suers cost 1; taking any action in [A 1] transits tos ? 1 with probability 1 B? and stays ats ? 0 otherwise; taking actionA transits tos 0 with probability 1 2B T and stays ats ? 0 otherwise. Ats ? n forn2 [N], taking any action suers cost 0; taking actionj n (recall thatj2 [A 1] N ) transits tos ? n+1 (denes ? N+1 =g) with probabilityp = minf 1 2T ; 4C(;4) g and stays at 277 s ? n otherwise; taking any other action in [A 1] transits tos ? 0 with probabilityp and stays ats ? n otherwise; taking actionA directly transits tos ? 0 ; see illustration in Figure F.1. Note that anyM i;j has parametersB ? (transiting tos ? 0 from any state and then reachingg throughS ? ) and satisesB ?;T 2 [ B T 2 ; 3B T ] (transiting froms ? 0 tos 0 and then reachingg throughS T ). From now on we x the learner as an (;;T )-correct algorithm with sample complexityC(; 4) 1 onfM i;j g i;j . DeneE 1 as the event that the rstC(; 4) samples drawn by the learner from any (s ? n ;a) withn2 [N] anda2 [A 1] transits tos ? n , and denote by P i;j the probability distribution w.r.t.M i;j . By 1 +xe x 1+x forx1 ande x 1 +x, we have for any i;j, P i;j (E 1 ) = (1p) C(;4) e pC(;4) 1p e 2pC(;4) e 2 1 2 : Also deneE 2 as the event that the learner uses at mostC(; 4) 1 samples, andE =E 1 \E 2 . We have P i;j (E 2 ) 1 by the sample complexity of the learner, and thusP i;j (E) 1 3 2 for anyi;j. We rst bound the expected cost of the learner inS ? conditioned onE. Denote byV M the value function of policy inM. Lemma103. Givenanypolicydistribution,thereexistsj ? suchthatE [IfV M i;j ? (s ? 0 ) 2B T g] 1 2 for anyi. Proof. Below we x ani2 [N 0 ]. For any policy andj2 [A 1] N , denex j = Q N n=1 p (j n js ? n ) and y =(Ajs ? 0 ), wherep (ajs ? n ) is the probability that when following policy starting froms ? n , the last 278 action taken before leavings ? n isa. It is not hard to see that in our construction,p is independent ofj. Also denote byV j the value function of policy inM i;j . Note that V j (s ? 0 ) 1 + 1y B ? V j (s ? 1 ) + 1 y 2B T 1y B ? V j (s ? 0 ) = 1 + (1y )(1x j ) B ? V j (s ? 0 ) + 1 y 2B T 1y B ? V j (s ? 0 ) = 1 + 1 y 2B T (1y )x j B ? V j (s ? 0 ): Reorganizing terms givesV j (s ? 0 ) 1 y =(2B T )+(1y )x j =B? . Now ifV j (s ? 0 )< 2B T , then we havey +(1 y ) 2B T x j B? > 1, which givesx j > B? 2B T . LetX be the set ofj2 [A 1] N such thatV j (s ? 0 )< 2B T . By B? 2B T jX j P j x j 1, we havejX j 2B T B? . Denez (j) =Ifj2X g. We have P j z (j) =jX j 2B T B? for any, and thus P j R z (j)()d 2B T B? . Therefore, there existsj ? such that R z (j ? )()d 2B T B?(A1) N , which implies that E [IfV M i;j ? (s ? 0 ) 2B T g] = 1 Z z (j ? )()d 1 2B T B ? (A 1) N 1 2 : The proof is completed by noting that for the pickedj ? , the bound above holds for anyi, since the lower bound onV j (s ? 0 ) we applied above is independent ofi. Now consider another set of MDPsfM 00 i g i2f0;:::;N 0 g with state spaceS T . The transition and cost functions ofM 00 i is the same asM i;j restricted onS T for any j, except that taking action A at any state directly transits tog with cost 2B T . We show that any ( 0 ; 0 )-correct algorithm with 0 2 (0; 1 32 ), 0 2 (0; 1 2e 4 ) has sample complexity at leastC( 0 ; 0 ) onfM 00 i g i Given any policy onM 00 i , deneg as a policy onM 0 i andM 00 i such thatg (ajs)/(ajs) and P A1 a=1 g (ajs) = 1. It is straightforward to see that V g M 0 i (s) =V g M 00 i (s)V M 00 i (s) andV ? M 0 i (s) =V ? M 00 i (s), whereV ? M is the optimal value function inM. Thus, if there exists an algorithmA that is ( 0 ; 0 )-correct with sample complexity less thanC( 0 ; 0 ) onfM 00 i g i , 279 then we can obtain an ( 0 ; 0 )-correct algorithm onfM 0 i g i with sample complexity less thanC( 0 ; 0 ) as follows: executingA onfM 00 i g i to obtain policyb , and then outputg b . This leads to a contradiction to the denition ofC(;), and thus any ( 0 ; 0 )-correct algorithm with 0 2 (0; 1 32 ), 0 2 (0; 1 2e 4 ) has sample complexity at leastC( 0 ; 0 ) onfM 00 i g i . Since we assume that the learner has sample complexity less thanC(; 4) onM i;j , for a xedj 0 , there existsi ? such thatP i ? ;j 0 (E 3 ) > 4, whereE 3 =f9s : V b M 00 i ? (s)V ? M 00 i ? (s) > g (note thatb is computed onM i ? ;j 0 , but we can applyb restricted onS T toM 00 i ?). This also implies thatP i ? ;j (E\E 3 ) = P i ? ;j 0 (E\E 3 ) 5 2 for any j, since the value of P i;j (!) is independent of j when ! 2 E. Dene E 4 =f9s :V b M (s)V ?;T M (s)>g. By Lemma 103, there existsj ? such that P i ? ;j ?(E 4 jE\E 3 )E b P i ?(jE\E 3 ) [IfV b M i ? ;j ? (s ? 0 ) 2B T g] 1 2 ; since the distribution ofb is independent ofj underE\E 3 ,V ?;T M i ? ;j (s) =V ? M 00 i ? (s) for anyj ands2S T , andV b M i ? ;j (s) V b M 00 i ? (s) fors2S T whenV b M i ? ;j (s ? 0 ) 2B T . Putting everything together, we have P i ? ;j ?(E 4 )P i ? ;j ?(E 4 \E\E 3 ) > , a contradiction. Therefore, there is no (;;T )-correct algorithm with sample complexity less thanC(; 4) onfM i;j g i;j . In other words, for any (;;T )-correct algorithm, there existsM2fM i;j g i;j such that this algorithm has sample complexity at leastC(; 4) onM. This completes the proof. F.2 OmittedDetailsinSection7.3 In this section, we present the omitted proofs of Lemma 12 and Theorem 25. To prove Theorem 25, we rst discuss the guarantee of the nite-horizon algorithm in Appendix F.2.2. Then, we bound the sample complexity of Algorithm 19 in Appendix F.2.3. 280 Algorithm21 LCBVI (H;N;B;c f ;) Input: horizonH, counterN, optimal value function upper boundB, terminal costc f , failure probability , and cost functionc, Dene: P s;a (s 0 ) = N(s;a;s 0 ) N + (s;a) andb(s;a;V ) = max 7 q V( Ps;a;V ) N + (s;a) ; 49B N + (s;a) , where = ln 2SAHn ,n = P s;a N(s;a), andN + (s;a) = maxf1;N(s;a)g. Initialize: b V H+1 =c f . forh =H;:::; 1do 1 b Q h (s;a) = c(s;a) + P s;a b V h+1 b(s;a; b V h+1 ) + . b V h (s) = min a b Q h (s;a). Output: (b ; b V ) withb (s;h) = argmin a b Q h (s;a). F.2.1 ProofofLemma12 Proof. LetV 1;h be the value functionV 1 inM h;c f . For anyn 0, we have V 1;(n+1)H (s) =E " nH X i=1 c(s i ;a i ) +V 1;H (s nH+1 ) s 1 =s # E " nH X i=1 c(s i ;a i ) +c f (s nH+1 ) s 1 =s # =V 1;nH (s): Therefore,V (s) lim n!1 V 1;nH (s)V 1;H (s) and this completes the proof. Note that the rst inequality may be strict. Indeed,V = lim H!1 V 1;H inM H;0 . Consider an improper policy behaving in a loop with zero cost. Then,V = 0 but lim H!1 V 1;H =c f inM H;c f . F.2.2 GuaranteeoftheFinite-HorizonAlgorithmLCBVI In this section, we discuss and prove the guarantee of Algorithm 21. Notation Within this section,H,N,B,c f , are inputs of Algorithm 21, andb , b Q, b V , P s;a ,N,N + ,, andb are dened in Algorithm 21. Value functionV h is w.r.t. MDPM H;c f , and we denote byV ? h ,Q ? h the optimal value function and action-value function, such thatV ? h (s) = argmin V h (s) andQ ? h (s;a) = c(s;a) +P s;a V ? h+1 for (s;a;h)2 [H]. We also deneV H+1 = V ? H+1 = c f for any policy, and 281 B ? H = max h2[H+1] kV ? h k 1 . For any ( s; h)2S [H] and (s;a;h)2 [H], denote byq ;( s; h) (s;a;h) the probability of visiting (s;a) in stageh if the learner starts in state s in stage h and follows policy afterwards. For any value functionV 2R S[H+1] , denekV k 1 = max h2[H+1] kV h k 1 . We rst prove optimism of the estimated value functions. Lemma104. WhenBB ? H ,wehave b Q h (s;a)Q ? h (s;a)forany (s;a;h)2 [H]withprobabilityat least 1. Proof. We prove this by induction. The case ofh =H + 1 is clearly true. ForhH, note that c(s;a) + P s;a b V h+1 b(s;a; b V h+1 )c(s;a) + P s;a V ? h+1 b(s;a;V ? h+1 ) (Lemma 116) =c(s;a) +P s;a V ? h+1 + ( P s;a P s;a )V ? h+1 max 8 < : 7 s V( P s;a ;V ? h+1 ) N + (s;a) ; 49B N + (s;a) 9 = ; c(s;a) +P s;a V ? h+1 + (2 p 2 3) s V( P s;a ;V ? h+1 ) N + (s;a) + (19 24) B N + (s;a) (Lemma 122 and maxfa;bg a+b 2 ) c(s;a) +P s;a V ? h+1 =Q ? h (s;a): The proof is then completed by the denition of b Q. Lemma105. For any state s2S and h2 [H], we have V b h ( s) b V h ( s) X s;a;h q b ;( s; h) (s;a;h) (P s;a P s;a ) b V h+1 +b(s;a; b V h+1 ) : 282 Proof. First note that V b h ( s) b V h ( s)P s;b ( s; h) V b h+1 P s;b ( s; h) b V h+1 +b( s;b ( s; h); b V h+1 ) (denition ofb and b V ) =P s;b ( s; h) (V b h+1 b V h+1 ) + (P s;b ( s; h) P s;b ( s; h) ) b V h+1 +b( s;b ( s; h); b V h+1 ) = X s;a;h q b ;( s; h) (s;a;h) (P s;a P s;a ) b V h+1 +b(s;a; b V h+1 ) : (expandP s;b ( s; h) (V b h+1 b V h+1 ) recursively andV b H+1 = b V H+1 ) For the other direction, ( b V h ( s)V b h ( s)) + P s;b ( s; h) b V h+1 P s;b ( s; h) V b h+1 + ((a) + (b) + (ab) + ) P s;b ( s; h) ( b V h+1 V b h+1 ) + + (P s;b ( s; h) P s;b ( s; h) ) b V h+1 ((a +b) + (a) + + (b) + ) X s;a;h q b ;( s; h) (s;a;h) (P s;a P s;a ) b V h+1 : (expandP s;b ( s; h) ( b V h+1 V b h+1 ) + recursively) Combining both directions completes the proof. Remark8. Note that the inequality in Lemma 105 holds even if optimism (Lemma 104) does not hold, which is very important for estimatingB ? . Lemma106. There exists a functionN ? (B 0 ;H 0 ; 0 ; 0 ). B 02 H 0 02 + SB 0 H 0 0 +SH 0 2 such that whenBB ? H and N(s;a) = N N ? (B;H;;) for all s;a for some integer N, we have V b V ? 1 with probability at least 1. 283 Proof. Below we assume thatB B ? H . Fix any state s2S and h2 [H], we writeq b ;( s; h) asq b for simplicity. We have with probability at least 1 4, V b h ( s) b V h ( s) X s;a;h q b (s;a;h) (P s;a P s;a )V ? h+1 + (P s;a P s;a )( b V h+1 V ? h+1 ) +b(s;a; b V h+1 ) (ja +bjjaj +jbj and Lemma 105) . X s;a;h q b (s;a;h) 0 @ s V(P s;a ;V ? h+1 ) N + SB N + s SV(P s;a ; b V h+1 V ? h+1 ) N + s V( P s;a ; b V h+1 ) N 1 A (Lemma 109 and maxfa;bg (a) + + (b) + ) . v u u t H N X s;a;h q b (s;a;h)V(P s;a ;V ? h+1 ) + v u u t SH N X s;a;h q b (s;a;h)V(P s;a ; b V h+1 V ? h+1 ) + SBH N ; where in the last inequality we apply Var[X +Y ] 2(Var[X] +Var[Y ]), Cauchy-Schwarz inequality, Lemma 108, and P s;a;h q b (s;a;h)H. Now note that: X s;a;h q b (s;a;h)V(P s;a ;V ? h+1 ) =E b 2 4 H X h= h V(P s h ;a h ;V ? h+1 ) s h = s 3 5 =E b 2 4 H X h= h P s h ;a h (V ? h+1 ) 2 (P s h ;a h V ? h+1 ) 2 s h = s 3 5 =E b 2 4 H X h= h V ? h+1 (s h+1 ) 2 V ? h (s h ) 2 + H X h= h V ? h (s h ) 2 (P s h ;a h V ? h+1 ) 2 s h = s 3 5 B 2 + 3BE b 2 4 H X h= h Q ? h (s h ;a h )P s h ;a h V ? h+1 + 3 5 (a 2 b 2 (a +b)(ab) + fora;b> 0 andV ? (s h )Q ? h (s h ;a h )) =B 2 + 3BE b 2 4 H X h= h c(s h ;a h ) s h = s 3 5 =B 2 + 3BV b h ( s): 284 Plugging this back and byV(P s;a ; b V h+1 V ? h+1 ) b V h+1 V ? h+1 2 1 , we have with probability at least 1, 0V b h ( s) b V h ( s).B r H N + s BHV b h ( s) N + r SH 2 N b V V ? 1 + SBH N (Lemma 104) .B r H N + s BH(V b h ( s) b V h ( s)) N + r SH 2 N b V V b 1 + SBH N ; where in the last step we apply b V h ( s)B and b V V ? 1 b V V b 1 since b V (s)V ? (s)V b (s) for alls2S by Lemma 104. Solving a quadratic inequality w.r.t.V b h ( s) b V h ( s), we have V b h ( s) b V h ( s).B r H N + r SH 2 N b V V b 1 + SBH N : The inequality above implies that there exist quantityN ? .SH 2 , such that whenNN ? , we have V b h ( s) b V h ( s).B r H N + 1 2 b V V b 1 + SBH N ; for any ( s; h). Taking maximum of the left-hand-side over ( s; h), reorganizing terms and by Lemma 104, we obtain V b V ? 1 V b b V 1 .B r H N + SBH N : (F.1) Now denen ? =N ? +inf n fright-hand-side of Eq. (F.1) whenN =ng. We haven ? . B 2 H 2 + SBH + SH 2 . This implies that whenB B ? H andN(s;a) = N n ? for alls;a, we have V b V ? 1 with probability at least 1 5. The proof is then completed by treatingn ? as a function with inputB,H, , and replace by=5 in the arguments above. 285 Lemma107. There exists functions b N(B 0 ;H 0 ; 0 ; 0 ). B 02 SH 0 02 + SB 0 H 0 0 such that whenN(s;a) =N b N(B;H;;) for alls;a for someN and b V 1 B, we have V b b V 1 with probability at least 1. Proof. Below we assume that b V 1 B. For any state xed s2S and h2 [H], we writeq b ;( s; h) asq b for simplicity. Note that with probability at least 1 2, V b h ( s) b V h ( s) X s;a;h q b (s;a;h) (P s;a P s;a ) b V h+1 +b(s;a; b V h+1 ) (Lemma 105) . X s;a;h q b (s;a;h) 0 @ s SV(P s;a ; b V h+1 ) N + s V( P s;a ; b V h+1 ) N + SB N 1 A (Lemma 109 and maxfa;bg (a) + + (b) + ) . X s;a;h q b (s;a;h) 0 @ s SV(P s;a ; b V h+1 ) N + SB N 1 A (Lemma 108) . v u u t SH N X s;a;h q b (s;a;h)V(P s;a ; b V h+1 ) + SBH N : (Cauchy-Schwarz inequality and P s;a;h q b (s;a;h)H) 286 Now note that with probability at least 1, X s;a;h q b (s;a;h)V(P s;a ; b V h+1 ) =E b 2 4 H X h= h V(P s h ;a h ; b V h+1 ) s h = s 3 5 =E b 2 4 H X h= h b V h+1 (s h+1 ) 2 b V h (s h ) 2 + H X h= h b V h (s h ) 2 (P s h ;a h b V h+1 ) 2 s h = s 3 5 B 2 + 3BE b 2 4 H X h= h b Q h (s h ;a h )P s h ;a h b V h+1 + s h = s 3 5 (a 2 b 2 (a +b)(ab) + fora;b> 0 and b V h (s h ) = b Q h (s h ;a h )) B 2 + 3BE b 2 4 H X h= h c(s h ;a h ) + ( P s h ;a h P s h ;a h ) b V h+1 + s h = s 3 5 (denition of b Q h and (a) + (b) + (ab) + ) .B 2 +BV b h ( s) +B v u u t SH N X s;a;h q b (s;a;h)V(P s;a ; b V h+1 ) + SB 2 H N ; where the last step is by (a +b) + (a) + + (b) + , the denition ofV b h ( s) , and E b 2 4 H X h= h ( P s h ;a h P s h ;a h ) b V h+1 + s h = s 3 5 .E b 2 4 H X h= h 0 @ s SV(P s h ;a h ; b V h+1 ) N + SB N 1 A s h = s 3 5 (Lemma 109) = X s;a;h q b (s;a;h) 0 @ s SV(P s;a ; b V h+1 ) N + SB N 1 A v u u t SH N X s;a;h q b (s;a;h)V(P s;a ; b V h+1 ) + SBH N : (Cauchy-Schwarz inequality and P s;a;h q b (s;a;h)H) Solving a quadratic inequality w.r.t. P s;a;h q b (s;a;h)V(P s;a ; b V h+1 ), we have X s;a;h q b (s;a;h)V(P s;a ; b V h+1 ).B 2 +BV b h ( s) + SB 2 H N : 287 Plugging this back, we have V b h ( s) b V h ( s) .B r SH N + s BSHV b h ( s) N + SBH N .B r SH N + s BSHjV b h ( s) b V h ( s)j N + SBH N : ( b V h ( s)B) Again solving a quadratic inequality w.r.t.jV b h ( s) b V h ( s)j and taking maximum over ( s; h) on the left- hand-side, we have V b b V 1 .B r SH N + SBH N : (F.2) Now deneb n = inf n fright-hand-side of Eq. (F.2) whenN =ng. We haveb n. B 2 SH 2 + SBH . This implies that whenN(s;a) =Nb n for alls;a and b V 1 B, we have V b b V 1 with probability at least 1 4. The proof is then completed by treatingb n as a function with inputB,H,, and replace by=4 in the arguments above. Lemma108. For any (s;a)2 andV 2 [B;B] S + for someB > 0, with probability at least 1, we haveV( P s;a ;V ).V(P s;a ;V ) + SB 2 N + (s;a) for all (s;a), whereN + (s;a) = maxf1;N(s;a)g. Proof. Note that V( P s;a ;V ) P s;a (VP s;a V ) 2 ( P i p i x i P i p i = argmin z P i p i (x i z) 2 ) =V(P s;a ;V ) + ( P s;a P s;a )(VP s;a V ) 2 .V(P s;a ;V ) +B s SV(P s;a ;V ) N + (s;a) + SB 2 N + (s;a) (Lemma 109) .V(P s;a ;V ) + SB 2 N + (s;a) : (AM-GM inequality) This completes the proof. 288 Lemma109. Given any value functionV 2 [B;B] S + , with probability at least 1,j(P s;a P s;a )Vj. q SV(Ps;a;V ) N + (s;a) + SB N + (s;a) for any (s;a)2 , whereN + (s;a) = maxf1;N(s;a)g. Proof. For any (s;a)2 , by Lemma 122, with probability at least 1 SA , we have j(P s;a P s;a )Vj X s 0 jP s;a (s 0 ) P s;a (s 0 )jjV (s 0 )P s;a Vj ( P s 0(P s;a (s 0 ) P s;a (s 0 )) = 0) . X s 0 s P s;a (s 0 ) N + (s;a) + 1 N + (s;a) ! V (s 0 )P s;a V . s SV(P s;a ;V ) N + (s;a) + SB N + (s;a) ; where the last step is by Cauchy-Schwarz inequality. Taking a union bound over (s;a) completes the proof. F.2.3 ProofofTheorem25 Proof. For each indexi, dene nite-horizon MDPM i =M H i ;c f;i . Also deneV h;i andV ? h;i as value functionV h and optimal value functionV ? h inM i respectively. We rst assume thatT D such that B ?;T <1. In this case, we haveT ? T;s (s) minfT z ;Tg for anys by ? T;s = ? whenTT z T max . Note that when B i 2 [20B ?;T ; 40B ?;T ], by T ? T;s (s) minfT z ;Tg for any s, denition of H i , and Lemma 118, we haveV ? 1;i (s) V ? T;s 1;i (s) V ?;T (s) + 0:6B i 24B i 0:1B i andV ? h;i (s) V ? T;s h;i (s) V ?;T (s) + 0:6B i 0:7B i for anys2S andh2 [H], where applying stationary policy ? T;s inM i means executing ? T;s in each steph2 [H]. This impliesB i V ? ;i 1 . Then according to Line 2 and by Lemma 104, with probability at least 1 i , we have V i 1 1 V ? 1;i 1 0:1B i , V i 1 V ? ;i 1 0:7B i , and the while loop should break (Line 3). Leti ? be the value ofi when the while loop breaks, we thus have B i ? 40B ?;T . Moreover, by Lemma 107 and the denition of N i , with probability at least 1 i ?, we have V i ? 1;i ? (s) (V i ? 1 (s) + 0:1B i ?)Ifs6= gg c f;i ?(s) for any s2S + . Thus by Lemma 12, we have V ? (s) V i ? (s) V i ? 1;i ? (s) V i ? 1 (s) + 0:1B i ? B i ? for any s2S. This gives B i ? B ? . If T < T z B i ? c min , then H i ? . minf B i ? c min ;Tg = minfT z ;Tg. Otherwise, T T z , B ?;T = B ? , and 289 H i ? . minf B i ? c min ;Tg . minfT z ;Tg by B i ? . B ?;T . Therefore, H i ? . minfT z ;Tg. By Lemma 107, V i ? 1 1 0:1B i ?, V i ? 1 0:7B i ? (breaking condition of the while loop), and the denition ofN i , we have with probability at least 1 i ?, V ? 1;i ? 1 V i ? 1;i ? 1 V i ? 1 1 + V i ? 1;i ?V i ? 1 1 0:2B i ? and V ? ;i ? 1 V i ? ;i ? 1 V i ? 1 + V i ? ;i ? V i ? 1 0:8B i ?. Therefore, by Lemma 106 and the denition ofN ? i ?, we have with probability at least 1 i ?, V b ;i ?V ? ;i ? 1 2 . Moreover, by Lemma 12 andV b 1;i ?(s) (V ? 1;i ?(s)+ 2 )Ifs6=gg (0:2B i ? + 1 2 )Ifs6=ggc f;i ?(s) for alls2S + sinceB i ? 2, we haveV b (s)V b 1;i ?(s) for alls. Thus,V b (s)V b 1;i ?(s)V ? 1;i ?(s) + 2 V ?;T (s) + 0:6B i ? 24B i ? + 2 V ?;T (s) + by the denition ofH i ? and Lemma 118 for anys2S. Finally, by the denition ofN i andN ? i , the total number of samples spent is of order ~ O (SA(N i ? +N ? i ?)) = ~ O H i ?B 2 i ?SA 2 + H i ?B i ?S 2 A +H 2 i ?S 2 A = ~ O minfT z ;Tg B 2 ?;T SA 2 + minfT z ;Tg B ?;T S 2 A + minfT z ;Tg 2 S 2 A ! : Moreover, the bound above holds with probability at least 1 since 20 P i i . Now we consider the caseT <D. From the arguments above we know thatB i 40B ?;T 40T for allii ? ifTD. Thus, we can conclude thatT <D ifB i > 40T for somei still in the while loop, and the total number samples used is of order ~ O(SAN i ) = ~ O(S 2 AT ) by the denition ofN i . This completes the proof. 290 AppendixG AuxiliaryLemmas Lemma110. Ifx (a p x+b) ln p (cx)forsomea;b;c> 0andabsoluteconstantp 0,thenx = ~ O(a 2 +b). Specically,xa p x +b impliesx (a + p b) 2 2a 2 + 2b. Lemma111. For anya;b2 [0; 1] andk2N + , we have:a k b k k(ab) + . Proof. a k b k = (ab)( P k i=1 a i1 b ki ) (ab) + P k i=1 1 =k(ab) + . Lemma 112. (Zhang, Ji, and Du (2020, Lemma 11)) Let 1 ; 2 ; 4 0; 3 1 andi 0 = log 2 ( 1 ). Let a 1 ;a 2 ;:::;a i 0 benon-negativerealssuchthata i 1 anda i 2 p a i+1 + 2 i+1 3 + 4 forany 1ii 0 . Then,a 1 maxf( 2 + p 2 2 + 4 ) 2 ; 2 p 8 3 + 4 g. Lemma113. Assumev t :S + ! [0;B] is monotonic int (i.e.,v t (s) is non-increasing or non-decreasing int for anys2S + ). Then, for any state sequencefs t g n t=1 ;n2N + , we have:j P n t=1 v t+1 (s t )v t (s t )jSB. Proof. n X t=1 v t+1 (s t )v t (s t ) X s2S + n X t=1 (v t+1 (s)v t (s))Ifs t =sg X s2S + n X t=1 v t+1 (s)v t (s) X s2S + jv n+1 (s)v 1 (s)jSB: (v t (s) is monotonic int) 291 Lemma114. (Cohenetal.(2021,LemmaC.3))ForanytworandomvariablesX;Y withVar[X]<1;Var[Y ]< 1. We have: p Var[X] p Var[Y ] p Var[XY ]. Lemma115. For any two random variablesX;Y, we have: Var[XY ] 2Var[X]kYk 2 1 + 2(E[X]) 2 Var[Y ]: Consequently,kXk 1 C impliesVar[X 2 ] 4C 2 Var[X]. Proof. First note that for any two random variablesU;V , we have Var[U +V ] 2Var[U] + 2Var[V ]. Now letU = (XE[X])Y andV =E[X]Y , we have: Var[XY ] 2Var[(XE[X])Y ] + 2Var[E[X]Y ] 2E[(XE[X]) 2 Y 2 ] + 2(E[X]) 2 Var[Y ] 2Var[X]kYk 2 1 + 2(E[X]) 2 Var[Y ]: Lemma 116. ((Tarbouriech et al., 2021b, Lemma 14)) Dene = fv 2 [0;B] S + : v(g) = 0g. Let f : S + R + R + R + !R + withf(p;v;n;B;) =pv max n c 1 q V(p;v) n ;c 2 B n o ,withc 2 1 c 2 . Thenf satises for allp2 S + ;v2 andn;> 0, 1. f(p;v;n;B;) is non-decreasing inv(s), that is, 8v;v 0 2 ;v(s)v 0 (s);8s2S + =) f(p;v;n;B;)f(p;v 0 ;n;B;); 2. f(p;v;n;B;)pv c 1 2 q V(p;v) n c 2 2 B n . 292 Lemma117. ((Jaksch,Ortner,andAuer,2010,Lemma19),(Cohenetal.,2020,LemmaB.18))Foranysequence of numbersz 1 ;:::;z n with 0z t Z t1 = maxf1; P t1 i=1 z i g: n X t=1 z t Z t1 2 lnZ n ; n X t=1 z t p Z t1 3 p Z n : Lemma118. (RosenbergandMansour,2021,Lemma6)Let beapolicywithexpectedhittingtimeatmost starting from any state. Then for any2 (0; 1), with probability at least 1, takes no more than 4 ln 2 steps to reach the goal state. Lemma119. (Luo,Wei,andLee,2021,LemmaA.4)Let> 0, k 2 (A),and` k 2R A satisfythefollowing for allk2 [K] anda2 [A]: 1 (a) = 1 A ; k+1 (a)/ k (a) exp(` k (a)); j` k (a)j 1: Then for any ? 2 (A), P K k=1 h k ? ;` k i lnA + P K k=1 P a2As k (a)` 2 k (a). G.1 ConcentrationInequalities Lemma 120. ((Cohen et al., 2020, Theorem D.1)) LetfX t g t be a martingale dierence sequence such that jX t jB. Then with probability at least 1, n X t=1 X t B r n ln 2n ; 8n 1: 293 Lemma121. (Cohen et al., 2020, Theorem D.3) LetfX n g 1 n=1 be a sequence of i.i.d. random variables with expectation andX n 2 [0;B] almost surely. Then with probability at least 1, for anyn 1: n X i=1 (X i ) min 8 < : 2 r Bn ln 2n +B ln 2n ; 2 v u u t B n X i=1 X i ln 2n + 7B ln 2n 9 = ; : Lemma122. LetfX t g t be a sequence of i.i.d. random variables with mean, variance 2 , and 0X t B. Then with probability at least 1, the following holds for alln 1 simultaneously: n X t=1 (X t ) 2 r 2 2 n ln 2n + 2B ln 2n : n X t=1 (X t ) 2 r 2^ 2 n n ln 2n + 19B ln 2n : where ^ 2 n = 1 n P n t=1 X 2 t ( 1 n P n t=1 X t ) 2 . Proof. For a xedn, the rst inequality holds with probability at least 1 4n 2 by Freedman’s inequality. Then by (Efroni et al., 2021, Lemma 19), with probability at least 1 4n 2 ,j ^ n j q 36B 2 ln(2n=) n + . Therefore, p n = p n^ n + p n( ^ n ) p n^ n + 6B p ln(2n=). Plugging this back to the rst inequality gives the second inequality. Lemma123. (Anytime Freedman’s inequality) LetfX i g 1 i=1 be amartingale dierence sequence adapted to the ltrationfF i g 1 i=0 andjX i j B for someB > 0. Then with probability at least 1, for alln 1 simultaneously, n X i=1 X i 3 v u u t n X i=1 E[X 2 i jF i1 ] ln 4B 2 n 3 + 2B ln 4B 2 n 3 : 294 Proof. For eachn 1, applying Freedman’s inequality (Jin et al., 2020a, Lemma 9) tofX i g n i=1 andfX i g n i=1 with each2 =f 1 B2 i g dlog 2 ne i=0 , we have with probability at least 1 2n 2 , for any2 , n X i=1 X i n X i=1 E[X 2 i jF i1 ] + ln 4Bn 3 ; (G.1) Note that there exists ? 2 such that ? = min 1=B; r ln(4Bn 3 =) P n i=1 E[X 2 i jF i1 ] 2 ( 1 2 ; 1]. Plugging ? into Eq. (G.1), we getj P n i=1 X i j 3 q P n i=1 E[X 2 i jF i1 ] ln 4Bn 3 + 2B ln 4Bn 3 . By a union bound overn, the statement is proved. Lemma124 (Any interval Freedman’s inequality). LetfX i g 1 i=1 be amartingale dierence sequencew.r.t. the ltrationfF i g 1 i=0 andjX i jB for someB > 0. Then with probability at least 1, for all 1ln simultaneously, n X i=l X i 3 v u u t n X i=l E[X 2 i jF i1 ] ln 16B 2 n 5 + 2B ln 16B 2 n 5 (G.2) 3 v u u t 2 n X i=l X 2 i ln 16B 2 n 5 + 18B ln 16B 2 n 5 : (G.3) Proof. For eachl 1, by (Chen, Jain, and Luo, 2021, Lemma 38), with probability at least 1 4l 2 , Eq. (G.2) holds for alln l. Then by Lemma 126, with probability at least 1 4l 2 , Eq. (G.3) holds for alln l. Applying a union bound overl completes the proof. Lemma125. (AnyIntervalStrengthenedFreedman’sinequality)LetX 1:1 beamartingaledierencesequence with respect to a ltrationfF t g t such thatE[X t jF t1 ] = 0. Suppose B t 2 [1;b] for a xed constant b, B t 2F t1 andX t B t almost surely. Then for a givenn, with probability at least 1: n X t=1 X t C q 8V 1;n ln (2C=) + 5B 1;n ln (2C=) ; (G.4) 295 and with probability at least 1 we have for all 1ln simultaneously l+n1 X t=l X t C q 8V l;n ln (4Cn 3 =) + 5B l;n ln 4Cn 3 = 8CB l;n p n ln(4Cn 3 =); (G.5) whereV l;n = P l+n1 t=l E[X 2 t jF t1 ];B l;n = max lt<l+n B t , andC =dln(b)edln(nb 2 )e. Proof. Eq. (G.4) is simply from applying (Lee et al., 2020, Theorem 2.2) tofX t g t andfX t g t . Fix some l;n 1. Eq. (G.5) holds with probability at least 1 2n 3 by Eq. (G.4). By a union bound (rst sum overl, then sum overn), the statement is proved. Lemma126. (Cohen et al., 2020, Lemma D.4) and (Cohen et al., 2021, Lemma E.2) LetfX i g 1 i=1 be a sequence ofrandomvariablesw.r.t. to theltrationfF i g 1 i=0 andX i 2 [0;B] almostsurely. Thenwithprobabilityat least 1, for alln 1 simultaneously: n X i=1 E[X i jF i1 ] 2 n X i=1 X i + 4B ln 4n ; n X i=1 X i 2 n X i=1 E[X i jF i1 ] + 8B ln 4n : Lemma 127. Given 1 and a martingale sequencefX t g t such that X t 2F t ; 0 X t B, with probability at least 1: n X t=1 E[X t jF t1 ] 1 + 1 n X t=1 X t + 8B ln 2n ; 8n 1: Proof. DeneY t =E[X t jF t1 ]X t . For a givenn, by Freedman’s inequality, with probability at least 1 2n 2 : n X t=1 Y t n X t=1 E[(X t E[X t jF t1 ]) 2 jF t1 ] + 2 ln(2n=) BE[X t jF t1 ] + 2 ln(2n=) ; 296 for some< 1 B . Reorganizng terms, we get when = 1 2B < 1 B (note thatB 1 2 ): n X t=1 E[X t jF t1 ] 1 1B n X t=1 X t + 2 ln(2n=) ! (1 + 2B) n X t=1 X t + 4 ln(2n=) 1 + 1 n X t=1 X t + 8B ln 2n : ( 1 1x 1 + 2x whenx2 [0; 1 2 ]) By a union bound overn, we obtain the desired bound. 297
Abstract (if available)
Abstract
Reinforcement learning (RL) is about learning to make optimal sequential decisions in an unknown envi- ronment. In the past decades, RL has made astounding progress. With massive computation power, we can train agents that beat professional players in challenging games such as Go (Silver et al., 2016) and StarCraft (Vinyals et al., 2019).
The objective of traditional RL models, such as finite-horizon models and infinite-horizon discounted models, is to minimize accumulated cost within an effective horizon. However, many real-world applications are goal-oriented, meaning that the objective is to achieve a certain goal while minimizing the accumulated cost. Examples of such applications include games (beat your opponent as quickly as possible), car navigation (reach a destination with minimum gas consumption), and robotics manipulations (move an object to a desired position with the least joint movements). Notably, there are two objectives in goal-oriented tasks: reaching the goal and minimizing cost. These two objectives may not always align, and it is often hard to specify goal-oriented tasks by traditional RL models. As a result, Goal-oriented reinforcement learning, that is, applying RL to solve goal-oriented tasks, often requires heavy engineering effort, such as cost function design, determining the appropriate horizon or discount factor, and sophisticated exploration schemes to handle sparse rewards.
In this thesis, we focus on resolving these issues by answering the following question: how can we perform Goal-oriented reinforcement learning (GoRL) in a principled way? Specifically, we study learning in a Markov Decision Process (MDP) called Stochastic Shortest Path (SSP) (Bertsekas and Yu, 2013), which exactly captures the dual objectives of GoRL. Similar to the study of other RL models, we consider various learning settings, such as adversarial environments, non-stationary environments, and PAC learning. We also develop practical learning algorithms in SSP, such as model-free algorithms, incorporating function approximations, and policy optimization.
Linked assets
University of Southern California Dissertations and Theses
Conceptually similar
PDF
Robust and adaptive online reinforcement learning
PDF
Online reinforcement learning for Markov decision processes and games
PDF
Robust and adaptive online decision making
PDF
Interactive learning: a general framework and various applications
PDF
High-throughput methods for simulation and deep reinforcement learning
PDF
Learning and control in decentralized stochastic systems
PDF
Identifying and leveraging structure in complex cooperative tasks for multi-agent reinforcement learning
PDF
Reinforcement learning with generative model for non-parametric MDPs
PDF
Leveraging prior experience for scalable transfer in robot learning
PDF
Scaling robot learning with skills
PDF
Sequential Decision Making and Learning in Multi-Agent Networked Systems
PDF
Characterizing and improving robot learning: a control-theoretic perspective
PDF
Learning and decision making in networked systems
PDF
Accelerating reinforcement learning using heterogeneous platforms: co-designing hardware, algorithm, and system solutions
PDF
No-regret learning and last-iterate convergence in games
PDF
Emotional appraisal in deep reinforcement learning
PDF
Provable reinforcement learning for constrained and multi-agent control systems
PDF
Machine learning in interacting multi-agent systems
PDF
Leveraging cross-task transfer in sequential decision problems
PDF
Reward shaping and social learning in self- organizing systems through multi-agent reinforcement learning
Asset Metadata
Creator
Chen, Liyu
(author)
Core Title
Understanding goal-oriented reinforcement learning
School
Viterbi School of Engineering
Degree
Doctor of Philosophy
Degree Program
Computer Science
Degree Conferral Date
2023-05
Publication Date
02/07/2023
Defense Date
01/12/2023
Publisher
University of Southern California
(original),
University of Southern California. Libraries
(digital)
Tag
OAI-PMH Harvest,online learning,reinforcement learning,stochastic shortest path
Format
theses
(aat)
Language
English
Contributor
Electronically uploaded by the author
(provenance)
Advisor
Luo, Haipeng (
committee chair
), Jain, Rahul (
committee member
), Kempe, David (
committee member
), Nayyar, Ashutosh (
committee member
)
Creator Email
lcwxyz762@gmail.com,liyuc@usc.edu
Permanent Link (DOI)
https://doi.org/10.25549/usctheses-oUC112732971
Unique identifier
UC112732971
Identifier
etd-ChenLiyu-11469.pdf (filename)
Legacy Identifier
etd-ChenLiyu-11469
Document Type
Dissertation
Format
theses (aat)
Rights
Chen, Liyu
Internet Media Type
application/pdf
Type
texts
Source
20230222-usctheses-batch-1007
(batch),
University of Southern California
(contributing entity),
University of Southern California Dissertations and Theses
(collection)
Access Conditions
The author retains rights to his/her dissertation, thesis or other graduate work according to U.S. copyright law. Electronic access is being provided by the USC Libraries in agreement with the author, as the original true and official version of the work, but does not grant the reader permission to use the work if the desired use is covered by copyright. It is the author, as rights holder, who must provide use permission if such use is covered by copyright. The original signature page accompanying the original submission of the work to the USC Libraries is retained by the USC Libraries and a copy of it may be obtained by authorized requesters contacting the repository e-mail address given.
Repository Name
University of Southern California Digital Library
Repository Location
USC Digital Library, University of Southern California, University Park Campus MC 2810, 3434 South Grand Avenue, 2nd Floor, Los Angeles, California 90089-2810, USA
Repository Email
cisadmin@lib.usc.edu
Tags
online learning
reinforcement learning
stochastic shortest path