Close
About
FAQ
Home
Collections
Login
USC Login
Register
0
Selected
Invert selection
Deselect all
Deselect all
Click here to refresh results
Click here to refresh results
USC
/
Digital Library
/
University of Southern California Dissertations and Theses
/
Online reinforcement learning for Markov decision processes and games
(USC Thesis Other)
Online reinforcement learning for Markov decision processes and games
PDF
Download
Share
Open document
Flip pages
Contact Us
Contact Us
Copy asset link
Request this asset
Transcript (if available)
Content
ONLINE REINFORCEMENT LEARNING FOR MARKOV DECISION PROCESSES AND GAMES by Mehdi Jafarnia Jahromi A Dissertation Presented to the FACULTY OF THE USC GRADUATE SCHOOL UNIVERSITY OF SOUTHERN CALIFORNIA In Partial Fulfillment of the Requirements for the Degree DOCTOR OF PHILOSOPHY ELECTRICAL ENGINEERING December 2021 Copyright 2021 Mehdi Jafarnia Jahromi Contents List of Tables iv List of Figures v Abstract vii 1 Introduction 1 2 Model-free Reinforcement Learning for Infinite-horizon Average-reward Markov Decision Processes 4 2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 2.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 2.3 Preliminaries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7 2.4 Optimistic Q-Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8 2.5 Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10 2.6 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11 Appendices 14 2.A Proof of Lemma 2.1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14 2.B Proof of Lemma 2.2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16 2.C Proof of Lemma 2.3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19 3 A Model-free Learning Algorithm for Infinite-horizon Average-reward Markov Decision Processes with Near-optimal Regret 20 3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20 3.2 The Exploration Enhanced Q-learning Algorithm . . . . . . . . . . . . . . . 21 3.3 Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23 3.4 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27 Appendices 29 3.A Proof of Lemma 3.1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29 3.B Proof of Lemma 3.2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30 3.C Proof of Lemma 3.3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33 i 3.D Proof of Lemma 3.4 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34 3.E Proof of Lemma 3.5 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35 3.F Proof of Lemma 3.6 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35 4 Online Learning for Stochastic Shortest Path Model via Posterior Sam- pling 37 4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37 4.2 Preliminaries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39 4.3 A Posterior Sampling RL Algorithm for SSP Models . . . . . . . . . . . . . 41 4.4 Theoretical Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43 4.5 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47 Appendices 49 4.A Proof of Lemma 4.2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49 4.B Proof of Lemma 4.3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50 4.C Proof of Lemma 4.4 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50 4.D Proof of Lemma 5.8 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51 4.E Proof of Theorem 4.1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59 4.F Proof of Theorem 4.2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60 5 Online Learning for Unknown Partially Observable MDPs 62 5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62 5.2 Preliminaries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64 5.3 The PSRL-POMDP Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . 66 5.4 Finite-Parameter Case (|Θ|<∞) . . . . . . . . . . . . . . . . . . . . . . . . 68 5.5 General Case (|Θ| =∞) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70 Appendices 74 5.A Regret Decomposition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74 5.B Proofs of Section 5.4 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77 5.C Proofs of Section 5.5 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79 5.D Other Proofs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88 6 Learning Zero-sum Stochastic Games with Posterior Sampling 89 6.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89 6.2 Preliminaries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91 6.3 Posterior Sampling for Stochastic Games . . . . . . . . . . . . . . . . . . . . 94 6.4 Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95 Appendices 97 6.A Proof of Lemma 6.1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97 6.B Proof of Lemma 6.2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98 ii 7 Non-indexability of the Stochastic Appointment Scheduling Problem 102 7.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102 7.2 Problem Formulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105 7.3 Non-indexability of the Sequencing Problem . . . . . . . . . . . . . . . . . . 108 7.4 Existence of Solution to the Scheduling Problem . . . . . . . . . . . . . . . 113 7.5 Numerical Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 116 Appendices 119 7.A Supplementary Material . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 119 7.B Proof of Lemma 7.1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 120 7.C Proof of Lemma 7.2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 120 7.D Proof of Theorem 7.3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 122 7.E Proof of Theorem 7.4 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 122 7.F Useful Theorems and Propositions . . . . . . . . . . . . . . . . . . . . . . . 123 8 Concluding Remarks and Future Directions 125 List of Publications 128 Bibliography 129 iii List of Tables 2.1 Regret comparisons for RL algorithms in infinite-horizon average-reward MDPs withS states,A actions, andT steps. D is the diameter of the MDP, sp(v ∗ )≤ D is the span of the optimal value function, V ? s,a := Var s 0 ∼p(·|s,a) [v ∗ (s 0 )]≤ sp(v ∗ ) 2 is the variance of the optimal value function, t mix is the mixing time, t hit is the hitting time. B is an upper bound on the estimates of theQ function. 5 2.2 Hyper parameters used in the experiments. These hyper parameters are optimized to perform the best possible result for all the algorithms. All the experiments are averaged over 10 independent runs for a horizon of 5× 10 6 . For thePolitex algorithm,τ andτ 0 are the lengths of the two stages defined in Figure 3 of Abbasi-Yadkori et al. (2019a). . . . . . . . . . . . . . . . . . 12 3.1 The hyper parameters used in the algorithms. These hyper parameters are optimized to obtain the best performance of each algorithm. We simulate 10 Monte Carlo independent runs over the horizon ofT = 5×10 6 steps. For the UCRL2 algorithm, C is a coefficient that scales the confidence interval. τ and τ 0 for the Politex algorithm, are the lengths of the two stages defined in Figure 3 of Abbasi-Yadkori et al. (2019a). . . . . . . . . . . . . . . . . . . 28 7.1 Non-optimality of least newsvendor first forc σ 1 and least variance first forc σ 2 . Optimal sequence found by exhaustive search is different from the sequence given by heuristic index-based policies. . . . . . . . . . . . . . . . . . . . . . 118 iv List of Figures 2.1 Performance of model-free algorithms on random MDP (left) and JumpRiver- Swim (right). The standard Q-learning algorithm with -greedy exploration suffers from linear regret. The Optimistic Q-learning algorithm achieve sub-linear regret. The shaded area denotes the standard deviation of regret over multiple runs. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12 3.1 Regret comparison of model-free and model-based RL algorithms in River- Swim (left) and RandomMDP (right). In RiverSwim, our algorithm outper- forms Optimistic QL, the best existing model-free algorithm, substantially and performs as well as PSRL (Ouyang et al., 2017b) which is among the best known model-based algorithms in practice. Politex andMDP-OOMD did not achieve sub-linear regret in RiverSwim and thus removed from the left figure. In RandomMDP, our algorithm together with Optimistic QL outper- form other model-free algorithms and are similar to model-based algorithms. 28 4.1 Cumulative regret of existing SSP algorithms on RandomMDP (left) and GridWorld (right) for 10, 000 episodes. The results are averaged over 10 runs and 95% confidence interval is shown with the shaded area. Our proposed PSRL-SSP algorithm outperforms all the existing algorithms considerably. The performance gap is even more significant in the more challenging Grid- World environment (right). . . . . . . . . . . . . . . . . . . . . . . . . . . . 48 7.1 Appointment scheduling. s i denotes appointment time of job i. For the realization shown in this figure, server remains idle betweenE σ(1) ands 2 and third job is delayed for E σ(2) −s 3 amount of time. . . . . . . . . . . . . . . 106 7.2 Examples of function g. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107 7.3 Distribution of X 1 and X 2 for (a) Example 7.5 and (b) Example 7.6. . . . . 110 7.4 SAA running time (in seconds) to find approximate optimal schedule for a given sequence. 30 samples/job are used for SAA though no appreciable difference even if 10x more samples used. . . . . . . . . . . . . . . . . . . . 117 7.5 Upper and lower bounds on optimal cost for (a)c σ 1 and (b)c σ 2 cost functions. As shown in the figure, upper bound is quite loose on USC Keck dataset. Table 7.1 provides numerical values for n≤ 6 to compare optimal sequence with index-based heuristic policy. . . . . . . . . . . . . . . . . . . . . . . . . 117 v 7.6 Optimality gap of newsvendor index increases as the ratio of α/β increases on Keck dataset. The optimal sequence is found by exhaustive search over all n! possible sequences. β = 1 is fixed and α changes from 0.1 to 100. The dashed lines show the cost for the sequence obtained by least newsvendor first. The cost of the optimal sequence is shown by solid lines. . . . . . . . . 118 vi Abstract Reinforcement learning (RL) refers to the problem of an agent interacting with an unknown environment while maximizing the cumulative reward. Recent advances in applying RL in playing the game of Go, Starcraft, navigation, robotics, etc., has proved the capability of this broad class of algorithms to learn complicated tasks. An important factor in applying RL algorithms is data efficiency which is usually achieved by balancing the trade-off between exploration and exploitation: should the agent explore the unknown environment to gather more information for future decisions, or should it exploit the available information to maximize the reward. RL algorithms can be divided into two large classes: offline and online. Offline RL algo- rithms require having access to a simulator that can be used to gain information before interacting with the environment. When a simulator is available, the problem is much eas- ier than the standard RL since no exploration is needed. However, building simulators is very difficult (if not impossible) for many applications. Online RL algorithms, on the other hand, gather information about the unknown environment simultaneously as the interaction occurs. Thus, efficient exploration plays a central role in designing online RL algorithms and is known to be significantly more challenging. In this dissertation, we address efficient exploration in online RL algorithms in various settings. The performance of the learning algorithm is measured through the notion of regret that compares the cumulative reward of the learning algorithm and that of an or- acle. First, we start with Markov Decision Processes (MDPs), a standard mathematical framework for RL, and design two algorithms that can achieve sub-linear regret bounds. These algorithms are the first model-free algorithms that achieve sub-linear regret in the infinite-horizon MDPs with average-reward criterion. Second, a simple online RL algorithm is proposed for Stochastic Shortest Path (SSP) model, a framework for goal-oriented RL. The proposed algorithm is simple, numerically outperforms previously proposed algorithms in the literature, and enjoys near-optimal regret guarantees. Third, partially observable MDPs (POMDPs) are considered where the underlying state of the agent is not known and only a partial observation is available. Our algorithm is the first online RL algorithm that achieves sub-linear regret in this setting. Finally, a simple algorithm is proposed for infinite- horizon zero-sum Markov games with average-reward criterion. This algorithm achieves a near-optimal regret bound and improves the best existing result. vii Chapter 1 Introduction Reinforcement Learning (RL) refers to the problem of an agent interacting with an unknown environment. The environment is usually modeled as a Markov Decision Process (MDP) whose transition kernel and reward are unknown. The agent interacts with the environment by observing the state, taking action and receiving reward at each time step. The goal of the agent is to maximize its cumulative reward. If the agent knows the transition kernel and the reward function, he can derive and follow the optimal policy that maximizes the cumulative reward. However, in the RL setting, this information is not available to the agent before interacting with the environment. There is a fundamental trade-off in RL referred as the exploration-exploitation trade-off: should the agent explore the environment to gather more information for better decisions in the future or should it exploit the current information to maximize the immediate perfor- mance? Efficient exploration is a key concept in RL to balance this trade-off and facilitate the learning process. It is measured through the notion of regret that compares the perfor- mance of the learning algorithm with an oracle. A general technique to balance the exploration-exploitation trade-off is to use the Optimism in the Face of Uncertainty (OFU) principle (Lai and Robbins, 1985). Under this principle, the agent constructs a set of plausible models based on the available information, selects the model associated with the maximum reward, and follows the optimal policy with respect to the selected model. This idea is widely used in the RL literature (e.g., Jaksch et al. (2010); Azar et al. (2017); Jin et al. (2018); Wei et al. (2020, 2021)). In Chapters 2 and 3, OFU principle is used to design model-free algorithms for the infinite-horizon average-reward MDPs. An alternative fundamental idea to encourage exploration is to use Posterior Sampling (PS) (also known as Thompson Sampling) (Thompson, 1933). The idea is to maintain the posterior distribution on the unknown model parameters based on the available information and the prior distribution. PS algorithms usually proceed in episodes. In the beginning of an episode, a model is sampled from the posterior. The actions during the episode are then selected according to the optimal policy associated with the sampled model. PS algorithms 1 have two main advantages over OFU-type algorithms. First, the prior knowledge of the environment can be incorporated through the prior distribution. Second, PS algorithms have shown superior numerical performance on multi-armed bandit problems (Scott, 2010; Chapelle and Li, 2011), and MDPs (Osband et al., 2013; Osband and Van Roy, 2017; Ouyang et al., 2017b). Chapters 4, 5, and 6 develop such algorithms for SSP, POMDP, and Stochastic Games, respectively. There are two broad classes of RL algorithms: model-based and model-free. Model-based algorithms maintain an estimate of the environment and use that to obtain a policy. Model- free algorithms, on the other hand, directly estimate the value function or the policy without explicitly estimating the model of the environment. Model-based algorithms are famous for their efficient exploration, though require more memory. Model-free algorithms, however, are faster, more flexible and require less memory. Until recently, it was believed that effi- cient exploration is not possible through model-free algorithms. However, recent advances of model-free RL algorithms (Jin et al., 2018) invalidated this belief in the finite-horizon setting. Chapters 2 and 3 address the problem of efficient exploration in the infinite-horizon average-reward setting using model-free algorithms. This setting is challenging because the techniques such as backward induction of the finite horizon and the contraction mapping of the discounted setting are not available in the infinite-horizon average-reward setting. In Chapter 2, Optimistic Q-learning algorithm is proposed that achieves regret bound of e O(T 2/3 ) for weakly communicating MDPs. This is a significant step towards designing model-free algorithms for the infinite-horizon average-reward setting. However, there is a gap between this regret bound and the worst case lower bound of Ω( √ T ). Chapter 3 takes the first steps to close this gap by proposing the second version of this algorithm (Exploration Enhanced Q-learning) that obtains e O( √ T ) regret bound although with some assumptions in the analysis that is left for future work. The interaction between the agent and the environment in many practical scenarios only ends when the agent reaches a predefined goal state. This includes a wide variety of goal- oriented control and RL problems such as navigation, game playing, etc., and can be formu- lated as Stochastic Shortest Path (SSP) problem. Unlike finite/infinite-horizon MDPs, the duration of interaction between the agent and the environment is a random variable and may be infinite (if the goal is not reached). In Chapter 4, we propose PSRL-SSP, a simple posterior sampling-based RL algorithm for the SSP problem and establish a Bayesian regret bound of e O(B ? S √ AK), where B ? is an upper bound on the expected cost of the optimal policy,S is the size of the state space,A is the size of the action space, andK is the number of episodes. The algorithm only requires the knowledge of the prior distribution, and has no hyper-parameters to tune. Although optimism-based algorithms such as (Cohen et al., 2021; Tarbouriech et al., 2021b) provide a better regret bound (by a factor of √ S), the proposed PSRL-SSP algorithm is much simpler and has superior numerical performance. In many real-world applications such as robotics, healthcare and finance, only a partial observation of the state is available. However, MDPs can only model scenarios where the state is perfectly observable. These scenarios are modeled by Partially Observable Markov Decision Processes (POMDPs). In addition to the uncertainty in the environment dynamics, RL in POMDPs deals with the uncertainty about the underlying state. Online learning of 2 optimal controllers for unknown POMDPs, which requires efficient learning using regret- minimizing algorithms that effectively tradeoff exploration and exploitation, is extremely challenging and no solution exists currently. In Chapter 5, we take the first steps to tackle this problem. We consider infinite-horizon average-cost POMDPs with unknown transition model, though a known observation model. We propose a natural posterior sampling-based RL algorithm (PSRL-POMDP) and show that it achieves a regret bound ofO(logT ), whereT is the time horizon, when the parameter set is finite. In the general case (continuous parameter set), we show that the algorithm achieves e O(T 2/3 ) regret under two technical assumptions. To the best of our knowledge, this is the first online RL algorithm for POMDPs and has sub-linear regret. MDPs are useful in modeling interaction of a single agent with the environment. In recent years, multi-agent RL has attracted a lot of researchers due to the applications in playing games, robotic control, autonomous driving, etc. Chapter 6 proposes PSRL-ZSG, a posterior sampling-based algorithm for efficient exploration in two-player zero-sum stochastic games in the infinite-horizon average-reward setting. We consider the online setting where the opponent can not be controlled and can take any arbitrary time-adaptive history-dependent strategy. This improves the best existing regret bound of e O( 3 √ DS 2 AT 2 ) by Wei et al. (2017) under the same assumption and matches the theoretical lower bound in A and T . In Chapter 7, a completely different topic is addressed. It studies the problem of stochastic sequencing and scheduling motivated by operating room scheduling. The challenge is to determine the optimal sequence and appointment times of jobs to minimize some function of the server idle time and service start-time delay. It was conjectured for many years that sequencing jobs in increasing order of variance is optimal. A key result in this chapter is that the optimal sequencing problem is non-indexable, i.e., neither the variance, nor any other such index can be used to determine the optimal sequence in which to schedule jobs. 3 Chapter 2 Model-free Reinforcement Learning for Infinite-horizon Average-reward Markov Decision Processes 2.1 Introduction Reinforcement learning (RL) refers to the problem of an agent interacting with an unknown environment with the goal of maximizing its cumulative reward through time. The environ- ment is usually modeled as a Markov Decision Process (MDP) with an unknown transition kernel and/or an unknown reward function. The fundamental trade-off between exploration and exploitation is the key challenge for RL: should the agent exploit the available infor- mation to optimize the immediate performance, or should it explore the poorly understood states and actions to gather more information to improve future performance? There are two broad classes of RL algorithms: model-based and model-free. Model-based algorithms maintain an estimate of the underlying MDP and use that to determine a pol- icy during the learning process. Examples include UCRL2 (Jaksch et al., 2010), REGAL (Bartlett and Tewari, 2009), PSRL (Ouyang et al., 2017b), SCAL (Fruit et al., 2018b), UCBVI (Azar et al., 2017), EBF (Zhang and Ji, 2019) and EULER (Zanette and Brunskill, 2019). Model-based algorithms are well-known for their sample efficiency. However, there are two general disadvantages of model-based algorithms: First, model-based algorithms require large memory to store the estimate of the model parameters. Second, it is hard to extend model-based approaches to non-parametric settings, e.g., continuous state MDPs. Model-free algorithms, on the other hand, try to resolve these issues by directly maintaining an estimate of the optimal Q-value function or the optimal policy. Examples include Q- learning (Watkins, 1989), Delayed Q-learning (Strehl et al., 2006), TRPO (Schulman et al., 4 Table 2.1: Regret comparisons for RL algorithms in infinite-horizon average-reward MDPs with S states, A ac- tions, and T steps. D is the diameter of the MDP, sp(v ∗ ) ≤ D is the span of the optimal value function, V ? s,a := Var s 0 ∼p(·|s,a) [v ∗ (s 0 )]≤ sp(v ∗ ) 2 is the variance of the optimal value function, t mix is the mixing time, t hit is the hitting time. B is an upper bound on the estimates of the Q function. Algorithm Regret Comment Model-based REGAL (Bartlett and Tewari, 2009) e O(sp(v ∗ ) √ SAT ) no efficient implementation UCRL2 (Jaksch et al., 2010) e O(DS √ AT ) - PSRL (Ouyang et al., 2017b) e O(sp(v ∗ )S √ AT ) Bayesian regret OSP (Ortner, 2018) e O( √ t mix SAT ) ergodic assumption and no efficient implementation SCAL (Fruit et al., 2018b) e O(sp(v ∗ )S √ AT ) - KL-UCRL (Talebi and Maillard, 2018) e O( q S P s,a V ? s,a T ) - UCRL2B (Fruit et al., 2019) e O(S √ DAT ) - EBF (Zhang and Ji, 2019) e O( √ DSAT ) no efficient implementation Model-free Politex(Abbasi-Yadkori et al., 2019a) t 3 mix t hit √ SAT 3 4 ergodic assumption MDP-OOMD(Wei et al., 2019) e O( q t 3 mix AT ) ergodic assumption EE-Politex(Abbasi-Yadkori et al., 2019b) e O(T 4/5 ) unichain assumption AAPI (Hao et al., 2020) e O(T 2/3 ) ergodic assumption Optimistic Q-learning (Chapter 2) e O(sp(v ∗ )(SA) 1 3T 2 3 ) - EE-QL (Chapter 3) e O((sp(v ∗ ) +B) √ SAT ) estimator of J ∗ and bounded Q estimates lower bound (Jaksch et al., 2010) Ω( √ DSAT ) - 2015), DQN (Mnih et al., 2013), A3C (Mnih et al., 2016), and more. Model-free algorithms are not only computation and memory efficient, but also easier to be extended to large scale problems by incorporating function approximation. It was believed that model-free algorithms are less sample-efficient compared to model-based algorithms. However, recently Jin et al. (2018) showed that (model-free) Q-learning algo- rithm with UCB exploration achieves a nearly-optimal regret bound, implying the possibility of designing algorithms with advantages of both model-free and model-based methods. Jin et al. (2018) addressed the problem for episodic finite-horizon MDPs. Following this work, Dong et al. (2019) extended the result to the infinite-horizon discounted-reward setting. However, Q-learning based model-free algorithms with low regret for infinite-horizon average- reward MDPs, an equally heavily-studied setting in the RL literature, remains unknown. Designing such algorithms has proven to be rather challenging since the Q-value function estimate may grow unbounded over time and it is hard to control its magnitude in a way that guarantees efficient learning. Moreover, techniques such as backward induction in the finite-horizon setting or contraction mapping in the infinite-horizon discounted setting can not be applied to the infinite-horizon average-reward setting. 5 In this chapter, we make significant progress in this direction and propose Optimistic Q- learning, a model-free algorithm for learning infinite-horizon average-reward MDPs. Opti- mistic Q-learning (Section 2.4), achieves a regret bound of e O(T 2/3 ) with high probability for the broad class of weakly communicating MDPs. 1 This is the first model-free algo- rithm in this setting under only the minimal weakly communicating assumption. The key idea of this algorithm is to artificially introduce a discount factor for the reward, to avoid the aforementioned unbounded Q-value estimate issue, and to trade-off this effect with the approximation introduced by the discount factor. We remark that this is very different from the R-learning algorithm of Schwartz (1993), which is a variant of Q-learning with no discount factor for the infinite-horizon average-reward setting. To the best of our knowledge, the only existing model-free algorithm for this setting is thePolitex algorithm (Abbasi-Yadkori et al., 2019a,b), which achieves e O(T 3/4 ) regret for ergodic MDPs only. Our algorithm enjoys a better bound compared to Politex, and even removes the ergodic assumption completely. 2 For comparisons with other existing model-based approaches for this problem, see Ta- ble 2.1. 2.2 Related Work We review the related literature with regret guarantees for learning MDPs with finite state and action spaces (there are many other works on asymptotic convergence or sample com- plexity, a different focus compared to our work). Three common settings have been studied: 1) finite-horizon episodic setting, 2) infinite-horizon discounted setting, and 3) infinite- horizon average-reward setting. For the first two settings, previous works have designed efficient algorithms with regret bound or sample complexity that is (almost) information- theoretically optimal, using either model-based approaches such as (Azar et al., 2017), or model-free approaches such as (Jin et al., 2018; Dong et al., 2019). For the infinite-horizon average-reward setting, many model-based algorithms have been proposed, such as (Auer and Ortner, 2007; Jaksch et al., 2010; Ouyang et al., 2017b; Agrawal and Jia, 2017; Talebi and Maillard, 2018; Fruit et al., 2018a,b). These algorithms either conduct posterior sampling or follow the optimism in face of uncertainty principle to build an MDP model estimate and then plan according to the estimate (hence model-based). They all achieve ˜ O( √ T ) regret, but the dependence on other parameters are suboptimal. Recent works made progress toward obtaining the optimal bound (Ortner, 2018; Zhang and Ji, 2019); however, their algorithms are not computationally efficient – the time complex- ity scales exponentially in the number of states. On the other hand, except for the naive approach of combining Q-learning with -greedy exploration (which is known to suffer re- gret exponential in some parameters (Osband et al., 2014)), the only existing model-free algorithm for this setting is Politex, which only works for ergodic MDPs. 1 Throughout the chapter, we use the notation e O(·) to suppress log terms. 2 Politex is studied in a more general setup with function approximation though. 6 2.3 Preliminaries An infinite-horizon average-reward Markov Decision Process (MDP) can be described by (S,A,r,p) whereS is the state space,A is the action space, r :S×A→ [0, 1] is the reward function and p :S 2 ×A→ [0, 1] is the transition probability such that p(s 0 |s,a) := P(s t+1 =s 0 |s t =s,a t =a) for s t ∈S,a t ∈A and t = 1, 2, 3,··· . We assume thatS andA are finite sets with cardinalities S and A, respectively. The average reward per stage of a deterministic/stationary policy π :S→A starting from state s is defined as J π (s) := lim inf T→∞ 1 T E " T X t=1 r(s t ,π(s t )) s 1 =s # wheres t+1 is drawn fromp(·|s t ,π(s t )). LetJ ∗ (s) := max π∈A SJ π (s). A policyπ ∗ is said to be optimal if it satisfies J π ∗ (s) =J ∗ (s) for all s∈S. We consider weakly communicating MDPs defined in Section 2.4. The weakly communicat- ing assumption is in fact known to be necessary for learning infinite-horizon MDPs with low regret (Bartlett and Tewari, 2009). Standard MDP theory (Puterman, 2014) shows that for these two classes, there exist q ∗ : S×A→R (unique up to an additive constant) and uniqueJ ∗ ∈ [0, 1] such thatJ ∗ (s) =J ∗ for all s∈S and the following Bellman equation holds: J ∗ +q ∗ (s,a) =r(s,a) +E s 0 ∼p(·|s,a) [v ∗ (s 0 )], (2.1) where v ∗ (s) := max a∈A q ∗ (s,a) and the optimal policy is π ∗ (s) = arg max a q ∗ (s,a). We consider a learning problem whereS,A and the reward function r are known to the agent, but not the transition probability p (so one cannot directly solve the Bellman equa- tion). The knowledge of the reward function is a typical assumption as in Bartlett and Tewari (2009); Gopalan and Mannor (2015); Ouyang et al. (2017b), and can be removed at the expense of a constant factor for the regret bound. Specifically, the learning protocol is as follows. An agent starts at an arbitrary states 1 ∈S. At each time step t = 1, 2, 3,··· , the agent observes state s t ∈S and takes action a t ∈A which is a function of the history s 1 ,a 1 ,s 2 ,a 2 ,··· ,s t−1 ,a t−1 ,s t . The environment then determines the next state by drawing s t+1 according to p(·|s t ,a t ). The performance of a learning algorithm is evaluated through the notion of cumulative regret, defined as the difference between the total reward of the optimal policy and that of the algorithm: R T := T X t=1 J ∗ −r(s t ,a t ) . Sincer∈ [0, 1] (and subsequentlyJ ∗ ∈ [0, 1]), the regret can at worst grow linearly with T . If a learning algorithm achieves sub-linear regret, then R T /T goes to zero, i.e., the average reward of the algorithm converges to the optimal per stage reward J ∗ . The best existing 7 Algorithm 1 Optimistic Q-learning Parameters: H≥ 2, confidence level δ∈ (0, 1) Initialization: γ = 1− 1 H , ∀s : ˆ V 1 (s) =H ∀s,a :Q 1 (s,a) = ˆ Q 1 (s,a) =H, n 1 (s,a) = 0 Define:∀τ,α τ = H+1 H+τ , b τ = 4sp(v ∗ ) q H τ ln 2T δ for t = 1,...,T do 1 Take action a t = arg max a∈A ˆ Q t (s t ,a). 2 Observe s t+1 . 3 Update: n t+1 (s t ,a t )←n t (s t ,a t ) + 1 τ←n t+1 (s t ,a t ) Q t+1 (s t ,a t )← (1−α τ )Q t (s t ,a t ) +α τ h r(s t ,a t ) +γ ˆ V t (s t+1 ) +b τ i (2.2) ˆ Q t+1 (s t ,a t )← min n ˆ Q t (s t ,a t ),Q t+1 (s t ,a t ) o ˆ V t+1 (s t )← max a∈A ˆ Q t+1 (s t ,a). (All other entries of n t+1 ,Q t+1 , ˆ Q t+1 , ˆ V t+1 remain the same as those in n t ,Q t , ˆ Q t , ˆ V t .) regret bound is e O( √ DSAT ) achieved by a model-based algorithm (Zhang and Ji, 2019) (where D is the diameter of the MDP) and it matches the lower bound of Jaksch et al. (2010). 2.4 Optimistic Q-Learning In this section, we introduce Optimistic Q-learning (see Algorithm 1 for pseudocode). The algorithm works for any weakly communicating MDPs. An MDP is weakly commu- nicating if its state spaceS can be partitioned into two subsets: in the first subset, all states are transient under any stationary policy; in the second subset, every two states are accessible from each other under some stationary policy. It is well-known that the weakly communicating condition is necessary for ensuring low regret in this setting (Bartlett and Tewari, 2009). Define sp(v ∗ ) = max s v ∗ (s)− min s v ∗ (s) to be the span of the value function, which is known to be bounded for weakly communicating MDPs. In particular, it is bounded by the diameter of the MDP (see (Lattimore and Szepesv´ ari, 2018, Lemma 38.1)). We assume that sp(v ∗ ) is known and use it to set the parameters. However, in the case when it is unknown, we can replace sp(v ∗ ) with any upper bound of it (e.g. the diameter) in both the algorithm and the analysis. The key idea of Algorithm 1 is to solve the undiscounted problem via learning a discounted MDP (with the same states, actions, reward function, and transition), for some discount 8 factor γ (defined in terms of a parameter H). Define V ∗ and Q ∗ to be the optimal value- function and Q-function of the discounted MDP, satisfying the Bellman equation: ∀(s,a), Q ∗ (s,a) =r(s,a) +γE s 0 ∼p(·|s,a) [V ∗ (s 0 )] ∀s, V ∗ (s) = max a∈A Q ∗ (s,a). The way we learn this discounted MDP is essentially the same as the algorithm of Dong et al. (2019), which itself is based on the idea from Jin et al. (2018). Specifically, the algorithm maintains an estimate ˆ V t for the optimal value functionV ∗ and ˆ Q t for the optimal Q-function Q ∗ , which itself is a clipped version of another estimate Q t . Each time the algorithm takes a greedy action with the maximum estimated Q value (Line 1). After seeing the next state, the algorithm makes a stochastic update ofQ t based on the Bellman equation, importantly with an extra bonus term b τ and a carefully chosen step size α τ (Eq.(2.2)). Here, τ is the number of times the current state-action pair has been visited, and the bonus termb τ scales asO( p H/τ), which encourages exploration since it shrinks every time a state-action pair is executed. The choice of the step size α τ is also crucial as pointed out in Jin et al. (2018) and determines a certain effective period of the history for the current update. While the algorithmic idea is similar to Dong et al. (2019), we emphasize that our analysis is different and novel: • First, Dong et al. (2019) analyze the sample complexity of their algorithm while we analyze the regret. • Second, we need to deal with the approximation effect due to the difference between the discounted MDP and the original undiscounted one (Lemma 2.1). • Finally, part of our analysis improves over that of Dong et al. (2019) (specifically our Lemma 2.2). Following the original analysis of (Dong et al., 2019) would lead to a worse bound here. We now state the main regret guarantee of Algorithm 1. Theorem 2.1. If the MDP is weakly communicating, Algorithm 1 with H = min s sp(v ∗ )T SA , T SA ln 4T δ !1 3 ensures that with probability at least 1−δ, R T is of order O q sp(v ∗ )SAT + sp(v ∗ ) T 2 3 SA ln T δ 1 3 + q T ln 1 δ . Our regret bound scales as e O(T 2/3 ) and is suboptimal compared to model-based approaches with e O( √ T ) regret (such as UCRL2) that matches the information-theoretic lower bound (Jaksch 9 et al., 2010). However, this is the first model-free algorithm with sub-linear regret (under only the weakly communicating condition), and how to achieve e O( √ T ) regret via model-free algorithms remains unknown. Also note that our bound depends on sp(v ∗ ) instead of the potentially much larger diameter of the MDP. To our knowledge, existing approaches that achieve sp(v ∗ ) dependence are all model-based (Bartlett and Tewari, 2009; Ouyang et al., 2017b; Fruit et al., 2018b) and use very different arguments. 2.5 Analysis The proof starts by decomposing the regret as R T = T X t=1 (J ∗ −r(s t ,a t )) = T X t=1 (J ∗ − (1−γ)V ∗ (s t )) + T X t=1 (V ∗ (s t )−Q ∗ (s t ,a t )) + T X t=1 (Q ∗ (s t ,a t )−γV ∗ (s t )−r(s t ,a t )). Each of these three terms are handled through Lemmas 2.1, 2.2 and 2.3 whose proofs are deferred to the appendix. Plugging in γ = 1− 1 H and picking the optimal H finish the proof. One can see that the e O(T 2/3 ) regret comes from the bound T H from the first term and the bound √ HT from the second. Lemma 2.1. The optimal value function V ∗ of the discounted MDP satisfies 1. |J ∗ − (1−γ)V ∗ (s)|≤ (1−γ)sp(v ∗ ),∀s∈S, 2. sp(V ∗ )≤ 2sp(v ∗ ). This lemma shows that the difference between the optimal value in the discounted setting (scaled by 1−γ) and that of the undiscounted setting is small as long asγ is close to 1. The proof is by combining the Bellman equation of these two settings and direct calculations. Lemma 2.2. With probability at least 1−δ, we have T X t=1 (V ∗ (s t )−Q ∗ (s t ,a t )) ≤ 4HSA + 24sp(v ∗ ) q HSAT ln 2T δ . 10 This lemma is one of our key technical contributions. To prove this lemma one can write T X t=1 (V ∗ (s t )−Q ∗ (s t ,a t )) = T X t=1 (V ∗ (s t )− ˆ V t (s t )) + T X t=1 ( ˆ Q t (s t ,a t )−Q ∗ (s t ,a t )), using the fact that ˆ V t (s t ) = ˆ Q t (s t ,a t ) by the greedy policy. The main part of the proof is to show that the second summation can in fact be bounded as P T +1 t=2 ( ˆ V t (s t )−V ∗ (s t )) plus a small sub-linear term, which cancels with the first summation. Lemma 2.3. With probability at least 1−δ, T X t=1 (Q ∗ (s t ,a t )−γV ∗ (s t )−r(s t ,a t )) ≤ 2sp(v ∗ ) q 2T ln 1 δ + 2sp(v ∗ ). This lemma is proven via Bellman equation for the discounted setting and Azuma’s inequal- ity. 2.6 Experiments In this section, we compare the performance of our proposed algorithm and previous model- free algorithms. We note that model-based algorithms (UCRL2, PSRL, . . . ) typically have better performance in terms of regret but require more memory. For a fair comparison, we restrict our attention to model-free algorithms. Two environments are considered: a randomly generated MDP and JumpRiverSwim. Both of the environments consist of 6 states and 2 actions. The reward function and the tran- sition kernel of the random MDP are chosen uniformly at random. The JumpRiverSwim environment is a modification of the RiverSwim environment Strehl and Littman (2008); Ouyang et al. (2017a) with a small probability of jumping to an arbitrary state at each time step. The standard RiverSwim models a swimmer who can choose to swim either left or right in a river. The states are arranged in a chain and the swimmer starts from the leftmost state (s = 1). If the swimmer chooses to swim left, i.e., the direction of the river current, he is always successful. If he chooses to swim right, he may fail with a certain probability. The reward function is: r(1, left) = 0.2, r(6, right) = 1 and r(s,a) = 0 for all other states and actions. The optimal policy is to always swim right to gain the maximum reward of state s = 6. The standard RiverSwim is not an ergodic MDP and does not satisfy the assumption of the MDP-OOMD algorithm. To handle this issue, we consider the JumpRiverSwim 11 0 1000000 2000000 3000000 4000000 5000000 0 10000 20000 30000 40000 Regret RandomMDP Optimistic Q-learning Q-learning with -greedy MDP-OOMD Politex 0 1000000 2000000 3000000 4000000 5000000 0 10000 20000 30000 40000 Regret JumpRiverSwim Optimistic Q-learning Q-learning with -greedy MDP-OOMD Politex Figure 2.1: Performance of model-free algorithms on random MDP (left) and JumpRiverSwim (right). The standard Q-learning algorithm with -greedy exploration suffers from linear regret. The Optimistic Q-learning algorithm achieve sub-linear regret. The shaded area denotes the standard deviation of regret over multiple runs. Table 2.2: Hyper parameters used in the experiments. These hyper parameters are optimized to perform the best possible result for all the algorithms. All the experiments are averaged over 10 independent runs for a horizon of 5× 10 6 . For the Politex algorithm,τ andτ 0 are the lengths of the two stages defined in Figure 3 of Abbasi-Yadkori et al. (2019a). Algorithm Parameters Random MDP Q-learning with -greedy = 0.05 Optimistic Q-learning H = 100,c = 1, b τ =c p H/τ MDP-OOMD N = 2,B = 4,η = 0.01 Politex τ = 1000, τ 0 = 1000,η = 0.2 JumpRiverSwim Q-learning with -greedy = 0.03 Optimistic Q-learning H = 100,c = 1, b τ =c p H/τ MDP-OOMD N = 10,B = 30,η = 0.01 Politex τ = 3000, τ 0 = 3000,η = 0.2 environment which has a small probability 0.01 of moving to an arbitrary state at each time step. This small modification provides an ergodic environment. We compare our algorithm with three benchmark model-free algorithms. The first bench- mark is the standard Q-learning with -greedy exploration. Figure 2.1 shows that this algorithm suffers from linear regret, indicating that the naive-greedy exploration is not ef- ficient. The second benchmark is the Politex algorithm by Abbasi-Yadkori et al. (2019a). The implementation of Politex is based on the variant designed for the tabular case, which is presented in their Appendix F and Figure 3. The third benchmark is the MDP- OOMD algorithm Wei et al. (2019). Politex usually requires longer episode length than MDP-OOMD (see Table 3.1) because in each episode it needs to accurately estimate the 12 Q-function, rather than merely getting an unbiased estimator of it as in MDP-OOMD. Fig- ure 2.1 shows that the proposed Optimistic Q-learning, MDP-OOMD algorithms, and the Politex algorithm by Abbasi-Yadkori et al. (2019a) all achieve similar performance in the RandomMDP environment. In the JumpRiverSwim environment, the Optimistic Q-learning algorithm outperforms the other three algorithms. Although the regret upper bound for Optimistic Q-learning scales as e O(T 2/3 ) (Theorem 2.1), which is worse than that of MDP-OOMD, Figure 2.1 suggests that in the environments that lack good mix- ing properties, Optimistic Q-learning algorithm may perform better. The detail of the experiments is listed in Table 3.1. 13 Appendices In this section, we provide detailed proof for the lemmas used in Section 2.4. Recall that the learning rate α τ = H+1 H+τ is similar to the one used by Jin et al. (2018). For notational convenience, let α 0 τ := τ Y j=1 (1−α j ), α i τ :=α i τ Y j=i+1 (1−α j ). (2.3) It can be verified that α 0 τ = 0 for τ≥ 1 and we define α 0 0 = 1. These quantities are used in the proof of Lemma 2.2 and have some nice properties summarized in the following lemma. Lemma 2.4 (Jin et al. (2018)). The following properties hold for α i τ : 1. 1 √ τ ≤ P τ i=1 α i τ √ i ≤ 2 √ τ for every τ≥ 1. 2. P τ i=1 (α i τ ) 2 ≤ 2H τ for every τ≥ 1. 3. P τ i=1 α i τ = 1 for every τ≥ 1 and P ∞ τ=i α i τ = 1 + 1 H for every i≥ 1. Also recall the well-known Azuma’s inequality: Lemma 2.5 (Azuma’s inequality). LetX 1 ,X 2 ,··· be a martingale difference sequence with |X i |≤c i for all i. Then, for any 0<δ< 1, P T X i=1 X i ≥ r 2¯ c 2 T ln 1 δ ! ≤δ, where ¯ c 2 T := P T i=1 c 2 i . 2.A Proof of Lemma 2.1 Lemma 2.1 (Restated). Let V ∗ be the optimal value function in the discounted MDP with discount factor γ and v ∗ be the optimal value function in the undiscounted MDP. Then, 14 1. |J ∗ − (1−γ)V ∗ (s)|≤ (1−γ)sp(v ∗ ),∀s∈S, 2. sp(V ∗ )≤ 2sp(v ∗ ). Proof. 1. Let π ∗ and π γ be the optimal policy under undiscounted and discounted set- tings, respectively. By Bellman’s equation, we have v ∗ (s) =r(s,π ∗ (s))−J ∗ +E s 0 ∼p(·|s,π ∗ (s)) v ∗ (s 0 ). Consider a state sequence s 1 ,s 2 ,··· generated by π ∗ . Then, by sub-optimality of π ∗ for the discounted setting, we have V ∗ (s 1 )≥E " ∞ X t=1 γ t−1 r(s t ,π ∗ (s t )) s 1 # =E " ∞ X t=1 γ t−1 (J ∗ +v ∗ (s t )−v ∗ (s t+1 )) s 1 # = J ∗ 1−γ +v ∗ (s 1 )−E " ∞ X t=2 (γ t−2 −γ t−1 )v ∗ (s t ) s 1 # ≥ J ∗ 1−γ + min s v ∗ (s)− max s v ∗ (s) ∞ X t=2 (γ t−2 −γ t−1 ) = J ∗ 1−γ − sp(v ∗ ), where the first equality is by the Bellman equation for the undiscounted setting. Similarly, for the other direction, let s 1 ,s 2 ,··· be generated by π γ . We have V ∗ (s 1 ) =E " ∞ X t=1 γ t−1 r(s t ,π γ (s t )) s 1 # ≤E " ∞ X t=1 γ t−1 (J ∗ +v ∗ (s t )−v ∗ (s t+1 )) s 1 # = J ∗ 1−γ +v ∗ (s 1 )−E " ∞ X t=2 (γ t−2 −γ t−1 )v ∗ (s t ) s 1 # ≤ J ∗ 1−γ + max s v ∗ (s)− min s v ∗ (s) ∞ X t=2 (γ t−2 −γ t−1 ) = J ∗ 1−γ + sp(v ∗ ), where the first inequality is by sub-optimality of π γ for the undiscounted setting. 2. Using previous part, for any s 1 ,s 2 ∈S, we have |V ∗ (s 1 )−V ∗ (s 2 )|≤ V ∗ (s 1 )− J ∗ 1−γ + V ∗ (s 2 )− J ∗ 1−γ ≤ 2sp(v ∗ ). 15 Thus, sp(V ∗ )≤ 2sp(v ∗ ). 2.B Proof of Lemma 2.2 Lemma 2.2 (Restated). With probability at least 1−δ, T X t=1 (V ∗ (s t )−Q ∗ (s t ,a t ))≤ 4HSA + 24sp(v ∗ ) s HSAT ln 2T δ . Proof. We condition on the statement of Lemma 2.6, which happens with probability at least 1−δ. Let n t ≥ 1 denote n t+1 (s t ,a t ), that is, the total number of visits to the state- action pair (s t ,a t ) for the first t rounds (including round t). Also let t i (s,a) denote the timestep at which (s,a) is visited thei-th time. Recalling the definition ofα i nt in Eq. (2.3), we have T X t=1 ˆ V t (s t )−V ∗ (s t ) + T X t=1 (V ∗ (s t )−Q ∗ (s t ,a t )) (2.4) = T X t=1 ˆ Q t (s t ,a t )−Q ∗ (s t ,a t ) (because a t = arg max a ˆ Q t (s t ,a)) = T X t=1 ˆ Q t+1 (s t ,a t )−Q ∗ (s t ,a t ) + T X t=1 ˆ Q t (s t ,a t )− ˆ Q t+1 (s t ,a t ) (2.5) ≤ 12sp(v ∗ ) T X t=1 s H n t ln 2T δ +γ T X t=1 nt X i=1 α i nt h ˆ V t i (st,at) (s t i (st,at)+1 )−V ∗ (s t i (st,at)+1 ) i +SAH. (2.6) Here, we apply Lemma 2.6 to bound the first term of Eq .(2.5) (note α 0 nt = 0 by definition since n t ≥ 1), and also bound the second term of Eq .(2.5) by SAH since for each fixed (s,a), ˆ Q t (s,a) is non-increasing int and overall cannot decrease by more thanH (the initial value). To bound the third term of Eq. (2.6) we write: γ T X t=1 nt X i=1 α i nt h ˆ V t i (st,at) (s t i (st,at)+1 )−V ∗ (s t i (st,at)+1 ) i =γ T X t=1 X s,a 1 [st=s,at=a] n t+1 (s,a) X i=1 α i n t+1 (s,a) h ˆ V t i (s,a) (s t i (s,a)+1 )−V ∗ (s t i (s,a)+1 ) i =γ X s,a n T+1 (s,a) X j=1 j X i=1 α i j h ˆ V t i (s,a) (s t i (s,a)+1 )−V ∗ (s t i (s,a)+1 ) i . 16 By changing the order of summation on i and j, the latter is equal to γ X s,a n T+1 (s,a) X i=1 n T+1 (s,a) X j=i α i j h ˆ V t i (s,a) (s t i (s,a)+1 )−V ∗ (s t i (s,a)+1 ) i =γ X s,a n T+1 (s,a) X i=1 h ˆ V t i (s,a) (s t i (s,a)+1 )−V ∗ (s t i (s,a)+1 ) i n T+1 (s,a) X j=i α i j Now, we can upper bound P n T+1 (s,a) j=i α i j by P ∞ j=i α i j where the latter is equal to 1 + 1 H by Lemma 2.4. Since ˆ V t i (s,a) (s t i (s,a)+1 )−V ∗ (s t i (s,a)+1 )≥ 0 (by Lemma 2.6), we can write: γ X s,a n T+1 (s,a) X i=1 h ˆ V t i (s,a) (s t i (s,a)+1 )−V ∗ (s t i (s,a)+1 ) i n T+1 (s,a) X j=i α i j ≤γ X s,a n T+1 (s,a) X i=1 h ˆ V t i (s,a) (s t i (s,a)+1 )−V ∗ (s t i (s,a)+1 ) i ∞ X j=i α i j =γ X s,a n T+1 (s,a) X i=1 h ˆ V t i (s,a) (s t i (s,a)+1 )−V ∗ (s t i (s,a)+1 ) i 1 + 1 H = 1 + 1 H γ T X t=1 h ˆ V t (s t+1 )−V ∗ (s t+1 ) i = 1 + 1 H γ T X t=1 h ˆ V t+1 (s t+1 )−V ∗ (s t+1 ) i + 1 + 1 H T X t=1 h ˆ V t (s t+1 )− ˆ V t+1 (s t+1 ) i ≤ T +1 X t=2 h ˆ V t (s t )−V ∗ (s t ) i + 1 + 1 H SH. The last inequality is because 1 + 1 H γ ≤ 1 and that for any state s, ˆ V t (s)≥ ˆ V t+1 (s) and the value can decrease by at most H (the initial value). Substituting in Eq. (2.6) and telescoping with the left hand side, we have T X t=1 (V ∗ (s t )−Q ∗ (s t ,a t ))≤ 12sp(v ∗ ) T X t=1 s H n t ln 2T δ + ˆ V T +1 (s T +1 )−V ∗ (s T +1 ) + 1 + 1 H SH +SAH ≤ 12sp(v ∗ ) T X t=1 s H n t ln 2T δ + 4SAH. 17 Moreover, P T t=1 1 √ nt ≤ 2 √ SAT because T X t=1 1 p n t+1 (s t ,a t ) = T X t=1 X s,a 1 [st=s,at=a] p n t+1 (s,a) = X s,a n T+1 (s,a) X j=1 1 √ j ≤ X s,a 2 q n T +1 (s,a)≤ 2 s SA X s,a n T +1 (s,a) = 2 √ SAT, where the last inequality is by Cauchy-Schwarz inequality. This finishes the proof. Lemma 2.6. With probability at least 1−δ, for any t = 1,...,T and state-action pair (s,a), the following holds 0≤ ˆ Q t+1 (s,a)−Q ∗ (s,a)≤Hα 0 τ +γ τ X i=1 α i τ h ˆ V t i (s t i +1 )−V ∗ (s t i +1 ) i + 12sp(v ∗ ) s H τ ln 2T δ , where τ = n t+1 (s,a) (i.e., the total number of visits to (s,a) for the first t timesteps), α i τ is defined by (2.3), and t 1 ,...,t τ ≤t are the timesteps on which (s,a) is taken. Proof. Recursively substituting Q t (s,a) in Eq. (2.2) of the algorithm, we have Q t+1 (s,a) =Hα 0 τ + τ X i=1 α i τ h r(s,a) +γ ˆ V t i (s t i +1 ) i + τ X i=1 α i τ b i . Moreover, since P τ i=1 α i τ = 1 (Lemma 2.4), By Bellman equation we have Q ∗ (s,a) =α 0 τ Q ∗ (s,a) + τ X i=1 α i τ h r(s,a) +γE s 0 ∼p(·|s,a) V ∗ (s 0 ) i . Taking their difference and adding and subtracting a term γ P τ i=1 α i τ V ∗ (s t i +1 ) lead to: Q t+1 (s,a)−Q ∗ (s,a) =α 0 τ (H−Q ∗ (s,a)) +γ τ X i=1 α i τ h ˆ V t i (s t i +1 )−V ∗ (s t i +1 ) i +γ τ X i=1 α i τ h V ∗ (s t i +1 )−E s 0 ∼p(·|s,a) V ∗ (s 0 ) i + τ X i=1 α i τ b i . The first term is upper bounded by α 0 τ H clearly and lower bounded by 0 since Q ∗ (s,a)≤ P ∞ i=0 γ i = 1 1−γ =H. The third term is a martingale difference sequence where absolute value of each term is bounded by γα i τ sp(V ∗ ). Therefore, by Azuma’s inequality (Lemma 2.5), its absolute value is bounded by γsp(V ∗ ) v u u t 2 τ X i=1 (α i τ ) 2 ln 2T δ ≤ 2γsp(V ∗ ) s H τ ln 2T δ ≤ 4γsp(v ∗ ) s H τ ln 2T δ 18 with probability at least 1− δ T , where the first inequality is by Lemma 2.4 and the last inequality is by Lemma 2.1. Note that when t varies from 1 to T and (s,a) varies over all possible state-action pairs, the third term only takes T different forms. Therefore, by taking a union bound over these T events, we have: with probability 1−δ, the third term is bounded by 4γsp(v ∗ ) q H τ ln 2T δ in absolute value for all t and (s,a). The forth term is lower bounded by 4sp(v ∗ ) q H τ ln 2T δ and upper bounded by 8sp(V ∗ ) q H τ ln 2T δ , by Lemma 2.4. Combining all aforementioned upper bounds and that ˆ Q t+1 (s,a) = min n ˆ Q t (s,a),Q t+1 (s,a) o ≤ Q t+1 (s,a) we prove the upper bound in the lemma statement. To prove the lower bound, further note that the second term can be written as γ τ X i=1 α i τ h max a ˆ Q t i (s t i +1 ,a)− max a Q ∗ (s t i +1 ,a) i . Using a direct induction with all aforementioned lower bounds and the fact ˆ Q t+1 (s,a) = min n ˆ Q t (s,a),Q t+1 (s,a) o we prove the lower bound in the lemma statement as well. 2.C Proof of Lemma 2.3 Lemma 2.3 (Restated). With probability at least 1−δ, T X t=1 (Q ∗ (s t ,a t )−γV ∗ (s t )−r(s t ,a t ))≤ 2sp(v ∗ ) r 2T ln 1 δ + 2sp(v ∗ ). Proof. By Bellman equation for the discounted problem, we have Q ∗ (s t ,a t )−γV ∗ (s t )− r(s t ,a t ) = γ E s 0 ∼p(·|st,at) [V ∗ (s 0 )]−V ∗ (s t ) . Adding and subtracting V ∗ (s t+1 ) and sum- ming over t we will get T X t=1 (Q ∗ (s t ,a t )−γV ∗ (s t )−r(s t ,a t )) =γ T X t=1 E s 0 ∼p(·|st,at) [V ∗ (s 0 )]−V ∗ (s t+1 ) +γ T X t=1 (V ∗ (s t+1 )−V ∗ (s t )) The summands of the first term on the right hand side constitute a martingale difference sequence. Thus, by Azuma’s inequality (Lemma 2.5) and the fact that sp(V ∗ )≤ 2sp(v ∗ ) (Lemma 2.1), this term is upper bounded by 2γsp(v ∗ ) q 2T ln 1 δ , with probability at least 1−δ. The second term is equal to γ(V ∗ (s T +1 )−V ∗ (s 1 )) which is upper bounded by 2γsp(v ∗ ). Recalling γ < 1 completes the proof. 19 Chapter 3 A Model-free Learning Algorithm for Infinite-horizon Average-reward Markov Decision Processes with Near-optimal Regret 3.1 Introduction In the previous chapter, we provided a model-free algorithm that obtains regret bound of e O(T 2/3 ) for infinite-horizon average-reward MDPs. However, there is a gap between this bound and the worst case lower bound of Ω( √ DSAT ) Jaksch et al. (2010). While model- based algorithms have been able to close the gap Zhang and Ji (2019), model-free algorithms have only achievedO( √ T ) with ergodicity assumption Wei et al. (2019). It is still an open question whether model-free algorithms can achieve near-optimal regret bounds for the more general class of weakly communicating MDPs. This is desired since model-free algorithms are faster, need less storage and are more amenable to extend to the continuous-state space. In this chapter, we take the first steps towards this direction. We propose Exploration Enhanced Q-learning (EE-QL) that achieves high probability regret bound of e O(sp(v ∗ ) + B) √ SAT , where B is an upper bound on the estimates of the Q-value function, sp(v ∗ ) is the span of the bias-span function in the Bellman equation, S,A are state and action size, respectively. 1 Despite the original Optimistic Q-learning algorithm that was proposed in the previous chapter, this version of the algorithm does not use the discounted setting to approximate the average-reward scenario. Instead, it directly estimates theQ-value function in the average-reward setting. In the analysis of this algorithm, we are making two assumptions: first, we assume that the estimates of the Q-value functions are uniformly bounded. Second, we assume that an 1 Note that sp(v ∗ )≤D and in most cases sp(v ∗ )D. 20 estimateJ t ofJ ∗ (the long-term average-reward of the optimal policy) is known whose error diminishes with the rate of 1/ √ t. The analysis without these assumptions is left for future work. EE-QL (read equal) uses stochastic approximation to estimate the Q-value function by as- suming that a concentrating estimate of the optimal gain is available. The key idea of this algorithm is the careful design of the learning rate to efficiently balance the effect of new and old observations. Despite the typical learning rate of 1/τ (where τ is the number of visits to the corresponding state-action pair) in the standard Q-learning type algorithms, the proposed EE-QL algorithm uses the learning rate of 1/ √ τ. This learning rate provides nice properties (listed in Lemma 3.2) that are central to our analysis. In addition, exper- iments show that EE-QL significantly outperforms the existing model-free algorithms and has similar performance to the best model-based algorithms. This is due to the fact that, unlike previous model-free algorithms in the tabular setting that optimistically estimate each entry of the optimal Q-value function (Jin et al., 2018; Dong et al., 2019; Wei et al., 2019), EE-QL estimates a single scalar (the optimal gain) optimistically to avoid spending unnecessary optimism. For the preliminaries and the related work, please refer to the previous chapter. 3.2 The Exploration Enhanced Q-learning Algorithm In this section, we introduce the Exploration Enhanced Q-learning (EE-QL) algorithm (see Algorithm 2). The algorithm works for the broad class of weakly communicating MDPs. It is well-known that the weakly communicating condition is necessary to achieve sublinear regret (Bartlett and Tewari, 2009). EE-QL approximates the Q-value function for the infinite-horizon average-reward setting using stochastic approximation with carefully chosen learning rates. The algorithm takes greedy actions with respect to the current estimate, Q t function. After visiting the next state, a stochastic update ofQ t is made based on the Bellman equation. J t in the algorithm is an estimate of J ∗ that satisfies the following assumption. Assumption 3.1. (Concentrating J t ) There exists a constantc≥ 0 such that|J t −J ∗ |≤ c/ √ t,∀t≥ 1. In some applications, J ∗ is known apriori. For example, in the infinite horizon version of Cartpole described in Hao et al. (2020), the optimal policy keeps the pole upright throughout the horizon which leads to a known J ∗ . In such cases, one can simply set J t = J ∗ . In applications where J ∗ is not known, one can set J t = ˜ J t +C/ √ t for some constant C≥ 0, where ˜ J 0 = 0 and ˜ J t is stochastically updated as ˜ J t = (1−β t ) ˜ J t−1 +β t r(s t ,a t ) for some decaying learning rate β t . In particular, β t = 1/t yields J t = 1 t t X t 0 =1 r(s t 0,a t 0) + C √ t . (3.1) 21 Algorithm 2 EE-QL Initialization: ∀s,a :Q 1 (s,a) = 0, n 1 (s,a) = 0 Define:∀τ,α τ = 1 √ τ for t = 1,...,T do 1 Take action a t = arg max a∈A Q t (s t ,a) and observe s t+1 . 2 Update: n t+1 (s t ,a t )←n t (s t ,a t ) + 1 τ←n t+1 (s t ,a t ) Q t+1 (s t ,a t )← (1−α τ )Q t (s t ,a t ) +α τ [r(s t ,a t )−J t + max b∈A Q t (s t+1 ,b)] (All other entries of n t+1 ,Q t+1 remain the same as those in n t ,Q t .) We have numerically verified that this choice of J t with C = 2 satisfies Assumption 3.1 for c≥ 5 in the RiverSwim and RandomMDP environemnts (see Section 4.5 for more details). The choice of the learning rate α τ is particularly important. Choosing α τ ∝ 1/ √ τ (rather than 1/τ) efficiently combines the new and old observations and provides nice properties listed in Lemma 3.2 that play a central role in the analysis. The widely used learning rate of 1/τ in the standard Q-learning algorithm (Abounadi et al., 2001) may not satisfy these properties. In addition, unlike the Q-learning algorithms with UCB exploration (Jin et al., 2018; Dong et al., 2019; Wei et al., 2019), EE-QL does not optimistically estimate the Q-value function. In the case that J t =J ∗ , the algorithm need not follow the optimism in the face of uncer- tainty principle as in (Jin et al., 2018; Dong et al., 2019; Wei et al., 2019; Jaksch et al., 2010). However, our numerical experiments show that if J ∗ is not known, J t has to be an optimistic estimate of the average reward as in (3.1). Thus, EE-QL is economical in using optimism. In other words, instead of wasting optimistic confidence intervals around each entry of the Q t function, our algorithm is optimistic around a single scalar J ∗ . This leads to significant improvement in the numerical performance compared to the literature (see Section 4.5). We now state the main regret guarantee of Algorithm 2. Our current theoretical analysis works by assuming that the Q t functions remain bounded. Thus, we impose the following assumption for now and leave the analysis without this assumption for future work. Assumption 3.2. TheQ t -value function in the algorithm are uniformly bounded, i.e., there exists B > 0 such that kQ t k ∞ ≤B, ∀t. Theorem 3.1. Under Assumptions 3.1 and 3.2, the EE-QL algorithm ensures that with probability at least 1−δ, R T =O (sp(v ∗ ) +B) √ SAT + (c + sp(v ∗ )) p T ln 1/δ , where c is defined in Assumption 3.1 and B is defined in Assumption 3.2. 22 This result improves the previous best known regret bound of e O(sp(v ∗ )(SA) 1/3 T 2/3 ) by Wei et al. (2019) and matches the lower bound of Ω( √ DSAT ) (Jaksch et al., 2010) in terms of T up to logarithmic factors. To the best of our knowledge, this is the first model-free algorithm that achieves e O( √ T ) regret bound for the general class of weakly communicating MDPs in the infinite-horizon average-reward setting. 3.3 Analysis In this section, we provide the proof of Theorem 3.1. Before we start the analysis, let’s define α i τ :=α i τ Y j=i+1 (1−α j ) (3.2) for i≥ 1, where α i = 1/ √ i is the learning rate used in Algorithm 2. α i τ determines the effect of the i-th step on τ-th update. This quantity has nice properties that are listed in Lemma 3.2 and are central to our analysis. In particular, the √ T regret bound is merely due to properties 2 and 4 in Lemma 3.2. Proof of Theorem 3.1 Proof. We start by decomposing the regret using Lemma 3.5. With probability at least 1−δ, the regret of any algorithm can be bounded by R T ≤ sp(v ∗ ) + sp(v ∗ ) r 1 2 T ln 1 δ + T X t=1 Δ(s t ,a t ), (3.3) where Δ(s,a) :=v ∗ (s)−q ∗ (s,a). Suffices to bound P T t=1 Δ(s t ,a t ). Letn t+1 (s,a) denote the number of visits to state-action pair (s,a) before timet + 1 (including timet and excluding time t + 1). For notational simplicity, let n t+1 :=n t+1 (s t ,a t ) and t i (s,a) be the time step at which (s,a) is visited for the ith time. We can write: T X t=1 h v t (s t )−v ∗ (s t ) + Δ(s t ,a t ) i = T X t=1 h Q t (s t ,a t )−v ∗ (s t ) + Δ(s t ,a t ) i = T X t=1 h Q t (s t ,a t )−q ∗ (s t ,a t ) i = T X t=1 h Q t+1 (s t ,a t )−q ∗ (s t ,a t ) i + T X t=1 h Q t (s t ,a t )−Q t+1 (s t ,a t ) i , (3.4) where the first equality is by the fact thata t = arg max a∈A Q t (s t ,a) and the second equality is by the definition of Δ(s t ,a t ). The second term on the right hand side can be bounded 23 by 2(2B + 1) √ SAT +cSA(1 + lnT ) by using line (3.2) of the Algorithm (see Lemma 3.1) where B if from Assumption 3.2. The rest of the proof proceeds to write the first term on the right hand side in terms of v t+1 (s t+1 )−v ∗ (s t+1 ) (to telescope with the left hand side) plus some sublinear additive terms. We can write: T X t=1 h Q t+1 (s t ,a t )−q ∗ (s t ,a t ) i = T X t=1 h Q t+1 (s t ,a t )−q ∗ (s t ,a t ) i 1(n t+1 (s t ,a t )≥ 1) = X s,a T X t=1 1(s t =s,a t =a) h Q t+1 (s,a)−q ∗ (s,a) i 1(n t+1 (s,a)≥ 1) =:R 1 By Lemma 3.4, the term R 1 can be written as: R 1 = X s,a T X t=1 1(s t =s,a t =a) ( n t+1 (s,a) X i=1 α i n t+1 (s,a) [J ∗ −J t i (s,a) ] + n t+1 (s,a) X i=1 α i n t+1 (s,a) [v t i (s,a) (s t i (s,a)+1 )−v ∗ (s t i (s,a)+1 )] + n t+1 (s,a) X i=1 α i n t+1 (s,a) [v ∗ (s t i (s,a)+1 )−E s 0 ∼p(·|s,a) v ∗ (s 0 )] ) = X s,a n T+1 (s,a) X j=1 ( j X i=1 α i j [J ∗ −J t i (s,a) ] + j X i=1 α i j [v t i (s,a) (s t i (s,a)+1 )−v ∗ (s t i (s,a)+1 )] + j X i=1 α i j [v ∗ (s t i (s,a)+1 )−E s 0 ∼p(·|s,a) v ∗ (s 0 )] ) By changing the order of summation on j and i, we can write: R 1 = X s,a n T+1 (s,a) X i=1 ( [J ∗ −J t i (s,a) ] n T+1 (s,a) X j=i α i j + [v t i (s,a) (s t i (s,a)+1 )−v ∗ (s t i (s,a)+1 )] n T+1 (s,a) X j=i α i j + [v ∗ (s t i (s,a)+1 )−E s 0 ∼p(·|s,a) v ∗ (s 0 )] n T+1 (s,a) X j=i α i j ) . We proceed by upper bounding each term in the latter by using Lemma 3.2(3). Note that |J ∗ −J t i (s,a)|≤ c p t i (s,a) ≤ c √ i , − sp(v ∗ )≤v ∗ (s t i (s,a)+1 )−E s 0 ∼p(·|s,a) v ∗ (s 0 )≤ sp(v ∗ ). Moreover, note that v ∗ is unique upto a constant. So, without loss of generality, we choose v ∗ such that max s v ∗ (s) =−B, whereB is the uniform bound onkQ t k ∞ (andkv t k ∞ ) as in Assumption 3.2. This choice of v ∗ implies that 0≤v t (s)−v ∗ (s)≤ 2B + sp(v ∗ ) for all s,t. 24 Replacing these bounds for M + and−M − in Lemma 3.2(3) implies R 1 ≤ X s,a n T+1 (s,a) X i=1 ( J ∗ −J t i (s,a) + 5c 2i + c √ i (1− 1 √ i + 1 ) n T+1 (s,a)−i+1 +v t i (s,a) (s t i (s,a)+1 )−v ∗ (s t i (s,a)+1 ) + 2B + sp(v ∗ ) 5 2 √ i +v ∗ (s t i (s,a)+1 )−E s 0 ∼p(·|s,a) v ∗ (s 0 ) + sp(v ∗ ) 5 2 √ i + (1− 1 √ i + 1 ) n T+1 (s,a)−i+1 ) . (3.5) To simplify the right hand side of the above inequality, observe that X s,a n T+1 (s,a) X i=1 (J ∗ −J t i (s,a) ) = X s,a T X t=1 1(s t =s,a t =a)(J ∗ −J t ) = T X t=1 (J ∗ −J t ). (3.6) Similarly, X s,a n T+1 (s,a) X i=1 (v t i (s,a) (s t i (s,a)+1 )−v ∗ (s t i (s,a)+1 )) = T X t=1 (v t (s t+1 )−v ∗ (s t+1 )), (3.7) X s,a n T+1 (s,a) X i=1 (v ∗ (s t i (s,a)+1 )−E s 0 ∼p(·|s,a) v ∗ (s 0 )) = T X t=1 (v ∗ (s t+1 )−E s 0 ∼p(·|st,at) v ∗ (s 0 )). (3.8) Using the inequalities in Lemma 3.3 and Lemma 3.2(5), replacing the equalities (3.6), (3.7), (3.8) into the right hand side of (3.5), and adding and subtracting v t+1 (s t+1 ) implies R 1 ≤ T X t=1 (J ∗ −J t ) + 5cSA 2 (1 + lnT ) + 2 √ 2cSA + T X t=1 (v t+1 (s t+1 )−v ∗ (s t+1 )) + T X t=1 (v t (s t+1 )−v t+1 (s t+1 )) + 5(2B + sp(v ∗ )) √ SAT + T X t=1 (v ∗ (s t+1 )−E s 0 ∼p(·|st,at) [v ∗ (s 0 )]) + 6sp(v ∗ ) √ SAT. (3.9) Note that by Assumption 3.1, P T t=1 |J ∗ −J t |≤ P T t=1 c/ √ t≤ 2c √ T . Furthermore, P T t=1 (v ∗ (s t+1 )− E s 0 ∼p(·|st,at) [v ∗ (s 0 )]) is a martingale difference sequence and can be bounded by sp(v ∗ ) q 1 2 T ln 1/δ with probability at least 1− δ, using Azuma’s inequality. Moreover, P T t=1 (v t (s t+1 )− v t+1 (s t+1 ))≤ 2 √ SAT +cSA(1 + lnT ) by Lemma 3.6. Replacing these bounds on the right hand side of the above inequality, simplifying the result and plugging back into (3.4) implies T X t=1 h v t (s t )−v ∗ (s t ) + Δ(s t ,a t ) i ≤ T X t=1 (v t+1 (s t+1 )−v ∗ (s t+1 )) + (14B + 11sp(v ∗ ) + 4) √ SAT 25 + 2c √ T + sp(v ∗ ) r 1 2 T ln 1 δ + 9cSA 2 lnT + ( 9 2 + 2 √ 2)cSA, with probability at least 1−δ. Telescoping the left hand side with the right hand side and noting that v T +1 (s T +1 )−v 1 (s 1 )≤ 2B (Assumption 3.2) and v ∗ (s 1 )−v ∗ (s T +1 )≤ sp(v ∗ ), implies that T X t=1 Δ(s t ,a t )≤ (14B + 11sp(v ∗ ) + 4) √ SAT + 2c √ T + sp(v ∗ ) r 1 2 T ln 1 δ + 9cSA 2 lnT + ( 9 2 + 2 √ 2)cSA + 2B + sp(v ∗ ), with probability at least 1−δ. Replacing this bound into (3.3) implies that R T ≤(14B + 11sp(v ∗ ) + 4) √ SAT + 2c √ T + 2sp(v ∗ ) r 1 2 T ln 2 δ + 9cSA 2 lnT + ( 9 2 + 2 √ 2)cSA + 2B + 2sp(v ∗ ), with probability at least 1−δ which completes the proof. Auxiliary Lemmas In this section, we provide some auxiliary lemmas that are used in the proof of Theorem 3.1. The proof for these lemmas can be found in the appendix. Lemma 3.1. The second term of (3.4) can be bounded by T X t=1 h Q t (s t ,a t )−Q t+1 (s t ,a t ) i ≤ 2(2B + 1) √ SAT +cSA(1 + lnT ). Lemma 3.2. The following properties hold: 1. P τ i=1 α i τ = 1 for any τ≥ 1. 2. For any i≥ 1, and any K≥i, we have 1− (1− 1 √ i+1 ) K−i+1 ≤ P K τ=i α i τ ≤ 1 + 5 2 √ i . 3. Let M be a scalar and define M + = max(M, 0) and M − = max(−M, 0). Then, for any i≥ 1, and any K≥i, we have M P K τ=i α i τ ≤M + 5M + 2 √ i +M − (1− 1 √ i+1 ) K−i+1 . 4. For any K≥ 0, we have P K i=1 (1− 1 √ i+1 ) K−i+1 ≤ √ K. 5. For any K≥ 1, we have P K i=1 1 √ i (1− 1 √ i+1 ) K−i+1 ≤ 2 √ 2. Lemma 3.3 (Frequently used inequalities). The following inequalities hold: 1. P T t=1 1 √ n t+1 (st,at) = P s,a P n T+1 (s,a) i=1 1 √ i ≤ 2 √ SAT . 26 2. P s,a P n T+1 (s,a) i=1 (1− 1 √ i+1 ) n T+1 (s,a)−i+1 ≤ √ SAT . 3. P T t=1 1 √ tn t+1 (st,at) ≤SA(1 + lnT ). Lemma 3.4. For a fixed (s,a)∈S×A, let τ =n t (s,a), and t i be the time step at which (s,a) is taken for the ith time. Then, Q t (s,a)−q ∗ (s,a) 1(τ≥ 1) = ( τ X i=1 α i τ [J ∗ −J t i ] + τ X i=1 α i τ [v t i (s t i +1 )−v ∗ (s t i +1 )] + τ X i=1 α i τ [v ∗ (s t i +1 )−E s 0 ∼p(·|s,a) v ∗ (s 0 )] ) Lemma 3.5. With probability at least 1−δ, the regret of any algorithm is bounded as R T ≤ sp(v ∗ ) + sp(v ∗ ) r 1 2 T ln 1 δ + T X t=1 h v ∗ (s t )−q ∗ (s t ,a t ) i . Lemma 3.6. P T t=1 [v t (s t+1 )−v t+1 (s t+1 )]≤ 2 √ SAT +cSA(1 + lnT ). 3.4 Experiments In this section, we numerically evaluate the performance of our proposed EE-QL algorithm. Two environments are considered: RandomMDP and RiverSwim. The RandomMDP en- vironment is an ergodic MDP with S = 6 states and A = 2 actions where the transition kernel and the rewards are chosen uniformly at random. The RiverSwim environment is a weakly communicating MDP with S = 6 states arranged in a chain and A = 2 actions (left and right) that simulates an agent swimming in a river. If the agent swims left (i.e., in the direction of the river current), it is always successful. If it decides to swim right, it may fail with some probability. The reward function can be described as follows: r(1, left) = 0.2, r(6, right) = 1 and r(s,a) = 0 for all other states and actions. The agent starts from the leftmost state (s 1 = 1). The optimal policy is to always swim right to reach the high-reward state s = 6. We compare our algorithm against Optimistic QL (Wei et al., 2019), MDP-OOMD (Wei et al., 2019), and Politex (Abbasi-Yadkori et al., 2019a) as model-free algorithms and UCRL2 (Jaksch et al., 2010) and PSRL (Ouyang et al., 2017b) as model-based benchmarks. The hyper parameters for these algorithms are tuned to obtain the best performance (see Table 3.1 for more details). J t is chosen as in (3.1) with appropriate C (see Table 3.1). We numerically verified that this choice of J t satisfies Assumption 3.1 with c = 5. Figure 3.1 shows that in the RiverSwim environment, our algorithm significantly outperforms Opti- mistic QL, the only existing model-free algorithm with low regret for weakly communicating MDPs. The reason is that the proposed algorithm does not waste optimism for the entireQ function. Rather, the optimism in the face of uncertainty principle is used around a single 27 Table 3.1: The hyper parameters used in the algorithms. These hyper parameters are optimized to obtain the best performance of each algorithm. We simulate 10 Monte Carlo independent runs over the horizon of T = 5× 10 6 steps. For the UCRL2 algorithm, C is a coefficient that scales the confidence interval. τ and τ 0 for the Politex algorithm, are the lengths of the two stages defined in Figure 3 of Abbasi-Yadkori et al. (2019a). Algorithm Parameters RandomMDP EE-QL J t = 1/t P t t 0 =1 r(s t 0,a t 0) + 1.2/ √ t Optimistic Q-learning H = 2,c = 0.1, b τ =c p H/τ MDP-OOMD N = 2,B = 4,η = 0.01 Politex τ = 1000, τ 0 = 1000,η = 0.2 UCRL2 C = 0.1 PSRL Dirichlet prior with parameters [0.1,··· , 0.1] RiverSwim EE-QL J t = 1/t P t t 0 =1 r(s t 0,a t 0) + 2/ √ t Optimistic Q-learning H = 1000,c = 1, b τ =c p H/τ UCRL2 C = 0.1 PSRL Dirichlet prior with parameters [0.1,··· , 0.1] 0 1000000 2000000 3000000 4000000 5000000 t 0 10000 20000 30000 40000 50000 60000 70000 Regret RiverSwim Our Algorithm PSRL Optimistic QL UCRL2 0 1000000 2000000 3000000 4000000 5000000 t 0 2000 4000 6000 8000 10000 12000 14000 Regret RandomMDP Our Algorithm PSRL Optimistic QL UCRL2 POLITEX MDP-OOMD Figure 3.1: Regret comparison of model-free and model-based RL algorithms in RiverSwim (left) and RandomMDP (right). In RiverSwim, our algorithm outperforms Optimistic QL, the best existing model-free algorithm, substantially and performs as well as PSRL (Ouyang et al., 2017b) which is among the best known model-based algorithms in practice. Politex and MDP-OOMD did not achieve sub-linear regret in RiverSwim and thus removed from the left figure. In RandomMDP, our algorithm together with Optimistic QL outperform other model-free algorithms and are similar to model-based algorithms. scalar J ∗ . Note that other model-free algorithms such as Politex and MDP-OOMD did not yield sub-linear regret in RiverSwim and thus removed from the figure. This is due to the fact that RiverSwim does not satisfy the ergodicity assumption required by these algo- rithms. Moreover, both in the RiverSwim and RandomMDP environments, our algorithm performs as well as the best existing model-based algorithms in practice, though with less memory. 28 Appendices 3.A Proof of Lemma 3.1 Lemma 3.1 (Restated). The second term of (3.4) can be bounded by T X t=1 h Q t (s t ,a t )−Q t+1 (s t ,a t ) i ≤ 2(2B + 1) √ SAT +cSA(1 + lnT ) Proof. Rearranging line (3.2) of Algorithm 2 implies that Q t (s t ,a t )−Q t+1 (s t ,a t ) = 1 p n t+1 (s t ,a t ) h J t −r(s t ,a t ) +Q t (s t ,a t )−v t (s t+1 ) i . Note that by Assumption 3.2, Q t (s t ,a t )−v t (s t+1 )≤ 2B. Moreover, J t −r(s t ,a t )≤ J t ≤ J ∗ +c/ √ t≤ 1 +c/ √ t by Assumption 3.1. Thus, T X t=1 h Q t+1 (s t ,a t )−Q t (s t ,a t ) i ≤ T X t=1 2B + 1 p n t+1 (s t ,a t ) + T X t=1 c p tn t+1 (s t ,a t ) . (3.10) For the first term on the right hand side, we have T X t=1 2B + 1 p n t+1 (s t ,a t ) = T X t=1 X s,a 1(s t =s,a t =a) 2B + 1 p n t+1 (s,a) = X s,a T X t=1 1(s t =s,a t =a) 2B + 1 p n t+1 (s,a) = X s,a n T+1 (s,a) X i=1 2B + 1 √ i ≤ 2(2B + 1) √ SAT, where the last step uses Lemma 3.3(1). For the second term on the right hand side (3.10), note that t≥n t+1 (s t ,a t ). Thus, T X t=1 c p tn t+1 (s t ,a t ) ≤ T X t=1 c n t+1 (s t ,a t ) = T X t=1 X s,a 1(s t =s,a t =a) c n t+1 (s t ,a t ) 29 = X s,a T X t=1 1(s t =s,a t =a) c n t+1 (s t ,a t ) = X s,a n T+1 (s,a) X i=1 c i ≤ X s,a c +c ln n T +1 (s,a) ≤cSA(1 + lnT ). 3.B Proof of Lemma 3.2 Lemma 3.2 (Restated). The following properties hold: 1. P τ i=1 α i τ = 1. 2. For any i≥ 1, and any K≥i, 1− (1− 1 √ i + 1 ) K−i+1 ≤ K X τ=i α i τ ≤ 1 + 5 2 √ i 3. Let M be a scalar and define M + = max(M, 0) and M − = max(−M, 0). Then, for any i≥ 1, and any K≥i M K X τ=i α i τ ≤M + 5M + 2 √ i +M − (1− 1 √ i + 1 ) K−i+1 4. For any K≥ 0, K X i=1 (1− 1 √ i + 1 ) K−i+1 ≤ √ K. 5. For any K≥ 1, K X i=1 1 √ i (1− 1 √ i + 1 ) K−i+1 ≤ 2 √ 2. Proof. 1. We prove by induction on τ. For τ = 1, α 1 1 =α 1 = 1. For the induction step, note that α i τ = (1−α τ )α i τ−1 . Thus, τ X i=1 α i τ =α τ τ + τ−1 X i=1 α i τ =α τ + (1−α τ ) τ−1 X i=1 α i τ−1 =α τ + (1−α τ ) = 1, where the third equality is by the induction hypothesis. 30 2. To prove the lower bound, we can write α i τ =α i τ Y j=i+1 (1−α j )≥α i (1− 1 √ i + 1 ) τ−i . Thus, K X τ=i α i τ ≥α i K X τ=i (1− 1 √ i + 1 ) τ−i =α i √ i + 1 1− (1− 1 √ i + 1 ) K−i+1 ≥ 1− (1− 1 √ i + 1 ) K−i+1 To prove the upper bound, note that α i τ =α i τ Y j=i+1 (1−α j )≤α i exp(− τ X j=i+1 α j ) ≤α i exp(− Z τ+1 i+1 1 √ x dx) =α i exp − 2( √ τ + 1− √ i + 1) , where the first inequality is by 1 +x≤e x . Thus, K X τ=i α i τ ≤ ∞ X τ=i α i τ ≤α i ∞ X τ=i exp − 2( √ τ + 1− √ i + 1) =α i 1 + ∞ X τ=i+1 exp − 2( √ τ + 1− √ i + 1) ! ≤α i 1 + Z ∞ i exp − 2( √ x + 1− √ i + 1) dx ! =α i 1 + √ i + 1 + 1 2 ≤α i 1 + √ i + 1 + 1 2 = 1 + 5 2 √ i 31 3. We can write M =M + −M − . Thus, by previous part we have M K X τ=i α i τ =M + K X τ=i α i τ −M − K X τ=i α i τ ≤M + (1 + 5 2 √ i )−M − +M − (1− 1 √ i + 1 ) K−i+1 =M + 5M + 2 √ i +M − (1− 1 √ i + 1 ) K−i+1 4. Let j =K−i + 1. We can write K X i=1 (1− 1 √ i + 1 ) K−i+1 = K X j=1 (1− 1 √ K−j + 2 ) j ≤ K X j=1 (1− 1 √ K + 1 ) j ≤ ∞ X j=1 (1− 1 √ K + 1 ) j = √ K + 1− 1≤ √ K. 5. K X i=1 1 √ i (1− 1 √ i + 1 ) K−i+1 = K X i=1 √ i + 1 √ i 1 √ i + 1 (1− 1 √ i + 1 ) K−i+1 ≤ √ 2 K X i=1 1 √ i + 1 (1− 1 √ i + 1 ) K−i+1 ≤ √ 2 K X i=1 1 √ i + 1 e i−K−1 √ i+1 , where the last inequality is by 1−x≤e −x . We proceed by upper bounding the latter. K X i=1 1 √ i + 1 e i−K−1 √ i+1 = K X i=1 1 √ i + 1 e √ i+1 e − K+2 √ i+1 ≤e − K+2 √ K+1 K X i=1 1 √ i + 1 e √ i+1 ≤e − K+2 √ K+1 Z K+1 1 1 √ x + 1 e √ x+1 dx = 2e − K+2 √ K+1 (e √ K+2 −e √ 2 ). 32 Note that e − K+2 √ K+1 ≤e − √ K+2 . Thus, 2e − K+2 √ K+1 (e √ K+2 −e √ 2 )≤ 2e − √ K+2 (e √ K+2 −e √ 2 )≤ 2. 3.C Proof of Lemma 3.3 Lemma 3.3 (Frequently used inequalities) (Restated). The following inequalities hold: 1. P T t=1 1 √ n t+1 (st,at) = P s,a P n T+1 (s,a) i=1 1 √ i ≤ 2 √ SAT . 2. P s,a P n T+1 (s,a) i=1 (1− 1 √ i+1 ) n T+1 (s,a)−i+1 ≤ √ SAT . 3. P T t=1 1 √ tn t+1 (st,at) ≤SA(1 + lnT ). Proof. 1. T X t=1 1 p n t+1 (s t ,a t ) = T X t=1 X s,a 1(s t =s,a t =a) 1 p n t+1 (s,a) = X s,a T X t=1 1(s t =s,a t =a) 1 p n t+1 (s,a) = X s,a n T+1 (s,a) X i=1 1 √ i ≤ 2 X s,a q n T +1 (s,a)≤ 2 s SA X s,a n T +1 (s,a) = 2 √ SAT, where the last inequality is by Cauchy-Schwarz. 2. Lemma 3.2(4) implies that X s,a n T+1 (s,a) X i=1 (1− 1 √ i + 1 ) n T+1 (s,a)−i+1 ≤ X s,a q n T +1 (s,a)≤ √ SAT, (3.11) where the last inequality is by Cauchy-Schwarz similar to the previous part. 3. Note that t≥n t+1 (s t ,a t ). Thus, T X t=1 1 p tn t+1 (s t ,a t ) ≤ T X t=1 1 n t+1 (s t ,a t ) 33 = T X t=1 X s,a 1(s t =s,a t =a) 1 n t+1 (s,a) = X s,a T X t=1 1(s t =s,a t =a) 1 n t+1 (s,a) = X s,a n T+1 (s,a) X i=1 1 i ≤ X s,a 1 + ln n T +1 (s,a) ≤SA(1 + lnT ) 3.D Proof of Lemma 3.4 Lemma 3.4 (Restated). For a fixed (s,a)∈S×A, let τ = n t (s,a), and t i be the time step at which (s,a) is taken for the i-th time. Then, Q t (s,a)−q ∗ (s,a) 1(τ≥ 1) = ( τ X i=1 α i τ [J ∗ −J t i ] + τ X i=1 α i τ [v t i (s t i +1 )−v ∗ (s t i +1 )] + τ X i=1 α i τ [v ∗ (s t i +1 )−E s 0 ∼p(·|s,a) v ∗ (s 0 )] ) Proof. If τ≥ 1, Lemma 3.2 implies that P τ i=1 α i τ = 1. Thus, by Bellman equation q ∗ (s,a) = τ X i=1 α i τ [r(s,a)−J ∗ +E s 0 ∼p(·|s,a) v ∗ (s 0 )]. Combining this with Lemma 3.7 completes the proof. Lemma 3.7. For a fixed (s,a)∈S×A, let τ = n t (s,a) and t i be the time step at which (s,a) is taken for the i-th time. Then, Q t (s,a) = τ X i=1 α i τ [r(s,a)−J t i +v t i (s t i +1 )]. (3.12) Proof. Note that Q t (s,a) remains unchanged during [t j−1 + 1,t j ]. Thus, suffices to prove Q t j +1 (s,a) = j X i=1 α i j [r(s,a)−J t i +v t i (s t i +1 )], for j≥ 0 with the convention that t 0 = 0. We proceed by induction on j. For j = 0, Q 1 (s,a) = 0 by the initialization of the algorithm. For the induction step, we write Q t j +1 (s,a) = (1−α j )Q t j (s,a) +α j [r(s,a)−J t j +v t j (s t j +1 )] = (1−α j )Q t j−1 +1 (s,a) +α j [r(s,a)−J t j +v t j (s t j +1 )] 34 = (1−α j ) j−1 X i=1 α i j−1 [r(s,a)−J t i +v t i (s t i +1 )] +α j [r(s,a)−J t j +v t j (s t j +1 )] = j X i=1 α i j [r(s,a)−J t i +v t i (s t i +1 )], where the first equality is by line 3.2 of the algorithm, the second equality is by the fact thatQ t (s,a) remains unchanged during [t j−1 + 1,t j ], the third equality is by the induction hypothesis, and the last equality follows from α i j = (1−α j )α i j−1 and α j j =α j . 3.E Proof of Lemma 3.5 Lemma 3.5 (Restated). The regret of any algorithm is bounded as R T ≤ sp(v ∗ ) + sp(v ∗ ) r 1 2 T ln 1 δ + T X t=1 h v ∗ (s t )−q ∗ (s t ,a t ) i , with probability at least 1−δ. Proof. Write the regret R T as R T = T X t=1 J ∗ −r(s t ,a t ) = T X t=1 E s 0 ∼p(·|st,at) [v ∗ (s 0 )]−q ∗ (s t ,a t ) = T X t=1 E s 0 ∼p(·|st,at) [v ∗ (s 0 )]−v ∗ (s t+1 ) | {z } R 1 + T X t=1 v ∗ (s t+1 )−v ∗ (s t ) | {z } R 2 + T X t=1 v ∗ (s t )−q ∗ (s t ,a t ) where the second equality is by Bellman equation. Note that R 2 = v ∗ (s T +1 )−v ∗ (s 1 )≤ sp(v ∗ ). The summands inR 1 constitute a martingale difference sequence. Thus, by Azuma- Hoeffding inequality R 1 ≤ sp(v ∗ ) q 1 2 T ln 1 δ with probability at least 1−δ which completes the proof. 3.F Proof of Lemma 3.6 Lemma 3.6 (Restated). T X t=1 [v t (s t+1 )−v t+1 (s t+1 )]≤ 2 √ SAT +cSA(1 + lnT ). 35 Proof. Note thatv t andv t+1 only differ at states t . So, only terms withs t+1 =s t contribute to the summation. Moreover, if s t+1 =s t , then Q t+1 (s t ,a t ) = (1−α n t+1 (st,at) )Q t (s t ,a t ) +α n t+1 (st,at) [r(s t ,a t )−J t +v t (s t+1 )] = (1−α n t+1 (st,at) )v t (s t ) +α n t+1 (st,at) [r(s t ,a t )−J t +v t (s t+1 )] =v t (s t ) +α n t+1 (st,at) [r(s t ,a t )−J t ]. Thus, v t (s t )−Q t+1 (s t ,a t ) =α n t+1 (st,at) [J t −r(s t ,a t )]≤α n t+1 (st,at) J t and T X t=1 [v t (s t+1 )−v t+1 (s t+1 )] = T X t=1 1(s t+1 =s t )[v t (s t+1 )−v t+1 (s t+1 )] = T X t=1 1(s t+1 =s t )[v t (s t )−v t+1 (s t )] ≤ T X t=1 1(s t+1 =s t )[v t (s t )−Q t+1 (s t ,a t )] ≤ T X t=1 α n t+1 (st,at) J t ≤ T X t=1 1 p n t+1 (s t ,a t ) + T X t=1 c p tn t+1 (s t ,a t ) , where the last inequality is by the fact that J t ≤J ∗ +c/ √ t≤ 1 +c/ √ t. Using Lemma 3.3 (1) and (3) completes the proof. 36 Chapter 4 Online Learning for Stochastic Shortest Path Model via Posterior Sampling 4.1 Introduction Stochastic Shortest Path (SSP) model considers the problem of an agent interacting with an environment to reach a predefined goal state while minimizing the cumulative expected cost. Unlike the finite-horizon and discounted Markov Decision Processes (MDPs) settings, in the SSP model, the horizon of interaction between the agent and the environment depends on the agent’s actions, and can possibly be unbounded (if the goal is not reached). A wide variety of goal-oriented control and reinforcement learning (RL) problems such as navigation, game playing, etc. can be formulated as SSP problems. In the RL setting, where the SSP model is unknown, the agent interacts with the environment in K episodes. Each episode begins at a predefined initial state and ends when the agent reaches the goal (note that it might never reach the goal). We consider the setting where the state and action spaces are finite, the cost function is known, but the transition kernel is unknown. The performance of the agent is measured through the notion of regret, i.e., the difference between the cumulative cost of the learning algorithm and that of the optimal policy during the K episodes. The agent has to balance the well-known trade-off between exploration and exploitation: should the agent explore the environment to gain information for future decisions, or should it exploit the current information to minimize the cost? A general way to balance the exploration-exploitation trade-off is to use the Optimism in the Face of Uncertainty (OFU) principle (Lai and Robbins, 1985). The idea is to construct a set of plausible models based on the available information, select the model associated with the minimum cost, and follow the optimal policy with respect to the selected model. This idea is widely used in the RL literature for MDPs (e.g., (Jaksch et al., 2010; Azar et al., 2017; Fruit et al., 2018b; Jin 37 et al., 2018; Wei et al., 2020, 2021)) and also for SSP models (Tarbouriech et al., 2020; Rosenberg et al., 2020; Rosenberg and Mansour, 2020; Chen and Luo, 2021; Tarbouriech et al., 2021b). An alternative fundamental idea to encourage exploration is to use Posterior Sampling (PS) (also known as Thompson Sampling) (Thompson, 1933). The idea is to maintain the posterior distribution on the unknown model parameters based on the available information and the prior distribution. PS algorithms usually proceed in epochs. In the beginning of an epoch, a model is sampled from the posterior. The actions during the epoch are then selected according to the optimal policy associated with the sampled model. PS algorithms have two main advantages over OFU-type algorithms. First, the prior knowledge of the environment can be incorporated through the prior distribution. Second, PS algorithms have shown superior numerical performance on multi-armed bandit problems (Scott, 2010; Chapelle and Li, 2011), and MDPs (Osband et al., 2013; Osband and Van Roy, 2017; Ouyang et al., 2017b). The main difficulty in designing PS algorithms is the design of the epochs. In the basic setting of bandit problems, one can simply sample at every time step (Chapelle and Li, 2011). In finite-horizon MDPs, where the length of an episode is predetermined and fixed, the epochs and episodes coincide, i.e., the agent can sample from the posterior distribution at the beginning of each episode (Osband et al., 2013). However, in the general SSP model, where the length of each episode is not predetermined and can possibly be unbounded, these natural choices for the epoch do not work. Indeed, the agent needs to switch policies during an episode if the current policy cannot reach the goal. In this chapter, we propose PSRL-SSP, the first PS-based RL algorithm for the SSP model. PSRL-SSP starts a new epoch based on two criteria. According to the first criterion, a new epoch starts if the number of episodes within the current epoch exceeds that of the previous epoch. The second criterion is triggered when the number of visits to any state-action pair is doubled during an epoch, similar to the one used by Bartlett and Tewari (2009); Jaksch et al. (2010); Filippi et al. (2010); Dann and Brunskill (2015); Ouyang et al. (2017b); Rosenberg et al. (2020). Intuitively speaking, in the early stages of the interaction between the agent and the environment, the second criterion triggers more often. This criterion is responsible for switching policies during an episode if the current policy cannot reach the goal. In the later stages of the interaction, the first criterion triggers more often and encourages exploration. We prove a Bayesian regret bound of e O(B ? S √ AK), whereS is the number of states, A is the number of actions, K is the number of episodes, and B ? is an upper bound on the expected cost of the optimal policy. This is similar to the regret bound of Rosenberg et al. (2020) and has a gap of √ S with the minimax lower bound. We note that concurrent works of Tarbouriech et al. (2021b) and Cohen et al. (2021) have closed the gap via OFU algorithms and blackbox reduction to the finite-horizon, respectively. However, the goal of this chapter is not to match the minimax regret bound, but rather to introduce the first PS algorithm that has near-optimal regret bound with superior numerical performance than OFU algorithms. This is verified with the experiments in Section 4.5. The √ S gap with the lower bound exists for the PS algorithms in the finite-horizon Osband et al. (2013) and the infinite-horizon average-cost MDPs (Ouyang et al., 2017b) as well. Thus, it remains an 38 open question whether it is possible to achieve the lower bound via PS algorithms in these settings. Related Work. Posterior Sampling. The idea of PS algorithms dates back to the pioneering work of Thompson (1933). The algorithm was ignored for several decades until recently. In the past two decades, PS algorithms have successfully been developed for various settings including multi-armed bandits (e.g., Scott (2010); Chapelle and Li (2011); Kaufmann et al. (2012); Agrawal and Goyal (2012, 2013)), MDPs (e.g., (Strens, 2000; Osband et al., 2013; Fonteneau et al., 2013; Gopalan and Mannor, 2015; Osband and Van Roy, 2017; Kim, 2017; Ouyang et al., 2017b; Banjevi´ c and Kim, 2019)), Partially Observable MDPs (Jafarnia-Jahromi et al., 2021), and Linear Quadratic Control (e.g., (Abeille and Lazaric, 2017; Ouyang et al., 2017a)). The interested reader is referred to Russo et al. (2017) and references therein for a more comprehensive literature review. Online Learning in SSP. Another related line of work is online learning in the SSP model which was introduced by Tarbouriech et al. (2020). They proposed an algorithm with e O(K 2/3 ) regret bound. Subsequent work of Rosenberg et al. (2020) improved the regret bound to e O(B ? S √ AK). The concurrent works of Cohen et al. (2021); Tarbouriech et al. (2021b) proved a minimax regret bound of e O(B ? √ SAK). However, none of these works propose a PS-type algorithm. We refer the interested reader to Rosenberg and Mansour (2020); Chen et al. (2020); Chen and Luo (2021) for the SSP model with adversarial costs and Tarbouriech et al. (2021a) for sample complexity of the SSP model with a generative model. Comparison with Ouyang et al. (2017b). Our work is related to Ouyang et al. (2017b) which proposesTSDE, a PS algorithm for infinite-horizon average-cost MDPs. However, clear distinctions exist both in the algorithm and analysis. From the algorithmic perspective, our first criterion in determining the epoch length is different from TSDE. Note that using the same epochs as TSDE leads to a sub-optimal regret bound ofO(K 2/3 ) in the SSP model set- ting. Moreover, following Hoeffding-type concentration as in TSDE, yields a regret bound of O(K 2/3 ) in the SSP model setting. Instead, we propose a different analysis using Bernstein- type concentration inspired by the work of Rosenberg et al. (2020) to achieve theO( √ K) regret bound (see Lemma 5.8). 4.2 Preliminaries A Stochastic Shortest Path (SSP) model is denoted byM = (S,A,c,θ,s init ,g) whereS is the state space,A is the action space,c :S×A→ [0, 1] is the cost function,s init ∈S is the initial state, g / ∈S is the goal state, and θ :S + ×S×A→ [0, 1] represents the transition kernel such that θ(s 0 |s,a) =P(s 0 t =s 0 |s t =s,a t =a) whereS + =S∪{g} includes the goal state as well. Here s t ∈S and a t ∈A are the state and action at time t = 1, 2, 3,··· and 39 s 0 t ∈S + is the subsequent state. We assume that the initial state s init is a fixed and known state andS andA are finite sets with size S and A, respectively. A stationary policy is a deterministic map π :S→A that maps a state to an action. The value function (also called the cost-to-go function) associated with policy π is a function V π (·;θ) :S + → [0,∞] given by V π (g;θ) = 0 and V π (s;θ) :=E[ P τπ (s) t=1 c(s t ,π(s t ))|s 1 =s] for s∈S, where τ π (s) is the number of steps before reaching the goal state (a random variable) if the initial state is s and policy π is followed throughout the episode. Here, we use the notation V π (·;θ) to explicitly show the dependence of the value function on θ. Furthermore, the optimal value function can be defined as V (s;θ) = min π V π (s;θ). Policy π is called proper if the goal state is reached with probability 1, starting from any initial state and following π (i.e., max s τ π (s)<∞ almost surely), otherwise it is called improper. We consider the reinforcement learning problem of an agent interacting with an SSP model M = (S,A,c,θ ∗ ,s init ,g) whose transition kernel θ ∗ is randomly generated according to the prior distribution μ 1 at the beginning and is then fixed. We will focus on SSP models with transition kernels in the set Θ B? with the following standard properties: Assumption 4.1. For all θ∈ Θ B? , the following holds: (1) there exists a proper policy, (2) for all improper policies π θ , there exists a state s∈S, such that V π θ (s;θ) =∞, and (3) the optimal value function V (·;θ) satisfies max s V (s;θ)≤B ? . Bertsekas and Tsitsiklis (1991) prove that the first two conditions in Assumption 4.1 imply that for each θ∈ Θ B? , the optimal policy is stationary, deterministic, proper, and can be obtained by the minimizer of the Bellman optimality equations given by V (s;θ) = min a n c(s,a) + X s 0 ∈S + θ(s 0 |s,a)V (s 0 ;θ) o , ∀s∈S. (4.1) Standard techniques such as Value Iteration and Policy Iteration can be used to compute the optimal policy if the SSP model is known (Bertsekas, 2017). Here, we assume that S,A, and the cost function c are known to the agent, however, the transition kernel θ ∗ is unknown. Moreover, we assume that the support of the prior distribution μ 1 is a subset of Θ B? . The agent interacts with the environment in K episodes. Each episode starts from the initial state s init and ends at the goal state g (note that the agent may never reach the goal). At each time t, the agent observes state s t and takes action a t . The environment then yields the next state s 0 t ∼ θ ∗ (·|s t ,a t ). If the goal is reached (i.e., s 0 t = g), then the current episode completes, a new episode starts, and s t+1 =s init . If the goal is not reached (i.e., s 0 t 6=g), then s t+1 =s 0 t . The goal of the agent is to minimize the expected cumulative cost after K episodes, or equivalently, minimize the Bayesian regret defined as R K :=E T K X t=1 c(s t ,a t )−KV (s init ;θ ∗ ) , whereT K is the total number of time steps before reaching the goal state for the Kth time, and V (s init ;θ ∗ ) is the optimal value function from (4.1). Here, expectation is with respect 40 to the prior distributionμ 1 forθ ∗ , the horizonT K , the randomness in the state transitions, and the randomness of the algorithm. If the agent does not reach the goal state at any of the episodes (i.e., T K =∞), we define R K =∞. 4.3 A Posterior Sampling RL Algorithm for SSP Models In this section, we propose the Posterior Sampling Reinforcement Learning (PSRL-SSP) algorithm (Algorithm 5) for the SSP model. The input of the algorithm is the prior dis- tribution μ 1 . At time t, the agent maintains the posterior distribution μ t on the unknown parameter θ ∗ given by μ t (Θ) =P(θ ∗ ∈ Θ|F t ) for any set Θ⊆ Θ B? . HereF t is the informa- tion available at timet (i.e., the sigma algebra generated bys 1 ,a 1 ,··· ,s t−1 ,a t−1 ,s t ). Upon observing state s 0 t by taking action a t at state s t , the posterior can be updated according to μ t+1 (dθ) = θ(s 0 t |s t ,a t )μ t (dθ) R θ 0 (s 0 t |s t ,a t )μ t (dθ 0 ) . (4.2) The PSRL-SSP algorithm proceeds in epochs ` = 1, 2, 3,··· . Let t ` denote the start time of epoch`. In the beginning of epoch`, parameterθ ` is sampled from the posterior distribution μ t ` and the actions within that epoch are chosen according to the optimal policy with respect toθ ` . Each epoch ends if either of the two stopping criteria are satisfied. The first criterion is triggered if the number of visits to the goal state during the current epoch (denoted by K ` ) exceeds that of the previous epoch. This ensures that K ` ≤ K `−1 + 1 for all `. The second criterion is triggered if the number of visits to any of the state-action pairs is doubled compared to the beginning of the epoch. This guarantees that n t (s,a)≤ 2n t ` (s,a) for all (s,a) where n t (s,a) = P t−1 τ=1 1 {sτ =s,aτ =a} denotes the number of visits to state-action pair (s,a) before time t. The second stopping criterion is similar to that used by Jaksch et al. (2010); Rosenberg et al. (2020), and is one of the two stopping criteria used in the posterior sampling algorithm (TSDE) for the infinite-horizon average-cost MDPs (Ouyang et al., 2017b). This stopping criterion is crucial since it allows the algorithm to switch policies if the generated policy is improper and cannot reach the goal. We note that updating the policy only at the beginning of an episode (as done in the posterior sampling for finite-horizon MDPs (Osband et al., 2013)) does not work for SSP models, because if the generated policy in the beginning of the episode is improper, the goal is never reached and the regret is infinity. The first stopping criterion is novel. A similar stopping criterion used in the posterior sampling for infinite-horizon MDPs (Ouyang et al., 2017b) is based on the length of the epochs, i.e., a new epoch starts if the length of the current epoch exceeds the length of the previous epoch. This leads to a bound ofO( √ T K ) on the number of epochs which translates to a final regret bound ofO(K 2/3 ) in SSP models. However, our first stopping criterion allows us to bound the number of epochs byO( √ K) rather thanO( √ T K ) (see Lemma 4.2). 41 This is one of the key steps in avoiding dependency on c −1 min (i.e., a lower bound on the cost function) in the main term of the regret and achieve a final regret bound ofO( √ K). Remark 4.1. The PSRL-SSP algorithm only requires the knowledge of the prior distribution μ 1 . It does not require the knowledge of B ? and T ? (an upper bound on the expected time the optimal policy takes to reach the goal) as in Cohen et al. (2021). Algorithm 3 PSRL-SSP Input: μ 1 Initialization: t← 1,`← 0,K −1 ← 0,t 0 ← 0,k t 0 ← 0 for episodes k = 1, 2,··· ,K do s t ←s init while s t 6=g do if k−k t ` >K `−1 or n t (s,a)> 2n t ` (s,a) for some (s,a)∈S×A then K ` ←k−k t ` `←` + 1 t ` ←t k t ` ←k Generate θ ` ∼μ t ` (·) and compute π ` (·) =π ∗ (·;θ ` ) according to (4.1) Choose action a t =π ` (s t ) and observe s 0 t ∼θ ∗ (·|s t ,a t ) Update μ t+1 according to (4.2) s t+1 ←s 0 t t←t + 1 Main Results. We now provide our main results for thePSRL-SSP algorithm for unknown SSP models. Our first result considers the case where the cost function is strictly positive for all state-action pairs. Subsequently, we extend the result to the general case by adding a small positive perturbation to the cost function and running the algorithm with the perturbed costs. We first assume that Assumption 4.2. There exists c min > 0, such that c(s,a)≥c min for all state-action pairs (s,a). This assumption allows us to bound the total time spent in K episodes with the total cost, i.e., c min T K ≤C K , where C K := P T K t=1 c(s t ,a t ) is the total cost during the K episodes. To facilitate the presentation of the results, we assume that S≥ 2,A≥ 2, andK≥S 2 A. The first main result is as follows. Theorem 4.1. Suppose Assumptions 4.1 and 4.2 hold. Then, the regret of the PSRL-SSP al- gorithm is upper bounded as R K =O B ? S √ KAL 2 +S 2 A s B ? 3 c min L 2 , 42 where L = log(B ? SAKc −1 min ). Note that whenKB ? S 2 Ac −1 min , the regret bound scales as e O(B ? S √ KA). A crucial point about the above result is that the dependency onc −1 min is only in the lower order term. This allows us to extend theO( √ K) bound to the general case where Assumption 4.2 does not hold by using the perturbation technique of Rosenberg et al. (2020) (see Theorem 4.2). Avoiding dependency on c −1 min in the main term is achieved by using a Bernstein-type con- fidence set in the analysis inspired by Rosenberg et al. (2020). We note that using a Hoeffding-type confidence set in the analysis as in Ouyang et al. (2017b) gives a regret bound ofO( p K/c min ) which results inO(K 2/3 ) regret bound if Assumption 4.2 is violated. Theorem 4.2. Suppose Assumption 4.1 holds. Running the PSRL-SSP algorithm with costs c (s,a) := max{c(s,a),} for = (S 2 A/K) 2/3 yields R K =O B ? S √ KA ˜ L 2 + (S 2 A) 2 3 K 1 3 (B 3 2 ? ˜ L 2 +T ? ) +S 2 AT 3 2 ? ˜ L 2 , where ˜ L := log(KB ? T ? SA). Note that when K S 2 A(B 3 ? +T ? (T ? /B ? ) 6 ), the regret bound scales as e O(B ? S √ KA). These results have similar regret bounds as the Bernstein-SSP algorithm (Rosenberg et al., 2020), and have a gap of √ S with the lower bound of Ω(B ? √ SAK). 4.4 Theoretical Analysis In this section, we prove Theorem 4.1. Proof of Theorem 4.2 can be found in the Appendix. A key property of posterior sampling is that conditioned on the information at time t, θ ∗ and θ t have the same distribution if θ t is sampled from the posterior distribution at time t (Osband et al., 2013; Russo and Van Roy, 2014). Since the PSRL-SSP algorithm samplesθ ` at the stopping timet ` , we use the stopping time version of the posterior sampling property stated as follows. Lemma 4.1 (Adapted from Lemma 2 of Ouyang et al. (2017b)). Let t ` be a stopping time with respect to the filtration (F t ) ∞ t=1 , and θ ` be the sample drawn from the posterior distribution at timet ` . Then, for any measurable functionf and anyF t ` -measurable random variable X, we have E[f(θ ` ,X)|F t ` ] =E[f(θ ∗ ,X)|F t ` ]. We now sketch the proof of Theorem 4.1. Let 0 < δ < 1 be a parameter to be chosen later. We distinguish between known and unknown state-action pairs. A state-action pair (s,a) is known if the number of visits to (s,a) is at least α· B?S c min log B?SA δc min for some large enough constant α (to be determined in Lemma 4.11), and unknown otherwise. We divide 43 each epoch into intervals. The first interval starts at time t = 1. Each interval ends if any of the following conditions hold: (i) the total cost during the interval is at least B ? ; (ii) an unknown state-action pair is met; (iii) the goal state is reached; or (iv) the current epoch completes. The idea of introducing intervals is that after all state-action pairs are known, the cost accumulated during an interval is at least B ? (ignoring conditions (iii) and (iv)), which allows us to bound the number of intervals with the total cost divided by B ? . Note that introducing intervals and distinguishing between known and unknown state-action pairs is only in the analysis and thus knowledge of B ? is not required. Instead of bounding R K , we bound R M defined as R M :=E T M X t=1 c(s t ,a t )−KV (s init ;θ ∗ ) , for any number of intervals M as long as K episodes are not completed. Here, T M is the total time of the first M intervals. Let C M denote the total cost of the algorithm after M intervals and define L M as the number of epochs in the first M intervals. Observe that the number of times conditions (i), (ii), (iii), and (iv) trigger to start a new interval are bounded by C M /B ? ,O( B?S 2 A c min log B?SA δc min ), K, and L M , respectively. Therefore, number of intervals can be bounded as M≤ C M B ? +K +L M +O( B ? S 2 A c min log B ? SA δc min ). (4.3) Moreover, since the cost function is lower bounded by c min , we have c min T M ≤ C M . Our argument proceeds as follows. 1 We bound R M . B ? S √ MA which implies E[C M ] . KE[V (s init ;θ ∗ )] +B ? S √ MA. From the definition of intervals and once all the state-action pairs are known, the cost accumulated within each interval is at least B ? (ignoring in- tervals that end when the epoch or episode ends). This allows us to bound the number of intervals M with C M /B ? (or E[C M ]/B ? ). Solving for E[C M ] in the quadratic inequal- ity E[C M ].KE[V (s init ;θ ∗ )] +B ? S √ MA.KE[V (s init ;θ ∗ )] +S p E[C M ]B ? A implies that E[C M ] . KE[V (s init ;θ ∗ )] +B ? S √ AK. Since this bound holds for any number of M in- tervals as long as K episodes are not passed, it holds for E[C K ] as well. Moreover, since c min > 0, this implies that the K episodes eventually terminate and proves the final regret bound. Bounding the Number of Epochs. Before proceeding with bounding R M , we first prove that the number of epochs is bounded asO( √ KSA logT M ). Recall that the length of the epochs is determined by two stopping criteria. If we ignore the second criterion for a moment, the first stopping criterion ensures that the number of episodes within each epoch grows at a linear rate which implies that the number of epochs is bounded byO( √ K). If we ignore the first stopping criterion for a moment, the second stopping criterion triggers at mostO(SA logT M ) times. The following lemma shows that the number of epochs remains of the same order even if these two criteria are considered simultaneously. Lemma 4.2. The number of epochs is bounded as L M ≤ √ 2SAK logT M +SA logT M . 1 Lower order terms are neglected. 44 We now provide the proof sketch for boundingR M . With abuse of notation definet L M +1 := T M + 1. We can write R M :=E T M X t=1 c(s t ,a t )−KV (s init ;θ ∗ ) =E L M X `=1 t `+1 −1 X t=t ` c(s t ,a t ) −KE [V (s init ;θ ∗ )]. (4.4) Note that within epoch `, action a t is taken according to the optimal policy with respect to θ ` . Thus, with the Bellman equation we can write c(s t ,a t ) =V (s t ;θ ` )− X s 0 θ ` (s 0 |s t ,a t )V (s 0 ;θ ` ). Substituting this and adding and subtracting V (s t+1 ;θ ` ) and V (s 0 t ;θ ` ), decomposes R M as R M =R 1 M +R 2 M +R 3 M , where R 1 M :=E L M X `=1 t `+1 −1 X t=t ` [V (s t ;θ ` )−V (s t+1 ;θ ` )] R 2 M :=E L M X `=1 t `+1 −1 X t=t ` V (s t+1 ;θ ` )−V (s 0 t ;θ ` ) −KE [V (s init ;θ ∗ )] R 3 M :=E L M X `=1 t `+1 −1 X t=t ` " V (s 0 t ;θ ` )− X s 0 θ ` (s 0 |s t ,a t )V (s 0 ;θ ` ) # . We proceed by bounding these terms separately. Proof of these lemmas can be found in the supplementary material. R 1 M is a telescopic sum and can be bounded by the following lemma. Lemma 4.3. The first term R 1 M is bounded as R 1 M ≤B ? E[L M ]. To bound R 2 M , recall that s 0 t ∈S + is the next state of the environment after applying action a t at state s t , and that s 0 t = s t+1 for all time steps except the last time step of an episode (right before reaching the goal). In the last time step of an episode, s 0 t = g while s t+1 =s init . This proves that the inner sum of R 2 M can be written as V (s init ;θ ` )K ` , where K ` is the number of visits to the goal state during epoch `. Using K ` ≤K `−1 + 1 and the property of posterior sampling completes the proof. This is formally stated in the following lemma. Lemma 4.4. The second term R 2 M is bounded as R 2 M ≤B ? E[L M ]. The rest of the proof proceeds to bound the third term R 3 M which contributes to the dom- inant term of the final regret bound. The detailed proof can be found in Lemma 5.8. Here we provide the proof sketch. R 3 M captures the difference between V (·;θ ` ) at the next state 45 s 0 t ∼θ ∗ (·|s t ,a t ) and its expectation with respect to the sampledθ ` . Applying the Hoeffding- type concentration bounds (Weissman et al., 2003), as used by Ouyang et al. (2017b) yields a regret bound ofO(K 2/3 ) which is sub-optimal. To achieve the optimal dependency on K, we use a technique based on the Bernstein concentration bound inspired by the work of Rosenberg et al. (2020). This requires a more careful analysis. Let n t ` (s,a,s 0 ) be the number of visits to state-action pair (s,a) followed by state s 0 before time t ` . For a fixed state-action pair (s,a), define the Bernstein confidence set using the empirical transition probability b θ ` (s 0 |s,a) := nt ` (s,a,s 0 ) nt ` (s,a) as B ` (s,a) := θ(·|s,a) : θ(s 0 |s,a)− b θ ` (s 0 |s,a) ≤ 4 q b θ ` (s 0 |s,a)A ` (s,a) + 28A ` (s,a),∀s 0 ∈S + . (4.5) Here A ` (s,a) := log(SAn + ` (s,a)/δ) n + ` (s,a) and n + ` (s,a) := max{n t ` (s,a), 1}. This confidence set is similar to the one used by Rosenberg et al. (2020) and contains the true transition probability θ ∗ (·|s,a) with high probability (see Lemma 4.7). Note that B ` (s,a) isF t ` - measurable which allows us to use the property of posterior sampling (Lemma 4.1) to conclude that B ` (s,a) contains the sampled transition probability θ ` (·|s,a) as well with high probability. With some algebraic manipulation, R 3 M can be written as (with abuse of notation ` :=`(t) is the epoch at time t) R 3 M =E T M X t=1 X s 0 ∈S + θ ∗ (s 0 |s t ,a t )−θ ` (s 0 |s t ,a t ) V (s 0 ;θ ` )− X s 00 ∈S + θ ∗ (s 00 |s t ,a t )V (s 00 ;θ ` ) . Under the event that bothθ ∗ (·|s t ,a t ) andθ ` (·|s t ,a t ) belong to the confidence setB ` (s t ,a t ), Bernstein bound can be applied to obtain R 3 M ≈O E T M X t=1 q SA ` (s t ,a t )V ` (s t ,a t ) =O M X m=1 E t m+1 −1 X t=tm q SA ` (s t ,a t )V ` (s t ,a t ) , where t m denotes the start time of interval m andV ` is the empirical variance defined as V ` (s t ,a t ) := X s 0 ∈S + θ ∗ (s 0 |s t ,a t ) V (s 0 ;θ ` )− X s 00 ∈S + θ ∗ (s 00 |s t ,a t )V (s 00 ;θ ` ) 2 . (4.6) Applying Cauchy Schwarz on the inner sum twice implies that R 3 M ≈O M X m=1 v u u u tSE t m+1 −1 X t=tm A ` (s t ,a t ) · v u u u tE t m+1 −1 X t=tm V ` (s t ,a t ) Using the fact that all the state-action pairs (s t ,a t ) within an interval except possibly the first one are known, and that the cumulative cost within an interval is at most 2B ? , one can boundE h P t m+1 −1 t=tm V ` (s t ,a t ) i =O(B 2 ? ) (see Lemma 4.10 for details). Applying Cauchy 46 Schwarz again implies R 3 M ≈O B ? v u u u tMSE T M X t=1 A ` (s t ,a t ) ≈O B ? S √ MA . This argument is formally presented in the following lemma. Lemma 4.5. The third term R 3 M can be bounded as R 3 M ≤ 288B ? S s MA log 2 SAE[T M ] δ + 1632B ? S 2 A log 2 SAE[T M ] δ + 4SB ? δE[L M ]. Detailed proofs of all lemmas and the theorem can be found in the appendix in the supple- mentary material. 4.5 Experiments In this section, the performance of our PSRL-SSP algorithm is compared with existing OFU- type algorithms in the literature. Two environments are considered: RandomMDP and GridWorld. RandomMDP (Ouyang et al., 2017b; Wei et al., 2020) is an SSP with 8 states and 2 actions whose transition kernel and cost function are generated uniformly at random. GridWorld (Tarbouriech et al., 2020) is a 3× 4 grid (total of 12 states including the goal state) and 4 actions (LEFT, RIGHT, UP, DOWN) with c(s,a) = 1 for any state-action pair (s,a)∈S×A. The agent starts from the initial state located at the top left corner of the grid, and ends in the goal state at the bottom right corner. At each time step, the agent attempts to move in one of the four directions. However, the attempt is successful only with probability 0.85. With probability 0.15, the agent takes any of the undesired directions uniformly at random. If the agent tries to move out of the boundary, the attempt will not be successful and it remains in the same position. In the experiments, we evaluate the frequentist regret of PSRL-SSP for a fixed environment (i.e., the environment is not sampled from a prior distribution). A Dirichlet prior with parameters [0.1,··· , 0.1] is considered for the transition kernel. Dirichlet is a common prior in Bayesian statistics since it is a conjugate prior for categorical and multinomial distributions. We compare the performance of our proposedPSRL-SSP against existing online learning algo- rithms for the SSP problem (UC-SSP (Tarbouriech et al., 2020), Bernstein-SSP (Rosenberg et al., 2020), ULCVI (Cohen et al., 2021), and EB-SSP (Tarbouriech et al., 2021b)). The algorithms are evaluated atK = 10, 000 episodes and the results are averaged over 10 runs. 95% confidence interval is considered to compare the performance of the algorithms. All the experiments are performed on a 2015 Macbook Pro with 2.7 GHz Dual-Core Intel Core i5 processor and 16GB RAM. 47 0 2000 4000 6000 8000 10000 Episode 0 2000 4000 6000 8000 10000 12000 14000 16000 Regret RandomMDP ULCVI Bernstein-SSP UC-SSP EB-SSP PSRL-SSP 0 2000 4000 6000 8000 10000 Episode 0 50000 100000 150000 200000 250000 300000 350000 Regret GridWorld ULCVI Bernstein-SSP UC-SSP EB-SSP PSRL-SSP Figure 4.1: Cumulative regret of existing SSP algorithms on RandomMDP (left) and GridWorld (right) for 10, 000 episodes. The results are averaged over 10 runs and 95% confidence interval is shown with the shaded area. Our proposed PSRL-SSP algorithm outperforms all the existing algorithms considerably. The performance gap is even more significant in the more challenging GridWorld environment (right). Figure 4.1 shows that PSRL-SSP outperforms all the previously proposed algorithms for the SSP problem, significantly. In particular, it outperforms the recently proposed ULCVI (Cohen et al., 2021) and EB-SSP (Tarbouriech et al., 2021b) which match the theoretical lower bound. Our numerical evaluation reveals that the ULCVI algorithm does not show any evidence of learning even after 80,000 episodes (not shown here). The poor performance of these algorithms ensures the necessity to consider PS algorithms in practice. The gap between the performance of PSRL-SSP and OFU algorithms is even more apparent in the GridWorld environment which is more challenging compared to RandomMDP. Note that in RandomMDP, it is possible to go to the goal state from any state with only one step. This is since the transition kernel is generated uniformly at random. However, in the GridWorld environment, the agent has to take a sequence of actions to the right and down to reach the goal at the bottom right corner. Figure 4.1(right) verifies that PSRL-SSP is able to learn this pattern significantly faster than OFU algorithms. Since these plots are generated for a fixed environment (not generated from a prior), we conjecture that PSRL-SSP enjoyed the same regret bound under the non-Bayesian setting. 48 Appendices 4.A Proof of Lemma 4.2 Lemma (restatement of Lemma 4.2). The number of epochs is bounded as L M ≤ p 2SAK logT M +SA logT M . Proof. Define macro epoch i with start time t u i given by t u 1 =t 1 , and t u i+1 = min{t ` >t u i :n t ` (s,a)> 2n t ` −1 (s,a) for some (s,a)}, i = 2, 3,··· . A macro epoch starts when the second criterion of determining epoch length triggers. Let N M be a random variable denoting the total number of macro epochs by the end of interval M and define u N M +1 :=L M + 1. Recall that K ` is the number of visits to the goal state in epoch `. Let ˜ K i := P u i+1 −1 `=u i K ` be the number of visits to the goal state in macro epoch i. By definition of macro epochs, all the epochs within a macro epoch except the last one are triggered by the first criterion, i.e., K ` =K `−1 + 1 for ` =u i ,··· ,u i+1 − 2. Thus, ˜ K i = u i+1 −1 X `=u i K ` =K u i+1 −1 + u i+1 −u i −1 X j=1 (K u i −1 +j)≥ u i+1 −u i −1 X j=1 j = (u i+1 −u i − 1)(u i+1 −u i ) 2 . Solving for u i+1 −u i implies that u i+1 −u i ≤ 1 + q 2 ˜ K i . We can write L M =u N M +1 − 1 = N M X i=1 (u i+1 −u i )≤ N M X i=1 1 + q 2 ˜ K i =N M + N M X i=1 q 2 ˜ K i ≤N M + v u u t 2N M N M X i=1 ˜ K i =N M + p 2N M K, where the second inequality follows from Cauchy-Schwarz. It suffices to show that the number of macro epochs is bounded asN M ≤ 1 +SA logT M . LetT s,a be the set of all time 49 steps at which the second criterion is triggered for state-action pair (s,a), i.e., T s,a := t ` ≤T M :n t ` (s,a)> 2n t `−1 (s,a) . We claim that|T s,a |≤ logn T M +1 (s,a). To see this, assume by contradiction that|T s,a |≥ 1 + logn T M +1 (s,a), then n t L M (s,a) = Y t ` ≤T M ,nt `−1 (s,a)≥1 n t ` (s,a) n t `−1 (s,a) ≥ Y t ` ∈Ts,a,nt `−1 (s,a)≥1 n t ` (s,a) n t `−1 (s,a) > 2 |Ts,a|−1 ≥n T M +1 (s,a), which is a contradiction. Thus,|T s,a |≤ logn T M +1 (s,a) for all (s,a). In the above argument, the first inequality is by the fact thatn t (s,a) is non-decreasing int, and the second inequality is by the definition ofT s,a . Now, we can write N M = 1 + X s,a |T s,a |≤ 1 + X s,a logn T M +1 (s,a) ≤ 1 +SA log P s,a n T M +1 (s,a) SA = 1 +SA log T M SA ≤SA logT M , where the second inequality follows from Jensen’s inequality. 4.B Proof of Lemma 4.3 Lemma (restatement of Lemma 4.3). The first term R 1 M is bounded as R 1 M ≤B ? E[L M ]. Proof. Recall R 1 M =E L M X `=1 t `+1 −1 X t=t ` [V (s t ;θ ` )−V (s t+1 ;θ ` )] Observe that the inner sum is a telescopic sum, thus R 1 M =E L M X `=1 V (s t ` ;θ ` )−V (s t `+1 ;θ ` ) ≤B ? E[L M ], where the inequality is by Assumption 4.1. 4.C Proof of Lemma 4.4 Lemma (restatement of Lemma 4.4). The second termR 2 M is bounded asR 2 M ≤B ? E[L M ]. 50 Proof. Recall that K ` is the number of times the goal state is reached during epoch `. By definition, the only time steps that s 0 t 6=s t+1 is right before reaching the goal. Thus, with V (g;θ ` ) = 0, we can write R 2 M =E L M X `=1 t `+1 −1 X t=t ` V (s t+1 ;θ ` )−V (s 0 t ;θ ` ) −KE [V (s init ;θ ∗ )] =E L M X `=1 V (s init ;θ ` )K ` −KE [V (s init ;θ ∗ )] = ∞ X `=1 E h 1 {m(t ` )≤M} V (s init ;θ ` )K ` i −KE [V (s init ;θ ∗ )], where the last step is by Monotone Convergence Theorem. Here m(t ` ) is the interval at timet ` . Note that from the first stopping criterion of the algorithm we have K ` ≤K `−1 + 1 for all `. Thus, each term in the summation can be bounded as E h 1 {m(t ` )≤M} V (s init ;θ ` )K ` i ≤E h 1 {m(t ` )≤M} V (s init ;θ ` )(K `−1 + 1) i . 1 {m(t ` )≤M} (K `−1 + 1) isF t ` measurable. Therefore, applying the property of posterior sampling (Lemma 4.1) implies E h 1 {m(t ` )≤M} V (s init ;θ ` )(K `−1 + 1) i =E h 1 {m(t ` )≤M} V (s init ;θ ∗ )(K `−1 + 1) i Substituting this into R 2 M , we obtain R 2 M ≤ ∞ X `=1 E h 1 {m(t ` )≤M} V (s init ;θ ∗ )(K `−1 + 1) i −KE [V (s init ;θ ∗ )] =E L M X `=1 V (s init ;θ ∗ )(K `−1 + 1) −KE [V (s init ;θ ∗ )] =E V (s init ;θ ∗ ) L M X `=1 K `−1 −K +E [V (s init ;θ ∗ )L M ]≤B ? E[L M ]. In the last inequality we have used the fact that 0≤ V (s init ;θ ∗ )≤ B ? and P L M `=1 K `−1 ≤ K. 4.D Proof of Lemma 5.8 Lemma (restatement of Lemma 5.8). The third term R 3 M can be bounded as R 3 M ≤ 288B ? S s MA log 2 SAE[T M ] δ + 1632B ? S 2 A log 2 SAE[T M ] δ + 4SB ? δE[L M ]. 51 Proof. With abuse of notation let ` := `(t) denote the epoch at time t and m(t) be the interval at time t. We can write R 3 M =E T M X t=1 " V (s 0 t ;θ ` )− X s 0 θ ` (s 0 |s t ,a t )V (s 0 ;θ ` ) # =E " ∞ X t=1 1 {m(t)≤M} " V (s 0 t ;θ ` )− X s 0 θ ` (s 0 |s t ,a t )V (s 0 ;θ ` ) ## = ∞ X t=1 E " 1 {m(t)≤M} E " V (s 0 t ;θ ` )− X s 0 θ ` (s 0 |s t ,a t )V (s 0 ;θ ` ) F t ,θ ∗ ,θ ` ## . The last equality follows from Dominated Convergence Theorem, tower property of con- ditional expectation, and that 1 {m(t)≤M} is measurable with respect to F t . Note that conditioned onF t ,θ ∗ andθ ` , the only random variable in the inner expectation iss 0 t . Thus, E[V (s 0 t ;θ ` )|F t ,θ ∗ ,θ ` ] = P s 0θ ∗ (s 0 |s t ,a t )V (s 0 ;θ ` ). Using Dominated Convergence Theorem again implies that R 3 M =E T M X t=1 X s 0 ∈S + θ ∗ (s 0 |s t ,a t )−θ ` (s 0 |s t ,a t ) V (s 0 ;θ ` ) =E T M X t=1 X s 0 ∈S + θ ∗ (s 0 |s t ,a t )−θ ` (s 0 |s t ,a t ) V (s 0 ;θ ` )− X s 00 ∈S + θ ∗ (s 00 |s t ,a t )V (s 00 ;θ ` ) , (4.7) where the last equality is due to the fact that θ ∗ (·|s t ,a t ) and θ ` (·|s t ,a t ) are probability distributions and that P s 00 ∈S +θ ∗ (s 00 |s t ,a t )V (s 00 ;θ ` ) is independent of s 0 . Recall the Bernstein confidence set B ` (s,a) defined in (4.5) and let Ω ` s,a be the event that both θ ∗ (·|s,a) and θ ` (·|s,a) are in B ` (s,a). If Ω ` s,a holds, then the difference between θ ∗ (·|s,a) and θ ` (·|s,a) can be bounded by the following lemma. Lemma 4.6. Denote A ` (s,a) = log(SAn + ` (s,a)/δ) n + ` (s,a) . If Ω ` s,a holds, then θ ∗ (s 0 |s,a)−θ ` (s 0 |s,a) ≤ 8 q θ ∗ (s 0 |s,a)A ` (s,a) + 136A ` (s,a). Proof. Since Ω ` s,a holds, by (4.5) we have that b θ ` (s 0 |s,a)−θ ∗ (s 0 |s,a)≤ 4 q b θ ` (s 0 |s,a)A ` (s,a) + 28A ` (s,a). Using the primary inequality that x 2 ≤ ax +b implies x≤ a + √ b with x = q b θ ` (s 0 |s,a), a = 4 p A ` (s,a), and b =θ ∗ (s 0 |s,a) + 28A ` (s,a), we obtain q b θ ` (s 0 |s,a)≤ 4 q A ` (s,a) + q θ ∗ (s 0 |s,a) + 28A ` (s,a)≤ q θ ∗ (s 0 |s,a) + 10 q A ` (s,a), 52 where the last inequality is by sub-linearity of the square root. Substituting this bound into (4.5) yields θ ∗ (s 0 |s,a)− b θ ` (s 0 |s,a) ≤ 4 q θ ∗ (s 0 |s,a)A ` (s,a) + 68A ` (s,a). Similarly, θ ` (s 0 |s,a)− b θ ` (s 0 |s,a) ≤ 4 q θ ∗ (s 0 |s,a)A ` (s,a) + 68A ` (s,a). Using the triangle inequality completes the proof. Note that if either of θ ∗ (·|s t ,a t ) or θ ` (·|s t ,a t ) is not in B ` (s t ,a t ), then the inner term of (4.7) can be bounded by 2SB ? (note that|S + |≤ 2S and V (·;θ ` )≤ B ? ). Thus, applying Lemma 4.6 implies that X s 0 ∈S + θ ∗ (s 0 |s t ,a t )−θ ` (s 0 |s t ,a t ) V (s 0 ;θ ` )− X s 00 ∈S + θ ∗ (s 00 |s t ,a t )V (s 00 ;θ ` ) ≤ 8 X s 0 ∈S + v u u u t A ` (s t ,a t )θ ∗ (s 0 |s t ,a t ) V (s 0 ;θ ` )− X s 00 ∈S + θ ∗ (s 00 |s t ,a t )V (s 00 ;θ ` ) 2 1 Ω ` s t ,a t + 136 X s 0 ∈S + A ` (s t ,a t ) V (s 0 ;θ ` )− X s 00 ∈S + θ ∗ (s 00 |s t ,a t )V (s 00 ;θ ` ) 1 Ω ` s t ,a t + 2SB ? 1 {θ∗(·|st,at)/ ∈B ` (st,at)} + 1 {θ ` (·|st,at)/ ∈B ` (st,at)} ≤ 16 q SA ` (s t ,a t )V ` (s t ,a t )1 Ω ` s t ,a t + 272SB ? A ` (s t ,a t )1 Ω ` s t ,a t + 2SB ? 1 {θ∗(·|st,at)/ ∈B ` (st,at)} + 1 {θ ` (·|st,at)/ ∈B ` (st,at)} . whereA ` (s,a) = log(SAn + ` (s,a)/δ) n + ` (s,a) andV ` (s,a) is defined in (4.6). Here the last inequality fol- lows from Cauchy-Schwarz,|S + |≤ 2S,V (·;θ ` )≤B ? and the definition ofV ` . Substituting this into (4.7) yields R 3 M ≤ 16 √ SE T M X t=1 q A ` (s t ,a t )V ` (s t ,a t )1 Ω ` s t ,a t (4.8) + 272SB ? E T M X t=1 A ` (s t ,a t )1 Ω ` s t ,a t (4.9) + 2SB ? E T M X t=1 1 {θ∗(·|st,at)/ ∈B ` (st,at)} + 1 {θ ` (·|st,at)/ ∈B ` (st,at)} . (4.10) 53 The inner sum in (4.9) is bounded by 6SA log 2 (SAT M /δ) (see Lemma 4.9). To bound (4.10), we first show that B ` (s,a) contains the true transition probability θ ∗ (·|s,a) with high probability: Lemma 4.7. For any epoch ` and any state-action pair (s,a)∈S×A, θ ∗ (·|s,a)∈B ` (s,a) with probability at least 1− δ 2SAn + ` (s,a) . Proof. Fix (s,a,s 0 )∈S×A×S + and 0 < δ 0 < 1 (to be chosen later). Let (Z i ) ∞ i=1 be a sequence of random variables drawn from the probability distribution θ ∗ (·|s,a). Apply Lemma 4.8 below with X i = 1 {Z i =s 0 } and δ t = δ 0 4St 2 to a prefix of length t of the sequence (X i ) ∞ i=1 , and apply union bound over all t and s 0 to obtain ˆ θ ` (s 0 |s,a)−θ ∗ (s 0 |s,a) ≤ 2 v u u u t ˆ θ ` (s 0 |s,a) log 8Sn + ` 2 (s,a) δ 0 n + ` (s,a) + 7 log 8Sn + ` 2 (s,a) δ 0 with probability at least 1− δ 0 /2 for all s 0 ∈ S + and ` ≥ 1, simultaneously. Choose δ 0 =δ/SAn + ` (s,a) and use S≥ 2, A≥ 2 to complete the proof. Lemma 4.8 (Theorem D.3 (Anytime Bernstein) of Rosenberg et al. (2020)). Let (X n ) ∞ n=1 be a sequence of independent and identically distributed random variables with expectation μ. Suppose that 0≤ X n ≤ B almost surely. Then with probability at least 1−δ, the following holds for all n≥ 1 simultaneously: n X i=1 (X i −μ) ≤ 2 v u u t B n X i=1 X i log 2n δ + 7B log 2n δ . Now, by rewriting the sum in (4.10) over epochs, we have E T M X t=1 1 {θ∗(·|st,at)/ ∈B ` (st,at)} + 1 {θ ` (·|st,at)/ ∈B ` (st,at)} =E L M X `=1 t `+1 −1 X t=t ` 1 {θ∗(·|st,at)/ ∈B ` (st,at)} + 1 {θ ` (·|st,at)/ ∈B ` (st,at)} = X s,a E L M X `=1 t `+1 −1 X t=t ` 1 {st=s,at=a} 1 {θ∗(·|s,a)/ ∈B ` (s,a)} + 1 {θ ` (·|s,a)/ ∈B ` (s,a)} = X s,a E L M X `=1 n t `+1 (s,a)−n t ` (s,a) 1 {θ∗(·|s,a)/ ∈B ` (s,a)} + 1 {θ ` (·|s,a)/ ∈B ` (s,a)} . Note that n t `+1 (s,a)−n t ` (s,a)≤n t ` (s,a) + 1 by the second stopping criterion. Moreover, observe thatB ` (s,a) isF t ` measurable. Thus, it follows from the property of posterior sam- pling (Lemma 4.1) that E[1 {θ ` (·|s,a)/ ∈B ` (s,a)} |F t ` ] = E[1 {θ∗(·|s,a)/ ∈B ` (s,a)} |F t ` ] = P(θ ∗ (·|s,a) / ∈ 54 B ` (s,a)|F t ` )≤ δ/(2SAn + ` (s,a)), where the inequality is by Lemma 4.7. Using Monotone Convergence Theorem and that 1 {m(t ` )≤M} isF t ` measurable, we can write X s,a E L M X `=1 n t `+1 (s,a)−n t ` (s,a) 1 {θ∗(·|s,a)/ ∈B ` (s,a)} + 1 {θ ` (·|s,a)/ ∈B ` (s,a)} ≤ X s,a ∞ X `=1 E h 1 {m(t ` )≤M} (n t ` (s,a) + 1)E h 1 {θ∗(·|s,a)/ ∈B ` (s,a)} + 1 {θ ` (·|s,a)/ ∈B ` (s,a)} |F t ` ii ≤ X s,a ∞ X `=1 E " 1 {m(t ` )≤M} (n t ` (s,a) + 1) δ SAn + ` (s,a) # ≤ 2δE[L M ], where the last inequality is byn t ` (s,a)+1≤ 2n + ` (s,a) and Monotone Convergence Theorem. We proceed by bounding (4.8). Denote by t m the start time of interval m, define t M+1 := T M + 1, and rewrite the sum in (4.8) over intervals to get E T M X t=1 q A ` (s t ,a t )V ` (s t ,a t )1 Ω ` s t ,a t = M X m=1 E t m+1 −1 X t=tm q A ` (s t ,a t )V ` (s t ,a t )1 Ω ` s t ,a t Applying Cauchy-Schwarz twice on the inner expectation implies E t m+1 −1 X t=tm q A ` (s t ,a t )V ` (s t ,a t )1 Ω ` s t ,a t ≤E v u u t t m+1 −1 X t=tm A ` (s t ,a t )· v u u t t m+1 −1 X t=tm V ` (s t ,a t )1 Ω ` s t ,a t ≤ v u u u tE t m+1 −1 X t=tm A ` (s t ,a t ) · v u u u tE t m+1 −1 X t=tm V ` (s t ,a t )1 Ω ` s t ,a t ≤ 7B ? v u u u tE t m+1 −1 X t=tm A ` (s t ,a t ) , where the last inequality is by Lemma 4.10. Summing over M intervals and applying Cauchy-Schwarz, we get M X m=1 E t m+1 −1 X t=tm q A ` (s t ,a t )V ` (s t ,a t )1 Ω ` s t ,a t ≤ 7B ? M X m=1 v u u u tE t m+1 −1 X t=tm A ` (s t ,a t ) ≤ 7B ? v u u u tM M X m=1 E t m+1 −1 X t=tm A ` (s t ,a t ) 55 = 7B ? v u u u tME T M X t=1 A ` (s t ,a t ) ≤ 18B ? s MSAE log 2 SAT M δ , where the last inequality follows from Lemma 4.9. Substituting these bounds in (4.8), (4.9), (4.10), concavity of log 2 x for x≥ 3, and applying Jensen’s inequality completes the proof. Lemma 4.9. P T M t=1 A ` (s t ,a t )≤ 6SA log 2 (SAT M /δ). Proof. RecallA ` (s,a) = log(SAn + ` (s,a)/δ) n + ` (s,a) . Denote byL := log(SAT M /δ), an upper bound on the numerator of A ` (s t ,a t ). we have T M X t=1 A ` (s t ,a t )≤ T M X t=1 L n + ` (s t ,a t ) =L X s,a T M X t=1 1 {st=s,at=a} n + ` (s,a) ≤ 2L X s,a T M X t=1 1 {st=s,at=a} n + t (s,a) = 2L X s,a 1 {n T M +1 (s,a)>0} + 2L X s,a n T M +1 (s,a)−1 X j=1 1 j ≤ 2LSA + 2L X s,a (1 + logn T M +1 (s,a)) ≤ 4LSA + 2LSA logT M ≤ 6LSA logT M . Here the second inequality is byn + ` (s,a)≥ 0.5n + t (s,a) (the second criterion in determining the epoch length), the third inequality is by P n x=1 1/x≤ 1+logn, and the fourth inequality is by n T M +1 (s,a)≤T M . The proof is complete by noting that logT M ≤L. Lemma 4.10. For any interval m, E[ P t m+1 −1 t=tm V ` (s t ,a t )1 Ω `]≤ 44B 2 ? . Proof. To proceed with the proof, we need the following two technical lemmas. Lemma 4.11. Let (s,a) be a known state-action pair and m be an interval. If Ω ` s,a holds, then for any state s 0 ∈S + , θ ∗ (s 0 |s,a)−θ ` (s 0 |s,a) ≤ 1 8 s c min θ ∗ (s 0 |s,a) SB ? + c min 4SB ? . Proof. From Lemma 4.6, we know that if Ω ` s,a holds, then θ ∗ (s 0 |s,a)−θ ` (s 0 |s,a) ≤ 8 q θ ∗ (s 0 |s,a)A ` (s,a) + 136A ` (s,a), 56 withA ` (s,a) = log(SAn + ` (s,a)/δ) n + ` (s,a) . The proof is complete by noting that log(x)/x is decreasing, and thatn + ` (s,a)≥α· B?S c min log B?SA δc min for some large enough constantα since (s,a) is known. Lemma 4.12 (Lemma B.15. of Rosenberg et al. (2020)). Let (X t ) ∞ t=1 be a martingale dif- ference sequence adapted to the filtration (F t ) ∞ t=0 . Let Y n = ( P n t=1 X t ) 2 − P n t=1 E[X 2 t |F t−1 ]. Then (Y n ) ∞ n=0 is a martingale, and in particular if τ is a stopping time such that τ ≤ c almost surely, then E[Y τ ] = 0. By the definition of the intervals, all the state-action pairs within an interval except possibly the first one are known. Therefore, we bound E t m+1 −1 X t=tm V ` (s t ,a t )1 Ω ` s t ,a t F tm =E h V ` (s tm ,a tm )1 Ω ` s t ,a t |F tm i +E t m+1 −1 X t=tm+1 V ` (s t ,a t )1 Ω ` s t ,a t F tm . The first summand is upper bounded by B 2 ? . To bound the second term, define Z t ` := [V (s 0 t ;θ ` )− P s 0 ∈S θ ∗ (s 0 |s t ,a t )V (s 0 ;θ ` )]1 Ω ` s t ,a t . Conditioned onF tm ,θ ∗ andθ ` , (Z t ` ) t≥tm con- stitutes a martingale difference sequence with respect to the filtration (F m t+1 ) t≥tm , where F m t is the sigma algebra generated by{(s tm ,a tm ),··· , (s t ,a t )}. Moreover, t m+1 − 1 is a stopping time with respect to (F m t+1 ) t≥tm and is bounded by t m + 2B ? /c min . Therefore, Lemma 4.12 implies that E t m+1 −1 X t=tm+1 V ` (s t ,a t )1 Ω ` s t ,a t F tm ,θ ∗ ,θ ` =E t m+1 −1 X t=tm+1 Z t ` 1 Ω ` s t ,a t 2 F tm ,θ ∗ ,θ ` . (4.11) We proceed by bounding P t m+1 −1 t=tm+1 Z t ` 1 Ω ` s t ,a t in terms of P t m+1 −1 t=tm+1 V ` (s t ,a t )1 Ω ` s t ,a t and com- bine with the left hand side to complete the proof. We have t m+1 −1 X t=tm+1 Z t ` 1 Ω ` s t ,a t = t m+1 −1 X t=tm+1 V (s 0 t ;θ ` )− X s 0 ∈S θ ∗ (s 0 |s t ,a t )V (s 0 ;θ ` ) 1 Ω ` s t ,a t ≤ t m+1 −1 X t=tm+1 V (s 0 t ;θ ` )−V (s t ;θ ` ) (4.12) + t m+1 −1 X t=tm+1 V (s t ;θ ` )− X s 0 ∈S θ ` (s 0 |s t ,a t )V (s 0 ;θ ` ) (4.13) + t m+1 −1 X t=tm+1 X s 0 ∈S + θ ` (s 0 |s t ,a t )−θ ∗ (s 0 |s t ,a t ) V (s 0 ;θ ` )− X s 00 ∈S + θ ∗ (s 00 |s t ,a t )V (s 00 ;θ ` ) 1 Ω ` s t ,a t . (4.14) 57 where (4.14) is by the fact that θ ` (·|s t ,a t ),θ ∗ (·|s t ,a t ) are probability distributions and P s 00 ∈S +θ ∗ (s 00 |s t ,a t )V (s 00 ;θ ` ) is independent of s 0 and V (g;θ ` ) = 0. (4.12) is a telescopic sum (recall that s t+1 = s 0 t if s 0 t 6= g) and is bounded by B ? . It follows from the Bellman equation that (4.13) is equal to P t m+1 −1 t=tm+1 c(s t ,a t ). By definition, the interval ends as soon as the cost accumulates to B ? during the interval. Moreover, since V (·;θ ` )≤ B ? , the al- gorithm does not choose an action with instantaneous cost more than B ? . This implies that P t m+1 −1 t=tm+1 c(s t ,a t )≤ 2B ? . To bound (4.14) we use the Bernstein confidence set, but taking into account that all the state-action pairs in the summation are known, we can use Lemma 4.11 to obtain X s 0 ∈S + θ ` (s 0 |s t ,a t )−θ ∗ (s 0 |s t ,a t ) V (s 0 ;θ ` )− X s 00 ∈S + θ ∗ (s 00 |s t ,a t )V (s 00 ;θ ` ) 1 Ω ` s t ,a t ≤ X s 0 ∈S + 1 8 v u u t c min θ ∗ (s 0 |s t ,a t ) (V (s 0 ;θ ` )− P s 00 ∈S +θ ∗ (s 00 |s t ,a t )V (s 00 ;θ ` )) 2 1 Ω ` s t ,a t SB ? + X s 0 ∈S + c min 4SB ? V (s 0 ;θ ` )− X s 00 ∈S + θ ∗ (s 00 |s t ,a t )V (s 00 ;θ ` ) ≤ 1 4 s c min V ` (s t ,a t )1 Ω ` s t ,a t B ? + c(s t ,a t ) 2 . The last inequality follows from Cauchy-Schwarz inequality,|S + |≤ 2S,|V (·;θ ` )|≤B ? , and c min ≤c(s t ,a t ). Summing over the time steps in interval m and applying Cauchy-Schwarz, we get t m+1 −1 X t=tm+1 1 4 s c min V ` (s t ,a t )1 Ω ` s t ,a t B ? + c(s t ,a t ) 2 ≤ 1 4 v u u t (t m+1 −t m ) c min P t m+1 −1 t=tm+1 V ` (s t ,a t )1 Ω ` s t ,a t B ? + P t m+1 −1 t=tm+1 c(s t ,a t ) 2 ≤ 1 4 v u u u t2 t m+1 −1 X t=tm+1 V ` (s t ,a t )1 Ω ` s t ,a t +B ? . The last inequality follows from the fact that duration of interval m is at most 2B ? /c min and its cumulative cost is at most 2B ? . Substituting these bounds into (4.11) implies that E t m+1 −1 X t=tm+1 V ` (s t ,a t )1 Ω ` s t ,a t F tm ,θ ∗ ,θ ` ≤E 4B ? + 1 4 v u u u t2 t m+1 −1 X t=tm+1 V ` (s t ,a t )1 Ω ` s t ,a t 2 F tm ,θ ∗ ,θ ` ≤ 32B 2 ? + 1 4 E t m+1 −1 X t=tm+1 V ` (s t ,a t )1 Ω ` s t ,a t F tm ,θ ∗ ,θ ` , 58 where the last inequality is by (a +b) 2 ≤ 2(a 2 +b 2 ) withb = 1 4 r 2 P t m+1 −1 t=tm+1 V ` (s t ,a t )1 Ω ` s t ,a t anda = 4B ? . Rearranging implies thatE h P t m+1 −1 t=tm+1 V ` (s t ,a t )1 Ω ` s t ,a t |F tm ,θ ∗ ,θ ` i ≤ 43B 2 ? and the proof is complete. 4.E Proof of Theorem 4.1 Theorem (restatement of Theorem 4.1). Suppose Assumptions 4.1 and 4.2 hold. Then, the regret bound of the PSRL-SSP algorithm is bounded as R K =O B ? S √ KAL 2 +S 2 A s B ? 3 c min L 2 , where L = log(B ? SAKc −1 min ). Proof. Denote by C M the total cost after M intervals. Recall that E[C M ] =KE[V (s init ;θ ∗ )] +R M =KE[V (s init ;θ ∗ )] +R 1 M +R 2 M +R 3 M Using Lemmas 4.3, 4.4, and 5.8 with δ = 1/K obtains E[C M ]≤KE[V (s init ;θ ∗ )] +O B ? E[L M ] +B ? S q MA log 2 (SAKE[T M ]) +B ? S 2 A log 2 (SAKE[T M ]) . (4.15) Recall that L M ≤ √ 2SAK logT M +SA logT M . Taking expectation from both sides and using Jensen’s inequality gets us E[L M ]≤ p 2SAK logE[T M ] +SA logE[T M ]. Moreover, taking expectation from both sides of (4.3), plugging in the bound onE[L M ], and concavity of log(x) implies M≤ E[C M ] B ? +K + q 2SAK logE[T M ] +SA logE[T M ] +O B ? S 2 A c min log B ? KSA c min ! . Substituting this bound in (4.15), using subadditivity of the square root, and simplifying yields E[C M ]≤KE[V (s init ;θ ∗ )] +O B ? S q KA log 2 (SAKE[T M ]) +S q B ? E[C M ]A log 2 (SAKE[T M ]) +B ? S 5 4 A 3 4 K 1 4 log 5 4 (SAKE[T M ]) +S 2 A s B 3 ? c min log 3 B ? SAKE[T M ] c min ! . 59 Solving forE[C M ] (by using the primary inequality thatx≤a √ x +b impliesx≤ (a + √ b) 2 for a,b> 0), using K≥S 2 A, V (s init ;θ ∗ )≤B ? , and simplifying the result gives E[C M ]≤ O S q B ? A log 2 (SAKE[T M ]) + v u u u tKE[V (s init ;θ ∗ )] +O B ? S q KA log 2.5 (SAKE[T M ]) +S 2 A s B 3 ? c min log 3 B ? SAKE[T M ] c min ! 2 ≤O B ? S 2 A log 2 SAE[T M ] δ +KE[V (s init ;θ ∗ )] +O B ? S q KA log 2.5 (SAKE[T M ]) +S 2 A s B 3 ? c min log 3 B ? SAKE[T M ] c min +B ? S q KA log 4 (SAKE[T M ]) +S 2 A B ? 5 c min log 7 B ? SAKE[T M ] c min !1 4 ! ≤KE[V (s init ;θ ∗ )] +O B ? S q KA log 4 SAKE[T M ]) +S 2 A s B 3 ? c min log 4 B ? SAKE[T M ] c min . (4.16) Note that by simplifying this bound, we can writeE[C M ]≤O q B ? 3 S 4 A 2 K 2 E[T M ]/c min . On the other hand, we have that c min T M ≤ C M which implies E[T M ] ≤ E[C M ]/c min . IsolatingE[T M ] impliesE[T M ]≤O B ? 3 S 4 A 2 K 2 /c 3 min . Substituting this bound into (4.16) yields E[C M ]≤KE[V (s init ;θ ∗ )] +O B ? S s KA log 4 B ? SAK c min +S 2 A s B 3 ? c min log 4 B ? SAK c min . We note that this bound holds for any number of M intervals as long as the K episodes have not elapsed. Since, c min > 0, this implies that the K episodes eventually terminate and the claimed bound of the theorem for R K holds. 4.F Proof of Theorem 4.2 Theorem (restatement of Theorem 4.2). Suppose Assumption 4.1 holds. Running the PSRL-SSP algorithm with costs c (s,a) := max{c(s,a),} for = (S 2 A/K) 2/3 yields R K =O B ? S √ KA ˜ L 2 + (S 2 A) 2 3 K 1 3 (B 3 2 ? ˜ L 2 +T ? ) +S 2 AT 3 2 ? ˜ L 2 , 60 where ˜ L := log(KB ? T ? SA) and T ? is an upper bound on the expected time the optimal policy takes to reach the goal from any initial state. Proof. Denote by T K the time to complete K episodes if the algorithm runs with the perturbed costs c (s,a) and let V (s init ;θ ∗ ), V π (s init ;θ ∗ ) be the optimal value function and the value function for policy π in the SSP with cost function c (s,a) and transition kernel θ ∗ . We can write R K =E T K X t=1 c(s t ,a t )−KV (s init ;θ ∗ ) ≤E T K X t=1 c (s t ,a t )−KV (s init ;θ ∗ ) =E T K X t=1 c (s t ,a t )−KV (s init ;θ ∗ ) +KE [V (s init ;θ ∗ )−V (s init ;θ ∗ )]. (4.17) Theorem 4.1 implies that the first term is bounded by E T K X t=1 c (s t ,a t )−KV (s init ;θ ∗ ) =O B ? S √ KAL 2 +S 2 A s B ? 3 L 2 , withL = log(B ? SAK/) andB ? ≤B ? +T ? (to see this note thatV (s;θ ∗ )≤V π ∗ (s;θ ∗ )≤ B ? +T ? ). To bound the second term of (4.17), we have V (s init ;θ ∗ )≤V π ∗ (s init ;θ ∗ )≤V (s init ;θ ∗ ) +T ? . Combining these bounds, we can write R K =O B ? S √ KAL 2 +T ? S √ KAL 2 +S 2 A s (B ? +T ? ) 3 L 2 +KT ? . Substituting = (S 2 A/K) 2/3 , and simplifying the result withK≥S 2 A andB ? ≤T ? (since c(s,a)≤ 1) implies R K =O B ? S √ KA ˜ L 2 + (S 2 A) 2 3 K 1 3 (B 3 2 ? ˜ L 2 +T ? ) +S 2 AT 3 2 ? ˜ L 2 , where ˜ L = log(KB ? T ? SA). This completes the proof. 61 Chapter 5 Online Learning for Unknown Partially Observable MDPs 5.1 Introduction Markov Decision Processes (MDPs) assume that the state is perfectly observable by the agent and the only uncertainty is about the underlying dynamics of the environment. How- ever, in many real-world scenarios such as robotics, healthcare and finance, the state is not fully observed by the agent, and only a partial observation is available. These scenarios are modeled by Partially Observable Markov Decision Processes (POMDPs). In addition to the uncertainty in the environment dynamics, the agent has to deal with the uncertainty about the underlying state. It is well known (Kumar and Varaiya, 2015) that introducing an information or belief state (a posterior distribution over the states given the history of observations and actions) allows the POMDP to be recast as an MDP over the belief state space. The resulting algorithm requires a posterior update of the belief state which needs the transition and observation model to be fully known. This presents a significant difficulty when the model parameters are unknown. Thus, managing the exploration-exploitation trade-off for POMDPs is a significant challenge and to the best of our knowledge, no online RL algorithm with sub-linear regret is known. In this chapter, we consider infinite-horizon average-cost POMDPs with finite states, ac- tions and observations. The underlying state transition dynamics is unknown, though we assume the observation kernel to be known. We propose a Posterior Sampling Reinforce- ment Learning algorithm (PSRL-POMDP) and prove that it achieves a Bayesian expected regret bound ofO(logT ) in the finite (transition kernel) parameter set case where T is the time horizon. We then show that in the general (continuous parameter set) case, it achieves e O(T 2/3 ) under some technical assumptions. The PSRL-POMDP algorithm is a natural ex- tension of the TSDE algorithm for MDPs (Ouyang et al., 2017b) with two main differences. First, in addition to the posterior distribution on the environment dynamics, the algorithm maintains a posterior distribution on the underlying state. Second, since the state is not 62 fully observable, the agent cannot keep track of the number of visits to state-action pairs, a quantity that is crucial in the design of algorithms for tabular MDPs. Instead, we introduce a notion of pseudo count and carefully handle its relation with the true counts to obtain sub-linear regret. To the best of our knowledge, PSRL-POMDP is the first online RL algorithm for POMDPs with sub-linear regret. Related Literature We review the related literature in two main domains: efficient exploration for MDPs, and learning in POMDPs. Efficient exploration in MDPs. To balance the exploration and exploitation, two general techniques are used in the basic tabular MDPs: optimism in the face of uncertainty (OFU), and posterior sampling. Under the OFU technique, the agent constructs a confidence set around the system parameters, selects an optimistic parameter associated with the minimum cost from the confidence set, and takes actions with respect to the optimistic parameter. This principle is widely used in the literature to achieve optimal regret bounds (Bartlett and Tewari, 2009; Jaksch et al., 2010; Azar et al., 2017; Fruit et al., 2018b; Jin et al., 2018; Zhang and Ji, 2019; Zanette and Brunskill, 2019; Wei et al., 2020; Chen et al., 2021a). An alternative technique to encourage exploration is posterior sampling (Thompson, 1933). In this approach, the agent maintains a posterior distribution over the system parameters, samples a parameter from the posterior distribution, and takes action with respect to the sampled parameter (Strens, 2000; Osband et al., 2013; Fonteneau et al., 2013; Gopalan and Mannor, 2015; Ouyang et al., 2017b; Jafarnia-Jahromi et al., 2021). In particular, (Ouyang et al., 2017b) proposes TSDE, a posterior sampling-based algorithm for the infinite-horizon average-cost MDPs. Extending these results to the continuous state MDPs has been recently addressed with general function approximation (Osband and Van Roy, 2014; Dong et al., 2020; Ayoub et al., 2020; Wang et al., 2020), or in the special cases of linear function approximation (Abbasi-Yadkori et al., 2019a,b; Jin et al., 2020; Hao et al., 2020; Wei et al., 2021; Wang et al., 2021), and Linear Quadratic Regulators (Ouyang et al., 2017a; Dean et al., 2018; Cohen et al., 2019; Mania et al., 2019; Simchowitz and Foster, 2020; Lale et al., 2020a). In general, POMDPs can be formulated as continuous state MDPs by considering the belief as the state. However, computing the belief requires the knowledge of the model parameters and thus unobserved in the RL setting. Hence, learning algorithms for continuous state MDPs cannot be directly applied to POMDPs. Learning in POMDPs. To the best of our knowledge, the only existing work with regret analysis in POMDPs is Azizzadenesheli et al. (2017). However, their definition of regret is not with respect to the optimal policy, but with respect to the best memoryless policy (a policy that maps the current observation to an action). With our natural definition of regret, their algorithm suffers linear regret. Other learning algorithms for POMDPs either consider linear dynamics (Lale et al., 2020b; Tsiamis and Pappas, 2020) or do not consider regret (Shani et al., 2005; Ross et al., 2007; Poupart and Vlassis, 2008; Cai et al., 2009; Liu 63 et al., 2011, 2013; Doshi-Velez et al., 2013; Katt et al., 2018; Azizzadenesheli et al., 2018) and are not directly comparable to our setting. Subsequent to our work, Xiong et al. (2021) also proved a regret bound of e O(T 2/3 ) in the infinite-horizon average-cost POMDPs with an OFU-type algorithm. Their approach is based on spectral method of moments estimations for hidden Markov models and uses a different set of assumptions. 5.2 Preliminaries An infinite-horizon average-cost Partially Observable Markov Decision Process (POMDP) can be specified by (S,A,θ,C,O,η) whereS is the state space, A is the action space, C :S×A→ [0, 1] is the cost function, and O is the set of observations. Here η :S→ Δ O is the observation kernel, and θ :S×A→ Δ S is the transition kernel such that η(o|s) = P(o t = o|s t = s) and θ(s 0 |s,a) = P(s t+1 = s 0 |s t = s,a t = a) where o t ∈ O, s t ∈S and a t ∈A are the observation, state and action at time t = 1, 2, 3,··· . Here, for a finite set X , Δ X is the set of all probability distributions onX . We assume that the state space, the action space and the observations are finite with size|S|,|A|,|O|, respectively. LetF t be the information available at time t (prior to action a t ), i.e., the sigma algebra generated by the history of actions and observationsa 1 ,o 1 ,··· ,a t−1 ,o t−1 ,o t and letF t+ be the information after choosing action a t . Unlike MDPs, the state is not observable by the agent and the optimal policy cannot be a function of the state. Instead, the agent maintains a belief h t (·;θ)∈ Δ S given by h t (s;θ) := P(s t = s|F t ;θ), as a sufficient statistic for the history of observations and actions. Here we use the notation h t (·;θ) to explicitly show the dependency of the belief on θ. After taking action a t and observing o t+1 , the belief h t can be updated as h t+1 (s 0 ;θ) = P s η(o t+1 |s 0 )θ(s 0 |s,a t )h t (s;θ) P s 0 P s η(o t+1 |s 0 )θ(s 0 |s,a t )h t (s;θ) . (5.1) This update rule is compactly denoted by h t+1 (·;θ) =τ(h t (·;θ),a t ,o t+1 ;θ), with the initial condition h 1 (s;θ) = η(o 1 |s)h(s) P s η(o 1 |s)h(s) , where h(·) is the distribution of the initial state s 1 (denoted by s 1 ∼ h). A deterministic stationary policy π : Δ S →A maps a belief to an action. The long-term average cost of a policy π can be defined as J π (h;θ) := lim sup T→∞ 1 T T X t=1 E h C s t ,π h t (·;θ) i . (5.2) Let J(h,θ) := inf π J π (h,θ) be the optimal long-term average cost that in general may depend on the initial state distribution h, though we will assume it is independent of the initial distributionh (and thus denoted byJ(θ)), and the following Bellman equation holds: 64 Assumption 5.1 (Bellman optimality equation). There exist J(θ)∈ R and a bounded function v(·;θ) : Δ S →R such that for all b∈ Δ S J(θ) +v(b;θ) = min a∈A {c(b,a) + X o∈O P (o|b,a;θ)v(b 0 ;θ)}, (5.3) wherev is called the relative value function,b 0 =τ(b,a,o;θ) is the updated belief,c(b,a) := P s C(s,a)b(s) is the expected cost, and P (o|b,a;θ) is the probability of observing o in the next step, conditioned on the current belief b and action a, i.e., P (o|b,a;θ) = X s 0 ∈S X s∈S η(o|s 0 )θ(s 0 |s,a)b(s). (5.4) Various conditions are known under which Assumption 5.1 holds, e.g., when all the entries of the transition and observation kernels are positive (Xiong et al., 2021), or when the MDP is weakly communicating (Bertsekas, 2017). Note that if Assumption 5.1 holds, the policy π ∗ that minimizes the right hand side of (5.3) is the optimal policy. More precisely, Lemma 5.1. Suppose Assumption 5.1 holds. Then, the policy π ∗ (·,θ) : Δ S →A given by π ∗ (b;θ) := arg min a∈A {c(b,a) + X o∈O P (o|b,a;θ)v(b 0 ;θ)} (5.5) is the optimal policy with J π ∗(h;θ) =J(θ) for all h∈ Δ S . Note that ifv satisfies the Bellman equation, so doesv plus any constant. Therefore, without loss of generality, and sincev is bounded, we can assume that inf b∈Δ S v(b;θ) = 0 and define the span of a POMDP as sp(θ) := sup b∈Δ S v(b;θ). Let Θ H be the class of POMDPs that satisfy Assumption 5.1 and have sp(θ)≤ H for all θ∈ Θ H . In Section 5.4, we consider a finite subset Θ⊆ Θ H of POMDPs. In Section 5.5, the general class Θ = Θ H is considered. The learning protocol. We consider the problem of an agent interacting with an un- known randomly generated POMDP θ ∗ , where θ ∗ ∈ Θ is randomly generated according to the probability distribution f(·). 1 After the initial generation of θ ∗ , it remains fixed, but unknown to the agent. The agent interacts with the POMDP θ ∗ in T steps. Initially, the agent starts from state s 1 that is randomly generated according to the conditional prob- ability mass function h(·;θ ∗ ). At time t = 1, 2, 3,··· ,T , the agent observes o t ∼ η(·|s t ), takes action a t and suffers cost of C(s t ,a t ). The environment, then determines the next states t+1 which is randomly drawn from the probability distribution θ ∗ (·|s t ,a t ). Note that although the cost function C is assumed to be known, the agent cannot observe the value ofC(s t ,a t ) since the states t is unknown to the agent. The goal of the agent is to minimize the expected cumulative regret defined as R T :=E θ∗ h T X t=1 h C(s t ,a t )−J(θ ∗ ) ii , (5.6) 1 In Section 5.4, f(·) should be viewed as a probability mass function. 65 where the expectation is with respect to the prior distributionh(·;θ ∗ ) fors 1 , the randomness in the state transitions, and the randomness in the algorithm. Here, E θ∗ [·] is a shorthand forE[·|θ ∗ ]. In Section 5.4, a regret bound is provided onR T , however, Section 5.5 considers E[R T ] (also called Bayesian regret) as the performance measure for the learning algorithm. We note that the Bayesian regret is widely considered in the MDP literature (Osband et al., 2013; Gopalan and Mannor, 2015; Ouyang et al., 2017b,a). 5.3 The PSRL-POMDP Algorithm We propose a general Posterior Sampling Reinforcement Learning for POMDPs (PSRL-POMDP) algorithm (Algorithm 5) for both the finite-parameter and the general case. The algo- rithm maintains a joint distribution on the unknown parameter θ ∗ as well as the state s t . PSRL-POMDP takes the prior distributions h and f as input. At time t, the agent computes the posterior distributionf t (·) on the unknown parameterθ ∗ as well as the posterior condi- tional probability mass function (pmf)h t (·;θ) on the states t forθ∈ Θ. Upon taking action a t and observing o t+1 , the posterior distribution at time t + 1 can be updated by applying the Bayes’ rule as 2 f t+1 (θ) = P s,s 0η(o t+1 |s 0 )θ(s 0 |s,a t )h t (s;θ)f t (θ) R θ P s,s 0η(o t+1 |s 0 )θ(s 0 |s,a t )h t (s;θ)f t (θ)dθ , h t+1 (·;θ) =τ(h t (·;θ),a t ,o t+1 ;θ), (5.7) with the initial condition f 1 (θ) = P s η(o 1 |s)h(s;θ)f(θ) R θ P s η(o 1 |s)h(s;θ)f(θ)dθ , h 1 (s;θ) = η(o 1 |s)h(s;θ) P s η(o 1 |s)h(s;θ) . (5.8) Recall that τ(h t (·;θ),a t ,o t+1 ;θ) is a compact notation for (5.1). In the special case of perfect observation at time t, h t (s;θ) = 1(s t = s) for all θ∈ Θ and s∈S. Moreover, the update rule of f t+1 reduces to that of fully observable MDPs (see Eq. (4) of Ouyang et al. (2017b)) in the special case of perfect observation at time t and t + 1. Let n t (s,a) = P t−1 τ=1 1(s τ = s,a τ = a) be the number of visits to state-action (s,a) by time t. The number of visits n t plays an important role in learning for MDPs (Jaksch et al., 2010; Ouyang et al., 2017b) and is one of the two criteria to determine the length of the episodes in the TSDE algorithm for MDPs (Ouyang et al., 2017b). However, in POMDPs, n t is notF (t−1)+ -measurable since the states are not observable. Instead, let ˜ n t (s,a) :=E[n t (s,a)|F (t−1)+ ], and define the pseudo-count ˜ m t as follows: 2 When the parameter set is finite, R θ should be replaced with P θ . 66 Definition 5.1. ( ˜ m t ) T t=1 is a pseudo-count if it is a non-decreasing, integer-valued se- quence of random variables such that ˜ m t isF (t−1)+ -measurable, ˜ m t (s,a)≥d˜ n t (s,a)e, and ˜ m t (s,a)≤t for all t≤T + 1. An example of such a sequence is simply ˜ m t (s,a) =t for all state-action pair (s,a)∈S×A. This is used in Section 5.4. Another example is ˜ m t (s,a) := max{ ˜ m t−1 (s,a),d˜ n t (s,a)e} with ˜ m 0 (s,a) = 0 for all (s,a)∈S×A which is used in Section 5.5. Hered˜ n t (s,a)e is the smallest integer that is greater than or equal to ˜ n t (s,a). By definition, ˜ m t is integer-valued and non-decreasing which is essential to bound the number of episodes in the algorithm for the general case (see Lemma 6.3). Algorithm 4 PSRL-POMDP Input: prior distributions f(·),h(·) Initialization: t← 1,t 1 ← 0 Observe o 1 and compute f 1 ,h 1 according to (5.8) for episodes k = 1, 2,··· do T k−1 ←t−t k t k ←t Generate θ k ∼f t k (·) and compute π k (·) =π ∗ (·;θ k ) from (5.5) while t≤ SCHED(t k ,T k−1 ) and ˜ m t (s,a)≤ 2 ˜ m t k (s,a) for all (s,a)∈S×A do Choose action a t =π k (h t (·;θ k )) and observe o t+1 Update f t+1 ,h t+1 according to (5.7) t←t + 1 Similar to the TSDE algorithm for fully observable MDPs, PSRL-POMDP algorithm proceeds in episodes. In the beginning of episode k, POMDP θ k is sampled from the posterior distributionf t k wheret k denotes the start time of episodek. The optimal policyπ ∗ (·;θ k ) is then computed and used during the episode. Note that the input of the policy is h t (·;θ k ). The intuition behind such a choice (as opposed to the beliefb t (·) := R θ h t (·;θ)f t (θ)dθ) is that during episodek, the agent treatsθ k to be the true POMDP and adopts the optimal policy with respect to it. Consequently, the input to the policy should also be the conditional belief with respect to the sampled θ k . A key factor in designing posterior sampling based algorithms is the design of episodes. Let T k denote the length of episode k. In PSRL-POMDP, a new episode starts if either t > SCHED(t k ,T k−1 ) or ˜ m t (s,a) > 2 ˜ m t k (s,a). In the finite parameter case (Section 5.4), we consider SCHED(t k ,T k−1 ) = 2t k and ˜ m t (s,a) = t. With these choices, the two criteria coincide and ensure that the start time and the length of the episodes are deterministic. In Section 5.5, we use SCHED(t k ,T k−1 ) =t k +T k−1 and ˜ m t (s,a) := max{ ˜ m t−1 (s,a),d˜ n t (s,a)e}. This guarantees thatT k ≤T k−1 +1 and ˜ m t (s,a)≤ 2 ˜ m t k (s,a). These criteria are previously introduced in the TSDE algorithm (Ouyang et al., 2017b) except that TSDE uses the true count n t rather than ˜ m t . 67 5.4 Finite-Parameter Case (|Θ|<∞) In this section, we consider Θ⊆ Θ H such that|Θ|<∞. When Θ is finite, the posterior dis- tribution concentrates on the true parameter exponentially fast if the transition kernels are separated enough (see Lemma 5.2). This allows us to achieve a regret bound ofO(H logT ). Leto 1:t ,a 1:t be shorthand for the history of observationso 1 ,··· ,o t and the history of actions a 1 ,··· ,a t , respectively. Letν o 1:t ,a 1:t θ (o) be the probability of observingo at timet + 1 if the action history is a 1:t , the observation history is o 1:t , and the underlying transition kernel is θ, i.e., ν o 1:t ,a 1:t θ (o) :=P(o t+1 =o|o 1:t ,a 1:t ,θ ∗ =θ). The distance betweenν o 1:t ,a 1:t θ andν o 1:t ,a 1:t γ is defined by Kullback Leibler (KL-) divergence as follows. For a fixed state-action pair (s,a) and anyθ,γ∈ Θ, denote byK(ν o 1:t ,a 1:t θ kν o 1:t ,a 1:t γ ), the Kullback Leibler (KL-) divergence between the probability distributions ν o 1:t ,a 1:t θ and ν o 1:t ,a 1:t γ is given by K(ν o 1:t ,a 1:t θ kν o 1:t ,a 1:t γ ) := X o ν o 1:t ,a 1:t θ (o) log ν o 1:t ,a 1:t θ (o) ν o 1:t ,a 1:t γ (o) . It can be shown thatK(ν o 1:t ,a 1:t θ kν o 1:t ,a 1:t γ )≥ 0 and that equality holds if and only ifν o 1:t ,a 1:t θ = ν o 1:t ,a 1:t γ . Thus, KL-divergence can be thought of as a measure of divergence ofν o 1:t ,a 1:t γ from ν o 1:t ,a 1:t θ . In this section, we need to assume that the transition kernels in Θ are distant enough in the following sense. Assumption 5.2. There exist positive constants > 0 and B > 0 such that for any time step t, any history of possible observations o 1:t and actions a 1:t , and any two transition kernels θ,γ ∈ Θ such that γ o 1:t−1 ,a 1:t−1 (o t ) > 0, we have K(ν o 1:t ,a 1:t θ kν o 1:t ,a 1:t γ ) ≥ and ν o 1:t−1 ,a 1:t−1 θ (o t )/ν o 1:t−1 ,a 1:t−1 γ (o t )≤B. This assumption is similar to the one used in Kim (2017). Theorem 5.1. Suppose Assumptions 5.1 and 5.2 hold. Then, the regret bound of Algo- rithm 5 with SCHED(t k ,T k−1 ) = 2t k and ˜ m t (s,a) = t for all state-action pairs (s,a) is bounded as R T ≤H logT + 4(H + 1) (e −β − 1) 2 , where β > 0 is a universal constant defined in Lemma 5.2. Observe that with SCHED(t k ,T k−1 ) = 2t k and ˜ m t (s,a) = t, the two stopping criteria in Algorithm 5 coincide and ensure that T k = 2T k−1 with T 0 = 1. In other words, the length of episodes grows exponentially as T k = 2 k . 68 Proof of Theorem 5.1 In this section, proof of Theorem 5.1 is provided. A key factor in achievingO(H logT ) regret bound in the case of finite parameters is that the posterior distribution f t (·) concentrates on the true θ ∗ exponentially fast. Lemma 5.2. Suppose Assumption 5.2 holds. Then, there exist constants α> 1 and β > 0 such that E[1−f t (θ ∗ )|θ ∗ ]≤α exp(−βt). Equipped with this lemma, we are now ready to prove Theorem 5.1. Proof. Note that the regret R T can be decomposed as R T = HE θ∗ [K T ] +R 1 +R 2 +R 3 , where R 1 :=E θ∗ K T X k=1 T k h J(θ k )−J(θ ∗ ) i , R 2 :=HE θ∗ K T X k=1 t k+1 −1 X t=t k " X s 0 θ ∗ (s 0 |s t ,a t )−θ k (s 0 |s t ,a t ) + X s |h t (s;θ ∗ )−h t (s;θ k )| # , R 3 :=E θ∗ K T X k=1 t k+1 −1 X t=t k h c(h t (·;θ ∗ ),a t )−c(h t (·;θ k ),a t ) i . Note that the start time and length of episodes in Algorithm 5 are deterministic with the choice of SCHED and ˜ m t in the statement of the theorem, i.e., t k , T k and hence K T are deterministic. Note that if θ k = θ ∗ , then R 1 = R 2 = R 3 = 0. Moreover, we have that J(θ k )−J(θ ∗ )≤ 1, P s 0|θ ∗ (s 0 |s t ,a t )−θ k (s 0 |s t ,a t )|≤ 2, P s |h t (s;θ ∗ )−h t (s;θ k )|≤ 2, and c(h t (·;θ ∗ ),a t )−c(h t (·;θ k ),a t )≤ 1. Therefore, R 1 :=E θ∗ K T X k=1 T k 1(θ k 6=θ ∗ ) = K T X k=1 T k P θ∗ (θ k 6=θ ∗ ), R 2 := 4HE θ∗ K T X k=1 t k+1 −1 X t=t k 1(θ k 6=θ ∗ ) = 4H K T X k=1 T k P θ∗ (θ k 6=θ ∗ ), R 3 :=E θ∗ K T X k=1 t k+1 −1 X t=t k 1(θ k 6=θ ∗ ) = K T X k=1 T k P θ∗ (θ k 6=θ ∗ ). Note that P θ∗ (θ k 6= θ ∗ ) = E θ∗ [1−f t k (θ ∗ )]≤ α exp(−βt k ) by Lemma 5.2. Combining all these bounds, we can write R T ≤HK T + (4H + 2)α K T X k=1 T k exp(−βt k ). 69 With the episode schedule provided in the statement of the theorem, it is easy to check that K T =O(logT ). Let n = 2 K T and write K T X k=1 T k exp(−βt k ) = K T X k=1 2 k e −β(2 k −1) ≤ n X j=2 je −β(j−1) = d dx x n+1 − 1 x− 1 x=e −β − 1. The last equality is by geometric series. Simplifying the derivative yields d dx x n+1 − 1 x− 1 x=e −β = nx n+1 − (n + 1)x n + 1 (x− 1) 2 x=e −β ≤ nx n − (n + 1)x n + 1 (x− 1) 2 x=e −β = −x n + 1 (x− 1) 2 x=e −β ≤ 2 (e −β − 1) 2 . Substituting these values implies R T ≤H logT + 4(H+1) (e −β −1) 2 . 5.5 General Case (|Θ| =∞) We now consider the general case, where the parameter set is infinite, and in particular, Θ = Θ H , an uncountable set. We make the following two technical assumptions on the belief and the transition kernel. Assumption 5.3. Denote byk(t) the episode at timet. The true conditional beliefh t (·;θ ∗ ) and the approximate conditional belief h t (·;θ k(t) ) satisfy E h X s h t (s;θ ∗ )−h t (s;θ k(t) ) i ≤ K 1 (|S|,|A|,|O|,ι) p t k(t) , (5.9) with probability at least 1−δ, for anyδ∈ (0, 1). HereK 1 (|S|,|A|,|O|,ι) is a constant that is polynomial in its input parameters andι hides the logarithmic dependency on|S|,|A|,|O|,T,δ. Assumption 5.3 states that the gap between conditional posterior function for the sampled POMDP θ k and the true POMDP θ ∗ decreases with episodes as better approximation of the true POMDP is available. There has been recent work on computation of approximate information states as required in Assumption 5.3 (Subramanian et al., 2020). Assumption 5.4. There exists anF t -measurable estimator ˆ θ t :S×A→ Δ S such that X s 0 |θ ∗ (s 0 |s,a)− ˆ θ t (s 0 |s,a)|≤ K 2 (|S|,|A|,|O|,ι) p max{1, ˜ m t (s,a)} (5.10) with probability at least 1−δ, for any δ∈ (0, 1), uniformly for all t = 1, 2, 3,··· ,T , where K 2 (|S|,|A|,|O|,ι) is a constant that is polynomial in its input parameters and ι hides the logarithmic dependency on|S|,|A|,|O|,T,δ. 70 There has been extensive work on estimation of transition dynamics of MDPs, e.g., (Grunewalder et al., 2012). Two examples where Assumptions 5.3 and 5.4 hold are: • Perfect observation. In the case of perfect observation, where h t (s;θ) = 1(s t =s), Assumption 5.3 is clearly satisfied. Moreover, with perfect observation, one can choose ˜ m t (s,a) =n t (s,a) and select ˆ θ k (s 0 |s,a) = nt(s,a,s 0 ) nt(s,a) to satisfy Assumption 5.4 (Jaksch et al., 2010; Ouyang et al., 2017b). Here n t (s,a,s 0 ) denotes the number of visits to s,a such that the next state is s 0 before time t. • Finite-parameter case. In the finite-parameter case with the choice of ˜ m t (s,a) =t for all state-action pairs (s,a) and SCHED(t k ,T k−1 ) =t k +T k−1 or SCHED(t k ,T k=1 ) = 2t k , both of the assumptions are satisfied (see Lemma 5.6 for details). Note that in this case a more refined analysis is performed in Section 5.4 to achieveO(H logT ) regret bound. Now, we state the main result of this section. Theorem 5.2. Under Assumptions 5.1, 5.3 and 5.4, running PSRL-POMDP algorithm with SCHED(t k ,T k−1 ) =t k +T k−1 yieldsE[R T ]≤ e O(HK 2 (|S||A|T ) 2/3 ), whereK 2 :=K 2 (|S|,|A|,|O|,ι) in Assumption 5.4. The exact constants are known (see proof and Appendix 5.C) though we have hidden the dependence above. Proof Sketch of Theorem 5.2 We provide the proof sketch of Theorem 5.2 here. A key property of posterior sampling is that conditioned on the information at timet, the sampledθ t and the trueθ ∗ have the same distribution (Osband et al., 2013; Russo and Van Roy, 2014). Since the episode start time t k is a stopping time with respect to the filtration (F t ) t≥1 , we use a stopping time version of this property: Lemma 5.3 (Lemma 2 in Ouyang et al. (2017b)). For any measurable function g and any F t k -measurable random variable X, we have E[g(θ k ,X)] =E[g(θ ∗ ,X)]. Introducing the pseudo count ˜ m t (s,a) in the algorithm requires a novel analysis to achieve a low regret bound. The following key lemma states that the pseudo count ˜ m t cannot be too smaller than the true count n t . Lemma 5.4. Fix a state-action pair (s,a)∈S×A. For any pseudo count ˜ m t and any α∈ [0, 1], P ˜ m t (s,a)<αn t (s,a) ≤α. (5.11) 71 Proof. We show that P ˜ n t (s,a) < αn t (s,a) ≤ α. Since by definition ˜ m t (s,a)≥ ˜ n t (s,a), the claim of the lemma follows. For any α∈ [0, 1], ˜ n t (s,a)1 αn t (s,a)> ˜ n t (s,a) ≤αn t (s,a). (5.12) By taking conditional expectation with respect toF (t−1)+ from both sides and the fact that E[n t (s,a)|F (t−1)+ ] = ˜ n t (s,a), we have ˜ n t (s,a)E h 1 αn t (s,a)> ˜ n t (s,a) F (t−1)+ i ≤α˜ n t (s,a). (5.13) We claim that E h 1 αn t (s,a)> ˜ n t (s,a) F (t−1)+ i ≤α, a.s. (5.14) If this claim is true, taking another expectation from both sides completes the proof. To prove the claim, let Ω 0 , Ω + be the subsets of the sample space where ˜ n t (s,a) = 0 and ˜ n t (s,a) > 0, respectively. We consider these two cases separately: (a) on Ω + one can divide both sides of (5.13) by ˜ n t (s,a) and reach (5.14); (b) note that by definition ˜ n t (s,a) = 0 on Ω 0 . Thus,n t (s,a)1(Ω 0 ) = 0 almost surely (this is becauseE[n t (s,a)1(Ω 0 )] = E[E[n t (s,a)1(Ω 0 )|F (t−1)+ ]] =E[˜ n t (s,a)1(Ω 0 )] = 0). Therefore, 1(Ω 0 )1 αn t (s,a)> ˜ n t (s,a) = 0, a.s., which implies 1(Ω 0 )E h 1 αn t (s,a)> ˜ n t (s,a) F (t−1)+ i = 0, a.s., which means on Ω 0 , the left hand side of (5.14) is indeed zero, almost surely, proving the claim. The parameterα will be tuned later to balance two terms and achieve e O(T 2/3 ) regret bound (see Lemma 5.9). We are now ready to provide the proof sketch of Theorem 5.2. By Lemma 5.5, R T can be decomposed as R T =HE θ∗ [K T ] +R 1 +R 2 +R 3 , where R 1 :=E θ∗ K T X k=1 T k h J(θ k )−J(θ ∗ ) i , R 2 :=HE θ∗ K T X k=1 t k+1 −1 X t=t k " X s 0 θ ∗ (s 0 |s t ,a t )−θ k (s 0 |s t ,a t ) + X s |h t (s;θ ∗ )−h t (s;θ k )| # , R 3 :=E θ∗ K T X k=1 t k+1 −1 X t=t k h c(h t (·;θ ∗ ),a t )−c(h t (·;θ k ),a t ) i . It follows from the first stopping criterion that T k ≤ T k−1 + 1. Using this along with the property of posterior sampling (Lemma 5.3) proves thatE[R 1 ]≤E[K T ] (see Lemma 5.7 for details). E[R 3 ] is bounded by K 1 E P K T k=1 T k √ t k + 1 where K 1 := K 1 (|S|,|A|,|O|,ι) is the 72 constant in Assumption 5.3 (see Lemma 5.8). To boundE[R 2 ], we use Assumption 5.3 and follow the proof steps of Lemma 5.8 to conclude that E[R 2 ]≤ ¯ R 2 +HK 1 E h K T X k=1 T k √ t k i + 1, where ¯ R 2 :=HE h K T X k=1 t k+1 −1 X t=t k X s 0 θ ∗ (s 0 |s t ,a t )−θ k (s 0 |s t ,a t ) i . ¯ R 2 is the dominating term in the final e O(T 2/3 ) regret bound and can be bounded by H + 12HK 2 (|S||A|T ) 2/3 whereK 2 :=K 2 (|S|,|A|,|O|,ι) is the constant in Assumption 5.4. The detailed proof can be found in Lemma 5.9. However, we sketch the main steps of the proof here. By Assumption 5.4, one can show that ¯ R 2 ≤ e O E h T X t=1 HK 2 p max{1, ˜ m t (s t ,a t )} i . Now, letE 2 be the event that ˜ m t (s,a)≥αn t (s,a) for alls,a. Note that by Lemma 5.4 and union bound,P(E c 2 )≤|S||A|α. Thus, ¯ R 2 ≤ e O E h T X t=1 HK 2 p max{1, ˜ m t (s t ,a t )} 1(E 2 ) + 1(E c 2 ) i ≤ e O HE h T X t=1 K 2 p α max{1,n t (s t ,a t )} i +HK 2 |S||A|Tα Algebraic manipulation of the inner summation yields ¯ R 2 ≤ e O HK 2 q |S||A|T α +HK 2 |S||A|Tα . Optimizing over α implies ¯ R 2 = e O(HK 2 (|S||A|T ) 2/3 ). Substituting upper bounds for E[R 1 ],E[R 2 ] andE[R 3 ], we get E[R T ] =HE[K T ] +E[R 1 ] +E[R 2 ] +E[R 3 ] ≤ (1 +H)E[K T ] + 12HK 2 (|S||A|T ) 2/3 + (H + 1)K 1 E h K T X k=1 T k √ t k i + 2 +H. From Lemma 6.3, we know that E[K T ] = e O( p |S||A|T ) and P K T k=1 T k √ t k = e O(|S||A| √ T ). Therefore,E[R T ]≤ e O(HK 2 (|S||A|T ) 2/3 ). 73 Appendices 5.A Regret Decomposition Lemma 5.5. R T can be decomposed as R T =HE θ∗ [K T ] +R 1 +R 2 +R 3 , where R 1 :=E θ∗ K T X k=1 T k h J(θ k )−J(θ ∗ ) i , R 2 :=HE θ∗ K T X k=1 t k+1 −1 X t=t k " X s 0 θ ∗ (s 0 |s t ,a t )−θ k (s 0 |s t ,a t ) + X s |h t (s;θ ∗ )−h t (s;θ k )| # , R 3 :=E θ∗ K T X k=1 t k+1 −1 X t=t k h c(h t (·;θ ∗ ),a t )−c(h t (·;θ k ),a t ) i . Proof. First, note thatE θ∗ [C(s t ,a t )|F t+ ] =c(h t (·;θ ∗ ),a t ) for anyt≥ 1. Thus, we can write: R T =E θ∗ h T X t=1 h C(s t ,a t )−J(θ ∗ ) ii =E θ∗ h T X t=1 h c(h t (·;θ ∗ ),a t )−J(θ ∗ ) ii . During episode k, by the Bellman equation for the sampled POMDP θ k and that a t = π ∗ (h t (·;θ k );θ k ), we can write: c(h t (·;θ k ),a t )−J(θ k ) =v(h t (·;θ k );θ k )− X o P (o|h t (·;θ k ),a t ;θ k )v(h 0 ;θ k ), whereh 0 =τ(h t (·;θ k ),a t ,o;θ k ). Using this equation, we proceed by decomposing the regret as R T =E θ∗ h T X t=1 h c(h t (·;θ ∗ ),a t )−J(θ ∗ ) ii =E θ∗ h K T X k=1 t k+1 −1 X t=t k h c(h t (·;θ ∗ ),a t )−J(θ ∗ ) ii 74 =E θ∗ h K T X k=1 t k+1 −1 X t=t k h v(h t (·;θ k );θ k )−v(h t+1 (·;θ k );θ k ) i | {z } telescopic sum i +E θ∗ K T X k=1 T k h J(θ k )−J(θ ∗ ) i | {z } =:R 1 +E θ∗ h K T X k=1 t k+1 −1 X t=t k h v(h t+1 (·;θ k );θ k )− X o∈O P (o|h t (·;θ k ),a t ;θ k )v(h 0 ;θ k ) ii | {z } =:R 0 2 +E θ∗ h K T X k=1 t k+1 −1 X t=t k h c(h t (·;θ ∗ ),a t )−c(h t (·;θ k ),a t ) ii | {z } =:R 3 whereK T is the number of episodes upto timeT ,t k is the start time of episodek (we lett k = T +1 for allk>K T ). The telescopic sum is equal tov(h t k (·;θ k );θ k )−v(h t k+1 (·;θ k );θ k )≤H. Thus, the first term on the right hand side is upper bounded byHE θ∗ [K T ]. Suffices to show that R 0 2 ≤R 2 . Throughout the proof, we change the order of expectation and summation at several points. A rigorous proof for why this is allowed in the case that K T and t k are random variables is presented in the proof of Lemma 5.8. We proceed by bounding the termR 0 2 . Recall thath 0 =τ(h t (·;θ k ),a t ,o;θ k ) andh t+1 (·;θ k ) = τ(h t (·;θ k ),a t ,o t+1 ;θ k ). Conditioned onF t ,θ ∗ ,θ k , the only random variable in h t+1 (·;θ k ) is o t+1 (a t = π ∗ (h t (·;θ k );θ k ) is measurable with respect to the sigma algebra generated by F t ,θ k ). Therefore, E θ∗ h v(h t+1 (·;θ k );θ k )|F t ,θ k i = X o∈O v(h 0 ;θ k )P θ∗ (o t+1 =o|F t ,θ k ). (5.15) We claim that P θ∗ (o t+1 = o|F t ,θ k ) = P (o|h t (·;θ ∗ ),a t ;θ ∗ ): by the total law of probability and thatP θ∗ (o t+1 =o|s t+1 =s 0 ,F t ,θ k ) =η(o|s 0 ), we can write P θ∗ (o t+1 =o|F t ,θ k ) = X s 0 η(o|s 0 )P θ∗ (s t+1 =s 0 |F t ,θ k ). Note that P θ∗ (s t+1 =s 0 |F t ,θ k ) = X s P θ∗ (s t+1 =s 0 |s t =s,F t ,a t ,θ k )P θ∗ (s t =s|F t ,θ k ) = X s θ ∗ (s 0 |s,a t )P θ∗ (s t =s|F t ). Thus, P θ∗ (o t+1 =o|F t ,θ k ) = X s,s 0 η(o|s 0 )θ ∗ (s 0 |s,a t )h t (s;θ ∗ ) =P (o|h t (·;θ ∗ ),a t ;θ ∗ ). (5.16) 75 Combining (5.16) with (5.15) and substituting into R 0 2 , we get R 0 2 =E θ∗ h K T X k=1 t k+1 −1 X t=t k h X o∈O P (o|h t (·;θ ∗ ),a t ;θ ∗ )−P (o|h t (·;θ k ),a t ;θ k ) v(h 0 ;θ k ) ii . Recall that for any θ∈ Θ, P (o|h t (·;θ),a t ;θ) = P s 0η(o|s 0 ) P s θ(s 0 |s,a t )h t (s;θ). Thus, R 0 2 =E θ∗ h K T X k=1 t k+1 −1 X t=t k X o,s 0 v(h 0 ;θ k )η(o|s 0 ) X s θ ∗ (s 0 |s,a t )h t (s;θ ∗ ) i −E θ∗ h K T X k=1 t k+1 −1 X t=t k X o,s 0 v(h 0 ;θ k )η(o|s 0 ) X s θ k (s 0 |s,a t )h t (s;θ ∗ ) i +E θ∗ h K T X k=1 t k+1 −1 X t=t k X o,s 0 v(h 0 ;θ k )η(o|s 0 ) X s θ k (s 0 |s,a t ) h t (s;θ ∗ )−h t (s;θ k ) i . (5.17) For the first term, note that conditioned onF t ,θ ∗ , the distribution of s t is h t (·;θ ∗ ) by the definition ofh t . Furthermore,a t is measurable with respect to the sigma algebra generated byF t ,θ k since a t =π ∗ (h t (·;θ k );θ k ). Thus, we have E θ∗ h v(h 0 ;θ k ) X s θ ∗ (s 0 |s,a t )h t (s;θ ∗ ) F t ,θ k i =v(h 0 ;θ k )E θ∗ h θ ∗ (s 0 |s t ,a t ) F t ,θ k i . (5.18) Similarly, for the second term on the right hand side of (5.17), we have E θ∗ h v(h 0 ;θ k ) X s θ k (s 0 |s,a t )h t (s;θ ∗ ) F t ,θ k i =v(h 0 ;θ k )E θ∗ h θ k (s 0 |s t ,a t ) F t ,θ k i . (5.19) Replacing (5.18), (5.19) into (5.17) and using the tower property of conditional expectation, we get R 0 2 =E θ∗ h K T X k=1 t k+1 −1 X t=t k h X s 0 X o v(h 0 ;θ k )η(o|s 0 ) θ ∗ (s 0 |s t ,a t )−θ k (s 0 |s t ,a t ) ii +E θ∗ h K T X k=1 t k+1 −1 X t=t k h X s 0 X o v(h 0 ;θ k )η(o|s 0 ) X s θ k (s 0 |s,a t ) h t (s;θ ∗ )−h t (s;θ k ) ii . (5.20) Since sup b∈Δ S v(b,θ k )≤ H and P o η(o|s 0 ) = 1, the inner summation for the first term on the right hand side of (5.20) can be bounded as X o∈O v(h 0 ;θ k )η(o|s 0 ) θ ∗ (s 0 |s t ,a t )−θ k (s 0 |s t ,a t ) ≤H θ ∗ (s 0 |s t ,a t )−θ k (s 0 |s t ,a t ) . (5.21) 76 Using sup b∈Δ S v(b,θ k )≤H, P o η(o|s 0 ) = 1 and P s 0θ k (s 0 |s,a t ) = 1, the second term on the right hand side of (5.20) can be bounded as X s 0 X o∈O v(h 0 ;θ k )η(o|s 0 ) X s θ k (s 0 |s,a t ) h t (s;θ ∗ )−h t (s;θ k ) ≤H X s h t (s;θ ∗ )−h t (s;θ k ) (5.22) Substituting (5.21) and (5.22) into (5.20) proves that R 0 2 ≤R 2 . 5.B Proofs of Section 5.4 Proof of Lemma 5.2 Lemma (restatement of Lemma 5.2). Suppose Assumption 5.2 holds. Then, there exist constants α> 1 and β > 0 such that E[1−f t (θ ∗ )|θ ∗ ]≤α exp(−βt). Proof. Letτ t be the trajectory{a 1 ,o 1 ,··· ,a t−1 ,o t−1 ,o t } and define the likelihood function L(τ t |θ) :=P(τ t |θ) =P(o 1 |θ) t Y τ=2 P(o τ |o 1:τ−1 ,a 1:τ−1 ,θ) =P(o 1 |θ) t Y τ=2 ν o 1:τ−1 ,a 1:τ−1 θ (o τ ) (5.23) Note that P(o 1 |θ) = P s h(s)η(o 1 |s) is independent of θ, thus for any θ,γ∈ Θ such that L(τ t |θ)6= 0 andL(τ t |γ)6= 0, we can write L(τ t |θ) L(τ t |γ) = t Y τ=2 ν o 1:τ−1 ,a 1:τ−1 θ (o τ ) ν o 1:τ−1 ,a 1:τ−1 γ (o τ ) . Recall that f t (·) is the posterior associated with the likelihood given by f t (θ) = L(τ t |θ)f(θ) P γ∈Θ L(τ t |γ)f(γ) . In the denominator, we exclude those γ such thatL(τ t |γ) = 0 without loss of generality. We now proceed to lower bound f t (θ) for those θ such thatL(τ t |θ)> 0. We can write f t (θ) = L(τ t |θ)f(θ) P γ L(τ t |γ)f(γ) = 1 1 + P γ6=θ f(γ) f(θ) L(τt|γ) L(τt|θ) = 1 1 + P γ6=θ f(γ) f(θ) exp(− P t τ=1 log Λ θ,γ τ ) , 77 where we define Λ θ,γ 1 := 1 and for τ≥ 2, Λ θ,γ τ := ν o 1:τ−1 ,a 1:τ−1 θ (o τ ) ν o 1:τ−1 ,a 1:τ−1 γ (o τ ) . Denote by Z θ,γ t := P t τ=1 log Λ θ,γ τ and decompose it as Z θ,γ t =M θ,γ t +A θ,γ t where M θ,γ t := t X τ=1 log Λ θ,γ τ −E h log Λ θ,γ τ F τ−1 ,θ ∗ =θ i , A θ,γ t := t X τ=1 E h log Λ θ,γ τ F τ−1 ,θ ∗ =θ i . Note that the terms inside the first summation constitute a martingale difference sequence with respect to the filtration (F τ ) τ≥1 and conditional probability P(·|θ ∗ = θ). Each term is bounded as| log Λ θ,γ τ −E[log Λ θ,γ τ |F τ−1 ,θ ∗ = θ]|≤ d for some d > 0 by Assumption 5.2. The second term, A θ,γ t can be lower bounded using Assumption 5.2 as follows E h log Λ θ,γ τ F τ−1 ,θ ∗ =θ i =E E h log Λ θ,γ τ F τ−1 ,a τ−1 ,θ ∗ =θ i F τ−1 ,θ ∗ =θ =E h K(ν o 1:τ−1 ,a 1:τ−1 θ kν o 1:τ−1 ,a 1:τ−1 γ ) F τ−1 ,θ ∗ =θ i ≥ Summing over τ implies that A θ,γ t ≥t. (5.24) To bound M θ,γ t , let 0<δ<, and apply Azuma’s inequality to obtain P |M θ,γ t |≥δt θ ∗ =θ ≤ 2 exp(− δ 2 t 2d 2 ). Fix θ. Union bound over all γ 6= θ implies that the event B θ,δ t :=∩ γ6=θ {|M θ,γ t |≤ δt} happens with probability at least 1− 2(|Θ|− 1) exp(− δ 2 t 2d 2 ). If B θ,δ t holds, then−M θ,γ t ≤δt for all γ6= θ. Combining this with (5.24) implies that exp(−M θ,γ t −A θ,γ t )≤ exp(δt−t). Therefore, E[f t (θ)|θ ∗ =θ] =E " 1 1 + P γ6=θ f(γ) f(θ) exp(−M θ,γ t −A θ,γ t ) θ ∗ =θ # ≥E " 1(B t δ ) 1 + P γ6=θ f(γ) f(θ) exp(δt−t) θ ∗ =θ # = P(B t δ |θ) 1 + 1−f(θ) f(θ) exp(δt−t) 78 ≥ 1− 2(|Θ|− 1) exp(− δ 2 t 2d 2 ) 1 + 1−f(θ) f(θ) exp(δt−t) . Now, by choosing δ = /2, and constants α = 2 max{max θ∈Θ 1−f(θ) f(θ) , 2(|Θ|− 1)}, and β = min{ 2 , 2 8d 2 }, we have E[1−f t (θ)|θ ∗ =θ]≤ 1− 1− 2(|Θ|− 1) exp(− δ 2 t 2d 2 ) 1 + 1−f(θ) f(θ) exp(δt−t) = 1−f(θ) f(θ) exp(δt−t) + 2(|Θ|− 1) exp(− δ 2 t 2d 2 ) 1 + 1−f(θ) f(θ) exp(δt−t) ≤ 1−f(θ) f(θ) exp(δt−t) + 2(|Θ|− 1) exp(− δ 2 t 2d 2 ) = 1−f(θ) f(θ) exp(− t 2 ) + 2(|Θ|− 1) exp(− 2 t 8d 2 ) ≤α exp(−βt). 5.C Proofs of Section 5.5 Full Upper Bound on the Expected Regret of Theorem 5.2 The exact expression for the upper bound of the expected regret in Theorem 5.2 is E[R T ] =HE[K T ] +E[R 1 ] +E[R 2 ] +E[R 3 ] ≤ (1 +H)E[K T ] + 12HK 2 (|S||A|T ) 2/3 + (H + 1)K 1 E h K T X k=1 T k √ t k i + 2 +H ≤ (1 +H) q 2T (1 +|S||A| log(T + 1)) + 12HK 2 (|S||A|T ) 2/3 + 7(H + 1)K 1 √ 2T (1 +|S||A| log(T + 1)) log √ 2T + 2 +H. Finite-parameter Case Satisfies Assumptions 5.3 and 5.4 In this section, we show that Assumptions 5.3 and 5.4 are satisfied for the finite-parameter case i.e., |Θ| < ∞ as long as the PSRL-POMDP generates a deterministic schedule. As an instance, a deterministic schedule can be generated by choosing ˜ m t (s,a) = t for all 79 state-action pairs (s,a) and running Algorithm 5 with either SCHED(t k ,T k−1 ) = 2t k or SCHED(t k ,T k−1 ) =t k +T k−1 . Lemma 5.6. Assume|Θ|<∞. If Algorithm refalg: posterior sampling generates a deter- ministic schedule, then Assumptions 5.3 and 5.4 are satisfied. Proof. Observe that the left hand side of (5.9) is zero if θ k(t) = θ ∗ , and is upper bounded by 2 if θ k(t) 6=θ ∗ . Thus, we can write E h X s h t (s;θ ∗ )−h t (s;θ k(t) ) θ ∗ i ≤ 2P(θ k(t) 6=θ ∗ |θ ∗ ) = 2E h 1−f t k(t) (θ ∗ )|θ ∗ i ≤α exp(−βt k(t) ), which obviously satisfies Assumption 5.3 by choosing a large enough constant K 1 . Here, the last equality is by Lemma 5.2 and that the start time of episode k(t) is deterministic. To see why Assumption 5.4 is satisfied, let ˆ θ t be the Maximum a Posteriori (MAP) estimator, i.e., ˆ θ t = arg max θ∈Θ f t (θ). Then, the left hand side of (5.10) is equal to zero if ˆ θ t = θ ∗ . Note that this happens with high probability with the following argument: P( ˆ θ t 6=θ ∗ |θ ∗ )≤P (f t (θ ∗ )≤ 0.5|θ ∗ ) =P(1−f t (θ ∗ )≥ 0.5|θ ∗ )≤ 2E[1−f t (θ ∗ )|θ ∗ ]≤ 2α exp(−βt). Here the first inequality is by the fact that if f t (θ ∗ )> 0.5, then the MAP estimator would choose ˆ θ t = θ ∗ . The second inequality is by applying Markov inequality and the last inequality is by Lemma 5.2. Note that ˜ m t (s,a) ≤ t by definition. We claim that As- sumption 5.4 is satisfied by choosing K 2 = 2 p (−1/β) log(δ/2α). To see this, note that 2α exp(−βt)≤ δ for t≥ (−1/β) log(δ/2α). In this case, (5.10) automatically holds since with probability at least 1−δ the left hand side is zero. For t < (−1/β) log(δ/2α), note that the left hand side of (5.10) can be at most 2. Therefore, K 2 can be found by solving 2≤K 2 / p (−1/β) log(δ/2α). Auxiliary Lemmas for Section 5.5 Lemma 5.7. [Lemma 3 in Ouyang et al. (2017b)] The term E[R 1 ] can be bounded as E[R 1 ]≤E[K T ]. Proof. E[R 1 ] =E h K T X k=1 T k h J(θ k )−J(θ ∗ ) ii =E h ∞ X k=1 1(t k ≤T )T k J(θ k ) i −TE[J(θ ∗ )]. By monotone convergence theorem and the fact thatJ(θ k )≥ 0 andT k ≤T k−1 + 1 (the first criterion in determining the episode length in Algorithm 5), the first term can be bounded as E h ∞ X k=1 1(t k ≤T )T k J(θ k ) i = ∞ X k=1 E h 1(t k ≤T )T k J(θ k ) i 80 ≤ ∞ X k=1 E h 1(t k ≤T )(T k−1 + 1)J(θ k ) i . Note that 1(t k ≤T )(T k−1 + 1) isF t k -measurable. Thus, by the property of posterior sam- pling (Lemma 5.3),E[1(t k ≤T )(T k−1 + 1)J(θ k )] =E[1(t k ≤T )(T k−1 + 1)J(θ ∗ )]. Therefore, E[R 1 ]≤E h ∞ X k=1 1(t k ≤T )(T k−1 + 1)J(θ ∗ ) i −TE[J(θ ∗ )] =E h J(θ ∗ )(K T + K T X k=1 T k−1 ) i −TE[J(θ ∗ )] =E[J(θ ∗ )K T ] +E h J(θ ∗ )( K T X k=1 T k−1 −T ) i ≤E[K T ], where the last inequality is by the fact that P K T k=1 T k−1 −T≤ 0 and 0≤J(θ ∗ )≤ 1. Lemma 5.8. The term E[R 3 ] can be bounded as E[R 3 ]≤K 1 E h K T X k=1 T k √ t k i + 1, where K 1 :=K 1 (|S|,|A|,|O|,ι) is the constant in Assumption 5.3. Proof. Recall that E[R 3 ] =E h K T X k=1 t k+1 −1 X t=t k h c(h t (·;θ ∗ ),a t )−c(h t (·;θ k ),a t ) ii . Letk(t) be a random variable denoting the episode number at timet, i.e.,t k(t) ≤t<t k(t)+1 for all t≤T . By the definition of c, we can write E[R 3 ] =E h K T X k=1 t k+1 −1 X t=t k X s C(s,a t ) h h t (s;θ ∗ )−h t (s;θ k ) ii =E h T X t=1 X s C(s,a t ) h h t (s;θ ∗ )−h t (s;θ k(t) ) ii ≤ T X t=1 E h X s h t (s;θ ∗ )−h t (s;θ k(t) ) i =E h K T X k=1 t k+1 −1 X t=t k E h X s h t (s;θ ∗ )−h t (s;θ k ) ii , where the inequality is by 0≤ C(s,a t )≤ 1. Let K 1 := K 1 (|S|,|A|,|O|,ι) be the constant in Assumption 5.3 and define event E 1 as the successful event of Assumption 5.3 where 81 E h P s h t (s;θ ∗ )−h t (s;θ k ) i ≤ K 1 √ t k happens. We can write E h X s h t (s;θ ∗ )−h t (s;θ k ) i =E h X s h t (s;θ ∗ )−h t (s;θ k ) i (1(E 1 ) + 1(E c 1 )) ≤ K 1 √ t k + 21(E c 1 ). Recall that by Assumption 5.3, P(E c 1 )≤δ. Therefore, E[R 3 ]≤K 1 E h K T X k=1 T k √ t k i + 2Tδ. Choosing δ = min(1/(2T ), 1/(2HT )) completes the proof. Lemma 5.9. The term ¯ R 2 can be bounded as ¯ R 2 ≤H + 12HK 2 (|S||A|T ) 2/3 , where K 2 :=K 2 (|S|,|A|,|O|,ι) in Assumption 5.4. Proof. Recall that ¯ R 2 =HE h K T X k=1 t k+1 −1 X t=t k X s 0 θ ∗ (s 0 |s t ,a t )−θ k (s 0 |s t ,a t ) i . (5.25) We proceed by bounding the inner term of the above equation. For notational simplicity, define z := (s,a) and z t := (s t ,a t ). Let ˆ θ t k be the estimator in Assumption 5.4 and define the confidence set B k as B k := n θ∈ Θ H : X s 0 ∈S θ(s 0 |z)− ˆ θ k (s 0 |z) ≤ K 2 q max{1, ˜ m t k (z)} ,∀z∈S×A o , whereK 2 :=K 2 (|S|,|A|,|O|,ι) is the constant in Assumption 5.4. Note that B k reduces to the confidence set used in Jaksch et al. (2010); Ouyang et al. (2017b) in the case of perfect observation by choosing ˜ m t (s,a) =n t (s,a). By triangle inequality, the inner term in (5.25) can be bounded by X s 0 θ ∗ (s 0 |z t )−θ k (s 0 |z t ) ≤ X s 0 θ ∗ (s 0 |z t )− ˆ θ t k (s 0 |z t ) + X s 0 θ k (s 0 |z t )− ˆ θ t k (s 0 |z t ) ≤ 2 1(θ ∗ / ∈B k ) + 1(θ k / ∈B k ) + 2K 2 q max{1, ˜ m t k (z t )} . 82 Substituting this into (5.25) implies ¯ R 2 ≤ 2HE h K T X k=1 t k+1 −1 X t=t k 1(θ ∗ / ∈B k ) + 1(θ k / ∈B k ) i + 2HE h K T X k=1 t k+1 −1 X t=t k K 2 q max{1, ˜ m t k (z t )} i . (5.26) We need to bound these two terms separately. Bounding the first term. For the first term we can write: E h K T X k=1 t k+1 −1 X t=t k 1(θ ∗ / ∈B k ) + 1(θ k / ∈B k ) i =E h K T X k=1 T k 1(θ ∗ / ∈B k ) + 1(θ k / ∈B k ) i ≤TE h K T X k=1 1(θ ∗ / ∈B k ) + 1(θ k / ∈B k ) i ≤T T X k=1 E h 1(θ ∗ / ∈B k ) + 1(θ k / ∈B k ) i , where the last inequality is by the fact that K T ≤ T . Now, observe that since B k is F t k -measurable, Lemma 5.3 implies that E[1(θ k / ∈ B k )] = E[1(θ ∗ / ∈ B k )]. Moreover, by Assumption 5.4,E[1(θ ∗ / ∈B k )] =P(θ ∗ / ∈B k )≤δ. By choosing δ = 1 4T 2 , we get E h K T X k=1 t k+1 −1 X t=t k 1(θ ∗ / ∈B k ) + 1(θ k / ∈B k ) i ≤ 1 2 . (5.27) Bounding the second term. To bound the second term of (5.26), observe that by the second criterion of the algorithm in choosing the episode length, we have 2 ˜ m t k (z t )≥ ˜ m t (z t ). Thus, E h K T X k=1 t k+1 −1 X t=t k K 2 q max{1, ˜ m t k (z t )} i ≤E h T X t=1 √ 2K 2 p max{1, ˜ m t (z t )} i = T X t=1 X z E h √ 2K 2 1(z t =z) p max{1, ˜ m t (z)} i = T X t=1 X z E h √ 2K 2 1(z t =z) p max{1, ˜ m t (z)} 1 ˜ m t (z)≥αn t (z) i + T X t=1 X z E h √ 2K 2 1(z t =z) p max{1, ˜ m t (z)} 1 ˜ m t (z)<αn t (z) i ≤ T X t=1 X z E h √ 2K 2 1(z t =z) p max{1,αn t (z)} i + T X t=1 X z E h √ 2K 2 1 ˜ m t (z)<αn t (z) i . (5.28) Lemma 5.4 implies thatE h 1 ˜ m t (z)<αn t (z) i =P( ˜ m t (z)<αn t (z))≤α. Thus, the second 83 term in (5.28) can be bounded by √ 2K 2 |S||A|Tα. To bound the first term of (5.28), we can write: T X t=1 X z E h √ 2K 2 1(z t =z) p max{1,αn t (z)} i ≤ r 2 α K 2 E h X z T X t=1 1(z t =z) p max{1,n t (z)} i . Observe that whenever z t =z, n t (z) increases by 1. Since, n t (z) is the number of visits to z by timet− 1 (includingt− 1 and excludingt), the denominator will be 1 for the first two times that z t =z. Therefore, the term inside the expectation can be bounded by X z T X t=1 1(z t =z) p max{1,n t (z)} = X z E h 1(n T +1 (z)> 0) + n T+1 (z)−1 X j=1 1 √ j i ≤ X z E h 1(n T +1 (z)> 0) + 2 q n T +1 (z) i ≤ 3 X z q n T +1 (z). Since P z n T +1 (z) =T , Cauchy Schwartz inequality implies 3 X z q n T +1 (z)≤ 3 s |S||A| X z n T +1 (z) = 3 q |S||A|T. Therefore, the first term of (5.28) can be bounded by T X t=1 X z E h √ 2K 2 1(z t =z) p max{1,αn t (z)} i ≤ 3K 2 s 2|S||A|T α . Substituting this bound in (5.28) along with the bound on the second term of (5.28), we obtain E h K T X k=1 t k+1 −1 X t=t k K 2 q max{1, ˜ m t k (z t )} i ≤ 3K 2 s 2|S||A|T α + √ 2K 2 |S||A|Tα. α = (3/2) 2/3 (|S||A|T ) −1/3 minimizes the upper bound, and thus E h K T X k=1 t k+1 −1 X t=t k K 2 q max{1, ˜ m t k (z t )} i ≤ 6K 2 (|S||A|T ) 2/3 . (5.29) By substituting (5.27) and (5.29) into (5.26), we get ¯ R 2 ≤H + 12HK 2 (|S||A|T ) 2/3 . 84 Lemma 5.10. The following inequalities hold: 1. The number of episodes K T can be bounded as K T ≤ p 2T (1 +|S||A| log(T + 1)) = e O( p |S||A|T ). 2. The following inequality holds: P K T k=1 T k √ t k ≤ 7 √ 2T (1 +|S||A| log(T + 1)) log √ 2T = e O(|S||A| √ T ). Proof. We first provide an intuition why these results should be true. Note that the length of the episodes is determined by two criteria. The first criterion triggers when T k =T k−1 + 1 and the second criterion triggers when the pseudo counts doubles for a state-action pair compared to the beginning of the episode. Intuitively speaking, the second criterion should only happen logarithmically, while the first criterion occurs more frequently. This means that one could just consider the first criterion for an intuitive argument. Thus, if we ignore the second criterion, we get T k =O(k), K T =O( √ T ), and t k =O(k 2 ) which implies P K T k=1 T k √ t k =O(K T ) =O( √ T ). The rigorous proof is stated in the following. 1. Define macro episodes with start times t m i given by t m 1 =t 1 and t m i := min{t k >t m i−1 : ˜ m t k (s,a)> 2 ˜ m t k−1 (s,a) for some (s,a)}. Note that a new macro episode starts when the second criterion of episode length in Algo- rithm 5 triggers. Let M T be the random variable denoting the number of macro episodes by time T and define m M T +1 =K T + 1. Let ˜ T i denote the length of macro episode i. Note that ˜ T i = P m i+1 −1 k=m i T k . Moreover, from the definition of macro episodes, we know that all the episodes in a macro episode except the last one are triggered by the first criterion, i.e.,T k =T k−1 +1 for allm i ≤k≤m i+1 −2. This implies that ˜ T i = m i+1 −1 X k=m i T k =T m i+1 −1 + m i+1 −m i −1 X j=1 (T m i −1 +j) ≥ 1 + m i+1 −m i −1 X j=1 (1 +j) = (m i+1 −m i )(m i+1 −m i + 1) 2 . This implies that m i+1 −m i ≤ q 2 ˜ T i . Now, we can write: K T =m M T +1 − 1 = M T X i=1 (m i+1 −m i ) ≤ M T X i=1 q 2 ˜ T i ≤ s 2M T X i ˜ T i = p 2M T T, (5.30) 85 where the last inequality is by Cauchy-Schwartz. Now, it suffices to show that M T ≤ 1 +|S||A| log(T + 1). LetT s,a be the start times at which the second criterion is triggered at state-action pair (s,a), i.e., T s,a :={t k ≤T : ˜ m t k (s,a)> 2 ˜ m t k−1 (s,a)}. We claim that|T s,a |≤ log( ˜ m T +1 (s,a)). To prove this claim, assume by contradiction that |T s,a |≥ log( ˜ m T +1 (s,a)) + 1, then ˜ m t K T (s,a)≥ Y t k ≤T, ˜ mt k−1 (s,a)≥1 ˜ m t k (s,a) ˜ m t k−1 (s,a) ≥ Y t k ∈Ts,a, ˜ mt k−1 (s,a)≥1 ˜ m t k (s,a) ˜ m t k−1 (s,a) > Y t k ∈Ts,a, ˜ mt k−1 (s,a)≥1 2 = 2 |Ts,a|−1 ≥ ˜ m T +1 (s,a), which is a contradiction. The second inequality is by the fact that ˜ m t (s,a) is non-decreasing, and the third inequality is by the definition ofT s,a . Therefore, M T ≤ 1 + X s,a |T s,a |≤ 1 + X s,a log( ˜ m T +1 (s,a)) ≤ 1 +|S||A| log( X s,a ˜ m T +1 (s,a)/|S||A|) = 1 +|S||A| log(T + 1), (5.31) where the third inequality is due to the concavity of log and the last inequality is by the fact that ˜ m T +1 (s,a)≤T + 1. 2. First, we claim that T k ≤ √ 2T for all k≤ K T . To see this, assume by contradiction that T k ∗ > √ 2T for some k ∗ ≤ K T . By the first stopping criterion, we can conclude that T k ∗ −1 > √ 2T− 1, T k ∗ −2 > √ 2T− 2, . . . , T 1 > max{ √ 2T−k ∗ + 1, 0} since the episode length can increase at most by one compared to the previous one. Note that k ∗ ≥ √ 2T− 1, because otherwise T 1 > 2 which is not feasible since T 1 ≤T 0 + 1 = 2. Thus, P k ∗ k=1 T k > 0.5 √ 2T ( √ 2T + 1)>T which is a contradiction. We now proceed to lower bound t k . By the definition of macro episodes in part (1), during a macro episode length of the episodes except the last one are determined by the first criterion, i.e., for macro episode i, one can write T k = T k−1 + 1 for m i ≤ k≤ m i+1 − 2. Hence, for m i ≤k≤m i+1 − 2, t k+1 =t k +T k =t k +T m i −1 +k− (m i − 1) ≥t k +k−m i + 1. 86 Recursive substitution of t k implies that t k ≥t m i + 0.5(k−m i )(k−m i + 1) for m i ≤k≤ m i+1 − 1. Thus, K T X k=1 T k √ t k ≤ √ 2T M T X i=1 m i+1 −1 X k=m i 1 √ t k ≤ √ 2T M T X i=1 m i+1 −1 X k=m i 1 p t m i + 0.5(k−m i )(k−m i + 1) . (5.32) The denominator of the summands atk =m i is equal to √ t m i . For other values ofk it can be lower bounded by 0.5(k−m i ) 2 . Thus, M T X i=1 m i+1 −1 X k=m i 1 p t m i + 0.5(k−m i )(k−m i + 1) ≤ M T X i=1 1 √ t m i + M T X i=1 m i+1 −1 X k=m i +1 √ 2 k−m i ≤M T + M T X i=1 m i+1 −m i −1 X j=1 √ 2 j ≤M T + √ 2(M T + M T X i=1 log(m i+1 −m i )) ≤M T (1 + √ 2) + √ 2M T log( 1 M T M T X i=1 (m i+1 −m i )) ≤M T (1 + √ 2) + √ 2M T log √ 2T ≤ 7M T log √ 2T, where the second inequality is byt m i ≥ 1, the third inequality is by the fact that P K j=1 1/j≤ 1 + R K 1 dx/x = 1 + logK, the forth inequality is by concavity of log and the fifth inequality is by the fact that P M T i=1 (m i+1 −m i ) =m M T +1 − 1 =K T and K T /M T ≤ p 2T/M T ≤ √ 2T (see (5.30)). Substituting this bound into (5.32) and using the upper bound on M T (5.31), we can write K T X k=1 T k √ t k ≤ √ 2T 7M T log √ 2T ≤ 7 √ 2T (1 +|S||A| log(T + 1)) log √ 2T. 87 5.D Other Proofs Proof of Lemma 5.1 Lemma (restatement of Lemma 5.1). Suppose Assumption 5.1 holds. Then, the policy π ∗ (·,θ) : Δ S →A given by π ∗ (b;θ) := arg min a∈A {c(b,a) + X o∈O P (o|b,a;θ)v(b 0 ;θ)} is the optimal policy with J π ∗(h;θ) =J(θ) for all h∈ Δ S . Proof. We prove that for any policy π, J π (h,θ)≥ J π ∗(h,θ) = J(θ) for all h∈ Δ S . Let π : Δ S →A be an arbitrary policy. We can write J π (h,θ) = lim sup T→∞ 1 T T X t=1 E[C(s t ,π(h t ))|s 1 ∼h] = lim sup T→∞ 1 T T X t=1 E h E[C(s t ,π(h t ))|F t ,s 1 ∼h] s 1 ∼h i = lim sup T→∞ 1 T T X t=1 E[c(h t ,π(h t ))|s 1 ∼h] ≥ lim sup T→∞ 1 T T X t=1 E[J(θ) +v(h t ,θ)−v(h t+1 ,θ)|s 1 ∼h] =J(θ), with equality attained by π ∗ completing the proof. 88 Chapter 6 Learning Zero-sum Stochastic Games with Posterior Sampling 6.1 Introduction Recent advances in playing the game of Go Silver et al. (2017) and Starcraft Vinyals et al. (2019) has proved the capability of self-play in achieving super-human performance in com- petitive reinforcement learning (competitive RL) Crandall and Goodrich (2005), a special case of multi-agent RL where each player tries to maximize its own reward. These self- play algorithms are able to learn through repeatedly playing against themselves and update their policy based on the observed trajectory in the absence of human supervision. Despite the empirical success, the theoretical understanding of these algorithms is limited and is significantly more challenging than the single-agent RL due to its multi-agent nature. Self-play can be considered as a special category of offline competitive RL where the learning algorithm can control both the agent and the opponent during the learning process Bai and Jin (2020); Bai et al. (2020). In the more general and sophisticated online learning case, the opponent can take arbitrary history-dependent strategies and the agent has no control on the opponent during the learning process Wei et al. (2017); Xie et al. (2020); Tian et al. (2021). In this chapter, the online learning case is considered where the agent learns against an arbitrary opponent who can follow a time-variant history-dependent policy and can switch its policy at any time. We consider infinite-horizon two-player zero-sum stochastic games (SGs) with the average-reward criterion. At each time step, both players determine their actions simultaneously upon observing the state of the environment. The reward and the probability distribution of the next state is then determined by the chosen actions and the current state. The players’ payoffs sum to zero, i.e., the reward of one player (agent) is exactly the loss of the other player (opponent). The agent’s goal is to maximize its cumulative reward while the opponent tries to minimize the total loss. 89 We propose Posterior Samling Reinforcement Learning algorithm for Zero-sum Stochastic Games (PSRL-ZSG), a learning algorithm that achieves e O(HS √ AT ) Bayesian regret bound. Here H is an upper bound on the bias-span, S is the number of states, A is the size of all possible action pairs for both players, T is the horizon, and e O hides logarithmic factors. The best existing result in this setting is achieved by UCSG algorithm Wei et al. (2017) which obtains a regret bound of e O( 3 √ DS 2 AT 2 ) where D≥ H is the diameter of the SG. As stochastic games generalize Markov Decision Processes (MDPs), our regret bound is optimal (except for logarithmic factors) in A and T due to the lower bound provided by Jaksch et al. (2010). Related Literature SG was first formulated by Shapley (1953). A large body of work focuses on finding the Nash equilibria in SGs with known transition kernel Littman (2001); Hu and Wellman (2003); Hansen et al. (2013), or learning with a generative model Jia et al. (2019); Sidford et al. (2020); Zhang et al. (2020) to simulate the transition for an arbitrary state-action pair. In these cases no exploration is needed. There is a long line of research on exploration and regret analysis in single-agent RL (see e.g. Jaksch et al. (2010); Osband et al. (2013); Gopalan and Mannor (2015); Azar et al. (2017); Ouyang et al. (2017b); Jin et al. (2018); Zhang and Ji (2019); Zanette and Brunskill (2019); Wei et al. (2020, 2021); Chen et al. (2021a); Jafarnia-Jahromi et al. (2021) and references therein). Extending these results to the SGs is non-trivial since the actions of the opponent also affects the state transition and can not be controlled by the agent. We review the literature on exploration in SGs and refer the interested reader to Zhang et al. (2021); Yang and Wang (2020) for an extensive literature review on multi-agent RL in various settings. Stochastic Games. A few recent works use self-play as a method to learn stochastic games Bai and Jin (2020); Bai et al. (2020); Liu et al. (2021); Chen et al. (2021b). How- ever, self-play requires controlling both the agent and the opponent and cannot be applied in the online setting where the agent plays against an arbitrary opponent. All of these works consider the setting of finite-horizon SG where the interaction of the players and the environment terminates after a fixed number of steps. In the online setting where the opponent is arbitrary, Xie et al. (2020); Jin et al. (2021) achieve a regret bound of e O( √ T ) in the finite-horizon SGs with linear and general function approximation, respectively. However, in the applications where the interaction between the players and the environment is non-stopping (e.g., stock trading), the infinite-horizon SG is more suitable. Lack of a fixed horizon in this setting makes the problem more challenging. This is since the backward induction, a technique that is widely used in the finite-horizon, is not applicable in the infinite-horizon setting. In the infinite-horizon average-reward setting, the primary work of Brafman and Tennen- holtz (2002) who proposes R-max does not consider regret. A special case of online learning in general-sum games is studied by DiGiovanni and Tewari (2021) where the opponent is 90 allowed to switch its stationary policy a limited number of times. They achieve a regret bound of e O(` + √ `T ) via posterior sampling, where` is the number of switches. Their result is not directly comparable to ours because their definition of regret is different. Moreover, they assume the transition kernel is known and the opponent adopts stationary policies. To the best of our knowledge, the only existing algorithm that considers online learning against an arbitrary opponent in the infinite-horizon average-reward SG is UCSG Wei et al. (2017). Comparison with UCSG Wei et al. (2017). Our work is closely related to UCSG, however clear distinctions exist in the result, the algorithm, and the technical contribution: • UCSG achieves a regret bound of e O( 3 √ DS 2 AT 2 ) under the finite-diameter assumption (i.e., for any two states and every stationary randomized policy of the opponent, there exists a stationary randomized policy for the agent to move from one state to the other in finite expected time). Under the much stronger ergodicity assumption (i.e., for any two states and every stationary randomized policy of the agent and the opponent, it is possible to move from one state to the other in finite expected time), UCSG obtains a regret bound of e O(DS √ AT ). Note that the ergodicity assumption greatly alleviates the challenge in exploration. Our algorithm significantly improves this result and achieves a regret bound of e O(HS √ AT ) under the finite-diameter assumption. • UCSG is an optimism-based algorithm inspired by Jaksch et al. (2010) and requires the complicated maximin extended value iteration. Our algorithm, however, is the first posterior sampling-based algorithm in SGs, leveraging the ideas of Ouyang et al. (2017b) in MDPs, and is much simpler both in the algorithm and the analysis. • From the analysis perspective, under the finite-diameter assumption, UCSG uses a sequence of finite-horizon SGs to approximate the average-reward SG and that leads to the sub-optimal regret bound ofO(T 2/3 ). Our analysis avoids the finite-horizon approximation by directly using the Bellman equation in the infinite-horizon SG and achieves near-optimal regret bound. 6.2 Preliminaries LetM = (S,A,r,θ) be a stochastic zero-sum game whereS is the state space,A =A 1 ×A 2 is the joint action space,r :S×A 1 ×A 2 → [−1, 0] is the reward function andθ :S×S×A 1 × A 2 represents the transition kernel such that θ(s 0 |s,a 1 ,a 2 ) = P(s t+1 = s 0 |s t = s,a 1 t = a 1 ,a 2 t = a 2 ) where s t ∈S,a 1 t ∈A 1 ,a 2 t ∈A 2 are the state, the agent and the opponent’s actions at time t = 1, 2, 3,··· , respectively. We assume thatS,A are finite sets with size S =|S|,A =|A|. The game starts at some initial state s 1 . At time t = 1, 2, 3,··· , the players observe state s t and take actions a 1 t ,a 2 t . The agent (maximizer) receives reward r(s t ,a 1 t ,a 2 t ) from the opponent (minimizer). Then, the state evolves to s t+1 according to the probability distributionθ(·|s t ,a 1 t ,a 2 t ). The goal of the agent is to maximize its cumulative reward while the opponent tries to minimize it. For the ease of notation, we denote a := (a 1 ,a 2 ) and 91 a t := (a 1 t ,a 2 t ) and accordingly r(s t ,a 1 t ,a 2 t ),θ(·|s t ,a 1 t ,a 2 t ) will be denoted by r(s t ,a t ) and θ(·|s t ,a t ), respectively. The players’ actions are assumed to depend on the history. Namely, denote byπ 1 t (resp. π 2 t ) the mappings from the history h t = (s 1 ,a 1 ,··· ,s t−1 ,a t−1 ,s t ) to the probability distribu- tions overA 1 (resp.A 2 ). Let π 1 := (π 1 1 ,π 1 2 ,··· ) (resp. π 2 := (π 2 1 ,π 2 2 ,··· )) be the sequence of history-dependent randomized policies whose class is denoted by Π HR . In the case thatπ 1 t (resp. π 2 t ) is independent of time (stationary randomized policies), we remove the subscript t and with abuse of notation denoteπ 1 := (π 1 ,π 1 ,··· ) (resp. π 2 := (π 2 ,π 2 ,··· )). The class of stationary randomized policies is denoted by Π SR . For the ease of presentation, we introduce a few notations. LetA 1 =|A 1 |,A 2 =|A 2 | denote the size of the action spaces. For an integer k≥ 1, denote by Δ k the probability simplex of dimension k. Let q 1 ∈ Δ A 1 and q 2 ∈ Δ A 2. With abuse of notation, let r(s,q 1 ,q 2 ) := E a 1 ∼q 1 ,a 2 ∼q 2[r(s,a 1 ,a 2 )] and θ(s 0 |s,q 1 ,q 2 ) :=E a 1 ∼q 1 ,a 2 ∼q 2[θ(s 0 |s,a 1 ,a 2 )]. To achieve a low regret algorithm, it is necessary to assume that all the states are accessible by the agent under some policy. In the special case of MDPs, this is stated by the notion of “weakly communication” (or “finite diameter” Jaksch et al. (2010)) and is known to be the minimal assumption to achieve sub-linear regret Bartlett and Tewari (2009). The following assumption generalizes this notion to the stochastic games. Assumption 6.1. (Finite Diameter) There exists D≥ 0 such that for any stationary randomized policyπ 2 ∈ Π SR of the opponent and anys,s 0 ∈S×S, there exists a stationary randomized policy π 1 ∈ Π SR of the agent, such that the expected time of reaching s 0 starting from s under policy π = (π 1 ,π 2 ) does not exceed D, i.e., max s,s 0 max π 2 ∈Π SR min π 1 ∈Π SR T π s→s 0≤D, where T π s→s 0 is the expected time of reaching s 0 starting from s under policy π = (π 1 ,π 2 ). This assumption was first introduced by Federgruen (1978) and is essential to achieve low regret algorithms in the adversarial setting Wei et al. (2017). To see this, suppose that the opponent has a way to lock the agent in a “bad” state. In the initial stages of the game where the agent has limited knowledge of the environment, it may not be possible to avoid such a state and linear regret is unavoidable. Thus, this assumption states that regardless of the strategy used by the opponent, the agent has a way to recover from bad states. For a general matrix game with matrixG of sizem×n, the game value is denoted by val(G) = max p∈Δm min q∈Δn p T Gq = min q∈Δn max p∈Δm p T Gq. Moreover, the Nash equilibrium p ∗ ∈ Δ m ,q ∗ ∈ Δ n always exists Nash et al. (1950). For SGs, under Assumption 6.1, Federgruen (1978); Wei et al. (2017) prove that there exist unique J(θ) ∈ R and unique (upto an additive constant) function v(·,θ) :S→ R that satisfy the Bellman equation, i.e., for all s∈S, J(θ) +v(s,θ) = val ( r(s,·,·) + X s 0 θ(s 0 |s,·,·)v(s 0 ,θ) ) . (6.1) 92 In particular, the Nash equilibrium of the right hand side for each s∈S yields maximin stationary policies π ∗ = (π 1∗ ,π 2∗ ) such that J(θ) +v(s,θ) = max q 1 ∈Δ A 1 n r(s,q 1 ,π 2∗ (·|s)) + X s 0 θ(s 0 |s,q 1 ,π 2∗ (·|s))v(s 0 ,θ) o , (6.2) J(θ) +v(s,θ) = min q 2 ∈Δ A 2 n r(s,π 1∗ (·|s),q 2 ) + X s 0 θ(s 0 |s,π 1∗ (·|s),q 2 )v(s 0 ,θ) o . (6.3) Moreover,J(θ) is the maximin average reward obtained by the agent and is independent of the initial state s 1 , i.e., J(θ) = sup π 1 ∈Π HR inf π 2 ∈Π HR lim inf T→∞ 1 T E " T X t=1 r(s t ,a t )|s 1 =s # , wherea t = (a 1 t ,a 2 t ) anda 1 t ∼π 1 t (·|h t ) anda 2 t ∼π 2 t (·|h t ). Note thatJ(θ)∈ [−1, 0] because the range of the reward function is [−1, 0]. Define the span of the stochastic game with transition kernel θ as the span of the corresponding value function v, i.e., sp(θ) := max s v(s,θ)− min s v(s,θ). We restrict our attention to stochastic games whose transition kernelθ satisfies Assumption 6.1 and sp(θ)≤ H where H is a known scalar. Let Ω ∗ denote the set of all such θ. Moreover, observe that if v satisfies the Bellman equation, v +c also satisfies the Bellman equation for any scalar c. Thus, without loss of generality, we can assume that 0≤v(s,θ)≤H for all s∈S and θ∈ Ω ∗ . We consider the problem of an agent playing a stochastic game (S,A,r,θ ∗ ) against an opponent who can take time-adaptive policies. We assume that the opponent knows the history of states and actions and can play time-adaptive history-dependent policies. Recall that the state of such policies is denoted by Π HR .S,A and r are completely known to the agent. However, the transition kernel θ ∗ is unknown. In the beginning of the game, θ ∗ is drawn from an initial distribution μ 1 and is then fixed. We assume that the support of μ 1 is a subset of Ω ∗ . The performance of the agent is then measured with the notion of regret defined as R T := sup π 2 ∈Π HR E " T X t=1 (J(θ ∗ )−r(s t ,a t )) # , (6.4) where a 2 t ∼ π 2 t (·|h t ). Here the expectation is with respect to the prior distribution μ 1 , randomized algorithm and the randomness in the state transition. Note that the regret guarantee is against an arbitrary opponent who can change its policy at each time step and has the perfect knowledge of the history of the states and actions. The only hidden information from the opponent is the realization of the agent’s current action (which will be revealed after both players have chosen their actions). We note that self-play and the case when the opponent uses the same learning algorithm as the agent are two special cases of the scenario considered here. 93 Algorithm 5 PSRL-ZSG Input: μ 1 Initialization: t← 1,t 1 ← 0 for episodes k = 1, 2,··· do T k−1 ←t−t k t k ←t Generate θ k ∼μ t k and compute π 1 k (·) using (6.1) while t≤t k +T k−1 and N t (s,a)≤ 2N t k (s,a) for all (s,a)∈S×A do Choose action a 1 t ∼π 1 k (·|s t ) and observe a 2 t , s t+1 Update μ t+1 according to (6.5) t←t + 1 6.3 Posterior Sampling for Stochastic Games In this section, we propose Posterior Sampling algorithm for Zero-sum SGs (PSRL-ZSG). The agent maintains the posterior distribution μ t on parameter θ ∗ . More precisely, the learning algorithm receives an initial distributionμ 1 as the input and updates the posterior distribution upon observing the new state according to μ t+1 (dθ)∝θ(s t+1 |s t ,a t )μ t (dθ). (6.5) PSRL-ZSG proceeds in episodes. Let t k ,T k denote the start time and the length of episode k, respectively. In the beginning of each episode, the agent draws a sample of the transition kernel from the posterior distribution μ t k . The maximin strategy is then derived for the sampled transition kernel according to (6.1) and used by the agent during the episode. Let N t (s,a) be the number of visits to state-action pair (s,a) = (s,a 1 ,a 2 ) before time t, i.e., N t (s,a) = t−1 X τ=1 1(s τ =s,a τ =a). As described in Algorithm 5, a new episode starts if t>t k +T k−1 or N t (s,a)> 2N t k (s,a) for some (s,a). The first criterion,t>t k +T k−1 , states that the length of the episode grows at most by 1 if the other criterion is not triggered. This ensures that T k ≤T k−1 + 1 for all k. The second criterion is triggered if the number of visits to a state-action pair is doubled. These stopping criteria balance the trade-off between exploration and exploitation. In the beginning of the game, the episodes are short to motivate exploration since the agent is uncertain about the underlying environment. As the game proceeds, the episodes grow to exploit the information gathered about the environment. These stopping criteria are the same as those used in MDPs Ouyang et al. (2017b). Algorithm 5 can achieve regret bound of e O(HS √ AT ). This result improves upon the previous best known result of UCSG algorithm which achieves e O( 3 √ DS 2 AT 2 ) under the same assumption Wei et al. (2017). 94 Theorem 6.1. Under Assumption 6.1, Algorithm 5 can achieve regret bound of R T ≤ (H + 1) p 2SAT logT +H +H SA + 2 √ SAT q 224S log(2AT ). (6.6) 6.4 Analysis In this section, we provide the proof of Theorem 6.1. A central observation in our analysis is that in the beginning of each episode, θ ∗ and θ k are identically distributed conditioned on the history. This key property of posterior sampling relates quantities that depend on the unknown θ ∗ to those of the sampled θ k which is fully observed by the agent. Posterior sampling ensures that if t k is a stopping time, for any measurable function f and any h t k - measurable random variable X, E[f(θ ∗ ,X)|h t k ] = E[f(θ k ,X)|h t k ] Ouyang et al. (2017b); Osband et al. (2013). The key challenge in the analysis of stochastic games is that the opponent is also making decisions. If the opponent follows a fixed stationary policy, it can be considered as part of the environment and thus the SG reduces to an MDP. However, in the case that the opponent uses a dynamic history-dependent policy during the learning phase of the agent, this reduction is not possible. The key lemma in our analysis is Lemma 6.2 which overcomes this difficulty through the Bellman equation for the SG. Proof of Theorem 6.1 LetK T := max{k :t k ≤T} be the number of episodes until timeT and definet K T +1 =T +1. Recall that R T = sup π 2 ∈Π HRR T (π 2 ) where R T (π 2 ) =E " TJ(θ ∗ )− T X t=1 r(s t ,a t ) # . (6.7) Let π 2 ∈ Π HR be an arbitrary history-dependent randomized strategy followed by the opponent. We start by decomposing the regret into two terms R T (π 2 ) =E TJ(θ ∗ )− K T X k=1 t k+1 −1 X t=t k J(θ k ) +E K T X k=1 t k+1 −1 X t=t k (J(θ k )−r(s t ,a t )) . (6.8) Lemma 6.1 uses the property of posterior sampling to bound the first term. The second term is handled by combining the Bellman equation, concentration inequalities and the property of posterior sampling as detailed in Lemma 6.2. The proof can be found in the appendix. 95 Lemma 6.1. The first term of (6.8) can be bounded by E TJ(θ ∗ )− K T X k=1 t k+1 −1 X t=t k J(θ k ) ≤E[K T ] Lemma 6.2. The second term of (6.8) can be bounded by E K T X k=1 t k+1 −1 X t=t k (J(θ k )−r(s t ,a t )) ≤HE[K T ] +H + q 224S log(2AT )(HSA + 2H √ SAT ). It remains to bound the number of episodes. The following lemma completes the proof of Theorem 6.1. Lemma 6.3 (Lemma 1 of Ouyang et al. (2017b)). The number of episodes can be bounded by K T ≤ p 2SAT logT. 96 Appendices In this section, we provide detailed proof for the lemmas used in Section 6.4. 6.A Proof of Lemma 6.1 Lemma 6.1 (Restated). The first term of (6.8) can be bounded by E TJ(θ ∗ )− K T X k=1 t k+1 −1 X t=t k J(θ k ) ≤E[K T ] Proof. K T X k=1 t k+1 −1 X t=t k J(θ k ) = K T X k=1 T k J(θ k ) = ∞ X k=1 1(t k ≤T )T k J(θ k )≥ ∞ X k=1 1(t k ≤T )(T k−1 + 1)J(θ k ) (6.9) where the last inequality is by the fact that J(θ k )≤ 0 and T k ≤ T k−1 + 1 due to the first criterion of starting new episodes. Now, note that t k is a stopping time and 1(t k ≤T ) and T k−1 are h t k -measurable random variables. Thus, by the property of posterior sampling and monotone convergence theorem, E " ∞ X k=1 1(t k ≤T )(T k−1 + 1)J(θ k ) h t k # = ∞ X k=1 E [1(t k ≤T )(T k−1 + 1)J(θ k )|h t k ] = ∞ X k=1 E [1(t k ≤T )(T k−1 + 1)J(θ ∗ )|h t k ] =E " ∞ X k=1 1(t k ≤T )(T k−1 + 1)J(θ ∗ ) h t k # =E K T X k=1 (T k−1 + 1)J(θ ∗ ) h t k . 97 Taking another expectation from both sides and using tower property, we have E " ∞ X k=1 1(t k ≤T )(T k−1 + 1)J(θ k ) # =E K T X k=1 (T k−1 + 1)J(θ ∗ ) . Replacing this in (6.9) implies that E TJ(θ ∗ )− K T X k=1 t k+1 −1 X t=t k J(θ k ) ≤E (T− K T X k=1 T k−1 )J(θ ∗ ) −E[K T J(θ ∗ )]≤E[K T ]. The last inequality is by the fact that T− P K T k=1 T k−1 ≤ 0 and J(θ ∗ )∈ [−1, 0]. 6.B Proof of Lemma 6.2 Lemma 6.2 (Restated). The second term of (6.8) can be bounded by E K T X k=1 t k+1 −1 X t=t k (J(θ k )−r(s t ,a t )) ≤HE[K T ] +H + q 224S log(2AT )(HSA + 2H √ SAT ). Proof. The policyπ 1 k used by the agent at episodek is the solution of the Nash equilibrium in (6.1). Thus, for t k ≤t≤t k+1 − 1 and any s∈S, (6.3) implies that J(θ k ) +v(s,θ k )≤r(s,π 1 k (·|s),q 2 ) + X s 0 θ k (s 0 |s,π 1 k (·|s),q 2 )v(s 0 ,θ k ), for any distribution q 2 ∈ Δ A 2. Let π 2 = (π 2 1 ,π 2 2 ,··· ) ∈ Π HR be an arbitrary history- dependent randomized strategy for the opponent. Note that for any t ≥ 1, π 2 t is h t - measurable. Replacing s by s t and q 2 by π 2 t (·|h t ) implies that J(θ k )−r(s t ,π 1 k (·|s t ),π 2 t (·|h t ))≤ X s 0 θ k (s 0 |s t ,π 1 k (·|s t ),π 2 t (·|h t ))v(s 0 ,θ k )−v(s t ,θ k ). Adding and subtracting v(s t+1 ,θ k ) to the right hand side and summing over time steps within episode k implies that t k+1 −1 X t=t k J(θ k )−r(s t ,π 1 k (·|s t ),π 2 t (·|h t )) ≤ t k+1 −1 X t=t k X s 0 θ k (s 0 |s t ,π 1 k (·|s t ),π 2 t (·|h t ))v(s 0 ,θ k )−v(s t+1 ,θ k ) ! + t k+1 −1 X t=t k (v(s t+1 ,θ k )−v(s t ,θ k )). (6.10) 98 The second term on the right hand side of (6.10) telescopes and can be bounded as t k+1 −1 X t=t k (v(s t+1 ,θ k )−v(s t ,θ k )) =v(s t k+1 ,θ k )−v(s t k ,θ k )≤H, (6.11) where the last inequality is by the fact that θ k is chosen from the posterior distribution whose support is a subset of Ω ∗ . Substituting (6.11) in (6.10), summing over episodes, and taking expectation implies that E K T X k=1 t k+1 −1 X t=t k (J(θ k )−r(s t ,a t )) =E K T X k=1 t k+1 −1 X t=t k J(θ k )−r(s t ,π 1 k (·|s t ),π 2 t (·|h t )) ≤HE[K T ] +E K T X k=1 t k+1 −1 X t=t k X s 0 θ k (s 0 |s t ,π 1 k (·|s t ),π 2 t (·|h t ))v(s 0 ,θ k )−v(s t+1 ,θ k ) ! . We proceed to bound the last term on the right hand side of the above inequality. E K T X k=1 t k+1 −1 X t=t k X s 0 θ k (s 0 |s t ,π 1 k (·|s t ),π 2 t (·|h t ))v(s 0 ,θ k )−v(s t+1 ,θ k ) ! =E K T X k=1 t k+1 −1 X t=t k X s 0 θ k (s 0 |s t ,a 1 t ,a 2 t )v(s 0 ,θ k )−v(s t+1 ,θ k ) ! =E K T X k=1 t k+1 −1 X t=t k X s 0 θ k (s 0 |s t ,a t )−θ ∗ (s 0 |s t ,a t ) v(s 0 ,θ k ) ≤HE K T X k=1 t k+1 −1 X t=t k X s 0 θ k (s 0 |s t ,a t )−θ ∗ (s 0 |s t ,a t ) (6.12) To bound the inner summation, similar to Ouyang et al. (2017b); Jaksch et al. (2010), we define a confidence setC k around the empirical transition kernel ˆ θ k (s 0 |s,a) := Nt k (s 0 ,s,a) Nt k (s,a) . HereN t k (s 0 ,s,a) := P t k −1 t=1 1(s t =s,a t =a,s t+1 =s 0 ) is the number of visits to state-action pair (s,a) whose next state is s 0 . The confidence setC k is defined as C k :={θ : X s 0 |θ(s 0 |s,a)− ˆ θ k (s 0 |s,a)|≤b k (s,a) ∀s,a,s 0 }, where b k (s,a) := r 14S log(2At k T ) max{1,Nt k (s,a)} . Weissman et al. (2003) shows that the true transition kernel θ ∗ belongs toC k with high probability. We use this fact to show concentration of ˆ θ k around θ ∗ . Concentration of ˆ θ k around θ k is then followed by the property of posterior sampling. More precisely, we can write X s 0 |θ k (s 0 |s t ,a t )−θ ∗ (s 0 |s t ,a t )| 99 ≤ X s 0 |θ k (s 0 |s t ,a t )− ˆ θ k (s 0 |s t ,a t )| + X s 0 |θ ∗ (s 0 |s t ,a t )− ˆ θ k (s 0 |s t ,a t )| ≤ 2b k (s t ,a t ) + 2 (1(θ k / ∈C k ) + 1(θ ∗ / ∈C k )). Substituting the inner sum of (6.12) with this upper bound implies HE K T X k=1 t k+1 −1 X t=t k X s 0 θ k (s 0 |s t ,a t )−θ ∗ (s 0 |s t ,a t ) ≤ 2H K T X k=1 t k+1 −1 X t=t k b k (s t ,a t ) + 2HE K T X k=1 T k {1(θ k / ∈C k ) + 1(θ ∗ / ∈C k )} . (6.13) The first term on the right hand side of (6.13) can be bounded as K T X k=1 t k+1 −1 X t=t k b k (s t ,a t ) = K T X k=1 t k+1 −1 X t=t k s 14S log(2At k T ) max{1,N t k (s t ,a t )} ≤ K T X k=1 t k+1 −1 X t=t k s 28S log(2AT 2 ) max{1,N t (s t ,a t )} = T X t=1 s 28S log(2AT 2 ) max{1,N t (s t ,a t )} ≤ q 56S log(2AT )(SA + 2 √ SAT ), (6.14) where the first inequality is by the fact that t k ≤ T and N t (s,a)≤ 2N t k (s,a) for all s,a and the second inequality is by the following argument: T X t=1 s 1 max{1,N t (s t ,a t )} = T X t=1 X s,a 1(s t =s,a t =a) p max{1,N t (s,a)} = X s,a T X t=1 1(s t =s,a t =a) p max{1,N t (s,a)} = X s,a 1 + n T+1 (s,a)−1 X j=1 1 √ j ≤ X s,a 1 + 2 q N T +1 (s,a) =SA + 2 X s,a q N T +1 (s,a)≤SA + 2 s SA X s,a N T +1 (s,a) =SA + 2 √ SAT, where the last inequality is by Cauchy-Schwarz and the last equality is by the fact that P s,a N T +1 (s,a) = T . To bound the second term on the right hand side of (6.13), we can write E K T X k=1 T k {1(θ k / ∈C k ) + 1(θ ∗ / ∈C k )} ≤E " ∞ X k=1 T{1(θ k / ∈C k ) + 1(θ ∗ / ∈C k )} # =T ∞ X k=1 E [1(θ k / ∈C k ) + 1(θ ∗ / ∈C k )] = 2T ∞ X k=1 E [1(θ ∗ / ∈C k )] = 2T ∞ X k=1 P(θ ∗ / ∈C k ), 100 where the second equality is by the property of Posterior Sampling sinceC k isF t k -measurable. Note thatP(θ ∗ / ∈C k )≤ 1 15Tt 6 k (Lemma 17 of Jaksch et al. (2010)). Thus, 2T ∞ X k=1 P(θ ∗ / ∈C k ) = 2 15 ∞ X k=1 1 t 6 k ≤ 2 15 ∞ X k=1 1 k 6 ≤ 1 2 . (6.15) Combining (6.14) and (6.15) in (6.13) completes the proof. 101 Chapter 7 Non-indexability of the Stochastic Appointment Scheduling Problem 7.1 Introduction Scheduling is an important aspect for efficient resource utilization, and there is a vast litera- ture on the topic dating back several decades (see for example, Baker (1974); Conway et al. (2003); Erdogan et al. (2015); Pinedo (2012)). The problem considered in this chapter is stochastic appointment scheduling which has various applications in scheduling of surgeries at operating rooms, appointments in outpatient clinics, cargo ships at seaports, etc. Problem Statement: Stochastic appointment scheduling problem (ASP) has a simple state- ment: Consider a finite set of n jobs to be scheduled on a single machine. Job durations are random with known distributions. If a job completes before the appointment time of the subsequent job, the server will remain idle. Conversely, if it lasts beyond the allocated slot, the following job will be delayed. We need to determine optimal appointment times so that the expectation of a function of idle time and delay is minimized. Thus, the ASP addresses two important questions. First, given the sequence of jobs, what are the optimal appointment times? This is called the scheduling problem. Second, in what sequence should the jobs be served? This is called the sequencing problem. We note that while we call it the sequencing problem for brevity, it really refers to the joint sequencing-scheduling problem of determining both the optimal sequence as well as the appointment times. Examples of the scheduling problem are cargo ships at seaports that must be given time to berth in the order in which they arrive. Examples of the sequencing problem are surgeries in a single operating room that must be scheduled the day before in an order that minimizes certain metrics such as wait times and idle time. Intuitively speaking, scheduling jobs with more uncertain durations first may lead to delay propagation through the schedule. This intuition has motivated many researchers to prove optimality of ‘least variance first’ (LVF) policy. However, efforts beyond the case of n = 2 102 have not been fruitful for specific distributions such as uniform and exponential Weiss (1990); Wang (1999). Recent numerical work of Mansourifard et al. (2018) has argued that LVF is not the best heuristic for the sequencing problem in the case that idle time and delay unit costs are not balanced. They introduce ‘newsvendor’ index as another heuristic that outperforms variance. However, no proof of optimality is provided. This controversy raises an important open question that: Is there an index (a map from a random variable to the reals) that yields the optimal sequence? In this chapter, both the scheduling and sequencing problems are addressed. In the se- quencing problem, we introduce an index (a map from a random variable to the reals) and prove that it is the only possible candidate to return the optimal sequence. This candi- date index reduces to the ‘Newsvendor’ index and variance index for l 1 and l 2 -type cost functions, respectively. However, by providing counterexamples for optimality of variance and newsvendor index, we show that the sequencing problem is not indexable in general. Moreover, the candidate index illuminates that the heuristic sub-optimal indexing policy one might use depends on the objective function. In contrast to what has been used in the literature for a long time, variance is not the candidate index for l 1 -type cost function. In- stead, newsvendor should be used. This theoretical result confirms the numerical evidence of Mansourifard et al. (2018) that newsvendor outperforms variance. In the scheduling problem, we prove that thel 1 -type objective function introduced by Weiss (1990) is convex and there exists a solution to the stochastic optimization problem. More- over, sample average approximation (SAA) can be used to approximate the optimization problem. We prove that SAA gives a solution that is statistically consistent. These provide a feasible computational method to compute optimal appointment times given the sequence. To the best of our knowledge this result is new from two perspectives. First, the result is proved for a generalized objective function c (as long as it is convex) which includes previ- ously considered objective function in the literature (l 1 -type objective) as a special case of interest. Second, the assumptions for consistency of SAA is considerably relaxed compared to the standard literature of SAA. In fact we prove that SAA gives a consistent result as long as there exists a schedule with finite cost. The main contributions of this chapter are: • It is rigorously proved that there exists no index that yields the optimal sequence in the stochastic appointment scheduling problem (Theorem 7.1). However, a candidate index is introduced that yields a heuristic sub-optimal sequence. This candidate index reduces to newsvendor and variance for l 1 -type andl 2 -type objective functions, respectively. • It is proved that the objective function of Weiss (1990) for stochastic appointment scheduling problem is convex and there exists an optimal schedule (Proposition 7.2 in the appendix and Theorem 7.3). • It is proved that for a fixed sequence of jobs, sample average approximation yields an approximate solution for the optimal schedule that is statistically consistent (Theorem 7.4). 103 Literature Review Extensive application of appointment scheduling in transportation such as bus scheduling Wu et al. (2019), as well as healthcare applications Erdogan and Denton (2013) has led to a vast body of literature on the topic. We note that the problem considered in this chapter is a variant of a broader scheme of railway scheduling Herroelen and Leus (2004); Tian and Demeulemeester (2014) where jobs are not allowed to start earlier than their schedule. Other variants of scheduling such as roadrunner scheduling Newbold (1998) are beyond the scope of this work. Among all the related papers, we focus on the most relevant work and classify it into sequencing and scheduling. The interested reader is referred toHulshof et al. (2012); Gupta and Denton (2008); Cayirli and Veral (2003); Ahmadi-Javid et al. (2017); Cardoen et al. (2010); Kuiper et al. (2017); Chen and Robinson (2014); Baker (1974); Conway et al. (2003); Erdogan et al. (2015); Pinedo (2012); Kuiper and Mandjes (2015a,b); Wu et al. (2019); Erdogan and Denton (2013); Tian and Demeulemeester (2014) and references therein for other aspects of the problem. Sequencing. The sequencing problem we consider was first formulated by Weiss (1990). Intuitively speaking, jobs with less uncertainty should be placed first to avoid delay prop- agation throughout the schedule. Motivated by this intuition, a large body of literature suggested ‘least variance first’ (LVF) rule as a sequencing policy (see Weiss (1990); Wang (1999); Mak et al. (2014); Denton et al. (2007); Qi (2016)). However, optimality of LVF rule is only proved for the case of two jobs n = 2 for certain distributions Weiss (1990); Wang (1999). Mak et al. (2014) and Guda et al. (2016) tried to impose conditions under which LVF rule is optimal but the conditions are relatively restrictive and unlikely to hold in most scenarios of interest. A variant of the problem where jobs are allowed to start before scheduled appointments (no idle time is allowed) is studied by Guda et al. (2016); Baker (2014). In particular, Guda et al. (2016) shows that LVF rule is optimal if there exists a dilation ordering for service durations. Gupta (2007); Berg et al. (2014) proved that if there exists a convex ordering for job durations, it is optimal to schedule smaller in convex order first for n = 2. However, their efforts for n > 2 have not been fruitful. Kong et al. (2016) considered likelihood ratio as a measure of variability and obtained some insights into why smallest variability first may not be optimal. Based on the insights, they provided a counterexample for non-optimality of LVF rule in the case of n = 6. Besides the theoretical work, some papers have resorted to extensive simulation studies to investigate optimality of heuristics (see Denton et al. (2007); Klassen and Rohleder (1996); Lebowitz (2003); Marcon and Dexter (2006)). In particular, Denton et al. (2007) numerically showed that LVF outperforms sequencing in increasing order of mean and coefficient of variation. However, Mancilla and Storer (2012); Mansourifard et al. (2018) argued that LVF is not the best sequencing policy especially when idle time and delay cost units are not balanced. Alternatively, Mansourifard et al. (2018) proposed a ‘newsvendor’ index and supported its better performance in simulations. No proof of optimality was provided. 104 Scheduling. The scheduling problem is also intensively studied in the literature. The seminal work of Bailey (1952) recommended to set appointment intervals equal to the average service time of each job. This approach was further persued by Soriano (1966) and Choi and Banerjee (2016). However, letting job slots to be average service time can be near optimal only in the case that waiting cost is about 10% to 50% of the idle cost (see Denton and Gupta (2003)). Starting with Weiss (1990), some papers modeled the problem using stochastic optimization to optimize on slot duration. He considered weighted sum of idle time and delay as the objective function and noticed that for the case of n = 2, the problem is equivalent to the newsvendor problem. Based on that, he proposed a heuristic estimate of the job start times for n> 2. This heuristic was extended by Kemper et al. (2014) to general convex function of idle time and delay. Wang (1993) and Wang (1999) considered another objective function as the weighted sum of jobs’ flow time (delay and service time) and server completion time and proved its convexity. Assuming that the job durations are exponentially distributed, he provided a set of nonlinear equations to derive the optimum slot durations. Vink et al. (2015) presented a lag order approximation by ignoring the effect of previous jobs past a certain point. Kong et al. (2013) adopted a robust method over all distributions with a given mean and covariance matrix of job durations. They computationally solved for 36 jobs and showed that their solution is within 2% of the approximate optimal solution given by Denton and Gupta (2003). We refer the reader to Begen and Queyranne (2011) and Kaandorp and Koole (2007) for approaches in discrete time. 7.2 Problem Formulation We start with providing a mathematical formulation of the problem. Let X + be the space of nonnegative random variables and X = (X 1 ,··· ,X n ) be a vector of independent random variables with components in X + and known distributions denoting jobs durations to be served on a single server. Let σ· X denote a permutation of the jobs in the order they are served, such that σ(i) is the ith job receiving service. Without loss of generality, we can assume that the first job starts at time zero, i.e., s 1 = 0. Let s = (s 2 ,··· ,s n ) be appointment times for job 2 through n in the order σ and let E σ(i) be a random variable denoting the end time of job i in this order (see Figure 7.1). Job σ(i) may finish before or after scheduled start time of the subsequent job. In the case that E σ(i) ≤s i+1 , jobσ(i + 1) starts according to the schedule and server is idle betweenE σ(i) ands i+1 . In the case where E σ(i) >s i+1 , job σ(i + 1) is delayed by E σ(i) −s i+1 and will start as soon as the previous job is finished. Hence, E σ(1) =X σ(1) , E σ(i) = max(E σ(i−1) ,s i ) +X σ(i) , i = 2,··· ,n. (7.1) 105 s 1 = 0 X σ(1) E σ(1) s 2 X σ(2) E σ(2) s 3 X σ(3) E σ(3) Figure 7.1: Appointment scheduling. s i denotes appointment time of job i. For the realization shown in this figure, server remains idle between E σ(1) and s 2 and third job is delayed for E σ(2) −s 3 amount of time. Our goal is to determine appointment times such that a combination of both delay and idle time is optimized. Consider an objective function of form, C(s,σ· X) = n X i=2 g(E σ(i−1) −s i ) (7.2) where g : R→ R is a nonnegative, continuous and coercive function (i.e., lim |t|→∞ g(t) = ∞). Furthermore, we can assume that g(0) = 0 since a perfect scenario where E σ(i−1) =s i should not impose any cost. However, this assumption is not technically necessary. For the special case ofg =g 1 in Example 7.1, this objective function reduces to that of Weiss (1990). However, (7.2) is not the most general objective function one can consider. For example, in some applications, it is useful to distinguish between jobs by considering different delay per unit costs for different jobs. Moreover, (7.2) does not account for overtime, a related quantity that is important in some applications. Given the schedule s, C(s,σ· X) captures the associative cost of the realization of job durations X in the order σ. Thus, c σ (s) = E[C(s,σ· X)] denotes the expected cost of schedule s when the jobs are served in the order σ. In the scheduling problem in Section 7.4, we assume that the sequence of jobs is given, and we are looking for a schedule that minimizes the expected cost, i.e., inf s∈S c σ (s) (7.3) whereS ={(s 2 ,··· ,s n )∈ R n−1 | 0≤ s 2 ≤···≤ s n } is a closed and convex subset of R n−1 . The sequencing problem discussed in Section 7.3, addresses the question of finding the optimal order and appointment times of the jobs, i.e., min σ inf s∈S c σ (s). Before proceeding with the sequencing problem, let’s see some possible choices for the function g. Example 7.1. Letg 1 (t) =β(t) + +α(−t) + where (·) + = max(·, 0) andα,β > 0. Thus, the objective function would be c σ 1 (s) = n X i=2 E[α(s i −E σ(i−1) ) + +β(E σ(i−1) −s i ) + ]. (7.4) 106 t Delay Zone Idle Zone TD −TI g 3 (t) g 2 (t) Figure 7.2: Examples of function g. (s i −E σ(i−1) ) + denotes idle time before jobi and (E σ(i−1) −s i ) + indicates its possible delay. Cost function c σ 1 is the same cost function used by Weiss (1990). If α6= β, it captures potential different costs associated with idle time and delay. We call this function l 1 -type objective function. Example 7.2. Let g 2 (t) =t 2 . The objective function reduces to c σ 2 (s) = n X i=2 E[(E σ(i−1) −s i ) 2 ]. (7.5) Cost functionc σ 2 penalizes both idle time and delay equally. However, due to the nonlinearity of c σ 2 , long idle time and delay are less tolerable. We call this function l 2 -type objective function. Example 7.3. Let g 3 (t) = β(t−T D ), if t≥T D −α(t +T I ), if t≤−T I 0, otherwise (7.6) where T D ,T I ≥ 0 are delay and idle time tolerance, respectively (see Figure 7.2). In this case, no cost is exposed for delay and idle time under a certain threshold. This situation arises in some applications such as operating room scheduling where some small amount of delay is tolerable. 107 7.3 Non-indexability of the Sequencing Problem Non-indexability In this section, we first consider the joint sequencing-scheduling problem (referred to as just the ‘sequencing problem’ since the optimal sequence cannot be determined without also determining the optimal appointment times). Intuitively, scheduling jobs with higher un- certainty in durations first may lead to delay propagation through the schedule. Considering objective function c σ 1 , this intuition has motivated many researchers to prove optimality of least variance first (LVF) policy. However, the efforts have not been fruitful beyond the case of two jobs (n = 2) for some typical distributions such as exponential and uniform. Most related papers thus have resorted to numerical evaluation to analyze the performance of the LVF rule. In particular, Denton et al. (2007) compared three ordering policies, namely, increasing mean, increasing variance, and increasing coefficient of variation. Using numeri- cal experiment with real surgery duration data, they argued that ordering with increasing variance outperforms the other two heuristics. However, Mansourifard et al. (2018) claimed that variance does not distinguish the potential difference between idle time and delay for c σ 1 . They introduced the newsvendor index defined as I ∗ 1 (X) =αE[(s ∗ −X) + ] +βE[(X−s ∗ ) + ] (7.7) whereF X is cumulative distribution function ofX,s ∗ :=F −1 X ( β α+β ), and numerically verified that sequencing in increasing order of I ∗ 1 outperforms LVF, and conjectured that it returns the optimal sequence. No proof of optimality was given. These conjectures will be evaluated in this section. In particular, we will prove that there exists no index (a map from a random variable to the reals) that yields the optimal sequence for objective functions c σ 1 and c σ 2 . Moreover, we rigorously prove that the only candidate to provide the optimal sequence is newsvendor index for objective functionc σ 1 and variance for objective functionc σ 2 . This pro- vides a theoretical support for numerical evidence of Mansourifard et al. (2018). Moreover, it completely eliminates variance as a candidate heuristic for objective function c σ 1 . Let’s first start with a simple example of sequencing two jobs. Example 7.4. Consider the case of scheduling two jobs with durations X 1 ,X 2 . The op- timization problem to determine optimal appointment times given the sequence (X 1 ,X 2 ) would be: inf s 2 ≥0 E[g(X 1 −s 2 )] (7.8) The optimal cost given by the above equation is indeed an index that maps random variable X 1 to a real number. Moreover, sorting in increasing order of this index yields the optimal sequence for n = 2. 108 Motivated by this example, we have a candidate index for general n: I ∗ g (X) = inf s≥0 E[g(X−s)]. (7.9) One can verify that this index reduces to variance (I ∗ 2 ) and newsvendor index (I ∗ 1 ) in the case thatg(t) =t 2 andg(t) =β(t) + +α(−t) + , respectively. The natural question is whether this index provides the optimal sequence for n> 2. And if not, whether there is any other index that yields the optimal sequence. In the ensuing, we will show that the answer to both of these questions is negative. In fact, we first prove in Proposition 7.1 that I ∗ g is the only possible candidate to return the optimal sequence and then through counterexamples 7.5 and 7.6 show that it is not optimal. To prepare the setup for Proposition 7.1, let ¯ R =R∪{+∞} be the extended real line. We say I :X + → ¯ R is an index and denote the space of all indexes by I. For example, mean, variance, newsvendor and I ∗ g are examples of elements inI. First, we define an equivalence relation onI. Definition 7.1. Let I 1 ,I 2 ∈ I. We say I 1 is in relation with I 2 denoting by I 1 RI 2 if for any X 1 ,X 2 ∈X + , I 1 (X 1 )≤I 1 (X 2 ) if and only if I 2 (X 1 )≤I 2 (X 2 ). It is straightforward to check that R is an equivalence relation on I. Hence, R splits I into disjoint equivalence classes. Next, we define a notation for sorting random variables in increasing order of an index. Definition 7.2. Let X = (X 1 ,··· ,X n ) be a random vector where X i ∈X + for all i, and I∈I be an index. We sayσ·X = (X σ(1) ,··· ,X σ(n) ) is a valid permutation of X with respect to I if I(X σ(1) )≤···≤I(X σ(n) ). We denote the set of valid permutations by P I (X). In the case that I(X 1 ),··· ,I(X n ) take distinct values, P I (X) includes only one element. If I 1 is equivalent to I 2 , thenP I 1 (X) =P I 2 (X) for any random vector X with components inX + . Definition 7.3. Index I is optimal for cost function c if for any n≥ 2 and any random vector X = (X 1 ,··· ,X n ) with components in X + , inf s E[C(s,σ· X)]≤ inf s E[C(s, X)] for all σ∈P I (X). Thus, by the above remark if an index of a class is optimal, all equivalent indices are also optimal. Hence, optimality is a class property. We already observed that I ∗ g is optimal for the case of n = 2. The following Proposition provides a result for general n. Proposition 7.1. If there exists an optimal index for cost function c, it is equivalent toI ∗ g . Proof. Assume by contradiction that there exists indexJ which is optimal but not equivalent to I ∗ g . Hence, there exist random variables X 1 ,X 2 ∈ X + such that I ∗ g (X 1 ) < I ∗ g (X 2 ) but 109 0.0 0.2 0.4 0.6 0.8 1.0 x 0.0 0.2 0.4 0.6 0.8 1.0 cdf X1 X2 (a) 0.0 0.5 1.0 1.5 2.0 2.5 3.0 3.5 4.0 x 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 pdf X1 X2 (b) Figure 7.3: Distribution of X 1 and X 2 for (a) Example 7.5 and (b) Example 7.6. J(X 1 )≥J(X 2 ). Note that I ∗ g (X 1 ) = inf s≥0 E[g(X 1 −s)] andI ∗ g (X 2 ) = inf s≥0 E[g(X 2 −s)]. Hence, I ∗ g (X 1 ) < I ∗ g (X 2 ) implies that inf s≥0 E[g(X 1 −s)] < inf s≥0 E[g(X 2 −s)]. However, optimality ofJ implies that inf s≥0 E[g(X 1 −s)]≥ inf s≥0 E[g(X 2 −s)] which is a contradiction. Note that I ∗ g reduces to I ∗ 1 and I ∗ 2 for objective functions c σ 1 and c σ 2 , respectively. We also notice that contrary to widely believed conjectures in the literature that I ∗ 2 (LVF rule) is an optimal index-based policy for cost functionc σ 1 , Proposition 7.1 states that variance can only be a candidate forc σ 2 . However, note that this proposition doesn’t say anything about the existence of an optimal index. In the following, we provide counter examples which show that sequencing (and optimally scheduling) in increasing order of I ∗ 1 and I ∗ 2 is not optimal for c σ 1 and c σ 2 , respectively. Example 7.5. Let X 1 ,X 2 ,X 3 be independent random variables in L 1 and assume that X 1 ∼U(0, 1) and X 2 follows the following distribution (see Figure 7.3): F X 2 (x) = 0, if x≤ 0 2x 2 , if 0<x< 0.5 2(x− 0.5) 2 + 0.5, if 0.5≤x< 1 1, otherwise Consider objective function c σ 1 with α = β = 1. I ∗ 1 reduces to E[|X−F −1 X ( 1 2 )|]. First we claim that I ∗ 1 (X 1 ) =I ∗ 1 (X 2 ) = 1 4 : E[|X 1 −F −1 X 1 ( 1 2 )|] =E[|X 1 − 1 2 |] = Z 1 2 0 ( 1 2 −x)dx + Z 1 1 2 (x− 1 2 )dx 110 = Z 1 2 0 ( 1 2 −x)dx + Z 1 2 0 xdx = 1 4 , E[|X 2 −F −1 X 2 ( 1 2 )|] =E[|X 2 − 1 2 |] = Z 1 2 0 ( 1 2 −x)4xdx + Z 1 1 2 (x− 1 2 )(4x− 2)dx = Z 1 2 0 ( 1 2 −x)4xdx + Z 1 2 0 4x 2 dx = 1 4 . Distribution ofX 3 can be arbitrary as long asI ∗ 1 (X 3 )> 1 4 to make sure that it comes last. In order to haveI ∗ 1 as the optimal index, changing the order ofX 1 andX 2 should not affect the optimal value ofc σ 1 . However, for the sequenceσ 1 ·X = (X 1 ,X 2 ,X 3 ), inf s∈S c σ 1 1 (s)≈ 0.3946 but sequence σ 2 · X = (X 2 ,X 1 ,X 3 ) yields inf s∈S c σ 2 1 (s)≈ 0.3872. Thus, the index I ∗ 1 is not optimal for cost function c σ 1 . Example 7.6. Consider objective functionc σ 2 and letX 1 ∼ lnN (1, 1) andX 2 ∼ lnN ( 1 2 ln( e e+1 ), 2) be independent (see Figure 7.3). Note that I ∗ 2 (X 1 ) = I ∗ 2 (X 2 ) = e 3 (e− 1). Distribution of X 3 can be arbitrary as long as I ∗ 2 (X 3 ) > I ∗ 2 (X 1 ) = I ∗ 2 (X 2 ) to make sure that it comes last. In order to have I ∗ 2 as the optimal index, changing the order of X 1 and X 2 should not affect the optimal value of c σ 2 . However, for the sequence σ 1 · X = (X 1 ,X 2 ,X 3 ), inf s∈S c σ 1 2 (s)≈ 94.158 but sequence σ 2 · X = (X 2 ,X 1 ,X 3 ) yields inf s∈S c σ 2 2 (s)≈ 99.096. Thus, the index I ∗ 2 is not optimal for cost function c σ 2 . The above leads us to the following conclusion. Theorem 7.1. There exists no index that yields the optimal sequence for cost functions c σ 1 and c σ 2 . Proof. Proposition 7.1 implies that I ∗ k is the only possible optimal index for cost functions c k ,k = 1, 2. But counterexamples 7.5 and 7.6 show that these need not be optimal. This leads us to the conclusion that optimal indices may not exist, i.e., the problem is non- indexable. Remark 7.1. It is worth mentioning that Proposition 7.1 still holds even if we restrict the space of random variables to a certain family. Therefore, although Theorem 7.1 states that the sequencing problem is not indexable in general, it does not preclude the possibility of indexability in a restricted space. Nevertheless, Proposition 7.1 ensures that one should not investigate indices other than I ∗ g . Finding a family of distributions for which I ∗ g is an optimal index is still an open research problem. In particular, Example 7.6 ensures that even if we restrict the space of random variables to exponential family, the problem remains non-indexable. In fact, we are unable to conclude about indexability if we further restrict 111 to the exponential distribution. Moreover, Theorem 7.1 does not exclude the possibility of existence of non-index-based optimal policies. Bounds on the optimal cost It is disappointing that contrary to long-held conjectures in the literature, the sequencing problem is non-indexable in general. Nevertheless, I ∗ g can be considered as a heuristic to order the random variables and achieve a suboptimal solution. We next provide lower and upper bounds on the optimum cost with c σ 1 and c σ 2 objective functions. Note that the Increasing order of I ∗ k ,k = 1, 2 minimizes the upper bound. Theorem 7.2. For k = 1, 2, the optimum cost of objective function c σ k can be bounded by: n−1 X i=1 I ∗ k (X σ(i) )≤ inf s∈S c σ k (s)≤ n−1 X i=1 (n−i)I ∗ k (X σ(i) ) (7.10) Proof. We need two Lemmas for the proof of the theorem. Their proofs are relegated to the appendix. Lemma 7.1 proves a sub-additive property of the index functions while Lemma 7.2 is a technical lemma. Lemma 7.1. Let X 1 ,X 2 ∈X + be independent. Then, for k = 1, 2 I ∗ k (X 1 +X 2 )≤I ∗ k (X 1 ) +I ∗ k (X 2 ). (7.11) Lemma 7.2. Assume g(0) = 0 and (i) let X∈X + . Then, sup x∈R I ∗ g (max(x,X))≤I ∗ g (X). (ii) let X 1 ,X 2 ∈X + be independent. Then, max(I ∗ g (X 1 ),I ∗ g (X 2 ))≤I ∗ g (X 1 +X 2 ). Lemmas 7.1 and 7.2 can now be used to bound I ∗ k (E σ(j) ): I ∗ k (E σ(j) ) =I ∗ k (max(s j ,E σ(j−1) ) +X σ(j) ) (7.12) ≤I ∗ k (max(s j ,E σ(j−1) )) +I ∗ k (X σ(j) ) (7.13) ≤I ∗ k (E σ(j−1) ) +I ∗ k (X σ(j) ) (7.14) for k = 1, 2 where the first and second inequality follow from Lemmas 7.1 and 7.2, respec- tively. Using the fact that E σ(1) =X σ(1) , one can write: I ∗ k (E σ(j) )≤ j X i=1 I ∗ k (X σ(i) ) (7.15) 112 By lower bound in Lemma 7.2, I ∗ k (max(s j ,E σ(j−1) ) +X j )≥I ∗ k (X j ). Hence, I ∗ k (E σ(j) ) can be bounded by: I ∗ k (X σ(j) )≤I ∗ k (E σ(j) )≤ j X i=1 I ∗ k (X σ(i) ). (7.16) Now, to prove the upper bound let ˜ s = (˜ s 2 ,··· , ˜ s n ) where ˜ s i = F −1 E σ(i−1) ( β α+β ) for the case that k = 1 and ˜ s i = E[E σ(i−1) ] for the case that k = 2. Note that ˜ s i can be calculated recursively because E σ(i−1) is a function of ˜ s 2 through ˜ s i−1 . We have: inf s c σ k (s)≤c σ k (˜ s) = n X j=2 I ∗ k (E σ(j−1) ) ≤ n−1 X j=1 j X i=1 I ∗ k (X σ(i) ) = n−1 X i=1 n−1 X j=i I ∗ k (X σ(i) ) = n−1 X i=1 (n−i)I ∗ k (X σ(i) ). To prove the lower bound, note thatE[g k (E σ(i−1) −s i )]≥I ∗ k (E σ(i−1) )≥I ∗ k (X σ(i−1) ) where g 1 (t) =β(t) + +α(−t) + and g 2 (t) =t 2 . Thus, c σ k (s) = n X i=2 E[g k (E σ(i−1) −s i )]≥ n X i=2 I ∗ k (X σ(i−1) ). Remark 7.2. Note that the upper bound and lower bound in (7.10) coincide when n = 2, and this is the result we already expected from Example 7.4. For general n, sequencing with respect to increasing order of I ∗ k minimizes the upper bound in (7.10). 7.4 Existence of Solution to the Scheduling Problem In many problems, the sequence in which to schedule is given and only the appointment times are to be determined optimally. In this section, we assume that the sequence of n random variables X = (X 1 ,··· ,X n ) is fixed and without loss of generality (by possibly renaming jobs) remove the notation σ for simplicity . We call this problem the scheduling problem. We propose sample average approximation (SAA) as an algorithm to find the op- timal appointment times and prove it is statistically consistent in the case that the objective function is convex (e.g.,c σ 1 ). This result is significant because the only assumption required 113 for consistency of SAA is the existence of a schedule with finite cost. This assumption sig- nificantly relaxes the typical assumptions required for consistency of SAA in the literature (see e.g., Theorem 5.4 of Shapiro et al. (2009)). Existence of Solution We first show that there exists a solution to the optimization problem in (7.3). Theorem 7.3. (i) For any particular realization of X, C(·, X) is nonnegative and coer- cive. (ii) c(·) is nonnegative, coercive and lower semi-continuous. Furthermore, if c(s)<∞ for some s∈S, then there exists a solution to the optimization problem in (7.3) and the set of minimizers is compact. The proof is relegated to the appendix. One of the essential conditions in Theorem 7.3 is that c(s) <∞ for some s∈S. The question is how to check whether this condition is satisfied. Should we explore the entire setS in the hope of finding such s? Let’s illuminate this condition: First of all it is easy to see that forp≥ 1 andg(t) =|t| p , this condition is equivalent toX i ∈L p (i.e.,E[|X i | p ]<∞) fori = 1,··· ,n−1. This is also true for some other variations whereg is a piecewise function of the form|·| p such asg 1 andg 3 in Examples 7.1 and 7.3. Moreover, ifc(s)<∞ for some s∈S, it is finite for all s∈S. It is mainly due to the fact that L p is a vector space. Therefore, in such cases, there is no need to explore the setS. However, for general g, the set{X∈X + |E[g(X)]<∞]} may not be a vector space (see Birnbaum-Orlicz space, Birnbaum and Orlicz (1931)) and c(s) may be infinite for some s. In that case, random exploration may yield s∈S such that c(s)<∞. Sample Average Approximation The next question is how to calculate the optimal appointment times. Theorem 7.3 assures that there exists an optimal schedule under mild condition. However, calculating expecta- tion is very costly in our problem due to the convolution nature of the distribution of the service completion times. In fact, for a given schedule s, distribution of E σ(i) is convolu- tion of distributions of max(s i ,E σ(i−1) ) and X i . An alternative is to use sample average approximation (SAA) to approximate the optimization problem. SAA is a well studied topic in stochastic programming (see for example, Shapiro et al. (2009); Royset (2013)). In the following, we discuss SAA and provide a theoretical guarantee for convergence of the solution in stochastic appointment scheduling problem. We assume that Assumption 7.1. For any realization of X, C(·, X) is convex. 114 This assumption holds for the l 1 -type objective function (see Proposition 7.2 in the ap- pendix) which is widely considered in the literature. However, it does not hold for the case of l 2 -type objective (see Example 7.7 in the appendix). Let (X j ) m j=1 be an independently and identically distributed (i.i.d.) random sample of size m for durations X and define C m (s) = 1 m m X j=1 C(s, X j ) (7.17) Instead of solving the optimization problem in Equation (??), we’re going to solve inf s∈S C m (s). (7.18) Convexity and coercivity of C(·, X) implies convexity and coercivity of C m (·). Therefore, there exists a solution to the optimization problem in (7.18). In addition, Strong Law of Large Numbers implies that for each s, C m (s) → c(s) a.s. as m → ∞. Nevertheless, optimization over the setS requires some stronger result to guarantee inf s∈S C m (s) → inf s∈S c(s) a.s. as m→∞. Moreover, it would be useful to see if the set of minimizers of the SAA also converges to the set of true minimizers in some sense. To reach that goal, we need the following definition of deviation for sets (see equation (7.4) in Shapiro et al. (2009)). Definition 7.4. Let (M,d) be a metric space and A,B⊆M. We define distance of a∈A from B by dist(a,B) := inf{d(a,b)|b∈B} (7.19) and deviation of A from B by D(A,B) := sup a∈A dist(a,B). (7.20) Note that D(A,B) = 0 implies A⊆ cl(B) (i.e. A is a subset of closure of B with respect toM). The next theorem guarantees that SAA is a consistent estimator for the scheduling problem. Theorem 7.4. Suppose Assumption 7.1 holds and c(s) < ∞ for some s ∈ S and let S ∗ = arginf s∈S c(s) and S ∗ m = arginf s∈S C m (s). Then, inf s∈S C m (s) → inf s∈S c(s) and D(S ∗ m ,S ∗ )→ 0 a.s. as m→∞. The proof is available in the appendix. Theorem 7.4 proves the consistent behavior of SAA as the number of samples tends to infinity. Let’s now observe how it behaves in terms of bias. For any s 0 ∈ S, we can write inf s∈S C m (s) ≤ C m (s 0 ). By taking expectation and then minimizing over s 0 , we conclude thatE[inf s∈S C m (s)]≤ inf s∈S E[C m (s)]. Since samples are i.i.d., E[C m (s)] =c(s). 115 Therefore, E[inf s∈S C m (s)]≤ inf s∈S c(s) which means SAA is negatively biased. Does this bias decrease as the number of samples increases? The answer is affirmative. Theorem 2 in Mak et al. (1999) proves that E[inf s∈S C m (s)]≤E[inf s∈S C m+1 (s)]. 7.5 Numerical Results It has become a standard practice to evaluate performance on operating room data due to the immediate application of stochastic appointment scheduling in healthcare. Denton et al. (2007) used real surgery scheduling data collected at Fletcher Allen Health Care of New York. In this chapter, we consider surgery scheduling dataset from Keck hospital of USC. The dataset includes 38,000 surgeries performed in 25 operating rooms over the course of 3 years. More than 800 different procedure types performed by 200 surgeons. Surgeries with the same procedure type performed by the same surgeon are assumed to be samples of the same distribution. Our numerical analysis is restricted to those distributions that have at least 30 samples.We stick to 30 samples because we observed that they are sufficient for a close enough SAA of the optimal solution. This is much fewer than the theoretically required number of samples given by Begen et al. (2012). In some practical scenarios, there are not enough samples to directly apply SAA. In such scenarios, similar cases based on the nature of the procedure type can be aggregated to build distributions with enough number of samples. In this chapter, we focus on the surgeon-procedure pairs that have enough number of samples. We first show that given a sequence, SAA-based optimization algorithm is fast enough for all practical purposes to find an approximate solution. To do so, we use the Powell method (Brent (2013)) to solve the SAA-based optimization problem numerically. The experiments are performed in Python on a 2015 Macbook Pro with 2.7 GHz Intel Core i5 processor and 16 GB 1867 MHz DDR3 memory. Figure 7.4 confirms that appointments for a given sequence of n = 80 jobs can be calculated in about 3 minutes. Moreover, we observed that changing the number of samples from 10 to 300 does not change the run time of the SAA-based optimization significantly. Secondly, the bounds provided in Theorem 7.2 are evaluated. Bounds in Theorem 7.2 are for general distribution and may be useful in the worst case scenarios. However, Figure 7.5 shows that the upper bound is loose as the number of jobs increases on Keck dataset. The upper bound of Theorem 7.2 uses the complete delay propagation through the schedule, i.e., potential cost of each job affects all the future jobs equally. Although this situation might arise in the worst case, we’ve observed that on Keck dataset, it does not happen. Indeed, the gaps between jobs prevents the delay to have full effect on subsequent jobs. Non-indexability shown in Theorem 7.1 is for general distribution. One might wonder if non-indexability is actually observed in practice. We verify that the optimal sequence is indeed different from the one given by heuristic policies (see Table 7.1) using Keck dataset. Newsvendor and LVF indexes are considered as heuristic policies for c σ 1 and c σ 2 objective 116 20 40 60 80 Number of Cases 10 0 10 1 10 2 Time (s) Figure 7.4: SAA running time (in seconds) to find approximate optimal schedule for a given sequence. 30 samples/job are used for SAA though no appreciable difference even if 10x more samples used. 5 10 15 20 25 Number of Cases 0.0 0.5 1.0 1.5 2.0 2.5 1e4 Lower Bound Upper Bound Newsvendor Cost Optimal Cost for c 1 (a) 5 10 15 20 25 Number of Cases 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 1e6 Lower Bound Upper Bound Variance Cost Optimal Cost for c 2 (b) Figure 7.5: Upper and lower bounds on optimal cost for (a) c σ 1 and (b) c σ 2 cost functions. As shown in the figure, upper bound is quite loose on USC Keck dataset. Table 7.1 provides numerical values for n≤ 6 to compare optimal sequence with index-based heuristic policy. functions, respectively since they are the only possible candidates to return the optimal sequence (Proposition 7.1). The true optimal sequence is calculated by comparing all n! choices. In operating room scheduling, the number of surgeries performed in a typical day hardly exceeds 6 which leaves the door open for exhaustive search to find the optimal sequence. However, other applications such as outpatient clinics have much larger number of jobs and it may not be feasible to exhaustively search over all possible sequences. Cost function c σ 1 depends on the idle time and delay per unit costs α and β. Mansourifard et al. (2018) analyzed how newsvendor index outperforms variance in different regimes of these parameters. In Figure 7.6, we evaluate the gap between newsvendor index and the optimal sequence as α and β change. The optimal sequence is obtained by exhaustive search over all n! possible sequences. It can be seen that as the ratio of α/β increases, the sub-optimality gap of newsvendor index increases on Keck dataset. 117 2 3 4 5 6 Number of Cases 0 50 100 150 200 250 300 Newsvendor Cost for / = 0.1 Optimal Cost for / = 0.1 Newsvendor Cost for / = 1 Optimal Cost for / = 1 Newsvendor Cost for / = 100 Optimal Cost for / = 100 Figure 7.6: Optimality gap of newsvendor index increases as the ratio ofα/β increases on Keck dataset. The optimal sequence is found by exhaustive search over all n! possible sequences. β = 1 is fixed and α changes from 0.1 to 100. The dashed lines show the cost for the sequence obtained by least newsvendor first. The cost of the optimal sequence is shown by solid lines. Table 7.1: Non-optimality of least newsvendor first for c σ 1 and least variance first for c σ 2 . Optimal sequence found by exhaustive search is different from the sequence given by heuristic index-based policies. n (Number of jobs) 2 3 4 5 6 c σ 1 lower bound 14.9 32.8 51.2 71.4 92.3 optimal cost 14.9 36.6 51.6 79.0 105.3 newsvendor cost 14.9 36.6 64.4 95.4 126.5 upper bound 14.9 47.7 98.9 170.4 262.7 c σ 2 lower bound 368.0 817.5 1534.3 2430.0 3562.3 optimal cost 368.0 1036.3 1763.8 2760.1 3912.5 variance cost 368.0 1081.7 1923.1 2853.7 3939.1 upper bound 368.0 1185.5 2719.9 5149.8 8712.2 118 Appendices 7.A Supplementary Material Proposition 7.2. Let X be a fixed sequence of jobs and C 1 (·, X) = n X i=2 h α(s i −E i−1 ) + +β(E i−1 −s i ) + i as in Example 7.1. For any realization of X,C 1 (·, X) is convex and thus,c 1 (·) =E[C 1 (·, X)] is also convex. Proof. We proceed by writingC 1 (·, X) as maximum of 2 n−1 affine functions. Since an affine function is convex, so is the maximum. To define these functions, we first split R n−1 into 2 n−1 regions and then define an affine function in each region. These functions are then extended to the entire R n−1 . The detail is given in the following. Fix a realization of X and note that for any schedule s = (s 2 ,··· ,s n ) ∈ R n−1 , either (s i −E i−1 ) + > 0 or (E i−1 −s i ) + > 0 for i = 2,··· ,n. Let b i be a binary variable that indicates which of the two happens. More specifically, b i = 1 if s i ≥ E i−1 , and b i = 0 if s i <E i−1 . These binary variables are used to split R n−1 into 2 n−1 regions R b 2 ,···,bn . More precisely, if b i = 1, then s i ≥ E i−1 denotes the range of s i in R b 2 ,···,bn and if b i = 0, then s i <E i−1 determines its range. For example for n = 4, region R 101 would be R 101 :={(s 2 ,s 3 ,s 4 )∈R 3 |s 2 ≥E 1 ,s 3 <E 2 ,s 4 ≥E 3 }. Corresponding to each region, one can define a function ¯ f b 2 ,···,bn :R b 2 ,···,bn →R that consists of sum ofn− 1 terms associated with eachb i . Ifb i = 1, then the corresponding term would be α(s i −E i−1 ) and if b i = 0, it would be β(E i−1 −s i ). For example for n = 4, ¯ f 101 would be ¯ f 101 (s) :=α(s 2 −E 1 ) +β(E 2 −s 3 ) +α(s 4 −E 3 ) =α(s 2 −X 1 ) +β(s 2 +X 2 −s 3 ) +α(s 4 −s 2 −X 2 ). (7.21) Note that C 1 (s, X) = ¯ f b 2 ,···,bn (s) on R b 2 ,···,bn . Moreover, restricting the domain of ¯ f b 2 ,···,bn toR b 2 ,···,bn allowed us to write the last equality in (7.21) which can now be used for an affine 119 extension to the entireR n−1 . Letf b 2 ,···,bn :R n−1 →R be such an extension. We claim that C 1 (s, X) = max b 2 ,···,bn f b 2 ,···,bn (s) for all s∈R n−1 and thus convex. To prove this claim, it suffices to show that onR b 2 ,···,bn , ¯ f b 2 ,···,bn (s)≥f b 0 2 ,···,b 0 n (s) (becauseC 1 (s, X) = ¯ f b 2 ,···,bn (s) on R b 2 ,···,bn ). This is indeed true because ifb 0 i 6=b i , the corresponding term would be negative in f b 0 2 ,···,b 0 n (s). Finally, note that c 1 (·) =E[C 1 (·, X)] is also convex since expectation preserves convexity. Proposition 7.2 shows that the l 1 -type objective function is convex. The following example shows that this may not be true for the objective function c 2 . Example 7.7. Consider the special case of n = 3 and let X 1 ,X 2 > 0 be positive scalars (which can be seen as degenerate distributions). We show that the function c 2 (s 2 ,s 3 ) := (X 1 −s 2 ) 2 +(max{X 1 ,s 2 }+X 2 −s 3 ) 2 is not convex. Lett = 0.5 and s 1 = (X 1 −γ,X 1 +10X 2 ), s 2 = (X 1 ,X 1 + 10X 2 ) and s 3 = (X 1 +γ,X 1 + 10X 2 ) for some 0 < γ < min{X 1 , 6X 2 }. Substituting these values, we observe thatc 2 (s 2 ) =c 2 (ts 1 +(1−t)s 3 )>tc 2 (s 1 )+(1−t)c 2 (s 3 ). 7.B Proof of Lemma 7.1 For k = 2 the statement is obvious. For k = 1, using the fact that (a +b) + ≤a + +b + for a,b∈R, we have: I ∗ 1 (X 1 +X 2 ) = inf s E[α(s−X 1 −X 2 ) + +β(X 1 +X 2 −s) + ] ≤E[α(F −1 X 1 ( β α +β ) +F −1 X 2 ( β α +β )−X 1 −X 2 ) + +β(X 1 +X 2 −F −1 X 1 ( β α +β )−F −1 X 2 ( β α +β )) + ] ≤E[α(F −1 X 1 ( β α +β )−X 1 ) + +β(X 1 −F −1 X 1 ( β α +β )) + ] +E[α(F −1 X 2 ( β α +β )−X 2 ) + +β(X 2 −F −1 X 2 ( β α +β )) + ] =I ∗ 1 (X 1 ) +I ∗ 1 (X 2 ) 7.C Proof of Lemma 7.2 (i) We can write g(t) = g r (t) +g l (t) where g r (t) = g(t + ) and g l (t) = g(−(−t) + ) capture g for positive and negative values of t, respectively. Since g is nonnegative, convex and g(0) = 0, we can conclude that g r is nondecreasing and g l is nonincreasing. Moreover, 120 I ∗ g (X) = inf s≥0 E[g(X−s)] = inf s∈R E[g(X−s)]. Suppose s ∗ is a minimizer for I ∗ g and let X ={x∈R :x≤s ∗ }. We prove the lemma for x∈X and x / ∈X separately. Let x∈X . We can write: I ∗ g (max(x,X)) = inf s E[g(max(x,X)−s)] = inf s E[g r (max(x,X)−s) +g l (max(x,X)−s)] ≤ inf s≥x E[g r (max(x,X)−s) +g l (max(x,X)−s)] = inf s≥x E[g r (X−s) +g l (max(x,X)−s)] ≤ inf s≥x E[g r (X−s) +g l (X−s)] = inf s≥x E[g(X−s)] =E[g(X−s ∗ )] =I ∗ g (X) For the case that x / ∈X , we can write: I ∗ g (max(x,X)) = inf s E[g(max(x,X)−s)] = inf s E[g r (max(x,X)−s) +g l (max(x,X)−s)] ≤ inf s<x E[g r (max(x,X)−s) +g l (max(x,X)−s)] = inf s<x E[g r (max(x,X)−s)] =E[g r (max(x,X)−x)] =E[g r (X−x)] ≤E[g r (X−s ∗ )] ≤E[g r (X−s ∗ ) +g l (X−s ∗ )] =E[g(X−s ∗ )] =I ∗ g (X) (ii) Note thatI ∗ g (X) = inf s≥0 E[g(X−s)] = inf s∈R E[g(X−s)]. To prove max(I ∗ g (X 1 ),I ∗ g (X 2 ))≤ I ∗ g (X 1 +X 2 ), by symmetry, suffices to prove I ∗ g (X 1 )≤I ∗ g (X 1 +X 2 ). Let x 2 ≥ 0, we have: I ∗ g (X 1 ) = inf s E[g(X 1 −s)] = inf s E[g(X 1 +x 2 −s)] = inf s E[g(X 1 +X 2 −s)|X 2 =x 2 ] = inf s φ(s,x 2 ) where φ(s,x 2 ) = E[g(X 1 +X 2 −s| X 2 = x 2 )]. The above equality holds for any value of x 2 ≥ 0. Hence, I ∗ g (X 1 )≤ φ(s,X 2 ) for any s∈ R. Therefore, I ∗ g (X 1 )≤ E[φ(s,X 2 )] = E[g(X 1 +X 2 −s)] by smoothing property of conditional expectation. Thus, I ∗ g (X 1 ) ≤ inf s E[g(X 1 +X 2 −s)] =I ∗ g (X 1 +X 2 ). 121 7.D Proof of Theorem 7.3 (i) Since g is nonnegative, it is obvious that C(·, X) is also nonnegative. To prove coercivity of C(·, X), let (s m ) m≥1 ⊆ R n−1 be a sequence such thatks m k→∞. We need to show that C(s m , X)→∞ as m→∞. Let j be the smallest integer such that ks m j k→∞. Note that for any particular realization of X, there exists M∈ R such that |E m j−1 |≤ M for all m where E m j−1 denotes finish time of job j− 1 with schedule s m . By triangle inequality, |E m j−1 −s m j |≥|s m j |−|E m j−1 |≥|s m j |−M→∞ (7.22) as m→∞. Coercivity of g implies that g(E m j−1 −s m j )→∞ as m→∞. On the other hand, since g is nonnegative we can write C(s m , X)≥ g(E m j−1 −s m j ) for all m. Hence, C(s m , X)→∞ as m→∞. (ii) Clearly, c(·) is nonnegative. To prove coercivity, let (s m ) m≥1 be as defined in the previous part, by Fatou’s Lemma and coercivity of C(·, X) we have: lim inf m c(s m ) = lim inf m E[C(s m , X)] ≥E[lim inf m C(s m , X)] =∞. To prove lower semi-continuity, let (s k ) k≥1 ⊆R n−1 be a sequence converging to s∈R n−1 . By Fatou’s Lemma we can write lim inf k c(s k ) = lim inf k E[C(s k , X)] ≥E[lim inf k C(s k , X)]≥E[C(s, X)] =c(s). Since c is coercive and c(s)<∞ for some s∈S, without loss of generality we can assume that the minimization is over a compact set. Moreover,c(s) is lower semi-continuous. Thus, the set of minimizers is nonempty and compact. 7.E Proof of Theorem 7.4 Define the extended real valued functions ¯ C m (s) =C m (s) +I S (s) ¯ c(s) =c(s) +I S (s) where I S (s) = ( 0, if s∈S +∞, Otherwise 122 Note that ¯ C m , ¯ c are nonnegative, convex and lower semicontinuous because C m ,c are lower semicontinuous andS is closed and convex. By Theorem 2.3 of Artstein and Wets (1994) (see Appendix B), ¯ C m (·) epi-converges to ¯ c(·) (denoted by ¯ C m (·) e − → ¯ c(·)) for a.e. ω∈ Ω. Note that S ∗ = arginf s∈S c(s) = arginf s∈R n−1¯ c(s) and S ∗ m = arginf s∈S C m (s) = arginf s∈R n−1 ¯ C m (s). Since c(s) < ∞ for some s ∈ S, by Theorem 7.3 we know that S ∗ is nonempty and compact. Let K be a compact subset of R n−1 such that S ∗ lies in the interior of K. Let ˆ S ∗ m = arginf s∈K ¯ C m (s). We first show that for a.e. ω∈ Ω, ˆ S ∗ m is nonempty for large enough m. Let s ∗ ∈S ∗ and considerω∈ Ω for which ¯ C m (·) e − → ¯ c(·). By definition of epi-convergence, lim sup m ¯ C m (s m )≤ ¯ c(s ∗ ) for some s m → s ∗ . Therefore, there exists M≥ 1 such that for m≥M, ¯ C m (s m )≤ ¯ c(s ∗ ) + 1<∞. Moreover, it follows from s m → s ∗ that for large enough m, s m lies in the interior of K. Since ¯ C m (·) is convex and lower semicontinuous and K is compact, ˆ S ∗ m is nonempty a.s. (see Appendix B for Proposition 2.3.2 of Bertsekas et al. (2003)). Now, let us show that D( ˆ S ∗ m ,S ∗ )→ 0 a.s. Consider ω∈ Ω for which ¯ C m (·) e − → ¯ c(·). We claim that for suchω,D( ˆ S ∗ m ,S ∗ )→ 0. Assume by contradiction thatD( ˆ S ∗ m ,S ∗ )6→ 0. Thus, there exists > 0 and y m ∈ ˆ S ∗ m (for large enough m) such that dist(y m ,S ∗ )≥ . Let y m l → y be a convergent subsequence of (y m ) m≥1 . Such a subsequence exists because K is compact. It follows from dist(y m ,S ∗ )≥ that y / ∈S ∗ . On the other hand, Proposition 7.26 of Shapiro et al. (2009) (see Appendix B) implies that y∈ arginf s∈K ¯ c(s) =S ∗ which is a contradiction. Note that S ∗ is in the interior of K. It follows from D( ˆ S ∗ m ,S ∗ )→ 0 that for large enough m, ˆ S ∗ m lies in the interior ofK. Hence, ˆ S ∗ m is a local minimizer. Convexity of ¯ C m (·) implies that ˆ S ∗ m is a global minimizer i.e. ˆ S ∗ m =S ∗ m . Therefore,D(S ∗ m ,S ∗ )→ 0 a.s. as m→∞. It remains to prove that inf s∈S C m (s)→ inf s∈S c(s) a.s. Fix ω∈ Ω for which ¯ C m (·) e − → ¯ c(·) and let s ∗ m ∈S ∗ m be a convergent sequence. Such a sequence exists because for large enough m, S ∗ m falls inside the compact set K. Then, by Proposition 7.26 of Shapiro et al. (2009) (see below), inf ¯ C m (s)→ inf ¯ c(s) or equivalently, inf s∈S C m (s)→ inf s∈S c(s). 7.F Useful Theorems and Propositions Theorem 7.5 (Theorem 2.3 of Artstein and Wets (1994)). Let F : S× Ξ→ (−∞,∞] be a measurable function and P (dξ) be a probability measure over the space Ξ of random elements. We assume that S is a metric space. Define f(s) :=E[F (s,ξ)] = R F (s,ξ)P (dξ) and let ξ 1 ,··· ,ξ m be independent samples of Ξ drawn according to P . Suppose (1) F (·,ξ) is lower semicontinuous for fixed ξ∈ Ξ and (2) for each s 0 ∈ S there exists an open set 123 N 0 ⊆S and an integrable function g 0 : Ξ→ (−∞,∞) such that the inequality F (s,ξ)≥g(ξ) holds for all s∈N 0 . Then, 1 m P m j=1 F (·,ξ j ) almost surely epi-converges to f(·). Proposition 7.3 (Proposition 7.26 of Shapiro et al. (2009)). Let f m ,f : S→ (−∞,∞] where S⊆R n . Suppose that f m (·) epi-converges to f(·). Then, lim sup m [inf s f m (s)]≤ inf s f(s). Suppose further that (1) for some m ↓ 0 there exists an m −minimizer s m of f m (·) such that the sequence s m converges to a point ¯ s. Then, ¯ s∈ argminf and lim m→∞ [inf s f m (s)] = inf s f(s) Proposition 7.4 (Proposition 2.3.2 of Bertsekas et al. (2003)). Let S be a closed convex subset of R n , and let f : R n → (−∞,∞] be a closed convex function such that f(s) <∞ for some s∈S. The set of minimizing points of f over S is nonempty and compact if and only if S and f have no common nonzero direction of recession. 124 Chapter 8 Concluding Remarks and Future Directions In this dissertation, we studies efficient exploration for online reinforcement learning in various settings. We conclude the dissertation in this chapter by providing a summary of the results and possible future directions. In Chapters 2 and 3, the first model-free algorithms were proposed for infinite-horizon average-reward weakly communicating MDPs. In Chapter 2, the algorithm reduces the problem to the discounted version and balances the trade-off between the performance of the discounted algorithm and the approximation error. In Chapter 3, the first attempts are made to improve this bound and match the information-theoretical lower bound of Ω( √ T ). The proposed EE-QL algorithm in this chapter achieves this goal, yet with two additional assumptions about the availability of an estimate of J ∗ (the gain of the optimal policy) and boundedness of the estimated Q values. These assumptions are verified numerically and the algorithm has tremendous numerical performance, significantly better than the existing model-free algorithms and similar to the best model-based algorithms, yet with less memory. The key to obtain such numerical performance is to avoid optimistic estimation of each entry of theQ function. Instead, EE-QL uses optimism for a single scalarJ ∗ . The main open question is how to prove these assumptions and achieve the information-theoretically optimal regret bound via a model-free algorithm, if it is possible at all. We believe that the techniques we developed in these chapters would be useful in answering this question. Chapter 4 addressed the problem of efficient exploration in Stochastic Shortest Path prob- lems. we proposed the first posterior sampling-based reinforcement learning algorithm for the SSP models with unknown transition probabilities. The algorithm is very simple as compared to the optimism-based algorithms proposed for SSP models recently (Tarbouriech et al., 2020; Rosenberg et al., 2020; Cohen et al., 2021; Tarbouriech et al., 2021b). It achieves a Bayesian regret bound of e O(B ? S √ AK), where B ? is an upper bound on the expected cost of the optimal policy, S is the size of the state space, A is the size of the action space, and K is the number of episodes. This has a √ S gap with the best known bound for 125 an optimism-based algorithm but numerical experiments suggest a better performance in practice. The √ S gap has been observed in posterior-sampling algorithms for finite-horizon (Osband et al., 2013) and infinite-horizon average-reward (Ouyang et al., 2017b) as well. An important open question is whether it is possible to close the √ S gap of posterior sampling-based algorithms. In Chapter 5, we presented one of the first online reinforcement learning algorithms for POMDPs. Solving POMDPs is a hard problem. Designing an efficient learning algorithm that achieves sub-linear regret is even harder. We showed that the proposed PSRL-POMDP algorithm achieves a Bayesian regret bound ofO(logT ) when the parameter is finite. When the parameter set is uncountable, we showed a e O(T 2/3 ) regret bound under two technical assumptions on the belief state approximation and transition kernel estimation. We also assume that the observation kernel is known. Note that without it, it is very challenging to design online learning algorithms for POMDPs. This result is just the first step to design online RL algorithms for POMDPs. There are several possible future directions. First, our results are under some technical assumptions that should ideally be relaxed. Second, proving a lower bound can shed light on the limits of the problem. Third, evaluating the performance of the algorithm in practice can also be useful. One impediment is that there is no solver for known POMDPs in the infinite-horizon average-reward setting to the best of our knowledge. Developing such a solver paves the way for implementing PSRL-POMDP in practice. Chapter 6 addressed the online RL algorithm in the infinite-horizon zero-sum stochastic games with average reward criterion. We proposed PSRL-ZSG, a posterior sampling algo- rithm that achieves Bayesian regret bound of e O(HS √ AT ) in this setting. No structure is imposed on the opponent’s strategy. The best existing result achieves high probability regret bound of e O(DS √ AT ), only with the strong ergodicity assumption (Wei et al., 2017). PSRL-ZSG relaxes that assumption and improves the previous best known high probability regret bound of e O( 3 √ DS 2 AT 2 ) obtained by UCSG algorithm (Wei et al., 2017) under the same finite diameter assumption. This bound is order optimal in terms of A and T . The framework and analysis developed in this chapter might be useful to design regret-optimal algorithms with the optimism in face of uncertainty principle. In this dissertation, posterior sampling is the general framework used in designing algorithms for SSPs, POMDPs, and stochastic games. Simplicity of the algorithms, similarity in various settings, and superb numerical performance compared to their optimism-based competitors are important factors for practitioners. However, all of the posterior sampling algorithms designed in this dissertation are for the tabular setting where the state and the action spaces are finite. A natural next step would be to extend these algorithms to the continuous state and action spaces, and to propose model-free algorithms for such settings. Designing posterior sampling-based model-free algorithms for even the tabular setting remains an open problem. The proposed algorithms follow a common framework. First, the algorithms proceed in epochs. In the beginning of an epoch, the agent samples the parameter from the posterior distribution and obtains a policy by solving the Bellman equation with respect to the 126 sampled parameter. This policy is then followed during the epoch. The main component that distinguishes these algorithms is the design of the epochs. More precisely, the epochs in all of these algorithms are determined based on two criteria. The first criterion triggers when the (pseudo) number of visits to a state-action pair is doubled. The second criterion varies depending on the setting. The analysis of these algorithms also follows a common theme. First, the property of posterior sampling (i.e., the sample and the true parameter have the same distribution conditioned on the information available at the time of the sample) is used to connect the quantities that depend on the sampled parameter to those that depend on the true param- eter. Second, the Bellman equation plays a crucial role in relating the instantaneous costs and the value functions. The main part that distinguishes the analysis of these algorithms is proving the concentration of the value function of the sampled parameter and that of the true parameter. In Chapter 7, we studied a different topic. We considered the optimal stochastic appoint- ment scheduling problem. Each job potentially has a different service time distribution and the objective is to minimize the expectation of a function of idle time and start-time delay. There are two sub-problems. (i) The sequencing problem: the optimal sequence in which to schedule the jobs. We show that this problem in general is non-indexable. (ii) The scheduling problem: finding the optimal appointment times given a sequence or order of jobs. We show that there exists a solution to the scheduling problem. Moreover, thel 1 -type objective function is convex. Further, we give a sample average approximation-based algo- rithm that yield an approximately optimal solution which is asymptotically consistent. It has been an open problem for many years to find the index that yields the optimal sequence of jobs. Following the work of Weiss (1990), who showed that Least Variance First (LVF) is optimal for two cases for specific distributions, it had been conjectured that the problem is indexable and LVF may be optimal for the general problem with the l 1 -type objective. In fact, several simulation studies and approximation algorithms are based on such policies. In this chapter, we settled the open question of the optimal index-type policy, namely that the problem is non-indexable in general, and no such index exists. Indeed, we showed that if the problem is indexable, then a ‘Newsvendor index’ would be optimal for the l 1 cost objective, a variance index would be optimal for l 2 objective, and we also gave form of an index I ∗ g that would be optimal for a generalized cost function g. But we provided counterexamples that showed that an optimal index-based policy does not exist for some problems. It is quite possible that the problem is indexable for specific distribution classes. That remains an open research question. 127 List of Publications Mehdi Jafarnia-Jahromi and Rahul Jain. Non-indexability of the stochastic appointment scheduling problem. Automatica, 118:109016, 2020. Mehdi Jafarnia-Jahromi, Rahul Jain, and Ashutosh Nayyar. Learning zero-sum stochastic games via posterior sampling. to be submitted. Mehdi Jafarnia-Jahromi, Chen-Yu Wei, Rahul Jain, and Haipeng Luo. A model-free learn- ing algorithm for infinite-horizon average-reward mdps with near-optimal regret. arXiv preprint arXiv:2006.04354, 2020. Mehdi Jafarnia-Jahromi, Liyu Chen, Rahul Jain, and Haipeng Luo. Online learning for stochastic shortest path model via posterior sampling. arXiv preprint arXiv:2106.05335, 2021a. Mehdi Jafarnia-Jahromi, Rahul Jain, and Ashutosh Nayyar. Online learning for unknown partially observable mdps. arXiv preprint arXiv:2102.12661, 2021b. Chen-Yu Wei, Mehdi Jafarnia-Jahromi, Haipeng Luo, Hiteshi Sharma, and Rahul Jain. Model-free reinforcement learning in infinite-horizon average-reward markov decision pro- cesses. In International Conference on Machine Learning, pages 10170–10180. PMLR, 2020. 128 Bibliography Yasin Abbasi-Yadkori, Peter Bartlett, Kush Bhatia, Nevena Lazic, Csaba Szepesvari, and Gell´ ert Weisz. Politex: Regret bounds for policy iteration using expert prediction. In International Conference on Machine Learning, pages 3692–3702, 2019a. Yasin Abbasi-Yadkori, Nevena Lazic, Csaba Szepesvari, and Gellert Weisz. Exploration- enhanced politex. arXiv preprint arXiv:1908.10479, 2019b. Marc Abeille and Alessandro Lazaric. Thompson sampling for linear-quadratic control problems. In Artificial Intelligence and Statistics, pages 1246–1254. PMLR, 2017. Jinane Abounadi, D Bertsekas, and Vivek S Borkar. Learning algorithms for markov de- cision processes with average cost. SIAM Journal on Control and Optimization, 40(3): 681–698, 2001. Shipra Agrawal and Navin Goyal. Analysis of thompson sampling for the multi-armed bandit problem. In Conference on learning theory, pages 39–1. JMLR Workshop and Conference Proceedings, 2012. Shipra Agrawal and Navin Goyal. Thompson sampling for contextual bandits with linear payoffs. In International Conference on Machine Learning, pages 127–135. PMLR, 2013. Shipra Agrawal and Randy Jia. Optimistic posterior sampling for reinforcement learning: worst-case regret bounds. In Advances in Neural Information Processing Systems, pages 1184–1194, 2017. Amir Ahmadi-Javid, Zahra Jalali, and Kenneth J Klassen. Outpatient appointment sys- tems in healthcare: A review of optimization studies. European Journal of Operational Research, 258(1):3–34, 2017. Zvi Artstein and Roger J-B Wets. Consistency of minimizers and the SLLN for stochastic programs. IBM Thomas J. Watson Research Division, 1994. Peter Auer and Ronald Ortner. Logarithmic online regret bounds for undiscounted rein- forcement learning. In Advances in Neural Information Processing Systems, pages 49–56, 2007. Alex Ayoub, Zeyu Jia, Csaba Szepesvari, Mengdi Wang, and Lin Yang. Model-based rein- forcement learning with value-targeted regression. In International Conference on Ma- chine Learning, pages 463–474. PMLR, 2020. 129 Mohammad Gheshlaghi Azar, Ian Osband, and R´ emi Munos. Minimax regret bounds for reinforcement learning. In Proceedings of the 34th International Conference on Machine Learning-Volume 70, pages 263–272. JMLR. org, 2017. Kamyar Azizzadenesheli, Alessandro Lazaric, and Animashree Anandkumar. Experimen- tal results: Reinforcement learning of pomdps using spectral methods. arXiv preprint arXiv:1705.02553, 2017. Kamyar Azizzadenesheli, Yisong Yue, and Animashree Anandkumar. Policy gradient in partially observable environments: Approximation and convergence. arXiv e-prints, pages arXiv–1810, 2018. Yu Bai and Chi Jin. Provable self-play algorithms for competitive reinforcement learning. In International Conference on Machine Learning, pages 551–560. PMLR, 2020. Yu Bai, Chi Jin, and Tiancheng Yu. Near-optimal reinforcement learning with self-play. In Advances in Neural Information Processing Systems, pages 2159–2170, 2020. Norman TJ Bailey. A study of queues and appointment systems in hospital out-patient departments, with special reference to waiting-times. Journal of the Royal Statistical Society. Series B (Methodological), pages 185–199, 1952. Kenneth R Baker. Introduction to sequencing and scheduling. John Wiley & Sons, 1974. Kenneth R Baker. Minimizing earliness and tardiness costs in stochastic scheduling. Euro- pean Journal of Operational Research, 236(2):445–452, 2014. Dragan Banjevi´ c and Michael Jong Kim. Thompson sampling for stochastic control: The continuous parameter case. IEEE Transactions on Automatic Control, 64(10):4137–4152, 2019. Peter L Bartlett and Ambuj Tewari. Regal: A regularization based algorithm for rein- forcement learning in weakly communicating mdps. In Proceedings of the Twenty-Fifth Conference on Uncertainty in Artificial Intelligence, pages 35–42. AUAI Press, 2009. Mehmet A Begen and Maurice Queyranne. Appointment scheduling with discrete random durations. Mathematics of Operations Research, 36(2):240–257, 2011. Mehmet A Begen, Retsef Levi, and Maurice Queyranne. A sampling-based approach to appointment scheduling. Operations Research, 60(3):675–681, 2012. Bjorn P Berg, Brian T Denton, S Ayca Erdogan, Thomas Rohleder, and Todd Huschka. Op- timal booking and scheduling in outpatient procedure centers. Computers & Operations Research, 50:24–37, 2014. Dimitri P Bertsekas. Dynamic programming and optimal control, vol i and ii, 4th edition. Belmont, MA: Athena Scientific, 2017. Dimitri P Bertsekas and John N Tsitsiklis. An analysis of stochastic shortest path problems. Mathematics of Operations Research, 16(3):580–595, 1991. 130 D.P. Bertsekas, A. Nedi´ c, and A.E. Ozdaglar. Convex Analysis and Optimization. Athena Scientific optimization and computation series. Athena Scientific, 2003. ISBN 9781886529458. URL https://books.google.com/books?id=DaOFQgAACAAJ. Z Birnbaum and W-f Orlicz. ¨ Uber die verallgemeinerung des begriffes der zueinander kon- jugierten potenzen. Studia Mathematica, 3(1):1–67, 1931. Ronen I Brafman and Moshe Tennenholtz. R-max-a general polynomial time algorithm for near-optimal reinforcement learning. Journal of Machine Learning Research, 3(Oct): 213–231, 2002. Richard P Brent. Algorithms for minimization without derivatives. Courier Corporation, 2013. Chenghui Cai, Xuejun Liao, and Lawrence Carin. Learning to explore and exploit in pomdps. Advances in Neural Information Processing Systems, 22:198–206, 2009. Brecht Cardoen, Erik Demeulemeester, and Jeroen Beli¨ en. Operating room planning and scheduling: A literature review. European Journal of Operational Research, 201(3):921– 932, 2010. Tugba Cayirli and Emre Veral. Outpatient scheduling in health care: a review of literature. Production and operations management, 12(4):519–549, 2003. Olivier Chapelle and Lihong Li. An empirical evaluation of thompson sampling. Advances in neural information processing systems, 24:2249–2257, 2011. Liyu Chen and Haipeng Luo. Finding the stochastic shortest path with low regret: The adversarial cost and unknown transition case. arXiv preprint arXiv:2102.05284, 2021. Liyu Chen, Haipeng Luo, and Chen-Yu Wei. Minimax regret for stochastic shortest path with adversarial costs and known transition. arXiv preprint arXiv:2012.04053, 2020. Liyu Chen, Mehdi Jafarnia-Jahromi, Rahul Jain, and Haipeng Luo. Implicit finite-horizon approximation and efficient optimal algorithms for stochastic shortest path. arXiv preprint arXiv:2106.08377, 2021a. Rachel R Chen and Lawrence W Robinson. Sequencing and scheduling appointments with potential call-in patients. Production and Operations Management, 23(9):1522–1538, 2014. Zixiang Chen, Dongruo Zhou, and Quanquan Gu. Almost optimal algorithms for two-player markov games with linear function approximation. arXiv preprint arXiv:2102.07404, 2021b. Sangdo Sam Choi and Amarnath Andy Banerjee. Comparison of a branch-and-bound heuristic, a newsvendor-based heuristic and periodic bailey rules for outpatients appoint- ment scheduling systems. Journal of the Operational Research Society, 67(4):576–592, 2016. 131 Alon Cohen, Tomer Koren, and Yishay Mansour. Learning linear-quadratic regulators efficiently with only √ t regret. In International Conference on Machine Learning, pages 1300–1309. PMLR, 2019. Alon Cohen, Yonathan Efroni, Yishay Mansour, and Aviv Rosenberg. Minimax regret for stochastic shortest path. arXiv preprint arXiv:2103.13056, 2021. Richard Walter Conway, William L Maxwell, and Louis W Miller. Theory of scheduling. Courier Corporation, 2003. Jacob W Crandall and Michael A Goodrich. Learning to compete, compromise, and coop- erate in repeated general-sum games. In Proceedings of the 22nd international conference on machine learning, pages 161–168, 2005. Christoph Dann and Emma Brunskill. Sample complexity of episodic fixed-horizon rein- forcement learning. In Advances in Neural Information Processing Systems, pages 2818– 2826, 2015. Sarah Dean, Horia Mania, Nikolai Matni, Benjamin Recht, and Stephen Tu. Regret bounds for robust adaptive control of the linear quadratic regulator. arXiv preprint arXiv:1805.09388, 2018. Brian Denton and Diwakar Gupta. A sequential bounding approach for optimal appoint- ment scheduling. IIE Transactions, 35(11):1003–1016, 2003. Brian Denton, James Viapiano, and Andrea Vogl. Optimization of surgery sequencing and scheduling decisions under uncertainty. Health Care Management Science, 10(1):13–24, 2007. Anthony DiGiovanni and Ambuj Tewari. Thompson sampling for markov games with piece- wise stationary opponent policies. In Proceedings of the 37th Annual Conference on Un- certainty in Artificial Intelligence, 2021. Kefan Dong, Yuanhao Wang, Xiaoyu Chen, and Liwei Wang. Q-learning with ucb ex- ploration is sample efficient for infinite-horizon mdp. arXiv preprint arXiv:1901.09311, 2019. Kefan Dong, Jian Peng, Yining Wang, and Yuan Zhou. Root-n-regret for learning in markov decision processes with function approximation and low bellman rank. In Conference on Learning Theory, pages 1554–1557. PMLR, 2020. Finale Doshi-Velez, David Pfau, Frank Wood, and Nicholas Roy. Bayesian nonparametric methods for partially-observable reinforcement learning. IEEE transactions on pattern analysis and machine intelligence, 37(2):394–407, 2013. S Ayca Erdogan and Brian Denton. Dynamic appointment scheduling of a stochastic server with uncertain demand. INFORMS Journal on Computing, 25(1):116–132, 2013. S Ayca Erdogan, Alexander Gose, and Brian T Denton. Online appointment sequencing and scheduling. IIE Transactions, 47(11):1267–1286, 2015. 132 Awi Federgruen. On n-person stochastic games by denumerable state space. Advances in Applied Probability, 10(2):452–471, 1978. Sarah Filippi, Olivier Capp´ e, and Aur´ elien Garivier. Optimism in reinforcement learning and kullback-leibler divergence. In 2010 48th Annual Allerton Conference on Communi- cation, Control, and Computing (Allerton), pages 115–122. IEEE, 2010. Rapha¨ el Fonteneau, Nathan Korda, and R´ emi Munos. An optimistic posterior sampling strategy for bayesian reinforcement learning. In NIPS 2013 Workshop on Bayesian Op- timization (BayesOpt2013), 2013. Ronan Fruit, Matteo Pirotta, and Alessandro Lazaric. Near optimal exploration- exploitation in non-communicating markov decision processes. In Advances in Neural Information Processing Systems, pages 2994–3004, 2018a. Ronan Fruit, Matteo Pirotta, Alessandro Lazaric, and Ronald Ortner. Efficient bias-span- constrained exploration-exploitation in reinforcement learning. In International Confer- ence on Machine Learning, pages 1573–1581, 2018b. Ronan Fruit, Matteo Pirotta, and Alessandro Lazaric. Improved analysis of ucrl2b, 2019. Available at rlgammazero.github.io/docs/ucrl2b_improved.pdf. Aditya Gopalan and Shie Mannor. Thompson sampling for learning parameterized markov decision processes. In Conference on Learning Theory, pages 861–898, 2015. Steffen Grunewalder, Guy Lever, Luca Baldassarre, Massi Pontil, and Arthur Gret- ton. Modelling transition dynamics in mdps with rkhs embeddings. arXiv preprint arXiv:1206.4655, 2012. Harish Guda, Milind Dawande, Ganesh Janakiraman, and Kyung Sung Jung. Optimal policy for a stochastic scheduling problem with applications to surgical scheduling. Pro- duction and Operations Management, 25(7):1194–1202, 2016. Diwakar Gupta. Surgical suites’ operations management. Production and Operations Man- agement, 16(6):689–700, 2007. Diwakar Gupta and Brian Denton. Appointment scheduling in health care: Challenges and opportunities. IIE transactions, 40(9):800–819, 2008. Thomas Dueholm Hansen, Peter Bro Miltersen, and Uri Zwick. Strategy iteration is strongly polynomial for 2-player turn-based stochastic games with a constant discount factor. Journal of the ACM (JACM), 60(1):1–16, 2013. Botao Hao, Nevena Lazic, Yasin Abbasi-Yadkori, Pooria Joulani, and Csaba Szepes- vari. Provably efficient adaptive approximate policy iteration. arXiv preprint arXiv:2002.03069, 2020. Willy Herroelen and Roel Leus. The construction of stable project baseline schedules. European Journal of Operational Research, 156(3):550–565, 2004. 133 Junling Hu and Michael P Wellman. Nash q-learning for general-sum stochastic games. Journal of machine learning research, 4(Nov):1039–1069, 2003. Peter JH Hulshof, Nikky Kortbeek, Richard J Boucherie, Erwin W Hans, and Piet JM Bakker. Taxonomic classification of planning decisions in health care: a structured review of the state of the art in or/ms. Health systems, 1(2):129–175, 2012. Mehdi Jafarnia-Jahromi, Rahul Jain, and Ashutosh Nayyar. Online learning for unknown partially observable mdps. arXiv preprint arXiv:2102.12661, 2021. Thomas Jaksch, Ronald Ortner, and Peter Auer. Near-optimal regret bounds for reinforce- ment learning. Journal of Machine Learning Research, 11(Apr):1563–1600, 2010. Zeyu Jia, Lin F Yang, and Mengdi Wang. Feature-based q-learning for two-player stochastic games. arXiv preprint arXiv:1906.00423, 2019. Chi Jin, Zeyuan Allen-Zhu, Sebastien Bubeck, and Michael I Jordan. Is Q-learning provably efficient? In Advances in Neural Information Processing Systems, pages 4863–4873, 2018. Chi Jin, Zhuoran Yang, Zhaoran Wang, and Michael I Jordan. Provably efficient reinforce- ment learning with linear function approximation. In Conference on Learning Theory, pages 2137–2143. PMLR, 2020. Chi Jin, Qinghua Liu, and Tiancheng Yu. The power of exploiter: Provable multi-agent rl in large state spaces. arXiv preprint arXiv:2106.03352, 2021. Guido C Kaandorp and Ger Koole. Optimal outpatient appointment scheduling. Health Care Management Science, 10(3):217–229, 2007. Sammie Katt, Frans Oliehoek, and Christopher Amato. Bayesian reinforcement learning in factored pomdps. arXiv preprint arXiv:1811.05612, 2018. Emilie Kaufmann, Nathaniel Korda, and R´ emi Munos. Thompson sampling: An asymp- totically optimal finite-time analysis. In International conference on algorithmic learning theory, pages 199–213. Springer, 2012. Benjamin Kemper, Chris AJ Klaassen, and Michel Mandjes. Optimized appointment scheduling. European Journal of Operational Research, 239(1):243–255, 2014. Michael Jong Kim. Thompson sampling for stochastic control: The finite parameter case. IEEE Transactions on Automatic Control, 62(12):6415–6422, 2017. Kenneth J Klassen and Thomas R Rohleder. Scheduling outpatient appointments in a dynamic environment. Journal of Operations Management, 14(2):83–101, 1996. Qingxia Kong, Chung-Yee Lee, Chung-Piaw Teo, and Zhichao Zheng. Scheduling arrivals to a stochastic service delivery system using copositive cones. Operations Research, 61 (3):711–726, 2013. 134 Qingxia Kong, Chung-Yee Lee, Chung-Piaw Teo, and Zhichao Zheng. Appointment se- quencing: Why the smallest-variance-first rule may not be optimal. European Journal of Operational Research, 255(3):809–821, 2016. Alex Kuiper and Michel Mandjes. Appointment scheduling in tandem-type service systems. Omega, 57:145–156, 2015a. Alex Kuiper and Michel Mandjes. Practical principles in appointment scheduling. Quality and Reliability Engineering International, 31(7):1127–1135, 2015b. Alex Kuiper, Michel Mandjes, and Jeroen de Mast. Optimal stationary appointment sched- ules. Operations Research Letters, 45(6):549–555, 2017. Panqanamala Ramana Kumar and Pravin Varaiya. Stochastic systems: Estimation, iden- tification, and adaptive control. SIAM Classic, 2015. Tze Leung Lai and Herbert Robbins. Asymptotically efficient adaptive allocation rules. Advances in applied mathematics, 6(1):4–22, 1985. Sahin Lale, Kamyar Azizzadenesheli, Babak Hassibi, and Anima Anandkumar. Explore more and improve regret in linear quadratic regulators. arXiv preprint arXiv:2007.12291, 2020a. Sahin Lale, Kamyar Azizzadenesheli, Babak Hassibi, and Anima Anandkumar. Loga- rithmic regret bound in partially observable linear dynamical systems. arXiv preprint arXiv:2003.11227, 2020b. Tor Lattimore and Csaba Szepesv´ ari. Bandit algorithms. Cambridge University Press, 2018. Philip Lebowitz. Schedule the short procedure first to improve or efficiency. AORN Journal, 78(4):651–659, 2003. Michael L Littman. Friend-or-foe q-learning in general-sum games. In ICML, volume 1, pages 322–328, 2001. Miao Liu, Xuejun Liao, and Lawrence Carin. The infinite regionalized policy representation. In ICML, 2011. Miao Liu, Xuejun Liao, and Lawrence Carin. Online expectation maximization for rein- forcement learning in pomdps. In IJCAI, pages 1501–1507, 2013. Qinghua Liu, Tiancheng Yu, Yu Bai, and Chi Jin. A sharp analysis of model-based re- inforcement learning with self-play. In International Conference on Machine Learning, pages 7001–7010. PMLR, 2021. Ho-Yin Mak, Ying Rong, and Jiawei Zhang. Appointment scheduling with limited distri- butional information. Management Science, 61(2):316–334, 2014. Wai-Kei Mak, David P Morton, and R Kevin Wood. Monte carlo bounding techniques for determining solution quality in stochastic programs. Operations Research Letters, 24(1): 47–56, 1999. 135 Camilo Mancilla and Robert Storer. A sample average approximation approach to stochastic appointment sequencing and scheduling. IIE Transactions, 44(8):655–670, 2012. Horia Mania, Stephen Tu, and Benjamin Recht. Certainty equivalence is efficient for linear quadratic control. arXiv preprint arXiv:1902.07826, 2019. Farzaneh Mansourifard, Parisa Mansourifard, Morteza Ziyadi, and Bhaskar Krishna- machari. A heuristic policy for outpatient surgery appointment sequencing: newsven- dor ordering. 2nd IEOM European Conference on Industrial Engineering and Operations Management, Paris, 2018. Eric Marcon and Franklin Dexter. Impact of surgical sequencing on post anesthesia care unit staffing. Health Care Management Science, 9(1):87–98, 2006. Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Alex Graves, Ioannis Antonoglou, Daan Wierstra, and Martin Riedmiller. Playing atari with deep reinforcement learning. arXiv preprint arXiv:1312.5602, 2013. Volodymyr Mnih, Adria Puigdomenech Badia, Mehdi Mirza, Alex Graves, Timothy Lill- icrap, Tim Harley, David Silver, and Koray Kavukcuoglu. Asynchronous methods for deep reinforcement learning. In International conference on machine learning, pages 1928–1937, 2016. John F Nash et al. Equilibrium points in n-person games. Proceedings of the national academy of sciences, 36(1):48–49, 1950. Robert C Newbold. Project management in the fast lane: applying the theory of constraints. CRC Press, 1998. Ronald Ortner. Regret bounds for reinforcement learning via markov chain concentration. arXiv preprint arXiv:1808.01813, 2018. Ian Osband and Benjamin Van Roy. Model-based reinforcement learning and the eluder dimension. arXiv preprint arXiv:1406.1853, 2014. Ian Osband and Benjamin Van Roy. Why is posterior sampling better than optimism for reinforcement learning? In Proceedings of the 34th International Conference on Machine Learning-Volume 70, pages 2701–2710. JMLR. org, 2017. Ian Osband, Daniel Russo, and Benjamin Van Roy. (more) efficient reinforcement learning via posterior sampling. In Advances in Neural Information Processing Systems, pages 3003–3011, 2013. Ian Osband, Benjamin Van Roy, and Zheng Wen. Generalization and exploration via ran- domized value functions. arXiv preprint arXiv:1402.0635, 2014. Yi Ouyang, Mukul Gagrani, and Rahul Jain. Learning-based control of unknown linear systems with thompson sampling. arXiv preprint arXiv:1709.04047, 2017a. 136 Yi Ouyang, Mukul Gagrani, Ashutosh Nayyar, and Rahul Jain. Learning unknown markov decision processes: A thompson sampling approach. In Advances in Neural Information Processing Systems, pages 1333–1342, 2017b. Michael Pinedo. Scheduling. Springer, 2012. Pascal Poupart and Nikos Vlassis. Model-based bayesian reinforcement learning in partially observable domains. In Proc Int. Symp. on Artificial Intelligence and Mathematics,, pages 1–2, 2008. Martin L Puterman. Markov decision processes: discrete stochastic dynamic programming. John Wiley & Sons, 2014. Jin Qi. Mitigating delays and unfairness in appointment systems. Management Science, 63 (2):566–583, 2016. Aviv Rosenberg and Yishay Mansour. Stochastic shortest path with adversarially changing costs. arXiv preprint arXiv:2006.11561, 2020. Aviv Rosenberg, Alon Cohen, Yishay Mansour, and Haim Kaplan. Near-optimal regret bounds for stochastic shortest path. In International Conference on Machine Learning, pages 8210–8219. PMLR, 2020. Stephane Ross, Brahim Chaib-draa, and Joelle Pineau. Bayes-adaptive pomdps. In NIPS, pages 1225–1232, 2007. Johannes O Royset. On sample size control in sample average approximations for solving smooth stochastic programs. Computational Optimization and Applications, 55(2):265– 309, 2013. Daniel Russo and Benjamin Van Roy. Learning to optimize via posterior sampling. Math- ematics of Operations Research, 39(4):1221–1243, 2014. Daniel Russo, Benjamin Van Roy, Abbas Kazerouni, Ian Osband, and Zheng Wen. A tutorial on thompson sampling. arXiv preprint arXiv:1707.02038, 2017. John Schulman, Sergey Levine, Pieter Abbeel, Michael Jordan, and Philipp Moritz. Trust region policy optimization. In International conference on machine learning, pages 1889– 1897, 2015. Anton Schwartz. A reinforcement learning method for maximizing undiscounted rewards. In Proceedings of the tenth international conference on machine learning, volume 298, pages 298–305, 1993. Steven L Scott. A modern bayesian look at the multi-armed bandit. Applied Stochastic Models in Business and Industry, 26(6):639–658, 2010. Guy Shani, Ronen I Brafman, and Solomon E Shimony. Model-based online learning of pomdps. In European Conference on Machine Learning, pages 353–364. Springer, 2005. 137 Alexander Shapiro, Darinka Dentcheva, and Andrzej Ruszczynski. Lectures on stochastic programming. MPS-SIAM Series on Optimization, 9:1, 2009. Lloyd S Shapley. Stochastic games. Proceedings of the national academy of sciences, 39 (10):1095–1100, 1953. Aaron Sidford, Mengdi Wang, Lin Yang, and Yinyu Ye. Solving discounted stochastic two- player games with near-optimal time and sample complexity. In International Conference on Artificial Intelligence and Statistics, pages 2992–3002. PMLR, 2020. David Silver, Julian Schrittwieser, Karen Simonyan, Ioannis Antonoglou, Aja Huang, Arthur Guez, Thomas Hubert, Lucas Baker, Matthew Lai, Adrian Bolton, et al. Master- ing the game of go without human knowledge. nature, 550(7676):354–359, 2017. Max Simchowitz and Dylan Foster. Naive exploration is optimal for online lqr. In Interna- tional Conference on Machine Learning, pages 8937–8948. PMLR, 2020. Alfonso Soriano. Comparison of two scheduling systems. Operations Research, 14(3):388– 397, 1966. Alexander L Strehl and Michael L Littman. An analysis of model-based interval estimation for markov decision processes. Journal of Computer and System Sciences, 74(8):1309– 1331, 2008. Alexander L Strehl, Lihong Li, Eric Wiewiora, John Langford, and Michael L Littman. Pac model-free reinforcement learning. In Proceedings of the 23rd international conference on Machine learning, pages 881–888. ACM, 2006. Malcolm Strens. A bayesian framework for reinforcement learning. In ICML, volume 2000, pages 943–950, 2000. Jayakumar Subramanian, Amit Sinha, Raihan Seraj, and Aditya Mahajan. Approximate information state for approximate planning and reinforcement learning in partially ob- served systems. arXiv preprint arXiv:2010.08843, 2020. Mohammad Sadegh Talebi and Odalric-Ambrym Maillard. Variance-aware regret bounds for undiscounted reinforcement learning in mdps. In Algorithmic Learning Theory, pages 770–805, 2018. Jean Tarbouriech, Evrard Garcelon, Michal Valko, Matteo Pirotta, and Alessandro Lazaric. No-regret exploration in goal-oriented reinforcement learning. In International Conference on Machine Learning, pages 9428–9437. PMLR, 2020. Jean Tarbouriech, Matteo Pirotta, Michal Valko, and Alessandro Lazaric. Sample complex- ity bounds for stochastic shortest path with a generative model. In Algorithmic Learning Theory, pages 1157–1178. PMLR, 2021a. Jean Tarbouriech, Runlong Zhou, Simon S Du, Matteo Pirotta, Michal Valko, and Alessan- dro Lazaric. Stochastic shortest path: Minimax, parameter-free and towards horizon-free regret. arXiv preprint arXiv:2104.11186, 2021b. 138 William R Thompson. On the likelihood that one unknown probability exceeds another in view of the evidence of two samples. Biometrika, 25(3/4):285–294, 1933. Wendi Tian and Erik Demeulemeester. Railway scheduling reduces the expected project makespan over roadrunner scheduling in a multi-mode project scheduling environment. Annals of Operations Research, 213(1):271–291, 2014. Yi Tian, Yuanhao Wang, Tiancheng Yu, and Suvrit Sra. Online learning in unknown markov games. In International Conference on Machine Learning, pages 10279–10288. PMLR, 2021. Anastasios Tsiamis and George Pappas. Online learning of the kalman filter with logarithmic regret. arXiv preprint arXiv:2002.05141, 2020. Wouter Vink, Alex Kuiper, Benjamin Kemper, and Sandjai Bhulai. Optimal appointment scheduling in continuous time: The lag order approximation method. European Journal of Operational Research, 240(1):213–219, 2015. Oriol Vinyals, Igor Babuschkin, Wojciech M Czarnecki, Micha¨ el Mathieu, Andrew Dudzik, Junyoung Chung, David H Choi, Richard Powell, Timo Ewalds, Petko Georgiev, et al. Grandmaster level in starcraft ii using multi-agent reinforcement learning. Nature, 575 (7782):350–354, 2019. P Patrick Wang. Static and dynamic scheduling of customer arrivals to a single-server system. Naval Research Logistics (NRL), 40(3):345–360, 1993. P Patrick Wang. Sequencing and scheduling n customers for a stochastic server. European Journal of Operational Research, 119(3):729–738, 1999. Ruosong Wang, Russ R Salakhutdinov, and Lin Yang. Reinforcement learning with general value function approximation: Provably efficient approach via bounded eluder dimension. Advances in Neural Information Processing Systems, 33, 2020. Tianhao Wang, Dongruo Zhou, and Quanquan Gu. Provably efficient reinforcement learn- ing with linear function approximation under adaptivity constraints. arXiv preprint arXiv:2101.02195, 2021. Christopher John Cornish Hellaby Watkins. Learning from delayed rewards. Phd Thesis, King’s College, Cambridge, 1989. Chen-Yu Wei, Yi-Te Hong, and Chi-Jen Lu. Online reinforcement learning in stochastic games. In Advances in Neural Information Processing Systems, pages 4987–4997, 2017. Chen-Yu Wei, Mehdi Jafarnia-Jahromi, Haipeng Luo, Hiteshi Sharma, and Rahul Jain. Model-free reinforcement learning in infinite-horizon average-reward markov decision pro- cesses. arXiv preprint arXiv:1910.07072, 2019. Chen-Yu Wei, Mehdi Jafarnia-Jahromi, Haipeng Luo, Hiteshi Sharma, and Rahul Jain. Model-free reinforcement learning in infinite-horizon average-reward markov decision pro- cesses. In International Conference on Machine Learning, pages 10170–10180. PMLR, 2020. 139 Chen-Yu Wei, Mehdi Jafarnia-Jahromi, Haipeng Luo, and Rahul Jain. Learning infinite- horizon average-reward mdps with linear function approximation. International Confer- ence on Artificial Intelligence and Statistics, 2021. Elliott N Weiss. Models for determining estimated start times and case orderings in hospital operating rooms. IIE Transactions, 22(2):143–150, 1990. Tsachy Weissman, Erik Ordentlich, Gadiel Seroussi, Sergio Verdu, and Marcelo J Wein- berger. Inequalities for the l1 deviation of the empirical distribution. Hewlett-Packard Labs, Tech. Rep, 2003. Weitiao Wu, Ronghui Liu, Wenzhou Jin, and Changxi Ma. Stochastic bus schedule co- ordination considering demand assignment and rerouting of passengers. Transportation Research Part B: Methodological, 121:275–303, 2019. Qiaomin Xie, Yudong Chen, Zhaoran Wang, and Zhuoran Yang. Learning zero-sum simultaneous-move markov games using function approximation and correlated equilib- rium. In Conference on Learning Theory, pages 3674–3682. PMLR, 2020. Yi Xiong, Ningyuan Chen, Xuefeng Gao, and Xiang Zhou. Sublinear regret for learning pomdps. arXiv preprint arXiv:2107.03635, 2021. Yaodong Yang and Jun Wang. An overview of multi-agent reinforcement learning from game theoretical perspective. arXiv preprint arXiv:2011.00583, 2020. Andrea Zanette and Emma Brunskill. Tighter problem-dependent regret bounds in rein- forcement learning without domain knowledge using value function bounds. In Interna- tional Conference on Machine Learning, 2019. Kaiqing Zhang, Sham M Kakade, Tamer Ba¸ sar, and Lin F Yang. Model-based multi- agent rl in zero-sum markov games with near-optimal sample complexity. arXiv preprint arXiv:2007.07461, 2020. Kaiqing Zhang, Zhuoran Yang, and Tamer Ba¸ sar. Multi-agent reinforcement learning: A selective overview of theories and algorithms. Handbook of Reinforcement Learning and Control, pages 321–384, 2021. Zihan Zhang and Xiangyang Ji. Regret minimization for reinforcement learning by evalu- ating the optimal bias function. In Advances in Neural Information Processing Systems, 2019. 140
Abstract (if available)
Linked assets
University of Southern California Dissertations and Theses
Conceptually similar
PDF
Provable reinforcement learning for constrained and multi-agent control systems
PDF
Understanding goal-oriented reinforcement learning
PDF
Robust and adaptive online decision making
PDF
No-regret learning and last-iterate convergence in games
PDF
Learning and decision making in networked systems
PDF
Sequential Decision Making and Learning in Multi-Agent Networked Systems
PDF
Reinforcement learning with generative model for non-parametric MDPs
PDF
Robust and adaptive online reinforcement learning
PDF
Utilizing context and structure of reward functions to improve online learning in wireless networks
PDF
Online learning algorithms for network optimization with unknown variables
PDF
Learning and control for wireless networks via graph signal processing
PDF
Robust and adaptive algorithm design in online learning: regularization, exploration, and aggregation
PDF
Identifying and leveraging structure in complex cooperative tasks for multi-agent reinforcement learning
PDF
Learning and control in decentralized stochastic systems
PDF
Sequential decision-making for sensing, communication and strategic interactions
PDF
I. Asynchronous optimization over weakly coupled renewal systems
PDF
Machine learning in interacting multi-agent systems
PDF
A survey on the computational hardness of linear-structured Markov decision processes
PDF
Leveraging prior experience for scalable transfer in robot learning
PDF
Learning social sequential decision making in online games
Asset Metadata
Creator
Jafarnia Jahromi, Mehdi
(author)
Core Title
Online reinforcement learning for Markov decision processes and games
School
Viterbi School of Engineering
Degree
Doctor of Philosophy
Degree Program
Electrical Engineering
Publication Date
12/03/2021
Defense Date
08/11/2021
Publisher
University of Southern California
(original),
University of Southern California. Libraries
(digital)
Tag
efficient exploration,game,Markov decision process,OAI-PMH Harvest,online reinforcement learning
Format
application/pdf
(imt)
Language
English
Contributor
Electronically uploaded by the author
(provenance)
Advisor
Luo, Haipeng (
committee member
), Nayyar, Ashutosh (
committee member
)
Creator Email
mehdi.jafarnia@gmail.com,mjafarni@usc.edu
Permanent Link (DOI)
https://doi.org/10.25549/usctheses-oUC17895961
Unique identifier
UC17895961
Legacy Identifier
etd-JafarniaJa-10276
Document Type
Dissertation
Format
application/pdf (imt)
Rights
Jafarnia Jahromi, Mehdi
Type
texts
Source
20211208-wayne-usctheses-batch-902-nissen
(batch),
University of Southern California
(contributing entity),
University of Southern California Dissertations and Theses
(collection)
Access Conditions
The author retains rights to his/her dissertation, thesis or other graduate work according to U.S. copyright law. Electronic access is being provided by the USC Libraries in agreement with the author, as the original true and official version of the work, but does not grant the reader permission to use the work if the desired use is covered by copyright. It is the author, as rights holder, who must provide use permission if such use is covered by copyright. The original signature page accompanying the original submission of the work to the USC Libraries is retained by the USC Libraries and a copy of it may be obtained by authorized requesters contacting the repository e-mail address given.
Repository Name
University of Southern California Digital Library
Repository Location
USC Digital Library, University of Southern California, University Park Campus MC 2810, 3434 South Grand Avenue, 2nd Floor, Los Angeles, California 90089-2810, USA
Repository Email
cisadmin@lib.usc.edu
Tags
efficient exploration
game
Markov decision process
online reinforcement learning