Close
About
FAQ
Home
Collections
Login
USC Login
Register
0
Selected
Invert selection
Deselect all
Deselect all
Click here to refresh results
Click here to refresh results
USC
/
Digital Library
/
University of Southern California Dissertations and Theses
/
Reinforcement learning with generative model for non-parametric MDPs
(USC Thesis Other)
Reinforcement learning with generative model for non-parametric MDPs
PDF
Download
Share
Open document
Flip pages
Contact Us
Contact Us
Copy asset link
Request this asset
Transcript (if available)
Content
Reinforcement Learning with Generative Model for non-parametric MDPs by Hiteshi Sharma A Dissertation Presented to the FACULTY OF THE GRADUATE SCHOOL UNIVERSITY OF SOUTHERN CALIFORNIA In Partial Fulfillment of the Requirements for the Degree DOCTOR OF PHILOSOPHY (Electrical Engineering) August 2020 Copyright 2020 Hiteshi Sharma Dedication In loving memory of my Grandfather. ii Acknowledgements First and foremost, I would like to express my gratitude to my advisor Prof. Rahul Jain for his guidance and patience. Without his constant encouragement, I would not be submitting this dissertation. His excitement for learning anything new is infectious. His thoughtful criticism has not only helped me immensely in my research, technical writing, and presentation but also to ask bold, insightful questions. I would spend days working on a problem and walk into his office confidently that I got everything covered, but he would always managed to ask questions for which I had no answer! I would also like to thank Prof. William Haskell with whom I had an opportunity to collaborate during beginning of my grad school. Working with him has enhanced my mathematical maturity and helped me develop results presented in this dissertation. I would also like to thank my committee, Prof. Mahdi Soltanolkotabi and Prof. Haipeng Luo for all their help and input along the way. Most of my time in USC has been spent in EEB 335 with some amazing colleagues: Naumaan Nayyar, Mehdi Jafarnia-Jahromi, Nathan Dahlin, Mohammad Asgharipari and Krishna Chaitanya Kalagarla. I have especially enjoyed working with Mehdi. Our discussions were a great experience and led to some very exciting research. Sometimes we would just stare at the equations on the white board during our discussions. I fondly remember lunch breaks with Nate and numerous discussions with Krishna. I would like to thank all the EE iii department staff members and student workers for making my life easier by handling all the administrative tasks, especially Diane Demetras, Annie Yu, Tracy Charles and Jennifer Gerson. I would also like to thank everybody in my improv group, USC Vidushak. In particular, Bhamati, Satya, Kiran, Sanmukh, Vishnu helped me form some of my favorite moments at USC. Thanks to Swati, Satish and Saket for looking out for me when I was feeling low. My family deserves the biggest thanks of all, especially my baba (grandfather). He taught me the importance of education and supported me immensely to pursue higher education. My parents have been my pillars of my support. They have never wavered in their support and belief in me. I am also grateful to my husband Peeyush who has always trusted me, sup- ported me during the hardest times and helped me in running some experiments. Peeyush, you are the wind beneath my wings! iv Table of Contents Dedication ii Acknowledgements iii List Of Tables viii List Of Figures ix Abstract xi Chapter 1: Introduction 1 Chapter 2: Iteration of Random Operators 5 2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 2.2 Contraction in Banach space . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 2.3 Random operators on Banach space . . . . . . . . . . . . . . . . . . . . . . . 7 2.3.1 Convergence of random operators . . . . . . . . . . . . . . . . . . . . 8 2.3.1.1 Main ingredients of the proof . . . . . . . . . . . . . . . . . 9 2.3.2 Application: Empirical dynamic programming . . . . . . . . . . . . . 10 2.4 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12 Chapter 3: MDPs with continuous state space: Discounted reward criteria 13 3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 3.2 Preliminaries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15 3.3 Offline Value Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19 3.3.1 Random Parametric Basis Function (RPBF) Approximation . . . . . 21 3.3.2 Non-parametric Function Approximation in RKHS . . . . . . . . . . 28 3.3.3 Convergence Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . 32 3.3.4 Stochastic dominance . . . . . . . . . . . . . . . . . . . . . . . . . . . 35 3.3.5 Numerical Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . 40 3.4 Online RL with generative model . . . . . . . . . . . . . . . . . . . . . . . . 45 3.4.1 Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51 3.4.2 Convergence Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . 55 3.4.3 Numerical Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . 60 3.4.3.1 Optimal replacement . . . . . . . . . . . . . . . . . . . . . . 60 v 3.4.3.2 Cart-Pole . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60 3.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65 3.6 Proofs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68 3.6.1 Supplement for Section 3.3 . . . . . . . . . . . . . . . . . . . . . . . . 68 3.6.2 Supplement for Section 3.3.3 . . . . . . . . . . . . . . . . . . . . . . . 68 3.6.3 Supplement for Section 3.3.3: Function approximation . . . . . . . . 73 3.6.4 Bellman error . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78 3.6.5 Supplement for Section 3.4.2 . . . . . . . . . . . . . . . . . . . . . . . 85 Chapter 4: MDPs with continuous state space: Average reward criteria 91 4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91 4.2 Preliminaries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92 4.3 Algorithm and analysis with truncation . . . . . . . . . . . . . . . . . . . . . 96 4.3.1 The Algorithm and Main Result . . . . . . . . . . . . . . . . . . . . . 97 4.3.2 Convergence Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . 99 4.3.3 Numerical Performance . . . . . . . . . . . . . . . . . . . . . . . . . . 105 4.4 Algorithm and analysis without truncation . . . . . . . . . . . . . . . . . . . 108 4.4.1 Approximate Bellman Operator . . . . . . . . . . . . . . . . . . . . . 110 4.4.2 Approximate Relative Value Learning (ARVL) . . . . . . . . . . . . . 113 4.4.3 Convergence Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . 115 4.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 124 4.6 Proofs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 124 4.6.1 Proof of Theorem 29 . . . . . . . . . . . . . . . . . . . . . . . . . . . 124 4.6.2 Proof of Theorem 28 . . . . . . . . . . . . . . . . . . . . . . . . . . . 126 4.6.3 Proof of theorem 52 . . . . . . . . . . . . . . . . . . . . . . . . . . . . 127 Chapter 5: MDPs with continuous state and action spaces 128 5.1 Empirical Optimization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 130 5.2 Empirical Dynamic Programming with Empirical Optimization . . . . . . . . 132 5.2.1 Preliminaries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 132 5.2.2 The Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 136 5.2.3 Convergence Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . 139 5.2.3.1 Bound on one-step error . . . . . . . . . . . . . . . . . . . . 141 5.2.3.2 Stochastic Dominance . . . . . . . . . . . . . . . . . . . . . 144 5.2.4 Numerical Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . 149 5.3 Online Value Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 150 5.3.1 Convergence Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . 152 5.4 Empirical Policy Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . 154 5.4.1 Convergence Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . 159 5.4.2 Error in one iteration . . . . . . . . . . . . . . . . . . . . . . . . . . . 159 5.4.3 Stochastic dominance. . . . . . . . . . . . . . . . . . . . . . . . . . . 166 5.4.4 Numerical Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . 169 5.4.4.1 Proof of Concept . . . . . . . . . . . . . . . . . . . . . . . . 169 5.4.4.2 Minitaur . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 170 5.5 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 172 vi 5.6 Proofs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 172 5.6.1 Proof of Lemma 44 . . . . . . . . . . . . . . . . . . . . . . . . . . . . 172 5.6.2 Proof of Lemma 51 . . . . . . . . . . . . . . . . . . . . . . . . . . . . 174 5.6.3 Proof of Lemma 57 . . . . . . . . . . . . . . . . . . . . . . . . . . . . 176 Bibliography 180 vii List Of Tables 3.1 Runtime performance comparison for the cart-pole problem (m=minutes) . . 42 3.2 β-mixing coefficients . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61 3.3 CPU per-iteration runtime (in milliseconds). . . . . . . . . . . . . . . . . . 62 5.1 Maximal reward after 3M steps . . . . . . . . . . . . . . . . . . . . . . . . . 170 viii List Of Figures 3.1 Relative Error with iterations for various algorithms. . . . . . . . . . . . . . 42 3.2 Performance on the Acrobot problem . . . . . . . . . . . . . . . . . . . . . . 45 3.3 Performance of ONEVaL for different sample sizes . . . . . . . . . . . . . . . 61 3.4 Performance of ONEVaL on Cart-Pole. Expected (discounted) reward, av- eraged over 100 trials (solid), with the range of highest and lowest values in faded colors for N = 1000 with M = 1 . . . . . . . . . . . . . . . . . . . . . 63 3.5 Performance of ONEVaL on Cart-Pole. Expected (discounted) reward, av- eraged over 100 trials (solid), with the range of highest and lowest values in faded colors for N = 5000 with M = 1 . . . . . . . . . . . . . . . . . . . . . 64 3.6 ONEVaL vs DQN . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65 3.7 Q-function over angle states. Q-function for a =−1 (move to the left). Rows (top to bottom): Ridge Regression, RKHS, Nystrom, RPBF, NN. Left: x =v = 0, episode 50. Middle: x =v = 0, episode 10. Right: x =−2.0,v = −2.5, episode 50. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66 4.1 Optimal replacement with nearest neighbors . . . . . . . . . . . . . . . . . . 108 4.2 Probability transition kernel for optimal machine replacement problem . . . 109 5.1 Sinusoid function f 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 131 5.2 Concave function f 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 132 5.3 Performance of RAEVL for different sample sizes . . . . . . . . . . . . . . . 150 5.4 Performance of RANDPOL on a synthetic example. Left: Optimal and approximate policy. Right: Error with iterations . . . . . . . . . . . . . . . 170 ix 5.5 Left: Minitaur environment in PyBullet. Right: Avg. reward for RANDPOL, PPO and DDPG, averaged over five random seeds. . . . . . . . . . . . . . . 171 x Abstract In this dissertation, we present empirical dynamic programming in dynamical systems modeled as Markov Decision Processes (MDPs). In particular, we focus on MDPs with continuous state and action spaces. First part deals with the continuous state and finite action space while the second part addresses the scenario where both the state and action space are continuous. We study two popular optimality criteria in infinite horizon MDPs: discounted and average reward criteria. The former is very common in robotics while the latter is used for online learning. We propose off-policy mini-batch empirical value learning algorithms under different cri- terion. To deal with continuous spaces, we leverage sampling based ideas. We first introduce an empirical Bellman operator in which the expectation is replaced by sample average. Next, we rely on function approximation, focusing on function spaces which can provide arbitrarily good approximation for any problem. Furthermore, we are interested in study- ing finite-time convergence of the proposed algorithms. The convergence analysis requires a random operator framework since the operator we deal with is no more deterministic. This requires a notion of “probabilistic contraction” and hence some sort of probabilistic fixed points. We define probabilistic (weak) fixed points and establish the convergence of the pro- posed algorithms. We also use “stochastic dominance” by constructing a Markov chain that xi stochastically dominates the error introduced due to the approximation to provide rate of convergence. For MDPs with continuous action space, we first propose an optimization via randomization method. This facilitates to find the best action through a given action-value function by uniform sampling over the action space. Combining the idea of optimization by sampling with empirical dynamic programming and function approximation, we can handle continuous state and action MDPs. We also develop online variant of empirical value learning algorithms. These algorithms need a memory buffer in which the previous experiences can be collected and sampled later for function reconstruction. For exploration, we use- greedy policies. We test our algorithms on benchmark problems like cart-pole balancing and mountain car. The numerical results show good performance of the proposed algorithms. We next extend the empirical dynamic programming framework to actor-critic architec- ture, which are more like policy iteration. We perform the two steps of policy evaluation and policy improvement approximately. This provides an efficient way to leverage continuous action spaces and allows us to scale to high dimensional problems like quadrupedal bot, minitaur by Pybullet. For this environment, we show that policy learning based algorithm is able to accomplish the locomotion task. xii Chapter 1 Introduction Iterative algorithms are often used in stochastic control. The analysis of such algorithms is usually by contraction property of the underlying operators. In chapter 2, we consider approximation of such operator and analyze their “probabilistic contraction” property. This chapter defines probabilistic fixed points and establishes convergence of such random opera- tors to such points. Real-time decision making in uncertain environments are often modeled as Markov Decision Processes (MDPs) [55]. There exist a wide variety of approximate dy- namic programming (DP) [12, Chapter 6], [52] and reinforcement learning (RL) algorithms [63] for finite state space Markov decision processes (MDPs). But many real-world prob- lems of interest have continuous or very large state and action space. We will first focus on continuous state and finite action space. Approximate DP and RL algorithms do exist for continuous state space MDPs but choosing which one to employ is an art form: different techniques (state space aggregation and function approximation [16]) and algorithms work for different problems [10, 11, 53], and universally applicable algorithms are lacking. For example, fitted value iteration [48] is very effective for some problems but requires the choice of an appropriate basis functions for good approximation. Most of the existing work on 1 empirical dynamic programming (EDP) requires domain knowledge of the problem at hand for effective implementation. Here, we are interested in EDP methods which are effective without any prior problem knowledge. In Chapter 3, we propose approximate DP algorithms for continuous state space MDPs and finite action space with discounted reward optimality criteria. The algorithms proposed are universal (approximating function space can provide arbitrarily good approximation for any problem), computationally tractable, simple to implement and yet we have non- asymptotic sample complexity bounds. The first is accomplished by picking functions spaces for approximation that are dense in the space of continuous functions (i.e., for any continuous functionf, and> 0, there is an element of our approximating function space that is within of f in the sup-norm.) The second goal is achieved by relying on randomized selection of basis functions for approximation and also by ‘empirical’ dynamic programming [24]. The third is enabled because standard Python routines can be used for function fitting and the fourth is by analysis in a random operator framework which provides non-asymptotic rate of convergence and sample complexity bounds. Chapter 4 is focused on MDPs with continuous state and finite action space with average reward optimality criteria. For infinite-horizon MDPs, while there are many reinforcement learning (RL) and approximate dynamic programming (ADP) algorithms available [53] for the discounted rewards criterion [9], the average rewards criterion is harder. Indeed, MDPs with average reward criterion are more difficult to analyze because establishing the existence of stationary optimal policy itself requires some restriction on the underlying Markov chains [6]. The goal of this chapter is to introduce an off-policy empirical (or approximate), (rela- tive) value learning algorithm for computing optimal policies for non-parametric MDPs with 2 continuous state space and average reward criterion. Specifically, we aim to design RL algo- rithms that are universal (arbitrarily good approximation for any MDP), computationally simple, easy to implement and yet we would like to have non-asymptotic sample complexity bounds, even if the guarantees are probabilistic. We would also like such algorithms to work well numerically.The algorithm we propose is inspired by the empirical dynamic program- ming (EDP) framework [24] - an off-policy minibatch empirical value learning algorithm for discounted MDPs with finite state and action space. The key idea is to replace the expectation in the Bellman operator with a sample average approximation obtained from a mini-batch of samples of the next state. Convergence analysis required a random opera- tor framework and construction of a Markov chain that stochastically dominates the error introduced due to the approximation. Chapter 5 focuses on MDPs with continuous state and action space. Such MDPs are very common in robotics ranging from simple task like pendulum swing up to locomotion tasks. The model-free reinforcement learning has been successfully applied to such problems [40, 59]. However, the sample complexity of model-free algorithms, particularly when using very high-dimensional function approximators (policy and value function), tends to be high [59]. Hence, while working with real physical systems, such as robots and autonomous vehicles, such algorithms need lot of manual engineering. In this chapter, we will focus on model-based algorithms we developed in the previous chapters and extend them to continuous action space. Now, this in turn requires an efficient way to optimize the value function over action space. For this, we propose optimization via randomization, where the idea is to uniformly sample a bunch of points from the underlying space and find the maximum among them. The function is assumed to be Lipschitz. We tested this idea on a couple of functions 3 while varying the dimension which suggest good numerical performance. Furthermore in the chapter, we propose a generalized policy learning algorithm which fits the function in policy domain , leveraging the continuous action space. It is an actor-critic architecture which maintains both the policy and value network. Using the randomized function approximation, we give the finite time guarantees in probability. Furthermore, we show its good performance on challenging quadrupedal bot environment. 4 Chapter 2 Iteration of Random Operators 2.1 Introduction In this chapter, we focus on iterative algorithms often arising in stochastic optimization and control. Each iteration has some randomness and often the convergence guarantees are asymptotic. Analysis of each algorithm involves a different method in spite of various com- monalities. It would be desirable to have a common framework within which we can think of various algorithms, with a common analysis technique, which can then also be useful for design of new algorithms. In general, if we have a contraction operator which is computa- tionally expensive then one can design an approximate version of the exact operator. Thus, one can view each iteration as operation of a random function or operator. This chapter presents an analysis of iteration of random operators by appropriately defining probabilistic fixed points. The convergence analysis followed by a novel stochastic dominance argument. We can specify the number of samples needed per iteration and the number of iterations needed to have small error with high probability. Furthermore, empirical DP algorithms 5 have superior numerical and actual run-time performance than popularly-used reinforce- ment learning algorithms [24]. The random operator framework developed is quite general. Stochastic approximation schemes and most reinforcement learning algorithms can be viewed as iterations of a random operator, and their convergence follows without additional effort. 2.2 Contraction in Banach space Let (X,k·k) be a compact Banach space. Consider a operator T : X → X be a contraction, i.e,kTx 1 −Tx 2 k<αkx 1 −x 2 k for all x 1 ,x 2 ∈X . We next present the Banach fixed point theorem. Theorem 1. Suppose (X,k·k) is a Banach space andT :X→X is a contraction mapping. Then 1. There exists a unique x ∗ such that Tx ∗ =x ∗ 2. For arbitrary x 0 ∈X , the sequence defined by x n+1 =Tx n =T n+1 x 0 converges to x ∗ Proof. Let{x n } n≥0 be the sequence defined by x n+1 =Tx n . Then for any m≥ 1, kx n+m −x n k≤ m−1 X k=0 kx n+k+1 −x n+k k = m−1 X k=0 kT n+k x 1 −T n+k x 0 k ≤ m−1 X k=0 α n+k kx 1 −x 0 k = α n (1−α m ) 1−α kx 1 −x 0 k 6 Since 0 < α < 1, for n sufficiently large,kx n+m −x n k is arbitrarily small, hence{x n } is a Cauchy sequence. SinceX is complete,{x n } has a limitx ∗ ∈X . Hence lim n→∞ kx n −x ∗ k = 0. Now see that 0≤kTx ∗ −x ∗ k≤kTx ∗ −x n k +kx n −x ∗ k≤αkx ∗ −x n−1 k +kx n −x ∗ k The right side can be made arbitrarily small, hence Tx ∗ =x ∗ . 2.3 Random operators on Banach space Let (Ω,F,P) be the standard probability space. Let b T N : Ω×X→X be the approxi- mation of T . For instance, if T involves expectation of some random variable then b T N uses average of i.i.d samples. Since b T N is a random operator, we define “probabilistic contraction” property P kTx− b T N xk< >p N () where p N ()↑ 1 as n→∞ for all > 0. This suggests that the approximate operator b T N contracts with high probability. This needs us to define strong and weak probabilistic fixed point. Definition 1 (Strong probabilistic fixed point). b x∈X is a strong probabilistic fixed point for the sequence{ b T N } N≥1 if for all > 0 lim N→∞ P k b T N b x−b xk> = 0 7 Definition 2 (Weak probabilistic fixed point). b x∈X is a weak probabilistic fixed point for the sequence{ b T N } N≥1 if for all > 0 and x∈X lim N→∞ lim k→∞ P k b T k N x−b xk> = 0 Note that the probabilistic contraction can be defined in some other ways as well. For eg. when the contraction parameter is random. Consider a random operator b S N with the property k b S N (ω)x 1 − b S N (ω)x 2 k≤β N (ω)kx 1 −x 2 k such thatP (β N (ω)∈ (1−, 1))<δ N () where < 0 for some 0 > 0, δ N ()↓ 0 as N→∞ and β N (ω) < 1 a.s. We will see that this property is useful to analyze the average-reward case for MDPs. In addition to that, one can also define fixed point in mean square sense, i.e, b v is a mean-square probabilistic fixed point of b T N if E h k b T k N v 0 −b vk i → 0 asN→∞. This is of mathematical interest although not arisen is algorithms we have seen. 2.3.1 Convergence of random operators The convergence of random operators on Banach space was studied in [24]. The proof argument relies on constructing a finite state irreducible Markov chain that ‘stochastically dominates’ the error process. The main theorem is presented below: 8 Theorem 2. [24, Theorem 4.2] Let (X,k·k) be a Banach space. Consider the random sequence generated by iterative applying the random operator b T N for some N∈ N, X N k = T k N X N 0 with X N 0 = 0. Let x ∗ denote the fixed point of an operator T . Suppose that the following conditions hold 1. T :X→X is a contraction operator with contraction coefficient α< 1. 2. For any > 0 and N∈N, we have lim N→∞ P k b T N −Tk> = 0 3. There exists κ> 0 such thatkx ∗ k≤κ andkX N k k≤κ for all k∈N almost surely. 4. For every k∈N, we have lim N→∞ b T N =T almost surely. Then for any > 0, lim N→∞ lim k→∞ P kX N k −x ∗ k> = 0 2.3.1.1 Main ingredients of the proof The proof of Theorem 2 majorly consists of two parts: bounding the one-step probability and then using it to construct a Markov chain which “stochastically dominates” the error process. The error process is defined as X k N = 0 if kb v k N −v ∗ k ∞ = 0 η if (η− 1) g ≤kb v k N −v ∗ k ∞ ≤η g 9 Now we need a bound on one-step probability, i.e., P kTv− b T N vk< for given v∈V. This can be done by using concentration inequality like Hoeffding’s. Hence for a given> 0 one can find a sequence p N ()↑ 1 such that the following is satisfied P kTv− b T N vk< >p N () The next part is to construct a Markov chain using the above boundp N which stochastically dominates the error process. The stochastic dominance can be defined as follows Definition 3. Let X and Y be two random variables, then Y stochastically dominates X, written X≤ st Y , when P(X≥θ)≤P (Y≥θ), for all θ in the support of Y . Since the iterates are bounded by κ then we can construct the following Markov chain with state space{0, 1, 2,...κ}: Y k = max(Y k − 1, 0) w.p. p N κ otherwise Note that by construction X k N ≤ st Y k . Furthermore, Y k is a finite state space irreducible Markov chain hence one can compute the steady state distribution, which implies that one can boundP(Y ≥η) by using mixing time argument and hence one can obtain a bound on the error process. 2.3.2 Application: Empirical dynamic programming In this section, we will see how the convergence of random operators, Theorem 2 can be applied to dynamic programming. Consider a MDP with finite state space S, finite action 10 spaceA, reward functionr :S×A→R and probability transitionP (j|i,a) for going to state j fromi when actiona is executed. Consider infinite horizon discounted setting with discount parameter 0<γ < 1. We will assume that the cost functionr satisfies|r (s, a)|≤r max <∞ for all (s, a)∈ S×A. Under this assumption,kv π k ∞ ≤ v max := r max / (1−γ) where v π is the value function for policy π∈ Π defined as v π (s) = E π s [ P ∞ t=0 γ t r (s t , a t )],∀s∈ S. The optimal value function is v ∗ (s) := inf π∈Π E π s " ∞ X t=0 γ t r (s t , a t ) # ,∀s∈S. Let the space of value function beV. To characterize the optimal value function, we define the Bellman operator T :V→V via [Tv] (s) := max a∈A r (s, a) +γ |S| X j=1 P (j|s,a)v(j) ,∀s∈S. It is well known that theT is a contraction withγ as the contraction parameter. Furthermore, optimal value function v ∗ is a fixed point of T , i.e. Tv ∗ =v ∗ [55, Theorem 6.2.5]. Classical value iteration is based on iteratingT to obtain a fixed point, it produces a sequence (v k ) k≥0 ⊂ V given by v k+1 =Tv k , k≥ 0. Also, we know that (v k ) k≥0 converges to v ∗ geometrically in k·k ∞ . Now consider the empirical Bellman operator h b T N v i (s) := max a∈A r (s, a) + γ N N X j=1 v(y j ) ,∀s∈S. 11 wherey j ∼P (·|s,a). Now see that the iterates are bounded by v max . Moreover, asN→∞, b T N →T almost surely by law of large numbers. Hence, all the assumptions of theorem 2 are satisfied hence one can define an empirical value iteration where the iterates b v N k = b T N b v N k−1 which will converge to neighborhood of v ∗ with high probability by Theorem 2. 2.4 Conclusions In this chapter, we defined the probabilistic contraction operator on Banach spaces. Such random operators arise when one wants to approximate a contraction operator. The proba- bilistic fixed points are defined to understand their convergence. The theoretical analysis is via stochastic dominance argument where we construct a irreducible and finite state Markov chain which dominates (stochastically) the error process. Then, the steady state distribu- tion and mixing time analysis are needed to conclude the theorem. We also presented one example in MDP but these random operator theory can also be applied to analyze stochastic optimization algorithms. 12 Chapter 3 MDPs with continuous state space: Discounted reward criteria 3.1 Introduction There is a large body of well-known literature on reinforcement learning and approxi- mate dyamic programming for continuous state space MDPs. We discuss the most directly related. In [58], a sampling-based state space aggregation scheme combined with sample average approximation for the expectation in the Bellman operator is proposed. Under some regularity assumptions, the approximate value function can be computed at any state and an estimate of the expected error is given. But the algorithm seems to suffer from poor numeri- cal performance. A linear programming-based constraint-sampling approach was introduced in [15]. Finite sample error guarantees, with respect to this constraint-sampling distribution are provided but the method suffers from issues of feasibility. The closest chapter to ours is [48] that does function fitting with a given basis and does ‘empirical’ value iteration in each step. Unfortunately, it is not a universal method as approximation quality depends on the 13 function basis picked. Other chapters worth noting are [50] that discusses kernel-based value iteration and the bias-variance tradeoff, and [21] that proposed a kernel-based algorithm with random sampling of the state and action spaces, and proves asymptotic convergence. Other related works worth mentioning are [66, 47] (approximate value iteration), [14, 13] (the LP approach to approximate DP) and [46, 36, 34] (approximate policy iteration). Re- cent applications stress policy gradient methods [65, 51] and deep learning-based function approximation [45] for which theoretical performance guarantees for general problems are not available. The method presented in this chapter may be seen as another alternative. This chapter is inspired by the ‘random function’ approach that uses randomization to (nearly) solve otherwise intractable problems (see e.g., [56, 57]) and the ‘empirical’ approach that reduces computational complexity of working with expectations [24]. We propose two new algorithms. For the first parametric approach, we pick a parametric function family. In each iteration a number of functions are picked randomly for function fitting by sampling the parameters. A preliminary version of this for l 2 function fitting appeared in [27]. For the second non-parametric approach, we pick a RKHS for approximation. Both function spaces are dense in the space of continuous functions. In each iteration, we sample a few states from the state space. Empirical value learning (EVL) is then performed on these states. Each step of EVL involves approximating the Bellman operator with an empirical (random) Bellman operator by plugging a sample average approximation from simulation for the expectation. This is akin to doing stochastic approximations with step size one. We employ a probabilistic convergence analysis technique of iterated random operators based on stochastic dominance that we developed in [24]. This method is general in the sense that not only can we handle various norms, but also various random contractive operators. 14 The main contribution of this chapter is development of randomized function approximation- based (offline) dynamic programming algorithms that are universally applicable (i.e., do not require appropriate choice of basis functions for good approximation). A secondary contri- bution is further development of the random operator framework for convergence analysis in theL p −norm that also yields finite time sample complexity bounds. The chapter is organized as follows. Section 3.2 presents preliminaries including the continuous state space MDP model and the empirical dynamic programming framework for finite state MDPs introduced in [24]. Section 3.3 presents two empirical value iteration algorithms - first, a randomized parametric function fitting method, and second, a non- parametric randomized function fitting in an RKHS space. We also provide statements of main theorems about non-asymptotic error guarantees. Section 3.3.3 presents a unified analysis of the two algorithms in a random operator framework. Numerical results are reported in Section 3.3.5. Supplemental proofs are relegated to the appendix. 3.2 Preliminaries Consider a discrete time discounted MDP given by the 5-tuple, (S,A, Q, c, γ). The state spaceS is a compact subset ofR d with the Euclidean norm, with corresponding Borel σ−algebraB (S). LetF (S) be the space of allB (S)−measurable bounded functionsf : S→ R in the supremum normkfk ∞ := sup s∈S |f (s)|. Moreover, letM (S) be the space of all probability distributions over S and define theL p norm askfk p p,μ := ( R S |f (s)| p μ (ds)) for p∈ [1,∞) and givenμ∈M (S). We assume that the action spaceA is finite. The transition law Q governs the system evolution. For B∈B (S), Q (B|s, a) is the probability of next 15 visiting the set B given that action a∈ A is chosen in state s∈ S. The cost function c : S×A→R is a bounded measurable function that depends on state-action pairs. Finally, γ∈ (0, 1) is the discount factor. We will denote by Π the class of stationary deterministic Markov policies: mappings π : S→A which only depend on history through the current state. For a given state s∈S, π (s)∈A is the action chosen in state s under the policy π. The state and action at time t are denoted s t and a t , respectively. Any policy π∈ Π and initial state s∈S determine a probability measure P π s and a stochastic process{(s t , a t ), t≥ 0} defined on the canonical measurable space of trajectories of state-action pairs. The expectation operator with respect to P π s is denotedE π s [·]. We will assume that the cost function c satisfies|c (s, a)|≤ c max <∞ for all (s, a)∈ S×A. Under this assumption,kv π k ∞ ≤v max :=c max / (1−γ) wherev π is the value function for policy π∈ Π defined as v π (s) = E π s [ P ∞ t=0 γ t c (s t , a t )],∀s∈ S. For later use, we define F (S; v max ) to be the space of all functions f∈F (S) such thatkfk ∞ ≤v max . The optimal value function is v ∗ (s) := inf π∈Π E π s [ P ∞ t=0 γ t c (s t , a t )],∀s∈ S. To charac- terize the optimal value function, we define the Bellman operator T :F (S)→F (S) via [Tv] (s) := min a∈A n c (s, a) +γE X∼Q(·|s,a) [v (X)] o ,∀s∈S. It is well known that the optimal value function v ∗ is a fixed point of T , i.e. Tv ∗ =v ∗ [55, Theorem 6.2.5]. Classical value iteration is based on iterating T to obtain a fixed point, it produces a sequence (v k ) k≥0 ⊂F (S) given by v k+1 = Tv k , k≥ 0. Also, we know that (v k ) k≥0 converges to v ∗ geometrically ink·k ∞ . 16 We are interested in approximating the optimal value functionv ∗ within a tractable class of approximating functionsF⊂F (S). We have the following definitions which we use to measure the approximation power ofF with respect to T . We define d p,μ (G,F) := sup g∈G inf f∈F kf−gk p,μ to be the distance between two function classes; thend p,μ (TF,F) is the inherentL p Bellman error for the function classF. Similarly, defining d ∞ (G,F) := sup g∈G inf f∈F kf−gk ∞ gives d ∞ (TF,F) as the inherentL ∞ Bellman error for an approximating classF. We often compareF to the Lipschitz continuous functions Lip (L) defined as {f∈F (S) :|f (s)−f (s 0 )|≤Lks−s 0 k,∀s, s 0 ∈S}. In our case, we say that an approximation classF is universal ifd ∞ (Lip (L),F) = 0 for all L≥ 0. Note that on a compact state space S, universality in the supremum norm implies universality in theL 1 andL 2 norms as well. One of the difficulties of dynamic programming algorithms like value iteration above is that each iteration of the Bellman operator involves computation of an expectation which 17 may be expensive. Thus, in [24], we proposed replacing the Bellman operator with an empirical (or random) Bellman operator, h ˆ T n v i (s) := min a∈A ( c (s, a) + γ n n X i=1 [v(X i )] ) , whereX i are samples of the next state fromQ (·|s, a) which can be obtained from simulation. Now, we can iterate the empirical Bellman operator, v k+1 = ˆ T n v k , ∀k≥ 0, an algorithm we called Empirical Value Iteration (EVL). The sequence of iterates{v k } is a random process. Since T is a contractive operator, its iterates converge to its fixed point v ∗ . The random operator ˆ T n may be expected to inherit the contractive property in a probabilistic sense and its iterates converge to some sort of a probabilitic fixed point. We introduce (,δ) versions of two such notions introduced in [24]. Definition 4. A function ˆ v :S→R is an (,δ)-strong probabilistic fixed point for a sequence of random operators{ ˆ T n } if there exists an N such that for all n>N, P || ˆ T n ˆ v− ˆ v||> <δ. It is called a strong probabilistic fixed point, if the above is true for every positive and δ. 18 Definition 5. A function ˆ v :S→R is an (,δ)-weak probabilistic fixed point for a sequence of random operators{ ˆ T n } if there exist N and K such that for all n>N and all k>K, P || ˆ T k n v 0 − ˆ v||> <δ, ∀v 0 ∈F (S). It is called a weak probabilistic fixed point, if the above is true for every positive and δ. Note that the stochastic iterative algorithms such as EVL often find the weak probabilistic fixed point of{ ˆ T n } whereas what we are looking for is v ∗ , the fixed point of T . In [24], it was shown that asymptotically the weak probabilistic fixed point of{ ˆ T n } coincides with its strong probabilistic fixed points which coincide with the fixed point ofT under certain fairly weak assumptions and a natural relationship between T and{ ˆ T n } lim n→∞ P || ˆ T n v−Tv||> = 0, ∀v∈F (S). This implies that stochastic iterative algorithms such as EVL will find approximate fixed points of T with high probability. 3.3 Offline Value Learning When the state spaceS is very large, or even uncountable, exact dynamic programming methods are not practical, or even feasible. Instead, one must use a variety of approximation methods. In particular, function approximation (or fitting the value function with a fixed function basis) is a common technique. The idea is to sample a finite set of states from S, approximate the Bellman update at these states, and then extend to the rest of S through 19 function fitting similar to [48]. Furthermore, the expectation in the Bellman operator, for example, is also approximated by taking a number of samples of the next state. There are two main difficulties with this approach: First, the function fitting depends on the function basis chosen, making the results problem-dependent. Second, with a large basis (for good approximation), function fitting can be computationally expensive. In this chapter, we aim to address these issues by first picking universal approximating function spaces, and then using randomization to pick a smaller basis and thus reduce the computational burden of the function fitting step. We consider two functional families, one is a parametric familyF(Θ) parameterized over parameter space Θ and the other is a non- parametric regularized RKHS. Byμ∈M (S), we will denote a probability distribution from which to sample states inS, and by aF⊂F (S; v max ) we will denote a functional family in which to do value function approximation. Let us denote by (v k ) k≥0 ⊂F (S; v max ), the iterates of the value functions produced by an algorithm and a sample of size N ≥ 1 from S is denoted s 1:N = (s 1 ,..., s N ). The empirical p−norm of f is defined askfk p p, ˆ μ := 1 N P N n=1 |f (s n )| p for p ∈ [1,∞) and as kfk ∞, ˆ μ := sup n=1,...,N |f (s n )| for p =∞, where ˆ μ is the empirical measure coresponding to the samples s 1:N . We will make the following technical assumptions for the rest of the chapter similar to those made in [48]. Assumption 1. (i) For all (s, a)∈S×A, Q (·|s, a) is absolutely continuous with respect to μ and C μ := sup (s,a)∈S×A kdQ (·|s, a)/dμk ∞ <∞. 20 (ii) Given any sequence of policies{π m } m≥1 , the future state distribution ρQ π 1 ···Q πm is absolutely continuous with respect to μ, c ρ,μ (m) := sup π 1 ,...,πm kd (ρQ π 1 ···Q πm )/dμk ∞ <∞, and C ρ,μ := P m≥0 γ m c ρ,μ (m)<∞. The above assumptions are conditions on transition probabilities, the first being a suffi- cient condition for the second. ρ can be regarded as an “importance” distribution onS, that is possibly different from the distribution μ onS that is used to sample states. Assumption 1 is essentially a regularity condition on the MDP: It ensures that the MDP cannot make arbitrary transitions with high probability with respect to the initial state distribution μ. C ρ,μ is called the discounted-average concentrability coefficient of the future-state distribu- tions in [48]. Note that the assumption is satisfied when μ is the Lebesgue measure on S and the transition kernel has a bounded density with respect to μ. 3.3.1 Random Parametric Basis Function (RPBF) Approximation We introduce an empirical value iteration algorithm with function approximation us- ing random parametrized basis functions (EVL+RPBF). It requires a parametric family F built from a set of parameters Θ with probability distribution ν and a feature function φ : S× Θ→ R (that depends on both states and parameters) with the assumption that sup (s,θ)∈S×Θ |φ (s; θ)|≤ 1. This can easily be met in practice by scaling φ whenever S and 21 Θ are both compact and φ is continuous in (s, θ). Let α : Θ→R be a weight function and defineF (Θ) := f (·) = Z Θ φ (·; θ)α (θ)dθ :|α (θ)|≤Cν (θ),∀θ∈ Θ . We note that the condition|α (θ)|≤ Cν (θ) for all θ∈ Θ is equivalent to requiring that kαk ∞,ν := sup θ∈Θ |α (θ)/ν (θ)|≤C wherekαk ∞,ν is the ν−weighted supremum norm of α and C is a constant. The function spaceF (Θ) may be chosen to have the ‘universal’ function approximation property in the sense that any Lipschitz continuous function can be approximated arbitrarily closely in this space as shown in [56]. By [56, Theorem 2], many such choices ofF (Θ) are possible and are developed in [56, Section 5]. For example,F (Θ) is universal in the following two cases: • φ (s; θ) = cos (hω, si +b) where θ = (ω, b) ∈ R d+1 ; and ν (θ) is given by ω ∼ Normal (0, 2γI) and b∼ Uniform [−π, π]; • φ (s; θ) = sign (s k −t) where θ = (t, k)∈ R×{1,..., d}; and ν (θ) to be given by k∼ Uniform{1,..., d} and t∼ Uniform [−a, a]. In this approach, we have a parametric function familyF (Θ) but instead of optimizing over parameters in Θ, we randomly sample them first and then do function fitting which involves optimizing over finite weighted combinations P J j=1 α j φ (·; θ j ). Unfortunately, this leads to a non-convex optimization problem. Hence, instead of optimizing over θ 1:J = (θ 1 ,..., θ J ) and α 1:J = (α 1 ,..., α J ) jointly, we first do randomization over θ 1:J and then optimization 22 over α 1:J , as in [57], to bypass the non-convexity inherent in optimizing over θ 1:J and α 1:J simultaneously. This approach allows us to deploy rich parametric families without much additional computational cost. Once we draw a random sample{θ j } J j=1 from Θ according to ν, we obtain a random function space: b F θ 1:J := f (·) = J X j=1 α j φ (·; θ j ) :k (α 1 ,..., α J )k ∞ ≤C/J . Step 1 of such an algorithm (Algorithm 1) involves sampling states s 1:N over which to do value iteration and sampling parameters θ 1:J to pick basis functions φ(·;θ) which are used to do function fitting. Step 2 involves doing an empirical value iteration over states s 1:N by sampling next states (X sn,a m ) M m=1 according to the transition kernel Q, and using the current iterate of the value functionv k . Note that fresh (i.i.d.) samples of the next state are regenerated in each iteration. Step 3 involves finding the best fit to ˜ v k , the iterate from Step 2, within b F θ 1:J wherein randomly sampled parametersθ 1:J specify the basis functions for function fitting and weights α 1:J are optimized, which is a convex optimization problem. 23 Algorithm 1 EVL with random parameterized basis functions (EVL+RPBF) Input: probability distribution μ onS and ν on Θ; Sample sizes N≥ 1, M≥ 1, J≥ 1; initial seed v 0 . For k = 1,...,K 1. Sample (s n ) N n=1 ∼μ N and (θ j ) J j=1 ∼ν J . 2. Compute ˜ v k (s n ) = min a∈A ( c (s n , a) + γ M M X m=1 v k (X sn,a m ) ) , where (X sn,a m )∼Q (·|s n , a), m = 1,··· ,M are i.i.d. 3. α k = arg min α 1 N P N n=1 P J j=1 α j φ (s n ; θ j )− ˜ v (s n ) 2 s.t.k (α 1 ,..., α J )k ∞ ≤C/J. v k+1 (s) = P J j=1 α k j φ(s;θ j ). 4. Increment k←k + 1 and return to Step 1. We note that Step 3 of the algorithm can be replaced by another method for function fitting (as we do in the next subsection). The above algorithm differs from Fitted Value Iteration (FVI) algorithm of [48] in how it does function fitting. FVI does function fitting with a deterministic and given set of basis functions which limits its universality while we do function fitting in a much larger space which has the universal function approximation property but are able to reduce computational complexity by exploiting randomization. 24 In [48, Section 7], it is shown that if the transition kernel and cost are smooth such that there exist L Q and L c for which kQ (·|s, a)−Q (·|s 0 , a)k TV ≤L Q ks−s 0 k 2 (3.1) and |c (s, a)−c (s 0 , a)|≤L c ks−s 0 k 2 (3.2) hold for all s, s 0 ∈ S and a∈ A, then the Bellman operator T maps bounded functions to Lipschitz continuous functions. In particular, if v is uniformly bounded by v max then Tv is (L c +γv max L Q )−Lipschitz continuous. Subsequently, the inherentL ∞ Bellman error satisfies d ∞ (TF,F)≤d ∞ (Lip (L),F) since TF⊂ Lip (L). So, it only remains to choose anF (Θ) that is dense in Lip (L) in the supremum norm, for which many examples exist. We now provide non-asymptotic sample complexity bounds to establish that Algorithm 1 yields an approximately optimal value function with high probability. We provide guarantees for both theL 1 andL 2 metrics on the error. 25 Denote N 2 (ε,δ 0 ) = 2 7 5 2 ¯ v 4 max log " 40e (J 2 + 1) δ 10e ¯ v 2 max J # , M 2 (ε,δ 0 ) = ¯ v 2 max 2 ! log " 10N 2 |A| δ 0 # , J 2 (ε,δ 0 ) = 5C ε 1 + v u u u u t 2 log 5 δ 0 2 , and K ∗ 2 = 2 ln C 1/2 ρ,μ ε − ln (2v max ) ln γ , where ¯ v max = v max /ε. Set δ 0 := 1− (1−δ/2) 1/(K ∗ 2 −1) . Then, we have the following sample complexity bound on Algorithm 1 withL 2 error. We note thatL 2,μ (S) is a Hilbert space and that many powerful function approximation results exist for this setting because of the favorable properties of a Hilbert space. Theorem 3. Given an ε> 0, and a δ∈ (0, 1), choose J≥J 2 (ε,δ 0 ), N≥N 2 (ε,δ 0 ), M≥ M 2 (ε,δ 0 ), Then, for K≥ log (4/ (δμ ∗ (δ;K ∗ 2 ))), we have kv K −v ∗ k 2,ρ ≤ 2˜ γ 1/2 C 1/2 ρ,μ (d 2,μ (TF (Θ),F (Θ)) + 2ε) with probability at least 1−δ. Remarks. 1. That is, if we choose enough samples N 2 of the states, enough samples M 2 of the next state, and enough random samples J 2 of the parameter θ, and then for large enough number of iterations K 2 , theL 2 error in the value function is determined by the inherent Bellman error of the function classF (Θ). 2. For the function familiesF(Θ) 26 discussed earlier (RPBF), the inherent Bellman error,d 2,μ (TF (Θ),F (Θ)) = 0 indeed, and so the value function will have smallL 2 error with high probability. 3. Note that the sample complexity bounds are independent of the state space dimension though the computational complexity of sampling from the state space does indeed depend on that dimension. Next we give a similar guarantee forL 1 error for Algorithm 1 by considering approximation inL 1,μ (S). Denote N 1 (ε,δ 0 ) = 2 7 5 2 ¯ v 2 max log " 40e (J 1 + 1) δ (10e ¯ v max ) J # , M 1 (ε,δ 0 ) = ¯ v 2 max 2 ! log " 10N 1 |A| δ 0 # , J 1 (ε,δ 0 ) = 5C ε 1 + v u u u u t 2 log 5 δ 0 2 , K ∗ 1 = & ln (C ρ,μ ε)− ln (2v max ) ln γ ' , and μ ∗ (p; K ∗ ) = (1−p)p (K ∗ −1) , where C is the same constant that appears in the definition ofF (Θ) (see [57]) and ¯ v max = v max /ε. Set δ 0 := 1− (1−δ/2) 1/(K ∗ 1 −1) . Theorem 4. Given an ε> 0, and a δ∈ (0, 1), choose J≥J 1 (ε,δ 0 ), N≥N 1 (ε,δ 0 ), M≥ M 1 (ε,δ 0 ), Then, for K≥ log (4/ (δμ ∗ (δ;K ∗ 1 ))), we have kv K −v ∗ k 1,ρ ≤ 2C ρ,μ (d 1,μ (TF (Θ),F (Θ)) + 2ε) 27 with probability at least 1−δ. Remarks. 1. Again, note that the above result implies that the RBPF function family F(Θ) has inherent Bellman error d 1,μ (TF (Θ),F (Θ)) = 0, so that for enough samples N 1 of the states, enough samples M 1 of the next state, and enough random samples J 1 of the parameterθ, and then for large enough number of iterationsK 1 , the value function will have smallL 1 error with high probability. 2. As above, note that there is no dependence on state space dimension in the sample complexity bounds though computational complexity of sampling states from the state space indeed depends on it. 3.3.2 Non-parametric Function Approximation in RKHS We now consider non-parametric function approximation combined with EVL. We employ a Reproducing Kernel Hilbert Space (RKHS) for function approximation since for suitably chosen kernels, it is dense in the space of continuous functions and hence has a ‘universal’ function approximation property. In the RKHS setting, we can obtain guarantees directly with respect to the supremum norm. We will consider a regularized RKHS setting with a continuous, symmetric and positive semidefinite kernel K :S×S→R and a regularization constant λ > 0. The RKHS space, H K is defined to be the closure of the linear span of{K(s,·)} s∈S endowed with an inner prod- ucth·,·i H K . The inner producth·,·i H K forH K is defined such thathK (x,·), K (y,·)i H K = K (x, y) for all x, y∈S, i.e.,h P i α i K (x i ,·), P j β j K (y j ,·)i H K = P i,j α i β j K (x i , y j ). Sub- sequently, the inner product satisfies the reproducing property: hK (s,·), fi H K = f (s) for all s ∈ S and f ∈ H K . The corresponding RKHS norm is defined in terms of the 28 inner productkfk H K := q hf, fi H K . We assume that our kernel K is bounded so that κ := sup s∈S q K (s, s)<∞. To find the best fit f ∈ H K to a function with data{(s n , ˜ v (s n ))} N n=1 , we solve the regularized least squares problem: min f∈H K ( 1 N N X n=1 (f (s n )− ˜ v (s n )) 2 +λkfk 2 H K ) . (3.3) This is a convex optimization problem (the norm squared is convex), and has a closed form solution by the Representer Theorem. In particular, the optimal solution is of the form ˆ f (s) = P N n=1 α n K (s n , s) where the weights α 1:N = (α 1 ,..., α N ) are the solution to the linear system [K (s i , s j )] N i,j=1 +λNI (α n ) N n=1 = (˜ v (s n )) N n=1 . (3.4) This yields EVL algorithm with randomized function fitting in a regularized RKHS (EVL+RKHS) displayed as Algorithm 2. Note that the optimization problem in Step 3 in Algorithm 2 is analogous to the opti- mization problem in Step 3 of Algorithm 1 which finds an approximate best fit within the finite-dimensional space b F θ 1:J , rather than the entire spaceF (Θ), while Problem (3.3) in Algorithm 2 optimizes over the entire spaceH K . This difference can be reconciled by the Representer Theorem, since it states that optimization overH K in Problem (3.3) is equiva- lent to optimization over the finite-dimensional space spanned by{K (s n ,·) : n = 1,..., N}. Note that the regularization λkfk 2 H K is a requirement of the Representer Theorem. 29 Algorithm 2 EVL with regularized RKHS (EVL+RKHS) Input: probability distribution μ onS; sample sizes N≥ 1, M≥ 1; penalty λ; initial seed v 0 ; counter k = 0. For k = 1,...,K 1. Sample{s n } N n=1 ∼μ. 2. Compute ˜ v k (s n ) = min a∈A ( c (s n , a) + γ M M X m=1 v k (X sn,a m ) ) , where n X sn,a j o M m=1 ∼Q (·|s n , a) are i.i.d. 3. v k+1 (·) is given by arg min f∈H K 1 N N X n=1 (f(s n )− ˜ v(s n )) 2 +λ||f|| H K . 4. Increment k←k + 1 and return to Step 1. We define the regression function f M : S→R via f M (s),E " min a∈A ( c (s, a) + γ M M X m=1 v (X s,a m ) )# ,∀s∈S, it is the expected value of our empirical estimator of Tv. As expected, f M → Tv as M→∞. We note thatf M is not necessarily equal toTv by Jensen’s inequality. We require the following assumption on f M to continue. 30 Assumption 2. For every M≥ 1, f M (s) = R S K (s, y)α (y)μ (dy) for some α∈L 2,μ (S). Regression functions play a key role in statistical learning theory, Assumption 2 states that the regression function lies in the span of the kernel K. It is satisfied whenever K is a universal kernel. Some examples of universal kernels follow. Additionally, whenH K is dense in the space of Lipschitz functions, then the inherent Bellman error is zero. For example, K (s, s 0 ) = exp (−γks−s 0 k 2 ), K (s, s 0 ) = 1− 1 a ks−s 0 k 1 , and K (s, s 0 ) = exp (γks−s 0 k 1 ) are all universal kernels. Denote N ∞ (ε,δ 0 ) = 4C K κ ε (1−γ) ! 6 log 4 δ 0 2 M ∞ (ε) = 160v 2 max (ε (1−γ)) 2 log 2|A|γ (8v max −ε (1−γ)) ε (1−γ) (2−γ) ! K ∗ ∞ = & ln (ε)− ln (4v max ) ln γ ' where C K is a constant independent of the dimension of S (see [62] for the details on how C K depends on the kernel K) and set δ 0 = 1− (1−δ/2) 1/(K ∗ ∞ −1) . Theorem 5. Suppose Assumption 2 holds. Given any ε > 0 and δ ∈ (0, 1), choose an N≥N ∞ (ε,δ 0 ) and an M≥M ∞ (ε). Then, for any K≥ log (4/ (δμ ∗ (δ;K ∗ ∞ ))), kv K −v ∗ k ∞ ≤ε with probability at least 1−δ. Note that we provide guarantees onL 1 andL 2 error (can be generalized toL p ) with the RPBF method and forL ∞ error with the RKHS-based randomized function fitting method. 31 Getting guarantees for theL p error with the RKHS method has proved quite difficult, as has bounds on theL ∞ error with the RBPF method. 3.3.3 Convergence Analysis We will analyze Algorithms 1 and 2 in terms of random operators since this framework is general enough to encompass many such algorithms. The reader can see that Step 2 of both algorithms involves iteration of the empirical Bellman operator while Step 3 involves a randomized function fitting step which is done differently and in different spaces in both algorithms. We use random operator notation to write these algorithms in a compact way, and then derive a clean and to a large-extent unified convergence analysis. The key idea is to use the notion of stochastic dominance to bound the error process with an easy to analyze “dominating” Markov chain. Then, we can infer the solution quality of our algorithms via the probability distribution of the dominating Markov chain. This analysis idea refines (and in fact, simplifies) the idea we introduced in [24] for MDPs with finite state and action spaces (where there is no function fitting) in the supremum norm. In this chapter, we develop the technique further, give a stronger convergence rate, account for randomized function approximation, and also generalize the technique toL p norms. We introduce a probability space (Ω,B (Ω),P) on which to define random operators, where Ω is a sample space with elements denoted ω∈ Ω,B (Ω) is the Borel σ−algebra on Ω, and P is a probability distribution on (Ω,B (Ω)). A random operator is an operator- valued random variable on (Ω,B (Ω), P ). We define the first random operator onF (S) as 32 b T (v) = (s n , ˜ v (s n )) N n=1 where (s n ) N n=1 is chosen fromS according to a distributionμ∈M (S) and ˜ v (s n ) = min a∈A ( c (s n , a) + γ M M X m=1 v (X sn,a m ) ) , n = 1,...,N is an approximation of [Tv] (s n ) for all n = 1,...,N. In other words, b T maps fromv∈F (S; v max ) to a randomly generated sample ofN input-output pairs (s n , ˜ v (s n )) N n=1 of the function Tv. Note that b T depends on sample sizes N and M. Next, we have the function reconstruction operator b Π F which maps the data (s n , ˜ v (s n )) N n=1 to an element in F. Note that b Π F is not necessarily deterministic since Algorithms 1 and 2 use randomized function fitting. We can now write both algorithms succinctly as v k+1 = b Gv k := b Π F b Tv k , (3.5) which can be further written in terms of residual error ε k = b Gv k −Tv k as v k+1 = b Gv k =Tv k +ε k . (3.6) Iteration of these operators corresponds to repeated samples from (Ω,B(Ω),P), so we define the space of sequences (Ω ∞ ,B(Ω ∞ ),P) where Ω ∞ =× ∞ k=0 Ω with elements denoted ω = (ω k ) k≥0 ,B (Ω ∞ ) =× ∞ k=0 B (Ω), andP is the probability measure on (Ω ∞ ,B (Ω ∞ )) guaranteed by the Kolmogorov extension theorem applied toP. 33 The random sequences (v k ) k≥0 in Algorithms 1 and 2 given by v k+1 = b Π F b T (ω k )v k = b Π F b T (ω k ) b Π F b T (ω k−1 )··· b Π F b T (ω 0 )v 0 , for all k ≥ 0 is a stochastic process defined on (Ω ∞ ,B(Ω ∞ ),P). We now analyze error propagation over the iterations. Let us now bound how the Bellman residual at each iteration of EVL is changing. There have already been some results which address the error propagation both inL ∞ andL p (p≥ 1) norms [47]. After adapting [48, Lemma 3], we obtain the following p-norm error bounds on v K −v ∗ in terms of the errors{ε k } k≥0 . Lemma 6. For anyK≥ 1, andε> 0, supposekε k k p,μ ≤ε for allk = 0, 1,..., K−1, then kv K −v ∗ k p,ρ ≤ 2 1−γ K+1 1−γ !p−1 p h C 1/p ρ,μ ε +γ K/p (2v max ) i . (3.7) whereC ρ,μ is as defined in Assumption 3. Note that Lemma 6 assumes thatkε k k p,μ ≤ε which we will show subsequently that it is true with high probability. The second inequality is for the supremum norm. Lemma 7. For any K≥ 1 and ε> 0, supposekε k k ∞ ≤ε for all k = 0, 1,..., K− 1, then kv K −v ∗ k ∞ ≤ε/ (1−γ) +γ K (2v max ). (3.8) Inequalities (3.7) and (3.8) are the key to analyzing iteration of Equation (3.6). 34 3.3.4 Stochastic dominance We now provide a (unified) convergence analysis for iteration of a sequence of random operators given by (3.5) and (3.6). Later, we will show how it can be applied to Algorithms 1 and 2. We will usek·k to denote a general norm in the following discussion, since our idea applies to all instances of p∈ [1,∞) andp =∞ simultaneously. The magnitude of the error in iteration k≥ 0 is thenkε k k. We make the following key assumption for a general EVL algorithm. Assumption 3. For ε> 0, there is a q∈ (0, 1) such that Pr{kε k k≤ε}≥q for all k≥ 0. Assumption 3 states that we can find a lower bound on the probability of the event {kε k k≤ε} that is independent of k and (v k ) k≥0 (but does depend on ε). Equivalently, we are giving a lower bound on the probability of the event n kTv k − b Gv k k≤ε o . This is possible for all of the algorithms that we proposed earlier. In particular, we can control q in Assumption 3 through the sample sizes in each iteration of EVL. Naturally, for a given ε, q increases as the number of samples grows. We first choose ε> 0 and the number of iterations K ∗ for our EVL algorithms to reach a desired accuracy (this choice of K ∗ comes from the inequalities (3.7) and (3.8)). We call iteration k “good” if the errorkε k k is within our desired tolerance ε and “bad” when the error is greater than our desired tolerance. We then construct a stochastic process (X k ) k≥0 on (Ω ∞ ,B (Ω ∞ ),P) with state spaceK :={1, 2,..., K ∗ } such that X k+1 = max{X k − 1, 1}, if iteration k is ”good”, K ∗ , otherwise. 35 The stochastic process (X k ) k≥0 is easier to analyze than (v k ) k≥0 because it is defined on a finite state space, however (X k ) k≥0 is not necessarily a Markov chain. We next construct a “dominating” Markov chain (Y k ) k≥0 to help us analyze the behavior of (X k ) k≥0 . We construct (Y k ) k≥0 on (K ∞ ,B), the canonical measurable space of trajectories onK, soY k :K ∞ →R, and we letQ denote the probability measure of (Y k ) k≥0 on (R ∞ ,B). Since (Y k ) k≥0 will be a Markov chain by construction, the probability measureQ is completely determined by an initial distribution on R and a transition kernel for (Y k ) k≥0 . We always initialize Y 0 =K ∗ , and then construct the transition kernel as follows Y k+1 = max{Y k − 1, 1}, w.p. q, K ∗ , w.p. 1−q, where q is the probability of a “good” iteration with respect to the corresponding norm. Note that the (Y k ) k≥0 we introduce here is different and has much smaller state space than the one we introduced in [24] leading to stronger convergence guarantees. We now describe a stochastic dominance relationship between the two stochastic processes (X k ) k≥0 and (Y k ) k≥0 . We will establish that (Y k ) k≥0 is “larger” than (X k ) k≥0 in a stochastic sense. Definition 6. Let X and Y be two real-valued random variables, then X is stochastically dominated by Y , written X≤ st Y , when E [f (X)]≤ E [f (Y )] for all increasing functions f : R→R. Equivalently, X≤ st Y when Pr{X≥θ}≤ Pr{Y≥θ} for all θ in the support of Y . 36 Let{F k } k≥0 be the filtration on (Ω ∞ ,B (Ω ∞ ),P) corresponding to the evolution of information about (X k ) k≥0 , and let [X k+1 |F k ] denote the conditional distribution of X k+1 given the informationF k . We have the following initial results on the relationship between (X k ) k≥0 and (Y k ) k≥0 . The following theorem, our main result for our random operator analysis, establishes the relationship between the stochastic process{X k } k≥0 and the Markov chain{Y k } k≥0 . Under Assumption 3, this result allows us to bound the stochastic process{X k } k≥0 which keeps track of the error in EVL with the dominating Markov chain{Y k } k≥0 . Theorem 8. Under Assumption 3: (i) X k ≤ st Y k for all k≥ 0. (ii) Pr{Y k ≤η}≤ Pr{X k ≤η} for any η∈R and all k≥ 0. The proof is relegated to Appendix 3.6.3. By Theorem 8, ifX K ≤ st Y K and we can make Pr{Y K ≤η} large, then we will also obtain a meaningful bound on Pr{X K ≤η}. Following this observation, the next two corollaries are the main mechanisms for our general sample complexity results for EVL. The following corollary follows from bounding the mixing time of the dominating Markov chain{Y k } k≥0 and employing our general p−norm error bound Lemma 6. Corollary 9. For a given p∈ [1,∞), and any ε > 0, and δ∈ (0, 1), suppose Assumption 3 holds for this ε, and choose any K ∗ ≥ 1. Then for q≥ (1/2 +δ/2) 1/(K ∗ −1) and K ≥ log 4/ (1/2−δ/2) (1−q)q K ∗ −1 , we have kv K −v ∗ k p,ρ ≤ 2 1−γ K ∗ +1 1−γ ! p−1 p h C 1/p ρ,μ ε +γ K ∗ /p (2v max ) i 37 with probability at least δ. The proof is relegated to Appendix 3.6.3. The next Corollary uses the same reasoning for the supremum norm case. It follows from bounding the mixing time of the dominating Markov chain{Y k } k≥0 and employing our general∞−norm error bound Lemma 7. Corollary 10. Given any ε> 0 and δ∈ (0, 1), suppose Assumption 3 holds for this ε, and choose anyK ∗ ≥ 1. Forq≥ (1/2 +δ/2) 1/(K ∗ −1) andK≥ log 4/ (1/2−δ/2) (1−q)q K ∗ −1 , we have Pr n kv K −v ∗ k ∞ ≤ε/ (1−γ) +γ K ∗ (2v max ) o ≥δ. The sample complexity results for both EVL algorithms from Section 3.3 follow from Corollaries 9 and 10. This is shown next. We now apply our random operator framework to both EVL algorithms. We will see that it is easy to check the conditions of Corollaries 9 and 10, from which we obtain specific sample complexity results. We will use Theorems 23, 22, and 25 which are all “one-step” results which bound the error in a single step of Algorithm 1 (in the 1- and 2-norm) and Algorithm 2 (in the∞−norm) compared to the true Bellman operator. We first give the proof of Theorem 3. We letp (N, M, J, ε) denote the a lower bound on the probability of the event n k b Tv−Tvk 2,μ ≤ε o . 38 Proof. (of Theorem 3) Starting with inequality (3.7) for p = 2 and using the statement of Theorem 23 in Appendix 3.6.3, we havekv K −v ∗ k 2,ρ ≤ 2 1 1−γ ! 1/2 C 1/2 ρ,μ (d 2,μ (TF (Θ),F (Θ)) +ε) +4 1 1−γ ! 1/2 v max γ K/2 , whenkε k k 2,μ ≤d 2,μ (TF (Θ),F (Θ)) +ε for all k = 0, 1,..., K− 1. We choose K ∗ ≥ 1 to satisfy 4 1 1−γ ! 1/2 v max γ K ∗ /2 ≤ 2 1 1−γ ! 1/2 C 1/2 ρ,μ ε which implies K ∗ = 2 ln C 1/2 ρ,μ ε −ln(2vmax) ln γ . Based on Corollary 9, we just need to choose N, M, J such that p (N, M, J, ε) ≥ (1−δ/2) 1/(K ∗ −1) . We then apply the statement of Theorem 22 with p = 1− (1−δ/2) 1/(K ∗ −1) . We now give the proof of Theorem 4 along the same lines as for Theorem 4. Let p (N, M, J, ε) denote the lower bound on the probability of the event n k b Tv−Tvk 1,μ ≤ε o for ε> 0. We also note that d 1,μ (Tv,F (Θ))≤d 1,μ (TF (Θ),F (Θ)) for all v∈F (Θ). Proof. (of Theorem 4) Starting with inequality (3.7) for p = 1 and using the statement of Theorem 22 in Appendix 3.6.4, we havekv K −v ∗ k 1,ρ ≤ 2C ρ,μ (d 1,μ (TF (Θ),F (Θ)) +ε) + 4v max γ K 39 whenkε k k 1,μ ≤d 1,μ (TF (Θ),F (Θ)) +ε for all k = 0, 1,..., K− 1. Choose K ∗ such that 4v max γ K ≤ 2C ρ,μ ε⇒K ∗ = & ln (C ρ,μ ε)− ln (2v max ) ln γ ' . Based on Corollary 9, we just need to chooseN, M, J such thatp (N, M, J, ε)≥ (1−δ/2) 1/(K ∗ −1) . We then apply the statement of Theorem 22 with probability 1− (1−δ/2) 1/(K ∗ −1) . We now provide proof ofL ∞ function fitting in RKHS based on Theorem 25 in Appendix 3.6.3. For this proof, we letp (N, M, ε) denote a lower bound on the probability of the event n k b Tv−Tvk ∞ ≤ε o . Proof. (of Theorem 5) By inequality (3.8), we chooseε andK ∗ ≥ 1 such thatε/ (1−γ)≤/2 and γ K ∗ (2v max )≤/2 by setting K ∗ ≥ & ln ()− ln (4v max ) ln (γ) ' . Based on Corollary 9, we next choose N and M such that p (N, M, ε)≥ (1−δ/2) 1/(K ∗ −1) . We then apply the statement of Theorem 25 with error (1−γ)/2 and probability 1− (1−δ/2) 1/(K ∗ −1) . 3.3.5 Numerical Experiments We now present numerical performance of our algorithm by testing it on the benchmark optimal replacement problem [58, 48]. The setting is that a product (such as a car) becomes more costly to maintain with time/miles, and must be replaced it some point. Here, the state s t ∈R + represents the accumulated utilization of the product. Thus, s t = 0 denotes a 40 brand new durable good. Here,A ={0, 1}, so at each time step,t, we can either replace the product (a t = 0) or keep it (a t = 1). Replacement incurs a costC while keeping the product has a maintenance cost,c(s t ), associated with it. The transition probabilities are as follows: q(s t+1 |s t ,a t ) = λe −λ(s t+1 −st) , if s t+1 ≥s t and a t = 1, λe −λs t+1 , if s t+1 ≥ 0 and a t = 0, and 0, otherwise and the reward function is given by r(s t ,a t ) = −c(s t ), if a t = 1, and −C−c(0), if a t = 0. For our computation, we use γ = 0.6,λ = 0.5,C = 30 and c(s) = 4s. The optimal value function and the optimal policy can be computed analytically for this problem. For EVL+RPBF, we useJ random parameterized Fourier functions{φ(s,θ j ) = cos(θ T j s +b)} J j=1 with θ j ∼N (0, 0.01) and b∼ Unif[−π,π]. We fix J=5. For EVL+RKHS, we use Gaussian kernel defined as k(x,y) = exp(||x−y|| 2 /(2σ 2 )) with 1/σ 2 = 0.01 andL 2 regularization. We fix the regularization coefficient to be 10 −2 . The underlying function space for FVI is polynomials of degree 4. The results are plotted after 20 iterations. The error in each iteration for different algorithms with N = 100 states and M = 5 is shown in Figure 1. On Y-axis, it shows the relative error computed as sup s∈S |v ∗ (s)− v π k (s)/v ∗ (s)| with iterations k on the X-axis. It shows that EVL+RPBF has relative error 41 Figure 3.1 Relative Error with iterations for various algorithms. below 10% after 20 iterations. FVI is close to it but EVL+RKHS has larger relative error though it may improve with a higherM or by using other kernels. This is also reflected in the actual runtime performance: EVL+RPBF takes 8,705s, FVI 8,654s and EVL+RKHS takes 42,173s to get within 0.1 relative error. The computational complexity of kernel methods increases quadratically with number of samples and needs a matrix inversion resulting in a slower perfomance. Note that performance of FVI depends on being able to choose suitable basis functions which for the optimal replacement problem is easy. For other problems, we may expect both EVL algorithms to perform better. So, we tested the algorithms on the cart-pole balancing problem, another benchmark problem but for which the optimal value function is unknown. We formulate it as a continuous 4-dimensional state space with 2 action MDP. The state Goal EVL+RPBF FVI EVL+RKHS 50 5.4m 4.8m 8.7m 100 18.3m 23.7m 32.1m 150 36.7m 41.5m 54.3m Table 3.1 Runtime performance comparison for the cart-pole problem (m=minutes) 42 comprises of the position of the cart,x, velocity of the cart, ˙ x, angle of the pole in radians,θ and the angular velocity of the pole, ˙ θ . The actions are to add a force of−10N or +10N to the cart, pushing it left or right. We add±50% noise to these actions. For system dynamics, let m c and m p be the mass of cart and pole respectively. Let l be the length of the pole. If F t is the force applied to the cart at time t, then acceleration of pole is ¨ θ t = g sinθ t + cosθ t −F t −m p l ˙ θ t 2 sinθ t m c +m p l 4 3 − m p cos 2 θ t m c +m p and acceleration of cart is ¨ x t = F t +m p l ˙ θ t 2 sinθ t − ¨ θ t cosθ t m c +m p . Now let τ be the time step for Euler’s method, we have the following state transition equa- tions: x t+1 =x t +τ ˙ x t ˙ x t+1 = ˙ x t +τ ¨ x t θ t+1 =x t +τ ˙ θ t ˙ θ t+1 = ˙ θ t +τ ¨ θ t 43 Rewards are zero except for failure state (if the position of cart reaches beyond ±2.4, or the pole exceeds an angle of±12 degrees), it is−1. For our experiments, we choose N = 100 andM = 1. In case of RPBF, we consider parameterized Fourier basis of the form cos(w T s +b) where w = [w 1 ,w 2 ], w 1 ,w 2 ∼N (0, 1) and b∼ Unif[−π,π]. We fix J = 10 for our EVL+RPBF. For RKHS, we consider Gaussian kernel,K(s 1 ,s 2 ) = exp (−σ||s 1 −s 2 || 2 /2) with σ = 0.01. We limit each episode to 1000 time steps. We compute the average length of the episode for which we are able to balance the pole without hitting the failure state. This is the goal in Table 3.1. The other columns show run-time needed for the algorithms to learn to achieve such a goal. From the table, we can see that EVL+RPBF outperforms FVI and EVL+RKHS. Note that guarantees for FVI are only available forL 2 -error and for EVL-RPBF forL p -error. EVL-RKHS is the only algorithm that can provide guarantees on the sup-norm error. Also note that when for problems for which the value functions are not so regular, and good basis functions difficult to guess, the EVL+RKHS method is likely to perform better but as of now we do not have a numerical example to demonstrate this. We also tested our algorithms on the Acrobot problem, a 2-link pendulum with only the second joint actuated. The objective is to swing the end-effector to a height which is at least the length of one link above the base starting with both links pointing downwards. The state here is six dimensional which are sin(·) and cos(·) of the two rotational joint angles and the joint angular velocities. There are three actions available: +1, 0 or -1, corresponding to the torque on the joint between the two pendulum links. We modify the environment available from OpenAI by injecting a uniform noise in the actions so that the transitions are not deterministic. The reward is 1 if the goal state is reached, else 0. We 44 choose N = 2000,M = 1,J = 100. Fig. 3.2 represents the reward for both of the proposed algorithms. Not only does EVL+RPBF perform better, it is also faster than EVL+RKHS by an average of 3.67 minutes per iteration. The reason for this is that the EVL+RKHS algorithm is designed to provide guarantees on sup-error, a much more stringent requirement than the L p -error that EVL+RPBF algorithm provides guarantees on. Figure 3.2 Performance on the Acrobot problem 3.4 Online RL with generative model In the previous sections, we proposed offline algorithms based on dynamic programming. Recently, reinforcement learning (RL) algorithms have been shown to have remarkably good performance on simulated control problems ranging from video games to locomotion tasks [44, 59, 40]. Unfortunately, these methods are rather complicated, require a lot of data to perform well [30],and do not yield much insight. Furthermore, it has been shown [33] that many of these RL algorithms are not robust to changes in hyper-parameters, network 45 architectures, random seeds, or even different implementations of the same algorithm. In fact, a large amount of time is typically spent on hyper-parameter tuning than actually training of the algorithm. Furthermore, [41] showed that a simple linear policy-based method with weights updated by a random search method can outperform some of these state-of-the-art results. A key question is whether the remarkable performance of these algorithms is more than just random search, and whether algorithms based on more principled approaches could be designed that could potentially match the remarkable performance (eventually). There is a large class of model-based algorithms based on dynamic programming (DP) ideas [54] including for continuous state space MDPs. Unfortunately, in many problems (e.g., robotics), the system model is unknown, or simply too complicated to be succintly stated and used in DP algorithms. Usually, latter is the more likely case. Thus, model- free algorithms such as Q-Learning have been in popular usage [64]. The problem with such stochastic approximation algorithms is that they are too slow to converge as shown in [24, 17, 7]. An alternative has been ‘empirical’ algorithms such as [48, 24] which replace the expectation in the Bellman operator with a sample average approximation obtained by getting multiple samples of the next state for each state-action pair. This requires access to a generative model. At first glance, this seems restrictive, and somewhat short of the goal of an ‘online’ algorithm for reinforcement learning. But in a large variety of applications from video games [44] to robotics, we do have access to a generative model of the system (even if it is too complicated for theoretical analysis). In fact, in robotics, for example, training of RL algorithms is first done offline in high-fidelity simulators as the training takes too long to converge for it to be done directly on the physical robot. 46 In this section, we propose an ‘online’ RL algorithm for continuous state space MDPs that is quasi-model-free in the sense of only needing access to a generative model (e.g., in a simulator), and unlike other approaches [44, 59, 40] provably converges to the optimal policy with finite time (or non-asymptotic) guarantees. The idea is inspired by the ‘empirical’ algorithms proposed in [48, 24]. While [24] only considered finite state MDPs, [48] considered continuous MDPs but with user-specified deterministic function class which may result in a large approximation error. Recently, in [27], the authors seem to have gotten around this problem by leveraging [57] and doing randomized function approximation in an RKHS. Though, this algorithm for continuous MDPs is ‘quasi-model-free’ (essentially model-free but needs a generative model), computes nearly-optimal policies and comes with theoretical performance guarantees, it is still an ‘offline’ algorithm. An ‘online’ version of [48] was presented in [3] but that algorithm can have a large optimality gap for continuous MDPs. The main contribution of this section is an ‘online’ quasi-model-free reinforcement learn- ing algorithm for continuous MDPs that computes nearly-optimal policies and also comes with non-asymptotic theoretical performance guarantees. We assume that we can use a ran- domized stationary policy that yields a β-mixing sample trajectory. At each instant, we do a value function update by doing a randomized function approximation in an RKHS. This involves getting multiple next state samples by use of the generative model, and then use them for an ‘empirical’ estimate of the expectation in the Bellman operator. The function approximation is performed on the base-points in some finite window of the sample trajec- tory. We prove convergence and yield finite time sample complexity bounds by combining theβ-mixing analysis with the random operator theoretic analysis and randomized function approximation analysis in [24, 27]. 47 Consider a discounted MDP (X,A,P,r,γ) whereX is the state space andA is the action space. The transition probability kernel is given by P (·|x,a), i.e., if action a is executed in state x, the probability that the next state is in a Borel-measurable set B is P (x t+1 ∈B|x t =x,a t =a) where x t and a t are the state and action at time t. The reward function is r :X×A→R. We are interested in maximizing the infinite horizon expected discounted reward where the discount parameter is γ. Let Π denote the class of stationary deterministic Markov policies mappings π :X→A which only depend on history through the current state. We only consider such policies since it is well known that there is an optimal MDP policy in this class. When the initial state is given, any policy π determines a probability measure P π . Let the expectation with respect to this measure be E π . We focus on infinite horizon discounted reward criterion. The expected infinite horizon discounted reward or the value function for a policy π and initial state x is given as v π (x) =E π " ∞ X t=0 γ t r(x t ,a t ) x 0 =x # . The optimal value function is given asv ∗ (x) = sup π∈Π v π (x) and the policy which maximizes the value function is the optimal policy, π ∗ . We make the following regularity assumption on the MDP. Assumption 4. (Regularity of MDP) The state spaceX is a compact subset of d- dimensional Euclidean space. The rewards are uniformly bounded by r max , i.e,|r(x,a)|≤ 48 r max for all (x,a)∈X×A. Furthermore, there existsL P ,L r > 0 such that for all (x,x 0 ,a)∈ X×X×A, we have kP (·|x,a)−P (·|x 0 ,a)k TV ≤L P kx−x 0 k 2 |r(x,a)−r(x 0 ,a)|≤L r kx−x 0 k 2 where TV denotes the total variation norm. The assumption above implies that for any policy π, v π ≤v max =r max /(1−γ). Let B(X ) be the set of functions f :X→R such that kfk ∞ ≤v max . Let us now define the Bellman operator T :B(X )→B(X ) as follows: Tv(x) = max a h r(x,a) +γE x 0 ∼P (·|x,a) v(x 0 ) i . It is well known that the operator T is a contraction with respect tok·k ∞ norm and the contraction parameter is the discount factor, γ. Hence, the sequence of iterates v k =Tv k−1 converge tov ∗ geometrically. It is easy to check thatT maps bounded functions to Lipschitz functions with Lipschitz constant L r +γL P v max . We can pick a function space in which to do function approximation which can provide an arbitrarily close approximation to any continuous function. As in [57, 27], we pick a parametrized reproducible kernel Hilbert space (RKHS) and do randomized function ap- proximation via random features which can be called Random parametric basis functions (RPBF). Let Θ be a set of parameters and let φ :X× Θ→R be a feature function. The 49 feature functions need to satisfy sup (x,θ)∈X×Θ |φ (x; θ)|≤ 1. One example of such basis functions are Fourier features. LetF (Θ) be defined as f (·) := Z Θ φ (·; θ)α (θ)dθ| :α (θ)|≤Cν (θ),∀θ∈ Θ . We are interested in finding the best fit within finite sums of the form P J j=1 α j φ (s; θ j ). Doing classical function fitting with P J j=1 α j φ (s; θ j ) leads to non-convex optimization problems because of the joint dependence in α and θ. Instead, we fix a density ν on Θ and draw a random sample θ j from Θ for j = 1, 2,...J. Once these (θ j ) J j=1 are fixed, consider the function space b F θ 1:J , f (·) = J X j=1 α j φ (·; θ j ) :k (α 1 ,..., α J )k ∞ ≤C/J . Now, it remains to calculate weights α by minimizing a convex loss. Furthermore, let us define the L 2,μ norm of a function for a given a probability distribution μ onX as kfk 2 2,μ = ( R X |f (x)| 2 μ (dx)). The empirical norm at given samples (x 1 ,x 2 ,...x N ) is defined askfk 2 2, ˆ μ = 1 N P N i=1 |f(x i )| 2 . Recall two distance measures for function spaces: d 2,μ (Tf,F) := inf f 0 ∈F kf 0 −Tfk 2,μ is the approximation error for a specificf andd 2,μ (TF,F) := sup f∈F d 2,μ (Tf,F) is the inherent Bellman error for the entire classF. Now, because of Lipschitz continuity assumption on the MDP (Assumption 11), we need to choose a function space that is dense in the space of Lipschitz functions. In fact,F (Θ) is shown to be dense in the space of continuous functions [56]. Hence, for such a function class d 2,μ (TF,F) = 0. 50 3.4.1 Algorithm We now present our ONline Empirical Value Learning (ONEVaL) algorithm. We assume that we have interactions (x 1 ,x 2 ,...x N ,x N+1 ,....) generated by a randomized policy π g i.e., x t+1 ∼ P (·|x t ,a t ) where a t ∼ π g (·|x t ). We assume that for any a and x, π g (a|x) is strictly positive. We keep a window of sizeN which moves forward one step in every iteration. Now theseN samples serve as the states for which we will compute our approximate value function and then use function approximation to generalize. It is clear that these samples are not i.i.d. Hence, we need some mixing conditions on this stochastic process. Mixing conditions quantify the decrease in dependence as the future moves farther from the past. There are various types of mixing conditions, of which the following are popular in statistical learning theory: α-mixing,β-mixing andφ- mixing. As mentioned in [67, Chapter 2],φ-mixing is too strong an assumption while α-mixing is too weak but β-mixing is ”just right” . Let us now revisit the definition of β-mixing: Definition 7 (β-mixing). Let{X t } ∞ t=1 be the sequence of random variables from a probability space (Ω,F,P). Let P j i define the joint distribution of random variables (X i ,X i+1 ...X j ), j≥i. Then, the β-mixing coefficient is defined as β m := sup t≥1 kP t 1 ×P ∞ t+m −P t,m k TV wherek·k TV is the total variation norm and P t,m is the joint distribution of (X t 1 ,X ∞ t+m ). A stochastic process is said to be β-mixing if β m → 0 as m→∞. 51 β-mixing coefficient measures the distance between the joint distribution of random vari- ables separated by m time units and the distribution under which they are independent. β-mixing processes could be either algebraically mixing or exponentially mixing (we focus on the latter)[43]. Aβ-mixing process is said to mix at an exponential rate with parameters b,c> 0 if β m =O(exp(−bm c )). Now, we state the mixing assumption on our samples: Assumption 5 (β-mixing). The sequence of samples{x t } t≥1 is β-mixing at an exponential rate, i.e., there exists positive constants b,c,c 1 such that β m =c 1 exp(−bm c ). Furthermore, this is a strictly stationary process and x t ∼μ. As mentioned earlier, we focus on fully randomized policy (i.e., each action is taken with positive probability in each state) for generating these interactions which is known to induce a Markov process satisfying the β-mixing properties [49]. But one could modify by sampling from these interactions, for instance, sampling from experience replay buffer [45]. As long as the mixing properties are satisfied, we are guaranteed to generalize well. In fact, in the experiments section, we show that sampling from previous samples satisfy the mixing conditions. Since we will be analyzing the convergence in L 2 norm, we no longer have the contraction property. Hence, we need bounded Radon-Nikodym derivatives of transitions which we illustrate in the next assumption. Such an assumption has been used earlier in [48, 27]: Assumption 6. (Stochastic Transitions) For all (x, a)∈X×A,P (·|x, a) is absolutely continuous with respect to μ and C μ , sup (x,a)∈X×A dP (·|x,a) dμ ∞ <∞. When there is an uncertainty in the underlying environment, computing expectation in the Bellman operator is expensive. If we have a generative model of the environment, we 52 can replace the expectation by empirical mean leading to definition of empirical Bellman operator: b T M v(x) := max a " r(x,a) + γ M M X i=1 v(x x,a i ) # . (3.9) where x x,a i ∼ P (·|x,a) for i = 1, 2...M. Note that the next state samples, x 0 , are i.i.d. If the environment is deterministic, like Atari games or locomotion tasks, having a single next state suffices and we don’t need a generative model. Given the data{(x n , b v (x n ))} N n=1 , we fit the value function over the state space by com- puting a best fit within b F θ 1:J by solving min α 1 N N X n=1 | J X j=1 α j φ (x n ; θ j )−b v (x n )| 2 (3.10) s.t. k (α 1 ,..., α J )k ∞ ≤C/J. This optimization problem only optimizes over weights α 1:J since parameters θ 1:J have al- ready been randomly sampled from a given distribution ν. Let Π b F (J,N) denote this op- timization problem which we denote as Π b F for compact notation. We are now ready to present our algorithm ONEVaL, shown in Algorithm 3. Step 1 selects the samples at which we compute approximate value function. Step 2 samples the next states for a given state and action. This is not difficult when we have a generative model for various learning tasks, for, instance, in robotics. Step 3 computes the approximate labels which Step 4 generalizes via function approximation. ONEVaL can be seen as an iteration of a composition of two random operators. Let us define b G(N,M,J,μ,ν) = Π b F (N,J)◦ b T M where◦ denotes the composition. 53 Algorithm 3 ONEVaL Input: sample sizes N,M,J ≥ 1; initial seed v 0 ; interactions from policy π g : (x 1 ,x 2 ,...,x N ,x N+1 ,...x 2N−1 ,x 2N ,...) For k=0,1, 2, . . . 1. Select samples (x n ) k+N n=k+1 from given interactions 2. For each n and action a, sample i.i.d. next states x xn,a i ∼P (·|x n ,a) for i = 1, 2,...m 3. Empirical value iteration: b v k+1 (x n ) = b T M v k (x n ) for each n according to (5.17) 4. Function approximation: v k+1 = Π b F b v k+1 (x n ) according to (3.10) Let b G be a compact notation for this operator. Hence, in our algorithm v k+1 = b Gv k . We will use the random operator framework developed in [24, 27] for analysis. Denote J 0 (,δ) := 5C 1 + s 2 log 5 δ 2 , M 0 (,δ) := 512v 2 max (/5) 4 log 4ev max (/5) 2 J 40eN|A| (J + 1) δ and η N 0 (,δ) := 512v 2 max (/5) 4 ! log " 160e (J + 1) δ 200Cγev max 3 J # Theorem 11. Suppose Assumptions 11, 14 and 13 hold and let N = 2l N η N . Choose an > 0 and δ∈ (0, 1). Set δ 0 = 1− (1/2 +δ/2) 1/(K ∗ −1) and denote e C := 4 1−γ K+1 1−γ 1/2 C 1/2 μ and K ∗ := log C 1/2 μ − log (2v max ) logγ . (3.11) 54 Then, if η N ≥η N 0 (,δ 0 ), l N (,δ)≥ 1 b log 20η N c 1 δ 0 1/c , M≥M 0 (,δ 0 ), J≥J 0 (,δ 0 ) and K≥ log 4/ (1/2−δ/2) (1−q)q K ∗ −1 , we have that with probability at least 1−δ, kv K −v ∗ k 2,μ ≤ e C. Remark: The above theorem states that if we have sufficiently large samples, in partic- ular, J = O (1/ 2 log 1/δ), η N = O 1/ 4 log 1/ 3J δ , M = O 1/ 4 log 1/ 2J δ and l N = O (log 1/δ) where N = 2η N l N , then for sufficiently large iterations, the approximation er- ror can be made arbitrarily small with high probability. Moreover, if Lipschitz continuity assumption is not satisfied then the result can be presented in a more general form as follows: kv K −v ∗ k 2,μ ≤ 2 1−γ K+1 1−γ ! 1/2 h C 1/2 μ d 2,μ (TF(Θ),F(Θ)) + 2 i . 3.4.2 Convergence Analysis There are two approximations interleaved in ONEVaL: approximation by sample average and function approximation. In addition to that, the samples for function approximation are not independent. We will use the technique developed in [69] to transform a sequence of dependent variables into a subsequence of nearly i.i.d. ones. Recall that the state samples for function approximation in iteration k are (x k+1 ,x k+2 ,...x k+N ). For convenience, let us rewrite the sequence as (˜ x k,n ) N n=1 for a given iteration k. Let η N and l N be positive integers 55 such that this sequence is divided in to 2η N blocks, each of length l N for all k≥ 0. We assume that 2η N l N = N ( the remainder terms become insignificant as the sample size increases). Let us now construct the blocks for 1≤j≤η N : H k,j ={˜ x k,i : 2(j− 1)l N + 1≤i≤ (2j− 1)l N } T k,j ={˜ x k,i : (2j− 1)l N + 1≤i≤ 2jl N } LetH k be the entire sequence of odd blocks{H k,j } and letT k be the sequence of even blocks T k,j . Finally, let H 0 k be a sequence of blocks which are independent of ˜ x N k,1 but such that each block has the same distribution as a block from the original sequence. Now, all the results for i.i.d variables can now be applied to the blocksH 0 k . This technique has been used to extend the convergence of function approximation to a β-mixing sequence. Lemma 12. [3, Lemma 2] Suppose that X 0 ,...,X N ∈X is a stationary β-mixing process with mixing coefficients β m , H 0 are the independent block samples (as shown in previous section) and thatF is a permissible class ofX→ [−K,K] functions. Let N = 2η N l N , then P sup f∈F 1 N N X t=1 f(Z t )−E[f(Z)] > ! ≤ 16E[N (/8,F,H 0 )] exp − η N 2 128K 2 ! + 2η N β l N . Step 1: Bound on one-step error. Let us now bound the error in one iteration of ONEVaL and then we will use the stochastic dominance argument to analyze the iteration 56 of the random operator. Let k be the gap between the random operator b G and the exact Bellman operator at iteration k, i.e., v k+1 = b Gv k =Tv k + k . We want to bound the error term k . Note that the functions from previous iteration, v k , are random. Also, they are not independent of function in the next iteration, v k+1 , because the interactions samples are dependent. Next lemma bounds the error in one iteration of ONEVaL. The proof is in the appendix. Lemma 13. Choose > 0, and δ ∈ (0, 1). Let N = 2η N l N for some positive inte- gers η N and l N . Also choose N ≥ N 0 (,δ),M ≥ M 0 (,δ) and J ≥ J 0 (,δ). Then, for b G (N, M, J, μ, ν) v, the output of one iteration of ONEVaL, we have with probability at least 1−δ, k b Gv−Tvk 2,μ ≤d 2,μ (TF (Θ),F (Θ)) + for any arbitrary function v∈F (Θ). Step 2: Stochastic dominance. This step is same as in the previous section. After bounding the error in one-step, we will now bound the error when the random operator b G is applied iteratively by constructing a dominant Markov chain. Since we do not have a contraction with respect to L 2 norm, we need an upper bound on how the errors propagate with iterations. Recall that b Gv k = Tv k + k , we use the point-wise error bounds as [48, Lemma 3]. For a given error tolerance, it gives a bound on number of iterations which we callK ∗ as shown in equation (5.20). The details of the choice ofK ∗ is given in the appendix. 57 We then construct a stochastic process as follows. We call iteration k “good” if the error k k k 2,μ ≤ and “bad” otherwise. We then construct a stochastic process{X k } k≥0 with state spaceK as,{1, 2,..., K ∗ } such that X k+1 = max{X k − 1, 1}, if iteration k is ”good”, K ∗ , otherwise. The stochastic process{X k } k≥0 is easier to analyze than{v k } k≥0 because it is defined on a finite state space, however{X k } k≥0 is not necessarily a Markov chain. Whenever X k = 1, it means that we just had a string of K ∗ “good” iterations in a row, and thatkv k −v ∗ k 2,μ is as small as desired. We next construct a “dominating” Markov chain{Y k } k≥0 to help us analyze the behavior of{X k } k≥0 . We construct{Y k } k≥0 and we letQ denote the probability measure of{Y k } k≥0 . Since{Y k } k≥0 will be a Markov chain by construction, the probability measureQ is com- pletely determined by an initial distribution on R and a transition kernel for{Y k } k≥0 . We now use the bound on one-step error as presented in Lemma 46 which states that when the samples are sufficiently large enough for all k, P (k k k 2,μ ≤)>q(N,M,J,L). 58 Denote this probability by q for a compact notation. Initialize Y 0 =K ∗ , and construct the process Y k+1 = max{Y k − 1, 1}, w.p. q, K ∗ , w.p. 1−q, where q is the probability of a “good” iteration which increases with sample sizes N,M,J and L. We now describe a stochastic dominance relationship between the two stochastic processes{X k } k≥0 and{Y k } k≥0 . We will establish that{Y k } k≥0 is “larger” than{X k } k≥0 in a stochastic sense. This relationship is the key to our analysis of{X k } k≥0 . Definition 8. Let X and Y be two real-valued random variables, then X is stochastically dominated by Y , X≤ st Y , when Pr{X≥θ}≤ Pr{Y≥θ} for all θ in the support(Y ). The next lemma uses stochastic dominance to show that if the error in each iteration is small, then after sufficient iterations we will have small approximation error. The proof is given in the appendix. Lemma 14. Choose> 0, andδ∈ (0, 1), and supposeN,M,J andL are chosen sufficiently large enough such that P (k k k 2,μ ≤) > q for all k≥ 0. Then for q≥ (1/2 +δ/2) 1/(K ∗ −1) and K≥ log 4/ (1/2−δ/2) (1−q)q K ∗ −1 , we have with probability at least 1−δ, kv K −v ∗ k 2,μ ≤ 2 1−γ K+1 1−γ ! 1/2 h C 1/2 μ +γ K/2 (2v max ) i . (3.12) Combining Lemma 13 and Lemma 14 concludes the proof of Theorem 11. 59 3.4.3 Numerical Experiments 3.4.3.1 Optimal replacement As a proof of concept for the performance of our proposed algorithm, we pick the optimal replacement problem where we can explicitly compute the optimal value function. This allows us to compute the error in the approximation of value function across iterations. In the optimal machine replacement problem, there are two actions: keep the machine, or replace it. The state comprises of non-negative real numbers and the transition kernel is exponential. Further details can be found in [48]. For this experiment, we choose J = 10 random parameterized Fourier functionsφ(x;θ j ) = cos(θ T j x+b) withθ j ∼N (0, 0.01) andb = Unif[−π,π]. Fig. 3.3 shows the performance of ONEVaL for differentN andM. Additionally, it shows the performance of two different methods of selecting the samples from interactions. The method consecutive is same as that presented in Algorithm 3. The method sampled refers to a variant where the instead of moving the window one sample, we uniformly sample from the interactions. We observe that increasing N and M improves the performance of the algorithm as we would expect. We also verify if the β-mixing assumption as stated in Assumption 14. Table 3.2 presents the estimated β-mixing coefficients for N = 40000 and N = 80000. To compute these coefficients, we follows the procedure in [42]. It shows that indeed increasing the lengths of the block indeed decreases the mixing coefficients. 3.4.3.2 Cart-Pole We now evaluate our methods on the cart-pole problem, using the OpenAI gym codebase. We compare our approximation methods with RPBF and RKHS against linear methods 60 Figure 3.3 Performance of ONEVaL for different sample sizes l N N Consecutive Sampled 1 40000 15.88 0.93 10 40000 0.9281 0.77 100 40000 0.714 0.89 1000 40000 0.702 0.87 10 80000 0.52 0.59 100 80000 0.49 0.55 1000 80000 0.47 0.54 Table 3.2 β-mixing coefficients (ridge regression) and a popular nonlinear method (10-nearest neighbors 1 evaluating their learning ability and per-iteration computational complexity. We also add in a random policy (R) as a baseline to see if a method has learned anything at all. Computer specifications and code libraries: The experiments are run on a 32 core ma- chine with an AMD Opteron Processor with 256 GB RAM. We use the Python libraries 1 We experimented with 1, 2, 5, 10, and 20 neighbors, finding 10 neighbors usually gave the best perfor- mance in these tests. 61 openai/gym to simulate the cart-pole environment, and scikit-learn to run the various function approximating machine learning algorithms. We heavily borrow from the environment constructed by the OpenAI gym for CartPole-v0 (see https://gym.openai.com/envs/CartPole-v0/ for documentation). The state space contains four variables: position and velocity of the cart, and angle and angular velocity of the pole. We modify the code slightly to inject noise, such that actual force = force + z, z∼N (0, 1). The reward function is +1 if the pole is balanced (between±12 degrees from vertical), and 0 otherwise. Since this is an episodic task, we uniformly sample from the interactions similar to experience replay buffer [45]. The exploration scheme used is-greedy with decaying . We compare the approximation methods against linear methods (ridge regression) popular nonlinear methods (10-nearest neighbors (NN)), kernel ridge regression (RKHS), Nystrom method ([68]) and the random parametric basis function (RPBF). We experimented with 1, 2, 5, 10, and 20 neighbors, finding 10 neighbors usually gave the best performance in these tests. For RKHS, we use the standard Guassian kernel. We also add in a random policy as a baseline to see if a method has learned anything at all. Batchsize 1000 Batchsize 5000 method TPI TPI Ridge Reg 0.21 0.21 RKHS 0.70 1.18 Nystrom 0.97 0.94 RPBF 0.38 0.37 NN 2.21 2.23 Table 3.3 CPU per-iteration runtime (in milliseconds). 62 Figure 3.4 Performance of ONEVaL on Cart-Pole. Expected (discounted) reward, averaged over 100 trials (solid), with the range of highest and lowest values in faded colors for N = 1000 with M = 1 Fig. 3.4 and 3.5 give the average achieved value at each episode for all methods, for different sample sizes. The only methods that do significantly better than random are nonlinear methods. Of these, RPBF appears a clear winner, with full kernel and Nystrom regression in second. Nearest neighbor can perform very well when the sample size is low, but has trouble when the sample size is high. To give some intuition as to the function approximation ability of each method, Figure 3.7 gives the Q-function for a single action (a =−1), over different angles (θ) and angular velocities ( ˙ θ). A linear Q-function is clearly inadequate to capture the value function. On the other extreme, the nearest neighbor method seems jagged and unstable, explaining its subpar performance in figure 3.4 and 3.5. RKHS, Nystrom, and RPBF all manage to capture the space well when the state is desirable (position and velocity is 0), and enough episodes have passed so that they have learned (left column). For fewer episodes (middle column) the approximation methods struggle a bit compared to the kernel. For bad states (right), 63 Figure 3.5 Performance of ONEVaL on Cart-Pole. Expected (discounted) reward, averaged over 100 trials (solid), with the range of highest and lowest values in faded colors for N = 5000 with M = 1 all methods show much lower values, but for RKHS, Nystrom, and RPBF, the shape is still intuitively correct. Table 3.3 shows the average time per iteration (averaged between episodes 500 and 1000 of a single run) for each method, for sample sizes of 1000 and 5000 on a 32 core machine with an AMD Opteron Processor with 256 GB RAM. We use scikit-learn to run the various function approximating machine learning algorithms. The runtime complexity of the approximation methods have about the same per-iteration runtime for batch sizes of 1000 and 5000, suggesting scalability; in contrast, the full kernel method runtime almost doubles. Additionally, the nearest neighbor method complexity is large in both cases. To compare the memory requirement, for KR it requires O(N 2 ) for the kernel matrix, and for Nystrom (withL selected columns) and RPBF areO(NL) andO(NJ) respectively to store the approximation parameters. ForLN andJN, there is clear memory improvement of the approximation methods over the full kernel method. We also compare our algorithm 64 Figure 3.6 ONEVaL vs DQN with DQN as presented in [44] in Fig. 3.6 which shows the average performance with shaded region as the variance for 50 experiments . We use a 2 hidden layer architecture with 24 nodes each with ReLU activation. We notice that when the underlying environment is noisy, there is a lot of variance in the performance of DQN as also noted in [30]. 3.5 Conclusion In this chapter, we have introduced universally applicable approximate dynamic program- ming algorithms for continuous state space MDPs with finite action spaces. The algorithms introduced are based on using randomization to improve computational tractability and re- duce the ‘curse of dimensionality’ via the synthesis of the ‘random function approximation’ and ‘empirical’ approaches. Our first algorithm is based on a random parametric function fitting by sampling parameters in each iteration. The second is based on sampling states which then yield a set of basis functions in an RKHS from the kernel. Both function fitting 65 Figure 3.7 Q-function over angle states. Q-function for a =−1 (move to the left). Rows (top to bottom): Ridge Regression, RKHS, Nystrom, RPBF, NN. Left: x =v = 0, episode 50. Middle: x =v = 0, episode 10. Right: x =−2.0,v =−2.5, episode 50. 66 steps involve convex optimization problems and can be implemented with standard packages. Both algorithms can be viewed as iteration of a type of random Bellman operator followed by a random projection operator. Iterated random operators in general are difficult to an- alyze. Nevertheless, we can construct Markov chains that stochastically dominate the error sequences which simplify the analysis [24]. In fact, the introduced method may be viewed as a ‘probabilistic contraction analysis’ method in contrast to stochastic Lyapunov techniques and other methods for analyzing stochastic iterative algorithms. They yield convergence but also non-asymptotic sample complexity bounds. Next, we presented ONEVaL, an ‘online’ quasi-model-free reinforcement learning for con- tinuous state space MDPs. It is essentially model-free as it only needs access to a generative model, and hence is suitable for a variety of applications where RL training is conducted in simulations, for example robotics. The algorithm also comes with finite time theoretical performance guarantees, and sample complexity bounds have been provided. It performs quite competitively in terms of actual runtime against Deep RL algorithms such as DQN on some benchmark problems. In fact, given that it needs no hyper-parameter tuning, it may even be regarded as better in some sense. In the future, we plan to do a more thor- ough numerical performance investigation of the ONEVaL algorithm against the others on more complicated benchmark problems, including some robotic problems. We note that the algorithm is ‘off-policy’. How to design efficient exploration schemes is part of future work. 67 3.6 Proofs 3.6.1 Supplement for Section 3.3 The following computation shows thatT maps bounded functions to Lipschitz continuous functions whenQ andc are both Lipschitz continuous in the sense of (3.1) and (3.2). Suppose kvk ∞ ≤v max , then Tv is Lipschitz continuous with constant L c +γv max L Q . We have | [Tv] (s)− [Tv] (s 0 )| ≤ max a∈A |c (s, a)−c (s 0 , a)| +γ max a∈A | Z v (y)Q (dy|s, a)− Z v (y)Q (dy|s 0 , a)| ≤L c ks−s 0 k 2 +γv max max a∈A Z |Q (dy|s, a)−Q (dy|s 0 , a)| ≤ (L c +γv max L Q )ks−s 0 k 2 . 3.6.2 Supplement for Section 3.3.3 First, we need to adapt [48, Lemma 3] to obtain point-wise error bounds on v K −v ∗ in terms of the errors{ε k } k≥0 . These bounds are especially useful when analyzing the performance of EVL with respect to other norms besides the supremum norm, since T does not have a contractive property with respect to any other norm. For any π∈ Π, we define the operator Q π :F (S)→F (S) (which gives the transition mapping as a function of π) via (Q π v) (s), Z S v (y)Q (dy|s, π (s)),∀s∈S. 68 Then we define the operator T π :F (S)→F (S) via [T π v] (s),c (s, π (s)) +γ Z S v (x)Q (dx|s, π (s)),∀s∈S. For later use, we let π ∗ ∈ Π be an optimal policy satisfying π ∗ (s)∈ arg min a∈A c (s, a) +γ Z S v ∗ (x)Q (dx|s, a) , ∀s∈S,, i.e., it is greedy with respect to v ∗ . More generally, a policy π∈ Π is greedy with respect to v∈F (S) if T π v =Tv. For use throughout this section, we let π k be a greedy policy with respect to v k so that T π k v k =Tv k for all k≥ 0. Then, for fixed K≥ 1 we define the operators A K , 1 2 Q π ∗ K +Q π K−1 Q π K−2 ···Q π 0 , A k , 1 2 Q π ∗ K−k−1 +Q π K−1 Q π K−2 ···Q π k+1 , fork = 0,..., K− 1, formed by composition of transition kernels. We let ~ 1 be the constant function equal to one on S, and we define the constant ˜ γ = 2(1−γ K+1 ) 1−γ for use shortly. We note that{A k } K k=0 are all linear operators and A k ~ 1 = ~ 1 for all k = 0,..., K. Lemma 15. For any K≥ 1, (i) v K −v ∗ ≤ P K−1 k=0 γ K−k−1 Q π ∗ K−k−1 ε k +γ K Q π ∗ K (v 0 −v ∗ ). (ii)v K −v ∗ ≥ P K−1 k=0 γ K−k−1 (Q π K−1 Q π K−2 ···Q π k+1 )ε k +γ K (Q π K−1 Q π K−2 ···Q π 0 ) (v 0 −v ∗ ). (iii)|v K −v ∗ |≤ 2 h P K−1 k=0 γ K−k−1 A k |ε k | +γ K A K (2v max ) i . 69 Proof. (i) For any k≥ 1, we have Tv k ≤ T π ∗ v k and T π ∗ v k −T π ∗ v ∗ = γQ π ∗ (v k −v ∗ ), so v k+1 −v ∗ = Tv k +ε k −T π ∗ v k +T π ∗ v k −T π ∗ v ∗ ≤γQ π ∗ (v k −v ∗ ) +ε k . The result then follows by induction. (ii) Similarly, for any k≥ 1, we have Tv ∗ ≤T π k v ∗ and Tv k −T π k v ∗ =T π k v k −T π k v ∗ = γQ π k (v k −v ∗ ), so v k+1 −v ∗ = Tv k +ε k −T π k v ∗ +T π k v ∗ −Tv ∗ ≥γQ π k (v k −v ∗ ) +ε k . Again, the result follows by induction. (iii) If f≤g≤h inF (S), then|g|≤|f| +|h|, so combining parts (i) and (ii) gives |v K −v ∗ |≤ 2 K−1 X k=0 γ K−k−1 A k |ε k | + 2γ K A K |v 0 −v ∗ |. Then we note that|v 0 −v ∗ |≤ 2v max . Now we use Lemma 15 to derive p−norm bounds. Proof. (of Lemma 6) Using P K k=0 γ k = 1−γ K+1 / (1−γ), we define the constants α k = (1−γ)γ K−k−1 1−γ K+1 ,∀k = 0,..., K− 1, α K = (1−γ)γ K 1−γ K+1 , 70 and we note that P K k=0 α k = 1. Then, we obtain|v K −v ∗ | ≤ ˜ γ " K−1 X k=0 α k A k |ε k | +α K A K (2v max ) # from Lemma 15(iii). Next, we computekv K −v ∗ k p p,ρ = Z S |v K (s)−v ∗ (s)| p ρ (ds) ≤ ˜ γ p Z S " K−1 X k=0 α k A k |ε k | +α K A K (2v max ) ~ 1 # p (s)ρ (ds) ≤ ˜ γ p Z S " K−1 X k=0 α k A k |ε k | p +α K A K (2v max ) p ~ 1 # (s)ρ (ds), using Jensen’s inequality and convexity ofx→|x| p . Now, we haveρA k ≤c ρ,μ (K−k− 1)μ for k = 0,..., K− 1 by Assumption 1(ii) and so for all k = 0,..., K− 1, Z S [A k |ε k | p ] (s)ρ (ds)≤c ρ,μ (K−k− 1)kε k k p p,μ . We arrive atkv K −v ∗ k p p,ρ ≤ ˜ γ p " K−1 X k=0 α k c ρ,μ (K−k− 1)kε k k p p,μ +α K (2v max ) p # = 2 p ˜ γ p−1 " K−1 X k=0 γ K−k−1 c ρ,μ (K−k− 1)kε k k p p,μ +γ K (2v max ) p # , 71 where we use|v 0 −v ∗ | p ≤ (2v max ) p . Now, by subadditivity of x→|x| t for t = 1/p∈ (0, 1] with p∈ [1,∞), assumption thatkε k k p,μ ≤ ε for all k = 0, 1,..., K− 1, and since P K−1 k=0 γ K−k−1 c ρ,μ (K−k− 1)≤C ρ,μ by Assumption 1(ii), we see kv K −v ∗ k p,ρ ≤ 2 1−γ K+1 1−γ !p−1 p h C 1/p ρ,μ ε +γ K/p (2v max ) i , which gives the desired result. Supremum norm error bounds follow more easily from Lemma 15. Proof. (Proof of Lemma 7) We have kv K −v ∗ k ∞ ≤ max{k K−1 X k=0 γ K−k−1 Q π ∗ K−k−1 ε k +γ K Q π ∗ K (v 0 −v ∗ )k ∞ , k K−1 X k=0 γ K−k−1 (Q π K−1 Q π K−2 ···Q π K+1 )ε k +γ K (Q π K−1 Q π K−2 ···Q π 0 ) (v 0 −v ∗ )k ∞ }, by Lemma 15. Now, k K−1 X k=0 γ K−k−1 Q π ∗ K−k−1 ε k +γ K Q π ∗ K (v 0 −v ∗ )k ∞ ≤ K−1 X k=0 γ K−k−1 kε k k ∞ +γ K kv 0 −v ∗ k ∞ 72 and k K−1 X k=0 γ K−k−1 (Q π K−1 Q π K−2 ···Q π K+1 )ε k +γ K (Q π K−1 Q π K−2 ···Q π 0 ) (v 0 −v ∗ )k ∞ ≤ K−1 X k=0 γ K−k−1 kε k k ∞ +γ K kv 0 −v ∗ k ∞ , where we use the triangle inequality, the fact that| (Qf) (s)|≤ R S |f (y)|Q (dy|s)≤kfk ∞ for any transition kernel Q onS and f∈F (S), and|v 0 −v ∗ |≤ 2v max . For any K≥ 1, kv K −v ∗ k ∞ ≤ K−1 X k=0 γ K−k−1 kε k k ∞ +γ K (2v max ). (3.13) Follows immediately since P K−1 k=0 γ K−k−1 ε≤ε/ (1−γ) for all K≥ 1. We emphasize that Lemma 7 does not require any assumptions on the transition proba- bilities, in contrast to Lemma 6 which requires Assumption 1(ii). 3.6.3 Supplement for Section 3.3.3: Function approximation We record several pertinent results here on the type of function reconstruction used in our EVL algorithms. The first lemma is illustrative of approximation results in Hilbert spaces, it gives an O 1/ √ J convergence rate on the error from using b F θ 1:J compared toF (Θ) inL 2,μ (S) in probability. Lemma 16. [57, Lemma 1] Fix f ∗ ∈F (Θ), for any δ ∈ (0, 1) there exists a function ˆ f∈ b F θ 1:J such that kf ∗ − ˆ fk 2,μ ≤ C √ J 1 + s 2 log 1 δ 73 with probability at least 1−δ. The next result is an easy consequence of [57, Lemma 1] and bounds the error from using b F θ 1:J compared toF (Θ) inL 1,μ (S). Lemma 17. Fix f ∗ ∈F (Θ), for any δ∈ (0, 1) there exists a function ˆ f∈ b F θ 1:J such that k ˆ f−f ∗ k 1,μ ≤ C √ J 1 + s 2 log 1 δ with probability at least 1−δ. Proof. Choosef, g∈F (S), then by Jensen’s inequality we havekf−gk 1,μ =E μ [|f (S)−g (S)|] =E μ (f (S)−g (S)) 2 1/2 ≤ r E μ h (f (S)−g (S)) 2 i . The desired result then follows by [57, Lemma 1]. Now we consider function approximation in the supremum norm. Recall the definition of the regression function f M (s) =E " min a∈A ( c (s, a) + γ M M X m=1 v (X s,a m ) )# , ∀s∈S. Then we have the following approximation result, for which we recall the constant κ := sup s∈S q K (s, s). Corollary 18. [62, Corollary 5] For any δ∈ (0, 1), kf z,λ −f M k H K 74 ≤ ˜ Cκ log (4/δ) 2 N ! 1/6 for λ = log (4/δ) 2 N ! 1/3 , with probability at least 1−δ. Proof. Uses the fact that for any f ∈H K ,kfk ∞ ≤ κkfk H K . For any s∈ S, we have |f (s)| =|hK (s,·), f (·)i H K | and subsequently |hK (s,·), f (·)i H K |≤kK (s,·)k H K kfk H K = q hK (s,·), K (s,·)i H K kfk H K = q K (s, s)kfk H K ≤ sup s∈S q K (s, s)kfk H K , where the first inequality is by Cauchy-Schwartz and the second is by assumption that K is a bounded kernel. The preceding result is about the error when approximating the regression function f M , butf M generally is not equal toTv. We bound the error betweenf M andTv as well in the next subsection. Proof. (Theorem 8) First we note that, by [24, Lemma A.1], [Y k+1 |Y k =η] is stochastically increasing in η for all k≥ 0, i.e. [Y k+1 |Y k =η]≤ st [Y k+1 |Y k =η 0 ] for all η≤ η 0 . Then, by [24, Lemma A.2], [X k+1 |X k =η,F k ]≤ st [Y k+1 |Y k =η] for all η∈f andF k for all k≥ 0. 75 (i) Trivially, X 0 ≤ st Y 0 since X 0 ≤ as Y 0 . Next, we see that X 1 ≤ st Y 1 by [24, Lemma A.1]. We prove the general case by induction. Suppose X k ≤ st Y k for k≥ 1, and for this proof define the random variable Y (θ) = max{θ− 1, 1}, w.p. q, K ∗ , w.p. 1−q. to be the conditional distribution of Y k conditional on θ, as a function of θ. We see that Y k+1 has the same distribution as [Y (θ)|θ =Y k ] by definition. SinceY (θ) are stochastically increasing by Lemma [24, Lemma A.1], we see that [Y (θ)|θ =Y k ]≥ st [Y (θ)|θ =X k ] by [61, Theorem 1.A.6] and our induction hypothesis. Now, [Y (θ)|θ =X k ]≥ st [X k+1 |X k ,F k ] by [61, Theorem 1.A.3(d)] and Lemma [24, Lemma A.2] for all historiesF k . It follows that Y k+1 ≥ st X k+1 by transitivity of≥ st . (ii) Follows from part (i) by the definition of≤ st . Proof. (Corollary 9) Since (Y k ) k≥0 is an irreducible Markov chain on a finite state space, its steady state distribution μ = (μ (i)) K ∗ i=1 onK exists. By [24, Lemma 4.3], the steady state distribution of (Y k ) k≥0 is μ = (μ (i)) K ∗ i=1 given by: μ (1) =q K ∗ −1 μ (i) = (1−q)q K ∗ −i , ∀i = 2,...,K ∗ − 1, μ (K ∗ ) = 1−q. 76 The constant μ min (q; K ∗ ) := min n q K ∗ −1 , (1−q)q (K ∗ −2) , (1−q) o , for allq∈ (0, 1) andK ∗ ≥ 1, which is the minimum of the steady state probabilities appears shortly in the Markov chain mixing time bound for (Y k ) k≥0 . We note that μ ∗ (q; K ∗ ) = (1−q)q K ∗ −1 ≤μ min (q; K ∗ ) is a simple lower bound forμ min (q; K ∗ ) (we definedμ ∗ (q; K ∗ ) = (1−q)q K ∗ −1 earlier). Now, recall thatkμ−νk TV = 1 2 P K ∗ η=1 |μ (η)−ν (η)| is the total variation distance for probability distributions onK. Let Q k be the marginal distribution of Y k for k≥ 0. By a Markov chain mixing time argument, e.g., [38, Theorem 12.3], we have that t mix (δ 0 ) := min n k≥ 0 :kQ k −μk TV ≤δ 0 o ≤ log 1 δ 0 μ min (q; K ∗ ) ! ≤ log 1 δ 0 (1−q)q K ∗ −1 ! for any δ 0 ∈ (0, 1). So, for K≥ log 1/ δ 0 (1−q)q K ∗ −1 we have|Pr{Y K = 1}−μ (1)| = |Pr{Y K = 1}−q K ∗ −1 |≤ 2kQ K −μk TV ≤ 2δ 0 , where we use μ (1) =q K ∗ −1 . By Theorem 8, Pr{X K = 1}≥ Pr{Y K = 1} and so Pr{X K = 1}≥q K ∗ −1 − 2δ 0 . 77 Choose q and δ 0 to satisfy q K ∗ −1 = 1/2 +δ/2 and 2δ 0 = q K ∗ −1 −δ = 1/2−δ/2 to get q K ∗ −1 − 2δ 0 ≥δ, and the desired result follows. 3.6.4 Bellman error The layout of this subsection is modeled after the arguments in [48], but with the added consideration of randomized function fitting. We use the following easy-to-establish fact. Fact 19. LetX be a given set, andf 1 : X→R andf 2 : X→R be two real-valued functions on X. Then, (i)| inf x∈X f 1 (x)− inf x∈X f 2 (x)|≤ sup x∈X |f 1 (x)−f 2 (x)|, and (ii)| sup x∈X f 1 (x)− sup x∈X f 2 (x)|≤ sup x∈X |f 1 (x)−f 2 (x)|. For example, Fact 19 can be used to show that T is contractive in the supremum norm. The next result is about b T , it uses Hoeffding’s inequality to bound the estimation error between{˜ v (s n )} N n=1 and{[Tv] (s n )} N n=1 in probability. Lemma 20. For any p∈ [1,∞], f, v∈F (S; v max ), and ε> 0, Pr{|kf−Tvk p, ˆ μ −kf− ˜ vk p, ˆ μ |>ε} ≤ 2N|A| exp −2Mε 2 v 2 max ! . Proof. First we have|kf−Tvk p, ˆ μ −kf− ˜ vk p, ˆ μ | ≤ kTv− ˜ vk p, ˆ μ by the reverse triangle inequality. Then, for any s∈S we have|[Tv] (s)− ˜ v (s)| = max a∈A | c (s, a) +γ Z S v (x)Q (dx|s, a) 78 − ( c (s, a) + γ M M X m=1 v (X s,a m ) ) | ≤ γ max a∈A Z S v (x)Q (dx|s, a)− 1 M M X m=1 v (X s,a m ) by Fact 19. We may also take v (s)∈ [0, v max ] for all s∈ S by assumption on the cost function, so by the Hoeffding inequality and the union bound we obtain Pr max n=1,...,N |[Tv] (s n )− ˜ v (s n )|≥ε ≤ 2N|A| exp −2Mε 2 v max 2 ! and thus Pr{kTv− ˜ vk p, ˆ μ ≥ε} = Pr 1 N P N n=1 |[Tv] (s n )− ˜ v (s n )| p 1/p ≥ε ≤ Pr{max n=1,...,N |[Tv] (s n )− ˜ v (s n )|≥ε}, which gives the desired result. To continue, we introduce the following additional notation corresponding to a set of functionsF⊂F (S): •F s 1:N ,{(f (s 1 ),..., f (s N )) : f∈F}; •N ε,F s 1:N is theε−covering number ofF s 1:N with respect to the 1−norm on R N . 79 The next lemma uniformly bounds the estimation error between the true expectation and the empirical expectation over the set b F θ 1:J (in the following statement,e is Euler’s number). Lemma 21. For any ε> 0 and N≥ 1, Pr sup f∈ b F(θ 1:J ) 1 N N X n=1 f (S n )−E μ [f (S)] >ε ≤ 8e (J + 1) 2ev max ε J exp −Nε 2 128v 2 max ! . Proof. For anyF⊂F (S; v max ), ε> 0, and N≥ 1, we have Pr sup f∈ b F(θ 1:J ) 1 N N X n=1 f (S n )−E μ [f (S)] >ε ≤ 8E h N ε/8, b F θ 1:J s 1:N i exp −Nε 2 128v 2 max ! . It remains to boundE h N ε/8, b F θ 1:J s 1:N i . We note that b F θ 1:J is a subset of f (·) = J X j=1 α j φ (·; θ j ) : (α 1 ,..., α J )∈R J , which is a vector space with dimension J. By [2, Corollary 11.5], the pseudo-dimension of b F θ 1:J is bounded above by J. Furthermore, N ε, b F θ 1:J s 1:N ≤e (J + 1) 2ev max ε J by [29, Corollary 3] which gives the desired result. 80 To continue, we letv 0 =v 0 (v, N, M, J, μ, ν) denote the (random) output of one iteration of EVL applied tov∈F (S) as a function of the parametersN, M, J≥ 1 and the probability distributionsμ andν. The next theorem bounds the error betweenTv andv 0 in one iteration of EVL with respect toL 1,μ (S), it is a direct adaptation of [48, Lemma 1] modified to account for the randomized function fitting and the effective function space beingF θ 1:J . Theorem 22. Choosev∈F (S; v max ),ε> 0, andδ∈ (0, 1). Also chooseJ≥ h 5C ε 1 + q 2 log 5 δ i 2 , N≥ 2 7 5 2 ¯ v 2 max log h 40e(J+1) δ (10e ¯ v max ) J i , and M≥ v 2 max 2ε 2 log h 10N|A| δ i . Then, for v 0 = v 0 (v, N, M, J, μ, ν) we havekv 0 −Tvk 1,μ ≤ d 1,μ (Tv,F (Θ)) +ε with probability at least 1−δ. Proof. Letε 0 > 0 be arbitrary and choosef ∗ ∈F (Θ) such thatkf ∗ −Tvk 1,μ ≤ inf f∈F(Θ) kf− Tvk 1,μ +ε 0 . Then, choose ˆ f∈ b F θ 1:J such thatk ˆ f−Tvk 1,μ ≤kf ∗ −Tvk 1,μ +ε/5 with probability at least 1−δ/5 by Lemma 17 by choosing J≥ 1 to satisfy C √ J 1 + s 2 log 1 (δ/5) ! ≤ ε 5 ⇒J≥ 5C ε 1 + s 2 log 5 δ 2 . 81 Now consider the inequalities: kv 0 −Tvk 1,μ ≤kv 0 −Tvk 1, ˆ μ +ε/5 (3.14) ≤kv 0 − ˜ vk 1, ˆ μ + 2ε/5 (3.15) ≤k ˆ f− ˜ vk 1, ˆ μ + 2ε/5 (3.16) ≤k ˆ f−Tvk 1, ˆ μ + 3ε/5 (3.17) ≤k ˆ f−Tvk 1,μ + 4ε/5 (3.18) ≤kf ∗ −Tvk 1,μ +ε (3.19) ≤d 1,μ (Tv,F (Θ)) +ε +ε 0 . (3.20) First, note that inequality (3.16) is immediate sincekv 0 − ˜ vk 1, ˆ μ ≤kf− ˜ vk 1, ˆ μ for all f ∈ b F θ 1:J by the choice of v 0 as the minimizer in Step 3 of Algorithm 1. Second, inequalities (3.14) and (3.18) follow from Lemma 21 by choosing N≥ 1 to satisfy 8e (J + 1) 2ev max ε/5 ! J exp −N (ε/5) 2 128v 2 max ! ≤ δ 5 ⇒N≥ 2 7 5 2 ¯ v 2 max log " 40e (J + 1) δ (10e ¯ v max ) J # . Third, inequality (3.20) follows from the choice of f ∗ ∈F. Finally, inequalities (3.15) and (3.17) follow from Lemma 20 by choosing M≥ 1 to satisfy 2N|A| exp −2Mε 2 v max 2 ! ≤ δ 5 ⇒M≥ v 2 max 2ε 2 ! log " 10N|A| δ # . 82 Since ε 0 was arbitrary, the desired result then follows by the union bound. Using similar steps as Theorem 22, the next theorem bounds the error in one iteration of EVL with respect toL 2,μ (S). Theorem 23. Choosev∈F (S; v max ),ε> 0, andδ∈ (0, 1). Also chooseJ≥ h 5C ε 1 + q 2 log 5 δ i 2 , N≥ 2 7 5 2 ¯ v 4 max log h 40e(J+1) δ (10e ¯ v max ) J i , and M≥ v 2 max 2ε 2 log h 10N|A| δ i . Then, for v 0 = v 0 (v, N, M, J, μ, ν) we havekv 0 −Tvk 2,μ ≤ d 2,μ (Tv,F (Θ)) +ε with probability at least 1−δ. In the next lemma we show that we can make the bias between the regression function f M and the Bellman updateTv arbitrarily small uniformly overs∈S through the choice of M≥ 1. Lemma 24. For any ε> 0 and M≥ 1, kf M −Tvk ∞ ≤γ " ε + 2|A| exp −2Mε 2 v max 2 ! (v max −ε) # . Proof. For any s∈S, we compute |f M (s)− [Tv] (s)| ≤E[| min a∈A ( c (s, a) + γ M M X m=1 v (X s,a m ) ) − min a∈A n c (s, a) +γE X∼Q(·|s,a) [v (X)] o |] ≤γE " max a∈A | 1 M M X m=1 v (X s,a m )−E X∼Q(·|s,a) [v (X)]| # ≤γ " ε + 2|A| exp −2Mε 2 v max 2 ! (v max −ε) # , 83 where the second inequality follows from Fact 19 and the third is by the Hoeffding inequality. We make use of the following RKHS function fitting result for the one step Bellman error in the supremum norm. Theorem 25. Fixv∈F (S; v max ),ε> 0, andδ∈ (0, 1). Also chooseN≥ 2C K κ ε 6 log (4/δ) 2 and M≥ v 2 max 2(ε/4) 2 log 4|A|γ(vmax−ε/4) (4−2γ)ε , where C K is a constant independent of the dimension of S. Then for ˆ f λ , arg min f∈H K ( 1 N N X n=1 (f (S n )−Y n ) 2 +λkfk 2 H K ) , we havek ˆ f λ −Tvk ∞ ≤ε with probability at least 1−δ. Proof. By the triangle inequality,k ˆ f λ −Tvk ∞ ≤k ˆ f λ −f M k ∞ +kf M −Tvk ∞ . We choose N≥ 1 to satisfy C K κ log (4/δ) 2 N ! 1/6 ≤ ε 2 ⇒N≥ 2C K κ ε 6 log (4/δ) 2 , so thatk ˆ f λ −f M k ∞ ≤ ε/2 with probability at least 1−δ by [62, Corollary 5] and the fact thatkfk ∞ ≤κkfk H K . Then, we choose M≥ 1 to satisfy γ " ε/4 + 2|A| exp −2M (ε/4) 2 v 2 max ! (v max −ε/4) # ≤ ε 2 ⇒M≥ v 2 max 2 (ε/4) 2 log 4|A|γ (v max −ε/4) (4− 2γ)ε ! , so thatkf M −Tvk ∞ ≤ε/2 by Lemma 24. 84 3.6.5 Supplement for Section 3.4.2 Let us present some auxiliary results for proving Lemma 13. Lemma 26. Choose > 0, δ∈ (0, 1). If M≥ 512v 2 max 4 log 4ev max 2 J 8e|A| (J + 1) δ then P sup f∈ b F(θ 1:J ) | b T M f(x)−Tf(x)|> <δ for all x∈X . Proof of Lemma 26. This is a direct application of Pollard’s inequality for next state samples since they are i.i.d. It states that ifF is a class of functionsX→ [−K,K], then P sup f∈F 1 M M X t=1 f(Z t )−E[f(Z)] > ! ≤ 8E[N (/8,F,Z 1:M )] exp − M 2 128K 2 ! (3.21) whereZ 1 ,Z 2 ,Z M are i.i.d. random variables andN denotes the empirical covering number. | b T M f(x)−Tf(x)| = max a∈A r(x,a) + γ M M X m=1 f(x x,a m ) − max a∈A [r(x,a) +γE[f(x x,a )]] ≤ max a∈A γ 1 M M X m=1 f(x x,a m )−E[f(x x,a )] 85 Since our function space has pseudo-dimesionJ, we can bound the covering number by using Corollary 3 from [28] as follows: E[N , b F θ 1:J ,Z 1:M ]≤e(J + 1) 4ev max J (3.22) The proof concludes by using (5.41) followed by a union bound over action spaceA. Note that|A| denotes the cardinality of the action spaceA. Since we assume finite actions,|A|<∞. Proof of Lemma 13. To begin, let 0 > 0 be arbitrary and choose f ∗ ∈F (Θ) such that kf ∗ −Tvk 2,μ ≤ inf f∈F(Θ) kf−Tvk 2,μ + 0 . Let e F ={f−Tg :f∈ b F θ 1:J ,g∈ b F θ 1:J } and b F T ={Tf : f∈ b F θ 1:J }. From the definition of covering numbers, we have for any random variables X 1:N =X 1 ,X 2 ,...X N : N , e F,X 1:N ≤N /2, b F θ 1:J ,X 1:N N /2, b F T ,X 1:N (3.23) Now, let us bound the covering number of function class b F T . For any functions f 1 ,f 2 ∈ b F θ 1:J , we have P N i=1 |Tf 1 (X i )−Tf 2 (X i )|≤NkTf−Tf 2 k ∞ ≤γkf 1 −f 2 k ∞ Note that, |f 1 (x)− f 2 (x)| = |α T φ(θ 1:J ;x)− β T φ(θ 1:J ;x)| ≤ |α− β| Since, the length of weights is 86 bounded by C/ √ J, the covering can now be bounded as: N , b F T ,X 1:N ≤ (2Cγ/) J . Using (x +y) 2 ≥x 2 +y 2 for x,y≥ 0, we have P sup b f∈ b F(θ 1:J ) sup g∈ b F(θ 1:J ) k b f−Tgk 2,μ −k b f−Tgk 2,ˆ μ >/5 ≤P sup b f∈ b F(θ 1:J ) sup g∈ b F(θ 1:J ) k b f−Tgk 2 2,μ −k b f−Tgk 2 2,ˆ μ > (/5) 2 ≤ 16e (J + 1) 200Cγev max 3 J exp −η N (/5) 4 512v 2 max + 2η N c 1 exp(−bl c N ) (3.24) where the last inequality follows by using Lemma 12 by bounding the covering numbers using (5.40) and [28, Corollary 3]. Furthermore, the β-mixing coefficient is O(exp(−bl c N )) Then, choose ˆ f∈ b F θ 1:J such thatk ˆ f−Tvk 2,μ ≤kf ∗ −Tvk 2,μ +/5 with probability at least 1−δ/5 by choosing J≥ 1 to satisfy C √ J 1 + s 2 log 1 (δ/5) ! ≤ 5 ⇒J≥ 5C 1 + s 2 log 5 δ 2 by [57, Lemma 1]. Now we have the following string of inequalities, each of which hold with probability 1−δ/5: k b Gv−Tvk 2,μ (a) ≤k b Gv−Tvk 2,ˆ μ +/5 (b) ≤k b Gv− b T M vk 2,ˆ μ + 2/5 (c) ≤k ˆ f− b T M vk 2,ˆ μ + 2/5 (d) ≤kf ∗ − b T M vk 2,ˆ μ + 3/5 (e) ≤kf ∗ −Tvk 2,ˆ μ + 4/5 (f) ≤kf ∗ −Tvk 2,μ + ≤ inf f∈F(Θ) kf−Tvk 2,μ + 0 +. 87 We chooseη N andl N from inequality (a) such that inequalities (f) hold with at least proba- bility 1−δ/5. Similarly, one can prove inequalities (b) and (e) using Lemma 45 followed by union bound argument on the state samples, giving us a bound onM. Inequality (c) follows from the fact that b G gives the least approximation error compared to any other function b f∈ b F θ 1:J . Inequality (d) is due to the choice ofJ. The last inequality is by the choice of f ∗ . Lemma 27. [48, Lemma 3] For any K ≥ 1, and > 0, supposek k k 2,μ ≤ for all k = 0, 1,..., K− 1, then kv K −v ∗ k 2,μ ≤ 2 1−γ K+1 1−γ !1 2h C 1/2 μ +γ K/2 (2v max ) i . (3.25) Choice of K ∗ : Now, from (3.25), we have kv K −v ∗ k 2,μ ≤ 2 1 1−γ !1 2h C 1/2 μ +γ K/2 (2v max ) i which gives a bound on K such that γ K/2 (2v max )≤C 1/2 μ . Denote K ∗ = log C 1/2 μ − log (2v max ) logγ . Proof of Lemma 14. First, we show that X k ≤ st Y k holds for all k ≥ 0. The stochastic dominance relation is the key to our analysis, since if we can show that Y K is “small” with high probability, thenX K must also be small and we infer thatkv K −v ∗ k 2,μ must be close to zero. By construction, X k ≤ st Y k for all k≥ 0 (see [24, Lemma A.1] and [24, Lemma A.2]). 88 Next, we compute the steady state distribution of{Y k } k≥0 and its mixing time. In par- ticular, chooseK so that the distribution ofY K is close to its steady state distribution. Since {Y k } k≥0 is an irreducible Markov chain on a finite state space, its steady state distribution μ ={μ (i)} K ∗ i=1 onK exists. By [24, Lemma 4.3], the steady state distribution of{Y k } k≥0 is μ ={μ (i)} K ∗ i=1 given by: μ (1) =q K ∗ −1 μ (i) = (1−q)q K ∗ −i , ∀i = 2,...,K ∗ − 1, μ (K ∗ ) = 1−q. The constant μ min (q; K ∗ ) = min n q K ∗ −1 , (1−q)q (K ∗ −2) , (1−q) o ∀q∈ (0, 1) andK ∗ ≥ 1 appears shortly in the Markov chain mixing time bound for{Y k } k≥0 . We note that (1−q)q K ∗ −1 ≤μ min (q; K ∗ ) is a simple lower bound for μ min (q; K ∗ ). Let Q k be the marginal distribution of Y k for k≥ 0. By a Markov chain mixing time argument, we have t mix (δ 0 ) , min n k≥ 0 :kQ k −μk TV ≤δ 0 o ≤ log 1 δ 0 μ min (q; K ∗ ) ! ≤ log 1 δ 0 (1−q)q K ∗ −1 ! for any δ 0 ∈ (0, 1). 89 Finally, we conclude the argument by using the previous part to find the probability that Y K = 1, which is an upper bound on the probability that X K = 1, which is an upper bound on the probability thatkv K −v ∗ k 2,μ is below our desired error tolerance. For K≥ log 1/ δ 0 (1−q)q K ∗ −1 we have|Pr{Y K = 1}−μ (1)|≤ 2δ 0 . Since X K ≤ st Y K , we have Pr{X K = 1}≥ Pr{Y K = 1} and so Pr{X K = 1}≥q K ∗ −1 − 2δ 0 . Choose q and δ 0 to satisfy q K ∗ −1 = 1/2 +δ/2 and 2δ 0 =q K ∗ −1 −δ = 1/2−δ/2 to getq K ∗ −1 − 2δ 0 ≥δ, and the desired result follows. 90 Chapter 4 MDPs with continuous state space: Average reward criteria 4.1 Introduction In this chapter, we will introduce an off-policy empirical (or approximate), (relative) value learning algorithm for computing optimal policies for non-parametric MDPs with continu- ous state space and average reward criterion. Specifically, we aim to design RL algorithms that are universal (arbitrarily good approximation for any MDP), computationally simple, easy to implement and yet we would like to have non-asymptotic sample complexity bounds, even if the guarantees are probabilistic. We would also like such algorithms to work well numerically. The algorithm we propose is inspired by the empirical dynamic programming (EDP) framework [24] - an off-policy minibatch empirical value learning algorithm for dis- counted MDPs with finite state and action space. The key idea is to replace the expectation in the Bellman operator with a sample average approximation obtained from a mini-batch of samples of the next state. Convergence analysis required a random operator framework 91 and construction of a Markov chain that stochastically dominates the error introduced due to the approximation. An ERVL algorithm for average reward MDPs was proposed in [23] but for finite state space only. Our problem is non-parametric and substantially harder. The second inspiration is the work on non-parametric function approximation [18, 8]. [18] provides an uniform convergence of nearest neighbor regression under some smoothness as- sumption of the regression function. This enables us to plug in the function fitting part, and give guarantees in the sup-norm. In this chapter, we combine both of the above ideas to develop the ERVL algorithm for non-parametric, average-reward MDPs with continuous state spaces. This requires use of non-parametric regression techniques to generalize the value iterates on a small set of sampled states to the entire state space. We are able to provide finite time sample complexity bounds for the algorithm that is near-universal (requires only minimal asumptions on the class of MDPs we can handle) and arbitrarily good approximation with high probability. A kernel RL algorithm was introduced earlier where the expectation operator is approximated by local averagers [50]. 4.2 Preliminaries Consider an MDP with state spaceX and action spaceA. We assume thatX is a compact subset in R d andA is finite. LetC B (X ) be the set of continuous and bounded functions overX . The transition probability kernel is given by P (·|x,a), i.e., if action a is executed in state x, the probability that the next state is in a Borel-measurable set B is 92 P (X t+1 ∈B|X t =x,a t =a). The reward function isr :X×A→R. For a stationary policy π :X×A, we are interested in maximizing the long-run average expected reward defined as J π (x) = lim inf T→∞ 1 T E " T−1 X t=0 r(x t ,a t ) x 0 =x,a t =π(x t ) # . LetJ ∗ = sup π J π (x). A policyπ ∗ is said to be optimal if for allx∈X , it satisfiesJ π ∗ (x) =J ∗ . We make the following assumptions. Assumption 7. (a) For every (x,a),|r(x,a)|≤r max and for everya,r(·,a) is continuous. (b) For every a, transition kernel P (·|x,a) is continuous in x. (c) There exists α< 1 such that sup (x,a),(x 0 ,a 0 ) kP (·|x,a)−P (·|x 0 ,a 0 )k TV ≤ 2α wherek·k TV denotes the total variation norm. Assumption 7 (a) establishes that for every a, r(·,a) ∈ C B (X ), (b) implies that if v ∈ C B (X ) then for any action a, R v(y)P (dy|·,a) ∈ C B (X ) and (c) implies that under any stationary and deterministic policy, t-step transition probability converges to a unique invariant probability measure (over the state process{x t }) in total variation norm, uniformly in x and at a geometric rate. Under these assumptions, there exists (J ∗ ,v ∗ )∈ R×C B (X ) such that the following optimality equation holds: J ∗ +v ∗ (x) = sup a∈A r(x,a) + Z v ∗ (x 0 )P (dx 0 |x,a) . (4.1) 93 Define the Bellman operator T :C B (X )→C B (X ) as Tv(x) = max a∈A h r(x,a) +E x 0 ∼P (·|x,a) v(x 0 ) i . Hence, J ∗ = Tv ∗ −v ∗ . Note that v ∗ is unique upto a constant. The exact computation of the expectation in the Bellman operator is computationally infeasible for continuous state space for most problems. So we replace this by a sample average approximation which leads us to define an empirical Bellman operator b T M as b T M v(x) = max a∈A " r(x,a) + 1 M M X i=1 v(x 0 i ) # wherex 0 i ∼P (·|x,a) fori = 1, 2,...,M. It can clearly be seen that b T M v is biased estimator of Tv, i.e.,E x 0 h b T M v i 6=Tv. Iteration on a Quotient Space. Let us now define the span semi-norm and the quotient space. For a function f ∈C B (X ), span(f) := sup (x) f(x)− inf x f(x). Clearly, this is a semi-norm and for the constant function f, we have span(f) = 0. Let us now define an equivalence relation∼ onC B (X ) defined by f∼ g if and only if there exists a constant c such that for all x∈X , f(x)−g(x) = c. Let f C B (X ) =C B (X )/∼ be the quotient space. The following for the quotient space then is not difficult to show [23]. Lemma 28. ( f C B (X ),span) is a Banach space. 94 The proof is available in [23] and hence omitted. Furthermore, we can show that the operator T is a contraction in the span semi-norm. The next theorem is from [31]. The proof is given in the supplementary section for the sake of completeness. Theorem 29. Suppose that Assumptions 7 hold. Then, operator T :C B (X )→C B (X ) is a span-contraction operator, i.e., span(Tv 1 −Tv 2 )≤αspan(v 1 −v 2 ) where v 1 ,v 2 ∈C B (X ) and α is defined in Assumption 7(c). Now consider a v∈C B (X ), and let ˜ v be the corresponding element in f C B (X ) and e T : f C B (X )→ f C B (X ) defined as e T ˜ v = g Tv. Since T is a span-contraction onC B (X ), then e T is a contraction on the Banach space f C B (X ), which by Banach fixed point theorem implies has a unique fixed point, which can be found by a simple iterative procedure on the quotient space that is easy to translate into an operation on the spaceC B (X ). Note that the empirical Bellman operator b T M is not a span-contraction since the contraction coefficient α M in span( b T M v 1 − b T M v 2 )≤α M span(v 1 −v 2 ) is random. But α M < 1 always, and intuitively, we can expect that b T M probabilistically contract to a probabilistic fixed point [24]. Indeed, we introduce the following definition. 95 Definition 9. A functionv∈C B (X ) is an (,δ)-weak probabilistic fixed point for a sequence of random operators{ b G N,M } with respect to a given normk·k if there exist an N 0 (,δ), M 0 (,δ) and a K 0 (,δ) such that for all N >N 0 (,δ), M >M 0 (,δ) and all k>K 0 (,δ), P(k b G k N,M v 0 −vk>)<δ for all v 0 ∈C B (X ). Let us now define a function spaceG ={f :X→R}. Let Π N G be the projection operator (why indexed by N will be clear later) which projects a function onto the spaceG. While various non-parametric function spaces can be considered, we will choose nearest neighbors for function approximation (other non-parametric function approximation methods for e.g., kernel regression, etc. will also work). In each case, we can show that the composition operator Π N G b T M is a contraction with respect to the span semi-norm. We can also establish a probabilistic bound on the one-step error for each such method. Furthermore, note that for any function f, span(f)≤ 2kfk ∞ . 4.3 Algorithm and analysis with truncation As mentioned earlier, the iterates in MDPs with average reward criterion are not bounded. In this section we will focus on methods which deploy a truncation operator in the algorithm to ensure boundedness of the iterates. 96 4.3.1 The Algorithm and Main Result We now present the Empirical Relative Value Learning (ERVL) algorithm, a mini-batch off-policy algorithm for non-parametric MDPs with continuous state space. It is an ‘empiri- cal’ or mini-batch variant of the relative value iteration algorithm for continuous state space MDPs. Note that if α is the contraction parameter, then span(v ∗ )≤ rmax 1−α = v max . Since it is not possible to argue that the iterates of relative Q-value iteration are bounded, we introduce the truncation operator Γ, Γv := v− minv, if span(v)≤ r max 1−α , v− minv span(v) , otherwise. The operator Γ b T M maintains the contraction property of b T M as Γ is just a projection operator[23]. With slight abuse of notation, we will denote b T M as the empirical Bell- man operator with truncation in the subsequent sections. Let us denote the composed operator by b G N,M = Π N G b T M . Algorithm 4 will iterate the random operator b G N,M , i.e., v k+1 = b G N,M v k = b G k N,M v 0 . Note that this operator depends on the sample sizes N and M. Note that b T M probabilistically contracts, i.e., with high probability it is a contraction. Fur- ther, we will argue that Π N G is non-expansive, which will imply the random operator b G N,M also probabilistically contracts. Function Approximation using nearest neighbors k−NN is a powerful yet simple approach in non-parametric regression. In this case, we first fix x∈X and reorder the samples{x 1 ,x 2 ,...x N } according to increasing distance of x i from x. Let the reordered 97 samples be{x (i) } for 1≤i≤N. Now we pickk nearest neighbors and estimate the function as v(s) = Π N G b v = 1 k k X i=1 b v(s (i) ). It is worthwhile to mention that for smoother average of the outputs, one could use kernel regression which also scale better with dimension. Now, we specify the ERVL algorithm. We first sample N points fromX uniformly (or according to another probability measure). Then, perform an ‘empirical’ value iteration step on these sampled points by obtaining mini-batches of next states. Then, we truncate the function at these points to ensure boundedness. Then, we do function-fitting using nearest neighbors, which gives us the next iterate of the value function. Algorithm 4 ERVL Input: sample sizes N≥ 1 ; M≥ 1; initial seed Q 0 ; total iterations K≥ 1. For k = 1,...,K 1. Sample{x n } N n=1 fromX uniformly 2. Value iteration: b v 0 k (x n )← b T M v k−1 for n = 1, 2,...,N 3. Truncation: b v k (·) = Γb v 0 k (·) 4. Function fitting: v k ← Π N G b v k . 5. Increment k←k + 1 and return to Step 1. We can now establish that the iterates of the algorithm,v k are an (,δ)-weak probabilitic fixed of the operator b G N,M = Π N G b T M and hence a good approximation to v ∗ , the fixed point of T in the span semi-norm with high probability if N, M and k are large enough. 98 Theorem 30. Given ,δ> 0, choose δ 1 ,δ 2 > 0 such that δ 1 + 2δ 2 <δ. Let κ ∗ = l 2vmax m and μ min is given by μ min = min δ κ ∗ 1 , (1−δ 1 ), (1−δ 1 )δ 1 ,··· , (1−δ 1 )δ κ ∗ −1 1 . Then for any k ≥ log 1 δ 2 μ min , N ≥ 2 16v max C 2d log 2 δ +d log 32v max C and M ≥ v 2 max 2(/8) 2 log 16|A|vmax , we have P (span(v k −v ∗ )>)≤δ. 4.3.2 Convergence Analysis We now prove Theorem 52. The analysis will proceed in three steps. First, we will bound the function approximation error, next we bound the one-step error of the algorithm by viewing each iteration as a random operator, and then we use a stochastic dominance argument to argue convergence and get rate of convergence. Error Analysis: Function Approximation with nearest neighbors. We first define the regression function f M :X→R via f M (x),E " max a∈A ( r (x, a) + 1 M M X m=1 v (x m ) )# (4.2) 99 where x m are the next generated given x and a. It is the expected value of our empirical estimator of Tv. As expected, f M → Tv as M→∞. We note that f M is not necessarily equal to Tv by Jensen’s inequality. In the next lemma we show that we can make the bias between the regression function f M and the Bellman update Tv arbitrarily small uniformly over x∈X through the choice of M≥ 1. Lemma 31. For any > 0 and M≥ 1, kf M −Tvk ∞ ≤ " + 2|A|v max exp −2M 2 v 2 max !# . Proof. For any x∈X , we compute |f M (x)−Tv (x)| ≤E[| max a∈A ( r (x, a) + 1 M M X m=1 v (x m ) ) − max a∈A n r (x, a) +E x 0 ∼P (·|x,a) [v (x 0 )] o |] ≤E " max a∈A | 1 M M X m=1 v (x x,a m )−E x 0 ∼P (·|x,a) [v (x 0 )]| # ≤ " + 2|A|v max exp −2Mε 2 v max 2 !# where the third one is due to Hoeffding’s inequality. 100 Let us denote the number of nearest neighbors ask N sincek denotes the iteration number. We can easily establish that Π N G in this case is a non-expansive mapping: kΠ N G b v 1 − Π N G b v 2 k ∞ ≤kb v 1 −b v 2 k ∞ . (4.3) Note that in the above equation if minb v 1 = minb v 2 = 0 then it also holds in span norm. The next lemma is from [18] which presents the rate of convergence in sup-norm for k−NN with an additional assumption of Lipschitz continuity. Lemma 32. If for any v ∈ e C(X ), f M is Lipschitz with constant C, then for δ, > 0, k N = γN 2 4v max C d for 0≤γ≤ 1 and N≥N 0 (,δ) = 2 8v max C 2d log 2 δ 16v max C d then P(k b G N,M v−f M k ∞ ≥)≤δ. The following lemma provides a probabilistic bound on the function approximation error. Lemma 33. Given v ∈C B (X ), > 0, and δ ∈ (0, 1). Also choose N ≥ N 0 (,δ) and M≥M 0 () = v 2 max 2(/4) 2 log 8|A|vmax , Then we have P(k b G N,M v−Tvk ∞ ≥)≤δ. 101 Proof. By the triangle inequality, k b G N,M v−Tvk ∞ ≤k b G N,M v−f M k ∞ +kf M −Tvk ∞ . Note that b G N,M minimizes the empirical loss over random samples. From lemma 39, if N≥N 0 (/2,δ) then with probability 1−δ k b G N,M v−f M k ∞ </2. From Lemma 38, if M≥ v 2 max 2 (/4) 2 log 8|A|v max ! , thenkf M −Tvk ∞ ≤/2. Hencek b G N,M v−Tvk ∞ ≤ with probability at least 1−δ if N and M are chosen appropriately. One-step error analysis of the random operator. We now analyze iteration of the random operator b G N,M = Π N G b T M to bound the error in one iteration. In particular, we are interested in analyzing the stochastic process{span(v k −v ∗ )} k≥0 . span(v k −v ∗ )≤span( b G N,M v k−1 −Tv k−1 ) +span(Tv k−1 −Tv ∗ ) ≤span( b G N,M v k−1 −Tv k−1 ) +αspan(v k−1 −v ∗ ) 102 As mentioned earlier, the composed operator b G N,M is a probabilistic contraction in span norm. Let p(,N,M),P (span (G N,M v−Tv)≤),∀v∈C B (X ), (4.4) For compact notation, let us rewrite p(,N,M) as p N,M . Note that if span(v k−1 −v ∗ ) = η and if span(G N,M v k−1 −Tv k−1 ) < , then using the previous decomposition, we have span(v k −v ∗ )≤ (αη + 1) with probability p N,M . Now let us define η ∗ , min{η∈N :dαη + 1e<η} = 2 1−α Now, to track the progress of the error process, we construct a Markov chain which stochastically dominates this process. The one-step probability of this chain depends on p N,M as it will be clear in the following section. Let us now define the concept of stochastic dominance. Definition 10. Let X and Y be two random variables, then Y stochastically dominates X, written X≤ st Y , when P(X≥θ)≤P (Y≥θ), for all θ in the support of Y . The iterates of Algorithm 4 are bounded in span-norm by v max . Choose > 0 and let κ ∗ =d 2vmax e. Define the error process{X k } k≥0 as follows: X k = η if η≤span(v k −v ∗ )< (η + 1) 0 if span(v k −v ∗ ) = 0 103 Next, we construct a Markov chain which stochastically dominates the process{X k } k≥0 . Similar to [24], we define the Markov chain as follows: Y k = max{Y k−1 − 1,η ∗ } w.p. p N,M κ ∗ 1−p N,M Note that the choice of p N,M is governed by the underlying function space and the samples N and M. Next we establish the stochastic dominance of the Markov chain{Y k } k≥0 over the error process{X k } k≥0 provided that both of these stochastic processes have the same initial state. Lemma 34. [24, Theorem 4.1] For all k≥ 0, X k ≤ st Y k if X 0 =Y 0 . Furthermore, process{Y k } k≥0 is a finite state and irreducible Markov chain and hence there exists a steady state distribution. The state space of this chain is{η ∗ ,η ∗ + 1,η ∗ + 2,...,κ ∗ }. Let μ denotes the steady state distribution of the Markov chain{Y k } k≥0 . The steady state distribution can be computed as follows: μ(0) =δ κ ∗ 1 , μ(κ ∗ ) = 1−δ 1 , μ(i) = (1−δ 1 )δ κ ∗ −i 1 ∀ 0<i<κ ∗ and let μ min = min 0≤i≤κ ∗μ(i). We are now ready to prove the main theorem which relies on analyzing the dominating Markov chain{Y k } k≥0 and its mixing time. Proof. First recall thatspan(f)≤ 2kfk ∞ . Since the one-step error is bounded forN 0 (/2,δ 1 ) andM 0 (/2) , we construct our Markov chain{Y k } k≥0 withp N,M = 1−δ 1 . Note that in (4.4), 104 we definedp N,M for a deterministic action-value function. But now, after iteration k,v k is a random function. But because the samples generated are independent across iterations, one could use (4.4) to randomized functions. Let us denote the transition matrix for{Y k } k≥0 as P Y given as follows: P Y = δ 1 0 0 ... 0 1−δ 1 δ 1 0 0 ... 0 1−δ 1 0 δ 1 0 ... 0 1−δ 1 ......................... ......................... 0 0 0 ... δ 1 1−δ 1 The eigenvalues ofP Y are 0 and 1. Then, from [37, Theorem 12.3], we can bound the mixing time for any δ 2 > 0 as t mix (δ 2 )≤ log 1 δ 2 μ ∗ . Using theorem 3.1 from [24], we conclude the theorem. 4.3.3 Numerical Performance We now show numerical performance on a benchmark problem of machine replacement. This problem has been studied for discounted setting [48]. We work out the details under average reward criterion. In this problem, the state space is non-negative real numbers and two actions are available in each state; keep the machine or replace it. Let the action of 105 keeping the machine be denoted as 0 and replacement as 1. The transition dynamics are given as follows: P (s 0 |s, 0) = β exp (−β(s 0 −s)), if s 0 ≥s 0, otherwise. P (s 0 |s, 1) = β exp (−βs 0 ), if s 0 ≥ 0 0, otherwise. If we decide the keep the machine, we need to pay the maintenance cost which increases with state and for replacement, we need to pay a fixed amount. Hence, the reward function is given by r(s, 0) =−αs and r(s, 1) =−C. The optimality equation is given as follows: J ∗ +v ∗ (s) = max (T 0 ,T 1 ) where T 0 =−αs + Z ∞ s β exp (−β(s 0 −s))v ∗ (s 0 )ds 0 and T 1 =−C + Z ∞ 0 β exp (−βs 0 )v ∗ (s 0 )ds 0 One could guess that the optimal policy will be a threshold policy, i.e., there exists a ¯ s such that following holds: π ∗ (s) = 0, if s≤ ¯ s 1, otherwise. 106 For s∈ [0, ¯ s], J ∗ +v ∗ (s) =−αs + Z ∞ s β exp (−β(y−s))v ∗ (y)dy Differentiating both sides, we have (v ∗ ) 0 (s) =−α +β 2 exp(βs) Z ∞ s exp (−βy)v ∗ (y)dy −βv ∗ (s) =−α +αβs +βJ ∗ . Recall that for s≥ ¯ s, the value function does not depend on s. Hence for s = ¯ s, we have J ∗ +v ∗ (¯ s) =−α¯ s +β Z ∞ ¯ s exp (−β(y− ¯ s))v ∗ (y)dy =−α¯ s +v ∗ (¯ s). Hence, J ∗ =−α¯ s. To compute ¯ s, we need to solve the following equation: Z ¯ s 0 β αβ 2 s 2 −αs(1 +β¯ s) exp(−βs)ds + Z ∞ ¯ s β −α¯ s− αβ 2 ¯ s 2 exp(−βs)ds + 2α¯ s + αβ 2 ¯ s 2 −C = 0 107 For our experiments, we use β = 2/3,α = 3,C = 15. This gives the optimality policy as π ∗ (s) = 0 ifs<= 2.654 ,otherwise 1 andJ ∗ =−7.962. Note that one could use any reference state instead of minimum of the function. For each iteration k one can computeπ k which is a greedy policy with respect to Q k . We used Gaussian kernel K(x,y) = exp (−γ(x−y) 2 ). Fig. 4.1 presents the error|J π k −J ∗ | against iteration kernel regression with regularization for N = 200. Figure 4.1 Optimal replacement with nearest neighbors 4.4 Algorithm and analysis without truncation In the previous section, we proposed empirical relative value iteration(ERVL), Algorithm 4. The truncation operator Γ is introduced to ensure boundedness of the iterates. In order to use Γ, we need to know the contraction parameterα which depends on the total variation norm of the underlying transition kernel. This computation can be difficult as one needs to calculate the total variation norm and is impossible when the kernel is not known (for online 108 algorithm). For instance, in our example of optimal machine replacement with two actions A ={keep, replace} and the transition density is as follows: p(y|x,a) := β exp(−β(y−x)), if y≥x,a = keep, β exp(−βy), a = replace. Now, following the definition ofα as given in assumption 7 (c), we fixa = replace,a 0 = keep. Fig. 4.2 shows the distributions corresponding to (x,a) and (x 0 ,a 0 ). It is clear from the figure that asx 0 gets large enough,α will move closer to 1. So, if the state space is truncated then α will be computed accordingly. Hence, to get rid of this scaling operator, we need to relax Figure 4.2 Probability transition kernel for optimal machine replacement problem the assumption of bounded iterates required while constructing the Markov chain. Recently, [22] studies the approximate (or empirical version in our case) of a contraction operator in Banach space which does not assume that the iterates are bounded. constructing the Markov chain on the set of natural numbers and then analyzing the invariant distribution under some conditions. This enables us to plug in their argument for convergence of random contraction to probabilistic fixed point for algorithm without truncation. 109 4.4.1 Approximate Bellman Operator Let us first extend assumption (7) to Lipschitz functions: Assumption 8. (a) For every (x,a),|r(x,a)|≤r max and for every a, r(·,a) is Lipschitz continuous. (b) For every a∈A, transition kernel P (·|x,a) has a positive Radon-Nikodym derivative, p(y|x,a) with respect to Lebesgue measure, λ on R d , for all x,y∈R d . (c) The transition probability density is Lipschitz continuous in the present state, i.e, for all a∈A and x,y,z∈X , there exists L 0 p (z) such that |p(z|x,a)−p(z|y,a)|≤L 0 p (z)kx−yk where R X L 0 p (z)λ(dz) =L p . (d) There exists α< 1 such that sup (x,a),(x 0 ,a 0 ) kP (·|x,a)−P (·|x 0 ,a 0 )k TV = 2α wherek·k TV denotes the total variation norm. Assumption 7 (a) establishes that for every a, r(·,a)∈ Lip(X ), (b) and (c) imply that if v∈ Lip(X ) then for any action a, R v(y)P (dy|·,a)∈ Lip(X ) and (d) implies that under any stationary and deterministic policy, t-step transition probability converges to a unique invariant probability measure (over the state process{x t }) in total variation norm, uniformly 110 in x and at a geometric rate. Under these assumptions, there exists (J ∗ ,v ∗ )∈ R×C(X ) such that the following optimality equation holds: J ∗ +v ∗ (x) = sup a∈A r(x,a) + Z v ∗ (x 0 )P (dx 0 |x,a) . (4.5) Define the Bellman operator T :C(X )→C(X ) as Tv(x) = max a∈A h r(x,a) +E x 0 ∼P (·|x,a) v(x 0 ) i . Hence, J ∗ =Tv ∗ −v ∗ . Similar to previous sections, we assume that the transition kernel is unknown but for a given state-action pair, we can get samples of the next state from the generative model. Using these samples, we approximate the dynamics by non-parametric density estimation. We begin with a smoothing kernel K :X → R defined as any smooth function such that R K(x)dx = 1, R xK(x)dx = 0 and R x 2 K(x)dx<∞. Assume that for any (x,a)∈X×A, we have access to M independent and identically distributed samples Y x,a i ∼ P (·|x,a), i = 1, 2,...M. Let h M be the bandwidth, then the kernel density estimator is defined as b p M (y|x,a) = 1 Mh d M M X i=1 K y−Y x,a i h M . For instance, the kernels commonly used are the Gaussian kernel,K(x) = 1 √ 2π exp(−kxk 2 /2) and tophat kernel, K(x) = 1 2 I (kxk< 1) where I is an indicator function. In this chapter, we focus on Gaussian kernels so that the Lipschitz property is preserved. The bandwidth, 111 h M controls the smoothness of estimation and hence, needs to be chosen carefully. Let the estimated distribution be b P M . Let us now define our approximate Bellman operator b T M :C(X )→C(X ) as follows: b T M v(x) = max a∈A h r(x,a) +E x 0 ∼ b P M (·|x,a) v(x 0 ) i . Clearly, b T M is a random operator. Let b α M be the random variable defined as sup (x,a),(x 0 ,a 0 ) k b P M (·|x,a)− b P M (·|x 0 ,a 0 )k TV = 2b α M . Then one can show that for all v 1 ,v 2 ∈C(X ) span( b T M v 1 − b T M v 2 )≤ b α M span(v 1 −v 2 ) We analyze probabilistic contraction of the approximate Bellman operator, b T M by arguing that b α M < 1 with high probability (as presented in detail in Section 4.4.3). Let us also compare our approximate Bellman operator to empirical Bellman operator defined in [23, 26]. In their case, the estimated distribution for a state-action pair (x,a) is given by b P M (dy|x,a) = 1 M M X i=1 δ Y i (dy) 112 whereY i are iid samples from the true distribution fori = 1,...M. It can be shown that for empirical distribution b α M will always be equal to 1. To see that, fix (x,a) and (x 0 ,a 0 ), then k b P M (·|x,a)− b P M (·|x 0 ,a 0 )k TV = 2 sup B⊂B(X) b P M (B|x,a)− b P M (B|x 0 ,a 0 ) Since M is finite, one can choose B such that the difference between two distributions is always 1. 4.4.2 Approximate Relative Value Learning (ARVL) We now present the Approximate Relative Value Learning (ARVL) algorithm, a non- parametric off-policy algorithm for MDPs with continuous state space. It is a sampling-based algorithm combined with non-parametric density estimation and function approximation. It first samples the state space uniformly and estimates the probability density for each sampled state and action. Then, the approximate Bellman operator gives samples of value function which are then used for regression. Recall that in relative value iteration, there is a bias subtraction at each iteration. This does not change the span norm but keeps the iterates bounded. In our algorithm, we make our samples for regression non-negative by subtracting the minimum of the function. In other words, we are choosing a non-negative optimal value function as it is not unique. This makes the samples for regression non-negative. Let the number of state samples be N, 113 next state samples be M and number of neighbors for function approximation be n. Let Γ N :R N →R N be an operator such that Γ N b v 0 = b v 0 − minb v 0 1 N where 1 N is a vector of all ones of size N. Let us denote the composed operator by b G(N,M,n) = Π n F Γ N b T M where we use the fact that the function approximation depends both on N and n. Algorithm 5 will iterate the random operator b G(N,M,n) (or just b G for compact notation), i.e., v k+1 = b Gv k = b G k v 0 . Using the non-expansive property of NN regression, we will establish that the composed operator b G is a contraction with high probability. Now, we specify the ARVL algorithm in detail. We first sample N points fromX uni- formly (or according to another probability measure). Then, perform an ‘approximate’ value iteration step on these sampled points by estimating the density via mini-batches of next states. Then, we do function-fitting using nearest neighbors, which gives us the next iterate of the value function. Algorithm 5 ARVL Input: sample sizes N≥ 1 ; M≥ 1; n≥ 1; initial seed v 0 ; total iterations K≥ 1. For k = 1,...,K 1. Sample{x i } N i=1 fromX uniformly 2. Kernel density estimation b p M (·|x i ,a) for each i = 1, 2,...,N and a∈A 3. Approximate value iteration: b v 0 k (x i )← b T M v k−1 , b v k (x i )← b v 0 k (x i )− min x j b v 0 k for i,j = 1, 2,...,N 4. Function fitting: v k ← Π n F b v k . 5. Increment k←k + 1 and return to Step 1. 114 We can now establish that the iterates of the algorithm, v k converge to a weak proba- bilistic fixed point of the operator b G(N,M,n) = Π n F Γ N b T M and hence a good approximation to v ∗ , the fixed point of T in the span semi-norm with high probability if N, M and k are large enough. Theorem 35. Suppose that Assumptions 7 and 9 hold. Given ,δ> 0, there exist constants B and C such that for any N≥ 2 8BC 2d log 2 δ 16BC d and n≥ N 2 4BC d , we have lim M→∞ lim k→∞ P (span(v k −v ∗ )>)≤δ. Note that the nearest neighbors scale very poorly with dimension which is reflected in our bounds. This can be made better by using kernel regression (for eg. Nadaraya-Watson kernel regression). Furthermore, the dependence on next state sample size, M, is due to asymptotic convergence of kernel density estimation. 4.4.3 Convergence Analysis We now prove Theorem 52. There are two approximations in ARVL: one is due to density estimation and another is due to function fitting. We first bound the error due to these approximations. As mentioned before, each iteration of ARVL can be viewed as iteration of a random operator, we then bound the error in one iteration. In the end, we use a stochastic dominance argument to argue convergence. 115 Error due to density estimation. We first want to establish that when M is large enough, b α M < 1 with high probability. Let us now recall that L 1 distance between any two densities μ and ν overX is given as: kμ−νk 1 = Z X |μ(x)−ν(x)|dx If we can bound the L 1 norm, we get a bound on total-variation norm as well since if R |μ−ν|dx<δ then|μ(B)−ν(B)|<δ for allB. Next, we present convergence of estimated density to the true density inL 1 norm as shown in [19] which needs the following assumptions: Assumption 9. 1. Let K :R d →R such that R K(x)dx = 1 and L(u) = sup kxk≥u K(x) for u≥ 0. 2. h M is a sequence of positive numbers such that h M → 0 and Mh d M →∞ as M→∞. 3. The density p(·|x,a) is almost everywhere continuous for all (x,a) ∈ X×A and kxk d K(x)→ 0 askxk→∞. Theorem 36. LetK be a smoothing kernel such that Assumption 9 holds then the following holds with probability 1, lim M→∞ kp(·|x,a)−b p M (·|x,a)k 1 = 0 for all (x,a)∈X×A. This now leads to the following lemma: 116 Lemma 37. Assume that Assumption 9 holds then for any δ∈ (0, 1−α), lim M→∞ P(b α M ≥ 1−δ) = 0 Proof. The proof is a direct application of Theorem 36. For any (x,a), (x 0 ,a 0 )∈X×A, k b P M (·|x,a)− b P M (·|x 0 ,a 0 )k TV ≤k b P M (·|x,a)−P (·|x,a)k TV +kP (·|x 0 ,a 0 )− b P M (·|x 0 ,a 0 )k TV +kP (·|x,a)−P (·|x 0 ,a 0 )k TV Using ergodicity of transition kernel as mentioned in assumption 8 (d) and Theorem 36, we conclude the lemma. Error due to function approximation with nearest neighbors. In the previous sec- tion, we had defined Γ N for vectors inR N but it can be extended toX as Γ :C(X )→C(X ) defined as Γf =f− minf. Let g M :X→R be such that g M (x) = max a∈A n r (x, a) +E x 0 ∼ b P M (·|x,a) v (x 0 ) o for any value function v∈ Lip(X ). Now, we define f M :X→R and e f M :X→R via f M (x) =E [g M ] and e f M (x) =E [Γg M ] 117 e f M is the regression function. It is the expected value of our approximate estimator of Tv. As expected, f M → Tv as M →∞. We note that f M is not necessarily equal to Tv by Jensen’s inequality. In the next lemma we show that we can make the bias between the regression function f M and the Bellman update Tv arbitrarily small uniformly over x∈X when M is large enough. Lemma 38. Under Assumption 9, the following holds lim M→∞ kf M −Tvk ∞ = 0 Proof. For any x∈X , we compute |f M (x)−Tv (x)| ≤E " max a∈A n r (x, a) +E x 0 ∼ b P M (·|x,a) v (x 0 ) o − max a∈A n r (x, a) +E x 0 ∼P (·|x,a) [v (x 0 )] o # ≤E max a∈A E x 0 ∼ b P M (·|x,a) v (x 0 )−E x 0 ∼P (·|x,a) [v (x 0 )] 118 Note that the value functionv is a continuous function on a compact setX hence sup x∈X v(x) = kvk ∞ <∞. Let the action which maximizes the inner term be a ∗ x then by Jenson’s and Cauchy-Schwartz inequalities we have lim M→∞ |f M (x)−Tv (x)| ≤kvk ∞ lim M→∞ E " Z X b p M (x 0 |x,a ∗ x ) −p(x 0 |x,a ∗ x ) λ(dx 0 ) # Using bounded convergence theorem and Theorem 36, the proof concludes. The next lemma is from [18] which presents the rate of convergence in sup-norm for nearest neighbor regression. Lemma 39. Suppose for a value function v∈ Lip(X ), there exist constants B and C such thatkvk ∞ <B and the regression function f M is Lipschitz with constant C for any M, then for δ, > 0, n≥n 0 () = N 2 4BC d and N≥N 0 (,δ) = 2 8BC 2d log 2 δ 16BC d we have lim M→∞ P(k b Gv− e f M k ∞ ≥)≤δ. 119 One-step error analysis of the random operator. The following lemma provides a probabilistic bound on the one-step error of the ARVL, which points out that the error in one iteration can be controlled if the samples are sufficiently large. Lemma 40. Given v∈ Lip (X ), > 0, and δ∈ (0, 1). Also choose N ≥ N 0 (,δ) and n≥n 0 (), Then we have lim M→∞ P(span( b Gv−Tv)≥)≤δ. Proof. By the triangle inequality, span( b Gv−Tv)≤span( b Gv− e f M ) +span(f M −Tv) where the last inequality follows from the fact thatspan( e f M −f M ) =span(E[Γg M −g M ]) = 0. From Lemma 39, ifn≥n 0 () andN≥N 0 (,δ) then with probability 1−δ,k b Gv− e f M k ∞ <. Combining with Lemma 38 concludes the proof. Next we establish that it is indeed a random contraction. Lemma 41. For a given N,M,n≥ 1, the operator b G(N,M,n) = Π n F Γ N b T M is a random contraction operator, i.e, for any v 1 ,v 2 ∈ Lip(X ), span( b Gv 1 − b Gv 2 )≤ b α M span(v 1 −v 2 ) where b α M is a the random contraction coefficient. 120 Proof. Since we usen-NN for function fitting, we can easily establish that Π n F in this case is a non-expansive mapping: kΠ n F b v 1 − Π n F b v 2 k ∞ ≤kb v 1 −b v 2 k ∞ . Note that in the above equation if minb v 1 = minb v 2 = 0 then it also holds in span norm. Hence, span(Π n F Γ N b v 1 − Π n F Γ N b v 1 ) =span(b v 1 −b v 2 ) Since b G is a composition of a non-expansive mapping with a (random) contraction, i.e, for any v 1 ,v 2 ∈ Lip(X ) span(Π n F Γ N b T M v 1 − Π n F Γ N b T M v 2 ) =span( b T M v 1 − b T M v 2 ) ≤ b α M span(v 1 −v 2 ) Stochastic Dominance. The following lemma is from [22] which enables us to analyze iteration of the composed operator. Theorem 42. Assume that the following holds: 1. T :C(X )→C(X ) is a contraction operator in span norm with contraction coefficient α< 1. 121 2. For any v∈ Lip (X ), we have lim M,N,n→∞ P(span( b Gv−Tv)≥) = 0. 3. Let b α M be the contraction coefficient of b G such that for δ∈ (0, 1−α), lim M→∞ P(b α M ≥ 1−δ) = 0. 4. There exists w> 0 such that span( b Gv ∗ −Tv ∗ )≤w almost surely. Then, v ∗ is weak probabilistic fixed point of random operator b G(N,M,n). Sketch of the proof: The key element in the proof is stochastic dominance of a Markov chain (over natural numbers) on the error process{span(v k −v ∗ )} k≥0 . Recall that v k = b Gv k−1 , we decompose the process as span(v k −v ∗ )≤span( b Gv k−1 − b Gv ∗ ) +span( b Gv ∗ −Tv ∗ ) ≤ b α M span(v k−1 −v ∗ ) +span( b Gv ∗ −Tv ∗ ) 122 For for all v∈ Lip(X ), let us now define for > 0,δ∈ (0, 1−α),n,N,M≥ 1, q(,δ,N,M,n),P b α M ≤ 1−δ, (4.6) span b Gv−Tv ≤ ! , which we will denote by q. By Hoeffding-Frechet bound, q≥P (b α M ≤ 1−δ) +P span b Gv−Tv ≤ − 1 Fix κ > 0, ∈ (0,κ/2], δ ∈ (0, 1−α) such that η =d2/δe≤ κ/, a Markov chain is constructed over natural numbers as follows: Y k = η w.p. q if Y k =η Y k−1 w.p. q if Y k ≥η + 1 Y k−1 +dw/e w.p. 1−q The next step is to show that this Markov chain stochastically dominates the error process. Let us first define stochastic dominance: Definition 11. Let X and Y be two random variables, then Y stochastically dominates X, written X≤ st Y , when P(X≥θ)≤P (Y≥θ), for all θ in the support of Y . This yields for any t> 0, P (Y k ≥t)≥P (span(v k −v ∗ )≥t) 123 Now it remains to show that the Markov chain admits an invariant distribution which con- centrates at state 1 when the samples are sufficiently high. 4.5 Conclusion In this paper, we proposed an empirical relative value learning (ERVL) algorithm com- bined with non-parametric function approximation. In particular, we focused on nearest neighbors regression. The framework developed in this paper can be extended to any non- parametric setting as long as the function approximation is non-expansion and convergence to regression function can be established in sup-norm. Then, one can bound the one-step error and use the stochastic dominance argument to establish convergence. The proposed algorithm needs to know the contraction parameter which is needed for truncation. We also proposed the idea to remove that truncation which can lead to unbounded iterates and studying the convergence analysis for which the dominant Markov chain is constructed differently. 4.6 Proofs 4.6.1 Proof of Theorem 29 Proof. First, note that span(Tv 1 −Tv 2 ) = sup x∈X (Tv 1 −Tv 2 ) (x,a) − inf x∈X (TQ 1 −TQ 2 ) (x,a) 124 Recall that v(x) = max a {r(x,a) + R X v(x 0 )P (dx 0 |x,a)}. Consider any state-action pairs (x 1 ,a 1 ) and (x 2 ,a 2 ) and define a new signed measure onS as: λ(·),P (·|x 1 ,a 1 )−P (·|x 2 ,a 2 ). Using Jordan-Hahn decomposition, let us now decomposeX =X + ∪X − andX + ∩X − =∅. Hence λ(X ) =λ(X + ) +λ(X − ) = 0 (4.7) From Assumption 7, it is clear that kλk TV =λ(X + )−λ(X − )≤ 2α (4.8) Hence, from (4.7) and (4.8), λ(X + )≤α. Now, (Tv 1 −Tv 2 )(x 1 )− (Tv 1 −Tv 2 )(x 2 ) ≤ Z X (v 1 (x 0 )−v 2 (x 0 ))λ(dx 0 ) = Z X + (v 1 (x 0 )−v 2 (x 0 ))λ(dx 0 ) + Z X − (v 1 (x 0 )−v 2 (x 0 ))λ(dx 0 ) ≤ Z S + sup s 0 (v 1 (x 0 )−v 2 (x 0 )) ! λ(dx 0 ) + Z X − inf x 0 (v 1 (x 0 )−v 2 (x 0 )) λ(ds 0 ) =span(v 1 −v 2 )λ(X + ) + inf x (v 1 (x)−v 2 (x))λ(X ) ≤αspan(v 1 −v 2 ) 125 4.6.2 Proof of Theorem 28 Proof. We need to show that every Cauchy sequence in e C B (X ) converges to limit point in this space. Consider a Cauchy sequence{e v n (x)} n≥1 ⊂ e C B (X ) such that for every> 0, there exists N such that for all m,n≥N , span(e v m −e v n )≤. Fixx 0 ∈X . Now definef n (x) = e v n (x)−e v n (x 0 ) for alln≥ 1. Clearlyf n ∼ e v n andf n ∈ e C B (X ) for all n≥ 1. Now, consider f m −f n for all m,n≥N , since f m (x 0 )−f n (x 0 ) = 0, we have sup x (f m (x)−f n (x))≥ 0≥ inf x (f m (x)−f n (x)). Hence, for all x∈X |f m (x)−f n (x)|≤ sup x (f m (x)−f n (x)) ≤span(f m −f n ) =span(e v m −e v n )≤ This establishes that{f n } n≥1 is a Cauchy sequence inR. Hence, it converges and let f ∞ = lim n→∞ f n . Define for any c∈R, e v ∞ (x) =f ∞ (x)−c for all x∈X×A, then 0 = lim n→∞ span(f n −f ∞ ) = lim n→∞ span(e v n −e v ∞ ) 126 4.6.3 Proof of theorem 52 Proof. The proof uses Theorem 42. Note that the first assumption in the theorem is satisfied by the ergodicity condition assumed in Assumption 8. The second and the third assumptions are satisfied by Lemma 40 and Lemma 37 respectively. The fourth one follows from bounded rewards and the fact that v ∗ is a fixed point of operator T . Hence, Theorem 42 can be applied to conclude the convergence. 127 Chapter 5 MDPs with continuous state and action spaces In this chapter, we are going to focus on MDPs with continuous action spaces. The algo- rithms seen so far works for discrete action space as the maximization step in the algorithm can be computed efficiently. Continuous action spaces are common in robotics from simple tasks like swing-up pendulum to locomotion tasks. In many robotics problems, action space (and state space) is continuous with dimensions in low double digits [32]. And it is not possible to discretize state or action space without losing control performance (while also running into issues of scalability). Continuous ac- tion space problems are much harder. Current methods available for such settings include considering parametric families of policies, and then doing either following policy gradient [1, 35] or doing policy optimization [59, 60]. While for some of these algorithms, a proof of convergence is available, they can have arbitrary gaps to optimality since it is nearly impos- sible to know which parametric policy family is close enough to an optimal policy. Moreover, training such algorithms requires a large amount of data which may not be available. In the reinforcement learning (RL) literature, the closest useful techniques come from deep RL that 128 rely on using deep neural networks for function approximation but can work with continu- ous action spaces. As with deep learning methods, such techniques suffer from long training times, no guarantees on optimality or performance, and the need to tune a large number of hyper-parameters before one can get reasonable performance on a given problem [33]. Here, we take a different approach. First, our goal is to develop algorithms that are universal (work on any problem), simple to implement, computationally tractable (and fairly easy), can provide arbitrarily good approximation to optimal performance and come with some performance guarantees, hopefully non-asymptotic, even if probabilistic. We build on our work in [26] for continuous state space but finite action space problems. The idea is very simple: Do function approximation in a universal function approximation space such as an RKHS. Do randomized function fitting by picking basis functions randomly in each iteration. Replace expectation in the Bellman operator with a sample average approximation by drawing samples of the next state. To optimize over the actions in the Bellman equation, sample a few actions, and just optimize over those. We call this algorithm Random Actions for Empirical Value Learning (or RAEVL). Thus, the algorithm we introduce for continuous action and state space problems is pretty simple, and in fact performs quite well numerically as we shown subsequently. But due to randomization and sampling of various kinds, its convergence analysis becomes intricate. This is addressed by viewing each iteration as operation of a random Bellman operator. In [25], probabilistic contraction analysis techniques were developed for analyzing iterated random operators. Convergence is proved by constructing a simple stochastically dominating Markov chain (which also yields the rate of convergence). We use the same framework but needed to modify some details for continuous spaces. 129 In the latter part of this chapter, we will present a policy iteration based algorithm, which is different from the all the value iteration based algorithms we have seen so far. The function approximation is same as the previous one but we will need function approximation for both policy and value function. Since we consider randomized networks, the connections in bottom layer(s) are left untrained after initialization and only the last layer is finely tuned. Such networks have been studied extensively, both for shallow [56, 57] and deep architecture [20]. Thus, the training algorithm only needs to operate on a reduced set of weights with similar or even better performance with respect to fully trained architectures. This is different from using the last-hidden layer of deep neural networks as a feature extractor and updating the last layer with different algorithm [39], which still trains a fully connected network. Furthermore, [56] shows that such random bases correspond to a reproducing kernel Hilbert space (RKHS), which are known to be dense in the space of continuous functions. They also provide theoretical bounds on the error due to approximation by finite random features. 5.1 Empirical Optimization In this section, we use random sampling methods for optimization. In particular, we focus on Lipschitz continuous functions and uniform sampling. Given a spaceX and a function f :X→R, we are looking for a good approximation of its maximum. Next, we state our assumptions: Assumption 10. (a)X is compact 130 Figure 5.1 Sinusoid function f 1 (b) f is Lipschitz continuous, i.e, for all x,y∈X , there exists L such that |f(x)−f(y)|≤Lkx−yk 2 Now, to find the maxima of the function, we sample N points fromX and find the one which gives the maximum value among them. This will be our sampling based approximate for the optima. We now present our next theorem which gives sample complexity bound for approximation error. This idea of optimization via sampling is very computational efficient. For instance, let us consider two functions with support [0, 1] 2 : f 1 (x 1 ,x 2 ) = 0.5∗(sin(13(x 1 +x 2 ))sin(27(x 1 +x 2 )) + 1) and f 2 (x 1 ,x 2 ) =−(x 1 − 0.1) 2 − (x 2 − 0.1) 2 . The functions are shown in Fig. 5.1 and 5.2 respectively. The figures also show the maxima. We need N = 100 to reach 1% error for both f 1 and f 2 with runtime in milliseconds. 131 Figure 5.2 Concave function f 2 5.2 Empirical Dynamic Programming with Empirical Optimization In this section, we will deploy the optimization via sampling idea for MDPs with con- tinuous action space. The algorithm will be similar to the empirical dynamic programming based algorithms we have seen so far. 5.2.1 Preliminaries Consider a MDP (X,U,P,r,γ) whereX is the state space andU is the action space. The transition probability kernel is given by P (·|x,u), i.e., if action u is executed in state x, the probability that the next state is in a Borel-measurable set B is P (x t+1 ∈B|x t =x,u t =u) where x t and u t are the state and action at time t. The reward function is r :X×U→R. We are interested in maximizing the long-run expected discounted reward where the discount parameter is γ. 132 Let Π denote the class of stationary deterministic Markov policies mappings π :X→U which only depend on history through the current state. We only consider such policies since it is well known that there is an optimal MDP policy in this class. When the initial state is given, any policy π determines a probability measure P π . Let the expectation with respect to this measure beE π . We focus on infinite horizon discounted reward criterion. The expected discounted reward or the value function for a policy π and initial state x is given as v π (x) =E π " ∞ X t=0 γ t r(x t ,a t ) x 0 =x # The optimal value function is given as v ∗ (x) = sup π∈Π E π " ∞ X t=0 γ t r(x t ,a t ) x 0 =x # and the policy which maximizes the value function is the optimal policy, π ∗ . Now we make the following assumptions on the regularity of the MDP. Assumption 11. (Regularity of MDP) The state spaceX and the action spaceU are compact subset of d X and d U dimensional Euclidean space respectively. The rewards are uniformly bounded by r max , i.e, r(x,u)≤ r max for all (x,u)∈X×U. FurthermoreU is convex. The assumption above implies that for any policyπ,v π ≤v max =r max /(1−γ). The next assumption is on Lipschitz continuity of MDP in action variable. 133 Assumption 12. (Lipschitz continuity) The reward and the transition kernel are Lip- schitz continuous with respect to the action i.e., there exists constants L r and L p such that for all (x,u,u 0 )∈X×U×U and a measurable set B ofX , the following holds |r(x,u)−r(x,u 0 )|≤L r ku−u 0 k |P (B|x,u)−P (B|x,u 0 )|≤L p ku−u 0 k The compactness of action space combined with Lipschitz continuity implies that the greedy policies do exist. LetB(X ) be the set of functions onX such thatkfk ∞ ≤v max . Let us now define the Bellman operator T :B(X )→B(X ) as follows Tv(x) = max u h r(x,u) +γE x 0 ∼P (·|x,u) v(x 0 ) i . It is well known that the operator T is a contraction with respect tok·k ∞ norm and the contraction parameter is the discount factor, γ. Hence, the sequence of iterates v k =Tv k−1 converge to v ∗ geometrically. Since, we will be analyzing the L 2 norm, we do not have contraction property with respect to this norm. Hence, we need bounded Radon-Nikodym derivative of transitions which we illustrate in the next assumption. Such an assumption has been used earlier for finite action space [48, 27] and for continuous action space in [5]. Assumption 13. (Stochastic Transitions) For all (x, u)∈X×U, P (·|x, u) is abso- lutely continuous with respect to μ and C μ , sup (x,u)∈X×U dP (·|x,u) dμ ∞ <∞. 134 Since we have a sampling based algorithm, we need a function space to approximate value function. Similar to Chapter 3, we focus on randomized function approximation via random features. Let Θ be a set of parameters and let φ :X× Θ→R be a feature function. The feature functions need to satisfy sup (x,θ)∈X×Θ |φ (x; θ)|≤ 1, for e.g., Fourier features. Let F (Θ) be defined as f (·) = Z Θ φ (·; θ)α (θ)dθ||α (θ)|≤Cν (θ),∀θ∈ Θ . But we are interested in finding the best fit within finite sums of the form P J j=1 α j φ (x; θ j ). Doing classical function fitting with P J j=1 α j φ (x; θ j ) leads to nonconvex optimization prob- lems because of the joint dependence inα andθ. Instead, we fix a densityν on Θ and draw a random sample θ j from Θ for j = 1, 2,...J. Once these (θ j ) J j=1 are fixed, we consider the space of functions b F θ 1:J , f (·) = J X j=1 α j φ (·; θ j )|k (α 1 ,..., α J )k ∞ ≤C/J . Now, it remains to calculate weights α by minimizing a convex loss. Furthermore, let us define the L 2,μ norm of a function for a given a probability distribution μ onX as kfk 2 2,μ = ( R X |f (x)| 2 μ (dx)). The empirical norm at given samples (x 1 ,x 2 ,...x N ) is defined askfk 2 2, ˆ μ = 1 N P N i=1 |f(x i )| 2 . Recall two distance measures for function spaces: • d 2,μ (Tf,F), inf f 0 ∈F kf 0 −Tfk 2,μ is the approximation error for a specific f; • d 2,μ (TF,F), sup f∈F d 2,μ (Tf,F) is the inherent Bellman error for the entire class F. 135 5.2.2 The Algorithm We now present our RAndomized Empirical Value Learning (RAEVL) algorithm. It is an empirical variant of value iteration for continuous state and action space. It samples both states and actions. Note that the Bellman operator computes expectation with respect to the next state and then optimize over action space for each state. This is not feasible for continuous state and action space. Instead, we replace the expectation with an empirical average and use empirical optimization in the original Bellman operator. This means for a given state x∈X and sample sizes M and L, we first sample L actions uniformly and then generate M samples of next state (for each sampled action). This leads us to define an empirical Bellman operator b T M,L :B(X )→B(X ) as b T M,L v(x) = max u 1 ,u 2 ,...u L r(x,u l ) + γ M M X m=1 v(x 0 lm ) where x 0 lm ∼ P (·|x,u l ) for all l = 1, 2...L and m = 1, 2,...M. Instead of evaluating this operator at each statex, we sample{x n } N n=1 from the state spaceX according to distribution μ. Then we compute b v(x n ) = [ b T M,L v](x n ). Given the data{(x n , b v (x n ))} N n=1 , we generalize the value function over the state space by computing a best fit within b F θ 1:J by solving min α 1 N N X n=1 | J X j=1 α j φ (x n ; θ j )−b v (x n )| 2 s.t. k (α 1 ,..., α J )k ∞ ≤C/J. 136 This optimization problem only optimizes over weights α 1:J since parameters θ 1:J have already been randomly sampled. Let Π b F (J,N) denote this optimization problem which we denote as Π b F for compact notation. We are now ready to present our algorithm. Algorithm 6 RAEVL Input: probability distribution μ onX ; sample sizes N,M,J,L≥ 1; counter k = 0, initial seed v 0 For k=1, 2, . . . 1. Sample (x n ) N n=1 fromX according to μ. 2. For each n, sample (u l ) L l=1 uniformly for the action space 3. For each n and l, sample i.i.d. next states y lm ∼P (·|x n ,u l ) 4. Empirical value iteration: b v k (x n ) = b T M,L v k−1 (x n ) 5. Function approximation: v k+1 = Π b F b v k (x n ) 6. Increment k←k + 1 and return to Step 1. RAEVL can be seen as an iteration of a composition of two random operators. Let us define b G(N,M,L,J) = Π b F (N,J)◦ b T M,L where◦ denotes the composition. Let b G be a compact notation for this operator. Hence, in our algorithm v k+1 = b Gv k . We will use the 137 random operator framework to analyze our algorithm. We now present our main theorem for which we define J 0 (,δ) = 5C 1 + s 2 log 5 δ 2 , N 0 (,δ) = 512v 2 max (/7) 4 ! log 56e (J + 1) δ 2ev max (/7) 2 ! J L 0 (,δ) = (L r +γv max L p ) diam(U) /7 d U log 7N δ and M 0 (,δ) = 2v 2 max (/7) 2 ! log 14NL δ . Theorem 43. Suppose Assumptions 11, 12 and 13 hold. Choose an > 0 and δ∈ (0, 1). Set δ 0 = 1− (1/2 +δ/2) 1/(K ∗ −1) and K ∗ = log C 1/2 μ − log (2v max ) logγ . Then if N≥N 0 (,δ 0 ), M≥M 0 (,δ 0 ), J≥J 0 (,δ 0 ) , L≥L 0 (,δ 0 ) and K≥ log 4/ (1/2−δ/2) (1−q)q K ∗ −1 , the following holds with probability at least 1−δ, kv K −v ∗ k 2,μ ≤ e C [d 2,μ (TF (Θ),F (Θ)) + 2] (5.1) 138 where e C = 2 (1−γ K+1 )/(1−γ) 1/2 C 1/2 μ . 5.2.3 Convergence Analysis There are three sources of error in RAEVL: empirical optimization, sample average and function approximation. We will get a handle on each of these approximation errors to give us a bound on the error in one iteration. We then use a stochastic dominance argument to analyze the error process. As mentioned before, we view RAEVL as an iteration of the random operator b G. Let k be the gap between this random operator and the exact Bellman operator at iteration k, i.e, v k+1 = b Gv k =Tv k + k Let us also define a (random) operator, e T L as follows e T L v(x) = max u 1 ,u 2 ,...u L h r(x,u l ) +γE x 0 ∼P (·|x,u l ) v(x 0 ) i where u l ∼ Unif(U) for l = 1, 2,...L. Let the Q-value function be Q(x,u) = r(x,u) + γE x 0 ∼P (·|x,u) v(x 0 ) for all v∈B(X ). We now argue that the Q-value function is L U -Lipschitz 139 continuous in action variable where L U =L r +γv max L p . For all (x,u,u 0 )∈X×U×U and v∈B(X ), |Q(x,u)−Q(x,u 0 )| ≤|r(x,u)−r(x,u 0 )|+γ Z X |(P (dy|x,u)−P (dy|x,u 0 ))v(y)| ≤L r |u−u 0 | +γv max Z X |P (dy|x,u)−P (dy|x,u 0 )| ≤ (L r +γv max L p )ku−u 0 k where the last inequalities follow from Assumption 12. The next lemma bounds the error due to sampling for finding the best action. The proof is given in the appendix. Lemma 44. Choose > 0 and δ∈ (0, 1). Let diam(U) be the diameter of the action space U. Then for all v∈B(X ), if L≥ L U diam(U) d U log 1 δ (5.2) then P |Tv(x)− e T L v(x)|> <δ for all x∈X. Next, we bound the error due to sample average. This will give us a sample complexity bound for next state samples. The proof is a simple application of Hoeffding’s inequality followed by an union bound. 140 Lemma 45. Choose > 0, δ∈ (0, 1) and L≥ 1. Then for all v∈B(X ), if M≥ 2v 2 max 2 log 2L δ then P | b T M,L v(x)− e T L v(x)|> <δ for all x∈X. 5.2.3.1 Bound on one-step error Now, we will bound the error in one iteration of RAEVL. We will use the bounds on M and L developed in the previous section. The choice of N comes from bounding the gap between the empirical norm and expected norm by Pollard’s inequality. Lastly, the bound on J comes through an application of the bounded difference concentration inequality as given in [57]. Lemma 46. Choose v∈F (Θ), > 0, and δ∈ (0, 1). Also choose N ≥ N 0 (,δ),M ≥ M 0 (,δ),J≥J 0 (,δ) and L≥L 0 (,δ). Then, for b G (N, M, J, L, μ, ν) v, the output of one iteration of our algorithm, we have k b Gv−Tvk 2,μ ≤d 2,μ (TF (Θ),F (Θ)) + with probability at least 1−δ. 141 Proof. To begin, let 0 > 0 be arbitrary and choose f ∗ ∈F (Θ) such thatkf ∗ −Tvk 2,μ ≤ inf f∈F(Θ) kf−Tvk 2,μ + 0 . Using (x +y) 2 ≥x 2 +y 2 for x,y≥ 0, we have P sup b f∈ b F(θ 1:J ) k b f−Tvk 2,μ −k b f−Tvk 2,ˆ μ >/7 ≤P sup b f∈ b F(θ 1:J ) k b f−Tvk 2 2,μ −k b f−Tvk 2 2,ˆ μ > (/7) 2 ≤ 8e (J + 1) 4ev max (/7) 2 J exp −N (/7) 4 512v 2 max (5.3) where the last inequality follows from Pollard’s inequality and the fact that the psuedo- dimension for the function class b F θ 1:J is J. Then, choose ˆ f∈ b F θ 1:J such thatk ˆ f− Tvk 2,μ ≤kf ∗ −Tvk 2,μ +/7 with probability at least 1−δ/7 by choosing J≥ 1 to satisfy C √ J 1 + s 2 log 1 (δ/7) ! ≤ 7 ⇒ J≥ 7C 1 + s 2 log 7 δ 2 142 by Lemma [57, Lemma 1]. Now we have the following string of inequalities, each of which hold with probability 1−δ/7: k b Gv−Tvk 2,μ ≤k b Gv−Tvk 2,ˆ μ +/7 (5.4) ≤k b Gv− e T L vk 2,ˆ μ + 2/7 (5.5) ≤k b Gv− b T M,L vk 2,ˆ μ + 3/7 (5.6) ≤k ˆ f− b T M,L vk 2,ˆ μ + 3/7 (5.7) ≤kf ∗ − b T M,L vk 2,ˆ μ + 4/7 (5.8) ≤kf ∗ − e T L vk 2,ˆ μ + 5/7 (5.9) ≤kf ∗ −Tvk 2,ˆ μ + 6/7 (5.10) ≤kf ∗ −Tvk 2,μ + (5.11) ≤ inf f∈F(Θ) kf−Tvk 2,μ + 0 + (5.12) We chooseN from inequality (5.41) such that inequalities (5.23) and (5.28) hold with atleast probability 1−δ/7. Now, using Lemma 44 followed by an union bound argument we choose L such that P max x 1 ,x 2 ,...x N Tv(x n )− e T L v(x n ) < {x n } N n=1 ! ≥ 1−δ/7 Hence the empirical norm can also be bounded. This proves inequalities (5.5) and (5.27). Similarly, one can prove inequalities (5.24) and (5.9) using Lemma 45 followed by union 143 bound argument on both the sampled, giving us a bound on M. Inequality (5.25) follows from the fact that b G gives the least approximation error compared to any other function b f∈ b F θ 1:J . The last inequality is by the choice of f ∗ . The previous Lemma 46 gives a bound on the error in one step of RAEVL. We will now extend this result to understand the convergence of{kv k −v ∗ k 2,μ } k≥0 in the next section. 5.2.3.2 Stochastic Dominance Since we do not have contraction with respect to L 2 norm, we need a bound on how the errors propagate with iterations. Recall that b Gv k = Tv k + k , we have the following by analyzing the point-wise error bounds. Lemma 47. [48, Lemma 3] For any K ≥ 1, and > 0, supposek k k 2,μ ≤ for all k = 0, 1,..., K− 1, then kv K −v ∗ k 2,μ ≤ 2 1−γ K+1 1−γ !1 2h C 1/2 μ +γ K/2 (2v max ) i . (5.13) Now, from (5.42), we have kv K −v ∗ k 2,μ ≤ 2 1 1−γ !1 2h C 1/2 μ +γ K/2 (2v max ) i 144 which gives a bound on K such that γ K/2 (2v max )≤C 1/2 μ . Denote K ∗ = log C 1/2 μ − log (2v max ) logγ (5.14) We then construct a stochastic process as follows. We call iteration k “good” if the error k k k 2,μ is within our desired tolerance and iteration k “bad” when the accuracy is greater than our desired tolerance. We then construct a stochastic process{X k } k≥0 with state space K as,{1, 2,..., K ∗ } such that X k+1 = max{X k − 1, 1}, if iteration k is ”good”, K ∗ , otherwise. The stochastic process{X k } k≥0 is easier to analyze than{v k } k≥0 because it is defined on a finite state space, however{X k } k≥0 is not necessarily a Markov chain. Whenever X k = 1, it means that we just had a string of K ∗ “good” iterations in a row, and thatkv k −v ∗ k 2,μ is as small as desired. We next construct a “dominating” Markov chain{Y k } k≥0 to help us analyze the behavior of{X k } k≥0 . We construct{Y k } k≥0 and we letQ denote the probability measure of{Y k } k≥0 . Since{Y k } k≥0 will be a Markov chain by construction, the probability measureQ is com- pletely determined by an initial distribution on R and a transition kernel for{Y k } k≥0 . We 145 now use the bound on one-step error as presented in Lemma 46 which states that when the samples are sufficiently large enough for all k, P (k k k 2,μ ≤)>q(N,M,J,L) Let us denote this probability by q for a compact notation. Let us initialize Y 0 = K ∗ , and then construct the transition kernel as follows Y k+1 = max{Y k − 1, 1}, w.p. q, K ∗ , w.p. 1−q, where q is the probability of a “good” iteration which increases with sample sizes N,M,J and L. We now describe a stochastic dominance relationship between the two stochastic processes{X k } k≥0 and{Y k } k≥0 . We will establish that{Y k } k≥0 is “larger” than{X k } k≥0 in a stochastic sense. This relationship is the key to our analysis of{X k } k≥0 . Definition 12. Let X and Y be two real-valued random variables, then X is stochastically dominated by Y , written X≤ st Y , when Pr{X≥θ}≤ Pr{Y≥θ} for all θ in the support of Y . Let{F k } k≥0 be the filtration on (Ω ∞ ,B (Ω ∞ ),P) corresponding to the evolution of information about{X k } k≥0 , and let [X k+1 |F k ] denote the conditional distribution of X k+1 given the informationF k . We infer the following key result from the relationship between {X k } k≥0 and{Y k } k≥0 . 146 Lemma 48. Choose> 0, andδ∈ (0, 1), and supposeN,M,J andL are chosen sufficiently large enough such that P (k k k 2,μ ≤) > q for all k≥ 0. Then for q≥ (1/2 +δ/2) 1/(K ∗ −1) and K≥ log 4/ (1/2−δ/2) (1−q)q K ∗ −1 , we have kv K −v ∗ k 2,μ ≤ 2 1−γ K ∗ +1 1−γ ! 1/2 h C 1/2 μ +γ K ∗ /2 (2v max ) i (5.15) with probability at least 1−δ. Proof. This proof proceeds in three steps. First, we show thatX k ≤ st Y k holds for allk≥ 0. This stochastic dominance relation is the key to our analysis, since if we can show that Y K is “small” with high probability, thenX K must also be small and we infer thatkv K −v ∗ k 2,μ must be close to zero. By construction, X k ≤ st Y k for all k≥ 0 (see [25, Lemma A.1] and [25, Lemma A.2]) Second, we compute the steady state distribution of{Y k } k≥0 and its mixing time, in particular, we use the mixing time to choose K so that the distribution of Y K is close to its steady state distribution. Since{Y k } k≥0 is an irreducible Markov chain on a finite state 147 space, its steady state distribution μ ={μ (i)} K ∗ i=1 onK exists. By [25, Lemma 4.3], the steady state distribution of{Y k } k≥0 is μ ={μ (i)} K ∗ i=1 given by: μ (1) =q K ∗ −1 μ (i) = (1−q)q K ∗ −i , ∀i = 2,...,K ∗ − 1, μ (K ∗ ) = 1−q. The constant μ min (q; K ∗ ) = min n q K ∗ −1 , (1−q)q (K ∗ −2) , (1−q) o ∀q∈ (0, 1) andK ∗ ≥ 1 appears shortly in the Markov chain mixing time bound for{Y k } k≥0 . We note that (1−q)q K ∗ −1 ≤μ min (q; K ∗ ) is a simple lower bound for μ min (q; K ∗ ). Let Q k be the marginal distribution of Y k for k≥ 0. By a Markov chain mixing time argument, we have t mix (δ 0 ) , min n k≥ 0 :kQ k −μk TV ≤δ 0 o ≤ log 1 δ 0 μ min (q; K ∗ ) ! ≤ log 1 δ 0 (1−q)q K ∗ −1 ! for any δ 0 ∈ (0, 1). Finally, we conclude the argument by using the previous part to find the probability that Y K = 1, which is an upper bound on the probability that X K = 1, which is an upper bound on the probability thatkv K −v ∗ k 2,μ is below our desired error tolerance. For K≥ 148 log 1/ δ 0 (1−q)q K ∗ −1 we have|Pr{Y K = 1}−μ (1)|≤ 2δ 0 . Since X K ≤ st Y K , we have Pr{X K = 1}≥ Pr{Y K = 1} and so Pr{X K = 1}≥q K ∗ −1 − 2δ 0 . Choose q and δ 0 to satisfy q K ∗ −1 = 1/2 +δ/2 and 2δ 0 =q K ∗ −1 −δ = 1/2−δ/2 to getq K ∗ −1 − 2δ 0 ≥δ, and the desired result follows. Now, putting Lemma 46 and 48 together along with the choice of K ∗ , we can conclude Theorem 52. 5.2.4 Numerical Experiments In this section, we try RAEVL on a synthetic problem for which we can compute the optimal value function. This allows us to compute error with each iteration. LetX = [0, 1] andU = [0, 1]. The reward isr(x,u) =−(x−u) 2 and transition probability densityp(y|x,u). Then the optimal value function can be found by solving the fixed point equation v ∗ (x) = max 0≤u≤1 −(x−u) 2 + γ 1−u Z 1 u v ∗ (y)dy which gives v ∗ (x) = 0 and π ∗ (x) = x for all x ∈ X . For our experiments, we choose φ(x,θ) = cos(w T x +b) where θ = (w,b). We sample w∼N (0, 1) and b∼ Unif[−1, 1]. We fix number of features, J = 10. Fig. 5.3 shows the errorkv k −v ∗ k ∞ on y-axis with number of iterations k on x-axis for different choices of N, M and L. The error reduces to order of 10 −4 when the the sample sizes are sufficiently large N = M = L = 50. But even for low values of sample sizes, we are able to get an error≈ 0.1 which indicates that we can get good approximation with less computation. 149 Figure 5.3 Performance of RAEVL for different sample sizes 5.3 Online Value Learning In this section, we propose theONEVaL-C algorithm, an extension of the ONEVaL algorithm (presented in Chapter 3) to continuous action spaces. We use empirical optimization in the original Bellman operator to handle continuous actions, similar to previous section. This means for a given statex∈X and sample sizesM andL, we first sampleL actions uniformly and then generateM samples of next state (for each sampled action). This leads us to define an empirical Bellman operator b T M,L :B(X )→B(X ) as b T M,L v(x) = max u 1 ,u 2 ,...u L r(x,u l ) + γ M M X m=1 v(x 0 lm ) where x 0 lm ∼ P (·|x,u l ) for all l = 1, 2...L and m = 1, 2,...M. Instead of evaluating this operator at each state x, we first select{x n } N n=1 from the previous interactions from a given policy. Then we compute b v(x n ) = [ b T M,L v](x n ). Before presenting the algorithm, we state the β−mixing assumption similar to ONEVaL. 150 Assumption 14 (β-mixing). The sequence of samples{x t } t≥1 isβ-mixing at an exponential rate, i.e., there exist positive constants b,c,c 1 such that β m = c 1 exp(−bm c ). Furthermore, this is a strictly stationary process and x t has distribution μ. Algorithm 7 ONEVaL-C Input: sample sizes N,M,J ≥ 1; initial seed v 0 ; interactions from policy π g : (x 1 ,x 2 ,...,x N ,x N+1 ,...x 2N−1 ,x 2N ,...) For k=0,1, 2, . . . 1. Select samples{x n } k+N n=k+1 from given interactions 2. Sample actions{u l } L l=1 uniformly from the action spaceU 3. For each n and l, sample i.i.d. next states x xn,u l i ∼P (·|x n ,u l ) for i = 1, 2,...m 4. Empirical value iteration: b v k+1 (x n ) = b T M,L v k (x n ) for each n 5. Function approximation: v k+1 = Π b F b v k+1 (x n ) Denote the following: J 0 (,δ) := 7C 1 + s 2 log 7 δ 2 , M 0 (,δ) := 512v 2 max (/7) 4 log 4ev max (/7) 2 J 56eNL (J + 1) δ , L 0 (,δ) := L U diam(U) −d U log 7Ne(J + 1) δ 2ev max (/7) J , and η N 0 (,δ) := 512v 2 max (/7) 4 ! log 112e (J + 1) δ Cγev max ( 2 /49) 2 ! J . Now, we can give the following finite time (non-asymptotic) guarantees on Algorithm 7: ONEVaL-C. 151 Theorem 49. Suppose Assumptions 14, 13, 11 and 12 hold and let N = 2l N η N . Choose an > 0 and δ∈ (0, 1). Set δ 0 = 1− (1/2 +δ/2) 1/(K ∗ −1) and denote e C := 4 1−γ K+1 1−γ 1/2 C 1/2 μ and K ∗ := log C 1/2 μ − log (2v max ) logγ . (5.16) Then, if η N ≥ η N 0 (,δ 0 ), l N ≥ l N 0 (,δ 0 ), L≥ L 0 (,δ 0 ), M ≥ M 0 (,δ 0 ), J ≥ J 0 (,δ 0 ) and K≥ log 4/ (1/2−δ/2) (1−q)q K ∗ −1 , we have that with probability at least 1−δ, Algorithm 7: ONEVaL-C achieves kv K −v ∗ k 2,μ ≤ e C. 5.3.1 Convergence Analysis In this section, we will prove Theorem 49. We proceed similarly to ONEVaL. For the sake of completeness, we present some previously used results. Lemma 50. [3, Lemma 2] Suppose that X 0 ,...,X N ∈X is a stationary β-mixing process with mixing coefficients β m , H 0 are the independent block samples (as shown in previous section) and thatF is a permissible class ofX→ [−K,K] functions. Let N = 2η N l N , then P sup f∈F 1 N N X t=1 f(Z t )−E[f(Z)] > ! ≤ 16E[N (/8,F,H 0 )] exp − η N 2 128K 2 ! + 2η N β l N . 152 We now bound the error in one step of ONEVaL-C. In Algorithm ONEVaL-C, we also have empirical optimization, where we uniformly sample actions and take the best among those samples. Hence, the algorithm can be viewed as iteration of a random operator c H which depends on the sample sizes N, M, L and J and distributions μ and ν. Similar to the previous section, let k be the gap between the random operator c H and the exact Bellman operator at iteration k, i.e., v k+1 = c Hv k =Tv k + k . We want to bound the error term k . Next lemma bounds the error in one iteration of ONEVaL-C. The proof is in the supplementary section. Lemma 51. Choose > 0, and δ∈ (0, 1). Let N = 2η N l N and let η N ≥ η N 0 (,δ), l N ≥ l N 0 (,δ), M≥ M 0 (,δ), J≥ J 0 (,δ) and L≥ L 0 (,δ). Then, for c H (N, M, L, J, μ, ν) v, the output of one iteration of ONEVaL, we have with probability at least 1−δ, k c Hv−Tvk 2,μ ≤d 2,μ (TF (Θ),F (Θ)) + for any arbitrary function v∈F (Θ). Since our underlying function space is dense in the space of Lipschitz functions [56], the function approximation error is zero. Hence d 2,μ (TF (Θ),F (Θ)) = 0. We now combining Lemma 51 with the stochastic dominance argument as presented in earlier section, Lemma 48, proves Theorem 49. 153 5.4 Empirical Policy Learning In all the previous chapters, we have only focused on value iteration flavored algorithms. For continuous actions, directly delving in to policy spaces have proved more efficient [40]. In this section, we present a generalized policy iteration algorithm based on actor-critic architecture. We will now be using action value function or the Q-value function. The expected discounted reward (action-value function) for a policy π and initial state x and action u is given as Q π (x,u) :=E π " ∞ X t=0 γ t r(x t ,u t ) x 0 =x,u 0 =u # The optimal value function is given as Q ∗ (x) := sup π∈Π E π " ∞ X t=0 γ t r(x t ,u t ) x 0 =x,u 0 =u # and the policy which maximizes the value function is the optimal policy, π ∗ . Now we make the following assumptions on the regularity of the MDP. Let us now define the Bellman operator for action-value functions G :B(X,U)→B(X,U) as follows GQ(x,u) :=r(x,u) +γE x 0 ∼P (·|x,u) max u 0 Q(x 0 ,u 0 ). It is well known that the operator G is a contraction with respect tok·k ∞ norm and the contraction parameter is the discount factorγ. Hence, the sequence of iteratesQ k =GQ k−1 converge toQ ∗ geometrically. We now present our RANDomized POlicy Learning (RANDPOL) 154 algorithm. It approximates both action-value function and policy, similar to actor-critic methods. It comprises of two main steps: policy evaluation and policy improvement. Given a policy π :X→U, we can define a policy evaluation operator G π :B(X,U)→B(X,U) as follows G π Q(x,u) =r(x,u) +γE x 0 ∼P (·|x,u) Q(x 0 ,π(x 0 )). Note that if π is a greedy policy with respect to Q, i.e., π(x)∈ arg max u∈U Q(x,u), then indeedG π Q =GQ. When there is an uncertainty in the underlying environment, computing expectation in the Bellman operator is expensive. If we have a generative model of the environment, we can replace the expectation by an empirical mean leading to definition of an empirical Bellman operator for policy evaluation for a given policy π: b G π M Q(x,u) := " r(x,u) + γ M M X i=1 Q(x x,u i ,π(x n )) # . (5.17) where x x,u i ∼P (·|x,u) for i = 1, 2,...,M. Note that the next state samples, x 0 , are i.i.d. If the environment is deterministic, like Atari games or locomotion tasks, having a single next state suffices and we don’t need a generative model. For each iteration, we first sampleN Q state-action pairs{(x 1 ,u 1 ), (x 2 ,u 2 ),... (x N Q ,u N Q )} independently fromX×U. Then, for each sample, we compute b Q M (x n ,u n ) = b G π M Q(x n ,u n ) for given action-value function Q and policy π. Given the data n (x n ,u n ), b Q (x n ,u n ) o N Q n=1 , 155 we fit the value function over the state and action space by computing a best fit within b F θ 1:J by solving min α 1 N Q N Q X n=1 | J Q X j=1 α j φ (x n ,u n ; θ j )− b Q M (x n ,u n )| 2 (5.18) s.t. k (α 1 ,..., α J )k ∞ ≤C/J Q . This optimization problem only optimizes over weights α 1:J since parameters θ 1:J have al- ready been randomly sampled from a given distribution ν. This completes the policy evalu- ation step. Next, the algorithm does the policy improvement step. For a fixed value function Q∈ B(X,U), define e π(x)∈ arg max u∈U Q(x,u). If action space were discrete and we had good approximation of value function, we could have just followed the greedy policy. But for our setting, we will need to approximate the greedy policy too. Let us compute a greedy policy empirically givenN π independent samples{x 1 ,x 2 ,...x Nπ } and value functionQ∈B(X,U), as follows: π(x)∈ arg max b π∈ b Π(θ 1:Jπ ) 1 N π Nπ X i=1 Q(x i ,b π(x i )) (5.19) where b Π = n b π : b π = (b π 1 ,...b π d U ),b π k ∈ b Π k θ 1:Jπ k o and J π = P du k=1 J π k . With this empirical policy improvement step, we now present the complete RANDPOL algorithm, shown in Algo- rithm 8, where we initialize the algorithm with a random value function Q 0 and π 0 is the approximate greedy policy with respect to Q 0 computed by equation (5.19). Define the policy improvement operatorH :B(X,U)→B(X ) asHQ(x) := sup u∈U Q(x,u). If policy π was fixed, then H π Q(x) = Q(x,π(x)) will give the performance of the policy. 156 Algorithm 8 RANDomized POlicy Learning Input: sample sizes N Q ,M,J Q ,N π ,J π ; initial value function Q 0 For k = 0, 1, 2,..., 1. Sample{x n ,u n } N Q n=1 from state and action space 2. Empirical policy evaluation: (a) For each sample (x n ,u n ), compute b Q M (x n ,u n ) = b G π k M Q k (x n ,u n ) (b) Approximate Q k+1 according to (5.18) 3. Sample{x n } Nπ n=1 from state space 4. Empirical policy improvement: (a) Approximate π k+1 according to (5.19) To measure the function approximation error, we next define distance measures for function spaces: • d 1 (π,F) := sup f∈F inf f 0 ∈F kf 0 −G π fk 1,μ is the approximation error for a specific policy π; • d 1 (Π,F) := sup π∈Π d 1,μ (π,F) is the inherent Bellman error for the entire class Π; and • e 1 (Π,F) := sup Q∈F inf π∈Π kHQ−H π Qk 1,μ is the worst-case approximation error of the greedy policy. 157 Let us define: J 0 Q (,δ) := 5C 1 + s 2 log 5 δ 2 , J 0 π (,δ) := 3L U C 0 1 + s 2 log 3 δ 2 , M 0 (,δ) := 2v 2 max (/5) 2 ! log 10N δ , N 0 Q (,δ) := 128v 2 max (/5) 2 ! log 40e (J Q + 1) δ 2ev max (/5) ! J Q , and N 0 π (,δ) := 128v 2 max (/3) 2 ! log 24e (J π + 1) δ 2ev max (/3) 2 ! Jπ . Theorem 52. Let Assumptions 11, 12 and 13 hold. Choose an > 0 and δ∈ (0, 1). Set δ 0 = 1− (1/2 +δ/2) 1/(K ∗ −1) and denote e C := 4 1−γ K+1 (1−γ) 2 C μ and K ∗ := log (C μ )− log (2Q max ) logγ . (5.20) Then, if N Q ≥ N 0 Q (,δ 0 ), N π ≥ N 0 π (,δ 0 ), M≥ M 0 (,δ 0 ), L≥ L 0 (,δ 0 ), J Q ≥ J 0 Q (,δ 0 ), J π ≥ J 0 π (,δ 0 ) and K≥ log 4/ (1/2−δ/2) (1−q)q K ∗ −1 , we have that with probability at least 1−δ, kQ π K −Q ∗ k 1,μ ≤ e C. Remark: The above theorem states that if we have sufficiently large number of samples, in particular, J π ,J Q and M = O (1/ 2 log 1/δ) and N Q ,N π = O 1/ 2 log 1/ J δ then for 158 sufficiently large iterations, the approximation error can be made arbitrarily small with high probability. Moreover, if Lipschitz continuity assumption is not satisfied then the result can be presented in a more general form: kQ π K −Q ∗ k 1,μ ≤ 2 1−γ K+1 (1−γ) 2 ! C μ [d 1 (Π (Θ),F (Θ)) +γC μ e 1 (Π (Θ),F (Θ)) +]. 5.4.1 Convergence Analysis In this section, we will analyze Algorithm 8. First, we will bound the error in one iteration of RANDPOL and then analyze how the errors propagate through iterations. 5.4.2 Error in one iteration Since RANDPOL approximates at two levels: policy evaluation and policy improvement, we decompose the error in one iteration as the sum of approximations at both levels. If a function Q was given as an input to RANDPOL for an iteration, the resulting value function, Q 0 , can be written as an application of a random operator b G. This random operator depends on the input sample sizes N Q ,M,J Q ,N π ,J π and the input value function Q. Let π be the approximate greedy policy with respect to Q. Thus, we have Q 0 = b G(N Q ,M,J Q ,N π ,J π )Q. For concise notation, we will just write Q 0 = b GQ. Let us now decompose this error into policy evaluation and policy improvement approximation errors. Q 0 =GQ + (Q 0 −G π Q) + (G π Q−GQ) =GQ + 0 + 00 =GQ +, 159 where 0 =Q 0 −G π Q, 00 =G π Q−GQ and = 0 + 00 . In other words, 0 is the approximation error in policy evaluation and 00 is the error in policy improvement. If we get a handle on both these errors, we can bound which is the error in one step of our algorithm. Policy evaluation approximation Let us first bound the approximation error in policy evaluation, 0 . Lemma 53. Choose Q∈F (Θ), > 0, and δ∈ (0, 1). Also choose N Q ≥ N 0 Q (,δ),M≥ M 0 (,δ) andJ Q ≥J 0 Q (,δ). Then, forQ 0 = b GQ, the output of one iteration of our algorithm, we have kQ 0 −G π Qk 1,μ ≤d 1 (Π (Θ),F (Θ)) + with probability at least 1−δ. Proof. To begin, let 0 > 0 be arbitrary and choose f ∗ ∈F (Θ) such thatkf ∗ −G π Qk 1,μ ≤ inf f∈F(Θ) kf−G π Qk 1,μ + 0 . Using Pollard’s inequality, we have P sup b f∈ b F(θ 1:J ) k b f−G π Qk 1,μ −k b f−G π Qk 1,ˆ μ >/5 ≤ 8e (J Q + 1) 4eQ max (/5) J Q exp −N Q (/5) 2 128Q 2 max (5.21) where the last inequality uses the fact that the psuedo-dimension for the function class b F θ 1:J Q is J Q . Now, for a given sample (x i ,u i ), we have |G π Q(x i ,u i )− b G π M Q(x i ,u i )| =γ E x 0 [Q(x 0 ,π(x 0 )]− 1 M M X j=1 Q(x 0 j ,π(x 0 j )) 160 where x j ∼P(·|x i ,u i ) and are i.i.d. Using Hoeffding’s concentration inequality followed by union bound, we have P max i=1,2,...N G π Q(x i ,u i )− b G π M Q(x i ,u i ) >/5 ! ≤γN exp −M (/5) 2 2Q 2 max Hence, we have P kG π Q− b G π M Qk 1,ˆ μ >/5 <γN exp −M (/5) 2 2Q 2 max (5.22) Then, choose ˆ f∈ b F θ 1:J Q such thatk ˆ f−G π Qk 1,μ ≤kf ∗ −G π Qk 1,μ +/5 with probability at least 1−δ/5 by choosing J Q ≥ 1 to satisfy C q J Q 1 + s 2 log 1 (δ/5) ! ≤ 5 ⇒J Q ≥ 5C 1 + s 2 log 5 δ 2 by Lemma [57, Lemma 1] and thatk b f−f ∗ k 1,μ ≤k b f−f ∗ k 2,μ by Jenson’s inequality. 161 Now we have the following string of inequalities, each of which hold with probability 1−δ/5: kQ 0 −G π Qk 1,μ ≤kQ 0 −G π Qk 1,ˆ μ +/5 (5.23) ≤kQ 0 − b G π M Qk 1,ˆ μ + 2/5 (5.24) ≤k ˆ f− b G π M Qk 1,ˆ μ + 2/5 (5.25) ≤kf ∗ − b G π M Qk 1,ˆ μ + 3/5 (5.26) ≤kf ∗ −G π Qk 1,ˆ μ + 4/5 (5.27) ≤kf ∗ −G π Qk 1,μ + (5.28) ≤ inf f∈F(Θ) kf−G π Qk 1,μ + 0 + (5.29) We chooseN Q from inequality (5.41) such that inequalities (5.23) and (5.28) hold with atleast probability 1−δ/5. Inequalities (5.24) and (5.27) follow by bounding right side of (5.22) by δ/5 and appropriately choosing M. Inequality (5.25) follows from the fact that b G gives the least approximation error compared to any other function b f∈ b F θ 1:J Q . Inequality (5.26) follows by the choice of J Q . The last inequality is by the choice of f ∗ . Policy improvement approximation The second step in the algorithm is policy im- provement and we now bound the approximation error in this step, 00 . Lemma 54. Under Assumption 13, for any action-value function Q and policy π, we have kGQ−G π Qk 1,μ ≤γC μ kHQ−H π Qk 1,μ 162 Proof. For any state-action pair (x,u)∈X×U, we have GQ(x,u)−G π Q(x,u) =γ Z max u 0 Q(y,u 0 )−Q(y,π(y)) dP (y|x,u) ≤γC μ Z (HQ(y)−H π Q(y))dμ where we used Assumption 13 for the last inequality. Also note that GQ(x,u)−G π Q(x,u) is non-negative. Now, kGQ−G π Qk 1,μ = Z GQ(x,u)−G π Q(x,u) dμ ≤γC μ Z (HQ(y)−H π Q(y))dμ Lemma 55. Under Assumption 12, for any Q∈F (Θ) and π∈ Π (Θ), we have for all x∈X , HQ(x)−H π Q(x) ≤L U ke π(x)−π(x)k where e π(x) = arg max u Q(x,u) and L U =L r +γQ max L p . 163 Proof. Using the Lipschitz Assumption 12, we have |Q(x,u)−Q(x,u 0 )| ≤|r(x,u)−r(x,u 0 )|+γ Z X (P (dy|x,u)−P (dy|x,u 0 )) max · Q(y,·) ≤L r |u−u 0 | +γQ max Z X |P (dy|x,u)−P (dy|x,u 0 )| ≤ (L r +γQ max L p )ku−u 0 k Now, we use the Lipschitz property of Q function as follows: HQ(x)−H π Q(x) = Q(x,e π(x))−Q(x,π(x)) ≤ (L r +γQ max L p )ke π(x)−π(x)k. Now, we will proceed similar to how we bounded the policy evaluation error. Lemma 56. Choose Q∈F (Θ), > 0, and δ∈ (0, 1). Let the greedy policy with respect to Q be e π(x)∈ arg max u Q(x,u). Also choose N π ≥ N 0 π (,δ) and J π ≥ J 0 π (,δ). Then, if the policy π is computed with respect to Q in equation (5.19), the policy improvement step in our algorithm, we have kG π Q−GQk 1,μ ≤γC μ [e 1 (Π (Θ),F (Θ)) +] with probability at least 1−δ. 164 Proof. Let 0 > 0 be arbitrary and choose f ∗ ∈ Π (Θ) such thatkHQ− H f ∗ Qk 1,μ ≤ inf f∈Π(Θ) kHQ−H f Qk 1,μ + 0 . Similar to policy evaluation step, we have P sup π∈ b Π(θ 1:Jπ ) kHQ−H π Qk 1,μ −kHQ−H π Qk 1,ˆ μ >/3 ≤ 8e (J π + 1) 4eU max (/3) Jπ exp −N π (/3) 2 128Q 2 max (5.30) where the last inequality follows from Pollard’s inequality and the facts that the psuedo- dimension for the underlying function class isJ π . Also, note that sinceHQ(x)−H π Q(x) is non-negative for any x∈X and π maximizes the empirical mean of action-value functions, we have for any f∈ b Π θ 1:Jπ : kHQ−H π Qk 1,ˆ μ = 1 N π Nπ X i=1 HQ(x i )− 1 N π Nπ X i=1 H π Q(x i )≤kHQ−H f Qk 1,ˆ μ (5.31) Now we have the following string of inequalities, each of which hold with probability 1−δ/3: kHQ−H π Qk 1,μ ≤kHQ−H π Qk 1,ˆ μ +/3 (5.32) ≤kHQ−H f Qk 1,ˆ μ +/3 (5.33) ≤kHQ−H f Qk 1,μ + 2/3 (5.34) ≤kHQ−H f ∗ Qk 1,μ +kH f ∗ Q−H f Qk 1,μ + 2/3 (5.35) ≤kHQ−H f ∗ Qk 1,μ +L U kf ∗ −fk 1,μ + 2/3 (5.36) ≤ inf f∈Π(Θ) kHQ−H f Qk 1,μ + 0 + (5.37) 165 The inequalities (5.32) and (5.34) by choosing N π such that (5.30) is true with atleast probability 1−δ/3. Inequality (5.33) follows from (5.31) and (5.35) is due to triangle’s inequality. To prove inequality (5.36), we first use Lemma 55 for policies f ∗ andf and then [57, Lemma 1] such that the following holds with probability at least 1−δ/3: kH f ∗ Q−H f Qk 1,μ ≤kf ∗ −fk 1,μ ≤kf ∗ −fk 2,μ ≤ C 0 √ J π 1 + s 2 log 1 (δ/3) ! (5.38) Bounding right side by /3L U gives us a bound on J π . The last inequality is by the choice of f ∗ . Using Lemma 54 concludes the lemma. 5.4.3 Stochastic dominance. After bounding the error in one step, we will now bound the error when the random operator b G is applied iteratively by constructing a dominating Markov chain. Since we do not have a contraction with respect to the L 1 norm, we need an upper bound on how the errors propagate with iterations. Recall that b GQ k =GQ k + k , we use the point-wise error bounds as computed in the previous section. For a given error tolerance, it gives a bound on the number of iterations which we call K ∗ as shown in equation (5.20). The details of the choice ofK ∗ is given in the appendix. We then construct a stochastic process as follows. 166 We call iteration k “good”, if the errork k k 1,μ ≤ and “bad” otherwise. We then construct a stochastic process{X k } k≥0 with state spaceK as :={1, 2,..., K ∗ } such that X k+1 = max{X k − 1, 1}, if iteration k is ”good”, K ∗ , otherwise. The stochastic process{X k } k≥0 is easier to analyze than{v k } k≥0 because it is defined on a finite state space, however{X k } k≥0 is not necessarily a Markov chain. Whenever X k = 1, it means that we just had a string of K ∗ “good” iterations in a row, and thatkQ π k −Q ∗ k 1,μ is as small as desired. We next construct a “dominating” Markov chain{Y k } k≥0 to help us analyze the behavior of{X k } k≥0 . We construct{Y k } k≥0 and we letQ denote the probability measure of{Y k } k≥0 . Since{Y k } k≥0 will be a Markov chain by construction, the probability measureQ is com- pletely determined by an initial distribution on R and a transition kernel for{Y k } k≥0 . We now use the bound on one step error as presented in previous section which states that when the samples are sufficiently large enough for all k, P (k k k 1,μ ≤)>q(N Q ,M,J Q ,N π ,J π ). Denote this probability by q for a compact notation. Initialize Y 0 =K ∗ , and construct the process Y k+1 = max{Y k − 1, 1}, w.p. q, K ∗ , w.p. 1−q, 167 where q is the probability of a “good” iteration which increases with sample sizes N,M,J and L. We now describe a stochastic dominance relationship between the two stochastic processes{X k } k≥0 and{Y k } k≥0 . We will establish that{Y k } k≥0 is “larger” than{X k } k≥0 in a stochastic sense. This relationship is the key to our analysis of{X k } k≥0 . Definition 13. Let X and Y be two real-valued random variables, then X is stochastically dominated by Y , X≤ st Y , when Pr{X≥θ}≤ Pr{Y≥θ} for all θ in the support(Y ). The next lemma uses stochastic dominance to show that if the error in each iteration is small, then after sufficient iterations we will have small approximation error. The proof is given in the supplementary material. Lemma 57. Choose> 0, andδ∈ (0, 1), and supposeN,M,J andL are chosen sufficiently large enough such that P (k k k 1,μ ≤) > q for all k≥ 0. Then, for q≥ (1/2 +δ/2) 1/(K ∗ −1) and K≥ log 4/ (1/2−δ/2) (1−q)q K ∗ −1 , we have with probability at least 1−δ, kQ π K −Q ∗ k 1,μ ≤ 2 1−γ K+1 1−γ ! 1/2 h C 1/2 μ +γ K/2 (2Q max ) i . Now to prove Theorem 52, we combine Lemmas 53 and 56 to bound the error in one iteration. Next, we use Lemma 57 to bound the error after sufficient number of iterations. Also, since our function class forms a reproducing kernel Hilbert space [56], which is dense in the space of continuous functions, the function approximation error is zero. 168 5.4.4 Numerical Experiments In this section, we present experiments showcasing the improved performance attained by our proposed algorithm compared to state-of-the-art deep RL methods. In the first part, we try it on a simpler environment, where one can compute the optimal policy theoretically. The second part focuses on a challenging, high-dimensional environment of a quadrupedal robot. 5.4.4.1 Proof of Concept We first test our proposed algorithm on a synthetic example where we can calculate the optimal value function and optimal policy analytically. In this example,X = [0, 1] and U = [0, 1]. The reward is r(x,u) =−(x−u) 2 and P (y|x,u) = Unif[u, 1]. The optimality equation for action-value function can be written as: Q(x,u) =−(x−u) 2 + γ 1−u Z 1 u max w Q(y,w)dy. The value function is v(x) = max u Q(x,u). For this example , v ∗ (x) = 0 and π ∗ (x) = x. We ran the experiment with N q = 100,N π = 100,J q = 20,J π = 20 and discount factor γ = 0.7. Fig. 5.4 shows the optimal policy and error in the performance of the policy kv π −v ∗ k ∞ . The approximate policy is computed for M = 10 and is very close to the optimal policy. The figure with performance error shows that even with a small number of next state samples in the empirical policy evaluation step in the algorithm, we are able to achieve good performance. 169 Figure 5.4 Performance of RANDPOL on a synthetic example. Left: Optimal and approximate policy. Right: Error with iterations Table 5.1 Maximal reward after 3M steps Algorithm Avg. Maximal Reward RANDPOL 13.894 DDPG 12.432 PPO 13.683 RANDPOL N 11.581 5.4.4.2 Minitaur In this example, we focus on forward locomotion of a quadrupedal robot. The state is a 28 dimensional vector consisting of position, roll, pitch, velocity etc. The action space is 8 dimensional vector of torques on the legs. The physics engine for this environment is given by PyBullet 1 . The reward has four components: reward for moving forward, penalty for sideways translation, sideways rotation, and energy expenditure. In our experiments, we maintain a experience replay buffer, previously used in [44, 40], which stores the data from past policies. We sample the data from this buffer which also helps in breaking the correlation among them. For exploration, we use an Ornstein-Uhlenbeck process [40]. We compare against the popular deep RL algorithms: DDPG [40] and PPO [60]. In both the 1 https://github.com/bulletphysics/bullet3/blob/004dcc34041d1e5a5d92f747296b0986922ebb96/examples/ pybullet/gym/pybullet envs/minitaur/envs/minitaur gym env.py 170 Figure 5.5 Left: Minitaur environment in PyBullet. Right: Avg. reward for RANDPOL, PPO and DDPG, averaged over five random seeds. algorithms, the policy and the value function is represented by a fully connected deep neural network. DDPG uses deterministic policy while PPO uses stochastic policy, Gaussian in particular. We also have a random policy, where the actions are randomly sampled from the action space uniformly. For RANDPOL, we used randomized networks with two hidden layers, where we tune only the top layer and weights for the bottom layers are chosen randomly at uniform with zero mean and standard deviation inversely proportional to the number of units. These random connections are not trained in the subsequent iterations. Fig. 5.5 shows the learning curve for DDPG, PPO, RANDPOL and randomly sampled actions. We found that RANDPOL gives better performance compared to both DDPG and PPO. We also tried a variation of RANDPOL where we fix the weights with normal distribution, which we call RANDPOL N . Table 5.1 shows the average maximal score of different algorithm. We found that RANDPOL N sometimes chooses a larger weight for a connection which can degrade the performance compared to uniformly sampled weights. 171 5.5 Conclusions In this chapter, we focused on MDPs with continuous state and continuous action space. The previous chapters deal with finite action space and the optimization (finding best action given action-value function) is taken care of. Now, for continuous action spaces, we need op- timization over action space for which we proposed an uniform sampling method. Now this method can be used in conjunction with empirical dynamic programming algorithms. We also presented an actor-critic type of algorithm with randomized networks. This algorithm has a flavor of policy iteration which is efficient when we have continuous actions. We com- pare the policy learning algorithm in high dimensional challenging environment, minitaur, and compare against the state-of-the-art deep learning algorithms. 5.6 Proofs 5.6.1 Proof of Lemma 44 Let f(u) = r(x,u) +γE x 0v(x 0 ) for a given x∈X . Let u ∗ be the maxima. Let the ball centered at u and radius r beB (u,r). Now, the volume of this d U - dimension ball is 172 vol(B (u,r))∝ r d U . LetU ={u∈U : f(u)≤ max u f(u)−}. Moreover, letU ,L U ={u∈ U :ku ∗ −uk≤/L U }. Since f is L U -Lipschitz, u6∈U =⇒ u6∈U ,L U . Hence, P (u6∈U ,L U ) = 1− vol(U ,L U ) vol(U) ≤ 1− L U diam(U) d U (5.39) where the last inequality follows from Lemma 5.2 in [70] and diam(U) = sup u,u 0ku−u 0 k. Now, P f(u ∗ )− max 1≤l≤L f(u l )≤ = 1−P ∩ L l=1 {u l 6∈U } = 1−P ∩ L l=1 {u l 6∈U } L ≥ 1−P ∩ L l=1 {u l 6∈U ,L U } L where the second equality is due to the fact that{u 1 ,u 2 ...u L } are i.i.d. and the last inequality follows Lipschitz continuity of the function f. Now, using (5.39) we have P f(u ∗ )− max 1≤l≤L f(u l )≤ ≥ 1− 1− L u diam(U) d U L 173 Putting L u diam(U) d U = 1 L log 1 δ and using 1−x≤e −x , we haveP (f(u ∗ )− max 1≤l≤L f(u l )≤)≥ 1−δ for the choice of L as presented in (5.2). 5.6.2 Proof of Lemma 51 To begin, let 0 > 0 be arbitrary and choose f ∗ ∈F (Θ) such thatkf ∗ −Tvk 2,μ ≤ inf f∈F(Θ) kf−Tvk 2,μ + 0 . Let e F ={f−Tg : f ∈ b F θ 1:J ,g∈ b F θ 1:J } and b F T = {Tf :f∈ b F θ 1:J }. From the definition of covering numbers, we have for random variables X 1:N =X 1 ,X 2 ,...X N : N , e F,X 1:N ≤N /2, b F θ 1:J ,X 1:N N /2, b F T ,X 1:N (5.40) Now, let us bound the covering number of function class b F T . For any functions f 1 ,f 2 ∈ b F θ 1:J , we have N X i=1 |Tf 1 (X i )−Tf 2 (X i )|≤NkTf−Tf 2 k ∞ ≤γkf 1 −f 2 k ∞ . 174 Note that,|f 1 (x)−f 2 (x)| =|α T φ(θ 1:J ;x)−β T φ(θ 1:J ;x)|≤|α−β| Since, the length of weights is bounded by C/ √ J, the covering can now be bounded as: N , b F T ,X 1:N ≤ (2Cγ/) J . Using (x +y) 2 ≥x 2 +y 2 for x,y≥ 0, we have P sup b f∈ b F(θ 1:J ) sup g∈ b F(θ 1:J ) k b f−Tgk 2,μ −k b f−Tgk 2,ˆ μ >/7 ≤P sup b f∈ b F(θ 1:J ) sup g∈ b F(θ 1:J ) k b f−Tgk 2 2,μ −k b f−Tgk 2 2,ˆ μ > 2 /49 ≤ 16e (J + 1) 200Cγev max 3 J exp −η N (/7) 4 512v 2 max + 2η N c 1 exp(−bl c N ). (5.41) where the last inequality follows by using Lemma 50 by bounding the covering numbers using (5.40) and [28, Corollary 3]. Furthermore, the β-mixing coefficient is O(exp(−bl c N )) Then, choose ˆ f∈ b F θ 1:J such thatk ˆ f−Tvk 2,μ ≤kf ∗ −Tvk 2,μ +/5 with probability at least 1−δ/5 by choosing J≥ 1 to satisfy C √ J 1 + s 2 log 1 (δ/5) ! ≤ 5 ⇒J≥ 5C 1 + s 2 log 5 δ 2 by [57, Lemma 1]. Now we have the following string of inequalities, each of which hold with probability at least 1−δ/7: 175 k b Gv−Tvk 2,μ (a) ≤k b Gv−Tvk 2,ˆ μ +/7 (b) ≤k b Gv− e T L vk 2,ˆ μ + 2/7 (c) ≤k b Gv− b T M,L vk 2,ˆ μ + 3/7 (d) ≤k ˆ f− b T M,L vk 2,ˆ μ + 3/7 (e) ≤kf ∗ − b T M,L vk 2,ˆ μ + 4/7 (f) ≤kf ∗ − e T L vk 2,ˆ μ + 5/7 (g) ≤kf ∗ −Tvk 2,ˆ μ + 6/7 (h) ≤kf ∗ −Tvk 2,μ + ≤ inf f∈F(Θ) kf−Tvk 2,μ + 0 + We choose η N andl N from inequality (a) such that inequalities (h) hold with probability at least 1−δ/7. Similarly, one can prove inequalities (c) and (f) using Lemma 45 followed by union bound argument on the state samples, giving us a bound on M. Inequalities (c) and (f) using Lemma 44 followed by union bound argument on the state samples, giving us a bound onL. Inequality (d) follows from the fact that b G gives the least approximation error compared to any other function b f∈ b F θ 1:J . Inequality (e) is due to the choice of J. The last inequality is by the choice of f ∗ . 5.6.3 Proof of Lemma 57 Now before proving Lemma 57, we will see how the error propagates through iterations. 176 Lemma 58. [4, Lemma 7] For any K ≥ 1, and > 0, suppose k k k 1,μ ≤ ∀ k = 0, 1,..., K− 1, then kQ π K −Q ∗ k 1,μ ≤ 2 1−γ K+1 (1−γ) 2 ! h C μ +γ K/2 (2Q max ) i . (5.42) Choice of K ∗ : Now, from (5.42), we have kQ π K −Q ∗ k 1,μ ≤ 2 1 (1−γ) 2 ! h C μ +γ K/2 (2Q max ) i which gives a bound on K such that γ K/2 (2v max )≤C μ . Denote K ∗ = log (C μ )− log (2Q max ) logγ . Proof of Lemma 57. First, we show that X k ≤ st Y k holds for all k ≥ 0. The stochastic dominance relation is the key to our analysis, since if we can show that Y K is “small” with high probability, thenX K must also be small and we infer thatkQ π K −Q ∗ k 1,μ must be close to zero. By construction, X k ≤ st Y k for all k≥ 0 (see [24, Lemma A.1] and [24, Lemma A.2]). Next, we compute the steady state distribution of{Y k } k≥0 and its mixing time. In par- ticular, chooseK so that the distribution ofY K is close to its steady state distribution. Since {Y k } k≥0 is an irreducible Markov chain on a finite state space, its steady state distribution 177 μ ={μ (i)} K ∗ i=1 onK exists. By [24, Lemma 4.3], the steady state distribution of{Y k } k≥0 is μ ={μ (i)} K ∗ i=1 given by: μ (1) =q K ∗ −1 μ (i) = (1−q)q K ∗ −i , ∀i = 2,...,K ∗ − 1, μ (K ∗ ) = 1−q. The constant μ min (q; K ∗ ) = min n q K ∗ −1 , (1−q)q (K ∗ −2) , (1−q) o ∀q∈ (0, 1) andK ∗ ≥ 1 appears shortly in the Markov chain mixing time bound for{Y k } k≥0 . We note that (1−q)q K ∗ −1 ≤μ min (q; K ∗ ) is a simple lower bound for μ min (q; K ∗ ). Let Q k be the marginal distribution of Y k for k≥ 0. By a Markov chain mixing time argument, we have t mix (δ 0 ) , min n k≥ 0 :kQ k −μk TV ≤δ 0 o ≤ log 1 δ 0 μ min (q; K ∗ ) ! ≤ log 1 δ 0 (1−q)q K ∗ −1 ! for any δ 0 ∈ (0, 1). Finally, we conclude the argument by using the previous part to find the probability that Y K = 1, which is an upper bound on the probability that X K = 1, which is an upper bound on the probability thatkQ π K −Q ∗ k 1,μ is below our desired error tolerance. For 178 K≥ log 1/ δ 0 (1−q)q K ∗ −1 we have|Pr{Y K = 1}−μ (1)|≤ 2δ 0 . Since X K ≤ st Y K , we have Pr{X K = 1}≥ Pr{Y K = 1} and so Pr{X K = 1}≥ q K ∗ −1 − 2δ 0 . Choose q and δ 0 to satisfy q K ∗ −1 = 1/2 +δ/2 and 2δ 0 =q K ∗ −1 −δ = 1/2−δ/2 to get q K ∗ −1 − 2δ 0 ≥δ, and the desired result follows. 179 Bibliography [1] Aberdeen, D. (2006). Policy-gradient methods for planning. In Advances in Neural Information Processing Systems, pp. 9–16. [2] Anthony, M. and P. L. Bartlett (2009). Neural network learning: Theoretical foundations. cambridge university press. [3] Antos, A., C. Szepesv´ ari, and R. Munos (2006). Learning near-optimal policies with bellman-residual minimization based fitted policy iteration and a single sample path. In International Conference on Computational Learning Theory, pp. 574–588. Springer. [4] Antos, A., C. Szepesv´ ari, and R. Munos (2007). Value-iteration based fitted policy iteration: learning with a single trajectory. In Approximate Dynamic Programming and Reinforcement Learning, 2007. ADPRL 2007. IEEE International Symposium on, pp. 330–337. IEEE. [5] Antos, A., C. Szepesv´ ari, and R. Munos (2008). Fitted q-iteration in continuous action- space mdps. In Advances in neural information processing systems, pp. 9–16. [6] Arapostathis, A., V. S. Borkar, E. Fern´ andez-Gaucherand, M. K. Ghosh, and S. I. Marcus (1993). Discrete-time controlled markov processes with average cost criterion: a survey. SIAM Journal on Control and Optimization 31 (2), 282–344. 180 [7] Azar, M. G., R. Munos, M. Ghavamzadaeh, and H. J. Kappen (2011). Speedy q-learning. [8] Berlinet, A. and C. Thomas-Agnan (2011). Reproducing kernel Hilbert spaces in proba- bility and statistics. Springer Science & Business Media. [9] Bertsekas, D. P. Dynamic programming and optimal control, Volume 2. [10] Bertsekas, D. P. (2005). Dynamic programming and suboptimal control: A survey from adp to mpc. [11] Bertsekas, D. P. (2010). Dynamic Programming and Optimal Control. [12] Bertsekas, D. P. (2011). Dynamic programming and optimal control 3rd edition, volume ii. Belmont, MA: Athena Scientific. [13] Bhat, N., V. Farias, and C. C. Moallemi (2012). Non-parametric approximate dynamic programming via the kernel method. In Advances in Neural Information Processing Sys- tems, pp. 386–394. [14] De Farias, D. P. and B. Van Roy (2003). The linear programming approach to approx- imate dynamic programming. Operations research 51 (6), 850–865. [15] De Farias, D. P. and B. Van Roy (2004). On constraint sampling in the linear pro- gramming approach to approximate dynamic programming. Mathematics of operations research 29 (3), 462–478. [16] DeVore, R. A. (1998). Nonlinear approximation. Acta numerica 7, 51–150. [17] Devraj, A. M. and S. Meyn (2017). Zap q-learning. In Advances in Neural Information Processing Systems, pp. 2235–2244. 181 [18] Devroye, L. (1978). The uniform convergence of nearest neighbor regression function estimators and their application in optimization. IEEE Transactions on Information The- ory 24(2), 142–151. [19] Devroye, L. and T. Wagner (1979). The l1 convergence of kernel density estimates. The Annals of Statistics, 1136–1139. [20] Gallicchio, C. and S. Scardapane (2020, February). Deep Randomized Neural Networks. arXiv e-prints, arXiv:2002.12287. [21] Grunewalder, S., G. Lever, L. Baldassarre, M. Pontil, and A. Gretton (2012). Modelling transition dynamics in mdps with rkhs embeddings. arXiv preprint arXiv:1206.4655 . [22] Gupta, A., R. Jain, and P. Glynn (2018, April). A fixed point theorem for iterative random contraction operators over banach spaces. ArXiv e-prints. [23] Gupta, A., R. Jain, and P. W. Glynn (2015, Dec). An empirical algorithm for relative value iteration for average-cost mdps. In 2015 54th IEEE Conference on Decision and Control (CDC), pp. 5079–5084. [24] Haskell, W. B., R. Jain, and D. Kalathil (2016a). Empirical dynamic programming. Mathematics of Operations Research 41 (2), 402–429. [25] Haskell, W. B., R. Jain, and D. Kalathil (2016b). Empirical dynamic programming. Mathematics of Operations Research. [26] Haskell, W. B., R. Jain, H. Sharma, and P. Yu (2017, September). An Empirical Dynamic Programming Algorithm for Continuous MDPs. ArXiv e-prints. 182 [27] Haskell, W. B., P. Yu, H. Sharma, and R. Jain (2017). Randomized function fitting- based empirical value iteration. In Decision and Control (CDC), 2017 IEEE 56th Annual Conference on, pp. 2467–2472. IEEE. [28] Haussler, D. (1992, 9). Decision theoretic generalizations of the pac model for neural net and other learning applications. Information and Computation 100 (1), 78–150. [29] Haussler, D. (1995). Sphere packing numbers for subsets of the boolean n-cube with bounded vapnik-chervonenkis dimension. Journal of Combinatorial Theory, Series A 69 (2), 217–232. [30] Henderson, P., R. Islam, P. Bachman, J. Pineau, D. Precup, and D. Meger (2018). Deep reinforcement learning that matters. In Thirty-Second AAAI Conference on Artificial Intelligence. [31] Hern´ andez-Lerma, O. (2012). Adaptive Markov control processes, Volume 79. Springer Science & Business Media. [32] Hwangbo, J., J. Lee, A. Dosovitskiy, D. Bellicoso, V. Tsounis, V. Koltun, and M. Hutter (2019). Learning agile and dynamic motor skills for legged robots. Science Robotics 4 (26). [33] Islam, R., P. Henderson, M. Gomrokchi, and D. Precup (2017). Reproducibility of benchmarked deep reinforcement learning tasks for continuous control. arXiv preprint arXiv:1708.04133 . [34] Jain, R. and P. Varaiya (2010). Simulation-based optimization of markov decision processes: An empirical process theory approach. Automatica 46(8), 1297–1304. 183 [35] Kakade, S. M. (2002). A natural policy gradient. In Advances in neural information processing systems, pp. 1531–1538. [36] Konda, V. R. and J. N. Tsitsiklis (2000). Actor-critic algorithms. In Advances in neural information processing systems, pp. 1008–1014. [37] Levin, D. A. and Y. Peres (2017). Markov chains and mixing times, Volume 107. American Mathematical Soc. [38] Levin, D. A., Y. Peres, and E. L. Wilmer (2008). Markov Chains and Mixing Times. American Mathematical Society. [39] Levine, N., T. Zahavy, D. J. Mankowitz, A. Tamar, and S. Mannor (2017). Shallow updates for deep reinforcement learning. In Advances in Neural Information Processing Systems, pp. 3135–3145. [40] Lillicrap, T. P., J. J. Hunt, A. Pritzel, N. Heess, T. Erez, Y. Tassa, D. Silver, and D. Wierstra (2015). Continuous control with deep reinforcement learning. arXiv preprint arXiv:1509.02971 . [41] Mania, H., A. Guy, and B. Recht (2018). Simple random search provides a competitive approach to reinforcement learning. arXiv preprint arXiv:1803.07055 . [42] Mcdonald, D., C. Shalizi, and M. Schervish (2011). Estimating beta-mixing coefficients. In Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pp. 516–524. 184 [43] Meir, R. (2000). Nonparametric time series prediction through adaptive model selection. Machine learning 39 (1), 5–34. [44] Mnih, V., K. Kavukcuoglu, D. Silver, A. Graves, I. Antonoglou, D. Wierstra, and M. Riedmiller (2013). Playing atari with deep reinforcement learning. arXiv preprint arXiv:1312.5602 . [45] Mnih, V., K. Kavukcuoglu, D. Silver, A. A. Rusu, J. Veness, M. G. Bellemare, A. Graves, M. Riedmiller, A. K. Fidjeland, G. Ostrovski, et al. (2015). Human-level control through deep reinforcement learning. Nature 518 (7540), 529. [46] Munos, R. (2003). Error bounds for approximate policy iteration. In ICML, Volume 3, pp. 560–567. [47] Munos, R. (2007). Performance bounds in l p-norm for approximate value iteration. SIAM journal on control and optimization 46 (2), 541–561. [48] Munos, R. and C. Szepesv´ ari (2008). Finite-time bounds for fitted value iteration. The Journal of Machine Learning Research 9, 815–857. [49] Nummelin, E. and P. Tuominen (1982). Geometric ergodicity of harris recurrent mar- cov chains with applications to renewal theory. Stochastic Processes and Their Applica- tions 12(2), 187–202. [50] Ormoneit, D. and ´ S. Sen (2002). Kernel-based reinforcement learning. Machine learn- ing 49(2-3), 161–178. 185 [51] Peters, J. and J. A. Bagnell (2016). Policy gradient methods. Encyclopedia of Machine Learning and Data Mining, 1–4. [52] Powell, W. B. (2007a). Approximate Dynamic Programming: Solving the curses of dimensionality, Volume 703. John Wiley & Sons. [53] Powell, W. B. (2007b). Approximate Dynamic Programming: Solving the Curses of Dimensionality (Wiley Series in Probability and Statistics). Wiley-Interscience. [54] Puterman, M. L. (2005). Markov Decision Processes: Discrete Stochastic Dynamic Programming. John Wiley & Sons. [55] Puterman, M. L. (2014). Markov decision processes: discrete stochastic dynamic pro- gramming. John Wiley & Sons. [56] Rahimi, A. and B. Recht (2008). Uniform approximation of functions with random bases. In Communication, Control, and Computing, 2008 46th Annual Allerton Conference on, pp. 555–561. IEEE. [57] Rahimi, A. and B. Recht (2009). Weighted sums of random kitchen sinks: Replacing minimization with randomization in learning. In Advances in neural information process- ing systems, pp. 1313–1320. [58] Rust, J. (1997). Using randomization to break the curse of dimensionality. Economet- rica: Journal of the Econometric Society, 487–516. [59] Schulman, J., S. Levine, P. Abbeel, M. Jordan, and P. Moritz (2015). Trust region policy optimization. In International Conference on Machine Learning, pp. 1889–1897. 186 [60] Schulman, J., F. Wolski, P. Dhariwal, A. Radford, and O. Klimov (2017). Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347 . [61] Shaked, M. and J. G. Shanthikumar (2007). Stochastic Orders. Springer. [62] Smale, S. and D.-X. Zhou (2005). Shannon sampling ii: Connections to learning theory. Applied and Computational Harmonic Analysis 19 (3), 285–302. [63] Sutton, R. S. and A. G. Barto (1998). Reinforcement learning: An introduction, Vol- ume 1. MIT press Cambridge. [64] Sutton, R. S. and A. G. Barto (2018). Reinforcement learning: An introduction. MIT press. [65] Sutton, R. S., D. A. McAllester, S. P. Singh, and Y. Mansour (2000). Policy gradient methods for reinforcement learning with function approximation. In Advances in neural information processing systems, pp. 1057–1063. [66] Szepesv´ ari, C. (2001). Efficient approximate planning in continuous space markovian decision problems. AI Communications 14 (3), 163–176. [67] Vidyasagar, M. (2002). A theory of learning and generalization. Springer-Verlag. [68] Yang, T., Y.-F. Li, M. Mahdavi, R. Jin, and Z.-H. Zhou (2012). Nystr¨ om method vs random fourier features: A theoretical and empirical comparison. In Advances in neural information processing systems, pp. 476–484. [69] Yu, B. (1994). Rates of convergence for empirical processes of stationary mixing se- quences. The Annals of Probability, 94–116. 187 [70] Zabinsky, Z. B. and R. L. Smith (1992). Pure adaptive search in global optimization. Mathematical Programming 53 (1-3), 323–338. 188
Abstract (if available)
Linked assets
University of Southern California Dissertations and Theses
Conceptually similar
PDF
Learning and control for wireless networks via graph signal processing
PDF
Theoretical foundations for dealing with data scarcity and distributed computing in modern machine learning
PDF
Robust and adaptive online reinforcement learning
PDF
Learning and decision making in networked systems
PDF
Landscape analysis and algorithms for large scale non-convex optimization
PDF
Provable reinforcement learning for constrained and multi-agent control systems
PDF
Online reinforcement learning for Markov decision processes and games
PDF
Understanding goal-oriented reinforcement learning
PDF
Active state learning from surprises in stochastic and partially-observable environments
PDF
Interactive learning: a general framework and various applications
PDF
Topics in algorithms for new classes of non-cooperative games
PDF
Empirical methods in control and optimization
PDF
Coded computing: a transformative framework for resilient, secure, private, and communication efficient large scale distributed computing
PDF
High-throughput methods for simulation and deep reinforcement learning
PDF
Learning from planners to enable new robot capabilities
PDF
A function approximation view of database operations for efficient, accurate, privacy-preserving & robust query answering with theoretical guarantees
PDF
Stochastic games with expected-value constraints
PDF
Utilizing context and structure of reward functions to improve online learning in wireless networks
PDF
Online learning algorithms for network optimization with unknown variables
PDF
Learning, adaptation and control to enhance wireless network performance
Asset Metadata
Creator
Sharma, Hiteshi
(author)
Core Title
Reinforcement learning with generative model for non-parametric MDPs
School
Viterbi School of Engineering
Degree
Doctor of Philosophy
Degree Program
Electrical Engineering
Publication Date
08/05/2020
Defense Date
11/19/2019
Publisher
University of Southern California
(original),
University of Southern California. Libraries
(digital)
Tag
machine learning,OAI-PMH Harvest,reinforcement learning
Language
English
Contributor
Electronically uploaded by the author
(provenance)
Advisor
Jain, Rahul (
committee chair
), Luo, Haipeng (
committee member
), Soltanolkotabi, Mahdi (
committee member
)
Creator Email
hi.hiteshi@gmail.com,hiteshis@usc.edu
Permanent Link (DOI)
https://doi.org/10.25549/usctheses-c89-360159
Unique identifier
UC11666310
Identifier
etd-SharmaHite-8887.pdf (filename),usctheses-c89-360159 (legacy record id)
Legacy Identifier
etd-SharmaHite-8887.pdf
Dmrecord
360159
Document Type
Dissertation
Rights
Sharma, Hiteshi
Type
texts
Source
University of Southern California
(contributing entity),
University of Southern California Dissertations and Theses
(collection)
Access Conditions
The author retains rights to his/her dissertation, thesis or other graduate work according to U.S. copyright law. Electronic access is being provided by the USC Libraries in agreement with the a...
Repository Name
University of Southern California Digital Library
Repository Location
USC Digital Library, University of Southern California, University Park Campus MC 2810, 3434 South Grand Avenue, 2nd Floor, Los Angeles, California 90089-2810, USA
Tags
machine learning
reinforcement learning