Close
About
FAQ
Home
Collections
Login
USC Login
Register
0
Selected
Invert selection
Deselect all
Deselect all
Click here to refresh results
Click here to refresh results
USC
/
Digital Library
/
University of Southern California Dissertations and Theses
/
I. Asynchronous optimization over weakly coupled renewal systems
(USC Thesis Other)
I. Asynchronous optimization over weakly coupled renewal systems
PDF
Download
Share
Open document
Flip pages
Contact Us
Contact Us
Copy asset link
Request this asset
Transcript (if available)
Content
I. ASYNCHRONOUS OPTIMIZATION OVER WEAKLY COUPLED RENEWAL SYSTEMS by Xiaohan Wei Presented to the FACULTY OF THE USC GRADUATE SCHOOL UNIVERSITY OF SOUTHERN CALIFORNIA In Partial Fulfillment of the Requirements for the Degree DOCTOR OF PHILOSOPHY (ELECTRICAL ENGINEERING) December 2019 Copyright 2019 Xiaohan Wei Approved by Professor Michael Neely, Committee Chair, Department of Electrical Engineering, University of Southern California. Professor Stanislav Minsker, Committee Chair, Department of Mathematics, University of Southern California. Professor Larry Goldstein, Department of Mathematics, University of Southern California. Professor Mihailo Jovanovic, Department of Electrical Engineering, University of Southern California. Professor Ashutosh Nayyar, Department of Electrical Engineering, University of Southern California. ii Dedication To my parents and my wife, Yuhong, who supported me both mentally and financially over the years. iii Acknowledgements First, I would like to thank my advisor professor Michael J. Neely for guiding me throughout the PhD journey since Summer 2013. He is a man of accuracy and rigorousness, always passionate about discussing concrete research problems, and willing to roll up the sleeves and grind through technical details with me. His way of treating research topics significantly impacts me. Rather than blindly following existing works and doing incremental works when trying to get into a new area, I learned to ask fundamental mathematical questions, making connections to the tools and theories we already familiar with and be not afraid of getting my hands dirty. His blazing new ideas are my morale boost when grasping in the dark. Next, I would like to thank professor Stanislav Minsker, who is the advisor on my high- dimensional statistics research. I got to know him during the Math-547 statistical learning course Fall 2015. Though not much senior than me, he is already extremely knowledgable on the statistical learning area and has been widely recognized for his works on robust high-dimensional statistics. He is a quick thinker and can always point out meaningful new directions hiding rather deeply which eventually lead to high-quality publications. I would have published no paper on this area should I never met with him. Along the way, he also teaches me how to sell my works and helps me practicing my seminar talks, which lead to impressive presentations and Ming-Hsieh scholarships. Also, I would like to thank professor Larry Goldstein, whom I met during a small paper reading group Spring 2016. He is an expert on Stein’s method and, as a senior professor, surprisingly accessible to PhD students and active on various research areas. Together with Prof. Minsker, we had quite a few fruitful discussions and made some nice progress on robust statistics. I would also like to thank professor Mihailo Jovanovic, Ashutosh Nayyar for discussing re- search problems with me and siting on my qualifying exam committee. I appreciate them for their valuable comments and suggestions. iv Moreover, I thank my senior lab mates Hao Yu and Sucha Supittayapornpong who were always accessible to discussing problems with me and came up with new research ideas. Also, Ruda Zhang, Lang Wang, and Jie Ruan studied various math courses and interesting math problems with me and helped me clear up the hurdles on different stages, for which I really appreciate. Special thanks to professor Qing Ling, who was my undergraduate advisor, but continuously influences me on various aspects of my academic career. Last but not least, I would like to take the chance to express my gratitude for folks who made contribution on various stages of my research. In particular, I thank Zhuoran Yang, for lighting up new areas and expanding my research horizon, Dongsheng Ding, who brings idea from control perspective and is always passionate to try out research ideas with me, Sheng Chen for sharing with me his perspective on robust LASSO problems, professor Jason D. Lee for working on the geometric median problem with me, and Jianshu Chen from Tencent AI who introduced me to the area of reinforcement learning. v Table of Contents Dedication iii Acknowledgements iv Abstract ix 1 Introduction to Renewal Systems 1 1.1 Optimization over a single renewal system: A review . . . . . . . . . . . . . . . . 1 1.1.1 Optimization over i.i.d. actions . . . . . . . . . . . . . . . . . . . . . . . . 3 1.1.2 Ergodic Markov decision process (MDP): An example . . . . . . . . . . . 3 1.1.3 The Drift-plus-penalty(DPP) ratio algorithm . . . . . . . . . . . . . . . . 4 1.1.4 A (somewhat) simple illustrative performance analysis . . . . . . . . . . . 6 1.2 The coupled renewal systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7 1.3 Example Applications and previous works . . . . . . . . . . . . . . . . . . . . . . 9 1.3.1 Multi-server energy-aware scheduling . . . . . . . . . . . . . . . . . . . . . 9 1.3.2 Coupled ergodic MDPs . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10 1.3.3 Why this problem is difficult . . . . . . . . . . . . . . . . . . . . . . . . . 11 1.3.4 Other works related to renewal and asynchronous optimization . . . . . . 12 1.4 Outline and our contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 2 Asynchronous Optimization over Weakly Coupled Renewal Systems 16 2.1 Technical preliminaries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17 2.2 Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18 2.2.1 Proposed algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18 2.2.2 Computing subproblems . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21 2.3 Limiting Performance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22 2.3.1 Convexity, performance region and other properties . . . . . . . . . . . . 22 2.3.2 Main result and near optimality analysis . . . . . . . . . . . . . . . . . . . 24 2.3.3 Key-feature inequality and supermartingale construction . . . . . . . . . . 27 2.3.4 Synchronization lemma . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33 2.4 Convergence Time Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36 2.4.1 Lagrange Multipliers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36 2.4.2 Convergence time theorem . . . . . . . . . . . . . . . . . . . . . . . . . . . 38 2.5 Simulation Study in Energy-aware Scheduling . . . . . . . . . . . . . . . . . . . . 40 2.6 Additional lemmas and proofs. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42 2.6.1 Proof of Lemma 2.3.1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42 2.6.2 Proof of Lemma 2.3.3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45 2.6.3 Proof of Lemma 2.3.5 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47 2.6.4 Proof of Lemma 2.3.7 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48 2.6.5 Proof of Lemma 2.3.2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48 vi 3 Data Center Server Provision via Theory of Coupled Renewal Systems 54 3.1 System model and problem formulation . . . . . . . . . . . . . . . . . . . . . . . 54 3.1.1 Related works . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55 3.1.2 Front-end load balancing . . . . . . . . . . . . . . . . . . . . . . . . . . . 56 3.1.3 Server model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57 3.1.4 Performance Objective . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59 3.2 Coupled renewal optimization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61 3.2.1 Prelude: The original intuition . . . . . . . . . . . . . . . . . . . . . . . . 61 3.2.2 Coupled renewal optimization . . . . . . . . . . . . . . . . . . . . . . . . . 62 3.2.3 Solving (3.8) and (3.9) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63 3.2.4 The proposed online control algorithm . . . . . . . . . . . . . . . . . . . . 64 3.3 Probability 1 Performance Analysis of Algorithm 4 . . . . . . . . . . . . . . . . . 66 3.3.1 Bounded request queues . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67 3.3.2 Optimal randomized stationary policy . . . . . . . . . . . . . . . . . . . . 67 3.3.3 Key features of thresholding algorithm . . . . . . . . . . . . . . . . . . . . 69 3.3.4 Bounded average of supermartingale difference sequeces . . . . . . . . . . 70 3.3.5 Near optimal time average cost . . . . . . . . . . . . . . . . . . . . . . . . 71 3.4 Delay improvement via virtualization . . . . . . . . . . . . . . . . . . . . . . . . . 72 3.4.1 Delay improvement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72 3.4.2 Performance guarantee . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73 3.5 Simulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74 3.5.1 Near optimality in N queues system . . . . . . . . . . . . . . . . . . . . . 75 3.5.2 Real data center traffic trace and performance evaluation . . . . . . . . . 77 3.6 Additional lemmas and proofs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82 4 Power Aware Wireless File Downloading and Restless Bandit via Renewal Optimization 91 4.1 System model and problem formulation . . . . . . . . . . . . . . . . . . . . . . . 91 4.2 Single user scenario . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94 4.2.1 The memoryless file size assumption . . . . . . . . . . . . . . . . . . . . . 96 4.2.2 DPP ratio optimization . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97 4.2.3 Average power constraints via queue bounds . . . . . . . . . . . . . . . . . 99 4.2.4 Optimality over randomized algorithms . . . . . . . . . . . . . . . . . . . 101 4.2.5 Key feature of the drift-plus-penalty ratio . . . . . . . . . . . . . . . . . . 102 4.2.6 Performance theorem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103 4.3 Multi-user file downloading . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106 4.3.1 DPP ratio indexing algorithm . . . . . . . . . . . . . . . . . . . . . . . . . 107 4.3.2 Theoretical performance analysis . . . . . . . . . . . . . . . . . . . . . . . 109 4.4 Multi-user optimality in a special case . . . . . . . . . . . . . . . . . . . . . . . . 110 4.4.1 A system with N single-buffer queues . . . . . . . . . . . . . . . . . . . . 111 4.4.2 Optimality of the indexing algorithm . . . . . . . . . . . . . . . . . . . . . 112 4.4.3 Preliminaries on stochastic coupling . . . . . . . . . . . . . . . . . . . . . 113 4.4.4 Stochastic ordering of buffer state process . . . . . . . . . . . . . . . . . . 114 4.4.5 Extending to non-work-conserving policies . . . . . . . . . . . . . . . . . . 120 4.5 Simulation experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 120 4.5.1 DPP ratio indexing with geometric file length . . . . . . . . . . . . . . . . 120 4.5.2 DPP ratio indexing with non-memoryless file lengths . . . . . . . . . . . . 124 4.6 Additional lemmas and proofs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 125 4.6.1 Comparison of Max-λ and Min-λ . . . . . . . . . . . . . . . . . . . . . . . 125 4.6.2 Proof of Lemma 4.4.2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 126 vii 5 Opportunistic Scheduling over Renewal Systems 128 5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 128 5.1.1 Example applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 129 5.1.2 Previous approaches on renewal systems . . . . . . . . . . . . . . . . . . . 130 5.1.3 Other related works . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 131 5.1.4 Our contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 131 5.2 Problem Formulation and Preliminaries . . . . . . . . . . . . . . . . . . . . . . . 132 5.2.1 Assumptions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 133 5.3 An Online Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 135 5.4 Feasibility Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 136 5.4.1 The drift-plus-penalty bound . . . . . . . . . . . . . . . . . . . . . . . . . 137 5.4.2 Bounds on the virtual queue process and feasibility . . . . . . . . . . . . . 138 5.5 Optimality Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 141 5.5.1 Relation between ˆ θ[n] and θ[n] . . . . . . . . . . . . . . . . . . . . . . . . 142 5.5.2 Towards near optimality (I): Truncation . . . . . . . . . . . . . . . . . . . 142 5.5.3 Towards near optimality (II): Exponential supermartingale . . . . . . . . 145 5.5.4 An asymptotic upper bound on θ[n] . . . . . . . . . . . . . . . . . . . . . 150 5.5.5 Finishing the proof of near optimality . . . . . . . . . . . . . . . . . . . . 152 5.6 Simulation experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 154 5.7 Additional proofs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 159 5.8 Computation of Asymptotics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 166 6 Online Learning in Weakly Coupled Markov Decision Processes 169 6.1 Problem formulation and related works . . . . . . . . . . . . . . . . . . . . . . . . 169 6.2 Preliminaries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 172 6.2.1 Basic Definitions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 172 6.2.2 Technical assumptions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 174 6.2.3 The state-action polyhedron . . . . . . . . . . . . . . . . . . . . . . . . . . 176 6.2.4 Preliminary results on MDPs . . . . . . . . . . . . . . . . . . . . . . . . . 177 6.2.5 The blessing of slow-update property in online MDPs . . . . . . . . . . . 179 6.3 OCMDP algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 180 6.3.1 Intuition of the algorithm and roadmap of analysis . . . . . . . . . . . . . 182 6.4 Convergence time analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 183 6.4.1 Stationary state performance: An online linear program . . . . . . . . . . 183 6.4.2 Markov analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 191 6.5 A more general regret bound against policies with arbitrary starting state . . . . 198 6.6 Additional lemmas and proofs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 204 6.6.1 Missing proofs in Section 6.2.4 . . . . . . . . . . . . . . . . . . . . . . . . 204 6.6.2 Missing proofs in Section 6.4.1 . . . . . . . . . . . . . . . . . . . . . . . . 209 6.6.3 Missing proofs in Section 6.5 . . . . . . . . . . . . . . . . . . . . . . . . . 212 Bibliography 216 viii Abstract A renewal system divides the slotted timeline into back to back time periods called “renewal frames” . At the beginning of each frame, it chooses a policy from a set of options for that frame. The policy determines the duration of the frame, the penalty incurred during the frame (such as energy expenditure), and a vector of performance metrics (such as instantaneous number of jobs served). The starting points of this line of research are Chapter 7 of the book [Nee10a], the seminal work [Nee13a], and Chapter 5 of the PhD thesis of Chih-ping Li [Li11], who graduated before I came to USC. These works consider stochastic optimization over a single renewal system. By way of contrast, this thesis considers optimization over multiple parallel renewal systems, which is computationally more challenging and yields much more applications. The goal is to minimize the time average overall penalty subject to time average overall constraints on the corresponding performance metrics. The main difficulty, which is not present in earlier works, is that these systems act asynchronously due to the fact that the renewal frames of different renewal systems are not aligned. The goal of the thesis is to resolve this difficulty head-on via a new asynchronous algorithm and a novel supermartingale stopping time analysis which shows our algorithms not only converge to the optimal solution but also enjoy fast convergence rates. Based on this general theory, we further develop novel algorithms for data center server provision problems with performance guarantees as well as new heuristics for the multi-user file downloading problems. We start by reviewing existing works on the optimization over a single renewal system in Chapter 1. Then, in Chapter 2, we propose a new algorithm for the asynchronous renewal optimization so that each system can make its own decision after observing a global multiplier that is updated every slot. We show that this algorithm satisfies the desired constraints and achieves O() near optimality with O(1/ 2 ) convergence time. Based on the new algorithm, we formulate the data center server provision problem as an asynchronous renewal optimization in Chapter 3 and develop a corresponding algorithm which exceeds the state-of-the-art. In Chapter 4, we look at another application, namely, the multi-user file downloading, which can be formulated as a constrained multi-armed bandit problem. We show that our proposed algorithm leads to a useful heuristic approximately solving the problem with experimentally near optimal performance. In Chapter 5, we consider the constrained optimization over a renewal system with observed random events at the beginning of each renewal frame. We propose an online algorithm which does not need the knowledge of the distributions of random events. We prove that this proposed algorithm is feasible and achieves O(ε) near optimality by constructing an exponential super- martingale. Simulation experiments demonstrates the near optimal performance of the proposed algorithm. Finally, in Chapter 6, we consider online learning over weakly coupled Markov decision pro- cesses. We develop a new distributed online algorithm where each MDP makes its own decision each slot after observing a multiplier computed from past information. While the scenario is significantly more challenging than the classical online learning context, the algorithm is shown to have a tightO( √ T ) regret and constraint violations simultaneously over a time horizon T . ix Chapter 1 Introduction to Renewal Systems 1.1 Optimization over a single renewal system: A review Figure 1.1: The sample timeline of a renewal system. Renewal systems are generalizations of renewal processes studied in probability and random processes courses. Parallel to Markov decision processes versus Markov chains, renewal systems are controlled renewal processes. Since this is not a widely used term, to set the tone of the thesis, we start with a review of optimization over a single renewal system. Consider a dynamical system operating over a discrete slotted timeline t∈{0, 1, 2,...}. The timeline is segmented into back-to-back intervals of time slots called renewal frames. The start of each renewal frame for a system is called a renewal time or simply a renewal for that system. The duration of each renewal frame is a random positive integer whose distribution depends on a control action chosen at the start of the frame. We use k = 0, 1, 2,··· to index the renewals. Let t k be the time slot corresponding to the k-th renewal with the convention that t 0 = 0. Let T k be the set of all slots from t k to t k+1 − 1. See Fig. 1.1 for a graphical illustration. At timet k , the decision maker chooses a possibly random decision α k in a setA. This action determines the distributions of the following random variables: • The duration of the k-th renewal frame T k :=t k+1 −t k , which is a positive integer. 1 • A vector of performance metrics at each slot of that frame z[t] := (z 1 [t], z 2 [t], ··· , z L [t]), t∈T k , where L is a fixed positive integer. • A penalty incurred at each slot of the frame y[t], t∈T k . In the special case whereT k = 1,∀k, this reduces to the classical slotted stochastic system, which has been relatively well-understood. LetF k be the system history up to t k − 1, which includes {y[t]} t k −1 j=0 ,{z[t]} t k −1 j=0 and{T j } k−1 j=0 . The key property we rely on is as follows. Definition 1.1.1 (Renewal property). A system is said to satisfy the renewal property if the randomT k , z[t] andy[t],t∈T k are conditionally independent of the historyF k givenα k =α∈A. The goal is to minimize the time average penalty subject to L time average constraints on the performance metrics, i.e. we aim to solve the following optimization problem: min lim sup T→∞ 1 T T−1 X t=0 E(y[t]) (1.1) s.t. lim sup T→∞ 1 T T−1 X t=0 E(z l [t])≤d l , l∈{1, 2,··· ,L}, (1.2) where{d l } L l=1 are known constants. Let y(α k ) := X t∈T k y[t], z l (α k ) = X t∈T k z l [t], T (α k ) =T k be realizations during the k-th frame using an action α k . Under mild technical conditions (e.g. existence of second moments, see Section 2.1 for details), the problem (1.1)-(1.2) can also be written as a fractional program form: min lim sup K→∞ E P K−1 k=0 y(α k ) E P K−1 k=0 T (α k ) (1.3) s.t. lim sup T→∞ E P K−1 k=0 z l (α k ) E P K−1 k=0 T (α k ) ≤d l , l∈{1, 2,··· ,L}, (1.4) α k ∈A, ∀k 2 1.1.1 Optimization over i.i.d. actions Suppose the system adopts an i.i.d. sequence of random actions{α ∗ k } ∞ k=0 , where the decision α ∗ k ∈A made on frame k independent of the past. Then, by the renewal property, it is easy to see that{y(α ∗ k ), z(α ∗ k ),T (α ∗ k )} are i.i.d. random variables. We have lim sup K→∞ E P K−1 k=0 y(α ∗ k ) E P K−1 k=0 T (α ∗ k ) = lim K→∞ 1 K E P K−1 k=0 y(α ∗ k ) lim K→∞ 1 K E P K−1 k=0 T (α ∗ k ) = E(y(α ∗ k )) E(T (α ∗ k )) lim sup K→∞ E P K−1 k=0 z l (α ∗ k ) E P K−1 k=0 T (α ∗ k ) = lim K→∞ 1 K E P K−1 k=0 z l (α ∗ k ) lim K→∞ 1 K E P K−1 k=0 T (α ∗ k ) = E(z l (α ∗ k )) E(T (α ∗ k )) As a consequence, if we consider solving (1.3)-(1.4) over the set of i.i.d. random actions, then? min E(y(α ∗ k )) E(T (α ∗ k )) (1.5) s.t. E(z l (α ∗ k )) E(T (α ∗ k )) ≤d l , l∈{1, 2,··· ,L}, (1.6) Assumption 1.1.1. The problem (1.5)-(1.6) is feasible, i.e. there exists α ∗ k such that (1.6) are satisfied. Furthermore, we assume the set of all feasible performance vectors ( E(y(α ∗ k )) E(T(α ∗ k )) , E(z(α ∗ k )) E(T(α ∗ k )) ) over all i.i.d. actions α ∗ k is compact. The compactness assumption is adopted so that there exists at least one i.i.d. action which solves (1.5)-(1.6). In fact, one can show that under proper technical conditions the minimum achieved by (1.3)-(1.4) is the same as that of (1.5)-(1.6) (see, for example, Lemma 2.3.2 in the next section). 1.1.2 Ergodic Markov decision process (MDP): An example As one of the main motivations for this line of research, in this section, we show that the well-known MDP is a special case of the renewal system. Consider a discrete time MDP over an infinite horizon. It consists of a finite state spaceS, and an action spaceU at each state s∈S. 1 For each state s∈S, we use P u (s,s 0 ) to denote the transition probability from s∈S to s 0 ∈S 1 To simplify the notation, we assume each state has the same action spaceA. All our analysis generalizes trivially to states with different action spaces. 3 when taking action u∈U, i.e. P u (s,s 0 ) =Pr(s[t + 1] =s 0 | s[t] =s, u[t] =u), where s[t] and u[t] are state and action at time slot t. At time slot t, after observing the state s[t] ∈ S and choosing the action u[t] ∈ U, the MDP receives a penalty y(u[t],s[t]) and L types of resource costs z 1 (u[t],s[t]),··· ,z L (u[t],s[t]), where these functions are all bounded mappings fromS×U to R. For simplicity we write y[t] =y(u[t],s[t]) andz l [t] =z l (u[t],s[t]). The goal is to minimize the time average penalty with constraints on time average overall costs. This problem can be written in the form (1.1)-(1.2). In order to define the renewal frame, we need one more assumption on the MDP. We assume the MDP is ergodic, i.e. there exists a state which is recurrent and the corresponding Markov chain is aperiodic under any randomized stationary policy 2 , with bounded expected recurrence time. Under this assumption, the renewals for the MDP can be defined as successive revisitations to the recurrent state, and the action setA in such scenario is defined as the set of all randomized stationary policies that can be implemented in one renewal frame. Thus, our renewal system formulation includes ergodic MDPs. We refer to [Alt99a], [Ber01], and [Ros02] for more details on MDP theory and related topics. We also refer readers to Chapter 5 for more MDP specific algorithms and analysis. 1.1.3 The Drift-plus-penalty(DPP) ratio algorithm In this section, we introduce the classical DPP ratio algorithm solving (1.3)-(1.4) ( [Nee10a], [Nee13a]). It is a frame-based algorithm which updates parameters at the beginning of each frame. We start by defining the “virtual queues” Q l [k] for each constraint with Q l [0] = 0 and Q l [k + 1] = max{Q l [k] +z l (α k )−d l T (α k ), 0}, 2 A randomized stationary policy π is an algorithm which chooses actions at state s∈S according to a fixed conditional distribution π(u|s), u ∈ U and is independent of all other past information, i.e. Pr(u[t]|Ft) = π(u[t]|s[t]), u[t]∈U, s[t]∈S andFt is the past information up to time t. 4 which is updated per frame. Let Q[t] be the vector of virtual queues. Define the drift as follows: Δ[k] := 1 2 (kQ[k + 1]k 2 2 −kQ[k]k 2 2 ), LetF k be the system history up to t k − 1, which includes{y(α j )} t−1 j=0 ,{z(α j )} t−1 j=0 Then, it is easy to show that E(Δ[k]|F k )≤B + L X l=1 Q l [k]E(z l (α k )−d l T (α k )|F k ). Assuming that the second moment of z l (α k )−d l T (α k ) exists, then, there exists a constant B such that B≥ 1 2 E (z l (α k )−d l T (α k )) 2 |F k . We define the DPP expression as Δ[k] +Vy(α k ), where V > 0 is a trade-off parameter, which has the following bound: E(Δ[k] +Vy(α k )|F k ) ≤B + L X l=1 Q l [k]E(z l (α k )−d l T (α k )|F k ) +VE(y(α k )|F k ) (1.7) =B +E(T (α k )|F k ) VE(y(α k )|F k ) + P L l=1 Q l [k]E(z l (α k )−d l T (α k )|F k ) E(T (α k )|F k ) | {z } minimize this . (1.8) Then, the algorithm (Algorithm 1) aims at minimizing the ratio on the right hand side. Note that Algorithm 1. DPP ratio algorithm: Fix a trade-off parameter V > 0. • At the beginning of each frame, the proposed algorithm takes actionα k in order to minimize the ratio VE(y(α k )|F k ) + P L l=1 Q l [k]E(z l (α k )|F k ) E(T (α k )|F k ) . (1.9) • Update the virtual queue Q[k] via Q l [k + 1] = max{Q l [k] +z l (α k )−d l T (α k ), 0}. (1.10) due to the renewal property of the system, maximizing the above ratio is the same as maximizing 5 the ratio: VE(y(α k )) + P L l=1 Q l [k]E(z l (α k )) E(T (α k )) . 1.1.4 A (somewhat) simple illustrative performance analysis The performance of this algorithm has been shown in a number of works ([Nee10a, Nee13a]). We reproduce the proof here but from a somewhat different perspective compared to previous works since it is more illustrative for our purpose and serves as the foundation of our new analysis later. The key step, which is repeatedly used throughout the thesis is as follows: Since our proposed algorithm minimizes (1.9), it must satisfy: VE(y(α k )|F k ) + P L l=1 Q l [k]E(z l (α k )|F k ) E(T (α k )|F k ) ≤ VE(y(α ∗ k )) + P L l=1 Q l [k]E(z l (α ∗ k )) E(T (α ∗ k )) (1.11) for any i.i.d. decisions α ∗ k , where we use the fact that the α ∗ k is independent of historyF k and thus the conditioning on the right hand side can be omitted. In particular, we can choose α ∗ k to be the solution to (1.5)-(1.6) and let [f ∗ , g ∗ ] = [E(y(α ∗ k ))/E(T (α ∗ k )),E(z(α ∗ k ))/E(T (α ∗ k ))] be the optimal performance vector. Rearranging terms in above inequality gives E V (y(α k )−f ∗ T (α k )) + L X l=1 Q l [k](z l (α k )−g ∗ l T (α k )) F k ! ≤ 0. This implies that the expression inside the expectation is a supermartingale difference sequence, a fact not necessarily needed here but is the key to our new analysis later. Now, taking expectation from both sides and sum up from k = 0 to K− 1 give K−1 X k=0 E V (y(α k )−f ∗ T (α k )) + L X l=1 Q l [k](z l (π k )−g ∗ l T (π k )) ! ≤ 0. Substituting g ∗ l ≤d l gives K−1 X k=0 E V (y(α k )−f ∗ T (α k )) + L X l=1 Q l [k](z l (α k )−d l T (α k )) ! ≤ 0. (1.12) On the other hand, taking expectation from both sides of the inequality (1.7) and sum up from 6 k = 0 to K− 1 gives E kQ[k + 1]k 2 2 2 + K−1 X k=0 E(Vy(α k ))≤ K−1 X k=0 E Vy(α k ) + L X l=1 Q l [k](z l (α k )−d l T (α k )) ! +BK. Sum the above inequality and (1.12) gives E kQ[k + 1]k 2 2 2 + K−1 X k=0 E(Vy(α k ))≤V K−1 X k=0 f ∗ E(T (α k )) +BK. (1.13) This bound “kills two birds in the same cage”, allowing us to get objective bound and constraint violations at the same time immediately. On one hand, since E kQ[k + 1]k 2 2 ≥ 0, we have P K−1 k=0 E(Vy(α k )) P K−1 k=0 E(T (α k )) ≤f ∗ + B V , (1.14) On the other hand, Let C be a constant such that C≥|E(y(π))|, T max ≥|E T (π k ) |∀π∈ Π, E(kQ[k + 1]k 2 )≤ p 2BK + 4VK(C +T max ) ⇒ P K−1 k=0 E(z l (α k )) P K−1 k=0 E(T (α k )) ≤ r 2B + 4V (C +T max ) K , (1.15) which follows from the virtual queue updating rule (1.10) thatE(kQ[k + 1]k 2 )≥ P K−1 k=0 E(z l (α k )) and P K−1 k=0 E(z l (α k ))≥K. Remark 1.1.1. The bounds (1.14), (1.15) are not the tightest possible bounds, but (I believe) simple enough to highlight the key steps. 1.2 The coupled renewal systems So far readers have gain some understanding on the renewal systems we will talk about throughout the thesis. In this section, we introduce our coupled renewal system model. Many of the notations are the same as those of the last section except we add a superscript n to index the renewal systems. Consider N renewal systems that operate over a slotted timeline (t∈{0, 1, 2,...}). The timeline for each system n∈{1,...,N} is segmented into back-to-back intervals, which are renewal frames. The duration of each renewal frame is a random positive integer with distribution that depends on a control action chosen by the system at the start of the frame. The decision at each renewal frame also determines the penalty and a vector of 7 performance metrics during this frame. The systems are coupled by time average constraints placed on these metrics over all systems. The goal is to design a decision strategy for each system so that overall time average penalty is minimized subject to time average constraints. We use k = 0, 1, 2,··· to index the renewals. Let t n k be the time slot corresponding to the k-th renewal of the n-th system with the convention that t n 0 = 0. LetT n k be the set of all slots from t n k to t n k+1 − 1. At time t n k , the n-th system chooses a possibly random decision α n k in a set A n . This action determines the distributions of the following random variables: • The duration of the k-th renewal frame T n k :=t n k+1 −t n k , which is a positive integer. • A vector of performance metrics at each slot of that frame z n [t] := (z n 1 [t], z n 2 [t], ··· , z n L [t]), t∈T n k . • A penalty incurred at each slot of the frame y n [t], t∈T n k . We assume each system has the renewal property as defined in Definition 1.1.1 that given α n k = α n ∈ A n , the random variables T n k , z n [t] and y n [t], t ∈ T n k are independent of the information of all systems from the slots beforet n k with the following known conditional expecta- tionsE(T n k |α n k =α n ), E P t∈T n k y n [t] α n k =α n andE P t∈T n k z n [t] α n k =α n . Fig. 1.2 plots a sample timeline of three parallel renewal systems. Figure 1.2: The sample timelines of three asynchronous parallel renewal systems, where the numbers underneath the figure index time slots and the numbers inside the blocks index the renewals of each system. To make the framework a little bit more general, we introduce an uncontrollable external i.i.d. random process{d[t]} ∞ t=0 ⊆R L which can be observed during each time slot. Let d l :=E(d l [t]). The expectation of d[t] often serves as the constraints of corresponding performance metrics. As we shall see in the example application on an energy-aware scheduling problem, z n [t] and d[t] could represent vectors of job services and arrivals for difference classes, respectively, and the 8 constraints are that the time average service is no less than the time average of arrivals for all classes of jobs. The goal is to minimize the total time average penalty of theseN renewal systems subject to L total time average constraints on the performance metrics related to the external i.i.d. process, i.e. we aim to solve the following optimization problem: min lim sup T→∞ 1 T T−1 X t=0 N X n=1 E(y n [t]) (1.16) s.t. lim sup T→∞ 1 T T−1 X t=0 N X n=1 E(z n l [t])≤d l , l∈{1, 2,··· ,L}. (1.17) 1.3 Example Applications and previous works 1.3.1 Multi-server energy-aware scheduling Consider a slotted time system withL classes of jobs andN servers. Job arrivals are Poisson distributed with ratesλ 1 , ··· , λ L , respectively. These jobs are stored in separate queues denoted asQ 1 [t], ··· , Q L [t] in a router waiting to be served. Assume the system is empty at time t = 0 so thatQ l [0] = 0,∀l∈{1, 2,··· ,L}. Letλ l [t] be the precise number of classl job arrivals at slot t, then, we have E(λ l [t]) = λ l , ∀l∈{1, 2,··· ,L}. Let μ n l [t] and e n [t] be the number of class l jobs served and the energy consumption for servern at time slott, respectively. Fig. 1.3 sketches an example architecture of the system with 3 classes of jobs and 10 servers. Each server makes decisions over renewal frames and the first frame starts at time slot t = 0. Successive renewals can happen at different slots for different servers. For the n-th server, at the beginning of the k-th frame (k∈N), it chooses a processing mode m n k within the set of all modesM n . The processing modem n k determines distributions on the number of jobs served, the service time, and the energy expenditure, with conditional expectations: • b T n (m n k ) :=E(T n k | m n k ). The expected frame size. • b μ n l (m n k ) =E P t∈T n k μ n l [t] m n k . The expected number of class l jobs served. • b e n (m n k ) =E P t∈T n k e n [t] m n k . The expected energy consumption. The goal is to minimize the time average energy consumption, subject to the queue stability 9 constraints, i.e. min lim sup T→∞ 1 T T−1 X t=0 N X n=1 E(e n [t]) (1.18) s.t. lim inf T→∞ 1 T T−1 X t=0 N X n=1 E(μ n l [t])≥λ l , ∀l∈{1, 2,··· ,L}. (1.19) Thus, we have formulated the problem into the form (5.1)-(1.17). Note that the external process in this example is the arrival process of L classes of jobs with potentially unknown arrival rates λ l . Figure 1.3: Illustration of an energy-aware scheduling sys- tem with 3 classes of jobs and 10 parallel servers. 1.3.2 Coupled ergodic MDPs Consider N discrete time Markov decision processes (MDPs) over an infinite horizon. Each MDP consists of a finite state spaceS n , and an action spaceU n at each state s∈S n . 3 For each state s∈S, we use P n u (s,s 0 ) to denote the transition probability from s∈S n to s 0 ∈S n when taking action u∈U n , i.e. P n u (s,s 0 ) =Pr(s[t + 1] =s 0 | s[t] =s, u[t] =u), 3 To simplify the notation, we assume each state has the same action spaceA n . All our analysis generalizes trivially to states with different action spaces. 10 where s[t] and u[t] are state and action at time slot t. At time slott, after observing the state s[t]∈S n and choosing the action u[t]∈U n , the n-th MDP receives a penaltyy n (u[t],s[t]) andL types of resource costsz n 1 (u[t],s[t]),··· ,z n L (u[t],s[t]), where these functions are all bounded mappings fromS n ×U n to R. For simplicity we write y n [t] = y n (u[t],s[t]) and z n l [t] = z n l (u[t],s[t]). The goal is to minimize the time average overall penalty with constraints on time average overall costs, where these MDPs are weakly coupled through the time average constraints. This problem can be written in the form (5.1)-(1.17). In order to define the renewal frame, we need one more assumption on the MDPs. We assume each of the MDPs is ergodic, i.e. there exists a state which is recurrent and the corresponding Markov chain is aperiodic under any randomized stationary policy, with bounded expected re- currence time. Under this assumption, the renewals for each MDP can be defined as successive revisitations to the recurrent state, and the action setA n in such scenario is defined as the set of all randomized stationary policies that can be implemented in one renewal frame. Thus, our formulation includes coupled ergodic MDPs. We refer to [Alt99a], [Ber01], and [Ros02] for more details on MDP theory and related topics. As a side remark, this multi-MDP problem can be viewed as a single MDP on an enlarged state space. Constrained MDPs are discussed previously in [Alt99a]. One can show that un- der the previous ergodic assumption, the minimum of (5.1)-(1.17) is achieved by a randomized stationary policy, and furthermore, such a policy can be obtained via solving a linear program reformulated from (5.1)-(1.17) offline. However, formulating such LP requires the knowledge of all the parameters in the problem, including the statistics of the external process{d[t]} ∞ t=0 , and the resulting LP is often computationally intractable when the number of MDPs is very large. 1.3.3 Why this problem is difficult Compared to (1.1)-(1.2), this problem is much more challenging because these N systems are weakly coupled by the time average constraints (1.17), yet each of them operates over its own renewal frames. The renewals of different systems do not have to be synchronized and they do not have to occur at the same rate (e.g. see Fig. 1.2). Our goal is to develop an algorithm that does not need the knowledge of d l =E(d l [t]) with a provable performance guarantee. Note that due to the asynchronicity, the DPP ratio algorithm (Algorithm 1) does not apply. More specifically, in order to cope with the time average constraints, Algorithm 1 introduces 11 virtual queues to penalize constraint violations. These virtual queues are then updated frame- wise and the analysis is also on the per frame scale of this particular system. For parallel renewal systems, it is however not clear what is a proper scale to update the virtual queues. Naturally, one would think of introducing a virtual queue for each constraint and update the queue whenever at least one of the systems starts its new renewal frame. However, this means for those systems who have yet to reach the renewal, we are updating algorithm parameters in the middle of the renewal of these systems. This creates grave difficulties on how to piece together the analysis from each individual systems. On the other hand, since time is slotted, one could also think of “giving up” the notion of renewals, synchronizing all systems on the slot scale and designing a slot-based algorithm. However, this does not make the problem any simpler since by doing so, the algorithm can still update algorithm parameters in the middle of renewals. Prior approaches treat this challenge only in special cases. The works [Nee12a] and [Nee12b] consider a special case where all quantities introduced above are deterministic functions of the actions. The work in [Nee11] develops a two stage algorithm for stochastic multi-renewal systems, but the first stage must be solved offline. On the other hand, for the special case where the system is a coupled Markov decision processes (MDPs). Classical methods for MDPs, such as dynamic programming and linear programming [Ber95][Put14][Ros02], can be used to solve this problem. However, it can be impractical for two reasons: First, the state space has dimension that depends on the number of renewal systems, making solutions difficult when the number of renewal systems is large. Second, some statistics of the system, such as the average d[t] process governing the resource constraints, can be unknown. 1.3.4 Other works related to renewal and asynchronous optimization The problem considered in the current paper is a generalization of optimization over a single renewal system. It is shown in [Nee13b] that for the single renewal system with finite action set, the problem can be solved (offline) via a linear fractional program. Methods for solving linear fractional programs can be found in [BV04] and [Sch83]. The drift-plus-penalty ratio approach is also developed in [Nee10b] and [Nee13a] for the single renewal system. Note that there are also many other algorithms which consider “asynchronous optimization” in a different sense compare to ours. More specifically, the works [BT97][BGPS06][SN11] [PXYY16] 12 consider the scenario where the asynchronicity shown in Fig. 1.2 results from uncontrollable delays due to environmental uncertainties. These delays are of fixed distributions independent of the actions or even deterministic. Thus, the delays do not appear in the optimization objectives. On the other hand, our problem is also related to the multi-server scheduling as is shown in one of the example applications. When assuming proper statistics of the arrivals and/or services, energy optimization problems in multi-server systems can also be treated via queueing theory. Specifically, by assuming both arrivals and services are Poisson distributed, [GDHBSW13] treats the multi-server system as an M/M/k/setup queue and explicitly computes several performance metrics via the renewal reward theorem. By assuming arrivals are Poisson and only one server, [LN14] and [Yao02] treat the system as a multi-class M/G/1 queue and optimize the average energy consumption via polymatroid optimization. 1.4 Outline and our contributions The rest of the thesis is organized as follows: • Chapter 2:(published in [WN18]) We develop a new algorithm for the general asyn- chronous renewal optimization, where each system operates on its own renewal frame. It is fully analyzed with convergence as well as convergence time results. As a first techni- cal contribution, we fully characterize the fundamental performance region of the problem (5.1)-(1.17). We then construct a supermartingale along with a stopping-time to “synchro- nize” all systems on a slot basis, by which we could piece together analysis of each individual system to prove the convergence of the proposed algorithm. Furthermore, encapsulating this new idea into convex analysis tools, we prove theO(1/ε 2 ) convergence time of the pro- posed algorithm to reachO(ε) near optimality under a mild assumption on the existence of a Lagrange multiplier. Specifically, we show that for any accuracy > 0 and any time T≥ 1/ε 2 , the sequence{y n [t]} and{z n [t]} produced by our algorithm satisfies, 1 T T−1 X t=0 N X n=1 E(y n [t])≤f ∗ +O(ε), 1 T T−1 X t=0 N X n=1 E(z n l [t])≤d l +O(ε),l∈{1, 2,··· ,L}, where f ∗ denotes the optimal objective value of (5.1)-(1.17). Simulation experiments on 13 the aforementioned multi-server energy-aware scheduling problem also demonstrate the effectiveness of the proposed algorithm. • Chapter 3 Data center server provision: (published in [WN17]) We consider a cost minimization problem for data centers with N servers and randomly arriving service re- quests. A central router decides which server to use for each new request. We formulate this problem as an asynchronous renewal optimization, and develop a distributed control algorithm so that each server makes its own decisions, the request queues are bounded and the overall time average cost is near optimal with probability 1. The algorithm does not need probability information for the arrival rate or job sizes. Next, an improved algorithm that uses a single queue is developed via a “virtualization” technique which is shown to provide the same (near optimal) costs. Simulation experiments on a real data center traffic trace demonstrate the efficiency of our algorithm compared to other existing algorithms. • Chapter 4 Multi-user file downloading: (published in [WN15]) We treat power-aware throughput maximization in a multi-user file downloading system. Each user can receive a new file only after its previous file is finished. The file state processes for each user act as coupled Markov chains that form a generalized restless bandit system. First, an optimal algorithm is derived for the case of one user. The algorithm maximizes throughput subject to an average power constraint. Next, the one-user algorithm is extended to a low complexity heuristic for the multi-user problem. The heuristic uses a simple online index policy. In a special case with no power-constraint, the multi-user heuristic is shown to be throughput optimal. Simulations are used to demonstrate effectiveness of the heuristic in the general case. For simple cases where the optimal solution can be computed offline, the heuristic is shown to be near-optimal for a wide range of parameters. • Chapter 5 Online Learning in Weakly Coupled Markov Decision Processes: (published in [WYN18]) In this chapter, we consider a special case of the multiple parallel renewal systems, namely, the parallel Markov decision processes coupled by global con- straints, where the time varying objective and constraint functions can only be observed after the decision is made. Special attention is given to how well the decision maker can perform in T slots, starting from any state, compared to the best feasible randomized sta- tionary policy in hindsight. We develop a new distributed online algorithm where each 14 MDP makes its own decision each slot after observing a multiplier computed from past information. While the scenario is significantly more challenging than the classical on- line learning context, the algorithm is shown to have a tight O( √ T ) regret and constraint violations simultaneously. To obtain such a bound, we combine several new ingredients in- cluding ergodicity and mixing time bound in weakly coupled MDPs, a new regret analysis for online constrained optimization, a drift analysis for queue processes, and a perturbation analysis based on Farkas’ Lemma. 15 Chapter 2 Asynchronous Optimization over Weakly Coupled Renewal Systems In this chapter, we present our asynchronous algorithm along with the new analysis. Along the way, we try to provide some intuitions and high level ideas of the analysis. Consider N renewal systems that operate over a slotted timeline (t∈{0, 1, 2,...}). The timeline for each system n∈{1,...,N} is segmented into back-to-back intervals, which are renewal frames. The duration of each renewal frame is a random positive integer with distribution that depends on a control action chosen by the system at the start of the frame. The decision at each renewal frame also determines the penalty and a vector of performance metrics during this frame. The systems are coupled by time average constraints placed on these metrics over all systems. The goal is to design a decision strategy for each system so that overall time average penalty is minimized subject to time average constraints. Recall that we usek = 0, 1, 2,··· to index the renewals. Lett n k be the time slot corresponding to the k-th renewal of the n-th system with the convention that t n 0 = 0. LetT n k be the set of all slots from t n k to t n k+1 − 1. At time t n k , the n-th system chooses a possibly random decision α n k in a setA n . This action determines the distributions of the following random variables: • The duration of the k-th renewal frame T n k :=t n k+1 −t n k , which is a positive integer. • A vector of performance metrics at each slot of that frame z n [t] := (z n 1 [t], z n 2 [t], ··· , z n L [t]), t∈T n k . • A penalty incurred at each slot of the frame y n [t], t∈T n k . We assume each system has the renewal property as defined in Definition 1.1.1 that given 16 α n k = α n ∈A n , the random variables T n k , z n [t] and y n [t], t∈T n k are independent of the infor- mation of all systems from the slots before t n k with the following known conditional expectations E(T n k |α n k =α n ),E P t∈T n k y n [t] α n k =α n andE P t∈T n k z n [t] α n k =α n . 2.1 Technical preliminaries Throughout the chapter, we make the following basic assumptions. Assumption 2.1.1. The problem (5.1)-(1.17) is feasible, i.e. there are action sequences{α n k } ∞ k=0 for all n∈{1, 2,··· ,N} so that the corresponding process{z n [t]} ∞ t=0 satisfies the constraints (1.17). Following this assumption, we define f ∗ as the infimum objective value for (5.1)-(1.17) over all decision sequences that satisfy the constraints. Assumption 2.1.2 (Boundedness). For any k ∈ N and any n ∈ {1, 2,··· ,N}, there exist absolute constants y max , z max and d max such that |y n [t]|≤y max , |z n l [t]|≤z max , |d l [t]|≤d max , ∀t∈T n k , ∀l∈{1, 2,··· ,L}. Furthermore, there exists an absolute constant B≥ 1 such that for every fixed α n ∈A n and every s∈N for which Pr(T n k ≥s|α n k =α n )> 0, E (T n k −s) 2 α n k =α n ,T n k ≥s ≤B. (2.1) Remark 2.1.1. The quantityT n k −s is usually referred to as the residual lifetime. In the special case where s = 0, (2.1) gives the uniform second moment bound of the renewal frames as E (T n k ) 2 α n k =α n ≤B. Note that (2.1) is satisfied for a large class of problems. In particular, it can be shown to hold in the following three cases: 1. If the inter-renewal T n k is deterministically bounded. 2. If the inter-renewal T n k is geometrically distributed. 17 3. If each system is a finite state ergodic MDP with a finite action set. Definition 2.1.1. For any α n ∈A n , let b y n (α n ) :=E X t∈T n k y n [t] α n k =α n , b z n l (α n ) :=E X t∈T n k z n l [t] α n k =α n , and b T n (α n ) :=E(T n k |α n k =α n ). Define b f n (α n ) :=b y n (α n )/ b T n (α n ), b g n l (α n ) :=b z n l (α n )/ b T n (α n ), ∀l∈{1, 2,··· ,L}, and let b f n (α n ), b g n (α n ) be a performance vector under the action α n . Note that by Assumption 5.2.1,b y n (α n ) andb z n (α n ) in Definition 2.1.1 are both bounded, and T n k ≥ 1, ∀k∈N, thus, the set n b f n (α n ), b g n (α n ) , α n ∈A n o is also bounded. The following mild assumption states that this set is also closed. Assumption 2.1.3. The set n b f n (α n ), b g n (α n ) , α n ∈A n o is compact. The motivation of this assumption is to guarantee that there always exists at least one solution to each subproblem in our algorithm. Finally, we define the performance region of each individual system as follows. Definition 2.1.2. LetS n be the convex hull of n b y n (α n ), b z n (α n ), b T n (α n ) : α n ∈A n o ⊆ R L+2 . Define P n :={(y/T, z/T ) : (y, z,T )∈S n }⊆R L+1 as the performance region of system n. 2.2 Algorithm 2.2.1 Proposed algorithm In this section, we propose an algorithm where each system can make its own decision after observing a global vector of multipliers which is updated using the global information from all 18 systems. We start by defining a vector of virtual queues Q[t] := (Q 1 [t], Q 2 [t], ··· , Q L [t]), which are 0 at t = 0 and updated as follows, Q l [t + 1] = max ( Q l [t] + N X n=1 z n l [t]−d l [t], 0 ) , l∈{1, 2,··· ,L}. (2.2) These virtual queues will serve as global multipliers to control the growth of corresponding resource consumptions. Then, the proposed algorithm is presented in Algorithm 2. Algorithm 2. Fix a trade-off parameter V > 0: • At the beginning of k-th frame of system n, the system observes the vector of virtual queues Q[t n k ] and makes a decision α n k ∈A n so as to solve the following subproblem: D n k := min α n ∈A n E P t∈T n k (Vy n [t] +hQ[t n k ], z n [t]i) α n k =α n , Q[t n k ] E(T n k |α n k =α n , Q[t n k ]) . (2.3) • Update the virtual queue after each slot: Q l [t + 1] = max ( Q l [t] + N X n=1 z n l [t]−d l [t], 0 ) , l∈{1, 2,··· ,L}. (2.4) Note that using the notation specified in Definition 2.1.1, we can rewrite (2.3) in a more concise way as follows: min α n ∈A n n V b f n (α n ) +hQ[t n k ],b g n (α n )i o , (2.5) which is a deterministic optimization problem. Then, by the compactness assumption (Assump- tion 2.1.3), there always exists a solution to this subproblem. Remark 2.2.1. We would like to compare this algorithm to the DPP ratio algorithm (Algorithm 1). For each renewal system, both algorithms update the decision variable frame-wise based on the virtual queue value at the beginning of each frame. The major difference is that the proposed algorithm updates virtual queue slot-wise while Algorithm 1 updates virtual queues per frame. Such a seemingly small change, somewhat surprisingly, requires significant generalizations of the analysis on Algorithm 1. This algorithm requires knowledge of the conditional expectations associated with the per- formance vectors b f n (α n ), b g n (α n ) , α n ∈A n , but only requires individual systems n to know 19 their own b f n (α n ), b g n (α n ) , α n ∈A n , and therefore decouples these systems. Furthermore, the virtual queue update uses observed d l [t] and does not require knowledge of distribution or mean of d l [t]. In addition, we introduce Q[t] as “virtual queues” for the following two reasons: First, it can be mapped to real queues in applications (such as the server scheduling problem mentioned in Section 1.3.1), where d[t] stands for the arrival process and z[t] is the service process. Second, stabilizing these virtual queues implies the constraints (1.17) are satisfied, as is illustrated in the following lemma. Lemma 2.2.1. IfQ l [0] = 0 and lim T→∞ 1 T E(Q l [T ]) = 0, then, lim sup T→∞ 1 T P T−1 t=0 P N n=1 E(z n l [t])≤ d l . Proof of Lemma 2.2.1. Fix l∈{1, 2,··· ,L}. For any fixed T , Q l [T ] = P T−1 t=0 (Q l [t + 1]−Q l [t]). For each summand, by queue updating rule (5.5), Q l [t + 1]−Q l [t] = max ( Q l [t] + N X n=1 z n l [t]−d l [t], 0 ) −Q l [t] ≥Q l [t] + N X n=1 z n l [t]−d l [t]−Q l [t] = N X n=1 z n l [t]−d l [t]. Thus, by the assumption Q l [0] = 0, Q l [T ]≥ T−1 X t=0 N X n=1 z n l [t]−d l [t] ! . Taking expectations of both sides withE(d l [t]) =d l , ∀l, gives E(Q l [T ])≥ T−1 X t=0 N X n=1 E(z n l [t])−d l ! . Dividing both sides by T and passing to the limit gives lim sup T→∞ 1 T T−1 X t=0 N X n=1 E(z n l [t])−d l ! ≤ lim T→∞ 1 T E(Q l [T ]) = 0, finishing the proof. 20 2.2.2 Computing subproblems Since a key step in the algorithm is to solve the optimization problem (2.5), we make several comments on the computation of the ratio minimization (2.5). In general, one can solve the ratio optimization problem (2.3) (therefore (2.5)) via a bisection search algorithm. For more details, see section 7 of [Nee10b]. However, more often than not, bisection search is not the most efficient one. We will discuss two special cases arising from applications where we can find a simpler way of solving the subproblem. First of all, when there are only a finite number of actions in the setA n , one can solve (2.5) simply via enumerating. This is a typical scenario in energy-aware scheduling where a finite action set consists of different processing modes that can be chosen by servers. Second, when the set n b y n (α n ), b z n (α n ), b T n (α n ) : α n ∈A n o specified in Definition 2.1.2 is itself a convex hull of a finite sequence{(y j , z j ,T j )} m j=1 , then, (2.5) can be rewritten as a simple enumeration: min i∈{1,2,···,m} V y i T i + Q[t n k ], z i T i . To see this, note that by definition of convex hull, for anyα n ∈A n , b y n (α n ), b z n (α n ), b T n (α n ) = P m j=1 p j · (y j ,z j ,T j ) for some{p j } m j=1 , p j ≥ 0 and P m j=1 p j = 1. Thus, V b f n (α n ) +hQ[t n k ],b g n (α n )i =V P m j=1 p j y j P m j=1 p j T j + * Q[t n k ], P m j=1 p j z j P m j=1 p j T j + = m X i=1 p i T i P m j=1 p j T j V y i T i + Q[t n k ], z i T i =: m X i=1 q i V y i T i + Q[t n k ], z i T i , where we let q i = piTi P m j=1 pjTj . Note that q i ≥ 0 and P m i=1 q i = 1 because T i ≥ 1. Hence, solving (2.5) is equivalent to choosing{q i } m i=1 to minimize the above expression, which boils down to choosing a single (y i , z i ,T i ) among{(y j , z j ,T j )} m j=1 which achieves the minimum. Note that such a convex hull case stands out not only because it yields a simple solution, but also because of the fact that ergodic coupled MDPs discussed in Section 1.3.2 have the region n b y n (α n ), b z n (α n ), b T n (α n ) : α n ∈A n o being the convex hull of a finite sequence of points {(y j , z j ,T j )} m j=1 , where each point (y j , z j ,T j ) results from a pure stationary policy ([Alt99a]). 21 1 Thus, solving (2.5) for the ergodic coupled MDPs reduces to choosing a pure policy among a finite number of pure policies. 2.3 Limiting Performance In this section, we provide the performance analysis of Algorithm 2. Let f ∗ be the optimal objective value for problem (5.1)-(1.17). The goal is to show the following bound similar to that of Algorithm 1: 1 T T−1 X t=0 N X n=1 E(y n [t])≤f ∗ + C V , E(kQ[T ]k)≤C 0 √ VT, for some constant C,C 0 > 0. Then, by Lemma 2.2.1, one readily obtains the constraint satisfac- tion result. For the rest of the chapter, the underlying probability space is denoted as the tuple (Ω,F, P ). LetF[t] be the system history up until time slot t. Formally, {F[t]} ∞ t=0 is a filtration with F[0] ={∅, Ω} and eachF[t], t≥ 1 is the σ-algebra generated by all random variables from slot 0 to t− 1. For the rest of the chapter, we always assume Assumptions 2.1.1-2.1.3 hold without explicitly mentioning them. 2.3.1 Convexity, performance region and other properties In this section, we present several lemmas on the fundamental properties of the optimization problem (5.1)-(1.17). The following lemma demonstrates the convexity ofP n in Definition 2.1.2. Lemma 2.3.1. The performance regionP n specified in Definition 2.1.2 is convex for any n∈ {1, 2,··· ,N}. Furthermore, it is the convex hull of the set n b f n (α n ), b g n (α n ) :α n ∈A n o and thus compact, where b f n (α n ), b g n (α n ) is specified Definition 2.1.1. 1 A pure stationary policy is an algorithm where the decision to be taken at any time t is a deterministic function of the state at time t, and independent of all other past information. 22 First of all, we have the following fundamental performance lemma which states that the optimality of (5.1)-(1.17) is achievable withinP n specified in Definition 2.1.2. Lemma 2.3.2. For each n∈{1, 2,··· ,N}, there exists a pair f n ∗ , g n ∗ ∈P n such that the following hold: N X n=1 f n ∗ =f ∗ N X n=1 g n l,∗ ≤d l , l∈{1, 2,··· ,L}, where f ∗ is the optimal objective value for problem (5.1)-(1.17), i.e. the optimality is achievable within⊗ N n=1 P n , the Cartesian product ofP n . Furthermore, for any f n , g n ∈P n , n ∈{1, 2,··· ,N}, satisfying P N n=1 g n l ≤ d l , l ∈ {1, 2,··· ,L}, we have P N n=1 f n ≥ f ∗ , i.e. one cannot achieve better performance than (5.1)- (1.17) in⊗ N n=1 P n . The proof of this Lemma is delayed to Section 2.6. In particular, the proof uses the following lemma, which also plays an important role in several lemmas later. Lemma 2.3.3. Suppose{y n [t]} ∞ t=0 ,{z n [t]} ∞ t=0 and{T n k } ∞ k=0 are processes resulting from any algorithm, 2 then,∀T∈N, 1 T T−1 X t=0 E(f n [t]−y n [t])≤ B 1 T , (2.6) 1 T T−1 X t=0 E(g n l [t]−z n l [t])≤ B 2 T , l∈{1, 2,··· ,L}, (2.7) where B 1 = 2y max √ B, B 2 = 2z max √ B and f n [t], g n [t] are constant over each renewal frame for system n defined by f n [t] = b f n (α n ), if t∈T n k ,α n k =α n g n [t] =b g n (α n ), if t∈T n k ,α n k =α n , and b f n (α n ),b g n (α n ) are defined in Definition 2.1.1. 2 Note that this algorithm might make decisions using the past information. 23 The proof of this lemma is delayed to Section 2.6. Remark 2.3.1. Note that directly computing f n ∗ and g n l,∗ indicated by Lemma 2.3.2 would be difficult because of the fractional nature ofP n , the coupling between different systems through time average constraints and the fact that d l = E(d l [t]) might be unknown. However, Lemma 2.3.2 can be used to prove important performance theorems regarding our proposed algorithm as is indicated by the following lemma. 2.3.2 Main result and near optimality analysis The following theorem gives the performance bound of our proposed algorithm. Theorem 2.3.1. The sequences{y n [t]} ∞ t=0 and{z n [t]} ∞ t=0 produced by the proposed algorithm satisfy all the constraints in (1.17) and achievesO(1/V ) near optimality, i.e. lim sup T→∞ 1 T T−1 X t=0 N X n=1 E(y n [t])≤f ∗ + NC 1 +C 3 V , where f ∗ is the optimal objective of (5.1)-(1.17), C 1 = 6Lz max (Nz max +d max )B i and C 3 := (Nz max +d max ) 2 L/2. Proof of Theorem 2.3.1. Define the drift-plus-penalty (DPP) expression at time slot t as P [t] :=E N X n=1 Vy n [t] + 1 2 kQ[t + 1]k 2 −kQ[t]k 2 ! . (2.8) By the queue updating rule (5.5), we have P [t]≤E N X n=1 Vy n [t] + 1 2 L X l=1 N X n=1 z n l [t]−d l [t] ! 2 + L X l=1 Q l [t] N X n=1 z n l [t]−d l [t] ! ≤ 1 2 (Nz max +d max ) 2 L +E N X n=1 Vy n [t] + L X l=1 Q l [t] N X n=1 z n l [t]−d l [t] !! = 1 2 (Nz max +d max ) 2 L +E N X n=1 Vy n [t] + L X l=1 Q l [t] N X n=1 z n l [t]−d l !! where the second inequality follows from the boundedness assumption (Assumption 5.2.1) that P L l=1 P N n=1 z n l [t]−d l [t] 2 ≤ (Nz max +d max ) 2 L, and the equality follows from the fact thatd l [t] 24 is i.i.d. and independent of Q l [t], thus, E(Q l [t]d l [t]) =E(Q l [t]·E(d l [t]|Q l [t])) =E(Q l [t]d l ). For simplicity, define C 3 = 1 2 (Nz max +d max ) 2 L. Now, by the achievability of optimality in ⊗ N n=1 P n (Lemma 2.3.2), we have P N n=1 g n l,∗ ≤ d l , thus, substituting this inequality into the above bound for P [t] gives P [t]≤C 3 +E N X n=1 Vy n [t] + N X n=1 L X l=1 Q l [t] z n l [t]−g n l,∗ ! =C 3 + N X n=1 E(Vy n [t] +hQ[t], z n [t]− g n ∗ i) =C 3 + N X n=1 E(X n [t]) +V N X n=1 f n ∗ =C 3 + N X n=1 E(X n [t]) +Vf ∗ , where we use the definition ofX n [t] in (2.15) by substituting (f n , g n ) with (f n ∗ , g n ∗ ), i.e. X n [t] = V (y n [t]−f n ∗ )+hQ[t], z n [t]− g n ∗ i, in the second from last equality and use the optimality condition (Lemma 2.3.2) in the final equality. Thus, it follows 1 T T−1 X t=0 P [t]≤C 3 +Vf ∗ + N X n=1 1 T T−1 X t=0 E(X n [t]). By the virtual queue updating rule (2.4) and the trivial bound Q l [t]≤O(t), we readily get T−1 X t=0 E(X n [t]) = T−1 X t=0 E V (y n [t]−f n ∗ ) + L X l=1 Q l [t](z n l [t]−g n ∗ ) ! ≤C(T 2 +VT ), for some constant C > 0. However, this bound is too weak to allow us proving the convergence result. The key to this proof is to improve such a bound so that T−1 X t=0 E(X n [t])≤C 1 T +C 2 V. whereC 1 andC 2 are two constants independent ofV orT . This is Lemma 2.3.8. As a consequence 25 for any T∈N, 1 T T−1 X t=0 P [t]≤ (NC 1 +C 3 ) + NC 2 V T . (2.9) On the other hand, by the definition ofP [t] in (2.8) and then telescoping sums with Q[0] = 0, we have 1 T T−1 X t=0 P [t] = 1 T T−1 X t=0 E N X n=1 Vy n [t] + 1 2 kQ[t + 1]k 2 −kQ[t]k 2 ! = 1 T T−1 X t=0 N X n=1 VE(y n [t]) + 1 2T E kQ[T ]k 2 . Combining this with inequality (2.9) gives 1 T T−1 X t=0 N X n=1 VE(y n [t]) + 1 2T E kQ[T ]k 2 ≤NC 1 +C 3 +Vf ∗ + NC 2 V T . (2.10) Since 1 2T E kQ[T ]k 2 ≥ 0, we can throw away the term and the inequality still holds, i.e. 1 T T−1 X t=0 N X n=1 E(y n [t])≤f ∗ + NC 1 +C 3 V + NC 2 T . (2.11) Taking lim sup T→∞ from both sides gives the near optimality in the theorem. To get the constraint violation bound, we use Assumption 5.2.1 that|y n [t]|≤y max , then, by (2.10) again, we have 1 T E kQ[T ]k 2 ≤ 2(NC 1 +C 3 ) + 4Vy max + 2NC 2 V T . By Jensen’s inequalityE kQ[T ]k 2 ≥E(kQ[T ]k) 2 . This implies that E(kQ[T ]k)≤ p (2(NC 1 +C 3 ) + 4Vy max )T + 2NC 2 V, which implies 1 T E(kQ[T ]k)≤ r 2(NC 1 +C 3 ) + 4Vy max T + 2NC 2 V T 2 . (2.12) Sending T→∞ gives lim T→∞ 1 T E(Q l [T ]) = 0, ∀l∈{1, 2,··· ,L}. 26 Finally, by Lemma 2.2.1, all constraints are satisfied. Note that the above proof implies a more refined result that illustrates the convergence time. Fix an ε> 0, let V = 1/ε, then, for all T≥ 1/ε, (2.11) implies that 1 T T−1 X t=0 N X n=1 E(y n [t])≤f ∗ +O(ε). However, (2.12) suggests a larger convergence time is required for constraint satisfaction! For V = 1/ε, it can be shown that (2.12) implies that 1 T T−1 X t=0 N X n=1 E(z n l [t])≤d l +O(ε), wheneverT≥ 1/ε 3 . The next section shows a tighter 1/ε 2 convergence time with a mild Lagrange multiplier assumption. The rest of this section is devoted to proving Lemma 2.3.8. 2.3.3 Key-feature inequality and supermartingale construction In this section and the next section, our goal is to show that the term T−1 X t=0 E V (y n [t]−f n ∗ ) + L X l=1 Q l [t](z n l [t]−g n ∗ ) ! ≤C 0 (V +T ). (2.13) Learning from the single renewal analysis (equation (1.11)), we have the following key-feature inequality connecting our proposed algorithm with the performance vectors insideP n . Lemma 2.3.4. Consider the stochastic processes{y n [t]} ∞ t=0 ,{z n [t]} ∞ t=0 , and{T n k } ∞ k=0 resulting from the proposed algorithm. For any system n, the following holds for any k ∈ N and any (f n , g n )∈P n , E P t∈T n k (Vy n [t] +hQ[t n k ], z n [t]i) Q[t n k ] E(T n k |Q[t n k ]) ≤Vf n +hQ[t n k ], g n i, (2.14) Proof of Lemma 5.5.4. First of all, since the proposed algorithm solves (2.3) over all possible decisions inA n , it must achieve value less than or equal to that of any action α n ∈A n at the 27 same frame. This gives, D n k ≤ E P t∈T n k (Vy n [t] +hQ[t n k ], z n [t]i) Q[t n k ],α n k =α n E(T n k | Q[t n k ],α n k =α n ) = Vb y n (α n ) +hQ[t n k ],b z n (α n )i b T n (α n ) , where D n k is defined in (2.3) and the equality follows from the renewal property of the system that T n k , P t∈T n k y n [t] and P t∈T n k z n [t] are conditionally independent of Q[t n k ] given α n k =α n . Since T n k ≥ 1, this implies b T n (α n )·D n k ≤Vb y n (α n ) +hQ[t n k ],b z n (α n )i, thus, for any α n ∈A n , Vb y n (α n ) +hQ[t n k ],b z n (α n )i−D n k · b T n (α n )≥ 0. SinceS n specified in Definition 2.1.2 is the convex hull of n (b y n (α n ), b z n (α n ), b T n (α n )), α n ∈A n o , it follows for any vector (y, z,T )∈S n , we have Vy +hQ[t n k ], zi−D n k ·T≥ 0. Dividing both sides by T and using the definition ofP n in Definition 2.1.2 give D n k ≤Vf n +hQ[t n k ], g n i, ∀(f n , g n )∈P n . Finally, since{y n [t]} ∞ t=0 ,{z n [t]} ∞ t=0 , and{T n k } ∞ k=0 result from the proposed algorithm and the action chosen is determined by Q[t n k ] as in (2.3), D n k = E P t∈T n k (Vy n [t] +hQ[t n k ], z n [t]i) Q[t n k ] E(T n k |Q[t n k ]) . This finishes the proof. Our next step is to give a frame-based analysis for each system by constructing a supermartin- gale on the per-frame timescale. We start with a definition of supermartingale: Definition 2.3.1 (Supermartingale). Consider a probability space (Ω,F,P) and a filtration 28 {F i } ∞ i=0 on this space withF 0 ={∅, Ω},F i ⊆F i+1 , ∀i andF i ⊆F, ∀i. Consider a process {X i } ∞ i=0 ⊆ R adapted to this filtration, i.e. X i ∈F i+1 , ∀i. Then, we have{X i } ∞ i=0 is a su- permartigale if E(|X i |)<∞ and E(X i+1 |F i+1 )≤X i . Furthermore,{X i+1 −X i } ∞ i=0 is called a supermartingale difference sequence. Note that by definition of supermartigale, we always haveE(X i+1 −X i |F i+1 )≤ 0. Along the way, we also have a standard definition of stopping time which will be used later: Definition 2.3.2 (Stopping time). Given a probability space (Ω,F,P ) and a filtration{?, Ω} = F 0 ⊆F 1 ⊆F 2 ··· inF. A stopping time τ with respect to the filtration{F i } ∞ i=0 is a random variable such that for any i∈N, {τ =i}∈F i , i.e. the stopping time occurring at timei is contained in the information during slots 0, 1, 2, ··· , i− 1. Recall that{F[t]} ∞ t=0 is a filtration (withF[t] representing system history during slots{0,··· ,t− 1}). Fix a system n and recall that t n k is the time slot where the k-th renewal occurs for system n. We would like to define a filtration corresponding to the random times t n k . To this end, define the collection of sets{F n k } ∞ k=0 such that for each k, F n k :={A∈F :A∩{t n k ≤t}∈F[t],∀t∈{0, 1, 2,···}} For example, the following set A is an element ofF n 3 : A ={t n 3 = 5}∩{y[0] =y 0 ,y[1] =y 1 ,y[2] =y 2 ,y[3] =y 3 ,y[4] =y 4 } wherey 0 ,··· ,y 4 are specific values. ThenA∈F n 3 because fori∈{0, 1, 2, 3, 4} we haveA∩{t n 3 ≤ i} =∅∈F[i], and for i∈{5, 6, 7,···} we have A∩{t≤i} =A∈F[i]. The following technical lemma is proved in Section 2.6. Lemma 2.3.5. The sequence{F n k } ∞ k=0 is a valid filtration, i.e. F n k ⊆F n k+1 , ∀k≥ 0. Further- 29 more, for any real-valued adapted process{Z n [t− 1]} ∞ t=1 with respect to{F[t]} ∞ t=1 , 3 n G t n k (Z n [0], Z n [1], ··· ,Z n [t n k − 1]) o ∞ k=1 is also adapted to{F n k } ∞ k=1 , where for anyt∈N,G t (·) is a fixed real-valued measurable mappings. That is, for any k, it holds that the value of any measurable function of (Z n [0],··· ,Z[t n k − 1]) is determined by events inF n k . With Lemma 5.5.4 and Lemma 2.3.5, we can construct a supermartingale as follows, Lemma 2.3.6. Consider the stochastic processes{y n [t]} ∞ t=0 ,{z n [t]} ∞ t=0 , and{T n k } ∞ k=0 resulting from the proposed algorithm. For any (f n , g n )∈P n , let X n [t] :=V y n [t]−f n +hQ[t], z n [t]− g n i, (2.15) then, E X t∈T n k X n [t] F n k ≤Lz max (Nz max +d max )B :=C 0 , where B, z max and d max are as defined in Assumption 5.2.1. Furthermore, define a real-valued process{Y n K } ∞ K=0 on the frame such that Y n 0 = 0 and Y n K = K−1 X k=0 X t∈T n k X n [t]−C 0 , K≥ 1. Then,{Y n K } ∞ K=0 is a supermartingale adapted to the aforementioned filtration{F n k } ∞ K=0 . Remark 2.3.2. Note that in the above lemma the quantity X n [t] is the term we aim to bound in (2.13). Having{Y n K } ∞ K=0 being a supermartingale implies E(Y n K )≤ 0, ∀K. This implies E t n K −1 X τ=0 X n [τ] ≤C 0 K≤C 0 t n K . Thus, this lemma proves (2.13) is true when T is taken to be the end of any renewal frame of system n. Our goal in the next section is to get rid of this restriction and finish the proof via a stopping time argument. 3 Meaning that for each t in{1, 2, 3,···}, the random variable Z n [t− 1] is determined by events inF[t]. 30 Proof of Lemma 2.3.6. Consider any t∈T n k , then, we can decompose X n [t] as follows X n [t] =V (y n [t]−f n ) +hQ[t n k ], z n [t]− g n i +hQ[t]− Q[t n k ], z n [t]− g n i. (2.16) By the queue updating rule (5.5), we have for any l∈{1, 2,··· ,L} and any t>t n k , |Q l [t]−Q l [t n k ]|≤ t−1 X s=t n k N X m=1 z m l [s]−d l [t] ≤ (t−t n k )(Nz max +d max ) (2.17) Thus, for the last term in (2.16), by H¨ older’s inequality, hQ[t]− Q[t n k ], z n [t]− g n i≤kQ[t]− Q[t n k ]k 1 ·kz n [t]− g n k ∞ ≤ t−1 X s=t n k N X m=1 z n [s]− d[t] 1 ·kz n [t]− g n k ∞ ≤(t−t n k )L(Nz max +d max )· 2z max , where the second inequality follows from (2.17) and the last inequality follows from the bounded- ness assumption (Assumption 5.2.1) of corresponding quantities. Substituting the above bound into (2.16) gives a bound onE P t∈T n k X n [t] F n k as E X t∈T n k X n [t] F n k ≤E X t∈T n k V y n [t]−f n +hQ[t n k ], z n [t]− g n i F n k +E X t∈T n k (t−t n k ) F n k · 2L(Nz max +d max )z max ≤E X t∈T n k V y n [t]−f n +hQ[t n k ], z n [t]− g n i F n k +E (T n k ) 2 F n k ·L(Nz max +d max )z max , (2.18) where we use the fact that 0 + 1 +··· +T n k − 1 = (T n k − 1)T n k /2≤ (T n k ) 2 in the last inequality. Next, by the queue updating rule (5.5), Q l [t n k ] is determined by z n l [0],··· ,z n l [t n k − 1] (n = 1, 2,··· ,N) and d l [0],··· ,d l [t n k − 1] for any l∈{1, 2,··· ,L}. Thus, by Lemma 2.3.5, Q[t n k ] is determined byF n k . For the proposed algorithm, each system makes decisions purely based on the virtual queue state Q[t n k ], and by the renewal property of each system, given the decision 31 at the k-th renewal, the random quantities T n k , z n [t] and y n [t], t∈T n k are independent of the outcomes from the slots before t n k . This implies the following display, E X t∈T n k V y n [t]−f n +hQ[t n k ], z n [t]− g n i F n k =E X t∈T n k V y n [t]−f n F n k + * Q[t n k ],E X t∈T n k (z n [t]− g n ) F n k + =E X t∈T n k V y n [t]−f n Q[t n k ] + * Q[t n k ],E X t∈T n k (z n [t]− g n ) Q[t n k ] + =E X t∈T n k V y n [t]−f n +hQ[t n k ], z n [t]− g n i Q[t n k ] , (2.19) By Lemma 5.5.4, we have the following: E X t∈T n k (Vy n [t] +hQ[t n k ], z n [t]i) Q[t n k ] ≤ Vf n +hQ[t n k ], g n i ·E(T n k |Q[t n k ]). Thus, rearranging terms in above inequality gives the expectation on the right hand side of (2.19) is no greater than 0 and hence the first expectation on the right hand side of (2.18) is also no greater than 0. For the second expectation in (2.18), using (2.1) in Assumption 5.2.1 gives E (T n k ) 2 F n k ≤B and the first part of the lemma is proved. For the second part of the lemma, by Lemma 2.3.5 and the definition of Y n K , the process {Y n K } ∞ K=0 is adapted to{F n k } ∞ K=0 . Moreover, by Assumption 5.2.1, E X t∈T n k X n [t] ≤E X t∈T n k |X n [t]| <∞, ∀k. Thus, E(|Y n K |) <∞, ∀K∈N, i.e. it is absolutely integrable. Furthermore, by the first part of the lemma, E Y n K+1 |F n k =Y n K +E X t∈T n K X n [t]−C 0 F n k ≤Y n K , finishing the proof. 32 2.3.4 Synchronization lemma So far, we have analyzed the processes related to each individual system over its renewal frames. However, due the asynchronous behavior of different systems, the supermartingales of each system cannot be immediately summed. In order to prove the result (2.13) and get a global performance bound, we have to get rid of any index related to individual renewal frames only. In other words, we need to look at the system property at any time slot T as opposed to any renewal t n k . For any fixed slotT > 0, letS n [T ] be the number of renewals up to (and including) time slot T , with the convention that the first renewal occurs at time t = 0, so t n 0 = 0 and S n [0] = 1, i.e. t n 0 = 0. The next lemma shows S n [T ] is a valid stopping time, whose proof is in the appendix. Lemma 2.3.7. For each n∈{1, 2,··· ,N}, the random variable S n [T ] is a stopping time with respect to the filtration{F n k } ∞ k=0 , i.e.{S n [T ] =k}∈F n k , ∀k∈N. The following theorem tells us a stopping-time truncated supermartingale is still a super- martingale. Theorem 2.3.2 (Theorem 5.2.6 in [Dur13]). Ifτ is a stopping time andZ[i] is a supermartingale with respect to{F i } ∞ i=0 , then Z[i∧τ] is also a supermartingale, where a∧b, min{a,b}. With this theorem and the above stopping time construction, we have the following lemma which finishes the argument proving (2.13): Lemma 2.3.8. For each n∈{1, 2,··· ,N} and any fixed T∈N, we have 1 T T−1 X t=0 E(X n [t])≤C 1 + C 2 V T , where X n [t] is defined in (2.16) and C 1 := 6Lz max (Nz max +d max )B, C 2 := 2y max √ B. Proof. First, note that the renewal index k starts from 0. Thus, for any fixed T∈N, t n S n [T]−1 ≤ 33 T <t n S n [T] , and E T−1 X t=0 X n [t] ! =E t n S n [T] −1 X t=0 X n [t]− t n S n [T] −1 X t=T X n [t] =E t n S n [T] −1 X t=1 X n [t] −E t n S n [T] −1 X t=T X n [t] =E Y n S n [T] +C 0 E(S n [T ])−E t n S n [T] −1 X t=T X n [t] ≤E Y n S n [T] +C 0 (T + 1)−E t n S n [T] −1 X t=T X n [t] , (2.20) where the third equality follows from the definition ofY n K in Lemma 2.3.6 and the last inequality follows from the fact that the number of renewals up to time slot T is no more than the total number of slots, i.e. S n [T ]≤ T + 1. For the term E Y n S n [T] , we apply Theorem 2.3.2 with τ =S n [T ] and index K to obtain{Y n K∧S n [T] } ∞ K=0 is a supermartingale. This implies E Y n K∧S n [T] ≤E Y n 0∧S n [T] =E(Y n 0 ) = 0, ∀K∈N. Since S n [T ]≤T + 1, it follows by substituting K =T + 1, E Y n S n [T] =E Y n (T+1)∧S n [T] ≤ 0. For the last term in (2.20), by queue updating rule (5.5), for any l∈{1, 2,··· ,L}, |Q l [t]|≤ t−1 X s=0 N X m=1 z m l [s]−d l [t] ≤t(Nz max +d max ), 34 it then follows from H¨ older’s inequality again that E t n S n [T] −1 X t=T X n [t] =E t n S n [T] −1 X t=T V (y n [t]−f n ) +hQ[t], z n [t]− g n i ≤E t n S n [T] −1 X t=T V y n [t]−f n +kQ[t]k 1 ·kz n [t]− g n k ∞ ≤E t n S n [T] −1 X t=T (2Vy max +L(Nz max +d max )t· 2z max ) =2Vy max ·E t n S n [T] −T +Lz max (Nz max +d max ) · (2T− 1)·E t n S n [T] −T +E t n S n [T] −T 2 ≤2Vy max √ B + 2Lz max (Nz max +d max ) √ BT +Lz max (Nz max +d max )B ≤2Vy max √ B + 2Lz max (Nz max +d max )B(T + 1), where in the second from last inequality we use (2.1) of Assumption 5.2.1 that the residual life t n S n [T] −T satisfies E (t n S n [T] −T ) 2 =E E (t n S n [T] −T ) 2 t n S n [T] −t n S n [T]−1 ≥T−t n S n [T]−1 ≤B andE t n S n [T] −T ≤ √ B, and in the last inequality we use the fact that B≥ 1, thus, √ B≤B. Substitute the above bound into (2.20) gives E T−1 X t=0 X n [t] ! ≤C 0 (T + 1) + 2Vy max B + 2Lz max (Nz max +d max )B(T + 1) =2Vy max √ B + 3Lz max (Nz max +d max )B(T + 1) ≤2Vy max √ B + 6Lz max (z max +d max )BT where we use the definition C 0 = Lz max (z max +d max )B from Lemma 2.3.6 in the equality and use T + 1≤ 2T in the final equality. Dividing both sides by T finishes the proof. 35 2.4 Convergence Time Analysis 2.4.1 Lagrange Multipliers Consider the following optimization problem: min N X n=1 f n (2.21) s.t. N X n=1 g n l ≤d l , ∀l∈{1, 2,··· ,L}, (2.22) (f n , g n )∈P n , ∀n∈{1, 2,··· ,N}. (2.23) SinceP n is convex, it followsP n is convex and⊗ N n=1 P n is also convex. Thus, (2.21)-(2.23) is a convex program. Furthermore, by Lemma 2.3.2, we have (2.21)-(2.23) is feasible if and only if (5.1)-(1.17) is feasible, and when assuming feasibility, they have the same optimality f ∗ as is specified in Lemma 2.3.2. SinceP n is convex, one can show (see Proposition 5.1.1 of [Ber09a]) that there always exists a sequence (γ 0 ,γ 1 ,··· ,γ L ) so that γ i ≥ 0, i = 0, 1,··· ,L and N X n=1 γ 0 f n + L X l=1 γ l N X n=1 g n l ≥γ 0 f ∗ + L X l=1 γ l d l , ∀(f n , g n )∈P n , i.e. there always exists a hyperplane parametrized by (γ 0 ,γ 1 ,··· ,γ L ), supported at (f ∗ ,d 1 ,··· ,d L ) and containing the set n P N n=1 f n , P N n=1 g n : (f n , g n )∈P n , ∀n∈{1, 2,··· ,N} o on one side. This hyperplane is called “separating hyperplane” . The following assumption stems from this property and simply assumes this separating hyperplane to be non-vertical (i.e. γ 0 > 0): Assumption 2.4.1. There exists non-negative finite constants γ 1 , γ 2 , ··· , γ L such that the following holds, N X n=1 f n + L X l=1 γ l N X n=1 g n l ≥f ∗ + L X l=1 γ l d l , ∀(f n , g n )∈P n , i.e. there exists a separating hyperplane parametrized by (1,γ 1 ,··· ,γ L ). Remark 2.4.1. The parametersγ 1 , ··· , γ L are called Lagrange multipliers and this assumption 36 is equivalent to the existence of Lagrange multipliers for constrained convex program (2.21)- (2.23). It is known that Lagrange multipliers exist if the Slater’s condition holds ([Ber09a]), which states that there exists a nonempty interior of the feasible region for the convex program. Slater’s condition is very common in convex optimization theory and plays an important role in convergence rate analysis, such as the analysis of the interior point algorithm ([BV04]). In the current context, this condition is satisfied, for example, in energy aware server scheduling problems, if the highest possible sum of service rates from all servers is strictly higher than the arrival rate. Lemma 2.4.1. Suppose{y n [t]} ∞ t=0 , {z n [t]} ∞ t=0 and{T n k } ∞ k=0 are processes resulting from the proposed algorithm. Under the Assumption 2.4.1, 1 T T−1 X t=0 f ∗ − N X n=1 E(y n [t]) ! ≤ 1 T T−1 X t=0 L X l=1 γ l N X n=1 E(z n l [t])−d l ! + C 4 T , where C 4 =B 1 N +B 2 N P L l=1 γ l , and B 1 , B 2 are defined in Lemma 2.3.3. Proof. First of all, from the statement of Lemma 2.3.3, for the proposed algorithm, we can define the corresponding processes (f n [t], g n [t]) for all n as f n [t] = b f n (α n ) =b y n (α n )/ b T n (α n ), if t∈T n k ,α n k =α n g n [t] =b g n (α n ) =b z n (α n )/ b T n (α n ), if t∈T n k ,α n k =α n , where the last equality follows from the definition of b f n (α n ) andb g n (α n ) in Definition 2.1.1. Since b y n (α n ), b z n (α n ), b T n (α n ) ∈S n , by definition ofP n in Definition 2.1.2, (f n [t], g n [t])∈P n ⊆ P n , ∀n, ∀t. SinceP n is a convex set by Lemma 2.3.1, it follows (E(f n [t]), E(g n [t]))∈P n , ∀t, ∀n. By Assumption 2.4.1, we have N X n=1 E(f n [t]) + L X l=1 γ l N X n=1 E(g n l [t])≥f ∗ + L X l=1 γ l d l , ∀t. 37 Rearranging terms gives f ∗ − N X n=1 E(f n [t])≤ L X l=1 γ l N X n=1 E(g n l [t])−d l ! , ∀t. Taking the time average from 0 to T− 1 gives 1 T T−1 X t=0 f ∗ − N X n=1 E(f n [t]) ! ≤ 1 T T−1 X t=0 L X l=1 γ l N X n=1 E(g n l [t])−d l ! . (2.24) For the left hand side of (2.24), we have l.h.s. = 1 T T−1 X t=0 f ∗ − N X n=1 E(y n [t]) ! + 1 T T−1 X t=0 N X n=1 E(y n [t]−f n [t]) ≥ 1 T T−1 X t=0 f ∗ − N X n=1 E(y n [t]) ! − B 1 N T . (2.25) where the inequality follows from (2.6) in Lemma 2.3.3. For the right hand side of (2.24), we have r.h.s. = 1 T T−1 X t=0 L X l=1 γ l N X n=1 E(z n l [t])−d l ! + 1 T T−1 X t=0 L X l=1 γ l N X n=1 E(g n l [t]−z n l [t]) ≤ 1 T T−1 X t=0 L X l=1 γ l N X n=1 E(z n l [t])−d l ! + B 2 N P L l=1 γ l T , (2.26) where the inequality follows from the fact thatγ l ≥ 0,∀l and (2.7) in Lemma 2.3.3. Substituting (2.25) and (2.26) into (2.24) finishes the proof. 2.4.2 Convergence time theorem Theorem 2.4.1. Fix ε∈ (0, 1) and define V = 1/ε. If the problem (5.1)-(1.17) is feasible and the Assumption 2.4.1 holds, then, for all T≥ 1/ε 2 , 1 T T−1 X t=0 N X n=1 E(y n [t])≤f ∗ +O(ε), (2.27) 1 T T−1 X t=0 N X n=1 E(z n l [t])≤d l +O(ε),l∈{1, 2,··· ,L}. (2.28) Thus, the algorithm providesO(ε) approximation with the convergence timeO(1/ε 2 ). 38 Proof. First of all, by queue updating rule (5.5), T−1 X t=0 N X n=1 E(z n l [t])−d l ! ≤E(Q l [T ]). (2.29) By Lemma 2.4.1, we have 1 T T−1 X t=0 f ∗ − N X n=1 E(y n [t]) ! ≤ 1 T T−1 X t=0 L X l=1 γ l N X n=1 E(z n l [t])−d l ! + C 4 T , ≤ L X l=1 γ l T E(Q l [T ]) + C 4 T . (2.30) Combining this with (2.10) gives 1 2T E kQ[T ]k 2 ≤NC 1 +C 3 + V T T−1 X t=0 f ∗ − N X n=1 E(y n [t]) ! + NC 2 V T ≤NC 1 +C 3 + (NC 2 +C 4 )V T +V L X l=1 γ l T E(Q l [T ]) ≤NC 1 +C 3 + (NC 2 +C 4 )V T + V T kγk·kE(Q[T ])k, (2.31) where γ := (γ 1 , ··· , γ L ), the second inequality follows from (2.30) and the final inequality follows from Cauchy-Schwarz. Then, by Jensen’s inequality, we have kE(Q[T ])k 2 ≤E kQ[T ]k 2 . Thus, it follows by (2.31) that kE(Q[T ])k 2 − 2Vkγk·kE(Q[T ])k− 2(NC 1 +C 3 )T− 2(NC 2 +C 4 )V ≤ 0. The left hand side is a quadratic form onkE(Q[T ])k, and the inequality implies thatkE(Q[T ])k is deterministically upper bounded by the largest root of the equation x 2 −bx−c = 0 with 39 b = 2Vkγk and c = 2(NC 1 +C 3 )T + 2(NC 2 +C 4 )V . Thus, kE(Q[T ])k≤ b + √ b 2 + 4c 2 =Vkγk + p V 2 kγk 2 + 2(NC 1 +C 3 )T + 2(NC 2 +C 4 )V ≤2Vkγk + p 2(NC 1 +C 3 )T + p 2(NC 2 +C 4 )V. Thus, for any l∈{1, 2,··· ,L}, 1 T E(Q l [T ])≤ 2Vkγk T + r 2(NC 1 +C 3 ) T + p 2(NC 2 +C 4 )V T . By (2.29) again, 1 T T−1 X t=0 N X n=1 E(z n l [t])≤d l + 2Vkγk T + r 2(NC 1 +C 3 ) T + p 2(NC 2 +C 4 )V T . Substituting V = 1/ε and T≥ 1/ε 2 into the above inequality gives∀l∈{1, 2,··· ,L}, 1 T T−1 X t=0 N X n=1 E(z n l [t])≤d l + 2kγk + p 2(NC 1 +C 3 ) ε + p 2(NC 2 +C4)ε 3/2 =d l +O(ε). Finally, substituting V = 1/ε and T≥ 1/ε 2 into (2.11) gives 1 T T−1 X t=0 N X n=1 E(y n [t])≤f ∗ +O(ε), finishing the proof. 2.5 Simulation Study in Energy-aware Scheduling Here, we apply the algorithm introduced in Section 2.2 to deal with the energy-aware schedul- ing problem described in Section 1.3. To be specific, we consider a scenario with 5 homogeneous servers and 3 different classes of jobs, i.e. N = 5 and L = 3. We assume that each server can only choose one class of jobs to serve during each frame. So the mode setM n contains three actions{1, 2, 3} and the action i stands for serving the i-th class of jobs and we count the num- 40 Table 2.1: Problem parameters λ i b H n (i) b μ n (i) b e n (i) b I n (i) Class 1 2 5.5 15 (Uniform [9, 21]∩N) 16 2.5 Class 2 3 4.6 21 (Uniform [15, 27]∩N) 20 4.3 Class 3 4 3.8 17 (Uniform [11, 23]∩N) 13 3.7 ber of serviced jobs at the end of each service duration. The action m n k determines the following quantities: • The uniformly distributed total number of class l jobs that can be served with expectation E P t∈T n k μ n l [t] m n k :=b μ n l (m n k ). • The geometrically distributed service duration H n k slots with expectation E(H n k | m n k ) := b H n (m n k ). • The energy consumptionb e n (m n k ) for serving all these jobs. • The geometrically distributed idle/setup time I n k slots with constant energy consumption p n per slot and zero job service. The expectation E(I n k | m n k ) := b I n (m n k ). The idle/setup cost is p n = 3 units per slot and the rest of the parameters are listed in Table 1. Following the algorithm description in Section 2.2, the proposed algorithm has the queue updating rule Q l [t + 1] = max ( Q l [t] +λ l [t]− N X n=1 μ n l [t], 0 ) , and each system minimizes (2.3) each frame, which can be written as min m n k ∈M n V b e n l (m n k ) +p n b I n (m n k ) −hQ[t n k ],b μ n (m n k )i b H n (m n k ) + b I n (m n k ) . Each plot for the proposed algorithm is the result of running 1 million slots and taking the time average as the performance of the proposed algorithm. The benchmark is the optimal stationary performance obtained by performing a change of variable and solving a linear program, knowing the arrival rates (see also [Nee12b] for details). Fig. 5.3 shows as the trade-off parameterV gets larger, the time average energy consumptions under the proposed algorithm approaches the optimal energy consumption. Fig. 5.4 shows as V gets large, the time average number of services also approaches the optimal service rate for each 41 class of jobs. In Fig. 5.5, we plot the time average queue backlog for each class of jobs verses V parameter. We see that the queue backlog for the first class is always low whereas the rest queue backlogs scale up linearly with V . This is because the service rate for the first class is always strictly larger than the arrival rate whereas for the rest classes, asV gets larger, the service rates approach the arrival rates. This plot, together with Fig. 5.3, also demonstrate that V is indeed a trade-off parameter which trades queue backlog for near optimality. Figure 2.1: Time average energy consumption verses V parameter over 1 millon slots. 2.6 Additional lemmas and proofs. 2.6.1 Proof of Lemma 2.3.1 Proof. We first prove the convexity ofP n . Consider any two points (f 1 , g 1 ), (f 2 , g 2 )∈P n . We aim to show that for any q∈ (0, 1), (qf 1 + (1−q)f 2 ,qg 1 + (1−q)g 2 )∈P n . Notice that by definition ofP n , there exists (y 1 , z 1 ,T 1 ), (y 2 , z 2 ,T 2 )∈S n such that f 1 = y 1 /T 1 , g 1 = z 1 /T 1 , f 2 =y 2 /T 2 , and g 2 = z 2 /T 2 . Thus, it is enough to show q y 1 T 1 + (1−q) y 2 T 2 ,q z 1 T 1 + (1−q) z 2 T 2 ∈P n . (2.32) 42 Figure 2.2: Time average services verses V parameter over 1 millon slots. Figure 2.3: Time average queue size verses V parameter over 1 million slots. 43 To show this, we make a change of variable by letting p = qT2 (1−q)T1+qT2 . It is obvious that p∈ (0, 1). Furthermore, q = pT1 pT1+(1−p)T2 and q y 1 T 1 + (1−q) y 2 T 2 = py 1 + (1−p)y 2 pT 1 + (1−p)T 2 , q z 1 T 1 + (1−q) z 2 T 2 = pz 1 + (1−p)z 2 pT 1 + (1−p)T 2 . SinceS n is convex, (py 1 + (1−p)y 2 , pz 1 + (1−p)z 2 , pT 1 + (1−p)T 2 )∈S n . Thus, by definition ofP n again, (2.32) holds and the first part of the proof is finished. To show the second part of the claim, let Q n := n b f n (α n ), b g n (α n ) :α n ∈A n o = n b y n (α n ) . b T n (α n ), b z n (α n ) . b T n (α n ) :α n ∈A n o and let conv(Q n ) be the convex hull ofQ n . First of all, By Definition 2.1.2, P n ={(y/T, z/T ) : (y, z,T )∈S n }⊆R L+1 , forS n being the convex hull of n b y n (α n ), b z n (α n ), b T n (α n ) : α n ∈A n o , thus, in view of the def- inition ofQ n , we haveQ n ⊆P n . Since bothP n and conv(Q n ) are convex, by definition of convex hull ([Roc15]) that conv(Q n ) is the smallest convex set containingQ n , we have conv(Q n )⊆P n . To show the reverse inclusionP n ⊆ conv(Q n ), note that any point inP n can be written in the form y T , z T , where (y, z,T )∈S n . SinceS n by definition is the convex hull of n b y n (α n ), b z n (α n ), b T n (α n ) : α n ∈A n o ⊆R L+2 , by the definition of convex hull, (y, z,T ) can be written as a convex combination of m points in 44 the above set. Let n b y n (α n i ), b z n (α n i ), b T n (α n i ) o m i=1 be these points, so that (y, z,T ) = m X i=1 p i · b y n (α n i ), b z n (α n i ), b T n (α n i ) , p i ≥ 0, m X i=1 p i = 1. As a result, we have y T , z T = P m i=1 p i y n (α n i ) P m i=1 p i T n (α n i ) , P m i=1 p i z n (α n i ) P m i=1 p i T n (α n i ) . We make a change of variable by letting q j = pjT n (α n j ) P m i=1 piT n (α n i ) , ∀j = 1, 2,··· ,m, then, p j = q j T n (α n j ) · m X i=1 p i T n (α n i ), it follows, y T , z T = m X i=1 q i · y n (α n i ) T n (α n i ) , z n (α n i ) T n (α n i ) = m X i=1 q i · b f n (α n i ), b g n (α n i ) . Since P m i=1 q i = 1 andq i ≥ 0, it follows any point inP n can be written as a convex combination of finite number of points inQ n , which impliesP n ⊆ conv(Q n ). Overall, we haveP n = conv(Q n ). Finally, by Assumption 2.1.3, we haveQ n = n b f n (α n ), b g n (α n ) :α n ∈A n o is compact. Thus,P n , being a convex hull of a compact set, is also compact. 2.6.2 Proof of Lemma 2.3.3 Proof. We prove bound (2.6) ((2.7) is proved similarly). By definition of b f n (α n ) in Definition 2.1.1, we have for any α n ∈A n , b f n (α n ) = E P t∈T n k y n [t] α n k =α n E(T n k | α n k =α n ) , thus, E X t∈T n k b f n (α n k )−y n [t] α n k =α n = 0. 45 By the renewal property of the system, given α n k =α n , T n k and P t∈T n k y n [t] are independent of the past information before t n k . Thus, the same equality holds if conditioning also onF n k , i.e. E X t∈T n k b f n (α n k )−y n [t] α n k =α n , F n k = 0. Hence, E X t∈T n k b f n (α n k )−y n [t] F n k = 0. By the definition of f n [t], this further implies that E X t∈T n k (f n [t]−y n [t]) F n k = 0. Since|y n [t]|≤y max andE(T n k )≤ √ B, it followsE P t∈T n k (f n [t]−y n [t]) <∞ and the process {F n K } ∞ K=0 defined as F n K = K−1 X k=0 X t∈T n k (f n [t]−y n [t]), K≥ 1, F n 0 = 0 is a martingale. Consider any fixed T ∈ N and define S n [T ] as the number of renewals up to T . Lemma 2.3.7 shows S n [T ] is a valid stopping time with respect to the filtration{F n k } ∞ k=0 . Furthermore, {F n K∧S n [T] } ∞ K=0 is a supermartingale by Theorem 2.3.2, where a∧b := min{a,b}. For this fixed T , we have E T−1 X t=0 (f n [t]−y n [t]) ! =E t n S n [T] −1 X t=0 (f n [t]−y n [t]) −E t n S n [T] −1 X t=T (f n [t]−y n [t]) =E F n S n [T] −E t n S n [T] −1 X t=T (f n [t]−y n [t]) . Since the number of renewals is always bounded by the number of slots at any time, i.e. S n [T ]≤ T + 1, it follows E F n S n [T] =E F n (T+1)∧S n [T] ≤ 0. 46 On the other hand, E t n S n [T] −1 X t=T (f n [t]−y n [t]) ≤E t n S n [T] −T · 2y max ≤ 2y max √ B. where the last inequality follows from Assumption 5.2.1 for the residual life time. Thus, E T−1 X t=0 (f n [t]−y n [t]) ! ≤ 2y max √ B. Dividing both sides by T finishes the proof. 2.6.3 Proof of Lemma 2.3.5 Proof. Recall that t n k is the time slot where the k-th renewal occurs (k = 0, 1, 2,··· ), then, it follows from the definition of stopping time ([Dur13]) that{t n k } ∞ k=0 is a sequence of stopping times with respect to{F[t]} ∞ t=0 satisfying t n k <t n k+1 , ∀k. Thus, by definition ofF n k , for any set A∈F n k , A∩{t n k+1 ≤t} =A∩{t n k ≤t}∩{t n k+1 ≤t}∈F[t]. Thus,A∈F n k+1 , which impliesF n k ⊆F n k+1 ,∀k, and{F n k } ∞ k=0 is indeed a filtration. This finishes the first part of the proof. Next, we would like to show that G t n k (Z n 0 ,··· ,Z n [t n k − 1]) is measurable with respect to F n k , ∀k≥ 1, i.e. n G t n k (Z n 0 ,··· ,Z n [t n k − 1])∈B o ∈F n k , for any Borel set B⊆R. By definition ofF n k , this is equivalent to showing{G t n k (Z n 0 ,··· ,Z n [t n k − 1])∈ B}∩{t n k ≤ s}∈F[s] for any slot s≥ 0. For s = 0, this is obvious because{t n k ≤ 0} =∅, ∀k≥ 1. Consider any s≥ 1, n G t n k (Z n 0 ,··· ,Z n [t n k − 1])∈B o ∩{t n k ≤s} = s [ i=1 {G i (Z n 0 ,··· ,Z n [i− 1])∈B} \ {t n k =i} = s [ i=1 (Z n 0 ,··· ,Z n [i− 1])∈G −1 i (B) \ {t n k =i} ∈F[s], ∀k≥ 1, where the last step follows from the assumption that the random variableZ n [t−1] is measurable with respect toF[t] for any t > 0 and t n k is a stopping time with respect to{F[t]} ∞ t=0 for all k≥ 1. This gives the second part of the claim. 47 2.6.4 Proof of Lemma 2.3.7 Proof. We aim to prove{S n [T ] = k}∈F n k , ∀k∈ N. First of all, recall that the index of the renewal starts fromk = 0 andt n 0 = 0, thus, for anyk∈N,{S n [T ] =k} ={t n k >T}∩{t n k−1 ≤T}, and any t∈N, {S n [T ] =k}∩{t n k ≤t} ={t n k >T}∩{t n k−1 ≤T}∩{t n k ≤t}. (2.33) Consider two cases as follows: 1. t≤T . In this case, the set (2.33) is empty and obviously belongs toF[t]. 2. t > T . In this case, we have{t n k > T}∩{t n k ≤ t} ={T < t n k ≤ t}∈F[t] as well as {t n k−1 ≤T}∈F[T ]⊆F[t]. Thus, the set (2.33) belongs toF[t]. Overall, we have{S n [T ] =k}∩{t n k ≤t}∈F[t], ∀t∈N. Thus,{S n [T ] =k}∈F n k and S n [T ] is indeed a valid stopping time with respect to the filtration{F n k } ∞ k=0 . 2.6.5 Proof of Lemma 2.3.2 Proof. To prove the first part of the claim, we define the following notation: N M n=1 P n := ( N X n=1 p n , p n ∈P n , ∀n ) is the Minkowski sum of setsP n , n∈{1, 2,··· ,N}, and for any sequence{x[t]} ∞ t=0 taking values inR d , define lim sup T→∞ x[T ] := lim sup T→∞ x 1 [T ], ··· , lim sup T→∞ x d [T ] is a vector of lim sups. By definition, any vector in⊕ N n=1 P n can be constructed from⊗ N n=1 P n , thus, it is enough to show that there exists a vector r ∗ ∈⊕ N n=1 P n such thatr ∗ 0 =f ∗ and the rest of the entries r ∗ l ≤d l , l = 1, 2,··· ,L. By the feasibility assumption for (5.1)-(1.17), we can consider any algorithm that achieves the optimality of (5.1)-(1.17) and the corresponding process{(f n [t], g n [t])} ∞ t=0 defined in Lemma 2.3.3 for any system n. Notice that (f n [t], g n [t])∈P n , ∀n, ∀t. This follows from the definition 48 of b f n (α n ) andb g n (α n ) in Definition 2.1.1 that f n [t] = b f n (α n ) =b y n (α n )/ b T n (α n ), if t∈T n k ,α n k =α n g n [t] =b g n (α n ) =b z n (α n )/ b T n (α n ), if t∈T n k ,α n k =α n , and b y n (α n ), b z n (α n ), b T n (α n ) ∈S n . By definition ofP n in Definition 2.1.2, (f n [t], g n [t])∈ P n , ∀n, ∀t. SinceP n is convex by Lemma 2.3.1, it follows that (E(f n [t]),E(g n [t]))∈P n ,∀n,∀t. Hence, 1 T T−1 X t=1 E(f n [t]), 1 T T−1 X t=1 E(g n [t]) ! ∈P n , ∀T,∀n. This further implies that r(T ) := 1 T T−1 X t=1 N X n=1 E(f n [t]), 1 T T−1 X t=1 N X n=1 E(g n [t]) ! ∈ N M n=1 P n . By Lemma 2.3.1,P n is compact in R L+1 . Thus,⊕ N n=1 P n is also compact. This implies that the sequence{r(T )} ∞ T=1 has at least one limit point, and any such limit point is contained in ⊕ N n=1 P n . We consider a specific limit point of{r(T )} ∞ T=1 denoted as r ∗ ∈⊕ N n=1 P n , with the first entry denoted as r ∗ 0 satisfying r ∗ 0 = lim sup T→∞ 1 T T−1 X t=0 N X n=1 E(f n [t]). Then, we have the rest of the entries of r ∗ must satisfy r ∗ l ≤ lim sup T→∞ 1 T T−1 X t=0 N X n=1 E(g n [t]), ∀l∈{1, 2,··· ,L}. Now, by Lemma 2.3.3, we can connect the lim sup with respect to f n [t] and g n [t] to that ofy n [t] 49 and z n [t] as follows: lim sup T→∞ 1 T T−1 X t=0 N X n=1 E(y n [t]) = lim sup T→∞ 1 T T−1 X t=0 N X n=1 (E(y n [t]−f n [t]) +E(f n [t])) = lim T→∞ 1 T T−1 X t=0 N X n=1 E(y n [t]−f n [t]) + lim sup T→∞ 1 T T−1 X t=0 N X n=1 E(f n [t]) = lim sup T→∞ 1 T T−1 X t=0 N X n=1 E(f n [t]). Similarly, we can show that lim sup T→∞ 1 T T−1 X t=0 N X n=1 E(z n [t]) = lim sup T→∞ 1 T T−1 X t=0 N X n=1 E(g n [t]). Thus, by our preceeding assumption that the algorithm under consideration achieves the opti- mality of (5.1)-(1.17), we have r ∗ 0 = lim sup T→∞ 1 T T−1 X t=0 N X n=1 E(y n [t]) =f ∗ r ∗ l ≤ lim sup T→∞ 1 T T−1 X t=0 N X n=1 E(z n l [t])≤d l , ∀i∈{1, 2,··· ,L}. Overall, we have shown that r ∗ ∈⊕ N n=1 P n achieves the optimality of (5.1)-(1.17), and the first part of the lemma is proved. To prove the second part of the lemma, we show that any point in⊗ N n=1 P n is achievable by the corresponding time averages of some algorithm. Specifically, consider the following class of ran- domized stationary algorithms: For each systemn, at the beginning ofk-th frame, the controller independently chooses an action α n k from the setA n with a fixed probability distribution. Thus, the actions{α n k } ∞ k=0 result from any randomized stationary algorithm is i.i.d.. By the renewal property of each system, we have X t∈T n k y n [t], X t∈T n k z n [t], T n k ∞ k=0 , is also an i.i.d. process for each system n. 50 Next, we would like to show that any point inS n can be achieved by the corresponding expectations of some randomized stationary algorithm. Recall thatS n defined in Definition 2.1.2 is the convex hull of G n := n b y n (α n ), b z n (α n ), b T n (α n ) , α n ∈A n o ⊆R L+2 , By definition of convex hull, any point (y, z,T )∈S n , can be written as a convex combination of a finite number of points from the setG n . Let n b y n (α n i ), b z n (α n i ), b T n (α n i ) o m i=1 be these points, then, we have there exists a finite sequence{p i } m i=1 , such that (y, z,T ) = m X i=1 p i · b y n (α n i ), b z n (α n i ), b T n (α n i ) , p i ≥ 0, m X i=1 p i = 1. We can then use{p i } m i=1 to construct the following randomized stationary algorithm: At the start of each frame k, the controller independently chooses action α i ∈A n with probability p i defined above for i = 1, 2,··· ,m. Then, the one-shot expectation of this particular randomized stationary algorithm on system n satisfies E X t∈T n k y n [t] , E X t∈T n k z n [t] , E(T n k ) = m X i=1 p i · b y n (α n i ), b z n (α n i ), b T n (α n i ) = (y, z,T ), which implies any point inS n can be achieved by the corresponding expectations of a randomized stationary algorithm. Next, by definition ofP n in Definition 2.1.2, any (f n , g n )∈P n can be written as (f n , g n ) = (y/T, z/T ), where (y, z,T )∈S n . Thus, it is achievable by the ratio of one-shot expectations from a randomized stationary algorithm, i.e. E P t∈T n k y n [t] E(T n k ) = y T =f n , E P t∈T n k z n [t] E(T n k ) = z T = g n . 51 Now we claim that for y n [t], z n [t] and T n k result from the randomized stationary algorithm, lim T→∞ 1 T T−1 X t=0 E(y n [t]) = E P t∈T n k y n [t] E(T n k ) , (2.34) lim T→∞ 1 T T−1 X t=0 E(z n [t]) = E P t∈T n k z n [t] E(T n k ) . (2.35) We prove (2.34) and (2.35) is shown in a similar way. Consider any fixed T , and let S n [T ] be the number of renewals up to (and including) time T . Then, from Lemma 2.3.7 in Section 2.3, S n [T ] is a valid stopping time with respect to the filtration{F n k } ∞ k=0 . We write 1 T T−1 X t=0 E(y n [t]) = 1 T E S n [T] X k=0 X t∈T n k y n [t] − 1 T E t n S n [T] −1 X t=T y n [t] . (2.36) For the first part on the right hand side of (2.36), since n P t∈T n k y n [t] o ∞ k=0 is an i.i.d. process, by Wald’s equality (Theorem 4.1.5 of [Dur13]), 1 T E S n [T] X k=0 X t∈T n k y n [t] =E X t∈T n k y n [t] · E(S n [T ]) T . By renewal reward theorem (Theorem 4.4.2 of [Dur13]), lim T→∞ E(S n [T ]) T = 1 E(T n k ) . Thus, lim T→∞ 1 T E S n [T] X k=0 X t∈T n k y n [t] = E P t∈T n k y n [t] E(T n k ) . For the second part on the right hand side of (2.36), by Assumption 5.2.1, E t n S n [T] −1 X t=T y n [t] ≤y max ·E t n S n [T] −T ≤ √ By max , which implies lim T→∞ 1 T E Pt n S n [T] −1 t=T y n [t] = 0. Overall, we have (2.34) holds. To this point, we have shown that for any (f n , g n )∈P n , n∈{1, 2,··· ,N}, there exists a 52 randomized stationary algorithm so that lim T→∞ 1 T T−1 X t=0 E(y n [t]) =f n , lim T→∞ 1 T T−1 X t=0 E(z n [t]) = g n , for any n∈{1, 2,··· ,N}. Since f ∗ is the optimal solution to (5.1)-(1.17) over all algorithms, it follows for any (f n , g n )∈P n , n∈{1, 2,··· ,N} satisfying P N n=1 g n l ≤d l , ∀l∈{1, 2,··· ,L}, we have P N n=1 f n ≥f ∗ , and the second part of the lemma is proved. 53 Chapter 3 Data Center Server Provision via Theory of Coupled Re- newal Systems The previous chapter introduces a new algorithm and analysis framework for coupled parallel renewal systems. In this chapter, we show that the previous algorithm can be applied (extended) to solve a data center power minimization problem consisting of a central controller who makes load balancing decisions per slot and parallel servers having multiple states making decisions per renewal frame. In particular, the analysis in this chapter, which is customized to the data center application, is stronger than that of previous general algorithm in the sense that we obtain a probability 1 convergence of the algorithm rather than an expected convergence. 3.1 System model and problem formulation Consider a data center that consists of a central controller andN servers that serve randomly arriving requests. The system operates in slotted time with time slots t∈{0, 1, 2,...}. Each server n∈{1,...,N} has three basic states: • Active: The server is available to serve requests. Server n incurs a cost of e n ≥ 0 on every active slot, regardless of whether or not requests are available to serve. In data center applications, such cost often represents the power consumption of each individual server. • Idle: A low cost sleep state where no requests can be served. The idle state is actually comprised of a choice of multiple sleep modes with different per-slot costs. The specific sleep mode also affects the setup time required to transition from the idle state to the active state. For the rest of the paper, we use “idle” and “sleep” exchangeably. 54 • Setup: A transition period from idle to active during which no requests can be served. The setup cost and duration depend on the preceding sleep mode. The setup duration is typically more than one slot, and can be a random variable that depends on the server n and on the preceding sleep mode. An active server can choose to transition to the idle state at any time. When it does so, it chooses the specific sleep mode to use and the amount of time to sleep. For example, deeper sleep modes can shut down more electronics and thereby save on per-slot idling costs. However, a deeper sleep incurs a longer setup time when transitioning back to the active state. Each server makes separate decisions about when to transition and what sleep mode to use. The resulting transition times for each server are asynchronous. On top of this, a central controller makes slot-wise decisions for routing requests to servers. It can also reject requests (with a certain amount of cost) if it decides they cannot be supported. The goal is to minimize the overall time average cost. This problem is challenging mainly for two reasons: First, since each setup state generates cost but serves no request, it is not clear whether or not transitioning to idle from the active state indeed saves power. It is also not clear which sleep mode the server should switch to. Second, if one server is currently in a setup state, it cannot make another decision until it reaches the active state (which typically takes more than one slot), whereas other active servers can make decisions during this time. Thus, this problem can be viewed as a system with coupled Markov decision processes (MDPs) making decisions asynchronously. 3.1.1 Related works Experimental work on power and delay minimization in data centers is treated in [Gan13], which proposes to turn each server ON and OFF according to the rule of anM/M/k/setup queue. The work in [UKIN10] applies Lyapunov optimization to optimize power in virtualized data centers. However, it assumes each server has negligible setup time and that ON/OFF decisions are made synchronously at each server. The works [YHS + 12], [LWAT13] focus on power-aware provisioning over a time scale large enough so that the whole data center can adjust its service capacity. Specifically, [YHS + 12] considers load balancing across geographically distributed data centers, and [LWAT13] considers provisioning over a finite time interval and introduces an online 3-approximation algorithm. 55 Prior works [HS08, MGW09, MSB + 11] consider servers with multiple hypothetical sleep states with different levels of power consumption and setup times. Although empirical evaluations in these works show significant power saving by introducing sleep states, they are restricted to the scenario where the setup time from sleep to active is on the order of milliseconds, which is not realistic for today’s data center. Realistic sleep states with setup time on the order of seconds are considered in [GHBK12], where effective heuristic algorithms are proposed and evaluated via extensive testbed simulations. However, little is known about the theoretical performance bound regarding these algorithms. 3.1.2 Front-end load balancing At each time slot t∈{0, 1, 2,...}, λ(t) new requests arrive at the system (see Fig. 3.1). We assume λ(t) takes values in a finite set Λ. Let R n (t), n∈N denote the number of requests routed into server n at time t. In addition, the system is allowed to reject requests. Let r(t) be the number of requests that are rejected on slot t, and let c(t) be the corresponding per-request cost for such rejection. Assume c(t) takes values in a finite state spaceC. The R n (t) and r(t) decision variables on slot t must be nonnegative integers that satisfy: N X n=1 R n (t) +r(t) =λ(t) N X n=1 R n (t)≤R max for a given integerR max > 0. The vector process (λ(t),c(t)) takes values in Λ×C and is assumed to be an independent and identically distributed (i.i.d.) vector over slots t∈{0, 1, 2,...} with an unknown probability mass function. Each servern maintains a request queue Q n (t) that stores the requests that are routed to it. Requests are served in a FIFO manner with queueing dynamics as follows: Q n (t + 1) = max{Q n (t) +R n (t)−μ n (t)H n (t), 0}. (3.1) whereH n (t) is an indicator variable that is 1 if servern is active on slott, and 0 else, andμ n (t) is a random variable that represents the number of requests can be served on slot t. Each queue is initialized toQ n (0) = 0. Assume that, every slot in which servern is active,μ n (t) is independent 56 Figure 3.1: Illustration of a data center structure which contains a front-end load balancer, N application servers with N request queues and a backend database (omitted here for brevity). and identically distributed with a known mean μ n . This randomness can model variation in job sizes. Assumption 3.1.1. The process{(λ(t),c(t))} ∞ t=0 is observable, i.e. the router can observe the (λ(t),c(t)) realization each time slott before making decisions. In contrast, the process{μ n (t)} ∞ t=0 is not observable, i.e. given that H n (t) = 1, the server n cannot observe the realization of μ n (t) until the end of slot t. Moreover, λ(t), c(t) and μ n (t) are all bounded by λ max , c max and μ max respectively. 3.1.3 Server model Each server n∈N has three types of states: active, idle, and setup (see Fig. 3.2). The idle state of each server n is further decomposed into a collection of distinct sleep modes. Each server n∈N makes decisions over its own renewal frames. Define the renewal frame for server n as the time period between successive visits to active state (with each renewal period ending in an active state). Let T n [f] denote the frame size of the f-th renewal frame for server n, for f∈{0, 1, 2,...}. Let t n f denote the start of frame f, so that T n [f] = t n f+1 −t n f . Assume that t n 0 = 0 for all n∈N , so that time slot 0 is the start of the first renewal frame (labeled frame f = 0) for all servers. For simplicity, assume all servers are “active” on slot t =−1. Thus, the slot just before each renewal frame is an active slot. Fix a server n∈N and a frame index f∈{0, 1, 2,...}. Time t n f marks the start of renewal framef. At this time, server n must decide whether to remain active or to go idle. If it remains active then the renewal frame lasts for one slot, so that T n [f] = 1. If it goes idle, it chooses an 57 Figure 3.2: Illustration of a typical renewal frame construction, whereT n [i] is the length of frame i and t (n) i is the start slot of frame i. idle mode from a finite setL n , representing the set of idle mode options. Let α n [f] represent this initial decision for server n at the start of frame f, so that: α n [f]∈{active}∪L n whereα n [f] =active means the server chooses to remain active. If the server chooses to go idle, so that α n [f]∈L n , it then chooses a variable I n [f] that represents how much time it remains idle. The decision variable I n [f] is chosen as an integer in the set{1,...,I max } for some given integer I max > 0. The consequences of these decisions are described below. • Case α n [f] = active. The frame starts at time t n f and has size T n [f] = 1. The active variable becomes H n (t n f ) = 1 and an activation cost of e n is incurred on this slot t n f . A random service variableμ n (t n f ) is generated and requests are served according to the queue update (3.1). Recall that, under Assumption 3.1.1, the value of μ n (t) is not known until the end of the slot. • Caseα n [f]∈L n . In this case, the server chooses to go idle andα n [f] represents the specific sleep mode chosen. The idle duration I n [f] is also chosen as an integer in the set [1,I max ]. After the idle duration completes, the setup duration starts and has an independent and random duration τ n [f] = ˆ τ(α n [f]), where ˆ τ(α n [f]) is an integer random variable with a known mean and variance that depends on the sleep mode α n [f]. At the end of the setup time the system goes active and serves with a random μ n (t) as before. The active variable is H n (t) = 0 for all slots t in the idle and setup times, and is 1 at the very last slot of the frame. Further: – Idle cost: Every slot t of the idle time of frame f, an idle cost of g n (t) = ˆ g n (α n [f]) is incurred (so that the idle cost depends on the sleep mode). We have g n (t) = 0 if 58 server n is not idle on slot t. The idle cost can be zero, but can also be a small but positive value if some electronics are still running in the sleep mode chosen. – Setup cost: Every slot t of the setup time of frame f, a cost of W n (t) = ˆ W n (α n [f]) is incurred. We have W n (t) = 0 if server n is not in a setup duration on slot t. Thus, the length of frame f for server n is: T n [f] = 1, if α n [f] =active; I n [f] +τ n [f] + 1, if α n [f]∈L n . (3.2) In summary, the costs ˆ g n (α n ), ˆ W n (α n ) and the setup time ˆ τ n (α n ) are functions ofα n ∈L n . We further make the following assumption regarding ˆ τ n (α n ): Assumption 3.1.2. For any α n ∈L n , the function ˆ τ n (α n ) is an integer random variable with known mean and variance, as well as bounded first four moments. DenoteE(τ n (α n )) =m αn and Var[τ n (α n )] =σ 2 αn . Note that this is a very mild assumption in view of the fact that the setup time of a real server is always bounded. The motivation behind emphasizing the fourth moment here instead of simply proceeding with boundedness assumption is more of theoretical interest than practical importance. Table I summarizes the parameters introduced in this section. The data center architecture is shown is Fig. 3.1. Since different servers might make different decisions, the renewal frames are not necessarily aligned. 3.1.4 Performance Objective For eachn∈N , letC,W n ,E n ,G n be the time average costs resulting from rejection, setup, service and idle, respectively. They are defined as follows: C = lim T→∞ 1 T P T−1 t=0 E(r(t)c(t)), W n = lim T→∞ 1 T P T−1 t=0 E(W n (t)),E n = lim T→∞ 1 T P T−1 t=0 E(e n H n (t)),G n = lim T→∞ 1 T P T−1 t=0 E(g n (t)). The goal is to design a joint routing and service policy so that the time average overall cost is minimized and all queues are stable, i.e. min C + N X n=1 W n +E n +G n , s.t. Q n (t) stable∀n. (3.3) 59 Table 3.1: Parameters Control parameters Control objectives R n (t) Requests routed to server n at slot t r(t) Requests rejected at slot t α n [f] The option (active/idle) server n takes in frame f I n [f] Number of slots server n stays idle in frame f Other parameters Meaning λ(t) Number of arrivals at time t c(t) Per request rejection cost at time t e n Per slot active service cost for server n T n [f] The length of frame f for server n t (n) [f] Starting slot of frame f for server n τ n [f] Setup duration in frame f μ n (t) Number of requests served on server n at time t H n (t) Server active indicator (equal to 1 if active, 0 if not) g n (t) Idle cost of server n at time t W n (t) Setup cost of server n at time t Notice that the constraint in (3.3) is not easy to work with. In order to get an optimization prob- lem one can deal with, we further define the time average request rate, rejection rate, routing rate and service rate as λ, d, R n , and μ n respectively: λ = lim T→∞ 1 T P T−1 t=0 λ(t) =E(λ(t)), r = lim T→∞ 1 T P T−1 t=0 E(r(t)),R n = lim T→∞ 1 T P T−1 t=0 E(R n (t)),μ n = lim T→∞ 1 T P T−1 t=0 E(μ n (t)H n (t)). Then, rewrite the problem (3.3) as follows min C + N X n=1 W n +E n +G n (3.4) s.t. R n ≤μ n , ∀n∈N (3.5) N X n=1 R n (t)≤R max , N X n=1 R n (t) +r(t) =λ(t)∀t (3.6) Constraint (3.5) requires the time average arrival rate to server n to be less than the time average service rate. We aim to develop an algorithm so that each server can make its own decision (without looking at the workload or service decision of any other server) and prove its near optimality. 60 3.2 Coupled renewal optimization In this section, we show one can apply the algorithm introduced in the previous section to solve (3.4)-(3.6). But before jumping into details, we would like to discuss some intuitions behind solving this problem. As a side remark, this data center work is written and published before the general algorithm introduced in the last section, so this intuition is the origin of thesis. 3.2.1 Prelude: The original intuition First of all, from the queueing model described in the last section and Fig. 3.1, it is intuitive that an efficient algorithm would have each server make decisions regarding its own queue state Q n (t), whereas the front-end load-balancer make routing and rejection decisions slot-wise based on the global information (λ(t),c(t), Q(t)). Next, to get an idea on what exactly the decision should be, by virtue of Lyapunov optimiza- tion, one would introduce a trade-off parameter V > 0 and penalize the time average constraint (3.5) via Q(t) to solve the following slotwise optimization problem min V c(t)r(t) + N X n=1 (W n (t) +e n H n (t) +g n (t)) ! (3.7) + N X n=1 Q n (t)(R n (t)−μ n (t)) s.t. constraint (3.6), which is naturally separable regarding the load-balancing decision (r(t), R n (t)), and the service decision (W n (t), H n (t), g n (t), μ n (t)). However, because of the existence of a setup state (on which no decision could be made), the server does not have an identical decision set every slot and furthermore, the decision set itself depends on previous decisions. This poses a significant difficulty analyzing the above optimization (3.7). In order to resolve this difficulty, we try to find the smallest “identical time unit” for each individual server in lieu of slots. This motivates the notion of renewal frame in the previous section (see Fig. 3.2). Specifically, from Fig. 3.2 and the related renewal frame construction, at the starting slot of each renewal, the server faces the identical decision set (remain active or go to idle with certain slots) regardless of previous decisions. Following this idea, we modify (3.7) as follows: 61 • For the front-end load balancer, we observe (λ(t),c(t), Q(t)) and solve min Vc(t)r(t) + P N n=1 Q n (t)R n (t), s.t. (3.6), which is detailed in Section 3.2.3. • For each server, instead of per slot optimization min V (W n (t) + e n H n (t) + g n (t))− Q n (t)μ n (t), we propose to minimize the time average of this quantity per renewal frame T n [f]. 3.2.2 Coupled renewal optimization In order to apply Algorithm 2 to this scenario, we can view the admission control (which choosesr(t) andR n (t)) as one another system besidesN servers. Thus, this problem is equivalent to an asynchronous optimization over N + 1 parallel renewal systems where one of them is just a slotted system. This falls into the form of (5.1)-(1.17) when setting l =N, y n [t] =r(t)c(t) +W n (t) +e n H n (t) +g n (t), z l [t] =R l (t)−μ l (t), l∈{1, 2,··· ,N} d l [t] =0, and the control variable r(t), R n (t) are non-negative, and must satisfy the following instant constraints: N X n=1 R n (t)≤R max , N X n=1 R n (t) +r(t) =λ(t). The only difference compared to (3.4)-(3.6) is that here the decision variables r(t) and R n (t) must take values from time-varying ranges per slot and they must be chosen after observing the random variable c(t). However, since r(t) andR n (t) are updated slot-wise, this minor difference is easy to handle via our renewal optimization framework and we have the following Algorithm 3. 62 Algorithm 3. Fix a trade-off parameter V > 0, and at each time slot t: • The admission controller chooses r(t) and R n (t) according to minVc(t)r(t) + N X n=1 Q n (t)R n (t) s.t. N X n=1 R n (t)≤R max , N X n=1 R n (t) +r(t) =λ(t). (3.8) • Each server chooses service options α n [f] and I n [f] via the following: min E h P t=t n f+1 −1 t=t n f VW n (t) +Ve n H n (t) +Vg n (t)−Q n (t n f )μ n (t)H n (t) Q n (t n f ) i E T n [f] Q n (t n f ) (3.9) • Update Q n (t): Q n (t + 1) = max{Q n (t) +R n (t)−μ n (t)H n (t), 0}. 3.2.3 Solving (3.8) and (3.9) Note first that in Algorithm 3, the solution to problem (3.8) admits a simple thresholding rule (with shortest queue ties broken arbitrarily): r(t) = max{λ(t)−R max , 0}, if∃n∈N s.t. Q n (t)≤Vc(t); λ(t), otherwise. (3.10) R n (t) = min{λ(t),R max }, if Q n (t) is the shortest queue and Q n (t)≤Vc(t); 0, otherwise. (3.11) Next, for the problem (3.9), recall the definition of T n [f] and α n [f]∈{active}∪L n . If the server chooses to remain active, then the frame length is exactly 1, otherwise, the server is allowed to choose how long it stays in idle with E T n [f]|Q(t n f ) = I n [f] +m αn[f] + 1, where I n [f]∈{1,··· ,I max }. It can be easily shown that over all randomized decisions between staying active and going to different idle states, it is optimal to make a pure decision which either stays active or goes to one of the idle states with probability 1. 63 More specifically, let D n [f] = E h P t=t n f+1 −1 t=t n f VW n (t) +Ve n H n (t) +Vg n (t)−Q n (t n f )μ n (t)H n (t) Q n (t n f ) i E T n [f] Q n (t n f ) . (3.12) We have when the server n chooses to be active, then D n [f] =Ve n −Q n (t n f )μ n . (3.13) Otherwise, choosing a specific idle option α n [f]∈L n gives D n [f] = V ˆ W n (α n [f])m αn[f] +Ve n −Q n (t n f )μ n + B0 2 σ 2 αn[f] +V ˆ g(α n [f])I n [f] I n [f] +m αn[f] + 1 , (3.14) which follows from the fact that if the server goes idle, then, H n (t) are all zero during the frame except for the last slot. Then, solving (3.9) is equivalent to choosing one option which achieves a smaller value of D n [f] between (3.13) and (3.14). A closer look at the optimization problem (3.14) indicates that the best idle period I n [f] solving (3.14) is either 1 or I max . This is unfortunately problematic for the application of data center since it means the server is either not idle at all or going to idle for a very long time. When the arrival task stream is of high volatility, this could cause significant delay. In the next section, we will introduce our proposed algorithm for the servers which makes relatively “smooth” decisions. 3.2.4 The proposed online control algorithm Our main idea pushing the server away from the binary decision is to add a term in the ratio (3.12) which is quadratic on the renewal frame length. Specifically, for server n, at the beginning of its f-th renewal frame t n f , it observes its current queue state Q(t n f ) and makes decisions on α n [f]∈{active}∪L n andI n [f] so as to solve the minimization of ratio of expectations in (3.15) 64 as follows: D n [f], E h P t=t n f+1 −1 t=t n f VW n (t) +Ve n H n (t) +Vg n (t)−Q n (t n f )μ n (t)H n (t) + t−t n f B 0 Q n (t n f ) i E T n [f] Q n (t n f ) . (3.15) where B 0 = 1 2 (R max +μ max )μ max . Compared to the objective (3.12), the quantity D n [f] has an extra term P t=t n f+1 −1 t=t n f t−t n f B 0 = Tn[f](Tn[f]−1) 2 B 0 on the numerate that is quadratic inT n [f]. Similar to the last section, we are then able to simplify the problem by computing D n [f] for active and idle options separately. • If the server chooses to go active, i.e. α n [f] =active, then, D n [f] =Ve n −Q n (t n f )μ n . (3.16) • If the server chooses to go idle, i.e. α n [f]∈L n , then, D n [f] = V ˆ W n (α n [f])m αn[f] +Ve n −Q n (t n f )μ n +E V ˆ g(α n [f])I n [f] + B0 2 T n [f](T n [f]− 1) Q n (t n f ) E T n [f] Q n (t n f ) (3.17) which follows from the fact that if the server goes idle, then, H n (t) are all zero during the frame except for the last slot. Now we try to compute the optimal idle option α n [f]∈ L n and idle time length I n [f] given the server chooses to go idle. The following lemma illustrates that the decision on I n [f] can also be reduced to pure decision. Lemma 3.2.1. The best decision minimizing (3.17) is a pure decision which takes one α n [f]∈L n and one integer value I n [f]∈{1,··· ,I max } minimizing the deterministic func- 65 tion: D n [f] = V ˆ W n (α n [f])m αn[f] +Ve n −Q n (t n f )μ n + B0 2 σ 2 αn[f] +V ˆ g(α n [f])I n [f] I n [f] +m αn[f] + 1 + B 0 2 (I n [f] +m αn[f] + 1). (3.18) The proof of above lemma is given in appendix A. Then, the server computes the minimum of (3.18), which is nothing but a deterministic opti- mization problem. It goes in the following two steps: 1. For each α n ∈L n , first differentiating (3.18) with respect to I[f] to get a real minimizer. Then, choosing I[f] as one of the two integer values bracketing the real minimizer which achieves a smaller value on (3.18). 2. Compare (3.18) for different α n ∈L n and choose the one achieving the minimum. Thus, the server compares (3.16) with the minimum of (3.18). If (3.16) is less than the minimum of (3.18), then, the server chooses to go active. Otherwise, the server chooses to go idle and stay idle for I n [f] time slots. Overall, our final algorithm is summarized in Algorithm 4. Algorithm 4. • At each time slot t, the data center observes λ(t), c(t), and Q(t) chooses rejection decision r(t) according to (3.10) and chooses routing decision R n (t) according to (3.11). • For each servern∈N , at the beginning of itsf-th framet n f , observe its queue stateQ n (t n f ) and compute (3.16) and the minimum of (3.18). If (3.16) is less than the minimum of (3.18), then the server still stays active. Otherwise, the server switches to the idle state minimizing (3.18) and stays idle for I n [f] achieving the minimum of (3.18). • Update Q n (t), ∀n∈N according to Q n (t + 1) = max{Q n (t) +R n (t)−μ n (t)H n (t), 0}. 3.3 Probability 1 Performance Analysis of Algorithm 4 In this section, we prove a probability 1 convergence result for the proposed algorithm (Algo- rithm 4). More specifically, we prove the online algorithm introduced in the last section makes 66 all request queues Q n (t) bounded (on the order of V ) and achieves the near optimality with sub-optimality gap on the order of 1/V with probability 1. 3.3.1 Bounded request queues In this section, we show that the request queues are deterministically bounded due to the special thresholding nature of the admission control. Such a result is stronger (yet simpler) than the expected virtual queue analysis presented in the last section. Lemma 3.3.1. If Q n (0) = 0, ∀n∈N , then, each request queue Q n (t) is deterministically bounded with bound: Q n (t)≤Vc max +R max , ∀t, ∀n∈N , where c max , max c∈C c. Proof. We use induction to prove the claim. Base case is trivial sinceQ n (0) = 0≤Vc max +R max . Suppose the claim holds at the beginning of t = i for i > 0, so that Q n (i)≤ Vc max +R max . Then, 1. If Q n (i)≤Vc max , then, it is possible for the queue to increase during slot i. However, the increase of the queue within one slot is bounded by R max . which implies at the beginning of slot i + 1, Q n (i + 1)≤Vc max +R max . 2. If Vc max <Q n (i)≤Vc max +R max , then, according to (3.11), it is impossible to route any request to servern during sloti, andR n (i) = 0 which results inQ n (i+1)≤Vc max +R max . Above all, we finished the proof of lemma. Lemma 3.3.2. The proposed algorithm meets the constraint (3.5) with probability 1. Proof. From the queue update rule (3.1), it follows, Q n (t + 1) ≥ Q n (t) +R n (t)−μ n H n (t). Taking telescoping sums from 0 to T− 1 gives Q n (T )≥Q n (0) + P T−1 t=0 R n (t)− P T−1 t=0 μ n H n (t). Since Q n (0) = 0, dividing both sides by T gives Qn(T) T ≥ 1 T P T−1 t=0 R n (t)− 1 T P T−1 t=0 μ n H n (t). Substitute the bound Q n (T )≤Vc max +R max from lemma 3.3.1 into above inequality and take limit as T→∞ give the desired result. 3.3.2 Optimal randomized stationary policy In this section, we introduce a class of algorithms which are theoretically helpful for doing analysis, but practically impossible to implement. 67 Since servers are coupled only through time average constraint (3.5), each server n can be viewed as a separate renewal system, thus, it can be shown that any possible time average service rateμ n can be achieved through a frame based stationary randomized service decision, meaning that the decisions are i.i.d. over frames. Furthermore, it can be shown that the optimality of (3.4)-(3.6) can be achieved over the following randomized stationary algorithms: At the beginning of each time slott, the data center observes the incoming requestsλ(t) and rejecting costc(t), then routes R ∗ n (t) incoming requests to server n and rejects d ∗ (t) requests, both of which are random functions of (λ(t),c(t)). They satisfy the same instantaneous relation as (3.6). Meanwhile, server n chooses a frame based stationary randomized service decision (α ∗ n [f],I ∗ n [f]), so that the optimal service rate is achieved. If one knows the stationary distribution for (λ(t),c(t)), then, this optimal control algorithm can be computed using dynamic programming or linear programming. Moreover, the optimal setup cost W ∗ n (t), idle cost g ∗ n (t), and the active state indicator H ∗ (t) can also be deduced. Since the algorithm is stationary, these three cost processes are all ergodic Markov processes. Let T ∗ n [f] be the frame length process under this algorithm. Thus, it follows from the re- newal reward theorem that n P t n f+1 −1 t=t n f W ∗ n (t) o +∞ f=0 , n P t n f+1 −1 t=t n f g ∗ n (t) o +∞ f=0 , n P t n f+1 −1 t=t n f e n H ∗ n (t) o +∞ f=0 , n P t n f+1 −1 t=t n f μ n (t)H ∗ n (t) o +∞ f=0 and{T ∗ n [f]} +∞ f=0 are all i.i.d. random variables over frames. Let C ∗ , W ∗ n , G ∗ n and E ∗ n be the optimal time average costs. Let R ∗ n , μ ∗ n and d ∗ be the optimal time average routing rate, service rate and rejection rate respectively. Then, by the strong law of large numbers, W ∗ n = E Pt (n) f +T ∗ n [f]−1 t=t n f W ∗ n (t) E(T ∗ n [f]) (3.19) E ∗ n = E Pt (n) f +T ∗ n [f]−1 t=t n f e n H ∗ n (t) E(T ∗ n [f]) (3.20) G ∗ n = E Pt (n) f +T ∗ n [f]−1 t=t n f g ∗ n (t) E(T ∗ n [f]) (3.21) μ ∗ n = E Pt (n) f +T ∗ n [f]−1 t=t n f μ n (t)H ∗ n (t) E(T ∗ n [f]) , (3.22) 68 Also, notice that R ∗ n (t) and d ∗ (t) depend only on the random variables λ(t) and c(t), which is i.i.d. over slots. Thus, R ∗ n (t) and d ∗ (t) are also i.i.d. random variables over slots. By the law of large numbers, R ∗ n =E(R ∗ n (t)), (3.23) C ∗ =E(c(t)d ∗ (t)). (3.24) Remark 3.3.1. Since the idle time I ∗ n [f]∈ [1,I max ] and the first two moments of the setup time are bounded, it follows the first two moments of T ∗ n [f] are bounded. 3.3.3 Key features of thresholding algorithm In this part, we compare the algorithm deduced from the two optimization problems (3.8) and (3.15) to that of the best stationary algorithm in section 3.3.2, illustrating the key features of the proposed online algorithm. DefineF(t) as the system history up till slott, which includes all the decisions taken and all the random events before slot t. We first consider (3.8). For simplicity of notations, define two random processes{X n [f]} ∞ f=0 and{Z[t]} ∞ t=0 as follows X n [f] = t=t n f+1 −1 X t=t n f V W n (t)−W ∗ n +V e n H n (t)−E ∗ n +V g n (t)−G ∗ n −Q n (t n f ) (μ n H n (t)−μ ∗ ) + t−t n f B 0 − Ψ n , Z[t] =V c(t)r(t)−C ∗ + N X n=1 Q n (t) R n (t)−R ∗ n , where Ψ n = B0 2 E(T ∗ n [f](T ∗ n [f]−1)) E(T ∗ n [f]) and B 0 = 1 2 (R max +μ max )μ max . Given the system informationF(t), the random events c(t) and λ(t), the solutions (3.10) and (3.11) take rejecting and routing decisions so as to minimize (3.8) over all possible routing and rejecting decisions at time slot t. Thus, the proposed algorithm achieves smaller value on (3.8) compared to that of the best stationary algorithm in section 3.3.2. Formally, this idea can be stated as the following inequality: E Vc(t)r(t) + P N n=1 Q n (t)R n (t) c(t),λ(t),F(t) ≤ E Vc(t)d ∗ (t) + P N n=1 Q n (t)R ∗ n (t) c(t),λ(t),F(t) . Taking expectation regarding c(t) and λ(t) using the fact that the best stationary algorithm on R ∗ n (t) and d ∗ (t) are i.i.d. over slots 69 E h P t=t n f+1 −1 t=t n f V (W n (t) +e n H n (t) +g n (t))−Q n (t n f )μ n (t)H n (t) + t−t n f B 0 F(t n f ) i E T n [f] F(t n f ) ≤ E Pt=t (n) f +T ∗ n [f]−1 t=t n f V (W ∗ n (t) +e n H ∗ n (t) +g ∗ n (t))−Q n (t n f )μ n H ∗ n (t) + B0 2 T ∗ n [f](T ∗ n [f]− 1) F(t n f ) E T ∗ n [f] F(t n f ) (3.26) (independent ofF(t)), together with (3.23) and (3.24), we get E(Z(t)|F(t))≤ 0. (3.25) Similarly, for (3.15), the proposed service decisions within frame f minimize D n [f] in (3.15), thus, compared to the best stationary policy, the inequality (3.26) holds. Again, using the fact that the optimal stationary algorithm gives i.i.d. W ∗ n (t), g ∗ n (t), H ∗ n (t) and T ∗ n [f] over frames (independent ofF(t n f )), as well as (3.19), (3.20) and (3.22), we get E X n [f] F(t n f ) E T n [f] F(t n f ) ≤ 0 (3.27) 3.3.4 Bounded average of supermartingale difference sequeces The key feature inequalities (3.25) and (3.27) provide us with bounds on the expectations. The following lemma serves as a stepping stone passing from expectation bounds to probability 1 bounds. Recall the basic definition of supermartingale in Definition 2.3.1. We have the following strong law of large numbers for supermartingale difference sequences: Lemma 3.3.3 (Corollary 4.2 of [Nee12c]). Let{X t } ∞ t=0 be a supermartingale difference sequence. If ∞ X t=1 E X 2 t t 2 <∞, then, lim sup T→∞ 1 T T−1 X t=0 X t ≤ 0, with probability 1. With this lemma, we are ready to prove the following result: 70 Lemma 3.3.4. Under the proposed algorithm, the following hold with probability 1, lim sup F→∞ 1 F F−1 X f=0 X n [f]≤ 0, (3.28) lim sup T→∞ 1 T T−1 X t=0 Z[t]≤ 0. (3.29) Proof. The key to the proof is treating these two sequences as supermartingale difference se- quences and applying law of large numbers for supermartingale difference sequences (theorem 4.1 and corollary 4.2 in (3.28)). We first look at the sequence{X n [f]} ∞ f=0 . LetY n [F ] = P F−1 f=0 X n [f]. We first prove thatY n [F ] is a supermartingale. Notice thatY n [F ]∈F t (F) n , i.e. it is measurable given all the information before frame t (F) n , and|Y n [F ]| <∞, ∀F <∞. Furthermore, E Y n [F + 1]−Y n [F ] F t n f = E X n [F ] F t n f ≤ 0·E T n [F ] F t n f = 0, where the only inequality follows from (3.27). Thus, it follows Y n [F ] is a supermartingale. Next, we show that the second mo- ment of supermartingale differences, i.e. E X n [f] 2 , is deterministically bounded by a fixed constant for any f. This part of proof is given in Appendix B. Thus, the following holds: P ∞ f=1 E X n [f] 2 f 2 <∞. Now, applying Lemma 3.3.3 immediately gives (3.28). Similarly, we can prove (3.29) by proving M[t] = P T−1 t=0 Z[t] is a supermartingale with bounded second moment on differences using (3.23), (3.24) and (3.25). The procedure is al- most the same as above and we omitted the details here for brevity. Corollary 3.3.1. The following ratio of time averages is upper bounded with probability 1, lim sup F→∞ P F−1 f=0 X n [f] . P F−1 f=0 T n [f]≤ 0. Proof. From (3.28), it follows for any > 0, there exists an F 0 () such that F≥ F 0 () implies P F−1 f=0 X n [f] . P F−1 f=0 T n [f]≤ . 1 F P F−1 f=0 T n [f]≤. Thus, lim sup F→∞ P F−1 f=0 X n [f] . P F−1 f=0 T n [f]≤ . Since is arbitrary, take → 0 gives the result. 3.3.5 Near optimal time average cost The ratio of time averages in corollary 3.3.1 and the true time average share the same bound, which is proved by the following lemma: 71 Lemma 3.3.5. The following time average is bounded with probability 1, lim sup T→∞ 1 T T−1 X t=0 V (W n (t) +e n H n (t) +g n (t))−Q n (t n f )(μ n H n (t)−μ ∗ n ) + t−t n f B 0 ≤V (W ∗ n +E ∗ n +G ∗ n ) + Ψ n , (3.30) where Ψ n = B0 2 E(T ∗ n [f](T ∗ n [f]−1)) E(T ∗ n [f]) and B 0 = 1 2 (R max +μ max )μ max . The idea of the proof is similar to that of basic renewal theory, which derives upper and lower bounds for each T within any frame F using corollary 3.3.1, thereby showing that as T→∞, the upper and lower bounds meet. See appendix C for details. With the help of this lemma, we are able to prove the following near optimal performance theorem: Theorem 3.3.1. If Q n (0) = 0,∀n∈N , then the time average total cost under the algorithm is near optimal on the order ofO(1/V ), i.e. with probability 1, lim sup T→∞ 1 T T−1 X t=0 c(t)r(t) + N X n=1 (W n (t) +e n H n (t) +g n (t)) ! ≤C ∗ + N X n=1 W ∗ n +E ∗ n +G ∗ n | {z } Optimal cost + P N n=1 Ψ n +B 3 V , (3.31) where B 3 , 1 2 P N n=1 (R max +μ n ) 2 , Ψ n = B0 2 E(T ∗ n [f](T ∗ n [f]−1)) E(T ∗ n [f]) and B 0 = 1 2 (R max +μ max )μ max . See appendix D for details of proof. 3.4 Delay improvement via virtualization 3.4.1 Delay improvement The algorithm in previous sections optimizes time average cost. However, it can route requests to idle queues, which increases system delay. This section considers an improvement in the algorithm that maintains the same average cost guarantees, but reduces delay. This is done by a “virtualization” technique that reduces from N server request queues to only one request queue Q(t). Specifically, the same Algorithm 1 is run, with queue updates (3.1) for each of theN queues Q n (t). However, the Q n (t) processes are now virtual queues rather than actual queues: Their 72 values are only kept in software. Every slot t, the data center observes the incoming requests λ(t), rejection costc(t) and virtual queue values, making rejection decision according to (3.10) as before. The admitted requests are queued in Q(t). Meanwhile, each server n makes active/idle decisions observing its own virtual queue Q n (t) same as before. Whenever a server is active, it grabs the requests from request queue Q(t) and serves them. This results in an actual queue updating for the system: Q(t + 1) = max ( Q(t) +λ(t)−r(t)− N X n=1 μ n (t)H n (t), 0 ) . (3.32) Fig. 5.2 shows this data center architecture. Figure 3.3: Illustration of basic data center architecture. 3.4.2 Performance guarantee Since this algorithm does not look at the actual queueQ(t), it is not clear whether or not the actual request queue would be stabilized under the proposed algorithm. The following lemma answers the question. For simplicity, we call the system with N queues, where our algorithm applies, the virtual system, and call the system with only one queue the actual system. Lemma 3.4.1. If Q(0) = 0 and Q n (0) = 0, ∀n∈N , then the virtualization technique stabilizes the queue Q(t) with the bound: Q(t)≤N(Vc max +R max ). Proof. Notice that this bound isN times the individual queue bound in lemma 3.3.1, we prove the lemma by showing that the sum-up weights P N n=1 Q n (t) in the virtual system always dominates the queue length Q(t). We prove this by induction. The base case is obvious since Q(0) = 73 P N n=1 Q n (0) = 0. Suppose at the beginning of time t, Q(t)≤ P N n=1 Q n (t), then, during time t, we distinguish between the following two cases: 1. Not all active servers in actual system have requests to serve. This case happens if and only if there are not enough requests in Q(t) to be served, i.e. λ(t)− r(t) + Q(t) < P N n=1 μ n (t)H n (t). Thus, according to queue updating rule (3.32), at the beginning of time slot t + 1, there will be no request sitting in the actual queue, i.e. Q(t + 1) = 0. Hence, it is guaranteed that Q(t + 1)≤ P N n=1 Q n (t + 1). 2. All active servers in actual system have requests to serve. Notice that the virtual system and the actual system have exactly the same arrivals, rejections and server active/idle states. Thus, the following holds, Q(t + 1) = Q(t) +λ(t)−r(t)− P N n=1 μ n (t)H n (t)≤ P N n=1 Q n (t)+ P N n=1 R n (t)− P N n=1 μ n (t)H n (t)≤ P N n=1 max{Q n (t)+R n (t)−μ n (t)H n (t), 0} = P N n=1 Q n (t + 1), where the first inequality follows from induction hypothesis as well as the fact that P N n=1 R n (t) =λ(t)−r(t). Above all, we provedQ(t)≤ P N n=1 Q n (t),∀t. Since eachQ n (t)≤Vc max +R max ,∀t, the lemma follows. Since the virtual system and the actual system have exactly the same cost, and it can be shown that the optimal cost in one queue system is lower bounded by the optimal cost in N queue system, thus, the near optimal performance is still guaranteed. 3.5 Simulation In this section, we demonstrate the performance of our proposed algorithm via extensive simulations. The first simulation runs over i.i.d. traffic. We show that our algorithm indeed achievesO(1/V ) near optimality withO(V ) delay ([O(1/V ),O(V )] trade-off), which is predicted by Lemma 3.3.1 and Theorem 3.3.1. We then apply our algorithm to a real data center traffic trace with realistic scale, setup time and cost being the power consumption. We compare the performance of the proposed algorithm with several other heuristic algorithms and show that our algorithm indeed delivers lower delay and saves power. 74 Table 3.2: Problem parameters Server μ n e n ˆ W n (α n ) E(ˆ τ(α n )) 1 {2, 3, 4, 5, 6} 4 2 5.893 2 {2, 3, 4} 2 3 4.342 3 {2, 3, 4} 3 3 27.397 4 {1, 2, 3} 4 2 5.817 5 {2, 3, 4} 2 4 6.211 3.5.1 Near optimality in N queues system In the first simulation, we consider a relative small scale problem with i.i.d. generated traffic. We set the number of serversN = 5. The incoming requestsλ(t) are integers following a uniform distribution in [10, 30]. The request rejecting cost c(t) are also integers following a uniform distribution in [1, 6]. The maximum admission amount R max = 40 and the maximum idle time I max = 1000. There is only one idle optionα n for each server where the idle cost ˆ g(α n ) = 0. The setup time follows a geometric distribution with mean E(ˆ τ(α n )), setup cost ˆ W n (α n ) per slot, service cost e n per slot, and the service amount μ n follows a uniform distribution over integers. The values 1/E(ˆ τ(α n )) are generated uniform at random within [0, 1] and specified in table II. The algorithm is run for 1 million slots in each trial and each plot takes the average of these 1 million slots. We compare our algorithm to the optimal stationary algorithm. The optimal stationary algorithm is computed using linear program [Fox66a] with the full knowledge of the statistics of requests and rejecting costs. In Fig. 5.3, we show that as our tradeoff parameterV gets larger, the average cost approaches the optimal value and achieves a near optimal performance. Furthermore, the cost curve drops rapidly when V is small and becomes relatively flat when V gets large, thereby demonstrating ourO(1/V ) optimality gap in Theorem 3.3.1. Fig. 5.4 plots the average sum-up queue size P 5 n=1 Q n (t) and shows asV gets larger, the average sum-up queue size becomes larger. We also plot the sum of individual queue bound from Lemma 3.3.1 for comparison. We can see that the real queue size grows linearly with V (although the constant in Lemma 3.3.1 is not tight due to the much better delay we obtain here), which demonstrates theO(V ) delay bound. We then tune the requestsλ(t) to be uniform in [20, 40] and keep other parameters unchanged. In Fig. 5.5, we see that since the request rate gets larger, we need V to be larger in order to obtain the near optimality, but still, the near optimality gap scales roughlyO(1/V ). Fig. 4.6 75 Figure 3.4: Time average cost verses V parameter over 1 millon slots. Figure 3.5: Time average sum-up request queue length verses V parameter over 1 millon slots. 76 gives the sum-up average queue length in this case. The average queue length is larger than that of Fig. 5.4 with linear growth with respect to V . Figure 3.6: Time average cost verses V parameter over 1 millon slots. Figure 3.7: Time average cost verses V parameter over 1 millon slots. 3.5.2 Real data center traffic trace and performance evaluation This second considers a simulation on a real data center traffic obtained from the open source data sets of the paper [BAM10]. The trace is plotted in Fig. 3.8. We synthesize different data chunks from the source so that the trace contains both the steady phase and increasing phase. The total time duration is 2800 seconds with each slot equal to 20ms. The peak traffic is 2120 requests per 20 ms, and the time average traffic over this whole time interval is 654 requests per 20 ms. We consider a data center consisting of 1060 homogeneous servers. We assume each server has only one sleep state and the service quantity of each server at each slot follows a Zipf’s 77 law 1 with parameter K = 10 and p = 1.9. This gives the service rate of each server equal to 1.9933≈ 2 requests per 20ms. So the full capacity of the data center is able to support the peak traffic. Zipf’s law is previously introduced to model a wide scope of physics, biology, computer science and social science phenomenon ([New05]), and is adopted in various literatures to simulate the empirical data center service rate ([Gan13, GHBK12]). The setup time of each server is geometrically distributed with success probability equal to 0.001. This gives the mean setup time 1000 slots (20 seconds). This setup time is previously shown in [GHBK12] to be a typical time duration for a desktop to recover from the suspend or hibernate state. Furthermore, to make a fair comparison with several existing algorithms, we enforce the front end balancer to accept all requests at each time slot (so the rejection rate is always 0). The only cost in the system is then the power consumption. We assume that a server consumes 10 W each slot when active and 0 W each slot when idle. The setup cost is also 10 W per slot. Moreover, we apply the one queue model described in Section 3.4 for all the rest of the simulations. Following the problem formulation, the maximum idle time of a server for the proposed algorithm is I max = 5000, while no such limit is imposed for any other benchmark algorithms. We first run our proposed algorithm over the trace with virtualization (in Section 3.4) for different V values. We set the initial virtual queue backlog Q n (0) = 2000∀n, and keep 20 servers always on. Fig. 3.9 and Fig. 3.10 plots the running average power consumption and corresponding queue length forV = 400, 600, 800 and 1200, respectively. It can be seen that as V gets large, the average power consumption does not improve too much but the queue length changes drastically. This phenomenon results from the [O(1/V ),O(V )] trade-off of our proposed algorithm. In view of this fact, we choose V = 600 which gives a reasonable delay performance in Fig. 3.10. Next, we compare our proposed algorithm with the same initial setup and V = 600 to the following algorithms: • Always-on with N = 327 active servers and the rest servers staying on the sleep mode. Note that 327 servers can support the average traffic over the whole interval which is 654 requests per 20 ms. 1 The pdf of Zipf’s law with parameter K,p is defined as: f(n;K,p) = 1/n p P K i=1 1/i p , n = 1, 2,··· ,K. Thus, the mean of the distribution is P K i=1 1/i p−1 P K i=1 1/i p . 78 Figure 3.8: Synthesized traffic trace from [BAM10]. Figure 3.9: Running average power consumption from slot 1 to the current slot for different V value. Figure 3.10: Instantaneous queue length for different V value. 79 • Always-on with full capacity. This corresponds to keeping all 1060 servers on at every slot. • Reactive. This algorithm is developed in [GHBK12] which reacts to the current traffic λ(t) and maintains k react (t) = λ(t)/2 servers on. In the simulation, we choose λ(t) to be the average of the traffic from the latest 10 slots. If the current active server k(t)>k react (t), then, we turn k(t)−k react (t) servers off, otherwise, we turn k react (t)−k(t) servers to the setup state. • Reactive with extra capacity. This algorithm is similar to Reactive except that we introduce a virtual traffic flow of p jobs per slot. So during each time slot t, the algorithm maintains k react (t) = (λ(t) +p)/2 servers on. Fig. 3.11-3.13 plots the average power consumption, queue length and the number of active servers, respectively. It can be seen that all algorithms perform pretty well during first half of the trace. For the second half of the trace, the traffic load is increasing. The Always-on algorithm with mean capacity does not adapt to the traffic so the queue length blows up quickly. Because of the long setup time, the number of active servers in the Reactive algorithm fails to catch up with the increasing traffic so the queue length also blows up. Our proposed algorithm minimizes the power consumption while stabilizing the queues, thereby outperforming both the Always-on and the Reactive algorithm. Note that the Reactive with extra 200 job capacity is able to achieve a similar delay performance as our proposed algorithm, but with significant extra power consumption. Figure 3.11: Running average power consumption from slot 1 to the current slot for different algorithms. 80 Figure 3.12: Instantaneous queue length for different algorithms. Figure 3.13: Number of active servers over time. 81 Finally, we evaluate the influence of different sleep modes on the performance. We keep all the setups the same as before and consider the sleep modes with sleep power consumption equal to 2 W and 4 W per slot, respectively. Since the Always-on and the Reactive algorithm do not look at the sleep power consumption, their decisions remain the same as before, thus, we superpose the queue length of our proposed algorithm onto the previous Fig. 3.12 and get the queue length comparison in Fig. 3.14. We see from the plot that increasing the power consumption during the sleep mode only slightly increases the queue length of our proposed algorithm. Fig. 3.16 plots the running average power consumption under different sleep modes. Despite spending more power on the sleep mode, the proposed algorithm can still save considerable amount of power compared to other algorithms while keeping the request queue stable. This shows that our algorithm is empirically robust to the change of sleep mode. Figure 3.14: Instantaneous queue length for different algorithms. 3.6 Additional lemmas and proofs Appendix A— Proof of Lemma 3.2.1 We have (3.33), as shown at the bottom of this page, holds, where the first equality follows from the definition T n [f] = I n [f] + τ n [f] + 1 and the second equality follows from iterated 82 Figure 3.16: Running average power consumption for 0 W sleep cost(left), 2 W sleep cost(middle) and 4 W sleep cost(right) D n [f] = V ˆ W n (α n [f])m αn[f] +Ve n −Q n (t n f )μ n +E B0 2 (I n [f] +τ n [f] + 1) 2 +V ˆ g(α n [f])I n [f] Q n (t n f ) E I n [f] +τ n [f] + 1 Q n (t n f ) − B 0 2 = V ˆ W n (α n [f])m αn[f] +Ve n −Q n (t n f )μ n +E B0 2 (I n [f] +m αn + 1) 2 + B0 2 σ 2 αn[f] +V ˆ g(α n [f])I n [f] Q n (t n f ) E I n [f] +m αn + 1 Q n (t n f ) − B 0 2 (3.33) expectations conditioning on I n [f] and α n [f]. For simplicity of notations, let F (α n [f],I n [f]) =V ˆ W n (α n [f])m αn[f] +Ve n −Q n (t n f )μ n + B 0 2 (I n [f] +m αn[f] + 1) 2 +V ˆ g(α n [f])I n [f] + B 0 2 σ 2 αn[f] G(α n [f],I n [f]) =I n [f] +m αn[f] + 1, then D n [f] = E F (α n [f],I n [f])| Q n (t n f ) E G(α n [f],I n [f])| Q n (t n f ) − B 0 2 . 83 Meanwhile, given the queue length Q n (t n f ) at frame f, denote the benchmark solution over pure decisions as m, min In[f]∈N, In[f]∈[1,Imax],αn[f]∈Ln F (α n [f],I n [f]) G(α n [f],I n [f]) . (3.34) Then, for any randomized decision onα n [f] andI n [f], its realization within framef satisfies the following F (α n [f],I n [f]) G(α n [f],I n [f]) ≥m, which implies F (α n [f],I n [f])≥mG(α n [f],I n [f]). Taking conditional expectation from both sides gives E F (α n [f],I n [f])| Q n (t n f ) ≥mE G(α n [f],I n [f])| Q n (t n f ) ⇒ E F (α n [f],I n [f])| Q n (t n f ) E G(α n [f],I n [f])| Q n (t n f ) ≥m. Thus, it is enough to consider pure decisions only, which boils down to computing (3.34). This proves the lemma. Appendix B— Proof of Lemma 3.3.4 This section is dedicated to prove that E X n [f] 2 is bounded. First of all, since the idle option setL n is finite, denote W max = max αn∈Ln W n (α n ) g max = max αn∈Ln g n (α n ) 84 It is obvious that|W n (t)−W ∗ n | ≤ W max , |g n (t)−G ∗ n | ≤ g max , |e n H n (t)−E ∗ n | ≤ e n , and |μ n H n (t)−μ ∗ |≤μ n . Combining with the boundedness of queues in lemma 3.3.1, it follows |X n [f]|≤ t=t n f+1 −1 X t=t n f (V (W max +e n +g max ) + (Vc max +R max )μ n + t−t n f B 0 + Ψ n ≤ (V (W max +e n +g max ) + (Vc max +R max )μ n +Ψ n )T n [f] + T n [f](T n [f]− 1)B 0 2 Let B 1 ,V (W max +e n +g max ) + (Vc max +R max )μ n + Ψ n +B 0 /2, it follows |X n [f]|≤B 1 T n [f] + B 0 2 T n [f] 2 . Thus, E X n [f] 2 ≤B 2 1 E T n [f] 2 +B 1 B 0 E T n [f] 3 + B 2 0 4 E T n [f] 4 . Notice that T n [f]≤ I n [f] +τ n [f] + 1 by (3.2), where I n [f] is upper bonded by I max and τ n [f] has first four moments bounded by assumption 3.1.2. Thus, E X n [f] 2 is bounded by a fixed constant. Appendix C— Proof of Lemma 3.3.5 Proof. Let’s first abbreviate the notation by defining Y (t) =V (W n (t) +e n H n (t) +g n (t))−Q n (t n f )(μ n H n (t)−μ ∗ n ) + t−t n f B 0 . For any T∈ [t n f , t (n+1) F ), we can bound the partial sums from above by the following T−1 X t=0 Y (t)≤ t n f −1 X t=0 Y (t) +B 2 T n [F ] + B 0 2 T n [F ] 2 , 85 where B 0 = 1 2 (R max +μ max )μ max is defined in (3.15), and B 2 , VW n +Vμ n e n + (Vc max + R max )μ n +B 0 /2. Thus, 1 T T−1 X t=0 Y (t)≤ 1 T t n f −1 X t=0 Y (t) + 1 T B 2 T n [F ] + B 0 2 T n [F ] 2 ≤ max{a[F ], b[F ]}, where a[F ], 1 t n f t n f −1 X t=0 Y (t) + 1 t n f B 2 T n [F ] + B 0 2 T n [F ] 2 , b[F ], 1 t n f+1 t n f −1 X t=0 Y (t) + 1 t n f+1 B 2 T n [F ] + B 0 2 T n [F ] 2 . Thus, this implies that lim sup T→∞ 1 T T−1 X t=0 Y (t)≤ lim sup F→∞ max{a[F ], b[F ]} = max lim sup F→∞ a[F ], lim sup F→∞ b[F ] . We then try to work out an upper bound for lim sup F→∞ a[F ] and lim sup F→∞ b[F ] respectively. 1. Bound for lim sup F→∞ a[F ]: lim sup F→∞ a[F ]≤ lim sup F→∞ 1 t n f t n f −1 X t=0 Y (t) + lim sup F→∞ 1 t n f B 2 T n [F ] + B 0 2 T n [F ] 2 ≤V (W ∗ n +E ∗ n +G ∗ n ) + Ψ n + lim sup F→∞ 1 t n f B 2 T n [F ] + B 0 2 T n [F ] 2 . where the second inequality follows from corollary 3.3.1. It remains to show that lim sup F→∞ 1 t n f B 2 T n [F ] + B 0 2 T n [F ] 2 ≤ 0. (3.35) 86 Since t n f ≥F , it is enough to show that lim sup F→∞ T n [F ] F = 0, (3.36) lim sup F→∞ T n [F ] 2 F = 0. (3.37) We prove (3.37), and (3.36) is similar. Since each T n [F ] = I n [F ] + τ n [F ] + 1, where I n [F ]≤ I max and τ n [F ] has bounded first four moments, the first four moments of T n [F ] must also be bounded and there exists a constant C > 0 such that E T n [F ] 4 ≤C. For any > 0, define a sequence of events A F , T n [F ] 2 >F . According to Markov inequality, Pr [A F ]≤ E T n [F ] 4 2 F 2 ≤ C 2 F 2 . Thus, ∞ X F=1 Pr [A F ]≤ C 2 ∞ X F=1 1 F 2 ≤ 2C 2 <∞. By Borel-Cantelli lemma (lemma 1.6.1 in [Dur13]), Pr [A F occurs infinitely often] = 0, which implies Pr lim sup F→∞ T n [F ] 2 F > = 0. Since is arbitrary, this implies (3.37). Similarly, (3.36) can be proved. Thus, (3.35) holds and lim sup F→∞ a[F ]≤V (W ∗ n +E ∗ n +G ∗ n ) + Ψ n . 87 2. Bound for lim sup F→∞ b[F ]: lim sup F→∞ b[F ]≤ lim sup F→∞ 1 t n f t n f −1 X t=0 Y (t)· t n f t n f+1 + lim sup F→∞ 1 t n f+1 B 2 T n [F ] + B 0 2 T n [F ] 2 . ≤ lim sup F→∞ 1 t n f t n f −1 X t=0 Y (t) · lim sup F→∞ t n f t n f+1 ≤ V W ∗ n +E ∗ n +G ∗ n + Ψ n · lim sup F→∞ t n f t n f+1 ≤V W ∗ n +E ∗ n +G ∗ n + Ψ n , where the second inequality follows from (3.35), the third inequality follows from corollary 3.3.1 and the last inequality follows from the fact that V W ∗ n +E ∗ n +G ∗ n + Ψ n > 0. Above all, we proved the lemma. Appendix D— Proof of Theorem 3.3.1 Proof. Define the drift-plus-penalty(DPP) expression P (t) as follows P (t) =V c(t)r(t) + N X n=1 (W n (t) +e n H n (t) +g n (t)) ! + 1 2 N X n=1 Q n (t + 1) 2 −Q n (t) 2 . By simple algebra using the queue updating rule (3.1), we can work out the upper bound for P (t) as follows, P (t)≤ 1 2 N X n=1 (R n (t) +μ n ) 2 +V c(t)r(t) + N X n=1 (W n (t) +e n H n (t) +g n (t)) ! + N X n=1 Q n (t)(R n (t)−μ n H n (t)) ≤B 3 +V c(t)r(t) + N X n=1 (W n (t) +e n H n (t) +g n (t)) ! + N X n=1 Q n (t)(R n (t)−μ n H n (t)) ≤B 3 +Vc(t)r(t) + N X n=1 Q n (t) R n (t)−R ∗ n +V N X n=1 (W n (t) +e n H n (t) +g n (t)) + N X n=1 Q n (t) (μ ∗ n −μ n H n (t)) 88 where B 3 = 1 2 P N n=1 (R max +μ n ) 2 , the last inequality follows from adding P N n=1 Q n (t)μ ∗ n and subtracting P N n=1 Q n (t)R ∗ with the fact that the best randomized stationary algorithm should also satisfy the constraint (3.5), i.e. μ ∗ ≥R ∗ n . Now we take the partial average of P (t) from 0 to T− 1 and take lim sup T→∞ , lim sup T→∞ 1 T T−1 X n=1 P (t)≤B 3 + lim sup T→∞ 1 T T−1 X t=0 Vc(t)r(t) + N X n=1 Q n (t) R n (t)−R ∗ n ! + N X n=1 lim sup T→∞ 1 T T−1 X t=0 (V (W n (t) +e n H n (t) +g n (t)) +Q n (t) (μ ∗ n −μ n H n (t))). (3.38) According to (3.29), lim sup T→∞ 1 T T−1 X t=0 Vc(t)r(t) + N X n=1 Q n (t) R n (t)−R ∗ n ! ≤VC ∗ . (3.39) On the other hand, lim sup T→∞ 1 T T−1 X t=0 (V (W n (t) +e n H n (t) +g n (t)) +Q n (t) (μ ∗ n −μ n H n (t))) ≤ lim sup T→∞ 1 T T−1 X t=0 V (W n (t) +e n H n (t) +g n (t)) +Q n (t n f ) (μ ∗ n −μ n H n (t)) + (t−t n f )B 0 ≤V W ∗ n +E ∗ n +G ∗ n + Ψ n , (3.40) where B 0 = 1 2 (R max +μ max )μ max as defined below (3.15), the first inequality follows from the fact that for any t∈ t n f , t n f+1 , Q n (t) (μ ∗ n −μ n H n (t)) ≤Q n (t n f ) (μ ∗ n −μ n H n (t)) + (Q n (t)−Q n (t n f )) (μ ∗ n −μ n H n (t)) ≤Q n (t n f ) (μ ∗ n −μ n H n (t)) + t n f+1 −1 X t=t n f (R n (t)−μ n H n (t)) (μ ∗ n −μ n H n (t)) ≤Q n (t n f ) (μ ∗ n −μ n H n (t)) + (t−t n f )B 0 , 89 and the second inequality follows from lemma 3.3.5. Substitute (3.39) and (3.40) into (3.38) gives lim sup T→∞ 1 T T−1 X t=0 P (t)≤V C ∗ + N X n=1 W ∗ n +E ∗ n +G ∗ n ! +B 3 + N X n=1 Ψ n . (3.41) Finally, notice that by telescoping sums, lim sup T→∞ 1 T T−1 X t=0 P (t) = lim sup T→∞ V T T−1 X t=0 (c(t)r(t) + N X n=1 (W n (t) +e n H n (t) +g n (t))) + 1 2 N X n=1 Q n (T ) 2 ! ≥V· lim sup T→∞ 1 T T−1 X t=0 (c(t)r(t) + N X n=1 (W n (t) +e n H n (t) +g n (t))) Substitute above inequality into (3.41) and divide V from both sides give the desired result. 90 Chapter 4 Power Aware Wireless File Downloading and Restless Ban- dit via Renewal Optimization In this chapter, we look at another application of the renewal optimization, namely, the wire- less file downloading. We start with a simple single-user file downloading problem and show that this problem can be characterized by a 2 state Markov decision process (MDP) with constraints, for which the drift-plus-penalty (DPP) ratio algorithm (Algorithm 1) applies. We then consider a more realistic multi-user file downloading and show that this problem is a constrained version of the well-known restless bandit problem, for which we develop a DPP ratio indexing heuristic based on the coupled renewal optimization. 4.1 System model and problem formulation Consider a wireless access point, such as a base station or femto node, that delivers files to N different wireless users. The system operates in slotted time with time slots t∈{0, 1, 2,...}. Each user can download at most one file at a time. File sizes are random and complete delivery of a file requires a random number of time slots. A new file request is made by each user at a random time after it finishes its previous download. Let F n (t)∈{0, 1} represent the binary file state process for user n∈{1,...,N}. The state F n (t) = 1 means that user n is currently active downloading a file, while the state F n (t) = 0 means that user n is currently idle. Idle times are assumed to be independent and geometrically distributed with parameter λ n for each usern, so that the average idle time is 1/λ n . Active times depend on the random file size and the transmission decisions that are made. Every slott, the access point observes which users 91 are active and decides to serve a subset of at most M users, where M is the maximum number of simultaneous transmissions allowed in the system (M <N is assumed throughout). The goal is to maximize a weighted sum of throughput subject to a total average power constraint. The file state processes F n (t) are coupled controlled Markov chains that form a total state (F 1 (t),...,F N (t)) that can be viewed as a restless multi-armed bandit system. Such problems are complex due to the inherent curse of dimensionality. We first compute an online optimal algorithm for 1-user systems, i.e., the case N = 1. This simple case avoids the curse of dimensionality and provides valuable intuition. The optimal policy here is computed via the drift-plus-penalty (DPP) ratio algorithm. The resulting algorithm makes a greedy transmission decision that affects success probability and power usage. Next, the algorithm is extended as a low complexity online heuristic for the N-user problem, which we call the “DPP ratio indexing” . The heuristic has the following desirable properties: • Implementation of the N-user heuristic is as simple as comparing indices for N different 1-user problems. • The N-user heuristic is analytically shown to meet the desired average power constraint. • TheN-user heuristic is shown in simulation to perform well over a wide range of parameters. Specifically, it is very close to optimal for example cases where an offline optimal can be computed. • The N-user heuristic is shown to be optimal in a special case with no power constraint and with certain additional assumptions. The optimality proof uses a theory of stochastic coupling for queueing systems [TE93]. Prior work on wireless optimization uses Lyapunov functions to maximize throughput in cases where the users are assumed to have an infinite amount of data to send [NML08, ES07, GNT + 06, Sto05, TE93], or when data arrives according to a fixed rate process that does not depend on delays in the network (which necessitates dropping data if the arrival rate vector is outside of the capacity region, e.g. [NML08]). These models do not consider the interplay between arrivals at the transport layer and file delivery at the network layer. For example, a web user in a coffee shop may want to evaluate the file she downloaded before initiating another download. The current work captures this interplay through the binary file state processes F n (t). This creates 92 a complex problem of coupled Markov chains. This problem is fundamental to file downloading systems. The modeling and analysis of these systems is a significant contribution of the current thesis. To understand this issue, suppose the data arrival rate is fixed and does not adapt to the service received over the network. If this arrival rate exceeds network capacity by a factor of two, then at least half of all data must be dropped. This can result in an unusable data stream, possibly one that contains every odd-numbered packet. A more practical model assumes that full files must be downloaded and that new downloads are only initiated when previous ones are completed. A general model in this direction would allow each user to download up to K files simultaneously. This thesis considers the case K = 1, so that each user is either actively downloading a file, or is idle. 1 The resulting system forN users has a nontrivial Markov structure with 2 N states. Since the current problem includes both time-average constraints (on average power expen- diture) and instantaneous constraints which restrict the number of users that can be served on one slot, it is more complicated than the weakly coupled systems discussed in previous chapters. More specifically, The latter service restriction is similar to a traditional restless multi-armed bandit (RMAB) system [Whi88]. RMAB problem considers a population ofN parallel MDPs that continue evolving whether in operation or not (although in different rules). The goal is to choose the MDPs in operation during each time slot so as to maximize the expected reward subject to a constraint on the number of MDPs in operation. The problem is in general complex (see P-SPACE hardness results in [PT99]). A standard low-complexity heuristic for such problems is the Whittle’s index technique [Whi88]. However, the Whittle’s index framework applies only when there are two options on each state (active and passive). Further, it does not consider the additional time average cost constraints. The DPP ratio indexing algorithm developed in the current work can be viewed as an alternative indexing scheme that can always be implemented and that incorporates additional time average constraints. It is likely that the techniques of the current work can be extended to other constrained RMAB problems. Prior work in [TE93] develops a Lyapunov drift method 1 One way to allow a usern to download up toK files simultaneously is as follows: DefineK virtual users with separate binary file state processes. The transition probability from idle to active in each of these virtual users is λn/K. The conditional rate of total new arrivals for user n (given that m files are currently in progress) is then λn(1−m/K) for m∈{0, 1,...,M}. 93 for queue stability, and work in [Nee10b] develops a drift-plus-penalty (DPP) ratio method for optimization over renewal systems. The current work is the first to use these techniques as a low complexity heuristic for multidimensional Markov problems. Work in [TE93] uses the theory of stochastic coupling to show that a longest connected queue algorithm is delay optimal in a multi-dimensional queueing system with special symmetric assumptions. The problem in [TE93] is different from that of the current work. However, a similar coupling approach is used below to show that, for a special case with no power constraint, the DPP ratio indexing algorithm is throughput optimal in certain asymmetric cases. As a consequence, the proof shows the policy is also optimal for a different setting with M servers, N single-buffer queues, and arbitrary packet arrival rates (λ 1 ,...,λ N ). 4.2 Single user scenario Consider a file downloading system that consists of only one user that repeatedly downloads files. Let F (t)∈{0, 1} be the file state process of the user. State “1” means there is a file in the system that has not completed its download, and “0” means no file is waiting. The length of each file is independent and is either exponentially distributed or geometrically distributed (described in more detail below). Let B denote the expected file size in bits. Time is slotted. At each slot in which there is an active file for downloading, the user makes a service decision that affects both the downloading success probability and the power expenditure. After a file is downloaded, the system goes idle (state 0) and remains in the idle state for a random amount of time that is independent and geometrically distributed with parameter λ> 0. A transmission decision is made on each slot t in which F (t) = 1. The decision affects the number of bits that are sent, the probability these bits are successfully received, and the power usage. Let α(t) denote the decision variable at slot t and letA represent an abstract action set. The setA can represent a collection of modulation and coding options for each transmission. Assume also thatA contains an idle action denoted as “0. ” The decision α(t) determines the following two values: • The probability of successfully downloading a fileφ(α(t)), whereφ(·)∈ [0, 1] withφ(0) = 0. • The power expenditure p(α(t)), where p(·) is a nonnegative function with p(0) = 0. 94 The user choosesα(t) = 0 wheneverF (t) = 0. The user choosesα(t)∈A for each slott in which F (t) = 1, with the goal of maximizing throughput subject to a time average power constraint. The example where the decision setA is finite can be found in the simulation experiment section. Here is a simple example where the decision can be continuous: Example 1. LetA be the set of all possible power allocation options, i.e.A := [p min ,p max ]∪{0} where p min ,p max > 0 are constants. Then, α(t)∈ [p min ,p max ]∪{0}, p(α(t)) = α(t) and the success probability of downloading a file can be φ(α(t)) = 1− exp(−α(t)). The problem can be described by a two state Markov decision process with binary stateF (t). Given F (t) = 1, a file is currently in the system. This file will finish its download at the end of the slot with probability φ(α(t)). Hence, the transition probabilities out of state 1 are: Pr[F (t + 1) = 0|F (t) = 1] = φ(α(t)) (4.1) Pr[F (t + 1) = 1|F (t) = 1] = 1−φ(α(t)) (4.2) Given F (t) = 0, the system is idle and will transition to the active state in the next slot with probability λ, so that: Pr[F (t + 1) = 1|F (t) = 0] = λ (4.3) Pr[F (t + 1) = 0|F (t) = 0] = 1−λ (4.4) Define the throughput, measured by bits per slot, as: lim inf T→∞ 1 T T−1 X t=0 Bφ(α(t)) The file downloading problem reduces to the following: Maximize: lim inf T→∞ 1 T T−1 X t=0 Bφ(α(t)) (4.5) Subject to: lim sup T→∞ 1 T T−1 X t=0 p(α(t))≤β (4.6) α(t)∈A∀t∈{0, 1, 2,...} such that F (t) = 1 (4.7) Transition probabilities satisfy (4.1)-(4.4) (4.8) 95 where β is a positive constant that determines the desired average power constraint. 4.2.1 The memoryless file size assumption The above model assumes that file completion success on slott depends only on the transmis- sion decision α(t), independent of history. This implicitly assumes that file length distributions have a memoryless property where the residual file length is independent of the amount already delivered. Further, it is assumed that if the controller selects a transmission rate that is larger than the residual bits in the file, the remaining portion of the transmission is padded with fill bits. This ensures error events provide no information about the residual file length beyond the already known 0/1 binary file state. Of course, error probability might be improved by removing padded bits. However, this affects only the last transmission of a file and has negligible impact when expected file size is large in comparison to the amount that can be transmitted in one slot. Note that padding is not needed in the special case when all transmissions send one fixed length packet. The memoryless property holds when each file i has independent length B i that is exponen- tially distributed with mean length B bits, so that: Pr[B i >x] =e −x/B for x> 0 For example, suppose the transmission rate r(t) (in units of bits/slot) and the transmission success probability q(t) are given by general functions of α(t): r(t) = ˆ r(α(t)) q(t) = ˆ q(α(t)) Then the file completion probabilityφ(α(t)) is the probability that the residual amount of bits in the file is less than or equal to r(t), and that the transmission of these residual bits is a success. By the memoryless property of the exponential distribution, the residual file length is distributed 96 the same as the original file length. Thus: φ(α(t)) = ˆ q(α(t))Pr[B i ≤ ˆ r(α(t))] = ˆ q(α(t)) Z ˆ r(α(t)) 0 1 B e −x/B dx (4.9) Alternatively, history independence holds when each file i consists of a random number Z i of fixed length packets, where Z i is geometrically distributed with mean Z = 1/μ. Assume each transmission sends exactly one packet, but different power levels affect the transmission success probability q(t) = ˆ q(α(t)). Then: φ(α(t)) =μˆ q(α(t)) (4.10) The memoryless file length assumption allows the file state to be modeled by a simple binary- valued processF (t)∈{0, 1}. However, actual file sizes may not have an exponential or geometric distribution. One way to treat general distributions is to approximate the file sizes as being memoryless by using a φ(α(t)) function defined by either (4.9) or (4.10), formed by matching the average file size B or average number of packets Z. The decisions α(t) are made according to the algorithm below, but the actual event outcomes that arise from these decisions are not memoryless. A simulation comparison of this approximation is provided in Section 4.5, where it is shown to be remarkably accurate (see Fig. 4.7). The algorithm in this section optimizes over the class of all algorithms that do not use residual file length information. This maintains low complexity by ensuring a user has a binary- valued Markov stateF (t)∈{0, 1}. While a system controller might know the residual file length, incorporating this knowledge creates a Markov decision problem with an infinite number of states (one for each possible value of residual length) which significantly complicates the scenario. 4.2.2 DPP ratio optimization This subsection develops an online algorithm for problem (4.5)-(4.8). This algorithm follows from Algorithm 1 in Chapter 1 with some customizations towards this application. First, notice that file state “1” is recurrent under any decisions for α(t). Denotet k as thek-th time when the system returns to state “1. ” Define the renewal frame as the time period between t k and t k+1 . Define the frame size: T [k] =t k+1 −t k 97 Notice thatT [k] = 1 for any framek in which the file does not complete its download. If the file is completed on framek, thenT [k] = 1+G k , whereG k is a geometric random variable with mean E(G k ) = 1/λ. Each frame k involves only a single decision α(t k ) that is made at the beginning of the frame. Thus, the total power used over the duration of frame k is: t k+1 −1 X t=t k p(α(t)) =p(α(t k )) (4.11) We treat the time average constraint in (4.6) using a virtual queue Q[k] that is updated every frame k by: Q[k + 1] = max{Q[k] +p(α(t k ))−βT [k], 0} (4.12) with initial condition Q[0] = 0. The algorithm is then parameterized by a constant V ≥ 0 which affects a performance tradeoff. At the beginning of the k-th renewal frame, the user observes virtual queue Q[k] and chooses α(t k ) to maximize the following drift-plus-penalty (DPP) ratio: max α(t k )∈A VBφ(α(t k ))−Q[k]p(α(t k )) E[T [k]|α(t k )] (4.13) The numerator of the above ratio adds a “queue drift term” −Q[k]p(α(t k )) to the “current reward term” VBφ(α(t k )). The intuition is that it is desirable to have a large value of current reward, but it is also desirable to have a large drift (since this tends to decrease queue size). Creating a weighted sum of these two terms and dividing by the expected frame size gives a simple index. The next subsections show that, for the context of the current work, this index leads to an algorithm that pushes throughput arbitrarily close to optimal (depending on the chosen V parameter) with a strong sample path guarantee on average power expenditure. The denominator in (4.13) can easily be computed via the transition model (4.1)-(4.4): E[T [k]|α(t k )] = 1−φ(α(t k )) +φ(α(t k ))· 1 + 1 λ = 1 + φ(α(t k )) λ (4.14) Thus, (4.13) is equivalent to max α(t k )∈A VBφ(α(t k ))−Q[k]p(α(t k )) 1 +φ(α(t k ))/λ (4.15) This gives the following Algorithm 5 for the single-user case: The expected performance analysis 98 Algorithm 5. • At each time t k , the user observes virtual queue Q[k] and chooses α(t k ) as the solution to (4.15) (where ties are broken arbitrarily). • The value Q[k + 1] is computed according to (4.12) at the end of the k-th frame. of this algorithm follows from that of Section 1.1.4 and we omit the details for brevity. In the following, we give a stronger probability 1 performance analysis taking into account the special property of the algorithm in this customized setting. 4.2.3 Average power constraints via queue bounds In this section, we show that the proposed algorithm makes the virtual queue deterministically bounded. Lemma 4.2.1. If there is a constant C≥ 0 such that Q[k]≤C for all k∈{0, 1, 2,...}, then: lim sup T→∞ 1 T T−1 X t=0 p(α(t))≤β Proof. From (4.12), we know that for each frame k: Q[k + 1]≥Q[k] +p(α(t k ))−T [k]β Rearranging terms and using T [k] =t k+1 −t k gives: p(α(t k ))≤ (t k+1 −t k )β +Q[k + 1]−Q[k] Fix K > 0. Summing over k∈{0, 1,··· ,K− 1} gives: K−1 X k=0 p(α(t k )) ≤ (t K −t 0 )β +Q[K]−Q[0] ≤ t K β +C The sum power over the first K frames is the same as the sum up to time t K − 1, and so: t K −1 X t=0 p(α(t))≤t K β +C 99 Dividing by t K gives: 1 t K t K −1 X t=0 p(α(t))≤β +C/t K . Taking K→∞, then, lim sup K→∞ 1 t K t K −1 X t=0 p(α(t))≤β (4.16) Now for each positive integer T , let K(T ) be the integer such that t K(T) ≤T <t K(T)+1 . Since power is only used at the first slot of a frame, one has: 1 T T−1 X t=0 p(α(t))≤ 1 t K(T) t K(T) −1 X t=0 p(α(t)) Taking a lim sup as T→∞ and using (4.16) yields the result. In order to show that the queue process under our proposed algorithm is deterministically bounded, we need the following assumption: Assumption 4.2.1. The following quantities are finite and strictly positive: p min = min α∈A\{0} p(α) p max = max α∈A\{0} p(α). Lemma 4.2.2. Suppose Assumption 4.2.1 holds. If Q[0] = 0, then under our algorithm we have for all k> 0: Q[k]≤ max VB p min +p max −β, 0 Proof. First, consider the case when p max ≤β. From (4.12) and the fact that T [k]≥ 1 for allk, it is clear the queue can never increase, and so Q[k]≤Q[0] = 0 for all k> 0. Next, consider the case whenp max >β. We prove the assertion by induction onk. The result trivially holds for k = 0. Suppose it holds at k =l for l> 0, so that: Q[l]≤ VB p min +p max −β We are going to prove that the same holds for k =l + 1. There are two cases: 100 1. Q[l]≤ VB p min . In this case we have by (4.12): Q[l + 1] ≤ Q[l] +p max −β ≤ VB p min +p max −β 2. VB p min <Q[l]≤ VB p min +p max −β. In this case, we use proof by contradiction. If p(α(t l )) = 0 then the queue cannot increase, so: Q[l + 1]≤Q[l]≤ VB p min +p max −β On the other hand, if p(α(t l )) > 0 then p(α(t l ))≥ p min and so the numerator in (4.15) satisfies: VBφ(α(t l ))−Q[l]p(α(t l )) ≤ VB−Q[l]p min < 0 and so the maximizing ratio in (4.15) is negative. However, the maximizing ratio in (4.15) cannot be negative because the alternative choice α(t l ) = 0 increases the ratio to 0. This contradiction implies that we cannot have p(α(t l ))> 0. The above is a sample path result that only assumes parameters satisfy λ > 0, B > 0, and 0≤ φ(·)≤ 1. Thus, the algorithm meets the average power constraint even if it uses incorrect values for these parameters. The next subsection provides a throughput optimality result when these parameters match the true system values. 4.2.4 Optimality over randomized algorithms Consider the following class of i.i.d. randomized algorithms: Let θ(α) be non-negative num- bers defined for each α∈A, and suppose they satisfy P α∈A θ(α) = 1. Let α ∗ (t) represent a policy that, every slot t for which F (t) = 1, chooses α ∗ (t)∈A by independently selecting strategy α with probability θ(α). Then (p(α ∗ (t k )),φ(α ∗ (t k ))) are independent and identically distributed (i.i.d.) over frames k. Under this algorithm, it follows by the law of large numbers 101 that the throughput and power expenditure satisfy (with probability 1): lim t→∞ 1 T T−1 X t=0 Bφ(α ∗ (t)) = BE(φ(α ∗ (t k ))) 1 +E(φ(α ∗ (t k )))/λ lim t→∞ 1 T T−1 X t=0 p(α ∗ (t)) = E(p(α ∗ (t k ))) 1 +E(φ(α ∗ (t k )))/λ It can be shown that optimality of problem (4.5)-(4.8) can be achieved over this class. Thus, there exists an i.i.d. randomized algorithm α ∗ (t) that satisfies: BE(φ(α ∗ (t k ))) 1 +E(φ(α ∗ (t k )))/λ = μ ∗ (4.17) E(p(α ∗ (t k ))) 1 +E(φ(α ∗ (t k )))/λ ≤ β (4.18) where μ ∗ is the optimal throughput for the problem (4.5)-(4.8). 4.2.5 Key feature of the drift-plus-penalty ratio DefineF(t k ) as the system history up to frame k, which includes which includes the actions taken α(t 0 ),··· ,α(t k−1 ), frame lengths T [0],··· ,T [k− 1], the busy period in each frame, the idle period in each frame, and the queue value Q[k] (since this is determined by the random events before frame k). Consider the algorithm that, on frame k, observes Q[k] and chooses α(t k ) according to (4.15). The following key feature of this algorithm can be shown (see [Nee10b] for related results): E −VBφ(α(t k )) +Q[k]p(α(t k ))|F(t k ) E(1 +φ(α(t k ))/λ|F(t k )) ≤ E −VBφ(α ∗ (t k )) +Q[k]p(α ∗ (t k ))|F(t k ) E(1 +φ(α ∗ (t k ))/λ|F(t k )) whereα ∗ (t k ) is any (possibly randomized) alternative decision that is based only onF(t k ). This is an intuitive property: By design, the algorithm in (4.15) observesF(t k ) and then chooses a particular action α(t k ) to minimize the ratio over all deterministic actions. Thus, as can be shown, it also minimizes the ratio over all potentially randomized actions. Using the (randomized) i.i.d. decision α ∗ (t k ) from (4.17)-(4.18) in the above and noting that this alternative decision is 102 independent ofF(t k ) gives: E −VBφ(α(t k )) +Q[k]p(α(t k ))|F(t k ) E(1 +φ(α(t k ))/λ|F(t k )) ≤−Vμ ∗ +Q[k]β (4.19) 4.2.6 Performance theorem Theorem 4.2.1. Suppose Assumption 4.2.1 holds. The proposed algorithm achieves the con- straint lim sup T→∞ 1 T P T−1 t=0 p(α(t))≤β and yields throughput satisfying (with probability 1): lim inf T→∞ 1 T T−1 X t=0 Bφ(α(t))≥μ ∗ − C 0 V (4.20) where C 0 is a constant. 2 Proof. First, for any fixed V , Lemma 4.2.2 implies that the queue is deterministically bounded. Thus, according to Lemma 4.2.1, the proposed algorithm achieves the constraint lim sup T→∞ 1 T T−1 X t=0 p(α(t))≤β. The rest is devoted to proving the throughput guarantee (4.20). Define: L(Q[k]) = 1 2 Q[k] 2 . We call this a Lyapunov function. Define a frame-based Lyapunov Drift as: Δ[k] =L(Q[k + 1])−L(Q[k]) According to (4.12) we get Q[k + 1] 2 ≤ (Q[k] +p(α(t k ))−T [k]β) 2 . Thus: Δ[k]≤ (p(α(t k ))−T [k]β) 2 2 +Q[k](p(α(t k ))−T [k]β) 2 The constant C 0 is independent of V and is given in the proof. 103 Taking a conditional expectation of the above givenF(t k ) and recalling thatF(t k ) includes the information Q[k] gives: E(Δ[k]|F(t k ))≤C 0 +Q[k]E(p(α(t k ))−βT [k]|F(t k )) (4.21) where C 0 is a constant that satisfies the following for all possible historiesF(t k ): E (p(α(t k ))−T [k]β) 2 2 F(t k ) ≤C 0 Such a constant C 0 exists because the power p(α(t k )) is deterministically bounded due to As- sumption 4.2.1, and the frame sizes T [k] are bounded in second moment regardless of history according to (4.14). Adding the “penalty”−E VBφ(α(t k ))|F(t k ) to both sides of (4.21) gives: E Δ[k]−VBφ(α(t k ))|F(t k ) ≤C 0 +E −VBφ(α(t k )) +Q[k](p(α(t k ))−βT [k])|F(t k ) =C 0 −Q[k]βE(T [k]|F(t k )) + E(T [k]|F(t k ))E −VBφ(α(t k )) +Q[k]p(α(t k ))|F(t k ) E(T [k]|F(t k )) Expanding T [k] in the denominator of the last term gives: E Δ[k]−VBφ(α(t k ))|F(t k ) ≤C 0 −Q[k]βE(T [k]|F(t k )) +E(T [k]|F(t k )) E −VBφ(α(t k )) +Q[k]p(α(t k ))|F(t k ) E(1 +φ(α(t k ))/λ|F(t k )) Substituting (4.19) into the above expression gives: E Δ[k]−VBφ(α(t k ))|F(t k ) ≤C 0 −Q[k]βE(T [k]|F(t k )) +E(T [k]|F(t k ))(−Vμ ∗ +βQ[k]) =C 0 −Vμ ∗ E(T [k]|F(t k )) (4.22) Rearranging gives: E Δ[k] +V (μ ∗ T [k]−Bφ(α(t k )))|F(t k ) ≤C 0 (4.23) 104 This implies that Δ[k] +V (μ ∗ T [k]−Bφ(α(t k )))−C 0 is a supermartingale difference sequence. Furthermore, we already know the queue Q[k] is deterministically bounded, it follows that: ∞ X k=1 E Δ[k] 2 k 2 <∞ This, together with (4.23), implies by Lemma 3.3.3 that (with probability 1): lim sup K→∞ 1 K K−1 X k=0 μ ∗ T [k]−Bφ(α(t k )) ≤ C 0 V Thus, for any > 0 one has for all sufficiently large K: 1 K K−1 X k=0 [μ ∗ T [k]−Bφ(α(t k ))]≤ C 0 V + Rearranging implies that for all sufficiently large K: P K−1 k=0 Bφ(α(t k )) P K−1 k=0 T [k] ≥ μ ∗ − (C 0 /V +) 1 K P K−1 k=0 T [k] ≥ μ ∗ − (C 0 /V +) where the final inequality holds because T [k]≥ 1 for all k. Thus: lim inf K→∞ P K−1 k=0 Bφ(α(t k )) P K−1 k=0 T [k] ≥μ ∗ − (C 0 /V +) The above holds for all > 0. Taking a limit as → 0 implies: lim inf K→∞ P K−1 k=0 Bφ(α(t k )) P K−1 k=0 T [k] ≥μ ∗ −C 0 /V. Notice thatφ(α(t)) only changes at the boundary of each frame and remains 0 within the frame. Thus, we can replace the sum over framesk by a sum over slotst. The desired result follows. The theorem shows that throughput can be pushed within O(1/V ) of the optimal value μ ∗ , where V can be chosen as large as desired to ensure throughput is arbitrarily close to optimal. The tradeoff is a queue bound that grows linearly withV according to Lemma 4.2.2, which affects the convergence time required for the constraints to be close to the desired time averages (as 105 described in the proof of Lemma 4.2.1). 4.3 Multi-user file downloading User%1% 0% λ 1% 1)λ 1% User%2% λ 2% 1)λ 2% User%Ν% λ Ν% 1)λ Ν% 1% 0% 0% 1% 1% Figure 4.1: A system with N users. The shaded node for each user n indicates the current file state F n (t) of that user. There are 2 N different state vectors. This section considers a multi-user file downloading system that consists of N single-user subsystems. Each subsystem is similar to the single-user system described in the previous section. Specifically, for the n-th user (where n∈{1,...,N}): • The file state process is F n (t)∈{0, 1}. • The transmission decision is α n (t) ∈ A n , whereA n is an abstract set of transmission options for user n. • The power expenditure on slot t is p n (α n (t)). • The success probability on a slot t for which F n (t) = 1 is φ n (α n (t)), where φ n (·) is the function that describes file completion probability for user n. • The idle period parameter is λ n > 0. • The average file size is B n bits. Assume that the random variables associated with different subsystems are mutually independent. The resulting Markov decision problem has 2 N states, as shown in Fig. 4.1. The transition 106 probabilities for each active user depends on which users are selected for transmission and on the corresponding transmission modes. This is a restless bandit system because there can also be transitions for non-selected users (specifically, it is possible to transition from inactive to active). To control the downloading process, there is a central server with only M threads (M <N), meaning that at most M jobs can be processed simultaneously. So at each time slot, the server has to make decisions selecting at most M out of N users to transmit a portion of their files. These decisions are further restricted by a global time average power constraint. The goal is to maximize the aggregate throughput, which is defined as lim inf T→∞ 1 T T−1 X t=0 N X n=1 c n B n φ(α n (t)) wherec 1 ,c 2 ,...,c N are a collection of positive weights that can be used to prioritize users. Thus, this multi-user file downloading problem reduces to the following: Max: lim inf T→∞ 1 T T−1 X t=0 N X n=1 c n B n φ n (α n (t)) (4.24) S.t.: lim sup T→∞ 1 T T−1 X t=0 N X n=1 p n (α n (t))≤β (4.25) N X n=1 I(α n (t))≤M ∀t∈{0, 1, 2,···} (4.26) Pr[F n (t + 1) = 1| F n (t) = 0] =λ n (4.27) Pr[F n (t + 1) = 0| F n (t) = 1] =φ n (α n (t)) (4.28) where the constraints (4.27)-(4.28) hold for all n∈{1,...,N} and t∈{0, 1, 2,...}, and where I(·) is the indicator function defined as: I(x) = 0, if x = 0; 1, otherwise. 4.3.1 DPP ratio indexing algorithm This section develops our indexing algorithm for the multi-user case using the single-user case as a stepping stone. The major difficulty is the instantaneous constraint P N n=1 I(α n (t))≤ M. 107 Temporarily neglecting this constraint, we use Lyapunov optimization to deal with the time average power constraint first. We introduce a virtual queue Q(t), which is again 0 at t = 0. Instead of updating it on a frame basis, the server updates this queue every slot as follows: Q(t + 1) = max ( Q(t) + N X n=1 p n (α n (t))−β, 0 ) . (4.29) DefineN (t) as the set of users beginning their renewal frames at time t, so that F n (t) = 1 for all such users. In general,N (t) is a subset ofN ={1, 2,··· ,N}. Define|N (t)| as the number of users in the setN (t). At each time slot t, the server observes the queue state Q(t) and chooses (α 1 (t),...,α N (t)) in a manner similar to the single-user case. Specifically, for each user n∈N (t) define: g n (α n (t)), Vc n B n φ n (α n (t))−Q(t)p n (α n (t)) 1 +φ n (α n (t))/λ n (4.30) This is similar to the expression (4.15) used in the single-user optimization. Call g n (α n (t)) a reward. Now define an index for each subsystem n by: γ n (t), max αn(t)∈An g n (α n (t)) (4.31) which is the maximum possible reward one can get from then-th subsystem at time slott. Thus, it is natural to define the following myopic algorithm: Find the (at most) M subsystems inN (t) with the greatest rewards, and serve these with their corresponding optimal α n (t) options inA n that maximize g n (α n (t)). Algorithm 6. • At each time slot t, the server observes virtual queue state Q(t) and computes the indices using (4.31) for all n∈N (t). • Activate the min[M,|N (t)|] subsystems with greatest indices, using their corresponding ac- tions α n (t)∈A n that maximize g n (α n (t)). • Update Q(t) according to (4.29) at the end of each slot t. 108 4.3.2 Theoretical performance analysis In this subsection, we show that the above algorithm always satisfies the desired time average power constraint. We adopt the following assumption: Assumption 4.3.1. The following quantities are finite and strictly positive. p min n = min αn∈An\{0} p n (α n ) p min = min n p min n p max n = max αn∈An p n (α n ) c max = max n c n B max = max n B n Lemma 4.3.1. Suppose Assumption 4.3.1 holds. Then, the queue{Q(t)} ∞ t=0 is deterministically bounded under Algorithm 6. Specifically, we have for all t∈{0, 1, 2,...}: Q(t)≤ max ( Vc max B max p min + N X n=1 p max n −β, 0 ) Proof. First, consider the case when P N n=1 p max n ≤ β. Since Q(0) = 0, it is clear from the updating rule (4.29) that Q(t) will remain 0 for all t. Next, consider the case when P N n=1 p max n >β. We prove the assertion by induction ont. The result trivially holds for t = 0. Suppose at t =t 0 , we have: Q(t 0 )≤ Vc max B max p min + N X n=1 p max n −β We are going to prove that the same statement holds fort =t 0 + 1. We further divide it into two cases: 1. Q(t 0 )≤ Vc max B max p min . In this case, since the queue increases by at most P N n=1 p max n −β on one slot, we have: Q(t 0 + 1)≤ Vc max B max p min + N X n=1 p max n −β 2. Vc max B max p min < Q(t 0 )≤ Vc max B max p min + P N n=1 p max n −β. In this case, since φ n (α n (t 0 ))≤ 1, 109 there is no possibility that Vc n B n φ n (α n (t 0 ))≥ Q(t 0 )p n (α n (t 0 )) unless α n (t 0 ) = 0. Thus, the DPP ratio indexing algorithm of minimizing (4.30) chooses α n (t 0 ) = 0 for all n. Thus, all indices are 0. This implies that Q(t 0 + 1) cannot increase, and we get Q(t 0 + 1)≤ Vc max B max p min + P N n=1 p max n −β. Theorem 4.3.1. The proposed DPP ratio indexing algorithm achieves the constraint: lim sup T→∞ 1 T T−1 X t=0 N X n=1 p n (α n (t))≤β Proof. First of all, similar to Lemma 4.2.1, one can show that if Q(t)≤ C for some constant C > 0 and any t∈{0, 1, 2,···}, then, lim sup T→∞ 1 T P T−1 t=0 P N n=1 p n (α n (t))≤β. Using Lemma 4.3.1 we finish the proof. 4.4 Multi-user optimality in a special case In general, it is very difficult to prove optimality of the above multi-user algorithm. There are mainly two reasons. The first reason is that multiple users might renew themselves asyn- chronously, making it difficult to define a “renewal frame” for the whole system. Thus, the proof technique in Theorem 1 is infeasible. The second reason is that, even without the time average constraint, the problem degenerates into a standard restless bandit problem where the optimality of indexing is not guaranteed. This section considers a special case of the multi-user file downloading problem where the DPP ratio indexing algorithm is provably optimal. The special case has no time average power constraint. Further, for each user n∈{1,...,N}: • Each file consists of a random number of fixed length packets with mean B n = 1/μ n . • The decision setA n ={0, 1}, where 0 stands for “idle” and 1 stands for “download. ” If α n (t) = 1, then user n successfully downloads a single packet. • φ n (α n (t)) =μ n α n (t). • Idle time is geometrically distributed with mean 1/λ n . 110 • The special case μ n = 1−λ n is assumed. The assumption that the file length and idle time parameters μ n and λ n satisfy μ n = 1−λ n is restrictive. However, there exists certain queueing system which admits exactly the same markov dynamics as the system considered here when the assumption holds (described in Section 4.4.1 below). More importantly, it allows us to implement the stochastic coupling idea to prove the optimality. The goal is to maximize the sum throughput (in units of packets/slot), which is defined as: lim inf T→∞ 1 T T−1 X t=0 N X n=1 B n φ(α n (t)). (4.32) In this special case, the multi-user file downloading problem reduces to the following: Max: lim inf T→∞ 1 T T−1 X t=0 N X n=1 α n (t) (4.33) S.t.: N X n=1 α n (t)≤M ∀t∈{0, 1, 2,···} (4.34) α n (t)∈{0,F n (t)} (4.35) Pr[F n (t + 1) = 1| F n (t) = 0] =λ n (4.36) Pr[F n (t + 1) = 0| F n (t) = 1] =α n (t)(1−λ n ) (4.37) where the equality (4.37) uses the fact that μ n = 1−λ n . A picture that illustrates the Markov structure of constraints (4.35)-(4.37) is given in Fig. 4.2 4.4.1 A system with N single-buffer queues The above model, with the assumptionμ n = 1−λ n , is structurally equivalent to the following: Consider a system ofN single-buffer queues,M servers, and independent Bernoulli packet arrivals with rates λ n to each queue n∈{1,...,N}. This considers packet arrivals rather than file arrivals, so there are no file length variables and no parameters μ n in this interpretation. Let A(t) = (A 1 (t),...,A N (t)) be the binary-valued vector of packet arrivals on slot t, assumed to be i.i.d. over slots and independent in each coordinate. Assume all packets have the same size and each queue has a single buffer that can store just one packet. Let F n (t) be 1 if queue n has a packet at the beginning of slot t, and 0 else. Each server can transmit at most 1 packet per 111 0 1 0 1 (1) n Download (0) n Idle n n 1 n 1 n 1 n 1 n Figure 4.2: Markovian dynamics of the n-th system. slot. Let α n (t) be 1 if queue n is served on slot t, and 0 else. An arrival A n (t) occurs at the end of slot t and is accepted only if queue n is empty at the end of the slot (such as when it was served on that slot). Packets that are not accepted are dropped. The Markov dynamics are described by the same figure as before, namely, Fig. 4.2. Further, the problem of maximizing throughput is given by the same equations (4.33)-(4.37). Thus, although the variables of the two problems have different interpretations, the problems are structurally equivalent. For simplicity of exposition, the remainder of this section uses this single-buffer queue interpretation. 4.4.2 Optimality of the indexing algorithm Since there is no power constraint, for any V > 0 the DPP ratio indexing policy (4.31) in Section 4.3.1 reduces to the following (using c n = 1, Q(t)≡ 0): If there are fewer than M non- empty queues, serve all of them. Else, serve the M non-empty queues with the largest values of γ n , where: γ n = 1 1 + (1−λ n )/λ n =λ n . Thus, the DPP ratio indexing algorithm in this context reduces to serving the (at most M) non- empty queues with the largest λ n values each time slot. For the remainder of this section, this is called the Max-λ policy. The following theorem shows that Max-λ is optimal in this context. Theorem 4.4.1. The Max-λ policy is optimal for the problem (4.33)-(4.37). In particular, under the single-buffer queue interpretation, it maximizes throughput over all policies that transmit on 112 each slot t without knowledge of the arrival vector A(t). For theN single-buffer queue interpretation, the total throughput is equal to the raw arrival rate P N i=1 λ i minus the packet drop rate. Intuitively, the reason Max-λ is optimal is that it chooses to leave packets in the queues that are least likely to induce packet drops. An example comparison of the throughput gap between Max-λ and Min-λ policies is given in Section 4.6. The proof of Theorem 4.4.1 is divided into two parts. The first part uses stochastic coupling techniques to prove that Max-λ dominates all alternative work-conserving policies. A policy is work-conserving if it does not allow any server to be idle when it could be used to serve a non-empty queue. The second part of the proof shows that throughput cannot be increased by considering non-work-conserving policies. 4.4.3 Preliminaries on stochastic coupling Consider two discrete time processesX , {X(t)} ∞ t=0 andY , {Y (t)} ∞ t=0 . The notation X = st Y means thatX andY are stochastically equivalent, in that they are described by the same probability law. Formally, this means that their joint distributions are the same, so for all t∈{0, 1, 2,...} and all (z 0 ,...,z t )∈R t+1 : Pr[X(0)≤z 0 ,...,X(t)≤z t ] =Pr[Y (0)≤z 0 ,...,Y (t)≤z t ] The notationX≤ st Y means thatX is stochastically less than or equal toY, as defined by the following theorem. Theorem 4.4.2. ([TE93]) The following three statements are equivalent: 1. X≤ st Y. 2. Pr[g(X(0),X(1),··· ,X(t)) >z]≤ Pr[g(Y (0), Y (1),··· ,Y (t)) >z] for all t∈Z + , all z, and for all functions g :R n →R that are measurable and nondecreasing in all coordinates. 3. There exist two stochastic processesX 0 andY 0 on a common probability space that satisfy X = st X 0 ,Y = st Y 0 , and X 0 (t)≤Y 0 (t) for every t∈Z + . The following additional notation is used in the proof of Theorem 4.4.1. 113 • Arrival vector{A(t)} ∞ t=0 , where A(t), [A 1 (t) A 2 (t) ··· A N (t)]. Each A n (t) is an inde- pendent binary random variable that takes 1 w.p. λ n and 0 w.p. 1−λ n . • Buffer state vector{F(t)} ∞ t=0 , where F(t), [F 1 (t) F 2 (t)··· F N (t)]. So F n (t) = 1 if queue n has a packet at the beginning of slot t, and F n (t) = 0 else. • Total packet processU,{U(t)} ∞ t=0 , whereU(t), P N n=1 F n (t) represents the total number of packets in the system on slot t. Since each queue can hold at most one packet, we have 0≤U(t)≤N for all slots t. 4.4.4 Stochastic ordering of buffer state process The next lemma is the key to proving Theorem 4.4.1. The lemma considers the multi- queue system with a fixed but arbitrary initial buffer state F(0). The arrival process A(t) is as defined above. LetU Max-λ be the total packet process under the Max-λ policy. LetU π be the corresponding process starting from the same initial state F(0) and having the same arrivals A(t), but with an arbitrary work-conserving policy π. Lemma 4.4.1. The total packet processesU π andU Max-λ satisfy: U π ≤ st U Max-λ (4.38) Proof. Without loss of generality, assume the queues are sorted so that λ n ≤ λ n+1 , n = 1, 2,··· ,N−1. Define{F π (t)} ∞ t=0 as the buffer state vector under policyπ. Define{F Max-λ (t)} ∞ t=0 as the corresponding buffer states under the Max-λ policy. By assumption the initial states sat- isfy F π (0) = F Max-λ (0). Next, we construct a third processU λ with a modified arrival vector process{A λ (t)} ∞ t=0 and a corresponding buffer state vector{F λ (t)} ∞ t=0 (with the same initial state F λ (0) = F π (0)), which satisfies: 1. U λ is also generated from the Max-λ policy. 2. U λ = st U Max-λ . Since the total packet process is completely determined by the initial state, the scheduling policy, and the arrival process, it is enough to construct{A λ (t)} ∞ t=0 so that it is of the same probability law as{A(t)} ∞ t=0 . 3. U π (t)≤U λ (t)∀t≥ 0. 114 Since the arrival process A(t) is i.i.d. over slots, in order to guarantee 2) and 3), it is sufficient to construct A λ (t) coupled with A(t) for each t so that the following two properties hold for all t≥ 0: • The random variables A(t) and A λ (t) have the same probability law. Specifically, both produce arrivals according to Bernoulli processes that are independent over queues and over time, with Pr[A n (t) = 1] =Pr[A λ n (t) = 1] =λ n for all n∈{1,...,N}. • For all j∈{1, 2,··· ,N}, j X n=1 F π n (t)≤ j X n=1 F λ n (t), (4.39) The construction is based on an induction. At t = 0 we have F π (0) = F λ (0). Thus, (4.39) naturally holds for t = 0. Now fix τ ≥ 0 and assume (4.39) holds for all slots up to time t = τ. If τ ≥ 1, further assume the arrivals {A λ (t)} τ−1 t=0 have been constructed to have the same probability law as{A(t)} τ−1 τ=0 . Since arrivals on slot τ occur at the end of slot τ, the arrivals A λ (τ) must be constructed. We are going to show there exists an A λ (τ) that is coupled with A(τ) so that it has the same probability law and it also ensures (4.39) holds for t =τ + 1. Since arrivals occur after the transmitting action, we divide the analysis into two parts. First, we analyze the temporary buffer states after the transmitting action but before arrivals occur. Then, we define arrivals A λ (τ) at the end of slot τ to achieve the desired coupling. Define ˜ F π (τ) and ˜ F λ (τ) as the temporary buffer states right after the transmitting action at slot τ but before arrivals occur under policy π and policy Max-λ, respectively. Thus, for each queue n∈{1,...,N}: ˜ F π n (τ) = F π n (τ)−α π n (τ) (4.40) ˜ F λ n (τ) = F λ n (τ)−α λ n (τ) (4.41) where α π n (τ) and α λ n (τ) are the slot τ decisions under policy π and Max-λ, respectively. Since (4.39) holds forj =N on slotτ, the total number of packets at the start of slotτ under policyπ is less than or equal to that of using Max-λ. Since both policiesπ and Max-λ are work-conserving, 115 it is impossible for policy π to transmit more packets than Max-λ during slot τ. This implies: N X n=1 ˜ F π n (τ)≤ N X n=1 ˜ F λ n (τ). (4.42) Indeed, ifπ transmits the same number of packets as Max-λ on slotτ, then (4.42) clearly holds. On the other hand, if π transmits fewer packets than Max-λ, it must transmit fewer than M packets (sinceM is the number of servers). In this case, the work-conserving nature of π implies that all non-empty queues were served, so that ˜ F π n (τ) = 0 for all n and (4.42) again holds. We now claim the following holds: Lemma 4.4.2. j X n=1 ˜ F π n (τ)≤ j X n=1 ˜ F λ n (τ) ∀j∈{1, 2,··· ,N}. (4.43) Proof. See Section 4.6. Now let j π (l) and j λ (l) be the subscript of l-th empty temporary buffer (with order starting from the first queue) corresponding to ˜ F π (τ) and ˜ F λ (τ), respectively. It follows from (4.43) that the π system on slot τ has at least as many empty temporary buffer states as the Max-λ policy, and: j π (l)≤j λ (l) ∀l∈{1, 2,··· ,K(τ)} (4.44) where K(τ)≤N is the the number of empty temporary buffer states under Max-λ at time slot τ. Since λ i ≤λ j if and only if i≤j, (4.44) further implies that λ j π (l) ≤λ j λ (l) ∀l∈{1, 2,··· ,K(τ)}. (4.45) Now construct the arrival vector A λ (τ) for the system with the Max-λ policy in the following way: A j π (l) (τ) = 1⇒A λ j λ (l) (τ) = 1 w.p. 1 (4.46) A j π (l) (τ) = 0⇒ A λ j λ (l) (τ) = 0, w.p. 1−λ j λ (l) 1−λ j π (l) ; A λ j λ (l) (τ) = 1, w.p. λ j λ (l) −λ j π (l) 1−λ j π (l) . (4.47) Notice that (4.47) uses valid probability distributions because of (4.45). This establishes the slot 116 τ arrivals for the Max-λ policy for all of its K(τ) queues with empty temporary buffer states. The slot τ arrivals for its queues with non-empty temporary buffers will be dropped and hence do not affect the queue states on slot τ + 1. Thus, we define arrivals A λ j (τ) to be independent of all other quantities and to be Bernoulli with Pr[A λ j (τ) = 1] =λ j for all j in the set: j∈{1, 2,··· ,N}\{j λ (1),··· ,j λ (K(τ))} Now we verify that A(τ) and A λ (τ) have the same probability law. First condition on knowledge of K(τ) and the particular j π (l) and j λ (l) values for l∈{1,...,K(τ)}. All queues j with non- empty temporary buffer states on slot τ under Max-λ were defined to have arrivals A λ j (τ) as in- dependent Bernoulli variables withPr[A λ j (τ) = 1] =λ j . It remains to verify those queues within {j λ (1),··· ,j λ (K(τ))}. According to (4.47), for any queue j λ (l) in set{j λ (1),··· ,j λ (K(τ))}, it follows Pr h A λ j λ (l) (τ) = 0 i = (1−λ j π (l) ) 1−λ j λ (l) 1−λ j π (l) = 1−λ j λ (l) and soPr[A λ j (τ) = 1] =λ j for allj∈{j λ (l)} K(τ) l=1 . Further, mutual independence of{A j π (l) (τ)} K(τ) l=1 implies mutual independence of{A j λ (l) (τ)} K(τ) l=1 . Finally, these quantities are conditionally in- dependent of events before slot τ, given knowledge of K(τ) and the particular j π (l) and j λ (l) values for l∈{1,...,K(τ)}. Thus, conditioned on this knowledge, A(τ) and A λ (τ) have the same probability law. This holds for all possible values of the conditional knowledge K(τ) and j π (l) and j λ (l). It follows that A(τ) and A λ (τ) have the same (unconditioned) probability law. Finally, we show that the coupling relations (4.46) and (4.47) produce such F λ (τ +1) satisfying j X n=1 F π n (τ + 1)≤ j X n=1 F λ n (τ + 1), ∀ j∈{1, 2,··· ,N}. (4.48) According to (4.46) and (4.47), A j π (l) (τ)≤A λ j λ (l) (τ), ∀l∈{1,··· ,K(τ)}, 117 thus, l X i=1 A j π (i) (τ)≤ l X i=1 A λ j λ (i) (τ), ∀l∈{1,··· ,K(τ)}. (4.49) Pick any j∈{1, 2,··· ,N}. Let l π be the number of empty temporary buffers within the first j queues under policy π, i.e. l π = max j π (l)≤j l Similarly define: l λ = max j λ (l)≤j l. Then, it follows: j X n=1 F π n (τ + 1) = j X n=1 ˜ F π n (τ) + l π X i=1 A j π (i) (τ) (4.50) j X n=1 F λ n (τ + 1) = j X n=1 ˜ F λ n (τ) + l λ X i=1 A λ j λ (i) (τ) (4.51) We know that l π ≥l λ . So there are two cases: • If l π =l λ , then from (4.50): j X n=1 F π n (τ + 1) = j X n=1 ˜ F π n (τ) + l λ X i=1 A j π (i) (τ) ≤ j X n=1 ˜ F λ n (τ) + l λ X i=1 A j λ (i) (τ) = j X n=1 F λ n (τ + 1) where the inequality follows from (4.43) and from (4.49) with l =l λ . Thus, (5.18) holds. 118 • If l π >l λ , then from (4.50): j X n=1 F π n (τ + 1) = j X n=1 ˜ F π n (τ) + l λ X i=1 A j π (i) (τ) + l π X i=l λ +1 A j π (i) (τ) ≤ j X n=1 ˜ F λ n (τ) + l λ X i=1 A j π (i) (τ) ≤ j X n=1 ˜ F λ n (τ) + l λ X i=1 A λ j λ (i) (τ) = j X n=1 F λ n (τ + 1). where the first inequality follows from the fact that l π X i=l λ +1 A j π (i) (τ) ≤ l π −l λ = (j−l λ )− (j−l π ) = j X n=1 ˜ F λ n (τ)− j X n=1 ˜ F π n (τ), and the second inequality follows from (4.49). Thus, (4.39) holds for t =τ + 1 and the induction step is done. Corollary 4.4.1. The Max-λ policy maximizes throughput within the class of work-conserving policies. Proof. Let S π (t) be the number of packets transmitted under any work-conserving policy π on slot t, and let S Max-λ (t) be the corresponding process under policy Max-λ. Lemma 4.4.1 implies U π (t)≤ st U Max-λ . Then: E(S π (t)) = E(min[U π (t),M]) ≤ E(min[U Max-λ (t),M]) = E(S Max-λ (t)) where the inequality follows from Theorem 4.4.2, with the understanding thatg(U(0),...,U(t)), 119 min[U(t),M] is a function that is nondecreasing in all coordinates. 4.4.5 Extending to non-work-conserving policies Corollary 4.4.1 establishes optimality of Max-λ over the class of all work-conserving policies. To complete the proof of Theorem 4.4.1, it remains to show that throughput cannot be increased by allowing for non-work-conserving policies. It suffices to show that for any non-work-conserving policy, there exists a work-conserving policy that gets the same or better throughput. The proof is straightforward and we give only a proof sketch for brevity. Consider any non-work-conserving policyπ, and letF π n (t) be its buffer state process on slott for each queuen. For the same initial buffer state and arrival process, define the work-conserving policy π 0 as follows: Every slot t, policy π 0 initially allocates the M servers to exactly the same queues as policy π. However, if some of these queues are empty under policy π 0 , it reallocates those servers to any non-empty queues that are not yet allocated servers (in keeping with the work-conserving property). Let F π 0 n (t) be the buffer state process for queue n under policy π 0 . It is not difficult to show that F π n (t)≥F π 0 n (t) for all queues n and all slots t. Therefore, on every slot t, the amount of blocked arrivals under policy π is always greater than or equal to that under policy π 0 . This implies the throughput under policy π is less than or equal to that of policy π 0 . 4.5 Simulation experiments In this section, we demonstrate near optimality of the multi-user DPP ratio indexing algo- rithm by extensive simulations. In the first part, we simulate the case in which the file length distribution is geometric, and show that the suboptimality gap is extremely small. In the second part, we test the robustness of our algorithm for more general scenarios in which the file length distribution is not geometric. For simplicity, it is assumed throughout that all transmissions send a fixed sized packet, all files are an integer number of these packets, and that decisions α n (t)∈A n affect the success probability of the transmission as well as the power expenditure. 4.5.1 DPP ratio indexing with geometric file length In the first simulation we useN = 8,M = 4 with action setA n ={0, 1}∀n; The settings are generated randomly and specified in Table I, and the constraint β = 5. 120 Table 4.1: Problem parameters User λ n μ n φ n (1) c n p n (1) 1 0.0028 0.5380 0.4842 4.7527 3.9504 2 0.4176 0.5453 0.4908 2.0681 3.7391 3 0.0888 0.5044 0.4540 2.8656 3.5753 4 0.3181 0.6103 0.5493 2.4605 2.1828 5 0.4151 0.9839 0.8855 4.5554 3.1982 6 0.2546 0.5975 0.5377 3.9647 3.5290 7 0.1705 0.5517 0.4966 1.5159 2.5226 8 0.2109 0.7597 0.6837 3.6364 2.5376 The algorithm is run for 1 million slots in each trial and each point is the average of 100 trials. We compare the performance of our algorithm with the optimal randomized policy. The optimal policy is computed by constructing composite states (i.e. if there are three users where user 1 is at state 0, user 2 is at state 1 and user 3 is at state 1, we view 011 as a composite state), and then reformulating this MDP into a linear program (see [Fox66a]) with 5985 variables and 258 constraints. In Fig. 5.3, we show that as our tradeoff parameter V gets larger, the objective value ap- proaches the optimal value and achieves a near optimal performance. Fig. 5.4 and Fig. 5.5 show that V also affects the virtual queue size and the constraint gap. As V gets larger, the average virtual queue size becomes larger and the gap becomes smaller. We also plot the upper bound of queue size we derived from Lemma 4.3.1 in Fig. 5.4, demonstrating that the queue is bounded. In order to show thatV is indeed a trade-off parameter affecting the convergence time, we plotted Fig. 4.6. It can be seen from the figure that as V gets larger, the number of time slots needed for the running average to roughly converge to the optimal power expenditure becomes larger. In the second simulation, we explore the parameter space and demonstrate that in general the suboptimality gap of our algorithm is negligible. First, we define the relative error as the following: relative error = |OBJ−OPT| OPT (4.52) where OBJ is the objective value after running 1 million slots of our algorithm and OPT is the optimal value. We first explore the system parameters by letting λ n ’s and μ n ’s take random numbers within 0 and 1, letting c n take random number within 1 and 5, choosing V = 70 and fixing the remaining parameters the same as the last experiment. We conduct 1000 Monte-Carlo experiments and calculate the average relative error, which is 0.00083. 121 0 10 20 30 40 50 60 70 3.8 4 4.2 4.4 4.6 4.8 5 V value Throughput Lyapunov Indexing Optimal Throughput Figure 4.3: Throughput versus tradeoff parameter V 0 10 20 30 40 50 60 70 4.1 4.2 4.3 4.4 4.5 4.6 4.7 4.8 4.9 5 5.1 V value Average Power Expenditure Lyapunov Indexing Power Constraint Figure 4.4: The time average power consumption versus tradeoff parameter V . 122 0 10 20 30 40 50 60 70 0 50 100 150 V value Average Virtual Queue Backlog Average Queuesize Queuesize bound Figure 4.5: Average virtual queue backlog versus tradeoff parameter V . 0 100 200 300 400 500 600 700 800 900 1000 4.5 5 5.5 6 6.5 7 7.5 8 No. of Iterations Time Average Constraint Violation V=70 V=40 V=10 V=4 V=70 V=40 V=4 V=10 Figure 4.6: Running average power consumption versus tradeoff parameter V . 123 Table 4.2: Problem parameters under geometric, uniform and poisson distribution User μ n Unif. Poiss. λ n φ n (1) c n p n (1) interval mean 1 1/3 [1,5] 3 0.4955 0.1832 4.3261 2.8763 2 1/2 [1,3] 2 0.1181 0.4187 1.6827 2.0549 3 1/2 [1,3] 2 0.1298 0.4491 1.9483 2.1469 4 1/7 [1,13] 7 0.4660 0.0984 2.7495 3.4472 5 1/4 [1,7] 4 0.1661 0.1742 1.5535 3.2801 6 1/3 [1,5] 3 0.2124 0.3101 4.3151 3.5648 7 1/2 [1,3] 2 0.5295 0.4980 3.6701 2.4680 8 1/5 [1,9] 5 0.2228 0.1971 4.0185 2.2984 9 1/4 [1,7] 4 0.0332 0.1986 3.0411 2.5747 Next, we explore the control parameters by letting the p n (1) take random number within 2 and 4, and letting φ n (1)/μ n values random numbers between 0 and 1, choosing V = 70 and fixing the remaining parameters the same as the first simulation. The relative error is 0.00057. Both experiments show that the suboptimality gap is extremely small. 4.5.2 DPP ratio indexing with non-memoryless file lengths In this part, we test the sensitivity of the algorithm to different file length distributions. In particular, the uniform distribution and the Poisson distribution are implemented respectively, while our algorithm still treats them as a geometric distribution with same mean. We then compare their throughputs with the geometric case. We use N = 9, M = 4 with action setA n ={0, 1}∀n. The settings are specified in Table II with constraint β = 5. Notice that for geometric and uniform distribution, the file lengths are taken to be integer values. The algorithm is run for 1 million slots in each trial and each point is the average of 100 trials. While the decisions are made using these values, the affect of these decisions incorporates the actual (non-memoryless) file sizes. Fig. 4.7 shows the throughput-versus-V relation for the two non-memoryless cases and the memoryless case with matched means. The performance of all three is similar. This illustrates that the indexing algorithm is robust under different file length distributions. 124 0 10 20 30 40 50 60 70 2.5 3 3.5 4 4.5 5 5.5 6 6.5 7 7.5 V value Throughput Geometric Uniform Poisson Geometric Uniform Poisson Figure 4.7: Throughput versus tradeoff parameter V under different file length distributions. 4.6 Additional lemmas and proofs 4.6.1 Comparison of Max-λ and Min-λ This section shows that different work conserving policies can give different throughput for the N single-buffer queue problem of Section 4.4.1. Suppose we have two single-buffer queues and one server. Let λ 1 ,λ 2 be the arrival rates of the i.i.d. Bernoulli arrival processes for queues 1 and 2. Assume λ 1 6= λ 2 . There are 4 system states: (0, 0), (0, 1), (1, 0), (1, 1), where state (i,j) means queue 1 has i packets and queue 2 has j packets. Consider the (work conserving) policy of giving queue 1 strict priority over queue 2. This is equivalent to the Max-λ policy when λ 1 >λ 2 , and is equivalent to the Min-λ policy when λ 1 <λ 2 . Let θ(λ 1 ,λ 2 ) be the steady state throughput. Then: θ(λ 1 ,λ 2 ) =p 1,0 +p 0,1 +p 1,1 where p i,j is the steady state probability of the resulting discrete time Markov chain. One can solve the global balance equations to show that θ(1/2, 1/4) > θ(1/4, 1/2), so that the Max-λ policy has a higher throughput than the Min-λ policy. In particular, it can be shown that: • Max-λ throughput: θ(1/2, 1/4) = 0.7 • Min-λ throughput: θ(1/4, 1/2)≈ 0.6786 125 4.6.2 Proof of Lemma 4.4.2 This section proves that: j X n=1 ˜ F π n (τ)≤ j X n=1 ˜ F λ n (τ) ∀j∈{1, 2,··· ,N}. (4.53) The case j =N is already established from (4.42). Fix j∈{1, 2,...,N− 1}. Since π cannot transmit more packets than Max-λ during slot τ, inequality (4.53) is proved by considering two cases: 1. Policy π transmits less packets than policy Max-λ. Then π transmits less than M packets during slot τ. The work-conserving nature of π implies all non-empty queues were served, so ˜ F π n (τ) = 0 for all n and (4.53) holds. 2. Policyπ transmits the same number of packets as policy Max-λ. In this case, consider the temporary buffer states of the lastN−j queues under policy Max-λ. If P N n=j+1 ˜ F λ n (τ) = 0, then clearly the following holds N X n=j+1 ˜ F π n (τ)≥ N X n=j+1 ˜ F λ n (τ). (4.54) Subtracting (4.54) from (4.42) immediately gives (4.53). If P N n=j+1 ˜ F λ n (τ)> 0, then allM servers of the Max-λ system were devoted to serving the largestλ n queues. So only packets in the last N−j queues could be transmitted by Max-λ during the slot τ. In particular, α λ n (τ) = 0 for all n∈{1,...,j}, and so (by (4.41)): j X n=1 ˜ F λ n (τ) = j X n=1 F λ n (τ) (4.55) 126 Thus: j X n=1 ˜ F π n (τ)≤ j X n=1 F π n (τ) (4.56) ≤ j X n=1 F λ n (τ) (4.57) = j X n=1 ˜ F λ n (τ), (4.58) where (4.56) holds by (4.40), (4.57) holds because (4.39) is true on slot t =τ, and the last equality holds by (4.55). This proves (4.53). 127 Chapter 5 Opportunistic Scheduling over Renewal Systems This chapter considers an opportunistic scheduling problem over a single renewal system. Different from previous chapters, we consider teh scenario where at the beginning of each renewal frame, the controller observes a random event and then chooses an action in response to the event, which affects the duration of the frame, the amount of resources used, and a penalty metric. The goal is to make frame-wise decisions so as to minimize the time average penalty subject to time average resource constraints. This problem has applications to task processing and communication in data networks, as well as to certain classes of Markov decision problems. We formulate the problem as a dynamic fractional program and propose an adaptive algorithm which uses an empirical accumulation as a feedback parameter. A key feature of the proposed algorithm is that it does not require knowledge of the random event statistics and potentially allows (uncountably) infinite event sets. We prove the algorithm satisfies all desired constraints and achieves O() near optimality with probability 1. 5.1 Introduction Consider a system that operates over the timeline of real numbers t≥ 0. The timeline is divided into back-to-back periods called renewal frames and the start of each frame is called a renewal (see Fig. 5.1). The system state is refreshed at each renewal. At the start of each renewal framen∈{0, 1, 2,...} the controller observes a random eventω[n]∈ Ω and then takes an action α[n] from an action setA in response to ω[n]. The pair (ω[n],α[n]) affects: (i) the duration of that renewal frame; (ii) a vector of resource expenditures for that frame; (iii) a penalty incurred on that frame. The goal is to choose actions over time to minimize time average penalty subject 128 to time average constraints on the resources without knowing any statistic of ω[n]. We call such a problem opportunistic scheduling over renewal systems. Figure 5.1: An illustration of a sequence of renewal frames. 5.1.1 Example applications This problem has applications to task processing in computer networks, and certain general- izations of Markov decision problems. • Task processing networks: Consider a device that processes tasks back-to-back. Each renewal period corresponds to the time required to complete a single task. The random event ω[n] observed corresponds to a vector of task parameters, including the type, size, and resource requirements for that particular task. The action setA consists of different processing mode options, and the specific actionα[n] determines the processing time, energy expenditure, and task quality. In this case, task quality can be defined as a negative penalty, and the goal is to maximize time average quality subject to power constraints and task completion rate constraints. A specific example of this sort is the following file downloading problem: Consider a wireless device that repeatedly downloads files. The device has two states: active (wants to download a file) and idle (does not want to download a file). Renewals occur at the start of each new active state. Here, ω[n] denotes the observed wireless channel state, which affects the success probability of downloading a file (and thereby affects the transition probability from active to idle). This example is discussed further in the simulation section (Section 5.6). • Hierarchical Markov decision problems: Consider a slotted two-timescale Markov decision processes (MDP) over an infinite horizon and with constraints on average cost per slot. An MDP is run on the lower level, with a special state that is recurrent under any sequence of actions. The renewals are defined as revisitation times to that state. On a higher level, 129 a random event ω is observed upon each revisitation to the renewal state on the lower level. Then, a decision is made on the higher level in response to ω, which in turn affects the transition probability and penalty/cost received per slot on the lower level until the next renewal. Such a problem is a generalization of classical MDP problem (e.g. [Ros02], [Ber01]) and has been considered previously in [Wer13], [CFMS03] with discrete finite state and full information on both levels. A heuristic method is also proposed in [Wer13] when some of the information is unknown. The algorithm of the current chapter does not require knowledge of the statistics of ω and allows the event set Ω to be potentially (uncountably) infinite. 5.1.2 Previous approaches on renewal systems Most works on optimization over renewal systems consider the simpler scenario of knowing the probability distribution of ω[n]. In such a case, one can show via the renewal-reward theory that the problem can be solved (offline) by finding the solution to a linear fractional program. This idea has been applied to solve MDPs in the seminal work [Fox66b]. Methods for solving linear fractional programs can also be found, for example, in [Sch83, BV04]. However, the practical limitations of such an offline algorithm are twofold: First, if the event set Ω is large, then, there are too many probabilities Pr(ω[n] =ω), ω∈ Ω to estimate and the corresponding offline optimization problem may be difficult to solve even if all probabilities are estimated accurately. Second, generic offline optimization solvers may not take advantage of the special renewal structure of the system. One notable example is the treatment of power and delay minimization for a multi-class M/G/1 queue in [Yao02, LN14], where the renewal structure allows a well known c-μ rule for delay minimization to be extended to treat both power and delay constraints. The work in [Nee10b, Nee13b] presents a new drift-plus-penalty (DPP) ratio algorithm solving renewal optimizations knowing the distribution of ω[n]. The algorithm treats the constraints via virtual queues so that one only requires to minimize an unconstrained ratio during every renewal frame. The algorithm provably meets all constraints and achieves asymptotic near-optimality. The works [WUZ + 15, UWH + 15] show that the edge cloud server migration problem can be for- mulated as a specific renewal optimization. Using a variant of the DPP ratio algorithm, they show that solving a simple stochastic shortest path problem during every renewal frame gives 130 near-optimal performance. The work [WN18] solves a more general asynchronous optimization over parallel renewal systems, though the knowledge of the random event statistics is still re- quired. It is worth noting that the work [Nee13b] also proposes a heuristic algorithm when the distribution ofω[n] is not known. That algorithm is partially analyzed: It is shown that if a cer- tain process converges, then the algorithm converges to a near-optimal point. However, whether or not such a process converges is unknown. 5.1.3 Other related works The renewal optimization problem considered in this chapter is a generalization of stochastic optimization over fixed time slots. Such problems are categorized based on whether or not the random event is observed before the decision is made. Cases where the random event is observed before taking actions are often referred to as opportunistic scheduling problems. Over the past decades, many algorithms have been proposed including max-weight ([TE90, TE93]), Lyapunov optimization ([ES06, ES07, Nee10b, GNT + 06]), fluid model methods ([Sto05, ES07]), and dual subgradient methods ([LS04, Rib10]) are often used. Cases where the random events are not observed are referred to as online learning problems. Various algorithms are developed for unconstrained learning including the weighted majority algorithm ([LW94]), multiplicative weighting algorithm ([FS99]), following the perturbed leader ([HP05]) and online gradient descent ([Zin03, HK14]). The resource constrained learning problem is studied in [MJY12] and [WSLJ15]. Online learning with an underlying MDP structure is also treated using modified multiplicative weighting ([EDKM05]) and improved following the perturbed leader ([YMS09]). 5.1.4 Our contributions In this work, we focus on opportunistic scheduling over renewal systems and propose a new algorithm that runs online (i.e. takes actions in response to each observed ω[n]). Unlike prior works, the proposed algorithm requires neither the statistics of ω[n] nor explicit estimation of them, and is fully analyzed with convergence properties that hold with probability 1. From a technical perspective, we prove near-optimality of the algorithm by showing asymptotic sta- bility of a customized process, relying on a novel construction of exponential supermartingales which could be of independent interest. We complement our theoretical results with simulation 131 experiments on a time varying constrained MDP. 5.2 Problem Formulation and Preliminaries Consider a system where the time line is divided into back-to-back time periods called frames. At the beginning of framen (n∈{0, 1, 2,···}), a controller observes the realization of a random variableω[n], which is an i.i.d. copy of a random variable taking values in a compact set Ω∈R q with distribution function unknown to the controller. Then, after observing the random event ω[n], the controller chooses an action vector α[n]∈A. Then, the tuple (ω[n], α[n]) induces the following random variables: • The penalty received during frame n: y[n]. • The length of frame n: T [n]. • A vector of resource consumptions during frame n: z[n] = [z 1 [n], z 2 [n], ··· , z L [n]]. We assume that given α[n] = α and ω[n] = ω at frame n, (y[n],T [n], z[n]) is a random vector independent of the outcomes of previous frames, with known expectations. We then denote these conditional expectations as ˆ y(ω,α) =E(y[n]| ω,α), ˆ T (ω,α) =E(T [n]| ω,α), ˆ z(ω,α) =E(ˆ z[n]| ω,α), which are all deterministic functions ofω andα. This notation is useful when we want to highlight the action α we choose. The analysis assumes a single action in response to the observed ω[n] at each frame. Nevertheless, an ergodic MDP can fit into this model by defining the action as a selection of a policy to implement over that frame so that the corresponding ˆ y(ω,α), ˆ T (ω,α) and ˆ z(ω,α) are expectations over the frame under the chosen policy. 132 Let y[N] = 1 N N−1 X n=0 y[n], T [N] = 1 N N−1 X n=0 T [n], z l [N] = 1 N N−1 X n=0 z l [n] l∈{1, 2,··· ,L}. The goal is to minimize the time average penalty subject to L constraints on resource consump- tions. Specifically, we aim to solve the following fractional programming problem: min lim sup N→∞ y[N] T [N] (5.1) s.t. lim sup N→∞ z l [N] T [N] ≤c l , ∀l∈{1, 2,··· ,L}, (5.2) α[n]∈A, ∀n∈{0, 1, 2,···}, (5.3) wherec l , l∈{1, 2,··· ,L} are nonnegative constants, and both the minimum and constraint are taken in an almost sure sense. Finally, we use θ ∗ to denote the minimum that can be achieved by solving above optimization problem. For simplicity of notation, let K[n] = v u u t L X l=1 (z l [n]−c l T [n]) 2 . (5.4) 5.2.1 Assumptions Our main result requires the following assumptions, their importance will become clear as we proceed. We begin with the following boundedness assumption: Assumption 5.2.1 (Exponential type). Given ω[n] = ω∈ Ω and α[n] = α∈A for a fixed n, it holds that T [n]≥ 1 with probability 1 and y[n], K[n], T [n] are of exponential type, i.e. there 133 exists a constant η> 0 s.t. E exp η y[n] ω,α ≤B + 1, E exp η K[n] ω,α ≤B + 1, E exp η T [n] ω,α ≤B + 1, where B is a positive constant. The following proposition is a simple consequence of the above assumption: Proposition 1. Suppose Assumption 5.2.1 holds. LetX[n] be any of the three random variables y[n], K[n] and T [n] for a fixed n. Then, given any ω[n] =ω∈ Ω and α[n] =α∈A, E X[n] ω,α ≤B/η, E X[n] 2 ω,α ≤ 2B/η 2 . The proof follows from the inequality: B + 1≥E e η X[n] ω,α ≥ 1 +η·E X[n] ω,α + η 2 2 ·E X[n] 2 ω,α . Assumption 5.2.2. There exists a positive constant θ max large enough so that the optimal objective of (5.1)− (5.3), denoted as θ ∗ , falls into [0,θ max ) with probability 1. Remark 5.2.1. Ifθ ∗ < 0, then, we shall find a constantc large enough so thatθ ∗ +c≥ 0. Then, define a new penaltyy 0 [n] =y[n]+cT [n]. It is easy to see that minimizing lim sup N→∞ y[N]/T [N] is equivalent to minimizing lim sup N→∞ y 0 [N]/T [N] and the optimal objective of the new problem is θ ∗ +c, which is nonnegative. Assumption 5.2.3. Let ˆ y(ω,α), ˆ T (ω,α), ˆ z(ω,α) be the performance vector under a certain (ω,α) pair. Then, for any fixed ω∈ Ω, the set of achievable performance vectors over all α∈A is compact. In order to state the next assumption, we need the notion of randomized stationary policy. We start with the definition: Definition 5.2.1 (Randomized stationary policy). A randomized stationary policy is an algo- rithm that at the beginning of each framen, after observing the random eventω[n], the controller chooses α ∗ [n] with a conditional probability that is the same for all n. 134 Assumption 5.2.4 (Bounded achievable region). Let (y, T, z),E (ˆ y(ω[0],α ∗ [0]), ˆ T (ω[0],α ∗ [0]), ˆ z(ω[0],α ∗ [0])) be the one-shot average of one randomized stationary policy. LetR⊆ R L+2 be the set of all achievable one-shot averages (y, T, z). Then,R is bounded. Assumption 5.2.5 (ξ-slackness). There exists a randomized stationary policy α (ξ) [n] such that the following holds, E ˆ z l ω[n],α (ξ) [n] E ˆ T (ω[n],α (ξ) [n]) =c l −ξ, ∀l∈{1, 2,··· ,L}, where ξ> 0 is a constant. Remark 5.2.2 (Measurability issue). We implicitly assume the policies for choosing α in reaction to ω result in a measurable α, so that T [n], y[n], z[n] are valid random variables and the expectations in Assumption 5.2.4 and 5.2.5 are well defined. This assumption is mild. For example, when the sets Ω andA are finite, it holds for any randomized stationary policy. More generally, if Ω andA are measurable subsets of some separable metric spaces, this holds whenever the conditional probability in Definition 5.2.1 is “regular” (see [Dur13] for discussions on regular conditional probability), and T [n], y[n], z[n] are continuous functions on Ω×A. 5.3 An Online Algorithm We define a vector of virtual queues Q[n] = [Q 1 [n] Q 2 [n] ··· Q L [n]] which are 0 at n = 0 and updated as follows: Q l [n + 1] = max{Q l [n] +z l [n]−c l T [n], 0}. (5.5) The intuition behind this virtual queue idea is that if the algorithm can stabilize Q l [n], then the “arrival rate”z l [N]/T [N] is below “service rate”c l and the constraint is satisfied. The proposed algorithm then proceeds as in Algorithm 5.1 via two fixed parameters V > 0, δ > 0, and an additional process θ[n] that is initialized to be θ[0] = 0. For any real number x, the notation 135 [x] θmax 0 stands for ceil and floor function: [x] θmax 0 = θ max , if x∈ (θ max , +∞); x, if x∈ [0,θ max ]; 0, if x∈ (−∞, 0). Note that we can rewrite (5.6) as the following deterministic form: V ˆ y(ω[n],α[n])−θ[n] ˆ T (ω[n],α[n]) + L X l=1 Q l [n] ˆ z l (ω[n],α[n])−c l ˆ T (ω[n],α[n]) , Thus, Algorithm 5.1 proceeds by observing ω[n] on each frame n and then choosing α[n] inA to minimize the above deterministic function. We can now see that we only use knowledge of current realization ω[n], not statistics of ω[n]. Also, the compactness assumption (Assumption 5.2.3) guarantees that the minimum of (5.6) is always achievable. Algorithm 5.1 Online renewal optimization: • At the beginning of each frame n, the controller observes Q l [n], θ[n], ω[n] and chooses action α[n]∈A to minimize the following function: E V (y[n]−θ[n]T [n]) + L X l=1 Q l [n](z l [n]−c l T [n]) Q l [n],θ[n],ω[n] ! . (5.6) • Update θ[n]: θ[n + 1] = " 1 (n + 1) δ n X i=0 y[i]−θ[i]T [i] + 1 V L X l=1 Q l [i](z l [i]−c l T [i]) !# θmax 0 . • Update virtual queues Q l [n]: Q l [n + 1] = max{Q l [n] +z l [n]−c l T [n], 0}, l = 1, 2,··· ,L. 5.4 Feasibility Analysis In this section, we prove that the proposed algorithm gives a sequence of actions{α[n]} ∞ n=0 which satisfies all desired constraints with probability 1. Specifically, we show that all virtual queues are stable with probability 1, in which we leverage an important lemma from [Haj82] to obtain a exponential bound for the norm of Q[n]. 136 5.4.1 The drift-plus-penalty bound The start of our proof uses the drift-plus-penalty methodology. For a general introduction on this topic, see [Nee12c] for more details. We define the 2-norm function of the virtual queue vector as: kQ[n]k 2 = L X l=1 Q l [n] 2 . Define the Lyapunov drift Δ(Q[n]) as Δ(Q[n]) = 1 2 kQ[n + 1]k 2 −kQ[n]k 2 . Next, define the penalty function at framen asV (y[n]−θ[n]T [n]), whereV > 0 is a fixed trade-off parameter. Then, the drift-plus-penalty methodology suggests that we can stabilize the virtual queues by choosing an action α[n]∈A to greedily minimize the following drift-plus-penalty expression, with the observed Q[n], ω[n] and θ[n]: E(V (y[n]−θ[n]T [n]) + Δ(Q[n])|Q l [n],θ[n],ω[n]). The penalty term V (y[n]−θ[n]T [n]) uses the θ[n] variable, which depends on events from all previous frames. This penalty does not fit the rubric of [Nee12c] and convergence of the algorithm does not follow from prior work. A significant thrust of the current chapter is convergence analysis under such a penalty function. In order to obtain an upper bound on Δ(Q[n]), we square both sides of (5.5) and use the fact that max{x, 0} 2 ≤x 2 , Q l [n + 1] 2 ≤Q l [n] 2 + (z l [n]−c l T [n]) 2 + 2Q l [n](z l [n]−c l T [n]). (5.7) Summing the above over all l∈{1,...,L} and dividing by 2 gives Δ(Q[n])≤ 1 2 L X l=1 (z l [n]−c l T [n]) 2 + L X l=1 Q l [n](z l [n]−c l T [n]) 137 Adding V (y[n]−θ[n]T [n]) to both sides and taking conditional expectations gives E(V (y[n]−θ[n]T [n]) + Δ(Q[n])|Q l [n],θ[n],ω[n]) ≤E V (y[n]−θ[n]T [n]) + L X l=1 Q l [n](z l [n]−c l T [n]) Q l [n],θ[n],ω[n] ! + 1 2 L X l=1 E (z l [n]−c l T [n]) 2 ≤E V (y[n]−θ[n]T [n]) + L X l=1 Q l [n](z l [n]−c l T [n]) Q l [n],θ[n],ω[n] ! + B 2 η 2 . (5.8) where the last inequality follows from Proposition 1. Thus, as we have already seen in Algorithm 5.1, the proposed algorithm observes the vector Q[n], the random event ω[n] and θ[n] at frame n, and minimizes the right hand side of (5.8). 5.4.2 Bounds on the virtual queue process and feasibility In this section, we show how the bound (5.8) leads to the feasibility of the proposed algorithm. DefineH n as the system history information up until frame n. Formally,{H n } ∞ n=0 is a filtration where eachH n is the σ-algebra generated by all the random variables before frame n. Notice that since Q[n] and θ[n] depend only on the events before frame n,H n contains both Q[n] and θ[n]. The following important lemma gives a stability criterion for any given real random process with certain negative drift property: Lemma 5.4.1 (Theorem 2.3 of [Haj82]). LetR[n] be a real random process overn∈{0, 1, 2,···} satisfying the following two conditions for a fixed r> 0: 1. For any n, E e r(R[n+1]−R[n]) H n ≤ Γ, for some Γ> 0. 2. Given R[n]≥σ, E e r(R[n+1]−R[n]) H n ≤ρ, with some ρ∈ (0, 1). Suppose further that R[0]∈R is given and finite, then, at every n∈{0, 1, 2,···}, the following bound holds: E e rR[n] ≤ρ n e rR[0] + 1−ρ n 1−ρ Γe rσ . Thus, in order to show the stability of the virtual queue process, it is enough to test the above two conditions with R[n] =kQ[n]k. The following lemma shows thatkQ[n]k satisfies these two conditions: 138 Lemma 5.4.2 (Drift condition). Let R[n] =kQ[n]k, then, it satisfies the two conditions in Lemma 5.4.1 with the following constants: Γ =B, r = min η, ξη 2 4B , σ =C 0 V, ρ = 1− rξ 2 + 2B η 2 r 2 < 1. where C 0 = 2B 2 Vξη 2 + 2(θmax+1)B ξη − ξ 4V . The central idea of the proof is to plug the ξ-slackness policy specified in Assumption 5.2.5 into the right hand side of (5.8). A similar idea has been presented in the Lemma 6 of [WYN15] under the bounded increment of the virtual queue process. Here, we generalize the idea to the case where the increment of the virtual queues contains exponential type random variables z l [n] and T [n]. Note that the boundedness of θ[n] is crucial for the argument to hold, which justifies the truncation of pseudo average in the algorithm. Lemma 5.4.1 is proved in the Appendix 5.7. Combining the above two lemmas, we immediately have the following corollary: Corollary 5.4.1 (Exponential decay). Given Q[0] = 0, the following holds for anyn∈{0, 1, 2,···} under the proposed algorithm, E e rkQ[n]k ≤D, (5.9) where D = 1 + B 1−ρ e rC0V , and r, ρ, C 0 are as defined in Lemma 5.4.2. Furthermore, we have E(kQ[n]k)≤ 1 r log(1 + B 1−ρ e rC0V ), i.e. the queue size isO(V ). The bound on E(kQ[n]k) follows readily from (5.9) via Jensen’s inequality. With Corollary 5.4.1 in hand, we can prove the following theorem: Theorem 5.4.1 (Feasibility). All constraints in (5.1)-(5.3) are satisfied under the proposed algorithm with probability 1. 139 Proof of Theorem 5.4.1. By queue updating rule (5.5), for any n and any l∈{1, 2,··· ,L}, one has Q l [n + 1]≥Q l [n] +z l [n]−c l T [n]. Fix N as a positive integer. Then, summing over all n∈{0, 1, 2,··· ,N− 1}, Q l [N]≥Q l [0] + N−1 X n=0 (z l [n]−c l T [n]). Since Q l [0] = 0, ∀l and T [n]≥ 1, ∀n, P N−1 n=0 z l [n] P N−1 n=0 T [n] −c l ≤ Q l [N] P N−1 n=0 T [n] ≤ Q l [N] N . (5.10) Define the event A (ε) N ={Q l [N]>εN}. By the Markov inequality and Corollary 5.4.1, for any ε> 0, we have Pr(Q l [N]>εN)≤Pr (rkQ[N]k>rεN) =Pr e rkQ[N]k >e rεN ≤ E e rkQ[N]k e rεN ≤De −rεN , where r is defined in Corollary 5.4.1. Thus, we have ∞ X N=0 Pr(Q l [N]>εN)≤D ∞ X N=0 e −rεN < +∞. Thus, by the Borel-Cantelli lemma [Dur13], Pr A (ε) N occurs infinitely often = 0. Since ε> 0 is arbitrary, letting ε→ 0 gives Pr lim N→∞ Q l [N] N = 0 = 1. 140 Finally, taking the lim sup N→∞ from both sides of (5.10) and substituting in the above equation gives the claim. 5.5 Optimality Analysis In this section, we show that the proposed algorithm achieves time average penalty within O(1/V ) of the optimal objective θ ∗ . Since the algorithm meets all the constraints, it follows, lim sup n→∞ P n−1 i=0 y[i] P n−1 i=0 T [i] ≥θ ∗ , w.p.1. Thus, it is enough to prove the following theorem: Theorem 5.5.1 (Near optimality). For anyδ∈ (1/3, 1) andV ≥ 1, the objective value produced by the proposed algorithm is near optimal with lim sup n→∞ P n−1 i=0 y[i] P n−1 i=0 T [i] ≤θ ∗ + B 2 η 2 V , w.p.1, i.e. the algorithm achievesO(1/V ) near optimality. Remark 5.5.1. Combining Theorem 5.5.1 with Corollary 5.4.1, we see that the tuning parameter V plays a trade-off between the sub-optimality and the virtual queue bound (i.e. the constraint violation). In particular, our result recovers the classical [O(1/V ), O(V )] trade-off in the work of opportunistic scheduling [Nee10b]. In order to prove Theorem 5.5.1, we introduce the following notation: original pseudo average : ˆ θ[n], 1 (n + 1) δ n X i=0 y[i]−θ[i]T [i] + 1 V L X l=1 Q l [i](z l [i]−c l T [i]) ! , tamed pseudo average : θ[n], " 1 (n + 1) δ n X i=0 y[i]−θ[i]T [i] + 1 V L X l=1 Q l [i](z l [i]−c l T [i]) !# θmax 0 . 141 5.5.1 Relation between ˆ θ[n] and θ[n] We start with a preliminary lemma illustrating that the original pseudo average ˆ θ[n] behaves almost the same as the tamed pseudo average θ[n]. Note that θ[n] can be written as: θ[n] = [ ˆ θ[n]] θmax 0 . Lemma 5.5.1 (Equivalence relation). For any x∈ (0,θ max ), 1. θ[n]≥x if and only if ˆ θ[n]≥x. 2. θ[n]≤x if and only if ˆ θ[n]≤x. 3. lim sup n→∞ θ[n]≤x if and only if lim sup n→∞ ˆ θ[n]≤x. 4. lim sup n→∞ θ[n]≥x if and only if lim sup n→∞ ˆ θ[n]≥x. This lemma is intuitive and the proof is shown in the Appendix 5.7. We will prove results on ˆ θ[n] which extend naturally to θ[n] via Lemma 5.5.1. The key idea of proving Theorem 5.5.1 is to bound the original pseudo average process ˆ θ[n] asymptotically from above by θ ∗ , which is Theorem 5.5.2 below. We then prove Theorem 5.5.2 through the following three steps: • We construct a truncated version of ˆ θ[n], namely ˜ θ[n], which has the same limit as ˆ θ[n] (Lemma 5.5.3 below), so that it is enough to show ˜ θ[n]≤θ ∗ asymptotically. • For the process ˜ θ[n], we bound the moments of the hitting time, namely, the time interval between two consecutive visits to the region{ ˜ θ[n]≤ θ ∗ }, by constructing a dominating exponential supermartingale and bounding its size. (Lemma 5.5.6 and 5.5.7 below). • We show that ˜ θ[n] > θ ∗ only finitely often asymptotically (with probability 1) using the bounded moments of the hitting time. 5.5.2 Towards near optimality (I): Truncation The following lemma states that the optimality of (5.1)-(5.3) is achievable within the closure of the set of all one-shot averages specified in Assumption 5.2.4: 142 Lemma 5.5.2 (Stationary optimality). Let θ ∗ be the optimal objective of (5.1)-(5.3). Then, there exists a tuple (y ∗ , T ∗ , z ∗ )∈R, the closure ofR, such that the following hold: y ∗ /T ∗ =θ ∗ (5.11) z ∗ l /T ∗ ≤c l , ∀l∈{1, 2,··· ,L}, (5.12) i.e. the optimality is achievable withinR. The proof of this lemma is similar to the proof of Theorem 4.5 as well as Lemma 7.1 of [Nee10b]. We omit the details for brevity. We start the truncation by picking up an ε 0 > 0 small enough so that θ ∗ + ε 0 /V < θ max . We aim to show lim sup n→∞ θ[n]≤ θ ∗ +ε 0 /V . By Lemma 5.5.1, it is enough to show lim sup n→∞ ˆ θ[n]≤θ ∗ +ε 0 /V . The following lemma tells us it is enough to prove it on a further term-wise truncated version of ˆ θ[n]. Lemma 5.5.3 (Truncation lemma). Consider the following alternative pseudo average{ ˜ θ[n]} ∞ n=0 obtained by truncating each summand such that ˜ θ[0] = 0 and ˜ θ[n + 1] = 1 (n + 1) δ n X i=0 " y[i]−θ[i]T [i] + 1 V L X l=1 Q l [i](z l [i]−c l T [i]) ! ∧ 2 η + 4 √ L ηrV ! log 2 (i + 1) !# , wherea∧b, min{a,b},η is defined in Assumption 5.2.1 andr is defined in Lemma 5.4.2. Then, we have lim sup n→∞ ˆ θ[n] = lim sup n→∞ ˜ θ[n]. Proof of Lemma 5.5.3. Consider any frame i ∈ {0, 1, 2,...} such that there is a discrepancy between the summand of ˆ θ[n] and ˜ θ[n], i.e. y[i]−θ[i]T [i] + 1 V L X l=1 Q l [i](z l [i]−c l T [i])> 2 η + 4 √ L ηrV ! log 2 (i + 1), (5.13) By the Cauchy-Schwartz inequality, this implies y[i]−θ[i]T [i] + 1 V v u u t L X l=1 Q l [i] 2 v u u t L X l=1 (z l [i]−c l T [i]) 2 > 2 η + 4 √ L ηrV ! log 2 (i + 1). Thus, at least one of the following three events happened: 143 1. A i , n y[i]−θ[i]T [i]> 2 η log 2 (i + 1) o . 2. B i , q P L l=1 Q l [i] 2 > 2 √ L r log(i + 1) . 3. E i , n K[i]> 2 η log(i + 1) o . where K[i] is defined in (5.4). Indeed, the occurence of one of the three events is necessary for (5.13) to happen. We then argue that these three events jointly occur only finitely many times. Thus, as n→∞, the discrepancies are negligible. Assume the event A i occurs, then, since y[i]−θ[i]T [i]≤ y[i], it follows y[i] > 2 η log 2 (i + 1). Then, we have Pr(A i )≤Pr y[i]> 2 η log 2 (i + 1) =Pr e ηy[i] >e 2 log 2 (i+1) ≤ E e ηy[i] (i + 1) 2 log(i+1) ≤ B (i + 1) 2 log(i+1) , where the second to last inequality follows from the Markov inequality and the last inequality follows from Assumption 5.2.1. Assume the event B i occurs, then, we have kQ[i]k = v u u t L X l=1 Q l [i] 2 > 2 √ L r log(i + 1)≥ 2 r log(i + 1). Thus, Pr(B i )≤Pr kQ[i]k> 2 r log(i + 1) =Pr e rkQ[i]k >e 2 log(i+1) ≤ E e rkQ[i]k (i + 1) 2 ≤ D (i + 1) 2 , where the second to last inequality follows from the Markov inequality and the last inequality follows from Corollary 5.4.1. 144 Assume the event E i occurs. Again, by Assumption 5.2.1 and the Markov inequality, Pr(E i ) =Pr K[i]> 2 η log(i + 1) =Pr e ηK[i] >e 2 log(i+1) ≤ E e ηK[i] (i + 1) 2 ≤ B (i + 1) 2 , where the last inequality follows from Assumption 5.2.1 again. Now, by a union bound, Pr(A i ∪B i ∪E i )≤Pr(A i ) +Pr(B i ) +Pr(E i )≤ B (i + 1) 2 log(i+1) + B +D (i + 1) 2 , and thus, ∞ X i=0 Pr(A i ∪B i ∪E i )≤ ∞ X i=0 B (i + 1) 2 log(i+1) + B +D (i + 1) 2 <∞ By the Borel-Cantelli lemma, we have the joint eventA i ∪B i ∪E i occurs only finitely many times with probability 1, and our proof is finished. Lemma 5.5.3 is crucial for the rest of the proof. Specifically, it creates an alternative sequence ˜ θ[n] which has the following two properties: 1. We know exactly what the upper bound of each of the summands is, whereas in ˆ θ[n], there is no exact bound for the summand due to Q l [i] and other exponential type random variables. 2. For any n∈ N, we have ˜ θ[n]≤ ˆ θ[n]. Thus, if ˜ θ[n]≥ θ ∗ +ε 0 /V for some n, then, ˆ θ[n]≥ θ ∗ +ε 0 /V . 5.5.3 Towards near optimality (II): Exponential supermartingale The following preliminary lemma demonstrates a negative drift property for each of the summands in ˜ θ[n]. Lemma 5.5.4 (Key feature inequality). For any ε 0 > 0, if θ[i]≥θ ∗ +ε 0 /V , then, we have E y[i]−θ[i]T [i] + 1 V L X l=1 Q l [i](z l [i]−c l T [i]) ! ∧ 2 η + 4 √ L ηrV ! log 2 (i + 1) ! H i ! ≤−ε 0 /V, 145 Proof of Lemma 5.5.4. Since the proposed algorithm minimizes (5.6) over all possible decisions inA, it must achieve value less than or equal to that of any randomized stationary algorithm α ∗ [i]. This in turn implies, E y[i]−θ[i]T [i] + 1 V L X l=1 Q l [i](z l [i]−c l T [i]) ! H i ,ω[i] ! ≤E ˆ y(ω[i],α ∗ [i])−θ[i] ˆ T (ω[i],α ∗ [i]) + 1 V L X l=1 Q l [i](ˆ z l (ω[i],α ∗ [i])−c l ˆ T (ω[i],α ∗ [i])) ! H i ,ω[i] ! . Taking expectation from both sides with respect to ω[i] and using the fact that randomized stationary algorithms are i.i.d. over frames and independent ofH i , we have E y[i]−θ[i]T [i] + 1 V L X l=1 Q l [i](z l [i]−c l T [i]) ! H i ,ω[i] ! ≤y−θ[i]T + 1 V L X l=1 Q l [i](z l −c l T ), for any (y,T, z)∈R. Since (y ∗ ,T ∗ , z ∗ ) specified in Lemma 5.5.2 is in the closure ofR, we can replace (y,T, z) by the tuple (y ∗ ,T ∗ , z ∗ ) and the inequality still holds. This gives E y[i]−θ[i]T [i] + 1 V L X l=1 Q l [i](z l [i]−c l T [i]) ! H i ,ω[i] ! ≤y ∗ −θ[i]T ∗ + 1 V L X l=1 Q l [i](z ∗ l −c l T ∗ ), =T ∗ y ∗ /T ∗ −θ[i] + 1 V L X l=1 Q l [i](z ∗ l /T ∗ −c l ) ! ≤T ∗ (θ ∗ −θ[i])≤−ε 0 /V, where the second to last inequality follows from (5.11) and (5.12), and the last inequality follows from θ[i]≥θ ∗ +ε 0 /V and T [i]≥ 1. Finally, since a∧b≤a for any real numbers a,b, it follows, E y[i]−θ[i]T [i] + 1 V L X l=1 Q l [i](z l [i]−c l T [i]) ! ∧ 2 η + 4 √ L ηrV ! log 2 (i + 1) ! H i ! ≤E y[i]−θ[i]T [i] + 1 V L X l=1 Q l [i](z l [i]−c l T [i]) ! H i ! ≤−ε 0 /V, and the claim follows. Define n k as the frame where ˜ θ[n] visits the set (−∞, θ ∗ +ε 0 /V ) for the k-th time with the 146 following conventions: 1. If ˜ θ[n]∈ (−∞, θ ∗ +ε 0 /V ) and ˜ θ[n + 1]∈ (−∞, θ ∗ +ε 0 /V ), then we count them as two times. 2. When k = 1, n 1 is equal to 0. Define the hitting time S n k as S n k =n k+1 −n k . The goal is to obtain a moment bound on this quantity when ˜ θ[n k + 1]≥θ ∗ +ε 0 /V (otherwise, this quantity is 1). In order to do so, we introduce a new process as follows. For any n k , define F [n], n−1 X i=n k y[i]−θ[i]T [i] + 1 V L X l=1 Q l [i](z l [i]−c l T [i]) ! ∧ 2 η + 4 √ L ηrV ! log 2 (i + 1) ! , ∀n>n k , (5.14) The following lemma shows that indeed this F [n] is closely related to ˜ θ[n]. It plays an important role in proving Lemma 5.5.7: Lemma 5.5.5. For any n>n k , if ˜ θ[n]≥θ ∗ +ε 0 /V , then, F [n]≥ 0. Proof of Lemma 5.5.5. Suppose ˜ θ[n]≥θ ∗ +ε 0 /V , then, the following holds θ ∗ +ε 0 /V ≤ ˜ θ[n] = n δ k n δ ˜ θ[n k ] + 1 n δ F [n]. Thus, F [n]≥n δ (θ ∗ +ε 0 /V )−n δ k ˜ θ[n k ]. Since at the frame n k , ˜ θ[n k ]<θ ∗ +ε 0 /V , it follows, F [n]≥ n δ −n δ k (θ ∗ +ε 0 /V ). Since θ ∗ +ε 0 /V ≥ 0, it follows F [n]≥ 0 and the claim follows. Recall our goal is to bound the hitting time S n k of the process ˜ θ[n] when{ ˜ θ[n k + 1] ≥ θ ∗ +ε 0 /V}, with a strictly negative drift property as Lemma 5.5.4. A classical approach analyzing the hitting time of a stochastic process came from Wald’s construction of martingale for sequential analysis (see, for example, [Wal44] for details). Later, [Haj82] extended this idea to analyze the stability of a queueing system with drift condition by a supermartingale construnction. Here, we take one step further by considering the following supermartingale construction based on F [n]: 147 Lemma 5.5.6 (Exponential Supermartingale). Fix ε 0 > 0 and V ≥ max n ε0η 4 log 2 2 − 2 √ L r , 1 o such that θ ∗ +ε 0 /V <θ max . Define a new random process G[n] starting from n k + 1 with G[n], exp (λ n F [n∧ (n k +S n k )]) Q n∧(n k +Sn k ) i=n k +1 ρ i 1 { ˜ θ[n k +1]≥θ ∗ +ε0/V} , where for any setA, 1 A is the indicator function which takes value 1 ifA is true and 0 otherwise. For any n≥n k + 1, λ n and ρ n are defined as follows: λ n = ε 0 2Ve 2 η + 4 √ L ηrV 2 log 4 (n + 1) , ρ n =1− ε 2 0 4V 2 e 2 η + 4 √ L ηrV 2 log 4 (n + 1) . Then, the process G[n] is measurable with respect toH n ,∀n≥n k + 1, and furthermore, it is a supermartingale with respect to the filtration{H n } n≥n k +1 . The proof of Lemma 5.5.6 is shown in Appendix 5.7. Remark 5.5.2. If the increments F [n + 1]−F [n] were to be bounded, then, we could adopt the similar construction as that of [Haj82]. However, in our scenario F [n + 1]−F [n] is of the order log 2 (n + 1), which is increasing and unbounded. Thus, we need decreasing exponents λ n and increasing weights ρ n to account for that. Furthermore, the indicator function indicates that we are only interested in the scenario{ ˜ θ[n k + 1]≥θ ∗ +ε 0 /V}. The following lemma uses the previous result to bound the conditional fourth moment of the hitting time S n k . Lemma 5.5.7. Given V ≥ max n ε0η 4 log 2 2 − 2 √ L r , 1 o as in Lemma 5.5.6, for any β∈ (0, 1/5) and anyε 0 > 0 such thatθ ∗ +ε 0 /V <θ max , there exists a positive constantC β,V,ε0 'O V 10 β −20 ε −10 0 , such that E S 4 n k |H n k ≤C β,V,ε0 (n k + 2) 4β , ∀k≥ 1. Proof of Lemma 5.5.7. First of all, from Lemma 5.5.6 gives thatG[n] is a supermartingale start- 148 ing from n k + 1, thus, we have the following chains of inequalities for any n≥n k + 1: G[n k + 1] =E(G[n k + 1]|H n k +1 ) ≥E(G[n]|H n k +1 ) =E e λnF[n∧(n k +Sn k )] Q n i=n k +1 ρ i 1 { ˜ θ[n k +1]≥θ ∗ +ε0/V} H n k +1 ! ≥E e λnF[n∧(n k +Sn k )] Q n i=n k +1 ρ i 1 {Sn k ≥n−n k +1} 1 { ˜ θ[n k +1]≥θ ∗ +ε0/V} H n k +1 ! ≥ 1 Q n i=n k +1 ρ i Pr S n k ≥n−n k + 1, ˜ θ[n k + 1]≥θ ∗ +ε 0 /V H n k +1 , where the first inequality uses the supermartingale property and the last inequality uses Lemma 5.5.5 that on the set{S n k ≥ n−n k + 1}, n∧ (n k +S n k ) = n and F [n]≥ 0. By definition of G[n k + 1], G[n k + 1] = e λn k +1F[n k +1] ρ n k +1 ≤ e λn k +1 2 η + 4 √ L ηrV log 2 (n k +2) ρ n k +1 ≤ 4 3 e, where the first inequality follows from the definition of F [n], and the second inequality follows from the assumption that V ≥ ε0η 4 log 2 2 − 2 √ L r , thus, λ n k +1 ≤ 1 2 η + 4 √ L ηrV log 2 (n k +2) and ρ n k +1 ≥ 1− log 2 2 2e > 3 4 . Thus, it follows, Pr S n k ≥n−n k + 1, ˜ θ[n k + 1]≥θ ∗ +ε 0 /V H n k +1 ≤ n Y i=n k +1 ρ i ! · 4 3 e. Now, we bound the fourth moment of hitting time: E S 4 n k H n k +1 = ∞ X m=1 m 4 Pr [S n k =m|H n k +1 ] ≤ ∞ X m=1 (m + 1) 4 −m 4 Pr S n k ≥m + 1, ˜ θ[n k + 1]≥θ ∗ +ε 0 /V H n k +1 + 1 ≤4 ∞ X m=1 (m + 1) 3 Pr S n k ≥m + 1, ˜ θ[n k + 1]≥θ ∗ +ε 0 /V H n k +1 + 1 ≤1 + 16 3 e ∞ X m=1 (m + 1) 3 n k +m Y i=n k +1 ρ i . 149 Thus, it remains to show there exists a constant C on the orderO V 10 β −20 ε −10 0 such that ∞ X m=1 (m + 1) 3 n k +m Y i=n k +1 ρ i ≤C(n k + 2) 4β , which is given is Appendix 5.8. This implies there exists a C β,V,ε0 so that E S 4 n k H n k +1 ≤C β,V,ε0 (n k + 2) 4β . Thus, E S 4 n k H n k =E E S 4 n k H n k +1 |H n k ≤E C β,V,ε0 (n k + 2) 4β |H n k =C β,V,ε0 (n k + 2) 4β , where the last equality follows from the fact that n k ∈H n k . This finishes the proof. 5.5.4 An asymptotic upper bound on θ[n] So far, we have proved that if we pick anyε 0 > 0 such thatθ ∗ +ε 0 /V <θ max , then, the inter- visiting time has bounded conditional fourth moment. We aim to show that lim sup n→∞ ˆ θ[n]≤θ ∗ with probability 1. By Lemma 5.5.3, it is enough to show lim sup n→∞ ˜ θ[n]≤ θ ∗ . To do so, we need the following Second Borel-Cantelli lemma: Lemma 5.5.8 (Theorem 5.3.2. of [Dur13]). LetF k , k≥ 1 be a filtration withF 1 ={∅, Ω}, and A k , k≥ 1 be a sequence of events with A k ∈F k+1 , then {A k occurs infinitely often} = ( ∞ X k=1 Pr(A k |F k ) =∞ ) Theorem 5.5.2 (Asymptotic upper bound). For anyδ∈ (1/3, 1) andV ≥ 1, the following hold, lim sup n→∞ ˆ θ[n]≤θ ∗ , w.p.1, and lim sup n→∞ θ[n]≤θ ∗ , w.p.1. Proof of Theorem 5.5.2. First of all, since the inter-hitting time S n k has finite fourth moment, each inter-hitting time is finite with probability 1, and thus the process { ˜ θ[n]} ∞ n=0 will visit 150 (−∞,θ ∗ +ε 0 /V ) infinitely many times with probability 1. Then, we pick any > 0 and define the following sequence of events: A k , ( S n k n 1/3 k > ) , k = 1, 2,··· . (5.15) For any fixed k, by Conditional Markov inequality, the following holds with probability 1: Pr(A k |H n k ) =Pr S 4 n k > 4 n 4/3 k H n k ≤ E S 4 n k |H n k 4 n 4/3 k ≤ C β,V,ε0 (n k + 2) 4β 4 n 4/3 k ≤ C β,V,ε0 4 n −4/3+4β k + C β,V,ε0 2 4β 4 n 4/3 k ≤ C β,V,ε0 4 k −4/3+4β + C β,V,ε0 2 4β 4 k −4/3 , where the second inequality follows from Lemma 5.5.7 with β∈ (0, 1/5), the third inequality follows from the fact that (a +b) x ≤a x +b x ,∀a,b≥ 0 andx∈ (0, 1). The last inequality follows from the fact that the inter-hitting time takes at least one frame and thus n k ≥k. ChooseF k =H n k and A k as is defined in (5.15). Then, for any β∈ (0, 1/12), we have with probability 1, ∞ X k=1 Pr(A k |H n k )≤ ∞ X k=1 C β,V,ε0 4 k −4/3+4β + C β,V,ε0 2 4β 4 k −4/3 <∞. Now by Lemma 5.5.8, Pr (A k occurs infinitely often) = 0. Since the process{ ˜ θ[n]} ∞ n=0 visits (−∞,θ ∗ +ε 0 /V ) infinitely many times with probability 1, lim sup n→∞ S n k n 1/3 k = lim sup k→∞ S n k n 1/3 k ≤, w.p.1, Since > 0 is arbitrary, let → 0 gives lim n→∞ S n k n 1/3 k = 0, w.p.1. (5.16) 151 Finally, we show how this convergence result leads to the bound of ˜ θ[n]. According to the updating rule of ˜ θ[n], for any frame n such that n k <n≤n k+1 , ˜ θ[n] =( n k n ) δ ˜ θ[n k ] + 1 n δ n−1 X i=n k y[i]−θ[i]T [i] + 1 V Q[i](z[i]−cT [i]) ∧ 2 η + 4 √ L ηrV ! log 2 (i + 1) ! ≤( n k n ) δ θ ∗ + ε 0 V + 1 n δ n−1 X i=n k 2 η + 4 √ L ηrV ! log 2 (i + 1) ! ≤( n k n ) δ θ ∗ + ε 0 V + 1 n δ S n k 2 η + 4 √ L ηrV ! log 2 n, where the first inequality follows from the fact that ˜ θ[n k ] < θ ∗ +ε 0 /V . Now, we take the lim sup n→∞ from both sides and analyze each single term on the right hand side: 1≥ lim sup n→∞ ( n k n ) δ ≥ lim sup k→∞ ( n k n k +S n k ) δ = lim sup k→∞ ( 1 1 + Sn k n k ) δ = 1, w.p.1, lim sup n→∞ S n k n δ 2 η + 4 √ L ηrV ! log 2 n≤ lim sup n→∞ S n k n 1/3 k · lim sup n→∞ 2 η + 4 √ L ηrV log 2 n n δ−1/3 = 0, w.p.1, where we apply the convergence result (5.16) in the second line. Thus, lim sup n→∞ ˜ θ[n]≤θ ∗ + ε 0 V , w.p.1. By Lemma 5.5.3 we have lim sup n→∞ ˆ θ[n]≤ θ ∗ +ε 0 /V . Finally, by Lemma 5.5.1, and the fact thatθ ∗ +ε 0 /V ∈ (0,θ max ), we have lim sup n→∞ θ[n]≤θ ∗ +ε 0 /V . Since this holds for anyε 0 > 0 small enough, let ε 0 → 0 finishes the proof. 5.5.5 Finishing the proof of near optimality With the help of previous analysis onθ[n], we are ready to prove our main theorem, with the following lemma on strong law of large numbers for martingale difference sequences: Lemma 5.5.9 (Corollary 4.2 of [Nee12c]). Let{F i } ∞ i=0 be a filtration and let{X(i)} ∞ i=0 be a real-valued random process such that X(i)∈F i+1 , ∀i. Suppose there is a finite constant C such that E(X(i)|F i )≤C, ∀i, and ∞ X i=1 E X(i) 2 i 2 <∞. 152 Then, lim sup n→∞ 1 n n−1 X i=0 X(i)≤C, w.p.1. Proof of Theorem 5.5.1. Recall for any n, the empirical accumulation without ceil and floor function is ˆ θ[n] = 1 n δ n−1 X i=0 y[i]−θ[i]T [i] + 1 V L X l=1 Q l [i](z l [i]−c l T [i]) ! . Dividing both sides by P n−1 i=0 T [i]/n δ yields ˆ θ[n] 1 n δ P n−1 i=0 T [i] = P n−1 i=0 y[i]−θ[i]T [i] + 1 V P L l=1 Q l [i](z l [i]−c l T [i]) P n−1 i=0 T [i] = P n−1 i=0 y[i] + 1 V P L l=1 Q l [i](z l [i]−c l T [i]) P n−1 i=0 T [i] − P n−1 i=0 θ[i]T [i] P n−1 i=0 T [i] . Moving the last term to the left hand side and taking the lim sup n→∞ from both sides gives lim sup n→∞ ˆ θ[n] 1 n δ P n−1 i=0 T [i] + P n−1 i=0 θ[i]T [i] P n−1 i=0 T [i] ! ≥ lim sup n→∞ P n−1 i=0 y[i] P n−1 i=0 T [i] + P n−1 i=0 1 V P L l=1 Q l [i](z l [i]−c l T [i]) P n−1 i=0 T [i] ≥ lim sup n→∞ P n−1 i=0 y[i] P n−1 i=0 T [i] + 1 2 kQ[n]k 2 − P n−1 i=0 P L l=1 (z l [i]−c l T [i]) 2 V P n−1 i=0 T [i] ≥ lim sup n→∞ P n−1 i=0 y[i] P n−1 i=0 T [i] − 1 2V lim sup n→∞ 1 n n−1 X i=0 K[i] 2 , where the second inequality follows from inequality (5.7) and telescoping sums, and the last inequality follows from T [n]≥ 1,kQ[n]k 2 ≥ 0 and K[i] = q P L l=1 (z l [i]−c l T [i]) 2 . Now we use Lemma 5.5.9 with X(i) = K[i] 2 to bound the second term. Since K[i] is of exponential type by Assumption 5.2.1, we know thatE K[i] 2 |H n ≤ 2B 2 /η 2 . Furthermore,E K[i] 4 ≤ 24B 4 /η 4 . Thus, ∞ X i=1 E K[i] 4 i 2 <∞. Thus, all assumptions in Lemma 5.5.9 are satisfied and we conclude that lim sup n→∞ 1 n n−1 X i=0 K[i] 2 ≤ 2B 2 η 2 , w.p.1. 153 This implies, lim sup n→∞ ˆ θ[n] 1 n δ P n−1 i=0 T [i] + P n−1 i=0 θ[i]T [i] P n−1 i=0 T [i] ! ≥ lim sup n→∞ P n−1 i=0 y[i] P n−1 i=0 T [i] − B 2 η 2 V . By Theorem 5.5.2, ˆ θ[n] is asymptotically upper bounded. Since δ < 1 and T [n]≥ 1, it follows 1 n δ P n−1 i=0 T [i] =O(n 1−δ ), which goes to infinity as n→∞. Thus, lim sup n→∞ ˆ θ[n] 1 n δ P n−1 i=0 T [i] ≤ 0, and thus, lim sup n→∞ P n−1 i=0 θ[i]T [i] P n−1 i=0 T [i] ≥ lim sup n→∞ P n−1 i=0 y[i] P n−1 i=0 T [i] − B 2 η 2 V . By Theorem 5.5.2 again, θ[n] is asymptotically upper bounded by θ ∗ . Based on this result, it is easy to show the following lim sup n→∞ P n−1 i=0 θ[i]T [i] P n−1 i=0 T [i] ≤θ ∗ . Thus, we finally get lim sup n→∞ P n−1 i=0 y[i] P n−1 i=0 T [i] ≤θ ∗ + B 2 η 2 V , finishing the proof. 5.6 Simulation experiments In this section, we demonstrate the performance of our proposed algorithm through an appli- cation scenario on single user file downloading. We show that this problem can be formulated as a two state constrained online MDP and solved using our proposed algorithm. Consider a slotted time system wheret∈{0, 1, 2,···}, and one user is repeatedly downloading files. We use F (t)∈{0, 1} to denote the system file state at time slot t. State “1” indicates there is an active file in the system for downloading and state “0” means there is no file and the system is idle. Suppose the user can only download 1 file at each time, and the user cannot observe the file length. Each file contains an integer number of packets which is independent and geometrically distributed with expected length equal to 1. During each time slot where there is an active file for downloading (i.e. F (t) = 1), the 154 user first observes the channel state ω(t), which is the i.i.d. random variable taking values in Ω ={0.2, 0.5, 0.8} with equal probabilities, and delay penaltys(t), which is also an i.i.d. random variable taking values in{1, 3, 5} with equal probability. Then, the user makes a service action α(t)∈A ={0, 0.3, 0.6, 0.9}. The pair (ω(t),α(t)) affects the following quantities: • The success probability of downloading a file at time t: φ(α(t),ω(t)),α(t)·ω(t). • The resource consumption p(α(t)) at time t. We assume p(0) = 0, p(0.3) = 1, p(0.6) = 2 and p(0.9) = 4. After a file is downloaded, the system goes idle (i.e. F (t) = 0) and stays there for a random amount of time that is independent and geometrically distributed with mean equal to 2. The goal is to minimize the time average delay penalty subject to a resource constraint that the time average resource consumption cannot exceed 1. In [WN15], a similar optimization problem is considered but without random events ω(t) and s(t), which can be formulated as a two state constrained MDP. Here, using the same logic, we can formulate our optimization problem as a two state constrained online MDP. Given F (t) = 1, the file will finish its download at the end of this time slot with probability φ(α(t),ω(t)). Thus, the transition probabilities out of state 1 are: Pr[F (t + 1) = 0|F (t) = 1] =φ(α(t),ω(t)) Pr[F (t + 1) = 1|F (t) = 1] = 1−φ(α(t),ω(t)), On the other hand, given F (t) = 0, the system is idle and will transition to the active state in the next slot with probability λ: Pr[F (t + 1) = 1|F (t) = 0] =λ Pr[F (t + 1) = 0|F (t) = 0] = 1−λ, Now, we characterize this online MDP through renewal frames and show that it can be solved using the proposed algorithm in Section 5.2. First, notice that the state “1” is recurrent under any actionα(t). We denotet n as then-th time slot when the system returns to state “1” . Define 155 the renewal frame as the time period between t n and t n+1 with frame size T [n] =t n+1 −t n . Furthermore, since the system does not have any control options in state “0”, the controller makes exactly one decision during each frame and this decision is made at the beginning of each frame. Thus, we can write out the optimization problem as follows: min lim sup N→∞ P N−1 n=0 α(t n )s(t n ) P N−1 n=0 T [n] s.t. lim sup N→∞ P N−1 n=0 p(α(t n )) P N−1 n=0 T [n] ≤ 1, α(t n )∈A. Subsequently, in order to apply our algorithm, we can define the virtual queue Q[n] as Q[0] = 0 with updating rule Q[n + 1] = max{Q[n] +p(α(t n ))−T [n], 0}. Notice that for any particular action α(t n )∈A and random event ω(t n )∈ Ω, we can always computeE(T [n]) as E(T [n]) = 1−φ(α(t n ),ω(t n )) +φ(α(t n ),ω(t n )) 1 + 1 λ = 1 + 2α(t n )ω(t n ), where the second equality follows by substituting λ = 0.5 and φ(α(t n ),ω(t n )) = α(t n )ω(t n ). Thus, for each α(t n )∈A, the expression (5.6) can be computed. In each of the simulations, each data point is the time average of 2 million slots. We compare the performance of the proposed algorithm with the optimal randomized policy. The optimal policy is computed by formulating the MDP into a linear program with the knowledge of the distribution on ω(t) and s(t). See [Fox66b] for details of this linear program formulation. In Fig. 5.2, we plot the performance of our algorithm verses V parameter for different δ value. We see from the plots that as V gets larger, the time averages approaches the optimal value and achieves a near optimal performance forδ roughly between 0.4 and 1. A more obvious relation between performance and δ value is shown in Fig. 5.3, where we fix V = 300 and plot the performance of the algorithm verses δ value. It is clear from the plots that the algorithm 156 fails whenever δ is too small (δ< 0.3) or too big (δ> 1). This meets the statement of Theorem 5.5.1 that the algorithm works for δ∈ (1/3, 1). Figure 5.2: Time average penalty versus tradeoff parameter V In Fig. 5.4, we plot the time average resource consumption verses V value. We see from the plots that the algorithm is always feasible for differentV ’s andδ’s, which meets the statement of Theorem 5.4.1. Also, asV gets larger, the constraint gap tends to be smaller. In Fig. 5.5, we plot the average virtual queue size verses V value. It shows that the average queue size gets larger as V get larger. To see the implications, recall from the proof of Theorem 5.4.1, the inequality (5.10) implies that the virtual queue size Q l [N] affects the rate that the algorithm converges down to the feasible region. Thus, if the average virtual queue size is large, then, it takes longer for the algorithm to converge. This demonstrates that V is indeed a trade-off parameter which trades the sub-optimality gap for the convergence rate. 157 Figure 5.3: Time average penalty versus δ parameter with fixed V = 300. Figure 5.4: Time average resource consumption versus tradeoff parameter V . 158 Figure 5.5: Time average virtual queue size versus tradeoff parameter V . 5.7 Additional proofs Proof of Lemma 5.4.2. We begin by bounding the difference|kQ[n + 1]k−kQ[n]k| for any n: kQ[n + 1]k−kQ[n]k ≤kQ[n + 1]− Q[n]k = v u u t L X l=1 max{Q l [n] +z l [n]−c l T [n], 0}−Q l [n] 2 ≤ v u u t L X l=1 (z l [n]−c l T [n]) 2 =K[n], where the first inequality follows from triangle inequality and the last inequality follows from the fact that for any a,b∈R,| max{a +b, 0}−a|≤|b|. Thus, it follows, |E(kQ[n + 1]k−kQ[n]k|H n )|≤E(K[n]|H n )≤ B η , 159 which follows from Proposition 1. Also, we have E e r(kQ[n+1]k−kQ[n]k) H n ≤E(exp (rK[n])|H n ) ≤E(exp (ηK[n])|H n )≤B, Γ where the second to last inequality follows by substituting the definition r = min n η, ξη 2 4B o ≤ η and the last inequality follows from Assumption 5.2.1. Next, supposekQ[n]k > σ, C 0 V . Then, since the proposed algorithm minimizes the term on the right hand side of (5.8) over all possible decisions at frame n, it must achieve smaller value on that term compared to that of ξ-slackness policy α (ξ) [n] specified in Assumption 5.2.5. Formally, this is E L X l=1 Q l [n](z l [n]−c l T [n]) +V (y[n]−θ[n]T [n]) H n ,ω[n] ! ≤E L X l=1 Q l [n](z (ξ) l [n]−c l T (ξ) [n]) +V (y (ξ) [n]−θ[n]T (ξ) [n]) H n ,ω[n] ! . where we used the fact that θ[n] and Q[n] are inH n . Substitute this bound into the right hand side of (5.8) and take expectation from both sides regarding ω[n] gives E(Δ[n] +V (y[n]−θ[n]T [n])|H n ) ≤E L X l=1 Q l [n](z (ξ) l [n]−c l T (ξ) [n]) +V (y (ξ) [n]−θ[n]T (ξ) [n]) H n ! +B 2 /η 2 . Since Δ[n] = 1 2 (kQ[n + 1]k 2 −kQ[n]k 2 ), This implies E kQ[n + 1]k 2 −kQ[n]k 2 |H n ≤2B 2 /η 2 + 2E L X l=1 Q l [n](z (ξ) l [n]−c l T (ξ) [n]) +V (y (ξ) [n]−θ[n]T (ξ) [n])−V (y[n]−θ[n]T [n]) H n ! ≤2B 2 /η 2 + 2 L X l=1 Q l [n]E z (ξ) l [n]−c l T (ξ) [n] H n + 2V B +θ max B η ≤2B 2 /η 2 + 2V B +θ max B η − 2ξ L X l=1 Q l [n] ≤2B 2 /η 2 + 2V B +θ max B η − 2ξkQ[n]k, 160 where the second inequality follows from applying Proposition 1 to bound E(T [n]|H n ) as well as the fact that 0<θ[n]<θ max , and the third inequality follows from the ξ-slackness property as well as the assumption that z (ξ) l [n] is i.i.d. over slots and hence independent of Q l [n]. This further implies E kQ[n + 1]k 2 |H n ≤kQ[n]k 2 − 2ξkQ[n]k + 2B 2 /η 2 + 2V B +θ max B η =kQ[n]k 2 − 2ξkQ[n]k + 2B 2 /η 2 + 2V B +θ max B η − ξ 2 4 + ξ 2 4 =kQ[n]k 2 − 2ξkQ[n]k + 2B 2 /η 2 + 2V B+θmaxB η − ξ 2 4 ξ ·ξ + ξ 2 4 =kQ[n]k 2 − 2ξkQ[n]k +C 0 V·ξ + ξ 2 4 ≤kQ[n]k 2 −ξkQ[n]k + ξ 2 4 = kQ[n]k− ξ 2 2 , where we use the fact that C 0 = 2B 2 Vξη 2 + 2 ξ B+θmaxB η − ξ 4V and also the assumption thatkQ[n]k≥ C 0 V . Now take the square root from both sides gives p E(kQ[n + 1]k 2 |H n )≤kQ[n]k− ξ 2 . By concavity of √ x function, we haveE(kQ[n + 1]k|H n )≤ p E(kQ[n + 1]k 2 |H n ), thus, E(kQ[n + 1]k|H n )≤kQ[n]k− ξ 2 . (5.17) Finally, we claim that this gives that under the conditionkQ[n]k>σ,C 0 V , E e r(kQ[n+1]k−kQ[n]k) H n ≤ρ, 1− rξ 2 + 2B η 2 r 2 < 1. (5.18) 161 To see this, we expandE e r(kQ[n+1]k−kQ[n]k) H n using Taylor series as follows: E e r(kQ[n+1]k−kQ[n]k) H n =1 +rE(kQ[n + 1]k−kQ[n]k|H n ) +r 2 ∞ X k=2 r k−2 E (kQ[n + 1]k−kQ[n]k) k H n k! ≤1− rξ 2 +r 2 ∞ X k=2 r k−2 E (kQ[n + 1]k−kQ[n]k) k H n k! ≤1− rξ 2 +r 2 ∞ X k=2 η k−2 E (kQ[n + 1]k−kQ[n]k) k H n k! =1− rξ 2 +r 2 E e η(kQ[n+1]k−kQ[n]k) H n −ηE(kQ[n + 1]k−kQ[n]k|H n )− 1 η 2 ≤1− rξ 2 + B +η· B η η 2 r 2 ≤1− rξ 2 + 2B η 2 r 2 =ρ, where the first inequality follows from (5.17), the second inequality follows from r≤η, and the second to last inequality follows from Proposition 1. Finally, notice that the above quadratic function on r attains the minimum at the point r = ξη 2 4B with value 1− ξ 2 η 2 8B < 1, and this function is strictly decreasing when r∈ 0, ξη 2 4B . Thus, our choice of r = min η, ξη 2 4B ≤ ξη 2 4B ensures that ρ is strictly less than 1 and the proof is finished. Proof of Lemma 5.5.1. Ifθ[n] =y for somey∈ [0,θ max ], then, ˆ θ[n] falls into one of the following three cases: • ˆ θ[n] =y. • y =θ max and ˆ θ[n]>θ max . • y = 0 and ˆ θ[n]< 0. Then, we prove the above four properties based on these three cases. 162 1) If θ[n] =y≥x for some y, then, the first two cases immediately imply ˆ θ[n]≥x. If y = 0, then, we have x≤ 0, which violates the assumption that x∈ (0,θ max ). Thus, the third case is ruled out. On the other hand, if ˆ θ[n]≥x, then, obviously, θ[n]≥x. 2) Ifθ[n] =y≤x for somey, then the last two cases immediately imply ˆ θ[n]≤x. Ify =θ max , then, we havex≥y max , which violates the assumption that x∈ (0,θ max ). Thus, the first case is ruled out. On the other hand, if ˆ θ[n]≤x, then, obviously, θ[n]≤x. 3) If lim sup n→∞ θ[n]≤ x, then, for any > 0 such that x + < y max , there exists an N large enough so that θ[n]≤x +, ∀n≥N. Then, by property 2), ˆ θ[n]≤x +, ∀n≥N, which implies lim sup n→∞ ˆ θ[n]≤ x +. Let → 0 gives lim sup n→∞ ˆ θ[n]≤ x. One the other hand, if lim sup n→∞ ˆ θ[n]≤x, then, obviously, lim sup n→∞ θ[n]≤x. 4) If lim inf n→∞ θ[n]≥ x, then, for any > 0 such that x− > 0 there exists an N large enough so that θ[n]≥ x−, ∀n≥ N. Then, by property 1), ˆ θ[n]≥ x−, ∀n≥ N, which implies lim sup n→∞ ˆ θ[n]≤ x−. Let → 0 gives lim sup n→∞ ˆ θ[n]≥ x. One the other hand, if lim sup n→∞ ˆ θ[n]≤x, then, obviously, lim sup n→∞ θ[n]≤x. Proof of Lemma 5.5.6. The proof is divided into two parts. The first part contains some technical preliminaries showingG[n] is measurable respect toH n ,∀n≥n k +1, and the second part contains computations to prove the supermartingale claim. • Technical preliminaries: First of all, for any fixed k, since n k is a random variable on the integers, we need to justify that{H n } n≥n k +1 is indeed a filtration. First, it is obvious that n k a valid stopping time, i.e. {n k ≤t}∈H t , ∀t∈N. Then, any n =n k +s with some constant s∈N + is also a valid stopping time because {n≤t} ={n k ≤t−s}∈H (t−s)∨0 ⊆H t , ∀t∈N, where a∨b, max{a,b}. Thus, by definition of stopping time σ-algebra from [Dur13], we know that for any n≥n k + 1,H n can be written as the collection of all sets A that have A∩{n≤t}∈H t , ∀t∈N 1 . Now, pick 1≤s 1 ≤s 2 as constants, and if a set A∈H n k +s1 , 1 An intuitive interpretation is that when n≤t, the set A is contained in the information known until t. 163 then, A∩{n k +s 2 ≤t} =A∩{n k +s 1 ≤t− (s 2 −s 1 )}∈H (t−(s2−s1))∨0 ⊆H t . Thus,H n k +s1 ⊆H n k +s2 and{H n } n≥n k +1 is indeed a filtration. Since ˜ θ[n k + 1] is determined by the realization up to frame n k , it follows, for any t∈N + , { ˜ θ[n k + 1]≥θ ∗ +ε 0 /V}∩{n k + 1≤t} =∪ t s=1 { ˜ θ[s]≥θ ∗ +ε 0 /V}∈H t , which implies that{ ˜ θ[n k + 1]≥θ ∗ +ε 0 /V}∈H n k +1 . Since{H n } n≥n k +1 is a filtration, it follows{ ˜ θ[n k + 1]≥θ ∗ +ε 0 /V}∈H n for any n≥n k + 1. By the same methodology, we can show that{ ˜ θ[n]<θ ∗ +ε 0 /V}∈H n ,∀n≥n k + 1, which in turn implies,{S n k +n k ≤ n}∈H n and{S n k ≥n−n k + 1}∈H n . Overall, the function G[n] is measurable respect toH n , ∀n≥n k + 1. • Proof of supermartingale claim: It is obvious that|G[n]|<∞, thus, in order to proveG[n] is a supermartingale, it is enough to show that E(G[n + 1]−G[n]|H n )≤ 0, ∀n≥n k + 1. (5.19) First, on the set{S n k ≤n−n k }, we have E (G[n + 1]−G[n])1 {Sn k +n k ≤n} H n =E (G[n]−G[n])1 {Sn k +n k ≤n} H n = 0. 164 It is then sufficient to show the inequality (5.19) holds on the set{S n k ≥n−n k +1}. Since E G[n + 1]1 {Sn k ≥n−n k +1} |H n =E e λn+1F[(n+1)∧(n k +Sn k )] Q (n+1)∧(n k +Sn k ) i=n k +1 ρ i H n 1 { ˜ θ[n k +1]≥θ ∗ +ε0/V} 1 {Sn k ≥n−n k +1} =E e λn+1F[n+1] Q n+1 i=n k +1 ρ i H n ! 1 { ˜ θ[n k +1]≥θ ∗ +ε0/V} 1 {Sn k ≥n−n k +1} = e λn+1F[n] Q n i=n k +1 ρ i E e λn+1(F[n+1]−F[n]) ρ n+1 H n 1 { ˜ θ[n k +1]≥θ ∗ +ε0/V} 1 {Sn k ≥n−n k +1} ≤ e λnF[n] Q n i=n k +1 ρ i E e λn+1(F[n+1]−F[n]) ρ n+1 H n 1 { ˜ θ[n k +1]≥θ ∗ +ε0/V} 1 {Sn k ≥n−n k +1} =G[n]E e λn+1(F[n+1]−F[n]) ρ n+1 H n 1 { ˜ θ[n k +1]≥θ ∗ +ε0/V} 1 {Sn k ≥n−n k +1} , where 1 { ˜ θ[n k +1]≥θ ∗ +ε0/V} and 1 {Sn k ≥n−n k +1} can be moved out of the expectation because { ˜ θ[n k +1]≥θ ∗ +ε 0 /V}∈H n and{S n k ≥n−n k +1}∈H n , and the only inequality follows from the following argument: On the set{S n k ≥n−n k + 1},{ ˜ θ[n]≥θ ∗ +ε 0 /V}, thus, by Lemma 5.5.5, F [n]≥ 0 and using the fact λ n >λ n+1 , we have λ n+1 F [n]≤λ n F [n]. Thus, it is sufficient to show that on the set{S n k ≥n−n k + 1}∩{ ˜ θ[n k + 1]≥θ ∗ +ε 0 /V}, we have E e λn+1(F[n+1]−F[n]) ρ n+1 H n ≤ 1. By Taylor expansion, we have E e λn+1(F[n+1]−F[n]) H n =1 +λ n+1 E(F [n + 1]−F [n]|H n ) + ∞ X k=2 λ k n+1 k! E (F [n + 1]−F [n]) k |H n =1 +λ n+1 E(F [n + 1]−F [n]|H n ) +λ 2 n+1 ∞ X k=2 λ k−2 n+1 k! E (F [n + 1]−F [n]) k |H n ≤1− λ n+1 ε 0 V +λ 2 n+1 ∞ X k=2 λ k−2 n+1 k! E (F [n + 1]−F [n]) k |H n , where the last inequality comes from the following argument: On the set{S n k ≥n−n k +1}, ˜ θ[n k + 1]≥θ ∗ +ε 0 /V , thus, by the definition of ˜ θ[n], we have ˆ θ[n]≥ ˜ θ[n]≥θ ∗ +ε 0 /V , and 165 Lemma 5.5.1 gives θ[n]≥θ ∗ +ε 0 /V , then, by Lemma 5.5.4, we have E(F [n + 1]−F [n]|H n )≤− ε 0 V . Now, by the assumption that V ≥ ε0η 4 log 2 2 − 2 √ L r , we have λ n+1 ≤ 1 2 η + 4 √ L ηrV log 2 (n+1) , which follows from simple algebraic manipulations. Using the fact that|F [n + 1]−F [n]|≤ 2 η + 4 √ L ηrV log 2 (n + 1), we have E e λn+1(F[n+1]−F[n]) H n ≤1− λ n+1 0 V +λ 2 n+1 ∞ X k=2 1 2 η + 4 √ L ηrV log 2 (n+1) k−2 k! E 2 η + 4 √ L ηrV ! log 2 (n + 1) ! k H n =1− λ n+1 0 V +λ 2 n+1 ∞ X k=2 1 k! 2 η + 4 √ L ηrV ! log 2 (n + 1) ! 2 ≤1− λ n+1 0 V +λ 2 n+1 e 2 η + 4 √ L ηrV ! 2 log 4 (n + 1) =ρ n+1 , where the final inequality follows by completing the third term back to Taylor series which is equal to e. Overall, the inequality (5.19) holds and G[n] is a supermartingale. 5.8 Computation of Asymptotics In this appendix, we show that there exists a constant C such that ∞ X m=1 (m + 1) 3 n k +m Y i=n k +1 ρ i ≤C(n k + 2) 4β . 166 We first bound ρ i . Let C 1 = 96V 2 e 2 η + 4 √ L ηrV 2 ε 2 0 β 4 , then, ρ i =1− ε 2 0 4V 2 e 2 η + 4 √ L ηrV 2 log 4 (i + 1) = 1− 1 C 1 β 4 24 log 4 (i + 1) < 1− 1 C 1 (i + 1) β , where we used the fact that β 4 24 log 4 (i + 1)< (i + 1) β ,∀β > 0,i≥ 0. Next, to bound Q n k +m i=n k +1 ρ i , we take the logarithm: log n k +m Y i=n k +1 ρ i ! = n k +m X i=n k +1 logρ i = n k +m X i=n k +1 log 1− 1 C 1 (i + 1) β ≤− n k +m X i=n k +1 1 C 1 (i + 1) β ≤− 1 C 1 Z n k +m+1 n k +2 1 x β dx. where the first inequality follows from the first order Taylor expansion. Since β < 1, we compute the integral, which gives − 1 C 1 Z n k +m+1 n k +2 1 x β dx =− 1 C 1 (1− 2β) (n k +m + 1) 1−β − (n k + 2) 1−β . Thus, ∞ X m=1 (m + 1) 3 n k +m Y i=n k +1 ρ i ≤ ∞ X m=1 (m + 1) 3 e − 1 C 1 (1−β) ((n k +m+1) 1−β −(n k +2) 1−β ) ≤ Z ∞ 0 (x + 2) 3 e − 1 C 1 (1−β) ((x+n k +2) 1−β −(n k +2) 1−β ) dx + (3C 1 (1−β)) 4 , where the last inequality follows from the fact that the integrand is monotonically decreasing when x> 3C 1 (1−β), thus, the integral dominates the sum on the tail x> 3C 1 (1−β). For the 167 part where x≤ 3C 1 (1−β), the maximum of the integrand is bounded by (3C 1 (1−β)) 3 . Thus, the total difference of such approximation is bounded by (3C 1 (1−β)) 4 . Then, we try to estimate the integral. Notice that d dx e − 1 C 1 (1−β) (x+n k +2) 1−β =− 1 C 1 e − 1 C 1 (1−β) (x+n k +2) 1−β (x +n k + 2) −β , we do integration-by-parts, which gives Z ∞ 0 (x + 2) 3 e − 1 C 1 (1−β) ((x+n k +2) 1−β −(n k +2) 1−β ) dx = Z ∞ 0 (x + 2) 3 (x +n k + 2) β (x +n k + 2) −β e − 1 C 1 (1−β) (x+n k +2) 1−β dx·e 1 C 1 (1−β) (n k +2) 1−β =8C 1 (n k + 2) β + Z ∞ 0 C 1 3(x + 2) 2 (x +n k + 2) β +β(x + 2) 3 (x +n k + 2) β−1 e − 1 C 1 (1−β) ((x+n k +2) 1−β −(n k +2) 1−β ) dx. Since 5β≤ 1 and n k ≥ 1, we have x +n k + 2≥x + 2, which implies (x + 2) 3 (x +n k + 2) β−1 ≤ (x + 2) 2 (x +n k + 2) β , thus, Z ∞ 0 (x + 2) 3 e − 1 C 1 (1−β) ((x+n k +2) 1−β −(n k +2) 1−β ) dx ≤8C 1 (n k + 2) β + Z ∞ 0 4C 1 (x + 2) 2 (x +n k + 2) β e − 1 C 1 (1−β) ((x+n k +2) 1−β −(n k +2) 1−β ) dx. Repeat above procedure 3 more times, we have Z ∞ 0 (x + 2) 3 e − 1 C 1 (1−β) ((x+n k +2) 1−β −(n k +2) 1−β ) dx ≤8C 1 (n k + 2) β + 16C 2 1 (n k + 2) 2β + 24C 3 1 (n k + 2) 3β + 24C 4 1 (n k + 2) 4β + Z ∞ 0 24C 4 1 (x +n k + 2) 4β−1 e − 1 C 1 (1−β) ((x+n k +2) 1−β −(n k +2) 1−β ) dx ≤8C 1 (n k + 2) β + 16C 2 1 (n k + 2) 2β + 24C 3 1 (n k + 2) 3β + 24C 4 1 (n k + 2) 4β + 24C 5 1 ≤C(n k + 2) 4β , for some C on the order of C 5 1 (which isO V 10 β −20 ε −10 0 ), where the second to last inequality follows from 4β− 1≤−β and thus, we replace (x +n k + 2) 4β−1 with (x +n k + 2) −β and do a direct integration. Overall, we proved the claim. 168 Chapter 6 Online Learning in Weakly Coupled Markov Decision Pro- cesses In this chapter, we consider online learning over weakly coupled Markov decision processes. We develop a new distributed online algorithm where each MDP makes its own decision each slot after observing a multiplier computed from past information. While the scenario is significantly more challenging than the classical online learning context, the algorithm is shown to have a tightO( √ T ) regret and constraint violations simultaneously over a time horizon T . 6.1 Problem formulation and related works This chapter considers online constrained Markov decision processes (OCMDP) where both the objective and constraint functions can vary each time slot after the decision is made. We assume a slotted time scenario with time slots t∈{0, 1, 2,...}. The OCMDP consists of K parallel Markov decision processes with indicesk∈{1, 2,...,K}. Thek-th MDP has state space S (k) , action spaceA (k) , and transition probability matrix P (k) a which depends on the chosen action a∈A (k) . Specifically, P (k) a = (P (k) a (s,s 0 )) where P (k) a (s,s 0 ) =Pr s (k) t+1 =s 0 s (k) t =s, a (k) t =a , where s (k) t and a (k) t are the state and action for system k on slot t. We assume that both the state space and the action space are finite for all k∈{1, 2,··· ,K}. After each MDPk∈{1,...,K} makes the decision at timet (and assuming the current state is s (k) t =s and the action is a (k) t =a), the following information is revealed: 169 1. The next state s (k) t+1 . 2. A penalty functionf (k) t (s,a) that depends on the current state s and the current action a. 3. A collection of m constraint functions g (k) 1,t (s,a),...,g (k) m,t (s,a) that depend on s and a. The functions f (k) t and g (k) i,t are all bounded mappings fromS (k) ×A (k) to R and represent different types of costs incurred by system k on slot t (depending on the current state and action). For example, in a multi-server data center, the different systems k∈{1,...,K} can represent different servers, the cost function for a particular server k might represent energy or monetary expenditure for that server, and the constraint costs for serverk can represent negative rewards such as service rates or qualities. Coupling between the server systems comes from using all of them to collectively support a common stream of arriving jobs. A key aspect of this general problem is that the functions f (k) t and g (k) i,t are unknown until after the slott decision is made. Thus, the precise costs incurred by each system are only known at the end of the slot. For a fixed time horizon of T slots, the overall penalty and constraint accumulation resulting from a policyP is: F T (d 0 ,P) :=E T X t=1 K X k=1 f (k) t a (k) t ,s (k) t d 0 ,P ! , (6.1) and G i,T (d 0 ,P) :=E T X t=1 K X k=1 g (k) i,t a (k) t ,s (k) t d 0 ,P ! , where d 0 represents a given distribution on the initial joint state vector (s (1) 0 ,··· ,s (K) 0 ). Note that (a (k) t ,s (k) t ) denotes the state-action pair of thekth MDP, which is a pair of random variables determined by d 0 andP. Define a constraint set G :={(P,d 0 ) : G i,T (d 0 ,P)≤ 0, i = 1, 2,··· ,m}. (6.2) Define the regret of a policyP with respect to a particular joint randomized stationary policy Π along with an arbitrary starting state distribution d 0 as: F T (d 0 ,P)−F T (d 0 , Π), 170 The goal of OCMDP is to choose a policyP so that both the regret and constraint violations grow sublinearly with respect toT , where regret is measured against all feasible joint randomized stationary policies Π. Here we give a brief review of the works related to online optimization and online MDPs. • Online convex optimization (OCO): This concerns multi-round cost minimization with arbitrarily-varying convex loss functions. Specifically, on each slot t the decision maker chooses decisions x(t) within a convex setX (before observing the loss function f t (x)) in order to minimize the total regret compared to the best fixed decision in hindsight, expressed as: regret(T ) = T X t=1 f t (x(t))− min x∈X T X t=1 f t (x). See [H + 16] for an introduction to OCO. Zinkevich introduced OCO in [Zin03] and shows that an online projection gradient descent (OGD) algorithm achieves O( √ T ) regret. This O( √ T ) regret is proven to be the best in [HAK07], although improved performance is possible if all convex loss functions are strongly convex. The OGD decision requires to compute a projection of a vector onto a setX . For complicated setsX with functional equality constraints, e.g.,X ={x∈X 0 : g k (x)≤ 0,k∈{1, 2,...,m}}, the projection can have high complexity. To circumvent the projection, work in [MJY12, JHA16, YN16, CLG17] proposes alternative algorithms with simpler per-slot complexity and that satisfy the inequality constraints in the long term (rather than on every slot). Recently, new primal-dual type algorithms with low complexity are proposed in [NY17, YNW17] to solve more challenging OCO with time-varying functional inequality constraints. • Online Markov decision processes: This extends OCO to allow systems with a more complex Markov structure. This is similar to the setup of the current paper of minimizing the expression (6.1), but does not have the constraint set (6.2). Unlike traditional OCO, the current penalty depends not only on the current action and the current (unknown) penalty function, but on the current system state (which depends on the history of previous actions). Further, the number of policies can grow exponentially with the sizes of the state and action spaces, so that solutions can be computationally intensive. The work [EDKM09] develops an algorithm in this context withO( √ T ) regret. Extended algorithms and regularization 171 methods are developed in [YMS09][GRW14][DGS14] to reduce complexity and improve dependencies on the number of states and actions. Online MDP under bandit feedback (where the decision maker can only observe the penalty corresponding to the chosen action) is considered in [YMS09][NAGS10]. • Constrained MDPs: This aims to solve classical MDP problems with known cost func- tions but subject to additional constraints on the budget or resources. Linear programming methods for MDPs are found, for example, in [Alt99b], and algorithms beyond LP are found in [Nee11] [CDM14]. Formulations closest to our setup appear in recent work on weakly coupled MDPs in [BL16][WN16a] that have known cost and resource functions. • Reinforcement Learning (RL): This concerns MDPs with some unknown parameters (such as unknown functions and transition probabilities). Typically, RL makes stronger assumptions than the online setting, such as an environment that is unknown but fixed, whereas the unknown environment in the online context can change over time. Methods for RL are developed in [Ber95][SB98][LHS + 13][CW16]. 6.2 Preliminaries 6.2.1 Basic Definitions Throughout this paper, given an MDP with state spaceS and action spaceA, a policyP defines a (possibly probabilistic) method of choosing actions a∈A at state s∈S based on the past information. We start with some basic definitions of important classes of policies: Definition 6.2.1. For an MDP, a randomized stationary policy π defines an algorithm which, whenever the system is in state s∈S, chooses an action a∈A according to a fixed conditional probability function π(a|s), defined for all a∈A and s∈S. Definition 6.2.2. For an MDP, a pure policy π is a randomized stationary policy with all probabilities equal to either 0 or 1. That is, a pure policy is defined by a deterministic mapping between states s∈S and actions a∈A. Whenever the system is in a state s∈S, it always chooses a particular action a s ∈A (with probability 1). Note that if an MDP has a finite state and action space, the set of all pure policies is also finite. Consider the MDP associated with a particular system k∈{1,...,K}. For any 172 randomized stationary policy π, it holds that P a∈A (k)π(a|s) = 1 for all s∈S (k) . Define the transition probability matrix P (k) π under policy π to have components as follows: P (k) π (s,s 0 ) = X a∈A (k) π(a|s)P (k) a (s,s 0 ), s,s 0 ∈S (k) . (6.3) It is easy to verify that P (k) π is indeed a stochastic matrix, that is, it has rows with nonnegative components that sum to 1. Let d (k) 0 ∈ [0, 1] |S (k) | be an (arbitrary) initial distribution for the k-th MDP. Define the state distribution at time t under π as d (k) π,t . By the Markov property of the system, we have d (k) π,t = d (k) 0 P (k) π t . A transition probability matrix P (k) π is ergodic if it gives rise to a Markov chain that is irreducible and aperiodic. Since the state space is finite, an ergodic matrix P (k) π has a unique stationary distribution denoted d (k) π , so that d (k) π is the unique probability vector solving d =dP (k) π . Assumption 6.2.1 (Unichain model). There exists a universal integerb r≥ 1 such that for any integer r≥ b r and every k∈{1,...,K}, we have the product P (k) π1 P (k) π2 ··· P (k) πr is a transition matrix with strictly positive entries for any sequence of pure policies π 1 ,π 2 ,··· ,π r associated with the kth MDP. Remark 6.2.1. Assumption 6.2.1 implies that each MDP k∈{1,...,K} is ergodic under any pure policy. This follows by taking π 1 ,π 2 ,··· ,π r all the same in Assumption 6.2.1. Since the transition matrix of any randomized stationary policy can be formed as a convex combination of those of pure policies, any randomized stationary policy results in an ergodic MDP for which there is a unique stationary distribution. Assumption 6.2.1 is easy to check via the following simple sufficient condition. Proposition 6.2.1. Assumption 6.2.1 holds if, for every k∈{1,...,K}, there is a fixed ergodic matrix P (k) (i.e., a transition probability matrix that defines an irreducible and aperiodic Markov chain) such that for any pure policy π on MDP k we have the decomposition P (k) π =δ π P (k) + (1−δ π )Q (k) π , where δ π ∈ (0, 1] depends on the pure policy π and Q (k) π is a stochastic matrix depending on π. Proof. Fixk∈{1,...,K} and assume every pure policy on MDPk has the above decomposition. Since there are only finitely many pure policies, there exists a lower bound δ min > 0 such that 173 δ π ≥δ min for every pure policyπ. Since P (k) is an ergodic matrix, there exists an integerr (k) > 0 large enough such that (P (k) ) r has strictly positive components for allr≥r (k) . Fixr≥r (k) and let π 1 ,...,π r be any sequence of r pure policies on MDP k. Then P (k) π1 ··· P (k) πr ≥δ min P (k) r > 0, where inequality is treated entrywise. The universal integer r can be taken as the maximum integer r (k) over all k∈{1,...,K}. Definition 6.2.3. A joint randomized stationary policy Π on K parallel MDPs defines an algorithm which chooses a joint action a := a (1) , a (2) , ··· , a (K) ∈A (1) ×A (2) ···×A (K) given the joint state s := s (1) , s (2) , ,··· ,s (K) ∈S (1) ×S (2) ···×S (K) according to a fixed conditional probability Π (a|s). The following special class of separable policies can be implemented separately over each of the K MDPs and plays a role in both algorithm design and performance analysis. Definition 6.2.4. A joint randomized stationary policy π is separable if the conditional prob- abilities π := π (1) , π (2) , ··· , π (K) decompose as a product π (a|s) = K Y k=1 π (k) a (k) |s (k) for all a∈A (1) ×···×A (K) , s∈S (1) ···×S (K) . 6.2.2 Technical assumptions The functions f (k) t and g (k) i,t are determined by random processes defined over t = 0, 1, 2,··· . Specifically, let Ω be a finite dimensional vector space. Let{ω t } ∞ t=0 and{μ t } ∞ t=0 be two sequences of random vectors in Ω. Then for all a∈A (k) , s∈S (k) , i∈{1, 2,··· ,m} we have g (k) i,t (a,s) = ˆ g (k) i (a,s,ω t ), f (k) t (a,s) = ˆ f (k) (a,s,μ t ) where ˆ g (k) i and ˆ f (k) formally define the time-varying functions in terms of the random processes ω t and μ t . It is assumed that the processes{ω t } ∞ t=0 and{μ t } ∞ t=0 are generated at the start of 174 slot 0 (before any control actions are taken), and revealed gradually over time, so that functions g (k) i,t and f (k) t are only revealed at the end of slot t. Remark 6.2.2. The functions generated at time 0 in this way are also called oblivious functions because they are not influenced by control actions. Such an assumption is commonly adopted in previous unconstrained online MDP works (e.g. [EDKM09], [YMS09] and [DGS14]). Further, it is also shown in [YMS09] that without this assumption, one can choose a sequence of objective functions against the decision maker in a specifically designed MDP scenario so that one never achieves the sublinear regret. The functions are also assumed to be bounded by a universal constant Ψ, so that: |ˆ g (k) i (a,s,ω)|≤ Ψ,| ˆ f (k) (a,s,μ)|≤ Ψ ,∀k∈{1,...,K},∀a∈A (k) , s∈S (k) ,∀ω,μ∈ Ω. (6.4) It is assumed that{ω t } ∞ t=0 is independent, identically distributed (i.i.d.) and independent of {μ t } ∞ t=0 . Hence, the constraint functions can be arbitrarily correlated on the same slot, but appear i.i.d. over different slots. On the other hand, no specific model is imposed on{μ t } ∞ t=0 . Thus, the functionsf (k) t can be arbitrarily time varying. LetH t be the system information up to time t, then, for any t∈{0, 1, 2,···},H t contains state and action information up to time t, i.e. s 0 ,··· , s t , a 0 ,··· , a t , and{ω t } ∞ t=0 and{μ t } ∞ t=0 . Throughout this paper, we make the following assumptions. Assumption 6.2.2 (Independent transition). For each MDP, given the state s (k) t ∈S (k) and action a (k) t ∈A (k) , the next state s (k) t+1 is independent of all other past information up to time t as well as the state transition s (j) t+1 , ∀j6=k, i.e., for all s∈S (k) it holds that Pr s (k) t+1 =s|H t ,s (j) t+1 , ∀j6=k =Pr s (k) t+1 =s|s (k) t ,a (k) t whereH t contains all past information up to time t. Intuitively, this assumption means that all MDPs are running independently in the joint probability space and thus the only coupling among them comes from the constraints, which reflects the notion of weakly coupled MDPs in our title. Furthermore, by definition ofH t , given s (k) t ,a (k) t , the next transition s (k) t+1 is also independent of function paths{ω t } ∞ t=0 and{μ t } ∞ t=0 . The following assumption states the constraint set is strictly feasible. 175 Assumption 6.2.3 (Slater’s condition). There exists a real value η > 0 and a fixed separable randomized stationary policye π such that E " K X k=1 g (k) i,t a (k) t ,s (k) t d e π ,e π # ≤−η, ∀i∈{1, 2,··· ,m}, where the initial state is d e π and is the unique stationary distribution of policy e π, and the expec- tation is taken with respect to the random initial state and the stochastic function g (k) i,t (a,s) (i.e., ω t ). Slater’s condition is a common assumption in convergence time analysis of constrained convex optimization (e.g. [NO09], [Ber09b]). Note that this assumption readily implies the constraint setG can be achieved by the above randomized stationary policy. Specifically, take d (k) 0 =d e π (k) andP =e π, then, we have G i,T (d 0 , ˜ π) = T−1 X t=0 E " K X k=1 g (k) i,t a (k) t ,s (k) t d e π ,e π # ≤−ηT < 0. 6.2.3 The state-action polyhedron In this section, we recall the well-known linear program formulation of an MDP (see, for example, [Alt99b] and [Fox66a]). Consider an MDP with a state spaceS and an action spaceA. Let Δ⊆R |S||A| be a probability simplex, i.e. Δ = θ∈R |S||A| : X (s,a)∈S×A θ(s,a) = 1, θ(s,a)≥ 0 . Given a randomized stationary policy π with stationary state distribution d π , the MDP is a Markov chain with transition matrix P π given by (6.3). Thus, it must satisfy the following balance equation: X s∈S d π (s)P π (s,s 0 ) =d π (s 0 ), ∀s 0 ∈S. Defining θ(a,s) =π(a|s)d π (s) and substituting the definition of transition probability (6.3) into the above equation gives X s∈S X a∈A θ(s,a)P a (s,s 0 ) = X a∈A θ(s 0 ,a), ∀s 0 ∈S. 176 The variable θ(a,s) is often interpreted as a stationary probability of being at state s∈S and taking actiona∈A under some randomized stationary policy. The state action polyhedron Θ is then defined as Θ := ( θ∈ Δ : X s∈S X a∈A θ(s,a)P a (s,s 0 ) = X a∈A θ(s 0 ,a), ∀s 0 ∈S ) . Given any θ∈ Θ, one can recover a randomized stationary policy π at any state s∈S as π(a|s) = θ(a,s) P a∈A θ(a,s) , if P a∈A θ(a,s)6= 0, 0, otherwise. (6.5) Given any fixed penalty function f(a,s), the best policy minimizing the penalty (without constraint) is a randomized stationary policy given by the solution to the following linear program (LP): min hf,θi, s.t. θ∈ Θ. (6.6) where f := [f(a,s)] a∈A, s∈S . Note that for any policyπ given by the state-action pairθ according to (6.5), hf,θi =E s∼dπ,a∼π(·|s) [f(a,s)], Thus,hf,θi is often referred to as the stationary state penalty of policy π. It can also be shown that any state-action pair in the set Θ can be achieved by a convex combination of state-action vectors of pure policies, and thus all corner points of the polyhedron Θ are from pure policies. As a consequence, the best randomized stationary policy solving (6.6) is always a pure policy. 6.2.4 Preliminary results on MDPs In this section, we give preliminary results regarding the properties of our weakly coupled MDPs under randomized stationary policies. The proofs can be found in Appendix 6.6.1. We start with a lemma on the uniform mixing of MDPs. 177 Lemma 6.2.1. Suppose Assumption 6.2.1 and 6.2.2 hold. There exists a positive integer r and a constant τ≥ 1 such that for any two state distributions d 1 and d 2 , sup π (k) 1 ,···,π (k) r d (k) 1 −d (k) 2 P (k) π (k) 1 P (k) π (k) 2 ··· P (k) π (k) r 1 ≤ e −1/τ d (k) 1 −d (k) 2 1 , ∀k∈{1, 2,··· ,K} where the supremum is taken with respect to any sequence of r randomized stationary policies n π (k) 1 ,··· ,π (k) r o . For the k-th MDP, let Θ (k) be its state-action polyhedron according to the definition in Section 6.2.3. For any joint randomized stationary policy, let θ (k) be the marginal state-action probability vector on the k-th MDP, i.e. for any joint state-action distribution Φ(a, s) where a∈A (1) ×···×A (K) and s∈S (1) ×···×S (K) , we have θ (k) (a (k) ,s (k) ) = P a (j) ,s (j) , j6=k Φ(a, s). We have the following lemma: Lemma 6.2.2. Suppose Assumption 6.2.1 and 6.2.2 hold. Consider the product MDP with product state spaceS (1) ×···×S (K) and action spaceA (1) ×···×A (K) . Then, for any joint randomized stationary policy, the following hold: 1. The product MDP is irreducible and aperiodic. 2. The marginal stationary state-action probability vector θ (k) ∈ Θ (k) , ∀k∈{1, 2,··· ,K}. An immediate conclusion we can draw from this lemma is that given any penalty and con- straint functions f (k) and g (k) i , k = 1, 2,··· ,K, the stationary penalty and constraint value of any joint randomized stationary policy can be expressed as K X k=1 D f (k) ,θ (k) E , K X k=1 D g (k) i ,θ (k) E , i = 1, 2,··· ,m, with θ (k) ∈ Θ (k) . This in turn implies such stationary state-action probabilities{θ (k) } K k=1 can also be realized via a separable randomized stationary policy π with π (k) (a|s) = θ (k) (a,s) P a∈A (k)θ (k) (a,s) , a∈A (k) , s∈S (k) , (6.7) and the corresponding stationary penalty and constraint value can also be achieved via this policy. This fact implies that when considering the stationary state performance only, the class 178 of separable randomized stationary policies is large enough to cover all possible stationary penalty and constraint values. In particular, let ˜ π = ˜ π (1) ,··· , ˜ π (K) be the separable randomized stationary policy associ- ated with the Slater condition (Assumption 6.2.3). Using the fact that the constraint functions g (k) i,t ,k = 1, 2,··· ,K (i.e. w t ) are i.i.d.and Assumption 6.2.2 on independence of probability transitions, we have the constraint functions g (k) i,t and the state-action pairs at any time t are mutuallly independent. Thus, E " K X k=1 g (k) i,t a (k) t ,s (k) t d e π ,e π # = K X k=1 D E g (k) i,t , ˜ θ (k) E , where ˜ θ (k) corresponds to ˜ π according to (6.7). Then, Slater’s condition can be translated to the following: There exists a sequence of state- action probabilities{ ˜ θ (k) } K k=1 from a separable randomized stationary policy such that ˜ θ (k) ∈ Θ (k) , ∀k, and K X k=1 D E g (k) i,t , ˜ θ (k) E ≤−η, i = 1, 2,··· ,m, (6.8) The assumption on separability does not lose generality in the sense that if there is no separable randomized stationary policy that satisfies (6.8), then, there is no joint randomized stationary policy that satisfies (6.8) either. 6.2.5 The blessing of slow-update property in online MDPs The current state of an MDP depends on previous states and actions. As a consequence, the slot t penalty not only depends on the current penalty function and current action, but also on the system history. This complication does not arise in classical online convex optimization ([H + 16],[Zin03]) as there is no notion of “state” and the slot t penalty depends only on the slot t penalty function and action. Now imagine a virtual system where, on each slot t, a policy π t is chosen (rather than an action). Further imagine the MDP immediately reaching its corresponding stationary distribution d πt . Then the states and actions on previous slots do not matter and the slot t performance depends only on the chosen policy π t and on the current penalty and constraint functions. This imaginary system now has a structure similar to classical online convex optimization as in the 179 Zinkevich scenario [Zin03]. A key feature of online convex optimization algorithms as in [Zin03] is that they update their decision variables slowly. For a fixed time scale T over whichO( √ T ) regret is desired, the decision variables are typically changed no more than a distanceO(1/ √ T ) from one slot to the next. An important insight in prior (unconstrained) MDP works(e.g. [DGS14], [EDKM09], and [YMS09]) is that such slow updates also guarantee the “approximate” convergence of an MDP to its stationary distribution. As a consequence, one can design the decision policies under the imaginary assumption that the system instantly reaches its stationary distribution, and later bound the error between the true system and the imaginary system. If the error is on the same order as the desiredO( √ T ) regret, then this approach works. This idea serves as a cornerstone of our algorithm design of the next section, which treats the case of multiple weakly coupled systems with both objective functions and constraint functions. 6.3 OCMDP algorithm Our proposed algorithm is distributed in the sense that each time slot, each MDP solves its own subproblem and the constraint violations are controlled by a simple update of global multipliers called “virtual queues” at the end of each slot. Let Θ (1) , Θ (2) , ··· , Θ (K) be the state-action polyhedra of K MDPs, respectively. Let θ (k) t ∈ Θ (k) be a state-action vector at time slot t. At t = 0, each MDP chooses its initial state-action vector θ (k) 0 resulting from any separable randomized stationary policy π (k) 0 . For example, one could choose a uniform policy π (k) (a|s) = 1/ A (k) ,∀s∈S (k) , solve the equationd π (k) 0 =d π (k) 0 P (k) π (k) 0 to get a probability vector d π (k) 0 , and obtain θ (k) 0 (a,s) =d π (k) 0 (s)/ A (k) . For each constraint i∈{1, 2,··· ,m}, let Q i (t) be a virtual queue defined over slotst = 0, 1, 2,··· with the initial conditionQ i (0) =Q i (1) = 0, and update equation: Q i (t + 1) = max ( Q i (t) + K X k=1 D g (k) i,t−1 ,θ t E , 0 ) , ∀t∈{1, 2, 3,···}. (6.9) Our algorithm uses two parameters V > 0 and α > 0 and makes decisions as follows: At the start of each slot t∈{1, 2, 3,···}, • The k-th MDP observes Q i (t), i = 1, 2,··· ,m and chooses θ (k) t to solve the following 180 subproblem: θ (k) t = argmin θ∈Θ (k) * V f (k) t−1 + m X i=1 Q i (t)g (k) i,t−1 ,θ + +α θ−θ (k) t−1 2 2 . (6.10) • Construct the randomized stationary policy π (k) t according to (6.5) with θ = θ (k) t , and choose the action a (k) t atk-th MDP according to the conditional distribution π (k) t ·|s (k) t . • Update the virtual queue Q i (t) according to (6.9) for all i = 1, 2,··· ,m. Remark 6.3.1. Note that for any slot t ≥ 1, this algorithm gives a separable randomized stationary policy, so that each MDP chooses its own policy based on its own function f (k) t−1 , g (k) i,t−1 ,i∈{1, 2,··· ,m}, and a common multiplier Q(t) := (Q 1 (t),··· ,Q m (t)). Furthermore, note that (6.10) is a convex quadratic program (QP). Standard theory of QP (e.g. [YT89]) shows that the computation complexity solving (6.10) is poly S (k) A (k) for each k. Thus, the total computation complexity over all MDPs during each round is poly K S (k) A (k) . Remark 6.3.2. The quadratic term α θ−θ (k) t−1 2 2 in (6.10) penalizes the deviation of θ from the previous decision variable θ (k) t−1 . Thus, under proper choice of α, the distance between θ (k) t and θ (k) t−1 would be very small, which is the slow update condition we need according to Section 6.2.5. The next lemma shows that solving (6.10) is in fact a projection onto the state-action poly- hedron. For any setX∈R n and a vector y∈R n , define the projection operatorP X (y) as P X (y) = arginf x∈X kx− yk 2 . Lemma 6.3.1. Fix an α> 0 and t∈{1, 2, 3,···}. The θ t that solves (6.10) is θ (k) t =P Θ (k) θ (k) t−1 − w (k) t 2α ! , where w (k) t =V f (k) t−1 + P m i=1 Q i (t)g (k) i,t−1 ∈R |A (k) ||S (k) | . 181 Proof. By definition, we have θ (k) t =argmin θ∈Θ (k) D w (k) t ,θ E +α θ−θ (k) t−1 2 2 =argmin θ∈Θ (k) D w (k) t ,θ−θ (k) t−1 E +α θ−θ (k) t−1 2 2 + D w (k) t ,θ (k) t−1 E =argmin θ∈Θ (k) α· D w (k) t . α,θ−θ (k) t−1 E + θ−θ (k) t−1 2 2 + D w (k) t ,θ (k) t−1 E =argmin θ∈Θ (k) α· θ−θ (k) t−1 + w (k) t . 2α 2 2 =P Θ (k) θ (k) t−1 − w (k) t . 2α , finishing the proof. 6.3.1 Intuition of the algorithm and roadmap of analysis The intuition of this algorithm follows from the discussion in Section 6.2.5. Instead of the Markovian regret (6.1) and constraint set (6.2), we work on the imaginary system that after the decision maker chooses any joint policy Π t and the penalty/constraint functions are revealed, the K parallel Markov chains reach stationary state distribution right away, with state-action probability vectors n θ (k) t o K k=1 for K parallel MDPs. Thus there is no Markov state in such a system anymore and the corresponding stationary penalty and constraint function value at time t can be expressed as P K k=1 D f (k) t ,θ (k) t E and P K k=1 D g (k) i,t ,θ (k) t E , i = 1, 2,··· ,m, respectively. As a consequence, we are now facing a relatively easier task of minimizing the following regret: T−1 X t=0 K X k=1 E D f (k) t ,θ (k) t E − T−1 X t=0 K X k=1 E D f (k) t ,θ (k) ∗ E , (6.11) where n θ (k) ∗ o K k=1 are the state-action probabilities corresponding to the best fixed joint random- ized stationary policy within the following stationary constraint set G := n θ (k) ∈ Θ (k) , k∈{1, 2,··· ,K} : K X k=1 D E g (k) i,t ,θ (k) E ≤ 0, i = 1, 2,··· ,m ) , (6.12) with the assumption that Slater’s condition (6.8) holds. 182 To analyze the proposed algorithm, we need to tackle the following two major challenges: • Whether or not the policy decision of the proposed algorithm would yieldO( √ T ) regret and constraint violation on the imaginary system that reaches steady state instantaneously on each slot. • Whether the error between the imaginary and true systems can be bounded byO( √ T ). In the next section, we answer these questions via a multi-stage analysis piecing together the results of MDPs from Section 6.2.4 with multiple ingredients from convex analysis and stochastic queue analysis. We first show theO( √ T ) regret and constraint violation in the imaginary online linear program incorporating a new regret analysis procedure with a stochastic drift analysis for queue processes. Then, we show if the benchmark randomized stationary algorithm always starts from its stationary state, then, the discrepancy of regrets between the imaginary and true systems can be controlled via the slow-update property of the proposed algorithm together with the properties of MDPs developed in Section 6.2.4. Finally, for the problem with arbitrary non- stationary starting state, we reformulate it as a perturbation on the aforementioned stationary state problem and analyze the perturbation via Farkas’ Lemma. 6.4 Convergence time analysis 6.4.1 Stationary state performance: An online linear program Let Q(t) := [Q 1 (t), Q 2 (t), ··· , Q m (t)] be the virtual queue vector and L(t) = 1 2 kQ(t)k 2 2 . Define the drift Δ(t) :=L(t + 1)−L(t). Sample-path analysis This section develops a couple of bounds given a sequence of penalty functionsf (k) 0 ,f (k) 1 ,··· ,f (k) T−1 and constraint functions g (k) i,0 ,g (k) i,1 ,··· ,g (k) i,T−1 . The following lemma provides bounds for virtual queue processes: Lemma 6.4.1. For anyi∈{1, 2,··· ,m} atT∈{1, 2,···}, the following holds under the virtual queue update (6.9), T X t=1 K X k=1 D g (k) i,t−1 ,θ (k) t−1 E ≤Q i (T + 1)−Q i (1) + Ψ T X t=1 K X k=1 q A (k) S (k) θ (k) t −θ (k) t−1 2 , 183 where Ψ> 0 is the constant defined in (6.4). Proof. By the queue updating rule (6.9), for any t∈N, Q i (t + 1) = max ( Q i (t) + K X k=1 D g (k) i,t−1 ,θ (k) t E , 0 ) ≥Q i (t) + K X k=1 D g (k) i,t−1 ,θ (k) t E =Q i (t) + K X k=1 D g (k) i,t−1 ,θ (k) t−1 E + K X k=1 D g (k) i,t−1 ,θ (k) t −θ (k) t−1 E ≥Q i (t) + K X k=1 D g (k) i,t−1 ,θ (k) t−1 E − K X k=1 g (k) i,t−1 2 θ (k) t −θ (k) t−1 2 , Note that the constraint functions are deterministically bounded, g (k) i,t−1 2 2 ≤ A (k) S (k) Ψ 2 . Substituting this bound into the above queue bound and rearranging the terms finish the proof. The next lemma provides a bound for the drift Δ(t). Lemma 6.4.2. For any slot t≥ 1, we have Δ(t)≤ 1 2 mK 2 Ψ 2 + m X i=1 Q i (t) K X k=1 D g (k) i,t−1 ,θ (k) t E . Proof. By definition, we have Δ(t) = 1 2 kQ(t + 1)k 2 2 − 1 2 kQ(t)k 2 2 ≤ 1 2 m X i=1 Q i (t) + K X k=1 D g (k) i,t−1 ,θ (k) t E ! 2 −Q i (t) 2 = m X i=1 Q i (t) K X k=1 D g (k) i,t−1 ,θ (k) t E + 1 2 m X i=1 K X k=1 D g (k) i,t−1 ,θ (k) t E ! 2 . 184 Note that by the queue update (6.9), we have K X k=1 D g (k) i,t−1 ,θ (k) t E ≤K g (k) i,t−1 ∞ θ (k) t 1 ≤KΨ. Substituting this bound into the drift bound finishes the proof. Consider a convex setX⊆R n . Recall that for a fixed real numberc> 0, a functionh :X→R is said to be c-strongly convex, if h(x)− c 2 kxk 2 2 is convex over x∈X . It is easy to see that if q :X → R is convex, c > 0 and b∈ R n , the function q(x) + c 2 kx−bk 2 2 is c-strongly convex. Furthermore, if the function h is c-strongly convex that is minimized at a point x min ∈X , then (see, e.g., Corollary 1 in [YN17]): h(x min )≤h(y)− c 2 ky−x min k 2 2 , ∀y∈X. (6.13) The following lemma is a direct consequence of the above strongly convex result. It also demon- strates the key property of our minimization subproblem (6.10). Lemma 6.4.3. The following bound holds for any k∈{1, 2,··· ,K} and any fixed θ (k) ∗ ∈ Θ (k) : V D f (k) t−1 ,θ (k) t −θ (k) t−1 E + m X i=1 Q i (t) D g (k) i,t−1 ,θ (k) t E +αkθ (k) t −θ (k) t−1 k 2 2 ≤V D f (k) t−1 ,θ (k) ∗ −θ (k) t−1 E + m X i=1 Q i (t) D g (k) i,t−1 ,θ (k) ∗ E +αkθ (k) ∗ −θ (k) t−1 k 2 2 −αkθ (k) ∗ −θ (k) t k 2 2 . (6.14) This lemma follows easily from the fact that the proposed algorithm (6.10) gives θ (k) t ∈ Θ (k) minimizing the left hand side, which is a strongly convex function, and then, applying (6.13), with h θ (k) ∗ =V D f (k) t−1 ,θ (k) ∗ −θ (k) t−1 E + m X i=1 Q i (t) D g (k) i,t−1 ,θ (k) ∗ E +α θ (k) ∗ −θ (k) t−1 2 2 Combining the previous two lemmas gives the following “drift-plus-penalty” bound. Lemma 6.4.4. For any fixed{θ (k) ∗ } K k=1 such that θ (k) ∗ ∈ Θ (k) and t∈N, we have the following 185 bound, Δ(t) +V K X k=1 D f (k) t−1 ,θ (k) t −θ (k) t−1 E +α K X k=1 kθ (k) t −θ (k) t−1 k 2 2 ≤ 3 2 mK 2 Ψ 2 +V K X k=1 D f (k) t−1 ,θ (k) ∗ −θ (k) t−1 E + m X i=1 Q i (t− 1) · K X k=1 D g (k) i,t−1 ,θ (k) ∗ E +α K X k=1 kθ (k) ∗ −θ (k) t−1 k 2 2 −α K X k=1 kθ (k) ∗ −θ (k) t k 2 2 (6.15) Proof. Using Lemma 6.4.2 and then Lemma 6.4.3, we obtain Δ(t) +V K X k=1 D f (k) t−1 ,θ (k) t −θ (k) t−1 E +α K X k=1 kθ (k) t −θ (k) t−1 k 2 2 ≤ 1 2 mK 2 Ψ 2 + m X i=1 Q i (t) K X k=1 D g (k) i,t−1 ,θ (k) t E +V K X k=1 D f (k) t−1 ,θ (k) t −θ (k) t−1 E +α K X k=1 kθ (k) t −θ (k) t−1 k 2 2 ≤ 1 2 mK 2 Ψ 2 + K X k=1 D f (k) t−1 ,θ (k) ∗ −θ (k) t−1 E + m X i=1 Q i (t) K X k=1 D g (k) i,t−1 ,θ (k) ∗ E +α K X k=1 kθ (k) ∗ −θ (k) t−1 k 2 2 −α K X k=1 kθ (k) ∗ −θ (k) t k 2 2 . (6.16) Note that by the queue updating rule (6.9), we have for any t≥ 2, |Q i (t)−Q i (t− 1)|≤ K X k=1 D g (k) i,t−2 ,θ (k) t−1 E ≤K g (k) i,t−2 ∞ θ (k) t−1 1 ≤KΨ, and for t = 1, Q i (t)−Q i (t− 1) = 0 by the initial condition of the algorithm. Also, we have for any θ (k) ∗ ∈ Θ (k) , K X k=1 D g (k) i,t−1 ,θ (k) ∗ E ≤K g (k) i,t−2 ∞ θ (k) ∗ 1 ≤KΨ. Thus, we have m X i=1 Q i (t) K X k=1 D g (k) i,t−1 ,θ (k) ∗ E ≤ m X i=1 Q i (t− 1) K X k=1 D g (k) i,t−1 ,θ (k) ∗ E +mK 2 Ψ 2 . Substituting this bound into (6.16) finishes the proof. 186 Objective bound Theorem 6.4.1. For any{θ (k) ∗ } K k=1 in the constraint set (6.12) and any T∈{1, 2, 3,···}, the proposed algorithm has the following stationary state performance bound: 1 T T−1 X t=0 E K X k=1 D f (k) t ,θ (k) t E ! ≤ 1 T T−1 X t=0 E K X k=1 D f (k) t ,θ (k) ∗ E ! + 2αK TV + mK 2 Ψ 2 T + V Ψ 2 2α K X k=1 S (k) A (k) + 3 2 mK 2 Ψ 2 V , In particular, choosing α =T and V = √ T gives theO( √ T ) regret 1 T T−1 X t=0 E K X k=1 D f (k) t ,θ (k) t E ! ≤ 1 T T−1 X t=0 E K X k=1 D f (k) t ,θ (k) ∗ E ! + 2K + Ψ 2 2 K X k=1 S (k) A (k) + 5 2 mK 2 Ψ 2 ! 1 √ T . Proof. First of all, note that{g (k) i,t−1 } K k=1 is i.i.d. and independent of all system history up to t− 1, and thus independent of Q i (t− 1), i = 1, 2,··· ,m. We have E Q i (t− 1) D g (k) i,t−1 ,θ (k) ∗ E =E(Q i (t− 1))E K X k=1 D g (k) i,t−1 ,θ (k) ∗ E ! ≤ 0 (6.17) where the last inequality follows from the assumption that{θ (k) ∗ } K k=1 is in the constraint set (6.12). Substituting θ (k) ∗ into (6.15), taking expectation with respect to both sides and using (6.17) give E(Δ(t)) +VE K X k=1 D f (k) t−1 ,θ (k) t −θ (k) t−1 E ! +αE K X k=1 kθ (k) t −θ (k) t−1 k 2 2 ! ≤ 3 2 mK 2 Ψ 2 +VE K X k=1 D f (k) t−1 ,θ (k) ∗ −θ (k) t−1 E ! +αE K X k=1 kθ (k) ∗ −θ (k) t−1 k 2 2 ! −αE K X k=1 kθ (k) ∗ −θ (k) t k 2 2 ! , where the second inequality follows from (6.17). Note that for any k, completing the squares 187 gives V D f (k) t−1 ,θ (k) t −θ (k) t−1 E +αkθ (k) t −θ (k) t−1 k 2 2 ≥ r α 2 θ (k) t −θ (k) t−1 + V 2 p α/2 f (k) t−1 2 2 − V 2 Ψ 2 S (k) A (k) 2α . Substituting this inequality into the previous bound and rearranging the terms give VE K X k=1 D f (k) t−1 ,θ (k) t−1 E ! ≤VE K X k=1 D f (k) t−1 ,θ (k) ∗ E ! −E(Δ(t))+ V 2 P K k=1 Ψ 2 S (k) A (k) 2α + 3 2 mK 2 Ψ 2 +αE K X k=1 kθ (k) ∗ −θ (k) t−1 k 2 2 ! −αE K X k=1 kθ (k) ∗ −θ (k) t k 2 2 ! . Taking telescoping sums from 1 to T and dividing both sides by TV gives, 1 T T X t=1 E K X k=1 D f (k) t−1 ,θ (k) t−1 E ! ≤E K X k=1 D f (k) t−1 ,θ (k) ∗ E ! + L(0)−L(T + 1) VT + V P K k=1 Ψ 2 S (k) A (k) 2α + 3 2 mK 2 Ψ 2 V + αE P K k=1 kθ (k) ∗ −θ (k) T−1 k 2 2 −αE P K k=1 kθ (k) ∗ −θ (k) T k 2 2 VT ≤E K X k=1 D f (k) t−1 ,θ (k) ∗ E ! + V P K k=1 Ψ 2 S (k) A (k) 2α + 3 2 mK 2 Ψ 2 V + 2αK VT , where we use the fact that L(0) = 0 andkθ (k) ∗ −θ (k) T−1 k 2 2 ≤kθ (k) ∗ −θ (k) T−1 k 1 ≤ 2. A drift lemma and its implications From Lemma 6.4.1, we know that in order to get the constraint violation bound, we need to look at the size of the virtual queueQ i (T + 1), i = 1, 2,··· ,m. The following drift lemma serves as a cornerstone for our goal. Lemma 6.4.5 (Lemma 5 of [YNW17]). Let{Ω,F,P} be a probability space. Let{Z(t),t≥ 1} be a discrete time stochastic process adapted to a filtration{F t−1 ,t≥ 1} with Z(1) = 0 and F 0 ={∅, Ω}. Suppose there exist integert 0 > 0, real constantsλ∈R,δ max > 0 and 0<ζ≤δ max 188 such that |Z(t + 1)−Z(t)|≤δ max , (6.18) E[Z(t +t 0 )−Z(t)|F t−1 ]≤ t 0 δ max , if Z(t)<λ −t 0 ζ, if Z(t)≥λ . (6.19) hold for all t∈{1, 2,...}. Then, the following holds: E[Z(t)]≤λ +t 0 δ max +t 0 4δ 2 max ζ log 8δ 2 max ζ 2 ,∀t∈{1, 2,...}. Note that a special case of above drift lemma for t 0 = 1 dates back to the seminal paper of Hajek ([Haj82]) bounding the size of a random process with strongly negative drift. Since then, its power has been demonstrated in various scenarios ranging from steady state queue bound ([ES12]) to feasibility analysis of stochastic optimization ([WN16b]). The current generalization to a multi-step drift is first considered in [YNW17]. This lemma is useful in the current context due to the following lemma, whose proof can be found in Appendix 6.6.2. Lemma 6.4.6. LetF t , t≥ 1 be the system history functions up to timet, includingf (k) 0 ,··· ,f (k) t−1 , g (k) 0,i ,··· ,g (k) t−1,i , i = 1, 2,··· ,m, k = 1, 2,··· ,K, andF 0 is a null set. Let t 0 be an arbitrary positive integer, then, we have kQ(t + 1)k 2 −kQ(t)k 2 ≤ √ mKΨ, E[kQ(t +t 0 )k 2 −kQ(t)k 2 F(t− 1)]≤ t 0 √ mKΨ, ifkQ(t)k<λ −t 0 η 2 , ifkQ(t)k≥λ where λ = 8VKΨ+3mK 2 Ψ 2 +4Kα+t0(t0−1)mΨ+2mKΨηt0+η 2 t 2 0 ηt0 . Combining the previous two lemmas gives the virtual queue bound as E(kQ(t)k 2 )≤ 8VKΨ + 3mK 2 Ψ 2 + 4Kα +t 0 (t 0 − 1)mΨ + 2mKΨηt 0 +η 2 t 2 0 ηt 0 +t 0 √ mKΨ + 4t 0 mK 2 Ψ 2 η log 8mK 2 Ψ 2 η 2 . 189 We then choose t 0 = √ T , V = √ T and α =T , which implies that E(kQ(t)k 2 )≤C(m,K, Ψ,η) √ T, (6.20) whereC(m,K, Ψ,η) = 8KΨ η + 3mK 2 Ψ 2 η 2 + 4K+mΨ η + 2mKΨ +η + √ mKΨ + 4mK 2 Ψ 2 η log 8mK 2 Ψ 2 η 2 . The slow-update condition and constraint violation In this section, we prove the slow-update property of the proposed algorithm, which not only implies the theO( √ T ) constraint violation bound, but also plays a key role in Markov analysis. Lemma 6.4.7. The sequence of state-action vectors θ (k) t , t∈{1, 2,··· ,T} satisfies E kθ (k) t −θ (k) t−1 k 2 ≤ p m|A (k) ||S (k) |ΨE(kQ(t)k 2 ) 2α + p |A (k) ||S (k) |ΨV 2α . In particular,choosing V = √ T and α =T gives a slow-update condition E kθ (k) t −θ (k) t−1 k 2 ≤ p |A (k) ||S (k) |Ψ +C p m|A (k) ||S (k) |Ψ 2 √ T , (6.21) where C =C(m,K, Ψ,η) is defined in (6.20). Proof of Lemma 6.4.7. First, choosing θ =θ t−1 in (6.14) gives V D f (k) t−1 ,θ (k) t −θ (k) t−1 E + m X i=1 Q i (t) D g (k) i,t−1 ,θ (k) t E +αkθ (k) t −θ (k) t−1 k 2 2 ≤ m X i=1 Q i (t)hg (k) i,t−1 ,θ (k) t−1 i−αkθ (k) t−1 −θ (k) t k 2 2 . Rearranging the terms gives 2αkθ (k) t −θ (k) t−1 k 2 2 ≤−Vhf (k) t−1 ,θ (k) t −θ (k) t−1 i− m X i=1 Q i (t)hg (k) i,t−1 ,θ (k) t −θ (k) t−1 i ≤Vkf (k) t−1 k 2 ·kθ (k) t −θ (k) t−1 k 2 + m X i=1 Q i (t)kg (k) i,t−1 k 2 ·kθ (k) t −θ (k) t−1 k 2 ≤Vkf t−1 k 2 ·kθ (k) t −θ (k) t−1 k 2 +kQ(t)k 2 v u u t m X i=1 kg (k) i,t−1 k 2 2 kθ (k) t −θ (k) t−1 k 2 , 190 where the second and third inequality follow from Cauchy-Schwarz inequality. Thus, it follows θ (k) t −θ (k) t−1 2 ≤ Vkf (k) t−1 k 2 +kQ(t)k 2 · q P m i=1 kg (k) i,t−1 k 2 2 2α . Applying the fact thatkf (k) t−1 k 2 ≤ p |A (k) ||S (k) |Ψ,kg (k) i,t−1 k 2 ≤ p |A (k) ||S (k) |Ψ and taking expec- tation from both sides give the first bound in the lemma. The second bound follows directly from the first bound by further substituting (6.20). Theorem 6.4.2. The proposed algorithm has the following stationary state constraint violation bound: 1 T T−1 X t=0 E K X k=1 D g (k) i,t ,θ (k) t E ! ≤ 1 √ T C + K X k=1 q m|A (k) ||S (k) |ΨC + K X k=1 |A (k) ||S (k) |Ψ 2 ! , where C =C(m,K, Ψ,η) is defined in (6.20). Proof. Taking expectation from both sides of Lemma 6.4.1 gives T X t=1 E K X k=1 D g (k) i,t−1 ,θ (k) t−1 E ! ≤E(Q i (T + 1)) + Ψ T X t=1 K X k=1 q A (k) S (k) E θ (k) t −θ (k) t−1 2 . Substituting the bounds (6.20) and (6.21) in to the above inequality gives the desired result. 6.4.2 Markov analysis So far, we have shown that our algorithm achieves anO( √ T ) regret and constraint violation simultaneously regarding the stationary online linear program (6.11) with constraint set given by (6.12) in the imaginary system. In this section, we show how these stationary state results lead to a tight performance bound on the original true online MDP problem (6.1) and (6.2) comparing to any joint randomized stationary algorithm starting from its stationary state. Approximate mixing of MDPs LetF t , t≥ 1 be the set of system history functions up to time t, including f (k) 0 ,··· ,f (k) t−1 , g (k) 0,i ,··· ,g (k) t−1,i ,i = 1, 2,··· ,m, k = 1, 2,··· ,K, andF 0 is a null set. Letd π (k) t be the stationary state distribution at k-th MDP under the randomized stationary policy π (k) t in the proposed algorithm. Letv (k) t be the true state distribution at time slott under the proposed algorithm given 191 the function pathF T and starting state d (k) 0 , i.e. for any s∈S (k) , v (k) t (s) :=Pr s (k) t =s|F T and v (k) 0 =d (k) 0 . The following lemma provides a key estimate on the distance between stationary distribution and true distribution at each time slott. It builds upon the slow-update condition (Lemma 6.4.7) of the proposed algorithm and uniform mixing bound of general MDPs (Lemma 6.2.1). Lemma 6.4.8. Consider the proposed algorithm with V = √ T and α =T . For any initial state distribution{d (k) 0 } K k=1 and any t∈{0, 1, 2,··· ,T− 1}, we have E d π (k) t −v (k) t 1 ≤ τr A (k) S (k) Ψ +C √ m A (k) S (k) Ψ . 2 √ T + 2e − t τr +1 , where τ and r are mixing parameters defined in Lemma 6.2.1 and C is an absolute constant defined in (6.20). Proof of Lemma 6.4.8. By Lemma 6.4.7 we know that for any t∈{1, 2,··· ,T}, E θ (k) t −θ (k) t−1 2 ≤ q A (k) S (k) Ψ +C q m A (k) S (k) Ψ 2 √ T , Thus, E θ (k) t −θ (k) t−1 1 ≤ A (k) S (k) Ψ +C √ m A (k) S (k) Ψ 2 √ T , Since for anys∈S (k) , d π (k) t (s)−d π (k) t−1 (s) = P a∈A (k)θ (k) t (a,s)−θ (k) t−1 (a,s) ≤ P a∈A (k) θ (k) t (a,s)− θ (k) t−1 (a,s) , it then follows E d π (k) t −d π (k) t−1 1 ≤ E θ (k) t −θ (k) t−1 1 ≤ A (k) S (k) Ψ +C √ m A (k) S (k) Ψ 2 √ T . (6.22) 192 Now, we use the above relation to boundE d π (k) t −v (k) t 1 for any t≥r. E d π (k) t −v (k) t 1 ≤E d π (k) t −d π (k) t−1 1 +E d π (k) t−1 −v (k) t 1 ≤ A (k) S (k) Ψ +C √ m A (k) S (k) Ψ 2 √ T +E d π (k) t−1 −v (k) t 1 = A (k) S (k) Ψ +C √ m A (k) S (k) Ψ 2 √ T +E d π (k) t−1 −v (k) t−1 P (k) π (k) t−1 1 , (6.23) where the second inequality follows from the slow-update condition (6.22) and the final equality follows from the fact that given the function pathF T , the following holds d π (k) t−1 −v (k) t = d π (k) t−1 −v (k) t−1 P (k) π (k) t−1 . (6.24) To see this, note that from the proposed algorithm, the policy π (k) t is determined byF T . Thus, by definition of stationary distribution, givenF T , we know that d π (k) t−1 = d π (k) t−1 P (k) π (k) t−1 , and it is enough to show that givenF T , v (k) t =v (k) t−1 P (k) π (k) t−1 . First of all, the state distributionv (k) t is determined byv (k) t−1 ,π (k) t−1 and probability transition from s t−1 to s t , which are in turn determined byF T . Thus, givenF T , for any s∈S (k) , v (k) t (s) = X s 0 ∈S (k) Pr(s t =s|s t−1 =s 0 ,F T )v (k) t−1 (s 0 ), and Pr(s t =s|s t−1 =s 0 ,F T ) = X a∈A (k) Pr(s t =s|a t =a,s t−1 =s 0 ,F T )Pr(a t =a|s t−1 =s 0 ,F T ) = X a∈A (k) P a (s 0 ,s)Pr(a t =a|s t−1 =s 0 ,F T ) = X a∈A (k) P a (s 0 ,s)π (k) t−1 (a|s 0 ) =P π (k) t−1 (s 0 ,s), where the second inequality follows from the Assumption 6.2.2, the third equality follows from 193 the fact that π (k) t−1 is determined byF T , thus, for any t, π (k) t (a s 0 ) =Pr(a t =a|s t−1 =s 0 ,F T ), ∀a∈A (k) , s 0 ∈S (k) , and the last equality follows from the definition of transition probability (6.3). This gives v (k) t (s) = X s 0 ∈S (k) P π (k) t−1 (s 0 ,s)v (k) t−1 (s 0 ), and thus (6.24) holds. We can iteratively apply the procedure (6.23) r times as follows E d π (k) t −v (k) t 1 ≤ A (k) S (k) Ψ +C √ m A (k) S (k) Ψ 2 √ T +E d π (k) t−1 −d π (k) t−2 P (k) π (k) t−1 1 +E d π (k) t−2 −v (k) t−1 P (k) π (k) t−1 1 ≤2· A (k) S (k) Ψ +C √ m A (k) S (k) Ψ 2 √ T +E d π (k) t−2 −v (k) t−1 P (k) π (k) t−1 1 =2· A (k) S (k) Ψ +C √ m A (k) S (k) Ψ 2 √ T +E d π (k) t−2 −v (k) t−2 P (k) π (k) t−2 P (k) π (k) t−1 1 ≤···≤r· A (k) S (k) Ψ +C √ m A (k) S (k) Ψ 2 √ T +E d π (k) t−r −v (k) t−r P (k) π (k) t−r ··· P (k) π (k) t−1 1 , where the second inequality follows from the nonexpansive property in ` 1 norm of the stochastic matrix P (k) π (k) t−1 that d π (k) t−1 −d π (k) t−2 P (k) π (k) t−1 1 ≤ d π (k) t−1 −d π (k) t−2 1 , and then using the slow-update condition (6.22) again. By Lemma 6.2.1, we have E d π (k) t −v (k) t 1 ≤r· A (k) S (k) Ψ +C √ m A (k) S (k) Ψ 2 √ T +e −1/τ E d π (k) t−r −v (k) t−r 1 . 194 Iterating this inequality down to t = 0 gives E d π (k) t −v (k) t 1 ≤ bt/τc X j=0 e −j/τ ·r· A (k) S (k) Ψ +C √ m A (k) S (k) Ψ 2 √ T +E d π (k) 0 −v (k) 0 1 e −bt/rc/τ ≤ bt/τc X j=0 e −j/τ ·r· A (k) S (k) Ψ +C √ m A (k) S (k) Ψ 2 √ T + 2e −bt/rc/τ ≤ Z ∞ x=0 e −x/τ dx·r· A (k) S (k) Ψ +C √ m A (k) S (k) Ψ 2 √ T + 2e − t rτ +1 ≤τr· A (k) S (k) Ψ +C √ m A (k) S (k) Ψ 2 √ T + 2e − t rτ +1 finishing the proof. Benchmarking against policies starting from stationary state Combining the results derived so far, we have the following regret bound regarding any randomized stationary policy Π starting from its stationary state distribution d Π such that (d Π , Π) in the constraint setG defined in (6.2). Theorem 6.4.3. LetP be the sequence of randomized stationary policies resulting from the proposed algorithm with V = √ T and α = T . Let d 0 be the starting state of the proposed algorithm. For any randomized stationary policy Π starting from its stationary state distribution d Π such that (d Π , Π)∈G, we have F T (d 0 ,P)−F T (d Π , Π)≤O m 3/2 K 2 K X k=1 A (k) S (k) · √ T ! , G i,T (d 0 ,P)≤O m 3/2 K 2 K X k=1 A (k) S (k) · √ T ! , i = 1, 2,··· ,m. Proof of Theorem 6.4.3. First of all, by Lemma 6.2.2, for any randomized stationary policy Π, there exists some stationary state-action probability vectors{θ (k) ∗ } K k=1 such that θ (k) ∗ ∈ Θ (k) , F T (d Π , Π) = T−1 X t=0 K X k=1 D E(f t ),θ (k) ∗ E , andG i,T (d Π , Π) = P T−1 t=0 P K k=1 D E(g i,t ),θ (k) ∗ E . As a consequence, (d Π , Π)∈G impliesG i,T (d Π , Π) = 195 P T−1 t=0 P K k=1 D E(g i,t ),θ (k) ∗ E ≤ 0, ∀i∈{1, 2,··· ,m} and it follows{θ (k) ∗ } K k=1 is in the imaginary constraint setG defined in (6.12). Thus, we are in a good shape applying Theorem 6.4.1 from imaginary systems. We then split F T (d 0 ,P)−F T (d Π , Π) into two terms: F T (d 0 ,P)−F T (d 0 , Π)≤ E T−1 X t=0 K X k=1 f (k) t (a (k) t ,s (k) t ) d 0 ,P ! − T−1 X t=0 K X k=1 E D f (k) t ,θ (k) t E | {z } (I) + T−1 X t=0 K X k=1 E D f (k) t ,θ (k) t E − D E(f t ),θ (k) ∗ E | {z } (II) . By Theorem 6.4.1, we get (II)≤ 2K + Ψ 2 2 K X k=1 S (k) A (k) + 5 2 mK 2 Ψ 2 ! √ T. (6.25) We then bound (I). Consider each time slot t∈{0, 1,··· ,T− 1}. We have E D f (k) t ,θ (k) t E = X s∈S (k) X a∈A (k) E d π (k) t (s)π (k) t (a|s)f (k) t (a,s) E f (k) t (a (k) t ,s (k) t ) d 0 ,P = X s∈S (k) X a∈A (k) E v (k) t (s)π (k) t (a|s)f (k) t (a,s) , where the first equality follows from the definition of θ (k) t and the second equality follows from the following: Given a specific function pathF T , the policy π (k) t and the true state distribution v (k) t are fixed. Thus, we have, E f (k) t (a (k) t ,s (k) t ) d 0 ,P,F T = X s∈S (k) X a∈A (k) v (k) t (s)π (k) t (a|s)f (k) t (a,s). 196 Taking the full expectation regarding the function path gives the result. Thus, E f (k) t (a (k) t ,s (k) t ) d 0 ,P −E D f (k) t ,θ (k) t E ≤ X s∈S (k) X a∈A (k) E v (k) t (s)−d π (k) t (s) π (k) t (a|s) Ψ ≤E v (k) t −d π (k) t 1 Ψ ≤ τr (1 +C √ m) A (k) S (k) Ψ 2 2 √ T + 2e − t τr +1 Ψ where the last inequality follows from Lemma 6.4.8. Thus, it follows, (I)≤ T−1 X t=0 K X k=1 τr (1 +C √ m) A (k) S (k) Ψ 2 2 √ T + 2e − t τr +1 Ψ ! ≤ K X k=1 τr 1 +C √ m A (k) S (k) Ψ 2 √ T + 2ΨK Z T−1 t=0 e − x τr +1 dx ≤τrΨ 2 1 +C √ m K X k=1 A (k) S (k) · √ T + 2eΨKτr. (6.26) Overall, combining (6.25),(6.26) and substituting the constant C =C(m,K, Ψ,η) defined in (6.20) gives the objective regret bound. For the constraint violation, we have G i,T (d 0 ,P) =E T−1 X t=0 K X k=1 g (k) i,t (a t ,s t ) d 0 ,P ! − T X t=1 K X k=1 D E g (k) i,t ,θ t E | {z } (IV) + T X t=1 K X k=1 D E g (k) i,t ,θ t E | {z } (V) . The term (V) can be readily bounded using Theorem 6.4.2 as T−1 X t=0 E K X k=1 D g (k) i,t ,θ (k) t E ! ≤ C + K X k=1 q m|A (k) ||S (k) |ΨC + K X k=1 |A (k) ||S (k) |Ψ 2 ! √ T. For the term (IV), we have E D g (k) i,t ,θ (k) t E = X s∈S (k) X a∈A (k) E d π (k) t (s)π (k) t (a|s)g (k) i,t (a,s) 197 E g (k) i,t (a (k) t ,s (k) t ) d 0 ,P = X s∈S (k) X a∈A (k) E v (k) t (s)π (k) t (a|s)g (k) i,t (a,s) , where the first equality follows from the definition of θ (k) t and the second equality follows from the following: Given a specific function pathF T , the policy π (k) t and the true state distribution v (k) t are fixed. Thus, we have, E g (k) t (a (k) t ,s (k) t ) d 0 ,P,F T = X s∈S (k) X a∈A (k) v (k) t (s)π (k) t (a|s)g (k) t (a,s). Taking the full expectation regarding the function path gives the result. Then, repeat the same proof as that of (6.26) gives (IV)≤τrΨ 2 1 +C √ m K X k=1 A (k) S (k) · √ T + 2eΨKτr. This finishes the proof of constraint violation. 6.5 A more general regret bound against policies with ar- bitrary starting state Recall that Theorem 6.4.3 compares the proposed algorithm with any randomized stationary policy Π starting from its stationary state distribution d Π , so that (d Π , Π)∈G. In this section, we generalize Theorem 6.4.3 and obtain a bound of the regret against all (d 0 , Π)∈G whered 0 is an arbitrary starting state distribution (not necessarily the stationary state distribution). The main technical difficulty doing such a generalization is as follows: For any randomized stationary policy Π such that (d 0 , Π)∈G, let{θ (k) ∗ } K k=1 be the stationary state-action probabilities such thatθ (k) ∗ ∈ Θ (k) andG i,T (d Π , Π) = P T−1 t=0 P K k=1 D E(g i,t ),θ (k) ∗ E . For some finite horizon T , there might exist some “low-cost” starting state distribution d 0 such that G i,T (d 0 , Π) < G i,T (d Π , Π) for some i∈{1, 2,··· ,m}. As a consequence, one coud have G i,T (d 0 , Π)≤ 0, and T−1 X t=0 K X k=1 D E(g i,t ),θ (k) ∗ E > 0. 198 This implies although (d 0 , Π) is feasible for our true system, its stationary state-action probabil- ities{θ (k) ∗ } K k=1 can be infeasible with respect to the imaginary constraint set (6.12), and all our analysis so far fails to cover such randomized stationary policies. To resolve this issue, we have to “enlarge” the imaginary constraint set (6.12) so as to cover all state-action probabilities{θ (k) ∗ } K k=1 arising from any randomized stationary policy Π such that (d 0 , Π)∈G. But a perturbation of constraint set would result in a perturbation of objective in the imaginary system also. Our main goal in this section is to bound such a perturbation and show that the perturbation bound leads to the finalO( √ T ) regret bound. A relaxed constraint set We begin with a supporting lemma on the uniform mixing time bound over all joint random- ized stationary policies. The proof is given in Appendix 6.6.3. Lemma 6.5.1. Consider any randomized stationary policy Π in (6.2) with arbitrary starting state distribution d 0 ∈S (1) ×···×S (K) . Let P Π be the corresponding transition matrix on the product state space. Then, the following holds (d 0 −d Π ) (P Π ) t 1 ≤ 2e (r1−t)/r1 ,∀t∈{0, 1, 2,···}, (6.27) where r 1 is fixed positive constant independent of Π. The following lemma shows a relaxation ofO(1/T ) on the imaginary constraint set (6.12) is enough to cover all the{θ (k) ∗ } K k=1 discussed at the beginning of this section. The proof is given in Appendix 6.6.3. Lemma 6.5.2. For any T∈{1, 2,···} and any randomized stationary policies Π in (6.2), with arbitrary starting state distribution d 0 ∈S (1) ×···×S (K) and stationary state-action probability {θ (k) ∗ } K k=1 , T−1 X t=0 E K X k=1 f (k) t (a (k) t ,s (k) t ) d 0 , Π ! − K X k=1 D E f (k) t ,θ (k) ∗ E ≤C 1 KΨ (6.28) T−1 X t=0 E K X k=1 g (k) i,t (a (k) t ,s (k) t ) d 0 , Π ! − K X k=1 D E g (k) i,t ,θ (k) ∗ E ≤C 1 KΨ (6.29) where C 1 is an absolute constant. In particular,{θ (k) ∗ } K k=1 is contained in the following relaxed 199 constraint set G + := ( θ (k) ∈ Θ (k) , k = 1, 2,··· ,K : K X k=1 D E g (k) i,t ,θ (k) E ≤ C 1 KΨ T ,i = 1, 2,··· ,m . Best stationary performance over the relaxed constraint set Recall that the best stationary performance in hindsight over all randomized stationary poli- cies in the constraint setG can be obtained as the minimum achieved by the following linear program. min 1 T T−1 X t=0 K X k=1 D E f (k) t ,θ (k) E (6.30) s.t. K X k=1 D E g (k) i,t ,θ (k) E ≤ 0, i = 1, 2,··· ,m. (6.31) On the other hand, if we consider all the randomized stationary policies contained in the original constraint set (6.2), then, By Lemma 6.5.2, the relaxed constraint setG contains all such policies and the best stationary performance over this relaxed set comes from the minimum achieved by the following perturbed linear program: min 1 T T−1 X t=0 K X k=1 D E f (k) t ,θ (k) E (6.32) s.t. K X k=1 D E g (k) i,t ,θ (k) E ≤ C 1 KΨ T , i = 1, 2,··· ,m. (6.33) We aim to show that the minimum achieved by (6.32)-(6.33) is not far away from that of (6.30)-(6.31). In general, such a conclusion is not true due to the unboundedness of Lagrange multipliers in constrained optimization. However, since Slater’s condition holds in our case, the perturbation can be bounded via the following well-known Farkas’ lemma ([Ber09b]): Lemma 6.5.3 (Farkas’ Lemma). Consider a convex program with objective f(x) and constraint 200 function g i (x), i = 1, 2,··· ,m: min f(x), (6.34) s.t. g i (x)≤b i , i = 1, 2,··· ,m, (6.35) x∈X, (6.36) for some convex setX⊆R n . Letx ∗ be one of the solutions to the above convex program. Suppose there exists e x∈X such that g i (e x) < 0, ∀i∈{1, 2,··· ,m}. Then, there exists a separation hyperplane parametrized by (1,μ 1 ,μ 2 ,··· ,μ m ) such that μ i ≥ 0 and f(x) + m X i=1 μ i g i (x)≥f(x ∗ ) + m X i=1 μ i b i , ∀x∈X. The parameter μ = (μ 1 ,μ 2 ,··· ,μ m ) is usually referred to as a Lagrange multiplier. From the geometric perspective, Farkas’ Lemma states that if Slater’s condition holds, then, there exists a non-vertical separation hyperplane supported at f(x ∗ ),b 1 ,··· ,b m and contains the set n f(x),g 1 (x),··· ,g m (x) , x∈X o on one side. Thus, in order to bound the perturbation of objective with respect to the perturbation of constraint level, we need to bound the slope of the supporting hyperplane from above, which boils down to controlling the magnitude of the Lagrange multiplier. This is summarized in the following lemma: Lemma 6.5.4 (Lemma 1 of [NO09]). Consider the convex program (6.34)-(6.36), and define the Lagrange dual function q(μ) = inf x∈X ( f(x) + m X i=1 μ i (g i (x)−b i ) ) . Suppose there exists e x∈X such that g i (e x)−b i ≤−η, ∀i∈{1, 2,··· ,m} for some positive constant η > 0. Then, the level setV ¯ μ ={μ 1 ,μ 2 ,··· ,μ m ≥ 0, q(μ)≥q(¯ μ)} is bounded for any nonnegative ¯ μ. Furthermore, we have max μ∈V ¯ μ kμk 2 ≤ 1 min 1≤i≤m {−g i (e x) +b i } (f(e x)−q(¯ μ)). The technical importance of these two lemmas in the current context is contained in the following corollary. Corollary 6.5.1. Let n θ (k) ∗ o K k=1 and n θ (k) ∗ o K k=1 be solutions to (6.30)-(6.31) and (6.32)-(6.33), 201 respectively. Then, the following holds 1 T T−1 X t=0 K X k=1 D E f (k) t ,θ (k) ∗ E ≥ 1 T T−1 X t=0 K X k=1 D E f (k) ,θ (k) ∗ E − C 1 K 2 √ mΨ 2 ηT where η is the constant defined in Assumption 6.2.3. Proof of Corollary 6.5.1. Take f θ (1) ,··· ,θ (K) = 1 T T−1 X t=0 K X k=1 D E f (k) ,θ (k) E , g i θ (1) ,··· ,θ (K) = K X k=1 D E g (k) i,t ,θ (k) E , X = Θ (1) × Θ (2) ×···× Θ (K) , and b i = 0 in Farkas’ Lemma and we have the following display 1 T T−1 X t=0 K X k=1 D E f (k) ,θ (k) E + m X i=1 μ i K X k=1 D E g (k) i,t ,θ (k) E ≥ 1 T T−1 X t=0 K X k=1 D E f (k) ,θ (k) ∗ E , for any θ (1) ,··· ,θ (K) ∈X and someμ 1 ,μ 2 ,··· ,μ m ≥ 0. In particular, substituting θ (1) ∗ ,··· ,θ (K) ∗ into the above display gives 1 T T−1 X t=0 K X k=1 D E f (k) ,θ (k) ∗ E ≥ 1 T T−1 X t=0 K X k=1 D E f (k) ,θ (k) ∗ E − m X i=1 μ i K X k=1 D E g (k) i,t ,θ (k) ∗ E ≥ 1 T T−1 X t=0 K X k=1 D E f (k) ,θ (k) ∗ E − C 1 KΨ T m X i=1 μ i , (6.37) where the final inequality follows from the fact that θ (1) ∗ ,··· ,θ (K) ∗ satisfies the relaxed con- straint P K k=1 D E g (k) i,t ,θ (k) ∗ E ≤ C1KΨ T and μ i ≥ 0, ∀i∈{1, 2,··· ,m}. Now we need to bound the magnitude of Lagrange multiplier (μ 1 ,··· ,μ m ). Note that in our scenario, f θ (1) ,··· ,θ (K) = 1 T T−1 X t=0 K X k=1 D E f (k) ,θ (k) E ≤ ΨK, 202 and the Lagrange multiplier μ is the solution to the maximization problem max μi≥0,i∈{1,2,···,m} q(μ), where q(μ) is the dual function defined in Lemma 6.5.4. thus, it must be in any super level set V ¯ μ ={μ 1 ,μ 2 ,··· ,μ m ≥ 0, q(μ)≥q(¯ μ)}. In particular, taking ¯ μ = 0 in Lemma 6.5.4 and using Slater’s condition (6.8), we have there exists e θ (1) ,··· , e θ (K) such that m X i=1 μ i ≤ √ mkμk 2 ≤ √ m η f e θ (1) ,··· , e θ (K) − inf (θ (1) ,···,θ (K) )∈X f θ (1) ,··· ,θ (K) ! ≤ 2 √ mΨK η , where the final inequality follows from the deterministic bound of|f(θ (1) ,··· ,θ (K) )| by ΨK. Substituting this bound into (6.37) gives the desired result. As a simple consequence of the above corollary, we have our final bound on the regret and constraint violation regarding any (d 0 , Π)∈G. Theorem 6.5.1. LetP be the sequence of randomized stationary policies resulting from the proposed algorithm with V = √ T and α = T . Let d 0 be the starting state of the proposed algorithm. For any randomized stationary policy Π starting from the state d 0 such that (d 0 , Π)∈ G, we have F T (d 0 ,P)−F T (d 0 , Π)≤O m 3/2 K 2 K X k=1 A (k) S (k) · √ T ! , G i,T (d 0 ,P)≤O m 3/2 K 2 K X k=1 A (k) S (k) · √ T ! , i = 1, 2,··· ,m. Proof. Let Π ∗ be the randomized stationary policy corresponding to the solution {θ (k) ∗ } K k=1 to (6.30)-(6.31) and let Π be any randomized stationary policy such that (d 0 , Π)∈G. Since G i,T (d Π∗ , Π ∗ ) = P T−1 t=0 P K k=1 D E(g i,t ),θ (k) ∗ E ≤ 0, it follows (d Π∗ , Π ∗ )∈G. By Theorem 6.4.3, we know that F T (d 0 ,P)−F T (d Π∗ , Π ∗ )≤O m 3/2 K 2 K X k=1 A (k) S (k) · √ T ! , andG i,T (d 0 ,P) satisfies the bound in the statement. It is then enough to bound F T (d Π∗ , Π ∗ )− 203 F T (d 0 , Π). We split it in to two terms: F T (d Π∗ , Π ∗ )−F T (d 0 , Π)≤F T (d Π∗ , Π ∗ )−F T (d Π , Π) | {z } (I) +F T (d Π , Π)−F T (d 0 , Π) | {z } (II) . By (6.28) in Lemma 6.5.2, the term (II) is bounded by C 1 KΨ. It remains to bound the first term. Since (d 0 , Π)∈G, by Lemma 6.5.2, the corresponding state-action probabilities{θ (k) } K k=1 of Π satisfies P K k=1 E(g i,t ),θ (k) ≤ C 1 KΨ/T and{θ (k) } K k=1 is feasible for (6.32)-(6.33). Since {θ (k) ∗ } K k=1 is the solution to (6.32)-(6.33), we must have F T (d Π , Π) = T−1 X t=0 K X k=1 D E f (k) t ,θ (k) E ≥ T−1 X t=0 K X k=1 D E f (k) t ,θ (k) ∗ E On the other hand, by Corollary 6.5.1, T−1 X t=0 K X k=1 D E f (k) t ,θ (k) ∗ E ≥ T−1 X t=0 K X k=1 D E f (k) ,θ (k) ∗ E − C 1 K 2 √ mΨ 2 η =F T (d Π∗ , Π ∗ )− C 1 K 2 √ mΨ 2 η . Combining the above two displays gives (I)≤ C1K 2 √ mΨ 2 η and the proof is finished. 6.6 Additional lemmas and proofs 6.6.1 Missing proofs in Section 6.2.4 We prove Lemma 6.2.1 and 6.2.2 in this section. Proof of Lemma 6.2.1. For simplicity of notations, we drop the dependencies on k through- out this proof. We first show that for any r ≥ b r, where b r is specified in Assumption 6.2.1, P π1 P π2 ··· P πr is a strictly positive stochastic matrix. Since the MDP is finite state with a finite action set, the set of all pure policies (Definition 6.2.2) is finite. Let P 1 , P 2 ,··· , P N be probability transition matrices corresponding to these pure policies. Consider any sequence of randomized stationary policies π 1 ,··· ,π r . Then, it follows their transition matrices can be expressed as convex combinations of pure policies, i.e. P π1 = N X i=1 α (1) i P i , P π2 = N X i=1 α (2) i P i , ··· , P πr = N X i=1 α (r) i P i , 204 where P N i=1 α (j) i = 1, ∀j∈{1, 2,··· ,r} and α (j) i ≥ 0. Thus, we have the following display P π1 P π2 ··· P πr = N X i=1 α (1) i P i ! N X i=1 α (2) i P i ! ··· N X i=1 α (r) i P i ! = X (i1,···,ir )∈Gr α (1) i1 ···α (r) ir · P i1 P i2 ··· P ir , (6.38) whereG r ranges over all N r configurations. Since P N i=1 α (1) i ··· P N i=1 α (r) i = 1, it follows (6.38) is a convex combination of all possible sequences P i1 P i2 ··· P ir . By assumption 6.2.1, we have P i1 P i2 ··· P ir is strictly positive for any (i 1 ,··· ,i r )∈G r , and there exists a universal lower bound δ > 0 of all entries of P i1 P i2 ··· P ir ranging over all configurations in (i 1 ,··· ,i r )∈G r . This implies P π1 P π2 ··· P πr is also strictly positive with the same lower bound δ > 0 for any sequences of randomized stationary policies π 1 ,··· ,π r . Now, we proceed to prove the mixing bound. Choose r = b r and we can decompose any P π1 P π2 ··· P πr as follows: P π1 ··· P πr =δΠ + (1−δ)Q, where Π has each entry equal to 1/|S| (recall that|S| is the number of states which equals the size of the matrix) and Q depends on π 1 ,··· ,π r . Then, Q is also a stochastic matrix (nonnegative and row sum up to 1) because both P π1 ··· P πr and Π are stochastic matrices. Thus, for any two distribution vectors d 1 and d 2 , we have (d 1 −d 2 ) P π1 ··· P πr =δ (d 1 −d 2 ) Π + (1−δ) (d 1 −d 2 ) Q = (1−δ) (d 1 −d 2 ) Q, where we use the fact that for distribution vectors (d 1 −d 2 ) Π = 1 |S| 1− 1 |S| 1 = 0. Since Q is a stochastic matrix, it is non-expansive on` 1 -norm, namely, for any vectorx,kxQk 1 ≤ kxk 1 . To see this, simply compute kxQk 1 = |S| X j=1 |S| X i=1 x i Q ij ≤ |S| X j=1 |S| X i=1 |x i Q ij | = |S| X j=1 |S| X i=1 |x i |Q ij = |S| X i=1 |x i | =kxk 1 . (6.39) 205 Overall, we obtain, k(d 1 −d 2 ) P π1 ··· P πr k 1 = (1−δ)k(d 1 −d 2 ) Qk 1 ≤ (1−δ)kd 1 −d 2 k 1 . We can then take τ =− 1 log(1−δ) to finish the proof. Proof of Lemma 6.2.2. Since the probability transition matrix of any randomized stationary pol- icy is a convex combination of those of pure policies, it is enough to show that the product MDP is irreducible and aperiodic under any joint pure policy. For simplicity, let s t = s (1) ,··· ,s (K) and a t = a (1) ,··· ,a (K) . Consider any joint pure policy Π which select a fixed joint action a∈A (1) ×···×A (K) given a joint state s∈S (1) ×···×S (K) , with probability 1. By Assumption 6.2.2, we have Pr s (1) t+1 ,··· ,s (K) t+1 s (1) t ,··· ,s (K) t ,a (1) t ,··· ,a (K) t =Pr s (1) t+1 s (1) t ,··· ,s (K) t ,a (1) t ,··· ,a (K) t ,s (2) t+1 ,··· ,s (K) t+1 ·Pr s (2) t+1 ,··· ,s (K) t+1 s (1) t ,··· ,s (K) t ,a (1) t ,··· ,a (K) t =Pr s (1) t+1 s (1) t ,a (1) t Pr s (2) t+1 ,··· ,s (K) t+1 s (1) t ,··· ,s (K) t ,a (1) t ,··· ,a (K) t =··· = K−1 Y k=1 Pr s (k) t+1 s (k) t ,a (k) t ·Pr s (K) t+1 s (1) t ,··· ,s (K) t ,a (1) t ,··· ,a (K) t = K Y k=1 Pr s (k) t+1 s (k) t ,a (k) t , (6.40) where the second equality follows from the independence relation in Assumption 6.2.2. Thus, we obtain the equality, Pr(s t+1 = s 0 s t = s, a t = a) = K Y k=1 Pr s (k) t+1 = ˜ s (k) s (k) t =s (k) ,a (k) t =a (k) , Then, the one step transition probability between any two states s, ˜ s∈S (1) ×···×S (K) can be 206 computed as Pr(s t+1 = ˜ s s t = s) = X a Pr(s t+1 = ˜ s s t = s, a t = a)·Pr(a t = a s t = s) = X a K Y k=1 Pr s (k) t+1 = ˜ s (k) s (k) t =s (k) ,a (k) t =a (k) ·Pr(a t = a s t = s) = K Y k=1 P a (k) (s) s (k) , ˜ s (k) , where we can remove the summation on a due to the fact that a t is a pure policy. The notation a (k) (s) denotes a fixed mapping from product state spaceS (1) ×···×S (K) to an individual action spaceA (k) resulting from the pure policy, and P a (k) (s) s (k) , ˜ s (k) is the Markov transition probability from state s (k) to ˜ s (k) under the action a (k) (s). One can then further compute the r (r≥ 2) step transition probability from between any two states s, ˜ s∈S (1) ×···×S (K) as Pr(s t+r = ˜ s s t = s) = X st+r−1 ··· X st+1 K Y k=1 P a (k) (s) s (k) ,s (k) t+1 · K Y k=1 P a (k) (st+1) s (k) t+1 ,s (k) t+2 ··· K Y k=1 P a (k) (st+r−1) s (k) t+r−1 , ˜ s (k) = X st+r−1 ··· X st+1 K Y k=1 P a (k) (s) s (k) ,s (k) t+1 ·P a (k) (st+1) s (k) t+1 ,s (k) t+2 ···P a (k) (st+r−1) s (k) t+r−1 , ˜ s (k) . (6.41) For any k∈{1, 2,··· ,K}, the term P a (k) (s) s (k) ,s (k) t+1 ·P a (k) (st+1) s (k) t+1 ,s (k) t+2 ···P a (k) (st+r−1) s (k) t+r−1 , ˜ s (k) denotes the probability of moving froms (k) to ˜ s (k) along a certain path under a certain sequence of fixed decisions a (k) (s), a (k) (s t+1 ),··· , a (k) (s t+r−1 ). Let s (k) = s (k) t+1 ,s (k) t+2 ,··· ,s (k) t+r−1 ∈S (k) ×···×S (k) , k∈{1, 2,··· ,K} be the state path of k-th MDP. One can then change the order of summation in (6.41) and sum over state paths of each MDP as follows: 207 (6.41) = X s (K) ··· X s (1) K Y k=1 P a (k) (s) s (k) ,s (k) t+1 ·P a (k) (st+1) s (k) t+1 ,s (k) t+2 ···P a (k) (st+r−1) s (k) t+r−1 , ˜ s (k) We would like to exchange the order of the product and the sums so that we can take the path sum over each individual MDP respectively. However, the problem is that the transition probabilities are coupled through the actions. The idea to proceed is to first apply a “hard” decoupling by taking the infimum of transition probabilities of each MDP over all pure policies, and use Assumption 6.2.1, to bound the transition probability from below uniformly. We have (6.41)≥ inf s (1) X s (K) ··· X s (2) K Y k=2 P a (k) (s) s (k) ,s (k) t+1 ···P a (k) (st+r−1) s (k) t+r−1 , ˜ s (k) · inf s (j) , j6=1 X s (1) P a (1) (s) s (1) ,s (1) t+1 ···P a (1) (st+r−1) s (1) t+r−1 , ˜ s (1) ≥ inf s (1) X s (K) ··· X s (2) K Y k=2 P a (k) (s) s (k) ,s (k) t+1 ···P a (k) (st+r−1) s (k) t+r−1 , ˜ s (k) · inf π (1) 1 ,···,π (1) r X s (1) P π (1) 1 s (1) ,s (1) t+1 ···P π (1) r s (1) t+r−1 , ˜ s (1) , where π (1) 1 ,··· ,π (1) r range over all pure policies, and the second inequality follows from the fact that fix any path of other MDPs (i.e. s (j) , j6= 1), the term X s (1) P a (1) (s) s (1) ,s (1) t+1 ···P a (1) (st+r−1) s (k) t+r−1 , ˜ s (1) is the probability of reaching ˜ s (1) froms (1) inr steps using a sequence of actionsa (1) (s (1) ),··· ,a (1) (s (1) t+r−1 ), where each action is a deterministic function of the previous state at the 1-st MDP only. Thus, it dominates the infimum over all sequences of pure policies π (1) 1 ,··· ,π (1) r on this MDP. Similarly, we can decouple the rest of the sums and obtain the follow display: (6.41)≥ K Y k=1 inf π (k) 1 ,···,π (k) r X s (k) P π (k) 1 s (k) ,s (k) t+1 ···P π (k) r s (k) t+r−1 , ˜ s (k) = K Y k=1 inf π (k) 1 ,···,π (k) r P π (k) 1 ,···,π (k) r s (k) , ˜ s (k) , 208 whereP π (k) 1 ,···,π (k) r s (k) , ˜ s (k) denotes the s (k) , ˜ s (k) -th entry of the product matrix P (k) π (k) 1 ··· P (k) π (k) r . Now, by Assumption 6.2.1, there exists a large enough integer b r such that P (k) π (k) 1 ··· P (k) π (k) r is a strictly positive matrix for any sequence ofr≥b r randomized stationary policy. As a consequence, the above probability is strictly positive and (6.41) is also strictly positive. This implies, if we choose ˜ s = s, then, starting from any arbitrary product state s∈S (1) × ···×S (K) , there is a positive probability of returning to this state afterr steps for allr≥b r, which gives the aperiodicity. Similarly, there is a positive probability of reaching any other composite state after r steps for all r≥ b r, which gives the irreducibility. This implies the product state MDP is irreducible and aperiodic under any joint pure policy, and thus, any joint randomized stationary policy. For the second part of the claim, we consider any randomized stationary policy Π and the corresponding joint transition probability matrix P Π , there exists a stationary state-action prob- ability vector Φ(a, s), a∈A (1) ×···×A (K) , s∈S (1) ×···×S (K) , such that X a Φ(a, ˜ s) = X s X a Φ(a, s)P a (s, ˜ s), ∀˜ s∈S (1) ×···×S (K) . (6.42) Then, the state-action probability of the k-th MDP is θ (k) (a (k) , ˜ s (k) ) = P ˜ s (j) ,a (j) , j6=k Φ(a, ˜ s). Thus, X a (k) θ (k) (a (k) , ˜ s (k) ) = X ˜ s (j) , j6=k X a Φ(a, ˜ s) = X s X a Φ(a, s) X ˜ s (j) , j6=k P a (s, ˜ s) = X s X a Φ(a, s)·Pr ˜ s (k) |a, s = X s X a Φ(a, s)·Pr ˜ s (k) |a (k) ,s (k) = X a (k) X s (k) θ (k) (a (k) , ˜ s (k) )·Pr ˜ s (k) |a (k) ,s (k) = X a (k) X s (k) θ (k) (a (k) , ˜ s (k) )·P a (k) s (k) , ˜ s (k) where the third from the last inequality follows from Assumption 6.2.2. This finishes the proof. 6.6.2 Missing proofs in Section 6.4.1 Proof of Lemma 6.4.6. Consider the state-action probabilities{ ˜ θ (k) } K k=1 which achieves the Slater’s condition in (6.8). First of all, note that Q i (t)∈F t−1 ,∀t≥ 1. Then, using the assumption that 209 {g (k) i,t−1 } K k=1 is i.i.d. and independent of all system information up to t− 1, we have E Q i (t− 1) K X k=1 D g (k) i,t−1 , ˜ θ E F t−1 ! =E K X k=1 D g (k) i,t−1 , ˜ θ E ! Q i (t− 1)≤−ηQ i (t− 1). (6.43) Now, by the drift-plus-penalty bound (6.15), with θ (k) = ˜ θ (k) , Δ(t)≤−V K X k=1 D f (k) t−1 ,θ (k) t −θ (k) t−1 E −α K X k=1 kθ (k) t −θ (k) t−1 k 2 2 + 3 2 mK 2 Ψ 2 +V K X k=1 D f (k) t−1 , ˜ θ (k) −θ (k) t−1 E + m X i=1 Q i (t− 1) K X k=1 D g (k) i,t−1 , ˜ θ (k) E +α K X k=1 k ˜ θ (k) −θ (k) t−1 k 2 2 −α K X k=1 k ˜ θ (k) −θ (k) t k 2 2 ≤4VKΨ + 3 2 mK 2 Ψ 2 + m X i=1 Q i (t− 1) K X k=1 D g (k) i,t−1 , ˜ θ (k) E +α K X k=1 k ˜ θ (k) −θ (k) t−1 k 2 2 −α K X k=1 k ˜ θ (k) −θ (k) t k 2 2 where the second inequality follows from Holder’s inequality that D f (k) t−1 ,θ (k) t −θ (k) t−1 E ≤kf (k) t−1 k ∞ θ (k) t −θ (k) t−1 1 ≤ 2Ψ. Summing up the drift from t to t +t 0 − 1 and taking a conditional expectationE(·|F t−1 ) give E kQ(t +t 0 )k 2 2 −kQ(t)k 2 2 F t−1 ≤8VKΨ + 3mK 2 Ψ 2 + 2 m X i=1 E t+t0−1 X τ=t Q i (τ− 1) K X k=1 D g (k) i,τ−1 , ˜ θ (k) E F t−1 ! + 2αE K X k=1 k ˜ θ (k) −θ (k) t−1 k 2 2 −k ˜ θ (k) −θ (k) t+t0 k 2 2 F t−1 ! ≤8VKΨ + 3mK 2 Ψ 2 + 4Kα + 2 m X i=1 E t+t0−1 X τ=t Q i (τ− 1) K X k=1 D g (k) i,τ−1 , ˜ θ (k) E F t−1 ! . 210 Using the tower property of conditional expectations (further taking conditional expectations E · F t+t0−1 ··· F t inside the conditional expectation) and the bound (6.43), we have E t+t0−1 X τ=t Q i (τ− 1) K X k=1 D g (k) i,τ−1 , ˜ θ (k) E F t−1 ! ≤−ηE t+t0−1 X τ=t Q i (τ− 1) F t−1 ! ≤−ηt 0 Q i (t− 1) + t 0 (t 0 − 1) 2 Ψ≤−ηt 0 Q i (t) + t 0 (t 0 − 1) 2 Ψ +ηt 0 KΨ, where the last inequality follows from the queue updating rule (6.9) that |Q i (t− 1)−Q i (t)|≤ K X k=1 D g (k) i,t−2 ,θ (k) t−1 E ≤Kkg (k) i,t−2 k ∞ kθ (k) t−1 k 1 ≤KΨ. Thus, we have E kQ(t +t 0 )k 2 2 −kQ(t)k 2 2 F t−1 ≤ 8VKΨ + 3mK 2 Ψ 2 + 4Kα +t 0 (t 0 − 1)mΨ + 2mKΨηt 0 − 2ηt 0 m X i=1 Q i (t) ≤ 8VKΨ + 3mK 2 Ψ 2 + 4Kα +t 0 (t 0 − 1)mΨ + 2mKΨηt 0 − 2ηt 0 kQ i (t)k 2 . SupposekQ i (t)k 2 ≥ 8VKΨ+3mK 2 Ψ 2 +4Kα+t0(t0−1)mΨ+2mKΨηt0+η 2 t 2 0 ηt0 , then, it follows, E kQ(t +t 0 )k 2 2 −kQ(t)k 2 2 F t−1 ≤−ηt 0 kQ i (t)k 2 , which implies E kQ(t +t 0 )k 2 2 F t−1 ≤ kQ i (t)k 2 − ηt 0 2 2 SincekQ i (t)k 2 ≥ ηt0 2 , taking square root from both sides using Jensen’ inequality gives E kQ(t +t 0 )k 2 F t−1 ≤kQ i (t)k 2 − ηt 0 2 . 211 On the other hand, we always have kQ(t + 1)k 2 −kQ(t)k 2 = v u u t m X i=1 max ( Q i (t) + K X k=1 D g (k) i,t−1 ,θ (k) t E , 0 ) 2 − v u u t m X i=1 Q i (t) 2 ≤ m X i=1 K X k=1 D g (k) i,t−1 ,θ (k) t E ! 2 1/2 ≤ √ mKΨ. Overall, we finish the proof. 6.6.3 Missing proofs in Section 6.5 Proof of Lemma 6.5.1. Consider any joint randomized stationary policy Π and a starting state probabilityd 0 on the product state spaceS (1) ×S (2) ×···×S (K) . Let P Π be the corresponding transition matrix on the product state space. Let d t be the state distribution at time t under Π and d Π be the stationary state distribution. By Lemma 6.2.2, we know that this product state MDP is irreducible and aperiodic (ergodic) under any randomized stationary policy. In particular, it is ergodic under any pure policy. Since there are only finitely many pure policies, let P Π1 ,··· , P Π N be probability transition matrices corresponding to these pure policies. By Proposition 1.7 of [LPW06] , for any Π i , i∈{1, 2,··· ,N}, there exists integer τ i > 0 such that (P Πi ) t is strictly positive for any t≥τ i . Let τ 1 = max i τ i , then, it follows (P Πi ) τ1 is strictly positive uniformly for all Π i ’s. Let δ > 0 be the least entry of (P Πi ) τ1 over all Π i ’s. Following from the fact that the probability transition matrix P Π is a convex combination of those of pure policies, i.e. P Π = P N i=1 α i P Πi , α i ≥ 0, P N i=1 α i = 1, we have (P Π ) τ1 is also strictly positive. To see this, note that (P Π ) τ1 = N X i=1 α i P Πi ! τ1 ≥ N X i=1 α τ1 i (P Πi ) τ1 > 0, where the inequality is taken to be entry-wise. Furthermore, the least entry of (P Π ) τ1 is lower bounded by δ/N τ1−1 uniformly over all joint randomized stationary policies Π, which follows 212 from the fact that the least entry of 1 N (P Π ) τ1 is bounded as 1 N N X i=1 α τ1 i δ≥ 1 N N X i=1 α i ! τ1 δ = δ N τ1 . The rest is a standard bookkeeping argument following from the Markov chain mixing time theory (Theorem 4.9 of [LPW06]). Let D Π be a matrix of the same size as P Π and each row equal to the stationary distribution d Π . Let ε =δ/N τ1−1 . We claim that for any integer n> 0, and any Π, P τ1n Π = (1− (1−ε) n )D Π + (1−ε) n Q n , (6.44) for some stochastic matrix Q. We use induction to prove this claim. First of all, for n = 1, from the fact that (P Π ) τ1 is a positive matrix and the least entry is uniformly lower bounded by ε over all policies Π, we can write (P Π ) τ1 as (P Π ) τ1 =εD Π + (1−ε)Q, for some stochastic matrix Q, where we use the fact that ε∈ (0, 1]. Suppose (6.44) holds for n = 1, 2,··· ,`, we show that it also holds for n =` + 1. Using the fact that D Π P Π = D Π and QD Π = D Π for any stochastic matrix Q, we can write out P τ1(`+1) Π : P τ1(`+1) Π =P τ1` Π P τ1 Π = 1− (1−ε) ` D Π + (1−ε) ` Q ` P τ1 Π = 1− (1−ε) ` D Π P τ1 Π + (1−ε) ` Q ` P τ1 Π = 1− (1−ε) ` D Π + (1−ε) ` Q ` (εD Π + (1−ε)Q) = 1− (1−ε) ` D Π + (1−ε) ` Q ` ((1− (1−ε))D Π + (1−ε)Q) =(1− (1−ε) `+1 )D Π + (1−ε) `+1 Q `+1 . Thus, (6.44) holds. For any integer t> 0, we write t =τ 1 n +j for some integer j∈ [0,τ 1 ) and n≥ 0. Then, (P Π ) t − D Π = (P Π ) t − D Π = (1−ε) n Q n P j Π − D Π . 213 Let P t Π (i,·) be the i-th row of P t Π , then, we obtain max i kP t Π (i,·)−d Π k 1 ≤ 2(1−ε) n , where we use the fact that the ` 1 -norm of the row difference is bounded by 2. Finally, for any starting state distribution d 0 , we have d 0 P t Π −d Π 1 = X i d 0 (i) P t Π (i,·)−d Π 1 = X i d 0 (i) P t Π (i,·)−d Π 1 ≤ max i kP t Π (i,·)−d Π k 1 ≤ 2(1−ε) n . Take r 1 = log 1 1−ε finishes the proof. Proof of Lemma 6.5.2. Let v t ∈S (1) ×···×S (K) be the joint state distribution at time t under policy Π. Using the fact that Π is a fixed policy independent of g (k) i,t and Assumption 6.2.2 that the probability transition is also independent of function path given any state and action, the function g (k) i,t and state-action pair (a (k) t ,s (k) t ) are mutually independent. Thus, for any t∈{0, 1, 2,··· ,T− 1} E K X k=1 g (k) i,t (a (k) t ,s (k) t ) d 0 , Π ! = X s∈S (1) ×···×S (K) X a∈A (1) ×···×A (K) v t (s)Π(a|s) K X k=1 E g (k) i,t (a (k) ,s (k) ) , where s = [s (1) ,··· ,s (K) ] and a = [a (1) ,··· ,a (K) ] and the latter expectation is taken with respect to g (k) i,t (i.e. the random variable w t ). On the other hand, by Lemma 6.2.2, we know that for any randomized stationary policy Π, the corresponding stationary state-action probability can be expressed as{θ (k) ∗ } K k=1 with θ (k) ∗ ∈ Θ (k) . Thus, K X k=1 D E g (k) i,t ,θ (k) E = X s∈S (1) ×···×S (K) X a∈A (1) ×···×A (K) d Π (s)Π(a|s) K X k=1 E g (k) i,t (a (k) ,s (k) ) . 214 Hence, we can control the difference: T−1 X t=0 E K X k=1 g (k) i,t (a (k) t ,s (k) t ) d 0 , Π ! − K X k=1 D E g (k) i,t ,θ (k) ∗ E ≤ T−1 X t=0 X s∈S (1) ×···×S (K) X a∈A (1) ×···×A (K) (v t (s)−d Π (s)) Π(a|s) KΨ ≤KΨ T−1 X t=0 kv t −d Π k 1 ≤ 2KΨ T−1 X t=0 e (r1−t)/r1 ≤ 2eKΨ Z T−1 0 e −t/r1 dt = 2er 1 KΨ, where the third inequality follows from Lemma 6.5.1. Taking C 1 = 2er 1 finishes the proof of (6.29) and (6.28) can be proved in a similar way. In particular, we have for any randomized stationary policy Π that satisfies the constraint (6.2), we have T· K X k=1 D E g (k) i,t ,θ (k) ∗ E ≤ T−1 X t=0 E K X k=1 g (k) i,t (a (k) t ,s (k) t ) d 0 , Π ! − K X k=1 D E g (k) i,t ,θ (k) ∗ E + T−1 X t=0 E K X k=1 g (k) i,t (a (k) t ,s (k) t ) d 0 , Π ! ≤ 2er 1 KΨ + 0 = 2er 1 KΨ, finishing the proof. 215 Bibliography [Alt99a] E. Altman. Constrained Markov decision processes. Chapman and Hall/CRC Press, 1999. [Alt99b] E. Altman. Constrained Markov decision processes, volume 7. CRC Press, 1999. [BAM10] T. Benson, A. Akella, and D. A. Maltz. Network traffic characteristics of data centers in the wild. In Proceedings of the 10th ACM SIGCOMM conference on Internet measurement, pages 267–280. ACM, 2010. [Ber95] D. P. Bertsekas. Dynamic programming and optimal control, volume 1. Athena scientific Belmont, MA, 1995. [Ber01] D. P. Bertsekas. Dynamic Programming and Optimal Control, 2nd edition, Vol. I. Athena Scientific, Nashua, NH, 2001. [Ber09a] D. Bertsekas. Convex Optimization Theory. Athena Scientific, 2009. [Ber09b] D. P. Bertsekas. Convex optimization theory. Athena Scientific Belmont, 2009. [BGPS06] S. Byod, A. Ghosh, B. Prabhakar, and D. Shah. Randomized gossip algorithms. IEEE/ACM Transactions on Networking, 14,:2508–2530, 2006. [BL16] C. Boutilier and T. Lu. Budget allocation using weakly coupled, constrained markov decision processes. In UAI, 2016. [BT97] D. P. Bertsekas and J. N. Tsitsiklis. Parallel and Distributed Computation: Nu- merical Methods. Athena Scientific, Nashua, NH, 1997. [BV04] S. Boyd and L. Vandenberghe. Convex Optimization. Cambridge University Press, 2004. 216 [CDM14] C. Caramanis, N. B. Dimitrov, and D. P. Morton. Efficient algorithms for budget- constrained markov decision processes. IEEE Transactions on Automatic Control, 59(10):2813–2817, 2014. [CFMS03] H. S. Chang, P. J. Fard, S. I. Marcus, and M. Shayman. Multitime scale markov decision processes. IEEE Transactions on Automatic Control, 48(6):976–987, 2003. [CLG17] T. Chen, Q. Ling, and G. B. Giannakis. An online convex optimization approach to dynamic network resource allocation. arXiv preprint arXiv:1701.03974, 2017. [CW16] Y. Chen and M. Wang. Stochastic primal-dual methods and sample complexity of reinforcement learning. arXiv preprint arXiv:1612.02516, 2016. [DGS14] T. Dick, A. Gyorgy, and C. Szepesvari. Online learning in markov decision pro- cesses with changing cost sequences. In Proceedings of the 31st International Conference on Machine Learning (ICML-14), pages 512–520, 2014. [Dur13] R. Durrett. Probability: Theory and Examples, 4th edition. Cambridge University Press, 2013. [EDKM05] E. Even-Dar, S. M. Kakade, and Y. Mansour. Experts in a markov decision process. In Advances in neural information processing systems, pages 401–408, 2005. [EDKM09] E. Even-Dar, S. M. Kakade, and Y. Mansour. Online markov decision processes. Mathematics of Operations Research, 34(3):726–736, 2009. [ES06] A. Eryilmaz and R. Srikant. Joint congestion control, routing, and mac for sta- bility and fairness in wireless networks. IEEE Journal on Selected Areas in Com- munications, 24(8):1514–1524, 2006. [ES07] A. Eryilmaz and R. Srikant. Fair resource allocation in wireless networks using queue-length-based scheduling and congestion control. IEEE/ACM Transactions on Networking (TON), 15(6):1333–1344, 2007. [ES12] A. Eryilmaz and R. Srikant. Asymptotically tight steady-state queue length bounds implied by drift conditions. Queueing Systems, 72(3-4):311–359, 2012. 217 [Fox66a] B. Fox. Markov renewal programming by linear fractional programming. SIAM Journal on Applied Mathematics, 14(6):1418–1432, 1966. [Fox66b] B. Fox. Markov renewal programming by linear fractional programming. SIAM Journal on Applied Mathematics, 14,(6):1418–1432, 1966. [FS99] Y. Freund and R. E. Schapire. Adaptive game playing using multiplicative weights. Games and Economic Behavior, 29(1-2):79–103, 1999. [Gan13] A. Gandhi. Dynamic server provisioning for data center power management. PhD thesis, Carnegie Mellon University, 2013. [GDHBSW13] A. Gandhi, S. Doroudi, M. Harchol-Balter, and A. Scheller-Wolf. Exact analysis of the m/m/k/setup class of markov chains via recursive renewal reward. Proc. ACM Sigmetrics, pages 153–166, 2013. [GHBK12] A. Gandhi, M. Harchol-Balter, and M. A. Kozuch. Are sleep states effective in data centers? In Green Computing Conference (IGCC), 2012 International, pages 1–10. IEEE, 2012. [GNT + 06] L. Georgiadis, M. J. Neely, L. Tassiulas, et al. Resource allocation and cross-layer control in wireless networks. Foundations and Trends R in Networking, 1(1):1– 144, 2006. [GRW14] P. Guan, M. Raginsky, and R. M. Willett. Online markov decision processes with kullback–leibler control cost. IEEE Transactions on Automatic Control, 59(6):1423–1438, 2014. [H + 16] E. Hazan et al. Introduction to online convex optimization. Foundations and Trends R in Optimization, 2(3-4):157–325, 2016. [Haj82] B. Hajek. Hitting-time and occupation-time bounds implied by drift analysis with applications. Advances in Applied probability, 14(3):502–525, 1982. [HAK07] E. Hazan, A. Agarwal, and S. Kale. Logarithmic regret algorithms for online convex optimization. Machine Learning, 69:169–192, 2007. 218 [HK14] E. Hazan and S. Kale. Beyond the regret minimization barrier: optimal algorithms for stochastic strongly-convex optimization. The Journal of Machine Learning Research, 15(1):2489–2512, 2014. [HP05] M. Hutter and J. Poland. Adaptive online prediction by following the perturbed leader. Journal of Machine Learning Research, 6(Apr):639–660, 2005. [HS08] T. Horvath and K. Skadron. Multi-mode energy management for multi-tier server clusters. In Proceedings of the 17th international conference on Parallel architec- tures and compilation techniques, pages 270–279. ACM, 2008. [JHA16] R. Jenatton, J. Huang, and C. Archambeau. Adaptive algorithms for online convex optimization with long-term constraints. In International Conference on Machine Learning, pages 402–411, 2016. [LHS + 13] T. Lattimore, M. Hutter, P. Sunehag, et al. The sample-complexity of general reinforcement learning. In Proceedings of the 30th International Conference on Machine Learning. Journal of Machine Learning Research, 2013. [Li11] C.-p. Li. Stochastic optimization over parallel queues: Channel-blind scheduling, restless bandit, and optimal delay. Citeseer, 2011. [LN14] C. Li and M. J. Neely. Solving convex optimization with side constraints in a multi-class queue by adaptive cμ rule. Queueing System, 77,(3):331–372, 2014. [LPW06] D. A. Levin, Y. Peres, and E. L. Wilmer. Markov chains and mixing times. American Mathematical Society, 2006. [LS04] X. Lin and N. B. Shroff. Joint rate control and scheduling in multihop wireless networks. In 2004 43rd IEEE Conference on Decision and Control (CDC)(IEEE Cat. No. 04CH37601), volume 2, pages 1484–1489. IEEE, 2004. [LW94] N. Littlestone and M. K. Warmuth. The weighted majority algorithm. Informa- tion and computation, 108(2):212–261, 1994. [LWAT13] M. Lin, A. Wierman, L. L. Andrew, and E. Thereska. Dynamic right-sizing for power-proportional data centers. IEEE/ACM Transactions on Networking, 21(5):1378–1391, 2013. 219 [MGW09] D. Meisner, B. T. Gold, and T. F. Wenisch. Powernap: eliminating server idle power. In ACM Sigplan Notices, volume 44, pages 205–216. ACM, 2009. [MJY12] M. Mahdavi, R. Jin, and T. Yang. Trading regret for efficiency: online convex optimization with long term constraints. Journal of Machine Learning Research, 13(Sep):2503–2528, 2012. [MSB + 11] D. Meisner, C. M. Sadler, L. A. Barroso, W.-D. Weber, and T. F. Wenisch. Power management of online data-intensive services. In ACM SIGARCH Computer Ar- chitecture News, volume 39, pages 319–330. ACM, 2011. [NAGS10] G. Neu, A. Antos, A. Gy¨ orgy, and C. Szepesv´ ari. Online markov decision processes under bandit feedback. In Advances in Neural Information Processing Systems, pages 1804–1812, 2010. [Nee10a] M. J. Neely. Stochastic network optimization with application to communication and queueing systems. Synthesis Lectures on Communication Networks, 3(1):1– 211, 2010. [Nee10b] M. J. Neely. Stochastic Network Optimization with Application to Communication and Queueing Systems. Morgan & Claypool, 2010. [Nee11] M. J. Neely. Online fractional programming for markov decision systems. In Communication, Control, and Computing (Allerton), 2011 49th Annual Allerton Conference on, pages 353–360. IEEE, 2011. [Nee12a] M. J. Neely. Asynchronous control for coupled markov decision systems. Infor- mation Theory Workshop (ITW), 2012. [Nee12b] M. J. Neely. Asynchronous scheduling for energy optimality in systems with multiple servers. Proceedings of 46th Annual Conference on Information Sciences and Systems (CISS), 2012. [Nee12c] M. J. Neely. Stability and probability 1 convergence for queueing networks via lyapunov optimization. Journal of Applied Mathematics, 2012, 2012. [Nee13a] M. J. Neely. Dynamic optimization and learning for renewal systems. IEEE Transactions on Automatic Control, 58(1):32–46, 2013. 220 [Nee13b] M. J. Neely. Dynamic optimization and learning for renewal systems. IEEE Transactions on Automatic Control, 58,(1):32–46, 2013. [New05] M. E. Newman. Power laws, pareto distributions and zipf’s law. Contemporary physics, 46(5):323–351, 2005. [NML08] M. J. Neely, E. Modiano, and C.-P. Li. Fairness and optimal stochastic control for heterogeneous networks. IEEE/ACM Transactions On Networking, 16(2):396– 409, 2008. [NO09] A. Nedi´ c and A. Ozdaglar. Approximate primal solutions and rate analysis for dual subgradient methods. SIAM Journal on Optimization, 19(4):1757–1780, 2009. [NY17] M. J. Neely and H. Yu. Online convex optimization with time-varying constraints. arXiv preprint arXiv:1702.04783, 2017. [PT99] C. H. Papadimitriou and J. N. Tsitsiklis. The complexity of optimal queuing network control. Mathematics of Operations Research, 24(2):293–305, 1999. [Put14] M. L. Puterman. Markov decision processes: discrete stochastic dynamic pro- gramming. John Wiley & Sons, 2014. [PXYY16] Z. Peng, Y. Xu, M. Yan, and W. Yin. Arock: an algorithmic framework for asyn- chronous parallel coordinate updates. To appear in SIAM Journal on Scientific Computing, 2016. [Rib10] A. Ribeiro. Ergodic stochastic optimization algorithms for wireless communica- tion and networking. IEEE Transactions on Signal Processing, 58(12):6369–6386, 2010. [Roc15] R. T. Rockafellar. Convex analysis. Princeton university press, 2015. [Ros02] S. Ross. Introduction to Probability Models, 8th edition. Academic Press, 2002. [SB98] R. S. Sutton and A. G. Barto. Reinforcement learning: An introduction, volume 1. MIT press Cambridge, 1998. 221 [Sch83] S. Schaible. Fractional programming. Zeitschrift fur Operations Research, 27,(1):39–54, 1983. [SN11] K. Srivastava and A. Nedic. Distributed asynchronous constrained stochastic optimization. IEEE Journal of Selected Topics in Signal Processing, 5,(4):772– 790, 2011. [Sto05] A. L. Stolyar. Maximizing queueing network utility subject to stability: Greedy primal-dual algorithm. Queueing Systems, 50(4):401–457, 2005. [TE90] L. Tassiulas and A. Ephremides. Stability properties of constrained queueing sys- tems and scheduling policies for maximum throughput in multihop radio networks. In 29th IEEE Conference on Decision and Control, pages 2130–2132. IEEE, 1990. [TE93] L. Tassiulas and A. Ephremides. Dynamic server allocation to parallel queues with randomly varying connectivity. IEEE Transactions on Information Theory, 39(2):466–478, 1993. [UKIN10] R. Urgaonkar, U. C. Kozat, K. Igarashi, and M. J. Neely. Dynamic resource allo- cation and power management in virtualized data centers. In Network Operations and Management Symposium (NOMS), 2010 IEEE, pages 479–486. IEEE, 2010. [UWH + 15] R. Urgaonkar, S. Wang, T. He, M. Zafer, K. Chan, and K. K. Leung. Dynamic ser- vice migration and workload scheduling in edge-clouds. Performance Evaluation, 91:205–228, 2015. [Wal44] A. Wald. On cumulative sums of random variables. The Annals of Mathematical Statistics, 15(3):283–296, 1944. [Wer13] C. Wernz. Multi-time-scale markov decision processes for organizational decision- making. EURO Journal on Decision Processes, 1(3-4):299–324, 2013. [Whi88] P. Whittle. Restless bandits: Activity allocation in a changing world. Journal of applied probability, 25(A):287–298, 1988. [WN15] X. Wei and M. J. Neely. Power-aware wireless file downloading: A lyapunov index- ing approach to a constrained restless bandit problem. IEEE/ACM Transactions on Networking, 24(4):2264–2277, 2015. 222 [WN16a] X. Wei and M. J. Neely. On the theory and application of distributed asyn- chronous optimization over weakly coupled renewal systems. arXiv preprint arXiv:1608.00195, 2016. [WN16b] X. Wei and M. J. Neely. Online constrained optimization over time varying re- newal systems: An empirical method. arXiv preprint arXiv:1606.03463, 2016. [WN17] X. Wei and M. J. Neely. Data center server provision: Distributed asynchronous control for coupled renewal systems. IEEE/ACM Transactions on Networking (TON), 25(4):2180–2194, 2017. [WN18] X. Wei and M. J. Neely. Asynchronous optimization over weakly coupled renewal systems. Stochastic Systems, 8(3):167–191, 2018. [WSLJ15] H. Wu, R. Srikant, X. Liu, and C. Jiang. Algorithms with logarithmic or sublinear regret for constrained contextual bandits. In Advances in Neural Information Processing Systems, pages 433–441, 2015. [WUZ + 15] S. Wang, R. Urgaonkar, M. Zafer, T. He, K. Chan, and K. K. Leung. Dynamic service migration in mobile edge-clouds. In 2015 IFIP Networking Conference (IFIP Networking), pages 1–9. IEEE, 2015. [WYN15] X. Wei, H. Yu, and M. J. Neely. A probabilistic sample path convergence time analysis of drift-plus-penalty algorithm for stochastic optimization. arXiv preprint arXiv:1510.02973, 2015. [WYN18] X. Wei, H. Yu, and M. J. Neely. Online learning in weakly coupled markov decision processes: A convergence time study. Proceedings of the ACM on Measurement and Analysis of Computing Systems, 2(1):12, 2018. [Yao02] D. D. Yao. Dynamic scheduling via polymatroid optimization. Proceeding Per- formance Evaluation of Complex Systems: Techniques and Tools, pages 89–113, 2002. [YHS + 12] Y. Yao, L. Huang, A. Sharma, L. Golubchik, and M. Neely. Data centers power reduction: A two time scale approach for delay tolerant workloads. In INFOCOM, 2012 Proceedings IEEE, pages 1431–1439. IEEE, 2012. 223 [YMS09] J. Y. Yu, S. Mannor, and N. Shimkin. Markov decision processes with arbitrary reward processes. Mathematics of Operations Research, 34(3):737–757, 2009. [YN16] H. Yu and M. J. Neely. A low complexity algorithm with o( √ T ) regret and finite constraint violations for online convex optimization with long term constraints. arXiv preprint arXiv:1604.02218, 2016. [YN17] H. Yu and M. J. Neely. A simple parallel algorithm with an O(1/t) convergence rate for general convex programs. SIAM Journal on Optimization, 27(2):759–783, 2017. [YNW17] H. Yu, M. Neely, and X. Wei. Online convex optimization with stochastic con- straints. arXiv preprint arXiv:1708.03741, 2017. [YT89] Y. Ye and E. Tse. An extension of karmarkar’s projective algorithm for convex quadratic programming. Mathematical programming, 44(1):157–179, 1989. [Zin03] M. Zinkevich. Online convex programming and generalized infinitesimal gradient ascent. In Proceedings of the 20th International Conference on Machine Learning (ICML-03), pages 928–936, 2003. 224
Abstract (if available)
Linked assets
University of Southern California Dissertations and Theses
Conceptually similar
PDF
New Lagrangian methods for constrained convex programs and their applications
PDF
Performance trade-offs of accelerated first-order optimization algorithms
PDF
Learning and control in decentralized stochastic systems
PDF
Provable reinforcement learning for constrained and multi-agent control systems
PDF
On practical network optimization: convergence, finite buffers, and load balancing
PDF
The next generation of power-system operations: modeling and optimization innovations to mitigate renewable uncertainty
PDF
Learning and decision making in networked systems
PDF
Landscape analysis and algorithms for large scale non-convex optimization
PDF
Sequential Decision Making and Learning in Multi-Agent Networked Systems
PDF
Information design in non-atomic routing games: computation, repeated setting and experiment
PDF
Online reinforcement learning for Markov decision processes and games
PDF
Adaptive control: transient response analysis and related problem formulations
PDF
Active state tracking in heterogeneous sensor networks
PDF
Differentially private and fair optimization for machine learning: tight error bounds and efficient algorithms
PDF
Average-case performance analysis and optimization of conditional asynchronous circuits
PDF
Online learning algorithms for network optimization with unknown variables
PDF
The smart grid network: pricing, markets and incentives
PDF
Optimizing task assignment for collaborative computing over heterogeneous network devices
PDF
Sequential decision-making for sensing, communication and strategic interactions
PDF
Elements of robustness and optimal control for infrastructure networks
Asset Metadata
Creator
Wei, Xiaohan
(author)
Core Title
I. Asynchronous optimization over weakly coupled renewal systems
School
Viterbi School of Engineering
Degree
Doctor of Philosophy
Degree Program
Electrical Engineering
Publication Date
12/16/2019
Defense Date
12/13/2019
Publisher
University of Southern California
(original),
University of Southern California. Libraries
(digital)
Tag
Markov decision systems,OAI-PMH Harvest,renewal systems,stochastic optimization
Language
English
Contributor
Electronically uploaded by the author
(provenance)
Advisor
Minsker, Stanislav (
committee chair
), Neely, Michael (
committee chair
), Goldstein, Larry (
committee member
), Jovanovic, Mihailo (
committee member
), Nayyar, Ashutosh (
committee member
)
Creator Email
xiaohanw@usc.edu
Permanent Link (DOI)
https://doi.org/10.25549/usctheses-c89-253998
Unique identifier
UC11675047
Identifier
etd-WeiXiaohan-8064.pdf (filename),usctheses-c89-253998 (legacy record id)
Legacy Identifier
etd-WeiXiaohan-8064.pdf
Dmrecord
253998
Document Type
Dissertation
Rights
Wei, Xiaohan
Type
texts
Source
University of Southern California
(contributing entity),
University of Southern California Dissertations and Theses
(collection)
Access Conditions
The author retains rights to his/her dissertation, thesis or other graduate work according to U.S. copyright law. Electronic access is being provided by the USC Libraries in agreement with the a...
Repository Name
University of Southern California Digital Library
Repository Location
USC Digital Library, University of Southern California, University Park Campus MC 2810, 3434 South Grand Avenue, 2nd Floor, Los Angeles, California 90089-2810, USA
Tags
Markov decision systems
renewal systems
stochastic optimization