Close
Home
Collections
Login
USC Login
Register
0
Selected
Invert selection
Deselect all
Deselect all
Click here to refresh results
Click here to refresh results
USC
/
Digital Library
/
University of Southern California Dissertations and Theses
/
On practical network optimization: convergence, finite buffers, and load balancing
(USC Thesis Other)
On practical network optimization: convergence, finite buffers, and load balancing
PDF
Download
Share
Open document
Flip pages
Contact Us
Contact Us
Copy asset link
Request this asset
Transcript (if available)
Content
ON PRACTICAL NETWORK OPTIMIZATION: CONVERGENCE, FINITE BUFFERS, AND LOAD BALANCING by Sucha Supittayapornpong A Dissertation Presented to the FACULTY OF THE USC GRADUATE SCHOOL UNIVERSITY OF SOUTHERN CALIFORNIA In Partial Fulllment of the Requirements for the Degree DOCTOR OF PHILOSOPHY (ELECTRICAL ENGINEERING) December 2017 Copyright 2017 Sucha Supittayapornpong Dedication To my mom and dad: Sudsaijai and Kovit Supittayapornpong ii Acknowledgements This work would have not been possible without the supports and guidance from my advisor, Prof. Michael J. Neely. He has shown me the proper way to conduct theoretical research in network optimization, which was a new territory for me when I rst joined the Ph.D. program. His patience and understanding during the entire doctoral study is invaluable. He spent his valuable time guiding me through insightful comments as well as reviewing my papers. Importantly, his openness allows me to pursue my own endeavor of the practical-theory network optimization, which will be my new journey after this thesis. I feel that I cannot thank him enough for everything he gave me. For the practical part of this work, I would like to thank Prof. Ramesh Govindan who introduced me to network system research and served as my dissertation committee. I took his Computer Communications CSCI-551 course in Spring 2013. I learnt a lot from his course including the design principle of the Internet, importance of practicality, and early development of TCP protocols. The class project, a latency-optimized TCP for datacenters, is one of the key elements that helps me on the practical load-balancing work, where theory and practicality emerge. I truly appreciate Prof. Bhaskar Krishnamachari, who served as my dissertation committee. This holds true for the committee of my qualifying exam as well. Prof. iii Konstantinos Psounis posted an important question about practicality of theoretical al- gorithms, which led to a new design of an algorithm that work gracefully with TCP trac. Prof. Ashutosh Nayyar gave me another perspective of my nite-buer algo- rithm. Prof. Andreas Molisch and Prof. Minlan Yu gave guidance on wireless systems and software-dened networks. I am very grateful for the nancial support from the Annenberg Graduate Fellowship, during the rst four years at USC, and the assistantship supports from Prof. Michael J. Neely and the Electrical Engineering department at USC. I want to say thank to Hao Feng, Chun-Ting Huang, Sunav Choudhary, Kuan-Wen Huang, Hao Yu, and Xiaohan Wei as well as Gerrielyn Ramos, Corine Wong, and Diane Demetras for the memorable experience at USC. iv Table of Contents Dedication ii Acknowledgements iii List Of Tables ix List Of Figures x Abstract xiv Chapter 1: Introduction 1 1.1 Objectives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 1.1.1 Convergence of time-average optimization . . . . . . . . . . . . . . 4 1.1.2 Stochastic network optimization with nite buers . . . . . . . . . 5 1.1.3 Trac load balancing for intra datacenter networks . . . . . . . . . 6 1.1.4 Additional works . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 1.2 Organization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7 1.3 Notation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8 Chapter 2: Convergence of Deterministic Time-Average Optimization 9 2.1 Time-average deterministic optimization . . . . . . . . . . . . . . . . . . . 14 2.1.1 The extended setY . . . . . . . . . . . . . . . . . . . . . . . . . . 15 2.1.2 Lipschitz continuity and Slater condition . . . . . . . . . . . . . . . 15 2.1.3 Relation to a dual subgradient algorithm . . . . . . . . . . . . . . 15 2.2 General convergence result . . . . . . . . . . . . . . . . . . . . . . . . . . . 21 2.3 Convergence under the uniqueness assumption . . . . . . . . . . . . . . . 25 2.3.1 Locally-polyhedral dual function . . . . . . . . . . . . . . . . . . . 26 2.3.2 Locally-quadratic dual function . . . . . . . . . . . . . . . . . . . . 32 2.3.3 Staggered time averages . . . . . . . . . . . . . . . . . . . . . . . . 38 2.3.4 Summary of convergence results . . . . . . . . . . . . . . . . . . . 39 2.4 Sample problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39 2.5 Chapter summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42 v Chapter 3: Convergence of Stochastic Time-Average Optimization 43 3.1 Time-average stochastic optimization . . . . . . . . . . . . . . . . . . . . . 47 3.1.1 Auxiliary formulation . . . . . . . . . . . . . . . . . . . . . . . . . 48 3.1.2 Lyapunov optimization . . . . . . . . . . . . . . . . . . . . . . . . 51 3.1.3 Drift-plus-penalty algorithm . . . . . . . . . . . . . . . . . . . . . . 53 3.2 Behaviors of the drift-plus-penalty algorithm . . . . . . . . . . . . . . . . 53 3.2.1 Embedded formulation . . . . . . . . . . . . . . . . . . . . . . . . . 54 3.2.2 T -slot convergence . . . . . . . . . . . . . . . . . . . . . . . . . . . 58 3.2.3 Concentration bound . . . . . . . . . . . . . . . . . . . . . . . . . . 59 3.3 Locally-polyhedral dual function . . . . . . . . . . . . . . . . . . . . . . . 61 3.3.1 Transient time . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62 3.3.2 Convergence time in a steady state . . . . . . . . . . . . . . . . . . 64 3.4 Locally-quadratic dual function . . . . . . . . . . . . . . . . . . . . . . . . 69 3.4.1 Transient time . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70 3.4.2 Convergence time in a steady state . . . . . . . . . . . . . . . . . . 73 3.5 Sample problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76 3.5.1 Staggered time averages . . . . . . . . . . . . . . . . . . . . . . . . 76 3.5.2 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77 3.6 Chapter summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80 Chapter 4: Stochastic Network Optimization with Finite Buers 85 4.1 System model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88 4.1.1 Network state . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89 4.1.2 Control decision . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89 4.1.3 Standard queue . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91 4.1.4 Stochastic formulation . . . . . . . . . . . . . . . . . . . . . . . . . 91 4.2 Drift-plus-penalty algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . 92 4.2.1 The algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92 4.2.2 Deterministic problem . . . . . . . . . . . . . . . . . . . . . . . . . 94 4.3 Floating-queue algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . 96 4.3.1 Queue transformation . . . . . . . . . . . . . . . . . . . . . . . . . 97 4.3.2 Real and fake queuing dynamics . . . . . . . . . . . . . . . . . . . 100 4.4 Performance analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103 4.4.1 Lower-bound policy . . . . . . . . . . . . . . . . . . . . . . . . . . 103 4.4.2 Sample path analysis . . . . . . . . . . . . . . . . . . . . . . . . . . 104 4.4.3 Performance of the oating-queue algorithm . . . . . . . . . . . . . 109 4.5 Simulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 114 4.5.1 Power minimization . . . . . . . . . . . . . . . . . . . . . . . . . . 115 4.5.2 Throughput maximization . . . . . . . . . . . . . . . . . . . . . . . 116 4.5.3 Dynamic state distributions . . . . . . . . . . . . . . . . . . . . . . 117 4.6 Chapter summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 118 vi Chapter 5: Trac Load Balancing for Intra Datacenter Networks 127 5.1 System model and design . . . . . . . . . . . . . . . . . . . . . . . . . . . 132 5.1.1 Topology and routing . . . . . . . . . . . . . . . . . . . . . . . . . 132 5.1.2 Trac . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 134 5.1.3 Decision variables . . . . . . . . . . . . . . . . . . . . . . . . . . . 134 5.1.4 Queues . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 135 5.1.5 Stability and assumption . . . . . . . . . . . . . . . . . . . . . . . 136 5.2 Throughput-optimal algorithm . . . . . . . . . . . . . . . . . . . . . . . . 137 5.2.1 The algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 137 5.2.2 Intuitions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 139 5.2.3 Correctness of Algorithm 5 . . . . . . . . . . . . . . . . . . . . . . 140 5.2.4 Stability analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . 144 5.3 System realization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 148 5.3.1 Approximation of common queues . . . . . . . . . . . . . . . . . . 148 5.3.2 Additional packet headers . . . . . . . . . . . . . . . . . . . . . . . 148 5.3.3 Weighted fair queueing . . . . . . . . . . . . . . . . . . . . . . . . . 149 5.3.4 Trac splitting by hashing . . . . . . . . . . . . . . . . . . . . . . 149 5.4 Simulations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 151 5.4.1 Ideal simulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 151 5.4.2 Network simulator . . . . . . . . . . . . . . . . . . . . . . . . . . . 152 5.5 Chapter summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 155 Chapter 6: Quality of Information Optimization in Wireless Multi-Hop Networks 156 6.1 Single-hop system model . . . . . . . . . . . . . . . . . . . . . . . . . . . . 159 6.1.1 Format selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . 159 6.1.2 Uplink scheduling . . . . . . . . . . . . . . . . . . . . . . . . . . . 161 6.1.3 Stochastic network optimization . . . . . . . . . . . . . . . . . . . 162 6.2 Dynamic algorithm of the uplink network . . . . . . . . . . . . . . . . . . 163 6.2.1 Lyapunov optimization . . . . . . . . . . . . . . . . . . . . . . . . 163 6.2.2 The separable quadratic policy . . . . . . . . . . . . . . . . . . . . 165 6.2.3 Separability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 166 6.3 Performance and simulation of the uplink network . . . . . . . . . . . . . 169 6.3.1 Performance analysis . . . . . . . . . . . . . . . . . . . . . . . . . . 171 6.3.2 Simulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 174 6.4 System model with relay . . . . . . . . . . . . . . . . . . . . . . . . . . . . 176 6.4.1 Routing and scheduling . . . . . . . . . . . . . . . . . . . . . . . . 177 6.4.2 Stochastic network optimization . . . . . . . . . . . . . . . . . . . 181 6.5 Dynamic algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 182 6.5.1 Lyapunov optimization . . . . . . . . . . . . . . . . . . . . . . . . 183 6.5.2 Separability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 185 6.5.3 Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 189 6.6 Stability and performance bounds . . . . . . . . . . . . . . . . . . . . . . 190 6.6.1 Performance analysis . . . . . . . . . . . . . . . . . . . . . . . . . . 194 6.6.2 Deterministic bounds of queue lengths . . . . . . . . . . . . . . . . 195 vii 6.7 Simulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 199 6.8 Chapter summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 202 Chapter 7: Staggered Algorithm for Non-smooth Optimization 204 7.1 Preliminaries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 207 7.1.1 Staggered time averages . . . . . . . . . . . . . . . . . . . . . . . . 208 7.1.2 Basic results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 209 7.1.3 Concentration bound . . . . . . . . . . . . . . . . . . . . . . . . . . 212 7.2 Locally polyhedral structure . . . . . . . . . . . . . . . . . . . . . . . . . . 214 7.2.1 Drift and transient time . . . . . . . . . . . . . . . . . . . . . . . . 215 7.2.2 Convergence rate . . . . . . . . . . . . . . . . . . . . . . . . . . . . 218 7.3 General convex function . . . . . . . . . . . . . . . . . . . . . . . . . . . . 219 7.3.1 Drift and transient time . . . . . . . . . . . . . . . . . . . . . . . . 221 7.3.2 Convergence rate . . . . . . . . . . . . . . . . . . . . . . . . . . . . 225 7.4 Fast convergence for deterministic problems . . . . . . . . . . . . . . . . . 226 7.5 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 229 7.6 Chapter summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 232 Chapter 8: Conclusion 233 Bibliography 235 viii List Of Tables 2.1 Summary of Convergence Times . . . . . . . . . . . . . . . . . . . . . . . 39 5.1 Average backlogs under Algorithm 5 and MaxWeight . . . . . . . . . . . . 151 ix List Of Figures 1.1 A network of queues . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 2.1 Wireless downlink example with non-convex setX . . . . . . . . . . . . . 10 2.2 Illustration of locally-polyhedral and locally-quadratic functions . . . . . . 27 2.3 Convergence of Algorithm 1 and the staggered algorithm that solve prob- lem (2.48) with f(x) = 1:5x 1 +x 2 . . . . . . . . . . . . . . . . . . . . . . . 40 2.4 Convergence of Algorithm 1 and the staggered algorithm that solve prob- lem (2.48) with f(x) =x 2 1 +x 2 2 . . . . . . . . . . . . . . . . . . . . . . . . 41 2.5 Convergence of Algorithm 1 and the staggered algorithm that solve prob- lem (2.48) with f(x) = 1:5x 1 +x 2 and an additional constraint x 1 +x 2 1 41 2.6 Convergence of Algorithm 1 and the staggered algorithm that solve prob- lem (2.48) with f(x) =x 2 1 +x 2 2 and an additional constraint x 1 +x 2 1 . 42 3.1 Illustrations of locally-polyhedral and locally-quadratic dual functions . . 61 3.2 Convergence of Algorithm 2 and the staggered algorithm that solve prob- lem (3.61) with f(x) = 1:5x 1 +x 2 . . . . . . . . . . . . . . . . . . . . . . . 78 3.3 Convergence of Algorithm 2 and the staggered algorithm that solve prob- lem (3.61) with f(x) =x 2 1 +x 2 2 . . . . . . . . . . . . . . . . . . . . . . . . 79 3.4 Convergence of Algorithm 2 and the staggered algorithm that solve prob- lem (3.61) withf(x) = 1:5x 1 +x 2 and an additional constraintE [x 1 +x 2 ] 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79 3.5 Convergence of Algorithm 2 and the staggered algorithm that solve prob- lem (3.61) with f(x) =x 2 1 +x 2 2 and an additional constraintE [x 1 +x 2 ] 1 80 4.1 Arrivals and services at a standard queue . . . . . . . . . . . . . . . . . . 90 x 4.2 Transformation of a standard queue to a oating queue . . . . . . . . . . 97 4.3 Time intervalT (T ) is partitioned intoT H (T ) andT L (T ). . . . . . . . . . . 105 4.4 SetT L (T ) is partitioned into sub-intervals, starting from t L to t + L . . . . . 107 4.5 Line network . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115 4.6 Average delay and average drop rate of the power minimization problem with V = 200 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115 4.7 Average throughput, average delay, and average drop rate of the through- put maximization problem with V = 200 . . . . . . . . . . . . . . . . . . . 116 4.8 Throughput maximization with dynamic state distributions and buer size B = 18. Results are averaged with a window size of 500. . . . . . . . . . . 118 4.9 Time interval between t L and t + L is decomposed to decreasing and non- decreasing sub-intervals. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 119 5.1 MaxWeight example: Let w d ij be the weight of commodity d over the link from switchi to switchj. All weights in this gure are w 1 12 ;w 2 12 ;w 1 23 ;w 2 23 = (0; 1; 3; 1). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 129 5.2 Timeline of queue occupancy under MaxWeight: a small box represents a packet in a queue. Switch i is represented by the number i2f1; 2; 3g under the long line. The short line under the number indicates that the commodity at the numbered switch is served in that particular time slot. The occupancy pattern repeats aftert = 14, which is similar to the pattern at t = 11. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 129 5.3 Timeline of queue occupancy under the ideal algorithm . . . . . . . . . . 130 5.4 An example network withN =f1; 2;:::; 14g andD =f1; 2;:::; 8g . . . . 133 5.5 Example of sets of switches at switch 9. Note thatP 8 9 must not contain 10 to avoid loops and imposes that 9 = 2H 8 10 . . . . . . . . . . . . . . . . . . . 133 5.6 Packet-lling Algorithm 6 iteratively fullls the requests. An iteration number is indicated in a gray box. In this example, Tc ij = 9 and the rst iteration (the plot on the left) allocates rate to commodity 4. The algorithm allocates 2; 0; 3; 4 rates to commodities 1; 2; 3; 4. . . . . . . . . . 140 5.7 Line Network with N = f1; 2; 3; 4g;D = f1; 2g and H d 1 = f2g;H d 2 = f3g;H d 3 =f4g for d2D . . . . . . . . . . . . . . . . . . . . . . . . . . . . 151 xi 5.8 Intra datacenter network withN =f1; 2;:::; 14g;D =f1; 2;:::; 9g. Each next-hop setH d i contains next-hop switches with the shortest distance to commodity d, e.g.,H 8 1 =f9; 10g;H 8 9 =f13; 14g;H 8 13 =f11; 12g;H 8 11 = f8g;H 9 1 =f9; 10g =H e 1 for e2f2; 3; 4g. Arrivals are E a d i (t) = 2 and E a 9 i (t) = 1 ford;i2f1;:::; 8g; otherwise 0. Departure rate isb d i = 20 if commodity d connects to switch i; otherwise 0. . . . . . . . . . . . . . . . 152 5.9 The FCTs from the network in Figure 5.8 without commodity 9 . . . . . 153 5.10 The FCTs of all ows in the network in Figure 5.8 where commodity 9 is omitted and the link between switches 12 and 14 fails . . . . . . . . . . . 154 5.11 A network with a priority ow in each direction and 600 normal ows between commodities 1 and 2 . . . . . . . . . . . . . . . . . . . . . . . . . 154 5.12 The FCTs of normal ows between commodities 1 and 2 in Figure 5.11 . 155 6.1 A network withN devices as queuesQ 1 (t);:::;Q N (t) and a receiver station.159 6.2 Small network with orthogonal channels . . . . . . . . . . . . . . . . . . . 174 6.3 Quality of information versus V and averaged queue lengths under the quadratic (QD) and max-weight (MW) policies . . . . . . . . . . . . . . . 176 6.4 Averaged backlog in queues versusV under the quadratic and max-weight policies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 176 6.5 An example network consists of devices with uplink and relay capabilities and a receiver station. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 178 6.6 Small network with independent channels and distributions . . . . . . . . 199 6.7 Quality of information versusV under the quadratic (QD) and max-weight (MW) policies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 200 6.8 Averaged backlog in device 1's queues versus V under the quadratic (QD) and max-weight (MW) policies . . . . . . . . . . . . . . . . . . . . . . . . 201 6.9 The system obtains average quality of information while having average total queue length . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 202 6.10 Larger network with independent channels with distributions shown . . . 202 6.11 Convergence of time-averaged quality of information. The interval of the moving average is 500 slots. . . . . . . . . . . . . . . . . . . . . . . . . . . 203 xii 7.1 Structures of function F . . . . . . . . . . . . . . . . . . . . . . . . . . . . 214 7.2 Results of algorithms and a locally polyhedral function . . . . . . . . . . . 230 7.3 Results of algorithms and a general convex function . . . . . . . . . . . . 231 7.4 Results of algorithms and the function (7.38) . . . . . . . . . . . . . . . . 232 xiii Abstract Practical network optimization is increasingly important as a way to provide better ser- vices in the era of cloud computing and programmable networks. An emerging challenge in practice is how to control and allocate network resources optimally. Fortunately, many problems related to the challenge have already been solved by the stochastic network optimization in theoretical research. An open problem is how to make the theory prag- matic. This thesis investigates and improves three practical aspects of the stochastic network optimization as steps towards a framework for practical network optimization with theoretical guarantees. Firstly, a convergence speed of a drift-plus-penalty algorithm that solves a class of problems called time-average optimization is studied. This class of problems arises in routing, scheduling, and service placement for cloud computing, where the convergence speed can represent a reaction time to network changes. The convergence speed is further improved fromO(1= 2 ) toO(1=) andO(1= 1:5 ) depending on the structure of a problem, where is the proximity to optimality. Secondly, a practical constraint on queues with nite buers is considered. A new nite-buer algorithm for the stochastic network optimization is developed as an practi- cally implementable algorithm with performance guarantee. Specically, when the buer xiv size of every queue in a network isB, the new algorithm achievesO(e B ) optimality gap, O(B) delay, and O(e B ) average drop rate. The algorithm operates in rst-in-rst-out (FIFO) manner and attained near optimality while a network experiences small delay and negligible packet drops. Thirdly, the issues of interoperability with transmission control protocol (TCP) traf- c and practical implementation via software-dened networking (SDN) are addressed as parts of the development of a new trac load-balancing algorithm for datacenter net- works. The load-balancing algorithm is inspired by a new throughput-optimal algorithm and is fully distributed. It outperforms the conventional equal-cost multi-path (ECMP) algorithm. This illustrates the practicality of the network optimization theory. The results developed in this thesis highlight the potentials of the stochastic network optimization for solving practical network problems. Specically, we have shown that the practical load-balancing problem can be solved by applying insights from the theory. Fur- ther, the insights from the oating queue algorithm and the convergence results provide a general guideline on practical implementation and behaviors of the drift-plus-penalty algorithm for the stochastic network optimization. xv Chapter 1 Introduction Network optimization has been being a challenging problem since the beginning of telecom- munications and the Internet. The challenge stems from both theoretical aspects, such as computational complexity and stochastic behavior, and practical aspects, including heterogeneity, implementation, and compatibility. In practice, the challenge is tackled by a separation approach that separates a network into various components, each of which deals with a smaller challenge. The Open Systems Interconnection (OSI) model and Software-Dened Networking (SDN) architecture (see respectively for example in [KR12] and [ONF12]) are good examples of this approach. Although the separation approach facilitates implementation and compatibility, performance might be sub-optimal if a net- work is not properly designed. A good example to illustrate practical network optimization is the suite of Transmis- sion Control Protocol (TCP) methods for networks. TCP has been actively optimized since 1974 to improve throughput, latency, stability, reliability, etc. An early version of the TCP was created and tested by experiments [Pos81, RJ88]. The recent TCP Ve- gas [LPW01] and FAST TCP [JWL + 05] are optimized by insights from mathematical 1 Figure 1.1: A network of queues optimization and are shown to outperform the original TCP. This suggests the potential of mathematical optimization for performance improvement of a practical network. Stochastic network optimization in [Nee10] is a mathematical technique that can be used to design an algorithm to control a network system with O(1=V ) optimality gap and O(V ) delay, where V > 0 is a tradeo parameter. The algorithms is called drift- plus-penalty. An intuition of the technique can be exemplied by considering a joint ow control and routing problem. A network, consisting of four switches, a source, and a destination, is modeled as a network of queues in Figure 1.1. A sending rate at the source becomes a rate control problem, which attempts to nd the highest sending rate without overwhelming the network. The routing and a forwarding mechanisms at each switch also become a rate control problem, which attempts to determine sending rates of the switch's outputs. The drift-plus-penalty algorithm can be derived from this queueing model and the rate control problems. As a result, the algorithm controls the source rate and the routing and forwarding mechanisms of every switch. It can be shown that all queues in the network are bounded and the source rate achieves maximum throughput, which equals the min-cut max- ow of the network, under the drift-plus-penalty algorithm. The relationship between the drift-plus-penalty algorithm and a dual subgradient method is noted in [NMR05, HN11]. This result provides a connection between the 2 stochastic network optimization and a dual problem (see for example in [BV04]) in the deterministic convex optimization. Further, the average queue backlog under the algo- rithm always concentrates around Lagrange multipliers of a dual problem, as established in [HN11]. Specically, the average queue backlog converges asymptotically to the scaled version of the Lagrange multipliers. Although this result provides properties of the al- gorithm asymptotically, it does not specify transient behavior, which is important in practice. The knowledge and development of the drift-plus-penalty algorithm is far from prac- ticality: (i) The convergence speed of the algorithm is unknown. (ii) The algorithm assumes to have innite buer space at every queue, while a switch only has limited buer memory. (iii) It is not sure that the algorithm is compatible with TCP ows which is the majority of real-world trac. 1.1 Objectives In this thesis, three practical aspects of the stochastic network optimization technique are investigated to develop more realistic algorithms that can be applied to real-world network problems. (i) The convergence behavior of the drift-plus-penalty algorithm is investigated in the context of time-average optimization. Specically, the transient and steady-state behaviors of the drift-plus-penalty algorithm is investigated. (ii) The nite- buer constraint in practice is considered to develop a new algorithm for the stochastic network optimization that only requires small buer space. (iii) Practical TCP com- patibility and implementation on SDN devices are investigated in the context of trac 3 load balancing in datacenter networks. These three objectives aim to bridge the theoret- ical network optimization and practical system networks. It also aims to illustrate the practicality of the theory. Each objective is described below. 1.1.1 Convergence of time-average optimization A time-average optimization problem consists of a time-average objective function, time- average constraints, and nonconvex decision sets. It is a general formulation for network optimization problems. Let > 0 be the optimality gap such that the time-average ob- jective is withinO() of the optimal cost and the constraint violation is withinO(). The drift-plus-penalty algorithm in [Nee10] can be applied to solve this class of time-average problems and asymptotically achieves O() optimality gap. However, this asymptotic result does not provide insight on the speed of the algorithm. In this thesis, the convergence time of the drift-plus-penalty algorithm is investigated and improved. The connection with the Lagrange multipliers established in [HN11] is utilized in the analysis. The algorithm is proven to have O(1= 2 ) convergence time. However, further analysis shows that the algorithm has two phases of operation: transient and steady state. The convergence time can be improved by performing averages in the steady state. When the algorithm has a locally-polyhedral structure, the transient time is O(1=) and the convergence time in the steady state is O(1=), so the total convergence time is O(1=). When the algorithm has a locally-quadratic structure, the transient time is O(1= 1:5 ), and the convergence time in the steady state is O(1= 1:5 ). These convergence speeds represent how fast the algorithm converges to a new opti- mal operating point after network changes, such as link failures and trac shifts. The 4 convergence results hold under both deterministic setting [SHN14,SHN17] and stochastic setting [SN15d,SN14]. 1.1.2 Stochastic network optimization with nite buers The drift-plus-penalty algorithm assumes that every queue in a network has innite buer space. However, network devices, such as switch and router, have limited memory space and can store only nite number of packets at any given time. A dynamic programming (see for example in [Ber05]) for nite queues suers from the curse of dimensionality due to the exponential number of states. Existing algorithms with performance tradeos in [HN11, HMNK13] either require the knowledge of Lagrange multipliers or operate in last-in-rst-out (LIFO) manner. In this thesis, a new oating queue algorithm is developed for a network of queues having small buer space [SN15a,SN15b]. Specically, when each queue can store at most B packets, the oating-queue algorithm achievesO(e B ) optimality gap,O(B) delay, and O(e B ) average rate of packet drops. Conceptually, the algorithm uses a real queue and a counter to track a corresponding queue of the drift-plus-penalty algorithm. Hence, the new algorithm (i) adaptive to network changes, (ii) does not require any prior statistic or the knowledge of Lagrange multipliers, (iii) operates naturally in FIFO manner, and (iv) is simple and does not suer from the curse of dimensionality. It is general to be used for stochastic network optimization problems. 5 1.1.3 Trac load balancing for intra datacenter networks Trac load balancing in datacenter networks is an open problem in system networking when a network is asymmetric. A conventional approach is equal-cost multi-path routing (ECMP), which splits trac equally to equal-cost nexthop, is not optimal and has several practical issues [TH00]. Distributed and heuristic approaches in [AED + 14, KHK + 16, VPA + 17] require special hardware, and it is not sure that they are optimal. In this thesis, a new practical algorithm for trac load-balancing is inspired by a new theoretical algorithm. The theoretical algorithm, proven to be throughput-optimal, is designed to (i) work gracefully with TCP ows and (ii) approximate weighted fair queuing (WFQ) scheme in a SDN switch [SN16b]. The algorithm is dierent from the traditional Max-Weight algorithm in [TE92], which performs poorly with TCP's congestion control [SM16b, SN16b]. The practical algorithm is distributed and can be implemented on production SDN switches. Simulation results show performance improvement up to 4:4 in comparison to the ECMP algorithm. 1.1.4 Additional works Two additional works, related the convergence investigation, are studied in this thesis. A quality of information maximization for wireless networks in has been studied as a prior investigation of the time-average convergence. The goal is to maximize total quality of information while maintaining stability of the network. Devices generate information with dierent qualities and sizes. The information is sent to a base station through wireless communications. A new joint format selection, routing, and scheduling algorithm 6 is developed. In stead of using the Max-Weight algorithm, a new separable quadratic drift is developed to reduce queue occupancy and delay [SN12,SN15c]. A staggered algorithm is developed as a spin o from the convergence research. The algorithm solves a class of non-smooth optimization and achieves an O()-solution with O(1=) convergence time [SN16a]. This has applications in machine learning and opera- tion research. 1.2 Organization The study of convergence and improvement of the drift-plus-penalty algorithm has two parts. Chapter 2 studies the convergence under a deterministic setting, while a stochas- tic setting is investigated in Chapter 3. The constraint on nite buers is investigated on Chapter 4. Then the practicality of theoretical network optimization is illustrated in Chapter 5 that develops a load-balancing algorithm for datacenter networks. Addi- tional works on the quality of information maximization and the staggered algorithm are described in Chapter 6 and Chapter 7 respectively. Finally, the thesis is concluded in Chapter 8. 7 1.3 Notation Below is the list of common notations used in this thesis. R the set of real numbers R n the set of n-dimensional real vectors R + the set of non-negative real numbers R n + the set of a n-dimensional real vector whose elements are non-negative Z the set of integers Z + the set of non-negative integers X the convex hull of setX kxk the Euclidean norm (or the l 2 norm) of a column vector x kxk 1 the l 1 norm of a column vector x x > the transpose of a column vector x [x] + the projection of a column vector x onto the non-negative orthant bxc the greatest integer that is less than or equal to a real number x dxe the least integer that is greater than or equal to a real number x IfXg the indicator function of a statement X 8 Chapter 2 Convergence of Deterministic Time-Average Optimization In this chapter, the convergence speeds and behaviors of the drift-plus-penalty algorithm that solves deterministic time-average optimization is analyzed. The results in this chap- ter are based in part on [SHN14,SHN17]. Convex optimization is often used to optimally control communication networks (see [CLCD07] and references therein) and distributed multi-agent systems [NO09b]. This framework utilizes both convexity properties of an objective function and a feasible deci- sion set. However, various systems have inherent discrete (and hence non-convex) decision sets. For example, a wireless system might constrain transmission rates to a nite set corresponding to a xed set of coding options. Further, distributed agents might only have nite options of decisions. This discreteness restrains the application of convex optimization. This chapter considers a class of problems called time-average optimization. Let I and J be positive integers. Decision vectors x(t) = (x 1 (t);:::;x I (t)) are chosen se- quentially over time slots t2f0; 1; 2;:::g from a decision setX (possibly non-convex 9 Figure 2.1: Wireless downlink example with non-convex setX and discrete), which is a closed and bounded subset of R I . The average of decisions x = lim T!1 1 T P T1 t=0 x(t) solves the following problem: Minimize f(x) (2.1) Subject to g j (x) 0 j2f1;:::;Jg x(t)2X t2f0; 1; 2;:::g; where f :X!R and g j :X!R are convex functions andX is the convex hull ofX . This time-average optimization re ects a scenario where an objective is in the time- average sense. For example, network users are interested in average bit rates or through- put, and distributed agents are concerned with average actions. The formulation can be considered as a ne granularity version of a one-shot average formulation, where an aver- age decision is chosen, and can be used to extend several convex optimization problems in literature, see for example [CLCD07] and references therein, to have non-convex decision sets. An example of the formulation (2.1) is a wireless downlink system shown in Figure 2.1. At each iteration t, a base station sends x n (t) packets to user n for n2f1; 2; 3g. The objective of the system is to maximize the total rate received by three users with 10 proportional fairness. Further, the second user is guaranteed the minimum rate of 0:7. This system can be formulated as the following time-average optimization problem: Maximize log (x 1 ) + log (x 2 ) + log (x 3 ) Subject to x 2 0:7 (x 1 (t);x 2 (t);x 3 (t))2X; whereX is dened in Figure 2.1. The formulation (2.1) has an optimal solution which can be converted (by averaging) to the following convex optimization problem: Minimize f(x) (2.2) Subject to g j (x) 0 j2f1;:::;Jg x2X: Note that an optimal solution to formulation (2.2) may not be in the non-convex decision setX . Nevertheless, problems (2.1) and (2.2) have the same optimal value. In addition, directly applying a primal-average technique on a non-convex formation (2.3), where the convex hull in (2.2) is removed, may lead to an local optimal solution with respect to the time-average problem (2.1). For example, whenX =f0; 1g;J = 1;f(x) = (x 2=3) 2 ;g 1 (x) = 2=3x, a primal average solution of the technique in [NO09a] is 1, while a solution to problem (2.1) is x = 2=3. 11 Minimize f(x) (2.3) Subject to g j (x) 0 j2f1;:::;Jg x2X: This chapter develops an algorithm for the formulation (2.1) and analyzes its conver- gence time. The algorithm is shown to haveO(1= 2 ) convergence time with a mild Slater condition. However, inspired by results in [HN11], under a uniqueness assumption on Lagrange multipliers the algorithm is shown to enter two phases: a transient phase and a steady state phase. Convergence time can be signicantly improved by starting the time averages after the transient phase. Specically, when a dual function satises a locally- polyhedral assumption, the modied algorithm has O(1=) convergence time (including the time spent in the transient phase), which equals the best known convergence time for constrained convex optimization via rst-order methods. On the other hand, when the dual function satises a locally-quadratic assumption, the algorithm hasO(1= 1:5 ) conver- gence time. Furthermore, simulations show that these fast convergence times are robust even without the uniqueness assumption. An application of these improved convergence times can be eective implementation of decisions where decisions are implemented online after oine calculation during a transient period. The contributions of this chapter are summarized below. 1. The connection between Lyapunov optimization and a dual subgradient algorithm for a time-average problem with a non-convex decision set is established. 12 2. The modeling of the one-shot convex optimization (2.2), extensively used in [CLCD07], is generalized to the time-average formulation (2.1) that allows a non-convex deci- sion set, while optimality and complexity are preserved. 3. Transient and steady-state behaviors of the algorithm solving the time-average prob- lem (2.1) is investigated, the behaviors is exploited to obtain sequences of deci- sions that achieve O()-optimal solutions within O(1=) and O(1= 1:5 ) iterations under locally-polyhedral and locally-quadratic assumptions instead of the standard O(1= 2 ) iterations in [NO09a,Nee05]. The chapter is organized as follows. Section 2.1 constructs an algorithm to solve the time-average problem. The general O(1= 2 ) convergence time is proven in Section 2.2. Section 2.3 explores faster convergence times of O(1=) and O(1= 1:5 ) under the unique Lagrange multiplier assumption. Example problems are given in Section 2.4, including cases when the uniqueness condition fails. Section 2.5 concludes the chapter. Related Works Although there have been several techniques utilizing time-average solutions [Nes09, NO09a, Nee05], those works are limited to convex formulations. In fact, this work can be considered as a generalization of [NO09a, Nee05] as decisions are allowed to be cho- sen from a non-convex set. A non-convex optimization problem is considered in [ZM13], where an approximate problem is solved with the assumption of a unique vector of La- grange multipliers. In comparison, when f(x) and g j (x)'s are Lipschitz continuous, the 13 algorithm proposed in this chapter solves problem (2.1) without the uniqueness assump- tion. This chapter is inspired by the Lyapunov optimization technique [Nee10] which solves stochastic and time-average optimization problems, including problems such as (2.1). This chapter removes the stochastic characteristic and focuses on the connection between the technique and a general convex optimization. This allows a convergence time analysis of a drift-plus-penalty algorithm that solves problem (2.1). Importantly, this chapter shows that faster convergence can be achieved by starting time averages after a suitable transient period. Another area of literature focuses on convergence time of rst-order algorithms to an O()-optimal solution to a convex problem, including problem (2.2). For unconstrained optimization without strong convexity of the objective function, the accelerated method (with Lipschitz continuous gradients) hasO(1= p ) convergence time [Nes04,Tse08], while gradient and subgradient methods take O(1=) and O(1= 2 ) respectively [BV04,NO09a]. TwoO(1=) rst-order methods for constrained optimization are developed in [BNOT14, WO13], but the results rely on special convex formulations. A second-order method for constrained optimization [LXSS13] has a fast convergence rate but relies on special a convex formulation. All of these results rely on convexity assumptions that do not hold in the formulation (2.1). 2.1 Time-average deterministic optimization In order to solve problem (2.1), an embedded problem with a similar solution is formulated with the following assumptions. 14 2.1.1 The extended setY LetY be a closed, bounded, and convex subset of R I that containsX . Assume the functionsf(x),g j (x) forj2f1;:::;Jg extend as real-valued convex functions overx2Y. The setY can be dened asX itself. However, choosingY as a larger set helps to ensure a Slater condition is satised (dened below). Further, choosingY to have a simple structure helps to simplify the resulting optimization. For example, set Y might be chosen as a closed and bounded hyper-rectangle that containsX in its interior. 2.1.2 Lipschitz continuity and Slater condition In addition to assuming that f(x) and g j (x) are convex over x2Y, assume they are Lipschitz continuous, so there is a constant M > 0 such that for all x;y2Y: jf(x)f(y)jMkxyk (2.4) jg j (x)g j (y)jMkxyk (2.5) wherekxk = q x 2 1 + +x 2 I is the Euclidean norm. Further, assume that there exists a vector ^ x 2 X that satises g j (^ x) < 0 for all j2f1;:::;Jg, and is such that ^ x is in the interior of setY. This is a Slater condition that, among other things, ensures the constraints are feasible for the problem of interest. 2.1.3 Relation to a dual subgradient algorithm Problem (2.1) can be solved by the Lyapunov optimization technique [Nee10]. It has been known that the drift-plus-penalty algorithm in the Lyapunov optimization is identical to 15 a classic dual subgradient method [BNO03, NO09a] that solves problem (2.6), with the exception that it takes a time average of primal values. Minimize f(y) (2.6) Subject to g j (y) 0 j2f1;:::;Jg x i =y i i2f1;:::;Ig x2X; y2Y: This was noted in [NMR05, HN11] for related problems. Problem (2.6) is called the embedded formulation of the time-average problem (2.1) and is convex. It is not dicult to show that the above problem has an optimal value f (opt) that is the same as that of problems (2.1) and (2.2). Compared to a formulation in [NO09a], problem (2.6) contains additional equality constraints andX derived from the original decision set. This makes further analysis and algorithm slightly dierent from [NO09a], whose results cannot be applied directly. Now consider the dual of the embedded formulation (2.6). Let vectorsw andz be dual variables of the rst and second constraints in problem (2.6), where the feasible set of (w;z) is denoted by =R J + R I . Let g(y) = (g 1 (y);:::;g J (y)) denote a J-dimensional column vector of functions g j (y). A Lagrangian has the following expression: (x;y;w;z) =f(y) +w > g(y) +z > (xy); 16 where the notation x > denotes the transpose of vector x. Dene: x (z) = arginf x2X z > x (with x (z)2X ) y (w;z) = arginf y2Y [f(y) +w > g(y)z > y]: Notice that x (z) may have multiple candidates including extreme point solutions, since z > x is a linear function. We restrictx (z) to any of these extreme solutions, which implies x (z)2X . Then the dual function is dened as d(w;z) = inf x2X;y2Y (x;y;w;z) (2.7) =f(y (w;z)) +w > g(y (w;z)) +z > [x (z)y (w;z)]: A pair of subgradients [BNO03] with respect to w and z is: @ w d(w;z) =g(y (w;z)); @ z d(w;z) =x (z)y (w;z): Finally, the dual formulation of the embedded problem (2.6) is Maximize d(w;z) (2.8) Subject to (w;z)2 : Let the optimal value of problem (2.8) be d . Since problem (2.6) is convex, the duality gap is zero, andd =f (opt) . Problem (2.8) can be treated by a dual subgradient method [BNO03] with a xed stepsize 1=V and the restriction on x(t)2X , where V > 0 is a 17 parameter. This leads to Algorithm 1 summarized in the gure below, called the dual subgradient algorithm, where [] + is the projection onto the nonnegative orthant. Note that the algorithm is dierent from the one in [NO09a] due to the equality constraints and the restriction on x(t). Initialize w(0) and z(0) for t = 0; 1; 2;::: do Choose x(t) = arginf x2X z(t) > x (with x(t)2X ) Choose y(t) = arginf y2Y [f(y) +w(t) > g(y)z(t) > y] Update w(t + 1) = w(t) + 1 V g(y(t)) + Update z(t + 1) =z(t) + 1 V [x(t)y(t)] end for Algorithm 1: Dual subgradient algorithm with a restriction on x(t) Traditionally, the dual subgradient algorithm of [BNO03] is intended to produce pri- mal vector estimates that converge to a desired result. However, this requires additional assumptions. Indeed, for our problem, the primal vectors x(t) and y(t) do not converge to anything near a solution in many cases, such as when thef(x) andg j (x) functions are linear or piecewise linear. However, Algorithm 1 ensures that the time averages of x(t) and y(t) converge as desired. We use the notationw(t) andz(t) from Algorithm 1, with the update rule forw(t+1) and z(t + 1) given there: w(t + 1) = w(t) + 1 V g(y(t)) + (2.9) z(t + 1) =z(t) + 1 V [x(t)y(t)]: (2.10) 18 For ease of notation, dene (t),(w(t);z(t)) as a concatenation of these vectors. Let C be some positive constant such that kg(y)k 2 C; kxyk 2 C; for all x2X;y2Y; (2.11) sinceX is closed and bounded. We rst provide some useful properties. It holds that k(t + 1)(t)k p 2C=V for all t; (2.12) because k(t + 1)(t)k 2 =kw(t + 1)w(t)k 2 +kz(t + 1)z(t)k 2 1 V 2 kg(y(t))k 2 + 1 V 2 kx(t)y(t)k 2 (2.13) 2C=V 2 (2.14) where (2.13) follows from (2.9){(2.10), and (2.14) follows from the denition of C. Fur- ther, k(t + 1)k 2 k(t)k 2 =kw(t + 1)k 2 +kz(t + 1)k 2 kw(t)k 2 kz(t)k 2 2C V 2 + 2 V w(t) > g(y(t)) + 2 V z(t) > [x(t)y(t)]; 19 where the last inequality uses the result of expanding the square norms of (2.9) and (2.10). Since Algorithm 1 chooses x(t), y(t) to minimize d((t)) =d(w(t);z(t)) in (2.7), the above bound and (2.7) imply that d((t)) =f(y(t)) +w(t) > g(y(t)) +z(t) > [x(t)y(t)] f(y(t)) + V 2 h k(t + 1)k 2 k(t)k 2 i C V : (2.15) From convex analysis, the dual function d(), dened in (2.7), has the following properties [BNO03]: d()f (opt) for all 2 . If the Slater condition holds, then there are real numbers F > 0, > 0 such that: d()Fkk for all 2 : If the Slater condition holds, then there is an optimal value 2 , called a Lagrange multiplier vector [BNO03], that maximizes d(). Specically, d( ) =f (opt) . The rst two properties can be substituted into the inequality (2.15) to ensure that, under Algorithm 1, the following inequalities hold for all time slots t2f0; 1; 2;:::g: V 2 h k(t + 1)k 2 k(t)k 2 i +f(y(t)) C V +f (opt) (2.16) V 2 h k(t + 1)k 2 k(t)k 2 i +f(y(t)) C V +Fk(t)k (2.17) 20 2.2 General convergence result Dene the average of variablesfa(t)g T1 t=0 for any T2f1; 2;:::g as a(T ), 1 T T1 X t=0 a(t): Theorem 1 Letfx(t);w(t);z(t)g 1 t=0 be a sequence generated by Algorithm 1. For T 2 f0; 1; 2;:::g, we have f(x(T ))f (opt) V 2T h k(0)k 2 k(T )k 2 i + C V + VM T kz(T )z(0)k (2.18) g j (x(T )) V T jw j (T )w j (0)j + VM T kz(T )z(0)k for all j2f1;:::;Jg; (2.19) where M is the Lipschitz constant from (2.4){(2.5). Proof: For the rst part, we have from the Lipschitz property (2.4): f(x(T ))f (opt) h f(y(T ))f (opt) i +Mky(T )x(T )k: (2.20) We rst upper bound f(y(T ))f (opt) on the right-hand side of (2.20). Let a sequence generated by Algorithm 1 befx(t);y(t);w(t);z(t)g 1 t=0 . Relation (2.16) can be rewritten as f(y(t))f (opt) C V + V 2 h k(t)k 2 k(t + 1)k 2 i : 21 Summing from t = 0 to t =T 1 and dividing by T gives: 1 T T1 X t=0 f(y(t))f (opt) C V + V 2T h k(0)k 2 k(T )k 2 i : Using Jensen's inequality and the convexity of f gives: f(y(T ))f (opt) V 2T h k(0)k 2 k(T )k 2 i + C V : (2.21) Forky(T )x(T )k in (2.20), we consider the update equation of z(t) in (2.10). Sum- ming from t = 0 to t = T 1 yields z i (T )z i (0) = 1 V P T1 t=0 [x i (t)y i (t)] for every i. Rearranging and dividing by T gives: x i (T )y i (T ) = V T [z i (T )z i (0)]; for all i2f1;:::;Ig: (2.22) Substituting (2.21) and (2.22) into (2.20) proves (2.18). For the second part, we have from (2.5): g j (x(T ))g j (y(T )) +Mky(T )x(T )k: (2.23) We rst bound g j (y(T )). The update equation of w(t) in (2.9) implies, for every j, that w j (t + 1) = w j (t) + 1 V g j (y(t)) + w j (t) + 1 V g j (y(t)); 22 and w j (t + 1)w j (t) 1 V g j (y(t)). Summing from t = 0 to t =T 1, we have w j (T ) w j (0) 1 V P T1 t=0 g j (y(t)). Dividing by T and using Jensen's inequality and convexity of g j gives 1 T [w j (T )w j (0)] 1 VT T1 X t=0 g j (y(t)) 1 V g j (y(T )): This shows that g j (y(T )) V T jw j (T )w j (0)j for all j2f1;:::;Jg: (2.24) Substituting (2.24) and (2.22) into (2.23) proves (2.19). Recall (t) = (w(t);z(t)). Theorem 1 can be interpreted when the magnitude of (t) is bounded by some nite constant. The next lemma shows that such a constant exists when the Slater condition in Section 2.1.2 holds. Lemma 1 When V 1, w j (0) =z i (0) = 0 for all i and j, then under Algorithm 1, the Slater condition implies there exists a constant D> 0 (independent of V ) such that k(t)k = v u u t J X j=1 w j (t) 2 + I X i=1 z i (t) 2 D; for all t2f0; 1; 2;:::g: Proof: The Slater condition in Section 2.1.2 implies the existence of > 0 such that inequality (2.17) holds. With f (min) = inf y2Y f(y), we have V 2 h k(t + 1)k 2 k(t)k 2 i C V +Ff(y(t))k(t)k C +Ff (min) k(t)k 23 where the nal inequality uses V 1. Now assumek(t)k (C +Ff (min) )=. Then: V 2 h k(t + 1)k 2 k(t)k 2 i 0: This implies thatk(t + 1)kk(t)k whenk(t)k (C+Ff (min) )=. Sincek(0)k = 0, we only need to show that the magnitude ofk(t + 1)k will always be at most (C +F f (min) )= + p 2C whenk(t)k< (C +Ff (min) )=. Supposek(t)k< (C +Ff (min) )=. We know from the triangular inequality that k(t + 1)kk(t)k +k(t + 1)(t)k < C +Ff (min) + p 2C V ; where the last inequality uses (2.12). Since V 1, we havek(t + 1)k < (C +F f (min) )= + p 2C. This implies thatk(t)k (C +Ff (min) )= + p 2C for allt. Letting D,(C +Ff (min) )= + p 2C proves the lemma. Therefore, Theorem 1 can be interpreted by xing > 0. Then one can set V = 1= and x an integer T 1= 2 . Substituting these values into the equations (2.18){(2.19) gives: f(x(T ))f (opt) O() g j (x(T ))O() which respectively are the deviation from the optimality and the constraint violation. 24 The above implies that Algorithm 1 produces a sequence of actions x(t)2X that are within O() of optimality within O(1= 2 ) iterations. The next section shows that it is possible to generate anO()-optimal achieving sequence of decisions with a lower number of iterations by analyzing a transient phase and a steady state phase of Algorithm 1. Specically, the number of iterations isO(1=) under a locally-polyhedral assumption and O(1= 1:5 ) under a locally-quadratic assumption. 2.3 Convergence under the uniqueness assumption With this idea, we analyze the convergence time in the case when the dual function satises a locally-polyhedral assumption and the case when it satises a locally-quadratic assumption. Both cases use the following mild assumption: Assumption 1 The dual formulation (2.8) has a unique vector of Lagrange multipliers denoted by ,(w ;z ). This assumption is assumed throughout Section 2.3, and replaces the Slater assump- tion (which is no longer needed). Note that this is a mild assumption when practical systems are considered, e.g., [HN11, ES07]. In addition, simulations in Section 2.4 sug- gest that the algorithm derived in this section still has desirable performance without this uniqueness assumption. We rst provide a general result that will be used later. Lemma 2 Letf(t)g 1 t=0 be a sequence generated by Algorithm 1. It holds that: k(t + 1) k 2 k(t) k 2 + 2 V [d((t))d( )] + 2C V 2 ; t2f0; 1; 2;:::g: (2.25) 25 Proof: Recall that(t) = (w(t);z(t)) andC is a constant that satiseskg(y)k 2 C and kxyk 2 C as in (2.11). Dene h(t),(g(y(t));x(t)y(t)) as the concatenation vector of the constraint functions. From the non-expansive property, we have that k(t + 1) k 2 = w(t) + 1 V g(y(t)) + ;z(t) + 1 V [x(t)y(t)] 2 w(t) + 1 V g(y(t));z(t) + 1 V [x(t)y(t)] 2 = (t) + 1 V h(t) 2 =k(t) k 2 + 1 V 2 kh(t)k 2 + 2 V [(t) ] > h(t) k(t) k 2 + 2C V 2 + 2 V [d((t))d( )]; (2.26) where the last inequality uses the denition of C and the concavity of the dual function (2.7), i.e, d( 1 )d( 2 ) +@d( 2 ) > [ 1 2 ] for any 1 ; 2 2 , and @d((t)) =h(t). 2.3.1 Locally-polyhedral dual function Throughout Section 2.3.1, the dual function (2.7) is assumed to have a locally-polyhedral property, introduced in [HN11], as stated in Assumption 2. A dual function with this property is illustrated in Figure 2.2. The property holds when f and g j for every j are either linear or piece-wise linear. Assumption 2 There exists an L p > 0 such that the dual function (2.7) satises d( )d() +L p k k for all 2 (2.27) where is the unique vector of Lagrange multipliers. 26 Locally polyhedron Locally quadratic Figure 2.2: Illustration of locally-polyhedral and locally-quadratic functions The \p" subscript in L p represents \polyhedral." Furthermore, concavity of dual function (2.7) ensures that if this property holds locally about , it also holds globally for all 2 (see Figure 2.2). Suppose problem (2.6) satises the locally-polyhedral assumption. Dene: B p (V ), max L p 2V ; 2C VL p : Lemma 3 Under Assumptions 1 and 2, wheneverk(t) kB p (V ), it follows that k(t + 1) kk(t) k L p 2V : (2.28) Proof: From Lemma 2, suppose the following condition holds 2 V [d((t))d( )] + 2C V 2 L p V k(t) k + L 2 p 4V 2 ; (2.29) then (2.25) becomes k(t + 1) k 2 k(t) k 2 L p V k(t) k + L 2 p 4V 2 = k(t) k L p 2V 2 : 27 It follows that ifk(t) kB p (V ) Lp 2V , then (2.28) holds. It requires to show that condition (2.29) holds whenk(t) k B p (V ). To this end, we have by the locally-polyhedral property (2.27): d((t))d( )L p k(t) k L p 2 k(t) k L p 2 B p (V ): L p 2 k(t) k C V : The above implies condition (2.29) and proves the lemma. Lemma 3 implies that, if the distance between (t) and is at least B p (V ), the successor (t + 1) will be closer to . This suggests the existence of a convergence set in which a subsequence off(t)g 1 t=0 resides. Note that p 2C=V boundsk(t + 1)(t)k for all t as the bound (2.12). The steady state of Algorithm 1 is dened from this convergence set. This set is dened as R p (V ), ( 2 :k kB p (V ) + p 2C V ) : (2.30) Let T p be the rst iteration that a generated dual variable enters this set: T p , arginf t0 f(t)2R p (V )g: (2.31) Intuitively, T p is the end of the transient phase and is the beginning of the steady state phase. Lemma 4 Under Assumptions 1 and 2, it holds that T p O(V ). 28 Proof: Sincek(0) k is a constant, Lemma 3 proves the claim. Then we show that dual variables generated after iteration T p never leaveR p (V ). Lemma 5 Under Assumptions 1 and 2, the generated dual variables from Algorithm 1 satisfy (t)2R p (V ) for all tT p . Proof: We prove the lemma by induction. First we note that (T p )2R p (V ) by the denition of T p in (2.31). Suppose that (t)2R p (V ). Then two cases are considered. i) Ifk(t) kB p (V ), it follows from (2.28) that k(t + 1) kk(t) k L p 2V B p (V ) + p 2C V : ii) Ifk(t) kB p (V ), it follows from the triangle inequality that k(t + 1) kk(t + 1)(t)k +k(t) k p 2C V +B p (V ); by (2.12) and the assumption ofk(t) k. Hence, (t + 1)2R p (V ) in both cases. This proves the lemma by induction. Finally, a convergence result is ready to be stated. Dene an average of sequence fa(t)g Tp+T1 t=Tp that starts from T p as a Tp (T ), 1 T Tp+T1 X t=Tp a(t): 29 Theorem 2 Under Assumptions 1 and 2, forT > 0, letfx(t);w(t)g 1 t=Tp be a subsequence generated by Algorithm 1, where T p is dened in (2.31). The following bounds hold: f(x Tp (T ))f (opt) C V + 2VM T p 2C V +B p (V ) + V 2T ( p 2C V +B p (V ) 2 + 4k k " p 2C V +B p (V ) #) (2.32) g j (x Tp (T )) 2V (1 +M) T p 2C V +B p (V ) ; for all j2f1;:::;Jg: (2.33) Proof: The rst part of the theorem follows from (2.18) with the average starting from T p that f(x Tp (T ))f (opt) C V + V 2T h k(T p )k 2 k(T p +T )k 2 i + VM T kz(T p +T )z(T p )k: (2.34) For any 2 , it holds that: kk 2 =k k 2 +k k 2 + 2[ ] > : The second term on the right-hand-side of (2.34) can be upper bounded by applying this equality. k(T p )k 2 k(T p +T )k 2 =k(T p ) k 2 + 2[(T p ) ] > k(T p +T ) k 2 2[(T p +T ) ] > k(T p ) k 2 + 2[(T p )(T p +T )] > k(T p ) k 2 + 2k(T p )(T p +T )kk k (2.35) 30 From Lemma 5, the rst term of (2.35) is bounded byk(T p ) k 2 [ p 2C=V +B p (V )] 2 . From triangle inequality and Lemma 5, the last term of (2.35) is bounded by k(T p +T )(T p )kk(T p +T ) k +k (T p )k 2 p 2C=V +B p (V ) : (2.36) Therefore, inequality (2.35) is bounded from above by h p 2C=V +B p (V ) i 2 + 4k k h p 2C=V +B p (V ) i : Substituting this bound into (2.34) and using the fact that kz(T p +T )z(T p )kk(T p +T )(T p )k 2[ p 2C=V +B p (V )] proves the rst part of the theorem. The last part follows from (2.19) that g j (x Tp (T )) V T jw j (T p +T )w j (T p )j + VM T kz(T p +T )z(T p )k: Sincek(T p +T )(T p )k upper boundsjw j (T p +T )w j (T p )j andkz(T p +T )z(T p )k, the above inequality is upper bounded by g j (x Tp (T )) V (1 +M) T k(T p +T )(T p )k 2V (1 +M) T p 2C V +B p (V ) ; 31 where the last inequality uses (2.36). This proves the last part of the theorem. Theorem 2 can be interpreted as follows. The deviation from the optimality value in (2.32) is bounded above byO(1=V + 1=T ). The constraint violation in (2.33) is bounded above by O(1=T ). To have both bounds be within O(), we set V = 1= and T = 1=, and the convergence time of Algorithm 1 is O(1=). Note that both bounds consider the average starting after reaching the steady state at time T p , and this transient time T p is at most O(1=). 2.3.2 Locally-quadratic dual function Throughout Section 2.3.2, the dual function (2.7) is assumed to have a locally-quadratic property, introduced in [HN11], as stated in Assumption 3 and illustrated in Figure 2.2. Assumption 3 Let be the unique Largrange multiplier, there exist N > 0 and L q > 0 such that whenever 2 andk kN, dual function (2.7) satises d( )d() +L q k k 2 : (2.37) Also, there exists D q > 0 such that whenever 2 and d( )d()D q , dual variable satisesk kN. The \q" subscript in L q represents \quadratic." Suppose problem (2.6) satises the locally-quadratic assumption. Dene: B q (V ), max ( 1 V 1:5 ; p V + p V + 4L q CV 2L q V ) : 32 Lemma 6 Under Assumptions 1 and 3, for suciently large V that B q (V )<N, when- ever B q (V )k(t) kN, it follows that k(t + 1) kk(t) k 1 V 1:5 : (2.38) Proof: From Lemma 2, suppose the following condition holds 2 V [d((t))d( )] + 2C V 2 2 V 1:5 k(t) k + 1 V 3 ; (2.39) then (2.25) becomes k(t + 1) k 2 k(t) k 2 2 V 1:5 k(t) k + 1 V 3 = k(t) k 1 V 1:5 2 : Furthermore, ifk(t) kB q (V ) 1 V 1:5 , then the desired inequality (2.38) holds. It requires to show that condition (2.39) holds when N k(t) k B q (V ). Condition (2.39) holds when d((t))d( ) C V 1 p V k(t) k: By the locally-quadratic property (2.37), ifL q k(t) k 2 C V 1 p V k(t) k, then the above inequality holds. This means that condition (2.39) holds when L q k(t) k 2 1 p V k(t) k C V 0: 33 The above inequality happens when k(t) k 1 p V + q 1 V + 4L q C V 2L q = p V + p V + 4L q CV 2L q V : This prove the lemma. Lemma 6 suggests the existence of a convergence set. The steady state of Algorithm 1 is dened from this set which is R q (V ), ( 2 :k kB q (V ) + p 2C V ) : (2.40) Let T q denote the rst iteration that generated dual variables arrives at the convergence set: T q , arginf t0 f(t)2R q (V )g: (2.41) Lemma 7 Under Assumptions 1 and 3, when V is suciently large and B q (V )<N, it holds that T q O(V 1:5 ). Proof: We rst shows that there exists t 0 O(V ) such thatk(t 0 ) kN. We show that the following is true: d( ) max 0tE (V) d((t)) C V + 2 ; (2.42) where E (V ), j Vk(0) k 2 k . 34 This is proved by contradiction. Suppose inequality (2.42) does not hold, i.e., d( )d((t))> C V + 2 for all 0tE (V ): From (2.25), it follows that for 0tE (V ) k(t + 1) k 2 k(t) k 2 + 2C V 2 2 V C V + 2 k(t) k 2 V : Summing from t = 0 to t =E (V ) yields: k(E (V ) + 1) k 2 k(0) k 2 [E (V ) + 1] V ; and E (V ) + 1 Vk(0) k 2 . This contradicts the denition of E (V ). Thus, property (2.42) holds. Let =D q and V > 2C=D q , we have d( )d((t))D q for some 0tE (V ). Then from Assumption 3, we havek(t) k N, and by the denition of E (V ), it takes at mostO(V ) to arrive where the locally-quadratic assumption holds. Then Lemma 6 implies that the algorithm needs at most O(V 1:5 ) to enter the convergence set. Next we show that, once the sequence of dual variables entersR q (V ), it never leaves the set. Lemma 8 Under Assumptions 1 and 3, when V is suciently large and B q (V ) + p 2C V < N, the generated dual variables from Algorithm 1 satisfy (t)2R q (V ) for all tT q . 35 Proof: We prove the lemma by induction. First we note that (T q )2R q (V ) by its denition. Suppose that (t) 2 R q (V ), which implies that k(t) k B q (V ) + p 2C=V <N. Then two cases are considered. i) Ifk(t) k>B q (V ), it follows from (2.38) that k(t + 1) k<k(t) k 1 V 1:5 <B q (V ) + p 2C V : ii) Ifk(t) kB q (V ), it follows from the triangle inequality and (2.12) that k(t + 1) kk(t + 1)(t)k +k(t) k p 2C V +B q (V ): Hence, (t + 1)2R q (V ) in both cases. This proves the lemma by induction. Now a convergence of a steady state is ready to be stated. Theorem 3 Under Assumptions 1 and 3, whenV is suciently large andB q (V )+ p 2C V < N, for T > 0, letfx(t);w(t)g 1 t=Tq be a subsequence generated by Algorithm 1, where T q is dened in (2.41). The following bounds hold: f(x Tq (T ))f (opt) C V + 2VM T p 2C V +B q (V ) + V 2T p 2C V +B q (V ) 2 + 2k k " p 2C V +B q (V ) # (2.43) g j (x Tq (T )) 2V (1 +M) T p 2C V +B q (V ) ; for all j2f1;:::;Jg: (2.44) 36 Proof: The rst part of the theorem follows from (2.18) with the average starting from T q that f(x Tq (T ))f (opt) C V + V 2T h k(T q )k 2 k(T q +T )k 2 i + VM T kz(T q +T )z(T q )k: (2.45) The second term on the right-hand-side of (2.45) can be bounded from above by k(T q )k 2 k(T q +T )k 2 k(T q ) k 2 + 2[(T q )(T q +T )] > k(T q ) k 2 + 2k(T q )(T q +T )kk k p 2C V +B q (V ) 2 + 4k k p 2C V +B q (V ) ; (2.46) where the deviation is similar to steps in (2.35) and (2.36). The last term on the right-hand-side of (2.45) can be bounded from above by kz(T q +T )z(T q )k 2 p 2C=V +B q (V ) : (2.47) Substituting bounds (2.46) and (2.47) into (2.45) proves the rst part of the theorem. The last part follows from (2.19) that g j (x Tq (T )) V T jw j (T q +T )w j (T q )j + VM T kz(T q +T )z(T q )k: 37 Sincek(T q +T )(T q )k upper boundsjw j (T q +T )w j (T q )j andkz(T q +T )z(T q )k, the above inequality is upper bounded by g j (x Tq (T )) V (1 +M) T k(T q +T )(T q )k 2V (1 +M) T p 2C V +B q (V ) : This proves the last part of the theorem. Theorem 3 can be interpreted as follows. The deviation from the optimality in (2.43) is at most O(1=V + p V=T ). The constraint violation in (2.44) is bounded above by O( p V=T ). To have both bounds be within O(), we set V = 1= and T = 1= 1:5 , and the convergence time of Algorithm 1 is O(1= 1:5 ). Note that both bounds consider the average starting after reaching the steady state at time T q , and this transient time T q is at most O(1= 1:5 ). 2.3.3 Staggered time averages In order to take advantage of the improved convergence rates, computing time averages must be started after the transient phase. To achieve this performance without deter- mining the exact end time of the transient phase, time averages can be restarted over successive frames whose frame lengths increase geometrically. For example, if one triggers a restart at times 2 k for integers k, then a restart is guaranteed to occur within a factor of 2 of the time of the actual end of the transient phase. 38 Table 2.1: Summary of Convergence Times General Polyhedral Quadratic Transient state N/A O(1=) O(1= 1:5 ) Steady state O(1= 2 ) O(1=) O(1= 1:5 ) 2.3.4 Summary of convergence results The results in Theorems 1, 2, and 3 (denoted by General, Polyhedral, and Quadratic) are summarized in Table 2.1. Note that the general convergence time is considered to be in the steady state from the beginning. 2.4 Sample problems This section illustrates the convergence times of Algorithm 1 and the staggered algorithm under the locally-polyhedral and locally-quadratic assumptions. A considered formulation is Minimize f(x) (2.48) Subject to 2x 1 +x 2 1:5; x 1 + 2x 2 1:5 x 1 (t);x 2 (t)2f0; 1; 2; 3g; t2f0; 1; 2;:::g where a function f will be given for dierent cases. Under the locally-polyhedral assumption, let f(x) = 1:5x 1 + x 2 be the objective function of problem (2.48). In this setting, the optimal value is 1:25 whenx 1 =x 2 = 0:5. Figure 2.3 shows the values of the objective and constraint functions of time-average 39 0 5000 10000 15000 20000 t 0.8 0.9 1.0 1.1 1.2 1.3 value Staggered f(x T p ( ·)) Subgradient f(x( ·)) Objective cost 0 5000 10000 15000 20000 t 0.2 0.0 0.2 0.4 0.6 0.8 value Staggered g 1 (x T p ( ·)) (blue) Staggered g 2 (x T p ( ·)) (green) Subgradient g 1 (x( ·)) (blue) Subgradient g 2 (x( ·)) (green) Constraints Figure 2.3: Convergence of Algorithm 1 and the staggered algorithm that solve problem (2.48) with f(x) = 1:5x 1 +x 2 solutions. It is easy to see the faster convergence time O(1=) from the polyhedral result (T p = 2048) compared to a general result with convergence time O(1= 2 ). Under the locally-quadratic assumption, letf(x) =x 2 1 +x 2 2 be the objective function of problem (2.48). Note that the optimal value of this problem is 0:5 where x 1 =x 2 = 0:5. Figure 2.4 shows the values of the objective and constraint functions of time-average solutions. The quadratic result starting the average at the (T q =)8192 th iteration con- verges faster than the general result. This illustrates the dierence between O(1= 2 ) and O(1= 1:5 ). Figures 2.5 and 2.6 illustrate the convergence times of problems, dened in each gure's caption, without the uniqueness assumption. The comparison of Figures 2.3 and 2.5 shows that there is no dierence in the order of convergence time. Similarly, gures 2.4 and 2.6 show no dierence in terms of the order of convergence. 40 0 5000 10000 15000 20000 t 0.30 0.35 0.40 0.45 0.50 0.55 0.60 value Staggered f(x T p ( ·)) Subgradient f(x( ·)) Objective cost 0 5000 10000 15000 20000 t −0.2 0.0 0.2 0.4 0.6 value Staggered g 1 (x T p ( ·)) (blue) Staggered g 2 (x T p ( ·)) (green) Subgradient g 1 (x( ·)) (blue) Subgradient g 2 (x( ·)) (green) Constraints Figure 2.4: Convergence of Algorithm 1 and the staggered algorithm that solve problem (2.48) with f(x) =x 2 1 +x 2 2 0 5000 10000 15000 20000 t 0.8 0.9 1.0 1.1 1.2 1.3 value Staggered f(x T p ( ·)) Subgradient f(x( ·)) Objective cost 0 5000 10000 15000 20000 t 0.2 0.0 0.2 0.4 0.6 0.8 value Staggered g 1 (x T p ( ·)) (blue) Staggered g 2 (x T p ( ·)) (green) Staggered g 3 (x T p ( ·)) (red) Subgradient g 1 (x( ·)) (blue) Subgradient g 2 (x( ·)) (green) Subgradient g 3 (x( ·)) (red) Constraints Figure 2.5: Convergence of Algorithm 1 and the staggered algorithm that solve problem (2.48) with f(x) = 1:5x 1 +x 2 and an additional constraint x 1 +x 2 1 41 0 5000 10000 15000 20000 t 0.30 0.35 0.40 0.45 0.50 0.55 0.60 value Staggered f(x T p ( ·)) Subgradient f(x( ·)) Objective cost 0 5000 10000 15000 20000 t −0.2 0.0 0.2 0.4 0.6 value Staggered g 1 (x T p ( ·)) (blue) Staggered g 2 (x T p ( ·)) (green) Staggered g 3 (x T p ( ·)) (red) Subgradient g 1 (x( ·)) (blue) Subgradient g 2 (x( ·)) (green) Subgradient g 3 (x( ·)) (red) Constraints Figure 2.6: Convergence of Algorithm 1 and the staggered algorithm that solve problem (2.48) with f(x) =x 2 1 +x 2 2 and an additional constraint x 1 +x 2 1 2.5 Chapter summary We consider the time-average optimization problem with a non-convex (possibly dis- crete) decision set. We show that the problem has a corresponding (one-shot) convex optimization formulation. This connects the Lyapunov optimization technique and con- vex optimization theory. Using convex analysis we prove a general convergence time of O(1= 2 ) when the Slater condition holds. Under an assumption on the uniqueness of Lagrange multipliers, we prove that faster convergence times O(1=) and O(1= 1:5 ) are possible for problems having locally-polyhedral and locally-quadratic structures. 42 Chapter 3 Convergence of Stochastic Time-Average Optimization In this chapter, a stochastic time-average optimization is considered instead of the de- terministic setting in Chapter 2. The convergence speed and behavior of the drift- plus-penalty algorithm is analyzed. The results in this chapter are based in part on [SN15d,SN14]. Stochastic network optimization can be used to design dynamic algorithms that opti- mally control communication networks [Nee10]. The technique has several unique prop- erties which do not exist in a traditional convex optimization setting. In particular, the technique allows for a time-varying and possibly non-convex decision set. For example, it can treat a packet switch that makes binary (0=1) scheduling decisions, or a wireless system with varying channels and decision sets. This chapter considers time-average stochastic optimization, which is useful for ex- ample problems of network utility maximization [NML08, GNT06, Sto06, ES07], energy minimization [Nee06a, LLS07], and quality of information maximization [SN15c]. The 43 time-average stochastic optimization is more complex than the time-average determin- istic optimization in Chapter 2, and their proofs of convergence are dierent due to its stochastic property. Time t2f0; 1; 2;:::g is slotted. DeneS to be a nite or countably innite sample space of random states. Let S(t)2S denote a random state at time t. Random state S(t) is assumed to be independent and identically distributed (i.i.d.) across time slots. The steady state probability of s2S is denoted by s . Let I and J be any positive integers. Each slot t, decision vector x(t) = (x 1 (t);:::;x I (t)) is chosen from a decision setX S(t) . For any positive integer T , dene x(T ) as x(T ), 1 T T1 X t=0 E [x(t)]: The goal is to make decisions over time to solve: Minimize lim sup T!1 f(x(T )) (3.1) Subject to lim sup T!1 g j (x(T )) 0 j2f1;:::;Jg x(t)2X S(t) t2f0; 1; 2;:::g: Here it is assumed thatX s is a compact subset of R I for each s2S. Assume[ s2S X s is bounded, and letC be a compact set that contains it. The functions f andg j are convex functions fromC toR, whereA denotes a convex hull of setA. Results in [Nee10] imply that the optimal point can be achieved with an ergodic policy for which the limiting time average expectation exists. 44 An example of formulation (3.1) can be a resource allocation problem of a stochastic wireless uplink network. The goal is to maximize rates of three users with proportional fairness: Minimize lim sup T!1 3 X i=1 log (x i (T )) Subject to lim sup T!1 x i (T )x (min) i i2f1; 2; 3g x(t)2X S(t) t2f0; 1; 2;:::g; wherex (min) i is the mininum rate for useri. In this example,S =f1; 2g, 1 = 0:3, 2 = 0:7, X 1 =f(0; 0; 0); (2; 1; 0); (0; 2; 2)g,X 2 =f(0; 0; 0); (0; 1; 2); (1; 1; 1)g. Solving formulation (3.1) using the stochastic network optimization framework does not require any statistical knowledge of the random states. However, if the steady state probabilities are known, the optimal objective cost of formulation (3.1) is identical to the optimal cost of the following problem: Minimize f(x) (3.2) Subject to g j (x) 0 j2f1;:::;Jg x2 ^ X; where ^ X, P s2S s X s . Note that, for any ; 2 R and any setsA andB, notation A +B =fa +b :a2A;b2Bg. Formulation (3.2) is convex; however, its optimal solution may not be in any of the setsX s . In fact, determining whether x is a member of ^ X may already be a dicult 45 task. This illustrates that traditional and state-of-the-art techniques for solving convex optimization cannot be applied directly to solve problem (3.1). This chapter considers a drift-plus-penalty algorithm, developed in [Nee10], that solves formulation (3.1). The algorithm is shown to have O(1= 2 ) convergence time in [Nee14]. Note that a deterministic version of formulation (3.1) and its corresponding convergence are studied in the previous chapter. Despite the similar analysis procedures, the analysis in this chapter is more challenging due to stochastic events and multiple decision sets. Inspired by the analysis in [HN11], the drift-plus-penalty algorithm is shown to have a transient phase and a steady state phase. These phases are analyzed in two cases that depend on the structure of a dual function. The rst case is when a dual function satises a locally-polyhedral assumption, and the transient time is O(1=). The second case is when the dual function satises a locally-quadratic assumption and the transient time is O(1= 1:5 ). Then, under a uniqueness assumption on a vector of Lagrange multipliers, if the time average starts in the steady state phase, a solution converges inO(1=) andO(1= 1:5 ) time slots respectively under the locally-polyhedral and locally-quadratic assumptions. Even though this characterization of the time complexity is important in its own right, it can also be used to improve the convergence speed of a system if enough prior samples of random states are available. This can be done by using those samples in the transient phase and only implementing decisions in the steady state phase. Recent work in this direction is in [HLH14], which considers methods to improve transient times that can likely be used in conjunction with results in this chapter. The chapter is organized as follows. Section 3.1 constructs an algorithm solving problem (3.1). The behavior and properties of the algorithm are analyzed in Section 3.2. 46 Section 3.3 analyzes the transient phase and the steady state phase under the locally- polyhedral assumption. Results under the locally-quadratic assumption are analyzed in Section 3.4. Simulations are performed in Section 3.5. Section 3.6 concludes the chapter. 3.1 Time-average stochastic optimization A solution to problem (3.1) can be obtained through an auxiliary problem, which is formulated in such a way that its optimal solution is also an optimal solution to the time-average problem. To formulate this auxiliary problem, an additional set and mild assumptions are dened. First of all, it can be shown that ^ X is compact. Assumption 4 There exists a vector ^ x in the interior of ^ X that satises g j (^ x)< 0 for all j2f1;:::;Jg. In convex optimization, Assumption 4 is a Slater condition, which is a sucient condition for strong duality [BNO03]. Dene the extended setY that is a compact and convex subset of R I and contains ^ X . SetY can be ^ X , but it can be dened as a hyper-rectangle set to simplify a later algorithm. Denekk as the Euclidean norm. Assumption 5 Functions f and g j for j2f1;:::;Jg are convex and Lipschitz contin- uous on the extended setY, so there exists a constant M > 0 that for any x;y2Y: jf(x)f(y)jMkxyk (3.3) jg j (x)g j (y)jMkxyk: (3.4) 47 We assume that Assumptions 4 and 5 always hold in this chapter. 3.1.1 Auxiliary formulation For function a(x(t)) of vector x(t), dene an average of function values as a(x), lim T!1 1 T T1 X t=0 E [a(x(t))]: Recall that problem (3.1) can be achieved with an ergodic policy for which the limiting time average expectation exists. The time-average stochastic optimization (3.1) is solved by considering an auxiliary formulation, which is formulated in terms of well dened limiting expectations for simplicity. Minimize f(y) (3.5) Subject to g j (y) 0 j2f1;:::;Jg lim T!1 x i (T ) = lim T!1 y i (T ) i2f1;:::;Ig (x(t);y(t))2X S(t) Y t2f0; 1; 2;:::g: This formulation introduces the auxiliary vectory(t). The second constraint ties lim T!1 x(T ) and lim T!1 y(T ) together, so the original objective function and constraints of problem (3.1) are preserved in problem (3.5). Let f (opt) be the optimal objective cost of problem (3.1). Theorem 4 The time-average stochastic problem (3.1) and the auxiliary problem (3.5) have the same optimal cost, f (opt) . 48 Proof: Let ^ f (opt) be the optimal objective cost of the auxiliary problem (3.5). We show that ^ f (opt) = f (opt) . Letfx (t)g 1 t=0 be an optimal solution, generated by an ergodic policy, to problem (3.1) such that: lim T!1 f(x (T )) =f (opt) lim T!1 g j (x (T )) 0 j2f1;:::;Jg x (t)2X S(t) t2f0; 1; 2;:::g: Consider a solutionfx(t);y(t)g 1 t=0 to problem (3.5) as follows: x(t) =x (t); y(t) = lim T!1 x (T ) t2f0; 1; 2;:::g: It is easy to see that this solution satises the last two constraints of problem (3.5). For the rst constraint, it follows from Lipschitz continuous that g j (y) =g j ( lim T!1 x (T )) =g j ( lim T!1 x (T )) 0 for all j2f1;:::;Jg: Therefore, this solution is feasible, and the objective cost of problem (3.5) is f(y) =f( lim T!1 x (T )) =f( lim T!1 x (T )) =f (opt) : This implies that ^ f (opt) f (opt) . 49 Alternatively, letfx (t);y (t)g 1 t=0 be an optimal solution to problem (3.5) such that: f(y ) = ^ f (opt) g j (y ) 0 j2f1;:::;Jg lim T!1 x (T ) = lim T!1 y (T ) (x (t);y (t))2X S(t) Y t2f0; 1; 2;:::g: Consider a solutionfx(t)g 1 t=0 to problem (3.1) as follows: x(t) =x (t) t2f0; 1; 2;:::g: It is easy to see that the solution satises the last constraint of problem (3.1). For the rst constraint, the convexity of g j implies that g j ( lim T!1 x(T )) =g j ( lim T!1 y (T ))g j (y ) 0 for all j2f1;:::;Jg: Hence, this solution is feasible. The objective cost of problem (3.1) follows from the convexity of f that f( lim T!1 x(T )) =f( lim T!1 y (T ))f(y ) = ^ f (opt) : This implies that f (opt) ^ f (opt) . Thus, combining the above results, we have that ^ f (opt) =f (opt) . 50 3.1.2 Lyapunov optimization The auxiliary problem (3.5) can be solved by the Lyapunov optimization technique [Nee10]. Dene W j (t) and Z i (t) to be virtual queues of the rst and second constraints of problem (3.5) with update dynamics: W j (t + 1) = [W j (t) +g j (y(t))] + j2f1;:::;Jg (3.6) Z i (t + 1) =Z i (t) +x i (t)y i (t) i2f1;:::;Ig; (3.7) where operator [] + is the projection onto a non-negative orthant. For ease of notations, let W (t),(W 1 (t);:::;W J (t)); Z(t),(Z 1 (t);:::;Z I (t)); g(y),(g 1 (y);:::;g J (y)) respectively be the vectors of virtual queues W j (t), Z i (t), and functions g j (y). Dene Lyapunov function (3.8) and Lyapunov drift (3.9) as L(t), 1 2 h kW (t)k 2 +kZ(t)k 2 i (3.8) (t),L(t + 1)L(t): (3.9) 51 Let notation x > denote the transpose of vector x. Dene a nite constant C, 1 2 sup x2C;y2Y h kg(y)k 2 +kxyk 2 i : (3.10) Lemma 9 For every t2f0; 1; 2;:::g, the Lyapunov drift is upper bounded by (t)C +W (t) > g(y(t)) +Z(t) > [x(t)y(t)]: (3.11) Proof: From (3.6) and (3.7), it follows from the non-expansive projection [BNO03] that kW (t + 1)k 2 kW (t) +g(y(t))k 2 =kW (t)k 2 +kg(y(t))k 2 + 2W (t) > g(y(t)) kZ(t + 1)k 2 =kZ(t) +x(t)y(t)k 2 =kZ(t)k 2 +kx(t)y(t)k 2 + 2Z(t) > [x(t)y(t)]: Adding the above relations and using denitions (3.8) and (3.9) yields 2(t)kg(y(t))k 2 + 2W (t) > g(y(t)) +kx(t)y(t)k 2 + 2Z(t) > [x(t)y(t)]: Using the denition of C in (3.10) proves the lemma. 52 Let V > 0 be any positive real number representing a parameter of an algorithm solving problem (3.5). The drift-plus-penalty term is dened as (t)+Vf(y(t)). Applying Lemma 9, the drift-plus-penalty term is bounded for every time t by (t) +Vf(y(t))C +W (t) > g(y(t)) +Z(t) > [x(t)y(t)] +Vf(y(t)): (3.12) 3.1.3 Drift-plus-penalty algorithm Let W 0 and Z 0 be the initial condition of W (0) and Z(0) respectively. Every time step, the Lyapunov optimization technique observes the current realization of random state S(t) before choosing decisions x(t)2X S(t) and y(t)2Y that minimize the right-hand- side of (3.12). The drift-plus-penalty algorithm is summarized in Algorithm 2. Initialize V > 0;W (0) =W 0 ;Z(0) =Z 0 for t2f0; 1; 2;:::g do Observe S(t) Choose x(t) = arginf x2X S(t) Z(t) > x Choose y(t) = arginf y2Y Vf(y) +W (t) > g(y)Z(t) > y Update W (t + 1) = [W (t) +g(y(t))] + Update Z(t + 1) =Z(t) +x(t)y(t) end for Algorithm 2: Drift-plus-penalty algorithm that solves problem (3.5) 3.2 Behaviors of the drift-plus-penalty algorithm Starting from (W (0);Z(0)), Algorithm 2 reaches the steady state when vector (W (t);Z(t)) concentrates around a specic set (dened in Section 3.2.1). The transient phase is the 53 period before this concentration. Note that this behavior is dierent from the determin- istic case in Chapter 2, where (w(t);z(t)) of Algorithm 1 is contained in a specic set when the algorithm is in the steady state. 3.2.1 Embedded formulation A convex optimization problem, called embedded formulation, is considered. This idea is inspired by [HN11]. Minimize f(y) (3.13) Subject to g j (y) 0 j2f1;:::;Jg y = X s2S s x s y2Y; x s 2X s s2S: Note that formulation (3.13) contains multiple setsX s and is more complex than its deterministic version (2.6) in the previous chapter. This formulation has a dual problem, whose properties are used in convergence anal- ysis. Let w2 R J + and z2 R I be the vectors of dual variables associated with the rst and second constraints of problem (3.13). The Lagrangian is dened as fx s g s2S ;y;w;z = X s2S s h f(y) +w > g(y) +z > (x s y) i : 54 The dual function of problem (3.13) is d(w;z) = inf y2Y; x s 2Xs:8s2S fx s g s2S ;y;w;z = X s2S s inf y s 2Y; x s 2Xs h f(y s ) +w > g(y s ) +z > (x s y s ) i = X s2S s d s (w;z); (3.14) where d s (w;z) is dened in (3.15) and all of the minimizing solutions y take the same value. d s (w;z), inf y2Y; x2Xs h f(y) +w > g(y) +z > (xy) i : (3.15) Dene the solution to the inmum in (3.15) as y (w;z), arginf y2Y h f(y) +w > g(y)z > y i ; (3.16) x s (z), arginf x2Xs z > x: (3.17) Finally, the dual problem of formulation (3.13) is Maximize d(w;z) (3.18) Subject to (w;z)2R J + R I : Problem (3.18) has an optimal solution that may not be unique. A set of these optimal solutions, which are vectors of Lagrange multipliers, can be used to analyze the transient 55 time. However, to simplify the proofs and notations, a uniqueness assumption is dened below. Dene ,(w;z) as a concatenation vector of w and z. Assumption 6 Dual problem (3.18) has a unique vector of Lagrange multipliers denoted by ,(w ;z ). This assumption is assumed throughout Section 3.3 and Section 3.4. Note that this is a mild assumption when practical systems are considered, e.g., [HN11,ES07]. Furthermore, simulation results in Section 3.5 suggest that this assumption may not be needed. To prove the main result of this section, a useful property of d S (w;z) is derived. Dene h(x;y),(g(y);xy). Lemma 10 For any = (w;z)2R J + R I and s2S, it holds that d s ( )d s () +h(x s (z);y (w;z)) > [ ]: (3.19) Proof: From (3.15), it follows, for any = (w;z)2R J + R I and (x;y)2X s Y, that d s ( )f(y) +h(x;y) > =f(y) +h(x;y) > +h(x;y) > [ ] Setting (x;y) = (x s (z);y (w;z)), as dened in (3.16) and (3.17), and using (3.15) proves the lemma. 56 The following lemma ties the virtual queues of Algorithm 2 to the Lagrange multipli- ers. Given the generated W (t) and Z(t) of Algorithm 2, dene Q(t),(W (t);Z(t)) as a concatenation of these vectors. The queue dynamics (3.6) and (3.7) are equivalent to Q(t + 1) =P[Q(t) +h(x(t);y(t))] whereP[(W;Z)] denotes the projection of the concatenated vector (W;Z) onto the set R J + R I . Lemma 11 The following holds for every t2f0; 1; 2;:::g: E h kQ(t + 1)V k 2 Q(t) i kQ(t)V k 2 + 2C + 2V [d(Q(t)=V )d( )]: Proof: The non-expansive projection [BNO03] implies that kQ(t + 1)V k 2 kQ(t) +h(x(t);y(t))V k 2 =kQ(t)V k 2 +kh(x(t);y(t))k 2 + 2h(x(t);y(t)) > [Q(t)V ] kQ(t)V k 2 + 2C + 2h(x(t);y(t)) > [Q(t)V ] (3.20) From (3.15), when =Q(t)=V , we have d S(t) (Q(t)=V ) = inf y2Y; x2X S(t) " f(y) + W (t) V > g(y) + Z(t) V > (xy) # ; 57 soy (W (t)=V;Z(t)=V ) =y(t) andx S(t) (Z(t)=V ) =x(t) where they are dened in (3.16) and (3.17), and (y(t);x(t)) is the decision from Algorithm 2. Therefore, property (3.19) implies that h(x(t);y(t)) > [Q(t)V ]V d S(t) (Q(t)=V )d S(t) ( ) : Applying the above inequality on the last term of (3.20) gives kQ(t + 1)V k 2 kQ(t)V k 2 + 2C + 2V d S(t) (Q(t)=V )d S(t) ( ) : Taking a conditional expectation given Q(t) proves the lemma: E h kQ(t + 1)V k 2 Q(t) i kQ(t)V k 2 + 2C + 2V X s2S s [d s (Q(t)=V )d s ( )]: The analysis of transient and steady state phases in Sections 3.3 and 3.4 will utilize Lemma 11. Convergence analysis of the steady state requires the following results. 3.2.2 T-slot convergence For any positive integer T and any starting time t 0 , dene the T -slot average starting at t 0 as x(t 0 ;T ), 1 T t 0 +T1 X t=t 0 x(t): This average leads to the following convergence bounds. 58 Theorem 5 LetfQ(t)g 1 t=0 be a sequence generated by Algorithm 2. For any positive integer T and any starting time t 0 , the objective cost converges as E [f(x(t 0 ;T ))]f (opt) M T E [kZ(t 0 +T )Z(t 0 )k] + 1 2TV E h kQ(t 0 )k 2 kQ(t 0 +T )k 2 i + C V ; (3.21) and the constraint violation for every j2f1;:::;Jg is E [g j (x(t 0 ;T ))] 1 T E [W j (t 0 +T )W j (t 0 )] + M T E [kZ(t 0 +T )Z(t 0 )k]: (3.22) Proof: The proof is in the appendix of this chapter. To interpret Theorem 5, the following concentration bound is provided. It is proven in [Nee16]. 3.2.3 Concentration bound LetIfAg be an indicator function whose value is 1 if A is true and 0 otherwise. Theorem 6 Let K(t) be a real random process over t2f0; 1; 2;:::g satisfying jK(t + 1)K(t)j and E [K(t + 1)K(t)jK(t)] 8 > > < > > : ;K(t)< ;K(t) ; for some positive real-valued ; , and 0<. 59 Suppose K(0) = k 0 (with probability 1) for some k 0 2 R. Then for every time t2 f0; 1; 2;:::g, the following holds: E h e rK(t) i D + e rk 0 D t where 0<< 1 and constants r;, and D are: r, ( 2 +=3) ; ,1 r 2 ; D, e r e r 1 : Later, random processK(t) is dened to be the distance betweenQ(t) and the vector of Lagrange multipliers, so K(t),kQ(t)V k for every t2f0; 1; 2;:::g. Lemma 12 It holds for every t2f0; 1; 2;:::g that jK(t + 1)K(t)j p 2C E [K(t + 1)K(t)jK(t)] p 2C: Proof: The rst part is proven in two cases. i) If K(t + 1)K(t), the non-expansive projection implies jK(t + 1)K(t)j =K(t + 1)K(t) kQ(t) +h(x(t);y(t))V kkQ(t)V k kh(x(t);y(t))k p 2C: 60 Locally-polyhedral Locally-quadratic Figure 3.1: Illustrations of locally-polyhedral and locally-quadratic dual functions ii) If K(t + 1)<K(t), then jK(t + 1)K(t)j =K(t)K(t + 1) kQ(t)Q(t + 1)k +kQ(t + 1)V kK(t + 1) kh(x(t);y(t))k p 2C: Therefore,jK(t + 1)K(t)j p 2C. Using K(t + 1)K(t)jK(t + 1)K(t)j proves the second part. Lemma 12 prepares K(t) for Theorem 6. The only constants left to be specied are and , which depend on properties of dual function (3.14). 3.3 Locally-polyhedral dual function This section analyzes the transient time and a convergence result in the steady state. Dual function (3.14) in this section is assumed to satisfy a locally-polyhedral property, introduced in [HN11]. This property is illustrated in Figure 3.1. It holds when f and each g j for j2f1;:::;Jg are either linear or piece-wise linear. 61 Assumption 7 Let be the unique vector of Lagrange multipliers. There exists L p > 0 such that dual function (3.14) satises, for any 2R J + R I , d( )d() +L p k k: (3.23) Note that, by concavity of the dual function, if inequality (3.23) holds locally about , it must also hold globally. The subscript p denotes \polyhedral." 3.3.1 Transient time The progress of Q(t) at each step can be analyzed. Dene B p , max L p 2 ; 2C L p : (3.24) Lemma 13 Under Assumptions 6 and 7, wheneverkQ(t)V k B p , the following holds E [kQ(t + 1)V kjQ(t)]kQ(t)V k L p 2 : Proof: If condition 2C + 2V [d(Q(t)=V )d( )]L p kQ(t)V k + L 2 p 4 (3.25) 62 is true, Lemma 11 implies that E h kQ(t + 1)V k 2 jQ(t) i kQ(t)V k 2 L p kQ(t)V k + L 2 p 4 = kQ(t)V k L p 2 2 : Applying Jensen's inequality [BNO03] on the left-hand-side yields E [kQ(t + 1)V kjQ(t)] 2 kQ(t)V k L p 2 2 : WhenkQ(t)V k Lp 2 , it follows that E [kQ(t + 1)V kjQ(t)]kQ(t)V k L p 2 : (3.26) It requires to show that condition (3.25) holds wheneverkQ(t)V k B p . As- sumption 7 implies that 2C + 2V [d(Q(t)=V )d( )] 2C 2L p kQ(t)V k: From the denition of B p in (3.24),kQ(t)V k B p implies L p kQ(t)V k 2C and 2C + 2V [d(Q(t)=V )d( )]L p kQ(t)V k; which implies (3.25). 63 Lemma 13 implies thatQ(t) proceeds closer toV in the next step when the distance between them is at leastB p . This implication means thatQ(t) concentrates aroundV in the steady state. 3.3.2 Convergence time in a steady state Dene constants r p ; p ;D p ;U p ;U 0 p as r p , 3L p 12C +L p p 2C ; p ,1 r p L p 4 ; (3.27) D p , e rpBp e rp p 2C p 1 p (3.28) U p , log (D p + 1) r p ; U 0 p , 2(D p + 1) r 2 p : (3.29) Given the initial condition Q 0 ,(W 0 ;Z 0 ), dene T p , & r p Q 0 V log(1= p ) ' ; (3.30) where constants r p and p are dened in (3.27). The value T p is O(V ). The next lemma showsT p can be interpreted as the transient time, so that desirable \steady state" bounds hold after this time. 64 Lemma 14 Suppose Assumptions 6 and 7 hold. Given the initial conditionQ 0 ,(W 0 ;Z 0 ), for any time tT p when T p is dened in (3.30), the following holds: E [kQ(t)V k]U p (3.31) E h kQ(t)V k 2 i U 0 p ; (3.32) where constants U p and U 0 p are dened in (3.29). Proof: Recall that K(t),kQ(t)V k. From Lemmas 12 and 13, the constants in Theorem 6 are = p 2C; =B p , and =L p =2. Theorem 6 implies for any t 0 that E e rpK(t) D p + e rpk 0 D p t p D p +e rpk 0 t p (3.33) wherek 0 =K(0) = Q 0 V , and constantsr p ; p ;D p are dened in (3.27) and (3.28). We then show that e rpk 0 t p 1 for all tT p : (3.34) Inequality e rpk 0 t p 1 is equivalent to t rpkQ 0 V k log(1=p) by arithmetic and the fact that log(1= p ) > 0. From the denition of T p in (3.30), it holds that T p rpkQ 0 V k log(1=p) , and the result (3.34) follows. From (3.34), inequality (3.33) becomes E h e rpK(t) i D p + 1 for all tT p : (3.35) 65 Jensen's inequality implies that e rpE[K(t)] E e rpK(t) , and we have e rpE[K(t)] D p + 1. Taking logarithm and dividing by r p proves (3.31). Cherno bound (see for example in [Ros96]) implies that, for any m2R + , PfK(t)mge rpm E h e rpK(t) i e rpm (D p + 1) for all tT p (3.36) where the last inequality uses (3.35). SinceK(t) 2 is always non-negative, it can be shown by the integration by parts that E K(t) 2 = 2 R 1 0 mPfK(t)mgdm. Using (3.36), we have E K(t) 2 2(D p + 1) Z 1 0 me rpm dm: Performing the integration by parts proves (3.32). The above lemma implies that, in the steady state, the expected distance and square distance between Q(t) and the vector of Lagrange multipliers are bounded by constants that do not depend onV . This phenomenon leads to an improved convergence time when the average is performed in the steady state. A useful result is proven and to be used in the main theorem. Lemma 15 For any times t 1 and t 2 , it holds that E h kQ(t 1 )k 2 kQ(t 2 )k 2 i E h kQ(t 1 )V k 2 i + 2kV kE [kQ(t 1 )V k +kQ(t 2 )V k]: 66 Proof: It holds for any Q2R J + R I that kQk 2 =kQV k 2 +kV k 2 + 2(QV ) > (V ): Using the above equality with Q 1 ;Q 2 2R J + R I leads to kQ 1 k 2 kQ 2 k 2 kQ 1 V k 2 kQ 2 V k 2 + 2(Q 1 Q 2 ) > (V ) kQ 1 V k 2 + 2kQ 1 Q 2 kkV k kQ 1 V k 2 + 2kV k[kQ 1 V k +kQ 2 V k]: Taking an expectation proves the lemma. Finally, the convergence in the steady state is analyzed. Theorem 7 Suppose Assumptions 6 and 7 hold. For any time t 0 T p and any positive integer T , the objective cost converges as E [f(x(t 0 ;T ))]f (opt) 2MU p T + U 0 p + 4VU p k k 2TV + C V (3.37) and the constraint violation is upper bounded by E [g j (x(t 0 ;T ))] 2U p T + 2MU p T : (3.38) 67 Proof: From Theorem 5, the objective cost converges as (3.21). Since T p t 0 <t 0 +T , we use results in Lemma 14 to upper bound E [kQ(t)V k] andE h kQ(t)V k 2 i for t 0 and t 0 +T . Terms in the right-hand-side of (3.21) are bounded by E [kZ(t 0 +T )Z(t 0 )k]E [kQ(t 0 +T )Q(t 0 )k] E [K(t 0 +T ) +K(t 0 )] 2U p : (3.39) Lemma 15 implies that E h kQ(t 0 )k 2 kQ(t 0 +T )k 2 i E K(t 0 ) 2 + 2kV k[K(t 0 ) +K(t 0 +T )] U 0 p + 4VU p k k: (3.40) Substituting (3.39) and (3.40) into (3.21) proves (3.37). The constraint violation converges as (3.22) where T p t 0 < t 0 +T . Using Lemma 14, the last term in the right-hand-side of (3.22) is bounded in (3.39). The rst term is bounded by E [W j (t 0 +T )W j (t 0 )]E [jW j (t 0 +T )V j +jW j (t 0 )V j] E [K(t 0 +T ) +K(t 0 )] 2U p : Substituting the above bound and (3.39) into (3.22) proves (3.38). 68 The implication of Theorem 7 is as follows. When the average starts in the steady state, the deviation from the optimal cost is O(1=T + 1=V ), and the constraint violation is bounded byO(1=T ). By settingV = 1= andT = 1=, both optimal cost and constrain violation achieve O()-approximation, and the convergence time is O(1=) slots. Note that this setting yields O(1=) transient time, since T p =O(V ) =O(1=). 3.4 Locally-quadratic dual function The dual function (3.14) in Section 3.4 is assumed to satisfy a locally-quadratic property, modied from [HN11]. This property is illustrated in Figure 3.1. Assumption 8 Let be the unique vector of Lagrange multipliers. There exist N > 0 and L q > 0 such that, whenever 2 R J + R I andk k N, dual function (3.14) satises d( )d() +L q k k 2 : Note that the subscript q denotes \quadratic." It can be shown that Assumption 8 implies d( )d() +NL q k k for all 2R J + R I andk k>N. 1 1 We would like to thank Hao Yu for noticing this fact. 69 3.4.1 Transient time The progress of Q(t) at each step can be analyzed. Dene B q (V ), max " 1 p V ; p V 1 + p 1 + 4L q C 2L q !# (3.41) B 0 q , max NL q 2 ; 2C NL q : Lemma 16 Suppose Assumptions 6 and 8 hold. When V is large enough to ensure both B q (V )NV and B 0 q NV , the following holds E [kQ(t + 1)V kjQ(t)]kQ(t)V k 8 > > < > > : 1 p V ; if B q (V )kQ(t)V kNV NLq 2 ; ifkQ(t)V k>NV (3.42) Proof: If condition 2C + 2V [d(Q(t)=V )d( )] 2 p V kQ(t)V k + 1 V (3.43) is true, Lemma 11 implies that E h kQ(t + 1)V k 2 jQ(t) i kQ(t)V k 2 2 p V kQ(t)V k + 1 V = kQ(t)V k 1 p V 2 : 70 Applying Jensen's inequality [BNO03] on the left-hand-side yields E [kQ(t + 1)V kjQ(t)] 2 kQ(t)V k 1 p V 2 : WhenkQ(t)V k 1 p V , it follows that E [kQ(t + 1)V kjQ(t)]kQ(t)V k 1 p V : (3.44) However, condition (3.43) holds when 2VL q kQ(t)=V k 2 2 p V kQ(t)V k 2C; (3.45) because Assumption 8 implies that, whenkQ(t)V kNV , 2V [d(Q(t)=V )d( )]2VL q kQ(t)=V k 2 : Therefore, condition (3.43) holds when condition (3.45) holds. Condition (3.45) requires that kQ(t)V k p V + p V + 4L q VC 2L q : Thus, inequality (3.44) holds whenkQ(t)V k max 1 p V ; p V+ p V+4LqVC 2Lq . This proves the rst part of (3.42). For the last part of (3.42), if the following condition: 2C + 2V [d(Q(t)=V )d( )]2kQ(t)V k + 2 (3.46) 71 is true, Lemma 11 implies that E h kQ(t + 1)V k 2 jQ(t) i kQ(t)V k 2 2kQ(t)V k + 2 = [kQ(t)V k] 2 : Applying Jensen's inequality [BNO03] on the left-hand-side yields E [kQ(t + 1)V kjQ(t)] 2 [kQ(t)V k] 2 : WhenkQ(t)V k, it follows that E [kQ(t + 1)V kjQ(t)]kQ(t)V k: (3.47) However, condition (3.46) holds when 2VL q kQ(t)=V k2kQ(t)V k 2C; (3.48) because Assumption 8 implies that, whenkQ(t)V k>NV , 2V [d(Q(t)=V )d( )]2VL 0 q kQ(t)=V k: Therefore, condition (3.46) holds when condition (3.48) holds. Condition (3.48) requires thatkQ(t)V k C=(L 0 q ) when < L 0 q . Thus, inequality (3.47) holds when kQ(t)V k max ;C=(L 0 q ) . Choosing =L 0 q =2 and using the fact that B 0 q NV proves the last part of (3.42). 72 The interpretation of Lemma 16 is similar to Lemma 13 except that B q (V ) and the negative drift in (3.42) are functions of V . Nevertheless, Lemma 16 implies that Q(t) concentrates around V in the steady state. 3.4.2 Convergence time in a steady state Recall thatB q (V ) is dened in (3.41). Dene constantsr q (V ); q (V );D q (V );U q (V );U 0 q (V ) as r q (V ), 3 6C p V + p 2C ; q (V ),1 r q (V ) 2 p V ; (3.49) D q (V ), e rq(V)Bq(V) e rq(V) p 2C q (V ) 1 q (V ) (3.50) U q (V ), log (D q (V ) + 1) r q (V ) ; U 0 q (V ), 2(D q (V ) + 1) r q (V ) 2 : (3.51) Given the initial conditionQ 0 ,(W 0 ;Z 0 ), dene the transient time for a locally-quadratic dual function as T q , & r q (V ) Q 0 V log(1= q (V )) ' ; (3.52) wherer q (V ) and q (V ) are dened in (3.49). Denition (3.52) implies that the transient time under the locally-quadratic assumption is O(V 1:5 ). This T q can be interpreted as the transient time, so that desirable bounds hold after this time. 73 Lemma 17 Suppose Assumptions 6 and 8 hold. When V is large enough to ensure B q (V )NV , B 0 q NV , and p V 2 NLq , for any time tT q , the following holds E [kQ(t)V k]U q (V ) (3.53) E h kQ(t)V k 2 i U 0 q (V ) (3.54) where U q (V ) and U 0 q (V ) are dened in (3.51). Proof: Recall thatK(t),kQ(t)V k. From Lemmas 12 and 16, constants in Theorem 6 are = p 2C; =B q (V ), and = 1= p V . Theorem 6 implies, for any t 0, that E e rq(V)K(t) D q (V ) + h e rq(V)k 0 D q (V ) i q (V ) t D q (V ) +e rq(V)k 0 q (V ) t (3.55) where k 0 = K(0) = Q 0 V , and r q (V ); q (V );D q (V ) are dened in (3.49) and (3.50). We then show that e rq(V)k 0 q (V ) t 1 for all tT q : (3.56) Inequality e rq(V)k 0 q (V ) t 1 is equivalent to t rq(V)kQ 0 V k log(1=q(V)) by arithmetic and the fact that log(1= q (V )) > 0. From the denition of T q in (3.52), it holds that T q rq(V)kQ 0 V k log(1=q(V)) , and the result (3.56) follows. From (3.56), inequality (3.55) becomes E h e rq(V)K(t) i D q (V ) + 1 for all tT q : (3.57) 74 Jensen's inequality implies that e rq(V)E[K(t)] E e rq(V)K(t) , and we have e rq(V)E[K(t)] D q (V ) + 1. Taking logarithm and dividing by r q (V ) proves (3.53). Cherno bound implies that, for any m2R + , PfK(t)mge rq(V)m E h e rq(V)K(t) i e rq(V)m (D q (V ) + 1) for all tT q (3.58) where the last inequality uses (3.57). SinceK(t) 2 is always non-negative, it can be shown by the integration by parts that E K(t) 2 = 2 R 1 0 mPfK(t)mgdm. Using (3.58), we have E K(t) 2 2(D q (V ) + 1) Z 1 0 me rq(V)m dm: Performing the integration by parts proves (3.54). The convergence results in the steady state are as follows. Theorem 8 Suppose Assumptions 6 and 8 hold. When V is large enough to ensure B q (V ) NV;B 0 q NV , and p V 2 NLq , then for any time t 0 T q and any positive integer T , the objective cost converges as E [f(x(t 0 ;T ))]f (opt) 2MU q (V ) T + U 0 q (V ) + 4VU q (V )k k 2TV + C V (3.59) and the constraint violation is upper bounded by E [g j (x(t 0 ;T ))] 2U q (V ) T + 2MU q (V ) T : (3.60) 75 Proof: The procedure is similar to the proof of Theorem 7. The implication of Theorem 8 is as follows. When the average starts in the steady state, the deviation from the optimal cost in (3.59) is O( p V=T + 1=V ), and the con- straint violation in (3.60) is at most O( p V=T ). Note that this can be shown by sub- stituting B q (V );r q (V ); q (V );D q (V ) into (3.59) and (3.60). By setting V = 1= and T = 1= 1:5 , this achieves an O()-approximation with transient time and convergence time of O(1= 1:5 ). 3.5 Sample problems 3.5.1 Staggered time averages Recall that the staggered algorithm dened in Section 2.3.3 restarts time averages at times 2 k for integers k, and a restart is guaranteed to occur within a factor of 2 of the time of the actual end of the transient phase. 76 3.5.2 Results This section illustrates the convergence times of the drift-plus-penalty in Algorithm 2 and the staggered algorithm under the locally-polyhedron and locally-quadratic assump- tions. LetS =f0; 1; 2g;X 0 =f(0; 0)g;X 1 =f(5; 0); (0; 10)g;X 2 =f(0;10); (5; 0)g, and ( 0 ; 1 ; 2 ) = (0:1; 0:6; 0:3). A formulation is Minimize lim sup T!1 f(x(T )) (3.61) Subject to lim sup T!1 [2x 1 (T )x 2 (T )]1:5 lim sup T!1 [x 1 (T ) 2x 2 (T )]1:5 (x 1 (t);x 2 (t))2X S(t) ; t2f0; 1; 2;:::g where function f will be given for dierent cases. Under the locally-polyhedron assumption, letf(x) = 1:5x 1 +x 2 be the objective func- tion of problem (3.61). In this setting, the optimal value is 1:25 where lim T!1 x 1 (T ) = lim T!1 x 2 (T ) = 0:5. Figure 3.2 shows the values of the objective and the constraint functions of time-average solutions. It is easy to see the improved O(1=) convergence time from the staggered time averages compared to the O(1= 2 ) convergence time of Algorithm 2. Under locally-quadratic assumption, let f(x) = x 2 1 +x 2 2 be the objective function of problem (3.61). Note that the optimal value of this problem is 0:5 where lim T!1 x 1 (T ) = lim T!1 x 2 (T ) = 0:5. Figure 3.3 shows the values of the objective and the constraint functions of time-average solutions. It can be seen from the plot of constraints that the 77 0 20000 40000 60000 80000 100000 t 1.15 1.20 1.25 1.30 1.35 value Staggered time averages Algorithm 1 Objective 0 20000 40000 60000 80000 100000 t − 0.10 − 0.05 0.00 0.05 0.10 0.15 0.20 0.25 0.30 value Staggered time averages: g 1 Staggered time averages: g 2 Algorithm 1: g 1 Algorithm 1: g 2 Constraints Figure 3.2: Convergence of Algorithm 2 and the staggered algorithm that solve problem (3.61) with f(x) = 1:5x 1 +x 2 staggered time averages converge faster than Algorithm 2. This illustrates the dierent between O(1= 1:5 ) convergence time and O(1= 2 ) convergence time. Figure 3.4 illustrates the convergence time of problem (3.61) with f(x) = 1:5x 1 +x 2 and an additional constraint E [x 1 +x 2 ] 1. The dual function of this formulation has non-unique vectors of Lagrange multipliers. The comparison of Figures 3.4 and 3.2 shows that there is no dierence in the order of convergence time. Figure 3.5 illustrates the convergence time of problem (3.61) with f(x) = 1:5x 1 +x 2 and an additional constraint E [x 1 +x 2 ] 1. The dual function of this formulation has non-unique vectors of Lagrange multipliers. The comparison of Figures 3.5 and 3.3 shows that there is no dierence in the order of convergence time. 78 0 200000 400000 600000 800000 1000000 t 0.46 0.48 0.50 0.52 0.54 value Staggered time averages Algorithm 1 Objective 0 200000 400000 600000 800000 1000000 t − 0.10 − 0.05 0.00 0.05 0.10 0.15 0.20 0.25 0.30 value Staggered time averages: g 1 Staggered time averages: g 2 Algorithm 1: g 1 Algorithm 1: g 2 Constraints Figure 3.3: Convergence of Algorithm 2 and the staggered algorithm that solve problem (3.61) with f(x) =x 2 1 +x 2 2 0 20000 40000 60000 80000 100000 t 1.15 1.20 1.25 1.30 1.35 value Objective STG f ALG f 0 20000 40000 60000 80000 100000 t − 0.10 − 0.05 0.00 0.05 0.10 0.15 0.20 0.25 0.30 value Constraints STG g 1 STG g 2 STG g 3 ALG g 1 ALG g 2 ALG g 3 Figure 3.4: Convergence of Algorithm 2 and the staggered algorithm that solve problem (3.61) with f(x) = 1:5x 1 +x 2 and an additional constraintE [x 1 +x 2 ] 1 79 0 200000 400000 600000 800000 1000000 t 0.46 0.48 0.50 0.52 0.54 value Objective STG f ALG f 0 200000 400000 600000 800000 1000000 t − 0.10 − 0.05 0.00 0.05 0.10 0.15 0.20 0.25 0.30 value Constraints STG g 1 STG g 2 STG g 3 ALG g 1 ALG g 2 ALG g 3 Figure 3.5: Convergence of Algorithm 2 and the staggered algorithm that solve problem (3.61) with f(x) =x 2 1 +x 2 2 and an additional constraintE [x 1 +x 2 ] 1 3.6 Chapter summary We consider the time-average stochastic optimization problem with non-convex decision sets. The problem can be solved by the drift-plus-penalty algorithm, which has O(1= 2 ) convergence time. After we analyze the transient and steady state phases, the convergence time can be improved by performing average in the steady state. We prove that the improved convergence time is O(1=) under the locally-polyhedral assumption and is O(1= 1:5 ) under the locally-quadratic assumption. Appendix To prove Theorem 5, the convergence time of the objective cost as a function of vectors y(t) is rstly proven. This proof requires the following theorem proven in [Nee10]. 80 Theorem 9 There exists a randomized policy x (t);y (t) that only depends onS(t) such that for all t2f0; 1; 2;:::g: E [f(y (t)] =f (opt) E [g(y (t))] 0 E [x (t)] =E [y (t)] (x (t);y (t))2X S(t) Y: Lemma 18 Letf(x(t);y(t);W (t);Z(t))g 1 t=0 be a sequence generated by Algorithm 2. For any positive integer T and any starting time t 0 , it holds that E [f(y(t 0 ;T )]f (opt) C V + 1 2TV E h kQ(t 0 )k 2 kQ(t 0 +T )k 2 i : Proof: Since decision (x(t);y(t)) minimizes the right-hand-side of (3.12), the following holds for any other decisions including the randomized policy in Theorem 9: (t) +Vf(y(t))C +Q(t) > h(x(t);y(t)) +Vf(y(t)) C +Q(t) > h(x (t);y (t)) +Vf(y (t)): Taking the conditional expectation of the above bound gives E [(t)jQ(t)] +VE [f(y(t))jQ(t)]C +Q(t) > E [h(x (t);y (t))jQ(t)] +VE [f(y (t))jQ(t)] C +Vf (opt) ; 81 where the last inequality uses properties of the randomized policy in Theorem 9. Taking expectation and using iterated expectation leads to 1 2 E h kQ(t + 1)k 2 kQ(t)k 2 i +VE [f(y(t))]C +Vf (opt) : Summing from t =t 0 to t =t 0 +T 1 yields 1 2 E h kQ(t 0 +T )k 2 kQ(t 0 )k 2 i +VE " t 0 +T1 X t=t 0 f(y(t)) # CT +VTf (opt) : Dividing by T , using the convexity of f, and rearranging terms proves the lemma. The constraint violation as a function of vectors y(t) is the following. Lemma 19 Letf(x(t);y(t);W (t);Z(t))g 1 t=0 be a sequence generated by Algorithm 2. For any positive integer T and any starting time t 0 , it holds for all j2f1;:::;Jg that E [g j (y(t 0 ;T ))] 1 T E [W j (t 0 +T )W j (t 0 )]: Proof: Dynamic (3.6) implies that W j (t + 1) W j (t) +g j (y(t)). Taking expectation givesE [W j (t + 1)]E [W j (t)]+E [g j (y(t))]. Summing fromt =t 0 tot =t 0 +T1 yields E [W j (t 0 +T )]E [W j (t 0 )] +E h P t 0 +T1 t=t 0 g j (y(t)) i . Dividing by T , using the convexity of g j , and rearranging terms proves the lemma. The following result is used to translates the results in Lemmas 18 and 19 to the bounds in Theorem 5 which are functions of vector x(t). 82 Lemma 20 Letf(x(t);y(t);W (t);Z(t))g 1 t=0 be a sequence generated by Algorithm 2. For any positive integer T and starting time t 0 , it holds that x(t 0 ;T )y(t 0 ;T ) = 1 T [Z(t 0 +T )Z(t 0 )]: Proof: Dynamic (3.7) implies thatZ(t + 1) =Z(t) + [x(t)y(t)]. Summing fromt =t 0 to t = t 0 +T 1 yields Z(t 0 +T ) = Z(t 0 ) + P t 0 +T1 t=t 0 [x(t)y(t)]. Dividing by T and rearranging terms proves the lemma. Finally, Theorem 5 is proven. Proof: For the rst part, Lipschitz continuity (3.3) implies f(x(t 0 ;T ))f (opt) f(y(t 0 ;T ))f (opt) +Mkx(t 0 ;T )y(t 0 ;T )k: Taking expectation yields E [f(x(t 0 ;T ))]f (opt) E [f(y(t 0 ;T ))]f (opt) +ME [kx(t 0 ;T )y(t 0 ;T )k]: Applying the results in Lemmas 18 and 20 proves (3.21). For the last part, Lipschitz continuity (3.4) implies that g j (x(t 0 ;T ))g j (y(t 0 ;T )) +Mkx(t 0 ;T )y(t 0 ;T )k for all j2f1;:::;Jg: 83 Taking expectation yields E [g j (x(t 0 ;T ))]E [g j (y(t 0 ;T ))] +ME [kx(t 0 ;T )y(t 0 ;T )k]: Applying the results in Lemmas 19 and 20 proves (3.22). 84 Chapter 4 Stochastic Network Optimization with Finite Buers In this chapter, a constraint on nite buers in practice is considered to develop a more practical algorithm that solve stochastic network optimization problems. The results in this chapter are based in part on [SN15a,SN15b]. Stochastic network optimization is a general framework for network optimization with randomness [Nee10]. The framework generates a control algorithm that achieves a spec- ied objective, such as minimizing power cost or maximizing throughput utility. It is assumed that the network has random states that evolve over discrete time. Every time slot, a network controller observes a current network state and makes a control decision. The network state and control decision together incur some cost and, at the same time, serve some amount of trac from network queues. The algorithm is designed to greedily minimize a drift-plus-penalty expression every slot. This greedy procedure is known to minimize time average network cost subject to queue stability. This general framework has been used to solve several network optimization prob- lems such as network routing [TE92], throughput maximization [ES07], dynamic power allocation [NMR05], quality of information maximization [SN15c]. The framework yields 85 low-complexity algorithms which do not require any statistical knowledge of the network states. Therefore, these algorithms are easily implemented and are robust to environment changes. Further, they achieveO(1=V ) optimality gap andO(V ) delay, which is denoted by an [O(1=V );O(V )] utility-delay tradeo, whereV > 0 is a parameter that can be cho- sen as desired to achieve a specic operating point on the [O(1=V );O(V )] utility-delay tradeo curve. This chapter develops a oating-queue approach to general stochastic network opti- mization with a constraint on nite buers. Our algorithm is inspired by the nite-buer heuristic in [MSKG10] and the steady state analysis in [HN11]. We propose the oating- queue algorithm to solve the learning issue in [HN11]. The result obtains the best of both worlds: It achieves the desired steady state performance but is just as adaptive to network changes as LIFO scheduling. For nite buers of size B at every queue in a network, deviation from utility optimality is shown to decrease like O(e B ) and packet drops are shown to have rate O(e B ), while average per-hop delay is O(B). The chapter is organized as follows. The system model is described in Section 4.1. Section 4.2 describes the standard drift-plus-penalty approach. The oating queue al- gorithm is introduced in Section 4.3. Performance of the oating-queue algorithm is analyzed in Section 4.4 and is validated by simulation in Section 4.5. Section 4.6 con- cludes the chapter. 86 Related Works Prior works attempt to improve network delay without sacricing reliability, which is measured by a rate of packet drops. Previous works [Nee06b] and [Nee07] use an ex- ponential Lyapunov function and assumed knowledge of an parameter, where mea- sures a distance associated with the optimal operation point. They achieve an opti- mal [O(1=V );O(log(V ))] utility-delay tradeo. A simpler methodology allows packet drops in order to obtain an [O(1=V );O([log(V )] 2 )] utility-delay tradeo [HN11,HMNK13]. In [HN11], a steady state behavior is observed to learn a placeholder parameter to achieve the tradeo in steady state. However, the algorithm does not gracefully adapt to changes of the network state distribution. It would need another mechanism to sense changes and then recompute a new placeholder parameter with each change. The Last-In-First-Out (LIFO) queue discipline is employed to resolve this issue [HMNK13]. However, these works, which achieve average queue size that grows logarithmically in V , still assume the availability of innite buer space [Nee07,HN11,HMNK13]. A practical implementation of the LIFO scheme is developed in [MSKG10]. The work in [MSKG10] also introduces a oating-queue algorithm, operating under the LIFO scheme, to deal with nite buers. The algorithm in [MSKG10] is heuristic, and it is not clear how to analyze its behavior. The work in this chapter is inspired by this oating queue idea of [MSKG10] and adopts the same \ oating queue" terminology, even though the oating-queue algorithm developed here is dierent from [MSKG10]. Indeed, the oating queue technique of this chapter operates under the First-In-First-Out (FIFO) scheme. It splits each queue into two queues (one for real and one for fake packets) and 87 yields analytical guarantees on utility, delay, and packet drops. In particular, if 1=V is the desired distance to optimality, the required buer size grows only logarithmically in V . Several backpressure approaches [RBPS10, ABJ + 13, JJS13] attempt to improve net- work delay. However, those focus on specic aspects and do not have the theoretical utility-delay tradeo. Stochastic network optimization with nite buers has been stud- ied previously in [LMS12]. That work uses a non-standard Lyapunov function and knowl- edge of an parameter to derive an upper bound on the required buer size. However, the parameter can be dicult to determine in practice, and the resulting utility-delay tradeo is still [O(1=V );O(V )]. An implementation of that work is studied in [XME12]. 4.1 System model The network model of this chapter is similar to that of [HN11]. Consider a network with N queues that evolve in discrete (slotted) time t2f0; 1; 2;:::g. At each time slot, a network controller observes the current network state before making a decision. The goal of the controller is to minimize a time average cost subject to network stability. An example of time average cost is average power incurred over the network. Utility maximization can be treated by dening the slot-t cost as1 times a slot-t reward. An example utility maximization problem is to maximize time average network throughput. The rest of the chapter deals with cost minimization, with the understanding that this can also treat utility maximization. 88 4.1.1 Network state The network experiences randomness every time slot. This randomness is called the network state and can represent a vector of channel conditions and/or random arrivals for slot t. Assume there are M dierent network states. DeneS =fs 1 ;s 2 ;:::;s M g as the set of all possible states. Let S(t) denote the network state experienced by the network at time t. Let m 2 [0; 1] be the steady state probability that S(t) = s m , i.e., m = PfS(t) =s m g. For simplicity, it is assumed that S(t) is independent and identically distributed (i.i.d.) over slots. The same results can be shown in the general case of ergodic but non-i.i.d. processes (see [HN11]). The network controller can observe S(t) before making the slot-t decision, but the m probabilities are not necessarily known to the controller. 4.1.2 Control decision Every time slot, the network controller chooses a decision from a set of feasible actions which depends on the current network state. Formally, deneX S(t) as the decision set depending on S(t), and let x(t) denote the decision chosen by the controller at time t, where x(t)2X S(t) . Assume the action setX sm is nite for every s m 2S. On slot t, the pair (x(t);S(t)) aects the network in two aspects: 1) A cost is incurred. The cost is f(t),f(x(t);S(t)) :X S(t) !R. An example cost is energy expenditure. Another example is1 times the amount of newly admitted packets. 2) Queues are served. The service variable is ij (t), representing the integer amount of packets taken from queue i and transmitted to queue j, for all i;j2N,f1;:::;Ng. This is determined by a function ij (t), ij (x(t);S(t)) :X S(t) !Z + , whereZ + is a set of 89 Figure 4.1: Arrivals and services at a standard queue non-negative integers. Further, the decision admits 0i (t), 0i (x(t);S(t)) :X S(t) ! Z + integer amount of exogenous packets to queue i2N . Packets depart from the network at queue j2N with an integer amount j0 (t), j0 (x(t);S(t)) :X S(t) !Z + . Note that we set ii (t) = 0 for all i2N[f0g and for all t. The transmission, admission, and departure are shown in Figure 4.1. Let (max) be a positive integer that bounds the magnitudes of P N i=0 in (x;s m ) and P N j=0 nj (x;s m ) for every s m 2S, every x2X sm , and every n2N . Furthermore, the network optimization is assumed to satisfy the following Slater condition [HN11]: Let jXj be the cardinality of setX . For every s m 2S and k2f1;:::;jX sm jg, there exist probabilities sm k (such that P jXsm j k=1 sm k = 1 for all s m 2S) that dene a stationary and randomized algorithm. Whenever the network controller observes state S(t) = s m , the stationary and randomized algorithm chooses actionx sm k with conditional probability sm k . The Slater condition assumes there exists such a stationary and randomized algorithm that satises: M X m=1 jXsm j X k=1 m sm k 2 4 N X i=0 in (x sm k ;s m ) N X j=0 nj (x sm k ;s m ) 3 5 for all n2N; for some > 0. In fact, this assumption is the standard Stater condition of convex optimization [BNO03]. 90 4.1.3 Standard queue The network consists of N standard queues. Let Q n (t) denote the backlog in queue n at time t, and let Q(t),(Q 1 (t);:::;Q N (t)) be the vector of these backlogs. The backlog dynamic of queue n2f1;:::;Ng is Q n (t + 1) = max 2 4 Q n (t) N X j=0 nj (t); 0 3 5 + N X i=0 in (t): (4.1) When there are not enough packets in a queue, i.e, Q n (t)< P N j=0 nj (t), blank packets are used to ll up transmissions. 4.1.4 Stochastic formulation The controller seeks to minimize the expected time-average cost while maintaing queue stability. The expected time average cost is dened by f, lim sup T!1 1 T T1 X t=0 E [f(t)]; and the queue stability is satised when lim sup T!1 1 T T1 X t=0 N X n=1 E [Q n (t)]<1: 91 The stochastic network optimization problem is Minimize f (4.2) Subject to queue stability x(t)2X S(t) for all t: 4.2 Drift-plus-penalty algorithm 4.2.1 The algorithm The drift-plus-penalty method in [Nee10] can solve problem (4.2) via a greedy decision at each time slot that does not require knowledge of the steady state probabilities. The method has parameter V 0. In the special case of V = 0, this algorithm is also called \MaxWeight" or \Backpressure": Drift-Plus-Penalty Algorithm: At every time t2f0; 1; 2;:::g, the network con- troller observes network state S(t) and backlog vector Q(t). Decision x(t)2X S(t) is chosen to solve: x(t) = argmin x2X S(t) ( Vf(x;S(t))+ N X n=1 Q n (t) " N X i=0 in (x;S(t)) N X j=0 nj (x;S(t)) #) : (4.3) Depending on the separability structure of problem (4.3), it can be decomposed to smaller subproblems that can be solved distributively. The algorithm is summarized in Algorithm 3. 92 Initialize Q(0) = 0 for t2f0; 1; 2;:::g do Observe S(t) and Q(t) Choose x(t) according to (4.3) Update Q n (t + 1) according to (4.1)8n2N end for Algorithm 3: The drift-plus-penalty algorithm that solves problem (4.2) It has been shown in [Nee10] that f (opt) f (dpp) f (opt) +O(1=V ) (4.4) lim sup T!1 1 T T1 X t=0 N X n=1 E [Q n (t)] =O(V ); (4.5) where f (dpp) is the expected time average cost achieved by the drift-plus-penalty algo- rithm, and f (opt) is the optimal cost of problem (4.2). The inequality (4.4) implies that the drift-plus-penalty algorithm achieves cost within O(1=V ) of the optimal cost, which can be made as small as desired by choosing a suciently large value of V . The equality (4.5) implies that average queue backlog grows linearly with V . Applying Little's law gives the [O(1=V );O(V )] utility-delay tradeo (see [BG92] for a standard description of Little's law). Notice that the drift-plus-penalty algorithm assumes innite buer size at each queue, even though the average queue size is bounded by O(V ). 93 4.2.2 Deterministic problem In order to consider a nite buer regime, the steady-state behavior of the drift-plus- penalty algorithm is considered. In [HN11], the stochastic problem (4.2) is shown to have an associated deterministic problem as follows: Minimize V M X m=1 m f(x sm ;s m ) (4.6) Subject to M X m=1 m N X i=0 in (x sm ;s m ) M X m=1 m N X j=0 nj (x sm ;s m ) 8n2N (4.7) x sm 2X sm 8m2f1;:::;Mg: Let = ( 1 ;:::; N ) be a vector of dual variables associated with constraint (4.7). The dual function of problem (4.6) is dened as: g() = M X m=1 m inf x sm 2Xsm ( Vf(x sm ;s m )+ N X n=1 n " N X i=0 in (x sm ;s m ) N X j=0 nj (x sm ;s m ) #) : (4.8) This dual function (4.8) is concave. Therefore, the following dual problem is a convex optimization problem: Maximize g() (4.9) Subject to 2R N + : 94 Let V = ( V 1 ;:::; V N ) be a vector of Lagrange multipliers, which solves the dual problem (4.9) with parameter V . The following theorem from [HMNK13] describes a steady state property of the drift-plus-penalty algorithm. Dene t 0 as the rst time index in a steady state of the drift-plus-penalty algorithm. Theorem 10 Suppose V is unique, the Slater condition holds, and the dual function g() satises: g( V )g() +L V for all 2R N + ; for some constant L> 0, independent of V . Then under the drift-plus-penalty algorithm, there exist constants D;K;c , independent of V , such that for any 0, the following upper bound holds P(D;K)c e ; (4.10) where P(D;K), lim sup T!1 1 T t 0 +T1 X t=t 0 P 9n; Q n (t) V n >D +K : (4.11) Proof: Please see the proof in [HN11]. As all transmissions, admissions, and departures are integers, the queue vector has a countably innite number of possibilities. Under a mild ergodic assumption that the steady state distribution offQ(t) :t 0g exists, theP(D;K) value in (4.11) becomes the steady state probability that backlog deviates more than D +K away from the vector of Lagrange multipliers. Note that the probability in (4.10) decays exponentially in . This implies that, in steady state, a large portion of arrivals and services occur when the queue backlog vector is close to the Lagrange multiplier vector V . Thus, if 95 we can admit and serve this portion of trac using nite-buer queues, the network still operates near its optimal point. 4.3 Floating-queue algorithm In this section, the oating-queue algorithm is presented as a way to implement the drift- plus-penalty algorithm using nite buers. The algorithm preserves the dynamics of the drift-plus-penalty algorithm and hence inherits several of its performance guarantees. The oating-queue algorithm is adaptive and does not need to know or learn the underlying Lagrange multipliers. Recall that standard queue n2N has dynamic (4.1). To simplify notation, let a n (t) and b n (t) denote respectively aggregated arrivals and services at queue n and time t: a n (t), N X i=0 in (t); b n (t), N X j=0 nj (t): (4.12) This implies that (max) upper bounds both a n (t) and b n (t). The dynamic (4.1) can be written as Q n (t + 1) = max [Q n (t)b n (t); 0] +a n (t): (4.13) For the rest of this chapter, the above dynamic is considered for a standard queue. Note that the values ofa n (t) andb n (t) are realized after knowing all ij (t) from the drift-plus- penalty algorithm. 96 Figure 4.2: Transformation of a standard queue to a oating queue 4.3.1 Queue transformation Recall that the processes ij (t), a n (t), b n (t), and Q n (t) are dened by the drift-plus- penalty algorithm and take non-negative integer values. We now break ij (t), a n (t), b n (t) into non-negative integer components that represent real and fake parts: ij (t) = r ij (t) + f ij (t) for all i2N;j2N[f0g (4.14) a n (t) =a r n (t) +a f n (t) for all n2N (4.15) b n (t) =b r n (t) +b f n (t) for all n2N (4.16) whereN =f1;:::;Ng. We also dene Q r n (t) and Q f n (t) as real and fake components of the Q n (t) process. An illustration of this decomposition is given in Figure 4.2. To precisely establish this decomposition we shall recursively specify the real components. Recall that Q n (0) = 0 for all n2N . Dene Q r n (0) = Q f n (0) = 0 for all n2N . For a given slott2f0; 1; 2;:::g, assume thatQ r n (t) andQ f n (t) have been dened for alln2N . For eachn2N , letb r n (t) represent the amount of service given to the real data at queue n on slot t, dened by: b r n (t) = min[Q r n (t);b n (t)] (4.17) 97 This equation means that real data has strict priority over fake data, so that all of the available service b n (t) is allocated to serve real data. Notice that b r n (t) is an integer that satises 0 b r n (t) b n (t), and so the corresponding fake component b f n (t) dened by (4.16) is indeed a non-negative integer. We next specify the r ij (t) variables. All exogenous arrivals are considered to be real, so we dene r 0j (t) = 0j (t) and f 0j (t) = 0 for all j2N and all t. Next, for n2N and j2N[f0g, we choose r nj (t) as any arbitrary integers that satisfy the following constraints (dened in terms of the known b r n (t) integers): N X j=0 r nj (t) =b r n (t); for all n2N (4.18) r nj (t)2f0; 1;:::; nj (t)g; for all n2N;j2N[f0g (4.19) The constraints (4.19) ensure that the corresponding fake components f nj (t) dened by (4.14) are indeed non-negative integers. The constraints (4.18)-(4.19) allow freedom in the assignments of the particular r nj (t) quantities, provided that (4.18)-(4.19) are upheld. Our analytical results shall hold for any such assignments. Finally, dene a r n (t) by: a r n (t) = N X i=0 r in (t) (4.20) and notice that 0a r n (t)a n (t), so the correspondinga f n (t) processes dened by (4.15) are indeed non-negative. Before specifying the update equations for Q r n (t) andQ f n (t), we must account for the nite buer sizeB in the real network queues, whereB is a positive integer. Indeed, real 98 arrivals a r n (t) must be interpreted as attempted arrivals, some of which may be dropped due to the buer constraint. Specically, the process of a r n (t) is decomposed as: a r n (t) =a r 0 n (t) +d n (t) (4.21) where d n (t) is the non-negative amount dropped, and a r 0 n (t) is the amount of real data actually admitted, dened by: a r 0 n (t) = min[BQ r n (t);a r n (t)] (4.22) This denition implies that a r 0 n (t) is an integer that satises 0 a r 0 n (t) a r n (t), and so d n (t) dened by (4.21) is also a non-negative integer. Dene a f 0 n (t) as the admitted fake arrivals, including the original fake arrivals a f n (t) and the dropped amount d n (t): a f 0 n (t) =a f n (t) +d n (t) (4.23) It follows immediately from (4.21) and (4.23) that a n (t) =a r n (t) +a f n (t) =a r 0 n (t) +a f 0 n (t) (4.24) 99 4.3.2 Real and fake queuing dynamics The update equations for Q r n (t) and Q f n (t) are dened: Q r n (t + 1) =Q r n (t)b r n (t) +a r 0 n (t) (4.25) Q f n (t + 1) = max[Q f n (t)b f n (t); 0] +a f 0 n (t) (4.26) Lemma 21 Under the above dynamics (4.25)-(4.26), theQ r n (t) andQ f n (t) processes take non-negative integer values that satisfy the following for all n2N and t2f0; 1; 2;:::g: Q(t) =Q r (t) +Q f (t) (4.27) Proof: Fixn2N . Recall that initial conditions are dened byQ(0) =Q r n (0) =Q f n (0) = 0, and so the result holds for t = 0. Suppose it holds at some slot t2f0; 1; 2;:::g. It is easy to see from (4.25)-(4.26) that Q r n (t + 1) and Q f n (t + 1) are non-negative integers. It suces to show they sum to Q n (t + 1). From (4.16) and (4.24), we have Q n (t + 1) = max [Q n (t)b n (t); 0] +a n (t) = max h Q r n (t) +Q f n (t)b r n (t)b f n (t); 0 i +a r 0 n (t) +a f 0 n (t): (4.28) 100 When there are not enough real packets, i.e., Q r n (t) < b n (t), it follows that Q r n (t) b r n (t) = 0 from (4.17). Equation (4.28) becomes Q n (t + 1) = max h Q f n (t)b f n (t); 0 i +a f 0 n (t) +a r 0 n (t) =Q f n (t + 1) +Q r n (t + 1): When there are enough real packets, i.e., Q r n (t)b n (t), we have that b f n (t) = 0 from (4.17) and (4.16). Equation (4.28) becomes Q n (t + 1) =Q f n (t) +a f 0 n (t) +Q r n (t)b r n (t) +a r 0 n (t) =Q f n (t + 1) +Q r n (t + 1): Thus, Q n (t + 1) = Q f n (t + 1) +Q r n (t + 1) for all n2N . By induction, this proves the lemma. The implication of Lemma 21 is that, although the oating-queue algorithm imple- ments these real and fake queues instead of the standard queues, the dynamics of Q(t) andQ r (t)+Q f (t) are the same. Hence, when decisionx(t) is chosen by solving (4.3) with Q r (t) +Q f (t) instead of Q(t), all decisionsfx(t)g 1 t=0 under the standard algorithm (in Algorithm 3) are identical to the decisionsfx(t)g 1 t=0 under the oating-queue algorithm (in Algorithm 4), given that Q r (0) +Q f (0) = Q(0). Yet, the buer size of each real queue in the oating-queue algorithm is B. The oating-queue algorithm is summarized in Algorithm 4. 101 Initialize Q r (0) =Q f (0) = 0 for t2f0; 1; 2;:::g do Observer S(t) and let Q(t) =Q r (t) +Q f (t) Choose x(t) according to (4.3) Calculate (a n (t);b n (t)) according to (4.12)8n2N Calculate (b r n (t);b f n (t)) according to (4.16){(4.17)8n2N Adjust (a r 0 n (t);a f 0 n (t)) according to (4.22){(4.23)8n2N Update (Q r n (t + 1);Q f n (t + 1)) according to (4.25){(4.26)8n2N end for Algorithm 4: The oating-queue algorithm that solves problem (4.2) We prove a useful lemma of the oating-queue algorithm, which will be used in Section 4.4.2. Lemma 22 Under the oating-queue algorithm with the buer size B 2 (max) at every real queue n2N , if d n (t)> 0, then Q f n (t + 1)>Q f n (t). Proof: Event d n (t) > 0 implies that a r n (t) > a r 0 n (t) from (4.21) and a r 0 n (t) = BQ r n (t) from (4.22), soQ r n (t)>Ba r n (t). WhenB 2 (max) , we haveQ r n (t)> 2 (max) a r n (t) (max) , and there are enough real packets for all services. Therefore, all services take real packets, andb r n (t) =b n (t),b f n (t) = 0 from (4.17) and (4.16) respectively. From (4.26) and (4.23), we have Q f n (t + 1) =Q f n (t) +a f 0 n (t) =Q f n (t) +a f n (t) +d n (t)>Q f n (t): The interpretation of Lemma 22 is that, for any queue with buer size B 2 (max) , if real packets are dropped at time t, then the fake backlog always increases from time t to t + 1. 102 4.4 Performance analysis The steady-state performance of the oating-queue algorithm is analyzed by bounding from below the admitted real arrivals at each queuen2N . Dene n (t) as a sample path of the arrivals, services, and backlogs of queue n that is generated by the oating-queue algorithm at time t: n (t), a r n (t);a f n (t);b r n (t);b f n (t);Q r n (t);Q f n (t) Recall that t 0 is the rst time index that the drift-plus-penalty enters a steady state. For any positive integer T , a sample path of queue n from t 0 to t 0 +T is denoted by f n (t)g t 0 +T t=t 0 . Note that (Q n (t);d n (t);a r 0 n (t);a f 0 n (t)) can be determined from n (t). From a sample path n (t), the amount of real arrivals are a r n (t), and the amount of admitted real arrivals are a r 0 n (t), which depend on the oating-queue mechanism (4.22). To lower bound this admitted real arrival a r 0 n (t), we construct another mechanism called the lower-bound policy that operates over the sample path. It has a dierent rule for counting admitted real packets (later dened as ^ a r n (t)), which is part of the real arrivals a r n (t). We will show (in Lemma 27) that the amount of admitted real arrivals under the oating-queue algorithm is lower bounded by the amount of admitted real arrivals under the lower-bound policy. Using this lower bound, the performance of the oating-queue algorithm can be analyzed. 4.4.1 Lower-bound policy In this section, queuen2N is xed and the lower-bound policy is derived for this queue. 103 Recall that V is the Lagrange multiplier of problem (4.9), and (max) is the upper bound on a n (t) and b n (t). Dene B n , h V n B=2 + (max) ; V n +B=2 (max) i : Let ^ a r n (t) denote the number of admitted real packets under the lower-bound policy at time t. Given any sample path n (t), having real arrivals a r n (t) and total backlogs Q n (t), the lower-bound policy counts real packets according to ^ a r n (t) = 8 > > < > > : a r n (t) ;Q n (t)2B n 0 ;Q n (t) = 2B n : (4.29) Let ^ d n (t) denote the number of dropped packets under the lower-bound policy at time t. It satises ^ d n (t) =a r n (t) ^ a r n (t): (4.30) Note that ^ a r n (t) and ^ d n (t) are articial numbers and are not real and fake packets in a real system. These values can be determined by a r n (t) and Q n (t) of the sample path n (t). 4.4.2 Sample path analysis The goal of this section is to show (Lemma 27) that, for any queuen2N and any positive integer T , the admitted real arrivals under the oating-queue algorithm with buer size B is lower bounded by t 0 +T1 X t=t 0 a r 0 n (t) t 0 +T1 X t=t 0 ^ a r n (t)B: 104 Figure 4.3: Time intervalT (T ) is partitioned intoT H (T ) andT L (T ). Recall that queuen is xed and analyzed; however, the analysis results hold for every queue n2N . DeneT (T ),ft 0 ;:::;t 0 +T 1g as a time interval of consideration from t 0 to t 0 +T 1. It can be partitioned into disjoint setsT H (T ) andT L (T ), which are illustrated in Figure 4.3, where T H (T ), n t2T (T ) :Q f n (t) V n B=2 and Q f n (t + 1) V n B=2 o T L (T ),T (T )nT H (T ) SetT H (T ) has a property that Q f n (t) V n B=2 for every t2T H (T ) and can be used to prove the following lemma. Lemma 23 When B 2 (max) , given any sample pathf n (t)g 1 t=t 0 with positive integer T , the following inequality holds X t2T H (T) a r 0 n (t) X t2T H (T) ^ a r n (t): Proof: Fix t2T H (T ). Two cases are examined. 105 1) When Q n (t)2B n , we have a r 0 n (t) =a r n (t), because real queue n has enough buer space: Q r n (t) =Q n (t)Q f n (t) V n +B=2 (max) V n B=2 =B (max) (max) : The rst inequality holds because of Q n (t)2B n and t2T H (T ). For the lower-bound policy, we have ^ a r n (t) =a r n (t), because Q n (t)2B n . So a r 0 n (t) = ^ a r n (t). 2) WhenQ n (t) = 2B n , we havea r 0 n (t) ^ a r n (t), because ^ a r n (t) = 0 from (4.29) anda r 0 n (t) is non-negative. These two cases implies the lemma. SetT L (T ) can be partitioned into disjoint intervals. We rst consider a special case with Q r n (t 0 ) = 0 (or equivalently Q n (t 0 ) = Q f n (t 0 )). Then each interval starts at time t L and ends at time t + L , such that t L and t + L satisfy the following. At time t L , either i) t L =t 0 andQ f n (t 0 )< V n B=2 or ii)Q f n (t L ) V n B=2 andQ f n (t L +1)< V n B=2. At timet + L , either i)t + L =t 0 +T1 or ii)Q f n (t + L )< V n B=2 andQ f n (t + L +1) V n B=2. This is illustrated in Figure 4.4. Lemma 24 When B 2 (max) , given sample pathf n (t)g 1 t=t 0 with Q r n (t 0 ) = 0 and positive integer T , the following holds for any interval between t L and t + L t + L X t=t L a r 0 n (t) t + L X t=t L ^ a r n (t): 106 Figure 4.4: SetT L (T ) is partitioned into sub-intervals, starting from t L to t + L . Proof: The proof is in the appendix of this chapter. Lemma 25 When B 2 (max) , given sample pathf n (t)g 1 t=t 0 with Q r n (t 0 ) = 0 and positive integer T , the following holds X t2T L (T) a r 0 n (t) X t2T L (T) ^ a r n (t): Proof: This is a direct consequence of Lemma 24 when all the intervals between t L and t + L are combined. The following lemma considers the intervalT (T ). Lemma 26 When B 2 (max) , given sample pathf n (t)g 1 t=t 0 with Q r n (t 0 ) = 0 and positive integer T , it holds that t 0 +T1 X t=t 0 a r 0 n (t) t 0 +T1 X t=t 0 ^ a r n (t): Proof: Disjoint time intervalsT H (T ) andT L (T ) are the partitions ofT (T ). Then Lemma 23 and Lemma 25 imply the lemma. 107 The above lemma is not general, since it requiresQ r n (t 0 ) = 0. Now its general version is provided. Lemma 27 When B 2 (max) , given any sample pathsf n (t)g 1 t=t 0 and positive integer T , it holds for any Q r n (t 0 )2f0; 1;:::;Bg that t 0 +T1 X t=t 0 a r 0 n (t) t 0 +T1 X t=t 0 ^ a r n (t)B: Proof: An articial sample path ~ n (t) = ~ a r n (t); ~ a f n (t); ~ b r n (t); ~ b f n (t); ~ Q r n (t); ~ Q f n (t) satisfy- ing all oating-queue mechanisms is considered. This sample path is constructed from an augmented system in Section 4.1 with a network state set ~ S =S[f~ sg whereX ~ s contains an action ~ x such that 0n (~ x; ~ s) = 1 and ij (~ x; ~ s) = 0 for every i2N and j2N[f0g. Dene t 1 =t 0 Q r n (t 0 ). The sample path n ~ n (t) o 1 t 1 is constructed as the following: ~ n (t) = n (t) for all t2ft 0 ;t 0 + 1;:::g, ~ Q r n (t 1 ) = 0 and ~ Q f n (t 1 ) =Q f n (t 0 ), (x(t);S(t)) = (~ x; ~ s) for all t2ft 1 ;t 1 + 1;:::;t 0 1g. Duringt2ft 1 ;:::;t 0 1g, every received packet a n (t) = P N i=0 in (~ s; ~ x) = 1 is real and is put in a real queue, so ~ a r n (t) = 1, since the real queue is empty at time t 1 and the duration length is at most Q r n (t 0 ) B. It holds that ~ a f n (t) = ~ b r n (t) = ~ b f n (t) = 0, since there is no transmission ij (~ x; ~ s) = 0 for every i2N and j2N[f0g. The fake queue stays the same as there is no drop, no fake arrival, and no transmission. It is easy to see that the sample path conforms all oating-queue mechanisms in Sections 4.3.1 and 4.3.2. 108 Let ~ a r 0 n (t) and ^ ~ a r n (t) respectively denote the admitted real arrival of the new sample path under the oating queue algorithm and the lower-bound policy. Since P t 0 +T1 t=t 0 ~ a r 0 n (t) = P t 0 +T1 t=t 0 a r 0 n (t), it follows that t 0 1 X t=t 1 ~ a r 0 n (t) + t 0 +T1 X t=t 0 a r 0 n (t) = t 0 1 X t=t 1 ~ a r 0 n (t) + t 0 +T1 X t=t 0 ~ a r 0 n (t) t 0 1 X t=t 1 ^ ~ a r n (t) + t 0 +T1 X t=t 0 ^ ~ a r n (t) = t 0 1 X t=t 1 ^ ~ a r n (t) + t 0 +T1 X t=t 0 ^ a r n (t): The rst inequality is the application of Lemma 26, as the articial sample path starts with empty real queue ~ Q r n (t 1 ) = 0. The last equality holds, becausefQ n (t)g 1 t=t 0 of both original and new sample paths are identical. Then, the facts that P t 0 1 t=t 1 ^ ~ a r n (t) 0 and P t 0 1 t=t 1 ~ a r 0 n (t) = P t 0 1 t=t 1 1 =Q r n (t 0 )B yields t 0 +T1 X t=t 0 a r 0 n (t) t 0 +T1 X t=t 0 ^ a r n (t)B: 4.4.3 Performance of the oating-queue algorithm 4.4.3.1 Average drop rate The average drop rate at each queue is analyzed using the steady state and sample path results. Recall that constants D;K;c are dened in Theorem 10. 109 Lemma 28 Suppose B > 2( (max) +D). In the steady state, the average drop rate at each real queue n2N under the oating-queue algorithm is bounded by lim T!1 1 T t 0 +T1 X t=t 0 E [d n (t)] (max) c e [B=2 (max) D] K : Proof: We consider queue n2N . Let IfXg be an indicator function of statement X such that IfXg = 1 if statement X is true; otherwise IfXg = 0. Equation (4.29) can be written as ^ a r n (t) =a r n (t)IfQ n (t)2B n g =a r n (t)a r n (t)IfQ n (t) = 2B n g. Then we have that t 0 +T1 X t=t 0 E [^ a r n (t)] = t 0 +T1 X t=t 0 E [a r n (t)a r n (t)IfQ n (t) = 2B n g] Taking an expectation on the result of Lemma 27 yields t 0 +T1 X t=t 0 E h a r 0 n (t) i t 0 +T1 X t=t 0 E [^ a r n (t)]B: Combining the above two relations gives t 0 +T1 X t=t 0 E h a r 0 n (t) i t 0 +T1 X t=t 0 E [a r n (t)] t 0 +T1 X t=t 0 E [a r n (t)IfQ n (t) = 2B n g]B: Rearranging terms and using the fact thata r n (t) (max) for allt and the denition (4.21) yields t 0 +T1 X t=t 0 E [d n (t)] t 0 +T1 X t=t 0 (max) PfQ n (t) = 2B n g +B: 110 Dividing by T and taking limit as T approaches innity yields lim T!1 1 T t 0 +T1 X t=t 0 E [d n (t)] (max) lim T!1 1 T t 0 +T1 X t=t 0 PfQ n (t) = 2B n g: (4.31) In steady state, Theorem 10 with = B=2 (max) D K yields lim T!1 1 T t 0 +T1 X t=t 0 PfQ n (t) = 2B n g lim sup T!1 1 T t 0 +T1 X t=t 0 P n 9n; Q n (t) V n >B=2 (max) o =P(D;B=2 (max) D) c e [B=2 (max) D]=K : Applying the above bound to (4.31) proves the lemma. 4.4.3.2 Delay At each queue, the average delay experienced by real packets is derived by invoking Little's law [BG92]. Dene a r n , lim T!1 1 T t 0 +T1 X t=t 0 E [a r n (t)]; a n , lim T!1 1 T t 0 +T1 X t=t 0 E [a n (t)]: Lemma 29 Suppose B > 2( (max) +D). In the steady state, the average delay at real queue n2N under the oating-queue algorithm is bounded by Per-hop delay B a r n (max) c e [B=2 (max) D]=K : 111 Proof: Since the buer size of queue n2N is B, Little's law implies: Per-hop Delay = B lim T!1 1 T P t 0 +T1 t=t 0 E [a r 0 n (t)] = B lim T!1 1 T P t 0 +T1 t=t 0 E [a r n (t)d n (t)] B a r n (max) c e [B=2 (max) D]=K : The implication of Lemma 29 is that, whenB is large enough such that the exponentially- decayed term (max) c e [B=2 (max) D]=K and the number of drops at other queues are negligible, a r n is approximately a n , and the average delay is O(B). 4.4.3.3 Objective cost The average objective cost is considered in two cases. Let f (FQ) (t) denote the cost un- der the oating-queue algorithm at time t, and f (FQ) , lim T!1 1 T P t 0 +T1 t=t 0 E f (FQ) (t) denote the expected time-average cost under the oating-queue algorithm. Drop-independent cost: In this case, packet drops do not aect the objective cost. Such cost can be the energy ex- penditure that is spent to transmit both real and fake packets. Due to this independence, the average cost follows immediately from the result of the drift-plus-penalty algorithm as in (4.4). 112 Theorem 11 Suppose each real queue has buer size B > 2( (max) +D). When V > 0 and packet drops do not incur any penalty cost, the oating-queue algorithm achieves: f (FQ) =f (dpp) f (opt) +O(1=V ) Per-hop delayO(B=(1e B )) =O(B) Average drop rateO(e B ): Note that the transient time of the drift-plus-penalty algorithm is O(V ), so parameterV cannot be set to innity. Drop-dependent cost: In this case, packet drops aect the objective cost. Such cost can be the amount of admitted packets. Let <1 be a maximum penalty cost per one unit of packet drop. Then we have the following result. Theorem 12 Suppose B > 2( (max) +D). When V > 0 and is a maximum penalty cost per one unit of packet drop, the oating-queue algorithm achieves: f (FQ) f (opt) +O(1=V ) +O(e B ) Per-hop delayO(B=(1e B )) =O(B) Average drop rateO(e B ) 113 Proof: Recall that f (dpp) (t) is a cost incurred at time t under the drift-plus-penalty algorithm. At each time t, we have f (FQ) (t)f (dpp) (t) + N X n=1 d n (t): Summing from t =t 0 to t =t 0 +T 1, dividing by T , and taking an expectation gives 1 T t 0 +T1 X t=t 0 E h f (FQ) (t) i 1 T t 0 +T1 X t=t 0 E h f (dpp) (t) i + T t 0 +T1 X t=t 0 N X n=1 E [d n (t)]: Taking a limit as T approaches innity gives lim T!1 1 T t 0 +T1 X t=t 0 E h f (FQ) (t) i f (dpp) +N (max) c e [B=2 (max) D]=K f (opt) +O(1=V ) +O(e B ) 4.5 Simulation A line network with 4 queues, shown in Figure 4.5, is simulated in two scenarios. The common network conguration is as follows. In each time slot, an exogenous packet arrives with probability 0:92. Transmission ij (t) is orthogonal and depends on channel state that is \good" with probability 0:9 and \bad" with probability 0:1 for (i;j) 2 f(1; 2); (2; 3); (3; 4); (4; 0)g. 114 Figure 4.5: Line network 2 4 6 8 10 12 14 16 18 20 buffer size (B) 0 10 20 30 40 50 slots Average delay Queue 1 Queue 2 Queue 3 Queue 4 End-to-End 2 4 6 8 10 12 14 16 18 20 buffer size (B) 0.00 0.01 0.02 0.03 0.04 0.05 0.06 0.07 0.08 0.09 packets Average drops Queue 1 Queue 2 Queue 3 Queue 4 End-to-End Figure 4.6: Average delay and average drop rate of the power minimization problem with V = 200 4.5.1 Power minimization In this scenario, all exogenous arrivals are admitted. When channel state is \good", one packet is transmitted using 1 unit of power; otherwise 2 units of power are used. The goal is to stabilize this network while minimizing the power usage. Note that the optimal average minimum power is 1 0:9 + 2 0:02 = 0:94 per hop, and the average total power is 0:94 4 = 3:76. Simulation results of this scenario are shown Figure 4.6. The time average power expenditure is 3:761 for all buer sizes B. In Figure 4.6, the average delay increases linearly with the buer size, and the average drop rate decays exponentially with the buer size. This result conrms the bounds in Theorem 11. 115 10 20 30 40 50 60 buffer size (B) 0.875 0.880 0.885 0.890 0.895 0.900 packets Average throughput 10 20 30 40 50 60 buffer size (B) 0 20 40 60 80 100 120 140 slots Average delay 10 20 30 40 50 60 buffer size (B) 0.000 0.005 0.010 0.015 0.020 0.025 packets Average drops Queue 1 Queue 2 Queue 3 Queue 4 End-to-End Figure 4.7: Average throughput, average delay, and average drop rate of the throughput maximization problem with V = 200 4.5.2 Throughput maximization In this scenario, a network decides to admit random exogenous arrival in each time slot. The goal is to maximize the time-average end-to-end throughput, which are real packets. Packet drops reduce the value of this objective function. Transmission ij (t) = 1 is possible if its channel state is \good"; otherwise the transmission is not allowed. Note that the maximum admission rate is 0:9, because of the limitation of the average transmission rate. This scenario also represents the worst case, because every link operates at its capacity. Figure 4.7 shows the simulation results of this scenario, which comply with the bounds in Theorem 12. 116 4.5.3 Dynamic state distributions The adaptiveness of the oating-queue algorithm is illustrated when the distribution of network state changes. The previous throughput maximization simulation is reconsidered where the buer size of each queue can hold 18 packets. The channel stateS 23 (t) of 23 (t) is time varying and has the following evolution: PfS 23 (t) is \good"g = 8 > > > > > > > > > > > > > > < > > > > > > > > > > > > > > : 0:9 ;t2 [0; 2 10 5 ) 1:0 ;t2 [2 10 5 ; 4 10 5 ) 0:8 ;t2 [4 10 5 ; 6 10 5 ) 0:6 ;t2 [6 10 5 ; 8 10 5 ] 0:9 ;t2 [8 10 5 ; 10 6 ]: Figure 4.8 shows the average throughput, received from the right-most queue in the line network, and the average rate of total end-to-end packet drops. The average drop rate (0:084) is higher than usual when the algorithm starts. During [0; 2 10 5 ), the average drop rates is 0:009, which is consistent with the average drop rate in Figure 4.7. When channel S 23 (t) is always \good" during [2 10 5 ; 4 10 5 ), the average drop rate decreases, as the link is not a bottleneck, while the other three links are. When the distribution changes at time t = 4 10 5 , the link becomes the only bottleneck and its associated Lagrange multiplier increases. This leads to suddenly drops (0:148) to ll the fake queue Q f 3 (t). Once the queue is lled, there is no further drops while the link is the only bottleneck. It is easy to see that the oating-queue algorithm adapts to distribution changes without any intervention. 117 0 200000 400000 600000 800000 1000000 t 0.0 0.2 0.4 0.6 0.8 1.0 packet Throughput (after drops) 0 200000 400000 600000 800000 1000000 t 0.0 0.1 0.2 0.3 0.4 0.5 packet 0.084 0.148 Average end-to-end packet drops Figure 4.8: Throughput maximization with dynamic state distributions and buer size B = 18. Results are averaged with a window size of 500. 4.6 Chapter summary We propose the general oating-queue algorithm that allows the stochastic network opti- mization framework to operate with nite buers. When the buer size at each queue is B, we prove the proposed algorithm achieves O(e B ) optimality gap and O(B) per-hop delay, and incurs O(e B ) drop rate. The optimality gap and drop rate decay exponen- tially with respect to the buer size. We conrm the theoretical results with simulations. Appendix To prove Lemma 24, an interval betweent L andt + L is partitioned into decreasing intervals and non-decreasing intervals. A sample path n (t) is a decreasing interval if Q f n (t) > Q f n (t+1) for anyt2 t L ;:::;t + L . The non-decreasing interval is considered contiguously as an interval of the formf n (t)g te t=t b that starts fromt b and ends att e . The timet b has 118 Figure 4.9: Time interval between t L and t + L is decomposed to decreasing and non- decreasing sub-intervals. two possibilities: i)t b =t 0 , ii)Q f n (t b )Q f n (t b +1) and n (t b 1) is a decreasing interval. Similarly, the timet e has two possibilities i)t e =t + L , ii)Q f n (t e )Q f n (t e +1) and n (t e +1) is a decreasing interval. Every non-decreasing interval satises that Q f n (t) Q f n (t + 1) for every t2ft b ;:::;t e g as illustrated in Figure 4.9. We rst prove that the admitted real arrival under the oating queue algorithm is at least the admitted real arrival under the lower-bound policy when a sample path is a decreasing. Lemma 30 When B 2 (max) , it holds for any decreasing sample path n (t) with Q f n (t)>Q f n (t + 1) that a r 0 n (t) ^ a r n (t). Proof: SinceQ f n (t)>Q f n (t + 1), Lemma 22 implies thatd n (t) = 0. From (4.21), we have a r 0 n (t) =a r n (t) ^ a r n (t). Then we will show the similar result holds for every non-decreasing interval. Useful properties of a non-decreasing interval are rst proven. From (4.26), dene ~ b f n (t) = 119 min h Q f n (t);b f n (t) i as the number of fake packets served at a fake queue n and time t. Then the fake queue update (4.26) can be written as Q f n (t + 1) =Q f n (t) ~ b f n (t) +a f 0 n (t): (4.32) Lemma 31 Suppose B 2 (max) and Q r n (t 0 ) = 0. The following properties hold for any non-decreasing intervalf n (t)g te t=t b . i) Q r n (t b ) =a r n (t b 1)Ift b 6=t 0 g ii) ~ b f n (t)a f n (t) 0 for all t2ft b ;:::;t e g iii) Q n (t b 1)< V n B=2 + (max) when t b 6=t 0 . Proof: i) When t b =t 0 , we have Q r n (t b ) = 0. Now suppose t b 6=t 0 . Since Q f n (t b 1)> Q f n (t b ), the fake queue update (4.32) implies that Q f n (t b 1)>Q f n (t b 1) ~ b f n (t b 1) + a f 0 n (t b 1) and ~ b f n (t b 1) > a f 0 n (t b 1) 0. Since b f n (t b 1) ~ b f n (t b 1), it follows that b f n (t b 1) > 0. From (4.16) and (4.17), it follows that b n (t b 1) > b r n (t b 1) and b r n (t b 1) = Q r n (t b 1), so Q r n (t b ) = a r 0 n (t b 1) from (4.25). From Lemma 22, event Q f n (t b )<Q f n (t b 1) implies that d n (t b 1) = 0. Therefore, a r n (t b 1) =a r 0 n (t b 1) from (4.21). Thus, it follows that Q r n (t b ) = a r n (t b 1), which proves the rst property. Note that the intuition of this result is that a real queue must be emptied before a fake queue decreases, so the real queue only contains recent real arrivals. ii) Two cases are considered for every t2ft b ;:::;t e g. When ~ b f n (t) = 0, it holds that ~ b f n (t)a f n (t) =a f n (t) 0, since a f n (t) 0. When ~ b f n (t) > 0, since Q f n (t) Q f n (t + 1) 120 always holds for everyt2ft b ;:::;t e g, it follows from (4.32) thatQ f n (t)Q f n (t) ~ b f n (t) + a f 0 n (t), and we have ~ b f n (t)a f 0 n (t) 0: (4.33) Since b f n (t) ~ b f n (t) > 0, it follows from (4.16) and (4.17) that Q r n (t) = b r n (t) (max) . Then there is sucient space for all real arrivals at time t, i.e., BQ r n (t) 2 (max) (max) = (max) a r n (t), so no packet is dropped and d n (t) = 0. It follows from (4.23) and (4.33) that a f 0 n (t) =a f n (t) and ~ b f n (t)a f n (t) 0, which proves the second part. iii) The last part is proven by contradiction. SupposeQ n (t b 1) V n B=2+ (max) . It follows from (4.25) and (4.32) that Q n (t b ) =Q n (t b 1) +a r 0 n (t b 1) +a f 0 n (t b 1)b r n (t b 1) ~ b f n (t b 1) =Q n (t b 1) +a r n (t b 1) +a f n (t b 1)b r n (t b 1) ~ b f n (t b 1) V n B=2 + (max) +a r n (t b 1) (max) = V n B=2 +a r n (t b 1); where the second equality uses (4.24), the inequality uses the assumption ofQ n (t b 1) and the facts thata f n (t b 1) 0 andb r n (t b 1)+ ~ b f n (t b 1)b n (t b 1) (max) . From the rst part (i), we haveQ n (t b ) =Q r n (t b )+Q f n (t b ) =a r n (t b 1)+Q f n (t b ). Therefore, the inequality above becomes a r n (t b 1) +Q f n (t b ) V n B=2 +a r n (t b 1) and Q f n (t b ) V n B=2. Since time t b begins a non-decreasing interval, Q f n (t b + 1) Q f n (t b ) V n B=2, so t b 2T H (T ), which contradicts that t b 2T L (T ). 121 The above lemma is utilized to prove that the admitted real arrivals under the oating queue algorithm is lower bounded by the admitted real arrival under the lower-bound policy during any non-decreasing interval. Lemma 32 When B 2 (max) and Q r n (t 0 ) = 0, the following relationships hold for any non-decreasing intervalf n (t)g te t=t b : i) When t b =t 0 , te X t=t b a r 0 n (t) te X t=t b ^ a r n (t): (4.34) ii) When t b 6=t 0 , te X t=t b 1 a r 0 n (t) te X t=t b 1 ^ a r n (t): (4.35) Proof: We rst consider packet drops under the lower-bound policy. A simple case is considered when Q n (t)< V n B=2 + (max) for every t2ft b ;:::;t e g. The lower-bound policy drops all real arrivals, as in (4.29), and te X t=t b a r 0 n (t) 0 = te X t=t b ^ a r n (t) When t b =t 0 , the above inequality proves (4.34). When t b 6=t 0 , Lemma 30 and the fact that n (t b 1) is a decreasing interval lead to a r 0 n (t b 1) ^ a r n (t b 1). Adding this to the above bound proves (4.35). Now we consider the case with Q n (t) V n B=2 + (max) for some t2ft b ;:::;t e g. Let t = arginf t2ft b ;:::;teg n Q n (t) V n B=2 + (max) o (4.36) 122 be the rst time that Q n (t) is at least V n B=2 + (max) . Note that this also means ^ d n (t) =a r n (t) for everyt2ft b ;:::;t 1g, since the lower-bound policy drops all arrivals when Q n (t)< V n B=2 + (max) . It follows from (4.25) and (4.32) that Q n (t + 1) =Q r n (t + 1) +Q f n (t + 1) =Q r n (t)b r n (t) +a r 0 n (t) +Q f n (t) ~ b f n (t) +a f 0 n (t) =Q n (t)b r n (t) ~ b f n (t) +a r n (t) +a f n (t); where the last equality follows directly from (4.24). When t > t b , summing the above from t =t b to t =t 1 gives Q(t ) =Q(t b ) + t 1 X t=t b h b r n (t) ~ b f n (t) +a r n (t) +a f n (t) i : It can be rewritten as t 1 X t=t b a r n (t) =Q n (t )Q n (t b ) + t 1 X t=t b h ~ b f n (t)a f n (t) +b r n (t) i : From the denition of t , arrivals a r n (t) are dropped under the lower-bound policy for all t2ft b ;:::;t 1g, so the above becomes t 1 X t=t b ^ d n (t) =Q n (t )Q n (t b ) + t 1 X t=t b h ~ b f n (t)a f n (t) +b r n (t) i : 123 When t =t b , the above equation is replaced by 0 =Q n (t )Q n (t b ). Therefore, we rep- resent both situations, i.e., t =t b and t >t b , by the above equation with understanding that P t b 1 t=t b x(t) = 0 for any x(t). Note that a situation with t 0 =t b =t never happens. Summing property (ii) of Lemma 31 fort2ft ;:::;t e g yields P te t=t h ~ b f n (t)a f n (t) i 0. Adding this sum to the above equality yields: t 1 X t=t b ^ d n (t)Q n (t )Q n (t b ) + te X t=t b h ~ b f n (t)a f n (t) i + t 1 X t=t b b r n (t): The denition of t implies that Q n (t ) V n B=2 + (max) > Q f n (t e ) + (max) Q f n (t e + 1), sinceQ f n (t e )< V n B=2. Further, the rst propery (i) of Lemma 31 implies that Q n (t b ) = Q r n (t b ) +Q f n (t b ) = a r n (t b 1)Ift b 6=t 0 g +Q f n (t b ). Therefore, the above inequality becomes t 1 X t=t b ^ d n (t)> Q f n (t e + 1)Q f n (t b )a r n (t b 1)Ift b 6=t 0 g + te X t=t b h ~ b f n (t)a f n (t) i + t 1 X t=t b b r n (t): (4.37) Now we rst consider packet drops under the oating-queue algorithm. Summing (4.32) from t =t b to t =t e gives Q f n (t e + 1) =Q f n (t b ) + te X t=t b h ~ b f n (t) +a f 0 n (t) i : 124 Applying denition (4.23) to the above yields te X t=t b d n (t) =Q f n (t e + 1)Q f n (t b ) + te X t=t b h ~ b f n (t)a f n (t) i : (4.38) Applying (4.38) to (4.37) leads to t 1 X t=t b ^ d n (t)> te X t=t b d n (t)a r n (t b 1)Ift b 6=t 0 g + t 1 X t=t b b r n (t): Since ^ d n (t) and b r n (t) are non-negative, it follows that te X t=t b ^ d n (t)> te X t=t b d n (t)a r n (t b 1)Ift b 6=t 0 g: Adding P te t=t b a r n (t) to both side of the above inequality and rearranging terms gives te X t=t b [a r n (t)d n (t)]> te X t=t b h a r n (t) ^ d n (t) i a r n (t b 1)Ift b 6=t 0 g The fact that ^ a r n (t) =a r n (t) ^ d n (t) and (4.21) implies that te X t=t b a r 0 n (t)> te X t=t b ^ a r n (t)a r n (t b 1)Ift b 6=t 0 g: When t b = t 0 , we have Ift b 6=t 0 g = 0 and the above proves (4.34). When t b 6= t 0 , the third property (iii) of Lemma 31 implies that a r n (t b 1) = ^ d n (t b 1) since this term, 125 a r n (t b 1), is introduced in (4.37) as the derivation of packet drops under the lower-bound policy. Adding a r n (t b 1) to both side of the above gives te X t=t b a r 0 n (t) +a r n (t b 1)> te X t=t b ^ a r n (t) +a r n (t b 1) ^ d n (t b 1): Since Q f n (t b 1)>Q f n (t b ), Lemma 22 implies d n (t b 1) = 0 and a r n (t b 1) =a r 0 n (t b 1). Combining this with the fact that ^ a r n (t b 1) =a r n (t b 1) ^ d n (t b 1) proves (4.35). Finally, Lemma 24 is proven. Proof:[Proof of Lemma 24] This lemma is a direct consequence of Lemma 30 and Lemma 32, because any interval between t L and t + L can be partitioned into decreasing intervals and non-decreasing intervals. 126 Chapter 5 Trac Load Balancing for Intra Datacenter Networks In this chapter, a theoretical network stability problem is used to design a new practical load-balancing algorithm for datacenter networks. The new algorithm works gracefully with TCP ows and can be implemented on SDN switches. The results in this chapter are based in part on [SN16b]. Datacenter networks serve as infrastructure for search engines, social networks, cloud computing, etc. Due to potentially high trac loads, load-balancing becomes an impor- tant solution to improve network utilization and alleviate hot spots [GHJ + 09, AFLV08, SOA + 15, RZB + 15]. A widely-used technique is equal-cost multipath (ECMP), where trac ows are split equally according to the number of available equal-cost next-hops. However, ECMP does not take into account actual trac and is susceptible to asymmet- ric topology [KGR + 15,AED + 14,ZTZ + 14]. Further, the deployment of ECMP is limited due to its equal-cost constraint [GLL + 09]. 127 Trac load-balancing can be implemented using software-dened networking (SDN). An SDN switch, a network device supporting layer-2 and layer-3 operations in the OSI ar- chitecture, consists of a data plane and control plane [ONF12]. 1 The data plane forwards packets according to given rules and operates at a fast timescale, e.g., 1ns. The control plane sets those rules and operates at a much slower timescale, e.g., 1ms. Several trac load-balancing algorithms can be implemented through the control plane. The challenge is to design a load-balancing algorithm that is implementation on SDN switches. From the theoretical network optimization, an algorithm is throughput optimal if it stably supports any feasible trac load, so that average backlog is bounded [TE92]. Specically, a throughput-optimal algorithm utilizes the entire network capacity and can distribute trac to any portion of the network to maintain network stability. MaxWeight [TE92] is a well-known throughput-optimal algorithm and has been studied for packet radio [TE92], switching [MMAW99], and inter-datacenter networking [JWA15]. It has been generalized to optimize power allocation [NMR05], throughput [ES06], etc. Practical aspects of MaxWeight such as nite buer capacity and fairness with TCP connections have been studied in [SN15a, LMS12, SM16b]. However, MaxWeight is not suitable for in-network load-balancing because it prohibits the sharing of link capacity at the data plane's timescale and thus causes high queue occupancy. This queue size has a nite average, but the size of the longer timescale makes that average unacceptably large. The MaxWeight algorithm is illustrated by the example in Figure 5.1. Two trac commodities share three links passing through switches 1; 2 and 3. Time is slotted. The slot size equals the length of the decision update interval (the control plane's timescale). 1 An SDN switch should not be confused with a crossbar switch, which may reside in the data plane of an SDN switch. 128 Figure 5.1: MaxWeight example: Let w d ij be the weight of commodity d over the link from switch i to switch j. All weights in this gure are w 1 12 ;w 2 12 ;w 1 23 ;w 2 23 = (0; 1; 3; 1). 1 2 3 1 2 3 1 2 3 1 2 3 1 2 3 1 2 3 1 2 3 1 2 3 1 2 3 1 2 3 1 2 3 1 2 3 1 2 3 1 2 3 1 2 3 1 2 3 1 2 3 1 2 3 1 2 3 1 2 3 1 2 3 1 2 3 1 2 3 1 2 3 1 2 3 1 2 3 1 2 3 1 2 3 1 2 3 1 2 3 Commodity-1 Commodity-2 t=0 t=1 t=2 t=3 t=4 t=5 t=6 t=7 t=8 t=9 t=10 t=11 t=12 t=13 t=14 t=0 t=1 t=2 t=3 t=4 t=5 t=6 t=7 t=8 t=9 t=10 t=11 t=12 t=13 t=14 Figure 5.2: Timeline of queue occupancy under MaxWeight: a small box represents a packet in a queue. Switchi is represented by the numberi2f1; 2; 3g under the long line. The short line under the number indicates that the commodity at the numbered switch is served in that particular time slot. The occupancy pattern repeats after t = 14, which is similar to the pattern at t = 11. The capacity of each link is 3 packets per slot. Every switch has a dedicated queue for each commodity. In every time slot, MaxWeight calculates, for each link and commodity, a weight equal to the dierential backlog between a queue and its next-hop queue. For that slot, the entire capacity of the link is allocated to the commodity with the maximum non-negative weight, while a commodity with negative weight is ignored. For example, commodity 2 is served on the link between switches 1 and 2 in Figure 5.1, and commodity 1 is served on the link between switches 2 and 3. Let the arrival rates to switch 1 of commodities 1 and 2 be respectively 1 and 2 packets per slot. The timeline of queue evolution is shown in Figure 5.2. MaxWeight is eective and always transmits three packets per slot after t = 11. This eectiveness requires a sucient amount of queue backlog. For example, commodity 2 backlog at switch 1 is always at least 5 for times t 11. This occupancy might be acceptable. However, if the control plane is recongured at a slow timescale relative to the link capacity, the 129 Figure 5.3: Timeline of queue occupancy under the ideal algorithm queue occupancy can be very high. For example, with a 1ms update interval, 10Gbps link speed, and 1kB packet size, each link can serve 1250 packets per slot (rather than just 3). This multiplies queue backlog in the timeline of Figure 5.2 by a factor 1250=3, so the minimum queue backlog of commodity 2 at switch 1 is 5(1250=3) 2083 fort 11. Another undesirable property of MaxWeight is that queue occupancy scales linearly with the number hops, as shown in [MSKG10, BSS09]. In fact, Figure 5.2 is inspired by an example in [MSKG10] for a dierent issue. In practice, the MaxWeight mechanism with a long update interval leads to i) large buer memory, ii) packet drops, iii) high latency, and iv) burstiness (no capacity sharing during an update interval). Issues (i){(iii) can be alleviated partially by the techniques in [Nee10,SN15c,HN11,BSS09]. However, issue (iv) resides in the decision making mech- anism of MaxWeight and still persists under those techniques. The situation is worse when issues (ii) and (iv) interact with TCP congestion control, causing slow ow rate and under utilization. To put it into theoretical perspective, even though MaxWeight solves a network stability problem with O(1) average queue size, the constant factor is too large for a practical system with a long update interval. Note that an ideal algorithm for the example in Figure 5.1 always serves 1 and 2 packets of commodities 1 and 2 by sharing the link capacity as shown in Figure 5.3. In this chapter, a new throughput-optimal algorithm is developed. The algorithm shares link capacity among commodities during an update interval, resulting in low queue 130 occupancy and low latency. The key challenge is to design a model and an algorithm that are analyzable, provably optimal, and practically implementable at the same time. The algorithm imitates the weighted fair queueing (WFQ) [DKS89,PG93] available in practical switches to provide fairness and low latency among TCP ows in practice. A general intra data center network may have an exponential number of paths, and our algorithm comes with an optimality proof considering all possible paths using per-commodity queueing which grows linearly with the number of commodities. This is also a key distinct aspect from the path-based algorithm in [SG16]. Section 5.2 develops the throughput-optimal algorithm. Inspired by this algorithm, Section 5.3 presents an enhanced algorithm that includes heuristics to cope with practical aspects, including queue information dissemination, queue approximation, and packet reordering issues in TCP. This heuristic algorithm uses local queue information and local measured trac to hash TCP ows to next-hop switches and set weights of the weighted fair queueing. The hash-based mechanism is chosen to reduce packet reordering, which is not possible for DeTail [ZDM + 12]. Simulation results in Section 5.4 show that both proposed algorithms outperform MaxWeight and ECMP algorithms in ideal simulation and in more realistic simulation with OMNeT++ [Opeb]. Related Works Existing trac load-balancing methods for datacenter networks distribute trac ac- cording to network capacity and measured trac. Weighted-cost multipath [ZTZ + 14] 131 distributes trac according to path capacity. Centralized algorithms, such as Hed- era [AFRR + 10] and Niagara [KGR + 15], take advantage of global trac information to split trac at a coarse timescale. Load balancing with a ner timescale has been implemented in DeTail [ZDM + 12], CONGA [AED + 14], HULA [KHK + 16], and LetIt- Flow [VPA + 17] using the in-network technique, where decisions are made at switches inside a network without any central controller. Conceptually, these approaches attempt to distribute trac over available network resources. However, the packet-by-packet dis- persion in DeTail needs TCP with out-of-order resilience. The path-based approach in CONGA limits scalability. HULA and LetItFlow improve the scalability issue. Neverthe- less, these algorithms requires special hardware and do not come with analytical proofs of optimality. It is not clear if they are throughput optimal. 5.1 System model and design The fast timescale of the data plane operates over slotted time t2 Z + , where Z + = f0; 1; 2;:::g. The control plane congures the data plane every T slots, where T is a positive integer. Thus, recongurations occur at times in the setT =f0;T; 2T;:::g. 5.1.1 Topology and routing An intra datacenter network is the interconnection of switches and destinations (such as servers), as shown in Figure 5.4. Trac designated for a particular destination d is called commodity d trac. LetN be the set of all switches andD be the set of all destinations (commodities). A link between switches i and j is bi-directional with capacity c ij from i 132 Figure 5.4: An example network withN =f1; 2;:::; 14g andD =f1; 2;:::; 8g Figure 5.5: Example of sets of switches at switch 9. Note thatP 8 9 must not contain 10 to avoid loops and imposes that 9 = 2H 8 10 . to j, and capacity c ji in the reverse direction. Dene c ij = 0 if link (i;j) does not exist or if i =j. Each switch must decide where to send its packets next. DeneH d i N as the set of next-hop switches available to commodity d packets in switch i, for all i2N and d2D. In practice, these sets can be obtained from manual conguration or from other routing mechanisms. Dene the set of all next-hop switches from switch i asH i = [ d2D H d i , and the set of previous-hop switches asP d i = n j2N :i2H d j o for i2N;d2 D. These sets are illustrated in Figure 5.5. DeneD ij = d2D :j2H d i as the set of all commodities utilizing a link from switch i to switch j for i;j 2N . Note that this model allows arbitrary path lengths, and can be applied to existing topologies in [GHJ + 09,AFLV08,SOA + 15,RZB + 15,GLL + 09]. 133 5.1.2 Trac Switch i receives a d i (t) commodity-d packets from external sources at time t (for i2 N;d2D). The external source represents a group of servers or a link connecting to the outside of the network. For each i2N , d2D, the arrival processfa d i (t)g 1 t=0 is independent and identically distributed (i.i.d.) across time slots. The i.i.d. assumption is useful for a simple and insightful analysis. The resulting algorithm developed under this assumption inspires a heuristic algorithm in Section 5.3 that does not require i.i.d. arrivals and works gracefully with TCP trac. Recall that c ij is the capacity of the link between switch i and switch j, for i;j2N . Letb d i be the capacity of the link between switchi2N and destinationd2D. Setb d i = 0 if switchi does not have a direct link to destinationd. Assume arrivals and link capacities are always bounded by a constant > 0, so 0c ij , 0a d i (t), 0b d i for all i;j2N;d2D;t2Z + . 5.1.3 Decision variables Decision variables are dened for every link connecting switch i to its next-hop switch j, fori2N;j2H i . Recall thatD ij is a set of commodities using the link. At conguration timet2T , the control plane in switchi chooses a control plane decision variable x d ij (t;T ) for d2D ij , which represents a constant transmission rate allocated to commodity d (in 134 units of packets) for the entire T -slot interval. Dene x d ij (t;T ) = 0 for d2DnD ij . The control plane decisions for link (i;j) are chosen to satisfy the link capacity constraint: X d2D ij x d ij (t;T )Tc ij : Once x d ij (t;T ) is determined, no more than x d ij (t;T ) commodity-d packets can be transmitted by the data plane during an intervalft;:::;t +T 1g. For example, the data plane can impose a token bucket mechanism. Let x d ij (t) be data plane decision variable that represents the transmission rate assigned by the data plane to commodity d on link (i;j) for slot t. These are chosen to satisfy x d ij (t;T ) = t+T1 X =t x d ij () for all i2N;j2H i ;t2T (5.1) X d2D ij x d ij (t)c ij for all i2N;j2H i ;t2Z + : (5.2) 5.1.4 Queues Packets are queued at each switch according to their commodity. LetQ d i (t) be the number of commodity-d packets queued at switch i on slot t. The value Q d i (t) is also called the commodity-d backlog and satises: Q d i (t + 1) 2 4 Q d i (t) X j2H d i x d ij (t)b d i 3 5 + + X j2P d i x d ji (t) +a d i (t) for all i2N;d2D; (5.3) 135 where [x] + = max[0;x]. Note that P j2H d i x d ij (t) denotes output transmission rate to next-hop switches, and P j2P d i x d ji (t) denotes receiving transmission rate from previous- hop switches. Inequality (5.3) is an inequality rather than an equality because the actual amount of new endogenous arrivals on slot t may be less than P j2P d i x d ji (t) if previous switchesj do not have enough commodity-d backlog to ll the assigned transmission rate x d ji (t). It can be shown that the backlog at time t2T satises for every i2N;d2D: Q d i (t +T ) 2 4 Q d i (t) X j2H d i x d ij (t;T )Tb d i 3 5 + + X j2P d i x d ji (t;T ) + t+T1 X =t a d i (): (5.4) Note that, while a common queue for each commodity is not available in practical switches, it can be heuristically implemented by queues in an SDN switch in Section 5.3. 5.1.5 Stability and assumption Denition 1 (Queue Stability [Nee10]) A queue with backlogfZ(t) 0 :t2Z + g is strongly stable if lim sup t!1 1 t t1 X =0 E [Z()]<1: Denition 2 (Network Stability [Nee10]) A network is strongly stable when every queue in the network is strongly stable. The arrival and departure rates are assumed to satisfy a standard Slater condition: 136 Assumption 9 (Slater Condition) There exists an > 0 and a randomized policy n x d ij (t) o i2N;d2D;j2H d i with E 2 4 X j2P d i x d ji (t) +a d i (t) X j2H d i x d ij (t)b d i 3 5 < for all i2N;d2D;t2Z + ; and the randomized policy satises the constraint (5.2). Note that Assumption 9 is stated in terms of the data plane decision variables x d ij (t) and the constraint (5.2). While the control plane decisions are used in the algorithm and the additional constraint (5.1) is satised by the algorithm, those are not used in the Slater condition. 5.2 Throughput-optimal algorithm 5.2.1 The algorithm The novel throughput-optimal algorithm runs distributively at every switch. Let K be a positive real number, andx d ij (T;T ) = 0 for alli;j2N;d2D. At every reconguration timet2T , switchi executes Algorithm 5 for each link connecting to a next-hop switchj fori2N;j2H i . Recall thatx d ij (t;T ) = 0 ford2DnD ij . The function round [z] rounds the real-valued z to its closest integer. 137 // At time t2T, link from switch i to switch j // y d ij (t) Q d i (t)Q d j (t) +x d ij (tT;T ) for d2D ij k ij (t) max 1; min K; 1 Tc ij P d2D ij h y d ij (t) i + if P d2D ij round h y d ij (t) i + =k ij (t) Tc ij then x d ij (t;T ) round h y d ij (t) i + =k ij (t) for d2D ij return n x d ij (t;T ) o d2D ij else return result of Algorithm 6 end if Algorithm 5: Throughput-optimal rate allocation // At time t2T, link from switch i to switch j, time t // y d ij (t) Q d i (t)Q d j (t) +x d ij (tT;T ) for d2D ij v d ij 0 for d2D ij for n = 1 to Tc ij do d n argmax e2D ij h y e ij (t)v e ij k ij (t) i v dn ij v dn ij + 1 end for x d ij (t;T ) v d ij for d2D ij return n x d ij (t;T ) o d2D ij Algorithm 6: Packet-lling algorithm (unaccelerated version) 138 5.2.2 Intuitions Algorithm 5 solves problem (5.5) with the value of k ij (t) that depends on local queue information and previous decisions. This k ij (t) is deliberately introduced, just as a so- lution of problem (5.5) imitates weighted fair queueing, which provides fairness and low latency in practice [DKS89,PG93]. Minimize X d2D ij 8 < : x d ij (t;T ) h Q d j (t)Q d i (t) i + k ij (t) 2 " x d ij (t;T ) x d ij (tT;T ) k ij (t) # 2 9 = ; Subject to X d2D ij x d ij (t;T )Tc ij (5.5) x d ij (t;T )2Z + for all d2D ij In Algorithm 5, y d ij (t) represents a request for transmission rate of commodity d. It also indicates how well the previous rates were allocated. Under allocation ofx d ij (tT;T ) increases queue backlog Q d i (t), which tends to increase the request y d ij (t). It is easy to see that, if the queue backlogs Q d i (t) and Q d j (t) are about the same, then the request is about the same as its previous value. This behavior is smoother than MaxWeight. The requests are fullled in two situations. i) When the total requests are roughly within KTc ij , i.e., P d2D ij round h y d ij (t) i + =k ij (t) Tc ij , the requests are fullled in WFQ fashion, which can be seen by considering k ij (t) = 1 Tc ij P d2D ij h y d ij (t) i + and x d ij (t;T ) = round h y d ij (t) i + =k ij (t) = round 2 6 4 h y d ij (t) i + P e2D ij h y e ij (t) i + Tc ij 3 7 5: (5.6) 139 Figure 5.6: Packet-lling Algorithm 6 iteratively fullls the requests. An iteration number is indicated in a gray box. In this example,Tc ij = 9 and the rst iteration (the plot on the left) allocates rate to commodity 4. The algorithm allocates 2; 0; 3; 4 rates to commodities 1; 2; 3; 4. ii) The other is an extreme situation for stability analysis, which occurs when a network operates near its capacity, which may not be the case in practice due to TCP congestion control. This case is solved by Algorithm 6, as illustrated by Figure 5.6. Note that Algorithm 6 can be accelerated by fullling multiple requests per iteration. For example, the requests in Figure 5.6 can be fullled in three iterations. It is easy to see the fairness introduced by k ij (t). Without k ij (t), i.e., k ij (t) is always 1, the allocation in Figure 5.6 will be 0; 0; 2; 7 for commodities 1 to 4, which may cause a fairness issue with TCP ows [SM16b]. 5.2.3 Correctness of Algorithm 5 This subsection shows that Algorithm 5 returns an optimal solution of problem (5.5). To simplify notation in this section, the time index of variables and constants in problem (5.5) are omitted. Let z d ij denote x d ij (tT;T ). 140 When commodityd is allocated an integer service valuev d ij , dene its contribution to the cost function of problem (5.5) as g d ij (v d ij ) =v d ij h Q d j Q d i i + k ij 2 " v d ij z d ij k ij # 2 : The cost dierence of getting another service allocation is g d ij (v d ij + 1)g d ij (v d ij ) = Q d i Q d j +z d ij k ij 2 k ij v d ij : (5.7) Since the cost function in problem (5.5) is minimized, commodity d only accepts a trans- mission allocation if g d ij (v d ij + 1) g d ij (v d ij ). The cost dierence in (5.7) is monototically increasing in v d ij . Therefore, commodity d receives transmission allocation at most: x d(max) ij = min n v2Z + :g d ij (v + 1)g d ij (v)> 0 o (5.8) = min ( v2Z + : Q d i Q d j +z d ij k ij 0:5<v ) = round h Q d i Q d j +z d ij i + =k ij : (5.9) Lemma 33 When k ij > 0 and P d2D ij x d(max) ij Tc ij , the optimal solution of problem (5.5) is x d ij =x d(max) ij for d2D ij . Proof: For any v d ij 2 n 0; 1;:::;x d(max) ij o ;d2D ij , the given implies that P d2D ij v d ij P d2D ij x d(max) ij Tc ij . So, any chosen v d ij in n 0; 1;:::;x d(max) ij o leads to a feasible solution of problem (5.5). Note that the objective function of problem (5.5) is sep- arable, P d2D ij g d ij (v d ij ), and is minimized. The denition of x d(max) ij in (5.8) implies 141 g d ij (v) > g d ij (x d(max) ij ) for any v > x d(max) ij , so any v > x d(max) ij is not optimal. Thus, i) if x d(max) ij = 0, x d ij = 0 = x d(max) ij minimizes problem (5.5) with respect to commodity d. ii) If x d(max) ij > 0, the denition of x d(max) ij in (5.8) implies g d ij (x d(max) ij ) g d ij (v) for any v2 n 0; 1;:::;x d(max) ij 1 o , so x d ij = x d(max) ij minimizes the problem with respect to commodity d. Lemma 33 implies that Algorithm 5 solves problem (5.5) when the rst (if-)condition is met. The other case of Algorithm 5 can be shown by the following property. When commodityd is allocatedv d ij transmission rate, dene the unfullled level of commodity- d's request as l d ij (v d ij ) =Q d i Q d j +z d ij k ij v d ij : (5.10) Lemma 34 When k ij > 0, for any commodities d;e2D ij whose allocated rates are respectively v d ij and v e ij , the following holds: i) If l d ij (v d ij ) =l e ij (v e ij ), then g d ij (v d ij + 1) +g e ij (v e ij ) =g d ij (v d ij ) +g e ij (v e ij + 1): ii) If l d ij (v d ij )>l e ij (v e ij ), then g d ij (v d ij + 1) +g e ij (v e ij )<g d ij (v d ij ) +g e ij (v e ij + 1): 142 Proof: It holds from (5.7) and (5.10) that g d ij (v d ij + 1)g d ij (v d ij )g e ij (v e ij + 1) +g e ij (v e ij ) = l d ij (v d ij ) k ij 2 + l e ij (v e ij ) k ij 2 =l d ij (v d ij ) +l e ij (v e ij ): (5.11) In case (i),l d ij (v d ij ) =l e ij (v e ij ) implies thatg d ij (v d ij + 1)g d ij (v d ij )g e ij (v e ij + 1) +g e ij (v e ij ) = 0, which proves (i). Case (ii) can be proven similarly by substitutingl d ij (v d ij ) +l e ij (v e ij )< 0 into equation (5.11) and rearranging terms. Lemma 34 implies that allocating rate to the commodity with the highest unfullled level reduces the total objective the most. Specically, letv d ij be the current rate allocation of commodity d. For d = argmax d2D ij l d ij (v d ij ), Lemma 34 implies that g d ij (v d ij + 1) + X d2D ij nfd g g d ij (v d ij )g e ij (v e ij + 1) + X d2D ij nfeg g d ij (v d ij ) for all e2D ij : The above property ensures that the iterative allocation in Algorithm 6 greedily op- timizes problem (5.5) when event P d2D ij round h y d ij (t) i + =k ij (t) >Tc ij occurs. It can be proven by contradiction that commodity d gets at most x d(max) ij rate for all d2D ij . Then, the algorithm always allocates the entire transmission rate, as an implication of the event. Therefore, Algorithm 6 returns an optimal solution of problem (5.5). Theorem 13 Given K > 0, Algorithm 5 solves problem (5.5) with k ij (t)2 [1;K], where k ij (t) is dened in the algorithm. Proof: The theorem is the consequence of Lemmas 33 and 34 and the fact that k ij (t)2 [1;K]. 143 5.2.4 Stability analysis Problem (5.5) with k ij (t)2 [1;K] is shown to be a class of throughput-optimal policies. Let Q(t) = Q d i (t) i2N;d2D be a vector of all backlogs at time t. Denekzk 1 as the l 1 -norm, e.g.,kQ(t)k 1 = P i2N P d2D Q d i (t). Theorem 14 When Assumption 9 holds, the network is strongly stable: lim sup U!1 1 UT U1 X u=0 T1 X =0 E [kQ(uT +)k 1 ] G 1 +G 2 ; where G 1 =jNjjDj T 2 (jNj + 1) + 2KT 2 jNj and G 2 = (T 1)(jNj + 1)=2. Proof: Squaring both sides of (5.4), rearranging, and bounding terms (see [Nee10] for example) leads to 1 2 h Q d i (t +T ) 2 Q d i (t) 2 i Q d i (t) " t+T1 X =t a d i ()Tb d i # +Q d i (t) " X j2P d i x d ji (t;T ) X j2H d i x d ij (t;T ) # +C d i ; whereC d i = T 2 2 (jP d i j+jH d i j+2) 2 . Dene aT -slot quadratic Lyapunov drift of queue backlogs [Nee10] as (t;T ) = 1 2 h kQ(t +T )k 2 kQ(t)k 2 i ; 144 wherekxk is the l 2 -norm of vector x, i.e.,kQ(t)k 2 = P i2N P d2D Q d i (t) 2 . It holds that (t;T ) X i2N X d2D ( C d i +Q d i (t) " t+T1 X =t a d i ()Tb d i #) + X i2N X d2D Q d i (t) " X j2P d i x d ji (t;T ) X j2H d i x d ij (t;T ) # (5.12) The second line of the above equation can be rewritten as X i2N X d2D Q d i (t) " X j2P d i x d ji (t;T ) X j2H d i x d ij (t;T ) # = X i2N X j2H i X d2D ij x d ij (t;T ) h Q d j (t)Q d i (t) i ; (5.13) using the fact that x d ij (t;T ) = 0 for every d2DnD ij . Instead of minimizing the above expression, which leads to the MaxWeight algorithm, an state-dependent proximal term k ij (t) 2 " x d ij (t;T ) x d ij (tT;T ) k ij (t) # 2 with k ij (t) = 2 4 1 Tc ij X d2D ij h Q d i (t)Q d j (t) +x d ij (tT;T ) i + 3 5 [1;K] 145 is introduced, where [x] [1;K] = max[1; min[K;x]]. This proximal term is non-negative and is upper bounded by KT 2 2 , so it holds from (5.12) and (5.13) that (t;T ) X i2N X d2D ( C d i +Q d i (t) " t+T1 X =t a d i ()Tb d i #) + X i2N X j2H i X d2D ij ( x d ij (t;T ) h Q d j (t)Q d i (t) i + k ij (t) 2 " x d ij (t;T ) x d ij (tT;T ) k ij (t) # 2 ) : (5.14) Minimizing the right-hand-side of (5.14) with respect to n x d ij (t;T ) o d2D ij leads to problem (5.5). Applying the result from Algorithm 5, which solves the minimization at recongu- ration time t2T , yields the bound for any other n ^ x d ij (t;T ) o d2D ij satisfying constraints in problem (5.5): (t;T ) X i2N X d2D ( C d i +Q d i (t) " t+T1 X =t a d i ()Tb d i #) + X i2N X j2H i X d2D ij ( ^ x d ij (t;T ) h Q d j (t)Q d i (t) i + k ij (t) 2 " ^ x d ij (t;T ) x d ij (tT;T ) k ij (t) # 2 ) Since the proximal term is bounded and policy x d ij (t;T ) = n x d ij () o t+T1 =t d2D ij , con- structed from the randomized policy in Assumption 9, is one of those n ^ x d ij (t;T ) o d2D ij , it follows that (t;T ) X i2N X d2D ( C d i +Q d i (t) " t+T1 X =t a d i ()Tb d i #) + X i2N X j2H i X d2D ij ( x d ij (t;T ) h Q d j (t)Q d i (t) i +KT 2 2 ) 146 Applying identity (5.13), taking expectation, and using the independent property of the randomized policy gives E [(t;T )] X i2N X d2D 8 < : D d i +E h Q d i (t) i t+T1 X =t E 2 4 X j2P d i x d ji () +a d i () X j2H d i x d ij ()b d i 3 5 9 = ; ; where D d i =C d i +KT 2 2 P d i + H d i . Assumption 9 implies for all t2T that 1 2T E h kQ(t +T )k 2 kQ(t)k 2 i G 1 X i2N X d2D E h Q d i (t) i ; where G 1 is dened in the theorem. Queue dynamic (5.3) and the upper bound imply thatQ d i (t+)Q d i (t)+(jNj+1) for any 2Z + and i2N;d2D. Summing for 2f0; 1;:::;T 1g gives P T1 =0 Q d i (t + )TQ d i (t) +T (T 1)(jNj + 1)=2 and Q d i (t) 1 T T1 X =0 Q d i (t +)G 2 for all t2Z + ; whereG 2 is dened in the theorem. Substituting the above into inequality (5.2.4) yields: 1 2T E h kQ(t +T )k 2 kQ(t)k 2 i G 1 +G 2 T T1 X =0 E [kQ(t +)k 1 ] for all t2T: Telescope summation for t2f0;T;:::; (U 1)Tg gives: 1 2T E h kQ(UT )k 2 kQ(0)k 2 i G 1 U +G 2 U T U1 X u=0 T1 X =0 E [kQ(nT +)k 1 ]: 147 Rearranging terms and taking supremum limit as U!1 proves the theorem. 5.3 System realization The previous section provides an ideal allocation of decision variables. This section develops a heuristic improvement that is easier to implement on SDN switches. 5.3.1 Approximation of common queues An SDN switch has output queues at each of its ports [ZDM + 12, SM16a, Opea]. Those queues can be assigned to each commodity. Let Q d ij (t) denote the backlog of a queue for commodityd at the port of switchi connecting to switchj at timet. The queue backlog Q d i (t) in Section 5.1.4 can be approximated by ~ Q d i (t) = P j2H d i Q d ij (t), for i2N;d2D. It can be shown that this approximation becomes exact when the port's queues have never been emptied. Note that OpenFlow [Opea] allows 2 32 unique queues per port, but availability of those queues may depend on a switch. This work encourages next generation switches to have a high number of available queues. 5.3.2 Additional packet headers Three elds are appended into the IP header as IP options: CommodityId andQueueInfo. The CommodityId identies the commodity of QueueInfo which stores the rounded value of exponential moving average of approximated queue backlog, ~ Q d j (t). A packet from switchj to switchi carries queue information of a commodity, which is circularly selected from commodities inD ij 2 . 2 This round-robin technique is inspired by CONGA [AED + 14]. The concept can also be applied to VXLAN [MDD + 14]. 148 Once a packet with the additional headers from switch j arrives to switch i, the contained queue information is extracted and stored in a local memory, which is denoted byM d ij (t). This is the most recent queue information for commodityd on link (i;j) up to timet, whered is the value in CommodityId. The header processing can be implemented by P4 [BDG + 14], DPDK [Lin], NetFPGA [LMW + 07], or a custom ASIC. 5.3.3 Weighted fair queueing Each port of switchi connecting to switchj2H i is congured with weighted fair queue- ing. Letr d ij (t;T ) denote the measured number of commodity-d packets transmitted from switch i to switch j during the interval [t;t +T ). At reconguration time t2T , the weight w d ij (t) for the interval [t;t +T ) is w d ij (t) = max h 1; ~ Q d i (t)M d ij (t) +r d ij (tT;T )= i ;d2D ij andw d ij (t) = 0 ford2DnD ij . Parameter> 0 is added to scale the magnitude of actual trac to match a nite queue capacity. From these weights, w d ij (t)= P e2D w e ij (t) fraction of link capacity is given to commodity d, which corresponds to the intuition in equation (5.6). Note that actual trac r d ij (tT;T ) is used instead of x d ij (tT;T ), because it is a better approximation under TCP trac. 5.3.4 Trac splitting by hashing A sending rate of a TCP connection is reduced when out-of-order packets are received at a destination. Hashing a packet to a next-hop switch based on 4 elds (source IP 149 address, destination IP address, source port number and destination port number) is implemented to reduce packet reordering. Packets from the same TCP connection have the same HashField, so they are hashed to the same path. Reordering does not occur if a hash rule at each switch is the same for the entire TCP connection. The hash rule is calculated as follows. For each commodity d2D, dene n s d ij (t) o j2H d i as a solution of Maximize min j2H d i n Q d ij (t)r d ij (tT;T ) +s d ij (t) o (5.15) Subject to X j2H d i s d ij (t) = X j2H d i r d ij (tT;T ) s d ij (t)2Z + for all j2H d i : This problem can be solved in polynomial time. Let s d ij (t) = 0 for allj2H d i . Iteratively, s d ij (t) is increased for a group of indices in Argmin j2H d i h Q d ij (t)r d ij (tT;T ) +s d ij (t) i until the equality constraint is met 3 . This process keeps increasing the term min j2H d i n Q d ij (t)r d ij (tT;T ) +s d ij (t) o : The intuition of problem (5.15) is that it attempts to equalize the backlog levels at all ports at end of the interval [t;t +T ) by using r d ij (tT;T ) as an estimate of actual transmission r d ij (t;T ). The attempt tries to make the approximation in Section 5.3.1 exact. The splitting ratio of the port connecting to switchj isf d ij (t) =s d ij (t)= P k2H d i s d ik (t) for j2H d i . 3 The process can be accelerated by increasing multiple values of s d ij (t)'s per iteration. 150 Figure 5.7: Line Network withN =f1; 2; 3; 4g;D =f1; 2g andH d 1 =f2g;H d 2 =f3g;H d 3 = f4g for d2D Table 5.1: Average backlogs under Algorithm 5 and MaxWeight Commodity 1 Commodity 2 Algorithm 5 MaxWeight Algorithm 5 MaxWeight Switch-1 14:88 2598:19 6:25 213:45 Switch-2 8:17 1699:99 2:35 195:12 Switch-3 7:52 700:01 2:16 199:51 Switch-4 7:12 7:00 2:00 2:00 5.4 Simulations 5.4.1 Ideal simulation Algorithm 5 is simulated according to the system model in Section 5.1. A switch uses a token bucket mechanism to ensure that transmission rate per interval satisfy constraint (5.1) after x d ij (t;T ) is determined. A line network in Figure 5.7 is simulated with the interval length T = 100 and the constantK = 10. The network is simulated for 10 5 slots. The average backlogs shown in Table 5.1 are calculated after Algorithm 5 and MaxWeight converge. The link-capacity sharing of Algorithm 5 leads to small queue backlogs. For scaling comparison, when T = 1000, the average backlogs of commodity 1 at switch 1 are respectively 23:94 and 16005:63 under Algorithm 5 and MaxWeight. In this simulation, the maximum value of k ij (t) for all i2N;j2H i ;t2 0;:::; 10 5 is 1:732<K. A network in Figure 5.8 is simulated with T = 100 and K = 10. After 10 5 slots, the average backlogs per queue (from all 124 queues) are 34:23 and 555:95 under Algorithm 5 151 Figure 5.8: Intra datacenter network withN =f1; 2;:::; 14g;D =f1; 2;:::; 9g. Each next-hop setH d i contains next-hop switches with the shortest distance to commodity d, e.g.,H 8 1 =f9; 10g;H 8 9 =f13; 14g;H 8 13 =f11; 12g;H 8 11 =f8g;H 9 1 =f9; 10g =H e 1 for e2f2; 3; 4g. Arrivals are E a d i (t) = 2 and E a 9 i (t) = 1 for d;i2f1;:::; 8g; otherwise 0. Departure rate is b d i = 20 if commodity d connects to switch i; otherwise 0. and MaxWeight. An eventk ij (t) =K occurs 93:41% of the times due to that the network operates near its capacity boundary. Reducing the arrivals by 12% (24%) yields 20:68% (5:49%) of the times that k ij (t) =K. This suggests that Algorithm 6 is rarely invoked, i.e.,k ij (t)<K, when a network does not operate near its capacity boundary. In practice, TCP ows with congestion control are dierent from the i.i.d. arrivals, so Algorithm 6 is not included in the heuristic algorithm for simplicity of implementation. 5.4.2 Network simulator The heuristic in-network load-balancing algorithm in Section 5.3 is simulated by OM- NeT++ [Opeb]. All simulations share the following setting. Capacity of each commodity queue at a switch port is 200 packets. For ECMP setting, a shared queue at a switch port has buer capacity of 200jDj packets, whereD is a set of commodities in a con- sidered network. Conguration interval is T = 1ms, and the scaling parameter is = 5. The NewReno TCP from INET Framework [INE] is adjusted for 10Gbps and 40Gbps link speeds. Every TCP ow is randomly established during [0s; 0:5s] and starts during [1s; 1:01s]. Each ow transmits 1MB of data. Flow completion time (FCT) is measured 152 0.00 0.05 0.10 0.15 0.20 0.25 0.30 Flow completion time (second) 0 50 100 150 200 250 No. of flows In-network load-balancing algorithm 0.00 0.05 0.10 0.15 0.20 0.25 0.30 Flow completion time (second) 0 20 40 60 80 100 120 140 160 No. of flows ECMP Figure 5.9: The FCTs from the network in Figure 5.8 without commodity 9 as the performance metrics, which are also used in [AED + 14, ZDM + 12]. FCT is the duration of time to complete a ow, i.e., the time to send 1MB of data. A network without commodity 9 in Figure 5.8 is simulated. The speeds of level- 1 links and level-2 links are respectively 10Gbps and 40Gbps. 64 ows are generated from a commodity to one another, and the total number of ows in the network is 3584. The FCTs under the heuristic algorithm and ECMP are shown in Figure 5.9, and their variances are 5:8 10 4 and 19:5 10 4 . Unsurprisingly, they are comparable, since the topology is optimized for ECMP. The FCTs under the heuristic algorithm has less variation, as the distribution of ows are more balance. Note that the tail of FCTs is critical for interactive services [ZDM + 12]. A follow-up scenario is simulated when the link between switches 12 and 14 in Figure 5.8 fails. The FCTs of all ows are shown Figure 5.10. The heuristic algorithm balances the ows better than ECMP. Switches 9 and 10 hash more ows to switch 13 than switch 14, while ECMP hashes ows equally. The improvement, calcuated from the longest FCT from both algorithms, is 1:3. 153 0.00 0.05 0.10 0.15 0.20 0.25 0.30 0.35 0.40 Flow completion time (second) 0 50 100 150 200 250 300 350 No. of flows In-network load-balancing algorithm 0.00 0.05 0.10 0.15 0.20 0.25 0.30 0.35 0.40 Flow completion time (second) 0 20 40 60 80 100 No. of flows ECMP Figure 5.10: The FCTs of all ows in the network in Figure 5.8 where commodity 9 is omitted and the link between switches 12 and 14 fails Priority ow Normal ow 1 always-on ow in each direction 300 ows with 1MB of data Figure 5.11: A network with a priority ow in each direction and 600 normal ows between commodities 1 and 2 A network in Figure 5.11 illustrates the adaptiveness of the heuristic algorithm when some link capacity is taken away by priority ows. The FCTs of all ows are shown in Figure 5.12. The FCTs under the heuristic algorithm is more balance compared to the FCTs under ECMP, as switches 1 and 2 hash more ows to switch 4 instead of equally hashing in ECMP case. Note that if the priority ows begin shortly after 1:01s, the same result is observed. The improvement, calcuated from the longest FCT from both algorithms, is 1:2. A similar trend can be observed from scenarios with short ows (10KB of data per ow) whenT = 0:1ms and = 1. Additionally, we observe 4:4 improvement in a highly asymmetric scenario. 154 0.00 0.05 0.10 0.15 0.20 0.25 0.30 0.35 0.40 Flow completion time (second) 0 5 10 15 20 25 30 35 40 45 No. of flows In-network load-balancing algorithm 0.00 0.05 0.10 0.15 0.20 0.25 0.30 0.35 0.40 Flow completion time (second) 0 5 10 15 20 25 30 35 No. of flows ECMP Figure 5.12: The FCTs of normal ows between commodities 1 and 2 in Figure 5.11 5.5 Chapter summary This chapter showed that practical load-balancing in datacenter networks can be tackled by the concept of throughput optimality. Our rst algorithm is a novel variation of the MaxWeight concept that treats control plane and data plane timescales (useful for soft- ware dened networking), allows link capacity sharing during a control plane interval, incorporates weighted fair queueing aspects, and comes with a proof of throughput opti- mality. Next, this algorithm was modied to include heuristic improvements that allow easy operation with practical switch capabilities and works gracefully with TCP ows. Ideal and OMNeT++ simulations show promising potential against existing MaxWeight and ECMP. 155 Chapter 6 Quality of Information Optimization in Wireless Multi-Hop Networks In this chapter, a new drift-based algorithm for stochastic network optimization problems is developed to reduce queue occupency. It is a prior work to the convergence analysis in Chapter 2 and Chapter 3. The results in this chapter are based in part on [SN12,SN15c]. This chapter investigates dynamic scheduling and data format selection in a network where multiple wireless devices, such as smart phones, report information to a receiver station. The devices together act as a pervasive pool of information about the network en- vironment. Such scenarios have been recently considered, for example, in applications of social sensing [MLF + 08] and personal environment monitoring [KLJ + 10,MRS + 09]. Send- ing all information in the highest quality format can quickly overload network resources. Thus, it is often more important to optimize the quality of information, as dened by an end-user, rather than the raw number of bits that are sent. The case for quality-aware networking is made in [WS96,JC05,BKS + 09]. Network management with quality of infor- mation awareness for wireless sensor networks is considered in [LBBL10]. More recently, 156 quality metrics of accuracy and credibility are considered in [BNCG + 11,LTBN + 12] using simplied models that do not consider the actual dynamics of a wireless network. We extend the quality-aware format selection problem in [LTBN + 12] to a dynamic network setting. We particularly focus on distributed algorithms for routing, scheduling, and format selection that jointly optimize quality of information. Specically, we assume that random events occur over time in the network environment, and these can be sensed by one or more of the wireless devices, perhaps at dierent sensing qualities. At the transport layer, each device selects one of multiple reporting formats, such as a video clip at one of several resolution options, an audio clip, or a text message. Information quality depends on the selected format. For example, higher quality formats use messages with larger bit lengths. The resulting bits are handed to the network layer at each device and must be delivered to the receiver station over possibly time-varying channels. We rst consider the case where all devices transmit directly to the destination over uplink channels. Due to heterogeneous channel conditions, the delivery rates in this case may be limited. To improve performance, we next allow devices to relay their informa- tion through other devices that have more favorable connections to the destination. An example is a single-cell wireless network with multiple smart phones and one base station, where each smart phone has 4G capability for uplink transmission and Wi-Fi capability for device-to-device relay transmission. Such a problem can be cast as a stochastic network optimization and solved using Lyapunov optimization theory. A \standard" method is to minimize a linear term in a quadratic drift-plus-penalty expression, which leads to max-weight type solutions [Nee10, GNT06]. This can be shown to yield algorithms that converge to optimal average utility 157 with a tradeo in average queue size. The linearization is useful for enabling decisions to be separated at each device. However, it can lead to larger queue sizes and delays. In this work, we propose a novel method that uses a quadratic minimization for the drift-plus-penalty expression, yet still allows separability of the decisions. This results in an algorithm that maintains distributed format selection decisions across all devices, but reduces average delay. Similar to the standard (linearized) drift-plus-penalty methods, the transmission decisions can also be made in a distributed manner under suitable physical layer models, such as when channels are orthogonal. Thus, the contributions of this work are twofold: (i) We formulate an important quality-of-information problem for reporting information in wireless systems. This prob- lem is of recent interest and can be used in other contexts where data deluge issues require selectivity in reporting of information. (ii) We extend Lyapunov optimization theory by presenting a new algorithm that uses a quadratic minimization to reduce queue sizes while maintaining separability across decisions. This new technique is general and can be used to reduce queue sizes in other Lyapunov optimization problems. (iii) The new technique leads to per-slot load balancing among outgoing links when simultaneous transmission is allowed. The next section formulates the problem for an uplink network without relay capa- bilities. Section 6.2 derives the quadratic algorithm for this network, and Section 6.3 analyzes and simulates its performance. Relay capabilities are introduced and analyzed in Sections 6.4{6.7. To reduce delays, this work restricts all paths to at most 2-hops, so that data can pass through at most one relay. This 2-hop restriction is not crucial to 158 format selection event format selection format selection Figure 6.1: A network with N devices as queues Q 1 (t);:::;Q N (t) and a receiver station. the analysis. Indeed, the same techniques can be used to treat multi-hop routing via the backpressure methodology [GNT06,TE92], although we omit that extension for brevity. 6.1 Single-hop system model Consider a network with N wireless devices that report information to a single receiver station. LetN =f1;:::;Ng be the set of devices. The receiver station is not part of the setN and can be viewed as \device 0." A network with N devices is shown in Figure 6.1. The system time is slotted with xed size slots t2f0; 1; 2;:::g. Every slot, format selection decisions are made at the transport layer of each device, and scheduling decisions are made at the network layer. 6.1.1 Format selection A new event can occur on each slot. Events are observed with dierent levels of quality at each device. For example, some devices may be physically closer to the event and hence can deliver higher quality. On slot t, each devicen2N selects a formatf n (t) from a set of available formatsF =f0; 1;:::;Fg. Format selection aects quality and data lengths of the reported information. To model this, the event on slot t is described by a vector of event characteristics (r (f) n (t);a (f) n (t))j n2N;f2F . The value r (f) n (t) is a numeric reward 159 that is earned if device n uses formatf to report on the event that occurs on slot t. The valuea (f) n (t) is the amount of data units required for this choice. This data is injected as arrivals to a network layer queue and must eventually be delivered to the receiver station (see Figure 6.1). Each device n observes (r (f) n (t);a (f) n (t))j f2F at the beginning of slot t and chooses a formatf n (t). Dener n (t) anda n (t) as the resulting reward and data size: r n (t),r (fn(t)) n (t) ; a n (t),a (fn(t)) n (t) If a devicen does not observe the event on slott (which might occur if it is physically too far from the event), then (r (f) n (t);a (f) n (t)) = (0; 0) for all formats f2F. If no event occurs on slott, then (r (f) n (t);a (f) n (t)) = (0; 0) for alln2N andf2F. To allow a devicen not to report on an event, there is a blank format 02F such that (r (0) n (t);a (0) n (t)) = (0; 0) for all slots t and all devices n2N . Rewards r n (t) are assumed to be real numbers that satisfy 0 r n (t) r (max) n for all t, where r (max) n is a nite maximum. Data sizes a n (t) are non-negative integers that satisfy 0 a n (t) a (max) n for all t, where a (max) n is a nite maximum. The vectors (r (f) n (t);a (f) n (t))j n2N;f2F are independent and identically distributed (i.i.d.) over slots t, and have a joint probability distribution over devices n and formats f that is arbitrary (subject to the above boundedness assumptions). This distribution is not necessarily known. A simple example is when there is no time-variation in the format selection process, so that the reward and bit length options (r (f) n ;a (f) n ) are the same for all time. This holds when each particular format always yields the same reward and has the same bit length. 160 The model also treats cases when these values can change from slot to slot. This holds, for example, in a video streaming application where format selection options correspond to dierent video compression techniques. These can have variable bit outputs depending on the content of the current video frame. For simplicity, this chapter assumes the random processes are i.i.d. over slots. This assumption is not crucial to the analysis, and the results can be extended to treat non-i.i.d. scenarios using techniques in [Nee10]. 6.1.2 Uplink scheduling At each device n2N , the a n (t) units of data generated by format selection are put into input queue Q n (t). Each device communicates directly to the receiver station through (direct) uplink transmission as shown in Figure 6.1. The amount that can be transmitted by uplink transmission at device n is denoted by n (t). The n (t) values are determined by the current channel states and the current transmission decisions. Specically, dene (t),( 1 (t);:::; N (t)) as the transmission vector, and dene (t) as a vector of current channel states in the network at time t. It is assumed that (t) is chosen every slot t within a setU (t) that depends on the observed (t). The setsU (t) are assumed to restrict transmissions to non-negative and bounded rates, so that 0 n (t) (max) n for alln2N and allt. Additional structure of the sets U (t) can be imposed to model the physical transmission capabilities of the network. A special case is when all uplink channels are orthogonal, and the setU (t) can be decom- posed into a set product of individual options for each uplink channel: U (t) ,U 1; 1 (t) U 2; 2 (t) :::U N; N (t) 161 where, in this case, n (t) represents the component of the current channel state vector (t) associated with channel n. The dynamics of input queue Q n (t) are: Q n (t + 1) = max[Q n (t) n (t); 0] +a n (t); (6.1) which assumes that newly arriving data a n (t) cannot be transmitted on slot t. As a minor technical detail that is useful later, the max[:::; 0] operator above allows n (t) to be greater than Q n (t). 6.1.3 Stochastic network optimization Here we dene the problem of maximizing time-averaged quality of information subject to queue stability. We use the following stability denition [GNT06]: Denition 3 QueuefX(t) :t2f0; 1; 2;:::gg is strongly stable if lim sup t!1 1 t t1 X =0 E [X()]<1 Intuitively, this means that a queue is strongly stable if its average backlog is nite. A network is dened to be strongly stable if all of its queues are strongly stable. Dene y 0 (t) as the total quality of information from format selection on slot t: y 0 (t), X n2N r n (t) 162 Dene the upper bound y (max) 0 , P n2N r (max) n . The time-averaged total information quality is y 0 , lim inf t!1 1 t t1 X =0 E [y 0 ()]: The objective is to solve: Maximize y 0 (6.2) Subject to Network is strongly stable. f n (t)2F for all t, and all n2N (t)2U (t) for all t This problem is always feasible because stability is trivially achieved if all devices always select the blank format. 6.2 Dynamic algorithm of the uplink network This section derives a novel quadratic policy to solve problem (6.2). 6.2.1 Lyapunov optimization Let Q(t) = (Q 1 (t):::;Q N ) represent the vector of all queues in the system. Dene a quadratic Lyapunov function: L(t), 1 2 X n2N Q n (t) 2 Dene L(t + 1)L(t) as the Lyapunov drift. In order to maximize y 0 in (6.2), the drift- plus-penalty function L(t + 1)L(t)Vy 0 (t) is considered, where V 0 is a constant 163 that determines a tradeo between queue size and proximity to optimality. 1 Later, this is used to prove stability. Intuitively, when queue lengths grow large beyond certain values, the drift becomes negative and the system is stable because the negative drift tends to reduce the total queue length. From (6.1) and the denition of y 0 (t), the drift-plus-penalty expression is given by: L(t + 1)L(t)Vy 0 (t) = 1 2 X n2N Q n (t + 1) 2 Q n (t) 2 2Vr n (t) = 1 2 X n2N h (max[Q n (t) n (t); 0] +a n (t)) 2 2Vr n (t) i 1 2 X n2N Q n (t) 2 (6.3) Ideally, every slott one would like to observe the current queue valuesQ(t) and select decision variables r n (t), a n (t), n (t) to minimize the above expression over all possible decision options for that slot. Since the Q n (t) values are xed in this decision, this amounts to minimizing the rst summation term in the expression above. However, the quadratic nature of the above expression couples all decision variables. Thus, such an algorithm would not allow for format selection decisions to be distributed across devices, and would not allow format selection and transmission scheduling to be separated. A standard simplication seeks to minimize the following linearized approximation of the above expression [Nee10]: X n2N [Q n (t)(a n (t) n (t))Vr n (t)]: (6.4) 1 The minus sign in front of Vy0(t) comes from the fact that the quality of information is viewed as a negative penalty. 164 This expression is a separable sum over individual devices. Minimization of this expres- sion every slot results in the drift-plus-penalty algorithm [Nee10]. This allows a clean separation of format selection and transmission decisions, and allows format selection to be distributed across all devices. It is known that using this linearized approximation does not hinder asymptotic stability or time average quality. However, intuitively, one expects that something is lost by only using the linear approximation. Often, this loss trans- lates into larger queue sizes. The next section develops a novel alternative method that preserves the quadratic nature of the minimization while maintaining a clean separation across decision variables. 6.2.2 The separable quadratic policy Lemma 35 Suppose a and are non-negative constants such that a a (max) and (max) . Then for any x 0: (max [x; 0] +a) 2 x 2 (x) 2 + (x +a) 2 2x 2 (6.5) Proof: Note that max[x; 0] 2 (x) 2 . Thus: (max[x; 0] +a) 2 x 2 (x) 2 +a 2 + 2a max[x; 0]x 2 (x) 2 +a 2 + 2axx 2 = (x) 2 + (x +a) 2 2x 2 : 165 Using the result of Lemma 35 in (6.3) gives: L(t + 1)L(t)Vy 0 (t) 1 2 X n2N (Q n (t) n (t)) 2 + (Q n (t) +a n (t)) 2 1 2 X n2N 2Vr n (t) X n2N Q n (t) 2 (6.6) Our novel separable quadratic policy observes the queue values Q(t) every slot t and makes format selection and transmission decisions to minimize the right-hand-side of the expression (6.6). That is, (t) and f n (t) decisions are made to solve the following optimization problem: Minimize X n2N (Q n (t) n (t)) 2 + (Q n (t) +a n (t)) 2 2Vr n (t) (6.7) Subject to (t)2U (t) f n (t)2F and a n (t),a (fn(t)) n (t); r n (t),r (fn(t)) n (t) 8n2N where weights Q n (t) act as given constants in the above optimization problem. The queues are then updated via (6.1) and the procedure is repeated for the next slot. Intuitively, every time slot, the bound (6.6) is minimized by the quadratic policy, so its value is smaller than that resulting from applying any other policies. This will become clear in Section 6.3. 6.2.3 Separability The control algorithm (6.7) can be simplied by exploiting the separable structure as follows: 166 Every slott, each devicen2N observes input queueQ n (t) and options (r (f) n (t);a (f) n (t))j f2F . It then chooses a format f n (t) according to the admission-control problem: Minimize h Q n (t) +a (fn(t)) n (t) i 2 2Vr (fn(t)) n (t) (6.8) Subject to f n (t)2F This is solved easily by comparing each option f n (t)2F. Intuitively, a large value of V allows more candidate formats to be selected. As an algorithm evolves, queue Q n (t) will enforce a system to select an optimal format at a particular time t. Note that these decisions are distributed across users and are separated from the uplink transmission rate decisions. The uplink-allocation problem to determine transmission rates (t) is: Minimize P n2N [Q n (t) n (t)] 2 (6.9) Subject to (t)2U (t) : This can be solved at the receiver station. Intuitively, a system minimizes a sum of the square of the remaining queue lengths, so a longer queue is treated with higher priority. When the dierence between queues is small, those queues are treated fairly equally. If 167 all uplink channels are orthogonal, the problem can be decomposed further so that each device n solves: Minimize [Q n (t) n (t)] 2 Subject to n (t)2U n;n(t) whereU n;n(t) is a feasible set of n (t) options. This chooses the uplink transmission rate which is the closest rate inU n;n(t) to Q n (t). The algorithm is summarized in the algorithms below. Device side for device n2N do Observe Q n (t) and (r (f) n (t);a (f) n (t))j f2F Select format f n (t) according to (6.8) end for Algorithm 7: Distributed format selection Receiver-station side for receiver station 0 do Observe Q(t) andU (t) Signal devices n2N to make uplink transmission (t) according to (6.9) end for Algorithm 8: Uplink resource allocation To compare this approach to the standard drift-plus-penalty technique, consider the following example. Suppose the transmission rate set is given by: U (t) = ( 0 X n2N n n (t) 1 ) This allows for a division of either time or frequency resources over one slot, so that a fraction of the channel capacity can be devoted to one or more users simultaneously. 168 The standard drift-plus-penalty approach of minimizing (6.4) over this set results in a max-weight decision that allocates the full channel to a single usern at the full rate n (t). This is an inecient use of resources if the queue backlog Q n (t) of device n is less than n (t). In contrast, our separable quadratic policy never over-allocates resources: In this example it ensures that n (t)Q n (t) for all slots t. Also, it often enables queues to be emptied more quickly by allowing multiple devices to transmit simultaneously. 6.3 Performance and simulation of the uplink network Compare the separable quadratic policy with any other policy. Let (f 1 ();:::;f N ()) and () be the decision variables from the quadratic policy, which is a solution of problem (6.7), and r n (t) , r (fn(t)) n (t), a n (t) , a (fn(t)) n (t). Also let ( ^ f 1 ();:::; ^ f N ()) and ^ () be decision variables from any other policy, and ^ r n (t) , r ( ^ fn(t)) n (t); ^ a n (t) , a ( ^ fn(t)) n (t). Because the quadratic policy makes decisions that minimize the right-hand-side of (6.6), we have at every slot : L( + 1)L()Vy 0 () 1 2 X n2N h (Q n () ^ n ()) 2 + (Q n () + ^ a n ()) 2 i 1 2 X n2N 2V ^ r n () X n2N Q n () 2 = X n2N [Q n ()(^ a n () ^ n ())V ^ r n ()] + 1 2 X n2N (^ n ()) 2 + (^ a n ()) 2 : Therefore: L( + 1)L()Vy 0 ()C + X n2N [Q n ()(^ a n () ^ n ())V ^ r n ()] (6.10) 169 where the constant C is dened: C, 1 2 X n2N h ( (max) n ) 2 + (a (max) n ) 2 i Now dene S(t) as a concatenated vector of all random events observed on slot t: S(t),[(t); (r (f) n (t);a (f) n (t))j n2N;f2F ] As discussed in Section 6.1, vector S(t) is i.i.d. over slots according to some (possibly unknown) probability distribution. The components of S(t) on a given slot t can be arbitrarily correlated. Dene anS-only policy as one that makes a (possibly randomized) choice of decision variables based only on the observed S(t) (and hence independently of queue backlogs). We now customize an important theorem from [Nee10]. Theorem 15 For any > 0 there exists an S-only policy that chooses all controlled variables (f 1 (t);:::;f N (t)); (t) such that: E [y 0 (t)]y (opt) 0 (6.11) E [a n (t) n (t)] for all n2N (6.12) where y (opt) 0 is the optimal solution of problem (6.2). Also, y 0 (t) , P n2N r n (t) when r n (t),r (f n (t)) n (t) and a n (t),a (f n (t)) n (t). 170 We additionally assume all constraints of the network can be achieved with slack- ness [Nee10]. In other words, there exists a policy that, at every queue, has average transmission rate higher than average arrival rate. Assumption 10 There are values > 0 and 0 y () 0 y (max) 0 and an S-only policy choosing all controlled variables (f 1 (t);:::;f N (t)); (t) that satises: E [y 0 (t)] =y () 0 (6.13) E [a n (t) n (t)] for all n2N (6.14) 6.3.1 Performance analysis Since our quadratic algorithm satises the bound (6.10), where the right-hand-side is in terms of any alternative policy ( ^ f 1 (t);:::; ^ f N (t)); ^ (t), it holds for any S-only policy (f 1 (t);:::;f N (t)); (t). Substituting an S-only policy into (6.10) and taking expecta- tions gives: E [L( + 1)L()Vy 0 ()]C + X n2N E [Q n ()(a n () n ())Vr n ()] =C + X n2N fE [Q n ()]E [a n () n ()]VE [r n ()]g (6.15) where we have used the fact that Q n () and (a n () n ()) are independent under an S-only policy. 171 Theorem 16 Assume queues are initially empty, so that Q n (0) = 0 for all n, and that Assumption 10 holds. Then the time-averaged total quality of information y 0 is within O(1=V ) of optimality under the separable quadratic policy, while the total queue backlog is O(V ). This theorem is proven in the next two subsections. 6.3.1.1 Quality of information vs. V Using the S-only policy from (6.11){(6.12) in the right-hand-side of (6.15) gives: E [L( + 1)L()Vy 0 ()]CV y (opt) 0 + X n2N E [Q n ()]: This inequality is valid for every > 0. Therefore E [L( + 1)L()Vy 0 ()]CVy (opt) 0 : Summing from = 0 to t 1: E h L(t)L(0)V P t1 =0 y 0 () i CtVty (opt) 0 : Using L(t) 0, L(0) = 0 and dividing byVt gives: 1 t t1 X =0 E [y 0 ()] C V +y (opt) 0 : (6.16) 172 The above holds for all t > 0. Taking a limit as t!1 shows that y 0 is at least y (opt) 0 C=V , where the gap C=V can be made arbitrarily small by increasing the V parameter. 6.3.1.2 Total queue backlog vs. V Now consider the existence of an S-only policy that satises Assumption 10. Using (6.13){(6.14) in the right-hand-side of (6.15) gives: E [L( + 1)L()Vy 0 ()]CVy () 0 X n2N E [Q n ()]: Thus: E [L( + 1)L()]C +V y (max) 0 y () 0 X n2N E [Q n ()]: Summing from = 0 to t 1 gives: E [L(t)L(0)] h C +V (y (max) 0 y () 0 ) i t t1 X =0 X n2N E [Q n ()]: Using L(t) 0, L(0) = 0, and rearranging terms above gives: 1 t t1 X =0 X n2N E [Q n ()] C +V (y (max) 0 y () 0 ) : (6.17) The above holds for all t > 0. Taking a limit as t!1 shows that total time-average expected queue backlog is bounded by a constant that isO(V ). In particular, this bound implies that every queue is strongly stable. 173 Figure 6.2: Small network with orthogonal channels TheV parameter in (6.16) and (6.17) aects the performance tradeo [O(1=V );O(V )] between quality of information and total queue backlog. These results are similar to those that can be derived under the standard max-weight algorithm [Nee10] [GNT06]. However, simulation in the next section shows signicant reduction of queue backlog under the quadratic policy. Note that our proofs are inspired by the techniques in [Nee10] [GNT06]. In addition to the above tradeo, it is possible to show that every queue is determin- istically bounded by a constant that is O(V ). This is skipped for brevity, but is shown more generally for the 2-hop problem in Section 6.6.2. 6.3.2 Simulation Simulation under the proposed quadratic policy and the standard max-weight policy is performed over a small network in Figure 6.2. The network contains two devices, N =f1; 2g. An event occurs in every slot with probability 0:3. Device 1 is closer to the event, but device 2 is closer to the receiver station. Due to this, the uplink channel distribution for device 2 is better than that of device 1, as shown in Figure 6.2. We assume the uplink channels are orthogonal. 174 The constraints are n (t)2f0;:::; (best) n ( n (t))g for every n2N . The feasible set of formats isF =f0; 1;:::; 6g with constant options given by (r (0) 1 ;a (0) 1 ) = (0; 0) (r (0) 2 ;a (0) 2 ) = (0; 0) (r (1) 1 ;a (1) 1 ) = (15; 10) (r (1) 2 ;a (1) 2 ) = (1:5; 10) (r (2) 1 ;a (2) 1 ) = (45; 40) (r (2) 2 ;a (2) 2 ) = (4:5; 40) (r (3) 1 ;a (3) 1 ) = (65; 70) (r (3) 2 ;a (3) 2 ) = (6:5; 70) (r (4) 1 ;a (4) 1 ) = (75; 100) (r (4) 2 ;a (4) 2 ) = (7:5; 100) (r (5) 1 ;a (5) 1 ) = (90; 200) (r (5) 2 ;a (5) 2 ) = (9:0; 200) (r (6) 1 ;a (6) 1 ) = (110; 400) (r (6) 2 ;a (6) 2 ) = (11:0; 400) whenever there is an event. In particular, the rewards associated with device 1 are ten times larger than those of device 2. The separable quadratic policy minimizes (6.7) every slot, while the max-weight policy minimizes (6.4). The time-averaged quality of information for the two policies is shown in Figure 6.3a. From the plot, the values of y 0 under both policies converge to optimality following theO(1=V ) performance bound. The averaged total rewards from the quadratic policy converges faster than that from the max-weight policy. Figure 6.4 reveals queue lengths in the inputs under the quadratic and max-weight policies. At the same V , the quadratic policy yields smaller or equal queue lengths compared to the cases under the max-weight policy. The plot also shows the growth of queue lengths with parameter V , which follows the O(V ) bound of the queue length. 175 0 500 1000 1500 2000 2500 3000 V 0.0 0.5 1.0 1.5 2.0 2.5 3.0 3.5 Avg. quality of information QD ¯ y 0 MW ¯ y 0 Quality of information vs. V QD ¯ y 0 MW ¯ y 0 0 1000 2000 3000 4000 5000 Avg. queue length 0.0 0.5 1.0 1.5 2.0 2.5 3.0 3.5 Avg. quality of information QD MW Avg. quality vs. T otal avg. queue length QD MW Figure 6.3: Quality of information versus V and averaged queue lengths under the quadratic (QD) and max-weight (MW) policies 0 500 1000 1500 2000 2500 3000 V 0 500 1000 1500 2000 2500 3000 3500 4000 4500 Avg. queue length of device 1 QD ¯ Q 1 MW ¯ Q 1 Input queues Q 1 (t) QD ¯ Q 1 MW ¯ Q 1 0 500 1000 1500 2000 2500 3000 V 0 50 100 150 200 250 300 Avg. queue length of device 2 QD ¯ Q 2 MW ¯ Q 2 Input queues Q 2 (t) QD ¯ Q 2 MW ¯ Q 2 Figure 6.4: Averaged backlog in queues versus V under the quadratic and max-weight policies Figure 6.3b shows that the quadratic policy can achieve near optimality with signif- icantly smaller total system backlog compared to the case under the max-weight policy. This shows a signicant advantage, which in turn aects buer size and packet delay. 6.4 System model with relay The simulation scenario in the previous section only allows direct transmission to the destination, which limits device 1 from reporting high-quality information. In this section, 176 every device has the choice of either transmitting to the destination via the direct uplink channel, or transmitting to a neighboring device that will act as a relay. These uplink and device-to-device transmissions are assumed to be orthogonal, e.g., 4G and Wi-Fi. Although muti-hop relaying is possible, the considered system restricts to at most two hops (i.e., paths to the destination use at most one relay). This allows tighter control over network delay. The denitions of format selection and uplink scheduling in Section 6.1.1 and Section 6.1.2 are still valid and are used for this new model. However, the input queue Q n (t) will be redened when routing is introduced. 6.4.1 Routing and scheduling At each device n2N , the a n (t) units of data generated by format selection are put into input queue Q n (t). To ensure all data takes at most two hops to the receiver station, the data in each queue Q n (t) is internally routed to one of two queues K n (t) and J n (t), respectively holding data for uplink and relay transmission (see Figure 6.5). Data in queue K n (t) must be transmitted directly to the receiver station, while data in queue J n (t) can be transmitted to another device m, but is then placed in queue K m (t) for that device. 2 This is conceptually similar to the hop-count based queue architecture in [YSRL11]. 2 It is possible to extend the model to allow at mostH hops by replacingJn(t) withJn;2(t);:::;Jn;H(t) where J n;h (t) carries data that must be delivered to the receiver station within h hops. This extension increases complexity linearly in H. 177 format selection event Figure 6.5: An example network consists of devices with uplink and relay capabilities and a receiver station. In each slot t, let s (k) n (t) and s (j) n (t) represent the amount of data in Q n (t) that can be internally moved to K n (t) and J n (t), respectively, as illustrated in Figure 6.5. These decision variables are chosen within setsS (k) n andS (j) n , respectively, where: S (k) n , f0; 1;:::;s (k)(max) n g S (j) n , f0; 1;:::;s (j)(max) n g where s (k)(max) n , s (j)(max) n are nite maximum values. 3 The new dynamics of Q n (t) are Q n (t + 1) = max[Q n (t)s (k) n (t)s (j) n (t); 0] +a n (t): (6.18) 3 These upper bounds are necessary for the performance analysis. In practice they are not required. 178 Thes (k) n (t) ands (j) n (t) decisions are selected by an algorithm, but the actual s (k)(act) n (t) and s (j)(act) n (t) data units moved from Q n (t) can be any values that satisfy: s (k)(act) n (t) +s (j)(act) n (t) = min[Q n (t);s (k) n (t) +s (j) n (t)] (6.19) 0s (k)(act) n (t)s (k) n (t) (6.20) 0s (j)(act) n (t)s (j) n (t) (6.21) Again, wireless transmission is assumed to be channel-aware, and decision options are determined by a vector(t) of current channel states in the network, which now includes channels for both uplink and device-to-device transmission. Let n (t) be the uplink rate from device n to the destination, and let (t) be the vector of these values. Let nm (t) be the amount of data selected for relay transmission from device n to device m, and let (t) = ( nm (t))j n;m2N . Assume nn (t) = 0 for everyt andn. Transmissions to relays are assumed to be orthogonal to the uplink transmissions. Every slot t, the vectors (t) and (t) are chosen within setsU (t) andA (t) , respectively (whereU andA stand for uplink and ad-hoc relay, respectively). If each relay channel is orthogonal then setA (t) can be decomposed into a set product of individual options for each relay link, where each option depends on the component of (t) that represents its own relay channel. The dynamics of relay queue J n (t) are: J n (t + 1) = max " J n (t) X m2N nm (t) +s (j)(act) n (t); 0 # : (6.22) 179 As before, the actual amount of data (act) nm (t) satises: X m2N (act) nm (t) = min " J n (t) +s (j)(act) n (t); X m2N nm (t) # (6.23) 0 (act) nm (t) nm (t) for m2Nfng: (6.24) The dynamics of uplink queue K n (t) are: K n (t + 1) = max h K n (t) n (t) +s (k)(act) n (t); 0 i + X m2N (act) mn (t): (6.25) TheJ n (t) andK n (t) dynamics assume the incoming datas (j) n (t) ands (k) n (t) can be trans- mitted out on the same slot t, since moving data between internal buers of the same device incurs negligible delay. Notice that all data transmitted to a relay is placed in the uplink queue of that relay (which ensures all paths take at most two hops). The queueing equations (6.22) and (6.25) involve actual amounts of data, but they can be bounded using (6.20), (6.21) and (6.24) as J n (t + 1) max " J n (t) X m2N nm (t) +s (j) n (t); 0 # (6.26) K n (t + 1) max h K n (t) n (t) +s (k) n (t); 0 i + X m2N mn (t): (6.27) The queue dynamics (6.18), (6.26), (6.27) do not require the actual variabless (j)(act) n ;s (k)(act) n (t), (act) nm (t), and are the only ones needed in the rest of this chapter. 180 Assume the relay transmissions have bounded rates. Specically, let (max) nm be nite maximum values of nm (t). Further, assume that for each n2N , s (k)(max) n (max) n and s (j)(max) n P m2N (max) nm , so that the maximum amount that can be internally shifted is at least as much as the maximum amount that can be transmitted. 6.4.2 Stochastic network optimization From Section 6.1.3, recall thaty 0 (t) is the total quality of information from format selec- tion on slott, and its upper bound isy (max) 0 . The time-averaged total information quality is y 0 . The objective is to solve: Maximize y 0 Subject to Network is strongly stable. f n (t)2F for all t and all n2N (t)2U (t) for all t (t)2A (t) for all t s (k) n (t)2S (k) n for all t and all n2N s (j) n (t)2S (j) n for all t and all n2N As before, this problem is always feasible. 181 6.5 Dynamic algorithm To apply the separable quadratic technique to this problem, we require the following lemma, which is an extension of Lemma 35. The notation R denotes the real numbers, andR + denotes the non-negative reals. Lemma 36 Let y i 2 R and z j 2 R + for i2f1; 2;:::;Yg and j2f1; 2;:::;Zg, where Y and Z are non-negative integers. Assume thatjy i jy (max) i and z j z (max) j . Then for any x2R + , 2 4 max x + Y X i=1 y i ; 0 ! + Z X j=1 z j 3 5 2 x 2 Y X i=1 (x+y i ) 2 + Z X j=1 (x+z j ) 2 (Y +Z)x 2 +D (6.28) where D, 2 4 Y X i=1 y (max) i + Z X j=1 z (max) j 3 5 2 : 182 Proof: 0 @ max " x + Y X i=1 y i ; 0 # + Z X j=1 z j 1 A 2 x 2 x + Y X i=1 y i ! 2 + 0 @ Z X j=1 z j 1 A 2 + 2 Z X j=1 z j x + Y X i=1 jy i j ! x 2 = 2x Y X i=1 y i + Y X i=1 y i ! 2 + 0 @ Z X j=1 z j 1 A 2 + 2 Z X j=1 z j (x + Y X i=1 jy i j) = Y X i=1 (x +y i ) 2 + Z X j=1 (x +z j ) 2 Y X i=1 y 2 i Z X j=1 z 2 j (Y +Z)x 2 + Y X i=1 y i ! 2 + 0 @ Z X j=1 z j 1 A 2 + 2 0 @ Z X j=1 z j 1 A Y X i=1 jy i j ! Y X i=1 (x +y i ) 2 + Z X j=1 (x +z j ) 2 (Y +Z)x 2 + 0 @ Y X i=1 jy i j + Z X j=1 z j 1 A 2 Y X i=1 y 2 i Z X j=1 z 2 j : The sum of the nal three terms is upper bounded by theD constant from Lemma 36. 6.5.1 Lyapunov optimization Let (t) = (Q n (t);K n (t);J n (t))j n2N represent a vector of all queues in the system. The quadratic Lyapunov function becomes: L(t), 1 2 X n2N Q 2 n (t) +K 2 n (t) +J 2 n (t) 183 Using queuing dynamics (6.18), (6.26), and (6.27), the drift-plus-penalty expression is bounded by (6.29) below. Then, using relation (6.28), the bound becomes (6.30). L( + 1)L()Vy 0 () 1 2 X n2N h max(Q n ()s (k) n ()s (j) n (); 0) +a n () i 2 Q n () 2 + " max(K n () n () +s (k) n (); 0) + X m2N mn () # 2 K n () 2 + " max(J n () X m2N nm () +s (j) n (); 0) # 2 J n () 2 2Vr n () (6.29) 1 2 X n2N h Q n ()s (k) n () i 2 + h Q n ()s (j) n () i 2 + [Q n () +a n ()] 2 + [K n () n ()] 2 + h K n () +s (k) n () i 2 + X m2N [K n () + mn ()] 2 + X m2N [J n () nm ()] 2 + [J n () +s (j) n ()] 2 2Vr n () +D n () (6.30) where D n (),3Q 2 n () (2 +N)K 2 n () (1 +N)J 2 n () + h s (k)(max) n +s (j)(max) n +a (max) n i 2 + " (max) n +s (k)(max) n + X m2N (max) mn # 2 + " X m2N (max) nm +s (j)(max) n # 2 (a (max) n ) 2 2(s (j)(max) n ) 2 : 184 Thus, every time t, the quadratic policy observes current queue backlogs (t) and random network stateS(t) and makes a decision according to the following minimization problem. Minimize X n2N h Q n (t)s (k) n (t) i 2 + h Q n (t)s (j) n (t) i 2 + [Q n (t) +a n (t)] 2 (6.31) + [K n (t) n (t)] 2 + h K n (t) +s (k) n (t) i 2 + X m2N [K n (t) + mn (t)] 2 + X m2N [J n (t) nm (t)] 2 + h J n (t) +s (j) n (t) i 2 2Vr n (t) Subject to s (k) n (t)2S (k) n ; s (j) n (t)2S (j) n 8n2N f n (t)2F;r n (t),r (fn(t)) n (t); a n (t),a (fn(t)) n (t) 8n2N (t)2A (t) ; (t)2U (t) As a result, the policy leads to a separated control algorithm specied in the next section. The performace tradeo and deterministic bounds are proven in Section 6.6. 6.5.2 Separability The control algorithm is derived from problem (6.31) by separately minimizing each sum of terms. At every slott, each devicen2N observes input queueQ n (t) and options (r (f) n (t);a (f) n (t)) for allf2F. It then chooses a formatf n (t) according to the admission-control problem: Minimize h Q n (t) +a (fn(t)) n (t) i 2 2Vr (fn(t)) n (t) (6.32) Subject to f n (t)2F 185 This is solved easily by comparing each option f n (t)2F. This problem is similar to (6.8), and the same intuition applies. Each device n moves data from its input queue to its uplink queue according to the uplink routing problem: Minimize h Q n (t)s (k) n (t) i 2 + h K n (t) +s (k) n (t) i 2 (6.33) Subject to s (k) n (t)2S (k) n : This can be solved in a closed form by lettingI + K (t), l Qn(t)Kn(t) 2 m ,I K (t), j Qn(t)Kn(t) 2 k and g K (x;t) = [Q n (t)x] 2 + [K n (t) +x] 2 . Then choose s (k) n (t) = 8 > > > > > > < > > > > > > : s (k)(max) n ; Q n (t)K n (t) 2s (k)(max) n argmin x2fI + K (t);I K (t)g g K (x;t) ; 0<Q n (t)K n (t)< 2s (k)(max) n 0 ; Q n (t)K n (t) 0 Intuitively, the amount s (k) n (t) is half of the dierence between queues Q n (t) and K n (t) that does not exceed s (k)(max) n . Also each device n moves data from its input queue to its relay queue according to the relay routing problem: Minimize h Q n (t)s (j) n (t) i 2 + h J n (t) +s (j) n (t) i 2 : (6.34) Subject to s (j) n (t)2S (j) n 186 Again, let I + J (t) , l Qn(t)Jn(t) 2 m , I J (t) , j Qn(t)Jn(t) 2 k and g J (x;t) = [Q n (t)x] 2 + [J n (t) +x] 2 . Then choose s (j) n (t) = 8 > > > > > > < > > > > > > : s (j)(max) n ; Q n (t)J n (t) 2s (j)(max) n arg min x2fI + J (t);I J (t)g g J (x;t) ; 0<Q n (t)J n (t)< 2s (j)(max) n 0 ; Q n (t)J n (t) 0 An intuition of decisions (j) n (t) is similar to the one from the case ofs (k) n (t). Note that the solutions from the quadratic policy are \smoother" as compared to the solutions from the max-weight policy that would choose \bang-bang" decisions of either 0 or s (k)(max) n for s (k) n (t) (and 0 or s (j)(max) n for s (j) n (t)). The uplink allocation problem determining uplink transmission of every node n2N is Minimize P n2N [K n (t) n (t)] 2 (6.35) Subject to (t)2U (t) : This can be solved at the receiver station and is similar to (6.9), so the same intuition follows. If all uplink channels are orthogonal, the problem can be decomposed further to be solved at each device n by Minimize [K n (t) n (t)] 2 (6.36) Subject to n (t)2U n;(t) ; 187 whereU n;(t) is a feasible set of n (t). An optimal uplink transmission rate is the closest rate inU n;(t) to K n (t). The relay allocation problem determining relay transmission of every node n2N is Minimize P n2N P m2N n [K n (t) + mn (t)] 2 + [J n (t) nm (t)] 2 o (6.37) Subject to (t)2A (t) : Intuitively, the decision nm (t) is made to balance the dierence between queues K m (t) and J n (t) while transmission resource is shared among devices. If channels are orthogonal so the sets have a product form, then the decisions are separable across transmission links (n;m) for n2N;m2N as Minimize [K m (t) + nm (t)] 2 + [J n (t) nm (t)] 2 (6.38) Subject to nm (t)2A nm;(t) ; whereA nm;(t) is a feasible set of nm (t). The closed form solution of this problem is nm (t) = 8 > > > > > > < > > > > > > : (max) nm ; J n (t)K m (t) 2 (max) nm argmin x2fI + A (t);I A (t)g g A (x;t) ; 0<J n (t)K m (t)< 2 (max) nm 0 ; J n (t)K m (t) 0 188 where I + A (t), argmin a2Anm;(t) a J n (t)K m (t) 2 ; I A (t), argmin a2A nm;(t) fI + A (t)g a J n (t)K m (t) 2 ; g A (x;t) = [J n (t)x] 2 + [K m (t) +x] 2 : Again, due to the structure ofU (t) andA (t) , orthogonality assumption may not hold, but the subproblems are always fully separable as a result of the novel quadratic policy. 6.5.3 Algorithm At every time slot t, our algorithm has two parts: device side and receiver-station side, which are summarized in the algorithms below. Device side for device n2N do Observe Q n (t);K n (t) and J n (t) Observe (r (f) n (t);a (f) n (t))j f2F Select format f n (t) according to (6.32) Move data from Q n (t) to K n (t) and J n (t) with s (k)(act) n (t);s (j)(act) n (t) satisfying (6.19)-(6.21) with values of s (k) n (t);s (j) n (t) calculated from (6.34) and (6.35). end for Algorithm 9: Distributed format selection and routing After these processes, queuesQ n (t+1);K n (t+1) andJ n (t+1) are updated via (6.18), (6.22), (6.25). 189 Receiver-station side for receiver station 0 do Observe (K n (t);J n (t))j n2N ObserveU (t) andA (t) Signal devices n2N to make uplink transmission (t) according to (6.35) Signal devices n2N to relay data (t) according to (6.37) end for Algorithm 10: Uplink and Relay resource allocation 6.6 Stability and performance bounds Compare the quadratic policy with any other policy. Let (f n ();s (k) n ();s (j) n ())j n2N ;(); () be the decision variables from the quadratic policy, and r n (t) , r (fn(t)) n (t), a n (t) , a (fn(t)) n (t). Also let ( ^ f n (); ^ s (k) n (); ^ s (j) n ())j n2N ; ^ (); ^ () be decision variables from any 190 other policy, and ^ r n (t),r ( ^ fn(t)) n (t); ^ a n (t),a ( ^ fn(t)) n (t). Because the quadratic policy min- imizes the right-hand-side of (6.30), the drift-plus-penalty expression under the quadratic policy satises: L( + 1)L()Vy 0 () 1 2 X n2N h Q n ()s (k) n () i 2 + h Q n ()s (j) n () i 2 + [Q n () +a n ()] 2 + [K n () n ()] 2 + h K n () +s (k) n () i 2 + X m2N [K n () + mn ()] 2 + X m2N [J n () nm ()] 2 + h J n () +s (j) n () i 2 2Vr n () +D n () (6.39) 1 2 X n2N h Q n () ^ s (k) n () i 2 + h Q n () ^ s (j) n () i 2 + [Q n () + ^ a n ()] 2 + [K n () ^ n ()] 2 + h K n () + ^ s (k) n () i 2 + X m2N [K n () + ^ mn ()] 2 + X m2N [J n () ^ nm ()] 2 + h J n () + ^ s (j) n () i 2 2V ^ r n () +D n () : (6.40) where (6.39) is a restatement of (6.30). It follows that L( + 1)L()Vy 0 () X n2N Q n () h ^ a n () ^ s (k) n () ^ s (j) n () i +K n () " ^ s (k) n () + X m2N ^ mn () ^ n () # +J n () " ^ s (j) n () X m2N ^ nm () # V ^ r n () +E (6.41) 191 whereE is a suitable constant that does not depend onV . In particular, it can be shown that: E, 1 2 X n2N h s (k)(max) n +s (j)(max) n +a (max) n i 2 + " s (k)(max) n + (max) n + X m2N (max) mn # 2 + " s (j)(max) n + X m2N (max) nm # 2 + 2(s (k)(max) n ) 2 + 2(s (j)(max) n ) 2 + 2(a (max) n ) 2 + 2( (max) n ) 2 + X m2N (max) mn ! 2 + X m2N (max) nm ! 2 The derivations (6.39){(6.41) show that applying the quadratic policy to the drift- plus-penalty expression leads to the bound (6.41) which is valid for every other control policy. As discussed in Section 6.4,S(t) is i.i.d. over slots. Dene anS-only policy as one that makes a (possibly randomized) choice of decision variables based only on the observed S(t). Then we customize an important theorem from [Nee10]. Theorem 17 For any > 0 there exists an S-only policy that chooses all control vari- ables (f n (t);s (k) n (t);s (j) n (t))j n2N ; (t); (t) such that for all n2N : E [y 0 (t)]y (opt) 0 (6.42) E h a n (t)s (k) n (t)s (j) n (t) i (6.43) E " s (k) n (t) + X m2N mn (t) n (t) # (6.44) E " s (j) n (t) X m2N nm (t) # (6.45) 192 where y (opt) 0 is the optimal solution of the new problem dened in Section 6.4.2. Also, y 0 (t), P n2N r n (t) when r n (t),r (f n (t)) n (t) and a n (t),a (f n (t)) n (t). We additionally assume all constraints of the network can be achieved with slackness: Assumption 11 There are values > 0 and 0 y () 0 y (max) 0 and an S-only policy choosing all control variables (f n (t);s (k) n (t);s (j) n (t))j n2N ; (t); (t) that satises for all n2N : E [y 0 (t)] =y () 0 (6.46) E h a n (t)s (k) n (t)s (j) n (t) i (6.47) E " s (k) n (t) + X m2N mn (t) n (t) # (6.48) E " s (j) n (t) X m2N nm (t) # : (6.49) 193 6.6.1 Performance analysis Since our quadratic algorithm satises the bound (6.41), where the right-hand-side is in terms of any alternative policy ^ f n (t); ^ s (k) n (t); ^ s (j) n (t) n2N ; ^ (t); ^ (t), it holds for any S- only policy f n (t);s (k) n (t);s (j) n (t) n2N ; (t); (t). Substituting anS-only policy into (6.41) and taking expectations gives: E [L( + 1)L()Vy 0 ()] X n2N E [Q n ()]E h a n ()s (k) n ()s (j) n () i +E [K n ()]E " s (k) n () + X m2N mn () n () # +E [J n ()]E " s (j) n () X m2N nm () # VE [r n ()] +E (6.50) where we have used the fact that queue backlogs on slot t are independent of the control decision variables of an S-only policy on that slot. Theorem 18 If Assumption 11 holds, then the time-averaged total quality of information y 0 is withinO(1=V ) of optimality under the quadratic policy, while the total queue backlog grows with O(V ). Proof: Theorem 18 is proven by substituting the S-only policies from Theorem 17 and Assumption 11 into the right-hand-side of (6.50). The analysis is almost identical to that given for the uplink problem and details are omitted for brevity. 194 6.6.2 Deterministic bounds of queue lengths Here we show that, in addition to the average queue size bounds derived in the previous subsection, our algorithm also yields deterministic worst-case queue size bounds. Prac- tically, these bounds can be used to determined memory requirement of a system for a particular value of V . For each device n2N , dene n as the maximum possible value of the expression: 2Vr (f) n (t) (a (f) n (t)) 2 2a (f) n (t) over all slots t and all formats f2F for which a (f) n (t)6= 0. Dene: Q (max) n , n +a (max) n for n2N K (max) n , max m2N h Q (max) m i + X m2N (max) mn +s (max) n : Theorem 19 Under the separable quadratic policy, for all devices n2N and all slots t 0, we have: Q n (t)Q (max) n (6.51) J n (t)Q (max) n (6.52) K n (t)K (max) n (6.53) provided that these inequalities hold at t = 0. The bounds (6.51){(6.53) are proven in the next subsections. 195 6.6.2.1 Input queue From the admission-control problem (6.32), if (r n (t);a n (t)) = (0; 0), then the objective value of the problem is Q n (t) 2 . Therefore, device n only chooses (r n (t);a n (t)) such that a n (t)6= 0 when: [(Q n (t) +a n (t)] 2 2Vr n (t)Q n (t) 2 This is equivalent to: 2Q n (t)a n (t) +a n (t) 2 2Vr n (t) 0; and Q n (t) 2Vr n (t)a n (t) 2 2a n (t) n : This implies that Q n (t) can only increase when Q n (t) n , and it receives no new data otherwise. It follows that for all slots t: 0Q n (t) n +a (max) n provided that this holds for slot t = 0. This proves (6.51). 196 6.6.2.2 Relay queue Fix t and assume for each device n2N that J n (t) Q (max) n for this slot t. From the closed form solution (6.35) and queue equation (6.22), there are three cases to consider. i) When Q n (t)J n (t) 0, then s (j) n (t) = 0, and J n (t + 1) max h J n (t) +s (j) n (t); 0 i =J n (t)Q (max) n : ii) WhenQ n (t)J n (t) 2s (j)(max) n (orJ n (t)Q n (t)2s (j)(max) n ), thens (j) n (t) =s (j)(max) n , and J n (t + 1) max h J n (t) +s (j) n (t); 0 i max h Q n (t)s (j)(max) n ; 0 i Q n (t)Q (max) n : iii) When 0<Q n (t)J n (t)< 2s (j)(max) n , then s (j) n (t) l Qn(t)Jn(t) 2 m , and J n (t + 1) max h J n (t) +s (j) n (t); 0 i max Q n (t) +J n (t) 2 ; 0 Q n (t)Q (max) n : Thus, given thatJ n (0)Q (max) n ,J n (t)Q (max) n for allt 0 by mathematical induction. 197 6.6.2.3 Uplink queue To provide a general upper bound for the uplink queue, we assume that all relay channels are orthogonal. This implies every device n2N can transmit and receive relayed data simultanously. Fixt and assumeK n (t)K (max) n for this slott. Then considerK n (t+1) from (6.25). i) When K n (t) max m2N h Q (max) m i , from (6.34) and (6.39), it follows that s (k) n (t) = 0 and mn (t) = 0 for all m2N , so K n (t + 1)K n (t)K (max) n . ii) WhenK n (t)< max m2N h Q (max) m i , then this queue may received datas (k) n (t) and mn (t) for some m2N , so K n (t + 1) max h K n (t) +s (k) n (t); 0 i + X m2N mn (t) K n (t) +s (k)(max) n + X m2N (max) mn K (max) n : Thus, given K n (0)K (max) n , K n (t)K (max) n for all t 0 by mathematical induction. 198 Figure 6.6: Small network with independent channels and distributions For comparison, the deterministic upper bounds of queues Q (mw) n (t);K (mw) n (t); and J (mw) n (t) under the max-weight algorithm, using a technique in [Nee10], are respectively given without proofs. Q (mw)(max) n , max t;f2Fja (f) n (t)6=0 " Vr (f) n (t) a (f) n (t) # +a (max) n J (mw)(max) n ,Q (mw)(max) n +s (j)(max) n K (mw)(max) n , max h Q (mw)(max) n ; n J (mw)(max) m o m2N i +s (k)(max) n + X m2N (max) mn : It is easy to see that the deterministic bounds from the quadratic policy are smaller than the bounds from the max-weight algorithm. 6.7 Simulation Simulation under the proposed quadratic policy and the standard max-weight policy is performed over a small network of Figure 6.6. This is the same network as considered in the simulation of the pure uplink problem (without relaying capabilities) from Figure 6.2. To compare results with and without relaying, we use the same assumptions on event probability, formats, and uplink channel conditions as Section 6.3.2. Now, each device has another as its neighbor. We assume all uplink and relay channels are orthogonal. 199 0 50 100 150 200 250 300 350 400 V 0 2 4 6 8 10 12 Avg. quality of information QD ¯ y 0 MW ¯ y 0 Quality of information vs. V QD ¯ y 0 MW ¯ y 0 Figure 6.7: Quality of information versus V under the quadratic (QD) and max-weight (MW) policies For relay transmissions, the constraints are 12 (t)2f0;:::; (best) 12 ((t))g and 21 (t)2 f0;:::; (best) 21 ((t))g. Then set s (k)(max) n =s (j)(max) n = 30. The simulation is performed according to the algorithm in Section 6.5. The time- averaged quality of information under the quadratic and max-weight policies are shown in Figure 6.7. From the plot, the values of y 0 under both policies converge to optimality following the O(1=V ) performance bound. The optimal time-averaged quality of infor- mation in this relaying system is signicantly higher than that of the pure uplink system (compare Figures. 6.7 and Figure 6.3a). Indeed, in this example, the time average utility increases by more than a factor of 3 when relaying is allowed. This gain is intuitive, because additional relay capability allows device 2 to relay device 1's information which has higher quality. 200 0 50 100 150 200 250 300 350 400 V 0 50 100 150 200 250 300 350 400 Avg. queue length QD ¯ Q 1 MW ¯ Q 1 Input queues Q 1 (t) QD ¯ Q 1 MW ¯ Q 1 0 50 100 150 200 250 300 350 400 V 0 50 100 150 200 250 300 350 400 450 Avg. queue length QD ¯ K 1 MW ¯ K 1 Uplink queues K 1 (t) QD ¯ K 1 MW ¯ K 1 0 50 100 150 200 250 300 350 400 V 0 50 100 150 200 250 300 350 400 450 Avg. queue length QD ¯ J 1 MW ¯ J 1 Relay queues J 1 (t) QD ¯ J 1 MW ¯ J 1 Figure 6.8: Averaged backlog in device 1's queues versus V under the quadratic (QD) and max-weight (MW) policies Figure 6.8abc reveals queue lengths in the input, uplink, and relay queues of device 1 under the quadratic and max-weight policies. At the same V , the quadratic policy reduces queue lengths by a signicant constant compared to the cases under the max- weight policy. The plot also shows the growth of queue lengths with parameter V , which follows the O(V ) bound of the queue length. Figure 6.9 shows that the quadratic policy can achieve near optimality with signi- cantly smaller total system backlog compared to the case under the max-weight policy. This shows a signicant advantage, which in turn aects buer size and packet delay. Another larger network shown in Figure 6.10 is simulated to observe convergence of the proposed algorithm. The probability of event occurrence is 0:3. Channel distributions are congured in Figure 6.10. The feasible set of formats isF =f0; 1; 2; 3g with constant options given by, for all n2N , (r (0) n ;a (0) n ) = (0; 0); (r (1) n ;a (1) n ) = (10; 10); (r (2) n ;a (2) n ) = (15; 50); (r (3) n ;a (3) n ) = (20; 100) whenever there is an event. For V = 800, the time- averaged quality of information is 25:00 after 10 6 time slots as shown in the upper plot of 201 0 500 1000 1500 2000 Avg. queue length 0 2 4 6 8 10 12 Avg. quality of information QD MW Avg. quality vs. T otal avg. queue length QD MW Figure 6.9: The system obtains average quality of information while having average total queue length Figure 6.10: Larger network with independent channels with distributions shown Figure 6.11. The lower plot in Figure 6.11 illustrates the early period of the simulation to illustrate convergence time. 6.8 Chapter summary We studied information quality maximization in a system with uplink and two-hop re- laying capabilities. From Lyapunov optimization theory, we proposed a novel quadratic policy having a separable property, which leads to a distributed mechanism of format 202 0 200000 400000 600000 800000 1000000 Time 0 5 10 15 20 25 30 Avg. quality of information Time-averaged quality of information at each time Time average Moving average 0 1000 2000 3000 4000 5000 Time 0 5 10 15 20 25 30 Avg. quality of information Time average Moving average Figure 6.11: Convergence of time-averaged quality of information. The interval of the moving average is 500 slots. selection. In comparison to the standard max-weight policy, our policy leads to an al- gorithm that reduces queue backlog by a signicant constant. Further, it was shown that device-to-device relaying can signicantly increase total quality of information as compared to a network that does not allow relaying. 203 Chapter 7 Staggered Algorithm for Non-smooth Optimization In this chapter, a new algorithm is developed to solve a non-smooth convex optimization problem. It is inspired by the results from Chapter 2 and Chapter 3. The results in this chapter are based in part on [SN16a]. Non-smooth convex optimization constitutes a class of problems in machine learn- ing and operations research. Examples include optimization of the hinge loss function [RDVC + 04] (used in support vector machines) and 1-norm regularization in regression problems [Tib94]. A non-smooth function is continuous and non-dierentiable [Nes04]. This lack of dierentiability makes it challenging to design algorithms with fast conver- gence. This chapter considers a stochastic optimization problem: min w2W F (w) (7.1) whereW is a closed and convex set and F :W! R is a continuous, convex (but not necessary strongly convex), and non-smooth function. Function F may not be known. 204 The optimization proceeds by obtaining an unbiased stochastic subgradient of F from an oracle. This model with an oracle has been previously used in literature, such as [RSS12,SZ13]. Formally, letg(w) be the subgradient of F atw. Receivingw2W, the oracle gives an unbiased stochastic subgradient ^ g(w) of F atw satisfying E [^ g(w)jw] = g(w). Note thatg(w) may not be known, but ^ g(w) is known. An algorithm proceeds by generating a sequence ofw t vectors that are given to the oracle. The next vector w t+1 is determined as a function of the history of oracle outputs. The history and thefw t g sequence is used to compute an estimate ^ w of the optimal solution. It has been noted in [SZ13] that this model can be applied to a class of learning problems in [SSSSS09]. Given > 0, an estimate ^ w is an O()-approximation if E [F ( ^ w)] min w2W F (w)O(): Let T be the number of unbiased stochastic subgradients obtained from an oracle. The convergence rate is determined by the rate at which the estimate converges to the true answer, as a function of T . For example, an algorithm with O(1= p T ) convergence rate provides the estimate whose deviation from optimality decays to zero like: E [F ( ^ w)] min w2W F (w)O(1= p T ): In this chapter, a non-smooth convex function is considered. In the case when the function has a locally polyhedral structure 1 , a staggered time average algorithm, based 1 The locally polyhedral structure is also called weak sharp minima in previous literature [BF93]. It is also the generalization of a sharp minimum function in [Pol87] 205 on a stochastic subgradient algorithm with constant step size, is proposed. The algorithm calculates the O()-approximation estimates with O(1=T ) convergence rate. For a gen- eral convex function, the convergence rate depends on the curvature near the minimum. To our knowledge, with the locally polyhedral structure, this is the rst O(1=T ) conver- gence rate for a non-smooth convex function, which requires neither strong convexity nor smoothing. The chapter is organized as follows. Section 7.1 provides notations, the staggered time average algorithm, and preliminary results. Section 7.2 and Section 7.3 prove respectively the results under the locally polyhedral and the general convex structures. Section 7.4 shows a fast convergence for deterministic problems. Experiments are performed in Sec- tion 7.5. Section 7.6 concludes the chapter. Related Works Prior work in [BM08, SZ13] develops algorithms with O(1= p T ) convergence rate when the function is non-smooth. A smoothing method in [OG12] improves the convergence rate to O(1=T ) when F is a linear combination of a smooth convex function and a non- smooth convex function with special structure. Related improvements can be shown when the non-smooth function is strongly convex [RSS12, SZ13, HK14]. The sux algorithm in [RSS12] is shown to have O(1=T ) convergence rate. The work by [SZ13] shows that using the reducing step size leads to O(log(T )=T ) and O(1=T ) convergence rates for the last round solution and a solution calculated from a polynomial-decay averaging. Then, the algorithm achieving optimal O(1=T ) convergence rate is developed in [HK14]. 206 Note that all previous results that achieve the O(1=T ) convergence rate rely on either a restrictive strong convexity or a special structure of F . A recent algorithm in [XLY16] achieves linear convergence speed under high probability assumption. It has become an open problem whether theO(1=T ) convergence rate can be achieved in expectation for a non-smooth and non-strongly convex function. 7.1 Preliminaries The closed convex setW is a subset ofR N , for some positive integer N, with Euclidean normkk and inner producth;i. Function F is assumed to be convex (possibly non- smooth) overW and satises the following assumption. Dene F , inf w2W F (w). Assumption 12 The minimum ofF is achievable inW, and the set of optimal solutions W = Arginf w2W F (w) =fw 2W :F (w ) =F g is closed. A subgradient of F atw2W is denoted byg(w) and satises for anyw 0 2W: F (w 0 )F (w) + g(w);w 0 w : (7.2) An unbiased stochastic subgradient at w is denoted by ^ g(w), satisfying E [^ g(w)jw] = g(w). Assumption 13 There exists a constant G<1 such that k^ g(w)kG 8w2W: Assumption 13 is also used in previous literature. 207 A stochastic subgradient algorithm with a positive constant step size> 0 initializes w 0 2W and proceeds repeatedly as w t+1 = W [w t ^ g(w t )] 8t2f0; 1; 2;:::g; (7.3) where W denotes projection onW andw t denotes the values ofw at round t. 7.1.1 Staggered time averages The staggered time average algorithm is summarized in Algorithm 11. Dene: w T 2 k 1 = 1 T T1 X t=0 w 2 k 1+t : This average can be computed on-the- y as shown in Algorithm 11. Initialize: w 0 2W; > 0 for t2f0; 1; 2;:::g do // Staggered time averages if t = 2 k 1 for some k 2f0; 1; 2;:::g then k k w 1 2 k 1 w t else w t2 k +2 2 k 1 t2 k +1 t2 k +2 w t2 k +1 2 k 1 + 1 t2 k +2 w t end if // Stochastic subgradient w t+1 W [w t ^ g(w t )] end for Algorithm 11: Staggered Time Averages Algorithm 11 implements the subgradient algorithm (7.3) with constant step size in each round. The staggered time averages reset the calculation of estimates every 2 k 1 for k 2f0; 1; 2;:::g. Specically, for every k 2f0; 1; 2;:::g, the algorithm generates 208 estimates w T 2 k 1 for T 2 1;:::; 2 k . To analyze Algorithm 11, the properties of the subgradient algorithm are proven in this section. Then the staggered time averages are analyzed in Section 7.2 and Section 7.3. Note that Algorithm 11 is dierent from the sux averaging in [RSS12] which uses the reducing step size. 7.1.2 Basic results We consider algorithm (7.3) with a positive constant step size . The initialw 0 2W is any constant vector. For everyw t 2W, dene the closest optimal solution tow t as w t = arginf w2W kww t k: Under Assumption 12, this w t is unique because of the convexity and the closeness of W . The following lemma modies a well known manipulation. Lemma 37 Suppose Assumptions 12 and 13 hold. It holds for any t2f0; 1; 2;:::g that E h w t+1 w t+1 2 w t i kw t w t k 2 + 2 G 2 + 2[F F (w t )]: 209 Proof: For anyt, by denition ofw t+1 as the minimizer ofkww t+1 k 2 over allw2W, we have: w t+1 w t+1 2 kw t+1 w t k 2 =k W [w t ^ g(w t )]w t k kw t ^ g(w t )w t k 2 where the nal inequality holds by the non-expansive property of the projection onto the convex setW [BNO03]. Expanding the right-hand-side gives: w t+1 w t+1 2 =kw t w t k 2 + 2 k^ g(w t )k 2 2h^ g(w t );w t w t i kw t w t k 2 + 2 G 2 + 2h^ g(w t );w t w t i Taking a conditional expectation givenw t yields E h w t+1 w t+1 2 w t i kw t w t k 2 + 2 G 2 + 2hg(w t );w t w t i: Using the subgradient property in (7.2) and the fact thatF (w t ) =F proves the lemma. Note that Lemma 37 uses a projection technique similar to standard analysis for the subgradient projection algorithm, (as in [Zin03], [Nes04] or [BNO03]). The standard approach compares the current iterate to a xed optimal pointw . Lemma 37 compares to the closest point in the optimal set. This is a simple but important distinction that is crucial in later sections for improved convergence time results. 210 While Lemma 37 is stated in a form useful for the analysis of later sections, it can readily be used to establish the standard O(1= 2 ) result for convex functions (see also [Zin03,Nes04,BNO03]). Dene an average of T -consecutive solutions from t 0 as w T t 0 , 1 T t 0 +T1 X t=t 0 w t : (7.4) Taking an expectation of the result in Lemma 37 gives E [F (w t )]F G 2 2 + 1 2 E h kw t w t k 2 w t+1 w t+1 2 i : Summing from t 0 to t 0 +T 1 and dividing by T gives 1 T t 0 +T1 X t=t 0 E [F (w t )]F G 2 2 + 1 2T E h w t 0 w t 0 2 w t 0 +T w t 0 +T 2 i : Using Jensen's inequality and convexity of F , denition (7.4), and non-negativity of w t 0 +T w t 0 +T 2 yield: E F (w T t 0 ) F G 2 2 + 1 2T E h w t 0 w t 0 2 i ; (7.5) for any t 0 2f0; 1; 2;:::g and any positive integer T . Equation (7.5) suggests that, to achieve an O()-approximation estimate, one can choose step size = (), number of rounds T = (1= 2 ), and dene ^ w as the average of w t values over the rst T rounds. This is equivalent to O(1= p T ) convergence rate. 211 Fortunately, this convergence rate can be improved by starting the average at an appro- priate time depending on the structure of the function F . These structures are shown in Section 7.2 and Section 7.3. We rst prove several useful results used in those sections. 7.1.3 Concentration bound These results are used to upper bound the term E h w t 0 w t 0 2 i in (7.5). Dene K t ,kw t w t k for all t2f0; 1; 2;:::g. Lemma 38 Suppose Assumption 13 holds. It holds for every t2f0; 1; 2;:::g that jK t+1 K t j 2G (7.6) E [K t+1 K t jw t ] 2G: (7.7) Proof: The rst part is proven in two cases. i) If K t+1 K t , denition ofw t+1 in (7.3) and the non-expansive projection implies jK t+1 K t j =K t+1 K t kw t+1 w t kK t kw t ^ g(w t )w t kK t kw t w t k +k^ g(w t )kK t G: 212 ii) If K t+1 <K t , we have jK t+1 K t j =K t K t+1 = w t w t+1 +w t+1 w t+1 +w t+1 w t K t+1 kw t w t+1 k +K t+1 + w t+1 w t K t+1 2kw t w t+1 k 2k^ g(w t )k 2G; where the last line uses non-expansive projection, w t+1 w t kw t+1 w t k. These two cases prove the rst part. The second part follows by taking a conditional expectation givenw t of K t+1 K t jK t+1 K t j. The concentration bound reinterprets the lemma in [Nee14]. Lemma 39 Suppose Assumption 13 holds and there exists a positive real-valued and a 2R such that for all t2f0; 1; 2;:::g: E [K t+1 K t jK t ] 8 > > < > > : 2G ; if K t < ; if K t : (7.8) Assume K t 0 = k t 0 (with probability 1) for some k t 0 2 R. Then following holds for all t2ft 0 ;t 0 + 1;t 0 + 2;:::g E e rKt D + e rkt 0 D tt 0 ; (7.9) 213 Locally polyhedron General convex Figure 7.1: Structures of function F where constants r;; and D are: r, 3 12 2 G 2 + 2G ; ,1 r 2 ; D, e 2Gr e r 1 : It can be shown that (7.6) and (7.8) together imply 0 < 2G, and hence it can be shown that 0<< 1. Constants and in Lemma 39 depend on a structure of function F . We then look at the rst structure. 7.2 Locally polyhedral structure In this section, functionF is assumed to have a locally polyhedral structure, which is illus- trated in Figure 7.1. This structure is generalized from [HN11]. It is assumed throughout that Assumptions 12 and 13 still hold. Note that, in machine learning, this F can be the hinge loss function with 1-norm regularization. Assumption 14 (Locally polyhedral assumption) There exists a constant L p > 0 such that for everyw t 2W the following holds F (w t )F L p kw t w t k: (7.10) 214 The subscript \p" in L p represents \polyhedral." 7.2.1 Drift and transient time Using this locally polyhedral structure, sequencefw t g 1 t=0 generated by algorithm (7.3) has the following drift property. Dene B p , max L p 2 ; G 2 L p : Lemma 40 Suppose Assumptions 12, 13, 14 hold. For any w t 2W andkw t w t k B p , the following holds E w t+1 w t+1 jw t kw t w t k L p 2 : (7.11) Proof: For anyw t 2W, if condition 2 G 2 + 2[F F (w t )]L p kw t w t k + 2 L 2 p 4 (7.12) is true, then the result in Lemma 37 implies that E h w t+1 w t+1 2 w t i kw t w t k 2 L p kw t w t k + 2 L 2 p 4 = kw t w t k L p 2 2 : 215 Applying Jensen's inequality on the left-hand-side gives E w t+1 w t+1 jw t 2 kw t w t k L p 2 2 : Whenkw t w t kB p L p =2, inequality (7.11) holds. It remains to show that condition (7.12) must hold wheneverkw t w t k B p . Starting with the left-hand-side of (7.12) we have, for everyw t 2W: 2 G 2 + 2[F F (w t )] 2 G 2 2L p kw t w t k where the inequality follows the locally polyhedral structure (7.10). By the denition of B p , it holds thatkw t w t k G 2 =L p and w t 2W. Substituting this into the above gives the result. This lemma implies that, when the distance betweenw t andw t is at leastB p , thenw t+1 is expected to get closer to w t+1 . This phenomenon suggests that w t will concentrate aroundW after some transient time, if it is not inside the set already. Let constants U p ;r p ; p ;D p be U p , 2(D p + 1) r 2 p ; (7.13) r p , 3L p 24G 2 + 2L p G ; (7.14) p ,1 3L 2 p 4(24G 2 + 2L p G) ; (7.15) D p , e 2Grp p e rpBp 1 p : (7.16) 216 Given anyw 0 2W, dene T p (), r p kw 0 w 0 k log(1= p ) ; (7.17) where constants r p and p are dened in (7.14) and (7.15). This T p () can be called a transient time for the locally polyhedral structure, since a useful bound holds after this time. Lemma 41 Suppose Assumptions 12, 13, 14 hold. When tT p (), the following holds E h kw t w t k 2 i 2 U p ; where constant U p is dened in (7.13). Proof: When K t =kw t w t kB p , Lemma 40 givesE [K t+1 K t jK t ]L p =2 as in (7.11). Therefore, the constants and in Lemma 39 can be set as = L p =2 and = B p . When t 0 = 0, we have k t 0 = K 0 =kw 0 w 0 k (with probability 1). From (7.9), it holds for all t2f0; 1; 2;:::g that E h e rpK t i D p + e rpk t 0 D p (t0) p D p +e rpK 0 t p ; (7.18) where constants r p ; p ;D p are dened in (7.14){(7.16) respectively. We then show that e rpK 0 t p 1 8tT p (): (7.19) 217 Inequality e rpK 0 t p 1 is equivalent to t rpK 0 log(1=p) by arithmetic and the fact that log(1= p ) > 0. From the denition of T p () in (7.17), it holds that T p () rpK 0 log(1=p) , and the results (7.19) follows. From (7.19), inequality (7.18) becomes E h e rpK t i D p + 1 8tT p (): The Cherno bounds (see, for example, [Ros96]) implies for any m> 0 that PfK t mge rpm E h e rpK t i e rpm (D p + 1) 8tT p (): UsingE K 2 t = 2 R 1 0 mPfK t mgdm and the above bound proves the lemma. The denition of T p () in (7.17) implies that the transient time is O(1=) under the locally polyhedral structure. Then, Assumption 41 implies that E h kw t w t k 2 i 2 U p every round t after the transient time. 7.2.2 Convergence rate We are now ready to prove the convergence rate of the staggered time averages in Algo- rithm 11 under the locally polyhedral structure. Theorem 20 Suppose Assumptions 12, 13, 14 hold. It holds for any tT p () and any positive integer T that E F (w T t ) F G 2 2 + U p 2T ; (7.20) 218 where constant U p is dened in (7.13). Proof: The theorem follows from (7.5) and Lemma 41 that E F (w T t ) F G 2 2 + 2 U p 2T 8tT p (): After the transient time, Theorem 20 implies that estimates, as the averages, converge as O( +=T ). To obtain an O()-approximation estimate, the step size must be set to (). Recall that the transient time (7.17) is O(1=) and that Algorithm 11 resets the averages at round 2 k 1 for k2f0; 1; 2;:::g. Let ^ k = arginf k2f0;1;2;:::g 2 ^ k 1T p () be the rst reset after the transient time. The exponential increasing implies that 2 ^ k 1 2T p (). Therefore, the total time to obtain the estimate is O(1=), and the convergence rate is O(1=T ). Note that the staggered time average algorithm is proposed, because kw 0 w 0 k in (7.17) can not be upper bounded ifW is unbounded. Also, even though the convergence rate depends only on the step size, performing the averages also helps obtaining more accurate estimates, as shown in Section 7.5. 7.3 General convex function In this section, function F is allowed to be a general convex function, possibly one that does not satisfy the locally polyhedral structure (Assumption 14). A general convex function is illustrated in Figure 7.1. It is assumed throughout that Assumptions 12 and 13 still hold. 219 DeneA(S) =fw t 2W :kw t w t k =Sg, and dene S max as the supremum value of S > 0 for whichA(S) is nonempty (S max is possibly innity). Assume that S max > 0. Convexity of the setW implies thatA(S) is nonempty for all S2 (0;S max ). For each S2 (0;S max ) dene: (S) = inf wt2W:kwtw t k=S jF (w t )F j kw t w t k : (7.21) Lemma 42 Suppose Assumption 12 holds. If F is convex andW is closed, then for all S2 (0;S max ): i) (S)> 0 ii) Wheneverw t 2W andkw t w t kS, it holds that F (w t )F (S)kw t w t k: (7.22) Proof: The rst part is proven by contradiction. Dene a nonempty set A =fw t 2W :kw t w t k =Sg: Note thatA is compact. Suppose (S) = 0. FunctionjF (w t )F j=S is continuous. The inmum of a continuous function over the compact set is achieved by a point in the set. Thus, there is a pointy2A such thatjF (y)F j=S = 0. That is F (y) =F , and y2W . Since y2A, it also satises inf y 2W kyy k = S and y = 2W , which is a contradiction. The second part is proven using the convexity of F . Let z2W be a vector such thatkzz kS wherez = arginf ^ z2W kz ^ zk. We want to show that F (z)F 220 (S)kzz k. The convexity of the setW implies that the line segment betweenz and z is insideW. The convexity ofF overW implies thatF is convex when it is restricted to this line segment. Dene y as a point on this line segment such that kyy k = S where y = arginf ^ y2W ky ^ yk. Then both y2W and y2A. The convexity of F over the line segment implies that F (z)F kzz k F (y)F kyy k (S); where the last inequality uses (7.21). The dierence between Assumptions 14 and 42 is that the bound (7.22) only holds whenkw t w t k S. The choice of S for a particular function F aects the transient time and convergence of achieving O()-approximation estimates. This eect does not occur with Assumption 14. 7.3.1 Drift and transient time Using Lemma 42, the sequencefw t g 1 t=0 generated by Algorithm (7.3) has the following property. Dene B g (;S), max (S) 2 ;S; G 2 (S) : (7.23) Lemma 43 Suppose Assumptions 12 and 13 hold. Function F is convex with (S) de- ned in (7.21) for all S2 (0;S max ). For any w t 2W thatkw t w t k B g (;S), the following holds E w t+1 w t+1 jw t kw t w t k (S) 2 : (7.24) 221 Proof: For anyw t 2W, if condition 2 G 2 + 2[F F (w t )](S)kw t w t k + 2 (S) 2 4 (7.25) is true, then the result in Lemma 37 implies that E h w t+1 w t+1 2 w t i kw t w t k 2 (S)kw t w t k + 2 (S) 2 4 = kw t w t k (S) 2 2 : Applying Jensen's inequality on the left-hand-side gives E w t+1 w t+1 jw t 2 kw t w t k (S) 2 2 : Whenkw t w t kB g (;S)(S)=2, inequality (7.24) holds. It remains to show condition (7.25) must hold wheneverkw t w t kB g (;S). Start- ing with the left-hand-side of (7.25), we have that 2 G 2 + 2[F F (w t )] 2 G 2 2(S)kw t w t k where the inequality follows (7.22), as B g (;S) S. By the denition of B g (;S), it holds thatkw t w t kG 2 =(S). Substituting this into the above inequality gives the result. This result is similar to Lemma 40 except that B g (;S) depends on both and S unlike B p in the locally polyhedral case. 222 Let constants U g (;S);r g (S); g (S);D g (;S) be U g (;S), 2[D g (;S) + 1] r g (S) 2 ; (7.26) r g (S), 3(S) 24G 2 + 2(S)G ; (7.27) g (S),1 3(S) 2 4[24G 2 + 2(S)G] ; (7.28) D g (;S), e 2Grg(S) g (S) e rg(S)Bg(;S) 1 g (S) : (7.29) Dene the transient time for a general convex function as T g (;S), r g (S)kw 0 w 0 k log(1= g (S)) ; (7.30) where r g (S) and g (S) are dened in (7.27) and (7.28). Lemma 44 Suppose Assumptions 12 and 13 hold. Function F is convex with (S) de- ned in (7.21) for all S2 (0;S max ). When tT g (;S), the following holds E h kw t w t k 2 i 2 U g (;S); where constant U g (;S) is dened in (7.26). 223 Proof: From Lemma 39, the constants are =(S)=2 and =B g (;S), where (7.8) holds due to Lemma 43. When t 0 = 0, we have k t 0 =K 0 =kw 0 w 0 k (with probability 1). From (7.9), it holds for all t2f0; 1; 2;:::g that E e rg(S)K t D g (;S) + e rg(S)k t 0 D g (;S) g (S) (t0) D g (;S) +e rg(S)K 0 g (S) t ; (7.31) where constants r g (S); g (S);D g (;S) are dened in (7.27){(7.29) respectively. We then show that e rg(S)K 0 g (S) t 1 8tT g (;S): (7.32) Inequality e rg(S)K 0 g (S) t 1 is equivalent to t rg(S)K 0 log(1=g(S)) . From the denition of T g (;S) in (7.30), it holds that T g (;S) rg(S)K 0 log(1=g(S)) , and the results (7.32) follows. From (7.32), inequality (7.31) becomes E e rg(S)K t D g (;S) + 1 8tT g (;S): The Cherno bounds implies for any m> 0 that PfK t mge rg(S)m [D g (;S) + 1] 8tT g (;S): UsingE K 2 t = 2 R 1 0 mPfK t mgdm and the above bound proves the lemma. The denition of T g (;S) in (7.30) implies that the transient time for a general convex function depends on a step size and the curvature near the unique minimum. Then, 224 Lemma 44 implies that E h kw t w t k 2 i 2 U g (;S) every round t after the transient time. 7.3.2 Convergence rate We are now ready to prove the convergence rate of the staggered time averages in Algo- rithm 11 under a general convex function. Theorem 21 Suppose Assumptions 12 and 13 hold. Function F is convex with (S) dened in (7.21) for all S2 (0;S max ). It holds for any t T g (;S) and any positive integer T that E F (w T t ) F G 2 2 + U g (;S) 2T ; (7.33) where constant U g (;S) is dened in (7.26) Proof: The theorem follows from (7.5) and Lemma 44 that E F (w T t ) F G 2 2 + 2 U g (;S) 2T 8tT g (;S): The transient time (7.30) and Theorem 21 can be interpreted as a class of convergence bounds that can be optimized over any S2 (0;S max ). Indeed, the values of S near the minimum of F plays a crucial role in (7.23), which aects much of the analysis in this section. 225 7.4 Fast convergence for deterministic problems This section revisits problems with the locally polyhedral structure, so that Assumptions 12, 13, 14 hold. However, it considers a deterministic scenario where the oracle returns the exact subgradient, rather than an unbiased stochastic subgradient. It is shown that a variation on the basic algorithm that uses a variable step size can signicantly improve the convergence rate. Specically, x > 0. The basic algorithm of Section 7.2 produces anO()-approximation withinO(1=) rounds. The modied algorithm of this section does the same O()-approximation with only O(log(1=)) rounds. In particular, this is faster than the lower bound (1= p T ) for a non-smooth function with Lipschitz continuity in [Nes04]. This does not contradict the Nesterov result in [Nes04], because that result shows the existence of a function with (1= p T ) convergence rate, while the locally polyhedral structure does not fall into a class of that function. Interestingly, the algorithm in this section is Faster than other algorithms with O(1=T 2 ) convergence rates [Nes04,Tse08]. Assume the functionF (w) is Lipschitz continuous overw2W with Lipschitz constant H > 0, so that: F (w)F (w 0 ) H ww 0 8w;w 0 2W: (7.34) Assume there is a known positive valueZ <1 such thatkw 0 w 0 kZ. Fix> 0, and x M as any positive integer. The idea is to run the algorithm over successive frames. Label the frames i2f1; 2; 3;:::;Mg. Letw [i] be the initial vector inW at the start of 226 frame i. Dene w [1] =w 0 . Dene constants U p , r p , p , D p as in (7.13){(7.16). Dene = max p U p ;Z , and dene the frame size T as: T = 2r p log(1= p ) The algorithm for each frame i2f1; 2; 3;:::g is: Dene the step size for frame i as [i] = 2 i . Run the Algorithm (7.3) using step size [i] overT rounds, using initial vectorw [i] . Denew [i+1] as thew t vector computed in the last round of frame i. Notice that the completion of M frames requires MT = O(M) rounds. The vector computed in the last round of the last frame is dened as w [M+1] . The next theorem shows that this vector is indeed an O(2 M )-approximation. Theorem 22 In the deterministic setting and when Assumptions 12, 13, 14 hold, the nal vectorw [M+1] satises: w [M+1] w [M+1] 2 M (7.35) F (w [M+1] )F H2 M (7.36) Proof: The proof is by induction on the rounds i2f1; 2;:::;Mg. Assume the following holds on a given i2f1; 2;:::;Mg: w [i] w [i] 2 (i1) (7.37) 227 This holds by assumption on the rst frame i = 1. We now show (7.37) holds for i + 1. The goal is to use Lemma 41 with initial conditionw [i] . Since the step size is [i] for this frame, the value T p ( [i] ) dened in (7.17) satises: T P ( [i] ) = 2 6 6 6 r p w [i] w [i] [i] log(1= p ) 3 7 7 7 & r p 2 (i1) [i] log(1= p ) ' =T where the inequality holds by the induction assumption (7.37), and the last equality holds by denition of [i] . Recall thatw [i+1] is dened as the nalw t value after T rounds of the frame. It follows by Lemma 41 that: E w [i+1] w [i+1] 2 2 [i] U p 2 [i] 2 On the other hand, this deterministic setting produces a deterministic sequence, so that all expectations can be removed: w [i+1] w [i+1] 2 2 [i] 2 Taking a square root and using the denition of [i] proves: w [i+1] w [i+1] 2 i This completes the induction, so that (7.37) holds for all i2f1; 2;:::;M + 1g. Substi- tutingi =M + 1 into (7.37) proves (7.35). The inequality (7.36) follows from (7.35) and the Lipschitz property (7.34). 228 7.5 Experiments In this section, Algorithm 11 (\Staggered") is compared to the polynomial-decay averag- ing (\Polynomial") in [SZ13]. We also proposed another heuristic algorithm (\Heuristic"), which has a promising convergence rate. This heuristic algorithm replaces in Algorithm 11 with t = max [;c=(t + 1)] 8t2f0; 1; 2;:::g where c is some real-valued positive constant. This modication does not change the convergence rates in Section 7.2 and Section 7.3, because it only addsO(1=) rounds into the previous bounds. For the purpose of comparison, Algorithm 11 uses = 10 4 . However, higher accu- racy can be achieved by a smaller step size. The heuristic algorithm sets c = 1. The polynomial-decay averaging algorithm uses c = 1 and = 3 (dened in [SZ13]). Note that the stochastic subgradient algorithm with a constant step size (\Constant") is the by product of Algorithm 11. A locally polyhedral function F =kwk 1 is considered where w2 [4; 4] 100 . When g(w) is a subgradient of F at w, a stochastic subgradient is ^ g(w) = g(w)X where X is a uniform random variable from 0 to 2, so E [g(w)jw] = g(w). Ten experiments are performed, and the average values are sampled at 2 k 2 18 k=1 (one round before 11 resets the averages). Results are shown in Figure 7.2. Both axes of Figure 7.2 are in a log scale. The plots of Algorithm 11 and the polynomial-decay algorithm cross each other, because the former has faster convergence rate. The subgradient algorithm with constant 229 2 1 2 3 2 5 2 7 2 9 2 11 2 13 2 15 2 17 log 2 (t) 2 -16 2 -14 2 -12 2 -10 2 -8 2 -6 2 -4 2 -2 2 0 2 2 2 4 2 6 2 8 log 2 [F(w t )−F(w ∗ )] Constant Staggered Heuristic Polynomial Figure 7.2: Results of algorithms and a locally polyhedral function step size stops improving due to the xed value of the step size. However, Algorithm 11 keeps improving after the stop. This can be explained by (7.20) where the average helps reducing the last term on the right-hand-side. The plot of the heuristic algorithm shows its convergence. A general convex functionF (w) = P 100 i=1 F i (w (i) ) is considered where, fori2f1;:::; 100g, w (i) is the i-component ofw and F i (w (i) ) = 8 > > < > > : w (i) ; if w (i) < 0 (w (i) ) 2 ; if w (i) 0: The i-component of a stochastic subgradient is ^ g i (w) =g i (w) +Y , whereg i (w) is the i-component of the true subgradient of F at w and Y is a uniform random variable between -1 and 1. Simulation uses the same parameters as the locally polyhedral case. The results in Figure 7.3 have a similar trend as in the locally polyhedral case except 230 2 1 2 3 2 5 2 7 2 9 2 11 2 13 2 15 2 17 log 2 (t) 2 -11 2 -9 2 -7 2 -5 2 -3 2 -1 2 1 2 3 2 5 2 7 2 9 log 2 [F(w t )−F(w ∗ )] Constant Staggered Heuristic Polynomial Figure 7.3: Results of algorithms and a general convex function that the plot of the stochastic subgradient algorithm with constant step size crosses the plot of the polynomial decay averaging. Then we consider a non-smooth convex function F (w) = P 100 i=1 F i (w (i) ) where F i (w (i) ) = 8 > > > > > > < > > > > > > : w (i) 10 6 2 ; if w (i) 10 6 2 w (i) 10 6 2 ; if w (i) 10 6 2 0; otherwise: (7.38) This function has uncountable minimizers. A stochastic subgradient is a component-wise addition of the true subgradient and the uniform random variable Y . Simulation results are shown in Figure 7.4. Comparing these results to the results in Figure 7.2 shows the same trend of convergence rates even though function F in (7.38) does not satisfy the uniqueness assumption. 231 2 1 2 3 2 5 2 7 2 9 2 11 2 13 2 15 2 17 log 2 (t) 2 -23 2 -20 2 -17 2 -14 2 -11 2 -8 2 -5 2 -2 2 1 2 4 2 7 log 2 [F(w t )−F(w ∗ )] Constant Staggered Heuristic Polynomial Figure 7.4: Results of algorithms and the function (7.38) 7.6 Chapter summary This chapter considers stochastic non-smooth convex optimization. We propose the stag- gered time average algorithm and prove its performance. When a function with a unique minimum satises the locally polyhedral structure, the algorithm has O(1=T ) conver- gence rate. For a general convex function with a unique minimum, we derive a class of bounds on the convergence rate of the algorithm. For a special case of deterministic prob- lems with the locally polyhedral structure, an algorithm with O(log(1=)) convergence is proposed. 232 Chapter 8 Conclusion In this thesis, we studied three practical aspects of stochastic network optimization as an approach to bring the theory to solve real-world problems. We rst study the convergence speeds and behaviors of the drift-plus-penalty algorithm that solves general time-average optimization problems. By looking at non-asymptotic behaviors of the algorithm, we discovered that the algorithm has a transient period and a steady state. Further, the speed of the algorithm depends on the inherent dual function of a system. Specically, let be the optimality gap. When the dual function satises the locally-polyhedral structure, the transient period is O(1=) iterations and the convergence time in the steady state is O(1=). When the dual function satises the locally-quadratic structure, the transient period and the convergence time in the steady state are O(1= 1:5 ). This nding can be used to tradeo between transient time and optimality gap in a situation when network and trac change regularly. Secondly, we considered a stochastic network optimization with nite-buer queues, which is a more realistic setting. We developed the oating-queue algorithm for a gen- eral stochastic network optimization that achieves near optimality while using very small 233 buer space. When the buer size of every queue is B, the algorithm is proven theoreti- cally to have O(e B ) optimality gap, O(B) delay, and O(e B ) average drop rate. From a system perspective, it operates in FIFO manner, which is compatible with a produc- tion switch and router. Overall, the oating queue algorithm is an attempt to push the boundary of the stochastic network optimization and to make the theory more practical. Thirdly, we used a dierent approach to show the practicality of the theory. We devel- oped a practical trac load-balancing algorithm that outperforms conventional ECMP algorithm in simulations of datacenter networks. The algorithm is inspired by a new theoretical throughput-optimal algorithm. Even though several throughput-optimal al- gorithms exist, they are not created equal. We learnt that practical considerations, such as TCP congestion mechanism and implementation constraints from SDN, play an impor- tant role in the design of the new throughput-optimal algorithm, which operates gracefully in practice. All in all, this thesis developed the new theoretical results and a new practical-theory approach to make the theory of stochastic network optimization pragmatic. The works on convergence speeds and the nite-buer algorithm are a theoretical approach that con- siders practicality. The load-balancing work leads to the new practical-theory approach that integrates practical considerations into the development of a theoretical algorithm to further extents theoretical validity to practical usefulness. 234 Bibliography [ABJ + 13] E. Athanasopoulou, L. X. Bui, T. Ji, R. Srikant, and A. Stolyar. Back- pressure-based packet-by-packet adaptive routing in communication net- works. IEEE/ACM Transactions on Networking, 21(1):244{257, Feb. 2013. [AED + 14] M. Alizadeh, T. Edsall, S. Dharmapurikar, R. Vaidyanathan, K. Chu, A. Fin- gerhut, V. T. Lam, F. Matus, R. Pan, N. Yadav, and G. Varghese. Conga: Distributed congestion-aware load balancing for datacenters. SIGCOMM Comput. Commun. Rev., 44(4):503{514, Aug. 2014. [AFLV08] M. Al-Fares, A. Loukissas, and A. Vahdat. A scalable, commodity data center network architecture. SIGCOMM Comput. Commun. Rev., 38(4):63{ 74, Aug. 2008. [AFRR + 10] M. Al-Fares, S. Radhakrishnan, B. Raghavan, N. Huang, and A. Vahdat. Hedera: Dynamic ow scheduling for data center networks. In Proceedings of the 7th USENIX Conference on Networked Systems Design and Imple- mentation, NSDI'10, pages 19{19, Berkeley, CA, USA, 2010. USENIX As- sociation. [BDG + 14] P. Bosshart, D. Daly, G. Gibb, M. Izzard, N. McKeown, J. Rexford, C. Schlesinger, D. Talayco, A. Vahdat, G. Varghese, and D. Walker. P4: Programming protocol-independent packet processors. SIGCOMM Comput. Commun. Rev., 44(3):87{95, July 2014. [Ber05] D. P. Bertsekas. Dynamic Programming and Optimal Control, volume I. Athena Scientic, Belmont, MA, USA, 3rd edition, 2005. [BF93] J. V. Burke and M. C. Ferris. Weak sharp minima in mathematical pro- gramming. SIAM J. Control Optim., 31(5):1340{1359, Sept 1993. [BG92] D. Bertsekas and R. Gallager. Data Networks (2Nd Ed.). Prentice-Hall, Inc., Upper Saddle River, NJ, USA, 1992. [BKS + 09] C. Bisdikian, L. M. Kaplan, M. B. Srivastava, D. J. Thornley, D. Verma, and R. I. Young. Building principles for a quality of information specication for sensor information. In 2009 12th International Conference on Information Fusion, pages 1370{1377, Jul. 2009. [BM08] S. Boyd and A. Mutapcic. Stochastic subgradient method. Technical report, Stanford University, 2008. 235 [BNCG + 11] A. Bar-Noy, G. Cirincione, R. Govindan, S. Krishnamurthy, T. F. LaPorta, P. Mohapatra, M. J. Neely, and A. Yener. Quality-of-information aware net- working for tactical military networks. In 2011 IEEE International Confer- ence on Pervasive Computing and Communications Workshops (PERCOM Workshops), pages 2{7, Mar. 2011. [BNO03] D. P. Bertsekas, A. Nedi c, and A. Ozdaglar. Convex Analysis and Optimiza- tion. Athena Scientic, 2003. [BNOT14] A. Beck, A. Nedi, A. Ozdaglar, and M. Teboulle. Ano(1=k) gradient method for network resource allocation problems. IEEE Transactions on Control of Network Systems, 1(1):64{73, Mar. 2014. [BSS09] L. Bui, R. Srikant, and A. Stolyar. Novel architectures and algorithms for delay reduction in back-pressure scheduling and routing. In IEEE INFO- COM 2009, pages 2936{2940, Apr. 2009. [BV04] S. Boyd and L. Vandenberghe. Convex Optimization. Cambridge University Press, New York, NY, USA, 2004. [CLCD07] M. Chiang, S. H. Low, A. R. Calderbank, and J. C. Doyle. Layering as op- timization decomposition: A mathematical theory of network architectures. Proceedings of the IEEE, 95(1):255{312, Jan. 2007. [DKS89] A. Demers, S. Keshav, and S. Shenker. Analysis and simulation of a fair queueing algorithm. SIGCOMM Comput. Commun. Rev., 19(4):1{12, Aug. 1989. [ES06] A. Eryilmaz and R. Srikant. Joint congestion control, routing, and mac for stability and fairness in wireless networks. IEEE Journal on Selected Areas in Communications, 24(8):1514{1524, Aug. 2006. [ES07] A. Eryilmaz and R. Srikant. Fair resource allocation in wireless networks using queue-length-based scheduling and congestion control. IEEE/ACM Transactions on Networking, 15(6):1333{1344, Dec. 2007. [GHJ + 09] A. Greenberg, J. R. Hamilton, N. Jain, S. Kandula, C. Kim, P. Lahiri, D. A. Maltz, P. Patel, and S. Sengupta. Vl2: A scalable and exible data center network. SIGCOMM Comput. Commun. Rev., 39(4):51{62, Aug. 2009. [GLL + 09] C. Guo, G. Lu, D. Li, H. Wu, X. Zhang, Y. Shi, C. Tian, Y. Zhang, and S. Lu. Bcube: A high performance, server-centric network architecture for modular data centers. SIGCOMM Comput. Commun. Rev., 39(4):63{74, Aug. 2009. [GNT06] L. Georgiadis, M. J. Neely, and L. Tassiulas. Resource allocation and cross- layer control in wireless networks. Foundations and Trends in Networking, 1(1):1{144, Apr. 2006. 236 [HK14] E. Hazan and S. Kale. Beyond the regret minimization barrier: Optimal algorithms for stochastic strongly-convex optimization. Journal of Machine Learning Research, 15:2489{2512, 2014. [HLH14] L. Huang, X. Liu, and X. Hao. The power of online learning in stochastic network optimization. SIGMETRICS Perform. Eval. Rev., 42(1):153{165, Jun. 2014. [HMNK13] L. Huang, S. Moeller, M. J. Neely, and B. Krishnamachari. Lifo-backpressure achieves near-optimal utility-delay tradeo. IEEE/ACM Transactions on Networking, 21(3):831{844, Jun. 2013. [HN11] L. Huang and M. J. Neely. Delay reduction via lagrange multipliers in stochastic network optimization. IEEE Transactions on Automatic Control, 56(4):842{857, Apr. 2011. [INE] INET framework community. Inet framework for the omnet++. https://inet.omnetpp.org/. [JC05] M. E. Johnson and K. C. Chang. Quality of information for data fusion in net centric publish and subscribe architectures. In 2005 7th International Conference on Information Fusion, volume 2, page 8, Jul. 2005. [JJS13] B. Ji, C. Joo, and N. B. Shro. Delay-based back-pressure scheduling in multihop wireless networks. IEEE/ACM Transactions on Networking, 21(5):1539{1552, Oct. 2013. [JWA15] T. Javidi, C. H. Wang, and T. Akta. A novel data center network architec- ture with zero in-network queuing. In 2015 13th International Symposium on Modeling and Optimization in Mobile, Ad Hoc, and Wireless Networks (WiOpt), pages 229{234, May 2015. [JWL + 05] C. Jin, D. Wei, S. H. Low, J. Bunn, H. D. Choe, J. C. Doylle, H. Newman, S. Ravot, S. Singh, F. Paganini, G. Buhrmaster, L. Cottrell, O. Martin, and W. Feng. Fast tcp: from theory to experiments. IEEE Network, 19(1):4{11, Jan. 2005. [KGR + 15] N. Kang, M. Ghobadi, J. Reumann, A. Shraer, and J. Rexford. Ecient traf- c splitting on commodity switches. In Proceedings of the 11th ACM Con- ference on Emerging Networking Experiments and Technologies, CoNEXT '15, pages 6:1{6:13, New York, NY, USA, 2015. ACM. [KHK + 16] N. Katta, M. Hira, C. Kim, A. Sivaraman, and J. Rexford. Hula: Scal- able load balancing using programmable data planes. In Proceedings of the Symposium on SDN Research, SOSR '16, pages 10:1{10:12, New York, NY, USA, 2016. ACM. [KLJ + 10] S. Kang, J. Lee, H. Jang, Y. Lee, S. Park, and J. Song. A scalable and energy-ecient context monitoring framework for mobile personal sensor 237 networks. IEEE Transactions on Mobile Computing, 9(5):686{702, May 2010. [KR12] J. F. Kurose and K. W. Ross. Computer Networking: A Top-Down Approach (6th Edition). Pearson, 6th edition, 2012. [LBBL10] C. H. Liu, C. Bisdikian, J. W. Branch, and K. K. Leung. Qoi-aware wireless sensor network management for dynamic multi-task operations. In 2010 7th Annual IEEE Communications Society Conference on Sensor, Mesh and Ad Hoc Communications and Networks (SECON), pages 1{9, Jun. 2010. [Lin] Linux Foundation Project. Dpdk: Data plane development kit. http://dpdk.org/. [LLS07] L. Lin, X. Lin, and N. B. Shro. Low-complexity and distributed energy minimization in multi-hop wireless networks. In IEEE INFOCOM 2007 - 26th IEEE International Conference on Computer Communications, pages 1685{1693, May 2007. [LMS12] L. B. Le, E. Modiano, and N. B. Shro. Optimal control of wireless networks with nite buers. IEEE/ACM Transactions on Networking, 20(4):1316{ 1329, Aug. 2012. [LMW + 07] J. W. Lockwood, N. McKeown, G. Watson, G. Gibb, P. Hartke, J. Naous, R. Raghuraman, and J. Luo. Netfpga{an open platform for gigabit-rate network switching and routing. In 2007 IEEE International Conference on Microelectronic Systems Education (MSE'07), pages 160{161, Jun. 2007. [LPW01] S. H. Low, L. Peterson, and L. Wang. Understanding tcp vegas: A duality model. SIGMETRICS Perform. Eval. Rev., 29(1):226{235, Jun. 2001. [LTBN + 12] B. Liu, P. Terlecky, A. Bar-Noy, R. Govindan, M. J. Neely, and D. Rawitz. Optimizing information credibility in social swarming applications. IEEE Transactions on Parallel and Distributed Systems, 23(6):1147{1158, Jun. 2012. [LXSS13] J. Liu, C. H. Xia, N. B. Shro, and H. D. Sherali. Distributed cross-layer optimization in wireless networks: A second-order approach. In 2013 Pro- ceedings IEEE INFOCOM, pages 2103{2111, Apr. 2013. [MDD + 14] M. Mahalingam, D. Dutt, K. Duda, P. Agarwal, L. Kreeger, T. Srid- har, M. Bursell, and C. Wright. Virtual extensible local area network (vxlan): A framework for overlaying virtualized layer 2 networks over layer 3 networks. RFC 7348, RFC Editor, Aug. 2014. http://www.rfc- editor.org/rfc/rfc7348.txt. [MLF + 08] E. Miluzzo, N. D. Lane, K. Fodor, R. Peterson, H. Lu, M. Musolesi, S. B. Eisenman, X. Zheng, and A. T. Campbell. Sensing meets mobile social 238 networks: The design, implementation and evaluation of the cenceme appli- cation. In Proceedings of the 6th ACM Conference on Embedded Network Sensor Systems, SenSys '08, pages 337{350, New York, NY, USA, 2008. ACM. [MMAW99] N. McKeown, A. Mekkittikul, V. Anantharam, and J. Walrand. Achieving 100% throughput in an input-queued switch. IEEE Transactions on Com- munications, 47(8):1260{1267, Aug. 1999. [MRS + 09] M. Mun, S. Reddy, K. Shilton, N. Yau, J. Burke, D. Estrin, M. Hansen, E. Howard, R. West, and P. Boda. Peir, the personal environmental impact report, as a platform for participatory sensing systems research. In Proceed- ings of the 7th International Conference on Mobile Systems, Applications, and Services, MobiSys '09, pages 55{68, New York, NY, USA, 2009. ACM. [MSKG10] S. Moeller, A. Sridharan, B. Krishnamachari, and O. Gnawali. Routing without routes: The backpressure collection protocol. In Proceedings of the 9th ACM/IEEE International Conference on Information Processing in Sensor Networks, IPSN '10, pages 279{290, New York, NY, USA, 2010. ACM. [Nee05] M. J. Neely. Distributed and secure computation of convex programs over a network of connected processors. In Proceedings of DCDIS Conference, Jul. 2005. [Nee06a] M. J. Neely. Energy optimal control for time-varying wireless networks. IEEE Transactions on Information Theory, 52(7):2915{2934, Jul. 2006. [Nee06b] M. J. Neely. Super-fast delay tradeos for utility optimal fair scheduling in wireless networks. IEEE Journal on Selected Areas in Communications, 24(8):1489{1501, Aug. 2006. [Nee07] M. J. Neely. Optimal energy and delay tradeos for multiuser wireless down- links. IEEE Transactions on Information Theory, 53(9):3095{3113, Sept. 2007. [Nee10] M. J. Neely. Stochastic Network Optimization with Application to Commu- nication and Queueing Systems. Morgan and Claypool Publishers, 2010. [Nee14] M. J. Neely. A simple convergence time analysis of drift-plus-penalty for stochastic optimization and convex programs. ArXiv e-prints, Dec. 2014. [Nee16] M. J. Neely. Energy-aware wireless scheduling with near-optimal backlog and convergence time tradeos. IEEE/ACM Transactions on Networking, 24(4):2223{2236, Aug. 2016. [Nes04] Y. Nesterov. Introductory Lectures on Convex Optimization: A Basic Course. Kluwer Academic, London, 1 edition, 2004. 239 [Nes09] Y. Nesterov. Primal-dual subgradient methods for convex problems. Math- ematical Programming, 120(1):221{259, Aug. 2009. [NML08] M. J. Neely, E. Modiano, and C. P. Li. Fairness and optimal stochastic con- trol for heterogeneous networks. IEEE/ACM Transactions on Networking, 16(2):396{409, Apr. 2008. [NMR05] M. J. Neely, E. Modiano, and C. E. Rohrs. Dynamic power allocation and routing for time-varying wireless networks. IEEE Journal on Selected Areas in Communications, 23(1):89{103, Jan. 2005. [NO09a] A. Nedi c and A. Ozdaglar. Approximate primal solutions and rate analysis for dual subgradient methods. SIAM Journal on Optimization, 19(4):1757{ 1780, 2009. [NO09b] A. Nedi c and A. Ozdaglar. Distributed subgradient methods for multi-agent optimization. IEEE Transactions on Automatic Control, 54(1):48{61, Jan. 2009. [OG12] H. Ouyang and A. Gray. Stochastic smoothing for nonsmooth minimiza- tions: Accelerating sgd by exploiting structure. In Proceedings of the 29th International Coference on International Conference on Machine Learning, ICML'12, pages 1523{1530, USA, 2012. Omnipress. [ONF12] ONF. Software-dened networking: The new norm for networks. Technical report, Open Networking Foundation, Apr. 2012. [Opea] Open Networking Foundations. Open ow switch specication: Version 1.5.1. https://www.opennetworking.org. [Opeb] OpenSim. Omnet++. https://omnetpp.org. [PG93] A. K. Parekh and R. G. Gallager. A generalized processor sharing ap- proach to ow control in integrated services networks: the single-node case. IEEE/ACM Transactions on Networking, 1(3):344{357, Jun. 1993. [Pol87] B. T. Polyak. Introduction to Optimization. Optimization Software Inc, New York, 1987. [Pos81] J. Postel. Transmission control protocol. STD 7, RFC Editor, Sept. 1981. http://www.rfc-editor.org/rfc/rfc793.txt. [RBPS10] J. Ryu, V. Bhargava, N. Paine, and S. Shakkottai. Back-pressure routing and rate control for icns. In Proceedings of the Sixteenth Annual International Conference on Mobile Computing and Networking, MobiCom '10, pages 365{ 376, New York, NY, USA, 2010. ACM. [RDVC + 04] L. Rosasco, E. De Vito, A. Caponnetto, M. Piana, and A. Verri. Are loss functions all the same? Neural Comput., 16(5):1063{1076, May 2004. 240 [RJ88] K. K. Ramakrishnan and R. Jain. A binary feedback scheme for conges- tion avoidance in computer networks with a connectionless network layer. SIGCOMM Comput. Commun. Rev., 18(4):303{313, Aug. 1988. [Ros96] S. M. Ross. Stochastic processes. Wiley series in probability and statistics: Probability and statistics. Wiley, 1996. [RSS12] A. Rakhlin, O. Shamir, and K. Sridharan. Making gradient descent opti- mal for strongly convex stochastic optimization. In Proceedings of the 29th International Coference on International Conference on Machine Learning, ICML'12, pages 1571{1578, USA, 2012. Omnipress. [RZB + 15] A. Roy, H. Zeng, J. Bagga, G. Porter, and A. C. Snoeren. Inside the so- cial network's (datacenter) network. SIGCOMM Comput. Commun. Rev., 45(4):123{137, Aug. 2015. [SG16] M. Shaee and J. Ghaderi. A simple congestion-aware algorithm for load balancing in datacenter networks. In IEEE INFOCOM 2016 - The 35th An- nual IEEE International Conference on Computer Communications, pages 1{9, Apr. 2016. [SHN14] S. Supittayapornpong, L. Huang, and M. J. Neely. Time-average optimiza- tion with nonconvex decision set and its convergence. In 53rd IEEE Con- ference on Decision and Control, pages 6627{6634, Dec. 2014. [SHN17] S. Supittayapornpong, L. Huang, and M. J. Neely. Time-average optimiza- tion with nonconvex decision set and its convergence. IEEE Transactions on Automatic Control, 62(8):4202{4208, Aug. 2017. [SM16a] H. Seferoglu and E. Modiano. Separation of routing and scheduling in backpressure-based wireless networks. IEEE/ACM Transactions on Net- working, 24(3):1787{1800, Jun. 2016. [SM16b] H. Seferoglu and E. Modiano. Tcp-aware backpressure routing and schedul- ing. IEEE Transactions on Mobile Computing, 15(7):1783{1796, Jul. 2016. [SN12] S. Supittayapornpong and M. J. Neely. Quality of information maximization in two-hop wireless networks. In 2012 IEEE International Conference on Communications (ICC), pages 4919{4925, Jun. 2012. [SN14] S. Supittayapornpong and M. J. Neely. Time-Average Stochastic Optimiza- tion with Non-convex Decision Set and its Convergence. ArXiv e-prints, Dec. 2014. [SN15a] S. Supittayapornpong and M. J. Neely. Achieving utility-delay-reliability tradeo in stochastic network optimization with nite buers. In 2015 IEEE Conference on Computer Communications (INFOCOM), pages 1427{1435, Apr. 2015. 241 [SN15b] S. Supittayapornpong and M. J. Neely. Achieving utility-delay-reliability tradeo in stochastic network optimization with nite buers. ArXiv e- prints, Jan. 2015. [SN15c] S. Supittayapornpong and M. J. Neely. Quality of information maximization for wireless networks via a fully separable quadratic policy. IEEE/ACM Transactions on Networking, 23(2):574{586, Apr. 2015. [SN15d] S. Supittayapornpong and M. J. Neely. Time-average stochastic optimiza- tion with non-convex decision set and its convergence. In 2015 13th Inter- national Symposium on Modeling and Optimization in Mobile, Ad Hoc, and Wireless Networks (WiOpt), pages 490{497, May 2015. [SN16a] S. Supittayapornpong and M. J. Neely. Staggered time average algorithm for stochastic non-smooth optimization with o(1/t) convergence. ArXiv e- prints, Jul. 2016. [SN16b] S. Supittayapornpong and M. J. Neely. Throughput-optimal load balancing for intra datacenter networks. ArXiv e-prints, Dec. 2016. [SOA + 15] A. Singh, J. Ong, A. Agarwal, G. Anderson, A. Armistead, R. Bannon, S. Boving, G. Desai, B. Felderman, P. Germano, A. Kanagala, J. Provost, J. Simmons, E. Tanda, J. Wanderer, U. H olzle, S. Stuart, and A. Vahdat. Jupiter rising: A decade of clos topologies and centralized control in google's datacenter network. SIGCOMM Comput. Commun. Rev., 45(4):183{197, Aug. 2015. [SSSSS09] S. Shalev-Shwartz, O. Shamir, N. Srebro, and K. Sridharan. Stochastic Convex Optimization. In Proceedings of the Conference on Learning Theory (COLT), 2009. [Sto06] A. L. Stolyar. Greedy primal-dual algorithm for dynamic resource allocation in complex networks. Queueing Systems, 54(3):203{220, Nov 2006. [SZ13] O. Shamir and T. Zhang. Stochastic gradient descent for non-smooth op- timization: Convergence results and optimal averaging schemes. In Pro- ceedings of the 30th International Conference on International Conference on Machine Learning - Volume 28, ICML'13, pages I{71{I{79. JMLR.org, 2013. [TE92] L. Tassiulas and A. Ephremides. Stability properties of constrained queueing systems and scheduling policies for maximum throughput in multihop radio networks. IEEE Transactions on Automatic Control, 37(12):1936{1948, Dec. 1992. [TH00] D. Thaler and C. Hopps. Multipath issues in unicast and multicast next-hop selection. RFC 2991, RFC Editor, Nov. 2000. http://www.rfc-editor. org/rfc/rfc2991.txt. 242 [Tib94] R. Tibshirani. Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society, Series B, 58:267{288, 1994. [Tse08] P. Tseng. On accelerated proximal gradient methods for convex-concave optimization. SIAM Journal on Optimization, {Submitted. 2008. [VPA + 17] E. Vanini, R. Pan, M. Alizadeh, P. Taheri, and T. Edsall. Let it ow: Resilient asymmetric load balancing with owlet switching. In 14th USENIX Symposium on Networked Systems Design and Implementation (NSDI 17), pages 407{420, Boston, MA, 2017. USENIX Association. [WO13] E. Wei and A. Ozdaglar. On the o(1/k) convergence of asynchronous dis- tributed alternating direction method of multipliers. ArXiv e-prints, Jul. 2013. [WS96] R. Y. Wang and D. M. Strong. Beyond accuracy: What data quality means to data consumers. J. Manage. Inf. Syst., 12(4):5{33, Mar. 1996. [XLY16] Y. Xu, Q. Lin, and T. Yang. Accelerated stochastic subgradient methods under local error bound condition. ArXiv e-prints, Jul. 2016. [XME12] D. Xue, R. Murawski, and E. Ekici. Distributed utility-optimal scheduling with nite buers. In 2012 10th International Symposium on Modeling and Optimization in Mobile, Ad Hoc and Wireless Networks (WiOpt), pages 278{ 285, May 2012. [YSRL11] L. Ying, S. Shakkottai, A. Reddy, and S. Liu. On combining shortest-path and back-pressure routing over multihop wireless networks. IEEE/ACM Transactions on Networking, 19(3):841{854, Jun. 2011. [ZDM + 12] D. Zats, T. Das, P. Mohan, D. Borthakur, and R. Katz. Detail: Reducing the ow completion time tail in datacenter networks. SIGCOMM Comput. Commun. Rev., 42(4):139{150, Aug. 2012. [Zin03] M. Zinkevich. Online convex programming and generalized innitesimal gradient ascent. In Proceedings of the Twentieth International Conference on International Conference on Machine Learning, ICML'03, pages 928{935. AAAI Press, 2003. [ZM13] M. Zhu and S. Martinez. An approximate dual subgradient algorithm for multi-agent non-convex optimization. IEEE Transactions on Automatic Control, 58(6):1534{1539, Jun. 2013. [ZTZ + 14] J. Zhou, M. Tewari, M. Zhu, A. Kabbani, L. Poutievski, A. Singh, and A. Vahdat. Wcmp: Weighted cost multipathing for improved fairness in data centers. In Proceedings of the Ninth European Conference on Computer Systems, EuroSys '14, pages 5:1{5:14, New York, NY, USA, 2014. ACM. 243
Abstract (if available)
Linked assets
University of Southern California Dissertations and Theses
Conceptually similar
PDF
Dynamic routing and rate control in stochastic network optimization: from theory to practice
PDF
New Lagrangian methods for constrained convex programs and their applications
PDF
Optimal distributed algorithms for scheduling and load balancing in wireless networks
PDF
Online learning algorithms for network optimization with unknown variables
PDF
I. Asynchronous optimization over weakly coupled renewal systems
PDF
Optimal resource allocation and cross-layer control in cognitive and cooperative wireless networks
PDF
Algorithmic aspects of energy efficient transmission in multihop cooperative wireless networks
PDF
Optimizing task assignment for collaborative computing over heterogeneous network devices
PDF
Scheduling and resource allocation with incomplete information in wireless networks
PDF
Empirical methods in control and optimization
PDF
Utilizing context and structure of reward functions to improve online learning in wireless networks
PDF
Using formal optimization techniques to improve the performance of mobile and data center networks
PDF
Learning, adaptation and control to enhance wireless network performance
PDF
Joint routing, scheduling, and resource allocation in multi-hop networks: from wireless ad-hoc networks to distributed computing networks
PDF
Elements of robustness and optimal control for infrastructure networks
PDF
Relative positioning, network formation, and routing in robotic wireless networks
PDF
Robustness of gradient methods for data-driven decision making
PDF
Improving network security through collaborative sharing
PDF
High-performance distributed computing techniques for wireless IoT and connected vehicle systems
PDF
Exploiting diversity with online learning in the Internet of things
Asset Metadata
Creator
Supittayapornpong, Sucha
(author)
Core Title
On practical network optimization: convergence, finite buffers, and load balancing
School
Viterbi School of Engineering
Degree
Doctor of Philosophy
Degree Program
Electrical Engineering
Publication Date
11/27/2017
Defense Date
08/31/2017
Publisher
University of Southern California
(original),
University of Southern California. Libraries
(digital)
Tag
convergence analysis,data center,finite buffer,load balancing,Lyapunov optimization,OAI-PMH Harvest,optimal-delay-reliability tradeoff,stochastic network optimization
Language
English
Contributor
Electronically uploaded by the author
(provenance)
Advisor
Neely, Michael J. (
committee chair
), Govindan, Ramesh (
committee member
), Krishnamachari, Bhaskar (
committee member
)
Creator Email
sucha.cpe@gmail.com,supittay@usc.edu
Permanent Link (DOI)
https://doi.org/10.25549/usctheses-c40-457923
Unique identifier
UC11268029
Identifier
etd-Supittayap-5931.pdf (filename),usctheses-c40-457923 (legacy record id)
Legacy Identifier
etd-Supittayap-5931.pdf
Dmrecord
457923
Document Type
Dissertation
Rights
Supittayapornpong, Sucha
Type
texts
Source
University of Southern California
(contributing entity),
University of Southern California Dissertations and Theses
(collection)
Access Conditions
The author retains rights to his/her dissertation, thesis or other graduate work according to U.S. copyright law. Electronic access is being provided by the USC Libraries in agreement with the a...
Repository Name
University of Southern California Digital Library
Repository Location
USC Digital Library, University of Southern California, University Park Campus MC 2810, 3434 South Grand Avenue, 2nd Floor, Los Angeles, California 90089-2810, USA
Tags
convergence analysis
data center
finite buffer
load balancing
Lyapunov optimization
optimal-delay-reliability tradeoff
stochastic network optimization