Close
About
FAQ
Home
Collections
Login
USC Login
Register
0
Selected
Invert selection
Deselect all
Deselect all
Click here to refresh results
Click here to refresh results
USC
/
Digital Library
/
University of Southern California Dissertations and Theses
/
New Lagrangian methods for constrained convex programs and their applications
(USC Thesis Other)
New Lagrangian methods for constrained convex programs and their applications
PDF
Download
Share
Open document
Flip pages
Contact Us
Contact Us
Copy asset link
Request this asset
Transcript (if available)
Content
NEW LAGRANGIAN METHODS FOR CONSTRAINED CONVEX PROGRAMS AND THEIR APPLICATIONS by Hao Yu A Dissertation Presented to the FACULTY OF THE USC GRADUATE SCHOOL UNIVERSITY OF SOUTHERN CALIFORNIA In Partial Fulfillment of the Requirements for the Degree DOCTOR OF PHILOSOPHY (ELECTRICAL ENGINEERING) May 2018 Copyright 2018 Hao Yu Dedication To my parents ii Acknowledgements First and foremost, I would like to thank my advisor, Prof. Michael J. Neely. I came across Mike’s Ph.D. thesis when I was an M.Phil. student at HKUST. While I was not working on the same topic, I was amazed by the beautiful and elegant drift-plus-penalty technique developed in Mike’s Ph.D. work. I told myself if I were going to pursue my Ph.D., I wish I can finish with such kind of good work. Serendipitously, I later became Mike’s Ph.D. student at USC. While doing research with Mike, I was impressed by his strong ability to sort out insight from complicated-looking problems. It happened so many times that when my research problems were stuck by some critical issues and seemed to come to a dead end, a seemingly random suggestion from Mike then suddenly opened up a new door for me. In my eyes, Mike is an ideal scholar and advisor. He is modest, rigorous and persistent, and always strives for perfection for each research problem. I have learned so much from him and he will continue to be my role model for my future life. I thank Prof. Meisam Razaviyayn and Prof. Ashutosh Nayyar for sitting on both my qual- ifying exam and defense committees; and Prof. Lieven Vandenberghe, Prof. Bhaskar Krishna- machari and Prof. Rahul Jain for siting on my qualifying exam committee. I appreciate them for their valuable comments and suggestions. I also thank my groupmates Xiaohan Wei and Sucha Supittayapornpong. Many of my research ideas originate from casual yet stimulating discussions with them. I also want to say “thank you” to all my other friends and colleagues on the “5th” floor at EEB. Finally, I would like to dedicate this thesis to my parents. They always support every bold decision that I make and encourage me to pursue a higher degree. Without their unconditional love and support, this thesis would not have been possible. iii Table of Contents Dedication ii Acknowledgements iii Abstract viii 1 Introduction 1 1.1 Lagrangian Methods for Constrained Convex Programs . . . . . . . . . . . . . . 2 1.1.1 Dual Subgradient Method . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 1.1.2 Primal-Dual Subgradient Method . . . . . . . . . . . . . . . . . . . . . . . 5 1.1.3 Drift-Plus-Penalty Technique for Deterministic Optimization . . . . . . . 6 1.1.4 Alternating Direction Method of Multipliers (ADMM) . . . . . . . . . . . 7 1.2 Convergence Time of Existing Lagrangian Methods . . . . . . . . . . . . . . . . . 8 1.3 Facts From Convex Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10 1.4 Thesis Outline and Our Contributions . . . . . . . . . . . . . . . . . . . . . . . . 12 2 Convergence Time of Dual Subgradient Methods for Strongly Convex Pro- grams 15 2.1 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17 2.2 Convergence Time Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18 2.2.1 An Upper Bound of the Drift-Plus-Penalty Expression . . . . . . . . . . . 18 2.2.2 Objective Value Violation . . . . . . . . . . . . . . . . . . . . . . . . . . . 21 2.2.3 Constraint Violation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21 2.2.4 Convergence Time of Algorithm 1.1 . . . . . . . . . . . . . . . . . . . . . 24 2.3 Geometric Convergence of Algorithm 1.1 with Sliding Running Averages . . . . . 25 2.3.1 Smooth Dual Function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25 2.3.2 Problems with Locally Quadratic Dual Functions . . . . . . . . . . . . . . 27 2.3.3 Problems with Locally Strongly Concave Dual Functions . . . . . . . . . . 31 2.3.4 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36 2.4 Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37 2.4.1 Strongly Convex Programs Satisfying Non-Degenerate Constraint Qualifi- cations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37 2.4.2 Network Utility Maximization with Independent Link Capacity Constraints 38 2.5 Numerical Experiment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40 2.5.1 Network Utility Maximization . . . . . . . . . . . . . . . . . . . . . . . . . 40 2.5.2 Linear Constrained Quadratic Program . . . . . . . . . . . . . . . . . . . 44 2.5.3 Large Scale Quadratic Program . . . . . . . . . . . . . . . . . . . . . . . 45 2.6 Chapter Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47 2.7 Supplement to this Chapter . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48 2.7.1 Proof of Lemma 2.6 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48 2.7.2 Proof of Part 2 of Lemma 2.9 . . . . . . . . . . . . . . . . . . . . . . . . . 51 iv 2.7.3 Proof of Lemma 2.11 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51 2.7.4 Proof of Part 2 of Lemma 2.12 . . . . . . . . . . . . . . . . . . . . . . . . 52 2.7.5 Proof of Theorem 2.10 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53 2.7.6 Proof of Theorem 2.11 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55 3 New Lagrangian Methods for Constrained Convex Programs 58 3.1 New Dual Type Algorithm for General Constrained Convex Programs . . . . . . 59 3.2 Basic Properties from Virtual Queue Update Equations . . . . . . . . . . . . . . 61 3.2.1 Properties of Virtual Queues . . . . . . . . . . . . . . . . . . . . . . . . . 61 3.2.2 Properties of the Drift . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63 3.3 Convergence Time Analysis of Algorithm 3.1 . . . . . . . . . . . . . . . . . . . . 64 3.3.1 An Upper Bound of the Drift-Plus-Penalty Expression . . . . . . . . . . . 64 3.3.2 Objective Value Violations . . . . . . . . . . . . . . . . . . . . . . . . . . 66 3.3.3 Constraint Violations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69 3.3.4 Convergence Time of Algorithm 3.1 . . . . . . . . . . . . . . . . . . . . . 71 3.3.5 Convex Programs with Linear Equality Constraints . . . . . . . . . . . . 72 3.4 New Primal-Dual Type Algorithm for Smooth Constrained Convex Programs . . 74 3.4.1 An Upper Bound of the Drift-Plus-Penalty Expression . . . . . . . . . . . 75 3.4.2 Smooth Constrained Convex Programs with Linear g(x) . . . . . . . . . . 79 3.4.3 Smooth Constrained Convex Programs with Non-Linear g(x) . . . . . . . 81 3.5 Chapter Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87 4 New Backpressure Algorithms for Joint Rate Control and Routing 88 4.1 System Model and Problem Formulation . . . . . . . . . . . . . . . . . . . . . . . 91 4.2 New Backpressure Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94 4.2.1 Discussion of Various Queueing Models . . . . . . . . . . . . . . . . . . . 94 4.2.2 New Backpressure Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . 96 4.2.3 Almost Closed-Form Updates in Algorithm 4.2 . . . . . . . . . . . . . . . 101 4.3 Performance Analysis of Algorithm 4.2 . . . . . . . . . . . . . . . . . . . . . . . . 102 4.3.1 Preliminaries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102 4.3.2 Utility Optimality Gap Analysis . . . . . . . . . . . . . . . . . . . . . . . 105 4.3.3 Queue Length Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107 4.3.4 Performance of Algorithm 4.2 . . . . . . . . . . . . . . . . . . . . . . . . . 110 4.4 Numerical Experiment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111 4.5 Chapter Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113 4.6 Supplement to this Chapter . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 114 4.6.1 Multi-Path Network Utility Maximization with Predetermined Paths . . 114 4.6.2 An Example Illustrating the Possibly Large Gap Between Model (4.7) and Model (4.8) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115 4.6.3 Proof of Lemma 4.4 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 116 4.6.4 Proof of Lemma 4.6 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 118 4.6.5 Proof of Lemma 4.7 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 119 5 Online Convex Optimization with Stochastic Constraints 122 5.1 Problem Statement and New Algorithm . . . . . . . . . . . . . . . . . . . . . . . 126 5.1.1 New Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 127 5.1.2 Intuitions of Algorithm 5.1 . . . . . . . . . . . . . . . . . . . . . . . . . . 128 5.1.3 Preliminary Analysis and More Intuitions of Algorithm 5.1 . . . . . . . . 130 5.2 Expected Performance Analysis of Algorithm 5.1 . . . . . . . . . . . . . . . . . . 132 5.2.1 A Drift Lemma for Stochastic Processes . . . . . . . . . . . . . . . . . . . 132 5.2.2 Expected Constraint Violation Analysis . . . . . . . . . . . . . . . . . . . 133 5.2.3 Expected Regret Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . 135 v 5.2.4 Special Case Performance Guarantees . . . . . . . . . . . . . . . . . . . . 136 5.3 High Probability Performance Analysis . . . . . . . . . . . . . . . . . . . . . . . . 138 5.3.1 High Probability Constraint Violation Analysis . . . . . . . . . . . . . . . 138 5.3.2 High Probability Regret Analysis . . . . . . . . . . . . . . . . . . . . . . . 138 5.4 Chapter Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 139 5.5 Supplement to this Chapter . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 140 5.5.1 Proof of Lemma 5.5 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 140 5.5.2 Proof of Lemma 5.7 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 143 5.5.3 Proof of Lemma 5.8 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 148 5.5.4 Proof of Theorem 5.3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 150 5.5.5 Proof of Lemma 5.9 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 151 5.5.6 Proof of Theorem 5.4 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 153 6 Online Convex Optimization with Long Term Constraints 157 6.1 Problem Statement and New Algorithm . . . . . . . . . . . . . . . . . . . . . . . 159 6.1.1 Problem Statement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 159 6.1.2 New Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 160 6.2 Regret and Constraint Violation Analysis . . . . . . . . . . . . . . . . . . . . . . 162 6.2.1 Properties of the Virtual Queues and the Drift . . . . . . . . . . . . . . . 162 6.2.2 An Upper Bound of the Drift-Plus-Penalty Expression . . . . . . . . . . . 166 6.2.3 Regret Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 168 6.2.4 An Upper Bound of the Virtual Queue Vector . . . . . . . . . . . . . . . . 170 6.2.5 Constraint Violation Analysis . . . . . . . . . . . . . . . . . . . . . . . . . 173 6.2.6 Practical Implementations . . . . . . . . . . . . . . . . . . . . . . . . . . . 174 6.3 Extensions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 174 6.3.1 Intermediate Time Horizon T . . . . . . . . . . . . . . . . . . . . . . . . . 174 6.3.2 Unknown Time Horizon T . . . . . . . . . . . . . . . . . . . . . . . . . . . 175 6.4 Chapter Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 176 7 Power Control for Energy Harvesting Devices with Outdated State Informa- tion 178 7.1 Problem Formulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 180 7.1.1 Further Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 181 7.1.2 Basic Assumption . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 181 7.1.3 Power Control and Energy Queue Model . . . . . . . . . . . . . . . . . . . 182 7.1.4 An Upper Bound Problem . . . . . . . . . . . . . . . . . . . . . . . . . . . 183 7.2 New Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 185 7.2.1 New Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 185 7.2.2 Algorithm Inuitions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 186 7.3 Performance Analysis of Algorithm 7.1 . . . . . . . . . . . . . . . . . . . . . . . . 187 7.3.1 Drift Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 187 7.3.2 Utility Optimality Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . 190 7.3.3 Lower Bound for Virtual Battery Queue Q(t) . . . . . . . . . . . . . . . . 191 7.3.4 Energy Availability Guarantee . . . . . . . . . . . . . . . . . . . . . . . . 194 7.3.5 Utility Optimality and Battery Capacity Tradeoff . . . . . . . . . . . . . . 196 7.3.6 Extensions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 196 7.4 Numerical Experiment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 197 7.5 Chapter Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 198 vi 8 Dynamic Transmit Covariance Design in MIMO Fading Systems With Un- known Channel Distributions and Inaccurate Channel State Information 199 8.1 Signal Model and Problem Formulation . . . . . . . . . . . . . . . . . . . . . . . 202 8.1.1 Signal Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 202 8.1.2 Problem Formulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 203 8.2 Instantaneous CSIT Case . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 205 8.2.1 Transmit Covariance Update in Algorithm 8.1 . . . . . . . . . . . . . . . 207 8.2.2 Performance of Algorithm 8.1 . . . . . . . . . . . . . . . . . . . . . . . . . 209 8.2.3 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 211 8.3 Delayed CSIT Case . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 213 8.3.1 Transmit Covariance Update in Algorithm 8.3 . . . . . . . . . . . . . . . 214 8.3.2 Performance of Algorithm 8.3 . . . . . . . . . . . . . . . . . . . . . . . . . 215 8.3.3 Extensions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 218 8.4 Rate Adaptation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 219 8.5 Simulations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 221 8.5.1 A Simple MIMO System with Two Channel Realizations . . . . . . . . . . 221 8.5.2 A MIMO System with Continuous Channel Realizations . . . . . . . . . . 223 8.6 Chapter Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 224 8.7 Supplement to this Chapter . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 225 8.7.1 Linear Algebra and Matrix Derivatives . . . . . . . . . . . . . . . . . . . . 225 8.7.2 Proof of Lemma 8.2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 226 8.7.3 Proof of Lemma 8.3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 228 8.7.4 Proof of Lemma 8.5 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 232 8.7.5 Proof of Lemma 8.6 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 235 9 Duality Codes and the Integrality Gap Bound for Index Coding 238 9.1 Weighted Bipartite Digraph . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 240 9.2 Acyclic Subgraph Bound and its LP Relaxation . . . . . . . . . . . . . . . . . . . 241 9.3 Cyclic Codes and Linear Programming Duality . . . . . . . . . . . . . . . . . . . 245 9.3.1 Cyclic Codes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 245 9.3.2 Duality Between Information Theoretical Lower Bounds and Cyclic Codes 247 9.4 Optimality of Cyclic Codes in Planar Bipartite Graphs . . . . . . . . . . . . . . . 250 9.4.1 Complementary Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . 251 9.4.2 Packet Split Digraph . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 252 9.4.3 Optimality of Cyclic Codes in Planar Graphs . . . . . . . . . . . . . . . . 254 9.4.4 Optimality of Cyclic Codes in the Unicast-Uniprior Index Coding Problem 259 9.5 Partial Clique Codes: a Duality Perspective . . . . . . . . . . . . . . . . . . . . . 260 9.5.1 Partial Clique Codes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 261 9.5.2 Duality Between Information Theoretical Lower Bounds and Partial Clique Codes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 264 9.5.3 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 267 9.6 Chapter Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 268 10 Conclusions 269 vii Abstract In this thesis, we develop new Lagrangian methods with fast convergence for constrained convex programs with complicated functional constraints. The dual subgradient method, also known as the dual ascent method, and the primal dual subgradient method, also known as the Arrow-Hurwicz-Uzawa subgradient method, are classical Lagrangian methods to solve con- strained convex programs. Both methods are known to have a slowO(1/ 2 ) convergence time. In contrast, the new Lagrangian methods proposed in this thesis have a faster O(1/) convergence time. Recall that the alternating direction method of multipliers (ADMM), which is another representative Lagrangian method for convex programs with linear equality constraints, is also known to have O(1/) convergence. However, our methods work for general convex programs with possibly non-linear constraints. We first revisit the classical dual subgradient method and study its convergence time for constrained strongly convex programs in Chapter 2. By using a novel drift-plus-penalty type analysis, we show that the dual subgradient method enjoys a faster O(1/) convergence time for general (possibly non-differentiable) constrained strongly convex programs. After that, we seek to develop new Lagrangian methods with the fast O(1/) convergence time for general constrained convex programs without strong convexity in Chapter 3, which is the core chapter in this thesis. Based on the new Lagrangian methods developed in Chapter 3, new techniques that exceed the state-of-the-art are developed for joint rate control and routing in data networks in Chapter 4 and for online convex optimization with stochastic and long term constraints in Chapters 5-6. The other focus of this thesis is to illustrate the practical relevance of mathematical op- timization techniques in engineering systems. In Chapter 7, we adapt our new online convex optimization technique to the power control for energy harvesting devices with outdated state information such that we can achieve utility withinO() of the optimal by using a battery with an O(1/) capacity. In Chapter 8, we extend conventional drift-plus-penalty stochastic optimization and Zinkevich’s online convex optimization to develop new dynamic transmit covariance design policies for MIMO fading systems with unknown channel distributions and inaccurate channel state information. In Chapter 9, we study the index coding problem and characterize the opti- mality of two representative scalar and fractional linear codes, i.e., cyclic codes and maximum distance separable (MDS) codes, by studying the integrality gap between the integer linear pro- gram from an information theoretical lower bound and its linear relaxations and the Lagrangian duality between various linear relaxations and their dual problems. viii Chapter 1 Introduction A constrained convex program, also called constrained convex optimization problem, has the form: min f(x) (1.1) s.t. g k (x)≤ 0,∀k∈{1, 2,...,m}, (1.2) x∈X, (1.3) where f(x) and g k (x) are convex functions andX is a closed convex set. In general, a convex program can have linear equality constraints given by h j (x) = 0,∀j, where functions h j (x) are linear functions. Since each linear equality constraint h j (x) = 0 can be equivalently represented by two convex inequality constraints h j (x)≤ 0 and−h j (x)≤ 0, linear equality constraints are not included in problem formulation (1.1)-(1.3). In this thesis, we often denote the stacked vector of multiple functions g 1 (x),g 2 (x),...,g m (x) as g(x) = g 1 (x),g 2 (x),...,g m (x) T . To deal with the challenge of constraints (1.2) in the above convex program, the definition of Lagrangian is introduced as follows: Definition 1.1 (Lagrangian and Lagrangian Dual Function). For each inequality constraint g k (x)≤ 0, we introduce a scalarλ k ≥ 0, called a Lagrangian multiplier, and define the Lagrangian L associated with the problem (1.1)-(1.3) as L(x,λ) =f(x) +λ T g(x) =f(x) + m X k=1 λ k g k (x), whereλ = [λ 1 ,...,λ m ] T is the stacked vector of all Lagrangian multipliers. Define the Lagrangian 1 dual function (or just dual function) associated with the problem (1.1)-(1.3) as q(λ) = inf x∈X L(x,λ) = inf x∈X {f(x) + m X k=1 λ k g k (x)} The Lagrangian augments the objective function f(x) with a weighted sum of the constraint functions g k (x). The vector λ is called the Lagrange multiplier vector or dual variable vector associated with the problem (1.1)-(1.3). Correspondingly, the vector x is called the primal variable vector. In this thesis, we assume all considered convex programs have at lease one optimal solution and the following condition holds. Condition 1.1 (Existence of Lagrange Multipliers Attaining Strong Duality). The convex pro- gram (1.1)-(1.3) has Lagrange multipliers attaining the strong duality. That is, there exists a Lagrange multiplier vector λ ∗ = [λ ∗ 1 ,λ ∗ 2 ,...,λ ∗ m ] T ≥ 0 such that q(λ ∗ ) =f(x ∗ ), where x ∗ is an optimal solution to the problem (1.1)-(1.3) and q(λ) defined in Definition 1.1 is the dual function of the problem (1.1)-(1.3). Condition 1.1 is a mild condition and is implied by various conditions called constraint qual- ifications [BNO03, BSS06, BV04]. For example, it is implied by the existence of a vector s∈X such that g k (s)< 0 for all k∈{1,...,m}, called the Slater condition. 1.1 Lagrangian Methods for Constrained Convex Programs Theoretically, the constrained convex program (1.1)-(1.3) can be solved directly using a first- order method with the projection onto the set{x∈X :g k (x)≤ 0,∀k∈{1, 2,...,m}}. However, such a projection itself can be computationally challenging or even as difficult as the original problem since the projection requires to minimize a quadratic function subject to the constraints (1.2) and (1.3). Alternatively, interior point methods, which convert constrained convex programs into unconstrained problems by introducing a barrier function for each functional constraint (1.2), have been developed to solve the convex program (1.1)-(1.3) [NW06, BV04]. An interior point 2 method typically takes a relatively small number of iterations to converge to a good solution. However, the per-iteration complexity can be huge since each iteration involves a Newton step that essentially computes Hessians and matrix inverses. In addition, interior point methods are centralized and hence are not suitable for distributed implementations in engineering systems. Lagrangian methods, which are based on the Lagrangian defined in Definition 1.1 and strong duality described in Condition 1.1, are effective methods for constrained, especially large-scale, convex programs in the form of (1.1)-(1.3). A Lagrangian method iteratively updates both the primal variable x and the dual variable λ; and the per-iteration complexity is typically low. In fact, Lagrangian methods often yield distributive implementations and hence are widely exploited in engineering applications such as data networks [KMT98, LL99, Low03], decentralized multi- agent systems [TTM11], model predictive control (MPC) [NN14] and so on. 1.1.1 Dual Subgradient Method As a representative example of Lagrangian methods, the dual subgradient method, also known as the dual ascent method, to solve the convex program (1.1)-(1.3) is described as follows: Algorithm 1.1 Dual Subgradient Method Letc> 0 be a constant step size. Letλ(0)≥ 0 be an arbitrary constant vector. At each iteration t∈{0, 1, 2,...}, update x(t) and λ(t + 1) as follows: • Update primal variables via x(t) = argmin x∈X f(x) + m X k=1 λ k (t)g k (x) . • Update dual variables via λ k (t + 1) = max λ k (t) +cg k (x(t)), 0 ,∀k∈{1, 2,...,m}. • Output the running average x(t + 1) given by x(t + 1) = 1 t + 1 t X τ=0 x(τ) = x(t) t t + 1 + x(t) 1 t + 1 as the solution at iteration t + 1. Note that the argmin x∈X {·} involved in the primal variable updates of Algorithm 1.1 may not be well defined even thoughX is a closed set. In general, to ensure argmin x∈X {h(x)} is well defined for a continuous functionh(x), we need to exclude the possibility thath(x) becomes 3 smaller askxk→∞. One condition that ensures the existence of argmin x∈X {h(x)} is to assume h(x) is coercive, i.e., h(x)→∞ wheneverkxk→∞. This condition holds whenever h(x) is a strongly convex function, defined in Definition 1.7. Alternatively, if the closed setX is bounded, then argmin x∈X {h(x)} exists for any continuous function h(x). Let q(λ) be the Lagrangian dual function, defined in Definition 1.1, for the convex program (1.1)-(1.3). By Danskin’s Theorem (see Proposition B.25 in [Ber99]), g(x(t)) is a subgradient of q(λ) at the point λ = λ(t). Recall that λ∈ R m + . It follows that the dynamic of the dual vector λ(t) can be interpreted as a projected subgradient method with the constant step size c to maximize the dual function q(λ). Thus, Algorithm 1.1 is called a dual subgradient method. It is worthwhile emphasizing that with a constant step size c, λ(t) does not necessarily con- verge to an optimal dual vectorλ ∗ that maximizesq(λ) sinceq(λ) is in general non-differentiable and the subgradient method with a constant step size may not converge to a maximizer of a non-differentiable concave function. Even if we assume λ(t) = λ ∗ at certain iteration t, the corresponding x(t) = argmin x∈X {f(x) + P m k=1 λ k (t)g k (x)} is not necessarily an optimal solution to the problem (1.1)-(1.3) when the minimizer of f(x) + P m k=1 λ k (t)g k (x) is not unique. In fact, x(t) = argmin x∈X {f(x) + P m k=1 λ k (t)g k (x)} can even be infeasible in this case. This is because an optimal solution to the problem (1.1)-(1.3) is a nontrivial convex combination of the minimizers of f(x) + P m k=1 λ ∗ k g k (x) when q(λ) is not differentiable at λ =λ ∗ . As a result, there is no performance guarantee of x(t) in Algorithm 1.1 for general convex programs in the form of (1.1)-(1.3) and it is important to use the running average x(t) as the solution. The running average sequence x(t) is also called the ergodic sequence in [LPS99]. The idea of using the running averages x(t) as the solutions, rather than the original primal variables x(t), dates back to Shor [Sho85] and is further developed in [SC96, LL97, LPS99, GPS15]. If the functions f(x) and each g k (x) are separable with respect to components or blocks of x, then the updates of primal variable x(t) can be decomposed into several smaller independent subproblems, each of which only involves a component or block of x(t). For example, if f(x) and g k (x) are linear functions, then the updates x(t) can be decomposed into n scalar convex minimizations that often have closed-form solutions. Such a desirable property has made the dual subgradient method a pervasive decomposition technique for distributed resource allocation in network utility maximization problems [LL99, PC06]. 4 1.1.2 Primal-Dual Subgradient Method A close relative of the dual subgraident method is the primal-dual subgradient method, also known as the Arrow-Hurwicz-Uzawa subgradient method. The convex program (1.1)-(1.3) can be solved by the primal-dual subgradient method as described in Algorithm 1.2. Algorithm 1.2 Primal-Dual Subgradient Method Let c > 0 be a constant step size. Let x(0)∈X be arbitrary and λ(0) = 0. At each iteration t∈{1, 2,...}, update x(t) and λ(t) as follows: • Update primal variables via x(t) =P X x(t− 1)−c∇f(x(t− 1))−c m X k=1 λ k (t− 1)∇g k (x(t− 1)) , where∇f(x(t− 1)) is a subgradient of f(x) at point x = x(t− 1),∇g k (x(t− 1)) is a subgradient of g k (x) at point x = x(t− 1), andP X [·] is the projection onto convex setX . • Update dual variables via λ k (t) = [λ k (t− 1) +cg k (x(t− 1))] λ max k 0 ,∀k∈{1, 2,...,m}, where λ max k > λ ∗ k ,∀k∈{1, 2,...,m} with λ ∗ k defined in Condition 1.1 and [·] λ max k 0 is the projection onto interval [0,λ max k ]. • Output the running average x(t + 1) given by x(t + 1) = 1 t + 1 t X τ=0 x(τ) = x(t) t t + 1 + x(t) 1 t + 1 as the solution at iteration t + 1. Recall that if the strong duality for the problem (1.1)-(1.3) is attained by certain Lagrange multipliers, i.e., Condition 1.1 holds, then by the saddle point theorem for convex programs (Proposition 5.1.6 in [Ber99]), (x ∗ ,λ ∗ ) is an optimal solution-multiplier pair if and only if it is a saddle point of the Lagrangian. Let L(x,λ) be the Lagrangian, defined in Definition 1.1, for the convex program (1.1)-(1.3). It follows that∇f(x(t− 1)) + P m k=1 λ k (t− 1)∇g k (x(t− 1))∈ ∂ ∂x L(x(t−1),λ(t−1)) and g(x(t−1))∈ ∂ ∂λ L(x(t−1),λ(t−1)), and hence Algorithm 1.2 can be interpreted as an Arrow-Hurwicz-Uzawa algorithm for solving the saddle points of the Lagragian L(x,λ). Note that iff(x) org k (x) are separable, then the primal updates in Algorithm 1.1 requires to solve unconstrained convex minimization problems, which can incur huge complexity when the 5 number of constraints or the primal variable dimension is large. In contrast, Algorithm 1.2 iterates the primal variables using projected gradient updates, which can be performed independently for each component of x as long as the setX is a Cartesian product. Note that ifX is a complicated set where coordinates are coupled together, we can introduce more inequality constraints and leaveX a simple set. This property makes Algorithm 1.2 suitable for large scale convex programs with non-separable objective or constraint functions. On the other hand, Algorithm 1.2 requires the knowledge of upper bounds for the optimal Lagrange multipliers to determine algorithm parameters λ max k ,∀k∈{1, 2,...,m} used in the dual updates. These upper bounds can be difficult to estimate for some convex programs. 1.1.3 Drift-Plus-Penalty Technique for Deterministic Optimization The drift-plus-penalty technique was originally developed to solve more general stochastic op- timization [Nee03, GNT06, Nee10] and was shown applicable to deterministic convex programs [Nee05, Nee14]. The drift-plus-penalty technique originates from the backpressure algorithm con- sidered in the seminal work [TE92] by Tassiulas and Ephremides. The backpressure algorithm developed in [TE92] is to perform routing and scheduling by minimizing a Lyapunov drift expres- sion such that all data queues in the stochastic data network are stabilized whenever possible. However, such a backpressure algorithm has no utility performance guarantee when network utilities exist in the considered network. The drift-plus-penalty technique extends the method in [TE92] by introducing an additional penalty term, corresponding to the network utility, to the drift minimization such that the problem of joint network stability and utility maximization can be solved. Later, this technique is further extended to solve general stochastic optimization, not only stochastic optimization in queueing networks, by introducing virtual queues for general stochastic constraints, not only queue stability constraints. A deterministic convex program in the form of (1.1)-(1.3) can be solved by the drift-plus- penalty technique as described in Algorithm 1.3. Note that the drift-plus-penalty technique introduces a virtual queue Q k (t) for each functional constraint (1.2). These virtual queues can be interpreted as the queue backlogs of constraint violations and indeed correspond to physical queue backlogs when each constraint (1.2) is a node flow balance constraint in data network applications. It was noted in [NMR05, HN11, SHN14] that the drift-plus-penalty technique applied to deterministic convex programs is closely related to the dual subgradient method. In 6 fact, if we let V = 1 c and Q k (0) V = λ k (0),∀k∈{1, 2,...,m}, then Algorithm 1.3 and Algorithm 1.1 yield the same sequence of x(t) and Q k (t) V =λ k (t),∀k∈{1, 2,...,m},∀t≥ 0. Algorithm 1.3 Drift-Plus-Penalty (DPP) Algorithm LetV > 0 be a constant parameter. Let Q(0)≥ 0 be arbitrary. At each iterationt∈{0, 1, 2,...}, update x(t) and Q(t + 1) as follows: • Update primal variables via x(t) = argmin x∈X Vf(x) + m X k=1 Q k (t)g k (x) . • Update virtual queues via Q k (t + 1) = max Q k (t) +g k (x(t)), 0 ,∀k∈{1, 2,...,m}. • Output the running average x(t + 1) given by x(t + 1) = 1 t + 1 t X τ=0 x(τ) = x(t) t t + 1 + x(t) 1 t + 1 as the solution at iteration t + 1. However, we emphasize that the drift-plus-penalty technique can solve stochastic optimization that is more general than deterministic convex programs. In addition, the performance analysis of the drift-plus-penalty is based on Lyapunov type analysis of a drift-plus-penalty expression and is fundamentally different from the conventional analysis of the dual subgradient method. In fact, the new Lagrangian methods developed in this thesis inherits heavily from the drift-plus- penalty technique even though many of them are intended to be developed to solve deterministic convex programs. 1.1.4 Alternating Direction Method of Multipliers (ADMM) Now consider a special case of the convex program (1.1)-(1.3) with separable objective func- tions and linear equality constraints, given as follows: min f 1 (x) +f 2 (y) (1.4) s.t. Ax + By = c, (1.5) x∈X⊆R n1 , y∈Y⊆R n2 . (1.6) 7 Note that a linear inequality constraint can be converted to an equality constraint by intro- ducing a non-negative dummy variable. By augmenting the Lagrangian with a quadratic term of the equality constraint function, we define the augmented Lagrangian as L ρ (x, y,λ) =f 1 (x) +f 2 (y) +λ T Ax + By− c + ρ 2 kAx + By− ck 2 where ρ is a constant algorithm parameter. In the ADMM applied to solve the problem (1.4)- (1.6), each iteration consists of the following steps: • Update x(t) = argmin x∈X L ρ (x, y(t− 1),λ(t− 1)). • Update y(t) = argmin y∈Y L ρ (x(t), y,λ(t− 1)). • Update λ(t) =λ(t− 1) +ρ Ax(t) + By(t)− c . At each iteration, the ADMM updates primal variables x and y in an alternating manner, also known as Gauss-Seidel manner, which accounts for the name “alternating direction method” . The isolated update of variables x and y can reduce per-iteration complexity in comparison to jointly choosing (x, y) to minimizeL ρ (x, y,λ). The ADMM recently has attracted a lot interest in machine learning, network scheduling, computational biology and finance. See [BPC + 11] for a recent survey on the development and applications of ADMM. However, a significant limitation of the ADMM algorithm is that it can only solve problems with linear constraints. This is mainly because the quadratic term introduced in the augmented Lagrangian can be non-convex when there are non-linear constraints. 1.2 Convergence Time of Existing Lagrangian Methods For the convex program (1.1)-(1.3), we define an -approximate solution as follows. Definition 1.2 (-approximate Solution). Let x ∗ be an optimal solution to the problem (1.1)- (1.3). For any > 0, a point x ∈X is said to be an -approximate solution if f(x )≤f(x ∗ ) + and g k (x )≤,∀k∈{1,...,m}. Note that if x is an -approximate solution and there exists z ∈ X such that g k (z) ≤ −δ,∀k∈{1,...,m} for someδ> 0, one can convert an -approximate point x to another point 8 x = θx + (1−θ)z, for θ = δ +δ , which satisfies all constraints g k (x)≤ 0 and has an objective value within O() of optimality. The convergence time of an iterative algorithm measures the number of iterations required to obtain an -approximate solution and is formally defined as follows: Definition 1.3 (Convergence Time). Let x(t),t∈{1, 2,...} be the solution sequence yielded by an iterative algorithm. The convergence time (to an approximate solution) of this algorithm is the number of iterations required to achieve an -approximate solution. That is, this algorithm is said to have an O(h()) convergence time if{x(t),t≥ O(h())} is a sequence of -approximate solutions. Or alternatively, we have the definition of convergence rates given as follows: Definition 1.4 (Convergence Rate). Let ˜ h(t) be a decreasing function converging to 0 ast→∞; and x(t),t∈{1, 2,...} be the solution sequence yielded by an iterative algorithm. This algorithm is said to have an O( ˜ h(t)) convergence rate if f(x(t))≤f(x ∗ ) + ˜ h(t) and g k (x(t))≤ ˜ h(t),∀k∈ {1,...,m} for all t≥ 1. The definition of convergence rate is independent of and requires the error to eventually converge to zero. In contrast, the definition of convergence time depends on and only requires that the solution error is eventually smaller than . In this sense, the definition of convergence rate is slightly stronger than the definition of convergence time; and a convergence rate result can imply the convergence time result. For example, if a solution sequence x(t) satisfies f(x(t))≤ f(x ∗ ) + 1 √ t and g k (x(t))≤ 1 √ t ,∀k∈{1,...,m} for all t≥ 1, then this algorithm has an O( 1 √ t ) convergence rate (since error decays with time like O( 1 √ t )) and this further implies that the convergence time of this algorithm is O( 1 2 ). However, the definition of convergence time is still quite useful in analyzing the convergence performance of an iterative algorithm since some algorithms fundamentally do not have vanishing errors. In this thesis, we use terminologies “convergence rate” and “convergence time” interchangeably when proper. The convergence time results of existing Lagrangian methods are summarized as follows: • Dual subgradient methods and drift-plus-penalty technique: For general convex programs in the form of (1.1)-(1.3), where the objective function f(x) is convex but not necessarily strongly convex, the convergence time of Algorithm 1.3 is shown to be O( 1 2 ) in [Nee05, Nee14]. A similar O( 1 2 ) convergence time of Algorithm 1.1 is shown in [NO09a]. 9 A recent work [SHN14] shows that Algorithm 1.3 has anO( 1 ) convergence time if the dual function is locally polyhedral and Algorithm 1.3 with a different average scheme has an O( 1 1.5 ) convergence time if the dual function is locally quadratic. For a special class of strongly convex programs in the form of (1.1)-(1.3), where f(x) is second-order differen- tiable and strongly convex andg k (x),∀k∈{1, 2,...,m} are second-order differentiable and have bounded Jacobians, the convergence time of Algorithm 1.1 is shown to be O( 1 ) in [NN14]. • Primal-dual subgradient methods: The convergence time of Algorithm 1.2 is proven to be O( 1 2 ) in [NO09b]. • ADMM: The best known convergence time of ADMM algorithm for the convex program (1.4)-(1.6) with general convexf 1 (·) andf 2 (·) is recently shown to beO( 1 ) [HY12, LMZ15]. An asynchronous ADMM algorithm with the same O( 1 ) convergence time is studied in [WO13]. Geometric convergence rate of ADMM is possible under restrictive assumptions such as the strong convexity of the objective functions and rank conditions of the linear equality constraints [HL16, DY16]. 1.3 Facts From Convex Analysis In this section, we introduce basic facts from convex analysis that will be frequently used throughout this thesis. If not specified, we always usek·k to denote the Euclidean norm, also known as l 2 norm, of a vector. Definition 1.5 (Lipschitz Continuity). LetX⊆R n be a convex set. Function h :X7→R m is said to be Lipschitz continuous onX with modulus L if there exists L> 0 such that kh(y)−h(x)k≤Lky− xk,∀x, y∈X. Definition 1.6 (Smooth Functions). LetX ⊆ R n and function h(x) be continuously differen- tiable onX . Function h(x) is said to be smooth onX with modulus L if∇h(x) is Lipschitz continuous onX with modulus L. Note that linear functionh(x) = a T x is smooth with modulus 0. If a functionh(x) is smooth with modulus L, then ch(x) is smooth with modulus cL for any c> 0. 10 Lemma 1.1 (Descent Lemma, Proposition A.24 in [Ber99]). Leth be a continuously differentiable function. If h is smooth onX with modulus L, then for any x, y∈X we have h(y)≥h(x) + [∇h(x)] T (y− x)− L 2 ||y− x|| 2 , h(y)≤h(x) + [∇h(x)] T (y− x) + L 2 ||y− x|| 2 . Definition 1.7 (Strongly Convex Functions). LetX ⊆ R n be a convex set. Function h(x) is said to be strongly convex onX with modulus α if there exists a constant α > 0 such that h(x)− 1 2 αkxk 2 is convex onX . Recall that function h(x) is concave if and only if−h(x) is convex. Similarly, function h(x) is strongly concave with modulus α if and only if−h(x) is strongly convex with modulus α. Or alternatively, function h(x) is strongly concave with if there exists a constant α > 0 such that h(x) + 1 2 αkxk 2 is concave. The next corollary follows directly from the definition of strongly convex functions. Corollary 1.1. LetX⊆R n be a convex set. If function h(x) is convex onX and α> 0, then h(x) +αkx− x 0 k 2 is strongly convex with modulus 2α for any constant x 0 ∈R n . Lemma 1.2 (Theorem D.6.1.2 in [HUL01]). Let function h(x) be strongly convex onX with modulus α. Let ∂h(x) be the set of all subgradients of h at point x. Then h(y)≥h(x) + d T (y− x) + α 2 ky− xk 2 ,∀x, y∈X,∀d∈∂h(x). Lemma 1.3 (Proposition B.24 (f) in [Ber99]). LetX⊆R n be a convex set. Let function h(x) be convex onX and x opt be a global minimum ofh onX . Let∂h(x) be the set of all subgradients of h at point x. Then, there exists d∈∂h(x opt ) such that d T (x− x opt )≥ 0,∀x∈X. Corollary 1.2. LetX ⊆ R n be a convex set. Let function h(x) be strongly convex onX with modulus α and x opt be a global minimum of h onX . Then, h(x opt )≤h(x)− α 2 kx opt − xk 2 ,∀x∈X. 11 Proof. A special case when h is differentiable andX = R n is Theorem 2.1.8 in [Nes04]. The proof for general h andX is as follows. Fix x∈X . By Lemma 1.3, there exists d∈∂h(x opt ) such that d T (x− x opt )≥ 0. By Lemma 1.2, we also have h(x)≥h(x opt ) + d T (x− x opt ) + α 2 kx− x opt k 2 (a) ≥ h(x opt ) + α 2 kx− x opt k 2 , where (a) follows from the fact that d T (x− x opt )≥ 0. Similarly, we have the next corollary for strongly concave functions. Corollary 1.3. LetX⊆R n be a convex set. Let function h(x) be strongly concave onX with modulus α and x opt be a global maximum of h onX . Then, h(x opt )≥h(x) + α 2 kx opt − xk 2 ,∀x∈X. 1.4 Thesis Outline and Our Contributions This thesis is organized as follows: • Chapter 2 – Convergence time of dual subgradient methods for strongly con- vex programs: This chapter considers general strongly convex programs (possibly non- differentiable) and shows that the classical dual subgraident method with simple running averages hasO( 1 ) convergence. This chapter also shows that if the strongly convex program satisfies additional assumptions, then the dual subgradient method with a new running av- erage scheme, called the sliding running average, can achieve O(log( 1 )) convergence. • Chapter 3 – New Lagrangian methods for convex programs: This chapter considers constrained convex programs (possibly without strong convexity) and develops two new Lagrangian methods with fast O( 1 ) convergence. The new meth- ods improve the conventional dual subgradient method or primal-dual subgradient method, both of which are known to have slowO( 1 2 ) convergence. The new methods can deal with nonlinear convex inequality constraints that can not be handled by the alternating direction 12 method of multipliers (ADMM). The first new Lagrangian method is a dual type method and works for general constrained convex programs (with possibly non-differentiable objec- tive and constraint functions). The second one is a primal-dual type method but only works for smooth constrained convex programs. Both methods have the sameO(1/) convergence time, yet the second one enjoys smaller per-iteration complexity when the objective function or the constraint functions are not separable. • Chapter 4 – New backpressure algorithms for joint rate control and routing: This chapter considers backpressure algorithms for joint rate control and routing in multi- hop data networks. To achieve an arbitrary small utility optimality gap, all existing back- pressure algorithms necessarily yield arbitrarily large queue lengths. Inspired by the new Lagrangian dual optimization methods developed in Chapter 3, this chapter proposes new backpressure algorithms that can converge to the exact optimal utility while ensuring all queue lengths are bounded by a finite constant. • Chapter 5 – Online convex optimization with stochastic constraints: This chapter considers online convex optimization (OCO) with stochastic constraints, which generalizes Zinkevich’s OCO over a known simple fixed set by introducing multiple stochastic func- tional constraints that are i.i.d. generated at each round and are disclosed to the decision maker only after the decision is made. This formulation arises naturally when decisions are restricted by stochastic environments or deterministic environments with noisy obser- vations. To solve this problem, this chapter proposes a new algorithm that achievesO( √ T ) expected regret and constraint violations and O( √ T log(T )) high probability regret and constraint violations. • Chapter 6 – Online convex optimization with long term constraints: This chapter considers online convex optimization with long term constraints, which is a special case problem of online convex optimization with stochastic constraints considered in Chapter 5. In online convex optimization with long term constraints, we relax the functional constraints by allowing them to be violated at each round but still requiring them to be satisfied in the long term. Inspired by the Lagrangian dual optimization technique developed in Chapter 3, this chapter develops a new online learning algorithm that can achieve O( √ T ) regret and O(1), i.e., finite, constraint violations. 13 • Chapter 7 – Power control for energy harvesting devices with outdated state information: This chapter considers utility optimal power control for energy harvesting wireless devices with a finite capacity battery. The distribution information of the under- lying wireless environment and harvestable energy is unknown and only outdated system state information is known at the device controller. This chapter proposes a new online power control algorithm that achieves utility within O() of the optimal, for any desired > 0, by using a battery with an O(1/) capacity. • Chapter 8 – Dynamic transmit covariance design in MIMO fading systems with unknown channel distributions and inaccurate channel state information: This chapter considers dynamic transmit covariance design in point-to-point MIMO fading systems with unknown channel state distributions and inaccurate channel state information subject to both long term and short term power constraints. We develop different dynamic transmit covariance policies for the case of instantaneous channel state information at the transmitter (CSIT) and the case of delayed CSIT, respectively. In either case, the corresponding policy can approach optimality with an O(δ) gap, whereδ is the inaccuracy measure of CSIT. • Chapter 9 – Duality codes and the integrality gap bound for index coding: This chapter considers the index coding problem that captures the essence of the classical network coding problem. By studying the integrality gap of an integer linear program, which arises from an information theoretical bound for the index coding problem, and the Lagrangian duality of its linear programing relaxations, we analyze the performance of cyclic codes and partial clique codes for index coding. We also show these codes are optimal when the bipartite digraph representation of the index coding problem is planar. 14 Chapter 2 Convergence Time of Dual Subgradient Methods for Strongly Convex Programs Consider the following strongly convex program: min f(x) (2.1) s.t. g k (x)≤ 0,∀k∈{1, 2,...,m} (2.2) x∈X (2.3) where the setX⊆R n is closed and convex; function f(x) is continuous and strongly convex on X ; functions g k (x),∀k∈{1, 2,...,m} are Lipschitz continuous and convex onX . Note that the functions f(x),g 1 (x),...,g m (x) are not necessarily differentiable. Denote the stacked vector of multiple functionsg 1 (x),g 2 (x),...,g m (x) as g(x) = g 1 (x),g 2 (x),...,g m (x) T . Throughout this chapter, the strongly convex program (2.1)-(2.3) is required to satisfy the following assumptions: Assumption 2.1 (Basic Assumptions). • There exists an optimal solution x ∗ ∈X that solves the strongly convex program (2.1)-(2.3). • The function f(x) is strongly convex onX with modulus σ. • There exists a constant β such thatkg(x 1 )−g(x 2 )k≤βkx 1 −x 2 k for all x 1 , x 2 ∈X . That is, the function g(x) is Lipschitz continuous onX with modulus β. Note that the strong convexity of the function f(x) implies that the optimum is unique. 15 Assumption 2.2 (Existence of Lagrange Multipliers). Condition 1.1 holds for the strongly con- vex program (2.1)-(2.3). That is, there exists a Lagrange multiplier vectorλ ∗ = [λ ∗ 1 ,λ ∗ 2 ,...,λ ∗ m ] T ≥ 0 such that q(λ ∗ ) =f(x ∗ ), where x ∗ is the optimal solution to the problem(2.1)-(2.3) andq(λ) = inf x∈X {f(x)+ P m k=1 λ k g k (x)} is the Lagrangian dual function of the problem (2.1)-(2.3). The strongly convex program (2.1)-(2.3) arises often in applications such as model predic- tive control (MPC) [NN14], network flow control [LL99] and decentralized multi-agent control [TTM11]. Algorithm 1.1, the dual subgradient method, is a conventional method to solve (2.1)- (2.3) [BSS06]. It is an iterative algorithm that, every iteration, removes the inequality constraints (2.2) and chooses primal variables to minimize a function over the setX . Algorithm 1.1 can be interpreted as a subgradient/gradient method applied to the Lagrangian dual function of convex program (2.1)-(2.3) and allows for many different step size rules [Ber99]. Note that by Danskin’s theorem (Proposition B.25(a) in [BSS06]), the Lagrangian dual function of a strongly convex program is differentiable, thus Algorithm 1.1 for strongly convex program (2.1)-(2.3) is in fact a dual gradient method. The classical dual subgradient method with a constant step size, Algorithm 1.1, uses the simple running averages, also called the ergodic sequence in [LPS99], x(t) = 1 t P t−1 τ=0 x(τ) as the solutions at each iteration. In this chapter, we also proposes a new running average scheme, called the sliding running averages as follows: • Sliding Running Averages: Usee x(t) = x(0) and e x(t) = 1 t 2 P t−1 τ= t 2 x(τ) if t is even e x(t− 1) if t is odd as the solution at each iteration t∈{1, 2,...}. This chapter shows that the sliding running averages can have better convergence time when the dual function of the convex program satisfies additional assumptions. The results in this chapter are originally developed in our papers [YN15, YN18b]. 16 2.1 Related Work As reviewed in Section 1.2, the convergence time of Algorithm 1.1 for general (possibly without strong convexity) convex programs isO( 1 2 ). For a special class of strongly convex programs in the form of (2.1)-(2.3), wheref(x) is second-order differentiable and strongly convex andg k (x),∀k∈ {1, 2,...,m} are second-order differentiable and have bounded Jacobians, the convergence time of the dual subgradient algorithm is shown to beO( 1 ) in [NN14]. Note that convex program (2.1)- (2.3) with second order differentiable f(x) and g k (x),k∈{1, 2,...,m} in general can be solved via interior point methods with geometric convergence. However, to achieve fast convergence in practice, the barrier parameters must be scaled carefully and the computation complexity associated with each iteration can be high. In contrast, the dual subgradient method can yield distributive implementations with low per-iteration computation complexity when the objective and constraint functions are separable. This chapter considers a class of strongly convex programs that is more general than those treated in [NN14]. 1 Besides the strong convexity off(x), we only require the constraint functions g k (x) to be Lipschitz continuous. The functions f(x) and g k (x) can even be non-differentiable. Thus, this paper can deal with non-smooth optimization. For example, the l 1 normkxk 1 is non-differentiable and often appears as part of the objective or constraint functions in machine learning, compressed sensing and image processing applications. This chapter shows that the con- vergence time of the dual subgradient method with simple running averages for general strongly convex programs isO( 1 ) and the convergence time can be improved toO(log( 1 )) by using sliding running averages when the dual function is locally quadratic. A closely related recent work is [NP16] that considers strongly convex programs with strongly convex and second order differentiable objective functionsf(x) and conic constraints in the form of Gx +h∈K, whereK is a proper cone. The authors in [NP16] show that a hybrid algorithm using both dual subgradient and dual fast gradient methods can have convergence time O( 1 2/3 ); and the dual subgradient method can have convergence time O(log( 1 )) if the strongly convex program satisfies an error bound property. Results in this chapter are developed independently and consider general nonlinear convex constraint functions; and show that the dual subgradi- 1 Note that bounded Jacobians imply Lipschitz continuity. Work [NN14] also considers the effect of inaccurate solutions for the primal updates. The analysis in this chapter can also deal with inaccurate updates. In this case, there will be an error term δ on the right of (2.6). 17 ent/gradient method with a different averaging scheme has an O(log( 1 )) convergence time when the dual function is locally quadratic. Another independent parallel work is [NPN15] that con- siders strongly convex programs with strongly convex and smooth objective functions f(x) and general constraint functions g(x) with bounded Jacobians. The authors in [NPN15] shows that the dual subgradient/gradient method with simple running averages has O( 1 ) convergence. This chapter and independent parallel works [NP16, NPN15] obtain similar convergence times of the dual subgradient/gradient method with different averaging schemes for strongly convex programs under slightly different assumptions. However, the proof technique in this chapter is fundamentally different from that used in [NP16] and [NPN15]. Works [NP16, NPN15] and other previous works, e.g., [NN14], follow the classical optimization analysis approach based on the descent lemma, while this chapter is based on the drift-plus-penalty analysis that was originally developed for stochastic optimization in dynamic queuing systems [Nee03, Nee10]. Using the drift-plus-penalty technique, we further propose a new Lagrangian dual type algorithm with O( 1 ) convergence for general convex programs (possibly without strong convexity) in Chapter 3. 2.2 Convergence Time Analysis This section analyzes the convergence time of Algorithm 1.1 for the strongly convex program (2.1)-(2.3) under Assumptions 2.1-2.2. 2.2.1 An Upper Bound of the Drift-Plus-Penalty Expression Denoteλ(t) = λ 1 (t),...,λ m (t) T . Define Lyapunov functionL(t) = 1 2 kλ(t)k 2 and Lyapunov drift Δ(t) =L(t + 1)−L(t). Lemma 2.1. At each iteration t in Algorithm 1.1, we have 1 c Δ(t) = [λ(t + 1)] T g(x(t))− 1 2c kλ(t + 1)−λ(t)k 2 (2.4) Proof. The update equations λ k (t + 1) = max{λ k (t) +cg k (x(t)), 0},∀k∈{1, 2,...,m} can be rewritten as λ k (t + 1) =λ k (t) +c˜ g k (x(t)),∀k∈{1, 2,...,m}, (2.5) 18 where ˜ g k (x(t)) = g k (x(t)), if λ k (t) +cg k (x(t))≥ 0 − 1 c λ k (t), else ,∀k∈{1, 2,...,m}. Fix k∈{1, 2,...,m}. Squaring both sides of (2.5) and dividing by factor 2 yields: 1 2 [λ k (t + 1)] 2 = 1 2 [λ k (t)] 2 + c 2 2 [˜ g k (x(t))] 2 +cλ k (t)˜ g k (x(t)) = 1 2 [λ k (t)] 2 + c 2 2 [˜ g k (x(t))] 2 +cλ k (t)g k (x(t)) +cλ k (t)[˜ g k (x(t))−g k (x(t))] (a) = 1 2 [λ k (t)] 2 + c 2 2 [˜ g k (x(t))] 2 +cλ k (t)g k (x(t))−c 2 ˜ g k (x(t))[˜ g k (x(t))−g k (x(t))] = 1 2 [λ k (t)] 2 − c 2 2 [˜ g k (x(t))] 2 +c[λ k (t) +c˜ g k (x(t))]g k (x(t)) (b) = 1 2 [λ k (t)] 2 − 1 2 [λ k (t + 1)−λ k (t)] 2 +cλ k (t + 1)g k (x(t)) where (a) follows from λ k (t)[˜ g k (x(t))−g k (x(t))] =−c˜ g k (x(t))[˜ g k (x(t))−g k (x(t))], which can be shown by considering ˜ g k (x(t)) =g k (x(t)) and ˜ g k (x(t))6=g k (x(t)), separately; and (b) follows from the fact that λ k (t + 1) =λ k (t) +c˜ g k (x(t)). Summing over k∈{1, 2,...,m} yields 1 2 kλ(t + 1)k 2 = 1 2 kλ(t)k 2 − 1 2 c 2 kλ(t+1)−λ(t)k 2 +c[λ(t+1)] T g(x(t)). Rearranging the terms and dividing both sides by factor c yields the result. Lemma 2.2. Let x ∗ ,σ and β be constants defined in Assumption 2.1. If c≤ σ β 2 in Algorithm 1.1, then we have 1 c Δ(t) +f(x(t))≤f(x ∗ ),∀t≥ 0 (2.6) Proof. Fix t ≥ 0. Since f(x) is strongly convex with modulus σ; g k (x),∀k ∈ {1, 2,...,m} are convex; and λ k (t),∀k ∈ {1, 2,...,m} are non-negative at each iteration t, the function f(x) + P m k=1 λ k (t)g k (x) is also strongly convex with modulus σ at each iteration t. Note that x(t) = argmin x∈X n f(x) + P m k=1 λ k (t)g k (x) o . By Corollary 1.2 with x opt = x(t) and y = x ∗ , we have f(x(t)) + m X k=1 λ k (t)g k (x(t))≤f(x ∗ ) + m X k=1 λ k (t)g k (x ∗ )− σ 2 kx(t)− x ∗ k 2 . Hence,f(x(t))≤f(x ∗ ) + [λ(t)] T g(x ∗ )− g(x(t)) − σ 2 kx(t)− x ∗ k 2 . Adding this inequality to 19 equation (2.4) yields 1 c Δ(t) +f(x(t)) ≤f(x ∗ )− 1 2c kλ(t + 1)−λ(t)k 2 − σ 2 kx(t)− x ∗ k 2 + [λ(t)] T [g(x ∗ )− g(x(t))] + [λ(t + 1)] T g(x(t)). Define B(t) =− 1 2c kλ(t + 1)−λ(t)k 2 − σ 2 kx(t)− x ∗ k 2 + [λ(t)] T [g(x ∗ )− g(x(t))] + [λ(t + 1)] T g(x(t)). Next, we need to show that B(t)≤ 0. Since x ∗ is the optimal solution to the problem (2.1)-(2.3), we have g k (x ∗ ) ≤ 0,∀k ∈ {1, 2,...,m}. Note that λ k (t + 1)≥ 0,∀k∈{1, 2,...,m},∀t≥ 0. Thus, [λ(t + 1)] T g(x ∗ )≤ 0, ∀t≥ 0 (2.7) Now we have, B(t) =− 1 2c kλ(t + 1)−λ(t)k 2 − σ 2 kx(t)− x ∗ k 2 + [λ(t)] T [g(x ∗ )− g(x(t))] + [λ(t + 1)] T g(x(t)) (a) ≤− 1 2c kλ(t + 1)−λ(t)k 2 − σ 2 kx(t)− x ∗ k 2 + [λ(t)] T [g(x ∗ )− g(x(t))] + [λ(t + 1)] T g(x(t)) − [λ(t + 1)] T g(x ∗ ) =− 1 2c kλ(t + 1)−λ(t)k 2 − σ 2 kx(t)− x ∗ k 2 + [λ(t)−λ(t + 1)] T [g(x ∗ )− g(x(t))] (b) ≤− 1 2c kλ(t + 1)−λ(t)k 2 − σ 2 kx(t)− x ∗ k 2 +kλ(t)−λ(t + 1)kkg(x(t))− g(x ∗ )k (c) ≤− 1 2c kλ(t + 1)−λ(t)k 2 − σ 2 kx(t)− x ∗ k 2 +βkλ(t)−λ(t + 1)kkx(t)− x ∗ k =− 1 2c kλ(t + 1)−λ(t)k−cβkx(t)− x ∗ k 2 − 1 2 (σ−cβ 2 )kx(t)− x ∗ k 2 (d) ≤0 where (a) follows from (2.7); (b) follows from the Cauchy-Schwarz inequality; (c) follows from Assumption 2.1; and (d) follows from c≤ σ β 2 . 20 2.2.2 Objective Value Violation Theorem 2.1 (Objective Value Violation). Let x ∗ ,σ and β be constants defined in Assumption 2.1. If c≤ σ β 2 in Algorithm 1.1, then f(x(t))≤f(x ∗ ) + kλ(0)k 2 2ct ,∀t≥ 1. Proof. Fix t≥ 1. By Lemma 2.2, we have 1 c Δ(τ) +f(x(τ))≤f(x ∗ ) for all τ∈{0, 1,...,t− 1}. Summing over τ∈{0, 1,...,t− 1} we have: 1 c t−1 X τ=0 Δ(τ) + t−1 X τ=0 f(x(τ))≤tf(x ∗ ) ⇒ 1 c [L(t)−L(0)] + t−1 X τ=0 f(x(τ))≤tf(x ∗ ) ⇒ 1 t t−1 X τ=0 f(x(τ))≤f(x ∗ ) + L(0)−L(t) ct ≤f(x ∗ ) + L(0) ct Note that x(t) = 1 t P t−1 τ=0 x(τ) and by the convexity of f(x), we have f(x(t))≤ 1 t t−1 X τ=0 f(x(τ))≤f(x ∗ ) + L(0) ct =f(x ∗ ) + kλ(0)k 2 2ct Remark 2.1. Similarly, we can prove thatf(e x(2t))≤f(x ∗ )+ kλ(t)k 2 2ct sincee x(2t) = 1 t P 2t−1 τ=t x(τ). A later lemma (Lemma 2.4) guarantees thatkλ(t)k≤ p kλ(0)k 2 +kλ ∗ k 2 +kλ ∗ k, where λ ∗ is defined in Assumption 2.2. Thus, f(e x(2t))≤f(x ∗ ) + √ kλ(0)k 2 +kλ ∗ k 2 +kλ ∗ k 2 2ct ,∀t≥ 1. 2.2.3 Constraint Violation The analysis of constraint violations is similar to that in [Nee14] for general convex programs. However, using the improved upper bound in Lemma 2.2, the convergence time of constraint violations in strongly convex programs is order-wise better than that in general convex programs. Lemma 2.3. For any t 2 >t 1 ≥ 0, λ k (t 2 )≥λ k (t 1 ) +c t2−1 X τ=t1 g k (x(τ)),∀k∈{1, 2,...,m}. 21 In particular, for any t> 0, λ k (t)≥λ k (0) +c t−1 X τ=0 g k (x(τ)),∀k∈{1, 2,...,m}. Proof. Fix k∈{1, 2,...,m}. Note that λ k (t 1 + 1) = max{λ k (t 1 ) +cg k (x(t 1 )), 0}≥ λ k (t 1 ) + cg k (x(t 1 )). By induction, this lemma follows. Lemma 2.4. Let σ and β be constants defined in Assumption 2.1 and λ ∗ be the Lagrange multiplier vector defined in Assumption 2.2. If c≤ σ β 2 in Algorithm 1.1, then λ(t) satisfies kλ(t)k≤ q kλ(0)k 2 +kλ ∗ k 2 +kλ ∗ k,∀t≥ 1. (2.8) Proof. Fix t≥ 1. Let x ∗ be the optimal solution to the problem (2.1)-(2.3). Assumption 2.2 implies that f(x ∗ ) =q(λ ∗ )≤f(x(τ)) + m X k=1 λ ∗ k g k (x(τ)),∀τ∈{0, 1,...,t− 1}, where the inequality follows from the definition of q(λ). Thus, we have f(x ∗ )−f(x(τ))≤ P m k=1 λ ∗ k g k (x(τ)),∀τ ∈{0, 1,...,t− 1}. Summing over τ∈{0, 1,...,t− 1} yields tf(x ∗ )− t−1 X τ=0 f(x(τ))≤ t−1 X τ=0 m X k=1 λ ∗ k g k (x(τ)) = m X k=1 λ ∗ k h t−1 X τ=0 g k (x(τ)) i (a) ≤ 1 c m X k=1 λ ∗ k [λ k (t)−λ k (0)] ≤ 1 c m X k=1 λ ∗ k λ k (t) (b) ≤ 1 c kλ ∗ kkλ(t)k (2.9) where (a) follows from Lemma 2.3 and (b) follows from the Cauchy-Schwarz inequality. On the 22 other hand, summing (2.6) in Lemma 2.2 over τ∈{0, 1,...,t− 1} yields tf(x ∗ )− t−1 X τ=0 f(x(τ))≥ L(t)−L(0) c = kλ(t)k 2 −kλ(0)k 2 2c (2.10) Combining (2.9) and (2.10) yields kλ(t)k 2 −kλ(0)k 2 2c ≤ 1 c kλ ∗ kkλ(t)k ⇒ kλ(t)k−kλ ∗ k 2 ≤kλ(0)k 2 +kλ ∗ k 2 ⇒ kλ(t)k≤ q kλ(0)k 2 +kλ ∗ k 2 +kλ ∗ k Theorem 2.2 (Constraint Violation). Let σ and β be constants defined in Assumption 2.1 and λ ∗ be the Lagrange multiplier vector defined in Assumption 2.2. If c≤ σ β 2 in Algorithm 1.1, then g k (x(t))≤ p kλ(0)k 2 +kλ ∗ k 2 +kλ ∗ k ct ,∀k∈{1, 2,...,m},∀t≥ 1. Proof. Fix t≥ 1 and k∈{1, 2,...,m}. Recall that x(t) = 1 t P t−1 τ=0 x(τ). Thus, g k (x(t)) (a) ≤ 1 t t−1 X τ=0 g k (x(τ)) (b) ≤ λ k (t)−λ k (0) ct ≤ λ k (t) ct ≤ kλ(t)k ct (c) ≤ p kλ(0)k 2 +kλ ∗ k 2 +kλ ∗ k ct where (a) follows from the convexity ofg k (x),k∈{1, 2,...,m}; (b) follows from Lemma 2.3; and (c) follows from Lemma 2.4. 23 Remark 2.2. Similarly, we can prove that g k (e x(2t))≤ p kλ(0)k 2 +kλ ∗ k 2 +kλ ∗ k ct ,∀k∈{1, 2,...,m},∀t≥ 1. The next corollary provides a lower bound of f(x(t)) and follows directly from Assumption 2.2 and Theorem 2.2. Corollary 2.1. Let σ and β be constants defined in Assumption 2.1 and let λ ∗ be the Lagrange multiplier vector defined in Assumption 2.2. If c≤ σ β 2 in Algorithm 1, then f(x(t))≥f(x ∗ )− 1 t p kλ(0)k 2 +kλ ∗ k 2 +kλ ∗ k c m X k=1 λ ∗ k ,∀t≥ 1. Proof. Fix t≥ 1. By Assumption 2.2, we have f(x(t)) + m X k=1 λ ∗ k g k (x(t))≥q(λ ∗ ) =f(x ∗ ) + m X k=1 λ ∗ k g k (x ∗ ) =f(x ∗ ). Thus, we have f(x(t))≥f(x ∗ )− m X k=1 λ ∗ k g k (x(t)) (a) ≥f(x ∗ )− m X k=1 λ ∗ k p kλ(0)k 2 +kλ ∗ k 2 +kλ ∗ k ct =f(x ∗ )− 1 t p kλ(0)k 2 +kλ ∗ k 2 +kλ ∗ k c m X k=1 λ ∗ k , where (a) follows from the constraint violation bound, i.e., Theorem 2.2, and the fact that λ ∗ k ≥ 0,∀k∈{1, 2,...,m}. 2.2.4 Convergence Time of Algorithm 1.1 The next theorem summarizes Theorem 2.1 and Theorem 2.2. Theorem 2.3. Let x ∗ ,σ andβ be constants defined in Assumption 2.1 and letλ ∗ be the Lagrange 24 multiplier vector defined in Assumption 2.2. If c≤ σ β 2 in Algorithm 1.1, then for all t≥ 1, f(x(t))≤f(x ∗ ) + kλ(0)k 2 2ct . g k (x(t))≤ p kλ(0)k 2 +kλ ∗ k 2 +kλ ∗ k t ,∀k∈{1, 2,...,m}. Specifically, if λ(0) = 0, then f(x(t))≤f(x ∗ ). g k (x(t))≤ 2kλ ∗ k ct ,∀k∈{1, 2,...,m}. In summary, if c≤ σ β 2 in Algorithm 1.1, then x(t) ensures that error decays like O( 1 t ) and provides an -approximate solution with convergence time O( 1 ). Remark 2.3. Ifc≤ σ β 2 in Algorithm 1.1, thene x(t) also ensures that error decays like O( 1 t ) and provides an -approximate solution with convergence time O( 1 ). 2.3 Geometric Convergence of Algorithm 1.1 with Sliding Running Averages This section shows that the convergence time of Algorithm 1.1 with sliding running averages e x(t) is O(log( 1 )) when the dual function of the problem (2.1)-(2.3) satisfies additional assump- tions. 2.3.1 Smooth Dual Function Recall the definition of smooth functions in Definition 1.6. Define q(λ) = inf x∈X {f(x) + P m k=1 λ k g k (x)} (a) = min x∈X {f(x) + P m k=1 λ k g k (x)} as the Lagrangian dual function of the prob- lem (2.1)-(2.3) where (a) follows because f(x) is strongly convex, f(x) andg k (x) are continuous andX is a closed set. For fixed λ∈R m + ,f(x) +λ T g(x) is strongly convex with respect to x∈X with modulusα. Define x(λ) = argmin x∈X {f(x)+λ T g(x)}. By Danskin’s theorem (Proposition B.25 in [Ber99]), q(λ) is differentiable with gradient∇ λ q(λ) = g(x(λ)). 25 Lemma 2.5 (Smooth Dual Function). The Lagrangian dual function of the problem (2.1)-(2.3), q(λ), is smooth on R m + with modulus γ = β 2 σ . Proof. Fix λ,μ∈R m + . Let x(λ) = argmin x∈X {f(x) +λ T g(x)} and x(μ) = argmin x∈X {f(x) + μ T g(x)}. Recall that for fixed λ∈R m + , f(x) +λ T g(x) is strongly convex with respect to x∈X with modulus α. By Corollary 1.2, we have f(x(λ)) +λ T g(x(λ))≤ f(x(μ)) +λ T g(x(μ))− α 2 kx(λ)−x(μ)k 2 andf(x(μ))+μ T g(x(μ))≤f(x(λ))+μ T g(x(λ))− α 2 kx(λ)−x(μ)k 2 . Summing the above two inequalities and simplifying gives αkx(λ)− x(μ)k 2 ≤ [μ−λ] T [g(x(λ))− g(x(μ))] (a) ≤kλ−μkkg(x(λ))− g(x(μ))k (b) ≤ βkλ−μkkx(λ)− x(μ)k where (a) follows from the Cauchy-Schwarz inequality and (b) follows because g(x) is Lipschitz continuous by Assumption 2.1. This implies kx(λ)− x(μ)k≤ β α kλ−μk (2.11) Thus, we have k∇q(λ)−∇q(μ)k (a) =kg(x(λ))− g(x(μ))k (b) ≤ βkx(λ)− x(μ)k (c) ≤ β 2 α kλ−μk where (a) follows from∇ λ q(λ) = g(x(λ)); (b) follows from the Lipschitz continuity of g(x); and (c) follows from (2.11). Thus, q(λ) is smooth onR m + with modulus γ = β 2 α . Since∇ λ q(λ(t)) = g(x(t)), the dynamic of λ(t) can be interpreted as the projected gradient method with step sizec to solve max λ∈R m + {q(λ)} where q(·) is a smooth function by Lemma 2.5. Thus, we have the next lemma. Lemma 2.6. Consider the strongly convex program (2.1)-(2.3) satisfying Assumptions 2.1-2.2. 26 If c≤ σ β 2 in Algorithm 1.1, then q(λ ∗ )−q(λ(t))≤ 1 2ct kλ(0)−λ ∗ k 2 , ∀t≥ 1. Proof. Recall that a projected gradient descent algorithm with step size c < 1 γ converges to the maximum of a concave function with smooth modulus γ with the error decaying like O( 1 t ). Thus, this lemma follows. The proof is essentially the same as the convergence time proof of the projected gradient method for set constrained smooth optimization in [Nes04]. See Section 2.7.1 for the detailed proof. 2.3.2 Problems with Locally Quadratic Dual Functions In addition to Assumptions 2.1-2.2, we further require the next assumption in this subsection. Assumption 2.3 (Locally Quadratic Dual Function). Let λ ∗ be a Lagrange multiplier vector of the problem (2.1)-(2.3) defined in Assumption 2.2. There exists D q > 0 and L q > 0, where the subscript q denotes locally “quadratic”, such that for all λ∈{λ∈ R m + :kλ−λ ∗ k≤ D q }, the dual function q(λ) = min x∈X n f(x) + P m k=1 λ k g k (x) o satisfies q(λ ∗ )≥q(λ) +L q kλ−λ ∗ k 2 . Lemma 2.7. Consider the strongly convex program (2.1)-(2.3) satisfying Assumptions 2.1-2.3. Let q(λ),λ ∗ ,D q and L q be defined in Assumption 2.3. We have the following properties: 1. If λ∈R m + and q(λ ∗ )−q(λ)≤L q D 2 q , thenkλ−λ ∗ k≤D q . 2. The Lagrange multiplier defined in Assumption 2.2 is unique. Proof. 1. Assume not and there exists λ 0 ∈R m + such thatq(λ ∗ )−q(λ 0 )≤L q D 2 q andkλ 0 −λ ∗ k>D q . Define λ = (1−η)λ ∗ +ηλ 0 for some η∈ (0, 1). Note thatkλ−λ ∗ k =kη(λ 0 −λ ∗ )k = ηk(λ 0 −λ ∗ )k. Thus, we can choose η∈ (0, 1) such thatkλ−λ ∗ k =D q , i.e., η = Dq kλ 0 −λ ∗ k . Note that λ∈ R m + because λ 0 ∈ R m + and λ ∗ ∈ R m + . Since the dual function q(·) is a concave function, we have q(λ)≥ (1−η)q(λ ∗ ) +ηq(λ 0 ). Thus, q(λ ∗ )−q(λ)≤ q(λ ∗ )− 27 (1−η)q(λ ∗ )+ηq(λ 0 ) =η(q(λ ∗ )−q(λ 0 ))≤ηL q D 2 q . This contradicts Assumption 2.3 that q(λ ∗ )−q(λ)≥L q kλ−λ ∗ k 2 =L q D 2 q . 2. Assume not and there exists μ ∗ 6=λ ∗ such that μ ∗ ∈R m + and q(μ ∗ ) =q(λ ∗ ). By part 1 of this lemma,kμ ∗ −λ ∗ k≤D q . Thus, we have q(μ ∗ ) (a) ≤ q(λ ∗ )−L q kμ ∗ −λ ∗ k 2 (b) < q(λ ∗ ) where (a) follows from Assumption 2.3 and (b) follows from the assumption that μ ∗ 6=λ ∗ . This contradicts the assumption that q(μ ∗ ) =q(λ ∗ ). Define T q = kλ(0)−λ ∗ k 2 2cL q D 2 q , (2.12) where the subscript q denotes locally “quadratic” . Lemma 2.8. Consider the strongly convex program (2.1)-(2.3) satisfying Assumptions 2.1-2.3. If c≤ σ β 2 in Algorithm 1.1, thenkλ(t)−λ ∗ k≤D q for all t≥T q , where T q is defined in (2.12). Proof. By Lemma 2.6 and Lemma 2.7, if 1 2ct kλ(0)−λ ∗ k 2 ≤ L q D 2 q , thenkλ(t)−λ ∗ k≤ D q . It can be checked that t≥ kλ(0)−λ ∗ k 2 2cLqD 2 q implies that 1 2ct kλ(0)−λ ∗ k 2 ≤L q D 2 q . Lemma 2.9. Consider the strongly convex program (2.1)-(2.3) satisfying Assumptions 2.1-2.3. If c≤ σ β 2 in Algorithm 1.1, then 1. kλ(t)−λ ∗ k≤ 1 √ t 1 √ 2cLq kλ(0)−λ ∗ k,∀t≥T q , where T q is defined in (2.12). 2. kλ(t)−λ ∗ k≤ q 1 1+2cLq t−Tq kλ(T q )−λ ∗ k≤ 1 √ 1+2cLq t D q (1 + 2cL q ) Tq 2 ,∀t≥T q , where T q is defined in (2.12). Proof. 1. By Lemma 2.6,q(λ ∗ )−q(λ(t))≤ 1 2ct kλ(0)−λ ∗ k 2 ,∀t≥ 1. By Lemma 2.8 and Assumption 2.3,q(λ ∗ )−q(λ(t))≥L q kλ(t)−λ ∗ k 2 ,∀t≥T q . Thus, we haveL q kλ(t)−λ ∗ k 2 ≤ 1 2ct kλ(0)− λ ∗ k 2 ,∀t≥T q , which implies thatkλ(t)−λ ∗ k≤ 1 √ t 1 √ 2cLq kλ(0)−λ ∗ k,∀t≥T q . 28 2. By part 1, we knowkλ(t)−λ ∗ k≤ D q ,∀t≥ T q . The second part is essentially a local version of Theorem 12 in [NNG15], which shows that the projected gradient method for set constrained smooth convex optimization converge geometrically if the objective function satisfies a quadratic growth condition. See Section 2.7.2 for the detailed proof. Corollary 2.2. Consider the strongly convex program (2.1)-(2.3) satisfying Assumptions 2.1-2.3. If c≤ σ β 2 in Algorithm 1.1, thenkλ(2t)−λ(t)k≤ 2 1 √ 1+2cLq t D q (1 + 2cL q ) Tq 2 ,∀t≥T q , where T q be defined in (2.12). Proof. kλ(2t)−λ(t)k≤kλ(2t)−λ ∗ k +kλ(t)−λ ∗ k (a) ≤ 1 p 1 + 2cL q 2t D q (1 + 2cL q ) Tq 2 + 1 p 1 + 2cL q t D q (1 + 2cL q ) Tq 2 (b) ≤2 1 p 1 + 2cL q t D q (1 + 2cL q ) Tq 2 , where (a) follows from part 2 in Lemma 2.9; and (b) follows from 1 √ 1+cLq < 1. Theorem 2.4. Consider the strongly convex program (2.1)-(2.3) satisfying Assumptions 2.1-2.3. If c≤ σ β 2 in Algorithm 1.1, then f(e x(2t))≤f(x ∗ ) + 1 t 1 p 1 + 2cL q t η,∀t≥T q , where η q = 2D 2 q (1+2cLq ) Tq +2Dq (1+2cLq ) Tq 2 √ kλ(0)k 2 +kλ ∗ k 2 +kλ ∗ k c and T q is defined in (2.12). Proof. Fix t≥ T q . By Lemma 2.2, we have 1 c Δ(τ) +f(x(τ))≤ f(x ∗ ) for all τ ∈{0, 1,...}. Summing overτ∈{t,t + 1,..., 2t− 1} yields 1 c P 2t−1 τ=t Δ(τ) + P 2t−1 τ=t f(x(τ))≤tf(x ∗ ). Dividing by factor t yields 1 t 2t−1 X τ=t f(x(τ))≤f(x ∗ ) + L(t)−L(2t) ct (2.13) 29 Thus, we have f(e x(2t)) (a) ≤ 1 t 2t−1 X τ=t f(x(τ)) (b) ≤ f(x ∗ ) + L(t)−L(2t) ct =f(x ∗ ) + kλ(t)k 2 −kλ(2t)k 2 2ct =f(x ∗ ) + kλ(t)−λ(2t) +λ(2t)k 2 −kλ(2t)k 2 2ct (c) ≤ f(x ∗ ) + kλ(t)−λ(2t)k 2 + 2kλ(2t)kkλ(t)−λ(2t)k 2ct (d) ≤ f(x ∗ ) + 2 1 √ 1+2cLq t D q (1 + 2cL q ) Tq 2 2 2ct + 4 1 √ 1+2cLq t D q (1 + 2cL q ) Tq 2 kλ(2t)k 2ct (e) ≤ f(x ∗ ) + 1 t 1 p 1 + 2cL q t 2D 2 q (1 + 2cL q ) Tq c + 2D q (1 + 2cL q ) Tq 2 kλ(2t)k c (f) = f(x ∗ ) + 1 t 1 p 1 + 2cL q t η q where (a) follows frome x(2t) = 1 t P 2t−1 τ=t x(τ) and the convexity of f(x); (b) follows from (2.13); (c) follows from the Cauchy-Schwarz inequality; (d) follows from Corollary 2.2; (e) follows from 1 √ 1+2cLq < 1; and (f) follows fromkλ(2t)k≤ p kλ(0)k 2 +kλ ∗ k 2 +kλ ∗ k and the definition of η q . Theorem 2.5. Consider the strongly convex program (2.1)-(2.3) satisfying Assumptions 2.1-2.3. If c≤ σ β 2 in Algorithm 1.1, then g k (e x(2t))≤ 2D q (1 + 2cL q ) Tq 2 ct 1 p 1 + 2cL q t ,∀k∈{1, 2,...,m},∀t≥T q , where T q is defined in (2.12). Proof. Fix t≥T q and k∈{1, 2,...,m}. Thus, we have g k (e x(2t)) (a) ≤ 1 t 2t−1 X τ=t g k (x(τ)) (b) ≤ 1 ct λ k (2t)−λ k (t) ≤ 1 ct kλ(2t)−λ(t)k (c) ≤ 2D q (1 + 2cL q ) Tq 2 ct 1 p 1 + 2cL q t 30 where (a) follows from the convexity of g k (x); (b) follows from Lemma 2.3; and (c) follows from Corollary 2.2. The next theorem summarizes both Theorem 2.4 and Theorem 2.5. Theorem 2.6. Consider the strongly convex program (2.1)-(2.3) satisfying Assumptions 2.1-2.3. If c≤ σ β 2 in Algorithm 1.1, then for all t≥T q , f(e x(2t))≤f(x ∗ ) + 1 t 1 p 1 + 2cL q t η q ,∀t≥T q , g k (e x(2t))≤ 2D q (1 + 2cL q ) Tq 2 ct 1 p 1 + 2cL q t ,∀k∈{1, 2,...,m},∀t≥T q , where η q = 2D 2 q (1+2cLq ) Tq +2Dq (1+2cLq ) Tq 2 √ kλ(0)k 2 +kλ ∗ k 2 +kλ ∗ k c and T q is defined in (2.12). In summary, if c≤ σ β 2 in Algorithm 1.1, then e x(t) ensures error decays like O 1 1+2cLq t/2 and provides an -approximate solution with convergence time O(log( 1 )). 2.3.3 Problems with Locally Strongly Concave Dual Functions The following assumption is stronger than Assumptions 2.3 but can be easier to verify in certain cases. For example, if the dual function of the convex program is available, Assumption 2.4 is easier to verify, e.g., by studying the Hessian of the dual function. Assumption 2.4 (Locally Strongly Concave Dual Function). Let λ ∗ be a Lagrange multiplier vector defined in Assumption 2.2. There exists D c > 0 andL c > 0, where the subscript c denotes locally strongly “concave”, such that the dual function q(λ) is strongly concave with modulus L c over{λ∈R m + :kλ−λ ∗ k≤D c }. The next lemma summarizes that Assumption 2.4 implies Assumption 2.3. Lemma 2.10. If the strongly convex program (2.1)-(2.3) satisfies Assumption 2.4, then it also satisfies Assumption 2.3 with D q =D c and L q = Lc 2 . Proof. Since q(·) is strongly concave and is maximized at λ ∗ , by Corollary 1.3, q(λ ∗ )≥q(λ) + Lc 2 kλ−λ ∗ k 2 for all λ∈{λ∈R m + :kλ−λ ∗ k≤D c }. 31 Since Assumption 2.4 implies Assumption 2.3, by the results from the previous subsection, e x(t) from Algorithm 1.1 provides an -approximate solution with convergence time O(log( 1 )). In this subsection, we show that if the problem (2.1)-(2.3) satisfies Assumption 2.4, then the geometric error decay has a smaller contraction modulus. The next lemma relates the smoothness of the dual function and Assumption 2.4. Lemma 2.11. If function h is both smooth with modulus γ and strongly concave with modulus L c over setX , which is not a singleton, then L c ≤γ. Proof. This is a basic fact in convex analysis. See Section 2.7.3 for the detailed proof. For any problem (2.1)-(2.3) satisfying Assumptions 2.1-2.2 and 2.4, we define T c = kλ(0)−λ ∗ k 2 cL c D 2 c , (2.14) where the subscript c denotes locally strongly “concave” . Lemma 2.12. Cosider the strongly convex program (2.1)-(2.3) satisfying Assumptions 2.1-2.2 and 2.4. Let D c and L c be defined in Assumption 2.4. If c≤ σ β 2 in Algorithm 1.1, then 1. kλ(t)−λ ∗ k≤D c for all t≥T c , where T c is defined in (2.14). 2. kλ(t)−λ ∗ k≤ √ 1−cL c t−Tc kλ(T c )−λ ∗ k≤ √ 1−cL c t Dc ( √ 1−cLc) Tc ,∀t≥T c , where T c is defined in (2.14). Proof. 1. By Lemma 2.10, q(·) is locally quadratic with D q =D c and L q = Lc 2 . The remaining part of the proof is identical to the proof of Lemma 2.8. 2. By part 1 of this lemma, λ(t)∈{λ∈R m + :kλ−λ ∗ k≤D c },∀t≥T c . That is, the dynamic of λ(t),t≥ T c is the same as that in the projected gradient method with step size c to solve 2 max λ∈{λ∈R m + :kλ−λ ∗ k≤Dc} q(λ) . Thus, the part is essentially a local version of the convergence time result of the projected gradient method for set constrained smooth and strongly convex optimization [Nes04]. See Section 2.7.4 for the detailed proof. 2 Recall that the projected gradient method with constant step size when applied to set constrained smooth and strongly convex optimization converges to the optimal solution at the rate O(κ t ) where κ is a parameter depending on the step size, the smoothness modulus and the strong convexity modulus [Nes04]. 32 The next corollary follows from Part 2 of Lemma 2.12. Corollary 2.3. Consider the strongly convex program (2.1)-(2.3) satisfying Assumptions 2.1-2.2 and 2.4. If c≤ σ β 2 in Algorithm 1.1, then kλ(2t)−λ(t)k≤ p 1−cL c t 2D c ( √ 1−cL c ) Tc ,∀t≥T c , where T c is defined in (2.14). Proof. kλ(2t)−λ(t)k ≤kλ(2t)−λ ∗ k +kλ(t)−λ ∗ k (a) ≤ p 1−cL c 2t D c ( √ 1−cL c ) Tc + p 1−cL c t D c ( √ 1−cL c ) Tc (b) ≤ p 1−cL c t 2D c ( √ 1−cL c ) Tc where (a) follows from part 2 of Lemma 2.12 and (b) follows from the fact that √ 1−cL c < 1, which is implied by c≤ 1 γ and L c ≤γ. Theorem 2.7. Consider the strongly convex program (2.1)-(2.3) satisfying Assumptions 2.1-2.2 and 2.4. If c≤ σ β 2 in Algorithm 1.1, then f(e x(2t))≤f(x ∗ ) + 1 t p 1−cL c t η c , ∀t≥T c , whereη c = 2D 2 c ( √ 1−cLc) 2Tc + 2Dc √ kλ(0)k 2 +kλ ∗ k 2 +kλ ∗ k ( √ 1−cLc) Tc is a fixed constant andT c is defined in (2.14). Proof. Fix t≥ T c . By Lemma 2.2, we have 1 c Δ(τ) +f(x(τ))≤ f(x ∗ ) for all τ ∈{0, 1,...}. 33 Summing over τ∈{t,t + 1,..., 2t− 1} we have: 1 c 2t−1 X τ=t Δ(τ) + 2t−1 X τ=t f(x(τ))≤tf(x ∗ ) ⇒ 1 c [L(2t)−L(t)] + 2t−1 X τ=t f(x(τ))≤tf(x ∗ ) ⇒ 1 t 2t−1 X τ=t f(x(τ))≤f(x ∗ ) + L(t)−L(2t) ct (2.15) Thus, we have f(e x(2t)) (a) ≤ 1 t 2t−1 X τ=t f(x(τ)) (b) ≤ f(x ∗ ) + L(t)−L(2t) ct =f(x ∗ ) + kλ(t)k 2 −kλ(2t)k 2 2ct =f(x ∗ ) + kλ(t)−λ(2t) +λ(2t)k 2 −kλ(2t)k 2 2ct =f(x ∗ ) + kλ(t)−λ(2t)k 2 + 2[λ(2t)] T [λ(t)−λ(2t)] 2ct (c) ≤f(x ∗ ) + kλ(t)−λ(2t)k 2 + 2kλ(2t)kkλ(t)−λ(2t)k 2ct (d) ≤f(x ∗ ) + √ 1−cL c t 2Dc ( √ 1−cLc) Tc 2 2ct + 2 √ 1−cL c t 2Dc ( √ 1−cLc) Tc kλ(2t)k 2ct (e) ≤f(x ∗ ) + √ 1−cL c t 2D 2 c ( √ 1−cLc) 2Tc + 2Dckλ(2t)k ( √ 1−cLc) Tc t (f) =f(x ∗ ) + 1 t p 1−cL c t η c where (a) follows from the fact thate x(2t) = 1 t P 2t−1 τ=t x(τ) and the convexity of f(x); (b) follows from (2.15); (c) follows from the Cauchy-Schwarz inequality; (d) is true because kλ(2t)−λ(t)k≤ p 1−cL c t 2D c ( √ 1−cL c ) Tc ,∀t≥T c by Corollary 2.3; (e) follows from the fact that √ 1−cL c < 1; and (f) follows from the fact that kλ(2t)k≤ p kλ(0)k 2 +kλ ∗ k 2 +kλ ∗ k by Lemma 2.4 and the definition of η c . Theorem 2.8. Consider the strongly convex program (2.1)-(2.3) satisfying Assumptions 2.1-2.2 34 and 2.4. If c≤ σ β 2 in Algorithm 1.1, then g k (e x(2t))≤ 1 t p 1−cL c t 2D c c( √ 1−cL c ) Tc ,∀k∈{1, 2,...,m},∀t≥T c where T c is defined in (2.14). Proof. Fix t≥T c and k∈{1, 2,...,m}. Thus, we have g k (e x(2t)) (a) ≤ 1 t 2t−1 X τ=t g k (x(τ)) (b) ≤ 1 ct [λ k (2t)−λ k (t)] ≤ 1 ct kλ(2t)−λ(t)k (c) ≤ 1 t p 1−cL c t 2D c c( √ 1−cL c ) Tc where (a) follows from the convexity of g k (·); (b) follows from Lemma 2.3; and (c) follows from Corollary 2.3. The next theorem summarizes both Theorem 2.7 and Theorem 2.8. Theorem 2.9. Consider the strongly convex program (2.1)-(2.3) satisfying Assumptions 2.1-2.2 and 2.4. If c≤ σ β 2 in Algorithm 1.1, then for all t≥T c , f(˜ x(2t))≤f(x ∗ ) + 1 t p 1−cL c t η c , g k (e x(2t))≤ 1 t p 1−cL c t 2D c c( √ 1−cL c ) Tc ,∀k∈{1, 2,...,m}, whereη c = 2D 2 c ( √ 1−cLc) 2Tc + 2Dc √ kλ(0)k 2 +kλ ∗ k 2 +kλ ∗ k ( √ 1−cLc) Tc is a fixed constant andT c is defined in (2.14). In summary, if c≤ σ β 2 in Algorithm 1.1, then e x(t) ensures error decays like O 1 t (1−cL c ) t/2 and provides an -approximate solution with convergence time O(log( 1 )). Under Assumptions 2.1-2.2 and 2.4, Theorem 2.9 shows that if c≤ σ β 2 , then e x(t) provides an-approximate solution with convergence time O(log( 1 )). Since L q = Lc 2 by Lemma 2.10 and note that √ 1−cL c ≤ 1 √ 1+2cLq , the geometric contraction modulus shown in Theorem 2.9 under Assumption 2.4 is smaller than the geometric contraction modulus shown in Theorem 2.6 under Assumption 2.3. 35 2.3.4 Discussion Practical Implementation Assumptions 2.3 and 2.4 in general are difficult to verify. However, we note that to ensuree x(t) provides the betterO(log( 1 )) convergence time, we only require c≤ σ β 2 , which is independent of the parameters in Assumptions 2.3 or 2.4. Namely, in practice, we can blindly apply Algorithm 1.1 to the problem (2.1)-(2.3) with no need to verify Assumption 2.3 or 2.4. If the problem (2.1)-(2.3) happens to satisfy Assumptions 2.3 or 2.4, then e x(t) enjoys the faster convergence time O(log( 1 )). If not, thene x(t) or x(t) at least have convergence time O( 1 ). Local Assumption and Local Geometric Convergence Since Assumption 2.3 only requires the “quadratic” property to be satisfied in a local radius D q around λ ∗ , the error of Algorithm 1.1 starts to decay like O 1 t 1 √ 1+2cLq t only after λ(t) arrives at the D q local radius after T q iterations, where T q is independent of the approximation requirement and hence is order O(1). Thus, Algorithm 1.1 provides an -approximate solution with convergence timeO(log( 1 ). However, it is possible that T q is relatively large if D q is small. In fact,T q > 0 because Assumption 2.3 only requires the dual function to have the “quadratic” property in a local radius. Thus, the theory developed in this section can deal with a large class of problems. On the other hand, if the dual function has the “quadratic” property globally, i.e., for all λ≥ 0, then T q = 0 and the error of Algorithm 1.1 decays like O 1 t 1 √ 1+2cLq t ,∀t≥ 1. A similar tradeoff holds with respect to Assumption 2.4. 36 2.4 Applications 2.4.1 Strongly Convex Programs Satisfying Non-Degenerate Constraint Qualifications Theorem 2.10. Consider the following strongly convex program: min f(x) (2.16) s.t. g k (x)≤ 0,∀k∈{1, 2,...,m} (2.17) x∈R n (2.18) wheref(x) is a second-order continuously differentiable and strongly convex function; g k (x),∀k∈ {1, 2,...,m} are Lipschitz continuous, second-order continuously differentiable and convex func- tions. Let x ∗ be the unique solution to this strongly convex program. 1. Let K ⊆ {1, 2,...,m} be the set of active constraints, i.e., g k (x ∗ ) = 0,∀k ∈ K, and denote the vector composed by g k (x),k∈K as g K . If g(x) has a bounded Jacobian and rank(∇ x g K (x ∗ ) T ) =|K|, then Assumptions 2.1-2.3 hold for this problem. 2. If g(x) has a bounded Jacobian and rank(∇ x g(x ∗ ) T ) = m, then Assumptions 2.1-2.4 hold for this problem. Proof. See Section 2.7.5. Corollary 2.4. Consider the following strongly convex program with linear inequality constraints: min f(x) (2.19) s.t. Ax≤ b (2.20) where f(x) is second-order continuously differentiable and strongly convex function; and A is an m×n matrix. 1. Let x ∗ be the optimal solution. Assume Ax ∗ ≤ b has l rows that hold with equality, and let A 0 be the l×n submatrix of A corresponding to these “active” rows. If rank(A 0 ) =l, then Assumptions 2.1-2.3 hold for this problem. 37 2. If rank(A) =m, then Assumptions 2.1-2.4 hold for this problem with D c =∞. 2.4.2 Network Utility Maximization with Independent Link Capacity Constraints Consider a network with l links and n flow streams. Let{b 1 ,b 2 ,...,b l } be the capacities of each link and{x 1 ,x 2 ,...,x n } be the rates of each flow stream. LetN (k)⊆{1, 2,...,n}, 1≤ k≤l be the set of flow streams that use linkk. This problem is to maximize the utility function P n i=1 w i log(x i ) withw i > 0,∀1≤i≤n, which represents a measure of network fairness [Kel97], subject to the capacity constraint of each link. This problem is known as the network utility maximization (NUM) problem and can be formulated as follows 3 : min n X i=1 −w i log(x i ) s.t. X i∈N(k) x i ≤b k ,∀k∈{1, 2,...,l} x i ≥ 0,∀i∈{1, 2,...,n} Typically, many link capacity constraints in the above formulation are redundant, e.g., ifN (k 1 ) = N (k 2 ) and b k1 ≤ b k2 , then the capacity constraint of the k 2 -th link is redundant. Assume that redundant link capacity constraints are eliminated and the remaining links are reindexed. The above formulation can be rewritten as follows: min n X i=1 −w i log(x i ) (2.21) s.t. Ax≤ b (2.22) x≥ 0 (2.23) wherew i > 0,∀1≤i≤n; A = [a 1 ,··· , a n ] is a 0-1 matrix of sizem×n such thata ij = 1 if and only if flow x j uses link i; and b> 0. Note that problem (2.21)-(2.23) satisfies Assumptions 2.1 and 2.2. By the results from Section 3 In this paper, the NUM problem is always formulated as a minimization problem. Without loss of optimality, we define log(0) =−∞ and hence log(·) is defined over R + . Or alternatively, we can replace the non-negative rate constraints with x i ≥ x min i ,∀i∈{1, 2,...,n} where x min i ,∀i∈{1, 2,...,n} are sufficiently small positive numbers. 38 2.2, x(t) has an O( 1 ) convergence time for this problem. The next theorem provides sufficient conditions such thate x(t) can have better convergence time O(log( 1 )) . Theorem 2.11. The network utility maximization problem (2.21)-(2.23) has the following prop- erties: 1. Letb max = max 1≤i≤n b i and x max > 0 such thatx max i >b max ,∀i∈{1,...,n}. The network utility maximization problem (2.21)-(2.23) is equivalent to the following problem min n X i=1 −w i log(x i ) (2.24) s.t. Ax≤ b (2.25) 0≤ x≤ x max (2.26) 2. Let x ∗ be the optimal solution. Assume Ax ∗ ≤ b has m 0 rows that hold with equality, and let A 0 be them 0 ×n submatrix of A corresponding to these “active” rows. If rank(A 0 ) =m 0 , then Assumptions 2.1-2.3 hold for this problem. 3. If rank(A) =m, then Assumptions 2.1-2.4 hold for this problem. Proof. See Section 2.7.6. Remark 2.4. Theorem 2.11 and Corollary 2.4 complement each other. If rank(A) = m, we can apply Theorem 2.11 to problem (2.21)-(2.23). However, to apply Corollary 2.4, we require rank(B) =m +n, where B = A I n . This is always false since the size of A 0 is (m +n)×n. Thus, Corollary 2.4 can not be applied to problem (2.21)-(2.23) even if rank(A) = m. On the other hand, Corollary 2.4 considers general utilities while Theorem 2.11 is restricted to the utility P n i=1 −w i log(x i ). Now we give an example of network utility maximization such that Assumption 2.3 is not satisfied. Consider the problem (2.21)-(2.23) with w = [1, 1, 1, 1] T , A = [a 1 , a 2 , a 3 , a 4 ] = 1 1 0 0 0 0 1 1 1 0 1 0 0 1 0 1 39 and b = [3, 7, 2, 8] T . Note that rank(A) = 3 < m; and μ T A = [0, 0, 0, 0] and μ T b = 0 if μ = [1, 1,−1,−1] T . It can be checked that the optimal solution to this NUM problem is [x ∗ 1 ,x ∗ 2 ,x ∗ 3 ,x ∗ 4 ] T = [0.8553, 2.1447, 1.1447, 5.8553] T . Note that all capacity constraints are tight and [λ ∗ 1 ,λ ∗ 2 ,λ ∗ 3 ,λ ∗ 4 ] T = [0.3858, 0.0903, 0.7833, 0.0805] T is the optimal dual variable that attains strong duality. Next, we show that q(λ) is not locally quadratic at λ = λ ∗ by contradiction. Assume that there exist D q > 0 and L q > 0 such that q(λ)≤ q(λ ∗ )−L q kλ−λ ∗ k 2 for any λ∈ R m + and kλ− λ ∗ k≤ D q . Put λ = λ ∗ +tμ with|t| sufficiently small such that λ ∗ +tμ∈ R m + and kλ ∗ +tμ−λ ∗ k<D q . Note that by (2.32) and (2.33), we have μ T ∇ λ q(λ ∗ ) = n X i=1 μ T a i [λ ∗ ] T a i +μ T b = 0, (2.27) μ T ∇ 2 λ q(λ ∗ )μ = 0. (2.28) Thus, we have q(λ ∗ +tμ) (a) =q(λ ∗ ) +tμ T ∇ λ ˜ q(λ ∗ ) +t 2 μ T ∇ 2 λ ˜ q(λ ∗ )μ +o(t 2 kμk 2 ) (b) =q(λ ∗ ) +o(t 2 kμk 2 ) where (a) follows from the second-order Taylor’s expansion and (b) follows from equations (2.27) and (2.28). By definition of o(t 2 kμk 2 ), there exists δ> 0 such that |o(t 2 kμk 2 )| ktμk 2 <L q ,∀t∈ (−δ,δ), i.e., o(t 2 kμk 2 ) > −L q ktμk 2 ,∀t ∈ (−δ,δ). This implies q(λ ∗ + tμ) = q(λ ∗ ) + o(t 2 kμk 2 ) > q(λ ∗ )−L q ktμk 2 . A contradiction! Thus, q(λ) is not locally quadratic at λ =λ ∗ . In view of the above example, the sufficient condition in Part 2 of Theorem 2.11 for Assump- tion 2.3 is sharp. 2.5 Numerical Experiment 2.5.1 Network Utility Maximization Consider the simple NUM problem described in Figure 2.1. Letx 1 ,x 2 andx 3 be the data rates of stream 1, 2 and 3 and let the network utility be minimizing− log(x 1 )− 2 log(x 2 )− 3 log(x 3 ). 40 It can be checked that capacity constraints other than x 1 +x 2 +x 3 ≤ 10,x 1 +x 2 ≤ 8, and x 2 +x 3 ≤ 8 are redundant. By Theorem 2.11, the NUM problem can be formulated as follows: min − log(x 1 )− 2 log(x 2 )− 3 log(x 3 ) s.t. Ax≤ b 0≤ x≤ x max where A = 1 1 1 1 1 0 0 1 1 , b = 10 8 8 and x max = 11 11 11 . The optimal solution to this NUM problem is x ∗ 1 = 2,x ∗ 2 = 3.2,x ∗ 3 = 4.8 and the optimal value is−7.7253. Note that the second capacity constraint x 1 +x 2 ≤ 8 is loose and the other two capacity constraints are tight. Since the objective function is decomposable, the dual subgradient method can yield a dis- tributed solution. This is why the dual subgradient method is widely used to solve NUM problems [LL99]. It can be checked that the objective function is strongly convex with modulus σ = 2 121 onX ={0≤ x≤ x max } and g is Lipschitz continuous with modulus β≤ √ 6 onX . Figure 2.2 verifies the convergence of x(t) with c = σ β 2 = 1/363 and λ 1 (0) = λ 2 (0) = λ 3 (0) = 0. Since λ 1 (0) = λ 2 (0) = λ 3 (0) = 0, by Theorem 2.1, we know f(x(t))≤ f(x ∗ ),∀t > 0, which is also verified in Figure 2.2. To verify the convergence time of constraint violations, Figure 2.3 plots g 1 (x(t)), g 2 (x(t)), g 3 (x(t)) and 1/t with both x-axis and y-axis in log 10 scales. As observed in Figure 2.3, the curves of g 1 (x(t)) and g 3 (x(t)) are parallel to the curve of 1/t for large t. Note thatg 1 (x(t))≤ 0 is satisfied early because this constraint is loose. Figure 2.3 verifies the conver- gence time of x(t) in Theorem 2.3 by showing that error decays like O( 1 t ) and suggests that the convergence time is actually Θ( 1 ) for this NUM problem. Note that rank(A) = 3. By Theorem 2.11, this NUM problem satisfies Assumptions 2.1-2.4. Apply Algorithm 1.1 withc = σ β 2 = 1/363 andλ 1 (0) =λ 2 (0) =λ 3 (0) = 0 to this NUM problem. Figure 2.4 verifies the convergence of the objective and constraint functions for e x(t). Figure 2.5 verifies the results in Theorem 2.11 that the convergence time of e x(t) is O(log( 1 )) by showing that error decays like O( 1 t 0.998 t ). If we compare Figure 2.5 and Figure 2.3, we can observe that e x(t) converges much faster than x(t). 41 capacity=8) stream)1) stream)2) stream)3) capacity=8) capacity=8) capacity=8) capacity=10) Figure 2.1: A simple NUM problem with 3 flow streams Iterations: t 0 100 200 300 400 500 600 700 Objective Values -15 -14 -13 -12 -11 -10 -9 -8 -7 NUM : C o n v e r ge n c e of O b j e c t i v e V al u e s of x ( t ) Iterations: t 0 100 200 300 400 500 600 700 Constraint Values -5 0 5 10 15 20 25 NU M : C on v e r ge n c e o f C o n s t r a i n t V a l u e s o f x ( t ) optimal value f ( x ( t ) ) g 1 ( x ( t ) ) = x 1 ( t ) + x 2 ( t ) + x 3 ( t ) ! 10 g 2 ( x ( t ) ) = x 1 ( t ) + x 2 ( t ) ! 8 g 3 ( x ( t ) ) = x 2 ( t ) + x 3 ( t ) ! 8 g ( t ) = 0 Figure 2.2: Convergence of x(t) from Algorithm 1.1 for a NUM problem. 42 Iterations: t 10 0 10 1 10 2 10 3 10 4 Constraint Violations 10 -4 10 -3 10 -2 10 -1 10 0 10 1 10 2 NUM : C o n v e r ge n c e T i m e of C on s t r ai n t V i ol at i on s o f x ( t ) v . s . O ( 1/t ) g 1 ( x ( t ) ) = x 1 ( t ) + x 2 ( t ) + x 3 ( t ) ! 1 0 g 2 ( x ( t ) ) = x 1 ( t ) + x 2 ( t ) ! 8 g 3 ( x ( t ) ) = x 2 ( t ) + x 3 ( t ) ! 8 1=t curves are parallel for large t Figure 2.3: Convergence time of x(t) from Algorithm 1.1 for a NUM problem. Iterations: t 0 100 200 300 400 500 600 700 Objective Values -15 -14 -13 -12 -11 -10 -9 -8 -7 NUM : C on v e r g e n c e o f O b j e c t i v e V a l u e s of e x ( t ) Iterations: t 0 100 200 300 400 500 600 700 Constraint Values -5 0 5 10 15 20 25 NUM : C o n v e r ge n c e of C o n s t r ai n t V al u e s of e x ( t ) optimal value f ( e x ( t ) ) g 1 ( e x ( t ) ) = e x 1 ( t ) + e x 2 ( t ) + e x 3 ( t ) ! 10 g 2 ( e x ( t ) ) = e x 1 ( t ) + e x 2 ( t ) ! 8 g 3 ( e x ( t ) ) = e x 2 ( t ) + e x t 2 ; 3 ( t ) ! 8 g ( t ) = 0 Figure 2.4: Convergence ofe x(t) from Algorithm 1.1 for a NUM problem. 43 Iterations: t 10 0 10 1 10 2 10 3 10 4 Constraint Violations 10 -12 10 -10 10 -8 10 -6 10 -4 10 -2 10 0 10 2 N UM : C on v e r g e n c e T i m e of C on s t r ai n t Vi ol at i on s of e x ( t ) v . s . O ( 1 t 3 = 2 ) a n d O ( 1 t ( 0: 9 984 ) t ) g 1 ( e x ( t ) ) = e x 1 ( t ) + e x 2 ( t ) + e x 3 ( t ) ! 10 g 2 ( e x ( t ) ) = e x 1 ( t ) + e x 2 ( t ) ! 8 g 3 ( e x ( t ) ) = e x 2 ( t ) + e x 3 ( t ) ! 8 1 t 3= 2 1 t 0: 9 98 4 t Figure 2.5: Convergence time ofe x(t) from Algorithm 1.1 for a NUM problem. 2.5.2 Linear Constrained Quadratic Program Consider the following quadratic program (QP) min x T Px + c T x s.t. Ax≤ b where P = 1 2 2 5 , c = [1, 1] T , A = 1 1 0 1 and b = [−2,−1] T . The optimal solution to this quadratic program is x ∗ = [−1,−1] T and the optimal value is 8. If P is a diagonal matrix, the dual subgradient method can yield a distributed solution. It can be checked that the objective function is strongly convex with modulus σ = 0.34 and each row of the linear inequality constraint is Lipschitz continuous with modulus ζ = √ 2. Figure 2.6 verifies the convergence of x(t) for the objective and constraint functions yielded by Algorithm 1.1 with c = σ 2ζ 2 = 0.34/4, λ 1 (0) = 0 and λ 2 (0) = 0. Figure 2.7 verifies the convergence time 44 of x(t) proven in Theorem 2.3 by showing that error decays like O( 1 t ) and suggests that the convergence time is actually Θ( 1 ) for this quadratic program. Iterations: t 0 500 1000 1500 2000 2500 3000 3500 4000 Objective Values -2 0 2 4 6 8 Q P : C o n v e r g e n c e of O b j e c t i v e V a l u e s o f x ( t ) Iterations: t 0 500 1000 1500 2000 2500 3000 3500 4000 Constraint Values 0 0.5 1 1.5 Q P : C o n v e r g e n c e o f C on s t r a i n t V al u e s o f x ( t ) optimal value f ( x ( t ) ) g 1 ( x ( t ) ) = x 1 ( t ) + x 2 ( t ) + 2 g 2 ( x ( t ) ) = x 2 ( t ) + 1 g ( t ) = 0 Figure 2.6: Convergence of x(t) from Algorithm 1.1for a quadratic program. Note that rank(A) = 2. By Corollary 2.4 this quadratic program satisfies Assumptions 2.1- 2.4. Apply Algorithm 1.1 withc = σ 2ζ 2 = 0.34/4 andλ 1 (0) =λ 2 (0) =λ 3 (0) = 0 to this quadratic program. Figure 2.8 verifies the convergence of the objective and constraint functions. Figure 2.9 verifies the results in Corollary 2.4 that the convergence time ofe x(t) isO(log( 1 )) by showing that error decays likeO( 1 t 0.9935 t ). If we compare Figure 2.9 and Figure 2.7, we can observe that Algorithme x(t) converges much faster than x(t). 2.5.3 Large Scale Quadratic Program Consider quadratic program min x∈R N{x T Qx + d T x : Ax≤ b} where Q, A∈ R N×N and d, b∈ R N . Q = UΣU H ∈ R N×N where U is the orthonormal basis for a random N×N zero mean and unit variance normal matrix and Σ is the diagonal matrix with entries from uniform [1, 3]. A is a random N×N zero mean and unit variance normal matrix. d and b are random vectors with entries from uniform [0, 1]. In a PC with a 4 core 2.7GHz Intel i7 CPU 45 Iterations: t 10 0 10 1 10 2 10 3 Constraint Violations 10 -4 10 -3 10 -2 10 -1 10 0 10 1 Q P : C o n v e r g e n c e T i m e of C o n s t r a i n t Vi ol at i o n s o f x ( t ) v . s . O ( 1/ t ) g 1 ( x ( t ) ) = x 1 ( t ) + x 2 ( t ) + 2 g 2 ( x ( t ) ) = x 1 ( t ) + 1 1=t curves are parallel for large t Figure 2.7: Convergence time of x(t) from Algorithm 1.1 for a quadratic program. Iterations: t 0 500 1000 1500 2000 2500 3000 3500 4000 Objective Values -2 0 2 4 6 8 Q P : C on v e r ge n c e of O b j e c t i v e V al u e s of e x ( t ) Iterations: t 0 500 1000 1500 2000 2500 3000 3500 4000 Constraint Values 0 0.5 1 1.5 Q P : C on v e r ge n c e of C on s t r ai n t V al u e s of e x ( t ) f ( x ( t ) ) optimal value g 1 ( e x ( t ) ) = e x 1 ( t ) + e x 2 ( t ) + 2 g 2 ( e x ( t ) ) = e x 2 ( t ) + 1 g ( t ) = 0 Figure 2.8: Convergence ofe x(t) from Algorithm 1.1 for a quadratic program. 46 Iterations: t 0 500 1000 1500 2000 2500 3000 3500 4000 Objective Values -2 0 2 4 6 8 Q P : C on v e r ge n c e of O b j e c t i v e V al u e s of e x ( t ) Iterations: t 0 500 1000 1500 2000 2500 3000 3500 4000 Constraint Values 0 0.5 1 1.5 Q P : C on v e r ge n c e of C on s t r ai n t V al u e s of e x ( t ) f ( x ( t ) ) optimal value g 1 ( e x ( t ) ) = e x 1 ( t ) + e x 2 ( t ) + 2 g 2 ( e x ( t ) ) = e x 2 ( t ) + 1 g ( t ) = 0 Figure 2.9: Convergence time ofe x(t) from Algorithm 1.1 for a quadratic program. and 16GB Memory, we run both Algorithm 1.1 and quadprog from Matlab, which by default is using the interior point method, over randomly generated large scale quadratic programs with N = 400, 600, 800, 1000 and 1200. For different problem size N, the running time is the average over 100 random quadratic programs and is plotted in Figure 2.10. To solve these large scale quadratic programs, the dual subgradient method has updates x(t) =− 1 2 Q −1 [λ T (t)A + d] and λ(t+1) = max{λ(t)+c[Ax(t)−b], 0} at each iterationt. Note that we only need to compute the inverse of large matrix Q once and then use it during all iterations. In our numerical simulations, Algorithm 1.1 is terminated when the error (both objective violations and constraint violations) ofe x(t) is less than 1e-5. 2.6 Chapter Summary This chapter studies the convergence time of the dual subgradient method strongly convex programs. This chapter shows that the convergence time of the dual subgradient method with simple running averages for general strongly convex programs isO( 1 ). This chapter also considers 47 Problem Dimension: N 400 500 600 700 800 900 1000 1100 1200 Running Time (secs) 0 2 4 6 8 10 12 14 16 18 20 D u al S u b gr a d i e n t M e t h o d v . s . q u ad p r og quadprog dual subgradient method Figure 2.10: The average running time for large scale quadratic programs. a variation of the primal averages, called the sliding running averages, and shows that if the dual function is locally quadratic then the convergence time is O(log( 1 )). 2.7 Supplement to this Chapter 2.7.1 Proof of Lemma 2.6 Note thatλ k (t + 1) = max{λ k (t) +cg k (x(t)), 0},∀k∈{1, 2,...,m} can be interpreted as the λ(t + 1) =P R m + λ(t) +cg(x(t)) whereP R m + [·] is the projection onto R m + . As observed before, the dynamic ofλ(t) can be interpreted as the projected gradient method with step sizec to solve max λ∈R m + {q(λ)}. Thus, the proof given below is essentially the same as the convergence time proof of the projected gradient method for set constrained smooth optimization in [Nes04]. Fact 2.1. λ(t + 1) = argmax λ∈R m + q(λ(t)) + [g(x(t))] T [λ−λ(t)]− 1 2c kλ−λ(t)k 2 ,∀t≥ 0. 48 Proof. Fix t≥ 0, λ(t + 1) =P R m + {λ(t) +cg(x(t))} = argmin λ∈R m + kλ− [λ(t) +cg(x(t))]k 2 = argmin λ∈R m + c 2 kg(x(t))k 2 − 2c[g(x(t))] T [λ−λ(t)] +kλ−λ(t)k 2 (a) = argmin λ∈R m + −q(λ(t))− [g(x(t))] T [λ−λ(t)] + 1 2c kλ−λ(t)k 2 = argmax λ∈R m + q(λ(t)) + [g(x(t))] T [λ−λ(t)]− 1 2c kλ−λ(t)k 2 where (a) follows because the minimizer is unchanged when we remove constant termc 2 kg(x(t))k 2 , divide by factor 2c, and add constant term−q(λ(t)) in the objective function. Recall that q(λ) is smooth with modulus γ = β 2 σ by Lemma 2.5. Fact 2.2. If c≤ 1 γ = σ β 2 , then q(λ(t + 1))≥q(λ(t)),∀t≥ 0. Proof. Fix t≥ 0, q(λ(t + 1)) (a) ≥q(λ(t)) + [g(x(t))] T [λ(t + 1)−λ(t)]− γ 2 kλ(t + 1)−λ(t)k 2 (b) ≥q(λ(t)) + [g(x(t))] T [λ(t + 1)−λ(t)]− 1 2c kλ(t + 1)−λ(t)k 2 (c) ≥q(λ(t)) + [g(x(t))] T [λ(t)−λ(t)]− 1 2c kλ(t)−λ(t)k 2 =q(λ(t)) where (a) follows from Lemma 1.1 and the fact that∇ λ q(λ(t)) = g(x(t)); (b) follows fromc≤ 1 γ ; and (c) follows form Fact 2.1. Fact 2.3. [g(x(t))] T [λ(t + 1)−λ ∗ ]≥ 1 c [λ(t + 1)−λ(t)] T [λ(t + 1)−λ ∗ ],∀t≥ 0 Proof. Fix t≥ 0. By the projection theorem (Proposition B.11(b) in [Ber99]), we have λ(t + 1)− λ(t) +cg(x(t)) T [λ(t + 1)−λ ∗ ]≤ 0. Thus, [g(x(t))] T [λ(t + 1)−λ ∗ ]≥ 1 c [λ(t + 1)− λ(t)] T [λ(t + 1)−λ ∗ ]. Fact 2.4. Ifc≤ 1 γ = σ β 2 , thenq(λ ∗ )−q(λ(t + 1))≤ 1 2c kλ(t)−λ ∗ k 2 − 1 2c kλ(t + 1)−λ ∗ k 2 ,∀t≥ 0. 49 Proof. Fix t≥ 0, q(λ(t + 1)) (a) ≥q(λ(t)) + [g(x(t))] T [λ(t + 1)−λ(t)]− γ 2 kλ(t + 1)−λ(t)k 2 =q(λ(t)) + [g(x(t))] T [λ(t + 1)−λ ∗ +λ ∗ −λ(t)]− γ 2 kλ(t + 1)−λ(t)k 2 (b) ≥q(λ(t)) + [g(x(t))] T [λ ∗ −λ(t)]− γ 2 kλ(t + 1)−λ(t)k 2 + 1 c [λ(t + 1)−λ(t)] T [λ(t + 1)−λ ∗ ] (c) ≥q(λ(t)) + [g(x(t))] T [λ ∗ −λ(t)]− γ 2 kλ(t + 1)−λ(t)k 2 + 1 2c kλ(t + 1)−λ(t)k 2 + 1 2c kλ(t + 1)−λ ∗ k 2 − 1 2c kλ(t)−λ ∗ k 2 (d) ≥q(λ(t)) + [g(x(t))] T [λ ∗ −λ(t)] + 1 2c kλ(t + 1)−λ ∗ k 2 − 1 2c kλ(t)−λ ∗ k 2 (e) ≥q(λ ∗ ) + 1 2c kλ(t + 1)−λ ∗ k 2 − 1 2c kλ(t)−λ ∗ k 2 where (a) follows from Lemma 1.1 and the fact that∇ λ q(λ(t)) = g(x(t)); (b) follows from Fact 2.3; (c) follows from the identity u T v = 1 2 kuk 2 + 1 2 kvk 2 − 1 2 ku− vk 2 ,∀u, v∈ R m ; (d) follows from c≤ 1 γ ; and (e) follows from the concavity of q(·). Rearranging terms yields the desired result. Fixc≤ 1 γ andt> 0. By Fact 2.4, we haveq(λ ∗ )−q(λ(τ + 1))≤ 1 2c kλ(τ)−λ ∗ k 2 − 1 2c kλ(τ + 1)−λ ∗ k 2 ,∀τ∈{0, 1,...,t− 1}. Summing over τ and dividing by fact t yields 1 t t−1 X τ=0 q(λ ∗ )−q(λ(τ + 1)) ≤ 1 2ct [kλ(0)−λ ∗ k 2 −kλ(t)−λ ∗ k 2 ] ≤ 1 2ct kλ(0)−λ ∗ k 2 Note that q(λ ∗ )−q(λ(τ + 1)),∀τ ∈{0, 1,...,t− 1} is a decreasing sequence by Fact 2.2. Thus, we have q(λ ∗ )−q(λ(t)≤ 1 t t−1 X τ=0 q(λ ∗ )−q(λ(τ + 1)) ≤ 1 2ct kλ(0)−λ ∗ k 2 . 50 2.7.2 Proof of Part 2 of Lemma 2.9 This part is essentially a local version of Theorem 12 in [NNG15], which shows that the projected gradient method for set constrained smooth convex optimization converge geometrically if the objective function satisfies a quadratic growth condition. In this subsection, we provide a simple proof that directly follows from Fact 2.4 and Assump- tion 2.3. By Fact 2.4, we have q(λ ∗ )−q(λ(t + 1))≤ 1 2c kλ(t)−λ ∗ k 2 − 1 2c kλ(t + 1)−λ ∗ k 2 ,∀t≥ 0. (2.29) By part 1, we knowkλ(t)−λ ∗ k≤D q ,∀t≥T q . By Assumption 2.3, we have q(λ ∗ )−q(λ(t + 1))≥L q kλ(t + 1)−λ ∗ k 2 ,∀t≥T q . (2.30) Combining (2.29) and (2.30) yields (L q + 1 2c )kλ(t + 1)−λ ∗ k 2 ≤ 1 2c kλ(t)−λ ∗ k 2 ,∀t≥T q . This can be written as kλ(t + 1)−λ ∗ k≤ s 1 1 + 2cL q kλ(t)−λ ∗ k,∀t≥T q Thus, this part follows by induction. 2.7.3 Proof of Lemma 2.11 Define ˜ h(x) =−h(x). Then ˜ h is smooth with modulus γ and strongly convex with modulus L c over the setX . By definition of smooth functions, h must be differentiable over setX . By Lemma 1.2, we have ˜ h(y)≥ ˜ h(x) + [∇ ˜ h(x)] T (y− x) + L c 2 ky− xk 2 ,∀x, y∈X 51 By Lemma 1.1, ˜ h(y)≤ ˜ h(x) + [∇ ˜ h(x)] T (y− x) + γ 2 ky− xk 2 ,∀x, y∈X SinceX is not a singleton, we can choose distinct x, y∈X . Combining the above two inequalities yields L c ≤γ. 2.7.4 Proof of Part 2 of Lemma 2.12 By the first part of this lemma, λ(t)∈{λ∈R m + :kλ−λ ∗ k≤D c },∀t≥T c . The remaining part of the proof is essentially a local version of the convergence time proof of the projected gradient method for set constrained smooth and strongly convex optimization [Nes04]. Recall thatq(λ) is smooth with modulusγ = β 2 σ by Lemma 2.5. The next fact is an enhance- ment of Fact 2.4 using the locally strong concavity of the dual function. Fact 2.5. Ifc≤ 1 γ = σ β 2 , thenq(λ ∗ )−q(λ(t+1))≤ ( 1 2c − Lc 2 )kλ(t)−λ ∗ k 2 − 1 2c kλ(t+1)−λ ∗ k 2 ,∀t≥ T c . Proof. Fix t≥T c , q(λ(t + 1)) (a) ≥q(λ(t)) + [g(x(t))] T [λ(t + 1)−λ(t)]− γ 2 kλ(t + 1)−λ(t)k 2 =q(λ(t)) + [g(x(t))] T [λ(t + 1)−λ ∗ +λ ∗ −λ(t)]− γ 2 kλ(t + 1)−λ(t)k 2 (b) ≥q(λ(t)) + [g(x(t))] T [λ ∗ −λ(t)]− γ 2 kλ(t + 1)−λ(t)k 2 + 1 c [λ(t + 1)−λ(t)] T [λ(t + 1)−λ ∗ ] (c) ≥q(λ(t)) + [g(x(t))] T [λ ∗ −λ(t)]− γ 2 kλ(t + 1)−λ(t)k 2 + 1 2c kλ(t + 1)−λ(t)k 2 + 1 2c kλ(t + 1)−λ ∗ k 2 − 1 2c kλ(t)−λ ∗ k 2 (d) ≥q(λ(t)) + [g(x(t))] T [λ ∗ −λ(t)] + 1 2c kλ(t + 1)−λ ∗ k 2 − 1 2c kλ(t)−λ ∗ k 2 (e) ≥q(λ ∗ ) + 1 2c kλ(t + 1)−λ ∗ k 2 + ( L c 2 − 1 2c )kλ(t)−λ ∗ k 2 where (a) follows from Lemma 1.1 and the fact that∇ λ q(λ(t)) = g(x(t)); (b) follows from Fact 2.3; (c) follows from the identity u T v = 1 2 kuk 2 + 1 2 kvk 2 − 1 2 ku− vk 2 ,∀u, v∈ R m ; (d) follows from c ≤ 1 γ ; and (e) follows from the fact that q(·) is strongly concave over the set 52 {λ∈R m + :kλ−λ ∗ k≤D c } such that q(λ ∗ )≤q(λ(t)) + [g(x(t))] T [λ ∗ −λ(t)]− Lc 2 kλ ∗ −λ(t)k 2 by Lemma 1.2 4 . Rearranging terms yields the desired inequality. Note that q(λ ∗ )−q(λ(t + 1))≥ 0,∀t> 0. Combining with Fact 2.5 yields ( 1 2c − Lc 2 )kλ(t)− λ ∗ k 2 − 1 2c kλ(t + 1)−λ ∗ k 2 ≥ 0,∀t≥T c . Recall that c≤ 1 γ implies that c≤ 1 Lc by Lemma 2.11. Thus, we have kλ(t + 1)−λ ∗ k≤ p 1−cL c kλ(t)−λ ∗ k,∀t≥T c By induction, we have kλ(t)−λ ∗ k≤ p 1−cL c t−Tc kλ(T c )−λ ∗ k,∀t≥T c 2.7.5 Proof of Theorem 2.10 Lemma 2.13. Let q(λ) :R m + →R be a concave function and q(λ) be maximized at λ =λ ∗ ≥ 0. Suppose the following conditions are satisfied: 1. Suppose∇ λ q(λ ∗ ) = d≤ 0 and λ ∗ k d k = 0,∀k∈{1,...,m}. DenoteK ={k∈{1,...,m} : d k = 0} and l =|K|. 2. Suppose∇ 2 λ q(λ ∗ ) = UΣU T where Σ≺ 0 is an n×n negative definite matrix and U is an m×n matrix. Let U 0 be an l×n submatrix of U and be composed by rows with indices in K. Suppose that rank(U 0 ) =l. Then, there exists D q > 0 and L q > 0 such that q(λ)≤ q(λ ∗ )−L q kλ−λ ∗ k 2 for any λ∈R m + andkλ−λ ∗ k≤D q . Proof. Without loss of generality, assume thatK ={1,...,l}. Denote U = U 0 U 00 where U 00 is the (m−l)×n matrix composed by (l + 1)-th to m-th rows of U. Since d≤ 0, let δ = min {l+1≤k≤m} {|d k |} such that d k ≤−δ,∀k∈{l + 1,...,m}. For each λ, we define μ via 4 Note that the dual functionq(λ) is differentiable and has gradient∇ λ q(λ(t)) =g(x(t)) by the strong convexity of f(x) and Proposition B.25 in [Ber99]. Applying Lemma 1.2 to ˜ q(λ) =−q(λ), which has gradient∇ λ ˜ q(λ(t)) = −g(x(t)) and is strongly convex over the set{λ∈R m + :kλ−λ ∗ k≤Dc}, yieldsq(λ ∗ )≤q(λ(t)) + [g(x(t))] T [λ ∗ − λ(t)]− Lc 2 kλ ∗ −λ(t)k 2 . 53 μ k =λ k −λ ∗ k ,∀k∈{1,...,l},μ k = 0,∀k∈{l + 1,...,m} and ν via ν k = 0,∀k∈{1,...,l},ν k = λ k −λ ∗ k ,∀k∈{l + 1,...,m} such that λ−λ ∗ = μ +ν andkλ−λ ∗ k 2 =kμk 2 +kνk 2 . Define l-dimension vector μ 0 = [μ 1 ,...,μ l ]. Note thatkμ 0 k =kμk. By the first condition, d k 6= 0,∀k∈ {l + 1,...,m} implies that λ ∗ k = 0,∀k∈{l + 1,...,m}, which together with the fact that λ≥ 0 implies that ν≥ 0. Ifkλ−λ ∗ k is sufficiently small, we have q(λ) (a) = q(λ ∗ ) + (λ−λ ∗ ) T ∇ λ q(λ ∗ ) + (λ−λ ∗ ) T ∇ 2 λ q(λ ∗ )(λ−λ ∗ ) +o(kλ−λ ∗ k 2 ) =q(λ ∗ ) + l X k=1 μ k d k + m X k=l+1 ν k d k +μ T UΣU T μ T +ν T UΣU T ν T +o(kλ−λ ∗ k 2 ) (b) ≤ q(λ ∗ )− m X k=l+1 ν k δ +μ 0,T U 0 ΣU 0,T μ 0,T +o(kλ−λ ∗ k 2 ) (c) ≤ q(λ ∗ )− m X k=l+1 ν k δ−κkμ 0 k 2 +o(kλ−λ ∗ k 2 ) (d) < q(λ ∗ )−κkνk 2 −κkμk 2 +o(kλ−λ ∗ k 2 ) =q(λ ∗ )−κkλ−λ ∗ k 2 +o(kλ−λ ∗ k 2 ) where (a) follows from the second-order Taylor’s expansion; (b) follows from the facts that d k = 0,∀k∈{1,...,l}; ν ≥ 0 and d k ≤−δ; the last m−l elements of vector μ are zeros; and Σ≺ 0; (c) is true because κ > 0 exists when rank(U 0 ) = l and Σ≺ 0; and (d) follows from−δ≤−κν k ,∀k∈{l + 1,...,m}, which is true as long askνk is sufficiently small; and kμ 0 k =kμk. By the definition ofo(kλ−λ ∗ k 2 ), for anyκ> 0, we haveo(kλ−λ ∗ k 2 )≤ κ 2 kλ−λ ∗ k 2 as long askλ−λ ∗ k is sufficiently small. Thus, there exists D q > 0 such that q(λ)≤q(λ ∗ )−L q kλ−λ ∗ k 2 ,∀λ∈{λ∈R m + :kλ−λ ∗ k≤D q } where L s =κ/2. Lemma 2.14. Letq(λ) :R m + →R be a second-order continuously differentiable concave function maximized at λ = λ ∗ ≥ 0. If∇ 2 λ q(λ ∗ )≺ 0, then there exists D c > 0 and L c > 0 such that q(·) is strongly concave on the set λ∈{λ∈R m + :kλ−λ ∗ k≤D c } Proof. This lemma trivially follows from the continuity of∇ 2 λ q(λ). 54 Proof of Part 1 of Theorem 2.10: Note that Assumption 2.1 is trivially true. Assumption 2.2 follows from the assumption 5 that rank(∇g K (x ∗ ) T ) =l. To show that Assumption 2.3 holds, we need to apply Lemma 2.13. By the strong convexity of f(x) and Proposition B.25 in [Ber99], the dual function q(λ) is differentiable and has gradient∇ λ q(λ ∗ ) = g(x ∗ ). Thus, d =∇ λ q(λ ∗ )≤ 0. By Assumption 2.2, i.e., the strong duality, we have λ ∗ k d k = 0,∀k∈{1,...,m}. Thus, the first condition in Lemma 2.13 is satisfied. For λ≥ 0, define x ∗ (λ) = argmin x∈R n f(x) +λ T g(x) and denote x ∗ (λ ∗ ) = x ∗ . Note that x ∗ (λ) is a well-defined function becausef(x)+λ T g(x) is strongly convex and hence is minimized at a unique point. By equation (6.9), page 598, in [Ber99], we have ∇ 2 λ q(λ ∗ ) =− ∇ x g(x ∗ ) T ∇ 2 x f(x ∗ ) + m X k=1 λ ∗ k ∇ 2 x g k (x ∗ ) −1 ∇ x g(x ∗ ) (2.31) Note that ∇ 2 x f(x ∗ ) + P m k=1 λ ∗ k ∇ 2 x g k (x ∗ ) 0 because f is strongly convex and g k ,k ∈ {1,...,m} are convex. Thus, if rank(∇ x g K (x ∗ ) T ) =|K|, then the second condition of Lemma 2.13 is satisfied. Proof of Part 2 of Theorem 2.10: Using the same argument, we can show that Assumptions 2.1-2.2 hold. By equation (2.31) and the assumption that rank(∇ x g(x ∗ ) T ) =m, Assumption 2.4 follows from Lemma 2.14. 2.7.6 Proof of Theorem 2.11 • Proof of Part 1: Let x ∗ be the optimal solution to the problem (2.21)-(2.23). Since each column of A has at least one non-zero entry, we have x ∗ i ≤ b max ,∀i∈{1, 2,...,n} with b max = max 1≤i≤n b i . Thus, the problem (2.24)-(2.26) is equivalent to the problem (2.21)- (2.23) since only a redundant constraint x≤ x max is introduced. • Proof of Part 2: – To show Assumption 2.1 holds: It follows from the strong convexity of P n i=1 −w i log(x i ) over setX ={0≤ x≤ x max }. 5 The assumption that rank(∇xg K (x ∗ ) T ) =l is known as the non-degenerate constraint qualification or linear independence constraint qualification, which along with Slater’s constraint qualification, is one of various constraint qualifications implying strong duality [BSS06, SB94]. 55 – To show Assumption 2.2 holds: If x ∗ i = 0,∀1≤ i≤ n, then the objective function is +∞. Thus, x ∗ i > 0,∀1≤i≤n. Since x ∗ i ≤b max ,∀1≤i≤n and x max i >b max ,∀1≤ i≤ n, we have x i < x max i ,∀1≤ i≤ n. Thus, constraints x≥ 0 and x≤ x max are inactive. The active inequality constraints can only be those among Ax≤ b. Thus, if rank(A) =m then the linear dependence constraint qualification is satisfied and the strong duality holds [BSS06, SB94]. Note that the strong duality also holds in the problem (2.21)-(2.23) because its active inequality constraints also can only be those among Ax≤ b. – To show Assumption 2.3 holds: Define the Lagrangian dual function of the problem (2.21)-(2.23) as ˜ q(λ) = min x≥0 n n X i=1 −w i log(x i ) +λ T (Ax− b) o . Note that argmin x≥0 P n i=1 −w i log(x i ) +λ T (Ax− b) = [ 1 λ T a1 ] ∞ 0 ,..., [ 1 λ T an ] ∞ 0 T where the [·] b a denotes the projection onto the interval [a,b]. As argued above, the strong duality holds for problem (2.21)-(2.23). Let x ∗ be the optimal solution to the above convex program andλ ∗ ≥ 0 be the corresponding dual variables. By strong du- ality, x ∗ = argmin x≥0 P n i=1 −w i log(x i ) + (λ ∗ ) T (Ax− b) , i.e,x ∗ i = argmin xi≥0 − w i log(x i ) + (λ ∗ ) T a i x i ,∀1≤ i≤ n. Thus, we have x ∗ i = wi (λ ∗ ) T ai ∞ 0 ,∀1≤ i≤ n. In the proof of part 1 of this theorem, we show that 0 < x ∗ i ≤ b max ,∀1≤ i≤ n. Thus, 0< wi (λ ∗ ) T ai ≤b max ,∀1≤i≤n and x ∗ i = wi (λ ∗ ) T ai ∞ 0 = wi (λ ∗ ) T ai ,∀1≤i≤n. Now consider the problem (2.24)-(2.26). x ∗ is still the optimal solution and λ ∗ is still the corresponding dual variable when the Lagrangian dual function is defined as q(λ) = min 0≤x≤x max P n i=1 −w i log(x i ) +λ T (Ax− b) . Note that argmin 0≤x≤x max n n X i=1 −w i log(x i ) +λ T (Ax− b) o = w 1 λ T a 1 x max 1 0 ,..., w n λ T a n x max n 0 T . By the fact that 0 < wi (λ ∗ ) T ai < x max i ,∀i ∈ {1, 2,...,n} and by the continuity of functions wi λ T ai ,, we know w 1 λ T a 1 x max 1 0 ,..., w n λ T a n x max n 0 = w 1 λ T a 1 ∞ 0 ,..., w n λ T a n ∞ 0 = w 1 λ T a 1 ,..., w n λ T a n 56 when λ is sufficiently near λ ∗ . That is, q(λ) = ˜ q(λ) when λ is sufficiently near λ ∗ . Next, we show that the dual function q(λ) is locally quadratic in a neighborhood of λ ∗ by using Lemma 2.13. Consider λ∈R m + such thatkλ−λ ∗ k is sufficiently small, or equivalently, λ is sufficiently close to λ ∗ . For such λ, we have argmin 0≤x≤x max n n X i=1 −w n log(x i ) +λ T (Ax− b) o = w 1 λ T a 1 ,..., w n λ T a n T Thus, q(λ) = n X i=1 h −w i log 1 λ T a i + w i λ T a i λ T a i i −λ T b = n X i=1 h −w i log 1 λ T a i i + n X i=1 w i −λ T b for λ∈ R m + such thatkλ− λ ∗ k is sufficiently small. Note that q(λ) is infinitely differentiable at any λ∈ R m + such thatkλ−λ ∗ k is sufficiently small. Taking the first-order and second-order derivatives at λ =λ ∗ yields ∇ λ q(λ ∗ ) = n X i=1 w i a i (λ ∗ ) T a i − b, (2.32) ∇ 2 λ q(λ ∗ ) =− n X i=1 w i a i a T i ((λ ∗ ) T a i ) 2 = Adiag [− w 1 ((λ ∗ ) T a i ) 2 ,...,− w n ((λ ∗ ) T a n ) 2 ] A T , (2.33) where diag [− w1 ((λ ∗ ) T ai) 2 ,...,− wn ((λ ∗ ) T an) 2 ] denotes the diagonal matrix with diagonal entries− w1 ((λ ∗ ) T ai) 2 ,...,− wn ((λ ∗ ) T an) 2 . Note that wi ((λ ∗ ) T ai) 2 > 0,∀1≤ i≤ n. Thus, if rank(A 0 ) =m 0 , then Assumption 2.3 holds by Lemma 2.13. • Proof of Part 3: Using the same arguments in the proof of part 2, we can show that Assumptions 2.1-2.2 hold. By equation (2.33) and the fact that rank(A) =m, Assumption 2.4 follows from Lemma 2.14. 57 Chapter 3 New Lagrangian Methods for Constrained Convex Programs Fix positive integers n and m. Consider general constrained convex programs given by: min f(x) (3.1) s.t. g k (x)≤ 0,∀k∈{1, 2,...,m}, (3.2) x∈X, (3.3) where the setX⊆R n is a closed convex set; the functionf(x) is convex onX ; and the functions g k (x),∀k∈{1, 2,...,m} are convex onX . Note that the functionsf(x),g 1 (x),...,g m (x) are not necessarily differentiable. Denote the stacked vector of multiple functionsg 1 (x),g 2 (x),...,g m (x) as g(x) = g 1 (x),g 2 (x),...,g m (x) T . Throughout this chapter, the convex program (3.1)-(3.3) is required to satisfy the following assumptions: Assumption 3.1 (Lipschitzness). • There exists a (possibly non-unique) optimal solution x ∗ ∈X that solves the convex program (3.1)-(3.3). • There exists a constant β such thatkg(x 1 )−g(x 2 )k≤βkx 1 −x 2 k for all x 1 , x 2 ∈X . That is, the function g(x) is Lipschitz continuous onX with modulus β. Assumption 3.2 (Existence of Lagrange Multipliers). Condition 1.1 holds for the convex pro- gram (3.1)-(3.3). That is, there exists a Lagrange multiplier vector λ ∗ = [λ ∗ 1 ,λ ∗ 2 ,...,λ ∗ m ] T ≥ 0 such that q(λ ∗ ) =f(x ∗ ), 58 where x ∗ is an optimal solution to the problem (3.1)-(3.3) andq(λ) = inf x∈X {f(x)+ P m k=1 λ k g k (x)} is the Lagrangian dual function of the problem (3.1)-(3.3). As reviewed in Section 1.2 in Chapter 1, existing Lagrangian methods for general (possibly non-differentiable non-strongly convex) convex programs with (possibly nonlinear) functional constraints have a slowO( 1 2 ) convergence time. In this chapter, we present two new Lagrangian methods, both of which have a fast O( 1 ) convergence time. The first algorithm is originally developed in our paper [YN17e] and the second algorithm is originally developed in our paper [YN16c] and our technical report [YN17d]. The first algorithm works for general (possibly non-differentiable) constrained convex pro- grams under Assumptions 3.1 and 3.2 and updates the primal variables by solving an uncon- strained convex minimization that can be decomposed into independent smaller subproblems when f(x) and g k (x) are separable. This new algorithm directly improves Algorithm 1.3, the drift-plus-penalty method for deterministic convex programs, or equivalently, Algorithm 1.1, the dual subgradient method, which only achieves a slow O( 1 2 ) convergence time with a similar primal update scheme. The second algorithm further requires that f(x) andg k (x) in the convex program (3.1)-(3.3) are smooth; and updates the the primal variables by following a projected gradient update that can be distributively implemented even whenf(x) org k (x) are not separable. This new algorithm directly improves Algorithm 1.2, the primal-dual subgradient method, which only achieves a slow O( 1 2 ) convergence time with a similar primal update scheme. 3.1 New Dual Type Algorithm for General Constrained Convex Programs Consider the following algorithm described in Algorithm 3.1. This algorithm computes both primal variables x(t)∈X and dual variables Q(t) = Q 1 (t),...,Q m (t) T , called virtual queue vectors in the drift-plus-penalty technique, at iterations t∈{0, 1, 2,...}. One main result of this chapter is that, whenever the parameter α in Algorithm 3.1 is chosen to satisfy α≥ β 2 /2, the running average x(t) = 1 t P t−1 τ=0 x(τ) closely approximates a solution to the convex program (3.1)-(3.3) and has an approximation error that decays like O(1/t). 59 Algorithm 3.1 New Dual Type Algorithm for General Constrained Convex Programs Let α > 0 be a constant parameter. Choose any x(−1) ∈ X . Initialize Q k (0) = max{0,−g k (x(−1))},∀k ∈{1, 2,...,m}. At each iteration t∈{0, 1, 2,...}, update x(t) and Q(t + 1) as follows: • Update primal variables via x(t) = argmin x∈X f(x) + Q(t) + g(x(t− 1)) T g(x) +αkx− x(t− 1)k 2 . • Update virtual queues via Q k (t + 1) = max −g k (x(t)),Q k (t) +g k (x(t)) ,∀k∈{1, 2,...,m}. (3.4) • Output the running average x(t + 1) given by x(t + 1) = 1 t + 1 t X τ=0 x(τ) = x(t) t t + 1 + x(t) 1 t + 1 as the solution at iteration t + 1. Algorithm 3.1 is similar to Algorithm 1.3, the DPP technique for deterministic convex pro- grams, with the following distinctions: 1. The Lagrange multiplier (“virtual queue”) update equation for Q k (t) is modified to take a max with−g k (x(t)), rather than simply project onto the nonnegative real numbers as the traditional update rule Q k (t + 1) = max{Q k (t) +g k (x(t)), 0} used in Algorithm 1.3 . 2. The minimization step augments theQ k (t) weights withg k (x(t−1)) values obtained on the previous step. Theseg k (x(t−1)) quantities, when multiplied by constraint functionsg k (x), yield a cross-product term in the primal update. This cross term together with another newly introduced quadratic term in the primal update can cancel a quadratic term in an upper bound of the Lyapunov drift such that a finer analysis of the drift-plus-penalty leads to the fast O( 1 ) convergence time. 3. A quadratic term, which is similar to a term used in proximal algorithms [PB13], is in- troduced. This provides a strong convexity “pushback” . The pushback is not sufficient to alone cancel the main drift components, but it cancels residual components introduced by the new g k (x(t− 1)) weight. At the same time, Algorithm 3.1 preserves the desirable properties possessed by Algorithm 1.3. That is, if the functions f(x) and g(x) are separable with respect to components or blocks 60 of x, then the primal updates for x(t) can be decomposed into several smaller independent subproblems, each of which only involves a component or block of x(t). 3.2 Basic Properties from Virtual Queue Update Equa- tions This section presents important facts from the virtual queue update equation (3.4). 3.2.1 Properties of Virtual Queues Lemma 3.1. Let Q(t),t∈{0, 1,...} be the sequence of virtual queue vectors yielded by the update equation (3.4). Then, 1. At each iteration t∈{0, 1, 2,...}, Q k (t)≥ 0 for all k∈{1, 2,...,m}. 2. At each iteration t∈{0, 1, 2,...}, Q k (t) +g k (x(t− 1))≥ 0 for all k∈{1, 2...,m}. 3. At iteration t = 0,kQ(0)k 2 ≤kg(x(−1))k 2 . At each iteration t∈{1, 2,...},kQ(t)k 2 ≥ kg(x(t− 1))k 2 . Proof. 1. Fixk∈{1, 2,...,m}. Note thatQ k (0)≥ 0 by initializationQ k (0) = max{0,−g k (x(−1))}. Assume Q k (t) ≥ 0 and consider iteration t + 1. If g k (x(t)) ≥ 0, then Q k (t + 1) = max{−g k (x(t)),Q k (t)+g k (x(t))}≥Q k (t)+g k (x(t))≥ 0. Ifg k (x(t))< 0, thenQ k (t+1) = max{−g k (x(t)),Q k (t)+g k (x(t))}≥−g k (x(t))> 0. Thus,Q k (t+1)≥ 0. The result follows by induction. 2. Fix k∈{1, 2,...,m}. Note that Q k (0) +g k (x(−1))≥ 0 by initialization rule Q k (0) = max{0,−g k (x(−1))}≥−g k (x(−1)). For t≥ 1, by the virtual queue update equation, we have Q k (t) = max{−g k (x(t− 1)),Q k (t− 1) +g k (x(t− 1))}≥−g k (x(t− 1)), which implies that Q k (t) +g k (x(t− 1))≥ 0. 3. Considert = 0. Fixk∈{1, 2,...,m}. Consider the casesg k (x(−1))≥ 0 andg k (x(−1))< 0 separately. If g k (x(−1))≥ 0, then Q k (0) = max{0,−g k (x(−1))} = 0 and so|Q k (0)|≤ 61 |g k (x(−1))|. If g k (x(−1)) < 0, then Q k (0) = max{0,−g k (x(−1))} = −g k (x(−1)) = |g k (x(−1))|. Thus, in both cases, we have|Q k (0)|≤|g k (x(−1))|. Squaring both sides and summing over k∈{1, 2,...,m} yieldskQ(0)k 2 ≤kg(x(−1))k 2 . Considert≥ 1. Fixk∈{1, 2,...,m}. Consider the casesg k (x(t−1))≥ 0 andg k (x(t−1))< 0 separately. If g k (x(t− 1))≥ 0, then Q k (t) = max{−g k (x(t− 1)),Q k (t− 1) +g k (x(t− 1))} ≥Q k (t− 1) +g k (x(t− 1)) (a) ≥ g k (x(t− 1)) =|g k (x(t− 1))| where (a) follows from part 1. If g k (x(t− 1))< 0, then Q k (t) = max{−g k (x(t− 1)),Q k (t− 1) +g k (x(t− 1))} ≥−g k (x(t− 1)) =|g k (x(t− 1))|. Thus, in both cases, we have|Q k (t)|≥|g k (x(t− 1))|. Squaring both sides and summing over k∈{1, 2,...,m} yieldskQ(t)k 2 ≥kg(x(t− 1))k 2 . Lemma 3.2. Let Q(t),t∈{0, 1,...} be the sequence of virtual queue vectors yielded by the update equation (3.4). At each iteration t∈{1, 2,...}, Q k (t)≥ t−1 X τ=0 g k (x(τ)),∀k∈{1, 2,...,m}. (3.5) Proof. Fix k∈{1, 2,...,m} and t≥ 1. For any τ∈{0,...,t− 1} the update rule of Algorithm 3.1 gives: Q k (τ + 1) = max{−g k (x(τ)),Q k (τ) +g k (x(τ))} ≥Q k (τ) +g k (x(τ)). 62 Hence, Q k (τ + 1)−Q k (τ)≥ g k (x(τ)). Summing over τ∈{0,...,t− 1} and using Q k (0)≥ 0 gives the result. 3.2.2 Properties of the Drift Recall that Q(t) = Q 1 (t),...,Q m (t) T is the vector of virtual queue backlogs. DefineL(t) = 1 2 kQ(t)k 2 . The function L(t) shall be called a Lyapunov function. Define the Lyapunov drift as Δ(t) =L(t + 1)−L(t) = 1 2 kQ(t + 1)k 2 −kQ(t)k 2 . (3.6) Lemma 3.3. Let Q(t),t∈{0, 1,...} be the sequence of virtual queue vectors yielded by the update equation (3.4). At each iteration t∈{0, 1, 2,...}, an upper bound of the Lyapunov drift is given by Δ(t)≤ [Q(t)] T g(x(t)) +kg(x(t))k 2 . (3.7) Proof. The virtual queue update equation Q k (t + 1) = max{−g k (x(t)),Q k (t) +g k (x(t))},∀k∈ {1, 2,...,m} can be rewritten as Q k (t + 1) =Q k (t) + ˜ g k (x(t)),∀k∈{1, 2,...,m}, (3.8) where ˜ g k (x(t)) = g k (x(t)), if Q k (t) +g k (x(t))≥−g k (x(t)) −Q k (t)−g k (x(t)), else ∀k. 63 Fix k∈{1, 2,...,m}. Squaring both sides of (3.8) and dividing by 2 yield: 1 2 [Q k (t + 1)] 2 = 1 2 [Q k (t)] 2 + 1 2 [˜ g k (x(t))] 2 +Q k (t)˜ g k (x(t)) = 1 2 [Q k (t)] 2 + 1 2 [˜ g k (x(t))] 2 +Q k (t)g k (x(t)) +Q k (t)[˜ g k (x(t))−g k (x(t))] (a) = 1 2 [Q k (t)] 2 + 1 2 [˜ g k (x(t))] 2 +Q k (t)g k (x(t))− [˜ g k (x(t)) +g k (x(t))][˜ g k (x(t))−g k (x(t))] = 1 2 [Q k (t)] 2 − 1 2 [˜ g k (x(t))] 2 +Q k (t)g k (x(t)) + [g k (x(t))] 2 ≤ 1 2 [Q k (t)] 2 +Q k (t)g k (x(t)) + [g k (x(t))] 2 , where (a) follows from the fact thatQ k (t)[˜ g k (x(t))−g k (x(t))] =−[˜ g k (x(t))+g k (x(t))]·[˜ g k (x(t))− g k (x(t))], which can be shown by considering ˜ g k (x(t)) = g k (x(t)) and ˜ g k (x(t)) 6= g k (x(t)). Summing over k∈{1, 2,...,m} yields 1 2 kQ(t + 1)k 2 ≤ 1 2 kQ(t)k 2 + [Q(t)] T g(x(t)) +kg(x(t))k 2 . Rearranging the terms yields the desired result. 3.3 Convergence Time Analysis of Algorithm 3.1 This section analyzes the convergence time of Algorithm 3.1 for the convex program (3.1)-(3.3) under Assumptions 3.1-3.2. 3.3.1 An Upper Bound of the Drift-Plus-Penalty Expression Lemma 3.4. Consider the convex program (3.1)-(3.3) under Assumptions 3.1-3.2. If α≥ 1 2 β 2 in Algorithm 3.1, then for all t≥ 0, we have Δ(t) +f(x(t)) ≤f(x ∗ ) +α kx ∗ − x(t− 1)k 2 −kx ∗ − x(t)k 2 + 1 2 kg(x(t))k 2 −kg(x(t− 1))k 2 , where x ∗ is an optimal solution of the problem (3.1)-(3.3) andβ is the Lipschitz modulus of g(x), both of which are defined in Assumption 3.1. 64 Proof. Fix t≥ 0. Note that Lemma 3.1 implies that Q(t) + g(x(t− 1)) is component-wise nonnegative. Hence, the function f(x) + Q(t) + g(x(t− 1)) T g(x) is convex with respect to x onX . Since αkx− x(t− 1)k 2 is strongly convex with respect to x with modulus 2α, it follows that f(x) + Q(t) + g(x(t− 1)) T g(x) +αkx− x(t− 1)k 2 is strongly convex with respect to x with modulus 2α. Since x(t) is chosen to minimize the above strongly convex function, by Corollary 1.2, we have f(x(t)) + Q(t) + g(x(t− 1)) T g(x(t)) +αkx(t)− x(t− 1)k 2 ≤f(x ∗ ) + Q(t) + g(x(t− 1)) T g(x ∗ ) | {z } ≤0 +αkx ∗ − x(t− 1)k 2 −αkx ∗ − x(t)k 2 (a) ≤f(x ∗ ) +αkx ∗ − x(t− 1)k 2 −αkx ∗ − x(t)k 2 , (3.9) where (a) follows by using the fact thatg k (x ∗ )≤ 0 for allk∈{1, 2,...,m} andQ k (t) +g k (x(t− 1))≥ 0 (i.e., part 2 in Lemma 3.1) to eliminate the term marked by an underbrace. Note that u T 1 u 2 = 1 2 ku 1 k 2 +ku 2 k 2 −ku 1 − u 2 k 2 for any u 1 , u 2 ∈R m . Thus, we have [g(x(t− 1))] T g(x(t)) = 1 2 kg(x(t− 1))k 2 +kg(x(t))k 2 −kg(x(t− 1))− g(x(t))k 2 . (3.10) Substituting (3.10) into (3.9) and rearranging terms yields f(x(t)) + [Q(t)] T g(x(t)) ≤f(x ∗ ) +αkx ∗ − x(t− 1)k 2 −αkx ∗ − x(t)k 2 −αkx(t)− x(t− 1)k 2 + 1 2 kg(x(t− 1))− g(x(t))k 2 − 1 2 kg(x(t− 1))k 2 − 1 2 kg(x(t))k 2 (a) ≤f(x ∗ ) +αkx ∗ − x(t− 1)k 2 −αkx ∗ − x(t)k 2 + ( 1 2 β 2 −α)kx(t)− x(t− 1)k 2 − 1 2 kg(x(t− 1))k 2 − 1 2 kg(x(t))k 2 (b) ≤f(x ∗ ) +αkx ∗ − x(t− 1)k 2 −αkx ∗ − x(t)k 2 − 1 2 kg(x(t− 1))k 2 − 1 2 kg(x(t))k 2 , 65 where (a) follows from the fact thatkg(x(t− 1))− g(x(t))k≤βkx(t)− x(t− 1)k, which further follows from the assumption that g(x) is Lipschitz continuous with modulus β; (b) follows from the fact α≥ 1 2 β 2 . Summing (3.7) with the above inequality and cancelling common terms on both sides yields Δ(t) +f(x(t))≤f(x ∗ ) +α kx ∗ − x(t− 1)k 2 −kx ∗ − x(t)k 2 + 1 2 kg(x(t))k 2 −kg(x(t− 1))k 2 . 3.3.2 Objective Value Violations Lemma 3.5. Consider the convex program (3.1)-(3.3) under Assumptions 3.1-3.2. Let x ∗ be an optimal solution of the problem (3.1)-(3.3) and β be the Lipschitz modulus of g(x), both of which are defined in Assumption 3.1. 1. If α≥ 1 2 β 2 in Algorithm 3.1, then for all t≥ 1, we have t−1 X τ=0 f(x(τ))≤tf(x ∗ ) +αkx ∗ − x(−1)k 2 . 2. If α> 1 2 β 2 in Algorithm 3.1, then for all t≥ 1, we have t−1 X τ=0 f(x(τ))≤tf(x ∗ ) +αkx ∗ − x(−1)k 2 + α 2α−β 2 kg(x ∗ )k 2 − kQ(t)k 2 2 . Proof. By Lemma 3.4, we have Δ(τ) +f(x(τ))≤f(x ∗ ) +α[kx ∗ − x(τ− 1)k 2 −kx ∗ − x(τ)k] + 1 2 [kg(x(τ))k 2 −kg(x(τ− 1))k 2 ] for allτ∈{0, 1, 2,...}. Summing overτ∈{0, 1,...,t− 1} yields t−1 X τ=0 Δ(τ) + t−1 X τ=0 f(x(τ)) ≤tf(x ∗ ) +α t−1 X τ=0 [kx ∗ − x(τ− 1)k 2 −kx ∗ − x(τ)k 2 ] + 1 2 t−1 X τ=0 [kg(x(τ))k 2 −kg(x(τ− 1))k 2 ]. 66 Recalling that Δ(τ) =L(τ + 1)−L(τ) and simplifying summations yields L(t)−L(0) + t−1 X τ=0 f(x(τ)) ≤tf(x ∗ ) +αkx ∗ − x(−1)k 2 −αkx ∗ − x(t− 1)k 2 + 1 2 kg(x(t− 1))k 2 − 1 2 kg(x(−1))k 2 . Rearranging terms; and substituting L(0) = 1 2 kQ(0)k 2 and L(t) = 1 2 kQ(t)k 2 yields t−1 X τ=0 f(x(τ)) ≤tf(x ∗ ) +αkx ∗ − x(−1)k 2 −αkx ∗ − x(t− 1)k 2 + 1 2 kg(x(t− 1))k 2 − 1 2 kg(x(−1))k 2 + 1 2 kQ(0)k 2 − 1 2 kQ(t)k 2 (a) ≤tf(x ∗ ) +αkx ∗ − x(−1)k 2 −αkx ∗ − x(t− 1)k 2 + 1 2 kg(x(t− 1))k 2 − 1 2 kQ(t)k 2 , (3.11) where (a) follows from the fact thatkQ(0)k≤kg(x(−1))k, i.e., part 3 in Lemma 3.1. Next, we present the proof of both parts: 1. This part follows from the observation that equation (3.11) can be further simplified as t−1 X τ=0 f(x(τ)) (a) ≤tf(x ∗ ) +αkx ∗ − x(−1)k 2 + 1 2 kg(x(t− 1))k 2 − 1 2 kQ(t)k 2 (b) ≤tf(x ∗ ) +αkx ∗ − x(−1)k 2 , where (a) follows by ignoring the non-positive term−αkx ∗ − x(t− 1)k 2 on the right side and (b) follows from the fact thatkQ(t)k≥kg(x(t− 1))k, i.e., part 3 in Lemma 3.1. 67 2. This part follows by rewriting equation (3.11) as t−1 X τ=0 f(x(τ)) ≤tf(x ∗ ) +αkx ∗ − x(−1)k 2 −αkx ∗ − x(t− 1)k 2 + 1 2 kg(x(t− 1))− g(x ∗ ) + g(x ∗ )k 2 − 1 2 kQ(t)k 2 =tf(x ∗ ) +αkx ∗ − x(−1)k 2 −αkx ∗ − x(t− 1)k 2 + 1 2 kg(x(t− 1))− g(x ∗ )k 2 + [g(x ∗ )] T [g(x(t− 1))− g(x ∗ )] + 1 2 kg(x ∗ )k 2 − 1 2 kQ(t)k 2 (a) ≤tf(x ∗ ) +αkx ∗ − x(−1)k 2 −αkx ∗ − x(t− 1)k 2 + 1 2 kg(x(t− 1))− g(x ∗ )k 2 +kg(x ∗ )kkg(x(t− 1))− g(x ∗ )k + 1 2 kg(x ∗ )k 2 − 1 2 kQ(t)k 2 (b) ≤tf(x ∗ ) +αkx ∗ − x(−1)k 2 −αkx ∗ − x(t− 1)k 2 + 1 2 β 2 kx ∗ − x(t− 1)k 2 +βkg(x ∗ )kkx ∗ − x(t− 1)k + 1 2 kg(x ∗ )k 2 − 1 2 kQ(t)k 2 =tf(x ∗ ) +αkx ∗ − x(−1)k 2 − α− 1 2 β 2 kx ∗ − x(t− 1)k− 1 2 β α− 1 2 β 2 kg(x ∗ )k 2 + α 2α−β 2 kg(x ∗ )k 2 − 1 2 kQ(t)k 2 (c) ≤tf(x ∗ ) +αkx ∗ − x(−1)k 2 + α 2α−β 2 kg(x ∗ )k 2 − 1 2 kQ(t)k 2 , where (a) follows from the Cauchy-Schwarz inequality; (b) follows from the fact that kg(x(t− 1))− g(x ∗ )k≤βkx ∗ − x(t− 1)k, which further follows from the assumption that g(x) is Lipschitz continuous with modulus β; and (c) follows from the fact that α> 1 2 β 2 . Theorem 3.1 (Objective Value Violations of Algorithm 3.1). Consider the convex program (3.1)-(3.3) under Assumptions 3.1-3.2. If α≥ 1 2 β 2 in Algorithm 3.1, for all t≥ 1, we have f(x(t))≤f(x ∗ ) + α t kx ∗ − x(−1)k 2 , where x ∗ is an optimal solution to the problem (3.1)-(3.3) andβ is the Lipschitz modulus of g(x), both of which are defined in Assumption 3.1. 68 Proof. Fix t≥1. By part 1 in Lemma 3.5, we have t−1 X τ=0 f(x(τ))≤tf(x ∗ ) +αkx ∗ − x(−1)k 2 ⇒ 1 t t−1 X τ=0 f(x(τ))≤f(x ∗ ) + α t kx ∗ − x(−1)k 2 . Since x(t) = 1 t P t−1 τ=0 x(τ) and f(x) is convex, by Jensen’s inequality it follows that f(x(t))≤ 1 t t−1 X τ=0 f(x(τ)). The above theorem shows that under Algorithm 3.1, the error gap between f(x(t)) and the optimal value f(x ∗ ) is at most O( 1 t ). This holds for any initial guess vector x(−1)∈X . Of course, choosing x(−1) close to x ∗ is desirable because it reduces the coefficientαkx ∗ −x(−1)k 2 . 3.3.3 Constraint Violations The next Lemma follows from Assumption 3.2 and Lemma 3.2. Lemma 3.6. Consider the convex program (3.1)-(3.3) under Assumptions 3.1-3.2. Let x(t), Q(t),t∈ {0, 1,...} be sequences generated by Algorithm 3.1. Then, t−1 X τ=0 f(x(τ))≥tf(x ∗ )−kλ ∗ kkQ(t)k, ∀t≥ 1, where x ∗ is an optimal solution of the problem (3.1)-(3.3) defined in Assumption 3.1; and λ ∗ is a Lagrange multiplier vector satisfying Assumption 3.2. Proof. Define Lagrangian dual functionq(λ) = inf x∈X {f(x)+ P m k=1 λ k g k (x)}. For allτ∈{0, 1,...}, by Assumption 3.2, we have f(x ∗ ) =q(λ ∗ ) (a) ≤ f(x(τ)) + m X k=1 λ ∗ k g k (x(τ)), 69 where (a) follows the definition of q(λ ∗ ). Thus, we have f(x(τ))≥f(x ∗ )− m X k=1 λ ∗ k g k (x(τ)),∀τ∈{0, 1,...}. Summing over τ∈{0, 1,...,t− 1} yields t−1 X τ=0 f(x(τ))≥tf(x ∗ )− t−1 X τ=0 m X k=1 λ ∗ k g k (x(τ)) =tf(x ∗ )− m X k=1 λ ∗ k h t−1 X τ=0 g k (x(τ)) i (a) ≥tf(x ∗ )− m X k=1 λ ∗ k Q k (t) (b) ≥tf(x ∗ )−kλ ∗ kkQ(t)k, where (a) follows from Lemma 3.2 and the fact that λ ∗ k ≥ 0,∀k∈{1, 2,...,m}; and (b) follows from the Cauchy-Schwarz inequality. Lemma 3.7. Consider the convex program (3.1)-(3.3) under Assumptions 3.1-3.2. If α> β 2 2 in Algorithm 3.1, then for all t≥ 1, the virtual queue vector satisfies kQ(t)k≤ 2kλ ∗ k + √ 2αkx ∗ − x(−1)k + r α α− 1 2 β 2 kg(x ∗ )k, where x ∗ is an optimal solution of the problem (3.1)-(3.3) andβ is the Lipschitz modulus of g(x), both of which are defined in Assumption 3.1; and λ ∗ is a Lagrange multiplier vector satisfying Assumption 3.2. Proof. Fix t≥ 1. By part 2 in Lemma 3.5, we have t−1 X τ=0 f(x(τ))≤tf(x ∗ ) +αkx ∗ − x(−1)k 2 + α 2α−β 2 kg(x ∗ )k 2 − 1 2 kQ(t)k 2 . By Lemma 3.6, we have t−1 X τ=0 f(x(τ))≥tf(x ∗ )−kλ ∗ kkQ(t)k. 70 Combining the last two inequalities and cancelling the common term tf(x ∗ ) on both sides yields 1 2 kQ(t)k 2 − αkx ∗ − x(−1)k 2 + α 2α−β 2 kg(x ∗ )k 2 ≤kλ ∗ kkQ(t)k ⇒ kQ(t)k−kλ ∗ k 2 ≤kλ ∗ k 2 + 2αkx ∗ − x(−1)k 2 + α α− 1 2 β 2 kg(x ∗ )k 2 ⇒kQ(t)k≤kλ ∗ k + r kλ ∗ k 2 + 2αkx ∗ − x(−1)k 2 + α α− 1 2 β 2 kg(x ∗ )k 2 (a) ⇒kQ(t)k≤ 2kλ ∗ k + √ 2αkx ∗ − x(−1)k + r α α− 1 2 β 2 kg(x ∗ )k, where (a) follows from the basic inequality √ a +b +c≤ √ a + √ b + √ c for any a,b,c≥ 0. Theorem 3.2 (Constraint Violations of Algorithm 3.1). Consider the convex program (3.1)- (3.3) under Assumptions 3.1-3.2. If α> β 2 2 in Algorithm 3.1, then for all t≥ 1, the constraint functions satisfy g k (x(t))≤ 1 t 2kλ ∗ k + √ 2αkx ∗ − x(−1)k + r α α− 1 2 β 2 kg(x ∗ )k ,∀k∈{1, 2,...,m}, where x ∗ is an optimal solution of the problem (3.1)-(3.3) andβ is the Lipschitz modulus of g(x), both of which are defined in Assumption 3.1; and λ ∗ is a Lagrange multiplier vector satisfying Assumption 3.2. Proof. Fix t≥ 1 and k∈{1, 2,...,m}. Recall that x(t) = 1 t P t−1 τ=0 x(τ). Thus, g k (x(t)) (a) ≤ 1 t t−1 X τ=0 g k (x(τ)) (b) ≤ Q k (t) t ≤ kQ(t)k t (c) ≤ 1 t 2kλ ∗ k + √ 2αkx ∗ − x(−1)k + r α α− 1 2 β 2 kg(x ∗ )k , where (a) follows from the convexity of g k (x),k ∈{1, 2,...,m} and Jensen’s inequality; (b) follows from Lemma 3.2; and (c) follows from Lemma 3.7. 3.3.4 Convergence Time of Algorithm 3.1 The next theorem summarizes Theorems 3.1 and 3.2. 71 Theorem 3.3 (Convergence Time of Algorithm 3.1). Consider the convex program (3.1)-(3.3) under Assumptions 3.1-3.2. If α> β 2 2 in Algorithm 3.1, then for all t≥ 1, we have f(x(t))≤f(x ∗ ) + α t kx ∗ − x(−1)k 2 , g k (x(t))≤ 1 t 2kλ ∗ k + √ 2αkx ∗ − x(−1)k + r α α− 1 2 β 2 kg(x ∗ )k ,∀k∈{1, 2,...,m}, where x ∗ is an optimal solution of the problem (3.1)-(3.3) andβ is the Lipschitz modulus of g(x), both of which are defined in Assumption 3.1; and λ ∗ is a Lagrange multiplier vector satisfying Assumption 3.2. In summary, Algorithm 3.1 ensures error decays like O( 1 t ) and provides an -approximate solution with convergence time O( 1 ). 3.3.5 Convex Programs with Linear Equality Constraints So far, it is assumed that there is no linear equality constraint in the convex program (3.1)- (3.3). In fact, if the convex program (3.1)-(3.3) contains a linear equality constraint given by h(x) = 0, we can replace it with two inequality constraints h(x)≤ 0 and−h(x)≤ 0, both of which are convex inequality constraints, to rewrite the original convex program into the form of (3.1)-(3.3). After that, we can further apply Algorithm 3.1 to solve the reformulated convex program with an O(1/) convergence time. However, by doing this, two virtual queues, rather than one, are needed for each linear equality constraint and it is obvious that more virtual queues incurs more computation and storage overhead in the implementation of Algorithm 3.1. By looking into the proof of Lemma 3.4, we realize that one reason why the virtual queue update equation (3.4) updateQ k (t) as the larger one between−g k (x(t−1)) andQ k (t−1)+g k (x(t− 1)) is to yield a non-negative coefficient vector Q(t)+g(x(t−1)) such that Q(t)+g(x(t−1)) T g(x) involved in the primal update is convex. This is necessary since a convex function multiplying a negative constant is in general no longer convex. However, if g k (x) is a linear function, then cg k (x) is convex no matter the constant c is positive or negative. In fact, if g k (x) is linear and the convex program (3.1)-(3.3) has a linear equality constraint given by g k (x) = 0, then it suffices to update the corresponding virtual queue Q k (t) using the equation: Q k (t + 1) =Q k (t) +g k (x(t)),∀t∈{0, 1, 2,...}. (3.12) 72 and initialize the virtual queue with Q k (0) = 0 to ensure the same O(1/) convergence time of Algorithm 3.1 without modifying any other steps. To simply the analysis, we now consider an extreme case of constrained convex programs where all functional constrains are linear equality constraints given by g(x) = 0. Instead of replacing g(x) = 0 with g(x)≤ 0 and g(x) = 0 and applying the original Algorithm 3.1, we keep all equality constraints unchanged, and initialize Q k (0) = 0,∀k and replace (3.4) with (3.12) in Algorithm 3.1. In the reminder of this section, we sketch the O(1/t) convergence rate analysis of such a modification of Algorithm 3.1. Note thatQ k (0) = 0 and virtual queue update equation (3.12) guarantees that P t−1 τ=0 g k (x(τ)) = Q k (t),∀t≥ 1, which can be easily proven by using the same argument in the proof of Lemma 3.2. Thus, if we can show thatkQ(t)k is bounded from above by a constant for all t, then we can establish the O( 1 t ) convergence rate of constraint violations. Squaring both sides of (3.12) and summing over k∈{1, 2,...,m} yields Δ(t) = [Q(t)] T g(x(t)) + 1 2 kg(x(t))k 2 (3.13) which is a drift identity that is even tighter than the drift bound in Lemma 3.3. Recall that all linear functions are Lipschitz continuous, we assume g(x) is Lipschitz contin- uous with modulus β. Note that Q(t) updated by (3.12) ensures f(x) + [Q(t) + g(x(t))] T g(x) is convex as long as g is linear. By using similar steps in the proof Lemma 3.4 and using (3.13) rather than (3.7) in the last step, we can obtain a simpler drift-plus-penalty bound given by Δ(t) +f(x(t))≤f(x ∗ ) +α[kx ∗ − x(t− 1)k 2 −kx ∗ − x(t)k 2 ] With the above drift-plus-penalty bound, we can prove the O(1/t) convergence rate of objec- tive and constraint violations following steps similar to those in Sections 3.3.2 and 3.3.3. TheO(1/t) convergence rate for constrained convex programs with both inequality constraints and linear equality constraints can be established by trivially combining the steps in the previous sections and the steps in this section. 73 3.4 New Primal-Dual Type Algorithm for Smooth Con- strained Convex Programs In this section, we further assume that the convex program (3.1)-(3.3) has smooth objective and constraint functions, i.e., the following assumption holds. Assumption 3.3 (Smoothness). • Let function f(x) be smooth with modulus L f , i.e.,k∇f(x 1 )−∇f(x 2 )k≤L f kx 1 − x 2 k for all x 1 , x 2 ∈X . For each k∈{1, 2,...,m}, let function g k (x) be smooth with modulus L g k , i.e.,k∇g k (x 1 )−∇g k (x 2 )k≤L g k kx 1 −x 2 k for all x 1 , x 2 ∈X . Denote L g = [L g1 ,...,L gm ] T . Now consider another new Lagrangian method described in Algorithm 3.2. Note that Algo- rithm 3.2 involves a positive step size sequence{γ(t),t≥ 0}. We consider the following two rules for choosing γ(t) in Algorithm 3.2. • Constant γ(t): Choose positive step sizes γ(t) via γ(t) =γ < 1 β 2 +L f ,∀t≥ 0 (3.14) • Non-increasing γ(t): Choose positive step sizes γ(t) via γ(t) = 1 β 2 +L f +[Q(0)+g(x(−1))] T Lg , t = 0 min γ(t− 1), 1 β 2 +L f +[Q(t)+g(x(t−1))] T Lg , t≥ 1 (3.15) Note that part 2 of Lemma 3.1 ensures Q(t)+g(x(t−1))≥ 0,∀t≥ 0. Thus, (3.15) ensures γ(t)> 0,∀t≥ 0. Note that Algorithm 3.2 uses the same virtual queue update equation (3.4) used in Algorithm 3.1 but modifies the update of primal variables x(t) from a minimization problem to a simple projection, which is similar to the primal update in Algorithm 1.2. Recall that if f(x) or g k (x) are not separable, the primal update of x(t) in Algorithm 3.1 is not decomposable and requires to jointly solve a set constrained convex minimization, which can have huge computation complexity especially when the dimension n is large. In contrast, the projection used in Algorithm 3.2 can be distributively implemented as long as the gradient 74 Algorithm 3.2 New Primal-Dual Type Algorithm for Smooth Constrained Convex Programs Let{γ(t),t≥ 0} be a sequence of positive step sizes. Choose any x(−1)∈X . Initialize Q k (0) = max{0,−g k (x(−1))},∀k ∈{1, 2,...,m}. At each iteration t∈{0, 1, 2,...}, update x(t) and Q(t + 1) as follows: • Update primal variables via x(t) =P X x(t− 1)−γ(t)d(t) , whereP X [·] is the projection onto convex setX and d(t) =∇f(x(t− 1)) + P m k=1 [Q k (t) + g k (x(t−1))]∇g k (x(t−1)) is the gradient of functionφ(x) =f(x)+[Q(t)+g(x(t−1))] T g(x) at point x = x(t− 1). • Update virtual queues via the equation (3.4) in Algorithm 3.1. • Output the running average x(t + 1) given by x(t + 1) = 1 t + 1 t X τ=0 x(τ) = x(t) t t + 1 + x(t) 1 t + 1 as the solution at iteration t + 1. is known and the setX is a Cartesian product. Thus, Algorithm 3.2 is suitable for large scale convex programs with non-separablef(x) org k (x) since its per-iteration complexity is much less than that in Algorithm 3.1. For constrained convex programs with non-separablef(x) org k (x), the primal update of x(t) in Algorithm 1.2 also has low complexity since it follows a similar projection update. However, Algorithm 1.2 has a slowO(1/ 2 ) convergence time as reviewed in Section 1.2. Another drawback of Algorithm 1.2 is that its implementation requires to know an upper bound of the optimal Lagrange multiplier vector λ ∗ (defined in Assumption 3.2), which is typically unavailable in practice. In this section, we show that Algorithm 3.2 has the sameO( 1 ) convergence time as Algorithm 3.1 for the smooth constrained convex programs (3.1)-(3.3) and its implementation does not require any knowledge of the optimal Lagrange multiplier vector λ ∗ . 3.4.1 An Upper Bound of the Drift-Plus-Penalty Expression Since Algorithm 3.2 uses the same virtual queue update equation (3.4) used in Algorithm 3.1, Lemmas 3.1-3.3 proven in Section 3.2 and Lemma 3.6 proven in Section 3.3.3 still hold for Algorithm 3.2. The convergence time analysis of Algorithm 3.2 folllows a structure similar to 75 that of Algorithm 3.1 and is presented in the remainder of this section. The first key step is to derive an upper bound for the drift-plus-penalty expression under Algorithm 3.2. Lemma 3.8. Consider the convex program (3.1)-(3.3) under Assumptions 3.1-3.3. For all t≥ 0 in Algorithm 3.2, we have Δ(t) +f(x(t))≤f(x ∗ ) + 1 2γ(t) [kx ∗ − x(t− 1)k 2 −kx ∗ − x(t)k 2 ] + 1 2 [kg(x(t))k 2 −kg(x(t− 1))k 2 ] + 1 2 β 2 +L f + [Q(t) + g(x(t− 1))] T L g − 1 γ(t) kx(t)− x(t− 1)k 2 , where x ∗ is an optimal solution of the problem (3.1)-(3.3) andβ is the Lipschitz modulus of g(x), both of which are defined in Assumption 3.1; and L f and L g are defined in Assumption 3.3. Proof. Fix t≥ 0. The projection operator can be reinterpreted as an optimization problem as follows: x(t) =P X [x(t− 1)−γ(t)d(t)] (a) ⇔ x(t) = argmin x∈X x− [x(t− 1)−γ(t)d(t)] 2 ⇔ x(t) = argmin x∈X kx− x(t− 1)k 2 + 2γ(t)[d(t)] T [x− x(t− 1)] + [γ(t)] 2 kd(t)k 2 (b) ⇔ x(t) = argmin x∈X f(x(t− 1)) + m X k=1 [Q k (t) +g k (x(t− 1))]g k (x(t− 1)) + d T (t)[x− x(t− 1)] + 1 2γ(t) kx− x(t− 1)k 2 (c) ⇔ x(t) = argmin x∈X φ(x(t− 1)) + [∇φ(x(t− 1))] T [x− x(t− 1)] + 1 2γ(t) kx− x(t− 1)k 2 , (3.16) where (a) follows from the definition of the projection onto a convex set; (b) follows from the fact the minimizing solution does not change when we remove constant term [γ(t)] 2 kd(t)k 2 , multiply positive constant 1 2γ(t) and add constant term f(x(t− 1)) + [Q(t) + g(x(t− 1))] T g(x(t− 1)) in the objective function; and (c) follows by defining φ(x) =f(x) + [Q(t) + g(x(t− 1))] T g(x). (3.17) Note that part 2 in Lemma 3.1 implies that Q(t) + g(x(t− 1)) is component-wise nonnegative 76 for all k∈{1, 2,...,m}. Hence, function φ(x) is convex with respect to x onX . Since 1 2γ(t) kx− x(t− 1)k 2 is strongly convex with respect to x with modulus 1 γ(t) , it follows that φ(x(t− 1)) + [∇φ(x(t− 1))] T [x− x(t− 1)] + 1 2γ(t) kx− x(t− 1)k 2 is strongly convex with respect to x with modulus 1 γ(t) . Since x(t) is chosen to minimize the above strongly convex function, by Corollary 1.2, we have φ(x(t− 1)) + [∇φ(x(t− 1))] T [x(t)− x(t− 1)] + 1 2γ(t) kx(t)− x(t− 1)k 2 ≤φ(x(t− 1)) + [∇φ(x(t− 1))] T [x ∗ − x(t− 1)] + 1 2γ(t) kx ∗ − x(t− 1)k 2 − 1 2γ(t) kx ∗ − x(t)k 2 (a) ≤φ(x ∗ ) + 1 2γ(t) [kx ∗ − x(t− 1)k 2 −kx ∗ − x(t)k 2 ] (b) =f(x ∗ ) + [Q(t) + g(x(t− 1))] T g(x ∗ ) | {z } ≤0 + 1 2γ(t) [kx ∗ − x(t− 1)k 2 −kx ∗ − x(t)k 2 ] (c) ≤f(x ∗ ) + 1 2γ(t) [kx ∗ − x(t− 1)k 2 −kx ∗ − x(t)k 2 ], (3.18) where (a) follows from the fact that φ(x) is convex with respect to x onX ; (b) follows from the definition of function φ(x) in (3.17); and (c) follows by using the fact that g k (x ∗ )≤ 0 and Q k (t) +g k (x(t− 1))≥ 0 (i.e., part 2 in Lemma 3.1) for all k∈{1, 2,...,m} to eliminate the term marked by an underbrace. Recall that f(x) is smooth onX with modulus L f by Assumption 3.3. By Lemma 1.1, we have f(x(t))≤f(x(t− 1)) + [∇f(x(t− 1))] T [x(t)− x(t− 1)] + L f 2 kx(t)− x(t− 1)k 2 . (3.19) Recall that each g k (x) is smooth onX with modulus L g k by Assumption 3.3. Thus, [Q k (t) + 77 g k (x(t− 1))]g k (x) is smooth with modulus [Q k (t) +g k (x(t− 1))]L g k . By Lemma 1.1, we have [Q k (t) +g k (x(t− 1))]g k (x(t)) ≤[Q k (t) +g k (x(t− 1))]g k (x(t− 1)) + [Q k (t) +g k (x(t− 1))][∇g k (x(t− 1))] T [x(t)− x(t− 1)] + [Q k (t) +g k (x(t− 1))]L g k 2 kx(t)− x(t− 1)k 2 . (3.20) Summing (3.20) over k∈{1, 2,...,m} yields [Q(t) + g(x(t− 1))] T g(x(t)) (3.21) ≤[Q(t) + g(x(t− 1))] T g(x(t− 1)) + m X k=1 [Q k (t) +g k (x(t− 1))][∇g k (x(t− 1))] T [x(t)− x(t− 1)] + [Q(t) + g(x(t− 1))] T L g 2 kx(t)− x(t− 1)k 2 . (3.22) Summing up (3.19) and (3.22) together yields f(x(t)) + [Q(t) + g(x(t− 1))] T g(x(t)) ≤f(x(t− 1)) + [Q(t) + g(x(t− 1))] T g(x(t− 1)) + [∇f(x(t− 1))] T [x(t)− x(t− 1)] + m X k=1 [Q k (t) +g k (x(t− 1))][∇g k (x(t− 1))] T [x(t)− x(t− 1)] + L f + [Q(t) + g(x(t− 1))] T L g 2 kx(t)− x(t− 1)k 2 (a) =φ(x(t− 1)) + [∇φ(x(t− 1))] T [x(t)− x(t− 1)] + L f + [Q(t) + g(x(t− 1))] T L g 2 kx(t)− x(t− 1)k 2 , (3.23) where (a) follows from the definition of function φ(x) in (3.17). Substituting (3.18) into (3.23) yields f(x(t)) + [Q(t) + g(x(t− 1))] T g(x(t)) ≤f(x ∗ ) + 1 2γ(t) [kx ∗ − x(t− 1)k 2 −kx ∗ − x(t)k 2 ] + 1 2 L f + [Q(t) + g(x(t− 1))] T L g − 1 γ(t) kx(t)− x(t− 1)k 2 . (3.24) 78 Note that u T 1 u 2 = 1 2 [ku 1 k 2 +ku 2 k 2 −ku 1 − u 2 k 2 ] for any u 1 , u 2 ∈R m . Thus, we have [g(x(t− 1))] T g(x(t)) = 1 2 [kg(x(t− 1))k 2 +kg(x(t))k 2 −kg(x(t− 1))− g(x(t))k 2 ]. (3.25) Substituting (3.25) into (3.24) and rearranging terms yields f(x(t)) + [Q(t)] T g(x(t)) ≤f(x ∗ ) + 1 2γ(t) [kx ∗ − x(t− 1)k 2 −kx ∗ − x(t)k 2 ]− 1 2 kg(x(t− 1))k 2 − 1 2 kg(x(t))k 2 + 1 2 kg(x(t− 1))− g(x(t))k 2 + 1 2 L f + [Q(t) + g(x(t− 1))] T L g − 1 γ(t) kx(t)− x(t− 1)k 2 (a) ≤f(x ∗ ) + 1 2γ(t) [kx ∗ − x(t− 1)k 2 −kx ∗ − x(t)k 2 ]− 1 2 kg(x(t− 1))k 2 − 1 2 kg(x(t))k 2 + 1 2 β 2 +L f + [Q(t) + g(x(t− 1))] T L g − 1 γ(t) kx(t)− x(t− 1)k 2 , where (a) follows from the fact thatkg(x(t− 1))− g(x(t))k≤βkx(t)− x(t− 1)k, which further follows from the assumption that g(x) is Lipschitz continuous with modulus β. Summing (3.7) to the above inequality and cancelling the common terms on both sides yields Δ(t) +f(x(t)) ≤f(x ∗ ) + 1 2γ(t) [kx ∗ − x(t− 1)k 2 −kx ∗ − x(t)k 2 ] + 1 2 [kg(x(t))k 2 −kg(x(t− 1))k 2 ] + 1 2 β 2 +L f + [Q(t) + g(x(t− 1))] T L g − 1 γ(t) kx(t)− x(t− 1)k 2 . 3.4.2 Smooth Constrained Convex Programs with Linear g(x) This subsection shows that if eachg k (x) is a linear function, then it suffices to choose constant parameters γ(t) =γ < 1 β 2 +L f in Algorithm 3.2 to solve the smooth constrained convex program (3.1)-(3.3) with an O(1/) convergence time. The next corollary follows directly from Lemma 3.8 by noting that L g = 0 when each g k (x) is a linear function. Corollary 3.1. Consider the convex program (3.1)-(3.3) where each g k (x) is a linear function under Assumptions 3.1-3.3. If we choose γ(t) according to (3.14) in Algorithm 3.2, i.e., γ(t) = 79 γ < 1 β 2 +L f , then for all t≥ 0, we have Δ(t) +f(x(t)) ≤f(x ∗ ) + 1 2γ kx ∗ − x(t− 1)k 2 −kx ∗ − x(t)k 2 + 1 2 kg(x(t))k 2 −kg(x(t− 1))k 2 where x ∗ is an optimal solution of the problem (3.1)-(3.3) andβ is the Lipschitz modulus of g(x), both of which are defined in Assumptions 3.1; and L f is the constant defined in Assumption 3.3. Proof. Note that if each g k (x) is a linear function, then we have L g = 0. Fix t≥ 0. By Lemma 3.8 with γ(t) =γ < 1 β 2 +L f and L g = 0, we have Δ(t) +f(x(t)) ≤f(x ∗ ) + 1 2γ kx ∗ − x(t− 1)k 2 −kx ∗ − x(t)k 2 + 1 2 kg(x(t))k 2 −kg(x(t− 1))k 2 + 1 2 β 2 +L f − 1 γ kx(t)− x(t− 1)k 2 (a) ≤f(x ∗ ) + 1 2γ kx ∗ − x(t− 1)k 2 −kx ∗ − x(t)k 2 + 1 2 kg(x(t))k 2 −kg(x(t− 1))k 2 where (a) follows from γ < 1 β 2 +L f . Theorem 3.4. Consider the convex program (3.1)-(3.3) where each g k (x) is a linear function under Assumptions 3.1- 3.3. Let x ∗ be an optimal solution. Let λ ∗ be a Lagrange multiplier vector satisfying Assumption 3.2. If we choose γ(t) according to (3.14) in Algorithm 3.2, then for all t≥ 1, we have 1. f(x(t))≤f(x ∗ ) + 1 2γt kx ∗ − x(−1)k 2 . 2. g k (x(t))≤ 1 t 2kλ ∗ k + r 1 γ kx ∗ − x(−1)k + v u u t 1 γ 1 γ −β 2 kg(x ∗ )k . where x ∗ is an optimal solution of the problem (3.1)-(3.3) defined in Assumptions 3.1; λ ∗ is a Lagrange multiplier vector satisfying Assumption 3.2; and L f is the constant defined in Assump- tion 3.3. That is, Algorithm 3.2 ensures error decays like O(1/t) and provides an -approximate solution with convergence time O(1/). 80 Proof. Note that Corollary 3.1 provides a drift-plus-penalty bound similar to the one derived in Lemma 3.4 for Algorithm 3.1. The only modification is replacing α with 1 2γ . Following the same proof steps in Sections 3.3.2-3.3.4, we can prove the current theorem. 3.4.3 Smooth Constrained Convex Programs with Non-Linear g(x) For the smooth constrained convex program (3.1)-(3.3) with possibly nonlinear g(x), the following assumption is further assumed: Assumption 3.4. • There exists C > 0 such thatkg(x)k≤C for all x∈X . • There exists R> 0 such thatkx− yk≤R for all x, y∈X . This subsection proves that if the convex program (3.1)-(3.3) with possibly nonlinear g(x) satisfies Assumptions 3.1-3.4, then it suffices to choose non-increasing step sizes γ(t) according to (3.15) in Algorithm 3.2 to solve the convex program (3.1)-(3.3) with an O(1/) convergence time. Lemma 3.9. Consider the convex program (3.1)-(3.3) under Assumptions 3.1-3.4. If we choose non-increasing γ(t) in Algorithm 3.2 according to (3.15), then we have 1. P t τ=0 1 2γ(τ) kx ∗ − x(τ− 1)k 2 −kx ∗ − x(τ)k 2 ≤ 1 2γ(t) R 2 ,∀t≥ 0; 2. P t−1 τ=0 f(x(τ))≤tf(x ∗ ) + 1 2γ(t−1) R 2 + 1 2 kg(x(t− 1))k 2 − 1 2 kQ(t)k 2 ,∀t≥ 1; 3. kQ(t + 1)k≤ 2kλ ∗ k +R q 1 γ(t) +C,∀t≥ 0; where x ∗ is an optimal solution of the problem (3.1)-(3.3) defined in Assumptions 3.1; λ ∗ is a Lagrange multiplier vector satisfying Assumption 3.2; and R and C are constants defined in Assumption 3.4. Proof. 81 1. This is obviously true when t = 0. Fix t≥ 1. Note that t X τ=0 1 2γ(τ) kx ∗ − x(τ− 1)k 2 −kx ∗ − x(τ)k 2 = 1 2γ(0) kx ∗ − x(−1)k 2 + t−1 X τ=0 [ 1 2γ(τ + 1) − 1 2γ(τ) ]kx ∗ − x(τ)k 2 − 1 2γ(t) kx ∗ − x(t)k 2 (a) ≤ 1 2γ(0) R 2 + t−1 X τ=0 [ 1 2γ(τ + 1) − 1 2γ(τ) ]R 2 = 1 2γ(t) R 2 where (a) follows becausekx ∗ − x(τ)k≤ R,∀τ ≥ 0 by Assumption 3.4 and γ(τ + 1)≤ γ(τ),∀τ≥ 0 by (3.15). 2. Fix t≥ 1. By Lemma 3.8, for all τ∈{0, 1, 2,...}, we have Δ(τ) +f(x(τ)) ≤f(x ∗ ) + 1 2γ(τ) kx ∗ − x(τ− 1)k 2 −kx ∗ − x(τ)k 2 + 1 2 kg(x(τ))k 2 −kg(x(τ− 1))k 2 + 1 2 β 2 +L f + [Q(τ) + g(x(τ− 1))] T L g − 1 γ(τ) kx(τ)− x(τ− 1)k 2 (a) ≤f(x ∗ ) + 1 2γ(τ) kx ∗ − x(τ− 1)k 2 −kx ∗ − x(τ)k 2 + 1 2 kg(x(τ))k 2 −kg(x(τ− 1))k 2 where (a) follows because each γ(τ) chosen according to (3.15) ensures β 2 +L f + [Q(τ) + g(x(τ− 1))] T L g − 1 γ(τ) ≤ 0. Summing over τ∈{0, 1, 2,...,t− 1} and rearranging terms yields t−1 X τ=0 f(x(τ)) ≤tf(x ∗ ) + t−1 X τ=0 1 2γ(τ) kx ∗ − x(τ− 1)k 2 −kx ∗ − x(τ)k 2 + 1 2 t−1 X τ=0 kg(x(τ))k 2 −kg(x(τ− 1))k 2 − t−1 X τ=0 Δ(τ) (a) ≤tf(x ∗ ) + 1 2γ(t− 1) R 2 + 1 2 kg(x(t− 1))k 2 − 1 2 kg(x(−1))k 2 + 1 2 kQ(0)k 2 − 1 2 kQ(t)k 2 (b) ≤tf(x ∗ ) + 1 2γ(t− 1) R 2 + 1 2 kg(x(t− 1))k 2 − 1 2 kQ(t)k 2 82 where (a) follows from part 1 of this lemma and by recalling that Δ(τ) = 1 2 kQ(τ + 1)k 2 − 1 2 kQ(τ)k 2 ; and (b) follows becausekQ(0)k 2 ≤kg(x(−1))k 2 by part 3 in Lemma 3.1. 3. By part 2 of this lemma, we have t X τ=0 f(x(τ))≤(t + 1)f(x ∗ ) + 1 2γ(t) R 2 + 1 2 kg(x(t))k 2 − 1 2 kQ(t + 1)k 2 ≤(t + 1)f(x ∗ ) + 1 2γ(t) R 2 + 1 2 C 2 − 1 2 kQ(t + 1)k 2 (3.26) where (a) follows fromkg(x(t))k≤C by Assumption 3.4. By Lemma 3.6, we have t X τ=0 f(x(τ))≥ (t + 1)f(x ∗ )−kλ ∗ kkQ(t + 1)k (3.27) Combining (3.26) and (3.27), cancelling common terms and rearranging terms yields 1 2 kQ(t + 1)k 2 −kλ ∗ kkQ(t + 1)k− 1 2γ(t) R 2 − 1 2 C 2 ≤ 0 ⇒ h kQ(t + 1)k−kλ ∗ k i 2 ≤kλ ∗ k 2 + 1 γ(t) R 2 +C 2 ⇒kQ(t + 1)k≤kλ ∗ k + s kλ ∗ k 2 + 1 γ(t) R 2 +C 2 (a) ⇒kQ(t + 1)k≤ 2kλ ∗ k + s 1 γ(t) R +C (3.28) where (a) follows from the basic inequality √ z 1 +z 2 +z 3 ≤ √ z 1 + √ z 2 + √ z 3 for any z 1 ,z 2 ,z 3 ≥ 0. Lemma 3.10. Consider the convex program (3.1)-(3.3) with possibly nonlinear g k (x) under Assumptions 3.1-3.4. If we choose non-increasing γ(t) according to (3.15) in Algorithm 3.2, then γ(t)≥γ min ,∀t≥ 0 with constant γ min = 1 p β 2 +L f + 2kλ ∗ kkL g k + 2CkL g k +RkL g k 2 (3.29) 83 where β,λ ∗ ,L f ,L g R and C are constants defined in Assumptions 3.1-3.4. Proof. This lemma can be proven by induction as follows. By (3.15), we have γ(0) = 1 β 2 +L f + [Q(0) + g(x(−1))] T L g (a) ≥ 1 β 2 +L f +kQ(0) + g(x(−1))kkL g k (b) ≥ 1 β 2 +L f + 2CkL g k ≥γ min where (a) follows from the Cauchy-Schwarz inequality; and (b) follows fromkQ(0)+g(x(−1))k≤ kQ(0)k +kg(x(−1))k≤ 2kg(x(−1))k≤ 2C where the second inequality follows from part 3 of Lemma 3.1 and the third inequality follows from Assumption 3.4. Thus, we have γ(0)≥γ min . Now assumeγ(t)≥γ min holds fort =t 0 and considert =t 0 + 1. By (3.15),γ(t 0 + 1) is given by γ(t 0 + 1) = min γ(t 0 ), 1 β 2 +L f + [Q(t 0 + 1) + g(x(t 0 ))] T L g Since γ(t 0 )≤γ min by the induction hypothesis, to prove γ(t 0 + 1)≥γ min , it remains to prove 1 β 2 +L f + [Q(t 0 + 1) + g(x(t 0 ))] T L g ≥γ min By part 3 of Lemma 3.9, we have kQ(t 0 + 1)k≤2kλ ∗ k +R s 1 γ(t 0 ) +C (a) ≤ 2kλ ∗ k +R r 1 γ min +C (3.30) 84 where (a) follows the hypothesis in the induction. Thus, we have 1 β 2 +L f + [Q(t 0 + 1) + g(x(t 0 ))] T L g (a) ≥ 1 β 2 +L f +kQ(t 0 + 1) + g(x(t 0 ))kkL g k (b) ≥ 1 β 2 +L f +kQ(t 0 + 1)kkL g k +kg(x(t 0 ))kkL g k (c) ≥ 1 β 2 +L f + 2kλ ∗ k +R q 1 γ min +C]kL g k +CkL g k = 1 β 2 +L f + 2kλ ∗ kkL g k + 2CkL g k +RkL g k q 1 γ min (d) = 1 β 2 +L f + 2kλ ∗ kkL g k + 2CkL g k + (RkL g k) 2 +RkL g k p β 2 +L f + 2kλ ∗ kkL g k + 2CkL g k (e) ≥ 1 p β 2 +L f + 2kλ ∗ kkL g k + 2CkL g k +RkL g k 2 =γ min where (a) follows from the Cauchy-Schwarz inequality; (b) follows from the triangle inequality; (c) follows from (3.30) andkg(x(t 0 ))k ≤ C by Assumption 3.4; (d) follows by substituting γ min = 1 √ β 2 +L f +2kλ ∗ kkLgk+2CkLgk+RkLgk 2 ; and (e) follow from the basic inequality z 2 1 +z 2 2 + z 1 z 2 ≤ (z 1 +z 2 ) 2 for any z 1 ,z 2 ≥ 0. Thus, we have γ(t 0 + 1)≥γ min . This lemma follows by induction. The next theorem summarizes the O(1/) convergence time of Algorithm 3.2 for the smooth constrained convex program (3.1)-(3.3) with possibly nonlinear g k (x). Theorem 3.5. Consider the convex program (3.1)-(3.3) with possibly nonlinear g k (x) under Assumptions 3.1- 3.4. Let x ∗ be an optimal solution and λ ∗ be a Lagrange multiplier vector satisfying Assumption 3.2. If we choose non-increasing γ(t) according to (3.15) in Algorithm 3.2, then for all t≥ 1, we have 1. f(x(t))≤f(x ∗ ) + 1 t R 2 2γ min . 2. g k (x(t))≤ 1 t kλ ∗ k +R r 1 γ min +C ,∀k∈{1, 2,...,m}. 85 where γ min is the constant defined in Lemma 3.10; x ∗ is an optimal solution of the problem (3.1)-(3.3) defined in Assumptions 3.1; λ ∗ is a Lagrange multiplier vector satisfying Assumption 3.2; and R and C are constants defined in Assumption 3.4. That is, Algorithm 3.2 ensures error decays like O(1/t) and provides an -approximate solution with convergence time O(1/). Proof. 1. Fix t≥ 1. By part 2 of Lemma 3.9, we have t−1 X τ=0 f(x(τ))≤tf(x ∗ ) + 1 2γ(t− 1) R 2 + 1 2 kg(x(t− 1))k 2 − 1 2 kQ(t)k 2 (a) ≤tf(x ∗ ) + 1 2γ min R 2 where (a) follows from γ(t− 1)≥ γ min by Lemma 3.10 andkQ(t)k≥kg(x(t− 1))k by Lemma 3.1. Recall that x(t) = 1 t P t−1 τ=0 x(τ). Dividing both sides byt and using Jensen’s inequality for convex function f(x) yields f(x(t))≤f(x ∗ ) + 1 t R 2 2γ min . 2. Fix t≥ 1 and k∈{1, 2,...,m}. Recall that x(t) = 1 t P t−1 τ=0 x(τ). Thus, g k (x(t)) (a) ≤ 1 t t−1 X τ=0 g k (x(τ)) (b) ≤ g k (t) t ≤ kQ(t)k t (c) ≤ 1 t 2kλ ∗ k +R r 1 γ min +C , where (a) follows from the convexity of g k (x),k∈{1, 2,...,m} and Jensen’s inequality; (b) follows from Lemma 3.2; and (c) follows becausekQ(t)k≤ 2kλ ∗ k +R q 1 γ(t−1) +C by part 3 of Lemma 3.9 and γ(t− 1)≥γ min by Lemma 3.10. 86 3.5 Chapter Summary This chapter considers two new Lagrangian methods to constrained solve convex programs. The first algorithm can solve general convex programs with possibly non-differentiable objective or constraint functions. and has a parallel implementation when the objective and constraint functions are separable. The second algorithm can solve convex programs with smooth objective and constraint functions. At each iteration, the second algorithm updates the primal variable x(t) using a simple projected gradient update, which can be distributively implemented even if the objective or constraint functions are not separable. Both algorithms are proven to have a fast O( 1 ) convergence time. 87 Chapter 4 New Backpressure Algorithms for Joint Rate Control and Routing In multi-hop data networks, the problem of joint rate control and routing is to accept data into the network to maximize certain utilities and to make routing decisions at each node such that all accepted data are delivered to intended destinations without overflowing any queue in intermediate nodes. The original backpressure algorithm proposed in the seminal work [TE92] by Tassiulas and Ephremides addresses this problem by assuming that incoming data are given and are inside the network stability region and develops a routing strategy to deliver all incoming data without overflowing any queue. In the context of [TE92], there is essentially no utility maximization consideration in the network. The backpressure algorithm is further extended by a drift-plus-penalty technique to deal with both utility maximization and queue stability [Nee03, GNT06, Nee10]. Alternative extensions for both utility maximization and queue stabilization are developed in [ES06, Sto05, LS04, LMS06]. The above extended backpressure algorithms have different dynamics and/or may yield different utility-delay tradeoff results. However, all of them rely on “backpressure” quantities, which are the differential backlogs between neighboring nodes. It has been observed in [NMR05, ES06, LS04, LSXS15] that the drift-plus-penalty and other alternative algorithms can be interpreted as first order Lagrangian methods for constrained op- timization. In addition, these backpressure algorithms follow certain fundamental utility-delay tradeoffs. For instance, the primal-dual type backpressure algorithm in [ES06] achieves an O() utility optimality gap with an O(1/ 2 ) queue length. That is, a small utility optimality gap (corresponding to a small ) is available only at the cost of a large queue length. The drift- plus-penalty backpressure algorithm [Nee10], which has the best utility-delay tradeoff among all 88 existing first order Lagrangian methods for general networks, can only achieve an O() utility optimality gap with an O(1/) queue length. Under certain restrictive assumptions over the network, a better [O(),O(log(1/))] tradeoff is achieved via an exponential Lyapunov function in [Nee06], and an [O(),O(log 2 (1/))] tradeoff is achieved via a LIFO-backpressure algorithm in [HMNK13]. Fundamental lower bounds on utility-delay tradeoffs in [BG02, Nee07, ES12, Nee16, Nee06] show that, for various stochastic network settings, a large queue delay is unavoidable if a small utility optimality gap is demanded. These works consider certain hard problems with stochastic behavior. It leaves open the question of whether or not performance can be improved for networks that fall outside these hard cases. The current chapter investigates network flow problems that can be written as (deterministic) convex programs, which are not restricted to the prior lower bounds. We pursue the question of whether or not improved tradeoffs are possible. Can optimal utility be approached with constant queue sizes? Recently, there have been many attempts in obtaining new variations of backpressure algo- rithms for deterministic network flow problems by applying Newton’s method to the Lagrangian dual function. In the recent work [LSXS15], the authors develop a Newton’s method for joint rate control and routing. However, the utility-delay tradeoff in [LSXS15] is still [O(),O(1/ 2 )]; and the algorithm requires a centralized projection step although Newton directions can be ap- proximated in a distributed manner. Work [WOJ13] considers a network flow control problem where the path of each flow is given (and hence there is no routing part in the problem), and pro- poses a decentralized Newton based algorithm for rate control. Work [ZRJ13] considers network routing without an end-to-end utility and only shows the stability of the proposed Newton based backpressure algorithm. All of the above Netwon’s method based algorithms rely on distributed approximations for the inverse of Hessians, whose computations still require certain coordinations for the local information updates and propagations and do not scale well with the network size. In contrast, the first order Lagrangian methods do not need global network topology information. Rather, each node only needs the queue length information of its neighbors. In this chapter, we propose two new backpressure algorithms that are as simple as the ex- isting algorithms in [Nee10, ES06, LS04] but have a better utility-delay tradeoff. The first new backpressure algorithm is originally developed in our paper [YN17b] and the second new back- pressue algorithm is developed in our technical report [YN17c]. The first algorithm is almost a 89 straightforward application of Algorithm 3.1, the new Lagrangian method developed in Chapter 3 for constrained convex programs, to the network utility maximization with node flow balance constraints. However, this backpressure algorithm involves a global algorithm parameter that depends on the number of sessions and the number of links in the underlying network. The second algorithm is developed by adapting the general Lagrangian method developed in Chapter 3 for the specific network optimization problem such that only local algorithm parameters, which can be locally determined by each node, are used. The new backpressue algorithms achieve a vanishing utility optimality gap that decays like O(1/t), where t is the number of iterations. They also guarantee that the queue length at each node is always bounded by a fixed constant of the same order as the optimal Lagrange multiplier of the network optimization problem. This improves on the utility-delay tradeoffs of prior work. In particular, it improves the steady-state [O(),O(1/ 2 )] utility-delay tradeoff in [ES06] and the [O(),O(1/)] utility-delay tradeoff of the drift-plus-penalty algorithm in [Nee10], both of which yield an unbounded queue length to have a vanishing utility optimality gap. Indeed, the steady-state utility-delay tradeoff of our algorithm is [0,O(1)]. They are the first algorithms to achieve zero utility gap and finite queue lengths for joint rate control and routing in multi-hop data networks. The convergence time to reach this limiting performance is also faster than prior work. The new backpressure algorithms differ from existing first order backpressure algorithms in the following aspects: 1. The “backpressure” quantities in this paper are with respect to newly introduced weights. These are different from queues used in other backpressure algorithms, but can still be locally tracked and updated. 2. The rate control and routing decision rule involves a quadratic term that is similar to a term used in proximal algorithms [PB13]. Note that the benefit of introducing a quadratic term in network optimization has been observed in [LS06]. Work [LS06] developed a distributive rate control algorithm for network utility maximization (NUM) problems with given routing paths that can be reformulated as a special case of the problem treated in this paper. The algorithm of [LS06] considers a fixed set of predetermined paths for each session and does not scale well when treating all (typically 90 exponentially many) possible paths of a general network. The algorithm proposed in [LS06] is not a backpressure type and does not have queue length or convergence time guarantees. The source session rates yielded during the execution of that algorithm can violate link capacity constraints and hence are infeasible before convergence. 4.1 System Model and Problem Formulation Consider a slotted data network with normalized time slots t∈{0, 1, 2,...}. This network is represented by a graphG = (N,L), whereN is the set of nodes andL⊆N×N is the set of directed links. Let|N| =N and|L| =L. This network is shared byF end-to-end sessions denoted by a setF. For each end-to-end session f ∈F, the source node Src(f) and destination node Dst(f) are given but the routes are not specified. Each session f has a continuous and concave utility function U f (x f ) that represents the “satisfaction” received by accepting x f amount of data for session f into the network at each slot. Unlike [ES06, LSXS15] where U f (·) is assumed to be differentiable and strongly concave, this paper considers general concave utility functions U f (·), including those that are neither differentiable nor strongly concave. Formally, each utility functionU f is defined over an interval dom(U f ), called the domain of the function. It is assumed throughout that either dom(U f ) = [0,∞) or dom(U f ) = (0,∞), the latter being important for proportionally fair utilities [KMT98] U f (x) = log(x) that have singularities at x = 0 . Denote the capacity of link l as C l and assume it is a fixed and positive constant. 1 Define μ (f) l as the amount of sessionf’s data routed at linkl that is to be determined by our algorithm. Note that in general, the network may be configured such that some session f is forbidden to use linkl. For each linkl, defineS l ⊆F as the set of sessions that are allowed to use link l. The case of unrestricted routing is treated by definingS l =F for all links l. Note that if l = (n,m) with n,m∈N , then μ (f) l and C l can also be respectively written as μ (f) (n,m) and C (n,m) . For each node n∈N , denote the sets of its incoming links and outgoing links asI(n) andO(n), respectively. Note that x f ,∀f ∈F and μ (f) l ,∀l∈L,∀f ∈F are the decision variables of a joint rate control and routing algorithm. If the global network topology information is available, the optimal joint rate control and routing can be formulated as the 1 As stated in [LSXS15], this is a suitable model for wireline networks and wireless networks with fixed trans- mission power and orthogonal channels. 91 following multi-commodity network flow problem: max x f ,μ (f) l X f∈F U f (x f ) (4.1) s.t. x f 1 {n=Src(f)} + X l∈I(n) μ (f) l ≤ X l∈O(n) μ (f) l ,∀f∈F,∀n∈N\{Dst(f)} (4.2) X f∈F μ (f) l ≤C l ,∀l∈L, (4.3) μ (f) l ≥ 0,∀l∈L,∀f∈S l , (4.4) μ (f) l = 0,∀l∈L,∀f∈F\S l , (4.5) x f ∈ dom(U f ),∀f∈F (4.6) where 1 {·} is an indicator function; (4.2) represents the node flow conservation constraints relaxed by replacing equalities with inequalities, meaning that the total rate of flow f into noden is less than or equal to the total rate of flow f out of the node (since, in principle, we can always send fake data for departure links when the inequality is loose); and (4.3) represents link capacity constraints. Note that for each flowf, there is no constraint (4.2) at its destination node Dst(f) since all incoming data are consumed by this node. The above formulation includes network utility maximization with fixed paths as special cases. In the case when each session only has one single given path, e.g., the network utility maximization problem considered in [LL99], we could modify the setsS l used in constraints (4.4) and (4.5) to reflect this fact. For example, if link l 1 is only used for sessions f 1 and f 2 , then S l1 ={f 1 ,f 2 }. Similarly, the case [LS06] where each flow is restricted to using links from a set of predefined paths can be treated by modifying the setsS l accordingly. See Section 4.6.1 for more discussions. The solution to the problem (4.1)-(4.6) corresponds to the optimal joint rate control and routing. However, to solve this convex program at a single computer, we need to know the global network topology and the solution is a centralized one, which is not practical for large data networks. As observed in [NMR05, ES06, LS04, LSXS15], various versions of backpressure algorithms can be interpreted as distributed solutions to the problem (4.1)-(4.6) from first order Lagrangian methods. Two mild assumptions are made concerning the problem (4.1)-(4.6). 92 Assumption 4.1 (Feasibility). The problem (4.1)-(4.6) has at least one optimal solution vector [x ∗ f ;μ (f),∗ l ] f∈F,l∈L . Assumption 4.2 (Existence of Lagrange Multipliers). Condition 1.1 holds for the convex pro- gram (4.1)-(4.6). Specifically, define convex set C ={[x f ;μ (f) l ] f∈F,l∈L : (4.3)-(4.6) hold}. Assume there exists a Lagrange multiplier vector λ ∗ = [λ (f),∗ n ] f∈F,n∈N\{Dst(f)} ≥ 0 such that q(λ ∗ ) = max{(4.1) : (4.2)-(4.6)} where q(λ) = sup [x f ;μ (f) l ]∈C X f∈F U f (x f )− X f∈F X n∈N\{Dst(f)} λ (f) n x f 1 {n=Src(f)} + X l∈I(n) μ (f) l − X l∈O(n) μ (f) l is the Lagrangian dual function of the problem (4.1)-(4.6) by treating (4.3)-(4.6) as a convex set constraint. Assumptions 4.1 and 4.2 hold in most cases of interest. For example, the Slater condition guarantees Assumption 4.2. Since the constraints (4.2)-(4.6) are linear, Proposition 6.4.2 in [BNO03] ensures that Lagrange multipliers exist whenever constraints (4.2)-(4.6) are feasible and when the utility functions U f are either defined over open sets (such as U f (x) = log(x) with dom(U f ) = (0,∞)) or can be concavely extended to open sets, meaning that there is an > 0 and a concave function e U f : (−,∞)→R such that e U f (x) =U f (x) whenever x≥ 0. 2 Fact 4.1 (Replacing Inequality with Equality). If Assumption 4.1 holds, the problem (4.1)-(4.6) has an optimal solution vector [x ∗ f ;μ (f),∗ l ] f∈F,l∈L such that all constraints (4.2) take equalities. Proof. Note that eachμ (f) l can appear on the left side in at most one constraint (4.2) and appear on the right side in at most one constraint (4.2). Let [x ∗ f ;μ (f),∗ l ] f∈F,l∈L be an optimal solution vector such that at least one inequality constraint (4.2) is loose. Note that we can reduce the value of μ (f),∗ l on the right side of a loose (4.2) until either that constraint holds with equality, or untilμ (f),∗ l reduces to 0. The objective function value does not change, and no constraints are violated. We can repeat the process until all inequality constraints (4.2) are tight. 2 If dom(U f ) = [0,∞), such concave extension is possible if the right-derivative ofU f atx = 0 is finite (such as forU f (x) = log(1+x) orU f (x) = min[x, 3]). Such an extension is impossible for the exampleU f (x) = √ x because the slope is infinite at x = 0. Nevertheless, Lagrange multipliers often exist even for these utility functions, such as when the Slater condition holds [BNO03]. 93 4.2 New Backpressure Algorithms 4.2.1 Discussion of Various Queueing Models At each node, an independent queue backlog is maintained for each session. At each slot t, let x f (t) be the source session rates; and let μ (f) l (t) be the link session rates. Some prior work enforces the constraints (4.2) via virtual queues Y (f) n (t) of the following form: Y (f) n (t + 1) = max Y (f) n (t) +x f (t)1 {n=Src(f)} + X l∈I(n) μ (f) l (t)− X l∈O(n) μ (f) l (t), 0 . (4.7) While this virtual equation is a meaningful approximation, it differs from reality in that new injected data are allowed to be transmitted immediately, or equivalently, a single packet is allowed to enter and leave many nodes within the same slot. Further, there is no clear connection between the virtual queues Y (f) n (t) in (4.7) and the actual queues in the network. Indeed, it is easy to construct examples that show there can be an arbitrarily large difference between the Y (f) n (t) value in (4.7) and the physical queue size in actual networks (see Section 4.6.2 for an illustrating example). An actual queueing network has queues Z (f) n (t) with the following dynamics: Z (f) n (t + 1)≤ max n Z (f) n (t)− X l∈O(n) μ (f) l (t), 0 o +x f (t)1 {n=Src(f)} + X l∈I(n) μ (f) l (t). (4.8) This is faithful to actual queue dynamics and does not allow data to be retransmitted over multiple hops in one slot. Note that (4.8) is an inequality because the new arrivals from other nodes may be strictly less than P l∈I(n) μ (f) l (t) because those other nodes may not have enough backlog to send. The model (4.8) allows for any decisions to be made to fill the transmission values μ (f) l (t) in the case that Z (f) n (t)≤ P l∈O(n) μ (f) l (t), provided that (4.8) holds. This chapter develops new algorithms that converges to the optimal utility defined by the problem (4.1)-(4.6), and that produce worst-case bounded queues on the actual queueing network, that is, with actual queues that evolve as given in (4.8). To begin, it is convenient to introduce 94 the following virtual queue equation Q (f) n (t + 1) =Q (f) n (t)− X l∈O(n) μ (f) l (t) +x f (t)1 {n=Src(f)} + X l∈I(n) μ (f) l (t), (4.9) where Q (f) n (t) represents a virtual queue value associated with session f at node n. At first glance, this model (4.9) appears to be only an approximation, perhaps even a worse approxima- tion than (4.7), because it allows the Q (f) n (t) values to be negative. Indeed, we use Q (f) n (t) only as virtual queues to inform the algorithm and do not treat them as actual queues. However, this paper shows that using these virtual queues to choose the μ(t) decisions ensures not only that the desired constraints (4.2) are satisfied, but that the resulting μ(t) decisions create bounded queues Z (f) n (t) in the actual network, where the actual queues evolve according to (4.8). In short, our algorithms can be faithfully implemented with respect to actual queueing networks, and converge to exact optimality on those networks. The next lemma shows that if an algorithm can guarantee virtual queues Q (f) n (t) defined in (4.9) are bounded, then actual physical queues satisfying (4.8) are also bounded. Lemma 4.1. Consider a network flow problem described by the problem (4.1)-(4.6). For all l∈L and f∈F, let μ (f) l (t),x f (t) be decisions yielded by a dynamic algorithm. Suppose Y (f) n (t), Z (f) n (t), Q (f) n (t) evolve by (4.7)-(4.9) with initial conditions V (f) n (0) =Z (f) n (0) =Q (f) n (0) = 0. If there exists a constant B > 0 such that|Q (f) n (t)|≤B,∀t, then 1. Z (f) n (t)≤ 2B + P l∈O(n) C l for all t∈{0, 1, 2,...}. 2. Y (f) n (t)≤ 2B + P l∈O(n) C l for all t∈{0, 1, 2,...}. Proof. 1. Fix f∈F,n∈N\{Dst(f)}. Define an auxiliary virtual queue b Q (f) n (t) that is initialized by b Q (f) n (0) = B + P l∈O(n) C l and evolves according to (4.9). It follows that b Q (f) n (t) = Q (f) n (t) +B + P l∈O(n) C l ,∀t. Since by assumption Q (f) n (t)≥−B,∀t, we have b Q (f) n (t)≥ P l∈O(n) C l ≥ P l∈O(n) μ (f) l (t),∀t. This implies that b Q (f) n (t) also satisfies: b Q (f) n (t + 1) = max b Q (f) n (t)− X l∈O(n) μ (f) l (t), 0 +x f (t)1 {n=Src(f)} + X l∈I(n) μ (f) l (t),∀t, (4.10) 95 which is identical to (4.8) except the inequality is replaced by an equality. Since Z (f) n (0) = 0< b Q (f) n (0) and b Q (f) n (t) satisfies (4.10), it follows by inductions that Z (f) n (t)≤ b Q (f) n (t),∀t. Since b Q (f) n (t) = Q (f) n (t) +B + P l∈O(n) C l ,∀t, and Q (f) n (t)≤ B,∀t, we have b Q (f) n (t)≤ 2B + P l∈O(n) C l ,∀t. It follows that Z (f) n (t)≤ 2B + P l∈O(n) C l ,∀t. 2. Fix f∈F,n∈N\{Dst(f)}. By (4.10), b Q (f) n (t + 1) = max b Q (f) n (t)− X l∈O(n) μ (f) l (t), 0 +x f (t)1 {n=Src(f)} + X l∈I(n) μ (f) l (t) = max b Q (f) n (t) +x f (t)1 {n=Src(f)} + X l∈I(n) μ (f) l (t)− X l∈O(n) μ (f) l (t), x f (t)1 {n=Src(f)} + X l∈I(n) μ (f) l (t) (a) ≥ max b Q (f) n (t) +x f (t)1 {n=Src(f)} + X l∈I(n) μ (f) l (t)− X l∈O(n) μ (f) l (t), 0 where (a) follows from the fact that μ (f) l (t),x f (t),∀f,l,t are non-negative. Note that the right side of the above equation is identical to the right side of (4.7) except that Y (f) n (t) is rewritten as b Q (f) n (t). Since Y (f) n (0) = 0 < b Q (f) n (0), by induction, we have Y (f) n (t)≤ b Q (f) n (t),∀t. Since b Q (f) n (t) = Q (f) n (t) +B + P l∈O(n) C l ,∀t and Q (f) n (t)≤ B,∀t, we have b Q (f) n (t)≤ 2B + P l∈O(n) C l ,∀t. It follows that Y (f) n (t)≤ 2B + P l∈O(n) C l ,∀t. 4.2.2 New Backpressure Algorithms In this subsection, we propose two new backpressure algorithms that yield source session rates x f (t) and link session rates μ (f) l (t) at each slot such that the physical queues for each session at each node are bounded by a constant and the time average utility satisfies 1 t t−1 X τ=0 X f∈F U f (x f (t))≥ X f∈F U f (x ∗ f )−O( 1 t ),∀t, where x ∗ f are from the optimal solution to (4.1)-(4.6). Note that Jensen’s inequality further implies that X f∈F U f 1 t t−1 X τ=0 x f (τ) ≥ X f∈F U f (x ∗ f )−O( 1 t ),∀t. 96 The two backpressure algorithm are described in Algorithm 4.1 and Algorithm 4.2, respec- tively. Similar to existing backpressure algorithms, the updates in both algorithms at each node n are fully distributed and only depend on weights at itself and its neighbor nodes. Unlike ex- isting backpressure algorithms, the weights used to update decision variables x f (t) and μ (f) l (t) are not the virtual queues Q (f) n (t) themselves, rather, they are augmented values W (f) n (t) equal to the sum of the virtual queues and the amount of net injected data in the previous slot t− 1. In addition, the updates involve an additional quadratic term, which is similar to a term used in proximal algorithms [PB13]. The only difference between Algorithm 4.1 and Algorithm 4.2 is that a single global parameter α is used in Algorithm 4.1 while each noden in Algorithm 4.2 owns its own local parameterα n . In fact, Algorithm 4.1 is derived from the direct application of Algorithm 3.1, developed for general constrained convex programs in Chapter 3, to the problem (4.1)-(4.6) by treating the constraints (4.3)-(4.6) as a convex set constraint and by replacing linear inequality constraints (4.2) with linear equality constraints. Note that by Fact 4.1, to solve the problem (4.1)-(4.6), we can replace linear inequality constraints (4.2) with linear equality constraints without loss of optimality. In Section 3.3.5, it is mentioned that the equation (3.12) can be used as the virtual queue update equation for linear equality constraints in a convex program. Note that the equation (3.12) in the context of the problem (4.1)-(4.6) is identical to (4.9). Since (4.1) and (4.2) are separable, the primal update in Algorithm 3.1 can be decomposed into independent subproblems. Thus, it is easy to observe that Algorithm 4.1 is simply a distributive implementation of Algorithm 3.1 to solve the problem (4.1)-(4.6). The global parameter α in Algorithm 4.1 is corresponding to the same parameter α in Algo- rithm 3.1. The results developed in Chapter 3 requireα> 1 2 β 2 , whereβ is the Lipschitz modulus of the vectorized constraints (4.2). Define x = [x f ] f∈F as the stacked column vector of all source session rates and μ = [μ (f) l ] f∈F,l∈L as the stacked column vector of all link session rates. Note that x has length|F| and μ has length|L||F|. Thus, the constraints (4.2) can be vectorized as g(x,μ) = Ax + Rμ≤ 0, (4.11) 97 where A = A 1 . . . A |F| is a|F|(|N|− 1)×|F| source-node incidence matrix such that each sub- matrix A f is a{0, 1} matrix of size (|N|− 1)×|F| whose (n,f)-th entry is equal to 1 if and only if noden is the source node of sessionf; and R = Diag{R 1 ,..., R |F| } is a block diagonal matrix with R 1 ,..., R |F| on its diagonal such that each sub-matrix R f is a{±1, 0} node-arc incidence matrix of size (|N|− 1)×|L| whose (n,l)-th entry is equal to 1 if and only if link l flows into noden and is equal to−1 if and only if linkl flows out of noden. The Lipschitz modulus of the vectorized version of the constraints (4.2) is summarized in the next Lemma. Lemma 4.2. The vector function g(x,μ) = Ax + Rμ is Lipschitz continuous with modulus β = p |F| + p 2|L|. (4.12) Proof. Define column vector y = [x;μ] and B = [A, R]. The constraints (4.2) can be further rewritten as g(y) = By≤ 0. Note that linear function g(y) = By is Lipschitz continuous with moduluskBk 2 wherekBk 2 is the induced matrix l 2 norm defined askBk 2 = sup x6=0 { kBxk kxk }. Applying the matrix norm inequalities (for block matrices) k[H 1 , H 2 ]k 2 ≤ kH 1 k 2 +kH 2 k 2 and kDiag{H 1 ,..., H K }k 2 ≤ max 1≤k≤K {kH k k 2 } yields kBk 2 ≤ kAk 2 +kRk 2 ≤ kAk 2 + max f∈F {kR f k 2 }. Note that exactly|F| entries in the matrix A are 1 and all the other en- tries are 0; and each matrix R f has at most 2|L| non-zero entries whose absolute values are 1. By the factkHk 2 ≤ q P m i=1 P n j=1 |H ij | for any matrix H∈R m×n , we knowkAk 2 ≤ p |F| and kR f k 2 ≤ p 2|L|,∀f∈F. It follows thatkBk 2 ≤ p |F| + p 2|L|. To determine a large enough value of the parameter α in Algorithm 4.1 to guarantee its convergence, we need to known the number of sessions and the number links in the network. In many applications, these two values may not be globally known at each node. In addition, the value given by α > 1 2 β 2 = 1 2 ( p |F| + p 2|L|) 2 can be unnecessarily large for certain network topologies. (That is, the Lipschitz modulus in Lemma 4.2 can be loose in many cases since it is derived without taking the network topology into consideration.) Recall that by the results in Chapter 3, an unnecessarily large value of α in Algorithm 3.1 incurs slow convergence. To resolve the above issues of Algorithm 4.1, we further develop Algorithm 4.2 by adapting the general Lagrangian methods developed in Chapter 3 for the multi-commodity network flow 98 problem (4.1)-(4.6). The complete analysis of Algorithm 4.1 is presented in our paper [YN17b]. In the remainder of this chapter, we analyze the performance of Algorithm 4.2; and show that each node n in Algorithm 4.2 determines its own parameter α n based on local link connections and the value of each α n is significantly smaller than the global parameter α required in Algorithm 4.1. Algorithm 4.1 New Backpressure Algorithm with a Global Parameter Let α > 0 be a constant parameter. Initialize x f (−1) = 0, μ (f) l (−1) = 0,∀f∈F,∀l∈L and Q (f) n (0) = 0,∀n∈N,∀f∈F. At each time t∈{0, 1, 2,...}, each node n does the following: • For each f∈F, if node n is not the destination node of session f, i.e., n6= Dst(f), then define weight W (f) n (t): W (f) n (t) =Q (f) n (t) +x f (t− 1)1 {n=Src(f)} + X l∈I(n) μ (f) l (t− 1)− X l∈O(n) μ (f) l (t− 1), If noden is the destination node, i.e.,n = Dst(f), then defineW (f) n (t) = 0. Notify neighbor nodes (nodes k that can send session f to node n, i.e.,∀k such that f∈S (k,n) ) about this new W (f) n (t) value. • For each f∈F, if node n is the source node of session f, i.e., n = Src(f), choose x f (t) as the solution to max x f U f (x f )−W (f) n (t)x f −α x f −x f (t− 1) 2 s.t. x f ∈ dom(U f ) • For all (n,m)∈O(n), choose{μ (f) (n,m) (t),∀f∈F} as the solution to the following convex program: max μ (f) (n,m) X f∈F W (f) n (t)−W (f) m (t) μ (f) (n,m) −α X f∈F μ (f) (n,m) −μ (f) (n,m) (t− 1) 2 s.t. X f∈F μ (f) (n,m) ≤C (n,m) μ (f) (n,m) ≥ 0,∀f∈S (n,m) μ (f) (n,m) = 0,∀f6∈S (n,m) • For eachf∈F, if noden is not the destination off, i.e.,n6= Dst(f), update virtual queue Q (f) n (t + 1) by (4.9). 99 Algorithm 4.2 New Backpressure Algorithm with Local Parameters Letα n > 0,∀n∈N be constant parameters. Initializex f (−1) = 0,μ (f) l (−1) = 0,∀f∈F,∀l∈L andQ (f) n (0) = 0,∀n∈N,∀f∈F. At each time t∈{0, 1, 2,...}, each noden does the following: • For each f∈F, if node n is not the destination node of session f, i.e., n6= Dst(f), then define weight W (f) n (t): W (f) n (t) =Q (f) n (t) +x f (t− 1)1 {n=Src(f)} + X l∈I(n) μ (f) l (t− 1)− X l∈O(n) μ (f) l (t− 1), (4.13) If noden is the destination node, i.e.,n = Dst(f), then defineW (f) n (t) = 0. Notify neighbor nodes (nodes k that can send session f to node n, i.e.,∀k such that f∈S (k,n) ) about this new W (f) n (t) value. • For each f∈F, if node n is the source node of session f, i.e., n = Src(f), choose x f (t) as the solution to max x f U f (x f )−W (f) n (t)x f −α n x f −x f (t− 1) 2 (4.14) s.t. x f ∈ dom(U f ) (4.15) • For all (n,m)∈O(n), choose{μ (f) (n,m) (t),∀f∈F} as the solution to the following convex program: max μ (f) (n,m) X f∈F W (f) n (t)−W (f) m (t) μ (f) (n,m) − α n +α m X f∈F μ (f) (n,m) −μ (f) (n,m) (t− 1) 2 (4.16) s.t. X f∈F μ (f) (n,m) ≤C (n,m) (4.17) μ (f) (n,m) ≥ 0,∀f∈S (n,m) (4.18) μ (f) (n,m) = 0,∀f6∈S (n,m) (4.19) • For eachf∈F, if noden is not the destination off, i.e.,n6= Dst(f), update virtual queue Q (f) n (t + 1) by (4.9). 100 4.2.3 Almost Closed-Form Updates in Algorithm 4.2 This subsection shows the decisionsx f (t) andμ (f) l (t) in Algorithm 4.2 have either closed-form solutions or “almost” closed-form solutions at each iteration t. Lemma 4.3. Let ˆ x f ≡x f (t) denote the solution to (4.14)-(4.15). 1. Suppose dom(U f ) = [0,∞) and U f (x f ) is differentiable. Let h(x f ) = U 0 f (x f )− 2α n x f + 2α n x f (t− 1)−W (f) n (t). If h(0)< 0, then ˆ x f = 0; otherwise ˆ x f is the root to the equation h(x f ) = 0 and can be found by a bisection search. 2. Suppose dom(U f ) = (0,∞) and U f (x f ) =w f log(x f ) for some weight w f > 0. Then: ˆ x f = 2α n x f (t− 1)−W (f) n (t) 4α n + q (W (f) n (t)− 2α n x f (t− 1)) 2 + 8α n w f 4α n Proof. Trivial. The problem (4.16)-(4.19) can be rewritten as follows by eliminatingμ (f) (n,m) ,f6∈S (n,m) , com- pleting the squares and replacing maximization with minimization. (Note that K =|S (n,m) |≤ |F|.) min 1 2 K X k=1 (z k −a k ) 2 (4.20) s.t. K X k=1 z k ≤b (4.21) z k ≥ 0,∀k∈{1, 2,...,K} (4.22) Lemma 4.4. The solution to the problem (4.20)-(4.22) is given by z ∗ k = max{0,a k −θ ∗ },∀k∈ {1, 2,...,K} where θ ∗ ≥ 0 can be found either by a bisection search (See Section 4.6.3) or by Algorithm 4.3 with complexity O(K logK). Proof. A similar problem where (4.21) is replaced with an equality constraint in considered in [DSSSC08]. The optimal solution to this quadratic program is characterized by its KKT condition and a corresponding algorithm can be developed to obtain its KKT point. A complete proof is presented in Section 4.6.3. 101 Algorithm 4.3 Algorithm to Solve Problem (4.20)-(4.22) 1. Check if P K k=1 max{0,a k } ≤ b holds. If yes, let θ ∗ = 0 and z ∗ k = max{0,a k },∀k ∈ {1, 2,...,K} and terminate the algorithm; else, continue to the next step. 2. Sort all a k ,∈{1, 2,...,K} in a decreasing order π such that a π(1) ≥ a π(2) ≥···≥ a π(K) . Define S 0 = 0. 3. For k = 1 to K • Let S k =S k−1 +a k . Let θ ∗ = S k −b k . • Ifθ ∗ ≥ 0,a π(k) −θ ∗ > 0 anda π(k+1) −θ ∗ ≤ 0, then terminate the loop; else, continue to the next iteration in the loop. 4. Let z ∗ k = max{0,a k −θ ∗ },∀k∈{1, 2,...,K} and terminate the algorithm. Note that the step (3) in Algorithm 4.3 has complexityO(K) and hence the overall complexity of Algorithm 4.3 is dominated by the sorting step (2) with complexity O(K log(K)). 4.3 Performance Analysis of Algorithm 4.2 In this section, we show that the new backpressure algorithm has vanishing utility optimality gaps that decay like O(1/t), where t is number of iterations, and finite queue lengths. 4.3.1 Preliminaries Let y = [x f ;μ (f) l ] f∈F,l∈L define a column vector. For each f∈F,n∈N\{Dst(f)}, define y (f) n = [x f ;μ (f) l ] l∈I(n)∪O(n) if n = Src(f), [μ (f) l ] l∈I(n)∪O(n) else, (4.23) which is a column vector composed by the control actions appearing in each constraint (4.2); and introduce a function with respect to y (f) n as g (f) n (y (f) n ) =x f 1 {n=Src(f)} + X l∈I(n) μ (f) l − X l∈O(n) μ (f) l (4.24) Thus, the constraints (4.2) can be rewritten as g (f) n (y (f) n )≤ 0,∀f∈F,∀n∈N\{Dst(f)}. 102 Note that each vector y (f) n is a subvector of y and has length d n + 1 where d n is the degree of noden (the total number of outgoing links and incoming links) if node n is the source of session f; and has length d n if node n is not the source of session f. Note that components in different vector variables y (f) n can overlap. The vector variables y and y (f) n are introduced only to simplify notation. Fact 4.2. Each function g (f) n (·) defined in (4.24) is Lipschitz continuous with respect to vector y (f) n with modulus β n ≤ p d n + 1. where d n is the degree of node n. Proof. This fact can be easily shown by noting that each g (f) n (y (f) n ) is a linear function with respect to vector y (f) n and has at most d n + 1 non-zero coefficients that are equal to±1. Note that virtual queue update equation (4.9) can be rewritten as: Q (f) n (t + 1) =Q (f) n (t) +g (f) n (y (f) n (t)), (4.25) and weight update equation (4.13) can be rewritten as: W (f) n (t) =Q (f) n (t) +g (f) n (y (f) n (t− 1)). (4.26) Define L(t) = 1 2 X f∈F X n∈N\Dst(f) [Q (f) n (t)] 2 (4.27) and call it a Lyapunov function. In the remainder of this chapter, double summations are often compactly written as a single summation, e.g., X f∈F X n∈N\Dst(f) · Δ = X f∈F, n∈N\Dst(f) · . 103 Define the Lyapunov drift as Δ(t) =L(t + 1)−L(t). The following lemma follows directly from equation (4.25). Lemma 4.5. At each iteration t∈{0, 1,...} in Algorithm 4.2, the Lyapunov drift is given by Δ(t) = X f∈F, n∈N\Dst(f) Q (f) n (t)g (f) n (y f n (t)) + 1 2 [g (f) n (y f n (t))] 2 . (4.28) Proof. Fix f∈F and n∈N\ Dst(f), we have 1 2 [Q (f) n (t + 1)] 2 − 1 2 [Q (f) n (t)] 2 (a) = 1 2 [Q (f) n (t) +g (f) n (y (f) n (t))] 2 − 1 2 [Q (f) n (t)] 2 =Q (f) n (t)g (f) n (y f n (t)) + 1 2 [g (f) n (y f n (t))] 2 (4.29) where (a) follows from (4.25). By the definition of Δ(t), we have Δ(t) = 1 2 X f∈F, n∈N\Dst(f) Q (f) n (t + 1) 2 − [Q (f) n (t)] 2 (a) = X f∈F, n∈N\Dst(f) Q (f) n (t)g (f) n (y f n (t)) + 1 2 [g (f) n (y f n (t))] 2 where (a) follows from (4.29). Define f(y) = P f∈F U f (x f ). At each time t, consider choosing a decision vector y(t) that includes elements in each subvector y (f) n (t) to solve the following problem: max y f(y)− X f∈F, n∈N\Dst(f) W (f) n (t)g (f) n (y (f) n ) +α n ky (f) n − y (f) n (t− 1)k 2 − X f∈F, n=Dst(f) α n X l∈I(n) [μ (f) l −μ (f) l (t− 1)] 2 (4.30) s.t. (4.3)-(4.6) (4.31) 104 The expression (4.30) is called a modified drift-plus-penalty expression. This results in the novel backpressure-type algorithm of Algorithm 4.2. Indeed, the decisions in Algorithm 4.2 were derived as the solution to the above problem (4.30)-(4.31). This is formalized in the next lemma. Lemma 4.6. At each iteration t∈{0, 1,...}, the action y(t) jointly chosen in Algorithm 4.2 is the solution to the problem (4.30)-(4.31). Proof. The proof involves collecting terms associated with the x f (t) and μ (f) l (t) decisions. See Section 4.6.4 for details. Furthermore, the next lemma relatesh(y ∗ ) andh(y(t)) yielded by action y(t) that aggregates all control actions jointly chosen in Algorithm 4.2 at each iteration t∈{0, 1,...}. Lemma 4.7. Let y ∗ = [x ∗ f ;μ (f),∗ l ] f∈F,l∈L be an optimal solution to problem (4.1)-(4.6) given in Fact 4.1, i.e., g (f) n (y (f),∗ n ) = 0,∀f∈F,∀n∈N\ Dst(f). If α n ≥ 1 2 (d n + 1),∀n∈N , where d n is the degree of node n, then the action y(t) = [x f (t);μ (f) l (t)] f∈F,l∈L jointly chosen in Algorithm 4.2 at each iteration t∈{0, 1,...} satisfies f(y(t))≥f(y ∗ ) + Φ(t)− Φ(t− 1) + Δ(t), where Φ(t) = P f∈F,n∈N α n 1 {n6=Dst(f)} ky (f),∗ n −y (f) n (t)k 2 +α n 1 {n=Dst(f)} P l∈I(n) [μ (f),∗ l −μ (f) l (t)] 2 . Proof. See Section 4.6.5. It remains to show that this modified backpressure algorithm leads to fundamentally improved performance. 4.3.2 Utility Optimality Gap Analysis Define Q(t) = Q (f) n (t) f∈F,n∈N\{Dst(f)} as the stacked column vector of all virtual queues Q (f) n (t) defined in (4.9). Note that (4.27) can be rewritten asL(t) = 1 2 kQ(t)k 2 . Define vectorized constraints (4.2) as g(y) = [g (f) n (y (f) n )] f∈F,n∈N\Dst(f) . Lemma 4.8. Let y ∗ = [x ∗ f ;μ (f),∗ l ] f∈F,l∈L be an optimal solution to the problem (4.1)-(4.6) given in Fact 4.1, i.e., g (f) n (y (f),∗ n ) = 0,∀f ∈F,∀n∈N\ Dst(f). If α n ≥ 1 2 (d n + 1),∀n∈N in 105 Algorithm 4.2, where d n is the degree of node n, then for all t≥ 1, t−1 X τ=0 f(y(τ))≥tf(y ∗ )−ζ + 1 2 kQ(t)k 2 . where ζ = Φ(−1) = P f∈F,n∈N α n 1 {n6=Dst(f)} ky (f),∗ n k 2 +α n 1 {n=Dst(f)} P l∈I(n) (μ (f),∗ l ) 2 is a constant. Proof. By Lemma 4.7, we have f(y(τ))≥f(y ∗ ) + Φ(t)− Φ(t− 1) + Δ(τ),∀τ∈{0, 1,...,t− 1}. Summing over τ∈{0, 1,...,t− 1} yields t−1 X τ=0 f(y(τ))≥tf(y ∗ ) + t−1 X τ=0 [Φ(τ)− Φ(τ− 1)] + t−1 X τ=0 Δ(τ) =tf(y ∗ ) + Φ(t)− Φ(−1) + t−1 X τ=0 Δ(τ) (a) ≥tf(y ∗ )− Φ(−1) + t−1 X τ=0 Δ(τ) where (a) follows from the fact that Φ(t)≥ 0,∀t. Recall Δ(τ) =L(τ + 1)−L(τ), simplifying summations and rearranging terms yields t−1 X τ=0 f(y(τ))≥tf(y ∗ )− Φ(−1) +L(t)−L(0) (a) =tf(y ∗ )− Φ(−1) + 1 2 kQ(t)k 2 where (a) follows from the fact that L(0) = 0 and L(t) = 1 2 kQ(t)k 2 . The next theorem summarizes that Algorithm 4.2 yields a vanishing utility optimality gap that approaches zero like O( 1 t ). Theorem 4.1. Let y ∗ = [x ∗ f ;μ (f),∗ l ] f∈F,l∈L be an optimal solution to the problem (4.1)-(4.6) given in Fact 4.1, i.e., g (f) n (y (f),∗ n ) = 0,∀f∈F,∀n∈N\ Dst(f). If α n ≥ 1 2 (d n + 1),∀n∈N in Algorithm 4.2, where d n is the degree of node n, then for all t≥ 1, we have 1 t t−1 X τ=0 X f∈F U f (x f (τ))≥ X f∈F U f (x ∗ f )− 1 t ζ, where ζ is a constant defined in Lemma 4.8. Moreover, if we define x f (t) = 1 t P t−1 τ=0 x f (τ),∀f∈ 106 F, then X f∈F U f (x f (t))≥ X f∈F U f (x ∗ f )− 1 t ζ. Proof. Recall that f(y) = P f∈F U f (x f ). By Lemma 4.8, we have t−1 X τ=0 X f∈F U f (x f (τ))≥t X f∈F U f (x ∗ f )−ζ + 1 2 kQ(t)k 2 (a) ≥t X f∈F U f (x ∗ f )−ζ. where (a) follows from the trivial fact thatkQ(t)k 2 ≥ 0. Dividing both sides by a factor t yields the first inequality in this theorem. The second inequality follows from the concavity of U f (·) and Jensen’s inequality. 4.3.3 Queue Length Analysis Lemma 4.9. Let Q(t),t∈{0, 1,...} be the virtual queue vectors in Algorithm 4.2. For any t≥ 1, Q(t) = t−1 X τ=0 g(y(τ)) Proof. This lemma follows directly from the fact that Q(0) = 0 and queue update equation (4.9) can be written as Q(t + 1) = Q(t) + g(y(t)). The next theorem shows the boundedness of all virtual queues Q (f) n (t) in Algorithm 4.2. Theorem 4.2. Let y ∗ = [x ∗ f ;μ (f),∗ l ] f∈F,l∈L be an optimal solution to the problem (4.1)-(4.6) given in Fact 4.1, i.e.,g (f) n (y (f),∗ n ) = 0,∀f∈F,∀n∈N\Dst(f), and λ ∗ be a Lagrange multiplier vector given in Assumption 4.2. If α n ≥ 1 2 (d n + 1) 2 ,∀n∈N in Algorithm 4.2, where d n is the degree of node n, then for all t≥ 1, |Q (f) n (t)|≤ 2kλ ∗ k + p 2ζ,∀f∈F,∀n∈N\{Dst(f)}. where ζ is a constant defined in Lemma 4.8. 107 Proof. Letq(λ) = sup y∈C f(y)−λ T g(y) be the Lagrangian dual function defined in Assump- tion 4.2. For all τ∈{0, 1,...,}, by Assumption 4.2, we have f(y ∗ ) =q(λ ∗ ) (a) ≥ f(y(τ))−λ ∗,T g(y(τ)) where (a) follows from the definition of q(λ ∗ ). Rearranging terms yields f(y(τ))≤f(y ∗ ) +λ ∗,T g(y(τ)),∀τ∈{0, 1,...}. Fix t> 0. Summing over τ∈{0, 1,...,t− 1} yields t−1 X τ=0 f(y(τ))≤tf(y ∗ ) + t−1 X τ=0 λ ∗,T g(y(τ)) =tf(y ∗ ) +λ ∗,T t−1 X τ=0 g(y(τ)) (a) =tf(y ∗ ) +λ ∗,T Q(t) (b) ≤tf(y ∗ ) +kλ ∗ kkQ(t)k where (a) follows form Lemma 4.9 and (b) follows from Cauchy-Schwarz inequality. On the other hand, by Lemma 4.8, we have t−1 X τ=0 f(y(τ))≥tf(y ∗ )−ζ + 1 2 kQ(t)k 2 . Combining the last two inequalities and cancelling the common terms yields 1 2 kQ(t)k 2 −ζ≤kλ ∗ kkQ(t)k⇒ kQ(t)k−kλ ∗ k 2 ≤kλ ∗ k 2 + 2ζ ⇒kQ(t)k≤kλ ∗ k + q kλ ∗ k 2 + 2ζ (a) ⇒kQ(t)k≤ 2kλ ∗ k + p 2ζ where (a) follows from the basic inequality √ a +b≤ √ a + √ b for any a,b≥ 0. Thus, for any f∈F and n∈N\{Dst(f)}, we have |Q (f) n (t)|≤kQ(t)k≤ 2kλ ∗ k + p 2ζ. 108 This theorem shows that the absolute values of all virtual queues Q (f) n (t) are bounded by a constant B = 2kλ ∗ k + √ 2ζ from above. By Lemma 4.1 and discussions in Section 4.2.1, the actual physical queues Z (f) n (t) evolving via (4.8) satisfy Z (f) n (t)≤ 2B + P l∈O(n) C l ,∀t. This is summarized in the next corollary. Corollary 4.1. Let y ∗ = [x ∗ f ;μ (f),∗ l ] f∈F,l∈L be an optimal solution to the problem (4.1)-(4.6) given in Fact 4.1, i.e.,g (f) n (y (f),∗ n ) = 0,∀f∈F,∀n∈N\Dst(f), and λ ∗ be a Lagrange multiplier vector given in Assumption 4.2. If α n ≥ 1 2 (d n + 1) 2 ,∀n∈N in Algorithm 4.2, where d n is the degree of node n, then all actual physical queues Z (f) n (t),∀f ∈F,∀n∈N\{Dst(f)} in the network evolving via (4.8) satisfy Z (f) n (t)≤4kλ ∗ k + 2 p 2ζ + X l∈O(n) C l , ∀t. where ζ is a constant defined in Lemma 4.8. Define vector x ∗ = [x ∗ f ] f∈F and x(t) = [x f (t)] f∈F wherex ∗ f andx f (t) are defined in Theorem 4.1. Note that if eachU f (x f ) is strongly concave with respect tox f , then x ∗ is unique by strong concavity. (However, [μ (f),∗ l ] l∈L is not necessarily unique.) In this case, Corollary 4.2 shows x(t) yielded by Algorithm 4.2 converges to the unique maximizer x ∗ . Corollary 4.2. If the conditions in Theorem 4.1 hold and each U f (x f ) is strongly concave with respect to x f , then Algorithm 4.2 guarantees x(t)→ x ∗ as t→∞. Proof. Assume each U f (x f ) is strongly concave with respect to x f with modulus c f . Let c = min f∈F {c f }. Note thatC ={[x f ;μ (f) l ] f∈F,l∈L : (4.3)-(4.6) hold} is a compact set. By Assumption 4.2, we have y ∗ = argmax y∈C {h(y)−λ ∗,T g(y)}. Recall that h(y) = P f∈F U f (x f ) and g(y) are separable since they can be written as the sum of scalar functions in terms of x f and μ (f) l . Thus, x ∗ f and [μ (f),∗ l ] f∈F appear separably and maximize h(y)−λ ∗,T g(y) to obtain the left-side of (4.32) where eachx ∗ f satisfying (4.6) maximizes a strongly concave part and each vector [μ (f),∗ l ] f∈F satisfying (4.3)-(4.5) maximizes a concave part. Define y(t) = 1 t P t−1 τ=0 y(τ). Note that y(t) satisfies (4.3)-(4.6) since each y(τ) is generated by Algorithm 4.2. By Corollary 109 1.2, for all t≥ 1, X f∈F U f (x ∗ f )−λ ∗,T g(y ∗ ) ≥ X f∈F U f (x f (t))−λ ∗,T g(y(t)) + X f∈F c f 2 (x ∗ f −x f (t)) 2 (a) ≥ X f∈F U f (x f (t))−λ ∗,T g(y(t)) + c 2 kx(t)− x ∗ k 2 (b) = X f∈F U f (x f (t))−λ ∗,T 1 t t−1 X t=0 g(y[τ]) + c 2 kx(t)− x ∗ k 2 (c) = X f∈F U f (x f (t))− 1 t λ ∗,T Q(t) + c 2 kx(t)− x ∗ k 2 (4.32) where (a) follows from c = min f∈F {c f }; (b) follows from the linearity of g(·) and the definition of y(t); and (c) follows from Lemma 4.9. Recall that λ ∗,T g(y ∗ ) = 0 by strong duality of convex programs (Assumption 4.2). Thus, (4.32) implies c 2 kx(t)− x ∗ k 2 ≤ X f∈F U f (x ∗ f )− X f∈F U f (x f (t)) + 1 t λ ∗,T Q(t) (a) ≤ X f∈F U f (x ∗ f )− X f∈F U f (x f (t)) + 1 t kλ ∗ kkQ(t)k (b) ≤ 1 t ζ + 1 t kλ ∗ k(2kλ ∗ k + p 2ζ) where (a) follows from the Cauchy-Schwarz inequality; and (b) follows from Theorem 4.1, which implies P f∈F U f (x f (t))≥ P f∈F U f (x ∗ f )− 1 t ζ,∀t≥ 1, and Theorem 4.2, which implieskQ(t)k≤ 2kλ ∗ k + √ 2ζ,∀t≥ 1. Taking limits t→∞ on both sides yields that x(t)→ x ∗ as t→∞. 4.3.4 Performance of Algorithm 4.2 Theorems 4.1 and Corollary 4.1 together imply that Algorithm 4.2 withα n ≥ 1 2 (d n +1),∀n∈ N can achieve a vanishing utility optimality gap that decays like O( 1 t ), where t is number of iterations, and guarantees the physical queues at each node for each session are always bounded 110 by a constant that is independent of the utility optimality gap. This is superior to existing backpressure algorithms from [ES06, Nee10, LSXS15] that can achieve an O( 1 V ) utility gap only at the cost of an O(V 2 ) or O(V ) queue length, where V is an algorithm parameter. To obtain a vanishing utility gap, existing backpressure algorithms in [ES06, Nee10, LSXS15] necessarily yield unbounded queues. To obtain a vanishing utility gap, existing backpressure algorithms in [ES06, Nee10] yield unbounded queues. We also comment that O(V 2 ) queue bound in the primal-dual type backpressure algorithm [ES06] is actually of the order V 2 kλ ∗ k +B 1 where λ ∗ is the Lagrangian multiplier vector attaining strong duality and B 1 is a constant determined by the problem parameters. A recent work [Nee14] also shows that theO(V ) queue bound in the backpressure algorithm from drift-plus-penalty is of the order Vkλ ∗ k +B 2 where B 2 is also a constant determined by the problem parameters. Since λ ∗ is a constant vector independent of V , both algorithms are claimed to have O(V 2 ) or O(V ) queue bounds. By Corollary 4.1, Algorithm 4.2 guarantees physical queues at each node are bounded by 4kλ ∗ k +B 3 , whereB 3 is constant given a problem. Thus, the constant queue bound guaranteed by Algorithm 4.2 is typically smaller than the O(V 2 ) or O(V ) queue bounds from [ES06] and [Nee14] even for a small V . (A smallV can yield a poor utility performance in the backpressure algorithms in [ES06, Nee10].) Theorems 4.1 and Corollary 4.1 require α n ≥ 1 2 (d n + 1),∀n∈N in Algorithm 4.2. The required value of each α n is significantly smaller than α > 1 2 p |F| + p 2|L| 2 required by Algorithm 4.1 according to Lemma 4.2 and the general theory developed in Chapter 3. 4.4 Numerical Experiment In this section, we consider a simple network with 6 nodes and 8 links and 2 sessions as described in Figure 4.1. This network has two sessions: session 1 from node 1 to node 6 has utility function log(x 1 ) and session 2 from node 3 to node 4 has utility function 1.5 log(x 2 ). The log utilities are widely used as metrics of proportional fairness in the network [KMT98]. The routing path of each session is arbitrary as long as data can be delivered from the source node to the destination node. For simplicity, assume that each link has capacity 1. The optimal source session rate to problem (4.1)-(4.6) is x ∗ 1 = 1.2 and x ∗ 2 = 1.8 and link session rates, i.e., static routing for each session, is drawn in Figure 4.2. 111 1 Session 2: 3->4 Session 1: 1->6 1 2 3 4 5 6 l 1 l 2 l 3 l 4 l 5 l 6 l 7 l 8 Figure 4.1: A simple network with 6 nodes, 8 links and 2 sessions. 1 Session 1: 1->6 1 2 3 4 5 6 1 1 1 0.2 0.2 0.2 Session 2: 3->4 3 4 5 1 0.8 0.8 Figure 4.2: The optimal routing for the network in Figure 4.1. To compare the convergence performance of Algorithm 4.2 and the backpressure algorithm in [Nee10] (with the best utility-delay tradeoff among all existing backpressure algorithms), we run both Algorithm 4.2 withα n = 1 2 d n +1),∀n∈N and the backpressure algorithm in [Nee10] with V = 500 to plot Figure 4.3. It can be observed from Figure 4.3 that Algorithm 4.2 converges to the optimal source session rates faster than the backpressure algorithm in [Nee10]. The backpressure algorithm in [Nee10] withV = 400 takes around 2500 iterations to converges to source rates close to (1.2, 1.8) while Algorithm 4.2 only takes around 800 iterations to converges to (1.2, 1.8) (as shown in the zoom-in subfigure at the top right corner.) In fact, the backpressure algorithm in [Nee10] withV = 500 can not converge to the exact optimal source session rate (1.2, 1.8) but can only converge to its neighborhood with a distance gap determined by the value of V . This is an effect from the fundamental [O( 1 V ),O(V )] utility-delay tradeoff of the the backpressure algorithm in [Nee10]. In contrast, Algorithm 4.2 can eventually converge to the the exact optimal source session rate (1.2, 1.8). A zoom-in subfigure at the bottom right corner in Figure 4.2 verifies this and shows that the source rate for Session 1 in Algorithm 4.2 converges to 1.2 while the source rate in the backpressure algorithm in [Nee10] with V = 500 oscillates around a point slightly 112 larger than 1.2. Iterations 0 500 1000 1500 2000 2500 3000 3500 4000 Source Rates 0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 5 0 200 400 600 800 0 0.5 1 1.5 2 3000 3500 4000 1.195 1.2 1.205 Backpressure in Neely 2010 Session 1: 1->6 Session 2: 3->4 New Backpressure: Algorithm 4.2 Figure 4.3: Convergence performance comparison between Algorithm 4.2 and the backpressure algorithm in [Nee10]. Corollary 4.1 shows that Algorithm 4.2 guarantees each actual queue in the network is bounded by constant 4kλ ∗ k + 2 √ 2ξky ∗ k + P l∈O(n) C l . Recall that the backpressure algorithm in [Nee10] can guarantee the actual queues in the network are bounded by a constant of order Vkλ ∗ k. Figure 4.4 plots the sum of actual queue length at each node for Algorithm 4.2 and the backpressure algorithm in [Nee10] with V = 10, 100 and 500. (Recall a largerV in the back- pressure algorithm in [Nee10] yields a smaller utility gap but a larger queue length.) It can be observed that Algorithm 4.2 has the smallest actual queue length (see the zoom-in subfigure) and the actual queue length of the backpressure algorithm in [Nee10] scales linearly with respect to V . 4.5 Chapter Summary This chapter develops new backpressure algorithms for joint rate control and routing in multi- hop data networks. The new backpressure algorithms can achieve vanishing utility optimality gaps and finite queue lengths. This improves the state-of-art [O(),O(1/ 2 )] or [O(),O(1/)] 113 Iterations 0 500 1000 1500 2000 2500 3000 Network Sum Physical Queue Length 0 200 400 600 800 1000 1200 1400 0 50 100 150 200 250 300 0 10 20 30 40 Backpressure in Neely 2010 (V=500) Backpressure in Neely 2010 (V=100) Backpressure in Neely 2010 (V=10) New Backpressure: Algorithm 4.2 Figure 4.4: Actual queue length comparison between Algorithm 4.2 and the backpressure algo- rithm in [Nee10]. utility-delay tradeoff attained by existing backpressure algorithms [ES06, NMR05, LS04, LSXS15]. 4.6 Supplement to this Chapter 4.6.1 Multi-Path Network Utility Maximization with Predetermined Paths Consider the multi-path network utility maximization in [LS06] where each session has mul- tiple given paths. Let x f be the total source rate of each session f ∈F. LetP f be the set of paths for session f. The link session rate μ (f) l becomes a vector μ (f) l = [μ (f,j) l ] j∈P f . (Note that multiple paths for the same session are allowed to overlap.) DefineS (f) l as the set of paths for session f that are allowed to use link l. Note thatS (f) l are determined by the given paths for each session. That is, if path j for session f uses link l, then j∈S (f) l ; if no given path for session f uses link l, thenS (f) l =∅. The multi-path network utility maximization problem can 114 be formulated as follows: max X f∈F U f (x f ) s.t. x f 1 {n=Src(f)} + X l∈I(n) X j∈P f μ (f,j) l ≤ X l∈O(n) X j∈P f μ (f,j) l ,f∈F,∀n∈N\{Dst(f)} X f∈F X j∈P f μ (f,j) l ≤C l ,∀l∈L, μ (f,j) l ≥ 0,∀l∈L,∀f∈F,∀j∈S (f) l , μ (f,j) l = 0,∀l∈L,∀f∈F,∀j∈P f \S (f) l , x f ∈ dom(U f ),∀f∈F The above formulation is in the form of the problem (4.1)-(4.6) except that the variable dimension is extended. In this case, Algorithm 4.2 developed to solve the problem (4.1)-(4.6) can be adapted to solve the above multi-path network utility maximization problem by replacing μ (f) l with P j∈P f μ (f,j) l in updates (4.9) and (4.13); and replacing the subproblem (4.16)-(4.19) with max μ (f) (n,m) X f∈F W (f) n (t)−W (f) m (t) X j∈P f μ (f,j) (n,m) − α n +α m X f∈F X j∈P f μ (f,j) (n,m) −μ (f,j) (n,m) (t− 1) 2 s.t. X f∈F X j∈P f μ (f) (n,m) ≤C (n,m) μ (f,j) (n,m) ≥ 0,∀f∈F,∀j∈S (f) (m,n) μ (f,j) (n,m) = 0,∀f∈F,∀j6∈S (f) (m,n) which again has the same structure as the subproblem (4.16)-(4.19) except that the variable dimension is extended. 4.6.2 An Example Illustrating the Possibly Large Gap Between Model (4.7) and Model (4.8) Consider a network example shown in Figure 4.5. The network has 3k + 1 nodes where only node 0 is a destination; and a i ,i∈{1, 2,...,k} and b i ,i∈{1, 2,...,k} can have exogenous arrivals. Assume all link capacities are equal to 1; and the exogenous arrivals are periodic with 115 period 2k, as follows: • Time slot 1: One packet arrives at node a 1 . • Time slot 2: One packet arrives at node a 2 . •··· • Time slot k: One packet arrives at node a k . • Time slot k + 1: One packet arrives at node b 1 . • Time slot k + 2: One packet arrives at node b 2 . •··· • Time slot 2k: One packet arrives at node b k . Under dynamics (4.7), each packet arrives on its own slot and traverses all links of its path to exit on the same slot it arrived. The queue backlog in each node is 0 for all time. Under dynamics (4.8), the first packet arrives at time slot 1 to node a 1 . This packet visits node a 2 at time slot 2, when the second packet also arrives at a 2 . One of these packets is delivered to node a 3 at time slot 3, and another packet also arrives to node 3. The nodes {1,...,k} do not have any exogenous arrivals and act only to delay the delivery of all packets from the ai nodes. It follows that the link from node k to node 0 will send exactly one packet over each slots t∈{2k + 1, 2k + 2,..., 2k +k}. Similarly, the link from b k to 0 sends exactly one packet to node 0 over each of these same slots. Thus, node 0 receives 2 packets on each slot t∈{2k + 1, 2k + 2,..., 2k +k}, but can only output 1 packet per slot. The queue backlog in this node grows linearly and reaches k + 1 at time 2k +k. Thus, the backlog in node 0 can be arbitrarily large when k is large. This example demonstrates that, even when there is only one destination, the deviation between virtual queues under dynamics (4.7) and actual queues under the dynamic (4.8) can be arbitrarily large, even when the in-degree and out-degree of 1 and an in-degree of at most 2. 4.6.3 Proof of Lemma 4.4 Note that the problem (4.20)-(4.22) satisfies the Slater condition. So the optimal solution to the problem (4.20)-(4.22) is characterized by KKT conditions [BV04]. Introducing a Lagrange 116 a 1 a 2 a k-1 a k b 1 b 2 b k-1 b k 0 1 2 k Figure 4.5: An example illustrating the possibly large gap between the queue model (4.7) and the queue model (4.8) multiplierθ∈R + for the inequality constraint P K k=1 z k ≤b and a Lagrange multiplier vectorν = [ν 1 ,...,ν K ] T ∈R K + for the inequality constraintsz k ≥ 0,k∈{1, 2,...,K}. Let z ∗ = [z ∗ 1 ,...,z ∗ K ] T and (θ ∗ ,ν ∗ ) be any primal and dual pair with the zero duality gap. By the KKT conditions, we have z ∗ k −a k +θ ∗ −ν ∗ k = 0,∀k∈{1, 2,...,K}; P K k=1 z ∗ k ≤b;θ ∗ ≥ 0;θ ∗ P K k=1 z ∗ k −b = 0;z ∗ k ≥ 0,∀k∈{1, 2,...,K};ν ∗ k ≥ 0,∀k∈{1, 2,...,K};ν ∗ k z ∗ k = 0,∀k∈{1, 2,...,K}. Eliminating ν ∗ k ,∀k∈{1, 2,...,K} in all equations yields θ ∗ ≥ a k −z ∗ k ,k∈{1, 2,...,K}; P K k=1 z ∗ k ≤b;θ ∗ ≥ 0;θ ∗ P K k=1 z ∗ k −b = 0;z ∗ k ≥ 0,∀k∈{1, 2,...,K}; (z ∗ k −a k +θ ∗ )z ∗ k = 0,∀k∈ {1, 2,...,K}. For all k∈{1, 2,...,K}, we consider θ ∗ <a k and θ ∗ ≥a k separately: 1. If θ ∗ < a k , then θ ∗ ≥ a k −z ∗ k holds only when z ∗ k > 0, which by (z ∗ k −a k +θ ∗ )z ∗ k = 0 implies that z ∗ k =a k −θ ∗ . 2. If θ ∗ ≥a k , then z ∗ k > 0 is impossible, because z ∗ k > 0 implies that z ∗ k −a k +θ ∗ > 0, which together with z ∗ k > 0 contradicts the slackness condition (z ∗ k −a k +θ ∗ )z ∗ k = 0. Thus, if θ ∗ ≥a k , we must have z ∗ k = 0. Summarizing both cases, we have z ∗ k = max{0,a k −θ ∗ },∀k∈{1, 2,...,K}, where θ ∗ is chosen such that P K k=1 z ∗ k ≤b, θ ∗ ≥ 0 and θ ∗ P K k=1 z ∗ k −b = 0. To find suchθ ∗ , we first check ifθ ∗ = 0. Ifθ ∗ = 0 is true, the slackness conditionθ ∗ P K k=1 z ∗ k − b is guaranteed to hold and we need to further require P K k=1 z ∗ k = P K k=1 max{0,a k }≤b. Thus θ ∗ = 0 if and only if P K k=1 max{0,a k }≤b. Thus, Algorithm 4.3 check if P K k=1 max{0,a k }≤b holds at the first step and if this is true, then we conclude θ ∗ = 0 and we are done! 117 Otherwise, we know θ ∗ > 0. By the slackness condition θ ∗ P K k=1 z ∗ k −b = 0, we must have P K k=1 z ∗ k = P K k=1 max{0,a k −θ ∗ } =b. To find θ ∗ > 0 such that P K k=1 max{0,a k −θ ∗ } =b, we can apply a bisection search by noting that all z ∗ k are decreasing with respect to θ ∗ . Another algorithm of finding θ ∗ is inspired by the observation that if a j ≥ a i ,∀i,j ∈ {1, 2,...,K}, thenz ∗ j ≥z ∗ i . Thus, we first sort alla k in a decreasing order, sayπ is the permuta- tion such thata π(1) ≥a π(2) ≥···≥a π(K) ; and then sequentially check ifk∈{1, 2,...,K} is the index such that a π(k) −θ ∗ ≥ 0 and a π(k+1) −θ ∗ < 0. To check this, we first assume k is indeed such an index and solve the equation P k j=1 (a π(j) −θ ∗ ) =b to obtainθ ∗ ; (Note that in Algorithm 4.3, to avoid recalculating the partial sum P k j=1 a π(j) for each k, we introduce the parameter S k = P k j=1 a π(j) and update S k incrementally. By doing this, the complexity of each iteration in the loop is only O(1).) then verify the assumption by checking if θ ∗ ≥ 0, a π(k) −θ ∗ ≥ 0 and a π(k+1) −θ ∗ ≤ 0. The algorithm is described in Algorithm 4.3 and has complexity O(K log(K)). The overall complexity is dominated by the step of sorting all a k . 4.6.4 Proof of Lemma 4.6 The objective function (4.30) can be rewritten as f(y)− X f∈F, n∈N\Dst(f) W (f) n (t)g (f) n (y (f) n ) +αnky (f) n −y (f) n (t− 1)k 2 − X f∈F, n=Dst(f) αn X l∈I(n) [μ (f) l −μ (f) l (t− 1)] 2 (a) = X f∈F U f (x f )− X f∈F, n∈N\Dst(f) W (f) n (t) x f 1 {n=Src(f)} + X l∈I(n) μ (f) l − X l∈O(n) μ (f) l − X f∈F, n∈N\Dst(f) αn [x f −x f (t− 1)] 2 1 {n=Src(f)} + X l∈I(n) [μ (f) l −μ (f) l (t− 1)] 2 + X l∈O(n) [μ (f) l −μ (f) l (t− 1)] 2 − X f∈F, n=Dst(f) αn X l∈I(n) [μ (f) l −μ (f) l (t− 1)] 2 (b) = X f∈F U f (x f )−W (f) Src(f) (t)x f −α Src(f) [x f −x f (t− 1)] 2 + X (n,m)∈L X f∈F [W (f) n (t)−W (f) m (t)]μ (f) (n,m) − X (n,m)∈L (αn +αm) X f∈F [μ (f) (n,m) −μ (f) (n,m) (t− 1)] 2 (4.33) where (a) follows from the fact thatg (f) n (y (f) n ) =x f 1 {n=Src(f)} + P l∈I(n) μ (f) l − P l∈O(n) μ (f) l and ky (f) n −y (f) n (t−1)k 2 = [x f −x f (t−1)] 2 1 {n=Src(f)} + P l∈I(n) [μ (f) l −μ (f) l (t−1)] 2 + P l∈O(n) [μ (f) l − μ (f) l (t− 1)] 2 ; and (b) follows by collecting each linear term μ (f) l and each quadratic term [μ (f) l − μ (f) l (t− 1)] 2 . Note that each link session rate μ (f) l appears twice with opposite signs in the 118 summation term P f∈F,n∈N\{Dst(f)} W (f) n (t) x f 1 {n=Src(f)} + P l∈I(n) μ (f) l − P l∈O(n) μ (f) l unless linkl flows into Dst(f) and recall thatW (f) Dst(f) = 0,∀f∈F. The quadratic terms are collected in a similar way. Note that the term P f∈F,n=Dst(f) α n P l∈I(n) [μ (f) l −μ (f) l (t−1)] 2 introduced to the objective function (4.30) is necessary to guarantee each quadratic term [μ (f) (m,n) −μ (f) (m,n) (t− 1)] 2 with the same link index (n,m) but different flow indices f ∈F to have the same coefficient α n +α m in (4.33). Note that the equation (4.33) is now separable for each scalarx f and vector [μ (f) (n,m) ] f∈F . Thus, the problem (4.30)-(4.31) can be decomposed into independent smaller optimization problems in the form of the problem (4.14)-(4.15) with respect to each scalar x f , and in the form of the problem (4.16)-(4.19) with respect to each vector [μ (f) (n,m) ] f∈F . 4.6.5 Proof of Lemma 4.7 Note that W (f) n (t) appears as a known constant in (4.14). Since U f (x f ) is concave and W (f) n (t)x f is linear, it follows that (4.14) is strongly concave with respect to x f with modulus 2α n . Since x f (t) is chosen to solve (4.14)-(4.15), by Corollary 1.3,∀f∈F, we have U f (x f (t))−W (f) Src(f) (t)x f (t)−α n [x f (t)−x f (t− 1)] 2 | {z } (4.34)-I ≥U f (x ∗ f )−W (f) Src(f) (t)x ∗ f −α n [x ∗ f −x f (t− 1)] 2 +α n [x ∗ f −x f (t)] 2 | {z } (4.34)-II . (4.34) Similarly, we know (4.16) is strongly concave with respect to vector [μ f (n,m) ] f∈F with modulus 2(α n +α m ). By Corollary 1.3,∀(n,m)∈O(n), we have X f∈F [W (f) n (t)−W (f) m (t)]μ (f) (n,m) (t)− (αn +αm) X f∈F [μ (f) (n,m) (t)−μ (f) (n,m) (t− 1)] 2 | {z } (4.35)-I ≥ X f∈F [W (f) n (t)−W (f) m (t)]μ (f),∗ (n,m) − (αn +αm) X f∈F [μ (f),∗ (n,m) −μ (f) (n,m) (t− 1)] 2 + (αn +αm) X f∈F [μ (f),∗ (n,m) −μ (f) (n,m) (t)] 2 | {z } (4.35)-II . (4.35) Recall that each column vector y (f) n defined in (4.23) is composed by control actions that appear in each constraint (4.2); the column vector y = [x f ;μ (f) l ] f∈F,l∈L is the collection of all 119 control actions; and f(y) = P f∈F U f (x f ). Summing the term (4.34)-I over all f∈F and the term (4.35)-I over all (n,m)∈L and using an argument similar to the proof of Lemma 4.6 (Recall that y(t) is jointly chosen to minimize (4.30) by Lemma 4.6.) yields X f∈F (4.34)-I + X (n,m)∈N (4.35)-I =f(y(t))− X f∈F, n∈N\Dst(f) W (f) n (t)g (f) n (y (f) n (t)) +α n ky (f) n (t)− y (f) n (t− 1)k 2 − X f∈F, n=Dst(f) α n X l∈I(n) [μ (f) l (t)−μ (f) l (t− 1)] 2 . (4.36) Recall that Φ(t) = P f∈F,n∈N α n 1 {n6=Dst(f)} ky (f),∗ n −y (f) n (t)k 2 +α n 1 {n=Dst(f)} P l∈I(n) [μ (f),∗ l − μ (f) l (t)] 2 . Summing the term (4.34)-II over all f∈F and the term (4.35)-II over all (n,m)∈L yields X f∈F (4.34)-II + X (n,m)∈N (4.35)-II =f(y ∗ ) + Φ(t)− Φ(t− 1)− X f∈F, n∈N\Dst(f) W (f) n (t)g (f) n (y (f),∗ n ), (4.37) Combining (4.34)-(4.37) and rearranging terms yields f(y(t)) ≥f(y ∗ ) + Φ(t)− Φ(t− 1)− X f∈F, n∈N\Dst(f) W (f) n (t)g (f) n (y (f),∗ n ) + X f∈F, n=Dst(f) αn X l∈I(n) [μ (f) l (t)−μ (f) l (t− 1)] 2 + X f∈F, n∈N\Dst(f) W (f) n (t)g (f) n (y (f) n (t)) +αnky (f) n (t)−y (f) n (t− 1)k 2 (a) ≥f(y ∗ ) + Φ(t)− Φ(t− 1) + X f∈F, n∈N\Dst(f) W (f) n (t)g (f) n (y (f) n (t)) +αnky (f) n (t)−y (f) n (t− 1)k 2 (b) =f(y ∗ ) + Φ(t)− Φ(t− 1) + X f∈F, n∈N\Dst(f) Q (f) n (t)g (f) n (y (f) n (t)) +g (f) n (y (f) n (t− 1))g (f) n (y (f) n (t)) +αnky (f) n (t)−y (f) n (t− 1)k 2 (4.38) 120 where (a) follows because g (f) n (y (f),∗ n ) = 0,∀f∈F,∀n∈N\ Dst(f), and X f∈F,n=Dst(f) α n X l∈I(n) [μ (f) l (t)−μ (f) l (t− 1)] 2 ≥ 0; and (b) follows because W (f) n (t) =Q (f) n (t) +g (f) n (y (f) n (t− 1)). Recall that u 1 u 2 = 1 2 u 2 1 + 1 2 u 2 2 − 1 2 (u 1 −u 2 ) 2 for any u 1 ,u 2 ∈ R. Thus, for all f∈F,n∈ N\ Dst(f), we have g (f) n (y (f) n (t− 1))g (f) n (y (f) n (t)) = 1 2 [g (f) n (y (f) n (t− 1))] 2 + 1 2 [g (f) n (y (f) n (t))] 2 − 1 2 [g (f) n (y (f) n (t− 1))−g (f) n (y (f) n (t))] 2 . (4.39) Substituting (4.39) into (4.38) yields f(y(t)) ≥f(y ∗ ) + Φ(t)− Φ(t− 1) + X f∈F, n∈N\Dst(f) h Q (f) n (t)g (f) n (y (f) n (t)) + 1 2 [g (f) n (y (f) n (t− 1))] 2 + 1 2 [g (f) n (y (f) n (t))] 2 − 1 2 [g (f) n (y (f) n (t− 1))−g (f) n (y (f) n (t))] 2 +α n ky (f) n (t)− y (f) n (t− 1)k 2 i (a) ≥f(y ∗ ) + Φ(t)− Φ(t− 1) + X f∈F, n∈N\Dst(f) h Q (f) n (t)g (f) n (y (f) n (t)) + 1 2 [g (f) n (y (f) n (t− 1))] 2 + 1 2 [g (f) n (y (f) n (t))] 2 + α n − 1 2 β 2 n ky (f) n (t)− y (f) n (t− 1)k 2 i (b) ≥f(y ∗ ) + Φ(t)− Φ(t− 1) + X f∈F, n∈N\Dst(f) Q (f) n (t)g (f) n (y (f) n (t)) + 1 2 [g (f) n (y (f) n (t))] 2 (4.40) where (a) follows from the Fact 4.2, i.e., each g (f) n (·) is Lipschitz continuous with modulus β n and (b) follows because α n ≥ 1 2 (d n + 1), β n ≤ √ d n + 1 and 1 2 [g (f) n (y (f) n (t− 1))] 2 ≥ 0. Substituting (4.28) into (4.40) yields f(y(t))≥f(y ∗ ) + Φ(t)− Φ(t− 1) + Δ(t). 121 Chapter 5 Online Convex Optimization with Stochastic Constraints Online convex optimization (OCO) is a multi-round learning process with arbitrarily-varying convex loss functions where the decision maker has to choose decision x(t)∈X before observing the corresponding convex loss function f t (·). For a fixed time horizon T , define the regret of a learning algorithm with respect to the best fixed decision in hindsight (with full knowledge of all loss functions) as regret(T ) = T X t=1 f t (x(t))− min x∈X T X t=1 f t (x). The best fixed decision x ∗ = argmin x∈X P T t=1 f t (x) typically cannot be implemented. That is because it would need to be determined before the start of the first round, and this would require knowledge of the future f t (·) functions for all t∈{1, 2,...,T}. However, to avoid being embarrassed by the situation where our performance is significantly exceeded by a stubborn decision maker guessing x ∗ correctly by luck, a desired learning algorithm should have a small regret. Specifically, we desire a learning algorithm for which regret(T ) grows sub-linearly with respect toT , i.e., the difference of average loss tends to zero asT goes to infinity when comparing the dynamic learning algorithm and a lucky stubborn decision maker. The setting of OCO is introduced in a series of works [CBLW96, KW97, Gor99, Zin03] and is formalized in [Zin03]. OCO has gained considerable amount of research interest recently with various applications such as online regression, prediction with expert advice, online ranking, online shortest paths, and portfolio selection. See [SS11, Haz16] for more applications and background. In [Zin03], Zinkevich shows that O( √ T ) regret can be achieved by using an online gradient 122 descent (OGD) update given by x(t + 1) =P X x(t)−γ∇f t (x(t)) (5.1) where γ is the step size, also known as the learning rate,∇f t (·) is a subgradient of f t (·) and P X [·] is the projection onto setX . Hazan et al. in [HAK07] show that better regret is possible under the assumption that each loss function is strongly convex but O( √ T ) is the best possible if no additional assumption is imposed. Zinkevich’s OGD in (5.1) requires the full knowledge of setX and low complexity of the projectionP X [·]. However, in practice, the constraint setX , which is often described by many functional inequality constraints, can be time varying and may not be fully disclosed to the decision maker. In [MTY09], Mannor et al. extend OCO by considering time-varying constraint functions g t (x) which can arbitrarily vary and are only disclosed to us after each x(t) is chosen. In this setting, Mannor et al. in [MTY09] explore the possibility of designing learning algorithms such that regret grows sub-linearly and lim sup T→∞ 1 T P T t=1 g t (x(t))≤ 0, i.e., the (cumulative) constraint violation P T t=1 g t (x(t)) also grows sub-linearly. Unfortunately, Mannor et al. in [MTY09] prove that this is impossible even when bothf t (·) andg t (·) are simple linear functions. Given the impossibility results shown by Mannor et al. in [MTY09], this chapter consid- ers OCO where constraint functions g t (x) are not arbitrarily varying but independently and identically distributed (i.i.d.) generated from an unknown probability model (and functions f t (x) are still arbitrarily varying and possibly non-i.i.d.). Specifically, this chapter considers on- line convex optimization (OCO) with stochastic constraintX ={x∈X 0 :E ω [g k (x;ω)]≤ 0,k∈ {1, 2,...,m}} whereX 0 is a known fixed set; the expressions of stochastic constraintsE ω [g k (x;ω)] (involving expectations with respect to ω from an unknown distribution) are unknown; and sub- scriptsk∈{1, 2,...,m} indicate the possibility of multiple functional constraints. In OCO with stochastic constraints, the decision maker receives loss function f t (x) and i.i.d. constraint func- tion realizations g t k (x) Δ =g k (x;ω(t)) at each roundt. However, the expressions of g t k (·) andf t (·) are disclosed to the decision maker only after decision x(t)∈X 0 is chosen. This setting arises naturally when decisions are restricted by stochastic environments or deterministic environments with noisy observations. For example, if we consider online routing (with link capacity con- straints) in wireless networks [MTY09], each link capacity is not a fixed constant (as in wireline 123 networks) but an i.i.d. random variable since wireless channels are stochastically time-varying by nature [TV05]. OCO with stochastic constraints also covers important special cases such as OCO with long term constraints [MJY12, CGP15, JHA16], stochastic constrained convex optimization [MYJ13] and deterministic constrained convex optimization [Nes04]. Let x ∗ = argmin {x∈X0:E[g k (x;ω)]≤0,∀k∈{1,2,...,m}} P T t=1 f t (x) be the best fixed decision in hind- sight (knowing all loss functions f t (x) and the distribution of stochastic constraint functions g k (x;ω)). Thus, x ∗ minimizes theT -round cumulative loss and satisfies all stochastic constraints in expectation, which also implies lim sup T→∞ 1 T P T t=1 g t k (x ∗ )≤ 0 almost surely by the strong law of large numbers. Our goal is to develop dynamic learning algorithms that guarantee both regret P T t=1 f t (x(t))− P T t=1 f t (x ∗ ) and constraint violations P T t=1 g t k (x(t)) grow sub-linearly. Note that Zinkevich’s algorithm in (5.1) is not applicable to OCO with stochastic con- straints sinceX is unknown and it can happen thatX (t) ={x∈X 0 : g k (x;ω(t))≤ 0,∀k∈ {1, 2,...,m}} =∅ for certain realizations ω(t), so that projectionsP X [·] orP X(t) [·] required in (5.1) are not even well-defined. Our Contributions This chapter solves online convex optimization with stochastic constraints. In particular, we propose a new learning algorithm that is proven to achieveO( √ T ) expected regret and constraint violations and O( √ T log(T )) high probability regret and constraint violations. The results in this chapter are originally developed in our paper [YNW17]. The proposed new algorithm also improves upon state-of-the-art results in the following special cases: • OCO with long term constraints: This is a special case where each g t k (x)≡g k (x) is known and does not depend on time. Note thatX ={x∈X 0 :g k (x)≤ 0,∀k∈{1, 2,...,m}} can be complicated whileX 0 might be a simple hypercube. To avoid high complexity involved in the projection ontoX as in Zinkevich’s algorithm, work in [MJY12, CGP15, JHA16] develops low complexity algorithms that use projections onto a simpler setX 0 by allowing g k (x(t)) > 0 for certain rounds but ensuring lim sup T→∞ 1 T P T t=1 g k (x(t))≤ 0. The best existing performance is O(T max{β,1−β} ) regret and O(T 1−β/2 ) constraint violations where β∈ (0, 1) is an algorithm parameter [JHA16]. This givesO( √ T ) regret with worseO(T 3/4 ) constraint violations or O( √ T ) constraint violations with worse O(T ) regret. In contrast, our algorithm, which only uses projections ontoX 0 as shown in Lemma 5.1, can achieve 124 O( √ T ) regret and O( √ T ) constraint violations simultaneously. In Chapter 6, we focus on OCO with long term constraints and further develop a different algorithm that can only solve “OCO with long term constraints” but can achieveO( √ T ) regret andO(1) constraint violations. • Stochastic constrained convex optimization: This is a special case where each f t (x) is i.i.d. generated from an unknown distribution. This problem has many applications in operations research and machine learning such as Neyman-Pearson classification and risk- mean portfolio. The work [MYJ13] develops a (batch) offline algorithm that produces a solution with high probability performance guarantees only after sampling the problems for sufficiently many times. That is, during the process of sampling, there is no performance guarantee. The work [LZ16] proposes a stochastic approximation based (batch) offline algorithm for stochastic convex optimization with one single stochastic functional inequality constraint. In contrast, our algorithm is an online algorithm with online performance guarantees and can deal with an arbitrary number of stochastic constraints. • Deterministic constrained convex optimization: This is a special case where each f t (x)≡ f(x) and g t k (x)≡ g k (x) are known and do not depend on time. In this case, the goal is to develop a fast algorithm that converges to a good solution (with a small error) with a few number of iterations; and our algorithm withO( √ T ) regret and constraint violations is equivalent to an iterative numerical algorithm with an O(1/ √ T ) convergence rate. Our al- gorithm is subgradient based and does not require the smoothness or differentiability of the convex program. Indeed, our algorithm when used to solve general (possibly non-smooth) deterministic constrained convex programs is a third new Lagrangian method developed in this thesis. Recall that in Chapter 3, we have developed two other Lagrangian methods that can only solve deterministic constrained convex optimization but can achieve a faster O(1/T ) convergence rate. The algorithm developed in this chapter is a primal-dual type one since its primal update follows a projected gradient dynamic, has an O(1/ √ T ) conver- gence rate, and does not require any knowledge of the optimal Lagrange multiplier vector. Recall that the primal-dual subgradient method Algorithm 1.2 has the same O(1/ √ T ) convergence rate but requires an upper bound of optimal Lagrange multipliers, which is usually unknown in practice. 125 . 5.1 Problem Statement and New Algorithm LetX 0 be a known fixed compact convex set. Let f t (x) be a sequence of arbitrarily-varying convex functions. Let g k (x;ω(t)),k ∈ {1, 2,...,m} be sequences of functions that are i.i.d. realizations of stochastic constraint functions ˜ g k (x) Δ =E ω [g k (x;ω)] with random variable ω∈ Ω from an unknown distribution. That is, ω(t) are i.i.d. samples of ω. Assume that each f t (·) is independent of allω(τ) withτ≥t+1 so that we are unable to predict future constraint functions based on the knowledge of the current loss function. For each ω∈ Ω, we assume g k (x;ω) are convex with respect to x∈X 0 . At the beginning of each round t, neither the loss function f t (x) nor the constraint function realizations g k (x;ω(t)) are known to the decision maker. However, the decision maker still needs to make a decision x(t)∈X 0 for roundt; and after thatf t (x) and g k (x,ω(t)) are disclosed to the decision maker at the end of round t. For convenience, we often suppress the dependence of each g k (x;ω(t)) on ω(t) and write g t k (x) = g k (x;ω(t)). Recall ˜ g k (x) = E ω [g k (x;ω)] where the expectation is with respect to ω. DefineX ={x∈X 0 : ˜ g k (x) = E[g k (x;ω)]≤ 0,∀k ∈{1, 2,...,m}}. We further define the stacked vector of multiple functions g t 1 (x),...,g t m (x) as g t (x) = [g t 1 (x),...,g t m (x)] T and define ˜ g(x) = [E ω [g 1 (x;ω)],...,E ω [g m (x;ω)]] T . We usek·k to denote the Euclidean norm for a vector. Throughout this chapter, we have the following assumptions: Assumption 5.1 (Basic Assumptions). • Loss functions f t (x) and constraint functions g k (x;ω) have bounded subgradients onX 0 . That is, there exists D 1 > 0 and D 2 > 0 such thatk∇f t (x)k≤D 1 for all x∈X 0 and all t∈{0, 1,...} andk∇g k (x;ω)k≤D 2 for all x∈X 0 , all ω∈ Ω and all k∈{1, 2,...,m}. • There exists constant G> 0 such thatkg(x;ω)k≤G for all x∈X 0 and all ω∈ Ω. • There exists constant R> 0 such thatkx− yk≤R for all x, y∈X 0 . Assumption 5.2 (Interior Point Assumption). There exists> 0 and ˆ x∈X 0 such that ˜ g k (ˆ x) = E ω [g k (ˆ x;ω)]≤− for all k∈{1, 2,...,m}. 126 5.1.1 New Algorithm Now consider the following algorithm described in Algorithm 5.1. This algorithm chooses x(t + 1) as the decision for round t + 1 based on f t (·) and g t (·) without requiring f t+1 (·) or g t+1 (·). Algorithm 5.1 New Algorithm for Online Convex Optimization with Stochastic Constraints Let V > 0 and α > 0 be constant algorithm parameters. Choose x(1)∈X 0 arbitrarily and let Q k (1) = 0,∀k∈{1, 2,...,m}. At the end of each round t∈{1, 2,...}, observe f t (·) and g t (·) and do the following: • Choose x(t + 1) that solves min x∈X0 V [∇f t (x(t))] T [x− x(t)] + m X k=1 Q k (t)[∇g t k (x(t))] T [x− x(t)] +αkx− x(t)k 2 (5.2) as the decision for the next roundt + 1, where∇f t (x(t)) is a subgradient off t (x) at point x = x(t) and∇g t k (x(t)) is a subgradient of g t k (x) at point x = x(t). • Update each virtual queue Q k (t + 1),∀k∈{1, 2,...,m} via Q k (t + 1) = max Q k (t) +g t k (x(t)) + [∇g t k (x(t))] T [x(t + 1)− x(t)], 0 . (5.3) The next lemma summarizes that x(t + 1) update in (5.2) can be implemented via a simple projection ontoX 0 . Lemma 5.1. The x(t + 1) update in (5.2) is given by x(t + 1) =P X0 x(t)− 1 2α d(t) where d(t) = V∇f t (x(t)) + P m k=1 Q k (t)∇g t k (x(t)) andP X0 [·] is the projection onto convex set X 0 . Proof. The projection by definition is min x∈X0 x− [x(t)− 1 2α d(t)] 2 and is equivalent to (5.2) since multiplying the expression to be minimized by α > 0 does not change the minimizer. 127 5.1.2 Intuitions of Algorithm 5.1 Note that if there are no stochastic constraints g t k (x), i.e.,X =X 0 , then Algorithm 5.1 has Q k (t)≡ 0,∀t and becomes Zinkevich’s algorithm with γ = V 2α in (5.1) since x(t + 1) (a) = argmin x∈X0 V [∇f t (x(t))] T [x− x(t)] +αkx− x(t)k 2 | {z } penalty (b) =P X0 x(t)− V 2α ∇f t (x(t)) (5.4) where (a) follows from (5.2); and (b) follows from Lemma 5.1 by noting that d(t) =V∇f t (x(t)). Call the term marked by an underbrace in (5.4) the penalty. Thus, Zinkevich’s algorithm is to minimize the penalty term and is a special case of Algorithm 5.1 used to solve OCO overX 0 . Let Q(t) = Q 1 (t),...,Q m (t) T be the vector of virtual queue backlogs. LetL(t) = 1 2 kQ(t)k 2 be a Lyapunov function and define Lyapunov drift Δ(t) =L(t + 1)−L(t) = 1 2 [kQ(t + 1)k 2 −kQ(t)k 2 ]. (5.5) The intuition behind Algorithm 5.1 is to choose x(t + 1) to minimize an upper bound of the expression Δ(t) |{z} drift +V [∇f t (x(t))] T [x− x(t)] +αkx− x(t)k 2 | {z } penalty (5.6) The intention to minimize penalty is natural since Zinkevich’s algorithm (for OCO without stochastic constraints) minimizes penalty, while the intention to minimize drift is motivated by observing that g t k (x(t)) is accumulated into queue Q k (t + 1) introduced in (5.3) such that we intend to have small queue backlogs. The drift Δ(t) can be complicated and is in general non-convex. The next lemma provides a simple upper bound on Δ(t) and follows directly from (5.3). Lemma 5.2. At each round t∈{1, 2,...}, Algorithm 5.1 guarantees Δ(t)≤ m X k=1 Q k (t) g t k (x(t)) + [∇g t k (x(t))] T [x(t + 1)− x(t)] + 1 2 (G + √ mD 2 R) 2 , (5.7) 128 where m is the number of constraint functions; and D 2 ,G and R are constants defined in As- sumption 5.1. Proof. Recall that for anyb∈R, ifa = max{b, 0} thena 2 ≤b 2 . Fixk∈{1, 2,...,m}. The virtual queue update equation Q k (t + 1) = max Q k (t) +g t k (x(t)) + [∇g t k (x(t))] T [x(t + 1)− x(t)], 0 implies that 1 2 [Q k (t + 1)] 2 ≤ 1 2 Q k (t) +g t k (x(t)) + [∇g t k (x(t))] T [x(t + 1)− x(t)] 2 = 1 2 [Q k (t)] 2 +Q k (t) g t k (x(t)) + [∇g t k (x(t))] T [x(t + 1)− x(t)] + 1 2 g t k (x(t)) + [∇g t k (x(t))] T [x(t + 1)− x(t)] 2 (a) = 1 2 [Q k (t)] 2 +Q k (t) g t k (x(t)) + [∇g t k (x(t))] T [x(t + 1)− x(t)] + 1 2 [h k ] 2 , (5.8) where (a) follows by defining h k =g t k (x(t)) + [∇g t k (x(t))] T [x(t + 1)− x(t)]. Define s = [s 1 ,...,s m ] T , where s k = [∇g t k (x(t))] T [x(t + 1)− x(t)],∀k∈{1, 2,...,m}; and h = [h 1 ,...,h m ] T = g t (x(t)) + s. Then, khk (a) ≤kg t (x(t))k +ksk (b) ≤ G + v u u t m X k=1 D 2 2 R 2 =G + √ mD 2 R, (5.9) where (a) follows from the triangle inequality; and (b) follows from the definition of Euclidean norm, the Cauchy-Schwartz inequality and Assumption 5.1. Summing (5.8) over k∈{1, 2,...,m} yields 1 2 kQ(t + 1)k 2 ≤ 1 2 kQ(t)k 2 + m X k=1 Q k (t) g t k (x(t)) + [∇g t k (x(t))] T [x(t + 1)− x(t)] + 1 2 khk 2 (a) ≤ 1 2 kQ(t)k 2 + m X k=1 Q k (t) g t k (x(t)) + [∇g t k (x(t))] T [x(t + 1)− x(t)] + 1 2 (G + √ mD 2 R) 2 , where (a) follows from (5.9). Rearranging the terms yields the desired result. Note that at the end of roundt, P m k=1 Q k (t)g t k (x(t))+ 1 2 (G+ √ mD 2 R) 2 is a given constant that is not affected by decision x(t + 1). The algorithm decision in (5.2) is now transparent: x(t + 1) is chosen to minimize the drift-plus-penalty expression (5.6), where Δ(t) is approximated by the 129 bound in (5.7). 5.1.3 Preliminary Analysis and More Intuitions of Algorithm 5.1 The next lemma relates constraint violations and virtual queue values and follows directly from (5.3). Lemma 5.3. For any T≥ 1, Algorithm 5.1 guarantees T X t=1 g t k (x(t))≤kQ(T + 1)k +D 2 T X t=1 kx(t + 1)− x(t)k,∀k∈{1, 2,...,m}, where D 2 is the constant defined in Assumption 5.1. Proof. Fix k∈{1, 2,...,m} and T≥ 1. For any t∈{0, 1,...}, (5.3) in Algorithm 5.1 gives: Q k (t + 1) = max{Q k (t) +g t k (x(t)) + [∇g t k (x(t))] T [x(t + 1)− x(t)], 0} ≥Q k (t) +g t k (x(t)) + [∇g t k (x(t))] T [x(t + 1)− x(t)] (a) ≥ Q k (t) +g t k (x(t))−k∇g t k (x(t))kkx(t + 1)− x(t)k (b) ≥ Q k (t) +g t k (x(t))−D 2 kx(t + 1)− x(t)k, where (a) follows from the Cauchy-Schwartz inequality and (b) follows from Assumption 5.1. Rearranging terms yields g t k (x(t))≤Q k (t + 1)−Q k (t) +D 2 kx(t + 1)− x(t)k. Summing over t∈{1,...,T} yields T X t=1 g t k (x(t))≤Q k (T + 1)−Q k (1) +D 2 T X t=1 kx(t + 1)− x(t)k (a) = Q k (T + 1) +D 2 T X t=1 kx(t + 1)− x(t)k ≤kQ(T + 1)k +D 2 T X t=1 kx(t + 1)− x(t)k. where (a) follows from the fact Q k (1) = 0. 130 Note that the expression involved in minimization (5.2) in Algorithm 5.1 is strongly convex with respect to x with modulus 2α and x(t + 1) is chosen to minimize it. Thus, the next lemma follows from Corollary 1.2. Lemma 5.4. Let z∈X 0 be arbitrary. For all t≥ 1, Algorithm 5.1 guarantees V [∇f t (x(t))] T [x(t + 1)− x(t)] + m X k=1 Q k (t)[∇g t k (x(t))] T [x(t + 1)− x(t)] +αkx(t + 1)− x(t)k 2 ≤V [∇f t (x(t))] T [z− x(t)] + m X k=1 Q k (t)[∇g t k (x(t))] T [z− x(t)] +αkz− x(t)k 2 −αkz− x(t + 1)k 2 . The next corollary follows by taking z = x(t) in Lemma 5.4. Corollary 5.1. For all t≥ 1, Algorithm 5.1 guarantees kx(t + 1)− x(t)k≤ VD 1 2α + √ mD 2 2α kQ(t)k. Proof. Fix t≥ 1. Note that x(t)∈X 0 . Taking z = x(t) in Lemma 5.4 yields V [∇f t (x(t))] T [x(t + 1)− x(t)] + m X k=1 Q k (t)[∇g t k (x(t))] T [x(t + 1)− x(t)] +αkx(t + 1)− x(t)k 2 ≤−αkx(t)− x(t + 1)k 2 . Rearranging terms and cancelling common terms yields 2αkx(t + 1)− x(t)k 2 ≤−V [∇f t (x(t))] T [x(t + 1)− x(t)]− m X k=1 Q k (t) [∇g t k (x(t))] T [x(t + 1)− x(t)] (a) ≤Vk∇f t (x(t))kkx(t + 1)− x(t)k +kQ(t)k v u u t m X k=1 k∇g t k (x(t))k 2 kx(t + 1)− x(t)k 2 (b) ≤VD 1 kx(t + 1)− x(t)k + √ mD 2 kQ(t)kkx(t + 1)− x(t)k where (a) follows by the Cauchy-Schwarz inequality (note that the second term on the right side applies the Cauchy-Schwarz inequality twice); and (b) follows from Assumption 5.1. 131 Thus, we have kx(t + 1)− x(t)k≤ VD 1 2α + √ mD 2 2α kQ(t)k. The next corollary follows directly from Lemma 5.3 and Corollary 5.1 and shows that con- straint violations are ultimately bounded by sequencekQ(t)k,t∈{1, 2,...,T + 1}. Corollary 5.2. For any T≥ 1, Algorithm 5.1 guarantees T X t=1 g t k (x(t))≤kQ(T + 1)k + VTD 1 D 2 2α + √ mD 2 2 2α T X t=1 kQ(t)k,∀k∈{1, 2,...,m} where D 1 and D 2 are constants defined in Assumption 5.1. This corollary further justifies why Algorithm 5.1 intends to minimize drift Δ(t). As illus- trated in the next section, controlled drift can often lead to boundedness of a stochastic process. Thus, the intuition of minimizing drift Δ(t) is to yield smallkQ(t)k bounds. 5.2 Expected Performance Analysis of Algorithm 5.1 This section shows that if we chooseV = √ T andα =T in Algorithm 5.1, then both expected regret and expected constraint violations are O( √ T ). 5.2.1 A Drift Lemma for Stochastic Processes Let{Z(t),t≥ 0} be a discrete time stochastic process adapted 1 to a filtration{F(t),t≥ 0}. For example, Z(t) can be a random walk, a Markov chain or a martingale. The drift analysis is the method of deducing properties, e.g., recurrence, ergodicity, or boundedness, about Z(t) from its driftE[Z(t + 1)−Z(t)|F(t)]. See [Doo53, Haj82] for more discussions or applications on drift analysis. This chapter proposes a new drift analysis lemma for stochastic processes as follows: Lemma 5.5. Let {Z(t),t ≥ 0} be a discrete time stochastic process adapted to a filtration {F(t),t≥ 0} with Z(0) = 0 andF(0) ={∅, Ω}. Suppose there exists an integer t 0 > 0, real 1 Random variable Y is said to be adapted to σ-algebraF if Y isF-measurable. In this case, we often write Y ∈F. Similarly, random process{Z(t)} is adapted to filtration{F(t)} if Z(t)∈F(t),∀t. See e.g. [Dur10]. 132 constants θ> 0, δ max > 0 and 0<ζ≤δ max such that |Z(t + 1)−Z(t)|≤δ max , (5.10) E[Z(t +t 0 )−Z(t)|F(t)]≤ t 0 δ max , if Z(t)<θ −t 0 ζ, if Z(t)≥θ . (5.11) hold for all t∈{1, 2,...}. Then, the following holds 1. E[Z(t)]≤θ +t 0 δ max +t 0 4δ 2 max ζ log 8δ 2 max ζ 2 ,∀t∈{1, 2,...}. 2. For any constant 0 < μ < 1, we have Pr(Z(t) ≥ z) ≤ μ,∀t ∈ {1, 2,...} where z = θ +t 0 δ max +t 0 4δ 2 max ζ log 8δ 2 max ζ 2 +t 0 4δ 2 max ζ log( 1 μ ). Proof. See Section 5.5.1. The above lemma provides both expected and high probability bounds for stochastic pro- cesses based on a drift condition. It will be used to establish upper bounds of virtual queues kQ(t)k, which further leads to expected and high probability constraint performance bounds of our algorithm. For a given stochastic processZ(t), it is possible to show the drift condition (5.11) holds for multiple t 0 with different ζ and θ. In fact, we will show in Lemma 5.7 thatkQ(t)k yielded by Algorithm 5.1 satisfies (5.11) for any integer t 0 > 0 by selecting ζ and θ according to t 0 . One-step drift conditions, corresponding to the special case t 0 = 1 of Lemma 5.5, have been previously considered in [Haj82, Nee15]. However, Lemma 5.5 (with general t 0 > 0) allows us to choose the best t 0 in performance analysis such that sublinear regret and constraint violation bounds are possible. 5.2.2 Expected Constraint Violation Analysis Define filtration{W(t),t≥ 0} withW(0) ={∅, Ω} andW(t) = σ(ω(1),...,ω(t)) being the σ-algebra generated by random samples{ω(1),...,ω(t)} up to roundt. From the update rule in Algorithm 5.1, we observe that x(t + 1) is a deterministic function of f t (·), g(·;ω(t)) and Q(t) where Q(t) is further a deterministic function of Q(t− 1), g(·;ω(t− 1)), x(t) and x(t− 1). By inductions, it is easy to show that σ(x(t))⊆W(t− 1) and σ(Q(t))⊆W(t− 1) for all t≥ 1 where σ(Y ) denotes the σ-algebra generated by random variable Y . For fixed t≥ 1, since Q(t) 133 is fully determined byω(τ),τ∈{1, 2,...,t− 1} andω(t) are i.i.d., we know g t (x) is independent of Q(t). This is formally summarized in the next lemma. Lemma 5.6. If x ∗ ∈X 0 satisfies ˜ g(x ∗ ) =E ω [g(x ∗ ;ω)]≤ 0, then Algorithm 5.1 guarantees: E[Q k (t)g t k (x ∗ )]≤ 0,∀k∈{1, 2,...,m},∀t≥ 1. (5.12) Proof. Fix k∈{1, 2,...,m} and t≥ 1. Since g t k (x ∗ ) = g k (x ∗ ;ω(t)) is independent of Q k (t), which is determined by{ω(1),...,ω(t− 1)}, it follows that E[Q k (t)g t k (x ∗ )] =E[Q k (t)]E[g t k (x ∗ )] (a) ≤ 0 where (a) follows from the fact thatE[g t k (x ∗ )]≤ 0 and Q k (t)≥ 0. To establish a bound on constraint violations, by Corollary 5.2, it suffices to derive upper bounds forkQ(t)k. In this subsection, we derive upper bounds forkQ(t)k by applying the new drift lemma (Lemma 5.5) developed at the beginning of this section. The next lemma shows that random process Z(t) =kQ(t)k satisfies the conditions in Lemma 5.5. Lemma 5.7. Let t 0 > 0 be an arbitrary integer. At each round t∈{1, 2,...,} in Algorithm 5.1, the following holds kQ(t + 1)k−kQ(t)k ≤G + √ mD 2 R, and E[kQ(t +t 0 )k−kQ(t)k W(t− 1)]≤ t 0 (G + √ mD 2 R), ifkQ(t)k<θ −t 0 2 , ifkQ(t)k≥θ , where θ = 2 t 0 + (G + √ mD 2 R)t 0 + 2αR 2 t0 + 2VD1R+(G+ √ mD2R) 2 , m is the number of constraint functions;D 1 ,D 2 ,G andR are constants defined in Assumption 5.1; and is the constant defined in Assumption 5.2. (Note that <G by the definition of G.) Proof. See Section 5.5.2. Lemma 5.7 allows us to apply Lemma 5.5 to random process Z(t) =kQ(t)k and obtain E[kQ(t)k] = O( √ T ),∀t by taking t 0 =d √ Te, V = √ T and α = T , whered √ Te represents the smallest integer no less than √ T . By Corollary 5.2, this further implies the expected constraint violation boundE[ P T t=1 g k (x(t))]≤O( √ T ) as summarized in the next theorem. 134 Theorem 5.1 (Expected Constraint Violation Bound). If V = √ T and α = T in Algorithm 5.1, then for all T≥ 1, we have E[ T X t=1 g t k (x(t))]≤O( √ T ),∀k∈{1, 2,...,m}. (5.13) where the expectation is taken with respect to all ω(t). Proof. Define random process Z(t) with Z(0) = 0 and Z(t) = kQ(t)k,t ≥ 1 and filtration F(t) withF(0) = {∅, Ω} andF(t) = W(t− 1),t ≥ 1. Note that Z(t) is adapted toF(t). By Lemma 5.7, Z(t) satisfies the conditions in Lemma 5.5 with δ max = G + √ mD 2 R, ζ = 2 and θ = 2 t 0 + (G + √ mD 2 R)t 0 + 2αR 2 t0 + 2VD1R+(G+ √ mD2R) 2 . Thus, by part 1 of Lemma 5.5, for allt∈{1, 2,...}, we haveE[kQ(t)k]≤ 2 t 0 +2(G+ √ mD 2 R)t 0 + 2αR 2 t0 + 2VD1R+(G+ √ mD2R) 2 + t 0 8(G+ √ mD2R) 2 log 32(G+ √ mD2R) 2 2 . Takingt 0 =d √ Te,V = √ T andα =T , we haveE[kQ(t)k]≤ O( √ T ) for all t∈{1, 2,...}. Fix T ≥ 1. By Corollary 5.2 (with V = √ T and α = T ) , we have P T t=1 g t k (x(t)) ≤ kQ(T + 1)k + √ TD1D2 2 + √ mD 2 2 2T P T t=1 kQ(t)k,∀k∈{1, 2,...,m}. Taking expectations on both sides and substitutingE[kQ(t)k] =O( √ T ),∀t into it yieldsE[ P T t=1 g t k (x(t))]≤O( √ T ). 5.2.3 Expected Regret Analysis The next lemma refines Lemma 5.4 and is useful to analyze the regret. Lemma 5.8. Let z∈X 0 be arbitrary. For all T≥ 1, Algorithm 5.1 guarantees T X t=1 f t (x(t))≤ T X t=1 f t (z) + α V R 2 + VD 2 1 4α T + 1 2 (G + √ mD 2 R) 2 T V | {z } (I) + 1 V T X t=1 m X k=1 Q k (t)g t k (z) | {z } (II) (5.14) where m is the number of constraint functions; and D 1 ,D 2 ,G and R are constants defined in Assumption 5.1. Proof. See Section 5.5.3. Note that if we take V = √ T and α =T , then term (I) in (5.14) is O( √ T ). Recall that the expectation of term (II) in (5.14) with z = x ∗ is non-positive by Lemma 5.6. The expected regret 135 bound of Algorithm 5.1 follows by taking expectations on both sides of (5.14) and is summarized in the next theorem. Theorem 5.2 (Expected Regret Bound). Let x ∗ ∈X 0 be any fixed solution that satisfies ˜ g(x ∗ )≤ 0, e.g., x ∗ = argmin x∈X { T X t=1 f t (x)}. If V = √ T and α =T in Algorithm 5.1, then for all T≥ 1, E[ T X t=1 f t (x(t))]≤E[ T X t=1 f t (x ∗ )] +O( √ T ). where the expectation is taken with respect to all ω(t). Proof. Fix T≥ 1. Taking z = x ∗ in Lemma 5.8 yields P T t=1 f t (x(t))≤ P T t=1 f t (x ∗ ) + α V R 2 + VD 2 1 4α T + 1 2 (G + √ mD 2 R) 2T V + 1 V P T t=1 P m k=1 Q k (t)g t k (x ∗ ) . Taking expectations on both sides and using (5.12) yields P T t=1 E[f t (x(t))]≤ P T t=1 E[f t (x ∗ )]+R 2α V + D 2 1 4 V α T + 1 2 (G+ √ mD 2 R) 2T V . Taking V = √ T and α =T yields P T t=1 E[f t (x(t))]≤ P T t=1 E[f t (x ∗ )] +O( √ T ). 5.2.4 Special Case Performance Guarantees Theorems 5.1 and 5.2 provide expected performance guarantees of Algorithm 5.1 for OCO with stochastic constraints. The results further imply the performance guarantees in the following special cases: • OCO with long term constraints: In this case, g k (x;ω(t))≡ g k (x) and there is no randomness. Thus, the expectations in Theorems 5.1 and 5.2 disappear. For this prob- lem, Algorithm 5.1 can achieve O( √ T ) (deterministic) regret and O( √ T ) (deterministic) constraint violations. • Stochastic constrained convex optimization: Note that i.i.d. time-varying f(x;ω(t)) is a special case of arbitrarily-varying f t (x) as considered in our OCO setting. Thus, Theorems 5.1 and 5.2 still hold when Algorithm 5.1 is applied to solve stochastic constrained convex optimization min x {E[f(x;ω)] : E[g k (x;ω)]≤ 0,∀k∈{1, 2,...,m}, x∈X 0 } in an online fashion with i.i.d. realizations ω(t)∼ ω. Since Algorithm 5.1 chooses each x(t) without knowing ω(t), it follows that x(t) is independent of ω(t 0 ) for any t 0 ≥ t by the i.i.d. property of each ω(t). Fix T > 0, if we run Algorithm 5.1 for T slots and use 136 x(T ) = 1 T P T t=1 x(t) as a fixed solution for any future slot t 0 ≥T + 1, then E[f(x(T );ω(t 0 )] (a) ≤ 1 T T X t=1 E[f(x(t);ω(t 0 ))] (b) = 1 T T X t=1 E[f(x(t);ω(t))] (c) ≤ 1 T T X t=1 E[f(x ∗ ;ω(t))] +O( 1 √ T ) (d) =E[f(x ∗ ;ω(t 0 ))] +O( 1 √ T ) and E[g k (x(T );ω(t 0 )] (a) ≤ 1 T T X t=1 E[g k (x(T );ω(t 0 )] (b) = 1 T T X t=1 E[g k (x(t);ω(t))] (c) ≤O( 1 √ T ),∀k∈{1, 2,...,m} where in both inequality chains (a) follows from Jensen’s inequality and the fact that x(T ) is independent of ω(t 0 ); (b) follows because each x(t) is independent of both ω(t) and ω(t 0 ), and ω(t) and ω(t 0 ) are i.i.d. realizations of ω; (c) follows from Theorems 5.1 and 5.2 by dividing both sides by T and (d) follows because E[f(x ∗ ;ω(t))] = E[f(x ∗ ;ω(t 0 ))] for all t∈{1,...,T} by the i.i.d. property of each ω(t). Thus, if we use Algorithm 5.1 as a (batch) offline algorithm to solve stochastic constrained convex optimization, it has O(1/ √ T ) convergence and ties with the algorithm developed in [LZ16], which is by design a (batch) offline algorithm and can only solve stochastic optimization with a single constraint function. • Deterministic constrained convex optimization: Similarly to OCO with long term constraints, the expectations in Theorems 5.1 and 5.2 disappear in this case since f t (x)≡ f(x) andg k (x;ω(t))≡g k (x). If we use x(T ) = 1 T P T t=1 x(t) as the solution, thenf(x(T ))≤ f(x ∗ ) +O( 1 √ T ) andg k (x(T ))≤O( 1 √ T ), which follows by dividing inequalities in Theorems 5.1 and 5.2 byT on both sides and applying Jensen’s inequality. Thus, Algorithm 5.1 solves deterministic constrained convex optimization with O( 1 √ T ) convergence. 137 5.3 High Probability Performance Analysis This section shows that if we choose V = √ T and α = T in Algorithm 5.1, then for any 0 < λ < 1, with probability at least 1−λ, regret is O( √ T log(T ) log 1.5 ( 1 λ )) and constraint violations are O √ T log(T ) log( 1 λ ) . 5.3.1 High Probability Constraint Violation Analysis Similarly to the expected constraint violation analysis, we can use part 2 of the new drift lemma (Lemma 5.5) to obtain a high probability bound ofkQ(t)k, which together with Corollary 5.2 leads to a high probability constraint violation bound summarized in Theorem 5.3. Theorem 5.3 (High Probability Constraint Violation Bound). Let 0 < λ < 1 be arbitrary. If V = √ T and α =T in Algorithm 5.1, then for all T≥ 1 and all k∈{1, 2,...,m}, we have Pr T X t=1 g k (x(t))≤O √ T log(T ) log( 1 λ ) ≥ 1−λ. Proof. See Section 5.5.4. 5.3.2 High Probability Regret Analysis To obtain a high probability regret bound from Lemma 5.8, it remains to derive a high probability bound of term (II) in (5.14) with z = x ∗ . The main challenge is that term (II) is a supermartingale with unbounded differences (due to the possibly unbounded virtual queues Q k (t)). Most concentration inequalities, e.g., the Hoeffding-Azuma inequality, used in high prob- ability performance analysis of online algorithms are restricted to martingales/supermartingales with bounded differences. See for example [CBL06, BDH + 08, MJY12]. The following lemma con- siders supermartingales with unbounded differences. Its proof uses the truncation method to con- struct an auxiliary well-behaved supermartingale. Similar proof techniques are previously used in [Vu02, TV15] to prove different concentration inequalities for supermartingales/martingales with unbounded differences. The truncation method is also previously used in [WYN15] to analyze the high probability sample path performance of the conventional drift-plus-penalty technique for opportunistic stochastic optimzation. 138 Lemma 5.9. Let{Z(t),t≥ 0} be a supermartingale adapted to a filtration{F(t),t≥ 0} with Z(0) = 0 andF(0) ={∅, Ω}, i.e.,E[Z(t + 1)|F(t)]≤Z(t),∀t≥ 0. Suppose there exits a constant c > 0 such that{|Z(t + 1)−Z(t)| > c}⊆{Y (t) > 0},∀t≥ 0, where Y (t) is process with Y (t) adapted toF(t) for all t≥ 0. Then, for all z> 0, we have Pr(Z(t)≥z)≤e −z 2 /(2tc 2 ) + t−1 X τ=0 Pr(Y (τ)> 0),∀t≥ 1. Note that if Pr(Y (t) > 0) = 0,∀t≥ 0, then Pr({|Z(t + 1)−Z(t)| > c}) = 0,∀t≥ 0 and Z(t) is a supermartingale with differences bounded by c. In this case, Lemma 5.9 reduces to the conventional Hoeffding-Azuma inequality. The next theorem summarizes the high probability regret performance of Algorithm 5.1 and follows from Lemmas 5.5-5.9 . Theorem 5.4 (High Probability Regret Bound). Let x ∗ ∈X 0 be any fixed solution that satisfies ˜ g(x ∗ )≤ 0, e.g., x ∗ = argmin x∈X { T X t=1 f t (x)}. Let 0<λ< 1 be arbitrary. If V = √ T and α =T in Algorithm 5.1, then for all T≥ 1, we have Pr T X t=1 f t (x(t))≤ T X t=1 f t (x ∗ ) +O( √ T log(T ) log 1.5 ( 1 λ )) ≥ 1−λ. Proof. See Section 5.5.6. 5.4 Chapter Summary This chapter studies OCO with stochastic constraints, where the objective function varies arbitrarily but the constraint functions are i.i.d. over time. A novel learning algorithm is devel- oped that guarantees O( √ T ) expected regret and constraint violations and O( √ T log(T )) high probability regret and constraint violations. 139 5.5 Supplement to this Chapter 5.5.1 Proof of Lemma 5.5 In this proof, we first establish an upper bound of E[e rZ(t) ] for some constant r > 0. Part 1 of this lemma follows by applying Jensen’s inequality since e rx is convex with respect to x when r> 0. Part 2 of this lemma follows directly from Markov’s inequality. The following fact is useful in the proof. Fact 5.1. e x ≤ 1 +x + 2x 2 for any|x|≤ 1. Proof. By Taylor’s expansion, we known for anyx∈R, there exists a point ˆ x in between 0 andx such thate x = 1+x+e ˆ xx 2 2 . (Note that the value of ˆ x depends onx and ifx> 0, then ˆ x∈ (0,x); if x< 0, then ˆ x∈ (x, 0); and if x = 0, then ˆ x =x. ) Since|x|≤ 1, we have e ˆ x ≤e≤ 4. Thus, e x ≤ 1 +x + 2x 2 for any|x|≤ 1. The next lemma provides an upper bound ofE[e rZ(t) ] with constant r = ζ 4t0δ 2 max < 1. Lemma 5.10. Under the assumption of Lemma 5.5, we have E[e rZ(t) ]≤ e rt0δmax 1−ρ e rθ ,∀t∈{0, 1,...}, where r = ζ 4t0δ 2 max , ρ = 1− ζ 2 8δ 2 max = 1− rt0ζ 2 . Proof. Since 0 < ζ < δ max , we have 0 < ρ < 1 < e rδmax . Define η(t) = Z(t +t 0 )−Z(t). Note that|η(t)|≤t 0 δ max ,∀t≥ 0 and|rη(t)|≤ ζ 4t0δ 2 max t 0 δ max = ζ 4δmax ≤ 1. Then, e rZ(t+t0) =e rZ(t) e rη(t) (5.15) (a) ≤e rZ(t) (1 +rη(t) + 2r 2 t 2 0 δ 2 max ) (b) =e rZ(t) (1 +rη(t) + 1 2 rt 0 ζ), (5.16) where (a) follows from Fact 5.1 by noting that|rη(t)|≤ 1 and|η(t)|≤t 0 δ max ; and (b) follows by substituting r = ζ 4t0δ 2 max into a single r of the term 2r 2 t 2 0 δ 2 max . Next, consider the cases Z(t)≥θ and Z(t)<θ, separately. 140 • Case Z(t)≥θ: Taking conditional expectations on both sides of (5.16) yields: E[e rZ(t+t0) |Z(t)]≤E[e rZ(t) (1 +rη(t) + 1 2 rt 0 ζ)|Z(t)] (a) ≤e rZ(t) 1−rt 0 ζ + 1 2 rt 0 ζ =e rZ(t) 1− rt 0 ζ 2 (b) =ρe rZ(t) . where (a) follows from the fact that E[Z(t +t 0 )−Z(t)|F(t)]≤−t 0 ζ when Z(t)≥ θ; and (b) follows from the fact that ρ = 1− rt0ζ 2 . • Case Z(t)<θ: Taking conditional expectations on both sides of (5.15) yields: E[e rZ(t+t0) |Z(t)] =E[e rZ(t) e rη(t) |Z(t)] =e rZ(t) E[e rη(t) |Z(t)] (a) ≤e rt0δmax e rZ(t) , where (a) follows from the fact that η(t)≤t 0 δ max . Putting two cases together yields: E[e rZ(t+t0) ] (a) = Pr(Z(t)≥θ)E[e rZ(t+t0) |Z(t)≥θ] + Pr(Z(t)<θ)E[e rZ(t+t0) |Z(t)<θ] (b) ≤ρE[e rZ(t) |Z(t)≥θ]Pr(Z(t)≥θ) +e rt0δmax E[e rZ(t) |Z(t)<θ]Pr(Z(t)<θ) (c) =ρE[e rZ(t) ] + (e rt0δmax −ρ)E[e rZ(t) |Z(t)<θ]Pr(Z(t)<θ) (d) ≤ρE[e rZ(t) ] + (e rt0δmax −ρ)e rθ ≤ρE[e rZ(t) ] +e rt0δmax e rθ , (5.17) where (a) follows by the definition of expectations; (b) follows from the results in the above two cases; (c) follows from the fact that E[e rZ(t) ] = Pr(Z(t)≥ θ)E[e rZ(t) |Z(t)≥ θ] + Pr(Z(t) < θ)E[e rZ(t) |Z(t)<θ]; and (d) follow from the fact that e rt0δmax >ρ. Now, we proveE[e rZ(t) ]≤ e rt 0 δmax 1−ρ e rθ ,∀t≥ 0, by inductions. We first consider the base case t∈{0, 1,...,t 0 }. Since Z(t)≤ tδ max ,∀t≥ 0, it follows that 141 E[e rZ(t) ]≤ e rtδmax ≤ e rt0δmax ≤ e rt 0 δmax 1−ρ e rθ ,∀t∈{0, 1,...,t 0 }, where the last inequality follows because e rθ 1−ρ ≥ 1. Now assume thatE[e rZ(t) ]≤ e rt 0 δmax 1−ρ e rθ for allt∈{0, 1,...,τ} with someτ≥t 0 and consider iteration t =τ + 1. By (5.17), we have E[e rZ(τ+1) ]≤ρE[e rZ(τ+1−t0) ] +e rt0δmax e rθ (a) ≤ρ e rt0δmax 1−ρ e rθ +e rt0δmax e rθ = e rt0δmax 1−ρ e rθ where (a) follows from the induction hypothesis by noting that 0≤τ + 1−t 0 ≤τ. Thus, this lemma follows by inductions. By this lemma, for all t∈{0, 1,...}, we have E[e rZ(t) ]≤ e rt0δmax 1−ρ e rθ . (5.18) Proof of Part 1: Note that e rx is convex with respect to x when r > 0. By Jensen’s inequality, e rE[Z(t)] ≤E[e rZ(t) ] (a) ≤ e r(θ+t0δmax) 1−ρ , (5.19) where (a) follows from (5.18). Taking logarithm on both sides and dividing by r yields: E[Z(t)]≤θ +t 0 δ max + 1 r log 1 1−ρ (a) =θ +t 0 δ max +t 0 4δ 2 max ζ log 8δ 2 max ζ 2 , where (a) follows by recalling that r = ζ 4t0δ 2 max and ρ = 1− ζ 2 8δ 2 max . 142 Proof of Part 2: Fix z. Note that Pr(Z(t)≥z) =Pr(e rZ(t) ≥e rz ) (a) ≤ E[e rZ(t) ] e rz (b) ≤e r(θ−z+t0δmax) 1 1−ρ (c) =e ζ 4t 0 δ 2 max (θ−z+t0δmax) 8δ 2 max ζ 2 (5.20) where (a) follows from Markov’s inequality; (b) follows from (5.18); and (c) follows by recalling that r = ζ 4t0δ 2 max and ρ = 1− ζ 2 8δ 2 max . Define μ =e ζ 4t 0 δ 2 max (θ−z+t0δmax) 8δ 2 max ζ 2 . It follows that if z =θ +t 0 δ max +t 0 4δ 2 max ζ log 8δ 2 max ζ 2 +t 0 4δ 2 max ζ log( 1 μ ), then we have Pr(Z(t)≥z)≤μ by (5.20). 5.5.2 Proof of Lemma 5.7 The next lemma will be useful in our proof. Lemma 5.11. Let ˆ x∈X 0 be a Slater point defined in Assumption 5.2, i.e, ˜ g k (ˆ x) =E ω [g k (ˆ x;ω)]≤ −,∀k∈{1, 2,...,m}. Then E[ m X k=1 Q k (t 1 )g t1 k (ˆ x)|W(t 2 )]≤−E[kQ(t 1 )k|W(t 2 )], ∀t 2 ≤t 1 − 1 where > 0 is defined in Assumption 5.2. Proof. To prove this lemma, we first show that E[Q k (t 1 )g t1 k (ˆ x)|W(t 2 )]≤−E[Q k (t 1 )|W(t 2 )],∀k∈{1, 2,...,m},∀t 2 ≤t 1 − 1. Fix k∈{1, 2,...,m}. Note that Q(t 1 )∈W(t 1 − 1) and g t1 k (ˆ x) is independent ofW(t 1 − 1). 143 Further, if t 2 ≤t 1 − 1, thenW(t 2 )⊆W(t 1 − 1). Thus, we have E[Q k (t 1 )g t1 k (ˆ x)|W(t 2 )] (a) =E E[Q k (t 1 )g t1 k (ˆ x)|W(t 1 − 1)]|W(t 2 ) (b) =E Q k (t 1 )E[g t1 k (ˆ x)]|W(t 2 ) (c) =E[g t1 k (ˆ x)]E[Q k (t 1 )|W(t 2 )] (d) ≤−E[Q k (t 1 )|W(t 2 )] where (a) follows from iterated expectations; (b) follows becauseg t1 k (ˆ x) is independent ofW(t 1 −1) andQ k (t 1 )∈W(t 1 − 1); (c) follows by extracting the constantE[g t1 k (ˆ x)] and (d) follows from the assumption that ˆ x is a Slater point, g t (·) are i.i.d. across t and the fact that Q k (t)≥ 0. Now, summing over m∈{1, 2,...,m} yields E[ m X k=1 Q k (t 1 )g t1 k (ˆ x)|W(t 2 )]≤−E[ m X k=1 Q k (t 1 )|W(t 2 )] (a) ≤−E[kQ(t 1 )k|W(t 2 )] where (a) follows from the basic fact that P m k=1 a k ≥ p P m k=1 a 2 k whena k ≥ 0,∀k∈{1, 2,...,m}. The bounded difference of|Q(t + 1)− Q(t)| follows directly from the virtual queue update equation (5.3) and is summarized in the next Lemma. Lemma 5.12. Let Q(t),t∈{0, 1,...} be the sequence generated by Algorithm 5.1. Then, kQ(t)k−G− √ mD 2 R≤kQ(t + 1)k≤kQ(t)k +G,∀t≥ 0. Proof. • Proof ofkQ(t + 1)k≤kQ(t)k +G: Fix t≥ 0 and k∈{1, 2,...,m}. The virtual queue update equation implies that Q k (t + 1) = max{Q k (t) +g t k (x(t)) + [∇g t k (x(t))] T [x(t + 1)− x(t)], 0} (a) ≤ max{Q k (t) +g t k (x(t + 1)), 0}, 144 where (a) follows from the convexity of g t k (·). Note that Q k (t + 1)≥ 0 and recall the fact that if 0≤a≤ max{b, 0}, then a 2 ≤b 2 for all a,b∈R. Then, we have [Q k (t + 1)] 2 ≤ [Q k (t) +g t k (x(t + 1))] 2 . Summing over k∈{1, 2,...,m} yields kQ(t + 1)k 2 ≤kQ(t) + g t (x(t + 1))k 2 . Thus,kQ(t + 1)k≤kQ(t) + g t (x(t + 1))k≤kQ(t)k +kg t (x(t + 1))k≤kQ(t)k +G where the last inequality follows from Assumption 5.1. • Proof ofkQ(t + 1)k≥kQ(t)k−G− √ mD 2 R: SinceQ k (t)≥ 0, it follows that|Q k (t+1)−Q k (t)|≤|g t k (x(t))+[∇g t k (x(t))] T [x(t+1)−x(t)]|. (This can be shown by consideringg t k (x(t))+[∇g t k (x(t))] T [x(t+1)−x(t)]≥ 0 andg t k (x(t))+ [∇g t k (x(t))] T [x(t+1)−x(t)]< 0 separately.) Thus, we havekQ(t+1)−Q(t)k≤G+ √ mD 2 R, which further implieskQ(t + 1)k≥kQ(t)k−G− √ mD 2 R by the triangle inequality of norms. Now, we are ready to present the main proof of Lemma 5.7. Note that Lemma 5.12 gives kQ(t + 1)k−kQ(t)k ≤G + √ mD 2 R, which further implies thatE[kQ(t +t 0 )k−kQ(t)k|Q(t)]≤ t 0 (G + √ mD 2 R) whenkQ(t)k < θ. It remains to prove E[kQ(t + 1)k−kQ(t)k Q(t)]≤− 2 t 0 whenkQ(t)k≥θ. Note thatkQ(0)k = 0<θ. Fix t≥ 1 and consider thatkQ(t)k≥ θ. Let ˆ x∈X 0 and > 0 be defined in Assumption 5.2. Note that E[g t k (ˆ x)]≤−,∀k∈{1, 2,...,m},∀t∈{1, 2,...} since ω(t) are i.i.d. from the distribution of ω. Since ˆ x∈X 0 , by Lemma 5.4, for all τ∈{t,t + 1,...,t +t 0 − 1}, we have V [∇f τ (x(τ))] T [x(τ + 1)−x(τ)] + m X k=1 Q k (τ)[∇g τ k (x(τ))] T [x(τ + 1)−x(τ)] +αkx(τ + 1)−x(τ)k 2 ≤V [∇f τ (x(τ))] T [ˆ x−x(τ)] + m X k=1 Q k (τ)[∇g τ k (x(τ))] T [ˆ x−x(τ)] +α[kˆ x−x(τ)k 2 −kˆ x−x(τ + 1)k 2 ]. Adding P m k=1 Q k (τ)g τ k (x(τ)) on both sides and noting thatg τ k (x(τ)) + [∇g τ k (x(τ))] T [ˆ x−x(τ)]≤ 145 g τ k (ˆ x) by convexity yields V [∇f τ (x(τ))] T [x(τ + 1)− x(τ)] + m X k=1 Q k (τ) g τ k (x(τ)) + [∇g τ k (x(τ))] T [x(τ + 1)− x(τ)] +αkx(τ + 1)− x(τ)k 2 ≤V [∇f τ (x(τ))] T [ˆ x− x(τ)] + m X k=1 Q k (τ)g τ k (ˆ x) +α[kˆ x− x(τ)k 2 −kˆ x− x(τ + 1)k 2 ]. Rearranging terms yields m X k=1 Q k (t) g τ k (x(t)) + [∇g τ k (x(τ))] T [x(τ + 1)− x(τ)] ≤V [∇f τ (x(τ))] T [ˆ x− x(τ)]−V [∇f τ (x(τ))] T [x(τ + 1)− x(τ)] +α[kˆ x− x(τ)k 2 −kˆ x− x(τ + 1)k 2 ]−αkx(τ + 1)− x(τ)k 2 + m X k=1 Q k (t)g τ k (ˆ x) ≤V [∇f τ (x(τ))] T [ˆ x− x(τ + 1)] +α[kˆ x− x(τ)k 2 −kˆ x− x(τ + 1)k 2 ] + m X k=1 Q k (τ)g τ k (ˆ x) (a) ≤Vk∇f τ (x(τ))kkˆ x− x(τ + 1)k +α[kˆ x− x(τ)k 2 −kˆ x− x(τ + 1)k 2 ] + m X k=1 Q k (τ)g τ k (ˆ x) (b) ≤VD 1 R +α[kˆ x− x(τ)k 2 −kˆ x− x(τ + 1)k 2 ] + m X k=1 Q k (τ)g τ k (ˆ x), (5.21) where (a) follows from the Cauchy-Schwarz inequality and (b) follows from Assumption 5.1. By Lemma 5.2, for all τ∈{t,t + 1,...,t +t 0 − 1}, we have Δ(τ)≤ m X k=1 Q k (τ) g τ k (x(τ)) + [∇g τ k (x(τ))] T [x(τ + 1)− x(τ)] + 1 2 (G + √ mD 2 R) 2 (a) ≤VD 1 R + 1 2 (G + √ mD 2 R) 2 +α[kˆ x− x(τ)k 2 −kˆ x− x(τ + 1)k 2 ] + m X k=1 Q k (τ)g τ k (ˆ x), where (a) follows from (5.21). Summing the above inequality overτ∈{t,t+1,...,t+t 0 −1}, taking expectations conditional 146 onW(t− 1) on both sides and recalling that Δ(τ) = 1 2 kQ(τ + 1)k 2 − 1 2 kQ(τ)k 2 yields E[kQ(t +t 0 )k 2 −kQ(t)k 2 W(t− 1)] ≤2VD 1 Rt 0 +t 0 [G + √ mD 2 R] 2 + 2αE[kˆ x− x(t)k 2 −kˆ x− x(t +t 0 )k 2 |W(t− 1)] + 2 t+t0−1 X τ=t E[ m X k=1 Q k (τ)g τ k (ˆ x)|W(t− 1)] (a) ≤ 2VD 1 Rt 0 +t 0 (G + √ mD 2 R) 2 + 2αR 2 − 2 t+t0−1 X τ=t E[kQ(τ)k|W(t− 1)] (b) ≤2VD 1 Rt 0 +t 0 (G + √ mD 2 R) 2 + 2αR 2 − 2 t0−1 X τ=0 E[kQ(t)k−τ(G + √ mD 2 R)|W(t− 1)] =2VD 1 Rt 0 +t 0 (G + √ mD 2 R) 2 + 2αR 2 − 2t 0 kQ(t)k +t 0 (t 0 − 1)(G + √ mD 2 R) ≤2VD 1 Rt 0 +t 0 (G + √ mD 2 R) 2 + 2αR 2 − 2t 0 kQ(t)k +t 2 0 (G + √ mD 2 R) where (a) follows becausekˆ x− x(t)k 2 −kˆ x− x(t +t 0 )k 2 ≤R 2 by Assumption 5.1 and E[ m X k=1 Q k (τ)g τ k (ˆ x)|W(t− 1)]≤−E[kQ(τ)k|W(t− 1)],∀τ∈{t,t + 1,...,t +t 0 − 1} by Lemma 5.11; (b) follows becausekQ(t + 1)k≥kQ(t)k− (G + √ mD 2 R),∀t by Lemma 5.12. This inequality can be rewritten as E[kQ(t +t 0 )k 2 W(t− 1)] ≤kQ(t)k 2 − 2t 0 kQ(t)k + 2VD 1 Rt 0 + 2αR 2 +t 0 (G + √ mD 2 R) 2 +t 2 0 (G + √ mD 2 R) (a) ≤kQ(t)k 2 −t 0 kQ(t)k−t 0 [ 2 t 0 + (G + √ mD 2 R)t 0 + 2αR 2 t 0 + 2VD 1 R + (G + √ mD 2 R) 2 ] + 2VD 1 Rt 0 + 2αR 2 +t 0 (G + √ mD 2 R) 2 +t 2 0 (G + √ mD 2 R) =kQ(t)k 2 −t 0 kQ(t)k− 2 t 2 0 2 ≤[kQ(t)k− 2 t 0 ] 2 , where (a) follows from the hypothesis thatkQ(t)k≥ θ = 2 t 0 + (G + √ mD 2 R)t 0 + 2αR 2 t0 + 2VD1R+(G+ √ mD2R) 2 . 147 Taking square root on both sides yields q E[kQ(t +t 0 )k 2 W(t− 1)]≤kQ(t)k− 2 t 0 . By the concavity of function √ x and Jensen’s inequality, we have E[kQ(t +t 0 )k W(t− 1)]≤ p E[kQ(t +t 0 )k 2 |W(t− 1)]≤kQ(t)k− 2 t 0 . 5.5.3 Proof of Lemma 5.8 Fix t≥ 1. By Lemma 5.4, we have V [∇f t (x(t))] T [x(t + 1)− x(t)] + m X k=1 Q k (t)[∇g t k (x(t))] T [x(t + 1)− x(t)] +αkx(t + 1)− x(t)k 2 ≤V [∇f t (x(t))] T [z− x(t)] + m X k=1 Q k (t)[∇g t k (x(t))] T [z− x(t)] +α[kz− x(t)k 2 −kz− x(t + 1)k 2 ]. Adding constant Vf t (x(t)) + P m k=1 Q k (t)g t k (x(t)) on both sides; and noting that f t (x(t)) + [∇f t (x(t))] T [z− x(t)]≤f t (z) and g t k (x(t)) + [∇g t k (x(t))] T [z− x(t)]≤g t k (z) by convexity yields Vf t (x(t)) +V [∇f t (x(t))] T [x(t + 1)−x(t)] + m X k=1 Q k (t) g t k (x(t)) + [∇g t k (x(t))] T [x(t + 1)−x(t)] +αkx(t + 1)−x(t)k 2 ≤Vf t (z) + m X k=1 Q k (t)g t k (z) +α[kz−x(t)k 2 −kz−x(t + 1)k 2 ]. (5.22) By Lemma 5.2, we have Δ(t)≤ m X k=1 Q k (t) g t k (x(t)) + [∇g t k (x(t))] T [x(t + 1)− x(t)] + 1 2 [G + √ mD 2 R] 2 . (5.23) Summing (5.22) and (5.23), cancelling common terms and rearranging terms yields Vf t (x(t))≤Vf t (z)− Δ(t) + m X k=1 Q k (t)g t k (z) +α[kz− x(t)k 2 −kz− x(t + 1)k 2 ] −V [∇f t (x(t))] T [x(t + 1)− x(t)]−αkx(t + 1)− x(t)k 2 + 1 2 [G + √ mD 2 R] 2 (5.24) 148 Note that −V [∇f t (x(t))] T [x(t + 1)− x(t)]−αkx(t + 1)− x(t)k 2 (a) ≤Vk∇f t (x(t))kkx(t + 1)− x(t)k−αkx(t + 1)− x(t)k 2 (b) ≤VD 1 kx(t + 1)− x(t)k−αkx(t + 1)− x(t)k 2 =−α kx(t + 1)− x(t)k− VD 1 2α 2 + V 2 D 2 1 4α ≤ V 2 D 2 1 4α (5.25) where (a) follows from the Cauchy-Schwarz inequality; and (b) follows from Assumption 5.1. Substituting (5.25) into (5.24) yields Vf t (x(t))≤Vf t (z)− Δ(t) + m X k=1 Q k (t)g t k (z) +α[kz− x(t)k 2 −kz− x(t + 1)k 2 ] + V 2 D 2 1 4α + 1 2 [G + √ mD 2 R] 2 . Summing over t∈{1, 2,...,T} yields V T X t=1 f t (x(t))≤V T X t=1 f t (z)− T X t=1 Δ(t) +α T X t=1 [kz− x(t)k 2 −kz− x(t + 1)k 2 ] + V 2 D 2 1 4α T + 1 2 (G + √ mD 2 R) 2 T + T X t=1 m X k=1 Q k (t)g t k (z) (a) =V T X t=1 f t (z) +L(1)−L(T + 1) +αkz− x(1)k 2 −αkz− x(T + 1)k 2 + V 2 D 2 1 4α T + 1 2 (G + √ mD 2 R) 2 T + T X t=1 m X k=1 Q k (t)g t k (z) (b) ≤V T X t=1 f t (z) +αR 2 + V 2 D 2 1 4α T + 1 2 (G + √ mD 2 R) 2 T + T X t=1 m X k=1 Q k (t)g t k (z) . where (a) follows by recalling that Δ(t) =L(t+1)−L(t); and (b) follows becausekz−x(1)k≤R by Assumption 5.1, L(1) = 1 2 kQ(1)k 2 = 0 and L(T + 1) = 1 2 kQ(T + 1)k 2 ≥ 0. Dividing both sides by V yields the desired result. 149 5.5.4 Proof of Theorem 5.3 Define random process Z(t) =kQ(t)k,∀t∈{1, 2,...}. By Lemma 5.7, Z(t) satisfies the conditions in Lemma 5.5 with δ max =G + √ mD 2 R, ζ = 2 and θ = 2 t 0 + (G + √ mD 2 R)t 0 + 2αR 2 t 0 + 2VD 1 R + (G + √ mD 2 R) 2 . Fix T≥ 1 and 0<λ< 1. Taking μ =λ/(T + 1) in part 2 of Lemma 5.5 yields Pr(kQ(t)k≥γ)≤ λ T + 1 ,∀t∈{1, 2,...,T + 1}, whereγ = 2 t 0 +2(G+ √ mD 2 R)t 0 + 2αR 2 t0 + 2VD1R+(G+ √ mD2R) 2 +t 0 8(G+ √ mD2R) 2 log 32(G+ √ mD2R) 2 2 + t 0 8(G+ √ mD2R) 2 log( T+1 λ ). By the union bound of probability, we have Pr(kQ(t)k≥γ for some t∈{1, 2,...,T + 1})≤λ. This implies Pr(kQ(t)k≤γ for all t∈{1, 2,...,T + 1})≥ 1−λ. (5.26) Taking t 0 =d √ Te, V = √ T and α =T yields γ =O( √ T log(T )) +O( √ T log( 1 λ )) =O( √ T log(T ) log( 1 λ )) (5.27) Recall that by Corollary 5.2 (with V = √ T and α =T ), for all k∈{1, 2,...,m}, we have T X t=1 g k (x(t))≤kQ(T + 1)k + √ TD 1 D 2 2 + √ mD 2 2 2T T X t=1 kQ(t)k. (5.28) It follows from (5.26)-(5.28) that Pr T X t=1 g k (x(t))≤O( √ T log(T ) log( 1 λ )) ≥ 1−λ. 150 5.5.5 Proof of Lemma 5.9 Intuitively, the second term on the right side in the lemma bounds the probability that |Z(τ + 1)− Z(τ)| > c for any τ ∈ {0, 1,...,t− 1}, while the first term on the right side comes from the conventional Hoeffding-Azuma inequality. However, it is unclear whether or not Z(t) is still a supermartigale conditional on the event that|Z(τ + 1)−Z(τ)|≤ c for any τ∈{0, 1,...,t− 1}.That’s why it is important to have{|Z(t + 1)−Z(t)| > c}⊆{Y (t) > 0} andY (t)∈F(t), which means the boundedness of|Z(t + 1)−Z(t)| can be inferred from another random variableY (t) that belongs toF(t). The proof of Lemma 5.9 uses the truncation method to construct an auxiliary supermargingale. Recall the definition of stoping time given as follows: Definition 5.1 ([Dur10]). Let{∅, Ω} =F(0)⊆F(1)⊆F(2)··· be a filtration. A discrete random variable T is a stoping time (also known as an option time) if for any integer t<∞, {T =t}∈F(t), i.e. the event that the stopping time occurs at time t is contained in the information up to time t. The next theorem summarizes that a supermartingale truncated at a stoping time is still a supermartingale. Theorem 5.5. (Theorem 5.2.6 in [Dur10]) If random variable T is a stopping time and Z(t) is a supermartingale, then Z(t∧T ) is also a supermartingale, where a∧b, min{a,b}. To prove this lemma, we first construct a new supermartingale by truncating the original supermartingale at a carefully chosen stopping time such that the new supermartingale has bounded differences. Define integer random variable T = inf{t≥ 0 :Y (t)> 0}. That is, T is the first time t when Y (t) > 0 happens. Now, we show that T is a stoping time and if we define e Z(t) = Z(t∧T ), then{ e Z(t)6= Z(t)}⊆ S t−1 τ=0 {Y (τ) > 0},∀t≥ 1 and e Z(t) is a supermartingale with differences bounded by c . 1. To show T is a stoping time: Note that{T = 0} ={Y (0) > 0}∈F(0). Fix integer 151 t 0 > 0, we have {T =t 0 } = inf{t≥ 0 :Y (t)> 0} =t 0 = t 0 −1 \ τ=0 {|Y (τ)≤ 0} \ {Y (t 0 )> 0} (a) ∈F(t 0 ) where (a) follows because{Y (τ)≤ 0}∈F(τ)⊆F(t 0 ) for all τ ∈{0, 1,...,t 0 − 1} and {Y (t 0 )> 0}∈F(t 0 ). It follows that T is a stoping time. 2. To show{ e Z(t)6=Z(t)}⊆ S t−1 τ=0 {Y (τ)> 0},∀t≥ 1: Fix t =t 0 > 1. Note that { e Z(t 0 )6=Z(t 0 )} (a) ⊆{T <t 0 } = inf{t> 0 :Y (t)> 0}<t 0 ⊆ t 0 −1 [ τ=0 {Y (τ)> 0} where (a) follows by noting that if T≥t 0 then e Z(t 0 ) =Z(t 0 ∧T ) =Z(t 0 ). 3. To show e Z(t) is a supermartingale with differences bounded by c: Since random variableT is proven to be a stoping time, e Z(t) =Z(t∧T ) is a supermartingale by Theorem 5.5. It remains to show| e Z(t + 1)− e Z(t)|≤c,∀t≥ 0. Fix integer t =t 0 ≥ 0. Note that | e Z(t 0 + 1)− e Z(t 0 )| =|Z(T∧ (t 0 + 1))−Z(T∧t 0 )| =|1 {T≥t 0 +1} [Z(T∧ (t 0 + 1))−Z(T∧t 0 )] + 1 {T≤t 0 } [Z(T∧ (t 0 + 1))−Z(T∧t 0 )]| =|1 {T≥t 0 +1} [Z(t 0 + 1)−Z(t 0 )] + 1 {T≤t 0 } [Z(T )−Z(T )]| =1 {T≥t 0 +1} |Z(t 0 + 1)−Z(t 0 )| Now consider T≤t 0 and T≥t 0 + 1 separately. • In the case whenT≤t 0 , it is straightforward that| e Z(t 0 +1)− e Z(t 0 )| = 1 {T≥t 0 +1} |Z(t 0 + 1)−Z(t 0 )| = 0≤c. • Consider the case whenT≥t 0 +1. By the definition ofT , we know that{T≥t 0 +1} = inf{t≥ 0 : Y (t) > 0}≥ t 0 + 1 ⊆ T t 0 τ=0 {Y (τ)≤ 0}⊆ T t 0 τ=0 {|Z(τ + 1)−Z(τ)|≤ 152 c}, where the last inclusion follows from the fact that{|Z(τ + 1)−Z(τ)| > c}⊆ {Y (τ) > 0}. That is, when T ≥ t 0 + 1, we must have|Z(τ + 1)−Z(τ)|≤ c for all τ ∈{1,...,t 0 }, which further implies that|Z(t 0 + 1)−Z(t 0 )|≤ c. Thus, when T≥t 0 + 1,| e Z(t 0 + 1)− e Z(t 0 )| = 1 {T≥t 0 +1} |Z(t 0 + 1)−Z(t 0 )|≤c. Combining two cases together proves| e Z(t 0 + 1)− e Z(t 0 )|≤c. Since e Z(t) is a supermartingale with bounded differences c and e Z(0) = Z(0) = 0, by the conventional Hoeffding-Azuma inequality, for any z> 0, we have Pr( e Z(t)≥z)≤e −z 2 /(2tc 2 ) (5.29) Finally, we have Pr(Z(t)≥z) =Pr( e Z(t) =Z(t),Z(t)≥z) + Pr( e Z(t)6=Z(t),Z(t)≥z) ≤Pr( e Z(t)≥z) + Pr( e Z(t)6=Z(t)) (a) ≤e −z 2 /(2tc 2 ) + Pr( t−1 [ τ=0 Y (τ)> 0) (b) ≤e −z 2 /(2tc 2 ) + t−1 X τ=0 p(τ) where (a) follows from equation (5.29) and the second bullet in the above; and (b) follows from the union bound and the hypothesis that Pr(Y (τ)> 0)≤p(τ),∀τ. 5.5.6 Proof of Theorem 5.4 Define Z(0) = 0 and Z(t) = P t τ=1 P m k=1 Q k (τ)g τ k (x ∗ ). RecallW(0) ={∅, Ω} andW(t) = σ(ω(1),...,ω(t)),∀t≥ 1. The next lemma shows that for any c > 0, Z(t) satisfies Lemma 5.9 withF(t) =W(t) and Y (t) =kQ(t + 1)k− c G . Lemma 5.13. Let x ∗ ∈X 0 be any solution satisfying ˜ g(x ∗ )≤ 0, e.g., x ∗ = argmin x∈X { T X t=1 f t (x)}. Let c> 0 be an arbitrary constant. Under Algorithm 5.1, if we define Z(0) = 0 and Z(t) = t X τ=1 m X k=1 Q k (τ)g τ k (x ∗ ),∀t≥ 1, 153 then{Z(t),t≥ 0} is a supermartingale adapted to filtration{W(t),t≥ 0} such that {|Z(t + 1)−Z(t)|>c}⊆{Y (t)> 0},∀t≥ 0 where Y (t) =kQ(t + 1)k− c G is a random variable adapted toW(t). (Note that G is a constant defined in Assumption 5.1.) Proof. It is easy to say{Z(t),t≥ 0} is adapted{W(t),t≥ 0}. It remains to show{Z(t),t≥ 0} is a supermartingale. Note that Z(t + 1) =Z(t) + P m k=1 Q k (t + 1)g t+1 k (x ∗ ) and E[Z(t + 1)|W(t)] =E[Z(t) + m X k=1 Q k (t + 1)g t+1 k (x ∗ )|W(t)] (a) =Z(t) + m X k=1 Q k (t + 1)E[g t+1 k (x ∗ )] (b) ≤Z(t) where (a) follows from the fact that Z(t)∈W(t), Q(t + 1)∈W(t) and g t+1 (x ∗ ) is independent ofW(t); and (b) follows fromE[g t+1 k (x ∗ )] = ˜ g k (x ∗ )≤ 0 which further follows from ω(t) are i.i.d. samples. Thus,{Z(t),t≥ 0} is a supermartingale. We further note that |Z(t + 1)−Z(t)| =| m X k=1 Q k (t + 1)g t+1 k (x ∗ )| (a) ≤kQ(t + 1)kG where (a) follows from the Cauchy-Schwarz inequality and the assumption thatkg t (x ∗ )k≤G. This implies that if|Z(t + 1)−Z(t)|>c, thenkQ(t)k> c G . Thus,{|Z(t + 1)−Z(t)|>c}⊆ {kQ(t + 1)k> c G }. Since Q(t + 1) is adapted toW(t), it follows that Y (t) =kQ(t + 1)k− c G is a random variable adapted toW(t). By Lemma 5.13, Z(t) satisfies Lemma 5.9. Fix T≥ 1, Lemma 5.9 implies that Pr( T X t=1 m X k=1 Q k (t)g t k (x ∗ )≥γ)≤e −γ 2 /(2Tc 2 ) | {z } (I) + T−1 X t=0 Pr(kQ(t + 1)k> c G ) | {z } (II) (5.30) Fix 0 < λ < 1. In the following, we shall choose γ and c such that both term (I) and term 154 (II) in (5.30) are no larger than λ 2 . Recall that by Lemma 5.7, random process e Z(t) =kQ(t)k satisfies the conditions in Lemma 5.5 with δ max =G + √ mD 2 R, ζ = 2 and θ = 2 t 0 + (G + √ mD 2 R)t 0 + 2αR 2 t 0 + 2VD 1 R + (G + √ mD 2 R) 2 . To guarantee term (II) is no lareger than λ 2 , it suffices to choose c such that Pr(kQ(t)k> c G )≤ λ 2T ,∀t∈{1, 2,...,T} By part 2 of Lemma 5.5 (with μ = λ 2T ), the above inequality holds if we choose c = t 0 2 G + 2t 0 (G + √ mD 2 R)G + 2αR 2 t0 G + 2VD1R+(G+ √ mD2R) 2 G +t 0 8(G+ √ mD2R) 2 log 32(G+ √ mD2R) 2 2 G + t 0 8(G+ √ mD2R) 2 log( 2T λ )G where t 0 > 0 is an arbitrary integer. Oncec is chosen, we further need to chooseγ such that term (I) in (5.30) is λ 2 . It follows that if γ = √ 2T log 0.5 ( 2 λ )c = √ 2T log 0.5 ( 2 λ ) 2 t 0 G+2t 0 (G+ √ mD 2 R)G+ 2αR 2 t0 G+ 2VD1R+(G+ √ mD2R) 2 G+ t 0 8(G+ √ mD2R) 2 log 32(G+ √ mD2R) 2 2 G +t 0 8(G+ √ mD2R) 2 log( 2T λ )G , then the term (I) is equal to λ 2 . Thus, we have Pr( T X t=1 m X k=1 Q k (t)g t k (x ∗ )≥γ)≤λ, which further implies, Pr( T X t=1 m X k=1 Q k (t)g t k (x ∗ )≤γ)≥ 1−λ. (5.31) Note that if we take t 0 =d √ Te, V = √ T and α = T , then γ = O T log(T ) log 0.5 ( 1 λ ) + O T log 1.5 ( 1 λ ) =O T log(T ) log 1.5 ( 1 λ ) . By Lemma 5.8 (with z = x ∗ , V = √ T and α =T ), we have T X t=1 f t (x(t))≤ T X t=1 f t (x ∗ ) + √ TR 2 + D 2 1 4 √ T + 1 2 [G + √ mD2R] 2 √ T + 1 √ T T X t=1 m X k=1 Q k (t)g t k (x ∗ ) (5.32) 155 Substituting (5.31) into (5.32) yields Pr T X t=1 f t (x(t))≤ T X t=1 f t (x ∗ ) +O √ T log(T ) log 1.5 ( 1 λ ) ≥ 1−λ. 156 Chapter 6 Online Convex Optimization with Long Term Constraints This chapter focuses on “online convex optimization with long term constraints”, which is a special case problem of online convex optimization with stochastic constraints considered in Chapter 5 such thatg t k (x)≡g k (x) are perfectly known and do not depend on time. Ideally, this problem can be solved by Zinkevich’s online gradient descent given by x(t + 1) =P X x(t)−γ∇f t (x(t)) , (6.1) whereP X [·] represents the projection onto the convex set X = {x ∈ X 0 : g k (x) ≤ 0,k ∈ {1, 2,...,m}} and γ is the step size, also known as the learning rate. In the case whenX is a simple set, such as when there are no g k (x) are missing andX 0 is a multidimensional box, the projectionP X [·] often has a closed form solution or enjoys low complexity. However, ifX is complicated, e.g., functional inequality constraints g k (x)≤ 0 are complicated, then the equation (6.1) requires us to solve the following convex program: min kx− [x(t)−γ∇f t (x(t))]k 2 (6.2) s.t. g k (x)≤ 0,∀k∈{1, 2,...,m} (6.3) x∈X 0 ∈R n (6.4) which can yield heavy computation and/or storage burden at each round. For instance, the interior point method (or other Newton-type methods) is an iterative algorithm and takes a number of iterations to approach the solution to the above convex program. The computation and memory space complexity at each iteration is between O(n 2 ) and O(n 3 ), where n is the 157 dimension of x. To circumvent the computational challenge of the projection operator, online convex optimiza- tion with long term constraints is first considered in [MJY12]. In online convex optimization with long term constraints, complicated functional constraints g k (x)≤ 0 are relaxed to be soft long term constraints. That is, we do not require x(t)∈X 0 to satisfy g k (x(t))≤ 0 at each round, but only require that P T t=1 g k (x(t)), called constraint violations, grows sub-linearly. [MJY12] proposes two algorithms such that one achievesO( √ T ) regret andO(T 3/4 ) constraint violations; and the other achieves O(T 2/3 ) for both regret and constraint violations when setX can be represented by linear constraints. [JHA16] recently extends the algorithm of [MJY12] to achieve O(T max{β,1−β} ) regret and O(T 1−β/2 ) constraint violations where β∈ (0, 1) is a user-defined tradeoff parameter. By choosingβ = 1/2 orβ = 2/3, the [O( √ T ),O(T 3/4 )] or [O(T 2/3 ),O(T 2/3 )] regret and constraint violations of [MJY12] are recovered. It is easy to observe that the best regret or constraint violations in [JHA16] are O( √ T ) under different β values. However, the algorithm of [JHA16] can not achieve O( √ T ) regret and O( √ T ) constraint violations simultaneously. As discussed in Chapter 5, Algorithm 5.1 developed in Chapter 5 can solve online convex optimization with long term constraints and achievesO( √ T ) regret andO( √ T ) constraint viola- tions. This chapter proposes a new algorithm that can achieveO( √ T ) regret and finite constraint violations that do not grow with T ; and hence yields improved performance in comparison to prior works [MJY12, JHA16] and our own Algorithm 5.1. This new algorithm is also closely related to the new Lagrangian dual methods withO(1/t) convergence developed in Chapter 3 for deterministic constrained convex programs. The results in this chapter are originally developed in our technical report [YN16b]. Many engineering problems can be directly formulated as online convex optimization with long term constraints. For example, problems with energy or monetary constraints often define these in terms of long term time averages rather than instantaneous constraints. In general, we assume that instantaneous constraints are incorporated into the setX 0 ; and long term constraints are represented via functional constraints g k (x). Two example problems are given as follows. More examples can be found in [MJY12] and [JHA16]. • In the application of online display advertising [GT11a, GMPV09], the publisher needs to iteratively allocate “impressions” to advertisers to optimize some online concave utilities for each advertiser. The utility is typically unknown when the decision is made but can be 158 inferred later by observing user click behaviors under the given allocations. Since each ad- vertiser usually specifies a certain budget for a period, the “impressions” should be allocated to maximize advertisers’ long term utilities subject to long term budget constraints. • In the application of network routing in a neutral or adversarial environment, the decision maker needs to iteratively make routing decisions to maximize network utilities. Further- more, link quality can vary after each routing decision is made. The routing decisions should satisfy the long term flow conservation constraint at each intermediate node so that queues do not overflow. 6.1 Problem Statement and New Algorithm This section introduces the problem of online convex optimization with long term constraints and presents our new algorithm. 6.1.1 Problem Statement LetX 0 be a compact convex set and g k (x),k∈{1, 2,...,m} be continuous convex functions. Denote the stacked vector of multiple functions g 1 (x),...,g m (x) as g(x) = [g 1 (x),...,g m (x)] T . DefineX ={x∈X 0 :g k (x)≤ 0,i∈{1, 2,...,m}}. Letf t (x) be a sequence of continuous convex loss functions which are determined by nature (or by an adversary) such thatf t (x) is unknown to the decision maker until the end of roundt. For any sequence x(t) yielded by an online algorithm, define P T t=1 f t (x(t))− min x∈X P T t=1 f t (x) as the regret and P T t=1 g k (x(t)),k∈{1, 2,...,m} as the constraint violations. The goal of online convex optimization with long term constraints is to choose x(t)∈X 0 for each roundt such that both the regret and the constraint violations grow sub-linearly with respect toT . Throughout this chapter, we consider online convex optimization with long term constraints satisfying the following assumptions: Assumption 6.1 (Basic Assumptions). • The loss functions have bounded gradients onX 0 . That is, there exists D > 0 such that k∇f t (x)k≤D for all x∈X 0 and all t. • There exists a constant β such thatkg(x)− g(y)k≤βkx− yk for all x, y∈X 0 , i.e., g(x) is Lipschitz continuous with modulus β. 159 • There exists a constant G such thatkg(x)k≤G for all x∈X 0 . • There exists a constant R such thatkx− yk≤R for all x, y∈X 0 . Note that the existence ofG follows directly from the compactness of setX 0 and the continuity of g(x). The existence of R follows directly from the compactness of setX 0 . Assumption 6.2 (Interior Point Assumption). There exists> 0 and ˆ x∈X 0 such thatg k (ˆ x)≤ − for all k∈{1, 2,...,m}. 6.1.2 New Algorithm Define ˜ g(x) = cg(x) where c > 0 is an algorithm parameter. Note that each ˜ g k (x) is still a convex function and ˜ g(x)≤ 0 if and only if g(x)≤ 0. The next lemma trivially follows. Lemma 6.1. If online convex optimization with long term constraints satisfies Assumptions 6.1 and 6.2, then •k˜ g(x)− ˜ g(y)k≤cβkx− yk for all x, y∈X 0 . •k˜ g(x)k≤cG for all x∈X 0 . • ˜ g k (ˆ x)≤−c for all k∈{1, 2,...,m} where ˆ x is defined in Assumption 6.2. Now consider the following algorithm described in Algorithm 6.1. This algorithm chooses x(t + 1) as the decision for roundt + 1 based onf t (·) without knowing the cost functionf t+1 (·). The remainder of this chapter shows that if the parametersc andα are chosen to satisfyc =T 1/4 andα = 1 2 (β 2 +1) √ T , then Algorithm 6.1 achieves anO( √ T ) regret bound with finite constraint violations. This algorithm introduces a virtual queue vector for constraint functions. The update equa- tion of this virtual queue vector is similar to Algorithms 3.1 and 3.2 developed in Chapter 3 for deterministic convex programs (with a fixed and known objective function) and the main difference is that scaled versions of the constraint functions, rather than the original constraint functions, are used in the virtual queue update equation. In fact, scaling the constraint functions with a factor c = T 1/4 is the key to achieve finite constraint violations and Algorithm 6.1 can only achieveO( √ T ) constraint violations without the scaling factorc. The update for x(t + 1) is different from the primal update in Algorithm 3.1 but is closely related to the primal update in 160 Algorithm 6.1 New Algorithm for Online Convex Optimization with Long Term Constraints Let c,α > 0 be constant parameters. Define function ˜ g(x) = cg(x) for all x∈X 0 . Choose any x(0)∈X 0 . Initialize Q k (0) = max{−˜ g k (x(0)), 0},∀k∈{1, 2,...,m}. At the end of each round t∈{0, 1, 2, 3,...}, observe f t (·) and do the following: • Choose x(t + 1) that solves min x∈X0 [∇f t (x(t))] T [x− x(t)] + [Q(t) + ˜ g(x(t))] T ˜ g(x) +αkx− x(t)k 2 as the decision for the next round t + 1, where∇f t (x(t)) is the gradient of f t (x) at point x = x(t). • Update virtual queue vector Q(t + 1) via Q k (t + 1) = max{−˜ g k (x(t + 1)),Q k (t) + ˜ g k (x(t + 1))},∀k∈{1, 2,...,m}. Algorithm 3.2. In fact, if g(x) are linear functions, Lemma 6.2 shows that the update of x(t) in Algorithm 6.1 can also be implemented by a convex projection involving the subgradient of that scaled constraint functions. Because of thekx−x(t)k 2 term, the choice of x(t+1) in Algorithm 6.1 involves minimization of a strongly convex function. If the constraint functions g(x) are separable (or equivalently, ˜ g(x) are separable) with respect to components or blocks of x, e.g., g(x) = Ax− b, then the primal updates for x(t + 1) can be decomposed into several smaller independent subproblems, each of which only involves a component or block of x(t + 1). The next lemma further shows that the update of x(t + 1) follows a simple gradient update in the case when g(x) is linear. Lemma 6.2. If g(x) is linear, then the update of x(t + 1) at each round in Algorithm 6.1 is given by x(t + 1) =P X0 x(t)− 1 2α d(t) where d(t) =∇f t (x(t)) + P m k=1 [Q k (t) + ˜ g k (x(t))]∇˜ g k (x(t)). Proof. Fix t≥{0, 1,...}. Note that d(t) is a constant vector in the update of x(t + 1). The projection operator can be interpreted as an optimization problem as follows: 161 x(t + 1) =P X0 x(t)− 1 2α d(t) (a) ⇔ x(t + 1) = argmin x∈X0 x− [x(t)− 1 2α d(t)] 2 ⇔ x(t + 1) = argmin x∈X0 kx− x(t)k 2 + 1 α d T (t)[x− x(t)] + 1 4α 2 kd(t)k 2 (b) ⇔ x(t + 1) = argmin x∈X0 " m X k=1 [Q k (t + 1) + ˜ g k (x(t))]˜ g k (x(t)) + d T (t)[x− x(t)] +αkx− x(t)k 2 # (c) ⇔ x(t + 1) = argmin x∈X0 h [∇f t (x(t))] T [x− x(t)] + m X k=1 [Q k (t) + ˜ g k (x(t))]˜ g k (x(t)) + m X k=1 [Q k (t) + ˜ g k (x(t))][∇˜ g k (x(t))] T [x− x(t)] +αkx− x(t)k 2 i (d) ⇔ x(t + 1) = argmin x∈X0 [∇f t (x(t))] T [x− x(t)] + [Q(t) + ˜ g(x(t))] T ˜ g(x) +αkx− x(t)k 2 where (a) follows from the definition of the projection onto a convex set; (b) follows from the fact the minimizing solution does not change when we remove constant term 1 4α 2 kd(t)k 2 , multiply pos- itive constantα and add constant term P m k=1 [Q k (t)+ ˜ g k (x(t))]˜ g k (x(t)) in the objective function; (c) follows from the definition of d(t); and (d) follows from the identity [Q(t) + ˜ g(x(t))] T ˜ g(x) = P m k=1 [Q k (t) + ˜ g k (x(t))]˜ g k (x(t)) + P m k=1 [Q k (t) + ˜ g k (x(t))]∇˜ g k (x(t))] T [x− x(t)] for any x∈R n , which further follows from the linearity of ˜ g(x). 6.2 Regret and Constraint Violation Analysis This section analyzes the regret and constraint violations of Algorithm 6.1 for online convex optimization with long term constraints under Assumptions 6.1-6.2. 6.2.1 Properties of the Virtual Queues and the Drift Lemma 6.3. In Algorithm 6.1, we have 1. At each round t∈{0, 1, 2,...}, Q k (t)≥ 0 for all k∈{1, 2,...,m}. 2. At each round t∈{0, 1, 2,...}, Q k (t) + ˜ g k (x(t))≥ 0 for all k∈{1, 2...,m}. 3. At roundt = 0,kQ(0)k 2 ≤k˜ g(x(0))k 2 . At each roundt∈{1, 2,...},kQ(t)k 2 ≥k˜ g(x(t))k 2 . 162 4. At each round t∈{0, 1, 2,...},kQ(t + 1)k≤kQ(t)k +k˜ g(x(t + 1))k. Proof. The proof of the first 3 parts is essentially identical to the proof of Lemma 3.1. 1. Fix k∈{1, 2,...,m}. The proof is by induction. Note that Q k (0)≥ 0 by initialization. AssumeQ k (t)≥ 0 for somet∈{0, 1, 2,...}. We now proveQ k (t+1)≥ 0. If ˜ g k (x(t+1))≥ 0, the virtual queue update equation of Algorithm 6.1 gives: Q k (t + 1) = max{−˜ g k (x(t + 1)),Q k (t) + ˜ g k (x(t + 1))}≥Q k (t) + ˜ g k (x(t + 1))≥ 0. On the other hand, if ˜ g k (x(t + 1)) < 0, then Q k (t + 1) = max{−˜ g k (x(t + 1)),Q k (t) + ˜ g k (x(t + 1))}≥−˜ g k (x(t + 1))> 0. Thus, in both cases we have Q k (t + 1)≥ 0. 2. Fixk∈{1, 2,...,m}. Note thatQ k (0)+˜ g k (x(0))≥ 0 by the initialization rule ofQ k (0). Fix t∈{0, 1,...}. By the virtual queue update equation, we have Q k (t + 1) = max{−˜ g k (x(t + 1)),Q k (t) + ˜ g k (x(t + 1))}≥−˜ g k (x(t + 1)), which implies thatQ k (t + 1) + ˜ g k (x(t + 1))≥ 0. 3. Note that kQ(0)k 2 ≤ k˜ g(x(0))k 2 follows by the initialization rule of Q k (0). Fix t ∈ {0, 1,...} and k∈{1, 2,...,m}. If ˜ g k (x(t + 1))≥ 0, then Q k (t + 1) = max{−˜ g k (x(t + 1)),Q k (t) + ˜ g k (x(t + 1))} ≥Q k (t) + ˜ g k (x(t + 1)) (a) ≥ ˜ g k (x(t + 1)) =|˜ g k (x(t + 1))|, where (a) follows from part 1. On the other hand, if ˜ g k (x(t + 1)) < 0, then Q k (t + 1) = max{−˜ g k (x(t + 1)),Q k (t) + ˜ g k (x(t + 1))} ≥ −˜ g k (x(t + 1)) = |˜ g k (x(t + 1))|. Thus, in both cases, we have Q k (t + 1)≥|˜ g k (x(t + 1))|. Squaring both sides and summing over k∈{1, 2,...,m} yieldskQ(t + 1)k 2 ≥k˜ g(x(t + 1))k 2 . This holds for all t∈{0, 1,...}. Thus, we havekQ(t)k 2 ≥k˜ g(x(t))k 2 for all t∈{1, 2,...}. 4. Fixt∈{0, 1,...}. Define vector h = [h 1 ,...,h m ] T byh k =|˜ g k (x(t+1))|,∀k∈{1, 2,...,m}. Note thatkhk =k˜ g(x(t + 1))k. For any k∈{1, 2,...,m}, by the virtual update equation 163 we have Q k (t + 1) = max{−˜ g k (x(t + 1)),Q k (t) + ˜ g k (x(t + 1))} ≤|Q k (t)| +|˜ g k (x(t + 1))| =Q k (t) +h k . Squaring both sides and summing over k∈{1, 2,...,m} yieldskQ(t + 1)k 2 ≤kQ(t) + hk 2 , which is equivalent tokQ(t + 1)k≤kQ(t) + hk. Finally, by the triangle inequality kQ(t) + hk≤kQ(t)k +khk and recalling thatkhk =k˜ g(x(t + 1))k, we havekQ(t + 1)k≤ kQ(t)k +k˜ g(x(t + 1))k. Lemma 6.4. Let Q(t),t∈{0, 1,...} be the sequence generated by Algorithm 6.1. For anyT≥ 1, we have T X t=1 g k (x(t))≤ 1 c Q k (T ),∀k∈{1, 2,...,m}. Proof. Fix k ∈ {1, 2,...,m} and T ≥ 1. For any t ∈ {0, 1,...,T− 1}, the update rule of Algorithm 6.1 gives: Q k (t + 1) = max{−˜ g k (x(t + 1)),Q k (t) + ˜ g k (x(t + 1))}≥Q k (t) + ˜ g k (x(t + 1)). Hence, ˜ g k (x(t + 1))≤Q k (t + 1)−Q k (t). Summing over t∈{0,...,T− 1} yields T X t=1 ˜ g k (x(t)) = T−1 X t=0 ˜ g k (x(t + 1))≤Q k (T )−Q k (0) (a) ≤ Q k (T ) where (a) follows from the fact Q k (0)≥ 0, i.e., part 1 in Lemma 6.3. This lemma follows by recalling that ˜ g k (x) =cg k (x). Let Q(t) = Q 1 (t),...,Q m (t) T be the vector of virtual queue backlogs. Define L(t) = 1 2 kQ(t)k 2 . The function L(t) shall be called a Lyapunov function. Define the Lyapunov drift as Δ(t) =L(t + 1)−L(t) = 1 2 [kQ(t + 1)k 2 −kQ(t)k 2 ]. (6.5) 164 Lemma 6.5. At each round t∈{0, 1, 2,...} in Algorithm 6.1, an upper bound of the Lyapunov drift is given by Δ(t)≤ [Q(t)] T ˜ g(x(t + 1)) +k˜ g(x(t + 1))k 2 . (6.6) Proof. The virtual queue update equations Q k (t + 1) = max{−˜ g k (x(t + 1)),Q k (t) + ˜ g k (x(t + 1))},∀k∈{1, 2,...,m} can be rewritten as Q k (t + 1) =Q k (t) +h k (x(t + 1)),∀k∈{1, 2,...,m}, (6.7) where h k (x(t)) = ˜ g k (x(t + 1)), if Q k (t) + ˜ g k (x(t + 1))≥−˜ g k (x(t + 1)) −Q k (t)− ˜ g k (x(t + 1)), else ∀k. Fix k∈{1, 2,...,m}. Squaring both sides of (6.7) and dividing by factor 2 yield: 1 2 [Q k (t + 1)] 2 = 1 2 [Q k (t)] 2 + 1 2 [h k (x(t + 1))] 2 +Q k (t)h k (x(t + 1)) = 1 2 [Q k (t)] 2 + 1 2 [h k (x(t + 1))] 2 +Q k (t)˜ g k (x(t + 1)) +Q k (t)[h k (x(t + 1))− ˜ g k (x(t + 1))] (a) = 1 2 [Q k (t)] 2 + 1 2 [h k (x(t + 1))] 2 +Q k (t)˜ g k (x(t + 1)) − [h k (x(t + 1)) + ˜ g k (x(t + 1))][h k (x(t))− ˜ g k (x(t + 1))] = 1 2 [Q k (t)] 2 − 1 2 [h k (x(t + 1))] 2 +Q k (t)˜ g k (x(t + 1)) + [˜ g k (x(t + 1))] 2 ≤ 1 2 [Q k (t)] 2 +Q k (t)˜ g k (x(t + 1)) + [˜ g k (x(t + 1))] 2 , where (a) follows from the fact thatQ k (t)[h k (x(t+1))− ˜ g k (x(t+1))] =−[h k (x(t+1))+ ˜ g k (x(t+ 1))]· [h k (x(t + 1))− ˜ g k (x(t + 1))], which can be shown by consideringh k (x(t + 1)) = ˜ g k (x(t + 1)) and h k (x(t + 1))6= ˜ g k (x(t + 1)). Summing over k∈{1, 2,...,m} yields 1 2 kQ(t + 1)k 2 ≤ 1 2 kQ(t)k 2 + [Q(t)] T ˜ g(x(t + 1)) +k˜ g(x(t + 1))k 2 . Rearranging the terms yields the desired result. 165 6.2.2 An Upper Bound of the Drift-Plus-Penalty Expression Lemma 6.6. Consider online convex optimization with long term constraints under Assumption 6.1. Let x ∗ ∈X 0 be any fixed solution that satisfies g(x ∗ )≤ 0, e.g., x ∗ = argmin x∈X P T t=1 f t (x). Let c > 0 and η > 0 be arbitrary. If α≥ 1 2 (c 2 β 2 +η) in Algorithm 6.1, then for all t≥ 1, we have Δ(t) +f t (x(t)) ≤f t (x ∗ ) +α[kx ∗ − x(t)k 2 −kx ∗ − x(t + 1)k 2 ] + 1 2 [k˜ g(x(t + 1))k 2 −k˜ g(x(t))k 2 ] + 1 2η D 2 where β and D are constants defined in Assumption 6.1. Proof. Fix t≥ 1. Note that part 2 of Lemma 6.3 implies that Q(t) + ˜ g(x(t)) is component-wise nonnegative. Hence, [∇f t (x(t))] T [x− x(t)] + [Q(t) + ˜ g(x(t))] T ˜ g(x) is a convex function with respect to x. Sinceαkx−x(t)k 2 is strongly convex with respect to x with modulus 2α, it follows that [∇f t (x(t))] T [x− x(t)] + [Q(t) + ˜ g(x(t))] T ˜ g(x) +αkx− x(t)k 2 is strongly convex with respect to x with modulus 2α. Since x(t + 1) is chosen to minimize the above strongly convex function, by Corollary 1.2, we have [∇f t (x(t))] T [x(t + 1)− x(t)] + [Q(t) + ˜ g(x(t))] T ˜ g(x(t + 1)) +αkx(t + 1)− x(t)k 2 ≤[∇f t (x(t))] T [x ∗ − x(t)] + [Q(t) + ˜ g(x(t))] T ˜ g(x ∗ ) +αkx ∗ − x(t)k 2 −αkx ∗ − x(t + 1)k 2 . 166 Adding f t (x(t)) on both sides yields f t (x(t)) + [∇f t (x(t))] T [x(t + 1)− x(t)] + [Q(t) + ˜ g(x(t))] T ˜ g(x(t + 1)) +αkx(t + 1)− x(t)k 2 ≤f t (x(t)) + [∇f t (x(t))] T [x ∗ − x(t)] + [Q(t) + ˜ g(x(t))] T ˜ g(x ∗ ) +αkx ∗ − x(t)k 2 −αkx ∗ − x(t + 1)k 2 (a) ≤f t (x ∗ ) + [Q(t) + ˜ g(x(t))] T ˜ g(x ∗ ) | {z } ≤0 +α[kx ∗ − x(t)k 2 −kx ∗ − x(t + 1)k 2 ] (b) ≤f t (x ∗ ) +α[kx ∗ − x(t)k 2 −kx ∗ − x(t + 1)k 2 ], where (a) follows from the convexity of function f t (x); and (b) follows by using the fact that ˜ g k (x ∗ )≤ 0 and Q k (t) + ˜ g k (x(t))≥ 0 (i.e., part 2 in Lemma 6.3) for all k∈{1, 2,...,m} to eliminate the term marked by an underbrace. Rearranging terms yields f t (x(t)) + [Q(t)] T ˜ g(x(t + 1)) ≤f t (x ∗ ) +α[kx ∗ − x(t)k 2 −kx ∗ − x(t + 1)k 2 ]−αkx(t + 1)− x(t)k 2 − [∇f t (x(t))] T [x(t + 1)− x(t)]− [˜ g(x(t))] T ˜ g(x(t + 1)). (6.8) For any η> 0, we have −[∇f t (x(t))] T [x(t + 1)− x(t)] (a) ≤k∇f t (x(t))kkx(t + 1)− x(t)k =[ 1 √ η k∇f t (x(t))k][ √ ηkx(t + 1)− x(t)k] (b) ≤ 1 2η k∇f t (x(t))k 2 + 1 2 ηkx(t + 1)− x(t)k 2 (c) ≤ 1 2η D 2 + 1 2 ηkx(t + 1)− x(t)k 2 , (6.9) where (a) follows from the Cauchy-Schwarz inequality; (b) follows from the basic inequality ab≤ 1 2 (a 2 +b 2 ),∀a,b∈R; and (c) follows from Assumption 6.1. Recall that u T 1 u 2 = 1 2 [ku 1 k 2 +ku 2 k 2 −ku 1 − u 2 k 2 ] for any u 1 , u 2 ∈R m . Thus, we have [˜ g(x(t))] T ˜ g(x(t + 1)) = 1 2 k˜ g(x(t))k 2 +k˜ g(x(t + 1))k 2 −k˜ g(x(t + 1))− ˜ g(x(t))k 2 . (6.10) 167 Substituting (6.9) and (6.10) into (6.8) yields f t (x(t)) + [Q(t)] T ˜ g(x(t + 1)) ≤f t (x ∗ ) +α[kx ∗ − x(t)k 2 −kx ∗ − x(t + 1)k 2 ] + ( 1 2 η−α)kx(t + 1)− x(t)k 2 + 1 2η D 2 + 1 2 k˜ g(x(t + 1))− ˜ g(x(t))k 2 − 1 2 k˜ g(x(t))k 2 − 1 2 k˜ g(x(t + 1))k 2 (a) ≤f t (x ∗ ) +α[kx ∗ − x(t)k 2 −kx ∗ − x(t + 1)k 2 ] + ( 1 2 c 2 β 2 + 1 2 η−α)kx(t + 1)− x(t)k 2 + 1 2η D 2 − 1 2 k˜ g(x(t))k 2 − 1 2 k˜ g(x(t + 1))k 2 (b) ≤f t (x ∗ ) +α[kx ∗ − x(t)k 2 −kx ∗ − x(t + 1)k 2 ] + 1 2η D 2 − 1 2 k˜ g(x(t))k 2 − 1 2 k˜ g(x(t + 1))k 2 , (6.11) where (a) follows becausek˜ g(x(t + 1))− ˜ g(x(t))k≤ cβkx(t + 1)− x(t)k, which further follows from Lemma 6.1; and (b) follows because α≥ 1 2 (c 2 β 2 +η). By Lemma 6.5, we have Δ(t)≤ [Q(t)] T ˜ g(x(t + 1)) +k˜ g(x(t + 1))k 2 . (6.12) Summing (6.11) and (6.12) together yields Δ(t) +f t (x(t)) ≤f t (x ∗ ) +α[kx ∗ − x(t)k 2 −kx ∗ − x(t + 1)k 2 ] + 1 2 [k˜ g(x(t + 1))k 2 −k˜ g(x(t))k 2 ] + 1 2η D 2 . 6.2.3 Regret Analysis Theorem 6.1. Consider online convex optimization with long term constraints under Assump- tion 6.1. Let x ∗ ∈X 0 be any fixed solution that satisfies g(x ∗ )≤ 0, e.g., x ∗ = argmin x∈X { T X t=1 f t (x)}. 1. Let c> 0 and η> 0 be arbitrary. If α≥ 1 2 (c 2 β 2 +η) in Algorithm 6.1, then for all T≥ 1, we have T X t=1 f t (x(t))≤ T X t=1 f t (x ∗ ) +αR 2 + 2c 2 G 2 + 1 2η D 2 T. 168 2. If c =T 1/4 ,η = √ T and α = 1 2 (β 2 + 1) √ T in Algorithm 6.1, then for all T≥ 1, we have T X t=1 f t (x(t))≤ T X t=1 f t (x ∗ ) +O( √ T ). Proof. Fix T≥ 1. Since α≥ 1 2 (c 2 β 2 +η), by Lemma 6.6, for all t∈{1, 2,...,T}, we have Δ(t) +f t (x(t)) ≤f t (x ∗ ) +α[kx ∗ − x(t)k 2 −kx ∗ − x(t + 1)k 2 ] + 1 2 [k˜ g(x(t + 1))k 2 −k˜ g(x(t))k 2 ] + 1 2η D 2 . Summing over t∈{1, 2,...,T} yields T X t=1 Δ(t) + T X t=1 f t (x(t)) ≤ T X t=1 f t (x ∗ ) +α T X t=1 [kx ∗ − x(t)k 2 −kx ∗ − x(t + 1)k 2 ] + 1 2 T X t=1 [k˜ g(x(t + 1))k 2 −k˜ g(x(t))k 2 ] + 1 2η D 2 T. Recalling that Δ(t) =L(t + 1)−L(t) and simplifying summations yields L(T + 1)−L(1) + T X t=1 f t (x(t)) ≤ T X t=1 f t (x ∗ ) +αkx ∗ − x(1)k 2 −αkx ∗ − x(T + 1)k 2 + 1 2 k˜ g(x(T + 1))k 2 − 1 2 k˜ g(x(1))k 2 + 1 2η D 2 T ≤ T X t=1 f t (x ∗ ) +αkx ∗ − x(1)k 2 + 1 2 k˜ g(x(T + 1))k 2 + 1 2η D 2 T. 169 Rearranging terms yields T X t=1 f t (x(t)) ≤ T X t=1 f t (x ∗ ) +αkx ∗ − x(1)k 2 + 1 2 k˜ g(x(T + 1))k 2 +L(1)−L(T + 1) + 1 2η D 2 T (a) = T X t=1 f t (x ∗ ) +αkx ∗ − x(1)k 2 + 1 2 k˜ g(x(T + 1))k 2 + 1 2 kQ(1)k 2 − 1 2 kQ(T + 1)k 2 + 1 2η D 2 T (b) ≤ T X t=1 f t (x ∗ ) +αkx ∗ − x(1)k 2 + 1 2 kQ(1)k 2 + 1 2η D 2 T (c) ≤ T X t=1 f t (x ∗ ) +αR 2 + 2c 2 G 2 + 1 2η D 2 T, where (a) follows form the definition that L(1) = 1 2 kQ(1)k 2 and L(T + 1) = 1 2 kQ(T + 1)k 2 ; (b) follows becausekQ(T + 1)k 2 ≥k˜ g(x(T + 1))k 2 by part 3 in Lemma 6.3; and (c) follows because kx ∗ −x(1)k≤R by Assumption 6.1 andkQ(1)k≤kQ(0)k+k˜ g(x(1))k≤k˜ g(x(0))k+k˜ g(x(1))k≤ 2cG, which further follows from part 3 and part 4 in Lemma 6.3 and Lemma 6.1. Thus, the first part of this theorem follows. Note that if we let c = T 1/4 and η = √ T , then α = 1 2 (β 2 + 1) √ T≥ 1 2 (c 2 β 2 +η). The second part of this theorem follows by substituting c =T 1/4 ,η = √ T and α = 1 2 (β 2 + 1) √ T into the first part of this theorem. Thus, we have T X t=1 f t (x(t))≤ T X t=1 f t (x ∗ ) + 1 2 (β 2 + 1)R 2 √ T + 2G 2 √ T + 1 2 D 2 √ T = T X t=1 f t (x ∗ ) +O( √ T ). 6.2.4 An Upper Bound of the Virtual Queue Vector It remains to establish a bound on constraint violations. Lemma 6.4 implies that this can be done by boundingkQ(t)k. Lemma 6.7. Consider online convex optimization with long term constraints under Assumptions 170 6.1-6.2. At each round t∈{1, 2,...,} in Algorithm 6.1, if kQ(t)k>cG + αR 2 +DR + 2c 2 G 2 c where D,G and R are constants defined in Assumption 6.1 and is the constant defined in Assumption 6.2, then kQ(t + 1)k<kQ(t)k. Proof. Let ˆ x∈X 0 and > 0 be defined in Assumption 6.2. Fix t≥ 0. Since x(t + 1) is chosen to minimize [∇f t (x(t))] T [x− x(t)] + [Q(t) + ˜ g(x(t))] T ˜ g(x) +αkx− x(t)k 2 , we have [∇f t (x(t))] T [x(t + 1)− x(t)] + [Q(t) + ˜ g(x(t))] T ˜ g(x(t + 1)) +αkx(t + 1)− x(t)k 2 ≤[∇f t (x(t))] T [ˆ x− x(t)] + [Q(t) + ˜ g(x(t))] T ˜ g(ˆ x) +αkˆ x− x(t)k 2 (a) ≤ [∇f t (x(t))] T [ˆ x− x(t)]−c m X k=1 [Q k (t) + ˜ g k (x(t))] +αkˆ x− x(t)k 2 (b) ≤[∇f t (x(t))] T [ˆ x− x(t)]−ckQ(t) + ˜ g(x(t))k +αkˆ x− x(t)k 2 (c) ≤[∇f t (x(t))] T [ˆ x− x(t)]−c kQ(t)k−k˜ g(x(t))k +αkˆ x− x(t)k 2 where (a) follows because ˜ g k (ˆ x)≤−c,∀k∈{1, 2,...,m} by Lemma 6.1 andQ k (t) + ˜ g k (x(t))≥ 0,∀k∈{1, 2,...,m} by part 2 in Lemma 6.3; (b) follows from the basic inequality P m i=1 a i ≥ p P m i=1 a 2 i for any nonnegative vector a≥ 0; and (c) follows from the triangle inequalitykx−yk≥ kxk−kyk,∀x, y∈R m . Rearranging terms yields [Q(t)] T ˜ g(x(t + 1)) ≤−ckQ(t)k +ck˜ g(x(t))k +αkˆ x− x(t)k 2 −αkx(t + 1)− x(t)k 2 + [∇f t (x(t))] T [ˆ x− x(t)]− [∇f t (x(t))] T [x(t + 1)− x(t)]− [˜ g(x(t))] T ˜ g(x(t + 1)) ≤−ckQ(t)k +ck˜ g(x(t))k +αkˆ x− x(t)k 2 + [∇f t (x(t))] T [ˆ x− x(t + 1)]− [˜ g(x(t))] T ˜ g(x(t + 1)) (a) ≤−ckQ(t)k +ck˜ g(x(t))k +αkˆ x− x(t)k 2 +k∇f t (x(t))kkˆ x− x(t + 1)k +k˜ g(x(t))kk˜ g(x(t + 1))k (b) ≤−ckQ(t)k +c 2 G +αR 2 +DR +c 2 G 2 , (6.13) 171 where (a) follows from the Cauchy-Schwarz inequality and (b) follows from Assumption 6.1 and Lemma 6.1. By Lemma 6.5, we have Δ(t)≤[Q(t)] T ˜ g(x(t + 1)) +k˜ g(x(t + 1))k 2 (a) ≤ [Q(t)] T ˜ g(x(t + 1)) +c 2 G 2 (b) ≤−ckQ(t)k +c 2 G +αR 2 +DR + 2c 2 G 2 , where (a) follows from Lemma 6.1 and (b) follows from (6.13). Thus, ifkQ(t)k>cG + αR 2 +DR+2c 2 G 2 c , then Δ(t)< 0. That is,kQ(t + 1)k<kQ(t)k. Corollary 6.1. Consider online convex optimization with long term constraints under Assump- tions 6.1-6.2. At each round t∈{1, 2,...,} in Algorithm 6.1, kQ(t)k≤ 2cG + αR 2 +DR + 2c 2 G 2 c , where D,G and R are constants defined in Assumption 6.1 and > 0 is defined in Assumption 6.2. Proof. Note thatkQ(0)k (a) ≤ k˜ g(x(0))k (b) ≤ cG andkQ(1)k (a) ≤ kQ(0)k +k˜ g(x(0))k (b) ≤ 2cG≤ 2cG + αR 2 +DR+2c 2 G 2 c , where (a) follows from Lemma 6.3 and (b) follows from Lemma 6.1. We need to showkQ(t)k ≤ 2cG + αR 2 +DR+2c 2 G 2 c for all rounds t ≥ 2. This can be proven by contradiction as follows: Assume thatkQ(t)k > 2cG + αR 2 +DR+2c 2 G 2 c happens at some round t≥ 2. Let τ be the first (smallest) round index at which this happens, i.e.,kQ(τ)k > 2cG + αR 2 +DR+2c 2 G 2 c . Note that τ ≥ 2 since we knowkQ(1)k≤ 2cG + αR 2 +DR+2c 2 G 2 c . The definition of τ implies that kQ(τ− 1)k≤ 2cG + αR 2 +DR+2c 2 G 2 c . Now consider the value ofkQ(τ− 1)k in two cases. • IfkQ(τ− 1)k>cG + αR 2 +DR+2c 2 G 2 c , then by Lemma 6.7, we must havekQ(τ)k<kQ(τ− 1)k≤ 2cG + αR 2 +DR+2c 2 G 2 c . This contradicts the definition of τ. • IfkQ(τ−1)k≤cG+ αR 2 +DR+2c 2 G 2 c , then by part 4 in Lemma 6.3, we must havekQ(τ)k≤ kQ(τ− 1)k +k˜ g(x(τ))k (a) ≤ cG + αR 2 +DR+2c 2 G 2 c +cG = 2cG + αR 2 +DR+2c 2 G 2 c , where (a) follows from Lemma 6.1. This also contradicts the definition of τ. 172 In both cases, we have a contradiction. Thus,kQ(t)k≤ 2cG + αR 2 +DR+2c 2 G 2 c for all round t≥ 2. 6.2.5 Constraint Violation Analysis Theorem 6.2. Consider online convex optimization with long term constraints under Assump- tions 6.1-6.2. Let D,β,G,R and be constants defined in Assumptions 6.1-6.2. The following are ensured by Algorithm 6.1: 1. For all T≥ 1, we have T X t=1 g k (x(t))≤ 2G + αR 2 +DR + 2c 2 G 2 c 2 . 2. If c =T 1/4 and α = 1 2 (β 2 + 1) √ T in Algorithm 6.1, then for all T≥ 1, we have T X t=1 g k (x(t))≤ 2G + 1 2 (β 2 + 1)R 2 + 2G 2 +DR ,∀k∈{1, 2,...,m}. Proof. Fix T≥ 1 and k∈{1, 2,...,m}. By Lemma 6.4, we have T X t=1 g k (x(t))≤ 1 c Q k (T )≤ 1 c kQ k (T )k (a) ≤ 2c c G + αR 2 +DR + 2c 2 G 2 c 2 , where (a) follows from Corollary 6.1. Thus, the first part of this theorem follows. The second part of this theorem follows by substituting c =T 1/4 and α = 1 2 (β 2 + 1) √ T into the last inequality: T X t=1 g k (x(t))≤2G + 1 2 (β 2 + 1) √ TR 2 +DR + 2 √ TG 2 √ T ≤2G + 1 2 (β 2 + 1)R 2 + 2G 2 + DR T −1/2 ≤2G + 1 2 (β 2 + 1)R 2 + 2G 2 +DR 173 6.2.6 Practical Implementations The finite constraint violation bound proven in Theorem 6.2 is in terms of constants D,G,R and defined in Assumptions 6.1-6.2. However, the implementation of Algorithm 6.1 only requires the knowledge of β, which is known to us since the constraint function g(x) does not change. In contrast, the algorithms developed in [MJY12] and [JHA16] have parameters that must be chosen based on the knowledge of D, which is usually unknown and can be difficult to estimate in an online optimization scenario. 6.3 Extensions This section extends the analysis in the previous section by considering intermediate and unknown time horizon T . 6.3.1 Intermediate Time Horizon T Note that parts 1 of Theorems 6.1 and 6.2 hold for any T . For large T , choosing c = T 1/4 and α = 1 2 (β 2 + 1) √ T yields the O( √ T ) regret bound and finite constraint violations as proven in parts 2 of both theorems. For intermediateT , the constant factor hidden in theO( √ T ) bound can be important and finite constraint violation bound can be relatively large. If parameters in Assumptions 6.1-6.2 are known, we can obtain the best regret and constraint violation bounds by choosing c and α as the solution to the following geometric program 1 : min η,c,α,z z s.t. αR 2 + 2c 2 G 2 + 1 2η D 2 T≤z, 2G + αR 2 +DR + 2c 2 G 2 c 2 ≤z, 1 2 (β 2 c 2 +η)≤α, η,c,α,z> 0. 1 By dividing the first two constraints byz and dividing the third constraint byα on both sides, this geometric program can be written into the standard from of geometric programs. Geometric programs can be reformulated into convex programs and can be efficiently solved. See [BKVH07] for more discussions on geometric programs. 174 In certain applications, we can choose c and α to minimize the regret bound subject to the constraint violation guarantee by solving the following geometric program: min η,c,α αR 2 + 2c 2 G 2 + 1 2η D 2 T s.t. 2G + αR 2 +DR + 2c 2 G 2 c 2 ≤z 0 , 1 2 (β 2 c 2 +η)≤α, η,c,α> 0, where z 0 > 0 is a constant that specifies the maximum allowed constraint violation. Or alterna- tively, we can consider the problem of minimizing the constraint violation subject to the regret bound guarantee. 6.3.2 Unknown Time Horizon T To achieveO( √ T ) regret and finite constraint violations, the parametersc andα in Algorithm 6.1 depend on the time horizon T . In the case when T is unknown, we can use the classical “doubling trick” to achieve O( √ T ) regret and O(log 2 T ) constraint violations. Suppose we have an online convex optimization algorithmA whose parameters depend on the time horizon. In the case when the time horizon T is unknown, the general doubling trick [CBL06, SS11] is described in Algorithm 6.2. It is known that the doubling trick can preserve the order of algorithmA’s regret bound in the case when the time horizon T is unknown. The next theorem summarizes that by using the “doubling trick” for Algorithm 6.1 with unknown time horizon T , we can achieve O( √ T ) regret and O(log 2 T ) constraint violations. Algorithm 6.2 The Doubling Trick [CBL06, SS11] • Let algorithmA be an algorithm whose parameters depend on the time horizon. Leti = 1. • Repeat until we reach the end of the time horizon – Run algorithmA for 2 i rounds by using 2 i as the time horizon. – Let i =i + 1. Theorem 6.3. If the time horizonT is unknown, then applying Algorithm 6.1 with the “doubling trick” can yield O( √ T ) regret and O(log 2 T ) constraint violations. 175 Proof. Let T be the unknown time horizon. Define each iteration in the doubling trick as a period. Since the i-th period consists of 2 i rounds, we have in totaldlog 2 Te periods, wheredxe denote the smallest integer no less than x. 1. The proof ofO( √ T ) regret is almost identical to the classical proof. By Theorem 6.1, there exists a constantC such that the regret in thei-th period is at mostC √ 2 i . Thus, the total regret is at most dlog 2 Te X i=1 C √ 2 i =C √ 2[1− √ 2 dlog 2 Te ] 1− √ 2 = √ 2C √ 2− 1 [ √ 2 dlog 2 Te − 1] ≤ √ 2C √ 2− 1 √ 2 1+log 2 T ≤ 2C √ 2− 1 √ T Thus, the regret bound is O( √ T ) when using the “doubling trick” . 2. The proof of O(log 2 T ) constraint violations is simple. By Theorem 6.1, there exists a constant C such that the constraint violation in the i-th period is at most C. Since we havedlog 2 Te periods, the total constraint violation is Cdlog 2 Te. 6.4 Chapter Summary This chapter considers online convex optimization with long term constraints, where func- tional constraints are only required to be satisfied in the long term. Prior algorithms in [MJY12] can achieve O( √ T ) regret and O(T 3/4 ) constraint violations for general problems and achieve O(T 2/3 ) bounds for both regret and constraint violations when the constraint set can be de- scribed by a finite number of linear constraints. A recent extension in [JHA16] can achieve O(T max{β,1−β} ) regret and O(T 1−β/2 ) constraint violations where β ∈ (0, 1) in an algorithm parameter. Algorithm 5.1 developed in Chapter 5 can achieve O( √ T ) regret and O( √ T ) con- straint violations. This chapter proposes a new algorithm that can achieve an O( √ T ) bound for regret and an O(1) bound for constraint violations; and hence yields improved performance in 176 comparison to prior works [MJY12, JHA16] and our own Algorithm 5.1. 177 Chapter 7 Power Control for Energy Harvesting Devices with Out- dated State Information Energy harvesting can enable self-sustainable and perpetual wireless devices. By harvesting energy from the environment and storing it in a battery for future use, we can significantly improve energy efficiency and device lifetime. Harvested energy can come from solar, wind, vibrational, thermal, or even radio sources [PS05, SK11, UYE + 15]. Energy harvesting has been identified as a key technology for wireless sensor networks [KHZS07], internet of things (IoT) [KMS + 15], and 5G communication networks [HH15]. However, the development of harvesting algorithms is complex because the harvested energy is highly dynamic and the device environment and energy needs are also dynamic. Efficient algorithms should learn when to take energy from the battery to power device tasks that bring high utility, and when to save energy for future use. There have been large amounts of work developing efficient power control policies to maximize the utility of energy harvesting devices. In the highly ideal case where the future system state (both the wireless channel sate and energy harvesting state) can be perfectly predicted, optimal power control strategies that maximize the throughput of wireless systems are considered in [YU12, TY12]. In a more realistic case with only the statistics and causal knowledge of the system state, power control policies based on Markov Decision Processes (MDP) are considered in [BGD13, MSZ13]. In the case when the statistical knowledge is unavailable but the current system state is observable, work [WWW + 17] develops suboptimal power control policies based on approximation algorithms. However, there is little work on the challenging scenario where neither the distribution in- formation nor the system state information are known. In practice, the amount of harvested 178 energy on each slot is known to us only after it arrives and is stored into the battery. Further, the wireless environment is often unknown before the power action is chosen. For example, the wireless channel state in a communication link is measured at the receiver side and then reported back to the transmitter with a time delay. If the fading channel varies very fast, the channel state feedback received at the transmitter can be outdated. Another example is power control for sensor nodes that detect unknown targets where the state of targets is known only after the sensing action is performed. In this chapter, we consider utility-optimal power control in an energy harvesting wireless de- vice with outdated state information and unknown state distribution information. This problem setup is closely related to but different from the Lyapunov opportunistic power control consid- ered in works [GGT10, HN13, UUNS11] with instantaneous wireless channel state information. The policies developed in [GGT10, HN13, UUNS11] are allowed to adapt their power actions to the instantaneous system states on each slot, which are unavailable in our problem setup. The problem setup in this chapter is also closely related to online convex optimization where control actions are performed without knowing instantaneous system states [Zin03, CBL06, SS11]. How- ever, existing methods for online convex learning require the control actions to be chosen from a fixed set. This does not hold in our problem since the power to be used can only be drained from the battery whose backlog is time-varying and dependent on previous actions. In Chapter 5, we extend the conventional online convex optimization (with a fixed known action set) to a more general setup with stochastic constraints. The stochastic constraints can be used to describe the uncertainty of energy harvesting and the fact that consumed energy is no more than the harvested energy in the long term. However, the stochastic constraint does not capture the fact that the used energy at any round must be no more than what is available in the battery and the fact that no more energy can be harvested when the battery is full. The algorithm developed in Chapter 5 for general online convex optimization with stochastic constraints only ensure that the stochastic constraint violations grow sublinearly in expectation and in high probability, which is not sufficient when used as a feasible power control algorithm for energy harvesting devices. In this chapter, we develop a new power control algorithm for energy harvesting devices with outdated state information and show that this power control algorithm can achieve an O() optimal utility by using a battery with capacityO(1/). The results in this chapter are originally 179 developed in our paper [YN18a]. 7.1 Problem Formulation Consider an energy harvesting wireless device that operates in normalized time slots t∈ {1, 2,...}. Let ω(t) = [e(t), s(t)]∈ Ω represent the system state on each slot t, where • e(t) is the amount of harvested energy for slot t (for example, through solar, wind, radio signal, and so on). • s(t) is the wireless device state on slot t (such as the vector of channel conditions over multiple subbands). • Ω is the state space for all ω(t) = [e(t), s(t)] states. Assume{ω(t)} ∞ t=1 evolves in an independent and identically distributed (i.i.d.) manner according to an unknown distribution. Further, the stateω(t) is unknown to the device until the end of slot t. The device is powered by a finite-size battery. At the beginning of each slot t∈{1, 2,...}, the device draws energy from the battery and allocates it as an n-dimensional power decision vector p(t) = [p 1 (t),...,p n (t)] T ∈P whereP is a compact convex set given by P ={p∈R n : n X i=1 p i ≤p max ,p i ≥ 0,∀i∈{1, 2,...,n}}. Note that p max is a given positive constant (restricted by hardware) and represents the max- imum total power that can be used on each slot. The device receives a corresponding utility U(p(t);ω(t)). Since p(t) is chosen without knowledge of ω(t), the achieved utility is unknown until the end of slot t. For each ω∈ Ω, the utility function U(p;ω) is assumed to be continuous and concave over p∈P. An example is: U(p;ω) = n X i=1 log(1 +p i (t)s i (t)) (7.1) where s(t) = [s 1 (t),...,s n (t)] is the vector of (unknown) channel conditions over n orthogonal subbands available to the wireless device. In this example, p i (t) represents the amount of power invested over subband i in a rateless coding transmission scenario, and U(p(t);ω(t)) is the total 180 throughput achieved on slott. We focus on fast time-varying wireless channels, e.g., communica- tion scenarios with high mobility transceivers, where s(t) known at the transmitter is outdated since s(t) must be measured at the receiver side and then reported back to the transmitter with a time delay. 7.1.1 Further Examples The above formulation admits a variety of other useful application scenarios. For example, it can be used to treat power control in cognitive radio systems. Suppose an energy limited secondary user harvests energy and operates over licensed spectrum occupied by primary users. In this case, s(t) = [s 1 (t),...,s n (t)] represents the channel activity of primary users over each subband. Since primary users are not controlled by the secondary user, s(t) is only known to the secondary user at the end of slot t. Another application is a wireless sensor system. Consider an energy harvesting sensor node that collects information by detecting an unpredictable target. In this case, s(t) can be the state or action of the target on slott. By using p(t) power for signaling and sensing, we receive utility U(p(t);ω(t)), which depends on stateω(t). For example, in a monitoring system, if the monitored target performs an action s(t) that we are not interested in, then the reward U(p(t);ω(t)) by using p(t) is small. Note that s(t) is typically unknown to us at the beginning of slot t and is only disclosed to us at the end of slot t. 7.1.2 Basic Assumption Assumption 7.1. • There exist a constant e max > 0 such that 0≤e(t)≤e max ,∀t∈{1, 2,...}. • Let ∇ p U(p;ω) denote a subgradient (or gradient if U(p;ω) is differentiable) vector of U(p;ω) with respect to p and let ∂ ∂pi U(p;ω),∀i ∈ {1, 2,...,n} denote each component of vector∇ p U(p;ω). There exist positive constants D 1 ,...,D n such that| ∂ ∂pi U(p;ω)|≤ D i ,∀i∈{1, 2,...,n} for all ω∈ Ω and all p∈P. This further implies there exists D> 0, e.g., D = p P n i=1 D 2 i , such thatk∇ p U(p;ω)k≤ D for all ω∈ Ω and all p∈P, where kxk = p P n i=1 x 2 i is the standard l 2 norm. 181 Such constants D 1 ,...,D n exist in most cases of interest, such as for utility functions (7.1) with bounded s i (t) values. 1 The following fact follows directly from Assumption 7.1. Fact 7.1. For each ω∈ Ω, U(p,ω) is D-Lipschitz over p∈P, i.e., |U(p 1 ,ω)−U(p 2 ,ω)|≤Dkp 1 − p 2 k,∀p 1 , p 2 ∈P Proof. By the basic subgradient inequality for concave functions: U(p 2 ,ω) + [∇ p U(p 2 ;ω)] T (p 1 − p 2 )≥U(p 1 ,ω) U(p 1 ,ω) + [∇ p U(p 1 ;ω)] T (p 2 − p 1 )≥U(p 2 ,ω) Rearranging terms and applying the Cauchy-Schwarz inequality yields U(p 1 ,ω)−U(p 2 ,ω)≤k∇ p U(p 2 ;ω)kkp 1 − p 2 k U(p 2 ,ω)−U(p 1 ,ω)≤k∇ p U(p 1 ;ω)kkp 1 − p 2 k Combining the above inequalities and recalling that all subgradients are bounded by D gives |U(p 1 ,ω)−U(p 2 ,ω)|≤Dkp 1 − p 2 k. 7.1.3 Power Control and Energy Queue Model The finite size battery can be considered as backlog in an energy queue. Let E(0) be the initial energy backlog in the battery and E(t) be the energy stored in the battery at the end of slot t. The power vector p(t) must satisfy the following energy availability constraint: P n i=1 p i (t)≤E(t− 1),∀t∈{1, 2,...}. (7.2) which requires the consumed power to be no more than what is available in the battery. LetE max be the maximum capacity of the battery. If the energy availability constraint (7.2) 1 This is always true since s i (t) are wireless signal strength attenuations. 182 is satisfied on each slot, the energy queue backlog E(t) evolves as follows: E(t) = min{E(t− 1)− P n i=1 p i (t) +e(t),E max },∀t. (7.3) 7.1.4 An Upper Bound Problem Let ω(t) = [e(t),s(t)] be the random state vector on slot t. Let E[e] = E[e(t)] denote the expected amount of new energy that arrives in one slot. Define a function h :P→R by h(p) =E[U(p;ω(t))]. Since U(p;ω) is concave in p for all ω by Assumption 7.1 and is D-Lipschitz over p∈P for all ω by Fact 7.1, we know h(p) is concave and continuous. The function h is typically unknown because the distribution of ω is unknown. However, to establish a fundamental bound, suppose bothh andE[e] are known and consider choosing a fixed vector p to solve the following deterministic problem: max p h(p) (7.4) s.t. n X i=1 p i −E[e]≤ 0 (7.5) p∈P (7.6) where constraint (7.5) requires that the consumed energy is no more than E[e]. Let p ∗ be an optimal solution of problem (7.4)-(7.6) and U ∗ be its corresponding utility value of (7.4). Define a causal policy as one that, on each slot t, selects p(t)∈P based only on information up to the start of slot t (in particular, without knowledge of ω(t)). Since ω(t) is i.i.d. over slots, any causal policy must have p(t) and ω(t) independent for all t. The next lemma shows that no causal policy p(t),t∈{1, 2,...} satisfying (7.2)-(7.3) can attain a better utility than U ∗ . Lemma 7.1. Let p(t)∈P,t∈{1, 2,...} be yielded by any causal policy that consumes less 183 energy than it harvests in the long term, so lim sup T→∞ 1 T P T t=1 E[ P n i=1 p i (t)]≤E[e]. Then, lim sup T→∞ 1 T T X t=1 E[U(p(t);ω(t))]≤U ∗ . Proof. Fix a slot t∈{1, 2,...}. Then E[U(p(t);ω(t))] (a) = E[E[U(p(t);ω(t))|p(t)]] (b) = E[h(p(t))] (7.7) where (a) holds by iterated expectations; (b) holds because p(t) and ω(t) are independent (by causality). For each T > 0 define p(T ) = [p 1 (T ),...,p n (T )] T with p i (T ) = 1 T T X t=1 E[p i (t)],∀i∈{1, 2,...,n}. We know by assumption that: lim sup T→∞ n X i=1 p i (T )≤E[e] (7.8) Further, since p(t)∈P for all slots t, it holds that p(T )∈P for all T > 0. Also, 1 T T X t=1 E[U(p(t);ω(t))] (a) = 1 T T X t=1 E[h(p(t))] (b) ≤ h E[ 1 T P T t=1 p(t)] =h(¯ p(T )) where (a) holds by (7.7); (b) holds by Jensen’s inequality for the concave function h. It follows that: lim sup T→∞ 1 T T X t=1 E[U(p(t);ω(t))]≤ lim sup T→∞ h(¯ p(T )). Define θ = lim sup T→∞ h(¯ p(T )). It suffices to show that θ≤U ∗ . Since ¯ p(T ) is in the compact setP for all T > 0 and h is continuous, the Bolzano-Weierstrass theorem ensures there is a subsequence of times T k such that p(T k ) converges to a fixed vector p 0 ∈ P and h(p(T k )) 184 converges to θ as k→∞: lim k→∞ p(T k ) = p 0 ∈P lim k→∞ h(p(T k )) =θ Continuity of h implies that h(p 0 ) = θ. By (7.8) the vector p 0 = [p 0,1 ,...,p 0,n ] T must satisfy P n i=1 p 0,i ≤E[e]. Hence, p 0 is a vector that satisfies constraints (7.5)-(7.6) and achieves utility h(p 0 ) =θ. Since U ∗ is defined as the optimal utility value to problem (7.4)-(7.6), it holds that θ≤U ∗ . Note that the U ∗ utility upper bound of Lemma 7.1 holds for any policy that consumes no more energy than it harvests in the long term. Policies that satisfy the physical battery con- straints (7.2)-(7.3) certainly consume no more energy than harvested in the long term. However, Lemma 7.1 even holds for policies that violate these physical battery constraints. For example, U ∗ is still a valid bound for a policy that is allowed to “borrow” energy from an external power source when its battery is empty and “return” energy when its battery is full. 7.2 New Algorithm This subsection proposes a new learning aided dynamic power control algorithm that chooses power control actions based on system history, without requiring the current system state or its probability distribution. 7.2.1 New Algorithm The new dynamic power control algorithm is described in Algorithm 7.1. At the end of slott, Algorithm 7.1 chooses p(t+1) based onω(t) without requiringω(t+1). To enable these decisions, the algorithm introduces a (non-positive) virtual battery queue process Q(t)≤ 0, which shall later be shown to be related to a shifted version of the physical battery queue E(t). Note that Algorithm 7.1 does not explicitly enforce the energy availability constraint (7.2). Let p(t + 1) be given by (7.10), one may expect to use ˆ p(t + 1) = min{ P n i=1 p i (t + 1),E(t)} P n i=1 p i (t + 1) p(t + 1) (7.11) 185 Algorithm 7.1 New Power Control Algorithm for Energy Harvesting Devices with Outdated State Information LetV > 0 be a constant algorithm parameter. Initialize virtual battery queue variableQ(0) = 0. Choose p(1) = [0, 0,..., 0] T as the power action at slot 1. At the end of each slot t∈{1, 2,...}, observe ω(t) = [e(t), s(t)] and do the following: • Update virtual battery queue Q(t): Update Q(t) via: Q(t) = min{Q(t− 1) +e(t)− n X i=1 p i (t), 0}. (7.9) • Power control: Choose p(t + 1) = Proj P n p(t) + 1 V ∇pU(p(t);ω(t)) + 1 V 2 Q(t)1 o (7.10) as the power action for the next slot t + 1 where Proj P {·} represents the projection onto setP, 1 denotes a column vector of all ones and∇ p U(p(t);ω(t)) represents a subgradient (or gradient if U(p;ω(t)) is differentiable) vector of function U(p;ω(t)) at point p = p(t). Note that p(t), Q(t) and∇ p U(p(t);ω(t)) are given constants in (7.10). that scales down p(t+1) to enforce the energy availability constraint (7.2). However, our analysis in Section 7.3 shows that if the battery capacity is at least as large as an O(V ) constant, then directly using p(t + 1) from (7.10) is ensured to always satisfy the energy availability constraint (7.2). Thus, there is no need to take the additional step (7.11). 7.2.2 Algorithm Inuitions Lemma 7.2. The power control action p(t+1) chosen in (7.10) is to solve the following quadratic convex program max p V [∇ p U(p(t);ω(t))] T [p− p(t)] +Q(t)1 T p− V 2 2 kp− p(t)k 2 (7.12) s.t. p∈P (7.13) Proof. By the definition of projection, equation (7.10) is to solve min kp− p(t) + 1 V ∇ p U(p(t);ω(t)) + 1 V 2 Q(t)1 k 2 s.t. p∈P By expanding the square, eliminating constant terms and converting the minimization to the 186 maximization of its negative object, it is easy to show this problem is equivalent to problem (7.12)-(7.13). The convex projection (7.10), or equivalently, the quadratic convex program (7.12)-(7.13) can be easily solved. See e.g., Lemma 4.4 in Chapter 4 for an algorithm that solves an n-dimensional quadratic program over setP with complexity O(n logn). Thus, the overall complexity of Algo- rithm 7.1 is low. 7.3 Performance Analysis of Algorithm 7.1 This section shows Algorithm 7.1 can attain anO() close-to-optimal utility by using a battery with capacity O(1/). 7.3.1 Drift Analysis DefineL(t) = 1 2 [Q(t)] 2 and call it a Lyapunov function. Define the Lyapunov drift as Δ(t) = L(t + 1)−L(t). Lemma 7.3. Under Algorithm 7.1, for all t≥ 0, the Lyapunov drift satisfies Δ(t)≤Q(t)[e(t + 1)− n X i=1 p i (t + 1)] + 1 2 B (7.14) with constant B = (max{e max ,p max }) 2 , where e max is the constant defined in Assumption 7.1. Proof. Fix t≥ 0. Recall that for any x∈R if y = min{x, 0} then y 2 ≤x 2 . It follows from (7.9) that [Q(t + 1)] 2 ≤ Q(t) +e(t + 1)− n X i=1 p i (t + 1) 2 . Expanding the square on the right side, dividing both sides by 2 and rearranging terms yields Δ(t)≤Q(t)[e(t + 1)− P n i=1 p i (t + 1)] + 1 2 [e(t + 1)− P n i=1 p i (t + 1)] 2 . This lemma follows by noting that|e(t + 1)− P n i=1 p i (t + 1)|≤ max{e max ,p max } since 0≤ P n i=1 p i (t + 1)≤p max and 0≤e(t + 1)≤e max . 187 Lemma 7.4. Let U ∗ be the utility upper bound defined in Lemma 7.1 and p ∗ be an optimal solution to problem (7.4)-(7.6) that attains U ∗ . At each iteration t∈{1, 2,...}, Algorithm 7.1 guarantees VE[U(p(t);ω(t))]− Δ(t)≥VU ∗ + V 2 2 E[Φ(t)]− D 2 +B 2 where Φ(t) =kp ∗ − p(t + 1)k 2 −kp ∗ − p(t)k 2 , D is the constant defined in Assumption 7.1 and B is the constant defined in Lemma 7.3. Proof. Note that P n i=1 p ∗ i ≤E[e]. Fix t∈{1, 2,...}. Note that V [∇ p U(p(t);ω(t))] T [p− p(t)] + Q(t) P n i=1 p i is a linear function with respect to p. It follows that V ∇ p U(p(t);ω(t)) T [p− p(t)] +Q(t) n X i=1 p i − V 2 2 kp− p(t)k 2 (7.15) is strongly concave with respect to p∈P with modulusV 2 . Since p(t+1) is chosen to maximize (7.15) over all p∈P, and since p ∗ ∈P, by Corollary 1.3, we have V ∇ p U(p(t);ω(t)) T [p(t + 1)− p(t)] +Q(t) n X i=1 p i (t + 1)− V 2 2 kp(t + 1)− p(t)k 2 ≥V ∇ p U(p(t);ω(t)) T [p ∗ − p(t)] +Q(t) n X i=1 p ∗ i − V 2 2 kp ∗ − p(t)k 2 + V 2 2 kp ∗ − p(t + 1)k 2 =V ∇ p U(p(t);ω(t)) T [p ∗ − p(t)] +Q(t) n X i=1 p ∗ i + V 2 2 Φ(t). Subtracting Q(t)e(t + 1) from both sides and rearranging terms yields V ∇ p U(p(t);ω(t)) T [p(t + 1)− p(t)] +Q(t) n X i=1 p i (t + 1)−e(t + 1) ≥V ∇ p U(p(t);ω(t)) T [p ∗ − p(t)] +Q(t) n X i=1 p ∗ i −e(t + 1) + V 2 2 Φ(t) + V 2 2 kp(t + 1)− p(t)k 2 . Adding VU(p(t);ω(t)) to both sides and noting that U(p(t);ω(t)) + [∇ p U(p(t);ω(t))] T [p ∗ − 188 p(t)]≥U(p ∗ ;ω(t)) by the concavity of U(p;ω(t)) yields VU(p(t);ω(t)) +V ∇ p U(p(t);ω(t)) T [p(t + 1)− p(t)] +Q(t) n X i=1 p i (t + 1)−e(t + 1) ≥VU(p ∗ ;ω(t)) +Q(t) n X i=1 p ∗ i −e(t + 1) + V 2 2 Φ(t) + V 2 2 kp(t + 1)− p(t)k 2 . Rearranging terms yields VU(p(t);ω(t)) +Q(t) n X i=1 p i (t + 1)−e(t + 1) ≥VU(p ∗ ;ω(t)) +Q(t) n X i=1 p ∗ i −e(t + 1) + V 2 2 Φ(t) + V 2 2 kp(t + 1)− p(t)k 2 −V ∇ p U(p(t);ω(t)) T [p(t + 1)− p(t)] (7.16) Note that V ∇ p U(p(t);ω(t)) T [p(t + 1)− p(t)] (a) ≤ 1 2 k∇ p U(p(t);ω(t))k 2 + V 2 2 kp(t + 1)− p(t)k 2 (b) ≤ 1 2 D 2 + V 2 2 kp(t + 1)− p(t)k 2 (7.17) where (a) follows by using basic inequality x T y≤ 1 2 kxk 2 + 1 2 kyk 2 for all x, y∈ R n with x = ∇ p U(p(t);ω(t)) and y =V [p(t + 1)− p(t)]; and (b) follows from Assumption 7.1. Substituting (7.17) into (7.16) yields VU(p(t);ω(t)) +Q(t) n X i=1 p i (t + 1)−e(t + 1) ≥VU(p ∗ ;ω(t)) +Q(t) n X i=1 p ∗ i −e(t + 1) + V 2 2 Φ(t)− 1 2 D 2 (7.18) By Lemma 7.3, we have −Δ(t)≥Q(t) n X i=1 p i (t + 1)−e(t + 1) − B 2 (7.19) 189 Summing (7.18) and (7.19); and cancelling common terms on both sides yields VU(p(t);ω(t))− Δ(t)≥VU(p ∗ ;ω(t)) +Q(t) n X i=1 p ∗ i −e(t + 1) + V 2 2 Φ(t)− D 2 +B 2 (7.20) Note that eachQ(t) (depending only one(τ),p(τ) withτ∈{1, 2,...,t}) is independent ofe(t+1). Thus, E h Q(t)[ n X i=1 p ∗ i −e(t + 1)] i =E[Q(t)]E[ n X i=1 p ∗ i −e(t + 1)] (a) ≥ 0 (7.21) where (a) follows because Q(t)≤ 0 and P n i=1 p ∗ i ≤E[e] (recall thate(t + 1) is an i.i.d. sample of e). Taking expectations on both sides of (7.20) and using (7.21) and E[U(p ∗ ;ω(t))] =U ∗ yields the desired result. 7.3.2 Utility Optimality Analysis The next theorem summarizes that the average expected utility attained by Algorithm 7.1 is within an O(1/V ) distance to U ∗ defined in Lemma 7.1. Theorem 7.1. LetU ∗ be the utility bound defined in Lemma 7.1. For allt∈{1, 2,...}, Algorithm 7.1 guarantees 1 t t X τ=1 E[U(p(τ);ω(τ))]≥U ∗ − V (p max ) 2 2t − B 2Vt − D 2 +B 2V (7.22) where D is the constant defined in Assumption 7.1 and B is the constant defined in Lemma 7.3. This implies, lim sup t→∞ 1 t t X τ=1 E[U(p(τ);ω(τ))]≥U ∗ − D 2 +B 2V . (7.23) In particular, if we take V = 1/ in Algorithm 7.1, then 1 t t X τ=1 E[U(p(τ);ω(τ))]≥U ∗ −O(),∀t≥ Ω( 1 2 ). (7.24) 190 Proof. Fix t∈{1, 2,...}. For each τ∈{1, 2,...,t}, by Lemma 7.4, we have E[VU(p(τ);ω(τ))]−E[Δ(τ)]≥VU ∗ + V 2 2 E[Φ(τ)]− D 2 +B 2 . Summing over τ∈{1, 2,...,t}, dividing both sides by Vt and rearranging terms yields 1 t t X τ=1 E[U(p(τ);ω(τ))] ≥U ∗ + V 2t t X τ=1 E[Φ(τ)] + 1 Vt t X τ=1 E[Δ(τ)]− D 2 +B 2V (a) =U ∗ + V 2t E[kp ∗ − p(t + 1)k 2 −kp ∗ − p(1)k 2 ] + 1 2Vt E[[Q(t + 1)] 2 − [Q(1)] 2 ]− D 2 +B 2V ≥U ∗ − V 2t E[kp ∗ − p(1)k 2 ]− 1 2Vt E[[Q(1)] 2 ]− D 2 +B 2V (b) ≥U ∗ − V (p max ) 2 2t − B 2Vt − D 2 +B 2V where (a) follows by recalling that Φ(τ) =kp ∗ − p(τ + 1)k 2 −kp ∗ − p(τ)k 2 and Δ(τ) = 1 2 [Q(τ + 1)] 2 − 1 2 [Q(τ)] 2 ; and (b) follows becausekp ∗ − p(1)k =kp ∗ k = p P n i=1 (p ∗ i ) 2 ≤ P n i=1 p ∗ i ≤p max and|Q(1)| =|Q(0) +e(1)− P n i=1 p i (1)| =|e(1)− P n i=1 p i (1)|≤ max{e max ,p max } = √ B where B is defined in Lemma 7.3. So far we have proven (7.22). Equation (7.23) follows directly by taking lim sup on both sides of (7.22). Equation (7.24) follows by substituting V = 1 and t = 1 2 into (7.22). 7.3.3 Lower Bound for Virtual Battery Queue Q(t) Note thatQ(t)≤ 0 by (7.9). This subsection further shows thatQ(t) is bounded from below. The projection Proj P {·} satisfies the following lemma: Lemma 7.5. For any p(t)∈P and vector b≤ 0, where≤ between two vectors means component- wisely less than or equal to, ˜ p = Proj P {p(t) + b} is given by ˜ p i = max{p i (t) +b i , 0},∀i∈{1, 2,...,n}. (7.25) 191 Proof. Recall that projection Proj P {p(t) + b} by definition is to solve min p n X i=1 [p i − [p i (t) +b i ]] 2 (7.26) s.t. n X i=1 p i ≤p max (7.27) p i ≥ 0,∀i∈{1, 2,...,n} (7.28) LetI⊆{1, 2,...,n} be the coordinate index set given byI ={i∈{1, 2,...,n} :p i (t) +b j < 0}. For any p such that P n i=1 p i ≤p max and p i ≥ 0,∀i∈{1, 2,...,n}, we have n X i=1 [p i − [p i (t) +b i ]] 2 = X i∈I [p i − [p i (t) +b i ]] 2 + X i∈{1,2,...,n}\I [p i − [p i (t) +b i ]] 2 ≥ X i∈I [p i − [p i (t) +b i ]] 2 (a) ≥ X i∈I [p i (t) +b i ] 2 where (a) follows becausep i (t)+b i < 0 fori∈I andp i ≥ 0,∀i∈{1, 2,...,n}. Thus, P i∈I [p i (t)+ b i ] 2 is an object value lower bound of problem (7.26)-(7.28). Note that ˜ p given by (7.25) is feasible to problem (7.26)-(7.28) since ˜ p i ≥ 0,∀i∈{1, 2,...,n} and P n i=1 ˜ p i ≤ P n i=1 p i (t)≤p max because ˜ p i ≤p i (t) for alli and p(t)∈P. We further note that n X i=1 [ ˜ p i − [p i (t) +b i ]] 2 = X i∈I [p i (t) +b i ] 2 . That is, ˜ p given by (7.25) attains the object value lower bound of problem (7.26)-(7.28) and hence is the optimal solution to problem (7.26)-(7.28). Thus, ˜ p = Proj P {p(t) + b}. Corollary 7.1. If Q(t)≤−V (D max +p max ) with D max = max{D 1 ,...,D n }, then Algorithm 7.1 guarantees p i (t + 1)≤ max{p i (t)− 1 V p max , 0},∀i∈{1, 2,...,n}. where D 1 ,...,D n are constants defined in Assumption 7.1. 192 Proof. Let b = 1 V ∇ p U(p(t);ω(t)) + 1 V 2 Q(t)1. Since ∂ ∂pi U(p(t);ω(t))≤D i ,∀i∈{1, 2,...,n} by Assumption 7.1 and Q(t)≤−V (D max +p max ), we know b i ≤− 1 V p max ,∀i∈{1, 2,...,n}. By Lemma 7.5, we have p i (t + 1) = max{p i (t) +b i , 0} ≤ max{p i (t)− 1 V p max , 0},∀i∈{1, 2,...,n}. By Corollary 7.1, if Q(t)≤−V (D max +p max ), then each component of p(t + 1) decreases by 1 V p max until it hits 0. That is, if Q(t) ≤ −V (D max +p max ) for sufficiently many slots, Algorithm 7.1 eventually chooses 0 as the power decision. By virtual queue update equation (7.9),Q(t) decreases only when P n i=1 p i (t)> 0. These two observations suggest thatQ(t) yielded by Algorithm 7.1 should be eventually bounded from below. This is formally summarized in the next theorem. Theorem 7.2. Let V in Algorithm 7.1 be a positive integer. Define positive constant Q l , where superscript l denotes “lower” bound, as Q l =V (D max + 2p max +e max ) (7.29) where e max is the constant defined in Assumption 7.1 and D max is the constant defined in Corol- lary 7.1. Algorithm 7.1 guarantees Q(t)≥−Q l ,∀t∈{0, 1, 2,...}. Proof. By virtual queue update equation (7.9), we know Q(t) can increase by at most e max and can decrease by at most p max on each slot. Since Q(0) = 0, we know Q(t)≥−Q l for all t≤V . We need to show Q(t)≥−Q l for all t>V . This can be proven by contradiction as follows: Assume Q(t)<−Q l for some t>V . Let τ >V be the first (smallest) slot index when this happens. By the definition of τ, we have Q(τ)<−Q l and Q(τ)<Q(τ− 1). (7.30) 193 Now consider the value of Q(τ−V ) in two cases (note that τ−V > 0). • Case Q(τ−V )≥−V (D max +p max +e max ): Since Q(t) can decrease by at most p max on each slot, we know Q(τ)≥−V (D max + 2p max +e max ) =−Q l . This contradicts the definition of τ. • Case Q(τ−V ) <−V (D max +p max +e max ): Since Q(t) can increase by at most e max on each slot, we know Q(t)<−V (D max +p max ) for all τ−V ≤t≤τ− 1. By Corollary 7.1, for all τ−V ≤t≤τ− 1, we have p i (t + 1)≤ max{p i (t)− 1 V p max , 0},∀i∈{1, 2,...,n}. Since the above inequality holds for all t∈{τ−V,τ−V + 1,...,τ− 1}, and since at the start of this interval we trivially have p i (τ−V )≤p max ,∀i∈{1, 2,...,n}, at each step of this interval each component of the power vector either hits zero or decreases by 1 V p max , and so after the V steps of this interval we have p i (τ) = 0,∀i∈{1, 2,...,n}. By (7.9), we have Q(τ) = min{Q(τ− 1) +e(τ)− n X i=1 p i (τ), 0} = min{Q(τ− 1) +e(τ), 0} ≥ min{Q(τ− 1), 0} =Q(τ− 1) where the final equality holds because the queue is never positive (see (7.9)). This contra- dicts (7.30). Both cases lead to contradictions. Thus, Q(t)≥−Q l for all t>V . 7.3.4 Energy Availability Guarantee To implement the power decisions of Algorithm 7.1 for the physical battery system E(t) from equations (7.2)-(7.3), we must ensure the energy availability constraint (7.2) holds on each slot. The next theorem shows that Algorithm 7.1 ensures the constraint (7.2) always holds as long as the battery capacity satisfies E max ≥Q l +p max and the initial energy satisfies E(0) =E max . It 194 also explains that Q(t) used in Algorithm 7.1 is a shifted version of the physical battery backlog E(t). Theorem 7.3. If E(0) =E max ≥Q l +p max , where Q l is the constant defined in Theorem 7.2, then Algorithm 7.1 ensures the energy availability constraint (7.2) on each slot t∈{1, 2,...}. Moreover E(t) =Q(t) +E max ,∀t∈{0, 1, 2,...}. (7.31) Proof. Note that to show the energy availability constraint P n i=1 p i (t)≤E(t− 1),∀t∈{1, 2,...} is equivalent to show n X i=1 p i (t + 1)≤E(t),∀t∈{0, 1, 2,...}. (7.32) This lemma can be proven by inductions. Note that E(0) = E max and Q(0) = 0. It is immediate that (7.31) holds for t = 0. Since E(0) =E max ≥p max and P n i=1 p i (1)≤p max , equation (7.32) also holds fort = 0. Assume (7.32) and (7.31) hold for t =t 0 and consider t =t 0 + 1. By virtual queue dynamic (7.9), we have Q(t 0 + 1) = min{Q(t 0 ) +e(t 0 + 1)− n X i=1 p i (t 0 + 1), 0} Adding E max on both sides yields Q(t 0 + 1) +E max = min{Q(t 0 ) +e(t 0 + 1)− n X i=1 p i (t 0 + 1) +E max ,E max } (a) = min{E(t 0 ) +e(t 0 + 1)− n X i=1 p i (t 0 + 1),E max } (b) =E(t 0 + 1) where (a) follows from the induction hypothesis E(t 0 ) =Q(t 0 ) +E max and (b) follows from the energy queue dynamic (7.3). Thus, (7.31) holds for t =t 0 + 1. 195 Now observe E(t 0 + 1) =Q(t 0 + 1) +E max (a) ≥ E max −Q l ≥p max (b) ≥ n X i=1 p i (t 0 + 2) where (a) follows from the fact that Q(t)≥−Q l ,∀t∈{0, 1, 2,...} by Theorem 7.2; (b) holds since sum power is never more than p max . Thus, (7.32) holds for t =t 0 + 1. Thus, this theorem follows by induction. 7.3.5 Utility Optimality and Battery Capacity Tradeoff By Theorem 7.1, Algorithm 7.1 is guaranteed to attain a utility within an O(1/V ) distance to the optimal utility U ∗ . To obtain an O()-optimal utility, we can choose V =d1/e, where dxe represents the smallest integer no less thanx. In this case,Q l defined in (7.3) is orderO(V ). By Theorem 7.3,we need the battery capacity E max ≥ Q l +p max = O(V ) = O(1/) to satisfy the energy availability constraint. Thus, there is a [O(),O(1/)] tradeoff between the utility optimality and the required battery capacity. 7.3.6 Extensions Thus far, we have assumed thatω(t) is known with one slot delay, i.e., at the end of slot t, or equivalently, at the beginning of slot t + 1. In fact, if ω(t) is observed with t 0 slot delay (at the end of slot t +t 0 − 1), we can modify Algorithm 7.1 by initializing p(τ) = 0,τ∈{1, 2,...,t 0 } and updating Q(t−t 0 + 1) = min{Q(t−t 0 ) +e(t−t 0 + 1)− P n i=1 p i (t−t 0 + 1), 0}, p(t + 1) = Proj P {p(t−t 0 + 1) + 1 V ∇ p U(p(t−t 0 + 1);ω(t−t 0 + 1)) + 1 V 2 Q(t−t 0 + 1)1} at the end of each slot t∈{t 0 ,t 0 + 1,...}. By extending the analysis in this section (from a t 0 = 1 version to a general t 0 version), a similar [O(),O(1/)] tradeoff can be established. 196 7.4 Numerical Experiment In this section, we consider an energy harvesting wireless device transmitting over 2 subbands whose channel strength is represented by s 1 (t) and s 2 (t), respectively. Our goal is to decide the power action p(t) to maximize the utility/throughput given by (7.1). LetP ={p : p 1 +p 2 ≤ 5,p 1 ≥ 0,p 2 ≥ 0}. Let harvested energy e(t) satisfy the uniform distribution over interval [0, 3]. Assume both subbands are Rayleigh fading channels wheres 1 (t) follows the Rayleigh distribution with parameter σ = 0.5 truncated in the range [0, 4] and s 2 (t) follows the Rayleigh distribution with parameter σ = 1 truncated in the range [0, 4]. By assuming the perfect knowledge of distributions, we solve the deterministic problem (7.4)- (7.6) and obtain U ∗ = 1.0391. To verify the performance proven in Theorems 7.1 and 7.3, we run Algorithm 7.1 withV ∈{5, 10, 20, 40} andE(0) =E max =Q l +p max over 1000 independent simulation runs. In all the simulation runs, the power actions yielded by Algorithm 7.1 always satisfy the energy availability constraints. We also plot the averaged utility performance in Figure 7.1, where they-axis is the running average of expected utility. Figure 7.1 shows that the utility performance can approach U ∗ by using larger V parameter. Time slot t 10 0 10 1 10 2 10 3 10 4 10 5 1 t P t = = 1 E [ U ( p [ = ] ; ! [ = ] ) ] 0 0.2 0.4 0.6 0.8 1 1.2 P e o f o r m a n c e of Al g o r i t h m 1 u s i n g a b at t e r y of c ap ac i t y Q l + p m a x U $ V = 5 V = 10 V = 2 0 V = 4 0 Figure 7.1: Utility performance (averaged over 1000 independent simulation runs) of Algorithm 7.1 with E(0) =E max =Q l +p max for different V . In practice, it is possible that for a given V , the battery capacity E max =Q l +p max required in Theorem 7.3 is too large. If we run Algorithm 7.1 with small capacity batteries such that 197 P n i=1 p i (t + 1)≥E(t) for certain slot t, a reasonable choice is to scale down p(t + 1) by (7.11) and use ˆ p(t + 1) as the power action. Now, we run simulations by fixing V = 40 in Algorithm 7.1 and test its performance with small capacity batteries. By Theorem 7.3, the required battery capacity to ensure energy availability isE max = 685. In our simulations, we choose smallE max ∈ {10, 20, 50} and E(0) = 0, i.e., the battery is initially empty. If p(t + 1) from Algorithm 7.1 violates energy availability constraint (7.2), we use ˆ p(t + 1) from (7.11) as the true power action that is enforced to satisfy (7.2) and update the energy backlog by E(t + 1) = min{E(t)− P n i=1 ˆ p i (t + 1) +e(t + 1),E max }. Figure 7.2 plots the utility performance of Algorithm 7.1 in this practical scenario and shows that even with small capacity batteries, Algorithm 7.1 still achieves a utility close to U ∗ . This further demonstrates the superior performance of our algorithm. Time slot t 10 0 10 1 10 2 10 3 10 4 10 5 1 t P t = = 1 E [ U ( p [ = ] ; ! [ = ] ) ] 0 0.2 0.4 0.6 0.8 1 1.2 P e r f o r m an c e o f A l g or i t h m 1 ( V = 4 0) u s i n g a s m a l l c a p a c i t y b at t e r y U $ E m a x = 1 0 E m a x = 2 0 E m a x = 5 0 Figure 7.2: Utility performance (averaged over 1000 independent simulation runs) of Algorithm 7.1 with V = 40 for different E max . 7.5 Chapter Summary This chapter develops a new learning aided power control algorithm for energy harvesting devices, without requiring the current system state or the distribution information. This new algorithm can achieve an O() optimal utility by using a battery with capacity O(1/). 198 Chapter 8 Dynamic Transmit Covariance Design in MIMO Fading Sys- tems With Unknown Channel Distributions and Inaccurate Channel State Information During the past decade, the multiple-input multiple-output (MIMO) technique has been recognized as one of the most important techniques for increasing the capabilities of wireless communication systems. In the wireless fading channel, where the channel changes over time, the problem of transmit covariance design is to determine the transmit covariance of the transmitter to maximize the capacity subject to both long term and short term power constraints. It is often reasonable to assume that instantaneous channel state information (CSI) is available at the receiver through training. Most works on transmit covariance design in MIMO fading systems also assume that statistical information about the channel state, referred to as channel distribution information (CDI), is available at the transmitter. Under the assumption of perfect channel state information at the receiver (CSIR) and perfect channel distribution information at the transmitter (CDIT), prior work on transmit covariance design in point-to-point MIMO fading systems can be grouped into two categories: • Instantaneous channel state information at the transmitter: In the ideal case of perfect 1 CSIT, optimal transmit covariance design for MIMO links with both long term and short term power constraints is a water-filling solution [Tel99]. Computation of water-levels involves a one-dimensional integral equation for fading channels with independent and identically distributed (i.i.d.) Rayleigh entries or a multi-dimensional integral equation for 1 In this paper, CSIT is said to be “perfect” if it is both instantaneous (i.e., has no delay) and accurate. 199 general fading channels [JP03]. The involved multi-dimensional integration equation is in general intractable and can only be approximately solved with numerical algorithms with huge complexity. MIMO fading systems with dynamic CSIT is considered in [VP07]. • No CSIT: If CSIT is unavailable, the optimal transmit covariance design is in general still open. If the channel matrix has i.i.d. Rayleigh entries, then the optimal transmit covariance is known to be the identity transmit covariance scaled to satisfy the power constraint [Tel99]. The optimal transmit covariance in MIMO fading channels with correlated Rayleigh entries is obtained in [JVG01, JB04]. The transmit covariance design in MIMO fading channels is further considered in [VLS05] under a more general channel correlation model. These prior works rely on accurate CDIT and/or on restrictive channel distribution assump- tions. It can be difficult to accurately estimate the CDI, especially when there are complicated correlations between entries in the channel matrix. Solutions that base decisions on CDIT can be suboptimal due to mismatches. Work [PCL03] considers MIMO fading channels without CDIT and aims to find the transmit covariance to maximize the worst channel capacity using a game theoretical approach rather than solve the original ergodic capacity maximization problem. In contrast, the current chapter proposes algorithms that do not require prior knowledge of the channel distribution, yet perform arbitrarily close to the optimal value of the ergodic capacity maximization that can be achieved by having CDI knowledge. The results in this chapter are originally developed in our papers [YN16a, YN17a]. In time-division duplex (TDD) systems with symmetric wireless channels, the CSI can be measured directly at the transmitter using the unlink channel. However, in frequency-division duplex (FDD) scenarios and other scenarios without channel symmetry, the CSI must be mea- sured at the receiver, quantized, and reported back to the transmitter with a time delay [TV05]. Depending on the measurement delay in TDD systems or the overall channel acquisition delay in FDD systems, the CSIT can be instantaneous or delayed. In general, the CSIT can also be inaccurate due to the measurement, quantization or feedback error. This paper first considers the instantaneous (but possibly inaccurate) CSIT case and develops an algorithm that does not require CDIT. This algorithm can achieve a utility within O(δ) of the best utility that can be achieved with CDIT and perfect CSIT, where δ is the inaccuracy measure of CSIT. This further implies that accurate instantaneous CSIT (with δ = 0) is almost as good as having both CDIT and accurate instantaneous CSIT. 200 Next, the case of delayed (but possibly inaccurate) CSIT is considered and a fundamentally different algorithm is developed for that case. The latter algorithm again does not use CDIT, but achieves a utility withinO(δ) of the best utility that can be achieved even with CDIT, where δ is the inaccuracy measure of CSIT. This further implies that delayed but accurate CSIT (with δ = 0) is almost as good as having CDIT. Related Work and Our Contributions In the instantaneous (and possibly inaccurate) CSIT case, the proposed dynamic transmit covariance design extends the general drift-plus-penalty algorithm for stochastic network opti- mization [Nee03, Nee10] to deal with inaccurate observations of system states. In this MIMO context, the current chapter shows the algorithm provides strong sample path convergence time guarantees. The dynamic of the drift-plus-penalty algorithm is similar to that of the stochastic dual subgradient algorithm, although the optimality analysis and performance bounds are dif- ferent. The stochastic dual subgradient algorithm has been applied to optimization in wireless fading channels without CDI, e.g., downlink power scheduling in single antenna cellular systems [LMS06], power allocation in single antenna broadcast OFDM channels [Rib10], scheduling and resource allocation in random access channels [HR11], transmit covariance design in multi-carrier MIMO networks [LHSS09]. In the delayed (and possibly inaccurate) CSIT case, the situation is similar to the scenario of online convex optimization [Zin03] except that we are unable to observe true history reward func- tions due to channel error. The proposed dynamic power allocation policy can be viewed as an online algorithm with inaccurate history information. The current chapter analyzes the perfor- mance loss due to CSIT inaccuracy and provides strong sample path convergence time guarantees of this algorithm. The analysis in this MIMO context can be extended to more general online con- vex optimization with inaccurate history information. Online optimization has been applied in power allocation in wireless fading channels without CDIT and with delayed and accurate CSIT, e.g., suboptimal online power allocation in single antenna single user channels [BLEM + 09], sub- optimal online power allocation in single antenna multiple user channels [BLEM + 10]. Online transmit covariance design in MIMO systems with inaccurate CSIT is also considered in recent works [SMT15, MM16, MB16]. The online algorithms in [SMT15, MM16, MB16] follow either a matrix exponential learning scheme or an online projected gradient scheme. However, all of 201 these works assume that the imperfect CSIT is unbiased, i.e., expected CSIT error conditional on observed previous CSIT is zero. This assumption of imperfect CSIT is suitable when modeling the CSIT measurement error or feedback error but cannot capture the CSI quantization error. In contrast, the current chapter only requires that CSIT error is bounded. 8.1 Signal Model and Problem Formulation 8.1.1 Signal Model Consider a point-to-point MIMO block fading channel with N T transmit antennas and N R receive antennas. In a block fading channel model, the channel matrix remains constant at each block and changes from block to block in an independent and identically distributed (i.i.d.) manner. Throughout this chapter, each block is also called a slot and is assigned an index t∈{0, 1, 2,...}. At each slot t, the received signal [Tel99] is described by y(t) = H(t)x(t) + z(t) where t∈{0, 1, 2,...} is the time index, z(t)∈ C N R is the additive noise vector, x(t)∈ C N T is the transmitted signal vector, H(t)∈C N R ×N T is the channel matrix, and y(t)∈C N R is the received signal vector. Assume that noise vectors z(t) are i.i.d. normalized circularly symmetric complex Gaussian random vectors with E[z(t)z H (t)] = I N R , where I N R denotes an N R ×N R identity matrix. 2 Note that channel matrices H(t) are i.i.d. across slot t and have a fixed but arbitrary probability distribution, possibly one with correlations between entries of the matrix. Assume there is a constantB > 0 such thatkHk F ≤B with probability one, wherek·k F denotes the Frobenius norm. 3 Recall that the Frobenius norm of a complex m×n matrix A = (a ij ) is kAk F = q P m i=1 P n j=1 |a ij | 2 = q tr(A H A) (8.1) where A H is the Hermitian transpose of A and tr(·) is the trace operator. 2 If the size of the identity matrix is clear, we often simply write I. 3 A bounded Frobenius norm always holds in the physical world because the channel attenuates the signal. Particular models such as Rayleigh and Rician fading violate this assumption in order to have simpler distribution functions [BA03]. 202 Assume that the receiver can track H(t) exactly at each slott and hence has perfect CSIR. In practice, CSIR is obtained by sending designed training sequences, also known as pilot sequences, which are commonly known to both the transmitter and the receiver, such that the channel matrix H(t) can be estimated at the receiver [TV05]. CSIT is obtained in different ways in different wireless systems. In TDD systems, the transmitter exploits channel reciprocity and use the measured uplink channel as approximated CSIT. In FDD systems, the receiver creates a quantized version of CSI, which is a function of H(t), and reports back to the transmitter after a certain amount of delay. In general, there are two possibilities of CSIT availabilities: • Instantaneous CSIT Case: In TDD systems or FDD systems where the measurement, quantization and feedback delays are negligible with respect to the channel coherence time, an approximate version e H(t) for the true channel H(t) is known at the transmitter at each time slot t. • Delayed CSIT Case: In FDD systems with a large CSIT acquisition delay, the transmitter only knows e H(t− 1), which is an approximate version of channel H(t− 1), and does not know H(t) at each time slot t. 4 In both cases, we assume the CSIT inaccuracy is bounded, i.e., there exists δ > 0 such that k e H(t)− H(t)k F ≤δ for all t. 8.1.2 Problem Formulation At each slot t, if the channel matrix is H(t) and the transmit covariance is Q(t), then the MIMO capacity is given by [Tel99]: log det(I + H(t)Q(t)H H (t)) 4 In general, the dynamic transmit covariance design developed in this chapter can be extended to deal with arbitrary CSIT acquisition delay as discussed in Section 8.3.3. For the simplicity of presentations, we assume the CSIT acquisition delay is always one slot in this chapter. 203 where det(·) denotes the determinant operator of matrices. The (long term) average capacity 5 of the MIMO block fading channel [Gol05] is given by E H log det(I + HQH H ) where Q can adapt to H when CSIT is available and is a constant matrix when CSIT is unavail- able. Consider two types of power constraints at the transmitter: A long term average power constraint E H [tr(Q)]≤ ¯ P and a short term power constraint tr(Q)≤ P enforced at each slot. The long term constraint arises from battery or energy limitations while the short term constraint is often due to hardware or regulation limitations. If CSIT is available, the problem is to choose Q as a (possibly random) function of the observed H to maximize the (long term) average capacity subject to both power constraints: max Q(H) E H log det(I + HQ(H)H H ) (8.2) s.t. E H [tr(Q(H))]≤ ¯ P, (8.3) Q(H)∈Q,∀H, (8.4) whereQ is a set that enforces the short term power constraint: Q = Q∈S N T + : tr(Q)≤P (8.5) where S N T + denotes the N T ×N T positive semidefinite matrix space. To avoid trivialities, we assume that P≥ ¯ P . In (8.2)-(8.4), we use notation Q(H) to emphasize that Q can depend on H, i.e., adapt to channel realizations. Under the long term power constraint, the optimal power allocation should be opportunistic, i.e., use more power over good channel realizations and less 5 The expressionE H log det(I +HQH H ) is also known as the ergodic capacity. In fast fading channels where the channel coherence time is smaller than the codeword length, ergodic capacity can be attained if each codeword spans across sufficiently many channel blocks. In slow fading channels where the channel coherence time is larger than the codeword length, ergodic capacity can be attained by adapting both transmit covariances and data rates to the CSIT of each channel block (see [LK06] for related discussions). In slow fading channels, the ergodic capacity is essentially the long term average capacity since it is asymptotically equal to the average capacity of each channel block (by the law of large numbers). Note that another concept “outage capacity” is sometimes considered for slow fading channels when there is no rate adaptation and the data rate is constant regardless of channel realizations (In this case, the data rate can be larger than the block capacity for poor channel realizations such that “outage” occurs). In this chapter, we have both transmit covariance design and rate adaptation; and hence consider “ergodic capacity” . 204 power over poor channel realizations. It is known that opportunistic power allocation provides a significant capacity gain in low SNR regimes and a marginal gain in high SNR regimes compared with fixed power allocation [ ¨ OLR14]. Without CSIT, the optimal transmit covariance design problem is different, given as follows. max Q E H log det(I + HQH H ) (8.6) s.t. E H [tr(Q)]≤ ¯ P, (8.7) Q∈Q, (8.8) where setQ is defined in (8.5). Again assume P≥ ¯ P . Since the instantaneous CSIT is unavail- able, the transmit covariance cannot adapt to H. By the convexity of this problem and Jensen’s inequality, a randomized Q is useless. It suffices to consider a constant Q. Since P ≥ ¯ P , this implies the problem is equivalent to a problem that removes the constraint (8.7) and that changes the constraint (8.8) to: Q∈ e Q ={Q∈S N T + : tr(Q)≤ ¯ P} The problems (8.2)-(8.4) and (8.6)-(8.8) are fundamentally different and have different op- timal objective function values. Most existing works [JP03, JVG01, JB04, VLS05] on MIMO fading channels can be interpreted as solutions to either of the above two stochastic optimization under specific channel distributions. Moreover, those works require perfect channel distribution information (CDI). In this chapter, the above two stochastic optimization problems are solved via dynamic algorithms that works for arbitrary channel distributions and does not require any CDI. The algorithms are different for the two cases, and use different techniques. 8.2 Instantaneous CSIT Case Consider the case of instantaneous but inaccurate CSIT where at each slot t∈{0, 1, 2,...}, channel H(t) is unknown and only an approximate version e H(t) is known. In this case, the problem (8.2)-(8.4) can be interpreted as a stochastic optimization problem where channel H(t) is the instantaneous system state and transmit covariance Q(t) is the control action at each slot t. This is similar to the scenario of stochastic optimization with i.i.d. time-varying system states, where the decision maker chooses an action based on the observed instantaneous system state 205 at each slot such that time average expected utility is maximized and the time average expected constraints are guaranteed. The drift-plus-penalty (DPP) technique reviewed in Chapter 1 is a mature framework to solve stochastic optimization without distribution information of system states. This is different from the conventional stochastic optimization considered by the DPP tech- nique because at each slott, the true “system state” H(t) is unavailable and only an approximate version e H(t) is known. Nevertheless, a modified version of the standard DPP algorithm is devel- oped in Algorithm 8.1. Algorithm 8.1 Dynamic Transmit Covariance Design with Instantaneous CSIT Let V > 0 be a constant parameter and Z(0) = 0. At each time t∈{0, 1, 2,...}, observe e H(t) and Z(t). Then do the following: • Choose transmit covariance Q(t)∈Q to solve : max Q∈Q {V log det(I + e H(t)Q e H H (t)−Z(t)tr(Q)}. • Update Z(t + 1) = max{0,Z(t) + tr(Q(t))− ¯ P}. In Algorithm 8.1, a virtual queueZ(t) withZ(0) = 0 and with updateZ(t+1) = max{0,Z(t)+ tr(Q(t))− ¯ P} is introduced to enforce the average power constraint (8.3) and can be viewed as the “queue backlog” of long term power constraint violations since it increases at slot t if the power consumption at slot t is larger than ¯ P and decreases otherwise. The next Lemma relates Z(t) and the average power consumption. Lemma 8.1. Under Algorithm 8.1, we have 1 t t−1 X τ=0 tr(Q(τ))≤ ¯ P + Z(t) t , ∀t> 0. Proof. Fix t > 0. For all slots τ ∈{0, 1,...,t− 1}, the update for Z(τ) satisfies Z(τ + 1) = max{0,Z(τ) + tr(Q(τ))− ¯ P}≥ Z(τ) + tr(Q(τ))− ¯ P . Rearranging terms gives: tr(Q(τ))≤ ¯ P +Z(τ + 1)−Z(τ). Summing over τ∈{0,...,t− 1} and dividing by factor t gives: 1 t t−1 X τ=0 tr(Q(τ))≤ ¯ P + Z(t)−Z(0) t (a) = ¯ P + Z(t) t where (a) follows from Z(t) = 0. 206 For each slot t∈{0, 1, 2,...} define the reward R(t): R(t) = log det(I + H(t)Q(t)H H (t)). (8.9) Define R opt as the optimal average utility in (8.2). The value R opt depends on the (unknown) distribution for H(t). Fix > 0 and define V = max{ ¯ P 2 , (P− ¯ P ) 2 }/(2). If e H(t) = H(t),∀t, regardless of the distribution of H(t), the standard DPP technique [Nee10] ensures: 1 t t−1 X τ=0 E[R(τ)]≥R opt −, ∀t> 0 (8.10) lim t→∞ 1 t t−1 X τ=0 E[tr(Q(τ))]≤ ¯ P (8.11) This holds for arbitrarily small values of > 0, and so the algorithm comes arbitrarily close to optimality. However, the above is true only if e H(t) = H(t),∀t. The development and analysis of Algorithm 8.1 extends the DPP technique in two aspects: • At each slot t, the standard drift-plus-penalty technique requires accurate “system state” H(t) and cannot deal with inaccurate “system state” e H(t). In contrast, Algorithm 8.1 works with e H(t). The next subsections show that the performance of Algorithm 8.1 degrades linearly with respect to CSIT inaccuracy measure δ. If δ = 0, then (8.10) is recovered. • Inequality (8.11) only treats infinite horizon time average expected power. The next sub- sections show that Algorithm 8.1 can guarantee 1 t t−1 X τ=0 tr(Q(τ))≤ ¯ P + (B +δ) 2 max{ ¯ P 2 , (P− ¯ P ) 2 } + 2(P− ¯ P ) 2t for all t> 0. This sample path guarantee on average power consumption is much stronger than (8.11). In fact, (8.11) is recovered by taking expectation and taking limit t→∞. 8.2.1 Transmit Covariance Update in Algorithm 8.1 This subsection shows the Q(t) selection in Algorithm 8.1 has an (almost) closed-form solu- tion. The convex program involved in the transmit covariance update of Algorithm 8.1 is in the 207 form max Q log det(I + HQH H )− Z V tr(Q) (8.12) s.t. tr(Q)≤P (8.13) Q∈S N T + (8.14) This convex program is similar to the conventional problem of transmit covariance design with a deterministic channel H, except that the objective (8.12) has an additional penalty term − Z V tr(Q). It is well known that, without this penalty term, the solution is to diagonalize the channel matrix and allocate power over eigen-modes according to a water-filling technique [Tel99]. The next lemma summarizes that the optimal solution to the problem (8.12)-(8.14) has a similar structure. Lemma 8.2. Consider singular value decomposition (SVD) H H H = U H ΣU, where U is a unitary matrix and Σ is a diagonal matrix with non-negative entries σ 1 ,...,σ N T . Then the optimal solution to (8.12)-(8.14) is given by Q ∗ = U H Θ ∗ U, where Θ ∗ is a diagonal matrix with entries θ ∗ 1 ,...,θ ∗ N T given by: θ ∗ i = max 0, 1 μ ∗ +Z/V − 1 σ i , ∀i∈{1,...,N T }, where μ ∗ is chosen such that P N T i=1 θ ∗ i ≤P , μ ∗ ≥ 0 and μ ∗ P N T i=1 θ ∗ i −P = 0. The exact μ ∗ can be determined using Algorithm 8.2 with complexity O(N T logN T ). Proof. The proof is a simple extension of the classical proof for the optimal transmit covariance in deterministic MIMO channels, e.g. Section 3.2 in [Tel99], to deal with the additional penalty term− Z V tr(Q). See Section 8.7.2 for a complete proof. The complexity of Algorithm 8.2 is dominated by the sorting of all σ i in step (2). Recall that the water-filling solution of power allocation in multiple parallel channels can also be found by an exact algorithm (see Section 6 in [PL03]), which is similar to Algorithm 8.2. The main difference is that Algorithm 8.2 has a first step to verify if μ ∗ = 0. This is because unlike the power allocation in multiple parallel channels, where the optimal solution always uses full power, the optimal solution to the problem (8.12)-(8.14) may not use full power for large Z due to the 208 Algorithm 8.2 Algorithm to Solve Problem (8.12)-(8.14) 1. Check if P N T i=1 max{0, 1 Z/V − 1 σi }≤ P holds. If yes, let μ ∗ = 0 and θ ∗ i = max{0, 1 Z/V − 1 σi },∀i∈{1, 2,...,N T } and terminate the algorithm; else, continue to the next step. 2. Sort allσ i ,∈{1, 2,...,N T } in a decreasing orderπ such thatσ π(1) ≥σ π(2) ≥···≥σ π(N T ) . Define S 0 = 0. 3. For i = 1 to N T • Let S i =S i−1 + 1 σ π(i) . Let μ ∗ = i Si+P − Z V . • Ifμ ∗ ≥ 0, 1 μ ∗ +Z/V − 1 σ π(i) > 0 and 1 μ ∗ +Z/V − 1 σ π(i+1) ≤ 0, then terminate the loop; else, continue to the next iteration in the loop. 4. Let θ ∗ i = max 0, 1 μ ∗ +Z/V − 1 σi ,∀i∈{1, 2,...,N T } and terminate the algorithm. penalty term− Z V tr(Q) in objective (8.12). 8.2.2 Performance of Algorithm 8.1 Define a Lyapunov function L(t) = 1 2 Z 2 (t) and its corresponding Lyapunov drift Δ(t) = L(t + 1)−L(t). The expression−Δ(t) +VR(t) is called the DPP expression. The analysis of the standard drift-plus-penalty (DPP) algorithm with accurate “system states” relies on a bound of the DPP expression in terms ofR opt [Nee10]. The performance analysis of Algorithm 8.1, which can be viewed as a DPP algorithm based on inaccurate “system states”, requires a new bound of the DPP expression in Lemma 8.3 and a new deterministic bound of virtual queue Z(t) in Lemma 8.4. Lemma 8.3. Under Algorithm 8.1, we have −E[Δ(t)] +VE[R(t)]≥VR opt − 1 2 max{ ¯ P 2 , (P− ¯ P ) 2 }− 2VP p N T (2B +δ)δ, where B,δ,N T ,P and ¯ P are defined in Section 8.1.1; and R opt is the optimal average utility in the problem (8.2)-(8.4). Proof. See Section 8.7.3. Lemma 8.4. Under Algorithm 8.1, we have Z(t)≤V (B +δ) 2 + (P− ¯ P ),∀t> 0, where B,δ,P and ¯ P are defined in Section 8.1.1. 209 Proof. We first show that if Z(t)≥V (B +δ) 2 , then Algorithm 8.1 chooses Q(t) = 0. Consider Z(t)≥ V (B +δ) 2 . Let SVD e H H (t) e H(t) = U H ΣU, where diagonal matrix Σ has non-negative diagonal entries σ 1 ,...,σ N T . Note that∀i∈{1, 2,...,N T }, σ i (a) ≤ tr( e H H (t) e H(t)) (b) =k e H(t)k 2 F ≤ (kH(t)k F +k e H(t)− H(t)k F ) 2 ≤ (B +δ) 2 where (a) follows from tr( e H H (t) e H(t)) = P N T i=1 σ i ; and (b) follows from the definition of Frobenius norm. By Lemma 8.2, Q(t) = U H Θ ∗ U, where Θ ∗ is diagonal with entries θ ∗ 1 ,...,θ ∗ N T given by θ ∗ i = max 0, 1 μ ∗ +Z(t)/V − 1 σi , where μ ∗ ≥ 0. Since σ i ≤ (B +δ) 2 ,∀i∈{1, 2,...,N T }, it follows that if Z(t)≥V (B +δ) 2 , then 1 μ+Z(t)/V − 1 σi ≤ 0 for all μ≥ 0 and hence θ ∗ i = 0,∀i∈{1, 2,...,N T }. This implies that Algorithm 8.1 chooses Q(t) = 0 by Lemma 8.2, which further implies that Z(t + 1)≤Z(t) by the update equation of Z(t + 1). On the other hand, if Z(t)≤ V (B +δ) 2 , then Z(t + 1) is at most V (B +δ) 2 + (P− ¯ P ) by the update equation of Z(t + 1) and the short term power constraint tr(Q(t))≤P . The next theorem summarizes the performance of Algorithm 8.1 and follows directly from Lemma 8.3 and Lemma 8.4. Theorem 8.1. Fix > 0 and choose V = max{ ¯ P 2 ,(P− ¯ P) 2 } 2 in Algorithm 8.1, then for all t> 0: 1 t t−1 X τ=0 E[R(τ)]≥R opt −−φ(δ), 1 t t−1 X τ=0 tr(Q(τ))≤ ¯ P + (B +δ) 2 max{ ¯ P 2 , (P− ¯ P ) 2 } + 2(P− ¯ P ) 2t , where φ(δ) = 2P √ N T (2B +δ)δ satisfying φ(δ)→ 0 as δ→ 0, i.e., φ(δ)∈O(δ); and B,δ,N T ,P and ¯ P are defined in Section 8.1.1. In particular, the average expected utility is within +φ(δ) of R opt and the sample path time average power is within of its required constraint ¯ P whenever t≥ Ω( 1 2 ). Proof. Proof of the first inequality: Fix t > 0. For all slots τ ∈{0, 1,...,t− 1}, Lemma 8.3 guarantees thatE[R(τ)]≥R opt + 1 V E[Δ(τ)]− 1 2V max{ ¯ P 2 , (P− ¯ P ) 2 }− 2P √ N T (2B +δ)δ. 210 Summing over τ∈{0,...,t− 1} and dividing by t gives: 1 t t−1 X τ=0 E[R(τ)] ≥R opt + 1 Vt t−1 X τ=0 E[Δ(τ)]− 1 2V max{ ¯ P 2 , (P− ¯ P ) 2 }− 2P p N T (2B +δ)δ (a) =R opt + 1 2Vt [E[Z 2 (t)]−E[Z 2 (0)] − 1 2V max{ ¯ P 2 , (P− ¯ P ) 2 }− 2P p N T (2B +δ)δ (b) ≥R opt − 1 2V max{ ¯ P 2 , (P− ¯ P ) 2 }− 2P p N T (2B +δ)δ (c) =R opt −− 2P p N T (2B +δ)δ where (a) follows from the definition that Δ(t) = 1 2 Z 2 (t + 1)− 1 2 Z 2 (t) and by simplifying the telescoping sum P t−1 τ=0 E[Δ(τ)]; (b) follows from Z(0) = 0 and Z(t)≥ 0; and (c) follows by substituting V = 1 2 max{ ¯ P 2 , (P− ¯ P ) 2 }. Proof of the second inequality: Fix t> 0. By Lemma 8.1, we have 1 t t−1 X τ=0 tr(Q(τ))≤ ¯ P + Z(t) t (a) = ¯ P + (B +δ) 2 max{ ¯ P 2 , (P− ¯ P ) 2 } + 2(P− ¯ P ) 2t where (a) follows from Lemma 8.4 and V = 1 2 max{ ¯ P 2 , (P− ¯ P ) 2 }. Theorem 8.1 provides a sample path guarantee on average power, which is much stronger than the guarantee in (8.11). It also shows that convergence time to reach an +O(δ) approximate solution is O( 1 2 ). 8.2.3 Discussion It is shown that Z(t) in the DPP algorithm is “attracted” to an optimal Lagrangian dual multiplier of an unknown deterministic convex program in [HMNK13]. In fact, if we have a good guess of this Lagrangian multiplier and initialize Z(0) close to it, then Algorithm 8.1 has faster convergence. In addition, the performance bounds derived in Theorem 8.1 are not tightest possible. The proof of Lemma 8.3 involves many relaxations to derive bounds that are simple but can still enable Theorem 8.1 to show the effect of missing CDIT can be made arbitrarily small by choosing the algorithm parameter V properly and the performance degradation of CSIT 211 inaccuracy scales linearly with respect to δ. In fact, tighter but more complicated bounds are possible by refining the proof of Lemma 8.3. A heuristic approach to solve the problem (8.2)-(8.4) without channel distribution information is to sample the channel for a large number of realizations and use the empirical distribution as an approximate distribution to solve the problem (8.2)-(8.4) directly. This approach has three drawbacks: • For a scalar channel, the empirical distribution based on O( 1 2 ) realizations is an ap- proximation to the true channel distribution with high probability by the Dvoretzky- Kiefer-Wolfowitz inequality [Ser09]. However, for an N R ×N T MIMO channel, the multi- dimensional empirical distribution requiresO( N 2 T N 2 R 2 ) samples to achieve an approximation of the true channel distribution [Dev77]. Thus, this approach does not scale well with the number of antennas. • Even if the empirical distribution is accurate, the complexity of solving the problem (8.2)- (8.4) based on the empirical distribution is huge if the channel is from a continuous distri- bution. This is known as the curse of dimensionality of empirical methods for stochastic optimization due to the large sample size. In contrast, the complexity of Algorithm 8.1 is independent of the sample space. • This approach is an offline method such that a large number of slots are wasted during the channel sampling process. In contrast, Algorithm 8.1 is an online method with performance guarantees for all slots. Note that even if we assume the distribution of H(t) is known and Q ∗ (H) can be computed by solving the problem (8.2)-(8.4), the optimal policy Q ∗ (H) in general cannot achieveR opt and can violate the long term power constraints when only the approximate versions e H(t) are known. For example, consider a MIMO fading system with two possible channel realizations H 1 and H 2 with equal probabilities. Suppose the average power constraint is ¯ P = 5 and the optimal policy Q ∗ (H) satisfies tr(Q ∗ (H 1 )) = 8 and tr(Q ∗ (H 2 )) = 2. However, if e H 1 6= H 1 and e H 2 6= H 2 , it can be hard to decide the transmit covariance based on e H 1 or e H 2 since the associations between e H 1 and H 1 (or between e H 2 and H 2 ) are unknown. In an extreme case when e H 1 = e H 2 = H 1 , if the transmitter uses Q ∗ ( e H(t)) at each slot t, the average power constraint is violated and hence the transmit covariance scheme is infeasible. In contrast, Algorithm 8.1 can attain the performance 212 in Theorem 8.1 with inaccurate instantaneous CSIT and no CDIT. 8.3 Delayed CSIT Case Consider the case of delayed and inaccurate CSIT. At the beginning of each slott∈{0, 1, 2,...}, channel H(t) is unknown and only quantized channels of previous slots e H(τ),τ∈{0, 1,...,t−1} are known. This is similar to the scenario of online optimization where the decision maker selects x(t)∈X at each slott to maximize an unknown reward functionf t (x) based on the information of previous reward functions f τ (x(τ)),τ ∈{0, 1,...,t− 1}. The goal is to minimize average regret 1 t max x∈X P t−1 τ=0 f τ (x) − 1 t P t−1 τ=0 f τ (x(τ)). The best possible average regret of online convex optimization with general convex reward functions is O( 1 √ t ) [Zin03, HAK07]. The situation in the current chapter is different from conventional online optimization because at each slot t, the rewards of previous slots, i.e., R(τ) = log det(I + H(τ)Q(τ)H H (τ)),τ ∈ {0, 1,...,t−1}, are still unknown due to the fact that the reported channels e H(τ) are approximate versions. Nevertheless, an online algorithm without using CDIT is developed in Algorithm 8.3. Algorithm 8.3 Dynamic Transmit Covariance Design with Delayed CSIT Let γ > 0 be a constant parameter and Q(0)∈Q be arbitrary. At each time t∈{1, 2,...}, observe e H(t− 1) and do the following: • Let e D(t− 1) = e H H (t− 1)(I N R + e H(t− 1)Q(t− 1) e H H (t− 1)) −1 e H(t− 1). Choose transmit covariance Q(t) =P e Q Q(t− 1) +γ e D(t− 1) , whereP e Q [·] is the projection onto convex set e Q ={Q∈S N T + : tr(Q)≤ ¯ P}. Define Q ∗ ∈ e Q as an optimal solution to the problem (8.6)-(8.8), which depends on the (unknown) distribution for H(t). Define R opt (t) = log det(I + H(t)Q ∗ H H (t)) as the utility at slot t attained by Q ∗ . If the channel feedback is accurate, i.e., e H(t− 1) = H(t− 1),∀t∈{1, 2,...}, then e D(t− 1) is the gradient of R(t− 1) at point Q(t− 1). Fix > 0 and take γ = . The results in [Zin03] 213 ensure that, regardless of the distribution of H(t): 1 t t−1 X τ=0 R(τ)≥ 1 t t−1 X τ=0 R opt (τ)− 2 ¯ P 2 t − N R B 4 2 ,∀t> 0 (8.15) tr(Q(τ))≤ ¯ P,∀τ∈{0, 1,...,t− 1} (8.16) The next subsections show that the performance of Algorithm 8.3 with inaccurate channels degrades linearly with respect to channel inaccuracy δ. If δ = 0, then (8.15) and (8.16) are recovered. 8.3.1 Transmit Covariance Update in Algorithm 8.3 This subsection shows that the Q(t) selection in Algorithm 8.3 has an (almost) closed-form solution. The projection operator involved in Algorithm 8.3 by definition is min 1 2 kQ− Xk 2 F (8.17) s.t. tr(Q)≤ ¯ P (8.18) Q∈S N T + (8.19) where X = Q(t− 1) +γ e D(t− 1) is a Hermitian matrix at each slot t. Without constraint tr(Q) ≤ ¯ P , the projection of Hermitian matrix X onto the positive semidefinite cone S n + is simply taking the eigenvalue expansion of X and dropping terms asso- ciated with negative eigenvalues (see Section 8.1.1. in [BV04]). Work [BX05] considered the projection onto the intersection of the positive semidefinite coneS n + and an affine subspace given by{Q : tr(A i Q) = b i ,i∈{1, 2,...,p}, tr(B j Q)≤ d j ,j ∈{1, 2,...,m}} and developed the dual-based iterative numerical algorithm to calculate the projection. The problem (8.17)-(8.19) is a special case, where the affine subspace is given by tr(Q)≤ ¯ P , of the projection considered in [BX05]. Instead of solving the problem (8.17)-(8.19) using numerical algorithms, the next lemma summarizes that the problem (8.17)-(8.19) has an (almost) closed-form solution. Lemma 8.5. Consider SVD X = U H ΣU, where U is a unitary matrix and Σ is a diagonal matrix with entries σ 1 ,...,σ N T . Then the optimal solution to the problem (8.17)-(8.19) is given 214 by Q ∗ = U H Θ ∗ U, where Θ ∗ is a diagonal matrix with entries θ ∗ 1 ,...,θ ∗ N T given by, θ ∗ i = max{0,σ i −μ ∗ },∀i∈{1, 2,...,N T }, where μ ∗ is chosen such that P N T i=1 θ ∗ i ≤ ¯ P , μ ∗ ≥ 0 and μ ∗ P N T i=1 θ ∗ i − ¯ P = 0. The exact μ ∗ can be determined using Algorithm 8.4 with complexity O(N T logN T ). Proof. See Section 8.7.4. Algorithm 8.4 Algorithm to Solve Problem (8.17)-(8.19) 1. Check if P N T i=1 max{0,σ i } ≤ ¯ P holds. If yes, let μ ∗ = 0 and θ ∗ i = max{0,σ i },∀i ∈ {1, 2,...,N T } and terminate the algorithm; else, continue to the next step. 2. Sort allσ i ,∈{1, 2,...,N T } in a decreasing orderπ such thatσ π(1) ≥σ π(2) ≥···≥σ π(N T ) . Define S 0 = 0. 3. For i = 1 to N T • Let S i =S i−1 +σ i . Let μ ∗ = Si− ¯ P i . • Ifμ ∗ ≥ 0,σ π(i) −μ ∗ > 0 andσ π(i+1) −μ ∗ ≤ 0, then terminate the loop; else, continue to the next iteration in the loop. 4. Let θ ∗ i = max{0,σ i −μ ∗ },∀i∈{1, 2,...,N T } and terminate the algorithm. 8.3.2 Performance of Algorithm 8.3 Define D(t− 1) = H H (t− 1)(I N R + H(t− 1)Q(t− 1)H H (t− 1)) −1 H(t− 1), which is the gradient ofR(t−1) at point Q(t−1) and is unknown to the transmitter due to the unavailability of H(t− 1). The next lemma relates e D(t− 1) and D(t− 1). Lemma 8.6. For all slots t∈{1, 2,...}, we have 1. kD(t− 1)k F ≤ √ N R B 2 . 2. kD(t−1)− e D(t−1)k F ≤ψ(δ), whereψ(δ) = √ N R B+ √ N R (B+δ)+(B+δ) 2 N R ¯ P (2B+δ) δ satisfying ψ(δ)→ 0 as δ→ 0, i.e., ψ(δ)∈O(δ). 3. k e D(t− 1)k F ≤ψ(δ) + √ N R B 2 where B,δ,N R ,N T ,P and ¯ P are defined in Section 8.1.1 Proof. See Section 8.7.5. 215 The next theorem summarizes the performance of Algorithm 8.3. Theorem 8.2. Fix > 0 and define γ =. Under Algorithm 8.3, we have 6 for all t> 0: 1 t t−1 X τ=0 R(τ)≥ 1 t t−1 X τ=0 R opt (τ)− 2 ¯ P 2 t − (ψ(δ) + √ N R B 2 ) 2 2 − 2ψ(δ) ¯ P tr(Q(τ))≤ ¯ P,∀τ∈{0, 1,...,t− 1} where ψ(δ) is the constant defined in Lemma 8.6 and B,δ,N R ,P and ¯ P are defined in Section 8.1.1. In particular, the sample path time average utility is within O() + 2ψ(δ) ¯ P of the optimal average utility for the problem (8.6)-(8.8) whenever t≥ 1 2 . Proof. The second inequality trivially follows from the fact that Q(t)∈ e Q,∀t∈{0, 1,...}. It remains to prove the first inequality. This proof extends the regret analysis of conventional online convex optimization [Zin03] by considering inexact gradient e D(t− 1). For all slots τ∈{1, 2,...}, the transmit covariance update in Algorithm 8.3 satisfies: kQ(τ)− Q ∗ k 2 F =kP e Q Q(τ− 1) +γ e D(τ− 1) − Q ∗ k 2 F (a) ≤kQ(τ− 1) +γ e D(τ− 1)− Q ∗ k 2 F =kQ(τ− 1)− Q ∗ k 2 F + 2γtr e D H (τ− 1)(Q(τ− 1)− Q ∗ ) +γ 2 k e D(τ− 1)k 2 F =kQ(τ− 1)− Q ∗ k 2 F + 2γtr D H (τ− 1)(Q(τ− 1)− Q ∗ ) + 2γtr ( e D(t− 1)− D(τ− 1)) H (Q(τ− 1)− Q ∗ ) +γ 2 k e D(τ− 1)k 2 F , where (a) follows from the non-expansive property of projections onto convex sets. Define Δ(t) = kQ(t+1)−Q ∗ k 2 F −kQ(t)−Q ∗ k 2 F . Rearranging terms in the last equation and dividing by factor 2γ implies tr D H (τ− 1)(Q(τ− 1)− Q ∗ ) ≥ 1 2γ Δ(τ− 1)− γ 2 k e D(τ− 1)k 2 F − tr ( e D(τ− 1)− D(τ− 1)) H (Q(τ− 1)− Q ∗ ) (8.20) 6 In our conference version [YN16a], the first inequality of this theorem is mistakenly given by 1 t P t−1 τ=0 R(τ)≥ 1 t P t−1 τ=0 R opt (τ)− ¯ P t − (ψ(δ)+ √ N R B 2 ) 2 2 − 2ψ(δ) ¯ P . 216 Define f τ−1 (Q) = log det(I + H(τ− 1)QH H (τ− 1)). By Fact 8.3 in Section 8.7.1, f τ−1 (·) is concave over e Q and D(t− 1) =∇ Q f τ−1 (Q(t− 1)). Note that Q ∗ ∈ e Q. By Fact 8.4 in Section 8.7.1, we have f τ−1 (Q(τ− 1))−f τ−1 (Q ∗ )≥ tr(D H (τ− 1)(Q(τ− 1)− Q ∗ )) (8.21) Note thatf τ−1 (Q(τ− 1)) =R(τ− 1) andf τ−1 (Q ∗ ) =R opt (τ− 1). Combining (8.20) and (8.21) yields R(τ− 1)−R opt (τ− 1) ≥ 1 2γ Δ(τ− 1)− γ 2 k e D(τ− 1)k 2 F − tr ( e D(τ− 1)− D(τ− 1)) H (Q(τ− 1)− Q ∗ ) (a) ≥ 1 2γ Δ(τ− 1)− γ 2 k e D(τ− 1)k 2 F −k e D(τ− 1)− D(τ− 1)k F kQ(τ− 1)− Q ∗ k F (b) ≥ 1 2γ Δ(τ− 1)− γ 2 (ψ(δ) + p N R B 2 ) 2 − 2ψ(δ) ¯ P where (a) follows from Fact 8.1 in Section 8.7.1 and (b) follows from Lemma 8.6 and the fact thatkQ(τ− 1)− Q ∗ k F ≤kQ(τ− 1)k F +kQ ∗ k F ≤ tr(Q(τ− 1)) + tr(Q ∗ )≤ 2 ¯ P , which is implied by Fact 8.1, Fact 8.2 in Section 8.7.1 and the fact that Q(τ− 1), Q ∗ ∈ e Q. Replacing τ− 1 with τ yields for all τ∈{0, 1,...} R(τ)−R opt (τ)≥ 1 2γ Δ(τ)− γ 2 (ψ(δ) + p N R B 2 ) 2 − 2ψ(δ) ¯ P (8.22) Fix t> 0. Summing over τ∈{0, 1,...,t− 1}, dividing by factor t and simplifying telescope sum P t−1 τ=0 Δ(τ) gives 1 t t−1 X τ=0 R(τ)− 1 t t−1 X τ=0 R opt (τ)) ≥ 1 2γt (kQ(t)− Q ∗ k 2 F −kQ(0)− Q ∗ k 2 F )− γ 2 (ψ(δ) + p N R B 2 ) 2 − 2ψ(δ) ¯ P (a) ≥− 2 ¯ P 2 γt − γ 2 (ψ(δ) + p N R B 2 ) 2 − 2ψ(δ) ¯ P where (a) follows fromkQ(0)− Q ∗ k F ≤kQ(0)k F +kQ ∗ k F ≤ tr(Q(0)) + tr(Q ∗ )≤ 2 ¯ P and kQ(t)− Q ∗ k 2 F ≥ 0. 217 Theorem 8.2 proves a sample path guarantee on the utility. It shows that the convergence time to reach anO() + 2ψ(δ) ¯ P approximate solution is 1/ 2 . Note that ifδ = 0, then equations (8.15) and (8.16) are recovered by Theorem 8.2. Theorem 8.2 also isolates the effect of missing CDIT and CSIT inaccuracy. The error termO() is corresponding to the effect of missing CDIT and can be made arbitrarily small by choosing a small γ and running the algorithm for more than 1 2 iterations. The observation is that the effect of missing CDIT vanishes as Algorithm 8.3 runs for a sufficiently long time and hence delayed but accurate CSIT is almost as good as CDIT. The other error term 2ψ(δ) ¯ P is corresponding to the effect of CSIT inaccuracy and does not vanish. The performance degradation due to channel inaccuracy scales linearly with respect to the channel error since ψ(δ)∈O(δ). Intuitively, this is reasonable since any algorithm based on inaccurate CSIT is actually optimizing another different MIMO system. 8.3.3 Extensions T-Slot Delayed and Inaccurate CSIT Thus far, we have assumed that CSIT is always delayed by one slot. In fact, if CSIT is delayed byT slots, we can modify the update of transmit covariances in Algorithm 8.3 as Q(t) = P e Q [Q(t−T ) +γ e D(t−T )]. A T -slot version of Theorem 8.2 can be similarly proven. Algorithm 8.3 with Time Varying γ Algorithm 8.3 can be extended to have time varying step size γ(t) = 1 √ t at slot t. The next lemma shows that such an algorithm can approach an + 2ψ(δ) ¯ P approximate solution with O(1/ 2 ) iterations. Lemma 8.7. Fix > 0. If we modify Algorithm 8.3 by using γ(t) = 1 √ t as the step size γ at each slot t, then for all t> 0: 1 t t−1 X τ=0 R(τ)≥ 1 t t−1 X τ=0 R opt (τ)− 2 ¯ P 2 √ t − 1 √ t (ψ(δ) + p N R B 2 ) 2 − 2ψ(δ) ¯ P, 1 t t−1 X τ=0 tr(Q(τ))≤ ¯ P, where B,δ,N R ,P and ¯ P are defined in Section 8.1.1 218 Proof. The second inequality again follows from the fact that Q(t) ∈ e Q,∀t ∈ {0, 1,...}. It remains to prove the first inequality. Withγ(t) = 1 √ t , equation (8.22) in the proof of Theorem 8.2 becomesR(τ)−R opt (τ)≥ 1 2γ(τ+1) Δ(τ)− γ(τ+1) 2 (ψ(δ)+ √ N R B 2 ) 2 −2ψ(δ) ¯ P for allτ∈{0, 1,...}. Fix t> 0. Summing over τ∈{0, 1,...,t− 1} and dividing by factor t yields that for all t> 0: 1 t t−1 X τ=0 R(τ)− 1 t t−1 X τ=0 R opt (τ) ≥ 1 2t t−1 X τ=0 √ τ + 1Δ(τ)− 1 t t−1 X τ=0 1 2 √ τ + 1 ! (ψ(δ) + p N R B 2 ) 2 − 2ψ(δ) ¯ P (a) ≥− 2 ¯ P 2 √ t − 1 √ t (ψ(δ) + p N R B 2 ) 2 − 2ψ(δ) ¯ P where (a) follows because P t−1 τ=0 √ τ + 1Δ(τ) = √ tkQ(t)−Q ∗ k 2 F −kQ(0)−Q ∗ k 2 F + P t−2 τ=0 ( √ τ + 1− √ τ + 2)kQ(τ + 1)− Q ∗ k 2 F ≥−kQ(0)− Q ∗ k 2 F + 4 ¯ P 2 P t−2 τ=0 ( √ τ + 1− √ τ + 2)≥−4 ¯ P 2 √ t and P t−1 τ=0 1 2 √ τ+1 ≤ √ t. An advantage of time varying step sizes is the performance automatically gets improved as the algorithm runs and there is no need to restart the algorithm with a different constant step size if a better performance is demanded. 8.4 Rate Adaptation To achieve the capacity characterized by either the problem (8.2)-(8.4) or the problem (8.6)- (8.8), we also need to consider the rate allocation associated with the transmit covariance, namely, we need to decide how much data is delivered at each slot. If the accurate instantaneous CSIT is available, the transmitter can simply deliver log det(I + H(t)Q(t)H H (t)) amount of data at slott once Q(t) is decided. However, in the cases when instantaneous CSIT is inaccurate or only delayed CSIT is available, the transmitter does not know the associated instantaneous channel capacity without knowing H(t). These cases belong to the representative communication scenarios where channels are unknown to the transmitter and rateless codes are usually used as a solution. To send N bits of source data, the rateless code keeps sending encoded information bits without knowing instantaneous channel capacity such that the receiver can decode allN bits as long as the accumulated channel capacity for sufficiently many slots is larger thanN. Many practical rateless codes for scalar or MIMO fading channels have been designed in [ETW12, FLEP10, LTCS16]. 219 This section provides an information theoretical rate adaptation policy based on rateless codes that can be combined with the dynamic power allocation algorithms developed in this chapter. The rate adaptation scheme is as follows: Let N be a large number. Encode N bits of source data with a capacity achieving code for a channel with capacity no less than N bits per slot. At slot 0, deliver the above encoded data with transmit covariance Q(0) given by Algorithm 8.1 or Algorithm 8.3. The receiver knows channel H(0), calculates the channel capacity R(0) = log det(I + H(0)Q(0)H H (0)); and reports back the scalar R(0) to the transmitter. At slot 1, the transmitter removes the first R(0) bits from the N bits of source data, encodes the remainingN−R(0) bits with a capacity achieving code for a channel with capacity no less than N−R(0) bits per slot; and delivers the encoded data with transmit covariance Q(1) given by Algorithm 8.1 or Algorithm 8.3. The receiver knows channel H(1), calculates the channel capacity R(1) = log det(I+H(1)Q(1)H H (1)); and reports back the scalarR(1) to the transmitter. Repeat the above process until slot T− 1 such that P T−1 t=0 R(t)>N. For the decoding, the receiver can decode all the N bits in a reverse order using the idea of successive decoding [TV05]. At slot T− 1, since N− P T−2 t=1 R(t) < R(T− 1), that is, N− P T−2 t=0 R(t)<R(T− 1) bits of source data are delivered over a channel with capacity R(T− 1) bits per slot, the receiver can decode all delivered data (N− P T−2 t=0 R(t) bits) with zero error. Note that N− P T−3 t=0 R(t) = R(T− 2) +N− P T−2 t=0 R(t) bits are delivered at slot T− 2 over a channel with capacity R(T− 2) bits per slot. The receiver subtracts the N− P T−2 t=0 R(t) bits that are already decoded such that onlyR(T− 2) bits remain to be decoded. Thus, theR(T− 2) bits can be successfully decoded. Repeat this process until all N bits are decoded. Using the above rate adaptation and decoding strategy, N bits are delivered and decoded within T− 1 slots during which the sum capacity is P T−1 t=0 R(t) bits. When N is large enough, the rate loss P T−1 t=0 R(t)−N is negligible. This rate adaptation scheme does not require H(t) and only requires to report back the scalar R(t− 1) to the transmitter at each slot t. 220 8.5 Simulations 8.5.1 A Simple MIMO System with Two Channel Realizations Consider a 2× 2 MIMO system with two equally likely channel realizations: H 1 = 1.3131e j1.9590π 2.3880e j0.7104π 2.5567e j1.5259π 2.8380e j0.3845π , H 2 = 1.4781e j0.9674π 1.5291e j0.1396π 0.0601e j0.9849π 0.1842e j1.9126π . This simple scenario is considered as a test case because, when there are only two possible channels with known channel probabilities, it is easy to find an optimal baseline algorithm by solving the problem (8.2)-(8.4) or the problem (8.6)-(8.8) directly. The goal is to show that the proposed algorithms (which do not have channel distribution information) come close to this baseline. The proposed algorithms can be implemented just as easily in cases when there are an infinite number of possible channel state matrices, rather than just two. However, in that case it is difficult to find an optimal baseline algorithm since the problem (8.2)-(8.4) or the problem (8.6)-(8.8) are difficult to solve. 7 The power constraints are ¯ P = 2 and P = 3. If CSIT has error, H 1 and H 2 are ob- served as e H 1 and e H 2 , respectively. Consider two CSIT error cases. CSIT Error Case 1: e H 1 = 1.3131e j2π 2.3880e j0.75π 2.5567e j1.5π 2.8380e j0.5π and e H 2 = 1.4781e j1π 1.5291e j0.25π 0.0601e j1π 0.1842e j2π , where the magnitudes are accurate but the phases are rounded to the nearest π/4 phase; CSIT Error Case 2: e H 1 = 1.3e j2π 2.4e j0.5π 2.6e j1.5π 2.8e j0.5π and e H 2 = 1.5e j1π 1.5e j0π 0 0.2e j2π , where the magnitudes are rounded to the first digit after the decimal point and the phases are rounded to the nearest π/2 phase. In the instantaneous CSIT case, consider Baseline 1 where the optimal solution Q ∗ (H) to the problem (8.2)-(8.4) is calculated by assuming the knowledge that H 1 and H 2 appear with equal probabilities and Q(t) = Q ∗ (H(t)) is used at each slot t. Figure 8.1 compares the performance 7 As discussed in Section 8.2.3, this is known as the curse of dimensionality of empirical methods for stochastic optimization due to the large sample size. 221 of Algorithm 8.1 (withV = 100) under various CSIT accuracy conditions and Baseline 1. It can be seen that Algorithm 8.1 has a performance close to that attained by the optimal solution to the problem (8.2)-(8.4) requiring channel distribution information. (Note that a larger V gives a even closer performance with a longer convergence time.) It can also be observed that the performance of Algorithm 8.1 becomes worse as CSIT error gets larger. Slots: t 0 500 1000 1500 1 t P t ! 1 = = 0 l o g d e t ( I + H ( = ) Q ( = ) H H ( = ) ) 2.8 2.9 3 3.1 3.2 3.3 3.4 3.5 3.6 3.7 A s i m p l e 2# 2 M I M O s y s t e m w i t h i n s t an t an e ou s C S I T Baseline 1 Algorithm 8.1: Accurate CSIT Algorithm 8.1: CSIT Error Case 1 Algorithm 8.1: CSIT Error Case 2 Slots: t 0 500 1000 1500 1 t P t ! 1 = = 0 t r ( Q = ) ) 0.5 1 1.5 2 2.5 3 3.5 Baseline 1 Algorithm 8.1: Accurate CSIT Algorithm 8.1: CSIT Error Case 1 Algorithm 8.1: CSIT Error Case 2 Figure 8.1: A simple MIMO system with instantaneous CSIT. In the delayed CSIT case, consider Baseline 2 where the optimal solution Q ∗ to the problem (8.6)-(8.8) is calculated by assuming the knowledge that H 1 and H 2 appear with equal probabili- ties; and Q(t) = Q ∗ is used at each slott. Figure 8.2 compares the performance of Algorithm 8.3 (with γ = 0.01) under various CSIT accuracy conditions and Baseline 2. Note that the average power is not drawn since the average power constraint is satisfied for all t in all schemes. It can be seen that Algorithm 8.3 has a performance close to that attained by the optimal solution to the problem (8.6)-(8.8) requiring channel distribution information. (Note that a smaller γ gives a even closer performance with a longer convergence time.) It can also be observed that the performance of Algorithm 8.3 becomes worse as CSIT error gets larger. 222 Slots: t 0 500 1000 1500 1 t P t ! 1 = = 0 l og d e t ( I + H( = ) Q ( = ) H H ( = ) ) 2.2 2.4 2.6 2.8 3 3.2 3.4 A s i m p l e 2# 2 M I M O s y s t e m w i t h d e l a y e d C S I T Baseline 2 Algorithm 8.3: Accurate CSIT Algorithm 8.3: CSIT Error Case 1 Algorithm 8.3: CSIT Error Case 2 Figure 8.2: A simple MIMO system with delayed CSIT. 8.5.2 A MIMO System with Continuous Channel Realizations This section considers a 2× 2 MIMO system with continuous channel realizations. Each entry in H(t) is equal to uv where u is a complex number whose real part and complex part are standard normal and v is uniform over [0, 0.5]. In this case, even if the channel distribution information is perfectly known, the problem (8.2)-(8.4) and the problem (8.6)-(8.8) are infinite dimensional problems and are extremely hard to solve. In practice, to solve the stochastic optimization, people usually approximate the continuous distribution by a discrete distribution with a reasonable number of realizations and solve the approximate optimization that is a large scale deterministic optimization problem. (Baselines 3 and 4 considered below are essentially using this idea.) In the instantaneous CSIT case, consider Baseline 3 where we spend 100 slots to obtain an empirical channel distribution by observing 100 accurate channel realizations 8 ; obtain the 8 By doing so, 100 slots are wasted without sending any data. The 100 slots are not counted in the simulation. If they are counted, Algorithm 8.1’s performance advantage over Baseline 3 is even bigger. The delayed CSIT case is similar. 223 optimal solution Q ∗ (H), H ∈ H to the problem (8.2)-(8.4) using the empirical distribution; choose Q ∗ (H) where H = argmin H∈H kH− H(t)k F at each slot t. Figure 8.3 compares the performance of Algorithm 8.1 (with V = 100) and Baseline 3; and shows that Algorithm 8.1 has a better performance than Baseline 3. Slots: t 0 500 1000 1500 1 t P t ! 1 = = 0 l og d e t ( I + H ( = ) Q ( = ) H H ( = ) ) 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 A c o n t i n u ou s 2# 2 M I M O s y s t e m w i t h i n s t an t an e o u s C S I T Baseline 3 Algorithm 8.1 Slots: t 0 500 1000 1500 1 t P t ! 1 = = 0 t r ( Q ( = ) ) 0 0.5 1 1.5 2 2.5 3 Baseline 3 Algorithm 8.1 Figure 8.3: A continuous MIMO system with instantaneous CSIT. In the delayed CSIT case, consider Baseline 4 where we spend 100 slots to obtain an empirical channel distribution by observing 100 accurate channel realizations; obtain the optimal solution Q ∗ to the problem (8.6)-(8.8) using the empirical distribution; choose Q ∗ at each slot t. Figure 8.4 compares the performance of Algorithm 8.3 (with γ = 0.01) and Baseline 4; and shows that Algorithm 8.3 has a better performance than Baseline 4. 8.6 Chapter Summary This chapter considers dynamic transmit covariance design in point-to-point MIMO fading systems without CDIT. Two different dynamic policies are proposed to deal with the cases of instantaneous CSIT and delayed CSIT, respectively. In both cases, the proposed dynamic policies 224 Slots: t 0 500 1000 1500 1 t P t ! 1 = = 0 l og d e t ( I + H ( = ) Q ( = ) H H ( = ) ) 0.3 0.35 0.4 0.45 0.5 0.55 0.6 A c on t i n u o u s 2# 2 M I M O s y s t e m w i t h d e l a y e d C S I T Baseline 4 Algorithm 8.3 Figure 8.4: A continuous MIMO system with delayed CSIT. can achieve O(δ) sub-optimality, where δ is the inaccuracy measure of CSIT. 8.7 Supplement to this Chapter 8.7.1 Linear Algebra and Matrix Derivatives Fact 8.1 ([HJ85]). For any A, B∈C m×n and C∈C n×k we have: 1. kAk F =kA H k F =kA T k F =k− Ak F . 2. kA + Bk F ≤kAk F +kBk F . 3. kACk F ≤kAk F kCk F . 4. |tr(A H B)|≤kAk F kBk F . Fact 8.2 ([HJ85]). For any A∈S n + we havekAk F ≤ tr(A). Fact 8.3 ([FHM07]). The functionf :S n + →R defined byf(Q) = log det(I+HQH H ) is concave and its gradient is given by∇ Q f(Q) = H H (I + HQH H ) −1 H,∀Q∈S n + . 225 The above fact is developed in [FHM07]. A general theory on developing derivatives for functions with complex matrix variables is available in [Hjø11]. The next fact is the complex matrix version of the first order condition for concave functions of real number variables, i.e., f(y)≤f(x) +f 0 (x)(y−x),∀x,y∈ domf if f is concave. We also provide a brief proof for this fact. Fact 8.4. Let function f(Q) :S n + →R be a concave function and have gradient∇ Q f(Q)∈S n at point Q. Then, f( b Q)≤f(Q) + tr [∇ Q f(Q)] H ( b Q− Q) ,∀ b Q∈S n + . Proof. Recall that a function is concave if and only if it is concave when restricted to any line that intersects its domain (see page 67 in [BV04]). For any Q, b Q ∈ S n + , define g(t) = f(Q +t( b Q− Q)). Thus, g(t) is concave over [0, 1]; g(0) = f(Q); and g(1) = f( b Q). Note that g 0 (t) = tr([∇ Q f(Q +t( b Q− Q))] H ( b Q− Q)) by the chain rule of derivatives when the inner product in complex matrix space C n×n is defined ashA, Bi = tr(A H B),∀A, B∈C n×n . By the first-order condition of concave function g(t), we have g(1)≤ g(0) +g 0 (0)(1− 0). Note that g 0 (0) = tr([∇ Q f(Q)] H ( b Q− Q)). Thus, we have f( b Q)≤f(Q) + tr [∇ Q f(Q)] H ( b Q− Q) . 8.7.2 Proof of Lemma 8.2 The proof method is an extension of Section 3.2 in [Tel99], which gives the structure of the optimal transmit covariance in deterministic MIMO channels. Note that log det(I + HQH H ) (a) = log det(I + QH H H) (b) = log det(I + QU H ΣU) (c) = log det(I + Σ 1/2 UQU H Σ 1/2 ), where (a) and (c) follows from the elementary identity det(I +AB) = det(I + BA),∀AC m×n and B∈C n×m ; and (b) follows from the fact that H H H = U H ΣU. Define e Q = UQU H , which is semidefinite positive if and only if Q is. Note that tr( e Q) = tr(UQU H ) = tr(Q) by the fact that tr(AB) = tr(BA),∀A∈C m×n , B∈C n×m . Thus, the problem (8.12)-(8.14) is equivalent to max e Q log det(I + Σ 1/2 e QΣ 1/2 )− Z V tr( e Q) (8.23) s.t. tr( e Q)≤P (8.24) e Q∈S N T + (8.25) Fact 8.5 (Hadamard’s Inequality, Theorem 7.8.1 in [HJ85]). For all A∈S n + , det(A)≤ Q n i=1 A ii 226 with equality if A is diagonal. The next claim can be proven using Hadamard’s inequality. Claim 8.1. The problem (8.23)-(8.25) has a diagonal optimal solution. Proof. Suppose the problem (8.23)-(8.25) has a non-diagonal optimal solution given by matrix e Q. Consider a diagonal matrix b Q whose entries are identical to the diagonal entries of e Q. Note that tr( b Q) = tr( e Q). To show b Q is a solution no worse than e Q, it suffices to show that log det(I + Σ 1/2 b QΣ 1/2 )≥ log det(I + Σ 1/2 e QΣ 1/2 ). This is true becase det(I + Σ 1/2 b QΣ 1/2 ) = Q N T i=1 (1+ b Q ii σ i ) = Q N T i=1 (1+ e Q ii σ i )≥ det(I+Σ 1/2 e QΣ 1/2 ), where the last inequality follows from Hadamard’s inequality. Thus, b Q is a solution no worse than e Q and hence optimal. By Claim 8.1, we can consider e Q = Θ = diag(θ 1 ,θ 2 ,...,θ N T ) and the problem (8.23)-(8.25) is equivalent to max N T X i=1 log(1 +θ i σ i )− Z V N T X i=1 θ i (8.26) s.t. N T X i=1 θ i ≤P (8.27) θ i ≥ 0,∀i∈{1, 2,...,N T } (8.28) Note that the problem (8.26)-(8.28) satisfies Slater’s condition. So the optimal solution to the problem (8.26)-(8.28) is characterized by KKT conditions [BV04]. The remaining part is similar to the derivation of the water-filling solution of power allocation in parallel chan- nels, e.g., the proof of Example 5.2 in [BV04]. Introducing Lagrange multipliers μ∈ R + for inequality constraint P N T i=1 θ i ≤ P and ν = [ν 1 ,...,ν N T ] T ∈ R + for inequality constraints θ i ≥ 0,i∈{1, 2,...,N T }. Let θ ∗ = [θ ∗ 1 ,...,θ ∗ N T ] T and (μ ∗ ,ν ∗ ) be any primal and dual optimal points with zero duality gap. By the KKT conditions, we have− σi 1+θ ∗ i σi + Z V +μ ∗ −ν ∗ i = 0,∀i∈ {1, 2,...,N T }; P N T i=1 θ ∗ i ≤ P ;μ ∗ ≥ 0;μ ∗ P N T i=1 θ ∗ i −P = 0;θ ∗ i ≥ 0,∀i∈{1, 2,...,N T };ν ∗ i ≥ 0,∀i∈{1, 2,...,N T };ν ∗ i θ ∗ i = 0,∀i∈{1, 2,...,N T }. Eliminatingν ∗ i ,∀i∈{1, 2,...,N T } in all equations yieldsμ ∗ + Z V ≥ σi 1+θ ∗ i σi ,∀i∈{1, 2,...,N T }; P N T i=1 θ ∗ i ≤P ; μ ∗ ≥ 0; μ ∗ P N T i=1 θ ∗ i −P = 0; θ ∗ i ≥ 0,∀i∈{1, 2,...,N T }; (μ ∗ + Z V − σi 1+θ ∗ i σi )θ ∗ i = 0,∀i∈{1, 2,...,N T }. For all i∈{1, 2,...,N T }, we consider μ ∗ + Z V <σ i and μ ∗ + Z V ≥σ i separately: 227 1. Ifμ ∗ + Z V <σ i , thenμ ∗ + Z V ≥ σi 1+θ ∗ i σi holds only whenθ ∗ i > 0, which by (μ ∗ + Z V − σi 1+θ ∗ i σi )θ ∗ i implies that μ ∗ + Z V − σi 1+θ ∗ i σi = 0, i.e., θ ∗ i = 1 μ ∗ +Z/V − 1 σi . 2. Ifμ ∗ + Z V ≥σ i , thenθ ∗ i > 0 is impossible, becauseθ ∗ i > 0 implies thatμ ∗ + Z V − σi 1+θ ∗ i σi > 0, which together with θ ∗ i > 0 contradict the slackness condition (μ ∗ + Z V − σi 1+θ ∗ i σi )θ ∗ i = 0. Thus, if μ ∗ + Z V ≥σ i , we must have θ ∗ i = 0. Summarizing both cases, we have θ ∗ i = max 0, 1 μ ∗ +Z/V − 1 σi ,∀i∈{1, 2,...,N T }, where μ ∗ is chosen such that P n i=1 θ ∗ i ≤P , μ ∗ ≥ 0 and μ ∗ P N T i=1 θ ∗ i −P = 0. To find such μ ∗ , we first check if μ ∗ = 0. If μ ∗ = 0 is true, the slackness condition μ ∗ P N T i=1 θ ∗ i −P = 0 holds and we need to further ensure P N T i=1 θ ∗ i = P N T i=1 max 0, 1 μ ∗ +Z/V − 1 σi ≤P . Thus μ ∗ = 0 if and only if P N T i=1 max 0, 1 Z/V − 1 σi ≤P . Thus, Algorithm 8.2 checks if P N T i=1 max 0, 1 Z/V − 1 σi ≤P holds at the first step. If this is true, then we conclude μ ∗ = 0 and we are done! Otherwise, we knowμ ∗ > 0. By the slackness condition μ ∗ P N T i=1 θ ∗ i −P = 0, we must have P N T i=1 θ ∗ i = P N T i=1 max 0, 1 μ ∗ +Z/V − 1 σi =P . To find μ ∗ > 0 such that P N T i=1 max 0, 1 μ ∗ +Z/V − 1 σi =P , we could apply a bisection search by noting that all θ ∗ i are decreasing with respect to μ ∗ . Another algorithm of finding μ ∗ is inspired by the observation that if σ j ≥ σ k ,∀j,k ∈ {1, 2,...,N T }, thenθ ∗ j ≥θ ∗ k . Thus, we first sort all σ i in a decreasing order, say π is the permu- tation such that σ π(1) ≥σ π(2) ≥···≥σ π(N T ) ; and then sequentially check if i∈{1, 2,...,N T } is the index such that σ π(i) −μ ∗ ≥ 0 and σ π(i+1) −μ ∗ ≤ 0. To check this, we first assume i is indeed such an index and solve the equation P i j=1 1 μ ∗ +Z/V − 1 σ π(j) =P to obtainμ ∗ ; (Note that in Algorithm 8.2, to avoid recalculating the partial sum P i j=1 1 σ π(j) for each i, we introduce the parameter S i = P i j=1 1 σ π(j) and update S i incrementally. By doing this, the complexity of each iteration in the loop is onlyO(1).) then verify the assumption by checking if 1 μ ∗ +Z/V − 1 σ π(i) ≥ 0 and 1 μ ∗ +Z/V − 1 σ π(i+1) ≤ 0. This algorithm is described in Algorithm 8.2. 8.7.3 Proof of Lemma 8.3 Fact 8.6. For all X∈S n + , we havek(I + X) −1 k F ≤ √ n. Proof. Since X∈S n + , matrix X has SVD X = U H ΣU, where U is unitary and Σ is diagonal with non-negative entriesσ 1 ,...,σ n . Then Y = (I + X) −1 = U H diag( 1 1+σ1 ,..., 1 1+σn )U is Hermitian. 228 Thus,k(I + X) −1 k F = p tr(Y 2 ) = q P n i=1 ( 1 1+σi ) 2 ≤ √ n. Fact 8.7. For any H, e H∈ C N R ×N T withkHk F ≤ B andkH− e Hk F ≤ δ, we havekH H H− e H H e Hk F ≤ (2B +δ)δ. Proof. kH H H− e H H e Hk F (a) ≤kH H H− H H e Hk F +kH H e H− e H H e Hk F (b) ≤kH H k F kH− e Hk F +kH H − e H H k F k e Hk F (c) ≤kH H k F kH− e Hk F +kH H − e H H k F k e H− Hk F +kHk F ≤2Bδ +δ 2 where (a) and (c) follow from part 2 of Fact 8.1; and (b) follows from part 3 of Fact 8.1. Fix Z(t) and V . Define φ(Q, H) = V log det(I + HQH H )− Z(t)tr(Q) and ψ(L, T) = V log det(I + LTL H )−Z(t)tr(L H L). Fact 8.8. Let Q∈S N T + have Cholesky decomposition Q = L H L. Then, φ(Q, H) =V log det(I + LTL H )−Z(t)tr(L H L) =ψ(L, T) with T = H H H. Moreover, if L is fixed, thenψ(L, T) is concave with respect to T and has gradient∇ T ψ(L, T) =V L H (I + LTL H ) −1 L. Proof. Note that V log det(I + HQH H )−Z(t)tr(Q) =V log det(I + HL H LH H )−Z(t)tr(L H L) (a) =V log det(I + LH H HL H )−Z(t)tr(L H L) (b) =V log det(I + LTL H )−Z(t)tr(L H L) =ψ(L, T) where (a) follows from the elementary identity det(I + AB) = det(I + BA) for any A∈C m×n and B∈C n×m ; and (b) follows from the definition T = H H H. Note that if L is fixed, then Z(t)tr(L H L) is a constant. It follows from Fact 8.3 that ψ(L, T) is concave with respect to T and has gradient∇ T ψ(L, T) =V L H (I + LTL H ) −1 L. 229 Let Q ∗ (H) be an optimal solution to the problem (8.2)-(8.4). Note that Q ∗ (H) is a mapping from channel states to transmit covariances and R opt =E[log det(I + HQ ∗ (H)H H )]. To simplify notation, we denote Q ∗ (t) = Q ∗ (H(t)), i.e. the transmit covariance at slot t selected according to Q ∗ (H). The next lemma relates the performance of Algorithm 8.1 and Q ∗ at each slot t. Lemma 8.8. Let Q(t) be yielded by Algorithm 8.1. At each slot t, we have V log det(I + H(t)Q(t)H H (t))−Z(t)tr(Q(t))≥V log det(I+H(t)Q ∗ (t)H H (t))−Z(t)tr(Q ∗ (t))−2VP √ N T (2B+ δ)δ. Proof. Fix t > 0. Let e H(t)∈ C N R ×N T be the observed (inaccurate) CSIT satisfyingkH(t)− e H(t)k F ≤δ. The main proof of this lemma can be decomposed into 3 steps: • Step 1: Show that φ(Q(t), H(t)) ≥ φ(Q(t), e H(t))− VP √ N T (2B + δ)δ. Let Q(t) = L H (t)L(t) be an Cholesky decomposition. Define T(t) = H H (t)H(t) and e T(t) = e H H (t) e H(t). By Fact 8.8, we have ψ(L(t), T(t)) = φ(Q(t), H(t)) and ψ(L(t), e T(t)) = φ(Q(t), e H(t)); and ψ is concave with respect to T. By Fact 8.4, we have ψ(L(t), T(t)) ≥ψ(L(t), e T(t))− tr [∇ T ψ(L(t), T(t))] H ( e T(t)− T(t) (a) ≥ψ(L(t), e T(t))−k∇ T ψ(L(t), T(t))k F k e T(t)− T(t)k F (b) ≥ψ(L(t), e T(t))−VkL H (t)(I + L(t)T(t)L H (t)) −1 L(t)k F (2B +δ)δ (c) ≥ψ(L(t), e T(t))−VP p N T (2B +δ)δ where (a) follows from part 4 in Fact 8.1; (b) follows from ∇ T ψ(L(t), T(t)) = V L H (t)(I + L(t)T(t)L H (t)) −1 L(t) by Fact 8.8 andk e T(t)−T(t)k F ≤δ(2B+δ) which is further implied by Fact 8.7; and (c) follows fromkL H (t)(I+L(t)T(t)L H (t)) −1 L(t)k F ≤kL H (t)k 2 F k(I+L(t)T(t)L H (t)) −1 k F ≤ P √ N T where the first inequality follows from Fact 8.1 and the second inequality follows from kL(t)k F = p tr(L H (t)L(t)) = p tr(Q(t))≤ √ P and Fact 8.6. • Step 2: Show that φ(Q(t), e H(t))≥φ(Q ∗ (t), e H(t)). This step simply follows from the fact that Algorithm 8.1 choses Q(t) to maximizeφ(Q, e H(t)) =V log det(I+ e H(t)Q e H H (t))−Z(t)tr(Q) and hence Q(t) should be no worse than Q ∗ (t). • Step 3: Show thatφ(Q ∗ (t), e H(t))≥φ(Q ∗ (t), H(t))−VP √ N T (2B+δ)δ. This step is similar to step 1. Let Q ∗ (t) = M H (t)M(t) be an Cholesky decomposition. Define T(t) = H H (t)H(t) and 230 e T(t) = e H H (t) e H(t). By Fact 8.8, we have ψ(M(t), T(t)) = φ(Q ∗ (t), H(t)) and ψ(M(t), e T(t)) = φ(Q ∗ (t), e H(t)); and ψ is concave with respect to T. By Fact 8.4, we have ψ(M(t), e T(t)) ≥ψ(M(t), T(t))− tr [∇ T ψ(M(t)), e T(t)] H [T(t)− e T(t)] (a) ≥ψ(M(t), T(t))−k∇ T ψ(M(t), e T(t))k F kT(t)− e T(t)k F (b) ≥ψ(M(t), T(t))−VkM H (t)(I + M(t) e T(t)M H (t)) −1 M(t)k F (2B +δ)δ (c) ≥ψ(M(t), T(t))−VP p N T (2B +δ)δ where (a) follows from part 4 in Fact 8.1; (b) follows from∇ T ψ(M(t), e T(t)) = V M H (t)(I + M(t) e T(t)M H (t)) −1 M(t) by Fact 8.8 and kT(t)− e T(t)k F ≤ δ(2B + δ) which is further im- plied by Fact 8.7; and (c) follows fromkM H (t)(I + M(t) e T(t)L H (t)) −1 M(t)k F ≤kM H (t)k 2 F k(I + M(t) e T(t)M H (t)) −1 k F ≤P √ N T where the first inequality follows from Fact 8.1 and the second inequality follows fromkM(t)k F = p tr(M H (t)M(t)) = p tr(Q ∗ (t))≤ √ P and Fact 8.6. Combining the above steps yields φ(Q(t), H(t))≥φ(Q ∗ (t), H(t))− 2VP √ N T (2B +δ)δ. Lemma 8.9. At each time t∈{0, 1, 2,...}, we have −Δ(t)≥−Z(t) tr(Q(t))− ¯ P − 1 2 max{ ¯ P 2 , (P− ¯ P ) 2 }. (8.29) Proof. Fix t∈{0, 1, 2,...}. Note that Z(t + 1) = max{0,Z(t) + tr(Q(t))− ¯ P} implies that Z 2 (t + 1)≤ Z(t) + tr(Q(t))− ¯ P 2 ≤Z 2 (t) + 2Z(t) tr(Q(t))− ¯ P + (tr(Q(t))− ¯ P ) 2 (a) ≤Z 2 (t) + 2Z(t) tr(Q(t)− ¯ P + max{ ¯ P 2 , (P− ¯ P ) 2 } where (a) follows from|tr(Q(t))− ¯ P|≤ max{ ¯ P,P− ¯ P}, which further follows from 0≤ tr(Q(t))≤ P . Rearranging terms and dividing by factor 2 yields the desired result. Now, we are ready to present the main proof of Lemma 8.3. AddingV log det(I+H(t)Q(t)H H (t)) 231 to both sides in (8.29) yields − Δ(t) +V log det(I + H(t)Q(t)H H (t)) ≥V log det(I + H(t)Q(t)H H (t))−Z(t) tr(Q(t))− ¯ P − 1 2 max{ ¯ P 2 , (P− ¯ P ) 2 } (a) ≥V log det(I + H(t)Q ∗ (t)H H (t))−Z(t)tr(Q ∗ (t)− ¯ P )− 1 2 max{ ¯ P 2 , (P− ¯ P ) 2 } − 2VP p N T (2B +δ)δ where (a) follows from Lemma 8.8. Taking expectations on both sides yields −E[Δ(t)] +VE[R(t)] ≥VR opt −E[Z(t)(tr(Q ∗ (t))− ¯ P )]− 1 2 max{ ¯ P 2 , (P− ¯ P ) 2 }− 2VP p N T (2B +δ)δ (a) =VR opt −E[E[Z(t)(tr(Q ∗ (t))− ¯ P )|Z(t)]]− 1 2 max{ ¯ P 2 , (P− ¯ P ) 2 }− 2VP p N T (2B +δ)δ (b) ≥VR opt − 1 2 max{ ¯ P 2 , (P− ¯ P ) 2 }− 2VP p N T (2B +δ)δ where (a) follows by noting that E[Z(t)(tr(Q ∗ (t))− ¯ P )|Z(t)] is the expectation conditional on Z(t) and the iterated law of expectations; and (b) follows from E[Z(t)tr(Q ∗ (t)− ¯ P )|Z(t)] = Z(t)E[tr(Q ∗ (t))− ¯ P ]≤ 0, where the identity follows because Q ∗ (t) only depends on H(t) and is independent of Z(t), and the inequality follows because Z(t)≥ 0 andE[tr(Q ∗ (t))− ¯ P ]≤ 0,∀t. Rearranging terms and dividing both sides by V yields − 1 V E[Δ(t)] +E[R(t)] ≥ R opt − max{ ¯ P 2 ,(P− ¯ P) 2 } 2V − 2P √ N T (2B +δ)δ. 8.7.4 Proof of Lemma 8.5 A problem similar to the problem (8.17)-(8.19) (with inequality constraint (8.18) replaced by the equality constraint tr(Q) = ¯ P ) is considered in Lemma 14 in [SPB09]. The problem in [SPB09] is different from (8.17)-(8.19) since inequality constraint (8.18) is not necessarily tight at the optimal solution to (8.17)-(8.19). However, the proof flow of the current lemma is similar to [SPB09]. We shall first reduce the problem (8.17)-(8.19) to a simpler convex program with a real vector variable by characterizing the structure of its optimal solution. After that, we can derive an (almost) closed-form solution to the simpler convex program by studying its KKT conditions. 232 The details of the proof are as follows: Claim 8.2. If b Θ is an optimal solution to the following convex program: min 1 2 kΘ− Σk 2 F (8.30) s.t. tr(Θ)≤ ¯ P (8.31) Θ∈S N T + (8.32) then b Q = U H b ΘU is an optimal solution to the problem (8.17)-(8.19). Proof. This claim can be proven by contradiction. Let b Θ be an optimal solution to convex program (8.30)-(8.32) and define b Q = U H b ΘU. Assume that there exists e Q∈ S N T + such that e Q6= b Q and is a feasible solution to the problem (8.17)-(8.19) that is strictly better than b Q. Consider e Θ = U e QU H and reach a contradiction by showing e Θ is strictly better than b Θ as follows: Note that tr( e Θ) = tr(U e QU H ) = tr( e Q) ≤ ¯ P , where the last inequality follows from the assumption that e Q is a feasible solution to the problem (8.17)-(8.19). Also note that e Θ∈S N T + since e Q∈S N T + . Thus, e Θ is feasible to the problem (8.30)-(8.32). Note thatk e Θ− Σk F (a) = kU H e ΘU− U H ΣUk F (b) = k e Q− Xk F (c) < k b Q− Xk F (d) = kU b QU H − UXU H k F (e) =k b Θ− Σk F , where (a) and (d) follow from the fact that Frobenius norm is unitary invariant 9 ; (b) follows from the fact that e Θ = U e QU H and X = U H ΣU; (c) follows from the fact that e Q is strictly better than b Q; and (e) follows from the fact that b Q = U H b ΘU and X = U H ΣU. Thus, e Θ is strictly better than b Θ. A contradiction! Claim 8.3. The optimal solution to the problem (8.30)-(8.32) must be a diagonal matrix. Proof. This claim can be proven by contradiction. Assume that the problem (8.30)-(8.32) has an optimal solution e Θ that is not diagonal. Since e Θ is positive semidefinite, all the diagonal entries of e Θ are non-negative. Define b Θ as a diagonal matrix whose the i-th diagonal entry is equal to thei-th diagonal entry of e Θ for alli∈{1, 2,...,N T }. Note that tr( b Θ) = tr( e Θ)≤ ¯ P and b Θ∈S n + . Thus, b Θ is feasible to the problem (8.30)-(8.32). Note thatk b Θ− Σk F <k e Θ− Σk F since Σ is diagonal. Thus, b Θ is a solution strictly better than e Θ. A contradiction! So the optimal solution to the problem (8.30)-(8.32) must be a diagonal matrix. 9 That iskAUk F =kAk F for all A∈C n×n and all unitary matrix U. 233 By the above two claims, it suffices to assume that the optimal solution to the problem (8.17)- (8.19) has the structure ˆ Q = U H ΘU, where Θ is a diagonal with non-negative entriesθ 1 ,...,θ N T . To solve the problem (8.17)-(8.19), it suffices to consider the following convex program. min 1 2 N T X i=1 (θ i −σ i ) 2 (8.33) s.t. N T X i=1 θ i ≤ ¯ P (8.34) θ i ≥ 0,∀i∈{1, 2,...,N T } (8.35) Note that the problem (8.33)-(8.35) satisfies Slater’s condition. So the optimal solution to the problem (8.33)-(8.35) is characterized by KKT conditions [BV04]. Introducing Lagrange multipliers μ∈ R + for inequality constraint P N T i=1 θ i ≤ ¯ P and ν = [ν 1 ,...,ν N T ] T ∈ R N T + for inequality constraintsθ i ≥ 0,i∈{1, 2,...,n}. Letθ ∗ = [θ ∗ 1 ,...,θ ∗ N T ] T and (μ ∗ ,ν ∗ ) be any primal and dual pair with the zero duality gap. By KKT conditions, we have θ ∗ i −σ i +μ ∗ −ν ∗ i = 0,∀i∈ {1, 2,...,N T }; P N T i=1 θ ∗ i ≤ ¯ P ;μ ∗ ≥ 0;μ ∗ P N T i=1 θ ∗ i − ¯ P = 0;θ ∗ i ≥ 0,∀i∈{1, 2,...,N T };ν ∗ i ≥ 0,∀i∈{1, 2,...,N T };ν ∗ i θ ∗ i = 0,∀i∈{1, 2,...,N T }. Eliminating ν ∗ i ,∀i∈{1, 2,...,N T } in all equations yields μ ∗ ≥ σ i −θ ∗ i ,i∈{1, 2,...,N T }; P N T i=1 θ ∗ i ≤ ¯ P ; μ ∗ ≥ 0; μ ∗ P N T i=1 θ ∗ i − ¯ P = 0; θ ∗ i ≥ 0,∀i∈{1, 2,...,N T }; (θ ∗ i −σ i +μ ∗ )θ ∗ i = 0,∀i∈{1, 2,...,N T }. For all i∈{1, 2,...,N T }, we consider μ ∗ <σ i and μ ∗ ≥σ i separately: 1. If μ ∗ < σ i , then μ ∗ ≥ σ i −θ ∗ i holds only when θ ∗ i > 0, which by (θ ∗ i −σ i +μ ∗ )θ ∗ i = 0 implies that θ ∗ i =σ i −μ ∗ . 2. If μ ∗ ≥σ i , then θ ∗ i > 0 is impossible, because θ ∗ i > 0 implies that θ ∗ i −σ i +μ ∗ > 0, which together with θ ∗ i > 0 contradicts the slackness condition (θ ∗ i −σ i +μ ∗ )θ ∗ i = 0. Thus, if μ ∗ ≥σ i , we must have θ ∗ i = 0. Summarizing both cases, we have θ ∗ i = max{0,σ i −μ ∗ },∀i∈{1, 2,...,N T }, where μ ∗ is chosen such that P N T i=1 θ ∗ i ≤ ¯ P , μ ∗ ≥ 0 and μ ∗ P N T i=1 θ ∗ i − ¯ P = 0. To find such μ ∗ , we first check if μ ∗ = 0. If μ ∗ = 0 is true, the slackness condition μ ∗ P N T i=1 θ ∗ i − ¯ P is guaranteed to hold and we need to further require P N T i=1 θ ∗ i = P N T i=1 max{0,σ i }≤ ¯ P . Thus μ ∗ = 0 if and only if P n i=1 max{0,σ i } ≤ ¯ P . Note that Algorithm 8.4 checks if 234 P N T i=1 max{0,σ i }≤ ¯ P holds at the first step and if this is true, then we conclude μ ∗ = 0 and we are done! Otherwise, we knowμ ∗ > 0. By the slackness condition μ ∗ P N T i=1 θ ∗ i − ¯ P = 0, we must have P N T i=1 θ ∗ i = P N T i=1 max{0,σ i −μ ∗ } = ¯ P . To find μ ∗ > 0 such that P N T i=1 max{0,σ i −μ ∗ } = ¯ P , we could apply a bisection search by noting that all θ ∗ i are decreasing with respect to μ ∗ . Another algorithm of finding μ ∗ is inspired by the observation that if σ j ≥ σ k ,∀j,k ∈ {1, 2,...,N T }, thenθ ∗ j ≥θ ∗ k . Thus, we first sort all σ i in a decreasing order, say π is the permu- tation such that σ π(1) ≥σ π(2) ≥···≥σ π(N T ) ; and then sequentially check if i∈{1, 2,...,N T } is the index such that σ π(i) −μ ∗ ≥ 0 and σ π(i+1) −μ ∗ < 0. To check this, we first assume i is indeed such an index and solve the equation P i j=1 σ π(j) −μ ∗ = ¯ P to obtain μ ∗ ; (Note that in Algorithm 8.4, to avoid recalculating the partial sum P i j=1 σ π(j) for eachi, we introduce the parameter S i = P i j=1 σ π(j) and update S i incrementally. By doing this, the complexity of each iteration in the loop is only O(1).) then verify the assumption by checking if μ ∗ ≥ 0, σ π(i) −μ ∗ ≥ 0 and σ π(i+1) −μ ∗ ≤ 0. The algorithm is described in Algorithm 8.4 and has complexity O(N T log(N T )). The overall complexity is dominated by the step of sorting all σ i . 8.7.5 Proof of Lemma 8.6 Proof of Part 1: The boundedness of D(t−1) can be shown as follows.kD(t−1)k F =kH H (t−1)(I N R +H(t− 1)Q(t− 1)H H (t− 1)) −1 H(t− 1)k F (a) ≤kH(t− 1)k 2 F k(I N R + H(t− 1)Q(t− 1)H H (t− 1)) −1 k F (b) ≤ √ N R B 2 , where (a) follows from Fact 8.1 and (b) follows fromkH(t− 1)k F ≤B and Fact 8.6. Proof of Part 2: To simplify the notation, this part uses H, e H and Q to represent H(t− 1), e H(t− 1) and Q(t− 1), respectively. 235 Note that kD(t− 1)− e D(t− 1)k F =kH H I N R + HQH H −1 H− e H H I N R + e HQ e H H −1 e Hk F ≤kH H I N R + HQH H −1 H− e H H I N R + HQH H −1 Hk F +k e H H I N R + HQH H −1 H− e H H I N R + HQH H −1 e Hk F +k e H H I N R + HQH H −1 e H− e H H I N R + e HQ e H H −1 e Hk F ≤k I N R + HQH H −1 k F kHk F kH− e Hk F +k I N R + HQH H −1 k F k e Hk F kH− e Hk F +k e Hk 2 F k I N R + HQH H −1 − I N R + e HQ e H H −1 k F (8.36) where both inequalities follow from Fact 8.1. SincekHk F ≤ B andk e H− Hk≤ δ, by Fact 8.1, we havek e Hk F ≤ B +δ. By Fact 8.6, we havek I N R + HQH H −1 k F ≤ √ N R . The following lemma from [SPB09] will be useful to bound k I N R + HQH H −1 − I N R + e HQ e H H −1 k F from above. Lemma 8.10 (Lemma 6 in [SPB09]). Let F :D⊆C m×n →C p×q be a complex matrix-valued function defined on a convex setD, assumed to be continuous onD and differentiable on the interior ofD, with Jacobian matrix 10 D X F(X). Then, for any given X, Y∈D, there exists somet∈ (0, 1) such thatkF (Y)−F (X)k F ≤kD X F(tY + (1−t)X)vec(Y−X)k 2 ≤kD X F(tY + (1−t)X)k 2,mat kY−Xk F , wherekAk 2,mat denotes the spectral norm of A, i.e., the largest singular value of A. Lemma 8.10 is essentially a mean value theorem for complex matrix valued functions. The next corollary is the complex matrix version of elementary inequality| 1 1+x − 1 1+y |≤|x−y|,∀x,y≥ 0 and follows directly from Lemma 8.10. Corollary 8.1. Consider F : S n + → S n + defined via F(X) = (I n + X) −1 . Then, kF (Y)− F (X)k F ≤nkY− Xk F ,∀X, Y∈S n + . Proof. By [HG07, SPB09], dX −1 =−X −1 (dX)X −1 . Thus, d(I + X) −1 =−(I + X) −1 (dX)(I + X) −1 . By identity vec(ABC) = (C T ⊗ A)vec(B), where⊗ denotes the Kronecker product, we 10 The Jacobian matrix is defined as the matrixD X F(X) such thatdvec(F(x)) =D X F(X)dvec(X). Note that the size of D X F(X) is pq×mn. 236 have dvec(F(X)) =− ((I + X) −1 ) T ⊗ (I + X) −1 dvec(X). Thus, D X F(X) =−((I + X) −1 ) T ⊗ (I + X) −1 . Note that for all X∈S n + ,k− ((I + X) −1 ) T ⊗ (I + X) −1 k 2,mat ≤k((I + X) −1 ) T ⊗ (I + X) −1 k F (a) =k((I+X) −1 ) T k F ·k(I+X) −1 k F =k(I+X) −1 k 2 F (b) ≤ n, where (a) follows from the fact thatkA⊗ Bk F =kAk F ·kBk F ,∀A∈ C m×n , B∈ C n×l (see Exercise 28, page 253 in [HJ91]); and (b) follows Fact 8.6. Applying Lemma 8.10 yieldskF (Y)−F (X)k F ≤nkY− Xk F ,∀X, Y∈ S n + . Applying the above corollary yields k I N R + HQH H −1 − I N R + e HQ e H H −1 k F (a) ≤N R kHQH H − e HQ e H H k F =N R kHQH H − e HQH H + e HQH H − e HQ e H H k F (b) ≤N R kHQH H − e HQH H k F +k e HQH H − e HQ e H H k F (c) ≤N R kQk F kH H k F kH− e Hk F +k e Hk F kQk F kH H − e H H k F (d) ≤N R ¯ P (2B +δ)δ where (a) follows from Corollary 8.1; (b) and (c) follows from Fact 8.1; and (d) follows from the fact thatkHk F ≤B andk e H− Hk F ≤δ,k e Hk F ≤B +δ, and the fact thatkQk F ≤ tr(Q)≤ ¯ P , which is implied by Fact 8.2 and Q∈ e Q. Plugging equationsk e Hk F ≤B +δ,k I N R + HQH H −1 k F ≤ √ N R andk I N R + HQH H −1 − I N R + e HQ e H H −1 k F ≤ N R ¯ P (2B +δ)δ into equation (8.36) yieldskD(t− 1)− e D(t− 1)k F ≤ √ N R Bδ+ √ N R (B+δ)δ+(B+δ) 2 N R ¯ P (2B+δ)δ = √ N R B+ √ N R (B+δ)+(B+δ) 2 N R ¯ P (2B+δ) δ. Proof of Part 3: This part follows fromk e D(t− 1)k F ≤k e D(t− 1)− D(t− 1)k F +kD(t− 1)k F . 237 Chapter 9 Duality Codes and the Integrality Gap Bound for Index Coding Consider a noiseless wireless system with N receivers, W independent packets of the same size, and a single broadcast station. The broadcast station has all packets. Each receiver has a subset of the packets as side information, but desires another (disjoint) subset of the packets. The broadcast station must deliver the packets to their intended receivers. To this end, it makes a sequence of (possibly coded) transmissions that are overheard by all receivers. The goal is to find a coding scheme with the minimum number of transmissions (clearance time) such that each user is able to decode its demanded packets. This problem was introduced by Birk and Kol in [BK98, BK06] and is known as the index coding problem. The formulation of the index coding problem is simple, elegant and captures the essence of broadcasting with side information. It also relates directly to multi-hop network coding problems. Specifically, work in [RSG10] shows that an index coding problem can be reduced to a network coding problem. A partial converse of this result is also shown in [RSG10], in that linear versions of network coding can be redued to linear index coding (see [ERL15] for extended results in this direction). However, the index coding problem still seems to be intractable. The first index coding problem investigated by Birk and Kol considers only the case of unicast packets and can be represented as a directed side information graph. Work by Bar-Yossef et. al. in [BYBJK11] shows that the performance of the best scalar linear code is equal to the graph parameter minrank of the side information graph. However, computing the minrank of a given graph is NP-hard [Pee96]. Further, it is known that restricting to scalar linear codes is generally sub-optimal [ALS + 08, LS09]. 238 One branch of research on index coding aims to find tight performance bounds. Work in [BYBJK11] shows that if the index coding problem has an undirected side information graph (such as when it has symmetric demands) then the minrank is lower-bounded by the independence number of the graph, and upper-bounded by the clique cover number. For the unicast index coding problem, work in [BYBJK11] shows that the optimal clearance time (with respect to any scalar, vector or non-linear code) is lower-bounded by the maximum acyclic subgraph of the side information graph. Work in [NTZ13] generalizes this to the multicast/groupcast case using a directed bipartite graph. It shows that the optimum of the general problem is lower-bounded by the maximum acyclic subgraph induced by deletions of packet vertices, user-vertices and packet-to-user arcs. In [BKL10], a sequence of linear programs is proposed to bound the optimal clearance time. Another branch of research on index coding focuses on studying the performance of specific codes and specific graph structures. Work in [ALS + 08] shows that vector linear codes can have strictly better performance compared with scalar linear codes. Work in [LS09] demonstrates that non-linear codes can outperform both scalar and vector linear codes. Instead of finding the minimum clearance time, Chaudhry et. al. in [CASL11] consider the problem of maximizing the total number of saved transmissions by exploiting a specific code structure together with graph theory algorithms. Ong et. al. in [OH12] find the optimal index code in the single uniprior case, where each user only has a single uniprior packet as side information. This chapter studies index coding from a perspective of optimization and duality. The results in this chapter are originally developed in our papers [YN13, YN14]. This chapter illustrates the inherent duality between the information theoretical lower bound in [BYBJK11][NTZ13] and the performance of specific codes. Section 9.1 extends the bipartite digraph representation of the problem in [NTZ13] to a weighted bipartite digraph. Section 9.2 uses this new graph structure to develop an integer linear program that finds the maximum acyclic subgraph. Section 9.3 considers the linear programming (LP) relaxation of the integer program, and shows that the dual problem of this relaxation corresponds to a simple form of vector linear codes, called vector cyclic codes. It follows that the information theoretic optimum is bounded by the integrality gap between the integer program and its LP relaxation. Section 9.4 shows that in the special case when the bipartite digraph is planar, the integrality gap is zero. In this case, optimality is achieved by a scalar cyclic code. Section 9.5 considers a different representation of the original 239 integer program that yields a smaller integrality gap. The dual problem of its LP relaxation leads to a more sophisticated partial clique coding strategy that time-shares between maximum distance separable (MDS) codes. The smaller integrality gap ensures that these codes are closer to the lower bound. These results provide new insight into the index coding problem and suggest that good codes can be found by exploring LP relaxations of the maximum acyclic subgraph problem. 9.1 Weighted Bipartite Digraph There are N receivers, also called users. LetU ={u 1 ,...,u N } be the set of users. Assume there are W total packets, all of the same size, labeled{q 1 ,...,q W }. For each m∈{1,...,W}, defineS m as the set of users inU that already have packet q m as side information, and define D m as the set of users inU that demand packet q m . Without loss of generality, assume that each packet is demanded by at least one user (else, that packet can be eliminated). Thus, the demand setD m is non-empty for all m∈{1,...,W}. On the other hand, the side information setsS m can be empty. Indeed, the setS m is empty if and only if no user has packet q m as side information. It is reasonable to assume that the set of users that demand a packet is disjoint from the set of users that already have that packet as side information, so thatS m ∩D m =∅ for all m∈{1,...,W}. This index coding problem is represented by a bipartite directed graph in [NTZ13][TDN12], where user vertices are on the left of the graph, packet vertices are on the right, and theS m and D m sets are represented by directed arcs. A directed graph is also called a digraph. It is useful to extend this representation to a weighted bipartite digraph as follows: Two packets q k and q m are said to have the same type ifS k =S m andD k =D m . That is, two packets have the same type if they have the same side information and demand sets. Types arise naturally when users desire multi-packet files, since packets of the same file typically have the same type. Let M be the number of packet types, and letP ={p 1 ,...,p M } be the set of types. The index coding problem can be represented by a weighted bipartite digraphG = (U,P,A,W P ) as follows: LetU be the set of vertices on the left side of the graph and letP be the set of vertices on the right side of the graph (see Fig. 9.1). The arc setA has a user-to-packet arc (u n ,p m ) if and only if user u n ∈U has all packets of type p m . The arc setA has a packet-to-user arc 240 (p m ,u n ) if and only if useru n ∈U demands all packets of typep m . Finally, defineW P as the set of integer weights associated with packet vertices inP. The weight w pm ∈W P of packet vertex p m ∈P is equal to the number of packets of type p m . Thus, the total number of packets W satisfies W = P M m=1 w pm . A packet is said to be a unicast packet if it is demanded by only one user, and is said to be a groupcast packet if it is demanded by two or more users. An index coding problem is said to be unicast if all packets are unicast packets. The index coding problems treated in [BK98][BYBJK11] are unicast problems. The current chapter also focuses exclusively on the unicast case. However, rather than use the graph structure of [BYBJK11], for our purposes it is more efficient to use a weighted bipartite digraph. 1 Figure 9.1 shows an example of the weighted bipartite digraph representation for a unicast index coding problem with 3 user vertices and 3 packet types. In this example, packet types p 1 ,p 2 ,p 3 are demanded by users u 1 , u 2 , u 3 , respectively, so that D 1 ={u 1 },D 2 ={u 2 },D 3 ={u 3 }. Furthermore, the side information sets are as follows: • Packets of type p 1 are contained as side information by users in the setS 1 ={u 2 ,u 3 }. • Packets of type p 2 are contained as side information by the user in the setS 2 ={u 3 }. • Packets of type p 3 are contained as side information by the user in the setS 3 ={u 1 }. 9.2 Acyclic Subgraph Bound and its LP Relaxation The following definitions from graph theory are useful. A sequence of vertices{s 1 ,s 2 ,...,s K } of a general digraph is defined as a cycle if (s i ,s i+1 )∈A for alli∈{1, 2,...,K− 1}, all vertices in{s 1 ,s 2 ,...,s K−1 } are distinct, and s 1 = s K . A digraph is acyclic if it contains no cycle. A set of vertices is called a feedback vertex set if the removal of vertices in this set leaves an acyclic digraph. In a vertex-weighted digraph, the feedback vertex set with the minimum sum weight is called the minimum feedback vertex set. For the weighted bipartite digraphG = (U,P,A,W P ) (as defined in the previous section), there exists a subsetP fd ⊆P such that the removal of vertices inP fd and all the associated 1 The unicast problem can be represented by the graph structure in [BYBJK11] by changing each user that desires more than one packet into multiple virtual users that each want a single packet. This can significantly expand the size of the graph, particularly when users want large multi-packet files. The graph structure in the current chapter does not expand the number of users; this is conceptually simpler and is useful for proving optimality in some cases (see Corollary 9.2). 241 u 1 # u 2 # u 3 # p 1 # p 2 # p 3 # w 1 =3# w 2 =1# w 3 =2# Figure 9.1: The bipartite digraph representation of a unicast index coding problem with 3 user vertices and 3 packet type vertices. packet-to-user arcs and user-to-packet arcs leaves an acyclic subgraph. In this case,P fd is called a feedback packet vertex set. A trivial feedback packet vertex set isP fd =P and the corresponding acyclic subgraph has no packet vertex. This trivial feedback packet vertex set has weightW , since the sum weight of all packet vertices isW . It is often possible to find a feedback packet vertex set with sum weight smaller thanW . The feedback packet vertex set with the minimum sum weight is called the minimum feedback packet vertex set. The acyclic subgraph induced by the deletion of the minimum feedback packet vertex set is called the maximum acyclic subgraph(MAS). Assume that each transmission from the base station sends a number of bits equal to the number of bits in each of the fixed length packets. It is trivial to satisfy all demands with W transmissions, where each of theW packets is successively transmitted without coding. However, coding can often be used to reduce the number of transmissions. Let T min (G) represent the minimum number of transmissions required to deliver all packets to their intended users for an index coding problem defined by the weighted bipartite digraphG. The value T min (G) considers all possible coding strategies. A theorem in [NTZ13] provides an information theoretic lower bound on T min (G). Theorem 9.1 (Theorem 1 and Lemma 1 in [NTZ13]). Consider an index coding problemG = (U,P,A,W P ). LetP fd ⊆P be a feedback packet vertex set and letG 0 be the acyclic subgraph 242 induced by the deletion ofP fd . If P pm∈G 0w pm =W 0 , then T min (G)≥W 0 . While the above theorem holds for general (possibly groupcast) index coding problems, this chapter uses it in the unicast case. For unicast problems, Theorem 9.1 reduces to an earlier result on acyclic subgraphs in [BYBJK11] after a suitable transformation of the graph structure. Suppose the largest cycle in digraphG involves L packet vertices. Define the set of all cycles inG asC = S L i=1 C i , whereC i ,i = 2,...,L is the set of all cycles involving i packet vertices. These cycles can possibly overlap, i.e., some of them can share common vertices. The tightest lower bound provided by Theorem 9.1 is referred to as the maximum acyclic subgraph (MAS) bound and can be formulated as a linear integer program (IP) as below: Maximum Acyclic Subgraph IP (P1): max xm M X m=1 x m w pm s.t. M X m=1 x m 1 {pm∈Ci} ≤i− 1, ∀C i ∈C i ,i = 2,...,L x m ∈{0, 1}, m = 1,...,M where x m ∈{0, 1},m = 1,...,M indicates if packet vertex p m remains in the acyclic subgraph, objective function P M m=1 x m w pm is the sum weight of the acyclic subgraph, 1 {pm∈Ci} is the indicator function which equals one if and only if packet vertex p m participates in cycle C i , and P M m=1 x m 1 {pm∈Ci} ≤ i− 1 is the constraint that for each cycle C i ∈C i , at most i− 1 packet vertices remain in the acyclic subgraph. This problem finds the MAS bound formed by packet vertex deletion. The integer constraints of the above problem can be convexified to form the following linear 243 programming (LP) relaxation: Maximum Acyclic Subgraph LP (P1 0 ): max xm M X m=1 x m w pm s.t. M X m=1 x m 1 {pm∈Ci} ≤i− 1, ∀C i ∈C i ,i = 2,...,L 0≤x m ≤ 1, m = 1,...,M The only difference between problem (P1) and its relaxation (P1 0 ) is that the constraints x m ∈ {0, 1} are changed to 0≤x m ≤ 1. The relaxed problem (P1 0 ) can be solved with standard linear programming techniques. The number of constraints depends on the number of cycles in the graph. However, the number of cycles in general graphs can grow exponentially with the number of vertices, and so (P1 0 ) can be difficult to solve when the graph is large 2 . One might not expect the relaxed problem (P1 0 ) to have a physical meaning. Remarkably, this chapter proves that it does. Indeed, the next section shows that any solution to the relaxed problem leads to a coding strategy. The clearance time of the coding strategy is equal to the optimal objective function value of the relaxed problem. Hence, this value is an upper bound on T min (G). This is surprising because the original integer program (P1) provides a lower bound on T min (G) and does not suggest any particular coding strategy. Define val(P1) as the optimal objective function value of problem (P1), being the size of the maximum acyclic subgraph. Theorem 9.1 implies thatval(P1)≤T min (G). The optimal objective function value for the relaxation (P1 0 ) can be written asval(P1 0 ) =val(P1)+gap(P1 0 , P1), where gap(P1 0 , P1) =val(P1 0 )−val(P1) is the integrality gap between the LP relaxation (P1 0 ) and the integer program (P1). Since the relaxation (P1 0 ) has less restrictive constraints, the value of gap(P1 0 , P1) is always non-negative. The next section proves constructively that: val(P1)≤T min (G)≤val(P1) +gap(P1 0 , P1) Thus, the difference between the minimum clearance time and the maximum acyclic subgraph 2 In fact, a linear program with an exponential number of constraints can still be solved in polynomial time via the ellipsoid method as long as it has an efficient separation oracle [GLS93]. It can be shown that the maximum acyclic subgraph LP (P1 0 ) has an efficient separation oracle and hence can be solved in polynomial time via the ellipsoid method. 244 bound is bounded by the integrality gap gap(P1 0 , P1). Furthermore, Section 9.4 shows that gap(P1 0 , P1) = 0 in special cases when the digraphG is planar. 9.3 Cyclic Codes and Linear Programming Duality Inspired by the observation that the lower bound in Theorem 9.1 is closely connected with cycles in graphG, this section considers cyclic codes that exploit cycles inG. It is shown that the problem of finding the optimal cyclic code is the dual problem of the LP relaxation (P1 0 ). Thus, the performance gap between the optimal cyclic code and the optimal index code is ultimately bounded by the integrality gap gap(P1 0 , P1). 9.3.1 Cyclic Codes Suppose there exists a cycle inG that involvesK user vertices{u 1 ,u 2 ,...,u K } andK packet vertices{p 1 ,p 2 ,...,p K }. In this cycle, user u 1 hasp K as side information and demands p 1 , user u 2 has p 1 as side information and demands p 2 , user u 3 has p 2 as side information and demands p 3 , and so on. If the weight of each packet vertex is identically one, a K-cycle coding action can deliver all K packets by transmitting Z i =p i +p i+1 ,i = 1,...,K− 1 with K− 1 transmissions, where addition is the mod-2 summation of each bit in both packets. After transmissions, user u i ∈{u 2 ,...,u K } can decode packetp i by performingp i−1 +Z i−1 =p i−1 + (p i−1 +p i ) =p i . At the same time, user u 1 can decode packet p 1 by performing: Z 1 +... +Z K−1 +q K =(p 1 +p 2 ) + (p 2 +p 3 ) +... + (p K−1 +p K ) +p K =p 1 . A linear index code is said to be a cyclic code if it uses a sequence of coding actions that involve only cyclic coding actions and direct broadcasts without coding. Linear codes can be further categorized into scalar linear codes and vector linear codes according to whether the transmitted message is a linear combination of the original packets or the subpackets obtained by subdivisions. In scalar linear codes, each packet is considered as an element of a finite field and the transmitted message is a linear combination of packets over that field. In vector linear codes, 245 each packet is assumed to be sufficiently large and can be divided into many smaller subpackets and the transmitted message is a linear combination of these subpackets instead of the original packets. The problem of finding the optimal scalar cyclic code to clearG can be formulated as an IP as below: Cyclic Code IP (P2): min y C i ,ym L X i=2 X Ci∈Ci y Ci (i− 1) + M X m=1 y m s.t. y m + L X i=2 X Ci∈Ci y Ci 1 {pm∈Ci} ≥w pm , m = 1,...,M y Ci non-negative integer, ∀C i ∈C i ,i = 2,...,L y m non-negative integer, m = 1,...,M where y Ci is the number of cyclic coding actions over each cycle C i ,∀C i ∈ C i ,i = 2,...,L, y m is the number of direct broadcasts over each packet vertex p m ,m = 1,...,M, objective function P L i=2 P Ci∈Ci y Ci (i− 1) + P M m=1 y m is the total number of transmissions, and y m + P L i=2 P Ci∈Ci y Ci 1 {pm∈Ci} ≥ w pm is the constraint that all the w pm packets represented by packet vertex p m are cleared by either cyclic codes or direct broadcasts. The LP relaxation of the cyclic code IP (P2) is as below: Cyclic Code LP (P2 0 ): min y C i ,ym L X i=2 X Ci∈Ci y Ci (i− 1) + M X m=1 y m s.t. y m + L X i=2 X Ci∈Ci y Ci 1 {pm∈Ci} ≥w pm , m = 1,...,M y Ci ≥ 0, ∀C i ∈C i ,i = 2,...,L y m ≥ 0, m = 1,...,M The only difference between the above problem and the cyclic code IP (P2) is that the constraints that y Ci and y m are non-negative integers are replaced by the relaxed constraints that y Ci ≥ 0 andy m ≥ 0. This gives rise to the optimal vector cyclic code. The optimal vector cyclic code can be viewed as a scheme for time-sharing of cyclic coding actions over overlapping cycles. With this interpretation, y Ci is proportional to the fraction of time used for cyclic coding actions over 246 cycle C i . Since all the coefficients in the linear constraints of the cyclic code LP (P2 0 ) are integers, an optimal solution can be found that has all variables equal to rational numbers. Let an optimal solution of cyclic code LP (P2 0 ) be y ∗ Ci ,∀C i ∈C i ,i = 2,...,L;y ∗ m ,m = 1,...,M, and assume these values are all rational numbers. The optimal vector cyclic code can be constructed as follows. First, one can find an integerθ such thatθy ∗ Ci ,∀C i ∈C i ,i = 2,...,L;θy ∗ m ,m = 1,...,M are all integers. Next, divide each packet into θ subpackets. After the subdivision, a single cyclic coding action over a cycle C i is no longer a linear combination of packets but a linear combination of subpackets. Further, a single (uncoded) direct broadcast from a packet vertexp m is no longer the broadcast of one packet but one subpacket. Then, the optimal vector cyclic code performs θy ∗ Ci cyclic coding actions over each cycle C i ,∀C i ∈C i ,i = 2,...,L and broadcasts θy ∗ m subpackets over each packet vertex p m ,m = 1,...,M. To apply the above vector cyclic code, the number of bits in each packet must be an integer multiple of θ. This is a reasonable assumption when the packet size is large. Indeed, if the original packet size isB, each packet can be expanded to have size ˜ B =B +r B , where ˜ B is the smallest multiple of θ that is greater than or equal to B, and r B ∈{0, 1,...,θ− 1}. The expansion ratio is (B +r B )/B, which converges to 1 as B→∞. Define gap(P2, P2 0 ) as the integrality gap between the cyclic code IP (P2) and its LP relax- ation (P2 0 ). Since the relaxation (P2 0 ) has less restrictive constraints, the value ofgap(P2, P2 0 ) is always non-negative. Letval(P2) andval(P2 0 ) be the optimal objective function values for prob- lems (P2) and (P2 0 ), respectively. Thus, val(P2) and val(P2 0 ) are the clearance times attained by the optimal scalar cyclic code and vector cyclic code, respectively, and: val(P2) =val(P2 0 ) +gap(P2, P2 0 ) (9.1) 9.3.2 Duality Between Information Theoretical Lower Bounds and Cyclic Codes The duality between the maximum acyclic subgraph lower bound given by Theorem 9.1 and the optimal cyclic code is formally stated in the following lemma. Lemma 9.1. The maximum acyclic subgraph LP (P1 0 ) and the cyclic code LP (P2 0 ) form a primal-dual linear programming pair. In particular, the vector cyclic code associated with problem 247 (P2 0 ) achieves a clearance time val(P2 0 ) that satisfies: val(P2 0 ) =val(P1) +gap(P1 0 , P1) (9.2) Proof. The Lagrangian function of the cyclic code LP (P2 0 ) can be written as L(y Ci ,y m ,λ m ,μ Ci ,μ m ) = L X i=2 X Ci∈Ci y Ci (i− 1) + M X m=1 y m + M X m=1 λ m w pm −y m − L X i=2 X Ci∈Ci y Ci 1 {pm∈Ci} − L X i=2 X Ci∈Ci μ Ci y Ci − M X m=1 μ m y m = M X m=1 λ m w pm + M X m=1 y m [1−λ m −μ m ] + L X i=2 X Ci∈Ci y Ci (i− 1)− M X m=1 λ m 1 {pm∈Ci} −μ Ci where λ m ≥ 0,m = 1,...,M; μ Ci ≥ 0,∀C i ∈C i ,i = 2,...,L and μ m ≥ 0,m = 1,...,M. The dual problem of (P2 0 ) is defined as: max λm≥0 μ C i ≥0 μm≥0 min y C i ∈R ym∈R L(y Ci ,y m ,λ m ,μ Ci ,μ m ) Note that, min y C i ∈R ym∈R L(y Ci ,y m ,λ m ,μ Ci ,μ m ) = P M m=1 λ m w pm if (i−1)− P M m=1 λm1 {pm∈C i } −μ C i =0, ∀Ci∈Ci,i=2,...,L 1−λm−μm=0,m=1,...,M −∞ otherwise 248 Then, the dual problem of (P2 0 ) can be written as, max λm,μ C i ,μm M X m=1 λ m w pm s.t. (i− 1)− M X m=1 λ m 1 {pm∈Ci} −μ Ci = 0,∀C i ∈C i ,i = 2,...,L 1−λ m −μ m = 0, m = 1,...,M λ m ≥ 0, m = 1,...,M μ Ci ≥ 0, ∀C i ∈C i ,i = 2,...,L μ m ≥ 0, m = 1,...,M Eliminating variables μ Ci ,∀C i ∈C i ,i = 2,...,L and μ m ,m = 1,...,M, we obtain max λm M X m=1 λ m w pm s.t. M X m=1 λ m 1 {pm∈Ci} ≤ (i− 1), ∀C i ∈C i ,i = 2,...,L 0≤λ m ≤ 1, m = 1,...,M The above problem is the same as (P1 0 ). Thus, the clearance time of the vector cyclic code associated with problem (P2 0 ) is equal to the value of the optimal objective function in problem (P1 0 ), which is val(P1) +gap(P1 0 , P1). Thus far, we have proven the following lower and upper bound for the minimum clearance time of an index coding problem. val(P1)≤T min (G)≤val(P1) +gap(P1 0 , P1) (9.3) where the first inequality follows from Theorem 9.1 and the second inequality follows from Lemma 9.1. Hence, the performance gap between the optimal index code and the optimal vector cyclic code is ultimately bounded by the integrality gap between the maximum acyclic subgraph IP (P1) and its LP relaxation (P1 0 ). There are various techniques for bounding the integrality gaps of integer linear programs, such as the random rounding methods in [RT87, Rag88]. Rather than explore this direction, the 249 next section provides a special case where the gap is equal to zero. This is motivated as follows. Adding the non-negative value gap(P2, P2 0 ) to the right-hand-side of (9.3) gives: val(P1)≤T min (G) ≤ val(P1) +gap(P1 0 , P1) +gap(P2, P2 0 ) = val(P2) where the final equality uses (9.1)-(9.2). In the special case when val(P1) = val(P2), one has gap(P1 0 , P1) =gap(P2, P2 0 ) = 0 and val(P2) =T min (G), so that the scalar cyclic code given by the cyclic code IP (P2) is an optimal index code. 9.4 Optimality of Cyclic Codes in Planar Bipartite Graphs In graph theory, a planar graph is a graph that can be drawn as a picture on a 2-dimensional plane in a way so that no two arcs meet at a point other than a common vertex. The main result in this section is the following theorem: Theorem 9.2. If the bipartite digraphG for a (unicast) index coding problem is planar, then val(P1) =val(P2), i.e., gap(P1 0 , P1) = 0 and gap(P2, P2 0 ) = 0. Hence, the (scalar) cyclic code given by the cyclic code IP (P2) is an optimal index code. The proof of Theorem 9.2 relies on the cycle-packing and feedback arc set duality in arc- weighted planar graphs, which is summarized in the following theorem. Theorem 9.3 (Theorem 2.1 in [GT11b] originally proven in [LY78]). LetG = (V,A,W A ) be an arc-weighted planar digraph whereV is the set of vertices,A is the set of arcs andW A is an integer arc weight assignment which assigns each arc a∈A a non-negative integer weight w a ∈Z + . LetC be the set of cycles inG. Then we have min n X a∈A x a w a : X a∈A x a 1 {a∈C} ≥ 1,∀C∈C;x a ∈{0, 1},∀a∈A o = max n X C∈C y C : X C∈C y C 1 {a∈C} ≤w a ,∀a∈A;y C ∈Z + ,∀C∈C o . (9.4) The integer program on the left-hand-side of (9.4) is a minimum feedback arc set problem, while the integer program on the right-hand-side of (9.4) is a cycle packing problem. Both 250 problems are associated with arc-weighted digraphs. However, our graph is vertex-weighted rather than arc-weighted. To apply this theorem, we modify the bipartite digraphG to produce an arc-weighted digraphG s , which is planar if and only ifG is planar. We then show that the minimum feedback packet vertex set problem and the cycle packing problem inG can be reduced to the minimum feedback arc set problem and the cycle packing problem inG s , respectively. The following subsections develop the proof of Theorem 9.2 and provide some additional consequences. 9.4.1 Complementary Problems The maximum acyclic subgraph IP (P1) finds the packet weighted maximum acyclic subgraph. This is equivalent to finding the minimum feedback packet vertex set. Indeed, this is the set of packets whose deletion induce the packet weighted maximum acyclic subgraph. Thus, an equivalent problem to the maximum acyclic subgraph IP (P1) is: (P3) min xm M X m=1 x m w pm s.t. M X m=1 x m 1 {pm∈Ci} ≥ 1, ∀C i ∈C i ,i = 2,...,L x m ∈{0, 1}, m = 1,...,M where x m ∈{0, 1},m = 1,...,M indicates if packet vertex p m is selected into the feedback packet vertex set, objective function P M m=1 x m w pm is the sum weight of the feedback packet vertex set, 1 {pm∈Ci} is the indicator function which equals one if and only if packet vertex p m participates in cycle C i , and P M m=1 x m 1 {pm∈Ci} ≥ 1 is the constraint that at least one packet vertex in each cycle is selected into the feedback packet vertex set. If x ∗ m ,m = 1,...,M is the optimal solution of (P3) and attains the optimal value W 0 , then x ∗ m = 1−x ∗ m ,m = 1,...,M is the optimal solution of (P1) and attains the optimal value W−W 0 . Now consider the integer program related to cyclic coding. It is now useful to write the complementary problem to the cyclic code IP (P2). In [CASL11], Chaudhry et. al. introduced the concept of complementary index coding problems. Instead of trying to find the minimum number of transmissions to clear the problem, the complementary index coding problem is formulated to maximize the number of saved transmissions by exploiting a specific code structure. Recall 251 that anyK-cycle code can deliverK packets inK− 1 transmissions and hence one transmission is saved in each K-cycle code. If the weight of each packet is not identically one, then K- cycle coding actions can be performed w min = min{w p1 ,...,w p K } times on the same cycle. By performing K-cycle coding actions w min times and then directly broadcasting the remaining packets (uncoded), the base station can deliver w total = P K k=1 w p k packets with w total −w min transmissions.Thus, w min transmissions are saved. The complementary index coding problem which aims to maximize the number of saved transmissions by exploiting scalar cycles inG is formulated as a linear integer program below: (P4) max y C i L X i=2 X Ci∈Ci y Ci s.t. L X i=2 X Ci∈Ci y Ci 1 {pm∈Ci} ≤w pm , m = 1,...,M y Ci non-negative integer, ∀C i ∈C i ,i = 2,...,L where y Ci is the number of cyclic coding actions over each cycle C i ∈C i ,∀C i ∈C i ,i = 2,...,L, objective function P L i=2 P Ci∈Ci y Ci is the total number of cyclic coding actions, i.e., total number of saved transmissions, and P L i=2 P Ci∈Ci y Ci 1 {pm∈Ci} ≤w pm is the constraint that each packet vertex p m can participate in at most w pm cyclic coding actions. This is important because if packet vertex p m has already participated w pm times in cyclic coding actions, then all of its packets have been delivered and new cyclic coding actions that involve this packet vertex can no longer save any transmissions. If the optimal solution of (P4) is y ∗ Ci ,∀C i ∈C i ,i = 2,...,L and attains the optimal value W 0 , then the optimal solution of the cyclic code IP (P2) is y ∗ Ci = y ∗ Ci ,∀C i ∈C i ,i = 2,...,L,y ∗ m =w pm − P L i=2 P Ci∈Ci y ∗ Ci 1 {pm∈Ci} ,m = 1,...,M and attains the optimal value W−W 0 . 9.4.2 Packet Split Digraph Definition 9.1 (Packet Split Digraph). Given a graphG = (U,P,A,W P ), we construct the corresponding packet split digraphG s = (V s ,A s ,W s ) as follows: 1. For each packet vertex p m ∈P,m = 1,...,M, we create two packet vertices p in m and p out m . LetV s =U∪{p in 1 ,p out 1 ,p in 2 ,p out 2 ,...,p in M ,p out M }. 252 3" 1" 2" 7" 7" 7" 7" 7" 7" 7" u 1 " u 2 " u 3 " p 1 in " p 2 in " p 3 in " p 1 out " p 2 out " p 3 out " Figure 9.2: The packet split digraph constructed from the bipartite digraph given in Figure 9.1 2. For each packet vertex p m ∈P,m = 1,...,M, we create a packet-to-packet arc (p in m ,p out m ) inA s . For each arc (u n ,p m )∈A, we create a user-to-packet arc (u n ,p in m ) inA s . For each arc (p m ,u n )∈A, we create a packet-to-user arc (p out m ,u n ) inA s . 3. For each arc (p in m ,p out m ) inA s , we assign a weight which is equal tow pm ∈W P . For each arc (u n ,p in m ) or (p out m ,u n ) inA s , we assign an integer weight which is larger than P M m=1 w pm . For any bipartite digraphG, the packet split digraphG s , which is an arc-weighted digraph, can always be constructed. Figure 9.2 shows the packet split digraph constructed from the bipartite digraph in Figure 9.1. In any digraph, a set of arcs is called a feedback arc set if the removal of arcs in this set leaves an acyclic digraph. If the digraph is arc-weighted, the feedback arc set with the minimum sum weight is called the minimum feedback arc set. The following facts summarize the connections between the packet split digraph and the original digraph. Fact 9.1. There is a bijection betweenG andG s . This bijection maps user vertices, user-to- packet arcs, packet vertices, and packet-to-user arcs inG to user vertices, user-to-packet arcs, 253 packet-to-packet arcs, and packet-to-user arcs inG s , respectively. Thus, this bijection also maps cycles inG to cycles inG s . Proof. The bijection can be easily identified according to the construction rule of the packet split digraph. Fact 9.2. Every minimum feedback arc set of packet split graphG s contains only packet-to-packet arcs and no packet-to-user arcs or user-to-packet arcs. Proof. In digraphG, each cycle contains at least one packet vertex. By Fact 1, each cycleG s contains at least one packet-to-packet arc. As such, the arc set composed of all packet-to-packet arcs is a feedback arc set ofG s and this feedback arc set contains no packet-to-user arcs or user- to-packet arcs. Note that the sum weight of this arc set is strictly less than the weight of any single packet-to-user or user-to-packet arc. Any feedback arc set with a packet-to-user arc or user-to-packet arc has a sum weight strictly larger than that of this one and hence cannot be a minimum feedback arc set. Fact 9.3. IfA s fd ⊆A s is a minimum feedback arc set of the packet split digraphG s , then a minimum feedback packet vertex setP fd ⊆P ofG is immediate. In addition, the sum weight of P fd is equal to the sum weight ofA s fd . Proof. LetA s fd be a minimum feedback arc set ofG s and the sum weight ofA s fd beW fd . By Fact 2,A s fd contains only packet-to-packet arcs. By Fact 1, the packet vertex setP fd ⊆P composed by packet vertices corresponding to arcs inA s fd is a feedback packet vertex set ofG and the sum weight ofP fd is equal to W fd . IfP fd is not a minimum feedback packet vertex set, there must exist a minimum feedback packet vertex set, sayP 0 fd , whose sum weight W 0 fd <W fd . By Fact 1, the counterpart ofP 0 fd inG s is a feedback arc set and the sum weight of this feedback arc set is equal to W 0 fd . Denote this feedback arc set asA s,0 fd , thenA s,0 fd has a sum weight strictly less than W fd . This contradicts the fact thatA s fd is a minimum feedback arc set ofG s . Hence,P fd must be a minimum feedback packet set ofG. 9.4.3 Optimality of Cyclic Codes in Planar Graphs The planarity of a digraph is not affected by arc directions, so that a digraph is planar if and only if its undirected counterpart, where all directed arcs are turned into undirected edges, 254 is planar. The following definitions are useful in characterizing the planarity of an (undirected) graph. Definition 9.2 (Page 21 in [Bol98]). Given an edge e = (v 1 ,v 2 ) of a graphG, subdividing the edge e is the operation of replacing the edge e = (v 1 ,v 2 ) by the path (v 1 ,v 0 ,v 2 ) of length 2 (see Figure 9.3a). Definition 9.3 (Page 24 in [Bol98]). Given an edge e = (v 1 ,v 2 ) of a graphG, contracting the edge e is the operation of merging the vertices v 1 and v 2 and deleting all resulting loops and duplicate edges (see Figure 9.3b). Definition 9.4 (Page 24 in [Bol98]). A graphH is a minor of a graphG ifH is a subgraph of a graph obtained fromG by a sequence of edge contractions. Note that if a graphG is planar, edge subdivisions and contractions preserve the planarity. Two simplest non-planar graphs are the complete graph with 5 vertices, which is denoted as K 5 , and the complete bipartite graph with 3 vertices on one side and 3 vertices on the other side, which is denoted asK 3,3 . Both of them are drawn in Figure 9.4. The following theorem provides a sufficient and necessary condition for the planarity of an undirected graph. Theorem 9.4 (Page 24 in [Bol98]). A graphG is planar if and only ifG contains neither K 5 nor K 3,3 as a minor. In the index coding problem, a packet is said to be a uniprior packet [OH12] if it is contained as side information by only one user. The following lemma is proposed to characterize the planarity of the packet split graphG s . Lemma 9.2. LetG be an index coding problem where each packet vertex is either unicast or uniprior and letG s be the packet split digraph ofG.G s is planar if and only ifG is planar. Proof. •G s planar⇒G planar: This part is relatively easy. AssumeG s is planar and is drawn in a plane. A planar drawing ofG can be obtained by contracting all the packet-to-packet arcs ofG s into packet vertices. This part holds for anyG even if some packet vertex is neither unicast nor uniprior. 255 v 1 # v 2 # v 1 # v 2 # v 0 # v 1 # v 2 # v 3 # v 0 # v 3 # (a)# (b)# subdivision# contrac4on# Figure 9.3: (a) Subdivision of edge (v 1 ,v 2 ). (b) Contraction of edge (v 1 ,v 2 ). K 5 # K 3,3 # Figure 9.4: K 5 and K 3,3 256 •G planar⇒G s planar: AssumeG is planar and is drawn in a plane. A planar drawing ofG s can be obtained by subdividing packet-to-user arcs and user-to-packet arcs inG. A crucial property is that each packet vertex inG has either one outgoing arc (unicast) or one incoming arc (uniprior). For each packet vertex p m with only one outgoing arc (unicast), we can subdivide the outgoing arc into two parts; add a new vertex p out m in the middle and reindex the vertex p m as p in m . This preserves planarity, and the newly created vertex p out m indeed acts as the corresponding outgoing vertex for packet vertexp m in the desired packet split digraphG s (since that packet vertex has only one outgoing arc). The remaining packet vertices p m that have not participated in these subdivisions must be uniprior and hence have just one incoming arc. We can subdivide this incoming arc into two parts; add a new vertex p in m in the middle and reindex the vertex p m as p out m . The subdivision operations as above yield a planar drawing ofG s . Corollary 9.1. For any unicast index coding problemG,G s is planar if and only ifG is planar. Now we are ready to present the main result in this section. Theorem 2: (Restated) If the bipartite digraphG for a (unicast) index coding problem is planar, then val(P1) = val(P2), i.e., gap(P1 0 , P1) = 0 and gap(P2, P2 0 ) = 0. Hence, the cyclic code given by (P2) is an optimal index code. Proof. SinceG is a planar graph and this is a unicast index coding problem,G s is also a planar graph by Corollary 9.1. LetG s = (V s ,A s ,W s ) be the packet spit digraph ofG = (U,P,A,W P ). LetC s be the set of cycles inG s . The minimum feedback arc set problem inG s can be formulated as a linear IP as follows: (P3 ∗ ) min xa M X a∈A x a w a s.t. X a∈A x a 1 {a∈C} ≥ 1, ∀C∈C s x a ∈{0, 1}, a∈A 257 Similarly, the cycle-packing problem inG s can formulated as another linear IP as follows: (P4 ∗ ) max y C X C∈C s y C s.t. L X C∈C s y C 1 {a∈C} ≤w a , ∀a∈A s y C non-negative integer,∀C∈C s By Theorem 9.3, ifG s is a planar graph, then (P3 ∗ ) and (P4 ∗ ) have the same optimal value. In what follows, we show that the optimal value of (P3) is equal to that of (P3 ∗ ) and the optimal value of (P4) is equal to that of (P4 ∗ ). • (P3) and (P3 ∗ ) have the same optimal value: By Fact 3, the minimum feedback arc set corresponding to the solution of (P3 ∗ ) can be converted to a minimum feedback packet set solution of (P3) which attains the same optimal objective function value as that of (P3 ∗ ). On the other hand, by Fact 1, the optimal solution of (P3) can be converted to a solution of (P3 ∗ ) which attains the same objective value as that of (P3). • (P4) and (P4 ∗ ) have the same optimal value: By Fact 1, there is a bijection fromC to C s . This is equivalent to say, there is a bijection from variables in (P4) to those in (P4 ∗ ). Let A s 1 be the set of packet-to-packet arcs andA s 2 be the set of packet-to-user and user-to-packet arcs. SoA s 1 ∪A s 2 =A s andA s 1 ∩A s 2 =∅. The constraints P C∈C sy C 1 {a∈C} ≤w a ,∀a∈A s 1 in (P4 ∗ ) are essentially the same as the constraints P L i=2 P Ci∈Ci y Ci 1 {pm∈Ci} ≤w pm ,m = 1,...,M in (P4). The other inequality constraints P C∈C sy C 1 {a∈C} ≤w a overa∈A s 2 can be shown to be redundant as follows. Let y C ,C∈C s be an arbitrary non-negative integer vector which satisfies all the constraints P C∈C sy C 1 {a∈C} ≤ w a over a∈A s 1 . Due to the bipartite property, each cycle inG contains at least one packet vertex. By Fact 1, each cycle inG s contains at least one packet-to-packet arc. Thus, for any C∈C s , there exists 258 some a∈A s 1 such that 1 {a∈C} = 1. Then, for any ¯ a∈A s 2 we have, X C∈C s y C 1 {¯ a∈C} ≤ X C∈C s y C ≤ X C∈C s y C · X a∈A s 1 1 {a∈C} = X a∈A s 1 X C∈C s y C 1 {a∈C} ≤ X a∈A s 1 w a <w ¯ a where the first inequality follows from the fact that 0≤ 1 {¯ a∈C} ≤ 1; the second inequality follows from the fact that for any C∈C s there exists some a∈A s 1 such that 1 {a∈C} = 1; the third inequality follows from the fact that all the constraints P C∈C sy C 1 {a∈C} ≤ w a over a∈A s 1 are satisfied; and the last inequality follows from the fact that the weight of any packet-to-user arc or user-to-packet-arc is strictly larger than the sum weight of all packet-to-packet arcs. This is to say the constraint P C∈C sy C 1 {a∈C} ≤w a over anya∈A s 2 is automatically satisfied and hence redundant. Hence, (P4) and (P4 ∗ ) are two equivalent optimization problems. Combining the above facts, we can conclude that the optimal value of (P3) is equal to that of (P4). Denote this value as W 0 . According to Theorem 9.1, W−W 0 is a lower bound on the clearance time of the index coding problemG. On the other hand,W−W 0 is the clearance time achieved by the scalar cyclic code corresponding to the solution of (P4), or equivalently the cyclic code IP (P2). Hence, we can conclude that the cyclic code given by the cyclic code IP (P2) is the optimal index code. 9.4.4 Optimality of Cyclic Codes in the Unicast-Uniprior Index Coding Problem In this subsection, we consider the unicast-uniprior index coding problem where each packet is demanded by one single user and can be contained as side information by at most one single user. This problem is motivated by the broadcast relay problem [NTZ13] where multiple users 259 exchange their individual data through a broadcast relay. A strong corollary of Theorem 9.2 on the unicast-uniprior index coding problem is presented as below. This corollary is also an enhancement of the conclusion in Section III.C of [NTZ13] where the cyclic code is proven to be the optimal index code in the unicast-uniprior index coding problem with less than or equal to 3 users. Corollary 9.2. If the number of users in the unicast-uniprior index coding problem is less than or equal to 4, then cyclic codes are optimal. Proof. LetG = (U,P,A,W P ) be a unicast-uniprior index coding problem where each packet vertex has one single outgoing link and one single incoming link and|U|≤ 4. Let the underlying undirected graph ofG be U(G). The degree of a vertex in an undirected graph is defined as the number of its adjacent edges. At most 4 vertices in U(G) can have a degree larger than 2. That is because each packet vertex must have a degree of 2 and only a user vertex can have a degree larger than 2. By Theorem 9.4, if U(G) is nonplanar, there must exist a subgraph of U(G) which can be converted to either K 5 or K 3,3 after several contracting operations. Note that K 5 has 5 nodes with identical degree of 4 and K 3,3 has 6 nodes with identical degree of 3. Also note that no matter a user-to-packet edge or a packet-to-user edge in U(G) is contracted, one user vertex and one packet vertex are replaced by one new vertex whose degree is equal to the degree of the user vertex. As a result, contracting operations performed over U(G) can not generate new nodes with degree larger than 2. Thus, there doesn’t exist a subgraph of U(G) which has K 5 or K 3,3 as minor. So graph U(G) must be planar. By Theorem 9.2, the cyclic code is optimal inG. 9.5 Partial Clique Codes: a Duality Perspective Section 9.3 shows the inherent duality between the maximum acyclic subgraph bound given by Theorem 9.1 and the optimal cyclic code. In fact, this is not an isolated case. In this section, a different code structure involving partial clique codes is considered. Partial clique codes are more sophisticated but often lead to performance improvements over cyclic codes. It is shown that the problem of finding the optimal partial clique code is the dual problem of another LP relaxation of the maximum acyclic subgraph IP (P1). The new relaxation is different from (P1 0 ) and results in a smaller integrality gap. 260 9.5.1 Partial Clique Codes LetP 0 ⊆P be a subset ofk (1≤k≤M) packet vertices andN out (P 0 ) = [ p∈P0 N out (p) be the outgoing neighborhood ofP 0 , i.e., the subset of users who demand packets inP 0 . If each user in N out (P 0 ) has at leastd (0≤d≤k−1) packet vertices inP 0 as side information, and at least one such user has exactly d, then the subgraph ofG induced byP 0 andN out (P 0 ) is called a (k,d)- partial clique. A (k,d)-partial clique where the weight of each packet vertex is identically 1 can be cleared withk−d transmissions usingk−d independent linear combinations of the packets (such as using systematic maximum distance separable (MDS) codes in [BK98, TDN12] or random codes in [HMK + 06]). For example, the digraphG in Figure 9.1 itself is a (3, 1)-partial clique. If the weight of each packet vertex is identically one, then this graph can be cleared by transmitting 2 linear combinations in the form Z 1 =α 1 p 1 +α 2 p 2 +α 3 p 3 andZ 2 =β 1 p 1 +β 2 p 2 +β 3 p 3 , where the α i and β i values are taken from a finite field F. If the finite field F is large enough, we are able to find 2 linear combinations that, together with any one known value of p 1 , p 2 or p 3 , are linearly independent. Thus, each user u i ,i = 1, 2, 3 can decodep i by solving a system of 2 linear equations and 2 unknowns. The linear index code ofG is said to be a partial clique code if it uses a sequence of coding actions that involve only partial clique coding actions. Note that the subgraph induced by a single packet vertex and the user vertex demanding it is by definition a (1, 0)-partial clique. Let T k,d ,k = 1,...,M,d = 0,...,k− 1 be the set of all (k,d)-partial cliques inG. The problem of finding the optimal scalar partial clique code can be formulated as an IP as below: Partial Clique Code IP (P5): min y T k,d M X k=1 k−1 X d=0 X T k,d ∈T k,d y T k,d (k−d) s.t. M X k=1 k−1 X d=0 X T k,d ∈T k,d y T k,d 1 {pm∈T k,d } ≥w pm , m = 1,...,M y T k,d non-negative integer, ∀T k,d ∈T k,d , k=1,...,M,d=0,...,k−1 where y T k,d is the number of partial clique coding actions over each partial clique T k,d ,∀T k,d ∈ T k,d ,k = 1,...,M,d = 0,...,k− 1, objective function M X k=1 k−1 X d=0 X T k,d ∈T k,d y T k,d (k−d) is the total 261 number of transmissions, and M X k=1 k−1 X d=0 X T k,d ∈T k,d y T k,d 1 {pm∈T k,d } ≥ w pm is the constraint that all the w pm packets represented by packet vertex p m are cleared by partial cliques involving it. The problem of finding the optimal vector partial clique code can be formulated as a LP problem as below: Partial Clique Code LP (P5 0 ): min y T k,d M X k=1 k−1 X d=0 X T k,d ∈T k,d y T k,d (k−d) s.t. M X k=1 k−1 X d=0 X T k,d ∈T k,d y T k,d 1 {pm∈T k,d } ≥w pm ,m = 1,...,M y T k,d ≥ 0, ∀T k,d ∈T k,d , k=1,...,M,d=0,...,k−1 Similar to cyclic codes, the partial clique code LP (P5 0 ) is the LP relaxation of the partial clique code IP (P5). The structure of partial clique codes is much more sophisticated than that of cyclic codes. Typically, partial clique codes have to be implemented over a large enough finite field while cyclic codes can always be implemented over the binary field. On the other hand, the performance of partial clique codes in general is better (no worse) than that of cyclic codes. This is summarized in the following lemma. Lemma 9.3. In any (unicast) index coding problem, the optimal clearance time attained by scalar cyclic codes is no less than that attained by scalar partial clique codes. Similarly, the optimal clearance time attained by vector cyclic codes is no less than that attained by vector partial clique codes. Proof. This lemma is proven for scalar codes. However, all the arguments can be carried over to vector codes after each packet is divided into subpackets. Recall that in any K-cycle, each user vertex has at least one packet vertex as side information. So each K-cycle code can be equivalently replaced by a (K, 1)-partial clique code. This uses partial clique coding to achieve the same clearance time. Thus, the best partial clique coding strategy achieves a clearance time that is less than or equal to that of the best cyclic coding strategy. Figure 9.5 shows an example of the index coding problem with 3 users and 3 packets. The bipartite digraph of this problem is not planar. (In fact, this example is the only unicast index 262 u 1 # u 2 # u 3 # p 1 # p 2 # p 3 # w 1 =1# w 2 =1# w 3 =1# Figure 9.5: An example with 3 users and 3 packets where the partial clique code is strictly better than the cyclic code. coding problem with 3 users and 3 packets for which the bipartite digraph is non-planar.) It can be verified that the optimal scalar cyclic code can clear this problem with 2 transmissions. On the other hand, the bipartite digraph itself is a (3, 2)-partial clique and hence the scalar partial clique code can clear it with one single transmission. The scalar partial clique code simply transmits Z = p 1 +p 2 +p 3 . In this simple example, the scalar partial clique code is strictly better than the scalar cyclic code. However, the following theorem shows that partial clique codes have no performance advantage over cyclic codes in the unicast-uniprior index coding problem. Theorem 9.5. In any unicast-uniprior index coding problem, the optimal clearance time attained by scalar cyclic codes is equal to that attained by scalar partial cliques. Similarly, the optimal clearance time attained by vector cyclic codes is equal to that attained by vector partial cliques. Proof. This theorem is proven for scalar codes. However, all the arguments can be carried over to vector codes after each packet is divided into subpackets. • Claim 1: The optimal clearance time attained by cyclic codes is larger than or equal to that attained by partial clique codes. This is Lemma 9.3. • Claim 2: The optimal clearance time attained by cyclic codes is less than or equal to that attained by partial clique codes. For any partial clique T k,d (d≥ 1) utilized in the optimal 263 partial clique code, k packets are cleared with k−d transmissions. By definition of partial cliques, each user vertex in this T k,d has at least d arcs outgoing to packet vertices in it. So we are able to find a cycle in it. To find a cycle, we start at any vertex, traverse a path from vertex to vertex using any outgoing link and discover a cycle when we revisit a vertex. Denote this cycle as C 1 and delete all the packet vertices and the associated outgoing and incoming arcs from T k,d . Note that each packet vertex has at most one outgoing arc and at most one incoming arc in a unicast-uniprior index coding problem. Hence, no two packet vertices in C 1 share the same outgoing neighbor or incoming neighbor. So after the deletion of the packet vertices and the associated outgoing and incoming arcs, the number of outgoing arcs of the user vertices involved in C 1 decreases by one while the number of outgoing arcs of the user vertices not involved in C 1 does not change. So in the remaining part of this T k,d , each user vertex has at least d− 1 outgoing arcs. Repeat the above process again and again. In the end, we have d cycles and no two cycles share the same packet vertex. So by performing a cycle code over each cycle C i ,i = 1,...,d, we can save d transmissions in total. Hence, this T k,d can be cleared with k−d transmissions by applying cyclic codes. As a result, cyclic codes are no worse than partial clique codes in the unicast-uniprior index coding problem. 9.5.2 Duality Between Information Theoretical Lower Bounds and Par- tial Clique Codes Define an IP as below: IP (P6) max xm M X m=1 x m w pm s.t. M X m=1 x m 1 {pm∈T k,d } ≤k−d, ∀T k,d ∈T k,d , k=1,...,M,d=0,...,k−1 x m ∈{0, 1}, m = 1,...,M The physical meaning of IP (P6) is to find the maximum packet weighted subgraph ofG formed by packet vertex deletions such that at least d packet vertices are deleted in each (k,d) partial clique. 264 Lemma 9.4. The partial clique code LP (P5 0 ) and the LP relaxation of IP (P6) are a primal-dual linear programming pair. Proof. The Lagrangian function of the partial clique code LP (P5 0 ) can be written as L(y T k,d ,λ m ,μ T k,d ) = M X k=1 k−1 X d=0 X T k,d ∈T k,d y T k,d (k−d) + M X m=1 λ m w pm − M X k=1 k−1 X d=0 X T k,d ∈T k,d y T k,d 1 {pm∈T k,d } − M X k=1 k−1 X d=0 X T k,d ∈T k,d μ T k,d y T k,d = M X k=1 k−1 X d=0 X T k,d ∈T k,d y T k,d (k−d)− M X m=1 λ m 1 {pm∈T k,d } −μ T k,d + M X m=1 λ m w pm where λ m ≥ 0,m = 1,...,M and μ T k,d ≥ 0,∀T k,d ∈T k,d ,k = 1,...,M,d = 0,...,k− 1. The dual problem of the partial clique code LP (P5 0 ) is defined as: max λm≥0 μ T k,d ≥0 min y T k,d ∈R L(y T k,d ,λ m ,μ T k,d ) Note that, min y T k,d ∈R L(y T k,d ,λ m ,μ T k,d ) = P M m=1 λ m w pm (k−d)− P M m=1 λm1 {pm∈T k,d } −μ T k,d =0 ∀T k,d ∈T k,d ,k=1,...,M,d=0,...,k−1 −∞ otherwise Then, the dual problem of the partial clique code LP (P5 0 ) can be written as, max λm,μ T k,d M X m=1 λ m w pm s.t. (k−d)− M X m=1 λ m 1 {pm∈T k,d } −μ T k,d = 0,∀T k,d ∈T k,d ,k = 1,...,M,d = 0,...,k− 1 λ m ≥ 0, m = 1,...,M μ T k,d ≥ 0, ∀T k,d ∈T k,d k=1,...,M,d=0,...,k−1, 265 Eliminating variables μ T k,d ,∀T k,d ∈T k,d ,k = 1,...,M,d = 0,...,k− 1, we obtain max λm M X m=1 λ m w pm s.t. M X m=1 λ m 1 {pm∈T k,d } ≤ (k−d), ∀T k,d ∈T k,d , ∀k=1,...,M,d=0,...,k−1 λ m ≥ 0, m = 1,...,M Now consider all the M packet vertices, i.e., all T 1,0 ∈ T 1,0 . The corresponding constraints P M m=1 λ m 1 {pm∈T k,d } ≤ (k−d),∀T 1,0 ∈T 1,0 can be simplified as λ m ≤ 1,m = 1,...,M. Hence, the above linear programming problem is the LP relaxation of (P6). IP (P6) seems quite different from the maximum acyclic subgraph IP (P1) and it seems that there exists no duality between the optimal partial clique code and the MAS lower bound. However, the following lemma shows that the maximum acyclic subgraph IP (P1) and IP (P6) are two equivalent problems. Lemma 9.5. For any (unicast) index coding problemG, the maximum acyclic subgraph IP (P1) and IP (P6) are two equivalent problems. Proof. Note that the objective function in problem (P1) is the same as that in problem (P6). To prove problems (P1) and (P6) are equivalent, we show thatx m ∈{0, 1},m = 1,...,M is feasible to problem (P1) if and only if it is feasible to problem (P6). • Feasible to (P6)⇒ feasible to (P1): Assume x m ∈{0, 1},m = 1,...,M is feasible to (P6). For any cycle C i ,∀C i ∈C i ,i = 2,...,L involving i packet vertices inG, let us consider the partial clique T i,d formed by the i packet vertices and i user vertices in this i-cycle. By the definition of a cycle, each user vertex has at least one packet vertex among these i packet vertices as side information. So d≥ 1. Since x m ∈{0, 1},m = 1,...,M satisfies the inequality constraints in (P6), at least d packet vertices among these i packet vertices are deleted. Since at least one packet vertex of the cycle is deleted, cycle C i cannot be complete. Hence, x m ,m = 1,...,M yields an acyclic subgraph ofG. • Feasible to (P1)⇒ feasible to (P6): Assume x m ∈{0, 1},m = 1,...,M is feasible to (P1). For any partial clique T k,d , if d = 0, then constraint P M m=1 x m 1 {pm∈T k,d } ≤ k−d is trivially satisfied. Without loss of generality, assume 1≤ d≤ k− 1. Then, in this 266 partial clique T k,d , each user vertex has at least d outgoing arcs. So we can find a cycle in this partial clique. (To find a cycle, we start at any vertex, traverse a path from vertex to vertex using any outgoing link and discover a cycle when we revisit a vertex.) Since x m ∈{0, 1},m = 1,...,M is feasible to (P1), at least one packet vertex in this cycle is deleted. Assume d 1 packet vertices are deleted. These deleted packet vertices are also vertices in partial clique T k,d . If d 1 ≥ d, then the constraint over T k,d is satisfied. If d 1 < d, then we continue to consider the remaining part of T k,d after deleting these d 1 packet vertices. In the remaining part, each user vertex has at leastd−d 1 outgoing arcs. A similar argument as above shows that we are still able to find a new cycle in the remaining part and at least one packet vertex in the cycle is deleted. Assume d 2 packet vertices in the new cycle are deleted. If d 1 +d 2 <d, we can repeat this process until at least d packet vertices are shown to be deleted. That is to say, constraint P M m=1 x m 1 {pm∈T k,d } ≤ k−d over allT k,d is satisfied. Hence,x m ∈{0, 1},m = 1,...,M satisfies the constraints of (P6). The above lemma indicates that IP (P6) is another representation of the maximum acyclic subgraph IP (P1). However, this new representation is non-trivial. The LP relaxations of IP (P6) and the maximum acyclic subgraph IP (P1) correspond to partial clique codes and cyclic codes, respectively. Lemma 9.3 demonstrates that codes associated with IP (P6) in general have better performance than codes associated with the maximum acyclic subgraph IP (P1). 9.5.3 Discussion The integer linear programs (P1) and (P6) are two different representations of the same problem of finding the maximum acyclic subgraph bound. Different representations of an integer linear program can yield LP relaxations with different integrality gaps. In Section 9.3 and this section, we show that the LP relaxation of (P1) is the (dual) problem of finding the optimal vector cyclic code, and the LP relaxation of (P6) is the (dual) problem of finding the optimal vector partial clique code. The performance of partial clique codes is no worse than that of cyclic codes. Hence, the integrality gap of the LP relaxation of (P6) is no larger than that of the LP relaxation of (P1). The relations between various problems in this chapter are illustrated in Figure 9.6. Note that (P2 0 ) and (P5 0 ) require a large packet size, while (P5) and (P5 0 ) require 267 MAS$bound:$(P1)$or$(P6)$ vector$par7al$clique$code:$(P5’)$ vector$cyclic$code:$(P2’)$ scalar$par7al$clique$code:$(P5)$ scalar$cyclic$code:$(P2)$ gap1$ gap2$ gap3$ If$$$$$$is$a$planar$graph,$then$gap1$is$zero.$Hence,$both$gap2$and$gap3$are$zero.$ If$$$$$$is$a$unicastIuniprior$index$coding$problem,$then$both$gap2$and$gap3$are$zero.$$ G G Integrality$gap$of$the$LPIrelaxa7on$of$(P6)$ Integrality$gap$of$the$LPIrelaxa7on$of$(P5)$ Integrality$gap$of$the$LPIrelaxa7on$of$(P2)$ Integrality$gap$of$the$LPIrelaxa7on$of$(P1)$ Increasing$direc7on$ Figure 9.6: The relations between various problems in Chapter 9. encoding in a large finite field. The graph parameter minrank is known to be optimal over scalar linear codes [BYBJK11], and hence lies somewhere in the shaded region between (P5) and the MAS bound. Since there are various techniques for obtaining tight LP relaxations of an integer linear program [NW88], a potential approach to design good code structures for the index coding problem is to explore different representations of the maximum acyclic subgraph IP (P1) for which the LP relaxations have small integrality gaps. If the dual problem of such an LP relaxation can be interpreted as a code, then this is a good code for the index coding problem. 9.6 Chapter Summary This chapter studies index coding from a perspective of optimization and duality. It illustrates the inherent duality between the information theoretic maximum acyclic subgraph (MAS) lower bound and the optimal cyclic codes and partial clique codes. The performance of both codes is bounded by the respective integrality gap of two different LP relaxations of the integer program that defines the MAS bound. In the special case when the index coding problem has a planar digraph representation, the integrality gap associated with cyclic coding is shown to be zero. So the exact optimality is achieved by cyclic coding. For general (non-planar) problems, the LP-relaxation associated with partial clique coding provides an integrality gap that is no worse, and often better, than the previous gap. These results provide new insight into the index coding problem and suggest that good codes can be found by exploring different relaxations of the MAS bound problem. 268 Chapter 10 Conclusions In this thesis, we develop new Lagrangian methods for constrained convex programs with complicated functional constraints. Existing Lagrangian methods either have a slow O( 1 2 ) con- vergence time or can only solve problems with linear equality constraint functions. The new methods developed in this thesis are proven to have a faster O( 1 ) convergence time and can be implemented in parallel in most cases of interest. The per-iteration complexity of the new meth- ods is also as small as that of existing Lagrangian methods. The design intuition and performance analysis of our new methods is different from conventional analysis techniques for existing La- grangian methods and is based on a drift-plus-penalty type analysis, which is originally proposed for stochastic network optimization in dynamic queueing networks. Most existing backpressure algorithms for joint rate control and routing in data networks can be interpreted as distributive applications of certain Lagrangian methods for a multi-commodity network flow formulation. By adapting our new Lagrangian methods, we are able to develop new backpressure algorithms that have the best utility and queue length tradeoff among all known backpressure algorithms. The new Lagrangian methods are further adapted to develop new learning algorithms for online convex optimization with constraints. The two developed learning algorithms are proven to achieve the best regret and constraint violations for online convex optimization with stochas- tic constraints and online convex optimization with long term constraints, respectively. Power control for energy harvesting devices with outdated state information is closely related to on- line convex optimization with stochastic constraints but is restricted to a more stringent energy availability constraint. For this problem, we develop dynamic power control policy to achieve an O() optimal utility by using a battery with capacity O(1/). 269 In this thesis, we also extend existing stochastic constrained convex optimization techniques and utilize Lagrangian duality theory for constrained convex optimization to study two other important problems in wireless communication and network coding. In the first problem, we adapted the conventional drift-plus-penalty technique for stochastic optimization and Zinkevich’s projected online gradient descent for online convex optimization to develop new dynamic transmit covariance design policies for MIMO fading systems with unknown channel distributions and inaccurate channel state information. In the second problem, we study the index coding problem and characterize the optimality of two representative linear codes by studying the integrality gap between the integer linear program from an information theoretical lower bound and its linear programming relaxations and the Lagrangian duality between various linear programming relaxations and their dual problems. 270 Bibliography [ALS + 08] N. Alon, E. Lubetzky, U. Stav, A. Weinstein, and A. Hassidim. Broadcasting with side information. In Proceedings of IEEE Annual Symposium on Foundations of Computer Science (FOCS), pages 823–832, 2008. [BA03] H. Bai and M. Atiquzzaman. Error modeling schemes for fading channels in wireless communications: A survey. IEEE Communications Surveys & Tutorials, 5(2):2–9, 2003. [BDH + 08] P. L. Bartlett, V. Dani, T. Hayes, S. Kakade, A. Rakhlin, and A. Tewari. High- probability regret bounds for bandit online linear optimization. In Proceedings of Conference on Learning Theory (COLT), 2008. [Ber99] D. P. Bertsekas. Nonlinear Programming. Athena Scientific, second edition, 1999. [BG02] R. A. Berry and R. G. Gallager. Communication over fading channels with delay constraints. IEEE Transactions on Information Theory, 48(5):1135–1149, 2002. [BGD13] P. Blasco, D. Gunduz, and M. Dohler. A learning theoretic approach to energy harvesting communication system optimization. IEEE Transactions on Wireless Communications, 12(4):1872–1882, 2013. [BK98] Y. Birk and T. Kol. Informed-source coding-on-demand (ISCOD) over broadcast channels. In Proceedings of IEEE International Conference on Computer Commu- nications (INFOCOM), 1998. [BK06] Y. Birk and T. Kol. Coding on demand by an informed source (ISCOD) for efficient broadcast of different supplemental data to caching clients. IEEE Transactions on Information Theory, 52(6):2825 – 2830, June 2006. 271 [BKL10] A. Blasiak, R. Kleinberg, and E. Lubetzky. Index coding via linear programming. arXiv:1004.1379, 2010. [BKVH07] S. Boyd, S.-J. Kim, L. Vandenberghe, and A. Hassibi. A tutorial on geometric programming. Optimization and Engineering, 8(1):67–127, 2007. [BLEM + 09] N. Buchbinder, L. Lewin-Eytan, I. Menache, J. S. Naor, and A. Orda. Dynamic power allocation under arbitrary varying channels : An online approach. In Pro- ceedings of IEEE International Conference on Computer Communications (INFO- COM), 2009. [BLEM + 10] N. Buchbinder, L. Lewin-Eytan, I. Menache, J. S. Naor, and A. Orda. Dynamic power allocation under arbitrary varying channels : The multi-user case. In Pro- ceedings of IEEE International Conference on Computer Communications (INFO- COM), 2010. [BNO03] D. P. Bertsekas, A. Nedi´ c, and A. E. Ozdaglar. Convex Analysis and Optimization. Athena Scientific, 2003. [Bol98] B. Bollob´ as. Modern Graph Theory, volume 184. Springer, 1998. [BPC + 11] S. Boyd, N. Parikh, E. Chu, B. Peleato, and J. Eckstein. Distributed optimiza- tion and statistical learning via the alternating direction method of multipliers. Foundations and Trends in Machine Learning, 3(1):1–122, 2011. [BSS06] M. S. Bazaraa, H. D. Sherali, and C. M. Shetty. Nonlinear Programming: Theory and Algorithms. Wiley-Interscience, 2006. [BV04] S. Boyd and L. Vandenberghe. Convex Optimization. Cambridge University Press, 2004. [BX05] S. Boyd and L. Xiao. Least-squares covariance matrix adjustment. SIAM Journal on Matrix Analysis and Applications, 27(2):532–546, 2005. [BYBJK11] Z. Bar-Yossef, Y. Birk, T. S. Jayram, and T. Kol. Index coding with side informa- tion. IEEE Transactions on Information Theory, 57(3):1479–1494, March 2011. 272 [CASL11] M. A. R. Chaudhry, Z. Asad, A. Sprintson, and M. Langberg. On the complemen- tary index coding problem. In Proceedings of IEEE International Symposium on Information Theory (ISIT), 2011. [CBL06] N. Cesa-Bianchi and G. Lugosi. Prediction, Learning, and Games. Cambridge University Press, 2006. [CBLW96] N. Cesa-Bianchi, P. M. Long, and M. K. Warmuth. Worst-case quadratic loss bounds for prediction using linear functions and gradient descent. IEEE Transac- tions on Neural Networks, 7(3):604–619, 1996. [CGP15] A. Cotter, M. Gupta, and J. Pfeifer. A light touch for heavily constrained SGD. In Proceedings of Conference on Learning Theory (COLT), 2015. [Dev77] L. P. Devroye. A uniform bound for the deviation of empirical distribution functions. Journal of Multivariate Analysis, 7(4):594–597, 1977. [Doo53] J. L. Doob. Stochastic processes. Wiley New York, 1953. [DSSSC08] J. Duchi, S. Shalev-Shwartz, Y. Singer, and T. Chandra. Efficient projections onto the l 1 -ball for learning in high dimensions. In Proceedings of International Confer- ence on Machine Learning (ICML), 2008. [Dur10] R. Durrett. Probability: Theory and Examples. Cambridge University Press, 2010. [DY16] W. Deng and W. Yin. On the global and linear convergence of the generalized alter- nating direction method of multipliers. Journal of Scientific Computing, 66(3):889– 916, 2016. [ERL15] M. Effros, S. E. Rouayheb, and M. Langberg. An equivalence between network coding and index coding. IEEE Transactions on Information Theory, 61(5):2478– 2487, 2015. [ES06] A. Eryilmaz and R. Srikant. Joint congestion control, routing, and mac for stability and fairness in wireless networks. IEEE Journal on Selected Areas in Communica- tions, 24(8):1514–1524, 2006. [ES12] A. Eryilmaz and R. Srikant. Asymptotically tight steady-state queue length bounds implied by drift conditions. Queueing Systems, 72:311–359, 2012. 273 [ETW12] U. Erez, M. D. Trott, and G. W. Wornell. Rateless coding for Gaussian channels. IEEE Transactions on Information Theory, 58(2):530–547, 2012. [FHM07] A. Feiten, S. Hanly, and R. Mathar. Derivatives of mutual information in Gaussian vector channels with applications. In Proceedings of IEEE International Symposium on Information Theory (ISIT), 2007. [FLEP10] Y. Fan, L. Lai, E. Erkip, and H. V. Poor. Rateless coding for MIMO fading chan- nels: performance limits and code construction. IEEE Transactions on Wireless Communications, 9(4):1288–1292, 2010. [GGT10] M. Gatzianas, L. Georgiadis, and L. Tassiulas. Control of wireless networks with rechargeable batteries. IEEE Transactions on Wireless Communications, 9(2):581– 593, 2010. [GLS93] M. Grotschel, L. Lov´ asz, and A. Schrijver. Geometric Algorithms and Combinatorial Optimizations. Springer-Verlag, 1993. [GMPV09] A. Ghosh, P. McAfee, K. Papineni, and S. Vassilvitskii. Bidding for representative allocations for display advertising. In Proceedings of International Workshop on Internet and Network Economics (WINE), 2009. [GNT06] L. Georgiadis, M. J. Neely, and L. Tassiulas. Resource allocation and cross-layer control in wireless networks. Foundations and Trends in Networking, 2006. [Gol05] A. Goldsmith. Wireless Communications. Cambridge University Press, 2005. [Gor99] G. J. Gordon. Regret bounds for prediction problems. In Proceeding of Conference on Learning Theory (COLT), 1999. [GPS15] E. Gustavsson, M. Patriksson, and A.-B. Str¨ omberg. Primal convergence from dual subgradient methods for convex optimization. Mathematical programming, 150(2):365–390, 2015. [GT11a] A. Goldfarb and C. Tucker. Online display advertising: targeting and obtrusiveness. Marketing Science, 30(3):389–404, 2011. [GT11b] B. Guenin and R. Thomas. Packing directed circuits exactly. Combinatorica, 31(4):397– 421, 2011. 274 [Haj82] B. Hajek. Hitting-time and occupation-time bounds implied by drift analysis with applications. Advances in Applied Probability, 14(3):502–525, 1982. [HAK07] E. Hazan, A. Agarwal, and S. Kale. Logarithmic regret algorithms for online convex optimization. Machine Learning, 69:169–192, 2007. [Haz16] E. Hazan. Introduction to online convex optimization. Foundations and Trends in Optimization, 2(3–4):157–325, 2016. [HG07] A. Hjørungnes and D. Gesbert. Complex-valued matrix differentiation: Techniques and key results. IEEE Transactions on Signal Processing, 55(6):2740–2746, 2007. [HH15] E. Hossain and M. Hasan. 5G cellular: key enabling technologies and research challenges. IEEE Instrumentation & Measurement Magazine, 18(3):11–21, 2015. [HJ85] R. A. Horn and C. R. Johnson. Matrix Analysis. Cambridge University Press, 1985. [HJ91] R. A. Horn and C. R. Johnson. Topics in Matrix Analysis. Cambridge University Press, 1991. [Hjø11] A. Hjørungnes. Complex-valued Matrix Derivatives: with Applications in Signal Processing and Communications. Cambridge University Press, 2011. [HL16] M. Hong and Z.-Q. Luo. On the linear convergence of the alternating direction method of multipliers. Mathematical Programming, 2016. [HMK + 06] T. Ho, M. M´ edard, R. Koetter, D. R. Karger, M. Effros, J. Shi, and B. Leong. A random linear network coding approach to multicast. IEEE Transactions on Information Theory, 52(10):4413–4430, October 2006. [HMNK13] L. Huang, S. Moeller, M. J. Neely, and B. Krishnamachari. LIFO-backpressure achieves near-optimal utility-delay tradeoff. IEEE/ACM Transactions on Network- ing, 21(3):831–844, 2013. [HN11] L. Huang and M. J. Neely. Delay reduction via lagrange multipliers in stochastic network optimization. IEEE Transactions on Automatic Control, 56(4):842–857, 2011. 275 [HN13] L. Huang and M. J. Neely. Utility optimal scheduling in energy-harvesting networks. IEEE/ACM Transactions on Networking, 21(4):1117–1130, 2013. [HR11] Y. Hu and A. Ribeiro. Adaptive distributed algorithms for optimal random access channels. IEEE Transactions on Wireless Communications, 10(8):2703–2715, 2011. [HUL01] J.-B. Hiriart-Urruty and C. Lemar´ echal. Fundamentals of Convex Analysis. Springer, 2001. [HY12] B. He and X. Yuan. On the O(1/n) convergence rate of the Douglas-Rachford alternating direction method. SIAM Journal on Numerical Analysis, 50(2):700– 709, 2012. [JB04] E. Jorswieck and H. Boche. Channel capacity and capacity-range of beamforming in MIMO wireless systems under correlated fading with covariance feedback. IEEE Transactions on Wireless Communications, 3(5):1543–1553, 2004. [JHA16] R. Jenatton, J. Huang, and C. Archambeau. Adaptive algorithms for online convex optimization with long-term constraints. In Proceedings of International Conference on Machine Learning (ICML), 2016. [JP03] S. K. Jayaweera and H. V. Poor. Capacity of multiple-antenna systems with both receiver and transmitter channel state information. IEEE Transactions on Infor- mation Theory, 49(10):2697–2709, 2003. [JVG01] S. A. Jafar, S. Vishwanath, and A. Goldsmith. Channel capacity and beamforming for multiple transmit and receive antennas with covariance feedback. In Proceedings of IEEE International Conference on Communications (ICC), 2001. [Kel97] F. P. Kelly. Charging and rate control for elastic traffic. European Transactions on Telecommunications, 8(1):33–37, 1997. [KHZS07] A. Kansal, J. Hsu, S. Zahedi, and M. B. Srivastava. Power management in energy harvesting sensor networks. ACM Transactions on Embedded Computing Systems, 6(4), 2007. 276 [KMS + 15] P. Kamalinejad, C. Mahapatra, Z. Sheng, S. Mirabbasi, V. C. Leung, and Y. L. Guan. Wireless energy harvesting for the internet of things. IEEE Communications Magazine, 53(6):102–108, 2015. [KMT98] F. P. Kelly, A. K. Maulloo, and D. K. Tan. Rate control for communication net- works: Shadow prices, proportional fairness and stability. Journal of the Operational Research Society, 49(3):237–252, 1998. [KW97] J. Kivinen and M. K. Warmuth. Exponentiated gradient versus gradient descent for linear predictors. Information and Computation, 132(1):1–63, 1997. [LHSS09] J. Liu, Y. T. Hou, Y. Shi, and H. D. Sherali. On performance optimization for multi- carrier MIMO Ad-Hoc networks. In Proceedings of ACM international symposium on Mobile ad hoc networking and computing (MobiHoc), 2009. [LK06] V. K. Lau and Y.-K. R. Kwok. Channel-Adaptive Technologies and Cross-Layer Designs for Wireless Systems with Multiple Antennas: Theory and Applications. John Wiley & Sons, 2006. [LL97] T. Larsson and Z. Liu. A Lagrangean relaxation scheme for structured linear pro- grams with application to multicommodity network flows. Optimization, 40(3):247– 284, 1997. [LL99] S. H. Low and D. E. Lapsley. Optimization flow control—I: basic algorithm and convergence. IEEE/ACM Transactions on Networking, 7(6):861–874, 1999. [LMS06] J.-W. Lee, R. R. Mazumdar, and N. B. Shroff. Opportunistic power scheduling for dynamic multi-server wireless systems. IEEE Transactions on Wireless Communi- cations, 5(6):1506–1515, 2006. [LMZ15] T.-Y. Lin, S.-Q. Ma, and S.-Z. Zhang. On the sublinear convergence rate of multi- block ADMM. Journal of the Operations Research Society of China, 3(3):251–274, 2015. [Low03] S. H. Low. A duality model of TCP and queue management algorithms. IEEE/ACM Transactions on Networking, 11(4):525–536, 2003. 277 [LPS99] T. Larsson, M. Patriksson, and A.-B. Str¨ omberg. Ergodic, primal convergence in dual subgradient schemes for convex programming. Mathematical programming, 86(2):283–312, 1999. [LS04] X. Lin and N. B. Shroff. Joint rate control and scheduling in multihop wireless networks. In Proceedings of IEEE Conference on Decision and Control (CDC), 2004. [LS06] X. Lin and N. B. Shroff. Utility maximization for communication networks with multipath routing. IEEE Transactions on Automatic Control, 51(5):766–781, 2006. [LS09] E. Lubetzky and U. Stav. Nonlinear index coding outperforming the linear opti- mum. IEEE Transactions on Information Theory, 55(8):3544–3551, August 2009. [LSXS15] J. Liu, N. B. Shroff, C. H. Xia, and H. D. Sherali. Joint congestion control and routing optimization: An efficient second-order distributed approach. IEEE/ACM Transactions on Networking, 24(3):1404–1420, 2015. [LTCS16] B. Li, D. Tse, K. Chen, and H. Shen. Capacity-achieving rateless polar codes. In Proceedings of IEEE International Symposium on Information Theory (ISIT), 2016. [LY78] C. Lucchesi and D. Younger. A minimax theorem for directed graphs. J. London Math. Soc.(2), 17(3):369 – 374, 1978. [LZ16] G. Lan and Z. Zhou. Algorithms for stochastic optimization with expectation con- straints. arXiv:1604.03887, 2016. [MB16] P. Mertikopoulos and E. V. Belmega. Learning to be green: Robust energy efficiency maximization in dynamic MIMO–OFDM systems. IEEE Journal on Selected Areas in Communications, 34(4):743–757, 2016. [MJY12] M. Mahdavi, R. Jin, and T. Yang. Trading regret for efficiency: online convex optimization with long term constraints. Journal of Machine Learning Research, 13(1):2503–2528, 2012. [MM16] P. Mertikopoulos and A. L. Moustakas. Learning in an uncertain world: MIMO covariance matrix optimization with imperfect feedback. IEEE Transactions on Signal Processing, 64(1):5–18, 2016. 278 [MSZ13] N. Michelusi, K. Stamatiou, and M. Zorzi. Transmission policies for energy har- vesting sensors with time-correlated energy supply. IEEE Transactions on Com- munications, 61(7):2988–3001, 2013. [MTY09] S. Mannor, J. N. Tsitsiklis, and J. Y. Yu. Online learning with sample path con- straints. Journal of Machine Learning Research, 10:569–590, March 2009. [MYJ13] M. Mahdavi, T. Yang, and R. Jin. Stochastic convex optimization with multiple objectives. In Advances in Neural Information Processing Systems (NIPS), 2013. [Nee03] M. J. Neely. Dynamic Power Allocation and Routing for Satellite and Wireless Networks with Time Varying Channels. PhD thesis, Massachusetts Institute of Technology, 2003. [Nee05] M. J. Neely. Distributed and secure computation of convex programs over a net- work of connected processors. In DCDIS International Conference on Engineering Applications and Computational Algorithms, 2005. [Nee06] M. J. Neely. Super-fast delay tradeoffs for utility optimal fair scheduling in wireless networks. IEEE Journal on Selected Areas in Communications, 24(8):1489–1501, 2006. [Nee07] M. J. Neely. Optimal energy and delay tradeoffs for multiuser wireless downlinks. IEEE Transactions on Information Theory, 53(9):3095–3113, 2007. [Nee10] M. J. Neely. Stochastic Network Optimization with Application to Communication and Queueing Systems. Morgan & Claypool Publishers, 2010. [Nee14] M. J. Neely. A simple convergence time analysis of drift-plus-penalty for stochastic optimization and convex programs. arXiv:1412.0791, 2014. [Nee15] M. J. Neely. Energy-aware wireless scheduling with near optimal backlog and con- vergence time tradeoffs. In Proceedings of IEEE International Conference on Com- puter Communications (INFOCOM), 2015. [Nee16] M. J. Neely. Energy-aware wireless scheduling with near-optimal backlog and con- vergence time tradeoffs. IEEE/ACM Transactions on Networking, 24(4):2223–2236, 2016. 279 [Nes04] Y. Nesterov. Introductory Lectures on Convex Optimization: A Basic Course. Springer Science & Business Media, 2004. [NMR05] M. J. Neely, E. Modiano, and C. E. Rohrs. Dynamic power allocation and routing for time-varying wireless networks. IEEE Journal on Selected Areas in Communi- cations, 23(1):89–103, 2005. [NN14] I. Necoara and V. Nedelcu. Rate analysis of inexact dual first-order methods applica- tion to dual decomposition. IEEE Transactions on Automatic Control, 59(5):1232– 1243, May 2014. [NNG15] I. Necoara, Y. Nesterov, and F. Glineur. Linear convergence of first order methods for non-strongly convex optimization. arXiv preprint arXiv:1504.06298v4, 2015. [NO09a] A. Nedi´ c and A. Ozdaglar. Approximate primal solutions and rate analysis for dual subgradient methods. SIAM Journal on Optimization, 19(4):1757–1780, 2009. [NO09b] A. Nedi´ c and A. Ozdaglar. Subgradient methods for saddle-point problems. Journal of Optimization Theory and Applications, 142(1):205–228, 2009. [NP16] I. Necoara and A. Patrascu. Iteration complexity analyisis of dual first order meth- ods for conic convex programming. Optimization Method and Software, 31(3):645– 678, 2016. [NPN15] I. Necoara, A. Patrascu, and A. Nedi´ c. Complexity certifications of first-order inexact lagrangian methods for general convex programming: Application to real- time mpc. In Developments in Model-Based Optimization and Control, pages 3–26. Springer, 2015. [NTZ13] M. J. Neely, A. S. Tehrani, and Z. Zhang. Dynamic index coding for wireless broadcast networks. IEEE Transactions on Information Theory, 59(11):7525–7540, November 2013. [NW88] G. L. Nemhauser and L. A. Wolsey. Integer and Combinatorial Optimization, vol- ume 18. Wiley New York, 1988. [NW06] J. Nocedal and S. Wright. Numerical Optimization. Springer Science & Business Media, 2006. 280 [OH12] L. Ong and C. K. Ho. Optimal index codes for a class of multicast networks with receiver side information. In Proceedings of IEEE International Conference on Communications (ICC), 2012. [ ¨ OLR14] B. ¨ Ozbek and D. Le Ruyet. Feedback Strategies for Wireless Communication. Springer, 2014. [PB13] N. Parikh and S. Boyd. Proximal algorithms. Foundations and Trends in Optimiza- tion, 1(3):123–231, 2013. [PC06] D. P. Palomar and M. Chiang. A tutorial on decomposition methods for net- work utility maximization. IEEE Journal on Selected Areas in Communications, 24(8):1439–1451, 2006. [PCL03] D. P. Palomar, J. M. Cioffi, and M. A. Lagunas. Uniform power allocation in MIMO channels: a game-theoretic approach. IEEE Transactions on Information Theory, 49(7):1707–1727, July 2003. [Pee96] R. Peeters. Orthogonal representations over finite fields and the chromatic number of graphs. Combinatorica, 16(3):417–431, 1996. [PL03] D. P. Palomar and M. A. Lagunas. Joint transmit-receive space-time equalization in spatially correlated MIMO channels: A beamforming approach. IEEE Journal on Selected Areas in Communications, 21(5):730–743, 2003. [PS05] J. A. Paradiso and T. Starner. Energy scavenging for mobile and wireless electronics. IEEE Pervasive Computing, 4(1):18–27, 2005. [Rag88] P. Raghavan. Probabilistic construction of deterministic algorithms: approximating packing integer programs. Journal of Computer and System Sciences, 37(2):130– 143, 1988. [Rib10] A. Ribeiro. Ergodic stochastic optimization algorithms for wireless communication and networking. IEEE Transactions on Signal Processing, 58(12):6369–6386, 2010. [RSG10] S. E. Rouayheb, A. Sprintson, and C. Georghiades. On the index coding problem and its relation to network coding and matroid theory. IEEE Transactions on Information Theory, 56(7):3187–3195, July 2010. 281 [RT87] P. Raghavan and C. D. Tompson. Randomized rounding: a technique for provably good algorithms and algorithmic proofs. Combinatorica, 7(4):365–374, 1987. [SB94] C. P. Simon and L. Blume. Mathematics for Economists. Norton New York, 1994. [SC96] H. D. Sherali and G. Choi. Recovery of primal solutions when using subgradient optimization methods to solve lagrangian duals of linear programs. Operations Research Letters, 19(3):105–113, 1996. [Ser09] R. J. Serfling. Approximation Theorems of Mathematical Statistics. John Wiley & Sons, 2009. [SHN14] S. Supittayapornpong, L. Huang, and M. J. Neely. Time-average optimization with nonconvex decision set and its convergence. In Proceedings of IEEE Conference on Decision and Control (CDC), 2014. [Sho85] N. Z. Shor. Minimization Methods for Non-Differentiable Functions. Springer- Verlag, 1985. [SK11] S. Sudevalayam and P. Kulkarni. Energy harvesting sensor nodes: Survey and implications. IEEE Communications Surveys & Tutorials, 13(3):443–461, 2011. [SMT15] I. Stiakogiannakis, P. Mertikopoulos, and C. Touati. Adaptive power allocation and control in time-varying multi-carrier MIMO networks. arXiv:1503.02155, 2015. [SPB09] G. Scutari, D. P. Palomar, and S. Barbarossa. The MIMO iterative waterfilling algorithm. IEEE Transactions on Signal Processing, 57(5):1917–1935, 2009. [SS11] S. Shalev-Shwartz. Online learning and online convex optimization. Foundations and Trends in Machine Learning, 4(2):107–194, 2011. [Sto05] A. L. Stolyar. Maximizing queueing network utility subject to stability: Greedy primal-dual algorithm. Queueing Systems, 50(4):401–457, 2005. [TDN12] A. S. Tehrani, A. G. Dimakis, and M. J. Neely. Bipartite index coding. In Proceed- ings of IEEE International Symposium on Information Theory (ISIT), 2012. 282 [TE92] L. Tassiulas and A. Ephremides. Stability properties of constrained queueing sys- tems and scheduling policies for maximum throughput in multihop radio networks. IEEE Transactions on Automatic Control, 37(12):1936–1948, 1992. [Tel99] I. E. Telatar. Capacity of multi-antenna Gaussian channels. European Transactions on Telecommunications, 10(6):585–596, 1999. [TTM11] H. Terelius, U. Topcu, and R. M. Murray. Decentralized multi-agent optimization via dual decomposition. In IFAC World Congress, 2011. [TV05] D. Tse and P. Viswanath. Fundamentals of Wireless Communication. Cambridge University Press, 2005. [TV15] T. Tao and V. Vu. Random matrices: universality of local spectral statistics of non-hermitian matrices. The Annals of Probability, 43(2):782–874, 2015. [TY12] K. Tutuncuoglu and A. Yener. Optimum transmission policies for battery lim- ited energy harvesting nodes. IEEE Transactions on Wireless Communications, 11(3):1180–1189, 2012. [UUNS11] R. Urgaonkar, B. Urgaonkar, M. J. Neely, and A. Sivasubramaniam. Optimal power cost management using stored energy in data centers. Proceedings of ACM SIGMETRICS, 2011. [UYE + 15] S. Ulukus, A. Yener, E. Erkip, O. Simeone, M. Zorzi, P. Grover, and K. Huang. Energy harvesting wireless communications: A review of recent advances. IEEE Journal on Selected Areas in Communications, 33(3):360–381, 2015. [VLS05] V. V. Veeravalli, Y. Liang, and A. M. Sayeed. Correlated MIMO wireless channels: capacity, optimal signaling, and asymptotics. IEEE Transactions on Information Theory, 51(6):2058–2072, 2005. [VP07] M. Vu and A. Paulraj. On the capacity of MIMO wireless channels with dynamic CSIT. IEEE Journal on Selected Areas in Communications, 25(7):1269–1283, 2007. [Vu02] V. Vu. Concentration of non-lipschitz functions and applications. Random Struc- tures & Algorithms, 20(3):262–316, 2002. 283 [WO13] E. Wei and A. Ozdaglar. On the O(1/k) convergence of asynchronous distributed alternating direction method of multipliers. In Proceedings of IEEE Global Confer- ence on Signal and Information Processing, 2013. [WOJ13] E. Wei, A. Ozdaglar, and A. Jadbabaie. A distributed Newton method for net- work utility maximization–I: algorithm. IEEE Transactions on Automatic Control, 58(9):2162–2175, 2013. [WWW + 17] W. Wu, J. Wang, X. Wang, F. Shan, and J. Luo. Online throughput maximiza- tion for energy harvesting communication systems with battery overflow. IEEE Transactions on Mobile Computing, 16(1):185–197, 2017. [WYN15] X. Wei, H. Yu, and M. J. Neely. A sample path convergence time analysis of drift-plus-penalty for stochastic optimization. arXiv:1510.02973v2, 2015. [YN13] H. Yu and M. J. Neely. Duality codes and the integrality gap bound for index coding. In Proceedings of 51st Annual Allerton Conference on Communication, Control, and Computing (Allerton), 2013. [YN14] H. Yu and M. J. Neely. Duality codes and the integrality gap bound for index coding. IEEE Transactions on Information Theory, 60(11):7256–7268, 2014. [YN15] H. Yu and M. J. Neely. On the convergence time of the drift-plus-penalty algorithm for strongly convex programs. In Proceedings of IEEE Conference on Decision and Control (CDC), 2015. [YN16a] H. Yu and M. J. Neely. Dynamic power allocation in MIMO fading systems without channel distribution information. In Proceedings of IEEE International Conference on Computer Communications (INFOCOM), 2016. [YN16b] H. Yu and M. J. Neely. A low complexity algorithm with O( √ T ) regret and finite constraint violations for online convex optimization with long term constraints. arXiv:1604.02218, 2016. [YN16c] H. Yu and M. J. Neely. A primal-dual type algorithm with the O(1/t) convergence rate for large scale constrained convex programs. In Proceedings of IEEE Conference on Decision and Control (CDC), 2016. 284 [YN17a] H. Yu and M. J. Neely. Dynamic transmit covariance design in MIMO fading sys- tems with unknown channel distributions and inaccurate channel state information. IEEE Transactions on Wireless Communications, 16(6):pp.3996–4008, 2017. [YN17b] H. Yu and M. J. Neely. A new backpressure algorithm for joint rate control and rout- ing with vanishing utility optimality gaps and finite queue lengths. In Proceedings of IEEE International Conference on Computer Communications (INFOCOM), 2017. [YN17c] H. Yu and M. J. Neely. A new backpressure algorithm for joint rate con- trol and routing with vanishing utility optimality gaps and finite queue lengths. arXiv:1701.04519, 2017. [YN17d] H. Yu and M. J. Neely. A primal-dual parallel method with O(1/) convergence for constrained composite convex programs. arXiv:1708.00322, 2017. [YN17e] H. Yu and M. J. Neely. A simple parallel algorithm with an O(1/t) convergence rate for general convex programs. SIAM Journal on Optimization, 27(2):pp.759– 783, 2017. [YN18a] H. Yu and M. J. Neely. Learning aided optimization for energy harvesting devices with outdated state information. In Proceedings of IEEE International Conference on Computer Communications (INFOCOM), 2018. [YN18b] H. Yu and M. J. Neely. On the convergence time of dual subgradient methods for strongly convex programs. IEEE Transacations on Automatic Control, to appear, 2018. [YNW17] H. Yu, M. J. Neely, and X. Wei. Online convex optimization with stochastic con- straints. In Advances in Neural Information Processing Systems, pages 1427–1437, 2017. [YU12] J. Yang and S. Ulukus. Optimal packet scheduling in an energy harvesting commu- nication system. IEEE Transactions on Communications, 60(1):220–230, 2012. [Zin03] M. Zinkevich. Online convex programming and generalized infinitesimal gradient ascent. In Proceedings of International Conference on Machine Learning (ICML), 2003. 285 [ZRJ13] M. Zargham, A. Ribeiro, and A. Jadbabaie. Accelerated backpressure algorithm. In Proceedings of IEEE Global Communications Conference (GLOBECOM), 2013. 286
Abstract (if available)
Abstract
In this thesis, we develop new Lagrangian methods with fast convergence for constrained convex programs with complicated functional constraints. The dual subgradient method, also known as the dual ascent method, and the primal dual subgradient method, also known as the Arrow-Hurwicz-Uzawa subgradient method, are classical Lagrangian methods to solve constrained convex programs. Both methods are known to have a slow O(1∕ε²) convergence time. In contrast, the new Lagrangian methods proposed in this thesis have a faster O(1∕ε) convergence time. Recall that the alternating direction method of multipliers (ADMM), which is another representative Lagrangian method for convex programs with linear equality constraints, is also known to have O(1∕ε) convergence. However, our methods work for general convex programs with possibly non-linear constraints. ❧ We first revisit the classical dual subgradient method and study its convergence time for constrained strongly convex programs in Chapter 2. By using a novel drift-plus-penalty type analysis, we show that the dual subgradient method enjoys a faster O(1∕ε) convergence time for general (possibly non-differentiable) constrained strongly convex programs. After that, we seek to develop new Lagrangian methods with the fast O(1∕ε) convergence time for general constrained convex programs without strong convexity in Chapter 3, which is the core chapter in this thesis. Based on the new Lagrangian methods developed in Chapter 3, new techniques that exceed the state-of-the-art are developed for joint rate control and routing in data networks in Chapter 4 and for online convex optimization with stochastic and long term constraints in Chapters 5-6. ❧ The other focus of this thesis is to illustrate the practical relevance of mathematical optimization techniques in engineering systems. In Chapter 7, we adapt our new online convex optimization technique to the power control for energy harvesting devices with outdated state information such that we can achieve utility within O(ε) of the optimal by using a battery with an O(1∕ε) capacity. In Chapter 8, we extend conventional drift-plus-penalty stochastic optimization and Zinkevich's online convex optimization to develop new dynamic transmit covariance design policies for MIMO fading systems with unknown channel distributions and inaccurate channel state information. In Chapter 9, we study the index coding problem and characterize the optimality of two representative scalar and fractional linear codes, i.e., cyclic codes and maximum distance separable (MDS) codes, by studying the integrality gap between the integer linear program from an information theoretical lower bound and its linear relaxations and the Lagrangian duality between various linear relaxations and their dual problems.
Linked assets
University of Southern California Dissertations and Theses
Conceptually similar
PDF
Provable reinforcement learning for constrained and multi-agent control systems
PDF
I. Asynchronous optimization over weakly coupled renewal systems
PDF
Difference-of-convex learning: optimization with non-convex sparsity functions
PDF
Performance trade-offs of accelerated first-order optimization algorithms
PDF
Robustness of gradient methods for data-driven decision making
PDF
On practical network optimization: convergence, finite buffers, and load balancing
PDF
Landscape analysis and algorithms for large scale non-convex optimization
PDF
Elements of robustness and optimal control for infrastructure networks
PDF
Mixed-integer nonlinear programming with binary variables
PDF
Topics in algorithms for new classes of non-cooperative games
PDF
Learning and control in decentralized stochastic systems
PDF
Optimization methods and algorithms for constrained magnetic resonance imaging
PDF
Novel optimization tools for structured signals recovery: channels estimation and compressible signal recovery
PDF
On the theory and applications of structured bilinear inverse problems to sparse blind deconvolution, active target localization, and delay-Doppler estimation
PDF
Stochastic games with expected-value constraints
PDF
Information design in non-atomic routing games: computation, repeated setting and experiment
PDF
Joint routing, scheduling, and resource allocation in multi-hop networks: from wireless ad-hoc networks to distributed computing networks
PDF
Adaptive control: transient response analysis and related problem formulations
PDF
Scalable optimization for trustworthy AI: robust and fair machine learning
PDF
Algorithms and landscape analysis for generative and adversarial learning
Asset Metadata
Creator
Yu, Hao
(author)
Core Title
New Lagrangian methods for constrained convex programs and their applications
School
Viterbi School of Engineering
Degree
Doctor of Philosophy
Degree Program
Electrical Engineering
Publication Date
02/07/2018
Defense Date
01/08/2018
Publisher
University of Southern California
(original),
University of Southern California. Libraries
(digital)
Tag
constrained convex programs,Lagrangian methods,OAI-PMH Harvest,optimization
Language
English
Contributor
Electronically uploaded by the author
(provenance)
Advisor
Neely, Michael J. (
committee chair
), Nayyar, Ashutosh (
committee member
), Razaviyayn, Meisam (
committee member
)
Creator Email
eeyuhao@gmail.com,yuhao@usc.edu
Permanent Link (DOI)
https://doi.org/10.25549/usctheses-c40-470444
Unique identifier
UC11268337
Identifier
etd-YuHao-6010.pdf (filename),usctheses-c40-470444 (legacy record id)
Legacy Identifier
etd-YuHao-6010.pdf
Dmrecord
470444
Document Type
Dissertation
Rights
Yu, Hao
Type
texts
Source
University of Southern California
(contributing entity),
University of Southern California Dissertations and Theses
(collection)
Access Conditions
The author retains rights to his/her dissertation, thesis or other graduate work according to U.S. copyright law. Electronic access is being provided by the USC Libraries in agreement with the a...
Repository Name
University of Southern California Digital Library
Repository Location
USC Digital Library, University of Southern California, University Park Campus MC 2810, 3434 South Grand Avenue, 2nd Floor, Los Angeles, California 90089-2810, USA
Tags
constrained convex programs
Lagrangian methods
optimization