Close
About
FAQ
Home
Collections
Login
USC Login
Register
0
Selected
Invert selection
Deselect all
Deselect all
Click here to refresh results
Click here to refresh results
USC
/
Digital Library
/
University of Southern California Dissertations and Theses
/
Robustness of gradient methods for data-driven decision making
(USC Thesis Other)
Robustness of gradient methods for data-driven decision making
PDF
Download
Share
Open document
Flip pages
Contact Us
Contact Us
Copy asset link
Request this asset
Transcript (if available)
Content
R OBUSTNESS OF GRADIENT METHODS F OR D A T A DRIVEN DECISION MAKING b y Hesameddin Mohammadi A Dissertation Presen ted to the F A CUL TY OF THE USC GRADUA TE SCHOOL UNIVERSITY OF SOUTHERN CALIF ORNIA In P artial F ulllmen t of the Requiremen ts for the Degree DOCTOR OF PHILOSOPHY (ELECTRICAL ENGINEERING) Decem b er 2022 Cop yrigh t 2023 Hesameddin Mohammadi Acknowledgments I am grateful to all m y colleagues, friends, and family who tremendously supp orted me during m y PhD journey . I try to giv e a partial accoun ting of some of their aid and supp ort. I w ould lik e to b egin b y sa ying a sp ecial thank y ou to m y great advisor Prof. Mihailo R. Jo v ano vi¢ who certainly falls in to all the ab o v e three categories. Mihailo nev er hesitated to oer his strongest supp ort, ranging from suggesting new researc h directions and ideas, helping with tec hnical c hallenges, and pro viding in v aluable feedbac k all along, to patien tly going o v er our pap ers revising draft after draft till near p erfection, to accommo dating and aiding me during some of m y hardest exp eriences in this journey . In addition, he generously dedicated funding resources at his disp osal to m y researc h throughout these y ears, whic h w as one of the most imp ortan t enablers of this w ork. Mihailo w as alw a ys a great source of inspiration and I am truly thankful to ha v e had him as m y top p ersonal and professional men tor, and academic father. I w ould also lik e to express m y sincere gratitude to m y committee mem b ers Prof. Urbashi Mitra, Prof. Pierluigi Nuzzo, Prof. Mahdi Soltanolk otabi, and Prof. Meisam Raza viy a yn. I b eneted so m uc h from their feedbac k and the discussions I had with them. Also, m y collab oration with Mahdi and Meisam w as extremely helpful to the dev elopmen t of this dissertation. In teracting with them w as alw a ys in tellectually inspiring and it signican tly help ed me broaden m y p ersp ectiv es. I w ould also lik e to thank m y former/curren t labmates Dr. Reza Kam y ar, Dr. Morgan Jones, Dr. Marziy e Rahimi, Prof. Armin Zare, Dr. W ei Ran, Dr. Sepideh Hassan Moghaddam, Dr. Dongsheng Ding, Dr. An ubha v Dwiv edi, Evgen y Mey er, Saman tha Sam uelson, and Ibrahim Ozaslan. In teracting with eac h of them w as alw a ys a pleasure and I learn t so m uc h ii from all of them. The man y hours I sp en t in the lab/oce ev eryda y w ould b e un b earable if I did not ha v e the greatest compan y of them. I am thankful to all the great friends I ha v e k ept and made: Iman Bonakdar, Ja v ad Abazari, Sina T ohidi, Dr. Mehdi A taei, Zalan F abian, Bo w en Song, Mohammadmahdi Sa jedi, Dr. Mo Hekmat, and Dr. Mohamadreza Ahmadi. The nev er-ending supp ort I ha v e receiv ed from them made it p ossible for me to con tin ue. Last but not least, I am alw a ys indebted to m y paren ts Elaheh and Behrouz, m y sister Noura, m y brother Shahab, and m y dearest friend T ara. Y ou w ere alw a ys there for me and remained the main source of motiv ation for me to nish the PhD. iii T able of Contents A c kno wledgmen ts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ii List of Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . x Abstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xiv Chapter 1: In tro duction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1.1 Main topics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 1.1.1 Noise amplication of accelerated optimization algorithms . . . . . . 2 1.1.2 T radeos b et w een noise amplication and con v ergence rate . . . . . . 3 1.1.3 T ransien t gro wth of accelerated optimization algorithms . . . . . . . 4 1.1.4 Noise amplication of primal-dual gradien t o w dynamics based on pro ximal augmen ted Lagrangian . . . . . . . . . . . . . . . . . . . . . 5 1.1.5 Gradien t metho ds for mo del-free linear quadratic regulator . . . . . . 6 1.1.6 Optimization landscap e of the linear Quadratic Gaussian . . . . . . . 7 1.2 Dissertation structure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8 1.3 Con tributions of the dissertation . . . . . . . . . . . . . . . . . . . . . . . . . 10 I Robustness of accelerated rst-order optimization algorithms for strongly convex optimization problems . . . . . . . . . . . . . . . . 14 Chapter 2: Noise amplication of accelerated algorithms . . . . . . . . . . . 15 2.1 In tro duction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15 2.2 Preliminaries and bac kground . . . . . . . . . . . . . . . . . . . . . . . . . . 20 2.3 Strongly con v ex quadratic problems . . . . . . . . . . . . . . . . . . . . . . . 22 2.3.1 Inuence of the eigen v alues of the Hessian matrix . . . . . . . . . . . 24 2.3.2 Comparison for parameters that optimize con v ergence rate . . . . . . 27 2.3.3 Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32 2.4 General strongly con v ex problems . . . . . . . . . . . . . . . . . . . . . . . . 35 2.4.1 An approac h based on con traction mappings . . . . . . . . . . . . . . 37 2.4.2 An approac h based on linear matrix inequalities . . . . . . . . . . . . 38 2.5 T uning of algorithmic parameters . . . . . . . . . . . . . . . . . . . . . . . . 43 2.5.1 T uning of parameters using the whole sp ectrum . . . . . . . . . . . . 44 2.5.2 F undamen tal lo w er b ounds . . . . . . . . . . . . . . . . . . . . . . . . 45 2.6 Application to distributed computation . . . . . . . . . . . . . . . . . . . . . 47 iv 2.6.1 Explicit form ulae for d-dimensional torus net w orks . . . . . . . . . . 49 2.7 Concluding remarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51 Chapter 3: T radeos b et w een con v ergence rate and noise amplication for accelerated algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . 52 3.1 In tro duction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53 3.2 Preliminaries and bac kground . . . . . . . . . . . . . . . . . . . . . . . . . . 56 3.2.1 Linear dynamics for quadratic problems . . . . . . . . . . . . . . . . 58 3.2.2 Con v ergence rates . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58 3.2.3 Noise amplication . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59 3.2.4 P arameters that optimize con v ergence rate . . . . . . . . . . . . . . . 60 3.3 Summary of main results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62 3.3.1 Bounded noise amplication for stabilizing parameters . . . . . . . . 62 3.3.2 T radeo b et w een settling time and noise amplication . . . . . . . . 63 3.4 Geometric c haracterization . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67 3.4.1 Mo dal decomp osition . . . . . . . . . . . . . . . . . . . . . . . . . . . 67 3.4.2 Conditions for linear con v ergence . . . . . . . . . . . . . . . . . . . . 68 3.4.3 Noise amplication . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75 3.5 Designing order-wise P areto-optimal algorithms with adjustable parameters . 76 3.5.1 P arameterized family of hea vy-ball-lik e metho ds . . . . . . . . . . . . 77 3.5.2 P arameterized family of Nestero v-lik e metho ds . . . . . . . . . . . . . 80 3.5.3 Impact of reducing the stepsize . . . . . . . . . . . . . . . . . . . . . 81 3.6 Con tin uous-time gradien t o w dynamics . . . . . . . . . . . . . . . . . . . . 83 3.6.1 Mo dal-decomp osition . . . . . . . . . . . . . . . . . . . . . . . . . . . 84 3.6.2 Optimal con v ergence rate . . . . . . . . . . . . . . . . . . . . . . . . 85 3.6.3 Noise amplication . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87 3.6.4 Con v ergence and noise amplication tradeos . . . . . . . . . . . . . 89 3.7 Pro ofs of Theorems 1-4 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89 3.7.1 Pro of of Theorem 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89 3.7.2 Pro of of Theorem 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90 3.7.3 Pro of of Theorem 3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94 3.7.4 Pro of of Theorem 4 . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94 3.8 Concluding remarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95 Chapter 4: T ransien t gro wth of accelerated algorithms . . . . . . . . . . . . . 97 4.1 In tro duction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97 4.2 Con v ex quadratic problems . . . . . . . . . . . . . . . . . . . . . . . . . . . 100 4.2.1 L TI form ulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101 4.2.2 Linear con v ergence of accelerated algorithms . . . . . . . . . . . . . . 101 4.2.3 T ransien t gro wth of accelerated algorithms . . . . . . . . . . . . . . . 102 4.2.4 Analytical expressions for transien t resp onse . . . . . . . . . . . . . . 104 4.2.5 The role of initial conditions . . . . . . . . . . . . . . . . . . . . . . . 107 4.3 General strongly con v ex problems . . . . . . . . . . . . . . . . . . . . . . . . 109 4.4 Concluding remarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113 v Chapter 5: Noise amplication of primal-dual gradien t o w dynamics based on pro ximal augmen ted Lagrangian . . . . . . . . . . . . . 114 5.1 In tro duction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 114 5.2 Pro ximal Augmen ted Lagrangian . . . . . . . . . . . . . . . . . . . . . . . . 116 5.2.1 Stabilit y prop erties . . . . . . . . . . . . . . . . . . . . . . . . . . . . 117 5.2.2 Noise amplication . . . . . . . . . . . . . . . . . . . . . . . . . . . . 118 5.3 Quadratic optimization problems . . . . . . . . . . . . . . . . . . . . . . . . 118 5.4 Bey ond quadratic problems . . . . . . . . . . . . . . . . . . . . . . . . . . . 122 5.4.1 An IQC-based approac h . . . . . . . . . . . . . . . . . . . . . . . . . 123 5.4.2 State-space represen tation . . . . . . . . . . . . . . . . . . . . . . . . 124 5.4.3 Characterizing the structural prop erties via IQCs . . . . . . . . . . . 125 5.4.4 General con v ex g . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 126 5.5 Application to distributed optimization . . . . . . . . . . . . . . . . . . . . . 127 5.6 Concluding remarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 130 II Convergence and sample complexity of gradient methods for the data-driven control . . . . . . . . . . . . . . . . . . . . . . . . . . . . 131 Chapter 6: Random searc h for con tin uous-time LQR . . . . . . . . . . . . . . 132 6.1 In tro duction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 132 6.2 Problem form ulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 135 6.3 Main results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 138 6.3.1 Kno wn mo del . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 139 6.3.2 Unkno wn mo del . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 140 6.4 Con v ex reparameterization . . . . . . . . . . . . . . . . . . . . . . . . . . . . 141 6.4.1 Change of v ariables . . . . . . . . . . . . . . . . . . . . . . . . . . . . 141 6.4.2 Smo othness and strong con v exit y of h(Y) . . . . . . . . . . . . . . . . 143 6.4.3 Gradien t metho ds o v er S Y . . . . . . . . . . . . . . . . . . . . . . . . 144 6.5 Con trol design with a kno wn mo del . . . . . . . . . . . . . . . . . . . . . . 145 6.5.1 Gradien t-o w dynamics: pro of of Theorem 1 . . . . . . . . . . . . . . 146 6.5.2 Geometric in terpretation . . . . . . . . . . . . . . . . . . . . . . . . . 148 6.5.3 Gradien t descen t: pro of of Theorem 2 . . . . . . . . . . . . . . . . . . 150 6.6 Bias and correlation in gradien t estimation . . . . . . . . . . . . . . . . . . . 152 6.6.1 Bias in gradien t estimation due to nite sim ulation time . . . . . . . 153 6.6.1.1 Lo cal b oundedness of the function f(K) . . . . . . . . . . . 154 6.6.1.2 Bounding the bias . . . . . . . . . . . . . . . . . . . . . . . 155 6.6.2 Correlation b et w een gradien t and gradien t estimate . . . . . . . . . . 156 6.6.2.1 Handling M 1 . . . . . . . . . . . . . . . . . . . . . . . . . . 158 6.6.2.2 Handling M 2 . . . . . . . . . . . . . . . . . . . . . . . . . . 160 6.7 Mo del-free con trol design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 162 6.8 Computational exp erimen ts . . . . . . . . . . . . . . . . . . . . . . . . . . . 166 6.8.1 Kno wn mo del . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 166 6.8.2 Unkno wn mo del . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 167 6.9 Concluding remarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 168 vi Chapter 7: Random searc h for discrete-time LQR . . . . . . . . . . . . . . . . 170 7.1 In tro duction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 170 7.2 State-feedbac k c haracterization . . . . . . . . . . . . . . . . . . . . . . . . . 171 7.3 Random searc h . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 174 7.4 Main result . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 175 7.5 Pro of sk etc h . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 178 7.5.1 Con trolling the bias . . . . . . . . . . . . . . . . . . . . . . . . . . . . 180 7.5.2 Correlation of b ∇f(K) and∇f(K) . . . . . . . . . . . . . . . . . . . . 182 7.5.2.1 Quan tifying the probabilit y of M 1 . . . . . . . . . . . . . . . 183 7.5.2.2 Quan tifying the probabilit y of M 2 . . . . . . . . . . . . . . . 183 7.6 Computational exp erimen ts . . . . . . . . . . . . . . . . . . . . . . . . . . . 184 7.7 Concluding remarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 186 Chapter 8: Lac k of gradien t domination for linear quadratic Gaussian problems with incomplete state information . . . . . . . . . . . . 187 8.1 In tro duction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 187 8.2 Linear Quadratic Gaussian . . . . . . . . . . . . . . . . . . . . . . . . . . . . 189 8.2.1 Separation principle . . . . . . . . . . . . . . . . . . . . . . . . . . . 190 8.2.2 Characterization based on gain matrices . . . . . . . . . . . . . . . . 191 8.3 Gradien t metho d . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 192 8.3.1 Non-separabilit y of gradien ts . . . . . . . . . . . . . . . . . . . . . . . 195 8.3.1.1 Optimal observ er gain L=L ⋆ . . . . . . . . . . . . . . . . . 195 8.3.1.2 Optimal con trol gain K =K ⋆ . . . . . . . . . . . . . . . . . 196 8.4 Lac k of gradien t domination . . . . . . . . . . . . . . . . . . . . . . . . . . . 197 8.4.1 Non-uniqueness of critical p oin ts . . . . . . . . . . . . . . . . . . . . 198 8.5 An example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 199 8.6 Concluding remarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 200 Bibliograph y . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 201 Appendices . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 217 Chapter A: Supp orting pro ofs for Chapter 2 . . . . . . . . . . . . . . . . . . . 217 A.1 Quadratic problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 217 A.1.1 Pro of of Theorem 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . 217 A.1.2 Pro of of Prop osition 1 . . . . . . . . . . . . . . . . . . . . . . . . . . 218 A.1.3 Pro of of Theorem 3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . 218 A.1.4 Pro of of the b ounds in (2.16) . . . . . . . . . . . . . . . . . . . . . . 219 A.2 General strongly con v ex problems . . . . . . . . . . . . . . . . . . . . . . . . 220 A.2.1 Pro of of Lemma 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 220 A.2.2 Pro of of Lemma 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 220 A.2.3 Pro of of Theorem 5 . . . . . . . . . . . . . . . . . . . . . . . . . . . . 222 A.2.4 Pro of of Theorem 6 . . . . . . . . . . . . . . . . . . . . . . . . . . . . 223 A.3 F undamen tal lo w er b ounds . . . . . . . . . . . . . . . . . . . . . . . . . . . . 227 vii A.3.1 Pro of of Theorem 7 . . . . . . . . . . . . . . . . . . . . . . . . . . . . 227 A.3.2 Pro of of Theorem 8 . . . . . . . . . . . . . . . . . . . . . . . . . . . . 230 A.4 Consensus o v er d-dimensional torus net w orks . . . . . . . . . . . . . . . . . . 235 A.4.1 Pro of of Theorem 9 . . . . . . . . . . . . . . . . . . . . . . . . . . . . 240 A.4.2 Computational exp erimen ts . . . . . . . . . . . . . . . . . . . . . . . 240 Chapter B: Supp orting pro ofs for Chapter 3 . . . . . . . . . . . . . . . . . . . 242 B.1 Settling time . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 242 B.2 Con v exit y of mo dal con tribution ˆ J . . . . . . . . . . . . . . . . . . . . . . . 242 B.3 Pro ofs of Section 3.4 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 243 B.3.1 Pro of of Lemma 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 243 B.3.2 Pro of of Equation (3.28c) . . . . . . . . . . . . . . . . . . . . . . . . 243 B.4 Pro ofs of Section 3.5 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 245 B.4.1 Pro of of Lemma 4 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 245 B.4.2 Pro of of Prop osition 2 . . . . . . . . . . . . . . . . . . . . . . . . . . 246 B.4.3 Pro of of Prop osition 3 . . . . . . . . . . . . . . . . . . . . . . . . . . 246 B.4.4 Pro of of Prop osition 4 . . . . . . . . . . . . . . . . . . . . . . . . . . 247 B.5 Pro ofs of Section 3.6 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 250 B.5.1 Pro of of Lemma 5 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 250 B.5.2 Pro of of Prop osition 5 . . . . . . . . . . . . . . . . . . . . . . . . . . 250 B.5.3 Pro of of Theorem 7 . . . . . . . . . . . . . . . . . . . . . . . . . . . . 250 B.6 Ly apuno v equations and the steady-state v ariance . . . . . . . . . . . . . . . 251 Chapter C: Supp orting pro ofs for Chapter 4 . . . . . . . . . . . . . . . . . . . 253 C.1 Pro ofs of Section 4.2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 253 C.1.1 Pro of of Lemma 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 253 C.1.2 Pro of of Lemma 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 254 C.1.3 Pro of of Theorem 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . 254 C.1.4 Pro of of Theorem 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . 254 C.1.5 Pro of of Prop osition 2 . . . . . . . . . . . . . . . . . . . . . . . . . . 255 C.2 Pro of of Theorem 3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 255 C.3 Pro ofs of Section 4.3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 257 C.3.1 Pro of of Lemma 3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 257 C.3.2 Pro of of Lemma 4 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 257 C.3.3 Pro of of Theorem 4 . . . . . . . . . . . . . . . . . . . . . . . . . . . . 258 Chapter D: Supp orting pro ofs for Chapter 6 . . . . . . . . . . . . . . . . . . . 260 D.1 Lac k of con v exit y of function f . . . . . . . . . . . . . . . . . . . . . . . . . 260 D.2 In v ertibilit y of the linear map A . . . . . . . . . . . . . . . . . . . . . . . . . 261 D.3 Pro of of Prop osition 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 261 D.4 Pro ofs for Section 6.5 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 262 D.5 Pro ofs for Section 6.6.1.1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 264 D.6 Pro of of Prop osition 4 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 267 D.7 Pro of of Prop osition 5 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 269 D.7.1 Pro of of Prop osition 5 . . . . . . . . . . . . . . . . . . . . . . . . . . 270 viii D.8 Pro of of Prop osition 6 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 271 D.9 Pro ofs of Section 6.6.2.1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 271 D.10 Pro ofs for Section 6.6.2.2 and probabilistic to olb o x . . . . . . . . . . . . . . 275 D.11 Bounds on optimization v ariables . . . . . . . . . . . . . . . . . . . . . . . . 278 D.12 The norm of the in v erse Ly apuno v op erator . . . . . . . . . . . . . . . . . . 280 Chapter E: Supp orting pro ofs for Chapter 7 . . . . . . . . . . . . . . . . . . . 282 E.1 Pro of of Prop osition 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 282 E.2 Pro of of Prop osition 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 282 ix List of Figures 2.1 Ellipsoids {z|z T Z − 1 z ≤ 1} asso ciated with the steady-state co v ariance matrices Z = CPC T of the p erformance outputs z t = x t − x ⋆ (top ro w) and z t = Q 1/2 (x t − x ⋆ ) (b ottom ro w) for algorithms (2.2) with the parameters pro vided in T able 2.2 for the matrix Q giv en in (2.17) with m≪ L = O(1). The horizon tal and v ertical axes sho w the eigen v ectors [1 0] T and [0 1] T asso ciated with the eigen v alues ˆ J(L) and ˆ J(m) (top ro w) and ˆ J ′ (L) and ˆ J ′ (m) (b ottom ro w) of the resp ectiv e output co v ariance matrices Z . . . . . . 34 2.2 P erformance outputs z t = x t (top ro w) and z t = Q 1/2 x t (b ottom ro w) resulting from 10 5 iterations of noisy rst-order algorithms (2.2) with the parameters pro vided in T able 2.2. Strongly con v ex problem with f(x) = 0.5x 2 1 +0.25× 10 − 4 x 2 2 (κ = 2× 10 4 ) is solv ed using algorithms with additiv e white noise and zero initial conditions. . . . . . . . . . . . . . . . . 35 2.3 (1/t) P t k=0 ∥z k ∥ 2 for the p erformance output z t in Example 2. T op ro w: the thic k blue (gradien t descen t), blac k (hea vy-ball), and red (Nestero v’s metho d) lines mark v ariance obtained b y a v eraging results of t w en t y sto c hastic sim ulations. Bottom ro w: comparison b et w een results obtained b y a v eraging outcomes of t w en t y sto c hastic sim ulations (thic k lines) with the corresp onding theoretical v alues (1/t) P t k=0 trace(CP k C T ) (dashed lines) resulting from the Ly apuno v equation (2.6a). . . . . . . . . . . . . . . . . . . 36 2.4 Blo c k diagram of system (2.21a). . . . . . . . . . . . . . . . . . . . . . . . . 39 3.1 Summary of the results established in Theorems 1-4 for σ 2 =1. The top and b ottom ro ws corresp ond to the iterate and gradien t noise mo dels, resp ectiv ely , and they illustrate (i) J ⋆ max := min α,β,γ max f J and J ⋆ min := min α,β,γ min f J sub ject to a settling time T s for f ∈ Q L m (blac k curv es); and (ii) their corresp onding upp er (maro on curv es) and lo w er (red curv es) b ounds in terms of the condition n um b er κ =L/m, problem size n, and settling time T s . The upp er b ounds on J established in Theorem 1 are mark ed b y blue curv es. The dark shaded region and its union with the ligh t shaded region resp ectiv ely corresp ond to all p ossible pairs (T s ,max f J) and (T s ,min f J) for f ∈Q L m and an y stabilizing parameters (α,β,γ ). . . . . . . . . . . . . . . . . . . . . . . . 66 x 3.2 The stabilit y set ∆ (the op en, cy an triangle) in (3.21b) and the ρ -linear con v ergence set ∆ ρ (the closed, y ello w triangle) in (3.22b) along with the corresp onding v ertices. F or the p oin t (b,a) (blac k dot) asso ciated with the matrix M in (3.20a), the corresp onding distances (d,h,l) in (3.29) are mark ed b y blac k lines. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70 3.3 F or a xed ρ -linear con v ergence triangle ∆ ρ (y ello w), dashed blue lines mark the line segmen ts (b(λ ),a(λ )) with λ ∈ [m,L] for gradien t descen t, P oly ak’s hea vy-ball, and Nestero v’s accelerated metho ds as particular instances of the t w o-step momen tum algorithm (3.2) with constan t parameters. The solid blue line segmen ts corresp ond to the parameters for whic h the algorithm ac hiev es rate ρ for the largest p ossible condition n um b er giv en b y (3.28). . . 71 3.4 The triangle ∆ ρ (y ello w) and the line segmen ts (b(λ ),a(λ )) with λ ∈ [m,L] (blue) for gradien t descen t with reduced stepsize (3.39) and hea vy-ball-lik e metho d (3.40), whic h place the end p oin t (b(m),a(m)) atX ρ and the end p oin t (b(L),a(L)) at(2c ′ ρ,ρ 2 ) on the edge X ρ Y ρ , wherec ′ :=κ (1− ρ ) 2 /ρ − (1+ρ 2 )/ρ ranges o v er the in terv al [− 1,1]. . . . . . . . . . . . . . . . . . . . . . . . . . 81 3.5 The op en p ositiv e orthan t (cy an) in the (b,a)-plane is the stabilit y region for the matrix M in (3.20a). The in tersections Y ρ and Z ρ of the stepsize normalization line a = 1 (blac k) and the b oundary of the ρ -exp onen tial stabilit y cone (y ello w) established in Lemma 5, along with the cone ap ex X ρ determine the v ertices of the ρ -exp onen tial stabilit y triangle ∆ ρ giv en b y (3.44). 86 3.6 F or a xed ρ -exp onen tial stabilit y triangle ∆ ρ (y ello w) in (3.44), the line segmen ts (b(λ ),a(λ )), λ ∈ [m,L] for Nestero v’s accelerated (γ = β ) and the hea vy-ball (γ = 0) dynamics, as sp ecial examples of accelerated dynamics (3.41b) with constan t parameters γ,β , and α =1/L are mark ed b y dashed blue lines. The blue bullets corresp ond to the lo cus of the end p oin t (b(L),a(L)), and the solid blue line segmen ts corresp ond to the parameters for whic h the rate ρ is ac hiev ed for the largest p ossible condition n um b er (3.45). 87 3.7 The line L (blue, dashed) and the in tersection p oin t G, along with the distances d 1 , h 1 , d G , and h G as in tro duced in the pro of of Theorem 2. . . . . 92 4.1 Error in the optimization v ariable for P oly ak’s hea vy-ball (blac k) and Nestero v’s (red) algorithms with the parameters that optimize the con v ergence rate for a strongly con v ex quadratic problem with the condition n um b er 10 3 and a unit norm initial condition with x 0 ̸=x ⋆ . . . . . . . . . . . . . . . . . . 99 4.2 Dep endence of the error in the optimization v ariable on the iteration n um b er for the hea vy-ball (blac k) and Nestero v’s metho ds (red), as w ell as the p eak magnitudes (dashed lines) obtained in Prop osition 2 for t w o dieren t initial conditions with ∥x 1 ∥ 2 =∥x 0 ∥ 2 =1. . . . . . . . . . . . . . . . . . . . . . . . 106 6.1 T ra jectories K(t) of (GF) (solid blac k) and K ind (t) resulting from Eq. (6.19) (dashed blue) along with the lev el sets of the function f(K). . . . . . . . . . 150 xi 6.2 Con v ergence curv es for gradien t descen t (blue) o v er the set S K , and gradien t descen t (red) o v er the set S Y with (a) s=10 and (b) s=20 masses. . . . . . 166 6.3 (a) Bias in gradien t estimation and (b) total error in gradien t estimation as functions of the sim ulation time τ . The blue and red curv es corresp ond to t w o v alues of the smo othing parameter r = 10 − 4 and r = 10 − 5 , resp ectiv ely . (c) Con v ergence curv e of the random searc h metho d (RS). . . . . . . . . . . 168 7.1 The in tersection of the half-space and the ball parameterized b y µ 1 and µ 2 , resp ectiv ely , in Prop osition 1. If an up date direction G lies within this region, then taking one step along − G with a constan t stepsize α yields a geometric decrease in the ob jectiv e v alue. . . . . . . . . . . . . . . . . . . . . . . . . . 179 7.2 (a) Bias in gradien t estimation; (b) total error in gradien t estimation as functions of the sim ulation time τ . The blue and red curv es corresp ond to t w o v alues of the smo othing parameter r = 10 − 4 and r = 10 − 6 , resp ectiv ely . (c) Con v ergence curv e of the random searc h metho d (RS). . . . . . . . . . . 184 7.3 An in terconnected system of in v erted p endula on carts. . . . . . . . . . . . . 185 7.4 Histograms of t w o algorithmic quan tities asso ciated with the ev en ts M 1 and M 2 giv en b y (7.10). The red lines demonstrate that M 1 with µ 1 =0.1 and M 2 with µ 2 =35 o ccur in more than 99% of trials. . . . . . . . . . . . . . . . . . 185 8.1 Mass-spring-damp er system. . . . . . . . . . . . . . . . . . . . . . . . . . . . 199 8.2 Con v ergence curv e of gradien t descen t for s=50. . . . . . . . . . . . . . . . 199 A.1 The β -dep endence of the function v in (A.29) for L=100 and m=1. . . . . 228 A.2 The dep endence of the net w ork-size normalized p erformance measure ¯ J/n of the rst-order algorithms for d-dimensional torus T d n 0 with n = n d 0 no des on condition n um b er κ . The blue, red, and blac k curv es corresp ond to the gradien t descen t, Nestero v’s metho d, and the hea vy-ball metho d, resp ectiv ely . Solid curv es mark the actual v alues of ¯ J/n obtained using the expressions in Theorem 1 and the dashed curv es mark the trends established in Theorem 9. 241 B.1 The green and orange subsets of the stabilit y triangle ∆ (dashed-red) corresp ond to complex conjugate and real eigen v alues for the matrix M in (3.20a), resp ectiv ely . The blue parab ola a = b 2 /4 corresp onds to the matrix M with rep eated eigen v alues and it is tangen t to the edges X ρ Z ρ and Y ρ Z ρ of the ρ -linear con v ergence triangle ∆ ρ (solid red). . . . . . . . . . . . . 244 B.2 The p oin ts X ′ ρ and Y ′ ρ as dened in (B.2) along with an arbitrary line segmen t EE ′ passing through the origin in the (b,a)-plane. . . . . . . . . . . . . . . . 244 B.3 The ratio d E /d E ′ in (B.3) for Nestero v’s metho d, where E and E ′ lie on the edges Y ρ Z ρ and X ρ Z ρ of the ρ -linear con v ergence triangle ∆ ρ , and cρ determines the slop e of EE ′ whic h passes through the origin. . . . . . . . . . 245 xii D.1 The LQR ob jectiv e function f(K(γ )), where K(γ ) :=γK 1 +(1− γ )K 2 is the line-segmen t b et w een K 1 and K 2 in (D.1) with ϵ =0.1. . . . . . . . . . . . . 260 xiii Abstract First-order optimization algorithms are increasingly used for data-driv en con trol and man y learning applications that often in v olv e uncertain and noisy en vironmen ts. In this thesis, w e emplo y con trol-theoretic to ols to study the sto c hastic p erformance of these algorithms in solving general (strongly) con v ex and some noncon v ex optimization problems that arise in reinforcemen t learning and con trol theory . In particular, w e rst study momen tum-based accelerated optimization algorithms in whic h the iterations utilize information from the t w o previous steps and are sub ject to additiv e white noise. This class of algorithms includes P oly ak’s hea vy-ball and Nestero v’s accelerated metho ds as sp ecial cases and noise accoun ts for uncertain t y in either gradien t ev aluation or iteration up dates. F or unconstrained, smo oth, strongly con v ex optimization problems, w e examine the mean-squared error in the optimization v ariable to quan tify noise amplication. By lev eraging the theory of Ly apuno v and in tegral quadratic constrain ts, w e establish an upp er b ound on the noise amplication of Nestero v’s metho d with standard parameters that is tigh t up to a constan t factor. W e also use strongly con v ex quadratic problems to iden tify fundamen tal tradeos b et w een noise amplication and con v ergence rate for the t w o-step momen tum algorithms. F or this class of problems, w e explicitly ev aluate the steady-state v ariance of the optimization v ariable in terms of the eigen v alues of the Hessian of the ob jectiv e function. W e also in tro duce a no v el geometric c haracterization of conditions for linear con v ergence that claries the relation b et w een the noise amplication and con v ergence rate as w ell as their dep endence on the condition n um b er and the constan t algorithmic parameters. This geometric insigh t leads to simple alternativ e pro ofs of standard con v ergence results and allo ws us to establish analytical lo w er b ounds on the pro duct b et w een the settling time and noise amplication that scale quadratically with the condition n um b er. Our analysis xiv also iden ties a k ey dierence b et w een the gradien t and iterate noise mo dels: while the amplication of gradien t noise can b e made arbitrarily small b y sucien tly decelerating the algorithm, the b est ac hiev able v ariance amplication for the iterate noise mo del increases linearly with the settling time in the decelerated regime. W e also c haracterize the impact of condition n um b er on w orst-case transien t resp onses of p opular accelerated algorithms and examine the noise amplication of a class of primal-dual gradien t o w dynamics based on the pro ximal augmen ted Lagrangian that can b e used for non-smo oth con v ex constrained optimization problems. W e next fo cus on mo del-free reinforcemen t learning whic h attempts to nd an optimal con trol action for an unkno wn dynamical system b y directly searc hing o v er the parameter space of con trollers. The con v ergence b eha vior and statistical prop erties of these approac hes are often p o orly understo o d b ecause of the noncon v ex nature of the underlying optimization problems and the lac k of exact gradien t computation. In this thesis, w e tak e a step to w ards dem ystifying the p erformance and eciency of suc h metho ds b y fo cusing on the standard linear quadratic regulator (LQR) with unkno wn state-space parameters. F or this problem, w e establish exp onen tial stabilit y for the ordinary dieren tial equation (ODE) that go v erns the gradien t-o w dynamics o v er the set of stabilizing feedbac k gains and sho w that a similar result holds for the standard gradien t descen t. W e also pro vide theoretical b ounds on the con v ergence rate and sample complexit y of the random searc h metho d with t w o-p oin t gradien t estimates. W e pro v e that in the mo del-free setup, the required sim ulation time and the total n um b er of function ev aluations b oth scale with the logarithm of the in v erse of the desired accuracy . The k ey enabler of our results is the PL condition that holds for the LQR problem b oth in con tin uous and discrete time. W e nish the thesis b y sho wing the absence of this condition for the linear quadratic Gaussian problem with incomplete state information. xv Chapter 1 Introduction First-order metho ds are w ell suited for solving a broad range of optimization problems that arise in statistics, signal and image pro cessing, con trol, and mac hine learning [1][5]. Among these algorithms, accelerated metho ds enjo y the optimal rate of con v ergence and they are p opular b ecause of their lo w p er-iteration complexit y . There is a large b o dy of literature dedicated to the con v ergence analysis of these metho ds under dieren t stepsize selection rules [4][9]. In man y applications, ho w ev er, these algorithms are brough t in to uncertain and noisy en vironmen ts and they ma y only b e used with limited time budgets. F or example, the exact v alue of the gradien t is often not fully a v ailable or noise ma y corrupt the iterates of the algorithm due to uncertain comm unication. This happ ens when the ob jectiv e function is obtained via costly sim ulations (e.g., tuning of h yp er-parameters in sup ervised/unsup ervised learning [10][12] and mo del-free optimal con trol [13][15]), when ev aluation of the ob jectiv e function relies on noisy measuremen ts (e.g., em b edded and real- time applications), or when the noise is due to comm unication b et w een dieren t agen ts (e.g., distributed computation o v er net w orks). Another related application arises in the con text of (batc h) sto c hastic gradien t, where at eac h iteration the gradien t of the ob jectiv e function is computed from a small batc h of data p oin ts. Suc h a batc h gradien t is kno wn to b e a noisy un biased estimator for the gradien t of the training loss. Moreo v er, additiv e noise ma y b e in tro duced delib erately in the con text of noncon v ex optimization to help the iterates escap e saddle p oin ts and impro v e generalization [16], [17]. In addition to uncertain t y , man y emerging applications [18], [19] that arise in mo dern Reinforcemen t Learning (RL) in v olv e optimization landscap es that lac k con v exit y . In these 1 applications con trol-orien ted mo dels are not readily a v ailable and classical approac hes from optimal con trol ma y not b e directly applicable. In spite of these c hallenges, mo del-free RL approac hes that rely on rst-order optimization algorithms and prescrib e con trol actions using estimated v alues of a cost function ac hiev e empirical success in a v ariet y of domains [20], [21]. Unfortunately , ho w ev er, our mathematical understanding of these algorithms is still in its infancy and there are man y op en questions surrounding con v ergence and sample complexit y . Motiv ated b y these observ ations, in this dissertation, w e rst use con trol theoretic to ols to analyze the sto c hastic p erformance and transien t resp onse of accelerated optimization algorithms for smo oth strongly con v ex problems and iden tify fundamen tal tradeos b et w een con v ergence rate and noise amplication. Then, w e turn our atten tion to the p erformance of rst-order metho ds in mo del-free RL and fo cus on the innite-horizon Linear Quadratic Regulator (LQR) problem. In spite of the lac k of con v exit y , w e establish linear con v ergence of gradien t descen t and examine the con v ergence and sample complexit y of the random searc h metho d [22] that attempts to em ulate the b eha vior of gradien t descen t via gradien t appro ximations resulting from ev aluating random estimates of the ob jectiv e function. 1.1 Main topics In this section, w e discuss the main topics of the dissertation. 1.1.1 Noise amplication of accelerated optimization algorithms There is a v ast b o dy of literature that considers the robustness of rst-order accelerated optimization algorithms under dieren t t yp es of noisy/inexact gradien t oracles [23][28]. F or example, in a deterministic noise scenario, an upp er b ound on the error in iterates for accelerated pro ximal gradien t metho ds w as established in [29]. This study sho w ed that b oth pro ximal gradien t and its accelerated v arian t can main tain their con v ergence rates pro vided that the noise is b ounded and that it v anishes fast enough. Moreo v er, it has b een sho wn that in the presence of random noise, with the prop er diminishing stepsize, acceleration can b e ac hiev ed for general con v ex problems. Ho w ev er, in this case optimal rates are sub-line ar [30]. 2 In the con text of sto c hastic appro ximation, while early results suggest to use a stepsize that is in v ersely prop ortional to the iteration n um b er [24], a more robust b eha vior can b e obtained b y com bining larger stepsizes with a v eraging [25], [31][33]. Utilit y of these a v eraging sc hemes and their mo dications for solving quadratic optimization and manifold problems has b een examined thoroughly in recen t y ears [34][36]. Moreo v er, sev eral studies ha v e suggested that accelerated rst-order algorithms are more susceptible to errors in the gradien t compared to their non-accelerated coun terparts [26], [27], [29], [37][39]. One of the basic sources of error that arises in computing the gradien t can b e mo deled b y additiv e white sto c hastic noise. This source of error is t ypical for problems in whic h the gradien t is b eing sough t through measuremen ts of a real system [40] and it has a ric h history in analysis of sto c hastic dynamical systems and con trol theory [41]. Moreo v er, in man y applications including distributed computing o v er net w orks [42], [43], co ordination in v ehicular formations [44], [45], and con trol of p o w er systems [46][48], additiv e white noise is a con v enien t abstraction for the robustness analysis of distributed con trol strategies [43] and of rst-order optimization algorithms [49], [50]. Motiv ated b y this observ ation, w e consider the scenario in whic h a white sto c hastic noise with zero mean and iden tit y co v ariance is added to the iterates of standard rst-order algorithms: gradien t descen t, P oly ak’s hea vy- ball metho d, and Nestero v’s accelerated algorithm. By fo cusing on smo oth strongly con v ex problems, w e use con trol theoretic to ols to pro vide a tigh t quan titativ e c haracterization for the mean-squared error of the optimization v ariable. Since this quan tit y pro vides a measure of ho w noise gets amplied b y the dynamics resulting from optimization algorithms, w e also refer to it as noise (or varianc e ) amplic ation. 1.1.2 T radeos between noise amplication and convergence rate While con v ergence prop erties of accelerated algorithms ha v e b een carefully studied [6], [9], [51][56], their p erformance and fundamen tal limitations in the presence of noise has receiv ed less atten tion [10][12], [57], [58]. Prior studies indicate that inaccuracies in the computation of gradien t v alues can adv ersely impact the con v ergence rate of accelerated metho ds and that gradien t descen t ma y ha v e adv an tages relativ e to its accelerated v arian ts in noisy en vironmen ts [23][26], [28]. In this dissertation, w e consider the class of rst-order metho ds 3 with constan t parameters in whic h the iterations in v olv e information from the t w o previous steps. This class includes hea vy-ball and Nestero v’s accelerated algorithms as sp ecial cases and w e examine its sto c hastic p erformance in the presence of additiv e white noise. F or strongly con v ex quadratic problems, w e establish analytical lo w er b ounds on the pro duct of the settling time and the steady-state v ariance of the error in the optimization v ariable that hold for an y constan t stabilizing parameters and for b oth gradien t and iterate noise mo dels. Our lo w er b ounds rev eal a fundamen tal limitation p osed b y the problem condition n um b er for this class of algorithms. Our results build up on a simple, y et p o w erful geometric viewp oin t, whic h claries the relation b et w een condition n um b er, con v ergence rate, and algorithmic parameters for strongly con v ex quadratic problems. This viewp oin t allo ws us to presen t alternativ e pro ofs for the optimal con v ergence rate of the t w o-step momen tum algorithm [59], [60] and that of the standard gradien t descen t, hea vy-ball metho d, and Nestero v’s accelerated algorithm [52]. In addition, this viewp oin t enables a no v el geometric c haracterization of noise amplication in terms of stabilit y margins and it allo ws us to precisely quan tify tradeos b et w een con v ergence rate and robustness to noise. 1.1.3 T ransient growth of accelerated optimization algorithms In addition to deterioration of robustness in the face of uncertain t y , asymptotically stable accelerated algorithms ma y also exhibit undesirable transien t b eha vior [61]. This is in con trast to gradien t descen t whic h is a con traction for strongly con v ex problems with suitable stepsize [62]. In real-time optimization and in applications with limited time budgets, the transien t gro wth can limit the app eal of accelerated metho ds. In addition, rst-order metho ds are often used as a building blo c k in m ulti-stage optimization including ADMM [63] and distributed optimization metho ds [64]. In these settings, at eac h stage w e can p erform only a few iterations of rst-order up dates on primal or dual v ariables and transien t gro wth can ha v e a detrimen tal impact on the p erformance of the en tire algorithm. This motiv ates an in-depth study of the b eha vior of accelerated rst-order metho ds in non-asymptotic regimes. It is widely recognized that large transien ts ma y arise from the presence of resonan t mo dal in teractions and non-normalit y of linear dynamical generators [65]. Ev en in the absence of unstable mo des, these can induce large transien t resp onses, signican tly amplify exogenous 4 disturbances, and trigger departure from nominal op erating conditions. F or example, in uid dynamics, suc h mec hanisms can initiate departure from stable laminar o ws and trigger transition to turbulence [66], [67]. T o quan tify the transien t b eha vior of accelerated algorithms, w e examine the ratio of the largest error in the optimization v ariable to the initial error. F or con v ex quadratic problems, these algorithms can b e cast as a linear time-in v arian t (L TI) system and mo dal analysis of the state-transition matrix can b e p erformed. F or b oth accelerated algorithms, w e iden tify non-normal mo des that create large transien t gro wth, deriv e analytical expressions for the state-transition matrices, and establish b ounds on the transien t resp onse in terms of the con v ergence rate and the iteration n um b er. W e sho w that b oth the p eak v alue of the transien t resp onse and the rise time to this v alue increase with the square ro ot of the condition n um b er of the problem. Moreo v er, for general strongly con v ex problems, w e com bine a Ly apuno v- based approac h with the theory of In tegral Quadratic Constrain ts (IQCs) to establish similar upp er b ound on the transien t resp onse of Nestero v’s accelerated algorithm. 1.1.4 Noise amplication of primal-dual gradient ow dynamics based on proximal augmented Lagrangian W e consider a class of primal-dual gradien t o w dynamics based on pro ximal augmen ted Lagrangian [68] that can b e used for solving large-scale nonsmo oth constrained optimization problems in con tin uous time. These problems arise in man y areas e.g. signal pro cessing [69], statistical estimation [70], and con trol [71]. In addition, primal-dual metho ds ha v e receiv ed renew ed atten tion due to their prev alen t application in distributed optimization [72] and their con v ergence and stabilit y prop erties ha v e b een greatly studied [73][79]. While gradien t- based metho ds are not readily applicable to nonsmo oth optimization, w e can utilize their pro ximal v arian ts to address suc h problems [80]. In the con text of nonsmo oth constrained optimization, pro ximal-based extensions of primal-dual metho ds can also b e obtained using the augmen ted Lagrangian [68], whic h preserv e structural separabilit y and remain suitable for distributed optimization. 5 W e extend our analysis of noise amplication to the primal-dual o w sub ject to additiv e white noise. W e examine the mean-squared error of the primal optimization v ariable as a measure of ho w noise gets amplied b y the dynamics. F or con v ex quadratic optimization problems, the primal-dual o w b ecomes a linear time in v arian t system, for whic h the noise amplication can b e c haracterized using Ly apuno v equations. F or non-quadratic problems, the o w is no longer linear, ho w ev er, to ols from robust con trol theory can b e utilized to quan tify upp er b ounds on the noise amplication. In particular, w e use IQCs [81], [82] to c haracterize upp er b ounds on the noise amplication of the primal-dual o w based on pro ximal augmen ted Lagrangian using solutions to a certain linear matrix inequalit y . Our results establish tigh t upp er-upp er b ounds on the noise amplication that are in v ersely prop ortional to the strong-con v exit y mo dule of the corresp onding ob jectiv e function. 1.1.5 Gradient methods for model-free linear quadratic regulator In man y emerging applications, con trol-orien ted mo dels are not readily a v ailable and classical approac hes from optimal con trol ma y not b e directly applicable. This c hallenge has led to the emergence of Reinforcemen t Learning (RL) approac hes that often p erform w ell in practice. Examples include learning complex lo comotion tasks via neural net w ork dynamics [18] and pla ying A tari games based on images using deep-RL [19]. In spite of the empirical success of RL in a v ariet y of domains, our mathematical understanding of it is still in its infancy and there are man y op en questions surrounding con v ergence and sample complexit y . In this dissertation, w e tak e a step to w ards answ ering suc h questions with a fo cus on the innite- horizon Linear Quadratic Regulator (LQR) for con tin uous-time systems. The LQR problem is the cornerstone of con trol theory . The globally optimal solution can b e obtained b y solving the Riccati equation and ecien t n umerical sc hemes with pro v able con v ergence guaran tees ha v e b een dev elop ed [83]. Ho w ev er, computing the optimal solution b ecomes c hallenging for large-scale problems, when prior kno wledge is not a v ailable, or in the presence of structural constrain ts on the con troller. This motiv ates the use of direct searc h metho ds for con troller syn thesis. Unfortunately , the noncon v ex nature of this form ulation complicates the analysis of rst- and second-order optimization algorithms. T o mak e matters w orse, structural constrain ts on the feedbac k gain matrix ma y result in a disjoin t searc h 6 landscap e limiting the utilit y of con v en tional descen t-based metho ds [84]. F urthermore, in the mo del-free setting, the exact mo del (and hence the gradien t of the ob jectiv e function) is unkno wn so that only zeroth-order metho ds can b e used. W e study the sample complexit y and con v ergence of random searc h metho d for the innite-horizon LQR problem. F or the con tin uous-time LQR, w e emplo y a standard con v ex reparameterization [85], [86] to establish exp onen tial stabilit y of the ODE that go v erns the gradien t-o w dynamics o v er the set of stabilizing feedbac k gains, and linear con v ergence of the gradien t descen t algorithm with a suitable stepsize for the noncon v ex form ulation. In the mo del-free setting, w e also examine con v ergence and sample complexit y of the random searc h metho d [22] that attempts to em ulate the b eha vior of gradien t descen t via gradien t appro ximations resulting from ob jectiv e function v alues. F or the discr ete-time LQR, global con v ergence guaran tees w ere recen tly pro vided in [13] for gradien t decen t and the random searc h metho d with one-p oin t gradien t estimates. F or the t w o-p oin t gradien t estimation setting, w e pro v e linear con v ergence of the random searc h metho d and sho w that the total n um b er of function ev aluations and the sim ulation time required in our results scale with the logarithm of the in v erse of the desired accuracy in b oth con tin uous and discrete time. 1.1.6 Optimization landscape of the linear Quadratic Gaussian Among mo del-free RL approac hes, simple random searc h ac hiev es a logarithmic complexit y if one can access the so-called two-p oint gradien t estimates [14], [87]. These results build on the fact that the gradien t descen t itself ac hiev es linear con v ergence for b oth discrete [13] and con tin uous-time LQR problems [88] despite lac k of con v exit y . A k ey enabler for these results is the so-called gradien t dominance prop ert y of the underlying optimization problem that can b e used as a surrogate for strong con v exit y [89]. Motiv ated b y this observ ation, w e study the con v ergence of gradien t descen t for the Linear Quadratic Gaussian (LQG) problem with incomplete state information. The separation principle states that the solution to the LQG problem is giv en b y an observ er-based con troller, whic h consists of a Kalman lter and the corresp onding LQR solution. This problem is also closely related to the output-feedbac k problem for distributed con trol, whic h is kno wn to b e fundamen tally more c hallenging than LQR. In particular, the output-feedbac k problem 7 has b een sho wn to in v olv e an optimization domain with exp onen tial n um b er of connected comp onen ts [84], [90]. In con trast, the standard LQG problem allo ws for dynamic con trollers and do not imp ose structural constrain ts on the con troller. W e reform ulate the LQG problem as a join t optimization of the con trol and observ er feedbac k gains whose domain, unlik e the output feedbac k problem is connected. W e deriv e analytical expressions for the gradien t of the LQG cost function with resp ect to gain matrices and demonstrate through examples that LQG do es not satisfy the gradien t dominance prop ert y . In particular, w e sho w that, in addition to the global solution, the gradien t v anishes at the origin for op en-lo op stable systems. Our study dispro v es global exp onen tial con v ergence of p olicy gradien t metho ds for LQG. 1.2 Dissertation structure This dissertation consists of t w o main parts that eac h fo cuses on a sp ecic topic and includes individual c hapters that study relev an t sub jects. Eac h c hapter is self con tained in that it pro vides in tro duction, preliminaries and bac kground material, problem form ulation, metho dology , tec hnical results, and concluding remarks. Pro ofs of tec hnical results are relegated to the corresp onding app endices. Part I W e study the sto c hastic p erformance of t w o-step momen tum algorithms with additiv e white noise that accoun ts for uncertain t y in either gradien t ev aluation or iteration up dates. F or smo oth, strongly con v ex optimization problems, w e examine the mean-squared error in the optimization v ariable to quan tify noise amplication. By lev eraging the theory of Ly apuno v and in tegral quadratic constrain ts, w e establish an upp er b ound on the noise amplication of Nestero v’s metho d with standard parameters that is tigh t up to a constan t factor. W e also use strongly con v ex quadratic problems to iden tify fundamen tal tradeos b et w een noise amplication and con v ergence rate for the t w o-step momen tum algorithms. W e use mo dal decomp osition to in tro duce a no v el geometric c haracterization of conditions for linear con v ergence that claries the relation b et w een the noise amplication and con v ergence rate as 8 w ell as their dep endence on the condition n um b er and the constan t algorithmic parameters. This geometric insigh t leads to simple alternativ e pro ofs of standard con v ergence results and allo ws us to establish analytical lo w er b ounds on the pro duct b et w een the settling time and noise amplication that scale quadratically with the condition n um b er. W e also c haracterize the impact of condition n um b er on w orst-case transien t resp onses of p opular accelerated algorithms, and examine the noise amplication of a class of primal-dual gradien t o w dynamics based on pro ximal augmen ted Lagrangian that can b e used for non-smo oth con v ex constrained optimization problems. Part II In the second part, w e fo cus on mo del-free reinforcemen t learning whic h attempts to nd an optimal con trol action for an unkno wn dynamical system b y directly searc hing o v er the parameter space of con trollers. The con v ergence b eha vior and statistical prop erties of these approac hes are often p o orly understo o d b ecause of the noncon v ex nature of the underlying optimization problems and the lac k of exact gradien t computation. In this thesis, w e tak e a step to w ards dem ystifying the p erformance and eciency of suc h metho ds b y fo cusing on the standard innite-horizon linear quadratic regulator problem with unkno wn state-space parameters. W e establish exp onen tial stabilit y for the ordinary dieren tial equation (ODE) that go v erns the gradien t-o w dynamics o v er the set of stabilizing feedbac k gains and sho w that a similar result holds for the standard gradien t descen t. W e also pro vide theoretical b ounds on the con v ergence rate and sample complexit y of the random searc h metho d with t w o-p oin t gradien t estimates. W e pro v e that in the mo del-free setup, the required sim ulation time and the total n um b er of function ev aluations b oth scale with the logarithm of the in v erse of the desired accuracy . The k ey enabler of our results is the PL condition that holds for the LQR problem b oth in con tin uous and discrete time. W e nish the thesis b y sho wing the absence of this condition for the linear quadratic Gaussian problem with incomplete state information. 9 1.3 Contributions of the dissertation In this section, w e pro vide a summary of the main con tributions of eac h part. The c hapters presen ted here are a repro duction of the materials that ha v e b een (or are still under review to b e) published in journals and conference pro ceedings. W e ha v e made only some minor c hanges that w ere necessary to meet the guidelines for this do cumen t. Part I Noise amplication of accelerated algorithms W e study the robustness of noisy hea vy-ball and Nestero v’s accelerated metho ds for smo oth, strongly con v ex optimization problems. Ev en though the underlying dynamics of these algorithms are in general nonlinear, w e establish upp er b ounds on noise amplication that are accurate up to constan t factors. F or quadratic ob jectiv e functions, w e pro vide analytical expressions that quan tify the eect of all eigen v alues of the Hessian matrix on v ariance amplication. W e use these expressions to establish lo w er b ounds demonstrating that although the acceleration tec hniques impro v e the con v ergence rate they signican tly amplify noise for problems with large condition n um b ers κ . In problems of size n ≪ κ , the noise amplication increases from O(κ ) to Ω( κ 3/2 ) when mo ving from standard gradien t descen t to accelerated algorithms. W e sp ecialize our results to the problem of distributed a v eraging o v er noisy undirected net w orks and also study the role of net w ork size and top ology on robustness of accelerated algorithms [91][93]. T radeos between convergence rate and noise amplication W e examine the amplication of sto c hastic disturbances for a class of t w o-step momen tum algorithms in whic h the iterates are p erturb ed b y an additiv e white noise. This class of algorithms includes P oly ak’s hea vy-ball and Nestero v’s accelerated metho ds as sp ecial cases and noise arises from uncertain ties in gradien t ev aluation or in computing the iterates. F or b oth gradien t and iterate noise mo dels, w e establish lo w er b ounds on the pro duct of the settling time and the smallest/largest steady-state v ariance of the error in the optimization 10 v ariable among the class of strongly con v ex quadratic optimization problems. Our b ounds scale quadratically with the condition n um b er for all stabilizing parameters, whic h rev eals a fundamen tal limitation imp osed b y the condition n um b er in designing algorithms that tradeo noise amplication and con v ergence rate. In addition, w e pro vide a no v el geometric viewp oin t of stabilit y and linear con v ergence. This viewp oin t brings insigh t in to the relation b et w een noise amplication, con v ergence rate, and algorithmic parameters. It also allo ws us to (i) tak e an alternativ e approac h to optimizing con v ergence rates for standard algorithms; (ii) iden tify k ey similarities and dierences b et w een the iterate and gradien t noise mo dels; and (iii) in tro duce parameterized families of algorithms for whic h the parameters can b e con tin uously adjusted to tradeo noise amplication and settling time. By utilizing p ositiv e and negativ e momen tum parameters in accelerated and decelerated regimes, resp ectiv ely , w e demonstrate that a parameterized family of hea vy-ball-lik e algorithms can ac hiev e order-wise P areto optimalit y for all settling times and b oth noise mo dels. W e also extend our analysis to con tin uous-time dynamical systems that can b e discretized via an implicit-explicit Euler sc heme to obtain the t w o-step momen tum algorithm. F or suc h gradien t o w dynamics, w e sho w that similar fundamen tal sto c hastic p erformance limitations hold as in discrete time [94], [95]. T ransient growth of accelerated algorithms W e examine the impact of acceleration on the transien t resp onses of p opular rst-order optimization algorithms. Without imp osing restrictions on initial conditions, w e establish b ounds on the largest v alue of the Euclidean distance b et w een the optimization v ariable and the global minimizer. F or con v ex quadratic problems, w e utilize the to ols from linear systems theory to fully capture transien t resp onses and for general strongly con v ex problems, w e emplo y the theory of in tegral quadratic constrain ts to establish an upp er b ound on transien t gro wth. This upp er b ound is prop ortional to the square ro ot of the condition n um b er and w e iden tify quadratic problem instances for whic h accelerated algorithms generate transien t resp onses whic h are within a constan t factor of this upp er b ound [96][98]. 11 Noise amplication of primal-dual gradient ow dynamics based on proximal augmented Lagrangian W e examine the noise amplication of pro ximal primal-dual gradien t o w dynamics that can b e used to solv e non-smo oth comp osite optimization problems. F or quadratic problems, w e emplo y algebraic Ly apuno v equations to establish analytical expressions for the noise amplication. W e also utilize the theory of IQCs to c haracterize tigh t upp er b ounds in terms of a solution to a linear matrix inequalit y . Our results sho w that sto c hastic p erformance of the primal-dual dynamics is in v ersely prop ortional to the strong-con v exit y mo dule of the smo oth part of the ob jectiv e function [99]. Part II Random search for continuous-time LQR W e pro v e exp onen tial/linear con v ergence of gradien t o w/descen t algorithms for solving the con tin uous-time Linear Quadratic Regulator problem based on a noncon v ex form ulation that directly searc hes for the con troller. A salien t feature of our analysis is that w e relate the gradien t-o w dynamics asso ciated with this noncon v ex form ulation to that of a con v ex reparameterization. This allo ws us to deduce con v ergence of the noncon v ex approac h from its con v ex coun terpart. W e also establish a b ound on the sample complexit y of the random searc h metho d for solving the con tin uous-time LQR problem that do es not require the kno wledge of system parameters. W e pro v e that in the mo del-free setup with t w o-p oin t gradien t estimates, the required sim ulation time and the total n um b er of function ev aluations b oth scale with the logarithm of the in v erse of the desired accuracy [14], [15], [88], [100], [101]. Random search for discrete-time LQR W e study the con v ergence and sample complexit y of the random searc h metho d with t w o- p oin t gradien t estimates for the discrete-time LQR problem. Despite noncon v exit y , w e establish that the random searc h metho d with a xed n um b er of roll-outs p er iteration 12 that is prop ortional to the problem size requires a sim ulation time and the total n um b er of function ev aluations that scale with the logarithm of the in v erse of the desired accuracy [87]. Lack of gradient domination for LQG Motiv ated b y the recen t results on the global exp onen tial con v ergence of p olicy gradien t algorithms for the mo del-free LQR problem that rely on the so-called gradien t dominance prop ert y , w e study the standard Linear Quadratic Gaussian problem as optimization o v er con troller and observ er feedbac k gains. W e presen t an explicit form ula for the gradien t and demonstrate that for op en-lo op stable systems, in addition to the unique global minimizer, the origin is also a critical p oin t for the LQG problem, th us dispro ving the gradien t dominance prop ert y for this class of problems [102]. 13 Part I Robustness of accelerated rst-order optimization algorithms for strongly convex optimization problems 14 Chapter 2 Noise amplication of accelerated algorithms In this c hapter, w e study the robustness of accelerated rst-order algorithms to sto c hastic uncertain ties in gradien t ev aluation. Sp ecically , for unconstrained, smo oth, strongly con v ex optimization problems, w e examine the mean-squared error in the optimization v ariable when the iterates are p erturb ed b y additiv e white noise. This t yp e of uncertain t y ma y arise in situations where an appro ximation of the gradien t is sough t through measuremen ts of a real system or in a distributed computation o v er a net w ork. Ev en though the underlying dynamics of rst-order algorithms for this class of problems are nonlinear, w e establish upp er b ounds on the mean-squared deviation from the optimal solution that are tigh t up to constan t factors. Our analysis quan ties fundamen tal tradeos b et w een noise amplication and con v ergence rates obtained via any acceleration sc heme similar to Nestero v’s or hea vy-ball metho ds. T o gain additional analytical insigh t, for strongly con v ex quadratic problems, w e explicitly ev aluate the steady-state v ariance of the optimization v ariable in terms of the eigen v alues of the Hessian of the ob jectiv e function. W e demonstrate that the en tire sp ectrum of the Hessian, rather than just the extreme eigen v alues, inuence noise amplication. W e sp ecialize this result to the problem of distributed a v eraging o v er undirected net w orks and examine the role of net w ork size and top ology on the robustness of noisy accelerated algorithms. 2.1 Introduction First-order metho ds are w ell suited for solving a broad range of optimization problems that arise in statistics, signal and image pro cessing, con trol, and mac hine learning [1][5]. 15 Among these algorithms, accelerated metho ds enjo y the optimal rate of con v ergence and they are p opular b ecause of their lo w p er-iteration complexit y . There is a large b o dy of literature dedicated to the con v ergence analysis of these metho ds under dieren t stepsize selection rules [4][9]. In man y applications, ho w ev er, the exact v alue of the gradien t is not fully a v ailable, e.g., when the ob jectiv e function is obtained via costly sim ulations (e.g., tuning of h yp er-parameters in sup ervised/unsup ervised learning [10][12] and mo del- free optimal con trol [13][15]), when ev aluation of the ob jectiv e function relies on noisy measuremen ts (e.g., real-time and em b edded applications), or when the noise is in tro duced via comm unication b et w een dieren t agen ts (e.g., distributed computation o v er net w orks). Another related application arises in the con text of (batc h) sto c hastic gradien t, where at eac h iteration the gradien t of the ob jectiv e function is computed from a small batc h of data p oin ts. Suc h a batc h gradien t is kno wn to b e a noisy un biased estimator for the gradien t of the training loss. Moreo v er, additiv e noise ma y b e in tro duced delib erately in the con text of noncon v ex optimization to help the iterates escap e saddle p oin ts and impro v e generalization [16], [17]. In all ab o v e situations, rst-order algorithms only ha v e access to noisy estimates of the gradien t. This observ ation has motiv ated the robustness analysis of these algorithms under dieren t t yp es of noisy/inexact gradien t oracles [23][28]. F or example, in a deterministic noise scenario, an upp er b ound on the error in iterates for accelerated pro ximal gradien t metho ds w as established in [29]. This study sho w ed that b oth pro ximal gradien t and its accelerated v arian t can main tain their con v ergence rates pro vided that the noise is b ounded and that it v anishes fast enough. Moreo v er, it has b een sho wn that in the presence of random noise, with the prop er diminishing stepsize, acceleration can b e ac hiev ed for general con v ex problems. Ho w ev er, in this case optimal rates are sub-line ar [30]. In the con text of sto c hastic appro ximation, while early results suggest to use a stepsize that is in v ersely prop ortional to the iteration n um b er [24], a more robust b eha vior can b e obtained b y com bining larger stepsizes with a v eraging [25], [31][33]. Utilit y of these a v eraging sc hemes and their mo dications for solving quadratic optimization and manifold problems has b een examined thoroughly in recen t y ears [34][36]. Moreo v er, sev eral studies 16 ha v e suggested that accelerated rst-order algorithms are more susceptible to errors in the gradien t compared to their non-accelerated coun terparts [26], [27], [29], [37][39]. One of the basic sources of error that arises in computing the gradien t can b e mo deled b y additiv e white sto c hastic noise. This source of error is t ypical for problems in whic h the gradien t is b eing sough t through measuremen ts of a real system [40] and it has a ric h history in analysis of sto c hastic dynamical systems and con trol theory [41]. Moreo v er, in man y applications including distributed computing o v er net w orks [42], [43], co ordination in v ehicular formations [44], [45], and con trol of p o w er systems [46][48], additiv e white noise is a con v enien t abstraction for the robustness analysis of distributed con trol strategies [43] and of rst-order optimization algorithms [49], [50]. Motiv ated b y this observ ation, in this c hapter w e consider the scenario in whic h a white sto c hastic noise with zero mean and iden tit y co v ariance is added to the iterates of standard rst-order algorithms: gradien t descen t, P oly ak’s hea vy-ball metho d, and Nestero v’s accelerated algorithm. By fo cusing on smo oth strongly con v ex problems, w e pro vide a tigh t quan titativ e c haracterization for the mean-squared error of the optimization v ariable. Since this quan tit y pro vides a measure of ho w noise gets amplied b y the dynamics resulting from optimization algorithms, w e also refer to it as noise (or varianc e ) amplic ation. W e demonstrate that our quan titativ e c haracterization allo ws us to iden tify fundamen tal tradeos b et w een the noise amplication and the rate of con v ergence obtained via acceleration. This w ork builds on our recen t conference pap ers [91], [92]. In a concurren t w ork [103], a similar approac h w as tak en to analyze the robustness of gradien t descen t and Nestero v’s accelerated metho d. Therein, it w as sho wn that for a giv en con v ergence rate, one can select the algorithmic parameters suc h that the steady-state mean-squared error in the obje ctive value of a Nestero v-lik e metho d b ecomes smaller than that of gradien t descen t. This is not surprising b ecause gradien t descen t can b e view ed as a sp ecial case of Nestero v’s metho d with a zero momen tum parameter. Using this argumen t, similar assertions ha v e b een made ab out the v ariance amplication of the iter ates . This observ ation has b een used to design an optimal m ulti-stage algorithm that do es not require an y information ab out the v ariance of the noise [104]. On the con trary , w e demonstrate that there are fundamen tal dierences 17 b et w een these t w o robustness measures, i.e., obje ctive values and iter ates , as the former do es not capture the negativ e impact of acceleration in the presence of noise. By conning our atten tion to the error in the iterates, w e sho w that an y c hoice of parameters for Nestero v’s or hea vy-ball metho ds that yields an accelerated con v ergence rate increases v ariance amplication relativ e to gradien t descen t. More precisely , for the pr oblem with the c ondition numb er κ , an algorithm with ac c eler ate d c onver genc e r ate of at le ast 1− c/ √ κ , wher e c is a p ositive c onstant, incr e ases the varianc e amplic ation in the iter ates by a factor of √ κ . The robustness problem w as also studied in [58] where the authors sho w a similar b eha vior of Nestero v’s metho d and gradien t descen t in an asymptotic regime in whic h the stepsize go es to zero. In con trast, w e fo cus on the non-asymptotic stepsize regime and establish fundamen tal dierences b et w een gradien t descen t and its accelerated v arian ts in terms of noise amplication. More recen tly , the problem of nding upp er b ounds on the v ariance amplication w as cast as a semidenite program [105]. This form ulation pro vided n umerical results that are consisten t with our theoretical upp er b ounds in terms of the condition n um b er. In [105], structured ob jectiv e functions (e.g., diagonal Hessians) that arise in distributed optimization w ere also studied and the problem of designing robust algorithms w ere form ulated as a bilinear matrix inequalit y (whic h, in general, is not con v ex). Con tributions The eect of imp erfections on the p erformance and robustness of rst-order algorithms has b een studied in [27], [35] but the inuence of acceleration on sto c hastic gradien t p erturbations has not b een precisely c haracterized. W e emplo y con trol-theoretic to ols suitable for analyzing sto c hastic dynamical systems to quan tify suc h inuence and iden tify fundamen tal tradeos b et w een acceleration and noise amplication. The main con tributions of this c hapter are: 1. W e start our analysis b y examining strongly con v ex quadratic optimization problems for whic h w e can explicitly c haracterize v ariance amplication of rst-order algorithms and obtain analytical insigh t. In con trast to con v ergence rates, whic h solely dep end on the extreme eigen v alues of the Hessian matrix, w e demonstrate that the varianc e amplic ation is inuenc e d by the entir e sp e ctrum. 18 2. W e establish the relation b et w een the noise amplication of accelerated algorithms and gradien t descen t for parameters that pro vide the optimal con v ergence rate for strongly con v ex quadratic problems. W e also explain ho w the distribution of the eigen v alues of the Hessian inuences these relations and pro vide examples to sho w that ac c eler ation c an signic antly incr e ase the noise amplic ation. 3. W e address the problem of tuning the algorithm parameters and demonstrate the existence of a fundamen tal tradeo b et w een con v ergence rate and noise amplication: for problems with condition n um b er κ and b ounded dimension n, w e sho w that an y c hoice of parameters in accelerated metho ds that yields the linear con v ergence rate of at least 1− c/ √ κ , where c is a p ositiv e constan t, incr e ases noise amplic ation in the iter ates r elative to gr adient desc ent b y a factor of at least √ κ . 4. W e extend our analysis from quadratic ob jectiv e functions to general strongly con v ex problems. W e b orro w an approac h based on linear matrix inequalities from con trol theory to establish upp er b ounds on the noise amplication of b oth gradien t descen t and Nestero v’s accelerated algorithm. F urthermore, for an y giv en condition n um b er, w e demonstrate that these b ounds ar e tight up to c onstant factors. 5. W e apply our results to distributed a v eraging o v er large-scale undirected net w orks. W e examine the role of net w ork size and top ology on noise amplication and further illustrate the subtle inuence of the en tire sp ectrum of the Hessian matrix on the robustness of noisy optimization algorithms. In particular, we identify a class of lar ge- sc ale pr oblems for which ac c eler ate d Nester ov’s metho d achieves the same or der-wise noise amplic ation (in terms of condition n um b er) as gr adient desc ent. Chapter structure The rest of our presen tation is organized as follo ws. In Section 2.2, w e form ulate the problem and pro vide bac kground material. In Section 2.3, w e explicitly ev aluate the v ariance amplication (in terms of the algorithmic parameters and problem data) for strongly con v ex quadratic problems, deriv e lo w er and upp er b ounds, and pro vide a comparison b et w een the accelerated metho ds and gradien t descen t. In Section 2.4, w e extend our analysis to general 19 strongly con v ex problems. In Section 2.5, w e establish fundamen tal tradeos b et w een the rate of con v ergence and noise amplication. In Section 2.6, w e apply our results to the problem of distributed a v eraging o v er noisy undirected net w orks. W e highligh t the subtle inuence of the distribution of the eigen v alues of the Laplacian matrix on v ariance amplication and discuss the roles of net w ork size and top ology . W e pro vide concluding remarks in Section 2.7 and tec hnical details in app endices. 2.2 Preliminaries and background In this c hapter, w e quan tify the eect of sto c hastic uncertain ties in gradien t ev aluation on the p erformance of rst-order algorithms for unconstrained optimization problems minimize x f(x) (2.1) where f : R n → R is strongly con v ex with Lipsc hitz con tin uous gradien t ∇f . Sp ecically , w e examine ho w gradien t descen t, x t+1 = x t − α ∇f(x t ) + σw t (2.2a) P oly ak’s hea vy-ball metho d, x t+2 = x t+1 + β (x t+1 − x t ) − α ∇f(x t+1 ) + σw t (2.2b) and Nestero v’s accelerated metho d, x t+2 = x t+1 + β (x t+1 − x t ) − α ∇f(x t+1 + β (x t+1 − x t )) + σw t (2.2c) amplify the additiv e white sto c hastic noise w t with zero mean and iden tit y co v ariance matrix, E[w t ] = 0, E w t (w τ ) T = Iδ (t− τ ). Here, t is the iteration index, x t is the optimization v ariable, α is the stepsize, β is an extrap olation parameter used for acceleration, σ is the 20 Metho d P arameters Linear rate Gradien t α = 1 L ρ = q 1 − 2 κ +1 Nestero v α = 1 L , β = √ κ − 1 √ κ +1 ρ = q 1− 1 √ κ T able 2.1: Con v en tional v alues of parameters and the corresp onding rates for f ∈ F L m , ∥x t − x ⋆ ∥ ≤ cρ t ∥x 0 − x ⋆ ∥, where κ := L/m and c > 0 is a constan t [9, Theorems 2.1.15, 2.2.1]. The hea vy-ball metho d do es not oer acceleration guaran tees for all f ∈F L m . noise magnitude, δ is the Kronec k er delta, and E is the exp ected v alue. When the only source of uncertain t y is a noisy gradien t, w e set σ =α in (2.2). The set of functions f that are m-strongly con v ex and L-smo oth is denoted b y F L m ; f ∈ F L m means that f(x)− m 2 ∥x∥ 2 is con v ex and that the gradien t ∇f is L-Lipsc hitz con tin uous. In particular, for a t wice con tin uously dieren tiable function f with the Hessian matrix ∇ 2 f , w e ha v e f ∈ F L m ⇔ mI ⪯ ∇ 2 f(x) ⪯ LI, ∀x ∈ R n . In the absence of noise (i.e., for σ = 0), for f ∈ F L m , the parameters α and β can b e selected suc h that gradien t descen t and Nestero v’s accelerated metho d con v erge to the global minim um x ⋆ of (2.1) with a linear rate ρ< 1, i.e., ∥x t − x ⋆ ∥≤ cρ t ∥x 0 − x ⋆ ∥ for all t and some c>0. T able 2.1 pro vides the con v en tional v alues of these parameters and the corresp onding guaran teed con v ergence rates [9]. Nestero v’s metho d with the parameters pro vided in T able 2.1 enjo ys the con v ergence rate ρ na = p 1− 1/ √ κ ≤ 1− 1/(2 √ κ ), where κ := L/m is the condition n um b er asso ciated with F L m . This rate is or derwise optimal in the sense that no rst-order algorithm can optimize all f ∈ F L m with the rate ρ lb = ( √ κ − 1)/( √ κ + 1) [9, Theorem 2.1.13]. Note that 1− ρ lb = O(1/ √ κ ) and 1− ρ na = Ω(1 / √ κ ). In con trast to Nestero v’s metho d, the hea vy-ball metho d do es not oer an y acceleration guaran tees for all f ∈F L m . Ho w ev er, for strongly con v ex quadratic f , parameters can b e selected to guaran tee linear con v ergence of the hea vy-ball metho d with a rate that outp erforms the one ac hiev ed b y Nestero v’s metho d [52]; see T able 2.2. 21 T o pro vide a quan titativ e c haracterization for the robustness of algorithms (2.2) to the noise w t , w e examine the p erformance measure, J := limsup t→∞ 1 t t X k=0 E ∥x k − x ⋆ ∥ 2 . (2.3) F or quadratic ob jectiv e functions, algorithms (2.2) are linear dynamical systems. In this case, J quan ties the steady-state v ariance amplication and it can b e computed from the solution of the algebraic Ly apuno v equation; see Section 2.3. F or general strongly con v ex problems, there is no explicit c haracterization for J but tec hniques from con trol theory can b e utilized to compute an upp er b ound; see Section 2.4. Notation W e write g =Ω( h) (or, equiv alen tly , h=O(g)) to denote the existence of p ositiv e constan ts c i suc h that, for an y x > c 2 , the functions g and h:R→R satisfy g(x)≥ c 1 h(x). W e write g =Θ( h), or more informally g≈ h, if b oth g =Ω( h) and g =O(h). 2.3 Strongly convex quadratic problems Consider a strongly con v ex quadratic ob jectiv e function, f(x) = 1 2 x T Qx − q T x (2.4) where Q is a symmetric p ositiv e denite matrix and q is a v ector. Let f ∈F L m and let the eigen v alues λ i of Q satisfy L = λ 1 ≥ λ 2 ≥ ... ≥ λ n = m > 0. In the absence of noise, the constan t v alues of parameters α and β pro vided in T able 2.2 yield linear con v ergence (with optimal deca y rates) to the globally optimal p oin t x ⋆ =Q − 1 q for all three algorithms [52]. In the presence of additiv e white noise w t , w e deriv e analytical expressions for the v ariance amplication J of algorithms (2.2) and demonstrate that J 22 dep ends not only on the algorithmic parameters α and β but also on all eigen v alues of the Hessian matrix Q. This should b e compared and con trasted to the optimal rate of linear con v ergence whic h only dep ends on κ := L/m, i.e., the ratio of the largest and smallest eigen v alues of Q. F or constan t α and β , algorithms (2.2) can b e describ ed b y a linear time-in v arian t (L TI) rst-order recursion ψ t+1 = Aψ t + σBw t z t = Cψ t (2.5) where ψ t is the state, z t := x t − x ⋆ is the p erformance output, and w t is a white sto c hastic input. In particular, c ho osing ψ t :=x t − x ⋆ for gradien t descen t and ψ t :=[(x t − x ⋆ ) T (x t+1 − x ⋆ ) T ] T for accelerated algorithms yields state-space mo del (2.5) with A = I − αQ, B = C = I for gradien t descen t and A = 0 I − βI (1+β )I− αQ , A = 0 I − β (I− αQ ) (1+β )(I− αQ ) for the hea vy-ball and Nestero v’s metho ds, resp ectiv ely , with B T = h 0 I i , C = h I 0 i . Since w t is zero mean, w e ha v e E(ψ t+1 ) = AE(ψ t ). Th us, E(ψ t ) = A t E(ψ 0 ) and, for an y stabilizing parameters α and β , lim t→∞ E(ψ t ) = 0, with the same linear rate as in the absence of noise. F urthermore, it is w ell-kno wn that the co v ariance matrix P t :=E ψ t (ψ t ) T of the state v ector satises the linear recursion P t+1 = AP t A T + σ 2 BB T (2.6a) 23 and that its steady-state limit P := lim t→∞ E ψ t (ψ t ) T (2.6b) is the unique solution to the algebraic Ly apuno v equation [41] P = APA T + σ 2 BB T . (2.6c) F or stable L TI systems, p erformance measure (2.3) simplies to the steady-state v ariance of the error in the optimization v ariable z t :=x t − x ⋆ , J = lim t→∞ 1 t t X k=0 E ∥z k ∥ 2 = lim t→∞ E ∥z t ∥ 2 (2.6d) and it can b e computed using either of the follo wing t w o equiv alen t expressions J = lim t→∞ 1 t t X k=0 trace Z k = trace(Z) (2.6e) where Z = CPC T is the steady-state limit of the co v ariance matrix Z t := E z t (z t ) T = CP t C T of the output z t . W e next pro vide analytical solution P to (2.6c) that dep ends on the parameters α andβ as w ell as on the sp ectrum of the Hessian matrix Q. This allo ws us to explicitly c haracterize the v ariance amplication J and quan tify the impact of additiv e white noise on the p erformance of rst-order optimization algorithms. 2.3.1 Inuence of the eigenv alues of the Hessian matrix W e use the mo dal decomp osition of the symmetric matrix Q = VΛ V T to bring A, B , and C in (2.5) in to a blo c k diagonal form, ˆ A = diag( ˆ A i ), ˆ B = diag( ˆ B i ), ˆ C = diag( ˆ C i ), with i = 1,...,n. Here, Λ = diag( λ i ) is the diagonal matrix of the eigen v alues and V is 24 Metho d Optimal parameters Rate of linear con v ergence Gradien t α = 2 L+m ρ = κ − 1 κ +1 Nestero v α = 4 3L+m , β = √ 3κ +1− 2 √ 3κ +1+2 ρ = √ 3κ +1− 2 √ 3κ +1 Hea vy-ball α = 4 ( √ L+ √ m) 2 , β = ( √ κ − 1) 2 ( √ κ +1) 2 ρ = √ κ − 1 √ κ +1 T able 2.2: Optimal parameters and the corresp onding con v ergence rates for a strongly con v ex quadratic ob jectiv e function f ∈ F L m with λ max (∇ 2 f) = L and λ min (∇ 2 f) = m, and κ := L/m [52, Prop osition 1]. the orthogonal matrix of the eigen v ectors of Q. More sp ecically , the unitary co ordinate transformation ˆ x t := V T x t , ˆ x ⋆ := V T x ⋆ , ˆ w t := V T w t (2.7) brings the state-space mo del of gradien t descen t in to a diagonal form with ˆ ψ t i = ˆ x t i − ˆ x ⋆ i , ˆ A i = 1 − αλ i , ˆ B i = ˆ C i = 1. (2.8a) Similarly , for P oly ak’s hea vy-ball and Nestero v’s accelerated metho ds, w e can use the c hange of co ordinates (2.7) in conjunction with a p erm utation of v ariables, ˆ ψ t i =[ˆ x t i − ˆ x ⋆ i ˆ x t+1 i − ˆ x ⋆ i ] T , resp ectiv ely to obtain ˆ A i = 0 1 − β 1+β − αλ i , ˆ B i = 0 1 , ˆ C i = h 1 0 i (2.8b) ˆ A i = 0 1 − β (1− αλ i ) (1+β )(1− αλ i ) , ˆ B i = 0 1 , ˆ C i = h 1 0 i . (2.8c) This blo c k diagonal structure allo ws us to explicitly solv e Ly apuno v equation (2.6c) for P and deriv e an analytical expression for J in terms of the eigen v alues λ i of the Hessian matrix Q and the algorithmic parameters α and β . Namely , under co ordinate transformation (2.7) 25 and a suitable p erm utation of v ariables, equation (2.6c) can b e brough t in to an equiv alen t set of equations, ˆ P i = ˆ A i ˆ P i ˆ A T i + σ 2 ˆ B i ˆ B T i , i = 1,...,n (2.9) where ˆ P i is a scalar for the gradien t descen t metho d and a 2× 2 matrix for the accelerated algorithms. In Theorem 1, w e use the solution to these decoupled Ly apuno v equations to express the v ariance amplication as J = n X i=1 ˆ J(λ i ) := n X i=1 trace( ˆ C i ˆ P i ˆ C T i ) where ˆ J(λ i ) determines the con tribution of the eigen v alue λ i of the matrix Q to the v ariance amplication. In what follo ws, w e use subscripts gd, hb, and na (e.g., J gd , J hb , and J na ) to denote quan tities that corresp ond to gradien t descen t (2.2a), hea vy-ball metho d (2.2b), and Nestero v’s accelerated metho d (2.2c). Theorem 1 F or str ongly c onvex quadr atic pr oblems, the varianc e amplic ation of noisy rst-or der algorithms (2.2) with any c onstant stabilizing p ar ameters α and β is determine d by J = P n i=1 ˆ J(λ i ), wher e λ i is the ith eigenvalue of Q=Q T ≻ 0 and the mo dal c ontribution to the varianc e amplic ation ˆ J(λ ) is given by Gr adient: ˆ J gd (λ ) = σ 2 αλ (2 − αλ ) Polyak: ˆ J hb (λ ) = σ 2 (1 + β ) αλ (1 − β )(2(1 + β ) − αλ ) Nester ov: ˆ J na (λ ) = σ 2 (1 + β (1 − αλ )) αλ (1 − β (1 − αλ ))(2(1 + β ) − (2β + 1)αλ ) . Pr o of: See App endix A.1. □ F or strongly con v ex quadratic problems, Theorem 1 pro vides exact expr essions for v ariance amplication of the rst-order algorithms. In addition to quan tifying the dep endence of J on the algorithmic parameters α and β and the impact of the largest and smallest eigen v alues, these expressions capture the eect of all other eigen v alues of the Hessian matrix Q. W e also observ e that the v ariance amplication J is prop ortional to σ 2 . Apart from Section 2.5, where w e examine the role of parameters α and β on acceleration/robustness tradeo and 26 allo w the dep endence of σ on α , without loss of generalit y w e c ho ose σ =1 in the rest of the c hapter. Remark 1 The p erformanc e me asur e J in (2.6d) quanties the ste ady-state varianc e of the iter ates of rst-or der algorithms. R obustness of noisy algorithms c an b e also evaluate d using alternative p erformanc e me asur es, e.g., the me an value of the err or in the obje ctive function [103], J ′ = lim t→∞ E (x t − x ⋆ ) T Q(x t − x ⋆ ) . (2.10) This me asur e of varianc e amplic ation c an b e char acterize d using our appr o ach by dening C = Q 1/2 for gr adient desc ent and C = [Q 1/2 0] for ac c eler ate d algorithms in state- sp ac e mo del (2.5) . F urthermor e, r ep e ating the ab ove pr o c e dur e for the mo die d p erformanc e output z t yields J ′ = P n i=1 λ i ˆ J(λ i ), wher e the r esp e ctive expr essions for ˆ J(λ i ) ar e given in The or em 1. 2.3.2 Comparison for parameters that optimize convergence rate W e next examine the robustness of rst-order algorithms applied to strongly con v ex quadratic problems for the parameters that optimize the linear con v ergence rate; see T able 2.2. F or these parameters, the eigen v alues of the matrix A are inside the op en unit disk, whic h implies exp onen tial stabilit y of system (2.5). W e rst use the expressions presen ted in Theorem 1 to compare the v ariance amplication of the hea vy-ball metho d to gradien t descen t. Theorem 2 L et the str ongly c onvex quadr atic obje ctive function f in (2.4) satisfy λ max (Q)= L, λ min (Q)=m>0, and let κ :=L/m b e the c ondition numb er. F or the optimal p ar ameters pr ovide d in T able 2.2, the r atio b etwe en the varianc e amplic ation of the he avy-b al l metho d and gr adient desc ent with e qual values of σ is given by J hb J gd = ( √ κ + 1) 4 8 √ κ (κ + 1) . (2.11) Pr o of: F or the parameters pro vided in T able 2.2 w e ha v e α hb = (1 + β )α gd , where β = ( √ κ − 1) 2 /( √ κ + 1) 2 is the momen tum parameter for the hea vy-ball metho d. It 27 is no w straigh tforw ard to sho w that the mo dal con tributions ˆ J hb and ˆ J gd to the v ariance amplication of the iterates giv en in Theorem 1 satisfy ˆ J hb (λ ) ˆ J gd (λ ) = 1 1 − β 2 = ( √ κ + 1) 4 8 √ κ (κ + 1) , ∀λ ∈ [m,L]. (2.12) Th us, the r atio ˆ J hb (λ )/ ˆ J gd (λ ) do es not dep end on λ and is only a function of the c ondition numb er κ . Substitution of (2.12) in to J = P i ˆ J(λ i ) yields relation (2.11). □ Theorem 2 establishes the linear relation b et w een the v ariance amplication of the hea vy- ball algorithm J hb and the gradien t descen t J gd . W e observ e that the ratio J hb /J gd only dep ends on the condition n um b er κ and that ac c eler ation incr e ases varianc e amplic ation : for κ ≫ 1, J hb is larger than J gd b y a factor of √ κ . W e next study the ratio b et w een the v ariance amplication of Nestero v’s accelerated metho d and gradien t descen t. In con trast to the hea vy-ball metho d, this ratio dep ends on the en tire sp ectrum of the Hessian matrix Q. The follo wing prop osition, whic h examines the mo dal con tributions ˆ J na (λ ) and ˆ J gd (λ ) of Nestero v’s accelerated metho d and gradien t descen t, is the k ey tec hnical result that allo ws us to establish the largest and smallest v alues that the ratio J na /J gd can tak e for a giv en pair of extreme eigen v alues m and L of Q in Theorem 3. Prop osition 1 L et the str ongly c onvex quadr atic function f in (2.4) satisfy λ max (Q) = L, λ min (Q) = m > 0, and let κ := L/m b e the c ondition numb er. F or the optimal p ar ameters pr ovide d in T able 2.2, the r atio ˆ J na (λ )/ ˆ J gd (λ ) of mo dal c ontributions to varianc e amplic ation of Nester ov’s metho d and gr adient desc ent is a de cr e asing function of λ ∈[m,L]. F urthermor e, for σ =1, the function ˆ J gd (λ ) satises max λ ∈[m,L] ˆ J gd (λ ) = ˆ J gd (m) = ˆ J gd (L) = (κ + 1) 2 4κ min λ ∈[m,L] ˆ J gd (λ ) = ˆ J gd (1/α ) = 1 (2.13a) 28 and the function ˆ J na (λ ) satises ˆ J na (L) = 9¯κ 2 ¯κ + 2 √ ¯κ − 2 32(¯κ − 1) ¯κ − √ ¯κ + 1 2 √ ¯κ − 1 max λ ∈[m,L] ˆ J na (λ ) = ˆ J na (m) = ¯κ 2 ¯κ − 2 √ ¯κ + 2 32 √ ¯κ − 1 3 min λ ∈[m,L] ˆ J na (λ ) = ˆ J na (1/α ) = 1 (2.13b) wher e ¯κ :=3κ +1. Pr o of: See App endix A.1. □ F or all three algorithms, Prop osition 1 and Theorem 2 demonstrate that the mo dal con tribution to the v ariance amplication of the iterates at the extreme eigen v alues of the Hessian matrix m and L only dep ends on the condition n um b er κ := L/m. F or gradien t descen t and the hea vy-ball metho d, ˆ J ac hiev es its largest v alue at m and L, i.e., max λ ∈[m,L] ˆ J gd (λ ) = ˆ J gd (m) = ˆ J gd (L) = Θ( κ ) max λ ∈[m,L] ˆ J hb (λ ) = ˆ J hb (m) = ˆ J hb (L) = Θ( κ √ κ ). (2.14a) On the other hand, for Nestero v’s metho d, (2.13b) implies a gap of Θ( κ ) b et w een the b oundary v alues max λ ∈[m,L] ˆ J na (λ ) = ˆ J na (m) = Θ( κ √ κ ), ˆ J na (L) = Θ( √ κ ). (2.14b) Remark 2 The or em 1 pr ovides explicit formulas for the varianc e amplic ation of noisy algorithms (2.2) in terms of the eigenvalues λ i of the Hessian matrix Q. Similarly, we c an r epr esent the varianc e amplic ation in terms of the eigenvalues ˆ λ i of the dynamic matric es ˆ A i in (2.8) . F or gr adient desc ent, ˆ λ i = 1− αλ i and it is str aightforwar d to verify that 29 J gd is determine d by the sum of r e cipr o c als of distanc es of these eigenvalues to the stability b oundary, J gd = P n i=1 σ 2 /(1− ˆ λ 2 i ). Similarly, for ac c eler ate d metho ds we have, J = n X i=1 σ 2 (1+ ˆ λ i ˆ λ ′ i ) (1− ˆ λ i ˆ λ ′ i )(1− ˆ λ i )(1− ˆ λ ′ i )(1+ ˆ λ i )(1+ ˆ λ ′ i ) wher e ˆ λ i and ˆ λ ′ i ar e the eigenvalues of ˆ A i . F or Nester ov’s metho d with the p ar ameters pr ovide d in T able 2.2, the matrix ˆ A n , which c orr esp onds to λ n = m, admits a Jor dan c anonic al form with r ep e ate d eigenvalues ˆ λ n = ˆ λ ′ n =1− 2/ √ 3κ +1. In this c ase, ˆ J na (m)=σ 2 (1+ ˆ λ 2 n )/(1− ˆ λ 2 n ) 3 , which should b e c omp ar e d and c ontr aste d to the ab ove expr ession for gr adient desc ent. F urthermor e, for b oth λ 1 =L and λ n =m, the matric es ˆ A 1 and ˆ A n for the he avy-b al l metho d with the p ar ameters pr ovide d in T able 2.2 have eigenvalues with algebr aic multiplicity two and inc omplete sets of eigenve ctors. W e next establish the range of v alues that the ratio J na /J gd can tak e. Theorem 3 F or the str ongly c onvex quadr atic obje ctive function f in (2.4) with x ∈ R n , λ max (Q) = L, and λ min (Q) = m > 0, the r atio b etwe en the varianc e amplic ation of Nester ov’s ac c eler ate d metho d and gr adient desc ent, for the optimal p ar ameters pr ovide d in T able 2.2 and e qual values of σ satises ˆ J na (m) + (n − 1) ˆ J na (L) ˆ J gd (m) + (n − 1) ˆ J gd (L) ≤ J na J gd ≤ ˆ J na (L) + (n − 1) ˆ J na (m) ˆ J gd (L) + (n − 1) ˆ J gd (m) . (2.15) Pr o of: See App endix A.1. □ Theorem 3 pro vides tigh t upp er and lo w er b ounds on the ratio b et w een J na and J gd for strongly con v ex quadratic problems. As sho wn in App endix A.1, the lo w er b ound is ac hiev ed for a quadratic function in whic h the Hessian matrix Q has one eigen v alue at m and n− 1 eigen v alues at L, and the upp er b ound is ac hiev ed when Q has one eigen v alue at L and the remaining ones at m. Theorem 3 in conjunction with Prop osition 1 demonstrate that for a xe d pr oblem dimension n, J na is lar ger than J gd by a factor of √ κ for κ ≫ 1. This tradeo is further highligh ted in Theorem 4 whic h pro vides tigh t b ounds on the v ariance amplication of iterates in terms of the problem dimension n and the condition n um b er κ for all three algorithms. T o simplify the presen tation, w e rst use the explicit 30 expressions for ˆ J na (m) and ˆ J na (L) in Prop osition 1 to obtain the follo wing upp er and lo w er b ounds on ˆ J na (m) and ˆ J na (L) (see App endix A.1) (3κ + 1) 3 2 32 ≤ ˆ J na (m) ≤ (3κ + 1) 3 2 8 , 9 √ 3κ + 1 64 ≤ ˆ J na (L) ≤ 9 √ 3κ + 1 8 . (2.16) Theorem 4 F or the str ongly c onvex quadr atic obje ctive function f in (2.4) with x ∈ R n , λ max (Q)=L, λ min (Q)=m>0, and κ :=L/m, the varianc e amplic ation of the rst-or der optimization algorithms, with the p ar ameters pr ovide d in T able 2.2 and σ =1, is b ounde d by (κ − 1) 2 2κ + n ≤ J gd ≤ n(κ + 1) 2 4κ ( √ κ + 1) 4 8 √ κ (κ + 1) (κ − 1) 2 2κ + n ≤ J hb ≤ n(κ + 1)( √ κ + 1) 4 32κ √ κ (3κ + 1) 3 2 32 + 9 √ 3κ + 1 64 + n − 2 ≤ J na ≤ (n− 1)(3κ + 1) 3 2 8 + 9 √ 3κ + 1 8 . Pr o of: As sho wn in Prop osition 1, the functions ˆ J(λ ) for gradien t descen t and Nestero v’s algorithm attain their largest and smallest v alues o v er the in terv al [m,L] at λ = m and λ = 1/α , resp ectiv ely . Th us, xing the smallest and largest eigen v alues, the v ariance amplication J is maximized when the other n− 2 eigen v alues are all equal to m and is minimized when they are all equal to 1/α . This com bined with the explicit expressions for ˆ J gd (m), ˆ J gd (L), and ˆ J gd (1/α ) in (2.13a) leads to the tigh t upp er and lo w er b ounds for gradien t descen t. F or the hea vy-ball metho d, the b ounds follo w from Theorem 2 and for Nestero v’s algorithm, the b ounds follo w from (2.16). □ F or problems with a xed dimension n and a condition n um b er κ ≫ n, there is an Ω( √ κ ) dierence in b oth upp er and lo w er b ounds pro vided in Theorem 4 for the accelerated algorithms relativ e to gradien t descen t. Ev en though Theorem 4 considers only the v alues of α and β that optimize the con v ergence rate, in Section 2.5 w e demonstrate that this gap is fundamen tal in that it holds for an y parameters that yield an accelerated con v ergence rate. It is w orth noting that b oth the lo w er and upp er b ounds are inuenced b y the problem dimension n and the condition n um b er κ . F or large-scale problems, there ma y b e a subtle 31 relation b et w een n and κ and the established b ounds ma y exhibit dieren t scaling trends. In Section 2.6, w e iden tify a class of quadratic optimization problems for whic h J na scales in the same w a y as J gd for κ ≫ 1 and n≫ 1. Before w e elab orate further on these issues, w e pro vide t w o illustrativ e examples that highligh t the imp ortance of the c hoice of the p erformance metric in the robustness analysis of noisy algorithms. It is w orth noting that an O(κ ) upp er b ound for gradien t descen t and an O(κ 2 ) upp er b ound for Nestero v’s accelerated algorithm w as established in [29]. Relativ e to this upp er b ound for Nestero v’s metho d, the upp er b ound pro vided in Theorem 4 is tigh ter b y a factor of √ κ . Theorem 4 also pro vides lo w er b ounds, rev eals the inuence of the problem dimension n, and iden ties constan ts that m ultiply the leading terms in the condition n um b er κ . Moreo v er, in Section 2.4 w e demonstrate that similar upp er b ounds can b e obtained for general strongly con v ex ob jectiv e functions with Lipsc hitz con tin uous gradien ts. 2.3.3 Examples W e next pro vide illustrativ e examples to (i) demonstrate the agreemen t of our theoretical predictions with the results of sto c hastic sim ulations; and (ii) con trast t w o relev an t measures of p erformance, namely the v ariance of the iterates J in (2.6d) and the mean ob jectiv e error J ′ in (2.10), for assessing robustness of noisy optimization algorithms. Example 1 Let us consider the quadratic ob jectiv e function in (2.4) with Q = L 0 0 m , q = 0 0 . (2.17) F or all three algorithms, the p erformance measures J and J ′ are giv en b y J = ˆ J(m) + ˆ J(L) J ′ = m ˆ J(m) + L ˆ J(L) = L 1 κ ˆ J(m) + ˆ J(L) = m ˆ J(m) + κ ˆ J(L) . 32 As sho wn in (2.14), ˆ J(m) and ˆ J(L) only dep end on the condition n um b er κ and the v ariance amplication of the iterates satises J gd = Θ( κ ), J hb = Θ( κ √ κ ), J na = Θ( κ √ κ ). (2.18a) On the other hand, J ′ also dep ends on m and L. In particular, it is easy to v erify the follo wing relations for t w o scenarios that yield κ ≫ 1: for m≪ 1 and L=O(1) J ′ gd = Θ( κ ), J ′ hb = Θ( κ √ κ ), J ′ na = Θ( √ κ ). (2.18b) for L≫ 1 and m=O(1) J ′ gd = Θ( κ 2 ), J ′ hb = Θ( κ 2 √ κ ), J ′ na = Θ( κ √ κ ). (2.18c) Relation (2.18a) rev eals the detrimen tal impact of acceleration on the v ariance of the optimization v ariable. On the other hand, (2.18b) and (2.18c) sho w that, relativ e to gradien t descen t, the hea vy-ball metho d increases the mean error in the ob jectiv e function while Nestero v’s metho d reduces it. Th us, if the mean v alue of the error in the ob jectiv e function is to b e used to assess p erformance of noisy algorithms, one can conclude that Nestero v’s metho d signican tly outp erforms gradien t descen t b oth in terms of con v ergence rate and robustness to noise. Ho w ev er, this p erformance metric fails to capture large v ariance of the mo de asso ciated with the smallest eigen v alue of the matrix Q in Nestero v’s algorithm. Theorem 2 and Prop osition 1 sho w that the mo dal con tributions to the v ariance amplication of the iterates for gradien t descen t and the hea vy-ball metho d are balanced at m and L, i.e., ˆ J gd (m)= ˆ J gd (L)=Θ( κ ) and ˆ J hb (m)= ˆ J hb (L)=Θ( κ √ κ ). On the other hand, for Nestero v’s metho d there is a Θ( κ ) gap b et w een ˆ J na (m) = Θ( κ √ κ ) and ˆ J na (L) = Θ( √ κ ). While the p erformance measure J ′ rev eals a sup erior p erformance of Nestero v’s algorithm at large condition n um b ers, it fails to capture the negativ e impact of acceleration on the v ariance of the optimization v ariable; see Fig. 2.1 for an illustration. Figure 2.2 sho ws the p erformance outputs z t = x t and z t = Q 1/2 x t resulting from 10 5 iterations of noisy rst-order algorithms with the optimal parameters pro vided in T able 2.2 33 Gradient descent Heavy-ball Ellipsoids associated with the performance measure J ′ : Nesterov Ellipsoids associated with the performance measure J: Figure 2.1: Ellipsoids {z|z T Z − 1 z≤ 1} asso ciated with the steady-state co v ariance matrices Z =CPC T of the p erformance outputs z t =x t − x ⋆ (top ro w) and z t =Q 1/2 (x t − x ⋆ ) (b ottom ro w) for algorithms (2.2) with the parameters pro vided in T able 2.2 for the matrix Q giv en in (2.17) with m≪ L=O(1). The horizon tal and v ertical axes sho w the eigen v ectors [1 0] T and [0 1] T asso ciated with the eigen v alues ˆ J(L) and ˆ J(m) (top ro w) and ˆ J ′ (L) and ˆ J ′ (m) (b ottom ro w) of the resp ectiv e output co v ariance matrices Z . for the strongly con v ex ob jectiv e function f(x) = 0.5x 2 1 + 0.25× 10 − 4 x 2 2 (κ = 2× 10 4 ). Although Nestero v’s metho d exhibits go o d p erformance with resp ect to the error in the ob jectiv e function (p erformance measure J ′ ), the plots in the rst ro w illustrate detrimen tal impact of noise on b oth accelerated algorithms with resp ect to the v ariance of the iterates (p erformance measure J ). In particular, w e observ e that: (i) for gradien t descen t and the hea vy-ball metho d, the iterates x t are scattered uniformly along the eigen-directions of the Hessian matrix Q and acceleration increases v ariance equally along all directions; and (ii) relativ e to gradien t descen t, Nestero v’s metho d exhibits larger v ariance in the iterates x t along the direction that corresp onds to the smallest eigen v alue λ min (Q). Example 2 Figure 2.3 compares the results of t w en t y sto c hastic sim ulations for a strongly con v ex quadratic ob jectiv e function (2.4) with q = 0 and a T o eplitz matrix Q ∈ R 50× 50 with the 34 p erformance output z t =x t : z 2 z 1 z 1 z 1 p erformance output z t =Q 1/2 x t : z 2 z 1 z 1 z 1 (a) Gradien t descen t (b) Hea vy-ball (c) Nestero v Figure 2.2: P erformance outputs z t = x t (top ro w) and z t = Q 1/2 x t (b ottom ro w) resulting from 10 5 iterations of noisy rst-order algorithms (2.2) with the parameters pro vided in T able 2.2. Strongly con v ex problem with f(x) = 0.5x 2 1 +0.25× 10 − 4 x 2 2 (κ = 2× 10 4 ) is solv ed using algorithms with additiv e white noise and zero initial conditions. rst ro w [2 − 1 0 ··· 0 0] T . This gure illustrates the dep endence of the v ariance of the p erformance outputs z t =x t andz t =Q 1/2 x t on time t for the algorithms sub ject to additiv e white noise with zero initial conditions. The plots further demonstrate that the mean error in the ob jectiv e function do es not capture the detrimen tal impact of noise on the v ariance of the iterates for Nestero v’s algorithm. The b ottom ro w also compares v ariance obtained b y a v eraging outcomes of t w en t y sto c hastic sim ulations with the corresp onding theoretical v alues resulting from the Ly apuno v equations. 2.4 General strongly convex problems In this section, w e extend our results to the class F L m ofm-strongly con v ex ob jectiv e functions withL-Lipsc hitz con tin uous gradien ts. While a precise c haracterization of noise amplication 35 t P k=0 1 t ∥z k ∥ 2 t P k=0 1 t ∥z k ∥ 2 iteration n um b er t iteration n um b er t (a) p erformance output z t =x t (b) p erformance output z t =Q 1/2 x t Figure 2.3: (1/t) P t k=0 ∥z k ∥ 2 for the p erformance output z t in Example 2. T op ro w: the thic k blue (gradien t descen t), blac k (hea vy-ball), and red (Nestero v’s metho d) lines mark v ariance obtained b y a v eraging results of t w en t y sto c hastic sim ulations. Bottom ro w: comparison b et w een results obtained b y a v eraging outcomes of t w en t y sto c hastic sim ulations (thic k lines) with the corresp onding theoretical v alues (1/t) P t k=0 trace(CP k C T ) (dashed lines) resulting from the Ly apuno v equation (2.6a). for general problems is c hallenging b ecause of the nonlinear dynamics, w e emplo y to ols from robust con trol theory to obtain meaningful upp er b ounds. Our results utilize the theory of in tegral quadratic constrain ts [81], a con v ex con trol-theoretic framew ork that w as recen tly used to analyze optimization algorithms [52] and study con v ergence and robustness of the rst-order metho ds [53], [54], [56], [68]. W e establish analytical upp er b ounds on the mean- squared error of the iterates (2.3) for gradien t descen t (2.2a) and Nestero v’s accelerated (2.2c) metho ds. Since there are no kno wn accelerated con v ergence guaran tees for the hea vy-ball metho d when applied to general strongly con v ex functions, w e do not consider it in this section. W e rst exploit structural prop erties of the gradien t and emplo y quadratic Ly apuno v functions to form ulate a semidenite programing problem (SDP) that pro vides upp er b ounds on J in (2.3). While quadratic Ly apuno v functions yield tigh t upp er b ounds for gradien t 36 descen t, they fail to pro vide an y upp er b ound for Nestero v’s metho d for large condition n um b ers (κ> 100). T o o v ercome this c hallenge, w e presen t a mo died semidenite program that uses more general Ly apuno v functions whic h are obtained b y augmen ting standard quadratic terms with the ob jectiv e function. This t yp e of generalized Ly apuno v functions has b een in tro duced in [56], [106] and used to study con v ergence of optimization algorithms for non-strongly con v ex problems. W e emplo y a mo died SDP to deriv e meaningful upp er b ounds on J in (2.3) for Nestero v’s metho d as w ell. W e note that algorithms (2.2) are in v arian t under translation, i.e., if w e let ˜ x := x− ¯x and g(˜ x) :=f(˜ x+ ¯x), then (2.2c), for example, satises ˜ x t+2 = ˜ x t+1 + β (˜ x t+1 − ˜ x t ) − α ∇g(˜ x t+1 + β (˜ x t+1 − ˜ x t ))+ σw t . Th us, in what follo ws, without loss of generalit y , w e assume that x ⋆ = 0 is the unique minimizer of (2.1). 2.4.1 An approach based on contraction mappings Before w e presen t our approac h based on Linear Matrix Inequalities (LMIs), w e pro vide a more in tuitiv e approac h that can b e used to examine noise amplication of gradien t descen t. Let φ: R n → R n b e a con traction mapping, i.e., there exists a p ositiv e scalar η < 1 suc h that∥φ(x)− φ(y)∥≤ η ∥x− y∥ for all x, y∈R n , and let x ⋆ =0 b e the unique xed p oin t of φ, i.e, φ(0)=0. F or the noisy recursion x t+1 =φ(x t )+σw t , where w t is a zero-mean white noise with iden tit y co v ariance and E((w t ) T φ(x t ))=0, the con tractiv eness of φ implies E(∥x t+1 ∥ 2 ) = E(∥φ(x t )+σw t ∥ 2 ) ≤ η 2 E(∥x t ∥ 2 ) + nσ 2 . Since η < 1, this relation yields lim t→∞ E(∥x t ∥ 2 ) ≤ nσ 2 1 − η 2 . 37 If η := max{|1− αm |,|1− αL |} < 1, the map φ(x) := x− α ∇f(x) is a con traction [62]. Th us, for the con v en tional stepsize α =1/L w e ha v e η =1− 1/κ , and the b ound b ecomes lim t→∞ E(∥x t ∥ 2 ) ≤ nσ 2 1 − η 2 = nσ 2 κ 2 2κ − 1 = nΘ( κ ). In the next section, w e sho w that this upp er b ound is indeed tigh t for the class of functions F L m . While this approac h yields a tigh t upp er b ound for gradien t descen t, it cannot b e used for Nestero v’s metho d (b ecause it is not a con traction). 2.4.2 An approach based on linear matrix inequalities F or an y function f ∈F L m , the nonlinear mapping ∆: R n →R n ∆( y) := ∇f(y) − my satises the quadratic inequalit y [52, Lemma 6] y − y 0 ∆( y) − ∆( y 0 ) T Π y − y 0 ∆( y) − ∆( y 0 ) ≥ 0 (2.19) for all y , y 0 ∈R n , where the matrix Π is giv en b y Π := 0 (L − m)I (L − m)I − 2I . (2.20) W e can bring algorithms (2.2) with constan t parameters in to a time-in v arian t state-space form ψ t+1 = Aψ t + σB w w t + B u u t " z t y t # = " C z C y # ψ t u t = ∆( y t ) (2.21a) that con tains a feedbac k in terconnection of linear and nonlinear comp onen ts. Figure 2.4 illustrates the blo c k diagram of system (2.21a), where ψ t is the state, w t is a white sto c hastic 38 noise, z t is the p erformance output, and u t is the output of the nonlinear term ∆( y t ). In particular, if w e let ψ t := x t x t+1 , z t := x t , y t := − βx t +(1+β )x t+1 and dene the corresp onding matrices as A = 0 I − β (1− αm )I (1+β )(1− αm )I , B w = 0 I , B u = 0 − αI C z = h I 0 i , C y = h − βI (1+β )I i (2.21b) then (2.21a) represen ts Nestero v’s metho d (2.2c). F or gradien t descen t (2.2a), w e can alternativ ely use ψ t =z t =y t :=x t with the corresp onding matrices A = (1 − αm )I, B w = I, B u = − αI, C z = C y = I. (2.21c) ∆ L TI system u t y t w t z t Figure 2.4: Blo c k diagram of system (2.21a). In what follo ws, w e demonstrate ho w prop ert y (2.19) of the nonlinear mapping ∆ allo ws us to obtain upp er b ounds on J when system (2.21a) is driv en b y the white sto c hastic input w t with zero mean and iden tit y co v ariance. Lemma 1 uses a quadratic Ly apuno v function of the form V(ψ ) = ψ T Xψ and pro vides upp er b ounds on the steady-state second-order momen t of the p erformance output z t in terms of solutions to a certain LMI. This approac h yields a tigh t upp er b ound for gradien t descen t. 39 Lemma 1 L et the nonline ar function u=∆( y) satisfy the quadr atic ine quality y u T Π y u ≥ 0 (2.22) for some matrix Π , let X b e a p ositive semidenite matrix, and let λ b e a nonne gative sc alar such that system (2.21a) satises A T XA− X +C T z C z A T XB u B T u XA B T u XB u + λ C T y 0 0 I Π C y 0 0 I ⪯ 0. (2.23) Then the ste ady-state se c ond-or der moment J of the p erformanc e output z t in (2.21a) is b ounde d by J ≤ σ 2 trace(B T w XB w ). Pr o of: See App endix A.2. □ F or Nestero v’s accelerated metho d with the parameters pro vided in T able 2.1, w e ha v e conducted computational exp erimen ts sho wing that LMI (2.23) b ecomes infeasible for large v alues of the condition n um b er κ . Th us, Lemma 1 do es not pro vide sensible upp er b ounds on J for Nestero v’s algorithm. This observ ation is consisten t with the results of [52], where it w as suggested that analyzing the con v ergence rate requires the use of additional quadratic inequalities, apart from (2.19), to further tigh ten the constrain ts on the gradien t ∇f and reduce conserv ativ eness. In what follo ws, w e build on the results of [56] and presen t an alternativ e LMI in Lemma 2 that is obtained using a Ly apuno v function of the form V(ψ )=ψ T Xψ +f([0 I]ψ ), where X is a p ositiv e semidenite matrix and f is the ob jectiv e function in (2.1). Suc h Ly apuno v functions ha v e b een used to study con v ergence of optimization algorithms in [106]. The resulting approac h allo ws us to establish an order-wise tigh t analytical upp er b ound on J for Nestero v’s accelerated metho d. 40 Lemma 2 L et the matrix M(m,L;α,β ) b e dene d as M := N T 1 LI I I 0 N 1 + N T 2 − mI I I 0 N 2 wher e N 1 := αmβI − αm (1+β )I − αI − mβI m (1+β )I I , N 2 := − βI βI 0 − mβI m (1+β )I I . Consider the state-sp ac e mo del in (2.21a) -(2.21b) for algorithm (2.2c) and let Π b e given by (2.20) . Then, for any p ositive semidenite matrix X and sc alars λ 1 ≥ 0 and λ 2 ≥ 0 that satisfy A T XA− X +C T z C z A T XB u B T u XA B T u XB u + λ 1 C T y 0 0 I Π C y 0 0 I + λ 2 M ⪯ 0 (2.24) the ste ady-state se c ond-or der moment J of the p erformanc e output z t in (2.21a) is b ounde d by J ≤ σ 2 nLλ 2 + trace(B T w XB w ) . (2.25) Pr o of: See App endix A.2. □ Remark 3 Sinc e LMI (2.24) simplies to (2.23) by setting λ 2 = 0, L emma 2 r epr esents a r elaxe d version of L emma 1. This mo dic ation is the key enabler to establishing tight upp er b ound on J for Nester ov’s metho d. The upp er b ounds pro vided in Lemmas 1 and 2 are prop ortional to σ 2 . In what follo ws, to mak e a connection b et w een these b ounds and our analytical expressions for the v ariance amplication in the quadratic case (Section 2.3), w e again set σ =1. The b est upp er b ound 41 on J that can b e obtained using Lemma 2 is giv en b y the optimal ob jectiv e v alue of the semidenite program minimize X,λ 1 ,λ 2 nLλ 2 + trace(B T w XB w ) (2.26) subject to LMI (2.24), X ⪰ 0, λ 1 ≥ 0, λ 2 ≥ 0. F or system matrices (2.21b), LMI (2.24) is of size 3n× 3n where x t ∈R n . Ho w ev er, if w e imp ose the additional constrain t that the matrix X has the same blo c k structure as A, X = x 1 I x 0 I x 0 I x 2 I for some scalars x 1 , x 2 , and x 0 , then using appropriate p erm utation matrices, w e can simplify (2.23) in to an LMI of size 3× 3. F urthermore, imp osing this constrain t comes without loss of generalit y . In particular, the optimal ob jectiv e v alue of problem (2.26) do es not c hange if w e require X to ha v e this structure; see [52, Section 4.2] for a discussion of this lossless dimensionalit y reduction for LMI constrain ts with similar structure. In Theorem 5, w e use Lemmas 1 and 2 to establish tigh t upp er b ounds on J gd and J na for all f ∈F L m . Theorem 5 F or gr adient desc ent and Nester ov’s ac c eler ate d metho d with the p ar ameters pr ovide d in T able 2.1 and σ =1, the p erformanc e me asur es J gd and J na of the err or x t − x ⋆ ∈ R n satisfy sup f∈F L m J gd = q gd , q na ≤ sup f∈F L m J na ≤ 4.08q na wher e q gd = nκ 2 2κ − 1 = nΘ( κ ), q na = nκ 2 (2κ − 2 √ κ +1) (2 √ κ − 1) 3 = nΘ( κ 3 2 ) and κ :=L/m is the c ondition numb er of the set F L m . Pr o of: See App endix A.2. □ 42 The v ariance amplication of gradien t descen t and Nestero v’s metho d for f(x)= m 2 x T x in F L m is determined b y q gd and q na , resp ectiv ely , and these t w o quan tities can b e obtained using Theorem 1. In Theorem 5, w e use this strongly con v ex quadratic ob jectiv e function to certify the accuracy of the upp er b ounds on supJ for all f ∈F L m . In particular, w e observ e that the upp er b ound is exact for gradien t descen t and that it is within a 4.08 factor of the optimal for Nestero v’s metho d. F or strongly con v ex ob jectiv e functions with the condition n um b er κ , Theorem 5 pro v es that gradien t descen t outp erforms Nestero v’s accelerated metho d in terms of the largest noise amplication b y a factor of √ κ . This unco v ers the fundamen tal p erformance limitation of Nestero v’s accelerated metho d when the gradien t ev aluation is sub ject to additiv e sto c hastic uncertain ties. 2.5 T uning of algorithmic parameters The parameters pro vided in T able 2.2 yield the optimal con v ergence rate for strongly con v ex quadratic problems. F or these sp ecic v alues, Theorem 4 establishes upp er and lo w er b ounds on the v ariance amplication that rev eal the negativ e impact of acceleration. Ho w ev er, it is relev an t to examine whether the parameters can b e designed to pro vide acceleration while reducing the v ariance amplication. While the con v ergence rate solely dep ends on the extreme eigen v alues m=λ min (Q) and L = λ max (Q) of the Hessian matrix Q, v ariance amplication is inuenced b y the en tire sp ectrum of Q and its minimization is c hallenging as it requires the use of all eigen v alues. In this section, w e rst consider the sp ecial case of eigen v alues b eing symmetrically distributed o v er the in terv al [m,L] and demonstrate that for gradien t descen t and the hea vy-ball metho d, the parameters pro vided in T able 2.2 yield a v ariance amplication that is within a constan t factor of the optimal v alue. As w e demonstrate in Section 2.6, symmetric distribution of the eigen v alues is encoun tered in distributed consensus o v er undirected torus net w orks. W e also consider the problem of designing parameters for ob jectiv e functions in whic h the problem size satises n≪ κ and establish a tradeo b et w een con v ergence rate and v ariance amplication. More sp ecically , w e sho w that for an y accelerating pair of parameters α and 43 β and b ounded problem dimension n, the v ariance amplication of accelerated metho ds is larger than that of gradien t descen t b y a factor of Ω( √ κ ). 2.5.1 T uning of parameters using the whole spectrum Let L = λ 1 ≥ λ 2 ≥ ··· ≥ λ n = m > 0 b e the eigen v alues of the Hessian matrix Q of the strongly con v ex quadratic ob jectiv e function in (2.4). Algorithms (2.2) con v erge linearly in exp ected v alue to the optimizer x ⋆ with the rate ρ := max i ˆ ρ (λ i ) (2.27) where ˆ ρ (λ i ) is the sp ectral radius of the matrix ˆ A i giv en b y (2.8). F or an y scalar c > 0 and xed σ , let (α ⋆ hb (c),β ⋆ hb (c)) := argmin α,β J hb (α,β ) subject to ρ hb ≤ 1 − c √ κ (2.28a) for the hea vy-ball metho d, and α ⋆ gd (c) := argmin α J gd (α ) subject to ρ gd ≤ 1 − c κ (2.28b) for gradien t descen t, where the expression for the v ariance amplication J is pro vided in Theorem 1. Here, the constrain ts enforce a standard rate of linear con v ergence for gradien t descen t and an accelerated rate of linear con v ergence for the hea vy-ball metho d parametrized with the constan t c. Obtaining a closed form solution to (2.28) is c hallenging b ecause J dep ends on all eigen v alues of the Hessian matrix Q. Herein, w e fo cus on ob jectiv e functions for whic h the sp ectrum of Q is symmetric, i.e., for an y eigen v alue λ , the corresp onding mirror image λ ′ := L + m− λ with resp ect to 1 2 (L + m) is also an eigen v alue with the same algebraic m ultiplicit y . F or this class of problems, Theorem 6 demonstrates that the parameters pro vided in T able 2.2 for gradien t descen t and the hea vy-ball metho d yield v ariance amplication that is within a constan t factor of the optimal. 44 Theorem 6 F or any sc alar c > 0 and xe d σ , ther e exist c onstants c 1 ≥ 1 and c 2 > 0 such that for any str ongly c onvex quadr atic obje ctive function in which the sp e ctrum of the Hessian matrix Q is symmetric al ly distribute d over the interval [m,L] with κ := L/m > c 1 , we have J gd (α ⋆ gd (c)) ≥ 1 2 J gd (α gd ), J hb (α ⋆ hb (c),β ⋆ hb (c)) ≥ c 2 J hb (α hb ,β hb ) wher e p ar ameters α gd and (α hb ,β hb ) ar e pr ovide d in T able 2.2, and α ⋆ gd (c) and (α ⋆ hb (c),β ⋆ hb (c)) solve (2.28) . Pr o of: See App endix A.2.4. □ F or strongly con v ex quadratic ob jectiv e functions with symmetric sp ectrum of the Hessian matrix o v er the in terv al [m,L], Theorem 6 sho ws that the v ariance amplications of gradien t descen t and the hea vy-ball metho d with the parameters pro vided in T able 2.2 are within a constan t factors of the optimal v alues. As w e illustrate in Section 2.6, this class of problems is encoun tered in distributed a v eraging o v er noisy undirected net w orks. Com bining this result with the lo w er b ound on J hb (α hb ,β hb ) and the upp er b ound on J gd (α gd ) established in Theorem 4, w e see that regardless of the c hoice of parameters, there is a fundamen tal gap of Ω( √ κ ) b et w een J hb and J gd as long as w e require an accelerated rate of con v ergence. 2.5.2 F undamental lower bounds W e next establish lo w er b ounds on the v ariance amplication of accelerated metho ds that hold for an y pair of α andβ for strongly con v ex quadratic problems with κ ≫ 1. In particular, w e sho w that the v ariance amplication of accelerated algorithms is lo w er b ounded b y Ω( κ 3/2 ) irresp ectiv e of the c hoice of α and β . The next theorem establishes a fundamen tal tradeo b et w een the con v ergence rate and v ariance amplication for the hea vy-ball metho d. 45 Theorem 7 F or str ongly c onvex quadr atic pr oblems with any stabilizing p ar ameters α > 0 and 0 < β < 1 and with a xe d noise magnitude σ , the he avy-b al l metho d with the line ar c onver genc e r ate ρ satises J hb 1 − ρ ≥ σ 2 κ +1 8 2 . (2.29a) F urthermor e, if σ =α , i.e., when the only sour c e of unc ertainty is a noisy gr adient, we have J hb 1 − ρ ≥ κ 8L 2 . (2.29b) Pr o of: See App endix A.3. □ T o gain additional insigh t, let us consider t w o sp ecial cases: (i) for α = 1/L and β → 0 + , w e obtain gradien t descen t algorithm for whic h 1− ρ = Θ(1 /κ ) and J = Θ( κ ); (ii) for the hea vy-ball metho d with the parameters pro vided in T able 2.2, w e ha v e 1− ρ = Θ(1 / √ κ ) and J = Θ( κ √ κ ). Th us, in b oth cases, J hb /(1− ρ ) = Ω( κ 2 ). Theorem 7 sho ws that this lo w er b ound is fundamen tal and it therefore quan ties the tradeo b et w een the con v ergence rate and the v ariance amplication of the hea vy-ball metho d for an y c hoice of parameters α and β . It is also w orth noting that the lo w er b ound for σ = α dep ends on the largest eigen v alue L of the Hessian matrix Q. Th us, this b ound is meaningful when the v alue of L is uniformly upp er b ounded. This scenario o ccurs in man y applications including consensus o v er undirected tori net w orks; see Section 2.6. While w e are not able to sho w a similar lo w er b ound for Nestero v’s metho d, in the next theorem, w e establish an asymptotic lo w er b ound on the v ariance amplication that holds for an y pair of accelerating parameters (α,β ) for b oth Nestero v’s and hea vy-ball metho ds. Theorem 8 F or a str ongly c onvex quadr atic obje ctive function with c ondition numb er κ , let c>0 b e a c onstant such that either Nester ov’s algorithm or the he avy-b al l metho d with some 46 (p ossibly pr oblem dep endent) p ar ameters α> 0 and 0<β < 1 c onver ges line arly with a r ate ρ ≤ 1− c/ √ κ . Then, for any xe d noise magnitude σ , the varianc e amplic ation satises J σ 2 = Ω( κ 3 2 ). (2.30a) F urthermor e, if σ =α , i.e., when the only sour c e of unc ertainty is a noisy gr adient, we have J = Ω( κ 3 2 L 2 ). (2.30b) Pr o of: F or the hea vy-ball metho d, the result follo ws from com bining Theorem 7 with the inequalit y 1− ρ ≥ c/ √ κ . F or Nestero v’s metho d, the pro of is pro vided in App endix A.3. □ F or problems with n≪ κ , w e recall that the v ariance amplication of gradien t descen t with con v en tional v alues of parameters scales as O(κ ); see Theorem 5. Irresp ectiv e of the c hoice of parameters α and β , this result in conjunction with Theorem 8 demonstrates that acceleration cannot b e ac hiev ed without increasing the v ariance amplication J b y a factor of Ω( √ κ ). 2.6 Application to distributed computation Distributed computation o v er net w orks has receiv ed signican t atten tion in optimization, con trol systems, signal pro cessing, comm unications, and mac hine learning comm unities. In this problem, the goal is to optimize an ob jectiv e function (e.g., for the purp ose of training a mo del) using m ultiple pro cessing units that are connected o v er a net w ork. Clearly , the structure of the net w ork (e.g., no de dynamics and net w ork top ology) ma y impact the p erformance (e.g., con v ergence rate and noise amplication) of an y optimization algorithm. As a rst step to w ard understanding the impact of the net w ork structure on p erformance of noisy rst-order optimization algorithms, in this section, w e examine the standard distributed consensus problem. 47 The consensus problem arises in applications ranging from so cial net w orks, to distributed computing net w orks, to co op erativ e con trol in m ulti-agen t systems. In the simplest setup, eac h no de up dates a scalar v alue using the v alues of its neigh b ors suc h that they all agree on a single consensus v alue. Simple up dating strategies of this kind can b e obtained b y applying a rst-order algorithm to the con v ex quadratic problem minimize x 1 2 x T Lx (2.31) where L =L T ∈R n× n is the Laplacian matrix of the graph asso ciated with the underlying undirected net w ork and x∈R n is the v ector of no de v alues. The graph Laplacian matrix L ⪰ 0 has a non trivial n ull space that consists of the minimizers of problem (2.31). In the absence of noise, for gradien t descen t and b oth of its accelerated v arian ts, it is straigh tforw ard to v erify that the pro jections v t of the iterates x t on to the n ull space of L remain constan t (v t =v 0 , for all t) and also that x t con v erges linearly to v 0 . In the presence of additiv e noise, ho w ev er, v t exp eriences a random w alk whic h leads to an un b ounded v ariance of x t as t → ∞. Instead, as describ ed in [42], the p erformance of algorithms in this case can b e quan tied b y examining ¯ J := lim t→∞ E(∥x t − v t ∥ 2 ). F or connected net w orks, the n ull space of L is giv en b y N(L)={c1|c∈R} and ¯ J = lim t→∞ E ∥x t − (1 T x t /n)1∥ 2 (2.32) quan ties the mean-squared deviation from the net w ork a v erage, where 1 denotes the v ector of all ones, i.e., 1 := [1 ··· 1] T . Finally , it is straigh tforw ard to sho w that ¯ J can also b e computed using the form ulae in Theorem 1 b y summing o v er the non-zero eigen v alues of L. In what follo ws, w e consider a class of net w orks for whic h the structure allo ws for the explicit ev aluation of the eigen v alues of the Laplacian matrix L. F or d-dimensional torus net w orks, fundamen tal p erformance limitations of standard consensus algorithms in con tin uous time w ere established in [43], but it remains an op en question if gradien t descen t and its accelerated v arian ts suer from these limitations. W e use suc h torus net w orks to sho w that standard gradien t descen t exhibits the same scaling trends as consensus algorithms 48 studied in [43] and that, in lo w er spatial dimensions, acceleration alw a ys increases v ariance amplication. 2.6.1 Explicit formulae for d-dimensional torus networks W e next examine the asymptotic scaling trends of the p erformance metric ¯ J giv en b y (2.32) for large problem dimensions n≫ 1 and highligh t the subtle inuence of the distribution of the eigen v alues ofL on the v ariance amplication for d-dimensional torus net w orks. T ori with nearest neigh b or in teractions generalize one-dimensional rings to higher spatial dimensions. Let Z n 0 denote the group of in tegers mo dulo n 0 . A d-dimensional torus T d n 0 consists of n:=n d 0 no des denoted b y v a where a∈Z d n 0 and its set of edges is giv en b y {{v a v b }|∥a − b∥ = 1 mod n 0 } where the no des v a and v b are neigh b ors if and only if a and b dier exactly at a single en try b y one. F or example, T 1 n 0 denotes a ring with n = n 0 no des and T 5 n 0 denotes a v e dimensional torus with n=n 5 0 no des. The m ultidimensional discrete F ourier transform can b e used to determine the eigen v alues of the Laplacian matrix L of a d-dimensional torus T d n 0 , λ i = d X l=1 2 1 − cos 2πi l n 0 , i l ∈ Z n 0 (2.33) where i := (i 1 ,...,i d )∈Z d n 0 . W e note that λ 0 = 0 is the only zero eigen v alue of L with the eigen v ector 1 and that all other eigen v alues are p ositiv e. Let κ :=λ max /λ min b e the ratio of the largest and smallest nonzero eigen v alues of L. A k ey observ ation is that, for n 0 ≫ 1, κ = Θ( 2 1 − cos 2π n 0 ) = Θ( n 2 0 ) = Θ( n 2/d ). (2.34) This is b ecauseλ min =2d(1− cos(2π/n 0 )) go es to zero as n 0 →∞, and the largest eigen v alue of L, λ max =2d(1− cos(2π ⌊ n 0 2 ⌋/n 0 )), is equal to 4d for ev en n 0 and it approac hes 4d from b elo w for o dd n 0 . 49 As aforemen tioned, the p erformance metric ¯ J can b e obtained b y ¯ J = X 0̸=i∈Z d n 0 ˆ J(λ i ) where ˆ J(λ ) for eac h algorithm is determined in Theorem 1 and λ i are the non-zero eigen v alues of L. The next theorem c haracterizes the asymptotic v alue of the net w ork-size normalized mean-squared deviation from the net w ork a v erage, ¯ J/n, for a xed spatial dimension d and condition n um b er κ ≫ 1. This result is obtained using analytical expression (2.33) for the eigen v alues of the Laplacian matrix L. Theorem 9 L et L∈R n× n b e the gr aph L aplacian of the d-dimensional undir e cte d torus T d n 0 with n = n d 0 ≫ 1 no des. F or c onvex quadr atic optimization pr oblem (2.31) , the network- size normalize d p erformanc e metric ¯ J/n of noisy rst-or der algorithms with the p ar ameters pr ovide d in T able 2.2 and σ =1, is determine d by d=1 d=2 d=3 d=4 d=5 Gr adient Θ( √ κ ) Θ(log κ ) Θ(1) Θ(1) Θ(1) Nester ov Θ( κ ) Θ( √ κ log κ ) Θ( κ 1 4 ) Θ(log κ ) Θ(1) Polyak Θ( κ ) Θ( √ κ log κ ) Θ( √ κ ) Θ( √ κ ) Θ( √ κ ) wher e κ =Θ( n 2/d ) is the c ondition numb er of L given in (2.34) . Pr o of: See App endix A.4. □ Theorem 9 demonstrates that the v ariance amplication of gradien t descen t is equiv alen t to that of the standard consensus algorithm studied in [43] and that, in lo w er spatial dimensions, acceleration alw a ys negativ ely impacts the p erformance of noisy algorithms. Our results also highligh t the subtle inuence of the distribution of the eigen v alues of L on the v ariance amplication. F or rings (i.e., d = 1), lo w er b ounds pro vided in Theorem 4 capture the trends that our detailed analysis based on the distribution of the en tire sp ectrum ofL rev eals. In higher spatial dimensions, ho w ev er, the lo w er b ounds that are obtained using 50 only the extreme eigen v alues of L are conserv ativ e. Similar conclusion can b e made ab out the upp er b ounds pro vided in Theorem 4. This observ ation demonstrates that the nav e b ounds that result only from the use of the extreme eigen v alues can b e o v erly conserv ativ e. W e also note that gradien t descen t signican tly outp erforms Nestero v’s metho d in lo w er spatial dimensions. In particular, while ¯ J/n b ecomes net w ork-size-indep enden t for d = 3 for gradien t descen t, Nestero v’s algorithm reac hes critical connectivit y only for d=5. On the other hand, in an y spatial dimension, there is no net w ork-size indep enden t upp er b ound on ¯ J/n for the hea vy-ball metho d. These conclusions could not ha v e b een reac hed without p erforming an in-depth analysis of the impact of all eigen v alues on p erformance of noisy net w orks with n≫ 1 and κ ≫ 1. 2.7 Concluding remarks W e study the robustness of noisy rst-order algorithms for smo oth, unconstrained, strongly con v ex optimization problems. Ev en though the underlying dynamics of these algorithms are in general nonlinear, w e establish upp er b ounds on noise amplication that are accurate up to constan t factors. F or quadratic ob jectiv e functions, w e pro vide analytical expressions that quan tify the eect of all eigen v alues of the Hessian matrix on v ariance amplication. W e use these expressions to establish lo w er b ounds demonstrating that although the acceleration tec hniques impro v e the con v ergence rate they signican tly amplify noise for problems with large condition n um b ers. In problems of b ounded dimension n≪ κ , the noise amplication increases from O(κ ) to Ω( κ 3/2 ) when mo ving from standard gradien t descen t to accelerated algorithms. W e sp ecialize our results to the problem of distributed a v eraging o v er noisy undirected net w orks and also study the role of net w ork size and top ology on robustness of accelerated algorithms. F uture researc h directions include (i) extension of our analysis to m ultiplicativ e and correlated noise; and (ii) robustness analysis of broader classes of optimization algorithms. 51 Chapter 3 T radeos between convergence rate and noise amplication for accelerated algorithms W e study momen tum-based rst-order optimization algorithms in whic h the iterations utilize information from the t w o previous steps and are sub ject to an additiv e white noise. This class of algorithms includes P oly ak’s hea vy-ball and Nestero v’s accelerated metho ds as sp ecial cases and noise accoun ts for uncertain t y in either gradien t ev aluation or iteration up dates. F or strongly con v ex quadratic problems, w e use the steady-state v ariance of the error in the optimization v ariable to quan tify noise amplication and iden tify fundamen tal sto c hastic p erformance tradeos. Our approac h utilizes the Jury stabilit y criterion to pro vide a no v el geometric c haracterization of conditions for linear con v ergence, and it claries the relation b et w een the noise amplication and con v ergence rate as w ell as their dep endence on the condition n um b er and the constan t algorithmic parameters. This geometric insigh t leads to simple alternativ e pro ofs of standard con v ergence results and allo ws us to establish analytical lo w er b ounds on the pro duct b et w een the settling time and noise amplication that scale quadratically with the condition n um b er. Our analysis also iden ties a k ey dierence b et w een the gradien t and iterate noise mo dels: while the amplication of gradien t noise can b e made arbitrarily small b y sucien tly decelerating the algorithm, the b est ac hiev able v ariance amplication for the iterate noise mo del increases linearly with the settling time in decelerating regime. F urthermore, w e in tro duce t w o parameterized families of algorithms that strik e a balance b et w een noise amplication and settling time while preserving order-wise P areto optimalit y for b oth noise mo dels. Finally , b y analyzing a 52 class of accelerated gradien t o w dynamics, whose suitable discretization yields the t w o- step momen tum algorithm, w e establish that sto c hastic p erformance tradeos also extend to con tin uous time. 3.1 Introduction A ccelerated rst-order algorithms [5], [7], [8] are often used in solving large-scale optimization problems [1], [2], [4] b ecause of their scalabilit y , fast con v ergence, and lo w p er-iteration complexit y . Con v ergence prop erties of these algorithms ha v e b een carefully studied [6], [9], [51][56], but their p erformance in the presence of noise has receiv ed less atten tion [10][12], [57], [58]. Prior studies indicate that inaccuracies in the computation of gradien t v alues can adv ersely impact the con v ergence rate of accelerated metho ds and that gradien t descen t ma y ha v e adv an tages relativ e to its accelerated v arian ts in noisy en vironmen ts [23][26], [28]. In con trast to gradien t descen t, accelerated algorithms can also exhibit undesirable transien t b eha vior [61], [98], [107]; for con v ex quadratic problems, the non-normal dynamic mo des in accelerated algorithms induce large transien t resp onses of the error in the optimization v ariable [98]. Analyzing the p erformance of accelerated algorithms with additiv e white noise that arises from uncertain t y in gradien t ev aluation dates bac k to [57] where P oly ak established the optimal linear con v ergence rate for strongly con v ex quadratic problems. In addition, he used time-v arying parameters to obtain con v ergence in the error v ariance at a sub-linear rate and with an impro v ed constan t factor compared to gradien t descen t. A cceleration in a sub-linear regime can also b e ac hiev ed for smo oth strongly con v ex problems with prop erly diminishing stepsize [30] and a v eraging tec hniques can b e used to prev en t the accum ulation of gradien t noise b y accelerated algorithms [108]. F or standard accelerated metho ds with constan t parameters, con trol-theoretic to ols w ere utilized in [93] and [109] to study the steady-state v ariance of the error in optimization v ariable for smo oth strongly con v ex problems. In particular, for the parameters that optimize con v ergence rates for quadratic problems, tigh t upp er and lo w er b ounds on the noise amplication of gradien t descen t, hea vy-ball metho d, and Nestero v’s accelerated algorithm w ere dev elop ed in [93]. 53 These b ounds are expressed in terms of the condition n um b er κ and the problem dimension n, and they demonstrate opp osite trends relativ e to the settling time: for a xe d pr oblem size n, ac c eler ate d algorithms incr e ase noise amplic ation by a factor of Θ( √ κ ) r elative to gr adient desc ent. Similar result also holds for hea vy-ball and Nestero v’s algorithms with parameters that pro vide con v ergence rate ρ ≤ 1− c/ √ κ with c > 0 [93]. F urthermore, for all strongly con v ex optimization problems with a condition n um b er κ , tigh t and attainable upp er b ounds for noise amplication of gradien t descen t and Nestero v’s accelerated metho d w ere pro vided in [93]. In this c hapter, w e extend the results of [93] to the class of rst-order algorithms with three constan t parameters in whic h the iterations in v olv e information from the t w o previous steps. This class includes hea vy-ball and Nestero v’s accelerated algorithms as sp ecial cases and w e examine its sto c hastic p erformance for strongly con v ex quadratic problems. Our results are complemen tary to [103], whic h ev aluates sto c hastic p erformance in the ob jectiv e error, and to a recen t w ork [109] that studies the steady-state v ariance of the error asso ciated with the p oin t at whic h the gradien t is ev aluated. This reference com bines theory with computational exp erimen ts to demonstrate that a parameterized family of hea vy-ball-lik e metho ds with reduced stepsize pro vides P areto-optimal algorithms for the sim ultaneous optimization of con v ergence rate and amplication of gradien t noise. In con trast to [109], w e establish analytical lo w er b ounds on the pro duct of the settling time and the steady- state v ariance of the error in the optimization v ariable that hold for an y constan t stabilizing parameters and for b oth gradien t and iterate noise mo dels. Our lo w er b ounds scale with the square of the condition n um b er and th us rev eal a fundamen tal limitation of this class of algorithms. In addition to considering noise arising from gradien t ev aluation, w e study the sto c hastic p erformance of algorithms when noise is directly added to the iterates (rather than the gradien t). F or the iterate noise mo del, w e establish an alternativ e lo w er b ound on the noise amplication whic h scales linearly with the settling time and is order-wise tigh t for settling times that are larger than that of gradien t descen t with the standard stepsize. In this decelerated regime, our results iden tify a k ey dierence b et w een the t w o noise mo dels: while the impact of gradien t uncertain ties on v ariance amplication can b e made arbitrarily small 54 b y decelerating the t w o-step momen tum algorithm, the b est ac hiev able v ariance amplication for the iterate noise mo del increases linearly with the settling time in the decelerated regime. Our results build up on a simple, y et p o w erful geometric viewp oin t, whic h claries the relation b et w een condition n um b er, con v ergence rate, and algorithm parameters for strongly con v ex quadratic problems. This viewp oin t allo ws us to presen t alternativ e pro ofs for (i) the optimal con v ergence rate of the t w o-step momen tum algorithm, whic h reco v ers Nestero v’s fundamen tal lo w er b ound on the con v ergence rate [59] for nite dimensional problems [60]; and (ii) the optimal rates ac hiev ed b y standard gradien t descen t, hea vy- ball metho d, and Nestero v’s accelerated algorithm [52]. In addition, it enables a no v el geometric c haracterization of noise amplication in terms of stabilit y margins and it allo ws us to precisely quan tify tradeos b et w een con v ergence rate and robustness to noise. W e also in tro duce t w o parameterized families of algorithms that are structurally similar to the hea vy-ball and Nestero v’s accelerated algorithms. These algorithms utilize con tin uous transformations from gradien t descen t to the corresp onding accelerated algorithm (with the optimal con v ergence rate) via a homotop y path, and they can b e used to pro vide additional insigh t in to the tradeo b et w een con v ergence rate and noise amplication. W e pro v e that these parameterized families are order-wise (in terms of the condition n um b er) P areto- optimal for sim ultaneous minimization of settling time and noise amplication. Another family of algorithms that facilitates similar tradeo w as prop osed in [54], and it includes the fastest kno wn algorithm for the class of smo oth strongly con v ex problems. W e also utilize negativ e momen tum parameters to decelerate a hea vy-ball-lik e family of algorithms relativ e to gradien t descen t with the optimal stepsize. F or b oth noise mo dels, our parameterized family yields order-wise optimal algorithms and it allo ws us to further highligh t the k ey dierence b et w een them in the decelerated regime. Finally , w e examine the noise amplication of a class of sto c hastically-forced momen tum- based accelerated gradien t o w dynamics. Suc h dynamics w ere in tro duced in [110] as a con tin uous-time v arian t of Nestero v’s accelerated algorithm and a Ly apuno v-based metho d w as used to establish their stabilit y prop erties and infer the con v ergence rate. Inspired b y this w ork, w e examine the tradeos b et w een the noise amplication and con v ergence rate of similar gradien t o w dynamics for strongly con v ex quadratic problems. W e in tro duce 55 a geometric viewp oin t analogous to the discrete-time setting to c haracterize the optimal con v ergence rate and iden tify the corresp onding algorithmic parameters. W e then examine the dep endence of the noise amplication on the parameters and the sp ectrum of the Hessian matrix and demonstrate that our ndings regarding the restrictions imp osed b y the condition n um b er on the pro duct of the settling time and noise amplication extend to the con tin uous- time case as w ell. The rest of the c hapter is organized as follo ws. In Section 3.2, w e pro vide preliminaries and bac kground material and, in Section 3.3, w e summarize our k ey con tributions. In Section 3.4, w e in tro duce the to ols and ideas that enable our analysis. In particular, w e utilize the Jury stabilit y criterion to pro vide a no v el geometric c haracterization of stabilit y and ρ -linear con v ergence and exploit this insigh t to deriv e simple alternativ e pro ofs of standard con v ergence results and quan tify fundamen tal sto c hastic p erformance tradeos. In Section 3.5, w e in tro duce t w o parameterized families of algorithms that allo w us to constructiv ely tradeo settling time and noise amplication. In Section 3.6, w e extend our results to the con tin uous-time setting, in Section 3.7, w e pro vide pro ofs of our main results, and in Section 3.8, w e conclude the c hapter. 3.2 Preliminaries and background F or the unconstrained optimization problem minimize x f(x) (3.1) where f : R n → R is a strongly con v ex function with a Lipsc hitz con tin uous gradien t ∇f , w e consider noisy momen tum-based rst-order algorithms that use information from the t w o previous steps to up date the optimization v ariable: x t+2 = x t+1 + β (x t+1 − x t ) − α ∇f(x t+1 + γ (x t+1 − x t )) + σ w w t . (3.2) 56 Here, t is the iteration index, α is the stepsize, β and γ are momen tum parameters, σ w is the noise magnitude, and w t is an additiv e white noise with zero mean and iden tit y co v ariance matrix, E w t = 0, E w t (w τ ) T = Iδ (t − τ ) where δ is the Kronec k er delta and E is the exp ected v alue op erator. In this c hapter, w e consider t w o noise mo dels. 1. Iterate noise (σ w =σ ): mo dels uncertain t y in computing the iterates of (3.2), where σ denotes the stepsize-indep enden t noise magnitude. 2. Gradien t noise (σ w =ασ ): mo dels uncertain t y in the gradien t ev aluation. In this case, the stepsize α directly impacts magnitude of the additiv e noise. Iterate noise mo dels scenarios where uncertain ties in optimization v ariables exist b ecause of roundo, quan tization, and comm unication errors. This mo del has also b een used to impro v e generalization and robustness in mac hine learning [111]. On the other hand, the second noise mo del accoun ts for gradien t computation error or scenarios in whic h the gradien t is estimated from noisy measuremen ts [40]. Also, noise ma y b e in ten tionally added to the gradien t for priv acy reasons [112]. Remark 1 An alternative noise mo del with σ w = √ ασ has b e en use d to esc ap e lo c al minima in sto chastic gr adient desc ent [113] and to pr ovide non-asymptotic guar ante es in nonc onvex le arning [114], [115]. This mo del arises fr om a discr etization of the c ontinuous-time L angevin diusion dynamics [114] and, for str ongly c onvex quadr atic pr oblems, our fr amework c an b e use d to examine ac c eler ation/r obustness tr ade os. F or algorithms that ar e faster than the standar d gr adient desc ent, this mo del has or der-wise identic al p erformanc e b ounds as the other two mo dels and the only dier enc e arises in de c eler ate d r e gime. W e omit details for br evity. Sp ecial cases of (3.2) include noisy gradien t descen t (β = γ = 0), P oly ak’s hea vy-ball metho d (γ =0), and Nestero v’s accelerated algorithm (γ =β ). In the absence of noise (i.e., for σ = 0), the parameters (α,β,γ ) can b e selected suc h that the iterates con v erge linearly 57 to the globally optimal solution [9]. F or the family of smo oth strongly con v ex problems, the parameters that yield the fastest kno wn linear con v ergence rate w ere pro vided in [55]. 3.2.1 Linear dynamics for quadratic problems LetQ L m denote the class of m-strongly con v ex L-smo oth quadratic functions f(x) = 1 2 x T Qx − q T x (3.3) with the condition n um b er κ := L/m, where q is a v ector and Q = Q T ≻ 0 is the Hessian matrix with eigen v alues L = λ 1 ≥ λ 2 ≥ ... ≥ λ n = m > 0. F or the quadratic ob jectiv e function in (3.3), w e can use a linear time-in v arian t (L TI) state- space mo del to describ e the two-step momentum algorithm (3.2) with constan t parameters, ψ t+1 = Aψ t + Bw t z t = Cψ t (3.4a) where ψ t is the state, z t :=x t − x ⋆ is the p erformance output, and w t is the white sto c hastic input. In particular, c ho osing ψ t :=[(x t − x ⋆ ) T (x t+1 − x ⋆ ) T ] T yields A = 0 I − βI +γαQ (1+β )I− (1+γ )αQ B T = h 0 σ w I i , C = h I 0 i . (3.4b) 3.2.2 Convergence rates An algorithm is stable if in the absence of noise (i.e., σ w = 0), the state con v erges linearly with some rate ρ< 1, ∥ψ t ∥ 2 ≤ cρ t ∥ψ 0 ∥ 2 for all t ≥ 1 (3.5) 58 metho d fastest parameters (α,β,γ ) T s J min /σ 2 w J max /σ 2 w Gradien t (2/(L+m), 0, 0) (κ +1)/2 Θ( κ )+n nΘ( κ ) Hea vy-ball (4/( √ L+ √ m) 2 , (1− 2/( √ κ +1)) 2 , 0) ( √ κ +1)/2 Θ( κ √ κ )+nΘ( √ κ ) nΘ( κ √ κ ) Nestero v (4/(3L+m), 1− 4/( √ 3κ +1+2, β ) √ 3κ +1/2 Θ( κ √ κ )+n nΘ( κ √ κ ) T able 3.1: Settling times T s := 1/(1− ρ ) [52, Prop osition 1] along with the corresp onding noise amplication b ounds in (3.10) [93, Theorem 4] for the parameters that optimize the linear con v ergence rate ρ for strongly con v ex quadratic function f ∈Q L m with the condition n um b er κ :=L/m, Here, n is the dimension of x and σ 2 w is the v ariance of the white noise. for all f ∈Q L m and all initial conditions ψ 0 , where c>0 is a constan t. F or L TI system (3.4a), the sp ectral radius ρ (A) determines the b est ac hiev able con v ergence rate. In addition, T s := 1 1 − ρ (3.6) determines the settling time , i.e., the n um b er of iterations required to reac h a giv en desired accuracy; see App endix B.1. F or the class Q L m of high-dimensional functions (i.e., for n≳T s ), Nestero v established the fundamen tal lo w er b ound on the settling time (con v ergence rate) of an y rst-order algorithm [9], T s ≥ √ κ + 1 2 . (3.7) This lo w er b ound is sharp and it is ac hiev ed b y the hea vy-ball metho d with the parameters pro vided in T able 3.1 [52]. 3.2.3 Noise amplication F or L TI system (3.4a) driv en b y an additiv e white noise w t , E(ψ t+1 ) = AE(ψ t ). Th us, E(ψ t )=A t E(ψ 0 ) and, for an y stabilizing parameters (α,β,γ ), the iterates reac h a statistical steady-state with lim t→∞ E(ψ t )=0 and a v ariance that can b e computed from the solution 59 of the algebraic Ly apuno v equation [41], [93]. W e call the steady-state v ariance of the error in the optimization v ariable noise (or v ariance) amplication, J := lim t→∞ 1 t t X k=0 E ∥x k − x ⋆ ∥ 2 2 . (3.8) In addition to the algorithmic parameters (α,β,γ ), the en tire sp ectrum {λ i |i=1,...,n} of the Hessian matrix Q impacts the noise amplication J of algorithm (3.2) [93]. Remark 2 An alternative p erformanc e metric that examines the ste ady-state varianc e of y t − x ⋆ was c onsider e d in [109], wher e y t := x t + γ (x t − x t− 1 ) is the p oint at which the gr adient is evaluate d in (3.2) . F or al l γ ≥ 0, we have J x ≤ J y ≤ (1+2γ ) 2 J x , wher e the subscripts x and y denote the noise amplic ation in terms of the err or in x t and y t . Thus, these p erformanc e metrics ar e within a c onstant factor of e ach other for b ounde d values of γ ≥ 0. 3.2.4 Parameters that optimize convergence rate F or sp ecial instances of the t w o-step momen tum algorithm (3.2) applied to strongly con v ex quadratic problems, namely gradien t descen t (gd), hea vy-ball metho d (hb), and Nestero v’s accelerated algorithm (na), the parameters that yield the fastest con v ergence rates w ere established in [52], [57]. These parameters along with the corresp onding rates and the noise amplication b ounds are pro vided in T able 3.1. The con v ergence rates are determined b y the sp ectral radius of the corresp onding A-matrices and the noise amplication b ounds are computed b y examining the solution to the algebraic Ly apuno v equation and determining the functions f ∈ Q L m for whic h the steady-state v ariance is maximized/minimized [93, Prop osition 1]. Since the optimal con v ergence rate for the hea vy-ball metho d meets the fundamen tal lo w er b ound (3.7), this c hoice of parameters also optimizes the con v ergence rate of the t w o-step momen tum algorithm (3.2) for f ∈Q L m . 60 F or the optimal parameters pro vided in T able 3.1, there is a Θ( √ κ ) impro v emen t in settling times of the hea vy-ball and Nestero v’s accelerated algorithms relativ e to gradien t descen t, T s = Θ( κ ) gd Θ( √ κ ) hb,na (3.9) where a=Θ( b) means that a lies within constan t factors of b as b→∞. This impro v emen t mak es accelerated algorithms p opular for problems with large condition n um b er κ . While con v ergence rate is only aected b y the largest and smallest eigen v alues of Q, the en tire sp ectrum of Q inuences the noise amplication J . On the other hand, the largest and smallest v alues of J o v er the function class Q L m , J max := max f∈Q L m J, J min := min f∈Q L m J (3.10) dep end only on the noise magnitude σ w , the algorithmic parameters (α,β,γ ), the problem dimension n, and the extreme eigen v alues m and L of Q. F or the parameters that optimize con v ergence rates, tigh t upp er and lo w er b ounds on the noise amplication w ere dev elop ed in [93, Theorem 4]. These b ounds are expressed in terms of the condition n um b er κ and the problem dimension n, and they demonstrate opp osite trends relativ e to the settling time. In particular, for gradien t descen t, J max = σ 2 w nΘ( κ ), J min = σ 2 w (Θ( κ )+n) (3.11a) and for accelerated algorithms, J max = σ 2 w nΘ( κ √ κ ), J min = σ 2 w (Θ( κ √ κ ) + nΘ( √ κ )) hb σ 2 w (Θ( κ √ κ ) + n) na. (3.11b) W e observ e that for xed problem dimension n and noise magnitude σ w , the accelerated algorithms increase noise amplication b y a factor of Θ( √ κ ) relativ e to gradien t descen t for the parameters that optimize con v ergence rates. While similar result also holds for hea vy-ball and Nestero v’s algorithms with arbitrary v alues of parameters α and β that 61 pro vide settling time T s ≤ c √ κ with c > 0 [93, Theorem 8], in this c hapter w e establish fundamen tal tradeos b et w een noise amplication and settling time for the class of the t w o- step momen tum algorithms (3.2) with arbitrary stabilizing v alues of constan t parameters. 3.3 Summary of main results In this section, w e summarize our k ey con tributions regarding tradeos b et w een robustness and con v ergence of noisy t w o-step momen tum algorithm (3.2). In addition, our geometric c haracterization of stabilit y and ρ -linear con v ergence allo ws us to pro vide alternativ e pro ofs of standard con v ergence results and quan tify fundamen tal p erformance tradeos. The pro ofs of results presen ted here can b e found in Section 3.7. 3.3.1 Bounded noise amplication for stabilizing parameters F or a discrete-time L TI system with a con v ergence rate ρ , the distance of the eigen v alues to the unit circle is lo w er b ounded b y 1− ρ . W e use this stabilit y margin to establish an upp er b ound on the noise amplication J of the t w o-step momen tum metho d (3.2) for any stabilizing parameters (α,β,γ ). Theorem 1 L et the p ar ameters (α,β,γ ) b e such that the two-step momentum algorithm in (3.2) c onver ges line arly with the r ate ρ =1− 1/T s for al l f ∈Q L m . Then, J ≤ σ 2 w (1 + ρ 2 ) (1 + ρ ) 3 nT 3 s (3.12a) wher e n is the pr oblem size. F urthermor e, for the gr adient noise mo del ( σ w =ασ ), J ≤ σ 2 (1 + ρ )(1 + ρ 2 ) L 2 nT 3 s . (3.12b) F or ρ< 1, b oth upp er b ounds in (3.12) scale with nT 3 s and they are exact for the hea vy- ball metho d with the parameters that optimize the con v ergence rate pro vided b y T able 3.1. Ho w ev er, these b ounds are not tigh t for all stabilizing parameters; e.g., applying (3.12a) to gradien t descen t with the optimal stepsize α =2/(L+m) yields J ≤ σ 2 w nΘ( κ 3 ), whic h is o 62 b y a factor of κ 2 ; cf. T able 3.1. The b ound in (3.12b) is obtained b y com bining (3.12a) with αL ≤ (1+ρ ) 2 , whic h follo ws from the conditions for ρ -linear con v ergence in Section 3.4. 3.3.2 T radeo between settling time and noise amplication In this subsection, w e establish lo w er b ounds on the pro ducts J max × T s and J min × T s for an y stabilizing constan t parameters (α,β,γ ) in the t w o-step momen tum algorithm (3.2), where J max and J min dened in (3.10) are the largest and the smallest noise amplication for the class of functions Q L m and T s is the settling time. Theorem 2 L et the p ar ameters (α,β,γ ) b e such that the two-step momentum algorithm in (3.2) c onver ges line arly with the r ate ρ = 1− 1/T s for al l f ∈Q L m . Then, J max and J min in (3.10) satisfy, J max × T s ≥ σ 2 w (n − 1) κ 2 64 + √ κ +1 2 (3.13a) J min × T s ≥ σ 2 w κ 2 64 + (n − 1) √ κ +1 2 . (3.13b) F urthermor e, for the gr adient noise mo del (σ w =ασ ), we have J max × T s ≥ σ 2 L 2 (n − 1) κ 2 4 + max κ 2 T 3 s , 1 4 (3.13c) J min × T s ≥ σ 2 L 2 κ 2 4 + (n − 1)max κ 2 T 3 s , 1 4 . (3.13d) F or b oth noise mo dels, the condition n um b er κ restricts the p erformance of the t w o-step momen tum algorithm with constan t parameters: for a xe d pr oblem size n, al l four lower b ounds in (3.13) sc ale with κ 2 . Relativ e to the dominan t term in κ , the problem dimension n app ears in a m ultiplicativ e fashion for the lo w er b ounds on J max and in an additiv e fashion for the lo w er b ounds on J min . Next, b y establishing upp er b ounds on J max × T s and J min × T s for a parameterized family of hea vy-ball-lik e algorithms in Theorem 3, w e pro v e that for an y settling time T s these b ounds are or der-wise tight (in κ ) for the gradien t noise mo del. On the other hand, for the iterate noise mo del, they are tigh t only if T s is smaller than the b est ac hiev able settling time of gradien t descen t, (κ +1)/2. 63 Theorem 3 F or the class of str ongly c onvex quadr atic functions Q L m with the c ondition numb er κ = L/m, let the sc alar ρ b e such that the fundamental lower b ound T s = 1/(1− ρ )≥ ( √ κ +1)/2 given by (3.7) holds. Then, the two-step momentum algorithm (3.2) with p ar ameters α = (1 + ρ )(1 + β/ρ ) L , β = ρ κ − (1 + ρ )/(1 − ρ ) κ + (1 + ρ )/(1 − ρ ) , γ = 0 (3.14) c onver ges line arly with the r ate ρ and, for settling times T s ≤ (κ + 1)/2, J max and J min in (3.10) satisfy J max × T s ≤ σ 2 w nκ (κ + 1)/2 (3.15a) J min × T s ≤ σ 2 w κ (κ + n − 1). (3.15b) F urthermor e, for the gr adient noise mo del (σ w =ασ ) and any settling time that satises the ine quality in (3.7) , p ar ameters (3.14) le ad to J max × T s ≤ σ 2 nκ (κ + 1)/L 2 (3.15c) J min × T s ≤ σ 2 2κ (κ + 4n − 7)/L 2 . (3.15d) Theorem 3 pro vides upp er b ounds on J max × T s andJ min × T s for a family of hea vy-ball-lik e algorithms (γ =0) parameterized b y the settling time T s . F or b oth noise mo dels, the upp er b ounds in (3.15) scale with κ 2 whic h matc hes the corresp onding lo w er b ounds in (3.13). F or the gradien t noise mo del, the upp er and lo w er b ounds are order-wise tigh t (in κ ) for an y settling time. Ho w ev er, for the iterate noise mo del, the lo w er b ounds in Theorem 2 can b e impro v ed when T s ≥ (κ +1)/2. In Theorem 4, w e establish alternativ e lo w er b ounds on J max and J min that scale with T s for the t w o-step momen tum algorithm (3.2) with an y stabilizing parameters. W e also utilize parameterized family (3.14) of hea vy-ball-lik e algorithms with negativ e momen tum parameter β to increase T s b ey ond (κ +1)/2 and obtain upp er b ounds on J max and J min that scale linearly with T s for the iterate noise mo del. 64 Theorem 4 L et the p ar ameters (α,β,γ ) b e such that the two-step momentum algorithm in (3.2) achieves the c onver genc e r ate ρ = ρ (A) = 1− 1/T s , wher e the matrix A is given by (3.4) . Then, J max and J min in (3.10) satisfy, J max ≥ σ 2 w (n − 1) T s 2(1 + ρ ) 2 + 1 (3.16a) J min ≥ σ 2 w T s 2(1 + ρ ) 2 + (n − 1) . (3.16b) F urthermor e, for the p ar ameterize d family of he avy-b al l-like algorithms (3.14) with T s ≥ (κ +1)/2, J max ≤ σ 2 w nT s (3.17a) J min ≤ 2σ 2 w (1+(n− 2)/κ )T s . (3.17b) W e note that the condition T s ≥ (κ + 1)/2 in Theorem 4 corresp onds to non-p ositiv e momen tum parameter β ≤ 0. W e also observ e that b oth upp er and lo w er b ounds on J max and J min in Theorem 4 gro w linearly with T s and that for the iterate noise mo del with T s ≥ (κ +1)/2 the lo w er b ound is sharp er than the one established in Theorem 2. Remark 3 Sinc eQ L m is a subset of the class of m-str ongly c onvex functions with L-Lipschitz c ontinuous gr adients, the fundamental lower b ounds on J max × T s establishe d in The or em 2 c arry over to this br o ader class of pr oblems. Thus, the r estriction imp ose d by the c ondition numb er on the tr ade o b etwe en settling time and noise amplic ation go es b eyond Q L m and holds for gener al str ongly c onvex pr oblems. Remark 4 The upp er b ounds in The or ems 3 and 4 ar e obtaine d for a p articular choic e of c onstant p ar ameters. Thus, they also pr ovide upp er b ounds on the b est achievable noise amplic ation b ounds J ⋆ max := min α,β,γ J max and J ⋆ min := min α,β,γ J min for a settling time T s ; se e Figur e 3.1. 65 Iterate noise σ w =1 J ≤ 1+ρ 2 (1+ρ ) 3 nT 3 s (3.12a) J ⋆ max ≤ 1 2 nκ (κ + 1)T − 1 s if T s ≤ (κ +1)/2 nT s if T s ≥ (κ +1)/2 (3.15a) (3.17a) J ⋆ max ≥ max n 1 64 (n− 1)κ 2 + √ κ +1 2 T − 1 s , 1 8 (n− 1)T s +1 o (3.13a) (3.16a) J ⋆ min ≤ κ (κ + n − 1)T − 1 s if T s ≤ (κ +1)/2 2(1 + n− 2 κ )T s if T s ≥ (κ +1)/2 (3.15b) (3.17b) J ⋆ min ≥ max n 1 64 κ 2 + (n− 1) √ κ +1 2 T − 1 s , 1 8 T s +n− 1 o (3.13b) (3.16b) Gradien t noise σ w =α J ≤ (1+ρ )(1+ρ 2 ) L 2 nT 3 s (3.12b) J ⋆ max ≤ 1 L 2 nκ (κ + 1)T − 1 s (3.15c) J ⋆ max ≥ 1 L 2 1 4 (n − 1)κ 2 + max{κ 2 /T 3 s ,1/4} T − 1 s (3.13d) J ⋆ min ≤ 2 L 2 κ (κ + 4n − 7)T − 1 s (3.15d) J ⋆ min ≥ 1 L 2 1 4 κ 2 + (n − 1)max{κ 2 /T 3 s ,1/4} T − 1 s (3.13c) • • J ⋆ min J ⋆ max ( √ κ+1)/2 heavy-ball (κ+1)/2 gradient descent (σw = 1) Ts = 1 1−ρ J • • J ⋆ min J ⋆ max ( √ κ+1)/2 heavy-ball (κ+1)/2 gradient descent (σw =α) Ts = 1 1−ρ J Figure 3.1: Summary of the results established in Theorems 1-4 for σ 2 = 1. The top and b ottom ro ws corresp ond to the iterate and gradien t noise mo dels, resp ectiv ely , and they illustrate (i) J ⋆ max := min α,β,γ max f J and J ⋆ min := min α,β,γ min f J sub ject to a settling time T s for f ∈Q L m (blac k curv es); and (ii) their corresp onding upp er (maro on curv es) and lo w er (red curv es) b ounds in terms of the condition n um b er κ =L/m, problem size n, and settling time T s . The upp er b ounds on J established in Theorem 1 are mark ed b y blue curv es. The dark shaded region and its union with the ligh t shaded region resp ectiv ely corresp ond to all p ossible pairs (T s ,max f J) and (T s ,min f J) for f ∈Q L m and an y stabilizing parameters (α,β,γ ). 66 3.4 Geometric characterization In this section, w e examine the relation b et w een the con v ergence rate and noise amplication of the t w o-step momen tum algorithm (3.2) for strongly con v ex quadratic problems. In particular, the eigen v alue decomp osition of the Hessian matrix Q allo ws us to bring the dynamics in to n decoupled second-order systems parameterized b y the eigen v alues of Q and the algorithmic parameters (α,β,γ ). W e utilize the Jury stabilit y criterion to pro vide a no v el geometric c haracterization of stabilit y and ρ -linear con v ergence and exploit this insigh t to deriv e alternativ e pro ofs of standard con v ergence results and quan tify fundamen tal p erformance tradeos. 3.4.1 Modal decomposition W e utilize the eigen v alue decomp osition of the Hessian matrix Q = Q T ≻ 0, Q = VΛ V T , where Λ is the diagonal matrix of the eigen v alues and V is the orthogonal matrix of the corresp onding eigen v ectors. The c hange of v ariables ˆ x t := V T (x t − x ⋆ ) and ˆ w t := V T w t allo ws us to bring (3.4) in to n decoupled second-order subsystems, ˆ ψ t+1 i = ˆ A i ˆ ψ t i + ˆ B i ˆ w t i ˆ z t i = ˆ C i ˆ ψ t i (3.18a) where ˆ w t i is the ith comp onen t of the v ector ˆ w t ∈R n , ˆ ψ t i = h ˆ x t i ˆ x t+1 i i T , ˆ A i = ˆ A(λ i ) := 0 1 − a(λ i ) − b(λ i ) , ˆ B i = h 0 σ w i T , ˆ C i = h 1 0 i (3.18b) and a(λ ) := β − γαλ, b (λ ) := (1+γ )αλ − (1+β ). (3.18c) 67 3.4.2 Conditions for linear convergence F or the class of strongly con v ex quadratic functions Q L m , the b est con v ergence rate ρ is determined b y the largest sp ectral radius of the matrices ˆ A(λ ) in (3.18) for λ ∈[m,L], ρ = max λ ∈[m,L] ρ ( ˆ A(λ )). (3.19) F or the hea vy-ball and Nestero v’s accelerated metho ds, analytical expressions for ρ ( ˆ A(λ )) w ere dev elop ed and algorithmic parameters that optimize con v ergences rate w ere obtained in [52]. Unfortunately , these expressions do not pro vide insigh t in to the relation b et w een con v ergence rates and noise amplication. In this c hapter, w e ask the dual question: F or a xe d c onver genc e r ate ρ , what is the lar gest c ondition numb er κ that c an b e hand le d by the two-step momentum algorithm (3.2) with c onstant p ar ameters (α,β,γ )? W e note that the matrices ˆ A(λ ) share the same structure as M = 0 1 − a − b (3.20a) with the real scalars a and b and that the c haracteristic p olynomial asso ciated with the matrix M is giv en b y F(z) := det(zI − M) = z 2 + bz + a. (3.20b) W e next utilize the Jury stabilit y criterion [116, Chap. 4-3] to pro vide conditions for stabilit y of the matrix M giv en b y (3.20a). Lemma 1 F or the matrix M ∈R 2× 2 given by (3.20a) , ρ (M) < 1 ⇐⇒ (b,a) ∈ ∆ (3.21a) 68 wher e the stability set ∆ := {(b,a)||b| − 1 < a < 1} (3.21b) is an op en triangle in the (b,a)-plane with vertic es X = (− 2,1), Y = (2,1), Z = (0,− 1). (3.21c) Pr o of: The c haracteristic p olynomial F(z) asso ciated with the matrix M is giv en b y equation (3.20b) and the Jury stabilit y criterion [116, Chap. 4-3] pro vides necessary and sucien t conditions for stabilit y , |a| < 1, F(± 1) = 1 ± b + a > 0. The condition a>− 1 is ensured b y the p ositivit y of F(± 1). □ F or an y ρ > 0, the sp ectral radius ρ (M) of the matrix M is smaller than ρ if and only if ρ (M/ρ ) is smaller than 1. This observ ation in conjunction with Lemma 1 allo w us to obtain necessary and sucien t conditions for stabilit y with the linear con v ergence rate ρ of the t w o-step momen tum algorithm (3.2). Lemma 2 F or any p ositive sc alar ρ < 1 and the matrix M ∈ R 2× 2 given by (3.20a) , we have ρ (M) ≤ ρ ⇐⇒ (b,a) ∈ ∆ ρ (3.22a) wher e the ρ -line ar c onver genc e set ∆ ρ := (b,a)|ρ (|b| − ρ ) ≤ a ≤ ρ 2 (3.22b) is a close d triangle in the (b,a)-plane with vertic es X ρ = (− 2ρ,ρ 2 ), Y ρ = (2ρ,ρ 2 ), Z ρ = (0,− ρ 2 ). (3.22c) 69 Pr o of: See App endix B.3. □ • • • X = (−2,1) Y = (2,1) Z = (0,−1) • • • Xρ = (−2ρ,ρ 2 ) Yρ = (2ρ,ρ 2 ) Zρ= (0,−ρ 2 ) h d l • b a Figure 3.2: The stabilit y set ∆ (the op en, cy an triangle) in (3.21b) and the ρ -linear con v ergence set ∆ ρ (the closed, y ello w triangle) in (3.22b) along with the corresp onding v ertices. F or the p oin t (b,a) (blac k dot) asso ciated with the matrix M in (3.20a), the corresp onding distances (d,h,l) in (3.29) are mark ed b y blac k lines. Figure 3.2 illustrates the stabilit y and the ρ -linear con v ergence sets ∆ and ∆ ρ . W e note that for an y ρ ∈(0,1), w e ha v e ∆ ρ ⊂ ∆ . This can b e v eried b y observing that the v ertices (X ρ ,Y ρ ,Z ρ ) of ∆ ρ all lie in ∆ . F or the t w o-step momen tum algorithm (3.2), the functions a(λ ) andb(λ ) giv en b y (3.18c) satisfy the ane relation, (1 + γ )a(λ ) + γb (λ ) = β − γ. (3.23) This fact in conjunction with Lemmas 1 and 2 allo w us to deriv e conditions for stabilit y and the con v ergence rate. Lemma 3 The two-step momentum algorithm (3.2) with c onstant p ar ameters (α,β,γ ) is stable for al l functions f ∈Q L m if and only if the fol lowing e quivalent c onditions hold: 1. (b(λ ),a(λ ))∈∆ for al l λ ∈[m,L]; 2. (b(λ ),a(λ ))∈∆ for λ ∈{m,L}. F urthermor e, the line ar c onver genc e r ate ρ < 1 is achieve d for al l functions f ∈Q L m if and only if the fol lowing e quivalent c onditions hold: 1. (b(λ ),a(λ ))∈∆ ρ for al l λ ∈[m,L]; 70 • • • X ρ Y ρ Z ρ Gradient descent b a • • • X ρ Y ρ Z ρ Polyak’s method b a • • • • X ρ Y ρ Z ρ X ′ ρ Nesterov’s method • b a Figure 3.3: F or a xed ρ -linear con v ergence triangle ∆ ρ (y ello w), dashed blue lines mark the line segmen ts (b(λ ),a(λ )) with λ ∈ [m,L] for gradien t descen t, P oly ak’s hea vy-ball, and Nestero v’s accelerated metho ds as particular instances of the t w o-step momen tum algorithm (3.2) with constan t parameters. The solid blue line segmen ts corresp ond to the parameters for whic h the algorithm ac hiev es rate ρ for the largest p ossible condition n um b er giv en b y (3.28). 2. (b(λ ),a(λ ))∈∆ ρ for λ ∈{m,L}. Her e, (b(λ ),a(λ )) is given by (3.18c) , and the stability and ρ -line ar c onver genc e triangles ∆ and ∆ ρ ar e given by (3.21b) and (3.22b) , r esp e ctively. Pr o of: The conditions in 1) follo w from com bing (3.19) with Lemma 1 (for stabilit y) and with Lemma 2 (for ρ -linear con v ergence). The conditions in 2) follo w from the facts that ∆ and ∆ ρ are con v ex sets and that (b(λ ),a(λ )) is a line segmen t in the (b,a)-plane with end p oin ts corresp onding to λ =m and λ =L. □ Lemma 3 exploits the ane relation (3.23) b et w een a(λ ) and b(λ ) and the con v exit y of the sets ∆ and ∆ ρ to establish necessary and sucien t conditions for stabilit y and ρ -linear con v ergence: the inclusion of the end p oints of the line se gment (b(λ ),a(λ )) asso ciate d with the extr eme eigenvalues m and L of the matrix Q in the c orr esp onding triangle. A similar approac h w as tak en in [109, App endix A.1], where the ane nature of the conditions resulting from the Jury stabilit y criterion with resp ect to λ w as used to conclude that ρ ( ˆ A(λ )) is a quasi-con v ex function of λ and sho w that the extreme p oin ts m and L determine ρ (A). In con trast, w e exploit the triangular shap es of the stabilit y and ρ -linear con v ergence sets ∆ and ∆ ρ and utilize this geometric insigh t to iden tify the parameters that optimize the con v ergence rate and to establish tradeos b et w een noise amplication and con v ergence rate. The follo wing corollary is immediate. 71 Corollary 1 L et the two-step momentum algorithm (3.2) with c onstant p ar ameters (α,β,γ ) minimize a function f ∈ Q L m with a line ar r ate ρ < 1. Then, the c onver genc e r ate ρ is achieve d for al l functions f ∈Q L m . Pr o of: Lemma 3 implies that only the extreme eigen v alues m and L of Q determine ρ . Since all functions f ∈Q L m share the same extreme eigen v alues, this completes the pro of. □ F or the t w o-step momen tum algorithm (3.2) with constan t parameters, Lemma 3 leads to a simple alternativ e pro of for the fundamen tal lo w er b ound (3.7) on the settling time established b y Nestero v. Our pro of utilizes the fact that for an y p oin t (b(λ ),a(λ ))∈∆ ρ , the horizon tal signed distance to the edge XZ of the stabilit y triangle ∆ satises d(λ ) := a(λ ) + b(λ ) + 1 = αλ. (3.24) where a and b are giv en b y (3.18c); see Figure 3.2 for an illustration. Prop osition 1 L et the two-step momentum algorithm in (3.2) with c onstant p ar ameters (α,β,γ ) achieve the line ar c onver genc e r ate ρ < 1 for al l functions f ∈ Q L m . Then, lower b ound (3.7) on the settling time holds and it is achieve d by the he avy-b al l metho d with the p ar ameters pr ovide d in T able 3.1. Pr o of: Let d(m) = αm and d(L) = αL denote the v alues of the function d(λ ) asso ciated with the p oin ts (b(m),a(m)) and (b(L),a(L)), where (b,a) and d are giv en b y (3.18c) and (3.24), resp ectiv ely . Lemma 3 implies that (b(L),a(L)) and (b(m),a(m)) lie in the ρ -linear con v ergence triangle ∆ ρ . Th us, d max /d min ≥ d(L)/d(m) = κ (3.25) 72 where d max and d min are the largest and smallest v alues of d among all p oin ts (b,a) ∈ ∆ ρ . F rom the shap e of ∆ ρ , w e conclude that d max and d min corresp ond to the v ertices Y ρ and X ρ of ∆ ρ giv en b y (3.22c); see Figure 3.2. Th us, d max = d Yρ = 1 + ρ 2 + 2ρ = (1 + ρ ) 2 (3.26a) d min = d Xρ = 1 + ρ 2 − 2ρ = (1 − ρ ) 2 . (3.26b) Com bining (3.25) with (3.26) yields κ = d(L) d(m) ≤ d max d min = (1 + ρ ) 2 (1 − ρ ) 2 . (3.27) Rearranging terms in (3.27) giv es lo w er b ound (3.7). □ T o pro vide additional insigh t, w e next examine the implications of Lemma 3 for gradien t descen t, P oly ak’s hea vy-ball, and Nestero v’s accelerated algorithms. In all three cases, our dual approac h reco v ers the optimal con v ergence rates pro vided in T able 3.1. F rom the ane relation (3.23), it follo ws that (b(λ ),a(λ )) with λ ∈[m,L] for, gradien t descen t (β =γ =0), is a horizon tal line segmen t parameterized b y a(λ )=0; hea vy-ball metho d (γ =0), is a horizon tal line segmen t parameterized b y a(λ )=β ; Nestero v’s accelerated metho d (β = γ ), is a line segmen t parameterized b y a(λ ) = − βb (λ )/(1+β ). These observ ations are illustrated in Figure 3.3 and, as w e sho w in the pro of of Lemma 3, to obtain the largest p ossible condition n um b er for whic h the con v ergence rate ρ is feasible for eac h algorithm, one needs to nd the largest ratio d(L)/d(m) = κ among all p ossible orien tations for the line segmen t (b(λ ),a(λ )) with λ ∈[m,L] to lie within ∆ ρ . This leads to the follo wing conditions: 73 F or gradien t descen t, the largest ratio d(L)/d(m) corresp onds to the in tersections of the horizon tal axis and the edges Y ρ Z ρ and X ρ Z ρ of the triangle ∆ ρ , whic h are giv en b y (ρ, 0) and (− ρ, 0), resp ectiv ely . Th us, w e ha v e κ = d(L)/d(m) ≤ (1 + ρ )/(1 − ρ ). (3.28a) Rearranging terms in (3.28a) yields a lo w er b ound on the settling time for gradien t descen t 1/(1− ρ ) ≥ (κ + 1)/2. This lo w er b ound is tigh t as it can b e ac hiev ed b y c ho osing the parameters in T able 3.1, whic h place (b(λ ),a(λ )) to (ρ, 0) and (− ρ, 0) for λ =L and λ =m, resp ectiv ely . F or the hea vy-ball metho d, the optimal rate is reco v ered b y designing the parameters (α,β ) suc h that the v ertices X ρ and Y ρ b elong to the line segmen t (b(λ ),a(λ )), κ = d(L)/d(m) ≤ (1 + ρ ) 2 /(1 − ρ ) 2 . (3.28b) By c ho osing d(L)=d Yρ and d(m)=d Xρ , w e reco v er the optimal parameters pro vided in T able 3.1 and ac hiev e the fundamen tal lo w er b ound (3.7) on the con v ergence rate. F or Nestero v’s accelerated metho d, the largest ratio d(L)/d(m) corresp onds to the line segmen t X ρ X ′ ρ that passes through the origin, where X ′ ρ = (2ρ/ 3,− ρ 2 /3) lies on the edge Y ρ Z ρ ; see App endix B.3. This yields κ = d(L)/d(m) ≤ (1 + ρ )(3 − ρ )/(3(1 − ρ ) 2 ). (3.28c) Rearranging terms in this inequalit y pro vides a lo w er b ound on the settling time 1/(1− ρ )≥ √ 3κ +1/2. This lo w er b ound is tigh t and it can b e ac hiev ed with the parameters pro vided in T able 3.1, whic h place (b(L),a(L)) to X ′ ρ and (b(m),a(m)) to X ρ . Figure 3.3 illustrates the optimal orien tations discussed ab o v e. 74 3.4.3 Noise amplication T o quan tify the noise amplication of the t w o-step momen tum algorithm (3.2), w e utilize an alternativ e c haracterization of the stabilit y and ρ -linear con v ergence triangles ∆ and ∆ ρ . As illustrated in Figure 3.2, let d and l denote the horizon tal signed distances of the p oin t (a,b) to the edges XZ and YZ of the stabilit y triangle ∆ , d(λ ) := a(λ ) + b(λ ) + 1 l(λ ) := a(λ ) − b(λ ) + 1. (3.29a) and let h denote its v ertical signed distance to the edge XY , h(λ ) := 1 − a(λ ). (3.29b) Then, the follo wing equiv alence conditions, (b,a) ∈ ∆ ⇐⇒ h, d, l > 0 (3.30a) (b,a) ∈ ∆ ρ ⇐⇒ h≥ (1 − ρ )(1 + ρ ) d≥ (1 − ρ )(1 + ρ + b) l ≥ (1 − ρ )(1 + ρ − b) (3.30b) follo w from the denition of the sets ∆ in (3.21b), ∆ ρ in (3.22b), and (h,d,l) in (3.29). In Theorem 5, w e quan tify the steady-state v ariance of the error in the optimization v ariable in terms of the sp ectrum of the Hessian matrix and the algorithmic parameters for noisy t w o-step momen tum algorithm (3.2). Sp ecial cases of this result for gradien t decen t, hea vy-ball, and Nestero v’s accelerated algorithms w ere established in [93]. The pro of of Theorem 5 follo ws from similar argumen ts and w e omit it for brevit y . 75 Theorem 5 F or a str ongly c onvex quadr atic obje ctive function f ∈ Q L m with the Hessian matrix Q, the ste ady-state varianc e of x t − x ⋆ for the two-step momentum algorithm (3.2) with any stabilizing p ar ameters (α,β,γ ) is determine d by J = n X i=1 σ 2 w (d(λ i )+l(λ i )) 2d(λ i )h(λ i )l(λ i ) =: n X i=1 ˆ J(λ i ) Her e, ˆ J(λ i ) denotes the mo dal c ontribution of the ith eigenvalue λ i of Q to the ste ady-state varianc e, (d,h,l) ar e dene d in (3.29) , and (a,b) ar e given by (3.18c) . In App endix B.6, w e describ e ho w the algebraic Ly apuno v equation for the steady-state co v ariance matrix of the error in the optimization v ariable can b e used to compute the noise amplication J . Theorem 5 demonstrates that J dep ends on the en tire sp ectrum of the Hessian matrix Q and not only on its extreme eigen v alues m and L, whic h determine the con v ergence rate. Since for an y f ∈Q L m the extreme eigen v alues of Q are xed at m and L, w e ha v e J max := max f∈Q L m J = ˆ J(m) + ˆ J(L) + (n − 2) ˆ J max J min := min f∈Q L m J = ˆ J(m) + ˆ J(L) + (n − 2) ˆ J min (3.31a) where ˆ J max := max λ ∈[m,L] ˆ J(λ ), ˆ J min := min λ ∈[m,L] ˆ J(λ ). (3.31b) W e use these expressions to determine explicit upp er and lo w er b ounds on J max and J min in terms of the condition n um b er and the settling time. 3.5 Designing order-wise Pareto-optimal algorithms with adjustable parameters W e no w utilize the geometric insigh t dev elop ed in Section 3.4 to design algorithm parameters that tradeo settling time and noise amplication. In particular, w e in tro duce t w o instances of parameterized families of hea vy-ball-lik e (γ =0) and Nestero v-lik e (γ =β ) algorithms that pro vide c ontinuous tr ansformations from gradien t descen t to the corresp onding accelerated 76 algorithm (with the optimal con v ergence rate) via a homotop y path parameterized b y the settling time T s . F or b oth the iterate and gradien t noise mo dels, w e establish an order-wise tigh t scaling Θ( κ 2 ) for J max × T s and J min × T s in accelerated regime (i.e., when T s is smaller than the settling time of gradien t descen t with the optimal stepsize, (κ + 1)/2). This is a direct extension of [93, Theorem 4] whic h studied gradien t descen t and its accelerated v arian ts for the parameters that optimize the corresp onding con v ergence rates. W e also examine p erformance tradeos for the parameterized family of hea vy-ball-lik e algorithms with negativ e momen tum parameter β < 0. This decelerated regime corresp onds to settling times larger than (κ +1)/2 and it captures a k ey dierence b et w een the t w o noise mo dels: for T s ≥ (κ +1)/2, J max and J min gr ow line arly with the settling time T s for the iter ate noise mo del and they r emain inversely pr op ortional to T s for the gr adient noise mo del. Comparison with the lo w er b ounds in Theorems 2 and 4 sho ws that the parameterized family of hea vy-ball-lik e metho ds yields order-wise optimal (in κ and T s ) J max and J min for b oth noise mo dels. The results presen ted here pro v e all upp er b ounds in Theorems 3 and 4. 3.5.1 Parameterized family of heavy-ball-like methods F or the t w o-step momen tum algorithm (3.2) with γ = 0, the line segmen t (b(λ ),a(λ )) parameterized b y λ ∈ [m,L] is parallel to the b-axis in the (b,a)-plane and it satises a(λ ) = β . As describ ed in Section 3.4, gradien t descen t and hea vy-ball metho ds with the optimal parameters pro vided in T able 3.1 are obtained for β = 0 and β = ρ 2 , resp ectiv ely , and the corresp onding end p oin ts (b(m),a(m)) and (b(L),a(L)) lie at the edges X ρ Z ρ and Y ρ Z ρ of the ρ -linear con v ergence triangle ∆ ρ . Inspired b y this observ ation, w e prop ose a family of parameters for whic h β = cρ 2 , for some scalar c ∈ [− 1,1], and determine the stepsize α suc h that the ab o v e end p oin ts lie at X ρ Z ρ and Y ρ Z ρ , α = (1 + ρ )(1 + cρ )/L, β = cρ 2 , γ = 0. (3.32) This yields a con tin uous transformation b et w een the standard hea vy-ball metho d ( c = 1) and gradien t descen t (c = 0) for a xed condition n um b er κ . In addition, the momen tum parameter β in (3.32) b ecomes negativ e for c < 0; see Figure 3.3 for an illustration. In 77 Lemma 4, w e pro vide expressions for the scalar c as w ell as for ˆ J max and ˆ J min dened in (3.31b) in terms of the condition n um b er κ and the con v ergence rate ρ . Lemma 4 F or the class of functions Q L m with the c ondition numb er κ =L/m, let the sc alar ρ b e such that T s = 1/(1 − ρ ) ≥ ( √ κ +1)/2. Then, the two-step momentum algorithm in (3.2) with p ar ameters in (3.32) achieves the c onver genc e r ate ρ , and the lar gest and smal lest values ˆ J max and ˆ J min of ˆ J(λ ) for λ ∈[m,L] satisfy ˆ J max = ˆ J(m) = ˆ J(L) = σ 2 w (κ + 1) 2(1− cρ 2 )(1+ρ )(1+cρ ) ˆ J min = ˆ J( ˆ λ ) = σ 2 w (1 + cρ 2 )(1 − cρ 2 ) wher e ˆ λ :=(m+L)/2 and the sc alar c is given by c := κ − (1 + ρ )/(1 − ρ ) ρ (κ + (1 + ρ )/(1 − ρ )) ∈ [− 1, 1]. (3.33) Pr o of: See App endix B.4. □ The parameters in (3.32) with c giv en b y (3.33) are equiv alen t to the parameters presen ted in Theorem 3. Lemma 4 in conjunction with (3.31) allo w us to deriv e analytical expressions for J max and J min . Corollary 2 The p ar ameterize d family of he avy-b al l-like metho ds (3.32) satises J max = n ˆ J(m) = n ˆ J(L) J min = 2 ˆ J(m) + (n − 2) ˆ J( ˆ λ ) wher e ˆ J(m) and ˆ J( ˆ λ ) ar e given in L emma 4, and J max and J min dene d in (3.10) ar e the lar gest and smal lest values of J when the algorithm is applie d to f ∈Q L m with the c ondition numb er κ =L/m. 78 The next prop osition uses the analytical expressions in Corollary 2 to establish order-wise tigh t upp er and lo w er b ounds on J max and J min for the parameterized family of hea vy-ball- lik e algorithms (3.32). Our upp er and lo w er b ounds are within constan t factors of eac h other and they are expressed in terms of the problem size n, condition n um b er κ , and settling time T s . Prop osition 2 F or the p ar ameterize d family of he avy-b al l-like algorithms in (3.32) , J max and J min in (3.10) satisfy, J max × T s = σ 2 w p 1c (ρ )nκ (κ + 1) (3.34a) J min × T s = σ 2 w κ (2p 1c (ρ )(κ + 1) + (n − 2)p 2c (ρ )). (3.34b) F urthermor e, for the gr adient noise mo del (σ w =ασ ), J max × T s = σ 2 p 3c (ρ )nκ (κ + 1) (3.35a) J min × T s = σ 2 κ (2p 3c (ρ )(κ + 1) + (n − 2)p 4c (ρ )) (3.35b) wher e p 1c (ρ ) := q c (ρ )/(2(1+ρ ) 2 (1+cρ ) 2 ), p 2c (ρ ) := q c (ρ )/((1+ρ )(1+cρ 2 )(1+cρ )) p 3c (ρ ) := q c (ρ )/(2L 2 ), p 4c (ρ ) := q c (ρ )q − c (ρ )(1+ρ )/L 2 (3.36) and q c (ρ ) := (1− cρ )/(1− cρ 2 ). In addition, for c∈ [0,1], p 1c (ρ )∈ [1/64,1/2] and p 2c (ρ )∈ [1/16,1]; and for c∈[− 1,1], p 3c (ρ )∈[1/(4L 2 ),1/L 2 ] and p 4c (ρ )∈[1/(4L 2 ),4/L 2 ]. Pr o of: See App endix B.4. □ Prop osition 3 F or the p ar ameterize d family of he avy-b al l-like metho ds (3.32) with c ∈ [− 1,0], J max and J min in (3.10) satisfy, J max = σ 2 w p 5c (ρ )n(1 + 1/κ )T s (3.37a) J min = σ 2 w 2p 5c (ρ )(1 + 1/κ ) + p 6c (ρ )(n− 2)/κ T s (3.37b) 79 wher e p 5c (ρ ) := 1/(2(1+|c|ρ )(1+|c|ρ 2 )) ∈ [1/8,1/2] and p 6c (ρ ) := 2(1+ρ )p 5c (ρ )q − c (ρ ) ∈ [1/8,2]. Pr o of: See App endix B.4. □ The upp er b ounds in Theorems 3 and 4 follo w from Prop ositions 2 and 3, resp ectiv ely . Since these upp er b ounds ha v e the same scaling as the corresp onding lo w er b ounds in Theorems 2 and 4 that hold for all stabilizing parameters (α,β,γ ), this demonstrates the tigh tness of lo w er b ounds for all settling times and for b oth noise mo dels. 3.5.2 Parameterized family of Nesterov-like methods F or the t w o-step momen tum algorithm (3.2) with γ = β , the line segmen t (b(λ ),a(λ )) parameterized b y λ ∈[m,L] passes through the origin. As describ ed in Section 3.4, gradien t descen t and Nestero v’s metho d with the optimal parameters pro vided in T able 3.1 are obtained for a = 0 and a = − (ρ/ 2)b, resp ectiv ely , and the corresp onding end p oin ts (b(m),a(m)) and (b(L),a(L)) lie on the edges X ρ Z ρ and Y ρ Z ρ of the ρ -linear con v ergence triangle ∆ ρ . T o pro vide a con tin uous transformation b et w een these t w o standard algorithms, w e in tro duce a parameter c∈ [0,1/2], and let the line segmen t satisfy a(λ ) =− cρb (λ ) and tak e its end p oin ts at the edges X ρ Z ρ and Y ρ Z ρ ; see Figure 3.3 for an illustration. This can b e accomplished with the follo wing c hoice of parameters, α = (1 + ρ )(1 + c − cρ )/(L(1 + c)), γ = β = cρ 2 /((αL − 1)(1 + c)). (3.38) F or the parameterized family of Nestero v-lik e algorithms (3.38), Prop osition 4 establishes the settling time and c haracterizes the dep endence of J min × T s andJ max × T s on the condition n um b er κ and the problem size n. Prop osition 4 F or the class Q L m with the c ondition numb er κ = L/m, let the sc alar ρ b e such that T s = 1/(1− ρ ) ∈ [( √ 3κ +1)/2,(κ +1)/2]. 80 • • • X ρ Y ρ Z ρ • • • • (−ρ,0) (ρ,0) (2c ′ ρ,ρ 2 ) (cρ,0) b a Parameters (3.39) P arameters (3.40) Figure 3.4: The triangle ∆ ρ (y ello w) and the line segmen ts (b(λ ),a(λ )) withλ ∈[m,L] (blue) for gradien t descen t with reduced stepsize (3.39) and hea vy-ball-lik e metho d (3.40), whic h place the end p oin t (b(m),a(m)) at X ρ and the end p oin t (b(L),a(L)) at (2c ′ ρ,ρ 2 ) on the edge X ρ Y ρ , where c ′ :=κ (1− ρ ) 2 /ρ − (1+ρ 2 )/ρ ranges o v er the in terv al [− 1,1]. The two-step momentum algorithm (3.2) with p ar ameters (3.38) achieves the c onver genc e r ate ρ and satises σ 2 w (n− 1)κ (κ + 1)/32 + √ 3κ +1/2 ≤ J max × T s ≤ 6σ 2 w nκ (3κ + 1) σ 2 w κ (κ +1)/32+(n− 1) √ 3κ +1/2 ≤ J min × T s ≤ σ 2 w (6κ (3κ +1)+(n− 1)(κ +1)/2) wher e J max and J min ar e the lar gest and smal lest values that J c an take when the algorithm is applie d to f ∈Q L m with the c ondition numb er κ = L/m, and the sc alar c∈ [0,1/2] is the solution to the quadr atic e quation κ (1− ρ )(1− cρ − c 2 (1+ρ ))=(1+ρ )(1− cρ − c 2 (1− ρ )). Pr o of: See App endix B.4. □ Since the stepsize in (3.38) satises α ∈ [1/L,3/L], comparing the upp er b ounds in Prop osition 4 with the lo w er b ounds in Theorem 2 sho ws that, for settling times T s ≤ (κ +1)/2, the parameters in (3.38) ac hiev e order-wise optimal J max and J min for b oth the iterate (σ w =σ ) and gradien t (σ w =ασ ) noise mo dels. 3.5.3 Impact of reducing the stepsize When the only source of uncertain t y is a noisy gradien t, i.e., σ w = ασ , one can attempt to reduce the noise amplication J b y decreasing the stepsize α at the exp ense of increasing the settling time T s =1/(1− ρ ) [56], [58], [103], [109]. In particular, for gradien t descen t, α can 81 b e reduced from its optimal v alue 2/(L+m) b y k eeping (b(m),a(m)) at (− ρ, 0) and mo ving the p oin t (b(L),a(L)) from (ρ, 0) to w ards (− ρ, 0) along the horizon tal axis; see Figure 3.4. This can b e accomplished with α = (1 + cρ )/L, γ = β = 0 (3.39) for some c ∈ [− 1,1] parameterizing (b(L),a(L)) = (cρ, 0). In this case, the settling time satises T s = (κ +c)/(c+1)∈ [(κ +1/2),∞) and similar argumen ts to those presen ted in the pro of of Lemma 4 can b e used to obtain ˆ J max = ˆ J(m) = σ 2 κ 2 (1− ρ )/L 2 ˆ J min = ˆ J(L) = σ 2 α 2 /(1− c 2 ρ 2 ) c ≤ 0 ˆ J(1/α ) = σ 2 α 2 c ≥ 0. F or a xed n, the stepsize in (3.39) yields a Θ( κ 2 ) scaling for b oth J max × T s and J min × T s for all c ∈ [− 1,1]. Th us, gradien t descen t with reduced stepsize order-wise matc hes the lo w er b ounds in Theorem 2. An IQC-based approac h [93, Lemma 1] w as utilized in [109, Theorem 13] to sho w that stepsize (3.39) also yields the ab o v e discussed con v ergence rate and w orst-case noise amplication for one-p oin t m-strongly con v ex L-smo oth functions. Remark 5 Any desir e d settling time T s = 1/(1− ρ )∈ [( √ κ +1)/2,∞) c an b e achieve d by the he avy-b al l-like metho d with r e duc e d stepsize, α = (1− ρ ) 2 /m, β = ρ 2 , γ = 0. (3.40) This choic e yields J max = σ 2 nκ 2 (1− ρ 4 )/(L 2 (1+ρ ) 4 ) [109, The or em 9]; se e Figur e 3.4. In addition, by c onsidering the err or in y t =x t +γ (x t − x t− 1 ) as the p erformanc e metric, it was state d and numeric al ly verie d in [109] that the choic e of p ar ameters (3.40) yields Par eto- optimal algorithms for simultane ously optimizing J max and ρ . W e note that the settling time T s =Θ( κ ) of gr adient desc ent with standar d stepsizes (α =1/L or2/(m+L)) c an b e achieve d 82 via (3.40) by r e ducing α to O(1/(κL )). In c ontr ast, the p ar ameterize d family of he avy- b al l-like metho ds (3.32) is or der-wise Par eto-optimal (cf. The or ems 2-4) while maintaining α ∈[1/L,4/L]. 3.6 Continuous-time gradient ow dynamics Noisy gradien t descen t can b e view ed as the forw ard Euler discretization of gradien t o w dynamics (gfd), ˙ x + α ∇f(x) = σw (3.41a) where ˙ x denotes the deriv ativ e of x with resp ect to time τ and w is a white noise with zero mean and iden tit y co v ariance matrix, E[w(τ )] = 0,E[w(τ 1 )w T (τ 2 )] = Iδ (τ 1 − τ 2 ). Similarly , noisy t w o-step momen tum algorithm (3.2) can b e obtained b y discretizing the accelerated gradien t o w dynamics ( agd), ¨x + θ ˙ x + α ∇f(x+γ ˙ x) = σw (3.41b) with θ :=1− β b y appro ximating x, ˙ x, and ¨x using x = x t+1 , ˙ x ≈ x t+1 − x t , ¨x ≈ x t+2 − 2x t+1 +x t . System (3.41b) with β = γ w as in tro duced in [110] as a con tin uous-time analogue of Nestero v’s accelerated algorithm and a Ly apuno v-based metho d w as emplo y ed to c haracterize its stabilit y prop erties for smo oth strongly con v ex problems. F or a time dilation s=cτ , the solution to (3.41b) satises x ′′ + ¯θx ′ + ¯α ∇f(x+¯γx ′ ) = ¯σw 83 where ˙ x=dx/dτ , x ′ =dx/ds, and ¯θ = θ/c, ¯γ = cγ, ¯α = α/c 2 , ¯σ = σ/ (c √ c). This follo ws b y com bining ˙ x=cx ′ and ¨x=c 2 x ′′ with the fact that the time dilation yields a √ c increase in the noise magnitude σ . Similar c hange of v ariables can b e applied to gradien t o w dynamics (3.41a) and to study stabilit y and noise amplication of (3.41) w e set α =1/L and σ =1 without loss of generalit y . 3.6.1 Modal-decomposition F or the quadratic problem (3.3) with Q = Q T ≻ 0, w e follo w the approac h of Section 3.4.1 and utilize the eigen v alue decomp osition of Q = VΛ V T and the c hange of v ariables, ˆ x := V T (x− x ⋆ ), ˆ w :=V T w , to bring (3.41) to, ˙ ˆ ψ i = ˆ A i ˆ ψ i + ˆ B i ˆ w i , ˆ z i = ˆ C i ˆ ψ i (3.42a) where ˆ w i is the ith comp onen t of the v ector ˆ w . F or gradien t o w dynamics (3.41a), w e let ˆ ψ i := ˆ x i , whic h leads to ˆ A i = − αλ i =: − a(λ i ), ˆ B i = 1, ˆ C i = 1. (3.42b) On the other hand, for accelerated gradien t o w dynamics (3.41b), ˆ ψ i := h ˆ x i ˙ ˆ x i i T , and ˆ A i = ˆ A(λ i ) := 0 1 − a(λ i ) − b(λ i ) ˆ B i = h 0 1 i T , ˆ C i = h 1 0 i a(λ ) := αλ, b (λ ) := θ + γαλ. (3.42c) Ev en though functions a(λ ) and b(λ ) tak e dieren t forms in con tin uous time, matrices ˆ A i , ˆ B i , and ˆ C i in (3.42c) ha v e the same structure as their discrete-time coun terparts in (3.18). 84 3.6.2 Optimal convergence rate System (3.42) is stable if and only if the matrix ˆ A i is Hurwitz (i.e., if all of its eigen v alues ha v e negativ e real parts). Moreo v er, the system is exp onen tially stable with the rate ρ , ∥ ˆ ψ i (τ )∥ 2 ≤ ce − ρτ ∥ ˆ ψ i (0)∥ 2 if and only if the real parts of all eigen v alues of ˆ A i are less than or equal to − ρ . F or gradien t o w dynamics (3.41a) with α =1/L, ˆ A i ’s are real scalars and ρ is determined b y ρ gfd := min i |αλ i | = m/L = 1/κ. (3.43) Note that ˆ A i in (3.42c) has the same structure as the matrix M in (3.20a). Lemma 5 is a con tin uous-time coun terpart for Lemmas 1 and 2 and it pro vides conditions for (exp onen tial) stabilit y of matrices ˆ A i for accelerated gradien t o w dynamics (3.41b). Lemma 5 The r e al matrix M in (3.20a) satises M is Hurwitz ⇐⇒ a, b > 0. In addition, for any ρ> 0, we have max{ℜ(eig(M))} ≤ − ρ ⇐⇒ a≥ ρ (b − ρ ) b≥ 2ρ. Pr o of: See App endix B.5. □ Conditions for stabilit y and ρ -exp onen tial stabilit y in Lemma 5 resp ectiv ely require inclusion of the p oin t (b,a) to the op en p ositiv e orthan t and the ρ -parameterized cone sho wn in Figure 3.5. F urthermore, the normalization of the parameter α to α = 1/L yields the extra condition a≤ 1. F or ρ < 1, com bining this inequalit y with the exp onen tial stabilit y 85 conditions in Lemma 5 further restricts the ρ -exp onen tial stabilit y cone to the triangle in the (b,a)-plane, ∆ ρ := {(b,a)|b ≥ 2ρ, ρ (b− ρ ) ≤ a ≤ 1} (3.44a) whose v ertices are giv en b y X ρ = (2ρ,ρ 2 ), Y ρ = (2ρ, 1), Z ρ = (ρ +1/ρ, 1). (3.44b) F orρ =1, the triangle ∆ ρ is a single p oin t and, for ρ> 1, adding the normalization condition a ≤ 1 mak es the ρ -exp onen tial stabilit y conditions in Lemma 5 infeasible. Th us, in what follo ws, w e conne our atten tion to ρ< 1. • Xρ • Yρ • Zρ b a Figure 3.5: The op en p ositiv e orthan t (cy an) in the (b,a)-plane is the stabilit y region for the matrix M in (3.20a). The in tersections Y ρ and Z ρ of the stepsize normalization line a = 1 (blac k) and the b oundary of the ρ -exp onen tial stabilit y cone (y ello w) established in Lemma 5, along with the cone ap ex X ρ determine the v ertices of the ρ -exp onen tial stabilit y triangle ∆ ρ giv en b y (3.44). Figure 3.5 illustrates the stabilit y and ρ -exp onen tial stabilit y cones as w ell as the ρ - exp onen tial stabilit y triangle ∆ ρ . The geometry of ∆ ρ allo ws us to determine the largest condition n um b er for whic h (3.41b) is ρ -exp onen tially stable. Prop osition 5 F or a str ongly c onvex quadr atic obje ctive function f ∈Q L m with the c ondition numb er κ = L/m, the optimal c onver genc e r ate and the c orr esp onding p ar ameters ( β,γ ) of ac c eler ate d gr adient ow dynamics (3.41b) with α =1/L ar e ρ = 1/ √ κ, β = 1+(v− 2)/ √ κ, γ = v √ κ (3.45) 86 • X ρ • Y ρ • Z ρ • (1, 1) Nesterov’s method b a • X ρ • Y ρ • Z ρ Polyak’s method ••••••••••••• b a Figure 3.6: F or a xed ρ -exp onen tial stabilit y triangle ∆ ρ (y ello w) in (3.44), the line segmen ts (b(λ ),a(λ )), λ ∈ [m,L] for Nestero v’s accelerated (γ = β ) and the hea vy-ball (γ = 0) dynamics, as sp ecial examples of accelerated dynamics (3.41b) with constan t parameters γ,β , and α =1/L are mark ed b y dashed blue lines. The blue bullets corresp ond to the lo cus of the end p oin t (b(L),a(L)), and the solid blue line segmen ts corresp ond to the parameters for whic h the rate ρ is ac hiev ed for the largest p ossible condition n um b er (3.45). wher e v∈ [0,1]. This r ate is achieve d by the he avy-b al l metho d (γ = 0) with v = 0 and, for κ ≥ 4, by Nester ov’s ac c eler ate d metho d (γ =β ) with v =( √ κ − 2)/(κ − 1). Pr o of: See App endix B.5. □ Prop osition 5 uses necessary and sucien t condition for ρ -exp onen tial stabilit y: (b(λ ),a(λ )) ∈ ∆ ρ for all λ ∈ [m,L]. Figure 3.6 illustrates the orien tation of this line segmen t in ∆ ρ for the hea vy-ball and Nestero v’s algorithms. F or the optimal v alues of parameters, Prop osition 5 implies that accelerated gradien t o w dynamics (3.41b) reduces the settling time 1/ρ relativ e to gradien t o w dynamics (3.41a) b y a factor of √ κ , i.e., ρ agd /ρ gfd = √ κ. 3.6.3 Noise amplication Similar to the discrete-time setting, exp onen tially stable L TI systems in (3.42) driv en b y white noise reac h a statistical steady-state with lim t→∞ E( ˆ ψ i (t)) = 0. F urthermore, the v ariance J := lim t→∞ 1 t Z t 0 E ∥x(τ ) − x ⋆ ∥ 2 2 dτ (3.46) 87 can b e computed from the solution of the c ontinuous-time algebr aic Lyapunov e quation [41]. The follo wing theorem pro vides analytical expressions for the steady-state v ariance J . Theorem 6 F or a str ongly c onvex quadr atic obje ctive function f ∈ Q L m with Hessian Q, the noise amplic ation J of (3.41) with any c onstant stabilizing p ar ameters (α,β,γ ) is determine d by J = P n i=1 ˆ J(λ i ). Her e, ˆ J(λ i ) is the mo dal c ontribution of the ith eigenvalue λ i of Q to the noise amplic ation ˆ J gfd (λ ) = 1/(2a(λ )), ˆ J agd (λ ) = 1/(2a(λ )b(λ )) wher e the functions a and b ar e given by (3.42c) . W e omit the pro of of Theorem 6 as it uses similar argumen ts to those used in the pro of of [93, Theorem 1]. F or α =1/L and the parameters that optimize the con v ergence rate in Prop osition 5, w e can use the explicit forms of ˆ J(λ ) established in Theorem 6 to obtain J max = ((n − 1)κ + 1)/2 gfd ((n − 1)κ √ κ + √ κ )/4 agd(hb) ((n − 1)κ √ κ + 2)/4 agd(na) J min = (κ + (n − 1))/2 gfd (κ √ κ + (n − 1) √ κ )/4 agd(hb) (κ √ κ + (n − 1)2)/4 agd(na) (3.47) F or all three cases, the largest noise amplication J max o ccurs when the Hessian matrix Q has n− 1 eigen v alues at λ = L and one at λ = m, and the smallest noise amplication J min o ccurs when Q has n− 1 eigen v alues at λ = m and one at λ = L. Despite the √ κ impro v emen t in the con v ergence rate ac hiev ed b y the accelerated gradien t o w dynamics, the corresp onding J max and J min are larger than those of gradien t o w dynamics b y a factor of √ κ . W e next generalize this result to an y stabilizing (β,γ ) and establish similar trends for all f ∈Q L m . 88 3.6.4 Convergence and noise amplication tradeos The next result is the con tin uous-time analogue of Theorem 2 and it establishes a lo w er b ound on the pro duct of the noise amplication and the settling time T s = 1/ρ of the accelerated gradien t o w dynamics for an y (β,γ ). Theorem 7 L et the p ar ameters (β,γ ) b e such that the ac c eler ate d gr adient ow dynamics in (3.41b) with α = 1/L is exp onential ly stable with r ate ρ = 1/T s for al l f ∈Q L m . Then, J max and J min in (3.10) satisfy, J max × T s ≥ (n − 1) κ 2 4 + 1 2(1 + ρ 2 ) (3.48a) J min × T s ≥ κ 2 4 + (n − 1) 2(1 + ρ 2 ) . (3.48b) Pr o of: See App endix B.5. □ Theorem 7 demonstrates that the tradeo b et w een J max and J min and the settling time established in Theorem 2 for the t w o-step momen tum algorithm extends to the con tin uous- time dynamics. F or a xed problem size n and the parameters that optimize the con v ergence rate pro vided in Lemma 5, w e can use (3.47) to conclude that the b ounds in Theorem 7 are order-wise tigh t for the parameters that ac hiev e the optimal con v ergence rate. 3.7 Proofs of Theorems 1-4 3.7.1 Proof of Theorem 1 F rom Theorem 5 it follo ws that w e can use upp er b ounds on ˆ J(λ ) o v erλ ∈[m,L] to establish an upp er b ound on J . Since the algorithm ac hiev es the con v ergence rate ρ , com bining equation (3.19) and Lemma 2 yield (b(λ ),a(λ ))∈ ∆ ρ for all λ ∈ [m,L]. As w e demonstrate in App endix B.2, the function ˆ J is con v ex in (b,a) o v er the stabilit y triangle ∆ . In addition, ∆ ρ ⊂ ∆ is the con v ex h ull of the p oin ts X ρ , Y ρ , Z ρ in the (b,a)-plane. Since the maxim um of a con v ex function o v er the con v ex h ull of a nite set of p oin ts is attained at one of these p oin ts, ˆ J attains its maxim um o v er ∆ ρ at X ρ , Y ρ , or Z ρ . 89 Using the denition of X ρ , Y ρ , and Z ρ in (3.22c), the ane relations (3.29), and the analytical expression for ˆ J in Theorem 5, it follo ws that the maxim um o ccurs at v ertices X ρ and Y ρ , ˆ J max := max λ ∈[m,L] ˆ J(λ ) = σ 2 w (1+ρ 2 ) (1− ρ ) 3 (1+ρ ) 3 where w e use d Xρ =l Yρ =(1− ρ ) 2 , l Xρ =d Yρ =(1+ρ ) 2 , and h Xρ =h Yρ =1− ρ 2 . Com bining the ab o v e iden tit y with Theorem 5 completes the pro of of (3.12a). W e use an argumen t similar to the pro of of Prop osition 1 to pro v e (3.12b). In particular, since (b(L),a(L))∈∆ ρ , w e ha v e αL = d(L) ≤ d max = (1 + ρ ) 2 where d giv en b y (3.24) is the horizon tal signed distance to the edge XZ of the stabilit y triangle ∆ . On the other hand, d max is the largest v alue that d can tak e among all p oin ts (b,a) ∈ ∆ ρ and it corresp onds to the v ertex Y ρ ; see equation (3.26a). Com bining this inequalit y with σ w =ασ and (3.12a) completes the pro of of Theorem 1. 3.7.2 Proof of Theorem 2 Using the expression J = P i ˆ J(λ i ) established in Theorem 5, w e ha v e the decomp osition J = ˆ J(m) + n− 1 X i=1 ˆ J(λ i ). (3.49) T o pro v e the lo w er b ounds (3.13b) and (3.13d) on J min × T s , w e establish a lo w er b ound on ˆ J(m)× T s that scales quadratically with κ , and a general lo w er b ound on ˆ J(λ )× T s . 90 Case σ w =σ The pro of of (3.13b) utilizes the follo wing inequalities ˆ J(m) 1 − ρ ≥ σ 2 w κ 2 2(1 + ρ ) 5 (3.50a) ˆ J(λ ) 1 − ρ ≥ σ 2 w ( √ κ + 1) 2 . (3.50b) W e rst pro v e (3.50a). Our approac h builds on the pro of of Prop osition 1. In particular, d(λ ) = αλ for the p oin t (b(λ ),a(λ )), where d and (b,a) are dened in (3.29) and (3.18c), resp ectiv ely . Th us, d(m) = d(L)/κ. F urthermore, Lemma 3 implies (b(λ ),a(λ )) ∈ ∆ ρ for λ ∈[m,L]. Th us, the trivial inequalit y d(L)≤ d max leads to d(m) ≤ d max /κ = (1 + ρ ) 2 /κ (3.51) where d max = (1+ρ ) 2 is the largest v alue that d can tak e among all p oin ts (b,a)∈ ∆ ρ ; see equation (3.26a). W e no w use Theorem 5 to write ˆ J(λ ) 1 − ρ = σ 2 w (d(λ ) + l(λ )) 2d(λ )h(λ )l(λ )(1 − ρ ) ≥ σ 2 w 2d(λ )h(λ )(1 − ρ ) . (3.52) Next, w e lo w er b ound the righ t-hand side of (3.52). Let L b e the line that passes through (b(λ ),a(λ )) whic h is parallel to the edge XZ of the stabilit y triangle ∆ , and let G b e the in tersection of L and the edge X ρ Z ρ of the ρ -stabilit y triangle ∆ ρ ; see Figure 3.7 for an illustration. It is easy to v erify that h G ≥ h(λ ), d G = d(λ ) (3.53a) where h G and d G corresp ond to the v alues of h and d asso ciated with the p oin t G. In addition, since G lies on the edge X ρ Z ρ , h G and d G satisfy the ane relation h G = 1 − ρ + d G ρ/ (1 − ρ ). (3.53b) 91 This follo ws from the equation of the line X ρ Z ρ in the (b,a)-plane and from the denitions of d and h in (3.29). F urthermore, com bining (3.53a) and (3.53b) implies σ 2 w 2d(λ )h(λ )(1 − ρ ) (a) ≥ σ 2 w 2d(λ )h G (1 − ρ ) (b) = σ 2 w 2d(λ )((1 − ρ ) 2 + ρd (λ )) . (3.54a) F or λ =m, w e can further write σ 2 w 2d(m)((1 − ρ ) 2 + ρd (m)) ≥ σ 2 w 2 (1+ρ ) 2 κ ( (1+ρ ) 2 κ + ρ (1+ρ ) 2 κ ) = σ 2 w κ 2 2(1+ρ ) 5 (3.54b) where the inequalit y is obtained from (3.27) and (3.51). Com bining (3.52) with (3.54a) and (3.54b) completes the pro of of (3.50a). • • • X Y Z • • • X ρ Y ρ Z ρ h(λ) • d(λ) • • G h G • d G • • L b a Figure 3.7: The line L (blue, dashed) and the in tersection p oin t G, along with the distances d 1 , h 1 , d G , and h G as in tro duced in the pro of of Theorem 2. Next, w e pro v e the general lo w er b ound in (3.50b). As w e demonstrate in App endix B.2, the mo dal con tribution ˆ J to the noise amplication is a con v ex function of (b,a) whic h tak es its minim um ˆ J min = σ 2 w o v er the stabilit y triangle ∆ at the origin b = a = 0. Com bining this fact with the lo w er b ound in (3.7) on ρ completes the pro of of (3.50b). Finally , w e can obtain the lo w er b ound (3.13b) on J min × T s b y com bining (3.49) and (3.50). Case σ w =ασ The pro of of (3.13d) utilizes the follo wing inequalities ˆ J(λ ) 1 − ρ ≥ σ 2 2λ 2 (1 + ρ ) (3.55a) ˆ J(λ ) 1 − ρ ≥ σ 2 (1 − ρ ) 3 κ 2 L 2 . (3.55b) 92 In particular, (3.13d) follo ws from using (3.55a) for λ = m and taking the maxim um of (3.55a) and (3.55b) for the other eigen v alues to b ound the expression for J established b y Theorem 5. W e rst pro v e (3.55a). By com bining (3.52) and (3.54a), w e obtain ˆ J(λ ) 1 − ρ ≥ α 2 σ 2 2d(λ )((1 − ρ ) 2 + ρd (λ )) . (3.56) Since d(λ )≥ d min :=(1− ρ ) 2 , where d min is the smallest v alue of d o v er ∆ ρ , [cf. (3.26b)], w e can write α 2 σ 2 2d(λ )((1 − ρ ) 2 + ρd (λ )) ≥ α 2 σ 2 2d(λ ) 2 (1 + ρ ) = σ 2 2λ 2 (1 + ρ ) . (3.57) Com bining (3.56) and (3.57) completes the pro of of (3.55a). T o pro v e (3.55b) w e use d(λ ) ≥ d min := (1− ρ ) 2 and d(m) = αm , to obtain α ≥ (1− ρ ) 2 κ/L. Com bining this inequalit y with ˆ J min = σ 2 w = α 2 σ 2 yields (3.55b). Finally , w e obtain the lo w er b ound (3.13d) on J min × T s b y com bining (3.49) and (3.55). T o obtain the lo w er b ounds (3.13a) and (3.13c) on J max × T s , w e consider a quadratic function for whic h the Hessian has n− 1 eigen v alues at λ =m and one eigen v alue at λ =L. F or suc h a function, w e can use Theorem 5 to write J max ≥ J = (n − 1) ˆ J(m)+ ˆ J(L). (3.58) Case σ w =σ T o pro v e (3.13a), w e use inequalities in (3.50a) and (3.50b) to b ound ˆ J(m)/(1− ρ ) and ˆ J(L)/(1− ρ ) in (3.58), resp ectiv ely . Case σ w =ασ T o pro v e (3.13c), w e use inequalit y in (3.55a) with λ = m to lo w er b ound ˆ J(m)/(1− ρ ), and com bine (3.55a) and (3.55b) to lo w er b ound ˆ J(L)/(1− ρ ) in (3.58). This completes the pro of. 93 3.7.3 Proof of Theorem 3 As describ ed in Section 3.5, the parameters in Theorem 3 are obtained b y placing the end p oin ts of the horizon tal line segmen t (b(λ ),a(λ )) parameterized b y λ ∈ [m,L] at the edges X ρ Z ρ andY ρ Z ρ of theρ -linear con v ergence triangle ∆ ρ . These parameters can b e equiv alen tly represen ted b y (3.32) where the scalar c giv en in Lemma 4 satises c ≥ 0 if and only if T s ≤ (κ +1)/2. The pro of of Theorem 3 follo ws from com bining Lemma 4 and Prop osition 2. 3.7.4 Proof of Theorem 4 The follo wing prop osition allo ws us to pro v e the lo w er b ounds in Theorem 4. Prop osition 6 L et ρ = ρ (A)=1− 1/T s b e the c onver genc e r ate of the two-step momentum algorithm (3.2) . Then, the lar gest and smal lest mo dal c ontributions to noise amplic ation given by (3.31b) satisfy ˆ J max ≥ σ 2 w 2(1 + ρ ) 2 T s , ˆ J min ≥ σ 2 w . Pr o of: The inequalit y ˆ J min ≥ σ 2 w follo ws from the fact that ˆ J , as a function of (b,a), tak es its minim um v alue at the origin; see App endix B.2. The pro of for ˆ J max utilizes the fact that for an y constan t parameters (α,β,γ ) and xed condition n um b er, the sp ectral radius ρ (A) corresp onds to the smallest ρ -linear con v ergence triangle ∆ ρ that con tains the line segmen t (b(λ ),a(λ )) for λ ∈ [m,L]. Th us, at least one of the end p oin ts (b(m),a(m)) or (b(L),a(L)) will b e on the b oundary of the triangle ∆ ρ (A) . Com bining this with the fact that d(m)≤ d(L), it follo ws that at least one of the follo wing holds (b(m),a(m)) ∈ X ρ Z ρ or X ρ Y ρ , (b(L),a(L)) ∈ Y ρ Z ρ or X ρ Y ρ . T ogether with the concrete v alues of v ertices (3.21c) in terms of ρ , this yields 1 − ρ ≥ min{h(m), h(L), l(L)/(1 + ρ ), d(m)/(1 + ρ )} (3.59) 94 Also, using Theorem 5 and noting that the maxim um v alues that h(λ ), d(λ ), and l(λ ) can tak e among ∆ ρ are giv en b y 1+ρ 2 , (1+ρ ) 2 , and (1+ρ ) 2 , resp ectiv ely , w e can write ˆ J(m) ≥ σ 2 w 2h(m)d(m) ≥ max σ 2 w 2h(m)(1+ρ ) 2 , σ 2 w 2d(m)(1+ρ 2 ) ˆ J(L) ≥ σ 2 w 2h(L)l(L) ≥ max σ 2 w 2h(L)(1+ρ ) 2 , σ 2 w 2l(L)(1+ρ 2 ) . (3.60) Finally , b y the con v exit y of ˆ J (see App endix B.2), w e ha v e ˆ J max ≥ max{ ˆ J(m), ˆ J(L)}. Com bining this with (3.59) and (3.60) completes the pro of. □ The lo w er b ounds in Theorem 4 follo w from com bining Prop osition 6 with the expression for J in Theorem 5. T o obtain the upp er b ounds, w e note that the parameter c in Lemma 4 satises c∈[− 1,0] if and only if T s ≥ (κ +1)/2. Th us, w e can use Prop osition 3 to complete the pro of. 3.8 Concluding remarks W e examined the amplication of sto c hastic disturbances for a class of t w o-step momen tum algorithms in whic h the iterates are p erturb ed b y an additiv e white noise whic h arises from uncertain ties in gradien t ev aluation or in computing the iterates. F or b oth noise mo dels, w e establish lo w er b ounds on the pro duct of the settling time and the smallest/largest steady- state v ariance of the error in the optimization v ariable. These b ounds scale with κ 2 for all stabilizing parameters, whic h rev eals a fundamen tal limitation imp osed b y the condition n um b er κ in designing algorithms that tradeo noise amplication and con v ergence rate. In addition, w e pro vide a no v el geometric viewp oin t of stabilit y and ρ -linear con v ergence. This viewp oin t brings insigh t in to the relation b et w een noise amplication, con v ergence rate, and algorithmic parameters. It also allo ws us to (i) tak e an alternativ e approac h to optimizing con v ergence rates for standard algorithms; (ii) iden tify k ey similarities and dierences b et w een the iterate and gradien t noise mo dels; and (iii) in tro duce parameterized families of algorithms for whic h the parameters can b e con tin uously adjusted to tradeo noise amplication and settling time. By utilizing p ositiv e and negativ e momen tum parameters 95 in accelerated and decelerated regimes, resp ectiv ely , w e demonstrate that a parameterized family of the hea vy-ball-lik e algorithms can ac hiev e order-wise P areto optimalit y for all settling times and b oth noise mo dels. W e also extend our analysis to con tin uous-time dynamical systems that can b e discretized via an implicit-explicit Euler sc heme to obtain the t w o-step momen tum algorithm. F or suc h gradien t o w dynamics, w e sho w that similar fundamen tal sto c hastic p erformance limitations hold as in discrete time. Our ongoing w ork fo cuses on extending these results to algorithms with more complex structures including up date strategies that utilize information from more than the last t w o iterates and time-v arying algorithmic parameters [117]. It is also of in terest to iden tify fundamen tal p erformance limitations of sto c hastic gradien t descen t algorithms in whic h b oth additiv e and m ultiplicativ e sto c hastic disturbances exist [118], [119]. 96 Chapter 4 T ransient growth of accelerated algorithms First-order optimization algorithms are increasingly b eing used in applications with limited time budgets. In man y real-time and em b edded scenarios, only a few iterations can b e p erformed and traditional con v ergence metrics cannot b e used to ev aluate p erformance of the algorithms in these non-asymptotic regimes. In this c hapter, w e examine the transien t b eha vior of accelerated rst-order optimization algorithms. F or con v ex quadratic problems, w e emplo y to ols from linear systems theory to sho w that transien t gro wth arises from the presence of non-normal dynamics. W e iden tify the existence of mo des that yield an algebraic gro wth in early iterations and quan tify the transien t excursion from the optimal solution caused b y these mo des. F or strongly con v ex smo oth optimization problems, w e utilize the theory of in tegral quadratic constrain ts (IQCs) to establish an upp er b ound on the magnitude of the transien t resp onse of Nestero v’s accelerated algorithm. W e sho w that b oth the Euclidean distance b et w een the optimization v ariable and the global minimizer and the rise time to the transien t p eak are prop ortional to the square ro ot of the condition n um b er of the problem. Finally , for problems with large condition n um b ers, w e demonstrate tigh tness of the b ounds that w e deriv e up to constan t factors. 4.1 Introduction First-order optimization algorithms are widely used in a v ariet y of elds including statistics, signal/image pro cessing, con trol, and mac hine learning [1][5], [71], [120], [121]. A cceleration is often utilized as a means to ac hiev e a faster rate of con v ergence relativ e to gradien t descen t 97 while main taining lo w p er-iteration complexit y . There is a v ast literature fo cusing on the con v ergence prop erties of accelerated algorithms for dieren t stepsize rules and acceleration parameters, including [7][9], [122]. There is also a gro wing b o dy of w ork whic h in v estigates robustness of accelerated algorithms to v arious t yp es of uncertain t y [27], [53], [91][93], [123], [124]. These studies demonstrate that acceleration increases sensitivit y to uncertain t y in gradien t ev aluation. In addition to deterioration of robustness in the face of uncertain t y , asymptotically stable accelerated algorithms ma y also exhibit undesirable transien t b eha vior [61]. This is in con trast to gradien t descen t whic h is a con traction for strongly con v ex problems with suitable stepsize [62]. In real-time optimization and in applications with limited time budgets, the transien t gro wth can limit the app eal of accelerated metho ds. In addition, rst-order algorithms are often used as a building blo c k in m ulti-stage optimization including ADMM [63] and distributed optimization metho ds [64]. In these settings, at eac h stage w e can p erform only a few iterations of rst-order up dates on primal or dual v ariables and transien t gro wth can ha v e a detrimen tal impact on the p erformance of the en tire algorithm. This motiv ates an in-depth study of the b eha vior of accelerated rst-order metho ds in non- asymptotic regimes. It is widely recognized that large transien ts ma y arise from the presence of resonan t mo dal in teractions and non-normalit y of linear dynamical generators [65]. Ev en in the absence of unstable mo des, these can induce large transien t resp onses, signican tly amplify exogenous disturbances, and trigger departure from nominal op erating conditions. F or example, in uid dynamics, suc h mec hanisms can initiate departure from stable laminar o ws and trigger transition to turbulence [66], [67]. In this c hapter, w e consider the optimization problem minimize x f(x) (4.1) 98 ∥x t − x ⋆ ∥ 2 2 iteration n um b er t Figure 4.1: Error in the optimization v ariable for P oly ak’s hea vy-ball (blac k) and Nestero v’s (red) algorithms with the parameters that optimize the con v ergence rate for a strongly con v ex quadratic problem with the condition n um b er 10 3 and a unit norm initial condition with x 0 ̸=x ⋆ . where f : R n →R is a con v ex and smo oth function, and w e fo cus on a class of accelerated rst-order algorithms x t+2 = x t+1 + β (x t+1 − x t ) − α ∇f(x t+1 + γ (x t+1 − x t )) (4.2) where t is the iteration index, α is the stepsize, and β is the momen tum parameter. In particular, w e are in terested in Nestero v’s accelerated and P oly ak’s hea vy-ball metho ds that corresp ond to γ =β andγ =0, resp ectiv ely . While these algorithms ha v e faster con v ergence rates compared to the standard gradien t descen t (γ = β = 0), they ma y suer from large transien t resp onses; see Fig. 4.1 for an illustration. T o quan tify the transien t b eha vior, w e examine the ratio of the largest error in the optimization v ariable to the initial error. F or con v ex quadratic problems, the algorithm in (4.2) can b e cast as a linear time- in v arian t (L TI) system and mo dal analysis of the state-transition matrix can b e p erformed. F or b oth accelerated algorithms, w e iden tify non-normal mo des that create large transien t gro wth, deriv e analytical expressions for the state-transition matrices, and establish b ounds on the transien t resp onse in terms of the con v ergence rate and the iteration n um b er. W e sho w that b oth the p eak v alue of the transien t resp onse and the rise time to this v alue increase with the square ro ot of the condition n um b er of the problem. Moreo v er, for general strongly con v ex problems, w e com bine a Ly apuno v-based approac h with the theory of IQCs to establish an upp er b ound on the transien t resp onse of Nestero v’s accelerated algorithm. As for quadratic problems, w e demonstrate that this b ound scales with the square ro ot of the condition n um b er. 99 This w ork builds on our conference pap ers [96], [97]. In con trast to these preliminary results, w e pro vide a comprehensiv e analysis of transien t gro wth of accelerated algorithms for con v ex quadratic problems and address the imp ortan t issue of eliminating transien t gro wth of Nestero v’s accelerated algorithm with the prop er c hoice of initial conditions. A daptiv e restarting, whic h w as in tro duced in [61] to address the oscillatory b eha vior of Nestero v’s accelerated metho d, pro vides heuristics for impro ving transien t resp onses. In [107], the transien t gro wth of second-order systems w as studied and a framew ork for establishing upp er b ounds w as in tro duced, with a fo cus on real eigen v alues. The result w as applied to the hea vy-ball metho d but w as not applicable to quadratic problems in whic h the dynamical generator ma y ha v e complex eigen v alues. W e accoun t for complex eigen v alues and conduct a thorough analysis for Nestero v’s accelerated algorithm as w ell. F urthermore, for con v ex quadratic problems, w e pro vide tigh t upp er and lo w er b ounds on transien t resp onses in terms of the condition n um b er and iden tify the initial condition that induces the largest transien t resp onse. Similar results with extensions to the W asserstein distance ha v e b een recen tly rep orted in [125]. Previous w ork on non-asymptotic b ounds for Nestero v’s accelerated algorithm includes [126], where b ounds on the ob jectiv e error in terms of the condition n um b er w ere pro vided. Ho w ev er, in con trast to our w ork, this result in tro duces a restriction on the initial conditions. Finally , while [56] presen ts computational b ounds w e dev elop analytical b ounds on the non-asymptotic v alue of the estimated optimizer. 4.2 Convex quadratic problems In this section, w e examine transien t resp onses of accelerated algorithms for con v ex quadratic ob jectiv e functions, f(x) = 1 2 x T Qx (4.3a) whereQ=Q T ⪰ 0 is a p ositiv e semi-denite matrix. In what follo ws, w e rst bring (4.2) in to a standard L TI state-space form and then utilize appropriate co ordinate transformation to decomp ose the dynamics in to decoupled subsystems. Using this decomp osition, w e pro vide analytical expressions for the state-transition matrix and establish sharp b ounds on the transien t gro wth and the lo cation of the transien t p eak for accelerated algorithms. W e also 100 examine the inuence of initial conditions on transien t resp onses and relegate the pro ofs to App endix C.1. 4.2.1 L TI formulation The matrix Q admits an eigen v alue decomp osition, Q = VΛ V T , where Λ is the diagonal matrix of eigen v alues with L := λ 1 ≥ ··· ≥ λ r =: m > 0 λ i = 0 for i = r+1,...,n (4.3b) and V is the unitary matrix of the corresp onding eigen v ectors. W e dene the condition n um b er κ :=L/m as the ratio of the largest and smallest non-zero eigen v alues of the matrix Q. F or f in (4.3a), w e ha v e ∇f(x) = Qx, and the c hange of v ariables ˆ x t := V T x t brings dynamics (4.2) to ˆ x t+2 = (I − α Λ)ˆx t+1 + (βI − γα Λ)(ˆx t+1 − ˆ x t ). (4.4) This system can b e represen ted via n decoupled second-order subsystems of the form, ˆ ψ t+1 i = A i ˆ ψ t i , ˆ x t i = C i ˆ ψ t i (4.5a) where ˆ x t i is the ith elemen t of the v ector ˆ x t ∈R n , ˆ ψ t i := h ˆ x t i ˆ x t+1 i i T , C i := h 1 0 i , and A i = 0 1 − (β − γαλ i ) 1− αλ i +(β − γαλ i ) . (4.5b) 4.2.2 Linear convergence of accelerated algorithms The minimizers of (4.3a) are determined b y the n ull space of the matrix Q, x ⋆ ∈N(Q). The constan t parameters α and β can b e selected to pro vide stabilit y of subsystems in (4.5) for all λ i ∈ [m,L], and guaran tee con v ergence of ˆ x t i to ˆ x ⋆ i := 0 with a linear rate determined b y 101 Metho d Optimal parameters Linear rate ρ Nestero v α = 4 3L+m β = √ 3κ +1− 2 √ 3κ +1+2 1− 2 √ 3κ +1 P oly ak α = 4 ( √ L+ √ m) 2 β = ( √ κ − 1) 2 ( √ κ +1) 2 1− 2 √ κ +1 T able 4.1: P arameters that pro vide optimal con v ergence rates for a con v ex quadratic ob jectiv e function (4.3) with κ :=L/m. the sp ectral radius ρ (A i ) < 1. On the other hand, for i = r+1,...,n the eigen v alues of A i are β and 1. In this case, the solution to (4.5) is giv en b y ˆ x t i = 1 − β t 1 − β (ˆ x 1 i − ˆ x 0 0 ) + ˆ x 0 i (4.6a) and the steady-state limit of ˆ x t i , ˆ x ⋆ i := 1 1 − β (ˆ x 1 i − ˆ x 0 i ) + ˆ x 0 i (4.6b) is ac hiev ed with a linear rate β < 1. Th us, the iterates of (4.2) con v erge to the optimal solution x ⋆ = Vˆ x ⋆ ∈N(Q) with a linear rate ρ < 1 and T able 4.1 pro vides the parameters α and β that optimize the con v ergence rate [52, Prop osition 1]. 4.2.3 T ransient growth of accelerated algorithms In spite of a signican t impro v emen t in the rate of con v ergence, acceleration ma y deteriorate p erformance on nite time in terv als and lead to large transien t resp onses. This is in con trast to gradien t descen t whic h is a con traction [62]. A t an y t, w e are in terested in the w orst-case ratio of the t w o norm of the error of the optimization v ariable z t :=x t − x ⋆ to the t w o norm of the initial condition ψ 0 − ψ ⋆ = h (z 0 ) T (z 1 ) T i T , J 2 (t) := sup ψ 0 ̸=ψ ⋆ ∥x t − x ⋆ ∥ 2 2 ∥ψ 0 − ψ ⋆ ∥ 2 2 . (4.7) 102 Prop osition 1 F or the ac c eler ate d algorithms applie d to c onvex quadr atic pr oblems, J(t) in (4.7) is determine d by J 2 (t) = max max i≤ r ∥C i A t i ∥ 2 2 , β 2t /(1 + β 2 ) . (4.8) Pr o of: Since V is unitary and dynamics (4.5) that go v ern the ev olution of eac h ˆ x t i are decoupled, J(t) is determined b y J 2 (t) = max i sup ˆ ψ 0 i ̸= ˆ ψ ⋆ i (ˆ x t i − ˆ x ⋆ i ) 2 ∥ ˆ ψ 0 i − ˆ ψ ⋆ i ∥ 2 2 (4.9) where ˆ ψ ⋆ i := h ˆ x ⋆ i ˆ x ⋆ i i T . F urthermore, the mapping from ˆ ψ 0 i − ˆ ψ ⋆ i to ˆ x t i − ˆ x ⋆ i is giv en b y Φ i (t) :=C i A t i where the state-transition matrix A t i is determined b y the tth p o w er of A i , ˆ x t i − ˆ x ⋆ i = C i A t i ( ˆ ψ 0 i − ˆ ψ ⋆ i ) =: Φ i (t)( ˆ ψ 0 i − ˆ ψ ⋆ i ). (4.10) F or λ i ̸=0, ˆ ψ 0 i − ˆ ψ ⋆ i = ˆ ψ 0 i is an arbitrary v ector in R 2 . Th us, sup ˆ ψ 0 i ̸= ˆ ψ ⋆ i (ˆ x t i − ˆ x ⋆ i ) 2 ∥ ˆ ψ 0 i − ˆ ψ ⋆ i ∥ 2 2 = ∥C i A t i ∥ 2 2 , i = 1,...,r. (4.11) This expression, ho w ev er, do es not hold when λ i = 0 in (4.5) b ecause ψ 0 i − ψ ⋆ i is restricted to a line in R 2 . Namely , from (4.6), ˆ x t i − ˆ x ⋆ i = − β t 1 − β (ˆ x 1 i − ˆ x 0 0 ) ψ 0 i − ψ ⋆ i = ˆ x 0 i − ˆ x ⋆ i ˆ x 1 i − ˆ x ⋆ i = − (ˆ x 1 i − ˆ x 0 i ) 1 − β 1 β (4.12) whic h, for an y initial condition with ˆ x 0 i ̸= ˆ x 1 i , leads to (ˆ x t i − ˆ x ⋆ i ) 2 ∥ψ 0 i − ψ ⋆ i ∥ 2 2 = β 2t 1 + β 2 , i = r+1,...,n. (4.13) 103 Finally , substitution of (4.11) and (4.13) to (4.9) yields (4.8). □ 4.2.4 Analytical expressions for transient response W e next deriv e analytical expressions for the state-transition matrix A t i and the resp onse matrix Φ i (t)=C i A t i in (4.5). Lemma 1 L et µ 1 and µ 2 b e the eigenvalues of the matrix M = 0 1 a b and let t b e a p ositive inte ger. F or µ 1 ̸=µ 2 , M t = 1 µ 2 − µ 1 µ 1 µ 2 (µ t− 1 1 − µ t− 1 2 ) µ t 2 − µ t 1 µ 1 µ 2 (µ t 1 − µ t 2 ) µ t+1 2 − µ t+1 1 . Mor e over, for µ :=µ 1 =µ 2 , the matrix M t is determine d by M t = (1− t)µ t tµ t− 1 − tµ t+1 (t+1)µ t . (4.14) Lemma 1 with M = A i determines explicit expressions for A t i . These expressions allo w us to establish a b ound on the norm of the resp onse for eac h decoupled subsystem (4.5). In Lemma 2, w e pro vide a tigh t upp er b ound on ∥C i A t i ∥ 2 2 for eac h t in terms of the sp ectral radius of the matrix A i . Lemma 2 The matrix M in L emma 1 satises ∥ h 1 0 i M t ∥ 2 2 ≤ (t− 1) 2 ρ 2t + t 2 ρ 2t− 2 (4.15) wher e ρ is the sp e ctr al r adius of M . Mor e over, (4.15) b e c omes e quality if M has r ep e ate d eigenvalues. 104 Remark 1 F or Nester ov’s ac c eler ate d algorithm with the p ar ameters that optimize the r ate of c onver genc e (cf. T able 4.1), the matrix ˆ A r , which c orr esp onds to the smal lest non-zer o eigenvalue of Q, λ r = m, has an eigenvalue 1− 2/ √ 3κ +1 with algebr aic multiplicity two and inc omplete sets of eigenve ctors. Similarly, for b oth λ 1 = L and λ r = m, ˆ A 1 and ˆ A r for the he avy-b al l metho d with the p ar ameters pr ovide d in T able 4.1 have r ep e ate d eigenvalues which ar e, r esp e ctively, given by (1− √ κ )/(1+ √ κ ) and − (1− √ κ )/(1+ √ κ ). W e next use Lemma 2 with M =A i to establish an analytical expression for J(t). Theorem 1 F or ac c eler ate d algorithms applie d to c onvex quadr atic pr oblems, J(t) in (4.7) satises J 2 (t) ≤ max (t− 1) 2 ρ 2t + t 2 ρ 2(t− 1) , β 2t /(1 + β 2 ) wher e ρ :=max i≤ r ρ (A i ). Mor e over, for the p ar ameters pr ovide d in T able 4.1 J 2 (t) = (t− 1) 2 ρ 2t + t 2 ρ 2(t− 1) . (4.16) Theorem 1 highligh ts the source of disparit y b et w een the long and short term b eha vior of the resp onse. While the geometric deca y of ρ t driv es x t to x ⋆ as t → ∞, early stages are dominated b y the algebraic term whic h induces a transien t gro wth. W e next pro vide tigh t b ounds on the time t max at whic h the largest transien t resp onse tak es place and the corresp onding p eak v alue J(t max ). Ev en though w e deriv e the explicit expressions for these t w o quan tities, our tigh t upp er and lo w er b ounds are more informativ e and easier to in terpret. Theorem 2 F or ac c eler ate d algorithms with the p ar ameters pr ovide d in T able 4.1, let ρ ∈ [1/e,1). Then the rise time t max :=argmax t J(t) and the p e ak value J(t max ) satisfy − 1/log(ρ ) ≤ t max ≤ 1 − 1/log(ρ ) − √ 2ρ elog(ρ ) ≤ J(t max ) ≤ − √ 2 eρ log(ρ ) . 105 ∥x t ∥ 2 2 iteration n um b er t iteration n um b er t (a) x 1 =x 0 (b) x 1 =− x 0 Figure 4.2: Dep endence of the error in the optimization v ariable on the iteration n um b er for the hea vy-ball (blac k) and Nestero v’s metho ds (red), as w ell as the p eak magnitudes (dashed lines) obtained in Prop osition 2 for t w o dieren t initial conditions with ∥x 1 ∥ 2 =∥x 0 ∥ 2 =1. F or accelerated algorithms with the parameters pro vided in T able 4.1, Theorem 2 can b e used to determine the rise time to the p eak in terms of condition n um b er κ . W e next establish that b oth t max and J(t max ) scale as √ κ . Prop osition 2 F or ac c eler ate d algorithms with the p ar ameters pr ovide d in T able 4.1, the rise time t max :=argmax t J(t) and the p e ak value J(t max ) satisfy (i) Polyak’s he avy-b al l metho d with κ ≥ 4.69 ( √ κ − 1)/2 ≤ t max ≤ ( √ κ +3)/2 ( √ κ − 1) 2 √ 2e( √ κ +1) ≤ J(t max ) ≤ ( √ κ +1) 2 √ 2e( √ κ − 1) (ii) Nester ov’s ac c eler ate d metho d with κ ≥ 3.01 ( √ 3κ +1− 2)/2 ≤ t max ≤ ( √ 3κ +1+2)/2 ( √ 3κ +1− 2) 2 √ 2e √ 3κ +1 ≤ J(t max ) ≤ 3κ +1 √ 2e( √ 3κ +1− 2) . In Prop osition 2, the lo w er-b ounds on κ are only required to ensure that the con v ergence rate ρ satises ρ ≥ 1/e, whic h allo ws us to apply Theorem 2. W e also note that the upp er and lo w er b ounds on t max and J(t max ) are tight in the sense that their ratio con v erges to 1 as κ →∞. 106 4.2.5 The role of initial conditions The accelerated algorithms need to b e initialized with x 0 and x 1 ∈ R n . This pro vides a degree of freedom that can b e used to p oten tially impro v e their transien t p erformance. T o pro vide insigh t, let us consider the quadratic problem with Q=diag(κ, 1). Figure 4.2 sho ws the error in the optimization v ariable for P oly ak’s and Nestero v’s algorithms as w ell as the p eak magnitudes obtained in Prop osition 2 for t w o dieren t t yp es of initial conditions with x 1 =x 0 and x 1 =− x 0 , resp ectiv ely . F or x 1 =− x 0 , b oth algorithms reco v er their w orst-case transien t resp onses. Ho w ev er, for x 1 =x 0 , Nestero v’s metho d sho ws no transien t gro wth. Our analysis sho ws that large transien t resp onses arise from the existence of non-normal mo des in the matrices A i . Ho w ev er, suc h mo des do not mo v e the en tries of the state transition matrix A t i in arbitrary directions. F or example, using Lemma 1, it is easy to v erify that A r in (4.5b), asso ciated with the smallest non-zero eigen v alue λ r = m of Q in Nestero v’s algorithm with the parameters pro vided b y T able 4.1 has the rep eated eigen v alue µ = 1− 2/ √ 3κ +1 and A t r is determined b y (4.14) with M = A r . Ev en though eac h en try of A t r exp eriences a transien t gro wth, its ro w sum is determined b y A t r 1 1 = 1 + 2t/( √ 3κ +1− 2) 1 + 2t/ √ 3κ +1 (1 − 2/ √ 3κ +1) t and en tries of this v ector are monotonically deca ying functions of t. F urthermore, for i<r , it can b e sho wn that the en tries of A t i [1 1] T remain smaller than 1 for all i and t. In Theorem 3, w e pro vide a b ound on the transien t resp onse of Nestero v’s metho d for b alanc e d initial conditions with x 1 =x 0 . Theorem 3 F or c onvex quadr atic optimization pr oblems, Nester ov’s ac c eler ate d metho d with a b alanc e d initial c ondition x 1 =x 0 and p ar ameters pr ovide d in T able 4.1 satises ∥x t − x ⋆ ∥ 2 ≤ ∥ x 0 − x ⋆ ∥ 2 . 107 Pr o of: See App endix C.2. □ It is w orth men tioning that the transien t gro wth of the hea vy-ball metho d cannot b e eliminated with the use of balanced initial conditions. T o see this, w e note that the matrices A t r and A t 1 for the hea vy-ball metho d with parameters pro vided in T able 4.1 also tak e the form in (4.14) with µ = (1− √ κ )/(1+ √ κ ) and µ =− (1− √ κ )/(1+ √ κ ), resp ectiv ely . In con trast to A t r h 1 1 i T , whic h deca ys monotonically , A t 1 1 1 = 1 + 2t √ κ/ (1− √ κ ) 1 + 2t √ κ/ (1+ √ κ ) (1− √ κ ) t (1+ √ κ ) t exp eriences transien t gro wth. It w as recen tly sho wn that an a v eraged v ersion of the hea vy- ball metho d exp eriences smaller p eak deviation than the hea vy-ball metho d [127]. W e also note that adaptiv e restarting pro vides eectiv e heuristics for reducing oscillatory b eha vior of accelerated algorithms [61]. Remark 2 F or ac c eler ate d algorithms with the p ar ameters pr ovide d in T able 4.1, the initial c ondition that le ads to the lar gest tr ansient gr owth at any time τ is determine d by ˆ ψ 0 r = c h (1− τ )ρ τ τρ τ − 1 i T , ˆ ψ 0 i = 0 for i ̸= r wher e c ̸= 0 and ˆ ψ 0 r is the princip al right singular ve ctor of C r A τ r . Thus, the lar gest p e ak J(t max ) o c curs for{ ˆ ψ 0 i =0, i̸=r} and ˆ ψ 0 r =c h (1− t max )ρ tmax t max ρ tmax− 1 i T , wher e tight b ounds on t max ar e establishe d in Pr op osition 2. Remark 3 F or λ i = 0 in (4.5) , |ˆ x t i − ˆ x ⋆ i | de c ays monotonic al ly with a line ar r ate β and only non-zer o eigenvalues of Q c ontribute to the tr ansient gr owth. F urthermor e, for the p ar ameters pr ovide d in T able 4.1, our analysis shows that J 2 (t)=max i≤ r ∥C i A t i ∥ 2 2 . In what fol lows, we pr ovide b ounds on the lar gest deviation fr om the optimal solution for Nester ov’s algorithm for gener al str ongly c onvex pr oblems. 108 4.3 General strongly convex problems In this section, w e com bine a Ly apuno v-based approac h with the theory of IQCs to pro vide b ounds on the transien t gro wth of Nestero v’s accelerated algorithm for the class F L m of m- strongly con v ex and L-smo oth functions. When f is not quadratic, rst-order algorithms are no longer L TI systems and eigen v alue decomp osition cannot b e utilized to simplify analysis. Instead, to handle nonlinearit y and obtain upp er b ounds on J in (4.7), w e augmen t standard quadratic Ly apuno v functions with the ob jectiv e error. F or f ∈ F L m , algorithm (4.2) is in v arian t under translation. Th us, without loss of generalit y , w e assume that x ⋆ = 0 is the unique minimizer of (4.1) with f(0) = 0. In what follo ws, w e presen t a framew ork based on Linear Matrix Inequalities (LMIs) that allo ws us to obtain time-indep enden t b ounds on the error in the optimization v ariable. This framew ork com bines certain IQCs [81] with Ly apuno v functions of the form V(ψ ) = ψ T Xψ + θf (Cψ ) (4.17) whic h consist of the ob jectiv e function ev aluated at Cψ and a quadratic function of ψ , where X is a p ositiv e denite matrix. The IQC theory pro vides a con v ex con trol-theoretic approac h to analyzing optimization algorithms [52] and it w as recen tly emplo y ed to study con v ergence and robustness of the rst-order metho ds [53], [54], [56], [68], [93], [128]. The t yp e of Ly apuno v functions in (4.17) w as in tro duced in [56], [106] to study con v ergence for con v ex problems. F or Nestero v’s accelerated algorithm, w e demonstrate that this approac h pro vides or derwise-tight analytical upp er b ounds on J(t). Nestero v’s accelerated algorithm can b e view ed as a feedbac k in terconnection of linear and nonlinear comp onen ts ψ t+1 = Aψ t + Bu t y t = C y ψ t , u t = ∆( y t ) (4.18a) 109 where the L TI part of the system is determined b y A = 0 I − βI (1+β )I , B = 0 − αI , C y = h − βI (1+β )I i (4.18b) and the nonlinear mapping ∆: R n → R n is ∆( y) := ∇f(y). Moreo v er, the state v ector ψ t and the input y t to ∆ are determined b y ψ t := x t x t+1 , y t := (1+β )x t+1 − βx t . (4.18c) F or smo oth and strongly con v ex functions f ∈F L m , ∆ satises the quadratic inequalit y [52, Lemma 6] y − y 0 ∆( y) − ∆( y 0 ) T Π y − y 0 ∆( y) − ∆( y 0 ) ≥ 0 (4.19a) for all y , y 0 ∈R n , where the matrix Π is giv en b y Π := − 2mLI (L+m)I (L+m)I − 2I . (4.19b) Using u t :=∆( y t ) and y t :=C y ψ t and ev aluating (4.19a) at y =y t and y 0 =0 leads to, ψ t u t T M 1 ψ t u t ≥ 0 (4.19c) where M 1 := C T y 0 0 I Π C y 0 0 I = − 2mLC T y C y (L+m)C T y (L+m)C y − 2I . (4.19d) In Lemma 3, w e pro vide an upp er b ound on the dierence b et w een the ob jectiv e function at t w o consecutiv e iterations of Nestero v’s algorithm. In com bination with (4.19), this result allo ws us to utilize Ly apuno v function of the form (4.17) to establish an upp er b ound 110 on transien t gro wth. W e note that v ariations of this lemma ha v e b een presen ted in [56, Lemma 5.2] and in [93, Lemma 3]. Lemma 3 Along the solution of Nester ov’s ac c eler ate d algorithm (4.18) , the function f ∈ F L m with κ :=L/m satises f(x t+2 ) − f(x t+1 ) ≤ 1 2 ψ t u t T M 2 ψ t u t (4.20a) wher e the matrix M 2 is given by M 2 := − mC T 2 C 2 C T 2 C 2 − α (2− αL )I , C 2 := h − βI βI i . (4.20b) Using Lemma 3, w e next demonstrate ho w a Ly apuno v function of the form (4.17) with θ := 2θ 2 and C := [0 I] in conjunction with prop ert y (4.19) of the nonlinear mapping ∆ can b e utilized to obtain an upp er b ound on ∥x t ∥ 2 2 . Lemma 4 L et M 1 b e given by (4.19d) and let M 2 b e dene d in L emma 3. Then, for any p ositive semi-denite matrix X and nonne gative sc alars θ 1 and θ 2 that satisfy W := A T XA− X A T XB B T XA B T XB + θ 1 M 1 + θ 2 M 2 ⪯ 0 (4.21) the tr ansient gr owth of Nester ov’s ac c eler ate d algorithm (4.18) for al l t≥ 1 is upp er b ounde d by ∥x t ∥ 2 2 ≤ λ max (X)∥x 0 ∥ 2 2 + (λ max (X)+Lθ 2 )∥x 1 ∥ 2 2 λ min (X)+mθ 2 . (4.22) In Lemma 4, the Ly apuno v function candidate V(ψ ) := ψ T Xψ +2θ 2 f([0 I]ψ ) is used to sho w that the state v ector ψ t is conned within the sublev el set {ψ ∈R 2n |V(ψ )≤ V(ψ 0 )} asso ciated with V(ψ 0 ). W e next establish an or der-wise tigh t upp er b ound on ∥x t ∥ 2 that scales linearly with √ κ b y nding a feasible p oin t to LMI (4.21) in Lemma 4. 111 Theorem 4 F or f ∈ F L m with the c ondition numb er κ := L/m, the iter ates of Nester ov’s ac c eler ate d algorithm (4.18) for any stabilizing p ar ameters α ≤ 1/L and β < 1 satisfy ∥x t ∥ 2 2 ≤ κ 1+β 2 αβL ∥x 0 ∥ 2 2 + (1+ 1+β 2 αβL )∥x 1 ∥ 2 2 . (4.23a) F urthermor e, for the c onventional values of p ar ameters α = 1/L, β = ( √ κ − 1)/( √ κ +1) (4.23b) the lar gest tr ansient err or, dene d in (4.7) , satises √ 2( √ κ − 1) 2 e √ κ ≤ sup {t∈N,f∈F L m } J(t) ≤ r 3κ + 4κ κ − 1 . (4.23c) F or balanced initial conditions, i.e., x 1 = x 0 , Nestero v established the upp er b ound √ κ +1 on J in [9]. Theorem 4 sho ws that similar trends hold without restriction on initial conditions. Linear scaling of the upp er and lo w er b ounds with √ κ illustrates a p oten tial dra wbac k of using Nestero v’s accelerated algorithm in applications with limited time budgets. As κ → ∞, the ratio of these b ounds con v erges to e p 3/2 ≈ 3.33, thereb y demonstrating that the largest transien t resp onse for all f ∈F L m is within the factor of 3.33 relativ e to the b ounds established in Theorem 4. 112 4.4 Concluding remarks W e ha v e examined the impact of acceleration on the transien t resp onses of accelerated rst-order optimization algorithms. Without imp osing restrictions on initial conditions, w e establish b ounds on the largest v alue of the Euclidean distance b et w een the optimization v ariable and the global minimizer. F or con v ex quadratic problems, w e utilize the to ols from linear systems theory to fully capture transien t resp onses and for general strongly con v ex problems, w e emplo y the theory of in tegral quadratic constrain ts to establish an upp er b ound on transien t gro wth. This upp er b ound is prop ortional to the square ro ot of the condition n um b er and w e iden tify quadratic problem instances for whic h accelerated algorithms generate transien t resp onses whic h are within a constan t factor of this upp er b ound. F uture directions include extending our analysis to nonsmo oth optimization problems and devising algorithms that balance acceleration with qualit y of transien t resp onses. 113 Chapter 5 Noise amplication of primal-dual gradient ow dynamics based on proximal augmented Lagrangian In this c hapter, w e examine amplication of additiv e sto c hastic disturbances to primal-dual gradien t o w dynamics based on pro ximal augmen ted Lagrangian. These dynamics can b e used to solv e a class of non-smo oth comp osite optimization problems and are con v enien t for distributed implemen tation. W e utilize the theory of in tegral quadratic constrain ts to sho w that the upp er b ound on noise amplication is in v ersely prop ortional to the strong- con v exit y mo dule of the smo oth part of the ob jectiv e function. F urthermore, to demonstrate tigh tness of these upp er b ounds, w e exploit the structure of quadratic optimization problems and deriv e analytical expressions in terms of the eigen v alues of the corresp onding dynamical generators. W e further sp ecialize our results to a distributed optimization framew ork and discuss the impact of net w ork top ology on the noise amplication. 5.1 Introduction W e consider a class of primal-dual gradien t o w dynamics based on pro ximal augmen ted Lagrangian [68] that can b e used for solving large-scale non-smo oth constrained optimization problems in con tin uous time. These problems arise in man y areas e.g. signal pro cessing [69], statistical estimation [70], and con trol [71]. In addition, primal-dual metho ds ha v e receiv ed renew ed atten tion due to their prev alen t application in distributed optimization [72] and their con v ergence and stabilit y prop erties ha v e b een greatly studied [73][79]. 114 While gradien t-based metho ds are not readily applicable to non-smo oth optimization, w e can utilize their pro ximal coun terparts to address suc h problems [80]. In the con text of non-smo oth constrained optimization, pro ximal-based extensions of primal-dual metho ds can also b e obtained using the augmen ted Lagrangian [68], whic h preserv e structural separabilit y and remain suitable for distributed optimization. Using primal-dual algorithms in real-w orld distributed settings motiv ates the robustness analysis of suc h metho ds as uncertain t y can p oten tially en ter the dynamics due to noisy comm unication c hannels [129]. Moreo v er, uncertain ties can also arise in applications where the exact v alue of the gradien t is not fully a v ailable, e.g., when the ob jectiv e function is obtained via costly sim ulations or its computation relies on noisy measuremen ts e.g., real- time and em b edded applications. In this c hapter, w e consider the scenario in whic h the dynamics of the primal-dual o w are p erturb ed b y additiv e white noise. W e examine the mean-squared error of the primal optimization v ariable as a measure of ho w noise gets amplied b y the dynamics w e refer to this quan tit y as noise (or varianc e ) amplic ation. F or con v ex quadratic optimization problems, the primal-dual o w b ecomes a linear time in v arian t system, for whic h the noise amplication can b e c haracterized using Ly apuno v equations. F or non- quadratic problems, the o w is no longer linear, ho w ev er, to ols from robust con trol theory can b e utilized to quan tify upp er b ounds on the noise amplication. In particular, w e use the theory of In tegral Quadratic Constrain ts (IQC) [81], [82] to c haracterize upp er b ounds on the noise amplication of the primal-dual o w based on pro ximal augmen ted Lagrangian using solutions to a certain linear matrix inequalit y . Our results establish tigh t upp er-upp er b ounds on the noise amplication that are in v ersely prop ortional to the strong-con v exit y mo dule of the corresp onding ob jectiv e function. The approac h tak en in this c hapter is similar to those in [53], [56], [93], [98], [103], [105], wherein IQCs ha v e b een used to analyze con v ergence and robustness of rst-order optimization algorithms and their accelerated v arian ts. The noise amplication of primal- dual metho ds has also b een studied in [129] where the authors ha v e fo cused on quadratic problems and considered the a v erage error in the ob jectiv e function. In con trast, w e consider the a v erage error in the optimization v ariable and extend the noise amplication analysis 115 to the case of strongly con v ex and non-smo oth optimization problems. F or smo oth strongly con v ex problems, an input-output analysis with a fo cus on the induced L 2 norm using the passivit y theory has b een pro vided in [49]. In con trast, w e study sto c hastic p erformance of primal-dual algorithms that can b e utilized to solv e non-smo oth comp osite optimization problems. The rest of the c hapter is structured as follo ws. W e describ e the pro ximal-augmen ted Lagrangian and the noisy primal-dual gradien t o w dynamics in Section 5.2. W e next study the v ariance amplication for quadratic problems in Section 5.3. W e presen t our IQC-based approac h for general strongly con v ex but non-smo oth optimization problems in Section 5.4. W e study the noise amplication in a distributed optimization setting in Section 5.5, and pro vide concluding remarks in Section 5.6. 5.2 Proximal Augmented Lagrangian W e study a nonsmo oth comp osite optimization problem minimize x,z f(x) + g(z) subject to Tx − z = 0 (5.1) where f : R n →R is a con v ex, con tin uously dieren tiable function, g : R k →R is a con v ex, but p ossibly non-dieren tiable function, and T ∈ R k× n is a giv en matrix. The augmen ted Lagrangian asso ciated with (5.1) is giv en b y L µ (x,z;ν )=f(x)+g(z)+ν T (Tx− z)+ 1 2µ ∥Tx− z∥ 2 2 where µ> 0 is a parameter and ν is the Lagrange m ultiplier. The inm um of the augmen ted Lagrangian L µ with resp ect to z is giv en b y the pro ximal augmen ted Lagrangian [68] L µ (x;ν ) := inf z L µ (x,z;ν ) = f(x) + M µg (Tx+µν ) − µ 2 ∥ν ∥ 2 2 (5.2) 116 where M µg (ξ ) :=g(prox µg (ξ ))+ 1 2µ ∥prox µg (ξ )− ξ ∥ 2 2 is the Moreau en v elop e of the function g and prox µg (ξ ) := argmin z g(z) + 1 2µ ∥z − ξ ∥ 2 is the corresp onding pro ximal op erator. In addition, the Moreau en v elop e is con tin uously dieren tiable and its gradien t is determined b y µ ∇M µg (ξ )=ξ − prox µg (ξ ). F or con v ex problems, solving (5.1) amoun ts to nding the saddle p oin ts of L µ (x;ν ). T o this end, con tin uous dieren tiabilit y of L µ (x;ν ) w as utilized in [68] to in tro duce asso ciated Arro w-Hurwicz-Uza w a gradien t o w dynamics ˙ x = −∇ x L µ (x;ν ) ˙ ν = ∇ ν L µ (x;ν ) (5.3) whic h is a con tin uous-time algorithm that p erforms gradien t primal-descen t and dual-ascen t on the pro ximal augmen ted Lagrangian. F or L µ (x;ν ) giv en b y (5.2), gradien t o w dynamics in (5.3) tak e the follo wing form, ˙ x = −∇ f(x)− 1 µ T T (Tx+µν − prox µg (Tx+µν )) ˙ ν = Tx − prox µg (Tx+µν ). (5.4) 5.2.1 Stability properties When f is con v ex with a Lipsc hitz con tin uous gradien t, and g is prop er, closed, and con v ex, the set of equilibrium p oin ts of (5.4) is c haracterized b y minimizers of problem (5.1) and is globally asymptotically stable [68, Theorem 2]. F urthermore, when f is strongly con v ex and T is full-ro w-rank, there is a unique equilibrium p oin t (x ⋆ ,ν ⋆ ) whic h is globally exp onen tially stable and (x ⋆ ,z ⋆ =prox µg (Tx ⋆ +µν ⋆ )) is the unique optimal solution of problem (5.1) [75, Theorem 6]. 117 5.2.2 Noise amplication W e examine the impact of additiv e sto c hastic uncertain ties on p erformance of the primal- dual gradien t o w dynamics. In particular, w e consider the noisy v ersion of (5.4), dx = − ∇f(x) + T T ∇M µg (Tx+µν ) dt + dw 1 dν = Tx − prox µg (Tx+µν ) dt + dw 2 (5.5) where dw i (t) are the incremen ts of indep enden t Wiener pro cesses with co v ariance matrices E[w i (t)w T i (t)]=s i It and s i >0 for i∈{1,2}. W e quan tify the noise amplication using [82] J = limsup T→∞ 1 T Z T 0 E[∥x(t) − x ⋆ ∥ 2 2 ]dt. (5.6) F or quadratic ob jectiv e functions f(x) := 1 2 x T Qx, if w e let g b e the indicator function of the set{b} with b∈R k , (5.5) is a linear time-in v arian t system and J quan ties the steady-state v ariance of the error in the optimization v ariable x(t)− x ⋆ , J = lim t→∞ E[∥x(t) − x ⋆ ∥ 2 2 ]. (5.7) In the next section, w e examine this class of problems. 5.3 Quadratic optimization problems T o pro vide insigh t in to the noise amplication of the primal-dual gradien t o w dynamics, w e rst examine the sp ecial case in whic h the quadratic ob jectiv e function f(x) = 1 2 x T Qx is strongly con v ex with Q = Q T ≻ 0 and g(z) = I {b} (z), where I S is the indicator function of the set S , i.e., I S (z) := 0 for z ∈ S and I S (z) := ∞ for z / ∈ S . F or this c hoice of g , optimization problem (5.1) simplies to minimize x f(x) subject to Tx = b (5.8) 118 and the nonlinear terms in (5.5) are determined b y ∇f(x) = Qx, prox µg (ξ ) = b, ∇M µg (ξ ) = 1 µ (ξ − b). Hence, (5.5) simplies to dx = − (Q + 1 µ T T T)x+T T ν − 1 µ Tb dt + dw 1 dν = (Tx − b)dt + dw 2 (5.9) In what follo ws, without loss of generalit y , w e set b=0. In this case, noisy dynamics (5.5) are describ ed b y an L TI system dψ = Aψ dt + dw (5.10) where w := h w T 1 w T 2 i T and ψ := x − x ⋆ ν − ν ⋆ , A = − (Q+ 1 µ T T T) − T T T 0 . F or Q ≻ 0 and a full-ro w-rank T , A is a Hurwitz matrix and L TI system (5.10) is stable. Moreo v er from linearit y , it follo ws that the v ariance amplication can b e computed as J = lim t→∞ E[∥x(t)− x ⋆ ∥ 2 2 ] = trace(XC T C) = trace(X 1 ) (5.11) where X := lim t→∞ E[ψ (t)ψ T (t)] = X 1 X 2 X T 2 X 3 is the steady-state co v ariance matrix of the state ψ (t) whic h can b e obtained b y solving the algebraic Ly apuno v equation AX + XA T = − diag(s 1 I,s 2 I) (5.12) and C := h I 0 i . Theorem 1 addresses the sp ecial case with Q = mI and pro vides an analytical expression for the v ariance amplication of the corresp onding primal-dual gradien t 119 o w dynamics. This result is obtained b y computing the steady-state co v ariance matrix of the state ψ . Theorem 1 L et f(x) = m 2 ∥x∥ 2 , g(z) = I {0} (z), and T b e a ful l-r ow-r ank matrix in (5.1) . Then, the ste ady-state varianc e of the primal optimization variable in (5.5) with dw i (t) b eing the incr ements of indep endent Wiener pr o c esses with c ovarianc e E[w i (t)w T i (t)] = s i It is determine d by J = (n− k)s 1 2m + k X i=1 s 1 +s 2 2(m+(1/µ )σ 2 i (T)) wher e σ i (T) is the ith singular values of the matrix T . Pr o of: Let T = UΣ V T b e the singular v alue decomp osition with unitary matrices U ∈ R k× k and V ∈R n× n and Σ= h Σ 0 0 k× (n− k) i ∈R k× n , with Σ 0 := diag(σ 1 ,...,σ k ) ∈ R k× k . Multiplication of the Ly apuno v equation (5.12) b y M =diag(V,U) and M T from righ t and left, resp ectiv ely , yields ˆ A ˆ X + ˆ X ˆ A T = − diag(s 1 I,s 2 I) (5.13) where ˆ A = − mI− 1 µ Σ T Σ − Σ T Σ 0 , ˆ X = ˆ X 1 ˆ X 2 ˆ X T 2 ˆ X 3 := M T XM. 120 Finally , it is straigh tforw ard to v erify that ˆ X 1 = s 1 +s 2 2 mI + 1 µ Σ 0 Σ 0 − 1 0 0 s 1 2m I ˆ X 2 = − s 2 2 Σ − 1 0 0 (n− k)× k ∈ R n× k , ˆ X 3 = diag(a 1 ,...,a k ) ∈ R k× k where a i = s 1 +s 2 2(m+σ 2 i /µ ) + s 2 (m+σ 2 i /µ ) 2σ 2 i . The result follo ws from J = trace(X 1 ) = trace( ˆ X 1 ). □ The follo wing corollary is immediate from Theorem 1. Corollary 1 Under the c onditions of The or em 1, the ste ady-state varianc e of the primal optimization variable in (5.5) is upp er b ounde d by J ≤ (ns 1 +ks 2 )/(2m). Corollary 1 establishes that, for µ> 0 and a full-ro w-rank matrix T , the v ariance of the primal optimization v ariable in (5.5) satises an upp er b ound that is indep enden t of T and µ . In addition, using the explicit expression for J pro vided in Theorem 1, it follo ws that for an y xed µ > 0, in the limit of σ max (T) → 0 and/or n/k → ∞, the upp er b ound on the v ariance amplication J in Corollary 1 b ecomes exact. It is also notew orth y that, as demonstrated in the pro of of Theorem 1, the dual v ariable ν ma y exp erience an un b ounded steady-state v ariance for s 2 >0 if σ min (T)→0. Ev en though it is c hallenging to deriv e an analytical expression for the co v ariance matrix X for a general strongly con v ex quadratic ob jectiv e function f , w e next demonstrate that the upp er b ound in Corollary 1 remains v alid. Theorem 2 L et f(x) = 1 2 x T Qx with Q ⪰ mI , g(z) = I {0} (z), and T b e a ful l-r ow- r ank matrix in (5.1) . Then, the ste ady-state varianc e of the primal optimization variable 121 in (5.5) with dw i (t) b eing the incr ements of indep endent Wiener pr o c esses with c ovarianc e E[w i (t)w T i (t)]=s i It satises J ≤ ns 1 +ks 2 2m . (5.14) Pr o of: T o quan tify J , an alternativ e metho d to using the state co v ariance matrix is to write J =trace(Pdiag(s 1 I,s 2 I)), where P is the observ abilit y gramian of system (5.10) A T P + PA = − C T C (5.15) with C = h I 0 i . Th us, an y matrix P ′ ⪰ P satises J ≤ trace(P ′ diag(s 1 I,s 2 I)). T o nd suc h a P ′ , w e note that A satises A T I +IA = − 2diag(Q+ 1 µ T T T,0) ⪯ − 2λ min (Q)C T C. Dividing this inequalit y b y 2λ min (Q) and subtracting from (5.15) yields A T ( 1 2λ min (Q) I − P) + ( 1 2λ min (Q) I − P)A ⪯ 0. Since A is Hurwitz, it follo ws that P ⪯ 1 2λ min (Q) I , and hence J = trace(P diag(s 1 I,s 2 I)) ≤ 1 2λ min (Q) trace(diag(s 1 I,s 2 I)) ≤ ns 1 +ks 2 2m . □ 5.4 Beyond quadratic problems In this section, w e extend our upp er b ounds on the noise amplication of the primal- dual gradien t o w dynamics to problems with a general strongly con v ex function f , a con v ex but p ossibly non-dieren tiable function g , and a matrix T of an arbitrary rank. Our approac h is based on In tegral Quadratic Constrain ts (IQCs) whic h pro vide a con v ex 122 con trol-theoretic framew ork for stabilit y and robustness analysis of systems with structured nonlinear comp onen ts [81]. This framew ork has b een recen tly used to analyze con v ergence and robustness of rst-order optimization metho ds [53], [56], [93], [105]. In what follo ws, w e rst demonstrate ho w IQCs can b e com bined with quadratic storage functions to c haracterize upp er b ounds on the noise amplication of con tin uous-time dynamical systems via solutions to a certain linear matrix inequalit y (LMI). W e then sp ecialize this result to the primal- dual gradien t o w dynamics and establish tigh t upp er b ounds on the noise amplication b y nding feasible solutions to the asso ciated LMI. 5.4.1 An IQC-based approach As demonstrated in Section 5.4.2, noisy primal-dual gradien t o w dynamics can b e view ed as a feedbac k in terconnection of an L TI system with a static nonlinear comp onen t dψ = Aψ dt + Budt + dw z y = C z C y ψ, u (t) = ∆( y(t)). (5.16) Here, ψ (t) is the state, dw(t) is the incremen t of a Wiener pro cess with co v ariance E[w(t)w T (t)] = W t where W is a p ositiv e semidenite matrix, z(t) is the p erformance output, and u(t) is the output of the nonlinear term ∆ : R n →R n that satises the quadratic inequalities y ∆( y) T Π i y ∆( y) ≥ 0 (5.17) for some matrices Π i and all y∈R n . 123 Lemma 1 utilizes prop ert y (5.17) of the nonlinear mapping ∆ and pro vides an upp er b ound on the a v erage energy [82] J = limsup T→∞ 1 T Z T 0 E[∥z(t)∥] 2 2 dt. Lemma 1 L et the nonline ar function u=∆( y) satisfy y u T Π i y u ≥ 0 (5.18) for some matric es Π i , let P b e a p ositive semidenite matrix, and let λ i b e nonne gative sc alars such that system (5.16) satises A T P +PA+C T z C z PB B T P 0 + X i λ i C T y 0 0 I Π i C y 0 0 I ⪯ 0. (5.19) Then the aver age ener gy of the p erformanc e output in statistic aly ste ady-state is b ounde d by J ≤ trace(PW). The pro of of Lemma 1 follo ws from similar argumen ts as in [82, Theorem 7.2] and is omitted for brevit y . Lemma 1 in tro duces a quadratic storage function, ψ T Pψ , for con tin uous- time primal-dual gradien t o w dynamics. W e note that discrete-time v arian ts of this result ha v e b een used to quan tify noise amplication of accelerated optimization algorithms [93, Lemmas 1, 2], [103]. 5.4.2 State-space representation W e next demonstrate ho w noisy primal-dual gradien t o w dynamics (5.5) can b e brough t in to the standard state-space form (5.16). In particular, c ho osing ψ = h x T ν T i T as the state v ariable along with z :=x and y = y 1 y 2 := x Tx+µν , u = u 1 u 2 := ∇f(x)− mx prox µg (Tx+µν ) 124 brings system (5.5) in to the state-space form (5.16) with A = − (mI + 1 µ T T T) − T T T 0 , B = − I 1 µ T T 0 − I , C y = I 0 T µI . (5.20) and C z = h I 0 i , where m is the strong-con v exit y mo dule of f . W e note that the input-output pair (u,y) satises the p oin t wise nonlinear equation u = ∆( y) with ∆ = diag(∆ 1 ,∆ 2 ), where u 1 = ∆ 1 (y 1 ) := ∇f(y 1 )− my 1 , u 2 = ∆ 2 (y 2 ) := prox µg (y 2 ). It is w orth men tioning that for the sp ecial case g(z) = I {0} (z), whic h w e considered in our analysis of quadratic problems in Section 5.3, the nonlinear term u 2 v anishes and the primal-dual gradien t o w dynamics simplify to dx = − ∇f(x) + 1 µ T T Tx + T T ν dt + dw 1 dν = Txdt + dw 2 . (5.21) 5.4.3 Characterizing the structural properties via IQCs The input-output pairs (y i ,u i ) asso ciated with nonlinear mappings ∆ i satisfy y i − y ′ i u i − u ′ i T π i y i − y ′ i u i − u ′ i ≥ 0 (5.22) where π 1 := 0 (L− m)I (L− m)I − 2I , π 2 := 0 I I − 2I . The ab o v e inequalities follo w from the facts that ∆ 1 is the gradien t of the (L− m)-smo oth con v ex function f(·)− (m/2)∥·∥ 2 and that ∆ 2 =prox µg is rmly non-expansiv e. 125 T o mak e the ab o v e IQCs conform to the required format in Lemma 1, w e can emplo y a suitable p erm utation com bined with a c hange of v ariables that utilizes deviations from the optimal solution to obtain the inequalities in (5.18) with Π 1 = 0 0 (L− m)I 0 0 0 0 0 (L− m)I 0 − 2I 0 0 0 0 0 , Π 2 = 0 0 0 0 0 0 0 I 0 0 0 0 0 I 0 − 2I . (5.23) 5.4.4 General convex g The main result of the c hapter is presen ted in Theorem 3. It demonstrates that pro ximal primal-dual gradien t o w dynamics enjo ys the same upp er b ound on noise amplication as the primal-dual gradien t o w dynamics for smo oth problems. Theorem 3 L et the function f b e m-str ongly c onvex and let g b e close d, pr op er, c onvex. Then, the noise amplic ation of noisy primal-dual gr adient ow dynamics satises (5.14) . Pr o of: It is easy to v erify that P = pI , λ 1 = 1/(L− m), λ 2 = 1/µ with p ≥ 1/(2m) pro vides a feasible solution to the LMI in Lemma 1 for the system matrices in (5.20) and matrices Π 1 , Π 2 in (5.23). Th us, the result follo ws from Lemma 1. □ F or general strongly con v ex problems, Theorem 3 establishes the same upp er b ound on the noise amplication as what w e obtained using Ly apuno v equations for quadratic problems in Theorem 2. In addition, as w e discussed in Section 5.3, this upp er b ound is tigh t in the sense that the noise amplication for the quadratic problem in Theorem 1 con v erges to this upp er b ound in the limit σ max (T)→0 and/or as n/k→∞. Another adv an tage of the IQC framew ork is that it do es not require the matrix A to b e Hurwitz. Therefore, the upp er b ound established in Theorem 3 holds for an y matrix T indep enden t of its rank. 126 5.5 Application to distributed optimization The primal-dual gradien t o w dynamics pro vide a distributed strategy for solving minimize θ n X i=1 f i (θ ) (5.24) where f i are con v ex functions [72]. Assuming without loss of generalit y that θ ∈ R, giv en a connected net w ork with an incidence matrix E = T T , w e can assign a dieren t scalar v ariable x i to eac h agen t and dene the equiv alen t problem minimize x n X i=1 f i (x i ) subject to T x = 0 (5.25) where the constrain t enforces that x := h x 1 ··· x n i T ∈ N (T) = {c1|c∈R} where 1 := [1 ··· 1] T . Letting f(x) := P i f i (x i ), the primal-dual gradien t o w for solving problem (5.25) is determined b y (5.21) and, in the absence of noise, it con v erges to x=θ ⋆ 1, where θ ⋆ is an optimal solution of problem (5.24). In this form ulation, the primal and dual v ariables x i and ν i corresp ond to the no des and the edges of the net w ork, resp ectiv ely . Theorem 3 pro vides an upp er b ound on noise amplication of a distributed primal-dual algorithm J ≤ ns 1 +ks 2 2m for strongly con v ex problems. Here, k denotes the n um b er of edges in the net w ork and m is the strong con v exit y mo dule of the function f . Ho w ev er, if f lac ks strong con v exit y , then an additiv e white noise with a full-rank co v ariance matrix can result in un b ounded v ariance of x(t) as t→∞. 127 T o see one suc h example, w e can let f i b e constan ts, in whic h case the primal-dual gradien t o w simplies to a consensus-t yp e algorithm. In this case, the a v erage mo de a(t) := 1 n (1 T x(t))1 exp eriences a random w alk, and its v ariance J a := lim t→∞ E ∥a(t)− θ ⋆ 1∥ 2 (5.26a) is un b ounded. Ho w ev er, the mean-square deviation from the net w ork a v erage ¯ J := lim t→∞ E ∥x(t)− a(t)∥ 2 (5.26b) b ecomes a relev an t quan tit y and it can b e used in lieu of J to quan tify sto c hastic p erformance as it remains b ounded [43]. Using the fact that ⟨x(t)− a(t),1⟩ = 0, this idea can b e generalized to the distributed optimization framew ork b y noting that the v ariance amplication can b e split in to t w o terms, J = J a + ¯ J. T o pro vide insigh t, let us examine the sp ecial case with f i (θ )= 1 2 m(θ − c i ) 2 , where the agen ts aim to compute the a v erage of c i . Although the underlying dynamics are linear in this case, the results of Theorem 1 are not applicable b ecause the matrix T is full ro w-rank only when the corresp onding graph is a tree. Ho w ev er, b y eliminating mo des from the dual-v ariable that are not stable, a similar argumen t as in the pro of of Theorem 1 can b e used to establish an expression for the noise amplication in the distributed setting in terms of the non-zero eigen v alues λ i of the Laplacian matrix L=T T T . Prop osition 1 The noisy primal-dual gr adient ow dynamics (5.9) for solving distribute d optimization pr oblem (5.25) with f i (x i )= 1 2 m(x i − c i ) 2 satises J =J a + ¯ J , wher e J a = s 1 2m , ¯ J = n− 1 X i=1 s 1 +s 2 2(m+λ i (L)/µ ) and λ i ar e the non-zer o eigenvalues of the L aplacian matrix L=T T T of c onne cte d undir e cte d network. 128 Pr o of: Let us without loss of generalit y assume that c i =0; using the c hange of v ariables y :=T T ν , w e obtain that the noisy primal-dual o w satises dx dy = − mI− 1 µ L − I L 0 x y dt+ dw 1 T T dw 2 . Noting that L1 = 0, w e can let L = VΛ V T , where Λ = diag(0 , ˆ Λ) is the diagonal matrix of eigen v alues and the columns of the unitary matrix V = h 1/ √ n U i are the corresp onding eigen v ectors. Using the c hange of v ariables ˆ x := U T x, ˆ y := U T y, ˆ ψ T = h ˆ x T ˆ y T i it is easy to v erify that d ˆ ψ = − mI− 1 µ ˆ Λ − I ˆ Λ 0 ˆ ψ dt+ dˆ w 1 dˆ w 2 where dˆ w 1 and dˆ w 2 are the incremen ts of indep enden t Wiener pro cess with co v ariance s 1 It and s 2 ˆ Λ t, resp ectiv ely . In addition, the a v erage mo des asso ciated with the primal and dual v ariables a=(x T 1)1/n and b=(y T 1)1/n satisfy da = − madt+dw a , b = 0 and the v ariance amplication is determined b y J = J a + ¯ J = lim t→∞ E[∥ˆ x∥ 2 ]+E[a 2 ] = trace(X 1 )+ s 1 2m where X = X 1 X 2 X T 2 X 3 is the corresp onding state co v ariance matrix at the steady state − mI− 1 µ ˆ Λ − I ˆ Λ 0 X + X − mI− 1 µ ˆ Λ ˆ Λ − I 0 = − s 1 I 0 0 − s 2 ˆ Λ 129 The result follo ws from noting that X 1 , X 2 , and X 3 are all diagonal and X 1 = s 1 +s 2 2 (mI + ˆ Λ) − 1 , X 2 = − s 2 2 I. □ F or quadratic optimization problems, Prop osition 1 demonstrates that, in addition to the strong-con v exit y mo dule of the function f , the top ology of the net w ork also impacts the v ariance amplication. In the limit as m go es to 0, while the v ariance of the a v erage mo de J a b ecomes un b ounded, the mean-square deviation from the a v erage mo de remains b ounded and is captured b y the sum of recipro cals of the eigen v alues of the graph Laplacian. This dep endence of v ariance amplication on the sp ectral prop erties of L is iden tical to the one observ ed in standard consensus algorithms [43], [93]. 5.6 Concluding remarks W e ha v e examined the noise amplication of pro ximal primal-dual gradien t o w dynamics that can b e used to solv e non-smo oth comp osite optimization problems. F or quadratic problems, w e ha v e emplo y ed algebraic Ly apuno v equations to establish analytical expressions for the noise amplication. W e ha v e also utilized the theory of IQCs to c haracterize tigh t upp er b ounds in terms of a solution to an LMI. Our results sho w that sto c hastic p erformance of the primal-dual dynamics is in v ersely prop ortional to the strong-con v exit y mo dule of the smo oth part of the ob jectiv e function. The ongoing w ork fo cuses on examining the impact of net w ork top ology on the noise amplication in distributed settings and on extension of our results to discrete-time v ersions of primal-dual algorithms. 130 Part II Convergence and sample complexity of gradient methods for the data-driven control 131 Chapter 6 Random search for continuous-time LQR Mo del-free reinforcemen t learning attempts to nd optimal con trol actions for an unkno wn dynamical system b y directly searc hing o v er the parameter space of con trollers. Ho w ev er, the statistical prop erties and con v ergence b eha vior of these approac hes are often p o orly understo o d b ecause of the noncon v ex nature of the underlying optimization problems and the lac k of exact gradien t computation. In this c hapter, w e tak e a step to w ards dem ystifying the p erformance and eciency of suc h metho ds b y fo cusing on the standard innite-horizon linear quadratic regulator problem for con tin uous-time systems with unkno wn state-space parameters. W e establish exp onen tial stabilit y for the ordinary dieren tial equation (ODE) that go v erns the gradien t-o w dynamics o v er the set of stabilizing feedbac k gains and sho w that a similar result holds for the gradien t descen t metho d that arises from the forw ard Euler discretization of the corresp onding ODE. W e also pro vide theoretical b ounds on the con v ergence rate and sample complexit y of the random searc h metho d with t w o-p oin t gradien t estimates. W e pro v e that the required sim ulation time for ac hieving ϵ -accuracy in the mo del-free setup and the total n um b er of function ev aluations b oth scale as log(1/ϵ ). 6.1 Introduction In man y emerging applications, con trol-orien ted mo dels are not readily a v ailable and classical approac hes from optimal con trol ma y not b e directly applicable. This c hallenge has led to the emergence of Reinforcemen t Learning (RL) approac hes that often p erform w ell in practice. Examples include learning complex lo comotion tasks via neural net w ork dynamics [18] and pla ying A tari games based on images using deep-RL [19]. 132 RL approac hes can b e broadly divided in to mo del-based [130], [131] and mo del-free [20], [21]. While mo del-based RL uses data to obtain appro ximations of the underlying dynamics, its mo del-free coun terpart prescrib es con trol actions based on estimated v alues of a cost function without attempting to form a mo del. In spite of the empirical success of RL in a v ariet y of domains, our mathematical understanding of it is still in its infancy and there are man y op en questions surrounding con v ergence and sample complexit y . In this c hapter, w e tak e a step to w ards answ ering suc h questions with a fo cus on the innite-horizon Linear Quadratic Regulator (LQR) for con tin uous-time systems. The LQR problem is the cornerstone of con trol theory . The globally optimal solution can b e obtained b y solving the Riccati equation and ecien t n umerical sc hemes with pro v able con v ergence guaran tees ha v e b een dev elop ed [83]. Ho w ev er, computing the optimal solution b ecomes c hallenging for large-scale problems, when prior kno wledge is not a v ailable, or in the presence of structural constrain ts on the con troller. This motiv ates the use of direct searc h metho ds for con troller syn thesis. Unfortunately , the noncon v ex nature of this form ulation complicates the analysis of rst- and second-order optimization algorithms. T o mak e matters w orse, structural constrain ts on the feedbac k gain matrix ma y result in a disjoin t searc h landscap e limiting the utilit y of con v en tional descen t-based metho ds [84]. F urthermore, in the mo del-free setting, the exact mo del (and hence the gradien t of the ob jectiv e function) is unkno wn so that only zeroth-order metho ds can b e used. In this c hapter, w e study con v ergence prop erties of gradien t-based metho ds for the con tin uous-time LQR problem. In spite of the lac k of con v exit y , w e establish (a) exp onential stability of the ODE that go v erns the gradien t-o w dynamics o v er the set of stabilizing feedbac k gains; and (b) line ar c onver genc e of the gradien t descen t algorithm with a suitable stepsize. W e emplo y a standard con v ex reparameterization for the LQR problem [85], [86] to establish the con v ergence prop erties of gradien t-based metho ds for the noncon v ex form ulation. In the mo del-free setting, w e also examine con v ergence and sample complexit y of the random searc h metho d [22] that attempts to em ulate the b eha vior of gradien t descen t via gradien t appro ximations resulting from ob jectiv e function v alues. F or the t w o-p oin t gradien t estimation setting, w e pro v e linear con v ergence of the random searc h metho d and 133 sho w that the total n um b er of function ev aluations and the sim ulation time required in our results to ac hiev e ϵ -accuracy are prop ortional to log(1/ϵ ). F or the discr ete-time LQR, global con v ergence guaran tees w ere recen tly pro vided in [13] for gradien t decen t and the random searc h metho d with one-p oin t gradien t estimates. The authors established a b ound on the sample complexit y for reac hing the error tolerance ϵ that requires a n um b er of function ev aluations that is at least prop ortional to (1/ϵ 4 )log(1/ϵ ). If one has access to the innite-horizon cost v alues, the n um b er of function ev aluations for the random searc h metho d with one-p oin t gradien t estimates can b e impro v ed to 1/ϵ 2 [132]. In con trast, w e fo cus on the c ontinuous-time LQR and examine the t w o-p oin t gradien t estimation setting. The use of t w o-p oin t gradien t estimates reduces the required n um b er of function ev aluations to 1/ϵ [132]. W e signican tly impro v e this result b y sho wing that the required n um b er of function ev aluations is prop ortional to log(1/ϵ ). Similarly , the sim ulation time required in our results is prop ortional to log(1/ϵ ); this is in con trast to [13] that requires poly(1/ϵ ) sim ulation time and [132] that assumes an innite sim ulation time. F urthermore, our con v ergence results hold b oth in terms of the error in the ob jectiv e v alue and the optimization v ariable (i.e., the feedbac k gain matrix) whereas [13] and [132] only pro v e con v ergence in the ob jectiv e v alue. W e note that the literature on mo del-free RL is rapidly expanding and recen t extensions to Mark o vian jump linear systems [133], H ∞ robustness analysis through implicit regularization [134], learning distributed LQ problems [135], and output-feedbac k LQR [136] ha v e b een made. Our presen tation is structured as follo ws. In Section 6.2, w e revisit the LQR problem and presen t gradien t-o w dynamics, gradien t descen t, and the random searc h algorithm. In Section 6.3, w e highligh t the main results of the c hapter. In Section 6.4, w e utilize con v ex reparameterization of the LQR problem and establish exp onen tial stabilit y of the resulting gradien t-o w dynamics and gradien t descen t metho d. In Section 6.5, w e extend our analysis to the noncon v ex landscap e of feedbac k gains. In Section 6.6, w e quan tify the accuracy of t w o-p oin t gradien t estimates and, in Section 6.7, w e discuss con v ergence and sample complexit y of the random searc h metho d. In Section 6.8, w e pro vide an example to illustrate our theoretical dev elopmen ts and, in Section 6.9, w e oer concluding remarks. Most tec hnical details are relegated to the app endices. 134 Notation W e use vec(M)∈R mn to denote the v ectorized form of the matrix M ∈R m× n obtained b y concatenating the columns on top of eac h other. W e use ∥M∥ 2 F = ⟨M,M⟩ to denote the F rob enius norm, where ⟨X,Y⟩ := trace(X T Y) is the standard matricial inner pro duct. W e denote the largest singular v alue of linear op erators and matrices b y ∥·∥ 2 and the sp ectral induced norm of linear op erators b y ∥·∥ S ∥M∥ 2 := sup M ∥M(M)∥ F ∥M∥ F , ∥M∥ S := sup M ∥M(M)∥ 2 ∥M∥ 2 . W e denote b y S n ⊂ R n× n the set of symmetric matrices. F or M ∈S n , M ≻ 0 means M is p ositiv e denite and λ min (M) is the smallest eigen v alue. W e use S d− 1 ⊂ R d to denote the unit sphere of dimension d− 1. W e denote the exp ected v alue b y E[·] and probabilit y b y P(·). T o compare the asymptotic b eha vior of f(ϵ ) and g(ϵ ) as ϵ go es to 0, w e use f = O(g) (or, equiv alen tly , g = Ω( f)) to denote limsup ϵ →0 f(ϵ )/g(ϵ ) < ∞; f = ˜ O(g) to denote f =O(glog k g) for some in teger k ; and f =o(ϵ ) to signify lim ϵ →0 f(ϵ )/ϵ =0. 6.2 Problem formulation The innite-horizon LQR problem for con tin uous-time L TI systems is giv en b y minimize x,u E Z ∞ 0 (x T (t)Qx(t) + u T (t)Ru(t))dt (6.1a) subject to ˙ x = Ax + Bu, x(0) ∼ D (6.1b) wherex(t)∈R n is the state, u(t)∈R m is the con trol input, A andB are constan t matrices of appropriate dimensions, Q and R are p ositiv e denite matrices, and the exp ectation is tak en o v er a random initial condition x(0) with distribution D . F or a con trollable pair (A,B), the solution to (6.1) is giv en b y u(t) = − K ⋆ x(t) = − R − 1 B T P ⋆ x(t) (6.2a) 135 where P ⋆ is the unique p ositiv e denite solution to the Algebraic Riccati Equation (ARE) A T P ⋆ + P ⋆ A + Q − P ⋆ BR − 1 B T P ⋆ = 0. (6.2b) When the mo del is kno wn, the LQR problem and the corresp onding ARE can b e solv ed ecien tly via a v ariet y of tec hniques [137][140]. Ho w ev er, these metho ds are not directly applicable in the mo del-free setting, i.e., when the matrices A andB are unkno wn. Exploiting the linearit y of the optimal con troller, w e can alternativ ely form ulate the LQR problem as a direct searc h for the optimal linear feedbac k gain, namely minimize K f(K) (6.3a) where f(K) := trace (Q+K T RK)X(K) , K∈S K ∞, otherwise. (6.3b) Here, the function f(K) determines the LQR cost in (6.1a) asso ciated with the linear state- feedbac k la w u=− Kx, S K := {K∈R m× n |A − BK is Hurwitz} (6.3c) is the set of stabilizing feedbac k gains and, for an y K∈S K , X(K) := Z ∞ 0 E x(t)x T (t) = Z ∞ 0 e (A− BK)t Ωe (A− BK) T t dt (6.4a) is the unique solution to the Ly apuno v equation (A − BK)X + X(A − BK) T + Ω = 0 (6.4b) and Ω := E[x(0)x T (0)]. T o ensure f(K) = ∞ for K / ∈ S K , w e assume Ω ≻ 0. This assumption also guaran tees K ∈ S K if and only if the solution X to (6.4b) is p ositiv e denite. 136 In problem (6.3), the matrix K is the optimization v ariable, and (A, B , Q≻ 0, R≻ 0, Ω ≻ 0) are the problem parameters. This alternativ e form ulation of the LQR problem has b een studied for b oth con tin uous-time [83] and discrete-time systems [13], [141] and it serv es as a building blo c k for sev eral imp ortan t con trol problems including optimal static-output feedbac k design [142], optimal design of sparse feedbac k gain matrices [71], [143][147], and optimal sensor/actuator selection [121], [148][150]. F or all stabilizing feedbac k gains K ∈ S K , the gradien t of the ob jectiv e function is determined b y [142], [143] ∇f(K) = 2(RK − B T P(K))X(K). (6.5) Here, X(K) is giv en b y (6.4a) and P(K)= Z ∞ 0 e (A− BK) T t (Q+K T RK)e (A− BK)t dt (6.6a) is the unique p ositiv e denite solution of (A − BK) T P + P(A − BK) = − Q − K T RK. (6.6b) T o simplify our presen tation, for an y K∈R m× n , w e dene the closed-lo op Ly apuno v op erator A K : S n →S n as A K (X) := (A − BK)X + X(A − BK) T . (6.7a) F or K∈S K , b othA K and its adjoin t A ∗ K (P) = (A − BK) T P + P(A − BK) (6.7b) are in v ertible and X(K), P(K) are determined b y X(K) = −A − 1 K (Ω) , P(K) = − (A ∗ K ) − 1 (Q + K T RK). 137 In this c hapter, w e rst examine the global stabilit y prop erties of the gradien t-o w dynamics ˙ K = −∇ f(K), K(0) ∈ S K (GF) asso ciated with problem (6.3) and its discretized v arian t, K k+1 := K k − α ∇f(K k ), K 0 ∈ S K (GD) where α > 0 is the stepsize. Next, w e build on this analysis to study the con v ergence of a searc h metho d based on random sampling [22], [151] for solving problem (6.3). As describ ed in Algorithm 1, at eac h iteration w e form an empirical appro ximation ∇f(K) to the gradien t of the ob jectiv e function via sim ulation of system (6.1b) for randomly p erturb ed feedbac k gains K± U i , i=1,...,N , and up date K via, K k+1 := K k − α ∇f(K k ), K 0 ∈ S K . (RS) W e note that the gradien t estimation sc heme in Algorithm 1 do es not require kno wledge of system matrices A and B in (6.1b) but only access to a sim ulation engine. 6.3 Main results Optimization problem (6.3) is not con v ex [84]; see App endix D.1 for an example. The function f(K), ho w ev er, has t w o imp ortan t prop erties: uniqueness of the critic al p oints and the c omp actness of sublevel sets [152], [153]. Based on these, the LQR ob jectiv e error f(K)− f(K ⋆ ) can b e used as a maximal Ly apuno v function (see [154] for a denition and [155], [156] as examples) to pro v e asymptotic stabilit y of gradien t-o w dynamics (GF) o v er the set of stabilizing feedbac k gains S K . Ho w ev er, this approac h do es not pro vide an y guaran tee on the rate of con v ergence and additional analysis is necessary to establish exp onen tial stabilit y; see Section 6.5 for details. 138 Algorithm 1 T w o-p oin t gradien t estimation Require: F eedbac k gain K∈R m× n , state and con trol w eigh t matrices Q andR, distribution D , smo othing constan t r , sim ulation time τ , n um b er of random samples N . for i = 1,...,N do Dene p erturb ed feedbac k gains K i,1 :=K +rU i and K i,2 :=K− rU i , where vec(U i ) is a random v ector uniformly distributed on the sphere √ mnS mn− 1 . Sample an initial condition x i from distribution D . F or j ∈{1,2}, sim ulate system (6.1b) up to time τ with the feedbac k gain K i,j and initial condition x i to form ˆ f i,j = Z τ 0 (x T (t)Qx(t) + u T (t)Ru(t))dt. end for Ensure: The gradien t estimate ∇f(K) = 1 2rN N X i=1 ˆ f i,1 − ˆ f i,2 U i . 6.3.1 Known model W e rst summarize our results for the case when the mo del is kno wn. In spite of the noncon v ex optimization landscap e, w e establish the exp onen tial stabilit y of gradien t-o w dynamics (GF) for an y stabilizing initial feedbac k gain K(0). This result also pro vides an explicit b ound on the rate of con v ergence to the LQR solution K ⋆ . Theorem 1 F or any initial stabilizing fe e db ack gain K(0) ∈ S K , the solution K(t) to gr adient-ow dynamics (GF) satises f(K(t)) − f(K ⋆ ) ≤ e − ρt (f(K(0)) − f(K ⋆ )) ∥K(t) − K ⋆ ∥ 2 F ≤ be − ρt ∥K(0) − K ⋆ ∥ 2 F wher e the c onver genc e r ate ρ and c onstant b dep end on K(0) and the p ar ameters of the LQR pr oblem (6.3) . The pro of of Theorem 1 along with explicit expressions for the con v ergence rate ρ and constan t b are pro vided in Section 6.5.1. Moreo v er, for a sucien tly small stepsize α , w e sho w that gradien t descen t metho d (GD) also con v erges o v er S K at a linear rate. 139 Theorem 2 F or any initial stabilizing fe e db ack gain K 0 ∈ S K , the iter ates of gr adient desc ent (GD) satisfy f(K k ) − f(K ⋆ ) ≤ γ k f(K 0 ) − f(K ⋆ ) ∥K k − K ⋆ ∥ 2 F ≤ bγ k ∥K 0 − K ⋆ ∥ 2 F wher e the r ate of c onver genc e γ , stepsize α , and c onstant b dep end on K 0 and the p ar ameters of the LQR pr oblem (6.3) . 6.3.2 Unknown model W e no w turn our atten tion to the mo del-free setting. W e use Theorem 2 to carry out the con v ergence analysis of the random searc h metho d (RS) under the follo wing assumption on the distribution of initial condition. Assumption 1 L et the distribution D of the initial c onditions have i.i.d. zer o-me an unit- varianc e entries with b ounde d sub-Gaussian norm, i.e., for a r andom ve ctor v ∈R n that is distribute d ac c or ding to D , E[v i ] = 0 and ∥v i ∥ ψ 2 ≤ κ , for some c onstant κ and i = 1,...,n; se e App endix D.10 for the denition of ∥·∥ ψ 2 . Our main con v ergence result holds under Assumption 1. Sp ecically , for a desired accuracy lev el ϵ> 0, in Theorem 3 w e establish that iterates of (RS) with constan t stepsize (that do es not dep end on ϵ ) reac h accuracy lev el ϵ at a linear rate (i.e., in at most O(log(1/ϵ )) iterations) with high probabilit y . F urthermore, the total n um b er of function ev aluations and the sim ulation time required to ac hiev e an accuracy lev el ϵ are prop ortional to log(1/ϵ ). This signican tly impro v es the existing results for discrete-time LQR [13], [132] that require O(1/ϵ ) function ev aluations and poly(1/ϵ ) sim ulation time. Theorem 3 (Informal) L et the initial c ondition x 0 ∼ D of the L TI system in (6.1b) ob ey Assumption 1. Also let the simulation time τ and the numb er of samples N use d by Algorithm 1 satisfy τ ≥ θ 1 log(1/ϵ ) and N ≥ c 1 + β 4 κ 4 θ 1 log 6 n n 140 for some β > 0 and desir e d ac cur acy ϵ> 0. Then, we c an cho ose a smo othing p ar ameter r < θ 3 √ ϵ in Algorithm 1 and the c onstant stepsize α such that the r andom se ar ch metho d (RS) that starts fr om any initial stabilizing fe e db ack gain K 0 ∈S K achieves f(K k )− f(K ⋆ )≤ ϵ in at most k ≤ θ 4 log (f(K 0 ) − f(K ⋆ ))/ϵ iter ations with pr ob ability not smal ler than 1− c ′ k(n − β +N − β +Ne − n 8 +e − c ′ N ). Her e, the p ositive sc alars c and c ′ ar e absolute c onstants and θ 1 ,...,θ 4 > 0 dep end on K 0 and the p ar ameters of the LQR pr oblem (6.3) . The formal v ersion of Theorem 3 along with a discussion of parameters θ i and stepsize α is presen ted in Section 6.7. 6.4 Convex reparameterization The main c hallenge in establishing the exp onen tial stabilit y of (GF) arises from noncon v exit y of problem (6.3). Herein, w e use a standard c hange of v ariables to reparameterize (6.3) in to a con v ex problem, for whic h w e can pro vide exp onen tial stabilit y guaran tees for gradien t- o w dynamics. W e then connect the gradien t o w on this con v ex reparameterization to its noncon v ex coun terpart and establish the exp onen tial stabilit y of (GF). 6.4.1 Change of v ariables The stabilit y of the closed-lo op system with the feedbac k gain K ∈S K in problem (6.3) is equiv alen t to the p ositiv e deniteness of the matrix X(K) giv en b y (6.4a). This condition allo ws for a standard c hange of v ariables K = YX − 1 , for some Y ∈ R m× n , to reform ulate the LQR design as a con v ex optimization problem [85], [86]. In particular, for an y K ∈S K and the corresp onding matrix X , w e ha v e f(K) = h(X,Y) := trace(QX +Y T RYX − 1 ) 141 where h(X,Y) is a join tly con v ex function of (X,Y) for X ≻ 0. In the new v ariables, Ly apuno v equation (6.4b) tak es the ane form A(X) − B (Y) + Ω = 0 (6.8a) whereA andB are the linear maps A(X) := AX + XA T , B(Y) := BY + Y T B T . (6.8b) F or an in v ertible map A, w e can express the matrix X as an ane function of Y X(Y) = A − 1 (B(Y) − Ω) (6.8c) and bring the LQR problem in to the con v ex form minimize Y h(Y) (6.9) where h(Y) := h(X(Y),Y), Y ∈ S Y ∞, otherwise and S Y :={Y ∈R m× n |X(Y)≻ 0} is the set of matrices Y that corresp ond to stabilizing feedbac k gains K =YX − 1 . The setS Y is op en and con v ex b ecause it is dened via a p ositiv e denite condition imp osed on the ane map X(Y) in (6.8c). This p ositiv e denite condition inS Y is equiv alen t to the closed-lo op matrix A− BY(X(Y)) − 1 b eing Hurwitz. Remark 1 Although our pr esentation assumes invertibility of A, this assumption c omes without loss of gener ality. As shown in App endix D.2, al l r esults c arry over to noninvertible A with an alternative change of variables A= ˆ A+BK 0 , K = ˆ K+K 0 , and ˆ K = ˆ YX − 1 , for some K 0 ∈S K . 142 6.4.2 Smoothness and strong convexity of h(Y) Our con v ergence analysis of gradien t metho ds for problem (6.4.1) relies on the L-smo othness andµ -strong con v exit y of the function h(Y) o v er its sublev el setsS Y (a) :={Y ∈S Y |h(Y)≤ a}. These t w o prop erties w ere recen tly established in [121] where it w as sho wn that o v er an y sublev el set S Y (a), the second-order term D ˜ Y,∇ 2 h(Y; ˜ Y) E in the T a ylor series expansion of h(Y + ˜ Y) around Y ∈S Y (a) can b e upp er and lo w er b ounded b y quadratic forms L∥ ˜ Y∥ 2 F and µ ∥ ˜ Y∥ 2 F for some p ositiv e scalars L and µ . While an explicit form for the smo othness parameter L along with an existence pro of for the strong con v exit y mo dulus µ w ere presen ted in [121], in Prop osition 1 w e establish an explicit expression for µ in terms ofa and parameters of the LQR problem. This allo ws us to pro vide b ounds on the con v ergence rate for gradien t metho ds. Prop osition 1 Over any non-empty sublevel set S Y (a), the function h(Y) is L-smo oth and µ -str ongly c onvex with L = 2a∥R∥ 2 ν 1 + a∥A − 1 B∥ 2 p νλ min (R) ! 2 (6.10a) µ = 2λ min (R)λ min (Q) a(1 + a 2 η ) 2 (6.10b) wher e the c onstants η := ∥B∥ 2 λ min (Q)λ min (Ω) p νλ min (R) (6.10c) ν := λ 2 min (Ω) 4 ∥A∥ 2 p λ min (Q) + ∥B∥ 2 p λ min (R) ! − 2 (6.10d) only dep end on the pr oblem p ar ameters. Pr o of: See App endix D.3. □ 143 6.4.3 Gradient methods over S Y The LQR problem can b e solv ed b y minimizing the con v ex function h(Y) whose gradien t is giv en b y [121, App endix C] ∇h(Y) = 2RY(X(Y)) − 1 − 2B T W(Y) (6.11a) where W(Y) is the solution to A T W + WA = (X(Y)) − 1 Y T RY(X(Y)) − 1 − Q. (6.11b) Using the strong con v exit y and smo othness prop erties of h(Y) established in Prop osition 1, w e next sho w that the unique minimizer Y ⋆ of the function h(Y) is the exp onen tially stable equilibrium p oin t of the gradien t-o w dynamics o v er S Y , ˙ Y = −∇ h(Y), Y(0) ∈ S Y . (GFY) Prop osition 2 F or any Y(0) ∈ S Y , the gr adient-ow dynamics (GFY) ar e exp onential ly stable, i.e., ∥Y(t) − Y ⋆ ∥ 2 F ≤ (L/µ )e − 2µt ∥Y(0) − Y ⋆ ∥ 2 F wher e µ and L ar e the str ong c onvexity and smo othness p ar ameters of the function h(Y) over the sublevel set S Y (h(Y(0))). Pr o of: The deriv ativ e of the Ly apuno v function candidate V(Y) := h(Y)− h(Y ⋆ ) along the o w in (GFY) satises ˙ V = D ∇h(Y), ˙ Y E = −∥∇ h(Y)∥ 2 F ≤ − 2µV. (6.12) 144 Inequalit y (6.12) is a consequence of the strong con v exit y of the function h(Y) and it yields [157, Lemma 3.4] V(Y(t)) ≤ e − 2µt V(Y(0)). (6.13) Th us, for an y Y(0)∈S Y , h(Y(t)) con v erges exp onen tially to h(Y ⋆ ). Moreo v er, since h(Y) is µ -strongly con v ex and L-smo oth, V(Y) can b e upp er and lo w er b ounded b y quadratic functions and the exp onen tial stabilit y of (GFY) o v er S Y follo ws from Ly apuno v theory [157, Theorem 4.10]. □ In Section 6.5, w e use the ab o v e result to pro v e exp onen tial/linear con v ergence of gradien t o w/descen t for the noncon v ex optimization problem (6.3). Before w e pro ceed, w e note that similar con v ergence guaran tees can b e established for the gradien t descen t metho d with a sucien tly small stepsize α , Y k+1 := Y k − α ∇h(Y k ), Y 0 ∈ S Y (GY) Since the function h(Y) is L-smo oth o v er the sublev el set S Y (h(Y 0 )), for an y α ∈ [0,1/L] the iterates Y k remain within S Y (h(Y 0 )). This prop ert y in conjunction with the µ -strong con v exit y of h(Y) imply that Y k con v erges to the optimal solution Y ⋆ at a linear rate of γ =1− αµ . 6.5 Control design with a known model The asymptotic stabilit y of (GF) is a consequence of the follo wing prop erties of the LQR ob jectiv e function [152], [153]: 1. The function f(K) is t wice con tin uously dieren tiable o v er its op en domain S K and f(K)→∞ as K→∞ and/or K→∂S K . 2. The optimal solution K ⋆ is the unique equilibrium p oin t o v er S K , i.e., ∇f(K) = 0 if and only if K =K ⋆ . 145 In particular, the deriv ativ e of the maximal Ly apuno v function candidate V(K) :=f(K)− f(K ⋆ ) along the tra jectories of (GF) satises ˙ V = D ∇f(K), ˙ K E = −∥∇ f(K)∥ 2 F ≤ 0 where the inequalit y is strict for all K ̸= K ⋆ . Th us, Ly apuno v theory [154] implies that, starting from an y stabilizing initial condition K(0), the tra jectories of (GF) remain within the sublev el set S K (f(K(0))) and asymptotically con v erge to K ⋆ . Similar argumen ts w ere emplo y ed for the con v ergence analysis of the Anderson-Mo ore algorithm for output-feedbac k syn thesis [152]. While [152] sho ws global asymptotic stabilit y , it do es not pro vide an y information on the rate of con v ergence. In this section, w e rst demonstrate exp onen tial stabilit y of (GF) and pro v e Theorem 1. Then, w e establish linear con v ergence of the gradien t descen t metho d (GD) and pro v e Theorem 2. 6.5.1 Gradient-ow dynamics: proof of Theorem 1 W e start our pro of of Theorem 1 b y relating the con v ex and noncon v ex form ulations of the LQR ob jectiv e function. Sp ecically , in Lemma 1, w e establish a relation b et w een the gradien ts∇f(K) and∇h(Y) o v er the sublev el sets of the ob jectiv e function S K (a) :={K∈ S K |f(K)≤ a}. Lemma 1 F or any stabilizing fe e db ack gain K∈S K (a) and Y :=KX(K), we have ∥∇f(K)∥ F ≥ c∥∇h(Y)∥ F (6.14a) wher e X(K) is given by (6.4a) , the c onstant c is determine d by c = ν p νλ min (R) 2a 2 ∥A − 1 ∥ 2 ∥B∥ 2 + a p νλ min (R) (6.14b) and the sc alar ν given by Eq. (6.10d) dep ends on the pr oblem p ar ameters. Pr o of: See App endix D.4. □ 146 Using Lemma 1 and the exp onen tial stabilit y of gradien t-o w dynamics (GFY) o v er S Y , established in Prop osition 2, w e next sho w that (GF) is also exp onen tially stable. In particular, for an y stabilizing K ∈ S K (a), the deriv ativ e of V(K) := f(K)− f(K ⋆ ) along the gradien t o w in (GF) satises ˙ V = −∥∇ f(K)∥ 2 F ≤ − c 2 ∥∇h(Y)∥ 2 F ≤ − 2µc 2 V (6.15) where Y =KX(K) and the constan ts c and µ are pro vided in Lemma 1 and Prop osition 1, resp ectiv ely . The rst inequalit y in (6.15) follo ws from (6.14a) and the second follo ws from f(K)=h(Y) com bined with∥∇h(Y)∥ 2 F ≥ 2µV (whic h in turn is a consequence of the strong con v exit y of h(Y) established in Prop osition 1). No w, since the sublev el set S K (a) is in v arian t with resp ect to (GF), follo wing [157, Lemma 3.4], inequalit y (6.15) guaran tees that system (GF) con v erges exp onen tially in the ob jectiv e v alue with rate ρ = 2µc 2 . This concludes the pro of of part (a) in Theorem 1. In order to pro v e part (b), w e use the follo wing lemma whic h connects the errors in the ob jectiv e v alue and the optimization v ariable. Lemma 2 F or any stabilizing fe e db ack gain K , the obje ctive function f(K) in pr oblem (6.3) satises f(K) − f(K ⋆ ) = trace (K− K ⋆ ) T R(K− K ⋆ )X(K) wher e K ⋆ is the optimal solution and X(K) is given by (6.4a) . Pr o of: See App endix D.4. □ F rom Lemma 2 and part (a) of Theorem 1, w e ha v e ∥K(t) − K ⋆ ∥ 2 F ≤ f(K(t)) − f(K ⋆ ) λ min (R)λ min (X(K(t))) ≤ e − ρt f(K(0)) − f(K ⋆ ) λ min (R)λ min (X(K(t))) ≤ b ′ e − ρt ∥K(0) − K ⋆ ∥ 2 F (6.16) 147 where b ′ :=∥R∥ 2 ∥X(K(0))∥ 2 /(λ min (R)λ min (X(K(t)))). Here, the rst and third inequalities follo w form basic prop erties of the matrix trace com bined with Lemma 2 applied with K = K(t) andK =K(0), resp ectiv ely . The second inequalit y follo ws from part (a) of Theorem 1. Finally , to upp er b ound parameter b ′ , w e use Lemma 15 presen ted in App endix D.11 that pro vides the lo w er and upp er b ounds ν/a ≤ λ min (X(K)) and∥X(K)∥ 2 ≤ a/λ min (Q) on the matrix X(K) for an y K∈S K (a), where the constan t ν is giv en b y (6.10d). Using these b ounds and the in v ariance of S K (a) with resp ect to (GF), w e obtain b ′ ≤ b := a 2 ∥R∥ 2 νλ min (R)λ min (Q) (6.17) whic h completes the pro of of part (b). Remark 2 (Gradien t domination) Expr ession (6.15) implies that the obje ctive function f(K) over any given sublevel set S K (a) satises the Polyak-ojasiewicz (PL) c ondition [89] ∥∇f(K)∥ 2 F ≥ 2µ f (f(K) − f(K ⋆ )) (6.18) with p ar ameter µ f := µc 2 , wher e µ and c ar e functions of a that ar e given by (6.10b) and (6.14b) , r esp e ctively. This c ondition is also known as gr adient dominanc e and it was r e c ently use d to show c onver genc e of gr adient desc ent for discrete-time LQR pr oblem [13]. 6.5.2 Geometric interpretation The solution Y(t) to gradien t-o w dynamics (GFY) o v er the set S Y induces the tra jectory K ind (t) := Y(t)(X(Y(t))) − 1 (6.19) o v er the set of stabilizing feedbac k gains S K , where the ane function X(Y) is giv en b y (6.8c). The induced tra jectory K ind (t) can b e view ed as the solution to the dieren tial equation ˙ K = g(K) (6.20a) 148 where g : S K →R m× n is giv en b y g(K) := KA − 1 (B(∇h(Y(K))))−∇ h(Y(K)) (X(K)) − 1 . (6.20b) Here, the matrix X = X(K) is giv en b y (6.4a) and Y(K) = KX(K). System (6.20) is obtained b y dieren tiating b oth sides of Eq. (6.19) with resp ect to time t and applying the c hain rule. Figure 6.1 illustrates an induced tra jectory K ind (t) and a tra jectory K(t) resulting from gradien t-o w dynamics (GF) that starts from the same initial condition. Moreo v er, using the denition of h(Y), w e ha v e h(Y(t)) = f(K ind (t)). (6.21) Th us, the exp onen tial deca y of h(Y(t)) established in Prop osition 2 implies that f deca ys exp onen tially along the v ector eld g , i.e., for K ind (0)̸=K ⋆ , w e ha v e f(K ind (t)) − f(K ⋆ ) f(K ind (0)) − f(K ⋆ ) = h(Y(t)) − h(Y ⋆ ) h(Y(0)) − h(Y ⋆ ) ≤ e − 2µt . This inequalit y follo ws from inequalit y (6.13), where µ denotes the strong-con v exit y mo dulus of the function h(Y) o v er the sublev el setS Y (h(Y(0))); see Prop osition 1. Herein, w e pro vide a geometric in terpretation of the exp onen tial deca y of f under the tra jectories of (GF) that is based on the relation b et w een the v ector elds g and−∇ f . Dieren tiating b oth sides of Eq. (6.21) with resp ect to t yields ∥∇h(Y)∥ 2 =⟨−∇ f(K),g(K)⟩. (6.22) Th us, for eac h K ∈S K , the inner pro duct b et w een the v ector elds −∇ f(K) and g(K) is nonnegativ e. Ho w ev er, this is not sucien t to ensure exp onen tial deca y of f along (GF). T o address this c hallenge, our pro of utilizes inequalit y (6.14a) in Lemma 1. Based on the equation in (6.22), w e observ e that (6.14a) can b e equiv alen tly restated as ∥−∇ f(K)∥ F ∥Π −∇ f(K) (g(K))∥ F = ∥∇f(K)∥ 2 F ⟨−∇ f(K),g(K)⟩ ≥ c 2 149 K 0 K Figure 6.1: T ra jectories K(t) of (GF) (solid blac k) and K ind (t) resulting from Eq. (6.19) (dashed blue) along with the lev el sets of the function f(K). where Π b (a) denotes the pro jection of a on to b. Th us, Lemma 1 ensures that the ratio b et w een the norm of the v ector eld −∇ f(K) asso ciated with gradien t-o w dynamics (GF) and the norm of the pro jection of g(K) on to −∇ f(K) is uniformly lo w er b ounded b y a p ositiv e constan t. This lo w er b ound is the k ey geometric feature that allo ws us to deduce exp onen tial deca y of f along the v ector eld −∇ f from the exp onen tial deca y of the v ector eld g . 6.5.3 Gradient descent: proof of Theorem 2 Giv en the exp onen tial stabilit y of gradien t-o w dynamics (GF) established in Theorem 1, the con v ergence analysis of gradien t descen t (GD) amoun ts to nding a suitable stepsize α . Lemma 3 pro vides a Lipsc hitz con tin uit y parameter for ∇f(K), whic h facilitates nding suc h a stepsize. Lemma 3 Over any non-empty sublevel set S K (a), the function gr adient ∇f(K) is Lipschitz c ontinuous with p ar ameter L f := 2a∥R∥ 2 λ min (Q) + 8a 3 ∥B∥ 2 λ 2 min (Q)λ min (Ω) ∥B∥ 2 λ min (Ω) + ∥R∥ 2 p νλ min (R) wher e ν given by (6.10d) dep ends on the pr oblem p ar ameters. Pr o of: See App endix D.4. □ 150 Let K α :=K− α ∇f(K), α ≥ 0 parameterize the half-line starting from K∈S K (a) with K̸=K ⋆ along−∇ f(K) and let us dene the scalar β m :=maxβ suc h that K α ∈S K (a), for all α ∈ [0,β ]. The existence of β m follo ws from the compactness of S K (a) [152]. W e next sho w that β m ≥ 2/L f . F or the sak e of con tradiction, supp ose β m < 2/L f . F rom the con tin uit y of f(K α ) with resp ect to α , it follo ws that f(K β s ) = a. Moreo v er, since −∇ f(K) is a descen t direction of the function f(K), w e ha v e β m >0. Th us, for α ∈(0,β m ], f(K α ) − f(K) ≤ − α (2 − L f α ) 2 ∥∇f(K)∥ 2 F < 0. Here, the rst inequalit y follo ws from the L f -smo othness of f(K) o v er S K (a) (Descen t Lemma [158, Eq. (9.17)]) and the second inequalit y follo ws from ∇f(K)̸=0 in conjunction with β m ∈ (0,2/L f ). This implies f(K β m ) < f(K) ≤ a, whic h con tradicts f(K β m ) = a. Th us, β m ≥ 2/L f . W e can no w use induction on k to sho w that, for an y stabilizing initial condition K 0 ∈ S K (a), the iterates of (GD) with α ∈[0,2/L f ] remain in S K (a) and satisfy f(K k+1 ) − f(K k ) ≤ − α (2 − L f α ) 2 ∥∇f(K k )∥ 2 F . (6.23) Inequalit y (6.23) in conjunctions with the PL condition (6.18) ev aluated at K k guaran tee linear con v ergence for gradien t descen t (GD) with the rate γ ≤ 1− αµ f for all α ∈(0,1/L f ], where µ f is the PL parameter of the function f(K). This completes the pro of of part (a) of Theorem 2. Using part (a) and Lemma 2, w e can mak e a similar argumen t to what w e used for the pro of of Theorem 1 to establish part (b) with constan t b in (6.17). W e omit the details for brevit y . Remark 3 Using our r esults, it is str aightforwar d to show line ar c onver genc e of K k+1 = K k − αH k 1 ∇f(K k )H k 2 with K 0 ∈ S K and smal l enough stepsize, wher e H k 1 and H k 2 ar e uniformly upp er and lower b ounde d p ositive denite matric es. In p articular, the Kleinman iter ation [137] is r e c over e d for α = 0.5, H k 1 = R − 1 , and H k 2 = (X(K k )) − 1 . Similarly, 151 c onver genc e of gr adient desc ent may b e impr ove d by cho osing H k 1 =I and H k 2 =(X(K k )) − 1 . In this c ase, the c orr esp onding up date dir e ction pr ovides the c ontinuous-time variant of the so-c al le d natural gradien t for discr ete-time systems [159]. 6.6 Bias and correlation in gradient estimation In the mo del-free setting, w e do not ha v e access to the gradien t ∇f(K) and the random searc h metho d (RS) relies on the gradien t estimate ∇f(K) resulting from Algorithm 1. A ccording to [13], ac hieving ∥∇f(K)−∇ f(K)∥ F ≤ ϵ ma y tak e N =Ω(1 /ϵ 4 ) samples using one-p oin t gradien t estimates. Our computational exp erimen ts (not included in this c hapter) also suggest that to ac hiev e ∥∇f(K)−∇ f(K)∥ F ≤ ϵ , N m ust scale as poly(1/ϵ ) ev en when a t w o-p oin t gradien t estimate is used. T o a v oid this p o or sample complexit y , in our pro of w e tak e an alternativ e route and giv e up on the ob jectiv e of con trolling the gradien t estimation error. By exploiting the problem structure, w e sho w that with a linear n um b er of samples N = ˜ O(n), where n is the n um b er of states, the estimate ∇f(K) concen trates with high pr ob ability when pro jected to the direction of ∇f(K). Our pro of strategy allo ws us to signican tly impro v e up on the existing literature b oth in terms of the required function ev aluations and sim ulation time. Sp ecically , using the random searc h metho d (RS), the total n um b er of function ev aluations required in our results to ac hiev e an accuracy lev el ϵ is prop ortional to log(1/ϵ ) compared to at least (1/ϵ 4 )log(1/ϵ ) in [13] and 1/ϵ in [132]. Similarly , the sim ulation time that w e require to ac hiev e an accuracy lev el ϵ is prop ortional to log(1/ϵ ); this is in con trast to poly(1/ϵ ) sim ulation times in [13] and innite sim ulation time in [132]. Algorithm 1 pro duces a biased estimate ∇f(K) of the gradien t ∇f(K). Herein, w e rst in tro duce an un biased estimate b ∇f(K) of∇f(K) and establish that the distance ∥ b ∇f(K)− ∇f(K)∥ F can b e readily con trolled b y c ho osing a large sim ulation time τ and an appropriate smo othing parameter r in Algorithm 1; w e call this distance the estimation bias. Next, w e sho w that with N = ˜ O(n) samples, the un biased estimate b ∇f(K) b ecomes highly correlated with∇f(K). W e exploit this fact in our con v ergence analysis. 152 6.6.1 Bias in gradient estimation due to nite simulation time W e rst in tro duce an un biased estimate of the gradien t that is used to quan tify the bias. F or an y τ ≥ 0 and x 0 ∈R n , let f x 0 ,τ (K) := Z τ 0 x T (t)Qx(t) + u T (t)Ru(t) dt denote the τ -truncated v ersion of the LQR ob jectiv e function asso ciated with system (6.1b) with the initial condition x(0) = x 0 and feedbac k la w u = − Kx for all K ∈ R m× n . Note that for an y K∈S K and x(0)=x 0 ∈R n , the innite-horizon cost f x 0 (K) := f x 0 ,∞ (K) (6.24a) exists and it satises f(K) = E x 0 [f x 0 (K)]. F urthermore, the gradien t of f x 0 (K) is giv en b y (cf. (6.5)) ∇f x 0 (K) = 2(RK − B T P(K))X x 0 (K) (6.24b) where X x 0 (K) =−A − 1 K (x 0 x T 0 ) is determined b y the closed-lo op Ly apuno v op erator in (6.7) and P(K) = − (A ∗ K ) − 1 (Q + K T RK). Note that the gradien ts ∇f(K) and ∇f x 0 (K) are linear in X(K) = −A − 1 K (Ω) and X x 0 (K), resp ectiv ely . Th us, for an y zero-mean random initial condition x(0) = x 0 with co v ariance E[x 0 x T 0 ] = Ω , the linearit y of the closed-lo op Ly apuno v op erator A K implies E x 0 [X x 0 (K)] = X(K), E x 0 [∇f x 0 (K)] = ∇f(K). 153 Let us dene the follo wing three estimates of the gradien t ∇f(K) := 1 2rN N X i=1 (f x i ,τ (K +rU i )− f x i ,τ (K− rU i ))U i e ∇f(K) := 1 2rN N X i=1 (f x i (K +rU i )− f x i (K− rU i ))U i b ∇f(K) := 1 N N X i=1 ⟨∇f x i (K),U i ⟩U i (6.25) where U i ∈ R m× n are i.i.d. random matrices with vec(U i ) uniformly distributed on the sphere √ mnS mn− 1 and x i ∈ R n are i.i.d. initial conditions sampled from distribution D . Here, e ∇f(K) is the innite-horizon v ersion of the output ∇f(K) of Algorithm 1 and b ∇f(K) pro vides an un biased estimate of ∇f(K). T o see this, note that b y the indep endence of U i and x i w e ha v e E x i ,U i h vec( b ∇f(K)) i = E U 1 [⟨∇f(K),U 1 ⟩vec(U 1 )] = E U 1 [vec(U 1 )vec(U 1 ) T ]vec(∇f(K)) = vec(∇f(K)) and th us E[ b ∇f(K)] = ∇f(K). Here, w e ha v e utilized the fact that for the uniformly distributed random v ariable vec(U 1 ) o v er the sphere √ mnS mn− 1 ,E U 1 [vec(U 1 )vec(U 1 ) T ]=I. 6.6.1.1 Lo cal b oundedness of the function f(K) An imp ortan t requiremen t for the gradien t estimation sc heme in Algorithm 1 is the stabilit y of the p erturb ed closed-lo op systems, i.e., K± rU i ∈ S K ; violating this condition leads to an exp onen tial gro wth of the state and con trol signals. Moreo v er, this condition is necessary and sucien t for e ∇f(K) to b e w ell dened. In Prop osition 3, w e establish a radius within whic h an y p erturbation of K∈S K remains stabilizing. Prop osition 3 F or any stabilizing fe e db ack gain K∈S K , we have{ ˆ K∈R m× n |∥ ˆ K− K∥ 2 < ζ }⊂S K wher e ζ := λ min (Ω) /(2∥B∥ 2 ∥X(K)∥ 2 ) 154 and X(K) is given by (6.4a) . Pr o of: See App endix D.5. □ If w e c ho ose the parameter r in Algorithm 1 to b e smaller than ζ , then the sample feedbac k gains K ± rU i are all stabilizing. In this c hapter, w e further require that the parameter r is small enough so that K± rU i ∈S K (2a) for all K∈S K (a). Suc h upp er b ound on r is pro vided in the next lemma. Lemma 4 F or any U ∈ R m× n with ∥U∥ F ≤ √ mn and K ∈ S K (a), K +r(a)U ∈ S K (2a) wher e r(a) := ˜ c/a for some p ositive c onstant ˜ c that dep ends on the pr oblem data. Pr o of: See App endix D.5. □ Note that for an y K ∈ S K (a), and r ≤ r(a) in Lemma 4, e ∇f(K) is w ell dened b ecause K +rU i ∈S K (2a) for all i. 6.6.1.2 Bounding the bias Herein, w e establish an upp er b ound on the dierence b et w een the output ∇f(K) generated b y Algorithm 1 and the un biased estimate b ∇f(K) of the gradien t ∇f(K). W e accomplish this b y b ounding the dierence b et w een these t w o quan tities and e ∇f(K) through the use of the triangle inequalit y ∥ b ∇f(K) − ∇f(K)∥ F ≤ ∥ e ∇f(K) − ∇f(K)∥ F + ∥ b ∇f(K) − e ∇f(K)∥ F . (6.26) The rst term on the righ t-hand side of (6.26) arises from a bias caused b y the nite sim ulation time in Algorithm 1. The next prop osition quan ties an upp er b ound on this term. Prop osition 4 F or any K ∈ S K (a), the output of Algorithm 1 with p ar ameter r ≤ r(a) (given by L emma 4) satises ∥ e ∇f(K)− ∇f(K)∥ F ≤ √ mnmax i ∥x i ∥ 2 r κ 1 (2a)e − κ 2 (2a)τ 155 wher e κ 1 (a) > 0 is a de gr e e 5 p olynomial and κ 2 (a) > 0 is inversely pr op ortional to a and they ar e given by (D.17) . Pr o of: See App endix D.6. □ Although small v alues of r ma y result in a large error∥ e ∇f(K)− ∇f(K)∥ F , the exp onen tial dep endence of the upp er b ound in Prop osition 4 on the sim ulation time τ implies that this error can b e readily con trolled b y increasing τ . In the next prop osition, w e handle the second term in (6.26). Prop osition 5 F or any K∈S K (a) and r≤ r(a) (given by L emma 4), we have ∥ b ∇f(K) − e ∇f(K)∥ F ≤ (rmn) 2 2 ℓ(2a) max i ∥x i ∥ 2 wher e the function ℓ(a)>0 is a de gr e e 4 p olynomial and it is given by (D.21) . Pr o of: See App endix D.7. □ The third-deriv ativ es of the functions f x i (K) are utilized in the pro of of Prop osition 5. It is also w orth noting that unlik e ∇f(k) and e ∇f(K), the un biased gradien t estimate b ∇f(K) is indep enden t of the parameter r . Th us, Prop osition 5 pro vides a quadratic upp er b ound on the estimation error in terms of r . 6.6.2 Correlation between gradient and gradient estimate As men tioned earlier, one approac h to analyzing con v ergence for the random searc h metho d in (RS) is to con trol the gradien t estimation error ∇f(K)−∇ f(K) b y c ho osing a large n um b er of samples N . F or the one-p oin t gradien t estimation setting, this approac h w as tak en in [13] for the discrete-time LQR (and in [15] for the con tin uous-time LQR) and has led to an upp er b ound on the required n um b er of samples for reac hing ϵ -accuracy that gro ws at least prop ortionally to 1/ϵ 4 . Alternativ ely , our pro of exploits the problem structure and sho ws that with a linear n um b er of samples N = ˜ O(n), where n is the n um b er of states, the gradien t estimate b ∇f(K) concen trates with high pr ob ability when pro jected to the direction 156 of ∇f(K). In particular, in Prop ositions 7 and 8 w e sho w that the follo wing ev en ts o ccur with high probabilit y for some p ositiv e scalars µ 1 , µ 2 , M 1 := nD b ∇f(K),∇f(K) E ≥ µ 1 ∥∇f(K)∥ 2 F o (6.27a) M 2 := n ∥ b ∇f(K)∥ 2 F ≤ µ 2 ∥∇f(K)∥ 2 F o . (6.27b) T o justify the denitions of these ev en ts, w e rst sho w that if they b oth tak e place then the un biased estimate b ∇f(K) can b e used to decrease the ob jectiv e error b y a geometric factor. Prop osition 6 [Appr oximate GD] If the matrix G∈R m× n and the fe e db ack gain K∈S K (a) ar e such that ⟨G,∇f(K)⟩ ≥ µ 1 ∥∇f(K)∥ 2 F (6.28a) ∥G∥ 2 F ≤ µ 2 ∥∇f(K)∥ 2 F (6.28b) for some p ositive sc alars µ 1 and µ 2 , then K− αG ∈S K (a) for al l α ∈[0,µ 1 /(µ 2 L f )], and f(K− αG ) − f(K ⋆ ) ≤ γ (f(K) − f(K ⋆ )) with γ = 1− µ f µ 1 α. Her e, L f and µ f ar e the smo othness and the PL p ar ameters of the function f over S K (a). Pr o of: See App endix D.8. □ Remark 4 The fastest c onver genc e r ate guar ante e d by Pr op osition 6, γ =1− µ f µ 2 1 /(L f µ 2 ), is achieve d with the stepsize α = µ 1 /(µ 2 L f ). This r ate b ound is tight in the sense that if G = c∇f(K), for some c > 0, we r e c over the standar d c onver genc e r ate γ = 1− µ f /L f of gr adient desc ent. W e next quan tify the probabilit y of the ev en ts M 1 andM 2 . In our pro ofs, w e exploit mo dern non-asymptotic statistical analysis of the concen tration of random v ariables around their a v erage. While in App endix D.10 w e set notation and pro vide basic denitions of k ey 157 concepts, w e refer the reader to a recen t b o ok [160] for a comprehensiv e discussion. Herein, w e use c, c ′ , c ′′ , etc. to denote p ositiv e absolute constan ts. 6.6.2.1 Handling M 1 W e rst exploit the problem structure to conne the dep endence of b ∇f(K) on the random initial conditions x i in to a zero-mean random v ector. In particular, for an y K ∈ S K and x 0 ∈R n , ∇f(K) = EX, ∇f x 0 (K) = EX x 0 where E := 2(RK − B T P(K)) ∈ R m× n is a xed matrix, X = −A − 1 K (Ω) , and X x 0 = −A − 1 K (x 0 x T 0 ). This allo ws us to represen t the un biased estimate b ∇f(K) of the gradien t as b ∇f(K) = 1 N N X i=1 ⟨EX x i ,U i ⟩U i = b ∇ 1 + b ∇ 2 (6.29a) b ∇ 1 = 1 N N X i=1 ⟨E(X x i − X),U i ⟩U i (6.29b) b ∇ 2 = 1 N N X i=1 ⟨∇f(K),U i ⟩U i . (6.29c) Note that b ∇ 2 do es not dep end on the initial conditions x i . Moreo v er, from E[X x i ]=X and the indep endence of X x i and U i , w e ha v eE[ b ∇ 1 ]=0 andE[ b ∇ 2 ]=∇f(K). In Lemma 5, w e sho w that D b ∇ 1 ,∇f(K) E can b e made arbitrary small with a large n um b er of samples N . This allo ws us to analyze the probabilit y of the ev en t M 1 in (6.27). Lemma 5 L et U 1 ,...,U N ∈ R m× n b e i.i.d. r andom matric es with e ach vec(U i ) uniformly distribute d on the spher e √ mnS mn− 1 and let X 1 ,...,X N ∈R n× n b e i.i.d. r andom matric es distribute d ac c or ding to M(xx T ). Her e, M is a line ar op er ator and x ∈ R n is a r andom ve ctor whose entries ar e i.i.d., zer o-me an, unit-varianc e, sub-Gaussian r andom variables with 158 sub-Gaussian norm less than κ . F or any xe d matrix E ∈R m× n and p ositive sc alars δ and β , if N ≥ C(β 2 κ 2 /δ ) 2 (∥M ∗ ∥ 2 + ∥M ∗ ∥ S ) 2 nlog 6 n (6.30) then, with pr ob ability not smal ler than 1− C ′ N − β − 4Ne − n 8 , 1 N N X i=1 ⟨E(X i − X),U i ⟩⟨EX,U i ⟩ ≤ δ ∥EX∥ F ∥E∥ F wher e X :=E[X 1 ]=M(I). Pr o of: See App endix D.9. □ In Lemma 6, w e sho w that D b ∇ 2 ,∇f(K) E concen trates with high probabilit y around its a v erage ∥∇f(K)∥ 2 F . Lemma 6 L et U 1 ,...,U N ∈ R m× n b e i.i.d. r andom matric es with e ach vec(U i ) uniformly distribute d on the spher e √ mnS mn− 1 . Then, for any W ∈R m× n and t∈(0,1], P ( 1 N N X i=1 ⟨W,U i ⟩ 2 < (1 − t)∥W∥ 2 F ) ≤ 2e − cNt 2 . Pr o of: See App endix D.9. □ In Prop osition 7, w e use Lemmas 5 and 6 to address M 1 . Prop osition 7 Under Assumption 1, for any stabilizing fe e db ack gain K∈S K and p ositive sc alar β , if N ≥ C 1 β 4 κ 4 λ 2 min (X) ∥(A ∗ K ) − 1 ∥ 2 +∥(A ∗ K ) − 1 ∥ S 2 nlog 6 n then the event M 1 in (6.27) with µ 1 :=1/4 satises P(M 1 )≥ 1− C 2 N − β − 4Ne − n 8 − 2e − C 3 N . 159 Pr o of: W e use Lemma 5 with δ :=λ min (X)/4 to sho w that D b ∇ 1 ,∇f(K) E ≤ δ ∥EX∥ F ∥E∥ F ≤ 1 4 ∥EX∥ 2 F = 1 4 ∥∇f(K)∥ 2 F . (6.31a) holds with probabilit y not smaller than 1− C ′ N − β − 4Ne − n 8 . F urthermore, Lemma 6 with t :=1/2 implies that D b ∇ 2 ,∇f(K) E ≥ 1 2 ∥∇f(K)∥ 2 F (6.31b) holds with probabilit y not smaller than 1− 2e − cN . Since b ∇f(K) = b ∇ 1 + b ∇ 2 , w e can use a union b ound to com bine (6.31a) and (6.31b). This together with a triangle inequalit y completes the pro of. □ 6.6.2.2 Handling M 2 In Lemma 7, w e quan tify a high probabilit y upp er b ound on ∥ b ∇ 1 ∥ F /∥∇f(K)∥. This lemma is analogous to Lemma 5 and it allo ws us to analyze the probabilit y of the ev en t M 2 in (6.27). Lemma 7 L et X i and U i with i = 1,...,N b e r andom matric es dene d in L emma 5, X := E[X 1 ], and let N ≥ c 0 n. Then, for any E∈R m× n and p ositive sc alar β , 1 N ∥ N X i=1 ⟨E(X i − X),U i ⟩U i ∥ F ≤ c 1 βκ 2 (∥M ∗ ∥ 2 +∥M ∗ ∥ S )∥E∥ F √ mn logn with pr ob ability not smal ler than 1− c 2 (n − β +Ne − n 8 ). Pr o of: See App endix D.10. □ In Lemma 8, w e quan tify a high probabilit y upp er b ound on ∥ b ∇ 2 ∥ F /∥∇f(K)∥. 160 Lemma 8 L et U 1 ,...,U N ∈ R m× n b e i.i.d. r andom matric es with vec(U i ) b eing uniformly distribute d on the spher e √ mnS mn− 1 and let N ≥ Cn. Then, for any W ∈R m× n , P 1 N ∥ N X j=1 ⟨W,U j ⟩U j ∥ F > C ′ √ m∥W∥ F ≤ 2Ne − mn 8 + 2e − ˆ cN . Pr o of: See App endix D.10. □ In Prop osition 8, w e use Lemmas 7 and 8 to address M 2 . Prop osition 8 L et Assumption 1 hold. Then, for any K ∈ S K , sc alar β > 0, and N ≥ C 4 n, the event M 2 in (6.27) with µ 2 :=C 5 βκ 2 ∥(A ∗ K ) − 1 ∥ 2 +∥(A ∗ K ) − 1 ∥ S λ min (X) √ mnlogn+ √ m 2 satises P(M 2 ) ≥ 1− C 6 (n − β +Ne − n 8 +e − C 7 N ). Pr o of: W e use Lemma 7 to sho w that, with probabilit y at least 1− c 2 (n − β + Ne − n 8 ), b ∇ 1 satises ∥ b ∇ 1 ∥ F ≤ c 1 βκ 2 (∥(A ∗ K ) − 1 ∥ 2 +∥(A ∗ K ) − 1 ∥ S )∥E∥ F √ mn logn ≤ c 1 βκ 2 ∥(A ∗ K ) − 1 ∥ 2 +∥(A ∗ K ) − 1 ∥ S λ min (X) ∥∇f(K)∥ F √ mn logn. F urthermore, w e can use Lemma 8 to sho w that, with probabilit y not smaller than 1− 2Ne − mn 8 − 2e − ˆ cN , b ∇ 2 satises ∥ b ∇ 2 ∥ F ≤ C ′ √ m∥∇f(K)∥ F . No w, since b ∇f(K)= b ∇ 1 + b ∇ 2 , w e can use a union b ound to com bine the last t w o inequalities. This together with a triangle inequalit y completes the pro of. □ 161 6.7 Model-free control design In this section, w e pro v e a more formal v ersion of Theorem 3. Theorem 4 Consider the r andom se ar ch metho d (RS) that uses the gr adient estimates of Algorithm 1 for nding the optimal solution K ⋆ of LQR pr oblem (6.3) . L et the initial c ondition x 0 ob ey Assumption 1 and let the simulation time τ , the smo othing c onstant r , and the numb er of samples N satisfy τ ≥ θ ′ (a)log 1 rϵ , r <min{r(a),θ ′′ (a) √ ϵ }, N ≥ c 1 (1 + β 4 κ 4 θ (a)log 6 n)n (6.32) for some β > 0 and a desir e d ac cur acy ϵ > 0. Then, for any initial c ondition K 0 ∈ S K (a), (RS) with the c onstant stepsize α ≤ 1/(32µ 2 (a)L f ) achieves f(K k )− f(K ⋆ )≤ ϵ with pr ob ability not smal ler than 1− kp− 2kNe − n in at most k ≤ log f(K 0 ) − f(K ⋆ ) ϵ log 1 1 − µ f (a)α/ 8 iter ations. Her e, p:=c 2 (n − β +N − β +Ne − n 8 +e − c 3 N ), µ 2 :=c 4 ( √ m+βκ 2 θ (a) √ mnlogn) 2 , c 1 ,...,c 4 ar e p ositive absolute c onstants, µ f and L f ar e the PL and smo othness p ar ameters of the function f over the sublevel set S K (a), θ , θ ′ , θ ′′ ar e p ositive functions that dep end only on the p ar ameters of the LQR pr oblem, and r(a) is given by L emma 4. Pr o of: The pro of com bines Prop ositions 4, 5, 6, 7, and 8. W e rst sho w that for an y r≤ r(a) and τ > 0, ∥∇f(K) − b ∇f(K)∥ F ≤ σ (6.33) with probabilit y not smaller than 1− 2Ne − n , where σ :=c 5 (κ 2 +1) n √ m r κ 1 (2a)e − κ 2 (2a)τ + r 2 m 2 n 5 2 2 ℓ(2a) ! . 162 Here, r(a), κ i (a), and ℓ(a) are p ositiv e functions that are giv en b y Lemma 4, Eq. (D.17), and Eq. (D.21), resp ectiv ely . Under Assumption 1, the v ector v∼D satises [160, Eq. (3.3)], P ∥v∥ ≤ c 5 (κ 2 +1) √ n ≥ 1− 2e − n . Th us, for the random initial conditions x 1 ,...,x N ∼ D , w e can apply the union b ound (Bo ole’s inequalit y) to obtain P n max i ∥x i ∥ ≤ c 5 (κ 2 +1) √ n o ≥ 1 − 2Ne − n . (6.34) No w, w e com bine Prop ositions 4 and 5 to write ∥∇f(K)− b ∇f(K)∥ F ≤ √ mn r κ 1 (2a)e − κ 2 (2a)τ + (rmn) 2 2 ℓ(2a) max i ∥x i ∥ 2 ≤ σ. The rst inequalit y is obtained b y com bining Prop ositions 4 and 5 through the use of the triangle inequalit y , and the second inequalit y follo ws from (6.34). This completes the pro of of (6.33). Let θ (a) b e a uniform upp er b ound on ∥(A ∗ K ) − 1 ∥ 2 +∥(A ∗ K ) − 1 ∥ S λ min (X) ≤ θ (a) for all K∈S K (a); see App endix D.12 for a discussion on θ (a). Since, the n um b er of samples satises (6.32), for an y giv en K∈S K (a), w e can com bine Prop ositions 7 and 8 with a union b ound to sho w that D b ∇f(K),∇f(K) E ≥ µ 1 ∥∇f(K)∥ 2 F (6.35a) ∥ b ∇f(K)∥ 2 F ≤ µ 2 ∥∇f(K)∥ 2 F (6.35b) holds with probabilit y not smaller than 1− p, where µ 1 =1/4, and µ 2 and p are determined in the statemen t of the theorem. 163 Without loss of generalit y , let us assume that the initial error satises f(K 0 )− f(K ⋆ )>ϵ . W e next sho w that ∇f(K 0 ),∇f(K 0 ) ≥ µ 1 2 ∥∇f(K 0 )∥ 2 F (6.36a) ∥∇f(K 0 )∥ 2 F ≤ 4µ 2 ∥∇f(K 0 )∥ 2 F (6.36b) holds with probabilit y not smaller than 1− p− 2Ne − n . Since the function f is gradien t dominan t o v er the sublev el set S K (a) with parameter µ f , com bining f(K 0 )− f(K ⋆ )>ϵ and (6.18) yields ∥∇f(K 0 )∥ F ≥ p 2µ f ϵ. Also, let the p ositiv e scalars θ ′ (a) and θ ′′ (a) b e suc h that for an y pair of τ and r satisfying τ ≥ θ ′ (a)log(1/(rϵ )) and r < min{r(a),θ ′′ (a) √ ϵ }, the upp er b ound σ in (6.33) b ecomes smaller than σ ≤ p 2µ f ϵ min{µ 1 /2, √ µ 2 }. The c hoice of θ ′ and θ ′′ with the ab o v e prop ert y is straigh tforw ard using the denition of σ . Com bining∥∇f(K 0 )∥ F ≥ p 2µ f ϵ andσ ≤ p 2µ f ϵ min{µ 1 /2, √ µ 2 } yields σ ≤ ∥∇ f(K 0 )∥ F min{µ 1 /2, √ µ 2 }. (6.37) Using the union b ound, w e ha v e ∇f(K 0 ),∇f(K 0 ) = D b ∇f(K 0 ),∇f(K 0 ) E + D ∇f(K 0 )− b ∇f(K 0 ),∇f(K 0 ) E (a) ≥ µ 1 ∥∇f(K 0 )∥ 2 F − ∥ ∇f(K 0 )− b ∇f(K 0 )∥ F ∥∇f(K 0 )∥ F (b) ≥ µ 1 ∥∇f(K 0 )∥ 2 F − σ ∥∇f(K 0 )∥ F (c) ≥ µ 1 2 ∥∇f(K 0 )∥ 2 F with probabilit y not smaller than 1− p− 2Ne − n . Here, (a) follo ws from com bining (6.35a) and the Cauc h y-Sc h w artz inequalit y , (b) follo ws from (6.33), and (c) follo ws from (6.37). Moreo v er, ∥∇f(K 0 )∥ F (a) ≤ ∥ b ∇f(K 0 )∥ F +∥∇f(K 0 )− b ∇f(K 0 )∥ F (b) ≤ √ µ 2 ∥∇f(K 0 )∥ F + σ (c) ≤ 2 √ µ 2 ∥∇f(K 0 )∥ F 164 where (a) follo ws from the triangle inequalit y , (b) from (6.33), and (c) from (6.37). This completes the pro of of (6.36). Inequalit y (6.36) allo ws us to apply Prop osition 6 and obtain with probabilit y not smaller than 1− p− 2Ne − n that for the stepsize α ≤ µ 1 /(8µ 2 L f ), w e ha v e K 1 ∈ S K (a) and also f(K 1 )− f(K ⋆ )≤ γ (f(K 0 )− f(K ⋆ )), with γ = 1− µ f µ 1 α/ 2, where L f is the smo othness parameter of the function f o v erS K (a). Finally , using the union b ound, w e can rep eat this pro cedure via induction to obtain that for some k ≤ log f(K 0 )− f(K ⋆ ) ϵ log 1 γ the error satises f(K k )− f(K ⋆ ) ≤ γ k f(K 0 )− f(K ⋆ ) ≤ ϵ with probabilit y not smaller than 1− kp− 2kNe − n . □ Remark 5 F or the failur e pr ob ability in The or em 4 to b e ne gligible, the pr oblem dimension n ne e ds to b e lar ge. Mor e over, to ac c ount for the c onicting term Ne − n/8 in the failur e pr ob ability, we c an r e quir e a crude exp onential b ound N ≤ e n/16 on the sample size. W e also note that although The or em 4 only guar ante es c onver genc e in the obje ctive value, similar to the pr o of of The or em 1, we c an use L emma 2 that r elates the err or in optimization variable, K , and the err or in the obje ctive function, f(K), to obtain c onver genc e guar ante es in the optimization variable as wel l. Remark 6 The or em 4 r e quir es the lower b ound on the simulation time τ in (6.32) to ensur e that, for any desir e d ac cur acy ϵ , the smo othing c onstant r satises r ≥ (1/ϵ )e − τ/θ ′ (a) . As we demonstr ate in the pr o of, this r e quir ement ac c ounts for the bias that arises fr om a nite value of τ . Sinc e this form of bias c an b e r e adily c ontr ol le d by incr e asing τ , the ab ove lower b ound on r do es not c ontr adict the upp er b ound r = O( √ ϵ ) r e quir e d by The or em 4. Final ly, we note that letting r→0 c an c ause lar ge bias in the pr esenc e of other sour c es of inac cur acy in the function appr oximation pr o c ess. 165 6.8 Computational experiments W e consider a mass-spring-damp er system with s masses, where w e set all mass, spring, and damping constan ts to unit y . In state-space represen tation (6.1b), the state x = [p T v T ] T con tains the p osition and v elo cit y v ectors and the dynamic and input matrices are giv en b y A = 0 I − T − T , B = 0 I where 0 and I are s× s zero and iden tit y matrices, and T is a T o eplitz matrix with 2 on the main diagonal and − 1 on the rst sup er and sub-diagonals. 6.8.1 Known model T o compare the p erformance of gradien t descen t metho ds (GD) and (GY) on K and Y , w e solv e the LQR problem with Q=I+100e 1 e T 1 ,R =I+1000e 4 e T 4 , andΩ= I for s∈{10, 20} masses (i.e., n = 2s state v ariables), where e i is the ith unit v ector in the standard basis of R n . Figure 6.2 illustrates the con v ergence curv es for b oth algorithms with a stepsize selected using a bac ktrac king pro cedure that guaran tees stabilit y of the closed-lo op system. Both algorithms w ere initialized with Y 0 = K 0 = 0. Ev en though Fig. 6.2 suggests that gradien t decen t/o w on S K con v erges faster than that on S Y , this observ ation do es not hold in general. (a) (b) f(K k )− f(K ⋆ ) f(K 0 )− f(K ⋆ ) k k Figure 6.2: Con v ergence curv es for gradien t descen t (blue) o v er the set S K , and gradien t descen t (red) o v er the set S Y with (a) s=10 and (b) s=20 masses. 166 6.8.2 Unknown model T o illustrate our results on the accuracy of the gradien t estimation in Algorithm 1 and the eciency of our random searc h metho d, w e consider the LQR problem with Q and R equal to iden tit y for s=10 masses (i.e., n=20 state v ariables). W e also let the initial conditions x i in Algorithm 1 b e standard normal and use N =n=2s samples. Figure 6.3 (a) illustrates the dep endence of ∥ b ∇f(K)− ∇f(K)∥ F /∥ b ∇f(K)∥ F on the sim ulation time τ for K = 0 and t w o v alues of the smo othing parameter r = 10 − 4 (blue) and r = 10 − 5 (red). W e observ e an exp onen tial decrease in error for small v alues of τ . In addition, the error do es not pass a saturation lev el whic h is determined b y r . W e also see that, as r decreases, this saturation lev el b ecomes smaller. These observ ations are in harmon y with our theoretical dev elopmen ts; in particular, com bining Prop ositions 4 and 5 through the use of the triangle inequalit y yields ∥ b ∇f(K) − ∇f(K)∥ F ≤ √ mn r κ 1 (2a)e − κ 2 (2a)τ + r 2 m 2 n 2 2 ℓ(2a) max i ∥x i ∥ 2 . This upp er b ound clearly captures the exp onen tial dep endence of the bias on the sim ulation timeτ as w ell as the saturation lev el that dep ends quadratically on the smo othing parameter r . In Fig. 6.3 (b), w e demonstrate the dep endence of the total relativ e error ∥∇f(K)− ∇f(K)∥ F /∥∇f(K)∥ F on the sim ulation time τ for t w o v alues of the smo othing parameter r = 10 − 4 (blue) and r = 10 − 5 (red), resulting from the use of N = n samples. W e observ e that the distance b et w een the appro ximate gradien t and the true gradien t is rather large. This is exactly wh y prior analysis of sample complexit y and sim ulation time is subpar to our results. In con trast to the existing results whic h rely on the use of the estimation error sho wn in Fig. 6.3 (b), our analysis sho ws that the sim ulated gradien t ∇f(K) is close to the gradien t estimate b ∇f(K). While b ∇f(K) is not close to the true gradien t ∇f(K), it is highly correlated with it. This is sucien t for establishing con v ergence guaran tees and it allo ws us to signican tly impro v e up on existing results [13], [132] in terms of sample complexit y and sim ulation time reducing b oth to O(log(1/ϵ )). 167 (a) (b) ∥ b ∇f(K)− ∇f(K)∥ F ∥ b ∇f(K)∥ F τ ∥∇f(K)− ∇f(K)∥ F ∥∇f(K)∥ F τ (c) f(K k )− f(K ⋆ ) f(K 0 )− f(K ⋆ ) k Figure 6.3: (a) Bias in gradien t estimation and (b) total error in gradien t estimation as functions of the sim ulation time τ . The blue and red curv es corresp ond to t w o v alues of the smo othing parameter r = 10 − 4 and r = 10 − 5 , resp ectiv ely . (c) Con v ergence curv e of the random searc h metho d (RS). Finally , Fig. 6.3 (c) demonstrates linear con v ergence of the random searc h metho d (RS) with stepsize α =10 − 4 , r =10 − 5 , and τ =200 in Algorithm 1, as established in Theorem 4. In this exp erimen t, w e implemen ted Algorithm 1 using the ode45 and trapz subroutines in MA TLAB to n umerically in tegrate the state/input p enalties with the corresp onding w eigh t matrices Q and R. Ho w ev er, our theoretical results only accoun t for an appro ximation error that arises from a nite sim ulation horizon. Clearly , emplo ying empirical ODE solv ers and n umerical in tegration ma y in tro duce additional errors in our gradien t appro ximation that require further scrutin y . 6.9 Concluding remarks W e pro v e exp onen tial/linear con v ergence of gradien t o w/descen t algorithms for solving the con tin uous-time LQR problem based on a noncon v ex form ulation that directly searc hes for the con troller. A salien t feature of our analysis is that w e relate the gradien t-o w dynamics 168 asso ciated with this noncon v ex form ulation to that of a con v ex reparameterization. This allo ws us to deduce con v ergence of the noncon v ex approac h from its con v ex coun terpart. W e also establish a b ound on the sample complexit y of the random searc h metho d for solving the con tin uous-time LQR problem that do es not require the kno wledge of system parameters. W e ha v e recen tly pro v ed similar result for the discrete-time LQR problem [87]. Our ongoing researc h directions include: (i) pro viding theoretical guaran tees for the con v ergence of gradien t-based metho ds for sparsit y-promoting as w ell as structured con trol syn thesis; and (ii) extension to nonlinear systems via successiv e linearization tec hniques. 169 Chapter 7 Random search for discrete-time LQR Mo del-free reinforcemen t learning tec hniques directly searc h o v er the parameter space of con trollers. Although this often amoun ts to solving a noncon v ex optimization problem, for b enc hmark con trol problems simple lo cal searc h metho ds exhibit comp etitiv e p erformance. T o understand this phenomenon, w e study the discrete-time Linear Quadratic Regulator (LQR) problem with unkno wn state-space parameters. In spite of the lac k of con v exit y , w e establish that the random searc h metho d with t w o-p oin t gradien t estimates and a xed n um b er of roll-outs ac hiev es ϵ -accuracy in O(log(1/ϵ )) iterations. This signican tly impro v es existing results on the mo del-free LQR problem whic h require O(1/ϵ ) total roll-outs. 7.1 Introduction W e study the sample complexit y and con v ergence of random searc h metho d for the innite- horizon discrete-time LQR problem. Random searc h metho d is a deriv ativ e-free optimization algorithm that directly searc hes o v er the parameter space of con trollers using appro ximations of the gradien t obtained through sim ulation data. Despite its simplicit y , this approac h has b een used to solv e b enc hmark con trol problems with state-of-the-art sample eciency [22], [151]. Ho w ev er, ev en for the standard LQR problem, man y op en theoretical questions surround con v ergence prop erties and sample complexit y of this metho d mainly b ecause of the lac k of con v exit y . F or discr ete-time LQR problem, global con v ergence guaran tees w ere recen tly pro vided for gradien t descen t and the random searc h metho d with one-p oin t gradien t estimates [13]. The 170 k ey observ ation w as that the LQR cost satises the P oly ak-o jasiewicz (PL) condition whic h can ensure con v ergence of gradien t descen t at a linear rate ev en for noncon v ex problems. This reference also established a b ound on the sample complexit y of random searc h for reac hing the error tolerance ϵ that requires a n um b er of function ev aluations that is prop ortional to (1/ϵ 4 )log(1/ϵ ). Extensions to the c ontinuous-time LQR [15], [88], the H ∞ regularized LQR [134], and Mark o vian jump linear systems [133] ha v e also b een made. Assuming access to the innite horizon cost, the n um b er of function ev aluations for the random searc h metho d with one-p oin t estimates w as impro v ed to 1/ϵ 2 in [132]. Moreo v er, this w ork sho w ed that the use of t w o-p oin t estimates reduces the n um b er of function ev aluations to 1/ϵ . Apart from the PL prop ert y , these results do not exploit structure of the LQR problem. Our recen t w ork [14] fo cused on the c ontinuous-time LQR problem, and established that the random searc h metho d with t w o-p oin t gradien t estimates con v erges to the optimal solution at a linear rate with high probabilit y . In this c hapter, w e extend the results of [14] to the discr ete-time case. Relativ e to the existing literature, our results oer a signican t impro v emen t b oth in terms of the required n um b er of function ev aluations and sim ulation time. Sp ecically , the total n um b er of function ev aluations to ac hiev e an ϵ -accuracy is prop ortional to log(1/ϵ ) compared to at least (1/ϵ 4 )log(1/ϵ ) in [13] and 1/ϵ in [132]. Similarly , the required sim ulation time is prop ortional to log(1/ϵ ); this is in con trast to [13] whic h requires poly(1/ϵ ) sim ulation time. 7.2 State-feedback characterization Consider the L TI system x t+1 = Ax t + Bu t , x 0 = ζ (7.1a) 171 where x t ∈ R n is the state, u t ∈ R m is the con trol input, A and B are constan t matrices, and x 0 =ζ is a zero-mean random initial condition with distribution D . The LQR problem asso ciated with system (7.1a) is giv en b y minimize x,u E " ∞ X t=0 (x t ) T Qx t + (u t ) T Ru t # (7.1b) where Q and R are p ositiv e denite matrices and the exp ectation is tak en o v er ζ ∼D . F or a con trollable pair (A,B), the solution to (7.1) tak es a state-feedbac k form, u t = − K ⋆ x t = − (R + B T P ⋆ B) − 1 B T P ⋆ Ax t where P ⋆ is the unique p ositiv e denite solution to the Algebraic Riccati Equation (ARE) , A T P ⋆ A + Q − A T P ⋆ B(R + B T P ⋆ B) − 1 B T P ⋆ A = P ⋆ . When the mo del parameters A and B are kno wn, the ARE can b e solv ed ecien tly via a v ariet y of tec hniques [138], [161]. Ho w ev er, these tec hniques are not directly applicable when the matrices A and B are not kno wn. One approac h to dealing with the mo del-free scenario is to use the linearit y of the optimal con troller and reform ulate the LQR problem as an optimization o v er state-feedbac k gains, minimize K f(K) := E[f ζ (K)] (7.2) where f ζ (K) := Q+K T RK,X ζ (K) = ζ T P(K)ζ and the matrices P(K) and X ζ (K) are giv en b y P(K) := ∞ X t=0 ((A− BK) T ) t (Q+K T RK)(A− BK) t X ζ (K) := ∞ X t=0 (A− BK) t ζζ T ((A− BK) T ) t . (7.3) 172 Here, f ζ (K) determines the LQR cost in (7.1b) asso ciated with the feedbac k la w u =− Kx and the initial condition x 0 = ζ . A necessary and sucien t condition for the b oundedness of f ζ (K) for all ζ ∈R n is closed-lo op stabilit y , K ∈ S K := {K∈R m× n |ρ (A − BK) < 1} (7.4) where ρ (·) is the sp ectral radius. F or an y K ∈S K , the matrices P(K) and X ζ (K) are w ell-dened and are, resp ectiv ely , determined b y the unique solutions to the Ly apuno v equations A ∗ K (P) = − Q − K T RK, A K (X ζ ) = − ζζ T . (7.5) Here,A K ,A ∗ K : S n →S n A K (X) = (A − BK)X(A − BK) T − X (7.6a) A ∗ K (P) = (A − BK) T P(A − BK) − P (7.6b) determine the adjoin t pairs of in v ertible closed-lo op Ly apuno v op erators acting on the set of symmetric matrices S n ⊂ R n× n . The in v ertibilit y of A K and A ∗ K for K ∈ S K allo ws us to express the LQR ob jectiv e function in (7.2) as f(K) = Q+K T RK,X(K) = ⟨Ω ,P(K)⟩, K∈S K ∞, otherwise where X(K) := E[X ζ (K)] = −A − 1 K (Ω) (7.7) and Ω := E[ζζ T ] is the co v ariance matrix of the initial condition. W e assume Ω ≻ 0 to ensure that the random v ector ζ ∼D has energy in all directions. This condition guaran tees f(K) = ∞ for all K / ∈ S K . Finally , it is w ell kno wn that for an y K ∈ S K , the cone of 173 p ositiv e denite matrices is closed under the action of −A − 1 K and− (A ∗ K ) − 1 . Th us, from the p ositiv e deniteness of the matrices Q+K T RK and Ω , it follo ws that P(K), X(K)≻ 0 for all K ∈S K . In (7.2), K is the optimization v ariable, and (A, B , Q≻ 0, R≻ 0, Ω ≻ 0) are the problem parameters. F or an y feedbac k gain K∈S K , it can b e sho wn that [162] ∇f ζ (K) = E(K)X ζ (K), ∇f(K) = E(K)X(K) (7.8a) where E(K) := 2 (R+B T P(K)B)K − B T P(K)A (7.8b) is a xed matrix that do es not dep end on the random initial condition ζ . Th us, the randomness of the gradien t ∇f ζ (K) arises from the random matrix X ζ (K). Remark 1 The LQR pr oblem for c ontinuous-time systems c an b e tr e ate d in a similar way. In this c ase, although the Lyapunov op er ator A K has a dier ent denition, the form of the obje ctive function in terms of the matric es X(K) and P(K) and also the form of the gr adient in terms of X(K) and E(K) r emain unchange d. While this similarity al lows for our r esults to hold for b oth c ontinuous and discr ete-time systems, in this chapter we only fo cus on the latter and r efer to [14] for a tr e atment of c ontinuous-time systems. 7.3 Random search The form ulation of the LQR problem giv en b y (7.2) has b een studied for b oth con tin uous- time [83], [88] and discrete-time systems [13], [141]. In this c hapter, w e analyze the sample complexit y and con v ergence prop erties of the random searc h metho d for solving problem (7.2) with unkno wn mo del parameters. A t eac h iteration k∈N, the random searc h metho d calls Algorithm 2 that forms an empirical appro ximation ∇f(K k ) to the gradien t of the ob jectiv e function via nite-time sim ulation of system (7.1a) for randomly p erturb ed feedbac k gains K k ± U i , i=1,...,N . 174 Algorithm 2 do es not require kno wledge of matrices A and B but only access to a two- p oint sim ulation engine. The t w o-p oin t setting means that for an y pair of p oin ts K and K ′ , the sim ulation engine can return the random v alues f ζ,τ (K) and f ζ,τ (K ′ ) for some random initial condition x 0 =ζ , where f ζ,τ (K) := τ X t=0 (x t ) T Qx t + (u t ) T Ru t (7.9) is a nite-time random function appro ximation asso ciated with system (7.1a), starting from a random initial condition x 0 = ζ , with the state feedbac k u =− Kx running up to time τ . This is in con trast to the one-p oint setting in whic h, at eac h query , the sim ulation engine can receiv e only one sp ecied p oin t K and return the random v alue f ζ,τ (K). Starting from an initial feedbac k gain K 0 ∈ S K , the random searc h metho d uses the gradien t estimates obtained via Algorithm 2 to up date the iterates according to K k+1 := K k − α ∇f(K k ), K 0 ∈ S K (RS) for some stepsize α> 0. The stabilizing assumption on the initial iterate K 0 ∈S K is required in our analysis as w e select the input parameters of Algorithm 2 and the stepsize so that all iterates satisfy K k ∈S K . F or con v ex problems, the gradien t estimates obtained in the t w o-p oin t setting are kno wn to yield faster con v ergence rates than the one-p oin t setting [163]. Ho w ev er, the t w o-p oin t setting requires sim ulations of the system for t w o dieren t feedbac k gain matrices under the same initial condition. 7.4 Main result W e analyze the sample complexit y and con v ergence of the random searc h metho d (RS) for the mo del-free setting. Our main con v ergence result exploits t w o k ey prop erties of the LQR ob jectiv e function f , namely smo othness and the P oly ak-o jasiewicz (PL) condition o v er its sublev el sets S K (a) :={K ∈S K | f(K) ≤ a} where a is a p ositiv e scalar. In particular, it 175 Algorithm 2 Gradien t estimation Require: F eedbac k gain K∈R m× n , state and con trol w eigh t matrices Q andR, distribution D , smo othing constan t r , sim ulation time τ , n um b er of random samples N . for i=1 to N do Dene t w o p erturb ed feedbac k gains K i,1 := K +rU i and K i,2 := K− rU i , where vec(U i ) is a random v ector uniformly distributed on the sphere √ mnS mn− 1 . Sample an initial condition ζ i from distribution D . F or j ∈{1,2}, sim ulate system (7.1a) up to time τ with the feedbac k gain K i,j and initial condition ζ i to form f ζ i ,τ (K i,j ) as in Eq. (7.9). end for Ensure: The t w o-p oin t gradien t estimate ∇f(K) := 1 2rN N X i=1 f ζ i ,τ (K i,1 ) − f ζ i ,τ (K i,2 ) U i . can b e sho wn that, restricted to an y sublev el set S K (a), the function f is L f (a)-smo oth and satises the PL condition with parameter µ f (a), i.e., f(K ′ ) − f(K) ≤ ⟨∇ f(K),K ′ − K⟩ + L f (a) 2 ∥K− K ′ ∥ 2 F f(K) − f(K ⋆ )≤ 1 2µ f (a) ∥∇f(K)∥ 2 F for all K and K ′ suc h that the line segmen t b et w een them b elongs to S K (a), where L f (a) and µ f (a) are p ositiv e rational functions of a. This result has b een established for b oth con tin uous-time [88] and discrete-time [13], [141] LQR problems. W e also mak e the follo wing assumption on the statistical prop erties of the initial condition. Assumption 1 (Initial distribution) L et the distribution D of the initial c ondition have i.i.d. zer o-me an unit-varianc e entries with b ounde d sub-Gaussian norm. F or a r andom ve ctor ζ ∈ R n distribute d ac c or ding to D , this implies E[ζ ] = 0, E[ζζ T ] = I , and ∥ζ i ∥ ψ 2 ≤ κ , for some c onstant κ and i=1,...,n, wher e ∥·∥ ψ 2 denotes the sub-Gaussian norm [160]. W e no w state our main theoretical result. Theorem 1 Consider the r andom se ar ch metho d (RS) that uses the gr adient estimates of Algorithm 2 for nding the optimal solution K ⋆ of pr oblem (7.2) . L et the initial c ondition 176 x 0 ∼D ob ey Assumption 1 and let the simulation time τ and the numb er of samples N in Algorithm 2 satisfy τ ≥ θ ′ (a)log(1/ϵ ), N ≥ c 1+β 4 κ 4 θ (a)log 6 n n, for some β > 0 and a desir e d ac cur acy ϵ > 0. Then, we c an cho ose a smo othing p ar ameter r < θ ′′ (a) √ ϵ in Algorithm 2 such that, for any initial c ondition K 0 ∈S K (a), metho d (RS) with the c onstant stepsize α =1/(ω(a)L f (a)) achieves f(K k )− f(K ⋆ )≤ ϵ in at most k ≤ − log ϵ − 1 (f(K 0 )− f(K ⋆ )) /log(1− µ f (a)α/ 8) iter ations. This holds with pr ob ability not smal ler than 1− c ′ k(n − β +N − β +Ne − n 8 +e − c ′ N ). Her e, ω(a) := c ′′ ( √ m+βκ 2 θ (a) √ mnlogn) 2 , the p ositive sc alars c, c ′ , and c ′′ ar e absolute c onstants, µ f (a) and L f (a) ar e the PL and smo othness p ar ameters of f over the sublevel set S K (a), and θ , θ ′ , and θ ′′ ar e p ositive p olynomials that dep end only on the p ar ameters of the LQR pr oblem. F or a desired accuracy lev el ϵ> 0, Theorem 1 sho ws that the random searc h iterates (RS) with constan t stepsize (that do es not dep end on ϵ ) reac h an accuracy lev el ϵ at a linear rate (i.e., in at most O(log(1/ϵ )) iterations) with high probabilit y . F urthermore, the total n um b er of function ev aluations and the sim ulation time required to ac hiev e an accuracy lev el ϵ are prop ortional to log(1/ϵ ). As stated earlier, this signican tly impro v es the existing results for discrete-time LQR [13], [132] that require O(1/ϵ ) function ev aluations and poly(1/ϵ ) sim ulation time. 177 7.5 Proof sketch In this section, w e presen t a sk etc h of our pro of strategy for the main result of the c hapter. The smo othness of the ob jectiv e function along with the PL condition are sucien t for the gradien t descen t metho d with a suitable stepsize α , K k+1 := K k − α ∇f(K k ), K 0 ∈ S K (GD) to ac hiev e linear con v ergence ev en for noncon v ex problems [89]. These prop erties w ere recen tly used to sho w con v ergence of gradien t descen t for b oth discrete-time [13] as w ell as con tin uous-time [88] LQR problems. In the mo del-free setting, the gradien t descen t metho d is not directly implemen table b ecause computing the gradien t ∇f(K) requires kno wledge of system parameters A and B . The random searc h metho d (RS) resolv es this issue b y using the gr adient estimate ∇f(K) obtained via Algorithm 2. One approac h to the con v ergence analysis of random searc h is to rst use a large n um b er of samples N in order to mak e the estimation error small, and then relate the iterates of (RS) to that of gradien t descen t. It has b een sho wn that ac hieving ∥∇f(K)−∇ f(K)∥ F ≤ ϵ tak es N =O(1/ϵ 4 ) samples [13]; see also [15, Theorem 3] for the con tin uous-time LQR. This upp er b ound unfortunately leads to a sample complexit y b ound that gro ws p olynomially with 1/ϵ . T o impro v e this result, w e tak e an alternativ e route and giv e up on the ob jectiv e of con trolling the gradien t estimation error. In particular, b y exploiting the problem structure, w e sho w that with a xed n um b er of samples N = ˜ O(n), where n denotes the n um b er of states, the estimate ∇f(K) concen trates with high pr ob ability when pro jected to the direction of ∇f(K). In what follo ws, w e rst establish that for an y ϵ > 0, using a sim ulation time τ = O(log(1/ϵ )) and an appropriate smo othing parameter r in Algorithm 2, the estimate ∇f(K) can b e made ϵ -close to an un biased estimate b ∇f(K) of the gradien t with high probabilit y , ∥∇f(K)− b ∇f(K)∥ F ≤ ϵ, where the denition of b ∇f(K) is giv en in Eq. (7.12). W e call this distance the estimation bias . W e then sho w that, for a large n um b er of samples N , our 178 ∇f(K) G Figure 7.1: The in tersection of the half-space and the ball parameterized b y µ 1 and µ 2 , resp ectiv ely , in Prop osition 1. If an up date direction G lies within this region, then taking one step along − G with a constan t stepsize α yields a geometric decrease in the ob jectiv e v alue. un biased estimate b ∇f(K) b ecomes highly correlated with the gradien t. In particular, w e establish that the follo wing t w o ev en ts M 1 := nD b ∇f(K),∇f(K) E ≥ µ 1 ∥∇f(K)∥ 2 F o (7.10a) M 2 := n ∥ b ∇f(K)∥ 2 F ≤ µ 2 ∥∇f(K)∥ 2 F o (7.10b) o ccur with high probabilit y for some p ositiv e scalars µ 1 and µ 2 . T o justify the denition of these ev en ts, let us rst demonstrate that the gradien t estimate b ∇f(K) can b e used to decrease the ob jectiv e error b y a geometric factor if b oth M 1 and M 2 o ccur. Prop osition 1 If G ∈ R m× n and K ∈ S K (a) ar e such that ⟨G,∇f(K)⟩ ≥ µ 1 ∥∇f(K)∥ 2 F and ∥G∥ 2 F ≤ µ 2 ∥∇f(K)∥ 2 F for some sc alars µ 1 ,µ 2 > 0, then K − αG ∈ S K (a) for al l α ∈ [0,µ 1 /(µ 2 L f (a))], and f(K− αG )− f(K ⋆ ) ≤ 1− µ f (a)µ 1 α f(K)− f(K ⋆ ) , wher e L f (a) and µ f (a) ar e the smo othness and PL p ar ameters of f over S K (a). Prop osition 1 demonstrates that, conditioned on the ev en ts M 1 and M 2 , the un biased estimate b ∇f(K) yields a simple descen t-based algorithm that has linear con v ergence. Fig. 7.1 illustrates the region parameterized b y µ 1 andµ 2 in Prop osition 1. This region has a dieren t geometry than ϵ -neigh b orho o ds of the gradien t. A gradien t estimate G can ha v e an accuracy ofO(∇f(K)) and still b elong to this region. W e lev erage this fact in our con v ergence analysis whic h only requires the gradien t estimate b ∇f(K) to b e in suc h a region for certain parameters µ 1 and µ 2 and not necessarily within an ϵ -neigh b orho o d of the gradien t. 179 7.5.1 Controlling the bias Herein, w e dene the un biased estimate b ∇f(K) of the gradien t and establish an upp er b ound on its distance to the output ∇f(K) of Algorithm 2 ∇f(K) := 1 2rN N X i=1 f ζ i ,τ (K +rU i )− f ζ i ,τ (K− rU i ) U i e ∇f(K) := 1 2rN N X i=1 f ζ i(K +rU i )− f ζ i(K− rU i ) U i b ∇f(K) := 1 N N X i=1 ∇f ζ i(K),U i U i (7.12) Here, U i ∈ R m× n are i.i.d. random matrices whose v ectorized form vec(U i ) are uniformly distributed on the sphere √ mnS mn− 1 andζ i ∈R n are i.i.d. random initial conditions sampled from distribution D . Note that e ∇f(K) is the innite horizon v ersion of ∇f(K) and b ∇f(K) is an un biased estimate of ∇f(K). The fact that E[ b ∇f(K)]=∇f(K) follo ws from E ζ i ,U i h vec( b ∇f(K)) i = E U 1 [⟨∇f(K),U 1 ⟩vec(U 1 )] = E U 1 [vec(U 1 )vec(U 1 ) T ]vec(∇f(K)) = vec(∇f(K)). Lo cal b oundedness of the function f(K): An imp ortan t requiremen t for the gradien t estimation sc heme in Algorithm 2 is the stabilit y of the p erturb ed closed-lo op systems, i.e., K ± rU i ∈ S K ; violating this condition leads to an exp onen tial gro wth of the state and con trol signals. Moreo v er, this condition is necessary and sucien t for e ∇f(K) to b e w ell dened. It can b e sho wn that for an y sublev el set S K (a), there exists a p ositiv e radius r suc h that K +rU ∈S K for all K ∈S K (a) and U ∈R m× n with ∥U∥ F ≤ √ mn. In this c hapter, w e further require that r is small enough so that K± rU i ∈S K (2a) for all K∈S K (a). Suc h upp er b ound on r can b e pro vided using the upp er b ound on the cost dierence established in [13, Lemma 24]. A similar result has b een established for the con tin uous-time LQR problem using the small-gain theorem and the KYP lemma [14]. 180 Lemma 1 F or any K ∈S K (a) and U ∈R m× n with ∥U∥ F ≤ √ mn, K +r(a)U ∈S K (2a), wher e r(a) := ˜ c/a for some c onstant ˜ c>0 that dep ends on the pr oblem data. Note that for an y K ∈ S K (a) and r ≤ r(a) in Lemma 1, e ∇f(K) is w ell dened since the feedbac k gains K± rU i are all stabilizing. W e next establish an upp er b ound on the dierence b et w een the output ∇f(K) of Algorithm 2 and the un biased estimate b ∇f(K) of the gradien t ∇f(K). W e accomplish this b y b ounding the dierence b et w een these t w o quan tities and e ∇f(K) using the triangle inequalit y ∥ b ∇f(K) − ∇f(K)∥ F ≤ ∥ e ∇f(K) − ∇f(K)∥ F + ∥ b ∇f(K) − e ∇f(K)∥ F . (7.13) Prop osition 2 pro vides an upp er b ound on eac h term on the righ t-hand side of the ab o v e inequalit y . Prop osition 2 F or any K∈S K (a) and r≤ r(a), wher e r(a) is given by L emma 1, ∥ e ∇f(K)− ∇f(K)∥ F ≤ √ mnη r κ 1 (2a)(1− κ 2 (2a)) τ ∥ b ∇f(K)− e ∇f(K)∥ F ≤ (rmn) 2 η 2 ℓ(2a) wher e η := max i ∥ζ i ∥ 2 , and ℓ(a) > 0, κ 1 (a) > 0, and 1 > κ 2 (a) > 0 ar e r ational functions that dep end on the pr oblem data. The rst term on the righ t-hand side of (7.13) corresp onds to a bias arising from the nite- time sim ulation. Prop osition 2 sho ws that although small v alues of r ma y result in a large ∥ e ∇f(K)− ∇f(K)∥ F , b ecause of the exp onen tial dep endence of the upp er b ound on the sim ulation time τ , this error can b e con trolled b y increasing τ . In addition, since b ∇f(K) is indep enden t of the parameter r , this result pro vides a quadratic b ound on the estimation error in terms of r . It is also w orth men tioning that the third deriv ativ e of the function f ζ (K) is utilized in obtaining the second inequalit y . 181 7.5.2 Correlation of b ∇f(K) and ∇f(K) W e establish that under Assumption 1 on the initial distribution, with large enough n um b er of samples N = ˜ O(n), the ev en ts M 1 and M 2 with µ 1 :=1/4 and µ 2 :=Cm βκ 2 ∥(A ∗ K ) − 1 ∥ 2 +∥(A ∗ K ) − 1 ∥ S λ min (X(K)) √ nlogn+1 2 (7.14) o ccur with high probabilit y , where κ is an upp er b ound on the ψ 2 -norm of the en tries of ζ i , β > 0 is a parameter that determines the failure probabilit y , C is a p ositiv e absolute constan t, and for an op erator M, ∥M∥ 2 := sup M ∥M(M)∥ F ∥M∥ F , ∥M∥ S := sup M ∥M(M)∥ 2 ∥M∥ 2 . W e note that these parameters do not dep end on the desired accuracy-lev el ϵ . Moreo v er, since the sub-lev el sets of the function f(K) are compact [141], ∥(A ∗ K ) − 1 ∥ is a con tin uous function of K , andX(K)⪰ Ω , w e can uniformly upp er b ound µ 2 o v er an y sublev el setS K (a). Suc h b ound has also b een discussed and analytically quan tied for the con tin uous-time LQR problem [14]. Our approac h to accomplishing the ab o v e task exploits the problem structure, whic h allo ws for conning the dep endence of b ∇f(K) on the random initial conditions ζ i in to the zero-mean random matrices X ζ i − X , where X ζ i := X ζ i(K) and X := X(K) are giv en b y (7.3) and (7.7), resp ectiv ely . In particular, for an y giv en feedbac k gain K ∈S K , w e can use the form of gradien t (7.8) to write b ∇f(K) = 1 N N X i=1 EX ζ i,U i U i = b ∇ 1 + b ∇ 2 where b ∇ 1 := (1/N) P N i=1 E(X ζ i− X),U i U i , b ∇ 2 := (1/N) P N i=1 ⟨∇f(K),U i ⟩U i , and the matrix E := E(K) is giv en b y (7.8b). It is no w easy to v erify that E[ b ∇ 1 ] = 0 and E[ b ∇ 2 ] = ∇f(K). F urthermore, only the term b ∇ 1 dep ends on the initial conditions ζ i . 182 7.5.2.1 Quan tifying the probabilit y of M 1 W e exploit results from mo dern high-dimensional statistics on the non-asymptotic analysis of the concen tration of random quan tities around their mean [160]. Our approac h to analyzing the ev en t M 1 consists of t w o steps. First, w e establish that the zero-mean random v ariable D b ∇ 1 ,∇f(K) E highly concen trates around zero with a large enough n um b er of samples N = ˜ O(n). Our pro of tec hnique relies on the Hanson-W righ t inequalit y [164, Theorem 1.1]. Next, w e study the concen tration of the random v ariable D b ∇ 2 ,∇f(K) E around its mean ∥∇f(K)∥ 2 F . The k ey enabler here is the Bernstein inequalit y [160, Corollary 2.8.3]. This leads to the next prop osition. Prop osition 3 Under Assumption 1, for any stabilizing fe e db ack gain K∈S K and p ositive sc alar β , if N ≥ C 1 β 4 κ 4 λ 2 min (X) ∥(A ∗ K ) − 1 ∥ 2 +∥(A ∗ K ) − 1 ∥ S 2 nlog 6 n then the event M 1 in (7.10) with µ 1 :=1/4 satises P(M 1 ) ≥ 1− C 2 N − β − 4Ne − n 8 − 2e − C 3 N . 7.5.2.2 Quan tifying the probabilit y of M 2 Similarly , w e analyze the ev en t M 2 in t w o steps. W e establish upp er b ounds on the ratio ∥ b ∇ i ∥ F /∥∇f(K)∥ F , for i = {1,2}, that hold with high probabilit y , and use the triangle inequalit y ∥ b ∇ 1 ∥ F ∥∇f(K)∥ F + ∥ b ∇ i ∥ F ∥∇f(K)∥ F ≥ ∥ b ∇f(K)∥ F ∥∇f(K)∥ F . Our results are summarized in the next prop osition. Prop osition 4 Under Assumption 1, for any K ∈ S K , sc alar β > 0, and N ≥ C 4 n, the event M 2 in (7.10) with µ 2 given by (7.14) satises P(M 2 )≥ 1− C 6 (n − β +Ne − n 8 +e − C 7 N ). 183 (a) (b) ∥ b ∇f(K)− ∇f(K)∥ F ∥ b ∇f(K)∥ F τ ∥∇f(K)− ∇f(K)∥ F ∥∇f(K)∥ F τ (c) f(K k )− f(K ⋆ ) f(K 0 )− f(K ⋆ ) k Figure 7.2: (a) Bias in gradien t estimation; (b) total error in gradien t estimation as functions of the sim ulation time τ . The blue and red curv es corresp ond to t w o v alues of the smo othing parameter r =10 − 4 and r =10 − 6 , resp ectiv ely . (c) Con v ergence curv e of the random searc h metho d (RS). 7.6 Computational experiments W e consider a system with s = 10 in v erted p endula on force-con trolled carts that are connected b y springs and damp ers; see Fig. 7.3. W e set all masses, p endula lengths, spring and damping constan ts to unit y and let the state v ector x := [θ T ω T p T v T ] T con tain the angle and angular v elo cit y of p endula as w ell as p osition and v elo cit y of masses. Linearizing around the equilibrium p oin t yields the con tin uous-time system ˙ x=A c x+B c u, where A c = 0 I 0 0 20I 0 T T 0 0 0 I − 10I 0 − T − T , B c = 0 − I 0 I . Here, 0 and I are s× s zero and iden tit y matrices, and T is a T o eplitz matrix with 2 on the main diagonal, − 1 on the rst upp er and lo w er sub-diagonals, and zero elsewhere. W e 184 M 1 M s m 1 m s θ 1 θ s p 1 u 1 p s u s Figure 7.3: An in terconnected system of in v erted p endula on carts. ⟨ b ∇f(K),∇f(K)⟩/∥∇f(K)∥ 2 F ∥ b ∇f(K)∥ 2 F /∥∇f(K)∥ 2 F Figure 7.4: Histograms of t w o algorithmic quan tities asso ciated with the ev en ts M 1 and M 2 giv en b y (7.10). The red lines demonstrate that M 1 with µ 1 = 0.1 and M 2 with µ 2 = 35 o ccur in more than 99% of trials. discretize this system with sampling time t s = 0.1, whic h yields Eq. (7.1a) with A = e Acts and B = R ts 0 e Act B c dt. Since the op en-lo op system is unstable, w e use a stabilizing feedbac k gain K 0 = [− 50I − 10I − 5I − 5I] as a starting p oin t for the random searc h metho d and c ho ose Q = blkdiag(10I,I,I,I) and R = I in the LQR cost. W e also let the initial conditions ζ i in Algorithm 2 b e standard normal and use N =n=2s samples. Figure 7.2 (a) illustrates the dep endence of the relativ e error ∥ b ∇f(K) − ∇f(K)∥ F /∥ b ∇f(K)∥ F on the sim ulation time τ for K =K 0 =[− 50I − 10I − 5I − 5I] and t w o v alues of smo othing parameter r = 10 − 4 (blue) and r = 10 − 6 (red). W e see an exp onen tial decrease in error for small v alues of τ and note that the error do es not pass a saturation lev el determined b y the smo othing parameter r > 0. W e also observ e that as r decreases, this saturation lev el b ecomes smaller. These observ ations are in harmon y with the results established in Prop osition 2. This should b e compared and con trasted with Fig. 7.2 (b), whic h demonstrates 185 that the relativ e error with resp ect to the true gradien t do es not v anish with increase in the sim ulation time τ . In spite of this signican t error, the k ey observ ation that allo ws us to establish the linear con v ergence of random searc h metho d in Theorem 1 is that the gradien t estimate has high correlation with the true gradien t. Figure 7.4 sho ws histograms of t w o algorithmic quan tities asso ciated with the ev en ts M 1 and M 2 giv en b y (7.10). The red lines demonstrate that M 1 with µ 1 = 0.1 and M 2 with µ 2 = 35 o ccur in more than 99% of trials; cf. Prop ositions 3 and 4. Figure 7.2 (c) illustrates the con v ergence curv e of the random searc h metho d (RS) with stepsize α = 10 − 5 , r = 10 − 5 , and τ = 1000 in Algorithm 2. This gure conrms linear con v ergence of (RS) established in Theorem 1. 7.7 Concluding remarks In this c hapter, w e studied the con v ergence and sample complexit y of the random searc h metho d with t w o-p oin t gradien t estimates for the discrete-time LQR problem. Despite noncon v exit y , w e established that the random searc h metho d with a xed n um b er of roll- outs N = ˜ O(n) p er iteration ac hiev es ϵ -accuracy in O(log(1/ϵ )) iterations. This signican tly impro v es existing results on the mo del-free LQR whic h require O(1/ϵ ) total roll-outs. Our ongoing researc h directions include: (i) pro viding theoretical guaran tees for the con v ergence of gradien t-based metho ds for sparsit y-promoting and structured con trol syn thesis [71]; and (ii) extension to nonlinear systems via successiv e linearization tec hniques. 186 Chapter 8 Lack of gradient domination for linear quadratic Gaussian problems with incomplete state information P olicy gradien t algorithms in mo del-free reinforcemen t learning ha v e b een sho wn to ac hiev e global exp onen tial con v ergence for the Linear Quadratic Regulator problem despite the lac k of con v exit y . Ho w ev er, extending suc h guaran tees b ey ond the scop e of standard LQR and full-state feedbac k has remained op en. A k ey enabler for existing results on LQR is the so- called gradien t dominance prop ert y of the underlying optimization problem that can b e used as a surrogate for strong con v exit y . In this c hapter, w e tak e a step further b y studying the con v ergence of gradien t descen t for the Linear Quadratic Gaussian problem and demonstrate through examples that LQG do es not satisfy the gradien t dominance prop ert y . Our study sho ws the non-uniqueness of equilibrium p oin ts and th us dispro v es the global con v ergence of p olicy gradien t metho ds for LQG. 8.1 Introduction Mo dern reinforcemen t learning algorithms ha v e sho wn great empirical p erformance in solving con tin uous con trol problems [18] with unkno wn dynamics. Ho w ev er, despite the recen t surge in researc h, con v ergence and sample complexit y of these metho ds are not y et fully understo o d. This has recen tly motiv ated a signican t b o dy of literature on data-driv en con trol to fo cus on the Linear Quadratic Regulator (LQR) problem with unkno wn mo del 187 parameters with the primary purp ose of pro viding insigh t in to the b eha vior and p erformance of RL algorithms in more c hallenging settings. The LQR problem is the cornerstone of con trol theory . The globally optimal solution to LQR is giv en b y a static linear feedbac k and, for problems with kno wn mo dels, the solution can b e obtained b y solving the celebrated Riccati equation using ecien t n umerical sc hemes with pro v able con v ergence guaran tees [83]. In the data-driv en setting, existing tec hniques are mainly divided in to t w o categories, mo del-based [130] and mo del-free [21]. While mo del- based tec hniques use data to obtain appro ximations of the underlying dynamics, mo del-free metho ds directly searc h o v er the parameter space of con trollers using the rew ard/cost v alues without attempting to form a mo del. Among mo del-free approac hes, simple random searc h, whic h em ulates the b eha vior of gradien t descen t b y forming estimates of the gradien t via cost ev aluations, has b een sho wn to ac hiev e sub-linear sample complexit y for LQR [132]. This can b e ev en further impro v ed to a logarithmic complexit y if one can access the so-called two-p oint gradien t estimates [14], [87]. These results build on the fact that the gradien t descen t itself ac hiev es linear con v ergence for b oth discrete [13] and con tin uous-time LQR problems [88] despite lac k of con v exit y . A k ey enabler for these results is the so-called gradien t dominance prop ert y of the underlying optimization problem that can b e used as a surrogate for strong con v exit y [89]. In this c hapter, w e tak e a step further b y studying the con v ergence of gradien t descen t for the Linear Quadratic Gaussian (LQG) problem with incomplete state information. The separation principle states that the solution to the LQG problem is giv en b y an observ er- based con troller, whic h consists of a Kalman lter and the corresp onding LQR solution. This problem is also closely related to the output-feedbac k problem for distributed con trol, whic h is kno wn to b e fundamen tally more c hallenging than LQR. In particular, the output-feedbac k problem has b een sho wn to in v olv e an optimization domain with exp onen tial n um b er of connected comp onen ts [84], [90]. In con trast, the standard LQG problem allo ws for dynamic con trollers and do not imp ose structural constrain ts on the con troller. Motiv ated b y the con v ergence prop erties of gradien t descen t on LQR, w e reform ulate the LQG problem as a join t optimization of the con trol and observ er feedbac k gains whose domain, unlik e the output feedbac k problem is connected. W e deriv e analytical expressions 188 for the gradien t of the LQG cost function with resp ect to gain matrices and demonstrate through examples that LQG do es not satisfy the gradien t dominance prop ert y . In particular, w e sho w that, in addition to the global solution, the gradien t v anishes at the origin for op en- lo op stable systems. Our study dispro v es global exp onen tial con v ergence of p olicy gradien t metho ds for LQG. The analysis of the optimization landscap e of the LQG problem with unkno wn system parameters has also b een recen tly pro vided in [165], where the authors relate the existence of m ultiple equilibrium p oin ts to the non-minimalit y of the con troller transfer function. The rest of the c hapter is structured as follo ws. In Section 8.2, w e form ulate the LQG problem and pro vide bac kground information. In Section 8.3, w e deriv e an analytical expression for the gradien t. In Section 8.4, w e discuss the lac k of gradien t domination and non-uniqueness of equilibrium p oin ts. W e presen t n umerical exp erimen ts in Section 8.5 and nally pro vide concluding remarks in Section 8.6. 8.2 Linear Quadratic Gaussian Consider the sto c hastic L TI system ˙ x = Ax + Bu + w, y = Cx + v (8.1a) where x(t) ∈ R n is the state, u(t) ∈ R m is the con trol input, y(t) ∈ R p is the measured output, A, B , and C are constan t matrices, and w(t) and v(t) are indep enden t zero-mean Gaussian white noise pro cesses with co v ariance functions E[w(t)w T (τ )] = δ (t− τ )Σ w and E[v(t)v T (τ )] = δ (t− τ )Σ v . Here, δ is the Dirac delta (impulse) function and w e assume Σ w ,Σ v ≻ 0 are p ositiv e denite matrices. The Linear Quadratic Gaussian (LQG) problem asso ciated with system (8.1a) is giv en b y minimize u(t)∈Y(t) lim t→∞ E x T (t)Qx(t) + u T (t)Ru(t) (8.1b) where Q and R are p ositiv e denite matrices and Y(t) is the set of functions that dep end only on the a v ailable information up to time t, i.e., the measured outputs y(s) with s≤ t. 189 8.2.1 Separation principle It is w ell-kno wn that if the pair (A,B) is con trollable and (A,C) is observ able, the solution to (8.1) is giv en b y an observ er-based con troller of the form ˙ ˆ x = Aˆ x + Bu − L(ˆ y − y) ˆ y = C ˆ x, u = − Kˆ x (8.2) where ˆ x(t)∈R n is the state estimate, and L∈R n× p and K ∈R m× n are the observ er and con troller feedbac k gain matrices, resp ectiv ely [83], [166]. The separation principle states that the optimal gains K ⋆ and L ⋆ corresp ond to solutions to t w o decoupled problems asso ciated with (8.1), namely the linear quadratic regulator minimize K lim t→∞ E x T (t)Qx(t) + u T (t)Ru(t) (8.3) sub ject to (8.1a) with the full-state feedbac k u =− Kx, and the Kalman lter, whic h seeks to minimize L lim t→∞ E ∥e(t)∥ 2 (8.4a) sub ject to the error dynamics ˙ e = (A− LC)e − Lv + w (8.4b) where e := x− ˆ x is the state estimation error. The solutions to these t w o problems (and also to the original LQG problem) are giv en b y K ⋆ = R − 1 B T P ⋆ c , L ⋆T = Σ − 1 v CX ⋆ o (8.5) 190 where P ⋆ c and X ⋆ o are the unique solutions to the decoupled pair of Algebraic Riccati Equations (ARE) A T P ⋆ c + P ⋆ c A + Q − P ⋆ c BR − 1 B T P ⋆ c = 0 AX ⋆ o + X ⋆ o A T + Σ w − X ⋆ o C T Σ − 1 v CX ⋆ o = 0. 8.2.2 Characterization based on gain matrices In this c hapter, w e analyze the LQG problem as optimization of feedbac k gain matrices K and L. In particular, the closed-lo op dynamics in (8.1a) and (8.2) can b e join tly describ ed b y ˙ ξ = A L ξ + µ (8.6) where ξ := h x T e T i T ∈R 2n consists of the state and error signals, µ := h w T w T − v T L T i T is white noise, and the closed-lo op matrix A L is giv en b y A L := A− BK BK 0 A− LC . (8.7) The closed-lo op represen tation giv en b y (8.6) allo ws us to reform ulate the LQG problem as an optimization o v er the set S c ×S o of stabilizing gain matrices, where S c := {K∈R m× n |A − BK is Hurwitz} S o := {L∈R n× p |A − LC is Hurwitz}. (8.8) In particular, the LQG problem in (8.1b) amoun ts to minimize K,L f(K,L) := ⟨Ω ,X⟩ (8.9) 191 where X = lim t→∞ E ξ (t)ξ T (t) is the steady-state co v ariance matrix asso ciated with closed- lo op system (8.6) and it can b e determined b y solving the algebraic Ly apuno v equation A L X + XA T L + Σ = 0. (8.10) Here, the p ositiv e semi-denite matrices Ω , Σ are giv en b y Ω := Q+K T RK − K T RK − K T RK K T RK (8.11a) Σ := Σ w Σ w Σ w Σ w +LΣ v L T . (8.11b) The matrix Ω accoun ts for the w eigh t matrices in the cost function (8.1b) and the matrix Σ determines the co v ariance function Σ δ (t− τ ) of µ . 8.3 Gradient method In this section, w e in tro duce the gradien t metho d on the LQG ob jectiv e function o v er the set of stabilizing gain matrices S c ×S o and discuss its con v ergence prop erties. Lemma 1 F or any stabilizing p air of gain matric es (K,L) ∈ S c ×S o , the gr adient of the LQG obje ctive function f in (8.9) is given by ∇ K f(K,L) =2(RK− B T ˆ P 1 ) ˆ X 1 − 2B T ˆ P 2 ˆ X T 2 ∇ L f(K,L) =2P 3 (LΣ v − X 3 C T ) − 2P T 2 X 2 C T wher e the matric es X = X 1 X 2 X T 2 X 3 , ˆ X = ˆ X 1 ˆ X 2 ˆ X T 2 ˆ X 3 P = P 1 P 2 P T 2 P 3 , ˆ P = ˆ P 1 ˆ P 2 ˆ P T 2 ˆ P 3 (8.12) 192 ar e the unique solutions to the Lyapunov e quations A L X + XA T L + Σ = 0 (8.13a) ˆ A L ˆ X + ˆ X ˆ A T L + ˆ Σ = 0 (8.13b) A T L P + PA L + Ω = 0 (8.13c) ˆ A T L ˆ P + ˆ P ˆ A L + ˆ Ω = 0 . (8.13d) Her e, the matric es A L , and Ω and Σ ar e given by (8.7) and (8.11) , r esp e ctively, and ˆ A L := A− BK LC 0 A− LC (8.14a) ˆ Ω := Q+K T RK Q Q Q (8.14b) ˆ Σ := LΣ v L T − LΣ v L T − LΣ v L T Σ w +LΣ v L T . (8.14c) Pr o of: T o obtain ∇ L f(K,L), w e use the T a ylor series expansion of f(K,L+ ˜ L) around (K,L) and collect rst-order terms. F rom (8.9), w e ha v e f(K,L+ ˜ L)− f(K,L) ≈ D ∇ L f(K,L), ˜ L E = D Ω , ˜ X E (8.15a) where ˜ X is the unique solution to A L ˜ X + ˜ XA T L = − ˜ A L X − X ˜ A T L − ˜ Σ = 0 X 2 C T ˜ L T ˜ LCX T 2 ˜ LCX 3 +X 3 C T ˜ L T − ˜ Σ =: Φ (8.15b) 193 Here, the rst equalit y is obtained b y dieren tiating Ly apuno v equation (8.10), and the second follo ws b y noting that ˜ A L = 0 0 0 − ˜ LC , ˜ Σ = 0 0 0 ˜ LΣ v L T +LΣ v ˜ L T . Using the adjoin t iden tit y and (8.15), w e obtain that D ∇ L f(K,L), ˜ L E = ⟨− Φ ,P⟩ where P is giv en b y (8.13c). Rearranging terms completes the pro of for ∇ L f(K,L). In order to obtain ∇ K f(K,L), w e use a sligh tly dieren t represen tation of the ob jectiv e function. In particular, if w e let ˆ ξ := h ˆ x T e T i T , it is easy to v erify that the closed-lo op system satises ˙ ˆ ξ = ˆ A L ˆ ξ + ˆ µ where the closed-lo op matrix ˆ A L is giv en b y (8.14a) and ˆ µ = h v T L T w T − v T L T i T . F urthermore, it is straigh tforw ard to v erify that for an y stabilizing gain matrices K ∈ S c and L∈S o , the LQG cost in (8.1b) is giv en b y f(K,L) := D ˆ Ω , ˆ X E (8.16) where ˆ X = lim t→∞ E h ˆ ξ (t) ˆ ξ T (t) i is the unique solution to the algebraic Ly apuno v equation giv en b y (8.13b) and and the matrices ˆ Ω and ˆ Σ are giv en b y (8.14). No w, using this represen tation, the same tec hnique as in the rst part of the pro of can b e used to obtain ∇ L f(K,L). This completes the pro of. □ 194 Using the explicit form ula of the gradien t in Lemma 1, the gradien t descen t metho d o v er the set of stabilizing gain matrices S c ×S o follo ws the up date rule K k+1 := K k − α ∇ K f(K k ,L k ), K 0 ∈ S c L k+1 := L k − α ∇ L f(K k ,L k ), L 0 ∈ S o (GD) where α > 0 is the stepsize. 8.3.1 Non-separability of gradients F or the LQG problem, unlik e the optimal solution that satises the separation principle, w e observ e from Lemma 1 that the gradien t is not separable as ∇ K f and∇ L f dep end on b oth L and K . T o pro vide more insigh t, let us examine the v alue of gradien t o v er t w o sp ecial subsets of the domain S c ×S o , namelyS c ×{ L ⋆ }, where L ⋆ is the optimal Kalman gain, and {K ⋆ }×S o , where K ⋆ is the optimal con trol feedbac k gain in (8.5). 8.3.1.1 Optimal observ er gain L=L ⋆ In this case, from (8.5) and the corresp onding Riccati equation, it follo ws that LΣ v = X ⋆ o C T (8.17) where X ⋆ o is the unique p ositiv e denite solution to the Ly apuno v equation (A − LC)X ⋆ o + X ⋆ o (A − LC) T = − Σ w − LΣ v L T . Expanding (8.13a) and (8.13b), w e observ e that X 3 and ˆ X 3 also satisfy the ab o v e Ly apuno v equation. Th us, since A− LC is Hurwitz, it follo ws that X ⋆ o = X 3 = ˆ X 3 . (8.18) 195 In addition, com bining equations (8.13b), (8.17), and (8.18) yields (A− BK) ˆ X 2 + ˆ X 2 (A− LC) T = 0. (8.19) No w, since K ∈S c and L∈S o , w e obtain that ˆ X 2 = 0. F orm this equation in conjunction with (8.17) and (8.18), w e obtain that the follo wing terms in the gradien t v anish B T ˆ P 2 ˆ X T 2 = 0, P 3 (LΣ v − X 3 C T ) = 0 (8.20a) and th us the gradien t simplies to ∇ K f(K,L ⋆ ) =2(RK− B T ˆ P 1 ) ˆ X 1 ∇ L f(K,L ⋆ ) = − 2P T 2 X 2 C T . Remark 1 As we demonstr ate in the pr o of of L emma 1, for any stabilizing gains L and K , the matrix ˆ X 2 is given by ˆ X 2 = lim t→∞ E e(t)ˆ x T (t) . Thus, the e quality ˆ X 2 = 0 c an b e dir e ctly establishe d using the ortho gonality principle which states that the optimal estimator is ortho gonal to the estimation err or. 8.3.1.2 Optimal con trol gain K =K ⋆ Similar to the previous case, from (8.5) and the corresp onding Riccati equation, it follo ws that RK = B T P ⋆ c where P ⋆ c is the unique p ositiv e denite solution to the Ly apuno v equation (A − BK)P ⋆ c + P ⋆ c (A − BK) T = − Q − K T RK. 196 Com bining this equations with (8.13c) and (8.13d) yields ˆ P 1 = P ⋆ c and P 2 = 0. Th us, w e ha v e (RK− B T ˆ P 1 ) ˆ X 1 = 0, P T 2 X 2 C T = 0 (8.20b) whic h yields ∇ K f(K ⋆ ,L) = − 2B T ˆ P 2 ˆ X T 2 ∇ L f(K ⋆ ,L) =2P 3 (LΣ v − X 3 C T ). W e observ e that ∇ K f(K ⋆ ,L) and∇ L f(K,L ⋆ ) do not v anish and th us the sets S c ×{ L ⋆ } and {K ⋆ }×S o are not in v arian t with resp ect to gradien t descen t. Therefore, unlik e the optimal solutions, the gradien t of the LQG ob jectiv e function ma y not b e decoupled. 8.4 Lack of gradient domination Recen tly , it has b een sho wn that the gradien t descen t metho d ac hiev es linear con v ergence for the LQR problem with full-state feedbac k in b oth discrete [13] and con tin uous-time [88] settings. These results build on the k ey observ ation that the full-state feedbac k LQR cost in (8.3) as a function of the feedbac k gains, denoted b y g(K), satises the P oly ak-o jasiewicz (PL) condition o v er its sub-lev elsets, i.e. ∥∇g(K)∥ 2 F ≥ µ g (g(K) − g(K ⋆ )) (8.21) for some constan t µ g > 0. The PL condition, also kno wn as gradien t dominance, can b e used as a surrogate to strong con v exit y to ensure con v ergence of gradien t descen t at a linear rate ev en for noncon v ex problems. This observ ation raises the question of whether the LQG problem is also gradien t dominan t. In addition, it has b een recen tly sho wn that the set of stabilizing gains for the case of static output feedbac k, i.e. u =− Ky, y = Cx consists of m ultiple connected comp onen ts and lo cal minima [90], whic h hinders the con v ergence of lo cal searc h algorithms. Ho w ev er, 197 in con trast to the static output feedbac k problem, the join t optimization of the con troller and observ er feedbac k gains for the LQG, as studied in this c hapter, in v olv es the connected domainS c ×S o . W e no w demonstrate that despite the connectivit y of the optimization domain, this form ulation y et suers from the existence of non-optimal equilibrium p oin ts and th us lac k of gradien t domination. 8.4.1 Non-uniqueness of critical points The noncon v exit y of the function f suggests the p ossibilit y of ha ving m ultiple critical p oin ts ∇f(K,L) = 0. In this section, w e demonstrate that this is in fact the case b y pro viding t w o of suc h p oin ts for the LQG problem in the general form. This should b e compared and con trasted to the full-state feedbac k LQR problem whic h, despite noncon v exit y , has b een sho wn to ha v e a unique critical p oin t. Global minimizer The most ob vious critical p oin t is the unique global minimizer of f , whic h is giv en b y (8.5). T o v erify this, note that for the optimal gains L ⋆ and K ⋆ , w e ha v e equations (8.20a) and (8.20b), resp ectiv ely . Using these equations, and the form of gradien t in Lemma 1, it immediately follo ws that ∇f(K ⋆ ,L ⋆ )=0. The origin for stable systems T o nd another critical p oin t, let us assume for simplicit y that the system is op en-lo op stable. W e next sho w that the origin (K,L)=(0,0) is also a critical p oin t, i.e., ∇f(0,0)=0. F or (K,L)=(0,0), from (8.13b) it follo ws that ˆ X 1 = ˆ X 2 =0. In addition, from (8.13c), it follo ws that P 2 =P 3 =0. Com bining these equalities and the form of gradien t in Lemma 1 ensures ∇f(0,0)=0. The existence of the sub-optimal critical p oin t (K,L) = (0,0) also implies that gradien t domination ma y not hold for the LQG problem. 198 M 1 M s u 1 u s Figure 8.1: Mass-spring-damp er system. f(K k ,L k )− f(K ⋆ ,L ⋆ ) f(K 0 ,L 0 )− f(K ⋆ ,L ⋆ ) iteration Figure 8.2: Con v ergence curv e of gradien t descen t for s=50. 8.5 An example W e consider the mass-spring-damp er system in Figure 8.1 with s masses to demonstrate the p erformance of gradien t descen t giv en b y (GD) on the LQG problem o v er the set S c ×S o of stabilizing gains. W e set all spring and damping constan ts as w ell as masses to unit y . In state-space represen tation (8.1a), the state v ector x = [p T v T ] T con tains the p osition and v elo cit y of masses and the measured output y =p is the p osition only . In this example, the dynamic, input, and output matrices are giv en b y A = 0 I − T − T , B = 0 I , C = h I 0 i where 0 and I are zero and iden tit y matrices of suitable size, and T is a T o eplitz matrix with 2 on the main diagonal, − 1 on the rst sup er and sub-diagonals, and 0 elsewhere. W e solv e the LQG problem with Q=Σ w =I , R =Σ v =I for s=50 masses, i.e., n=2s state v ariables. The algorithm w as initialized with scaled matrices of all ones K 0 =(L 0 ) T = 10 − 5 1. Figure 8.2 illustrates the con v ergence curv es of gradien t descen t with a stepsize selected using a bac ktrac king-based pro cedure initialized with α 0 = 10 − 3 that guaran tees 199 stabilit y of the feedbac k lo op and ensures descen t. The optimal solution K ⋆ , L ⋆ is obtained using (8.5) and the corresp onding Riccati equations. 8.6 Concluding remarks Motiv ated b y the recen t results on the global exp onen tial con v ergence of p olicy gradien t algorithms for the mo del-free LQR problem, in this c hapter w e studied the standard LQG problem as optimization o v er con troller and observ er feedbac k gains. W e presen t an explicit form ulae for the gradien t and demonstrate that for op en-lo op stable systems, in addition to the unique global minimizer, the origin is also a critical p oin t for the LQG problem, th us dispro ving the gradien t dominance prop ert y . Numerical exp erimen ts for the con v ergence of gradien t descen t are also pro vided. Our w ork is ongoing to iden tify conditions under whic h gradien t decen t can solv e the LQG problem at a linear rate. 200 Bibliography [1] L. Bottou and Y. Le Cun, On-line learning for v ery large data sets, Appl. Sto ch. Mo dels Bus. Ind. , v ol. 21, no. 2, pp. 137151, 2005. [2] M. Hong, M. Raza viy a yn, Z.-Q. Luo, and J.-S. P ang, A unied algorithmic framew ork for blo c k-structured optimization in v olving big data: With applications in mac hine learning and signal pro cessing, IEEE Signal Pr o c ess. Mag. , v ol. 33, no. 1, pp. 5777, 2016. [3] L. Bottou, F. Curtis, and J. No cedal, Optimization metho ds for large-scale mac hine learning, SIAM R ev. , v ol. 60, no. 2, pp. 223311, 2018. [4] A. Bec k and M. T eb oulle, A fast iterativ e shrink age-thresholding algorithm for linear in v erse problems, SIAM J. Imaging Sci. , v ol. 2, no. 1, pp. 183202, 2009. [5] Y. Nestero v, Gradien t metho ds for minimizing comp osite ob jectiv e functions, Math. Pr o gr am. , v ol. 140, no. 1, pp. 125161, 2013. [6] I. Sutsk ev er, J. Martens, G. Dahl, and G. Hin ton, On the imp ortance of initialization and momen tum in deep learning, in Pr o c. International Confer enc e on Machine L e arning , 2013, pp. 11391147. [7] B. T. P oly ak, Some metho ds of sp eeding up the con v ergence of iteration metho ds, USSR Comput. Math. & Math. Phys. , v ol. 4, no. 5, pp. 117, 1964. [8] Y. Nestero v, A metho d for solving the con v ex programming problem with con v ergence rate O(1/k 2 ), in Dokl. Akad. Nauk SSSR , v ol. 27, 1983, pp. 543547. [9] Y. Nestero v, L e ctur es on c onvex optimization . Springer Optimization and Its Applications, 2018, v ol. 137. [10] D. Maclaurin, D. Duv enaud, and R. A dams, Gradien t-based h yp erparameter optimization through rev ersible learning, in Pr o c. International Confer enc e on Machine L e arning , 2015, pp. 21132122. 201 [11] Y. Bengio, Gradien t-based optimization of h yp erparameters, Neur al Comput. , v ol. 12, no. 8, pp. 18891900, 2000. [12] A. Beirami, M. Raza viy a yn, S. Shahramp our, and V. T arokh, On optimal generalizabilit y in parametric learning, in Pr o c. Neur al Information Pr o c essing (NIPS) , 2017, pp. 34583468. [13] M. F azel, R. Ge, S. M. Kak ade, and M. Mesbahi, Global con v ergence of p olicy gradien t metho ds for the linear quadratic regulator, in Pr o c. International Confer enc e on Machine L e arning , 2018, pp. 14671476. [14] H. Mohammadi, A. Zare, M. Soltanolk otabi, and M. R. Jo v ano vi¢, Con v ergence and sample complexit y of gradien t metho ds for the mo del-free linear-quadratic regulator problem, IEEE T r ans. Automat. Contr ol , v ol. 67, no. 5, pp. 24352450, 2022. [15] H. Mohammadi, M. Soltanolk otabi, and M. R. Jo v ano vi¢, Random searc h for learning the linear quadratic regulator, in Pr o c. the 2020 Americ an Contr ol Confer enc e , 2020, pp. 47984803. [16] R. Ge, F. Huang, C. Jin, and Y. Y uan, Escaping from saddle p oin ts online sto c hastic gradien t for tensor decomp osition, in Pr o c. The 28th Confer enc e on L e arning The ory , v ol. 40, 2015, pp. 797842. [17] C. Jin, R. Ge, P . Netrapalli, S. M. Kak ade, and M. I. Jordan, Ho w to escap e saddle p oin ts ecien tly, in Pr o c. International Confer enc e on Machine L e arning , v ol. 70, 2017, pp. 17241732. [18] A. Nagabandi, G. Kahn, R. F earing, and S. Levine, Neural net w ork dynamics for mo del-based deep reinforcemen t learning with mo del-free ne-tuning, in IEEE Int Conf. R ob ot. Autom. , 2018, pp. 75597566. [19] V. Mnih, K. Ka vuk cuoglu, D. Silv er, A. Gra v es, I. An tonoglou, D. Wierstra, and M. Riedmiller, Pla ying Atari with deep reinforcemen t learning, 2013, arXiv:1312.5602. [20] D. Bertsek as, Appro ximate p olicy iteration: A surv ey and some new metho ds, J. Contr ol The ory Appl. , v ol. 9, no. 3, pp. 310335, 2011. [21] Y. Abbasi-Y adk ori, N. Lazic, and C. Szep esvÆri, Mo del-free linear quadratic con trol via reduction to exp ert prediction, in Pr o c. Mach. L e arn. R es. , v ol. 89, 2019, pp. 31083117. [22] H. Mania, A. Guy, and B. Rec h t, Simple random searc h of static linear p olicies is comp etitiv e for reinforcemen t learning, in NeurIPS , v ol. 31, 2018. 202 [23] Z.-Q. Luo and P . T seng, Error b ounds and con v ergence analysis of feasible descen t metho ds: A general approac h, Ann. Op er. R es. , v ol. 46, no. 1, pp. 157178, 1993. [24] H. Robbins and S. Monro, A sto c hastic appro ximation metho d, Ann. Math. Statist. , pp. 400407, 1951. [25] A. Nemiro vski, A. Juditsky, G. Lan, and A. Shapiro, Robust sto c hastic appro ximation approac h to sto c hastic programming, SIAM J. Optim. , v ol. 19, no. 4, pp. 15741609, 2009. [26] O. Dev older, Exactness, inexactness and sto c hasticit y in rst-order metho ds for large-scale con v ex optimization, Ph.D. dissertation, Louv ain-la-Neuv e, 2013. [27] O. Dev older, F. Glineur, and Y. Nestero v, First-order metho ds of smo oth con v ex optimization with inexact oracle, Math. Pr o gr am. , v ol. 146, no. 1-2, pp. 3775, 2014. [28] P . Dvurec hensky and A. Gasnik o v, Sto c hastic in termediate gradien t metho d for con v ex problems with sto c hastic inexact oracle, J. Optimiz. The ory App. , v ol. 171, no. 1, pp. 121145, 2016. [29] M. Sc hmidt, N. L. Roux, and F. R. Bac h, Con v ergence rates of inexact pro ximal-gradien t metho ds for con v ex optimization, in Pr o c. Neur al Information Pr o c essing (NIPS) , 2011, pp. 14581466. [30] O. Dev older, Sto c hastic rst order metho ds in smo oth con v ex optimization, Catholic Univ. Louv ain, Louv ain-la-Neuv e, T ec h. Rep., 2011. [31] F. Bac h, A daptivit y of a v eraged sto c hastic gradien t descen t to lo cal strong con v exit y for logistic regression., J. Mach. L e arn. R es. , v ol. 15, no. 1, pp. 595627, 2014. [32] B. T. P oly ak, New sto c hastic appro ximation t yp e pro cedures, Automat. i T elemekh , v ol. 7, no. 98-107, p. 2, 1990. [33] B. T. P oly ak and A. B. Juditsky, A cceleration of sto c hastic appro ximation b y a v eraging, SIAM J. Contr ol Optim. , v ol. 30, no. 4, pp. 838855, 1992. [34] A. Dieulev eut, N. Flammarion, and F. Bac h, Harder, b etter, faster, stronger con v ergence rates for least-squares regression, J. Mach. L e arn. R es. , v ol. 18, no. 1, pp. 35203570, 2017. [35] E. Moulines and F. Bac h, Non-asymptotic analysis of sto c hastic appro ximation algorithms for mac hine learning, in Pr o c. Neur al Information Pr o c essing (NIPS) , 2011, pp. 451459. 203 [36] N. T ripuraneni, N. Flammarion, F. Bac h, and M. I. Jordan, A v eraging sto c hastic gradien t descen t on Riemannian manifolds, in Pr o c. The 31st Confer enc e On L e arning The ory , 2018, pp. 650687. [37] M. Baes, Estimate sequence metho ds: Extensions and appro ximations, IFOR Internal r ep ort, ETH, Zrich, Switzerland , 2009. [38] A. d’Aspremon t, Smo oth optimization with appro ximate gradien t, SIAM J. Optim. , v ol. 19, no. 3, pp. 11711183, 2008. [39] J.-F. Aujol and C. Dossal, Stabilit y of o v er-relaxations for the forw ard-bac kw ard algorithm, application to FIST A, SIAM J. Optim. , v ol. 25, no. 4, pp. 24082433, 2015. [40] B. T. P oly ak, In tro duction to optimization. optimization soft w are, Inc., Public ations Division, New Y ork , v ol. 1, 1987. [41] H. K w ak ernaak and R. Siv an, Line ar optimal c ontr ol systems . Wiley-In terscience, 1972. [42] L. Xiao, S. Bo yd, and S.-J. Kim, Distributed a v erage consensus with least-mean-square deviation, J. Par al lel Distrib. Comput. , v ol. 67, no. 1, pp. 3346, 2007. [43] B. Bamieh, M. R. Jo v ano vi¢, P . Mitra, and S. P atterson, Coherence in large-scale net w orks: Dimension dep enden t limitations of lo cal feedbac k, IEEE T r ans. Automat. Contr ol , v ol. 57, no. 9, pp. 22352249, 2012. [44] F. Lin, M. F ardad, and M. R. Jo v ano vi¢, Optimal con trol of v ehicular formations with nearest neigh b or in teractions, IEEE T r ans. Automat. Contr ol , v ol. 57, no. 9, pp. 22032218, 2012. [45] M. R. Jo v ano vi¢ and B. Bamieh, On the ill-p osedness of certain v ehicular plato on con trol problems, IEEE T r ans. Automat. Contr ol , v ol. 50, no. 9, pp. 13071321, 2005. [46] F. Drer, M. R. Jo v ano vi¢, M. Chertk o v, and F. Bullo, Sparsit y-promoting optimal wide-area con trol of p o w er net w orks, IEEE T r ans. Power Syst. , v ol. 29, no. 5, pp. 22812291, 2014. [47] F. Drer, M. R. Jo v ano vi¢, M. Chertk o v, and F. Bullo, Sparse and optimal wide-area damping con trol in p o w er net w orks, in Pr o c e e dings of the 2013 Americ an Contr ol Confer enc e , W ashington, DC, 2013, pp. 42954300. 204 [48] X. W u, F. Drer, and M. R. Jo v ano vi¢, Input-output analysis and decen tralized optimal con trol of in ter-area oscillations in p o w er systems, IEEE T r ans. Power Syst. , v ol. 31, no. 3, pp. 24342444, 2016. [49] J. W. Simpson-P orco, Input/output analysis of primal-dual gradien t algorithms, in in Pr o c. 54th Annual Al lerton Confer enc e on Communic ation, Contr ol, and Computing , 2016, pp. 219224. [50] J. W. Simpson-P orco, B. K. P o olla, N. Monshizadeh, and F. Drer, Quadratic p erformance of primal-dual metho ds with application to secondary frequency con trol of p o w er systems, in Pr o c. 55th IEEE Conf. De cision Contr ol , pp. 18401845, 2016. [51] A. Badithela and P . Seiler, Analysis of the hea vy-ball algorithm using in tegral quadratic constrain ts, in Pr o c. the 2019 Americ an Contr ol Confer enc e , IEEE, 2019, pp. 40814085. [52] L. Lessard, B. Rec h t, and A. P ac k ard, Analysis and design of optimization algorithms via in tegral quadratic constrain ts, SIAM J. Optim. , v ol. 26, no. 1, pp. 5795, 2016. [53] B. Hu and L. Lessard, Dissipativit y theory for Nestero v’s accelerated metho d, in Pr o c. the 34th International Confer enc e on Machine L e arning , ser. Pro c. Mac h. Learn. Res. 2017, pp. 15491557. [54] S. Cyrus, B. Hu, B. V an Sco y, and L. Lessard, A robust accelerated optimization algorithm for strongly con v ex functions, in Pr o c. the 2018 Americ an Contr ol Confer enc e , 2018, pp. 13761381. [55] B. V an Sco y, R. A. F reeman, and K. M. Lync h, The fastest kno wn globally con v ergen t rst-order metho d for minimizing strongly con v ex functions, IEEE Contr ol Syst. L ett. , v ol. 2, no. 1, pp. 4954, 2018. [56] M. F azly ab, A. Rib eiro, M. Morari, and V. M. Preciado, Analysis of optimization algorithms via in tegral quadratic constrain ts: Nonstrongly con v ex problems, SIAM J. Optim. , v ol. 28, no. 3, pp. 26542689, 2018. [57] B. T. P oly ak, Comparison of the con v ergence rates for single-step and m ulti-step optimization algorithms in the presence of noise, Engr g.Cyb ern. , v ol. 15, no. 1, pp. 610, 1977. [58] K. Y uan, B. Ying, and A. H. Sa y ed, On the inuence of momen tum acceleration on online learning, J. Mach. L e arn. R es. , v ol. 17, no. 1, pp. 66026667, 2016. [59] Y. Nestero v, Intr o ductory le ctur es on c onvex optimization: A b asic c ourse . Springer Science & Business Media, 2004, v ol. 87. 205 [60] Y. Arjev ani, S. Shalev-Sh w artz, and O. Shamir, On lo w er and upp er b ounds in smo oth and strongly con v ex optimization, J. Mach. L e arn. R es. , v ol. 17, no. 1, pp. 43034353, 2016. [61] B. O’Donogh ue and E. Candes, A daptiv e restart for accelerated gradien t sc hemes, F ound. Comput. Math. , v ol. 15, pp. 715732, 2015. [62] D. P . Bertsek as, Convex optimization algorithms . A thena Scien tic, 2015. [63] Y. Ouy ang, Y. Chen, G. Lan, and E. P asiliao, An accelerated linearized alternating direction metho d of m ultipliers, SIAM J. Imaging Sci. , v ol. 8, no. 1, pp. 644681, 2015. [64] W. Shi, Q. Ling, K. Y uan, G. W u, and W. Yin, On the linear con v ergence of the admm in decen tralized consensus optimization, IEEE T r ans. Signal Pr o c ess. , v ol. 62, no. 7, pp. 17501761, 2014. [65] L. N. T refethen and M. Em bree, Sp e ctr a and pseudosp e ctr a: the b ehavior of nonnormal matric es and op er ators . Princeton Univ ersit y Press, 2005. [66] M. R. Jo v ano vi¢ and B. Bamieh, Comp onen t wise energy amplication in c hannel o ws, J. Fluid Me ch. , v ol. 534, pp. 145183, 2005. [67] M. R. Jo v ano vi¢, F rom b ypass transition to o w con trol and data-driv en turbulence mo deling: An input-output viewp oin t, Annu. R ev. Fluid Me ch. , v ol. 53, no. 1, pp. 311345, 2021. [68] N. K. Dhingra, S. Z. Khong, and M. R. Jo v ano vi¢, The pro ximal augmen ted Lagrangian metho d for nonsmo oth comp osite optimization, IEEE T r ans. Automat. Contr ol , v ol. 64, no. 7, pp. 28612868, 2019. [69] E. J. Candes, J. Rom b erg, and T. T ao, Stable signal reco v ery from incomplete and inaccurate measuremen ts, Comm. Pur e and Applie d Math. , v ol. 59, no. 8, pp. 12071223, 2006. [70] P . J. Bic k el and E. Levina, Regularized estimation of large co v ariance matrices, The Annals of Statistics , v ol. 36, no. 1, pp. 199227, 2008. [71] F. Lin, M. F ardad, and M. R. Jo v ano vi¢, Design of optimal sparse feedbac k gains via the alternating direction metho d of m ultipliers, IEEE T r ans. Automat. Contr ol , v ol. 58, no. 9, pp. 24262431, 2013. [72] J. W ang and N. Elia, A con trol p ersp ectiv e for cen tralized and distributed con v ex optimization, in in Pr o c. 50th IEEE Conf. De cision Contr ol , 2011, pp. 38003805. 206 [73] G. Qu and N. Li, On the exp onen tial stabilit y of primal-dual gradien t dynamics, IEEE Contr ol Syst. L ett. , v ol. 3, no. 1, pp. 4348, 2018. [74] H. D. Nguy en, T. L. V u, K. T uritsyn, and J. Slotine, Con traction and robustness of con tin uous time primal-dual dynamics, IEEE Contr ol Syst. L ett. , v ol. 2, no. 4, pp. 755760, 2018. [75] D. Ding and M. R. Jo v ano vi¢, Global exp onen tial stabilit y of primal-dual gradien t o w dynamics based on the pro ximal augmen ted Lagrangian, in Pr o c. the 2019 Americ an Contr ol Confer enc e , 2019, pp. 34143419. [76] Y. T ang, G. Qu, and N. Li, Semi-global exp onen tial stabilit y of augmen ted primaldual gradien t dynamics for constrained con v ex optimization, Systems & Contr ol L etters , v ol. 144, p. 104 754, 2020. [77] D. Ding and M. R. Jo v ano vi¢, Global exp onen tial stabilit y of primal-dual gradien t o w dynamics based on the pro ximal augmen ted Lagrangian: A Ly apuno v-based approac h, in Pr o c. the 59th IEEE Conf. De cision Contr ol , 2020, pp. 48364841. [78] D. Jak o v eti¢, D. Ba jo vi¢, J. Xa vier, and J. M. Moura, Primaldual metho ds for large-scale and distributed con v ex optimization and data analytics, Pr o c. the IEEE , v ol. 108, no. 11, pp. 19231938, 2020. [79] P . Y ou and E. Mallada, Saddle o w dynamics: Observ able certicates and separable regularization, in Pr o c. the 2021 Americ an Contr ol Confer enc e , 2021, pp. 48174823. [80] N. P arikh and S. Bo yd, Pro ximal algorithms, F ound. T r ends Optim. , v ol. 1, no. 3, pp. 123231, 2013. [81] A. Megretski and A. Ran tzer, System analysis via in tegral quadratic constrain ts, IEEE T r ans. Autom. Contr ol , v ol. 42, no. 6, pp. 819830, 1997. [82] F. P aganini and E. F eron, Linear matrix inequalit y metho ds for robust H 2 analysis: A surv ey with comparisons, in A dvanc es in line ar matrix ine quality metho ds in c ontr ol , SIAM, 2000, pp. 129151. [83] B. Anderson and J. Mo ore, Optimal Contr ol; Line ar Quadr atic Metho ds . New Y ork, NY: Pren tice Hall, 1990. [84] J. A c k ermann, P arameter space design of robust con trol systems, IEEE T r ans. Automat. Contr ol , v ol. 25, no. 6, pp. 10581072, 1980. [85] E. F eron, V. Balakrishnan, S. Bo yd, and L. El Ghaoui, Numerical metho ds for H 2 related problems, in Pr o c. the 1992 Americ an Contr ol Confer enc e , 1992, pp. 29212922. 207 [86] G. E. Dullerud and F. P aganini, A c ourse in r obust c ontr ol the ory: a c onvex appr o ach . New Y ork: Springer-V erlag, 2000. [87] H. Mohammadi, M. Soltanolk otabi, and M. R. Jo v ano vi¢, On the linear con v ergence of random searc h for discrete-time LQR, IEEE Contr ol Syst. L ett. , v ol. 5, no. 3, pp. 989994, 2021. [88] H. Mohammadi, A. Zare, M. Soltanolk otabi, and M. R. Jo v ano vi¢, Global exp onen tial con v ergence of gradien t metho ds o v er the noncon v ex landscap e of the linear quadratic regulator, in Pr o c. the 58th IEEE Conf. De cision Contr ol , 2019, pp. 74747479. [89] H. Karimi, J. Nutini, and M. Sc hmidt, Linear con v ergence of gradien t and pro ximal-gradien t metho ds under the P oly ak-o jasiewicz condition, in In Eur op e an Confer enc e on Machine L e arning , 2016, pp. 795811. [90] H. F eng and J. La v aei, On the exp onen tial n um b er of connected comp onen ts for the feasible set of optimal decen tralized con trol problems, in Pr o c. the 2019 Americ an Contr ol Confer enc e , 2019, pp. 14301437. [91] H. Mohammadi, M. Raza viy a yn, and M. R. Jo v ano vi¢, V ariance amplication of accelerated rst-order algorithms for strongly con v ex quadratic optimization problems, in Pr o c. the 57th IEEE Conf. De cision Contr ol , 2018, pp. 57535758. [92] H. Mohammadi, M. Raza viy a yn, and M. R. Jo v ano vi¢, P erformance of noisy Nestero v’s accelerated metho d for strongly con v ex optimization problems, in Pr o c. the 2019 Americ an Contr ol Confer enc e , 2019, pp. 34263431. [93] H. Mohammadi, M. Raza viy a yn, and M. R. Jo v ano vi¢, Robustness of accelerated rst-order algorithms for strongly con v ex optimization problems, IEEE T r ans. Automat. Contr ol , v ol. 66, no. 6, pp. 24802495, 2021. [94] H. Mohammadi, M. Raza viy a yn, and M. R. Jo v ano vi¢, Noise amplication of momen tum-based optimization algorithms, in Pr o c. the 2023 Americ an Contr ol Confer enc e , submitted, Sand Diego, CA, 2023. [95] H. Mohammadi, M. Raza viy a yn, and M. R. Jo v ano vi¢, T radeos b et w een con v ergence rate and noise amplication for momen tum-based accelerated optimization algorithms, 2022, arXiv:2209.11920. [96] S. Sam uelson, H. Mohammadi, and M. R. Jo v ano vi¢, T ransien t gro wth of accelerated rst-order metho ds, in Pr o c. the 2020 Americ an Contr ol Confer enc e , 2020, pp. 28582863. 208 [97] S. Sam uelson, H. Mohammadi, and M. R. Jo v ano vi¢, On the transien t gro wth of Nestero v’s accelerated metho d for strongly con v ex optimization problems, in Pr o c. the 59th IEEE Conf. De cision Contr ol , 2020, pp. 59115916. [98] H. Mohammadi, S. Sam uelson, and M. R. Jo v ano vi¢, T ransien t gro wth of accelerated optimization algorithms, IEEE T r ans. Automat. Contr ol , 2022, doi:10.1109/T A C.2022.3162154. [99] H. Mohammadi and M. R. Jo v ano vi¢, On the noise amplication of primal-dual gradien t o w dynamics based on pro ximal augmen ted Lagrangian, in Pr o c. the 2022 Americ an Contr ol Confer enc e , A tlan ta, GA, 2022, pp. 926931. [100] H. Mohammadi, M. Soltanolk otabi, and M. R. Jo v ano vi¢, Learning the mo del-free linear quadratic regulator via random searc h, in Pr o c. Machine L e arning R ese ar ch, 2nd Annual Confer enc e on L e arning for Dynamics and Contr ol , v ol. 120, Berk eley , CA, 2020, pp. 19. [101] H. Mohammadi, M. Soltanolk otabi, and M. R. Jo v ano vi¢, Mo del-free linear quadratic regulator, in Handb o ok of R einfor c ement L e arning and Contr ol , K. G. V am v oudakis, Y. W an, F. Lewis, and D. Cansev er, Eds., doi:10.1007/978-3-030-60990-0, Springer In ternational Publishing, 2021. [102] H. Mohammadi, M. Soltanolk otabi, and M. R. Jo v ano vi¢, On the lac k of gradien t domination for linear quadratic Gaussian problems with incomplete state information, in Pr o c. the 60th IEEE Conf. De cision Contr ol , Austin, TX, 2021, pp. 11201124. [103] N. S. A ybat, A. F allah, M. M. Grbzbalaban, and A. Ozdaglar, Robust accelerated gradien t metho ds for smo oth strongly con v ex functions, SIAM J. Opt. , v ol. 30, no. 1, pp. 717751, 2020. [104] N. S. A ybat, A. F allah, M. Grbzbalaban, and A. Ozdaglar, A univ ersally optimal m ultistage accelerated sto c hastic gradien t metho d, in Pr o c. Neur al Information Pr o c essing (NIPS) , 2019. [105] S. Mic halo wsky, C. Sc herer, and C. Eb en bauer, Robust and structure exploiting optimization algorithms: An in tegral quadratic constrain t approac h, Int. J. Contr ol , v ol. 94, no. 11, pp. 29562979, 2021. [106] B. T. P oly ak and P . Shc herbak o v, Ly apuno v functions: An optimization theory p ersp ectiv e, IF AC-Pap ersOnLine , v ol. 50, no. 1, pp. 74567461, 2017. [107] B. T. P oly ak and G. V. Smirno v, T ransien t resp onse in matrix discrete-time linear systems, Autom. R emote Contr ol , v ol. 80, no. 9, pp. 16451652, 2019. 209 [108] M. B. Cohen, J. Diak onik olas, and L. Orecc hia, On acceleration with noise-corrupted gradien ts, in Pr o c. the 35th International Confer enc e on Machine L e arning , ser. Pro c. Mac h. Learn. Res. V ol. 80, 2018, pp. 10191028. [109] B. V an Sco y and L. Lessard, The sp eed-robustness trade-o for rst-order metho ds with additiv e gradien t noise, 2021, arXiv:2109.05059. [110] M. Muehlebac h and M. Jordan, A dynamical systems p ersp ectiv e on Nestero v acceleration, in International Confer enc e on Machine L e arning , PMLR, 2019, pp. 46564662. [111] Z. He, A. S. Rakin, and D. F an, P arametric noise injection: T rainable randomness to impro v e deep neural net w ork robustness against adv ersarial attac k, in Pr o c. the IEEE/CVF Confer enc e on Computer Vision and Pattern R e c o gnition , 2019, pp. 588597. [112] R. Bassily, A. Smith, and A. Thakurta, Priv ate empirical risk minimization: Ecien t algorithms and tigh t error b ounds, in 2014 IEEE 55th Annual Symp osium on F oundations of Computer Scienc e , 2014, pp. 464473. [113] S. B. Gelfand and S. K. Mitter, Recursiv e sto c hastic algorithms for global optimization in R d , SIAM J. Contr ol Optim. , v ol. 29, no. 5, pp. 9991018, 1991. [114] M. Raginsky, A. Rakhlin, and M. T elgarsky, Non-con v ex learning via sto c hastic gradien t Langevin dynamics: A nonasymptotic analysis, in Pr o c. the Confer enc e on L e arning The ory , ser. Pro c.Mac h. Learn. Res. V ol. 65, 2017, pp. 16741703. [115] Y. Zhang, P . Liang, and M. Charik ar, A hitting time analysis of sto c hastic gradien t Langevin dynamics, in Pr o c. the 2017 Confer enc e on L e arning The ory , ser. Pro c. Mac h. Learn. Res. V ol. 65, 2017, pp. 19802022. [116] K. Ogata, Discr ete-time c ontr ol systems . New Jersey: Pren tice-Hall, 1994. [117] R. P admanabhan and P . Seiler, Analysis of gradien t descen t with v arying step sizes using in tegral quadratic constrain ts, 2022, arXiv:2210.00644. [118] B. Hu, P . Seiler, and A. Ran tzer, A unied analysis of sto c hastic optimization metho ds using jump system theory and quadratic constrain ts, in Pr o c. the 2017 Confer enc e on L e arning The ory , ser. Pro c. Mac h. Learn. Res. 2017, pp. 11571189. [119] B. Hu, P . Seiler, and L. Lessard, Analysis of biased sto c hastic gradien t descen t using sequen tial semidenite programs, Math. Pr o gr am. , v ol. 187, no. 1, pp. 383408, 2021. 210 [120] S. Hassan-Moghaddam and M. R. Jo v ano vi¢, T op ology design for sto c hastically-forced consensus net w orks, IEEE T r ans. Contr ol Netw. Syst. , v ol. 5, no. 3, pp. 10751086, 2018. [121] A. Zare, H. Mohammadi, N. K. Dhingra, T. T. Georgiou, and M. R. Jo v ano vi¢, Pro ximal algorithms for large-scale statistical mo deling and sensor/actuator selection, IEEE T r ans. Automat. Contr ol , v ol. 65, no. 8, pp. 34413456, 2020. [122] I. Sutsk ev er, J. Martens, G. Dahl, and G. Hin ton, On the imp ortance of initialization and momen tum in deep learning, in Pr o c. International Confer enc e on Machine L e arning , 2013, pp. 11391147. [123] S. Mic halo wsky, C. Sc herer, and C. Eb en bauer, Robust and structure exploiting optimisation algorithms: An in tegral quadratic constrain t approac h, Int. J. Contr ol , pp. 124, 2020. [124] J. I. P o v eda and N. Li, Robust h ybrid zero-order optimization algorithms with acceleration via a v eraging in time, Automatic a , p. 109 361, 2021. [125] B Can, M. Gurbuzbalaban, and L. Zh u, A ccelerated linear con v ergence of sto c hastic momen tum metho ds in Wasserstein distances, in International Confer enc e on Machine L e arning , PMLR, 2019, pp. 891901. [126] Y. Nestero v, Intr o ductory le ctur es on c onvex optimization: A b asic c ourse . Klu w er A cademic Publishers, 2004, v ol. 87. [127] M. Danilo v a and G. Malino vsky, A v eraged hea vy-ball metho d, 2021, arXiv:2111.05430. [128] S. Hassan-Moghaddam and M. R. Jo v ano vi¢, Pro ximal gradien t o w and Douglas-Rac hford splitting dynamics: Global exp onen tial stabilit y via in tegral quadratic constrain ts, Automatic a , v ol. 123, 109311 (7 pages), 2021. [129] J. W. Simpson-P orco, B. K. P o olla, N. Monshizadeh, and F. Drer, Input-output p erformance of linearquadratic saddle-p oin t algorithms with application to distributed resource allo cation problems, IEEE T r ans. Automat. Contr ol , v ol. 65, no. 5, pp. 20322045, 2019. [130] S. Dean, H. Mania, N. Matni, B. Rec h t, and S. T u, On the sample complexit y of the linear quadratic regulator, F ound. Comput. Math. , pp. 147, 2017. [131] M. Simc ho witz, H. Mania, S. T u, M. I. Jordan, and B. Rec h t, Learning without mixing: T o w ards a sharp analysis of linear system iden tication, in Pr o c. Mach. L e arn. R es. , 2018, 439473. 211 [132] D. Malik, A. P ana jady, K. Bhatia, K. Khamaru, P . L. Bartlett, and M. J. W ain wrigh t, Deriv ativ e-free metho ds for p olicy optimization: Guaran tees for linear-quadratic systems, J. Mach. L e arn. R es. , v ol. 51, 151, 2019. [133] J. P . Jansc h-P orto, B. Hu, and G. E. Dullerud, Con v ergence guaran tees of p olicy optimization metho ds for Mark o vian jump linear systems, in Pr o c. the Americ an Contr ol Confer enc e , 2020. [134] K. Zhang, B. Hu, and T. Basar, P olicy optimization for H 2 linear con trol with H ∞ robustness guaran tee: Implicit regularization and global con v ergence, SIAM J. Contr ol Optim. , v ol. 59, no. 6, pp. 40814109, 2021. [135] L. F urieri, Y. Zheng, and M. Kamgarp our, Learning the globally optimal distributed LQ regulator, in L e arning for Dynamics and Contr ol , 2020, pp. 287297. [136] I. F atkh ullin and B. T. P oly ak, Optimizing static linear feedbac k: Gradien t metho d, SIAM J. Contr ol Optim. , v ol. 59, no. 5, pp. 38873911, 2021. [137] D Kleinman, On an iterativ e tec hnique for Riccati equation computations, IEEE T r ans. Automat. Contr ol , v ol. 13, no. 1, pp. 114115, 1968. [138] S. Bittan ti, A. J. Laub, and J. C. Willems, The Ric c ati Equation . Berlin, German y: Springer-V erlag, 2012. [139] P . L. D. P eres and J. C. Geromel, An alternate n umerical solution to the linear quadratic problem, IEEE T r ans. Automat. Contr ol , v ol. 39, no. 1, pp. 198202, 1994. [140] V. Balakrishnan and L. V anden b erghe, Semidenite programming dualit y and linear time-in v arian t systems, IEEE T r ans. Automat. Contr ol , v ol. 48, no. 1, pp. 3041, 2003. [141] J. Bu, A. Mesbahi, M. F azel, and M. Mesbahi, LQR through the lens of rst order metho ds: Discrete-time case, 2019, arXiv:1907.08921. [142] W. S. Levine and M. A thans, On the determination of the optimal constan t output feedbac k gains for linear m ultiv ariable systems, IEEE T r ans. Automat. Contr ol , v ol. 15, no. 1, pp. 4448, 1970. [143] F. Lin, M. F ardad, and M. R. Jo v ano vi¢, Augmen ted Lagrangian approac h to design of structured optimal state feedbac k gains, IEEE T r ans. Automat. Contr ol , v ol. 56, no. 12, pp. 29232929, 2011. [144] M. F ardad, F. Lin, and M. R. Jo v ano vi¢, Sparsit y-promoting optimal con trol for a class of distributed systems, in Pr o c. the 2011 Americ an Contr ol Confer enc e , San F rancisco, CA, 2011, pp. 20502055. 212 [145] M. R. Jo v ano vi¢ and N. K. Dhingra, Con troller arc hitectures: T radeos b et w een p erformance and structure, Eur. J. Contr ol , v ol. 30, pp. 7691, 2016. [146] F. Lin, M. F ardad, and M. R. Jo v ano vi¢, Sparse feedbac k syn thesis via the alternating direction metho d of m ultipliers, in Pr o c e e dings of the 2012 Americ an Contr ol Confer enc e , Mon trØal, Canada, 2012, pp. 47654770. [147] X. W u and M. R. Jo v ano vi¢, Sparsit y-promoting optimal con trol of systems with symmetries, consensus and sync hronization net w orks, Syst. Contr ol L ett. , v ol. 103, pp. 18, 2017. [148] B. T. P oly ak, M. Khlebnik o v, and P . Shc herbak o v, An LMI approac h to structured sparse feedbac k design in linear con trol systems, in Pr o c. the 2013 Eur op e an Contr ol Confer enc e , 2013, pp. 833838. [149] N. K. Dhingra, M. R. Jo v ano vi¢, and Z. Q. Luo, An ADMM algorithm for optimal sensor and actuator selection, in Pr o c. the 53r d IEEE Conf. De cision Contr ol , 2014, pp. 40394044. [150] A. Zare, T. T. Georgiou, and M. R. Jo v ano vi¢, Sto c hastic dynamical mo deling of turbulen t o ws, Annu. R ev. Contr ol R ob ot. Auton. Syst. , v ol. 3, pp. 195219, 2020. [151] B. Rec h t, A tour of reinforcemen t learning: The view from con tin uous con trol, Annu. R ev. Contr ol R ob ot. Auton. Syst. , v ol. 2, pp. 253279, 2019. [152] H. T. T oiv onen, A globally con v ergen t algorithm for the optimal constan t output feedbac k problem, Int. J. Contr ol , v ol. 41, no. 6, pp. 15891599, 1985. [153] T. Rautert and E. W. Sac hs, Computational design of optimal output feedbac k con trollers, SIAM J. Optim , v ol. 7, no. 3, pp. 837852, 1997. [154] A. V annelli and M. Vidy asagar, Maximal Ly apuno v functions and domains of attraction for autonomous nonlinear systems, Automatic a , v ol. 21, no. 1, pp. 69 80, 1985. [155] M. Jones, H. Mohammadi, and M. M. P eet, Estimating the region of attraction using p olynomial optimization: A con v erse ly apuno v result, in in Pr o c. 56th IEEE Conf. on De cision and Contr ol , 2017, pp. 17961802. [156] H. Mohammadi, M. Raza viy a yn, and M. R. Jo v ano vi¢, On the stabilit y of gradien t o w dynamics for a rank-one matrix appro ximation problem, in Pr o c. the 2018 Americ an Contr ol Confer enc e , Milw auk ee, WI, 2018, pp. 45334538. [157] H. K. Khalil, Nonline ar Systems . New Y ork: Pren tice Hall, 1996. 213 [158] S. Bo yd and L. V anden b erghe, Convex optimization . Cam bridge Univ ersit y Press, 2004. [159] S.-I Amari, Natural gradien t w orks ecien tly in learning, Neur al Comput. , v ol. 10, no. 2, pp. 251276, 1998. [160] R. V ersh ynin, High-dimensional pr ob ability: An intr o duction with applic ations in data scienc e . Cam bridge Univ ersit y Press, 2018. [161] G. Hew er, An iterativ e tec hnique for the computation of the steady state gains for the discrete optimal regulator, IEEE T r ans. Automat. Contr ol , v ol. 16, no. 4, pp. 382384, 1971. [162] K. Mrtensson, Gradien t metho ds for large-scale and distributed linear quadratic con trol, Ph.D. dissertation, Lund Univ ersit y, 2012. [163] J. C. Duc hi, M. I. Jordan, M. J. W ain wrigh t, and A. Wibisono, Optimal rates for zero-order con v ex optimization: The p o w er of t w o function ev aluations, IEEE T r ans. Inf. The ory , v ol. 61, no. 5, pp. 27882806, 2015. [164] M. Rudelson and R. V ersh ynin, Hanson-Wrigh t inequalit y and sub-Gaussian concen tration, Ele ctr on. Commun. Pr ob ab. , v ol. 18, 2013. [165] Y. Zheng, Y. T ang, and N. Li, Analysis of the optimization landscap e of linear quadratic gaussian (lqg) con trol, 2021, arXiv:2102.04393. [166] K. J. ¯strm, Intr o duction to sto chastic c ontr ol the ory . A cademic Press, New Y ork, 1970. [167] F. T opsok, Some b ounds for the logarithmic function, Ine qual. The ory Appl. , v ol. 4, p. 137, 2006. [168] H. T. T oiv onen and P . M. Mkil, Newton’s metho d for solving parametric linear quadratic con trol problems, Int. J. Contr ol , v ol. 46, no. 3, pp. 897911, 1987. [169] M. Soltanolk otabi, A. Ja v anmard, and J. D. Lee, Theoretical insigh ts in to the optimization landscap e of o v er-parameterized shallo w neural net w orks, IEEE T r ans. Inf. The ory , v ol. 65, no. 2, pp. 742769, 2019. [170] M. Ledoux and M. T alagrand, Pr ob ability in Banach Sp ac es: isop erimetry and pr o c esses . Springer Science & Business Media, 2013. [171] D. P ollard, Mini empirical, 2015. [Online]. A v ailable: http://www.stat.yale.edu/~pollard/Books/Mini/ . 214 [172] A. W. V aart and J. A. W ellner, W e ak c onver genc e and empiric al pr o c esses: with applic ations to statistics . Springer, 1996. [173] T. Ma and A. Wigderson, Sum-of-Squares lo w er b ounds for sparse PCA, in A dvanc es in Neur al Information Pr o c essing Systems , 2015, pp. 16121620. 215 Appendices 216 Appendix A Supporting proofs for Chapter 2 A.1 Quadratic problems A.1.1 Proof of Theorem 1 F or gradien t descen t, ˆ A i = 1− αλ i and ˆ B i = 1 are scalars and the solution to (2.9) is giv en b y ˆ P i := σ 2 p i = σ 2 1 − (1 − αλ i ) 2 = σ 2 αλ i (2 − αλ i ) . F or the accelerated metho ds, w e note that for an y ˆ A i and ˆ B i of the form ˆ A i = 0 1 a i b i , ˆ B i = 0 1 the solution ˆ P i to Ly apuno v equation (2.9) is giv en b y ˆ P i = σ 2 p i b i p i /(1− a i ) b i p i /(1− a i ) p i where p i := a i − 1 (a i + 1)(b i + a i − 1)(b i − a i + 1) . (A.1) The parameters a i andb i for Nestero v’s algorithm are {a i =− β (1− αλ i );b i =(1+β )(1− αλ i )} and for the hea vy-ball metho d w e ha v e {a i =− β ; b i = 1+β − αλ i }. No w, since ˆ C i = 1 for gradien t descen t and ˆ C i = [1 0] for the accelerated algorithms, it follo ws that for all three algorithms w e ha v e ˆ J(λ i ) := trace( ˆ C i ˆ P i ˆ C T i ) = σ 2 p i . Finally , if w e use the expression for p i for gradien t descen t and substitute for a i and b i in (A.1) for the accelerated algorithms, w e obtain the expressions for ˆ J in the statemen t of the theorem. 217 A.1.2 Proof of Proposition 1 T o sho w that ˆ J na (λ )/ ˆ J gd (λ ) is a decreasing function of λ ∈[m,L], w e split this ratio in to the sum of t w o homographic functions ˆ J na (λ )/ ˆ J gd (λ )=σ 1 (λ )+σ 2 (λ ), where σ 1 (λ ) := 4α gd β α na (3β +1)(1− β ) 1− α gd 2 λ 1+ α naβ 1− β λ , σ 2 (λ ) := α gd α na (3β +1) 1− α gd 2 λ 1− α na(2β +1) 2+2β λ . (A.2) No w, if w e substitute the parameters pro vided in T able 2.2 in to (A.2), it follo ws that the signs of the deriv ativ es dσ 1 /dλ and dσ 2 /dλ satisfy sign( dσ 1 dλ ) = sign(− α naβ 1− β − α gd 2 ) = sign(− κ +κ √ 3κ +1+ √ 3κ +1− 1 m(3κ +1)(κ +1) ) < 0, ∀κ> 1 sign( dσ 2 dλ ) = sign( α na(2β +1) 2+2β − α gd 2 ) = sign(− 2 κ − √ 3κ +1+1 m(3κ +1) 3/2 (κ +1) ) < 0, ∀κ> 1. F urthermore, since the critical p oin ts of the functions σ 1 (λ ) and σ 2 (λ ) are not in [m,L], λ crt1 = − m(3κ +1) √ 3κ +1− 2 < 0 < m, λ crt2 = m(3κ +1) √ 3κ +1 3 √ 3κ +1− 2 > mκ = L w e conclude that b oth σ 1 and σ 2 are decreasing functions o v er the in terv al [m,L]. W e next pro v e (2.13a) and (2.13b). It is straigh tforw ard to v erify that b oth ˆ J gd (λ ) and ˆ J na (λ ) are quasi-con v ex functions o v er the in terv al [m,L] and that the resp ectiv e minima are attained at the critical p oin t λ =1/α . Quasi-con v exit y also implies max λ ∈[m,L] ˆ J(λ ) = max{ ˆ J(m), ˆ J(L)}. (A.3) No w, letting α = 2/(L + m) in the expression for ˆ J gd giv es ˆ J gd (m) = ˆ J gd (L) = (κ + 1) 2 /(4κ ) whic h in conjunction with (A.3) complete the pro of for (2.13a). Finally , since the ratio ˆ J na (λ )/ ˆ J gd (λ ) is decreasing, w e ha v e ˆ J na (L)/ ˆ J gd (L)≤ ˆ J na (m)/ ˆ J gd (m). Com bining this inequalit y with ˆ J gd (m)= ˆ J gd (L) and (A.3) completes the pro of of (2.13b). A.1.3 Proof of Theorem 3 F rom Prop osition 1, it follo ws that ˆ J na (L) ˆ J gd (L) ≤ ˆ J na (λ i ) ˆ J gd (λ i ) ≤ ˆ J na (m) ˆ J gd (m) (A.4a) for all λ i and n− 1 X i=1 ˆ J gd (λ i ) ≤ (n− 1) ˆ J gd (m)=(n− 1) ˆ J gd (L). (A.4b) 218 F or the upp er b ound, w e ha v e J na J gd = P n i=1 ˆ J na (λ i ) P n i=1 ˆ J gd (λ i ) ≤ ˆ J na (L) + ˆ Jna(m) ˆ J gd (m) P n− 1 i=1 ˆ J gd (λ i ) ˆ J gd (L) + P n− 1 i=1 ˆ J gd (λ i ) ≤ ˆ J na (L) + (n − 1) ˆ J na (m) ˆ J gd (L) + (n − 1) ˆ J gd (m) where the rst inequalit y follo ws from (A.4a). The second inequalit y can b e v eried b y m ultiplying b oth sides with the pro duct of the denominators and using ˆ J gd (m) = ˆ J gd (L), ˆ J na (m)≥ ˆ J na (L), and (A.4b). Similarly , for the lo w er b ound w e can write J na J gd = P n i=1 ˆ J na (λ i ) P n i=1 ˆ J gd (λ i ) ≥ ˆ J na (m) + ˆ Jna(L) ˆ J gd (L) P n i=2 ˆ J gd (λ i ) ˆ J gd (m) + P n i=2 ˆ J gd (λ i ) ≥ ˆ J na (m) + (n − 1) ˆ J na (L) ˆ J gd (m) + (n − 1) ˆ J gd (L) . Again, the rst inequalit y follo ws from (A.4a) and the second inequalit y can b e v eried b y m ultiplying b oth sides with the pro duct of the denominators and using ˆ J gd (m) = ˆ J gd (L), ˆ J na (m)≥ ˆ J na (L), and (A.4b). A.1.4 Proof of the bounds in (2.16) F rom Prop osition 1, w e ha v e ˆ J na (m) = b 4 (b 2 − 2b+2) 32(b− 1) 3 , ˆ J na (L) = 9b 4 (b 2 +2b− 2) 32 (b 2 − 1)(2b− 1)(b 2 − b+1) where b:= √ 3κ +1>2. The upp er and lo w er b ounds on ˆ J na (m) are obtained as follo ws b 3 32 ≤ b 4 ((b− 1) 2 +1) 32(b− 1) 3 = ˆ J na (m) ≤ b 3 (b+c 1 (b))(b 2 − 2b+2+c 2 (b)) 32(b− 1) 3 = b 3 8 where the p ositiv e quan tities c 1 (b) := b− 2 and c 2 (b) := b 2 − 2b are added to yield a simple upp er b ound. Similarly , for ˆ J na (L) w e ha v e 9b 64 = (9/32)b 4 (b 2 +2b− 2) ((b 2 − 1)+1)((2b− 1)+1)(b 2 − b+1+c 3 (b)) ≤ ˆ J na (L) 9b 8 = (9/32)b 4 (b 2 +2b− 2+c 4 (b)) (b 2 − 1)(2b− 1− c 5 (b))(b 2 − b+1− c 6 (b)) ≥ ˆ J na (L) where the p ositiv e quan tities c 3 (b) := 3b− 3, c 4 (b) := b 2 − 2b, c 5 (b) := b− 1, and c 6 (b) := (1/2)b 2 − b+1 are in tro duced to obtain tractable b ounds. 219 A.2 General strongly convex problems A.2.1 Proof of Lemma 1 Let us dene the p ositiv e semidenite function V(ψ ) :=ψ T Xψ and letη :=[ψ T u T ] T . Using LMI (2.23) and (2.22), w e can write ∥z t ∥ 2 = (η t ) T C T z C z 0 0 0 η t ≤ − (η t ) T A T XA− X A T XB u B T u XA B T u XB u η t − λ (η t ) T C T y 0 0 I Π C y 0 0 I η t = (η t ) T X 0 0 0 − A T B T u X A T B T u T η t − λ y t u t T Π y t u t ≤ V(ψ t ) − V(ψ t+1 ) + 2σ (ψ t ) T A T XB w w t + σ 2 (w t ) T B T w XB w w t + 2σ (u t ) T B T u XB w w t . Since w t is a zero-mean white input with iden tit y co v ariance whic h is indep enden t of u t and x t , if w e tak e the a v erage of the ab o v e inequalit y o v er t and exp ectation o v er dieren t realizations of w t , w e obtain 1 ¯ T ¯ T X t=1 E ∥z t ∥ 2 ≤ 1 ¯ T E V(ψ 1 ) − V(ψ ¯ T+1 ) + σ 2 trace(B T w XB w ) Therefore, letting ¯ T →∞ and using X ⪰ 0 lead to J ≤ σ 2 trace(B T w XB w ), whic h completes the pro of. A.2.2 Proof of Lemma 2 In order to pro v e Lemma 2, w e presen t a tec hnical lemma whic h along the lines of results of [56] pro vides us with an upp er b ound on the dierence b et w een the ob jectiv e v alue at t w o consecutiv e iterations. Lemma 1 L et f ∈ F L m and κ := L/m. Then, Nester ov’s ac c eler ate d metho d, with the notation intr o duc e d in Se ction 2.4, satises f(x t+2 ) − f(x t+1 ) ≤ 1 2 N 1 ψ t u t + σw t 0 T LI I I 0 N 1 ψ t u t + σw t 0 + 1 2 N 2 ψ t u t T − mI I I 0 N 2 ψ t u t wher e N 1 and N 2 ar e dene d in L emma 2. 220 Pr o of: F or an y f ∈F L m , the Lipsc hitz con tin uit y of ∇f implies f(x t+2 ) − f(y t ) ≤ 1 2 x t+2 − y t ∇f(y t ) T LI I I 0 x t+2 − y t ∇f(y t ) (A.5) and the strong con v exit y of f yields f(y t ) − f(x t+1 ) ≤ 1 2 y t − x t+1 ∇f(y t ) T − mI I I 0 y t − x t+1 ∇f(y t ) . (A.6) Moreo v er, the state and output equations in (2.5) lead to x t+2 − y t ∇f(y t ) = N 1 ψ t u t + σw t 0 , y t − x t+1 ∇f(y t ) = N 2 ψ t u t . (A.7) Summing up (A.5) and (A.6) and substituting for the terms x t+2 − y t ∇f(y t ) and x t+2 − y t ∇f(y t ) from (A.7) completes the pro of. □ Let us dene the p ositiv e semidenite function V(ψ ) := ψ T Xψ and let η := [ψ T u T ] T . Similar to the rst part of the pro of of Lemma 1, w e can use LMI (2.24) and inequalit y (2.19) to write ∥z t ∥ 2 ≤ V(ψ t ) − V(ψ t+1 ) + 2σ (ψ t ) T A T XB w w t + σ 2 (w t ) T B T w XB w w t + 2σ (u t ) T B T u XB w w t − (η t ) T Mη t . (A.8) F rom Lemma 1, it follo ws that (η t ) T Mη t ≥ 2 f(x t+2 ) − f(x t+1 ) − σ 2 L∥w t ∥ 2 − 2 σw t 0 T LI I I 0 N 1 η t . (A.9) No w, com bining inequalities (A.8) and (A.9) yields ∥z t ∥ 2 ≤ V(ψ t ) − V(ψ t+1 ) + 2σ (ψ t ) T A T XB w w t + σ 2 (w t ) T B T w XB w w t + 2σ (u t ) T B T u XB w w t − 2λ 2 f(x t+2 ) − f(x t+1 ) + λ 2 σ 2 L∥w t ∥ 2 + 2λ 2 σw t 0 T LI I I 0 N 1 η t . (A.10) Since w t is a zero-mean white input with iden tit y co v ariance whic h is indep enden t of u t and x t , taking the exp ectation of the last inequalit y yields E ∥z t ∥ 2 ≤ E V(ψ t )− V(ψ t+1 ) + σ 2 trace(B T w XB w ) + 2λ 2 E f(x t+1 ) − f(x t+2 ) + nσ 2 Lλ 2 221 and taking the a v erage o v er the rst ¯ T iterations results in 1 ¯ T ¯ T X t=1 E ∥z t ∥ 2 ≤ 1 ¯ T E V(ψ 1 ) − V(ψ ¯ T+1 ) + σ 2 trace(B T w XB w ) + 2λ 2 ¯ T E f(x 2 ) − f(x ¯ T+2 ) + nσ 2 Lλ 2 . Finally , using p ositiv e deniteness of the function V , strong con v exit y of the function f , and letting ¯ T →∞, it follo ws that J ≤ σ 2 (nLλ 2 +trace(B T w XB w )) as required. A.2.3 Proof of Theorem 5 Using Theorem (1), it is straigh tforw ard to sho w that for gradien t descen t and Nestero v’s metho d with the parameters pro vided in T able 2.1, the function f(x) := m 2 ∥x∥ 2 leads to the largest v ariance amplication J among the quadratic ob jectiv e functions within F L m . This yields the lo w er b ounds q gd = J gd ≤ J ⋆ gd , q na = J na ≤ J ⋆ na with J gd and J na corresp onding to f(x)= m 2 ∥x∥ 2 . W e next sho w that J gd ≤ q gd . T o obtain the b est upp er b ound on J gd using Lemma 1, w e minimize trace(B T w XB w ) sub ject to LMI (2.23), X ⪰ 0, and λ ≥ 0. F or gradien t descen t, if w e use the represen tation in (2.21c), then the negativ e deniteness of the (1,1)-blo c k of LMI (2.23) implies that X ⪰ 1 αm (2 − αm ) I = κ 2 2κ − 1 I. (A.11) It is straigh tforw ard to sho w that the pair X = κ 2 2κ − 1 I, λ = 1− αm m(2− αm )(L− m) (A.12) is feasible as the LMI (2.23) b ecomes 0 0 0 − 1 m 2 (2κ − 1) I ⪯ 0. Th us,X andλ giv en b y (A.12) pro vide a solution to LMI (2.23). Therefore, inequalit y (A.11) is tigh t and it pro vides the b est ac hiev able upp er b ound J gd ≤ trace(B T w XB w ) = nκ 2 2κ − 1 . 222 Finally , w e sho w J na ≤ 4.08q na b y nding a sub-optimal feasible p oin t for (2.26). Let X := x 1 I x 0 I x 0 I x 2 I with x 1 := 1 s(κ ) 2κ 3.5 − 8κ 3 +11κ 2.5 +5κ 2 − 14κ 1.5 +8κ − 2κ 0.5 x 0 := − 1 s(κ ) 2κ 1.5 κ 0.5 − 1 3 κ 0.5 +1 x 2 := κ 1.5 s(κ ) 2κ 2 − 3κ +5κ 0.5 − 2 , s(κ ) := 8κ 2 − 6κ 1.5 − 2κ +3κ 0.5 − 1 and let λ 1 := (κ/L ) 2 /(2κ − 1) and λ 2 := − x 0 /(Ls(κ )). W e rst sho w that (λ 1 ,λ 2 ,X) is feasible for problem (2.26). It is straigh tforw ard to v erify that s(κ ), x 1 s(κ ), x 2 s(κ ), and − x 0 s(κ ) (whic h are p olynomials of degree less than 7 in √ κ ) are all p ositiv e for an y κ ≥ 1. Hence, x 1 >0,x 2 >0 andλ 2 >0. It is also easy to see that λ 1 >0 and that the determinan t of X satises det(X) = κ 2n s 2n (κ ) 28κ 3.5 − 65κ 3 +56κ 2.5 +25κ 2 − 88κ 1.5 +70κ − 26κ 0.5 +4 n >0, ∀κ ≥ 1 whic h yieldsX ⪰ 0. Moreo v er, it can b e sho wn that the left-hand-side of LMI (2.24) b ecomes " 0 0 0 0 0 0 0 0 − λ 1 I # ⪯ 0. Therefore, the p oin t (λ 1 ,λ 2 ,X) is feasible to problem (2.26) and J na ≤ p(κ ) := nLλ 2 +nx 2 = n s(κ ) 4κ 3.5 − 4κ 3 − 3κ 2.5 +9κ 2 − 4κ 1.5 . Comparing p with q na , it can b e v eried that, for all κ ≥ 1, 4.08q na (κ ) ≥ p(κ ), whic h completes the pro of. A.2.4 Proof of Theorem 6 Without loss of generalit y , let σ =1 and G := n X i=1 max{ ˆ J(λ i ), ˆ J(λ ′ i )} (A.13) where λ i are the eigen v alues of the Hessian of the ob jectiv e function f and λ ′ i =m+L− λ i is the mirror image of λ i with resp ect to (m+L)/2. SinceJ = P i ˆ J(λ i ), ifλ i are symmetrically 223 distributed o v er the in terv al [m,L] i.e., (λ 1 ,··· ,λ n )=(λ ′ n ,··· ,λ ′ 1 ), then for an y parameters α and β w e ha v e J ≤ G ≤ 2J. (A.14) Equation (A.14) implies that an y b ound on G simply carries o v er to J within an accuracy of constan t factors. Th us, w e fo cus on G and establish one of its useful prop erties in the next lemma that allo ws us to pro v e Theorem 6. Lemma 2 The he avy-b al l metho d with any stabilizing p ar ameter β satises 2(1+β ) L+m = argmin α ρ (α,β ) (A.15) wher e ρ is the r ate of line ar c onver genc e. F urthermor e, if the Hessian of the quadr atic obje ctive function f has a symmetric sp e ctrum over the interval [λ 1 ,λ n ]=[m,L], then 2(1+β ) L+m = argmin α G(α, β ). Pr o of: The linear con v ergence rate ρ is giv en b y ρ = max 1≤ i≤ n ˆ ρ (λ i ), where ˆ ρ (λ ) is the largest absolute v alue of the ro ots of the c haracteristic p olynomial det(zI− ˆ A)=z 2 +(αλ − 1− β )z+β asso ciated with the hea vy-ball metho d and the eigen v alue λ of the Hessian of the ob jectiv e function f ; See (2.8) for the form of ˆ A. Th us, w e ha v e ˆ ρ (λ ) = ( √ β if ∆ <0 1 2 |1+β − αλ |+ 1 2 √ ∆ otherwise where ∆ :=(1+β − αλ ) 2 − 4β. This can b e simplied to ˆ ρ = ( √ β if (1− √ β ) 2 ≤ αλ ≤ (1+ √ β ) 2 1 2 |1+β − αλ |+ 1 2 √ ∆ otherwise. It is straigh tforw ard to sho w that ˆ ρ and ˆ J with σ =1 are explicit quasi-con v ex functions of µ :=αλ whic h are symmetric with resp ect to µ =1+β . Quasi-con v exit y of ˆ ρ yields ρ = max{ˆ ρ (λ 1 ),ˆ ρ (λ n )} = max{ˆ ρ (λ 1 ),ˆ ρ (λ ′ 1 )}. Let α ♯ (β ) = 2(1+β )/(L+m). F or an y eigen v alue λ i , from the symmetry of the sp ectrum, w e ha v e α ♯ (β )λ i − (1+β ) = (1+β ) − α ♯ (β )λ ′ i meaning that α ♯ (β )λ i and α ♯ (β )λ ′ i are the mirror images with resp ect to the middle p oin t 1+β . Th us, from the quasi-con v exit y and symmetry of the functions ˆ ρ and ˆ J , it follo ws 224 that α ♯ (β ) minimizes ρ as w ell as max{ ˆ J(λ i ), ˆ J(λ ′ i )} for all i, whic h completes the pro of. □ Since gradien t descen t is obtained from the hea vy-ball metho d b y letting β = 0, from Lemma 2 it immediately follo ws that α gd =2/(L+m) giv en in T able 2.2 optimizes b oth G gd and the con v ergence rate ρ gd . This fact com bined with (A.14) yields 2J gd (α ⋆ gd (c)) ≥ G gd (α ⋆ gd (c)) ≥ G gd (α gd ) ≥ J gd (α gd ) (A.16) where α ⋆ gd (c) is giv en b y (2.28b). This completes the pro of for gradien t descen t. W e next use Lemma 2 to establish a b ound on the parameter β ⋆ hb (c) that allo ws us to pro v e the result for the hea vy-ball metho d as w ell. Lemma 3 Ther e exists a p ositive c onstant a such that β ⋆ hb (c) ≥ 1 − a √ κ (A.17) wher e β ⋆ hb (c) is given by (2.28a) . Pr o of: W e rst sho w that for an y parameters α and β , the con v ergence rate ρ of the hea vy-ball metho d giv en b y (2.27) is lo w er b ounded b y ρ ≥ √ β if β ≥ ( √ κ − 1 √ κ +1 ) 2 (1+β )(L− m)+ √ (1+β ) 2 (L− m) 2 − 4β (L+m) 2 2(L+m) otherwise. (A.18) The con v ergence rate satises ρ = max 1≤ i≤ n ˆ ρ (λ i ) = max λ ∈{m,L} ˆ ρ (λ ) where the function ˆ ρ (λ ) is giv en b y (see pro of of Lemma 2 for the pro of of this statemen t) ˆ ρ (λ )= ( √ β if (1− √ β ) 2 ≤ αλ ≤ (1+ √ β ) 2 1 2 |1+β − αλ |+ 1 2 √ ∆ otherwise and ∆ := (1+β − αλ ) 2 − 4β. A ccording to Lemma 2, α = 2(1+β )/(L+m) optimizes the rate ρ . This v alue of α yields ˆ ρ (m) = ˆ ρ (L) = √ β if κ ≤ (1+ √ β ) 2 (1− √ β ) 2 1 2 |1+β − α ⋆ λ |+ 1 2 √ ∆ λ =m otherwise or equiv alen tly ˆ ρ (m) = ˆ ρ (L) = √ β if β ≥ ( √ κ − 1 √ κ +1 ) 2 (1+β )(L− m)+ √ (1+β ) 2 (L− m) 2 − 4β (L+m) 2 2(L+m) otherwise (A.19) 225 whic h completes the pro of of inequalit y (A.18). No w, if β ≥ ( √ κ − 1) 2 /( √ κ +1) 2 , then (A.17) with a=2 follo ws immediately . Otherwise, from (A.18) w e obtain ρ ≥ (1+β )(L− m)+ p (1+β ) 2 (L− m) 2 − 4β (L+m) 2 2(L+m) whic h yields β ≥ v(ρ ) := ρ ( L− m L+m − ρ )/(1 − L− m L+m ρ ). (A.20) The con v ergence rate ρ satises ( √ κ − 1) 2 /( √ κ +1) 2 ≤ ρ ≤ 1− c/ √ κ , where the lo w er b ound follo ws from the optimal rate pro vided in T able 2.2 and the upp er b ound follo ws from the denition in (2.28a). Moreo v er, the deriv ativ e dv dρ =0 v anishes only at ρ =( √ κ − 1)/( √ κ +1). Th us, w e obtain a lo w er b ound on β as β ≥ v(ρ ) ≥ min{v(( √ κ − 1 √ κ +1 ) 2 ), v(1− c/ √ κ ), v( √ κ − 1 √ κ +1 )}. (A.21) A simple manipulation of (A.21) allo ws us to nd a constan t a that satises (A.17), whic h completes the pro of. □ Let (ˆ α, ˆ β ) b e the optimal solution of the optimization problem minimize α,β G(α,β ) subject to ρ ≤ 1 − c/ √ κ where G is dened in (A.13). W e next sho w that there exists a scalar c ′ >0 suc h that G(ˆ α, ˆ β ) ≥ c ′ J(α hb ,β hb ) (A.22) where α hb and β hb are pro vided in T able 2.2. Let ˆ α (β ) := 2(1 + β )/(L + m). It is straigh tforw ard to v erify that J(ˆ α (β ),β ) = 1− β 2 hb 1− β 2 J(α hb ,β hb ) (A.23) 226 whic h allo ws us to write G(ˆ α, ˆ β ) (i) = min β G(ˆ α (β ),β ) (A.24) subject to ρ ≤ 1− c/ √ κ (ii) ≥ min β J(ˆ α (β ),β ) subject to ρ ≤ 1− c/ √ κ (iii) = min β 1− β 2 hb 1− β 2 J(α hb ,β hb ) subject to ρ ≤ 1− c/ √ κ (iv) ≥ 1− β 2 hb 1− (1− a √ κ ) 2 J(α hb ,β hb ). Here, (i) determines partial minimization with resp ect to α whic h follo ws from Lemma 2; (ii) follo ws from (A.14); (iii) follo ws from (A.23), and (iv) follo ws from Lemma 3. F urthermore, it is easy to sho w the existence of a constan t scalar c ′ suc h that 1− β 2 hb 1− (1− a √ κ ) 2 ≥ c ′ . (A.25) Inequalit y (A.22) follo ws from com bining (A.25) and (A.24). Finally , w e obtain that J(α ⋆ gd ,β ⋆ gd ) ≥ 1 2 G(α ⋆ gd ,β ⋆ gd ) ≥ 1 2 G(ˆ α, ˆ β ) ≥ c ′ 2 J(α gd ,β gd ) where the rst inequalit y follo ws from (A.14), the second follo ws from the denition of (ˆ α, ˆ β ), and the last inequalit y is giv en b y (A.22). This completes the pro of for the hea vy-ball metho d in Theorem 6. A.3 F undamental lower bounds A.3.1 Proof of Theorem 7 W e rst pro v e (2.29a). Without loss of generalit y , let the noise magnitude σ =1. W e dene the trivial lo w er b ound J ≥ ˆ J ⋆ := max{ ˆ J(m), ˆ J(L)} (A.26) and sho w that ˆ J ⋆ 1− ρ ≥ ( κ +1 8 ) 2 . Let ˜ f(x 1 ,x 2 ) := 1 2 (mx 2 1 +Lx 2 2 ). The eigen v alues of the Hessian matrix ∇ 2 ˜ f are giv en b y m and L whic h are clearly symmetric o v er the in terv al 227 [m,L]. Th us, for an y giv en v alue of β , m, and L, w e can use Lemma 2 with the ob jectiv e function ˜ f to obtain ˆ α (β ) := 2(1+β ) L+m = argmin α ˆ J ⋆ (α, β )=argmin α ρ (α, β ). F or the stepsize ˆ α (β ), the rate of con v ergence ρ is giv en b y (A.19), i.e., ρ = √ β if β ≥ ( √ κ − 1 √ κ +1 ) 2 (1+β )(L− m)+ √ (1+β ) 2 (L− m) 2 − 4β (L+m) 2 2(L+m) otherwise (A.27) and the lo w er b ound ˆ J ⋆ is giv en b y ˆ J ⋆ = ˆ J(m)= ˆ J(L)= (L+m) 2 4Lm(1− β 2 ) . (A.28) Therefore, w e obtain a lo w er b ound on ˆ J ⋆ /(1− ρ ) as ˆ J ⋆ (α,β ) 1− ρ (α,β ) ≥ ν (β ) := ˆ J ⋆ (ˆ α (β ),β ) 1− ρ (ˆ α (β ),β ) = (L+m) 2 4Lm(1− β 2 )(1− √ β ) if β ≥ ( √ κ − 1 √ κ +1 ) 2 (L+m) 3 2Lm(1− β 2 ) (1− β )L+(3+β )m− √ (1+β ) 2 (L− m) 2 − 4β (L+m) 2 otherwise (A.29) where the last equalit y follo ws from (A.27) and (A.28). It can b e sho wn that v(β ) attains its minim um at β =( √ κ − 1) 2 /( √ κ +1) 2 ; see Figure A.1 for an illustration. Therefore, v β Figure A.1: The β -dep endence of the function v in (A.29) for L=100 and m=1. ν (β ) ≥ (L+m) 2 4Lm(1− β 2 )(1− √ β ) β =( √ κ − 1 √ κ +1 ) 2 = (L+m) 2 4Lm(1+β )(1+ √ β )(1− √ β ) 2 β =( √ κ − 1 √ κ +1 ) 2 ≥ (L+m) 2 16Lm(1− √ β ) 2 β =( √ κ − 1 √ κ +1 ) 2 = (κ +1) 2 ( √ κ +1) 2 64κ ≥ κ +1 8 2 228 whic h completes the pro of of (2.29a). W e next pro v e (2.29b) for σ =α . W e analyze the t w o cases α> 1/L andα ≤ 1/L separately . If α> 1/L, inequalit y (2.29b) directly follo ws from inequalit y (2.29a) J hb 1 − ρ ≥ σ 2 κ +1 8 2 = α 2 κ +1 8 2 ≥ κ 8L 2 . Here, the rst inequalit y is giv en b y (2.29a) and the second inequalit y holds since α> 1/L. No w supp ose α ≤ 1/L. The con v ergence rate of P oly ak’s metho d is giv en b y max i ˆ ρ (λ i ), where ˆ ρ (λ ) = ( √ β if (1− √ β ) 2 ≤ αλ ≤ (1+ √ β ) 2 1 2 |1+β − αλ |+ 1 2 √ ∆ otherwise and ∆ := (1+β − αλ ) 2 − 4β (see the pro of of Lemma 2). Th us, for σ = α , w e ha v e the trivial lo w er b ound J 1− ρ ≥ ˆ J(m) 1− ˆ ρ (m) = α (1 + β ) m(1 − β )(2(1+β ) − αm )(1− ˆ ρ (m)) ≥ p(α,β ) := α 2m(1 − β )(1− ˆ ρ (m)) = α 2m(1 − β ) 1− √ β , β ∈[(1− √ αm ) 2 , 1) α m(1 − β ) 1− β +αm − √ ∆ , β ∈[0, (1− √ αm ) 2 ). Here, the rst inequalit y follo ws from com bining J = P i ˆ J(λ i ) and max i ˆ ρ (λ i ), and the second inequalit y follo ws from αm ≤ αL ≤ 1. W e next sho w that for an y xed α , the function p(α, ·) attains its minim um at β = (1− √ αm ) 2 . Before w e do so, note that this fact allo ws us to use partial minimization with resp ect to β and obtain p(α,β ) ≥ p(α, (1− √ αm ) 2 ) = 1 2m 2 (2− √ αm ) ≥ 1 4m 2 ≥ ( κ 2L ) 2 whic h completes the pro of of (2.29b). F or an y xed α , it is straigh tforw ard to v erify that p(α,β ) is increasing with resp ect to β o v er [(1− √ αm ) 2 , 1). Th us, it suces to sho w that p(α,β ) is decreasing with resp ect to β o v er [0,(1− √ αm ) 2 ). T o simplify the presen tation, let us dene the new set of parameters q := s(s+x− δ ), s := 1− β, x := αm δ := √ ∆ = p (1+β − αm ) 2 − 4β = p (s+x) 2 − 4x. 229 It is no w straigh tforw ard to v erify that p(α,β ) = α/ (mq) for β ∈ [(1− √ αm ) 2 , 1). It th us follo ws that p(α,β ) is decreasing with resp ect to β o v er [0,(1− √ αm ) 2 ) if and only if q ′ =dq/ds≤ 0 for s∈( √ x(2− √ x),1]. The deriv ativ e is giv en b y q ′ = 1 δ (2s+x)δ − 2s 2 − 3sx− x 2 +4x . Th us, w e ha v e q ′ ≤ 0 ⇐⇒ (2s+x)δ ≤ 2s 2 +3sx+x 2 − 4x. (A.30) It is easy to v erify that b oth sides of the inequalit y in (A.30), namely , (2s+x)δ and 2s 2 + 3sx+x 2 − 4x are p ositiv e for the sp ecied range of s∈( √ x(2− √ x),1]. Th us, w e can square b oth sides and obtain that q ′ ≤ 0 ⇐⇒ (2s+x) 2 δ 2 ≤ (2s 2 +3sx+x 2 − 4x) 2 (i) ⇐⇒ (2s+x) 2 (s+x) 2 − 4x ≤ (2s 2 +3sx+x 2 − 4x) 2 (ii) ⇐⇒ 8sx 2 +4x 3 ≤ 16x 2 ⇐⇒ 8s+4x ≤ 16. where (i) follo ws from the denition of δ and (ii) is obtained b y expanding b oth sides and rearranging the terms. Finally , the inequalit y 8s+4x ≤ 16 clearly holds since s≤ 1 and x≤ 1. This pro v es that p(α, ·) attains its minim um at β =(1− √ αm ) 2 . A.3.2 Proof of Theorem 8 F or the hea vy-ball metho d, the result follo ws from com bining Theorem 7 and the inequalit y 1− ρ > c/ √ κ . Next, w e presen t three additional lemmas that allo w us to pro v e the result for Nestero v’s metho d. The follo wing lemma pro vides a lo w er b ound on the function ˆ J(m) asso ciated with Nestero v’s metho d whic h dep ends on κ and β . Lemma 4 F or any str ongly c onvex quadr atic pr oblem with c ondition numb er κ> 2 and the smal lest eigenvalue of the Hessian m, the function ˆ J asso ciate d with Nester ov’s ac c eler ate d metho d with any stabilizing p air of p ar ameters 0<α , 0<β < 1, and σ =1 satises ˆ J(m)≥ κ 2 24(1− β )κ +32β . (A.31) Pr o of: W e rst sho w that Nestero v’s metho d with 0 < α and 0 < β < 1 is stable if and only if m< 2β +2 ακ (2β +1) . (A.32) 230 The rate of linear con v ergence is giv en b y ρ = max 1≤ i≤ n ˆ ρ (λ i ), where ˆ ρ (λ ) is the largest absolute v alue of the ro ots of the c haracteristic p olynomial det(zI− ˆ A)=z 2 − (1+β )(1− αλ )z+β (1− αλ ) asso ciated with Nestero v’s metho d and the eigen v alue λ of the Hessian of the ob jectiv e function f ; See (2.8) for the form of ˆ A. F or α> 0 and 0<β < 1, it can b e sho wn that ˆ ρ (λ )= ( p β (1− αλ ) if αλ ∈(( 1− β 1+β ) 2 ,1) 1 2 |(1+β )(1− αλ )|+ 1 2 p (1+β ) 2 (1− αλ ) 2 − 4β (1− αλ ) otherwise. (A.33) The stabilit y of the algorithm is equiv alen t to ˆ ρ (λ i )<1 for all eigen v alues λ i . F or an y p ositiv e stepsize α and parameter β ∈ (0,1), it can b e sho wn that the function ˆ ρ (λ ) is quasi-con v ex and ˆ ρ (λ ) = 1 if and only if λ ∈{0, 2β +2 α (2β +1) }. This fact along with 0 < m≤ λ i ≤ L = κm imply that ˆ ρ (λ i ) < 1 for all λ i ∈ [m,L] if and only if κm ≤ 2β +2 α (2β +1) whic h completes the pro of of (A.32). F or Nestero v’s metho d, it is straigh tforw ard to sho w that the function ˆ J(λ ) is quasi- con v ex o v er the in terv al [0, 2β +2 α (2β +1) ] and that it attains its minim um at λ = 1/α . Also, from (A.32), for κ> 2 w e obtain m ≤ 2β +2 ακ (2β +1) ≤ 1 α and th us, ˆ J(m) ≥ ˆ J( 2β +2 ακ (2β +1) ) = (2β +1)κ 2 (κ − 2β +2βκ ) 4 (β +1) (κ − 1) (2β +κ +βκ − 2β 2 κ +2β 2 ) ≥ κ 2 24(1− β )κ +32β where the last inequalit y follo ws from the fact that β ∈(0,1). □ The next lemma presen ts a lo w er b ound on an y accelerating parameter β for Nestero v’s metho d. Lemma 5 F or Nester ov’s metho d, under the c onditions of The or em 8, ther e exist p ositive c onstants c 3 and c 4 such that for any κ>c 3 , β > 1 − c 4 √ κ . (A.34) 231 Pr o of: F or an y α > 0 and β ∈ (0,1), Nestero v’s metho d con v erges with the rate ρ = max 1≤ i≤ n ˆ ρ (λ i ), where ˆ ρ (λ ) is giv en b y (A.33). W e treat the t w o cases (1− β )/(1+β ) 2 <αm and (1− β )/(1+β ) 2 ≥ αm separately . F or (1− β )/(1+β ) 2 <αm , w e ha v e (1− β ) 2 ≤ 4( 1− β 1+β ) 2 < 4αm = 4 αL κ ≤ 8 κ (A.35) where the last inequalit y follo ws from (A.32). Therefore, w e obtain β ≥ 1− √ 8/ √ κ as required. No w, supp ose (1− β )/(1+β ) 2 ≥ αm . The con v ergence rate ρ satises ρ ≥ 1 2 (1+β )(1− αm ) + 1 2 p (1+β ) 2 (1− αm ) 2 − 4β (1− αm ). Th us, ρ 2 − ρ (1+β )(1− αm )+β (1− αm )>0 whic h yields a lo w er b ound on β , β ≥ ν (ρ,αm ) := ρ (1− αm − ρ ) (1− ρ )(1− αm ) . (A.36) In what follo ws, w e establish a lo w er b ound for ν . F or a xed αm , the critical p oin t of ν (ρ ) is giv en b y ρ 1 := 1− √ αm , i.e., ∂ν/∂ρ = 0 for ρ = ρ 1 . F urthermore, the optimal rate from T able 2.2 and the condition on con v ergence rate in Theorem 8 for an y κ > c 1 yield upp er and lo w er b ounds ρ 3 < ρ < ρ 2 , where ρ 2 := 1− c 2 / √ κ and ρ 3 := 1− 2/ √ 3κ +1. Th us, the lo w er b ound on ν is giv en b y β ≥ ν (ρ,αm ) ≥ min{ν (ρ 1 ,αm ),ν (ρ 2 ,αm ),ν (ρ 3 ,αm )}. (A.37) F rom the stabilit y condition (A.32), w e ha v e αm < 2/κ (A.38) F urthermore, it can b e sho wn that for an y giv en ρ ∈(0,1) the function ν (ρ,αm ) is decreasing with resp ect to αm . This fact com bined with (A.37) and (A.38) yield β ≥ min{ν (ρ 1 ,αm ),ν (ρ 2 ,2/κ ),ν (ρ 3 ,2/κ )}. (A.39) 232 If w e substitute for ρ 1 . ρ 2 , and ρ 3 their v alues as functions of κ and use αm< 2/κ , then the result follo ws immediately . In particular, ν (ρ 1 ,αm ) = 1− √ αm 1+ √ αm ≥ 1− p 2/κ 1+ p 2/κ = √ κ − √ 2 √ κ + √ 2 ≥ 1− 2 √ 2 √ κ ν (ρ 2 ,2/κ ) = 1− ( 2 c 2 +c 2 ) √ κ − 4 κ − 2 ≥ 1− ( 2 c 2 +c 2 ) √ κ , ∀κ ≥ ( 1 c 2 + c 2 2 ) 2 ν (ρ 3 ,2/κ ) = 1− 5κ − 4 √ 3κ +1+1 (κ − 2) √ 3κ +1 ≥ 1− 5 √ κ , ∀κ ≥ 9 whic h completes the pro of. □ The next lemma pro vides a lo w er b ound on J na /(1− ρ ) for Nestero v’s metho d with σ =α ≤ 1/L. Lemma 6 Nester ov’s ac c eler ate d metho d with any stabilizing p air of p ar ameters 0 < α ≤ 1/L and 0<β < 1, and σ =α satises J na 1− ρ ≥ 1 8 ( κ L ) 2 . Pr o of: The con v ergence rate of Nestero v’s metho d is giv en b y max i ˆ ρ (λ i ), where ˆ ρ (λ )= ( p β (1− αλ ) if αλ ∈(( 1− β 1+β ) 2 ,1) 1 2 |(1+β )(1− αλ )|+ 1 2 √ ∆ otherwise and ∆ := (1+β ) 2 (1− αλ ) 2 − 4β (1− αλ ); see equation (A.33). Th us, w e ha v e the trivial lo w er b ound J 1− ρ ≥ ˆ J(m) 1− ˆ ρ (m) = α (1 + β (1 − αm )) m(1 − β (1 − αm ))(2(1 + β ) − (2β + 1)αm )(1− ˆ ρ (m)) ≥ p(α,β ) := α 4m(1 − β (1 − αm ))(1− ˆ ρ (m)) = α 4m(1 − β (1− αm )) 1− p β (1− αm ) , β ∈[γ, 1) α 2m(1 − β (1− αm )) 2− (1+β )(1− αm )− √ ∆ , β ∈[0, γ ) (A.40) where γ := 1− √ αm 1+ √ αm . Here, the rst inequalit y can b e obtained b y com bining J = P i ˆ J(λ i ) and max i ˆ ρ (λ i ), and the second inequalit y follo ws from the fact that 0 < αm ≤ 1 and 0≤ β < 1. W e next sho w that for an y xed α , the function p(α, ·) attains its minim um at 233 β =γ . Before w e do so, note that this fact allo ws us to do partial minimization with resp ect to β and obtain p(α,β ) ≥ p(α,γ ) = 1 4m 2 (2− √ αm ) ≥ 1 8m 2 ≥ 1 8 ( κ L ) 2 . F or an y xed α , it is straigh tforw ard to v erify that p(α,β ) is increasing with resp ect to β o v er [γ, 1). Th us, it suces to sho w that p(α,β ) is decreasing with resp ect to β o v er [0,γ ). T o simplify the presen tation, let us dene q := (1− s)(2− x− s− δ ), x := 1− αm, s := βx δ := √ ∆ = p (1+β ) 2 (1− αm ) 2 − 4β (1− αm ) = p (x+s) 2 − 4s. It is no w straigh tforw ard to v erify that p(α,β ) = α/ (2mq) for β ∈ [0,γ ). It th us follo ws that p(α,β ) is decreasing with resp ect to β o v er [0,γ ) if and only if q ′ = dq/ds ≥ 0 for s∈[0,(1− √ 1− x) 2 ). The deriv ativ e is giv en b y q ′ = 1 δ (x+2s− 3)δ +(1− s)(2− x− s)+δ 2 . Th us, w e ha v e q ′ ≥ 0 ⇐⇒ (1− s)(2− x− s)+δ 2 ≥ (3− x− 2s)δ. (A.41) It is easy to v erify that b oth sides of the inequalit y in (A.41), namely , (1− s)(2− x− s)+δ 2 and (3− x− 2s)δ are p ositiv e for the sp ecied range of s∈[0,(1− √ 1− x) 2 ). Th us, w e can square b oth sides and obtain that q ′ ≥ 0 ⇐⇒ (1− s)(2− x− s)+δ 2 2 ≥ (3− x− 2s) 2 δ 2 (i) ⇐⇒ (1− s)(2− x− s)+(x+s) 2 − 4s 2 ≥ (3− x− 2s) 2 (x+s) 2 − 4s (ii) ⇐⇒ 4(x− 1) 2 (2s+x+1) ≥ 0. where (i) follo ws from the denition of δ and (ii) is obtained b y expanding b oth sides and rearranging the terms. Finally , the inequalit y 4(x− 1) 2 (2s+x+1) ≥ 0 trivially holds whic h completes the pro of. □ W e are no w ready to pro v e Theorem 8 for Nestero v’s metho d. The inequalit y in (2.30a) directly follo ws from com bining (A.31) in Lemma 4 and (A.34) in Lemma 5. T o sho w inequalit y (2.30b), w e treat the t w o cases α > 1/L and α ≤ 1/L separately . If α > 1/L, then (2.30b) directly follo ws from (2.30a) J na = α 2 J na σ 2 = Ω( κ 3 2 L 2 ). 234 No w supp ose α ≤ 1/L. W e can use Lemma 6 to obtain J na ≥ (1− ρ ) k 2 8L 2 ≥ c √ κ k 2 8L 2 = Ω( κ 3 2 L 2 ). Here, the rst inequalit y follo ws from Lemma 6 and the second inequalit y follo ws from the acceleration assumption ρ ≤ 1− c/ √ κ . This completes the pro of. A.4 Consensus over d-dimensional torus networks The pro of of Theorem 9 uses the explicit expression for the eigen v alues of torus in (2.33) to compute the v ariance amplication ¯ J = P i̸=0 ˆ J(λ i ) for all three algorithms. Sev eral tec hnical results that w e use in the pro of are presen ted next. W e b orro w the follo wing lemma, whic h pro vides tigh t b ounds on the sum of recipro cals of the eigen v alues of a d-dimensional torus net w ork, from [43, App endix B]. Lemma 7 The eigenvalues λ i of the gr aph L aplacian of the d-dimensional torus T d n 0 with n 0 ≫ 1 satisfy X 0̸=i∈Z d n 0 1 λ i = Θ( B(n 0 )) wher e the function B is given by B(n 0 ) = ( 1 d− 2 (n d 0 − n 2 0 ), d̸= 2 n d 0 log n 0 , d = 2. W e next use Lemma 7 to establish an asymptotic expression for the v ariance amplication of the gradien t descen t algorithm for a d-dimensional torus. Lemma 8 F or the c onsensus pr oblem over a d-dimensional torus T d n 0 with n 0 ≫ 1, the p erformanc e metric ¯ J gd c orr esp onding to gr adient de c ent with the stepsize α = 2/(L+m) satises ¯ J gd = Θ( B(n 0 )) wher e the function B is given in L emma 7. 235 Pr o of: Using the expression for the noise amplication of gradien t descen t from Theorem 1, w e ha v e ¯ J gd = X 0̸=i∈Z d n 0 1 αλ i (2 − αλ i ) = 1 2α X 0̸=i∈Z d n 0 1 λ i + 1 2 α − λ i = 1 2α X 0̸=i∈Z d n 0 1 λ i + 1 λ max + λ min − λ i ≈ 1 α X 0̸=i∈Z d n 0 1 λ i ≈ 2d X 0̸=i∈Z d n 0 1 λ i . The rst appro ximation follo ws from the facts that the eigen v alues satisfy 0 < λ i ≤ λ max + λ min ≈ 4d and that their distribution is asymptotically symmetric with resp ect to λ =2d. The second appro ximation follo ws from α = 2 L + m = 2 λ max + λ min ≈ 1 2d . The b ounds for the sum of recipro cals of λ i pro vided in Lemma 7 can no w b e used to complete the pro of. □ The follo wing lemma establishes a relationship b et w een the v ariance amplications of Nestero v’s metho d and gradien t descen t. This relationship allo ws us to compute tigh t b ounds on J na b y splitting it in to the sum of t w o terms. The rst term dep ends linearly on J gd whic h is already computed in Lemma 8 and the second term can b e ev aluated separately using in tegral appro ximations for consensus problem on torus net w orks. This result holds in general for the scenarios in whic h the largest eigen v alue L = Θ(1) is b ounded and the smallest eigen v alue m go es to zero causing the condition n um b er κ to go to innit y . Lemma 9 F or a str ongly c onvex quadr atic pr oblem with mI ⪯ Q ⪯ LI and c ondition numb er κ := L/m ≥ κ 0 , the r atio b etwe en varianc e amplic ations of Nester ov’s algorithm and gr adient desc ent with the p ar ameters given in T able 2.2 satises the asymptotic b ounds c 1 √ κ ≤ J na − D J gd ≤ c 2 , D := 2 (3β +1)α 2 na n X i=1 1 λ 2 i + 1− β α naβ λ i 236 wher e κ 0 , c 1 , and c 2 ar e p ositive c onstants. F urthermor e, dep ending on the distribution of the eigenvalues of the L aplacian matrix, D c an take values b etwe en c 3 κ ≤ D J gd ≤ c 4 √ κ (A.42) wher e c 3 and c 4 ar e p ositive c onstants. Pr o of: W e can split ˆ J na (λ )/ ˆ J gd (λ ) in to the sum of t w o decreasing homographic functions σ 1 (λ ) + σ 2 (λ ), where σ 1 and σ 2 are dened in (A.2); see the pro of of Prop osition 1. F urthermore, for κ ≫ 1, these functions attain their extrema o v er the in terv al [m,L] at σ 1 (L) ≈ 9 8κ , σ 1 (m) ≈ 3 √ 3κ 8 , σ 2 (L) ≈ 9 √ 3 16 √ κ , σ 2 (m) ≈ 3 8 (A.43) where w e ha v e k ept the leading terms. It is straigh tforw ard to v erify that n X i=1 σ 1 (λ i ) ˆ J gd (λ i ) = 2 (3β +1)α 2 na n X i=1 1 λ 2 i + 1− β α naβ λ i = D. This equation in conjunction with (A.43), yield inequalities in (A.42). Moreo v er, w e obtain that J na − D J gd = P n i=1 σ 2 (λ i ) ˆ J gd (λ i ) P n i=1 ˆ J gd (λ i ) . This also implies that, asymptotically , J na − D J gd = O max λ ∈[m,L] σ 2 (λ ) = O(1) J na − D J gd = Ω min λ ∈[m,L] σ 2 (λ ) = Ω( 1 √ κ ) whic h completes the pro of. □ The next t w o lemmas pro vide us with asymptotic b ounds on summations of the form P i 1/(λ 2 i +µλ i ), where λ i are the eigen v alues of the graph Laplacian matrix of a torus net w ork. These b ounds allo w us to com bine Lemma 8 and Lemma 9 to ev aluate the v ariance amplication of Nestero v’s accelerated algorithm. Lemma 10 F or an inte ger q≫ 1 and any p ositive a=O(q 3 ), we have X 0̸=i∈Z d q 1 ∥i∥ 4 +a∥i∥ 2 ≈ q d− 4 Z 1 1/q r d− 1 r 4 +wr 2 dr wher e ω =a/q 2 . 237 Pr o of: The function h(x) :=∥x∥ 4 +ω∥x∥ 2 is strictly increasing o v er the p ositiv e orthan t (x≻ 0) and h((1/q)1) go es to 0 as q go es to innit y where 1∈R d is the v ector of all ones. Therefore, using the lo w er and upp er Riemann sum appro ximations, it is straigh tforw ard to sho w that Z ··· Z ∆ ≤∥ x∥≤ 1 1 h(x) dx 1 ··· dx d ≈ ∆ d X 0̸=i∈Z d q 1 P d l=1 (∆ i l ) 2 2 +ω P d l=1 (∆ i l ) 2 where ∆ = 1 /q is the incremen tal step in the Riemann appro ximation. Therefore, since ω =a∆ 2 , w e can write X 0̸=i∈Z d q 1 ∥i∥ 4 +a∥i∥ 2 ≈ ∆ 4− d Z ··· Z ∆ ≤∥ x∥≤ 1 1 h(x) dx 1 ··· dx d . Finally , w e obtain the result b y transforming the in tegral in to a d-dimensional system with p olar co ordinates, i.e., Z ··· Z ∆ ≤∥ x∥≤ 1 1 h(x) dx 1 ··· dx d ≈ Z 1 ∆ r d− 1 r 4 +ωr 2 dr. □ Lemma 11 L et λ i b e the eigenvalues of the L aplacian matrix for the d-dimensional torus T d n 0 . In the limit of lar ge n 0 , for any µ =O(n 0 ), we have X 0̸=i∈Z d n 0 1 λ 2 i +µλ i = Θ n d 0 Z 1 1 n 0 r d− 1 r 4 +ωr 2 dr ! (A.44) wher e ω =Θ( µ ). Pr o of: Let ζ := P 0̸=i∈Z d n 0 1 λ 2 i +µλ i , where λ i are the eigen v alues of the Laplacian matrix, λ i = 2 d X l=1 1− cos(i l 2π n 0 ) . Since 1− cos(·− π ) is an ev en function, for large n 0 , ζ ≈ 2 d X 0̸=i∈Z d q 1 λ 2 i +µλ i 238 where q =⌊n 0 /2⌋. It is w ell-kno wn that the function 1− cos(x) can b e b ounded b y quadratic functions as x 2 /π 2 ≤ 1− cos(x) ≤ x 2 for an y x ∈ [− π,π ]. No w, since for an y i ∈ Z d q , i l 2π n 0 ∈[0,π ] for all l, w e can use these quadratic b ounds to obtain ζ ≈ n 4 0 X 0̸=i∈Z d q 1 ∥i∥ 4 + cµn 2 0 ∥i∥ 2 (A.45) where c is a b ounded constan t. Finally , equation (A.44) follo ws from Lemma 10 where w e let a=cµn 2 0 and q≈ n 0 /2. □ The follo wing prop osition c haracterizes the net w ork-size-normalized asymptotic v ariance amplication of noisy consensus algorithms for d-dimensional torus net w orks. This result is used to pro v e Theorem 9. Prop osition 1 L et L∈R n× n b e the gr aph L aplacian of the d-dimensional undir e cte d torus T d n 0 with n=n d 0 ≫ 1 no des. F or c onvex quadr atic optimization pr oblem (2.31) , the network- size-normalize d asymptotic varianc e amplic ation ¯ J/n of the rst-or der algorithms on the subsp ac e 1 ⊥ is determine d by d=1 d=2 d=3 d=4 d=5 Gradien t Θ( n) Θ(log n) Θ(1) Θ(1) Θ(1) Nestero v Θ( n 2 ) Θ( √ nlog n) Θ( n 1/6 ) Θ(log n) Θ(1) P oly ak Θ( n 2 ) Θ( √ n log n) Θ( n 1/3 ) Θ( n 1/4 ) Θ( n 1/5 ). Pr o of: W e pro v e the result for the three algorithms separately . 1. F or gradien t descen t, the result follo ws from dividing the asymptotic b ounds established in Lemma 8 with the total n um b er of no des n=n d 0 . 2. F or Nestero v’s algorithm, w e use the relation established in Lemma 9 to write ¯ J na /n − c n X i 1 λ 2 i +µλ i = O ¯ J gd /n (A.46a) ¯ J na /n − c n X i 1 λ 2 i +µλ i = Ω ¯ J gd /(n √ κ ) (A.46b) where c = 2/((3β +1)α 2 na ) ≈ 9d 2 /2 and µ = (1− β )/(α na β ) = Θ(1 / √ κ ) = Θ( n − 1 0 ); see equation (2.34). W e can use Lemma 11 to compute the second term 1 n X 0̸=i∈Z d n 0 1 λ 2 i +µλ i = Θ Z 1 1 n 0 r d− 1 r 4 +ωr 2 dr ! (A.47) 239 where ω =Θ( µ )=Θ( n − 1 0 ). Ev aluating the ab o v e in tegral for dieren t v alues of d∈N and letting ω =Θ( n − 1 0 ), it is straigh tforw ard to sho w that Z 1 1 n 0 r d− 1 r 4 + ωr 2 dr = Θ( n 2 0 ) d = 1 Θ( n 0 log n 0 ) d = 2 Θ( √ n 0 ) d = 3 Θ(log n 0 ) d = 4 Θ(1) d = 5. Finally , the result follo ws from the asymptotic v alues for ¯ J gd /n (sho wn in P art 1) and substituting for the second term on the left-hand-side of equation (A.46) from the ab o v e asymptotic v alues and using n = n d 0 . W e note that w e used the follo wing in tegrals to ev aluate ¯ J na , Z 1 r 4 +ωr 2 dr = − tan − 1 ( r √ ω ) ω 3/2 − 1 rω Z r r 4 +ωr 2 dr = − log(r 2 +ω)− 2log(r) 2ω Z r 2 r 4 +ωr 2 dr = tan − 1 ( r √ ω ) √ ω Z r 3 r 4 +ωr 2 dr = 1 2 log(r 2 +ω) Z r 4 r 4 +ωr 2 dr = r− √ ωtan − 1 ( r √ ω ). 3. The result for the hea vy-ball metho d directly follo ws from the rst part of the pro of, the relationship b et w een v ariance amplications of gradien t descen t and the hea vy-ball metho d in Theorem 2, and equation (2.34). □ W e no w use Prop osition 1 to pro of Theorem 9 as follo ws. A.4.1 Proof of Theorem 9 As stated in (2.34), the condition n um b er satises κ = Θ( n 2/d ) and the result follo ws from com bining this asymptotic relation with those pro vided in Prop osition 1. A.4.2 Computational experiments T o complemen t our asymptotic theoretical results, w e compute the p erformance measure ¯ J in (2.32) for the consensus problem o v er d-dimensional torus T d n 0 with n = n d 0 no des for dieren t v alues of n 0 and d. W e use expression (2.33) for the eigen v alues of the graph 240 ¯ J/n κ κ κ κ (a) d=1 (b) d=2 (c) d=3 (d) d=4 Figure A.2: The dep endence of the net w ork-size normalized p erformance measure ¯ J/n of the rst-order algorithms for d-dimensional torusT d n 0 with n=n d 0 no des on condition n um b er κ . The blue, red, and blac k curv es corresp ond to the gradien t descen t, Nestero v’s metho d, and the hea vy-ball metho d, resp ectiv ely . Solid curv es mark the actual v alues of ¯ J/n obtained using the expressions in Theorem 1 and the dashed curv es mark the trends established in Theorem 9. Laplacian L to ev aluate the form ulae pro vided in Theorem 1 for eac h algorithm. Figure A.2 illustrates net w ork-size normalized v ariance amplication ¯ J/n vs. condition n um b er κ and v eries the asymptotic relations pro vided in Theorem 9. It is notew orth y that, even though our analysis is asymptotic in the c ondition numb er (i.e., it assumes that κ ≫ 1), our c omputational exp eriments exhibit similar sc aling tr ends for smal l values of κ as wel l. 241 Appendix B Supporting proofs for Chapter 3 B.1 Settling time If ρ denotes the linear con v ergence rate, T s = 1/(1− ρ ) quan ties the settling time . The inequalit y in (3.5) sho ws that cρ t ≤ ϵ pro vides a sucien t condition for reac hing the accuracy lev el ϵ with ∥ψ t ∥ 2 /∥ψ 0 ∥ 2 ≤ ϵ . T aking the logarithm of cρ t ≤ ϵ and using the rst-order T a ylor series appro ximation log(1− x)≈− x around x = 0 yields a sucien t condition on the n um b er of iterations t for an algorithm to reac h ϵ -accuracy , t ≥ log(ϵ/c )/log(1− 1/T s ) ≈ T s log(c/ϵ ). In con tin uous time, the sucien t condition for reac hing ϵ -accuracy ce − ρt ≤ ϵ yields t ≥ log(c/ϵ )/ρ, and T s =1/ρ can b e used to asses the settling time. B.2 Convexity of modal contribution ˆ J T o sho w the con v exit y of ˆ J , w e use the fact that the function g(x)= Q d i=1 x − 1 i is con v ex o v er the p ositiv e orthan t R d ++ . This can b e v eried b y noting that its Hessian satises ∇ 2 g(x) = g(x) diag(x) + xx T ≻ 0 where diag(·) is the diagonal matrix. By Theorem 5, w e ha v e ˆ J σ 2 w = d + l 2dhl = 1 2hd + 1 2hl where w e ha v e dropp ed the dep endence on λ for simplicit y . The functions 1/(2hd) and 1/(2hl) are b oth con v ex o v er the p ositiv e orthan t d,h,l > 0. Th us, ˆ J is con v ex with resp ect to (d,h,l). In addition, since d,h, and l are all ane functions of a and b, w e can use the equiv alence relation in (3.30a) to conclude that ˆ J is also con v ex in (b,a) o v er the stabilit y triangle ∆ . Finally , since b(λ ) and a(λ ) are ane in λ , it follo ws that for an y stabilizing parameters, ˆ J is also con v ex with resp ect to λ o v er the in terv al [m,L]. 242 Con v exit y of ˆ J allo ws us to use rst-order conditions to nd its minimizer. In particular, since for σ w =1 ∂ ˆ J ∂d = − 1 2hd 2 , ∂ ˆ J ∂l = − 1 2hl 2 , ∂ ˆ J ∂h = − l + d 2h 2 dl ∂d ∂a = ∂l ∂a = − ∂h ∂a = ∂d ∂b = − ∂l ∂b = 1, ∂h ∂b = 0 it is easy to v erify that ∂ ˆ J/∂a = ∂ ˆ J/∂b = 0 at a = b = 0. Th us, ˆ J tak es its minim um ˆ J min =σ 2 w o v er the stabilit y triangle ∆ at a=b=0, whic h corresp onds to d=h=l =1. B.3 Proofs of Section 3.4 B.3.1 Proof of Lemma 2 W e start b y noting that ρ (M) ≤ ρ if and only if ρ (M ′ ) ≤ 1 where M ′ := M/ρ . The c haracteristic p olynomial asso ciated with M ′ , F ρ (z) = z 2 +(b/ρ )z +a/ρ 2 , allo ws us to use similar argumen ts to those presen ted in the pro of of Lemma 1 to sho w that ρ (M ′ ) ≤ 1 ⇐⇒ (b/ρ, a/ρ 2 ) ∈ ∆ 1 (B.1) where ∆ 1 :={(b,a)||b| − 1 ≤ a ≤ 1} is the closure of the set ∆ in (3.21b). Finally , the condition on the righ t-hand side of (B.1) is equiv alen t to (b,a) ∈ ∆ ρ , where ∆ ρ is giv en b y (3.22b). Remark 1 The eigenvalues of the matrix M in (3.20a) ar e given by (− b± √ b 2 − 4a)/2, and the sign of b 2 − 4a determines if the eigenvalues ar e r e al or c omplex. The c ondition a=b 2 /4 denes a p ar ab ola that p asses thr ough the vertic es X ρ = (− 2ρ,ρ 2 ) and Y ρ = (2ρ,ρ 2 ) of the triangle ∆ ρ and is tangent to the e dges X ρ Z ρ and Y ρ Z ρ for al l ρ < 1; se e Figur e B.1. F or the optimal values of p ar ameters pr ovide d in T able 3.1, we c an c ombine this observation and the information in Figur e 3.3 to c onclude that while al l eigenvalues of the matrix A in (3.4a) ar e r e al for gr adient desc ent, they c an b e b oth r e al and c omplex for Nester ov’s ac c eler ate d algorithm, and they c ome in c omplex-c onjugate p airs for he avy-b al l metho d. B.3.2 Proof of Equation (3.28c) A ccording to Figure 3.3, in order to nd the largest ratio d(L)/d(m) o v er the ρ -linear con v ergence set ∆ ρ for Nestero v’s accelerated metho d, w e need to c hec k the pairs of p oin ts {E,E ′ } that lie on the b oundary of the triangle ∆ ρ , whose line segmen t EE ′ passes through the origin O . If one of the end p oin ts E lies on the edge X ρ Y ρ , then dep ending on whether the other end p oin t E ′ lies on the edge X ρ Z ρ or Y ρ Z ρ , w e can con tin uously increase the ratio d E /d E ′ b y mo ving E to w ard the v ertices Y ρ or X ρ , resp ectiv ely . Th us, this case reduces to c hec king only the ratio d E /d E ′ for the line segmen ts X ρ X ′ ρ and Y ρ Y ′ ρ , where X ′ ρ = (2ρ/ 3,− ρ 2 /3), Y ′ ρ = (− 2ρ/ 3,− ρ 2 /3) (B.2) 243 • • • X Y Z • • • X ρ Y ρ Z ρ b a Figure B.1: The green and orange subsets of the stabilit y triangle ∆ (dashed-red) corresp ond to complex conjugate and real eigen v alues for the matrix M in (3.20a), resp ectiv ely . The blue parab ola a = b 2 /4 corresp onds to the matrix M with rep eated eigen v alues and it is tangen t to the edges X ρ Z ρ and Y ρ Z ρ of the ρ -linear con v ergence triangle ∆ ρ (solid red). • • • • • X ρ Y ρ Z ρ X ′ ρ Y ′ ρ • • E E ′ • b a Figure B.2: The p oin ts X ′ ρ and Y ′ ρ as dened in (B.2) along with an arbitrary line segmen t EE ′ passing through the origin in the (b,a)-plane. are the in tersections of OX ρ with Y ρ Z ρ , and OY ρ with X ρ Z ρ ; see Figure B.2. Regarding the case when neither E nor E ′ lies on the edge X ρ Y ρ , let us assume without loss of generalit y that E and E ′ lie on Y ρ Z ρ and X ρ Z ρ , resp ectiv ely . In this case, w e can parameterize the ratio using d E d E ′ = (1+c)(1/(1− ρ )− c) (1− c)(1/(1+ρ )+c) , c ∈ [− 1/2,1/2] (B.3) where cρ determines the slop e of EE ′ . The general shap e of this function is pro vided in Figure B.3. It is easy to v erify that d E /d E ′ tak es its maxim um o v er c∈ [− 1/2,1/2] at one of the b oundaries. Th us, this case also reduces to c hec king only the ratio d E /d E ′ for the line segmen ts X ρ X ′ ρ and Y ρ Y ′ ρ . W e complete the pro of b y noting that d X ′ ρ d Xρ = (1 + ρ )(3 − ρ ) 3(1 − ρ ) 2 , d Yρ d Y ′ ρ = 3(1 + ρ ) 2 (3 + ρ )(1 − ρ ) satisfy d X ′ ρ /d Xρ >d Yρ /d Y ′ ρ . 244 1 2 1 2 c = −1 1+ρ c = 1 c d E d E ′ Figure B.3: The ratio d E /d E ′ in (B.3) for Nestero v’s metho d, where E and E ′ lie on the edges Y ρ Z ρ and X ρ Z ρ of the ρ -linear con v ergence triangle ∆ ρ , and cρ determines the slop e of EE ′ whic h passes through the origin. B.4 Proofs of Section 3.5 B.4.1 Proof of Lemma 4 W e sho w that the parameters (α,β,γ ) in (3.32) place the p oin ts (b(m),a(m)) and(b(L),a(L)) on the edges X ρ Z ρ and Y ρ Z ρ of the ρ -linear con v ergence triangle ∆ ρ , resp ectiv ely . In particular, w e can use a scalar c∈[− 1,1] to parameterize the end p oin ts as (b(m),a(m)) = (− (1+c)ρ, cρ 2 ), (b(L),a(L)) = ((1+c)ρ, cρ 2 ). Using the denition of a and b in (3.18c), w e can solv e the ab o v e equations for (α,β,γ ) to v erify the desired parameters. Th us, the algorithm ac hiev es the con v ergence rate ρ . In addition the p oin ts c = 0 and c = 1 reco v er gradien t descen t and hea vy-ball metho d with the parameters that optimize the con v ergence rate; see T able 3.1. F urthermore, h, d, and l in (3.29) are giv en b y h(m) = h(L) = 1 − cρ 2 d(m) = l(L) = (1 − ρ )(1 − cρ ) l(m) = d(L) = (1 + ρ )(1 + cρ ) (B.4a) and the condition n um b er is determined b y κ = αL αm = d(L) d(m) = l(m) d(m) . (B.4b) Com bining (B.4b) with (B.4a), and rearranging terms yields the desired expression for c in terms of ρ and κ . The analytical expressions in Theorem 5 imply that for the parameters in (3.32), the function ˆ J(λ ) is symmetric o v er [m,L], i.e., ˆ J(λ ) = ˆ J(m + L− λ ) for all λ ∈ [m,L]. In addition, as w e demonstrate in App endix B.2, ˆ J(λ ) is con v ex. Th us, ˆ J(λ ) attains its 245 maxim um at λ = m and λ = L and w e can use the expression for ˆ J(λ ) in Theorem 5 to obtain the maxim um v alue, ˆ J(m) = σ 2 w (d(m)+(m)) 2h(m)d(m)l(m) = σ 2 w (κ + 1) 2h(m)l(m) (B.4c) where the second equalit y follo ws from (B.4b). Com bining (B.4a) and (B.4c) yields the expression for ˆ J(m). Also, symmetry and con v exit y imply that ˆ J(λ ) attains its minim um at the midp oin t λ = ˆ λ := (m+L)/2 = (1+β )/α . This p oin t corresp onds to (b( ˆ λ ),a( ˆ λ )) = (0,cρ 2 ) in the (b,a)-plane and it th us satises h( ˆ λ ) = 1 − cρ 2 , d( ˆ λ ) = l( ˆ λ ) = 1 + cρ 2 . (B.4d) Using (B.4d) to ev aluate the expression for ˆ J(λ ) at the p oin t λ = ˆ λ yields the desired minim um v alue. B.4.2 Proof of Proposition 2 Using the expressions established in Lemma 4, it is straigh tforw ard to v erify that ˆ J(m)× T s = σ 2 w p 1c (ρ )κ (κ + 1) ˆ J( ˆ λ )× T s = σ 2 w κp 2c (ρ ) and that, for the gradien t noise mo del (σ w =ασ ), w e ha v e ˆ J(m)× T s = σ 2 p 3c (ρ )κ (κ + 1) ˆ J( ˆ λ ) = σ 2 κp 2c (ρ ) where the functions p 1c (ρ )-p 4c (ρ ) are giv en b y (3.36). Th us, the expressions for J max andJ min follo w from Corollary 2. The b ounds on p 1c (ρ )-p 4c (ρ ) follo w from the fact that, for ρ ∈(0,1), w e ha v e q c (ρ ) = 1− cρ 1− cρ 2 ∈ [1/(1+cρ ),1] c ∈ [0,1] [1/2,2] c ∈ [− 1,0]. This completes the pro of. B.4.3 Proof of Proposition 3 Using the expressions established in Lemma 4, it is straigh tforw ard to v erify that ˆ J(m) = σ 2 w p 5c (ρ )(1 + 1/κ )T s ˆ J( ˆ λ ) = σ 2 w p 6c (ρ )T s /κ where p 5c and p 6c are giv en b y Prop osition 3. Th us, the expressions for J max and J min follo w from Corollary 2. The b ounds on p 5c and p 6c also follo w from c∈[− 1,0] and ρ ∈(0,1). 246 B.4.4 Proof of Proposition 4 W e sho w that (α,β,γ ) corresp ond to the parameterized family of Nestero v-lik e algorithms in whic h the end p oin ts of the line segmen t (b(λ ),a(λ )), λ ∈[m,L], lie on the edges X ρ Z ρ and Y ρ Z ρ of the ρ -linear con v ergence triangle ∆ ρ . In particular, w e can use a scalar c∈ [0,1/2] to parameterize the lines passing through the origin via a=− cρb . This yields (b(m),a(m)) = (− ρ/ (1 − c),cρ 2 /(1 − c)) (b(L),a(L)) = (ρ/ (1 + c),− cρ 2 /(1 + c)). Using the denitions of a and b in (3.18c), w e can solv e the ab o v e equations for (α,β,γ ) to v erify the desired parameters. Th us, the algorithm ac hiev es the con v ergence rate ρ and the extreme p oin ts c = 0 and c = 1/2 reco v er gradien t descen t and Nestero v’s metho d with the parameters pro vided in T able 3.1 that optimize settling times. In Lemma 1, w e establish expressions for the con v ergence rate and largest/smallest mo dal con tributions to noise amplication in terms of the condition n um b er for this family of parameters. Lemma 1 F or the class of functions Q L m with c ondition numb er κ = L/m, the extr eme values ˆ J max and ˆ J min of ˆ J(λ ) over [m,L] asso ciate d with the two-step momentum algorithm in (3.2) with p ar ameters (3.38) satisfy ˆ J max = ˆ J(m) = σ 2 w (1 − c) 2 (rκ + 1) 2(1− c− cρ 2 )(1+ρ )(1− c+cρ ) ≥ ˆ J(L) = σ 2 w (1 + c) 2 (1+c− cρ 2 ) (1− ρ 2 )(1+c− cρ )(1+c+cρ )(1+c+cρ 2 ) and ˆ J min = ˆ J(1/α )=σ 2 w , wher e the sc alar r∈[1,3] is given by r := (1+c)(1− c+cρ )/((1− c)(1+c− cρ )) and the sc alar c∈[0,1/2] is given by Pr op osition 4. Pr o of: The v alues of h, d, and l in (3.29) are giv en b y h(m) = (1 − c − cρ 2 )/(1 − c) h(L) = (1 + c + cρ 2 )/(1 + c) d(m) = (1 − ρ )(1 − c − cρ )/(1 − c) d(L) = (1 + ρ )(1 + c − cρ )/(1 + c) l(m) = (1 + ρ )(1 − c + cρ )/(1 − c) l(L) = (1 − ρ )(1 + c + cρ )/(1 + c) (B.5a) and the condition n um b er is determined b y κ = αL αm = d(L) d(m) = l(m) rd(m) (B.5b) where w e let r :=l(m)/d(L). By com bining this iden tit y with the expressions in (B.5a), and rearranging terms, w e can obtain the desired quadratic equation for c in terms of ρ andκ . T o 247 see that r∈[1,3], from Figure 3.3 w e observ e that as w e c hange the orien tation from gradien t descen t (c = 0) to Nestero v’s metho d with parameters that optimize the con v ergence rate (c = 1/2), l(m) and 1/d(L) monotonically increase. Th us, r is also increasing in c, and its smallest and largest v alues are obtained for c = 0 and c = 1/2, resp ectiv ely , whic h yields 1≤ r≤ 3(1+ρ )/(3− ρ )≤ 3. As w e demonstrate in App endix B.2, ˆ J as a function of (b,a) tak es its minim um ˆ J min =σ 2 w at the origin. In addition, for eac h c ∈ [0,1/2], the line segmen t (b(λ ),a(λ )), λ ∈ [m,L], passes through the origin at λ =1/α . Th us, the minim um of ˆ J(λ ) o ccurs at λ =1/α and is giv en b y ˆ J min =σ 2 w . W e next sho w that ˆ J(m) is the larges v alue of ˆ J(λ ) o v er [m,L]. Since ˆ J(λ ) is a con v ex function of λ (see App endix B.2), it attains its maxim um at one of the b oundary p oin ts λ = m and λ = L. T o sho w ˆ J(m) > ˆ J(L), w e rst obtain expressions for ˆ J(m) and ˆ J(L) in terms of ρ and c b y com bining (B.5a) with the analytical expression for ˆ J in Theorem 5. By prop erly rearranging terms and simplifying fractions, w e can obtain the equiv alence ˆ J(m) ≥ ˆ J(L) ⇐⇒ c 4 ρ 4 − c 4 ρ 2 − c 2 ρ 2 − c 2 +1 ≥ 0. F or ρ ∈[0,1] and c∈[0,1/2], it is easy to v erify that the inequalit y on the righ t-hand holds. T o obtain the maxim um v alue, w e use Theorem 5 to write ˆ J(m) σ 2 w = d(m)+(m) 2h(m)d(m)l(m) = rκ + 1 2h(m)l(m) (B.5c) Com bining (B.5a) with (B.5c) yields the desired v alue for ˆ J(m). □ Lemma 1 allo ws us to deriv e analytical expressions for the largest and smallest v alues that J tak es o v er f ∈Q L m . Corollary 1 The p ar ameterize d family of Nester ov-like metho ds (3.38) satises J max = (n − 1) ˆ J(m) + ˆ J(L) J min = ˆ J(m) + ˆ J(L) + (n − 2) ˆ J(1/α ) wher e ˆ J(m), ˆ J(L), ˆ J(1/α ) ar e given by L emma 1, and J max , J min ar e the extr eme values of J when the algorithm is applie d to f ∈Q L m with c ondition numb er κ =L/m. Pr o of: The result follo ws from com bining Lemma 1 and the expression J = P n i=1 ˆ J(λ i ) established in Theorem 5. In particular, J is maximized when Q has n− 1 eigen v alues at m and one at L, and it is minimized when, apart from the extreme eigen v alues m and L, the rest are at λ =1/α . □ W e next establish order-wise tigh t upp er and lo w er b ounds on ˆ J max /(1− ρ ) and ˆ J min /(1− ρ ) in terms of κ . 248 Lemma 2 F or the p ar ameterize d family of Nester ov-like metho ds (3.38) , the lar gest and smal lest mo dal c ontributions to varianc e amplic ation establishe d in L emma 1 satisfy σ 2 w ω 1 rκ (rκ + 1)≤ ˆ J max × T s ≤ σ 2 w ω 2 rκ (rκ + 1) σ 2 w √ 3κ +1/2≤ ˆ J min × T s ≤ σ 2 w (κ + 1)/2 wher e the sc alar ω 1 := (1 + ρ ) − 3 (1− c) 2 (1− c + cρ ) − 2 /2, ω 2 := (1 + ρ )ω 1 , and we have (1 + ρ ) − 5 ≤ ω 1 ≤ (1 + ρ ) − 3 . Pr o of: T o obtain the upp er and lo w er b ounds on ˆ J max × T s = ˆ J(m)× T s , w e com bine (B.5b) and (B.5c) to write ˆ J(m) = rκ (rκ + 1) 2h(m)(l(m)) 2 /d(m) where w e set σ w =1. This equation in conjunction with the trivial inequalities 1 ≤ (1 − ρ )h(m)/d(m) ≤ 1 + ρ allo ws us to write rκ (rκ + 1) 2(1 + ρ )l 2 (m) ≤ ˆ J(m) 1 − ρ ≤ rκ (rκ + 1) 2l 2 (m) . (B.6) Com bining (B.5a) and (B.6) yields the desired b ounds on ˆ J max /(1− ρ ). Finally , the b ounds on ˆ J min × T s =σ 2 w /(1− ρ ) can b e obtained b y noting that T s =1/(1− ρ )∈[ √ 3κ +1/2,(κ +1)/2] as sho wn in Lemma 1. □ Similar to the hea vy-ball-lik e metho ds, ˆ J max × T s = Θ( κ 2 ). Ho w ev er, the upp er and lo w er b ounds on ˆ J min × T s scale linearly with κ and √ κ , resp ectiv ely . W e next use this result to b ound J× T s and complete the pro of of Prop osition 4. In particular, w e ha v e (n− 1)ω 1 rκ (rκ + 1) + √ 3κ +1 2 ≤ J max × T s σ 2 w ≤ nω 2 rκ (rκ + 1) ω 1 rκ (rκ + 1)+(n− 1) √ 3κ +1 2 ≤ J min × T s σ 2 w ≤ ω 2 rκ (rκ + 1)+(n− 1) κ +1 2 where the scalar r ∈ [1,3], and ω 1 and ω 2 are giv en b y Lemmas 1 and 2, resp ectiv ely . T o see this, note that as sho wn in the pro of of Corollary 1, J/(1− ρ ) is maximized when Q has n− 1 eigen v alues at m and one at L, and is minimized when, apart from the extreme eigen v alues m andL, the rest are placed at λ =1/α . Emplo ying the b ounds on ˆ J max = ˆ J(m) and ˆ J min = ˆ J(1/α ) pro vided b y Lemma 2 and noting that ˆ J(L)∈[ ˆ J min , ˆ J max ] completes the pro of. 249 B.5 Proofs of Section 3.6 B.5.1 Proof of Lemma 5 Stabilit y can b e v eried using the Routh-Hurwitz criterion applied to the c haracteristic p olynomial F(s) = det(sI− M) = s 2 + bs + a. Similarly , conditions for ρ -exp onen tial stabilit y can b e obtained b y applying the Routh-Hurwitz criterion to the c haracteristic p olynomial F ρ (s) asso ciated with the matrix M +ρI , i.e., F ρ (s) = s 2 + (b − 2ρ )s + ρ 2 − ρb + a and noting that strict inequalities b ecome non-strict as w e require ℜ(eig(M +ρI )) to b e non-p ositiv e. B.5.2 Proof of Proposition 5 The ρ -exp onen tial stabilit y of (3.41b) with α =1/L is equiv alen t to the inclusion of the line segmen t (b(λ ),a(λ )), λ ∈[m,L], in the triangle ∆ ρ in (3.44), where a(λ ) and b(λ ) are giv en b y (3.42c). In addition, using the con v exit y of ∆ ρ , this condition further reduces to the end p oin ts (b(L),a(L)) and (b(m),a(m)) b elonging to ∆ ρ . No w since a(L)=1, a(m)=1/κ , the ab o v e condition implies a max /a min ≥ κ (B.7) wherea max anda min are the largest and smallest v alues that a can tak e among all (b,a)∈∆ ρ . It is no w easy to v erify that a max = 1 and a min = ρ 2 corresp ond to the edge Y ρ Z ρ and the v ertex X ρ of ∆ ρ , resp ectiv ely; see Figure 3.5. Th us, inequalit y (B.7) yields the upp er b ound ρ ≤ 1/ √ κ and w e can ac hiev e this rate with, (b(m),a(m)) = X ρ , (b(L),a(L)) = E v (B.8) where E v := (b v ,1) = (2ρ +v(ρ − 1/ρ ),1), v∈ [0,1], parameterizes the edge Y ρ Z ρ . Solving the equations in (B.8) for γ and β yields the optimal v alues of parameters. Finally , letting γ = 0 and γ = β yields the conditions on v for the hea vy-ball and Nestero v’s metho d, resp ectiv ely . The condition κ ≥ 4 for Nestero v’s metho d stems from the fact that, for α = 1/L, setting γ = β yields b(L) = 1. Th us, w e ha v e the necessary condition 2ρ ≤ 1 to ensure (b(L),a(L))∈∆ ρ ; see Figure 3.6. This completes the pro of. B.5.3 Proof of Theorem 7 Let G:=(b G ,a G ) b e the p oin t on the edge X ρ Z ρ of the triangle ∆ ρ in (3.44) suc h that a G = a(m), b G = a(m)/ρ + ρ. 250 Using (b(m),a(m))∈∆ ρ , it is easy to v erify that b G ≥ b(m). This allo ws us to write ˆ J(m) ρ = 1 2a(m)b(m)ρ ≥ 1 2a(m)b G ρ = 1 2a(m)(a(m)+ρ 2 ) . Com bining the ab o v e inequalit y with a(m) = 1/κ and the upp er b ound ρ ≤ 1/ √ κ from Lemma 5 yields ˆ J(m)/ρ ≥ κ 2 /4. (B.9) Noting that among the p oin ts in ∆ ρ , the mo dal con tribution ˆ J =1/(2ab) tak es its minim um v alue ˆ J min = 1/(2ρ +2/ρ ) (B.10) at the v ertex Z ρ =(1, ρ +1/ρ ), w e can write J ρ = ˆ J(m) ρ + n− 1 X i=1 ˆ J(λ i ) ρ ≥ κ 2 4 + n − 1 2(1 + ρ 2 ) where w e use (B.9) to lo w er b ound the rst term ˆ J(m)/ρ . This completes the pro of of (3.48b). T o pro v e the lo w er b ound in (3.48a), w e consider a quadratic ob jectiv e function for whic h the Hessian has n− 1 eigen v alues at λ =m and one eigen v alue at λ =L. F or suc h a function, w e can write J max ≥ J = ˆ J(m)(n − 1) + ˆ J(L). Finally , w e lo w er b ound the righ t hand-side using (B.9) and (B.10) to complete the pro of. B.6 Lyapunov equations and the steady-state v ariance F or the discrete-time L TI system in (3.4a), the co v ariance matrix P t :=E ψ t (ψ t ) T of the state v ector ψ t satises the linear recursion P t+1 = AP t A T + BB T (B.11a) and its steady-state limit P := lim t→∞ E ψ t (ψ t ) T (B.11b) is the unique solution to the algebraic Ly apuno v equation [41], P = APA T + BB T . (B.11c) 251 F or stable L TI systems, p erformance measure (3.8) can b e computed using J = lim t→∞ 1 t t X k=0 trace Z k = trace(Z) (B.11d) where Z =CPC T is the steady-state limit of the output co v ariance matrix Z t := E z t (z t ) T = CP t C T . W e can pro v e Theorem 5 b y nding the solution P to (B.11c) for the t w o-step momen tum algorithm. The ab o v e results carry o v er to the con tin uous-time case with the only dierence that the Ly apuno v equation for the steady-state co v ariance matrix of ψ (t) is giv en b y AP + PA T = − BB T . 252 Appendix C Supporting proofs for Chapter 4 C.1 Proofs of Section 4.2 W e rst presen t a tec hnical lemma that w e use in our pro ofs. Lemma 1 F or any ρ ∈[1/e,1), a(t) :=tρ t satises argmax t≥ 1 a(t) = − 1/log(ρ ), max t≥ 1 a(t) = − 1/(elog(ρ )). Pr o of: F ollo ws from the fact that da/dt=ρ t (1+tlog(ρ )) v anishes at t=− 1/log(ρ ). □ C.1.1 Proof of Lemma 1 F or µ 1 ̸=µ 2 , the eigen v alue decomp osition of M is determined b y M = 1 µ 2 − µ 1 1 1 µ 1 µ 2 µ 1 0 0 µ 2 µ 2 − 1 − µ 1 1 . Computing the tth p o w er of the diagonal matrix and m ultiplying throughout completes the pro of for µ 1 ̸=µ 2 . F or µ 1 =µ 2 =:µ , M admits the Jordan canonical form M = 1 0 µ 1 µ 1 0 µ 1 0 − µ 1 and the pro of follo ws from µ 1 0 µ t = µ t tµ t− 1 0 µ t . 253 C.1.2 Proof of Lemma 2 F rom Lemma 1, it follo ws 1 0 M t = " − t− 2 X i=0 µ i+1 1 µ t− 1− i 2 t− 1 X i=0 µ i 1 µ t− 1− i 2 # where µ 1 and µ 2 are the eigen v alues of M . Moreo v er, | t− 2 X i=0 µ i+1 1 µ t− 1− i 2 |≤ t− 2 X i=0 |µ i+1 1 µ t− 1− i 2 |≤ t− 2 X i=0 ρ t ≤ (t− 1)ρ t | t− 1 X i=0 µ i 1 µ t− 1− i 2 |≤ t− 1 X i=0 |µ i 1 µ t− 1− i 2 |≤ t− 1 X i=0 ρ t− 1 ≤ tρ t− 1 b y triangle inequalit y . Finally , for µ 1 = µ 2 ∈ R, w e ha v e ρ = |µ 1 | = |µ 2 | and the ab o v e inequalities b ecome equalities. C.1.3 Proof of Theorem 1 Let µ 1i and µ 2i b e the eigen v alues and let ρ i =max{|µ 1i |,|µ 2i |} b e the sp ectral radius of A i . W e can use Lemma 2 with M :=A i to obtain max i≤ r ∥C i A t i ∥ 2 2 ≤ max i≤ r (t− 1) 2 ρ 2t i + t 2 ρ 2t− 2 i ≤ (t− 1) 2 ρ 2t + t 2 ρ 2t− 2 (C.1) where ρ := max i≤ r ρ i . F or the parameters pro vided in T able 4.1, the matrices A 1 and A r , that corresp ond to the largest and smallest non-zero eigen v alues of Q, i.e., λ 1 = L and λ r =m, resp ectiv ely , ha v e the largest sp ectral radius [93, Eq. (64)], ρ = ρ 1 = ρ r ≥ ρ i , i = 2,...,r− 1 (C.2) and A r has rep eated eigen v alues. Th us, w e can write max i≤ r ∥C i A t i ∥ 2 2 ≥ ∥ 1 0 A t r ∥ 2 2 = (t− 1) 2 ρ 2t r + t 2 ρ 2t− 2 r = (t− 1) 2 ρ 2t + t 2 ρ 2t− 2 (C.3) where the rst equalit y follo ws from Lemma 2 applied to M := A r and the second equalit y follo ws from (C.2). Finally , com bining (C.1) and (C.3) with β < ρ and Prop osition 1 completes the pro of. C.1.4 Proof of Theorem 2 Let a(t) := tρ t . Theorem 1 implies J 2 (t) = ρ 2 a 2 (t− 1) + ρ − 2 a 2 (t) and, for t ≥ 1, J(t) has only one critical p oin t, whic h is a maximizer. Moreo v er, since dJ 2 (t)/dt is p ositiv e at t = − 1/log(ρ ) and negativ e at t = 1− 1/log(ρ ), w e conclude that the maximizer lies 254 b et w een − 1/log(ρ ) and 1− 1/log(ρ ). Regarding max t J(t), w e note that √ 2ρa (t− 1) ≤ J(t)≤ √ 2a(t)/ρ and the pro of follo ws from max t≥ 1 a(t)=− 1/(elog(ρ )) (cf. Lemma 1). C.1.5 Proof of Proposition 2 Since for all a≤ 1, w e ha v e [167] a ≤ − log(1− a) ≤ a/(1− a) ρ hb =1− 2/( √ κ +1) and ρ na =1− 2/( √ 3κ +1) satisfy 2/( √ κ +1) ≤ − log(ρ hb ) ≤ 2/( √ κ − 1) 2/ √ 3κ +1 ≤ − log(ρ na ) ≤ 2/( √ 3κ +1− 2). The conditions on κ ensure that ρ hb and ρ na are not smaller than 1/e and w e com bine the ab o v e b ounds with Theorem 2 to complete the pro of. C.2 Proof of Theorem 3 The condition x 0 = x 1 is equiv alen t to ˆ x 0 i = ˆ x 1 i in (4.5). Th us, for λ i = 0, equation (4.12) yields ˆ x t i = ˆ x 0 i = ˆ x ⋆ i . F or λ i ̸=0, w e ha v e ˆ ψ 0 i − ˆ ψ ⋆ i = ˆ x 0 i ˆ x 0 i T and, hence, ∥x t − x ⋆ ∥ 2 ∥x 0 − x ⋆ ∥ 2 ≤ max i≤ r |ˆ x t i − ˆ x ⋆ i | |ˆ x t 0 − ˆ x ⋆ i | = max i≤ r C i A t i 1 1 (C.4a) where the equalit y follo ws from (4.10). T o b ound the righ t-hand side, w e use Lemma 1 with M =A i to obtain C i A t i 1 1 = 1 0 A t i 1 1 = ω t (µ 1i ,µ 2i ) (C.4b) where µ 1i and µ 2i are the eigen v alues of A i and ω t (z 1 ,z 2 ) := t− 1 X i=0 z i 1 z t− 1− i 2 − t− 1 X i=1 z i 1 z t− i 2 (C.5) for an y t∈N and z 1 ,z 2 ∈C. F or Nestero v’s accelerated metho d, the c haracteristic p olynomial det(zI − A i ) = z 2 − (1+β )h i z +βh i yields µ 1i ,µ 2i = ((1+β )h i ± p (1+β ) 2 h 2 i − 4βh i )/2, where λ i is the ith the eigen v alue of Q and h i := 1− αλ i . F or the parameters pro vided in T able 4.1, it is easy to sho w that: F or λ i ∈[m,1/α ], w e ha v e h i ∈[0,4β/ (1+β ) 2 ] and µ 1i and µ 2i are complex conjugates of eac h other and lie on a circle of radius β/ (1+β ) cen tered at z =β/ (1+β ). F or λ i ∈(1/α,L ], µ 1i and µ 2i are real with opp osite signs and can b e sorted to satisfy |µ 2i |<|µ 1i | with− 1≤ µ 1i ≤ 0≤ µ 2i ≤ 1/3. 255 The next lemma pro vides a unit b ound on |ω t (µ 1i ,µ 2i )| for b oth of the ab o v e cases. Lemma 2 F or any z = lcos(θ )e iθ ∈ C with |θ | ≤ π/ 2 and 0 ≤ l ≤ 1, and for any r e al sc alars (z 1 ,z 2 ) such that − 1≤ z 1 ≤ 0≤ z 2 ≤ 1/3, and z 2 <− z 1 , the function ω t in (C.5) satises |ω t (z,¯z)|≤ 1 and |ω t (z 1 ,z 2 )|≤ 1 for al l t ∈ N, wher e ¯z is the c omplex c onjugate of z . Pr o of: Since ω 1 (z 1 ,z 2 ) = 1, w e assume t≥ 2. W e rst address θ = 0, i.e., z = l∈R and ω t (z,¯z)=tl t− 1 − (t− 1)l t . W e note that dω t /dl =t(t− 1)(l t− 2 − l t− 1 )=0 only if l∈{0,1}. This in com bination with l∈[0,1] yield|ω t (l,l)|≤ max{|ω t (1,1)|,|ω t (0,0)|}≤ 1. T o address θ ̸=0, w e note that b(t) :=sin(tθ )/t satises |b(t)| ≤ | sin(θ )| (C.6) whic h follo ws from |sin(tθ )| = |sin((t− 1)θ )cos(θ ) + cos((t− 1)θ )sin(θ )| ≤ | sin((t− 1)θ )| + |sin(θ )|. F or z =lcos(θ )e iθ , w e ha v e ω t (z,¯z) = (z t − ¯z t − z¯z(z t− 1 − ¯z t− 1 ))/(z− ¯z) = (lcos(θ )) t− 1 (sin(tθ )− lcos(θ )sin((t− 1)θ ))/sin(θ ). Th us,dω t /dl =0 only ifl =0,1, orl ⋆ :=b(t)/(b(t− 1)cos(θ )). Moreo v er, it is straigh tforw ard to sho w that ω t (z,¯z) = 0, l = 0 (cos(θ )) t− 1 cos((t− 1)θ ), l = 1 (l ⋆ cos(θ )) t− 1 b(t)/sin(θ ), l = l ⋆ . Com bining this with (C.6) completes the pro of for complex z . T o address the case of z 1 , z 2 ∈R, w e note that ω t (z 1 ,z 2 ) = z t 1 (1 − z 2 ) − z t 2 (1 − z 1 ) /(z 1 − z 2 ). Th us, dieren tiating with resp ect to z 1 yields dω t dz 1 = (1− z 2 ) (t− 1)z t− 1 1 − z 2 P t− 2 i=0 z t− 2− i 1 z i 2 z 1 − z 2 . Moreo v er, from |z 2 |<|z 1 |, it follo ws that (t− 1) z t− 1 1 > |z 2 | t− 2 X i=0 z t− 2− i 1 z i 2 > z 2 t− 2 X i=0 z t− 2− i 1 z i 2 . 256 Therefore, dω t /dz 1 ̸= 0 o v er our range of in terest for z 1 ,z 2 . Th us, ω t (z 1 ,z 2 ) ma y tak e its extrem um only at the b oundary z 1 ∈{0,− 1}, i.e. |ω t (z 1 ,z 2 )|≤ max{|ω t (0,z 2 )|,|ω t (1,z 2 )|}. Finally , it is easy to sho w that |ω t (0,z 2 )|= z t− 1 2 <1, and |ω t (− 1,z 2 )| = (− 1) t (z 2 − 1) + 2z t 2 /(1 + z 2 ) ≤ 1. □ W e complete the pro of of Theorem 3 b y noting that the eigen v alues of the matrices A i for Nestero v’s algorithm with parameters pro vided in T able 4.1 satisfy the conditions in Lemma 2. C.3 Proofs of Section 4.3 C.3.1 Proof of Lemma 3 F or an y f ∈F L m , the L-Lipsc hitz con tin uit y of the gradien t ∇f , f(x t+2 ) − f(y t ) ≤ (∇f(y t )) T (x t+2 − y t ) + L 2 ∥x t+2 − y t ∥ 2 2 (C.7a) and the m-strong con v exit y of f , f(y t ) − f(x t+1 ) ≤ (∇f(y t )) T (y t − x t+1 ) − m 2 ∥y t − x t+1 ∥ 2 2 (C.7b) can b e used to sho w that (4.20) for the solution of Nestero v’s accelerated algorithm (4.18). In particular, for (4.18) w e ha v e u t :=∇f(y t ) and x t+2 − y t = − αu t y t − x t+1 = β (x t+1 − x t ) = − βI βI ψ t . (C.8) Substituting (C.8) in to (C.7a) and (C.7b) and adding the resulting inequalities completes the pro of. C.3.2 Proof of Lemma 4 Pre- and p ost-m ultiplication of LMI (4.21) b y (η t ) T and η t :=[(ψ t ) T (u t ) T ] T yields 0≥ (η t ) T A T XA− X A T XB B T XA B T XB η t + θ 1 (η t ) T M 1 η t + θ 2 (η t ) T M 2 η t ≥ (η t ) T A T XA− X A T XB B T XA B T XB η t + θ 2 (η t ) T M 2 η t where the second inequalit y follo ws from (4.19c). This yields 0 ≤ ˆ V(ψ t ) − ˆ V(ψ t+1 ) − θ 2 (η t ) T M 2 η t (C.9) 257 where ˆ V(ψ ) :=ψ T Xψ . Also, since Lemma 3 implies − (η t ) T M 2 η t ≤ 2 f(x t+1 ) − f(x t+2 ) (C.10) com bining (C.9) and (C.10) yields ˆ V(ψ t+1 ) + 2θ 2 f(x t+2 ) ≤ ˆ V(ψ t ) + 2θ 2 f(x t+1 ). Th us, using induction, w e obtain the uniform upp er b ound ˆ V(ψ t ) + 2θ 2 f(x t+1 ) ≤ ˆ V(ψ 0 ) + 2θ 2 f(x 1 ). (C.11) This allo ws us to b ound ˆ V b y writing λ min (X)∥ψ ∥ 2 2 ≤ ˆ V(ψ ) ≤ λ max (X)∥ψ ∥ 2 2 . (C.12a) W e can also upp er and lo w er b ound f ∈F L m as m∥x∥ 2 2 ≤ 2f(x) ≤ L∥x∥ 2 2 . (C.12b) Finally , com bining (C.11) and (C.12) yields λ min (X)∥ψ t ∥ 2 2 + mθ 2 ∥x t+1 ∥ 2 2 ≤ λ max (X)∥ψ 0 ∥ 2 2 + Lθ 2 ∥x 1 ∥ 2 2 . W e complete the pro of b y noting that ∥x t+1 ∥ 2 ≤∥ ψ t ∥ 2 . C.3.3 Proof of Theorem 4 T o pro v e (4.23a), w e need to nd a feasible solution for θ 1 ,θ 2 andX in terms of the condition n um b er κ . Let us dene X := x 1 I x 0 I x 0 I x 2 I = x 2 β 2 I − βI − βI I θ 2 := θ 1 (L+m)β/ (1− β ) x 2 := ((L+m)θ 1 + θ 2 )/α = θ 2 /(αβ ). (C.13) If (C.13) holds, it is easy to v erify that X ⪰ 0 with λ min (X) = 0, λ max (X) = (1+β 2 )x 2 = θ 2 (1+β 2 )/(αβ ), andA T XA− X =0. Moreo v er, the matrix W on the left-hand-side of (4.21) is blo c k-diagonal, W :=diag(W 1 ,W 2 ), and negativ e semi-denite for all α ≤ 1/L, where W 1 = − m(2θ 1 LC T y C y +θ 2 C T 2 C 2 ) ⪯ 0 W 2 = − ((2− α (L+m))θ 1 + α (1− αL )θ 2 )I ⪯ 0. 258 Th us, the c hoice of (θ 1 ,θ 2 ,X) in (C.13) satises the conditions of Lemma 4. Using the expressions for the largest and smallest eigen v alues of the matrix X in equation (4.22) in Lemma 4, leads to the upp er b ound for ∥x t ∥ 2 2 in (4.23a). F urthermore, from (4.23a) w e ha v e ∥x t ∥ 2 2 ≤ κ 1+(1+β 2 )/(αβL ) ∥ψ 0 ∥ 2 2 and the upp er b ound in (4.23c) follo ws from the fact that, for α and β in (4.23b), 1+(1+ β 2 )/(αβL )=3+4/(κ − 1). T o obtain the lo w er b ound in (4.23c), w e emplo y our framew ork for quadratic ob jectiv e functions in Section 4.2. In particular, for the parameters α and β in (4.23b), the largest sp ectral radius ρ (A i ) corresp onds to A n , whic h is asso ciated with the smallest eigen v alue λ n =m of Q. Since A n has rep eated real eigen v alues ρ =1− 1/ √ κ , using similar argumen ts as in Theorem 1 for quadratic problems w e obtain, J(t max ) = q (t max − 1) 2 ρ 2tmax + t 2 max ρ 2(tmax− 1) ≥ √ 2(t max − 1)ρ tmax ≥ √ 2( √ κ − 1) 2 /(e √ κ ) whic h completes the pro of. 259 Appendix D Supporting proofs for Chapter 6 D.1 Lack of convexity of function f The function f is noncon v ex in general b ecause its eectiv e domain, namely , the set of stabilizing feedbac k gains S K can b e noncon v ex. In particular, for A = 0 and B =− I , the closed-lo op A-matrix is giv en b y A− BK =K . No w, let K 1 = − 1 2− 2ϵ 0 − 1 , K 2 = − 1 0 2− 2ϵ − 1 , K 3 = K 1 +K 2 2 = − 1 1− ϵ 1− ϵ − 1 (D.1) where 0 ≤ ϵ ≪ 1. It is straigh tforw ard to sho w that for ϵ > 0, the en tire line-segmen t K 1 K 2 lies in S K . Ho w ev er, if w e let ϵ → 0, while the endp oin ts K 1 and K 2 con v erge to stabilizing gains, the middle p oin t K 3 con v erges to the b oundary of S K . Th us, f(K 1 ) and f(K 2 ) are b ounded whereas f(K 3 ) → ∞. This implies the existence of a p oin t on the line-segmen t K 1 K 2 for some ϵ ≪ 1 for whic h the function f has negativ e curv ature. F or ϵ = 0.1, Fig. D.1 illustrates the v alue of the LQR ob jectiv e function f(K(γ )) asso ciated with the ab o v e example and the problem parameters Q = R = Ω = I , where K(γ ) := γK 1 +(1− γ )K 2 is the line-segmen t K 1 K 2 . W e observ e the negativ e curv ature of f around the middle p oin t K 3 . Alternativ ely , w e can v erify the negativ e curv ature using the second- order term ⟨J,∇ 2 f(K);J⟩ in the T a ylor series expansion of f(K +J) around K giv en in App endix D.7. F or the ab o v e example, letting J =(K 1 − K 2 )/∥K 1 − K 2 ∥ yields the negativ e v alue⟨J,∇ 2 f(K 3 );J⟩=− 135.27. f(K(γ )) γ Figure D.1: The LQR ob jectiv e function f(K(γ )), where K(γ ) := γK 1 +(1− γ )K 2 is the line-segmen t b et w een K 1 and K 2 in (D.1) with ϵ =0.1. 260 D.2 Invertibility of the linear map A The in v ertibilit y of the map A is equiv alen t to the matrices A and − A T not ha ving an y common eigen v alues. If A is non-in v ertible, w e can use K 0 ∈ S K to in tro duce the c hange of v ariables ˆ K := K − K 0 and ˆ Y := ˆ KX and obtain f(K) = ˆ h(X, ˆ Y) := trace(Q 0 X + X − 1 ˆ Y T R ˆ Y +2 ˆ Y T RK 0 ) for all K ∈ S K , where Q 0 := Q+(K 0 ) T RK 0 . Moreo v er, X and ˆ Y satisfy the ane relation A 0 (X)−B ( ˆ Y) + Ω = 0 , where A 0 (X) := (A− BK 0 )X + X(A− BK 0 ) T . Since the matrix A− BK 0 is Hurwitz, the map A 0 is in v ertible. This allo ws us to write X as an ane function of ˆ Y , X( ˆ Y)=A − 1 0 (B( ˆ Y)− Ω) . Since the function ˆ h( ˆ Y) := ˆ h(X( ˆ Y), ˆ Y) has a similar form to h(Y) except for the linear term 2trace( ˆ Y T RK 0 ), the smo othness and strong con v exit y of h(Y) established in Prop osition 1 carry o v er to the function ˆ h(Y). D.3 Proof of Proposition 1 The second-order term in the T a ylor series expansion of h(Y + ˜ Y) around Y is giv en b y [121, Lemma 2] D ˜ Y,∇ 2 h(Y; ˜ Y) E = 2∥R 1 2 ( ˜ Y − K ˜ X)X − 1 2 ∥ 2 F (D.2) where ˜ X is the unique solution to A( ˜ X) = B( ˜ Y). W e sho w that this term is upp er and lo w er b ounded b y L∥ ˜ Y∥ 2 F and µ ∥ ˜ Y∥ 2 F , where L and µ are giv en b y (6.10a) and (6.10b), resp ectiv ely . The pro of for the upp er b ound is b orro w ed from [121, Lemma 1]; w e include it for completeness. W e rep eatedly use the b ounds on the v ariables presen ted in Lemma 15; see App endix D.11. Smoothness F or an y Y ∈S Y (a) and ˜ Y with∥ ˜ Y∥ F =1, D ˜ Y,∇ 2 h(Y; ˜ Y) E = 2∥R 1 2 ( ˜ Y − K ˜ X)X − 1 2 ∥ 2 F ≤ 2∥R∥ 2 ∥X − 1 ∥ 2 ∥ ˜ Y − KA − 1 B( ˜ Y)∥ 2 F ≤ 2∥R∥ 2 λ min (X) ∥ ˜ Y∥ F + ∥K∥ 2 ∥A − 1 B∥ 2 ∥ ˜ Y∥ F 2 ≤ 2a∥R∥ 2 ν 1 + a∥A − 1 B∥ 2 p νλ min (R) ! 2 =: L. Here, the rst and second inequalities are obtained from the denition of the 2-norm in conjunction with the triangle inequalit y , and the third inequalit y follo ws from (D.36b) and (D.36c). This completes the pro of of smo othness. 261 Strong convexity Using the p ositiv e deniteness of matrices R and X , the second-order term (D.2) can b e lo w er b ounded b y D ˜ Y,∇ 2 h(Y; ˜ Y) E ≥ 2λ min (R)∥H∥ 2 F ∥X∥ 2 (D.3) where H := ˜ Y − K ˜ X . Next, w e sho w that ∥H∥ F ∥ ˜ X∥ F ≥ λ min (Ω) λ min (Ω) a∥B∥ 2 . (D.4) W e substitute H +K ˜ X for ˜ Y inA( ˜ X)=B( ˜ Y) to obtain Γ = B(H) (D.5) where Γ := A K ( ˜ X). The closed-lo op stabilit y implies ˜ X = A − 1 K (Γ) and from Eq. (D.5) w e ha v e ∥H∥ F ≥ ∥Γ ∥ F ∥B∥ 2 . (D.6) This allo ws us to use Lemma 17, presen ted in App endix D.12, to write a∥Γ ∥ F ≥ λ min (Ω) λ min (Q)∥ ˜ X∥ F . This inequalit y in conjunction with (D.6) yield (D.4). Next, w e deriv e an upp er b ound on ∥ ˜ Y∥ F , ∥ ˜ Y∥ F = ∥H + K ˜ X∥ F ≤ ∥ H∥ F + ∥K∥ F ∥ ˜ X∥ F ≤ ∥ H∥ F 1 + a 2 η (D.7) whereη is giv en b y (6.10c) and the second inequalit y follo ws from (D.36d) and (D.4). Finally , inequalities (D.3) and (D.7) yield D ˜ Y,∇ 2 f(Y; ˜ Y) E ∥ ˜ Y∥ 2 F ≥ 2λ min (R)∥H∥ 2 F ∥X∥ 2 ∥ ˜ Y∥ 2 F ≥ 2λ min (R) ∥X∥ 2 (1 + a 2 η ) 2 ≥ 2λ min (R)λ min (Q) a(1 + a 2 η ) 2 =: µ (D.8) where the last inequalit y follo ws from (D.36a). D.4 Proofs for Section 6.5 Proof of Lemma 1 The gradien ts are giv en b y ∇f(K) = EX and ∇h(Y) = E +2B T (P − W), where E := 2(RK− B T P), P is determined b y (6.6a), and W is the solution to (6.11b). Subtracting 262 the equation in (6.11b) from (6.6b) yields A T (P − W)+(P − W)A =− 1 2 K T E +E T K , whic h in turn leads to ∥P − W∥ F ≤ ∥A − 1 ∥ 2 ∥K∥ F ∥E∥ F ≤ a∥A − 1 ∥ 2 ∥E∥ F p νλ min (R) where the second inequalit y follo ws from (D.36d) in App endix D.11. Th us, b y applying the triangle inequalit y to ∇h(Y), w e obtain ∥∇h(Y)∥ F ∥E∥ F ≤ 1 + 2a∥A − 1 ∥ 2 ∥B∥ 2 p νλ min (R) . Moreo v er, using the lo w er b ound (D.36c) on λ min (X), w e ha v e ∥∇f(K)∥ F = ∥EX∥ F ≥ (ν/a )∥E∥ F . Com bining the last t w o inequalities completes the pro of. Proof of Lemma 2 F or an y pair of stabilizing feedbac k gains K and ˆ K := K + ˜ K , w e ha v e [152, Eq. (2.10)], f( ˆ K)− f(K) = trace ˜ K T R(K + ˆ K)− 2B T ˆ P X , where X = X(K) and ˆ P = P( ˆ K) are giv en b y (6.4a) and (6.6a), resp ectiv ely . Letting ˆ K = K ⋆ in this equation and using the optimalit y condition B T ˆ P =R ˆ K completes the pro of. Proof of Lemma 3 W e sho w that the second-order term D ˜ K,∇ 2 f(K; ˜ K) E in the T a ylor series expansion of f(K+ ˜ K) around K is upp er b ounded b y L f ∥ ˜ K∥ 2 F for all K∈S K (a). F rom [168, Eq. (2.3)], it follo ws D ˜ K,∇ 2 f(K; ˜ K) E = 2trace( ˜ K T R ˜ KX− 2 ˜ K T B T ˜ PX) where ˜ P =(A ∗ K ) − 1 (C) and C := ˜ K T (B T P− RK)+(B T P− RK) T ˜ K. Here, X =X(K) and P = P(K) are giv en b y (6.4a) and (6.6a) resp ectiv ely . Th us, using basic prop erties of the matrix trace and the triangle inequalit y , w e ha v e D ˜ K,∇ 2 f(K; ˜ K) E ∥ ˜ K∥ 2 F ≤ 2∥X∥ 2 ∥R∥ 2 + 2∥B∥ 2 ∥ ˜ P∥ F ∥ ˜ K∥ F ! . (D.9) No w, w e use Lemma 17 to upp er b ound the norm of ˜ P ,∥ ˜ P∥ F ≤ a∥C∥ F /(λ min (Ω) λ min (Q)). Moreo v er, from the denition of C , the triangle inequalit y , and the subm ultiplicativ e prop ert y 263 of the 2-norm, w e ha v e ∥C∥ F ≤ 2∥ ˜ K∥ F (∥B∥ 2 ∥P∥ 2 +∥R∥ 2 ∥K∥ 2 ). Com bining the last t w o inequalities giv es ∥ ˜ P∥ F ∥ ˜ K∥ F ≤ 2a λ min (Ω) λ min (Q) (∥B∥ 2 ∥P∥ 2 +∥R∥ 2 ∥K∥ 2 ) whic h in conjunction with (D.9) lead to D ˜ K,∇ 2 f(K; ˜ K) E ∥ ˜ K∥ 2 F ≤ 2∥X∥ 2 ∥R∥ 2 + 4a λ min (Ω) λ min (Q) (∥B∥ 2 2 ∥P∥ 2 +∥B∥ 2 ∥R∥ 2 ∥K∥ 2 ) . Finally , w e use the b ounds pro vided in App endix D.11 to obtain D ˜ K,∇ 2 f(K; ˜ K) E ∥ ˜ K∥ 2 F ≤ 2a∥R∥ 2 λ min (Q) + 8a 3 λ 2 min (Q)λ min (Ω) ∥B∥ 2 2 λ min (Ω) + ∥B∥ 2 ∥R∥ 2 p νλ min (R) ! whic h completes the pro of. D.5 Proofs for Section 6.6.1.1 W e rst presen t t w o tec hnical lemmas. Lemma 1 L et Z≻ 0 and let the Hurwitz matrix F satisfy δ 2 I +F T Z +ZF Z Z − I ≺ 0. (D.10) Then F +δ ∆ is Hurwitz for al l ∆ with ∥∆ ∥ 2 ≤ 1. Pr o of: The matrix F +δ ∆ is Hurwitz if and only if the linear map from w to x with the state-space realization {˙ x = Fx+w+u, z = δx } in feedbac k with u = ∆ z is input-output stable. F rom the small-gain theorem [86, Theorem 8.2], this system is stable for all ∆ in the unit ball if and only if the induced gain of the map u7→ z with the state-space realization {˙ x = Fx+u, z = δx } is smaller than one. The KYP Lemma [86, Lemma 7.4] implies that this norm condition is equiv alen t to (D.10). □ Lemma 2 L et the matric es F , X ≻ 0, and Ω ≻ 0 satisfy FX + XF T + Ω = 0 . (D.11) Then the matrix F +∆ is Hurwitz for al l ∆ that satisfy ∥∆ ∥ 2 <λ min (Ω) /(2∥X∥ 2 ). 264 Pr o of: F rom (D.11), w e obtain that F is Hurwitz and F ˆ X + ˆ XF T + I ⪯ 0 where ˆ X := X/λ min (Ω) . Multiplication of this inequalit y from b oth sides b y ˆ X − 1 and division b y 2 yields ZF+F T Z+2Z 2 ⪯ 0 where Z :=(2 ˆ X) − 1 . F or an y p ositiv e scalar δ <λ min (Z)= λ min (Ω) /(2∥X∥ 2 ) the last matricial inequalit y implies δ 2 I+ZF +F T Z+Z 2 ≺ 0. The result follo ws from Lemma 1 b y observing that the last inequalit y is equiv alen t to (D.10) via the use of Sc h ur complemen t. □ Proof of Proposition 3 F or an y feedbac k gain ˆ K suc h that ∥ ˆ K− K∥ 2 <ζ , the closed-lo op matrix A− B ˆ K satises ∥A− B ˆ K− (A− BK)∥ 2 ≤∥ K− ˆ K∥ 2 ∥B∥ 2 <ζ ∥B∥ 2 . This b ound on the distance b et w een the closed-lo op matrices A− BK and A− B ˆ K allo ws us to apply Lemma 2 with F :=A− BK and X :=X(K) to complete the pro of. W e next presen t a tec hnical lemma. Lemma 3 F or any K∈S K and ˆ K∈R m× n such that ∥ ˆ K− K∥ 2 <δ , with δ := 1 4∥B∥ F min λ min (Ω) trace(X(K)) , λ min (Q) trace(P(K)) the fe e db ack gain matrix ˆ K∈S K , and ∥X( ˆ K) − X(K)∥ F ≤ ϵ 1 ∥ ˆ K− K∥ 2 (D.12a) ∥P( ˆ K) − P(K)∥ F ≤ ϵ 2 ∥ ˆ K− K∥ 2 (D.12b) ∥∇f( ˆ K) −∇ f(K)∥ F ≤ ϵ 3 ∥ ˆ K− K∥ 2 (D.12c) |f( ˆ K) − f(K)| ≤ ϵ 4 ∥ ˆ K− K∥ 2 (D.12d) wher e X(K) and P(K) ar e given by (6.4a) and (6.6a) , r esp e ctively. F urthermor e, the p ar ameters ϵ i which only dep end on K and pr oblem data ar e given by ϵ 1 := ∥X(K)∥ 2 /δ ϵ 2 := 2trace(P)(2∥P∥ 2 ∥B∥ F +(δ +2∥K∥ 2 )∥R∥ F )/λ min (Q) ϵ 4 := ϵ 2 ∥Ω ∥ F ϵ 3 := 2(ϵ 1 ∥K∥ 2 +2∥X(K)∥ 2 )∥R∥ F +2ϵ 1 (∥P(K)∥ 2 +2ϵ 2 ∥X(K)∥ 2 )∥B∥ F . Pr o of: Note that δ ≤ ζ , where ζ is giv en in Prop osition 3. Th us, w e can use Prop osition 3 to sho w that ˆ K ∈ S K . W e next pro v e (D.12a). F or K and ˆ K ∈ S K , w e can represen t X =X(K) and ˆ X =X( ˆ K) as the p ositiv e denite solutions to (A − BK)X + X(A − BK) T + Ω = 0 (D.13a) (A − B ˆ K) ˆ X + ˆ X(A − B ˆ K) T + Ω = 0 . (D.13b) 265 Subtracting (D.13a) from (D.13b) and rearranging terms yield (A − BK) ˜ X + ˜ X(A − BK) T = B ˜ K ˆ X + ˆ X(B ˜ K) T where ˜ X := ˆ X− X and ˜ K := ˆ K− K . No w, w e use Lemma 16, presen ted in App endix D.12, with F := A− BK to upp er b ound the norm of ˜ X = F(− B ˜ K ˆ X− ˆ X(B ˜ K) T ), where the linear map F is dened in (D.40), as follo ws ∥ ˜ X∥ F ≤ ∥F∥ 2 ∥B ˜ K ˆ X + ˆ X(B ˜ K) T ∥ F ≤ trace(X) λ min (Ω) ∥B ˜ K ˆ X + ˆ X(B ˜ K) T ∥ F ≤ 2trace(X)∥B∥ F ∥ ˜ K∥ 2 λ min (Ω) ∥X∥ 2 +∥ ˜ X∥ 2 ≤ 2trace(X)∥B∥ F ∥ ˜ K∥ 2 ∥X∥ 2 λ min (Ω) + 1 2 ∥ ˜ X∥ F . (D.14) Here, the second inequalit y follo ws from Lemma 16, the third inequalit y follo ws from a com bination of the sub-m ultiplicativ e prop ert y of the F rob enius norm and the triangle inequalit y , and the last inequalit y follo ws from ∥ ˜ K∥ ≤ δ and ∥ ˜ X∥ 2 ≤ ∥ ˜ X∥ F . Rearranging the terms in (D.14) completes the pro of of (D.12a). W e next pro v e (D.12b). Similar to the pro of of (D.12a), subtracting the Ly apuno v equation (6.6b) from that of ˆ P = P( ˆ K) yields (A− BK) T ˜ P + ˜ P(A− BK) = W where ˜ P := ˆ P− P and W :=(B ˜ K) T ˆ P+ ˆ PB ˜ K− ˜ K T R ˜ K− ˜ K T RK− K T R ˜ K. This allo ws us to use Lemma 16, presen ted in App endix D.12, with F := (A− BK) T to upp er b ound the norm of ˜ P =F(− W), where the linear map F is dened in (D.40), as follo ws ∥ ˜ P∥ F ≤ ∥F∥ 2 ∥W∥ F ≤ trace(F(Q+K T RK)) λ min (Q+K T RK) ∥W∥ F = trace(P) λ min (Q+K T RK) ∥W∥ F ≤ trace(P) λ min (Q) ∥W∥ F . Here, the second inequalit y follo ws from Lemma 16. This inequalit y in conjunction with applying the triangle inequalit y to the denition of W yield ∥ ˜ P∥ F ≤ trace(P) λ min (Q) × ∥(B ˜ K) T ˜ P + ˜ P B ˜ K∥F + ∥(B ˜ K) T P +P B ˜ K− ˜ K T R ˜ K− ˜ K T RK− K T R ˜ K∥ F ≤ ∥ ˜ P∥ F 2 + trace(P) λ min (Q) 2∥P∥ 2 ∥B∥ F + (δ +2∥K∥ 2 )∥R∥ F ∥ ˜ K∥ 2 . The second inequalit y is obtained b y b ounding the t w o terms on the left-hand side using basic prop erties of norm, where, for the rst term, ∥ ˜ K∥ 2 ≤ δ ≤ λ min (Q)/(4∥B∥ F trace(P(K))) and, for the second term, ∥ ˜ K∥ 2 ≤ δ . Rearranging the terms in ab o v e completes the pro of of (D.12b). 266 W e next pro v e (D.12c). It is straigh tforw ard to sho w that the gradien t (6.5) satises ˜ ∇ := ∇f( ˆ K)−∇ f(K) = 2R( ˜ KX +K ˜ X + ˜ K ˜ X) − 2B T ( ˜ PX +P ˜ X + ˜ P ˜ X) where P := P(K) and ˜ P := ˆ P − P . The triangle inequalit y in conjunction with ∥ ˜ X∥ F ≤ ϵ 1 ∥ ˜ K∥ 2 , ∥ ˜ P∥ F ≤ ϵ 2 ∥ ˜ K∥ 2 , and ∥ ˜ K∥ 2 < δ , yield ∥ ˜ ∇∥ F /∥ ˜ K∥ 2 ≤ 2∥R∥ F (∥X∥ 2 +ϵ 1 (∥K∥ 2 + δ ))+2∥B∥ F (ϵ 2 ∥X∥ 2 +ϵ 1 (∥P∥ 2 +ϵ 2 δ )). Rearranging terms completes the pro of of (D.12c). Finally , w e pro v e (D.12d). Using the denitions of f(K) in (6.3b) and P(K) in (6.6a), it is easy to v erify that f(K)=trace(P(K)Ω) . Application of the Cauc h y-Sc h w artz inequalit y yields|f( ˆ K)− f(K)|=|trace( ˜ PΩ) |≤∥ ˜ P∥ F ∥Ω ∥ F , whic h completes the pro of. □ Proof of Lemma 4 F or an y K∈S K (a), w e can use the b ounds pro vided in App endix D.11 to sho w that c 1 /a≤ δ and ϵ 4 ≤ c 2 a 2 , where δ and ϵ 4 are giv en in Lemma 3 and eac h c i is a p ositiv e constan t that dep ends on the problem data. No w, Lemma 3 implies f(K+r(a)U)− f(K)≤ ϵ 4 r(a)∥U∥ 2 ≤ a where r(a) :=min{c 1 ,1/c 2 }/(a √ mn). This inequalit y together with f(K)≤ a complete the pro of. D.6 Proof of Proposition 4 W e rst presen t t w o tec hnical lemmas. Lemma 4 L et the matric es F , X ≻ 0, and Ω ≻ 0 satisfy FX +XF T +Ω = 0 . Then, for any t≥ 0, ∥e Ft ∥ 2 2 ≤ (∥X∥ 2 /λ min (X))e − (λ min (Ω) /∥X∥ 2 )t . Pr o of: The function V(x) :=x T Xx is a Ly apuno v function for ˙ x=F T x b ecause ˙ V(x)= − x T Ω x≤− cV(x), where c := λ min (Ω) /∥X∥ 2 . F or an y initial condition x 0 , this inequalit y together with the comparison lemma [157, Lemma 3.4] yield V(x(t)) ≤ V(x 0 )e − ct . Noting that x T (t) = x T 0 e Ft , w e let x 0 b e the normalized left singular v ector asso ciated with the maxim um singular v alue of e Ft to obtain ∥e Ft ∥ 2 2 = ∥x(t)∥ 2 ≤ V(x(t)) λ min (X) ≤ V(x 0 ) λ min (X) e − ct whic h along with V(x 0 )≤∥ X∥ 2 complete the pro of. □ Lemma 5 establishes an exp onen tially deca ying upp er b ound on the dierence b et w een f x 0 (K) and f x 0 ,τ (K) o v er an y sublev el set S K (a) of the LQR ob jectiv e function f(K). Lemma 5 F or any K ∈ S K (a) and v ∈ R n , |f v (K)− f v,τ (K)| ≤ ∥ v∥ 2 κ 1 (a)e − κ 2 (a)τ , wher e the p ositive functions κ 1 (a) and κ 2 (a), given by (D.17) , dep end on pr oblem data. 267 Pr o of: Since x(t) = e (A− BK)t v is the solution to (6.1b) with u = − Kx and the initial condition x(0) = v , it is easy to v erify that f v,τ (K) = trace (Q+K T RK)X v,τ (K) and f v (K)=trace (Q + K T RK)X v (K) , where X v,τ (K) := Z τ 0 e (A− BK)t vv T e (A− BK) T t dt and X v :=X v,∞ . Using the triangle inequalit y , w e ha v e ∥X v (K)− X v,τ (K)∥ F ≤ ∥ v∥ 2 Z ∞ τ ∥e (A− BK)t ∥ 2 2 dt. (D.15) Equation (6.4b) allo ws us to use Lemma 4 with F :=A− BK , X :=X(K) to upp er b ound ∥e (A− BK)t ∥ 2 , λ min (X)∥e (A− BK)t ∥ 2 2 ≤ ∥ X∥ 2 e − (λ min (Ω) /∥X∥ 2 )t . In tegrating this inequalit y o v er [τ, ∞] in conjunction with (D.15) yield ∥X v (K)− X v,τ (K) ∥ F ≤ ∥ v∥ 2 κ ′ 1 e − κ ′ 2 τ (D.16) where κ ′ 1 :=∥X(K)∥ 2 2 /(λ min (Ω) λ min (X(K))) and κ ′ 2 :=λ min (Ω) /∥X(K)∥ 2 . F urthermore, |f v (K)− f v,τ (K)| = trace (Q+K T RK)(X v − X v,τ ) ≤ (∥Q∥ F +∥R∥ 2 ∥K∥ 2 F )∥X v − X v,τ ∥ F ≤ ∥ v∥ 2 (∥Q∥ F +∥R∥ 2 ∥K∥ 2 F )κ ′ 1 e − κ ′ 2 τ where w e use the Cauc h y-Sc h w artz and triangle inequalities for the rst inequalit y and (D.16) for the second inequalit y . Com bining this result with the b ounds on the v ariables pro vided in Lemma 15 completes the pro of with κ 1 (a) := ∥Q∥ F + a 2 ∥R∥ 2 νλ min (R) a 3 νλ min (Ω) λ 2 min (Q) (D.17a) κ 2 (a) := λ min (Ω) λ min (Q)/a (D.17b) where the constan t ν is giv en b y (6.10d). □ Proof of Proposition 4 SinceK∈S K (a) andr≤ r(a), Lemma 4 implies that K± rU i ∈S K (2a). Th us, f x i (K± rU i ) is w ell dened for i=1,...,N , and e ∇f(K)− ∇f(K) = 1 2rN × X i f x i (K +rU i )− f x i ,τ (K +rU i ) U i − X i f x i (K− rU i )− f x i ,τ (K− rU i ) U i . 268 F urthermore, since K± rU i ∈S K (2a), w e can use triangle inequalit y and apply Lemma 5, 2N times, to b ound eac h term individually and obtain ∥ e ∇f(K)− ∇f(K)∥ F ≤ ( √ mn/r)max i ∥x i ∥ 2 κ 1 (2a)e − κ 2 (2a)τ where w e used ∥U i ∥ F = √ mn. This completes the pro of. D.7 Proof of Proposition 5 W e rst establish b ounds on the smo othness parameter of ∇f(K). F or J ∈R m× n , v∈R n , and f v (K) giv en b y (6.24a), let j v (K) :=⟨J,∇ 2 f v (K;J)⟩, denote the second-order term in the T a ylor series expansion of f v (K+J) around K . F ollo wing similar argumen ts as in [168, Eq. (2.3)] leads to j v (K)=2trace(J T (RJ− 2B T D)X v ), where X v and D are the solutions to A K (X v ) = − vv T (D.18a) A ∗ K (D) = J T (B T P − RK)+(B T P − RK) T J (D.18b) and P is giv en b y (6.6a). The follo wing lemma pro vides an analytical expression for the gradien t ∇j v (K). Lemma 6 F or any v∈R n and K∈S K ,∇j v (K)=4 B T W 1 X v +(RJ− B T D)W 2 +(RK− B T P)W 3 , wher e W i ar e the solutions to the line ar e quations A ∗ K (W 1 ) = J T RJ− J T B T D− DBJ (D.19a) A K (W 2 ) = BJX v +X v J T B T (D.19b) A K (W 3 ) = BJW 2 +W 2 J T B T . (D.19c) Pr o of: W e expand j v (K +ϵ ˜ K) around K and to obtain j v (K +ϵ ˜ K)− j v (K) = 2ϵ trace(J T (RJ− 2B T D) ˜ X v )− 4ϵ trace(J T B T ˜ DX v )+o(ϵ ). Here, o(ϵ ) denotes higher-order terms in ϵ , whereas ˜ X v , ˜ D , and ˜ P are obtained b y p erturbing Eqs. (D.18a), (D.18b), and (6.6b), resp ectiv ely , A K ( ˜ X v ) = B ˜ KX v +X v ˜ K T B T (D.20a) A ∗ K ( ˜ D) = ˜ K T B T D+DB ˜ K + A ∗ K ( ˜ D) = J T (B T ˜ P − R ˜ K)+(B T ˜ P − R ˜ K) T J (D.20b) A ∗ K ( ˜ P) = ˜ K T B T P +PB ˜ K− K T R ˜ K− ˜ K T RK. (D.20c) 269 Applying the adjoin t iden tit y on Eqs. (D.20a) and (D.20b) yields j v (K +ϵ ˜ K)− j v (K) ≈ 2ϵ trace((B ˜ KX v +X v ˜ K T B T )W 1 ) − 2ϵ trace(( ˜ K T B T D+DB ˜ K +J T (B T ˜ P − R ˜ K)+(B T ˜ P − R ˜ K) T J)W 2 ) = 4ϵ trace( ˜ K T B T W 1 X v )− 4ϵ trace( ˜ K T (B T D− RJ)W 2 )− 4ϵ trace(W 2 J T B T ˜ P) where w e ha v e neglected o(ϵ ) terms, and W 1 and W 2 are giv en b y (D.19a) and (D.19b), resp ectiv ely . Moreo v er, the adjoin t iden tit y applied to (D.20c) allo ws us to simplify the last term as, 2trace(W 2 J T B T ˜ P) = trace(( ˜ K T B T P +PB ˜ K− K T R ˜ K− ˜ K T RK)W 3 ) where W 3 is giv en b y (D.19c). Finally , this yields j(K +ϵ ˜ K)− j(K) ≈ 4ϵ trace( ˜ K T ((RK− B T P)W 3 +B T W 1 X v +(RJ− B T D)W 2 )). □ W e next establish a b ound on ∥∇j v (K)∥ F . Lemma 7 L et K,K ′ ∈ R m× n b e such that the line se gment K +t(K ′ − K) with t ∈ [0,1] b elongs to S K (a) and let J ∈R m× n and v∈R n b e xe d. Then, the function j v (K) satises |j v (K 1 )− j v (K 2 )|≤ ℓ(a)∥J∥ 2 F ∥v∥ 2 ∥K 1 − K 2 ∥ F , wher e l(a) is a p ositive function given by ℓ(a) := ca 2 + c ′ a 4 (D.21) and c, c ′ ar e p ositive sc alars that dep end only on pr oblem data. Pr o of: W e sho w that the gradien t ∇j v (K) giv en b y Lemma 6 is upp er b ounded b y ∥∇j v (K)∥ F ≤ ℓ(a)∥J∥ 2 F ∥v∥ 2 . Applying Lemma 17 on (D.18), the b ounds in Lemma 15, and the triangle inequalit y , w e ha v e ∥X v ∥ F ≤ c 1 a∥v∥ 2 and∥D∥ F ≤ c 2 a 2 ∥J∥ F , where c 1 and c 2 are p ositiv e constan ts that dep end on problem data. W e can use the same tec hnique to b ound the norms of W i in Eq. (D.19), ∥W 1 ∥ F ≤ (c 3 a+c 4 a 3 )∥J∥ 2 F ,∥W 2 ∥ F ≤ c 5 a 2 ∥v∥ 2 ∥J∥ F , ∥W 3 ∥ F ≤ c 6 a 3 ∥v∥ 2 ∥J∥ 2 F , where c 3 ,...,c 6 are p ositiv e constan ts that dep end on problem data. Com bining these b ounds with the Cauc h y-Sc h w artz and triangle inequalities applied to∇f v (K) completes the pro of. □ D.7.1 Proof of Proposition 5 Since r≤ r(a), Lemma 4 implies that K± sU ∈S K (2a) for all s≤ r . Also, the mean-v alue theorem implies that, for an y U ∈R m× n and v∈R n , f v (K± rU) = f v (K) ± r⟨∇f v (K),U⟩ + r 2 2 U,∇ 2 f v (K± s ± U;U) 270 where s ± ∈ [0,r] are constan ts that dep end on K and U . No w, if ∥U∥ F = √ mn, the ab o v e iden tit y yields 1 2r (f v (K +rU)− f v (K− rU))−⟨∇ f v (K),U⟩ = r 4 ( U,∇ 2 f v (K +s + U;U) − U,∇ 2 f v (K− s − U;U) ) ≤ r 4 (s + +s − )∥U∥ 3 F ℓ(2a)∥v∥ 2 ≤ r 2 2 mn √ mnℓ(2a)∥v∥ 2 where the rst inequalit y follo ws from Lemma 7. Com bining this inequalit y with the triangle inequalit y applied to the denition of b ∇f(K)− e ∇f(K) completes the pro of. D.8 Proof of Proposition 6 F rom inequalit y (6.28a), it follo ws that G is a descen t direction of the function f(K). Th us, w e can use the descen t lemma [158, Eq. (9.17)] to sho w that K + :=K− αG satises f(K + ) − f(K) ≤ (L f α 2 /2)∥G∥ 2 F − α ⟨∇f(K),G⟩ (D.22) for an y α for whic h the line segmen t b et w een K + and K lies inS K (a). Using (6.28), for an y α ∈[0,2µ 1 /(µ 2 L f )], w e ha v e (L f α 2 /2)∥G∥ 2 F − α ⟨∇f(K),G⟩ ≤ (α (L f µ 2 α − 2µ 1 )/2)∥∇f(K)∥ 2 F ≤ 0 (D.23) and the righ t-hand side of inequalit y (D.22) is nonp ositiv e for α ∈ [0,2µ 1 /(µ 2 L f )]. Th us, w e can use the con tin uit y of the function f(K) along with inequalities (D.22) and (D.23) to conclude that K + ∈S K (a) for all α ∈[0,2µ 1 /(µ 2 L f )], and f(K + ) − f(K) ≤ (α (L f µ 2 α − 2µ 1 )/2)∥∇f(K)∥ 2 F . Com bining this inequalit y with the PL condition (6.18), it follo ws that f(K + )− f(K) ≤ − (µ 1 α/ 2)∥∇f(K)∥ 2 F ≤− µ f µ 1 α (f(K)− f(K ⋆ )) for all α ∈[0,c 1 /(c 2 L f )]. Subtracting f(K ⋆ ) and rearranging terms complete the pro of. D.9 Proofs of Section 6.6.2.1 W e rst presen t t w o tec hnical results. Lemma 8 extends [164, Theorem 3.2] on the norm of Gaussian matrices presen ted in App endix D.10 to random matrices with uniform distribution on the sphere √ mnS mn− 1 . Lemma 8 L et E ∈ R m× n b e a xe d matrix and let U ∈ R m× n b e a r andom matrix with vec(U) uniformly distribute d on the spher e √ mnS mn− 1 . Then, for any s ≥ 1 and t ≥ 1, 271 we have P(B)≤ 2e − s 2 q− t 2 n +e − mn/8 , wher e B := ∥E T U∥ 2 >c ′ (s∥E∥ F +t √ n∥E∥ 2 ) , and q :=∥E∥ 2 F /∥E∥ 2 2 is the stable r ank of E . Pr o of: F or a matrix G with i.i.d. standard normal en tries, w e ha v e ∥E T U∥ 2 ∼ √ mn∥E T G∥ 2 /∥G∥ F . Let the constan t κ b e the ψ 2 -norm of the standard normal random v ariable and let us dene t w o auxiliary ev en ts, C 1 := { √ mn>2∥G∥ F } C 0 := { √ mn∥E T G∥ 2 >2cκ 2 ∥G∥ F s∥E∥ F +t √ n∥E∥ 2 }. F or c ′ :=2cκ 2 , w e ha v eP(B)=P(C 0 )≤ P(C 1 ∪A)≤ P(C 1 )+P(A), where the ev en t A is giv en b y Lemma 13. Here, the rst inequalit y follo ws from C 0 ⊂ C 1 ∪A and the second follo ws from the union b ound. No w, since ∥·∥ F is Lipsc hitz con tin uous with parameter 1, from the concen tration of Lipsc hitz functions of standard normal Gaussian v ectors [160, Theorem 5.2.2], it follo ws that P(C 1 ) ≤ e − mn/8 . This in conjunction with Lemma 13 complete the pro of. □ Lemma 9 In the setting of L emma 8, we have P ∥E T U∥ F >2 √ n∥E∥ F ≤ e − n/2 . Pr o of: W e b egin b y observing that ∥E T U∥ F = ∥vec(E T U)∥ F = ∥ I⊗ E T vec(U)∥ F , where⊗ denotes the Kronec k er pro duct. Th us, it is easy to v erify that ∥E T U∥ F is a Lipsc hitz con tin uous function of U with parameter ∥I⊗ E T ∥ 2 =∥E∥ 2 . No w, from the concen tration of Lipsc hitz functions of uniform random v ariables on the sphere √ mnS mn− 1 [160, Theorem 5.1.4], for all t>0, w e ha v eP ∥E T U∥ F > p E[∥E T U∥ 2 F ]+t ≤ e − t 2 /(2∥E∥ 2 2 ) . No w, since E[∥E T U∥ 2 F ] = E[∥(I⊗ E T )vec(U)∥ 2 F ] = E[trace((I⊗ E T )vec(U)vec(U) T (I⊗ E))] = trace((I⊗ E T )(I⊗ E)) = n∥E∥ 2 F w e can rewrite the last inequalit y for t= √ n∥E∥ F to obtain P{∥E T U∥ F >2 √ n∥E∥ F } ≤ e − n∥E∥ 2 F /(2∥E∥ 2 2 ) ≤ e − n/2 where the last inequalit y follo ws from ∥E∥ F ≥∥ E∥ 2 . □ 272 Proof of Lemma 5 W e dene the auxiliary ev en ts D i := {∥M ∗ (E T U i )∥ 2 ≤ c √ n∥M ∗ ∥ S ∥E∥ F }∩{∥M ∗ (E T U i )∥ F ≤ 2 √ n∥M ∗ ∥ 2 ∥E∥ F } for i=1,...,N . Since ∥M ∗ (E T U i )∥ 2 ≤∥M ∗ ∥ S ∥E T U i ∥ 2 and ∥M ∗ (E T U i )∥ F ≤∥M ∗ ∥ 2 ∥E T U i ∥ F w e ha v e P(D i ) ≥ P {∥E T U i ∥ 2 ≤ c √ n∥E∥ F } ∩ {∥E T U i ∥ F ≤ 2 √ n∥E∥ F } . Applying Lemmas 8 and 9 to the righ t-hand side of the ab o v e ev en ts together with the union b ound yield P(D c i ) ≤ 2e − n +e − mn/8 +e − n/2 ≤ 4e − n/8 , where D c i is the complemen t of D i . This in turn implies P(D c ) = P( N [ i=1 D c i ) ≤ N X i=1 P(D c i ) ≤ 4Ne − n 8 (D.24) whereD:=∩ i D i . W e can no w use the conditioning iden tit y to b ound the failure probabilit y , P{|a|>b} = P |a|>b D P(D)+P |a|>b D c P(D c ) ≤ P |a|>b D P(D)+P(D c ) = P{|a1 D | >b} + P(D c ) ≤ P{|a1 D |>b} + 4Ne − n/8 (D.25) where a := (1/N) X i ⟨E(X i − X),U i ⟩⟨EX,U i ⟩ b := δ ∥EX∥ F ∥E∥ F and 1 D is the indicator function of D. It is no w easy to v erify that P{|a1 D |>b} ≤ P{|Y|>b} 273 where Y := (1/N) X i Y i Y i := ⟨E(X i − X),U i ⟩⟨EX,U i ⟩1 D i . The rest of the pro of uses the ψ 1/2 -norm of Y to establish an upp er b ound on P{|Y|>b}. Since Y i are linear in the zero-mean random v ariables X i − X , w e ha v e E[Y i |U i ] = 0. Th us, the la w of total exp ectation yields E[Y i ] = E[E[Y i |U i ]] = 0. Therefore, Lemma 14 implies ∥Y∥ ψ 1/2 ≤ (c ′ / √ N)(logN)max i ∥Y i ∥ ψ 1/2 . (D.26) No w, using the standard prop erties of the ψ α -norm, w e ha v e ∥Y i ∥ ψ 1/2 ≤ c ′′ ∥⟨E(X i − X),U i ⟩1 D i ∥ ψ 1 ∥⟨EX,U i ⟩∥ ψ 1 ≤ c ′′′ ∥⟨E(X i − X),U i ⟩1 D i ∥ ψ 1 ∥EX∥ F (D.27) where the second inequalit y follo ws from [160, Theorem 3.4.6], ∥⟨EX,U i ⟩∥ ψ 1 ≤ ∥⟨ EX,U i ⟩∥ ψ 2 ≤ c 0 ∥EX∥ F . (D.28) W e can no w use ⟨E(X i − X),U i ⟩ = ⟨X i − X,E T U i ⟩ = ⟨M(x i x T i ),E T U i ⟩−⟨M (I),E T U i ⟩ = x T i M ∗ (E T U i )x i − trace(M ∗ (E T U i )) to b ound the righ t-hand side of (D.27). This iden tit y allo ws us to use the Hanson-W rite inequalit y (Lemma 12) to upp er b ound the conditional probabilit y P |⟨E(X i − X),U i ⟩|>t U i ≤ 2e − ˆ cmin{ t 2 κ 4 ∥M ∗ (E T U i )∥ 2 F , t κ 2 ∥M ∗ (E T U i )∥ 2 } . Th us, w e ha v e P{|⟨E(X i − X),U i ⟩1 D i |>t} = E U i 1 D i E x i 1 {|⟨E(X i − X),U i ⟩|>t} = E U i 1 D i P |⟨E(X i − X),U i ⟩|>t U i ≤ E U i " 1 D i 2e − ˆ cmin{ t 2 κ 4 ∥M ∗ (E T U i )∥ 2 F t κ 2 ∥M ∗ (E T U i )∥ 2 } # ≤ 2e − ˆ cmin{ t 2 4nκ 4 ∥M ∗ ∥ 2 2 ∥E∥ 2 F t c √ nκ 2 ∥M ∗ ∥ S ∥E∥ F } 274 where the denition of D i w as used to obtain the last inequalit y . The ab o v e tail b ound implies [169, Lemma 11] ∥⟨E(X i − X),U i ⟩1 D i ∥ ψ 1 ≤ ˜ cκ 2 √ n(∥M ∗ ∥ 2 +∥M ∗ ∥ S )∥E∥ F . (D.29) Using (6.30), it is easy to obtain the lo w er b ound on the n um b er of samples, N ≥ C ′ (β 2 κ 2 /δ ) 2 (∥M ∗ ∥ 2 + ∥M ∗ ∥ S ) 2 n log 6 N. W e can no w com bine (D.26), (D.29) and (D.27) to obtain ∥Y∥ ψ 1/2 ≤ C ′ κ 2 √ nlogN √ N (∥M ∗ ∥ 2 +∥M ∗ ∥ S )∥E∥ F ∥EX∥ F ≤ δ β 2 log 2 N ∥E∥ F ∥EX∥ F where the last inequalit y follo ws from the ab o v e lo w er b ound on N . Com bining this inequalit y and (D.35) with t :=δ ∥E∥ F ∥EX∥ F /∥Y∥ ψ 1/2 yieldsP{|Y|>δ ∥E∥ F ∥EX∥ F }≤ 1/N β , whic h completes the pro of. Proof of Lemma 6 The marginals of a uniform random v ariable ha v e b ounded sub-Gaussian norm (see the inequalit y in (D.28)). Th us, [160, Lemma 2.7.6] implies ∥⟨W,U i ⟩ 2 ∥ ψ 1 = ∥⟨W,U i ⟩∥ 2 ψ 2 ≤ ˆ c∥W∥ 2 F whic h together with the triangle inequalit y yield ∥⟨W,U i ⟩ 2 −∥ W∥ 2 F ∥ ψ 1 ≤ c ′ ∥W∥ 2 F . No w since ⟨W,U i ⟩ 2 −∥ W∥ 2 F are zero-mean and indep enden t, w e can apply the Bernstein inequalit y (Lemma 11) to obtain P ( 1 N N X i=1 ⟨W,U i ⟩ 2 −∥ W∥ 2 F >t∥W∥ 2 F ) ≤ 2e − cNmin{t 2 ,t} (D.30) whic h together with the triangle inequalit y complete the pro of. D.10 Proofs for Section 6.6.2.2 and probabilistic toolbox W e rst presen t a tec hnical lemma. Lemma 10 L et v 1 ,...,v N ∈R d b e i.i.d. r andom ve ctors uniformly distribute d on the spher e √ dS d− 1 and let a∈R d b e a xe d ve ctor. Then, for any t≥ 0, we have P ( 1 N ∥ N X j=1 ⟨a,v j ⟩v j ∥ > (c+c √ d+t √ N )∥a∥ ) ≤ 2e − t 2 +Ne − d/8 +2e − ˆ cN . 275 Pr o of: It is easy to v erify that P j ⟨a,v j ⟩v j = Vv, where V := [v 1 ··· v n ]∈R d× N is the random matrix with the j th column giv en b y v j and v :=V T a∈R N . Th us, ∥ X j ⟨a,v j ⟩v j ∥ = ∥Vv∥ ≤ ∥ V∥ 2 ∥v∥. No w, let G ∈ R d× N b e a random matrix with i.i.d. standard normal Gaussian en tries and let ˆ G∈R d× N b e a matrix obtained b y normalizing the columns of G as ˆ G j := √ dG j /∥G j ∥, where G j and ˆ G j are the j th columns of G and ˆ G, resp ectiv ely . F rom the concen tration of norm of Gaussian v ectors [160, Theorem 5.2.2], w e ha v e ∥G j ∥≥ √ d/2 with probabilit y not smaller than 1− e − d/8 . This in conjunction with a union b ound yield ∥ ˆ G∥ 2 ≤ 2∥G∥ 2 with probabilit y not smaller than 1− Ne − d/8 . F urthermore, from the concen tration of Gaussian matrices [160, Theorem 4.4.5], w e ha v e ∥G∥ 2 ≤ C( √ N+ √ d+t) with probabilit y not smaller than1− 2e − t 2 . By com bining this inequalit y with the ab o v e upp er b ound on ∥ ˆ G∥ 2 , and using V ∼ ˆ G in conjunction with a union b ound, w e obtain ∥V∥ 2 ≤ 2C( √ N + √ d + t) (D.31) with probabilit y not smaller than 1− 2e − t 2 − Ne − d/8 . Moreo v er, using (D.30) in the pro of of Lemma 6, giv es ∥v∥≤ C ′ √ N∥a∥ with probabilit y not smaller than 1− 2e − ˆ cN . Com bining this inequalit y with (D.31) and emplo ying a union b ound complete the pro of. □ Proof of Lemma 7 W e b egin b y noting that ∥ N X i=1 ⟨E(X i − X),U i ⟩U i ∥ F = ∥Uu∥ ≤ ∥ U∥ 2 ∥u∥ (D.32) where U ∈R mn× N is a matrix with the ith column vec(U i ) and u∈R N is a v ector with the ith en try ⟨E(X i − X),U i ⟩. Using (D.31) in the pro of of Lemma 10, for s≥ 0, w e ha v e ∥U∥ 2 ≤ c( √ N + √ mn + s) (D.33) with probabilit y not smaller than 1− 2e − s 2 − Ne − mn/8 . T o b ound the norm of u, w e use similar argumen ts as in the pro of of Lemma 5. In particular, let D i b e dened as ab o v e and let D:=∩ i D i . Then for an y b≥ 0, P{∥u∥>b} ≤ P{∥u1 D ∥>b} + 4Ne − n/8 (D.34) 276 where 1 D is the indicator function of D; cf. (D.25). Moreo v er, it is straigh tforw ard to v erify that ∥u1 D ∥ ≤ ∥ z∥, where the en tries of z ∈ R N are giv en z i = u i 1 D i . Since ∥∥z∥ 2 ∥ ψ 1/2 = ∥ P i z 2 i ∥ ψ 1/2 , w e ha v e ∥ N X i=1 z 2 i ∥ ψ 1/2 (a) ≤ ∥ N X i=1 z 2 i − E[z 2 i ]∥ ψ 1/2 + N∥E[z 2 1 ]∥ ψ 1/2 (b) ≤ ¯c 1 ∥z 2 1 ∥ ψ 1/2 √ NlogN + ¯c 2 N∥z 1 ∥ 2 ψ 1 (c) ≤ ¯c 3 N∥z 1 ∥ 2 ψ 1 (d) ≤ ¯c 4 Nκ 4 n(∥M ∗ ∥ 2 +∥M ∗ ∥ S ) 2 ∥E∥ 2 F . Here, (a) follo ws from the triangle inequalit y , (b) follo ws from com bination of Lemma 14, applied to the rst term, and E[z 2 1 ]≤ ˜ c 0 ∥z 1 ∥ 2 ψ 1 (e.g., see [160, Prop osition 2.7.1]) applied to the second term, (c) follo ws from ∥z 2 1 ∥ ψ 1/2 ≤ ˜ c 1 ∥z 1 ∥ 2 ψ 1 , and (d) follo ws from (D.29). This allo ws us to use (D.35) with ξ =∥z∥ 2 and t=r 2 to obtain P{∥z∥ > r √ nNκ 2 (∥M ∗ ∥ 2 +∥M ∗ ∥ S )∥E∥ F } ≤ ¯c 5 e − r for all r >0. Com bining this inequalit y with (D.34) yield P n ∥u∥ > r √ nNκ 2 (∥M ∗ ∥ 2 + ∥M ∗ ∥ S )∥E∥ F o ≤ ¯c 5 e − r + 4Ne − n/8 . Finally , substituting r =β logn in the last inequalit y and letting s= √ mn in (D.33) yield P ( 1 N ∥ X i ⟨E(X i − X),U i ⟩U i ∥ F >c 1 β √ mnlognκ 2 (∥M ∗ ∥ 2 +∥M ∗ ∥ S )∥E∥ F ) ≤ c 0 n − β + 2e − mn +Ne − mn/8 +4Ne − n/8 ≤ c 2 (n − β +Ne − n/8 ) where w e used inequalit y (D.32), N ≥ c 0 n, and applied the union b ound. This completes the pro of. Proof of Lemma 8 This result is obtained b y applying Lemma 10 to the v ectors vec(U i ) and setting t= √ mn. Probabilistic toolbox In this subsection, w e summarize kno wn tec hnical results whic h are useful in establishing b ounds on the correlation b et w een the gradien t estimate and the true gradien t. Herein, w e usec,c ′ , andc i to denote p ositiv e absolute constan ts. F or an y p ositiv e scalar α , theψ α -norm of a random v ariable ξ is giv en b y [170, Section 4.1], ∥ξ ∥ ψ α :=inf t {t>0|E[ψ α (|ξ |/t)]≤ 1}, 277 where ψ α (x) :=e x α − 1 (linear near the origin when 0<α< 1 in order for ψ α to b e con v ex) is an Orlicz function. Finiteness of the ψ α -norm implies the tail b ound P{|ξ | > t∥ξ ∥ ψ α } ≤ c α e − t α for all t ≥ 0 (D.35) where c α is an absolute constan t that dep ends on α ; e.g., see [171, Section 2.3] for a pro of. The random v ariable ξ is called sub-Gaussian if its distribution is dominated b y that of a normal random v ariable. This condition is equiv alen t to ∥ξ ∥ ψ 2 <∞. The random v ariable ξ is sub-exp onen tial if ∥ξ ∥ ψ 1 <∞. It is also w ell-kno wn that for an y random v ariables ξ and ξ ′ and an y p ositiv e scalar α , ∥ξξ ′ ∥ ψ α ≤ ˆ c α ∥ξ ∥ ψ 2α ∥ξ ′ ∥ ψ 2α and the ab o v e inequalit y b ecomes equalit y with c α =1 if α ≥ 1. Lemma 11 (Bernstein inequalit y [160, Corollary 2.8.3]) L et the ve ctors ξ 1 ,...,ξ N b e indep endent, zer o-me an, sub-exp onential r andom variables with κ ≥ ∥ ξ i ∥ ψ 1 . Then, for any sc alar t≥ 0, P{|(1/N) P i ξ i |>t}≤ 2e − cNmin{t 2 /κ 2 ,t/κ } . Lemma 12 (Hanson-W righ t inequalit y [164, Theorem 1.1]) L et A b e a xe d matrix inR N× N and let x∈R N b e a r andom ve ctor with indep endent entries that satisfy E[x i ]=0, E[x 2 i ]=1, and ∥x i ∥ ψ 2 ≤ κ . Then, for any nonne gative sc alar t, we have P{ x T Ax− E[x T Ax] >t} ≤ 2e − cmin{t 2 /(κ 4 ∥A∥ 2 F ),t/(κ 2 ∥A∥ 2 )} . Lemma 13 (Norms of random matrices [164, Theorem 3.2]) L etE b e a xe d matrix inR m× n and let G∈R m× n b e a r andom matrix with indep endent entries that satisfy E[G ij ]= 0, E[G 2 ij ]=1, and ∥G ij ∥ ψ 2 ≤ κ . Then, for any sc alars s,t≥ 1, P(A) ≤ 2e − s 2 q− t 2 n wher e q :=∥E∥ 2 F /∥E∥ 2 2 is the stable r ank of E and A := {∥E T G∥ 2 >cκ 2 s∥E∥ F +t √ n∥E∥ 2 }. The next lemma pro vides us with an upp er b ound on the ψ α -norm of sum of random v ariables that is b y T alagrand. This result is a straigh tforw ard consequence of com bining the results in [170, Theorem 6.21] and [172, Lemma 2.2.2]; see e.g. [173, Theorem 8.4] for a formal argumen t. Lemma 14 F or any sc alar α ∈(0,1], ther e exists a c onstant C α such that for any se quenc e of indep endent r andom variables ξ 1 ,...,ξ N we have ∥ X i ξ i − E X i ξ i ∥ ψ α ≤ C α (max i ∥ξ i ∥ ψ α ) √ NlogN. D.11 Bounds on optimization v ariables Building on [152], in Lemma 15 w e pro vide useful b ounds on the matrices K , X = X(K), P =P(K), and Y =KX(K). 278 Lemma 15 Over the sublevel set S K (a) of the LQR obje ctive function f(K), we have trace(X) ≤ a/λ min (Q) (D.36a) ∥Y∥ F ≤ a/ p λ min (R)λ min (Q) (D.36b) ν/a ≤ λ min (X) (D.36c) ∥K∥ F ≤ a/ p νλ min (R) (D.36d) trace(P) ≤ a/λ min (Ω) (D.36e) wher e the c onstant ν is given by (6.10d) . Pr o of: F or K∈S K (a), w e ha v e trace(QX +Y T RYX − 1 ) ≤ a (D.37) whic h along with trace(QX) ≥ λ min (Q)∥X 1/2 ∥ 2 F yield (D.36a). T o establish (D.36b), w e com bine (D.37) with trace(RYX − 1 Y T ) ≥ λ min (R)∥YX − 1/2 ∥ 2 F to obtain ∥YX − 1/2 ∥ 2 F ≤ a/λ min (R). Th us, ∥Y∥ 2 F ≤ a∥X∥ 2 /λ min (R). This inequalit y along with (D.36a) giv e (D.36b). T o sho w the inequalit y in (D.36c), let v b e the normalized eigen v ector corresp onding to the smallest eigen v alue of X . Multiplication of Eq. (6.8a) from the left and the righ t b y v T and v , resp ectiv ely , giv es v T (DX 1/2 +X 1/2 D T )v = p λ min (X)v T (D+D T )v = − v T Ω v where D :=AX 1/2 − BYX − 1/2 . Th us, λ min (X) = (v T Ω v) 2 (v T (D+D T )v) 2 ≥ λ 2 min (Ω) 4∥D∥ 2 2 (D.38) where w e applied the Cauc h y-Sc h w arz inequalit y on the denominator. Using the triangle inequalit y and subm ultiplicativ e prop ert y of the 2-norm, w e can upp er b ound ∥D∥ 2 , ∥D∥ 2 ≤ ∥ A∥ 2 ∥X 1/2 ∥ 2 +∥B∥ 2 ∥YX − 1/2 ∥ 2 ≤ √ a(∥A∥ 2 / p λ min (Q)+∥B∥ 2 / p λ min (R)) (D.39) where the last inequalit y follo ws from (D.36a) and the upp er b ound on ∥YX − 1/2 ∥ 2 F . Inequalit y (D.36c), with ν giv en b y (6.10d), follo ws from com bining (D.38) and (D.39). T o sho w (D.36d), w e use the upp er b ound on ∥YX − 1/2 ∥ 2 F , whic h is equiv alen t to ∥KX 1/2 ∥ 2 F ≤ a/λ min (R) to obtain ∥K∥ 2 F ≤ a/λ min (R)λ min (X) ≤ a 2 /(νλ min (R)). 279 Here, the second inequalit y follo ws from (D.36c). Finally , to pro v e (D.36e), note that the denitions of f(K) in (6.3b) andP in (6.6a) implyf(K)=trace(P Ω) . Th us, fromf(K)≤ a, w e ha v e trace(P)≤ a/λ min (Ω) , whic h completes the pro of. □ D.12 The norm of the inverse Lyapunov operator Lemma 16 pro vides an upp er b ound on the norm of the in v erse Ly apuno v op erator for stable L TI systems. Lemma 16 F or any Hurwitz matrix F ∈R n× n , the line ar map F : S n →S n F(W) := Z ∞ 0 e Ft W e F T t dt (D.40) is wel l dene d and, for any Ω ≻ 0, ∥F∥ 2 ≤ trace(F(I)) ≤ trace(F(Ω)) /λ min (Ω) . (D.41) Pr o of: Using the triangle inequalit y and the sub-m ultiplicativ e prop ert y of the F rob enius norm, w e can write ∥F(W)∥ F ≤ Z ∞ 0 ∥e F t W e F T t ∥ F dt ≤ ∥ W∥ F Z ∞ 0 ∥e F t ∥ 2 F dt = ∥W∥ F trace(F(I)). (D.42) Th us, ∥F∥ 2 = max ∥W∥ F =1 ∥F(W)∥ F ≤ trace(F(I)), whic h pro v es the rst inequalit y in (D.41). T o sho w the second inequalit y , w e use the monotonicit y of the linear map F , i.e., for an y symmetric matrices W 1 and W 2 with W 1 ⪯ W 2 , w e ha v e F(W 1 )⪯F (W 2 ). In particular, λ min (Ω) I ⪯ Ω implies λ min (Ω) F(I)⪯F (Ω) whic h yields λ min (Ω)trace( F(I))≤ trace(F(Ω)) and completes the pro of. □ W e next use Lemma 16 to establish a b ound on the norm of the in v erse of the closed-lo op Ly apuno v op erator A K o v er the sublev el sets of the LQR ob jectiv e function f(K). Lemma 17 F or any K ∈ S K (a), the close d-lo op Lyapunov op er ators A K given by (6.7) satises ∥A − 1 K ∥ 2 =∥(A ∗ K ) − 1 ∥ 2 ≤ a/λ min (Ω) λ min (Q). Pr o of: Applying Lemma 16 with F =A− BK yields ∥A − 1 K ∥ 2 = ∥(A ∗ K ) − 1 ∥ 2 ≤ trace(X)/λ min (Ω) . Com bining this inequalit y with (D.36a) completes the pro of. □ 280 Parameter θ (a) in Theorem 4 As discussed in the pro of, o v er an y sublev el set S K (a) of the function f(K), w e require the function θ in Theorem 4 to satisfy (∥(A ∗ K ) − 1 ∥ 2 +∥(A ∗ K ) − 1 ∥ S )/λ min (X) ≤ θ (a) for all K ∈ S K (a). Clearly , Lemma 17 in conjunction with Lemma 15 can b e used to obtain ∥(A ∗ K ) − 1 ∥ 2 ≤ a/(λ min (Q)λ min Ω) and λ − 1 min (X) ≤ a/ν, where ν is giv en b y (6.10d). The existence of θ (a), follo ws from the fact that there is a scalar M(n) > 0 suc h that ∥A∥ S ≤ M∥A∥ 2 for all linear op erators A: S n →S n . 281 Appendix E Supporting proofs for Chapter 7 E.1 Proof of Proposition 1 Since G has a p ositiv e inner pro duct with the gradien t of the function f(K), w e can use the descen t lemma [158, Eq. (9.17)] to sho w that K + :=K− αG satises f(K + ) − f(K) ≤ (L f (a)α 2 /2)∥G∥ 2 F − α ⟨∇f(K),G⟩ (E.1) for an y α for whic h the line segmen t b et w een K + andK lies inS K (a). Using the inequalities in (7.11), for an y α ∈[0,2µ 1 /(µ 2 L f (a))], w e ha v e (L f (a)α 2 /2)∥G∥ 2 F − α ⟨∇f(K),G⟩ ≤ (α (L f (a)µ 2 α − 2µ 1 )/2)∥∇f(K)∥ 2 F ≤ 0 (E.2) and the righ t-hand side of inequalit y (E.1) is nonp ositiv e for α ∈ [0,2µ 1 /(µ 2 L f (a))]. Th us, w e can use the con tin uit y of the function f(K) along with inequalities (E.1) and (E.2) to conclude that K + ∈S K (a) for all α ∈[0,2µ 1 /(µ 2 L f (a))], and f(K + )− f(K) ≤ (α (L f (a)µ 2 α − 2µ 1 )/2)∥∇f(K)∥ 2 F . Com bining this inequalit y with the PL condition, it follo ws that f(K + )− f(K) ≤ − (µ 1 α/ 2)∥∇f(K)∥ 2 F ≤ − µ f (a)µ 1 α (f(K)− f(K ⋆ )) for all α ∈[0,µ 1 /(µ 2 L f (a))]. Subtracting f(K ⋆ ) and rearranging terms complete the pro of. E.2 Proof of Proposition 2 W e rst presen t t w o tec hnical lemmas. Lemma 1 L et the matric es F , X ≻ 0, and Ω ≻ 0 satisfy FXF T − X + Ω = 0 . Then, we have ∥F t ∥ 2 2 ≤ cρ t for al l t∈N, wher e c := ∥X∥ 2 /λ min (X), ρ := 1− λ min (Ω) /∥X∥ 2 . 282 Pr o of: Using the trivial inequalities Ω ⪰ λ min (Ω) I and X ⪯∥ X∥ 2 I , w e can write FXF T = X− Ω ⪯ ρX where ρ := 1− λ min (Ω) /∥X∥ 2 . This matrix inequalit y implies that V(x) := x T Xx is a Ly apuno v function for x t+1 = F T x t b ecause V(x k+1 ) ≤ ρV (x k ). Th us, for an y initial condition x 0 , w e ha v e V(x t ) ≤ ρ t V(x 0 ). Noting that x t = (F T ) t x 0 , w e let x 0 b e the normalized left singular v ector asso ciated with the maxim um singular v alue of F t to obtain ∥F t ∥ 2 2 = ∥x t ∥ 2 ≤ V(x t ) λ min (X) ≤ ρ t V(x 0 ) λ min (X) whic h along with V(x 0 )≤∥ X∥ 2 complete the pro of. □ Lemma 2 establishes an exp onen tially deca ying upp er b ound on the dierence b et w een f ζ (K) = f ζ, ∞ (K) and f ζ,τ (K) o v er an y sublev el set S K (a) of the LQR ob jectiv e function f(K). Lemma 2 F or any K∈S K (a) and ζ ∈R n , |f ζ (K)− f ζ,τ (K)| ≤ ∥ ζ ∥ 2 κ 1 (a)(1− κ 2 (a)) τ wher e κ 1 and κ 2 <1 ar e p ositive r ational functions that dep end on the pr oblem data. Pr o of: Since x t = (A− BK) t ζ is the solution to (7.1a) with u = − Kx and the initial condition x 0 =ζ , it is easy to v erify that f ζ,τ (K)= Q+K T RK,X ζ,τ (K) , where X ζ,τ (K) := τ X t=0 (A− BK) t ζζ T ((A− BK) T ) t . Using the triangle inequalit y , w e ha v e ∥X ζ (K)− X ζ,τ (K)∥ F ≤ ∥ ζ ∥ 2 ∞ X t=τ ∥(A− BK) t ∥ 2 2 (E.3) where X ζ (K)=X ζ, ∞ (K) is giv en b y (7.3). The Ly apuno v equation in (7.5) allo ws us to use Lemma 1 with F :=A− BK , X :=X(K) to upp er b ound ∥(A− BK) t ∥ 2 , λ min (X)∥(A− BK) t ∥ 2 2 ≤ ∥ X∥ 2 (1− λ min (Ω) /∥X∥ 2 ) t . Summing this inequalit y from t=τ on w ard in conjunction with (E.3) yield ∥X ζ (K)− X ζ,τ (K) ∥ F ≤ ∥ ζ ∥ 2 κ ′ 1 (1− κ ′ 2 ) τ (E.4) 283 where κ ′ 1 :=∥X(K)∥ 2 2 /(λ min (Ω) λ min (X(K))) and κ ′ 2 :=λ min (Ω) /∥X(K)∥ 2 . F urthermore, |f ζ (K)− f ζ,τ (K)| = trace (Q+K T RK)(X ζ − X ζ,τ ) ≤ (∥Q∥ F +∥R∥ 2 ∥K∥ 2 F )∥X ζ − X ζ,τ ∥ F ≤ ∥ ζ ∥ 2 (∥Q∥ F +∥R∥ 2 ∥K∥ 2 F )κ ′ 1 (1− κ ′ 2 ) τ where w e use the Cauc h y-Sc h w artz and triangle inequalities for the rst inequalit y and (E.4) for the second inequalit y . Com bining this result with trivial upp er b ounds on the norms ∥K∥ F ,∥X(K)∥ 2 , andX(X)⪰ Ω ≻ 0 completes the pro of. See [14, Lemma 23] for deriv ation of these b ounds. □ W e are no w ready to pro v e the rst inequalit y in Prop osition 2. Since K ∈S K (a) and r ≤ r(a), Lemma 1 implies that K± rU i ∈S K (2a). Th us, f ζ i(K± rU i ) is w ell dened for i=1,...,N , and e ∇f(K)− ∇f(K) = 1 2rN × X i f ζ i(K +rU i )− f ζ i ,τ (K +rU i ) U i − f ζ i(K− rU i )− f ζ i ,τ (K− rU i ) U i . F urthermore, since K± rU i ∈S K (2a), w e can use triangle inequalit y and apply Lemma 2, 2N times, to b ound eac h term individually and obtain ∥ e ∇f(K) − ∇f(K)∥ F ≤ ( √ mn/r)max i ∥ζ i ∥ 2 κ 1 (2a)(1 − κ 2 (2a)) τ where w e used ∥U i ∥ F = √ mn. This completes the pro of of the rst inequalit y . The pro of of the second inequalit y follo ws similar argumen ts as in [14, Prop opsition 5], whic h exploits the third deriv ativ es of the functions f ζ . 284
Abstract (if available)
Abstract
First-order optimization algorithms are increasingly used for data-driven control and many learning applications that often involve uncertain and noisy environments. In this thesis, we employ control-theoretic tools to study the stochastic performance of these algorithms in solving general (strongly) convex and some nonconvex optimization problems that arise in reinforcement learning and control theory. We first study momentum-based accelerated optimization algorithms subject to additive white noise and identify fundamental tradeoffs between noise amplification and convergence rate. We introduce a novel geometric characterization of conditions for linear convergence that clarifies the relation between the noise amplification and convergence rate as well as their dependence on the condition number and the algorithmic parameters. This geometric insight leads to simple alternative proofs of standard convergence results and allows us to establish analytical lower bounds on the product between the settling time and noise amplification that scale quadratically with the condition number. We also analyze the transient responses of popular accelerated algorithms that arise due to the presence of non-normal models. We next focus on model-free reinforcement learning which attempts to find an optimal control action for an unknown dynamical system by directly searching over the parameter space of controllers. We take a step towards demystifying the performance and efficiency of such methods by focusing on the standard linear quadratic regulator with unknown parameters. For this problem, we establish linear convergence of gradient descent and also provide theoretical bounds on the convergence rate and sample complexity of the random search method with two-point gradient estimates.
Linked assets
University of Southern California Dissertations and Theses
Conceptually similar
PDF
Performance trade-offs of accelerated first-order optimization algorithms
PDF
Provable reinforcement learning for constrained and multi-agent control systems
PDF
New Lagrangian methods for constrained convex programs and their applications
PDF
Landscape analysis and algorithms for large scale non-convex optimization
PDF
Differentially private and fair optimization for machine learning: tight error bounds and efficient algorithms
PDF
Difference-of-convex learning: optimization with non-convex sparsity functions
PDF
Algorithms and landscape analysis for generative and adversarial learning
PDF
Scalable optimization for trustworthy AI: robust and fair machine learning
PDF
Theoretical foundations for dealing with data scarcity and distributed computing in modern machine learning
PDF
Novel multi-stage and CTLS-based model updating methods and real-time neural network-based semiactive model predictive control algorithms
PDF
Matrix factorization for noise-robust representation of speech data
PDF
Improving decision-making in search algorithms for combinatorial optimization with machine learning
PDF
Topics in algorithms for new classes of non-cooperative games
PDF
Data-driven optimization for indoor localization
PDF
Online learning algorithms for network optimization with unknown variables
PDF
Reinforcement learning with generative model for non-parametric MDPs
PDF
Optimization methods and algorithms for constrained magnetic resonance imaging
PDF
Analysis, design, and optimization of large-scale networks of dynamical systems
PDF
Mixed-integer nonlinear programming with binary variables
PDF
Noise aware methods for robust speech processing applications
Asset Metadata
Creator
Mohammadi, Hesameddin
(author)
Core Title
Robustness of gradient methods for data-driven decision making
School
Viterbi School of Engineering
Degree
Doctor of Philosophy
Degree Program
Electrical Engineering
Degree Conferral Date
2022-12
Publication Date
12/12/2022
Defense Date
10/28/2022
Publisher
University of Southern California
(original),
University of Southern California. Libraries
(digital)
Tag
convergence analysis,convergence rate,convex optimization,data-driven control,first-order algorithms,gradient descent,gradient-flow dynamics,heavy-ball method,integral quadratic constraints,linear–quadratic regulator,model-free control,Nesterov’s accelerated algorithm,noise amplification,non-asymptotic behavior,non-convex optimization non-normal matrices,OAI-PMH Harvest,performance tradeoffs,settling time,transient growth
Format
theses
(aat)
Language
English
Contributor
Electronically uploaded by the author
(provenance)
Advisor
Jovanovic, Mihailo (
committee chair
), Razaviyayn, Meisam (
committee member
), Soltanolkotabi, Mahdi (
committee member
)
Creator Email
hee.mohammadi@gmail.com,hesamedm@usc.edu
Permanent Link (DOI)
https://doi.org/10.25549/usctheses-oUC112619731
Unique identifier
UC112619731
Identifier
etd-MohammadiH-11354.pdf (filename)
Legacy Identifier
etd-MohammadiH-11354
Document Type
Dissertation
Format
theses (aat)
Rights
Mohammadi, Hesameddin
Internet Media Type
application/pdf
Type
texts
Source
20221213-usctheses-batch-995
(batch),
University of Southern California
(contributing entity),
University of Southern California Dissertations and Theses
(collection)
Access Conditions
The author retains rights to his/her dissertation, thesis or other graduate work according to U.S. copyright law. Electronic access is being provided by the USC Libraries in agreement with the author, as the original true and official version of the work, but does not grant the reader permission to use the work if the desired use is covered by copyright. It is the author, as rights holder, who must provide use permission if such use is covered by copyright. The original signature page accompanying the original submission of the work to the USC Libraries is retained by the USC Libraries and a copy of it may be obtained by authorized requesters contacting the repository e-mail address given.
Repository Name
University of Southern California Digital Library
Repository Location
USC Digital Library, University of Southern California, University Park Campus MC 2810, 3434 South Grand Avenue, 2nd Floor, Los Angeles, California 90089-2810, USA
Repository Email
cisadmin@lib.usc.edu
Tags
convergence analysis
convergence rate
convex optimization
data-driven control
first-order algorithms
gradient descent
gradient-flow dynamics
heavy-ball method
integral quadratic constraints
linear–quadratic regulator
model-free control
Nesterov’s accelerated algorithm
noise amplification
non-asymptotic behavior
non-convex optimization non-normal matrices
performance tradeoffs
settling time
transient growth