Close
Home
Collections
Login
USC Login
Register
0
Selected
Invert selection
Deselect all
Deselect all
Click here to refresh results
Click here to refresh results
USC
/
Digital Library
/
University of Southern California Dissertations and Theses
/
Analysis, design, and optimization of large-scale networks of dynamical systems
(USC Thesis Other)
Analysis, design, and optimization of large-scale networks of dynamical systems
PDF
Download
Share
Open document
Flip pages
Contact Us
Contact Us
Copy asset link
Request this asset
Transcript (if available)
Content
Analysis, design, and optimization of large-scale networks of dynamical systems A DISSERTATION SUBMITTED TO THE FACULTY OF THE GRADUATE SCHOOL OF THE UNIVERSITY OF SOUTHERN CALIFORNIA BY Sepideh Hassan-Moghaddam IN PARTIAL FULFILLMENT OF THE REQUIREMENTS FOR THE DEGREE OF Doctor of Philosophy Ming Hsieh Department of Electrical and Computer Engineering December 2019 Analysis, design, and optimization of large-scale networks of dynamical systems Copyright ' 2019 by Sepideh Hassan-Moghaddam ALL RIGHTS RESERVED To Karen, Susan, Hojjat, and Neda i Acknowledgements I would like to express my sincere gratitude to my Ph.D. advisor, Professor Mihailo Jovanovi c, for his continuous support and guidance through these years. He was always ready to help and never gave up on making me a better researcher. He was the perfect role model in research with his great vision, immense knowledge, intelligence, and pa- tience. I am extremely fortunate to have the opportunity to work with him and I will always be grateful for all his advices and support. His kindness and friendship have made the hard days of the graduate school enjoyable for me. I would like to thank Professors Antonio Ortega, Ketan Savla, Urbashi Mitra, and Meisam Razaviyayn for serving on my doctoral committee. I have received invaluable comments and feedback from all of them. Also, I would like to thank Professors Tryphon Georgiou, Andrew Lamperski, Yousef Saad, and Tom Luo for their extremely useful feedback on my research. Also, I am extremely lucky to have the opportunity to work with Professor Meyn. His insightful comments and suggestions have made our joint work successful. I am very grateful to have my past and current lab mates, Xiaofan Wu, Neil Dhingra, Armin Zare, Yongxin Chen, Wei Ran, Hesam Mohammadi, and Dongsheng Ding for making these graduate school years more interesting. I would like to also thank the senior members Rashad Moarref, Fu Lin, and Binh Lieu who built the foundation of my research by their hard work. I am incredibly fortunate for having my PhD program in Minneapolis and Los An- geles where I met my friends for life. I would like to thank Fatemeh Sheikholeslami, Pardees Azodanloo, Maziar Sanjabi, Roushanak Navvab, Farnaz Forootaninia, Pooyan Behnam, Mehdi Lamee, Behnaz Forootaninia, Nazila Haratipour, Mehrdad Hairani, ii Maral Mousavi, Meisam Razaviyayn, Saghar Sadeghpour, Pooya Rezaei, Samira Tabar, Farnoosh Zough, and Rashad Moarref. Thank you for constantly providing love and support. I would like to sincerely thank my family. My parents have always been teaching me to believe in myself. They have scaried a lot for me to be here and I am forever indebted to them. I am grateful to my sister Neda for her constant support and innite love. Finally and most importantly, I would like to thank my wonderful husband, Karen, for being on my side, loving me, and supporting me through all these years. iii Abstract Large-scale networks of dynamical systems are becoming increasingly important in science and engineering with applications ranging from economics, social networks, power systems, and robotics. As a consequence, network science has emerged as an in- terdisciplinary eld that draws from diverse disciplines including graph theory, matrix theory, dynamical systems, optimization, and statistical mechanics. One of the major challenges is the development of fast and scalable methods for analysis and design of these systems. Such systems are usually large-scale, they involve large-scale interconnec- tions of components but have limitations on communication power, and their dynamics may be unknown. In large-scale dynamical systems, centralized implementation of con- trol and estimation policies is computationally expensive or even infeasible. Moreover, it is often the case that the underlying dynamics are unknown and only limited input- output measurements are available from the system. To overcome these challenges, we combine tools and ideas from control theory, distributed optimization, reinforcement learning, and compressive sensing to develop distributed estimation and control strate- gies that utilize limited information exchange between the individual subsystems and do not require knowledge of the underlying dynamics. In Part I of this dissertation, we study the problem of optimal topology design for stochastically forced undirected consensus networks in order to optimally enhance closed-loop performance. The performance is given by the steady-state variance am- plication of the network with additive stochastic disturbances. We use a sparsity promoting optimal control framework to avoid computational complexity. In particu- lar, we introduce ` 1 -regularization into the optimalH 2 formulation and cast the design problem as a nonsmooth composite optimization problem. By exploiting the structure of the problem, we develop customized proximal gradient and Newton algorithms that are well-suited for large-scale problems. We illustrate that our algorithms can solve the problems with more than million edges in the controller graph in a few minutes, on a PC. We also exploit structure of connected resistive networks to demonstrate how additional edges can be systematically added in order to minimize theH 2 norm of the iv closed-loop system. Moreover, we study the problem of performance enhancement in stochastically-forced directed consensus networks by adding edges to an existing topol- ogy. We formulate the problem as a feedback control design, and represent the links as the elements of the controller graph Laplacian matrix. To deal with the structural constraints that arise from the absence of absolute measurements, we introduce a coor- dinate transformation to eliminate the average mode and assure convergence of all states to the average of the initial node values. By exploiting structure of the optimization problem, we develop a customized algorithm based on the alternating direction method of multipliers to design a sparse controller network that improves the performance of the closed-loop system. Part II studies the problem of identifying sparse interaction topology using sam- ple covariance matrix of the states of the network. We assume that the statistics are generated by a stochastically-forced undirected consensus network with unknown topology. We propose a method for identifying the topology using a regularized Gaus- sian maximum likelihood framework where the ` 1 regularizer is introduced as a means for inducing sparse network topology. The proposed algorithm employs a sequential quadratic approximation in which the Newton's direction is obtained using coordinate descent method. We also develop a method based on growing a Chow-Liu tree that is well-suited for identifying the underlying structure of large-scale systems in which some of the nodes may have access to their own states. In Part III, we study the algorithms that can be used to solve nonsmooth composite optimization problems from control-theoretic point of view. We view proximal algo- rithms as dynamical systems and leverage techniques from control theory to study their global convergence properties. In particular, for problems with strongly convex objec- tive functions, we utilize the theory of integral quadratic constraints to prove global exponential stability of the dierential equations that govern the evolution of proximal gradient and Douglas-Rachford splitting ows. Moreover, we establish conditions for global exponential convergence even in the absence of strong convexity. We also study a class of nonsmooth composite optimization problems in which the convex objective function is given by a sum of dierentiable and nondierentiable terms. We propose a primal-descent dual-ascent gradient ow method that exploits separability of the objec- tive function and is well-suited for in-network optimization. We prove global asymptotic v stability of the proposed algorithm and solve the problem of growing undirected con- sensus networks in a distributed manner to demonstrate its eectiveness. Part IV focuses on the distributed design of structured feedback gains for large-scale systems. This is a challenging problem even in the absence of additional structural con- straints on the feedback controller. We approach the structured optimal control problem via a data-driven framework that does not require knowledge of the system parameters and avoids the need to solve large-scale matrical equations. For the structured opti- mal H 2 state-feedback problem, we show that the objective function and its gradient can be computed from data and develop customized proximal algorithms based on gra- dient descent and incremental gradient method. Moreover, we exploit separability of the objective function and utilize an ADMM-based consensus algorithm to solve the regularized optimal control problem in a distributed manner over multiple processors. vi Table of Contents Acknowledgements ii Abstract iv List of Tables xii List of Figures xiii 1 Introduction 1 1.1 Main topics of the dissertation . . . . . . . . . . . . . . . . . . . . . . . 3 1.1.1 Distributed design and control of consensus networks . . . . . . . 3 1.1.2 Network estimation and inference . . . . . . . . . . . . . . . . . . 5 1.1.3 Control theoretic approach for analysis of optimization algorithms 7 1.1.4 Distributed in-network optimization . . . . . . . . . . . . . . . . 8 1.1.5 Data-driven and model-free distributed control . . . . . . . . . . 9 1.2 Dissertation structure . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10 1.3 Contributions of the dissertation . . . . . . . . . . . . . . . . . . . . . . 12 I Distributed design and control of networks 16 2 Topology design for stochastically forced undirected consensus net- works 17 2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17 2.2 Problem formulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19 2.2.1 Design of optimal sparse topology . . . . . . . . . . . . . . . . . 22 vii 2.2.2 Structured optimal control problem: debiasing step . . . . . . . . 24 2.2.3 Gradient and Hessian of J(x) . . . . . . . . . . . . . . . . . . . . 25 2.3 Dual problem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25 2.4 Customized algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29 2.4.1 Proximal gradient method . . . . . . . . . . . . . . . . . . . . . . 29 2.4.2 Proximal Newton method . . . . . . . . . . . . . . . . . . . . . . 30 2.5 Growing connected resistive networks . . . . . . . . . . . . . . . . . . . . 33 2.5.1 Proximal gradient method . . . . . . . . . . . . . . . . . . . . . . 35 2.5.2 Proximal Newton method . . . . . . . . . . . . . . . . . . . . . . 36 2.6 Computational experiments . . . . . . . . . . . . . . . . . . . . . . . . . 37 2.6.1 Performance comparison . . . . . . . . . . . . . . . . . . . . . . . 38 2.6.2 Large-scale Facebook network . . . . . . . . . . . . . . . . . . . . 39 2.6.3 Random disconnected network . . . . . . . . . . . . . . . . . . . 41 2.6.4 Path and ring networks . . . . . . . . . . . . . . . . . . . . . . . 43 2.7 Concluding remarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43 3 Edge addition in directed consensus networks 46 3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46 3.2 Motivation and background . . . . . . . . . . . . . . . . . . . . . . . . . 47 3.2.1 Directed graphs . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48 3.2.2 Feedback design in consensus networks . . . . . . . . . . . . . . . 48 3.2.3 Elimination of the average mode . . . . . . . . . . . . . . . . . . 50 3.3 Topology design for directed networks . . . . . . . . . . . . . . . . . . . 52 3.3.1 Sparsity-promoting optimal control problem . . . . . . . . . . . . 52 3.3.2 Structured optimal control problem . . . . . . . . . . . . . . . . 53 3.4 An ADMM-based algorithm . . . . . . . . . . . . . . . . . . . . . . . . . 53 3.4.1 F -minimization step . . . . . . . . . . . . . . . . . . . . . . . . . 54 3.4.2 L x -minimization step . . . . . . . . . . . . . . . . . . . . . . . . 56 3.5 Computational experiments . . . . . . . . . . . . . . . . . . . . . . . . . 56 3.5.1 Synthetic example . . . . . . . . . . . . . . . . . . . . . . . . . . 56 3.5.2 Performance improvement . . . . . . . . . . . . . . . . . . . . . . 57 3.6 Concluding remarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59 viii II Network estimation and inference 60 4 Topology identication of undirected consensus networks via sparse inverse covariance estimation 61 4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61 4.2 Problem formulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62 4.2.1 Undirected consensus networks . . . . . . . . . . . . . . . . . . . 62 4.2.2 Topology identication . . . . . . . . . . . . . . . . . . . . . . . . 64 4.3 Customized algorithm based on sequential quadratic approximation . . 65 4.3.1 Structured problem: debiasing step . . . . . . . . . . . . . . . . . 66 4.3.2 Gradient and Hessian of J(x) . . . . . . . . . . . . . . . . . . . . 66 4.3.3 Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67 4.4 Computational experiments . . . . . . . . . . . . . . . . . . . . . . . . . 70 4.4.1 Network identication . . . . . . . . . . . . . . . . . . . . . . . . 70 4.4.2 Network sparsication . . . . . . . . . . . . . . . . . . . . . . . . 72 4.5 Concluding Remarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75 5 Topology identication via growing a Chow-Liu tree network 76 5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76 5.2 Topology identication via structured graphical-LASSO . . . . . . . . . 77 5.3 Growing a Chow-Liu tree network . . . . . . . . . . . . . . . . . . . . . 80 5.4 Computational experiments . . . . . . . . . . . . . . . . . . . . . . . . . 83 5.4.1 Structured graphical-LASSO algorithm . . . . . . . . . . . . . . 83 5.4.2 Topology identication via growing a Chow-Liu tree . . . . . . . 84 5.4.3 FMRI dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85 5.5 Concluding remarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86 III Control theoretic approach for analysis of optimization algo- rithms 88 6 Proximal gradient ow and Douglas-Rachford splitting dynamics: global exponential stability via integral quadratic constraints 89 6.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90 ix 6.2 Problem formulation and background . . . . . . . . . . . . . . . . . . . . 91 6.2.1 Proximal operators and the associated envelopes . . . . . . . . . 91 6.2.2 Proximal augmented Lagrangian . . . . . . . . . . . . . . . . . . 94 6.2.3 Strong convexity and Lipschitz continuity . . . . . . . . . . . . . 95 6.2.4 Proximal Polyak-Lojasiewicz inequality . . . . . . . . . . . . . . 95 6.3 Exponential stability of proximal algorithms . . . . . . . . . . . . . . . . 96 6.3.1 Strongly convex problems . . . . . . . . . . . . . . . . . . . . . . 98 6.3.2 Proximal Polyak-Lojasiewicz condition . . . . . . . . . . . . . . . 100 6.4 Global exponential stability of Douglas-Rachford splitting dynamics . . 102 6.4.1 Non-smooth composite optimization problem . . . . . . . . . . . 102 6.4.2 Douglas-Rachford splitting on the dual problem . . . . . . . . . . 107 6.5 Concluding remarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108 7 Distributed proximal augmented Lagrangian method for nonsmooth composite optimization 109 7.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109 7.2 Problem formulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111 7.2.1 Motivating application . . . . . . . . . . . . . . . . . . . . . . . . 112 7.2.2 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 114 7.3 Proximal augmented Lagrangian . . . . . . . . . . . . . . . . . . . . . . 116 7.4 Arrow-Hurwicz-Uzawa gradient ow . . . . . . . . . . . . . . . . . . . . 118 7.5 Distributed implementation . . . . . . . . . . . . . . . . . . . . . . . . . 122 7.5.1 Primal update . . . . . . . . . . . . . . . . . . . . . . . . . . . . 122 7.5.2 Dual updates . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 123 7.6 Computational experiments . . . . . . . . . . . . . . . . . . . . . . . . . 123 7.7 Concluding remarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 123 IV Data-driven and model-free distributed control 125 8 Data-driven proximal algorithms for the design of structured optimal feedback gains 126 8.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 126 x 8.2 Problem formulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 128 8.3 Computation from data . . . . . . . . . . . . . . . . . . . . . . . . . . . 130 8.3.1 Computation of P and L from data . . . . . . . . . . . . . . . . 130 8.3.2 Computation of PL from data . . . . . . . . . . . . . . . . . . . 131 8.4 Systems with special structure . . . . . . . . . . . . . . . . . . . . . . . 134 8.5 Proximal algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 138 8.5.1 Proximal gradient descent . . . . . . . . . . . . . . . . . . . . . . 138 8.5.2 Incremental proximal gradient . . . . . . . . . . . . . . . . . . . 139 8.6 An ADMM-based consensus algorithm . . . . . . . . . . . . . . . . . . . 140 8.6.1 Computational experiments . . . . . . . . . . . . . . . . . . . . . 143 8.7 Concluding remarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 145 9 Conclusions 148 References 150 Appendix A. Proximal PL condition 169 A.1 Pointwise quadratic inequality . . . . . . . . . . . . . . . . . . . . . . . . 171 Appendix B. On the exponential convergence rate of proximal gradient ow algorithms 172 B.0.1 Distributed optimization . . . . . . . . . . . . . . . . . . . . . . . 172 xi List of Tables 2.1 Comparison of our customized proximal algorithms . . . . . . . . . . . . 33 2.2 Algorithmic properties of the primal dual IP method. . . . . . . . . . . 33 2.3 Comparison of algorithms (solve times in seconds/number of iterations) for the problem of growing connected resistive Erd os-R enyi networks with dierent number of nodes n, edge probability 1:05 log(n)=n, and = 0:8 max . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38 2.4 Table 2.3 continued . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38 xii List of Figures 1.1 (a) Social network. (b) Computer network. (c) Vehicular formation. (d) Power network. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 2.1 (a) Solve times (in seconds); and (b) performance degradation (in per- cents) of proximal gradient and greedy algorithms relative to the optimal centralized controller. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39 2.2 (a) Sparsity level; and (b) optimal tradeo curves resulting from the application of proximal gradient algorithm and a heuristic strategy for the Facebook network. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40 2.3 Topologies of the plant (blue lines) and controller graphs (red lines) for an unweighted random network with three disconnected subgraphs. . . . 42 2.4 (a) Sparsity level; (b) performance degradation; and (c) the optimal tradeo curve between the performance degradation and the sparsity level of optimal sparse x compared to the optimal centralized vector of the edge weights x c . The results are obtained for unweighted random disconnected plant network with topology shown in Fig. 2.3. . . . . . . . 43 2.5 The problems of growing unweighted path (top row) and ring (bottom row) networks. Blue lines identify edges in the plant graph, and red lines identify edges in the controller graph. . . . . . . . . . . . . . . . . . . . 44 3.1 Topologies of the plant (blue lines) and controller graphs (red lines) for a randomly generated weighted network. . . . . . . . . . . . . . . . . . . 58 xiii 3.2 (a) Sparsity level; (b) performance degradation; and (c) the optimal tradeo between the performance degradation and the sparsity level of the optimal sparseL x compared to the optimal centralized controllerL c . The results are obtained for the weighted random plant network with topology shown in Fig. 3.1a. . . . . . . . . . . . . . . . . . . . . . . . . . 59 4.1 The problem of identication of sparse networks using sample covariance matrix for a network with n = 12 nodes. . . . . . . . . . . . . . . . . . . 71 4.2 The problem of identication of sparse networks using sample covariance matrix for a network with n = 12 nodes. . . . . . . . . . . . . . . . . . . 72 4.3 The problem of sparsication of a network with n = 50 nodes using sample covariance matrix. . . . . . . . . . . . . . . . . . . . . . . . . . . 73 4.4 Relative error in log-scale for the sparsication problem of a network with n = 50 nodes using sample covariance matrix. . . . . . . . . . . . . . . . 74 5.1 The RC tree network with n = 10 nodes. . . . . . . . . . . . . . . . . . . 83 5.2 (a) Plant network; (b) constructed Chow-Liu tree network; and (c) the identied network from growing the Chow-Liu tree. . . . . . . . . . . . . 84 5.3 The identied networks for dierent values of . . . . . . . . . . . . . . . 86 7.1 Speedup ratio versus the number of the cores used for growing connected resistive networks with n nodes. . . . . . . . . . . . . . . . . . . . . . . . 124 8.1 A mass-spring-damper system on a line. . . . . . . . . . . . . . . . . . . 143 8.2 Structure of feedback gains resulting from the solution to (8.6) for dif- ferent values of the regularization parameter for an MSD system with N = 10 masses. The number of non-zero elements in K is denoted by n z . 144 8.3 (a) Sparsity level; (b) performance degradation; and (c) the optimal tradeo between the performance degradation and the sparsity level of the optimal sparse x compared to the optimal centralized controller x c . The results are obtained for a randomly generated Erd os-R enyi network with n = 20 nodes. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 145 8.4 Speedup ratio versus the number of the cores used for growing connected resistive networks with n nodes. . . . . . . . . . . . . . . . . . . . . . . . 146 xiv Chapter 1 Introduction Traditional control theory is concerned with implementing the control and estimation policies in a centralized fashion [1]. However, in large networks of dynamical systems, centralized processing is often computationally expensive or even practically infeasible. This motivates the development of distributed control strategies that require limited information exchange between the individual subsystems. Analysis and design of dis- tributed averaging protocols have been an active research area recently; e.g., see [2{11]. Moreover, distributed computing over networks is of fundamental importance in network science [2]. Consensus problem has received signicant attention because it encounters in a number of applications ranging from animal group behavior [12,13], to social networks [14,15], to power systems [16{18], to distributed processing networks [19, 20], to spreading processes [21,22], and to coordination of autonomous vehicles [23{26], see Fig. 1.1 for some examples. An inherent challenge in all of these problems is that it is desired for all nodes to reach an agreement or to achieve synchronization by only exchanging relative information with limited number of nodes. The restriction on the absence of the absolute measurements imposes structural constraints for the analysis and design. Furthermore, in many of the real-world applications, the dynamics of the underly- ing system are unknown, the dynamical systems are large-scale, and only input-output measurements are available. This necessitates the development of theory and techniques that utilize distributed computing architectures to cope with large problem sizes and 1 2 (a) (b) (c) (d) Figure 1.1: (a) Social network. (b) Computer network. (c) Vehicular formation. (d) Power network. do not require the knowledge of underlying dynamics of the system. Moreover, in some applications, having access to even the input-output measurements is not feasible and only second-order data is available. Recovering network topology from partially avail- able statistical signatures is a challenging problem. Due to experimental or numerical limitations, it is often the case that only noisy partial network statistics are known. Thus, the objective is to account for all observed correlations that may be available and to develop ecient identication algorithms that are well-suited for big data prob- lems [27,28]. Moreover, analysis of optimization algorithms from the system theoretic point of 3 view has received signicant recent attention [29{31]. In these references, the optimiza- tion algorithm is interpreted as a feedback interconnection in which the states converge to the optimal solution of the optimization problem. Many modern large-scale and dis- tributed optimization problems can be cast into a form in which the objective function is a sum of a smooth term and a nonsmooth regularizer. Such problems can be solved via a proximal gradient method which generalizes standard gradient descent to a non- smooth setup. We leverage the tools from control theory to study global convergence of proximal gradient ow algorithms in continuous time. We also provide a distributed implementation of the gradient ow dynamics based on the proximal augmented La- grangian and prove global exponential stability for strongly convex problems. In this dissertation, we combine tools and ideas from control theory, distributed optimization, reinforcement learning, and compressive sensing to develop distributed estimation and control strategies that require limited information exchange between the individual subsystems. The remainder of this introductory chapter is organized as follows. In Section 1.1, we overview dierent topics of the dissertation and we pro- vide a brief dicsussion on each of them. The outline of the dissertation is provided in Section 1.2. Finally, we provide a summary of our main results and contributions in Section 1.3. 1.1 Main topics of the dissertation In this section, we discuss the main topics of the dissertation. 1.1.1 Distributed design and control of consensus networks Reaching agreement using relative information exchange in a decentralized fashion has received lots of attention. In large networks, centralized implementation of control policies im- poses heavy communication and computation burden on individual nodes. This mo- tivates the development of distributed control strategies that require limited informa- tion exchange between the nodes in order to reach consensus or achieve synchroniza- tion [5,7{9]. Optimal design of the edge weights for networks with pre-specied topology has received signicant attention. In [3], the design of the fastest averaging protocol for 4 undirected networks was cast as a semidenite program (SDP). Two customized algo- rithms, based on primal barrier interior-point (IP) and subgradient methods, were de- veloped and the advantages of optimal weight selection over commonly used heuristics were demonstrated. Similar SDP characterization, for networks with state-dependent graph Laplacians, was provided in [4]. The allocation of symmetric edge weights that minimize the mean-square deviation from average for networks with additive stochas- tic disturbances was solved in [5]. A related problem, aimed at minimizing the total eective resistance of resistive networks, was addressed in [7]. In [8], the edge Lapla- cian was used to provide graph-theoretic characterization of theH 2 andH 1 symmetric agreement protocols. Network coherence quanties the ability of distributed estimation and control strate- gies to guard against exogenous disturbances [6, 10]. The coherence is determined by the sum of reciprocals of the non-zero eigenvalues of the graph Laplacian and its scaling properties cannot be predicted by algebraic connectivity of the network. In [10], per- formance limitations of spatially-localized consensus protocols on regular lattices were examined. It was shown that the fundamental limitations for large-scale networks are dictated by the network topology rather than by the optimal selection of the edge weights. Moreover, epidemic spread in networks is strongly in uenced by their topol- ogy [21,22,32]. Thus, optimal topology design represents an important challenge. It is precisely this problem, for undirected consensus networks, that we address here. More specically, we study an optimal control problem aimed at achieving a de- sired tradeo between the network performance and communication requirements in the distributed controller. Our goal is to add a certain number of edges to a given undirected network in order to optimally enhance the closed-loop performance. One of our key contributions is the formulation of topology design as an optimal control problem that admits convex characterization and is amenable to the development of ecient optimization algorithms. In our formulation, the plant network can contain disconnected components and optimal topology of the controller network is an integral part of the design. In general, this problem is NP-hard [33] and it amounts to an in- tractable combinatorial search. Several references have examined convex relaxations or greedy algorithms to design topology that optimizes algebraic connectivity [34] or 5 network coherence [9,35{37]. In this dissertation, we tap on recent developments regarding sparse representa- tions in conjunction with regularization penalties on the level of communication in a distributed controller. Similar approaches have been used for solving the problem of optimal sensors/actuators selections in large-scale dynamical systems [38{40]. This allows us to formulate convex optimization problems that exploit the underlying struc- ture and are amenable to the development of ecient optimization algorithms. To avoid combinatorial complexity, we approach optimal topology design using a sparsity- promoting optimal control framework introduced in [41, 42]. Performance is captured by theH 2 norm of the closed-loop network and ` 1 -regularization is introduced to pro- mote controller sparsity. While this problem is in general nonconvex [42], for undirected networks we show that it admits a convex characterization with a non-dierentiable ob- jective function and a positive denite constraint. This problem can be transformed into an SDP and, for small size networks, the optimal solution can be computed using standard IP method solvers, e.g., SeDuMi [43] and SDPT3 [44]. Moreover, we study the problem of performance enhancement in stochastically- forced directed consensus networks by adding edges to an existing topology. Signicant amount of research has been devoted to the study of the consensus problem in networks, where both the plant and the controller graphs are undirected. We consider the problem of adding edges to a weakly connected directed consensus network in order to improve performance. In particular, we are interested in designing sparse communication graphs that strike a balance between the variance amplication of the closed-loop system and the number of communication links. In general, this is a combinatorial search problem and is non-convex. 1.1.2 Network estimation and inference Identifying network topology and learning graphical models from partially available statistical signatures are topics of im- mense interest in areas ranging from machine learning to statistics to neuroscience [45{ 52]. Moreover, structured covariance completion problem has received signicant atten- tion in the recent years [53{55]. In these works, the unknown elements of a partially known covariance matrix are estimated such that they are consistent with the dynamics of the underlying system. Studying the human brain as a complex network has received 6 signicant attention recently [56{58]. The brain functional connectivity can be mea- sured by computing the correlation between time-series functional magnetic resonance imaging (FMRI) data. The functional connectivity structure between dierent regions can be revealed by utilizing dierent thresholding techniques [59, 60]. In general, this is a challenging problem because it is often the case that only noisy partial network statistics are known. The goal is to develop an ecient algorithm for recovering the underlying topology of a network utilizing the limited sample data. Recovering the underlying network topology using sample covariance matrix of the node values under structural constraints has been studied in [27, 61, 62]. Moreover, a rich body of literature has been devoted to the problems of designing network topol- ogy to improve performance [33, 34, 36, 37, 63]. Several algorithms can be employed to identify the underlying network structure from limited statistical data. In [64], the au- thors show inability of standard graphical-LASSO to identify network topology. It was demonstrated that this popular algorithm fails to recover the underlying topology even when the abundance of data is available. Over the last decade, a rich body of literature has been devoted to the problems of designing network topology to improve performance [9,34{37,63,65,66] and identifying an unknown network topology from available data [45{51, 67]. Moreover, the problem of sparsifying a network with dense interconnection structure in order to optimize a specic performance measure has been studied in [68,69]. In [70], it was demonstrated that individual starlings within a large ock interact only with a limited number of immediate neighbors. The authors have shown that the networks with six or seven nearest neighbors provide an optimal trade-o between ock cohesion and eort of individual birds. These observations were made by examining how robustness of the group depends on the network structure. The proposed measure of robustness captures the network coherence [10] and it is well-suited for developing tractable optimization framework. We develop a convex optimization framework for identifying sparse interaction topol- ogy using sample covariance matrix of the state of the network. Our framework utilizes an ` 1 -regularized Gaussian maximum likelihood estimator. We also develop a method based on growing a Chow-Liu tree that is well-suited for identifying the underlying 7 structure of large-scale systems. We apply this technique to resting-state functional MRI (FMRI) data as well as synthetic datasets to illustrate the eectiveness of the proposed approach. 1.1.3 Control theoretic approach for analysis of optimization algo- rithms Structured optimal control and inverse problems, that arise when trying to identify and control dynamical representations of systems evolving in real-time, typ- ically lead to optimization of objective functions that consist of a sum of a smooth term and a nonsmooth regularizer. Such problems are of increasing importance in con- trol [41, 42, 71] and it is thus necessary to develop ecient algorithms for distributed and embedded nonsmooth composite optimization [72{75]. The lack of dierentiability in the objective function precludes the use of standard descent methods from smooth optimization. Proximal gradient method [76] generalizes gradient descent to nonsmooth context and provides a powerful tool for solving problems in which the nonsmooth term is separable over the optimization variable. Examining optimization algorithms as continuous-time dynamical systems has been an active topic since the seminal work of Arrow, Hurwicz, and Uzawa [77]. This view- point can provide important insight into performance of optimization algorithms and, during the last decade, it has been advanced and extended to a broad class of prob- lems including convergence analysis of primal-dual [73, 78{82] and accelerated [83{88] rst-order methods. This should be compared and contrasted to existing techniques which employ subgradient methods and discontinuous projected dynamics [89{91]. Fur- thermore, establishing the connection between theory of ordinary dierential equations (ODEs) and numerical optimization algorithms has been a topic of many studies, in- cluding [92,93]; for recent eorts, see [84,94]. Most algorithms can be viewed as a feedback interconnection of linear dynamical sys- tems with nonlinearities that posses certain structural properties. This system-theoretic interpretation was exploited in [31] and further advanced in a number or recent pa- pers [81, 95{101]. The key idea is to exploit structural features of linear and nonlinear terms and utilize theory and techniques from stability analysis of nonlinear dynamical systems to study properties of optimization algorithms. This approach provides new 8 methods for studying not only convergence rate but also robustness of optimization rou- tines [102{105] and can lead to new classes of algorithms that strike a desired tradeo between the speed and robustness. We study a class of nonsmooth composite convex optimization problems in which the objective is a sum of a dierentiable function and a possibly nondierentiable regular- izer. These problems emerge in compressive sensing, machine learning, and control. For example, structured feedback design can be cast as a nonsmooth composite optimiza- tion problem [42,71,106]. Standard descent methods cannot be used in the presence of nondierentiable component. Proximal gradient algorithms [76,107] oer viable alterna- tives to the generic descent methods for solving nonsmooth problems. Another eective strategy is to transform the associated augmented Lagrangian into the continuously dierentiable proximal augmented Lagrangain [108] in which the former is restricted to the manifold that corresponds to the explicit minimization over the variable in the nonsmooth term. 1.1.4 Distributed in-network optimization Distributed algorithms are crit- ically important for solving large-scale optimization problems. The decentralized con- sensus problem in multi-agent networks [72, 73, 109, 110] arises in many applications. Moreover, distributed control techniques are critically important in the design of large- scale systems. In these systems, conventional control strategies that rely on centralized computation and implementation are often prohibitively expensive. For example, nd- ing the optimal controller requires computation of the solution to the algebraic Riccati equations which is often infeasible because of high computational requirements. This ne- cessitates the development of theory and techniques that utilize distributed computing architectures to cope with large problem sizes. We study a class of nonsmooth composite optimization problems in which the con- vex objective function is a sum of dierentiable and nondierentiable functions. Among other applications, these problems emerge in machine learning, compressive sensing, and control. Recently, regularization has been used as a promising tool for enhanc- ing utility of standard optimal control techniques. Generic descent methods cannot be used in the nonsmooth composite optimization problems due to the presence of a 9 nondierentiable component in the objective function. Moreover, these standard meth- ods are not well-suited for distributed implementation. An alternative approach is to separate the smooth and nonsmooth parts of the objective function and use the alter- nating direction method of multipliers (ADMM). In [111], we exploit separability of the objective function and utilize an ADMM-based consensus algorithm to solve the regularized optimal control problem in a distributed manner over multiple processors. Even though the optimal control problem is in general non-convex, recent results can be utilized to show convergence to a local minimum [112]. However, in an update step of the ADMM algorithm, all the processors halt to compute the weighted average (the gathering step) [113]. By introducing auxiliary variables in nondierentiable terms, we provide an equiva- lent consensus-based characterization that is convenient for distributed implementation. The Moreau envelope associated with the nonsmooth part of the objective function is used to bring the optimization problem into a continuously dierentiable form that serves as a basis for the development of a primal-descent dual-ascent gradient ow method. This algorithm exploits separability of the objective function and is well-suited for in-network optimization. We prove global asymptotic stability of the proposed al- gorithm. Moreover, we study primal-descent dual-ascent gradient ow dynamics based on the proximal augmented Lagrangian [108] that can be used as an eective algorithm to solve smooth optimization problems with a separable objective function. We provide a distributed implementation and prove global exponential stability in the presence of strong convexity. 1.1.5 Data-driven and model-free distributed control One of the major challenges in analysis and design of large-scale dynamical systems is the development of fast and scalable methods. Such systems involve large-scale interconnections of com- ponents, have rapidly-evolving structure and limitations on communication/processing power, and require real-time distributed control actions. These requirements make con- trol strategies that rely on centralized information processing infeasible and motivate new classes of optimal control problems. In these, standard performance metrics are augmented with typically nonsmooth regularizers to promote desired structural features (e.g., low communication requirements) in the optimal controller [41, 42, 71, 114, 115]. 10 Moreover, in many applications, the dynamics of the plant are unknown and only lim- ited input-output measurements are available. Designing optimal controllers for these systems is challenging even in the absence of structural constraints. The LQR problem in the model-free setup has been recently studied in [116,117]. Modern control applications impose additional requirements on controller design that cannot be addressed using standard optimal control tools. These requirements may arise from limited communication and computation resources or the size of the problem. The standard optimal control techniques typically induce an all-to-all communication requirements in the controller which is infeasible in large-scale setting. Recently, reg- ularization has emerged as a promising tool for enhancing utility of standard optimal control techniques. In this approach, commonly used performance measures (e.g.,H 2 or H 1 ) are augmented with regularization functions that are supposed to promote some desired structural features in the distributed controller, e.g., sparsity. Such an approach has received signicant level of attention in recent years [41, 42, 71, 106, 115, 118, 119], but computing optimal solutions in large-scale problems still remains a challenge. Dis- tributed computing techniques have been commonly used to cope with large problem sizes; for example, stability and synthesis of cooperative distributed model predictive controllers for linear systems have been recently studied in [120]. Herein, we address complexity through a combination of linear-algebraic techniques and computational methods adapted from both machine learning and reinforcement learning. We study the problem of designing optimal structured feedback gains for large-scale systems with unknown dynamics in a distributed manner. We quantify per- formance using theH 2 norm and introduce regularization functions to promote desired structural properties in the controller. The key challenge is to evaluate the objective function and its gradient without solving the large-scale Lyapunov equations. By ex- ploiting the square-additive property of theH 2 norm, we propose customized proximal and ADMM-based algorithms that are convenient for distributed implementation. 1.2 Dissertation structure This dissertation consists of four parts. Each part focuses on a specic topic and includes individual chapters that studies relevant subjects. In each chapter, we provide a brief introduction, background and motiva- tion, problem formulation, algorithm and design steps, computational experiments, and 11 conclusion. Part I. Traditional control theory is concerned with implementing the control poli- cies in a centralized fashion. However, due to high communication and computation costs, this is not practically feasible for large-scale networks of dynamical systems. Leveraging tools from compressive sensing and distributed optimization, we have devel- oped a theoretical and computational framework for designing distributed controllers for large-scale systems [63, 66, 106, 119]. In these, standard performance metrics are aug- mented with typically nonsmooth regularizers to promote desired structural features (e.g., low communication requirements) in the optimal controller. These algorithms scale gracefully with problem size and are well-suited for distributed implementation over multiple processors. Part II. Identifying network topology and learning graphical models from partially available statistical signatures are topics of immense interest in areas ranging from machine learning to statistics to neuroscience. A motivating application in neuroscience is reconstructing the pattern of causal interactions between distinct units of the brain using time series data of neural activities. For undirected consensus networks, the topology identication problem can be imposed as a sparse inverse covariance estimation with additional structural constraints. We have developed an algorithm to identify simple models of large consensus networks from a known sample covariance matrix [27]. Moreover, we have developed an inference algorithm based on growing a Chow-Liu tree which is well-suited for identifying the underlying structure of large-scale dynamical systems [28]. Part III. Analysis of optimization algorithms from the system theoretic point of view has received signicant recent attention [29{31]. In these references, the optimiza- tion algorithm is interpreted as a feedback interconnection in which the states converge to the optimal solution of the optimization problem. Moreover, centralized computa- tion over large-scale systems are often impractical or very expensive. In this part, we propose distributed computational algorithm based on alternating direction method of multiplier and proximal augmented Lagrangian for solving non-smooth composite opti- mization problems over networks and provide global convergence analysis using control 12 theooretic techniques [98,99,121]. Part IV. Distributed feedback design and complexity constrained control are exam- ples of problems posed within the domain of structured optimal feedback synthesis. The optimal feedback gain is typically a non-convex function of system primitives. However, in recent years, algorithms have been proposed to obtain locally optimal solutions. In applications to large-scale distributed control, the major obstacle is computational com- plexity. We address complexity through a combination of linear-algebraic techniques and computational methods adapted from both machine learning and reinforcement learning. Moreover, in many applications, the dynamics of the plant are unknown and only limited input-output measurements are available. Designing optimal controllers for these systems is challenging even in the absence of structural constraints. We have shown that for general classes of optimal control problems, the objective function and its gradient can be computed from data [111]. By employing incremental and stochastic gradient and ADMM based algorithms, we have developed a data-driven framework for the structured optimal control problem that does not require knowledge of the system parameters and avoids the need to solve large-scale matrical equations [122]. 1.3 Contributions of the dissertation In this section, the structure of the dissertation is provided along with the main contributions of each part. Part I Topology design for stochastically forced undirected consensus networks. We have examined the problem of optimal topology design of the corresponding edge weights for undirected consensus networks. Our approach uses convex optimization to balance performance of stochastically-forced networks with the number of edges in the distributed controller. For ` 1 -regularized minimum variance optimal control problem, we have derived a Lagrange dual and exploited structure of the optimality conditions for undirected networks to develop customized algorithms that are well-suited for large problems. These are based on the proximal gradient and the proximal Newton meth- ods. The proximal gradient algorithm is a rst-order method that updates the controller graph Laplacian via the use of the soft-thresholding operator. In the proximal Newton 13 method, sequential quadratic approximation of the smooth part of the objective func- tion is employed and the Newton direction is computed using cyclic coordinate descent over the set of active variables. Examples are provided to demonstrate utility of our algorithms. We have shown that proximal algorithms can solve the problems with mil- lions of edges in the controller graph in several minutes, on a PC. Furthermore, we have specialized our algorithm to the problem of growing connected resistive networks. In this, the plant graph is connected and there are no joint edges between the plant and the controller graphs. We have exploited structure of such networks and demonstrated how additional edges can be systematically added in a computationally ecient manner. Edge addition in directed consensus networks. We consider the ` 1 regularized version of optimal control problem for adding edges to directed consensus networks in order to reach consensus and optimally enhance per- formance. Although the given plant network is not necessarily balanced, in order to reach agreement, we restrict the closed-loop graph Laplacian to be balanced. The performance is measured by theH 2 norm from the disturbance to the output of the closed-loop network. In general, this problem is a combinatorial search problem. We use sparsity promoting optimal control framework and introduce weighted ` 1 regular- ization as a proxy for promoting sparsity of the controller. By exploiting structure of the problem, we develop an algorithm based on ADMM. Part II Topology identication of undirected consensus networks via sparse inverse covariance estimation. We have developed a method for identifying the topology of an undirected consensus network using available statistical data. In order to promote network sparsity, we intro- duce a convex optimization framework aimed at nding the solution to the` 1 -regularized maximum likelihood problem. This problem is closely related to the problem of sparse inverse covariance estimation that has received signicant attention in the literature. In our setup, additional structure arises from the requirement that data is generated by an undirected consensus network. By exploiting the structure of the problem, we develop an ecient algorithm based on the sequential quadratic approximation method 14 in which the search direction is determined using coordinate descent with active set strategy. Several examples have been provided to illustrate utility of the method and eciency of the customized algorithm. Topology identication via growing a Chow-Liu tree network. We have studied the problem of sparse topology identication of an undirected con- sensus network with leaders using second-order statistical data. The goal is to identify a sparse interaction topology using sample covariance matrix of the network state. We have introduced two algorithms based on regularized Gaussian maximum likelihood and growing a Chow-Liu tree. In the rst algorithm, we propose a structured graphical- LASSO algorithm that uses the weighted ` 1 regularizer as a proxy for inducing sparse network topology. The other method is based on growing a Chow-Liu tree that is well- suited for identifying the underlying structure of large-scale networks. Several examples have been provided to demonstrate the performance of our framework. Part III Global exponential stability via integral quadratic constraints for proximal gradient ow and Douglas-Rachford splitting dynamics. We study a class of nonsmooth optimization problems in which it is desired to minimize the sum of a continuously dierentiable function with a Lipschitz continuous gradient and a nondierentiable function. For strongly convex problems, we employ the theory of integral quadratic constraints to prove global exponential stability of proximal gradient ow and Douglas-Rachford splitting dynamics. We also propose a generaliza- tion of the Polyak-Lojasiewicz condition to nonsmooth problems and demonstrate the global exponential convergence of the forward-backward envelope for the proximal gra- dient ow algorithm even in the absence of strong convexity. Distributed in-network optimization. We have studied a class of convex nonsmooth composite optimization problems in which the objective function is a combination of dierentiable and nondierentiable functions. By exploiting the structure of the probelm, we have provided an equivalent consensus-based characterization and have developed an algorithm based on primal- descent dual-ascent gradient ow method. This algorithm exploits the separability of 15 the objective function and is well-suited for distributed implementation. Convexity of the smooth part of the objective function is utilized to prove global asymptotic stability of our algorithm. Finally, by exploiting the structure of theH 2 norm, we have employed this algorithm to design a sparse controller network that improves the performance of the closed-loop system in a large-scale undirected consensus network in a distributed manner. An example is provided to demonstrate the utility of the developed approach. We are currently working on implementing this algorithm inC++ and will use it to solve structured optimal control problems for large-scale systems in a distributed manner. Part IV Data-driven proximal algorithms for the design of structured optimal feed- back gains. We have considered the problem of designing optimal structured feedback controllers for large-scale systems. We have shown that for general classes of optimal control problems, the objective function and its gradient can be computed from data. Trans- formations borrowed from the theory of reinforcement learning are adapted to obtain simulation-based algorithms for computing the structured optimal H 2 feedback gain. Customized proximal algorithms based on gradient descent and incremental gradient are tested in computational experiments and their relative merits are discussed. More- over, by exploiting the structure of theH 2 norm, we have shown that this problem can be separated intoN subproblems and that it can be solved eciently using distributed optimization. We have utilized an ADMM-based consensus algorithm to design a sparse controller network that improves the performance of the closed-loop system. By splitting the problem into N separate subproblems over N dierent cores, we have implemented this algorithm in C++. Our parallel implementation can be used to solve structured optimal control problems for large-scale systems, e.g., power networks [16,18]. Compu- tational experiments are provided to demonstrate the utility of the developed approach. Part I Distributed design and control of networks 16 Chapter 2 Topology design for stochastically forced undirected consensus networks In this chapter, we study an optimal control problem aimed at adding a certain number of edges to an undirected network, with a known graph Laplacian, in order to optimally enhance closed-loop performance. The performance is quantied by the steady-state variance amplication of the network with additive stochastic disturbances. To promote controller sparsity, we introduce ` 1 -regularization into the optimalH 2 formulation and cast the design problem as a semidenite program. We derive a Lagrange dual, provide interpretation of dual variables, and exploit structure of the optimality conditions for undirected networks to develop customized proximal gradient and Newton algorithms that are well-suited for large problems. We illustrate that our algorithms can solve the problems with more than million edges in the controller graph in a few minutes, on a PC. We also exploit structure of connected resistive networks to demonstrate how additional edges can be systematically added in order to minimize theH 2 norm of the closed-loop system. 2.1 Introduction Herein, we study an optimal control problem aimed at achieving a desired tradeo between the network performance and communication re- quirements in the distributed controller. Our goal is to add a certain number of edges to 17 18 a given undirected network in order to optimally enhance the closed-loop performance. One of our key contributions is the formulation of topology design as an optimal control problem that admits convex characterization and is amenable to the development of ecient optimization algorithms. In our formulation, the plant network can contain disconnected components and optimal topology of the controller network is an integral part of the design. In general, this problem is NP-hard [33] and it amounts to an in- tractable combinatorial search. Several references have examined convex relaxations or greedy algorithms to design topology that optimizes algebraic connectivity [34] or network coherence [9,35{37]. To enable design of large networks, we pay particular attention to the computational aspects of the edge-addition problem. We derive a Lagrange dual of the optimal control problem, provide interpretation of dual variables, and develop ecient proximal algo- rithms. Furthermore, building on preliminary work [63], we specialize our algorithms to the problem of growing connected resistive networks described in [7, 34]. In this, the plant graph is connected and inequality constraints amount to non-negativity of controller edge weights. This allows us to simplify optimality conditions and further improve computational eciency of our customized algorithms. Proximal gradient algorithms [76] and their accelerated variants [107] have recently found use in distributed optimization, statistics, machine learning, image and signal processing. They can be interpreted as generalization of standard gradient projection to problems with non-smooth and extended real-value objective functions. When the proximal operator is easy to evaluate, these algorithms are simple yet extremely ecient. For networks that can contain disconnected components and non-positive edge weights, we show that the proximal gradient algorithm iteratively updates the controller graph Laplacian via convenient use of the soft-thresholding operator. This extends the Itera- tive Shrinkage Thresholding Algorithm (ISTA) to optimal topology design of undirected networks. In contrast to the ` 1 -regularized least-squares, however, the step-size has to be selected to guarantee positivity of the second smallest eigenvalue of the closed-loop graph Laplacian. We combine the Barzilai-Borwein (BB) step-size initialization with backtracking to achieve this goal and enhance the rate of convergence. The biggest com- putational challenge comes from evaluation of the objective function and its gradient. We exploit problem structure to speed up computations and save memory. Finally, for 19 the problem of growing connected resistive networks, the proximal algorithm simplies to gradient projection which additionally improves the eciency. We also develop a customized algorithm based on the proximal Newton method. In contrast to the proximal gradient, this method sequentially employs the second- order Taylor series approximation of the smooth part of the objective function; e.g., see [123]. We use cyclic coordinate descent over the set of active variables to eciently compute the Newton direction by consecutive minimization with respect to individual coordinates. Similar approach has been recently utilized in a number of applications, including sparse inverse covariance estimation in graphical models [124]. Both of our customized proximal algorithms signicantly outperform a primal-dual IP method developed in [63]. It is worth noting that the latter is signicantly faster than the general-purpose solvers. While the customized IP algorithm of [63] with a simple diagonal preconditioner can solve the problems with hundreds of thousands of edges in the controller graph in several hours, on a PC, the customized algorithms based on proximal gradient and Newton methods can solve the problems with millions of edges in several minutes. Furthermore, they are considerably faster than the greedy algorithm with ecient rank-one updates developed in [37]. Our presentation is organized as follows. In Section 2.2, we formulate the problem of optimal topology design for undirected networks subject to additive stochastic dis- turbances. In Section 2.3, we derive a Lagrange dual of the sparsity-promoting optimal control problem, provide interpretation of dual variables, and construct dual feasible variables from the primal ones. In Section 2.4, we develop customized algorithms based on the proximal gradient and Newton methods. In Section 2.5, we achieve additional speedup by specializing our algorithms to the problem of growing connected resistive networks. In Section 2.6, we use computational experiments to design optimal topol- ogy of a controller graph for benchmark problems and demonstrate eciency of our algorithms. In Section 2.7, we provide a brief overview of the chapter. 2.2 Problem formulation We consider undirected consensus networks with n nodes _ = L p + u + d (2.1) 20 where d and u are the exogenous disturbance and the control input, respectively, is the state of the network, and L p is a symmetric nn matrix that represents graph Laplacian of the open-loop system, i.e., plant. Such networks arise in applications ranging from load balancing to power systems to opinion formation to control of multi- agent systems. The goal is to improve performance of a consensus algorithm in the presence of stochastic disturbances by adding a certain number of edges (from a given set of candidate edges). We formulate this problem as a feedback design problem with u = L x where the symmetric feedback-gain matrix L x is required to have the Laplacian struc- ture. This implies that each node in (2.1) forms control action using a weighted sum of the dierences between its own state and the states of other nodes and that information is processed in a symmetric fashion. Since a nonzero ijth element ofL x corresponds to an edge between the nodesi andj, the communication structure in the controller graph is determined by the sparsity pattern of the matrix L x . Upon closing the loop we obtain _ = (L p + L x ) + d: (2.2a) For a givenL p , our objective is to design the topology forL x and the corresponding edge weightsx in order to achieve the desired tradeo between controller sparsity and network performance. The performance is quantied by the steady-state variance amplication of the stochastically-forced network, from the white-in-time input d to the performance output , := " Q 1=2 0 # + " 0 R 1=2 # u = " Q 1=2 R 1=2 L x # (2.2b) which penalizes deviation from consensus and control eort. Here, Q = Q T 0 and R = R T 0 are the state and control weights in the standard quadratic performance index. The interesting features of this problem come from structural restrictions on the Lalpacian matrices L p and L x . Both of them are symmetric and are restricted to 21 having an eigenvalue at zero with the corresponding eigenvector of all ones, L p 1 = 0; L x 1 = 0: (2.3) Since each node uses relative information exchange with its neighbors to update its state, in the presence of white noise, the average mode (t) := (1=n)1 T (t) experiences a random walk and its variance increases linearly with time. To make the average mode unobservable from the performance output , the matrix Q is also restricted to having an eigenvalue at zero associated with the vector of all ones, Q1 = 0. Furthermore, to guarantee observability of the remaining eigenvalues of L p , we consider state weights that are positive denite on the orthogonal complement of the subspace spanned by the vector of all ones, Q + (1=n)11 T 0; e.g., Q = I (1=n)11 T penalizes mean-square deviation from the network average. In what follows, we express L x as L x := m X l = 1 x l l T l = E diag (x)E T (2.4) whereE is the incidence matrix of the controller graphL x ,m is the number of edges in L x , and diag (x) is a diagonal matrix containing the vector of the edge weights x2R m . The matrixE is given and it determines the set of candidate edges in controller network. This set can contain all possible edges in the network or it can only include edges that are not in the plant network. Many other options are possible as long as the union of the sets of edges in the plant and controller networks yields a connected graph. We note that the size of the set of candidate edges in controller network in uences computational complexity of our algorithms. It is desired to select a subset of edges in order to balance the closed-loop performance with the number of added edges. Vectors l 2R n determine the columns ofE and they signify the connection with weight x l between nodes i and j: the ith and jth entries of l are 1 and1 and all other entries are equal to 0. Thus, L x given by (2.4) satises structural requirements on the controller graph Laplacian in (2.3) by construction. To achieve consensus in the absence of disturbances, the closed-loop network has to be connected [2]. Equivalently, the second smallest eigenvalue of the closed-loop 22 graph Laplacian, L :=L p +L x , has to be positive, i.e., L has to be positive denite on 1 ? . This amounts to positive deniteness of the \strengthened" graph Laplacian of the closed-loop network G := L p + L x + (1=n)11 T = G p + E diag (x)E T 0 (2.5a) where G p := L p + (1=n)11 T : (2.5b) Structural restrictions (2.3) on the Laplacian matrices introduce an additional constraint on the matrix G, G1 = 1: (2.5c) 2.2.1 Design of optimal sparse topology Let d be a white stochastic distur- bance with zero-mean and unit variance, E (d(t)) = 0; E d(t 1 )d T (t 2 ) = I(t 1 t 2 ) where E is the expectation operator. The square of theH 2 norm of the transfer function from d to , kHk 2 2 = lim t!1 E T (t) (Q + L x RL x ) (t) quanties the steady-state variance amplication of closed-loop system (2.2). As noted earlier, the network average (t) corresponds to the zero eigenvalue of the graph Lapla- cian and it is not observable from the performance output . Thus, theH 2 norm is equivalently given by kHk 2 2 = lim t!1 E ~ T (t) (Q + L x RL x ) ~ (t) = trace (P (Q + L x RL x )) =hP;Q + L x RL x i where ~ (t) is the vector of deviations of the states of individual nodes from (t), ~ (t) := (t) 1 (t) = I (1=n)11 T (t) 23 and P is the steady-state covariance matrix of ~ , P := lim t!1 E ~ (t) ~ T (t) : The above measure of the amplication of stochastic disturbances is determined by kHk 2 2 = (1=2)J(x), where J(x) := D G p + E diag (x)E T 1 ;Q +L x RL x E : (2.6) It can be shown that J can be expressed as J(x) = D G p + E diag (x)E T 1 ;Q p E + diag E T RE T xhR;L p i 1 (2.7) with Q p := Q + (1=n)11 T + L p RL p : Note that the last two terms in (2.7) do not depend on the optimization variable x and that the term L p RL p in Q p has an interesting interpretation: it determines a state- weight that guarantees inverse optimality (in LQR sense) of u =L p for a system with no coupling between the nodes, _ =u +d. We formulate the design of a controller graph that provides an optimal tradeo between theH 2 performance of the closed-loop network and the controller sparsity as minimize x J(x) + kxk 1 subject to G p + E diag (x)E T 0 (SP) where J(x) and G p are given by (2.7) and (2.5b), respectively. The ` 1 norm of x, kxk 1 := P m l = 1 jx l j; is introduced as a convex proxy for promoting sparsity. In (SP), the vector of the edge weights x2R m is the optimization variable; the problem data are the positive regularization parameter , the state and control weightsQ andR, the plant graph Laplacian L p , and the incidence matrix of the controller graph E. The sparsity-promoting optimal control problem (SP) is a constrained optimization problem with a convex non-dierentiable objective function [35] and a positive de- nite inequality constraint. This implies convexity of (SP). Positive deniteness of the 24 strengthened graph Laplacian G guarantees stability of the closed-loop network (2.2a) on the subspace1 ? , and thereby consensus in the absence of disturbances [2]. The consensus can be achieved even if some edge weights are negative [3, 5]. By expressing x as a dierence between two non-negative vectors x + and x , (SP) can be written as minimize x + ;x D G p + E diag (x + x )E T 1 ;Q p E + ( 1 + c) T x + + ( 1 c) T x subject to G p + E diag (x + x )E T 0 x + 0; x 0 (2.8) where c := diag E T RE : By utilizing the Schur complement, (2.8) can be cast to an SDP, and solved via standard IP method algorithms for small size networks. Reweighted ` 1 norm An alternative proxy for promoting sparsity is given by the weighted ` 1 norm [125], kwxk 1 := P m l = 1 w l jx l j where denotes elementwise product. The vector of non- negative weights w 2 R m can be selected to provide better approximation of non- convex cardinality function than the` 1 norm. An eective heuristic for weight selection is given by the iterative reweighted algorithm [125], with w l inversely proportional to the magnitude of x l in the previous iteration, w + l = 1=(jx l j + "): (2.9) This puts larger emphasis on smaller optimization variables, where a small positive parameter " ensures that w + l is well-dened. If the weighted ` 1 norm is used in (SP), the vector of all ones1 should be replaced by the vector w in (2.8). 2.2.2 Structured optimal control problem: debiasing step After the structure of the controller graph LaplacianL x has been designed, we x the structure of L x and optimize the corresponding edge weights. This \polishing" or \debiasing" step is used to improve the performance relative to the solution of the regularized optimal control problem (SP); see [126, Section 6.3.2] for additional information. The structured 25 optimal control problem is obtained by eliminating the columns from the incidence ma- trix E that correspond to zero elements in the vector of the optimal edge weights x ? resulting from (SP). This yields a new incidence matrix ^ E and leads to minimize x G p + ^ E diag (x) ^ E T 1 ;Q p + diag ^ E T R ^ E T x subject to G p + ^ E diag (x) ^ E T 0: Alternatively, this optimization problem is obtained by setting = 0 in (SP) and by replacing the incidence matrix E with ^ E. The solution provides the optimal vector of the edge weights x for the controller graph Laplacian with the desired structure. 2.2.3 Gradient and Hessian of J(x) We next summarize the rst- and second- order derivatives of the objective function J, given by (2.7), with respect to the vector of the edge weights x. The second-order Taylor series approximation of J(x) around x2R m is given by J( x + ~ x) J( x) + rJ( x) T ~ x + 1 2 ~ x T r 2 J( x) ~ x: For related developments we refer the reader to [7]. Proposition 1. The gradient and the Hessian of J at x2R m are determined by rJ( x) = diag E T (Y ( x) R)E r 2 J( x) = H 1 ( x) H 2 ( x) where Y ( x) := G p +ED x E T 1 Q p G p +ED x E T 1 H 1 ( x) := E T Y ( x)E H 2 ( x) := E T G p + ED x E T 1 E D x := diag ( x): 2.3 Dual problem Herein, we study the Lagrange dual of the sparsity- promoting optimal control problem (2.8), provide interpretation of dual variables, and construct dual feasible variables from primal feasible variables. Since minimization of 26 the Lagrangian associated with (2.8) does not lead to an explicit expression for the dual function, we introduce an auxiliary variable G and nd the dual of minimize G;x G 1 ;Q p + ( 1 + c) T x + + ( 1 c) T x subject to G G p E diag (x + x )E T = 0 G 0; x + 0; x 0: (P) In (P),G represents the \strengthened" graph Laplacian of the closed-loop network and the equality constraint comes from (2.5a). As we show next, the Lagrange dual of the primal optimization problem (P) admits an explicit characterization. Proposition 2. The Lagrange dual of the primal optimization problem (P) is given by maximize Y 2 trace (Q 1=2 p Y Q 1=2 p ) 1=2 hY;G p i subject to k diag E T (Y R)E k 1 Y 0; Y 1 = 1 (D) whereY =Y T 2R nn is the dual variable associated with the equality constraint in (P). The duality gap is = y T + x + + y T x = 1 T (y + x + + y x ) (2.10) where y + = 1 diag E T (YR)E 0 (2.11a) y = 1 + diag E T (YR)E 0: (2.11b) are the Lagrange multipliers associated with elementwise inequality constraints in (P). Proof. The Lagrangian of (P) is given by L = G 1 ;Q p +hY;GihY;G p i + 1 diag E T (YR)E y + T x + + 1 + diag E T (YR)E y T x : (2.12) 27 Note that no Lagrange multiplier is assigned to the positive denite constraint on G in L. Instead, we determine conditions on Y and y that guarantee G 0. MinimizingL with respect to G yields G 1 Q p G 1 = Y (2.13a) or, equivalently, G = Q 1=2 p Q 1=2 p Y Q 1=2 p 1=2 Q 1=2 p : (2.13b) Positive deniteness of G and Q p implies Y 0. Furthermore, since Q p 1 = 1, from (2.5c) and (2.13a) we have Y 1 = 1: Similarly, minimization with respect to x + and x leads to (2.11a) and (2.11a). Thus, non-negativity of y + and y amounts to 1 diag E T (YR)E 1 or, equivalently, k diag E T (YR)E k 1 : Substitution of (2.13) and (2.11) into (2.12) eliminatesy + andy from the dual problem. We can thus represent the dual function, inf G;x L(G;x ;Y;y ); as 2 trace (Q 1=2 p Y Q 1=2 p ) 1=2 hY;G p i which allows us to bring the dual of (P) to (D). Any dual feasible Y can be used to obtain a lower bound on the optimal value of the primal problem (P). Furthermore, the dierence between the objective functions of the primal (evaluated at the primal feasible (G;x )) and dual (evaluated at the dual feasible Y ) problems yields expression (2.10) for the duality gap , where y + and y are given by (2.11a) and (2.11b). The duality gap can be used to estimate distance to optimality. 28 Strong duality follows from Slater's theorem [126], i.e., convexity of the primal prob- lem (P) and strict feasibility of the constraints in (P). This implies that at optimality, the duality gap for the primal problem (P) and the dual problem (D) is zero. Further- more, if (G ? ;x ? ) are optimal points of (P), then Y ? = (G ? ) 1 Q p (G ? ) 1 is the optimal point of (D). Similarly, if Y ? is the optimal point of (D), G ? = Q 1=2 p Q 1=2 p Y ? Q 1=2 p 1=2 Q 1=2 p is the optimal point of (P). The optimal vector of the edge weights x ? is determined by the non-zero o-diagonal elements of the controller graph Laplacian, L ? x =G ? G p . Interpretation of dual variables For electrical networks, the dual variables have appealing interpretations. Let 2R n be a random current injected into the resistor network satisfying 1 T = 0; E () = 0; E T = Q + L p RL p : The vector of voltages #2 R m across the edges of the network is then given by # = E T G 1 . Furthermore, since E ## T = E T G 1 E T G 1 E = E T Y E; the dual variable Y is related to the covariance matrix of voltages across the edges. Moreover, (2.11) implies that y + and y quantify the deviations between variances of edge voltages from their respective upper and lower bounds. Remark 1. For a primal feasible x, Y resulting from (2.13a) with G given by (2.5a) may not be dual feasible. Let ^ Y := Y + 1 n 11 T (2.14a) and let the control weight be R =rI with r> 0. If + 2r k diag (E T (Y R)E)k 1 + 2r (2.14b) 29 then ^ Y satises the inequality constraint in (D) and it is thus dual feasible. 2.4 Customized algorithms We next exploit the structure of the sparsity- promoting optimal control problem (SP) and develop customized algorithms based on the proximal gradient and Newton methods. The proximal gradient algorithm is a rst- order method that uses a simple quadratic approximation of J in (SP). This yields an explicit update of the vector of the edge weights via application of the soft-thresholding operator. In the proximal Newton method a sequential quadratic approximation of the smooth part of the objective function in (SP) is used and the search direction is eciently computed via cyclic coordinate descent over the set of active variables. 2.4.1 Proximal gradient method We next use the proximal gradient method to solve (SP). A simple quadratic approximation of J(x) around the current iteratex k , J(x) J(x k ) + rJ(x k ) T (x x k ) + 1 2 k kx x k k 2 2 is substituted to (SP) to obtain x k+1 = argmin x g(x) + 1 2 k kx (x k k rJ(x k ))k 2 2 : Here, k is the step-size and the update is determined by the proximal operator of the function k g, x k+1 = prox k g x k k rJ(x k ) : In particular, for g(x) = kxk 1 , we have x k+1 = S k x k k rJ(x k ) whereS (y) = sign (y) max (jyj; 0) is the soft-thresholding function. The proximal gradient algorithm converges with rateO(1=k) if k < 1=L, whereL is the Lipschitz constant ofrJ [76,107]. It can be shown thatrJ is Lipschitz continuous but, since it is challenging to explicitly determine L, we adjust k via backtracking. To provide a better estimate ofL, we initialize k using the Barzilai-Borwein (BB) method which provides an eective heuristic for approximating the Hessian of the function J 30 via the scaled version of the identity [127], (1= k )I. At thekth iteration, the initial BB step-size k;0 , k;0 := kx k x k1 k 2 2 (x k1 x k ) T (rJ(x k1 )rJ(x k )) (2.15) is adjusted via backtracking until the inequality constraint in (SP) is satised and J(x k+1 )J(x k ) +rJ(x k ) T (x k+1 x k ) + 1 2 k kx k+1 x k k 2 2 : SinceJ is continuously dierentiable with Lipschitz continuous gradient, this inequality holds for any k < 1=L and the algorithm converges sub-linearly [107]. This condition guarantees that objective function decreases at every iteration. Our numerical experi- ments in Section 2.6 suggest that BB step-size initialization signicantly enhances the rate of convergence. Remark 2. The biggest computational challenge comes from evaluation of the objective function and its gradient. Since the inverse of the strengthened graph Laplacian G has to be computed, with direct computations these evaluations take O(n 3 ) and O(nm 2 ) ops, respectively. However, by exploiting the problem structure,rJ can be computed more eciently. The main cost arises in the computation of diag (E T YE). We instead compute it using sum (E T (YE)) which takes O(n 2 m) operations. Here, sum (A) is a vector which contains summation of each row of the matrix A in its entries. For networks with mn this leads to signicant speed up. Moreover, in contrast to direct computation, we do not need to store the mm matrix E T YE. Only formation of the columns is required which oers memory saving. 2.4.2 Proximal Newton method In contrast to the proximal gradient algo- rithm, the proximal Newton method benets from second-order Taylor series expansion of the smooth part of the objective function in (SP). Herein, we employ cyclic coordinate descent over the set of active variables to eciently compute the Newton direction. By approximating the smooth part of the objective function J in (SP) with the second-order Taylor series expansion around the current iterate x, J( x + ~ x) J( x) + rJ( x) T ~ x + 1 2 ~ x T r 2 J( x) ~ x 31 the problem (SP) becomes minimize ~ x rJ( x) T ~ x + 1 2 ~ x T r 2 J( x) ~ x + k x + ~ xk 1 subject to G p + E diag ( x + ~ x)E T 0: (2.16) Let ~ x denote the current iterate approximating the Newton direction. By perturbing ~ x in the direction of theith standard basis vector e i inR m , the objective function in (2.16) becomes rJ( x) T (~ x + i e i ) + 1 2 (~ x + i e i ) T r 2 J( x) (~ x + i e i ) + j x i + ~ x i + i j: Elimination of constant terms allows us to bring (2.16) into minimize i 1 2 a i 2 i + b i i + jc i + i j (2.17) where the optimization variable is the scalar i and (a i , b i , c i , x i , ~ x i ) are the problem data with a i := e T i r 2 J( x) e i b i := r 2 J( x) e i T ~ x + e T i rJ( x) c i := x i + ~ x i : The explicit solution to (2.17) is given by i = c i + S =a i (c i b i =a i ): After the Newton direction ~ x has been computed, we determine the step-size via backtracking. This guarantees positive deniteness of the strengthened graph Lapla- cian and sucient decrease of the objective function. We use generalization of Armijo rule [128] to nd an appropriate step-size such thatG p +E diag( x+~ x)E T is positive denite matrix and J( x +~ x) + k x +~ xk 1 J( x) + k xk 1 + rJ( x) T ~ x + k x + ~ xk 1 k xk 1 : Remark 3. The parametera i in (2.17) is determined by theith diagonal element of the 32 Hessianr 2 J( x). On the other hand, the ith column ofr 2 J( x) and the ith element of the gradient vectorrJ( x) enter into the expression for b i . All of these can be obtained directly fromr 2 J( x) andrJ( x) and forming them does not require any multiplication. Computation of a single vector inner product between the ith column of the Hessian and ~ x is required inb i , which typically takesO(m) operations. To avoid direct multiplication, in each iteration after nding i , we update the vectorr 2 J( x) T ~ x using the correction term i (E T YE i ) ((G 1 E i ) T E) T and take its ith element to form b i . Here, E i is the ith column of the incidence matrix of the controller graph. This also avoids the need to store the Hessian of J, which is an mm matrix, thereby leading to a signicant memory saving. Remark 4. Active set strategy is an eective means for determining the directions that do not need to be updated in the coordinate descent algorithm. At each outer iteration, we classify the variable as either active or inactive based on the values of x i and the ith component of the gradient vectorrJ( x). For g(x) = kxk 1 , the ith search direction is inactive if x i = 0 and j e T i rJ( x)j < and it is active otherwise. Here, > 0 is a small number (e.g., = 0:0001 ). The Newton direction is then obtained by solving the optimization problem over the set of active variables. This signicantly improves algorithmic eciency for large values of the regularization parameter . Convergence analysis In (SP),J(x) is smooth forG p +E diag(x)E T 0 and the non-smooth part is given by the ` 1 norm of x. The objective function of the form J(x) +g(x) was studied in [124], whereJ is smooth over the positive denite cone andg is a separable non-dierentiable function. Theorem 16 from [124] thus implies super-linear (i.e., quadratic) convergence rate of the quadratic approximation method for (SP). 33 Table 2.1: Comparison of our customized proximal algorithms Algorithm proximal gradient proximal Newton Order 1st 2nd Search direction explicit update coordinate descent Speed-up strategy BB step-size initialization active set strategy Memory no storage of mm matrices no storage of mm matrices Most expensive part O(n 2 m) O(m 2 ) Convergence rate linear super-linear (quadratic) Table 2.2: Algorithmic properties of the primal dual IP method. Algorithm primal-dual IP method Order 2nd Search direction PCG Speed-up strategy PCG with preconditioner Memory no storage of mm matrices Most expensive part O(m 3 ) Convergence rate super-linear Stopping criteria The norms of the primal and dual residuals r p and r d as well as the duality gap are used as stopping criteria. In contrast to the stopping criteria available in the literature, this choice enables fair comparison of the algorithms. We use (2.14) to construct a dual feasible ^ Y and obtain y from (2.11), (2.10) to compute the duality gap , and r p (x; x ) := x x + + x r + d (x; y + ) := 1 diag E T ( ^ YR)E y + r d (x; y ) := 1 + diag E T ( ^ YR)E y : to determine the primal and dual residuals. Comparison of algorithms Tables 2.1 and 2.2 compare and contrast features of our customized proximal algorithms and the algorithm based on the primal-dual IP method developed in [63]. 34 2.5 Growing connected resistive networks The problem of opti- mal topology design for stochastically-forced networks has many interesting variations. An important class is given by resistive networks in which all edge weights are non- negative, x 0. Here, we study the problem of growing connected resistive networks; e.g., see [34]. In this, the plant graph is connected and there are no joint edges between the plant and the controller graphs. Our objective is to enhance the closed-loop per- formance by adding a small number of edges. As we show below, inequality constraints in this case amount to non-negativity of controller edge weights. This simplies opti- mality conditions and enables further improvement of the computational eciency of our customized algorithms. The restriction on connected plant graphs implies positive deniteness of the strength- ened graph Laplacian of the plant, G p =L p + (1=n)11 T 0: Thus,G p +E diag (x)E T is always positive denite for connected resistive networks and (SP) simplies to minimize x f(x) + g(x) (2.18) where f(x) := J(x) + 1 T x and g(x) is the indicator function for the non-negative orthant, g(x) := I + (x) = 8 < : 0; x 0 +1; otherwise: As in Section 2.3, in order to determine the Lagrange dual of the optimization problem (2.18), we introduce an additional optimization variable G and rewrite (2.18) as minimize G;x G 1 ;Q p + ( 1 + diag E T RE ) T x subject to G G p E diag (x)E T = 0 x 0: (P1) Proposition 3. The Lagrange dual of the primal optimization problem (P1) is given 35 by maximize Y 2 trace (Q 1=2 p Y Q 1=2 p ) 1=2 hY;G p i subject to diag E T (Y R)E 1 Y 0; Y 1 = 1 (D1) whereY is the dual variable associated with the equality constraint in (P1). The duality gap is = y T x = 1 T (y x) (2.19) where y := 1 diag E T (YR)E 0 (2.20) represents the dual variable associated with the non-negativity constraint on the vector of the edge weights x. Remark 5. For connected resistive networks with the control weight R =rI, ^ Y given by (2.14a) is dual feasible if + 2r max (diag (E T (Y R)E)) + 2r : (2.21) 2.5.1 Proximal gradient method Using a simple quadratic approximation of the smooth part of the objective function f around the current iterate x k f(x) f(x k ) + rf(x k ) T (x x k ) + 1 2 k kx x k k 2 2 the optimal solution of (2.18) is determined by the proximal operator of the function g(x) =I + (x), x k+1 = x k k rf(x k ) + where () + is the projection on the non-negative orthant. Thus, the action of the proximal operator is given by the projected gradient. As in Section 2.4.1, we initialize k using the BB heuristics but we skip the back- tracking step here and employ a non-monotone BB scheme [129,130]. The eectiveness 36 of this strategy has been established on quadratic problems [127, 129], but its conver- gence in general is hard to prove. In Section 2.6, we demonstrate eciency of this approach. 2.5.2 Proximal Newton method We next adjust the customized algorithm based on proximal Newton method for growing connected resistive networks. We ap- proximate the smooth part of the objective function f in (2.18) using the second-order Taylor series expansion around the current iterate x, f( x + ~ x) f( x) + rf( x) T ~ x + 1 2 ~ x T r 2 f( x) ~ x and rewrite (2.18) as minimize ~ x rf( x) T ~ x + 1 2 ~ x T r 2 f( x) ~ x subject to x + ~ x 0: (2.22) By perturbing ~ x in the direction of theith standard basis vector e i inR m , ~ x + i e i ; the objective function in (2.22) becomes rf( x) T (~ x + i e i ) + 1 2 (~ x + i e i ) T r 2 f( x) (~ x + i e i ): Elimination of constant terms allows us to bring (2.22) into minimize i 1 2 a i 2 i + b i i subject to x i + ~ x i + i 0: (2.23) The optimization variable is the scalar i and a i , b i , x i , and ~ x i are the problem data with a i := e T i r 2 f( x) e i b i := r 2 f( x) e i T ~ x + e T i rf( x) The explicit solution to (2.23) is given by i = 8 < : b i =a i ; x i + ~ x i b i =a i 0 ( x i + ~ x i ); otherwise: 37 After the Newton direction ~ x has been computed, we determine the step-size via backtracking. This guarantees positivity of the updated vector of the edge weights, x+~ x, and sucient decrease of the objective function,f( x+~ x)f( x)+rf( x) T ~ x: Remark 6. As in Section 2.4.2, we use an active set strategy to identify the directions that do not need to be updated in the coordinate descent algorithm. Forg(x) =I + (x), the ith search direction is inactive iff x i = 0 and e T i rf( x) 0g and it is active otherwise. Stopping criteria The norm of the dual residual, r d , and the duality gap, , are used as stopping cri- teria. The dual variable y is obtained from (2.20) where ^ Y is given by (2.14a) and satises (2.21). At each iteration, is evaluated using (2.19) and the dual residual is determined by r d (x;y) := 1 diag E T (Y (x) R)E y: 2.6 Computational experiments We next provide examples and eval- uate performance of our customized algorithms. Algorithm proxBB represents proximal gradient method with BB step-size initialization and proxN identies proximal Newton method in which the search direction is found via coordinate descent. Performance is compared with the PCG-based primal-dual IP method of [63] and the greedy algorithm of [37]. We have implemented all algorithms in Matlab and executed tests on a 3.4 GHz Core(TM) i7-3770 Intel(R) machine with 16GB RAM. In all examples, we set R =I and choose the state weight that penalizes the mean- square deviation from the network average, Q = I (1=n)11 T : The absolute value of the dual residual, r d , and the duality gap, , are used as stopping criteria. We set the tolerances for r d and to 10 3 and 10 4 , respectively. Finally, for connected plant networks max := k diag (E T G 1 p QG 1 p E)k 1 identies the value of the regularization parameter for which all edge weights in the controller graph are equal to zero. 38 Table 2.3: Comparison of algorithms (solve times in seconds/number of iterations) for the problem of growing connected resistive Erd os-R enyi networks with dierent number of nodes n, edge probability 1:05 log(n)=n, and = 0:8 max . number of nodes n = 300 n = 700 n = 1000 number of edges m = 43986 m = 242249 m = 495879 IP (PCG) 16:499=8 394:256=13 1014:282=13 proxBB 1:279=11 15:353=11 55:944=13 proxN 1:078=4 11:992=4 34:759=4 Table 2.4: Table 2.3 continued number of nodes n = 1300 n = 1500 number of edges m = 839487 m = 1118541 IP (PCG) 15948:164=13 179352:208=14 proxBB 157:305=16 239:567=16 proxN 82:488=4 124:307=4 Additional information about our computational experiments, along with Matlab source codes, can be found at: www.ece.umn.edu/mihailo/software/graphsp/ 2.6.1 Performance comparison In what follows, the incidence matrix of the controller graph is selected to satisfy the following requirements: (i) in the absence of the sparsity-promoting term, the closed-loop network is given by a complete graph; and (ii) there are no joint edges between the plant and the controller graphs. We rst solve the problem (P1) for growing connected resistive Erd os-R enyi networks with dierent number of nodes. The generator of the plant dynamics is given by an undirected unweighted graph with edge probability 1:05 log(n)=n. Tables 2.3 and 2.4 compare our customized algorithms in terms of speed and the number of iterations. Even for small networks, proximal methods are signicantly faster than the IP method and proxN takes smaller number of iterations and converges quicker than proxBB. For a larger network (with 1500 nodes and 1118541 edges in the controller graph), it takes about 50 hours for the PCG-based IP method to solve the problem. In contrast, proxN and proxBB converge in about 2 and 4 minutes, respectively. Figure 2.1 compares our proximal gradient algorithm with the fast greedy algorithm 39 (a) solve times (b) (JJc)=Jc n n Figure 2.1: (a) Solve times (in seconds); and (b) performance degradation (in percents) of proximal gradient and greedy algorithms relative to the optimal centralized controller. of [37]. We solve problem (P1) for Erd os-R enyi networks with dierent number of nodes (n = 5 to 500) and = 0:4 max . After proxBB identies the edges in the controller graph, we use the greedy method to select the same number of edges. Finally, we polish the identied edge weights for both methods. Figure 2.1a shows the solve times (in seconds) versus the number of nodes. As the number of nodes increases the proximal algorithm signicantly outperforms the fast greedy method. Relative to the optimal centralized controller, both methods yield similar performance degradation of the closed-loop network; see Fig. 2.1b. 2.6.2 Large-scale Facebook network To evaluate eectiveness of our algo- rithms on large networks, we solve the problem of growing a network of friendships. In such social networks, nodes denote people and edges denote friendships. There is an edge between two nodes if two people are friends. The network is obtained by examining social network of 10 users (the so-called ego nodes); all other nodes are friends to at least one of these ego nodes [131]. The resulting network is undirected and unweighted with 4039 nodes and 88234 edges; the data is available at http://snap.stanford.edu/data/. Our objective is to improve performance by adding a small number of extra edges. We assume that people can only form friendships with friends of their friends. This restricts the number of potential edges in the controller graph to 1358067. 40 (a) card(x) (b) (JJc)=Jc card(x) Figure 2.2: (a) Sparsity level; and (b) optimal tradeo curves resulting from the applica- tion of proximal gradient algorithm and a heuristic strategy for the Facebook network. To avoid memory issues, we have implemented our algorithms in C++. For = c max with c =f0:1; 0:2; 0:5; 0:8g and max = 19:525, the proximal gradient algorithm computes the solution in about 10, 2:6, 0:87, and 0:43 hours, respectively. After de- signing the topology of the controller graph, we optimize the resulting edge weights via polishing. Figure 2.2a shows that the number of nonzero elements in the vector x decreases as increases and Fig. 2.2b illustrates that theH 2 performance deteriorates as the number of nonzero elements inx decreases. In particular, for = 0:8 max , the identied sparse controller has only 3 nonzero elements (it uses only 0:0002% of the potential edges). Relative to the optimal centralized controller, this controller degrades performance by 16:842%, (JJ c )=J c = 16:842%: In all of our experiments, the added links with the largest edge weights connect either the ego nodes to each other or three non-ego nodes to the ego nodes. Thus, our method recognizes signicance of the ego nodes and identies non-ego nodes that play an important role in improving performance. We compare performance of the identied controller to a heuristic strategy that is described next. The controller graph contains 16 potential edges between ego nodes. If the number of edges identied by our method is smaller than 16, we randomly select the desired number of edges between ego nodes. Otherwise, we connect all ego nodes and 41 select the remaining edges in the controller graph randomly. We then use polishing to nd the optimal edge weights. The performance of resulting random controller graphs are averaged over 10 trials and the performance loss relative to the optimal centralized controller is displayed in Fig. 2.2b. We see that our algorithm always performs better than the heuristic strategy. On the other hand, the heuristic strategy outperforms the strategy that adds edges randomly (without paying attention to ego nodes). Unlike our method, the heuristic strategy does not necessarily improve the performance by increasing the number of added edges. In fact, the performance deteriorates as the number of edges in the controller graph increases from 4 to 27; see Fig. 2.2b. 2.6.3 Random disconnected network The plant graph (blue lines) in Fig. 2.3 contains 50 randomly distributed nodes in a region of 10 10 units. Two nodes are neighbors if their Euclidean distance is not greater than 2 units. We examine the problem of adding edges to a plant graph which is not connected and solve the sparsity- promoting optimal control problem (SP) for controller graph with m = 1094 potential edges. This is done for 200 logarithmically-spaced values of 2 [10 3 ; 2:5] using the path-following iterative reweighted algorithm as a proxy for inducing sparsity [125]. As indicated by (2.9), we set the weights to be inversely proportional to the magnitude of the solution x to (SP) at the previous value of . We choose " = 10 3 in (2.9) and initialize weights for = 10 3 using the solution to (SP) with = 0 (i.e., the optimal centralized vector of the edge weights). Topology design is followed by the polishing step that computes the optimal edge weights; see Section 2.2.2. As illustrated in Fig. 2.3, larger values of yield sparser controller graphs (red lines). In contrast to all other examples, the plant graph is not connected and the optimal solution is obtained using the algorithms of Section 2.4. Note that greedy method [37] cannot be used here. Since the plant graph has three disconnected subgraphs, at least two edges in the controller are needed to make the closed-loop network connected. Figure 2.4 shows that the number of nonzero elements in the vector of the edge weights x decreases and that the closed-loop performance deteriorates as increases. In particular, Fig. 2.4c illustrates the optimal tradeo curve between theH 2 performance loss (relative to the optimal centralized controller) and the sparsity of the vector x. For = 2:5, only four edges are added. Relative to the optimal centralized vector of the controller edge weightsx c , the identied sparse controller in this case uses only 0:37% of 42 (a) = 0:02 (b) = 0:09 (c) = 0:63 (d) = 2:5 Figure 2.3: Topologies of the plant (blue lines) and controller graphs (red lines) for an unweighted random network with three disconnected subgraphs. 43 (a) card(x)=card(xc) (b) (JJc)=Jc (c) (JJc)=Jc card(x)=card(x c ) Figure 2.4: (a) Sparsity level; (b) performance degradation; and (c) the optimal tradeo curve between the performance degradation and the sparsity level of optimal sparse x compared to the optimal centralized vector of the edge weights x c . The results are obtained for unweighted random disconnected plant network with topology shown in Fig. 2.3. the edges, and achieves a performance loss of 82:13%, i.e., card(x)=card(x c ) = 0:37% and (JJ c )=J c = 82:13%: Here,x c is the solution to (SP) with = 0 and the pattern of non-zero elements of x is obtained by solving (SP) with = 2:5 via the path-following iterative reweighted algorithm. 2.6.4 Path and ring networks For path networks, our computational exper- iments show that for a large enough value of the sparsity-promoting parameter a single edge, which generates the longest cycle, is added; see Fig. 2.5, top row. This is in agreement with [9] where it was proved that the longest cycle is most benecial for improving theH 2 performance of tree networks. Similar observations are made for the spatially-invariant ring network with nearest neighbor interactions. For large values of , each node establishes a link to the node that is farthest away in the network; see Fig. 2.5, bottom row. This is in agreement with recent theoretical developments [132] where perturbation analysis was used to identify optimal week links in edge-transitive consensus networks. Thus, for these regular networks and large enough values of the regularization parameter, our approach indeed provides the globally optimal solution to the original non-convex cardinality minimization problem. 2.7 Concluding remarks We have examined the problem of optimal topol- ogy design of the corresponding edge weights for undirected consensus networks. Our 44 (a) = 0 (b) = 0:09 max (c) = 0:24 max (d) = 0:96 max (e) = 0 (f) = 0:11 max (g) = 0:24 max (h) = 0:94 max Figure 2.5: The problems of growing unweighted path (top row) and ring (bottom row) networks. Blue lines identify edges in the plant graph, and red lines identify edges in the controller graph. approach uses convex optimization to balance performance of stochastically-forced net- works with the number of edges in the distributed controller. For ` 1 -regularized mini- mum variance optimal control problem, we have derived a Lagrange dual and exploited structure of the optimality conditions for undirected networks to develop customized al- gorithms that are well-suited for large problems. These are based on the proximal gradi- ent and the proximal Newton methods. The proximal gradient algorithm is a rst-order method that updates the controller graph Laplacian via the use of the soft-thresholding operator. In the proximal Newton method, sequential quadratic approximation of the smooth part of the objective function is employed and the Newton direction is computed using cyclic coordinate descent over the set of active variables. Examples are provided to demonstrate utility of our algorithms. We have shown that proximal algorithms can solve the problems with millions of edges in the controller graph in several minutes, on a PC. Furthermore, we have specialized our algorithm to the problem of growing connected resistive networks. In this, the plant graph is connected and there are no joint edges between the plant and the controller graphs. We have exploited structure of such networks and demonstrated how additional edges can be systematically added in 45 a computationally ecient manner. Chapter 3 Edge addition in directed consensus networks We study the problem of performance enhancement in stochastically-forced directed consensus networks by adding edges to an existing topology. We formulate the problem as a feedback control design, and represent the links as the elements of the controller graph Laplacian matrix. The topology design of the controller network can be cast as an ` 1 regularized version of theH 2 optimal control problem. The goal is to optimize the performance of the network by selecting a controller graph with low communica- tion requirements. To deal with the structural constraints that arise from the absence of absolute measurements, we introduce a coordinate transformation to eliminate the average mode and assure convergence of all states to the average of the initial node values. By exploiting structure of the optimization problem, we develop a customized algorithm based on the alternating direction method of multipliers to design a sparse controller network that improves the performance of the closed-loop system. 3.1 Introduction In this chapter, we consider the problem of adding edges to a weakly connected directed consensus network in order to improve performance. In particular, we are interested in designing sparse communication graphs that strike a balance between the variance amplication of the closed-loop system and the number of communication links. In general, this is a combinatorial search problem and is non- convex. In undirected networks, convex relaxations or greedy algorithms have been 46 47 introduced in order to optimize algebraic connectivity of the network [34,65] or network coherence [35,133,134] by adding edges from a given set of edges. In the absence of disturbances, a strongly connected balanced network converges to the average of the initial node values [2]. However, in the presence of additive stochastic disturbances, the network average experiences a random walk. Thus, the control objective is to minimize mean square deviation from average. To cope with structural constraints that arise from the absence of absolute information exchange, we introduce a coordinate transformation to eliminate the average mode. While it is desired to promote sparsity of controller graph in the physical domain, theH 2 optimal control problem is solved in the transformed set of coordinates where the average mode is eliminated. We formulate the edge addition problem for directed networks with an objective of optimizing the closed-loop coherence. We consider the scenario in which the plant graph is unbalanced. Structural requirements that the closed-loop graph Laplacian is weakly connected and balanced make the optimal control problem challenging. We approach this problem using sparsity-promoting optimal control framework [41,42,71]. In our formulation, performance is captured by theH 2 norm of the closed-loop network and ` 1 regularization is introduced as a proxy for inducing sparsity in the controller graph [35, 133]. By exploiting the structure of the problem, we develop a customized algorithm based on alternating direction method of multipliers (ADMM) [113]. The rest of the chapter is structured as follows. In Section 3.2, we provide necessary background on graph theory for directed networks, dene consensus problems on graphs, and introduce an appropriate change of coordinates. In Section 3.3, we formulate the optimal control problem. In Section 3.4, we develop a customized algorithm based on ADMM. In Section 3.5, we use our algorithm for sparse edge addition to directed consensus networks. Finally, in Section 3.6, we provide concluding remarks and highlight future directions. 3.2 Motivation and background In this section, after providing the necessary background, we describe the dynamics of consensus networks, inherent struc- tural constraints, and the challenges in the related design problems. We then introduce a change of coordinates that enables us to overcome these restrictions. 48 3.2.1 Directed graphs Herein, we provide a brief overview of the background material; for additional information, see [2]. Weighted digraph: a weighted directed graph is denoted byD = (V;E;w) where V and E are the sets of nodes and edges and w is the vector of edge weights. The edges are directed and a value w ij is the edge weight between nodes v i andv j . If the ordered pair (v i ;v j )2E then v i is the tail of the edge and v j is its head. Strongly connected digraph: a digraph is strongly connected if there is a directed path between every pair of nodes. Weakly connected digraph: a digraph is weakly connected if it is a connected graph when the directions are omitted. Adjacency matrix: the ijth element is given by A ij = 8 < : w ij ; (v j ;v i )2E 0; otherwise: Degree matrix is a diagonal matrix with D ii =d in (v i ) where d in (v i ) is the weighted in-degree of node v i , d in (v i ) = X j w ij : We sum over j's for (v j ;v i )2E. Laplacian matrix is a matrix L =DA, where L1 = 0 by denition. Balanced digraph: a digraph is balanced if the weighted in-degree and the weighted out-degree for each node are equal. 3.2.2 Feedback design in consensus networks We consider a consensus dy- namics _ x = L p x + u + d where d and u are the disturbance and control inputs, x is the vector of the states, and L p 2R nn is the graph Laplacian of the plant network. We assume that the plant network is weakly connected and formulate the edge addition problem as a feedback design problem with u = L x x 49 where L x 2 R nn is the weighted directed Laplacian matrix of the controller. This matrix represents the locations, directions, and edge weights. Note that each nonzero element in L x can either indicate addition of an edge or re-tuning of an existing edge gain. The closed-loop system is given by _ x = (L p + L x )x + d: (3.1a) Our goal is to optimally design the feedback gain matrix L x in order to achieve the desired tradeo between the controller sparsity and network performance. The per- formance is quantied by the steady-state variance amplication of the stochastically- forced network, from the white-in-time input d to the performance outputz that quan- ties deviation from consensus and control eort, z = " Q 1=2 R 1=2 L x # x: (3.1b) Here, the matrices Q = Q T 0 and R 0 are the state and control weights, respec- tively, and () signies positive deniteness (semi-deniteness) of a matrix. In consensus networks, each node updates its state using the relative information exchange with its neighbors. In the presence of white noise, the average mode x = (1=n)x T 1 experiences a random walk and variance increased linearly with time. A key property of the Laplacian matrix of the controller isL x 1 = 0. Since our primary control objective is to achieve consensus, only dierences between node values are penalized in the performance output. Therefore, the state weight matrix Q has a zero eigenvalue with the corresponding eigenvector of all ones, i.e., Q1 = 0. Furthermore, we chooseQ to be positive denite on the orthogonal complement of the subspace span (1), Q + (1=n)11 T 0: The following lemma summarizes the well-known conditions for achieving consensus in the absence of disturbances in directed networks [2]. Lemma 4. The agreement protocol over a digraph reaches the average consensus, 50 (1=n)11 T x(0), for every initial condition if and only if it is weakly connected and bal- anced. Thus, for a weekly connected L p which is not necessarily balanced, it is required that the closed-loop graph Laplacian, L p +L x be balanced, which amounts to 1 T (L p + L x ) = 0: (3.2) The problem of designing a controller graph that provides a desired tradeo between performance indexJ of the network and the sparsity of the controllerL x can be formu- lated as minimize Lx J(L x ) + kWL x k 1 subject to 1 T (L x + L p ) = 0 L x 1 = 0: (3.3) The positive scalar is the sparsity-promoting parameter that characterizes a trade-o between network performance and sparsity of the controller and denotes elementwise multiplication. The rst condition guarantees asymptotic consensus to the initial net- work average value in the absence of disturbances and the second condition secures the row stochastic property of the controller graph Laplacian. We gurantee that the closed-loop graph Laplacian remains weakly connected via proper step-size selection in the algorithm. In what follows, we quantify the network performance using theH 2 norm. The weighted ` 1 norm [125] of L x is a convex approximation of the cardinality function. In each iteration, each element of the weight matrix W (i;j) is chosen to be inversely proportional to the magnitude ofL x (i;j) in the previous iteration and can be determined as W (i;j) = 1=(jL x (i;j)j + "): (3.4) This puts larger emphasis on smaller optimization variables, where a small positive parameter " ensures that W (i;j) is well-dened. 3.2.3 Elimination of the average mode Since both Q and L x have a zero eigenvalue associated with the vector of all ones, the average mode x is not observable 51 from the output z. In order to eliminate the average mode, we introduce the following change of coordinates " x # = " U T 1 T =n # x where U2R n(n1) is a full-rank matrix and its columns span the orthogonal comple- ment of1. Using the properties of the matrix U, U T U = I; UU T = I (1=n)11 T ; U T 1 = 0; we have " _ _ x # = " U T (L p + L x )U 0 0 0 #" x # + " U T (1=n)1 T # d: In the new coordinates, the performance output is given by z = " Q 1=2 U 0 R 1=2 L x U 0 #" x # : The network average x = (1=n)1 T d experiences a random walk and the vector quan- ties the deviation from average of the states. Since our control objective is to minimize the deviation from average in the nodes, the average mode is not of interest. Therefore, the minimal representation of the system containing only the state is given by _ = U T (L p + L x )U + U T d z = " Q 1=2 U R 1=2 L x U # : (3.5) In order to guarantee that the closed-loop graph Laplacian (L p +L x ) is balanced, we introduce the following change of variables L x = UFU T (1=n)11 T L p , F = U T L x U: (3.6) 52 Note that the main contribution that dierentiates this work from previous results [133] is that we are considering a weakly-connected directed unbalanced network, and we require the closed-loop network to be balanced. The problem of adding edges to a directed network has not been addressed in [133]. Moreover, the presented framework cannot be used to solve the current constrained problem due to an additional bias term. In our framework, a new change of variables is introduced to accommodate structural constraints. Using the properties of the matrix U, the constraints (3.2) and L x 1 = 0 are auto- matically satised. The equation (3.6) demonstrates how we can move back and forth between two variables F and L x . SubstitutingL x given by (3.6) into (3.5), we can write the state-space representation of the closed-loop system as _ = (U T L p U + F ) + U T d z = " Q 1=2 U R 1=2 (UF (1=n)11 T L p U) # : (3.7) Next, we formulate the optimal control problem and propose a framework to design sparse controller L x . 3.3 Topology design for directed networks In this section, we approach the problem of topology design as a regularized optimal control problem. 3.3.1 Sparsity-promoting optimal control problem TheH 2 norm of the transfer function from d to z, kHk 2 2 = J(F ) = trace (P (F )) quanties the variance amplication of the closed-loop network (3.7). Here, P (F ) is the closed-loop observability Gramian which is the solution of the Lyapunov equation, (U T L p U + F ) T P + P (U T L p U + F ) = C T C (3.8) 53 where C = " Q 1=2 U R 1=2 (UF (1=n)11 T L p U) # : The control design problem (3.3) takes the following form minimize F;Lx J(F ) + kW L x k 1 subject to UFU T L x (1=n)11 T L p = 0: (3.9) In (3.9), the Laplacian matrix L x and the matrix F are the optimization variables; the problem data is given by plant graph Laplacian L p , state and control weights Q and R, and positive regularization parameter . The matrix W is the weight matrix that imposes a penalty on the magnitude of the elements in L x . With the problem formulation (3.9), we are able to minimize theH 2 norm of the network J(F ) in the transformed coordinates where the average mode is eliminated, while promoting sparsity of L x in the physical domain. For = 0, the solution to (3.9) is typically given by a matrixL x with all non-zero components. As the regularization parameter increases, the number of non-zero elements in the controller graph decreases. 3.3.2 Structured optimal control problem After the structure of the con- troller graph Laplacian L x has been designed, we x the sparsity patternS and then solve the following problem minimize F J(F ) subject to UFU T (1=n)11 T L p 2S (3.10) whose solution provides the optimal controller graph Laplacian with the desired struc- ture. This optimization problem is obtained by setting = 0 in (3.9) and adding the sparsity patternS. This \polishing" or \debiasing" step is used to improve the perfor- mance relative to the solution of the sparsity-promoting optimal control problem (3.9). 3.4 An ADMM-based algorithm We next exploit the structure of the constrained optimization problem (3.9) and develop a customized algorithm based on 54 ADMM. The augmented Lagrangian associated with (3.9) is given by L (F; L x ; ) = J(F ) + kW L x k 1 + ;UFU T L x (1=n)11 T L p + 2 kUFU T L x (1=n)11 T L p k 2 F (3.11) where the matrix is the Lagrange multiplier, is a positive scalar, andkk F is the Frobenius norm. The ADMM algorithm consists of the following steps at each iteration F k+1 = argmin F L (F; L k x ; k ) L k+1 x = argmin Lx L (F k+1 ; L x ; k ) k+1 = k + UF k+1 U T L k+1 x (1=n)11 T L p : The algorithm terminates whenkL k+1 x L k x k F 1 andkUF k+1 U T L k+1 x (1=n)11 T L p )k 2 , where 1 and 2 are desired tolerances. Note that, the smooth part J(F ) and the non-smooth partkWL x k 1 are now operating in dierent coordinates; therefore, descent algorithms can be utilized in the F -minimization step. Moreover, the ` 1 norm is a separable function with respect to each element of L x . Thus, we can determine the solution to the L x -minimization step analytically. In the Lagrange multiplier update step, we use the step-size equal to in order to guarantee the dual feasibility [113]. 3.4.1 F-minimization step We bring the F -minimization step to the following form by using the properties of the matrices L p and U, F k+1 = argmin F J(F ) + 2 kF S k k 2 F where S k = U T L k x + (1=n)11 T L p (1=) k U = U T L k x (1=) k U: We employ the Anderson-Moore method to solve this problem [135]. This algorithm converges faster compared to the gradient method and its implementation is simpler compared to Newton method [42]. 55 We next summarize the rst- and second-order derivatives of the objective function J. The second order approximation of the smooth part of objective function J around F is given by J( F + ~ F )J( F ) + D r F J( F ); ~ F E + 1 2 D r 2 F J( F; ~ F ); ~ F E : For related developments we refer the reader to [135]. Proposition 5. The gradient and the Hessian of J at F are determined by rJ( F ) = 2 ( RF (1=n)U T R11 T L p U P )L r 2 J( F; ~ F ) = 2 ( R ~ F ~ P )L + ( RF (1=n)U T R11 T L p UP ) ~ L (3.12) where R =U T RU and the matrix P is given by (3.8) and is the observability Gramian. The matrix L is the controllability Gramian and is determined by (U T L p U + F )L + L (U T L p U + F ) T = U T U (3.13) where U T U = I is identity. The matrices ~ P and ~ L are the solutions to the following Lyapunov equations (U T L p U + F ) ~ L + ~ L (U T L p U + F ) = ~ FL L ~ F (U T L p U + F ) T ~ P + ~ P (U T L p U + F ) = ~ F T ( RF P (1=n)U T R11 T LpU) + (F T R P (1=n)U T L T p 11 T RU) ~ F: By settingr F L =r F J + (FS k ) = 0, we obtain 2 RF (1=n)U T R11 T L p UP L + (FS k ) = 0: (3.14) The necessary conditions for the optimality ofL (F; L x ; ) are given by (3.8), (3.13) and (3.14). The Anderson-Moore method solves the F -minimization step iteratively. In each 56 iteration, the algorithm starts with a stabilizing F and solves two Lyapunov equations and one Sylvester equation. It rst solves the Lyapunov equations (3.8), (3.13) with a xedF to obtain controllability and observability GramiansL andP , respectively. Then it solves the Sylvester equation (3.14) forF with xedL andP . Then we use Newton's method to nd the descent direction between two consecutive steps by utilizing (3.12). We next employ a line search strategy to determine an appropriate step-size in order to guarantee convergence to a stationary point and the closed-loop stability. 3.4.2 L x -minimization step Using the expression for the augmented Lagrangian (3.11), we can write the L x -minimization step as L k+1 x = argmin Lx kL x k 1 + 2 kL x T k k 2 F where T k = UF k+1 U T (1=n)11 T L p + (1=) k : The solution is given by soft-thresholding L k (i;j) k+1 = 8 < : (1 jT k (i;j)j )T k (i;j); jT k (i;j)j> 0; otherwise where = ( =)W (i;j). Convergence analysis of ADMM for convex problems can be found in [113]. For non- convex problems, the quadratic term in the augmented Lagrangian locally convexify the problem for a large . A recent result on convergence analysis of ADMM for a family of non-convex problems can be found in [136]. Computational results also show that ADMM works well when is suciently large [137,138]. 3.5 Computational experiments 3.5.1 Synthetic example In this section, we employ our customized algorithm based on ADMM to add certain number of edges to a given directed network. The plant network is a randomly generated graph with n nodes and with edge weight l i 2 (0; 1) 57 for the ith edge which is obtained randomly. We setR =I and choose the state weight that penalizes the mean-square deviation from the network average, Q =I (1=n)11 T : We solve the sparsity-promoting optimal control problem (3.9) for controller graph for 100 logarithmically-spaced values of 2 [0:001; 3] using the path-following iterative reweighted algorithm as a proxy for inducing sparsity [125]. We set the weights to be inversely proportional to the magnitude of the solutionL x to (3.9) at the previous value of . We choose " = 10 3 in (3.4) and initialize weights for = 0:01 using the solution to (3.9) with LQR state-feedback matrix. Topology design is followed by a polishing step that computes the optimal weights of identied edges; see Section I-A. 3.5.2 Performance improvement The randomly generated plant graph in Fig. 3.1a is a directed graph withn = 20 nodes. The plant graph is weakly connected but unbal- anced. Thus, it will not converge to the initial nodes average value. In order to reach consensus and improve the performance, adding edges to the plant network is required. Figure 3.1 illustrates that by increasing , the controller graph becomes sparser. The number of added edges to the network is equal to the number of nonzero o-diagonal elements in the controller. Specically, for = 0:001, the number of nonzero elements in the controller graph is 84, among which 74 edges are added to the original network, and the other 10 nonzero elements represent the diagonal elements. By increasing to 0:0788, there are 33 nonzero elements are in the controller graph, and only 25 edges are added to the plant network. It is noteworthy that the Laplacian matrix of the controller graph can not be a zero matrix, because the plant network is an unbalance graph and a nonzero Laplacian matrix of the controller is needed to make the closed-loop graph Laplacian balanced. Figure 3.2 shows that the closed-loop performance deteriorates and the number of nonzero elements in the controller graph Laplacian L x decreases as increases. As shown in Fig. 3.2b, relative to the optimal LQR controller, L c , theH 2 performance loss decreases as the sparsity of the controller graph Laplacian matrix L x increases. In particular, for = 3, there are only 24 nonzero elements in the controller graph. This is equivalent to have only 16 added edges. The identied sparse controller in 58 (a) plant graph (b) = 0:001 (c) = 0:0788 (d) = 3 Figure 3.1: Topologies of the plant (blue lines) and controller graphs (red lines) for a randomly generated weighted network. 59 card(Lx)=card(Lc) (a) (JJc)=Jc (b) (JJc)=Jc (c) card(L x )=card(L c ) Figure 3.2: (a) Sparsity level; (b) performance degradation; and (c) the optimal tradeo between the performance degradation and the sparsity level of the optimal sparse L x compared to the optimal centralized controller L c . The results are obtained for the weighted random plant network with topology shown in Fig. 3.1a. this case uses only 0:057% of the edges, relative to the optimal LQR controller, i.e., card(L x )=card(L c ) = 29:629% and achieves a performance loss of 15:118%, i.e., (J J c )=J c = 15:118%: 3.6 Concluding remarks We consider the ` 1 regularized version of opti- mal control problem for adding edges to directed consensus networks in order to reach consensus and optimally enhance performance. Although the given plant network is not necessarily balanced, in order to reach agreement, we restrict the closed-loop graph Laplacian to be balanced. The performance is measured by theH 2 norm from the disturbance to the output of the closed-loop network. In general, this problem is a com- binatorial search problem. We use sparsity promoting optimal control framework and introduce weighted` 1 regularization as a proxy for promoting sparsity of the controller. By exploiting structure of the problem, we develop an algorithm based on ADMM. An example is provided to demonstrate the utility of the developed algorithm. Part II Network estimation and inference 60 Chapter 4 Topology identication of undirected consensus networks via sparse inverse covariance estimation We study the problem of identifying sparse interaction topology using sample covari- ance matrix of the state of the network. Specically, we assume that the statistics are generated by a stochastically-forced undirected rst-order consensus network with unknown topology. We propose a method for identifying the topology using a regu- larized Gaussian maximum likelihood framework where the ` 1 regularizer is introduced as a means for inducing sparse network topology. The proposed algorithm employs a sequential quadratic approximation in which the Newton's direction is obtained using coordinate descent method. We provide several examples to demonstrate good practical performance of the method. 4.1 Introduction In this chapter, we develop a convex optimization frame- work for identifying sparse interaction topology using sample covariance matrix of the state of the network. Our framework utilizes an ` 1 -regularized Gaussian maximum likelihood estimator. Because of strong theoretical guarantees, this approach has been 61 62 commonly used for recovering sparse inverse covariance matrices [139{143]. We uti- lize the structure of undirected networks to develop an ecient second-order method based on a sequential quadratic approximation. As in [124,144], we compute the New- ton's direction using coordinate descent method [145{147] that employs active set strat- egy. The main point of departure is the formulation of a convex optimization problem that respects particular structure of undirected consensus networks. We also use a reweighted` 1 -heuristics as an eective means for approximating non-convex cardinality function [125], thereby improving performance relative to standard ` 1 regularization. Our presentation is organized as follows. In Section 4.2, we formulate the problem of topology identication using sparse inverse covariance matrix estimation. In Section 4.3, we develop a customized second-order algorithm to solve the ` 1 -regularized Gaussian maximum likelihood estimation problem. In Section 4.4, we use computational exper- iments to illustrate features of the method. In Section 4.5, we conclude with a brief summary. 4.2 Problem formulation In this section, we provide background material on stochastically forced undirected rst-order consensus networks and formulate the problem of topology identication using a sample covariance matrix. The inverse of a given sample covariance matrix can be estimated using Gaussian maximum likelihood estimator. For undirected consensus networks, we show that the estimated matrix is related to the graph Laplacian of the underlying network. Our objective is to identify the underlying graph structure of a stochastically forced undirected consensus network with a known number of nodes by sampling its second- order statistics. In what follows, we relate the problem of topology identication for consensus networks to the inverse covariance matrix estimation problem. 4.2.1 Undirected consensus networks We consider an undirected consensus network _ = L x + d; (4.1) where 2 R n represents the state of n nodes, d is the disturbance input, and the symmetricnn matrixL x represents the graph Laplacian. The matrixL x is restricted to have an eigenvalue at zero corresponding to an eigenvector of all ones,L x 1 = 0. This 63 requirement is implicitly satised by expressing L x in terms of the incidence matrix E L x := m X l = 1 x l l T l = E diag (x)E T ; where diag (x) is a diagonal matrix containing the vector of edge weightsx2R m . Each column l = e i e j , where e i 2R n is theith basis vector, represents an edge connecting nodesi andj. Them columns ofE specify the edges that may be used to construct the consensus network. For a complete graph, there are m =n(n 1)=2 potential edges. In order to achieve consensus in the absence of disturbances, it is required that the closed-loop graph Laplacian, L x , be positive denite on the orthogonal complement of the vector of all ones,1 ? [2]. This amounts to positive deniteness of the \strengthened" graph Laplacian of the closed-loop network X := L x + (1=n)11 T 0: (4.2) Clearly, since L x 1 = 0, X1 =1. Consensus networks attempt to compute the network average; thus, it is of interest to study the deviations from average ~ (t) := (t) 1 (t) = I (1=n)11 T (t); where (t) := (1=n)1 T (t) is the network average and corresponds to the zero eigenvalue of L x . From (4.1), it follows that the dynamics of the deviation from average are, _ ~ = L x ~ + I (1=n)11 T d: The steady-state covariance of ~ , P := lim t!1 E ~ (t) ~ T (t) ; is given by the solution to the algebraic Lyapunov equation L x P + PL x = I (1=n)11 T : 64 For connected networks, the unique solution is given by P = 1 2 L x = 1 2 L x + (1=n)11 T 1 (1=n)11 T = 1 2 X 1 (1=n)11 T ; (4.3) where () is the pseudo-inverse of a matrix. Thus, the inverse of the steady-state covari- ance matrix of the deviation from network average is determined by the strengthened graph Laplacian of the consensus network X. 4.2.2 Topology identication A sparse precision matrix can be obtained as the solution to the regularized maximum log-likelihood problem [141], minimize X log det (X) + trace (SX) + kXk 1 subject to X 0; (4.4) where S is the sample covariance matrix, is a positive regularization parameter, and kXk 1 := P jX ij j is the` 1 norm of the matrixX. The` 1 norm is introduced as a means for inducing sparsity in the inverse covariance matrix where a zero element implies conditional independence. This problem has received signicant attention in recent years [124,139,141,144,148{151]. In this work, we establish a relation between inverse covariance matrix estimation and the problem of topology identication of an undirected consensus network. We are interested in identifying a sparse topology that yields close approximation of a given sample covariance matrix. This is achieved by solving the following problem, minimize x J(x) + m X l = 1 jx l j subject to E diag (x)E T + (1=n)11 T 0; (NI) where J(x) = log det E diag (x)E T + (1=n)11 T + trace (SE diag (x)E T ): 65 Relative to [124, 139, 144, 148, 149], our optimization problem has additional structure induced by the dynamics of undirected consensus networks. The network identication problem (NI) is a convex but non-smooth problem where the optimization variable is the vector of the edge weights x2 R m and the problem data is the sample covariance matrixS and the incidence matrix. The incidence matrix is selected to contain all possible edges. The ` 1 norm of x is a convex relaxation of the cardinality function and it is introduced to promote sparsity. The positive parameter species the emphasis on sparsity versus matching the sample covariance matrix S. For = 0, the solution to (NI) is typically given by a vector x with all non-zero elements. The positive denite constraint comes from (4.2) and guarantees a connected closed-loop network and thus asymptotic consensus in the absence of disturbances. Remark 7. We also use this framework to address the network sparsication problem where it is of interest to nd a sparse network that generates close approximation of the covariance matrix of a given dense network. We choose E to be equal to the incidence matrix of the primary network. 4.3 Customized algorithm based on sequential quadratic approximation We next exploit the structure of the optimization problem (NI) to develop an ecient customized algorithm. Our algorithm is based on sequential quadratic approximation of the smooth part J of the objective function in (NI). This method benets from exploiting second-order information aboutJ and from computing the Newton direction using cyclic coordinate descent [145{147] over the set of active variables. We nd a step-size that ensures the descent direction via backtracking line search. Furthermore, by restricting our computations to active search directions, com- putational cost is signicantly reduced. A similar approach has been recently utilized in a number of applications, including sparse inverse covariance estimation in graphi- cal models [124, 144, 152]. In this work, we have additional structural constraints and use reweighted heuristics in order to achieve sparsity. We use an alternative proxy for promoting sparsity which is given by the weighted ` 1 norm [125]. In particular, we solve problem (NI) for dierent values of using a path-following iterative reweighted algorithm; see Section II (A) in [?]. The topology identication then is followed by a 66 polishing step to debias the identied edge weights. 4.3.1 Structured problem: debiasing step In addition to promoting sparsity of the identied edge weights, the` 1 norm penalizes the magnitude of the nonzero edge weights. In order to gauge the performance of the estimated network topology, once we identify a set of sparse topology via (NI), we solve the structured \polishing" or \debiasing" problem to optimize J over the set of identied edges. To do this, we form a new incidence matrix ^ E which contains only those edges identied as nonzero in the solution to (NI) and form the problem, minimize x log det ^ E diag (x) ^ E T + (1=n)11 T + trace S ^ E diag (x) ^ E T subject to ^ E diag (x) ^ E T + (1=n)11 T 0: whose solution provides the optimal estimated graph Laplacian with the desired struc- ture. 4.3.2 Gradient and Hessian of J(x) We next derive the gradient and Hessian of J which can be used to form a second-order Taylor series approximation of J(x) around x k , J(x k + ~ x) J(x k ) + rJ(x k ) T ~ x + 1 2 ~ x T r 2 J(x k ) ~ x: (4.5) Proposition 6. The gradient and the Hessian of J at x k are rJ(x k ) = diag E T (S X 1 (x k ))E r 2 J(x k ) = M(x k ) M(x k ) (4.6) where denotes the elementwise (Hadamard) product and X 1 (x k ) := ED x kE T + (1=n)11 T 1 ; M(x k ) := E T X 1 (x k )E: 67 Proof. Utilizing the second order expansion of the log-determinant function we have J(x k + ~ x) J(x k ) trace E T (S X 1 (x k ))ED ~ x + 1 2 trace D ~ x E T X 1 (x k )ED ~ x E T X 1 (x k )E : The expressions in (4.6) can be established using a sequence of straightforward algebraic manipulations in conjunction with the use of the commutativity invariance of the trace function and the following properties for a matrix T , a vector, and a diagonal matrix D , trace (TD ) = T diag (T ) trace (D TD T ) = T (T T ): 4.3.3 Algorithm Our algorithm is based on building the second-order Taylor series expansion of the smooth part of the objective function J in (NI) around the current iterate x k . Approximation J in (NI) with (4.5), minimize ~ x rJ(x k ) T ~ x + 1 2 ~ x T r 2 J(x k ) ~ x + kx k + ~ xk 1 subject to E diag x k + ~ x E T + (1=n)11 T 0: (4.7) We use the coordinate descent algorithm to determine the Newton direction. Let ~ x denote the current iterate approximating the Newton direction. By perturbing ~ x in the direction of the ith standard basis vector e i 2 R m , ~ x + i e i ; the objective function in (4.7) becomes rJ(x k ) T (~ x + i e i ) + 1 2 (~ x + i e i ) T r 2 J(x k ) (~ x + i e i ) + jx k i + ~ x i + i j: Elimination of constant terms allows us to express (4.7) as minimize i 1 2 a i 2 i + b i i + jc i + i j (4.8) 68 where (a i , b i , c i , x k i , ~ x i ) are the problem data, a i := e T i r 2 J(x k ) e i b i := r 2 J(x k ) e i T ~ x + e T i rJ(x k ) c i := x k i + ~ x i : The explicit solution to (4.8) is given by i = c i + S =a i (c i b i =a i ); whereS (y) = sign (y) max (jyj ; 0); is the soft-thresholding function. After the Newton direction ~ x has been computed, we determine the step-size via backtracking. This guarantees positive deniteness of the strengthened graph Laplacian and sucient decrease of the objective function. We use Armijo rule to nd an appro- priate step-size such that E diag(x k +~ x)E T + (1=n)11 T is positive denite matrix and J(x k +~ x) + kx k +~ xk 1 J(x k ) + kx k k 1 + rJ(x k ) T ~ x + kx k +~ xk 1 kx k k 1 : There are two computational aspects in our work which lead to suitability of this algorithm for large-scale networks. Active set strategy We propose an active set prediction strategy as an ecient method to solve the problem (NI) for large values of . It is an eective means for determining which directions need to be updated in the coordinate descent algorithm. The classica- tion of a variable as either active or inactive is based on the values of x k i and the ith component of the gradient vectorrJ(x k ). Specically, theith search direction is only inactive if x k i = 0 and j e T i rJ(x k )j < where > 0 is a small number (e.g., = 0:0001 ). At each outer iteration, the Newton search direction is obtained by solving the optimization problem over the 69 set of active variables. The size of active sets is small for large values of the regularization parameter . Memory saving Computation of b i in (4.8) requires a single vector inner product between the ith column of the Hessian and ~ x, which typically takes O(m) operations. To avoid direct multiplication, in each iteration after nding i , we update the vector r 2 J(x k ) T ~ x using the correction term i (E T X 1 i ) ((X 1 i ) T E) T and take its ith element to form b i . Here, i is the ith column of the incidence matrix of the controller graph. This also avoids the need to store the Hessian of J, which is an mm matrix, thereby leading to a signicant memory saving. Moreover, the ith column ofr 2 J(x k ) and the ith element of the gradient vectorrJ(x k ) enter into the expression for b i . On the other hand, a i is determined by the ith diagonal element of the Hessian matrixr 2 J(x k ). All of these can be obtained directly fromr 2 J(x k ) andrJ(x k ) which are formed in each outer iteration without any multiplication. Our problem is closely related to the problem in [124]. The objective function there has the formf(x) =J(x) + g(x), whereJ(x) is smooth over the positive denite cone, andg(x) is a separable non-dierentiable function. In our problem formulation, J(x) is smooth forE diag(x)E T +(1=n)11 T 0 while the non-smooth part is the` 1 norm which is separable. Thus, convergence can be established using similar arguments. According to [124, Theorems 1,2], the quadratic approximation method converges to the unique global optimum of (NI) and at super-linear rate. The optimality condition for any x that satises E diag(x )E T + (1=n)11 T 0 is given by rJ(x ) + @kx k 1 2 0; where @kx k 1 is the subgradient of the ` 1 norm. This means that for any i r i J(x ) 2 8 > > < > > : ; x i > 0; ; x i < 0; [ ; ]; x i = 0. 70 The stopping criterion is to check the norm ofrJ(x) and the sign of x to make sure that x is the optimal solution. 4.4 Computational experiments We next illustrate the performance of our customized algorithm. We have implemented our algorithm in Matlab, and all tests were done on 3.4 GHz Core(TM) i7-3770 Intel(R) machine with 16GB RAM. The problem (NI) is solved for dierent values of using the path-following iterative reweighted algorithm [125] with " = 10 5 . The initial weights are computed using the solution to (NI) with = 0 (i.e., the optimal centralized vector of the edge weights). We then adjust " = 0:001kxk 2 at each iteration. Topology identication is followed by the polishing step described in Section 4.3.1. In the gures, we use black dots to represent nodes, blue lines to identify the original graph, and red lines to denote the edges in the estimated sparse network. In all examples, we set the tolerance for the stopping criterion to 10 4 . 4.4.1 Network identication We solve the problem of identication of a sparse network using sample covariance matrix for 500 logarithmically-spaced values of 2 [0:1; 1000]. The sample covariance matrix S is obtained by sampling the nodes of the stochastically-forced undirected unweighted network whose topology is shown in Fig. 4.1a. To generate samples, we conducted 20 simulations of system (4.1) forced with zero-mean unit variance band-limited white noise d. The sample covariance ma- trix is averaged over all simulations and asymptotically converges to the steady-state covariance matrix. The incidence matrix E in (NI) contains all possible edges. Empirically, we observe that after about 5 seconds the sample covariance matrix converges to the steady-state covariance. First, we sample the states after 3 seconds, so the sample covariance matrix we compute is dierent than the true steady-state covariance matrix. For (NI) solved with this problem data, Figures 4.3 and 4.1c illustrate the topology of the identied networks for the minimum and maximum values of . In Fig. 4.1d, the blue line shows an edge in the original network that has not been recovered by the algorithm for the largest value of . The red lines in this gure show two extra edges in the estimated network for the smallest value of which were not present in the original graph. Next, we solve the problem (NI) using a sample 71 (a) Original network with n = 12 nodes (b) = 0:1 (c) = 1000 (d) Distinct edges Figure 4.1: The problem of identication of sparse networks using sample covariance matrix for a network with n = 12 nodes. covariance matrix generated by sampling the node states after 15 seconds. In this case, the sample covariance matrix is closer to the steady-state covariance matrix than the previous experiment. As shown in Fig. 4.2a, the identied network is exactly the same as the original network for = 0:1. If is further increased, a network sparser than the original is identied; see Fig. 4.2b. For = 0, the relative error between the covariance matrix of the estimated network and the sample covariance matrix S is given by k E diag (x c )E T + (1=n)11 T 1 Sk F kSk F = 0:004%; wherekk F is the Frobenius norm and x c is the solution to (NI) with = 0. As increases, the number of nonzero elements in the vector of the edge weights x decreases and the state covariance matrix gets farther away from the given sample covariance 72 (a) = 0:1 (b) = 1000 Figure 4.2: The problem of identication of sparse networks using sample covariance matrix for a network with n = 12 nodes. matrix. In particular, in the rst experiment for = 1000, only twelve edges are chosen. Relative to the centralized network with the vector of the edge weightsx c , the identied sparse network in this case uses only 18:182% of the edges, i.e., card(x)=card(x c ) = 18:182% and achieves a relative error of 53:841%, (kX 1 Sk F )=kSk F = 53:841%; withX = E diag (x)E T + (1=n)11 T . In the second experiment, the identied sparse network has 16:666% of the potential edges and achieves a relative error of 51:067%. 4.4.2 Network sparsication We next use (NI) to nd a sparse representation of a dense consensus network. Inspired by [153], we generate a network by randomly distributing nodes in a 10 10 box. A pair of nodes can communicate if the Euclidean distance between them,d(i;j), is not greater than 5 units and the edge connecting them has weight exp (d(i;j)). The incidence matrix of the identied graph is selected to be equal to the incidence matrix of the given graph; i.e., the sparse network's edges are a subset of the original network's. Figure 4.3a shows a graph with 50 nodes. We use the reweighted ` 1 regularized Gaussian maximum likelihood estimation framework, for 200 logarithmically-spaced values of 2 [0:01; 100] following by the polishing step. The sparse graph topologies identied for dierent values of are shown in gures 4.3b, 4.3c, and 4.3d. As increases, the identied graph becomes sparser. For = 0:01, 222 edges are chosen to be in the sparse estimated network which is only 34:851% of the 637 potential edges. The network with these selected edges achieves 73 (a) Original network with n = 50 nodes (b) = 0:01 (c) = 0:102 (d) = 1:138 Figure 4.3: The problem of sparsication of a network with n = 50 nodes using sample covariance matrix. 74 a relative error of 29:965%, kX 1 Sk F kSk F = 29:965%: For the largest value of the sparsity-promoting parameter, = 1:138, only 64 edges are present (10:047% of the potential edges) in the estimated graph that gets a relative error of 207:493%. To provide a point of comparison, we compare the performance of our algorithm to a simple truncation scheme. In this scheme, the edge with the smallest weights that does not disconnect the network is iteratively removed until the network has the desired sparsity. After identifying the topology in this way, the polishing step optimizes the edge weights of the selected set of edges. Figure 4.4 shows the relative errors of our algorithm (in red dashed lines) and the truncation algorithm (in blue solid lines) on a log scale against the number of removed edges. As the number of edges in the estimated graph decreases, the relative error of both algorithms increases. The relative error of network topologies identied by our algorithm is much smaller than the error of those identied by the truncation algorithm, and thus our customized algorithm outperforms the truncation method. In particular, when 573 edges are removed, the relative errors for our customized algorithm and the truncation algorithm are 2:075 and 8:977, respectively. (kX 1 SkF )=kSkF Number of eliminated edges Figure 4.4: Relative error in log-scale for the sparsication problem of a network with n = 50 nodes using sample covariance matrix. 75 4.5 Concluding Remarks We have developed a method for identifying the topology of an undirected consensus network using available statistical data. In or- der to promote network sparsity, we introduce a convex optimization framework aimed at nding the solution to the ` 1 -regularized maximum likelihood problem. This prob- lem is closely related to the problem of sparse inverse covariance estimation that has received signicant attention in the literature. In our setup, additional structure arises from the requirement that data is generated by an undirected consensus network. By exploiting the structure of the problem, we develop an ecient algorithm based on the sequential quadratic approximation method in which the search direction is determined using coordinate descent with active set strategy. Several examples have been provided to illustrate utility of the method and eciency of the customized algorithm. Chapter 5 Topology identication via growing a Chow-Liu tree network We study the problem of sparse interaction topology identication using sample covari- ance matrix of the states of the network. We postulate that the statistics are gener- ated by a stochastically-forced undirected consensus network with unknown topology in which some of the nodes may have access to their own states. We rst propose a method for topology identication using a regularized Gaussian maximum likelihood framework where the` 1 regularizer is introduced as a means for inducing sparse network topology. We also develop a method based on growing a Chow-Liu tree that is well-suited for identifying the underlying structure of large-scale systems. We apply this technique to resting-state functional MRI (FMRI) data as well as synthetic datasets to illustrate the eectiveness of the proposed approach. 5.1 Introduction In this chapter, we develop a convex optimization algo- rithm for identifying sparse interaction topology using the sample covariance matrix of the states of the network. First, we utilize an ` 1 -regularized Gaussian maximum likelihood estimator that has been commonly used for recovering sparse inverse covari- ance matrices [139{141]. We show that the performance of graphical-LASSO can im- prove signicantly by imposing additional structure on the problem and using reweight- ing schemes [125]. In particular, our framework overcomes challenges that standard graphical-LASSO faced and performs well for the case study in [64]. Moreover, inspired 76 77 by [64], we combine the Chow-Liu algorithm [154] with the techniques for growing net- works developed in [106] to identify the underlying structure of an undirected consensus network. Constructing the Chow-Liu tree from statistical data does not require any ma- trix inversion; thereby, it is well-suited for large-scale problems. Furthermore, we have developed ecient algorithms [106] for growing connected resistive consensus networks. Herein, we demonstrate that combining these two algorithms yields an ecient method for recovering the network topology in large-scale systems. Our presentation is organized as follows. In Section 5.2, we discuss the proper- ties of consensus networks and formulate the problem of topology identication using sparse inverse covariance matrix estimation. We also brie y comment on the customized second-order algorithm based on the proximal Newton method to solve the` 1 -regularized Gaussian maximum likelihood estimation problem. In Section 5.3, we develop an algo- rithm for growing a Chow-Liu tree graph in order to identify the network that yields close approximation of a given sample covariance matrix. In Section 5.4, we use com- putational experiments to illustrate features of our method. In particular, we employ our algorithm to identify the underlying functional network of the human brain using FMRI data. Finally, in Section 5.5, we conclude with a brief summary. 5.2 Topology identication via structured graphical-LASSO The problem of topology identication using a sample covariance matrix for stochasti- cally forced undirected consensus networks has been studied in Chapter 4. In consensus networks, each node updates its own state using localized information exchange with the neighbors. Two nodes are neighbors if an edge connects them together. Herein, we consider a network that leader nodes are equipped with absolute information about their own states. In Chapter 4, we showed that the underlying topology of an undirected consensus network can be identied using Gaussian maximum likelihood estimator. In this chapter, we formulate the topology identication problem for undirected consensus networks with leader nodes and provide two algorithms to solve this problem. Consider an undirected network with n nodes governed by _ i = X j2N i x ij ( j i ) z i i + w i ; 78 where each node i updates its own state using relative information exchange with its neighbors in the setN i . Moreover, certain nodes, the so-called leaders, have access to their own states. Here, z i is the weight that the ith node sets for the absolute measurement, x ij is the edge weight, and w i is an exogenous stochastic disturbance. Theith node is a leader if z i 6= 0 and it is a follower if z i = 0. By concatenating all the states in a vector 2R n , the consensus dynamics can be written as _ = (L x + D z ) + w (5.1) Here, L x 2 R nn is the graph Laplacian of the consensus network and D z 2R nn is a diagonal matrix with the ith diagonal entry z i . The incidence matrix E of the underlying graph represents the edges in the network. The lth column of this matrix is given by l = e i e j ; that demonstrates thelth edge between the nodesi andj. Here, e i 2R n is theith basis vector. By using the incidence matrix, the Laplacian matrix L x can be written as L x := m X l = 1 x l l T l = E diag (x)E T where diag (x) is a diagonal matrix with the edge weights x2R m in its diagonal. Given that the covariance of the disturbance is a multiple of the identity matrix I, the steady-state covariance of , := lim t!1 E ( T ); can be computed as the solution to the associated algebraic Lyapunov equation (L x + D z ) + (L x + D z ) = 2I: Thus, the steady state covariance can be explicitly computed as = (L x + D z ) 1 : (5.2) 79 The inverse of the steady-state covariance matrix of the states of the system can be determined by the structure of the underlying graph that connects the n nodes. Thus, by using a sampled second-order statistics and estimating the inverse covariance ma- trix, the underlying topology of an undirected consensus network with leaders can be identied. The problem of sparse covariance estimation has received signicant atten- tion recently [124, 144, 150, 151]. Relative to these works, our optimization problem has additional structure coming from the dynamics of undirected consensus networks. Moreover, compared to previous chapter, we consider consensus networks with leaders and introduce a new algorithm that is convenient for solving large-scale problems. We rst generalize the proposed algorithm based on the structured graphical-LASSO in Chapter 4 to solve the problem of topology identication in undirected consensus networks with leaders. It is well-known that the estimation of the inverse covariance matrix X can be obtained as the solution to the regularized maximum log-likelihood problem [141], minimize X log det (X) + trace (SX) + kF Xk 1 subject to X 0; (5.3) where S is the sample covariance matrix, is a positive regularization parameter, F is the weight matrix, andkFXk 1 := P F ij jX ij j is the weighted` 1 norm of the matrixX. By substituting the expression (5.2) for the inverse covariance matrix in (5.3) and using the incidence matrix E, the topology of a network that generates close approximation of a given sample covariance matrix can be identied by solving the following problem, minimize x;z J(x;z) + 1 m X l = 1 f l jx l j + 2 N X k = 1 g k jz k j subject to E diag (x)E T + diag(z) 0; (NI) where J(x; z) = log det E diag (x)E T + diag(z) + trace S (E diag (x)E T + diag(z) : Moreover, f 2 R m and g2 R N are the vectors of non-negative weights and ( 1 ; 2 ) 80 are the positive regularization parameters. An eective heuristic for weight selection is given by the iterative reweighted algorithm where the weights f and g are inversely proportional to the magnitudes ofx andz in the previous iteration [125]. Problem (NI) is a convex but non-smooth optimization problem where the optimization variables are the vector of the edge weights x2R m and the vector of leaders weights z2R N . Relative to our prior work in the previous chapter, our optimization problem has additional structure induced by presence of the leader nodes. The algorithm based on the sequential quadratic approximation of the smooth part of the objective function in Section 4.3 can be utilized for solving (NI) with minor changes. The dierence is that the optimization variable size has increased form m to m +N. This method benets from exploiting second-order information of the smooth part of the objective funtion and from computing the Newton direction using cyclic coordinate descent [145] over the set of active variables. We solve the problem (NI) for dierent values of ( 1 ; 2 ) using a path-following iterative reweighted algorithm. The topology identication then is followed by a polishing step to debias the identied edge weights. We next propose an algorithm to solve the topology identication problem of large-scale networks. 5.3 Growing a Chow-Liu tree network In this section, we discuss an alternative algorithm for identifying the underlying network topology which is well- suited for large-scale systems. In order to nd the underlying network structure using statistical data, the Chow-Liu tree algorithm [154] can be utilized. This method does not require any matrix inversion; thereby, suitable for large-scale usage. However, as discussed in [64], it causes false positives and negatives when using it for identifying the topology of disconnected networks or networks with cycles, respectively. Herein, we propose a framework in order to combine the Chow-Liu tree and the reweighted graphical-LASSO algorithms for identifying the structure of connected networks with cycles. We consider the same consensus network in (5.1) and we assume that the sample covariance matrixS is given. In order to use the Chow-Liu algorithm, the mutual infor- mation matrixM should be constructed from the sample covariance matrix. Assuming 81 Gaussian distribution for the noise w, the mutual information is given by M ij = 1 2 log S ii S jj S ii S jj S 2 ij ! ; whereS ij is theijth element of the matrixS. We only use then (n 1)=2 o-diagonal elements of this symmetric matrix to construct the Chow-Liu tree. A spanning tree of a graph withn nodes has (n1) edges. To build the Chow-Liu tree, we sort the elements of the mutual information matrix and choose the biggest (n 1) of them that do not create cycles [154]. After nding the underlying tree network that generates close approximation of the sample covariance matrix, our goal is to add a certain number of edges to the tree graph in order to enhance the closed-loop performance [106]. The performance is measured by the proximity of the second-order statistical data generated by the network to the given sample covariance matrix. Consider an undirected consensus tree network, _ = L t + u + w; (5.4) where w and u are the exogenous disturbance and the control input, respectively and L t is the graph Laplacian of the tree network that is obtained using the Chow-Liu algorithm. The goal is to improve the performance of this system by growing the tree network. We approach this problem as a feedback design problem with u = (L x + D z ) ; (5.5) whereD z is a diagonal matrix with theith diagonal entryz i and the symmetric feedback- gain matrixL x is required to have the Laplacian structure. Since a nonzeroijth element ofL x corresponds to an edge between the nodesi andj, the communication structure in the controller graph is determined by the sparsity pattern of the matrix L x . Moreover, the ith node is a leader if z i is nonzero. By substituting the control input u from (5.5) in (5.4) _ = (L t + L x + D z ) + w: (5.6) 82 For a computedL t from the Chow-Liu algorithm, our objective is to design the topology L x and to identify the leader nodes in the network in order to achieve the desired tradeo between the controller sparsity and the network performance. The performance is quantied by the proximity of the steady-state covariance matrix of the closed-loop system to the sample covariance matrix. Next, we are going to establish a relation between the closed-loop graph Laplacian and the inverse covariance matrix of the network. The steady-state covariance of is given by = (L t + L x + D z ) 1 ; (5.7) where L x = Ediag(x)E T . Thus, the problem of identifying the sparse topology of a network, i.e., nding L x and D z , that generates close approximation of a given sample covariance matrix is equivalent to sparse inverse covariance estimation problem. This can be achieved by solving a similar regularized maximum log-likelihood problem to (5.3) with one main dierence. The inverse covariance matrix X is the summation of the Laplacian matrices of the tree plant network L t and the controller network L x + D z . Thus, the problem of growing a tree network in order to match the available statistical data can be formulated as minimize x;z J(x;z) + 1 m X l = 1 f l jx l j + 2 N X k = 1 g k jz k j subject to L t + E diag (x)E T + diag(z) 0; (5.8) where J(x; z) = log det L t + E diag (x)E T + diag(z) + trace S (E diag (x)E T + diag(z) : In the case of resistive networks (i.e., all the edge weights are nonnegative), since the plant network is given by a tree graph, the closed-loop network is connected; thereby, the optimization problem simplies to minimize x;z J(x;z) + 1 m X l = 1 f l x l + 2 N X k = 1 g k z k subject to x 0; z 0: (5.9) 83 In this scenario, the topology identication problem turn into the problem of growing a tree network and the positive deniteness constraint simplies to nonnegativity con- straints of the vectors x andz. Thus, several optimization algorithms can be employed to solve this problem eciently for large-scale networks. 5.4 Computational experiments We next illustrate the performance of our methods by employing them to identify the topology of an RC network with 10 nodes and a tree structure shown in Fig. 5.1. The voltages of the nodes form the states of the system. In this network, node 5 is grounded with z 5 = 4 and all other edge weights are one which implies that node 5 is a leader. This example is borrowed from [64]. Figure 5.1: The RC tree network with n = 10 nodes. 5.4.1 Structured graphical-LASSO algorithm Assume that innite samples of the nodes' voltages are available; thereby, the sample covariance matrix S is equal to the exact solution of the Lyapunov equation. Moreover, we set 1 = 2 = . In this case, the structured graphical-LASSO algorithm in Section 5.2 can completely recover the underlying network topology for dierent values of . This example has been previously studied in [64]. They show that when the sample covariance matrix is precise, the graphical-LASSO algorithm results in 5 false positives and negatives. However, by adding a structural constraint to the problem and using a reweighting scheme, we showed that the same algorithm can recover the network topology with zero error. Next, we utilize our method to solve the problem by using a sample covariance 84 (a) plant network (b) Chow-Liu tree (c) identied structure Figure 5.2: (a) Plant network; (b) constructed Chow-Liu tree network; and (c) the identied network from growing the Chow-Liu tree. matrix which is not very close to the actual covariance matrix and is constructed from only 80 samples. The algorithm is again able to recover the network topology and to identify the leader for dierent values of . It is worth to note that the performance of this method deteriorates if we replace the reweighted ` 1 norm scheme with the ` 1 norm. In particular, by eliminating the reweighted ` 1 norm, we observed the eect of grounding one of the nodes with high capacitance. Although the network topology will be identied for some values of , by increasing (to very large value), the algorithm chooses the optimal edges in the same way as [64]. In particular, it ignores the connections between nodes 1 to 5 because of their low variances. In the next example, we illustrate the eectiveness of growing a Chow-Liu tree by using it on a synthetic dataset. 5.4.2 Topology identication via growing a Chow-Liu tree In this sec- tion, the second method is utilized to identify the underlying structure of a network with cycles. The original plant network is shown in Fig. 5.2a. We rst assume that the sample covariance matrix S is equal to the exact solution of the Lyapunov equation. We form the mutual information matrix and construct the Chow-Liu tree accordingly which is shown in Fig. 5.2b. We next grow this tree network in order to enhance the performance of the closed-loop system. In particular, we solve the problem (5.8) to nd the leader nodes and the Laplacian matrix of the controller graph L x . In this case, our algorithm can completely recover the underlying network topology for dierent values of . Next, we employ this algorithm to identify the topology of a larger network with 85 real data to evaluate its performance. 5.4.3 FMRI dataset The FMRI technique detects the activity of a region in the brain by measuring the blood ow to that region. Since the blood ow increases in an active part of the brain, the functioning regions can be identied by monitoring the blood ow. The functional connectivity structure between dierent regions can be revealed by utilizing dierent thresholding techniques [59,60]. The results indicate that dierent regions of the brain that are not anatomically connected act closely together and are functionally linked. Moreover, the previous studies have shown that the human brain has small-world network properties [60]. In this section, we employ the second algorithm based on growing a Chow-Liu tree to identify the underlying functional network of the human brain. The sample covariance matrix is computed using the resting-state FMRI data of 20 healthy patients [60]. In the resting-state FMRI, the patients are asked to close their eyes and try not to think. The studies have shown that even in the rest state, the brain is highly active and dierent regions of the brain are communicating with each other [155]. We collect 134 samples from 140 cortical brain regions (nodes) in the right hemisphere. The sample correlation matrix for each patient is a 140 140 matrix and can be computed using the time series data. The sample covariance matrices are not invertible since the number of samples is smaller than the number of the nodes in the network. Thus, we use our proposed algorithm to estimate the inverse covariance matrix and to identify the underlying network structure of the human brain. First, we form the mutual information matrix and construct the Chow-Liu tree Fig 5.3a. Next, we grow the obtained tree network to identify the remained edges and improve the performance of the closed-loop system. We set 1 = 2 = . The identied networks for a randomly chosen patient are shown in Fig 5.3. In particular, as the sparsity promoting parameter increases, the identied network gets sparser. This example has been previously studied in [156]. Their results show that the nodes that are located in the lower left corner of the graphs are highly connected to their neighboring nodes. They compare this pattern of connectivity with the false positives created by their algorithm in a synthetic RC circuit and conclude that the high number of edges in that area is false positives created by the same phenomenon 86 (a) Chow-Liu tree graph with 139 edges (b) = 0:1 (c) = 1 Figure 5.3: The identied networks for dierent values of . in the circuit example. However, by adding a structural constraint to the problem and using a reweighting scheme, we showed that the underlying network can be recovered without high connectivity in the lower left corner. Moreover, the general shape of the identied network is consistent with the results reported in [60]. Furthermore, the small- world properties such as high clustering and high eciency coecients can be seen in the identied networks. To conclude, it seems that using both an additional structural constraint and the reweighted` 1 norm scheme can improve the performance of the graphical-LASSO algo- rithm signicantly. Unlike the Chow-Liu algorithm that can be employed to construct tree networks only, our algorithm is more general and overcomes the challenges associ- ated with the conventional algorithms proposed in [156]. 87 5.5 Concluding remarks We have studied the problem of sparse topology identication of an undirected consensus network with leaders using second-order statis- tical data. The goal is to identify a sparse interaction topology using sample covariance matrix of the network state. We have introduced two algorithms based on regularized Gaussian maximum likelihood and growing a Chow-Liu tree. In the rst algorithm, we propose a structured graphical-LASSO algorithm that uses the weighted ` 1 regularizer as a proxy for inducing sparse network topology. The other method is based on growing a Chow-Liu tree that is well-suited for identifying the underlying structure of large-scale networks. Several examples have been provided to demonstrate the performance of our framework. Part III Control theoretic approach for analysis of optimization algorithms 88 Chapter 6 Proximal gradient ow and Douglas-Rachford splitting dynamics: global exponential stability via integral quadratic constraints Many large-scale and distributed optimization problems can be brought into a composite form in which the objective function is given by the sum of a smooth term and a nonsmooth regularizer. Such problems can be solved via a proximal gradient method and its variants, thereby generalizing gradient descent to a nonsmooth setup. In this chapter, we view proximal algorithms as dynamical systems and leverage techniques from control theory to study their global properties. In particular, for problems with strongly convex objective functions, we utilize the theory of integral quadratic constraints to prove global exponential stability of the dierential equations that govern the evolution of proximal gradient and Douglas-Rachford splitting ows. In our analysis, we use the fact that these algorithms can be interpreted as variable-metric gradient methods on the forward-backward and the Douglas-Rachford envelopes and exploit structural properties of the nonlinear terms that arise from the gradient of the smooth part of the objective 89 90 function and the proximal operator associated with the nonsmooth regularizer. We also demonstrate that these envelopes can be obtained from the augmented Lagrangian associated with the original nonsmooth problem and establish conditions for global exponential convergence even in the absence of strong convexity. 6.1 Introduction Analysis of optimization algorithms from the system theo- retic point of view has received signicant recent attention [29{31]. In these references, the optimization algorithm is interpreted as a feedback interconnection in which the states converge to the optimal solution of the optimization problem. In this chapter, we utilize techniques from control theory to establish global prop- erties of proximal gradient ow and Douglas-Rachford (DR) splitting dynamics. These algorithms provide an eective tool for solving nonsmooth convex optimization prob- lems in which the objective function is given by a sum of a dierentiable term and a nondierentiable regularizer. We exploit the fact that the proximal gradient ow can be interpreted as a variable-metric gradient method on forward-backward (FB) enve- lope [157{159] and show that this envelope can be obtained by restricting the proximal augmented Lagrangian to the manifold in which the dual variable is given by the neg- ative gradient of the smooth part of the objective function. When this smooth part is strongly convex with a Lipschitz continuous gradient, we prove global exponential sta- bility by utilizing the theory of IQCs [160]. We also generalize the Polyak-Lojasiewicz (PL) [161] condition to nonsmooth problems and show global exponential convergence of the FB envelope even in the absence of strong convexity. Finally, since the DR split- ting algorithm [162] can be interpreted as a variable-metric gradient method on DR envelope [157], we utilize similar approach to establish global stability properties of the resulting continuous-time dynamics for strongly convex problems. This chapter is structured as follows. In Section 6.2, we formulate the nonsmooth composite optimization problem and provide background material. In Section 6.3, we establish the global exponential stability of the proximal gradient ow dynamics for a problem with strongly convex objective function. Moreover, by exploiting the problem structure, we demonstrate the global exponential convergence of the forward- backward envelope even in the absence of strong convexity. In Section 6.4, we introduce a continuous-time gradient ow dynamics based on the celebrated Douglas-Rachford 91 splitting algorithm and utilize the theory of IQCs to prove global exponential stability for strongly convex problems. We oer concluding remarks in Section 6.5. 6.2 Problem formulation and background We consider a com- posite optimization problem, minimize x f(x) + g(Tx) (6.1) where x2 R n is the optimization variable, T 2 R mn is a given matrix, f: R n ! R is a continuously dierentiable function with a Lipschitz continuous gradient, and g: R m ! R is a nondierentiable convex function. Such optimization problems arise in a number of dierent applications and depending on the structure of the functions f and g, dierent rst- and second-order algorithms can be employed to solve them. We are interested in studying global convergence properties of methods based on proximal gradient ow algorithms. In what follows, we provide background material that we utilize in the rest of the chapter. 6.2.1 Proximal operators and the associated envelopes The proximal op- erator of a proper, lower semicontinuous, and convex function g is dened as prox g (v) := argmin z g(z) + 1 2 kz vk 2 2 : The value function of this optimization problem determines the associated Moreau en- velope, M g (v) := g(prox g (v)) + 1 2 kprox g (v) vk 2 2 which is a continuously dierentiable function even when g is not [76], rM g (v) = 1 (v prox g (v)): Combining the last two expressions yields, M g (v) = g(prox g (v)) + 2 krM g (v)k 2 2 : The Moreau envelope of g can be used to introduce the forward-backward (FB) 92 envelope [157{159] of the composite function F (x) := f(x) + g(x): The FB envelope is determined by the value function of the problem minimize v J(x;v) (6.2a) where J approximates F via a simple quadratic expansion of the function f around x, J(x;v) := f(x) +hrf(x);vxi + 1 2 kv xk 2 2 + g(v) = g(v) + 1 2 kv (x rf(x))k 2 2 + f(x) 2 krf(x)k 2 2 : (6.2b) The optimal solution of (6.2) is determined by v ? (x) = prox g (x rf(x)) and it can be used to obtain the FB envelope of the function F , F (x) := J(x;v ? (x)) = J(x; prox g (x rf(x))) = f(x) hrf(x);G (x)i + 2 kG (x)k 2 2 + g(prox g (x rf(x))) (6.3) where G is the generalized gradient map, G (x) := 1 (x prox g (x rf(x))): (6.4) Alternatively, the FB envelope F can be also expressed as F (x) = f(x) + M g (x rf(x)) 2 krf(x)k 2 2 : (6.5) Moreover, when f is twice continuously dierentiable, F is continuously dierentiable and its gradient is determined by [157], rF (x) = (I r 2 f(x))G (x): (6.6) 93 The Douglas-Rachford (DR) envelope is another useful object that is obtained by evaluating the FB envelope at prox f (x) [163], F DR (x) := F (prox f (x)): (6.7) Alternatively, the DR envelope can be expressed as F DR (x) = M g (x 2rM f (x)) + M f (x) krM f (x)k 2 2 : (6.8) From the denition of the proximal operator of the continuously dierentiable function f, we have rf(prox f (x)) + prox f (x) x = 0 (6.9) and, thus, rM f (x) = rf(prox f (x)): (6.10) Equality (6.7) follows from substituting the expression forrM f (x) into (6.8), using equation (6.9), and leveraging the properties of the Moreau envelope, F DR (x) = M g (x 2rf(prox f (x)) +f(prox f (x)) + 1 2 kprox f (x) xk 2 2 krf(prox f (x))k 2 2 = M g (prox f (x) rf(prox f (x)) +f(prox f (x)) 2 krf(prox f (x))k 2 2 = F (prox f (x)): Iff is twice continuously dierentiable withr 2 f(x)L f I for allx, the DR envelope is continuously dierentiable and its gradient is given by [163] rF DR (x) = 1 (2rprox f (x) I)G DR (x) (6.11) where rprox f (x) = I + r 2 f(prox f (x)) 1 and G DR (x) := prox f (x) prox g (2prox f (x)x): (6.12) 94 6.2.2 Proximal augmented Lagrangian By introducing an auxiliary optimiza- tion variable z, problem (6.1) can be rewritten as follows, minimize x;z f(x) + g(z) subject to Tx z = 0: (6.13) The augmented Lagrangian associated with constrained optimization problem (6.13) is given by, L (x;z;y) := f(x) + g(z) +hy;Txzi + 1 2 kTxzk 2 2 and the completion of squares yields, L = f(x) + g(z) + 1 2 kz (Tx + y)k 2 2 2 kyk 2 2 where y is the Lagrange multiplier and is a positive parameter. The minimizer ofL with respect to z is z ? (x;y) = prox g (Tx + y) and the evaluation ofL (x;z;y) along the manifold resulting from the explicit mini- mization over z yields the proximal augmented Lagrangian [81], L (x;y) := L (x;z ? (x;y);y) = f(x) + M g (Tx + y) 2 kyk 2 2 : (6.14) This function is continuously dierentiable with respect to both x and y and it can be used as a foundation for the development of dierent rst- and second-order primal- dual methods for nonsmooth composite optimization [81, 108]. It is noteworthy that, for T = I, forward-backward envelope F (x) is obtained by restricting the proximal augmented LagrangianL (x;y) along the manifold y ? (x) = rf(x) resulting from KKT optimality conditions, F (x) := L (x;y ? (x)) = L (x;y =rf(x)) = f(x) + M g (x rf(x)) 2 krf(x)k 2 2 : 95 Moreover, relation (6.7) between the two envelopes allows us to interpret the DR en- velope as the proximal augmented Lagrangian evaluated at prox f (x) and restricted along the manifold y ? (prox f (x)) =rf(prox f (x)) =rM f (x), F DR (x) := L (prox f (x);y ? (prox f (x))) = M f (x) + M g (x 2rM f (x)) krM f (x)k 2 2 : 6.2.3 Strong convexity and Lipschitz continuity The functionf is strongly convex with parameter m f if for any x and ^ x, f(^ x) f(x) +hrf(x); ^ x xi + m f 2 k^ x xk 2 2 and equivalently, krf(x)rf(^ x)k 2 m f kx ^ xk 2 : (6.15) The gradient of a continuously-dierentiable function f is Lipschitz continuous with parameter L f if for any x and ^ x, f(^ x) f(x) +hrf(x); ^ x xi + L f 2 k^ x xk 2 2 and equivalently, krf(x)rf(^ x)k 2 L f kx ^ xk 2 : (6.16) Moreover, if anm f -strongly convex functionf has anL f -Lipschitz continuous gradient, the following inequality holds for any x and ^ x, hrf(x)rf(^ x);x ^ xi m f L f m f +L f kx ^ xk 2 2 + 1 m f +L f krf(x)rf(^ x)k 2 2 : (6.17) Furthermore, the subgradient @g of a nondierentiable function g is dened as the set of points z2@g(x) that for any x and ^ x satisfy, g(^ x) g(x) + z T (^ x x): (6.18) 6.2.4 Proximal Polyak-Lojasiewicz inequality The Polyak-Lojasiewicz (PL) condition can be used to establish linear convergence of standard gradient descent 96 method in the absence of strong convexity (or even convexity) [164]. For an uncon- strained optimization problem with a non-empty solution set, minimize x f(x) where f is a twice dierentiable function with a Lipschitz continuous gradient, the PL condition is given by krf(x)k 2 2 (f(x) f ? ) where > 0 and f ? is the optimal value of the function f. We next provide the generalization of the PL condition for nonsmooth composite optimization problems (6.13) with T =I. For this class of problems, the proximal PL inequality holds for 2 (0; 1=L f ) if there exist > 0 such that kG (x)k 2 2 (F (x) F ? ): (6.19) Here,L f is the Lipschitz constant ofrf,F is the FB envelope, andG is the generalized gradient map. In Appendix A, we show that (6.19) is equivalent to the condition provided in [164]. 6.3 Exponential stability of proximal algorithms In this sec- tion, we brie y discuss the Arrow-Hurwicz-Uzawa gradient ow dynamics that can be used to solve (6.13) by computing the saddle points of the proximal augmented La- grangian [81]. We then show that the proximal gradient method in continuous time can be obtained from the proximal augmented Lagrangian method by restricting the dual variable along the manifold y =rf(x). Finally, we discuss global stability properties of proximal algorithms both in the presence and in the absence of strong convexity. Continuous dierentiability of the proximal augmented Lagrangian (6.14) can be utilized to compute its saddle points via the Arrow-Hurwicz-Uzawa gradient ow dy- namic, " _ x _ y # = " rf(x) + T T rM g (Tx + y) (rM g (Tx + y) y) # : (6.20) As shown in [81], these primal-descent dual-ascent gradient ow dynamics are globally exponentially stable for convex problems in which the matrix TT T is invertible and the 97 smooth part of the objective function f is strongly convex. For convex problems with T =I in (6.1), minimize x f(x) + g(x) (6.21) the proximal gradient method, x k+1 = prox k g (x k k rf(x k )) (6.22) with the stepsize k 2 (0; 1=L f ] can be used to solve (6.21), where L f is the Lipschitz constant ofrf. In [157], it was demonstrated that (6.22) can be interpreted as a variable-metric gradient method on FB envelope, x k+1 = x k k I k r 2 f(x) 1 rF k (x k ) = x k k G k (x k ): This interpretation can be utilized to compute the optimal solution to (6.21) using the continuous-time proximal gradient ow dynamics _ x = G (x) = (rf(x) +rM g (x rf(x))) = 1 x prox g (x rf(x)) : (6.23) Remark 8. Proximal gradient ow dynamics (6.23) are dierent from the subgradient ow dynamics associated with nonsmooth composite optimization problem (6.21). We note that proximal gradient algorithm (6.22) can be obtained via explicit forward Euler discretization of (6.23) with the stepsize = k . This should be compared and contrasted to a standard interpretation [76] in which (6.22) results from implicit backward Euler discretization of the subgradient ow dynamics associated with (6.21). We also note that (6.23) can be obtained by substitutingrf(x) for the dual variable y in the x- update step of primal-descent dual-ascent gradient ow dynamics (6.20) with T =I. We next examine global stability properties of proximal gradient ow dynamics (6.23), rst for strongly convex problems and then for the problems in which only the PL con- dition holds. 98 6.3.1 Strongly convex problems Herein, we utilize the theory of integral quadratic constraints to establish global asymptotic stability of proximal gradient ow dynam- ics (6.23) under the following assumption. Assumption 1. Let the dierentiable part f of the objective function in (6.21) be strongly convex with parameter m f , letrf be Lipschitz continuous with parameter L f , and let the regularization function g be proper, lower semicontinuous, and convex. Proximal gradient ow dynamics (6.23) can be expressed as a feedback interconnec- tion of an LTI system _ x = Ax + Bu = Cx (6.24a) with a nonlinear term, u(x) := prox g (x rf(x)): (6.24b) Here, A = 1 I; B = 1 I; C = I (6.24c) and the corresponding transfer function H(s), (s) =H(s)u(s), is H(s) = C (sI A) 1 B = 1 s + 1 I: (6.24d) The following lemma combines rm nonexpansiveness of prox g , strong convexity of f, and Lipschitz continuity ofrf to characterize nonlinear map (6.24b) by establishing a quadratic inequality that u(x) = prox g (xrf(x)) satises. Lemma 7. Let Assumption 1 hold. Then, for any x2R n , ^ x2R n , u := prox g (x rf(x)); and ^ u := prox g (^ xrf(^ x)), the pointwise quadratic inequality " x ^ x u ^ u # T " 2 I 0 0 I # | {z } " x ^ x u ^ u # 0 (6.25) holds, where = maxfj1 m f j;j1 L f jg: (6.26) 99 Proof. See Appendix A.1. Lemma 8. The nonlinear function u(x) := prox g (xrf(x)) is a contraction for 2 (0; 2=L f ). Proof. From (A.8), it follows that the nonlinear functionu(x) is a contraction for< 1. From (6.26), we see that < 1 if and only if, 1 < 1 L f < 1 and 1 < 1 m f < 1: Since m f L f , these conditions hold for 2 (0; 2=L f ) . We next employ the KYP lemma in the frequency domain [165] " H (j!) I # " H (j!) I # 0; 8!2 R (6.27) where is given by (6.25), ! is the temporal frequency, and H (j!) := C (j!I (A + I)) 1 B = 1 j! + 1 I to establish global exponential stability of (6.24). Note that stability of H requires < 1=. Theorem 9. Let Assumption 1 hold and let 2 (0; 2=L f ). Then, proximal gradient ow dynamics (6.24) are globally exponentially stable with the rate < (1)=, i.e., kx(t) x ? k 2 c e t kx(0) x ? k 2 ; 8t 0 where c is a positive constant and is given by (6.26). Proof. Substituting given by (6.25) into (6.27) implies that the condition (6.27) holds for all !2R if I 2 H (j!)H (j!) 0: This inequality can be equivalently written as 2 ! 2 + (1 ) 2 2 I 0 (6.28) 100 which holds for every !2R if (1 + )(1 ) > 0: Since stability ofH requires 1> 0, the rst term is positive and (6.28) is satised for all ! if 1 > 0. Finally, since < 1 for 2 (0; 2=L f ), dynamics (6.24) are globally exponentially stable with the rate < (1)=. Remark 9. For = 2=(L f +m f ), the second term on the right-hand-side in (A.7) disappears and the parameter is given by = L f m f L f + m f = 1 + 1 where :=L f =m f is the condition number associated with the function f. In this case, the convergence rate is upper bounded by the strong convexity module m f , i.e., <m f . Remark 10. In Appendix B, by restricting our attention to in-network optimization, we provide a distributed implementation based on the proximal augmented Lagrangian. Furthermore, by introducing an appropriate change of coordinates, we utilize the theory of IQCs to prove global exponential stability under the strong convexity assumption. 6.3.2 Proximal Polyak-Lojasiewicz condition Next, we consider the prob- lems in which the functionf is not strongly convex but the functionF :=f +g satises the proximal PL condition (6.19). Assumption 2. Let the regularization function g in (6.13) be proper, lower semicon- tinuous, and convex, let f be twice continuously dierentiable withr 2 f(x)L f I, and let the generalized gradient map satisfy the proximal PL condition, kG (x)k 2 2 (F (x) F ? ) where 2 (0; 1=L f ), > 0, and F ? is the optimal value of the FB envelope F . Remark 11. We recall that the proximal gradient algorithm can be interpreted as a variable-metric gradient method on FB envelope and that (6.23) can be equivalently written as _ x = (I r 2 f(x)) 1 rF (x): 101 Under Assumption 2, the matrix Ir 2 f(x) is invertible and the functions F and F have the same minimizers and the same optimal values [157], argmin x F (x) = argmin x F (x); F ? = F ? : This motivates study of the convergence properties of (6.23) in terms of the FB envelope. Theorem 10. Let Assumption 2 hold. Then the forward-backward envelope associated with proximal gradient ow dynamics (6.23) converge exponentially to F ? = F ? with the rate = (1L f ), i.e., F (x(t)) F ? e t (F (x(0)) F ? ); 8t 0: Proof. We introduce a Lyapunov function candidate, V (x) = F (x) F ? whereF is the FB envelope. The derivative of V along the solutions of (6.23) is given by _ V (x) = hrF (x); _ xi = rF (x); (I r 2 f(x)) 1 rF (x) = G (x); (I r 2 f(x))G (x) where the last expression follows from (6.6). Since the gradient of f is L f -Lipschitz continuous, i.e.,r 2 f(x)L f I for all x2R n , Assumption 2 implies (I r 2 f(x)) (1 L f )I and, thus, _ V (x) (1 L f )kG (x)k 2 2 (1 L f ) (F (x) F ? ) (6.29) is non-positive for 2 (0; 1=L f ). Moreover, combining the last inequality with the denition of V yields, _ V (1 L f )V 102 which implies F (x(t)) F ? e (1L f )t (F (x(0)) F ? ): Remark 12. When the proximal PL condition is satised, F (x(t))F ? converges exponentially but, in the absence of strong convexity, the exponential convergence rate cannot be established forkx(t)x ? k 2 . Thus, in the absence of strong convexity, although the objective function converges exponentially fast, the solution to (6.23) does not enjoy this convergence rate. To the best of our knowledge, the convergence rate of x(t) to the set of optimal values x ? is not known in this case. 6.4 Global exponential stability of Douglas-Rachford split- ting dynamics We next introduce a continuous-time gradient ow dynamics based on the well-known Douglas-Rachford splitting algorithm [162] and establish global exponential stability for strongly convex f. 6.4.1 Non-smooth composite optimization problem The optimality con- dition for non-smooth composite optimization problem (6.21) is given by 0 2 rf(x) + @g(x): Multiplication by and addition/subtraction of x yields, 0 2 [I + rf] (x) + @g(x) x: Since the proximal operator associated with the function f is determined by the re- solvent operator of rf, we have x = (I + rf) 1 (x @g(x)) = prox f (x @g(x)): Introducing a new variable z :=x@g(x) allows us to bring the optimality condition into the following form x = prox f (z) or, equivalently, @g(x) = prox f (z) z: 103 Now, adding x to both sides of this equation yields [I + @g] (x) = x + prox f (z) z = 2prox f (z) z which leads to, x ? = prox g (2 prox f (z ? ) z ? ) = prox f (z ? ): (6.30a) Furthermore, the re ected proximal operators [166] of the functions f and g, R f (z) := [2 prox f I](z); R g := [2 prox g I](z) can be used to write optimality condition (6.30a) as [R g R f ](z ? ) = z ? : (6.30b) This follows from (6.30a) and [R g R f ](z) =z + 2(prox g (2 prox f (z) z) prox f (z)): Building on the optimality conditions, the DR splitting algorithm consists of the following iterative steps, x k = prox f (z k ) y k = prox g (2x k z k ) z k+1 = z k + 2(y k x k ): (6.31) Under standard convexity assumptions [167], the DR splitting algorithm converges for 2 (0; 1). Combining all the steps in (6.31) yields the rst-order recurrence, z k+1 = z k + 2 prox g (2x k z k ) x k = [(1 )I + R g R f ] (z k ) (6.32) where z k converges to the xed point of the operator R f R g and x k converges to the 104 optimal solution of (6.21). Optimality conditions (6.30) can be used to obtain the continuous-time gradient ow dynamics to compute z ? , _ z = z + [R g R f ](z) = prox g (2 prox f (z) z) prox f (z) = G DR (z) (6.33) where G DR (z) is given by (6.12). We note that the discrete-time system (6.32) results from the explicit forward Euler discretization of (6.33) with the stepsize . Remark 13. Using the denition of rF DR (x) in (6.11), the continuous-time sys- tem (6.33) can be written as _ z = 2rprox f (z) I 1 rF DR (z) where the inverse is well-dened for 2 (0; 1=L f ). Thus, the DR splitting algorithm can be interpreted as a variable-metric gradient method on the DR envelope F DR [157]. The continuous-time dynamics (6.33) can be also seen as a feedback interconnection of an LTI system _ z = Az + Bu = Cz (6.34a) with a nonlinear term, u(z) := [R g R f ](z): (6.34b) Herein, the matrices in LTI representation (6.34a) are given by A =I; B = I; C = I (6.34c) and the corresponding transfer function is H(s) = C(sI A) 1 B = 1 s + 1 I: (6.34d) 105 We rst characterize properties of nonlinearity u in (6.34b) and then, similar to the previous section, employ the KYP lemma to establish the conditions for global exponential stability of (6.34). Lemma 11. Let Assumption 1 hold and let 2 (0; 2=L f ). Then, the operator R f is -contractive, kR f (x) R f (y)k 2 kx yk 2 where is given by = maxfj1 m f j;j1 L f jg < 1: (6.35) Proof. Givenz x := prox f (x) andz y := prox f (y),x andy can be computed as follows x = z x + rf(z x ); y = z y + rf(z y ): Thus, kR f (x) R f (y)k 2 =k2(z x z y ) (xy)k 2 =k(z x z y ) (rf(z x )rf(z y ))k 2 =kz x z y k 2 +j(rf(z x )rf(z y ))k 2 2hrf(z x )rf(z y );z x z y i max (1 L f ) 2 ; (1 m f ) 2 kz x z y k 2 kx yk 2 : where the rm non-expansiveness of the proximal operator is used in the last step. Moreover, according to Lemma 8, for2 (0; 2=L f ) we have< 1, which completes the proof. Lemma 12. Let Assumption 1 hold and let 2 (0; 2=L f ). Then, the operator R g is rmly non-expansive. 106 Proof. kR g (x)R g (y)k 2 2 = 4kprox f (x) prox f (y)k 2 2 +kx yk 2 2 4 x y; prox f (x) prox f (y) kx yk 2 2 : Remark 14. SinceR g is rmly non-expansive andR f is-contractive, the composite operator R g R f is also -contractive. Moreover, since the operator R f and nonlin- earity u in (6.24b) have the same contraction parameters, the quadratic inequality that describes nonlinearity (6.24b) can be also utilized to characterize the composite operator R g R f . Theorem 13. Let Assumption 1 hold. Then, the DR splitting dynamics (6.34) are globally exponentially stable, i.e., there is c> 0 and 2 (0; 1) such that, kz(t) z ? k c e t kz(0) z ? k; 8t 0: Proof. Note that although the nonlinearities in systems (6.20) and (6.34) are dierent, they have the same quadratic characterizations. Moreover, the dynamics are the same except for a multiplicative coecient 1=. Thus, the KYP lemma implies the global exponential stability of (6.34) if there exists 2 (0; 1) such that, 2 H (j!) H (j!) I 0; 8!2 R; (6.36) where H (j!) = C (j!I (A + I)) 1 B = 1 j! + 1 I: The inequality (6.36) is satised if, 2 (1 ) 2 ! 2 < 0; 8!2 R which proves < 1. 107 6.4.2 Douglas-Rachford splitting on the dual problem The DR splitting algorithm cannot be used to directly solve a problem with a more general linear equality constraint, minimize x;z f(x) + g(z) subject to Tx + Sz = r (6.37) whereT2R mn ,S2R mn , andr2R m are problem data. However, it can be utilized to solve the dual problem, minimize f 1 () + g 1 () (6.38) where f 1 () = f ? (T T ) + r T g 1 () = g ? (S T ) and h ? () := sup x ( T xh(x)) is the conjugate of the function h. It is a standard fact [167, 168] that solving the dual problem (6.38) via the DR splitting algorithm is equivalent to using ADMM for the original problem (6.37). Next, we introduce a gradient ow dynamics based on the DR splitting algorithm for solving (6.38) and demonstrate global exponentially stability under the following assumption. Assumption 3. Let the dierentiable part f of the objective function in (6.37) be strongly convex with parameter m f , letrf be Lipschitz continuous with parameter L f , let the function g be proper, lower semicontinuous, and convex, and let the matrix T be full row rank. The continuous-time DR splitting algorithm _ = prox g 1 (2 prox f 1 () ) prox f 1 () (6.39) can be used to compute the optimal solution ? to (6.38). It is well-known that the conjugate functions are convex [126]. Since (6.39) is identical to (6.33), if f 1 satises the conditions in Assumption 1 global exponential stability of (6.39) follows from The- orem 13. For a full row rank T , f 1 is strongly convex andrf 1 is Lipschitz continuous 108 with parameters [166, Proposition 4], L f 1 = kT T k 2 =m f ; m f 1 = 2 =L f where is a positive parameter that always exists and satiseskT T k kk for all 2 R m . Thus, both f 1 and g 1 satisfy Assumption 3 and global exponential stability of (6.39) follows from Theorem 13. 6.5 Concluding remarks We study a class of nonsmooth optimization problems in which it is desired to minimize the sum of a continuously dierentiable func- tion with a Lipschitz continuous gradient and a nondierentiable function. For strongly convex problems, we employ the theory of integral quadratic constraints to prove global exponential stability of proximal gradient ow and Douglas-Rachford splitting dynam- ics. We also propose a generalization of the Polyak-Lojasiewicz condition to nonsmooth problems and demonstrate the global exponential convergence of the forward-backward envelope for the proximal gradient ow algorithm even in the absence of strong convex- ity. Chapter 7 Distributed proximal augmented Lagrangian method for nonsmooth composite optimization We study a class of nonsmooth composite optimization problems in which the con- vex objective function is given by a sum of dierentiable and nondierentiable terms. By introducing auxiliary variables in nondierentiable terms, we provide an equivalent consensus-based characterization that is convenient for distributed implementation. The Moreau envelope associated with the nonsmooth part of the objective function is used to bring the optimization problem into a continuously dierentiable form that serves as a basis for the development of a primal-descent dual-ascent gradient ow method. This algorithm exploits separability of the objective function and is well-suited for in- network optimization. We prove global asymptotic stability of the proposed algorithm and solve the problem of growing undirected consensus networks in a distributed manner to demonstrate its eectiveness. 7.1 Introduction We study a class of nonsmooth composite optimization problems in which the convex objective function is a sum of dierentiable and non- dierentiable functions. Among other applications, these problems emerge in machine 109 110 learning, compressive sensing, and control. Recently, regularization has been used as a promising tool for enhancing utility of standard optimal control techniques. In this approach, commonly used performance measures are augmented with regularization functions that are supposed to promote some desired structural features in the dis- tributed controller, e.g., sparsity. Such an approach has received signicant attention in recent years [41,42,71,106,115,118,119], but computing optimal solutions in large-scale problems still remains a challenge. Generic descent methods cannot be used in the nonsmooth composite optimization problems due to the presence of a nondierentiable component in the objective function. Moreover, these standard methods are not well-suited for distributed implementation. An alternative approach is to separate the smooth and nonsmooth parts of the objective function and use the alternating direction method of multipliers (ADMM). In [111], we exploit separability of the objective function and utilize an ADMM-based consensus algorithm to solve the regularized optimal control problem in a distributed manner over multiple processors. Even though the optimal control problem is in general non-convex, recent results can be utilized to show convergence to a local minimum [112]. However, in an update step of the ADMM algorithm, all the processors halt to compute the weighted average (the gathering step) [113]. Herein, we build on recent work [108] in which the structure of proximal oper- ators associated with nonsmooth regularizers was exploited to bring the augmented Lagrangian into a continuously dierentiable form. Such an approach is suitable for de- veloping an algorithm based on primal-descent dual-ascent gradient method. We use the Arrow-Hurwicz-Uzawa gradient ow dynamics [77] and propose an algorithm that can be implemented in a fully distributed manner over multiple processors. This increases the computational eciency and reduces the overall computation time. By exploiting convexity of the smooth part of the objective function, we show asymptotic convergence of our algorithm. The point of departure of our work from [108] is that we study a more general form of consensus optimization problems in which the optimization variable is a matrix 111 and develop a fully distributed algorithm. Furthermore, while most existing primal- dual techniques for nonsmooth distributed optimization employ subgradient ow meth- ods [73,79,169], our approach yields a gradient ow dynamics with a continuous right- hand side even for nonsmooth problems. The rest of the chapter is structured as follows. In Section 7.2, we formulate the nonsmooth composite optimization problem, discuss a motivating example, and provide background on proximal operators and the consensus-based ADMM algorithm. In Sec- tion 7.3, by exploiting the structure of proximal operators, we introduce the proximal augmented Lagrangian. In Section 7.4, we use the Arrow-Hurwicz-Uzawa method to develop the gradient ow dynamics that are well-suited for distributed computations and prove global asymptotic stability. In Section 7.5, we discuss distributed implemen- tation, in Section 7.6, we provide examples, and, in Section 7.7, we oer concluding remarks. 7.2 Problem formulation We consider a composite convex optimization problem, minimize x n X i = 1 f i (x) + g(x) (7.1) where x2R m is the optimization variable, the functions f i are continuously dieren- tiable, and the function g is possibly nondierentiable. This problem can be brought into a standard consensus form by introducingn local variablesx i and a global variable z, minimize x i ;z n X i = 1 f i (x i ) + g(z) subject to x i z = 0; i = 1;:::;n: (7.2) Even though this reformulation increases the number of optimization variables, it facilitates distributed computations by bringing the objective function into a separa- ble form. Clearly, the solutions to (7.1) and (7.2) coincide but, in contrast to (7.1), optimization problem (7.2) is convenient for distributed implementation. Solving (7.2) on a single processor is not necessarily more computationally ecient than solving the original problem via a centralized algorithm. However, optimization problem (7.2) can be split into n separate subproblems over n dierent processors. In such a setup, each 112 processor solves an optimization problem that involves a local objective function f i of a single local variable x i . This is advantageous for large-scale systems where centralized implementation is prohibitively complex and cannot be aorded. 7.2.1 Motivating application Problem (7.1) arises in feedback design when a performance metric, e.g., theH 2 norm, is augmented with a regularization function to promote structural features in the optimal controller. Herein, we discuss the prob- lem of growing undirected consensus networks and show that the objective function is separable; thereby, it completely ts into the framework (7.2). We consider the controlled undirected network, _ = L p + d + u = " Q 1=2 0 # + " 0 R 1=2 # u where d and u are the disturbance and control inputs, is the state, and is the performance output. The dynamic matrix L p is the Laplacian matrix of the plant network and symmetric matricesQ 0 andR 0 specify the state and control weights in the performance output. For memoryless control laws, u = L x whereL x is the Laplacian matrix of the controller graph, the closed-loop system is given by _ = (L p + L x ) + d = " Q 1=2 R 1=2 L x # : (7.3) In the absence of exogenous disturbances, the network converges to the average of the initial node values = (1=n) P i i (0) if and only if it is connected [2]. Let Q := I (1=n)11 T penalizes the deviation of individual node values from average. The objective is to minimize the mean square deviation from the network average by adding a few additional edges, specied by the graph Laplacian L x of a controller network. If 113 E is the incidence matrix of the controller graph, L x can be written as L x = E diag (x)E T where diag (x) is a diagonal matrix containing the optimization variablex2R m (i.e., the vector of the edge weights in the controller graph). Regularization terms may be used to promote sparsity of the controller network or to impose some additional constraints on the edge weights. The matrixL x that optimizes the closed-loop performance and has certain structural properties can be obtained by solving the regularized optimal control problem minimize x f(x) + g(x): (7.4) Here, f is the function that quanties the closed-loop performance, i.e. theH 2 norm, and g is the regularization function that is introduced to promote certain structural properties of L x . For example, when it is desired to design L x with a specied pattern of zero elements,g is an indicator function of the set that characterizes this pattern [114]. When it is desired to promote sparsity of L x , the ` 1 norm g(x) = P i jx i j can be used as a sparsity-enhancing regularizer, where is the positive parameter that characterizes emphasis on sparsity [42]. Next, we exploit the square-additive property of theH 2 norm to provide an equiv- alent representation that is convenient for large-scale and distributed optimization. As shown in [106], up to an additive constant, the square of theH 2 norm (from d to ) is determined by f(x) = trace (E diag (x)E T + L p ) (I + L p RL p ) + diag E T RE T x where the pseudo-inverse of the closed-loop graph Laplacian is given by (Ediag(x)E T +L p ) = (Ediag(x)E T + (1=n)11 T +L p ) 1 : It is easy to show that f(x) can be written as f(x) = n X i = 1 f i (x) 114 where f i (x) = T i E diag (x)E T + (1=n)11 T +L p 1 i + (1=n) diag E T RE T x: (7.5) Here, i = (I + L p RL p ) 1=2 e i is the ith column of the square root of the matrix (I + L p RL p ). Moreover, it can be shown that the gradient of f i (x) is given by rf i (x) = (1=n) diag E T RE i (x) i (x) (7.6) where is the elementwise multiplication and i (x) = E T E diag (x)E T + (1=n)11 T +L p 1 i : (7.7) In what follows, we provide essential background on the proximal operators that we utilize for the latter developments. 7.2.2 Background Proximal operators The proximal operator of the function g is given by prox g (v) := argmin z g(z) + 1 2 kz vk 2 ; and the Moreau envelope determines the corresponding value function, M g (v) := g(prox g (v)) + 1 2 kprox g (v) vk 2 2 : (7.8) Irrespective of dierentiability of g, Moreau envelope is a continuously dierentiable function and its gradient is given by [76], rM g (v) = 1 (v prox g (v)): (7.9) The above dened functions play essential role in our subsequent developments. 115 Alternating Direction Method of Multipliers (ADMM) We next demonstrate that a standard consensus algorithm based on the ADMM [113] can be used to solve the problem (7.2). This algorithm is well-suited for distributed implementation in which each processor solves an optimization problem. More details can be found in [111]. The augmented Lagrangian associated with (7.2) is given by L i (x i ;z;) := g(z) + n X i = 1 (f i (x i ) +h i ;x i zi + 1 2 i kx i zk 2 2 ) (7.10) where i 's are the Lagrange multipliers and i 's are positive parameters. The distributed ADMM algorithm consists of the following iterative steps, x k+1 i = argmin x i f i (x i ) + 1 2 i kx i u k i k 2 2 z k+1 = argmin z g(z) + n X i=1 1 2 i kz v k i k 2 2 k+1 i = k i + 1 i x k+1 i z k+1 where u k i := z k i k i v k i := x k+1 i + i k i : The x i -minimization step can be done via distributed computation by spreading subproblems to n dierent processors. On the other hand, the update of z amounts to the evaluation of the proximal operator of the function g, z k+1 = prox ^ g ^ n X i = 1 1 i v k i ! : where ^ := ( P i 1= i ) 1 . Thus, the update of z requires gathering each x k+1 i and the associated Lagrange multipliers k i in order to form v k i . The above presented consensus algorithm is standard (e.g., see [113]). We have pre- viously used this algorithm for distributed design of structured feedback gains in [111]. 116 Recently, convergence of this algorithm was established even for problems with non- convex objective functions f i [112]. 7.3 Proximal augmented Lagrangian The interconnection graph between the nodes in the consensus-based formulation in (7.2) is given by a star graph. Each node has access to the internal node of the star graph with the statez. This topol- ogy yields the z-update in ADMM that requires gathering the states of all subsystems. Recently, an algorithm based on the proximal augmented Lagrangian for solving (7.2) was developed in [108]. To avoid the above described computational requirement in the z-update of the ADMM algorithm, we propose a primal-dual algorithm based on the proximal augmented Lagrangian that can be implemented in a fully distributed manner. Problem (7.2) can be equivalently written as, minimize x i ;z n X i = 1 f i (x i ) + g(z) subject to x i x j = 0; (i;j)2I x k z = 0; (7.11) whereI is the set of indices between 1 and n such that any index appears in one pair of the set at least once. This set characterizes structure of the information exchange network between the agents. The interaction topology is given by a connected graph. Moreover, the indexk2f1;:::;ng can be chosen arbitrarily. In what follows, we study one particular instance of problem (7.11). Without loss of generality, we assume that k = n and that the underlying com- munication network between dierent nodes in (7.11) is given by a path graph. By introducing the optimization variable X, X := h x 1 x n i 2 R mn the column vector e n , e n = h 0 0 1 i T 2 R n 117 and the matrix T , T = 2 6 6 6 6 6 4 1 0 0 1 1 0 : : : : : : : : : : : : 0 0 1 3 7 7 7 7 7 5 2 R n(n1) we can rewrite (7.11) as, minimize X;z f(X) + g(z) subject to XT = 0; X e n z = 0: (7.12) In (7.12), the matrix T is the incidence matrix of an undirected path network that connects n nodes. We note that any connected network can be used to build the information exchange structure between the nodes. For an arbitrary connected graph with m edges, the matrix T2R nm has to satisfy the following properties, T T 1 = 0; TT T = L; where1 is the vector of all ones andL2R nn is the Laplacian matrix of the underlying graph. This is exactly the problem (6.37) that we discussed in the previous Chapter 6. In Sec. 6.4.2, we have introduced a gradient ow dynamics based on the DR splitting algorithm that can be used to solve the dual form of the problem (7.12) with a guaran- teed global exponential convergence under certain assumptions 3. Herein, we provide an algorithm based on proximal augmented Lagrangian to solve the primal problem that is well-suited for distributed implementation and we show global asymptotic convergence. The augmented Lagrangian associated with (7.12) is given by L i (X;z;;Y ) = f(X) + g(z) +h;X e n zi +hY;XTi + 1 2 1 kX e n zk 2 2 + 1 2 2 kXTk 2 F (7.13) where 2 R m and Y 2 R m(n1) are the Lagrange multipliers and 1 and 2 are positive parameters. The proximal augmented Lagrangian is obtained by evaluating the 118 augmented Lagrangian on the manifold that results from the explicit minimization of L i with respect toz [108]. This yields a function that is once but not twice continuously dierentiable with respect to both the primal variable X and the dual variables and Y . The proximal augmented Lagrangian associated with (7.12) is given by L i (X;;Y ) = L i (X;z ? 1 (X;);;Y ) = f(X) + M 1 g (X e n + 1 ) 1 2 kk 2 2 +hY;XTi + 1 2 2 kXTk 2 F ; (7.14) where M 1 g is the Moreau envelope of the function g(z) and z ? 1 (X;) is given by z ? 1 (X;) = prox 1 g (X e n + 1 ) and by substituting z ? in the augmented Lagrangian (7.13), the proximal augmented Lagrangian can be written as (7.14). 7.4 Arrow-Hurwicz-Uzawa gradient ow The proximal augmented Lagrangian is a continuously dierentiable function because the Moreau envelope is con- tinuously dierentiable. This facilitates the use of the Arrow-Hurwicz-Uzawa algorithm which is a primal-descent dual-ascent gradient ow method. In this algorithm, the primal variable X and the dual variables and Y are updated simultaneously. The gradient ow dynamics are given by _ X = r X L (X;;Y ) _ = +r L (X;;Y ) _ Y = +r Y L (X;;Y ): By taking the derivatives, the updates can be written as _ X = (rf(X) +rM 1 g (X e n + 1 ) e T n + 1 2 XTT T + Y T T ) _ = 1 rM 1 g (X e n + 1 ) 1 _ Y = XT: (7.15) 119 It is worth to note that if f(x) is separable, i.e. f(x) = P i f i (x i ), the gradientrf(X) is an mn matrix and can be written as rf(X) = h rf 1 (x 1 ) ::: rf n (x n ) i : Asymptotic convergence Our subsequent developments are based on the following assumption Assumption 4. The functionf is continuously dierentiable and convex, and the func- tion g is proper, lower semicontinuous, and convex. We show that under Assumption 4, dynamics (7.15) are globally asymptotically stable and converge to (X ? ; ? ;Y ? ) where each of the columns of X ? is the optimal solution to (7.1). The optimal primal and dual points (X ? ; ? ;Y ? ) satisfy the following rst order optimality conditions rf(X ? ) + ? e T n + Y ? T T = 0 (7.16a) X ? e n z ? = 0 (7.16b) @g(z ? ) ? 3 0 (7.16c) X ? T = 0 (7.16d) where @g is the subgradient of g. Proposition 14. Let Assumption 4 hold. Then, gradient ow dynamics (7.15) are globally asymptotically stable, i.e., they converge globally to the optimal primal and dual points (X ? ; ? ;Y ? ) of (7.12). Proof. We introduce a change of variables ~ X := X X ? ; ~ := ? ; ~ Y := Y Y ? and a Lyapunov function V ( ~ X; ~ ; ~ Y ) = 1 2 D ~ X; ~ X E + 1 2 D ~ ; ~ E + 1 2 D ~ Y; ~ Y E 120 where ( ~ X; ~ ; ~ Y ) satisfy _ ~ X = rf(X ? )rf(X) 1 1 ~ m e T n 1 2 ~ XTT T ~ Y T T ; _ ~ = 1 ~ + ~ m; _ ~ Y = ~ XT; (7.17) with ~ m := 1 (rM 1 g (X e n + 1 )rM 1 g (X ? e n + 1 ? )): Based on [108, Lemma 2], we can write P ( ~ Xe n + 1 ~ ) = prox 1 g (Xe n + 1 ) prox 1 g (X ? e n + 1 ? ); (7.18) whereI is the identity matrix and P is a positive semidenite matrix such that PI. Thus, from (7.9) we have, ~ m = (I P ) ( ~ Xe n + 1 ~ ): (7.19) The derivative of the Lyapunov function candidate along the solutions of (7.17) is determined by _ V = D ~ X; _ ~ X E + D ~ ; _ ~ E + D ~ Y; _ ~ Y E = D ~ X;rf(X)rf(X ? ) E 1 D ~ ;P ~ E 1 1 D (I P ) ~ Xe n ; ~ X e n E 1 2 D ~ XTT T ; ~ X E : Since f is convex, the rst term in nonpositive. Moreover, by utilizing 0 P I, it follows that _ V 0. We next invoke LaSalle's invariance principle [170] to establish global asymptotic stability. The points ( ~ X; ~ ; ~ Y ) in the set S =f( ~ X; ~ ; ~ Y )j _ V ( ~ X; ~ ; ~ Y ) = 0g; 121 satisfy rf(X ? + ~ X) = rf(X ? ); (7.20a) ~ 2 ker(P ); (7.20b) ~ X e n 2 ker(I P ); (7.20c) ~ X T 2 spanf[11 ]g; (7.20d) where ker(A) denotes the null space of the matrix A and A2 spanf[11 ]g signies that each column of the matrix A is given by a scalar multiple of the vector of all ones, 1. From (7.19), for the points in this set, we have ~ m = 1 ~ . In order to identify the largest invariant set inS, we evaluate dynamics (7.17) under constraints (7.20) to obtain _ ~ X = ~ e T n ~ Y T T ; _ ~ = 0; _ ~ Y = 0: (7.21) Thus, the invariant set is characterized by ~ e T n + ~ YT T = 0 for constant ~ and ~ Y . To complete the proof, we need to show that the largest invariant set in S yields (X;;Y ) = (X ? ; ? ;Y ? ) + ( ~ X; ~ ; ~ Y ); that satisfy optimality conditions (7.16). Points (X;;Y ) satisfy optimality condition (7.16a) if rf(X) + ( ~ + ? ) e T n + ( ~ Y + Y ? )T T = 0: For any ( ~ X; ~ ; ~ Y ) in the invariant set, we can use (7.20a) to replacerf(X) withrf(X ? ). Furthermore, since ~ e T n + ~ YT T = 0, the resulting (X;;Y ) satisfy (7.16a). Moreover, the substitution of P ~ = 0 and (IP ) ~ Xe n = 0 to (7.18) yields Xe n X ? e n = prox 1 g (Xe n + 1 ) prox 1 g (X ? e n + 1 ? ): The optimality condition (7.16b) leads to Xe n = prox 1 g (Xe n + 1 ) = z 122 which implies that the pair (X;z) satises (7.16b). We next show that the optimality condition (7.16c) holds for any (z;) in this set. Taking sub-dierential of the proximal operator of the function g in (7.8) yields @g(z) + 1 1 (z v)3 0 where v is an arbitrary vector. Choosing v = Xe n + 1 and utilizing the fact that Xe n = z yields the third optimality condition (7.16c). Furthermore, ~ XT = 0 yields XT = 0. Thus, X satises (7.16d) and the dynamics (7.15) converges asymptotically to the optimal points (X ? ; ? ;Y ? ). 7.5 Distributed implementation In this section, we exploit the struc- ture of the problem and show that the gradient ow dynamics (7.15) is well-suited for distributed implementation. In this case, the underlying interconnection network is given by a path graph. We rst discuss how the gradient ow of the primal variable X can be implemented in a distributed manner and then show that the dual variables can be also updated in a distributed fashion. 7.5.1 Primal update The vector x k denotes the kth column of the matrix X. Each of the columns fromk = 2;:::; (n 1) can be updated in a distributed manner as follows _ x k = rf k (x k ) 1 2 (2x k x k1 x k+1 ) y k + y k1 ; (7.22a) where the vector y k is the kth columns of the matrix Y . Thus, each agent only uses its neighbors' states and the corresponding dual variables to update its own state. The updates for the rst and last column of X which are x 1 and x n are dierent than the other updates and can be written as follows _ x 1 = rf 1 (x 1 ) x 1 x 2 2 y 1 (7.22b) _ x n = rf n (x n )rM 1 g (x n + 1 ) 1 2 (x n x n1 ) + y n1 : (7.22c) 123 Similarly, we can see only local information exchange and access to local dual variable is required for these two updates. 7.5.2 Dual updates The dual variable is a column vector and its update can be done by using the following column update _ = 1 rM 1 g (x n + 1 ) 1 : (7.22d) Thus, we only need the state of the nth agent to update its value. The second dual variableY is anm (n 1) matrix and thekth column of it,y k , can be updated using only the states of x k and x k+1 agents for k = 1;:::; (n 1), _ y k = x k x k+1 : (7.22e) 7.6 Computational experiments In this section, we employ our al- gorithm for growing connected resistive Erd os-R enyi networks with edge probability 1:05 log(n)=n for dierent number of nodes using multiple cores. We choose the control weight matrix R =I and a state weight matrix that penalizes the mean-square devia- tion from the network average, Q = I (1=n)11 T : Moreover, the incidence matrix of the controller is such that there are no joint edges between the plant and the controller graphs. This algorithm is implemented in a distributed fashion by splitting the problem into N separate subproblems over N dierent cores. We have provided a parallel im- plementation in Matlab and have executed tests on a machine featuring an Intel Core i7-3770 with 16GB of RAM to measure the performance of the algorithm. The solve times are averaged over 10 trials and the speedup relative to a single core is displayed in Fig 7.1. It demonstrates that the algorithm is scalable. In particular, multi-core execution outperforms running just on a single core. Moreover, the speed-up is even higher for larger networks since overheads of parallel execution are less and more time is spent on actual parallel computation. 124 Speedup Number of cores Figure 7.1: Speedup ratio versus the number of the cores used for growing connected resistive networks with n nodes. 7.7 Concluding remarks We have studied a class of convex nonsmooth composite optimization problems in which the objective function is a combination of dif- ferentiable and nondierentiable functions. By exploiting the structure of the probelm, we have provided an equivalent consensus-based characterization and have developed an algorithm based on primal-descent dual-ascent gradient ow method. This algorithm exploits the separability of the objective function and is well-suited for distributed im- plementation. Convexity of the smooth part of the objective function is utilized to prove global asymptotic stability of our algorithm. Finally, by exploiting the structure of the H 2 norm, we have employed this algorithm to design a sparse controller network that improves the performance of the closed-loop system in a large-scale undirected consen- sus network in a distributed manner. An example is provided to demonstrate the utility of the developed approach. We are currently working on implementing this algorithm in C++ and will use it to solve structured optimal control problems for large-scale systems in a distributed manner. Part IV Data-driven and model-free distributed control 125 Chapter 8 Data-driven proximal algorithms for the design of structured optimal feedback gains Distributed feedback design and complexity constrained control are examples of prob- lems posed within the domain of structured optimal feedback synthesis. The optimal feedback gain is typically a non-convex function of system primitives. However, in recent years, algorithms have been proposed to obtain locally optimal solutions. In applica- tions to large-scale distributed control, the major obstacle is computational complexity. This chapter addresses complexity through a combination of linear-algebraic techniques and computational methods adapted from both machine learning and reinforcement learning. It is shown that for general classes of optimal control problems, the objective function and its gradient can be computed from data. Transformations borrowed from the theory of reinforcement learning are adapted to obtain simulation-based algorithms for computing the structured optimal H 2 feedback gain. Customized algorithms based on proximal gradient descent, incremental proximal gradient, an ADMM are tested in computational experiments and their relative merits are discussed. 8.1 Introduction In this chapter, by exploiting the square-additive property of theH 2 norm, we provide ecient algorithms to solve the regularizedH 2 optimal control problem. We demonstrate that the objective function and its gradients can 126 127 be evaluated without solving large-scale Lyapunov equations. In particular, our ap- proach is inspired by the framework developed in [171, 172] where an iterative method for computing the structured linear quadratic regulator is proposed. Herein, we ad- dress complexity through a combination of linear-algebraic techniques and computa- tional methods adapted from both machine learning and reinforcement learning. We approach the structured optimal control problem via a data-driven framework that does not require knowledge of the system parameters and avoids the need to solve large-scale matrical equations. For the structured optimal H 2 state-feedback problem, we show that the objective function and its gradient can be computed from data and develop customized proximal algorithms based on gradient descent and incremental gradient method. Moreover, we exploit separability of the objective function and utilize an ADMM- based consensus algorithm to solve the regularized optimal control problem in a dis- tributed manner over multiple processors. Even though the optimal control problem is in general non-convex, recent results can be utilized to show convergence to a lo- cal minimum [112]. The ADMM-based consensus algorithm that we use is standard (e.g., see [113]) but, to the best of our knowledge, it has not been previously used for distributed design of structured feedback gains. Even though the optimal control prob- lem is in general non-convex, recent results can be utilized to show convergence to a local minimum [112]. The ADMM-based consensus algorithm that we use is standard (e.g., see [113]) but, to the best of our knowledge, it has not been previously used for distributed design of structured feedback gains. Our presentation is organized as follows. In Section 8.2, we describe the regularized structured optimal control problem. In Section 8.3, a square-additive property of the H 2 norm is used to obtain a decomposition of the objective function and its gradient and duality arguments are utilized to transform these into quantities that can be es- timated from experiments or simulations. In Section 8.4, we illustrate how structure of symmetric systems and undirected consensus networks can be exploited to simplify computations. In Section 8.5, proximal algorithms of tractable complexity for solving the structured optimal control problem are described and computational experiments are provided. In Section 8.6, we propose an ADMM-based consensus algorithm to solve the regularized optimal control problem in a distributed manner. Finally, in Section 8.7, 128 the chapter is concluded with remarks and a summary of outstanding challenges. 8.2 Problem formulation Consider the LTI control system in state-space form _ = A + Bu = " Q 1=2 0 # + " 0 R 1=2 # u (8.1) where (t)2 R n is the state, u(t)2 R q is the control input, and (t)2 R n+q is the performance output. The matrices are all of compatible dimensions, and the standard assumptions are imposed: Q 0, R 0, (A;B) is stabilizable and (A;Q 1=2 ) is de- tectable. The control input is dened by state-feedback, u(t) = K (t) with gain matrix K2R qn . The closed-loop system is thus _ = A cl = " Q 1=2 R 1=2 K # (8.2) with A cl :=ABK. Closed-loop performance is quantied by the square of the L 2 norm of the impulse response for the closed-loop system, which is expressed using either of the following expressions: f(K) = trace (P ) = trace (Q + K T RK)L (8.3) where P 0 is the closed-loop observability Gramian, A T cl P + PA cl + Q + K T RK = 0 (8.4) 129 and L 0 is the closed-loop controllability Gramian, A cl L + LA T cl + I = 0: (8.5) Our objective is to promote certain structural properties of the feedback gain matrix K by solving the regularized optimal control problem minimize K f(K) + g(K) (8.6) where f is given by (8.3) and g is the regularization function. When it is desired to design K to belong to a setS with a specied pattern of zero elements [114], g is an indicator function associated with the underlying sparsity constraints, g(K) := 8 < : 0 K2S 1 K62S: Similarly, the ` 1 norm and the nuclear norm are commonly used convex proxies for promoting sparsity, or for designing K with low rank. In this case, g(K) = P i;j jK ij j or g(K) = kKk = P i i (K), where is a positive regularization parameter and i is the ith singular value. In general, the optimization problem (8.6) is not convex because f is a non-convex function of K. The proximal augmented Lagrangian method [108] can be used to compute a local minimum of (8.6); for details, see [173]. The minimization (8.6) remains non-convex even in the absence of the regularization functiong. Of course, in this special case it reduces to the standard LQR problem, and the globally optimal solution is given by K ? = R 1 B T P where P is the unique positive denite solution of the algebraic Riccati equation, A T P + PA + Q PBR 1 B T P = 0: The recent work [117] established conditions for convergence of gradient descent methods 130 for the standard LQR problem to the global minimum, in spite of the lack of convexity. The goal of this chapter is twofold: rst is to obtain algorithms of tractable complex- ity for truly large-scale problems. A second goal is to solve the regularized, structured optimal control problem without knowledge of the underlying model. The approach to the second goal is a component of our approach to the rst: a data-driven framework is proposed for computing the optimal feedback gain, with unknown matrices A and B in (8.2). This approach also avoids the need to solve large-scale Lyapunov equations. 8.3 Computation from data A square-additive property of theH 2 norm is used to obtain a decomposition of the objective function f and its gradientrf. Duality arguments are then used to transform each term into a quantity that can be estimated from experiments or simulations. The resulting algorithm obtains estimates of the optimal gain without solving a Lyapunov equation, and without knowledge of the system parameters. The gradient of f with respect to K is given by [135], rf(K) = 2 (RK B T P )L: (8.7) In what follows, we use the three decompositions L = n X k = 1 L k ; f(K) = n X k = 1 f k ; rf(K) = n X k = 1 rf k where L k 0 solves the Lyapunov equation A cl L k + L k A T cl + e k e T k = 0; (8.8) e k is the kth unit vector in the canonical basis ofR n , and f k (K) = trace (Q + K T RK)L k rf k (K) = 2 (RK B T P )L k : (8.9) 131 8.3.1 Computation of P and L from data The solution to (8.8) can be ex- pressed L k = Z 1 0 e A cl t e k e T k e A T cl t dt = Z 1 0 k (t) ( k (t)) T dt (8.10) where k (t) := e A cl t e k is the solution to _ k = A cl k ; k (0) = e k ; k = 1;:::;n: (8.11) This leads to the rst component of the data-driven architecture: for a given feedback gainK, the termKL k that appears both in f andrf can be expressed as the integral KL k = Z 1 0 u k (t) ( k (t)) T dt where k (t) and u k (t) = K k (t) are obtained from simulations or experiments of sys- tem (8.11). Similarly, the observability Gramian is given by P = Z 1 0 e A T cl t (Q + K T RK) e A cl t dt and its (i;j)th component is P i;j = Z 1 0 (' i (t)) T (Q + K T RK)' j (t) dt (8.12) where ' i (t) := e A cl t e i is the solution to _ ' i = A cl ' i ; ' i (0) = e i ; i = 1;:::;n: Thus, using 2n forward-in-time numerical simulations, the gradient of the objective functionrf(K) can be computed without solving a Lyapunov equation. Computation can be further reduced by avoiding multiplication of P and L k in (8.9). A data-driven approach to compute the product PL k directly is introduced in the next subsection. 132 8.3.2 Computation of PL from data As showed in [111,172], the productPL k that appears inrf k (K) can be expressed as X k := PL k = Z 1 0 k (t) ( k (t)) T dt (8.13) where k is obtained from the solution of the adjoint system _ k = A T cl k + (Q + K T RK) k k (1) = 0 (8.14) and k is the solution to (8.11). Thus, numerical simulations of the primal and adjoint systems (8.11) and (8.14) along with numerical evaluations of the corresponding integrals can be used to compute L k , PL k , f k , andrf k . Next, we show that the matrix X k in (8.13) can be computed without simulating the adjoint system (8.14). This is a critically important step in the model-free setup where only experimental or numerical data is available. Introduce the new variable k := (Q + K T RK) k so that (8.14) becomes _ k = A T cl k + k ; k (1) = 0 (8.15) which admits the solution, k (t) = H k (t) = Z 1 t e A T cl (t) k () d: (8.16) The linear operator H is introduced here to facilitate an adjoint transformation below. This operator and its adjoint H are dened dened on L 2 ([0;1))! R n . For any functions ;2 L 2 ([0;1)), we have by denitionh;Hi =hH ;i; see, e.g., [174]. An explicit representation for =H is obtained using elementary calculus: (t) = Z t 0 e A cl (t) () d: 133 The (i;j)th element of the matrix X k can be expressed X k i;j := e T i X k e j = Z 1 0 k i (t) k j (t) dt = D k j ; k i E (8.17) where k i (t) := e T i k (t) and k j (t) := e T j k (t) are theith andjth elements of the vectors k (t) and k (t), respectively, and the inner product is in L 2 ([0;1)). By substituting the expression (8.16) for k in (8.17), we obtain X k i;j = D k j ; e T i H k E = D H e i k j ; k E =: D k i;j ; k E : (8.18) Applying the expression for H then gives k i;j (t) := h H e i e T j k i (t) = Z t 0 e A cl (t) e i e T j k () d: This implies that k i;j can be obtained as the solution to the LTI system, _ k i;j = A cl k i;j + e i e T j k ; k i;j (0) = 0: (8.19) Thus, to computerf(K) for a given gainK, the only systems that need to be simulated are the forward in time systems (8.11) and (8.19). The system (8.11) is unforced, and its solution determines the input to the forced system (8.19). Remark 15. Similar adjoint techniques are used in analysis of reinforcement learning. The proof that the TD(1) algorithm solves a minimum norm problem is based on related adjoint transformations [175, 176]. And, a similar adjoint transformation is a crucial step in a Q-learning algorithm for deterministic continuous-time systems [177]. Remark 16. By writing the Lyapunov equation for the controllability Gramian as (8.8), unlike the requirement of stochastic simulations in [172], only deterministic simulations are needed in the present framework. Remark 17. Since the optimal unstructured gain K ? only depends on the closed-loop observability GramianP (which can be obtained from the solution of the algebraic Riccati 134 equation), L does not in uence K ? and its computation can be avoided. To promote structure (e.g., via proximal algorithms) we need to compute both P and L to formrf. In contrast, only computation of P is required to form natural policy gradient, which is dened as [178, 179], rh(K) = rf(K)L 1 = 2 (RK B T P ): (8.20) 8.4 Systems with special structure If the underlying system has additional structure, evaluation of both the objective function and the corresponding gradients can be further simplied. We next illustrate how structure of symmetric sys- tems and undirected consensus networks can be exploited to simplify computations. In both cases, the closed-loop L 2 norm has an explicit convex dependance on the opti- mization variable and there is no need to conduct simulations of the adjoint system and numerically evaluate the underlying integrals. Symmetric systems LetA andK in (8.2) be symmetric matrices and letB =I. For a stabilizingK, theL 2 norm of the impulse response for the closed-loop system (8.2) is determined by f(K) = 1 2 trace (K A) 1 (Q + KRK) : Convex dependence off onK can be established via a straightforward use of the Schur complement. Furthermore, the design for symmetric systems provides a useful starting point for the design of non-symmetric systems [180]. Equivalently, f(K) can be written as f(K) = 1 2 trace (K A) 1 (Q + ARA) + 1 2 trace (RK) + 1 2 trace (RA) = 1 2 n X i = 1 f i (K) 135 where f i (K) = q T i (K A) 1 q i + a T i (K A) 1 a i + r T i (K + A)r i : The vectors indexed by i in this expression are given by q i = Q 1=2 e i ; r i = R 1=2 e i ; a i = Ar i where e i is the ith canonical basis vector in R n . Furthermore, the gradient of f i with respect to K is given by rf i (K) = r i r T i (K A) 1 q i q T i (K A) 1 (K A) 1 a i a T i (K A) 1 Thus, there is no need to compute and store the inverse of the matrix KA in order to evaluatef i (K) andrf i (K); only actions of (KA) 1 on the vectorsq i anda i is necessary. For example, the preconditioned conjugate gradients method can be used to compute them eciently. Remark 18. The vector (K A) 1 q i represents the steady-state solution of a stable linear system _ i = (A K) i + q i and it can be computed via numerical integration. This illustrates that the symmetric na- ture of the closed-loop system (8.2) allows us to avoid the need for numerical integration of the corresponding adjoint system. Undirected consensus networks For undirected consensus networks withn nodes, the matricesA andK2R nn in (8.2) determine the graph Laplacians of the plant and controller networks, respectively. In the absence of exogenous disturbances, the network converges to the average of the initial node values = (1=n) P i i (0) if and only if it is connected [2]. Let B = I and let Q :=I (1=n)11 T penalize the deviation of individual node values from average. The objective is to minimize the mean square deviation from the network average by adding a few additional edges, specied by the graph Laplacian K of a controller network. If 136 E is the incidence matrix of the controller graph, K can be written as K(x) = E diag (x)E T where diag (x) is a diagonal matrix containing the optimization variablex2R m (i.e., the vector of the edge weights in the controller graph). Regularization terms may be used to promote sparsity of the controller network or to impose some additional constraints on the edge weights. As shown in [106], up to an additive constant, the L 2 norm of the impulse response for the closed-loop system is determined by f(x) = trace (E diag (x)E T A) (I + ARA) + diag E T RE T x where the pseudo-inverse of the closed-loop graph Laplacian is given by (E diag (x)E T A) = (E diag (x)E T + (1=n)11 T A) 1 : It is easy to show that f(x) can be written as f(x) = n X i = 1 f i (x) where f i (x) = T i E diag (x)E T + (1=n)11 T A 1 i + (1=n) diag E T RE T x: (8.21) Here, i = (I + ARA) 1=2 e i is the ith column of the square root of the matrix (I + ARA). Moreover, it can be shown that the gradient of f i (x) is given by rf i (x) = (1=n) diag E T RE i (x) i (x) (8.22) where is the elementwise multiplication and i (x) = E T E diag (x)E T + (1=n)11 T A 1 i : (8.23) 137 As for symmetric systems, there is no need to compute and store the inverse of the matrix in (8.23) and the preconditioned conjugate gradients method can be used to compute the gradient eciently. We next provide a proposition on the Lipschitz continuity of the gradient of f i for connected resistive networks which enables us to choose appropriate step-size for proximal gradient algorithms. In such networks, the plant graph is connected and all edge weights in both the plant and the controller graphs are non-negative. Proposition 15. The gradient of f i for connected resistive networks is Lipschitz con- tinuous with the Lipschitz constant L i = T i ^ A(0) 1 EE T ^ A(0) 1 i kE T ^ A(0) 1 Ek 2 (8.24) where ^ A(x) =E diag (x)E T + (1=n)11 T A. Proof. For a convex and twice dierentiable function f i , the gradientrf i is Lipschitz continuous with Lipschitz constant L i , if r 2 f i (x) L i I or, equivalently, z T r 2 f i (x)zL i kzk 2 . It can be shown that the second order deriva- tive of f i is given by r 2 f i (x) = ( i T i ) (E T ^ A(x) 1 E): Thus, z T r 2 f i (x)z can be written as T i ^ A(x) 1 ED z E T ^ A(x) 1 ED z E T ^ A(x) 1 i where D z := diag(z). For positive denite matrices S and T with S T , we have z T Szz T Tz and z T Sz max (S)kzk 2 q max (S T S)kzk 2 =kSk 2 kzk 2 138 for any vector z. Therefore, D z E T ^ A(x) 1 ED z kE T ^ A(0) 1 Ek 2 kzk 2 : As a result, z T r 2 f i (x)z i (0) T i (0)kE T ^ A(0) 1 Ek 2 kzk 2 where i (0) =E T ^ A(0) 1 i . 8.5 Proximal algorithms Algorithms of tractable complexity are intro- duced here for solving the regularized structured optimal control problem (8.6). A standard proximal gradient descent algorithm is considered as a starting point. It is argued however that computing the full gradient may be prohibitively expensive. Thus, we propose a more ecient algorithm based on incremental proximal gradient that is well-suited for the design of structured optimal feedback gains in large-scale systems. 8.5.1 Proximal gradient descent The proximal gradient algorithm is a rst- order method that extends standard gradient descent to a class of non-smooth composite optimization problems. In our setup, it iteratively updates the feedback gain K via application of the proximal operator associated with the function g on the standard gradient descent update. The algorithm is initialized with a stabilizing K 0 and the iterates are determined by, K l+1 = prox l g K l l rf(K l ) where l is the iteration index and l is the step-size. At each iteration, the step-size is selected via backtracking to guarantee sucient descent of the objective function and stability of closed-loop system (8.2). To enhance practical performance, we use the Barzilai-Borwein (BB) step-size initialization [127]. For any matrix V and a positive scalar , the proximal operator of the function g is dened as [76], prox g (V ) := argmin Z g(Z) + 1 2 kZ Vk 2 F 139 wherekk F is the Frobenius norm. In particular, forg(K) = kKk 1 = P i;j jK ij j, we have K l+1 = S l K l l rf(K l ) whereS (V ) is the soft-thresholding function that acts on the individual entries V ij of the matrix V according toS (V ij ) = sign (V ij ) max (jV ij j; 0). In the absence of the regularization functiong in (8.6), the proximal gradient algorithm simplies to standard gradient descent with the updates K l+1 =K l l rf(K l ). Recall that estimates ofrf(K l ) require the matrices KL and PL. These can be computed or estimated from data using the techniques described in the previous section. Remark 19. The step-size l is adjusted via backtracking to ensure stability of closed- loop system (8.2) at each iteration. It is initialized using the BB method which provides a heuristic for approximating the Hessian of the function f via the scaled version of the identity matrix [127], (1= l )I. At the lth iteration, the initial BB step-size l;0 := kK l K l1 k 2 2 trace ((K l1 K l ) T (rf(K l1 )rf(K l ))) (8.25) is adjusted via backtracking until system (8.2) is stable and sucient descent is achieved at each iteration, i.e., f(K l+1 ) < f(K l ) + trace (K l+1 K l ) T rf(K l ) + 1 2 l kK l+1 K l k 2 2 : (8.26) The algorithm stops when the generalized gradient map becomes smaller than the given tolerance, k K l prox l g K l l rf(K l ) l k < : 8.5.2 Incremental proximal gradient Next, we exploit the separable structure obtained in Section 8.2 to speed up computations and save memory. This formulation allows to work with a single function f k and its gradientrf k in the controller design, which reduces computational complexity and allows implementation at scale. We start with an initial stabilizing gain K 0 , and at each iteration we solve the 140 following optimization problem minimize K f k (K) + g(K): (8.27) The explicit solution to (8.27) is given by K l+1 = prox l g K l l rf k (K l ) : Here, prox is the proximal operator of the function g, and l is a step-size. This is the incremental proximal gradient algorithm because, at each iteration, the optimization variable is updated in the negative direction of a single element of the gradient where the index k is selected in a random manner. At each iteration, we choose step-size to be O(1=l) and adjust it to guarantee stability of closed-loop system (8.2). Incremental and stochastic gradient based algorithms have recently found widespread use in large-scale optimization and machine learning. High variance that results from es- timating the full gradientrf using samples of its entries can result in slow convergence. Moreover, to ensure convergence, the step-size has to decay to zero. In the rst few iterations, the objective function in the incremental proximal gradient decreases dramatically but it starts to oscillate after that. Thus, in large-scale optimiza- tion problems where having high accuracy may not be achievable, these methods are useful because of their fast initial convergence rate. 8.6 An ADMM-based consensus algorithm We next demon- strate that a standard consensus algorithm based on the Alternating Direction Method of Multipliers (ADMM) is well-suited for distributed design of optimal structured feed- back gains. As shown in Section 8.3, the regularized optimal control problem can be written as minimize K N X i = 1 f i (K) + g(K): (8.28) This characterization is suitable for distributed implementation in which each processors solves an optimization problem. By introducing N local variables K i and a global 141 variable K 0 , problem (8.28) can be brought into a standard consensus form, minimize K i ;K 0 N X i = 1 f i (K i ) + g(K 0 ) subject to K i K 0 = 0; i = 1;:::;N: (8.29) This formulation increases the number of optimization variables but it brings the ob- jective function into a separable form and facilitates distributed computations. When implemented on a single machine, solving the reformulated problem is not necessarily more computationally ecient than solving the original problem using centralized al- gorithms. However, this formulation allows to work with a local objective function f i that depends on a single local variableK i . This is advantageous for large-scale systems where centralized algorithms cannot be aorded (e.g., because of the computational cost associated with solving the large-scale Lyapunov equations). The augmented Lagrangian associated with (8.29) is L(K i ;K 0 ; ) = g(K 0 ) + N X i=1 (f i (K i ) +h i ;K i K 0 i + i 2 kK i K 0 k 2 F ) (8.30) where i 's are the Lagrange multipliers and i 's are positive parameters. The ADMM algorithm consists of the following iterative steps, K k+1 i = argmin K i f i (K i ) + i 2 kK i U k i k 2 F K k+1 0 = argmin K 0 g(K 0 ) + N X i=1 i 2 kK 0 V k i k 2 F k+1 i = k i + i K k+1 i K k+1 0 where U k i := K k 0 (1= i ) k i V k i := K k+1 i + (1= i ) k i : The K i -minimization step can be done via distributed computation by spreading subproblems to N dierent processors. On the other hand, the update of K 0 amounts 142 to the evaluation of the proximal operator of the function g, K k+1 0 = prox g=(N ) 1 N N X i = 1 i V k i ! : where := (1=N) P i i . Thus, the update of K 0 requires gathering each K k+1 i and the associated Lagrange multipliers k i in order to form V k i . Remark 20. The above presented consensus algorithm is standard (e.g., see [113]) but to the best of our knowledge it has not been previously used for optimal design of structured feedback gains via distributed optimization. Recently, convergence of this al- gorithm was established even for problems with non-convex objective functions f i [112]. The authors of [112] also show that additional computational advantage can be gained by updating only a subset of K i 's in each iteration and oer several alternative imple- mentations to speed computations. Remark 21. The update of each K i amounts to the computation of the proximal oper- ator associated with f i , K k+1 i = prox f i = i U k i However, while the proximal operators of commonly used regularization functions are easy to evaluate, the computation of prox f i = i is signicantly more involved. In [42], the proximal operator associated with the closed-loopH 2 normf(K) was computed using the Anderson-Moore algorithm. This approach requires computation of the closed-loop controllability and observability gramians. In contrast, the developments of Section 8.3 facilitate computation of prox f i = i via a proximal gradient algorithm that avoids the need for solving large-scale Lyapunov equations. Remark 22. For the problem of growing connected resistive consensus networks [106], Algorithm 3 in [112] can be used to solve the K i -minimization step in the ADMM- based algorithm explicitly. The key point of departure compared to the standard ADMM implementation is linearization in the K i -minimization step of the function f i with a Lipschitz continuous gradient around the current K 0 iterate. This oers a signicant speed-up and enables an explicit update of K i . 143 Figure 8.1: A mass-spring-damper system on a line. 8.6.1 Computational experiments We next provide examples to illustrate the performance of proximal algorithms. The acronym proxG represents the proximal gra- dient and proxIG is the incremental proximal gradient. We have implemented all algo- rithms in Matlab. In all examples, we choose R =I and Q =I. The algorithms were tested for a mass-spring-damper (MSD) model withN masses, illustrated in Figure 8.1. Our goal is to nd the local minimum of the problem (8.6) with g(K) = kKk 1 ; where is the sparsity-promoting parameter and the ` 1 norm is a proxy for inducing sparsity in the feedback gain matrix. The MSD system with N masses hasn = 2N states where the rstN states denote the positions and the rest are velocities. We consider the case where all masses, spring, and damping constants are equal to one, which results in the state-space model with A = " 0 I T T # ; B = " 0 I # whereT is anNN tridiagonal symmetric Toeplitz matrix with 2 on the main diagonal and1 on the rst upper and lower sub-diagonals, andI and 0 areNN identity and zero matrices. Sparsity-promoting controllers were designed for this model with N = 10 masses and compared with the ADMM-based algorithm [42] that does not utilize the iterative re-weighting scheme. The algorithms were initialized with the (unstructured) optimal feedback gain K ? . Figure 8.2 shows how the sparsity pattern of the controller changes as the value of the parameter is increased. It is worth to note that all algorithms give the same structure of K with very close feedback gain values. For = 0, the optimal feedback controller is given by a dense matrix and as increases the controller becomes sparser. It is diagonal when = 1, and for suciently large the gain is identically zero. This is a feasible solution because the open-loop system is stable. 144 2 4 6 8 10 2 4 6 8 10 12 14 16 18 20 2 0 0 4 6 8 10 12 14 16 18 20 2 4 6 8 10 n z = 200 , γ = 0 n z = 102 , γ = 0 . 01 n z = 38 , γ = 0 . 1 n z = 16 , γ = 1 Figure 8.2: Structure of feedback gains resulting from the solution to (8.6) for dierent values of the regularization parameter for an MSD system with N = 10 masses. The number of non-zero elements in K is denoted by n z . Next, we employ the ADMM-based consensus algorithm to design optimal struc- tured feedback gains in a distributed fashion by splitting the problem into N separate subproblems over N dierent cores. We have provided a parallel implementation in C++ using pthreads library and executed tests on a machine featuring an Intel Core i7-3770 with 16GB of RAM to measure the performance of the algorithm. We use our algorithm for growing a connected resistive network withn = 20 nodes. The plant graph is given by an Erd os-R enyi network with edge probability 1:05 log(n)=n. We choose the control weight matrix R =I and a state weight matrix that penalizes the mean-square deviation from the network average,Q =I(1=n)11 T : Moreover, the incidence matrix of the controller is such that there are no joint edges between the plant and the con- troller graphs. As discussed in Section 8.4, for consensus networks, the controller can be written as a function of the vector of the edge weights x. Thus, theH 2 norm of the closed-loop system isf(x) and the regularization function is given byg(x) = kwxk 1 wherew is the vector of the weights. Since the plant network is resistive and connected, all the edge weights are nonnegative, thereby if the added edges have nonnegative edge weights, the closed-loop system is stable. We can write the smooth and nonsmooth parts of the objective function as f(x) + w T x and g(x) = I + (x) where I + (x) is an indicator function for nonnegative orthant. 145 We solve the problem (8.28) to nd the controller graph for 500 logarithmically- spaced values of 2 [0:001; 0:3] using the path-following iterative reweighted algorithm as a proxy for inducing sparsity [125]. We set the weights to be inversely proportional to the magnitude of the solution x at the previous value of following by a polishing step that computes the optimal weights of identied edges; see [106]. As increases, the number of nonzero edges decreases and the closed-loop perfor- mance deteriorates. As shown in Fig. 8.3, relative to the optimal centralized vector of the edge weights, x c , theH 2 loss decreases as the sparsity of the vector of the edge weights x increases. In particular, for = 0:3, there is only one nonzero element in the vector of the edge weights. The identied sparse controller in this case uses only 0:62% of the edges, relative to the optimal centralized controller, i.e., card(x)=card(x c ) = 0:62% and achieves a performance loss of 23:47%, i.e., (JJ c )=J c = 23:47%: card(x)=card(xc) (a) (JJc)=Jc (b) (JJc)=Jc (c) card(x)=card(x c ) Figure 8.3: (a) Sparsity level; (b) performance degradation; and (c) the optimal trade- o between the performance degradation and the sparsity level of the optimal sparse x compared to the optimal centralized controller x c . The results are obtained for a randomly generated Erd os-R enyi network with n = 20 nodes. Next, we employ our algorithm for growing connected resistive networks with dier- ent number of nodes using multiple cores. The solve times are averaged over 10 trials and the speedup relative to a single core is displayed in Fig 8.4. This gure demonstrates that the algorithm is scalable. In particular, multi-core execution outperforms running just on a single core. The speed-up is even higher for larger networks since overheads of parallel execution are smaller and more time is spent on actual parallel computation. 146 Speedup Number of cores Figure 8.4: Speedup ratio versus the number of the cores used for growing connected resistive networks with n nodes. 8.7 Concluding remarks This chapter provides a step towards a compre- hensive theory for distributed and structured optimal control of LTI systems. Even within the linear setting there are many open questions, including Development of simulation-based methods in the presence of disturbances and uncertainty; Construction of truly recursive algorithms, similar to those used in traditional reinforcement learning settings; Techniques to avoid local minima, and nding conditions that ensure non-existence of local minima for f; Development of algorithms that combine proximal methods with natural policy gradient. Extensions to nonlinear control will probably require more modest objectives since it is not an easy task to compute the performance of a given policy. However, both Q-learning and TD-learning are based on value function approximation. Consequently, as in the present work, these algorithms can be designed so that each value function approximation serves as a Lyapunov function for the current approximating policy. 147 This provides bounds on performance, as well as stability. Consequently, algorithms for computation of structured stabilizing policies with good performance are not out of reach even for nonlinear control systems. Chapter 9 Conclusions This dissertation focuses on structure identication and optimal control of large-scale networks of dynamical systems. To overcome the challenges in this area, we combine tools and ideas from control theory, distributed optimization, reinforcement learning, and compressive sensing to develop distributed estimation and control strategies that require limited information exchange between the individual subsystems. In the rst and second parts, we have studied the problems of optimal distributed design and in- ference of large-scale networks of dynamical systems. We have developed theoretical and computational frameworks to solve these problems. In particular, we have designed and implemented customized optimization algorithms that are well-suited for solving the above-mentioned problems in large-scale dynamical systems. In the third chapter, we have studied the convergence properties of these algorithms from system-theoretic point of view. We have utilized control-theoretic tools to view dierent optimization al- gorithms as dynamical systems and study their convergence properties by using stability analysis techniques from control theory. Finally, in the last chapter, we have provided a data-driven framework for the structured optimal control problem that does not require knowledge of the system parameters and avoids the need to solve large-scale matrical equations. Future directions The current research on design and control of networks can be extended to make them applicable to a larger class of systems such as unbalanced time-varying large-scale 148 149 networks. We have also developed an algorithm to recover the underlying undirected network topology of a complex system using partially available statistical data. We will extend this framework to identify the structure of directed networks which is a more challenging problem since the resulting optimization problem is non-convex. Also, we are utilizing optimization theoretical results to develop a new model with more complicated dynamics than rst-order consensus. Furthermore, we are extending the work on distributed computation over networks to solve the non-smooth optimization problems with more than one regularizers. In this case, the dynamical system of the primal-dual updates is more complicated and convergence analysis is more challenging. Moreover, we will extend the data-driven framework to be truly recursive, similar to those used in traditional reinforcement learning settings and will combine the proximal methods with natural policy gradient. The framework can be generalized by developing simulation-based methods in the presence of disturbances and uncertainty. References [1] G. E. Dullerud and F. Paganini. A course in robust control theory. Springer, 2000. [2] M. Mesbahi and M. Egerstedt. Graph Theoretic Methods in Multiagent Networks. Princeton University Press, 2010. [3] L. Xiao and S. Boyd. Fast linear iterations for distributed averaging. Syst. Control Lett., 53:65{78, 2004. [4] Y. Kim and M. Mesbahi. On maximizing the second smallest eigenvalue of a state-dependent graph Laplacian. IEEE Trans. Automat. Control, 51(1):116{120, 2006. [5] L. Xiao, S. Boyd, and S.-J. Kim. Distributed average consensus with least-mean- square deviation. J. Parallel Distrib. Comput., 67(1):33{46, 2007. [6] P. Barooah and J. P. Hespanha. Estimation on graphs from relative measure- ments: Distributed algorithms and fundamental limits. IEEE Control Syst. Mag., 27(4):57{74, 2007. [7] A. Ghosh, S. Boyd, and A. Saberi. Minimizing eective resistance of a graph. SIAM Rev., 50(1):37{66, 2008. [8] D. Zelazo and M. Mesbahi. Edge agreement: Graph-theoretic performance bounds and passivity analysis. IEEE Trans. Automat. Control, 56(3):544{555, 2011. [9] D. Zelazo, S. Schuler, and F. Allg ower. Performance and design of cycles in consensus networks. Syst. Control Lett., 62(1):85{96, 2013. 150 151 [10] B. Bamieh, M. R. Jovanovic, P. Mitra, and S. Patterson. Coherence in large- scale networks: dimension dependent limitations of local feedback. IEEE Trans. Automat. Control, 57(9):2235{2249, September 2012. (2013 George S. Axelby Outstanding Paper Award). [11] Lin Xiao and Stephen Boyd. Fast linear iterations for distributed averaging. Systems & Control Letters, 53(1):65{78, 2004. [12] C. W. Reynolds. Flocks, herds and schools: A distributed behavioral model. ACM SIGGRAPH Computer Graphics, 21(4):25{34, 1987. [13] T. Vicsek, A. Czir ok, E. Ben-Jacob, I. Cohen, and O. Shochet. Novel type of phase transition in a system of self-driven particles. Physical Review Letters, 75(6):1226, 1995. [14] M. H. DeGroot. Reaching a consensus. J. Amer. Statist. Assoc., 69(345):118{121, 1974. [15] B. Golub and M. Jackson. Naive learning social networks and the wisdom of crowds. American Economic Journal: Microeconomics, 2(1):112{149, 2010. [16] F. Dor er, M. R. Jovanovic, M. Chertkov, and F. Bullo. Sparsity-promoting opti- mal wide-area control of power networks. IEEE Trans. Power Syst., 29(5):2281{ 2291, September 2014. [17] X. Wu, F. Dor er, and M. R. Jovanovic. Decentralized optimal control of inter- area oscillations in bulk power systems. In Proceedings of the 54th IEEE Confer- ence on Decision and Control, pages 5532{5537, Osaka, Japan, 2015. [18] X. Wu, F. Dor er, and M. R. Jovanovic. Input-output analysis and decentralized optimal control of inter-area oscillations in power systems. IEEE Trans. Power Syst., 31(3):2434{2444, May 2016. [19] G. Cybenko. Dynamic load balancing for distributed memory multiprocessors. J. Parallel Distrib. Comput., 7(2):279{301, 1989. [20] J. E. Boillat. Load balancing and Poisson equation in a graph. Concurrency: Practice and Experience, 2(4):289{313, 1990. 152 [21] V. Preciado, M. Zargham, C. Enyioha, A. Jadbabaie, and G. Pappas. Optimal vaccine allocation to control epidemic outbreaks in arbitrary networks. In Pro- ceedings of the 52nd IEEE Conference on Decision and Control, pages 7486{7491. IEEE, 2013. [22] E. Ram rez-Llanos and S. Mart nez. A distributed nonlinear dynamics for virus spread control. Automatica, 2016. To appear. [23] A. Jadbabaie, J. Lin, and A. S. Morse. Coordination of groups of mobile au- tonomous agents using nearest neighbor rules. IEEE Trans. Automat. Control, 48(6):988{1001, 2003. [24] R. Olfati-Saber and R. M. Murray. Consensus problems in networks of agents with switching topology and time-delays. IEEE Trans. Automat. Control, 49(9):1520{ 1533, 2004. [25] L. Moreau. Stability of multiagent systems with time-dependent communication links. IEEE Trans. Automat. Control, 50(2):169{182, 2005. [26] R. Carli, F. Fagnani, A. Speranzon, and S. Zampieri. Communication constraints in the average consensus problem. Automatica, 44(3):671{684, 2007. [27] S. Hassan-Moghaddam, N. K. Dhingra, and M. R. Jovanovic. Topology identica- tion of undirected consensus networks via sparse inverse covariance estimation. In Proceedings of the 55th IEEE Conference on Decision and Control, pages 4624{ 4629, Las Vegas, NV, 2016. [28] S. Hassan-Moghaddam and M. R. Jovanovic. Topology identication via growing a chow-liu tree network. In Proceedings of the 57th IEEE Conference on Decision and Control, pages 5421{5426, Miami, FL, 2018. [29] M. Fazlyab, A. Ribeiro, M. Morari, and V. M. Preciado. Analysis of optimization algorithms via integral quadratic constraints: Nonstrongly convex problems. 2017. arXiv:1705.03615. 153 [30] B. Hu and P. Seiler. Exponential decay rate conditions for uncertain linear systems using integral quadratic constraints. IEEE Trans. Autom. Control, 61(11):3631{ 3637, 2016. [31] L. Lessard, B. Recht, and A. Packard. Analysis and design of optimization algo- rithms via integral quadratic constraints. SIAM J. Optim., 26(1):57{95, 2016. [32] Y. Wan, S. Roy, and A. Saberi. Designing spatially heterogeneous strategies for control of virus spread. IET Systems Biology, 2(4):184{201, 2008. [33] M. Siami and N. Motee. Tractable approximation algorithms for the np-hard prob- lem of growing linear consensus networks. In Proceedings of the 2016 American Control Conference, pages 6429{6434, 2016. [34] A. Ghosh and S. Boyd. Growing well-connected graphs. In Proceedings of the 45th IEEE Conference on Decision and Control, pages 6605{6611, 2006. [35] F. Lin, M. Fardad, and M. R. Jovanovic. Identication of sparse communication graphs in consensus networks. In Proceedings of the 50th Annual Allerton Confer- ence on Communication, Control, and Computing, pages 85{89, Monticello, IL, 2012. (Invited paper). [36] M. Fardad, F. Lin, and M. R. Jovanovic. Design of optimal sparse interconnection graphs for synchronization of oscillator networks. IEEE Trans. Automat. Control, 59(9):2457{2462, September 2014. [37] T. H. Summers, I. Shames, J. Lygeros, and F. D or er. Topology design for optimal network coherence. In Proceedings of the 2015 European Control Conference, pages 575{580, 2015. [38] B. Polyak, M. Khlebnikov, and P. Shcherbakov. An LMI approach to structured sparse feedback design in linear control systems. In Proceedings of the 2013 Eu- ropean Control Conference, pages 833{838, 2013. [39] N. K. Dhingra, M. R. Jovanovic, and Z. Q. Luo. An admm algorithm for optimal sensor and actuator selection. In Proceedings of the 53rd IEEE Conference on Decision and Control, pages 4039{4044, Los Angeles, CA, 2014. (Invited paper). 154 [40] A. Zare, H. Mohammadi, N. K. Dhingra, T. T. Georgiou, and M. R. Jo- vanovic. Proximal algorithms for large-scale statistical modeling and op- timal sensor/actuator selection. IEEE Trans. Automat. Control, 2019. doi:10.1109/TAC.2019.2948268; also arXiv:1807.01739. [41] M. Fardad, F. Lin, and M. R. Jovanovic. Sparsity-promoting optimal control for a class of distributed systems. In Proceedings of the 2011 American Control Conference, pages 2050{2055, San Francisco, CA, 2011. [42] F. Lin, M. Fardad, and M. R. Jovanovic. Design of optimal sparse feedback gains via the alternating direction method of multipliers. IEEE Trans. Automat. Control, 58(9):2426{2431, September 2013. [43] J. F. Sturm. Using SeDuMi 1.02, a Matlab toolbox for optimization over sym- metric cones. Optim. Methods Softw., 11(1-4):625{653, 1999. [44] Kim-Chuan Toh, M. J. Todd, and R. H. T ut unc u. SDPT3 { A Matlab software package for semidenite programming, version 1.3. Optim. Methods Softw., 11(1- 4):545{581, 1999. [45] Pedro A Vald es-Sosa, Jose M S anchez-Bornot, Agust n Lage-Castellanos, Mayrim Vega-Hern andez, Jorge Bosch-Bayard, Lester Melie-Garc a, and Erick Canales- Rodr guez. Estimating brain functional connectivity with sparse multivariate au- toregression. Phil. Trans. R. Soc. B, 360(1457):969{981, 2005. [46] Dongchuan Yu, Marco Righero, and Ljupco Kocarev. Estimating topology of networks. Phys. Rev. Lett., 97(18):188701, 2006. [47] Michael M Zavlanos, A Agung Julius, Stephen P Boyd, and George J Pappas. Identication of stable genetic networks using convex programming. In Proceedings of the 2008 American Control Conference, pages 2755{2760, 2008. [48] Domenico Napoletani and Timothy D Sauer. Reconstructing the topology of sparsely connected dynamical networks. Physical Review E, 77(2):026103, 2008. 155 [49] Mustafa Ayazoglu, Mario Sznaier, and Necmiye Ozay. Blind identication of sparse dynamic networks and applications. In Proceedings of the 50th IEEE Con- ference on Decision and Control, pages 2944{2950, 2011. [50] Marzieh Nabi-Abdolyouse, Maryam Fazel, and Mehran Mesbahi. A graph re- alization approach to network identication. In Proceedings of the 51st IEEE Conference on Decision and Control, pages 4642{4647, 2012. [51] D. Materassi and M. V. Salapaka. On the problem of reconstructing an unknown topology via locality properties of the wiener lter. IEEE Trans. Automat. Con- trol, 57(7):1765{1777, 2012. [52] D. Heckerman, D. Geiger, and D. M. Chickering. Learning bayesian networks: The combination of knowledge and statistical data. JMLR, 20(3):197{243, 1995. [53] A. Zare, Y. Chen, M. R. Jovanovic, and T. T. Georgiou. Low-complexity mod- eling of partially available second-order statistics: theory and an ecient matrix completion algorithm. IEEE Trans. Automat. Control, 62(3):1368{1383, March 2017. [54] A. Zare, M. R. Jovanovic, and T. T. Georgiou. Colour of turbulence. J. Fluid Mech., 812:636{680, February 2017. [55] A. Zare, T. T. Georgiou, and M. R. Jovanovic. Stochastic dynamical modeling of turbulent ows. Annu. Rev. Control Robot. Auton. Syst., 3, May 2020. in press; doi:10.1146/annurev-control-053018-023843; also arXiv:1908.09487. [56] Mikail M. Rubinov and O. Sporns. Complex network measures of brain connec- tivity: uses and interpretations. Neuroimage, 52(3):1059{1069, 2010. [57] Bharat Biswal, F Zerrin Yetkin, Victor M Haughton, and James S Hyde. Func- tional connectivity in the motor cortex of resting human brain using echo-planar mri. Magn. Reson. Med., 34(4):537{541, 1995. [58] M. Lynall, D. S. Bassett, R. Kerwin, P. J. McKenna, M. Kitzbichler, U. Muller, and E. Bullmore. Functional connectivity and brain networks in schizophrenia. J. Neurosci., 30(28):9477{9487, 2010. 156 [59] D. Cordes, V. Haughton, J. D. Carew, K. Arfanakis, and K. Maravilla. Hierarchi- cal clustering to measure connectivity in fmri resting-state data. JMRI, 20(4):305{ 317, 2002. [60] P. E. V ertes, A. F. Alexander-Bloch, N. Gogtay, J. N. Giedd, J. L. Rapoport, and E. T. Bullmore. Simple models of human brain functional networks. Proceedings of the National Academy of Sciences, 109(15):5868{5873, 2012. [61] H. E. Egilmez, E. Pavez, and A. Ortega. Graph learning from data under laplacian and structural constraints. IEEE J. Sel. Topics Signal Process, 2017. [62] H. J. van Waarde, P. Tesi, and M. K. Camlibel. Topology reconstruction of dynamical networks via constrained lyapunov equations. arXiv preprint arXiv:1706.09709, 2017. [63] S. Hassan-Moghaddam and M. R. Jovanovic. An interior point method for grow- ing connected resistive networks. In Proceedings of the 2015 American Control Conference, pages 1223{1228, Chicago, IL, 2015. [64] S. Sojoudi and J. Doyle. On the ecacy of computational methods for learning graphical models. submitted. [65] R. Dai and M. Mesbahi. Optimal topology design for dynamic networks. In 50th IEEE Conference on Decision and Control, pages 1280{1285, 2011. [66] S. Hassan-Moghaddam and M. R. Jovanovic. Customized algorithms for growing connected resistive networks. In Proceedings of the 10th IFAC Symposium on Nonlinear Control Systems, pages 986{991, Monterey, CA, 2016. [67] Donatello Materassi and Giacomo Innocenti. Topological identication in net- works of dynamical systems. IEEE Trans. Automat. Control, 55(8):1860{1871, 2010. [68] N. K. Dhingra, F. Lin, M. Fardad, and M. R. Jovanovic. On identifying sparse representations of consensus networks. In Preprints of the 3rd IFAC Workshop on Distributed Estimation and Control in Networked Systems, pages 305{310, Santa Barbara, CA, 2012. 157 [69] Milad Siami and Nader Motee. Network sparsication with guaranteed systemic performance measures. In Preprints of the 5th IFAC Workshop on Distributed Estimation and Control in Networked Systems, volume 48, pages 246{251, 2015. [70] George F Young, Luca Scardovi, Andrea Cavagna, Irene Giardina, and Naomi E Leonard. Starling ock networks manage uncertainty in consensus at low cost. PLoS Comput Biol, 9(1):e1002894, 2013. [71] M. R. Jovanovic and N. K. Dhingra. Controller architectures: tradeos between performance and structure. Eur. J. Control, 30:76{91, July 2016. [72] A. Nedi c and A. Ozdaglar. Distributed subgradient methods for multiagent opti- mization. IEEE Trans. on Automat. Control, 54(1):48{61, 2009. [73] Jing Wang and Nicola Elia. A control perspective for centralized and distributed convex optimization. In Proceedings of the 50th IEEE Conference on Decision and Control and the 10th European Control Conference, pages 3800{3805, 2011. [74] P. Latafat, L. Stella, and P. Patrinos. New primal-dual proximal algorithm for distributed optimization. In Proceedings of the 55th IEEE Conference on Decision and Control, pages 1959{1964, 2016. [75] P. Latafat, N. Freris, and P. Patrinos. A new randomized block-coordinate primal- dual proximal algorithm for distributed optimization. IEEE Trans. Automat. Control, 2019. doi:10.1109/TAC.2019.2906924. [76] N. Parikh and S. Boyd. Proximal algorithms. Foundations and Trends in Opti- mization, 1(3):123{231, 2013. [77] K. J. Arrow, L. Hurwicz, H. Uzawa, and H. B. Chenery. Studies in linear and non-linear programming. 1958. [78] D. Feijer and F. Paganini. Stability of primal{dual gradient dynamics and appli- cations to network optimization. Automatica, 46(12):1974{1981, 2010. [79] Ashish Cherukuri, Enrique Mallada, and Jorge Cort es. Asymptotic convergence of constrained primal{dual dynamics. Syst. Control Lett., 87:10{15, 2016. 158 [80] A. Cherukuri, E. Mallada, S. Low, and J. Cortes. The role of convexity on saddle- point dynamics: Lyapunov function and robustness. IEEE Trans. Automat. Con- trol, 63(8):2449{2464, 2018. [81] N. K. Dhingra, S. Z. Khong, and M. R. Jovanovic. The proximal augmented lagrangian method for nonsmooth composite optimization. IEEE Trans. Automat. Control, 64(7):2861{2868, July 2019. [82] G. Qu and N. Li. On the exponential stability of primal-dual gradient dynamics. IEEE Control Syst. Lett., 3(1):43{48, 2018. [83] W. Su, S. Boyd, and E. Candes. A dierential equation for modeling Nesterov's accelerated gradient method: Theory and insights. J. Mach. Learn. Res., 17:1{43, 2016. [84] A. Wibisono, A. C. Wilson, and M. I. Jordan. A variational perspective on ac- celerated methods in optimization. Proc. Natl. Acad. Sci., 113(47):E7351{E7358, 2016. [85] G. Fran ca, D. Robinson, and R. Vidal. ADMM and accelerated ADMM as con- tinuous dynamical systems. 2018. arXiv:1805.06579. [86] B. Shi, S. Du, M. I. Jordan, and W. Su. Understanding the acceleration phe- nomenon via high-resolution dierential equations. 2018. arXiv:1810.08907. [87] M. Muehlebach and M. I. Jordan. A dynamical systems perspective on Nesterov acceleration. 2019. arXiv:1905.07436. [88] J. I. Poveda and N. Li. Inducing uniform asymptotic stability in time- varying accelerated optimization dynamics via hybrid regularization. 2019. arXiv:1905.12110. [89] A. Nedic and A. Ozdaglar. Distributed subgradient methods for multi-agent op- timization. IEEE Trans. Automat. Control, 54:48, 2009. [90] D. Feijer and F. Paganini. Stability of primal-dual gradient dynamics and appli- cations to network optimization. Automatica, 46(12):1974{1981, 2010. 159 [91] A. Cherukuri, E. Mallada, S. Low, and J. Cort es. The role of convexity in saddle- point dynamics: Lyapunov function and robustness. IEEE Trans. Automat. Con- trol, 63(8):2449{2464, 2017. [92] A. Brown and M. Bartholomew-Biggs. Some eective methods for unconstrained optimization based on the solution of systems of ordinary dierential equations. J. Optimiz. Theory App., 62(2):211{224, 1989. [93] J. Schropp and I. Singer. A dynamical systems approach to constrained minimiza- tion. Numer. Func. Anal. Opt., 21(3-4):537{551, 2000. [94] J. Zhang, A. Mokhtari, S. Sra, and A. Jadbabaie. Direct Runge-Kutta discretiza- tion achieves acceleration. In Advances in Neural Information Processing Systems, pages 3900{3909, 2018. [95] B. Hu, P. Seiler, and A. Rantzer. A unied analysis of stochastic optimization methods using jump system theory and quadratic constraints. In Proceedings of the 2017 Conference on Learning Theory, pages 1157{1189, 2017. [96] B. Hu and L. Lessard. Dissipativity theory for Nesterov's accelerated method. In Proceedings of the 34th International Conference on Machine Learning, pages 1549{1557, 2017. [97] M. Fazlyab, A. Ribeiro, M. Morari, and V. M. Preciado. Analysis of optimiza- tion algorithms via integral quadratic constraints: Nonstrongly convex problems. SIAM J. Optim., 28(3):2654{2689, 2018. [98] S. Hassan-Moghaddam and M. R. Jovanovic. Distributed proximal augmented lagrangian method for nonsmooth composite optimization. In Proceedings of the 2018 American Control Conference, pages 2047{2052, Milwaukee, WI, 2018. [99] S. Hassan-Moghaddam and M. R. Jovanovic. On the exponential convergence rate of proximal gradient ow algorithms. In Proceedings of the 57th IEEE Conference on Decision and Control, pages 4246{4251, Miami, FL, 2018. (Invited paper). [100] D. Ding, B. Hu, N. K. Dhingra, and M. R. Jovanovic. An exponentially convergent primal-dual algorithm for nonsmooth composite minimization. In Proceedings of 160 the 57th IEEE Conference on Decision and Control, pages 4927{4932, Miami, FL, 2018. [101] J. Seidman, M. Fazlyab, V. Preciado, and G. Pappas. A control-theoretic ap- proach to analysis and parameter selection of Douglas-Rachford splitting. 2019. arXiv:1903.11525. [102] H. Mohammadi, M. Razaviyayn, and M. R. Jovanovic. Variance amplication of accelerated rst-order algorithms for strongly convex quadratic optimization problems. In Proceedings of the 57th IEEE Conference on Decision and Control, pages 5753{5758, Miami, FL, 2018. [103] H. Mohammadi, M. Razaviyayn, and M. R. Jovanovic. Performance of noisy nesterov's accelerated method for strongly convex optimization problems. In Pro- ceedings of the 2019 American Control Conference, pages 3426{3431, Philadelphia, PA, 2019. [104] H. Mohammadi, M. Razaviyayn, and M. R. Jovanovic. Robustness of accelerated rst-order algorithms for strongly convex optimization problems. IEEE Trans. Automat. Control, 2019. submitted; also arXiv:1905.11011. [105] S. Michalowsky, C. Scherer, and C. Ebenbauer. Robust and structure exploit- ing optimization algorithms: An integral quadratic constraint approach. 2019. arXiv:1905.00279. [106] S. Hassan-Moghaddam and M. R. Jovanovic. Topology design for stochastically- forced consensus networks. IEEE Trans. Control Netw. Syst., 5(3):1075{1086, September 2018. [107] A. Beck and M. Teboulle. A fast iterative shrinkage-thresholding algorithm for linear inverse problems. SIAM J. Imaging Sci., 2(1):183{202, 2009. [108] N. K. Dhingra, S. Z. Khong, and M. R. Jovanovic. A second order primal-dual method for nonsmooth convex composite optimization. IEEE Trans. Automat. Control, 2017. conditionally accepted; also arXiv:1709.01610. 161 [109] W. Shi, Q. Ling, G. Wu, and W. Yin. Extra: An exact rst-order algorithm for decentralized consensus optimization. SIAM J. Optim., 25(2):944{966, 2015. [110] B. Gharesifard and J. Cort es. Distributed continuous-time convex optimization on weight-balanced digraphs. IEEE Trans. Autom. Control, 59(3):781{786, 2014. [111] S. Hassan-Moghaddam and M. R. Jovanovic. Distributed design of optimal struc- tured feedback gains. In Proceedings of the 56th IEEE Conference on Decision and Control, pages 6586{6591, Melbourne, Australia, 2017. [112] M. Hong, Z. Q. Luo, and M. Razaviyayn. Convergence analysis of alternating direction method of multipliers for a family of nonconvex problems. SIAM J. Optimiz., 26(1):337{364, 2016. [113] S. Boyd, N. Parikh, E. Chu, B. Peleato, and J. Eckstein. Distributed optimiza- tion and statistical learning via the alternating direction method of multipliers. Foundations and Trends in Machine Learning, 3(1):1{124, 2011. [114] F. Lin, M. Fardad, and M. R. Jovanovic. Augmented lagrangian approach to design of structured optimal state feedback gains. IEEE Trans. Automat. Control, 56(12):2923{2929, December 2011. [115] N. Matni and V. Chandrasekaran. Regularization for design. IEEE Trans. Au- tomat. Control, 61(12):3991{4006, 2016. [116] S. Dean, H. Mania, N. Matni, B. Recht, and S. Tu. On the sample complexity of the linear quadratic regulator. 2017. arXiv:1710.01688. [117] M. Fazel, R. Ge, S. M. Kakade, and M. Mesbahi. Global convergence of policy gradient methods for linearized control problems. 2018. arXiv:1801.05039. [118] S. Schuler, P. Li, J. Lam, and F Allg ower. Design of structured dynamic output- feedback controllers for interconnected systems. International Journal of Control, 2011. doi:10.1080/00207179.2011.634029. [119] S. Hassan-Moghaddam, X. Wu, and M. R. Jovanovic. Edge addition in directed consensus networks. In Proceedings of the 2017 American Control Conference, pages 5592{5597, Seattle, WA, 2017. 162 [120] C. Conte, C. N. Jones, M. Morari, and M. N. Zeilinger. Distributed synthesis and stability of cooperative distributed model predictive control for linear systems. Automatica, 69:117{125, 2016. [121] S. Hassan-Moghaddam and M. R. Jovanovic. Proximal gradient ow and douglas- rachford splitting dynamics: global exponential stability via integral quadratic constraints. Automatica, 2019. submitted; also arXiv:1908.09043. [122] S. Hassan-Moghaddam, M. R. Jovanovic, and S. Meyn. Data-driven proximal algorithms for the design of structured optimal feedback gains. In Proceedings of the 2019 American Control Conference, pages 5846{5850, Philadelphia, PA, 2019. [123] J. D. Lee, Y. Sun, and M. A. Saunders. Proximal Newton-type methods for minimizing composite functions. SIAM J. Optim., 24(3):1420{1443, 2014. [124] Cho-Jui Hsieh, M. A. Sustik, I. S. Dhillon, and P. Ravikumar. QUIC: Quadratic approximation for sparse inverse covariance estimation. J. Mach. Learn. Res., 15:2911{2947, 2014. [125] E. J. Cand es, M. B. Wakin, and S. P. Boyd. Enhancing sparsity by reweighted ` 1 minimization. J. Fourier Anal. Appl, 14:877{905, 2008. [126] S. Boyd and L. Vandenberghe. Convex Optimization. Cambridge University Press, 2004. [127] J. Barzilai and J. M. Borwein. Two-point step size gradient methods. IMA J. Numer. Anal., 8(1):141{148, 1988. [128] Paul Tseng and Sangwoon Yun. A coordinate gradient descent method for nons- mooth separable minimization. Math. Prog., 117(1-2):387{423, 2009. [129] Yu-Hong Dai and Roger Fletcher. Projected Barzilai-Borwein methods for large-scale box-constrained quadratic programming. Numerische Mathematik, 100(1):21{47, 2005. [130] S. J. Wright, R. D. Nowak, and M. A. T. Figueiredo. Sparse reconstruction by separable approximation. IEEE Trans. Signal Process., 57(7):2479{2493, 2009. 163 [131] Julian J McAuley and Jure Leskovec. Learning to discover social circles in ego networks. In Adv. Neural Inf. Process. Syst., pages 539{547, 2012. [132] M. Fardad, X. Zhang, F. Lin, and M. R. Jovanovic. On the properties of optimal weak links in consensus networks. In Proceedings of the 53rd IEEE Conference on Decision and Control, pages 2124{2129, Los Angeles, CA, 2014. (Invited paper). [133] X. Wu and M. R. Jovanovic. Sparsity-promoting optimal control of consensus and synchronization networks. In Proceedings of the 2014 American Control Confer- ence, pages 2948{2953, Portland, OR, 2014. [134] T.H. Summers, Iman Shames, J. Lygeros, and F. D or er. Topology Design for Optimal Network Coherence. In Proceedings of the 14th European Control Con- ference, July 2015. [135] T. Rautert and E. W. Sachs. Computational design of optimal output feedback controllers. SIAM J. Optim., 7(3):837{852, 1997. [136] M. Hong, Z. Q. Luo, and M. Razaviyayn. Convergence analysis of alternating direction method of multipliers for a family of nonconvex problems. SIAM J OPTIMIZ, 26(1):337{364, 2016. [137] M Fortin and R. Glowinski. Augmented Lagrangian method: Application to the numerical solution of boundary-value problems. North-Holland, 1983. [138] R. Glowinski and P. Le Tallec. Augmented Lagrangian and operator-splitting meth- ods in nonlinear mechanics. SIAM, 1989. [139] Katya Scheinberg, Shiqian Ma, and Donald Goldfarb. Sparse inverse covariance selection via alternating linearization methods. In NIPS, pages 2101{2109, 2010. [140] J. Friedman, T. Hastie, and R. Tibshirani. Sparse inverse covariance estimation with the graphical lasso. Biostatistics, 9(3):432{441, 2008. [141] Onureena Banerjee, Laurent El Ghaoui, and Alexandre d'Aspremont. Model se- lection through sparse maximum likelihood estimation for multivariate gaussian or binary data. J. Mach. Learn. Res., 9:485{516, 2008. 164 [142] M. Yuan and Y. Lin. Model selection and estimation in the gaussian graphical model. Biometrika, 94(1):19{35, 2007. [143] J. Duchi, S. Gould, and D. Koller. Projected subgradient methods for learning sparse gaussians. arXiv preprint arXiv:1206.3249, 2012. [144] Cho-Jui Hsieh, Inderjit S Dhillon, Pradeep K Ravikumar, and M aty as A Sustik. Sparse inverse covariance matrix estimation using quadratic approximation. In Adv. Neural Inf. Process. Syst., pages 2330{2338, 2011. [145] P. Tseng. Convergence of a block coordinate descent method for nondierentiable minimization. J. Optim. Theory Appl., 109(3):475{494, 2001. [146] Shai Shalev-Shwartz and Ambuj Tewari. Stochastic methods for ` 1 regularized loss minimization. J. Mach. Learn. Res., 12:1865{1892, 2011. [147] Ankan Saha and Ambuj Tewari. On the non-asymptotic convergence of cyclic coordinate descent methods. SIAM J. Optim., 23(1):576{601, 2013. [148] Benjamin Rolfs, Bala Rajaratnam, Dominique Guillot, Ian Wong, and Arian Maleki. Iterative thresholding algorithm for sparse inverse covariance estimation. In NIPS, pages 1574{1582, 2012. [149] Figen Oztoprak, Jorge Nocedal, Steven Rennie, and Peder A Olsen. Newton-like methods for sparse inverse covariance estimation. In NIPS, pages 755{763, 2012. [150] Onureena Banerjee, Laurent El Ghaoui, Alexandre d'Aspremont, and Georges Natsoulis. Convex optimization techniques for tting sparse gaussian graphical models. In Proceedings of the 23rd international conference on Machine learning, pages 89{96, 2006. [151] Alexandre d'Aspremont, Onureena Banerjee, and Laurent El Ghaoui. First-order methods for sparse covariance selection. SIAM J. Matrix Anal. Appl., 30(1):56{66, 2008. [152] Cho-Jui Hsieh, M aty as A Sustik, Inderjit S Dhillon, Pradeep K Ravikumar, and Russell Poldrack. Big & quic: Sparse inverse covariance estimation for a million variables. In NIPS, pages 3165{3173, 2013. 165 [153] N. Motee and A. Jadbabaie. Optimal control of spatially distributed systems. IEEE Trans. Automat. Control, 53(7):1616{1629, 2008. [154] C. K. Chow and C. N. Liu. Approximating discrete probability distributions with dependence trees. 14(3):462{467, 1968. [155] M. P. van den Heuvel, C. J. Stam, R. S. Kahn, and H. R. H Pol. Eciency of functional brain networks and intellectual performance. J. Neurosci., 29(23):7619{ 7624, 2009. [156] S. Sojoudi and J. Doyle. Study of the brain functional network using synthetic data. In Proceedings of the 52nd Annual Allerton Conference on Communication, Control, and Computing, pages 350{357, 2014. [157] P. Patrinos, L. Stella, and A. Bemporad. Forward-backward truncated Newton methods for convex composite optimization. 2014. arXiv:1402.6655. [158] L. Stella, A. Themelis, and P. Patrinos. Forward{backward quasi-Newton methods for nonsmooth optimization problems. Comput. Optim. Appl., 67(3):443{487, 2017. [159] A. Themelis, L. Stella, and P. Patrinos. Forward-backward envelope for the sum of two nonconvex functions: Further properties and nonmonotone line-search al- gorithms. SIAM J. Optim., 28(3):2274{2303, 2018. [160] A. Megretski and A. Rantzer. System analysis via integral quadratic constraints. IEEE Trans. Autom. Control, 42(6):819{830, 1997. [161] B. T. Polyak. Gradient methods for minimizing functionals. Zhurnal Vychisli- tel'noi Matematiki i Matematicheskoi Fiziki, 3(4):643{653, 1963. [162] J. Douglas and H. Rachford. On the numerical solution of heat conduction prob- lems in two and three space variables. Trans. Amer. Math. Soc., 82(2):421{439, 1956. [163] A. Themelis and P. Patrinos. Douglas-Rachford splitting and ADMM for noncon- vex optimization: tight convergence results. 2017. arXiv:1709.05747. 166 [164] H. Karimi, J. Nutini, and M. Schmidt. Linear convergence of gradient and proximal-gradient methods under the Polyak-Lojasiewicz condition. In Joint Eu- ropean Conference on Machine Learning and Knowledge Discovery in Databases, pages 795{811, 2016. [165] Anders Rantzer. On the Kalman-Yakubovich-Popov lemma. Systems & Control Letters, 28(1):7 { 10, 1996. [166] P. Giselsson and S. Boyd. Linear convergence and metric selection for Douglas- Rachford splitting and ADMM. IEEE Trans. Automat. Control, 62(2):532{544, 2017. [167] J. Eckstein and D. P. Bertsekas. On the Douglas-Rachford splitting method and the proximal point algorithm for maximal monotone operators. Math. Program., 55(1-3):293{318, 1992. [168] D. Gabay. Applications of the method of multipliers to variational inequalities. In Studies in Mathematics and its Applications, volume 15, pages 299{331. 1983. [169] A. Cherukuri, B. Gharesifard, J., and Cort es. Saddle-point dynamics: conditions for asymptotic stability of saddle points. SIAM J. Control Optim., 55(1):486{511, 2017. [170] Hassan K. Khalil. Noninear systems. Prentice-Hall, New Jersey, 2(5):5{1, 1996. [171] K. M artensson and A. Rantzer. Gradient methods for iterative distributed control synthesis. In Proceedings of the 48th IEEE Conference on Decision and Control, pages 549{554, 2009. [172] K. M artensson and A. Rantzer. A scalable method for continuous-time distributed control synthesis. In Proceedings of the 2012 American Control Conference, pages 6308{6313, 2012. [173] N. K. Dhingra and M. R. Jovanovic. A method of multipliers algorithm for sparsity-promoting optimal control. In Proceedings of the 2016 American Control Conference, pages 1942{1947, Boston, MA, 2016. (Invited paper). 167 [174] D. G. Luenberger. Optimization by vector space methods. John Wiley & Sons Inc., New York, 1969. Reprinted 1997. [175] J. N. Tsitsiklis and B. Van Roy. An analysis of temporal-dierence learning with function approximation. 42(5):674{690, 1997. [176] S. P. Meyn. Control Techniques for Complex Networks. Cambridge University Press, 2007. Pre-publication edition available online. [177] P. G. Mehta and S. P. Meyn. Q-learning and Pontryagin's minimum principle. In Proceedings of the 48th IEEE Conference on Decision and Control, pages 3598{ 3605, 2009. [178] S. Amari. Natural gradient works eciently in learning. Neural Computation, 10(2):251{276, 1998. [179] S. M. Kakade. A natural policy gradient. In Advances in Neural Information Processing Systems, pages 1531{1538, 2002. [180] N. K. Dhingra and M. R. Jovanovic. Convex synthesis of symmetric modications to linear systems. In Proceedings of the 2015 American Control Conference, pages 3583{3588, Chicago, IL, 2015. [181] D. P. Bertsekas. Convex optimization algorithms. Athena Scientic, 2015. [182] S. Kia, J. Cort es, and S. Mart nez. Distributed convex optimization via continuous-time coordination algorithms with discrete-time communication. Au- tomatica, 55:254{264, 2015. 168 Appendix A Proximal PL condition The generalization of the PL condition to nonsmooth problems was introduced in [164] and is given by 1 2 D g (x; L f ) (F (x) F ? ) (A.1) where is a positive constant,L f is the Lipschitz constant ofrf, andD g (x; ) is given by 2 min y hrf(x);yxi + 2 kyxk 2 2 +g(y)g(x) : (A.2) Herein, we show that if proximal PL condition (A.1) holds, there is a lower bound given by (6.19) on the norm of the generalized gradient map G (x). For 2 (0; 1=L f ), D g (x; 1=)D g (x;L f ); and, thus, inequality (A.1) also holds forD g (x; 1=). Moreover, from the denition of D g (x;) given by (A.2), it follows that D g (x; 1=) = 2 (F (x) F (x)) where F := f +g and F is the associated FB envelope. Substituting this expression for D g (x; 1=) to (A.1) yields, 1 (F (x) F (x)) (F (x) F ? ): (A.3) 169 170 From (6.5), the smooth part of the objective function f can be written as f(x) = F (x) g(prox g (x rf(x))) + hrf(x);G (x)i 2 kG (x)k 2 2 and substituting this expression for f to both sides of (A.3) yields ( 1) 2 kG (x)k 2 2 (F (x) F ? ) + ( 1) g(x) ( 1) g( prox g (x rf(x))) + ( 1)hrf(x);G (x)i: (A.4) Since G (x)rf(x)2 @g(x) the subgradient inequality (6.18) implies 0 kG (x)k 2 2 g(x) g(prox g (x rf(x))) + hrf(x);G (x)i: (A.5) Thus, if 1> 0, inequality (A.4) can be written as ( 1) 2 kG (x)k 2 2 (F (x) F ? ): Moreover, if 1< 0, let :=( 1) 0: From (A.4) and (A.5), we have 2 kG (x)k 2 2 (F (x) F ? ): Thus, in both cases the inequality holds for =j 1j. Furthermore, it can be shown that [157], F ? = F ? and argmin F = argmin F : Thus, F ? can be substituted by F ? and we have kG (x)k 2 2 (F (x) F ? ) 171 where := 2=. A.1 Pointwise quadratic inequality Herein, we provide a pointwise quadratic inequality that characterizes the nonlinear map u(x) = prox g (xrf(x)) in (6.24b). Since the proximal operator of g is rmly nonexpansive [76], it is Lipschitz continuous with parameter 1 and we have ku ^ uk 2 2 k(xrf(x)) (^ xrf(^ x))k 2 2 : (A.6) Expanding the right-hand-side of (A.6) yields, ku ^ uk 2 2 kx ^ xk 2 2 + 2 krf(x)rf(^ x)k 2 2 2hx ^ x;rf(x)rf(^ x)i and utilizing inequality (6.17) for anm f -strongly convex functionf with anL f -Lipschitz continuous gradient, the last inequality can be further simplied as, ku ^ uk 2 2 (1 2m f L f L f +m f )kx ^ xk 2 2 + ( 2 2 L f +m f )krf(x)rf(^ x)k 2 2 : (A.7) Now, depending on the sign of 2=(L f +m f ) either (6.15) or (6.16) can be used to upper bound the second term on the right-hand-side of (A.7) and obtain ku ^ uk 2 2 max (1 L f ) 2 ; (1 m f ) 2 kx ^ xk 2 2 : (A.8) Choosing given by (6.26) completes the proof of Lemma 7. We note that this con- traction property has been also exploited in [97,181]. Appendix B On the exponential convergence rate of proximal gradient ow algorithms Herein, we describe a primal-descent dual-ascent gradient ow dynamics based on proxi- mal augmented Lagrangian (6.20) can be used for distributed optimization. We provide a distributed implementation and prove global exponential stability in the presence of strong convexity. B.0.1 Distributed optimization Let us consider the unconstrained optimiza- tion problem, minimize x n X i = 1 f i (x) wherex2R n is the optimization variable andf := P i f i is a strongly convex objective function. It is desired to solve this problem over an undirected connected network with the incidence matrix E T and the graph Laplacian L :=E T E. To accomplish this objective, we reformulate it as, minimize x n X i = 1 f i (x i ) + g(Ex) (B.1a) 172 173 wherex := [x 1 x n ] T andg(Ex) is an indicator function associated with the equality constraint Ex = 0, g(Ex) = 8 < : 0; Ex = 0; 1; otherwise: (B.1b) This constraint is introduced to ensure asymptotic agreement between the node values x i (t)2R. As demonstrated in [108], the primal-descent dual-ascent gradient ow dynamics based on the proximal augmented Lagrangian (6.20) can be used to solve this problem. The resulting gradient ow dynamics are given by, _ x = rf(x) 1 Lx ~ y _ ~ y = Lx; (B.2) whereL =E T E is the Laplacian matrix of the underlying communication graph between neighboring nodes and the vector ~ y := E T y belongs to the orthogonal complement of the vector of all ones. This setup is well-suited for distributed implementation in which each node only shares its state x i with its neighbors and maintains the corresponding dual variable ~ y i . A Lyapunov-based argument was used in [182] to prove the exponential convergence of (B.2). Herein, we provide an alternative proof that utilizes the theory of IQCs to establish global exponential stability of (B.2) under the condition that1 T ~ y(0) = 0. Assumption 5. Let the dierentiable part f := P i f i (x) of the objective function in (B.1) be strongly convex with parameter m f , letrf be Lipschitz continuous with parameter L f , let the regularization function g be proper, lower semicontinuous, and convex, letE T be incidence matrix of a connected undirected network, and let1 T ~ y(0) = 0 in (B.2). From Assumption 5 it follows that the graph Laplacian L := E T E is a positive semidenite matrix with one zero eigenvalue. Thus, it can be decomposed as L =V V T = h U 1 n 1 i " 0 0 0 0 #" U T 1 n 1 T # 174 where the columns ofV are the eigenvectors ofL, 0 is a diagonal matrix of the nonzero eigenvalues of L, and the matrix U satises, U T U = I; U T 1 = 0; UU T = I 1 n 11 T : (B.3) By introducing a change of variables " x # = " U T x 1 n 1 T x # ; " y # = " U T ~ y 1 n 1 T ~ y # with 2R n1 and 2R n1 , (B.2) can be written as 2 6 6 6 6 6 6 4 _ _ x _ _ y 3 7 7 7 7 7 7 5 = 2 6 6 6 6 6 6 4 ( 1 0 + m f I) U T u 1 n 1 T u m f x y 0 0 3 7 7 7 7 7 7 5 (B.4) where u =rf(x)m f x; and from the properties (B.3) of the matrix U we have, x = h U 1 i " x # : (B.5) By choosing y(0) = 0, we have y 0. Thus, the y-dynamics can be eliminated from (B.4), which yields _ w = Aw + Bu = Cw u = rf() m f : (B.6a) 175 Here, w := [ T x T ] T , :=x, A = 2 6 6 6 4 ( 1 0 + m f I) 0 I 0 m f 0 0 0 0 3 7 7 7 5 B = 2 6 6 4 U T 1 n 1 T 0 3 7 7 5 ; C = h U 1 0 i (B.6b) and the corresponding transfer function is H(s) = h U 1 i 2 4 H 1 (s) 0 0 1 s +m f 3 5 " U T 1 n 1 T # (B.6c) where H 1 (s) = diag s s 2 + ( i = + m f )s + i : Furthermore, under Assumption 5 with u =rf()m f , for any and ^ 2R n , we have [31], " ^ u ^ u # T " 0 (L f m f )I (L f m f )I 2I #" ^ u ^ u # 0: We now employ the KYP lemma to establish global exponential stability of (B.6a). Theorem 16. Let Assumption 5 hold. Then proximal gradient ow dynamics (B.6a) are globally exponentially stable, i.e., there is ; 0<<m f such that, kw(t) w ? k e t kw(0) w ? k . Proof. The KYP lemma implies global exponential stability if " H (j!) I # " H (j!) I # 0; 8!2 R (B.7) 176 where = " 0 (L f m f )I (L f m f )I 2I # H (j!) = H(j! ): It is easy to show that (B.7) holds for all !2R if 2I (L f m f ) (H (j!) + H (j!)) 0: This condition yields a decoupled family of inequalities, ! 2 + (m f ) 2 + (L f m f )(m f ) > 0 (B.8) (! 2 b i ()) 2 + c i ()! 2 + d i () > 0 (B.9) which have to hold for all !2R and fori = 1;:::;n 1. Condition (B.8) clearly holds if 2 (0;m f ). On the other hand, checking (B.9) amounts to checking a decoupled family of quadratic inequalities in ! 2 where b i (), c i (), and d i () are parameters that depend on , L f , m f , i , and . At = 0, these are given by b i (0) = i =; c i (0) = ( i = + m f ) ( i = + L f ); d i (0) = 0: Positivity of b i (0) and c i (0) for each i and continuity of b i (), c i (), and d i () with respect to imply the existence of> 0 that guarantees (B.9) for each!2R and each i = 1;:::;n 1, which completes the proof.
Abstract (if available)
Linked assets
University of Southern California Dissertations and Theses
Conceptually similar
PDF
Provable reinforcement learning for constrained and multi-agent control systems
PDF
Dynamic routing and rate control in stochastic network optimization: from theory to practice
PDF
Landscape analysis and algorithms for large scale non-convex optimization
PDF
Control and optimization of complex networked systems: wireless communication and power grids
PDF
Computationally efficient design of optimal strategies for passive and semiactive damping devices in smart structures
PDF
Novel techniques for analysis and control of traffic flow in urban traffic networks
PDF
Elements of robustness and optimal control for infrastructure networks
PDF
Mobility-based topology control of robot networks
PDF
Learning and decision making in networked systems
PDF
Theoretical foundations and design methodologies for cyber-neural systems
PDF
Team decision theory and decentralized stochastic control
PDF
Understanding dynamics of cyber-physical systems: mathematical models, control algorithms and hardware incarnations
PDF
On practical network optimization: convergence, finite buffers, and load balancing
PDF
On the synthesis of dynamics and control for complex multibody systems
PDF
An analytical dynamics approach to the control of mechanical systems
PDF
Information design in non-atomic routing games: computation, repeated setting and experiment
PDF
New approaches in modeling and control of dynamical systems
PDF
Train routing and timetabling algorithms for general networks
PDF
Multiscale and multiresolution approach to characterization and modeling of porous media: From pore to field scale
PDF
Generalized optimal location planning
Asset Metadata
Creator
Hassan-Moghaddam, Sepideh
(author)
Core Title
Analysis, design, and optimization of large-scale networks of dynamical systems
School
Viterbi School of Engineering
Degree
Doctor of Philosophy
Degree Program
Electrical Engineering
Publication Date
12/09/2019
Defense Date
10/17/2019
Publisher
University of Southern California
(original),
University of Southern California. Libraries
(digital)
Tag
consensus and synchronization networks,control theory,data-driven feedback design,distributed optimization,large-scale networks of dynamical systems,OAI-PMH Harvest,primal-dual gradient flow dynamics,proximal algorithms,sparse inverse covariance estimation,sparsity-promoting optimal control,structured optimal control,topology identification
Language
English
Contributor
Electronically uploaded by the author
(provenance)
Advisor
Jovanovic, Mihailo (
committee chair
), Ortega, Antonio (
committee member
), Savla, Ketan (
committee member
)
Creator Email
hassanmo@usc.edu
Permanent Link (DOI)
https://doi.org/10.25549/usctheses-c89-248371
Unique identifier
UC11675595
Identifier
etd-HassanMogh-8028.pdf (filename),usctheses-c89-248371 (legacy record id)
Legacy Identifier
etd-HassanMogh-8028.pdf
Dmrecord
248371
Document Type
Dissertation
Rights
Hassan-Moghaddam, Sepideh
Type
texts
Source
University of Southern California
(contributing entity),
University of Southern California Dissertations and Theses
(collection)
Access Conditions
The author retains rights to his/her dissertation, thesis or other graduate work according to U.S. copyright law. Electronic access is being provided by the USC Libraries in agreement with the a...
Repository Name
University of Southern California Digital Library
Repository Location
USC Digital Library, University of Southern California, University Park Campus MC 2810, 3434 South Grand Avenue, 2nd Floor, Los Angeles, California 90089-2810, USA
Repository Email
cisadmin@lib.usc.edu
Tags
consensus and synchronization networks
control theory
data-driven feedback design
distributed optimization
large-scale networks of dynamical systems
primal-dual gradient flow dynamics
proximal algorithms
sparse inverse covariance estimation
sparsity-promoting optimal control
structured optimal control
topology identification