Close
About
FAQ
Home
Collections
Login
USC Login
Register
0
Selected
Invert selection
Deselect all
Deselect all
Click here to refresh results
Click here to refresh results
USC
/
Digital Library
/
University of Southern California Dissertations and Theses
/
Stochastic optimization in high dimension
(USC Thesis Other)
Stochastic optimization in high dimension
PDF
Download
Share
Open document
Flip pages
Contact Us
Contact Us
Copy asset link
Request this asset
Transcript (if available)
Content
STOCHASTIC OPTIMIZATION IN HIGH DIMENSION by Hanie Sedghi A Dissertation Presented to the FACULTY OF THE USC GRADUATE SCHOOL UNIVERSITY OF SOUTHERN CALIFORNIA In Partial Fulllment of the Requirements for the Degree DOCTOR OF PHILOSOPHY (ELECTRICAL ENGINEERING) August 2015 Copyright 2015 Hanie Sedghi To my family For their endless love, support and encouragement ii Acknowledgements I am truly thankful to my adviser, professor Edmond Jonckheere, who thought me the joy of mathematical research and gave me the freedom of pursuing the problems that I was most interested in and let me grow as a research scientist. I sincerely thank my co-adviser, professor Anima Anandkumar, who introduced me to the amazing world of theoretical machine learning and helped me conquer new horizons. I am deeply grateful for her dedication, guidance, support and especially her condence in me. She has been an invaluable mentor for me. I would like to extend my appreciation to my dissertation committee members, professor Bhaskar Krishnamachari who valued my research and encouraged me and professor Yan Liu for her useful comments. I am very thankful to Dr. Alekh Agarwal for detailed discussion on his work and his valuable comments. I am grateful for the fruitful discussions with professor Alex Smola. I have learned a lot from his insightful feedback. iii Many thanks to the wonderful sta at Ming Hsieh Department of Electrical Engineering, especially Ms. Diane Demetras, Mr. Tim Boston and Mr. Shane Goodo. I would like to extend my greatest thanks to USC graduate school for supporting my PhD studies with USC Provost fellowship and USC graduate school fellowship awards. A special feeling of gratitude and my warm blessing to the friends who supported me in every respect through this journey, especially Hamed Abrishami, Sanaz Barghi, Hadi Goudarzi, Fatemeh Kash, Salman Khaleghi, Mohammad Naghsh- var, Parastoo Qarabaqi, Susan Ramirez and Roozbeh Tabrizian. I wish to acknowledge help and advice from my dear friends and colleagues, Taha Bahadori, Behnam Bahrak, Reza Banirazi, Mohammad Reza Chitgarha, Alireza Imani, Sina Jafarpour, Anantha Karthikeyan, Sina Khaleghi, Mohammad Mehdi Korjani, Phylis LeBourgeois, Sanam Mirzazad, Mohammad Reza Rajati and Katay- oun Zand. Most importantly, I want to thank my family, my source of inspiration, courage and joy. My husband, Majid Janzamin who spurred me on with his patience, en- thusiasm and abundant love and lit my path with hope and peace past all obstacles; my parents, Mitra Soraya and Kambiz Sedghi who raised me up with their heav- enly love, sacrice, wisdom and devotion to become who I am, embraced my fears and gave me the strength to never surrender; and my wonderful sisters, Hosna and iv Hoda who stood by me through all life challenges and without their unconditional love, passion and encouragement, I could not chase my dreams. I am deeply grateful to my gracious aunt Fatemeh, my loving grandmother Eat and my kind uncle, Hossein whose pure love, compassion and sincere prayers have accompanied me in every stage of my life. I am thankful to my brother-in-law, Mehdi for his support and kindness through these years. v Abstract In this thesis, we consider two main problems in learning with big data: data in- tegrity and high dimension. We specically consider the problem of data integrity in smart grid as it is of paramount importance for grid maintenance and control. In addition, data manipulation can lead to catastrophic events. Inspired by this problem, we then expand the horizon to designing a general framework for stochas- tic optimization in high dimension for any loss function and any underlying low dimensional structure. We propose Regularized Epoch-based Admm for Stochas- tic Optimization in high-dimensioN (REASON). Our ADMM method is based on epoch-based annealing and consists of inexpensive steps which involve projections on to simple norm balls. We provide explicit bounds for the sparse optimization problem and the noisy matrix decomposition problem and show that our conver- gence rate in both cases matches the minimax lower bound. For matrix decompo- sition into sparse and low rank components, we provide the rst guarantees for any online method. Experiments show that for both sparse optimization and matrix decomposition problems, our algorithm outperforms the state-of-the-art methods. In particular, we reach higher accuracy with same time complexity. vi Table of Contents Acknowledgements iii Abstract vi List Of Tables x List Of Figures xi Chapter 1: Introduction 1 1.1 Initial Results: Data integrity in Smart Grid . . . . . . . . . . . . . 3 1.2 Main Results: Stochastic Optimization in High Dimension . . . . . . . . . . . . . 6 Chapter 2: Preliminaries 10 2.1 Notation and Terminology . . . . . . . . . . . . . . . . . . . . . . . 10 2.2 High Dimensional Statistics . . . . . . . . . . . . . . . . . . . . . . 11 2.3 Graphical Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12 2.3.1 KullbackLeibler divergence . . . . . . . . . . . . . . . . . . . 15 2.4 Convex Optimization . . . . . . . . . . . . . . . . . . . . . . . . . . 18 2.4.1 Stochastic Optimization . . . . . . . . . . . . . . . . . . . . 18 2.4.2 Alternating Direction Method of Multipliers . . . . . . . . . 19 Chapter 3: Initial Results: Data Integrity in Smart Grid 23 3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23 3.1.1 Summary of Results . . . . . . . . . . . . . . . . . . . . . . 24 3.1.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . 27 3.1.3 Bus Phase Angles GMRF . . . . . . . . . . . . . . . . . . . 28 3.2 Structure Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . 33 3.2.1 Conditional Covariance Test . . . . . . . . . . . . . . . . . . 34 3.2.2 Decentralization . . . . . . . . . . . . . . . . . . . . . . . . . 37 3.2.3 Online Calculations . . . . . . . . . . . . . . . . . . . . . . . 39 3.3 Stealthy Deception Attack . . . . . . . . . . . . . . . . . . . . . . . 41 3.4 Stealthy Deception Attack Detection . . . . . . . . . . . . . . . . . 42 vii 3.4.1 Detecting the Set of Attacked Nodes . . . . . . . . . . . . . 45 3.4.2 Reactive Power versus Voltage Amplitude . . . . . . . . . . 45 3.5 Simulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48 3.6 Discussion and Conclusion . . . . . . . . . . . . . . . . . . . . . . . 54 Chapter 4: Stochastic Optimization in High Dimension 56 4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56 4.2 Problem Formulation . . . . . . . . . . . . . . . . . . . . . . . . . . 63 4.3 ` 1 Regularized Stochastic Optimization . . . . . . . . . . . . . . . . 66 4.3.1 Epoch-based Online ADMM Algorithm . . . . . . . . . . . . 67 4.3.2 High-dimensional Guarantees . . . . . . . . . . . . . . . . . 68 4.4 Extension to Doubly Regularized Stochastic Optimization . . . . . 73 4.4.1 Epoch-based Multi-Block ADMM Algorithm . . . . . . . . . 74 4.4.2 High-dimensional Guarantees . . . . . . . . . . . . . . . . . 77 4.4.2.1 Optimal Guarantees for Various Statistical Models 80 4.5 Proof Ideas and Discussion . . . . . . . . . . . . . . . . . . . . . . . 83 4.5.1 Proof Ideas for REASON 1 . . . . . . . . . . . . . . . . . . 83 4.5.2 Proof Ideas for REASON 2 . . . . . . . . . . . . . . . . . . 84 4.5.3 Graphical Model Selection . . . . . . . . . . . . . . . . . . . 87 4.5.3.1 Sparse optimization for learning Gaussian graphical models . . . . . . . . . . . . . . . . . . . . . . . . . 88 4.5.3.2 Sparse + low rank decomposition for learning latent Gaussian graphical models . . . . . . . . . . . . . . 90 4.6 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94 4.6.1 REASON 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . 94 4.6.2 REASON 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . 97 4.7 Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97 4.7.1 Implementation details for REASON 1 . . . . . . . . . . . . 98 4.7.2 Implementation details for REASON 2 . . . . . . . . . . . . 98 4.8 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100 Appendix Appendix: Proofs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104 A.1 Guarantees for REASON 1 . . . . . . . . . . . . . . . . . . . . . . . 104 A.1.1 Proof outline for Theorem 3 . . . . . . . . . . . . . . . . . . 105 A.2 Proof of Theorem 3 . . . . . . . . . . . . . . . . . . . . . . . . . . 107 A.2.1 Proofs for Convergence within a Single Epoch for Algorithm 2 110 A.2.2 Proof of Proposition 2: Inequality (A.1.3a) . . . . . . . . . . 116 A.2.3 Proof of Lemma 4 . . . . . . . . . . . . . . . . . . . . . . . . 117 A.2.4 Proof of Lemma 5 . . . . . . . . . . . . . . . . . . . . . . . . 118 A.2.5 Proof of Proposition 2: Inequality (A.1.3b) . . . . . . . . . . 118 A.2.6 Proof of Guarantees with Fixed Epoch Length, Sparse Case 123 viii A.2.7 Proof of Guarantees for Sparse Graphical Model selection Problem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 125 A.3 Guarantees for REASON 2 . . . . . . . . . . . . . . . . . . . . . . . 127 A.3.1 Proof outline for Theorem 11 . . . . . . . . . . . . . . . . . 129 A.4 Proof of Theorem 11 . . . . . . . . . . . . . . . . . . . . . . . . . . 130 A.4.1 Proofs for Convergence within a Single Epoch for Algorithm 3 132 A.4.2 Proof of Proposition 3: Equation (A.3.2a) . . . . . . . . . . 134 A.4.3 Proof of Lemma 12 . . . . . . . . . . . . . . . . . . . . . . . 135 A.4.4 Proof of Lemma 13 . . . . . . . . . . . . . . . . . . . . . . . 137 A.4.5 Proof of Proposition 3: Equation (A.3.2b) . . . . . . . . . . 141 A.4.6 Proof of Guarantees with Fixed Epoch Length, Sparse + Low Rank Case . . . . . . . . . . . . . . . . . . . . . . . . . . . . 146 A.4.7 Proof of Guarantees for Sparse + Low Rank Graphical Model selection Problem . . . . . . . . . . . . . . . . . . . . . . . . 147 References 148 ix List Of Tables 4.1 Comparison of online sparse optimization methods under s sparsity level for the optimal paramter,d dimensional space, andT number of iterations. SC = Strong Convexity, LSC = Local Strong Convexity, LL = Local Lipschitz, L = Lipschitz property, E = in Expectation The last row provides minimax-optimal rate on error for any method. The results hold with high probability unless otherwise mentioned. . 63 4.2 Comparison of optimization methods for sparse+low rank matrix de- composition for a pp matrix under s sparsity level and r rank matrices and T is the number of samples. SC = Strong Convexity, LSC = Local Strong Convexity, LL = Local Lipschitz, L = Lipschitz for loss function, IN = Independent noise model, DF = diuse low rank matrix under the optimal parameter. (p) = ( p p);O(p) and its value depends the model. The last row provides minimax-optimal rate on error for any method under the independent noise model. The results hold with high probability unless otherwise mentioned. For Multi-block-ADMM [Wang et al., 2013b] the convergence rate is on the dierence of loss function from optimal loss, for the rest of works in the table, the convergence rate is onk S(T )S k 2 F +k L(T )L k 2 F . 64 4.3 Least square regression problem, epoch size T i = 2000, Error= k k 2 k k 2 . 96 4.4 REASON 2 and inexact ALM, matrix decomposition problem. p = 2000, 2 = 0:01 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96 x List Of Figures 2.1 Global Markov property: X I ?X J jX S . . . . . . . . . . . . . . . . 13 2.2 Local Markov property: X i ?X Vnfi[N(i)g jX N(i) . . . . . . . . . . . . 13 3.1 Flowchart of our detection algorithm . . . . . . . . . . . . . . . . . 25 3.2 Power grid under a cyber attack . . . . . . . . . . . . . . . . . . . . 42 3.3 Detection rate for IEEE-14 bus system . . . . . . . . . . . . . . . . 48 3.4 Anomaly score for IEEE-14 bus system. Nodes 4, 5, 6 are under attack; Attack size is 0.7. . . . . . . . . . . . . . . . . . . . . . . . . 50 3.5 Anomaly score for IEEE-14 bus system for dierent attack sizes. Nodes 4, 5, 6 are under attack. Attack sizes are 0.5, 0.7, 1. . . . . . 51 4.1 Graphical representation of a latent variable model. . . . . . . . . . 90 4.2 Least square regression, Error= k k 2 k k 2 vs. iteration number,d 1 = 20 and d 2 = 20000. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95 xi Chapter 1 Introduction With new trend in technology, there is the challenge of data deluge in almost any domain. We have lots of data, in terms of system measurement, image, video, genetics, social network, etc and the amount just increases. The goal is to use the data for inference and even control. But as we have more data it does not mean we have more information. Data might have corruptions. Therefore, ensuring data integrity is an important task. In addition, with more data comes more challenges. Although we have more data, we have a huge number of unknown parameters. The classic approach in statistics is that with xed number of parametersp, the sample sizen!1. But in modern applications in science and engineering we have large scale problems where both n;p can be very large (possibly p n) and this calls for high-dimensional theory where we let both (n;p)!1. In high-dimensional statistics we have ex- ponential explosions in computational complexity and also for a large number of unknown parameters the sample complexity goes beyond the number of available 1 samples. Therefore, for a tractable analysis of the problem with available samples, additional assumptions are made on the problem structure, i.e., an embedded low dimensional structure. Examples include sparse vectors, patterned matrices, low rank matrices, Markov random elds and some assumptions on manifold structure. Nevertheless, even within the class of tractable problems, high-dimensional opti- mization is harmed by curse of dimensionality. To be more specic, conventional optimization methods provide convergence bounds that are a quadratic function of the dimension. Therefore, their convergence guarantees in high dimension are disheartening. To make the challenge even more severe, we should consider another challenge in big data. That is the volume and velocity of the data we receive, which calls for learning methods that are fast, cheap and do not require data storage. Therefore, batch methods are not a good t for such applications and we need stochastic meth- ods that can provide a good estimation of the parameter per any noisy sample they receive. Therefore, we no more have the luxury of noise concentration that batch optimization benets from. Hence, the combination of stochastic optimization and high-dimensional statistics leads to a extremely dicult problem. In this thesis we consider the challenges of learning with big data. To be more precise, our goal is to shed light on the problem of big data from two angles. First we consider the problem of data integrity. This is motivated by the fact that not all the data we receive is reliable and for some cases believing the data without any 2 integrity check can lead to catastrophic events. One prominent case is the case of data integrity in smart grid. We elaborate on this shortly. Inspired by the complex problem of data integrity in smart grid, we used it as our spring board to extend our horizon and consider the general case of stochastic optimization in high dimension, for any loss function with some mild assumptions. Hence, our approach can be used for various applications. Our goal is to design a general method for stochastic optimization method in high-dimensional setting that is fast and cheap to implement and to provide tight convergence guarantees that have logarithmic dependence with dimension (this is the minimax optimal rate). We also compare our method with earlier state-of-the-art methods via experiments. 1.1 Initial Results: Data integrity in Smart Grid Recently, Gaussian graphical models have been used as a precious tool for modeling and analyzing various phenomena in diverse elds such as control, cyber security, biology, sociology, social networks and geology. It all starts with model selection for the random variables associated to the problem. Model selection means nding the real underlying Markov graph among a group of random variables based on samples of those random variables. Traditionally, the term grid is used to refer to an electricity system that sup- ports the following four operations: electricity generation, electricity transmission, electricity distribution, and voltage stability control. In the early days, generation 3 was co-located with distribution in what we would now call a micro-grid and the connections among the micro-grids were meant to transmit energy in case of such contingencies as shift in the supply/demand balance. After deregulation, however, a large-scale generation-transmission-distribution network became the substitute for the traditional generation-distribution co-location. The new network allows con- sumers to purchase electricity at the cheapest price across the country, as opposed to the former concept in which consumers were forced to purchase electricity from local utility companies. Other considerations calling for an overhaul of the electric- ity system include the reduction of carbon emission, an objective that cannot be achieved without a signicant contribution from the electricity sector. This calls for a bigger share of the renewable energy resources in the generation mix and a sup- ply/demand that must be managed more eectively. Management and control of the grid made increasingly complex by its response to electricity market conditions are, next to its ability to detect contingencies, the most fundamental attributes that make it smart. Automated large scale management requires considerable exchange of informa- tion, so that the smart grid has become a two-commodity ow|electricity and information. By utilizing modern information technologies, the smart grid is ca- pable of delivering power in a more ecient way and responding to wider ranging conditions. 4 Massive amount of measurements and their transmission across the grid by modern information technology, however, make the grid prone to attacks. Fast and accurate detection of possibly malicious events is of paramount importance not only for preventing faults that may lead to blackouts, but also for routine monitoring and control tasks of the smart grid, including state estimation and optimal power ow. Fault localization in the nation's power grid networks is known to be challenging, due to the massive scale and inherent complexity. We use model selection to address this crucial matter for smart grid control and maintenance. We approach false data injection in smart grid via statistical analysis of underlying structure among data. In order to learn the structure of the power grid, we utilize the new Gaussian Graphical Model Selection method called Conditional Covariance Test (CMIT) [Anandkumar et al., 2012] and prove that in normal conditions this structure follows grid topology. Next we assess that our method can detect a sophisticated and strong attack as it causes the graphical model to change. Our approach is the rst method that can comprehensively detect this data manipulation without the need for additional hardware. 5 1.2 Main Results: Stochastic Optimization in High Dimension Stochastic optimization considers the problem of minimizing a loss function with access to noisy samples of (gradient of) the function. The goal is to have an estimate of the optimal parameter (minimizer) per new sample. Therefore, compared to batch optimization where we have noise concentration, stochastic optimization is a more challenging problem. In addition, as discussed earlier, in high dimensional statistics we have p n. Therefore in general the problems are not tractable. in order to make the problem tractable we need to assume low dimensional underlying truth such as sparse vectors, patterned matrices, low rank matrices, Markov random elds and some assumptions on manifold structure. Mathematically this is modeled by adding a regularizer term to the optimization problem. These terms are mostly non-smooth and hence, make each step of stochastic optimization very expensive. The reason is that in most cases no closed form solution exists for each step of the optimization problem. The alternating direction method of multipliers (ADMM), takes the form of a decomposition-coordination procedure, in which the solutions to small local sub- problems are coordinated to nd a solution to a large global problem. To be more precise, ADMM decomposes the optimization problem into two parts; minimizing 6 the loss function term and minimizing the regularization term and then links the two solutions. ADMM can be viewed as an attempt to blend the benets of dual decomposition and augmented Lagrangian methods for constrained optimization. It is a simple but powerful algorithm that is well suited to distributed convex op- timization, and in particular to problems arising in applied statistics and machine learning. Nevertheless, stochastic ADMM techniques suer from the curse of dimen- sionality, i.e., their convergence rates are proportional to square of the dimension which is disheartening for high-dimensional problem. In order to design a general framework for stochastic optimization in high- dimension that is both fast and cheap as well as enjoys logarithmic dependence to dimension (and is hence minimax optimal), modify stochastic ADMM such that we have best of both worlds. We do this an epoch-based approach and by performing intelligent projections into a norm ball around the optimal value and shrink the ball after each epoch such that we have error contraction by the end of each epoch. The norm ball is determined by the nature of the underlying optimal value or hence the regularizer term. For example, in case of sparse optimization we use ` 1 norm projections. It should be noted that we do not have a knowledge of the optimal parameter and hence use the average of the last epoch estimates as an estimate of the optimal value. We design our algorithm parameters such that we ensure the optimal parameter remains feasible during the algorithm. In addition, we prove that by our choice of parameters, the square error shrinks by half by the end of 7 each epoch. Therefore, we halve the radius of the norm ball and can obtain a logarithmic dependency to the dimension. This is a general framework that can be applied to any convex optimization problem with some mild assumptions and any number of regularizers. We provide complete detailed analysis for two infamous problems: sparse optimization and matrix decomposition into sparse and low rank parts (also known as robust PCA). The above simple modications to ADMM have huge implications for high- dimensional problems. For sparse optimization, our convergence rate isO( s logd T ), for s-sparse problems in d dimensions in T steps. Our bound has the best of both worlds: ecient high-dimensional scaling (as logd) and ecient convergence rate (as 1 T ). This also matches the minimax lower bound for the linear model and square loss function [Raskutti et al., 2011], which implies that our guarantee is unimprov- able by any (batch or online) algorithm (up to constant factors). For matrix de- composition, our convergence rate isO((s+r) 2 (p) logp=T ))+O(maxfs+r;pg=p 2 ) for a pp input matrix in T steps, where the sparse part has s non-zero entries and low rank part has rank r. For many natural noise models (e.g. independent noise, linear Bayesian networks), 2 (p) = p, and the resulting convergence rate is minimax-optimal. Note that our bound is not only on the reconstruction error, but also on the error in recovering the sparse and low rank components. These are the rst convergence guarantees for online matrix decomposition in high dimensions. Moreover, our convergence rate holds with high probability when noisy samples are 8 input, in contrast to expected convergence rate, typically analyzed in literature. See Table 4.1, 4.2 for comparison of this work with related frameworks. Our proposed algorithms provide signicantly faster convergence in high di- mension and better robustness to noise. For sparse optimization, our method has signicantly better accuracy compared to the stochastic ADMM method and bet- ter performance than RADAR, based on multi-step dual averaging [Agarwal et al., 2012b]. For matrix decomposition, we compare our method with the state-of-art inexact ALM [Lin et al., 2010] method. While both methods have similar recon- struction performance, our method has signicantly better accuracy in recovering the sparse and low rank components. 9 Chapter 2 Preliminaries 2.1 Notation and Terminology 1. A graph G is represented as G(V;E) where V represents the set of vertices and E represents the edge set. 2. For random variables? is used to show independence andj symbol is used for conditioning. i.e., X 1 ?X 2 jX 3 means X 1 is independent of X 2 given X 3 . 3. In probability theory and statistics, a covariance matrix is a matrix whose element in the i, j position is the covariance between the i th and j th elements of a random vector (that is, of a vector of random variables). 4. For every matrix A, Tr(A) represents sum of the diagonal entries of A. In the sequel, we use lower case letter for vectors and upper case letter for matrices.kxk 1 ,kxk 2 refer to` 1 ;` 2 vector norms respectively. The termkXk stands 10 for nuclear norm of X. In addition,kXk 2 ,kXk F denote spectral and Frobenius norms respectively. jjjXjjj 1 stands for induced innity norm. We use vectorized ` 1 ;` 1 norm for matrices. i.e.,kXk 1 = P i;j jX ij j,kXk 1 = max i;j jX ij j. 2.2 High Dimensional Statistics The eld of high-dimensional statistics studies data whose dimension is higher than the dimension of classical multivariate data. In many applications the dimension of the data is bigger than the sample size. High-dimensional statistical inference deals with models in which the the number of parameters p is comparable to or larger than the sample size n. Since it is usually impossible to obtain consistent procedures unlessp=n! 0, Therefore, for a tractable analysis of the problem with available samples, a line of recent work has studied models where additional as- sumptions are made on the problem structure, i.e., an embedded low dimensional structure. with various types of low-dimensional structure, including sparse vectors, patterned matrices, low rank matrices, Markov random elds and some assump- tions on manifold structure. In such settings, a general approach to estimation is to solve a regularized optimization problem, which combines a loss function mea- suring how well the model ts the data with some regularization function that encourages the assumed structure. Accordingly, there are now several lines of work within high-dimensional statistics, all of which are based on imposing some type of low-dimensional constraint on the model space and then studying the behavior of 11 dierent estimators. Examples include linear regression with sparsity constraints, estimation of structured covariance or inverse covariance matrices, graphical model selection, sparse principal component analysis, low-rank matrix estimation, matrix decomposition problems and estimation of sparse additive nonparametric models. The classical technique of regularization has proven fruitful in all of these contexts. Many well-known estimators are based on solving a convex optimization problem formed by the sum of a loss function with a weighted regularizer. For example, ` 0 norm denotes the number of nonzero elements but since ` 0 is a nonconvex func- tion, ` 1 norm is used to ensure sparsity. The reason is that ` 1 norm is the closest function to ` 0 that is convex. As another example, to ensure low rank structure nuclear norm is used as a regularizer. The intuition is that nuclear norm is sum of the singular values and minimizing the nuclear norm results in imposing the low rank structure. 2.3 Graphical Model Denition 1. Global Markov property: A probability distribution is said to have global Markov property with respect to a graph if, for any disjoint subsets of nodes I, J, S such that S separates I and J on the graph, the distribution satises X I ?X J jX S , i.e., X I is independent ofX J conditioned onX S . This is represented in Figure 2.1. 12 Figure 2.1: Global Markov property: X I ?X J jX S Figure 2.2: Local Markov property: X i ?X Vnfi[N(i)g jX N(i) Denition 2. Pairwise Markov property: A distribution is pairwise Markov with respect to a given graph if, for any two nodes i and j in the graph such that there is no direct link in the graph between i and j, then X i is independent of X j given the states of all of the remaining nodes, i.e., X i ?X j jX Vnfi;jg . Denition 3. Local Markov property: A set of random variables is said to have local Markov property corresponding to a graph [Lauritzen, 1996] if any variable X i is conditionally independent of all other variablesX Vnfi[N(i)g given its neighbors 13 X N(i) , whereVnfi[N(i)g :=fj2V : j6= i;j6= N(i)g and N(i) :=fj2V : (i;j)2Eg. Local Markov property can be seen in Figure 2.2. Denition 4. Markov Random Field (MRF):Given an undirected graphG = (V;E), a set of random variables X = (X v ) v2V form a Markov Random Field with respect toG if they have the global Markov property. It should be noted that local Markov property and pairwise Markov property are equivalent and they are a special case of global Markov property. For a strictly positive probability distribution, the properties are equivalent and it can be shown that the probability distribution can be factorized with respect to the graph [Lauritzen, 1996]. One instance of this positivity condition happens in case of jointly Gaussian distri- butions. Denition 5. GaussianMarkovRandomField(GMRF): A Gaussian Markov Random Field (GMRF) is a family of jointly Gaussian distributions, which factor according to a given graph. Given a graph G = (V;E), with V =f1;:::;pg, con- sider a vector of Gaussian random variables X = [X 1 ;X 2 ;:::;X p ] T , where each node i2 V is associated with a scalar Gaussian random variable X i . A Gaussian Markov Random Field on G has a probability density function (pdf) that may be parametrized as f X (x)/exp[ 1 2 x T Jx +h T x]; (2.3.1) 14 where J is a positive-denite symmetric matrix whose sparsity pattern corresponds to that of the graph G. More precisely, J(i;j) = 0() (i;j) = 2E: (2.3.2) The matrix J = 1 is known as the potential or information matrix, the non- zero entries J(i;j) as the edge potentials, and the vector h as the vertex potential vector. Denition 6. Graphical Model: In general, Graph G = (V;E) is called the Markov graph (graphical model) underlying the joint probability distribution f X (x) where the node set V represents each random variable X i and the edge set E is dened in order to satisfy local Markov property. For a Markov Random Field, local Markov property states that X i ? X fi;N(i)g jX N(i) , where X N(i) represents all random variables associated with the neighbors of i in graph G and X fi;N(i)g denotes all variables except for X i and X N(i) . 2.3.1 KullbackLeibler divergence Denition 7. KullbackLeibler divergence: In probability theory and informa- tion theory, the KullbackLeibler divergence [Kullback, 1951, 1959, 1987] (also infor- mation divergence, information gain, relative entropy, or KLIC) is a non-symmetric measure of the dierence between two probability distributionsP andQ. Specically, 15 the KullbackLeibler divergence ofQ fromP , denotedDKL(PkQ), is a measure of the information lost when Q is used to approximate P [K. P. Burnham, 2002]. KL measures the expected number of extra bits required to code samples fromP when us- ing a code based onQ, rather than using a code based onP . TypicallyP represents the "true" distribution of data, observations, or a precisely calculated theoretical distribution. The measure Q typically represents a theory, model, description, or approximation of P . For discrete probability distributions P and Q, the KL divergence of Q from P is dened to be D KL (PkQ) = X i ln P (i) Q(i) (2.3.3) In words, it is the expectation of the logarithmic dierence between the probabilities P and Q, where the expectation is taken using the probabilities P . For distributions P and Q of a continuous random variable, KL-divergence is dened to be the integral [Bishop, 2006] D KL (PkQ) = Z 1 1 ln P (x) Q(x) p(x)dx (2.3.4) where p and q denote the densities of P and Q. In Bayesian statistics the KL divergence can be used as a measure of the infor- mation gain in moving from a prior distribution to a posterior distribution. If some new factY =y is discovered, it can be used to update the probability distribution 16 forX fromp(xjI) to a new posterior probability distributionp(xjy;I) using Bayes' theorem: p(xjy;I) = p(yjx;I)p(xjI) p(yjI) : (2.3.5) This distribution has a new entropy H(p(:jy;I)) = X x p(xjy;I) logp(xjy;I); (2.3.6) which may be less than or greater than the original entropy H(p(jI)). However, from the standpoint of the new probability distribution one can estimate that to have used the original code based onp(xjI) instead of a new code based onp(xjy;I) would have added an expected number of bits D KL (p(:jy;I)kp(:jI)) = X x p(xjy;I) logp(xjy;I) p(xjI) (2.3.7) to the message length. This therefore represents the amount of useful information, or information gain, aboutX, that we can estimate has been learned by discovering Y =y. 17 2.4 Convex Optimization A convex optimization problem is one of the form minimize f 0 (x) subject to f i (x)b i ;i = 1;:::;m; where the functions f 0 ;:::;f m : R n ! R are convex, i.e., satisfy f i (x +y) f i (x) +f i (y) for all x;y2R n and all ;2R with + = 1; 0; 0: Convex optimization enjoys the fact that for a convex optimization there is no spurious local optima, i.e., the local optima is the global optima. In addition, convex optimization problems can be solved eciently. 2.4.1 Stochastic Optimization Stochastic optimization considers the problem of minimizing a loss function with access to noisy samples of (gradient of) the function. The goal is to have an estimate of the optimal parameter (minimizer) per new sample. Consider the optimization problem 2 arg min 2 E[f(;x)]; (2.4.1) 18 where x2 X is a random variable and f : X! R is a given loss function. Since only samples are available, we employ the empirical estimate of b f() := 1=n P i2[n] f(;x i ) in the optimization. For high-dimensional , we need to impose a regularizationR(), and b := arg minf b f() + n R()g; is the batch optimal solution. 2.4.2 Alternating Direction Method of Multipliers Alternating Direction Method of Multipliers (ADMM) is an algorithm that is in- tended to blend the decomposability of dual ascent with the superior convergence properties of the method of multipliers. The algorithm solves problems in the form minimize f(x) +g(y) subject to Ax +By =c with variables x2 R n and y 2 R m , where A2 R pn ;B 2 R pm , and c2 R p . We will assume that f and g are convex. The dierence from the usual convex optimization problem is that the variable x is split into two parts and the cost 19 function is separable. As in the method of multipliers, we form the augmented Lagrangian L (x;y;z) =f(x) +g(y) +z > (Ax +Byc) + 2 kAx +Byck 2 : ADMM iterations are the as follows: x k+1 := arg min x L (x;y k ;z k ); y k+1 := arg min y L (x k+1 ;y;z k ); z k+1 :=z k +(Ax k+1 +By k+1 c); where> 0. It is proved that for convex, closed and proper functionsf,g ADMM converges [Boyd et al., 2011]. ADMM is originally a batch method. However, with some modications it can also be used for stochastic optimization. Since in stochastic setting we only have access to noisy samples of gradient, we use an inexact approximation of the Lagrangian as ^ L ;k =f 1 (x k ) +hrf(x k ; k+1 );xi +g(y)z > (Ax +Byc) + 2 kAx +Byck 2 + kxx k k 2 2 k+1 ; 20 where k+1 is a time-varying step size [Ouyang et al., 2013]. Note that the rst term does not appear in ADMM iterates. Convergence rate for stochastic ADMM is shown in table 4.1. It can be seen that the convergence rate is proportional to square of dimension which is a disheartening rate in high dimension. As discussed above, in stochastic ADMM we use noisy samples of gradient. Therefore, in order to prove convergence for stochastic ADMM, we need a bound on gradient, i.e., the cost function needs to satisfy an additional assumption: Lipschitz property. Denition 8. Lipschitz property: A function f : !R is said to satisfy the Lipschitz condition if there is a constant M such that jf(x)f(x 0 )jMkxx 0 k 8 x;x 0 2 : (2.4.2) The smallest constant M satisfying (2.4.2) is called Lipschitz constant. Lipschitz constant can be interpreted as an upper bound on gradient of function f. Another property that improves the convergence rate is strong convexity. Denition 9. Strong Convexity: A dierentiable function f : !R is called strongly convex if there is a constant m> 0 such that f(x 0 )f(x) +hrf(x);x 0 xi + m 2 kx 0 xk 2 8 x;x 0 2 : 21 Intuitively, strong convexity is a measure of curvature of the loss function, which relates the reduction in the loss function to closeness in the variable domain. In high dimension, we cannot guarantee the above properties globally. Never- theless we will show that the following locally dened notions suce. Denition 10. Local Lipschitz condition: For each R> 0, there is a constant G =G(R) such that jf( 1 )f( 2 )jGk 1 2 k 1 for all 1 ; 2 2S such thatk k 1 R andk 1 k 1 R. Denition 11. Local strong convexity (LSC): The function f :S!R sat- ises an R-local form of strong convexity (LSC) if there is a non-negative constant = (R) such that f( 1 )f( 2 ) +hrf( 2 ); 1 2 i + 2 k 2 1 k 2 2 : for any 1 ; 2 2S withk 1 k 1 R andk 2 k 1 R. 22 Chapter 3 Initial Results: Data Integrity in Smart Grid 3.1 Introduction Among the attributes that make the grid \smart" is its ability to process a massive amount of data for monitoring, control, and maintenance purposes. In a typi- cal Transmission System Operator (TSO), the substation Remote Terminal Units (RTUs) read the status of voltages, currents, and switching states. The RTU data is redirected in data-packages to the Supervisory Control and Data Acquisition (SCADA) system via communication channels. In addition, synchronous Phasor Measurement Units (PMUs) are being massively deployed throughout the grid. PMUs provide a higher level of detail to the SCADA system (e.g. voltage angle). The signals from the PMUs are transmitted via the RTU to the SCADA. The State Estimator (SE) located at the control center aims to nd the best overall snapshot solution based on all measurements. 23 Recent monitoring and control schemes rely primarily on PMU measurements; for example, [Diao et al., May 2009] tries to increase voltage resilience to avoid volt- age collapse by using synchronized PMU measurements and decision trees and [Zhu and Giannakis, C. Wei, 2012, He and Zhang, June 2011] rely on PMUs for fault detection and localization. The centralization of the data to the State Estimator makes it the back door to false data injection attacks. Therefore, aforementioned methods can be deluded by false data injection attacks. Thus, it is crucial to have a mechanism for fast and ac- curate discovery of malicious tampering; both for preventing the attacks that may lead to blackouts, and for routine monitoring and control tasks of the smart grid. The cyber attacks have gained increasing attention over the past years. Unfortu- nately, there are realistic \stealthy" threats that cannot be detected with current security modules in the power network and may lead to cascading events, instability in the system, and blackouts in major areas of the network. For details on stealthy deception attack, their implementation and serious consequences, see [Kwon et al., 2013, Kosut et al., 2010, Amin and Giacomoni, 2012, Giani et al., January 2012, Yao Liu and Reiter, May 2011]. 3.1.1 Summary of Results We have designed a decentralized false data injection attack detection mechanism that utilizes the Markov graph of the bus phase angles. We utilize the conditional 24 new data Received Run Conditional Covariance Test Output matches topology data? Trigger the Alarm Compute anomaly scores to find attacked nodes yes yes no Figure 3.1: Flowchart of our detection algorithm covariance threshold test CMIT [Anandkumar et al., 2012] to learn the structure of the grid. We show that under normal circumstances, and because of the grid structure, the Markov graph of voltage angles can be determined by the power grid graph. Therefore, a discrepancy between calculated Markov graph and learned structure triggers the alarm. This work was initiated by the authors in [Sedghi and Jonckheere, 2013]. Because of the connection between the Markov graph of the bus angle measure- ments and the grid topology, our method can be implemented in a decentralized manner, i.e. at each sub-network. Currently, sub-network topology is available on- line and global network structure is available hourly [Zhu and Giannakis]. Not only by decentralization can we increase the speed and get closer to online detection, but 25 we also increase accuracy and stability by avoiding communication delays and syn- chronization problems when trying to send measurement data between locations far apart [Zhu et al., 2013, Ancillotti et al., 2013]. Furthermore, we noticeably decrease the amount of exchanged data to address privacy concerns as much as possible. We show that our method can detect the most recently designed attack on the power grid that remains undetected by the traditional bad data detection scheme [Teixeira et al., 2011] and is capable of deceiving the State Estimator and damaging power network control, monitoring, demand response, and pricing schemes [Kosut et al., 2010]. In this scenario, the attacker is equipped with vital data and has the knowledge of the bus-branch model of the grid. It should be noted that our method not only detects that the system is under attack, but also determines the particular set of nodes under the attack. The owchart is shown in Figure 3.1. In addition, we show that our method can detect the situation where the attacker manipulates reactive power data to lead the State Estimator to wrong estimates of the voltages. Such an attack can be designed to fake a voltage collapse or trick the operator to cause a voltage collapse. This latter detection is based on the linearization of the AC power ow around the steady state. Then using our algorithm for bus voltages and reactive power rather than bus phase angles and active power, it readily follows that this latter attack can also be detected. 26 3.1.2 Related Work Although the authors of [Giani et al., January 2012] suggest an algorithm for PMU placement such that the \stealthy" attack is observable, they report a successful algorithm only for the 2-node attack and propose empirical approaches for the 3, 4, and 5-node attacks. According to [Giani et al., January 2012], for cases where more than two nodes are under attack, the complexity of the approach is said to be \disheartening". Considering the fact that nding the number of needed PMUs is NP-hard and that [Giani et al., January 2012] gives an upper bound and uses a heuristic method for PMU placement, we need to mention for comparison purposes that our algorithm has no hardware requirements, its complexity does not depend on the number of nodes under attack, and it works for any number of attacked nodes. It is also worth mentioning that, even in the original paper presenting the attack for a relatively small network (IEEE-30), seven measurements from ve nodes are manipulated. Therefore, it seems that the 2-node attack is not the most probable one. There has been another line of work dedicated to computing the \security index" for dierent nodes in order to nd the set of nodes that are most vulnerable to false data injection attacks [Hendrickx et al., 2014]. Although these attempts are acknowledged, our method diers greatly from such perspectives as such methods do not detect the attack state when it happens and they cannot nd the set of nodes that are under attack. 27 The dependency graph approach is used in [He and Zhang, June 2011] for topol- ogy fault detection in the grid. However, since attacks on the State Estimator are not considered, such methods can be deceived by false data injection. Further- more, [He and Zhang, June 2011] use a constrained maximum likelihood optimiza- tion for nding the information matrix, while here an advanced structure learning method is used that captures the power grid structure better. This is because in the power grid the edges are not centered but distributed all over the network. This is discussed in Section 3.2.1. 3.1.3 Bus Phase Angles GMRF We now apply the preceding to the bus phase angles. The DC power ow model [Abur and Exposito, 2004] is often used for analysis of power systems in normal op- erations. When the system is stable, the phase angle dierences are small, so sin( i j ) i j . By the DC power ow model, the system state X can be described using bus phase angles. The active power ow on the transmission line connecting bus i to bus j is given by P ij =b ij (X i X j ); (3.1.1) 28 where X i and X j denote the phasor angles at bus i and j respectively, and b ij denotes the inverse of the line inductive reactance. The power injected at bus i equals the algebraic sum of the powers owing away from bus i: P i = X j6=i P ij = X j6=i b ij (X i X j ): (3.1.2) When buses i and j are not connected, b ij = 0. Thus, it follows that the phasor angle at bus i could be represented as X i = X j6=i ( b ij P i6=j b ij ) X j + 1 P j6=i b ij P i : (3.1.3) Eq. (3.1.1) can also be rewritten in matrix form as P =BX; (3.1.4) whereP = [P 1 ;P 2 ;:::;P p ] > is the vector of injected active powers,X = [X 1 ;X 2 ;:::;X p ] > is the vector of bus phase angles and B = 8 > > < > > : b ij if i6=j; P j6=i b ij if i =j: (3.1.5) Remark: Note that, because of linearity of the DC power ow model, the above equations are valid for both the phase angle X together with the injected power 29 P and for the uctuations of the phase angle X together with the uctuations of the injected powerP around its steady-state value. Specically, if we let e P refer to the vector of active power uctuations and e X represent the vector of phase angle uctuations, we have e P =B e X. In the following, the focus is on the DC power ow model. Nevertheless, our analysis remains valid if we consider uctuations around the steady-state values. Because of load uncertainty, and under generation-load balance, the injected power can be modeled as a random variable [Zhang and Lee, 2004]. The injected power is the sum of many random factors such as load uctuations, wind turbine and Photo Voltaic Cell (PVC) output uctuations, etc. While the independence of the constituting random variables can be justied, their identical distribution can- not. Therefore, using the Lyapunov Central Limit Theorem(CLT) [B. De Finetti, 1975, Sec. 7.7.2], which does not require the random variables to be identically distributed, we can model the injected power as a Gaussian distribution. Lyapunov CLT: LetfY i : i = 1; 2;:::;ng be a sequence of independent random variables each with nite expected value i and variance 2 i . Dene s 2 n = P n i=1 2 i . If the Lyapunov condition 1 is satised, then P n i=1 (Y i i ) sn converges in distribution to a standard normal random variable as n goes to innity. Considering conventional assumptions in power systems, the Lyapunov condi- tion is met. As argued in [Sedghi and Jonckheere, 2014], the Gaussian assumption 1 The condition requires that 9 > 0 such that the random variables jY i i j have mo- ments of order 2 + and the rate of growth of these moments is limited in the sense that lim n!1 P n i=1 EjYiij 2+ s 2+ n = 0. 30 is justied in the transmission network. The Gaussian model is also utilized in var- ious analysis of power networks such as [Kashyap and Callaway, 2010, Schellenberg et al., 2005, Pang et al., 2012, Dopazo et al., 1975] where n is estimated to be of order 1000. To exemplify CLT, it is suggested in [Mur-Amada and Salln-Arasanz, April 2011] that as few as 5 wind turbines would suce to see CLT in action. Therefore, for each i, we modelP i in Eq. (3.1.2) with a Gaussian random variable. Hence the linear relationship in Eq. (3.1.4), together with the xed phasor at the slack bus, implies that the phasor angles i are Gaussian random variables [He and Zhang, June 2011]. The next step is to nd out whether the X i 's satisfy the local Markov property and, in the armative, to discover the neighbor sets corresponding to each node. We do this by analyzing Eq. (3.1.3). If there were only the rst term, we would conclude that the set of nodes electrically connected to node i satises the local Markov property, but the second term makes a dierence. Below, we argue that an analysis of the second term of (3.1.3) shows that this term causes some second- neighbors of X i to have a nonzero term in matrix J . In addition, for nodes that are more than two hops apart, J ij = 0. Therefore, as opposed to the claim in [He and Zhang, June 2011], a second-neighbor relationship does exist in matrix J. The second neighbor property may result in additional edges in the Markov graph between the nodes that are second neighbors in the grid graph. 31 As stated earlier, the powers injected at dierent buses have Gaussian distri- bution. We can assume that they are independent and without loss of general- ity they are zero mean. Therefore, the probability distribution function for P is f P (P )/ e 1 2 P > P : Since P = BX, we have f X (X)/ e 1 2 X > B > BX : Recalling the denition of the probability distribution function for jointly Gaussian random vari- ables in (2.3.1), we get J = B T B. Let d(i;j) represent the hop distance between nodes i and j in the power grid graph G. By denition of matrix B, this leads to some nonzero J ij entries for d(i;j) = 2. In addition, we state the following: Proposition 1. Assume that the powers injected at the nodes are Gaussian and mutually independent. Then J ij = 0; 8 d(i;j)> 2: Proof. We argue by contradiction. Assume J ij 6= 0 for some d(i;j) > 2. Since J ij = P k B ik B jk , it follows that9 k s:t: B ik 6= 0;B jk 6= 0. By (3.1.5), B ik 6= 0 implies d(i;k) = 1. From there on, the triangle inequality implies that d(i;j) d(i;k) +d(k;j) = 1 + 1 = 2, which contradicts the assumption d(i;j)> 2. It was shown in [Sedghi and Jonckheere, 2014] that for some graphs, the second- neighbor terms are smaller than the terms corresponding to the immediate elec- trical neighbors of X i . More precisely, it was shown that for lattice-structured grids, this approximation falls under the generic fact of the tapering o of Fourier coecients [Sedghi and Jonckheere, 2014]. Therefore, we can approximate each 32 neighborhood with the immediate electrical neighbors. We can also proceed with the exact relationship. For simplicity, we opt for the rst-neighbor analysis. We explain shortly why CMIT works with this approximation as well. Note that our detection method relies on the graphical model of the variables. It is based on the fact that the Markov graph of bus phase angles changes under an attack. CMIT is tuned with correct data and we prove that in case of attack, the Markov graph of compromised data does not follow the Markov graph of correct data. Hence, we can tune CMIT by either the exact relationship or the approximate Markov graph. In both cases, the output in case of attack is dierent from the output tuned with correct data. Therefore, CMIT works for both approximate and exact neighborhoods. 3.2 Structure Learning In the context of graphical models, model selection means nding the exact Markov graph underlying a group of random variables based on samples of those random variables. There are two main classes of methods for learning the structure of the underlying graphical model, convex methods and non-convex methods. The ` 1 -regularized maximum likelihood estimators are the main class of convex meth- ods [Friedman et al., 2007, Ravikumar et al., 2011, Janzamin and Anandkumar, 2012, 2014]. In these methods, the inverse covariance matrix is penalized with a convex ` 1 -regularizer in order to encourage sparsity in the estimated Markov 33 Algorithm 1 CMIT (x n ; n;p ;) for structure learning using samples x n [Anand- kumar et al., 2012] Initialize b G n p = (V;;) For each i;j2V , if min SVnfi;jg jSj b (i;jjS)> n;p , then add (i;j) to the edge set of b G n p . end if Output: b G n p graph structure. The other types of methods are the non-convex or greedy meth- ods [Anandkumar et al., 2012]. In our work [Sedghi and Jonckheere, 2013, 2014, 2015], we use the latter methods. 3.2.1 Conditional Covariance Test In order to learn the structure of the power grid, we utilize the Gaussian Graphical Model Selection method called CMIT [Anandkumar et al., 2012]. CMIT estimates the structure of the underlying graphical model given i.i.d. samples of the random variables. CMIT is shown in Algorithm 1. In Algorithm 1, the output is an edge set corresponding to graphG givenn i.i.d. samples x n , each of which has p variables (corresponding to vertices), a threshold n;p (that depends on both p and n) and a constant 2N, which is related to the local vertex separation property (described later). In our case, each one of the p variables represents a bus phase angle. 34 The sucient condition for output of CMIT to have structural consistency with the underlying Markov graph among variables is that the graph has to satisfy local separation property and walk-summability [Anandkumar et al., 2012]. An ensemble of graphs has the (; )-local separation property if for any (i;j) = 2E(G), the maximum number of paths betweeni andj of length at most does not exceed . A Gaussian model is said to be -walk summable ifjj Rjj < 1, where R = [jr ij j] andjj:jj denotes the spectral or 2-norm of a matrix [Anandkumar et al., 2012]. R = [r ij ] is the matrix of partial correlation coecients; it vanishes on the diagonal entries and on the non-diagonal entries it is given by r ij , (i;jjVnfi;jg) p (i;ijVnfi;jg)(j;jjVnfi;jg) = J(i;j) p J(i;i)J(j;j) : (3.2.1) r ij , the partial correlation coecient between variables X i and X j for i6=j, mea- sures their conditional covariance given all other variables [Lauritzen, 1996]. Regardless of whether the exact or approximate neighborhood relationship holds, the Markov graph of the bus phase angles is an example of bounded local path graphs that satisfy the local separation property. We also checked the analyzed networks for the walk-summability condition. As shown in (3.2.1) and the de- nition of walk-summability, this property depends only on matrix J and thus on 35 the topology of the grid. The walk-summability does not depend on the operating point of the grid. It is shown in [Anandkumar et al., 2012] that, under walk-summability, the eect of faraway nodes on the covariance decays exponentially with the distance and the error in approximating the covariance by local neighboring decays exponentially with the distance. Hence by correct tuning of threshold n;p and with enough samples, we expect the output of CMIT to follow the grid structure. The computational complexity of CMIT is O(p +2 ), which is ecient for small [Anandkumar et al., 2012]. is the parameter associated with local separa- tion property described above. The sample complexity associated with CMIT is n = (J 2 min logp), where J min is the minimum absolute edge potential in the model [Anandkumar et al., 2012]. It is worth mentioning that since we use CMIT for structure learning of pha- sor data, our method is robust against measurement noise. The reason is that CMIT analyzes conditional covariance of its input data. Since input data is Gaus- sian, the conditional covariance can be found from covariance matrix for phasor data, i.e. (X;X) (see Eq. 3.2.2). Let N be the sum of the measurement noise and systematic errors. Both systematic errors and measurement noise are inde- pendent of the measured values. Also, we know that E(X) = 0. Therefore, (X +N;X +N) = (X;X) + (N;N). Note that in CMIT we only look at pairs (i;j) such that i6= j. Therefore as long as (N;N) has a diagonal form, 36 this error does not in uence our performance. This is the case when errors at dif- ferent locations in the network are independent of each other. Measurement noise meets this criterion. Moreover, if systematic error in the network has a diagonal covariance matrix (N;N), it also does not impact our method. Even if systematic errors do not have a diagonal covariance but remain the same with time, they can be detected and compensated during an initial training phase when we are sure the system is not under the attack. CMIT distributes the edges fairly uniformly across the nodes, while the ` 1 method tends to cluster all the edges together among the \dominant" variables leading to a densely connected component and several isolated points [Anandku- mar et al., 2012] and thus a disconnected graph. Therefore, the ` 1 method has some limitations in detecting the structure of a connected graph. The power grid transmission network is a connected graph where the edges are distributed over the network. Therefore, CMIT is more suitable for detecting the structure of the power grid. 3.2.2 Decentralization We want to nd the Markov graph of our bus phasor measurements. The con- nection between electrical connectivity and correlation (Proposition 1) helps us to decentralize our method to a great extent. The power network in its normal oper- ating condition consists of dierent areas connected together via border nodes. A 37 border node is any node that is also connected to a node from a dierent area as depicted in [bor, 2014]. Therefore, we decompose our network into these sub-areas. Our method can be performed locally in the sub-networks. The sub-network con- nection graph is available online from the protection system at each sub-network and can be readily compared with the bus phase angle Markov graph. In addition, only for border nodes we need to consider their out-of-area neighbors as well. This can be done either by solving the power ow equations for that border link or by receiving measurements from neighbor sub-networks. Therefore, we run CMIT for each sub-graph to gure out its Markov graph. Then we compare it with online network graph information to detect false data injection attacks. This decentralization reduces complexity and increases speed. Our decentralized method is a substitute for considering all measurements throughout the power grid, which requires a huge amount of data exchange, computation, and overhead. In addition to having fewer nodes to analyze, this decentralization leads us to a smaller and greatly reduces computational complexity, which makes our method capable of being executed in very large networks. Furthermore, since structure learning is performed locally, faraway relationships created by nonlinearities|ignored in Prop. 1 but intrinsic to power systems|are mitigated, hence our neighborhood assumptions are justied. Last but not least, utility companies are not willing to expose their information for economical competition reasons and there have been several attempts to make them do that [Rajagopalan et al., 2011]. Thus it is 38 desired to reduce the amount of data exchange between dierent areas and our method adequately fullls this preference. It should be noted that the measurement vector X analyzed in our work is a mixture of measurements from PMUs and State Estimator output corresponding to the same time. This is achieved as follows. PMUs use GPS-sync time stamp and State Estimator measurements in SCADA are labeled with local time stamp. Since our method is performed locally, it has two advantages. First, as discussed earlier, it avoids large delays in communication network. Second, we can use the local time stamps from State Estimator outputs. We do not require the high rate of measurement from PMUs for our detection scheme and only consider the PMU samples at the time we have State Estimator samples. Since both data have time stamps, we are able to form the measurement vector X with measurement data from the same time. 3.2.3 Online Calculations For fast monitoring of the power grid, we need an on-line algorithm. As we show in this section, our algorithm can be developed as an iterative method that processes new data without the need for reprocessing earlier data. Here, we derive an iterative 39 formulation for the sample covariance matrix. Then we use it to calculate the conditional covariance using b (i;jjS) := b (i;j) b (i;S) b 1 (S;S) b (S;j): (3.2.2) As we know, in general, =E[(X)(X) > ] =E[XX > ] > : Let b (n) (X) denote the sample covariance matrix for a vectorX ofp elements from n samples and letb (n) (X) be the corresponding sample mean. In addition, letX (i) be the ith sample of our vector. Then we have b (n) (X) = 1 n 1 n X i=1 X (i) X (i) > ! b (n) b (n) > : (3.2.3) Therefore, b (n+1) (X) = 1 n " n X i=1 X (i) X (i) > +X (n+1) X (n+1) > # (3.2.4a) b (n+1) b (n+1) > ; b (n+1) = 1 n + 1 [nb (n) +X (n+1) ]: (3.2.4b) 40 By keeping the rst term in (3.2.3) and the sample mean (3.2.4b), our updating rule is (3.2.4a). Thus, we revise the sample covariance as soon as any bus phasor measurement changes and leverage it to reach the conditional covariances needed for CMIT. It goes without saying that if the system demand and structure does not change and the system is not subject to false data injection attack, the voltage angles at nodes remain the same and there is no need to run any algorithm. 3.3 Stealthy Deception Attack The most recent and most dreaded false data injection attack on the power grid was introduced in [Teixeira et al., 2011]. It assumes knowledge of the bus-branch model and it is capable of deceiving the State Estimator. For a p-bus electric power network, the l = 2p 1 dimensional state vector x is [ > ;V > ] > , where V = [V 1 ;:::;V p ] > is the vector of voltage bus magnitudes and = [ 2 ;:::; p ] > the vector of phase angles. It is assumed that the nonlinear measurement model for the state estimation isz =h(x) +, whereh(:) is the measurement function,z = [z > P ;z > Q ] > is the measurement vector consisting of active and reactive power ow measurements and is the measurement error. H(x k ) := dh(x) dx j x=x k denotes the Jacobian matrix of the measurement modelh(x) atx k . The goal of the stealthy deception attacker is to compromise the measurements available to the State Estimator (SE) asz a =z +a, wherez a is the corrupted measurement vector anda is the attack vector. The vector a is designed such that the SE algorithm converges and the attack a is undetected 41 by the Bad Data Detection scheme. Then it is shown that, under the DC power ow model, such an attack can only be performed locally with a2 Im(H), where H =H P is the matrix connecting the vector of bus injected active powers to the vector of bus phase angles, i.e., P =H P . The attack is shown in Figure 3.2. 3.4 Stealthy Deception Attack Detection In this Section, we show that our method can detect the aforementioned stealthy deception attack despite the fact that it remains undetected by the traditional Bad Data Detection scheme. The fundamental idea behind our detection scheme is that of structure learning. Our learner, CMIT, is rst tuned with correct data, which corresponds to the grid graph. Therefore, any attack that changes the structure alters the output of CMIT and this triggers the alarm. Let us consider the attack more specically. As we are considering the DC power ow model and all voltage magnitudes are normalized to 1 p.u., the state vector introduced in [Teixeira et al., Figure 3.2: Power grid under a cyber attack 42 2011] reduces to the vector of voltage angles, X. Since a2 Im(H),9 d such that a =Hd and z a =z +a =H(X +d) =HX a ; where X a represents the vector of angles when the system is under attack, z a is the attacked measurement vector, and X is the correct phasor angle vector. Considering (3.1.2), we have H ij =b ij for i6= j and H ii = P i6=j b ij , where b ij denotes the inverse of the line inductive reactance. We have X a =X +d =H 1 P +H 1 a =H 1 (P +a): (3.4.1) As the denition of matrix H shows, it is of rank p 1. Therefore, the above H 1 denotes the pseudo-inverse of matrixH. Another way to address this singularity is to remove the row and the column associated with the slack bus. From (3.4.1), we get (X a ;X a ) =H 1 [(P +a;P +a)]H 1 T =H 1 [(P;P ) + (a;a)]H 1 T : 43 The above calculation assumes that the attack vector is independent of the cur- rent measurement values in the network, as demonstrated in the denition of the attack [Teixeira et al., 2011]. An attack is considered successful if it causes the operator to make a wrong decision. For that matter, the attacker would not insert just one wrong sample. In addition, if the attack vector remains constant, it does not cause any reaction. This eliminates the case of constant attack vectors. Therefore, the attacker is expected to insert non-constant vectors a during some samples. Thus (a;a)6= 0 and (X a ;X a )6= (X;X): (3.4.2) It is not dicult to show that, if we remove the assumption on independence of attack vector and the injected power, (3.4.2) still holds. Considering (3.4.2) and the fact that matrix inverse is unique, it follows that, in case of an attack, the new 1 will not be the same as the network information matrix in normal condition, i.e., 1 (X a ;X a )6=J normal , and as a result, the output of CMIT will not follow the grid structure. We use this mismatch to trigger the alarm. It should be noted that acceptable load changes do not change the Markov graph and as a result do not lead to false alarms. The reason is that such changes do not falsify the DC power ow model and the Markov graph will continue to follow the dened information matrix. After the alarm is triggered, the next step is to nd which nodes are under attack. 44 3.4.1 Detecting the Set of Attacked Nodes We use the correlation anomaly metric [Ide T. Lozano, 2009] to nd the attacked nodes. This metric quanties the contribution of each random variable to the dierence between two probability densities while considering the sparsity of the structure. The Kullback-Leibler (KL) divergence is used as the measure of the dierence. As soon as an attack is detected, we use the attacked information matrix and the information matrix corresponding to the current topology of the grid to compute the anomaly score for each node. The nodes with highest anomaly scores are announced as the nodes under attack. We investigate the implementation details in the next section. It should be noted that the attack is performed locally and because of the local Markov property, we are certain that no nodes from other sub-graphs contribute to the attack. We should emphasize that the considered attack assumes the knowledge of the system bus-branch model. Therefore, the attacker is equipped with very critical information. Yet, we can mitigate such an \intelligent" attack. 3.4.2 Reactive Power versus Voltage Amplitude As mentioned before, with similar calculations, we can consider the case where the attacker manipulates reactive power data to lead the State Estimator to wrong estimates of the voltage. Such an attack can be designed to fake a voltage collapse 45 or trick the operator to cause a change in the normal state of the grid. For example, if the attacker fakes a decreasing trend in the voltage magnitude in some part of the grid, the operator will send more reactive power to that part and thus this could cause voltage overload/underload. At this point, the protection system would disconnect the corresponding lines. This could lead to outages in some areas and in a worse scenario to overloading in other parts of the grid that might cause blackouts and cascading events. The detection can be done by linearization of the AC power ow and by consid- ering the uctuations around steady state. Then pursuing our algorithm, it readily follows that such an attack can also be detected with a similar approach to the one developed here for bus phase angles and active power. In the rest of this section, we show how this analogy can be established. The AC power ow states that the active power and the reactive power owing from bus i to bus j are, respectively, P ij =G ij V 2 i G ij V i V j cos( i j ) +b ij V i V j sin( i j ); Q ij =b ij V 2 i b ij V i V j cos( i j )G ij V i V j sin( i j ); where V i and i are the voltage magnitude and phase angle, resp., at bus i and G ij and b ij are the conductance and susceptance, resp., of line ij. From [Banirazi 46 and Jonckheere, 2010], we obtain the following approximation of the AC uctuating power ow: e P ij = (b ij V i V j cos ij )( e i e i ); e Q ij = (2b ij V i b ij V j cos ij ) ~ V i (b ij V i cos ij ) e V j ; where an overbar denotes the steady-state value, a tilde means the uctuation around the steady-state value, and ij = i j . These uctuating values due to renewables and variable loads justify the utilization of probabilistic methods in power grid problems. Now assuming that for the steady-state values of the voltages we have V i = V j ' 1 p:u: (per unit) and the uctuations in angles are about the same such that cos ij = 1, we have e P ij =b ij ( e i e j ); (3.4.3a) e Q ij =b ij ( e V i e V j ): (3.4.3b) It is clear from (3.4.3a)-(3.4.3b) that we can follow the same approach we had about active power and voltage angles with reactive power and voltage magnitudes, respectively. It can be argued that, as a result of uncertainty, the aggregate reactive power at each bus can be approximated as a Gaussian random variable and, because of 47 0 20 40 60 80 100 120 140 0 10 20 30 40 50 60 70 80 90 100 Detection Rate for IEEE 14 system No. of Corrupted Measurements Detection rate in % Figure 3.3: Detection rate for IEEE-14 bus system Eq. (3.4.3b), the voltage uctuations around the steady-state value can be approx- imated with Gaussian random variables. Therefore, the same path of approach as for phase angles can be followed to show the GMRF property for voltage ampli- tudes. Comparing (3.4.3b) with (3.1.1) makes it clear that the same matrix, i.e., matrix B developed in Section 3.1.3, is playing the role of correlating the voltage amplitudes. Therefore, assuming that the statistics of the active and reactive power uctuations are similar, the underlying graph is the same. This can readily be seen by comparing (3.4.3a) and (3.4.3b). 3.5 Simulation Training the System We consider IEEE-14 bus system as well as IEEE-30 bus system. First, we feed the system with Gaussian demand and simulate the power grid. We use MATPOWER [Zimmerman et al., Feb. 2011] for solving the DC power ow equations for various demand and use the resulting angle measurements as the input to CMIT. We leverage YALMIP [Lofberg, 2004] and SDPT3 [Toh et al., 48 1999] to run CMIT in MATLAB. With the right choice of parameters and threshold n;p of CMIT, and enough measurements, the Markov graph should follow the grid structure. We use the edit distance between two graphs for tuning the threshold n;p . The edit distance between two graphs reveals the number of edges that exist in only one of the two graphs. Detecting Attack State After the threshold n;p is set, our detection algorithm works in the following manner. Each time the procedure is initiated, i.e., when any PMU angle measurement or State Estimator output changes, it updates the conditional covariances ^ (i;jjS) based on new data, runs CMIT and checks the edit distance between the Markov graph of phasor data and the grid structure. A discrepancy triggers the alarm. Subsequently to an alarm, the system uses anomaly metric to nd all the buses under the attack. The owchart of our method is shown in Figure 3.1. Next, we introduce the stealthy deception attack on the system. The attack is designed according to the description in [Teixeira et al., 2011], i.e., it is a random vector such thata2 Im(H). The attack is claimed to be successful only if performed locally on connected nodes. Having this constraint in mind, for IEEE-14 test case the maximum number of attacked nodes is 6 and for IEEE-30 bus system this number is 8. For the IEEE-14 network, we consider the cases where 2 to 6 nodes are under attack. For the IEEE-30 network, we consider the cases where 2 to 8 nodes are under attack. For each case and for each network, we simulate all possible 49 attack combinations. This is to make sure we have checked our detection scheme against all possible stealthy deception attacks. Each case is repeated 1000 times for dierent attack vector values. When the attacker starts tampering with the data, the corrupted samples are added to the sample bin of CMIT and are therefore used in calculating the sample covariance matrix. With enough corrupted samples, our algorithm can get arbi- trarily close to 100% successful in detecting all cases of attacks discussed above, for both IEEE-14 and IEEE-30 bus systems. This is shown in Figure 3.3 for IEEE-14 bus system. The detection rate is averaged over all possible attack scenarios. The reason behind the trend shown in Figure 3.3 is that rst, for a very small number of corrupted measurements, the Markov graph follows the true information matrix and then, for a higher number of compromised measurements, the Markov graph follows the random relationship that the attacker is producing. When the number of compromised samples increases, they gain more weight in the sample covariance, and the chance of a change in the Markov graph increases. It can be seen that even 2 4 6 8 10 12 14 0 0.2 0.4 0.6 0.8 1 Node Number Normalized Anomaly Score attack size=0.7 Figure 3.4: Anomaly score for IEEE-14 bus system. Nodes 4, 5, 6 are under attack; Attack size is 0.7. 50 for a small number of corrupted measurements, our method presents a good perfor- mance: the detection rate is 90% with 30 corrupted samples. The minimum number of corrupted samples to get almost 100% detection rate for IEEE-14 bus system is 130 and it is 50 for IEEE-30 bus system. Since IEEE-30 is more sparse than IEEE-14 bus system, our method performs more eciently in the former case. Yet, for a 60 Hz system, the detection speed for IEEE-14 bus system is quite amazing as well. Identifying Nodes under Attack The next step is to nd which nodes are under attack. As stated earlier, we use anomaly score metric [Ide T. Lozano, 2009] to detect such nodes. As an example, Figure 3.4 shows the anomaly score plot for the case where nodes 4; 5 and 6 are under attack 2 . It means that a random vector is added to the measurements at these nodes. This attack is repeated 1000 times for dierent values building an attack size of 0:7. The attack size refers to the expected value of the Euclidean norm of the attack vector a. 2 The numbering system employed here is the one of the published IEEE-14 system available at https://www.ee.washington.edu/research/pstca/pf14/pg_tca14bus.htm 2 4 6 8 10 12 14 0 0.2 0.4 0.6 0.8 1 Node Number Normalized Anomaly Score attack size=1 attack size=0.7 attack size=0.5 Figure 3.5: Anomaly score for IEEE-14 bus system for dierent attack sizes. Nodes 4, 5, 6 are under attack. Attack sizes are 0.5, 0.7, 1. 51 Simulation results show that as the attack size increases, the dierence between the anomaly scores of the nodes under the attack and the uncompromised nodes increases and, as a result, it becomes easier to pinpoint the attacked nodes. For example, Figure 3.5 compares the cases where the attack size is 1, 0:7 and 0:5 for the attack scenario where nodes 4, 5, 6 are under attack. It should be noted that in order for an attack to be successful in misleading the TSO, the attack size should not be too small. More specically, the attacker wants to make a change in the system state such that the change is noticeable with the hope that this would result in the wrong reaction of the TSO. If the value of the system state under the attack is close to its real value, the system is not considered under the attack as it continues its normal operation. It can be seen that, even for the smallest possible attack size that would normally not lead the operator to react, the anomaly score plot will remain reliable. For example, in the considered attack scenario, the anomaly plot performs well even for an attack size of 0:3, while it seems that a potentially successful attack under normal standards needs a bigger attack size. Setting up Anomaly Score Threshold Setting the threshold for anomaly score is another important aspect of the detection algorithm. As discussed earlier, our scheme has two major parts. First, detection of attack state, i.e. to declare if the system is under attack. Second, the identication of the attacked nodes in case of an attack state. In Figure 3, we analyzed the detection rate of the \attack state" versus the number of corrupted samples. In Figure 4 and 5, we discussed how normalized 52 anomaly score changes with dierent attack sizes. Now, we use this intuition to design the threshold for anomaly score. In case of \attack state" we calculate the normalized anomaly score for each node. For any node, if this benchmark is greater than the threshold, the node is considered to be under attack. In this context, we dene the \Node Detection Ratio (NDRo)" as the ratio of the number of attacked nodes that are correctly labeled as attacked to the total number of attacked nodes. Consequently, the \False Alarm Ratio (FARo)," not to be confused with the False Alarm rate (FAR), refers to the number of uncompromised nodes that are mislabeled as under attack to the total number of uncompromised nodes. As in detection theory, there is a trade-o in designing this threshold value. Lower threshold values result in higher NDRo and higher FARo and vice versa. Since our goal is to detect all attacked nodes, we design the threshold such that the NDRo is approximately 100% with a very low FARo. To design the threshold, we repeat the simulation discussed for Figures 4 and 5 for ve dierent sets of attacked nodes, the three discussed attack sizes, and repeat each attack size 100 times. As can also be seen in the above plots, with a threshold of 0:3 for all attacked nodes, the normalized anomaly score is above the threshold. Next, we use this threshold in all possible sets of attacked nodes on IEEE-14 bus system with a attack size of 0:7 and repeat it 50 times for each set. Simulation results show that this threshold guarantees nearly 100% NDRo with a very low FARo of 3:8210 5 . The reason is that anomaly score 53 provides a precise statistical analysis of the nodes that contribute to the mismatch. Hence, we can obtain 100% detection rate with a very low FARo. 3.6 Discussion and Conclusion We have proposed a decentralized false data injection attack detection scheme that is capable of detecting the most recent stealthy deception attack on power grid. To the best of our knowledge, our remedy is the rst to comprehensively detect this sophisticated attack. In addition to detecting the attack state, our algorithm is capable of pinpointing the set of attacked nodes. Although [Giani et al., January 2012] considers the same attack on the power network, considerable progress is made in our approach versus the one in [Giani et al., January 2012]. In both cases, the goal is to detect the attack. While [Giani et al., January 2012] seeks a PMU placement method, our method does not require additional hardware but rather performs statistical structure learning on the measurement data. In general, both PMU placement and structure learning are NP hard. However, the use of common knowledge of the grid structure helps us reach a polynomial time solution. The power network structure is a sparse graph that satises the local separation property and the walk-summability. For details on how these properties reduce the general NP hard problem to a tractable polynomial time problem, see [Anandkumar et al., 2012]. 54 As stated earlier, the computational complexity of our method is polynomial and the decentralized property makes our scheme suitable for huge networks, yet with bearable complexity and run time. In addition, our method is capable of detecting attacks that manipulate reactive power measurements to cause inaccurate voltage amplitude data. Such attack scenario can lead to, or mimic a voltage collapse. In conclusion, we have introduced change detection for the graphical model of a power system and showed that it can be used to detect data manipulation. Our method protects the power system against a large class of false data injection attacks, which is of paramount importance for current and future grid reliability, security, and stability. 55 Chapter 4 Stochastic Optimization in High Dimension 4.1 Introduction Stochastic optimization techniques have been extensively employed for online ma- chine learning on data which is uncertain, noisy or missing. Typically it involves performing a large number of inexpensive iterative updates, making it scalable for large-scale learning. In contrast, traditional batch-based techniques involve far more expensive operations for each update step. Stochastic optimization has been analyzed in a number of recent works, e.g., [Shalev-Shwartz, 2011, Boyd et al., 2011, Agarwal et al., 2012b, Wang et al., 2013a, Johnson and Zhang, 2013, Shalev-Shwartz and Zhang, 2013]. The alternating direction method of multipliers (ADMM) is a popular method for online and distributed optimization on a large scale [Boyd et al., 2011], and is em- ployed in many applications, e.g., [Wahlberg et al., 2012], [Esser et al., 2010], [Mota et al., 2012]. It can be viewed as a decomposition procedure where solutions to 56 sub-problems are found locally, and coordinated via constraints to nd the global solution. Specically, it is a form of augmented Lagrangian method which applies partial updates to the dual variables. ADMM is often applied to solve regularized problems, where the function optimization and regularization can be carried out locally, and then coordinated globally via constraints. Regularized optimization problems are especially relevant in the high dimensional regime since regularization is a natural mechanism to overcome ill-posedness and to encourage parsimony in the optimal solution, e.g., sparsity and low rank. Due to the eciency of ADMM in solving regularized problems, we employ it in our work. In our work [Sedghi et al., 2014b,a], we design a modied version of the stochas- tic ADMM method for high-dimensional problems. We rst analyze the simple setting, where the optimization problem consists of a loss function and a single reg- ularizer, and then extend to the multi-block setting with multiple regularizers and multiple variables. For illustrative purposes, for the rst setting, we consider the sparse optimization problem and for the second setting, the matrix decomposition problem respectively. Note that our results easily extend to other settings, e.g., those in Negahban et al. [2012]. We consider a simple modication to the (inexact) stochastic ADMM method [Ouyang et al., 2013] by incorporating multiple steps or epochs, which can be viewed as a form of annealing. We establish that this simple modication has huge implications in achieving tight convergence rates as the dimensions of the problem instances 57 scale. In each iteration of the method, we employ projections on to certain norm balls of appropriate radii, and we decrease the radii in epochs over time. The idea of annealing was rst introduced by Agarwal et al. [2012b] for dual averaging. Yet, that method cannot be extended for multivariable cases. For instance, for the sparse optimization problem, we constrain the optimal solution at each step to be within an ` 1 -norm ball of the initial estimate, obtained at the beginning of each epoch. At the end of the epoch, an average is computed and passed on to the next epoch as its initial estimate. Note that the ` 1 projection can be solved eciently in linear time, and can also be parallelized easily [Duchi et al., 2008]. For matrix decomposition with a general loss function, the ADMM method requires multiple blocks for updating the low rank and sparse components. We apply the same principle and project the sparse and low rank estimates on to ` 1 and nuclear norm balls, and these projections can be computed eciently. Theoretical implications: The above simple modications to ADMM have huge implications for high-dimensional problems. For sparse optimization, our con- vergence rate isO( s logd T ), for s-sparse problems in d dimensions in T steps. Our bound has the best of both worlds: ecient high-dimensional scaling (as logd) and ecient convergence rate (as 1 T ). This also matches the minimax lower bound for the linear model and square loss function [Raskutti et al., 2011], which im- plies that our guarantee is unimprovable by any (batch or online) algorithm (up 58 to constant factors). For matrix decomposition, our convergence rate isO((s + r) 2 (p) logp=T )) +O(maxfs +r;pg=p 2 ) for app input matrix inT steps, where the sparse part has s non-zero entries and low rank part has rank r. For many natural noise models (e.g. independent noise, linear Bayesian networks), 2 (p) =p, and the resulting convergence rate is minimax-optimal. Note that our bound is not only on the reconstruction error, but also on the error in recovering the sparse and low rank components. These are the rst convergence guarantees for online matrix decomposition in high dimensions. Moreover, our convergence rate holds with high probability when noisy samples are input, in contrast to expected convergence rate, typically analyzed in literature. See Table 4.1, 4.2 for comparison of this work with related frameworks. Practical implications: The proposed algorithms provide signicantly faster convergence in high dimension and better robustness to noise. For sparse opti- mization, our method has signicantly better accuracy compared to the stochas- tic ADMM method and better performance than RADAR, based on multi-step dual averaging [Agarwal et al., 2012b]. For matrix decomposition, we compare our method with the state-of-art inexact ALM [Lin et al., 2010] method. While both methods have similar reconstruction performance, our method has signicantly bet- ter accuracy in recovering the sparse and low rank components. 59 Related Work: ADMM: Existing online ADMM-based methods lack high- dimensional guarantees. They scale poorly with the data dimension (asO(d 2 )), and also have slow convergence for general problems (asO( 1 p T )). Under strong convexity, the convergence rate can be improved toO( 1 T ) but only in expectation: such analyses ignore the per sample error and consider only the expected conver- gence rate (see Table 4.1). In contrast, our bounds hold with high probability. Some stochastic ADMM methods, Goldstein et al. [2012], Deng [2012] and Luo [2012], provide faster rates for stochastic ADMM, than the rate noted in Table 4.1. However, they require strong conditions which are not satised for the optimization problems considered here, e.g., Goldstein et al. [2012] require both the loss function and the regularizer to be strongly convex. It is also worth mentioning that our method provides error contraction, i.e., we can show error shrinkage after specic number of iterations whereas no other ADMM based method can guarantee this. Related Work: Sparse Optimization: For the sparse optimization problem, ` 1 regularization is employed and the underlying true parameter is assumed to be sparse. This is a well-studied problem in a number of works (for details, refer to [Agarwal et al., 2012b]). Agarwal et al. [2012b] propose an ecient online method based on annealing dual averaging, which achieves the same optimal rates as the ones derived in our work. The main dierence is that our ADMM method is capable 60 of solving the problem for multiple random variables and multiple conditions while their method cannot incorporate these extensions. Related Work: Matrix Decomposition: To the best of our knowledge, on- line guarantees for high-dimensional matrix decomposition have not been provided before. Wang et al. [2013b] propose a multi-block ADMM method for the matrix decomposition problem but only provide convergence rate analysis in expectation and it has poor high dimensional scaling (asO(p 4 ) for a pp matrix) without further modications. Note that they only provide convergence rate on dierence between loss function and optimal loss, whereas we provide the convergence rate on individual errors of the sparse and low rank componentsk S(T )S k 2 F ;k L(T )L k 2 F . See Table 4.2 for comparison of guarantees for matrix decomposition problem. We compare our guarantees in the online setting with the batch guarantees of Agarwal et al. [2012a]. Although other batch analyses exist for matrix decompo- sition, e.g., [Chandrasekaran et al., 2011, Cand es et al., 2011, Hsu et al., 2011], they require stronger assumptions based on incoherence conditions for recovery, which we do not impose here. The batch analysis by Agarwal et al. [2012a] requires fairly mild condition such as \diusivity" of the unknown low rank matrix. Moreover, the convergence rate for the batch setting by Agarwal et al. [2012a] achieves the minimax lower bound (under the independent noise model), and is thus, optimal, up to constant factors. 61 Note that when only the weak diusivity condition is assumed, the matrix decomposition problem suers from an approximation error, i.e. an error even in the noiseless setting. Both the minimax rate and the batch rates in [Agarwal et al., 2012a] have an approximation error. However, our approximation error is worse by a factor of p, although it is still decaying with respect to p. Overview of Proof Techniques: Note that in the main text, we provide guarantees for xed-epoch length. However, if we use variable-length epoch size we can get a logd improvement in the convergence rate. Our proof involves the fol- lowing high-level steps to establish the convergence rate: (1) deriving convergence rate for the modied ADMM method (with variable-length epoch size) at the end of one epoch, where the ADMM estimate is compared with the batch estimate, (2) comparing the batch estimate with the true parameter, and then combining the two steps, and analyzing over multiple epochs to obtain the nal bound. We can show that with the proposed parameter setting and varying epoch size, error can be halved by the end of each epoch. For the matrix decomposition problem, additional care is needed to ensure that the errors in estimating the sparse and low rank parts can be decoupled. This is especially non-trivial in our setting since we utilize multiple variables in dierent blocks which are updated in each iteration. Our careful analysis enables us to establish the rst results for online matrix de- composition in the high-dimensional setting which match the batch guarantees for many interesting statistical models. (3) Next, we analyze how guarantees change 62 Method Assumptions convergence ST-ADMM [Ouyang et al., 2013] L, convexity O(d 2 = p T ) ST-ADMM [Ouyang et al., 2013] SC, E O(d 2 logT=T ) BADMM [Wang and Banerjee, 2013] convexity, E O(d 2 = p T ) RADAR [Agarwal et al., 2012b] LSC, LL O(s logd=T ) REASON 1 (this work) LSC, LL O(s logd=T ) Minimax bound [Raskutti et al., 2011] Eigenvalue conditions O(s logd=T ) Table 4.1: Comparison of online sparse optimization methods unders sparsity level for the optimal paramter, d dimensional space, and T number of iterations. SC = Strong Convexity, LSC = Local Strong Convexity, LL = Local Lipschitz, L = Lipschitz property, E = in Expectation The last row provides minimax-optimal rate on error for any method. The results hold with high probability unless otherwise mentioned. for xed epoch length. We prove that although the error halving stops after some iterations but the error does not increase noticeably to invalidate the analysis. 4.2 Problem Formulation Consider the optimization problem 2 arg min 2 E[f(;x)]; (4.2.1) where x2 X is a random variable and f : X! R is a given loss function. Since only samples are available, we employ the empirical estimate of b f() := 63 Method Assumptions Convergence rate Multi-block-ADMM [Wang et al., 2013b] L, SC, E O( p 4 T ) Batch method [Agarwal et al., 2012a] LL, LSC, DF O( s logp+rp T ) +O( s p 2 ) REASON 2 (this work) LSC, LL, DF O( (s+r) 2 (p) logp T ) +O( maxfs+r;pg p 2 ) Minimax bound [Agarwal et al., 2012a] ` 2 , IN, DF O( s logp+rp T ) +O( s p 2 ) Table 4.2: Comparison of optimization methods for sparse+low rank matrix decom- position for a pp matrix under s sparsity level and r rank matrices and T is the number of samples. SC = Strong Convexity, LSC = Local Strong Convexity, LL = Local Lipschitz, L = Lipschitz for loss function, IN = Independent noise model, DF = diuse low rank matrix under the optimal parameter. (p) = ( p p);O(p) and its value depends the model. The last row provides minimax-optimal rate on error for any method under the independent noise model. The results hold with high probability unless otherwise mentioned. For Multi-block-ADMM [Wang et al., 2013b] the convergence rate is on the dif- ference of loss function from optimal loss, for the rest of works in the table, the convergence rate is onk S(T )S k 2 F +k L(T )L k 2 F . 1=n P i2[n] f(;x i ) in the optimization. For high-dimensional , we need to impose a regularizationR(), and b := arg minf b f() + n R()g; (4.2.2) is the batch optimal solution. For concreteness we focus on the sparse optimization and the matrix decom- position problem. It is straightforward to generalize our results to other settings, 64 say [Negahban et al., 2012]. For the rst case, the optimum is as-sparse solution, and the regularizer is the ` 1 norm, and we have b =arg minf b f() + n kk 1 g (4.2.3) We also consider the matrix decomposition problem, where the underlying ma- trix M = S +L is a combination of a sparse matrix S and a low rank matrix L . Here the unknown parameters are [S ;L ], and the regularizationR() is a combination of the` 1 norm, and the nuclear normkk on the sparse and low rank parts respectively. The corresponding batch estimate is given by c M := arg minff(M) + n kSk 1 + n kLk g (4.2.4) s:t: M =S +L; kLk 1 p : Thekk 1 constraint on the low rank matrix will be discussed in detail later, and it is assumed that the true matrix L satises this condition. We consider an online version of the optimization problem where we optimize the program in (4.2.2) under each data sample instead of using the empirical estimate of f for an entire batch. We consider an inexact version of the online ADMM method, where we compute the gradient ^ g i 2rf(;x i ) at each step and employ it for optimization. In addition, we consider an epoch based setting, where we constrain the optimal solution to be close to the initial estimate at the beginning of 65 the epoch. This can be viewed as a form of regularization and we constrain more (i.e. constrain the solution to be closer) as time goes by, since we expect to have a sharper estimate of the optimal solution. This limits the search space for the optimal solution and allows us to provide tight guarantees in the high-dimensional regime. We rst consider the simple case of sparse setting in (4.2.3), where the ADMM has double blocks,and then extend it to the sparse+low rank setting of (4.2.4), which involves multi-block ADMM. 4.3 ` 1 Regularized Stochastic Optimization We consider the optimization problem 2 arg min E[f(;x)], 2 where is a sparse vector. The loss function f(;x k ) is a function of a parameter 2 R d and samples x i . In stochastic setting, we do not have access to E[f(;x)] nor to its subgradients. In each iteration we have access to one noisy sample. In order to impose sparsity we use regularization. Thus we solve a sequence k 2 arg min 2 0 f(;x k ) +kk 1 ; 0 ; (4.3.1) where the regularization parameter > 0 and the constraint sets 0 change from epoch to epoch. 66 Algorithm 2 Regularized Epoch-based Admm for Stochastic Optimization in high- dimensioN 1 (REASON 1) Input ; x > 0, epoch length T 0 , initial prox center ~ 1 , initial radius R 1 , regularization parameterf i g k T i=1 . Dene Shrink () shrinkage operator in (4.3.3) for Each epoch i = 1; 2;:::;k T do Initialize 0 =y 0 = ~ i for Each iteration k = 0; 1;:::;T 0 1 do k+1 = arg min k ~ i k 1 R i fhrf( k ); k ihz k ;y k i + 2 ky k k 2 2 + x 2 k k k 2 2 g (4.3.2) y k+1 = Shrink i = ( k+1 z k ) z k+1 =z k ( k+1 y k+1 ) end for Return : (T i ) := 1 T P T 0 1 k=0 k for epoch i and ~ i+1 =(T i ). Update : R 2 i+1 =R 2 i =2. end for 4.3.1 Epoch-based Online ADMM Algorithm We now describe the modied inexact ADMM algorithm for the sparse opti- mization problem in (4.3.1), and refer to it as REASON 1, see Algorithm 2. We consider epochs of length T 0 , and in each epoch i, we constrain the optimal solu- tion to be within an ` 1 ball with radius R i centered around ~ i , which is the initial estimate of at the start of the epoch. The -update is given by k+1 = arg min k ~ i k 2 1 R 2 i fhrf( k ); k ihz k ;y k i + 2 ky k k 2 2 + x 2 k k k 2 2 g 67 Note that this is an inexact update since we employ the gradientrf() rather than optimize directly on the loss function f() which is expensive. The above program can be solved eciently since it is a projection on to the ` 1 ball, whose complexity is linear in the sparsity level of the gradient, when performed serially, andO(logd) when performed in parallel using d processors [Duchi et al., 2008]. For details of -update implementation see Appendix 4.7.1. For the regularizer, we introduce the variable y, and the y-update is y k+1 = arg minf i ky k k 1 hz k ; k+1 yi + 2 k k+1 yk 2 2 g This update can be simplied to the form given in REASON 1, where Shrink () is the soft-thresholding or shrinkage function [Boyd et al., 2011]. Shrink (a) = (a) + (a) + (4.3.3) Thus, each step in the update is extremely simple to implement. When an epoch is complete, we carry over the average (T i ) as the next epoch center and reset the other variables. 4.3.2 High-dimensional Guarantees We now provide convergence guarantees for the proposed method under the follow- ing assumptions. 68 Assumption A1: Local strong convexity (LSC) : The function f : S ! R satises an R-local form of strong convexity (LSC) if there is a non-negative constant = (R) such that f( 1 )f( 2 ) +hrf( 2 ); 1 2 i + 2 k 2 1 k 2 2 : for any 1 ; 2 2S withk 1 k 1 R andk 2 k 1 R. Note that the notion of strong convexity leads to faster convergence rates in general. Intuitively, strong convexity is a measure of curvature of the loss function, which relates the reduction in the loss function to closeness in the variable domain. Assuming that the function f is twice continuously dierentiable, it is strongly convex, if and only if its Hessian is positive semi-denite, for all feasible. However, in the high-dimensional regime, where there are fewer samples than data dimension, the Hessian matrix is often singular and we do not have global strong convexity. A solution is to impose local strong convexity which allows us to provide guarantees for high dimensional problems. The notion of local strong convexity has been exploited before in a number of works on high dimensional analysis, e.g., [Negahban et al., 2012, Agarwal et al., 2012a,b]. 69 Assumption A2: Sub-Gaussian stochastic gradients: Lete k () :=rf(;x k ) E[rf(;x k )]. For all such thatk k 1 R, there is a constant =(R) such that for all k> 0, E[exp(ke k ()k 2 1 )= 2 ] exp(1) Remark: The bound holds with =O( p logd) whenever each component of the error vector has sub-Gaussian tails [Agarwal et al., 2012b]. Assumption A3: Local Lipschitz condition: For each R > 0, there is a constant G =G(R) such that jf( 1 )f( 2 )jGk 1 2 k 1 (4.3.4) for all 1 ; 2 2S such thatk k 1 R andk 1 k 1 R. We choose the algorithm parameters as below where i is the regularization for ` 1 term, and x are penalties in -update as in (4.3.2) and is the step size for the dual update. 2 i = s p T 0 s R 2 i logd + G 2 R 2 i T 0 + 2 i R 2 i w 2 i (4.3.5) / p T 0 logd R i ; x > 0; =: 70 Theorem 1. Under Assumptions A1A3, i as in (4.3.5) , we use xed epoch length T 0 = T logd=k T where T is the total number of iterations. Assuming this setting ensures T 0 =O(logd), for any with sparsity s, we have k T k 2 2 =O s logd + (w 2 + log(k T =logd)) 2 T logd k T ; with probability at least 1 3 exp(w 2 =12), where T is the average for the last epoch for a total of T iterations and k T = log 2 2 R 2 1 T s 2 (logd + 12 2 w 2 ) : For proof, see Appendix A.2.6. Improvement of logd factor : The above theorem covers the practical case where the epoch lengthT 0 is xed. We can improve the above results using varying epoch lengths (which depend on the problem parameters) such thatk T k 2 2 = O(s logd=T ). See Theorem 3 in Appendix A.1. Optimal Guarantees: The above results indicate a convergence rate ofO(s logd=T ) which matches the minimax lower bounds for sparse estimation [Raskutti et al., 2011]. This implies that our guarantees are unimprovable up to constant factors. Comparison with Agarwal et al. [2012b]: The RADAR algorithm proposed by Agarwal et al. [2012b] also achieves a rate ofO(s logd=T ) which matches with 71 ours. The dierence is our method is capable of solving problems with multiple variables and constraints, as discussed in the next section, while RADAR cannot be generalized to do so. Remark on Lipschitz property: In fact, our method requires a weaker condi- tion than local Lipschitz property. We only require the following bounds on the dual variable:kz k+1 z k k 1 andkz k k 1 . Both these are upper bounded byG+2( x +)R i . In addition the` 1 constraint does not in uence the bound on the dual variable. For details see Section A.2.1. Remark on need for` 1 constraint: We use` 1 constraint in the-update step, while the usual ADMM method does not have such a constraint. The` 1 constraint allows us to provide ecient high dimensional scaling (asO(logd)). Specically, this is because one of the terms in our convergence rate consists ofhe k ; k ^ i i, where e k is the error in the gradient (see Appendix A.2.2). We can use the inequality he k ; k ^ i ike k k 1 k k ^ i k 1 : From Assumption A2, we have a bound onke k k 1 =O(logd), and by imposing the ` 1 constraint, we also have a bound on the second term, and thus, we have an ecient convergence rate. If instead ` p penalty is imposed for some p, the error scales aske()k 2 q , where ` q is the dual norm of ` p . For instance, if p = 2, we have q = 2, and the error can be as high asO(d=T ) sinceke()k 2 2 d. Note that for the 72 ` 1 norm, we have ` 1 as the dual norm, andke()k 1 =O( p logd) which leads to optimal convergence rate in the above theorem. Moreover, this` 1 constraint can be eciently implemented, as discussed in Section 4.3.1. 4.4 Extension to Doubly Regularized Stochastic Optimization We now consider the problem of matrix decomposition into a sparse matrix S2 R pp and a low rank matrix L 2 R pp based on the loss function f on M = S +L. The batch program is given in Equation (4.2.4) and we now design an online program based on multi-block ADMM algorithm, where the updates for M;S;L are carried out independently. In the stochastic setting, we consider the optimization problemM 2 arg minE[f(M;X)], where we want to decomposeM into a sparse matrixS2R pp and a low rank ma- trixL2R pp . f(M;X k ) is a function of parameterM and samplesX k . X k can be a matrix (e.g. independent noise model) or a vector (e.g. Gaussian graphical model). In stochastic setting, we do not have access toE[f(M;X)] nor to its subgradients. In each iteration we have access to one noisy sample and update our estimate based 73 on that. We impose the desired properties with regularization. Thus, we solve a sequence M k :=arg minf b f(M;X k ) +kSk 1 +kLk g (4.4.1) s:t: M =S +L; kLk 1 p : 4.4.1 Epoch-based Multi-Block ADMM Algorithm We now extend the ADMM method proposed in REASON 2 to multi-block ADMM. The details are in Algorithm 3, and we refer to it as REASON 2. Recall that the matrix decomposition setting assumes that the true matrix M = S +L is a combination of a sparse matrix S and a low rank matrix L . In REASON 2, the updates for matrices M;S;L are done independently at each step. For the M-update, the same linearization approach as in REASON 1 is used M k+1 = arg minf Tr(rf(M k );MM k ) Tr(Z k ;MS k L k ) + 2 kMS k L k k 2 F + x 2 kMM k k 2 F g: This is an unconstrained quadratic optimization with closed-form updates, as shown in REASON 2. The update rules for S, L are result of doing an inexact proximal 74 update by considering them as a single block, which can then be decoupled as follows. For details, see Section 4.5.2. arg min kS ~ S i k 2 1 R 2 i i kSk 1 + 2 k kS (S k + k G M k )k 2 F ; (4.4.2) arg min kL ~ L i k 2 ~ R 2 i kLk1=p i kLk + 2 k kL (L k + k G M k )k 2 F ; (4.4.3) where G M k =M k+1 S k L k 1 Z k . As before, we consider epochs of length T 0 and project the estimates S and L around the epoch initializations ~ S i and ~ L i . We do not need to constrain the update of matrixM. We impose an` 1 -norm project for the sparse estimateS. For the low rank estimate L, we impose a nuclear norm projection around the epoch initialization ~ L i . Intuitively, the nuclear norm projection , which is an` 1 projection on the singular values, encourages sparsity in the spectral domain leading to low rank estimates. In addition, we impose an ` 1 constraint of =p on each entry of L, which is dierent from the update of S. Note that the ` 1 constraint is also imposed for the batch version of the problem (4.2.4) in [Agarwal et al., 2012a], and we assume that the true matrix L satises this constraint. For more discussions, see Section 4.4.2. Note that each step of the method is easily implementable. The M-update is in closed form. The S-update involves optimization with projection on to the 75 given` 1 ball which can be performed eciently [Duchi et al., 2008], as discussed in Section 4.3.1. For implementation details see Appendix 4.7.2. For theL-update, we introduce an additional auxiliary variable Y and we have L k+1 = min kL ~ L i k 2 ~ R 2 i i kLk Tr(U k ;LY k ) + 2 kLY k k 2 F ; Y k+1 = min kYk1=p 2 k kL (L k + k G M k )k 2 F + 2 kL k+1 Yk 2 F Tr(U k ;L k+1 Y ); U k+1 =U k (L k+1 Y k+1 ): TheL-update can now be performed eciently by computing a SVD, and then running the projection step [Duchi et al., 2008]. Note that approximate SVD com- putation techniques can be employed for eciency here, e.g., [Lerman et al., 2012]. TheY -update is projection on to the innity norm ball which can be found easily. Let Y (j) stand for j-th entry of vector(Y ). The for any j-th entry of vector(Y ), solution will be as follows If j(L k+1 + k k + 1 (G M k U k =)) (j) j p ; then Y (j) = (L k+1 + k k + 1 (G M k U k =)) (j) : Else Y (j) = sign (L k+1 + k k + 1 (G M k U k =)) (j) p p : As before, the epoch averages are computed and used as initializations for the next epoch. 76 4.4.2 High-dimensional Guarantees We now provide guarantees that REASON 2 eciently recovers both the sparse and the low rank estimates in high dimensions eciently. We need the following assumptions, in addition to Assumptions A1 and A2 from the previous section. Assumption A4: Spectral Bound on the Gradient Error LetE k (M;X k ) := rf(M;X k )E[rf(M;X k )],kE k k 2 (p), where :=kE k k 1 . Recall from Assumption A2 that =O(logp), under sub-Gaussianity. Here, we require spectral bounds in addition tokk 1 bound in A2. Assumption A5: Bound on spikiness of low-rank matrix kL k 1 p . Intuitively, the ` 1 constraint controls the \spikiness" of L . If 1, then the entries of L areO(1=p), i.e. they are \diuse" or \non-spiky", and no entry is too large. When the low rank matrix L has diuse entries, it cannot be a sparse matrix, and thus, can be separated from the sparse S eciently. In fact, the ` 1 constraint is a weaker form of the incoherence-type assumptions needed to guarantee identiability [Chandrasekaran et al., 2011] for sparse+low rank decomposition. Assumption A6: Local strong convexity (LSC) The function f :R d 1 d 2 ! R n 1 n 2 satises anR-local form of strong convexity (LSC) if there is a non-negative constant = (R) such that f(B 1 ) f(B 2 ) + Tr(rf(B 2 )(B 1 B 2 )) + 2 kB 2 B 1 k F ; for anykB 1 k R andkB 2 k R, which is essentially the matrix version of 77 Assumption A1. Note that we only require LSC condition onS +L and not jointly on S and L. We choose algorithm parameters as below where i ; i are the regularization for ` 1 and nuclear norm respectively, ; x correspond to penalty terms in M-update and is dual update step size. 2 i = q R 2 i + ~ R 2 i (s +r) p T 0 s logp + G 2 T 0 + 2 (p) 2 i w 2 i (4.4.4) + 2 x (R 2 i + ~ R 2 i ) T 0 + 2 p 2 + 2 (p) 2 T 0 logp +w 2 i ; 2 i =c 2 i ; / s T 0 logp R 2 i + ~ R 2 i ; x > 0; =: Theorem 2. Under assumptions A2A6, parameter settings (4.4.4) , let T de- note total number of iterations and T 0 = T logp=k T . Assuming that above setting guarantees T 0 =O(logp), k S(T )S k 2 F +k L(T )L k 2 F = (4.4.5) O (s +r) logp + 2 (p) 2 (w 2 + log(k T = logp)) T logp k T + 1 + s +r 2 p 2 p ; with probability at least 1 6 exp(w 2 =12), k T ' log (s +r) 2 2 R 2 1 T logp + 2 (p) 2 w 2 : For proof, see Appendix A.4.6 78 Improvement of logp factor : The above result can be improved by a logp factor by considering varying epoch lengths (which depend on the problem parame- ters). The resulting convergence rate isO((s+r)p logp=T + 2 =p). See Theorem 11 in Appendix A.3. Scaling of(p): We have the following bounds ( p p)(p)(p). This implies that the convergence rate isO((s +r)p logp=T + 2 =p), when (p) = ( p p) and when (p) = (p), it isO((s +r)p 2 logp=T + 2 =p). The upper bound on (p) arises trivially by converting the max-normkE k k 1 to the bound on the spectral normkE k k 2 . In many interesting scenarios, the lower bound on (p) is achieved, as outlined in Section 4.4.2.1. Comparison with the batch result: Agarwal et al. [2012a] consider the batch version of the same problem (4.2.4), and provide a convergence rate ofO(s logp + rp)=T +s 2 =p 2 ). This is also the minimax lower bound under the independent noise model. With respect to the convergence rate, we match their results with respect to the scaling ofs andr, and also obtain a 1=T rate. We match the scaling with respect top (up to a log factor), when(p) = ( p p) attains the lower bound, and we discuss a few such instances below. Otherwise, we are worse by a factor of p compared to the batch version. Intuitively, this is because we require dierent bounds on error termsE k in the online and the batch settings. For online analysis, we need to bound P T i k=1 kE k k 2 =T i over each epoch, while for the batch analysis, we 79 need to boundk P T i k=1 E k k 2 =T i , which is smaller. Intuitively, the dierence for the two settings can be explained as follows: for the batch setting, since we consider an empirical estimate, we operate on the averaged error, while we are manipulating each sample in the online setting and suer from the error due to that sample. We can employ ecient concentration bounds for the batch case [Tropp, 2012], while for the online case, no such bounds exist in general. From these observations, we conjecture that our bounds in Theorem 11 are unimproveable in the online setting. Approximation Error: Note that the optimal decompositionM =S +L is not identiable in general without the incoherence-style conditions [Chandrasekaran et al., 2011, Hsu et al., 2011]. In our work [Sedghi et al., 2014a,b], we provide ecient guarantees without assuming such strong incoherence constraints. This implies that there is an approximation error which is incurred even in the noiseless setting due to model non-identiability. Agarwal et al. [2012a] achieve an approx- imation error of s 2 =p 2 for their batch algorithm. Our online algorithm has an approximation error of maxfs +r;pg 2 =p 2 , which is worse, but is still decaying withp. It is not clear if this bound can be improved by any other online algorithm. 4.4.2.1 Optimal Guarantees for Various Statistical Models We now list some statistical models under which we achieve the batch-optimal rate for sparse+low rank decomposition. 80 1) Independent Noise Model: Assume we sample i.i.d. matrices X k =S + L +N k , where the noise N k has independent bounded sub-Gaussian entries with max i;j Var(N k (i;j)) = 2 . We consider the square loss function, i.e.kX k SLk 2 F . In this case, E k =X k S L =N k . From [Thm. 1.1][Vu, 2005], we have w.h.p thatkN k k =O( p p). We match the batch bound of [Agarwal et al., 2012a] in this setting. Moreover, Agarwal et al. [2012a] provide a minimax lower bound for this model, and we match it as well. Thus, we achieve the optimal convergence rate for online matrix decomposition under the independent noise model. 2) Linear Bayesian Network: Consider a p-dimensional vector y = Ah +n, where h2 R r with r p, and n2 R p . The variable h is hidden, and y is the observed variable. We assume that the vectors h and n are each zero-mean sub- Gaussian vectors with i.i.d entries, and are independent of one another. Let 2 h and 2 n be the variances for the entries ofh andn respectively. Without loss of generality, we assume that the columns of A are normalized, as we can always rescale A and h appropriately to obtain the same model. Let y;y be the true covariance matrix ofy. From the independence assumptions, we have y;y =S +L , whereS = 2 n I is a diagonal matrix and L = 2 h AA > has rank at most r. 81 In each stepk, we obtain a sampley k from the Bayesian network. For the square loss functionf, we have the errorE k =y k y > k y;y . Applying [Cor. 5.50][Vershynin, 2010], we have, with w.h.p. kn k n > k 2 n Ik 2 =O( p p 2 n ); kh k h > k 2 h Ik 2 =O( p p 2 h ): (4.4.6) We thus have with probability 1Te cp ,kE k k 2 O p p(kAk 2 2 h + 2 n ) ;8kT: WhenkAk 2 is bounded, we obtain the optimal bound in Theorem 11, which matches the batch bound. If the entries of A are generically drawn (e.g., from a Gaussian distribution), we havekAk 2 = O(1 + p r=p). Moreover, such generic matrices A are also \diuse", and thus, the low rank matrix L satises Assumption A5, with polylog(p). Intuitively, when A is generically drawn, there are diuse connections from hidden to observed variables, and we have ecient guarantees under this setting. Thus, our online method matches the batch guarantees for linear Bayesian net- works when the entries of the observed vectory are conditionally independent given the latent variableh. When this assumption is violated, the above framework is no longer applicable since the true covariance matrix y;y is not composed of a sparse matrix. To handle such models, we consider matrix decomposition of the inverse covariance or the precision matrix M := 1 y;y , which can be expressed as a com- bination of sparse and low rank matrices, for the class of latent Gaussian graphical models, described in Section 4.5.3. Note that the result cannot be applied directly 82 in this case as loss function is not locally Lipschitz. Nevertheless, in Section 4.5.3 we show that we can take care of this problem. 4.5 Proof Ideas and Discussion 4.5.1 Proof Ideas for REASON 1 1. In general, it is not possible to establish error contraction for stochastic ADMM at the end of each step. We establish error contracting at the end of certain time epochs, and we impose dierent levels of regularizations over dierent epochs. We perform an induction on the error, i.e. if the error at the end of k th epoch isk (T i ) k 2 2 cR 2 i , we show that in the subsequent epoch, it contracts ask (T i+1 ) k 2 2 cR 2 i =2 under appropriate choice of T i , R i and other design parameters. This is possible when we establish feasi- bility of the optimal solution in each epoch. Once this is established, it is straightforward to obtain the result in Theorem 3. 2. To show error contraction, we break down the errork (T i ) k 2 into two parts, viz.,k (T i ) ^ (T i )k 2 andk ^ (T i ) k 2 , where ^ (T i ) is the optimal batch estimate over the i-th epoch. The rst termk (T i ) ^ (T i )k 2 is obtained on the lines of analysis of stochastic ADMM, e.g., [Wang and Banerjee, 2013]. Nevertheless, our analysis diers from that of [Wang and Banerjee, 2013], as theirs is not a stochastic method. i.e., the sampling error is not considered. 83 Moreover, we show that the parameter x can be chosen as a constant while the earlier work [Wang and Banerjee, 2013] requires a stronger constraint x = p T i . For details, see Appendix A.2.1. In addition, the ` 1 constraint that we impose enables us to provide tight bounds for the high dimensional regime. The second termk ^ (T i ) k 2 is obtained by exploiting the local strong convexity properties of the loss function, on lines of [Agarwal et al., 2012b]. There are additional complications in our setting, since we have an auxiliary variable y for update of the regularization term. We relate the two variables through the dual variable, and use the fact that the dual variable is bounded. Note that this is a direct result from local Lipschitz property and it is proved in Lemma 8 in Appendix A.2.1. In fact, in order to prove the guarantees, we need bounded duality which is a weaker assumption than local Lipschitz property. We discuss this in Section 4.5.3. 3. For xed epoch length, the error shrinkage stops after some epochs but the error does not increase signicantly afterwards. Following lines of [Agarwal et al., 2012b], we prove that for this case the convergence rate is worse by a factor of logd. 4.5.2 Proof Ideas for REASON 2 We now provide a short overview of proof techniques for establishing the guarantees in Theorem 2. It builds on the proof techniques used for proving Theorem 1, but is 84 signicantly more involved since we now need to decouple the errors for sparse and low rank matrix estimation, and our ADMM method consists of multiple blocks. The main steps are as follows 1. It is convenient to dene W = [S;L] to merge the variables L and S into a single variable W , as in [Ma et al., 2012]. Let (W ) =kSk 1 + i i kLk , and A = [I;I]. The ADMM update for S and L in REASON 2, can now be rewritten as a single update for variable W . Consider the update W k+1 = arg min W f i (W ) + 2 kM k+1 AW 1 Z k k 2 F g: The above problem is not easy to solve as the S and L parts are coupled to- gether. Instead, we solve it inexactly through one step of a proximal gradient method as in [Ma et al., 2012] as arg min W f i (W ) + 2 k kW [W k + k A > (M k+1 AW k 1 Z k )]k 2 F g: (4.5.1) Since the two parts of W = [S;L] are separable in the quadratic part now, Equation (4.5.1) reduces to two decoupled updates on S and L as given by (4.4.2) and (4.4.3). 2. It is convenient to analyze theW update in Equation (4.5.1) to derive conver- gence rates for the online update in one time epoch. Once this is obtained, we also need error bounds for the batch procedure, and we employ the guarantees 85 from Agarwal et al. [2012a]. As in the previous setting of sparse optimization, we combine the two results to obtain an error bound for the online updates by considering multiple time epochs. It should be noted that we only require LSC condition onS+L and not jointly on S and L. This results in an additional higher order term when analyzing the epoch error and therefore does not play a role in the nal convergence bound. The LSC bound provides us with sum of sparse and low rank errors for each epoch. i.e.,k ^ S i S(T i ) + ^ L i L(T i )k 2 F . Next we need to decouple these errors. 3. An added diculty in the matrix decomposition problem is decoupling the errors for the sparse and low rank estimates. To this end, we impose norm constraints on the estimates of S and L, and carry them over from epoch to epoch. On the other hand, at the end of each epoch M is reset. These norm constraints allows us to control the error. Special care needs to be taken in many steps of the proof to carefully transform the various norm bounds, where a naive analysis would lead to worse scaling in the dimensionality p. We instead carefully project the error matrices on to on and o support ofS for the ` 1 norm term, and similarly onto the range and its complement of L for the nuclear norm term. This allows us to have a convergence rate with a s +r term, instead of p. 86 4. For xed epoch length, the error shrinkage stops after some epochs but the error does not increase signicantly afterwards. Following lines of [Agarwal et al., 2012b], we prove that for this case the convergence rate is worse by a factor of logp. Thus, our careful analysis leads to tight guarantees for online matrix decom- position. For Proof outline and detailed proof of Theorem 2 see Appendix A.3.1 and A.4 respectively. 4.5.3 Graphical Model Selection Our framework cannot directly handle the case where loss function is the log like- lihood objective. This is because for log likelihood function Lipschitz constant can be large and this leads to loose bounds on error. Yet, as we discuss shortly, our analysis needs conditions weaker than Local Lipschitz property. We consider both settings, i.e., fully observed graphical models and latent Gaussian graphical models. We apply sparse optimization to the former and tackle the latter with sparse + low rank decomposition. 87 4.5.3.1 Sparse optimization for learning Gaussian graphical models Consider ap-dimensional Gaussian random vector [x 1 ;:::;x p ] > with a sparse inverse covariance or precision matrix . Consider the` 1 -regularized maximum likelihood estimator (batch estimate), b := arg min 0 fTr( b ) log detfg + n kk 1 g; (4.5.2) where b is the empirical covariance matrix for the batch. This is a well-studied method for recovering the edge structure in a Gaussian graphical model, i.e. the sparsity pattern of [Ravikumar et al., 2011]. We have that the loss function is strongly convex for all within a ball 1 . However, the above loss function is not (locally) Lipschitz in general, since the gradient 2 rf(x; ) =xx > 1 is not bounded in general. Thus, the bounds de- rived in Theorem 1 do not directly apply here. However, our conditions for recovery are somewhat weaker than local Lipschitz property, and we provide guarantees for this setting under some additional constraints. Let = 1 1 denote the Hessian of log-determinant barrier at true information matrix. Let Y (j;k) := X j X k E[XjX k ] and note that (j;k);(l;m) = E[Y (j;k)Y (l;m) ] [Ravikumar et al., 2011]. A bound onjjj jjj 1 limits the in uence of the 1 Let Q = f 2 R n : I n I n g then log det is strongly convex on Q with = 1 2 [d'Aspremont et al., 2008]. 2 The gradient computation can be expensive since it involves computing the matrix inverse. However, ecient techniques for computing an approximate inverse can be employed, on lines of [Hsieh et al., 2011]. 88 edges on each other, and we need this bound for guaranteed convergence. Yet, this bound contributes to a higher order term and does not show up in the convergence rate. Corollary 1. Under Assumptions A1, A2 when the initialization radiusR 1 satises R 1 0:25 k k F , under the negative log-likelihood loss function, REASON 1 has the following bound (for dual update step size = p T 0 ) k T k 2 2 c 0 s 2 T logd k T logd + 2 w 2 + 24 log(k T = logd) The proof does not follow directly from Theorem 1, since it does not utilize Lipschitz property. However, the conditions for Theorem 1 to hold are weaker than (local) Lipschitz property and we utilize it to provide the above result. For proof, see Appendix A.2.7. Note that in case epoch length is not xed and depends on the problem parameters, the bound can be improved by a logd factor. Comparing to Theorem 1, the local Lipschitz constantG 4 is replaced by 2 jjj jjj 2 . We haveG =O(d), and thus we can obtain better bounds in the above result, when jjj jjj is small and the initialization radius R 1 satises the above condition. Intu- itively, the initialization condition (constraint on R 1 ) is dependent on the strength of the correlations. For the weak-correlation case, we can initialize with large error compared to the strongly correlated setting. 89 h 2 h 3 h 1 y 1 y 2 y 3 y 4 H A Y B Figure 4.1: Graphical representation of a latent variable model. 4.5.3.2 Sparse + low rank decomposition for learning latent Gaussian graphical models Consider the Bayesian network on p-dimensional observed variables as y =Ah +By +n; y;n2R p ; h2R r ; (4.5.3) as in Figure 4.1 where h;y and n are drawn from a zero-mean multivariate Gaus- sian distribution. The vectors h and n are independent of one another, and nN (0; 2 n I). Assume thatA has full column rank. Without loss of generality, we assume thatA has normalized columns, and thath has independent entries [Pitman and Ross, 2012]. For simplicity, let hN (0; 2 h I) (more generally, its covariance is a diagonal matrix). Note that the matrix B = 0 in the previous setting (the previous setting allows for more general sub-Gaussian distributions, and here, we limit ourselves to the Gaussian distribution). For the model in (4.5.3), the precision 90 matrix M with respect to the marginal distribution on the observed vector y is given by M := 1 y;y = f M y;y f M y;h ( f M h;h ) 1 f M h;y ; (4.5.4) where f M = 1 , and is the joint-covariance matrix of vectors y and h. It is easy to see that the second term in (4.5.4) has rank at most r. The rst term in (4.5.4) is sparse under some natural constraints, viz., when the matrix B is sparse, and there are a small number of colliders among the observed variables y. A triplet of variables consisting of two parents and their child in a Bayesian network is termed as a collider. The presence of colliders results in additional edges when the Bayesian network on y and h is converted to an undirected graphical model, whose edges are given by the sparsity pattern f M y;y , the rst term in (4.5.4). Such a process is known as moralization [Lauritzen, 1996], and it involves introducing new edges between the parents in the directed graph (the graph of the Bayesian networks), and removing the directions to obtain an undirected model. Therefore, when the matrix B is sparse, and there are a small number of colliders among the observed variables y, the resulting sub-matrix f M y;y is also sparse. We thus have the precision matrix M in (4.5.4) as M = S +L , where S and L are sparse and low rank components. We can nd this decomposition via 91 regularized maximum likelihood. The batch estimate is given by Chandrasekaran et al. [2012] f ^ S; ^ Lg :=arg minfTr( b n M) log detM + n kSk 1 + n kLk g; (4.5.5) s:t: M =S +L: This is a special case of (4.2.4) with the loss functionf(M) = Tr( b n M)log detM. In this case, we have the error E k =y k y > k M 1 . Since y = (IB) 1 (Ah +n), we have the following bound w.h.p. kE k k 2 O p p (kAk 2 2 2 h + 2 n ) log(pT ) min (IB) 2 ; 8kT; where min () denotes the minimum singular value. The above result is obtained by alluding to (4.4.6). WhenkAk 2 and min (IB) are bounded, we thus achieve optimal scaling for our proposed online method. As discussed for the previous case, when A is generically drawn,kAk 2 is bounded. To bound min (IB), a sucient condition is walk- summability on the sub-graph among the observed variables y. The class of walk- summable models is ecient for inference [Malioutov et al., 2006] and structure learning [Anandkumar et al., 2012], and they contain the class of attractive models. Thus, it is perhaps not surprising that we obtain ecient guarantees for such models for our online algorithm. 92 We need to slightly change the algorithm REASON 2 for this scenario as follows: for theM-update in REASON 2, we add a` 1 norm constraint onM askM k ~ S i ~ L i k 2 1 R 2 , and this can still be computed eciently, since it involves projection on to the` 1 norm ball, see Appendix 4.7.1. We assume a good initialization M which satiseskMM k 2 1 R 2 . This ensures thatM k in subsequent steps is non-singular, and that the gradient of the loss functionf in (4.5.5), which involvesM 1 k , can be computed. As observed in section 4.5.3.1 on sparse graphical model selection, the method can be made more ecient by computing approximate matrix inverses [Hsieh et al., 2013]. As observed before, the loss function f satises the local strong convexity property, and the guarantees in Theorem 2 are applicable. There is another reason for using the ` 1 bound. Note that the loss function is not generally Lipschitz in this case. However, our conditions for recovery are somewhat weaker than local Lipschitz property, and we provide guarantees for this setting under some additional constraints. Let = M M . As explained in Section 4.5.3.1, a bound onjjj jjj 1 limits the in uence on the edges on each other, and we need this bound for guaranteed convergence. Yet, this bound contributes to a higher order term and does not show up in the convergence rate. 93 Corollary 2. Under Assumptions A1, A2, A4, A5, when the radius R satises R 0:25 k k F , under the negative log-likelihood loss function, REASON 2 has the following bound (for dual update step size = p T 0 ) k S(T )S k 2 F +k L(T )L k 2 F c 0 (s +r) T logp k T logp + 2 (p) 2 w 2 + log(k T = logd) + maxfs +r;pg 2 p : The proof does not follow directly from Theorem 2, since it does not utilize Lipschitz property. However, the conditions for Theorem 2 to hold are weaker than (local) Lipschitz property and we utilize it to provide the above result. For proof, see Appendix A.4.7. Note that in case epoch length is not xed and depends on the problem parameters, the bound can be improved by a logp factor. 4.6 Experiments 4.6.1 REASON 1 For sparse optimization problem we compare REASON 1 with RADAR and ST- ADMM under the least-squares regression setting. Samples (x t ;y t ) are generated such that x t 2 Unif[B;B] and y t =h ;xi +n t . is s-sparse with s =dlogde. n t N (0; 2 ). With 2 = 0:5 in all cases. We considerd = 20; 2000; 20000 ands = 1; 3; 5 respectively. The experiments are performed on a 2:5 GHz Intel Core i5 laptop with 8 GB RAM. See Table 4.3 for experiment results. It should be noted that 94 Figure 4.2: Least square regression, Error= k k 2 k k 2 vs. iteration number, d 1 = 20 and d 2 = 20000. RADAR is provided with information of for epoch design and recentering. In addition, both RADAR and REASON 1 have the same initial radius. Nevertheless, REASON 1 reaches better accuracy within the same run time even for small time frames. In addition, we compare relative errork k 2 =k k 2 in REASON 1 and ST-ADMM in the rst epoch. We observe that in higher dimension error uctuations for ADMM increases noticeably (see Figure 4.2). Therefore, projections of REASON 1 play an important role in denoising and obtaining good accuracy. Epoch Size For xed- epoch size, if epoch size is designed such that the relative error dened above has shrunk to a stable value, then we move to the next epoch 95 Dimension Run Time (s) Method error at 0.02T error at 0.2T error at T ST-ADMM 1.022 1.002 0.996 d=20000 T=50 RADAR 0.116 2.10e-03 6.26e-05 REASON 1 1.5e-03 2.20e-04 1.07e-08 ST-ADMM 0.794 0.380 0.348 d=2000 T=5 RADAR 0.103 4.80e-03 1.53e-04 REASON 1 0.001 2.26e-04 1.58e-08 ST-ADMM 0.212 0.092 0.033 d=20 T=0.2 RADAR 0.531 4.70e-03 4.91e-04 REASON 1 0.100 2.02e-04 1.09e-08 Table 4.3: Least square regression problem, epoch size T i = 2000, Error= k k 2 k k 2 . Run Time T = 50 sec T = 150 sec Error kM SLk F kM k F kSS k F kS k F kL Lk F kL k F kM SLk F kM k F kSS k F kS k F kL Lk F kL k F REASON 2 inexact ALM 2.20e-03 5.11e-05 0.004 0.12 0.01 0.27 5.55e-05 8.76e-09 1.50e-04 0.12 3.25e-04 0.27 Table 4.4: REASON 2 and inexact ALM, matrix decomposition problem. p = 2000, 2 = 0:01 and the algorithm works as expected. If we choose a larger epoch than this value we do not gain much in terms of accuracy at a specic iteration. On the other hand if we use a small epoch size such that the relative error is still noticeable, this delays the error reduction and causes some local irregularities. 96 4.6.2 REASON 2 We compare REASON 2 with state-of-the-art inexact ALM method for matrix de- composition problem 3 In this problemM is the noisy sample the algorithm receives. Since we have direct access to M, the M-update is eliminated. Table 4.4 shows that with equal time, inexact ALM reaches smaller kM SLk F kM k F error while in fact this does not provide a good decomposition. On the other hand, REASON 2 reaches useful individual errors in the same time frame. Experiments with 2 2 [0:01; 1] reveal similar results. This emphasizes the importance of projec- tions in REASON 2. Further investigation on REASON 2 shows that performing one of the projections (either` 1 or nuclear norm) suces to reach this performance. The same precision can be reached using only one of the projections. Addition of the second projection improves the performance marginally. Performing nuclear norm projections are much more expensive since they require SVD. Therefore, it is more ecient to perform the ` 1 projection. Similar experiments on exact ALM shows worse performance than inexact ALM and are thus omitted. 4.7 Implementation Here we discuss the updates for REASON 1 and REASON 2. Note that for any vector v, v (j) denotes the j-th entry. 3 ALM codes are downloaded from http://perception.csl.illinois.edu/matrix-rank/ home.html and REASON 2 code is available at https://github.com/haniesedghi/REASON2. 97 4.7.1 Implementation details for REASON 1 Let us start with REASON 1. We have already provided closed form solution for y and z. The update rule for can be written as min w kwvk 2 2 s:t: kwk 1 R; (4.7.1) w = ~ i ; R =R i ; v = 1 + x [y k ~ i f( k ) + z k + x ( k ~ i )]: We note that ifkvk 1 R, the answer is w = v. Else, the optimal solution is on the boundary of the constraint set and we can replace the inequality constraint withkwk 1 = R. Similar to [Duchi et al., 2008], we perform Algorithm 4 for solv- ing (4.7.1). The complexity of this Algorithm isO(d logd), d =p 2 . 4.7.2 Implementation details for REASON 2 For REASON 2, the update rule for M, Z, Y and U are in closed form. Consider the S-update. It can be written in form of (4.7.1) with min W i kW + ~ S i k 1 + 2 k kW (S k + k G M k ~ S i )k 2 F : s:t: kWk 1 R; 98 W =S ~ S i ; R =R i : Therefore, similar to [Duchi et al., 2008], we generate a sequence offW (t) g ts t=1 via W (t+1) = 1 W (t) t r (t) i kW + ~ S i k 1 + 2 k kW (S k + k G M k ~ S i )k 2 F ; where 1 is projection on to ` 1 norm, similar to Algorithm 4. In other words, at each iteration, vector W (t) t i r (t) kW (t) + ~ S i k 1 + k (W (t) (S k + k G M k ~ S i )) is the input to Algorithm 4 (instead of vector v) and the output is vector(W (t+1) ). The termr (t) kW (t) + ~ S i k 1 stands for subgradient of the` 1 normkW (t) + ~ S i k 1 . The S-update is summarized is Algorithm 5. A step size of t / 1= p t guarantees a convergence rate ofO( p logp=T ) [Duchi et al., 2008]. The L-update is very similar in nature to the S-update. The only dierence is that the projection is on to nuclear norm instead of ` 1 norm. It can be done by performing an SVD before the ` 1 norm projection. The code for REASON 1 follows directly from the discussion in Section 4.7.1. For REASON 2 on the other hand, we have added additional heuristic modications to improve the performance. REASON 2 code is available athttps://github.com/ haniesedghi/REASON2. The rst modication is that we do not update the dual 99 variable Z per every iteration on S and L. Instead, we update the dual variable once S and L seem to have converged to some value or after every m iterations on S and L. The reason is that once we start the iteration, S and L can be far from each other which results in a big dual variable and hence, a slower convergence. The value of m can be set based on the problem. For the experiments discussed here we have used m = 4. Further investigation on REASON 2 shows that performing one of the projec- tions (either ` 1 or nuclear norm) suces to reach this performance. The same precision can be reached using only one of the projections. Addition of the second projection improves the performance only marginally. Performing nuclear norm projections are much more expensive since they require SVD. Therefore, it is more ecient to perform the ` 1 projection. In the code, we leave it as an option to run both projections. 4.8 Conclusion In our work [Sedghi et al., 2014a,b], we consider a modied version of the stochas- tic ADMM method for high-dimensional problems. We rst analyze the simple setting, where the optimization problem consists of a loss function and a single reg- ularizer, and then extend to the multi-block setting with multiple regularizers and multiple variables. For the sparse optimization problem, we showed that we reach 100 the minimax-optimal rate in this case, which implies that our guarantee is unim- proveable by any (batch or online) algorithm (up to constant factors). We then consider the matrix decomposition problem into sparse and low rank components, and propose a modied version of the multi-block ADMM algorithm. Experiments show that for both sparse optimization and matrix decomposition problems, our algorithm outperforms the state-of-the-art methods. In particular, we reach higher accuracy with same time complexity. There are various future problems to consider. One is to provide lower bounds on error for matrix decomposition problem in case of strongly convex loss if possible. Agarwal et al. [2012a] do not provide bounds for strongly convex functions. Another approach can be to extend our method to address nonconvex programs. Loh and Wainwright [2013] and Wang et al. [2013c] show that if the problem is nonconvex but has additional properties, it can be solved by methods similar to convex loss programs. In addition, we can extend our method to coordinate descent methods such as [Roux et al., 2012]. 101 Algorithm 3 Regularized Epoch-based Admm for Stochastic Optimization in high- dimensioN 2 (REASON 2) Input ; x > 0, epoch length T 0 , regularizersf i ; i g k T i=1 , initial prox center ~ S 1 ; ~ L 1 , initial radii R 1 ; ~ R 1 . DeneShrink (a) shrinkage operator in (4.3.3), G M k =M k+1 S k L k 1 Z k . for Each epoch i = 1; 2;:::;k T do Initialize S 0 = ~ S i ;L 0 = ~ L i ;M 0 =S 0 +L 0 for Each iteration k = 0; 1;:::;T 0 1 do M k+1 = rf(M k ) +Z k +(S k +L k ) + x M k + x S k+1 = min kS ~ S i k 1 R i i kSk 1 + 2 k kS (S k + k G M k )k 2 F L k+1 = min kL ~ L i k ~ R i i kLk + 2 kLY k U k =k 2 F Y k+1 = min kYk1=p 2 k kY (L k + k G M k )k 2 F + 2 kL k+1 YU k =k 2 F Z k+1 =Z k (M k+1 (S k+1 +L k+1 )) U k+1 =U k (L k+1 Y k+1 ): end for Set: ~ S i+1 = 1 T 0 P T 0 1 k=0 S k and ~ L i+1 := 1 T 0 P T 0 1 k=0 L k if R 2 i > 2(s +r + (s+r) 2 p 2 ) 2 p then Update R 2 i+1 =R 2 i =2; ~ R 2 i+1 = ~ R i 2 =2 else STOP end if end for 102 Algorithm 4 Implementation of -update Input: A vectorv = 1 +x [y k ~ i rf( k ) + z k + x ( k ~ i )] and a scalarR =R i > 0 ifkvk 1 R, then Output: =v + ~ i else Sort v into : 1 2 d . Find = maxfj2 [d] : j 1 j P j i=1 i R > 0g: Dene = 1 P i=1 i R Output: , where (j) = sign(v (j) ) maxfv (j) ; 0g + ( ~ i ) (j) end if Algorithm 5 Implementation of S-update Input: W (1) = vector(S k ~ S i ) and a scalar R =R i > 0 for t = 1 to t =t s do v =W (t) t h i r (t) kW (t) + vector( ~ S i )k 1 + k W (t) vector(S k + k G M k ~ S i ) i ifkvk 1 R, then W (t+1) =v else Sort v into : 1 2 d . Find = maxfj2 [d] : j 1 j P j i=1 i R > 0g: Dene = 1 P i=1 i R For 1jd, W (t+1) (j) = sign(v (j) ) maxfv (j) ; 0g end if end for Output:matrix(W (ts) ) + ~ S i 103 Appendix A Appendix: Proofs A.1 Guarantees for REASON 1 First, we provide guarantees for the theoretical case such that epoch length depends on epoch radius. This provides intuition on how the algorithm is designed. The xed-epoch algorithm is a special case of this general framework. We rst state and prove guarantees for general framework. Next, we leverage these results to prove Theorem 1. Let the design parameters be set as T i =C s 2 2 logd + 12 2 i log(3= i ) R 2 i ; (A.1.1) 2 i = s p T i s R 2 i logd + G 2 R 2 i + 2 x R 4 i T i + 2 i R 2 i log(3= i ); / p logd R i p T i ; x > 0; =: 104 Theorem 3. Under assumptions A1A3 and parameter settings (A.1.1), there exists a constant c 0 > 0 such that REASON 1 satises for all T >k T , k T k 2 2 c 0 s 2 T e logd + 2 w 2 + logk T ) ; (A.1.2) with probability at least 1 6 exp(w 2 =12), where k T = log 2 2 R 2 1 T s 2 (logd+12 2 log( 6 )) , and c 0 is a universal constant. For Proof outline and detailed proof of Theorem 3 see Appendix A.1.1 and A.2 respectively. A.1.1 Proof outline for Theorem 3 The foundation block for this proof is Proposition 2. Proposition 2. Suppose f satises Assumptions A1;A2 with parameters and i respectively and assume thatk ~ i k 2 1 R 2 i . We apply the updates in REASON 1 with parameters as in (A.1.1). Then, there exists a universal constant c such that for any radius R i f( (T i ))f( ^ i ) + i k y(T i )k 1 i k ^ i k 1 R i p logd p T i + GR i T i + x R 2 i T i (A.1.3a) + R i i p T i p 12 log(3= i ); k (T i ) k 2 1 c 0 p C R 2 i : (A.1.3b) 105 where 0 = x + and both bounds are valid with probability at least 1 i . Note that our proof for epoch optimum improves proof of [Wang and Banerjee, 2013] with respect to x . For details, see Section A.2.1. In order to prove Proposition 2, we need to prove some more lemmas. To move forward from here please note the following notations: i = ^ i and ^ (T i ) = i ^ i . Lemma 4. At epochi assume thatk ~ i k 1 R i . Then the error i satises the bounds k ^ i k 2 4 p s i ; (A.1.4a) k ^ i k 1 8 s i : (A.1.4b) Lemma 5. Under the conditions of Proposition 2 and with parameter settings (A.1.1) , we have k ^ (T i )k 2 2 c 0 p C 1 s R 2 i ; with probability at least 1 i . 106 A.2 Proof of Theorem 3 The rst step is to ensure thatk ~ i kR i holds at each epoch so that Propo- sition 2 can be applied in a recursive manner. We prove this by induction on the epoch index. By construction, this bound holds at the rst epoch. Assume that it holds for epoch i. Recall that T i is dened by (A.1.1) where C 1 is a con- stant we can choose. By substituting this T i in inequality (A.1.3b), the simplied bound (A.1.3b) further yields k (T i ) k 2 1 c 0 p C R 2 i : Thus, by choosingC suciently large, we can ensure thatk (T i ) k 2 1 R 2 i =2 := R 2 i+1 . Consequently, if is feasible at epoch i, it stays feasible at epoch i + 1. Hence, by induction we are guaranteed the feasibility of throughout the run of algorithm. As a result, Lemma 5 applies and we nd that k ^ (T i )k 2 2 c s R 2 i : (A.2.1) 107 We have now bounded ^ (T i ) = (T i ) ^ i and Lemma 4 provides a bound on i = ^ i , such that the error (T i ) = (T i ) can be controlled by triangle inequality. In particular, by combining (A.1.4a) with (A.2.1), we get k (T i )k 2 2 cf 1 s R 2 i + 16 s R 2 i g; i.e. k (T i )k 2 2 c R 2 1 2 (i1) s : (A.2.2) The bound holds with probability at least 1 3 exp(w 2 i =12). Recall that R 2 i = R 2 1 2 (i1) . Since w 2 i = w 2 + 24 logi, we can apply union bound to simplify the error probability as 1 6 exp(w 2 =12). Throughout this report we use i = 3 exp(w 2 i =12) and = 6 exp(w 2 =12) to simplify the equations. To complete the proof we need to convert the error bound (A.2.2) from its dependence on the number of epochs k T to the number of iterations needed to complete k T epochs, i.e. T (K) = P k i=1 T i . Note that here we use T i from (A.2.8), 108 to show that when considering the dominant terms, the denition in (A.1.1) suces. Here you can see how negligible terms are ignored. T (k) = k X i=1 C s 2 2 logd + 12 2 i log(3= i ) R 2 i + s G R i + s x =C k X i=1 s 2 flogd + =sG + 2 (w 2 + 24 logk)g2 i1 2 R 2 1 + sG R 1 p 2 i1 + s x : Hence, T (k)C s 2 2 R 2 1 flogd + 2 (w 2 + 24 logk)g2 k + s R 1 G p 2 k + s x : T (k)S(k), therefore k T S 1 (T ). S(k) =C s 2 2 R 2 1 flogd + 2 (w 2 + 24 logk)g2 k + s R 1 G p 2 k + s x : Ignoring the dominated terms and using a rst order approximation for log(a +b), log(T )' logC +k T + log s 2 2 R 2 1 flogd + 2 (w 2 + 24 logk)g ; k T ' logT logC log s 2 2 R 2 1 flogd + 2 (w 2 + 24 logk)g : Therefore, 2 k T = Cs 2 2 TR 2 1 flogd + 2 (w 2 + 24 logk)g: 109 Putting this back into (A.2.2), we get that k (T i )k 2 2 c R 2 1 s Cs 2 2 TR 2 1 flogd + 2 (w 2 + 24 logk)g c s 2 T flogd + 2 (w 2 + 24 logk)g: Using the denition = 6 exp(w 2 =12), above bound holds with probability 1. Simplifying the error in terms of by replacingw 2 with 12 log(6=), gives us (A.1.2). A.2.1 Proofs for Convergence within a Single Epoch for Algorithm 2 Lemma 6. For (T i ) dened in Algorithm 2 and ^ i the optimal value for epoch i, let =c 1 p T i , x some positive constant, 0 = + x and = where c 1 = p logd R i . We have that f( (T i ))f( ^ i ) + i k y(T i )k 1 i k ^ i k 1 (A.2.3) R i p logd p T i + GR i T i + x R 2 i T i + P T i k=1 he k ; ^ i k i T i : Remark : Please note that as opposed to [Wang and Banerjee, 2013] we do not require x / p T i . We show that our parameter setting also works. Proof. First we show that our update rule for is equivalent to not linearizingf and using another Bregman divergence. This helps us in nding a better upper bound 110 on error that does not require bounding the subgradient. Note that linearization does not change the nature of analysis. The reason is that we can deneB f (; k ) = f()f( k )+hrf( k ); k i, which meansf()B f (; k ) =f( k )+hrf( k ); k i. Therefore, arg min k ~ i k 2 1 R 2 i fhrf( k ); k ig = arg min k ~ i k 2 1 R 2 i ff()B f (; k )g: As a result, we can write down the update rule of in REASON 1 as k+1 = arg min k ~ i k 2 1 R 2 i ff()B f (; k ) +z T k (y k ) +B (;y k ) + x B 0 x (; k )g: We also have thatB x (; k ) =B 0 x (; k ) 1 x B f (; k ), which simplies the update rule to k+1 = arg min k ~ i k 2 1 R 2 i ff() +hz k ;y k i +B (;y k ) + x B x (; k )g: (A.2.4) We notice that equation (A.2.4) is equivalent to Equation (7) [Wang and Banerjee, 2013]. Note that as opposed to [Wang and Banerjee, 2013], in our setting x can be set as a constant. Therefore, for completeness we provide proof of convergence and the convergence rate for our setting. 111 Lemma 7. Convergence of REASON 1: The optimization problem dened in REA- SON 1 converges. Proof. On lines of [Wang and Banerjee, 2013], let R(k + 1) stand for residuals of optimality condition. For convergence we need to show that lim k!1 R(k + 1) = 0. Let w k = ( k ;y k ;z k ). Dene D(w ;w k ) = 1 kz z k k 2 2 +B (y ;y k ) + x B ( ; k ): By Lemma 2 Wang and Banerjee [2013] R(t + 1)D(w ;w k )D(w ;w k+1 ): Therefore, 1 X k=1 R(t + 1)D(w ;w 0 ) = 1 kz k 2 2 +B (y ;y 0 ) + x B ( ; 0 ) lim T!1 R 2 i logd T krf( )k 2 2 + 2R 2 i + x p T logd R 3 i : Therefore, lim k!1 R(k + 1) = 0 and the algorithm converges. 112 If in addition we incorporate sampling error, then Lemma 1 [Wang and Banerjee, 2013] changes to f( k+1 )f( ^ i ) + i ky k+1 k 1 i k ^ i k 1 hz k ; k+1 y k+1 i 2 fk k+1 y k k 2 2 +k k+1 y k+1 k 2 2 g +he k ; ^ i k i + 2 fk ^ i y k k 2 2 k ^ i y k+1 k 2 2 g + x fB x ( ^ i ; k )B x ( ^ i ; k+1 ) B x ( k+1 ; k )g: The above result follows from convexity off, the update rule for (Equation (A.2.4)) and the three point property of Bregman divergence. Next, we show the bound on the dual variable. Lemma 8. The dual variable in REASON 1 is bounded. i.e., kz k k 1 G + 2 0 R i ; where 0 := x +: Proof. Considering the update rule for , we have the Lagrangian L =f() +hz k ;y k i +B (;y k ) + x B x (; k ) + k k+1 ~ i k 1 R i ; where is the Lagrange multiplier corresponding to the ` 1 bound. We hereby emphasize that does not play a role in size of the dual variable. i.e., considering the ` 1 constraint, three cases are possible: 113 1.k k+1 ~ i k 1 >R i . By complementary slackness, = 0. 2.k k+1 ~ i k 1 <R i . By complementary slackness, = 0. 3.k k+1 ~ i k 1 = R i . This case is equivalent to the non-constrained update and no projection will take place. Therefore, z will be the same as in the non-constrained update. Having above analysis in mind, the upper bound on the dual variable can be found as follows By optimality condition on k+1 , we have z k =rf( k+1 ) + x ( k+1 k ) +( k+1 y k ): (A.2.5) By denition of the dual variable and the fact that =, we have that z k =z k1 ( k y k ) Hence, we have thatz k1 =rf( k+1 ) + ( x +)( k+1 k ). Therefore, kz k1 k 1 G + 2 0 R i ; where 0 := x +: It is easy to see that this is true for all z k at each epoch. 114 Consequently, 1 hz k ;z k z k+1 i = 1 h0z k ;z k z k+1 i = 1 2 kz k+1 k 2 kz k k 2 kz k+1 z k k 2 : Ignoring the negative term in the upper bound and noting z 0 = 0, we get 1 T i T i X k=1 hz k ; k+1 y k+1 i 1 2T i kz T i k 2 1 2T i (G + 2 0 R i ) 2 ' R i p logd p T i + GR i T i : Note that since we consider the dominating terms in the nal bound, terms with higher powers ofT i can be ignored throughout the proof. Next, following the same approach as in Theorem 4 [Wang and Banerjee, 2013] and considering the sampling error, we get, f( (T i ))f( ^ i ) + i k y(T i )k 1 i k ^ i k 1 R i p logd p T i + GR i T i + c 1 p T i k ^ i y 0 k 2 2 + x T i B x ( ^ i ; 0 ) + 1 T i T i X k=1 he k ; ^ i k i: 115 We have 0 =y 0 = ~ i and z 0 = 0. Moreover, B x (; k ) =B 0 x (; k ) 1 x B f (; k ). Therefore, f( (T i ))f( ^ i ) + i k y(T i )k 1 i k ^ i k 1 R i p logd p T i + GR i T i + c 1 p T i k ^ i ~ i k 2 2 + x T i fB 0 x ( ^ i ; ~ i )B f ( ^ i ; ~ i )g+ T i X k=1 he k ; ^ i k i R i p logd p T i + GR i T i + p logd R i p T i k ^ i ~ i k 2 2 + x T i B 0 x ( ^ i ; ~ i ) + T i X k=1 he k ; ^ i k i: We note that x B 0 x ( ^ i ; ~ i ) = x 2 k ^ i ~ i k 2 2 . Considering the ` 2 terms, remember that for any vector x, if s > r > 0 then kxk s kxk r . Therefore, p logd R i k ^ i ~ i k 2 2 p logd R i k ^ i ~ i k 2 1 p logd R i R 2 i =R i p logd: A.2.2 Proof of Proposition 2: Inequality (A.1.3a) Note the shorthand e k = ^ g k rf( k ), where ^ g k stands for empirically calculated subgradient of f( k ). 116 From Lemma 6, we have that f( (T i ))f( ^ i ) + i k y(T i )k 1 i k ^ i k 1 R i p logd p T i + GR i T i + x R 2 i T i + P T i k=1 he k ; ^ i k i T i : Using Lemma 7 from [Agarwal et al., 2012b], we have that f( (T i ))f( ^ i ) + i k y(T i )k 1 i k ^ i k 1 R i p logd p T i + GR i T i + x R 2 i T i + R i i w i p T i = R i p logd p T i + GR i T i + x R 2 i T i + R i i p T i p 12 log(3= i ): with probability at least 1 i . In the last equality we use i = 3 exp(w 2 i =12). A.2.3 Proof of Lemma 4 Proof follows the same approach as Lemma 1 [Agarwal et al., 2012b]. Note that since we assume exact sparsity the termk S ck 1 is zero for our case and is thus eliminated. Needless to say, it is an straightforward generalization to consider approximate sparsity from this point. 117 A.2.4 Proof of Lemma 5 Using LSC assumption and the fact that ^ i minimizes f() +kk 1 , we have that 2 k ^ (T i )k 2 2 f( (T i ))f( ^ (T i )) + i (k y(T i )k 1 k ^ i k 1 ) R i p logd p T i + GR i T i + x R 2 i T i + R i i p T i r 12 log 3 i ; with probability at least 1 i . A.2.5 Proof of Proposition 2: Inequality (A.1.3b) Throughout the proof, let (T i ) = i and ^ (T i ) = i ^ i , we have that (T i ) ^ (T i ) = ^ i . Now we want to convert the error bound in (A.1.3a) from function values into ` 1 and ` 2 -norm bounds by exploiting the sparsity of . Since the error bound in (A.1.3a) holds for the minimizer ^ i , it also holds for any other feasible vector. In particluar, applying it to leads to, f( (T i ))f( ) + i k y(T i )k 1 i k k 1 R i p logd p T i + GR i T i + x R 2 i T i + R i i p T i r 12 log 3 i ; with probability at least 1 i . 118 For the next step, we nd a lower bound on the left hand side of this inequality. f( (T i ))f( ) + i k y(T i )k 1 i k k 1 f( )f( ) + i k y(T i )k 1 i k k 1 = i k y(T i )k 1 i k k 1 ; where the rst inequality results from the fact that optimizes f(). Thus, k y(T i )k 1 k k 1 + R i p logd i p T i + GR i i T i + x R 2 i i T i + R i i i p T i r 12 log 3 i : Now we need a bound onk (T i ) y(T i )k 1 , we have k (T i ) y(T i )k 1 =k 1 T i T i 1 X k=0 ( k y k )k 1 =k 1 T i T i 1 X k=0 (z k+1 z k )k 1 = 1 T i kz T i k 1 G + 2 0 R i T i = GR i T i p T i p logd + R i T i : By triangle inequality k (T i )k 1 k y(T i )k 1 k (T i ) y(T i )k 1 ; 119 Hence, after ignoring the dominated terms, k (T i )k 1 k k 1 + R i p logd i p T i + GR i i T i + x R 2 i i T i + R i i i p T i p 12 log(3= i ) + R i T i : By Lemma 6 in [Agarwal et al., 2012b], k (T i ) S ck 1 k (T i ) S k 1 + R i p logd i p T i + GR i i T i + x R 2 i i T i + R i i i p T i p 12 log(3= i ) + R i T i : with probability at least 1 3 exp(w 2 i =12). We have (T i ) ^ (T i ) = ^ i . Therefore, k ^ i k 1 = k S (T i ) ^ S (T i )k 1 +k S c(T i ) ^ S c(T i )k 1 fk S (T i )k 1 k ^ S (T i )k 1 gfk S c(T i )k 1 k ^ S c(T i )k 1 g: Consequently, k ^ S c(T i )k 1 k ^ S (T i )k 1 k S c(T i )k 1 k S (T i )k 1 +k ^ i k 1 : Using Equation (A.1.4b), we get k ^ S c(T i )k 1 k ^ S (T i )k 1 + 8s i + R i p logd i p T i + GR i i T i + x R 2 i i T i + R i i i p T i p 12 log(3= i ) + R i T i : 120 Hence, further use of the inequalityk ^ S (T i )k 1 p sk ^ (T i )k 2 allows us to conclude that there exists a universal constant c such that k ^ (T i )k 2 1 4sk ^ (T i )k 2 2 +c " s 2 2 i 2 + R 2 i logd 2 i T i + G 2 R 2 i 2 i T 2 i + 2 x R 4 i 2 i T 2 i + 12R 2 i 2 i log( 3 i ) T i 2 i + R 2 i T 2 i # ; (A.2.6) with probability at least 1 i . Optimizing the above bound with choice of i gives us (A.1.1). From here on all equations hold with probability at least 1 i , we have k ^ (T i )k 2 1 8s h f( (T i ))f( ^ (T i )) + i (k Y (T i )k 1 k ^ i k 1 ) i + 2cs p T i R i p logd + GR i p T i + x R 2 i p T i +R i i r 12 log( 3 i ) + R 2 i T 2 i : Thus, for some other c, we have that k ^ (T i )k 2 1 c s R i p logd p T i + GR i T i + x R 2 i T i + R i i p T i r 12 log( 3 i ) + R 2 i T 2 i : (A.2.7) 121 Combining the above inequality with error bound (A.1.4b) for ^ i and using triangle inequality leads to k (T i )k 2 1 2k ^ (T i )k 2 1 + 2k ^ i k 2 1 2k ^ (T i )k 2 1 + 64 2 s 2 2 i c 0 s R i p logd p T i + GR i T i + x R 2 i T i + R i i p T i r 12 log 3 i + R 2 i T 2 i : Finally, in order to use (T i ) as the next prox center ~ i+1 , we would also like to control the errork (T i ) ^ i+1 k 2 1 . Since i+1 i by assumption, we obtain the same form of error bound as in (A.2.7). We want to run the epoch till all these error terms drop to R 2 i+1 :=R 2 i =2. Therefore, we set the epoch length T i to ensure that. All above conditions are met if we choose the epoch length T i =C s 2 2 logd + 12 2 i log(3= i ) R 2 i + sG R i + s x ; (A.2.8) for a suitably large universal constantC. Note that since we consider the dominat- ing terms in the nal bound, the last two terms can be ignored. By design of T i , we have that k (T i )k 2 1 c 0 p C R 2 i ; which completes this proof. 122 A.2.6 Proof of Guarantees with Fixed Epoch Length, Sparse Case This is a special case of Theorem 3 (Appendix). The key dierence between this case and optimal epoch length setting of Theorem 3 is that in the latter we guaran- teed error halving by the end of each epoch whereas with xed epoch length that statement may not be possible after the number of epochs becomes large enough. Therefore, we need to show that in such case the error does not increase much to invalidate our analysis. Let k be the epoch number such that error halving holds true until then. Next we demonstrate that error does not increase much fork>k . Given a xed epoch length T 0 =O(logd), we dene k := sup ( i : 2 j=2+1 cR 1 s s T 0 logd + 2 i w 2 for all epochs ji ) ; (A.2.9) where w = log(6=). First we show that if we run REASON 1 with xed epoch lengthT 0 it has error halving behavior for the rst k epochs. Lemma 9. For T 0 =O(logd) and k as in (A.2.9), we have k ~ k k 1 R k and k ~ k k k 1 R k for all 1kk + 1: 123 with probability at least 1 3k exp(w 2 =12). Under the same conditions, there exists a universal constant c such that k ~ k k 2 c R k p s and k ~ k k k 2 c R k p s for all 2kk + 1: Next, we analyze the behavior of REASON 1 after the rstk epochs. Since we cannot guarantee error halving, we can also not guarantee that remains feasible at later epochs. We use Lemma 10 to control the error after the rst k epochs. Lemma 10. Suppose that Assumptions A1A3 in the main text are satised at epochs i = 1; 2;::: . Assume that at some epoch k, the epoch center ~ k satises the boundk ~ k k 2 c 1 R k = p s and that for all epochsjk, the epoch lengths satisfy the bounds s s logd + 2 i w 2 i T j R k 2 and logd T i c 2 : Then for all epochsjk, we have the error boundkq j k 2 2 c 2 R 2 k s with probability at least 1 3 P j i=k+1 exp(w 2 i =12). In order to check the condition on epoch length in Lemma 10, we notice that with k as in (A.2.9), we have c s s logd + 2 i w 2 T 0 R 1 2 k =21 = R k +1 2 : 124 Since we assume that constants k are decreasing ink, the inequality also holds for kk + 1, therefore Lemma 10 applies in this setting. The setting of epoch length in Theorem 1 ensures that the total number of epochs we perform is k 0 = log R 1 s s T logd + 2 w 2 ! : Now we have two possibilities. Either k 0 k or k 0 k . In the former, Lemma 9 ensures that the error boundk ~ k 0 k 2 2 cR 2 k 0 =s. In the latter case, we use Lemma 10 and get the error bound cR 2 k =s. Substituting values of k 0 , k in these bounds completes the proof. Proof of Lemma 9 and Lemma 10 follows directly from that of Lemma 5 and Lemma 3 in [Agarwal et al., 2012b]. A.2.7 Proof of Guarantees for Sparse Graphical Model selection Problem Here we prove Corollary 1. According to C.1, in order to prove guarantees, we rst need to boundkz k+1 z k k 1 andkz k k 1 . According to Equation (A.2.5) and considering the imposed ` 1 bound, this is equivalent to boundkg k+1 g k k 1 and kg k k 1 . The rest of the proof follows on lines of Theorem 1 proof. On the other hand, Lipschitz property requires a bound onkg k k 1 , which is much more stringent. 125 Assuming we are in a close proximity of , we can use Taylor approximation to locally approximate 1 by 1 as in [Ravikumar et al., 2011] 1 = 1 1 1 +R(); where = andR() is the remainder term. We have kg k+1 g k k 1 jjj jjj 1 k k+1 k k 1 ; and kg k k 1 kg k E(g k )k 1 +kE(g k )k 1 ke k k 1 +k 1 k k 1 +k k 1 k k+1 k k 1 : The termk k+1 k k 1 is bounded by 2R i by construction. We assumejjj jjj 1 and k k 1 are bounded. The error needs to be \small enough" for theR() to be negligible, and we now provide the conditions for this. By denition,R() = P 1 k=2 (1) k ( 1 ) k 1 . Using triangle inequality and sub-multiplicative property for Frobenious norm, kR()k F k 1 k F k 1 k 2 F 1k 1 k F : 126 Forkk F 2R i 0:5 k 1 k F , we get kR()k F k 1 k F : We assumek k F is bounded. Note thatfR i g k T i=1 is a decreasing sequence and we only need to bound R 1 . Therefore, if the variables are closely-related we need to start with a small R 1 . For weaker correlations, we can start in a bigger ball. The rest of the proof follows the lines of proof for Theorem 3, by replacing G 2 byjjj jjj 1 R i ( +k k 1 R i ). Ignoring the higher order terms gives us Corollary 1. A.3 Guarantees for REASON 2 First, we provide guarantees for the theoretical case such that epoch length depends on epoch radius. This provides intuition on how the algorithm is designed. The xed-epoch algorithm is a special case of this general framework. We rst state and 127 prove guarantees for general framework. Next, we leverage these results to prove Theorem 1. Let the design parameters be set as T i 'C (s +r + s +r ) 2 logp + 2 (p) 2 i log(6= i )+ R 2 i + (s +r + s +r ) G R i + x ; (A.3.1) 2 i = (s +r) p T i s (R 2 i + ~ R 2 i )logp + G 2 (R 2 i + ~ R 2 i ) T i + 2 (p)(R 2 i + ~ R 2 i ) 2 i log 3 i + x (R 2 i + ~ R 2 i ) T i + 2 p 2 + 2 (p) 2 T i logp + log 1 i ; 2 i =c 2 i ; / s T i logp R 2 i + ~ R 2 i ; x > 0; =: Theorem 11. Under assumptions A2A6 and parameter settings as in (A.3.1), there exists a constant c 0 > 0 such that REASON 2 satises the following for all T >k T , k S(T )S k 2 F +k L(T )L k 2 F c 0 (s +r) T logp + 2 (p) 2 w 2 + logk T + 1 + s +r 2 p 2 p : with probability at least 1 6 exp(w 2 =12) and k T ' log (s +r) 2 2 R 2 1 T logp + 2 (p) 2 w 2 : 128 For Proof outline and detailed proof of Theorem 11 see Appendix A.3.1 and A.4 respectively. A.3.1 Proof outline for Theorem 11 The foundation block for this proof is Proposition 3. Proposition 3. Suppose f satises Assumptions A1A6 with parameters and i respectively and assume thatkS ~ S i k 2 1 R 2 i ,kL ~ L i k 2 1 ~ R 2 i . We apply the updates in REASON 2 with parameters as in (A.3.1). Then, there exists a universal constant c such that for any radius R i ; ~ R i , ~ R i =c r R i ; 0c r 1, f( M(T i )) + i ( W (T i ))f( ^ M i ) i ( ^ W (T i )) (A.3.2a) s R 2 i + ~ R 2 i T i p logp + R 2 i + ~ R 2 i T i x + G q R 2 i + ~ R 2 i T i + (p)(R i + ~ R i ) i q 12 log 3 i p T i + (p)G(R i + ~ R i ) i q 12 log 3 i T i p logp ; k S(T i )S k 2 1 c 0 p C R 2 i +c(s +r + (s +r) 2 p 2 ) 2 p ; (A.3.2b) k L(T i )L k 2 c 0 p C 1 1 + R 2 i +c (s +r) 2 p 2 2 p : where both bounds are valid with probability at least 1 i . In order to prove Proposition 3, we need two more lemmas. To move forward, we use the following notations: (T i ) = ^ S i S + ^ L i L , (T i ) = S(T i )S + L(T i )L and ^ (T i ) = S i ^ S i + L i ^ L i . In addition 129 S (T i ) = ^ S i S , with alike notations for L (T i ). For on and o support part of (T i ), we use ((T i )) supp and ((T i )) supp c. Lemma 12. At epoch i assume thatkS ~ Sk 2 1 R 2 i ,kL ~ Lk 2 1 ~ R 2 i . Then the errors S (T i ); L (T i ) satisfy the bound k ^ S i S k 2 F +k ^ L i L k 2 F cfs 2 i 2 +r 2 i 2 g: Lemma 13. Under the conditions of Proposition 3 and with parameter settings (A.3.1), (A.3.1), we have k ^ S i S(T i )k 2 F +k ^ L i L(T i )k 2 F 2 0 @ s R 2 i + ~ R 2 i T i p logp + R 2 i + ~ R 2 i T i x + G q R 2 i + ~ R 2 i T i + (p)(R i + ~ R i ) i p 12 log(3= i ) p T i + (p)G(R i + ~ R i ) i p 12 log(3= i ) T i p logp ! + ( 2 p p + p T i ) 2 ; with probability at least 1 i . A.4 Proof of Theorem 11 The rst step is to ensure thatkS ~ S i k 2 1 R 2 i ,kL ~ L i k 2 1 ~ R 2 i holds at each epoch so that Proposition 3 can be applied in a recursive manner. We prove this in the same manner we proved Theorem 1, by induction on the epoch index. By 130 construction, this bound holds at the rst epoch. Assume that it holds for epoch i. Recall that T i is dened by (A.3.1) where C 1 is a constant we can choose. By substituting thisT i in inequality (A.3.2b), the simplied bound (A.3.2b) further yields k S (T i )k 2 1 c 0 p C R 2 i +c(s +r + (s +r) 2 p 2 ) 2 p ; Thus, by choosingC suciently large, we can ensure thatk S(T i )S k 2 1 R 2 i =2 := R 2 i+1 . Consequently, if S is feasible at epoch i, it stays feasible at epoch i + 1. Hence, we guaranteed the feasibility of S throughout the run of algorithm by induction. As a result, Lemma 12 and 13 apply and for ~ R i =c r R i , we nd that k S (T i )k 2 F 1 s +r R 2 i + (1 + s +r 2 p ) 2 2 p : The bound holds with probability at least 1 3 exp(w 2 i =12). The same is true fork L (T i )k 2 F . Recall that R 2 i = R 2 1 2 (i1) . Since w 2 i = w 2 + 24 logi, we can apply union bound to simplify the error probability as 1 6 exp(w 2 =12). Let = 6 exp(w 2 =12), we write the bound in terms of , using w 2 = 12 log(6=). 131 Next we convert the error bound from its dependence on the number of epochs k T to the number of iterations needed to completek T epochs, i.e. T (K) = P k i=1 T i . Using the same approach as in proof of Theorem 3, we get k T ' log (s +r + (s +r)= ) 2 R 2 1 T log logp + 12 2 (p) 2 w 2 : As a result k S (T i )k 2 F C(s +r) T logp + 2 (p) 2 w 2 + logk T ) + 2 p : For the low-rank part, we proved feasibility in proof of Equation (A.3.2b), conse- quently The same bound holds fork L (T i )k 2 F . A.4.1 Proofs for Convergence within a Single Epoch for Algorithm 3 We showed that our method is equivalent to running Bregman ADMM on M and W = [S;L]. Consequently, our previous analysis for sparse case holds true for the error bound on sum of loss function and regularizers within a single epoch. With 132 =c 2 p T i ; =;c 2 = p logp p R 2 i + ~ R 2 i . We use the same approach as in Section A.2.1 for bounds on dual variable Z k . Hence, f( M(T i )) + i ( W (T i ))f( ^ M i ) i ( ^ W (T i )) c 2 kA ^ W (T i )AW 0 k 2 F p T i + x k ^ M(T i )M 0 k 2 F T i + GR i T i + R i p logp p T i + P T i k=1 Tr(E k ; ^ M i M k ) T i c 2 p T i + x T i k ^ S i ~ S i + ^ L i ~ L i k 2 F + GR i T i + R i p logp p T i + P T i k=1 Tr(E k ; ^ M i M k ) T i : By the constraints enforced in the algorithm, we have f( M(T i )) + i ( W (T i ))f( ^ M i ) i ( ^ W (T i )) s R 2 i + ~ R 2 i T i p logp + R 2 i + ~ R 2 i T i x + G q R 2 i + ~ R 2 i T i + P T i k=1 Tr(E k ; ^ M i M k ) T i : Lemma 14. The dual variable in REASON 2 is bounded. i.e., kZ k k 1 G + 2 0 R i ; where 0 := x +: Proof. The proof follows the same line as in proof of Lemma 8 and replacing ;y by M;W where W = [S;L]. Hence, 133 kZ k k 1 G + 2 0 R i ; where 0 := x +: A.4.2 Proof of Proposition 3: Equation (A.3.2a) In this section we bound the term P T i k=1 Tr(E k ; ^ M i M k ) T i . We have M k ^ M i =S k ^ S i +L k ^ L i + (Z k+1 Z k )=: Hence, [Tr(E k ; ^ M i M k )] 2 [kE k k 1 kS k ^ S i k 1 +kE k k 2 2 kL k ^ L i k +kE k k 1 k(Z k+1 Z k )=k 1 ] 2 [2R i kE k k 1 + 2 ~ R i kE k k 2 + (G + 2 0 R i )=kE k k 1 ] 2 kE k k 2 2 [2R i + 2 ~ R i + (G + 2 0 R i )=] 2 : 134 Consider the termkE k k 2 . Using Assumption A4, our previous approach in proof of Equation (A.1.3a), holds true with addition of a (p) term. Consequently, f( M(T i )) + i ( W (T i ))f( ^ M i ) i ( ^ W (T i )) s R 2 i + ~ R 2 i T i p logp + R 2 i + ~ R 2 i T i x + G q R 2 i + ~ R 2 i T i + (p)(R i + ~ R i ) i p 12 log(3= i ) p T i + (p)G(R i + ~ R i ) i p 12 log(3= i ) T i p logp : with probability at least 1 i . A.4.3 Proof of Lemma 12 We use Lemma 1 [Negahban et al., 2012] for designing i and i . This Lemma re- quires that for optimization problem min fL()+ i Q()g, we design the regularizer coecient i 2Q (rL( )), whereL is the loss function,Q is the regularizer and Q is the dual regularizer. For our case stands for [S;L]. L() = 1 n n X k=1 f k (;x); and Q (rL( )) =Q " E(rf( ) + 1 n n X k=1 frf k ( ))E(rf( ))g # =Q ( 1 n n X k=1 E k ); 135 where E k =g k E(g k ) is the error in gradient estimation as dened earlier. Using Theorem 1 [Agarwal et al., 2012a] in this case, if we design i 4 1 n n X k=1 E k 1 + 4 p and i 4 1 n n X k=1 E k 2 ; (A.4.1) then we have k ^ S i S k 2 F +k ^ L i L k 2 F cfs 2 i 2 +r 2 i 2 g: (A.4.2) Lemma 15. Assume X2R pp . IfkXk 2 B almost surely then with probability at least 1 we have 1 n n X k=1 X k E(X k ) 2 6B p n p logp + r log 1 ! : Note that this lemma is matrix Hoeding bound and provides a loose bound on matrix. Whereas using matrix Bernstein provided tighter results using E(E k E > k ). Moreover, since the elementwise max normkXk 1 kXk 2 , we use the same upper bound for both norms. 136 By denition E(E k ) = 0. According to Assumption A4,kE k k 2 (p). Thus it suces to design i 24(p) i p T i p logp + r log 1 i + 4 p and i 24(p) i p T i p logp + r log 1 i : Then, we can use Equation (A.4.2). A.4.4 Proof of Lemma 13 By LSC condition on X =S +L 2 k ^ S i S(T i ) + ^ L i L(T i )k 2 F f( X(T i )) + i k S(T i )k 1 + i k L(T i )k f( ^ X i ) i k ^ S(T i )k 1 i k ^ L(T i )k We want to use the following upper bound for the above term. f( M(T i )) + i ( X(T i ))f( ^ M i ) i ( ^ X(T i )) s R 2 i + ~ R 2 i T i p logp + R 2 i + ~ R 2 i T i x + G q R 2 i + ~ R 2 i T i + (p)(R i + ~ R i ) i q 12 log 3 i p T i + (p)G(R i + ~ R i ) i q 12 log 3 i T i ; 137 ^ M i = ^ X i , i.e., all the terms are the same except for f( M(T i ));f( X(T i )). We have M(T i ) = X(T i ) + Z T T i . This is a bounded and small termO(R i =(T i p T i )). We accept this approximation giving the fact that this is a higher order term compared toO(1= p T i ) . Hence, it will not play a role in the nal bound on the convergence rate. Therefore, 2 k ^ S i S(T i ) + ^ L i L(T i )k 2 F (A.4.3) s R 2 i + ~ R 2 i T i p logp + R 2 i + ~ R 2 i T i x + G q R 2 i + ~ R 2 i T i + (p)(R i + ~ R i ) i q 12 log 3 i p T i + (p)G(R i + ~ R i ) i q 12 log 3 i T i p logp ; with probability at least 1 i . For simplicity, we use H 1 = s R 2 i + ~ R 2 i T i p logp + R 2 i + ~ R 2 i T i x + G q R 2 i + ~ R 2 i T i + (p)(R i + ~ R i ) i q 12 log 3 i p T i + (p)G(R i + ~ R i ) i q 12 log 3 i T i p logp : We have, 2 Tr( ^ S ^ L ) = 2 fk ^ S k 2 F +k ^ L k 2 F g 2 fk ^ S + ^ L k 2 F g; 138 In addition, k Tr( ^ S (T i ) ^ L (T i ))j k ^ S (T i )k 1 k ^ L (T i )k 1 : We have, k ^ L (T i )k 1 k ^ L i k 1 +k L(T i )k 1 k L(T i )k 1 k Y (T i )k 1 +k L(T i ) Y (T i )k 1 k Y (T i )k 1 +k P T i 1 k=0 (L k Y k ) T i k 1 =k Y (T i )k 1 +k P T i 1 k=0 (U k U k+1 ) T i k 1 =k Y (T i )k 1 +k U k+1 T i k 1 p + p p T i : In the last step we incorporated the constraintkYk 1 p , and the fact thatU 0 = 0. Moreover, we used kU k+1 k 1 =krfkLk gk 1 p rank(L) p p: 139 Last step is from the analysis of Watson [1992]. Therefore, k Tr( ^ S (T i ) ^ L (T i ))j ( 2 p + p p T i )k ^ S (T i )k 1 : Consequently, 2 k ^ S (T i ) + ^ L (T i )k 2 F 2 fk ^ S (T i )k 2 F +k ^ L (T i )k 2 F g 2 ( 2 p + p p T i )k ^ S (T i )k 1 : Combining the above equation with (A.4.3), we get 2 fk ^ S (T i )k 2 F +k ^ L (T i )k 2 F g 2 ( 2 p + p p T i )k ^ S (T i )k 1 H 1 : UsingkSk 1 p pkSk F , k ^ S (T i )k 2 F +k ^ L (T i )k 2 F 2 f s R 2 i + ~ R 2 i T i p logp + R 2 i + ~ R 2 i T i x + G q R 2 i + ~ R 2 i T i + (p)(R i + ~ R i ) i p 12 log(3= i ) p T i + (p)G(R i + ~ R i ) i p 12 log(3= i ) T i p logp g + ( 2 p p + p T i ) 2 ; with probability at least 1 i . 140 A.4.5 Proof of Proposition 3: Equation (A.3.2b) Now we want to convert the error bound in (A.3.2a) from function values into vectorized` 1 and Frobenius-norm bounds. Since the error bound in (A.3.2a) holds for the minimizer ^ M i , it also holds for any other feasible matrix. In particular, applying it to M leads to, f( M(T i ))f(M ) + i ( W (T i )) i (W ) s R 2 i + ~ R 2 i T i p logp + R 2 i + ~ R 2 i T i x + G q R 2 i + ~ R 2 i T i + (p)(R i + ~ R i ) i p 12 log(3= i ) p T i + (p)G(R i + ~ R i ) i p 12 log(3= i ) T i p logp ; with probability at least 1 i . For the next step, we nd a lower bound on the left hand side of this inequality. f( M(T i ))f(M ) + i ( W (T i )) i (W ) f(M )f(M ) + i ( W (T i )) i (W ) = i ( W (T i )) i (W ); where the rst inequality results from the fact that M optimizes M. From here onward all equations hold with probability at least 1 i . We have ( W (T i ))(W )H 1 = i : (A.4.4) 141 i.e. k S(T i )k 1 + i i k L(T i )k kS k 1 + i i kL k +H 1 = i Using S(T i ) = S +S , L(T i ) = L +L . We split S into its on-support and o-support part. We also divide L into its projection ontoV andV ? . V is range of L . Meaning8X2V;kXk r. Therefore, k( S(T i )) supp k 1 k(S ) supp k 1 k( S ) supp k 1 k( S(T i )) supp ck 1 k(S ) supp ck 1 +k( S ) supp ck 1 ; and k( L(T i )) V k k(L ) V k k( L ) V k k( L(T i )) V ?k k(L ) V ?k +k( L ) V ?k : Consequently, k( S ) supp ck 1 + i i k( L ) V ?k k( S ) supp k 1 + i i k( L ) V k +H 1 = i : (A.4.5) 142 S (T i ) ^ S (T i ) = ^ S i S . Therefore, k ^ S i S k 1 = k( S (T i )) supp ( ^ S (T i )) supp k 1 +k( S (T i )) supp c ( ^ S (T i )) supp ck 1 n k( S (T i )) supp k 1 k( ^ S (T i )) supp k 1 o n k( S (T i )) supp ck 1 k( ^ S (T i )) supp ck 1 o : Hence, k( ^ S (T i )) supp ck 1 k( ^ S (T i )) supp k 1 k( S (T i )) supp ck 1 k( S (T i )) supp k 1 +k ^ S i S k 1 : As Equation (A.4.1) is satised, we can use Lemma 1 [Negahban et al., 2012]. Combining the result with Lemma 12, we havek ^ S i S k 2 1 (4s + 3r)(s 2 i 2 + r 2 i 2 ). Consequently, further use of Lemma 12 and the inequalityk( ^ S (T i )) supp k 1 p sk ^ (T i )k F allows us to conclude that there exists a universal constantc such that k ^ S (T i )k 2 1 4sk ^ S (T i )k 2 F + (H 1 = i ) 2 +c(s +r)(s 2 i 2 +r 2 i 2 ) +cr 2 i 2 i 2 H 1 + ( p p + p T i ) 2 +s 2 i 2 +r 2 i 2 4s 2 H 1 + ( p p + p T i ) 2 + (H 1 = i ) 2 +c(s +r)(s 2 i 2 +r 2 i 2 ) +cr 2 i 2 i 2 H 1 + ( p p + p T i ) 2 +s 2 i 2 +r 2 i 2 ; 143 with probability at least 1 i . Optimizing the above bound with choice of i and complying with the conditions in Lemma 15, leads to 2 i = s +r H 1 + 2 p 2 + 2 (p) 2 T i logp + log 1 : Repeating the same calculations fork ^ L (T i )k results in 2 i =c 2 i ; we have k ^ S (T i )k 2 1 c(s +r + s +r )H 1 +c(s +r)(1 + s +r p 2 ) 2 p + (s +r)( p 2 T 2 i + T i ): Therefore, k S (T i )k 2 1 2k ^ S (T i )k 2 1 + 2kS ^ S i k 2 1 (A.4.6) 2k ^ (T i )k 2 1 + 8c(s +r)(s 2 i 2 +r 2 i 2 ) c(s +r + s +r )H 1 +c(s +r)(1 + s +r p 2 ) 2 p + (s +r)( p 2 T 2 i + T i ): Finally, in order to use S(T i ) as the next prox center ~ S i+1 , we would also like to control the errork S(T i ) ^ S i+1 k 2 1 . Without loss of generality, we can design ~ R i =c r R i for any 0c r 1. The result only changes in a constant factor. Hence, 144 we use ~ R i =R i . Since i+1 i by assumption, we obtain the same form of error bound as in (A.4.6). We want to run the epoch till all these error terms drop to R 2 i+1 :=R 2 i =2. It suces to set the epoch length T i to ensure that sum of all terms in (A.4.6) is not greater that R 2 i =2. All above conditions are met if we choose the epoch length T i 'C(s +r + s +r ) 2 logp + 12 2 (p) 2 i log 6 R 2 i +C(s +r + s +r ) 2 4 (p)G i q 12 log 6 R i p logp + G R i + x 3 5 ; for a suitably large universal constant C. Then, we have that k S (T i )k 2 1 c 0 p C R 2 i +c(s +r)(1 + s +r p 2 ) 2 p : Since the second part of the upper bound does not shrink in time, we stop where two parts are equal. Namely, R 2 i =c(s +r)(1 + s+r p 2 ) 2 p . With similar analysis for L, we get k L (T i )k 2 c 0 p C 1 1 + R 2 i +c (s +r) 2 p 2 2 p : 145 A.4.6 Proof of Guarantees with Fixed Epoch Length, Sparse + Low Rank Case This is a special case of Theorem 11 (Appendix). Note that this xed epoch length results in a convergence rate that is worse by a factor of logp. The key dierence between this case and optimal epoch length setting of Theorem 11 is that in the latter we guaranteed error halving by the end of each epoch whereas with xed epoch length that statement may not be possible after the number of epochs be- comes large enough. Therefore, we need to show that in such case the error does not increase much to invalidate our analysis. Let k be the epoch number such that error halving holds true until then. Next we demonstrate that error does not increase much fork>k . The proof follows the same nature as that of Theorem 1 (in the main text), Section A.2.6, with k :=sup ( i : 2 j 2 +1 cR 1 s +r s T 0 logp + 2 (p) 2 i w 2 ) ; for all epochs ji and k 0 = log R 1 s +r s T logp + 2 (p) 2 w 2 ! : 146 A.4.7 Proof of Guarantees for Sparse + Low Rank Graphical Model selection Problem Here we prove Corollary 2. Proof follows by using the bounds derived in Ap- pendix A.2.7 for Taylor series expansion and following the lines of Theorem 11 proof as in Appendix A.4. According to D.1, in order to prove guarantees, we rst need to boundkz k+1 z k k 1 andkz k k 1 . According to Equation (A.2.5) and considering the imposed ` 1 bound, this is equivalent to boundkg k+1 g k k 1 andkg k k 1 .kg k+1 g k k 1 andkg k k 1 . The rest of the proof follows on lines of Theorem 2 proof. On the other hand, Lipschitz property requires a bound onkg k k 1 , which is much more stringent. Assuming we are in a close proximity of M , we can use Taylor approximation to locally approximate M 1 by M 1 as in [Ravikumar et al., 2011] M 1 =M 1 M 1 M 1 +R(); where =MM andR() is the remainder term. We have kg k+1 g k k 1 jjj jjj 1 kM k+1 M k k 1 ; 147 and kg k k 1 kg k E(g k )k 1 +kE(g k )k 1 ke k k 1 +k M 1 k k 1 +k k 1 kM k+1 M k k 1 : The termkM k+1 M k k 1 is bounded by 2 R by construction. We assumejjj jjj 1 and k k 1 are bounded. The error needs to be \small enough" for theR() to be negligible, and we now provide the conditions for this. By denition,R() = P 1 k=2 (1) k (M 1 ) k M 1 . Using triangle inequality and sub-multiplicative property for Frobenious norm, kR()k F kM 1 k F kM 1 k 2 F 1kM 1 k F : Forkk F 2 R 0:5 kM 1 k F , we get kR()k F kM 1 k F : We assumek k F is bounded. Therefore, if the variables are closely-related we need to start with a small R. For weaker correlations, we can start in a bigger ball. The rest of the proof follows the lines of proof for Theorem 11, by replacing G 2 byjjj jjj 1 R( +k k 1 R). 148 Bibliography Pictorial explanation of a border node. August 2014. URL https://www.dropbox. com/s/jc4jo4t26ma2mlx/border.jpg?dl=0. A. Abur and A.G Exposito. Power System State Estimation, Theory and Imple- mentation. Marcel Dekker, 2004. A. Agarwal, S. Negahban, and M. Wainwright. Noisy matrix decomposition via convex relaxation: Optimal rates in high dimensions. The Annals of Statistics, 40(2):1171{1197, 2012a. A. Agarwal, S. Negahban, and M. J. Wainwright. Stochastic optimization and sparse statistical recovery: Optimal algorithms for high dimensions. In NIPS, pages 1547{1555, 2012b. S Massoud Amin and Anthony M Giacomoni. Smart grid-safe, secure, self-healing. IEEE Power Energy Mag, 10:33{40, 2012. A. Anandkumar, V. Tan, F. Huang, and A.S. Willsky. High-dimensional gaus- sian graphical model selection:walk summability and local separation criterion. Journal of Machine Learning, 13:22932337, August 2012. Emilio Ancillotti, Raaele Bruno, and Marco Conti. The role of communication systems in smart grids: Architectures, technical solutions and research challenges. Computer Communications, 36(17):1665{1697, 2013. B. De Finetti. Theory of Probability. Wiley, 1975. R. Banirazi and E. Jonckheere. Geometry of power ow in negatively curved power grids: Toward a smart transmission system. In IEEE CDC, pages 6259{6264, 2010. C. M. Bishop. Pattern recognition and machine learning. Springer, New York, USA, 2006. S. Boyd, N. Parikh, E. Chu, B. Peleato, and J. Eckstein. Distributed optimiza- tion and statistical learning via the alternating direction method of multipliers. Foundations and Trends R in Machine Learning, 3(1):1{122, 2011. 149 R.S Blum C. Wei, A. Wiesel. Change detection in smart grids using errors in variables models. In Sensor Array and Multichannel Signal Processing Workshop (SAM), 2012 IEEE 7th, pages 16{20, June 17-20 2012. E. J. Cand es, X. Li, Y. Ma, and J. Wright. Robust principal component analysis? Journal of the ACM (JACM), 58(3):11, 2011. V. Chandrasekaran, S. Sanghavi, Pablo A Parrilo, and A. S Willsky. Rank-sparsity incoherence for matrix decomposition. SIAM Journal on Optimization, 21(2): 572{596, 2011. V. Chandrasekaran, P. A Parrilo, and Alan S Willsky. Latent variable graphical model selection via convex optimization. The Annals of Statistics, 40(4):1935{ 1967, 2012. Alexandre d'Aspremont, Onureena Banerjee, and Laurent El Ghaoui. First-order methods for sparse covariance selection. SIAM Journal on Matrix Analysis and Applications, 30(1):56{66, 2008. W. Deng, W.and Yin. On the global and linear convergence of the generalized alternating direction method of multipliers. Technical report, DTIC Document, 2012. R. Diao, K. Sun, V. Vittal, R. OKeefe, M. Richardson, N. Bhatt, D. Stradford, and S. Sarawgi. Decision tree-based online voltage security assesment using PMU measurements. IEEE Trans. Power Systems, 24:832{839, May 2009. JF Dopazo, OA Klitin, and AM Sasson. Stochastic load ows. Power Apparatus and Systems, IEEE Trans., 94(2):299{309, 1975. J. Duchi, S. Shalev-Shwartz, Y. Singer, and T. Chandra. Ecient projections onto the ` 1 -ball for learning in high dimensions. In Proceedings of the 25th interna- tional conference on Machine learning, pages 272{279. ACM, 2008. E. Esser, X. Zhang, and T. Chan. A general framework for a class of rst order primal-dual algorithms for convex optimization in imaging science. SIAM Journal on Imaging Sciences, 3(4):1015{1046, 2010. J. Friedman, T. Hastie, and R. Tibshirani. Sparse inverse covariance estimation with the graphical Lasso. Biostatistics, 2007. A. Giani, E. Bitar, M. Garcia, M. McQueen, P. Khargonekar, and K. Poolla. Smart grid data integrity attacks. IEEE Trans. Smart Grid, 1(1), January 2012. T Goldstein, B. ODonoghue, and S. Setzer. Fast alternating direction optimization methods. CAM report, pages 12{35, 2012. 150 M. He and J. Zhang. A dependency graph approach for fault detection and local- ization towards secure smart grid. IEEE Trans. Smart Grid, 2:342{351, June 2011. Julien M Hendrickx, Karl Henrik Johansson, Rapha el M Jungers, Henrik Sand- berg, and Kin Cheong Sou. Ecient computations of a security index for false data attacks in power networks. IEEE TRANSACTIONS ON AUTOMATIC CONTROL, 59(12), 2014. C. Hsieh, M. A Sustik, I. Dhillon, P. Ravikumar, and R. Poldrack. Big & quic: Sparse inverse covariance estimation for a million variables. In Advances in Neural Information Processing Systems, pages 3165{3173, 2013. Cho-Jui Hsieh, Matyas A Sustik, Inderjit S Dhillon, and Pradeep D Ravikumar. Sparse inverse covariance matrix estimation using quadratic approximation. In NIPS, pages 2330{2338, 2011. Daniel Hsu, Sham M Kakade, and Tong Zhang. Robust matrix decomposition with sparse corruptions. Information Theory, IEEE Transactions on, 57(11): 7221{7234, 2011. N. Liu Ide T. Lozano, A.C. Abe. Proximity-based anomaly detection using sparse structure learning. In SIAM International Conference on Data Mining, Philadel- phia, 2009. M. Janzamin and A. Anandkumar. High-Dimensional covariance decomposition into sparse Markov and independence models. JMLR, 15:1549{1591, April 2014. Majid Janzamin and Animashree Anandkumar. High-dimensional covariance de- composition into sparse markov and independence domains. In Proceedings of the 29th International Conference on Machine Learning (ICML-12), pages 1839{ 1846, 2012. R. Johnson and T. Zhang. Accelerating stochastic gradient descent using predictive variance reduction. In C.J.C. Burges, L. Bottou, M. Welling, Z. Ghahramani, and K.Q. Weinberger, editors, Advances in Neural Information Processing Systems 26, pages 315{323. 2013. D.R. Anderson K. P. Burnham. Model Selection and Multi-Model Inference: A Practical Information-Theoretic Approach. page 51, 2002. Ashwin Kashyap and Duncan Callaway. Estimating the probability of load curtail- ment in power systems with responsive distributed storage. In IEEE PMAPS, pages 18{23, 2010. 151 Oliver Kosut, Liyan Jia, Robert J Thomas, and Lang Tong. Malicious data attacks on smart grid state estimation: Attack strategies and countermeasures. In IEEE SmartGridComm, 2010, pages 220{225, 2010. R.A. Kullback, S.; Leibler. On Information and Suciency. Annals of Mathematical Statistics, (22(1)):79{86, 1951. R.A. Kullback, S.; Leibler. Letter to the Editor: The KullbackLeibler distance. The American Statistician, (41(4)):340341, 1987. S Kullback. Information theory and statistics. John Wiley and sons, NY, 1959. Cheolhyeon Kwon, Weiyi Liu, and Inseok Hwang. Security analysis for cyber- physical systems against stealthy deception attacks. In American Control Con- ference (ACC), 2013, pages 3344{3349. IEEE, 2013. S.L. Lauritzen. Graphical Models. Clarendon Press, 1996. Gilad Lerman, Michael McCoy, Joel A Tropp, and Teng Zhang. Robust compu- tation of linear models, or how to nd a needle in a haystack. arXiv preprint arXiv:1202.4044, 2012. Z. Lin, M. Chen, and Y. Ma. The augmented lagrange multiplier method for exact recovery of corrupted low-rank matrices. arXiv preprint arXiv:1009.5055, 2010. J. Lofberg. YALMIP: A toolbox for modeling and optimization in MATLAB. In IEEE CACSD, September 2004. Available from http://users.isy.liu.se/johanl/yalmip/. Po-Ling Loh and Martin J Wainwright. Regularized m-estimators with nonconvex- ity: Statistical and algorithmic theory for local optima. In Advances in Neural Information Processing Systems, pages 476{484, 2013. Zhi-Quan Luo. On the linear convergence of the alternating direction method of multipliers. arXiv preprint arXiv:1208.3922, 2012. S. Ma, L. Xue, and H. Zou. Alternating direction methods for latent variable gaussian graphical model selection. arXiv preprint arXiv:1206.1275v2, 2012. Dmitry M Malioutov, Jason K Johnson, and Alan S Willsky. Walk-sums and belief propagation in gaussian graphical models. The Journal of Machine Learning Research, 7:2031{2064, 2006. J. FC Mota, J. MF Xavier, P. MQ Aguiar, and M. Puschel. Distributed admm for model predictive control and congestion control. In Decision and Control (CDC), 2012 IEEE 51st Annual Conference on, pages 5110{5115. IEEE, 2012. 152 J. Mur-Amada and J. Salln-Arasanz. From turbine to wind farms - technical re- quirements and spin-o products. In Gesche Krause, editor, Phase Transitions and Critical Phenomena, volume 18, pages 101{132. InTech, April 2011. S. Negahban, P. Ravikumar, M. Wainwright, and B. Yu. A unied framework for high-dimensional analysis of M-estimators with decomposable regularizers. Statistical Science, 27(4):538{557, 2012. H. Ouyang, N. He, L. Tran, and A. G Gray. Stochastic alternating direction method of multipliers. In Proceedings of the 30th International Conference on Machine Learning (ICML-13), pages 80{88, 2013. Guodong Pang, George Kesidis, and Takis Konstantopoulos. Avoiding overages by deferred aggregate demand for pev charging on the smart grid. In Communica- tions (ICC), 2012 IEEE International Conference on, pages 3322{3327. IEEE, 2012. Jim Pitman and Nathan Ross. Archimedes, gauss, and stein. Notices AMS, 59: 1416{1421, 2012. S. R. Rajagopalan, L. Sankar, S. Mohajer, and H. V. Poor. Smart meter privacy: A utility-privacy framework. In 2nd Annual IEEE Conference on Smart Grid Communications, Brussels, Belgium, Oct 2011. G. Raskutti, M. J. Wainwright, and B. Yu. Minimax rates of estimation for high- dimensional linear regression over ` q -balls. IEEE Trans. Information Theory, 57 (10):6976|6994, October 2011. Pradeep Ravikumar, Martin J Wainwright, Garvesh Raskutti, Bin Yu, et al. High- dimensional covariance estimation by minimizing 1-penalized log-determinant di- vergence. Electronic Journal of Statistics, 5:935{980, 2011. Nicolas Le Roux, Mark Schmidt, and Francis Bach. A stochastic gradient method with an exponential convergence rate for strongly-convex optimization with nite training sets. Technical report, 2012. A Schellenberg, W. Rosehart, and J. Aguado. Cumulant-based probabilistic optimal power ow with Gaussian and Gamma distributions. Power Systems, IEEE Trans., 20(2):773{781, 2005. Hanie Sedghi and Edmond Jonckheere. Statistical structure learning of smart grid for detection of false data injection. In Power and Energy Society General Meeting (PES), 2013 IEEE, pages 1{5. IEEE, 2013. Hanie Sedghi and Edmond Jonckheere. On conditional mutual information in Gauss-Markov structured grids. In G. Como, B. Bernhardson, and A. Rantzer, 153 editors, Lecture notes in Control and Information Sciences, volume 450, pages 277{297. Springer, 2014. Hanie Sedghi and Edmond Jonckheere. Statistical structure learning to ensure data integrity in smart grid. Smart Grid, IEEE Transactions on, 6, 2015. Hanie Sedghi, Anima Anandkumar, and Edmond Jonckheere. Guarantees for multi- step stochastic admm in high dimensions. arXiv preprint arXiv:1402.5131, 2014a. submitted to Journal of Machine Learning Research(JMLR). Hanie Sedghi, Anima Anandkumar, and Edmond Jonckheere. Multi-step stochas- tic admm in high dimensions: Applications to sparse optimization and matrix decomposition. In Proceedings of the 28th Annual Conference on Neural Infor- mation Processing Systems (NIPS-14), pages 2771{2779, 2014b. S. Shalev-Shwartz. Online learning and online convex optimization. Foundations and Trends in Machine Learning, 4(2):107{194, 2011. S. Shalev-Shwartz and T. Zhang. Accelerated mini-batch stochastic dual coor- dinate ascent. In C.J.C. Burges, L. Bottou, M. Welling, Z. Ghahramani, and K.Q. Weinberger, editors, Advances in Neural Information Processing Systems 26, pages 378{385. 2013. Andr e Teixeira, Gy orgy D an, Henrik Sandberg, and Karl Henrik Johansson. A cyber security study of a scada energy management system: Stealthy deception attacks on the state estimator. In World Congress, volume 18, pages 11271{ 11277, 2011. K. C. Toh, M.J. Todd, and R. H. Tutuncu. SDPT3 - a MATLAB software package for semidenite programming. Optimization Methods and Software, 11:545{581, 1999. J. Tropp. User-friendly tail bounds for sums of random matrices. Foundations of Computational Mathematics, 12(4):389{434, 2012. Roman Vershynin. Introduction to the non-asymptotic analysis of random matrices. arXiv preprint arXiv:1011.3027, 2010. Van H Vu. Spectral norm of random matrices. In Proceedings of the thirty-seventh annual ACM symposium on Theory of computing, pages 423{430. ACM, 2005. B. Wahlberg, S. Boyd, M. Annergren, and Y. Wang. An admm algorithm for a class of total variation regularized estimation problems. arXiv preprint arXiv:1203.1828, 2012. 154 C. Wang, X. Chen, A. Smola, and E. Xing. Variance reduction for stochastic gradi- ent optimization. In C.J.C. Burges, L. Bottou, M. Welling, Z. Ghahramani, and K.Q. Weinberger, editors, Advances in Neural Information Processing Systems 26, pages 181{189. 2013a. H. Wang and A. Banerjee. Bregman alternating direction method of multipliers. arXiv preprint arXiv:1306.3203, 2013. X. Wang, M. Hong, S. Ma, and Z. Luo. Solving multiple-block separable convex minimization problems using two-block alternating direction method of multipli- ers. arXiv preprint arXiv:1308.5294, 2013b. Zhaoran Wang, Han Liu, and Tong Zhang. Optimal computational and statistical rates of convergence for sparse nonconvex learning problems. arXiv preprint arXiv:1306.4960, 2013c. G. Watson. Characterization of the subdierential of some matrix norms. Linear Algebra and its Applications, 170(0):33{45, 1992. Peng Ning Yao Liu and Michael K. Reiter. False data injection attacks against state estimation in electric power grids. ACM trans. Information and Security Systems, 14, May 2011. Pei Zhang and Stephen T Lee. Probabilistic load ow computation using the method of combined cumulants and gram-charlier expansion. Power Systems, IEEE Tran., 19(1):676{682, 2004. H. Zhu and G. B. Giannakis. Sparse overcomplete representations for ecient identication of power line outages. IEEE Tran. on Power Systems, 2012. Kun Zhu, Moustafa Chenine, Lars Nordstr om, Sture Holmstr om, and G oran Er- icsson. An empirical study of synchrophasor communication delay in a utility TCP/IP network. International Journal of Emerging Electric Power Systems, 14 (4):341{350, 2013. R. D. Zimmerman, C. E. Murillo-Snchez, and R. J. Thomas. Matpower steady-state operations, planning and analysis tools for power systems research and education. Power Systems, IEEE Transactions on, 26(1):12{19, Feb. 2011. 155
Abstract (if available)
Abstract
In this thesis, we consider two main problems in learning with big data: data integrity and high dimension. We specifically consider the problem of data integrity in smart grid as it is of paramount importance for grid maintenance and control. In addition, data manipulation can lead to catastrophic events. Inspired by this problem, we then expand the horizon to designing a general framework for stochastic optimization in high dimension for any loss function and any underlying low dimensional structure. We propose Regularized Epoch-based Admm for Stochastic Optimization in high-dimensioN (REASON). Our ADMM method is based on epoch-based annealing and consists of inexpensive steps which involve projections on to simple norm balls. We provide explicit bounds for the sparse optimization problem and the noisy matrix decomposition problem and show that our convergence rate in both cases matches the minimax lower bound. For matrix decomposition into sparse and low rank components, we provide the first guarantees for any online method. Experiments show that for both sparse optimization and matrix decomposition problems, our algorithm outperforms the state-of-the-art methods. In particular, we reach higher accuracy with same time complexity.
Linked assets
University of Southern California Dissertations and Theses
Conceptually similar
PDF
Computational stochastic programming with stochastic decomposition
PDF
On practical network optimization: convergence, finite buffers, and load balancing
PDF
The fusion of predictive and prescriptive analytics via stochastic programming
PDF
On the interplay between stochastic programming, non-parametric statistics, and nonconvex optimization
PDF
Optimizing distributed storage in cloud environments
PDF
Iterative path integral stochastic optimal control: theory and applications to motor control
PDF
Improving machine learning algorithms via efficient data relevance discovery
PDF
Dynamic routing and rate control in stochastic network optimization: from theory to practice
PDF
Optimizing task assignment for collaborative computing over heterogeneous network devices
PDF
Exploitation of sparse and low-rank structures for tracking and channel estimation
PDF
Optimization strategies for robustness and fairness
PDF
Data-driven optimization for indoor localization
PDF
Empirical methods in control and optimization
PDF
Backpressure delay enhancement for encounter-based mobile networks while sustaining throughput optimality
PDF
Transmission tomography for high contrast media based on sparse data
PDF
Design optimization under uncertainty for rotor blades of horizontal axis wind turbines
PDF
On the theory and applications of structured bilinear inverse problems to sparse blind deconvolution, active target localization, and delay-Doppler estimation
PDF
Algorithms for stochastic Galerkin projections: solvers, basis adaptation and multiscale modeling and reduction
PDF
Multichannel data collection for throughput maximization in wireless sensor networks
PDF
LQ feedback formulation for H∞ output feedback control
Asset Metadata
Creator
Sedghi, Hanie
(author)
Core Title
Stochastic optimization in high dimension
School
Viterbi School of Engineering
Degree
Doctor of Philosophy
Degree Program
Electrical Engineering
Publication Date
07/28/2015
Defense Date
04/20/2015
Publisher
University of Southern California
(original),
University of Southern California. Libraries
(digital)
Tag
convergence rate,high dimensional regime,l1 regularization,multi-block ADMM,OAI-PMH Harvest,sparse low rank decomposition,stochastic ADMM
Format
application/pdf
(imt)
Language
English
Contributor
Electronically uploaded by the author
(provenance)
Advisor
Anandkumar, Anima (
committee chair
), Jonckheere, Edmond A. (
committee chair
), Krishnamachari, Bhaskar (
committee member
), Liu, Yan (
committee member
)
Creator Email
honey.sedghi@gmail.com,hsedghi@usc.edu
Permanent Link (DOI)
https://doi.org/10.25549/usctheses-c3-610840
Unique identifier
UC11301662
Identifier
etd-SedghiHani-3719.pdf (filename),usctheses-c3-610840 (legacy record id)
Legacy Identifier
etd-SedghiHani-3719.pdf
Dmrecord
610840
Document Type
Dissertation
Format
application/pdf (imt)
Rights
Sedghi, Hanie
Type
texts
Source
University of Southern California
(contributing entity),
University of Southern California Dissertations and Theses
(collection)
Access Conditions
The author retains rights to his/her dissertation, thesis or other graduate work according to U.S. copyright law. Electronic access is being provided by the USC Libraries in agreement with the a...
Repository Name
University of Southern California Digital Library
Repository Location
USC Digital Library, University of Southern California, University Park Campus MC 2810, 3434 South Grand Avenue, 2nd Floor, Los Angeles, California 90089-2810, USA
Tags
convergence rate
high dimensional regime
l1 regularization
multi-block ADMM
sparse low rank decomposition
stochastic ADMM