Close
About
FAQ
Home
Collections
Login
USC Login
Register
0
Selected
Invert selection
Deselect all
Deselect all
Click here to refresh results
Click here to refresh results
USC
/
Digital Library
/
University of Southern California Dissertations and Theses
/
Estimation of graph Laplacian and covariance matrices
(USC Thesis Other)
Estimation of graph Laplacian and covariance matrices
PDF
Download
Share
Open document
Flip pages
Contact Us
Contact Us
Copy asset link
Request this asset
Transcript (if available)
Content
ESTIMATION OF GRAPH LAPLACIAN AND COVARIANCE MATRICES by Eduardo Hern an Pavez Carvelli A Dissertation Presented to the FACULTY OF THE USC GRADUATE SCHOOL UNIVERSITY OF SOUTHERN CALIFORNIA In Partial Fulllment of the Requirements for the Degree DOCTOR OF PHILOSOPHY (Electrical Engineering) December 2019 Acknowledgments The completion of my thesis would not be possible without Professor Antonio Ortega. I am deeply thankful for his support, guidance, and patience. Conducting research together has elevated my research skills and giving me the tools and condence to take the next steps. I would also like to thank Professor Mihailo Jovanovic for being part of my dissertation and qualication exam committees. Thanks to Prof. Ramesh Govindan for being part of my disser- tation committee, and Prof. Urbashi Mitra, Prof. Mahdi Soltanolkotabi and Prof. Yingying Fan for serving in my qualication exam committee. Their feedback helped further rene my thesis. I would like to express gratitude for the collaborations with Dr. Aamir Anis, Prof. Nicolo Michelusi, Dr. Debargha Mukherjee, Yongzhe Wang and Lester Lu. Special thanks to Hilmi E. Egilmez, for the seemingly endless, yet fruitful research discussions, as well as co-authoring various papers with me. I am especially grateful for the mentorship I received during summer internships. Thanks to Dr. Phil Chou, at Microsoft Research, for teaching me about point clouds, compression and the future of human communications. Also, I am thankful for Dr. Dong Tian, Dr. Anthony Vetro and Dr. Robert Cohen at Mitsubishi Electric Research Laboratories for the opportunity to expand my interest in point cloud compression. I also want to thank fellow members of the lab including Akshay, Joanne, Jessie, Ajinkya, Ben, Miguel, Alexander, Sarath, and Pratyusha for valuable discussions and fun times at conferences. Thanks to my undergraduate and MS thesis advisor Prof. Jorge Silva, who introduced me to the fun of doing research and encouraged me to pursue graduate studies. I would like to thank my California friends, Magda, Gonzalo, Paulina, Rodrigo, and Karel, for making the transition to living in Los Angeles easier. Thank you for all the laughter, whether you were laughing with me or at me. Living thousands of kilometers away from home, I was adopted into a Filipino American family, where I became the Chilean son they never knew they wanted. I am especially thankful to Helen and Elison, Vanessa, Keith and Beckett, as well as all the aunties and uncles. Muchas gracias a mis padres, Luis y Magali, y a mi hermana Chiara, por darme su apoyo cuando decidi irme a vivir al otro lado del mundo. Gracias por hacerme sentir como que nunca me fui de Chile. Todos sus mensajes de whatsapp, llenos de fotos, videos, e historias me hacen sentir parte de sus vidas. Finally, I would not have survived graduate school without Alyssa, my greatest supporter. Alyssa, we made it!!. ii Table of Contents Acknowledgments ii List Of Tables vi List Of Figures vii Abstract viii Chapter 1: Introduction 1 1.1 Graph learning for graph signal processing . . . . . . . . . . . . . . . . . . . . . . . 1 1.1.1 Graph learning from smooth signals . . . . . . . . . . . . . . . . . . . . . . 2 1.1.1.1 Related work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 1.1.1.2 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 1.1.2 Graph learning with topology properties . . . . . . . . . . . . . . . . . . . . 3 1.1.2.1 Related work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 1.1.2.2 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 1.2 Covariance estimation with partial observations . . . . . . . . . . . . . . . . . . . . 4 1.2.1 Missing data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 1.2.2 Dimensionality reduced data . . . . . . . . . . . . . . . . . . . . . . . . . . 5 1.3 Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 Chapter 2: Graph theoretic background 7 2.1 Notation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7 2.2 Graphs and Laplacians . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7 2.3 Graph signal processing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8 2.4 Estimation of graph Laplacian matrices . . . . . . . . . . . . . . . . . . . . . . . . 9 2.4.1 Estimation of Gaussian Markov random elds . . . . . . . . . . . . . . . . . 9 2.4.2 Smoothness based methods . . . . . . . . . . . . . . . . . . . . . . . . . . . 11 2.4.3 Spectral methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 2.5 Applications of Laplacian matrices . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 Chapter 3: Generalized graph Laplacian estimation 15 3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15 3.2 Properties of generalized graph Laplacians . . . . . . . . . . . . . . . . . . . . . . . 15 3.2.1 Spectral properties . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16 3.2.2 Gaussian Markov random elds . . . . . . . . . . . . . . . . . . . . . . . . . 16 3.2.3 Interpretation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17 3.3 Estimation of GGL matrices . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17 3.3.1 A dual block coordinate descent algorithm . . . . . . . . . . . . . . . . . . . 18 3.3.2 Incorporating additional information . . . . . . . . . . . . . . . . . . . . . . 19 iii 3.4 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20 3.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20 Chapter 4: Learning graphs with topology properties 22 4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22 4.2 Related work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23 4.2.1 Estimation of attractive Gaussian Markov random elds . . . . . . . . . . . 24 4.2.2 Characterizations of graphical model structures . . . . . . . . . . . . . . . . 24 4.2.3 Estimation of structured GMRFs . . . . . . . . . . . . . . . . . . . . . . . . 25 4.3 Preliminaries and problem statement . . . . . . . . . . . . . . . . . . . . . . . . . . 25 4.3.1 Graph theoretic background . . . . . . . . . . . . . . . . . . . . . . . . . . . 25 4.3.2 Similarity matrix . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26 4.3.3 Analysis of Problem 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27 4.3.4 Proposed algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28 4.4 Analysis of Problem 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28 4.4.1 Sparsity of optimal graphs . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30 4.4.2 Connected components of optimal graph . . . . . . . . . . . . . . . . . . . . 30 4.4.3 Regularization and sparsity . . . . . . . . . . . . . . . . . . . . . . . . . . . 31 4.4.4 Closed form solution for acyclic graphs . . . . . . . . . . . . . . . . . . . . . 32 4.4.5 Admissible graph families . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32 4.5 Analysis of Algorithm 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33 4.6 Graph topology inference . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35 4.6.1 Tree structured graphs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35 4.6.2 Bipartite graphs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36 4.6.3 k-sparse connected graphs . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36 4.7 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37 4.7.1 Graph learning from synthetic data . . . . . . . . . . . . . . . . . . . . . . . 37 4.7.1.1 Learning tree structured graphs . . . . . . . . . . . . . . . . . . . 38 4.7.1.2 Learning bipartite graphs . . . . . . . . . . . . . . . . . . . . . . . 39 4.7.1.3 Connected k-sparse graphs . . . . . . . . . . . . . . . . . . . . . . 40 4.7.2 Image graphs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42 4.8 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42 4.9 Proofs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44 4.9.1 Properties of GGL matrices . . . . . . . . . . . . . . . . . . . . . . . . . . . 44 4.9.2 Proof of Theorem 3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44 Chapter 5: Covariance matrix estimation with missing data 46 5.1 Related work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47 5.1.1 Covariance estimation with complete observations . . . . . . . . . . . . . . 47 5.1.2 Missing data mechanisms . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47 5.1.3 Covariance estimation with missing data . . . . . . . . . . . . . . . . . . . . 48 5.2 Problem Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50 5.2.1 Notation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50 5.2.2 Missing data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50 5.2.3 Missing data mechanisms and covariance estimators . . . . . . . . . . . . . 50 5.3 Estimation error bounds . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53 5.3.1 Preliminaries and assumptions . . . . . . . . . . . . . . . . . . . . . . . . . 53 5.3.2 Bounds under MCAR mechanism . . . . . . . . . . . . . . . . . . . . . . . . 54 5.3.3 Bounds under MAR mechanism . . . . . . . . . . . . . . . . . . . . . . . . . 55 5.3.4 Bounds under active sequential observations (CMCAR mechanism) . . . . . 57 5.4 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59 iv 5.4.1 Covariance estimation under MCAR observations . . . . . . . . . . . . . . . 59 5.4.1.1 Uniform MCAR observations . . . . . . . . . . . . . . . . . . . . . 59 5.4.1.2 Non uniform MCAR observations . . . . . . . . . . . . . . . . . . 59 5.5 Proofs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60 5.5.1 Bounding tools . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60 5.5.2 Proof of Theorem 7 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62 5.5.3 Proof of Theorem 8 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65 5.5.4 Proof of Theorem 9 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67 5.5.5 Proof of Lemma 3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68 5.6 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68 Chapter 6: Covariance estimation from dimensionality reduced data 70 6.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70 6.2 Related work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71 6.2.1 Dimensionality reduction approaches . . . . . . . . . . . . . . . . . . . . . . 71 6.2.2 Active learning methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72 6.3 Problem formulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74 6.3.1 Data acquisition and estimators . . . . . . . . . . . . . . . . . . . . . . . . . 74 6.3.1.1 Independent and xed sampling distribution . . . . . . . . . . . . 74 6.3.1.2 Dependent time varying sampling distribution . . . . . . . . . . . 75 6.3.2 Application based resource constraints . . . . . . . . . . . . . . . . . . . . . 76 6.3.2.1 Data compression . . . . . . . . . . . . . . . . . . . . . . . . . . . 76 6.3.2.2 Streaming data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77 6.3.2.3 Evaluation criteria and proposed solution . . . . . . . . . . . . . . 78 6.4 Fixed sampling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78 6.4.1 Finding a data adaptive sampling distribution . . . . . . . . . . . . . . . . 79 6.5 Dependent time varying sampling . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80 6.5.1 Finding data adaptive sampling distributions . . . . . . . . . . . . . . . . . 81 6.5.1.1 Data adaptive algorithm for the compression setting . . . . . . . . 81 6.5.1.2 Active algorithm for streaming data . . . . . . . . . . . . . . . . . 82 6.5.2 Complexity analysis and comparison with related work . . . . . . . . . . . . 83 6.6 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84 6.6.1 Covariance estimation from dimensionality reduced data . . . . . . . . . . . 84 6.6.2 Covariance estimation from streaming data . . . . . . . . . . . . . . . . . . 85 6.7 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85 6.8 Proofs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86 6.8.1 Proof of Theorem 11 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86 6.8.2 Optimal missing data distribution, proof of Theorem 12 . . . . . . . . . . . 89 6.8.3 Proof of Theorem 13 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91 Chapter 7: Conclusion and future work 92 Reference List 95 v List Of Tables 2.1 Classic transforms viewed as GFT. DCT and DST stand for discrete cosine trans- form and discrete sine transform respectively [152]. Last column indicates which graph representation matrix may be used to dene the GFT. . . . . . . . . . . . . 9 2.2 Combinatorial Laplacian estimation algorithms that minimize average variation. . 12 4.1 Algorithms for graph learning with topology properties. . . . . . . . . . . . . . . . 25 5.1 Comparison of covariance estimators with missing data. For Cai & Zhang [20], the constants and n;N parameterize the classes of bandable and sparse covariance matrices respectively.[ p min is the smallest non zero entry of [ . Some universal numerical constants have been ignored, for detailed results see corresponding The- orems in Section 5.3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49 6.1 Comparison of covariance estimators with partial observations. Type column indi- cates data dependence being: non adaptive (NA), adaptive (AD), or active (AC). Acquisition column contains number of observed variables, and time complexity. Compressed column is the minimum amount of data required to compute the esti- mator. Last column indicate time complexity of computing estimator. Storage of the covariance matrix was not included since it is the same for all estimatorsO(n 2 ). 73 vi List Of Figures 3.1 Texture graphs using our GLP estimation and the LsigRep method from [46]. We consider wood textures, both original and with 60 degree rotation. We normalize each graph by its largest entry and show only o-diagonal entries. . . . . . . . . . 21 4.1 Performance of Algorithm 2 for learning tree structured graphs. . . . . . . . . . . 39 4.2 Performance of Algorithm 2 for learning bipartite graphs. . . . . . . . . . . . . . . 40 4.3 Performance of Algorithm 2 for learning connected k-sparse graphs. . . . . . . . . 41 4.4 Structured graphs learned from sample covariance matrices of texture images using Algorithm 2 with GTI found using (A). (a)-(c) 512 512 texture images. (d)-(f) tree structured graphs. (g)-(i) bipartite graphs. (j)-(l) sparse connected graphs obtained by setting sparsity parameter k = 112. . . . . . . . . . . . . . . . . . . . 43 5.1 Comparison of MCAR and MAR estimators under MCAR missing observations with uniform observation probabilities. The plot shows relative estimation error in operator norm versus number of i.i.d. realizations, both in logarithmic scale. Dashed and solid lines denote MAR and MCAR estimator respectively, while dotted lines are an estimate based on the error bounds. . . . . . . . . . . . . . . . . . . . 60 5.2 Approximately sparse covariance matrix with low eective rank. . . . . . . . . . . 60 5.3 Evaluation of the MAR estimators under MCAR missing observations with non uniform observation probabilities. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61 6.1 Comparison of Algorithm 3 and [28] for covariance estimation from compressed observations. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84 6.2 Comparison of non adaptive (uniform), and active (Alg. 4) strategies for covariance estimation with streaming observations for xed m=n. . . . . . . . . . . . . . . . . 85 vii Abstract Research in recent years has been dominated by bigger and more complex types of data. Multi- media signals (images, video, holograms) and non traditional signals arising in sensor and social networks have become pervasive while increasing in complexity and quantity. We study two problems related to estimation of signal models of high dimensional unstrucutred data. In the rst part we concentrate on graph estimation algorithms from smooth graph signals. Many datasets can be represented as samples living on the nodes of a graph, where the structure and weights of these graphs represent relations between samples. In graph signal processing, sev- eral techniques may require graphs with certain structures (trees, sparse, bipartite). We derive data driven algorithms to obtain graphs with certain topology properties. Since this problem is in general non convex, the proposed framework uses existing combinatorial optimization algo- rithms to identify near optimal graph topologies, and applies graph learning algorithms based on convexity to estimate the graph weights. In the second part we study covariance matrix estimation from partially observed data. We start by looking at covariance estimation with data independent and data dependent missing observations. We obtain operator norm estimation error bounds that are new or more general than previous studies. Motivated by high dimensional problems, where data acquisition and communication is expensive, we extend our analysis for missing observations to scenarios where one wishes to estimate the covariance matrix from a subset of variables. We apply this theory to design data adaptive and active estimators, suitable for problems where computational resources are limited. viii Chapter 1 Introduction Research in recent years has focused in bigger and more complex types of data. Traditional signals such as images and videos have increased in dimensionality and quantity. Non traditional signals have also gained prominence due to new sensing modalities and technologies, for example sig- nals produced within sensor, social, communication, transportation, biological networks and the internet, immersive media arising in virtual/augmented/mixed reality applications such as holo- grams, point clouds, and light elds, and high dimensional unstructured data, such as irregularly sampled spatio-temporal remote sensing data. The diversity and irregularity of these data sets forbids the use of standard tools designed for traditional regular spatio-temporal signals, while the increased data dimesionality and size calls for algorithms that use computational resources eciently. This dissertation studies the problem of learning signal models while addressing some of these challenges. In chapters 2, 3 and 4 we concentrate on graph learning from smooth signals. This problem arises in machine learning and in graph signal processing, where data samples are unorganized, but expected to have a geometric structure. Graphs provide a powerful mathematical abstraction that can represent these relations, thus leading to a graph learning/estimation from data. Most of our attention is devoted to learning graphs with certain types of structures, known to enable the use of powerful graph algorithms. Some of these structures include trees, bipartite graphs and sparse graphs. In chapters 5 and 6 we study covariance matrix estimation from partially observed data. Covariance matrices arise in a variety of areas such as machine learning, biology, medical imaging, nance, and signal processing. Given the diverse contexts where covariance matrices are needed, understanding the performance of covariance matrix estimators under dierent conditions is of great importance. We start by studying covariance estimation with data independent and data dependent missing data. Motivated by high dimensional problems, where data acquisition and communication is expensive, we extend this analysis to scenarios where one wishes to estimate the covariance matrix from a subset of variables. We apply this theory to design data adaptive and active estimators, suitable for problems where computational resources are limited. A more detailed description of these problems is presented in the following sections. 1.1 Graph learning for graph signal processing Graphs are mathematical structures consisting of nodes (vertices) and links (edges), used in various elds to represent and analyze signals. Particularly, in signal processing and machine learning, graphs have been extensively used for modeling high dimensional datasets, where graphs' nodes represent objects of interest and the edges encode pairwise relations between them. Applications of such graph-based models include image processing [121, 107], supervised and unsupervised learning [55, 112, 11], dynamical systems [143], and other network oriented problems [18, 15]. 1 Graph signal processing (GSP) [146, 138, 116] has emerged as a promising toolbox to study modern datasets. The goal of GSP is developing tools that exploit graph structure for studying functions dened on the nodes of a graph, i.e., graph signals. Many ideas from digital signal processing have been translated into this framework, including sampling [125, 5, 27, 100], ltering [60, 51, 74, 142], interpolation [111, 26] and wavelet transforms [110, 68, 34]. See [116] for a recent review. In practice, graphs are represented by graph matrices, whose non zero entries determine the graph structure and weights. The graph theory literature studies in detail the Laplacian and adjacency matrices [33, 14], while other more data driven disciplines may prefer covariance, cor- relation and precision matrices [88, 134]. While in some applications graphs are given or can be easily constructed, it is common for datasets to consist of an unstructured and unordered list of samples, for which the underlying graph information representing the structural relations between samples is unknown. A fundamental problem is to characterize the unknown structure (i.e., set of edges) and anity relations (weights) between the entries of a dataset. Eective graph-based modeling depends on the quality and type of graph. Thus, it is crucial to develop graph learning methods to optimize the graph topology and the link weights. These types of problems are often posed as an optimization problem, in which the optimization variable is a graph matrix such as the Laplacian. We will consider two instances of this problem described in the following subsections. 1.1.1 Graph learning from smooth signals Depending on the application and discipline, several approaches have been taken to learn graphs from data. Graph signal smoothness is a popular criteria for GSP applications. A graph signal is said to be smooth if for all nodes connected by large weights the signal values at those nodes have similar values. Another graph learning strategy uses statistical modeling, in particular graphical models, where graph structure encodes conditional statistical dependencies among signal coordinates. Given a collection of signals with coordinates assigned to nodes of an unknown graph, we seek a graph learning framework in which 1. graph signals are smooth, 2. there is an associated probabilistic model, 3. there exist ecient estimation algorithms. 1.1.1.1 Related work Several papers have approached the graph learning problem from the smoothness perspective. In [36] the authors propose an algorithm that minimizes the variation of the squared graph Laplacian. Later work [72, 46, 12] considered the Laplacian variation form with ` 2 regularization on the Laplacian entries to control edge sparsity. More recently a regularization on the graph degree was proposed in [77, 78], which can be used in combination with ` 2 regularization in order better control the graph sparsity. Inspired by Gaussian graphical models [88, 134], and algorithms such as the graphical lasso [59], others have proposed estimating graph Laplacians that are inverse covariance matrices. The rst approach that followed this strategy was [86], where a Laplacian with equal self loops was estimated to study brain connectivity. An approach that constraints the o diagonal of the inverse covariance matrix to be non positive was introduced in [147]. This method was shown to be consistent for this sub-class of matrices. A common element to methods based on inverse covariance estimation is that the log-determinant of the precision matrix is part of the objective function, which forces the solutions to be positive denite matrices. To estimate combinatorial 2 graph Laplacians, which are singular, several papers have instead used the pseudo-determinant [52, 69, 53]. Graph Laplacians constructed using these algorithms are especially appealing for compression applications. The eigenvectors of these optimized graph Laplacians have been used as alternatives to traditional transforms such as the discrete cosine transform (DCT). Applications include image [121, 58] and video [50, 124] coding, and compression of new types of signals such as light elds [23, 132, 164] and 3D sensing data [89]. 1.1.1.2 Contributions We propose using GGL matrices for graph signal processing. We recognize that inverse covariance estimation with GGL constraints can be viewed as a smoothness based graph learning method. Our main contribution is using this formulation to propose a new graph learning algorithm that only operates with sparse matrices. 1.1.2 Graph learning with topology properties Many graph based algorithms have exploited graphs with special properties to developed theories of multi-resolution analyses with bipartite graphs [110, 155], trees [144, 61], block cyclic graphs [156, 157], and graphs with multiple connected components [158]. Sparse graphs are necessary for ecient implementations of lters and transforms [145, 68], while graph connectedness is linked to convergence of algorithms [115]. Note that all the aforementioned properties can be described in terms of the connections (or lack thereof) between nodes of the graph. Motivated by these type of requirement, we study the problem of graph learning with topology properties, which is in general non-convex. In addition to requirements 1)-3) listed above, we seek algorithms that: 4 are generic, in the sense that they can be applied to various types of graph properties, and 5 have theoretical performance guarantees. 1.1.2.1 Related work Theoretical results relating to graph topology properties are available for the graphical lasso [59] only for acyclic graphs and graphs with multiple connected components. For the former, equiv- alence conditions between graphical lasso and thresholding were derived for acyclic graphs and sparse graphs in [149]. In [56], a sequel to [148, 149], it is shown that under certain conditions, the graphical lasso solution for acyclic graphs has a closed form expression. The Chow-Liu algorithm [32] was proposed for approximating discrete probability distributions using tree structured ran- dom variables. The work of [154] studied the Chow-Liu algorithm in Gaussian graphical models. Algorithms for estimation of combinatorial Laplacians of tree structured graphs were derived in [97]. The connected components of the graphical lasso solution were characterized in [102, 167]. These properties were exploited by [70] to develop a divide and conquer algorithm that can reduce computational complexity of the graphical lasso implementation. Estimation of generalized graph Laplacians with topology properties was rst considered in our original work [120] which is contained in Chapter 4. 1.1.2.2 Contributions The graph learning problem with topology properties is shown to be non convex. Our main contribution is a two step algorithm to nd an approximate solution. The rst step is graph 3 topology inference (GTI), where we choose a graph topology with the desired property using any method (a generic non-specic routine). The second step is called graph weight estimation (GWE), and involves solving a generalized graph Laplacian estimation problem with a constrained topology based on the GTI step. Our main contribution is an error bound that relates the solution of this algorithm with any global optimum of the non convex problem. This bound is used to derive GTI algorithms that depend only on the covariance matrix of the data. We illustrate our method on trees, bipartite and sparse graphs. A secondary contribution is a set of results that characterize the non-zero pattern of the solution of graphical lasso with GGL constraints. In particular, we show that by looking a the sample covariance matrix and the regularization parameters we can infer some absent edges from the optimal graph, and thus predict some topology properties of the solution. These results are critical in the derivation and analysis of the proposed algorithm for graph learning with topology properties. Our algorithm is inspired by [70], but in addition to allowing decomposition into multiple connected components, we can handle what we call admissible graph properties, which include acyclic, bipartite, sparse, connected and other graph properties detailed in Chapter 4. For tree structured graphs, we show that the GTI step is equivalent to the Chow-Liu algorithm, and the GWE step has an analytic solution. 1.2 Covariance estimation with partial observations Covariance estimation involves the design and analysis of statistical procedures for recovering the covariance matrix from data samples. The covariance matrix is an essential component of many algorithms in machine learning [172], genomics [140], fMRI [18], nance [10], and signal processing [151, 35, 129], among others. Given the diverse contexts where covariance matrices are needed, studying covariance matrix estimators under dierent conditions such as noise, high dimensionality, and partial observations, is of great importance. We consider two scenarios where partial observations occur. We rst address missing data problems, where some observations are lost due to external uncontrollable factors, and the goal is designing statistically ecient estimators. We then move on to covariance estimation from dimensionality reduced data, where the focus is on algorithms and estimators that use computational resources eciently. We study several covariance matrix estimators and present an unied view that encompasses both scenarios. Some of the problems/applications that motivate this work are listed next. \Plug-in" estimators. A wide array of algorithms and estimators use as input the sample covariance matrix. This \plug-in" principle can be used to extend these techniques to the missing data case by using estimators from this work that account for missing data. Examples that have followed this strategy include sparse sub-space clustering [168], multi-task learning [73], sub- space learning [65], classication [31], image compression [23], regression [93], principal component analysis [76, 95], and inverse covariance estimation [150, 80]. Applications with data acquisition costs. Obtaining complete observations might not be possible for some applications due to resource constraints. For example, in clinical studies, mea- suring certain features (e.g., acquiring samples or running tests) has monetary costs [161, 90], while in sensor networks obtaining measurements incurs in energy consumption [7, 44, 62]. For this reason, data may be acquired based on the cost of acquiring each sample, as well as infor- mation from previous observations. In these examples a covariance matrix may be required for statistical analysis, but standard covariance estimators are not well suited since observations are incomplete and possibly non i.i.d.. 4 Statistical estimation and inference with partial observations. A very high ambient dimension can be problematic for data storage, transmission, or processing. But, since most natural signals are very structured, a few partial observations are often sucient for performing simple statistical tasks like estimation and detection [40]. Covariance estimation from compressed measurements has been studied for structured [37, 29], and unstructured covariance matrices [28, 8, 128, 127, 3]. Additionally, active learning approaches, have been shown to benet sequential hypothesis testing [109], Gaussian graphical model selection [38, 139], and covariance estimation [123]. 1.2.1 Missing data Even though missing data is pervasive, and covariance matrices are utilized by a plethora of algorithms, studies on covariance estimation with missing data are limited in quantity and depth. The works of [96, 20] stand out. For the complete observations case, the best bounds in operator norm for the estimation error of the sample covariance matrix were obtained for sub-Gaussian [96, 19] and Gaussian [81] distributions. All these bounds are characterized by the eective rank, a parameter that is upper bounded by the matrix rank, but that can be much smaller. In [96] the author considers an unbiased covariance estimator that can handle missing observations that follow an uniform Bernoulli distribution, and characterizes its performance under very general distribution assumptions. These results are comparable to state of the art bounds with complete observations through the eective rank, however the missing data mechanism is quite elementary and the theoretical analysis is hard to extend to more realistic missing data mechanisms (e.g. non uniform Bernoulli). A second interesting study [20] considers a more general missing data mechanism, although the population covariance matrix is requried to be structured (sparse or bandable). Other papers have studied the eects of missing data in close related problems like regression [93], inverse covariance estimation [80], and principal component analysis [95]. For various missing data mechanisms, we propose estimators and derive error bounds in op- erator norm. Our principal contributions are outlined below. For dierent types of missing data mechanisms (data dependent and independent) we derive error bounds under sub-Gaussian assumptions that are new or more general than previous missing data studies. We show all the error bounds are compatible with each other. In addition, our bounds are comparable (using the eective rank) with state of the art bounds for covariance estimation with complete and incomplete data. 1.2.2 Dimensionality reduced data In applications such as clinical studies [161, 90], and sensor networks [7, 44, 62], there are data acquisition costs. We consider a general framework to model such problems, where computational resources are limited due to acquisition, communication or storage constraints. A common strategy to handle such limitations is to sacrice accuracy by performing statistical inference and analysis on dimensionality reduced data. Several recent papers [8, 3, 128, 127, 28] have proposed dimensionality reduction schemes and estimators that can leverage computational resources (computational complexity, memory, storage or other costs) against statistical perfor- mance. We study several covariance estimators that can handle dierent computational resource constraints. These estimators are based on variable sub-sampling (as in the missing data case) and can be non-adaptive, adaptive or active. For this setting, we show that the performance degrada- tion laws are compatible to those obtained in missing data settings, even though the theoretical assumptions under which both are obtained are signicantly dierent. 5 Various estimators have been proposed to estimate the covariance matrix from low dimensional observations obtained from sequences of random independent matrix projections [3, 127, 8, 128, 28]. The disadvantage of these and other compressed sensing approaches is the high computational complexity of generating, storing and multiplying dense random matrices. Al tough sparse random matrices are used in [127, 3] and matrix multiplication has reduced complexity, the corresponding estimator has poor statistical performance and lacks theoretical guarantees. The best approach in terms of compression and statistical performance is the algorithm from [28], which uses data adaptive sampling with replacement. This estimator can be inecient since some observations may be repeated. Additionally, this algorithm needs to observe the complete data before computing the sampling distribution, thus it is not suitable for streaming/online scenarios. Active strategies for variable observations and covariance estimation were proposed in our previous work [123], however no error bounds were provided. Our contributions are covariance estimators and estimation error bounds in operator norm for the following settings. We are the rst to use Bernoulli sampling for covariance estimation with dimensionality re- duced data. We obtain bounds that do not impose any assumptions on the data distribution. Our theory is applied to covariance estimation with data compression constraints. We are the rst to consider an active data acquisition scenario suitable for streaming/online applications with limited memory. We propose an estimator and provide error bounds and show that its computational performance (computational and statistical) is comparable or better than that of a non adaptive streaming estimator. We show all the error bounds are compatible with each other. In addition, our bounds are comparable (using the eective rank) with state of the art bounds for covariance estimation with complete and incomplete data. 1.3 Outline This thesis is organized as follows. In Chapter 2 we introduce some graph signal processing concepts and review the literature on graph estimation algorithms. We present the proposed approach for graph learning in Chapter 3. In Chapter 4 we study the problem of graph learning with topology properties. Chapters 5 and 6 contain our results on covariance estimation with missing and dimensinality reduced observations respectively. We summarize our contributions and discuss future research directions in Chapter 7. 6 Chapter 2 Graph theoretic background In this section we review some concepts from graph theory and statistics that will be used through- out this document. We rst introduce graphs and their corresponding Laplacian matrices. Later we review the literature of graph learning. 2.1 Notation We denote matrices and vectors by bold letters. For a matrix X = (x ij ) its entries are denoted by x ij or X ij and for a vector x its i-th entry is denoted by x i or x[i]. Positive denite (PD) and positive semi-denite (PSD) matrices are denoted by X 0 and X 0, while the inequality notation X Y is entry-wise scalar inequality. 2.2 Graphs and Laplacians Formally, a graph is denoted byG = (V;E) whereV is the set of vertices (or nodes), andEVV is the set of edges (or links). An edge going from node i to node j is denoted by (i;j), we will only consider undirected graphs, where (i;j)2E if and only if (j;i)2E. Denition 1. Graph elements 1. The 1 hop neighborhood of node i is N i =fj : (i;j)2Eg: 2. A path from node i to node j is any set of edges satisfying P ij =f(i;i 1 ); (i 1 ;i 2 ); ; (i k ;j)gE: 3. A cycle is a path from a node to itself. 4. A sub-graph ofG, is a graph G = ( V; E), where VV, and EE\ V V Graphs can be classied according to some properties, some of them include the following. Denition 2. Graph properties A graph is connected if for all (i;j)2VV, there exists a path going from i to j. A graph is called acyclic if it does not have cycles. Connected acyclic graphs are called trees. 7 A graph has M connected components if there are M connected sub-graphs that form a partition ofG, i.e., the sub-graphs are disjoint (no common vertices), and the union of the sub-graph vertices and edges return the original graph. Graphs nodes and links can have weights assigned to them. Throughout this dissertation we will represent graphs with generalized graph Laplacian (GGL) matrices, formally dened as follows. Denition 3. [14] LetG = (V;E; V; W) denote an undirected weighted graph, with edge weight matrix W = (w ij ) and edge weight matrix V. The edge weight matrix (often called adjacency matrix) W is symmetric, non negative, and w ij 6= 0 i (i;j)2E. The vertex weight matrix, V = diag(v) = diag(v 1 ; ;v n ), is diagonal and can have arbi- trary values. The degree matrix is a diagonal matrix D = diag(d), where d i = P n j=1 w ij . The generalized graph Laplacian (GGL) matrix is dened as L = V + D W; we will only consider positive semi-denite L. GGLs include most common types of Laplacian matrices: If v = 0 then L is a combinatorial Laplacian. If diag(L) = 1, and LD 1=2 1 = 0, then L is a normalized Laplacian. 2.3 Graph signal processing A fundamental notion in signal processing is that of frequency as captured by the Fourier trans- form. In the graph setting, the graph Fourier transform (GFT) has been generally dened based on a graph operator (e.g., Laplacian, Adjacency), where the Fourier modes correspond to its eigenvectors, an the discrete frequencies correspond to its eigenvalues. The Rayleigh quotient is a quantity that can be used for such purpose, and it is given by R(L; x) = x > Lx x > x ; (2.1) for any signal x. The GFT of a GGL matrix L are its eigenvectors. The GFT has a variational interpretation as the solution of the following sequence of optimization problems u 1 = arg min u > Lu kuk 2 u 2 = arg min u?u1 u > Lu kuk 2 . . . u k+1 = arg min u?[u1;;u k ] u > Lu kuk 2 : 8 Name Graph Directed has self loop weighted Matrix DFT circulant yes no yes Adjacency DCT-II line no no no Adjacency, Laplacian DST line no yes no Adjacency, Laplacian Table 2.1: Classic transforms viewed as GFT. DCT and DST stand for discrete cosine transform and discrete sine transform respectively [152]. Last column indicates which graph representation matrix may be used to dene the GFT. The Laplacian eigenvalues can be computed using Rayleigh's quotient ! i = R(L; u i ). The com- binatorial Laplacian is easier to interpret since the quadratic form may be written as x > Lx = X i<j w ij (x i x j ) 2 ; which can be viewed as a a measure of variation across edges. The rst eigenvector is u 1 = 1= p n and has minimal variation, that is u > 1 Lu 1 = 0. The variation (i.e., frequency) measured using the Rayleight quotient increases with the eigenvector index. This can be measured more precisely by the nodal domain theorems for GGL matrices [14]. The GFT can be dened using other graph matrices and variation operators [26, 5, 63]. In Table 2.1 we list some classical signal processing transforms that can be regarded as GFT with appropriate choices of graph and graph matrix. 2.4 Estimation of graph Laplacian matrices Graph construction is often guided by the application, but more generic methods are also quite popular. We can classify them into three broad categories: probabilistic, smoothness-based and spectral. In the next section we review probabilistic and smoothness based method in detail, with an emphasis on the material required for Chapters 3 and 4. Dierent perspectives on some of these methods can be found in the overview papers [47, 101]. 2.4.1 Estimation of Gaussian Markov random elds The probabilistic case assumes there is a random vector x = [x 1 ; ;x n ] > 2R n , whose prob- ability distribution is characterized by a graph. In this model, the graph encodes conditional dependencies among variables, where an edge exists in the graph if x i and x j are conditionally independent given the rest of the variables [88, 134]. This is denoted by (i;j) = 2E ,E[x i x j jx k 8k6=i;j] = 0 , x i ;x j ?x k ; 8k6=i;j (2.2) Graphs dened this way are undirected. Graphs can also be dened based on causality relations, which leads to directed graphs [66]. However, in this thesis we only cover undirected graphs. We are particularly interested in the family of Gaussian Markov random elds (GMRF) or Gaussian graphical models, which have densities of the form f(x) = exp tr(xx > =2)() : (2.3) This type of density belongs to the class of exponential families [166]. is called Log-partition function, and it ensures that f(x) is properly normalized. is symmetric and is usually called precision, concentration or inverse covariance matrix. When is non-singular and is the 9 log-determinant, then (2.3) corresponds to a multivariate normal distribution with zero mean and covariance matrix 1 . If is singular, (2.3) does not lead to a proper Gaussian density. Nevertheless, the distribution is still called a Gaussian Markov random eld (GMRF) (see [134, 88]). In GMRFs, edges in the conditional independence graph dened in (2.2) correspond exactly to the non-zero o-diagonal entries of the precision matrix , thus the edge weights are the entries of the inverse covariance (precision) matrix. Learning a graph under this model amounts to estimating a possibly sparse precision matrix. Given x 1 ; x N , i.i.d. copies of x, where x is zero mean, the goal is to nd an estimator b for the precision matrix . Next we present some types of inverse covariance estimation problems. Covariance selection. The maximum likelihood estimator of the precision matrix is the in- verse of the sample covariance matrix, however this estimator does not exist when N < n [88], since the sample covariance matrix is singular. Dempster [43] was the rst to suggest solving maximum likelihood estimation of the inverse covariance matrix with conditional independence constraints, that is, if the conditional expectation graph of the true precision matrix is known, we can set some entries of the precision matrix estimator to zero. This estimator may exist for some graph topologies, even if the sample covariance matrix is singular. Examples include chordal and bipartite graphs [160], but existence conditions are hard to check other graphs. Regularized inverse covariance estimation Other methods have been proposed to address the non existence of the maximun likelihood estimator. The main tool is sparsity promoting regularization [104, 171, 133, 39, 59, 21, 170]. The ` 1 regularized maximum likelihood estimator, also known as graphical lasso [59] is one of the more popular approaches, and corresponds to b = arg min Q0 log det(Q) + tr(QS) | {z } log(`(x1;;x N ;Q)) +kQk ; (2.4) where a weighted` 1 norm is usedkAk = P i;j ij ja ij j and the sample covariance matrix is given by S = 1 N P N i=1 x i x > i . The graphical lasso solution has been shown to be statistically consistent [131] and several ecient algorithms have been proposed [71, 59]. Generalized graph Laplacian (GGL) estimation. Attractive GMRFs are a subclass of GMRFs for which the partial correlations are only allowed to be non-negative, so that the cor- responding precision matrix is a GGL matrix. An earlier approach [86] proposed an inverse covariance estimation problem with a particular type of GGL constraint with all equal and con- stant self-loop weights, but no ecient algorithm was oered. Adding GGL constraints to the graphical lasso produces the estimator b L = arg min L0 log det(L) + tr(LS) +kLk ; (2.5) s.t. L ij 0; 8i6=j: The case without regularization ( ij = 0) was rst proposed in [147], and a solution to (2.5) was shown to exist unless there is a pair of variables such that s ij = p s ii s jj . This condition is very mild, and holds even for N < n, unlike covariance selection and inverse covariance estimation, which require stronger assumptions. This estimator was further extended to allow for arbitrary regularization (as in (2.5)) and additional constraints on the non-zero pattern [122, 52]. We study this optimization problem in Chapter 3. 10 Combinatorial graph Laplacian (CGL) estimation. If a combinatorial Laplacian is sought, one can solve b L = arg min L logjLj + + tr(LS) +kLk ; (2.6) s.t. L ij 0; 8i6=j: L1 = 0: Since L is by denition a singular matrix, the pseudo determinantjj + , product of the non-zero eigenvalues, is used instead of the determinant. This particular optimization problem was rst considered in [52, 69] for applications in graph signal processing and consensus. Both papers derive ecient optimization algorithms to solve (2.6), with possible additional sparsity constraints. Although the GGL and CGL estimation problems look similar, they have very dierent solu- tions. This is evident for the case of tree structured graphs, for which both problems have closed form solutions. For the CGL case, the weights are functions of the distance between variables, while in the GGL case, weights are functions of the correlation coecient. More details can be found in [97]. This work only considers the GGL case. 2.4.2 Smoothness based methods Many graph algorithms assume data is smooth in the graph. The corresponding graph learning algorithm, nds a graph for which the data is smoothest. These problems have the form min L2C f(L) +g(L): The setC is a (possibly non-convex) subset of the set of combinatorial Laplacian matrices, and it might include several types of topology constraints. The function f is a smoothness promoting term based on a variation operator, which ensures the data is smooth in the graph. Popular choices are the p-Laplacian variation V 1 (x) = X i;j w ij jx i x j j p ; (2.7) and the q-th power of the Laplacian V 2 (x) = x > L q x: (2.8) These lead to the average variation f 1 (L) = 1 N N X k=1 V 1 (x (k) ) = tr(WE); (2.9) f 2 (L) = 1 N N X k=1 x (k)> L q x (k) = tr(L q S) =kL q=2 Xk 2 F ; (2.10) where (E) ij = 1 N P N k=1 jx (k) i x (k) j j p . f 1 and f 2 are equal, and coincide with the probabilistic formulation from (2.6) when p = 2 and q = 1 respectively. The regularization functiong is chosen to control sparsity and avoid trivial solutions. In Table 2.2 we show a comparison of dierent approaches that learn graphs by minimizing a variation function but using dierent types of regularization and constraints. 11 Reference Variation f regularization g constraintC Hu et.al. [72] f 2 , q = 1; 2 kLk 2 F tr(L) =n Daitch et.al.[36] f 2 , q = 2 0 P i max(0; 1d i )n, or d i 1; 8i Dong et.al.[46] f 2 , q = 1 kLk 2 F + de-noiser tr(L) =n Kalofolias, Kalofolias & Perraudin [77, 78] f 1 , p = 2 P i log(d i ) +kWk 2 F topology Berger et.al.[12] f 1 , p = 1 kWk 2 F ad i b, P i d i = 2n Lake & Tenenbaum [86] f 2 , q = 1 log det(L +I) +kWk 1 none Hassan- Moghaddam & Jovanovic [69] Egilmez et.al.[52] f 1 , q = 2 log det(L + 1 n 11 > ) +kWk 1 topology Table 2.2: Combinatorial Laplacian estimation algorithms that minimize average variation. 12 Other graph signal processing works include [130]. Chepuri et.al. [30] learns an unweighted graph by minimizing f 1 with p = 2. Peretic et.al. [98] considers joint graph learning of the normalized Laplacian and signal de-noising under a sparse signal representation over a graph dictionary. 2.4.3 Spectral methods The spectral framework assumes that the covariance matrix of the data and the graph matrix (Laplacian, adjacency, shift) have the same eigenvectors [141, 118, 41]. This has two origins: stationarity of graph signals, and graph diusion. Several denitions of graph stationarity have been introduced, in many of them, a necessary condition is that the covariance matrix and the graph matrix are simultaneously diagonalizable. In addition, applying some popular choices of graph diusion operators on white Gaussian signals produce stationary graph process (see [53] for a list of several diusion operators). 2.5 Applications of Laplacian matrices To motivate the variation based graph learning methods, we describe a few well known applications that use graphs and involve some type of graph variation minimization. Graph based semi-supervised Learning. Semi-supervised learning are a family of classi- cation algorithms designed for problems where class labels are scarce, but there is a large amount of unlabeled data available. Denote by z 1 ; ; z n1 a set of labeled feature vectors with corre- sponding labelsy 1 ; ;y n1 , and by z n1+1 ; ; z n2 a set of unlabeled feature vectors. A graph for the labeled and unlabeled samples is usually constructed using w ij = exp kz i z j k 2 2 ; 8i;j = 1; ;n 2 or other similarity function. Then the unknown labels can be estimated by solving optimization problems of the form ^ y = arg min x V (x; L) + n1 X i=1 (y i x i ) 2 ; (2.11) whereV is a graph variation function e.g., (2.7), or (2.8). Variations of (2.11) with the quadratic Laplacian form as variation operator are analyzed in [24, Ch.11]. Image de-noising. Consider a vectorized image z with n pixels, contaminated by noise e y = z + e: Milanfar [107] observes that many image denoising algorithms output the ltered image ^ z = D 1 Wy; (2.12) with D and W a degree and adjacency matrix, respectively. The dierence between the methods described in [107] lies in how the edge weights are constructed. Nonetheless, all of them correspond to similarities between noisy pixels of the form w ij =W (y i ;y j ;p i ;p j ); 13 withp i being the cartesian coordinates of the pixeli. More interestingly, if we construct a bipartite graph on 2n vertices, with combinatorial Laplacian matrix ~ L = D W W D ; the solution of (2.12) can be obtained as the minimizer of min x x > ~ Lx s.t. x i =y i ; 8i = 1; ;n: (2.13) Solving (2.13) produces a ltered image whose pixels are a convex combination of similar noisy pixels. This method is predictive by denition. Another popular approach is the one based on graph regularization [92, 117] which solves min x kx yk 2 + x > Lx: Optimization in networks There has been growing interest in solving large scale optimization problems for network/graph structured data [67, 159]. Recently, several algorithms have been proposed to solve, possibly in a distributed fashion, large scale problems of the form min x1;;xn X i f i (x i ) + X i;j w ij kx i x j k p : This problem has been called network lasso in [67], when applied to regression problems on graphs. 14 Chapter 3 Generalized graph Laplacian estimation 3.1 Introduction Graphs can be represented by dierent types of matrices [33, 137] (such as adjacency and graph Laplacian matrices), whose non-zero entries correspond to edges in the graph. The choice of a matrix type may depend on modeling assumptions, properties of the desired graph, applications and theoretical requirements. In this work, we will consider generalized graph Laplacian matri- ces (GGL) [14], which are positive denite matrices with nonpositive o-diagonal entries used to represent undirected graphs with non-negative weights and self-loops. GGL matrices arise in statistics as precision matrices of attractive Gaussian Markov random elds [147]. They have also been applied to image and video coding [121, 50, 23], analyzing electrical circuits [48] and machine learning [83, 84]. Given the presence of GGL matrices as precision matrices of attractive Gaussian Markov random elds [147, 87], they provide a probabilistic model for graph signals. Additionally, the concept of graph signal smoothness, introduced in Chapter 2, appears naturally during maximum likelihood estimation of attractive GMRFs. The combination of graph theoret- ical and statistical properties makes GGL matrices specially suitable for graph signal processing applications. This chapter tackles the graph learning problem of GGL matrices from data by solving a convex optimization problem. The proposed graph learning takes as input a symmetric similarity matrix K, for example a covariance matrix, and solves min L0;lij0;i6=j log det(L) + tr(LK): (3.1) We also provide a block coordinate minimization algorithm to solve (3.1). This algorithm is dual to the one proposed in [147], where the problem is considered from a statistical estimation point of view and no connection is made with GGL matrices [14] or with GSP [146, 116] applications. This chapter is organized as follows. In Section 3.2 we introduce notation on graphs, GSP and GGL matrices. In Section 3.3 we analyze the solution of (3.1) from a GSP point of view and derive our algorithm. In Section 3.4 we show experimental results with texture images. We end with conclusions and future work in Section 3.5. 3.2 Properties of generalized graph Laplacians Following the notation from Chapter 2, we denote an undirected graph byG = (V;E; L), with node setV =f1; 2;:::;ng, edge setE and a GGL matrix L. Recall that the edge set and the non zero pattern of the GGL matrix and the edge set obey,8 i6=j the pair (i;j) = 2E if and only if l ij = 0. 15 In this section we will review properties of GGL matrices. We start with spectral properties, which are central in the denition of graph Fourier transform, graph frequencies and notions of variation of graph signals. Later we review the connection of Laplacian matrices with Gaussian Markov random elds, and motivate this model for GSP applications. 3.2.1 Spectral properties The therm generalized graph Laplacian was adopted from the graph theory book [14]. These matrices are also known as M-matrices. The following property is of great importance. Proposition 1. [57]. Consider a non singular GGL matrix L with orthonormal eigendecompo- sition L = U U > and = diag(! 1 ;! 2 ; ;! n ) where the ! i are sorted in increasing order of magnitude. The following properties hold. 1. L 1 0. 2. The rst eigenvector satises u 1 > 0. The rst property indicates that spectral properties of GGL matrices are closely related to those of non negative matrices [106]. The second property is related to the variation of the rst eigenvector of L. We can dene the smoothness of a graph signal following what is typically done with the combinatorial Laplacian and say that x is smoother than y i R(L; x)<R(L; y) Recall that the Rayleigh quotient, dened previously in (2.1), is minimized by u 1 , and for! i <! j u i will be smoother (have smaller frequency) than u j . Then the eigenvectors of a GGL can be used to dene the GFT of signal x as x = U > x. Another approach to get some insight into the properties of the GFT is based on the discrete nodal domain theorems for GGL matrices: Denition 4. [14] A strong (weak) nodal domain of an eigenvector u is a connected sub-graph with vertex set ~ V so that u i u j > 0 (u i u j 0) for all i;j2 ~ V: Theorem 1. [14] Let L be the GGL of a connected graph. The eigenvector u k with eigenvalue ! k and multiplicity r has at most k weak nodal domains and k +r 1 strong nodal domains. The rst eigenvector of a connected graph has one nodal domain since it has constant sign (Proposition 1). Nodal domains can be interpreted as regions or clusters of nodes where eigenvec- tors are smooth. For connected graphs, the k-th eigenvector has at most k weak nodal domains, thus as k increases, eigenvectors have more sign changes along paths in the graph. 3.2.2 Gaussian Markov random elds Gaussian Markov Random Fields (GMRF) or Gaussian graphical models are multivariate Gaus- sian distributions parameterized in terms of the inverse covariance matrix (also known as precision matrix). Assume that Q 0 is the precision matrix of a mean zero GMRF, the Gaussian as- sumption implies the partial correlations are given by [134, 88] Corr [x[i];x[j]jx[k] :k6=i;j] = q ij p q ii q jj : (3.2) 16 Hence, the non zero pattern of Q encodes the conditional independence relations within the graph signal,i.e., x[i] and x[j] are independent conditioned in all other variables i q ij = 0. A second important property is the regression formula E [x[i]jx[j] :j6=i] = 1 q ii X j6=i q ij x[j]: When the precision matrix is a GGL (Q = L), we have an attractive GMRF. This comes from the fact that all conditional correlations are positive, therefore all variables can be represented, in expectation as a non-negative linear combination of their neighbors in the conditional dependence graph. 3.2.3 Interpretation The information provided in this section implies that the optimization problem in (3.1) can be looked at from two perspectives. Smoothness perspective. Assuming the data vectors x 1 ; ; x N are graph signals with sam- ple covariance matrix: K = 1 N N X i=1 x i x i > ; we can write the second term of eq. (3.1) as tr (KL) = 1 N N X i=1 x i > Lx i = n X i=1 n X j=1 l ij k ij : (3.3) The rst equality indicates that minimizing tr (KL) promotes average smoothness. The second equality indicates that when k ij is large and positive, the minimization will enforce large o- diagonal negative values onl ij , i.e., large graph weights. On the other hand when k ij is negative, the graph weights will be pushed toward zero therefore disconnecting nodesi andj. The log det function acts as a barrier on the minimum eigenvalue of L, thus enforcing the positive denite constraint. Probabilistic perspective. Let x be an n dimensional attractive GMRF with zero mean, and the data vectors x 1 ; ; x N are independent identically distributed (i.i.d.) realizations. The solution of (3.1) is a maximum likelihood estimator of the precision matrix of an attractive GMRF. 3.3 Estimation of GGL matrices In this section we analyze the solution of (3.1). The sign constraints on each of the entries can be handled by introducing a corresponding matrix of Lagrange multipliers, . The Lagrangian is log det(L) + tr(KL) + tr(L), with = > = ( ij ) 0 for i6=j and zero diagonal elements. Strong duality is easy to verify for thi sproblem, and since it is also convex, the solution is unique 17 [17]. Furthermore, the Karush Kuhn Tucker (KKT) conditions are necessary and sucient for optimality and correspond to L 1 + K + = 0 (3.4) 0 (3.5) l ij 0; 8i6=j (3.6) ij l ij = 0; 8i6=j: (3.7) 3.3.1 A dual block coordinate descent algorithm In order to derive a block coordinate descent algorithm for solving (3.1), we analyze the KKT conditions. Consider a partition of K; and L 1 as shown below for L L = L 11 l 12 l > 12 l 22 ; (3.8) where L 11 is a (n 1) (n 1) sub-matrix, l 12 is a column vector of sizen 1, andl 22 is a scalar. The inverse L 1 has a closed form expression in terms of its block components L 1 = L 11 (1=l 22 )l 12 l > 21 1 (1=c)L 1 11 l 12 (1=c)l > 12 L 1 11 1=c ; (3.9) withc =l 22 l > 12 L 1 11 l 12 . Suppose L; satisfy the KKT conditions, then using the identity above, the KKT conditions for one of the rows or columns become the simplied system L 1 11 l 12 l 22 l > 12 L 1 11 l 12 + k 12 + 12 = 0; (3.10) l 22 l > 12 L 1 11 l 12 = 1 k 22 : (3.11) By grouping these equations with the inequality and complementary slackness conditions for 12 and l 12 we may write the system L 1 11 l 12 k 22 + k 12 + 12 = 0 (3.12) 12 0 (3.13) l 12 0 (3.14) 12 l 12 = 0; (3.15) where denotes the Hadamard product. Suppose L 11 is xed and known, then the set of equations (3.12)-(3.15) correspond to the KKT conditions for the optimization problem, min l120 1 2 l > 12 L 1 11 l 12 + 1 k 22 l > 12 k 12 : (3.16) The block coordinate descent algorithm proposed in [147] iterates over all row/columns of L and solves (3.16). This approach does not require inverting L 11 at each iteration, instead it dene = L 1 and updates both of them at each iteration using Schur complements and the block partitioned matrix inverse formula (3.9). To avoid updating a possibly dense each iteration, 18 we instead solve the dual of (3.16). We can obtain it by nding an equivalent set of equations for (3.12)-(3.15). After multiplying both sides of (3.12) by L 11 we get l 12 + 1 k 22 L 11 (k 12 + 12 ) = 0: (3.17) Considering 12 as the optimization variable, the KKT conditions (3.13)-(3.15) and (3.17) also characterize the solution of min 120 (k 12 + 12 ) > L 11 (k 12 + 12 ): (3.18) Once 12 is found, l 12 can be updated using (3.17). The diagonal element l 22 can be updated combining (3.11) and (3.17) to get l 22 = 1 k 22 (1 l > 12 (k 12 + 12 )) = 1 k 22 (1 l > 12 k 12 ): (3.19) This procedure can be repeated over all rows of L until convergence. Pseudo code is presented in Algorithm 1. Notice that if we start with a sparse L, and at each iteration set to zero the entries of l 12 that have a positive Lagrange multiplier, there will be lower computation and memory use as we are working with sparse matrices and vectors unlike the algorithm from [147] in which the dense is updated at each iteration. The algorithm proposed in [147] has a counterpart in the primal graphical lasso [103], while Algorithm 1 corresponds to the dual primal graphical lasso [103]. Algorithm 1 GGL estimation Require: L (0) = (diag(K)) 1 1: while not converged do 2: Choose i t 2f1; ;ng 3: L it remove i t -th row/column of L (t1) 4: k it i t -th column from K and remove its i t -th entry 5: k it (i t ;i t ) entry of K 6: it arg min 0 (k it +) > L it (k it +) 7: l it 1 ki t L it ( it + k it ) (3.17) 8: l it 1 ki t (1 l > it k it ) 9: L (t) update with L it ; l it and l it 10: t t + 1 3.3.2 Incorporating additional information Some graph structures may be more desirable than others. For example in sensor networks, nodal measurements like temperature form a graph signal. One would expect that nearby sensors should be more strongly connected in the graph. Or in image processing, the distances between pixels should in uence the graph topology. To add such information, we can re-weight the trace term as follows tr(L(K Z)) = n X i=1 n X j=1 l ij k ij z ij : 19 A reasonable choice is the kernel z ij = exp(ky i y j k 2 = 2 ) where y i is the coordinate vector of the i-th sensor or pixel. Algorithm 1 does not need to be modied, since the only thing that changes is its input. This re-weighting method is similar the ones used in adaptive image lters [107, 60]. 3.4 Experiments We consider textures from the USC-SIPI Brodatz 1 dataset. In particular, we use two rotated versions of the wood image at 0 and 60 degree angles. We take 8 8 image blocks, vectorize them and construct a covariance matrix K and apply Algorithm 1 with input K, K W Gaussian , and K W 8conn where W Gaussian is a Gaussian kernel with = 2 that computes distances between pixel coordinates, and W 8conn is a mask matrix with values 1 or 0 and w ij = 1 i the j-th and i-th pixels are vertical or diagonal neighbors and zero otherwise. In Figure 3.1 we show the graphs for each image, which consist of the magnitude of the o-diagonal elements of the estimated precision matrices. In the last columns we show the method from [46]. We manually choose the regularization parameter = 6, which roughly controls sparsity, and since we use noiseless data we set denoising parameter = 0. The GGL graphs allow for any type of connection hence the larger weights are well aligned with the texture orientation. By including the Gaussian and 8-connected mask, the GGL estimation algorithm is told to only look at covariances with pixels in a neighborhood, thus encouraging those solutions to connect pixels to their close neighbors. That eect can be seen clearly for the wood060 image, whose graphs lose directional information when constructed using re-weighted covariances. For the wood000 texture the use of re-weighted covariance matrices makes the resulting graphs more regular. All GGL matrices are sparse and have around 100 edges, while a fully connected graph has 2016 edges. In the last column we show the graphs constructed using the method from [46], which also follow the texture orientation precisely. However, the algorithm of [46] has several parameters that are hard to set, and sparse solutions are hard to obtain, as shown in the implementation. 3.5 Conclusion In this chapter we formulated the graph learning problem as a precision matrix estimation with generalized graph Laplacian constraints. We proposed a novel block coordinate descent algorithm to solve the optimization and apply to precision matrix estimation and texture graph learning. The proposed approach accurately estimates the non zeros pattern and the entries of the GGL matrix. For texture images our algorithm learns a meaningful and sparse graph that follows the texture orientation. The proposed algorithm, along with more recent improvements [52, 53] are rst order methods and do not scale well to more than a few hundred nodes. An interesting direction to pursue are techniques to improve convergence with low complexity, e.g., second order methods and greedy coordinate selection rules. 1 http://sipi.usc.edu/database/ 20 (a) wood000 2.7539e−05 1 (b) GLP, jEj = 130 0.10447 1 (c) GLP + kernel, jEj = 112 2.7914e−05 1 (d) GLP + 8- con.,jEj = 113 0.0052266 1 (e) LsigRep, jEj = 1998 (f) wood060 0.053489 1 (g) GLP, jEj = 117 0.11318 1 0.11318 1 (h) GLP + kernel, jEj = 161 0.16131 1 (i) GLP + 8-con., jEj = 105 0.0063298 1 (j) LsigRep, jEj = 1998 Figure 3.1: Texture graphs using our GLP estimation and the LsigRep method from [46]. We consider wood textures, both original and with 60 degree rotation. We normalize each graph by its largest entry and show only o-diagonal entries. 21 Chapter 4 Learning graphs with topology properties 4.1 Introduction In the context of graph signal processing [146, 136], algorithms have been proposed that are re- stricted to (or operate more eciently on) graphs with specic topology properties. For example, bipartite graphs are required for perfect reconstruction two channel lter banks [110] and M- block cyclic graphs are required for designing M-channel lter banks [156, 157]. Tree structured graphs have been used to design multiresolution transforms [61, 144] and sampling algorithms [113]. Decomposition of a graph into connected sub-graphs can also help constructing new mul- tiresolution analysis for graph signals [158], analyzing performance of spectral clustering [165] and can be useful to scale inverse covariance estimation algorithms [70]. A bipartite graph with multiple connected components is used in [114] for co-clustering. Sparsity, which is often a desir- able graph property, can be achieved by tuning` 1 -regularization parameters in the graph learning optimization [52]. In contrast, none of the aforementioned topology properties (e.g., bipartition or a certain number of connected components) can be directly achieved by adding penalty functions to an existing graph learning problem. In this chaper, we consider a graph learning problem where the goal is to nd a graph that best ts the data, while having a specic topology property. For instance, if we are interested in the family of tree structured graphs, we will need to nd the best tree topology and weights for the data. The problem of graph learning with topology properties naturally generalizes the work from [52, 122, 147], where the graph structure was either xed or unknown. Specically, given data and a desired topology property our goal is to solve the following problem. Problem 1. LetL + n be the set of nn positive denite GGL matrices, S be a similarity (e.g., covariance or kernel) matrix computed from data, and letF be a family of graphs with a certain property. The goal is to nd a GGL matrix L of a weighted graphG(L), that solves: min L log det(L) + tr(SL); s.t. L2L + n ;G(L)2F: The objective function in Problem 1 is convex [17], and its optimization can be viewed from a probabilistic perspective as inverse covariance estimation [43, 131, 52], as matrix approximation using a log-determinant Bregman divergence [45], or as a graph learning method as in Chapter 3. However, the constraint setfL2L + n :G(L)2Fg is not convex for many graph families (see Proposition 2), making Problem 1 non-convex, and thus intractable. We propose a two step procedure (Algorithm 2) to nd an approximate solution for Problem 1. The rst step is graph topology inference (GTI), where we can use any algorithm that returns a graph with the desired topology property, in particular algorithms that operate directly on the similarity matrix S, i.e., without attempting to nd its inverse. The second step is graph weight 22 estimation (GWE), where we solve the convex problem of estimating a GGL matrix with an edge set contained within the topology found in the rst stage, using for example the algorithm from Chapter 3, or the more recent [52]. Our main contributions are a set of theoretical results that lead to performance bounds for the aforementioned two-step graph learning framework. In Section 4.4 we analyze the following weighted ` 1 -regularized GGL estimation problem. Problem 2. Given an arbitrary non negative regularization matrix = ( ij ), and a similarity matrix S, nd L that solves min L log det(L) + tr(SL) + X i6=j ij jl ij j; s.t. L2L + n : We characterize topology properties of the solution of Problem 2 as a function of and S. This set of results is crucial in the derivation of the proposed algorithm and its performance guarantees. We will show that this problem, for a particular regularization matrix, can be used to approximate Problem 1. Specically, in Section 4.5 we show that the error between the solution of Problem 1 and the output of the GWE step in Algorithm 2 can be bounded by an error function that depends on the similarity matrix and the topology obtained in the GTI step (Theorem 5 and Lemma 1). Finally, since this error bound is agnostic to how the GTI step was solved, we can leverage existing graph approximation methods, e.g., state of the graph partition and tree approximation algorithms, to solve the GTI step. This is because the upper bound from Theorem 5 can be interpreted as a sum of terms, indexed by the edges not allowed in the optimal graph, where each term is the utility of an edge. The minimization of the error bound can be interpreted as the problem of learning a graph, whose edges have the largest utility (or picking the graph with smallest cost). We illustrate how to apply this strategy to trees, bipartite graphs, and k-sparse connected graphs in Section 4.6. We also demonstrate the equivalence between the Chow-Liu algorithm [32] and our proposed method for learning tree structured graphs. Finally, we derive necessary conditions on the graph families for which the theoretical guar- antees of Algorithm 2 hold. One of these conditions is for the graph family of interest to be monotone, i.e., closed under edge deletion operations. This is a mild requirement and families satisfying it include acyclic graphs, k-partite graphs and k-sparse graphs (graphs with at most k edges). Our algorithm can also handle the non-monotone family of graphs with J connected components, and intersections of a monotone family with the family of graphs with J connected components. This intersection includes other types of non-monotone graph families, such as trees (acyclic connected graphs) and k-sparse connected graphs (see Denitions 6 and 7). The rest of this chapter is organized as follows. In Section 4.2, we review the literature of structured and unstructured graph learning methods. In Section 4.3, we introduce the required graph theoretic concepts and present our algorithm. In Section 4.4, we show our theoretical results on sparsity patterns of the solution of Problem 2. In Section 4.5, we establish the relation between Problems 1 and 2 as well as the error bounds for Algorithm 2. In Section 4.6, we derive instances of our methods for specic graph families. Finally, numerical validation and conclusions are provided in Sections 4.7 and 4.8, respectively. 4.2 Related work Solving Problem 1 can be viewed as nding an approximate inverse of S with some type of sparsity pattern. When the similarity matrix is positive denite, Problem 1 is equivalent to minimizing the log-determinant Bregman divergence [45]. Moreover, when the similarity matrix is a sample covariance matrix, Problem 1 can be viewed as maximum likelihood estimator of the inverse 23 covariance matrix of an attractive Gaussian Markov random eld (GMRF)[52]. In this section, we review the literature on inverse covariance estimation. In Section 2.4.1, we discuss the weighted graphical lasso estimator, introduced earlier in chapter 2, and discuss its relation with Problem 2. Next, we discuss some studies about sparsity properties of the graphical lasso and covariance thresholding in Section 4.2.2, which are related to our results of Section 4.4. We nish in Section 4.2.3 with an overview of papers, some of them listed in Table 4.1, that study inverse covariance estimators with topology properties. 4.2.1 Estimation of attractive Gaussian Markov random elds In our recent work [52], we generalized the algorithm from [147] to allow for structural (known topology) constraints and ` 1 -regularization. We considered dierent types of attractive GMRFs obtained by using several types of Laplacian matrices, namely generalized graph Laplacian (GGL), diagonally dominant GGL and combinatorial graph Laplacian matrices. These algorithms are useful when the graph topology is available, or there is a natural choice e.g., regular grids for images or video. However, there are cases where the graph does not have a known topology and instead is required to have a certain property such as being connected or bipartite (refer to the Introduction for examples). In these cases, new algorithms need to be developed to chose a graph structure within that family. The algorithm proposed in this chapter for graph learning with topology properties complements our previous work by providing ecient ways of choosing the graph topology constraints. Moreover, we can use the algorithm from Chapter 3 or others [52, 122, 87, 147] to solve the GWE step of our algorithm. 4.2.2 Characterizations of graphical model structures The results from Sections 4.4 and 4.5 are based on the analysis of the properties of the solution of Problem 2. Similar characterizations are available for the graphical lasso [59, 103] and its weighted variant [70]. For example, it has been shown that the graphical lasso solution and a graph obtained by thresholding the entries of the covariance matrix have the same block diagonal structure [167, 102], i.e., the same connected components. In Theorem 3, we show an equivalent result for Problem 2 with a dierent covariance thresholding graph. In [70] the parameters of the weighted ` 1 regularizer are designed to control the covariance thresholding graph, such that the solution of the weighted graphical lasso has a certain number of connected components. Our work follows a similar strategy to learn graphs with topology properties, but we focus on learning graphs represented by GGL matrices. Our approach not only guarantees a decomposition into connected components, but also allows analyzing other graph properties. For certain types of sparse graphs obtained by solving graphical lasso with high regularization, it was shown in [148] that covariance thresholding and graphical lasso have the same graph structure. However, the equivalence conditions are hard to check, except for acyclic graphs, as shown in [149]. In [56], a sequel to [148, 149], it is shown that under certain conditions, the graphical lasso solution for acyclic graphs has a closed form expression. The same formula from [56] was previously derived in [42] but in the context of tree structured M-matrices 1 . In this chapter, we show that covariance thresholding and Problem 2 are equivalent (in graph structure inference) for acyclic graphs. We also show that by applying the formula from [42] we can obtain a much shorter proof than the one in [56], but only valid for GGL matrices. In [87] acyclic graphs are studied for the un-regularized GGL estimator. Our results complement those of [87] and are derived for a more general optimization problem. In Theorem 2, we obtain a characterization of the graph topology of the solution of Problem 2. This is one of the key results of this chapter and depends on special properties of GGL matrices. 1 The set of symmetric M-matrices is equal to the set of positive semi-denite GGL matrices. 24 In particular, it allows us to derive proofs for acyclic graphs and graphs with a given number of connected components, similar to those included in the aforementioned papers. However, to the best of our knowledge there are no equivalent results to Theorem 2 for the graphical lasso or other inverse covariance matrix estimation algorithms. 4.2.3 Estimation of structured GMRFs Estimation of structured GMRFs can be described as follows: given a GMRF, which is known to have a certain graph topology (e.g., being bipartite), identify the graph structure and/or graph weights from i.i.d. realizations. Identifying the structure of a GMRF is NP-hard for general graph families [2, 16]. The Chow-Liu algorithm [32] was proposed for approximating discrete probability distributions using tree structured random variables. The work of [154] proved that the Chow-Liu algorithm applied to GMRFs is consistent with an exponential rate. We show that our method, when applied to tree structured attractive GMRFs, is equivalent to the Chow-Liu algorithm (see Section 4.6.1). Other methods that learn tree structured graphs include the structural graphical lasso [169] and the work of [94], both of which use the graphical lasso to recover tree structured Ising models. A line of work in characterization and inference of graphs with certain properties is developed in [1, 2]. They consider random graph models, which include Erdos-Renyi, power-law and small- world random graphs. These algorithms are based on thresholding conditional mutual information and conditional covariances and were shown to be consistent for graph structure estimation. We summarize related work for learning graphs with deterministic topology properties in Table 4.1. These works and ours have three main dierences. First, we focus on nding a graph with a desired property for a given dataset, while most of these works study algorithms that are statistically consistent, i.e., they converge to the true model as the number of observations increase. Second, our error bounds are deterministic, non-asymptotic and are derived only using optimization arguments. Third, our framework includes the graph properties listed in Table 4.1 as well as others to be introduced in the next section. Family/Property Paper Learning goal Model Algorithm Optimality/Analysis Tree [154] topology GMRF Chow-Liu [32] consistency [169] topology and weights GMRF structural graphical lasso none [94] topology Ising graphical lasso consistency k-connected components [70] topology GMRF weighted ` 1 graphical lasso error bound Table 4.1: Algorithms for graph learning with topology properties. 4.3 Preliminaries and problem statement 4.3.1 Graph theoretic background Recall that the GGL matrix of a graphG = (V;E; W; V) is given by L = V + D W: We will denote the set of positive denite generalized graph Laplacian matrices byL + n , and set of GGL matrices that do not allow links outside ofE by L + n (E) =fL2L + n :l ij = 0; if (i;j) = 2Eg: 25 We will also denote by L(E) a GGL matrix that keeps only the entries corresponding to an edge setE and zeroes out the rest, therefore L2L + n (E) if and only if L(E) = L. Now we introduce some graph theoretic concepts and graph families used in this chapter. Denition 5. A graph family (or class) is a set of graphs that have a common property. We denote a generic graph family byF. Some of them include: k-sparse graphs have at most k edges (i.e.,jEjk). Bounded degree graphs have at most d neighbors per node (i.e.,jN i jd for all i2V). Connected graphs, such that for all i;j2V, there is a path connecting node i to node j. Graphs with k connected components contain partitions of the verticesV =[ k i=t V t , such that each sub-graphG t = (V t ;E t ) is connected. The edgesE t E have endpoints inV t , and no edges connect dierent connected components. k-partite graphs have partitions of the verticesV =[ k i=t V t such that for eacht there are no edges withinV t , i.e. for each (i;j)2V t V t then (i;j) = 2E. The case k = 2 corresponds to the family of bipartite graphs. Acyclic graphs have no cycles. If the graph is also connected it is called a tree, otherwise it is called a forest. Graph families can also be characterized by their properties. We are particularly interested in monotone graph families. Denition 6. A graph family (or property) is called monotone if it is closed under edge deletion operations. Monotone families include k-partite, acyclic, k-sparse, and bounded degree graphs. The following denition contains all graph families required for the theoretical guarantees of Section 4.5. Denition 7. A graph familyF is admissible if one of the following properties hold F is monotone, F is the family of graphs with k-connected components, F =F 1 \F 2 , whereF 1 is a monotone family, andF 2 are the graphs with k-connected components. Note that in this caseF is no longer monotone. 4.3.2 Similarity matrix Let X2R nN be a data matrix, denote its i-th column by x i , of dimension n 1, and its i-th row by x i , of dimension 1N. Each x i is a data vector attached to one of the nodes inV = f1; ;ng. We use a similarity function ' :R N R N !R, that is symmetric'(a; b) ='(b; a) for all a; b2 R N . We dene the similarity matrix S = (s ij ) between data points at dierent nodes as s ij = '(x i ; x j ). Typical examples of similarity functions are '(a; b) = 1 N ha; bi and '(a; b) =h a kak ; b kbk i that produce empirical correlation and correlation coecient matrices, and a Gaussian kernel function '(a; b) = exp(ka bk 2 = 2 ). 26 4.3.3 Analysis of Problem 1 We will denote the cost function from Problem 1 by J (L) = log det(L) + tr(SL): (4.1) Even thoughJ is convex, the constraint setfL2L + n :G(L)2Fg is not convex in general. This can be characterized by the following proposition. Proposition 2. LetF be any graph family. The constraint set fL2L + n :G(L)2Fg is convex if and only ifF is closed under edge addition operations. Proof. Let L 1 and L 2 be arbitrary positive denite GGL matrices, withG(L 1 ) = (V;E 1 ; L 1 )2F, andG(L 2 ) = (V;E 2 ; L 2 )2F. Let2 (0; 1), then we have that L =L 1 +(1)L 2 is obviously a positive denite GGL matrix. Also, notice that since GGL matrices have non positive o diagonal entries, the convex combination of GGL matrices has a graph given byG(L) = (V;E; L), with E =E 1 [E 2 , which means that there is an edge in the graphG(L) if and only if there is either an edge inG(L 1 ) or inG(L 2 ). This implies that for the problem to be convexF has to be closed under union of edge sets, and therefore closed under edge addition operations. WhenF is the family of connected graphs, the constraint is convex, i.e., adding an edge to a connected graph keeps it connected, and solutions to this problem can be solved using the approach from [153]. In light of the results from [153], it can be easily veried that the methods from [52, 69] produce connected graphs, even tough even though this is not explicitly discussed in those papers. Other admissible graph properties such as acyclic, k-partite or graphs with k- connected components with k > 1 are not closed under edge addition operations, thus leading to non-convex optimization problems. In addition to non-convexity, some graph families induce prohibitively large constraint sets. For instance there are n n2 dierent tree topologies with n vertices. Noting that since the solution of unconstrained minimization ofJ (L) is S 1 , our approach can be interpreted as nding an approximate inverse of S that belongs to the setfL2L + n :G(L)2Fg. For a xed S, we can classify Problem 1 into three types of graph learning problems. 1. S 1 2 L + n andG(S 1 )2F: In this case we can solve Problem 1 by inverting S. This scenario is highly unlikely, as S would in general be computed from data. An interesting variation of this scenario is analyzed in the papers for learning tree structured graphs from Table 4.1. When the data follows a GMRF and the input is an empirical covariance matrix S, a maximum weight spanning tree type of algorithm [32] is consistent, i.e., as the number of samples goes to innity, it will infer the correct tree structure with probability 1. 2. S 1 2L + n andG(S 1 ) = 2F: This case leads to a graph modication problem, where given a graph and its GGL matrix, we want to nd its optimal approximation in the desired graph family. The approximation problem can be interpreted as minimizing the Kullback- Leibler divergence between attractive GMRFs, or minimizing the log-determinant divergence between GGL matrices [45]. 3. S 1 = 2L + n andG(S 1 ) = 2F: This case corresponds to graph learning with a desired topology property. This scenario re ects better a real world situation, where S is computed from data, and the modeling assumptions do not hold. In addition, it is also very likely that in practice, the sample covariance matrix S is not invertible. 27 Algorithm 2 Graph learning with admissible graph families Require: S andF. 1: Initialization: construct the set ~ E 0 =f(i;j) :s ij > 0g. 2: Graph topology inference (GTI): nd an edge set ~ E ~ E 0 such that (V; ~ E)2F. 3: Graph weight estimation (GWE): nd L # by solving min L2L + n ( ~ E) J (L): We are more interested in nding approximate solutions for Problem 1 in the second and third scenarios. We also focus on deterministic approximation leaving statistical analysis for future work. 4.3.4 Proposed algorithm Since Problem 1 is non-convex, we propose Algorithm 2 for approximating its solution. The initialization step is required in order to achieve our theoretical guarantees, since any feasible solution will not include edges that have a non-positive similarity value (see Theorem 2). The graph topology inference (GTI) step is a generic routine that returns a feasible graph topology, i.e., an edge set that belongs toF. We will go into more details on how to solve this step in Section 4.6. The graph weight estimation (GWE) step is a convex optimization problem. In this chapter, we deal with three optimization problems, namely Problems 1, 2 and the GWE step from Algorithm 2. The proof of our error bound for Algorithm 2 (Theorem 5), relies on three results that can be stated informally as: 1. The GWE step from Algorithm 2 can be replaced by Problem 2 with a specic choice of regularization (Proposition 3), so that regularization with can be used instead of sparsity constraints. 2. We derive necessary conditions on the regularization matrix and the similarity matrix S guaranteeing that the solution of Problem 2 will have an admissible graph topology (Propo- sition 4), which will allow us to choose for our problem. 3. Under certain conditions, we can use the solution of Problem 2, a convex problem, to approximate the solution of the non-convex Problem 1 (Lemma 1 and Theorem 5). 4.4 Analysis of Problem 2 The main results of this section are Theorems 2 and 3, which characterize the relation between the regularization parameters, the similarity matrix and the graph of the solution of Problem 2. Later, in Proposition 3 we show the equivalence between the GWE step and Problem 2. In Proposition 4 we show how the solution Problem 2 is related to admissible graph families. We end this section with a formula for the solution of Problem 2 for acyclic graphs. We begin by simplifying Problem 2. Throughout this section, the regularization matrix = ( ij ) can be arbitrary symmetric, non-negative, with diagonal entries equal to zero ( ii = 0). Since L is a GGL matrix, it has non positive o diagonal entries, hence the` 1 regularization term from Problem 2 can be written as tr(L). Then, Problem 2 has a more compact form given by min L2L + n log det(L) + tr(KL); 28 with K = S . We now introduce the covariance thresholding graph. Denition 8. Given K = S , the covariance thresholding graph is dened as ~ G(K) = (V; ~ E; A) with connectivity matrix A = (a ij ) corresponding to a ij = 1 if k ij > 0; and i6=j a ij = 0 otherwise: Sincek ij =s ij ij , edges are included in the covariance thresholding graph when the similarity is larger than the regularization parameter. Problem 2 is convex so that the Karush Kuhn Tucker (KKT) conditions are necessary and suf- cient for optimality [17]. We denote by L # = (l # ij ) the solution of Problem 2, with corresponding graphG(L # ) = (V;E # ; L # ). The optimality conditions are: (L # ) 1 + K + = 0 (4.2) = T 0 (4.3) diag() = 0 (4.4) L # = 0 (4.5) 8i6=j;l # ij 0 (4.6) L # 0; (4.7) where = ( ij ) is a matrix of Lagrange multipliers and is the Hadamard (entry-wise) product. Remark 1. The edge set of the covariance thresholding graph ~ E, and the graph of the optimal solution of Problem 2G(L # ), use the same notation from Algorithm 2. In later sections we will show this equivalence for a particular choice of regularization matrix . We will now derive a few properties of the optimum of Problem 2. First denoteS =fi;jg for an arbitrary pairi6=j, and denote A S the sub-matrix of A that keeps only itsi-th andj-th rows and columns. Then we have that L # S L # ST (L # T ) 1 L # TS = (K S + S ) 1 = 1 k ii k jj (k ij + ij ) 2 k jj (k ij + ij ) (k ij + ij ) k ii : Note that the matrix L # ST (L # T ) 1 L # TS is non negative, therefore l # ij (k ij + ij ) k ii k jj (k ij + ij ) 2 : This implies that if l # ij < 0, then l # ij k ij k ii k jj k 2 ij ; (4.8) andk ij > 0. A necessary condition for existence of a solution for Problem 2 is thatk ii k jj k ij 6= 0 for all i6=j. This requirement was also established in [147] using a dierent argument. 29 4.4.1 Sparsity of optimal graphs In this section we show that some of the edges of the optimal GGL matrix can be removed by screening the entries of the input matrix K. Theorem 2 (Sparsity). Given ~ G(K) = (V; ~ E; A) andG(L # ) = (V;E # ; L # ) as dened before, we have thatE # ~ E. Proof. We have to show that ifl # ij < 0, thenk ij > 0. By complementary slackness,l # ij < 0 implies that ij = 0. Then, based on Lemma 2 (see Section 4.9.1), we can write (L # ) 1 ij =k ij as ((L # ) 1 ) ij = l # ij l # ii l # jj +e ij =k ij : Thus, we have k ij > 0 since e ij 0. Theorem 2 provides a lot of information about the solution of Problem 2 by only looking at the covariance thresholding graph of K. Some implications of Theorem 2 are the following: The number of edges in the optimal graph is bounded by the number of edges in the covariance thresholding graph, i.e.,jE # jj ~ Ej. The optimal graph structure is partially known. Ifk ij 0, or equivalently if the similarity is below a threshold, the optimal graph does not include the corresponding edge. The entries of the regularization matrix act as thresholds for the similarity matrix. The converse is not necessarily true, ifk ij > 0 we cannot say if the optimal weightl # ij is zero or not. We will show that for acyclic graphs we can exactly determine the graph topology (see Theorem 4). Although the above theorem is a simple result with a straightforward proof, it is the key technical element that connects the results of this chapter and justies the use of Algorithm 2. To the best of our knowledge, there is no counterpart to Theorem 2 for graphical lasso or other inverse covariance estimation algorithm. 4.4.2 Connected components of optimal graph In this section we show that the optimal graph and the covariance thresholding graph have the same connected components. We can decompose ~ G(K) = (V; ~ E; A) into its connected components f ~ G s (K)g J s=1 . Each ~ G s (K) = ( ~ V s ; ~ E s ; A s ) is a connected sub-graph andf ~ V 1 ; ; ~ V J g forms a partition of the nodesV. Also for dierent components ~ V s ; ~ V t with s6=t, if we pick i2 ~ V s , and j2 ~ V t there is no edge connecting them (i.e., a ij = 0). Similarly, consider the solution of Problem 2 and its corresponding graphG(L # ) = (V;E # ; L # ). LetfG s (L # )g M s=1 denote its connected components dened byG s (L # ) = (V s ;E # s ; L # Vs ), where L # Vs is ajV s jjV s j sub-matrix of L # that only keeps rows and columns indexed by the setV s . We state the following theorem on connected components obtained by solving Problem 2. Theorem 3. If L # is the solution of Problem 2, then the connected components ofG(L # ) and ~ G(K) induce the same vertex partition, i.e., M = J and there is a permutation , such that ~ V s =V (s) for s2 [J]. Proof. See Section 4.9.2. This result has the following consequence. 30 Corollary 1. ~ G(K) is connected if and only ifG(L # ) is connected. This allows us to check whether the solution is connected by only checking if the covariance thresholding graph is connected, without solving Problem 2. 4.4.3 Regularization and sparsity Since Theorem 2 informs us about some of the zero entries of L # , then solving Problem 2 is equivalent to solving min L2L + n ( ~ E) log det(L) + tr(KL); (4.9) where ~ E is the edge set of the covariance thresholding graph. In particular we have that some instances of Problem 2 can be replaced by a non regularized sparsity constrained problem. Proposition 3. Let ~ E be an arbitrary edge set, for example the one produced by the GTI step in Algorithm 2. If the regularization matrix is constructed as ij = ( 0 if (i;j)2 ~ E s ij otherwise. (4.10) Then L # minimizes Problem 2 and also L # = arg min L2L + n ( ~ E) log det(L) + tr(SL): In particular, the GGL estimation step of Algorithm 2 can be replaced by Problem 2. Proof. This choice of regularization implies that k ij = s ij for (i;j)2 ~ E, and k ij = 0 otherwise. From Theorem 2 we have that edges are not allowed outside ~ E, thus we can add the constraint L2L + n ( ~ E) without changing the solution. This result indicates that thresholding some entries of the covariance matrix can be replaced by constraining the corresponding entries of the GGL matrix to be zeros. We can also use Theorem 3 to simplify Problem 2. The rows and columns of matrices A and L can be permuted to have the same block diagonal structure, thus they can be written as L = 0 B B B @ L V1 0 0 0 L V2 0 . . . . . . . . . . . . 0 0 L V J 1 C C C A and A = 0 B B B @ A V1 0 0 0 A V2 0 . . . . . . . . . . . . 0 0 A V J 1 C C C A : This block diagonal structure can be inferred by inspecting A, and the optimal GGL matrix will have the same block diagonal structure. We can restrict ourselves to consider GGL matrices with the block structure of A. The determinant of a block diagonal matrix is the product of the 31 determinants of the individual blocks. Since the trace is linear on L, we can decompose (4.9) into J smaller sub-problems, one for each connected component, and solve min L2L + ns ( ~ Es) log det(L) + tr(K Vs L) (4.11) 8s2f1; ;Jg, where n s =jV s j. Thus, solving Problem 2 is equivalent to solving J smaller problems of size n s by only screening the values of K and identifying the connected components of ~ G(K). For larger regularization parameter, the covariance thresholding graph will become sparse and possibly disconnected, thus aecting the sparsity and connected components of the solution of Problem 2. Since each of the J GGL estimation problems are independent, they can be solved in parallel and asynchronously using any of the algorithms from [147, 52, 122, 87]. 4.4.4 Closed form solution for acyclic graphs Acyclic graphs form a special graph family that admits closed form solutions. First note that combining Theorems 2 and 3, implies that the number of edges in the optimal graph satises J + J X s=1 jV s jjE # jj ~ Ej; (4.12) where J is the number of connected components. These inequalities are sharp when the graph ~ G(K) is acyclic. We state this more precisely in the following theorem. Theorem 4. If ~ G(K) is an acyclic graph, then the solution of Problem 2 has the same edge set, i.e.,E # = ~ E, and the optimal GGL matrix is given by l # ij = 8 > > < > > : 1 kii 1 + P j2Ni k 2 ij kiikjjk 2 ij i =j kij kiikjjk 2 ij (i;j)2E # 0 (i;j) = 2E # : (4.13) Proof. From Theorem 2 we have thatE # ~ E. Since the covariance thresholding graph is acyclic, and from Theorem 3 both graphs have the same number of connected components, then (4:12) holds with equality. Given that both edge sets have the same number of elements, and one of them is included in the other, they must be equal. The authors of [42] showed that GGL matrices (which they call M-matrices), have a closed form expression in terms of their inverse. Denote = L 1 , and note that the KKT optimality conditions establish that ij = k ij if i = j or (i;j)2E # . The formula in (4.13) follows directly from [42, Corollary 2.3]. The same formula was obtained in [56] for the case when the graphical lasso solution is a tree. The main dierence with our approach is that we use results for GGL matrices of tree structured graphs that lead to a simple proof. The work of [42] is in the context of graph theory and does not make the connection with inverse covariance estimation methods. 4.4.5 Admissible graph families The theory developed thus far allows us to prove the following proposition. Proposition 4 (Transitive property). LetF be an admissible graph family, if ~ G(K)2F, then the solution of Problem 2,G(L # ) = (V;E # ; L # ), belongs toF. 32 Proof. The case of monotone graph properties follows directly from Theorem 2. The case of graphs withk-connected components follows from Theorem 3. When the graph class is an intersection of a monotone family and the family of graphs with k-connected components, the property follows again by applying Theorems 2 and 3. 4.5 Analysis of Algorithm 2 This section is devoted to proving Theorem 5, which deals with the output of Algorithm 2. A GGL that minimizes Problem 1 is denoted by L , and its graph byG(L ) = (V;E ; L ). Since Problem 1 is non-convex, L might not be the only global minimum, thus we denote by L the set of all global minima, and dene J =J (L ); for any L 2L . Our rst result is Lemma 1 below, which states that Problem 1 can be approx- imated by Problem 2. Lemma 1. Let L # be the solution of Problem 2, with corresponding graphG(L # ) = (V;E # ; L # ). Suppose that G(L # )2F: (4.14) Then for any L 2L jJ (L # )J j X i6=j ij w ij ; wherejl ij j =l ij =w ij for all i6=j Proof. Since L # is optimal for Problem 2 we have that J (L # ) + X i6=j ij jl # ij jJ (L ) + X i6=j ij jl ij j: (4.15) Optimality of L for Problem 1 and (4.14) implies that J (L )J (L # ): We conclude by combining both inequalities X i6=j ij jl ij jJ (L # )J (L ) + X i6=j ij jl # ij j J (L # )J (L ) 0: We denote the weight of a sub-graph of the covariance/similarity graph as W S (E) = X (i;j)2E s 2 ij s ii s jj s 2 ij = X (i;j)2E r 2 ij 1r 2 ij whereE is an arbitrary edge set, and r 2 ij = s 2 ij =(s ii s jj ). The following Theorem is the main contribution of this chapter. 33 Theorem 5. LetF be an admissible graph family, and assume S 0. Let L # be the output of Algorithm 2, then we have thatG(L # ) = (V;E # ; L # )2F, and jJ (L # )J j 2 min E :L 2L X (i;j)2E \ ~ E c r 2 ij 1r 2 ij 2 X (i;j)= 2 ~ E r 2 ij 1r 2 ij = 2(W S (E full )W S ( ~ E)); where ~ E is the edge set found in the GTI step. Proof. In the previous section we used the matrix K = S to dene the covariance thresholding graph. First we take the edge set ~ E from the GTI step, and we choose such that the covariance thresholding graph is ~ G(K) = (V; ~ E). This means we need to pick such that ij < s ij for (i;j)2 ~ E, and ij s ij when (i;j) = 2 ~ E. Since the GTI step ensures that ~ E belongs toF, this choice of regularization combined with Proposition 4 imply thatG(L # )2F. We pick any L 2L and apply Lemma 2 to obtain the bound: jJ (L # )J (L )j X i6=j ij w ij : The goal now is to simplify the upper bound and make it as tight as possible. We start by decomposing the right hand side using the terms corresponding to ~ E and ~ E c : X i6=j ij w ij = 2 X (i;j)2 ~ E ij w ij + 2 X (i;j)= 2 ~ E ij w ij = 2 X (i;j)2E \ ~ E c w ij s ij 2 X (i;j)2E \ ~ E c s 2 ij s ii s jj s 2 ij = 2 X (i;j)2E \ ~ E c r 2 ij 1r 2 ij 2 X (i;j)= 2 ~ E r 2 ij 1r 2 ij ; where for the second equality we have chosen: ij = ( 0 if (i;j)2 ~ E s ij otherwise: This choice produces the tightest possible upper bound because ~ E is a well dened covariance thresholding graph, i.e., 0 ij <s ij when (i;j)2 ~ E, and s ij ij when (i;j) = 2 ~ E. For the rst inequality we use w ij s ij s ii s jj s 2 ij ; which was established in (4.8) . The last inequality uses the fact that all terms are non negative and E \ ~ E c ~ E c . We conclude by applying Proposition 3 to Problem 2 with the chosen regularization, thus making the solution of Problem 2 coincide with the GWE step from Algorithm 2. 34 Theorem 5 indicates that Problem 2, which is convex, can be close to the non-convex Problem 1 if the upper bound can be minimized. This can be achieved in several ways, including: IfE ~ E and ~ E is inF, then the tightest upper bound is zero. This is in general not possible, but in Section 4.6 we will discuss a case where our method is equivalent to a consistent estimator of the graph structure. In most cases, nding a feasible graph that contains the optimal graph is not possible, but since ~ E depends solely on the similarity matrix S, maximizingW S (E) becomes a practical alternative. We will discuss this method to solve the graph topology inference step in Section 4.6. 4.6 Graph topology inference In this section we propose a general method to solve the graph topology inference step of Algorithm 2. We also show that for specic instances of graph families, the proposed graph topology inference problems can be solved using existing combinatorial optimization algorithms. Minimizing the upper bound from Theorem 5 leads to a maximum weight spanning sub-graph approximation problem inF given by max (V;E)2F W S (E): (A) For some graph families, solving (A) might not be computationally tractable which aects the quality of the output of Algorithm 2. We can see this by decomposing the second factor of the upper bound in Theorem 5 as: W S (E full )MW F S | {z } modeling error +MW F S W S ( ~ E) | {z } approximation error ; where andMW F S = max (V;E)2F W S (E), and ~ E is the output of the GTI step, i.e., an exact or approximate solution of (A). The modeling error measures how much the covariance matrix diers from its best approximation inF. The approximation error measures the discrepancy between the best approximation that can be obtained by optimizing the upper bound, and the estimate given by ~ E. This term depends on the existence of ecient algorithms for solving (A). For example, whenF is the family of bipartite graphs, solving (A) is equivalent to max-cut, which is NP-hard, hence the approximation error will always be positive. WhenF is the family of tree structured graphs, the solution can be found exactly in polynomial time using Kruskal's algorithm, hence the only term is the modeling error. When (A) is an NP-hard problem, good solutions with small approximation errors can be obtained using combinatorial approximation algorithms. In the following subsections we will show specic instances of the GTI step for the families of tree structured graphs, k-sparse connected graphs and bipartite graphs. 4.6.1 Tree structured graphs The set of tree structured graphs is the set of (n1)-sparse connected graphs. Theorem 5 suggests solving (A), which reduces to the following problem for trees, max E X (i;j)2E r 2 ij 1r 2 ij s.t.jEjn 1; (V;E) is connected: (4.16) 35 This is a maximum weight spanning tree (MWST) problem which can be solved using Kruskal's or Prim's algorithms [79] inO(n 2 log(n)) time. For this graph family, the optimization prob- lem exactly solves the combinatorial problem (approximation error is zero). Problem (4.16) is equivalent to the Chow-Liu algorithm [32], which is formally stated in the following proposition. Proposition 5. Suppose the vectorsfx i g N i=1 (columns of data matrix) are i.i.d. realizations of an attractive GMRF distribution, and for all i<j the values r ij are distinct. The solution of the GTI step of Algorithm 2 that uses (4.16) and the solution obtained by the Chow-Liu algorithm are equal. Proof. The Chow-Liu algorithm [32] solves a MWST problem using the pairwise empirical mutual information as edge weights. For two Gaussian variables the mutual information is I(x i ; x j ) = 1 2 log(1r 2 ij ) =I ij : The mutual information andr 2 ij =(1r 2 ij ) are non-negative strictly increasing function ofr 2 ij . Then any MWST algorithm will return the same tree structured graph for weights r 2 ij =(1r 2 ij ) and I ij . If the data has a tree structured distribution, the Chow-Liu algorithm is consistent [154], i.e., ~ E!E with probability one as N!1. From Theorem 4 we know thatE # = ~ E, thenE # !E with probability one [154]. Therefore for this case, Algorithm 2 is consistent. 4.6.2 Bipartite graphs For the bipartite graph case, we need to nd a partition of the vertex setV =S[S c , and an edge set ~ E for which (i;j)2 ~ E if and only if i2S and j2S c . Finding a bipartite approximation is equivalent to a max-cut problem betweenS andS c . A standard way of solving max-cut, is introducing a variable y i 2f+1;1g for each node in the graph, where y i = +1 if i2S and y i =1 otherwise. This results in the optimization problem max y1;;yn X i;j r 2 ij 1r 2 ij (1y i y j ) s.t. y i 2f+1;1g: (4.17) Max-cut is NP hard, so approximation algorithms are the only option. We use the Goemans- Williamson (GM) algorithm [64], which nds a 0:87856-approximation for (4.17). The complexity of the GM algorithm is dominated by a semi-denite convex optimization problem and a Cholesky decomposition of an nn matrix (see [64] for complexity analysis). Obviously, other approxi- mation algorithms can be used for bipartite approximation with lower computational complexity. Once the vector y = [y 1 ; ;y n ] is found, we construct the edge set as ~ E =f(i;j) :y i y j =1g: 4.6.3 k-sparse connected graphs For a xed and known k we want to solve max E X (i;j)2E r 2 ij 1r 2 ij s.t.jEjk; (V;E) is connected. (4.18) When k = n 1, the solution of (4.18) is the MWST. Based on this observation, for the case kn we propose the following approximation algorithm 36 1. Find the edge setE 0 that solves the MWST from (4.16). 2. FindE 1 with edges (i;j) corresponding to the entries r 2 ij 1r 2 ij with the kn + 1 largest magnitudes such that (i;j) = 2E 0 . 3. Return ~ E =E 0 [E 1 . The resulting graph is the maximum weight connected k-sparse graph that contains the MWST. 4.7 Experiments In this section, we evaluate Algorithm 2 on synthetic and real data. To quantify the graph learning performance in terms of topology inference, we use the F-score (which is typically used for evaluating binary classication problems) dened as FS = 2tp 2tp + fp + fn ; where tp, fp, fn correspond to true positives, false positives and false negatives respectively. The F-score takes values in [0; 1] where a value of 1 indicates perfect classication. To evaluate GGL estimation, we use the relative error in Frobenius norm given by RE = kL ref L # k F kL ref k F ; where L ref is a reference GGL matrix, and L # denotes the output of Algorithm 2. 4.7.1 Graph learning from synthetic data In this section, we evaluate the performance of Algorithm 2 for learning graphs with three dierent topology properties. The main diculty in evaluation of our algorithms is the fact that the optimal solution of Problem 1 cannot be found in general. Even designing an experiment with synthetic data and computing metrics to evaluate graph learning performance is non-trivial. Therefore, we will instead consider scenarios where we have an input empirical covariance matrix that is close to an attractive GMRF with a GGL matrix in the graph familyF. For this purpose, we generate data as i.i.d. realizations of the random variable x =z 1 + (1)z 2 ; (4.19) where 2f0; 1g is a Bernoulli random variable with probabilityP( = 1) =. The independent random variables z 1 and z 2 are distributed as zero-mean multivariate GaussiansN (0; L 1 type ) and N (0; L 1 ER ), respectively. The precision matrix of z 1 is the GGL of a graph with an admissible graph topologyG(L type ) = (V;E type ; L type ), while the precision matrix of z 2 is the GGL of an Erdos-Renyi random graph (which is not admissible). A simple calculation reveals that x is zero-mean with covariance matrix equal to E(xx > ) = =L 1 type + (1)L 1 ER : The inverse covariance matrix 1 is not a GGL matrix, but for ' 1 we expect it to be well approximated by an attractive GMRF with the same graph topology as L type . In our experiments, we choose the inverse covariance matrix of z 2 as the GGL of an Erdos-Renyi random graph, with probability of an edge ofp = 0:05 and with weights sampled from the uniform distributionU[1; 2]. 37 Note that, for a small choice of , z 2 can be interpreted as a random noise model based on Erdos-Renyi graphs. We compare our algorithms against two methods. First we consider a reference Laplacian matrix, obtained as L ref = arg min L2L + n (Etype) log det(L) + tr(S N L); (4.20) where S N is the empirical covariance matrix obtained from N i.i.d. realizations of the random vector x. The constraint forces the topology of the reference GGL to be included inE type , and possibly having dierent graph weights, except when N!1 and = 1. This method assumes that for smaller noise, i.e. ' 1, the correct graph topology should resembleE type . Second, we compare against a baseline method that rst learns a graph without any topology constraint using existing methods [52]. Then it simplies the graph topology using the algorithms from Section 4.6 on the learned graph. With the new topology, the GGL is estimated. We can summarize this procedure in the following steps: 1. Solve an unconstrained graph learning problem, L uc = arg min L2L + n log det(L) + tr(S N L); (4.21) andG(L uc ) = 2F. 2. Find graph topology E base = arg max (V;E)2F X (i;j)2E jl uc ij j; (4.22) where L uc = (l uc ij ). 3. Find graph weights L base = arg min L2L + n (E base ) log det(L) + tr(S N L): (4.23) Our theoretical analysis indicates that step 1) in the aforementioned procedure can be removed, and step 2) can be solved using the covariance matrix. In this section, we experimentally validate our theoretical analysis by demonstrating that our proposed methods outperform the baseline method. 4.7.1.1 Learning tree structured graphs We sample 50 uniform random trees 2 with n = 50 vertices. Each edge in the graph is assigned a weight drawn from an uniform distribution U[1; 2], while the vertex weights are drawn from a U[0; 1]. For each graph and each N (number of i.i.d. realizations), we run 50 Monte-Carlo simulations. The proposed methods are compared against the reference Laplacian obtained by solving (4.20) and the baseline method. In Fig. 4.1a, we plot the F-score with respect to the reference Laplacian as a function of the parameter N=n (i.e., number of observed data samples per vertex). As expected, our algorithms nd graph topologies similar to those of the reference Laplacian, and for smaller, the topologies become more dissimilar. We observe that the proposed algorithm leads to solutions that are close to the reference Laplacian. For larger number of 2 All trees ofn nodes are picked with the same probability given by 1=n n2 . We generate them by constructing Pr ufer sequences. 38 0.5 0.75 1 5 10 30 100 250 500 1000 0.5 0.6 0.7 0.8 0.9 1 F-score Proposed =0.75 Proposed =0.95 Baseline =0.75 Baseline =0.95 (a) F-score (FS) for edges 0.5 0.75 1 5 10 30 100 250 500 1000 0 0.1 0.2 0.3 0.4 0.5 RE Proposed =0.75 Proposed =0.95 Baseline =0.75 Baseline =0.95 (b) Relative Error (RE) 0.5 0.75 1 5 10 30 100 250 500 1000 2 4 6 8 10 12 14 Proposed =0.75 Reference =0.75 Proposed =0.95 Reference =0.95 Baseline =0.75 Baseline =0.95 (c) Objective function Figure 4.1: Performance of Algorithm 2 for learning tree structured graphs. samples, the relative error and objective functions converge, where the output of Algorithm 2 and the reference Laplacian become very close. Even though x is a Gaussian mixture, and for this case our algorithm is not equivalent to the Chow-Liu algorithm, it still nds the true graph topology, For N=n 5, the baseline method is slightly better than (A) in nding the correct topology, while for larger values ofN the performance becomes indistinguishable. This is remarkable, since the the baseline method estimates the graph topology based on the solution of an unconstrained GGL estimation problem, while our proposed GTI method only uses the sample covariance matrix. 4.7.1.2 Learning bipartite graphs We construct GGL matrices L type of bipartite graphs with n = 50 vertices as follows. First we partition the vertices intoS =f1; ; 25g and its complementS c . Then, the edges that connect S withS c are added with probability q = 0:5. Finally, the edge weights and vertex weights are drawn from uniform distribution U[1; 2] and U[0; 1] respectively. We run Algorithm 2 and use the max-cut algorithm from [64] to solve the GTI step. To evaluate graph learning performance, we report F-scores for the vertex partition, relative errors and value of the objective function, all with respect to the optimal GGL matrix found by solving (4.20). The F-scores are smaller relative to tree case, but they improve with more data samples N=n, suggesting the obtained graph topologies are not substantially dierent to the reference graph. However, notice that the 39 0.5 0.75 1 5 10 30 100 250 500 1000 0.4 0.6 0.8 1 F-score Proposed =0.75 Proposed =0.95 Baseline =0.75 Baseline =0.95 (a) F-score (FS) for vertices 0.5 0.75 1 5 10 30 100 250 500 1000 0 0.1 0.2 0.3 0.4 0.5 0.6 RE Proposed =0.75 Proposed =0.95 Baseline =0.75 Baseline =0.95 (b) Relative Error (RE) 0.5 0.75 1 5 10 30 100 250 500 1000 -100 -80 -60 -40 -20 0 Proposed =0.75 Reference =0.75 Proposed =0.95 Reference =0.95 Baseline =0.75 Baseline =0.95 (c) Objective function Figure 4.2: Performance of Algorithm 2 for learning bipartite graphs. reference Laplacian, i.e., the solution of (4.20), has higher cost than the output of our algorithm, for all values of N=n and , as seen in Figure 4.2c. For = 0:95, the baseline method is outperformed by all other methods when N=n 10. A similar behavior is observed for = 0:75, but only when N=n 100. As expected, the values of the objective function verify that the baseline method is slightly better than all other methods, suggesting that the reference graph topology is not optimal when N=n is smaller. 4.7.1.3 Connected k-sparse graphs We construct connected k-sparse graphs as follows. First we generate an uniform random tree with n = 50 vertices, then we randomly connect pairs of vertices until there are k edges. The GGL matrix L type is then constructed by assigning random edge and vertex weights drawn from the uniform distributions U[1; 2] and U[0; 1], respectively. We choose k = 150 (sparsity is ap- proximately 12%) for all experiments. We follow the same procedure as before and compare the reference GGL against the output of Algorithm 2 with GTI found by solving (A). We observe from Figure 4.3c that the graph topology given by the reference graph is not optimal, since Algorithm 2 always produce GGL matrices with smaller cost. When compared to the baseline method, we observe that for = 0:75 the performance is almost as good as (A). For = 0:95, the base- line method and (A) show identical performance until N=n> 1, when the proposed method (A) starts improving in F-score and RE. An interesting observation is that, for very large N=n, the 40 0.5 0.75 1 5 10 30 100 250 500 1000 0.3 0.4 0.5 0.6 0.7 0.8 0.9 F-score Proposed =0.75 Proposed =0.95 Baseline =0.75 Baseline =0.95 (a) F-score (FS) for edges 0.5 0.75 1 5 10 30 100 250 500 1000 0 0.1 0.2 0.3 0.4 0.5 RE Proposed =0.75 Proposed =0.95 Baseline =0.75 Baseline =0.95 (b) Relative Error (RE) 0.5 0.75 1 5 10 30 100 250 500 1000 -25 -20 -15 -10 -5 0 Proposed =0.75 Reference =0.75 Proposed =0.95 Reference =0.95 Baseline =0.75 Baseline =0.95 (c) Objective function Figure 4.3: Performance of Algorithm 2 for learning connected k-sparse graphs. 41 baseline method and the output of Algorithm 2 have dierent graph topologies, as indicated by the F-Score. However, they all converge to the same objective function value. This is due to the non-convexity of the problem, since multiple solutions with the same cost value may exist. 4.7.2 Image graphs In this section, we evaluate Algorithm 2 with image data. We use the Brodatz texture images from the USC-SIPI dataset 3 . We take the subset of rotated textures straw*.ti and partition each image into non-overlapping blocks of size 8 8. Each block is vectorized and organized as columns of the data matrix X. Then, we use Algorithm 2 to learn (i) tree, (ii) bipartite and (iii) connected, sparse graphs. For trees and connected sparse graphs, we use the sample covariance matrix S N . For bipartite graphs, we use a thresholded sample covariance matrix S N A, where A is the 0-1 connectivity matrix of a 8-connected grid graph [121]. In Figure 4.4, we show the resulting graphs for three straw textures. The edge weights are colored and have been normalized to lie in [0; 1]. The resulting graphs all have strong link weights that follow the texture orientation. In particular for trees, the strong weights connect pixels along the main texture orientation, while the weaker edge weights connect pixels in other directions. For bipartite graphs, the vertices (i.e., pixels) are colored in red and black, to mark the bipartition obtained during the graph topology inference step of Algorithm 2. Neighboring pixels along the direction of the texture orientation have strong edges, and at the same time neighboring pixels have dierent color, providing a reasonable and intuitive notion of every other pixel. For sparse connected graphs, even though we choose the same parameter k = 112 for all images, the number of connections in the optimal graph is always smaller. The graphs corresponding to straw000, straw060 and straw150 images have 105, 109 and 107 edges, respectively. 4.8 Conclusion We have proposed a framework to learn graphs with admissible topology properties. These include monotone graph families, graphs with a given number of connected components, and graph families that are in the intersection of both. The proposed algorithm consists of two main steps. First, a graph topology with the desired property is found by solving a maximum weight spanning sub- graph problem on the similarity matrix. And second, a generalized graph Laplacian matrix is found by solving a graph topology constrained log-determinant divergence minimization problem. Although nding the best graph with the desired property is in general a non-convex optimization problem, we show that our proposed solution is near optimal in the sense that seeks to minimize an error bound with respect to the best possible solution. Our theoretical guarantees are based on the analysis of a convex relaxation of the non-convex problem and relies on properties of generalized graph Laplacian matrices. We implemented instances of our algorithm to learn tree structured graphs, bipartite graphs and sparse connected graphs, and tested them with synthetic and texture image data. Our theoretical and numerical results indicate that one can nd a near optimal graph topology by screening the entries of the covariance/similarity matrix. Some future research directions are: 1. Applications of the covariance screening rules to improving computational complexity of algorithms from Chapter 3 and [52]. 2. Since trees have closed form solutions, it would be interesting to investigate other types of graphs that admit analytic formulas or simplied optimization algorithms. 3 http://sipi.usc.edu/database/database.php?volume=textures 42 (a) straw000 (b) straw060 (c) straw150 (d) tree000 (e) tree060 (f) tree150 (g) bipartite000 (h) bipartite060 (i) bipartite150 (j) sparse000 (k) sparse060 (l) sparse150 Figure 4.4: Structured graphs learned from sample covariance matrices of texture images using Algorithm 2 with GTI found using (A). (a)-(c) 512 512 texture images. (d)-(f) tree structured graphs. (g)-(i) bipartite graphs. (j)-(l) sparse connected graphs obtained by setting sparsity parameter k = 112. 43 3. GGLs are easier to study but less popular than other Laplacian matrices, thus it would useful to obtain similar algorithms to estimate structured graphs for combinatorial and normalized Laplacian matrices. 4.9 Proofs 4.9.1 Properties of GGL matrices The following property of GGL matrices will be used throughout the chapter. Lemma 2. Let L = V + D W be a positive denite GGL, and denote P = diag(L) = V + D. Then L 1 = P 1 + P 1 WP 1 + E (4.24) for some E 0. Proof. First decompose L as L = P W = P 1=2 (I Q)P 1=2 ; where Q = P 1=2 WP 1=2 . Since L is positive denite, then I Q is also positive denite hence kQk 2 < 1, so we have the following set of equalities. L 1 = P 1=2 (I Q) 1 P 1=2 = P 1=2 (I + Q + Q 2 + )P 1=2 = P 1 + P 1 WP 1 + E where E = P 1=2 ( P 1 k=2 Q)P 1=2 0. A consequence is the following. Theorem 6. [126] Let L = V + D W be a GGL matrix, then the following statements are equivalent 1. L is non singular and L 1 0. 2. L2L + n . When L is an inverse covariance matrix, Theorem 6 implies that attractive GMRF models have non-negative covariance matrices. However, not all GMRFs with non-negative covariance matrices correspond to attractive GMRF models. 4.9.2 Proof of Theorem 3 Proof. We need to show that after a permutation of rows and columns, the matrices A and L # as dened before, have the same block diagonal structure [106]. Since we need to prove a set equality, we will show inclusion in one direction and then the converse. ()) First assume that the rows and columns of A have been reordered so the matrix is block diagonal. Each block is indexed by the vertex partition given by the connected components of ~ G(K). Pick two connected components s6= t and i2V s ;j2V t , hence (i;j) = 2 ~ E and a ij = 0 therefore k ij 0. Using Theorem 2 we have l # ij = 0, and since this is true for all i2V s ;j2V t then (i;j) = 2E # , there are not any edges between vertices inV s andV t for the graphG(L # ). Then 44 there exist s 0 6=t 0 such that ^ V s 0V s and ^ V t 0V t . Since this is true for all s;t;i;j we also have that there are at least J connected components inG(L # ), i.e. MJ. (() For the converse the proof is similar, again assume the rows and columns of L # have been reordered so the matrix is block diagonal with blocks given by the vertices of the connected components ofG(L # ). Then pick two connected compoenents s6= t and i2 ^ V s ;j2 ^ V t , hence (i;j) = 2E # and l # ij = 0. Since L # is block diagonal, then (L # ) 1 is also block diagonal therefore ((L # ) 1 ) ij = 0 =k ij + ij , where the second equality comes from (4.2). Using (4.3) then ij 0, and k ij 0, which implies (i;j) = 2 ~ E and a ij = 0. Using the same argument as before, we have that there exist s 00 6=t 00 such thatV s 00 ^ V s andV t 00 ^ V t and JM. We have shown thatJ =M, now using ()) x anys, then there exists an uniques 0 (because L = M) such that ^ V s 0V s , now using (() there also exists an unique s 00 such thatV s 00 ^ V s 0. Since thefV s g are disjoint the only possibility is thats 00 =s which impliesV s = ^ V s 0 ans since the mapping is one on one, we have that the permutation is given by (s) =s 0 . 45 Chapter 5 Covariance matrix estimation with missing data Covariance matrices are central components of data analysis in various disciplines, such as machine learning [172], medical studies [140, 18], nance [10], and signal processing [151, 35, 129], among others. The term missing data usually refers to observations that become unavailable due to an uncontrollable external factor. This problem arises in various applications including longitudinal studies, sensor networks and image processing. Developing covariance estimators in the missing data setting is of great importance, given the various applications where covariance matrices are required as well as the inevitability and frequent occurrence of missing observations. Recent papers have studied estimators that generalize the sample covariance matrix to handle missing observations [96, 20]. They have studied questions such as, unbiasedness, and convergence to the population covariance matrix (consistency) through nite sample error bounds. These works assume that observations are independent identically distributed (i.i.d.) copies of a sub-Gaussian vector, while the population covariance matrix may be unstructured, or have sparse, bandable [20], or low rank [96] structure. In this chapter we study covariance matrix estimation under various missing data mechanisms without making any structural assumptions (such as low rank or sparsity) on the population covariance matrix. We focus on several existing missing data mechanisms, and introduce a new one. Each of these mechanisms has a corresponding unbiased covariance estimator that generalizes the sample covariance matrix with complete observations. For each estimator we derive estimation error bounds in operator norm. Our principal contributions are: We study the missing completely at random (MCAR), and missing at random (MAR) miss- ing data mechanisms, which are data independent and data dependent respectively (see Section 5.2), each with a corresponding well known unbiased estimator. We consider an instance of the MCAR mechanism in which variables are observed according to independent Bernoulli random variables with dierent probabilities. We are the rst to provide error bounds (Theorem 7) for this setting when all probabilities are arbitrary. The case of equal observation probabilities was considered in [96]. For the MAR mechanism, we are the rst to bound the estimation error in operator norm (Theorem 8). We propose the conditionally MCAR (CMCAR) missing data mechanism, in which the missing data distribution evolves over time based on previous observations. We propose an unbiased estimator and characterize its performance in Theorem 9. Under certain conditions, we show that the bounds from Theorems 8 and 9, can be simplied to resemble the bound from Theorem 7. This is possible because they all depend on quanti- ties we call scaled eective rank. In addition, under some simplications, our bounds match previous state of the art results with complete and missing observations, up to constant and logarithmic factors. 46 This chapter is organized as follows. In Section 5.1 we review the literature on covariance estimation with missing data. Section 5.2 introduces the missing data mechanisms and estima- tors. Error bounds are presented and discussed in Section 5.3, while numerical experiments are contained in Section 5.4. Proofs and Conclusion appear in Sections 5.5 and 5.6 respectively. 5.1 Related work 5.1.1 Covariance estimation with complete observations The complete data is denoted by the nN matrix X, where the columns of X are assumed to be zero mean, independent and identically distributed (i.i.d.) realizations of a vector with population covariance matrix . The starting point of most covariance estimation studies is the sample covariance matrix given by S = 1 N XX > ; which is an unbiased estimator for the population covariance matrix . Some recent papers have shown that the estimation error in operator norm of the sample covariance matrix is characterized by the eective rank of [96, 19, 81]. These results imply expectation bound of the form E [k Sk] kk O r r() N _ r() N ! ; (5.1) wherea_b stands for max(a;b). TheO notation hides logarithmic dependencies on the dimension n and number of samples N. The eective rank r() (see Denition 9) is upper bounded by the actual rank, but can be much smaller, thus oering a more nuanced measure of dimensionality. The inequality (5.1) implies the sample complexity of the sample covariance matrix is N = O(r()= 2 ), that is, the number of i.i.d. samples required for the error in (5.1) to be below is proportional to the eective rank. The results from [19, 96] assume a sub-Gaussian distribution, while [81] provides dimension free bounds (without logarithmic dependence onn) for the Gaussian case. Our bounds for covariance estimation with missing observations are also based on the eective rank, so they can be compared with the bounds available for complete observations. 5.1.2 Missing data mechanisms When there is missing data, construction of unbiased estimators requires knowledge of the missing data mechanism, i.e., the process by which data is lost. Recent literature of covariance estimation with missing observations considers mechanisms that are either data independent, or data depen- dent. This distinction is signicant since it leads to dierent estimators with unique properties and guarantees. We consider two mechanisms dened in the missing data literature [91], called missing completely at random (MCAR) and missing at random (MAR), and also introduce a new one called conditionally MCAR (CMCAR). When the missing data mechanism does not depend on the data, it is called missing completely at random. The MCAR mechanism has been used to model missing features in machine learning applications [99], or data loss in sensor networks, where transmission errors (e.g., packet loss), or sensor failures induce missing data. This mechanism can model incomplete observations in clinical studies, and more generally in longitudinal studies [85], due to patient or subject dropout. We consider a particular case, where coordinates are observed according to Bernoulli 0 1 random variables that are independent of the data and each other, and may have dierent probabilities. Covariance estimators for this type of missing data appear in [96, 93]. In [96], error bounds are 47 derived assuming variables are observed with the same probability, while in [94] the covariance matrix estimator was not analyzed directly, but as part of a linear regression study. In this work, we analyze the covariance estimator under the more general non uniform Bernoulli sampling. The second mechanism is called missing at random (MAR), and it depends on the observed data, but not on the unobserved data. This mechanism can model situations occurring in sensor networks, where data dependent missing observations can occur if the signal being measured can aect the sensing process. For example, temperature and humidity may damage a sensor over time. In other applications, the subset of variables that are acquired/missed may depend on the data if an observer is seeking to improve system performance. We consider a covariance estimator that appeared in [80, 20], where it was used as a component of other procedures. That estimator is valid for both MAR and MCAR mechanisms, but to the best of our knowledge we are the rst to study its estimation error under any setting. The third missing data mechanism is new, and we call it conditionally MCAR (CMCAR). For this case, coordinates are observed according to Bernoulli 0 1 random variables, but their distributions can change over time, even depending on previous observations. A particular version of this mechanism was introduced in our previous work [123], but no error bounds were provided. The CMCAR mechanism is closely related to active learning, which are data adaptive pro- cedures designed to maximize system performance, while complying with problem specic cost constraints. In some scenarios, data acquisition carries a cost and obtaining complete observa- tions might not be possible. For example, in clinical studies, measuring certain features (e.g. acquiring samples or running tests) may be expensive [161, 90], or in sensor networks, obtaining measurements incurs in energy consumption [7, 44, 62]. Active feature acquisition techniques have been applied to reduce the cost of predictive and classication algorithms while maintaining performance [173, 135, 105, 22]. For these active approaches to be feasible, there must be a data collector that can choose how to acquire (or ignore) data, at a cost (monetary, energy, bandwidth, memory, etc), based on prior observations. 5.1.3 Covariance estimation with missing data The MCAR mechanism has been considered in previous statistical studies involving the covariance matrix [93, 96]. Particularly in [96], the author considers MCAR observations with uniform Bernoulli distribution, i.e., all variables are observed with probability p > 0, and the estimator of (5.2). That study shows that the error bound and sample complexity depend on r()=p 2 , thus eectively requiring 1=p 2 more samples than the sample covariance matrix to achieve the same performance. That work also proposes a regularized estimator for low rank covariance matrices, and shows that by carefully choosing a regularization parameter, its estimation error attains optimal error rates. More recently Cai and Zhang [20] studied estimation of sparse and bandable covariance matrices under a MCAR missing data mechanism. The MCAR mechanism considered is allowed to be deterministic or random. The obtained error bounds match those for complete observations, up to scale factors that re ect the eect of missing data. Additionally, the error bounds are pessimistic, since they depend only on the least observed entry of the covariance matrix. Neither study [96, 20] oers insight into the relations among the missing data mechanism, the entries of the population covariance matrix and the covariance estimation error. This is because the missing data mechanism is too simple [96], or the error bounds are too conservative [20]. Other works have focused on estimation of sparse inverse covariance matrices [93, 80]. In particular [93] derives an error bound for the covariance matrix that is valid for sparse Gaussian graphical models. Table 5.1 lists the most relevant covariance estimators and compares them in terms of sample complexity, that is, the minimum number of observations required to achieve error below . 48 Paper Covariance class Estimator Missing data Sample complexity Bound type Bunea & Xiao [19] any sample covariance none r() log(n)1= 2 expectation Cai & Zhang [20] bandable block thresholding MCAR (1=b p min ) C log(n)= 2 _ (C= 2 ) 1+1=2 expectation Cai & Zhang [20] sparse adaptive thresholding MCAR n;N log(n)= 2 expectation Lounici [96] any Eq.(5.2) uniform MCAR (r()=p 2 ) log 2 (2n_N)1= 2 prob. 1 1=2n Theorem 7 any Eq.(5.2) non uniform MCAR r min (; p) log(n)(log 2 (N)_ 1= 2 ) expectation Theorem 8 any Eq.(5.4) MAR (r()=^ p min ) log(n)(log 2 (N)_ 1= 2 ) expectation Table 5.1: Comparison of covariance estimators with missing data. For Cai & Zhang [20], the constants and n;N parameterize the classes of bandable and sparse covariance matrices respectively. b p min is the smallest non zero entry of b . Some universal numerical constants have been ignored, for detailed results see corresponding Theorems in Section 5.3 49 5.2 Problem Setup In this section we describe in detail the partial observation models and estimator that will be considered. 5.2.1 Notation We follow the same notation introduced in Chapter 2. In addition, the entry-wise product 1 between matrices is dened as (AB) ij =a ij b ij . We usekk q for entry-wise matrix norms, with q = 2 corresponding to the Frobenius norm. kk denotes the ` 2 norm when applied to vectors, and the operator norm (largest singular value) when applied to matrices. The nuclear norm (sum of singular values) is denoted bykk ? . The set of integersf1; 2; ;dg is abbreviated by [d]. 5.2.2 Missing data We say that X, annN matrix, is the complete data. Itsi-th row andk-th column are denoted by x i and x (k) respectively. We observe the sequence of vectors y (1) ; y (2) ; ; y (N) , where y (k) = (k) x (k) ; and (k) is a vector with entries inf0; 1g. If the column vectors x (k) are i.i.d., zero mean and with population covariance matrix , the sample covariance matrix of the complete data S = 1 N XX > = 1 N N X k=1 x (k) x (k) > ; is an unbiased estimator for . The behavior of S for niteN has been studied extensively and it is well understood [96, 19, 81]. In the rest of this section we will introduce alternative covariance estimators designed for dierent types of partial observations,fy (k) g N k=1 . 5.2.3 Missing data mechanisms and covariance estimators Consider a random vector x = [x 1 ;x 2 ; ;x n ] > in R n with population covariance matrix . The purpose of a covariance estimator is to accurately approximate fromfx (k) g N k=1 , a set of independent identically distributed (i.i.d.) copies of x. Here, however, we only have access to the sequence of partial observationsfy (k) g N k=1 . Our interest is in estimators with the following desirable properties: 1. Unbiased: their expected value is equal to . 2. Consistent: as the number of observations N!1 the estimator converges to . 3. Minimum sample complexity: minimum number of samples to achieve error below > 0. We consider three missing data mechanisms, and therefore three dierent estimators described next. Unbiased estimator under MCAR observations. The missing data is determined by a random vector, more precisely, let = ( i ) be a vector of independent Bernoulli 0 1 random variables, whereP( i = 1) =p i . We assume that (k) are i.i.d. copies of, independent of the data x (k) for allk2 [N]. If allp i = 1, then y (k) = x (k) for allk, thus we have perfect observation of x (k) . 1 The entry-wise product is also known as Hadamard or Schur product. 50 We are interested in recovering the covariance matrix of x fromfy (k) g N k=1 , when 0 < p i 1 for alli2f1; ;ng. The vector of probabilities is denoted by p = [p 1 ; ;p n ] > , and P = diag(p) is a diagonal matrix. Let = ( ij ) be dened as ii =p i and ij =p i p j wheni6=j, or equivalently = pp > + P P 2 . Assuming x is zero mean, the covariance of the observation vector y (k) is Cov(y (k) ) = : Based on this observation, we will consider the following unbiased estimator b = 1 N N X k=1 y (k) y (k) > ; (5.2) where is the Hadamard (entry-wise) inverse of , that is = 11 > , which is guaranteed to exist since p i > 0. The estimation error of (5.2) has not been studied before, except when the missing data distribution is uniform, that is, p i = p for all i in [96]. We provide a novel error bound for this estimator in Theorem 7. Note that since the matrix has negative eigenvalues, the matrix b might not be positive semi-denite (conditions for b to be positive semi-denite in the uniform case are given in [76]). Unbiased estimator under MAR observations. We dene the matrix of empirical obser- vation probabilities by b = 1 N N X k=1 (k) (k) > ; (5.3) and denote its entries by ^ p ij = ( b ) ij . The quantity ^ p ij is equal to zero when the i-th and j-th variables are not observed together, hence we can only hope to have an estimate for the observed entries. To formalize this we dene the setE =f(i;j) : ^ p ij 6= 0g, and let E be matrix that coincides with the population covariance for the entries corresponding to observed pairs, and is zero otherwise. Let b be the Hadamard inverse of max( b ; 1 N 11 > ). An unbiased estimator for E is given by b = 1 N N X k=1 y (k) y (k) > b : (5.4) This estimator was used as part of the banded and sparse covariance estimators from [20], and the inverse covariance estimator from [80]. However, to the best of our knowledge we are the rst to establish estimation error bounds for (5.4) in Theorem 8. Unbiased estimator under CMCAR observations The CMCAR mechanism presented here provides a general framework that includes active sequential observations and a mechanism that evolves over time, while still being MCAR. The CMCAR mechanism stands between the MCAR and the MAR mechanisms, since it allows a special type of data dependency, and to the best of our knowledge it has not been considered for covariance matrix estimation. Model details are described next. The index t = 1; 2; denotes the iteration number, while N t is the number of vector realizations obtained at iteration t. The observed vectors at iteration t arefy (t;1) ; ; y (t;Nt) g, where y (t;k) = (t;k) x (t;k) and (t;k) Nt k=1 are identically distributed Bernoulli with sampling probabilities p (t) . Y t is annN t matrix whose columns are the vectors y (t;k) for k = 1; ;N t . We assume that for any t 1, 51 1. Complete data is i.i.d.. For k2 [N t ], and t2 [T ], the complete observations x (t;k) are i.i.d. copies of x. 2. Stochastic observation parameters. The sampling probabilities are a function of previous observations, thus p (t) =f t (Y t1 ; Y t2 ; ; Y 1 ). 3. Conditionally i.i.d. observations. Conditioned on the eventfY : < tg, the random vectors y (t;1) ; ; y (t;Nt) are i.i.d.. 4. Conditionally MCAR (CMCAR) mechanism. Conditioned on the eventfY : < tg, the vectors (t;k) Nt k=1 are i.i.d., with independent coordinates, and independent of x (t;k) . The sampling probabilities are positive (i.e.,p (t) i > 0 for alli;t), and we denote the vector of inverse probabilities by q (t) , which obeys q (t) p (t) = 1. We will also use the matrices Q (t) = diag(q (t) ), and P (t) = diag(p (t) ) = (Q (t) ) 1 . Finally we dene the matrices t = p (t) (p (t) ) > P (t) (P (t) I), and t = q (t) (q (t) ) > Q (t) (Q (t) I), which are Hadamard inverses t t = 11 > . To keep notation simple, conditional expectation is abbreviated as E t1 [ ] =E[ j Y : <t]: Below we present some useful quantities Z t = Y t Y > t t (5.5) W T = T X t=1 (Z t N t ) b T = P T t=1 Z t P T t=1 N t : (5.6) Useful facts about Z t ; W t and b T are stated below. 1. Z t is an unbiased estimator for N t . It follows by applying conditional expectation com- bined with (5.2), E t1 [Z t ] =E t1 Y t Y > t t =E t1 Y t Y > t t =N t t t =N t : (5.7) 2. W T is a zero mean matrix martingale. The identityE T [W T+1 ] = W T can be easily veried by applying conditional expectation, E T [W T+1 ] =E T [Z T+1 N T+1 ] +E T [W T ] = W T : By iteratively applying conditional expectation we obtainE[W T ] = 0. 3. b T is an unbiased estimator for . This result is implied by the martingale property and the identity W T = ( b T )( P T t=1 N t ). Performance of this estimator is detailed in Theorem 9. 52 5.3 Estimation error bounds We start this section by introducing some concepts related to sub-Gaussian vectors and notions of complexity of covariance matrices. In Theorems 7, 8, and 9 we present error bounds in operator norm for each estimator. All results of this section depend on new quantities we denote scaled eective rank, which generalize the eective rank to account for partial observations. All our bounds can be viewed as non trivial applications of a moment inequality found in [25]. After some simplications, our results are summarized in Table 5.1. 5.3.1 Preliminaries and assumptions We rst introduce a proxy for the rank of a matrix. Denition 9. The eective rank [96, 163] of a nn matrix A is dened as r(A) = kAk ? kAk The eective rank obeys 1 r(A) rank(A)n. When A is positive semi-denite,kAk ? = tr(A). The eective rank, as we will see in latter sections, is the parameter that quanties the sample complexity of covariance estimation. We introduce sub-Gaussian variables and vectors below. Denition 10 ([163]). IfE[exp(z 2 =K 2 )] 2 holds for some K > 0, we say z is sub-Gaussian. If E[exp(jzj=K)] 2 holds for some K > 0, we say z is sub-exponential. Denition 11 ([163]). The sub-Gaussian and sub-exponential norms are dened as kzk = inffu> 0 :E[exp(jzj =u )] 2g; for = 2 and = 1 respectively. Denition 12. If a random vector z inR n satises kzk = sup u:kuk=1 ku > zk <1; it is called sub-Gaussian or sub-exponential for = 2 or = 1 respectively. We will work under Assumption 1 detailed below, which is standard in the covariance estima- tion literature [19, 96]. Assumption 1. The distribution of x obeys. x is a sub-Gaussian vector, therefore there is a constantc so thatE[ju > xj r ] 1=r cku > xk 2 p r for every unit vector u2R n , and r 1. There is a constant c 0 so that for all u2R n , thenku > xk 2 2 c 0 E(u > x) 2 Finally we introduce a new quantity that generalizes the eective rank. Denition 13. A scaled eective rank of a matrix , is any quantity that takes values in [r(); r()=] for some constant 0<< 1, where the constant does not depend on . 53 Examples that will appear in this section are r q (; p) = 1 kk n X i=1 ii p q i ; r min (; p) = 1 p min r 1 (; p); where q 1 and p is a vector satisfying 0< p 1. We are interested in the cases q = 1 and q = 2. These quantities are related by the following ordering r 1 (; p) r 2 (; p) r min (; p): (5.8) 5.3.2 Bounds under MCAR mechanism Now we state our main result for the MCAR model. Theorem 7. Suppose the random vector x obeys Assumption 1. If all p i > 0, and also Ne 2 and n 3, we have that (5.2) satises E h k b k 2 i 1=2 C 2 kk "r 8e log(n)r min (; p) N + 8e log(n) log(N)r 2 (; p) N # ; (5.9) where C 2 is an universal constant. Some remarks are in order: Comparison with covariance estimation with missing data. The scaled eective rank and the eective rank are related by the following inequalities r() r 1 (; p) r() p min (5.10) r() r min (; p) r() p 2 min (5.11) r() r 2 (; p) r() p 2 min : (5.12) The upper bounds in (5.10)-(5.12) are met when all p i = p. When that happens, our bound in operator norm from Theorem 7 and [96, Proposition 3] have the same parameters and rates, thus they are equivalent up to constant and logarithmic factors. Comparison with covariance estimation with complete observations. When p i = 1, the lower bounds in (5.10)-(5.12) are met and all scaled eective ranks match the eective rank. Several recent papers [96, 19, 81] have shown bounds in operator norm that scale as O r r() N + r() N ! =O r r() N _ r() N ! ; where theO notation hides constants and logarithmic dependencies on n and N. Our bound in spectral norm matches that of [96]. The work of [19] derives optimal bounds in Frobenius 54 and operator norm. The operator norm bounds are equivalent to ours if we ignore the log(N) factor. For the Gaussian case, it was shown in [81] that dimension free bounds can be attained (without any log terms). It is also worth mentioning that the error bounds from [96, 19] are high probability results based on variations of matrix Bernstein inequalities for the operator norm. In [96], one is used for sums of bounded random matrices, while [19] applies a version for unbounded matrices. Our results are based in a moment inequality for sums of independent, possibly unbounded matrices from [25], thus signicantly simplifying the proofs. Sample complexity. We can guarantee covariance estimation error in operator norm of mag- nitude E h k b k 2 i 1=2 kk; if the number of observations obeys NCr min (; p) log(n)(log 2 (N)_ 1= 2 ); whereC is an universal constant only depending one andC 2 . Compared with the case of complete observations, the missing data case requires extra samples by a factor of r min (; p)= r() to attain equal accuracy. Parameter scaling. Each entry of the sample covariance matrix is a (weighted) average of a dierent number of observations. In particular the ij o-diagonal element of b is observed an expectedNp i p j times. The reduced data acquisition ratio is re ected in an increased error bound of the same order. Error decaying as 1=Np 2 was shown to be optimal for the uniform sampling case, while a scaling in the order of 1=Np 2 min was shown to be optimal for the case of banded and sparse covariance matrices [20]. Perhaps the most important consequence of Theorem 7 is that distributions with smaller eective rank or containing low variance variables, can tolerate more missing observations (smaller p i ). In other works, if a variable has low energy, it does not matter if it is not completely observed. 5.3.3 Bounds under MAR mechanism Now we present a result that holds under a more general missing data mechanism. Theorem 8. Suppose the random vector x satises Assumption 1, with Ne 2 and n 3. Also assume that for all i2 [n] there is a k2 [N] so that (k) i = 1. Conditioned on the missing data pattern, the estimator from (5.4) satises E h k b E k 2 i 1=2 kkC 2 0 B @ v u u t 8e log(n) N 1 kk max j X i:(i;j)2E ii b p ij + 8e log(n) log(N) N 1 kk v u u t X (i;j)2E ii jj b p 2 ij 1 C A; (5.13) where C 2 is a universal numerical constant. 55 The right hand side of (5.13) has some similarities with the bound from Theorem 7. First note that the constants in each term are scaled eective ranks. Indeed, we have that r() 1 kk max j X i:(i;j)2E ii b p ij r() b p min ; r() 1 kk v u u t X (i;j)2E ii jj b p 2 ij r() b p min ; whereb p min = min (i;j)2E b p ij . Second, if we assume the collectionf (k) g N k=1 are random, and distributed as in Theorem 7, the bounds from Theorem 7 and 8 become equivalent as N increases. To see this, observe that b p ij !p i p j for i6=j, andb p ii !p i , when N!1, thus 1 kk max j X i:(i;j)2E ii b p ij ' 1 kk max j X i:(i;j)2E;i6=j ii p i p j + ii p i 1 kk max j X i ii p i p j = r min (;p); (5.14) and also 1 kk v u u t X (i;j)2E ii jj b p 2 ij ' 1 kk v u u t X (i;j)2E;i6=j ii jj p 2 i p 2 j + X i 2 ii p 2 i 1 kk s X i;j ii jj p 2 i p 2 j = 1 kk X i ii p 2 i = r 2 (;p): (5.15) A third similarity is the condition that each variable is observed at least once, with its counterpart in Theorem 7, where all p i > 0. As discussed in Section 5.2, for b to be unbiased, we require that allb p ij > 0, meaning that all pairs of variables must be observed together at least once. This stronger requirement for unbiasedness, along with the conditional nature of the bound is the price to pay for the more general missing data mechanism. Regarding sample complexity, a necessary condition for the relative error to be at most is given by the inequalities NC 1 kk P (i;j)2E ( ii jj )=b p 2 ij max j P i:(i;j)2E ii =b p ij log(n) log 2 (N); and NC 0 @ 1 kk max j X i:(i;j)2E ii =b p ij 1 A log(n) 1 2 ; 56 whereC is a constant that depends one andC 2 . By increasing the number of samples we obtain a more simple expression NC r() b p min log(n)(log 2 (N)_ 1= 2 ); that also guarantees an error below . 5.3.4 Bounds under active sequential observations (CMCAR mechanism) The following result characterizes the estimation error of (5.6). Theorem 9. Suppose p (t+1) = f t ( b t ) for all t 1 where f t ( ) are bounded, in the sense that there is a constant so that for all i and for all t, p (t) i > 0 almost surely. Under the same assumptions of Theorem 7, the estimator from (5.6) obeys E[k b T k 2 ] 1=2 ~ Ckk p log(n + 1) "r 8e log(n) N " 2 + 8e log(n)L N " 3 # ; where ~ C =C 2 C p 2 and " 2 ;" 3 are scaled eective ranks given by " 2 (;fp (t) ;N t g T t=1 ) =E " 1 N T X t=1 r min (; p (t) )N t # " 2 3 (;fp (t) ;N t g T t=1 ) =E " 1 L 2 T X t=1 r 2 2 (; p (t) ) log 2 (N t ) # : With N = P T t=1 N t , and L 2 = P T t=1 log 2 (N t ). Note that when T = 1, the bounds from Theorems 7 and 9 only dier on the log(n + 1) term. The estimation error depends on the behavior of the quantities " i , which depends on the observations, the missing data mechanism andff t g T t=1 . In particular, when f t depends only on the index t, and not on the data, we have the case of a time varying MCAR mechanism. The assumptions of Theorem 9 guarantee that the scaled eective ranks are well dened and bounded, " i r() 2 : This implies convergence in probability of b T ! . To establish that this bound is also consistent with Theorems 7 and 8 forT > 1 we will analyze the limiting behavior of the expectations in the upper bound. Proposition 6. Under the same assumptions of Theorem 9, if additionally f t =f is continuous, then for s 1 p (t) ! p 1 (5.16) r s (; p (t) )! r s (; p 1 ); (5.17) r min (; p (t) )! r min (; p 1 ); (5.18) 57 in probability, where p 1 =f(). Also we have convergence of the expectation, i.e. E[r 2 2 (; p (t) )]! r 2 2 (; p 1 ); (5.19) E[r min (; p (t) )]! r min (; p 1 ): (5.20) Proof. Sincef is bounded and a continuous function of b t , so are r s f, and r min f. All limits are implied by convergence in probability of b t to its expected value, along with [49, Th. 2.3.4]. Proposition 6, and Theorems 7 and 9 imply that asymptotically, our estimators in (5.2) and (5.6) converge at the same rates. This is a direct consequence of the following limits, which are implied by Proposition 6 " 2 ! r min (; p 1 ); and (5.21) " 3 ! r 2 (; p 1 ): (5.22) We will nish the analysis of this estimator by establishing a lower bound on the sample complexity. Proposition 7. Suppose that e<N t , and the following conditions hold E " T X t=1 r min (; p (t) )N t # E " T X t=1 8e log(n) log 2 (N t )r 2 min (; p (t) ) # ; (5.23) N 32e ~ C 2 log(n) log(n + 1)" 2 1 2 (5.24) for some T > 1 and some > 0, then E h k b T k 2 i 1=2 kk: Proof. The rst inequality implies r 8e log(n) N " 2 8e log(n)L N " 3 : Combining this with Theorem 9 leads to E[k b T k 2 ] 1=2 2 ~ Ckk "r 8e log(n) log(n + 1) N " 2 # ; and then using the second assumption we get the desired bound. If we ignore logarithmic and constant factors the condition (5.24) would be in the same order of magnitude as the sample complexity we established for the MCAR and MAR estimators. Note that in this case there is an additional log(n + 1) factor. The inequality (5.23) is a condition on the sequencefN t g, and although it looks convoluted, it is easily satised ifT is large enough, and N t = log 2 (N t ) = (log(n)r min (; p 1 )): (5.25) An even simpler condition is picking N t to be an increasing sequence with lim t!1 N t =1. 58 5.4 Experiments 5.4.1 Covariance estimation under MCAR observations 5.4.1.1 Uniform MCAR observations We rst compare the estimators from (5.2) and (5.4), designed to handle MCAR and MAR observations respectively. We study their estimation error under an MCAR mechanism from Section 5.2. The experimental setup is as follows. Data follows a Gaussian distribution with mean zero, and population covariance matrix . The eigenvalues are i for i = 1; ;n, with 0 < < 1 chosen so that r() = 4. The eigenvectors are generated by applying the Gram-Schmidt method to ann random matrix withN (0; 1) independent entries. The dimension is set to n = 50. The MCAR missing data mechanism is uniform with observation probabilities p2f0:4; 0:6; 0:8g. We generate complete and incomplete data. The number of samplesN are chosen to be 100 equidistant values in the log scale between 15 and 50n. For each pair (N;p) we compute estimation error and average over 100 trials. The averaged estimation errors are plotted in Figure 5.1. The black line in the gure cor- responds to the estimation error of the sample covariance matrix, and serves as our baseline performance. Theorem 7 indicates that for N large enough, the MCAR estimator with missing data has errorO((p p N) 1 ), thus to achieve the same error as the sample covariance matrix, it needs more samples by a factor of p 1 . We verify this experimentally by multiplying the estima- tion error of the sample covariance matrix by p 1 . Those curves are shown in dotted lines, and are always above the estimation error of both MCAR and MAR estimators, almost parallel and tighter whenN=n< 1. Additionally, the estimator for the MAR setting (5.4) always outperforms the estimator (5.2) for the MCAR Bernoulli observations, even though the latter has access to the actual missing data distribution. This might occur because the estimator (5.4) computes averages using the actual number of observations, instead of an estimate that uses the sampling probabil- ities. In addition, the comparison established by the inequalities (5.14) and (5.15) suggests that Theorem 8 provides tighter bounds than Theorem 7 when the missing data mechanism is MCAR Bernoulli. 5.4.1.2 Non uniform MCAR observations In the next experiment we show that when the missing observations have a favorable distribution, not much information is lost due to missing data. We consider a population covariance matrix shown in Figure 5.2 that is non singular, has dimensionn = 50, and eective rank r = 2:7472. The missing data distribution is MCAR with distribution given by p = max( p diag(); 1). In this setting the variable x i is observed with probability proportional to its standard deviation p ii . The parameter is set so that the expected number of observed variables aref28%; 40%; 50%g. The performance of (5.4) is depicted in Figure 5.3. When roughly 50% or more variables are observed in average, the estimator has almost the same performance as the sample covariance matrix that has access to complete observations. Since the performance of the estimator (5.2) is slightly worse, we do not include it this time. 59 -1 -0.5 0 0.5 1 1.5 2 2.5 3 3.5 log(N/n) -2.5 -2 -1.5 -1 -0.5 0 0.5 Relative Error (log scale) mcar p=0.4 mar p=0.4 bound mcar p=0.6 mar p=0.6 bound mcar p=0.8 mar p=0.8 bound complete Figure 5.1: Comparison of MCAR and MAR estimators under MCAR missing observations with uniform observation probabilities. The plot shows relative estimation error in operator norm versus number of i.i.d. realizations, both in logarithmic scale. Dashed and solid lines denote MAR and MCAR estimator respectively, while dotted lines are an estimate based on the error bounds. 10 20 30 40 50 5 10 15 20 25 30 35 40 45 50 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 (a) covariance matrix 5 10 15 20 25 30 35 40 45 50 variable index 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 variance (b) variance Figure 5.2: Approximately sparse covariance matrix with low eective rank. 5.5 Proofs 5.5.1 Bounding tools Sub-exponential and sub-Gaussian properties [163]. For any r 1, a random variable z satises E[jzj r ] 1=r ckzk 2 p r; if z is sub-Gaussian. If z and w are sub-Gaussian random variables, then z 2 and zw are sub-exponential with norms satisfyingkz 2 k 1 =kzk 2 2 , andkzwk 1 kzk 2 kwk 2 . Let y 1 = 1 x 1 and y 2 = 2 x 2 , where 1 ; 2 are Bernoulli with parameters p 1 and p 2 , and x 1 ;x 2 are sub-Gaussian random variables. If 1 ; 2 are independent of each other and of x 1 and x 2 , then 60 -1 -0.5 0 0.5 1 1.5 2 2.5 3 3.5 log(N/n) 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 1.1 Relative Error 28% 40% 50% complete Figure 5.3: Evaluation of the MAR estimators under MCAR missing observations with non uni- form observation probabilities. 1) y i is sub-Gaussian andky i k 2 kx i k 2 . 2) y 2 1 , y 2 2 , and y 1 y 2 are sub-exponential with norms satisfyingky i y j k 1 kx i x j k 1 for i;j = 1; 2. Useful inequalities Holder's inequality for Schatten norms. Let A; B be square matrices with real entries, then j tr(A > B)jkAk F kBk F j tr(A > B)jkAk ? kBk: Jensen's inequality. Let z be a real random variable, and r 1, then E[jzj] r E[jzj r ]: Application of Jensen's inequality. Let z 1 ; ;z N be non negative random variables, then for any r 1 E[max k z k ]N 1=r max k E[z r k ] 1=r (5.26) Minkowski's inequality (triangular inequality). For any pair of random variables z 1 ;z 2 with nite r-th moments, then for any r 1 E[jz 1 +z 2 j r ] 1=r E[jz 1 j r ] 1=r +E[jz 2 j r ] 1=r For any a;b2R, (a +b) 2 2(a 2 +b 2 ). Hadamard products 61 We will need the following identify M xx > 2 = n X i=1 x 2 i xx > M i M > i ; (5.27) where M i is the i-th column of M. If A 1 A 2 , and B is positive semi-denite, then A 1 B A 2 B (5.28) where A B stands for 0 B A. In particular we have that A 1 BkA 1 kI B (5.29) Some of our proofs rely on an expectation bound from [25], originally devised to certify the performance of the masked covariance estimator. By following Theorem 3.2 and the beginning of the proof of Theorem 3.1 in [25] we state their result in the form most useful for us. Theorem 10 ([25]). LetfZ k g be a nite sequence of independent, symmetric, self adjoint random matrices of size nn, and n 3, then E 2 4 X k Z k E[Z k ] 2 3 5 1=2 p X k E[Z 2 k ] 1=2 + E max k kZ k k 2 1=2 : IfE[Z k ] = 0 for all k, then = 2e log(n) and = 2. Otherwise, = = 8e log(n). 5.5.2 Proof of Theorem 7 Applying Theorem 10 to b produces. E h k b k 2 i 1=2 r 8e log(n)kE [( yy > ) 2 ]k N + 8e log(n)E h max k k y (k) y (k) > k 2 i 1=2 N : The bounds from [25] forkE ( yy > ) 2 k and E h max k k y (k) y (k) > k 2 i are fairly general. They completely decouple y and y (k) from , thus they can be greatly improved if we consider the particular relation they share for this problem. Bounding the matrix variance (rst term). We will use the following identity. 62 Lemma 3. Conditioned on x, the matrix variance satises. E ( yy > ) 2 j x = diag(xx > ) 2 P 1 (I P 1 ) (5.30) + diag(xx > )(P 1 I) n X i=1 x 2 i p i + xx > n X i=1 x 2 i p i : Proof. See Section 5.5.5 We use Lemma 3 above. To bound the spectral norm ofE ( yy > ) 2 we apply triangular inequality, and then each term is bounded independently. We have that kE ( yy > ) 2 k E " diag(xx > ) 2 (I P 1 )P 1 + diag(xx > )(P 1 I) n X k=1 x 2 k p k # + E " xx > n X k=1 x 2 k p k # : (5.31) Each of the matrices on the right hand side of (5.31) is a positive semi-denite matrix, hence each spectral norm is the largest eigenvalue. The rst term is a diagonal matrix, thus we have E " diag(xx > ) 2 (I P 1 )P 1 + diag(xx > )(P 1 I) n X k=1 x 2 k p k # = max i E " 1p i p 2 i x 4 i + 1p i p i x 2 i n X k=1 x 2 k p k # (5.32) max i E " 1p i p i x 2 i n X k=1 x 2 k p k # (5.33) max i E " x 2 i 1p i p i 2 # 1=2 E 2 4 n X k=1 x 2 k p k ! 2 3 5 1=2 (5.34) The equality (5.32) holds because the operator norm of a positive denite diagonal matrix is its largest entry, while the subsequent inequality (5.33) is obtained by keeping the non negative term. Finally, (5.34) follows from Cauchy-Schwartz. Applying triangular inequality and Assumption 1 to the second factor in (5.34) leads to E 2 4 n X k=1 x 2 k p k ! 2 3 5 1=2 n X k=1 E x 4 k p 2 k 1=2 4c 2 n X k=1 kx k k 2 2 p k 4c 2 =c 0 n X k=1 kk p k : (5.35) 63 A similar argument can be applied to the rst factor in (5.34) max i E " x 2 i 1p i p i 2 # 1=2 4c 2 =c 0 max i ii 1p i p i 4c 2 =c 0 kk 1p min p min ; where the last inequality follows from ii kk, and 1pi pi 1pmin pmin . Replacing both terms back into (5.34) we get E " diag(xx > ) 2 (I P 1 )P 1 + diag(xx > )(P 1 I) n X k=1 x 2 k p k # (4c 2 =c 0 ) 2 1p min p min kk n X k=1 kk p k ! : (5.36) The second term in (5.31) can be bounded similarly using Cauchy-Schwartz, the moment inequal- ity for sub-Gaussian random variables, assumption 1, and (5.35). Thus we may have E " (u > x) 2 n X k=1 x 2 k p k # E (u > x) 4 1=2 E 2 4 n X k=1 x 2 k p k ! 2 3 5 1=2 (4c 2 =c 0 ) 2 u > u n X k=1 kk p k ! : Taking supremum over the set of unit norm vectors leads to E " xx > n X k=1 x 2 k p k # (4c 2 =c 0 ) 2 kk n X k=1 kk p k ! : (5.37) Then adding both bounds into (5.31) we get the desired result kE ( yy > ) 2 k (4c 2 =c 0 ) 2 kk p min n X k=1 kk p k ! ; where C 2 = 4c 2 =c 0 . Bounding the maximum spectral norm (second term). For any r 1, applying (5.26) tok y (k) y (k) > k 2 results in E max k k y (k) y (k) > k 2 =N 1=r E k yy > k 2r 1=r : The dependency on k is removed because y (k) are i.i.d. copies of y. Now we only need to bound E k yy > k 2r .The following inequality uses the fact that the operator norm is bounded by the Frobenius norm, and that y 2 i x 2 i k yy > k 2 k yy > k 2 F = X i;j 2 ij y 2 i y 2 j X i;j 2 ij x 2 i x 2 j : 64 Now we apply triangular inequality, followed by Cauchy-Schwartz inequality and Assumption 1 E k yy > k 2r 1=r X i;j 2 ij E (x 2 i x 2 j ) r 1=r X i;j 2 ij E x 4r i 1=2r E x 4r j 1=2r (4rc 2 =c 0 ) 2 X i;j 2 ij ii jj : Evidently ij 1=p i p j , with equality when i6=j, hence E k yy > k 2r 1=r C 2 2 r 2 n X i=1 ii p 2 i ! 2 : The function N 1=r r 2 is minimized with r = log(N)=2. Since log(N) 2 by assumption, the condition r 1 is satised. 5.5.3 Proof of Theorem 8 All expectations in this proof are with respect to the distribution offx (k) g N k=1 , conditioned on f (k) g N k=1 . Also recall thatE[ b ] = E . As in the proof of Theorem 7, we apply Theorem 10 to b producing E h k b E k 2 i 1=2 1 N v u u t 8e log(n) N X k=1 E h ( b y (k) y (k) > ) 2 i + 8e log(n)E h max k k b y (k) y (k) > k 2 i 1=2 N : Bounding the matrix variance We follow a slightly dierent strategy than in the proof of Theorem 7. Here, the sequence y (k) are independent, but not identically distributed, so to address that we dene M (k) = b (k) (k) > , therefore M (k) x (k) x (k) > = b y (k) y (k) > : (5.27) and linearity of the expectation operator produces N X k=1 E h (M (k) x (k) x (k) > ) 2 i = N X k=1 n X i=1 E[x (k) i 2 x (k) x (k) > ] M (k) i M (k) i > = N X k=1 n X i=1 E[x i 2 xx > ] M (k) i M (k) i > = n X i=1 E[x i 2 xx > ] N X k=1 M (k) i M (k) i > : 65 The last two equalities use the i.i.d. assumption on x (k) and a reorganization of terms. Now we use (5.28) to bound the Hadamard product of positive denite matrices, thus we have N X k=1 E h (M (k) x (k) x (k) > ) 2 i n X i=1 kE[x i 2 xx > ]kI N X k=1 M (k) i M (k) i > : The resulting matrix on the right side is diagonal, hence its operator norm can be bounded as N X k=1 E h (M (k) x (k) x (k) > ) 2 i n X i=1 kE[x i 2 xx > ]kI N X k=1 M (k) i M (k) i > = max j n X i=1 kE[x i 2 xx > ]k N X k=1 M (k) i M (k) i > ! jj : We need to compute estimates for the terms in the sum. We start with the entries of the diagonal matrix M (k) i M (k) i > jj = ( b 2 ii (k) i if j =i b 2 ij (k) i (k) j otherwise: Thus for i =j N X k=1 M (k) i M (k) i > ! ii = N X k=1 M (k) i M (k) i > ii =b 2 ii N X k=1 (k) i =Nb 2 ii b p ii =N 1 b p ii : If i6=j and (i;j)2E, a similar calculation produces N X k=1 M (k) i M (k) i > ! ij =N 1 b p ij : Finally, when i6=j and (i;j) = 2E, then N X k=1 M (k) i M (k) i > ! ij = 0: 66 To bound the other quantity, we use the variational denition of operator norm, along with the Cauchy-Schwartz inequality and Assumption 1. This allows us to obtain the sequence of inequalities below kE[x i 2 xx > ]k = sup kuk2=1 u > E[x i 2 xx > ]u = sup kuk2=1 E[x i 2 (x > u) 2 ] sup kuk2=1 E[x i 4 ] 1=2 E[(x > u) 4 ] 1=2 4c 2 kx i k 2 2 4c 2 sup kuk2=1 kx > uk 2 2 (4c 2 =c 0 ) 2 ii sup kuk2=1 u > u C 2 2 ii kk: The desired bound is N X k=1 E h (M (k) x (k) x (k) > ) 2 i C 2 2 Nkk max j X i:(i;j)2E ii b p ij : Bounding the maximum spectral norm . We follow the same strategy as in the proof of Theorem 7 and bound the operator norm by the Frobenius norm. We follow this with Minkowski's inequality leading to the bound E h kM (k) x (k) x (k) > k 2r i 1=r X i;j E h jx (k) i x (k) j j 2r i 1=r jM (k) ij j 2 X (i;j)2E E jx i j 4r 1=2r E jx j j 4r 1=2r 1 b p 2 ij (4rc 2 ) 2 X (i;j)2E kx i k 2 2 kx j k 2 2 1 b p 2 ij r 2 (4c 2 =c 0 ) 2 X (i;j)2E ii jj b p 2 ij : As with Theorem 7, to nalize we set r = log(N)=2. 5.5.4 Proof of Theorem 9 We will use an inequality due to [75], from where we state the required result in the Lemma below. Lemma 4. Let W T = P T t=1 ~ Z t be a matrix valued martingale satisfying E t1 [ ~ Z t ] = 0, and E[k ~ Z t k 2 ] 2 t <1, then E[kW T k 2 ]C log(n + 1) T X t=1 2 t : 67 Taking ~ Z t = Z t N t we just need to boundE[k ~ Z t k 2 ]. We can do that by conditioning on previous observations and applying Theorem 7, thus leading to E t1 [k ~ Z t k 2 ]C 2 2 kk 2 N 2 t 2 4 s 8e log(n)r min (; p (t) ) N t + 8e log(n) log(N t )r 2 (; p (t) ) N t 3 5 2 2C 2 2 kk 2 h 8e log(n)r min (; p (t) )N t + (8e log(n) log(N t )) 2 r 2 2 (; p (t) ) i : Then 2 t is equal to the expectation of the above expression. We conclude by adding terms from t = 1 to t =T , dividing by ( P T t=1 N t ) 2 and computing square root. 5.5.5 Proof of Lemma 3 We compute diagonal and o-diagonal entries separately. We start with the diagonal entries, E ( yy > ) 2 j x ii =E " n X k=1 2 ik y 2 i y 2 k j x # = 2 ii E[y 4 i j x] + X k6=i 2 ik E[y 2 i y 2 k j x] = x 4 i p i + X k6=i x 2 i x 2 k p i p k =x 4 i p i 1 p 2 i + 1 p i x 2 i n X k=1 x 2 k p k : Now let i6=j, then E ( yy > ) 2 j x ij = n X k=1 ik kj E[y i y j y 2 k j x] =p i p j x 3 i x j p 2 i p j +p i p j x 3 j x i p 2 j p i + X k6=i;j p i p j p k x i x j x 2 k p 2 k p i p j =x i x j n X i=1 x 2 k p k : 5.6 Conclusion This chapter studied covariance estimation with various types of missing observations. We rst considered observations missing completely at random (MCAR), where variables are observed according to non uniform Bernoulli random variables, independent of each other and of the data. We proposed an unbiased covariance estimator and obtained new error bounds in operator norm that characterize its performance in terms of the scaled eective rank. The aforementioned error bound is consistent with previous missing data studies, that only considered an uniform Bernoulli distribution, as well as with bounds obtained for the complete observations case. We also study an estimator that can handle data missing at random (MAR), a missing data mechanism that can 68 depend on the observations. This work obtains the rst error bound for this estimator, and also demonstrates this bound is consistent with other bounds derived for the missing and complete data cases. In practice, missing data does not follow Bernoulli distributions, thus it is recommended to use the estimator for MAR observations. Our theory suggest that the MAR estimator has a tighter bound. This is further conrmed by our experiments with MCAR observations, where both MCAR and MAR estimators are compared, and the latter is always found to attain superior performance. Our last contribution is the introduction of the conditionally MCAR missing data mechanism, where the missing data pattern may depend on previous observations. We combined the bounds from the MCAR mechanism in conjunction with martingale bound and obtained estimation error bounds. 69 Chapter 6 Covariance estimation from dimensionality reduced data 6.1 Introduction In this chapter we develop data adaptive and active covariance estimators, suitable for problems where computational resources are limited. We are motivated by high dimensional problems where data acquisition and communication is expensive, and covariance matrices are required for data analysis. We consider the MCAR mechanism and estimator, as well as a variation of the CMCAR mechanism and estimator from Chapter 5. In this chapter, incomplete observations are used as a tool to reduce costs associated to data acquisition, storage and computational complexity, while in chapter 5, incomplete observations were viewed as an external uncontrollable factor. A few recent studies have considered recovery of the sample covariance matrix from compressed measurements. These studies [3, 127, 8, 128, 28] use dierent types of random projections, sim- ilar to those encountered in the compressed sensing literature [162], with the goal of storing or transmitting a low dimensional dataset. The choice of a particular random projection is based on implementation considerations regarding computational complexity and memory. These works start from a desired compression ratio, i.e., a number of observations per data sample m < n, and design a acquisition/compression scheme, along with the corresponding estimator. Part of this chapter considers the same problem, but we propose a dierent estimator that has favorable theoretical and practical properties. The compression ratio, estimator's statistical performance, and computational complexity are the main concerns of the aforementioned works. The most apparent dierence with theoretical results found in the missing data literature, including ours from Chapter 5, are the type of bounds. In [3, 127, 8, 128, 28], the prevailing theoretical results are bounds on the error between the proposed estimator and the sample covariance matrix, while in the missing data literature, the error of interest is between the proposed estimator and the population covariance matrix. A byproduct of this approach is that in the missing data literature, the sub-Gaussian and i.i.d. assumptions are necessary. On the contrary, in the data compression literature, all results are conditioned on the full data, thus allowing for deterministic, random but non i.i.d., or even adversarial data. In this chapter we follow the latter approach, thus we do not make assumptions on the data distribution (for example being sub-Gaussian), and focus more on the relation between the sample covariance matrix (with complete observations) and the estimators under incomplete observations. Besides studying covariance estimation in the compression setting, we also consider a data streaming scenario with limited memory, that is, only a few variables can be observed and kept in memory, and the covariance matrix has to be computed in an online fashion. Previous works for the compression setting, provide detailed analyses in terms of the goals outlined above, but we have identied a few important unaddressed issues for both the compression and data streaming scenarios. 70 1. Convergence of the estimator to the sample covariance matrix S as the reduced dimension converges to the ambient dimension (m!n). The estimators from [3, 127, 28] do not have this property, while the remaining [8, 128] do have it but their error bounds do not re ect it. 2. Consistency and comparisons with estimation of the population covariance matrix with complete and incomplete observations. Consistency under various conditions is only proved in [8, 28], however the bounds are sub-optimal compared to state of the art covariance estimation bounds under similar conditions [19, 96, 81]. The estimators from this Chapter are similar to the ones analyzed in chapter 5, where they were shown to converge to the population covariance matrix when studied under missing data conditions. 3. When data acquisition and covariance estimation are performed in dierent locations, for example in distributed computation scenarios, it is important to distinguish between com- putational complexity and memory use at both stages. In the next section (Table 6.1) we revisit this issue in relation to our work. Since the goal is recovery of the sample covariance matrix, we obtain estimation error bounds in operator for E[k b S Sk 2 j X]; where X is the complete data, S is the sample covariance matrix and b S is an estimator from incomplete data. Given that the upper bounds are functions of the sampling distribution and the data, they can be used optimize the sampling probabilities. We introduce two optimization prob- lems with cost constraints, one for dimensionality reduced (compressed) observations and another for streaming data. We also obtain low complexity closed form solutions for both optimization problems. Our theoretical analysis and numerical validation show that the corresponding data sampling and estimation algorithms have computational costs and statistical performance that is comparable or better than previous approaches. We organize this chapter as follows. In Section 6.2 we review literature in covariance estimation from compressed and active observations. In Section 6.3 we introduce the covariance estimation problem with resource constraints. Our error bounds and algorithms are presented in Sections 6.4 and 6.5. We show the advantages of our work through numerical experiments in 6.6. We summarize our results and conclusion in Section 6.7, while proofs are left for Section 6.8. 6.2 Related work 6.2.1 Dimensionality reduction approaches Dimensionality reduction is the most simple method for compressing a dataset. The compressed sensing paradigm has been applied to this problem to characterize the compression ratio for in- formation preservation. Recently, various works have considered reconstruction of the sample covariance matrix from dimensionality reduced data. We identify two general trends, both shar- ing the goal of recovering a covariance matrix from compressed data, although using dierent estimators and assumptions. The rst approach is formulated as the recovery of the population covariance matrix from ob- servations compressed through a xed random projection [29, 13, 37, 9]. This framework assumes that the number of samples is innite, or large enough so that the population covariance matrix is close to the sample covariance matrix. Additionally, the encoding consists of left and right multiplication of the sample covariance matrix by a random matrix, thus producing a smaller 71 dimensional square matrix. Then the population covariance matrix is recovered by solving op- timization problems that leverage structure such as low rank or sparsity (by minimizing the ` 1 and nuclear norms respectively). The compressed measurements by themselves do not lead to unbiased estimators, therefore the computationally expensive recovery algorithms that enforce structure in the covariance matrix must be used. The second group of papers follows a dierent approach, and compresses the observations (in- stead of the compressing the covariance matrix). In [3, 127, 8, 128], each data sample is projected onto an m dimensional subspace. All projections are implemented by products with indepen- dent and identically distributed random matrices. The lower dimensional samples along with the random projectors are used to construct estimators for the sample covariance matrix. These esti- mators are unbiased, in the sense that their expected value, with respect to the distribution of the random projection matrices, is equal to the sample covariance matrix. By considering dierent random projectors, these works study trade-os between memory, computational complexity and estimation accuracy. The estimators used in this chapter are similar to the second group of papers. When applying them to the data compression setting, there are various dierences compared to [8, 3, 127, 128]. We chose to use Bernoulli sampling, thus eectively observing m values on average, while other estimators acquire at most m dimensional linear measurements using sampling with or without replacement. This is advantageous since we can sample each variable independently simplifying the theoretical analysis. In addition, the use of Bernoulli sampling instead of random (possibly dense) matrices, reduces the amount of computational resources used while performing data acquisition and covariance estimation. An exception is [28], which we discuss in more detail in Section 6.2. Finally, our estimators have stronger theoretical guarantees for the compression setting. This is because unlike the aforementioned studies, our error bounds vanish as m approachesn, and have consistency bounds comparable to state of the art covariance estimation bounds with complete [96, 19, 81] and incomplete observations (see Chapter 5). We provide a detailed complexity analysis of these methods and ours (Sections 6.4 and 6.5) for covariance estimation with compressed data in Table 6.1. 6.2.2 Active learning methods Other data acquisition approaches are active, that is, they collect measurements in a sequential data adaptive manner, while complying with a resource constraint. Active learning strategies for variable sensing were proposed and analyzed for the related problem of graphical model selection [38, 139] which involves only graph topology inference without estimating the covariance or pre- cision matrix. More related is our previous work [123], where we studied a CMCAR mechanism (similar to the one from Chapter 5) that used a slightly dierent type of estimator, however no error bounds were provided. 72 Estimator Type Acquisition memory/time Compressed Estimation time Sample covariance, Eq. (6.1) - O(nN) O(nN) O(nN) O(n 2 N) Anaraki [127] NA O(mN) O(msN) + s N O(mN) O((sm) 2 N +n) + s N Azizyan et.al.[8] NA O(nN) O(mnN) + g N O(mN) O(mnN +nm 2 N +n 2 N) + g N Anaraki & Becker [128] NA O(nN) O(nN +n log(n)N) O(mN) O(m 2 N +n log(n)N) Chen et.al. [28] AD O(nN) O(nN +m log(n)N) O(mN) O(m 2 N +nN) Eq.(6.2) NA O(mN) O(nN) O(mN) O(m 2 N) Algorithm 3 AD O(nN) O(nN +nT log(nT ) +T op ) O(mN +nT ) O(m 2 N +nN) Algorithm 4 AC O(mN) O(nN +nT log(n)) O(mN) O(m 2 N +nN) Table 6.1: Comparison of covariance estimators with partial observations. Type column indicates data dependence being: non adaptive (NA), adaptive (AD), or active (AC). Acquisition column contains number of observed variables, and time complexity. Compressed column is the minimum amount of data required to compute the estimator. Last column indicate time complexity of computing estimator. Storage of the covariance matrix was not included since it is the same for all estimatorsO(n 2 ). 73 6.3 Problem formulation This section formally states the covariance estimation problem with resource constraints. Let X be a nN real matrix, with sample covariance matrix S = 1 N XX > = 1 N N X k=1 x (k) x (k) > ; (6.1) where x (k) is the k-th column of X. There are several costs associated with processing X to obtain a sample covariance matrix. First there are data acquisition and storage costs, that is, one needs to measure and store nN real values (as oating point numbers with certain precision). When the signal dimension n or the number of samples N are very large, the aforementioned costs can be signicantly above the data processing capacity of a system. One strategy to deal with this issue in the high dimensional setting (high n), is to apply dimensionality reduction and then perform data analysis. If the new dimension is much smaller thann, new operational regimes are enabled where data processing cost can be reduced without sacricing estimation performance. The rst row of Table 6.1 summarizes the acquisition, memory and computational costs associated with covariance matrix estimation. Our proposed solution works with the sparse matrix Y of size nN with columns y (k) = (k) x (k) ; where (k) is ann dimensional vector with 0 1 entries samples from Bernoulli random variables with known distribution. We will show that under dierent scenarios of interest, one can perform accurate covariance estimation from Y with reduced cost. 6.3.1 Data acquisition and estimators We investigate two types of data acquisition schemes, each one with a corresponding estimator and error bound. These data sensing mechanisms are dened in terms of the distribution of f (k) g N k=1 . 6.3.1.1 Independent and xed sampling distribution Assume that each (k) are independent copies of = [ 1 ; ; n ] > , whose entries are independent Bernoulli 0 1 variables with P( i = 1) = p i > 0. This is exactly the MCAR missing data mechanism from Chapter 5. Thus following the same notation, we dene the vector of probabilities p = [p 1 ; ;p n ] > , the matrix of probabilities P = diag(p), and the correction matrix = pp > + P P 2 . Let be the Hadamard inverse of , when we use the estimator b S = 1 N N X k=1 YY > ; (6.2) from Chapter 5. In this chapter, however, we are interested in approximating the sample covari- ance matrix S, instead of the population covariance matrix . Fortunately, (6.2) is an unbiased estimator for S, that is E[ b Sj X] = S: 74 6.3.1.2 Dependent time varying sampling distribution Now we introduce a more general estimator. Consider the following block partition of the complete and incomplete data X = [X 1 ; ; X T ]; (6.3) Y = [Y 1 ; ; Y T ]; where X t and Y t arenN t matrices, and P T t=1 N t =N. The columns of X t and Y t are denoted by x (t;k) and y (t;k) respectively, and are related as follows y (t;k) = (t;k) x (t;k) (6.4) for every t2 [T ] and k2 [N t ]. The indext = 1; 2; ;T may denote time, whileN t is the block size. The sequence (t;k) Nt k=1 are identically distributed Bernoulli with sampling probabilities p (t) . We assume that for any t 1, 1. Stochastic data dependent sampling parameters. The sampling probabilities are a function of the data and previous sampling patterns, thus p (t) = f t (fX : tg;f (;k) : k 2 [N ]; 2 [t 1]g). 2. Conditionally i.i.d. sampling distribution. Conditioned onf (;k) : k2 [N ]; 2 [t 1]g, the vectors (t;k) Nt k=1 are i.i.d., with independent coordinates. When T = 1, we recover an independent and xed sampling distribution. In addition, the data acquisition mechanism outlined above generalizes the model for data conditionally missing completely at random (CMCAR) from Chapter 5. The diagonal matrix that contains the probabilities of the block t is denoted by P (t) = diag(p (t) ). We dene the matrix matrices t = p (t) (p (t) ) > P (t) (P (t) I), and its Hadamard inverse by t . Conditional expectation is abbreviated as E t1 [ ] =E[ j X; f (;k) :k2 [N ]; 2 [t 1]g]: (6.5) We dene below some useful matrices followed by the proposed estimator S (t) = 1 N t X t X > t ; (6.6) ~ Z t = Y t Y > t t N t S (t) ; (6.7) W T = T X t=1 ~ Z t : (6.8) The proposed estimator is b S T = 1 N T X t=1 Y t Y > t t : (6.9) Useful facts about ~ Z t ; W t and b S T are stated below. 75 1. ~ Z t is zero mean. It follows by applying conditional expectation denition from (6.5) com- bined with (6.6), E t1 [Z t ] =E t1 Y t Y > t t N t S (t) =E t1 Y t Y > t t N t S (t) =N t S (t) t t N t S (t) =N t S (t) N t S (t) : (6.10) 2. W T is a zero mean matrix martingale. We may check E T [W T+1 ] = W T by applying conditional expectation, E T [W T+1 ] =E T [ ~ Z T+1 ] +E T [W T ] = W T : By iteratively applying conditional expectation we obtainE[W T ] = 0. 3. b S T is an unbiased estimator for S. This result is implied by the martingale property and the identity W T = ( b S T S)( P T t=1 N t ). 6.3.2 Application based resource constraints Computational resources will be quantied by the total cost of the entries of Y. We start from the most general formulation, and then proceed with particular cases. c (t;k) i is the cost associated to the i-th variable and k-th column of the t-th block. We measure the quality of the sampling pattern/distribution by the expected total cost E " n X i=1 T X t=1 Nt X k=1 c (t;k) i (t;k) i # = n X i=1 T X t=1 Nt X k=1 c (t;k) i p (t;k) i : (6.11) These costs may model several limitations of practical systems, for example 1. Data acquisition: c (t;k) i is the cost of acquiring the measurement x (t;k) i . An example may be energy consumption per sensor measurements in a sensor network or monetary price of feature acquisition in medical studies. 2. Data storage/communication: when data acquisition is inexpensive, or the ambient dimen- sion is high (high N and n respectively), the sheer amount of data may be to large to be stored or communicated with the available bandwidth. A simple compression strategy is to discard some features and keep the most informative ones. In this setting, more informative variables can be assigned a lower cost, while less informative ones are given a higher c (t;k) i . 3. Space/time complexity: since the space and time complexity of a data acquisition or esti- mator are functions of the number of observations, one can design data acquisition schemes that minimize the computational burden, for a desired statistical performance level. 6.3.2.1 Data compression If the data X is too large to be stored or transmitted, a form of data compression can be useful. In this scenario the goal is to produce a representation that uses less memory than the complete data (see rst row of Table 6.1), that can be stored or transmitted, and from which the sample covariance matrix can be recovered or approximated. The specications of this setting are as follows 76 There is a data storage budget B =O(mN) for some m<n. There is access to the complete dataset X. The proposed approach uses the data partitioned as in (6.3). The columns of block X t are sampled independently using the vector of probabilities p (t) . Assuming all entries of X have the same cost, that is,c (t;k) i = 1, the total expected cost corresponds to the expected number of stored variables and is given by n X i=1 T X t=1 Nt X k=1 p (t) i = T X t=1 N t n X i=1 p (t) i =mN: Where m n, and equality is attained when data is uncompressed. Data can then be repre- sented using mN real values (on average), plus an overhead cost ofO(nT ) to store the sampling probabilities and block sizes for each block. Then, we can choose the number of blocks so that O(nT ) =O(mN). Given data X, a parameterm<n, and a set of block sizesfN t g T t=1 , the covariance estimation problem from compressed data is given by min p (1) ;;p (T) E[k b S T Skj X] s.t. 0 p (t) 1; 8t2 [T ] (6.12) T X t=1 N t n X i=1 p (t) i mN: In this scenario, we have access to the complete dataset and therefore the sampling probabilities can be optimized jointly. The estimation error in operator norm cannot be computed exactly so we will use upper bounds to nd approximate solutions. 6.3.2.2 Streaming data When only a few columns of X are available at a time, the covariance matrix must be computed in an online fashion. The specication is as follows At time t we can sample a few entries of X t using a distribution that has expected cost m. More formally, we seek distributions that obey n X i=1 c (t;k) i p (t) i m; 8t2 [T ]; k2 [N t ]: At time t we have access to recent observationsfY : 2 [tW;t 1]g, where W is an integer representing window size. The estimation problem with streaming data in its most general form corresponds to min p (T) E[k b S T SkjfY :2 [TW;T 1]g] s.t. 0 p (T) 1; (6.13) n X i=1 c (T) i p (T) i m: Note that p (T) may depend on previously observed data only, and for each block, each sampled vector has the same expected cost. As in the data compression case, a practical implementation replaces the expected error by an upper bound. 77 6.3.2.3 Evaluation criteria and proposed solution Theorems 11 and 13 present error bounds for the performance of the proposed estimators. Then, using those error bounds we propose solutions to the covariance estimation problem with resource constraints under the data compression and data streaming settings. Regarding theoretical analysis, we adopt evaluation criteria similar to previous works for the compressed covariance estimation setting [8, 3, 127, 128, 28], where instead of estimating the population covariance matrix (as in the missing data setting), the goal is to recover the sample covariance matrix S. This is a more practical approach that does not require assumptions on the data distribution, which is a more suitable for real world data. In particular, observations may be deterministic, adversarial, or random but not necessarily i.i.d.. 6.4 Fixed sampling We start by characterizing the performance of (6.2) for an arbitrary but xed sampling distribu- tion. Proofs and derivations can be found in Section 6.8. Theorem 11. Sample covariance estimation: Assume all p i > 0, log(N) 2 and n 3, then conditioned on the datafx (k) g N k=1 , the estimator from (6.2), satises E h k b S Sk 2 i 1=2 (6.14) s C 2 log(n)kSk P n i=1 kx i k 2 1 1pipmin pipmin N +C 3 log(n)k log(N) k 1 P i kxik 2 1 p 2 i N ; wherekx i k 1 = max k jx (k) i j, and C 2 = 2e, C 3 = 4 p 5e 2 . For any r 1, r is a vector whose i-th entry is r (p i ), and r (p) = (p(1p) 2r + (1p)p 2r ) 1=2r is the ` 2r norm of the centered Bernoulli random variable with parameter p. We highlight some interesting properties of our bound. Comparison with complete observations. For xed N, as the number of observations ap- proach the ambient dimension m!n, we have that b S! S in probability. It is easy to see that this is equivalent to all p i ! 1, and since r (p)! 0 as p! 1, both terms in the right side of (6.14) approach zero. Sample complexity. Let > 0, if NC log(n) 1 kSkp min X i kx i k 2 1 p i k log(N) k 2 1 _ 1 2 the estimation error obeys E h k b S Sk 2 i 1=2 kSk. The constant C depends only on C 2 ;C 3 . Also, since 0 r 1, the sample complexity depends only on 1= 2 when high accuracy is desired (< 1). Comparison with Theorem 7 from Chapter 5. Assuming the data vectorsfx (k) g N k=1 are identically distributed with covariance matrix , and they satisfy Assumption 1. As N increases 78 the sample covariance matrix converges to the population covariance matrix, and thereforekSk kk, andkx i k 2 1 =O(log(N) ii ). These facts, in conjunction with 1p i p min 1 and r 1, indicate that for large N, the error bounds in Theorem 11 and Theorem 7 cases are compatible, up to constants and logarithmic factors. 6.4.1 Finding a data adaptive sampling distribution The error bounds derived earlier are valid for arbitrary observation probabilities. We will now show how to nd a distribution so that the sample complexity is reduced. Note that a weaker sucient condition that ensures estimation error of the order < 1 is that NC log(n) 1 kSkp min X i kx i k 2 1 p i 1_ 1 2 : Since T = 1, the cost constraint from (6.12) is reduced to n X i=1 p i c i =m: For 0m=1 > c, we denote the set of feasible probability distributions by M(;m; c) =fa2R n :1 a 1; a > c =mg; and costs are arranged as a vector c = [c 1 ; ;c n ] > . The condition m=1 > c is required to ensure thatM(;m; c)6=;. To minimize the sample complexity, we wish to solve min p 1 p min X i kx i k 2 1 p i s.t. p2M(;m; c): (6.15) Since this optimization problem has a non convex objective function, we propose solving the following related convex problem min p X i b i p i s.t. p2M(;m; c): (6.16) The value b i can be chosen askx i k 2 1 , as suggested by Theorem 11, or a related quantity such as 1 N P N k=1 (x (k) i ) 2 . Below we provide a complete characterization of the solution of (6.16). Theorem 12. Let b = [b 1 ; ;b n ] > > 0, cost vector c> 0 and budget m satisfying c > 1 m, and 0<m=1 > c. Assume the variables are ordered so that b1 c1 b2 c2 ; . Let p be the solution of min a n X i=1 b i a q i ; s.t. a2M(;m; c); (6.17) for q 1. Then p satises 1. p i p j if and only if bi ci bj cj . 79 2. For each p i 2 (; 1) p i = (mm 1 m ) (b i =c i ) 1 q+1 P d<jnr c j (b j =c j ) 1 q+1 ; where m 1 = P d j=1 c j and m = P n j=nr+1 c j . The integers r;d are dened so that p i = 1 when 1id, <p i < 1 when d<inr, and p i = if i>nr. 3. There is an algorithm OptSampling(b; c;m;;q) that returns the optimal solution inO(n log(n)) time withO(n) memory. When all b i =c i are equal, the optimal solution is the uniform distribution. The optimal sam- pling distribution from this section can be used to approximate the solution of 6.12 when T = 1. The ensuing estimator can achieve very good performance when the optimal sampling distribution is far from uniform. It requires two passes on the data, the rst is required for computing b vector and the optimal sampling distribution, while the second pass is for data compression. Since the optimal sampling distribution depends on the complete data matrix, we cannot use this strategy in streaming scenarios, in which only a few samples can be kept in memory at a time, and the vector b cannot be computed. The only strategy with a xed sampling distribution that can be implemented in the streaming scenario uses an uniform distribution where p i = m=c > 1, or a non uniform sampling distribution that is not adaptive. 6.5 Dependent time varying sampling In this section we propose and analyze sampling algorithms and estimators for covariance estima- tion with time varying observation probabilities. First we derive an error bound for the estimator (6.9) that generalizes Theorem 11. We apply this bound to derive a data adaptive algorithm for covariance estimation from dimensionality reduced data. A second application is an active algo- rithm for covariance estimation from streaming data. The proposed algorithms design optimal sampling probabilities applying Theorem 12. We start with the error bound for (6.9). Recall the sampling probabilities and block sizes are denoted by p (t) and N t respectively. Theorem 13. If all p (t) i > , log(N t ) 2 and n 3, conditioned onfx (t;k) g for all pairs t;k, the estimator from (6.9), satises E h k b S T Sk 2 i 1=2 ~ C p log(n + 1) "r 2C 2 log(n) N ~ " 2 + p 2C 3 log(n) N ~ " 3 # ; with ~ " 2 = n X i=1 E " 1 N T X t=1 N t kS (t) kkx (t) i k 2 1 1p (t) i p (t) min p (t) i p (t) min # ; ~ " 2 3 =E 2 4 T X t=1 k (t) log(Nt) k 2 1 n X i=1 kx (t) i k 2 1 (p (t) i ) 2 ! 2 3 5 : Data dependence. In this case, dierent from Theorem 9 in Chapter 5, the probabilistic dependency is between the observation distributions, i.e. p (t) depends on p () for < t, while also being allowed to depend on the data in an arbitrary manner. 80 Compatibility with previous results. Note that sincekx (t) i k 2 1 kx i k 2 1 , we have that ~ " 2 1 N T X t=1 kS (t) kN t 1 2 2 n X i=1 kx i k 2 1 : The expression on the right on the above inequality, under the assumptions of Theorem 9 and N!1, behaves as O kk 1 2 2 tr() ; which is consistent with Theorems 7 and 9 (if we ignore logarithmic terms). Sample complexity. If the following conditions hold r 2C 2 log(n) N ~ " 2 p 2C 3 log(n) N ~ " 3 ; (6.18) N 8C 2 ~ C 2 kSk 2 log(n) log(n + 1)~ " 2 (1= 2 ); (6.19) The estimation error satisesE h k b S T Sk 2 i 1=2 kSk. The rst condition is satised iffN t g is an increasing divergent sequence. 6.5.1 Finding data adaptive sampling distributions Using Theorem 13 we propose designs for the sampling probabilities that minimize the sample complexity. 6.5.1.1 Data adaptive algorithm for the compression setting The expectation in ~ " 2 can be dropped since the values p (t) i are deterministic and depend only on the complete data. To reduce sample complexity we would like to minimize ~ " 2 , however due to non-convexity we follow the same strategy as in the previous section and assume that p (t) i for some > 0, thus being able to bound the 1=p (t) min factors by 1=. To approximate (6.12), we propose solving the following convex optimization problem min p (1) ;;p (T) n X i=1 T X t=1 kX t k 2 kx (t) i k 2 1 p (t) i s.t. 1 p (t) 1; 8t2 [T ] (6.20) T X t=1 N t n X i=1 p (t) i mN; wherekX t k 2 =kX t X > t k = N t kS (t) k. The solution of (6.20) can be found inO(nT log(nT )) using Theorem 12. The update (6.20) is data adaptive, since it requires complete observations to compute the valuekx (t) i k 2 1 . Pseudo code for covariance estimation from dimensionality reduced data is presented in Algorithm 3. 81 Algorithm 3 Covariance estimation with compressed data and adaptive sampling Require: , budget m, sequencefN t g. 1: Computekx (t) i k 2 1 kX t k 2 for i2 [n], t2 [T ] 2: Solve (6.20) 3: return p (t) for t2 [T ] Require: fN t g, p (t) for t2 [T ] 1: Z 0 2: N 0 3: for t = 1 to T do 4: Samplef (t;k) g Nt k=1 , with distribution p (t) 5: Compute y (t;k) = (t;k) x (t;k) 6: Set N N +N t 7: Set Z Z + Y t Y > t t 8: return b S T 1 N Z Algorithm 4 Covariance estimation with active sampling Require: ;, c, budget m, W , sequencefN t g, p (1) 2M(;m; c). 1: Z 0 2: N 0 3: for t = 1 to T do 4: Samplef (t;k) g Nt k=1 , as i.i.d. copies of (t) . 5: Observe y (t;k) = (t;k) x (t;k) for k2 [N t ] 6: Set N N +N t 7: Set Z Z + Y t Y > t t 8: Find p (t+1) using (6.22), (6.21) and (6.23) 9: return b S T = Z=N 6.5.1.2 Active algorithm for streaming data For streaming data, we again use the upper bound from Theorem 13, however now the index t represents time, and the sampling distribution depends only on previously observed data. To approximate (6.13) we propose solving min p (T) n X i=1 (T) i p (T) i s.t. p (T) 2M(;m; c); (6.21) where (T) i is an estimator forkx (T) i k 2 1 given by (T) i = maxf(y (t;k) i ) 2 :k2 [N t ];t2 [TW;T 1]:g (6.22) The solution of (6.21) can be found using Theorem 12. To add stability in case the observation probabilities are too small, we regularize with the uniform distribution, thus we end up using the sampling distribution p (T) + (1) m c > 1 1; (6.23) for some small2 [0; 1]. Implementation of the estimator based on the proposed active sampling rule is given in Algorithm 4. 82 6.5.2 Complexity analysis and comparison with related work Data acquisition memory. Previous papers (see Table 6.1) have implemented the sensing mechanism using matrices that have fewer rows than columns. The number of rows may be xed, or random (as in Bernoulli sampling or sampling without replacement). We make the distinction between methods that act on the complete observations and those that act on a subset. The sensing matrices used in [8, 128] are of the rst kind. Since they are dense they require access to the complete data matrix, leading to an acquisition cost ofO(nN). The algorithm from [28] does not use dense matrices, but requires access to the complete data to construct its adaptive compression scheme, hence also observingO(nN) values. [127] is the only work that acquires compressed data directly by using a sparse matrix, however their algorithm does not have theoretical performance guarantees. In the streaming data scenario, since memory is more limited, Algorithm 4 is more suitable since it require observingO(mN) values, while in the compression setting we do not impose memory constraints at data acquisition, thus we use Algorithm 3 instead. Data acquisition time complexity. Two factors dominate time complexity of previous ap- proaches, matrix vector multiplication and random matrix construction, denoted by s and g in Table 6.1 respectively (see [28] and its supplementary material for more details). The time com- plexity of our algorithms has two components, Bernoulli sampling, i.e., ipping nN biased coins, and estimating the sampling probabilities. Data acquisition prole of Algorithm 3 is comparable to [28, 128] (see Table 6.1). The main dierence in the data acquisition complexity of Algorithms 3 and 4 is the method for computing the sampling distributions. For Algorithm 3, we solve a problem onnT variables, that also involves computingT operator norms or rectangular matrices, each with a cost op . The complexity of Algorithm 4 is lower since it does not requires computing operator norms, and computation of the estimates (t) i are easier to compute than the quantities kx (t) i k 1 . Nonetheless, the cost of both algorithms is dominated by Bernoulli sampling. Compression In all previous works, compressed data usesO(mN) space . We highlight some dierences that have practical implications. In [8, 128, 127], the dimension of the compressed data vectors is exactlym. Although in [28] the dimension is alsom, since the observations are performed using sampling with replacement, the number of dierent observations can be much smaller than m. When m = n, the compressed observations produced by [127] and [28] are not the complete data. All the estimators from this paper will produce compressed data requiring expectedO(mN) space, however there is an overhead necessary for storing the observation probabilities of nT . T is the number of times the sampling distribution is updated, and provides a way to control the trade-o between statistical performance and computational memory/complexity. In practice, if we choose T so that nT <mN, the total storage cost can be still close to mN and below 2mN. In contrast, the algorithm from [28] requires storage of at least 2mN. Estimation. Regarding computational complexity of computing the estimator, [128, 28] are the state of the art. Our estimators from Algorithm 4 and 3 attain the same computational complexity (ignoring logarithmic terms). The complexity bound displayed in Table 6.1 is easy to compute. Given the set of vectors p (t) and assuming the random number generators at data acqusition and estimation are syncronized, the observed variables can be recovered, and we can form the vectors y (t;k) all in timeO(nN). Computing the outer product of a m sparse vector costs m 2 , adding these m 2 sparse matrices N takesO(m 2 N). Note that the complexity depends on m which is an expected value. 83 6.6 Experiments 6.6.1 Covariance estimation from dimensionality reduced data We apply the estimator from Algorithm 3. We set the algorithm parameters to = 10 6 , m=n2f1; 5; 10; 20; 30; 40; 50; 60; 70; 80; 90g. We only evaluate the case of constant block size, that is the values N t are xed for all t. As a baseline method we only consider the algorithm from Chen et.al. [28], since it was shown to outperform all previous methods. Note that the algortihm from [28] requires that m 2. We evaluate both algorithms on datasets from the UCI machine learning repository 1 . We consider the Daily and Sports Activities (DSA) dataset. We use the rst 200 variables and all samples, therefore n = 200 and N = 9120. In addition we use a subset of the UJIIndoorLoc dataset with n = 100 and N = 5000. Average estimation errors, computed over 100 trials, are depicted in Figure 6.1. The algorithm from [28] uses non 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 m/n -4.5 -4 -3.5 -3 -2.5 -2 -1.5 -1 -0.5 Relative Error (log10 scale) baseline N t =1 N t =5 N t =50 N t =500 N t =N (a) DSA 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 m/n -2.2 -2 -1.8 -1.6 -1.4 -1.2 -1 -0.8 Relative Error (log10 scale) baseline N t =N (b) UJIIndoorLoc Figure 6.1: Comparison of Algorithm 3 and [28] for covariance estimation from compressed ob- servations. uniform data adaptive probabilities and the observed variables are chosen using sampling with replacement. This choice of data sampling is ecient for small m, but as we increase the number of observed variables, there are many repetitions and the actual number of distinct measurements does not increase with m. This can be seen in both plots of Figure 6.1, where the estimation error depicted by black curves, fails to converge to zero as m approaches n. For the DSA data, we observed that the best error is achieved with N t = 1, and as expected as we increase the value of N t up to N, the performance degrades. Algorithm 3 is better than the baseline for all compression ratios when N t 50. When m=n 0:2 the proposed algorithm is better than the baseline for all values of N t . The performance on the second dataset is depicted in Figure 6.1b. Here the performance does not depend on N t , so we only show the case N t =N. This might be because for this dataset, approximately 93% of the values belong tof100; 100g, thus the optimal sampling is close to the uniform distribution. It appears that Algorithm 3 produces a close to uniform distribution, thus it is at least as good as the baseline method for all values of m=n. 1 https://archive.ics.uci.edu 84 1000 2000 3000 4000 5000 6000 7000 8000 9000 samples -1.6 -1.4 -1.2 -1 -0.8 -0.6 -0.4 Relative Error (log10 scale) Active W=N Uniform Active W=5 (a) DSA m=n = 0:2 500 1000 1500 2000 2500 3000 3500 4000 4500 5000 samples -1.4 -1.2 -1 -0.8 -0.6 -0.4 -0.2 Relative Error (log10 scale) Active W=N Uniform Active W=5 (b) UJIIndoorLoc m=n = 0:2 Figure 6.2: Comparison of non adaptive (uniform), and active (Alg. 4) strategies for covariance estimation with streaming observations for xed m=n. 6.6.2 Covariance estimation from streaming data In this scenario data arrives one sample at a time, and only m variables can be measured. Figure 6.2 depicts the error as data arrives in batches of N t = n=5 samples. For both datasets we x m=n = 0:2. In Figure 6.2a we show results for the DSA dataset using Algorithm 3 with window sizesW = 5 and W = N, that represent short and long term memory respectively. At the beginning, both active approaches perform similarly and as Algorithm 4, estimation performance improves faster than with the uniform distribution. Around 2000 samples, the performance of Algorithm 4 with W = N suddenly deteriorates fast, while the implementation that uses W = 5 keeps reducing the estimation error. The reason behind this is that around sample 2000 the data distribution changes drastically, and the algorithm that has shorter memory W = 5 can adapt faster to the changes. As expected, the change of distribution does not increase the estimation error when the sampling distribution is uniform. For the UJIIndoorLoc dataset results are plotted in Figure 6.2b we observe a similar relation between short and long term memory implementations. The dierence here is that the uniform distribution is also optimal. 6.7 Conclusion This chapter studied covariance estimation with resource constraints. One instance of this prob- lem is when there are storage or communication constraints, for example in a distributed sensing scenario, where data is collected in remote locations and then sent to a fusion center for statistical analysis, where the covariance matrix is computed. We also considered the data streaming case where a covariance matrix has to be computed in an online fashion due to latency or memory limitations. For both scenarios we proposed using Bernoulli sampling to reduce the number of variables, and thus allocate computational resources more eciently. Our contributions are error bounds in operator norm for covariance estimators that use very general data adaptive and time varying sampling distributions. Based on the error bounds, we also propose ecient algorithms to optimize the sampling distributions for the dimensionality reduction (compression), and data streaming setting. Our main contribution are Algorithms 3 and 4, for covariance estimation from dimensionality reduced data and active sequential observations respectively. These routines 85 combine data acquisition/compression, sampling distributions optimization and covariance esti- mation. Our theoretical and numerical analysis shows that the proposed algorithms have similar or better statistical, and computational performance than previous approaches. 6.8 Proofs 6.8.1 Proof of Theorem 11 The error can be decomposed as a sum of zero mean random matrices N( b S S) = N X k=1 y (k) y (k) > x (k) x (k) > = N X k=1 (k) (k) > x (k) x (k) > x (k) x (k) > = N X k=1 (k) (k) > 11 > x (k) x (k) > : Denote by M (k) = (k) (k) > 11 > . We apply Theorem 10, where the expectation is taken with respect to the missing data mechanism and conditioned on the data event x (k) N k=1 , leading to E h k b S Sk 2 i 1=2 1 N v u u t 2e log(n) N X k=1 E h (M (k) x (k) x (k) > ) 2 i + 4e log(n)E h max k kM (k) x (k) x (k) > k 2 i 1=2 N : Bounding the matrix variance. We rst upper bound P N k=1 E h (M (k) x (k) x (k) > ) 2 i . To simplify notation we will ignore the (k) for now. The Hadamard product obeys the identity M xx > = diag(x)M diag(x), thus we can writeE (M xx > ) 2 as E (M xx > ) 2 =E diag(x)M diag(x) 2 M diag(x) =E M diag(x) 2 M xx > =E M diag(x) 2 M xx > : Additionally, let e i be the i-th column of the identity, thus E M diag(x) 2 M = n X i=1 x 2 i E[M i M > i ]; 86 where Me i = M i is the i-th column of M. Then we dene R (i) = P N k=1 x (k) i 2 x (k) x (k) > , and compute the sequence of bounds N X k=1 E h (M (k) x (k) x (k) > ) 2 i = N X k=1 n X i=1 x (k) i 2 x (k) x (k) > E[M i M > i ] = n X i=1 R (i) E[M i M > i ] n X i=1 kR (i) kIE[M i M > i ] n X i=1 kR (i) kIE[M i M > i ] I: Note that P n i=1 kR (i) kIE[M i M > i ] is a diagonal matrix, thus its operator norm is given by n X i=1 kR (i) kIE[M i M > i ] = max j n X i=1 kR (i) k(E[M i M > i ]) jj = max j n X i=1 kR (i) kE[M 2 ij ] NkSk max j n X i=1 kx i k 2 1 E[M 2 ij ] =NkSk max j 0 @ kx j k 2 1 E[M 2 jj ] + n X i6=j kx i k 2 1 E[M 2 ij ] 1 A =NkSk max j 0 @ kx j k 2 1 1p j p j + n X i6=j kx i k 2 1 1p i p j p i p j 1 A NkSk max j n X i=1 kx i k 2 1 1p i p j p i p j =NkSk n X i=1 kx i k 2 1 1p i p min p i p min : We have denedkx i k 2 1 = max k jx (k) i j 2 . With this it only remains to verify the bound forkR (i) k, indeed kR (i) k = sup kuk2=1 u > N X k=1 x (k) i 2 x (k) x (k) > u kx i k 2 1 sup kuk2=1 u > N X k=1 x (k) x (k) > u =kx i k 2 1 NkSk: 87 Bounding the maximum spectral norm. Now we boundE h max k kM (k) x (k) x (k) > k 2 i 1=2 . First we have that (M (k) )ij = 8 > > < > > : (k) i pi pi if i =j (k) i (k) j pipj pipj otherwise: The maximum can be bounded using (5.26). Indeed, for any r 1 E max k kM (k) x (k) x (k) > k 2 1=2 (N) 1=r max k E h kM (k) x (k) x (k) > k 2r i 1=2r : For a moment, we will ignore the superindex (k) while bounding eachkM (k) x (k) x (k) > k. We will use te following inequality regarding the square of a symmetric matrix kM xx > k 2 kM xx > k 2 F = X i;j x 2 i x 2 j M 2 ij : Then, an application of Minkowski's inequality leads to the bound E h kM (k) x (k) x (k) > k 2r i 1=r X i;j jx (k) i x (k) j j 2 E h jM (k) ij j 2r i 1=r X i jx (k) i j 4 p 2 i E h j (k) i p i j 2r i 1=r + X i;j jx (k) i x (k) j j 2 p 2 i p 2 j E h j (k) i (k) j p i p j j 2r i 1=r : Since all (k) are i.i.d. with independent entries, we can remove the dependence on k. We bound the second term by applying Minkowski's inequality and the identity i j p i p j = i ( j p j ) + ( i p i )p j , leading to X i;j jx (k) i x (k) j j 2 p 2 i p 2 j E j i j p i p j j 2r 1=r X i;j jx (k) i x (k) j j 2 p 2 i p 2 j 2E j i ( j p j )j 2r 1=r + X i;j jx (k) i x (k) j j 2 p 2 i p 2 j 2E jp j ( i p i )j 2r 1=r 4 X i;j jx (k) i x (k) j j 2 p 2 i p 2 j E j i p i j 2r 1=r = 4 X i jx (k) i j 2 p 2 i 2 r (p i ) ! X i jx (k) i j 2 p 2 i 4k r k 2 1 X i jx (k) i j 2 p 2 i 2 r (p i ) ! 2 : In the second inequality above we used the fact that i and p i are bounded by one. We dene r (p) = (p(1p) 2r + (1p)p 2r ) 1=2r as the ` 2r norm of the centered Bernoulli random variable 88 Algorithm 5 Optimal sampling probabilities Require: m, q, c, b, sorted variables satisfying b 1 =c 1 b 2 =c 2 ; ;b n =c n > 0. 1: Set d 0, 2: i 1, 3: P n i=1 c i (b i =c i ) 1 q+1 , 4: 0 5: while in do 6: Compute p i (m)(b i =c i ) 1 q+1 = 7: if p i < 1 then 8: i i + 1 9: else 10: p i 1 11: c i (b i =c i ) 1 q+1 12: +c i 13: d d + 1 14: i i + 1 15: return d, p. with parameter p, andk r k 1 = max i r (p i ). The other term can be bounded similarly, leading to X i jx (k) i j 4 p 2 i E h j (k) i p i j 2r i 1=r k r k 2 1 X i jx (k) i j 2 p 2 i ! 2 : Combining both bounds and maximizing over k produces E h kM (k) x (k) x (k) > k 2r i 1=2r p 5k r k 1 X i kx i k 2 1 p 2 i : We nalize by setting r = log(N), thus N 1=r =e. 6.8.2 Optimal missing data distribution, proof of Theorem 12 We rst analyze the case = 0. Theorem 14. For any q 1, and cost vector c> 0 satisfying c > 1m, and covariance matrix b with diagonal entries b i . Assume the variables are sorted according to b1 c1 b2 c2 ; , and p 1 = p 2 ; ; =p d = 1, and p i < 1 for i>d. If p is the solution of min a n X i=1 b i a q i ; s.t. a2M 0 ; (6.24) it satises: 1. p i p j if and only if bi ci bj cj . 2. If p i < 1 p i = (mm 1 ) (b i =c i ) 1 q+1 P d<jn c j (b j =c j ) 1 q+1 ; 89 where m 1 = P d j=1 c j . 3. If the variables are sorted, Algorithm 5 returns the optimal p inO(n) time and memory. If they are not sorted, complexity isO(n log(n)). Proof. We rst state optimality conditions, and then we proceed to prove each of the statements of the Theorem. The optimization problem is convex, hence the Karush Kuhn Tucker (KKT) optimality conditions completely characterize the solution [17]. They correspond to q b i p q+1 i +c i + i = 0 i (p i 1) = 0 p i 1 i 0 n X i=1 c i p i =m: If p i = 1, then qb i =c i + i . If p i < 1, then i = 0, thus p q+1 i = b i c i q : (6.25) 1. Let i6=j, there are three cases we must consider. If p i =p j = 1, then the statement holds trivially. If p i < 1 and p j < 1, then the statement follows from (6.25). The case p i < 1, p j = 1 follows from the set of inequalities qb j c j = j c j + qb i c i 1 p q+1 i qb i c i : 2. To nd p i < 1 we compute from m = n X i=1 p i c i = d X j=1 c i + q 1 q+1 n X j=d+1 c j b j c j 1 q+1 ; and replacing back in (6.25). 3. The algorithm is basically a linear search to nd d, the number of p i = 1, which combined with facts 1) and 2), allows us to compute the solution. To prove that Algorithm 5 produces the desired solution, without loss of generality we re-label and sort the variables so that b 1 =c 1 b 2 =c 2 . Let j be the index so that p j < 1 for the rst time. That is p i = 1 for i<j, andp i < 1 fori>j, i.e. j =d+1. Then it is easy to check that Algorithm 5 produces a feasible solution, that is, it also satises P i c i p i = m. The other KKT conditions can be checked by computing ; i . Sorting the values b i =c i can be done inO(n log(n)) time, computing (line 3), and running the while loop both costO(n) time. Then the overall complexity is dominated by the sorting algorithm, leading to overall complexityO(n log(n)). Now we consider the case > 0 to nish the proof of Theorem 12. We will show that solving (6.17) can be reduced to calling Algorithm 5O(log(n)) times. We omit the proof of 1) and 2), since it is very similar to the proof of Theorem 14. Note that when = 0, the solution is determined 90 once the integer d has been found. When > 0, two integers, d and r have to be found. To this end, consider the constraint set M r 0 =M 0 \ fa :a i =; nr<ing: When r = 0, there are no equality constraints inM r 0 . The corresponding auxiliary problem is min a n X i=1 b i a q i ; s.t. a2M r 0 ; (6.26) with solution p y r . Note that p y is also the solution of min r r q (p y r ) s.t. p y r 2M(;m; c); (6.27) thus solving (6.17) has been reduced to solving the combinatorial problem (6.27). By computing p y r for 0rn, and doing a brute force search the optimal value of r can be found. However, this is sub-optimal in terms of complexity and memory, both in the order ofO(n 2 ). A more ecient solution can be implemented since the problem has the following additional properties. 1. Values ofr induce an ordering of the solutions of (6.26) given byM r+1 0 M r 0 and r q (p y r ) r q (p y r+1 ). 2. The solution of (6.27) can be charaterized as follows. Lets be the smallest integer satisfying p y = p y s , then r q (p y s ) r q (p y r ) if rs p y r = 2M(;m; c) if r<s: 3. The auxiliary problem (6.26) can be solved using Algorithm 5. That is, for anyr, p y r can be found by setting its last r elements equal to , and the rst nr elements are the solution of (6.24) with budget m P n i=nr+1 c i , and cost vector [c 1 ; ;c nr ] > . 4. s can be found using binary search which terminates inO(log(n)) steps. Each step calls Algorithm 5, and checks the conditions established above in fact 2), termination occurs, when s produces a feasible solution, and s 1 produces an unfeasible one. 6.8.3 Proof of Theorem 13 The proof combines Theorems 11 and 10, and it is almost identical to the proof of Theorem 9 from Chapter 5. The main dierence is that since all expectations are with respect to the distribution of missing data mechanism, we need a dierent martingale. For that we use W T dened in (6.8). If we apply Theorem 10 to W T , we only need to bound E[k ~ Z t k 2 ]. We can do that by applying conditional expectation with respect to pattern of previous observations as dened in (6.5), and applying Theorem 11, thus leading to E t1 [k ~ Z t k 2 ]2C 2 log(n)kS (t) kN t n X i=1 kx (t) i k 2 1 1p (t) i p (t) min p (t) i p (t) min + 2C 2 3 log(n) 2 k (t) log(Nt) k 2 1 n X i=1 kx (t) i k 2 1 (p (t) i ) 2 ! 2 : We conclude by computing total expectation, adding over t = 1 to t =T , and dividing by N 2 . 91 Chapter 7 Conclusion and future work In this thesis we studied various methods to learn models from data. In Chapter 3 we considered estimation of generalized graph Laplacian matrices. The graph learning problem was posed as maximum likelihood estimation of the precision matrix of an attractive GMRF. This formulation is also shown to be a GGL learning problem that optimizes graph signal smoothness, thus mak- ing it suitable for graph signal processing applications. In addition, we proposed a new block coordinate descent algorithm to solve this optimization problem. We showed through numerical experiments that precision matrix estimation with GGL constraints leads to sensible graphs for image data. More recent works have applied the GFT derived from these optimal GGL ma- trices to compression of image and video [121, 58, 50, 124], and other non traditional signals [23, 132, 164, 89]. In the aforementioned applications, these data optimized graphs were shown to have excellent performance in comparison to traditional signal processing methods, and to other types of data adaptive transforms. However, the algorithm from Chapter 3, and others that solve the same problem [147, 52], only scale to a few hundred nodes, and therefore cannot be used in applications that require bigger graphs, or have limited computational resources. Future research on this area will focus in developing algorithms that can learn an optimal graph, or approximate it, more eciently than current approaches. In Chapter 4 we considered a variation of the graph learning problem from Chapter 3, where in addition to the GGL constraints, we require that the graph topology belongs to a given class of graphs. We dene a large class of graph properties called admissible that includes trees, bipartite graphs, sparse connected graphs, and other graph properties. We showed that by incorporating a topology property constraint, the optimization problem becomes non convex. To overcome this issue we proposed an algorithm that nds an approximate solution by leveraging existing convex solvers like the one from Chapter 3, and combinatorial optimization algorithms. Specically, our algorithm consists of two steps, graph topology inference and graph weight estimation. The rst step nds a graph topology with the desired property by solving a maximum weight graph approximation problem that depends on the input covariance matrix. For example, to learn bipartite graphs we need to solve a maximum cut problem, that can be approximated using the algorithm from [64], and then solve a GGL estimation problem on the bipartite topology found in the previous step. We derive an error bound that compares the global minima (a solution of the non-convex problem) to any point in the feasible set, and show that the proposed algorithm seeks to minimize that bound. Our method, when applied to learn tree structured graphs, is reduced to the well known Chow-Liu algorithm [32]. To the best of our knowledge, this is the only work that considers graph learning with general topology properties. Applicability of the proposed framework depends strongly on the graph property of interest and the convex solver used in the graph weight estimation phase. For example the usefulness of this method in high dimensional problems depends on the existence approximation algorithms (for a given graph property), and convex GGL estimation solvers that are ecient and scale well. There are several questions that we would like to have answered. Circulant [82, 54] and M-block cyclic graphs [156, 157, 6] can 92 be used to construct graph lter banks, however they are not admissible graph properties, thus it would be necessary to develop dierent graph learning methods for such families. We also expect that algorithms with better theoretical and computational characteristics may be found by focusing on specic graph properties. Finally, since optimized graphs have been shown to improve coding performance of images and video [50, 121], and point cloud data has beneted from graph transforms dened on trees and bipartite graphs [4, 119], we would like to implement graph based compression algorithms used optimized structured graphs. Chapters 5 and 6 are devoted to covariance matrix estimation with partial observations. In both chapters we consider similar unbiased estimators and derived bounds for the estimation error measured in operator norm. Each chapter studies the estimators under dierent assumptions, namely Chapter 5 is motivated by statistical analysis, while Chapter 6 is concerned with estimators that can operate under limited computational resources. In Chapter 5 we studied the missing data case, which consists of scenarios where parts of a dataset are lost due to uncontrollable factors [91]. Recent theoretical work on this area focuses on designing covariance estimators with optimal rate of convergence (in probability) to the pop- ulation covariance matrix. The main assumption is that observations are i.i.d. realizations of a sub-Gaussian vector. While some works require the population covariance matrix to be structured (e.g. low rank or sparse), we only investigated the unstructured sub-Gaussian case. We started with a data independent missing data mechanism (denoted MCAR), where variables are observed according to Bernoulli 0 1 random variables with independent coordinates, and arbitrary ob- servation probabilities. Unbiased estimators for this type of missing observations are well known, however error bounds are only available for the uniform missing data case [96]. We obtained the rst error bound for the more general non uniform case. When our bound is simplied to the uniform case, or to complete observations case, it matches previous state of the art bounds up to constant or logarithmic factors [19, 81, 96]. Next, we considered data dependent missing ob- servations (MAR missing data mechanism), and a corresponding unbiased covariance estimator. This missing data mechanism may be non i.i.d. and data dependent, therefore we construct an expectation bound that is conditioned on the missing data pattern. We are the rst to bound the estimation error of this estimator for unstructured population covariance matrices. The con- stants in this bound depend on an empirical estimator of the probability of jointly observing pairs of variables, therefore when the missing data pattern follow a Bernoulli distribution, both bounds (for MCAR and MAR observations) have very similar constants. In fact, asymptotically, the constants for the MAR case are upper bounded by the constants in the MCAR case, sug- gesting that the former assumption leads to a better estimator. Numerical experiments provide evidence supporting that assertion. The last contribution of Chapter 5 is the introduction of a new missing data mechanism that changes over time depending on previous observations. We call this mechanism conditionally MCAR (CMCAR) because it behaves as a MCAR mechanism when conditioned on previous observations. We also propose an unbiased estimator and obtain estimation error bounds. The resulting theory stems from a combination of error bounds valid for the MCAR case with martingale inequalities. Although the theoretical analysis of chapter 5 introduced new and more general results, there are many open questions regarding covariance estimation with missing observations. Concen- tration bounds, instead of expectation bounds, similar to those of [96, 19], could be applied to examine other algorithms subject to missing observations. Some of these include estimation of structured covariance matrices, graphical model selection, inverse covariance estimation and prin- cipal component analysis. Another interesting line of work is constructing estimators when the missing data mechanisms has a dependent structure encoded by a graphical model [108]. In Chapter 6 we studied covariance estimation from an algorithmic perspective. In this setting, we are interested in minimizing costs associated with data acquisition, storage, and computational complexity of calculating the covariance matrix. The approach we follow is inspired by the miss- ing data literature, in fact, we consider the MCAR missing data mechanism, and introduce a 93 CMCAR mechanism similar than the one from Chapter 5. By applying Bernoulli sampling on the data, we produce a sequence of sparse observation vectors. This reduced data requires memory proportional to the expected number of observed variables. The computational complexity of computing the covariance estimator is also reduced by operating with sparse matrices. Dierent from chapter 5, we do not make assumptions on the data distribution, and allow it to be random, non i.i.d., and even deterministic. Since the goal is to improve with respect to the case of com- plete observations, we compare against the sample covariance matrix (instead of the population covariance matrix). We compute bounds for the expected estimation error in operator norm, conditioned on the complete data. Using these bounds we propose low complexity algorithms to design data adaptive sampling distributions. We evaluate the proposed estimators in two settings. Covariance estimation from compressed data, where the goal is to minimize the storage cost, while attaining the lowest possible error with respect to the sample covariance matrix. Our algorithm was compared to a state of the art adaptive algorithm [28]. Numerical experiments showed that our algorithm attains similar or better performance, under various parameter settings. We also consider the problem of covariance estimation from streaming data with limited memory. In this scenario observations arrive one at a time, and the covariance matrix is computed in an online fashion. We propose an active algorithm that updates the sampling distribution based on previous observations. Our numerical experiments show that our approach outperforms adaptive and non adaptive approaches, and adapts better to data with time varying distributions. The most immediate extension of this work is concerned with concentration inequalities instead of expectation bounds. These might be useful in the study of streaming algorithms that employ covariance matrices, for example streaming principal component analysis and matrix completion. 94 Reference List [1] Animashree Anandkumar, Vincent Tan, and Alan S Willsky. High-dimensional graphical model selection: tractable graph families and necessary conditions. In Advances in Neural Information Processing Systems, pages 1863{1871, 2011. [2] Animashree Anandkumar, Vincent YF Tan, Furong Huang, and Alan S Willsky. High- dimensional gaussian graphical model selection: Walk summability and local separation criterion. Journal of Machine Learning Research, 13(Aug):2293{2337, 2012. [3] Farhad P Anaraki and Shannon Hughes. Memory and computation ecient pca via very sparse random projections. In Proceedings of the 31st International Conference on Machine Learning (ICML-14), pages 1341{1349, 2014. [4] Aamir Anis, Philip A Chou, and Antonio Ortega. Compression of dynamic 3d point clouds using subdivisional meshes and graph wavelet transforms. In 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 6360{6364. IEEE, 2016. [5] Aamir Anis, Akshay Gadde, and Antonio Ortega. Ecient sampling set selection for ban- dlimited graph signals using graph spectral proxies. IEEE Transactions on Signal Process- ing, 64(14):3775{3789, July 2016. [6] Aamir Anis, David BH Tay, and Antonio Ortega. Tree-structured lter banks for m-block cyclic graphs. In 2017 51st Asilomar Conference on Signals, Systems, and Computers, pages 55{59. IEEE, 2017. [7] Muhammad Tayyab Asif, Nikola Mitrovic, Justin Dauwels, and Patrick Jaillet. Matrix and tensor based methods for missing data estimation in large trac networks. IEEE Transactions on Intelligent Transportation Systems, 17(7):1816{1825, 2016. [8] Martin Azizyan, Akshay Krishnamurthy, and Aarti Singh. Extreme compressive sampling for covariance estimation. IEEE Transactions on Information Theory, 64(12):7613{7635, Dec 2018. [9] Sohail Bahmani and Justin Romberg. Near-optimal estimation of simultaneously sparse and low-rank matrices from nested linear measurements. Information and Inference: A Journal of the IMA, 5(3):331{351, 2016. [10] Jushan Bai and Shuzhong Shi. Estimating high dimensional covariance matrices and its applications. Annals of Economics and Finance, 12(2):199{215, 2011. [11] Mikhail Belkin, Partha Niyogi, and Vikas Sindhwani. Manifold regularization: A geometric framework for learning from labeled and unlabeled examples. The Journal of Machine Learning Research, 7:2399{2434, 2006. 95 [12] Peter Berger, Manfred Buchacher, Gabor Hannak, and Gerald Matz. Graph learning based on total variation minimization. In Acoustics, Speech and Signal Processing (ICASSP), 2018 IEEE International Conference on. Ieee, 2018. [13] Jos e M Bioucas-Dias, Deborah Cohen, and Yonina C Eldar. Covalsa: Covariance estima- tion from compressive measurements using alternating minimization. In Signal Processing Conference (EUSIPCO), 2014 Proceedings of the 22nd European, pages 999{1003. IEEE, 2014. [14] T urker Biyikoglu, Josef Leydold, and Peter F. Stadler. Laplacian eigenvectors of graphs. Lecture notes in mathematics, 1915, 2007. [15] Stefano Boccaletti, Vito Latora, Yamir Moreno, Martin Chavez, and D.-U. Hwang. Complex networks: Structure and dynamics. Physics reports, 424(4):175{308, 2006. [16] Andrej Bogdanov, Elchanan Mossel, and Salil Vadhan. The complexity of distinguishing markov random elds. In Approximation, Randomization and Combinatorial Optimization. Algorithms and Techniques, pages 331{342. Springer, 2008. [17] Stephen Boyd and Lieven Vandenberghe. Convex optimization. Cambridge university press, 2004. [18] Ed Bullmore and Olaf Sporns. Complex brain networks: graph theoretical analysis of structural and functional systems. Nature Reviews Neuroscience, 10(3):186{198, 2009. [19] Florentina Bunea and Luo Xiao. On the sample covariance matrix estimator of reduced eective rank population matrices, with applications to fpca. Bernoulli, 21(2):1200{1230, 2015. [20] T Tony Cai and Anru Zhang. Minimax rate-optimal estimation of high-dimensional covari- ance matrices with incomplete data. Journal of multivariate analysis, 150:55{74, 2016. [21] Tony Cai, Weidong Liu, and Xi Luo. A constrained l1 minimization approach to sparse precision matrix estimation. Journal of the American Statistical Association, 106(494):594{ 607, 2011. [22] Shayok Chakraborty, Jiayu Zhou, Vineeth Balasubramanian, Sethuraman Panchanathan, Ian Davidson, and Jieping Ye. Active matrix completion. In Data Mining (ICDM), 2013 IEEE 13th International Conference on, pages 81{90. IEEE, 2013. [23] Yung-Hsuan Chao. Compression of Signal on Graphs with the Application to Image and Video Coding. PhD thesis, University of Southern California, 2017. [24] O Chapelle, B Sch olkopf, and A Zien. Semi-Supervised Learning. MIT Press, 2006. [25] Richard Y Chen, Alex Gittens, and Joel A Tropp. The masked sample covariance estimator: an analysis using matrix concentration inequalities. Information and Inference: A Journal of the IMA, 1(1):2{20, 2012. [26] Siheng Chen, Aliaksei Sandryhaila, Jos e MF Moura, and Jelena Kova cevi c. Signal recovery on graphs: Variation minimization. IEEE Transactions on Signal Processing, 63(17):4609{ 4624, 2015. [27] Siheng Chen, Rohan Varma, Aliaksei Sandryhaila, and Jelena Kova cevi c. Discrete sig- nal processing on graphs: Sampling theory. IEEE Transactions on Signal Processing, 63(24):6510{6523, 2015. 96 [28] Xixian Chen, Michael R Lyu, and Irwin King. Toward ecient and accurate covariance matrix estimation on compressed data. In International Conference on Machine Learning, pages 767{776, 2017. [29] Yuxin Chen, Yuejie Chi, and Andrea J Goldsmith. Exact and stable covariance estima- tion from quadratic sampling via convex programming. IEEE Transactions on Information Theory, 61(7):4034{4059, 2015. [30] Sundeep Prabhakar Chepuri, Sijia Liu, Geert Leus, and Alfred O Hero. Learning sparse graphs under smoothness prior. In Acoustics, Speech and Signal Processing (ICASSP), 2017 IEEE International Conference on, pages 6508{6512. IEEE, 2017. [31] Yuejie Chi. Nearest subspace classication with missing data. In Signals, Systems and Computers, 2013 Asilomar Conference on, pages 1667{1671. IEEE, 2013. [32] C Chow and Cong Liu. Approximating discrete probability distributions with dependence trees. IEEE transactions on Information Theory, 14(3):462{467, 1968. [33] Fan R.K. Chung. Spectral graph theory, volume 92. American Mathematical Soc., 1997. [34] Ronald R Coifman and Mauro Maggioni. Diusion wavelets. Applied and Computational Harmonic Analysis, 21(1):53{94, 2006. [35] Kostadin Dabov, Alessandro Foi, Vladimir Katkovnik, and Karen Egiazarian. Bm3d image denoising with shape-adaptive principal component analysis. In SPARS'09-Signal Process- ing with Adaptive Sparse Structured Representations, 2009. [36] Samuel I Daitch, Jonathan A Kelner, and Daniel A Spielman. Fitting a graph to vector data. In Proceedings of the 26th Annual International Conference on Machine Learning, pages 201{208. ACM, 2009. [37] Gautam Dasarathy, Parikshit Shah, Badri Narayan Bhaskar, and Robert D Nowak. Sketch- ing sparse matrices, covariances, and graphs via tensor products. IEEE Transactions on Information Theory, 61(3):1373{1388, 2015. [38] Gautamd Dasarathy, Aarti Singh, Maria-Florina Balcan, and Jong H Park. Active learning algorithms for graphical model selection. In Articial Intelligence and Statistics (AISTATS), pages 1356{1364, 2016. [39] Alexandre d'Aspremont, Onureena Banerjee, and Laurent El Ghaoui. First-order meth- ods for sparse covariance selection. SIAM Journal on Matrix Analysis and Applications, 30(1):56{66, 2008. [40] Mark A. Davenport, Petros T. Boufounos, Michael B. Wakin, and Richard G. Baraniuk. Signal Processing With Compressive Measurements. IEEE Journal of Selected Topics in Signal Processing, 4(2):445 { 460, 2010. [41] Yohann De Castro, Thibault Espinasse, and Paul Rochet. Reconstructing undirected graphs from eigenspaces. The Journal of Machine Learning Research, 18(1):1679{1702, 2017. [42] Claude Dellacherie, Servet Martinez, and Jaime San Martin. Potentials of random walks on trees. Linear Algebra and its Applications, 501:123{161, 2016. [43] Arthur P. Dempster. Covariance selection. Biometrics, pages 157{175, 1972. 97 [44] Amol Deshpande, Carlos Guestrin, Samuel R Madden, Joseph M Hellerstein, and Wei Hong. Model-driven data acquisition in sensor networks. In Proceedings of the Thirtieth interna- tional conference on Very large data bases-Volume 30, pages 588{599. VLDB Endowment, 2004. [45] Inderjit S. Dhillon and Joel A. Tropp. Matrix nearness problems with Bregman divergences. SIAM Journal on Matrix Analysis and Applications, 29(4):1120{1146, 2007. [46] X. Dong, D. Thanou, P. Frossard, and P. Vandergheynst. Learning laplacian matrix in smooth graph signal representations. IEEE Transactions on Signal Processing, 64(23):6160{ 6173, Dec 2016. [47] Xiaowen Dong, Dorina Thanou, Michael Rabbat, and Pascal Frossard. Learning graphs from data: A signal representation perspective. IEEE Signal Processing Magazine, 36(3):44{63, 2019. [48] Florian Dor er and Francesco Bullo. Kron reduction of graphs with applications to electrical networks. Circuits and Systems I: Regular Papers, IEEE Transactions on, 60(1):150{163, 2013. [49] Rick Durrett. Probability: theory and examples. Cambridge university press, 2010. [50] Hilmi E Egilmez, Yung-Hsuan Chao, Antonio Ortega, Bumshik Lee, and Sehoon Yea. Gbst: Separable transforms based on line graphs for predictive video coding. In Image Processing (ICIP), 2016 IEEE International Conference on, pages 2375{2379. IEEE, 2016. [51] Hilmi E Egilmez and Antonio Ortega. Spectral anomaly detection using graph-based l- tering for wireless sensor networks. In Acoustics, Speech and Signal Processing (ICASSP), 2014 IEEE International Conference on, pages 1085{1089. IEEE, 2014. [52] Hilmi E Egilmez, Eduardo Pavez, and Antonio Ortega. Graph learning from data under laplacian and structural constraints. IEEE Journal of Selected Topics in Signal Processing, 11(6):825{841, 2017. [53] Hilmi E Egilmez, Eduardo Pavez, and Antonio Ortega. Graph learning from ltered sig- nals: Graph system and diusion kernel identication. IEEE Transactions on Signal and Information Processing over Networks, 2018. [54] Venkatesan N Ekambaram, Giulia C Fanti, Babak Ayazifar, and Kannan Ramchandran. Circulant structures and graph signal processing. In 2013 IEEE International Conference on Image Processing, pages 834{838. IEEE, 2013. [55] Ehsan Elhamifar and Rene Vidal. Sparse subspace clustering: Algorithm, theory, and applications. IEEE transactions on pattern analysis and machine intelligence, 35(11):2765{ 2781, 2013. [56] Salar Fattahi and Somayeh Sojoudi. Graphical lasso and thresholding: Equivalence and closed-form solutions. Journal of Machine Learning Research, 20(10):1{44, 2019. [57] Miroslav Fiedler and Vlastimil Ptk. On matrices with non-positive o-diagonal elements and positive principal minors. Czechoslovak Mathematical Journal, 12 (87):382{400, 1962. [58] Giulia Fracastoro, Dorina Thanou, and Pascal Frossard. Graph transform learning for image compression. In 2016 Picture Coding Symposium (PCS), pages 1{5. IEEE, 2016. [59] Jerome Friedman, Trevor Hastie, and Robert Tibshirani. Sparse inverse covariance estima- tion with the graphical lasso. Biostatistics, 9(3):432{441, 2008. 98 [60] Akshay Gadde, Sunil K Narang, and Antonio Ortega. Bilateral lter: Graph spectral interpretation and extensions. In 2013 IEEE International Conference on Image Processing, pages 1222{1226. IEEE, 2013. [61] Matan Gavish, Boaz Nadler, and Ronald R Coifman. Multiscale wavelets on trees, graphs and high dimensional data: Theory and applications to semi supervised learning. In Proceed- ings of the 27th International Conference on Machine Learning (ICML-10), pages 367{374, 2010. [62] Neil Gershenfeld, Stephen Samouhos, and Bruce Nordman. Intelligent infrastructure for energy eciency. Science, 327(5969):1086{1088, 2010. [63] Benjamin Girault, Antonio Ortega, and Shrikanth S Narayanan. Irregularity-aware graph fourier transforms. IEEE Transactions on Signal Processing, 66(21):5746{5761, 2018. [64] Michel X Goemans and David P Williamson. Improved approximation algorithms for max- imum cut and satisability problems using semidenite programming. Journal of the ACM (JACM), 42(6):1115{1145, 1995. [65] Alon Gonen, Dan Rosenbaum, Yonina C Eldar, and Shai Shalev-Shwartz. Subspace learning with partial information. The Journal of Machine Learning Research, 17(1):1821{1841, 2016. [66] Clive WJ Granger. Some recent development in a concept of causality. Journal of econo- metrics, 39(1):199{211, 1988. [67] David Hallac, Jure Leskovec, and Stephen Boyd. Network lasso: Clustering and optimiza- tion in large graphs. In Proceedings of the 21th ACM SIGKDD international conference on knowledge discovery and data mining, pages 387{396. ACM, 2015. [68] David K. Hammond, Pierre Vandergheynst, and R emi Gribonval. Wavelets on graphs via spectral graph theory. Applied and Computational Harmonic Analysis, 30(2):129 { 150, 2011. [69] Sepideh Hassan-Moghaddam, Neil K Dhingra, and Mihailo R Jovanovi c. Topology iden- tication of undirected consensus networks via sparse inverse covariance estimation. In Decision and Control (CDC), 2016 IEEE 55th Conference on, pages 4624{4629. IEEE, 2016. [70] Cho-Jui Hsieh, Arindam Banerjee, Inderjit S Dhillon, and Pradeep K Ravikumar. A divide- and-conquer method for sparse inverse covariance estimation. In Advances in Neural Infor- mation Processing Systems, pages 2330{2338, 2012. [71] Cho-Jui Hsieh, M aty as A Sustik, Inderjit S Dhillon, and Pradeep Ravikumar. Quic: quadratic approximation for sparse inverse covariance estimation. Journal of Machine Learning Research, 15(1):2911{2947, 2014. [72] Chenhui Hu, Lin Cheng, Jorge Sepulcre, Georges El Fakhri, Yue M Lu, and Quanzheng Li. A graph theoretical regression model for brain connectivity learning of Alzheimer's disease. In Biomedical Imaging (ISBI), 2013 IEEE 10th International Symposium on, pages 616{619. IEEE, 2013. [73] Xin J Hunt, Saba Emrani, Ilknur Kaynar Kabul, and Jorge Silva. Multi-task learning with incomplete data for healthcare. In KDD Workshop on Machine Learning for Medicine and Healthcare, 2018. 99 [74] Elvin Isu, Andreas Loukas, Andrea Simonetto, and Geert Leus. Autoregressive moving average graph ltering. IEEE Transactions on Signal Processing, 65(2):274{288, 2017. [75] Anatoli Juditsky and Arkadii S Nemirovski. Large deviations of vector-valued martingales in 2-smooth normed spaces. arXiv preprint arXiv:0809.0813, 2008. [76] Kamil Jurczak and Angelika Rohde. Spectral analysis of high-dimensional sample covariance matrices with missing observations. Bernoulli, 23(4A):2466{2532, 2017. [77] Vassilis Kalofolias. How to learn a graph from smooth signals. In th International Confer- ence on Articial Intelligence and Statistics AISTATS, Cadiz, Spain, volume 13, 2016. [78] Vassilis Kalofolias and Nathanal Perraudin. Large scale graph learning from smooth signals. In International Conference on Learning Representations, 2019. [79] Jon Kleinberg and Eva Tardos. Algorithm design. Pearson Education India, 2006. [80] Mladen Kolar and Eric P Xing. Consistent covariance selection from data with missing values. In Proceedings of the 29th International Conference on Machine Learning (ICML), pages 551{558, 2012. [81] Vladimir Koltchinskii and Karim Lounici. Concentration inequalities and moment bounds for sample covariance operators. Bernoulli, 23(1):110{133, 2017. [82] Madeleine S Kotzagiannidis and Pier Luigi Dragotti. Splines and wavelets on circulant graphs. Applied and Computational Harmonic Analysis, 2017. [83] Sven Kurras. Variants of the Graph Laplacian with Applications in Machine Learning. PhD thesis, Universit at Hamburg, 2017. [84] Sven Kurras, Ulrike Luxburg, and Gilles Blanchard. The f-adjusted graph laplacian: a diag- onal modication with a geometric interpretation. In International Conference on Machine Learning, pages 1530{1538, 2014. [85] Nan M Laird. Missing data in longitudinal studies. Statistics in medicine, 7(1-2):305{315, 1988. [86] Brenden Lake and Joshua Tenenbaum. Discovering structure by learning sparse graphs. In Proceedings of the Annual Meeting of the Cognitive Science Society, 32 (32), 2010. [87] Steen Lauritzen, Caroline Uhler, and Piotr Zwiernik. Maximum likelihood estimation in gaussian models under total positivity. arXiv preprint arXiv:1702.04031, 2017. [88] Steen L Lauritzen. Graphical models, volume 17. Clarendon Press, 1996. [89] Weihang Liao, Gene Cheung, Shogo Muramatsu, Hiroyasu Yasuda, and Kiyoshi Hayasaka. Graph learning & fast transform coding of 3d river data. In 2018 Asia-Pacic Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC), pages 1313{1317. IEEE, 2018. [90] Chee-Peng Lim, Jenn-Hwai Leong, and Mei-Ming Kuan. A hybrid neural network system for pattern classication tasks with missing features. IEEE transactions on pattern analysis and machine Intelligence, 27(4):648{653, 2005. [91] Roderick JA Little and Donald B Rubin. Statistical analysis with missing data, volume 333. John Wiley & Sons, 2014. 100 [92] Xianming Liu, Deming Zhai, Debin Zhao, Guangtao Zhai, and Wen Gao. Progressive image denoising through hybrid graph laplacian regularization: a unied framework. IEEE Transactions on image processing, 23(4):1491{1503, 2014. [93] Po-Ling Loh and Martin J Wainwright. High dimensional regression with noisy and missing data: provable guarantees with non-convexity. The Annals of Statistics, 40(3):1637{1664, 2012. [94] Po-Ling Loh and Martin J Wainwright. Structure estimation for discrete graphical models: Generalized covariance matrices and their inverses. In NIPS, pages 2096{2104, 2012. [95] Karim Lounici. Sparse principal component analysis with missing observations. In High dimensional probability VI, pages 327{356. Springer, 2013. [96] Karim Lounici. High-dimensional covariance matrix estimation with missing observations. Bernoulli, 20(3):1029{1058, 2014. [97] Keng-Shih Lu, Eduardo Pavez, and Antonio Ortega. On learning laplacians of tree struc- tured graphs. In 2018 IEEE Data Science Workshop (DSW), pages 205{209. IEEE, 2018. [98] Hermina Petric Maretic, Dorina Thanou, and Pascal Frossard. Graph learning under spar- sity priors. In Acoustics, Speech and Signal Processing (ICASSP), 2017 IEEE International Conference on, pages 6523{6527. Ieee, 2017. [99] Benjamin Marlin. Missing data problems in machine learning. PhD thesis, University of Toronto, 2008. [100] Antonio G Marques, Santiago Segarra, Geert Leus, and Alejandro Ribeiro. Sampling of graph signals with successive local aggregations. IEEE Trans. Signal Processing, 64(7):1832{ 1843, 2016. [101] Gonzalo Mateos, Santiago Segarra, Antonio G Marques, and Alejandro Ribeiro. Connecting the dots: Identifying network structure via graph signal processing. IEEE Signal Processing Magazine, 36(3):16{43, 2019. [102] Rahul Mazumder and Trevor Hastie. Exact covariance thresholding into connected compo- nents for large-scale graphical lasso. The Journal of Machine Learning Research, 13(1):781{ 794, 2012. [103] Rahul Mazumder and Trevor Hastie. The graphical lasso: New insights and alternatives. Electronic journal of statistics, 6:2125, 2012. [104] Nicolai Meinshausen and Peter B uhlmann. High-dimensional graphs and variable selection with the lasso. The annals of statistics, pages 1436{1462, 2006. [105] Prem Melville, Foster Provost, Maytal Saar-Tsechansky, and Raymond Mooney. Economical active feature-value acquisition through expected utility estimation. In Proceedings of the 1st international workshop on Utility-based data mining, pages 10{16. ACM, 2005. [106] Carl D Meyer. Matrix analysis and applied linear algebra, volume 2. Siam, 2000. [107] P. Milanfar. A Tour of Modern Image Filtering: New Insights and Methods, Both Practical and Theoretical. Signal Processing Magazine, IEEE, 30(1):106{128, January 2013. [108] Karthika Mohan, Judea Pearl, and Jin Tian. Graphical models for inference with missing data. In Advances in neural information processing systems, pages 1277{1285, 2013. 101 [109] Mohammad Naghshvar and Tara Javidi. Active sequential hypothesis testing. The Annals of Statistics, 41(6):2703{2738, 2013. [110] S.K. Narang and A Ortega. Perfect Reconstruction Two-Channel Wavelet Filter Banks for Graph Structured Data. Signal Processing, IEEE Transactions on, 60(6):2786{2799, June 2012. [111] Sunil K Narang, Akshay Gadde, and Antonio Ortega. Signal processing techniques for in- terpolation in graph structured data. In 2013 IEEE International Conference on Acoustics, Speech and Signal Processing, pages 5445{5449. IEEE, 2013. [112] Andrew Y. Ng, Michael I. Jordan, Yair Weiss, and others. On spectral clustering: Analysis and an algorithm. Advances in neural information processing systems, 2:849{856, 2002. [113] Ha Q Nguyen and Minh N Do. Downsampling of signals on graphs via maximum spanning trees. IEEE Trans. Signal Processing, 63(1):182{191, 2015. [114] Feiping Nie, Xiaoqian Wang, and Heng Huang. Learning a structured optimal bipartite graph for co-clustering. In Advances in Neural Information Processing Systems, pages 4130{ 4139, 2017. [115] Reza Olfati-Saber, J Alex Fax, and Richard M Murray. Consensus and cooperation in networked multi-agent systems. Proceedings of the IEEE, 95(1):215{233, 2007. [116] Antonio Ortega, Pascal Frossard, Jelena Kova cevi c, Jose MF Moura, and Pierre Van- dergheynst. Graph signal processing: Overview, challenges, and applications. Proceedings of the IEEE, 106(5):808{828, 2018. [117] Jiahao Pang and Gene Cheung. Graph laplacian regularization for image denoising: analysis in the continuous domain. IEEE Transactions on Image Processing, 26(4):1770{1785, 2017. [118] Bastien Pasdeloup, Vincent Gripon, Gr egoire Mercier, Dominique Pastor, and Michael G Rabbat. Characterization and inference of graph diusion processes from observations of stationary signals. IEEE Transactions on Signal and Information Processing over Networks, 2017. [119] Eduardo Pavez, Philip A Chou, Ricardo L De Queiroz, and Antonio Ortega. Dynamic polygon clouds: representation and compression for vr/ar. APSIPA Transactions on Signal and Information Processing, 7, 2018. [120] Eduardo Pavez, Hilmi E Egilmez, and Antonio Ortega. Learning graphs with monotone topology properties and multiple connected components. IEEE Transactions on Signal Processing, 66(9):2399{2413, 2018. [121] Eduardo Pavez, Hilmi E. Egilmez, Yongzhe Wang, and Antonio Ortega. GTT: Graph tem- plate transforms with applications to image coding. In Picture Coding Symposium (PCS), 2015, pages 199{203, May 2015. [122] Eduardo Pavez and Antonio Ortega. Generalized laplacian precision matrix estimation for graph signal processing. In Acoustics, Speech and Signal Processing (ICASSP), 2016 IEEE International Conference on, pages 6350{6354. IEEE, 2016. [123] Eduardo Pavez and Antonio Ortega. Active covariance estimation by random sub-sampling of variables. In 2018 IEEE International Conference on Acoustics, Speech and Signal Pro- cessing (ICASSP), pages 4034{4038, April 2018. 102 [124] Eduardo Pavez, Antonio Ortega, and Debargha Mukherjee. Learning separable transforms by inverse covariance estimation. In 2017 IEEE International Conference on Image Pro- cessing (ICIP), pages 285{289. IEEE, 2017. [125] Isaac Pesenson. Sampling in paley-wiener spaces on combinatorial graphs. Transactions of the American Mathematical Society, 360(10):5603{5627, 2008. [126] George Poole and Thomas Boullion. A survey on M-matrices. SIAM review, 16(4):419{427, 1974. [127] Farhad Pourkamali-Anaraki. Estimation of the sample covariance matrix from compressive measurements. IET Signal Processing, 10(9):1089{1095, 2016. [128] Farhad Pourkamali-Anaraki and Stephen Becker. Preconditioned data sparsication for big data with applications to pca and k-means. IEEE Transactions on Information Theory, 63(5):2954{2974, 2017. [129] Stanislav Pyatykh, J urgen Hesser, and Lei Zheng. Image noise level estimation by principal component analysis. IEEE transactions on image processing, 22(2):687{699, 2013. [130] Michael G Rabbat. Inferring sparse graphs from smooth signals with theoretical guarantees. In Acoustics, Speech and Signal Processing (ICASSP), 2017 IEEE International Conference on, pages 6533{6537. IEEE, 2017. [131] Pradeep Ravikumar, Martin J Wainwright, Garvesh Raskutti, and Bin Yu. High- dimensional covariance estimation by minimizing 1-penalized log-determinant divergence. Electronic Journal of Statistics, 5:935{980, 2011. [132] Mira Rizkallah, Xin Su, Thomas Maugey, and Christine Guillemot. Graph-based transforms for predictive light eld compression based on super-pixels. In 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 1718{1722. IEEE, 2018. [133] Adam J Rothman, Peter J Bickel, Elizaveta Levina, and Ji Zhu. Sparse permutation invari- ant covariance estimation. Electronic Journal of Statistics, 2:494{515, 2008. [134] Havard Rue and Leonhard Held. Gaussian Markov random elds: theory and applications. CRC Press, 2005. [135] Maytal Saar-Tsechansky, Prem Melville, and Foster Provost. Active feature-value acquisi- tion. Management Science, 55(4):664{684, 2009. [136] A Sandryhaila and J.M.F. Moura. Discrete Signal Processing on Graphs. Signal Processing, IEEE Transactions on, 61(7):1644{1656, April 2013. [137] A Sandryhaila and J.M.F. Moura. Discrete Signal Processing on Graphs: Frequency Anal- ysis. Signal Processing, IEEE Transactions on, 62(12):3042{3054, June 2014. [138] Aliaksei Sandryhaila and Jose M.F. Moura. Big Data Analysis with Signal Processing on Graphs: Representation and processing of massive data sets with irregular structure. Signal Processing Magazine, IEEE, 31(5):80{90, 2014. [139] Jonathan Scarlett and Volkan Cevher. Lower bounds on active learning for graphical model selection. In The 20th International Conference on Articial Intelligence and Statistics (AISTATS), 2017. 103 [140] Juliane Sch afer and Korbinian Strimmer. A shrinkage approach to large-scale covariance matrix estimation and implications for functional genomics. Statistical applications in ge- netics and molecular biology, 4(1), 2005. [141] Santiago Segarra, Antonio G Marques, Gonzalo Mateos, and Alejandro Ribeiro. Network topology inference from spectral templates. IEEE Transactions on Signal and Information Processing over Networks, 3(3):467{483, 2017. [142] Santiago Segarra, Antonio G Marques, and Alejandro Ribeiro. Optimal graph-lter design and applications to distributed linear network operators. IEEE Transactions on Signal Processing, 65(15):4117{4131, 2017. [143] Shahin Shahrampour and Victor M Preciado. Topology identication of directed dynamical networks via power spectral analysis. IEEE Transactions on Automatic Control, 60(8):2260{ 2265, 2015. [144] Godwin Shen and Antonio Ortega. Transform-based distributed data gathering. IEEE Transactions on Signal Processing, 58(7):3802{3815, 2010. [145] David I Shuman, Pierre Vandergheynst, Daniel Kressner, and Pascal Frossard. Distributed signal processing via chebyshev polynomial approximation. IEEE Transactions on Signal and Information Processing over Networks, 4(4):736{751, 2018. [146] D.I Shuman, S.K. Narang, P. Frossard, A Ortega, and P. Vandergheynst. The emerging eld of signal processing on graphs: Extending high-dimensional data analysis to networks and other irregular domains. Signal Processing Magazine, IEEE, 30(3):83{98, May 2013. [147] Martin Slawski and Matthias Hein. Estimation of positive denite M-matrices and structure learning for attractive Gaussian Markov random elds. Linear Algebra and its Applications, 473:145{179, 2015. [148] Somayeh Sojoudi. Equivalence of graphical lasso and thresholding for sparse graphs. Journal of Machine Learning Research, 17(115):1{21, 2016. [149] Somayeh Sojoudi. Graphical lasso and thresholding: Conditions for equivalence. In Decision and Control (CDC), 2016 IEEE 55th Conference on, pages 7042{7048. IEEE, 2016. [150] Nicolas St adler and Peter B uhlmann. Missing values: sparse inverse covariance estimation and an extension to sparse regression. Statistics and Computing, 22(1):219{235, 2012. [151] Petre Stoica, Prabhu Babu, and Jian Li. Spice: A sparse covariance-based estimation method for array processing. IEEE Transactions on Signal Processing, 59(2):629{638, 2011. [152] Gilbert Strang. The discrete cosine transform. SIAM review, 41(1):135{147, 1999. [153] Martin Sundin, Arun Venkitaraman, Magnus Jansson, and Saikat Chatterjee. A connect- edness constraint for learning sparse graphs. In Signal Processing Conference (EUSIPCO), 2017 25th European, pages 151{155. IEEE, 2017. [154] Vincent YF Tan, Animashree Anandkumar, and Alan S Willsky. Learning gaussian tree models: Analysis of error exponents and extremal structures. IEEE Transactions on Signal Processing, 58(5):2701{2714, 2010. [155] David BH Tay and Antonio Ortega. Bipartite graph lter banks: Polyphase analysis and generalization. IEEE Transactions on Signal Processing, 65(18):4833{4846, 2017. 104 [156] Oguzhan Teke and PP Vaidyanathan. Extending classical multirate signal processing theory to graphs part i: Fundamentals. IEEE Transactions on Signal Processing, 65(2):409{422, 2017. [157] Oguzhan Teke and PP Vaidyanathan. Extending classical multirate signal processing the- ory to graphs part ii: M-channel lter banks. IEEE Transactions on Signal Processing, 65(2):423{437, 2017. [158] Nicolas Tremblay and Pierre Borgnat. Subgraph-based lterbanks for graph signals. IEEE Transactions on Signal Processing, 64(15):3827{3840, 2016. [159] Jonathan Tuck, David Hallac, and Stephen Boyd. Distributed majorization-minimization for laplacian regularized problems. IEEE/CAA Journal of Automatica Sinica, 6(1):45{52, 2019. [160] Caroline Uhler. Geometry of maximum likelihood estimation in gaussian graphical models. The Annals of Statistics, 40(1):238{261, 2012. [161] Sriharsha Veeramachaneni, Francesca Demichelis, Emanuele Olivetti, and Paolo Avesani. Active sampling for knowledge discovery from biomedical data. In European Conference on Principles of Data Mining and Knowledge Discovery, pages 343{354. Springer, 2005. [162] Roman Vershynin. Introduction to the non-asymptotic analysis of random matrices, pages 210{268. Cambridge University Press, 2012. [163] Roman Vershynin. High Dimensional Probability: An Introduction with Applications in Data Science. Cambridge University Press, 2018. [164] Irene Viola, Hermina Petric Maretic, Pascal Frossard, and Touradj Ebrahimi. A graph learning approach for light eld image compression. In Applications of Digital Image Pro- cessing XLI, volume 10752, page 107520E. International Society for Optics and Photonics, 2018. [165] Ulrike Von Luxburg. A tutorial on spectral clustering. Statistics and computing, 17(4):395{ 416, 2007. [166] Martin J Wainwright and Michael I Jordan. Graphical models, exponential families, and variational inference. Foundations and Trends R in Machine Learning, 1(1{2):1{305, 2008. [167] Daniela M. Witten, Jerome H. Friedman, and Noah Simon. New insights and faster computations for the graphical lasso. Journal of Computational and Graphical Statistics, 20(4):892{900, 2011. [168] Congyuan Yang, Daniel Robinson, and Rene Vidal. Sparse subspace clustering with missing entries. In International Conference on Machine Learning, pages 2463{2472, 2015. [169] Sen Yang, Qian Sun, Shuiwang Ji, Peter Wonka, Ian Davidson, and Jieping Ye. Structural graphical lasso for learning mouse brain connectivity. In Proceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pages 1385{ 1394. ACM, 2015. [170] Ming Yuan. High dimensional inverse covariance matrix estimation via linear programming. Journal of Machine Learning Research, 11(Aug):2261{2286, 2010. [171] Ming Yuan and Yi Lin. Model selection and estimation in the gaussian graphical model. Biometrika, 94(1):19{35, 2007. 105 [172] Yi Zhang and Je G Schneider. Learning multiple tasks with a sparse matrix-normal penalty. In Advances in Neural Information Processing Systems, pages 2550{2558, 2010. [173] Zhiqiang Zheng and Balaji Padmanabhan. On active learning for data acquisition. In Data Mining, 2002. ICDM 2003. Proceedings. 2002 IEEE International Conference on, pages 562{569. IEEE, 2002. 106
Abstract (if available)
Linked assets
University of Southern California Dissertations and Theses
Conceptually similar
PDF
Efficient transforms for graph signals with applications to video coding
PDF
Graph-based models and transforms for signal/data processing with applications to video coding
PDF
Scalable sampling and reconstruction for graph signals
PDF
Sampling theory for graph signals with applications to semi-supervised learning
PDF
Compression of signal on graphs with the application to image and video coding
PDF
Analysis, design, and optimization of large-scale networks of dynamical systems
PDF
Human activity analysis with graph signal processing techniques
PDF
Critically sampled wavelet filterbanks on graphs
PDF
Neighborhood and graph constructions using non-negative kernel regression (NNK)
PDF
Novel optimization tools for structured signals recovery: channels estimation and compressible signal recovery
PDF
Novel algorithms for large scale supervised and one class learning
PDF
Learning and control for wireless networks via graph signal processing
PDF
Efficient graph processing with graph semantics aware intelligent storage
PDF
Efficient graph learning: theory and performance evaluation
PDF
Human motion data analysis and compression using graph based techniques
PDF
I. Asynchronous optimization over weakly coupled renewal systems
PDF
Lifting transforms on graphs: theory and applications
PDF
Architecture design and algorithmic optimizations for accelerating graph analytics on FPGA
PDF
Coded computing: a transformative framework for resilient, secure, private, and communication efficient large scale distributed computing
PDF
Covariance modeling and estimation for distributed parameter systems and their approximations
Asset Metadata
Creator
Pavez Carvelli, Eduardo Hernán
(author)
Core Title
Estimation of graph Laplacian and covariance matrices
School
Viterbi School of Engineering
Degree
Doctor of Philosophy
Degree Program
Electrical Engineering
Publication Date
12/05/2019
Defense Date
04/23/2019
Publisher
University of Southern California
(original),
University of Southern California. Libraries
(digital)
Tag
covariance estimation,graph Laplacian,graph signal processing,graph theory,missing data,OAI-PMH Harvest
Language
English
Contributor
Electronically uploaded by the author
(provenance)
Advisor
Ortega, Antonio (
committee chair
), Govindan, Ramesh (
committee member
), Jovanovic, Mihailo (
committee member
)
Creator Email
eduardo.pavez.carvelli@gmail.com,pavezcar@usc.edu
Permanent Link (DOI)
https://doi.org/10.25549/usctheses-c89-244332
Unique identifier
UC11673158
Identifier
etd-PavezCarve-8002.pdf (filename),usctheses-c89-244332 (legacy record id)
Legacy Identifier
etd-PavezCarve-8002.pdf
Dmrecord
244332
Document Type
Dissertation
Rights
Pavez Carvelli, Eduardo Hernán; Pavez Carvelli, Eduardo Hernan
Type
texts
Source
University of Southern California
(contributing entity),
University of Southern California Dissertations and Theses
(collection)
Access Conditions
The author retains rights to his/her dissertation, thesis or other graduate work according to U.S. copyright law. Electronic access is being provided by the USC Libraries in agreement with the a...
Repository Name
University of Southern California Digital Library
Repository Location
USC Digital Library, University of Southern California, University Park Campus MC 2810, 3434 South Grand Avenue, 2nd Floor, Los Angeles, California 90089-2810, USA
Tags
covariance estimation
graph Laplacian
graph signal processing
graph theory
missing data