Close
About
FAQ
Home
Collections
Login
USC Login
Register
0
Selected
Invert selection
Deselect all
Deselect all
Click here to refresh results
Click here to refresh results
USC
/
Digital Library
/
University of Southern California Dissertations and Theses
/
Graph-based models and transforms for signal/data processing with applications to video coding
(USC Thesis Other)
Graph-based models and transforms for signal/data processing with applications to video coding
PDF
Download
Share
Open document
Flip pages
Contact Us
Contact Us
Copy asset link
Request this asset
Transcript (if available)
Content
GRAPH-BASED MODELS AND TRANSFORMS FOR SIGNAL/DATA PROCESSING WITH APPLICATIONS TO VIDEO CODING A Dissertation Presented to the FACULTY OF THE USC GRADUATE SCHOOL UNIVERSITY OF SOUTHERN CALIFORNIA In Partial Fulllment of the Requirements for the Degree DOCTOR OF PHILOSOPHY (ELECTRICAL ENGINEERING) Hilmi Enes E gilmez December 2017 c Copyright by Hilmi Enes E gilmez 2017 All Rights Reserved ii To my family... Aileme... iii Acknowledgments I would like to express my sincere gratitude towards my advisor, Professor Antonio Ortega, for his constant guidance, support and understanding during the course of my studies at the University of Southern California (USC). It has been a pleasure and privilege to work with Professor Ortega who helped me grow as an independent researcher. This thesis would not be possible without his help and support. Also, I would like to thank Professor C.-C. Jay Kuo and Professor Jinchi Lv for serving on my dissertation committee, as well as Professor Salman Avestimehr and Professor Mahdi Soltanolkotabi for being in my qualifying exam committee. Their comments and suggestions have denitely helped improve this thesis. I thank all my friends in the Signal Transformation, Analysis and Compression Group, espe- cially, Sungwon Lee, Amin Rezapour, Akshay Gadde, Aamir Anis, Eduardo Pavez, Yung-Hsuan (Jessie) Chao, Jiun-Yu (Joanne) Kao and Yongzhe Wang, for making our research environment en- joyable through both technical and non-technical discussions. Special thanks goes to Eduardo Pavez, Yung-Hsuan (Jessie) Chao and Yongzhe Wang, with whom I collaborated on dierent projects. I acknowledge their inputs and contributions which shaped parts of this thesis. Moreover, I would like to express my thanks to Dr. Amir Said, Dr. Onur G. G ulery uz and Dr. Marta Karczewicz for mentoring me during my several summer internships where we had detailed discussions on video coding. I also thank to Professor A. Murat Tekalp for advising my M.Sc. studies at Koc University prior to joining USC as a Ph.D. student. Furthermore, I am thankful to my brother M. Bedi E gilmez and my friends, C. Umit Ba s, Enes Ozel, C. Ozan O guz, Ezgi Kantarc, Utku Baran, Ahsan Javed, Vanessa Landes and Daoud Burghal for making \life in LA" even more fun. Last but not least, I am deeply grateful to my family for their love, support and encouragement. Particularly, I thank to my wife _ Ilknur whose love, sacrice and patience motivated me throughout my Ph.D. studies. Also, none of this would be possible without my parents, Reyhan and Hulusi E gilmez, to whom I cannot thank enough. iv Contents Acknowledgments iv 1 Introduction 1 1.1 Notation, Graphs and Graph Laplacian Matrices . . . . . . . . . . . . . . . . . . . . 3 1.1.1 Notation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 1.1.2 Graphs and Their Algebraic Representations . . . . . . . . . . . . . . . . . . 4 1.2 Applications of Graph Laplacian Matrices . . . . . . . . . . . . . . . . . . . . . . . . 5 1.3 Signal/Data Processing on Graphs: Graph-based Transforms and Filters . . . . . . . 6 1.4 Contributions of the Thesis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9 1.4.1 Graph-based Modeling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9 1.4.2 Graph-based Transforms for Video Coding . . . . . . . . . . . . . . . . . . . 10 2 Graph Learning from Data 12 2.1 Related Work and Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14 2.1.1 Sparse Inverse Covariance Estimation . . . . . . . . . . . . . . . . . . . . . . 14 2.1.2 Graph Laplacian Estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . 14 2.1.3 Graph Topology Inference . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15 2.1.4 Summary of Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15 2.2 Problem Formulations for Graph Learning . . . . . . . . . . . . . . . . . . . . . . . . 16 2.2.1 Proposed Formulations: Graph Laplacian Estimation . . . . . . . . . . . . . . 16 2.2.2 Related Prior Formulations . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18 2.3 Probabilistic Interpretation of Proposed Graph Learning Problems . . . . . . . . . . 20 2.4 Proposed Graph Learning Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . 23 2.4.1 Matrix Update Formulas . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24 2.4.2 Generalized Laplacian Estimation . . . . . . . . . . . . . . . . . . . . . . . . 24 2.4.3 Combinatorial Laplacian Estimation . . . . . . . . . . . . . . . . . . . . . . . 28 2.4.4 Convergence and Complexity Analysis of Algorithms . . . . . . . . . . . . . . 31 2.5 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32 v 2.5.1 Comparison of Graph Learning Methods . . . . . . . . . . . . . . . . . . . . . 33 2.5.2 Empirical Results for Computational Complexity . . . . . . . . . . . . . . . . 37 2.5.3 Illustrative Results on Real Data . . . . . . . . . . . . . . . . . . . . . . . . . 38 2.5.4 Graph Laplacian Estimation under Model Mismatch . . . . . . . . . . . . . . 39 2.5.5 Graph Laplacian Estimation under Connectivity Mismatch . . . . . . . . . . 39 2.6 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42 3 Graph Learning from Filtered Signals 43 3.1 Related Work and Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46 3.2 Problem Formulation: Graph System Identication . . . . . . . . . . . . . . . . . . . 47 3.2.1 Filtered Signal Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48 3.2.2 Problem Formulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49 3.2.3 Special Cases of Graph System Identication Problem and Applications of Graph-based Filters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50 3.3 Proposed Solution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53 3.3.1 Optimal Preltering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54 3.3.2 Optimal Graph Laplacian Estimation . . . . . . . . . . . . . . . . . . . . . . 55 3.3.3 Filter Parameter Selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55 3.4 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57 3.4.1 Graph Learning from Diusion Signals/Data . . . . . . . . . . . . . . . . . . 57 3.4.2 Graph Learning from Variance/Frequency Shifted Signals . . . . . . . . . . . 58 3.4.3 Illustrative Results on Temperature Data . . . . . . . . . . . . . . . . . . . . 61 3.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63 4 Graph Learning from Multiple Graphs 64 4.1 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64 4.2 Problem Formulation for Multigraph Combining . . . . . . . . . . . . . . . . . . . . 65 4.2.1 Statistical Formulation and Optimality Conditions . . . . . . . . . . . . . . . 67 4.3 Proposed Solution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71 4.4 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73 4.5 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74 5 Graph-based Transforms for Video Coding 77 5.1 Models for Video Block Signals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81 5.1.1 1-D GMRF Models for Residual Signals . . . . . . . . . . . . . . . . . . . . . 82 5.1.2 2-D GMRF Model for Residual Signals . . . . . . . . . . . . . . . . . . . . . . 88 5.1.3 Interpretation of Graph Weights for Predictive Coding . . . . . . . . . . . . . 89 5.2 Graph Learning for Graph-based Transform Design . . . . . . . . . . . . . . . . . . 91 vi 5.2.1 Graph Learning: Generalized Graph Laplacian Estimation . . . . . . . . . . . 91 5.2.2 Graph-based Transform Design . . . . . . . . . . . . . . . . . . . . . . . . . . 91 5.2.3 Optimality of Graph-based Transforms . . . . . . . . . . . . . . . . . . . . . . 92 5.3 Edge-Adaptive Graph-based Transforms . . . . . . . . . . . . . . . . . . . . . . . . . 93 5.3.1 EA-GBT Construction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94 5.3.2 Theoretical Justication for EA-GBTs . . . . . . . . . . . . . . . . . . . . . . 96 5.4 Residual Block Signal Characteristics and Graph-based Models . . . . . . . . . . . . 100 5.5 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102 5.5.1 Experimental Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102 5.5.2 Compression Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106 5.6 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107 6 Conclusions and Future Work 109 Bibliography 111 vii List of Tables 1.1 List of Symbols and Their Meaning . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11 2.1 Average relative errors for dierent graph connectivity models and number of vertices with xed k=n = 30 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36 3.1 Graph-based lter (GBF) functions with parameter and corresponding inverse func- tions. Note that y denotes the pseudoinverse of , that is, y = 1= for 6= 0 and y = 0 for = 0. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45 3.2 An overview of the related work on learning graphs from smooth/diused signals (NA: No assumptions, L: Localized, 1-1: one-to-one function). . . . . . . . . . . . . . . . . 46 3.3 Average Relative Errors for Variance Shifting GBF . . . . . . . . . . . . . . . . . . . 61 3.4 Average Relative Errors for Frequency Shifting GBF . . . . . . . . . . . . . . . . . . 61 4.1 Coding gain (cg) results for the line graphs in Figure 4.1 . . . . . . . . . . . . . . . . 74 4.2 Average variation (av) results for the line graphs in Figure 4.1 . . . . . . . . . . . . . 74 4.3 Coding gain (cg) results for the graphs in Figure 4.2 . . . . . . . . . . . . . . . . . . 74 4.4 Average variation (av) results for the graphs in Figure 4.2 . . . . . . . . . . . . . . . 74 5.1 DCTs/DSTs corresponding to e L with dierent vertex weights. . . . . . . . . . . . . . 88 5.2 Average absolute valued intensity dierences (m di ) between pixels connected by the edge with weight w e =w c =s edge for various sharpness parameters (s edge ). . . . . . . 96 5.3 Comparison of KLT, GBST and GBNT with MDT and RDOT schemes in terms of BD-rate (% bitrate reduction) with respect to the DCT. Smaller (negative) BD-rates mean better compression. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107 5.4 Comparison of KLT, GBST and GBNT for coding of dierent prediction modes in terms of BD-rate with respect to the DCT. Smaller (negative) BD-rates mean better compression. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108 viii 5.5 The coding gains achieved by using GL-GBT over KLT in terms of BD-rate and the number of training data samples used per number of vertices (k=n) to design transforms for dierent prediction modes. Negative BD-rates mean that GL-GBT outperforms KLT. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108 5.6 The contribution of EA-GBTs in terms of BD-rate with respect to DCT. . . . . . . 108 ix List of Figures 1.1 Illustration of GBTs derived from two dierent simple graphs with 9 vertices (i.e., 9 9 CGLs), where all edges are weighted as 1. The basis vectors associated with 1 , 2 , 8 and 9 are shown as graph signals, attached to vertices. Note that the notion of frequency (in terms of both GBTs and associated graph frequencies) changes depending on the underlying graph. For example, the additional edges in (f) lead to smaller signal dierences (pairwise variations) at most of the vertices that are originally disconnected in (b). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7 1.2 Input-output relations of graph systems dened by h(L) = exp(L) for dierent , where L is the CGL associated with the shown graph whose all edges are weighted as 1. For given input graph signals (a) x 1 and (e) x 2 , the corresponding output signals are (b{d) y 1; = exp(L)x 1 and (f{h) y 2; = exp(L)x 2 for =f0:5; 1; 2g. Note that the output signals become more smooth as the parameter increases. . . . . . 8 2.1 The set of positive semidenite matrices (M psd ) containing the sets of diagonally dominant positive semidenite matrices (M d ), generalized (L g ), diagonally dominant (L d ) and combinatorial Laplacian (L c ) matrices. The corresponding classes of GMRFs are enumerated as (1){(5), respectively. In this work, we focus on estimating/learning the sets colored in gray. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 2.2 Examples of dierent graph connectivity models, (a) grid (b) Erdos-Renyi (c) modular graphs used in our experiments. All graphs have 64 vertices. . . . . . . . . . . . . . 33 2.3 Comparison of the (a) GGL and (b) CGL estimation methods: The algorithms are tested on grid (G (n) grid ) and random graphs (G (n,0.1) ER andG (n,0.1,0.3) M ). . . . . . . . . . . 35 2.4 Illustration of precision matrices (whose entries are color coded) estimated from S for k=n = 30: In this example, (a) the ground truth is a GGL associated with a grid graph having n=64 vertices. From S, the matrices are estimated by (b) inverting S, (c) GLasso() and (d) GGL(). The proposed method provides the best estimation in terms of RE and FS, and the resulting matrix is visually the most similar to the ground truth . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37 x 2.5 Illustration of precision matrices (whose entries are color coded) estimated from S for k=n = 30: In this example, (a) the ground truth is a CGL associated with a grid graph havingn=36 vertices. From S, the matrices are estimated by (b) GLS( 1 ), (c) GTI and (d) CGL(). The proposed method provides the best estimation in terms of RE and FS, and the resulting matrix is visually the most similar to the ground truth . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37 2.6 Computational speedup as compared to GLasso ( T GLasso = T ). . . . . . . . . . . . . . 38 2.7 The graphs learned from Animals dataset using methods GLasso(), GGL() and CGL(): GGL() leads to sparser graphs compared to the others. The results follow the intuition that larger positive weights should be assigned between similar animals, although the dataset is categorical (non-Gaussian). . . . . . . . . . . . . . . . . . . . 40 2.8 Illustration of precision matrices whose entries are color coded, where negative (resp. positive) entries correspond to positive (resp. negative) edge weights of a graph: (a) Ground truth precision matrices ( ) are randomly generated with (top) sparse and (bottom) dense positive entries (i.e., negative edge weights). From the corresponding true covariances (), the precision matrices are estimated by (b) GLasso( = 0) and (c) GGL( = 0) methods without ` 1 -regularization. . . . . . . . . . . . . . . . . . . . 41 2.9 Average relative errors for dierent number of data samples (k=n), where GGL(A) exploits the true connectivity information in GGL estimation, and GGL(A full ) refers to the GGL estimation without connectivity constraints. Dierent levels of connectivity mismatch are tested. For example, GGL(A 5% ) corresponds to the GGL estimation with connectivity constraints having a 5% mismatch. No ` 1 -regularization is applied in GGL estimation (i.e., = 0). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42 3.1 Input-output relation of a graph system dened by a graph Laplacian (L) and a graph-based lter (h). In this work, we focus on joint identication of L and h. . . 45 3.2 Graph-based lters with dierent parameters as a function of graph frequency h (). 45 3.3 Average RE and FS results for graph estimation from signals modeled based on ex- ponential decay GBFs tested with =f0:5; 0:75g on 10 dierent grid, Erdos-Renyi and modular graphs (30 graphs in total). The proposed GSI outperforms all baseline methods in terms of both RE and FS. . . . . . . . . . . . . . . . . . . . . . . . . . . . 59 3.4 Average RE and FS results for graph estimation from signals modeled based on-hop localized GBFs tested with =f2; 3g on 10 dierent grid, Erdos-Renyi and modular graphs (30 graphs in total). The proposed GSI outperforms all baseline methods in terms of both RE and FS. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59 xi 3.5 A sample illustration of graph estimation results (fork=n = 30) from signals modeled using the exponential decay GBF with =0:75 and L is derived from the grid graph in (a). The edge weights are color coded where darker colors indicate larger weights. The proposed GSI leads to the graph that is the most similar to the ground truth. . 60 3.6 A sample illustration of graph estimation results (fork=n = 30) from signals modeled using the -hop localized GBF with = 2 and L is derived from the grid graph in (a). The edge weights are color coded where darker colors indicate larger weights. The proposed GSI leads to the graph that is the most similar to the ground truth. . 60 3.7 Average air temperatures (in degree Celsius) for (a) 45th, (b) 135th, (c) 225th and (d) 315th days over 16 years (2000-2015). Black dots denote 45 states. . . . . . . . . 61 3.8 Graphs learned from temperature data using the GSI method with exponential decay and -hop localized GBFs for xed parameters where no regularization is applied (i.e., H is set to the zero matrix). The edge weights are color coded so that darker colors represent larger weights (i.e., more similarities). The graphs associated with exponential decay GBF leads to sparser structures compared to the graphs corre- sponding to -hop localized GBFs. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62 4.1 Combining L = 5 line graphs with n = 16 vertices. Edge weights are color coded between 0 and 1. Each input graph have only one weak edge weight equal to 0:1 while all other edges are weighted as 0:95. . . . . . . . . . . . . . . . . . . . . . . . . 75 4.2 CombiningL = 4 mesh-like graphs withn = 5 vertices. Edge weights are color coded between 0 and 1. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76 5.1 Building blocks of a typical video encoder and decoder consisting of three main steps, which are (i) prediction, (ii) transformation, (iii) quantization and entropy coding. . 78 5.2 1-D GMRF models for (a) intra and (b) inter predicted signals. Black lled vertices represent the reference pixels and unlled vertices denote pixels to be predicted and then transform coded. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81 5.3 2-D GMRF models for (a) intra and (b) inter predicted signals. Black lled vertices correspond to reference pixels obtained (a) from neighboring blocks and (b) from other frames via motion compensation. Unlled vertices denote the pixels to be predicted and then transform coded. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81 5.4 Separable and nonseparable transforms: (Left) For an NN image/video block, separable transforms (e.g., GBSTs) are composed of two (possibly distinct) NN transforms, U row and U col , applied to rows and columns of the block. (Right) Non- separable transforms (e.g., GBNTs) apply anN 2 N 2 linear transformation using U. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82 xii 5.5 An illustration of graph construction for a given 8 8 residual block signal where w c = 1 and w e = 0:1, and the corresponding GBT. The basis patches adapt to the characteristics of the residual block. The order of basis patches is in row-major ordering. 94 5.6 Two random block samples obtained from three 2-D GMRF models having an image edge at a xed location with three dierent sharpness parameters (s edge =f10; 20; 100g). A larger s edge leads to a sharper transition across the image edge. . . . . . . . . . . . 97 5.7 Coding gain (cg) versus s edge for block sizes with N = 4; 8; 16; 32; 64. EA-GBT provides better coding gain (i.e., cg is negative) when s edge is larger than 10 across dierent block sizes. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98 5.8 Coding gain (cg) versus bits per pixel (R=N) for dierent edge sharpness parameters s edge = 10; 20; 40; 100; 200. EA-GBT provides better coding gain (i.e., cg is negative) if s edge is larger than 10 for dierent block sizes. . . . . . . . . . . . . . . . . . . . . 98 5.9 Sample variances of residual signals corresponding to dierent prediction modes. Each square corresponds to a sample variance of pixel values. Darker colors represent larger sample variances. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100 5.10 Edge weights (left) and vertex weights (right) learned from residual blocks obtained by intra prediction with planar mode. . . . . . . . . . . . . . . . . . . . . . . . . . . 103 5.11 Edge weights (left) and vertex weights (right) learned from residual blocks obtained by intra prediction with DC mode. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103 5.12 Edge weights (left) and vertex weights (right) learned from residual blocks obtained by intra prediction with horizontal mode. . . . . . . . . . . . . . . . . . . . . . . . . 104 5.13 Edge weights (left) and vertex weights (right) learned from residual blocks obtained by intra prediction with diagonal mode. . . . . . . . . . . . . . . . . . . . . . . . . . 104 5.14 Edge weights (left) and vertex weights (right) learned from residual blocks obtained by inter prediction with PU mode 2N 2N. . . . . . . . . . . . . . . . . . . . . . . 105 5.15 Edge weights (left) and vertex weights (right) learned from residual blocks obtained by inter prediction with PU mode N 2N. . . . . . . . . . . . . . . . . . . . . . . . 105 xiii Abstract Graphs are fundamental mathematical structures used in various elds to represent data, signals and processes. Particularly in signal processing, machine learning and statistics, structured modeling of signals and data by means of graphs is essential for a broad number of applications. In this thesis, we develop novel techniques to build graph-based models and transforms for signal/data processing, where the models and transforms of interest are dened based on graph Laplacian matrices. For graph-based modeling, various graph Laplacian estimation problems are studied. Firstly, we consider estimation of three types of graph Laplacian matrices from data and develop ecient algorithms by incorporating associated Laplacian and structural constraints. Then, we propose a graph signal processing framework to learn graph-based models from classes of ltered signals, dened based on functions of graph Laplacians. The proposed approach can also be applied to learn diusion (heat) kernels, which are popular in various elds for modeling diusion processes. Additionally, we study the problem of multigraph combining, which is estimating a single optimized graph from multiple graphs, and develop an algorithm. Finally, we propose graph-based transforms for video coding and develop two dierent techniques, based on graph learning and image edge adaptation, to design orthogonal transforms capturing the statistical characteristics of video signals. Theoretical justications and comprehensive experimental results for the proposed methods are presented. Chapter 1 Introduction Graphs are generic mathematical structures consisting of sets of vertices and edges, which are used for modeling pairwise relations (edges) between a number of objects (vertices). They provide a natural abstraction by representing the objects of interest as vertices and their pairwise relations as edges. In practice, this representation is often extended to weighted graphs, for which a set of scalar values (weights) are assigned to edges and potentially to vertices. Thus, weighted graphs oer general and exible representations for modeling anity relations between the objects of interest. Many practical problems can be represented using weighted graphs. For example, a broad class of combinatorial problems such as weighted matching, shortest-path and network- ow [1] are dened using weighted graphs. In signal/data-oriented problems, weighted graphs provide concise (sparse) representations for robust modeling of signals/data [2]. Such graph-based models are also useful for analyzing and visualizing the relations between their samples/features. Moreover, weighted graphs naturally emerge in networked data applications, such as learning, signal processing and analysis on computer, social, sensor, energy, transportation and biological networks [3], where the signals/data are inherently related to a graph associated with the underlying network. Similarly, image and video signals can be modeled using weighted graphs whose weights capture the correlation or similarity between neighboring pixel values (such as in nearest-neighbor models) [4, 5, 6, 7, 8, 9]. Furthermore, in graph signal processing [3], weighted graphs provide useful spectral representations for signals/data, referred as graph signals, where graph Laplacian matrices are used to dene ba- sic operations such as transformation [8, 10], ltering [9, 11] and sampling [12] for graph signals. Depending on the application, graph signals can be further considered as smooth with respect to functions of a graph Laplacian, dening graph-based (smoothing) lters for modeling processes such as diusion. For example, in a social network, a localized signal (information) can diuse on the network (i.e., on vertices of the graph) where smoothness of the signal increases as it spreads over time. In a wireless sensor network, sensor measurements (such as temperature) can be considered as smooth signals on a graph connecting communicating sensors, since sensors generally communicate 1 CHAPTER 1. INTRODUCTION 2 with their close neighbors where the measurements are similar (i.e., spatially smooth). However, in practice, datasets typically consist of an unstructured list of samples, where the associated graph information (representing the structural relations between samples/features) is latent. For analysis, learning, processing and algorithmic purposes, it is often useful to build graph-based models that provide a concise/simple explanation for datasets and also reduce the dimension of the problem [13, 14] by exploiting the available prior knowledge and assumptions about the desired graph (e.g., structural information including connectivity and sparsity level) and the observed data (e.g., signal smoothness). The main focus of this thesis is on graph-based modeling, where the models of interest are dened based on graph Laplacian matrices (i.e., weighted graphs with nonnegative edge weights), so that the basic goal is to optimize a graph Laplacian matrix from data. For this purpose, various graph learning problems are formulated to estimate graph Laplacian matrices under dierent model assumptions on graphs and data. The graph learning problems studied in this thesis can be categorized as follows: 1. Graph learning from data via structured graph Laplacian estimation: An optimization frame- work is proposed to estimate dierent types of graph Laplacian matrices from a data statistic (e.g., covariance and symmetric kernel matrices), which is empirically obtained by using the samples in a given dataset. The problems of interest are formulated as nding the closest graph Laplacian t to the inverse covariance of observed signal/data samples in a maximum a pos- teriori (MAP) sense. The additional prior constraints on the graph structure (e.g., graph con- nectivity and sparsity level) are also incorporated into the problems. The proposed framework constitutes a fundamental part of this work, since the optimization problems and applications considered throughout the thesis involve estimation of graph Laplacian matrices. 2. Graph learning from ltered signals: In this class of problems, the data is modeled based on a graph system, dened by a graph Laplacian and a graph-based lter (i.e., a matrix function of a graph Laplacian), where the observed set of signals are assumed to be outputs of a graph system with a specic type of graph-based lter. In order to learn graphs from empirical covariances of ltered signals, an optimization problem is formulated for joint identication of a graph Laplacian and a graph-based lter. The proposed problem is motivated by graph signal processing applications [3] and diusion-based models [15] where the signals/data are intrinsically smooth with respect to an unknown graph. 3. Graph learning from multiple graphs: An optimization problem called multigraph combining is formulated to learn a single graph from a dataset consisting of multiple graph Laplacian matrices. This problem is motivated by signal processing and machine learning applications working with clusters of signals/data where each cluster can be modeled by a dierent graph, and an optimized ensemble model is desired to characterize the overall relations between the objects of interest. Especially when a signal/data model is uncertain (or unknown), combining CHAPTER 1. INTRODUCTION 3 multiple candidate models would allow us to design a model that is robust to model uncertain- ties. Besides, multigraph combining can be used to summarize a dataset consisting multiple graphs into a single graph, which is favorable for graph visualization in data analytics. In order to solve these classes of graph learning problems, specialized algorithms are developed. The proposed methods allow us to capture the statistics of signals/data by means of graphs (i.e., graph Laplacians), which are useful in a broad range of signal/data-oriented applications, discussed in Sections 1.2 and 1.3. Among dierent possible applications, this thesis primarily focuses on video coding, where the goal is to design graph-based transforms (GBTs) adapting to characteristics of video signals in order to improve the coding eciency. Two distinct graph-based modeling techniques are developed to construct GBTs derived from graph Laplacian matrices: 1. Instances of a structured graph Laplacian estimation problem are solved to learn graphs based on empirical covariance matrices obtained from dierent classes of video block samples. The optimized graphs are used to derive GBTs that eectively capture statistical characteristics of video signals. 2. Graph Laplacian matrices are constructed on a per-block basis by rst detecting image edges a (i.e., discontinuities) for each video block and then modifying weights of a predetermined graph based on the detected image edges. Thus, the resulting graph represents a class of block signals with a specic image edge structure, from which an edge-adaptive GBT is derived. The main contributions of the thesis on graph-based modeling and video coding are summarized in Section 1.4. 1.1 Notation, Graphs and Graph Laplacian Matrices In this section, we present the notation and basic denitions related to graphs and graph Laplacian matrices used throughout the thesis. 1.1.1 Notation Throughout the thesis, lowercase normal (e.g.,a and), lowercase bold (e.g., a and) and uppercase bold (e.g., A and ) letters denote scalars, vectors and matrices, respectively. Unless otherwise stated, calligraphic capital letters (e.g.,E andS) represent sets. O() and () are the standard big-O and big-Omega notations used in complexity theory [1]. The rest of the notation is presented in Table 1.1. a We use the term image edge to distinguish edges in image/video signals with edges in graphs. CHAPTER 1. INTRODUCTION 4 1.1.2 Graphs and Their Algebraic Representations In this thesis, the models of interest are dened based on undirected, weighted graphs with nonneg- ative edge weights, which are formally dened as follows. Denition 1 (Weighted Graph). The graphG = (V;E;f w ;f v ) is a weighted graph with n vertices in the setV =fv 1 ;:::;v n g. The edge setE =fejf w (e)6= 0;8e2P u g is a subset ofP u , the set of all unordered pairs of distinct vertices, where f w ((v i ;v j )) 0 for i6=j is a real-valued edge weight function, and f v (v i ) for i=1;:::;n is a real-valued vertex (self-loop) weight function. Denition 2 (Simple Weighted Graph). A simple weighted graph is a weighted graph with no self-loops (i.e., f v (v i ) = 0 for i = 1;:::;n). Weighted graphs can be represented by adjacency, degree and self-loop matrices, which are used to dene graph Laplacian matrices. Moreover, we use connectivity matrices to incorporate structural constraints in our formulations. In the following, we present formal denitions for these matrices. Denition 3 (Algebraic representations of graphs). For a given weighted graphG = (V;E;f w ;f v ) with n vertices, v 1 ;:::;v n : The adjacency matrix ofG is an nn symmetric matrix, W, such that (W) ij = (W) ji = f w ((v i ;v j )) for (v i ;v j )2P u . The degree matrix ofG is an nn diagonal matrix, D, with entries (D) ii = P n j=1 (W) ij and (D) ij = 0 for i6=j. The self-loop matrix ofG is an nn diagonal matrix, V, with entries (V) ii = f v (v i ) for i = 1;:::;n and (V) ij = 0 for i6=j. IfG is a simple weighted graph, then V = O. The connectivity matrix ofG is annn matrix, A, such that (A) ij = 1 if (W) ij 6= 0, and (A) ij = 0 if (W) ij = 0 for i;j = 1;:::;n, where W is the adjacency matrix ofG. The generalized graph Laplacian of a weighted graphG is dened as L=DW+V. The combinatorial graph Laplacian of a simple weighted graphG is dened as L=DW. Denition 4 (Diagonally Dominant Matrix). Annn matrix is diagonally dominant ifj() ii j P j6=i j() ij j8i, and it is strictly diagonally dominant ifj() ii j> P j6=i j() ij j8i. Based on the dentions above, any weighted graph with positive edge weights can be represented by a generalized graph Laplacian, while simple weighted graphs lead to combinatorial graph Lapla- cians, since they have no self-loops (i.e., V=O). Moreover, if a weighted graph has nonnegative vertex weights (i.e., VO), its corresponding generalized Laplacian matrix is also diagonally dom- inant. Furthermore, graph Laplacian matrices are symmetric and positive semidenite. So, each of CHAPTER 1. INTRODUCTION 5 them has a complete set of orthogonal eigenvectors, u 1 ; u 2 ;:::; u n , whose associated eigenvalues, 1 2 n are nonnegative real numbers. Specically for combinatorial Laplacians of con- nected graphs, the rst eigenvalue is zero ( 1 = 0) and the associated eigenvector is u 1 =(1= p n)1. In this thesis, we consider three dierent types of graph Laplacian matrices, which lead to the following sets of graphs for a given vertex setV. Generalized Graph Laplacians (GGLs): L g =fLj L 0; (L) ij 0 for i6=jg: (1.1) Diagonally Dominant Generalized Graph Laplacians (DDGLs): L d =fLj L 0; (L) ij 0 for i6=j; L1 0g: (1.2) Combinatorial Graph Laplacians (CGLs): L c =fLj L 0; (L) ij 0 for i6=j; L1 = 0g: (1.3) Moreover, the problems of interest include learning graphs with a specic choice of connectivity A, that is, nding the best weights for the edges contained in A. In order to incorporate the given connectivity information into the problems, we dene the following set of all graph Laplacians depending on A: L(A)= 8 < : L2L (L) ij 0 if (A) ij =1 (L) ij =0 if (A) ij =0 for i6=j 9 = ; ; (1.4) whereL denotes a set of graph Laplacians (which can beL g ,L d orL c ). The sets described in (1.1){(1.4) are used to specify Laplacian and connectivity constraints in our problem formulations. 1.2 Applications of Graph Laplacian Matrices Graph Laplacian matrices have multiple applications in various elds. In spectral graph theory [16], basic properties of graphs are investigated by analyzing characteristic polynomials, eigenvalues and eigenvectors of the associated graph Laplacian matrices. In machine learning, graph Laplacians are extensively used as kernels [15, 17], especially in clustering [18, 19, 20] and graph regularization [21] tasks. Moreover, in graph signal processing [3], basic signal processing operations such as ltering [9, 11], sampling [12], transformation [8, 10] are extended to signals dened on graphs associated with Laplacian matrices. Although the majority of studies and applications primarily focus on CGLs (and their normalized versions) [16, 22], which represent graphs with zero vertex weights CHAPTER 1. INTRODUCTION 6 (i.e., graphs with no self-loops), there are recent studies where GGLs [23] (i.e., graphs with nonzero vertex weights) are shown to be useful. Particularly, GGLs are proposed for modeling image and video signals in [7, 8], and their potential machine learning applications are discussed in [24]. In [25], a Kron reduction procedure is developed based on GGLs for simplied modeling of electrical networks. Furthermore, DDGLs are utilized in [26, 27, 28] to develop ecient algorithms for graph partitioning [26], graph sparsication [27] and solving linear systems [28]. Our work discussed in the rest of the thesis complements these methods and applications by proposing ecient algorithms for estimation of graph Laplacians from data. The following section reviews some basic concepts from graph signal processing, which is a major area for our methods, because CGLs are extensively used. 1.3 Signal/Data Processing on Graphs: Graph-based Trans- forms and Filters Traditional signal processing denes signals on regular Euclidean domains, where there is a xed no- tion of frequency dened by the Fourier transform characterizing the smoothness of signals. Graph signal processing aims to extend basic signal processing operations on irregular non-Euclidean do- mains by dening signals on graphs, where the notion of frequency is graph-dependent. Specically, the graph frequency spectrum is dened by eigenvalues of the graph Laplacian a , 1 2 n , which are called graph frequencies, and its eigenvectors u 1 ; u 2 ;:::; u n are the harmonics (basis vec- tors) associated with the graph frequencies. Based on the eigenvectors of a graph Laplacian matrix, the graph-based transform (GBT) b is formally dened as follows. Denition 5 (Graph-based Transform or Graph Fourier Transform). Let L be a graph Lapla- cian of a graphG. The graph-based transform is the orthogonal matrix U, satisfying U T U = I, obtained by eigendecomposition of L=UU T , where is the diagonal matrix consisting of eigen- values 1 ; 2 ;:::; n (graph frequencies). For a given graph signal x = [x 1 x 2 x n ] T dened on a graphG withn vertices, wherex i is attached to v i (i-th vertex), the GBT of x is obtained by b x = U T x where b x denotes the GBT coecients c . GBTs provide useful (Fourier-like) spectral representations for graph signals. As illustrated in Figure 1.1, the variation of GBT basis vectors gradually increase with the graph frequencies, and the basis vectors corresponding to low frequencies are relatively smooth. To quantify smoothness of a graph signal x, a common variation measure used in graph signal processing is the graph Laplacian quadratic form, x T Lx, which can be written in terms of edge and a In [29], adjacency matrices are used to dene graph spectra. In this thesis, we adopt the graph Laplacian-based construction in [3]. b GBTs are also commonly referred as graph Fourier transforms (GFTs) [3]. c In Chapter 5, we design GBTs for video coding, where the video blocks are considered as graph signals and the resulting GBT coecients are encoded. CHAPTER 1. INTRODUCTION 7 (a) 1 = 0, x = u 1 (b) 2 = 0:1392, x = u 2 (c) 8 = 4:1149, x = u 8 (d) 9 = 4:3028, x = u 9 (e) 1 = 0, x = u 1 (f) 2 = 0:6972, x = u 2 (g) 8 = 4:3028, x = u 8 (h) 9 = 5, x = u 9 Figure 1.1: Illustration of GBTs derived from two dierent simple graphs with 9 vertices (i.e., 9 9 CGLs), where all edges are weighted as 1. The basis vectors associated with 1 , 2 , 8 and 9 are shown as graph signals, attached to vertices. Note that the notion of frequency (in terms of both GBTs and associated graph frequencies) changes depending on the underlying graph. For example, the additional edges in (f) lead to smaller signal dierences (pairwise variations) at most of the vertices that are originally disconnected in (b). vertex weights ofG as x T Lx = n X i=1 (V) ii x 2 i + X (i;j)2I (W) ij (x i x j ) 2 (1.5) where (W) ij =(L) ij , (V) ii = P n j=1 (L) ij andI =f(i;j)j (v i ;v j )2Eg is the set of index pairs of vertices associated with the edge setE. A smaller Laplacian quadratic term (x T Lx) indicates a smoother signal (x), and for combinatorial Laplacians (simple weighted graphs), the smoothness depends only on edge weights (W), since there are no self-loops (i.e., V=O). In graph frequency domain, the above measure can be written in terms of GBT coecients, b x i = u T i x for i = 1;:::;n, and graph frequencies 1 ; 2 ;:::; n as x T Lx = x T UU T x = n X i=1 (x T u i ) i (u i T x) = n X i=1 i b x 2 i : (1.6) Moreover, the ltering operation for graph signals is dened in graph spectral (GBT) domain. We CHAPTER 1. INTRODUCTION 8 (a) Input x 1 (b) Output y 1; for = 0:5 (c) Output y 1; for = 1 (d) Output y 1; for = 2 (e) Input x 2 (f) Output y 2; for = 0:5 (g) Output y 2; for = 1 (h) Output y 2; for = 2 Figure 1.2: Input-output relations of graph systems dened by h(L) = exp(L) for dierent , where L is the CGL associated with the shown graph whose all edges are weighted as 1. For given input graph signals (a) x 1 and (e) x 2 , the corresponding output signals are (b{d) y 1; = exp(L)x 1 and (f{h) y 2; = exp(L)x 2 for =f0:5; 1; 2g. Note that the output signals become more smooth as the parameter increases. formally dene graph-based lters (GBFs) a as follows. Denition 6 (Graph-based Filter). Let L be a graph Laplacian of a graphG. The graph-based lter is a matrix function h of graph Laplacian matrices, h(L) = Uh()U T , where U is the graph-based transform and (h()) ii =h(() ii )=h( i )8i. By denition, a graph-based lterh serves as a scalar function of (i.e., graph frequencies), so that we have h(L) = Uh( )U T = n X i=1 h( i )u i u i T : (1.7) Essentially, a GBFh maps graph frequencies 1 ;:::; n to lter responses b h( 1 );:::;h( n ). GBFs are used to dene graph system as follows. Denition 7 (Graph System). A graph system is dened by a graphG and a graph-based lter h such that the input-output relation of the system is y=h(L)x, where L is a graph Laplacian associated withG, and x is the input signal vector. For a given input graph signal x, the graph system output y = h(L)x is obtained by modulating a In Chapter 3, GBFs are used in modeling classes of ltered signals. b Filter responses corresponding to the eigenvalues with multiplicity more than one have the same value. Formally, if i = i+1 = = i+c1 for some i> 1 and multiplicity c> 1, then h( i ) =h( i+1 ) = =h( i+c1 ). CHAPTER 1. INTRODUCTION 9 the GBT coecients of the input signal inb x = U T x using h( ) as y =h(L)x = Uh( )U T x = Uh( )b x = n X i=1 h( i )b x i u i ; (1.8) where b x i = u T i x for i = 1;:::;n as in (1.6). As an example, Figure 1.2 illustrates input-output relations of graph systems dened by h(L) = exp(L) for three dierent parameters. The matrix function exp(L) is called the exponential decay GBF, which is one of the GBFs used for modeling smooth graph signals in Chapter 3, where the parameter determines the smoothness of the output signal. 1.4 Contributions of the Thesis 1.4.1 Graph-based Modeling Graph Learning From Data: Structured Graph Laplacian Estimation A general optimization framework is proposed in Chapter 2 for learning/estimating graphs from data. The proposed framework includes (i) formulation of various graph learning problems, (ii) their prob- abilistic interpretations and (iii) associated algorithms. Specically, graph learning problems are posed as estimation of graph Laplacian matrices from some observed data under given structural constraints (e.g., graph connectivity and sparsity level). Particularly, we consider three types of graph Laplacian matrices, namely, GGLs, DDGLs and CGLs. From a probabilistic perspective, the problems of interest correspond to maximum a posteriori (MAP) parameter estimation of Gaussian- Markov random eld (GMRF) models, whose precision (inverse covariance) is a graph Laplacian matrix. For the proposed graph learning problems, specialized algorithms are developed by incorpo- rating the graph Laplacian and structural constraints. Our experimental results demonstrate that the proposed algorithms outperform the current state-of-the-art methods [30, 31, 32, 33, 34] in terms of accuracy and computational eciency. Graph Learning From Filtered Signals: Graph System Identication In Chapter 3, a novel graph signal processing framework is introduced for building graph-based models from classes of ltered signals, dened based on functions of a graph Laplacian matrix. In this framework, the graph-based modeling is formulated as a graph system identication problem where the goal is to learn a weighted graph (graph Laplacian) and a graph-based lter (function of a graph Laplacian). An algorithm is proposed to jointly identify a graph and an associated graph-based lter (GBF) from multiple signal/data observations under the assumption that GBFs are one-to-one functions. The proposed approach can also be applied to learn diusion (heat) kernels [15], which are popular in various elds for modeling diusion processes. In addition, for a specic choice of CHAPTER 1. INTRODUCTION 10 graph-based lters, the proposed problem reduces to a graph Laplacian estimation problem. Our experimental results demonstrate that the proposed approach outperform the current state-of-the- art methods [31, 32, 34]. We also implement our framework on a real climate dataset for modeling of temperature signals. Graph Learning From Multiple Graphs: Multigraph Combining Chapter 4 of this thesis addresses the multigraph combining problem, which we dene as designing an optimized graph from multiple graphs. Specically, an optimization problem is formulated to nd the best graph Laplacian that minimizes weighted sum of maximum likelihood (ML) criteria corresponding to given graph Laplacians. Based on the optimality conditions of the problem, an algorithm is proposed. Our experimental results show that the proposed solution provides better modeling compared to the commonly used averaging method. 1.4.2 Graph-based Transforms for Video Coding In many state-of-the-art compression systems, signal transformation is an integral part of the en- coding and decoding process, where transforms provide compact representations for the signals of interest. Chapter 5 of this thesis proposes GBTs for video compression, and develops two dierent techniques to design them. In the rst technique, we solve specic instances of the GGL estimation problem by using the graph Laplacian estimation algorithms developed in Chapter 2, and the opti- mized graphs are used to design separable and nonseparable GBTs. The optimality of the proposed GBTs is also theoretically analyzed based on 1-D and 2-D Gaussian-Markov random eld (GMRF) models for intra and inter predicted block signals. The second technique develops edge-adaptive GBTs (EA-GBTs) in order to exibly adapt transforms to block signals with image edges (discon- tinuities) in order to improve coding eciency. The advantages of EA-GBTs are both theoretically and empirically demonstrated. Our experimental results demonstrate that the proposed transforms can outperform the traditional Karhunen-Loeve transform (KLT). CHAPTER 1. INTRODUCTION 11 Table 1.1: List of Symbols and Their Meaning Symbols Meaning Gj L weighted graphj graph Laplacian matrix h, h j graph-based lterj lter parameter i ; i (L) i-th eigenvalue of L in ascending order VjEjS c vertex setj edge setj complement of setS P u set of unordered pairs of vertices jSj cardinality of setS nj k number of verticesj number of data samples N block size (NN) of an image/video patch Oj I matrix of zerosj identity matrix 0j 1 column vector of zerosj column vector of ones Wj A adjacency matrixj connectivity matrix Dj V degree matrixj self-loop matrix Hj regularization matrixj regularization parameter 1 j y inverse of j pseudo-inverse of T j T transpose of j transpose of det()jjj determinant of j pseudo-determinant of () ij entry of at i-th row and j-th column () i;: j () :;j i-th row vector of j j-th column vector of () SS submatrix of formed by selecting indexes inS () i i-th entry of () S subvector of formed by selecting indexes inS () elementwise greater (less) than or equal to operator 0 is a positive semidenite matrix 0 is a positive denite matrix Trj logdet() trace operatorj natural logarithm of det() diag() diagonal matrix formed by elements of ddiag() diagonal matrix formed by diagonal elements of p(x) probability density function of random vector x x N(0; ) zero-mean multivariate Gaussian with covariance x U(a;b) uniform distribution on the interval [a;b] kk 1 ,kk 1 sum of absolute values of all elements (` 1 -norm) kk 1;o sum of absolute values of all o-diagonal elements kk 2 2 ,kk 2 F sum of squared values of all elements kk 2 F;o sum of squared values of all o-diagonal elements Chapter 2 Graph Learning from Data: Structured Graph Laplacian Estimation The focus of this chapter is on learning graphs (i.e., graph-based models) from data, where the basic goal is to nd the nonnegative edge weights of a graph in order to characterize the anity relationship between the entries of a signal/data vector based on multiple observed vectors. For this purpose, we propose a general framework where graph learning is formulated as the estimation of dierent types of graph Laplacian matrices from data. Specically, for a given kn data matrix X consisting of k observed data vectors with dimension n, the problems of interest are formulated as minimization of objective functions of the following form: Tr (S) logdet () | {z } D(;S) +k Hk 1 | {z } R(;H) ; (2.1) where is the nn target matrix variable and S denotes the data statistic obtained from X. Depending on the application and underlying statistical assumptions, S may stand for the sample covariance of X or a kernel matrix S=K(X; X) derived from data, whereK is a positive denite kernel function (e.g., polynomial and RBF kernels). R(; H) is the sparsity promoting weighted ` 1 -regularization term [40] multiplying and a selected regularization matrix H element-wise, and D(; S) is the data-delity term, a log-determinant Bregman divergence [41], whose minimization corresponds to the maximum likelihood estimation of inverse covariance (precision) matrices for multivariate Gaussian distributions. Thus, minimizing (2.1) for arbitrary data can be interpreted as Most of the work presented in this chapter is published in [35, 36, 37]. MATLAB [38] implementations of the proposed algorithms are available online [39]. 12 CHAPTER 2. GRAPH LEARNING FROM DATA 13 L g L c M psd M d L d =M d \ L g (1) (2) (5) (4) (3) Attractive DC-Intrinsic GMRF General GMRF Diagonally Dominant GMRF Attractive GMRF Attractive Diagonally Dominant GMRF (1) (5) (4) (3) (2) Sets of Matrices Sets of GMRFs Figure 2.1: The set of positive semidenite matrices (M psd ) containing the sets of diagonally dom- inant positive semidenite matrices (M d ), generalized (L g ), diagonally dominant (L d ) and combi- natorial Laplacian (L c ) matrices. The corresponding classes of GMRFs are enumerated as (1){(5), respectively. In this work, we focus on estimating/learning the sets colored in gray. nding the parameters of a multivariate Gaussian model that best approximates the data [42, 43]. In addition to the objective in (2.1), our formulations incorporate problem-specic Laplacian and structural constraints depending on (i) the desired type of graph Laplacian and (ii) the available information about the graph structure. Particularly, we consider three types of graph Laplacian matrices which are GGL, DDGL and CGL matrices (dened in Chapter 1) and develop novel tech- niques to estimate them from data (i.e., data statistic S). As illustrated in Figure 2.1 and further discussed in Section 2.3, the proposed graph Laplacian estimation techniques can also be viewed as methods to learn dierent classes of Gaussian-Markov random elds (GMRFs) [44, 45], whose precision matrices are graph Laplacians. Moreover, in our formulations, structural (connectivity) constraints are introduced to exploit available prior information about the target graph. When graph connectivity is unknown, graph learning involves estimating both graph structure and graph weights, with the regularization term controlling the level of sparsity. Otherwise, if graph connec- tivity is given (e.g., based on application-specic assumptions or prior knowledge), graph learning reduces to the estimation of graph weights only. This chapter is organized as follows. In Section 2.1, we discuss the related studies and our contributions. Section 2.2 formulates our proposed problems and summarizes some of the related formulations in the literature. Section 2.3 discusses the probabilistic interpretation of our proposed problems. In Section 2.4, we derive necessary and sucient optimality conditions and develop novel algorithms for the proposed graph learning problems. Experimental results are presented in Section 2.5, and some concluding remarks are discussed in Section 2.6. CHAPTER 2. GRAPH LEARNING FROM DATA 14 2.1 Related Work and Contributions 2.1.1 Sparse Inverse Covariance Estimation In the literature, several approaches have been proposed for estimating graph-based models. Demp- ster [46] originally proposed the idea of introducing zero entries in inverse covariance matrices for simplied covariance estimation. Later, a neighborhood selection approach was proposed for graph- ical model estimation [47] by using the Lasso algorithm [48]. Friedman et al. [30] formulated a reg- ularization framework for sparse inverse covariance estimation and developed the Graphical Lasso algorithm to solve the regularized problem. Some algorithmic extensions of the Graphical Lasso are discussed in [42, 49], and a few computationally ecient variations are presented in [50, 51, 52]. However, inverse covariance estimation methods, such as the Graphical Lasso, search for solutions in the set of the positive semidenite matrices (M psd in Figure 2.1), which lead to a dierent notion of graphs by allowing both negative and positive edge weights, while we focus on learning graphs with nonnegative edge weights, associated with graph Laplacian matrices (L g ,L d orL c in Figure 2.1). Although graph Laplacian matrices represent a more restricted set of models (attractive GMRFs) compared to positive semidenite matrices (modeling general GMRFs), attractive GMRFs cover an important class of random vectors whose entries can be optimally predicted by nonnegative linear combinations of the other entries. For this class of signals/data, our proposed algorithms incorpo- rating Laplacian constraints provide more accurate graph estimation than sparse inverse covariance methods (e.g., Graphical Lasso). Even when such model assumptions do not strictly hold, the pro- posed algorithms can be employed to nd the best (closest) graph Laplacian t with respect to the Bregman divergence in (2.1) for applications where graph Laplacians are useful (see Section 1.2). 2.1.2 Graph Laplacian Estimation Several recent publications address learning of dierent types of graph Laplacians from data. Clos- est to our work, Slawski and Hein address the problem of estimating symmetric M-matrices [53], or equivalently GGLs, and propose an ecient primal algorithm [54], while our recent work [55] proposes an alternative dual algorithm for GGL estimation. Our work addresses the same GGL estimation problem as [54, 55], based on a primal approach analogous to that of [54], but unlike both [54] and [55], we incorporate connectivity constraints in addition to sparsity promoting reg- ularization. For estimation of CGLs, Lake and Tenenbaum [33] also consider minimization of the objective function in (2.1), which is unbounded for CGLs (since they are singular matrices). To avoid working with singular matrices, they propose to optimize a dierent target matrix obtained by adding a positive constant value to diagonal entries of a combinatorial Laplacian, but no ecient algorithm is developed. Dong et al. [31] and Kalofolias [32] propose minimization of two objective functions dierent from (2.1) in order to overcome issues related to the singularity of CGLs. In- stead, by restricting our learning problem to connected graphs (which have exactly one eigenvector CHAPTER 2. GRAPH LEARNING FROM DATA 15 with eigenvalue 0), we can directly use a modied version of (2.1) as the objective function and develop an ecient algorithm that guarantees convergence to the optimal solution, with signicant improvements in experimental results over prior work. 2.1.3 Graph Topology Inference There are also a few recent studies that focus on inferring graph topology (i.e., connectivity) in- formation from signals assumed to be diused on a graph. Particularly, Segarra et al. [34] and Pasdeloup et al. [56] focus on learning graph shift/diusion operators (such as adjacency and Lapla- cian matrices) from a set of diused graph signals, and Sardellitti et al. [57] propose an approach to estimate a graph Laplacian from bandlimited graph signals. None of these works [34, 56, 57] considers the minimization of (2.1). In fact, techniques proposed in these papers directly use the eigenvectors of the empirical covariance matrix and only optimize the choice of eigenvalues of the Laplacian or adjacency matrices under specic criteria, for the given eigenvectors. In contrast, our methods implicitly optimize both eigenvectors and eigenvalues by minimizing (2.1). The problem of learning diusion-based models is addressed in Chapter 3. 2.1.4 Summary of Contributions In this work, we address estimation of three dierent types of graph Laplacian with structural constraints. For CGL estimation, we propose a novel formulation for the objective function in (2.1), whose direct minimization is not possible due to the singularity of CGLs. Our formulation allows us to improve the accuracy of CGL estimation signicantly compared to the approaches in [33, 31, 32]. For GGL estimation, the prior formulations in [54, 55] are extended in order to accommodate structural constraints. To solve the proposed problems, we develop ecient block- coordinate descent (BCD) algorithms [58] exploiting the structural constraints within the problems, which can signicantly improve the accuracy and reduce the computational complexity depending on the degree of sparsity introduced by the constraints. Moreover, we theoretically show that the proposed algorithms guarantee convergence to the optimal solution. Previously, numerous BCD- type algorithms are proposed for sparse inverse covariance estimation [42, 30, 49] which iteratively solve an ` 1 -regularized quadratic program. However, our algorithms are specically developed for graph Laplacian estimation problems, where we solve a nonnegative quadratic program for block- coordinate updates. Finally, we present probabilistic interpretations of our proposed problems by showing that their solutions lead to optimal parameter estimation for special classes of GMRFs, as depicted in Figure 2.1. While recent work has noted the relation between graph Laplacians and GMRFs [6, 59], this chapter provides a more comprehensive classication of GMRFs and proposes specic methods for estimation of their parameters. CHAPTER 2. GRAPH LEARNING FROM DATA 16 2.2 Problem Formulations for Graph Learning 2.2.1 Proposed Formulations: Graph Laplacian Estimation For the purpose of graph learning, we formulate three dierent optimization problems for a given S, a connectivity matrix A and a regularization matrix H. In our problems, we minimize the function in (2.1) under the set constraint 2L(A) dened in (1.4), where the choices ofL and A determine the Laplacian and connectivity constraints, respectively. Based on the Laplacian constraints on (i.e., nonnegativity of edge weights), a regularization matrix H can be selected such thatR(; H) term in (2.1) is compactly written as k Hk 1 = Tr (H): (2.2) For example, the following standard` 1 -regularization terms with parameter can be written in the above form as, kk 1 = Tr (H) where H =(2I 11 T ); (2.3) and kk 1;o = Tr (H) where H =(I 11 T ): (2.4) Note that kk 1;o applies ` 1 -regularization to o-diagonal entries of only (see in Table 1.1). Since the trace operator is linear, we can rewrite the objective function in (2.1) as Tr (K) logdet() where K=S+H; (2.5) which is the form used in our optimization problems. Note that the nonnegativity of edge weights allows us to transform the nonsmooth function in (2.1) into the smooth function in (2.5) by rewriting the regularization term as in (2.2). In what follows, we formally introduce three dierent optimization problems with Laplacian and structural constraints. Problem 1 (GGL Problem). The optimization problem formulated for estimating generalized graph Laplacian (GGL) matrices is minimize Tr (K) logdet() subject to 2L g (A) (2.6) where K=S+H as in (2.5), and the set of constraintsL g (A) leads to being a GGL matrix. Problem 2 (DDGL Problem). The diagonally dominant generalized graph Laplacian (DDGL) CHAPTER 2. GRAPH LEARNING FROM DATA 17 estimation problem is formulated as minimize Tr (K) logdet() subject to 2L d (A) (2.7) where the additional 1 0 constraint inL d (A) ensures that all vertex weights are nonnegative, and therefore the optimal solution is a diagonally dominant matrix. Problem 3 (CGL Problem). The combinatorial graph Laplacian (CGL) estimation problem is formulated as minimize Tr (K) logjj subject to 2L c (A) (2.8) where the objective function involves the pseudo-determinant term (jj), since the target matrix is singular. However, the problem is hard to solve because of thejj term. To cope with this, we propose to reformulate (2.8) as the following problem a , minimize Tr ((K + J)) logdet( + J) subject to 2L c (A) (2.9) where the 1=0 constraint inL c (A) guarantees that the solution is a CGL matrix, and J = u 1 u 1 T such that u 1 = (1= p n)1 is the eigenvector corresponding to the zero eigenvalue of CGL matrices. Proposition 1. The optimization problems stated in (2.8) and (2.9) are equivalent. Proof. The problems in (2.8) and (2.9) have the same constraints. To prove their equivalence, we show that their objective functions are also the same. First, note that Tr ((K + J)) = Tr (K) + 1 n Tr (11 T ) = Tr (K) since 1=0 based on the CGL problem constraints. Next, we can write logdet( + 1=n 11 T ) = log n Y i=1 i ( + 1=n 11 T ) ! (2.10) where i () denotes the i-th eigenvalue of in ascending order ( 1 () n ()). Since the eigenvector corresponding to the rst (zero) eigenvalue (i.e., 1 () = 0) is u 1 = 1= p n 1, by the a An alternative second-order approach is proposed to solve (2.9) in [60], which is published after the initial version of this work [36]. Yet, the equivalence of (2.8) and (2.9) is not discussed in [60]. CHAPTER 2. GRAPH LEARNING FROM DATA 18 problem constraints (i.e., by denition of CGL matrices), we have that + 1 n 11 T = ( 1 () | {z } 0 +1)u 1 u 1 T + n X i=2 i ()u i u i T : (2.11) Since the determinant of a matrix is equal to the product of its eigenvalues, from (2.11) we have logdet( + 1=n 11 T ) = log 1 n Y i=2 i () ! = logjj: Therefore, the problems in (2.8) and (2.9) are equivalent. Proposition 2. Problems 1, 2 and 3 are convex optimization problems. Proof. The function logdet() dened over positive semidenite matrices ( 0) is a concave function (see [61] for a proof), and Tr() is a linear function. Thus, the overall objective function is convex. The graph Laplacian constraints form a convex set. Since we have a minimization of a convex objective function over a convex set, the problems of interest are convex. In Problems 1, 2 and 3, prior knowledge/assumptions about the graph structure are built into the choice of A, determining the structural constraints. In practice, if the graph connectivity is unknown, then A can be set to represent a fully connected graph, A=A full =11 T I, and the regularization matrix H (or the parameter for the ` 1 -regularizations in (2.3) and (2.4)) can be tuned until the desired level of sparsity is achieved. 2.2.2 Related Prior Formulations In this section, we review some of the related problems previously proposed in the literature. Sparse Inverse Covariance Estimation [30]. The goal is to estimate a sparse inverse covariance matrix from S by solving: minimize 0 Tr (S) logdet () +kk 1 : (2.12) In our work, we are interested in minimization of the same objective function under Laplacian and structural constraints. Shifted CGL Estimation [33]. The goal is to estimate a shifted CGL matrix, which is dened by adding a scalar value to diagonal entries of a combinatorial Laplacian matrix: minimize 0;0 Tr (S) logdet () +kk 1 subject to = e +I e 1 = 0; ( e ) ij 0 i6=j (2.13) CHAPTER 2. GRAPH LEARNING FROM DATA 19 where denotes the positive scalar (i.e., shift) variable added to diagonal elements of e , which is constrained to be a CGL matrix, so that is the target variable. By solving this problem, a CGL matrix b is estimated by subtracting the shift variable as b =(I). However, this generally leads to a dierent solution than our method. Proposition 3. The objective functions of Problem 3 and the shifted CGL problem in (2.13) are dierent. Proof. The optimal cannot be zero, since the objective function is unbounded for = 0. By assuming that = 0 (without loss of generality), the objective function in (2.13) can be decomposed as follows J ( e ) +Tr(S) n X i=2 log 1 + i ( e ) ! log() (2.14) whereJ ( e ) is the objective of Problem 3 with H = O. Graph Learning from Smooth Signals [31, 32]. The goal is to estimate a CGL from n- dimensional signals that are assumed to be smooth with respect to the corresponding graph: minimize 0 Tr (S) + 1 kk 2 F subject to 1 = 0; Tr () =n; () ij 0 i6=j (2.15) where the sum of degrees is constrained as Tr () =n. This is a limitation that is later relaxed in [32] by introducing the following problem with regularization parameters minimize 0 Tr (S)+ 1 kk 2 F;o 2 n X i=1 log (() ii ) subject to 1 = 0; () ij 0 i6=j (2.16) where the constraints in (2.15) and (2.16) lead to a CGL solution. The following proposition relates the objective function in (2.16) with 1 = 0 to the objective in our proposed CGL estimation problem. Proposition 4. The objective function in Problem 3 with = 0 is lower-bounded by the objective function in (2.16) for 1 = 0 and 2 = 1. Proof. For 1 = 0 and 2 = 1, the objective function in (2.16) is written as Tr(S) P n i=1 log(() ii ). By using Hadamard's inequality det() Q n i=1 () ii [62] and taking the log of both sides, the following bound is obtained Tr(S) n X i=1 log(() ii ) Tr(S) logdet() (2.17) CHAPTER 2. GRAPH LEARNING FROM DATA 20 where the right-hand side is the objective function in Problems 1, 2 and 3 with = 0, as desired. Graph Topology Inference. Various approaches for graph topology (connectivity) inference from data (under diusion-based model assumptions) have been proposed in [34, 56, 57]. As the most related to our work, Segarra et al. [34] introduce a sparse recovery problem to infer the graph topology information from the eigenbasis, U, associated with a graph shift/diusion operator. Specically for CGL estimation, the following problem is formulated: minimize 0; kk 1 subject to = UU T 1 = 0; () ij 0 i6=j (2.18) where the eigenbasis U is the input to the problem, so that the goal is to nd the set of eigenvalues (i.e., the diagonal matrix ) minimizingkk 1 . Note that the problems in [34, 56, 57] require that U be given (or calculated beforehand), while our goal is to directly estimate a graph Laplacian so that both U and are jointly optimized. The estimation of Laplacians from data under diusion-based assumptions be discussed later in Chapter 3. 2.3 Probabilistic Interpretation of Proposed Graph Learning Problems The proposed graph learning problems can be viewed from a probabilistic perspective by assuming that the data has been sampled from a zero-meann-variate Gaussian distribution a x N(0; = y ), parametrized with a positive semidenite precision matrix , dening a Gaussian Markov random eld (GMRF) p(xj ) = 1 (2) n=2 j y j 1=2 exp 1 2 x T x ; (2.19) with covariance matrix = y . Based on its precision matrix ( ), a GMRF is classied as [44, 45]: a general GMRF if its precision is positive semidenite, an attractive GMRF if its precision has nonpositive o-diagonal entries, a diagonally dominant GMRF if its precision is diagonally dominant, an intrinsic GMRF if its precision is positive semidenite and singular. a The zero-mean assumption is made to simplify the notation. Our analysis can be trivially extended to a multi- variate Gaussian with nonzero mean. CHAPTER 2. GRAPH LEARNING FROM DATA 21 The entries of the precision matrix can be interpreted in terms of the following conditional dependence relations among the variables in x, E x i j(x) Snfig = 1 ( ) ii X j2Snfig ( ) ij x j (2.20) Prec x i j(x) Snfig = ( ) ii (2.21) Corr x i x j j(x) Snfi;jg = ( ) ij p ( ) ii ( ) jj i6=j; (2.22) whereS =f1;:::;ng is the index set for x = [x 1 x 2 x n ] T . The conditional expectation in (2.20) represents the minimum mean square error (MMSE) prediction of x i using all the other random variables in x. The precision of x i is dened as in (2.21), and the relation in (2.22) corresponds to the partial correlation betweenx i andx j (i.e., correlation between random variablesx i andx j given all the other variables in x). For example, if x i and x j are conditionally independent (( ) ij =0), there is no edge between corresponding vertices v i and v j . For GMRFs, whose precision matrices are graph Laplacian matrices (i.e., = L), we can show that there is a one-to-one correspondence (bijection) between dierent classes of attractive GMRFs and types of graph Laplacian matrices by their denitions, as illustrated in Figure 2.1: L is a GGL matrix (L2L g ) if and only if p(xjL) is an attractive GMRF, L is a DDGL matrix (L2L d ) if and only if p(xjL) is an attractive, diagonally dominant GMRF, L is a CGL matrix (L2L c ) if and only if p(xjL) is an attractive, DC-intrinsic GMRF. Note that, in our characterization, the GMRFs corresponding to CGL matrices are classied as DC- intrinsic GMRFs, which are specic cases of intrinsic GMRFs [44] with no probability density along the direction of the eigenvector u 1 =1= p n 1 associated with the zero eigenvalue ( 1 (L)=0). On the other hand, if L is a nonsingular GGL matrix, then x has a proper (non-degenerate) distribution. Moreover, for = L, the x T x term in (2.19) becomes the graph Laplacian quadratic form stated in (1.5), which is used to quantify smoothness of graph signals [3]. In our formulations, the Laplacian quadratic from x T Lx relates to the trace term in our objective function, which is derived based on the likelihood function for GMRFs as discussed in the following. The proposed graph learning problems can be probabilistically formulated as parameter estima- tion for attractive GMRFs from data. Assuming thatk independent, identically distributed samples, x i fori = 1;:::;k, are obtained from an attractive GMRF with unknown parameters, the likelihood of a candidate graph Laplacian L can be written as k Y i=1 p(x i jL)=(2) kn 2 jL y j k 2 k Y i=1 exp 1 2 x i T Lx i : (2.23) CHAPTER 2. GRAPH LEARNING FROM DATA 22 Let L(w; v) be dened by edge weight and vertex weight vectors w = [f w (e 1 );:::;f w (e m )] T and v = [f v (v 1 );:::;f v (v n )] T , wheren is the number of vertices, and m=n(n1)=2 is the number of all possible (undirected) edges. The maximization of the likelihood function in (2.23) can be equivalently formulated as minimizing the negative log-likelihood, that is b L ML = argmin L(w;v) ( k 2 logjLj+ 1 2 k X i=1 Tr (x i T Lx i ) ) = argmin L(w;v) fTr (LS) logjLjg (2.24) where S is the sample covariance matrix, and b L ML denotes the maximum likelihood estimate of L(w; v). Moreover, we can derive maximum a posteriori (MAP) estimation problems by incorpo- rating the information known about L into a prior distribution p(L) as b L MAP = argmin L(w;v) fTr (LS) logjLjlog(p(L))g: (2.25) For example, we can choose the following m-variate exponential prior for sparse estimation of w, p(w) = (2) m exp (21 T w) for w 0; (2.26) so that the MAP estimation in (2.25) can be written as follows: b L MAP = argmin L(w;v) fTr (LS) logjLjlog(p(w))g = argmin L(w;v) fTr (LS) logjLj+2kwk 1 g = argmin L(w;v) n Tr (LS) logjLj+kLk 1;o o (2.27) where the resulting minimization is equivalent to the objective of our problems with the regulariza- tion in (2.4). Proposition 5. Let the data model be x N(0; L y ) as in (2.19). Then, Problems 1, 2 and 3 are specic instances of the maximum a posteriori estimation problem in (2.25). Proof. With proper choices of the prior p(L) in (2.25), the objective function in (2.1) can be con- structed for any H, except the pseudo-determinant termjj needed for estimating CGL matrices. For this case, Proposition 1 shows that we can equivalently formulate (2.25) in the form of Problem 3. The construction in (2.25){(2.27) can be trivially extended for the weighted ` 1 -regularization. Also, the connectivity and Laplacian constraints in Problems 1, 2 and 3 can be incorporated in a Bayesian setting by choosing spike-and-slab prior and improper prior distributions [63] on v and w, so that spike priors correspond to zero edge weights, and slab priors allow nonnegative edge CHAPTER 2. GRAPH LEARNING FROM DATA 23 weights. Hence, our graph learning problems can be interpreted as MAP parameter estimation for dierent classes of attractive GMRFs. Thus, when the data model assumptions are satised, solving our problems produces the optimal parameters in MAP sense. Given S, which is obtained from the data, the solution of (2.25) corresponds to the closest Laplacian in terms of a regularized log- determinant divergence criterion [41]. In practice, in order to capture nonlinear relations between random variables, dierent types of kernels (e.g., polynomial and RBF kernels) can also be used to construct S. 2.4 Proposed Graph Learning Algorithms Problems 1, 2 and 3 can be solved using general purpose solvers such as CVX [64]. However, these solvers generally implement second-order methods that require calculation of a Hessian matrix and are therefore computationally inecient. Simpler gradient descent algorithms would also be com- putationally complex, since the full gradient calculation of the objective function involves inverting the current estimate of the Laplacian matrix at each iteration (e.g., see the derivative of (2.34) in (2.38)). In order to develop ecient methods, we propose iterative block-coordinate descent algo- rithms [58], where each iterate (block-coordinate update) is obtained by xing some of the elements in the set of target variables while updating the rest. Thus, the original problem is decomposed into a series of lower-dimensional subproblems that are relatively easier to solve. Particularly, at each iteration, the update variables are formed by a row/column of the target graph Laplacian matrix (), and they are updated by solving the subproblem derived based on the optimality conditions of corresponding Laplacian estimation problem, where the available structural constraints are also incorporated into the subproblem. Basically, to estimate annn graph Laplacian matrix, our algo- rithms iteratively update rows/columns of and its inverse (C), so that the cycle ofn row/column updates is repeated until convergence is achieved. Also, depending on the type of target Laplacian, our algorithms potentially apply projections to satisfy the Laplacian constraints. In what follows, we rst provide matrix update formulas used to eciently update entries of and C in our algorithms (Section 2.4.1). Then, Algorithms 1 and 2 are presented with the deriva- tions of subproblems based on the optimality conditions of corresponding graph learning problems. Specically, Algorithm 1 is proposed to solve Problems 1 and 2 (Section 2.4.2), while Algorithm 2 solves Problem 3 (Section 2.4.3). Finally, the convergence and computational complexity of proposed algorithms are analyzed (Section 2.4.4). CHAPTER 2. GRAPH LEARNING FROM DATA 24 2.4.1 Matrix Update Formulas The proposed algorithms exploit the following formulas to update the target variable and its inverse C iteratively. Row/column updates. Updating the u-th row/column of results in updating all the elements in its inverse C= 1 , which can be obtained by the matrix inversion lemma [65], (P T P) 1 = 2 6 4 u u T u u 3 7 5 1 = P T CP = 2 6 4 C u c u c T u c u 3 7 5 = 2 6 6 4 u u T u u 1 C u u u T u u C T u 1 u T u C u u 2 u 3 7 7 5 (2.28) where the permutation matrix P is used to arrange updated and xed elements in block partitions, so that the submatrix u represents the elements that remain unchanged, while vector u and scalar u (i.e., u = () uu ) are the u-th row/columns , which are being updated. Based on the block partitions in (2.28), we can calculate C, using updated u and u , for xed u as follows: C u = u u T u u 1 = 1 u 1 u u T u 1 u u T u 1 u u ; (2.29) c u =C u u u = 1 u u u T u 1 u u ; (2.30) c u = 1 u T u 1 u u ; (2.31) where 1 u can be calculated from partitions of updated C as, 1 u = C u c u c T u =c u : (2.32) Diagonal updates. After adding a scalar value to () ii , we use the Sherman-Morrison formula [66] to update C as b C = b 1 = ( + i i T ) 1 = C C i i T C 1 + i T C i ; (2.33) where i is the vector whose entries are zero, except for its i-th entry which is equal to one. 2.4.2 Generalized Laplacian Estimation Derivation of the optimality conditions. To derive necessary and sucient optimality con- ditions, we use Lagrangian duality theory [67, 61], which requires introducing a set of Lagrange multipliers (i.e., dual variables) and a Lagrangian function. For Problems 1 and 2 the Lagrangian CHAPTER 2. GRAPH LEARNING FROM DATA 25 functions have the form, logdet() + Tr (K) + Tr(M); (2.34) where M is the matrix of Lagrange multipliers associated with the constraints. In particular, for Problem 1, M=M 1 +M 2 , with multiplier matrices, M 1 and M 2 , whose entries are (M 1 ) ij = (M 1 ) ji = 8 > > > < > > > : (1) ij 0 if (A) ij = 1; i6=j 0 if (A) ij = 0; i6=j 0 if i =j (2.35) (M 2 ) ij = (M 2 ) ji = 8 > > > < > > > : (2) ij 2R if (A) ij = 0; i6=j 0 if (A) ij = 1; i6=j 0 if i =j (2.36) for i;j = 1;:::;n where (1) ij and (2) ij are the Lagrange multipliers associated with inequality and equality constraints in Problem 1, respectively. For Problem 2, M = M 1 +M 2 M 3 so that M 1 and M 2 are as in (2.35) and (2.36), and M 3 consists of Lagrange multipliers (denoted as (3) i ) associated with the constraint 1 0. The entries of M 3 are (M 3 ) ij = (3) i + (3) j where (3) i 0; (3) j 0 (2.37) for i;j = 1;:::;n. By taking the derivative of (2.34) with respect to and setting it to zero, we obtain the following optimality condition, 1 + K + M = O; (2.38) and the necessary and sucient optimality conditions [61] for Problem 1 are 1 + K + M = O () ij 0 if (A) ij = 1, i6=j () ij = 0 if (A) ij = 0, i6=j 0 (M 1 ) ij () ij = 0 (2.39) where M = M 1 + M 2 . For Problem 2, the multiplier matrix is M = M 1 + M 2 M 3 , and the corresponding optimality conditions include those in (2.39) as well as: 1 0 (M 3 ) ii (1) i = 0: (2.40) Subproblems for block-coordinate descent updates. In our algorithm, we solve instances of CHAPTER 2. GRAPH LEARNING FROM DATA 26 the subproblem derived based on the optimality conditions of Problem 1. So, letting C= 1 and using the conditions in (2.39), the optimality conditions for theu-th row/column of can be written as: c u + k u + m u = 0 (2.41) c u +k u = 0 (2.42) where the vectors, c u , k u and m u , and the scalars, c u , k u and m u , are obtained by partitioning C, K and M, as in (2.28) such that c u = (C) uu , k u = (K) uu and m u = (M) uu = 0. By using the relations in (2.30) and (2.31), we can rewrite (2.41) as 1 u u c u + k u + m u = 0: (2.43) Based on the relations in (2.39) and (2.42) the optimality conditions for the u-th column of (i.e., u ) include 1 u u k u + k u + m u = 0 ( u ) i 0 if (a u ) i = 1 ( u ) i = 0 if (a u ) i = 0 (2.44) where u and a u are obtained by partitioning and A as in (2.28), respectively, and the optimality conditions on m u follow from (2.35) and (2.36). Based on the above optimality conditions, in (2.39) and (2.44), the optimal update for the u-th row/column of (i.e., u ) corresponds to the solution of the following quadratic program: minimize u 1 2 k 2 u T u 1 u u +k u T u k u subject to ( u ) i 0 if (a u ) i = 1 ( u ) i = 0 if (a u ) i = 0 (2.45) The above problem can be simplied by eliminating its equality constraints determined by A (i.e., a u ), so that we formulate an equivalent version of (2.45) as the following nonnegative quadratic program [68], whose solution satises the optimality conditions in (2.44), minimize 1 2 T Q T p subject to 0 (2.46) where =( u ) S p = (k u =k u ) S Q = ( 1 u ) SS S =fij (a u ) i = 1g (2.47) CHAPTER 2. GRAPH LEARNING FROM DATA 27 so that is the vector whose elements are selected from the original variable vector u based on index setS. For example, ifS =f1; 2; 5g, then =[( u ) 1 ( u ) 2 ( u ) 5 ] T . Similarly, Q is constructed by selecting rows and columns of 1 u with index values inS, so the resulting Q is a submatrix of 1 u . It is important to note that the connectivity constraints (i.e., A or a u ) allow us to reduce the dimension of the variable u and therefore, the dimension of (2.45). For Problem 2, based on the conditions in (2.39) and (2.40), we can similarly formulate a quadratic program to update u-th row/column of : minimize u 1 2 c 2 u T u 1 u u +c u T u k u subject to ( u ) i 0 if (a u ) i = 1 ( u ) i = 0 if (a u ) i = 0 T u 1 u (2.48) where u =() uu . The above problem is also a nonnegative quadratic program. To solve (2.48) for all u, we rst iteratively update each row/column of by solving the subproblem in (2.46). After completing a cycle of n row/column updates, we modify the diagonal entries of the updated , so that it satises the constraints in (2.48). The diagonal update parameters () are the optimal solutions of the following projection problem for given : minimize k b k 2 F subject to b = + diag() b 2L d (2.49) where is the vector of update parameters, andL d denotes the set of diagonally dominant gener- alized Laplacian matrices. Proposed Algorithm. Algorithm 1 is proposed to solve Problems 1 and 2 for a given connectivity matrix A, type of desired Laplacian matrix (i.e.,L g orL d ) and regularization matrix H. Basically, the proposed algorithm iteratively updates each row/column of the working estimate of Laplacian matrix ( b ) and its inverse ( b C) by solving the subproblem in (2.46). The main reason of updating C is that the derived subproblem is parametrized by 1 u , which depends on C as formulated in (2.32). In Algorithm 1, the for loop in lines 5{12 implements the cycle ofn row/column updates, where the update formulas (see lines 7, 9 and 10) are derived based on the relations in (2.29){(2.32). If we are interested in solving Problem 1 (ifL =L g ), then the algorithm skips the lines 13{19. For Problem 2 (forL=L d ) the for loop between the lines 14{18 iteratively modies the diagonal elements of b by solving the projection problem in (2.49) ensuring that the resulting b is a diagonally dominant matrix. The inverse of b (i.e., b C) is also iteratively updated, accordingly (see line 16). The overall procedure is repeated until a stopping criterion (line 20) has been satised. CHAPTER 2. GRAPH LEARNING FROM DATA 28 Algorithm 1 Generalized Graph Laplacian (GGL) Input: Sample statistic S, connectivity matrix A, regularization matrix H, target Laplacian setL and tolerance Output: and C 1: Set K = S + H 2: Initialize b C = ddiag(K) and b = b C 1 3: repeat 4: Set b pre = b 5: for u = 1 to n do 6: Partition b , b C, K and A as in (2.28) for u 7: Update b 1 u = b C u b c u b c T u =b c u 8: Solve (2.46) for with Q, p andS in (2.47) 9: Update b u and b u using the solution b from above: ( b u ) S = b ( b u ) S c = 0 b u = 1=k u + b T Q b 10: Updateb c u ,b c u and b C u : b c u = 1=( b u b T u b 1 u b u ) b c u = b 1 u b u =b c u b C u = b 1 u +b c u b c T u =b c u 11: Rearrange b and b C using P for u as in (2.28) 12: end for 13: ifL =L d (target Laplacian is a DDGL) then 14: for i = 1 to n do 15: if ( b 1) i < 0 then =( b 1) i 16: Set ( b ) ii = ( b ) ii + and update b C using (2.33) 17: end if 18: end for 19: end if 20: until criterion( b ; b pre ) 21: return = b and C = b C 2.4.3 Combinatorial Laplacian Estimation Derivation of the optimality conditions. Similar to our derivations in the previous subsection, in order to derive optimality conditions, we rst dene the Lagrangian function corresponding to Problem 3 as follows, logdet( + J) + Tr ((K + J)) + Tr(M); (2.50) where M = M 1 + M 2 + M 4 consists of Lagrange multipliers associated with the constraints in (2.9) such that M 1 and M 2 are as dened in (2.35) and (2.36), and the entries of M 4 are (M 4 ) ij = (4) i + (4) j where (4) i 2R; (4) j 2R (2.51) CHAPTER 2. GRAPH LEARNING FROM DATA 29 Algorithm 2 Combinatorial Graph Laplacian (CGL) Input: Sample statistic S, connectivity matrix A, regularization matrix H and tolerance Output: and C 1: Set J=(1=n)11 T e K=S+H+J 2: Initialize b C = ddiag( e K) and b = b C 1 3: repeat 4: Set b pre = b 5: for u = 1 to n do 6: Partition b , b C, e K and A as in (2.28) for u 7: Calculate b 1 u = b C u b c u b c T u =b c u 8: Solve (2.58) for with Q, p andS in (2.59) 9: Update b u and b u using the solution b from above: ( b u ) S = ((1=n)1 b ) ( b u ) S c = (1=n)1 b u = 1= e k u + ( b (1=n)1) T Q( b (1=n)1) 10: Updateb c u ,b c u and b C u : b c u = 1=( b u b T u b 1 u b u ) b c u = b 1 u b u =b c u b C u = b 1 u +b c u b c T u =b c u 11: Rearrange b and b C using P for u as in (2.28) 12: end for 13: for i = 1 to n do 14: if ( b 1) i 16= 0 then =( b 1) i + 1 15: Set ( b ) ii = ( b ) ii + and update b C using (2.33) 16: end if 17: end for 18: until criterion( b ; b pre ) 19: return = b J and C = b C J fori;j = 1;:::;n. Based on the Lagrangian stated in (2.50), the necessary and sucient optimality conditions for the problem in (2.9) are e 1 + e K + M 1 + M 2 + M 4 = O ( e ) ij 1=n if (A) ij = 1, i6=j ( e ) ij = 1=n if (A) ij = 0, i6=j e 0 e 1 = 1 (M 1 ) ij (( e ) ij 1=n) = 0 (2.52) where e = + J, e C = ( + J) 1 and e K = K + J. The matrices M 1 , M 2 and M 4 are dened as in (2.35), (2.36) and (2.51), respectively. For the u-th row/column of e , the rst optimality condition in (2.52) reduces to e c u + e k u + m u = 0 (2.53) e c u + e k u +m u = 0 (2.54) CHAPTER 2. GRAPH LEARNING FROM DATA 30 where the condition in (2.53) can also be stated using the relations in (2.30) and (2.31) as e 1 u e u e c u + e k u + m u = 0: (2.55) Subproblem for block-coordinate descent updates. Based on the optimality conditions stated in (2.52) and (2.55), we derive the following quadratic program solved for updatingu-th row/column of e , minimize e u 1 2 e c 2 u e T u e 1 u e u +e c u e T u e k u subject to ( e u ) i 1=n if (a u ) i = 1 ( e u ) i = 1=n if (a u ) i = 0 ( e u (1=n)1) T 1 = e u (1=n) (2.56) By changing variables e u = u + (1=n)1, e u = u + (1=n) and dividing the objective function with e c 2 u , we rewrite (2.56) as a quadratic program of the standard form, minimize u 1 2 T u e 1 u u + T u e k u e c u + 1 n e 1 u 1 ! subject to ( u ) i 0 if (a u ) i = 1 ( u ) i = 0 if (a u ) i = 0 T u 1 = u (2.57) which can be simplied by eliminating the equality constraints as follows, minimize 1 2 T Q T p subject to 0 (2.58) where =( u ) S p = ( e k u = e k u + (1=n) e 1 u 1) S Q = ( e 1 u ) SS S =fij (a u ) i = 1g: (2.59) In order to solve (2.57) for allu, we rst iteratively update each row/column by solving the nonneg- ative quadratic program in (2.58). After each cycle ofn row/column updates, the diagonal entries of the resulting matrix ( e ) are modied to satisfy the combinatorial Laplacian constraints T u 1= u for u = 1;:::;n in (2.57). The diagonal update parameters are optimized by solving the following projection problem for given e minimize k e b k 2 F subject to b = e + diag() ( b J)2L c (2.60) CHAPTER 2. GRAPH LEARNING FROM DATA 31 where is the vector of update parameters,L c denotes the set of combinatorial Laplacian matrices and J=(1=n)11 T . Proposed Algorithm. Algorithm 2 is proposed to solve Problem 3 for a given connectivity matrix A and regularization matrix H. Although the basic structures of Algorithms 1 and 2 are similar, Algorithm 2 has three major dierences. Firstly, the algorithm does not directly estimate the target Laplacian matrix (i.e., ). Instead, it iteratively solves for the matrix e = + J whose entries are shifted by (1=n). Secondly, the subproblem solved for updating each row/column and the associated update formulas are dierent (see lines 8, 9 and 10 in Algorithm 2). Thirdly, the for loop in lines 13{17 maintains that the estimate of e leads to a CGL matrix (via the b J transformation) by solving the projection problem in (2.60). In Algorithm 2, we propose to estimate the shifted matrix e because of the singularity of combinatorial Laplacian matrices, by their construction. However, our result in Proposition 1 shows that one can solve the CGL problem in (2.8) by solving the equivalent problem in (2.9). Based on this result, we derive optimality conditions for (2.9), and develop Algorithm 2 which iteratively estimates e until the convergence criterion has been satised. Then, the optimal is recovered by subtracting J from the estimated e as in line 19 of Algorithm 2. 2.4.4 Convergence and Complexity Analysis of Algorithms Convergence. The following proposition addresses the convergence of our proposed algorithms based on the results from [58, 67, 69]. Proposition 6. Algorithms 1 and 2 guarantee convergence to the global optimal solutions of Prob- lems 1, 2 and 3. Proof. In minimization of a strictly convex and dierentiable function over a convex set, (with a proper selection of the step size) a block-coordinate descent algorithm guarantees convergence to the optimal solution [67, 58]. As stated in Proposition 2, all of our problems of interest are convex. Also, the objective functions are dierentiable (see in Section 2.4). Moreover, convergence conditions (see in [67]) require that each block-coordinate update be optimal (i.e., each iteration optimally solves the subproblem associated with selected coordinate directions). In Algorithms 1 and 2, this condition is also satised, since nonnegative quadratic programs (subproblems) are derived using optimality conditions, so each of their solutions leads to optimal block-coordinate (row/column) updates. These subproblems are also strictly convex, since the Q matrix in (2.46) and (2.58) is positive denite throughout all iterations. To prove this, we use the following Schur complement condition on partitions of as in (2.28), 0 () u 0 and u T u 1 u u > 0: (2.61) where u (and therefore, 1 u ) is xed and positive denite from the previous iteration. Noting that CHAPTER 2. GRAPH LEARNING FROM DATA 32 both algorithms initialize and C as (diagonal) positive denite matrices, u T u 1 u u = 1=c u > 0 by (2.31), then the updated is also positive denite. Since both algorithms solve for the optimal row/column updates at each iteration, based on the result in [67], this proves convergence to the optimal solution for Problem 1. For Problems 2 and 3, Algorithms 1 and 2 implement a variation of the general projected block-coordinate descent method, whose convergence to the optimal solution is shown in [69]. Also, it is trivial to show that additional updates on diagonal elements (see in (2.33)) maintain positive deniteness of and C. Therefore, both algorithms guarantee convergence to the optimal solution. Complexity. In Algorithms 1 and 2, each block-coordinate descent iteration has O(T p (n)+n 2 ) complexity where O(T p (n)) is the worst-case complexity of solving the subproblem with dimension n, and the n 2 term is due to updating b 1 u . After a cycle of n row/column updates, updating a diagonal entry of b and its inverse, b C, also has O(n 2 ) complexity. Depending on the sparsity of solutions (i.e., graphs) the complexity can be reduced to O(T p (s) +n 2 ) where s is the maximum number of edges connected to a vertex (i.e., number of non-zero elements in any row/column of ). Moreover, both proposed algorithms use diagonal matrices to initialize b and b C. In practice, better initializations (i.e., \warm starts") and randomized implementations [58] can be exploited to reduce the algorithm runtime. To solve the subproblems in (2.46) and (2.58), which are specically nonnegative quadratic programs, we employ an extension of Lawson-Hanson algorithm [70] with a block principal pivoting method [71]. Since nonnegative quadratic programs require varying number of iterations to converge for each row/column update, it is hard to characterize the overall complexity of our algorithms. Yet, the complexity of solving the subproblems is T p (n) = (n 3 ), which can be signicantly reduced if the solutions are sparse (T p (s)= (s 3 ) fors-sparse solutions). In Section 2.5, we present empirical complexity results for the proposed algorithms. 2.5 Experimental Results In this section, we present experimental results for comprehensive evaluation of our proposed graph learning techniques. Firstly, we compare the accuracy of our proposed algorithms and the state-of- the-art methods, where the datasets are articially generated based on attractive GMRFs. Speci- cally, our GGL and DDGL estimation methods based on Algorithm 1 (GGL andDDGL) are evaluated by benchmarking against the Graphical Lasso algorithm (GLasso) [30]. Algorithm 2 (CGL) is com- pared to the methods proposed for estimating shifted CGL matrices (SCGL) [33] and learning graphs from smooth signals (GLS) [31, 32], as well as the graph topology inference approach (GTI) proposed in [34]. Secondly, empirical computational complexity results are presented, and the advantage of exploiting connectivity constraints is demonstrated. Thirdly, the proposed methods are applied to learn similarity graphs from a real categorical dataset with binary entries, and the results are visu- ally investigated. Finally, we evaluate the eect of model mismatch and of incorporating inaccurate CHAPTER 2. GRAPH LEARNING FROM DATA 33 (a)G (64) grid : 88 grid graph (b)G (64,0.2) ER : Erdos-Renyi graph (c)G (64,0.2,0.9) M : Modular graph with 4 modules Figure 2.2: Examples of dierent graph connectivity models, (a) grid (b) Erdos-Renyi (c) modular graphs used in our experiments. All graphs have 64 vertices. connectivity constraints on graph Laplacian estimation. 2.5.1 Comparison of Graph Learning Methods Datasets. In order to demonstrate the performance of the proposed algorithms, we create several articial datasets based on dierent graph-based models. Then the baseline and proposed algorithms are used to recover graphs (i.e., graph-based models) from the articially generated data. To create a dataset, we rst construct a graph, then its associated Laplacian matrix, L, is used to generate independent data samples from the distribution N(0; L y ). A graph (i.e., L) is constructed in two steps. In the rst step, we determine the graph structure (i.e., connectivity) based on one of the following three options (see Figure 2.2): Grid graph,G (n) grid , consisting ofn vertices attached to their four nearest neighbors (except the vertices at boundaries). Random Erdos-Renyi graph,G (n,p) ER , withn vertices attached to other vertices with probability p. Random modular graph (also known as stochastic block model),G (n,p1,p2) M withn vertices and four modules (subgraphs) where the vertex attachment probabilities across modules and within modules are p 1 and p 2 , respectively. In the second step, the graph weights (i.e., edge and vertex weights) are randomly selected based on a uniform distribution from the interval [0:1; 3], denoted as U(0:1; 3). Note that this procedure leads to DDGLs, which are used in comparing our Algorithm 1 against GLasso. For evaluation of Algorithm 2, the edge weights are randomly selected from the same distribution U(0:1; 3), while all vertex weights are set to zero. CHAPTER 2. GRAPH LEARNING FROM DATA 34 Experimental setup. For comprehensive evaluation, we create various experimental scenarios by choosing dierent graph structures (grid, Erdos-Renyi or modular) with varying number of vertices. Particularly, the GGL, DDGL and GLasso methods are tested on graphs consisting of 64 or 256 vertices. Since SCGL and GTI approaches [33, 34] do not currently have an ecient algorithm, the CGL estimation methods are only evaluated on graphs with 36 vertices. We also test the performance of the proposed methods with and without connectivity constraints. For each scenario, Monte-Carlo simulations are performed to test baseline and proposed algorithms with varying number of data samples (k). All algorithms use the following convergence criterion with the tolerance = 10 4 , criterion( b ; b pre ) = k b b pre k F k b pre k F ; (2.62) where b and b pre denote the graph parameters from current and previous steps, respectively (see Algorithms in 1 and 2). In order to measure graph learning performance, we use two dierent metrics, RE( b ; ) = k b k F k k F (2.63) which is the relative error between the ground truth graph ( ) and estimated graph parameters ( b ), and FS( b ; ) = 2tp 2tp +fn +fp (2.64) is the F-score metric (commonly used metric to evaluate binary classication performance) calculated based on true-positive (tp), false-positive (fp) and false-negative (fn) detection of graph edges in estimated b with respect to the ground truth edges in . F-score takes values between 0 and 1, where the value 1 means perfect classication. In our experiments, since SCGL and GLasso methods employ kk 1 for regularization, we use the same regularization in our proposed methods (i.e., the matrix H is selected as in (2.3) for Problems 1, 2 and 3) to fairly compare all methods. Monte-Carlo simulations are performed for each proposed/baseline method, by successively solving the associated graph learning problem with dierent regularization parameters (i.e., , 1 and 2 ) to nd the (best) regularization minimizing RE. The corresponding graph estimate is also used to calculate FS. Particularly, for the GGL, DDGL, CGL, SCGL and GLasso methods, is selected from the following set: f0g[f0:75 r (s max p log(n)=k)jr = 1; 2; 3;:::; 14g; (2.65) wheres max =max i6=j j(S) ij j is the maximum o-diagonal entry of S in absolute value, and the scaling term p log(n)=k is used for adjusting the regularization according tok andn as suggested in [72, 73]. For both of the GLS methods [31, 32] addressing the problems in (2.15) and (2.16), 1 is selected CHAPTER 2. GRAPH LEARNING FROM DATA 35 fromf0g[f0:75 r s max jr = 1; 2; 3;:::; 14g, and for the problem in (2.16) the parameter 2 is selected by ne tuning. For GLS [31], the Tr() = n constraint in (2.15) is set as Tr() = Tr( ). Since GLS [32] and GTI [34] approaches generally result in severely biased solutions with respect to the ground truth (based on our observations from the experiments), their RE values are calculated after normalizing the estimated solution as b = (Tr( )=Tr()). 0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45 0.5 0.75 1 5 10 30 100 250 500 1000 Average RE k=n GGL(A;) DDGL(A;) GGL() DDGL() GLasso()[30] 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0.5 0.75 1 5 10 30 100 250 500 1000 Average FS k=n GGL(A;) DDGL(A;) GGL() DDGL() GLasso()[30] (a) Average performance results for learning generalized graph Laplacian matrices: The proposed methods GGL() and DDGL() outperform GLasso(). The degree of improvement achieved by GGL(A;) and DDGL(A;), incor- porating the connectivity constraints, is also demonstrated. 0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.5 0.75 1 5 10 30 100 250 500 1000 Average RE k=n CGL(A;) CGL() SCGL()[33] GLS( 1 )[31] GLS( 1; 2 )[32] GTI[34] 0.4 0.5 0.6 0.7 0.8 0.9 1 0.5 0.75 1 5 10 30 100 250 500 1000 Average FS k=n CGL(A;) CGL() SCGL()[33] GLS( 1 )[31] GLS( 1; 2 )[32] GTI[34] (b) Average performance results for learning combinatorial graph Laplacian matrices: The proposed CGL() method outperforms all baseline approaches, and the dierence increases as gets k=n larger. Incorporating the connectivity constraints (i.e., CGL(A;)) naturally improves the performance. Figure 2.3: Comparison of the (a) GGL and (b) CGL estimation methods: The algorithms are tested on grid (G (n) grid ) and random graphs (G (n,0.1) ER andG (n,0.1,0.3) M ). Results. Figure 2.3 demonstrates graph learning performances of dierent methods (in terms of average RE and FS) with respect to the number of data samples, used to calculate sample covariance S, per number of vertices (i.e., k=n). In our results, GGL(A;), DDGL(A;) and CGL(A;) refer to solving the graph learning problems with both connectivity constraints and regularization, where the constraints are determined based on the true graph connectivity. GGL(), DDGL() and CGL() refer to solving the problems with ` 1 -regularization only (i.e., without connectivity constraints). CHAPTER 2. GRAPH LEARNING FROM DATA 36 As shown in Figure 2.3, our proposed methods outperform all baseline approaches (namely GLasso [30], SCGL [33], GLS [31, 32] and GTI [34]) in terms of both RE and FS metrics. Naturally, incorporating the connectivity constraints (e.g., in GGL(A;) and CGL(A;)) signicantly improves the graph learning performance. However, for smallk=n, it may not provide a perfect FS, even if the true graph connectivity is given. This is because there may not be a sucient number of data samples to eectively recover the graph information. Specically fork=n1, both S and the estimated graph Laplacian have low-rank. Since low-rank graph Laplacians correspond to disconnected graphs (i.e., graphs with more than one connected components), they can cause false-negative (fn) detections which reduce FS. Furthermore, the proposed methods outperform all baseline approaches regardless of the size of the graph (n) and the connectivity model (A). As an example, for xedk=n=30, Table 2.1 compares average RE results for dierent graph connectivity models and number of vertices (n for 64 and 256). Also, Figures 2.4 and 2.5 illustrate sample GGL and CGL estimation results and compare dierent methods. Table 2.1: Average relative errors for dierent graph connectivity models and number of vertices with xed k=n = 30 Connectivity models Average RE(n = 64)j Average RE(n = 256) GLasso() GGL() GGL(A;) G (n) grid 0.079j 0.078 0.053j 0.037 0.040j 0.027 G (n,0.1) ER 0.105j 0.112 0.077j 0.082 0.053j 0.053 G (n,0.1,0.3) M 0.102j 0.124 0.075j 0.081 0.051j 0.053 For estimation of GGL matrices, Figure 2.3a demonstrates that the best set of results are obtained by solving the DDGL problem, DDGL(A;) and DDGL(). This is expected since the random data samples are generated based on DDGL matrices (as part of our experimental setup), and exploiting additional information about the type of graph Laplacian improves graph learning performance. Generally, solving the GGL problem, GGL(A;) and GGL(), also provides a good performance, where the dierence between GGL and DDGL is often negligible. Moreover, both of the proposed GGL() and DDGL() methods (i.e., Algorithm 1) perform considerably better than GLasso, espe- cially when the number of data samples (k=n) is small, since exploiting Laplacian constraints fullls the model assumptions of attractive GMRFs. For estimation of CGL matrices, Figure 2.3b shows thatCGL (i.e., Algorithm 2) provides signicantly better performance with increasing number of data samples (k=n) as compared to the SCGL, GLS and GTI approaches. Particularly, SCGL and GLS have limited accuracy even for large number of samples (e.g., k=n100). The main reason is due to their problem formulations, where SCGL [33] optimizes a shifted CGL instead of directly estimating a CGL matrix as stated in (2.13). The GLS and GTI methods solve the problems in (2.15), (2.16) and (2.18), whose objective functions are derived without taking multivariate Gaussian distributions into CHAPTER 2. GRAPH LEARNING FROM DATA 37 −2 0 2 4 6 8 10 (a) The ground truth GGL ( ) −2 0 2 4 6 8 10 (b) S 1 : (RE,FS)=(0.171,0.20) −2 0 2 4 6 8 10 (c) GLasso [30]: (RE,FS)=(0.078,0.64) −2 0 2 4 6 8 10 (d) GGL: (RE,FS)=(0.054,0.89) Figure 2.4: Illustration of precision matrices (whose entries are color coded) estimated from S for k=n = 30: In this example, (a) the ground truth is a GGL associated with a grid graph havingn=64 vertices. From S, the matrices are estimated by (b) inverting S, (c) GLasso() and (d) GGL(). The proposed method provides the best estimation in terms of RE and FS, and the resulting matrix is visually the most similar to the ground truth . −2 0 2 4 6 8 (a) The ground truth CGL ( ) −2 0 2 4 6 8 (b) GLS [31]: (RE,FS)=(0.187,0.75) −1 0 1 2 3 4 5 6 (c) GTI [34]: (RE,FS)=(0.322,0.39) −2 0 2 4 6 8 (d) CGL: (RE,FS)=(0.085,0.82) Figure 2.5: Illustration of precision matrices (whose entries are color coded) estimated from S for k=n = 30: In this example, (a) the ground truth is a CGL associated with a grid graph having n = 36 vertices. From S, the matrices are estimated by (b) GLS( 1 ), (c) GTI and (d) CGL(). The proposed method provides the best estimation in terms of RE and FS, and the resulting matrix is visually the most similar to the ground truth . account. Specically, the objective in (2.16) serves as a lower-bound for the maximum-likelihood criterion in (2.24) (see Proposition 4). Besides, the performance dierence against GTI is substantial [34] across dierent k=n, and GTI generally does not converge if S has low-rank (i.e., k=n < 1), so those results are not available. 2.5.2 Empirical Results for Computational Complexity Figure 2.6 compares the computational speedups achieved by our proposed algorithms over GLasso, which is implemented according to the P-GLasso algorithm presented in [49]. In our experiments, the algorithms are tested on Erdos-Renyi graphs withn = 64 vertices (G (64;p) ER ). By varying the parameter p, we evaluate the speedups at dierent graph sparsity levels (p = 1 means a fully connected graph). For eachp, 10 dierent graphs and associated datasets (withk=n=1000) are generated as discussed CHAPTER 2. GRAPH LEARNING FROM DATA 38 0 5 10 15 20 25 30 35 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 T GLasso = T p GGL(A;) GGL() DDGL(A;) DDGL() CGL(A;) CGL() Figure 2.6: Computational speedup as compared to GLasso ( T GLasso = T ). in the previous section. The speedup values are calculated using T GLasso = T , where T and T GLasso denote average execution times of the test algorithm (Algorithm 1 or 2) and GLasso algorithm, respectively. Since GLasso is approximately 1:5 times faster than both GLS methods [31, 32] on average, and the other methods [33, 34] do not have ecient implementations, we only use GLasso as the baseline algorithm in this experiment. As shown in Figure 2.6, the proposed methods provide faster convergence than GLasso regardless of p (i.e., sparsity level), and the computational gain is substantial for learning sparse graphs (e.g., p 0:3), where incorporating the connectivity constraints contributes to an additional 2 to 3-fold speedup over the methods without exploiting connectivity constraints. In the worst case, for dense graphs (e.g., p 0:8), our methods converge approximately 5 times faster than the GLasso. This is mainly because, at each iteration, GLasso solves an ` 1 -regularized quadratic program [30, 49] having a nonsmooth objective function, and it is generally harder to solve compared to the (smooth) nonnegative quadratic program in (2.46). 2.5.3 Illustrative Results on Real Data We present some illustrative results to demonstrate that the proposed methods can also be useful to represent categorical (non-Gaussian) data. In our experiments, we use the Animals dataset [74] to learn weighted graphs, where graph vertices denote animals and edge weights represent the similarity between them. The dataset consists of binary values (0 or 1) assigned to d=102 features for n = 33 animals. Each feature corresponds to a true-false question such as, \has lungs?", \is warm-blooded?" and \lives in groups?". Using this dataset, the statistic matrix S is calculated as CHAPTER 2. GRAPH LEARNING FROM DATA 39 S=(1=d) XX T +(1=3) I where X is the mean removednd data matrix. The (1=3) I term is added based on the variational Bayesian approximation result in [42] for binary data. Then, S is used as input to the GLasso, GGL and CGL methods. Here, we only compare methods minimizing the objective function in (2.1) which is shown to be a suitable metric for binary data in [42, 43]. Figure 2.7 illustrates the graphs constructed by using GLasso, GGL and CGL with dierent reg- ularization parameters. The positive and negative edge weights estimated by GLasso are shown side-by-side in Figures 2.7a and 2.7d, where the magnitudes of negative weights are substantially smaller than most of the positive weights. In other words, positive partial correlations are more dominant than negative partial correlations in the precision matrix. Since the proposed GGL and CGL nd graphs with nonnegative edge weights (i.e., closest graph Laplacian projections with no negative edge weights), the corresponding results are similar to the graphs with nonnegative weights in Figures 2.7a and 2.7d. The results follow the intuition that larger positive weights should be assigned between animals considered to be similar. Such pairs of animals include (gorilla,chimp), (dolphin,whale), (dog,wolf) and (robin,nch). 2.5.4 Graph Laplacian Estimation under Model Mismatch In the case of a model mismatch (i.e., when data is not generated by a GMRF with a graph Laplacian as its precision matrix), the proposed methods allow us to nd the closest graph Laplacian t with respect to the original model in the log-determinant Bregman divergence sense. Figure 2.8 illustrates two examples (simulating sparse and dense mismatches) where the ground truth precision matrices have both negative and positive edge weights (Figure 2.8a). From their true covariances = 1 , GLasso recovers the original model by allowing both positive and negative edge weights (Figure 2.8b). On the other hand, the proposed GGL nds the closest graph Laplacian with respect to the ground truth (Figure 2.8c). As shown in the gure, GGL maintains all the edges with positive weights and also introduces additional connectivity due to the negative weights in the ground truth . Note that this form of projection (based on log-determinant Bregman divergence) does not necessarily assign zeros to the corresponding negative edge weights. In general, the identication of edges with positive weights under model mismatches is nontrivial, as also pointed out in [54]. 2.5.5 Graph Laplacian Estimation under Connectivity Mismatch We now evaluate empirically the eect inaccurate selection of connectivity matrices (A), used to determine the connectivity constraints in our problems. For this purpose, in our experiments, we randomly construct multiple 64 64 GGL matrices based on Erdos-Renyi graphs with p=0:1 (i.e., G (64,0.1) ER ) and generate datasets associated with the GGLs, as discussed in Section 2.5.1. Then, the proposed GGL method is employed to estimate graphs from generated data using dierent connec- tivity constraints, whose corresponding connectivity matrices are obtained by randomly swapping CHAPTER 2. GRAPH LEARNING FROM DATA 40 0 0.52958 Elephant Rhino Horse Cow Camel Giraffe Chimp Gorilla Mouse Squirrel Tiger Lion Cat Dog Wolf Seal Dolphin Robin Eagle Chicken Salmon Trout Bee Iguana Alligator Butterfly Ant Finch Penguin Cockroach Whale Ostrich Deer 0 0.52958 Elephant Rhino Horse Cow Camel Giraffe Chimp Gorilla Mouse Squirrel Tiger Lion Cat Dog Wolf Seal Dolphin Robin Eagle Chicken Salmon Trout Bee Iguana Alligator Butterfly Ant Finch Penguin Cockroach Whale Ostrich Deer (a) GLasso( = 0): (Left) Positive and (Right) absolute negative weights 0 0.5382 Elephant Rhino Horse Cow Camel Giraffe Chimp Gorilla Mouse Squirrel Tiger Lion Cat Dog Wolf Seal Dolphin Robin Eagle Chicken Salmon Trout Bee Iguana Alligator Butterfly Ant Finch Penguin Cockroach Whale Ostrich Deer (b) GGL( = 0) 0 0.61331 Elephant Rhino Horse Cow Camel Giraffe Chimp Gorilla Mouse Squirrel Tiger Lion Cat Dog Wolf Seal Dolphin Robin Eagle Chicken Salmon Trout Bee Iguana Alligator Butterfly Ant Finch Penguin Cockroach Whale Ostrich Deer (c) CGL( = 0) 0 0.50417 Elephant Rhino Horse Cow Camel Giraffe Chimp Gorilla Mouse Squirrel Tiger Lion Cat Dog Wolf Seal Dolphin Robin Eagle Chicken Salmon Trout Bee Iguana Alligator Butterfly Ant Finch Penguin Cockroach Whale Ostrich Deer 0 0.50417 Elephant Rhino Horse Cow Camel Giraffe Chimp Gorilla Mouse Squirrel Tiger Lion Cat Dog Wolf Seal Dolphin Robin Eagle Chicken Salmon Trout Bee Iguana Alligator Butterfly Ant Finch Penguin Cockroach Whale Ostrich Deer (d) GLasso( = 0:02): (Left) Positive and (Right) absolute negative weights 0 0.50495 Elephant Rhino Horse Cow Camel Giraffe Chimp Gorilla Mouse Squirrel Tiger Lion Cat Dog Wolf Seal Dolphin Robin Eagle Chicken Salmon Trout Bee Iguana Alligator Butterfly Ant Finch Penguin Cockroach Whale Ostrich Deer (e) GGL( = 0:02) 0 0.55657 Elephant Rhino Horse Cow Camel Giraffe Chimp Gorilla Mouse Squirrel Tiger Lion Cat Dog Wolf Seal Dolphin Robin Eagle Chicken Salmon Trout Bee Iguana Alligator Butterfly Ant Finch Penguin Cockroach Whale Ostrich Deer (f) CGL( = 0:02) Figure 2.7: The graphs learned from Animals dataset using methodsGLasso(),GGL() andCGL(): GGL() leads to sparser graphs compared to the others. The results follow the intuition that larger positive weights should be assigned between similar animals, although the dataset is categorical (non-Gaussian). CHAPTER 2. GRAPH LEARNING FROM DATA 41 −2 0 2 4 6 8 10 −2 0 2 4 6 8 10 12 (a) Ground Truth ( ) −2 0 2 4 6 8 10 −2 0 2 4 6 8 10 12 (b) GLasso( = 0) −2 0 2 4 6 8 10 −2 0 2 4 6 8 10 (c) GGL( = 0) Figure 2.8: Illustration of precision matrices whose entries are color coded, where negative (resp. positive) entries correspond to positive (resp. negative) edge weights of a graph: (a) Ground truth precision matrices ( ) are randomly generated with (top) sparse and (bottom) dense positive entries (i.e., negative edge weights). From the corresponding true covariances (), the precision matrices are estimated by (b) GLasso( = 0) and (c) GGL( = 0) methods without ` 1 -regularization. the entries of the true connectivity matrix to simulate connectivity mismatches. Specically, a 5% mismatch is obtained by randomly exchanging 5% of the ones in A with zero entries. Figure 2.9 compares the accuracy of GGL estimation with connectivity constraints under dif- ferent levels (5%, 10%, 15% and 25%) of connectivity mismatch against the GGL estimation with true connectivity constraints (i.e., GGL(A)) and without using any connectivity constraints (i.e., GGL(A full )). The results show that using slightly inaccurate connectivity information can be useful when the number of data samples is small (e.g.,k=n<5), where the performance is similar toGGL(A) and can outperform GGL(A full ), even though there is a 25% mismatch (GGL(A 25% )). However, the performance dierence with respect to GGL(A) increases as the number of data samples increases (e.g., k=n> 10), where both GGL(A) and GGL(A full ) performs substantially better. In this case, the graph estimation without connectivity constraints (GGL(A full )) can be preferred if connectivity information is uncertain. CHAPTER 2. GRAPH LEARNING FROM DATA 42 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.5 0.75 1 5 10 30 100 250 500 1000 Average RE k=n GGL(A) GGL(A full ) GGL(A 5% ) GGL(A 10% ) GGL(A 15% ) GGL(A 25% ) Figure 2.9: Average relative errors for dierent number of data samples (k=n), where GGL(A) exploits the true connectivity information in GGL estimation, and GGL(A full ) refers to the GGL estimation without connectivity constraints. Dierent levels of connectivity mismatch are tested. For example, GGL(A 5% ) corresponds to the GGL estimation with connectivity constraints having a 5% mismatch. No ` 1 -regularization is applied in GGL estimation (i.e., = 0). 2.6 Conclusion In this chapter, we have formulated various graph learning problems as estimation of graph Lapla- cian matrices, and proposed ecient block-coordinate descent algorithms. We have also discussed probabilistic interpretations of our formulations by showing that they are equivalent to parameter estimation of dierent classes of attractive GMRFs. The experimental results have demonstrated that our proposed techniques outperform the state-of-the-art methods in terms of accuracy and com- putational eciency, and both can be signicantly improved by exploiting the additional structural (connectivity) information about the problem at hand. Chapter 3 Graph Learning from Filtered Signals: Graph System Identication In Chapter 2, we have proposed algorithms for learning three dierent types of Laplacian matrices, where models of interest are dened by graph Laplacians only. The present chapter extends Chapter 2 by (i) introducing more general graph-based models based on graph systems, dened by a graph (i.e., graph Laplacian) and a graph-based lter (i.e., a matrix function of a graph Laplacian) as illustrated in Figure 3.1, and (ii) proposing an algorithm to learn parameters of a graph system from multiple signal/data observations. For graph-based modeling, graph systems provide a general abstraction where graphs repre- sent pairwise relations between the samples, and graph-based lters (GBFs) a determine the signal smoothness. With this construction, we formulate the graph-based modeling as a graph system identication (GSI) problem with an ` 1 -regularized maximum likelihood (ML) criterion, where the goal is to jointly identify a graph Laplacian L and a GBF h from signals/data. In order to solve the GSI problem, we propose an alternating optimization algorithm that rst optimizes the graph by xing the GBF, and then designs the GBF by xing the graph. The proposed algorithm involves three main steps: Graph-based lter (GBF) identication: For a given graph Laplacian L, the purpose of this step is to design a GBF h, whose inverse function (i.e., h 1 ) is used for the preltering step discussed next. A version of the work presented in this chapter will be submitted for publication subsequently [75]. a A formal denition for graph-based lters is given in Section 1.3 (see Denition 6). 43 CHAPTER 3. GRAPH LEARNING FROM FILTERED SIGNALS 44 Preltering: We perform an inverse ltering operation (h 1 ) on observed signals, then the preltered signals are used in graph Laplacian estimation. We show that this step of our algorithm signicantly improves the accuracy of graph estimation. Graph Laplacian Estimation: To learn graphs from preltered signals, we employ the CGL estimation algorithm introduced in Chapter 2. Although the present chapter focuses on graphs associated with CGL matrices, the algorithm can be simply extended for dierent types of graph Laplacian matrices. The proposed algorithm is developed under the general assumption that h is a one-to-one function, hence its inverse functionh 1 exists. Based on this condition, we specically consider the parametric GBFs listed in Table 3.1, which are one-to-one functions and have useful applications discussed in Section 3.2.3. The rst three GBF types in the table dene basic scaling and shifting operations, and the exponential decay and -hop localized GBFs provide diusion-based models. These GBFs can also be used for modeling dierent classes of smooth graph signals satisfying that most of the signal energy is concentrated in the low graph frequencies a , since they yield larger lter responses (h ) in lower graph frequencies () as illustrated in Figure 3.2. In order to accommodate the GBFs in Table 3.1 into our algorithm (i.e., the GBF identication step), we propose specic methods to nd the lter parameter that fully characterizes the GBF. For a selected GBF type, our proposed algorithm guarantees optimal identication of and L in an ` 1 -regularized ML sense for frequency shifting, variance shifting and -hop localized GBFs. However, for frequency scaling and exponential decay GBFs, our algorithm cannot nd the optimal in general, but it guarantees that the estimated L is optimal up to a constant factor. In practice, the type of GBF and its parameter can also be selected based on the prior knowledge available about the problem and the application. In this case, dierent GBFs and their parameters can be tested until the estimated graph and GBF pair achieves the desired performance (e.g., in terms of mean square error, likelihood or sparsity) where the parameter serves as a regularization parameter, which can be used for tuning smoothness of the signal for example. Moreover, our algorithm can be extended to support dierent types of GBFs with more than one lter parameter by introducing specic methods to identify their parameters. As long as a specied GBF (h) has an inverse function (h 1 ), the preltering and graph Laplacian estimation steps can be directly utilized to learn graphs from signals/data. This chapter is organized as follows. Section 3.1 presents the related work. We formulate the graph system identication problem and discuss its variations in Section 3.2. In Section 3.3, we derive optimality conditions and develop methods for the graph system identication. Experimental results are presented in Section 3.4 and Section 3.5 draws some conclusions. a Graph signals satisfying this condition are analogous to low-pass signals in classical signal processing. CHAPTER 3. GRAPH LEARNING FROM FILTERED SIGNALS 45 x 0 x L graph Laplacian graph-based filter h h(L) Figure 3.1: Input-output relation of a graph system dened by a graph Laplacian (L) and a graph- based lter (h). In this work, we focus on joint identication of L and h. Table 3.1: Graph-based lter (GBF) functions with parameter and corresponding inverse functions. Note that y denotes the pseudoinverse of , that is, y = 1= for 6= 0 and y = 0 for = 0. Filter Name h () h 1 (s) Frequency scaling ( 1 > 0 0 = 0 ( 1 s s> 0 0 s = 0 Frequency shifting ( +) y 1 s Variance shifting y + (s) y Exponential decay exp() log(s)= -hop localized ( 1 > 0 0 = 0 ( 1 s 1 s> 0 0 s = 0 0 0.2 0.4 0.6 0.8 1 1.2 1.4 0 2 4 6 8 10 h () = 1 = 2 = 3 (a) Frequency scaling GBF 0 0.2 0.4 0.6 0.8 1 1.2 1.4 0 2 4 6 8 10 h () = 0:25 = 0:50 = 0:75 (b) Frequency shifting GBF 0 0.2 0.4 0.6 0.8 1 1.2 1.4 0 2 4 6 8 10 h () = 0:25 = 0:50 = 0:75 (c) Variance shifting GBF 0 0.2 0.4 0.6 0.8 1 1.2 1.4 0 2 4 6 8 10 h () = 0:25 = 0:50 = 0:75 (d) Exponential decay GBF 0 0.2 0.4 0.6 0.8 1 1.2 1.4 0 2 4 6 8 10 h () = 1 = 2 = 3 (e) -hop localized GBF Figure 3.2: Graph-based lters with dierent parameters as a function of graph frequency h (). CHAPTER 3. GRAPH LEARNING FROM FILTERED SIGNALS 46 3.1 Related Work and Contributions Some of the related studies have already discussed in the context of graph Laplacian estimation (in Chapter 2). In this section, we revisit them by pointing out their main dierences with respect to our work in the present chapter. Table 3.2 summarizes the related studies by comparing against our work in terms of target variables, underlying assumptions and optimization criteria. Table 3.2: An overview of the related work on learning graphs from smooth/diused signals (NA: No assumptions, L: Localized, 1-1: one-to-one function). References Target Variable(s) Assumptions Optimization Criterion Graph Filter Signal [31, 32] graph Laplacian NA NA NA regularized Laplacian quadratic form [34] shift operator NA NA NA ` 1 -norm of shift operator [56] adjacency matrix NA NA NA trace of adjacency [76] GBF source locations Given NA L least squares [77] multiple heat kernels source locations NA Heat kernel L regularized least squares [78] polynomials of adjacency Sparse Poly- nomial NA least squares This work graph Laplacian GBF (Table 3.1) NA 1-1 NA regularized maximum likelihood As discussed in Chapter 2, the problems in (2.15) and (2.16) are solved to estimate CGL matrices from smooth signals [31, 32], yet GBFs are not considered in their formulations. In [34, 56], the authors focus on learning graph shift/diusion operators (such as adjacency and graph Laplacian matrices) from a complete set of eigenvectors that has to be computed from observed data. Par- ticularly, Segarra et al. [34] solve the sparse recovery problem in (2.18) to infer the topology of a graph, and Pasdeloup et al. [56] focus on the estimation of an adjacency matrix by solving a trace minimization problem. In fact, the problems in [34] and [56] are equivalent if the target matrix is constrained to be a CGL matrix, since the problems of minimizingkLk 1 and Tr(L) over the set of CGL matrices (i.e., L2L c ) lead to the same solution. Although our approach also requires a set eigenvectors to be computed, they are not directly used to estimate a graph. Instead, the computed eigenvectors are used for the preltering step, then a graph is estimated from the preltered signals by minimizing a regularized ML criterion. Segarra et al. [76] discuss the joint identication of a GBF and a sparse input (localized sources) from a set of observed signals for a given graph. Thanou CHAPTER 3. GRAPH LEARNING FROM FILTERED SIGNALS 47 et al. [77] further address the estimation of a graph in a similar setting with localized sources and propose a dictionary-based method where a graph estimation problem is solved to construct a dic- tionary consisting of multiple diusion kernels, and the resulting dictionary is used for identifying the localized sources in the diusion process. Although the present chapter focuses on the GSI problem without locality assumptions on diused signals, when a single diusion kernel is used in the dictionary and no locality assumptions are imposed, the problem in [77] becomes analogous to our problem of interest and specically reduces to the CGL estimation problem in (2.15) originally proposed by Dong et al. [31]. Yet, in our algorithm, we solve a dierent problem with a regularized ML criterion (i.e., Problem 3 introduced in Chapter 2) to estimate graphs from preltered signals. The proposed algorithm can be used as an alternative method to construct a dictionary as in [77] for identifying localized sources, which is out of the scope of the present work. Moreover, Mei and Moura [78] address the estimation of polynomials of adjacency matrices by solving a regularized least-squares problem. As their counterparts, polynomials of graph Laplacians can be used to ap- proximate the GBFs in Table 3.1 as well as many other types lters such as bandlimited GBFs [57]. Since polynomial lters provide more degrees of freedom in designing GBFs, they can be considered as more general compared to our GBFs of interest. However, they are not one-to-one functions in general, so the proposed algorithm (i.e., the preltering step) cannot be applied to polynomial lters. Additionally, the algorithm proposed in [78] can only guarantee the optimality of solutions in mean-square sense under a very restrictive set of assumptions which require the polynomials of adjacency matrices to be sparse a . Our proposed algorithm provides stronger optimality guarantees in a regularized ML sense without imposing any assumptions on the sparsity of graphs, yet the GBFs of interest are more restrictive than polynomial lters. The identication of graph systems with polynomial lters will be studied as part of our future work. To the best of our knowledge, this is the rst work that (i) formulates the graph-based modeling problem as identication of graph systems (with the types of GBFs in Table 3.1) under a regularized ML criterion and (ii) proposes an algorithm based on preltering to jointly identify a graph and a GBF. The existing related studies consider optimization of dierent criteria (see Table 3.2), and none of them propose a preltering-based procedure for graph estimation. Since diusion kernels [15] are special cases of graph systems, which are equivalent to the graph systems with an exponential decay lter, our proposed method can be used to learn diusion kernels from signals/data, which are popular in many applications [21, 17, 79]. 3.2 Problem Formulation: Graph System Identication Our formulation is based on the following general assumption on a GBFh, which holds for the GBFs in Table 3.1. a In practice, powers of adjacency matrices lead to dense matrices. CHAPTER 3. GRAPH LEARNING FROM FILTERED SIGNALS 48 Assumption 1. We assume that a graph-based lterh() is a nonnegative and one-to-one function of . 3.2.1 Filtered Signal Model We formulate the graph system identication problem in a probabilistic setting by assuming that the observed (ltered) signals have been sampled from a zero mean n-variate Gaussian distribution x N(0; =h(L)), p(xjh(L)) = 1 (2) n=2 jh(L)j 1=2 exp 1 2 x T h(L) y x ; (3.1) with the covariance =h(L) dened based on a graph Laplacian L and a GBFh. The signal model in Chapter 2 is a special case of (3.1), which can be shown by choosing the GBF h in (3.1) as h() = y = 8 < : 1= > 0 0 = 0 that is h(L) = L y : (3.2) Then, replacing theh(L) term in (3.1) with L y leads to the GMRF model in (2.19), whose precision (inverse covariance) is =h(L) y = L. Obviously, for a general GBF, h(L) y does not always have a graph Laplacian form. From the graph signal processing perspective, the random signal x in (3.1) can be considered as the output of a graph system (see in Figure 3.1 and Denition 7) dened by a graph Laplacian matrix L and a GBF h(L) for the input x 0 N(0; 0 ). Without loss of generality, we assume that the input is a white Gaussian noise x 0 N(0; I) and choose h(L) = p h(L), so the covariance of the output x = h(L)x 0 is = E[xx T ] = p h(L)E[x 0 x T 0 ] p h(L) T = p h(L) p h(L) T =h(L) (3.3) as in (3.1), since E[x 0 x T 0 ] = I. Note that for a more general (colored) input model, x 0 N(0; 0 ), the whitening transform T= 1=2 0 can be used to write as = E[xx T ] = p h(L)TE[x 0 x T 0 ]T p h(L) T =h(L): (3.4) Moreover, for modeling smooth graph signals, it is reasonable to choose h() to be a monotonically decreasing function a satisfying h( 1 )h( 2 )h( n )> 0, where the graph frequencies (i.e., eigenvalues of L) are ordered as 0 = 1 2 n . Thus, the corresponding covariance matrix = h(L) represent graph signals whose energy is larger in lower graph frequencies. On the other a All of the GBFs in Table 3.1 are monotonically decreasing functions on the interval 2 (0;1). The exponential decay and frequency shifting GBFs further satises h( 1 )h( 2 ) at the zero frequency (i.e., 0 = 1 2 ), while for the other GBFs, we have h( 2 )h( 1 ). CHAPTER 3. GRAPH LEARNING FROM FILTERED SIGNALS 49 hand, the inverse covarianceh(L) y leads to a variation operator that is similar to a graph Laplacian a L, which is essentially a special case of h(L) y if the GBF is chosen as in (3.2). The quadratic term x T h(L) y x in the exponent of (3.1) can be compactly written as x T h(L) y x = x T Uh( ) y U T x = x T U y U T x (3.5) by using the GBT U, which jointly diagonalizes L = U U T and h(L) = Uh( )U T = U U T , where the diagonal matrices = diag([ 1 2 n ] T ) and = diag([ 2 1 2 2 2 n ] T ) are composed of the graph frequencies and GBF responses, respectively. Note that the nonnegativity condition in Assumption 1, that is, 2 i = h( i ) 0 for i = 1;:::;n, ensures that the covariance = h(L) is a positive semidenite matrix. In terms of the GBF responses (i.e., h( i ) = 2 i fori = 1; 2;:::;n), we can rewrite (3.5) as x T h(L) y x = x T Uh( ) y U T x = n X i=1 1 h( i ) b x 2 i = n X i=1 1 2 i b x 2 i : (3.6) whereb x i = u T i x and u i denotes thei-th column of U associated with the graph frequency i . Since the ordering of the inverse GBF responses, 1= 2 1 1= 2 2 1= 2 n , is consistent with the ordering of graph frequencies 1 2 n , x T h(L) y x provides a variation measure such that a smaller x T h(L) y x means a smoother signal x. As a special case, choosing h() = y reduces (3.6) to the quadratic graph Laplacian form in (1.6). 3.2.2 Problem Formulation Our goal is to estimate the graph system parameters L and h based on k random samples, x i for i=1;:::;k, obtained from the signal modelp(xj) in (3.1) by maximizing the likelihood of =h(L), that is k Y i=1 p(x i j)=(2) kn 2 jj k 2 k Y i=1 exp 1 2 x i T y x i : (3.7) The maximization of (3.7) can be equivalently stated as minimizing its negative log-likelihood, which leads to the following problem for estimating the graph system parameters L and h, ( b L; b h) = argmin L;h ( 1 2 k X i=1 Tr x i T h(L) y x i k 2 logjh(L) y j ) = argmin L;h n Tr h(L) y S logjh(L) y j o (3.8) a In Chapter 1, the graph Laplacian quadratic form x T Lx in (1.6) is used to quantify the variation of a signal x on the graph associated with L. CHAPTER 3. GRAPH LEARNING FROM FILTERED SIGNALS 50 where S is the sample covariance calculated using k samples, x i for i = 1; 2;:::;k. In our problem formulation, we additionally introduce a sparsity promoting weighted ` 1 -regularization,kL Hk 1 for robust graph estimation, where H is symmetric regularization matrix and denotes element-wise multiplication. Thus, the proposed graph system identication problem is formulated as follows: minimize L0; Tr(h (L) y S)logjh (L) y j+kL Hk 1 subject to L 1 = 0; (L) ij 0 i6=j (3.9) where is the parameter for a given type of lterh from Table 3.1, and H denotes the regularization matrix. The constraints ensure that L is a CGL matrix a . In this work, we particularly choose H =(2I 11 T ) so that the regularization term in (3.9) reduces to the standard ` 1 -regularization kLk 1 with parameter . 3.2.3 Special Cases of Graph System Identication Problem and Appli- cations of Graph-based Filters Graph Learning. The problem in (3.9) can be reduced to the CGL problem proposed in Chapter 2 by choosing the lter h to be the frequency scaling or -hop localized GBFs with = 1, so that we obtain h () = y and h (L) = L y as in (3.2). Since h (L) y = L, we can rewrite the objective function in (3.9) as Tr(LS)logjLj+kL Hk 1 (3.10) which is the objective function of graph learning problems discussed in Chapter 2. Graph Learning from Noisy Signals. The variance shifting lter corresponds to a noisy signal model with variance = 2 . Specically, assuming that the noisy signal is modeled as b x = x + e, where x N(0; L y ) denotes the original signal, and e N(0; 2 I) is the additive white Gaussian noise vector independent of x with variance 2 . Therefore, the noisy signal b x is distributed as b x N(0; b = L y + 2 I). The same model is obtained by variance shifting lter with = 2 so that h () = y + and h (L) = L y +I = L y + 2 I: (3.11) Graph Learning from Frequency Shifted Signals. For the shifted frequency lter with pa- rameter , we have h () = ( +) y and h (L) = (L +I) y : (3.12) a The formulation can be trivially extended for dierent types of graph Laplacians (e.g., generalized graph Lapla- cian) discussed in Chapter 2. CHAPTER 3. GRAPH LEARNING FROM FILTERED SIGNALS 51 By substituting h (L) with (L+I) y , the problem in (3.9) can be written as minimize e L0;0 Tr( e LS)logj e Lj+k e L Hk 1 subject to e L = L+I; L1 = 0; (L) ij 0 i6=j (3.13) which is the shifted combinatorial Laplacian (SCGL) estimation problem in (2.13) that is originally proposed in [33]. Diusion Kernel Learning. The diusion kernel on a graphG [15] is the matrix exponential of L, that is exp(L) = lim t!1 I L t t t2N; (3.14) where L is the graph Laplacian associated withG and the parameter is a real number called diusion bandwidth. To learn diusion kernels, dened in (3.14), our proposed problem in (3.9) can be modied by choosing h () as the exponential decay lter, so we get h (L)=Uh ()U T =Uexp()U T =exp(L) (3.15) based on Denitions 5 and 6. Diusion kernels are used to model a basic class of random diusion processes [15]. Suppose that the random process is obtained by diusion of the initial random vector x(0) whose entries are independent, zero-mean random variables, denoted as (x(0)) i for i = 1; 2;:::;n, with variance 2 , the random diusion over a graphG is formulated recursively as x(t + 1) = x(t)rLx(t) = (IrL)x(t) t2f0; 1;:::g (3.16) where t represents the time index, L is the graph Laplacian ofG and the real number r2 (0; 1) is the diusion rate of the process. On i-th vertex of the graphG, we can write the random process in (3.16) as (x(t + 1)) i = (x(t)) i +r X j2Ni (W) ij ((x(t)) j (x(t)) i ) (3.17) for all i where W is the adjacency matrix of the graphG, andN i is the set of vertex indices neighboring i-th vertex (v i ). More compactly, (3.16) can be written as x(t) = G(t)x(0) where G(t) = (IrL) t : (3.18) CHAPTER 3. GRAPH LEARNING FROM FILTERED SIGNALS 52 The covariance of x(t) is ((t)) ij = E [(G(t)x(0)) i (G(t)x(0)) j ] = E " n X k=1 (G(t)) ik (x(0)) k ! n X l=1 (G(t)) jl (x(0)) l !# ; (3.19) which simplies by using independence of x(0) as ((t)) ij = 2 n X k=1 ((G(t)) ik (G(t)) kj ) = 2 (G(t) 2 ) ij : (3.20) Therefore, the covariance matrix leads to (t) = G(t) 2 = 2 (IrL) 2t : (3.21) To adjust the time resolution of the process, we replace t witht=t andr withrt in G(t) so that G(t) = I rL 1=t t=t : (3.22) For arbitrarily small t, G(t) converges to a matrix exponential function of L, e (t) = 2 lim t!0 I rL 1=t 2t=t = 2 exp(2rtL) (3.23) which is equivalent to exponential decay ltering operation in Table 3.1 with (t) =2rt. For arbitrarily large t (i.e., when the diusion reaches steady-state), the covariance is lim t!1 e (t) = lim t!1 n X i=1 exp(2rt i )u i u i T = 2 n 11 T ; (3.24) whose all entries have the same value, since we have 1 = 0 for CGLs. Thus, the statistical properties eventually become spatially at across all vertices as intuitively expected for a diusion process. Yet, for a nonsingular L (e.g., a GGL), the process bahaves dierently so that the signal energy vanishes, since the above limit converges to the nn zero matrix as t gets larger. Graph Learning from -hop Localized Signals. To learn graphs from -hop localized signals, where is a positive integer, the GBF can be selected as h() = ( y ) so that the corresponding inverse covariance (precision) matrix in (3.9) ish(L) y =L , which is generally not a graph Laplacian matrix due to the exponent. However, it denes a-hop localized operation on a graph associated with L, and the corresponding signal model leads to -hop localized signals, in which each sample depends on samples located within the neighborhood at most -hops away. CHAPTER 3. GRAPH LEARNING FROM FILTERED SIGNALS 53 Algorithm 3 Graph System Identication Input: Sample covariance S, graph-based lter type h Output: Graph Laplacian L and lter parameter 1: Obtain U and s via eigendecomposition S=U s U T 2: Initialize parameter b : Apply the initialization method in Section 3.3.3 for variance/frequency shifting GBFs. For the other types of GBFs, apply a positive random initialization. 3: repeat 4: Prelter the sample covariance S: S pf = (h 1 b (S)) y = U(h 1 b ( s )) y U T (3.25) 5: Estimate L from preltered data (S pf ): b L Run Algorithm 2 to solve the problem in (3.33) 6: Update lter parameter b or skip if b is optimal: b Apply lter-specic update discussed in Section 3.3.3 7: until convergence has been achieved 8: return Graph system parameters b L and b 3.3 Proposed Solution Algorithm 3 is proposed to solve the graph system identication problem in (3.9) for a given sample covariance S and a selected type of graph-based lter h . After obtaining U and s via eigende- composition of S and initialization of the parameter (see lines 1 and 2), the algorithm performs three main steps to nd the optimal graph Laplacian L and the lter parameter : 1. Preltering step applies an inverse ltering on the sample covariance S to reverse the eect of lter h in the graph system. Without the preltering step, it may be impossible to eec- tively recover L from S, since y = h(L) y is generally not a graph Laplacian. The proposed preltering allows us to eectively estimate the original eigenvalues of L (i.e., ) from the preltered covariance S pf in (3.25). 2. Graph Laplacian estimation step uses Algorithm 2 developed for estimating the CGL b L from preltered covariance S pf . 3. Filter parameter selection step nds the best matching for the graph system. Depending on the type of GBF, we propose dierent methods for parameter selection. In the rest of this section, we derive the optimality conditions and discuss preltering, graph Laplacian estimation and lter parameter selection in Sections 3.3.1, 3.3.2 and 3.3.3, respectively. The optimality conditions are derived based on the following assumption on the number of data samples used to compute the sample covariance S (i.e., k). CHAPTER 3. GRAPH LEARNING FROM FILTERED SIGNALS 54 Assumption 2. We assume that the number of data samples (k) used for estimating the sample covariance S is asymptotically large (i.e., k!1), so the sample covariance S converges to the actual covariance matrix almost surely by the strong law of large numbers. Thus, we have S = by the assumption. Note that this is a theoretical assumption used to derive the optimality conditions for the graph system identication problem. In practice, the proposed algorithm (Algorithm 3) works regardless of the number of data samples used to obtain the sample covariance S. 3.3.1 Optimal Preltering Based on Assumption 2, we have a graph-based transform matrix U satisfying L=U U T ,h(L) y = Uh( ) y U T and S==U U T . By change of variables, the objective function in (3.8) becomes J (U;h( )) = Tr(Uh( ) y U T U U T ) logjUh( ) y U T j (3.26) which is simplied using properties of Tr() andjj as J (h( )) = Tr(h( ) y ) logjh( ) y j (3.27) where the graph-based lter h and the diagonal matrix of graph frequencies are unknown. By letting i =h( i ) y =(h( ) y ) ii and 2 i =( ) ii for i=1; 2;:::;n, we can write (3.27) as J () = n X i=1 i 2 i log( i ) ; (3.28) where = [ 1 2 n ] T . In minimization of the convex function (3.28), the optimal solution satises the following necessary and sucient conditions [67] obtained by taking the derivative of (3.28) with respect to i and equating to zero, @J () @ i = 1 i 2 i = 0: (3.29) By change of variables, the optimality conditions can be stated as h( i ) = (h( )) ii = ( ) ii = 2 i 8i: (3.30) Based on Assumption 1, we can also write (3.30) as h 1 (h( i )) = i =h 1 ( 2 i ) 8i; (3.31) CHAPTER 3. GRAPH LEARNING FROM FILTERED SIGNALS 55 whereh 1 is inverse function ofh. By using the matrix notation, we can state (3.31) more compactly as h 1 (S) = Uh 1 ( )U T = U U T = L: (3.32) This condition shows that we can nd the optimal Laplacian L via inverse ltering (inverse pre- ltering) h 1 (S). Yet, (3.32) is satised only if Assumption 2 holds, which requires innitely many samples (i.e., k!1) to estimate the sample covariance S. Thus, using (3.32) does not lead to a Laplacian matrix in practice. In order to address this problem, Algorithm 3 rst estimates the preltered sample covariance S pf as in (3.25), then employs Algorithm 2 to nd the best graph Laplacian from S pf by minimizing the criterion in (3.10). 3.3.2 Optimal Graph Laplacian Estimation For a graph-based lter h (or h ) satisfying the optimal preltering condition in (3.31), the graph system identication problem in (3.9) can be written as the following graph learning problem, minimize L0 Tr(LS pf ) logjLj+kL Hk 1 subject to L 1 = 0; (L) ij 0 i6=j (3.33) where S pf =(h 1 (S)) y is obtained by preltering the covariance S. This problem is discussed in detail in Chapter 2 where we have developed Algorithm 2 (CGL) to solve (3.33). 3.3.3 Filter Parameter Selection Depending on the type of GBF (in Table 3.1), we propose dierent methods for choosing the lter parameter . Initialization for Variance/Frequency Shifting Filters. Assuming that the optimal preltering condition in (3.31) holds, for both variance and frequency shifting GBFs, the optimal is found by calculating 2 1 = u 1 T Su 1 where u 1 is the eigenvector associated with the zero eigenvalue of the Laplacian (( ) 11 = 1 =0). Specically, if h () is a variance shifting lter, we get h ( 1 ) = 2 1 = u 1 T Su 1 = y 1 +; (3.34) since 1 = y 1 =0, the optimal satises = u 1 T Su 1 = 2 1 : (3.35) CHAPTER 3. GRAPH LEARNING FROM FILTERED SIGNALS 56 if h () is a frequency shifting lter, we obtain h ( 1 ) = 2 1 = u 1 T Su 1 = 1=( 1 +) = 1=; (3.36) so the optimal satises = 1=(u 1 T Su 1 ) = 1= 2 1 : (3.37) Since the optimal can be directly estimated from S as in (3.35) and (3.37), Algorithm 3 uses the optimized initial (in line 2) for preltering, and then estimates L, so that the lter parameter update (line 6) is skipped for graph systems with frequency and variance shifting lters. Exponential decay and Frequency Scaling Filters. Assuming that the condition in (3.31) holds and b L is the solution of (3.33) where S pf is obtained by preltering with parameter b . If h () is a frequency scaling lter, the preltering gives h 1 b ( 2 i ) : 1 = 2 1 = 0; b i = 1 b 2 i i = 2;:::;n: (3.38) If h () is an exponential decay lter, we have h 1 b ( 2 i )= log( 2 i ) b = log(exp( i )) b = b i : (3.39) Based on (3.38) and (3.39), for any selected b , the resulting graph Laplacian satises b L = (= b )L where L and denote the original graph system parameters. So, Algorithm 3 nds the optimal combinatorial Laplacian L up to a constant scale factor = b , and the parameter b can be tuned so that the desired normalization (scaling factor) for L is achieved. -hop Localized Filter. For estimation of the optimal hop count in Algorithm 3, preltering with b gives, h 1 b ( 2 i ) : 1 = 2 1 =0; = b i = 1 2 i 1= b i = 2;:::;n: (3.40) Since this requires the graph learning step to estimate L = b , which is not a graph Laplacian in general, Algorithm 3 cannot guarantee optimal graph system identication unless = b . In order to nd the optimal , we perform a line search for given range of integers optimizing the following: b = argmin b 2N jjh b ( b L) Sjj F : (3.41) CHAPTER 3. GRAPH LEARNING FROM FILTERED SIGNALS 57 3.4 Results 3.4.1 Graph Learning from Diusion Signals/Data We evaluate the performance of our proposed graph system identication (GSI) method (i.e., Al- gorithm 3) by benchmarking against the current state-of-the-art approaches proposed for learning graph from smooth signals (GLS) [31, 32] as well as the graph topology inference (GTI) in [34]. The proposed GSI is also compared against the CGL estimation algorithm (CGL [36]) in Chapter 2 (i.e., using Algorithm 2 without preltering) and the inverse preltering (IPF) approach, which estimates a graph Laplacian matrix by inverting the preltered covariance, S pf in (3.25), so that b L =h 1 (S) = S y pf . For this purpose, we generate several articial datasets based on the signal model in (3.1), dened by a graph Laplacian (L) and a GBF (h ) where the dataset entries are generated by random sampling from the distribution N(0;h (L)). Then, the generated data is used in the proposed and benchmark algorithms to recover the corresponding graph Laplacian L. We repeat our experiments for dierent L and h where graphs are constructed by using three dierent graph connectivity models: Grid graph,G (n) grid , consisting n vertices attached to their four nearest neighbors (except the vertices at boundaries). Random Erdos-Renyi graph,G (n,p) ER , withn vertices attached to other vertices with probability p = 0:2. Random modular graph (also known as stochastic block model),G (n,p1,p2) M withn vertices and four modules (subgraphs) where the vertex attachment probabilities across modules and within modules are p 1 =0:1 and p 2 =0:2, respectively. Then, the edge weights of a graph are randomly selected from the uniform distribution U(0:1; 3), on the interval [0:1; 3]. For each L and h pair, we perform Monte-Carlo simulations to test average performance of proposed and benchmark methods with varying number of data samples (k) and xed number of vertices (n = 36) a . To measure the estimation performance, we use the following two metrics: RE( b L; L ) = k b L L k F kL k F (3.42) which is the relative error between the ground truth graph (L ) and estimated graph parameters ( b L), and FS( b L; L ) = 2tp 2tp +fn +fp (3.43) a Methods are evaluated on small graphs (withn = 36 vertices), since GTI [34] is implemented using CVX [64] and do not currently have an ecient and scalable implementation. The proposed and the other benchmark methods are more ecient and can support larger graphs. CHAPTER 3. GRAPH LEARNING FROM FILTERED SIGNALS 58 is the F-score metric (commonly used metric to evaluate binary classication performance) calculated based on true-positive (tp), false-positive (fp) and false-negative (fn) detection of graph edges in estimated b L with respect to the ground truth edges in L . F-score takes values between 0 and 1, where the value 1 means perfect classication. In our experiments, for the proposed GSI, the regularization parameter in (3.33) is selected from the following set: f0g[f0:75 r (s max p log(n)=k)jr = 1; 2; 3;:::; 14g; (3.44) wheres max =max i6=j j(S) ij j is the maximum o-diagonal entry of S in absolute value, and the scaling term p log(n)=k is used for adjusting the regularization according tok andn as suggested in [72, 73]. Monte-Carlo simulations are performed for each proposed/baseline method, by successively solving the associated problem with dierent regularization parameters to nd the best regularization that minimizes RE. The corresponding graph estimate is also used to calculate FS. For all baseline methods [31, 32, 34], the required parameters are selected by ne tuning. Since CGL [36], GLS [31, 32] and GTI [34] approaches generally result in severely biased solutions with respect to the ground truth L (based on our observations from the experiments), RE values are calculated after normalizing the estimated solution L as b L = (Tr(L )=Tr(L))L. Note that, this normalization also resolves the ambiguity in identication of graph systems with exponential decay and frequency scaling lters up to a scale factor (discussed in Section 3.3.3). Figures 3.3 and 3.4 depict the performance of dierent methods applied for estimating graphs from signals modeled based on exponential decay lters (diusion kernels) and-hop localized lters. As shown in Figures 3.3 and 3.4, the proposed GSI signicantly outperforms all baseline methods, including the state-of-the-art GLS [31, 32] and GTI [34], in terms of average RE and FS metrics. The performance dierence between GSI and CGL [36] demonstrates the impact of the preltering step, which substantially improves the graph learning accuracy. Similarly, the performance gap between GSI and IPF shows the advantage of Algorithm 3 compared to the direct preltering of input covariance (S) as in (3.32), where GSI provide better graph estimation especially when number of data samples (i.e., k=n) is small. Besides, Figures 3.5 and 3.6 illustrates two examples from the experiments with grid graphs for the case of k=n = 30, where the proposed GSI constructs graphs that are the most similar to the ground truth (L ). 3.4.2 Graph Learning from Variance/Frequency Shifted Signals In this subsection, we compare the CGL estimation performance of GSI, CGL [36] and SCGL [33] methods from signals modeled based on variance and frequency shifting GBFs. As discussed in Section 3.2.3, the covariance matrices for signals modeled based on these GBFs with parameter are CHAPTER 3. GRAPH LEARNING FROM FILTERED SIGNALS 59 = L y +I for variance shifting, = (L+I) y for frequency shifting. where L denotes the associated combinatorial Laplacian. In our experiments, we construct 10 random Erdos-Renyi graphs (G (n,p) ER ) with n = 36 vertices andp = 0:2, then generate for each GBF by varying between 0 and 1. To evaluate the eect of only, we use actual covariance matrices instead of sample covariances as input to the algorithms. So, GSI, CGL and SCGL estimate a graph Laplacian L from . The average RE results are presented in Tables 3.3 and 3.4 for various . Table 3.3 shows that the proposed GSI signicantly outperforms CGL for > 0, and the average RE dierence increases as gets larger. This is because the variance shifting GBF leads to the noisy signal model with the covariance in (3.11) where represents the variance of the noise ( 2 ), and the preltering step allows GSI to perfectly estimate the parameter from by using (3.35) so that 0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45 5 10 30 50 100 250 500 1000 Average RE k=n GSI IPF CGL[36] GLS[31] GLS[32] GTI[34] 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 5 10 30 50 100 250 500 1000 Average FS k=n GSI IPF CGL[36] GLS[31] GLS[32] GTI[34] Figure 3.3: Average RE and FS results for graph estimation from signals modeled based on exponen- tial decay GBFs tested with =f0:5; 0:75g on 10 dierent grid, Erdos-Renyi and modular graphs (30 graphs in total). The proposed GSI outperforms all baseline methods in terms of both RE and FS. 0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45 0.5 5 10 30 50 100 250 500 1000 Average RE k=n GSI IPF CGL[36] GLS[31] GLS[32] GTI[34] 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 5 10 30 50 100 250 500 1000 Average FS k=n GSI IPF CGL[36] GLS[31] GLS[32] GTI[34] Figure 3.4: Average RE and FS results for graph estimation from signals modeled based on -hop localized GBFs tested with =f2; 3g on 10 dierent grid, Erdos-Renyi and modular graphs (30 graphs in total). The proposed GSI outperforms all baseline methods in terms of both RE and FS. CHAPTER 3. GRAPH LEARNING FROM FILTERED SIGNALS 60 0 2.9746 (a) The ground truth CGL (L ) 0 1.8932 (b) GLS [31]: (RE,FS)=(0.4019,0.6172) 0 2.6617 (c) GTI [34]: (RE,FS)=(0.1201,0.7229) 0 3.2115 (d) GSI: (RE,FS)=(0.0340,0.9917) Figure 3.5: A sample illustration of graph estimation results (for k=n = 30) from signals modeled using the exponential decay GBF with = 0:75 and L is derived from the grid graph in (a). The edge weights are color coded where darker colors indicate larger weights. The proposed GSI leads to the graph that is the most similar to the ground truth. 0 2.9307 (a) The ground truth CGL (L ) 0 2.9115 (b) GLS [31]: (RE,FS)=(0.4045,0.6890) 0 2.0173 (c) GTI [34]: (RE,FS)=(0.2118,0.6030) 0 3.0194 (d) GSI: (RE,FS)=(0.0274,0.9833) Figure 3.6: A sample illustration of graph estimation results (for k=n = 30) from signals modeled using the -hop localized GBF with = 2 and L is derived from the grid graph in (a). The edge weights are color coded where darker colors indicate larger weights. The proposed GSI leads to the graph that is the most similar to the ground truth. the covariance is preltered as in (3.25) based on the optimal . The preltering step can also be considered as a denoising operation (reversing the eect of variance shifting GBFs) on the signal covariance before the graph estimation step, while CGL work with noisy (i.e., shifted) covariances, which diminish the CGL estimation performance. For = 0 (i.e., is noise-free), the problem (3.9) reduces to the CGL estimation problem in [36], so both GSI and CGL lead to the same average RE. For the frequency shifting GBFs with > 0, GSI performs slightly better than SCGL, since SCGL is implemented using a general purpose solver CVX [64], which generally produces less accurate solutions compared to our algorithm. Moreover, our algorithm is approximately 90 times faster than SCGL on average, and signicantly outperforms SCGL for = 0, since SCGL method is developed for shifted covariance matrices (i.e., L +I) where needs to be strictly positive. CHAPTER 3. GRAPH LEARNING FROM FILTERED SIGNALS 61 Table 3.3: Average Relative Errors for Variance Shifting GBF Filter parameter () Method 0 0:1 0:3 0:5 0:7 0:9 CGL [36] 210 4 0:60 0:79 0:85 0:88 0:89 GSI 2 10 4 Table 3.4: Average Relative Errors for Frequency Shifting GBF Filter parameter () Method 0 0:1 0:5 0:9 SCGL [33] 0:2354 6:710 4 6:610 4 6:210 4 GSI 1:6 10 4 (a) 45th day (Winter) (b) 135th day (Spring) (c) 225th day (Summer) (d) 315th day (Autumn) Figure 3.7: Average air temperatures (in degree Celsius) for (a) 45th, (b) 135th, (c) 225th and (d) 315th days over 16 years (2000-2015). Black dots denote 45 states. 3.4.3 Illustrative Results on Temperature Data In this experiment, we apply our proposed method on a real (climate) dataset a consisting of air temperature measurements [80]. We specically use the average daily temperature measurements collected from 45 states in the US over 16 years (2000-2015), so that in total there are k = 5844 samples for each of then=45 states. Figure 3.7 shows samples of average temperature signals, which are spatially smooth across dierent states. Also, the Rocky Mountains region has lower average temperature values as compared to the other regions in the western US b . The goal of this experiment is to learn graphs (with 45 vertices) associated with exponential decay and -hop localized lters from temperature data and investigate the eect of lter type and parameter on the resulting graphs, representing the similarity of temperature conditions between the 45 states. For modeling temperature signals, a diusion kernel is a good candidate, because it is a NCEP Reanalysis data provided by the NOAA/OAR/ESRL PSD, Boulder, Colorado, USA, from their website at http://www.esrl.noaa.gov/psd/ b The Rocky Mountains cross through the states of Idaho, Montana, Wyoming, Utah, Colorado and New Mexico. For example, in Figure 3.7b, the areas with temperature values between 0 and 10 degrees Celsius (colored in green) correspond to the Rocky Mountains region approximately. CHAPTER 3. GRAPH LEARNING FROM FILTERED SIGNALS 62 0 83.3921 (a) -hop localized GBF with = 1 0 4.7463 (b) -hop localized GBF with = 5 0 0.80833 (c) -hop localized GBF with = 10 0 9.5083 (d) Exponential decay with = 0:25 0 4.7722 (e) Exponential decay with = 0:5 0 0.81572 (f) Exponential decay with = 3 Figure 3.8: Graphs learned from temperature data using the GSI method with exponential decay and -hop localized GBFs for xed parameters where no regularization is applied (i.e., H is set to the zero matrix). The edge weights are color coded so that darker colors represent larger weights (i.e., more similarities). The graphs associated with exponential decay GBF leads to sparser structures compared to the graphs corresponding to -hop localized GBFs. a fundamental solution of the heat equation a , which describes the distribution of heat in a physical environment [81]. Figure 3.8 illustrates the graphs estimated using the GSI method without ` 1 -regularization (i.e., H in (3.9) is set to the zero matrix). As shown in the gure, larger edge weights are assigned between vertices (i.e., states) that are closer to each other in general, since temperature values are mostly similar between states within close proximity. However, the distance between states is obviously not the only factor eecting the similarity of temperature values. For example, in Figures 3.8b{ 3.8f, the weights are considerably small between the states in the Rocky Mountains region and their neighboring states in the east (e.g., Nebraska and Kansas) due to the large dierences in altitude. Note also that dierent choices of GBFs can lead to substantially dierent similarity graphs. Especially for the-hop localized GBF with = 1 (corresponding to the CGL method), the resulting graph is signicantly dierent than the results in Figures 3.8b and 3.8c, since the 1-hop localized lter does not lead to a diusion model. The graphs associated with exponential decay GBF leads to sparser graphs, better revealing the structure of the signal, compared to the graphs in a This is the reason that diusion kernels are also known as heat kernels in the literature. CHAPTER 3. GRAPH LEARNING FROM FILTERED SIGNALS 63 Figures 3.8a{3.8c. The structure of the graphs in Figures 3.8d{3.8f are similar for dierent because of the relation in (3.39) for the exponential decay lter. For example, increasing from 0:25 to 0:5 approximately halves edge weights, as shown in Figures 3.8d and 3.8e. Besides, the distribution of edge weights for -hop localized GBFs becomes more similar to the ones in Figures 3.8d{3.8f as gets larger. 3.5 Conclusion We have introduced a novel graph-based modeling framework by (i) formulating the modeling prob- lem as the graph system identication from signals/data and (ii) proposing an algorithm that jointly identies a graph and a graph-based lter. The proposed framework supports various types of graph- based lters which include diusion kernels as a special case. Our experimental results have demon- strated that the proposed method outperform the state-of-the-art approaches in terms of modeling accuracy. Chapter 4 Graph Learning from Multiple Graphs: Multigraph Combining In the previous two chapters, we have studied methods to learn graphs from data under two dierent model assumptions based on attractive GMRFs and graph systems. The present chapter focuses on a dierent graph learning problem, called multigraph combining, where the goal is to learn a single graph from multiple graphs (i.e., the observed data consist of a list of graphs). We partic- ularly consider combining simple weighted graphs (i.e., graphs with no self-loops) associated with combinatorial graph Laplacians (CGLs) a and propose a three-step formulation based on a maximum likelihood (ML) criterion. In the rst two steps, the common graph-based transform (CGBT) and common graph frequency (CGF) estimation problems are proposed to estimate a GBT ( b U) and a diagonal matrix ( b ) consisting of graph frequencies. By exploiting the optimality conditions of the problems, we propose a method that estimates the best CGBT and CGFs in an ML sense. The third step involves estimating a CGL matrix based on the optimized b U and b , where we employ Algorithm 2 described in Chapter 2 to learn a CGL matrix. This chapter is organized as follows. Section 4.1 presents the related work. Section 4.2 rst formulates the multigraph combining problem. In Section 4.3, we introduce the proposed solution. Experimental results are presented in Section 4.4, and Section 4.5 draws some conclusions. 4.1 Related Work In the literature, there is only a limited number of studies on graph combining. The authors in [83, 84] propose graph combining to improve spectral clustering and semi-supervised learning with multiple graphs, respectively. However, their approaches are based on weighted averaging of graphs, An earlier version of the work in this chapter is published in [82]. a The work in this chapter can be trivially extended to learn other types of graph Laplacians. 64 CHAPTER 4. GRAPH LEARNING FROM MULTIPLE GRAPHS 65 which have no theoretical guarantees. Additionally, our CGBT estimation problem is closely related to common principal component (CPC) estimation originally introduced in [85], where the goal is nding an orthogonal matrix that jointly diagonalizes multiple covariance matrices in an ML sense, and an iterative algorithm for this problem is developed in [86]. Our CGBT problem can be viewed as a variation of the CPC problem with graph Laplacian matrices. Another variation of the CPC problem was also studied in blind source separation [87], and the JADE algorithm [88] was introduced for joint diagonalization with a Frobenius-norm criterion. Algorithmically, our algorithm for the CGBT estimation problem is very similar to the ones in [86, 88], which are all based on Jacobi- like iterations [89]. Yet, our algorithm iteratively updates an orthogonal matrix (i.e., a CGBT) based on pairwise projections on graph Laplacian matrices, while [86] and [88] use covariance matrices. To the best of our knowledge, none of the prior studies address the CGF estimation problem. Note that the graph topology inference problem [34] stated in (2.18) applies the same type of decomposition on the target variable (i.e., = UU T ) to estimate graph frequencies (i.e., ) for a given GBT (i.e., U). Since the graph combining problem is not considered in [34], U is rst obtained by the eigendecomposition of a single covariance matrix, and then is estimated from U by solving an` 1 - minimization problem, while we sequentially solve CGBT and CGF estimation problems to optimize U and from multiple graphs based on an ML criterion. 4.2 Problem Formulation for Multigraph Combining Our formulation is based on the following two basic assumptions on graphs: (Number of vertices) All of the given graphs have the same vertex setV with n vertices. (Connected graphs) Each of the given graphs is a connected graph. The proposed three-step formulation is derived from the following general optimization problem for given positive semidenite matrices K 1 ;:::; K L and positive weights k 1 ;:::;k L : minimize U; L X l=1 k l (Tr (K l ) logjj) subject to = UU T U T U = I 2L c (4.1) where the orthogonal matrix U and the diagonal matrix are the target variables, andL c denotes the set of CGLs as stated in (1.3). The weightsk 1 ;:::;k L can be selected to adjust the contribution of each K 1 ;:::; K L in the problem, so that choosing a larger weight k l increases the contribution of K l in the minimization. Depending on the choices/types of K 1 ; K 2 ;:::; K L , the problem has two variants: CHAPTER 4. GRAPH LEARNING FROM MULTIPLE GRAPHS 66 (Statistical) For given L sample covariance matrices S 1 ;:::; S L corresponding to L groups (clusters) of data, the problem in (4.1) can be solved to estimate a graph Laplacian from L groups of data by setting K l = S l for l = 1;:::;L. (Deterministic) For givenL graph Laplacian matrices (e.g., CGLs) L 1 ;:::; L L , setting K l = L y l in (4.1) for l = 1;:::;L leads to the multigraph combining problem, which is the problem of interest in this chapter. Note that (4.1) is nonconvex due to the orthogonality constraint (i.e., U T U = I), which makes it hard to solve for U and directly. Thus, we present a three-step formulation for the multigraph combining problem by decomposing (4.1) into the following three subproblems: Common graph-based transform (CGBT) problem is formulated with the goal of estimating a CGBT U and multiple diagonal matrices 1 ;:::; L from given K 1 ;:::; K L as follows: minimize U;1;:::; L L X l=1 k l (Tr ( l U T K l U) logj l j) subject to U T U = I (4.2) The basic purpose of this step is to nd the best orthogonal transform U by minimizing the weighted ML criterion above in (4.2). Then, the resulting diagonal matrices, denoted as b 1 ;:::; b L , are used in the next step to optimize a CGF matrix. Common graph frequency (CGF) problem is formulated to estimate a CGF matrix from the diagonal matrices b 1 ;:::; b L obtained by solving (4.2) as: minimize L X l=1 k l Tr( b y l ) logjj (4.3) Since the estimated CGBT and CGF matrices (i.e., b U and b ) generally do not lead to a CGL matrix, we propose the following CGL problem, discussed in Chapter 2, to optimize a CGL based on b U and b : minimize Tr(S) logjj subject to 2L c (4.4) where S = b U b y b U T , andL c denotes the set of CGLs. This step can be viewed as a projection for nding the closest CGL matrix with respect to the S in an ML sense. As an alternative to the above three-step formulation, the problem in (4.1) can be solved without splitting the target variable into U and , so that it reduces to the CGL problem in (4.4) where S = 1=k sum P L l=1 k l K l such that k sum = P L l=1 k l is the normalization factor. In this case, CHAPTER 4. GRAPH LEARNING FROM MULTIPLE GRAPHS 67 the multigraph combining is simply carried out by rst computing a weighted summation of input matrices K 1 ; K 2 ;:::; K L dening S and then solving the CGL problem with S as the input. Yet, our formulation introduces CGBT and CGF problems by decomposing the target variable as = UU T , which has two main advantages over directly solving the CGL problem. Firstly, splitting the target variable into U and introduces more degrees of freedom for an algorithm to search for the best solutions, so that solving the CGBT and CGF problems potentially leads to better estimation of U and in an ML sense. This advantage will be empirically demonstrated later in Section 4.4. Secondly, for a given set of L input CGLs L 1 ;:::; L L , directly solving the CGL problem requires us to compute the pseudoinverse of each CGL to have K l = L y l for l = 1;:::;L used in calculating the S, which can be infeasible for large L because of the computationally intensive pseudoinverse operation. However, the proposed three-step formulation allows us to work with CGLs directly, since a CGBT (U) is invariant to pseudoinverses of input CGLs, and also corresponding graph frequencies can be simply obtained for a given U by using l = U T L l U. In order to justify our three-step formulation, the following section presents the statistical deriva- tions of the proposed problems as well as their optimality conditions used to develop our algorithm. 4.2.1 Statistical Formulation and Optimality Conditions Let x l be an n-variate Gaussian random vector x l N(0; L y l ) for l = 1; 2;:::;L. Given e k l = k l + 1 independent random vectors the random data matrix X l = [x (1) l x (2) l x ( e k l ) l ] leads to the random empirical covariance matrix S l = (1=k l )X l X T l . Then, the corresponding scatter matrix e S l = k l S l has a Wishart distribution, e S l W(L y l ;k l ), which is a common data model used in covariance estimation. Since e S l are independent for l = 1; 2;:::;L, the likelihood function of L 1 ; L 2 ;:::; L L is p(f e S l g L l=1 jfL l g L l=1 ) = L Y l=1 C l jL y l j k l 2 exp k l 2 Tr (L l S l ) (4.5) where C l is a constant that does not depend on L l . Thus, we can write negative log-likelihood objective function as follows, J NLL (L 1 ; L 2 ;:::; L L ) = L X l=1 k l (logjL l j+Tr (L l S l )); (4.6) which leads to the objective function of the proposed problem in (4.1) for = L 1 = = L L where is the target variable that combines L 1 ;:::; L L . Without loss of generality, the weightsk 1 ;:::;k L can be normalized by dividing (4.6) with the constant k sum = P L l=1 k l , so that P L l=1 k l =k sum = 1. Common Graph-based Transform (CGBT) Estimation. In order to nd the best CGBT CHAPTER 4. GRAPH LEARNING FROM MULTIPLE GRAPHS 68 matrix, we introduce the following constraint on CGL matrices, C CGBT : L l = U l U T for l = 1; 2;:::;L; (4.7) where U denotes the orthogonal CGBT matrix that we seek to obtain, which is supposed to (approx- imately) jointly diagonalize L 1 ;:::; L L , andf l g L l=1 is the corresponding set of diagonal matrices consisting of graph frequencies. Based on (4.7), we can rewriteJ NLL objective function in (4.6) as J NLL (U;f l g L l=1 ) = L X l=1 k l (logj l j+Tr ( l U T S l U)) (4.8) Since (l) 1 = 0 for l = 1;::;L and u 1 = (1= p n)1 by properties of CGL matrices, we can simplify (4.8) as follows, J NLL ( e U; e ) = L X l=1 k l n X j=2 log( (l) j ) + (l) j u T j S l u j ; (4.9) where u j is thej-th column of U and (l) j = ( l ) jj . Also, the variablesfu j g n j=2 andf (l) 2 ;:::; (l) n g L l=1 are compactly denoted by e U and e , respectively. Thus, the minimization of the negative log- likelihood in (4.6) under the constraintC CGBT leads to the following problem, minimize e U; e L X l=1 k l n X j=2 log( (l) j ) + (l) j u T j S l u j subject to u T i u j = 8 < : 1 i =j 0 i6=j (4.10) which is equivalent to the CGBT problem in (4.2). The optimization problem in (4.10) is nonconvex due to its orthogonality constraints. Thus, we derive necessary conditions for local optimality using the Lagrange multiplier theorem [67]. The Lagrangian function associated to (4.10) is J LAG ( e U; e ;) =J NLL ( e U; e ) + n X i=2 i (u T i u i 1) + 2 X 2i<jn ij u T i u j ; (4.11) where denotes Lagrange multipliersf i g n i=2 andf ij g 2i<jn associated with the equality con- straints in (4.10). Note that the last summation term of (4.11) is simplied by exploiting the sym- metry between multipliers ( ij = ji ). Taking partial derivatives with respect to primal variables ( e U; e ) and equating to zero, we obtain following system of equations: @J LAG ( e U; e ;) @ (l) j = L X l=1 k l 1 (l) j + u T j S l u j ! = 0; (4.12) CHAPTER 4. GRAPH LEARNING FROM MULTIPLE GRAPHS 69 @J LAG ( e U; e ;) @u j = L X l=1 2k l (l) j S l u j + 2 j u j + 2 X 2i<jn ij u i = 0: (4.13) Since k l , (l) j and u T j S l u j are all nonnegative, (4.12) can be simplied as 1= (l) j = u T j S l u j : (4.14) By multiplying (4.13) with u T j from the left, we get L X l=1 2k l (l) j u T j S l u j + 2 j = 0: (4.15) Then, replacing u T j S l u j with 1= (l) j , as stated in (4.14), leads to j = L X l=1 k l for j = 2;:::;n: (4.16) By multiplying (4.13) with (1=2)u T h from the left, we get L X l=1 k l (l) j u T h S l u j + hj = 0 j = 2;:::;n; h6=j: (4.17) Then, switching h and j indexes leads to L X l=1 k l (l) h u T j S l u h + jh = 0 h = 2;:::;n; j6=h: (4.18) Subtracting (4.18) from (4.17) results in the following equation, u T h L X l=1 k l ( (l) j (l) h )S l ! u j = 0 j;h = 2;:::;n; j6=h; (4.19) where (l) j = 1=(u T j S l u j ) and (l) h = 1=(u T h S l u h ) by (4.14). Based on (4.14) and (4.19), the necessary optimality conditions for the CGBT are u T h L X l=1 k l ( (l) j (l) h )S l ! u j = 0 for j;h = 2;:::;n; j6=h; (4.20) u T j S l u j = 1= (l) j for j = 2;:::;n; (4.21) both of which obviously hold if U jointly diagonalizes S 1 ;:::; S L , since each u T h S l u j term would be CHAPTER 4. GRAPH LEARNING FROM MULTIPLE GRAPHS 70 equal to zero forh6=j and 1= (l) j forh=j2. However, for a given U that cannot jointly diagonalize the empirical covariances, the weighted combinations of S 1 ;:::; S L in (4.20) measure the deviation from diagonality for each u j and u h pair, and it is used in our algorithm to update columns of U (discussed in Section 4.3). Note that the overall eect of S l on the deviation depends not only on the number of data samples (k l ) but also on graph frequencies ( (l) j and (l) h ) where the corresponding weight is k l ( (l) j (l) h ). The other condition in (4.21) is used in CGF estimation discussed next. Common Graph Frequency (CGF) Estimation. To nd the best CGF from multiple graph frequency matrices 1 ;:::; L , we introduce the following constraint on graph Laplacians, C CGF : L l = b U b U T for l = 1; 2;:::;L (4.22) where b U is an optimal solution to the CGBT problem in (4.10), and = diag([ 1 2 n ] T ) is the diagonal matrix we want to estimate. By using (4.22), we can rewriteJ NLL (L 1 ; L 2 ;:::; L L ) = J NLL () in (4.6) as, J NLL () = L X l=1 k l logjj+Tr b U T S l b U : (4.23) Since the rst eigenvalue of a CGL matrix is zero, we simplify the above equation as, J NLL ( 2 ;:::; n ) = L X l=1 k l n X j=2 log( j ) + j b u T j S l b u j (4.24) whereb u j is thej-th column of b U and j = () jj . By using the optimality condition 1= (l) j =b u T j S l b u j in (4.21), we get J NLL ( 2 ;:::; n ) = L X l=1 k l n X j=2 log( j ) + j = (l) j (4.25) whose minimization leads to the CGF problem in (4.3). By taking the derivative of (4.25) with respect to j and equating it to zero, we obtain @J NLL ( 2 ;:::; n ) @ j = L X l=1 k l 1 j + 1 (l) j ! = 0: (4.26) Since (4.25) is a convex function, (4.26) is the necessary and sucient optimality condition for the CGF estimation, which can be compactly written as 1 j = P L l=1 k l = (l) j P L l=1 k l = P L l=1 k l = (l) j k sum j = 2; 3;:::;n: (4.27) Therefore, the optimal common graph frequency j is a weighted sum of 1= (1) j ;:::; 1= (L) j , where CHAPTER 4. GRAPH LEARNING FROM MULTIPLE GRAPHS 71 Algorithm 4 Multigraph Combining Algorithm Input: CGLsfL l g L l=1 , positive weightsfk l g L l=1 , initial matrix U init and error tolerance . Output: Combined graph Laplacian b L 1: b U CGBT(fL l g L l=1 ;fk l g L l=1 ; U init ;) (i.e., apply Algorithm 5 to estimate a CGBT) 2: fs j g n j=2 = n 1 ksum P L l=1 (k l =b u T j L l b u j ) o n j=2 3: 1 = 0 f j g n j=2 =f1=s j g n j=2 4: f( b ) jj g n j=1 =f j g n j=1 5: b L Solve the problem in (4.4) with b S = b U b y b U T to estimate b L. 6: return b L the weights depend on the number of observations (e.g.,k l ) used to calculate the sample covariances (e.g., S l ). Proposition 7. If b U; b 1 ;:::; b L are the optimal solutions of (4.2), then the CGF estimation problem in (4.3) has the following closed form solution, the diagonal matrix with entries () 11 = 0 and () jj = P L l=1 k l P L l=1 k l = b (l) j for j = 2; 3;:::;n; (4.28) where b (l) j = ( b l ) jj . Proof. The proof follows from (4.23){(4.27). Combinatorial graph Laplacian (CGL) Estimation. Assuming that CGBT and CGF matri- ces (i.e., b U and b ) are optimal, the best CGL can be simply obtained by b L = b U b b U T . However, estimating the optimal b U is generally not feasible in practice, since the associated problem is non- convex so that an estimated b U is typically a local optimal solution. Moreover, for a given set of graph Laplacian matrices, there may be no orthogonal matrix that satisesC CGBT (i.e., that jointly diagonalizes L 1 ;:::; L L ). Thus, in general, b L = b U b b U T is not a graph Laplacian matrix. In order to nd the closest CGL to the estimated b U b b U T term, we propose to solve the CGL problem in (4.4) where S = b U b y b U T is analogous of the sample covariance in the original CGL problem discussed in Chapter 2. 4.3 Proposed Solution In order to nd the best CGL matrix b L that combines L graph Laplacian matricesfL l g L l=1 , we propose Algorithm 4, where we rst solve the problem stated in (4.2) to nd the best CGBT b U (see line 1). We then estimate the CGF matrix (see lines 2{4) based on the condition stated in (4.27). Finally, the best CGL is found by solving the optimization problem in (4.4) (see line 5). Note that Algorithm 4 calls Algorithm 5 to nd the best CGBT. CHAPTER 4. GRAPH LEARNING FROM MULTIPLE GRAPHS 72 Algorithm 5 Common Graph-based Transform Algorithm (CGBT) 1: function CGBT(fL l g L l=1 ; U init ;fk l g L l=1 ;) 2: U = U init 3: do 4: U pre = U 5: for all (h;j) pairs such that 2h;jn and h6=j do 6: u h = (U) :;h u j = (U) :;j 7: fQ l g L l=1 = (" 1=(u T h L l u h ) 1=(u T h L l u j ) 1=(u T j L l u h ) 1=(u T j L l u j ) #) L l=1 8: R Best Rotation(fQ l g L l=1 ,fk l g L l=1 , ) 9: V = [u h u j ]R 10: (U) :;h = (V) :;1 (U) :;j = (V) :;2 11: end for 12: whilejjU U pre jj F (stopping criterion) 13: return U 14: end function 15: function Best Rotation(fQ l g L l=1 ,fk l g L l=1 , ) 16: R = I 17: do 18: R pre = R T = 0 19: r 1 = (R) :;1 r 2 = (R) :;2 20: for l = 1 to L do 21: (l) 1 = 1=(r T 1 Q l r 1 ) (l) 2 = 1=(r T 2 Q l r 2 ) 22: T = T +k l ( (l) 1 (l) 2 )Q l 23: end for 24: = 1 2 arctan 2 (T)1;2 (T)1;1(T)2;2 25: R = " cos() sin() sin() cos() # 26: whilejjR R pre jj F (stopping criterion) 27: return R 28: end function In Algorithm 5, we present the CGBT algorithm proposed to solve the nonconvex optimization problem in (4.2). In a nutshell, for a given initial guess U init , weightsfk l g L l=1 and error tolerance , the CGBT algorithm iteratively solves the equation in (4.20) by pairwise updating the columns of U (see lines 6{10). Specically, the for loop at line 5 iterates over all (h;j) pairs forh;j = 2;:::;n and h6=j until the -convergence has been reached (see line 26). At each iterate, an optimized (2 2) rotation matrix R is used to update u h and u j (i.e., columns of U) so that (4.20) is satised (see lines 8{10). To nd the best rotation, another iterative procedure is used as depicted in Algorithm 5 between lines 15{28. CHAPTER 4. GRAPH LEARNING FROM MULTIPLE GRAPHS 73 4.4 Results In this section, we present our experimental results demonstrating the performance of our multigraph combining algorithm which solves (4.1) with k 1 =k 2 = =k L = 1. For given L graph Laplacian matrices, L 1 ;:::; L L , we compare the proposed Algorithm 4 by benchmarking against the following two methods: 1. (Averaging method) The averaging method combines graphs as L avg = 1 L L X l=1 L l ; (4.29) 2. (CGL method) The CGL problem in (4.4) is solved to combine graphs by setting the input matrix S as S = 1 P L l=1 k l L X l=1 k l L y l = 1 L L X l=1 L y l (4.30) since k 1 =k 2 = =k L = 1. The solution (i.e., combined CGL) is denoted as L cgl . We use two dierent metrics to measure the graph combining performance. The rst metric is called coding gain which is a popular metric used in information theory and compression [90]. This metric is used to measure how well a designed CGBT U diagonalizes L y as follows, cg(U; L) = Q n i=1 (L y ) ii Q n i=2 (U T L y U) ii 1=n : (4.31) The second metric we use is the graph Laplacian quadratic form in (1.5) to measure average variation of k signals (fy i g k i=1 ) with respect to a graph Laplacian L as, av(fy i g k i=1 ; L) = 1 k k X i=1 y T i Ly i : (4.32) In our experiments, for each input graph L 1 ;:::; L L we randomly pickk = 1000n samples from the distribution x l N(0; L y l ), and measure the average variation (av) with respect to combined graphs, b L, L avg and L cgl . On the other hand, the coding gain is directly calculated for each input graph L 1 ;:::; L L using GBTs b U, U avg and U cgl obtained from b L, L avg and L cgl , respectively. Figures 4.1 and 4.2 illustrate input graphs with line and mesh structures and their combined graph results, b L and L avg , respectively. Corresponding coding gain and average variation results are presented in Tables 4.1{4.4. According to these results, proposed graph combining algorithm outperforms both averaging and CGL methods by providing larger coding gain and lower average variation. CHAPTER 4. GRAPH LEARNING FROM MULTIPLE GRAPHS 74 Table 4.1: Coding gain (cg) results for the line graphs in Figure 4.1 cg L 1 L 2 L 3 L 4 L 5 Average b U 0:8298 0:8586 0:9066 0:8319 0:8812 0:8616 U avg 0:8216 0:8102 0:8160 0:8125 0:8061 0:8133 U cgl 0:8554 0:8591 0:8566 0:8565 0:8583 0:8572 Table 4.2: Average variation (av) results for the line graphs in Figure 4.1 av L 1 L 2 L 3 L 4 L 5 Average b L 15:0063 14:5198 13:4222 15:0538 13:9226 14:3850 L avg 21:1885 21:0094 21:1003 21:1065 21:0882 21:0986 L cgl 15:0628 14:9598 15:0286 15:0051 15:0065 15:0126 Table 4.3: Coding gain (cg) results for the graphs in Figure 4.2 cg L 1 L 2 L 3 L 4 Average b U 0:9791 0:8981 0:9604 0:8799 0:9294 U avg 0:9050 0:8549 0:9521 0:8801 0:8981 U cgl 0:8673 0:8997 0:9310 0:9184 0:9041 Table 4.4: Average variation (av) results for the graphs in Figure 4.2 av L 1 L 2 L 3 L 4 Average b L 2:6637 3:4654 1:8542 3:8498 2:9583 L avg 5:6080 7:0555 3:6657 7:6316 5:9902 L cgl 3:9964 4:5259 2:6581 4:9227 4:0258 4.5 Conclusions We have introduced a novel framework for graph combining by (i) introducing a new problem for- mulation with a maximum likelihood criterion and by (ii) proposing a solution involving common graph-based transform estimation and common graph frequency estimation. The experimental re- sults have showed that the proposed multigraph combining method leads to a better model compared to the average graph model in terms of two well-known metrics, coding gain and quadratic Laplacian cost. The applications of the proposed framework to compression, such as designing aggregate mod- els for clusters of signals/data, and to machine learning problems, such as clustering, are possible future directions on this work. CHAPTER 4. GRAPH LEARNING FROM MULTIPLE GRAPHS 75 0 1 0 1 0 1 0 1 0 1 (a) Input graphs: L 1 ; L 2 ; L 3 ; L 4 ; L 5 0 1 (b) Combined graph b L 0 1 (c) Combined graph Lavg 0 1 (d) Combined graph L cgl Figure 4.1: CombiningL = 5 line graphs withn = 16 vertices. Edge weights are color coded between 0 and 1. Each input graph have only one weak edge weight equal to 0:1 while all other edges are weighted as 0:95. CHAPTER 4. GRAPH LEARNING FROM MULTIPLE GRAPHS 76 0 1 0 1 0 1 0 1 (a) Input graphs: L 1 ; L 2 ; L 3 ; L 4 0 1 (b) Combined graph b L 0 1 (c) Combined graph Lavg 0 1 (d) Combined graph L cgl Figure 4.2: Combining L = 4 mesh-like graphs with n = 5 vertices. Edge weights are color coded between 0 and 1. Chapter 5 Graph-based Transforms for Video Coding Predictive transform coding is a fundamental compression technique adopted in many block-based image and video compression systems, where block signals are initially predicted from a set of avail- able (already coded) reference pixels, then the resulting residual block signals are transformed (gen- erally by a linear transformation) to decorrelate residual pixel values for eective compression. After prediction and transformation steps, a typical image/video compression system applies quantization and entropy coding to convert transform coecients into a stream of bits. Figure 5.1 illustrates a representative encoder-decoder architecture comprising three basic components, (i) prediction, (ii) transformation, (iii) quantization and entropy coding, which are also implemented in state-of-the-art compression standards such as JPEG [94], HEVC [95] and VP9 [96]. This chapter focuses mainly on the transformation component of video coding by developing graph-based techniques to design or- thogonal transforms adapting to statistical characteristics of block residual signals. We also present theoretical justications for the proposed techniques. In predictive transform coding of video, the prediction is typically carried out by choosing among multiple intra and inter prediction modes to exploit spatial and temporal redundancies in block signals. On the other hand, for the transformation, generally a single transform such as the discrete cosine transform (DCT) is separably applied to rows and columns of each residual block. The main problem of using xed block transforms is the implicit assumption that all residual blocks share the same statistical properties. However, residual blocks can have very diverse statistical characteristics depending on the video content and the prediction mode, as will be demonstrated by some of the experiments in this chapter. The latest video coding standard, HEVC [95], partially addresses this problem by additionally allowing the use of asymmetric discrete sine transform (ADST or DST-7) Earlier versions of the work in this chapter are published in [8, 91, 91, 92, 7]. A journal version of this work will be submitted for publication subsequently [93]. 77 CHAPTER 5. GRAPH-BASED TRANSFORMS FOR VIDEO CODING 78 ENCODER DECODER Delay Quantization Inverse Quantization Entropy Encoding Inverse Transform Transform Prediction + Inverse Quantization + Decoding Entropy Inverse Transform Delay Prediction x b x c c y p r b x y p b r b r + + + + + + Figure 5.1: Building blocks of a typical video encoder and decoder consisting of three main steps, which are (i) prediction, (ii) transformation, (iii) quantization and entropy coding. for small (4 4) intra predicted blocks [97]. Yet, it has been shown that better compression can be achieved by using data-driven transform designs that adapt to statistical properties of residual blocks [98, 99, 100, 101, 102, 103, 104, 105]. The majority of prior studies about transforms for video coding focus on developing transforms for intra predicted residuals. Ye and Karczewicz [98] propose the mode-dependent transform (MDT) scheme where a Karhunen-Loeve transform (KLT) is designed for each intra prediction mode. More recently in [99, 100], the MDT scheme is implemented on the HEVC standard, where a single KLT is trained for each intra prediction mode oered in HEVC. Moreover in [101, 102, 103, 104], authors demonstrate considerable coding gains outperforming the MDT method by using the rate-distortion optimized transformation (RDOT) scheme, which suggests designing multiple transforms for each prediction mode. From the predetermined set of transforms, the encoder selects a transform by minimizing a rate-distortion (RD) cost. Since the RDOT scheme allows encoders to exibly select a transform on a per-block basis, it provides better adaptation to residual blocks with dierent statistical characteristics as compared to the MDT scheme. However, all of these methods rely on KLTs, which are constructed by the eigendecomposition of sample covariance matrices. In this work, we propose a graph-based modeling framework to design GBTs a , where the models of interest are dened based on GMRFs whose inverse covariances are graph Laplacian matrices as discussed in Chapter 2. Specically, in our framework, we propose two dierent approaches to construct graph Laplacian matrices based on data (i.e., video signals) under specic structural constraints. Our approaches can be interpreted as methods to regularize covariance matrices used in order to derive KLTs by introducing Laplacian and structural constraints, which a GBT is formally dened in Denition 5 (see in Chapter 1). CHAPTER 5. GRAPH-BASED TRANSFORMS FOR VIDEO CODING 79 potentially leads to a better (inverse) covariance estimation (i.e., better models to characterize signals/data) as discussed in Chapter 2. The following two distinct methods are proposed to develop classes of GBTs for video coding, called GL-GBTs and EA-GBTs: Graph learning for GBT (GL-GBT) design: The GGL estimation problem (Problem 1 in Chapter 2) is used to estimate a GGL from training data, and the estimated graph Laplacian matrix is used to derive the GBT, called the GL-GBT. The proposed method allows us to impose structural constraints on the graph Laplacian in order to design graphs with desired connectivity. As the KLT, GL-GBT is learned from a sample covariance, but in addition, it incorporates Laplacian and structural constraints re ecting the inherent model assumptions about the video signal. The proposed graph learning approach can be used to design GBTs for MDT and RDOT schemes. Edge-adaptive GBT (EA-GBT) design: To adapt transforms for block signals with image edges a , we develop edge-adaptive GBTs (EA-GBTs) which are designed on a per-block basis. These lead to a block-adaptive transform (BAT) scheme, where transforms are derived from graph Laplacians whose weights are modied based on image edges detected in a residual block. From a statistical learning perspective, the main advantage of GL-GBT over KLT is that GL-GBT requires learning fewer model parameters to be obtained from training data, and thus potentially leads to a more robust transform allowing better compression for the block signals outside of the training dataset. In addition, EA-GBTs provide per-block transform adaptation that is not appli- cable to KLTs. However in practice, transform coding schemes such as RDOT and BAT require to encode additional side information, called transform signaling overhead, which is used by the decoder to identify the transforms used at the encoder. Thus, it is important for any transform coding scheme to consider the rate-distortion (RD) tradeo [106]. For example, in RDOT scheme, the number of GL-GBTs/KLTs trained for each mode should be limited so that the designed trans- forms capture common block characteristics with low signaling overhead. Similarly, for EA-GBTs the encoder should make a decision for each block, i.e., whether to send the detected edge (graph) in- formation or to use a xed transform such as DCT. In our experiments, we comprehensively evaluate the performance of dierent transforms considering the rate-distortion tradeo. The main contributions of this chapter can be summarized as follows: The graph learning techniques proposed in Chapter 2 are used to develop separable and non- separable GBTs, called GBST and GBNT respectively, and their theoretical justications for coding residual block signals modeled based on GMRFs are presented. In addition to the 1-D GMRF models used to design GBSTs for intra and inter predicted signals in our previous work a We use the term image edge to distinguish edges in image/video signals with edges in graphs. CHAPTER 5. GRAPH-BASED TRANSFORMS FOR VIDEO CODING 80 [8], a general 2-D GMRF model is presented for GBNTs, and the optimality of proposed GBTs is analyzed. EA-GBTs are applied for intra and inter predicted blocks, while our prior work in [92] focuses only on inter predicted blocks. In addition to the experimental results, we further derive some theoretical results and discuss the cases in which EA-GBTs are useful. We present graph-based interpretations of proposed models and statistical characteristics of signals in terms of entries of graph Laplacian matrices. Also, we show that dierent types of DCTs and DSTs, including the DCT-2 and DST-7 used in HEVC, are specic examples of GBTs. In the literature, there are a few studies on model-based transform designs for video coding. Most related to our work, Han et. al. [97] present a single parameter 1-D GMRF and discuss the cases where ADST (DST-7) is optimal for coding intra predicted signals. Hu et. al. [107] extend this model by introducing another parameter to represent piecewise-smooth signals and use that to develop transforms for depth map coding. Although the models introduced in [97, 107] are analogous to our 1-D GMRF introduced for intra predicted signals, our model is dened using multiple parameters (i.e., weights of a line graph) so that it is more general than [97, 107]. In addition, [97] and [107] focus only on separable transforms and do not consider inter predicted signals. In [6, 108], the authors present a graph-based probabilistic framework for predictive video coding, and use it to justify the optimality of DCT. However, optimal graph/transform design is out of their scope. In our previous work [7], we present an extensive comparison of various instances of dierent graph learning problems for 2-D nonseparable image modeling. This chapter theoretically and empirically validates one of the conclusions in [7], which suggests the use of GBTs derived from GGL matrices. Moreover, several edge-adaptive transform approaches have been proposed. Shen et. al. [109] propose edge adaptive transforms (EAT) specically for depth map compression. Although our proposed EA-GBTs adopt some basic concepts originally introduced in [109], our graph construction method is dierent (in terms of image edge detection) and provides better compression for residual signals. Hu et. al. [110] extends EATs for piecewise-smooth image compression and applies them for depth map coding, while our work focuses on encoding intra and inter predicted blocks. This chapter is organized as follows. Section 5.1, introduces 1-D and 2-D GMRFs used for modeling the video signals and discusses graph-based interpretations. In Section 5.2, GL-GBT design problem is formulated as a graph Laplacian estimation problem, and solutions for optimal GBNT and GBST construction are proposed. Section 5.3 presents EA-GBTs. Graph-based interpretations of residual block signal characteristics are discussed in Section 5.4 by empirically validating the theoretical observations. The experimental results are presented in Section 5.5 and Section 5.6 draws concluding remarks. CHAPTER 5. GRAPH-BASED TRANSFORMS FOR VIDEO CODING 81 x 1 y x 2 x 3 x n 1 x n (a) x 1 x 2 x 3 x n 1 x n y n y n 1 y 1 y 2 y 3 (b) Figure 5.2: 1-D GMRF models for (a) intra and (b) inter predicted signals. Black lled vertices represent the reference pixels and unlled vertices denote pixels to be predicted and then transform coded. (a) (b) Figure 5.3: 2-D GMRF models for (a) intra and (b) inter predicted signals. Black lled vertices correspond to reference pixels obtained (a) from neighboring blocks and (b) from other frames via motion compensation. Unlled vertices denote the pixels to be predicted and then transform coded. 5.1 Models for Video Block Signals For modeling video block signals, we use Gaussian Markov random elds (GMRFs), which provide a probabilistic interpretation for our graph-based framework as presented in Chapter 2. In statistical modeling of image/video signals, it is generally assumed that adjacent pixel values are positively correlated [4, 5]. The assumption is intuitively reasonable for video signals, since neighboring pixel values are often similar to each other due to spatial and temporal redundancy. With this general assumption, we propose attractive GMRFs to model intra/inter predicted video block signals, where Figures 5.2 and 5.3 illustrate the 1-D and 2-D GMRFs used to design separable and nonseparable GBTs (GBSTs and GBNTs), respectively. We formally dene GBST and GBNT as follows. CHAPTER 5. GRAPH-BASED TRANSFORMS FOR VIDEO CODING 82 U col U row U row U row U row U row U col U col U col U col vec U Figure 5.4: Separable and nonseparable transforms: (Left) For anNN image/video block, separable transforms (e.g., GBSTs) are composed of two (possibly distinct) NN transforms, U row and U col , applied to rows and columns of the block. (Right) Nonseparable transforms (e.g., GBNTs) apply an N 2 N 2 linear transformation using U. Denition 8 (Graph-based Separable Transform (GBST)). Let U row and U col be NN GBTs associated with two line graphs with N vertices, then the GBST of X is b X = U T col XU row ; (5.1) where U row and U col are applied to each row and each column of an N N block signal X, respectively. Denition 9 (Graph-based Nonseparable Transform (GBNT)). Let U be an N 2 N 2 GBT asso- ciated with a graph with N 2 vertices, then the GBNT of NN block signal X is b X = block(U T vec(X)); (5.2) where U is applied on vectorized signal x = vec(X), and the block operator restructures the signal back in block form. Figure 5.4 shows separable and nonseparable transforms applied on a block signal. In the rest of this section, we rst present three dierent 1-D GMRFs for intra and inter predicted signals, as well as an edge model. Then, we introduce a 2-D GMRF generalizing these models. 5.1.1 1-D GMRF Models for Residual Signals In order to model rows/columns of NN block residual signals, we construct 1-D GMRFs based on rst-order autoregressive (AR) processes. Specically, depending on type of the prediction (i.e., CHAPTER 5. GRAPH-BASED TRANSFORMS FOR VIDEO CODING 83 intra/inter), a set of reference samples, denoted as y, is used to predict x = [x 1 x 2 x n ] T which represents n=N samples in a row/column of a block a . Assuming that the optimal causal (MMSE) predictor p is used for intra/inter prediction, the precision matrix of the resulting residual signal, r = x p, is then derived to dene the 1-D GMRF of r. Intra predicted signal model. For modeling intra predicted signals as a 1-D GMRF, illustrated in Figure 5.2a, we formulate the following stochastic dierence equations, which generalize the models in [97, 107] by allowing arbitrary i correlation parameters, which are xed in [97, 107], for i = 0;:::;n 1, x 1 = 0 (y +d) +e 1 x 2 = 1 x 1 +e 2 . . . x n1 = n2 x n2 +e n1 x n = n1 x n1 +e n (5.3) where the reference sample y is used to predict n samples in x = [x 1 x 2 x n ] T . The random variable d N(0; 2 d ) models the distortion due to compression in the reference sample y, and e i N(0; 2 e ) is the noise inx i with a xed variance 2 e . The spatial correlation coecients between samples are denoted by 0 ; 1 ;:::; n1 , and we also assume that random variables d and e i are independent for i=1;:::;n. The relations in (5.3) can be written more compactly in vector form as Qx = y 1 + d 1 + e where y 1 = [( 0 y) 0 0] T , d 1 = [( 0 d) 0 0] T , e = [e 1 e 2 e n ] T and Q = 2 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 4 1 0 0 1 1 0 . . . 0 2 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . n2 1 0 0 0 n1 1 3 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 5 : (5.4) Since x = Q 1 y 1 + Q 1 (d 1 + e) where p = Q 1 y 1 is the optimal prediction for x, the resulting residual vector is r = x p, and its covariance matrix is r = Q 1 E [(e + d 1 )(e + d 1 ) T ] (Q 1 ) T : (5.5) a For 1-D models, the number of vertices (n) are equal to the number of pixels (N) in a row/column of an NN block. CHAPTER 5. GRAPH-BASED TRANSFORMS FOR VIDEO CODING 84 Inverting the covariance matrix gives us the precision matrix, intra = r 1 = Q T (E [ee T + d 1 d T 1 ]) 1 Q = 1 2 e Q T 2 6 6 6 6 6 6 6 4 1 1+ d 0 0 0 1 . . . . . . . . . 0 0 0 1 3 7 7 7 7 7 7 7 5 Q; (5.6) which denes our 1-D GMRF model for the residual signal r. This is explicitly stated in (5.9) where d = (0 d ) 2 2 e . Inter predicted signal model. In modeling inter predicted signals, we have n reference samples, y 1 ;:::;y n , used to predict n samples in x = [x 1 x 2 x n ] T , as shown in Figure 5.2b. So, we derive the following dierence equations by incorporating multiple reference samples as x 1 =e 1 (y 1 +d 1 ) +e 1 x 2 = 1 x 1 +e 2 (y 2 +d 2 ) +e 2 . . . x n1 = n2 x n2 +e n1 (y n1 +d n1 ) +e n1 x n = n1 x n1 +e n (y n +d n ) +e n (5.7) where d i N(0; 2 di ) denotes the distortion due to compression in the reference sample y i and e i N(0; 2 e ) is the noise in sample x i . The random variables e i and d i are also assumed to be independent. In addition to the spatial correlation coecients 1 ;:::; n1 , our model includes temporal correlation coecients, e 1 ;:::;e n . The recursive relations in (5.7) can also be written in vector form, that is Qx = y 2 + d 2 + e where Q is given in (5.4), y 2 = [e 1 y 1 e 2 y 2 e n y n ] T and d 2 = [e 1 d 1 e 2 d 2 e n d n ] T . We write x = Q 1 y 2 + Q 1 (d 2 + e) where p = Q 1 y 2 is the optimal prediction for x. So, the resulting residual vector is r = x p = Q 1 (d 2 + e) and its covariance is r = Q 1 E [(e + d 2 )(e + d 2 ) T ] (Q 1 ) T : (5.8) CHAPTER 5. GRAPH-BASED TRANSFORMS FOR VIDEO CODING 85 intra = 1 2 e 2 6 6 6 6 6 6 6 6 6 6 6 6 4 1 1+ d + 2 1 1 0 0 1 1 + 2 2 2 0 . . . 0 2 1 + 2 3 3 . . . . . . . . . . . . . . . . . . . . . 0 . . . . . . n2 1 + 2 n1 n1 0 0 n1 1 3 7 7 7 7 7 7 7 7 7 7 7 7 5 (5.9) inter = 1 2 e 2 6 6 6 6 6 6 6 6 6 6 6 6 4 1 1+ 1 + 2 1 1+ 2 1 1+ 2 0 0 1 1+ 2 1 1+ 2 + 2 2 1+ 3 2 1+ 3 . . . . . . . . . 0 . . . . . . . . . . . . . . . . . . . . . . . . . . . n2 1+ n1 0 . . . . . . . . . n2 1+ n1 1 1+ n1 + 2 n1 1+ n n1 1+ n 0 0 n1 1+ n 1 1+ n 3 7 7 7 7 7 7 7 7 7 7 7 7 5 (5.10) edge = 1 2 e 2 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 4 1 + 2 1 1 0 0 1 1 + 2 2 2 . . . . . . . . . . . . . . . . . . . . . 0 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . n2 1 + 2 n1 n1 . . . . . . . . . . . . . . . . . . . . . . . . i1 1 + 2 i 1+t i 1+t . . . . . . . . . . . . . . . . . . . . . . . . i 1+t 2 i+1 + 1 1+t i+1 . . . . . . . . . . . . . . . . . . . . . . . . i+1 1 + 2 i+2 i+2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 0 . . . . . . . . . . . . . . . . . . . . . n2 1 + 2 n1 n1 0 0 n1 1 3 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 5 (5.11) CHAPTER 5. GRAPH-BASED TRANSFORMS FOR VIDEO CODING 86 By inverting the covariance matrix r , we obtain the precision matrix inter = 1 r = Q T (E [ee T + d 2 d T 2 ]) 1 Q = 1 2 e Q T 2 6 6 6 6 6 6 6 4 1 1+ 1 0 0 0 1 1+ 2 . . . . . . . . . 0 0 0 1 1+ n 3 7 7 7 7 7 7 7 5 Q; (5.12) where i = (e i d i ) 2 2 e for i=1;:::;n. The explicit form of the precision inter is stated in (5.10). Edge-based residual model. In this model, we extend intra/inter predicted signal models by introducing an image edge, which is modeled using a random variable t N(0; t 2 ) representing a sharp signal transition. For example, to extend (5.3) with an image edge at x i , we have x 1 = 0 y +e 1 x 2 = 1 x 1 +e 2 . . . x i1 = i2 x i2 +e i1 x i = i1 x i1 +e i +t x i+1 = i x i +e i+1 . . . x n = n1 x n1 +e n (5.13) where e i is dened as in (5.3), and it is statistically independent of t. In vector form, we can write x = Q 1 y 1 + Q 1 (e + t) and p = Q 1 y 1 , where Q is as stated in (5.4), y 1 = [( 0 y) 0 0] T , e = [e 1 e 2 e n ] T and t = [0 0t 0 0] T . Then the residual signal is r = x p = Q 1 (e + t). Thus, the covariance of the residual r is r = Q 1 E [(e + t)(e + t) T ] (Q 1 ) T ; (5.14) CHAPTER 5. GRAPH-BASED TRANSFORMS FOR VIDEO CODING 87 and the corresponding precision matrix is edge = 1 r = Q T (E [ee T + tt T ]) 1 Q = 1 2 e Q T 2 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 4 1 0 0 0 . . . . . . . . . . . . . . . . . . . . . . . . 1 0 . . . . . . . . . . . . . . . 0 1 1+t 0 . . . . . . . . . . . . . . . 0 1 . . . . . . . . . . . . . . . . . . . . . . . . 0 0 0 1 3 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 5 Q; (5.15) which is explicitly stated in (5.11) where t = t 2 2 e . Although this model is derived for intra pre- diction, a similar edge-based model can be trivially constructed for inter prediction based on (5.7). Also, a special case of this model with i = 1 for i = 1;:::;n is considered in [110]. Attractive 1-D GMRFs and DCTs/DSTs as GBTs. The 1-D GMRF models discussed above are attractive if 1 ; 2 ;:::; n1 are nonnegative. In this case, the corresponding precision matrices in (5.9){(5.11) are GGL matrices. Moreover, DCT and DST can be considered as special cases of GBTs associated with line graphs (i.e., 1-D attractive GMRFs). The relation between dierent types of DCT and graph Laplacians are originally discussed in [111] where DCT-2 (which is the type of DCT used extensively in image/video compression) is shown to be equal to the GBT derived from a uniformly weighted line graph (i.e., i = 1 for i = 1;:::;n in 1-D GMRFs presented above) with no self-loops (i.e., all vertex weights are zero). Thus, the corresponding graph Laplacian for the n-point DCT-2 is L u = 2 6 6 6 6 6 6 6 6 6 6 6 4 1 1 0 1 2 1 . . . . . . . . . 1 2 1 0 1 1 3 7 7 7 7 7 7 7 7 7 7 7 5 (5.16) which is a CGL matrix. In [8, 107], it has been shown that DST-7 (i.e., ADST used in HEVC standard) is GBT derived from the GGL L=L u +V where V=diag([1 0 0] T ), which includes a self-loop at vertexv 1 with weightf v (v 1 )=1. Based on the results in [111, 112], various other types of n-point DCTs and DSTs can be characterized based on GGLs with dierent vertex weights atv 1 and CHAPTER 5. GRAPH-BASED TRANSFORMS FOR VIDEO CODING 88 v n . Table 5.1 summarizes the graphs corresponding to dierent types of DCTs and DSTs, which are GBTs derived from graph Laplacians of the form e L=L u + e V where e V=diag([f v (v 1 ) 0 0f v (v n )] T ). Table 5.1: DCTs/DSTs corresponding to e L with dierent vertex weights. Vertex weights f v (v 1 )=0 f v (v 1 )=1 f v (v 1 )=2 f v (v n )=0 DCT-2 DST-7 DST-4 f v (v n )=1 DCT-8 DST-1 DST-6 f v (v n )=2 DCT-4 DST-5 DST-2 Moreover, we can justify the use of DCT-2 and DST-7 in practice by analyzing the 1-D intra and inter predicted models. For this purpose, we assume that parameters i for i = 0; 1;:::;n 1 approach to 1 (i.e., i ! 1), since video pixels are typically highly correlated in practice. Then, based on the precision matrices in (5.9) and (5.10), we can show that the optimal GBT leads to DCT-2 if d e for the intra predicted model (if intra prediction performance is bad), DST-7 if d e for the intra predicted model (if intra prediction performance is good), DCT-2 if 1 = = n for the inter predicted model (if inter prediction performance is similar across pixels). 5.1.2 2-D GMRF Model for Residual Signals By extending the 1-D models discussed above, we introduce a general 2-D GMRF model for intra and inter predicted NN block signals, depicted in Figure 5.3. For this purpose, we derive the precision matrix of the residual signal obtained by predicting the signal x = [x 1 x 2 x n ] T with n =N 2 from n p reference samples in y = [y 1 y 2 y np ] T (i.e., predicting unlled vertices using black lled vertices in Figure 5.3). In our model, x and y are zero-mean and jointly Gaussian with respect to the following attractive 2-D GMRF: p([ x y ]j ) = 1 (2) n=2 j j 1=2 exp 1 2 [ x y ] T [ x y ] : (5.17) where the precision matrix and the covariance matrix = 1 are partitioned as = 2 6 4 x xy yx y 3 7 5 = 2 6 4 x xy yx y 3 7 5 1 = 1 : (5.18) Irrespective of the type of prediction (intra/inter), the optimal (MMSE) prediction of x from the reference samples in y is p = E[xjy] = xy 1 y y = 1 x xy y; (5.19) CHAPTER 5. GRAPH-BASED TRANSFORMS FOR VIDEO CODING 89 and the resulting residual vector is r = x p with covariance r = xjy = E[rr T ] = E[(x p)(x p) T ] = E[xx T + pp T 2xp T ] = x + xy 1 y yx 2 xy 1 y yx = x xy 1 y yx : (5.20) By the matrix inversion lemma [65], the precision matrix of the residual r is shown to be equal to x , that is the submatrix in (5.18), 1 r = ( x xy 1 y yx ) 1 = x : (5.21) Since we also have x = ( x xy 1 y yx ) 1 by [65], the desired precision matrix can also be written as residual = 1 r = x = 1 x + xy 1 y yx : (5.22) Proposition 8. Let the signals x2R n and y2R np be distributed based on the attractive GMRF model in (5.17) with precision matrix . If the residual signal r is estimated by minimum mean square error (MMSE) prediction of x from y (i.e, r=xE[xjy]), the residual signal r is distributed as an attractive GMRF whose precision matrix is a generalized graph Laplacian (i.e., residual in (5.22)). Proof. The proof follows from equations (5.17){(5.22). Proposition 8 also applies to the 1-D models in (5.3), (5.7) and (5.13) which are special cases of (5.17) with precision matrices as stated in (5.9){(5.11). 5.1.3 Interpretation of Graph Weights for Predictive Coding For graph-based interpretation of our models, we rst present two dierent formulations of GMRFs, and then investigate the eect of graph weights based on these two forms. GMRFs in Predictive Form. Let x=[x 1 x 2 x n ] T be a random vector drawn from a GMRF whose precision matrix is , so that the entries of the error vector e are obtained by the optimal MMSE prediction, e i =x i X j2Snfig g ij x j for i = 1;:::;n (5.23) where g ij denotes the optimal MMSE prediction coecient such that g ii = 0 for i = 1;:::;n, and S =f1;:::;ng is the index set for x. Letting the variance of random variable e i be 2 i = E e 2 i j(x) Snfig ; (5.24) CHAPTER 5. GRAPH-BASED TRANSFORMS FOR VIDEO CODING 90 we write the following conditional distribution p(x i jx j for i6=j) = 1 p 2 2 i exp 0 B @ 1 2 2 i 0 @ x i X j2Snfig g ij x j 1 A 2 1 C A (5.25) from which the entries of the precision matrix can be written in terms of g ij and 2 i as follows: ( ) ij = 8 < : 1 2 i g ij if i6=j 1 2 i if i =j : (5.26) More compactly, the precision matrix is given by = 1 (I G) (5.27) where is the diagonal matrix whose entries are ( ) ii = 2 i for i = 1;:::;n, and G is the prediction coecient matrix whose entries are (G) ij =g ij , so that in matrix form G = I : (5.28) which is generally not a symmetric matrix, while is symmetric by denition. Since ( ) ii = 1=( ) ii , the diagonal entries of G are zero (i.e., (G) ii = 0 for alli) by (5.28). Specically for attractive GM- RFs, (G) ij are nonnegative for all i and j. GMRFs in Laplacian Quadratic Form. Based on Proposition 8, the distribution of residual signals, denoted as r, is dened by the following GMRF whose precision matrix is a GGL matrix L, p(rjL) = 1 (2) n=2 jLj 1=2 exp 1 2 r T Lr ; (5.29) where the quadratic term in the exponent can be decomposed in terms of graph weights as stated in (1.5) r T Lr = n X i=1 (V) ii r 2 i + X (i;j)2I (W) ij (r i r j ) 2 (5.30) where r = [r 1 r n ] T , (W) ij =(L) ij , (V) ii = P n j=1 (L) ij andI =f(i;j)j (v i ;v j )2Eg is the set of index pairs of all vertices associated with the edge setE. Interpretation of Graph Weights. Based on the Laplacian quadratic form given in (5.29) and (5.30), it is obvious that the distribution of the residual signal r depends on edge weights (W) and vertex weights (V) where assigning larger (resp. smaller) edge weights (e.g., (W) ij ) increases the probability of having a smaller (resp. larger) squared dierence between corresponding residual pixel values (e.g., CHAPTER 5. GRAPH-BASED TRANSFORMS FOR VIDEO CODING 91 r i and r j ), assigning larger (resp. smaller) vertex weights (e.g., (V) ii ) increases the probability of pixel values (e.g., r i ) with smaller (resp. larger) magnitude. In addition, the characterization of the precision matrix (i.e., L) in terms of g ij and 2 i in (5.26) reveals the interplay between prediction and graph weights. Specically, assuming that the prediction coecients, g ij for j = 1;:::;n, are given, the variance of the prediction error at vertex v i , 2 i , gets smaller (resp. larger) as either one or both of the following statements are satised, degree at vertex v i , (D) ii , gets larger (resp. smaller) vertex weight (V) ii gets larger (resp. smaller) which are based on (L) ii =(D) ii +(V) ii =1= 2 i by (5.26). Since the degree matrix D is dened using W, a good prediction may lead to increase in both edge and vertex weights. For the case of video coding, the contribution of edges and vertices depends mainly on statistics of the block signal and the type of the prediction. We experimentally investigate the eect of edge and vertex weights in Section 5.4. 5.2 Graph Learning for Graph-based Transform Design 5.2.1 Graph Learning: Generalized Graph Laplacian Estimation As justied in Proposition 8, the residual signal r2 R n is modeled as an attractive GMRF, r N(0; = L y ), whose precision matrix is a GGL denoted by L. To nd the best GGL from a set of residual signalsfr 1 ;:::; r k g in a maximum likelihood sense, we solve the GGL estimation problem (i.e., Problem 1 in Chapter 2) without the ` 1 -regularization, which is explicitly stated as follows: minimize L0 Tr (LS) logdet(L) subject to (L) ij 0 if (A) ij = 1 (L) ij = 0 if (A) ij = 0 (5.31) where S = 1 k P k i=1 r i r i T is the sample covariance of residual signals, and A is the connectivity matrix representing the structure of the graph (i.e., set of graph edges). In order to optimally solve (5.31), we employ Algorithm 1 discussed in Chapter 2. 5.2.2 Graph-based Transform Design To design separable and nonseparable GBTs, we rst solve instances of (5.31), which is denoted as GGL(S; A). Then, the optimized GGL matrices are used to derive GBTs. CHAPTER 5. GRAPH-BASED TRANSFORMS FOR VIDEO CODING 92 Graph-based Separable Transforms (GBST). For the GBST design, we solve two instances of (5.31) to optimize two separate line graphs used to derive U row and U col in (5.1). Since we wish to design a separable transform, each line graph can be optimized independently a . Thus, our basic goal is nding the best line graph pair based on sample covariance matrices S row and S col created from rows and columns of residual block signals. For NN residual blocks, the proposed GBST construction has following steps: 1. Create the connectivity matrix A line representing a line graph structure with n =N vertices as in Figure 5.2. 2. Train two NN sample covariances, S row and S col , from N rows and N columns of residual blocks in the dataset. 3. Solve instances of the problem in (5.31), GGL(S row ; A line ) and GGL(S col ; A line ), by using Algo- rithm 1 in Chapter 2 to learn generalized graph Laplacian matrices L row and L col , respectively. 4. Perform eigendecomposition on L row and L col to obtain GBTs, U row and U col , which dene the GBST as in (5.1). Graph-based Nonseparable Transforms (GBNT). Similarly, forNN residual block signals, we propose following steps to design a GBNT: 1. Create the connectivity matrix A based on a desired graph structure. For example, A can represent a grid graph with n =N 2 vertices as in Figure 5.3. 2. TrainN 2 N 2 sample covariance S using residual block signals in the dataset (after vectorizing the block signals). 3. Solve the problem GGL(S; A) by using Algorithm 1 in Chapter 2 to estimate the generalized graph Laplacian matrix L. 4. Perform eigendecomposition on L to obtain the N 2 N 2 GBNT, U dened in (5.2). 5.2.3 Optimality of Graph-based Transforms It has been shown that KLT is optimal for transform coding of jointly Gaussian sources in terms of mean-square error (MSE) criterion under high bitrate assumptions [115, 116, 117]. Since the GMRF models discussed in Section 5.1 lead to jointly Gaussian distributions, the corresponding KLTs are optimal in theory. However, in practice, a KLT is obtained by eigendecomposition of the associated sample covariance, which has to be estimated from a training dataset where (i) data samples may not closely follow the model assumptions (e.g., GMRFs), and (ii) the number of data samples may a Alternatively, joint optimization of the transforms associated with rows and columns has been recently proposed in [113, 114]. CHAPTER 5. GRAPH-BASED TRANSFORMS FOR VIDEO CODING 93 not be sucient to accurately recover the parameters. As a result, the sample covariance may lead a poor estimation of the actual model parameters. From the statistical learning theory perspective [118, 119], the advantage of our proposed GL-GBT over KLT is that KLT requires learning O(n 2 ) model parameters while GL-GBT only needsO(n). Therefore, our graph learning approach provides better generalization in learning the signal model by taking into account variance-bias tradeo. This advantage can also be justied based on the following error bounds characterized in [120, 73]. Assuming thatk residual blocks are used for calculating the sample covariance S, under general set of assumptions, the error bound for estimating with S derived in [120] is jj Sjj F =O r n 2 log(n) k ! ; (5.32) while estimating the precision matrix by using the proposed graph learning approach leads to the following bound shown in [73], jj Ljj F =O r nlog(n) k ! ; (5.33) where L denotes the estimated GGL. Thus, in terms of the worst-case errors (based on Frobenius norm), the proposed method provides a better model estimation as compared to the estimation based on the sample covariance. Section 5.5 empirically justies the advantage of GL-GBT against KLT. 5.3 Edge-Adaptive Graph-based Transforms The optimality of GL-GBTs relies on the assumption that the residual signal characteristics are the same across dierent blocks. However, in practice, video blocks often exhibit complex image edge structures that can degrade the coding performance when the transforms are designed from average statistics without any classication based on image edges. In order to achieve better compression for video signals with image edges, we propose edge-adaptive GBTs (EA-GBTs), which are designed on a per-block basis by constructing a graph-based model whose parameters are determined based on the salient image edges in each residual block. In this work, EA-GBTs are dened by using the following two parameters: E edge is a subset of edges indicating the locations (of image edges) where the dierence between pixel values are large, s edge 1 is a parameter representing the level of dierence between pixel values (i.e., sharpness of the image edge). CHAPTER 5. GRAPH-BASED TRANSFORMS FOR VIDEO CODING 94 0 1 (a) Initial graph (b) Residual block signal 0 1 (c) Constructed graph (d) Constructed EA-GBT basis patches Figure 5.5: An illustration of graph construction for a given 88 residual block signal wherew c = 1 and w e = 0:1, and the corresponding GBT. The basis patches adapt to the characteristics of the residual block. The order of basis patches is in row-major ordering. 5.3.1 EA-GBT Construction To design EA-GBTs on a per-block basis, we construct a graph for a given NN block as follows: 1. We create a nearest neighbor (4-connected) graphG with edge weights all equal to a xed constant w c . 2. On a given residual block, we apply Prewitt operator to calculate gradient in vertical and horizontal direction, and then detect image edges based on thresholding on the gradient values. CHAPTER 5. GRAPH-BASED TRANSFORMS FOR VIDEO CODING 95 3. Based on the detected image edges, we determine the set of edgesE edge in graphG and then modify the corresponding edge weights by setting them to w e =w c =s edge . 4. The resulting graph is used to construct the associated GBT is constructed as in Denition 5. Figure 5.5 illustrates an example of graph construction for the 8 8 block signal (Figure 5.5b) and the resulting EA-GBT (Figure 5.5d), which captures the characteristic of the residual block. Note that EA-GBT construction can also be viewed as a classication procedure where each residual block (such as in Figure 5.5b) is assigned to a class of signals modeled based on a graph (such as in Figure 5.5c) determined through image edge detection. The above procedure results in nonseparable EA-GBTs, yet it can be trivially modied for separable EA-GBTs by designing line graphs based on given image edge locations. In terms of the 1-D edge-based residual model in (5.13), the proposed EA-GBT corresponds to the case where i is xed to a positive constant w c for i = 1;:::;n, the parameter s edge is s edge =(1 + t ), and the edge setE edge represents a single image edge located between verticesv i andv i+1 . By lettingw c = 2 e = 1, the resulting precision matrix in (5.11) becomes edge = 2 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 4 2 1 0 0 1 2 1 . . . . . . . . . . . . . . . . . . . . . 0 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 2 1 . . . . . . . . . . . . . . . . . . . . . . . . 1 1 +w e w e . . . . . . . . . . . . . . . . . . . . . . . . w e 1 +w e 1 . . . . . . . . . . . . . . . . . . . . . . . . 1 2 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 0 . . . . . . . . . . . . . . . . . . . . . 1 2 1 0 0 1 1 3 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 5 (5.34) wherew e =w c =s edge = 1=(1 + t ) is the weight of the edge connecting vertices (pixels) v i1 andv i . In order to evaluate the eect of the parameters edge on pixel values, we generatek = 50000 samples r 1 ;:::; r k based on an attractive GMRF whose precision matrix is (5.34) with w e = 1=s edge (i.e., a 1-D edge-based model with an image edge between v i1 and v i ), and the generated samples are scaled to have 8-bit intensity values within the range [0; 255]. Then, the average intensity dierence CHAPTER 5. GRAPH-BASED TRANSFORMS FOR VIDEO CODING 96 Table 5.2: Average absolute valued intensity dierences (m di ) between pixels connected by the edge with weight w e =w c =s edge for various sharpness parameters (s edge ). s edge =w c =w e 1 2 5 10 20 40 100 200 m di 3:99 5:64 8:93 12:65 17:89 25:30 40:13 56:26 between the pixels attached to vertices v i1 and v i is estimated as m di = 1 k k X j=1 j(r j ) i1 (r j ) i j: (5.35) Table 5.2 shows the estimatedm di values for dierents edge parameters, wherem di is approximately equal to 4 for s edge = 1 and increases with s edge . For example, m di approximately triples when s edge = 10. In other words, the degree/sharpness of discontinuity between the pixels at vertices v i1 and v i becomes three times as large. Moreover, for 2-D GMRF models, we can make a similar analysis based on (5.25), (5.26) and (5.30). Assuming that the 2-D GMRF model has image edges around a pixel x i , then the prediction from neighboring pixels (as in (5.23)) would be naturally dicult where the prediction coecients (g ij ) have smaller values and the error variance ( 2 i ) is larger. Since ( ) ij =g ij = 2 i by (5.26), this leads to smaller edge weights (w ij =( ) ij ) and also a smaller degree (i.e., ( ) ii ). Figure 5.6 illustrates the eect ofs edge by showing random block samples generated from 2-D GMRF models having an image edge with dierent s edge parameters, where the image edge (i.e., transition) becomes more visually apparent as s edge increases. Although EA-GBTs can provide ecient coding for transform coecients, the overall coding performance may not be sucient due to signaling overhead of graph information to the decoder side, especially if multiple graph weight parameters are used to model image edges (e.g., sharpness). By taking the rate-distortion tradeo into account, we propose to use a single parameter (s edge ) to model image edges, and for ecient representation of detected edges (i.e.,E edge ), we employ the state-of-the-art binary edge-map codec called arithmetic edge encoder (AEC) [121]. 5.3.2 Theoretical Justication for EA-GBTs We present a theoretical justication for advantage of EA-GBTs over KLTs. In our analysis, we assume that the signal follows our 1-D edge-based residual signal model in (5.13), where the location of an image edge l is uniformly distributed as P(l =j) = 8 < : 1 N1 for j = 1;:::;N 1 0 otherwise (5.36) CHAPTER 5. GRAPH-BASED TRANSFORMS FOR VIDEO CODING 97 (a) s edge = 10 (b) s edge = 20 (c) s edge = 100 Figure 5.6: Two random block samples obtained from three 2-D GMRF models having an image edge at a xed location with three dierent sharpness parameters (s edge =f10; 20; 100g). A larger s edge leads to a sharper transition across the image edge. whereN is the number of pixels (vertices) on the line graph. This construction leads to a Gaussian mixture distribution obtained by N 1 edge-based signal models, that is p(x) = N1 X j=1 P(l =j)N(0; j ) (5.37) where j denotes the covariance of the model in (5.13) with an edge between pixels v j and v j+1 (i.e., the transition variable t is located at v j+1 ). In general, x does not have a jointly Gaussian distribution, so the KLT obtained from the covariance of x (which implicitly performs a second-order approximation of the distribution) can be suboptimal in MSE sense [122]. On the other hand, the proposed EA-GBT removes the uncertainty coming from the random variable l by detecting the location of the image edge in pixel (vertex) domain, and then constructing a GBT based on the detected image edge. Yet, EA-GBT requires allocating additional bits to represent the image edge (side) information, while KLT only allocates bits for coding transform coecients. To demonstrate the rate-distortion tradeo between KLT and EA-GBT based coding schemes, we set the model parameters of (5.13) so that the corresponding precision matrix has the form in (5.34) with dierent edge locations. Based on the classical rate-distortion theory results with CHAPTER 5. GRAPH-BASED TRANSFORMS FOR VIDEO CODING 98 -8 -6 -4 -2 0 2 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 10 5 3.33 2.5 2 1.66 1.43 1.25 Coding gain (cg) w e =w c = 1=s edge s edge N = 4 N = 8 N = 16 N = 32 N = 64 Figure 5.7: Coding gain (cg) versuss edge for block sizes withN = 4; 8; 16; 32; 64. EA-GBT provides better coding gain (i.e., cg is negative) when s edge is larger than 10 across dierent block sizes. -10 -8 -6 -4 -2 0 2 4 6 0.5 0.75 1 1.25 1.5 Coding gain (cg) R=N s edge = 10 s edge = 20 s edge = 40 s edge = 100 s edge = 200 Figure 5.8: Coding gain (cg) versus bits per pixel (R=N) for dierent edge sharpness parameters s edge = 10; 20; 40; 100; 200. EA-GBT provides better coding gain (i.e., cg is negative) ifs edge is larger than 10 for dierent block sizes. CHAPTER 5. GRAPH-BASED TRANSFORMS FOR VIDEO CODING 99 high-bitrate assumptions the distortion can be written as a function of bitrate [115, 116, 117], D( R) = N 12 2 2 H d 2 2 R (5.38) with R = R N and H d = 1 N N X i=1 H d ((c) i ) (5.39) where R denotes the total bitrate allocated to code transform coecients in c=U T x, and H d ((c) i ) is the dierential entropy of transform coecient (c) i . For EA-GBT, theR is allocated to code both transform coecients (R coe EA-GBT ) and side information (R edge ), so we have R =R coe EA-GBT +R edge =R coe EA-GBT + log 2 (N 1) (5.40) while for KLT, the bitrate is allocated only to code transform coecients (R coe KLT ), so thatR =R coe KLT . Figure 5.7 depicts the coding gain of EA-GBT over KLT for dierent sharpness parameters (s edge ) in terms of the following metric, called coding gain, cg(D EA-GBT ;D KLT ) = 10 log 10 D EA-GBT D KLT (5.41) where D EA-GBT and D KLT denote distortion levels measured at high-bitrate regime for EA-GBT and KLT, respectively. EA-GBT provides better compression for negative cg values in Figure 5.7 which appear when the sharpness of edgess edge is large (e.g.,s edge > 10). Moreover, the coding loss is negligible even if the image edge is not sharp (e.g., s edge < 2) for large block sizes (e.g., N = 64). Note that the distortion function in (5.38) is derived based on high-bitrate assumptions. To char- acterize rate-distortion tradeo for dierent bitrates, we employ the reverse water-lling technique [123, 115] by varying the parameter to obtain rate and distortion measures as follows R(D) = N X i=1 1 2 log 2 i D i (5.42) where i is the i-th eigenvalue of the signal covariance, and D i = 8 < : i if i if < i (5.43) so that D = P N i=1 D i . Figure 5.8 illustrates the coding gain formulated in (5.41) achieved at dierent bitrates, where each curve correspond to a dierent s edge parameter. Similar to Figure 5.7, EA-GBT leads to a CHAPTER 5. GRAPH-BASED TRANSFORMS FOR VIDEO CODING 100 better compression if the sharpness of edges, s edge , is large (e.g., s edge > 10 for R=N > 0:6) a . At low-bitrates (e.g.,R=N =0:5), EA-GBT can perform worse than KLT fors edge =20; 40, yet EA-GBT outperforms as bitrate increases. 5.4 Residual Block Signal Characteristics and Graph-based Models 250 300 350 400 450 500 550 (a) Intra (Planar) 300 350 400 450 500 550 600 (b) Intra (DC) 150 200 250 300 350 (c) Intra (Horizontal) 300 400 500 600 700 800 (d) Intra (Diagonal) 145 150 155 160 165 170 175 (e) Inter (2N 2N) 165 170 175 180 185 190 (f) Inter (N 2N) Figure 5.9: Sample variances of residual signals corresponding to dierent prediction modes. Each square corresponds to a sample variance of pixel values. Darker colors represent larger sample variances. In this section, we discuss statistical characteristics of intra and inter predicted residual blocks, and empirically justify our theoretical analysis and observations in Section 5.1. Our empirical results are based on residual signals obtained by encoding 5 dierent video sequences (City, Crew, Harbour, Soccer and Parkrun) using the HEVC reference software (HM-14) [95] at 4 dierent QP paremeters (QP =f22; 27; 32; 37g). Although the HEVC standard does not implement optimal MMSE prediction (which is the main assumption in Section 5.1), it includes 35 intra and 8 inter prediction modes, a Table 5.2 shows the eect of dierent sharpness parameters (s edge ) on pixel values. CHAPTER 5. GRAPH-BASED TRANSFORMS FOR VIDEO CODING 101 which provide reasonably good prediction for dierent classes of block signals. Naturally, residual blocks have dierent statistical characteristics depending on the type of pre- diction and the prediction mode. Figure 5.9 illustrates the sample variances of 8 8 residual blocks for dierent prediction modes a and leads us to the following observations: As expected, inter predicted blocks have smaller sample variance (energy) across pixels com- pared to intra predicted blocks, because inter prediction provides better prediction with larger number of reference pixels as shown in Figure 5.3. In intra prediction, sample variances are generally larger at the bottom-right part of residual blocks, since reference pixels are located at the top and left of a block where the prediction is relatively better. This holds specically for planar, DC and diagonal modes using pixels on both top and left as references for prediction. For some angular modes including intra horizontal/vertical mode, only left/top pixels are used as references. In such cases the sample variance gets larger as distance from reference pixels increases. Figure 5.9c illustrates sample variances corresponding to the horizontal mode. In inter prediction, sample variances are larger around the boundaries and corners of the residual blocks mainly because of occlusions leading to partial mismatches between reference and predicted blocks. In inter prediction, PU partitions lead to larger residual energy around the partition boundaries as shown in Figure 5.9f corresponding to horizontal PU partitioning (N2N) . Figures 5.10{5.15 depict the graphs (i.e., normalized edge and vertex weights) optimized for residuals obtained by 4 intra and 2 inter prediction modes, whose corresponding sample variances are shown in Figure 5.9. In the gures, grid and line graphs are estimated based on the GBNT and GBST construction procedures, which involve solving the GGL estimation problem in (5.31) as discussed in Chapter 2 in detail. Inspection of Figures 5.10{5.15 leads to following observations, which validate our theoretical analysis and justify the interpretation of model parameters in terms of graph weights discussed in Section 5.1: Irrespective of the prediction mode/type, vertex (self-loop) weights tend to be larger for the pixels that are connected to reference pixels. Specically, in intra prediction, graphs have larger vertex weights for vertices (pixels) located at the top and/or left boundaries of the block (Figures 5.10-5.13), while the vertex weights are approximately uniform across vertices in inter prediction (Figures 5.14 and 5.15). In intra prediction, the grid and line graphs associated with planar and DC modes are similar in structure (Figures 5.10 and 5.11), where their edge weights decrease as the distance of edges a For 44 and 1616 residual blocks, the structure of sample variances are quite similar to the ones in Figure 5.9. CHAPTER 5. GRAPH-BASED TRANSFORMS FOR VIDEO CODING 102 to the reference pixels increase. Also, vertex weights are larger for the pixels located at top and left boundaries, since planar and DC modes use reference pixels from the both sides (top and left). These observations indicate that the prediction performance gradually decreases for pixels increasingly farther away from the reference pixels. For intra prediction with horizontal mode (Figure 5.12), the grid graph has larger vertex weights at the left boundary of the block. This is because the prediction only uses reference pixels on the left side of the block. Due to this reason, the line graph associated to rows has a large self-loop at the rst pixel, while the other line graph has no dominant vertex weights. However, grid and line graphs for the diagonal mode (Figure 5.13), are more similar to the ones for planar and DC modes, since the diagonal mode also uses the references from both top and left sides. For inter prediction with PU mode 2N2N (do not perform any partitioning), the graph weights (both vertex and edge weights) are approximately uniform across the dierent edges and vertices (Figure 5.14). This shows that the prediction performance is similar at dierent locations (pixels). In contrast, the graphs for the PU mode N 2N (performs horizontal partitioning) leads to smaller edge weights around the PU partitioning (Figure 5.15). In the grid graph, we observe smaller weights between the partitioned vertices. Among line graphs, only the line graph designed for columns has a small weight in the middle, as expected. 5.5 Experimental Results 5.5.1 Experimental Setup In our experiments, we generate two residual block datasets, one for training and the other for testing. The residual blocks are collected by using HEVC reference software (HM version 14) [95]. For the training dataset, residual blocks are obtained by encoding 5 video sequences, City, Crew, Harbour, Soccer and Parkrun, and for the test dataset, we use 5 dierent video sequences, BasketballDrill, BQMall, Mobcal, Shields and Cactus. The sequences are encoded using 4 dierent quantization parameters, QP =f22; 27; 32; 37g, and transform block sizes are restricted to 4 4, 8 8 and 16 16. In both datasets, residual blocks are classied based on the side information provided by the HEVC encoder [95]. Specically, intra predicted blocks are classied based on 35 intra prediction modes oered in HEVC. Similarly, inter predicted blocks are classied into 7 dierent classes using prediction unit (PU) partitions, such that 2 square PU partitions are grouped as one class and other 6 rectangular PU partitions determine other classes. Hence, we have 35+7 = 42 classes in total. For each class and block size, the optimal GBST, GBNT and KLT are designed using the residual blocks in training dataset, while EA-GBTs are constructed based on the detected image edges. The details of transform the construction are discussed in Sections 5.2 and 5.3. CHAPTER 5. GRAPH-BASED TRANSFORMS FOR VIDEO CODING 103 0 1 0 1 (a) Graph weights of a grid graph 0 1 0 1 0 1 0 1 (b) Graph weights of line graphs for rows (top) and columns (bottom) Figure 5.10: Edge weights (left) and vertex weights (right) learned from residual blocks obtained by intra prediction with planar mode. 0 1 0 1 (a) Graph weights of a grid graph 0 1 0 1 0 1 0 1 (b) Graph weights of line graphs for rows (top) and columns (bottom) Figure 5.11: Edge weights (left) and vertex weights (right) learned from residual blocks obtained by intra prediction with DC mode. CHAPTER 5. GRAPH-BASED TRANSFORMS FOR VIDEO CODING 104 0 1 0 1 (a) Graph weights of a grid graph 0 1 0 1 0 1 0 1 (b) Graph weights of line graphs for rows (top) and columns (bottom) Figure 5.12: Edge weights (left) and vertex weights (right) learned from residual blocks obtained by intra prediction with horizontal mode. 0 1 0 1 (a) Graph weights of a grid graph 0 1 0 1 0 1 0 1 (b) Graph weights of line graphs for rows (top) and columns (bottom) Figure 5.13: Edge weights (left) and vertex weights (right) learned from residual blocks obtained by intra prediction with diagonal mode. CHAPTER 5. GRAPH-BASED TRANSFORMS FOR VIDEO CODING 105 0 1 0 1 (a) Graph weights of a grid graph 0 1 0 1 0 1 0 1 (b) Graph weights of line graphs for rows (top) and columns (bottom) Figure 5.14: Edge weights (left) and vertex weights (right) learned from residual blocks obtained by inter prediction with PU mode 2N 2N. 0 1 0 1 (a) Graph weights of a grid graph 0 1 0 1 0 1 0 1 (b) Graph weights of line graphs for rows (top) and columns (bottom) Figure 5.15: Edge weights (left) and vertex weights (right) learned from residual blocks obtained by inter prediction with PU mode N 2N. CHAPTER 5. GRAPH-BASED TRANSFORMS FOR VIDEO CODING 106 To evaluate the performance of transforms, we adopt the mode-dependent transform (MDT) and the rate-distortion optimized transform (RDOT) schemes. The MDT scheme assigns a single transform trained for each mode and each block size. In RDOT scheme, the best transform is selected from a set of transformsT by minimizing the rate-distortion cost J( rd ) = D + rd R [106] where the multiplier rd = 0:85 2 (QP12)=3 [95] depends on QP parameter. In our simulations, dierent transform sets are chosen for each mode (i.e., class) and block size. Specically, the RDOT scheme selects either the DCT or the transform designed for each mode and block size pair, so that the encoder has two transform choices for each block. Note that, this requires the encoder to send one extra bit to identify the RD optimized transform at the decoder side. For EA-GBTs, the necesary graph (i.e., image edge) information is also sent by using the arithmetic edge encoder (AEC) [121]. After the transformation of a block, the resulting transform coecients are uniformly quantized, and then entropy coded using arithmetic coding [124]. The compression performance is measured in terms of Bjontegaard-delta rate (BD-rate) [125]. 5.5.2 Compression Results The following four tables compare the BD-rate performances of dierent transforms: Table 5.3 demonstrates the overall coding gains obtained by using KLTs, GBSTs and GBNTs with MDT and RDOT schemes for intra and inter predicted blocks. According to the results, GBNT outperforms KLT irrespective of the prediction type and coding scheme. This validates our observation that the proposed graph learning method leads to a more robust transform than KLT. For inter prediction, GBST performs slightly better than GBNT, since the inter predicted residuals tend to have a separable structure as shown in Figures 5.14 and 5.15. Moreover, RDOT scheme signicantly outperforms MDT. Table 5.4 compares the RDOT coding performance of KLTs, GBSTs and GBNTs on resid- ual blocks with dierent prediction modes. In RDOT scheme the transform sets areT KLT = fDCT; KLTg,T GBST =fDCT; GBSTg andT GBNT =fDCT; GBNTg, which consist of DCT and a trained transform for each mode and block size. The results show that GBNT consis- tently outperforms KLT for all prediction modes. Similar to Table 5.3, GBST provides slightly better compression compared to KLT and GBST. Also for angular modes (e.g., diagonal mode) in intra predicted coding, GBNT signicantly outperforms GBST as expected. Table 5.5 shows (i) the coding gains obtained by using GL-GBTs (i.e., GBNTs) over KLTs and (ii) the amount of training data used to design transforms in terms of k=n, wherek is the number of data samples and n is the number of vertices (i.e., number of pixels in the model). For all cases, k=n is larger than 1000, so that the training dataset has sucient amount of data as empirically shown in Section 2.5.1 where relative errors (RE) are very close to zero for k=n = 1000. Hence, the performance gain obtained by GL-GBT over KLTs is not due to lack CHAPTER 5. GRAPH-BASED TRANSFORMS FOR VIDEO CODING 107 of data, yet it is mainly because GL-GBTs provide better generalization compared to KLTs. Also, the BD-rates in Table 5.5 demonstrates that the level of coding gains largely depend on the residual signal characteristics rather than k=n. Note that having a small k=n does not indicate whether the BD-rate would be larger or smaller. For example, the gain is larger for Planar mode even though k=n is smaller for DC mode, where the opposite is the case in comparison of diagonal and horizontal modes. Table 5.6 demonstrates the RDOT coding performance of EA-GBTs for dierent modes. As shown in the table, the contribution of EA-GBT within the transform setT GL-GBT+EA-GBT = fDCT; GL-GBT; EA-GBTg is limited to 0:3% for intra predicted coding, while it is approxi- mately 0:8% for inter coding. On the other hand, if the transform set is selected asT EA-GBT = fDCT; EA-GBTg the contribution of EA-GBT provides considerable coding gains, which are approximately 0:5% for intra and 1% for inter predicted coding. Table 5.3: Comparison of KLT, GBST and GBNT with MDT and RDOT schemes in terms of BD-rate (% bitrate reduction) with respect to the DCT. Smaller (negative) BD-rates mean better compression. Transform Intra Prediction Inter Prediction MDT RDOT MDT RDOT KLT 1:81 6:02 0:09 3:28 GBST 1:16 4:61 0:25 3:89 GBNT 2:04 6:70 0:18 3:68 5.6 Conclusion In this work, we discuss the class of transforms, called graph-based transforms (GBTs), with their applications to video compression. In particular, separable and nonseparable GBTs are introduced and two dierent design strategies are proposed. Firstly, the GBT design problem is posed as a graph learning problem, where we estimate graphs from data and the resulting graphs are used to dene GBTs (GL-GBTs). Secondly, we propose edge-adaptive GBTs (EA-GBTs) which can be adapted on a per-block basis using side-information (image edges in a given block). We also give theoretical justications for these two strategies and show that well-known transforms such as DCTs and DSTs are special cases of GBTs, and graphs can be used to design generalized (DCT-like or DST- like) separable transforms. Our experiments demonstrate that GL-GBTs can provide considerable coding gains with respect to standard transform coding schemes using DCT. In comparison with the Karhunen-Loeve transform (KLT), GL-GBTs are more robust and provide better generalization. Although coding gains obtained by including EA-GBTs in addition to GL-GBTs in the RDOT scheme are limited, using EA-GBTs only provides considerable coding gains over DCT. CHAPTER 5. GRAPH-BASED TRANSFORMS FOR VIDEO CODING 108 Table 5.4: Comparison of KLT, GBST and GBNT for coding of dierent prediction modes in terms of BD-rate with respect to the DCT. Smaller (negative) BD-rates mean better compression. Transform Set Intra Prediction Inter Prediction Planar DC Diagonal Horizontal All modes Square Rectangular All modes T KLT 5:79 4:57 7:68 6:14 6:02 3:47 2:93 3:35 T GBST 5:45 4:12 3:32 6:45 4:61 3:95 3:25 3:89 T GBNT 6:27 5:04 8:74 6:53 6:70 3:84 3:18 3:68 Table 5.5: The coding gains achieved by using GL-GBT over KLT in terms of BD-rate and the number of training data samples used per number of vertices (k=n) to design transforms for dierent prediction modes. Negative BD-rates mean that GL-GBT outperforms KLT. Intra Prediction Inter Prediction Planar DC Diagonal Horizontal All modes Square Rectangular All modes BD-rate 0:51 0:49 1:15 0:42 0:72 0:38 0:26 0:34 k=n 9:48 10 4 5:56 10 4 2:87 10 3 2:45 10 4 1:26 10 4 3:54 10 4 5:05 10 3 6:80 10 3 Table 5.6: The contribution of EA-GBTs in terms of BD-rate with respect to DCT. Transform Set Intra Prediction Inter Prediction Planar DC Diagonal Horizontal All modes Square Rectangular All modes T GL-GBT 6:27 5:04 8:74 6:53 6:70 3:95 3:25 3:89 T EA-GBT 0:51 0:47 0:69 0:66 0:54 1:01 0:73 0:93 T GL-GBT+EA-GBT 6:58 5:34 9:07 6:89 7:01 4:80 3:65 4:73 Chapter 6 Conclusions and Future Work In this thesis, we propose novel techniques to build graph-based models from signals/data where the models of interest are dened based on graph Laplacian matrices. Particularly, three categories of graph learning problems are addressed for graph-based modeling. Firstly (in Chapter 2), an optimization framework is proposed to estimate three types of graph Laplacian matrices (i.e., GGLs, DDGLs and CGLs dened in Section 1.1) from data under structural constraints. For each type of graph Laplacian, specialized block-coordinate descent algorithms are developed by incorporating the structural constraints. From the probabilistic perspective, proposed algorithms can be viewed as methods to estimate parameters of attractive GMRFs, whose precision matrices are graph Laplacian matrices (as discussed in Section 2.3). Our comprehensive experimental results demonstrate that our proposed algorithms provide more accurate and computationally ecient graph learning as compared to the state-of-the-art approaches in [30, 31, 32, 33, 34]. The proposed methods can be useful in applications that involve learning a similarity graph to represent anity relations between the entries of a dataset. Depending on the application, learning specic type of graph Laplacian matrices (GGL, DDGL or CGL matrices) can also be useful (as discussed in Section 1.2). The present work primarily focuses on applications of graph-based models used to design transforms for video coding, and other related applications such as spectral clustering [18, 19, 20], graph-based regularization and denoising [21, 79, 9] are considered as part of the future work. In addition, some of the ideas proposed in [50, 51, 52] for ecient and large-scale sparse inverse covariance estimation can be implemented in our algorithms as extensions. Secondly (in Chapter 3), the graph-based modeling is formulated as a graph system identication problem, where the aim is to jointly identify a combinatorial graph Laplacian (CGL) matrix and a graph-based lter (GBF) from a set of observed signals. In this work, we specically consider the identication graph systems with the GBFs in Table 3.1, and propose a novel alternating optimization algorithm, which iteratively solves for a CGL and a GBF. At each iteration, a preltering operation dened by the estimated GBF is applied on the observed signals, and a CGL is estimated from preltered signals by solving Problem 3 in Chapter 2. The 109 CHAPTER 6. CONCLUSIONS AND FUTURE WORK 110 experimental results demonstrate that our proposed algorithm outperforms the existing methods on modeling smooth signals and learning diusion-based models [31, 32, 33, 34] in terms of graph estimation accuracy. The proposed approach can be used to learn diusion kernels from signals/data, which are special cases of graph systems widely used in many applications [21, 17, 79]. Our future work focuses on the extensions of our algorithm for joint identication of graphs and polynomial lters (i.e., estimation of polynomials of graph Laplacians), which can provide more degrees of freedom in designing lters than the GBFs in Table 3.1. Thirdly (in Chapter 4), the multigraph combining problem is studied and a novel algorithm is proposed to optimize a single graph from a dataset consisting of multiple graph Laplacians. Our simulation results show that the proposed algorithm provides better graph-based models than (i) the commonly used averaging method and (ii) the direct Laplacian estimation approach in terms of both coding gain and graph Laplacian quadratic form (signal smoothness) metrics. The proposed combining method can also be used to build an aggregate/ensemble graph-based model from clusters/groups of data samples. For example, multiple graph Laplacians can be estimated from clusters of data samples by using the methods in Chapters 2 and 3 so that each graph is associated with a cluster, and then the proposed multigraph combining algorithm can be employed to learn an aggregate graph-based model. This type of approach can be useful in developing optimized transforms for coding clusters of signals/data. In the case of video coding, multigraph combining can be used to develop graphs from multiple edge-based models (discussed in Section 5.3) in order to adapt block characteristics with dierent edge structures or to combine graph-based models obtained from clusters of samples associated with dierent intra/inter modes (discussed in Section 5.4). In our future work, we will explore applications of multigraph combining to design transforms for video coding. Finally in (Chapter 5), two distinct methods are developed to construct graph-based transforms (GBTs) for video compression. In one method, instances of the GGL estimation problem (i.e., Problem 1 in Chapter 2) with line and grid graph (connectivity) constraints to learn the optimal graph weights from residual block samples, and the resulting line and grid graphs are used to derive separable and nonseparabe GBTs (GL-GBTs), respectively. The other method constructs a graph (i.e., an edge-based residual model discussed in Section 5.3) for each residual block based on detected image edges, so that the corresponding GBT (EA-GBTs) are adapted on a per-block basis. The proposed methods are theoretically and empirically justied. The experimental results demonstrate that proposed GL-GBTs can provide better compression than KLTs, and EA-GBTs can achieve considerable coding gains over DCT. However, the overall contribution of EA-GBT over GL-GBT remains limited (in terms of BD-rates). Our future work includes extending our edge-based model for image edges with smooth transitions (e.g., ramp edges [91]) in order to improve EA-GBT designs. Also, practical implementations of GBTs on a state-of-the-art encoder are considered as part of the future work. Bibliography [1] J. Kleinberg and E. Tardos, Algorithm Design. Boston, MA, USA: Addison-Wesley Longman Publishing Co., Inc., 2005. [2] P. Buhlmann and S. van de Geer, Statistics for High-Dimensional Data: Methods, Theory and Applications. Springer Publishing Co., Inc., 2011. [3] D. Shuman, S. Narang, P. Frossard, A. Ortega, and P. Vandergheynst, \The emerging eld of signal processing on graphs: Extending high-dimensional data analysis to networks and other irregular domains," IEEE Signal Processing Magazine, vol. 30, no. 3, pp. 83{98, 2013. [4] A. K. Jain, Fundamentals of Digital Image Processing. Upper Saddle River, NJ, USA: Prentice-Hall, Inc., 1989. [5] A. M. Tekalp, Digital Video Processing, 2nd ed. Upper Saddle River, NJ, USA: Prentice Hall Press, 2015. [6] C. Zhang and D. Florencio, \Analyzing the optimality of predictive transform coding using graph-based models," IEEE Signal Process. Lett., vol. 20, no. 1, pp. 106{109, 2013. [7] E. Pavez, H. E. Egilmez, Y. Wang, and A. Ortega, \GTT: Graph template transforms with applications to image coding," in Picture Coding Symposium (PCS), 2015, May 2015, pp. 199{203. [8] H. E. Egilmez, Y. H. Chao, A. Ortega, B. Lee, and S. Yea, \GBST: Separable transforms based on line graphs for predictive video coding," in 2016 IEEE International Conference on Image Processing (ICIP), Sept 2016, pp. 2375{2379. [9] P. Milanfar, \A tour of modern image ltering: New insights and methods, both practical and theoretical," IEEE Signal Processing Magazine, vol. 30, no. 1, pp. 106{128, Jan 2013. [10] S. K. Narang and A. Ortega, \Perfect reconstruction two-channel wavelet lter banks for graph structured data," IEEE Transactions on Signal Processing, vol. 60, no. 6, pp. 2786{2799, June 2012. 111 BIBLIOGRAPHY 112 [11] H. E. Egilmez and A. Ortega, \Spectral anomaly detection using graph-based ltering for wireless sensor networks," in 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), May 2014, pp. 1085{1089. [12] A. Anis, A. Gadde, and A. Ortega, \Ecient sampling set selection for bandlimited graph signals using graph spectral proxies," IEEE Transactions on Signal Processing, vol. 64, no. 14, pp. 3775{3789, July 2016. [13] T. Hastie, R. Tibshirani, and J. Friedman, The elements of statistical learning: data mining, inference and prediction, 2nd ed. Springer, 2008. [14] Y. S. Abu-Mostafa, M. Magdon-Ismail, and H.-T. Lin, Learning From Data. AMLBook, 2012. [15] R. I. Kondor and J. D. Laerty, \Diusion kernels on graphs and other discrete input spaces," in Proceedings of the Nineteenth International Conference on Machine Learning, ser. ICML '02. San Francisco, CA, USA: Morgan Kaufmann Publishers Inc., 2002, pp. 315{322. [Online]. Available: http://dl.acm.org/citation.cfm?id=645531.655996 [16] F. R. K. Chung, Spectral Graph Theory. USA: American Mathematical Society, 1997. [17] J. Laerty and G. Lebanon, \Diusion kernels on statistical manifolds," J. Mach. Learn. Res., vol. 6, pp. 129{163, Dec. 2005. [18] A. Y. Ng, M. I. Jordan, and Y. Weiss, \On spectral clustering: Analysis and an algorithm," in Proceedings of the 14th International Conference on Neural Information Processing Systems: Natural and Synthetic, ser. NIPS'01. Cambridge, MA, USA: MIT Press, 2001, pp. 849{856. [19] M. Soltanolkotabi, E. Elhamifar, and E. J. Cand es, \Robust subspace clustering," Ann. Statist., vol. 42, no. 2, pp. 669{699, 04 2014. [20] U. Luxburg, \A tutorial on spectral clustering," Statistics and Computing, vol. 17, no. 4, pp. 395{416, Dec. 2007. [21] A. J. Smola and I. R. Kondor, \Kernels and regularization on graphs." in Proceedings of the Annual Conference on Computational Learning Theory, 2003. [22] D. A. Spielman, \Algorithms, graph theory, and the solution of laplacian linear equations," in Automata, Languages, and Programming - 39th International Colloquium, ICALP 2012, 2012, pp. 24{26. [23] T. Bykoglu, J. Leydold, and P. F. Stadler, \Laplacian eigenvectors of graphs," Lecture notes in mathematics, vol. 1915, 2007. BIBLIOGRAPHY 113 [24] S. Kurras, U. Von Luxburg, and G. Blanchard, \The f-adjusted graph Laplacian: A diag- onal modication with a geometric interpretation," in Proceedings of the 31st International Conference on International Conference on Machine Learning, 2014, pp. 1530{1538. [25] F. Dor er and F. Bullo, \Kron reduction of graphs with applications to electrical networks," IEEE Transactions on Circuits and Systems I: Regular Papers, vol. 60, no. 1, pp. 150{163, Jan 2013. [26] D. A. Spielman and S.-H. Teng, \A local clustering algorithm for massive graphs and its application to nearly linear time graph partitioning," SIAM Journal on Computing, vol. 42, no. 1, pp. 1{26, 2013. [27] J. Batson, D. A. Spielman, N. Srivastava, and S.-H. Teng, \Spectral sparsication of graphs: Theory and algorithms," Commun. ACM, vol. 56, no. 8, pp. 87{94, Aug. 2013. [28] D. A. Spielman and S. Teng, \Nearly linear time algorithms for preconditioning and solv- ing symmetric, diagonally dominant linear systems," SIAM J. Matrix Analysis Applications, vol. 35, no. 3, pp. 835{885, 2014. [29] A. Sandryhaila and J. M. F. Moura, \Discrete signal processing on graphs: Frequency analy- sis," IEEE Transactions on Signal Processing, vol. 62, no. 12, pp. 3042{3054, June 2014. [30] J. Friedman, T. Hastie, and R. Tibshirani, \Sparse inverse covariance estimation with the graphical lasso," Biostatistics, vol. 9, no. 3, pp. 432{441, Jul. 2008. [31] X. Dong, D. Thanou, P. Frossard, and P. Vandergheynst, \Learning Laplacian matrix in smooth graph signal representations," IEEE Transactions on Signal Processing, vol. 64, no. 23, pp. 6160{6173, Dec 2016. [32] V. Kalofolias, \How to learn a graph from smooth signals," in Proceedings of the 19th In- ternational Conference on Articial Intelligence and Statistics (AISTATS), May 2016, pp. 920{929. [33] B. M. Lake and J. B. Tenenbaum, \Discovering structure by learning sparse graph," in Pro- ceedings of the 33rd Annual Cognitive Science Conference, 2010, pp. 778{783. [34] S. Segarra, A. G. Marques, G. Mateos, and A. Ribeiro, \Network topology inference from spectral templates," CoRR, vol. abs/1608.03008v1, 2016. [Online]. Available: https://arxiv.org/abs/1608.03008v1 [35] H. E. Egilmez, E. Pavez, and A. Ortega, \Graph learning with Laplacian constraints: Modeling attractive Gaussian Markov random elds," in 2016 50th Asilomar Conference on Signals, Systems and Computers, Nov 2016, pp. 1470{1474. BIBLIOGRAPHY 114 [36] ||, \Graph learning from data under structural and Laplacian constraints," CoRR, vol. abs/1611.05181v1, 2016. [Online]. Available: https://arxiv.org/abs/1611.05181v1 [37] ||, \Graph learning from data under laplacian and structural constraints," IEEE Journal of Selected Topics in Signal Processing, vol. PP, no. 99, pp. 1{1, 2017. [38] MATLAB, version 8.1.0 (R2013a). Natick, Massachusetts: The MathWorks Inc., 2013. [39] H. E. Egilmez, E. Pavez, and A. Ortega, \GLL: Graph Laplacian learning package, version 1.0," https://github.com/STAC-USC/Graph Learning, 2017. [40] E. J. Cand es, M. B. Wakin, and S. P. Boyd, \Enhancing sparsity by reweighted `-1 minimiza- tion," Journal of Fourier Analysis and Applications, vol. 14, no. 5, pp. 877{905, 2008. [41] I. S. Dhillon and J. A. Tropp, \Matrix nearness problems with Bregman divergences," SIAM J. Matrix Anal. Appl., vol. 29, no. 4, pp. 1120{1146, Nov. 2007. [42] O. Banerjee, L. E. Ghaoui, and A. D'aspremont, \Model selection through sparse maximum likelihood estimation for multivariate gaussian or binary data," Journal of Machine Learning Research, vol. 9, pp. 485{516, 2008. [43] P.-L. Loh and M. J. Wainwright, \Structure estimation for discrete graphical models: Gen- eralized covariance matrices and their inverses," The Annals of Statistics, vol. 41, no. 6, pp. 3022{3049, 2013. [44] H. Rue and L. Held, Gaussian Markov Random Fields: Theory and Applications, ser. Mono- graphs on Statistics and Applied Probability. London: Chapman & Hall, 2005, vol. 104. [45] D. Koller and N. Friedman, Probabilistic Graphical Models: Principles and Techniques - Adap- tive Computation and Machine Learning. Cambridge, MA, USA: The MIT Press, 2009. [46] A. P. Dempster, \Covariance selection," Biometrics, vol. 28, no. 1, pp. 157{175, 1972. [47] N. Meinshausen and P. Buhlmann, \High-dimensional graphs and variable selection with the lasso," The Annals of Statistics, vol. 34, no. 3, pp. 1436{1462, 2006. [48] R. Tibshirani, \Regression shrinkage and selection via the lasso," Journal of the Royal Statis- tical Society. Series B (Methodological), vol. 58, no. 1, pp. 267{288, 1996. [49] R. Mazumder and T. Hastie, \The graphical lasso: New insights and alternatives," Electronic Journal of Statistics, vol. 6, pp. 2125{2149, 2012. [50] ||, \Exact covariance thresholding into connected components for large-scale graphical lasso," The Journal of Machine Learning Research, vol. 13, no. 1, pp. 781{794, 2012. BIBLIOGRAPHY 115 [51] C.-J. Hsieh, M. A. Sustik, I. S. Dhillon, P. K. Ravikumar, and R. Poldrack, \Big & quic: Sparse inverse covariance estimation for a million variables," in Advances in Neural Information Processing Systems 26, 2013, pp. 3165{3173. [52] M. Grechkin, M. Fazel, D. Witten, and S.-I. Lee, \Pathway graphical lasso," in AAAI Con- ference on Articial Intelligence, 2015, pp. 2617{2623. [53] G. Poole and T. Boullion, \A survey on M-matrices," SIAM review, vol. 16, no. 4, pp. 419{427, 1974. [54] M. Slawski and M. Hein, \Estimation of positive denite M-matrices and structure learning for attractive Gaussian Markov random elds," Linear Algebra and its Applications, vol. 473, pp. 145 { 179, 2015. [55] E. Pavez and A. Ortega, \Generalized Laplacian precision matrix estimation for graph signal processing," in 2016 IEEE International Conference on Acoustics, Speech and Signal Process- ing (ICASSP), March 2016, pp. 6350{6354. [56] B. Pasdeloup, M. Rabbat, V. Gripon, D. Pastor, and G. Mercier, \Characterization and inference of graph diusion processes from observations of stationary signals," CoRR, vol. abs/arXiv:1605.02569v3, 2017. [Online]. Available: https://arxiv.org/abs/arXiv:1605.02569v3 [57] S. Sardellitti, S. Barbarossa, and P. D. Lorenzo, \Graph topology inference based on transform learning," in 2016 IEEE Global Conference on Signal and Information Processing (GlobalSIP), Dec 2016, pp. 356{360. [58] S. J. Wright, \Coordinate descent algorithms," Math. Program., vol. 151, no. 1, pp. 3{34, Jun. 2015. [59] C. Zhang, D. Flor^ encio, and P. A. Chou, \Graph signal processing{a probabilistic framework," Microsoft Research Technical Report, 2015. [60] S. Hassan-Moghaddam, N. K. Dhingra, and M. R. Jovanovic, \Topology identication of undirected consensus networks via sparse inverse covariance estimation," in 2016 IEEE 55th Conference on Decision and Control (CDC), Dec 2016, pp. 4624{4629. [61] S. Boyd and L. Vandenberghe, Convex Optimization. New York, NY, USA: Cambridge University Press, 2004. [62] T. M. Cover and J. A. Thomas, \Determinant inequalities via information theory," SIAM J. Matrix Anal. Appl., vol. 9, no. 3, pp. 384{392, Nov. 1988. [63] H. Ishwaran and J. S. Rao, \Spike and slab variable selection: Frequentist and bayesian strategies," The Annals of Statistics, vol. 33, no. 2, pp. 730{773, 2005. BIBLIOGRAPHY 116 [64] M. Grant and S. Boyd, \CVX: Matlab software for disciplined convex programming, version 2.1," http://cvxr.com/cvx, Mar. 2014. [65] M. A. Woodbury, Inverting Modied Matrices, ser. Statistical Research Group Memorandum Reports. Princeton, NJ: Princeton University, 1950, no. 42. [66] J. Sherman and W. J. Morrison, \Adjustment of an inverse matrix corresponding to a change in one element of a given matrix," The Annals of Mathematical Statistics, vol. 21, no. 1, pp. 124{127, 03 1950. [67] D. P. Bertsekas, Nonlinear Programming. Belmont, MA: Athena Scientic, 1999. [68] D. Chen and R. J. Plemmons, \Nonnegativity constraints in numerical analysis," in The Birth of Numerical Analysis. Singapore: World Scientic Publishing, 2010, pp. 109{140. [69] A. Beck and L. Tetruashvili, \On the convergence of block coordinate descent type methods," SIAM Journal on Optimization, vol. 23, no. 4, pp. 2037{2060, 2013. [70] C. L. Lawson and R. J. Hanson, Solving Least Squares Problems, ser. Series in Automatic Computation. Englewood Clis, NJ, USA: Prentice-Hall, 1974. [71] L. F. Portugal, J. J. J udice, and L. N. Vicente, \A comparison of block pivoting and interior- point algorithms for linear least squares problems with nonnegative variables," Math. Comput., vol. 63, no. 208, pp. 625{643, Oct. 1994. [72] S. Zhou, P. Rutimann, M. Xu, and P. Buhlmann, \High-dimensional covariance estimation based on Gaussian graphical models," J. Mach. Learn. Res., vol. 12, pp. 2975{3026, Nov. 2011. [73] P. Ravikumar, M. Wainwright, B. Yu, and G. Raskutti, \High dimensional covariance estima- tion by minimizing l1-penalized log-determinant divergence," Electronic Journal of Statistics (EJS), vol. 5, pp. 935{980, 2011. [74] C. Kemp and J. B. Tenenbaum, \The discovery of structural form," Proceedings of the National Academy of Sciences, vol. 105, no. 31, pp. 10 687{10 692, 2008. [75] H. E. Egilmez, E. Pavez, and A. Ortega, \Graph learning from ltered signals: Graph system and diusion kernel identication," 2017, in preparation. [76] S. Segarra, G. Mateos, A. G. Marques, and A. Ribeiro, \Blind identication of graph lters," IEEE Transactions on Signal Processing, vol. 65, no. 5, pp. 1146{1159, March 2017. [77] D. Thanou, X. Dong, D. Kressner, and P. Frossard, \Learning heat diusion graphs," CoRR, vol. abs/1611.01456v1, 2016. [Online]. Available: https://arxiv.org/abs/1611.01456v1 BIBLIOGRAPHY 117 [78] J. Mei and J. M. F. Moura, \Signal processing on graphs: Causal modeling of unstructured data," IEEE Transactions on Signal Processing, vol. PP, no. 99, pp. 1{1, 2016. [79] A. D. Szlam, M. Maggioni, and R. R. Coifman, \Regularization on graphs with function- adapted diusion processes," J. Mach. Learn. Res., vol. 9, pp. 1711{1739, Jun. 2008. [80] E. Kalnay, M. Kanamitsu, R. Kistler, W. Collins, D. Deaven, L. Gandin, M. Iredell, S. Saha, G. White, J. Woollen, Y. Zhu, A. Leetmaa, R. Reynolds, M. Chelliah, W. Ebisuzaki, W. Hig- gins, J. Janowiak, K. C. Mo, C. Ropelewski, J. Wang, R. Jenne, and D. Joseph, \The ncep/ncar 40-year reanalysis project," Bulletin of the American Meteorological Society, vol. 77, no. 3, pp. 437{471, 1996. [81] L. C. Evans, Partial dierential equations. Providence, R.I.: American Mathematical Society, 2010. [82] H. E. Egilmez, A. Ortega, O. G. Guleryuz, J. Ehmann, and S. Yea, \An optimization frame- work for combining multiple graphs," in 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), March 2016, pp. 4114{4118. [83] W. Tang, Z. Lu, and I. S. Dhillon, \Clustering with multiple graphs," in Proc. IEEE Interna- tional Conference on Data Mining (ICDM), 2009, pp. 1016{1021. [84] A. Argyriou, M. Herbster, and M. Pontil, \Combining graph Laplacians for semi-supervised learning," in Neural Information Processing Systems, (NIPS), 2005, pp. 67{74. [85] B. N. Flury, \Common principal components in k groups," Journal of the American Statistical Association, vol. 79, no. 388, pp. 892{898, 1984. [86] B. N. Flury and W. Gautschi, \An algorithm for simultaneous orthogonal transformation of several positive denite symmetric matrices to nearly diagonal form," SIAM J. Sci. Stat. Comput., vol. 7, no. 1, pp. 169{184, Jan. 1986. [87] A. Belouchrani, K. Abed-Meraim, J.-F. Cardoso, and E. Moulines, \A blind source separation technique using second-order statistics," IEEE Transactions on Signal Processing, vol. 45, no. 2, pp. 434{444, Feb 1997. [88] J.-F. Cardoso and A. Souloumiac, \Jacobi angles for simultaneous diagonalization," SIAM J. Mat. Anal. Appl., vol. 17, no. 1, pp. 161{164, Jan. 1996. [89] G. H. Golub and H. A. van der Vorst, \Eigenvalue computation in the 20th century," Journal of Computational and Applied Mathematics, vol. 123, no. 12, pp. 35 { 65, 2000. [90] K. R. Rao and P. Yip, Discrete Cosine Transform Algorithms, Advantages, Applications. San Diego, CA, USA: Academic Press Professional, Inc., 1990. BIBLIOGRAPHY 118 [91] Y. H. Chao, H. E. Egilmez, A. Ortega, S. Yea, and B. Lee, \Edge adaptive graph-based transforms: Comparison of step/ramp edge models for video compression," in 2016 IEEE International Conference on Image Processing (ICIP), Sept 2016, pp. 1539{1543. [92] H. E. Egilmez, A. Said, Y.-H. Chao, and A. Ortega, \Graph-based transforms for inter pre- dicted video coding," in IEEE International Conference on Image Processing (ICIP), Sept 2015, pp. 3992{3996. [93] H. E. Egilmez, Y. H. Chao, and A. Ortega, \Graph-based transforms for video coding," 2017, in preparation. [94] W. B. Pennebaker and J. L. Mitchell, JPEG Still Image Data Compression Standard, 1st ed. Norwell, MA, USA: Kluwer Academic Publishers, 1992. [95] G. J. Sullivan, J.-R. Ohm, W.-J. Han, and T. Wiegand, \Overview of the High Eciency Video Coding (HEVC) Standard," IEEE Trans. Circuits Syst. Video Technol., vol. 22, no. 12, pp. 1649{1668, Dec. 2012. [96] D. Mukherjee, J. Bankoski, R. S. Bultje, A. Grange, J. Han, J. Koleszar, P. Wilkins, and Y. Xu, \The latest open-source video codec VP9 { an overview and preliminary results," in Proc. 30th Picture Coding Symp., San Jose, CA, Dec. 2013. [97] J. Han, A. Saxena, V. Melkote, and K. Rose, \Jointly optimized spatial prediction and block transform for video and image coding," IEEE Trans. Image Process., vol. 21, no. 4, pp. 1874{ 1884, Apr. 2012. [98] Y. Ye and M. Karczewicz, \Improved H.264 intra coding based on bi-directional intra predic- tion, directional transform, and adaptive coecient scanning," in IEEE International Confer- ence on Image Processing (ICIP), Oct 2008, pp. 2116{2119. [99] S. Takamura and A. Shimizu, \On intra coding using mode dependent 2D-KLT," in Proc. 30th Picture Coding Symp., San Jose, CA, Dec. 2013, pp. 137{140. [100] A. Arrufat, P. Philippe, and O. Deforges, \Non-separable mode dependent transforms for intra coding in hevc," in 2014 IEEE Visual Communications and Image Processing Conference, Dec 2014, pp. 61{64. [101] F. Zou, O. Au, C. Pang, J. Dai, X. Zhang, and L. Fang, \Rate-distortion optimized transforms based on the lloyd-type algorithm for intra block coding," IEEE Journal of Selected Topics in Signal Processing, vol. 7, no. 6, pp. 1072{1083, Dec 2013. [102] X. Zhao, J. Chen, M. Karczewicz, L. Zhang, X. Li, and W. J. Chien, \Enhanced multiple transform for video coding," in 2016 Data Compression Conference (DCC), March 2016, pp. 73{82. BIBLIOGRAPHY 119 [103] X. Zhao, J. Chen, A. Said, V. Seregin, H. E. Egilmez, and M. Karczewicz, \NSST: Non- separable secondary transforms for next generation video coding," in 2016 Picture Coding Symposium (PCS 2016), December 2016, pp. 1{5. [104] A. Said, X. Zhao, M. Karczewicz, H. E. Egilmez, V. Seregin, and J. Chen, \Highly ecient non- separable transforms for next generation video coding," in 2016 Picture Coding Symposium (PCS 2016), December 2016, pp. 1{5. [105] Y. H. Chao, A. Ortega, and S. Yea, \Graph-based lifting transform for intra-predicted video coding," in 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), March 2016, pp. 1140{1144. [106] A. Ortega and K. Ramchandran, \Rate-distortion methods for image and video compression," IEEE Signal Process. Mag., vol. 15, no. 6, pp. 23{50, Nov. 1998. [107] W. Hu, G. Cheung, and A. Ortega, \Intra-prediction and generalized graph fourier transform for image coding," IEEE Signal Processing Letters, vol. 22, no. 11, pp. 1913{1917, Nov 2015. [108] C. Zhang, D. Florencio, and P. Chou, \Graph signal processing - a probabilistic framework," Microsoft Research, Tech. Rep. MSR-TR-2015-31, April 2015. [Online]. Available: http://research.microsoft.com/apps/pubs/default.aspx?id=243326 [109] G. Shen, W.-S. Kim, S. Narang, A. Ortega, J. Lee, and H. Wey, \Edge-adaptive transforms for ecient depth map coding," in Picture Coding Symposium (PCS), 2010, Dec 2010, pp. 566{569. [110] W. Hu, G. Cheung, A. Ortega, and O. C. Au, \Multiresolution graph fourier transform for compression of piecewise smooth images," IEEE Transactions on Image Processing, vol. 24, no. 1, pp. 419{433, Jan 2015. [111] G. Strang, \The discrete cosine transform," SIAM Rev., vol. 41, no. 1, pp. 135{147, Mar. 1999. [112] M. P uschel and J. M. F. Moura, \Algebraic signal processing theory: 1-D space," IEEE Transactions on Signal Processing, vol. 56, no. 8, pp. 3586{3599, 2008. [113] H. E. Egilmez, O. G. Guleryuz, J. Ehmann, and S. Yea, \Row-column transforms: Low- complexity approximation of optimal non-separable transforms," in 2016 IEEE International Conference on Image Processing (ICIP), Sept 2016, pp. 2385{2389. [114] E. Pavez, A. Ortega, and D. Mukherjee, \Learning separable transforms by inverse covariance estimation," in 2017 IEEE International Conference on Image Processing (ICIP), Sept 2017, pp. 1{1. BIBLIOGRAPHY 120 [115] A. Gersho and R. M. Gray, Vector Quantization and Signal Compression. Norwell, MA, USA: Kluwer Academic Publishers, 1991. [116] V. K. Goyal, \Theoretical foundations of transform coding," IEEE Signal Processing Magazine, vol. 18, no. 5, pp. 9{21, Sep 2001. [117] S. Mallat, A Wavelet Tour of Signal Processing, Third Edition: The Sparse Way, 3rd ed. Academic Press, 2008. [118] V. N. Vapnik, \An overview of statistical learning theory," IEEE Transactions on Neural Networks, vol. 10, no. 5, pp. 988{999, Sep 1999. [119] U. von Luxburg and B. Sch olkopf, Statistical Learning Theory: Models, Concepts, and Results. Amsterdam, Netherlands: Elsevier North Holland, May 2011, vol. 10, pp. 651{706. [120] R. Vershynin, \How close is the sample covariance matrix to the actual covariance matrix?" Journal of Theoretical Probability, vol. 25, no. 3, pp. 655{686, 2012. [121] I. Daribo, D. Florencio, and G. Cheung, \Arbitrarily shaped motion prediction for depth video compression using arithmetic edge coding," IEEE Transactions on Image Processing, vol. 23, no. 11, pp. 4696{4708, Nov 2014. [122] M. Eros, H. Feng, and K. Zeger, \Suboptimality of the karhunen-loeve transform for trans- form coding," IEEE Transactions on Information Theory, vol. 50, no. 8, pp. 1605{1619, Aug 2004. [123] T. M. Cover and J. A. Thomas, Elements of Information Theory. New York, NY, USA: Wiley-Interscience, 1991. [124] A. Said, \Introduction to Arithmetic Coding: theory and practice," HP Labs, Tech. Rep. HPL-2004-76, 2004. [125] G. Bjontegaard, \Calculation of average PSNR dierences between RD-curves," ITU-T Q.6/SG16 VCEG-M33 Contribution, vol. 48, Apr 2001.
Abstract (if available)
Abstract
Graphs are fundamental mathematical structures used in various fields to represent data, signals and processes. Particularly in signal processing, machine learning and statistics, structured modeling of signals and data by means of graphs is essential for a broad number of applications. In this thesis, we develop novel techniques to build graph-based models and transforms for signal/data processing, where the models and transforms of interest are defined based on graph Laplacian matrices. For graph-based modeling, various graph Laplacian estimation problems are studied. Firstly, we consider estimation of three types of graph Laplacian matrices from data and develop efficient algorithms by incorporating associated Laplacian and structural constraints. Then, we propose a graph signal processing framework to learn graph-based models from classes of filtered signals, defined based on functions of graph Laplacians. The proposed approach can also be applied to learn diffusion (heat) kernels, which are popular in various fields for modeling diffusion processes. Additionally, we study the problem of multigraph combining, which is estimating a single optimized graph from multiple graphs, and develop an algorithm. Finally, we propose graph-based transforms for video coding and develop two different techniques, based on graph learning and image edge adaptation, to design orthogonal transforms capturing the statistical characteristics of video signals. Theoretical justifications and comprehensive experimental results for the proposed methods are presented.
Linked assets
University of Southern California Dissertations and Theses
Conceptually similar
PDF
Efficient transforms for graph signals with applications to video coding
PDF
Compression of signal on graphs with the application to image and video coding
PDF
Sampling theory for graph signals with applications to semi-supervised learning
PDF
Estimation of graph Laplacian and covariance matrices
PDF
Lifting transforms on graphs: theory and applications
PDF
Efficient coding techniques for high definition video
PDF
Human activity analysis with graph signal processing techniques
PDF
Scalable sampling and reconstruction for graph signals
PDF
Human motion data analysis and compression using graph based techniques
PDF
Critically sampled wavelet filterbanks on graphs
PDF
Syntax-aware natural language processing techniques and their applications
PDF
Advanced machine learning techniques for video, social and biomedical data analytics
PDF
Novel optimization tools for structured signals recovery: channels estimation and compressible signal recovery
PDF
Efficient graph learning: theory and performance evaluation
PDF
Object classification based on neural-network-inspired image transforms
PDF
Dynamic graph analytics for cyber systems security applications
PDF
Word, sentence and knowledge graph embedding techniques: theory and performance evaluation
PDF
Distributed source coding for image and video applications
PDF
Novel algorithms for large scale supervised and one class learning
PDF
Neighborhood and graph constructions using non-negative kernel regression (NNK)
Asset Metadata
Creator
Egilmez, Hilmi Enes
(author)
Core Title
Graph-based models and transforms for signal/data processing with applications to video coding
School
Viterbi School of Engineering
Degree
Doctor of Philosophy
Degree Program
Electrical Engineering
Publication Date
09/12/2017
Defense Date
06/07/2017
Publisher
University of Southern California
(original),
University of Southern California. Libraries
(digital)
Tag
graph Laplacian,graph signal processing,graph-based models,graphs,OAI-PMH Harvest,transform coding,video coding,video compression
Language
English
Contributor
Electronically uploaded by the author
(provenance)
Advisor
Ortega, Antonio (
committee chair
), Kuo, C.-C. Jay (
committee member
), Lv, Jinchi (
committee member
)
Creator Email
hegilmez@usc.edu,hilmi.e.egilmez@gmail.com
Permanent Link (DOI)
https://doi.org/10.25549/usctheses-c40-427945
Unique identifier
UC11264234
Identifier
etd-EgilmezHil-5717.pdf (filename),usctheses-c40-427945 (legacy record id)
Legacy Identifier
etd-EgilmezHil-5717.pdf
Dmrecord
427945
Document Type
Dissertation
Rights
Egilmez, Hilmi Enes
Type
texts
Source
University of Southern California
(contributing entity),
University of Southern California Dissertations and Theses
(collection)
Access Conditions
The author retains rights to his/her dissertation, thesis or other graduate work according to U.S. copyright law. Electronic access is being provided by the USC Libraries in agreement with the a...
Repository Name
University of Southern California Digital Library
Repository Location
USC Digital Library, University of Southern California, University Park Campus MC 2810, 3434 South Grand Avenue, 2nd Floor, Los Angeles, California 90089-2810, USA
Tags
graph Laplacian
graph signal processing
graph-based models
transform coding
video coding
video compression