Close
About
FAQ
Home
Collections
Login
USC Login
Register
0
Selected
Invert selection
Deselect all
Deselect all
Click here to refresh results
Click here to refresh results
USC
/
Digital Library
/
University of Southern California Dissertations and Theses
/
Sampling theory for graph signals with applications to semi-supervised learning
(USC Thesis Other)
Sampling theory for graph signals with applications to semi-supervised learning
PDF
Download
Share
Open document
Flip pages
Contact Us
Contact Us
Copy asset link
Request this asset
Transcript (if available)
Content
SAMPLING THEORY FOR GRAPH SIGNALS WITH APPLICATIONS TO SEMI-SUPERVISED LEARNING by Aamir Anis DISSERTATION presented to the FACULTY OF THE USC GRADUATE SCHOOL UNIVERSITY OF SOUTHERN CALIFORNIA in partial fulfillment of the requirements for the degree of DOCTOR OF PHILOSOPHY in ELECTRICAL ENGINEERING December 2017 To my family. ii Abstract The representation, processing and analysis of large-scale data as signals defined over graphs has drawn much interest recently. Graphs allow us to embed natural inter- connectivities between data points and exploit them during processing. As a result, graph signal processing has laid a strong foothold in various modern application domains such as machine learning, analysis of social, transportation, web and sensor networks, and even traditional areas such as image processing and video compression. Although powerful, this research area is still in its infancy. Recent efforts have therefore focused on translating well-developed tools of traditional signal processing for handling graph signals. An important aspect of graph signal processing is defining a notion of frequency for graph signals. A frequency domain representation for graph signals can be defined using the eigenvectors and eigenvalues of variation operators (e.g., graph Laplacian) that take into account the underlying graph connectivity. These operators can also be used to design graph spectral filters. The primary focus of our work is to develop a theory of samplingforgraphsignalsthatanswersthefollowingquestions: 1. Whencanonerecover a graph signal from its samples on a given subset of nodes of the graph? 2. What is the best choice of nodes to sample a given graph signal? Our formulation primarily works under the assumption of bandlimitedness in the graph Fourier domain, which amounts to smoothness of the signal over the graph. The techniques we employ to answer these questionsarebasedontheintroductionofspecialquantitiescalledgraphspectralproxies that allow our algorithms to operate in the vertex domain, thereby admitting efficient, localized implementations. We also explore the sampling problem in the context of designing wavelet filterbanks on graphs. This problem is fundamentally different since one needs to choose a sam- pling scheme jointly over multiple channels of the filterbank. We explore constraints iii for designing perfect reconstruction two-channel critically-sampled filterbanks with low- degree polynomial filters, and conclude that such a design is in general not possible for arbitrary graphs. This leads us to propose an efficient technique for designing a critical sampling scheme that, given predesigned filters, aims to minimize the overall reconstruc- tion error of the filterbank. We also explore M-channel filterbanks over M-block cyclic graphs (that are natural extensions of bipartite graphs), and propose a tree-structured design in a simpler setting when M is a power of 2. As an application, we study the graph-based semi-supervised learning problem from a sampling theory point of view. A crucial assumption here is that class labels form a smooth graph signal over a similarity graph constructed from the feature vectors. Our analysis justifies this premise by showing that in the asymptotic limit, the bandwidth (a measure of smoothness) of any class indicator signal is closely related to the geometry of the dataset. Using the sampling theory perspective, we also quantitatively show that the label complexity (i.e., the amount of labeling required for perfect prediction of unknown labels) matches its theoretical value, thereby adding to the appeal of graph- based techniques for semi-supervised learning. iv Acknowledgments The journey of my doctoral studies would not have been a success without the help and support of a number of people around me. First and foremost, I would like to express my sincerest gratitude towards my advisor Prof. Antonio Ortega, for giving me the opportunity of undertaking this endeavor. His guidance and encouragement have been pivotal for my progress, especially when I have fumbled along the way. I am deeply indebted to him for inculcating in me the discipline necessary for pursuing research, through his wisdom and even through his everlasting red pen. I am very grateful to Prof. Salman Avestimehr for being a part of my committee and for stimulating my interest in theoretical problems through our collaborations. I would like to thank Prof. Shri Narayanan, Prof. Mahdi Soltanolkotabi, Prof. Yan Liu for serving on my qualifying exam committee and providing me with valuable feedback necessary for improving my thesis. I would also like to thank Prof. Jinchi Lv for being a part of my defense committee. I am indebted to all my teachers at USC for imparting the finest quality of teaching through a wide variety of courses. I am thankful to USC and its staff for providing all the facilities conducive to a good research atmosphere. I must also thank my research collaborators Akshay Gadde, Aly El Gamal, Eduardo Pavez, Nicolo Michelusi, and Prof. Urbashi Mitra. My discussions with them have been immensely enlightening. It would be remiss of me not to mention the crucial role of my friends Praveen, Akshay, Ruchir, Arvind and Naveen. Their company and moral support has alleviated much of the psychological stress associated with a life in research, for which I am thor- oughly grateful. I would also like to thank my labmates Hilmi, Eduardo, Jessie, Joanne and Eric for fulfilling a similar role in the office. Most importantly, to my Mom, Dad and Little Brother Fahad, for being a never- ending source of love, inspiration, motivation, strength, and encouragement, and for their blind faith in me no matter what – I cannot thank them enough! v Contents Abstract iii Acknowledgments v List of Tables ix List of Figures x 1 Introduction 1 1.1 Sampling theory for graph signals . . . . . . . . . . . . . . . . . . . . . . 3 1.1.1 Existing work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 1.1.2 Our contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 1.2 Wavelet filterbanks on graphs . . . . . . . . . . . . . . . . . . . . . . . . 6 1.2.1 Existing work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7 1.2.2 Our contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . 7 1.3 Sampling theory perspective of graph-based semi-supervised learning . . 9 1.3.1 Existing work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10 1.3.2 Our contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . 11 1.4 Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 2 Review of Graph Signal Processing 15 2.1 Basic concepts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15 2.2 Notion of frequency for graph signals . . . . . . . . . . . . . . . . . . . . 16 2.3 Examples of variation operators . . . . . . . . . . . . . . . . . . . . . . . 18 2.3.1 Variation on undirected graphs . . . . . . . . . . . . . . . . . . . 18 2.3.2 Variation on directed graphs . . . . . . . . . . . . . . . . . . . . . 18 2.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22 3 Sampling Theory for Graph Signals 23 3.1 Related work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24 3.2 Necessary and sufficient conditions . . . . . . . . . . . . . . . . . . . . . 26 3.2.1 Issue of stability and choice of sampling set . . . . . . . . . . . . 27 3.3 Graph Spectral Proxies . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30 vi 3.4 Cutoff frequency . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32 3.5 Sampling set selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33 3.5.1 Best sampling set of given size . . . . . . . . . . . . . . . . . . . . 33 3.5.2 Obtaining the best sampling set . . . . . . . . . . . . . . . . . . . 35 3.6 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40 3.6.1 Examples with artificial data . . . . . . . . . . . . . . . . . . . . 40 3.6.2 A real data example . . . . . . . . . . . . . . . . . . . . . . . . . 43 3.7 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45 3.7.1 Future work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47 4 Wavelet Filterbanks on Graphs 48 4.1 Related work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49 4.2 Background and notation . . . . . . . . . . . . . . . . . . . . . . . . . . . 50 4.3 Two-channel filterbanks . . . . . . . . . . . . . . . . . . . . . . . . . . . 51 4.3.1 Special case: bipartite graphs . . . . . . . . . . . . . . . . . . . . 52 4.3.2 Characterizing graphs that admit perfect reconstruction filterbanks 54 4.4 Critical sampling for filterbanks on arbitrary graphs . . . . . . . . . . . . 58 4.4.1 Approximately optimal sampling scheme . . . . . . . . . . . . . . 58 4.4.2 Theoretical guarantees . . . . . . . . . . . . . . . . . . . . . . . . 61 4.4.3 Multi-channel extension . . . . . . . . . . . . . . . . . . . . . . . 61 4.4.4 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63 4.5 Filterbanks on block-cyclic graphs . . . . . . . . . . . . . . . . . . . . . . 67 4.5.1 Perfect reconstruction conditions forM-channel filterbanks onM- block cyclic graphs . . . . . . . . . . . . . . . . . . . . . . . . . . 69 4.5.2 Tree-structured filterbank design . . . . . . . . . . . . . . . . . . 70 4.5.3 Preliminary experiment . . . . . . . . . . . . . . . . . . . . . . . 80 4.6 Summary and future work . . . . . . . . . . . . . . . . . . . . . . . . . . 81 5 Sampling Theory Perspective of Semi-supervised Learning 84 5.1 Preliminaries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86 5.1.1 Data models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86 5.1.2 Graph model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88 5.1.3 Estimating bandwidth . . . . . . . . . . . . . . . . . . . . . . . . 88 5.1.4 Bandlimited interpolation for classification . . . . . . . . . . . . . 89 5.2 Related work and connections . . . . . . . . . . . . . . . . . . . . . . . . 90 5.2.1 Classification setting . . . . . . . . . . . . . . . . . . . . . . . . . 92 5.2.2 Regression setting . . . . . . . . . . . . . . . . . . . . . . . . . . . 94 5.3 Main results and discussion . . . . . . . . . . . . . . . . . . . . . . . . . 95 5.3.1 Interpretation of bandwidth and bandlimited reconstruction . . . 95 5.3.2 Label complexity of SSL . . . . . . . . . . . . . . . . . . . . . . . 98 5.4 Numerical validation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101 5.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105 vii 5.5.1 Future work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106 5.6 Proofs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107 5.6.1 Convergence of variance terms . . . . . . . . . . . . . . . . . . . . 107 5.6.2 Convergence of the bias term for the separable model . . . . . . . 109 5.6.3 Convergence of bias term for the nonseparable model . . . . . . . 112 5.6.4 Proof of Theorem 5.4 . . . . . . . . . . . . . . . . . . . . . . . . . 113 5.6.5 Expansions of 1 T S L m 1 S andE n 1 n 1 T S L m 1 S o . . . . . . . . . . . . . 115 5.6.6 Proof of Lemma 4 . . . . . . . . . . . . . . . . . . . . . . . . . . . 116 Reference List 126 viii List of Tables 2.1 Different choices of the variation operator L for defining GFT bases. . . . 19 3.1 Comparison of complexity of different sampling set selection algorithms. . 38 3.2 Running time of different methods (in seconds) for selecting 5% samples on graphs of different sizes. The running time forM1 increases drastically and is ignored beyond graph size 5k. . . . . . . . . . . . . . . . . . . . . 45 4.1 Recontruction error results for random signals on Minnesota road graph. 66 4.2 Comparison of our method against an optimal sampling scheme obtained through exhaustive search. The experiment is performed for ring graphs of size N = 10 that have randomly added cross-links with probability p. Reconstruction error is averaged over 100 signals, and also 100 sampling schemes for the random sampling case. . . . . . . . . . . . . . . . . . . . 66 5.1 Related convergence results in the literature under different data models andgraphconstructionschemes. Allmodelsassumethatthedistributions are smooth (at least twice-differentiable). Further, the graph Laplacian is defined as L = 1 n (D− W) in all cases. [42] also studies convergence of graph cuts for weightedk-nearest neighbor andr-neighborhood graphs which we do not include for brevity. . . . . . . . . . . . . . . . . . . . . . 91 5.2 Illustrative boundaries used in the separable model. . . . . . . . . . . . . 101 ix List of Figures 1.1 Commonly occurring examples of graph-structured data in modern appli- cation areas such as ranking, user ratings prediction over social networks, collaborative filtering and semi-supervised learning. . . . . . . . . . . . . 2 2.1 VariationintheeigenvectorsoftheLaplacianofagraphorderedaccording to the eigenvalues. Note the increasing variation over edges. . . . . . . . 16 3.1 Reconstruction results for different graph and signal models. Plots for sig- nal model F1 are not shown since the reconstruction errors are identically zero for all methods when|S|≥ dimPW ω (G) = 50. The large recon- struction errors for|S| < 50 arise due to non-uniqueness of bandlimited reconstruction and hence, are less meaningful. . . . . . . . . . . . . . . . 42 3.2 Reconstruction performance for noisy signals (model F2) with different values of k in Erdös-Renyi graphs having different connection sparsity levels. Higher connection probability p implies lower sparsity. . . . . . . . 44 3.3 Classification results for the USPS dataset using different methods and GFTs. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46 4.1 A generic two-channel filterbank on graphs. . . . . . . . . . . . . . . . . 51 4.2 An illustration of aliasing patterns for two-channel filterbanks in (a) an arbitrary graph, (b) a bipartite graph. Spectral folding in a bipartite graph results in a concise anti-aliasing constraint in the spectral domain as seen in Section 4.3.1. . . . . . . . . . . . . . . . . . . . . . . . . . . . 56 4.3 Sampling scheme obtained using Algorithm 4.2 for bipartite graphs with length 4 Graph-QMF filters: (a)-(c), and GraphBior(6,6) filters: (d)-(g). Red and blue colors indicate nodes in low-pass and high-pass channels. The sets are heuristically forced to be disjoint in (g). . . . . . . . . . . . 64 x 4.4 Performance of our critical sampling scheme (Algorithm 4.2) on the Min- nesota road network graph. (a), (c) and (e) denote the sampling scheme obtained, spectral response, and maximum aliasing component for Graph- QMF design. (b), (d) and (f) illustrate corresponding results for Graph- Bior(6,6). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65 4.5 A 3-block cyclic graph. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68 4.6 A4-blockcyclicgraphvisualizedasadirectedbipartitegraphbygrouping even and odd blocks. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71 4.7 A tree-structured filterbank for block cyclic graphs consisting of two stages. 73 4.8 Two-stage tree-structured filterbank simplified using the noble identities for bipartite graphs. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75 4.9 Spectral characterization of a two-stage tree-structured filterbank with ideal filters. Gray-shaded areas indicate passbands. . . . . . . . . . . . . 77 4.10 Spectral responses (magnitude and phase) of polynomial filters designed using the maximally-flat approach withK1 =K2 = 2: (a) lowpass kernel h 0 (λ) (length = 6), (b) highpass kernel h 1 (λ) (length = 7). (c) Channel responses of a two-stage tree-structured filterbank built using the poly- nomial kernels h 0 (λ) and h 1 (λ). . . . . . . . . . . . . . . . . . . . . . . . 79 4.11 (a) A 4-block cyclic graph considered in the filtering experiment (the edges are oriented in counter-clockwise fashion). (b) The spectrum of its adjacency in the complex unit disc. (c) An input signal on the graph used for filtering experiments with the two-stage tree-structured filterbank. . . 81 4.12 Output obtained in each channel of a two-stage tree-structured filterbank after filtering the input signal from Figure 4.11c using ideal filters (Fig- ure 4.9c) and polynomial filters (Figure 4.10c). . . . . . . . . . . . . . . . 82 5.1 Statistical models of data considered in this work: (a) the separable model, (b) the nonseparable model. . . . . . . . . . . . . . . . . . . . . . 86 5.2 1-D example illustrating the theoretical label complexity for (a) the sep- arable model, (b) the nonseparable model. Note that labeling all points where density is lower than supremum density over the boundary resolves all ambiguity and results in perfect prediction. . . . . . . . . . . . . . . . 98 5.3 Probability density functions to generate data for (a) separable model, (b) nonseparable model. . . . . . . . . . . . . . . . . . . . . . . . . . . . 102 5.4 Boundaries{∂S i } considered in the separable model. . . . . . . . . . . . 103 xi 5.5 Convergence of empirical value of bandwidths ω(1 S i ) and ω(1 A ) for dif- ferent boundaries{∂S i } and ∂A on corresponding graphs. Dark shaded regions denote standard deviation over 100 experiments. Red bars indi- cate theoretical values. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103 5.6 Standard deviation of ω(1 S i ) and ω(1 A ) as a function of n. . . . . . . . . 104 5.7 Mean reconstruction errors averaged over 100 experiments for (a) 1 S 3 , and (b) 1 A . Red-dashed lines indicate the theoretical label complexities of ∂S 3 and ∂A. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104 5.8 Geometrical constructions in Definition 5.1. . . . . . . . . . . . . . . . . 119 5.9 Worst-case scenarios for the boundary ∂S when (a) S 1 is a ball of radius τ, (b) S 2 is a ball of radius τ. . . . . . . . . . . . . . . . . . . . . . . . . 122 xii Chapter 1 Introduction Recent improvements in communication and electronic storage capabilities have cre- ated an explosion in the amount of data generated from a variety of sources, so much so that we have already stretched the boundaries of current data processing techniques and ventured into unchartered territory. The exponential improvement in raw com- puting power can only do so much to alleviate this problem – partly because current processing techniques do not scale very well with the size of datasets, but more impor- tantly because “modern” data is quite unstructured and irregular, and hence, there is an alarming shortage of techniques required for understanding and processing such data. Graph-structured data is one such form of unconventional data that cannot be easily represented by regularly or irregularly spaced samples in a metric space. In many appli- cation domains, such as social networks, web information analysis, sensor networks and machine learning, a natural representation of the dataset can be provided by a graph. Each observation or data point is attached to one of the vertices/nodes in the graph, and the edges (directed or undirected, weighted or unweighted) capture important character- istics of the dataset, e.g., similarity or proximity between the vertices, link capacity etc.. Specifically, such a mapping from the vertices of a graph to the set of scalars is known as a graph signal. In order to study various properties of these signals (such as smoothness, anomalous behavior etc.), one needs to take into account the underlying connectivity of the data points. Common examples of graph signals include user ratings or preferences on social graphs, class labels on similarity graphs in machine learning, signal values on sensor nodes in a sensor network, collaborative filtering, etc. (Figure 1.1). Further, even traditional applications such as image and video processing, and compression, can bene- fit from a graph-based formulation [47, 30]. The set of tools that facilitate understanding and processing graph signals are broadly covered by the emerging research area known as Graph Signal Processing [66, 62]. Traditional signals in the Euclidean space can be analyzed and processed using well- established techniques such as sampling, filtering, frequency transforms, etc.. Graph Signal Processing aims to extend these tools for signals on graphs by exploiting the underlyingconnectivityinformation. Theseextensionsinmostcasesarenon-trivialsince 1 Figure 1.1: Commonly occurring examples of graph-structured data in modern appli- cation areas such as ranking, user ratings prediction over social networks, collaborative filtering and semi-supervised learning. graph signals lie in irregular non-Euclidean spaces. For example, defining analogs of even thesimplestsignalprocessingoperationssuchastimeshift, scaling, anddownsamplingis not immediately obvious for graph signals. An additional consideration in graph signal processing is complexity, therefore greater emphasis is laid on the design of localized algorithms, where the output at each vertex depends only on its local neighborhood. This allows algorithms to scale well with large graph sizes that have become increasingly common in modern big data applications. The primary focus of this dissertation is to formulate a sampling theory for graph signals that is meant to form a Graph Signal Processing analog of the Nyquist-Shannon samplingtheorem. Contrarytothemorecommonuniformly-spacedsamplingstrategyin traditional signals, sampling in graphs can consist of choosing any subset of vertices over the graph. Therefore, in our formulation, we seek an answer for the following questions: 1. When can one recover a graph signal from a given subset of sampled vertices on the graph? 2. How can one choose the best vertices to sample a given graph signal? Next, weconsiderthesamplingprobleminthecontextofdesigningwaveletfilterbankson graphs. Finally, as an application, we view the problem of graph-based semi-supervised learning from the perspective of sampling theory. A detailed description of the research problems studied in this dissertation is provided in the following sections. 2 1.1 Sampling theory for graph signals The Nyquist-Shannon sampling theorem is a landmark result in signal processing theory that forms a bridge between continuous-time and discrete-time signals. It estab- lishesasufficientconditiononthesamplingraterequiredtocapturealltheinformationof continuous-time signals satisfying certain modeling assumptions (most commonly, ban- dlimitedness in the spectral domain). For example, the simplest form of the sampling theorem states that a signal with bandwidth f in the Fourier domain can be uniquely represented by its samples captured uniformly at a rate of at least 2f. An analogue of the sampling theorem in the digital signal processing domain specifies conditions on downsampling bandlimited discrete-time signals without losing information. There also exist alternate variants of the sampling theorem under different sampling regimes such as non-uniform sampling [31], or different modeling assumptions (for instance, non- baseband signals, signals with finite rate of innovation [73]), or a combination of these in the more recent compressed sensing literature [18]. Similarly, in the context of Graph Signal Processing, the sampling theorem gives conditions under which a graph signal can be uniquely represented by its samples on a subset of nodes. This problem is fundamentally different from traditional sampling due to a lack of “ordering” of the samples, i.e., one cannot select uniformly spaced samples. Further, signal properties such as smoothness/bandlimitedness are determined by the topology or connectivity of the graph. As we shall see in Chapter 2, similar to the Fourier domain for traditional signals, a graph Fourier domain is obtained if one is able to represent graph signals in a basis that takes into account the underlying connectivity of the graph. Such a basis can be obtained using the eigenvalues and eigenvectors of special matrices associated with the graph that capture variation in a graph signal (example of such matrices include the adjacency [63], the Laplacian [66], and its normalized versions [4, 24]). The representation of a graph signal in this basis is known as the Graph Fourier Transform (GFT) and is used to provide a notion of frequency for graph signals, which is crucial for developing the theory of sampling. In order to formulate a sampling theorem for graph signals, one needs to consider the following questions – P1. When can one recover a graph signal from its samples on a given subset of nodes? Not all graph signals can be uniquely represented from their samples on a subset of nodes. Since sampling results in an inherent reduction of dimensionality, we need to impose certain modeling assumptions on signals for recovery. In this work, 3 we assume that our signals of interest are smooth over the graph. In quantitative terms, this can be imposed by enforcing signals to be bandlimited or lowpass in the graph Fourier domain. Therefore, for a given subset of nodes, we would like to find the minimum bandwidth a signal can have in the graph Fourier domain (also known as the cutoff frequency) so that it can be perfectly recovered from its samples on the subset. P2. What is the best choice of nodes to sample a given graph signal? Under the modeling assumption of bandlimitedness, we would like to find a set of nodes of smallest possible size that uniquely represents all signals with bandwidth lower than a given value. However, not all choices of sampling sets exhibit robust- ness of reconstruction in the presence of noise and model mismatch, therefore we would also like to focus on choosing sets that promote stable reconstructions of the original signal. Note that the algorithms we consider for choosing the sets should be scale well with graph sizes, since graphs in modern applications are quite large. P3. How can we reconstruct a signal from its samples? Given samples of a graph signal on a set that satisfies the criteria posed by the two questions above, one would like to design efficient algorithms for recovering the entire signal. This is equivalent to the interpolation problem in traditional signal processing, however, here we must focus on the design of localized algorithms, where the signal value on a particular node can be computed from values on neighboring nodes upto a certain connection depth. Sampling theory for graph signals has applications in many research areas, most prominent being machine learning and image processing. In the former case, class labels are treated as graph signals over the similarity graph and predicting unknown labels can be posed as an interpolation problem. In image processing, graphs can be used to incorporate edge information in the images and are useful for designing edge-adaptive wavelet transforms over images. 1.1.1 Existing work Prior to our work, sampling theory for graph signals was considered first in [55], where a sufficient condition was given for unique recovery of signals from a given subset of nodes (called a sampling set). Using this condition, [49] computes a bound on the maximum bandwidth that a signal can have, so that it can be uniquely reconstructed 4 from its samples on a given subset of nodes. The uniqueness conditions are in fact special cases of results pertaining to sampling signals from arbitrary subspaces presented in [28]. Results along the same lines have been considered recently in [65, 22] after our first work [3]. However, in order to apply results from the aforementioned papers, one must have an explicit representation of the GFT basis corresponding to the bandlimited space of interest thereby limiting their practical utility when the graphs are large. To our knowledge, our approach is the first to circumvent this issue. Sampling theory for graph signals has also been studied in other contexts such as designing filterbanks on graphs [44, 52], or more recently along the lines of alternative sampling strategies such as shift-based sampling [43] and random sampling [57], and in connection with the uncertainty principle for graph signals [71, 72]. 1.1.2 Our contributions The primary contribution of this dissertation is to develop a sampling theorem for bandlimited graph signals by answering the questions posed in Section 1.1. We proceed by first stating a necessary and sufficient condition under which a bandlimited signal can be perfectly recovered from a given subset of nodes on the graph. This condition guarantees uniqueness of sampling and is used to compute, for a given sampling set, the maximum bandwidth (i.e., the cutoff frequency) that a signal can have so that it can be perfectly reconstructed from its samples (thereby answering P1). Further, this formulation also avoids explicit computation of the GFT basis for the graph by defining quantities called Graph Spectral Proxies that allow a trade-off between complexity and accuracy while computing the cutoff frequency. These quantities are based on a method for estimating the bandwidth of any signal, upto different orders of accuracy, through simple localized operations in the vertex domain. In order to obtain the best sampling set (i.e., problem P2), we propose to use the cut- off frequency, estimated via spectral proxies, as the objective function that is maximized subject to a bandwidth constraint (or equivalently, a cardinality constraint that is deter- mined by the dimensionality of the space of signals to be sampled). This optimization problem is however combinatorial in nature, therefore we resort to a greedy algorithm that uses binary relaxation and gradient ascent. Although the spectral proxies-based objective function is designed to obtain sampling sets that guarantee uniqueness, we show that it also considers robustness of reconstruction with respect to sampling noise and model mismatch. This is because maximizing the objective function is shown to 5 be closely related to minimizing a bound on the worst case reconstruction error associ- ated with a given sampling set. Further, our method is iterative and requires finding a minimum eigenpair as the atomic operation, thereby making it efficient and amenable for localized/distributed implementations. This is demonstrated through various exper- iments that compare our sampling set selection method against other existing methods in terms of robustness and complexity. Inthisdissertation, weexcludethestudyoftechniquesforbandlimitedreconstruction of signals from their samples (i.e., problem P3) since they have been studied indepen- dently elsewhere [50, 74, 76]. An example of a real-world application of our sampling theory is the problem of active semi-supervised learning where the learning algorithm can specify beforehand specific data points to label given a budget. Under the assump- tion that class labels are smooth over the similarity graph, active learning is essentially equivalent to a sampling set selection problem. It is shown in [29] that our method performs well in comparison to other state-of-the art methods. The details of this work are excluded from this dissertation for brevity. 1.2 Wavelet filterbanks on graphs Graph wavelet transforms have recently been used for a variety of applications, rang- ing from multiresolution analysis [47, 60], compression [48, 53, 19], denoising [25], and classification [27]. These transforms allow one to analyze and process signals defined over graphs while taking into consideration the underlying relationships between signal values. The designs of these transforms are generally inspired by traditional wavelet construction schemes and leverage principles from graph signal processing. One of the recent techniques for constructing wavelet transforms on graphs is based on filterbanks. This approach is quite appealing because it makes use of spectral graph filters [66] that have a low complexity and enable a trade-off between vertex-domain and frequency-domain localization. Besides designing the filters, sampling set selection is an important aspect of graph wavelet filterbank design. The requirements are fundamen- tally different in this case – instead of working with bandlimited signals for perfect and stablerecovery, onelooksforasamplingschemeovertwoormorechannelsofafilterbank that allows a multiresolution analysis, followed by recovery (or synthesis) of all signals over the graph. An important property taken into consideration while designing these filterbanks is critical sampling, which requires the total number of samples across chan- nels to be equal to the number of nodes in the graph [45, 46]. Other desirable properties 6 of the sampling scheme involve ensuring near perfect reconstruction, near orthogonality, and most importantly compact support that allows localized implementations of the filterbank. These properties are suitable for compression applications where one would like to obtain compact respresentations of signals of interest efficiently at the encoder, followed by near lossless recovery at the decoder. Unlike traditional filterbanks that boast numerous well-studied designs with wide-ranging properties, graph filterbanks are more difficult to design as the structure of the Fourier basis is influenced by the structure of the graph. Therefore, it is useful to understand and characterize graphs that admit filterbanks with one or more of these properties. 1.2.1 Existing work In order to ensure perfect reconstruction and compact support, state-of-the-art wavelet filterbanks require imposing certain structural constraints on the underlying graph. For example, the recently proposed two-channel filterbanks in [45, 46] are designed specifically for bipartite graphs. The special structure leads to a natural downsampling-upsampling scheme (on one of the two partitions) in each channel, accom- panied by a spectral folding phenomenon that is exploited while designing the filters. The design can be extended to arbitrary graphs through a multidimensional approach that involves hierarchically decomposing the original graph into a sequence of bipartite subgraphs. Techniques for bipartite subgraph decomposition such as those in [52, 77] form an active area of research. There also exist filterbanks that exploit circulant graph architectures, albeit with non-polynomial synthesis filterbanks [26, 27]. Recently, yet another class of graphs suitable for the design ofM-channel polynomial filterbanks have been studied [69, 70]. These are calledM-block cyclic graphs and are a natural directed extension of bipartite graphs. The topology and the eigenstructure of these graphs also induces a spectral folding upon downsampling-upsampling on one of the blocks, thus providing a means to extend several concepts from classical multirate signal processing to the graph signal processing domain. Perfect reconstruction conditions for M-channel filterbanks on these graphs are stated in [70], albeit with a lack of concrete solutions. 1.2.2 Our contributions Following the lines of [45], our work begins by spelling out design criteria for two- channel filterbanks that satisfy the critical sampling, compact support, and perfect reconstruction properties. We realize immediately that without the presence of special 7 structure in the Fourier basis associated with the graph, it is in general difficult to satisfy critical sampling and perfect reconstruction simultaneously with low-degree polynomial filterbanks. Our analysis shows that the structural requirement on the Fourier basis is identical to that of traditional signal processing – downsampling-upsampling should induce a spectral folding in the graph Fourier domain. This makes bipartite graphs particularly amenable to the design of perfect reconstruction two-channel filterbanks. Weshiftourfocusandseektodesignfilterbanksonarbitrarygraphs(withoutaltering theirtopology)thatarecritically-sampled, havecompactsupport, andsatisfytheperfect reconstruction and orthogonality conditions as closely as possible. The lack of a special structure makes the problem of jointly designing low-degree polynomial filters and the sampling scheme impossible. Therefore, as a starting step, we decouple the two by focusing only on designing a critical sampling scheme while assuming that the analysis and synthesis filters are predesigned for given frequency localization constraints. Our main contributions are the following: • A criterion based on the reconstruction error to evaluate any sampling scheme for given predesigned filters. • A greedy but computationally efficient algorithm to minimize the error criterion that approximates the optimal solution, along with some theoretical guarantees. Experiments show that our algorithm for choosing the sampling scheme performs better than existing heuristics. Finally, we turn our attention to the design of M-channel filterbanks on M-block cyclic graphs. The perfect reconstruction conditions stated in [70] are meant for graphs with balanced block sizes, where signals in each channel are sampled on the same block. Wefirstmodifytheconditionstoworkfordisjointsamplingsetsacrossdifferentchannels. Next, welookintothedesignofpossiblepolynomialsolutionssatisfyingtheseconditions. This problem is more complex than traditional M-channel filterbanks since the graph Fourier domain (of graphs with a normalized adjacency) spans the complete unit-disc insteadoftheunit-circle. Inourwork, weconsiderasimplerversionoftheproblemwhere M is a power of 2, and propose a tree-structured design composed of hierarchically arranged two-channel designs. Our design allows disjoint sampling sets for different channels, and also works for unbalanced block sizes. Our main contributions are: • A scheme to represent a 2 L -channel filterbank on 2 L -block cylic graphs as a hierar- chical tree-structure composed of 2-channel filterbanks on smaller directed bipar- tite graphs. 8 • A perfect reconstruction 2-channel filterbank solution for a 2-block cyclic graph (i.e., a directed bipartite graph). This is a generalization of the designs in [45, 46, 68], since the spectrum of directed bipartite graphs can be complex. 1.3 Sampling theory perspective of graph-based semi-supervised learning Theabundanceofunlabeleddatainvariousmachinelearningapplications,alongwith the prohibitive cost of labeling, has led to growing interest in semi-supervised learning. This paradigm deals with the task of classifying data points in the presence of very little labeling information by relying on the geometry of the dataset. Assuming that the features are well-chosen, a natural assumption in this setting is to consider the marginal density p(x) of the feature vectors to be informative about the labeling function f(x) defined on the points. This assumption is fundamental to the semi-supervised learning problem both in the classification and the regression settings, and is also known as the semi-supervised smoothness assumption [20], which states that the label function is smoother in regions of high data density. There also exist other similar variants of this assumption specialized for the classification setting, namely, the cluster assumption [78] (points in a cluster are likely to have the same class label) or the low density separation assumption [51] (decision boundaries pass through regions of low data density). Most present day algorithms for semi-supervised learning are based on one or more of these assumptions. In practice, graph-based methods have been found to be quite suitable for geometry- based learning tasks, primarily because they provide an easy way of exploiting informa- tion from the geometry of the dataset. These methods involve constructing a distance- based similarity graph whose vertices represent the data points and edge weights are computed using a decreasing function of Euclidean distances between the points in the feature space. The class labels are treated as a graph signal, and the known labels as its samples. The semi-supervised smoothness assumption for the class labels translates into a notion of “smoothness” of the signal over the graph, in the sense that labels of vertices do not vary much over edges with high weights (i.e., edges that connect close or similar points). This smoothness assumption can be imposed through quantities such as the graph cut, or the graph Laplacian regularizer, and more recently, through bandwidth 9 constraints in the graph spectral domain. Further, predicting labels of the unknown points can be considered as an intepolation problem over the similarity graph. Although using the bandwidth of signals to impose smoothness is well-motivated in graph-based learning methods, it is important to understand its connection to the under- lying geometry of the dataset in a theoretical sense. One way of justifying this approach is to explore its geometrical interpretation in the limit of infinitely available unlabeled data. This typically involves assuming a probabilistic generative model for the dataset and analyzing the bandwidth of class indicator functions in the asymptotic setting for certain commonly-used graph construction schemes. Specifically, in this setting, we seek answers for the following questions in the asymptotic setting: 1. What is the connection between the bandwidth of class indicator signals over the similarity graph and the underlying geometry of the data set? 2. What is the interpretation of the bandlimited recontruction approach for label prediction? 3. How many labeled examples does one require for perfect prediction? The answers to these questions would help complete our theoretical understanding of graph-based semi-supervised classification approaches, specifically bandlimited interpo- lation over the graph, and strengthen their link with the semi-supervised smoothness assumption and its variants. 1.3.1 Existing work In the graph-based semi-supervised learning paradigm, there have been numerous ways of quantitatively imposing smoothness constraints over label functions defined on verticesofasimilaritygraph. Mostgraph-basedsemi-supervisedclassificationalgorithms incorporate one of these criteria as a penalty against the fitting error in a regularization problem, or as a constraint term while minimizing the fitting error in an optimization problem. For example, a commonly used measure of smoothness for a label function f is the graph Laplacian regularizer f T Lf (L being the graph Laplacian), and many algo- rithms involve minimizing this quadratic energy function while ensuring that f satisfies the known set of labels [83, 78]. There also exist higher-order variants of this measure known as iterated graph Laplacian regularizers f T L m f, that have been shown to make the problem more well-behaved [81]. On the other hand, a spectral theory based clas- sification algorithm restricts f to be spanned by the first few eigenvectors of the graph 10 Laplacian [10, 11], that are known to form a representation basis for smooth functions on the graph. In each of the examples, the criterion enforces smoothness of the labels over the graph – a lower value of the regularizer f T Lf, and a smaller number of leading eigenvectors to model f imply that vertices that are close neighbors on the graph are more likely to have the same label. Recent works have therefore focused on justifying these approaches by exploring their geometrical interpretation in the limit of infinitely available unlabeled data. This is typicallydonebyassumingaprobabilisticgenerativemodelforthedatasetandanalyzing thegraphsmoothnesscriteriaintheasymptoticsettingforcertaincommonly-usedgraph construction schemes. For example, it has been shown that for data points drawn from a smooth distribution with an associated smooth label function (i.e., the regression setting), the graph Laplacian regularizer converges in the limit of infinite data points to a density-weighted variational energy functional that penalizes large variations of the labels in high density regions [11, 15, 34, 12, 81, 82]. A similar connection ensues for semi-supervised learning problems in the classification setting (i.e., when labels are discrete in the feature space). If points drawn from a smooth distribution are separated by a smooth boundary into two classes, then the graph cut for the partition converges to a weighted volume of the boundary [51, 42]. This is consistent with the low density separation assumption – a low value of the graph cut implies that the boundary passes through regions of low data density. 1.3.2 Our contributions The main contribution of our work is a novel theoretical justification of the sampling theoreticapproachtosemi-supervisedlearning. Thisapproachinvolvestreatingtheclass label function/indicator signal as a bandlimited graph signal, and label prediction as a bandlimited reconstruction problem. Our work is the first to consider, using sampling theory, the label complexity of semi-supervised classification, that is, the minimum frac- tion of labeled examples required for perfect prediction of the unknown labels. A key ingredient in this formulation is the bandwidth of signals on the graph – signals with lower bandwidth tend to be smoother on the graph, and are useful for modeling label functions over similarity graphs. Label prediction using bandlimited reconstruction then involves estimating a label function/indicator signal that minimizes prediction error on the known set under a bandwidth constraint. 11 In order to provide a geometrical interpretation of bandwidth of class indicator sig- nals, we leverage our work on sampling theory and make use of spectral proxies. These spectral proxies are computed over similarity graphs constructed from a two-class sta- tistical model for the feature vectors. To make our analysis as general as possible, we considertwodatamodels: separable andnonseparable. Thesegenerativemodelsarequite practical and mimic most datasets in the real world. The separable model assumes that data points are drawn from an underlying probability distribution in the feature space and each class is separated from the others by a smooth boundary. On the other hand, the nonseparable model assumes a mixture distribution for the data where the data points are drawn randomly and independently with certain probability from separate class conditional distributions. We also introduce a notion of “boundaries” for classes in the nonseparable model in the form of overlap regions, defined as the set of points where the probability of belonging and not belonging to a class are both non-zero (i.e., the region of ambiguity). This definition is quite practical and useful for characterizing the geometry of such datasets. Using the data points, we consider a specific graph construction scheme that applies the Gaussian kernel over Euclidean distances between feature vectors for computing their similarities (our analysis can be generalized easily to arbitrary kernels under simple assumptions). In order to compute the bandwidth of the indicator, we use graph spectral proxies from our work on sampling theory. A significant portion of this work focuses on analyzing the stochastic convergence of this bandwidth estimate (using variance-bias decomposition) in the limit of infinite data points for any class indicator signal on the graph. The analysis in our work suggests a novel sampling theoretic interpretation of graph-based semi-supervised learning and the main contributions can be summarized as follows: 1. Relationship between bandwidth and data geometry. For the separable model, we show that under certain rate conditions, the bandwidth estimate for any class converges to the supremum of the data density over the class boundary. Similarly, for the nonseparable model, we show that the bandwidth estimate converges to the supremum of the density over the overlap region. 2. Interpretation of bandlimited reconstruction. Using the geometrical interpreta- tion of the bandwidth, we conclude that bandlimited reconstruction allows one to choose the complexity of the hypothesis space while predicting unknown labels (i.e., a larger bandwidth allows more complex class boundaries). 12 3. Quantification of label complexity. We also show that the bandwidth of class indicator signals is closely linked theoretically to the fraction of labeled points required for perfect classification which is in turn related to the geometry of the data. Our analysis has significant implications: Firstly, class indicator signals have a low bandwidth if class boundaries lie in regions of low data densities, that is, the semi- supervised assumption holds for graph-based methods. And secondly, our analysis also helps quantify the impact of bandwidth and data geometry in semi-supervised learning problems, an aspect that was lacking in existing work. Specifically, our results enable us to theoretically assert that for the sampling theoretic approach to graph-based semi- supervised learning, the label complexity (minimum fraction of labeled points required) oflearningclassifiersmatchesthetheoreticalestimateandisindeedloweriftheboundary lies in regions of low data density, as demonstrated empirically in earlier works [10, 11]. 1.4 Outline The rest of this dissertation is organized as follows: • Chapter 2 begins by introducing notations used throughout this dissertation, along with several useful definitions. This is followed by a brief review of graph signal processing concepts and the introduction of notions of frequencies over graphs. • In Chapter 3, we formulate our theory of sampling for bandlimited graph signals. We begin by stating necessary and sufficient conditions for unique sampling, using which we formulate methods to compute the cutoff frequency for a given sampling set and conversely, the best sampling set of a given size. We also briefly discuss stability considerations in sampling. • Chapter 4 considers the sampling problem in a different, but related, problem of designing graph filterbanks. We consider a general formulation for designing two- channel filterbanks on arbitrary graphs and study conditions required for satisying desirable properties such as critical sampling, compact support, perfect reconstruc- tion and orthogonality. We characterize graphs that admit solutions satisfying these properties. We then present an efficient algorithm for selecting the best sampling scheme over the channels that aims to minimize reconstruction error, 13 given predesigned filters. Finally, we also consider the design of M-channel filter- banks on M-block cyclic graphs and propose a simple tree-structured design. • Finally, in Chapter 5, we view graph-based semi-supervised learning from the perspective of sampling theory. This point of view allows us to explain why graph- based methods are suited for this problem by looking at the geometrical interpreta- tions of bandwidth of class indicator signals in the asymptotic limit. Further, using sampling theory, we also shed light on the label complexity of the semi-supervised learning problem. 14 Chapter 2 Review of Graph Signal Processing In this chapter, we review relevant concepts from Graph Signal Processing that will be helpful in formulating our sampling theory. The most important concept is the notion offrequencyforgraphsignalsthatwillhelpuscharacterizetheirlevelofsmoothness. The key idea is to introduce analogs of operators such as variation or shift from traditional signal processing, that allow one to transform a signal or measure its properties while taking into account the underlying connectivity on the graph. Such operators create a notion of smoothness for graph signals through their spectrum. There can be many different choices of variation operators depending the application at hand, we review a few examples after introducing notations used throughout this proposal. 2.1 Basic concepts A graph G = (V,E) is a collection of nodes indexed by the setV ={1,...,N} and connected by linksE ={(i,j,w ij )}, where (i,j,w ij ) denotes a link of weight w ij ∈R + pointing from node i to node j. The adjacency matrix W of the graph is an N×N matrix with elementsw ij . The degreed i of a nodei in an undirected graph is defined as P j w ij , and the degree matrix of the graph is defined as D = diag{d 1 ,d 2 ,...,d N }. For directed graphs, one can define in-degrees and out-degrees separately. A graph signal is a function f :V→R defined on the vertices of the graph, (i.e., a scalar value assigned to each vertex, such that f(i) is the value of the signal on nodei). It can be represented as a vector f∈R N where f i represents the function value on the i th vertex. Sampling sets are subset of nodesS ⊂ V over which the values of a signal are measured. For any signal x∈R N and a setS⊆{1,...,N}, we use x S to denote its sampled version, which is an|S|-dimensional sub-vector of x consisting of components indexed byS. Similarly, for any matrix A∈R N×N , A S 1 S 2 is used to denote the sub- matrix of A with rows indexed byS 1 and columns indexed byS 2 . For simplicity, we denote A SS by A S . The complement ofS inV is denoted byS c =V\S. 15 λ = 0 λ = 0.27 λ = 1.32 λ = 1.59 Figure 2.1: Variation in the eigenvectors of the Laplacian of a graph ordered according to the eigenvalues. Note the increasing variation over edges. 2.2 Notion of frequency for graph signals In order to formulate a sampling theorem for graph signals, we need a notion of frequency that helps us in characterizing the degree of smoothness of various signals with respect to the graph. This can be done by defining shift or variation operators that allow one to transform a signal or measure its properties while taking into account the underlying connectivity over the graph. We denote any operator that measures variation of signals over the graph by L, an N×N matrix 1 . This operator essentially creates a notion of smoothness for graph signals through its spectrum. Specifically, assume that L has eigenvalues|λ 1 |≤...≤|λ N | and corresponding eigenvectors{u 1 ,..., u N }. Then, these eigenvectors provide a Fourier-like basis for graph signals with the frequencies given by the corresponding eigenvalues. For each L, one can also define a variation functional Var(L, f) that measures the variation in any signal f with respect to L. Such a definition should induce an ordering of the eigenvectors which is consistent with the ordering of eigenvalues (see for example, Figure 2.1). More formally, if|λ i |≤|λ j |, then Var(L, u i )≤ Var(L, u j ). The GFT ˜ f of a signal f is given by its representation in the above basis, ˜ f = U −1 f, where U = [u 1 ... u N ]. Note that a GFT can be defined using any variation operator. Examplesofpossiblevariationoperatorsarereviewedinthenextsection. Ifthevariation operator L is symmetric then its eigenvectors are orthogonal, leading to an orthogonal GFT. In some cases, L may not be diagonalizable. In such cases, one can resort to the Jordan normal form [63] and use generalized eigenvectors. 1 Although L has been extensively used to denote the combinatorial Laplacian in graph theory, we overload this notation to make the point that any such variation operator can be defined to characterize signals of interest in the application at hand. 16 Using this notion of frequency, filters can be designed on the graph using any of the variation operators by manipulating its spectral response to satisfy desired prop- erties. One useful way is to consider polynomial graph filters H = h(L) = P k i=0 h i L i of different degrees, whose response in the spectral domain is given by the polynomial h(λ) = P k i=0 h i λ i . These are particularly useful because of their simplicity – a k-degree polynomial filter can be implemented in O(k|E|) complexity. Further, note that for undirected graphs, L is symmetric, and hence H is symmetric. Bandlimited graph signals A signal f is said to beω-bandlimited if ˜ f i = 0 for alli with|λ i |>ω. In other words, GFT of anω-bandlimited 2 f is supported on frequencies in [0,ω]. If{λ 1 ,λ 2 ,...,λ r } are the eigenvalues of L less than or equal toω in magnitude, then anyω-bandlimited signal can be written as a linear combination of the corresponding eigenvectors: f = r X i=1 ˜ f i u i = U VR ˜ f R , (2.1) whereR ={1,...,r}. The space of ω-bandlimited signals is called Paley-Wiener space and is denoted by PW ω (G) [55]. Note that PW ω (G) = range(U VR ) (i.e., the span of columns of U VR ). Bandwidth of a signal f is defined as the largest among absolute values of eigenvalues corresponding to non-zero GFT coefficients of f, i.e., ω(f) 4 = max i {|λ i || ˜ f i 6= 0}. (2.2) A key ingredient in our theory is an approximation of the bandwidth of a signal using powers of the variation operator L, as explained in Section 3.3. Since this approximation holds for any variation operator, the proposed theory remains valid for all of the choices of GFT in Table 2.1. 2 One can also define highpass and bandpass signals in the GFT domain. Sampling theory can be generalized for such signals by treating them as lowpass in the eigenbasis of a shifted variation operator, e.g., one can use L 0 =|λ N |I− L for highpass signals. 17 2.3 Examples of variation operators 2.3.1 Variation on undirected graphs In undirected graphs, the most commonly used variation operator is the combinato- rial Laplacian [24] given by: L = D− W, (2.3) where D is the diagonal degree matrix diag{d 1 ,...,d N } with d i = P j w ij . Since, w ij = w ji for undirected graphs, this matrix is symmetric. As a result, it has real non-negative eigenvalues λ i ≥ 0 and an orthogonal set of eigenvectors. The variation functional associated with this operator is known as the graph Laplacian quadratic form [66] and is given by Var QF (f) = f > Lf = 1 2 X i,j w ij (f i −f j ) 2 . (2.4) One can normalize the combinatorial Laplacian to obtain the symmetric normalized Laplacian and the (asymmetric) random walk Laplacian given as L sym = D −1/2 LD −1/2 , L rw = D −1 L. (2.5) Both L sym and L rw have non-negative eigenvalues. However the eigenvectors of L rw are not orthogonal as it is asymmetric. The eigenvectors of L sym , on the other hand, are orthogonal. The variation functional associated with L sym has a nice interpretation as it normalizes the signal values on the nodes by the degree: Var QFsym (f) = f > L sym f = 1 2 X i,j w ij f i √ d i − f j q d j 2 . (2.6) 2.3.2 Variation on directed graphs Note that variation operators defined for directed graphs can also be used for undi- rected graphs since each undirected edge can be thought of as two oppositely pointing directed edges. Variationusingtheadjacencymatrix Thisapproachinvolvesposingtheadjacency matrix as a shift operator over the graph (see [63] for details). For any signal f∈R n , 18 Table 2.1: Different choices of the variation operator L for defining GFT bases. Operator Expression Graph type Variation functional Properties Combinatorial L = D− W Undirected f > Lf = 1 2 P i,j w ij (f i − f j ) 2 Symmetric,λ i ≥ 0, U orthogonal Symmetric nor- malized L = I− D −1/2 WD −1/2 Undirected f > Lf = 1 2 P i,j w ij fi √ di − fj √ dj 2 Symm. , λ i ∈ [0, 2], U orthogonal Random walk (undirected) L = I− D −1 W Undirected ||Lf|| Asymmetric, λ i ≥ 0, U non-orthogonal Adjacency- based L = I− 1 |μmax| W, μ max : maximum eigen- value of W Undirected/ Directed ||Lf|| p ,p = 1, 2 Asymm., non-orthog. U for directed graphs, Re{λ i }≥ 0 Hub-authority L =γ(I−T > T)+(1−γ)(I−TT > ), T = D −1/2 p WD −1/2 q , D p = diag{p i },p i = P j w ji , D q = diag{q i },q i = P j w ij Directed f > Lf, see text. Symmetric,λ i ≥ 0, U orthogonal Random walk (directed) L = I − 1 2 Π 1/2 PΠ −1/2 + Π −1/2 P > Π 1/2 , P ij =w ij / P j w ij , Π = diag{π i } Directed f > Lf = 1 2 P i,j π i P ij fi √ πi − fj √ πj 2 Symmetric,λ i ≥ 0, U orthogonal 19 the signal Wf is considered as a shifted version of f over the graph, analogous to the shift operation defined in digital signal processing. Using this analogy, [63] defines total variation of a signal f on the graph as Var p TV (f) = f− 1 |μ max | Wf p , (2.7) wherep = 1, 2 andμ max denotes the eigenvalue of W with the largest magnitude. It can beshownthatfortwoeigenvalues|μ i |<|μ j |of W, thecorrespondingeigenvectors v i and v j satisfyVar p TV (v i )< Var p TV (v j ). Inordertobeconsistentwithourconvention, onecan define the variation operator as L = I− W/|μ max | which has the same eigenvectors as W with eigenvaluesλ i = 1−μ i /|μ max |. This allows us to have the same ordering for the graph frequencies and the variations in the basis vectors. Note that for directed graphs, where W is not symmetric, the GFT basis vectors will not be orthogonal. Further, for some adjacency matrices, there may not exist a complete set of linearly independent eigenvectors. In such cases, one can use generalized eigenvectors in the Jordan normal form of W as stated before [63]. Variation using the hub-authority model This notion of variation is based on the hub-authority model [39] for specific directed graphs such as a hyperlinked environment (e.g., the web). This model distinguishes between two types of nodes. Hub nodesH are the subset of nodes which point to other nodes, whereas authority nodesA are the nodes to which other nodes point. Note that a node can be both a hub and an authority simultaneously. In a directed network, we need to define two kinds of degrees for each node i∈V, namely the in-degree p i = P j w ji and the out-degree q i = P j w ij . The co-linkage between two authorities i,j∈A or two hubs i,j∈H is defined as c ij = X h∈H w hi w hj q h and c ij = X a∈A w ia w ja p a (2.8) respectively, and can be thought of as a cumulative link weight between two authorities (or hubs). Based on this, one can define a variation functional for a signal f on the authority nodes [80] as Var A (f) = 1 2 X i,j∈A c ij f i √ p i − f j √ p j ! 2 . (2.9) 20 In order to write the above functional in a matrix form, define T = D −1/2 q WD −1/2 p , where D −1/2 p and D −1/2 q are diagonal matrices with (D −1/2 p ) ii = 1 √ p i if p i 6= 0 0 otherwise, (D −1/2 q ) ii = 1 √ q i if q i 6= 0 0 otherwise. It is possible to show that Var A (f) = f > L A f, where L A = I− T > T. A variation functional for a signal f on the hub nodes can be defined in the same way as (2.9) and can be written in a matrix form as Var H (f) = f > L H f, where L H = I− TT > . A convex combination Var γ (f) =γVar A (f)+(1−γ)Var H (f), withγ∈ [0, 1], can be used to define a variation functional for f on the whole vertex setV. Note that the corresponding variation operator L γ =γL A +(1−γ)L H is symmetric and positive semi-definite. Hence, eigenvectors and eigenvalues of L γ can be used to define an orthogonal GFT similar to theundirectedcase, wherethevariationintheeigenvectorincreasesasthecorresponding eigenvalue increases. Variation using the random walk model Every directed graph has an associated random walk with a probability transition matrix P given by P ij = w ij P j w ij . (2.10) By the Perron-Frobenius theorem, if P is irreducible then it has a stationary distribution π which satisfies πP = π [36]. One can then define the following variation functional for signals on directed graphs [23, 79]: Var rw (f) = 1 2 X i,j π i P ij f i √ π i − f j √ π j ! 2 . (2.11) Note that if the graph is undirected, the above expression reduces to (2.6) since, in that case, π i = d i / P j d j . Intuitively, π i P ij can be thought of as the probability of transition from node i to j in the steady state. We expect it to be large if i is similar to j. Thus, a big difference in signal values on nodes similar to each other contributes more to the variation. A justification for the above functional in terms of generalization 21 of normalized cut to directed graphs is given in [23, 79]. Let Π = diag{π 1 ,...,π n }. Then Var rw (f) can be written as f > Lf, where L = I− 1 2 Π 1/2 PΠ −1/2 + Π −1/2 P > Π 1/2 . (2.12) It is easy to see that the above L is a symmetric positive semi-definite matrix. Therefore, its eigenvectors can be used to define an orthonormal GFT, where the variation in the eigenvector increases as the corresponding eigenvalue increases. Table 2.1 summarizes different choices of GFT bases based on the above variation operators – our theory applies to all of these choices. 2.4 Summary In this chapter, we introduced basic concepts of graph signal processing that our rele- vant to our work. Specifically, we introduced variation operators that allow us to obtain notions of frequency for graph signals and quantify their smoothness. We mentioned several examples of variation operators from existing literature both for undirected and directed graphs. A definition of the Graph Fourier Transform (GFT) is given using the eigenvalues and eigenvectors of these operators. This framework allows us to quantify smoothness of signals over the graph in terms of bandlimitedness. 22 Chapter 3 Sampling Theory for Graph Signals In this chapter, we focus on developing the sampling theorem for graph signals. As mentioned earlier, our signals of interest are bandlimited in the graph spectral domain, or in other words, smooth graph signals. Under this assumption, we answer the following questions: (i) What is the maximum bandwidth a signal f can have so that it can be perfectly recovered from its subset of samples onS(⊂ V)? (ii) Given a bandwidth (or dimensionality of the bandlimited space), what is the best set of nodes to sample for unique reconstruction? Additionally, stability is an important issue while choosing sampling sets, since in practice signals are only approximately bandlimited and/or their samples are noisy. A good sampling set leads to robustness of reconstruction in the presence of noise and model mismatch. Most approaches to the sampling problem involve explicitly computing a portion of the graph Fourier basis, followed by using these basis elements to check if a unique and stable recovery is possible with the given samples or to choose the best subset of nodes for sampling. This approach works well enough for small graphs, where computing and storing a portion of the graph Fourier basis is practically feasible. However, current applications demand the handling of large graphs with thousands or even millions of nodes, and computing multiple eigenvectors of the variation operators can be burden- some in terms of time and space complexity. Therefore, in our approach, we define certain quantities called graph spectral proxies based on powers of the variation oper- ator that allow one to estimate the bandwidth of graph signals. These proxies can be computed using repeated application of the variation operator over signals in a localized and distributed fashion with minimal storage cost, thus forming a key contribution of our approach. Using these proxies, we provide an approximate bound on the maximum bandwidth for unique recovery (i.e., the cutoff frequency) given the sampling set. We show that this quantity also appears in an approximate bound on the reconstruction error, and hence we maximize it using a greedy approach in order to select an approx- imately optimal sampling set of given size. Our algorithm is efficient and scalable, Work in this chapter has been published in [3, 29, 4]. 23 since it does not require explicit computation of the graph Fourier basis, and is shown to achieve comparable performance in comparison to approaches such as [65, 22] with lower computational cost. The rest of this chapter is organized as follows: Section 3.1 reviews related work in the area of sampling. In Section 3.2, we provide necessary and sufficient conditions for sampling, and consider results in the scenario of known GFT. Section 3.3 introduces graph spectral proxies along with their properties. Section 3.4 and 3.5 employs these quantities to address the questions posed by the sampling theorem. Finally, we conclude this chapter with experimental validation in Section 3.6, followed by a summary in Section 3.7. 3.1 Related work Sampling theory for graph signals was first studied in [55], where a sufficient con- dition for unique recovery of signals is stated for a given sampling set. Using this condition, [49] gives a bound on the maximum bandwidth that a signal can have, so that it can be uniquely reconstructed from its samples on a given subset of nodes. The uniqueness conditions in this section have also appeared prior to our work in [28] and subsequently in [65, 22]. However, the specific form in which these conditions have been presentedrequiretheexplicitcomputationoftheGFT,therebylimitingitspracticalutil- ity. Using spectral proxies defined later in Section 3.3, our work circumvents the explicit computation of the graph Fourier basis and states conditions that ensure uniqueness and find a good sampling set. Previous methods for sampling set selection in graphs can be classified into two types, namely spectral-domain methods and vertex-domain methods, which are summarized below. Spectral-domain approaches Mostoftherecentworkonsamplingtheoryofgraphsignalsassumesthataportionof thegraphFourierbasisisexplicitlyknown. Weclassifythesemethodsasspectral-domain approaches since they involve computing the spectrum of the variation operator. For example, the work of [65] requires computation and processing of the firstr eigenvectors of the graph Laplacian to construct a sampling set that guarantees unique (but not necessarily stable) reconstruction for a signal spanned by those eigenvectors. Similarly, a greedy algorithm for selecting stable sampling sets for a given bandlimited space is 24 proposedin[22]. Itconsidersaspectral-domaincriterion, usingminimumsingularvalues of submatrices of the graph Fourier transform matrix, to minimize the effect of sample noise in the worst case. The work of [71, 72] creates a link between the uncertainty principleforgraphsignalsandsamplingtheorytoarriveatsimilarcriteriainthepresence of sample noise. Itis also possible to generalize this approach using ideas from the theory of optimal experiment design [38] and define other spectral-domain optimality criteria for selecting sampling sets that minimize different measures of reconstruction error when the samples are noisy (for example, the mean squared error). Greedy algorithms can then be used to find sets which are approximately optimal with respect to these criteria. Vertex-domain approaches There exist alternative approaches to sampling set selection that do not consider graph spectral information and instead rely on vertex-domain characteristics. Examples include [44] and [52], which select sampling sets based on maximum graph cuts and spanning trees, respectively. However, these methods are better suited for designing downsampling operators required in bipartite graph multiresolution transforms [45, 46]. Specifically, they do not consider the issue of optimality of sampling sets in terms of quality of bandlimited reconstruction. Further, it can be shown that the maximum graph-cut based sampling set selection criterion is closely related to a special case of our proposed approach. There exists an alternate vertex-domain sampling approach, described in [43], that involves successively shifting a signal using the adjacency matrix and aggregating the values of these signals on a given node. However, sampling using this strategy requires aggregating the sample values for a neighborhood size equal to the dimension of the bandlimited space, which can cover a large portion of the graph. The sampling strategies described so far involve deterministic methods of approxi- mating optimal sampling sets. Following our work, [57] proposed a randomized sampling strategy that guarantees a bound on the worst case reconstruction error in the presence of noise by sampling nodes independently based on a judiciously designed distribution over the nodes. However, one needs to sample many more nodes than the dimension of the bandlimited space to achieve the error bound with high probability. 25 3.2 Necessary and sufficient conditions In this section, we address the issue of uniqueness and stability of bandlimited graph signal reconstruction and discuss different optimality criteria for sampling set selection assuming that the graph Fourier basis (i.e., the spectrum of the corresponding variation operator) is known 1 . The results in this section are useful when the graphs under consideration are small and thus, computing the spectrum of their variation operators is computationally feasible. They also serve as a guideline for tackling the aforementioned questions when the graphs are large and computation and storage of the graph Fourier basis is impractical. In order to give a necessary and sufficient condition for unique identifiability of any signal f∈PW ω (G) from its samples f S on the sampling setS, we first state the concept of uniqueness set [55]. Definition 3.1 (Uniqueness set). A subset of nodesS is a uniqueness set for the space PW ω (G) iff x S = y S implies x = y for all x, y∈PW ω (G). Unique identifiability requires that no two bandlimited signals have the same samples on the sampling set as ensured by the following theorem [3]. Theorem 3.1 (Unique sampling).S is a uniqueness set for PW ω (G) if and only if PW ω (G)∩L 2 (S c ) ={0}. Proof. Given PW ω (G)∩L 2 (S c ) ={0}, assume thatS is not a uniqueness set. Then, there exist f, g∈PW ω (G), f6= g such that f S = g S . Hence, we have f− g∈L 2 (S c ), f− g6= 0. Also, f− g∈PW ω (G) due to closure. But this is a contradiction as PW ω (G)∩ L 2 (S c ) ={0}. Therefore,S must be a uniqueness set. Conversely, we are given thatS is a uniqueness set. Letφ be any signal inPW ω (G)∩ L 2 (S c ). Then, for any f∈ PW ω (G), we have g = f +φ∈ PW ω (G) and f(S) = g(S). But sinceS is a uniqueness set, one must have f = g, which implies φ = 0. Therefore, PW ω (G)∩L 2 (S c ) ={0}. Let S be a matrix whose columns are indicator functions for nodes inS. Note that S > :R n →R |S| is the sampling operator with S > f = f S . Theorem 3.1 essentially states that no signal in PW ω (G) is in the null spaceN (S > ) of the sampling operator. Any 1 Parts of this chapter focusing on stability of reconstruction and other optimality criteria, have been done in collaboration with Akshay Gadde and appear in our joint paper [4]. 26 f∈ PW ω (G) can be written as f = U VR c. Thus, for unique sampling of any signal in PW ω (G) onS, we need S > U VR c = U SR c6= 0 ∀ c6= 0. This observation leads to the following corollary (which is also stated in [21]). Corollary 3.1. LetR ={1,...,r}, where λ r is the largest graph frequency less than ω. ThenS is a uniqueness set for PW ω (G) if and only if U SR has full column rank. If U SR has a full column rank, then a unique reconstruction ˆ f ∈ PW ω (G) can be obtained by finding the unique least squares solution to f S = U SR c: ˆ f = U VR U + SR f S , (3.1) where U + SR = (U > SR U SR ) −1 U > SR is the pseudo-inverse of U SR . The above reconstruction formulaisalsoknownasconsistentreconstruction[28]sinceitkeepstheobservedsamples unchanged 2 , i.e., ˆ f S = f S . Moreover, it is easy to see that if the original signal f ∈ PW ω (G), then ˆ f = f. 3.2.1 Issue of stability and choice of sampling set Note that selecting a sampling setS forPW ω (G) amounts to selecting a set of rows of U VR . It is always possible to find a sampling set of sizer = dimPW ω (G) that uniquely determines signals in PW ω (G) as proven below. Proposition 3.1. For any PW ω (G), there always exists a uniqueness setS of size |S| =r. Proof. Since{u i } r i=1 are linearly independent, the matrix U VR has full column rank equal to r. Further, since the row rank of a matrix equals its column rank, we can always find a linearly independent setS of r rows such that U SR has full rank that equals r, thus proving our claim. In most cases pickingr nodes randomly gives a full rank U SR . However, all sampling sets of given size are not equally good. A bad choice ofS can give an ill-conditioned U SR which in turn leads to an unstable reconstruction ˆ f. Stability of reconstruction is important when the true signal f is only approximately bandlimited (which is the case 2 Existence of a sample consistent reconstruction in PW ω (G) requires that PW ω (G)⊕L 2 (S c ) = R N [28]. 27 for most signals in practice) or when the samples are noisy. The reconstruction error in this case depends not only on noise and model mismatch but also on the choice of sampling set. The best sampling set achieves the smallest reconstruction error. Effect of noise We first consider the case when the observed samples are noisy. Let f∈ PW ω (G) be the true signal and n∈R |S| be the noise introduced during sampling. The observed samples are then given by y S = f S + n. Using (3.1), we get the following reconstruction ˆ f = U VR U + SR f S + U VR U + SR n. (3.2) Since f ∈ PW ω (G), U VR U + SR f S = f. The reconstruction error equals e = ˆ f− f = U VR U + SR n. If we assume that the entries of n are iid with zero mean and unit variance, then the covariance matrix of the reconstruction error is given by E =E[ee > ] = U VR (U > SR U SR ) −1 U > VR . (3.3) Different costs can be defined to measure the reconstruction error as a function of the error covariance matrix. These cost functions are based on optimal design of exper- iments [16]. If we define the optimal sampling setS opt of size m, as the set which minimizes the mean squared error, then assuming U VR has orthonormal columns, we have S A-opt = arg min |S|=m tr[E] = arg min |S|=m tr[(U > SR U SR ) −1 ]. (3.4) This is analogous to the so-calledA-optimal design. Similarly, minimizing the maximum eigenvalue of the error covariance matrix leads toE-optimal design. For an orthonormal U VR , the optimal sampling set with this criterion is given by S E-opt = arg min |S|=m λ max (E) = arg max |S|=m σ min (U SR ), (3.5) where σ min (.) denotes the smallest singular value of a matrix. It can be thought of as a sampling set that minimizes the worst case reconstruction error. The above criterion is equivalent to the one proposed in [22]. Further, one can show that when U VR does not have orthonormal columns, (3.4) and (3.5) produce sampling sets that minimize upper bounds on the mean squared and worst case reconstruction errors respectively. Note 28 that bothA andE-optimality criteria lead to combinatorial problems, but it is possible to develop greedy approximate solutions to these problems. So far we assumed that the true signal f ∈ PW ω (G) and hence, U VR U + SR f S = f. However, in most applications, the signals are only approximately bandlimited. The reconstruction error in such a case is analyzed next. Effect of model mismatch Let P = U VR U > VR be the projector for PW ω (G) and Q = SS > be the projector for L 2 (S). Assume that the true signal is given by f = f ∗ + Δf, where f ∗ = Pf is the bandlimitedcomponentofthesignaland Δf = P ⊥ f capturesthe“high-passcomponent” (i.e., the model mismatch). If we use (3.1) for reconstructing f, then a tight upper bound on the reconstruction error [28] is given by kf− ˆ fk≤ 1 cos(θ max ) kΔfk, (3.6) where θ max is the maximum angle between subspaces PW ω (G) and L 2 (S) defined as cos(θ max ) = inf f∈PWω(G),kfk=1 kQfk. (3.7) cos(θ max ) > 0 when the uniqueness condition in Theorem 3.1 is satisfied and the error is bounded. Intuitively, the above equation says that for the worst case error to be minimum, the sampling and reconstruction subspaces should be as aligned as possible. We define an optimal sampling setS opt of size m for PW ω (G) as the set which minimizes the worst case reconstruction error. Therefore, L 2 (S opt ) makes the smallest maximum angle with PW ω (G). It is easy to show that cos(θ max ) = σ min (U SR ). Thus, to find this set we need to solve a similar problem as (3.5). As stated before, this problem is combinatorial. It is possible to give a greedy algorithm to get an approximate solution. A simple greedy heuristic to approximateS opt is to perform column-wise Gaussian elimination over U VR with partial row pivoting. The indices of the pivot rows in that case form a good estimate ofS opt in practice. The methods described above require computation of many eigenvectors of the varia- tion operator L. We circumvent this issue in the next section, by defining graph spectral proxies based on powers of L. These spectral proxies do not require eigen-decomposition of L and still allow us to compute the cut-off frequency that also acts as a measure of 29 quality for different sampling sets. As we will show, these proxies arise naturally in the expression for the bound on the reconstruction error. Thus, a sampling set optimal with respect to these spectral proxies ensures a small reconstruction error bound. 3.3 Graph Spectral Proxies As discussed earlier, graphs considered in most real applications are very large. Hence, computing and storing the graph Fourier basis explicitly is often practically infeasible. We now present a tool to approximately compute the bandwidth ω(φ) of any given signal φ without computing the Fourier coefficients explicitly. These quan- tities shall allow us to express the condition for unique bandlimited reconstruction, in terms of the cut-off frequency, and methods for sampling set selection via simple oper- ations using the variation operator. The following definition holds for any choice of the variation operator L in Table 2.1: Definition3.2 (GraphSpectralProxies). For any signal f6= 0, we define itsk th spectral proxy ω k (f) with k∈Z + as ω k (f) 4 = kL k fk kfk ! 1/k . (3.8) For an operator L with real eigenvalues and eigenvectors,ω k (f) can be shown to increase monotonically with k: ∀f,k 1 <k 2 ⇒ω k 1 (f)≤ω k 2 (f). (3.9) These quantities are bounded from above, as a result, lim k→∞ ω k (f) exists for all f. Consequently, we can show that if ω(f) denotes the bandwidth of a signal f, then ∀k> 0, ω k (f)≤ lim j→∞ ω j (f) =ω(f). (3.10) Note that (3.10) also holds for an asymmetric L that has complex eigenvalues and eigenvectors. The proofs of (3.9) and (3.10) are provided in Lemmas 3.1 and 3.2. These propertiesgiveusanimportantinsight: asweincreasethevalueofk, thespectralproxies tend to have a value close to the actual bandwidth of the signal, i.e., they essentially indicate the frequency localization of the signal energy. Lemma 3.1. If L has real eigenvalues and eigenvectors, then for any k 1 <k 2 , we have ω k 1 (f)≤ω k 2 (f),∀f. 30 Proof. We first expand ω k 1 (f) as follows: (ω k 1 (f)) 2k 1 = kL k 1 fk kfk ! 2 = P i,j (λ i λ j ) k 1˜ f i ˜ f j u > i u j P i,j ˜ f i ˜ f j u > i u j (3.11) = X i,j (λ i λ j ) k 1 c ij (3.12) where c ij = ˜ f i ˜ f j u > i u j / P i,j ˜ f i ˜ f j u > i u j . Now, consider the function f(x) = x k 2 /k 1 . Note that since k 1 < k 2 , f(x) is a convex function. Further, since P i,j c ij = 1, we can use Jensen’s inequality in the above equation to get X i,j (λ i λ j ) k 1 c ij k 2 /k 1 ≤ X i,j (λ i λ j ) k 1 k 2 /k 1 c ij (3.13) ⇒ X i,j (λ i λ j ) k 1 c ij 1/2k 1 ≤ X i,j (λ i λ j ) k 2 c ij 1/2k 2 ⇒ω k 1 (f)≤ω k 2 (f) (3.14) If L has real entries, but complex eigenvalues and eigenvectors, then these occur in conjugate pairs, hence, the above summation is real. However, in that case, ω k (f) is not guaranteed to increase in a monotonous fashion, sincec ij ’s are not real and Jensen’s inequality breaks down. Lemma 3.2. Let ω(f) be the bandwidth of any signal f. Then, the following holds: ω(f) = lim k→∞ ω k (f) = lim k→∞ kL k fk kfk ! 1/k (3.15) Proof. We first consider the case when L has real eigenvalues and eigenvectors. Let ω(f) =λ p , then we have: ω k (f) = P p i,j=1 (λ i λ j ) k ˜ f i ˜ f j u > i u j P p i,j=1 ˜ f i ˜ f j u > i u j ! 1/2k (3.16) =λ p c pp + X (i,j)6=(p,p) λ i λ p λ j λ p ! k c ij 1/2k (3.17) 31 where c ij = ˜ f i ˜ f j u > i u j / P i,j ˜ f i ˜ f j u > i u j . Now, using logarithms, we can show lim k→∞ c pp + X (i,j)6=(p,p) λ i λ p λ j λ p ! k c ij 1/2k = 1. (3.18) Substituting (3.18) in (3.17), we get lim k→∞ ω k (f) =λ p =ω(f). (3.19) Now,if Lhascomplexeigenvaluesandeigenvectors,thenthesehavetooccurinconjugate pairs since L has real entries. Hence, for this case, we do a similar expansion as above and take|λ p | out of the expression. Then, the limit of the remaining term is once again equal to 1. 3.4 Cutoff frequency In order to obtain a measure of quality for a sampling setS, we first find the cutoff frequency associated with it, which can be defined as the largest frequency ω such that S is a uniqueness set for PW ω (G). It follows from Theorem 3.1 that, forS to be a uniqueness set of PW ω (G), ω needs to be less than the minimum possible bandwidth that a signal inL 2 (S c ) can have. This would ensure that no signal fromL 2 (S c ) can be a part ofPW ω (G). Thus, the cutoff frequencyω c (S) for a sampling setS can be expressed as: ω c (S) 4 = min φ∈L 2 (S c ),φ6=0 ω(φ). (3.20) In order to avoid computation of the GFT basis, we use ω k (φ) as a proxy for ω(φ) (i.e. bandwidth of φ) and this leads us to define the cut-off frequency estimate of order k as Ω k (S) 4 = min φ∈L 2 (S c ) ω k (φ) = min φ∈L 2 (S c ) kL k φk kφk ! 1/k . (3.21) Using the definitions of Ω k (S) and ω c (S) along with (3.9) and (3.10), we conclude that for any k 1 <k 2 : ω c (S)≥ lim k→∞ Ω k (S)≥ Ω k 2 (S)≥ Ω k 1 (S). (3.22) Using (3.22) and (3.20), we now state the following proposition: 32 Proposition 3.2. For any k,S is a uniqueness set for PW ω (G) if, ω< Ω k (S). Ω k (S) can be computed from (3.21) as Ω k (S) = " min ψ ψ > ((L > ) k L k ) S cψ ψ > ψ # 1/2k = (σ 1,k ) 1/2k , (3.23) whereσ 1,k denotes the smallest eigenvalue of the reduced matrix ((L > ) k L k ) S c. Further, if ψ 1,k is the corresponding eigenvector, and φ ∗ k minimizes ω k (φ) in (3.21) (i.e. it approx- imates the smoothest possible signal in L 2 (S c )), then φ ∗ k (S c ) =ψ 1,k , φ ∗ k (S) = 0. (3.24) We note from (3.22) that to get a better estimate of the true cut-off frequency, one simply needs a higherk. Therefore, there is a trade-off between accuracy of the estimate on the one hand, and complexity and numerical stability on the other (that arise by taking higher powers of L). 3.5 Sampling set selection 3.5.1 Best sampling set of given size As shown in Proposition 3.2, Ω k (S) is an estimate of the smallest bandwidth that a signal in L 2 (S c ) can have and any signal in PW ω (G) is uniquely sampled onS if ω < Ω k (S). Intuitively, we would like the projection of L 2 (S c ) along PW ω (G) to be as small as possible. Based on this intuition, we propose the following optimality criterion for selecting the best sampling set of size m: S opt k = arg max |S|=m Ω k (S). (3.25) To motivate the above criterion more formally, let P denote the projector for PW ω (G). The minimum gap [40] between the two subspaces L 2 (S c ) and PW ω (G) is given by: inf f∈L 2 (S c ),kfk=1 kf− Pfk = s X i:ω<λ i | ˜ f ∗ i | 2 ≥ s X i∈I | ˜ f ∗ i | 2 , (3.26) 33 whereI ={i :ω<λ i ≤ Ω k (S)} and ˜ f ∗ i denotes thei th GFT coefficient of the minimizer f ∗ for the left hand side. The inequality on the right hand side holds because Ω k (S) is the smallest bandwidth that any signal in L 2 (S c ) can have. Eq. (3.26) shows that maximizing Ω k (S) increases the lower bound on the minimum gap between L 2 (S c ) and PW ω (G). The minimum gap equals cos(θ max ) as defined in (3.7) [40]. Thus, maximizing Ω k (S) increases the lower bound on cos(θ max ) which, in turn, minimizes the upper bound on the reconstruction errorkf− ˆ fk given in (3.6), where the original signal f / ∈PW ω (G) and ˆ f∈PW ω (G) is obtained by (3.1). We now show that Ω k (S) also arises in the bound on the reconstruction error when the reconstruction is obtained by variational energy minimization: ˆ f m = arg min y∈R N kL m yk subject to y S = f S . (3.27) It was shown in [56] that if f∈ PW ω (G), then the reconstruction errork ˆ f m − fk/kfk, for a given m, is upper-bounded by 2(ω/Ω 1 (S)) m . This bound is suboptimal and can be improved by replacing Ω 1 (S) with Ω k (S) (which, from (3.22), is at least as large as Ω 1 (S)) for any k≤m, as shown in the following theorem: Theorem 3.2. Let ˆ f m be the solution to (3.27) for a signal f∈PW ω (G). Then, for any k≤m, k ˆ f m − fk≤ 2 ω Ω k (S) ! m kfk. (3.28) Proof. Note that ( ˆ f m − f)∈L 2 (S c ). Therefore, from (3.21) k ˆ f m − fk≤ 1 (Ω m (S)) m kL m ( ˆ f m − f)k ≤ 1 (Ω m (S)) m (kL m ˆ f m k +kL m fk) (3.29) ≤ 2 (Ω m (S)) m kL m fk (3.30) ≤ 2 ω m (f) Ω m (S) ! m kfk (3.31) ≤ 2 ω Ω k (S) ! m kfk. 34 (3.29) follows from triangle inequality. (3.30) holds because ˆ f m minimizeskL m ˆ f m k over all sample consistent signals. (3.31) follows from the definition of ω m (f) and the last step follows from (3.10) and (3.22). Note that for the error bound in (3.28) to go to zero asm→∞,ω must be less than Ω k (S). Thus, increasing Ω k (S) allows us to reconstruct signals in a larger bandlimited space using the variational method. Moreover, for a fixed m and k, a higher value of Ω k (S) leads to a lower reconstruction error bound. The optimal sampling setS opt k in (3.25) essentially minimizes this error bound. 3.5.2 Obtaining the best sampling set The problem posed in (3.25) is a combinatorial problem because we need to compute Ω k (S) for every possible subsetS of size m. We therefore formulate a greedy heuristic to get an estimate of the optimal sampling set. Starting with an empty sampling setS (Ω k (S) = 0)wekeepaddingnodes(fromS c )one-by-onewhiletryingtoensuremaximum increase in Ω k (S) at each step. To achieve this, we first consider the following quantity: λ α k (1 S ) = min x ω k (x) +α x > diag(1 S )x x > x ! , (3.32) where 1 S :V→{0, 1} denotes the indicator function for the subsetS (i.e. 1(S) = 1 and 1(S c ) = 0). Note that the right hand side of (3.32) is simply a relaxation of the constraint in (3.21). When α 1, the components x(S) are highly penalized during minimization, hence, forcing values of x onS to be vanishingly small. Thus, if x α k (1 S ) is the minimizer in (3.32), then [x α k (1 S )](S)→ 0. Therefore, for α 1, φ ∗ k ≈ x α k (1 S ), Ω k (S)≈λ α k (1 S ). (3.33) Now, to tackle the combinatorial nature of our problem, we allow a binary relaxation of the indicator 1 S in (3.32), to define the following quantities ω α k (x, t) = ω k (x) +α x > diag(t)x x > x ! , (3.34) λ α k (t) = min x ω α k (x, t), (3.35) 35 where t∈R N . These relaxations circumvent the combinatorial nature of our problem and have been used recently to study graph partitioning based on Dirichlet eigenval- ues [54, 14]. Note that making the substitution t = 1 S in (3.35) gives us (3.32) exactly. The effect of adding a node toS on Ω k (S) at each step can now be understood by observing the gradient vector∇ t λ α k (t), at t = 1 S . Note that for any x and t, dω α k (x, t) dt(i) =α x(i) kxk ! 2 . (3.36) When t = 1 S , we know that the minimizer of (3.35) with respect to x for large α isφ ∗ k . Hence, dλ α k (t) dt(i) t=1 S = dω α k (φ ∗ k , t) dt(i) t=1 S =α φ ∗ k (i) kφ ∗ k k ! 2 . (3.37) The equation above gives us the desired greedy heuristic - starting with an emptyS (i.e., 1 S = 0), if at each step, we include the node on which the smoothest signalφ ∗ k ∈L 2 (S c ) has maximum energy (i.e. 1 S (i)← 1,i = arg max j (φ ∗ k (j)) 2 ), then λ α k (t) and in effect, the cut-off estimate Ω k (S), tend to increase maximally. We summarize the method for estimatingS opt k in Algorithm 3.1. One can show that the cutoff frequency estimate Ω k (S) associated with a sampling set can only increase (or remain unchanged) when a node is added to it. This is stated more formally in the following proposition. Proposition 3.3. LetS 1 andS 2 be two subsets of nodes of G withS 1 ⊆S 2 . Then Ω k (S 1 )≤ Ω k (S 2 ). This turns out to be a straightforward consequence of the eigenvalue interlacing property for symmetric matrices. Theorem 3.3 (Eigenvalue interlacing [32]). Let B be a symmetric n×n matrix. Let R ={1, 2,...,r}, for 1≤ r≤ n− 1 and B r = B R . Let λ k (B r ) be the k-th largest eigenvalue of B r . Then the following interlacing property holds: λ r+1 (B r+1 )≤λ r (B r )≤λ r (B r+1 )≤... ...≤λ 2 (B r+1 )≤λ 1 (B r )≤λ 1 (B r+1 ). The above theorem implies that if S 1 ⊆ S 2 , then S c 2 ⊆ S c 1 and thus, λ min (L > ) k L k S c 1 ≤λ min (L > ) k L k S c 2 . 36 Algorithm 3.1 Greedy heuristic for estimatingS opt k Require: G ={V,E}, L, sampling set size M, k∈Z + . Ensure:S ={∅}. 1: while|S|<M do 2: ForS, compute smoothest signal φ ∗ k ∈L 2 (S c ) using Proposition 3.2. 3: v← arg max i (φ ∗ k (i)) 2 . 4: S←S∪v. 5: end while 6:S opt k ←S. Connection with Gaussian elimination From Section 3.2, we know that the optimal sampling set can be obtained by max- imizing σ min (U SR ) with respect toS. A heuristic to obtain good sampling sets is to perform a column-wise Gaussian elimination with pivoting on the eigenvector matrix U. Then, a sampling set of size i is given by the indices of zeros in the (i + 1) th column of the echelon form. We now show that the greedy heuristic proposed in Algorithm 3.1 is closely related to this rank-revealing Gaussian elimination procedure through the following observation: Proposition 3.4. Let Φ be the matrix whose columns are given by the smoothest signals φ ∗ ∞ obtained sequentially after each iteration of Algorithm 3.1 with k =∞, (i.e., Φ = h φ ∗ ∞ | |S|=0 φ ∗ ∞ | |S|=1 , ... i ). Further, let T be the matrix obtained by performing column- wise Gaussian elimination on U with partial pivoting. Then, the columns of T are equal to the columns of Φ ∗ ∞ within a scaling factor. Proof. IfS is the smallest sampling set for uniquely representing signals inPW ω (G) and r = dimPW ω (G), then we have the following: 1.|S| =r. 2. The smoothest signal φ ∗ ∞ ∈L 2 (S c ) has bandwidth λ r+1 . Therefore,φ ∗ ∞ | |S|=r is spanned by the firstr +1 frequency basis elements{u 1 ,..., u r+1 }. Further, sinceφ ∗ ∞ | |S|=r haszeroesonexactlyr locations, itcanbeobtainedbyperforming Gaussian elimination on u r+1 using u 1 , u 2 ,..., u r . Hence the (r + 1) th column of Φ is equal (within a scaling factor) to the (r + 1) th column of T. Pivoting comes from the fact that the (i + 1) th sampled node is given by the index of the element with maximum 37 Table 3.1: Comparison of complexity of different sampling set selection algorithms. Methods in [65, 22] (with GFT) Proposed Method Eigen-pair computations O((|E||S| +C|S| 3 )T 1 ) O (k|E||S|T 2 (k)) Sampling set search O(N|S| 3 ) [65], O(N|S| 4 ) [22] O(N|S|) Space complexity O(N|S|) O(N) magnitude in φ ∗ ∞ | |S|=i , and is used as the pivot to zeros out elements with same index in subsequent columns. The above result illustrates that Algorithm 3.1 is an iterative procedure that approx- imates a rank-revealing Gaussian elimination procedure on U VR . For the known- spectrum case, this is a good heuristic for maximizing σ min (U SR ). In other words, our method directly maximizesσ min (U SR ) without going through the intermediate step of computing U VR . As we shall see in the next subsection, this results in significant savings in both time and space complexity. Complexity and implementation issues We note that in the algorithm, computing the first eigen-pair of ((L > ) k L k ) S c is the major step for each iteration. There are many efficient iterative methods, such as those based on Rayleigh quotient minimization, for computing the smallest eigen-pair of a matrix [41]. The atomic step in all of these methods consists of matrix-vector products. Specifically, in our case, this step involves evaluating the expression ((L > ) k L k ) S cx. Note that we do not actually need to compute the matrix ((L > ) k L k ) S c explicitly, since the expression can be implemented as a sequence of matrix-vector products as ((L > ) k L k ) S cx = I S c V L > ... L > L... LI VS cx. (3.38) Evaluating the expression involves 2k matrix-vector products and has a complexity of O(k|E|), where|E| is the number of edges in the graph. Moreover, a localized and parallel implementation of this step is possible in the case of sparse graphs. The num- ber of iterations required for convergence of the eigen-pair computation iterations is a function of the eigen-value gaps [41] and hence dependent on the graph structure and edge-weights. For the methods of [65] and [22], one needs to compute a portion of the eigenvector matrix, i.e., U VS (assuming|R| =|S|). This can be done using block-based Rayleigh 38 quotient minimization methods [41], block-based Kryolov subspace methods such as Arnoldi/Lanczos iterations or deflation methods in conjunction with single eigen-pair solvers [59]. The complexity of these methods increases considerably as the number of requested eigen-pairs increases, making them impractical. On the other hand, our method requires computing a single eigen-pair at each iteration, making it viable for cases when a large number of samples are required. Moreover, the sample search steps in the methods of [65] and [22] require an SVD solver and a linear system solver, respectively, thus making them much more complex in comparison to our method, where we only require finding the maximum element of a vector. Our algorithm is also efficient in terms of space complexity, since at any point we just need to store L and one vector. On the other hand, [65, 22] require storage of at least|S| eigenvectors. A summary of the complexities of all the methods is given in Table 3.1. The eigen-pair computations for [65, 22] are assumed to be performed using a block version of the Rayleigh quotient minimization method, which has a complexity of O((|E||S| + C|S| 3 )T 1 ), where T 1 denotes the number of iterations for convergence, and C is a constant. The complexity of computing one eigen-pair in our method is O(k|E||S|T 2 (k)), whereT 2 (k) denotes the average number of iterations required for con- vergence of a single eigen-pair. T 1 andT 2 (k) required to achieve a desired error tolerance are functions of the eigen-gaps of L and L k respectively. In general, T 2 (k) > T 1 , since L k has lower eigengaps near the smallest eigenvalue. Increasing the parameterk further flattens the spectrum of L k near the smallest eigenvalue leading to an increase in T 2 (k), since one has to solve a more ill-conditioned problem. We illustrate this in the next section through experiments that compare the running times of all the methods. The choice of the parameter k depends on the desired accuracy – a larger value of k gives a better sampling set, but increases the complexity proportionally, thus providing a trade-off. Through experiments, we show in the next section that the quality of the samplingsetismoresensitivetochoiceofk forsparsergraphs. Thisisbecauseincreasing k results in the consideration of more global information while selecting samples. On the other hand, dense graphs have a lower diameter and there is relatively little information to be gained by increasing k. 39 3.6 Experiments We now numerically evaluate the performance of the proposed work 3 . The experi- ments involve comparing the reconstruction errors and running times of different sam- pling set selection algorithms in conjunction with consistent bandlimited reconstruc- tion (3.1) 4 . We compare our approach with the following methods: M1: This method [22] uses a greedy algorithm to approximate theS that maximizes σ min (U SR ). Consistent bandlimited reconstruction (3.1) is then used to estimate the unknown samples. M2: At each iteration i, this method [65] finds the representation of u i as P j<i β j u j + P u/ ∈S α u 1 u , where 1 u is the delta function on u. The node v with maximum|α v | is sampled. Reconstruction is done using (3.1). Both the above methods assume that a portion of the frequency basis is known and the signal to be recovered is exactly bandlimited. As a baseline, we also compare all sampling set selection methods against uniform random sampling. 3.6.1 Examples with artificial data We first give some simple examples on the following simulated undirected graphs: G1: Erdös-Renyi random graph (unweighted) with 1000 nodes and connection proba- bility 0.01. G2: Small world graph [75] (unweighted) with 1000 nodes. The underlying regular graph with degree 8 is rewired with probability 0.1. G3: Barabási-Albert random network [9] with 1000 nodes. The seed network is a fully connected graph withm 0 = 4 vertices, and each new vertex is connected tom = 4 existing vertices randomly. This model, as opposed to G1 and G2, is a scale-free network, i.e., its degrees follow a power law P (k)∼k −3 . 3 Code available at https://github.com/aamiranis/sampling_theory 4 Although reconstruction using (3.1) requires explicit computation of U VR , there exist efficient localized reconstruction algorithms that circumvent this [50, 74]. However, in our work, we restrict our attention to the problem of sampling set selection. 40 The performance of the sampling methods depends on the assumptions about the true signal and sampling noise. For each of the above graphs, we consider the problem in the following scenarios: F 1: The true signal is noise-free and exactly bandlimited with r = dimPW ω (G) = 50. The non-zero GFT coefficients are randomly generated fromN (1, 0.5 2 ). F 2: The true signal is exactly bandlimited with r = 50 and non-zero GFT coefficients are generated fromN (1, 0.5 2 ). The samples are noisy with additive iid Gaussian noise such that the SNR equals 20dB. F 3: The true signal is approximately bandlimited with an exponentially decaying spec- trum. Specifically, the GFT coefficients are generated fromN (1, 0.5 2 ), followed by rescaling with the following filter (where r = 50): h(λ) = 1, λ<λ r e −4(λ−λr) , λ≥λ r . (3.39) We generate 50 signals from each of the three signal models on each of the graphs, use the sampling sets obtained from the all the methods to perform reconstruction and plot the mean of the mean squared error (MSE) for different sizes of sampling sets. For our algorithm, we set the value ofk to 2, 8 and 14. The result is illustrated in Figure 3.1. Note that when the size of the sampling set is less than r = 50, the results are quite unstable. This is expected, because the uniqueness condition is not satisfied by the sampling set. Beyond|S| =r, we make the following observations: 1. For the noise-free, bandlimited signal model F1, all methods lead to zero recon- struction error as soon as the size of the sampling set exceeds the signal cutoff r = 50 (error plots for this signal model are not shown). This is expected from the sampling theorem. It is interesting to note that in most cases, uniform random sampling does equally well, since the signal is noise-free and perfectly bandlimited. 2. For the noisy signal model F2 and the approximately bandlimited model F3, our method has better or comparable performance in most cases. This indicates that our method is fairly robust to noise and model mismatch. Uniform random sam- pling performs very badly as expected, because of lack of stability considerations. 41 0 20 40 60 80 100 120 140 160 180 200 0 0.001 0.002 0.003 0.004 0.005 0.006 0.007 0.008 0.009 0.01 Sample size Reconstruction MSE Uni. rand. M1 M2 Prop. k = 2 Prop. k = 8 Prop. k = 14 (a) Graph G1 and signal model F2 0 20 40 60 80 100 120 140 160 180 200 0 0.001 0.002 0.003 0.004 0.005 0.006 0.007 0.008 0.009 0.01 Sample size Reconstruction MSE Uni. rand. M1 M2 Prop. k = 2 Prop. k = 8 Prop. k = 14 (b) Graph G1 and signal model F3 0 20 40 60 80 100 120 140 160 180 200 0 0.001 0.002 0.003 0.004 0.005 0.006 0.007 0.008 0.009 0.01 Sample size Reconstruction MSE Uni. rand. M1 M2 Prop. k = 2 Prop. k = 8 Prop. k = 14 (c) Graph G2 and signal model F2 0 20 40 60 80 100 120 140 160 180 200 0 0.001 0.002 0.003 0.004 0.005 0.006 0.007 0.008 0.009 0.01 Sample size Reconstruction MSE Uni. rand. M1 M2 Prop. k = 2 Prop. k = 8 Prop. k = 14 (d) Graph G2 and signal model F3 0 20 40 60 80 100 120 140 160 180 200 0 0.001 0.002 0.003 0.004 0.005 0.006 0.007 0.008 0.009 0.01 Sample size Reconstruction MSE Uni. rand. M1 M2 Prop. k = 2 Prop. k = 8 Prop. k = 14 (e) Graph G3 and signal model F2 0 20 40 60 80 100 120 140 160 180 200 0 0.001 0.002 0.003 0.004 0.005 0.006 0.007 0.008 0.009 0.01 Sample size Reconstruction MSE Uni. rand. M1 M2 Prop. k = 2 Prop. k = 8 Prop. k = 14 (f) Graph G3 and signal model F3 Figure 3.1: Reconstruction results for different graph and signal models. Plots for signal model F1 are not shown since the reconstruction errors are identically zero for all methods when|S|≥ dimPW ω (G) = 50. The large reconstruction errors for|S| < 50 ariseduetonon-uniquenessofbandlimitedreconstructionandhence,arelessmeaningful. 42 Effect of parameter k in the spectral proxy Parameter k in the definition of spectral proxies controls how closely we estimate the bandwidth of any signal f. Spectral proxies with higher values of k give a better approximation of the bandwidth. Our sampling set selection algorithm tries to maximize the smallest bandwidth that a signal inL 2 (S c ) can have. Using higher values ofk allows us to estimate this smallest bandwidth more closely, thereby leading to better sampling sets as demonstrated in Figure 3.2. Intuitively, maximizing Ω k (S) with k = 1 ensures that the sampled nodes are well connected to the unsampled nodes [29] and thus, allows better propagation of the observed signal information. Using k > 1 takes into account multi-hop paths while ensuring better connectedness betweenS andS c . This effect is especially important in sparsely connected graphs and the benefit of increasing k becomes less noticeable when the graphs are dense as seen in Figure 3.2. However, this improvement in performance in the case of sparse graphs comes at the cost of increased numerical complexity. Running time Wealsocomparetherunningtimesofthesamplingsetselectionmethodsfordifferent sizes of the graph. For our experiments, we generate symmetrized Erdös-Renyi random graphs of different sizes with parameter 0.01, and measure the average running time of selecting 5% of the samples in MATLAB. For computing the eigen-pairs, we use the code fortheLocallyOptimalBlockPrec-conditionedConjugateGradient(LOBPCG)method available online [41] (this was observed to be faster than MATLAB’s inbuilt sparse eigensolver eigs, which is based on Lanczos iterations). The results of the experiments are shown in Table 3.2. We observe that the rate of increase of running time as the graph size increases is slower for our method compared to other methods, thus making it more practical. Note that the increase with respect tok is nonlinear since the eigengaps are a function of k and lead to different number of iterations required for convergence of the eigenvectors. 3.6.2 A real data example In this example, we apply the proposed method to classification of the USPS hand- written digit dataset [1]. This dataset consists of 1100 images of size 16× 16 each corresponding digits 0 to 9. We create 10 different subsets of this dataset randomly, 43 0 20 40 60 80 100 120 140 160 180 200 0 0.001 0.002 0.003 0.004 0.005 0.006 0.007 0.008 0.009 0.01 Sample size Reconstruction MSE k = 2 k = 8 k = 14 (a) p = 0.01 0 20 40 60 80 100 120 140 160 180 200 0 0.001 0.002 0.003 0.004 0.005 0.006 0.007 0.008 0.009 0.01 Sample size Reconstruction MSE k = 2 k = 8 k = 14 (b) p = 0.05 0 20 40 60 80 100 120 140 160 180 200 0 0.001 0.002 0.003 0.004 0.005 0.006 0.007 0.008 0.009 0.01 Sample size Reconstruction MSE k = 2 k = 8 k = 14 (c) p = 0.1 Figure3.2: Reconstructionperformancefornoisysignals(modelF2)withdifferentvalues ofk inErdös-Renyi graphshavingdifferent connection sparsitylevels. Higherconnection probability p implies lower sparsity. consisting of 100 images from each class. The data points can be thought of as points {x i } 1000 i=1 ⊂ R 256 with labels{y i } 1000 i=1 . For each instance, we construct a symmetrized k-nearest neighbor (k-nn) graph with k = 10, where each node corresponds to a data point. We restrict the problem to the largest strongly connected component of the graph for convenience so that a stationary distribution for the resultant random walk exists which allows us to define the random walk based GFT. The graph signal is given by the membership function f c of each class c which takes a value 1 on a node which belongs to the class and is 0 otherwise. To solve the multi-class classification task, we use the 44 Table 3.2: Running time of different methods (in seconds) for selecting 5% samples on graphs of different sizes. The running time for M1 increases drastically and is ignored beyond graph size 5k. 1k 5k 10k 20k M1 16.76 12, 322.72 - - M2 2.16 57.46 425.92 3004.01 Proposed, k = 4 2.00 11.13 84.85 566.39 Proposed, k = 6 13.08 24.46 170.15 1034.21 Proposed, k = 8 31.16 53.42 316.12 1778.31 one-vs-rest strategy which entails reconstructing the membership function of every class. The final classification for node i is then obtained by y i = arg max c {f c i }. (3.40) We first compare the performance of the proposed method against M1 and M2 using the normalized adjacency matrix based GFT with the variation operator L = I−D −1 W. The bandwidth parameter r is set to 50. The plot of classification error averaged over the 10 dataset instances vs. number of labels is presented in Figure 3.3a. It shows that the proposed method has comparable performance despite being localized. The per- formance is also affected by the choice of the variation operators (or, the GFT bases). Figure 3.3b shows that the variation operators based on the hub-authority model and random walk offer higher classification accuracy and thus, are more suited for this par- ticular application. Their superior performance can be explained by looking at the signal representation in the respective GFT domains. Figure 3.3c shows the fraction of signal energy captured in increasing number of GFT coefficients starting from low frequency. Since the hub-authority model based GFT and random walk based GFT offer more energy compaction than adjacency based GFT, the signal reconstruction quality using these bases is naturally better. 3.7 Summary We studied the problem of selecting an optimal sampling set for reconstruction of bandlimited graph signals. The starting point of our framework is the notion of the Graph Fourier Transform (GFT) which is defined via an appropriate variation operator. 45 50 60 70 80 90 100 110 120 130 140 150 160 0.15 0.16 0.17 0.18 0.19 0.2 0.21 0.22 Number of labels Classification error M1 M2 Proposed (a)Comparisonofdifferentmethodsusingthe adjacency based variation operator. 50 60 70 80 90 100 110 120 130 140 150 160 0.15 0.16 0.17 0.18 0.19 0.2 0.21 0.22 Number of labels Classification error Adjacency Hub−authority Random walk (b)Performanceofproposedmethodwithdif- ferent choices of the variation operator. 200 400 600 800 1000 0 0.2 0.4 0.6 0.8 1 Number of GFT coefficients Energy fraction Adjacency Hub−authority Random walk (c) Energy compaction with different GFTs obtained from different variation operators. Figure 3.3: Classification results for the USPS dataset using different methods and GFTs. Our goal is to find good sampling sets for reconstructing signals which are bandlimited in the above frequency domain. We showed that when the samples are noisy or the true signal is only approximately bandlimited, the reconstruction error depends not only on the model mismatch but also on the choice of sampling set. We proposed a measure of quality for the sampling sets, namely the cutoff frequency, that can be computed without finding the GFT basis explicitly. A sampling set that maximizes the cutoff frequency is shown to minimize the reconstruction error. We also proposed a greedy algorithm which finds an approximately optimal set. The proposed algorithm 46 can be efficiently implemented in a distributed and parallel fashion. Together with localized signal reconstruction methods, it gives an effective method for sampling and reconstruction of smooth graph signals on large graphs. 3.7.1 Future work The present work opens up some new questions for future research. The problem of finding a sampling set with maximum cutoff frequency is combinatorial. The proposed greedy algorithm gives only an approximate solution to this problem. It would be useful to find efficient algorithms with theoretical guarantees on the quality of approximation. The accuracy-complexity trade-off provided by spectral proxies in computing the cutoff frequency indicates the existence of simple heuristics that can be used for fast sampling set selection at the expense of robustness of reconstruction. For example, maximizing the spectral proxy for k = 1 removes the need of computing even the smallest eigenpair and can be implemented entirely in the vertex domain. The heuristic in effect ensures that the unsampled nodes are strongly connected to the sampled ones. It would be interesting to understand heuristics arising from higher-order spectral proxies. Another avenue for research is the consideration of randomized sampling algorithms (alongthelinesof[57]),thatdefineadistributionoverthenodesandprovideprobabilistic tradeoffs between the sampling set size and bounds on the reconstruction error. This would come however at the expense of increased size of the sampling set and increased reconstruction error. By carefully designing probability distributions for selecting the sampling sets, one can guarantee a probabilistic bound on the error as a function its cardinality using concentration inequalities from probability theory. Finally, it would also be interesting if one can estimate the bandwidth of any signal using a few observed samples and L. This would help us decide “where to stop” while computingthesamplingsetsequentially, orinotherwords, understandingthecardinality ofthesamplingsetrequiredforrecoveringasignalwithoutknowingit apriori. Apossible approach for achieving this is to randomly observe a few samples of the signal and analyze the convergence of spectral proxies defined using these samples as a function of the cardinality. This can potentially provide a handle on the accuracy of the bandwidth estimate, that one can use to figure out when to stop sampling. 47 Chapter 4 Wavelet Filterbanks on Graphs In this chapter, we explore the sampling problem further and consider the design of wavelet filterbanks on graphs. This problem is fundamentally different from sam- pling bandlimited signals since one needs to find a joint sampling scheme spanning over multiple channels of the filterbank and favoring all signals on the graph, i.e., not only bandlimited signals. These filterbanks are designed taking into account several desir- able properties such as compact support, critical sampling, near orthogonality and near perfect reconstruction. Compact support implies using graph spectral filters that are polynomials of the graph adjacency or Laplacian matrix, thus helping keep time and space complexities in check. We show in this chapter that satisfying all these proper- ties simultaneously for graph wavelet filterbanks is only possible under very restrictive conditions, since the structure of the GFT basis is dependent on the structure of the graph. Recent approaches in graph wavelet filterbank design impose certain structural con- straints on the graphs, for example by requiring the graphs to be bipartite or circulant. These assumptions significantly reduce the number of constraints needed to be satis- fied for achieving the desirable properties mentioned earlier. Extending these designs to arbitrary graphs involves approximate decomposition into multiple subgraphs that sat- isfy the structural constraint at the cost of diminished multiresolution performance of the system. In this chapter, we circumvent this issue and design filterbanks directly on arbitrary graphs by decoupling the design of filters and choice of the sampling scheme. Given graph spectral filters that achieve desirable frequency localization, we consider the problem of choosing the best sampling scheme over multiple channels by minimizing a bound on the reconstruction error for the entire filterbank. This objective is opti- mized by a greedy minimization scheme that allows an efficient implementation, once again without the need of explicitly handling the GFT basis. Experimental results show that our scheme performs well in comparison to other adhoc sampling schemes in the literature. Finally, we also explore an interesting directed extension of bipartite graphs This chapter is based partly on our work in [7, 8]. 48 called M-block cylic graphs [69]. These graphs are useful in modeling periodic finite state machines (FSMs) and Markov decision processes (MDPs). The eigenstructure of these graphs makes them particularly suitable for the design of M-channel filterbanks. The rest of this chapter is organized as follows: In Section 4.1, we review existing work on the design of graph wavelet filterbanks. Section 4.2 introduces some background and notations relevant for this chapter. Section 4.3 focuses on the general formulation behind two-channel filterbanks and introduces conditions required for attaining desirable properties. In Section 4.4, we design a critical sampling scheme that can be obtained efficiently given predesigned filters. We then turn to the design of M-channel filter- banks on M-block cyclic graphs in Section 4.5. Finally, we conclude the chapter with a summary and possible extensions of our work in Section 4.6. 4.1 Related work State-of-the-art wavelet filterbanks that satisfy most of the above mentioned proper- ties require imposing certain structural constraints on the underlying graph. For exam- ple, the recently proposed two-channel filterbanks in [45, 46] are designed specifically for bipartite graphs. The special structure leads to a natural downsampling-upsampling scheme (on one of the two partitions) in each channel, accompanied by a spectral folding phenomenon that is exploited while designing the filters. In order to extend the design to arbitrary graphs, these works suggest using a multidimensional framework where the input graph is decomposed into multiple bipartite subgraphs over which filterbanks are designed and implemented independently. Various approaches have been proposed to optimize the bipartite subgraph decomposition [52, 77] for designing these multidimen- sional filterbanks. However, the limitation of this framework is that one is forced to work with simplified graphs that do not contain all the connectivity information. Addition- ally, there are also works that suggest expanding the input graph to create a bipartite graph thereby leading to an oversampled filterbank [61], which may not be desirable for someapplicationssuchascompression. Therealsoexistfilterbanksthatexploitcirculant graph architectures [26, 27]. These works however consider only the analysis filterbank, whereas the synthesis part is assumed to be obtained via least-squares inversion. Recently, there has also been interest in designingM-channel polynomial filterbanks. Once again, the filterbanks are designed on graphs with a special structure called M- block cyclic graphs [69]. These graphs exhibit an M-fold spectral folding phenomenon (in the GFT basis of the adjacency) upon downsampling-upsampling on any block. 49 This phenomenon is exploited to state perfect reconstruction conditions for M-channel filterbanks in [70]. However, these conditions are meant for graphs with balanced block sizes and a sampling scheme that involves downsampling-upsampling on the same block in each channel. Moreover, this work does not provide insight into possible solutions satisfying the constraints and suggests using existing filter designs from classical DSP. In our work, we remove the sampling restrictions and provide a possible solution for the perfect reconstruction conditions when M is a power of 2. 4.2 Background and notation In this chapter, we work with weighted graphsG = (V,E) consisting of a set of nodes V ={1, 2,...,n} and edgesE ={w ij },i,j∈V, with w ii = 0. We denote the adjacency matrix by A and the degree matrix by D, and assume that A has been normalized 1 so thatkAk 2 = 1 (this facilitates the design of graph spectral filters). When the graph is undirected, we shall work with the symmetric normalized form of the Laplacian defined as L = I− D −1/2 AD −1/2 . L is a symmetric positive semi-definite matrix and has a set of real eigenvalues 0 =λ 1 ≤λ 2 ≤···≤λ n ≤ 2 and a corresponding orthogonal set of eigenvectors denoted as U = [u 1 , u 2 ,..., u n ]. We recall that the downsampling operation on a graph signal f is defined as the restrictionofthesignal f toacertainsubsetofnodesS⊂V (knownasthedownsampling set), and the downsampled signal is a vector of reduced length|S|. The downsampling operator forS is obtained by sampling the corresponding rows of the identity matrix I, i.e., S = I S,V ∈{0, 1} |S|×n . Similarly, the upsampling operation for signals downsampled onS inserts zeros in place of the missing signal values at appropriate locations and is given by S T . Further, while designing filterbanks, we shall employ polynomial graph filters H = h(L) = P k i=0 h i L i (or alternatively, H = h(A) = P k i=0 h i A i , when we work with A in case of directed graphs). These filters are useful because of their efficiency (since a k-degree polynomial filter can be implemented in O(k|E|) complexity). Note that for undirected graphs, L is symmetric, and hence H is symmetric. 1 There can be different ways of normalizing depending on the application at hand, eg., 1 |λmax| A, or random walk normalization D −1 A. 50 Figure 4.1: A generic two-channel filterbank on graphs. 4.3 Two-channel filterbanks We now describe the general formulation for two-channel wavelet filterbanks on arbi- traryundirectedgraphs(thiscanbeeasilyextendedtodirectedgraphs). Amoredetailed description can be found in [45, 46] in the context of bipartite graphs. We make certain changes to notation for compactness. A generic two-channel wavelet filterbank on a graph decomposes any graph signal x ∈ R N into a lowpass (smooth) and highpass (detail) component (Figure 4.1). It consists of an analysis filterbank with H 0 and H 1 as lowpass and highpass filters, and a synthesisfilterbankwith G 0 and G 1 asthelowpassandhighpassfilters. S 0 ∈{0, 1} |S 0 |×N and S 1 ∈{0, 1} |S 1 |×N are the downsampling operators for the lowpass and highpass branch, respectively, while S T 0 and S T 1 are the corresponding upsampling operators. The outputs of the two branches after the analysis filterbank are y 0 ∈R |S 0 | and y 1 ∈R |S 1 | . These are given as y 0 y 1 = S 0 H 0 S 1 H 1 x = T a x. (4.1) Similarly, the output of the synthesis filterbank (i.e., the reconstructed signal) is denoted as ˆ x∈R N and is given by ˆ x = h G 0 S T 0 G 1 S T 1 i y 0 y 1 = T s y 0 y 1 , (4.2) with the complete transfer equation for the system given by ˆ x = G 0 S T 0 S 0 H 0 + G 1 S T 1 S 1 H 1 x. (4.3) 51 We now state some desirable characteristics of the filterbank along with the conditions needed to satisfy each. • Compact support requires that the filters{H i , G i } i=0,1 be expressible as finite poly- nomials of the graph Laplacian L (or A for directed graphs), a notion analogous to FIR filters in classical DSP. A k-degree polynomial filter requires collecting information from a k-degree neighborhood for each node. • Critical sampling requires that the total number of samples after downsampling in both branches should be equal to the dimension of the signal, i.e.,|S 0 | +|S 1 | =N. If the sampling scheme is constrained to disjoint sets, this can be stated in terms of the sampling operators as S T 0 S 0 + S T 1 S 1 = I. (4.4) • Perfect reconstruction requires that the transfer function of the entire system be identity, i.e., G 0 S T 0 S 0 H 0 + G 1 S T 1 S 1 H 1 = I. (4.5) • Orthogonality requires the filterbanks to satisfy T s = T T a and T T a T a = I, which translates to substituting G 0 = H 0 and G 1 = H 1 in (4.5). Note that the perfect reconstruction condition in (4.5) can also be interpreted using the eigendecomposition of L or A as UΛU −1 as g 0 (Λ)U −1 S T 0 S 0 Uh 0 (Λ) +g 1 (Λ)U −1 S T 1 S 1 Uh 1 (Λ) = I, (4.6) For an arbitrary U, it is impossible to satisfy (4.6) using low-degree polynomial filters, since the number of constraints (=N 2 ) is much larger than the available degrees of free- dom. Therefore, one would like to design the system such that G 0 S T 0 S 0 H 0 +G 1 S T 1 S 1 H 1 is as close as possible to identity. Special structure in the graph results in a structured U and therefore simplification of (4.6) by elimination of several constraints, as shown next for bipartite graphs. 4.3.1 Special case: bipartite graphs The special structure of bipartite graphs leads to a spectral folding phenomenon that eliminates several constraints in (4.6), thereby allowing two-channel filterbank designs 52 using low-degree polynomial filters [45, 46]. We now explain this phenomenon in the context of undirected bipartite graphs, however, this analysis can easily extended to directed bipartite graphs, and alsoM-block cyclic graphs as we shall see in Section 4.5.1. We begin by noting the following properties of the symmetric normalized Laplacian L for bipartite graphs: 1. The eigenvectors of L exhibit a “spectral folding” phenomenon [45], i.e., if λ, u 0 u 1 is an eigenpair of L (where u 0 and u 1 are values on partitionsS 0 andS 1 , respectively), then 2−λ, u 0 −u 1 is also an eigenpair. 2. Using orthogonality of the eigenvectors u 0 u 1 and u 0 −u 1 , we have u T 0 u 0 −u T 1 u 1 = 0⇒ u T 0 u 0 = u T 1 u 1 . Further, since the eigenvectors are normalized, we have u T 0 u 0 + u T 1 u 1 = 1, which gives u T 0 u 0 = u T 1 u 1 = 1/2. 3. Further, using orthogonality of the eigenvector u T 0 u 1 to u 0 0 u 0 1 and u 0 0 −u 0 1 , we get u T 0 u 0 0 + u T 1 u 0 1 = 0 and u T 0 u 0 0 − u T 1 u 0 1 = 0, which gives us u T 0 u 0 0 = u T 1 u 0 1 = 0. We can use the above three properties to simplify (4.6). For simplicity, let us consider balanced bipartite graphs (that have equal-sized partitions) with distinct eigenvalues. The eigenvector matrix for such graphs can be written as U = U 0 U ∗ 0 U 1 −U ∗ 1 , (4.7) where U ∗ 0 and U ∗ 1 are obtained from U 0 and U 1 by reversing or mirroring the column order. Therefore, ifS 0 ,S 1 are two sets of the bipartition, then it can be shown that U T S T 0 S 0 U = U T 0 U ∗T 0 h U 0 U ∗ 0 i = 1 2 (I + I ∗ ), (4.8) U T S T 1 S 1 U = U T 1 −U ∗T 1 h U 1 −U ∗ 1 i = 1 2 (I− I ∗ ), (4.9) where I ∗ is the anti-diagonal identity matrix. Noting that U −1 = U T for the undirected case and substituting (4.8) and (4.9) in the left hand side of (4.6), we conclude that (4.6) 53 is satisfied if the following conditions on the filter responses hold in the spectral domain for 0≤λ≤ 2: g 0 (λ)h 0 (λ) +g 1 (λ)h 1 (λ) = 2, (4.10) g 0 (λ)h 0 (2−λ)−g 1 (λ)h 1 (2−λ) = 0. (4.11) These are exactly identical to the perfect reconstruction conditions stated in [45]. I ∗ causes the spectral folding phenomenon and thus generates N additional aliasing con- straints besides the N diagonal constraints, resulting in a total of 2N constraints that are easier to satisfy with low-degree filters. The analysis can be extended to bipartite graphs with unbalanced partitions and possibly repeated eigenvalues through simple modifications and can be shown to produce the same perfect reconstruction conditions. Note that if we are not restricted to using polynomial filters for synthesis, one can use least-squares inversion for inverting the analysis transfer function T a , provided it is non-singular. 4.3.2 Characterizing graphs that admit perfect reconstruction filterbanks We now characterize graphs that admit a critically-sampled, compact support, per- fect reconstruction design for the two-channel filterbank depicted in Figure 4.1. For the analysis to hold for both undirected and directed graphs, we work with filters that are polynomial in A with appropriate normalization 2 . Once again, we begin with the transfer function in the spectral domain as ˜ T =g 0 (Λ)U −1 S T 0 S 0 Uh 0 (Λ) +g 1 (Λ)U −1 S T 1 S 1 Uh 1 (Λ). (4.12) Recall that for critically-sampled (on disjoint sets), perfect reconstruction polynomial filterbanks, we should satisfy ˜ T = I with S T 0 S 0 + S T 1 S 1 = I and low-degree polynomial 2 Normalization is carried out as D −1 A or 1 λmax(A) A for directed graphs and D −1/2 AD −1/2 for undirected graphs. Note that both these normalizations lead to the following bound on the frequencies: λ∈ [−1, 1]. 54 filter responses{h k (λ),g k (λ)}. Similar to [45], let us define a modulation function β∈ {−1, +1} n and a corresponding modulation operator Ω as Ω = diag(β), where β(v) = +1 v∈S 0 −1 v∈S 1 . (4.13) The modulation operator satisfies Ω 2 = I. With this definition, we have the following relations for critical sampling in the two channels S T 0 S 0 = 1 2 (I + Ω), (4.14) S T 1 S 1 = 1 2 (I− Ω), (4.15) using which (4.12) can be rewritten as ˜ T = 1 2 g 0 (Λ)h 0 (Λ) +g 1 (Λ)h 1 (Λ) ! | {z } ˜ T gain + 1 2 g 0 (Λ)U −1 ΩUh 0 (Λ)−g 1 (Λ)U −1 ΩUh 1 (Λ) ! | {z } ˜ T alias . (4.16) The first term on the right hand side of the above equation, denoted by ˜ T gain , is diagonal in the spectral domain and therefore determines the gain of the transfer function. The second expression, denoted by ˜ T alias , is termed as the aliasing component since U −1 ΩU is not diagonal in general (Ω is not simultaneously diagonalizable with A). As a result, we observe an input-dependent smearing in the spectrum which is difficult to reverse. Therefore, for perfect reconstruction, the filter responses and the sampling scheme must be chosen such that ˜ T gain = I and ˜ T alias = 0. In order to characterize aliasing, we expand the modulated basis vectors ΩU in the original basis U, by finding a P such that ΩU = UP. (4.17) P contains the coefficients for expressing ΩU in U and since P = U −1 ΩU, it also determines the aliasing pattern in the transfer function (illustrated in Figure 4.2 for an 55 Aliasing pattern: Ω ! | {z } U = ! | {z } U ! | {z } P Anti-aliasing constraint: ˜ T alias =g 0 (Λ) ! h 0 (Λ)−g 1 (Λ) ! h 1 (Λ) = 0 (a) Aliasing pattern: Ω ! | {z } U = ! | {z } U ! | {z } P Anti-aliasing constraint: ˜ T alias =g 0 (Λ) ! h 0 (Λ)−g 1 (Λ) ! h 1 (Λ) = 0 ⇒ ˜ T alias =g 0 (λ)h 0 (−λ)−g 1 (λ)h 1 (−λ) = 0 (b) Figure 4.2: An illustration of aliasing patterns for two-channel filterbanks in (a) an arbitrary graph, (b) a bipartite graph. Spectral folding in a bipartite graph results in a concise anti-aliasing constraint in the spectral domain as seen in Section 4.3.1. arbitrary graph and a bipartite graph). A minimum number of constraints is generated from the condition ˜ T alias = 0 when P is a permutation matrix. This happens if and only if for all GFT basis vectors u, their modulated versions Ωu are also elements of the basis. In other words, if{λ, u} is an eigenpair of A, then{μ, Ωu} is also an eigenpair. Observation 4.1. Minimum number of anti-aliasing constraints is generated if there exists a one-to-one mapping between GFT vectors and their modulated versions. In other words, modulating the GFT matrix is equivalent to applying a column permutation. To ensure that polynomial filters can be designed independent of the graph (i.e., without knowing its size or spectrum), the frequencies associated with u and Ωu,λ and μ respectively, must be related in a simple fashion. Specifically, we must haveμ =p(λ), where p(λ) is a continuous function. Continuity of p(λ) is required since piecewise functions cannot be expressed as polynomials, and hence cannot be implemented as polynomial filters in the vertex domain. Moreover, sinceμ =p(λ) andλ =p(μ),p must be an involutory function (i.e., p(p(λ)) = λ) in [−1, 1]. Using this relationship in the 56 aliasing term in (4.16), we conclude that the following condition must be satisfied to eliminate all aliasing: g 0 (λ)h 0 (p(λ))−g 1 (λ)h 1 (p(λ)) = 0. (4.18) Note that (4.18) is satisfied if g 0 (λ) =h 1 (p(λ)) and g 1 (λ) =h 0 (p(λ)). Further, in order to ensureg 0 (λ) andg 1 (λ) are polynomials, the only choice ofp(λ) isp(λ) =c−λ (since higher order polynomials are not involutory in [−1, 1]). Note that Tr(A) = P N i=1 λ i = 0. But we also have Tr(A) = P N i=1 (c−λ i ) = 0. Therefore,c = 0 is the only choice satisfying all our design criteria. This means both{λ, u} and{−λ, Ωu} are eigenpairs of A, and this can be true if and only if the graph is bipartite [24]. We summarize our analysis in the following observation: Observation 4.2. Two-channel perfect reconstruction filterbanks satisfying the follow- ing design criteria: 1. Polynomial filters, 2. Disjoint sampling sets, 3. Independent from graph size N (i.e., constraints expressible as a function of λ), 4. Minimum number of anti-aliasing constraint equations (equal to two), can be designed if and only if the graph is bipartite. The criteria in the observation above reduce the problem to the design of four poly- nomialsh 0 (λ),h 1 (λ),g 0 (λ),g 1 (λ) in the spectral domain satisfying two concise constraint equations. Although this leads to a significant simplification of the design process, we must remark that these criteria can be somewhat restrictive as the design is viable only on one candidate graph, i.e., bipartite. We cannot deny the possibility that filterbanks can be realized on other kinds of graphs if one or more of these criteria is relaxed. For example, an alternative sampling approach has recently been proposed in [67], where spectral folding is embedded in the sampling strategy itself through spectral domain sampling. Such a sampling scheme allows for the design for filterbanks on any graph, at the expense of increased complexity in the sampling step. In the next section, we relax the perfect reconstruction requirement and explore critically-sampled polynomial designs on arbitrary graphs. 57 4.4 Critical sampling for filterbanks on arbitrary graphs 4.4.1 Approximately optimal sampling scheme For a critically sampled design, we must chooseS 0 andS 1 such that|S 0 | +|S 1 | =N and the filterbank is as close to perfect reconstruction as possible. One way to achieve this is by minimizing the deviation of the overall transfer function of the system from identity in terms of Frobenius form, i.e.,kG 0 S T 0 S 0 H 0 + G 1 S T 1 S 1 H 1 − Ik 2 F , which is in fact an upper bound on the squared relative error for all signals on the graph. Further, in our design, we assume that we have already designed filters H 0 , H 1 , G 0 , G 1 to satisfy G 0 H 0 + G 1 H 1 = 2I which is the overcomplete case. The filters can be designed, for example, using the methods of [45, 46]. In order to minimize the reconstruction error over the choice of sampling setsS 0 andS 1 , we first introduce a concatenated setting (of 2N dimensions) by defining H = H 0 H 1 ∈R 2N×N , G = G 0 G 1 ∈R 2N×N , y = y 0 y 1 ∈R |S 0 |+|S 1 | , S = S 0 0 0 S 1 ∈{0, 1} (|S 0 |+|S 1 |)×2N . (4.19) Note that the concatenated downsampling operator S can be obtained by sampling rows of the 2N-dimensional identity corresponding to indices in a concatenated sampling set S⊂{1,..., 2N} that contains sampled nodes for both the channels such that|S| = |S 0 | +|S 1 |. Further,S c ={1,..., 2N}\S andS 0 andS 1 can be recovered fromS as S 0 ={v|v∈S, 1≤ v≤ N} andS 1 ={v−N|v∈S,n + 1≤ v≤ 2N}. With these definitions, the transfer function of the system can be written as G T S T SH and finding a critical sampling scheme requires solving min S:|S|=N G T S T SH− I 2 F . (4.20) 58 Algorithm 4.1 Basic greedy minimization Ensure:S ={∅}. 1: while|S|<N do 2: S←S∪{u}, where u = argmin v∈S c φ(S∪{v}). 3: end while Since we choose the filters such that G T H = 2I, we can rewrite the objective as φ(S) = G T S T SH− 1 2 G T H 2 F = 1 2 X i∈S g i h T i − 1 2 X j∈S c g j h T j 2 F , (4.21) where g i and h i denote thei th columns of G T and H T respectively. In order to minimize φ(S), we propose to use a simple greedy procedure (Algorithm 4.1) that begins with an emptyS and keeps adding nodes one-by-one that minimizeφ(S) at each step. However, this algorithm requires O(N 2 ) evaluations of the objective φ(S) which can be quite expensive. Explicitly storing the matrices G and H requiresO(N 2 ) space. We now show how one can efficiently implement the algorithm inO(N) graph filtering operations and O(N)space. Using (4.21), thechangeintheobjectiveφ(S)whenanodev⊂{1,..., 2N} is added toS is given by: φ(S∪{v}) = 1 2 X i∈S g i h T i − 1 2 X j∈S c g j h T j + g v h T v 2 F =φ(S) +p v (S) +q v , (4.22) where we defined p v (S) = Tr h v g T v X i∈S g i h T i − X j∈S c g j h T j , (4.23) q v =kg v k 2 kh v k 2 . (4.24) Thus, we have argmin v∈S c φ(S∪{v}) = argmin v∈S c (p v (S) +q v ). (4.25) 59 Algorithm 4.2 Efficient algorithm for critical sampling Require: Graph G ={V,E}, concatenated filters H, G. Ensure:S ={∅}, p, q∈R 2N such that p v =−2hg v , h v i, q v =kg v k 2 kh v k 2 . 1: while|S|<N do 2: S←S∪{u}, where u = argmin v∈S c (p v +q v ). 3: p← p + 2(Gg u )◦ (Hh u ). 4: end while In order to compute p v (S) for eachS, we first note that p v (∅) = Tr h h v g T v (−G T H) i =−2hg v , h v i. (4.26) Further, for a node u, p v (S∪{u}) can be computed as p v (S∪{u}) = Tr h v g T v X i∈S g i h T i − X j∈S c g j h T j + 2g u h T u =p v (S) + 2hg v , g u ihh v , h u i. (4.27) To make the notation compact, we introduce the vectors p(S), q ∈ R 2N , whose v th elements are p v (S) and q v . Therefore, using “◦” to denote element-wise vector product (Hadamard product), we have p(S∪{u}) = p(S) + 2(Gg u )◦ (Hh u ). (4.28) We summarize the efficient method for choosingS in Algorithm 4.2. Note that the algorithm is allowed to produce sampling setsS 0 andS 1 that are not disjoint. In order toavoidthis, asimpleheuristicistoconstrainthealgorithmineachiterationtodisregard “images” (created in the concatenated setting) of already selected indices (eg., selection of i≤N rules out i +N from list of candidate indices in subsequent iterations). Complexity: The vectors h v and g v can be computed using two filtering operations each as H T δ v and G T δ v respectively, where δ v is the graph delta signal on node v. Therefore, in terms of time complexity, computing p(∅) and q require 4N one-time graph filtering operations in total. Further, each greedy iteration requires performing 8 filtering operations. Therefore, Algorithm 4.2 requires O(N) graph filtering operations. The space complexity of the algorithm isO(N) since it is matrix-free, i.e., A is the only matrix that needs to be stored. 60 4.4.2 Theoretical guarantees We now show that it is possible to obtain some theoretical insight into the perfor- mance of (a randomized variant of) our greedy algorithm when G = H. Note that for S 1 ⊆S 2 and a v / ∈S 1 ,S 2 , p v (S 1 ) = X i∈S 1 hh v , h i i 2 − X j∈S c 1 hh v , h j i 2 ≤ X i∈S 2 hh v , h i i 2 − X j∈S c 2 hh v , h j i 2 =p v (S 2 ). (4.29) Using this in (4.22), we obtain φ(S 1 ∪{v})−φ(S 1 )≤φ(S 2 ∪{v})−φ(S 2 ), (4.30) which impliesφ(S) is supermodular inS. Therefore, the functionψ(S) =φ(∅)−φ(S) is submodular, non-monotone and normalized (ψ(∅) = 0). As a result, the setS ∗ obtained by the greedy maximization ofψ(S) (or equivalently greedy minimization ofφ(S)) with a randomized version of Algorithm 4.1, that selects one node uniformly at random from the best N nodes at each iteration, is at least a 0.3-approximation of the optimal set S OPT [17]. To be precise, we have the following guarantees forS ∗ obtained from the randomized greedy algorithm ψ(S OPT )≥ψ(S ∗ )≥ 0.3ψ(S OPT ) (4.31) ⇒φ(S OPT )≤φ(S ∗ )≤ 0.3φ(S OPT ) + 0.7N. (4.32) Although, guarantees for the deterministic version of the greedy algorithm are part of ongoing research, we observe empirically that its performance is competitive. Note that for the biorthogonal design when G6= H, φ(S) is no longer supermodular, hence we cannot state guarantees on the performance of the greedy algorithm in this case. However, experiments show that the algorithm performs well in this case as well. 4.4.3 Multi-channel extension In order to extend our formulation to M-channel filterbanks with analysis/synthesis filter pairs{H k , G k } k=0,...,M−1 and sampling sets{S k } k=0,...,M−1 (S k ⊂{1,...,N}), one 61 can create the concatenated filters H, G∈R MN×N and the concatenated sampling set S⊂{1,...,MN} in a manner similar to that of the two-channel case: H = H 0 H 1 . . . H M−1 ∈R MN×N , G = G 0 G 1 . . . G M−1 ∈R MN×N , y = y 0 y 1 . . . y M−1 ∈R P k=1 |S k | , S = S 0 0 ... 0 0 S 1 ... 0 . . . . . . . . . . . . 0 0 ... S M−1 ∈{0, 1} ( P k=1 |S k |)×MN . (4.33) Note that eachS k can be then be recovered fromS asS k ={v−kN|v∈S,kN + 1≤ v≤kN +N}. Further, in this case, we require predesigned filters such that G T H =MI, resulting in the objective φ(S) = 1− 1 M X i∈S g i h T i − 1 M X j∈S c g j h T j 2 F , (4.34) where g i = G T δ i and h i = H T δ i . The objective can be optimized under the constraint |S| = N using the same technique as that of the two-channel case by computing the change with respect to incremental node additions. In this case, φ(S∪{v}) = 1− 1 M X i∈S g i h T i − 1 M X j∈S c g j h T j + g v h T v 2 F =φ(S) +p v (S) +q v , (4.35) where we defined p v (S) = 2Tr h v g T v 1− 1 M X i∈S g i h T i − 1 M X j∈S c g j h T j , (4.36) q v =kg v k 2 kh v k 2 . (4.37) 62 To compute p v (S) for eachS, we note that p v (∅) = 2Tr h v g T v (− 1 M G T H) =−2hg v , h v i, (4.38) where we used G T H = MI. Further, for a node u, the update p v (S∪{u}) can be computed as p v (S∪{u}) = 2Tr h v g T v 1− 1 M X i∈S g i h T i − 1 M X j∈S c g j h T j + g u h T u =p v (S) + 2hg v , g u ihh v , h u i. (4.39) Note that the computations required to initialize q v and update p v (S) are identical to that of the two-channel case. Thus, one can use Algorithm 4.2 to obtain the optimal sampling set with one change – the vectors p(S), q are now MN-dimensional, with v th elements p v (S) and q v . In order to force the algorithm to produce a disjoint sampling scheme, one can once again disregard images of selected indices (eg., choice of index i≤N rules out indices i +kN,k = 1,...,M− 1 in subsequent iterations). In terms of complexity, computing p(∅) and q require 2MN one-time graph filtering operations and each greedy iteration requires performing 4M filtering operations. Therefore, similar to the two-channel case, computing an approximately optimal sampling scheme for M- channel filterbanks requiresO(N) graph filtering operations, with a space complexity of O(N). Note that the theoretical guarantees discussed in Section 4.4.2 hold in this case as well. 4.4.4 Experiments Inthissection, wepresentsimpleexperiments 3 todemonstratetheeffectivenessofour critical sampling scheme for two-channel filterbanks. In our first experiment, we test its performance on three simple bipartite graphs (Figure 4.3) with filters obtained using the Graph-QMF design [45] (that approximates the Meyer kernel with a polynomial filter of chosen length 4), and GraphBior(6,6) [46]. We observe that the output of Algorithm 4.2, in most cases, matches exactly with that of the optimal sampling scheme for bipartite graphs, that is to downsample the filtered signal in each channel on either partition. However, due to the greedy nature of the algorithm, we occasionally observe that the 3 Code available at https://github.com/aamiranis/cs_fb_arbitrary 63 (a) (b) (c) (d) (e) (f) (g) Figure 4.3: Sampling scheme obtained using Algorithm 4.2 for bipartite graphs with length 4 Graph-QMF filters: (a)-(c), and GraphBior(6,6) filters: (d)-(g). Red and blue colors indicate nodes in low-pass and high-pass channels. The sets are heuristically forced to be disjoint in (g). obtained sampling scheme differs slightly from the optimal one, as seen in Figure 4.3f for the biorthogonal filters case. Notice that the sampling scheme is not disjoint, we heuristically force the algorithm to obtain disjoint sets in Figure 4.3g. The sampling pattern obtained in this case is nearly perfect with one wrongly assigned pair. For our second experiment, we design a critically-sampled two-channel filterbank on the Minnesota road network graph using two configurations of analysis/synthesis filters: (i) Graph-QMF [45] with 8-degree polynomial approximations of the Meyer kernel, and (ii) GraphBior(6,6) [46]. The sampling scheme obtained for each of these configurations is plotted in Figures 4.4a and 4.4b. We observe that the sampling pattern for each channel colors nodes in a predominantly alternating fashion indicating a propensity towards bipartition. The response of the filterbank after determining the sampling set is plotted in Figures 4.4c and 4.4d for unit magnitude delta functions in the spectral domain. We observe that it is close to 1 for all frequencies. Since the transfer function is not diagonalizable in the GFT basis U, there is an associated aliasing effect with the filterbank. We characterize this by plotting the maximum aliasing coefficient in terms of magnitude for each frequency component in Figures 4.4e and 4.4f. We also compare the recontruction performance (in terms of ratio of energies of error signal and original) of 64 (a) (b) 0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 2 0 0.5 1 1.5 λ |T(λ)| (c) 0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 2 0 0.5 1 1.5 λ |T(λ)| (d) 0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 2 0 0.5 1 1.5 λ max μ ≠ λ |T(μ)| (e) 0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 2 0 0.5 1 1.5 λ max μ ≠ λ |T(μ)| (f) Figure 4.4: Performance of our critical sampling scheme (Algorithm 4.2) on the Min- nesota road network graph. (a), (c) and (e) denote the sampling scheme obtained, spectral response, and maximum aliasing component for Graph-QMF design. (b), (d) and (f) illustrate corresponding results for GraphBior(6,6). our method against one instance of a randomly selected sampling scheme, and a spectral approximation of MaxCut for 1000 random signals. The average squared relative errors along with the standard deviations are listed in Table 4.1. Observe that our method has superior performance. In the final experiment, we compare the sampling scheme obtained from our method against random sampling schemes and the optimal sampling scheme that minimizes the objective: S opt = arg min S φ(S). Since determining the optimal scheme requires an exhaustive search, we limit ourselves to small graphs. Specifically, we test the methods on a ring graph with N = 10 nodes that also contains cross-links with a probability p. 65 Table 4.1: Recontruction error results for random signals on Minnesota road graph. Graph-QMF (poly 8) GraphBior(6,6) Random 0.4842± 0.0113 0.4629± 0.0108 MaxCut 0.1125± 0.0069 0.0972± 0.0061 Proposed 0.0779± 0.0049 0.0664± 0.0045 Table 4.2: Comparison of our method against an optimal sampling scheme obtained through exhaustive search. The experiment is performed for ring graphs of size N = 10 that have randomly added cross-links with probability p. Reconstruction error is averaged over 100 signals, and also 100 sampling schemes for the random sampling case. Graph Random Proposed Optimal p = 0.1 0.410± 0.168 0.100± 0.098 0.100± 0.098 p = 0.2 0.375± 0.142 0.199± 0.093 0.190± 0.086 p = 0.3 0.375± 0.133 0.204± 0.091 0.205± 0.098 p = 0.4 0.364± 0.131 0.223± 0.102 0.183± 0.080 66 Note that when p = 0, such a graph is bipartite, and increasing p results in a deviation from the bipartite property. We also compare the mean error for reconstructing 100 random signals in each case. For the random sampling scheme, the mean is computed over all combinations of 100 randomly generated sets and 100 random signals. The results are illustrated in Table 4.2. We make two significant observations: 1. Despite making use of a greedy approximation for efficiency, our method’s perfor- mance is comparable to the optimal sampling scheme. 2. As we deviate from the bipartite property, the structure of the graph limits the performance of the filterbank significantly. The results in this experiment, along with the analyses in Sections 4.3.1 and 4.3.2, suggest that there is little wiggle-room for designing filterbanks on arbitrary graphs in terms of reconstruction error. Specifically, one needs to exploit special structures in the graph, or special sampling strategies (such as the one suggested in [67]) in order to achieve perfect reconstruction. In the next section, we explore a certain class of graphs where the special structure makes them suitable for the design of perfect reconstruction M-channel filterbanks. 4.5 Filterbanks on block-cyclic graphs WesawinSection4.3.1thatthespecialstructureinbipartitegraphsgreatlysimplifies the design of two-channel filterbanks with low-degree polynomials filters. In this section, we explore the design of filterbanks in another such class of graphs calledM-block cyclic graphs [69]. These graphs are directed withM components,S 0 ,S 1 ,...,S M−1 , connected in a cyclic fashion, with no edges within each partition (for M = 2, it is a directed bipartite graph). The adjacency of this graph has a block-cyclic structure given by A = 0 A 1 0 ... 0 0 0 0 A 2 ... 0 0 0 0 0 . . . . . . 0 . . . . . . . . . . . . . . . . . . 0 0 0 ... 0 A M−1 A 0 0 0 ... 0 0 , (4.40) where each A j has arbitrary but appropriate size to make A square. The adjacency 67 Figure 4.5: A 3-block cyclic graph. matrix of an M-block cyclic graph has a special eigen-structure suitable for designing filterbanks, as described by the following theorem from [69]: Theorem 4.1. For any M-block cyclic graph, if (λ, v) is an eigenpair of its adjacency matrix A, then (ω M λ, Ω M v), (ω 2 M λ, Ω 2 M v), ... (ω M−1 M λ, Ω M−1 M v) are also eigenpairs of A, where ω M =e −i 2π M , Ω M = diag(β), β(v) = 1 v∈S 0 ω 1 M v∈S 1 . . . ω (M−1) M v∈S M−1 . (4.41) The special form of the eigenstructure of M-block cylic graphs, i.e., the existence of M modulated copies of the eigenvalues and eigenvectors, naturally facilitates the design of M-channel filterbanks. Specifically, modulating an eigenvector with Ω M pro- duces another eigenvector whose eigenvalue is also a modulated version of the original eigenvalue. In other words, modulating a signal results in a structured remapping (or folding)ofitsFouriercoefficientsinthespectraldomain–akintothewell-knownspectral folding phenomenon in traditional sampling theory. Using the fact that downsampling- upsampling operations can be expressed as weighted sums of modulated versions of the original signal, one can succinctly express the perfect reconstruction conditions for an M-channel filterbank on M-block cyclic graphs in the spectral domain, as illustrated next. 68 4.5.1 Perfect reconstruction conditions for M-channel filter- banks on M-block cyclic graphs Let us assume for now that the adjacency A of the M-block cyclic graph is diago- nalizable and has distinct eigenvalues 4 . Therefore, we have A = UΛU −1 . The transfer function of an M-channel filterbank is given by T = M−1 X k=0 g k (A)S T k S k h k (A). (4.42) where S k ∈{0, 1} |S k |×N is the downsampling operator for the blockS k , and S T k is the corresponding upsampling operator. {h k (A)} k=0,...,M−1 and{g k (A)} k=0,...,M−1 are the set of analysis and synthesis filters respectively, these are polynomial functions of the adjacency matrix 5 . The transfer function can be expressed in the spectral domain as ˜ T = U −1 TU ˜ T = M−1 X k=0 g k (Λ)U −1 S T k S k Uh k (Λ). (4.43) Note that using properties of the discrete Fourier basis, the downsampling-upsampling operation in each channel can be expressed as S T k S k = 1 M M−1 X j=0 ω −jk M Ω j M . (4.44) Substituting this in the transfer function, we get ˜ T = M−1 X k=0 g k (Λ)U −1 1 M M−1 X j=0 ω −jk M Ω j M Uh k (Λ) (4.45) = 1 M M−1 X k=0 g k (Λ)h k (Λ) + M−1 X j=1 M−1 X k=0 ω −jk M g k (Λ)U −1 Ω j M Uh k (Λ) | {z } aliasing components . (4.46) 4 With simple modifications, the analysis can be extended to the case of repeated eigenvalues. The frequency interpretation for the non-diagonalizable case is more complex since one might need to work with the Jordan form [64]. We leave this aspect for future work. 5 It is easier to work with the adjacency in this case since M-block cyclic graphs are directed. 69 UsingTheorem4.1inthealiasingterms,weobtainthefollowingsetofperfectreconstruc- tion constraints on polynomials{g k (λ),h k (λ)} for|λ|≤ 1 (assuming A is appropriately normalized): M−1 X k=0 g k (λ)h k (λ) =M, (4.47) M−1 X k=0 ω −jk M g k (λ)h k (ω −j M λ) = 0, for all 1≤j≤M− 1. (4.48) Similar perfect reconstruction conditions in the spectral domain have been proposed in [70], where filtered signals in each channel are downsampled on the same subset of nodes (i.e.,S 0 ) leading to the absence of the summation weights “ω −jk M ” in (4.48). Ssampling each channel on a different subset of nodes is more amenable to the polyphase implementation carried out in [68] for the two-channel case. The analyses in this subsection and in [70] do not provide insight about possible solutions for the perfect reconstruction conditions. The problem can be particularly challenging for theM-channel case since one needs to designM pairs of filters satisfying the constraint equations (4.47), (4.48). Further, since the spectrum of M-block cyclic graphs encompasses the entire complex unit disc and not just the complex unit circle (as in classical DSP), the choice of sub-bands for these filterbanks also remains unclear. In order to gain some insights regarding these questions, we consider the problem under a simpler setting in the following subsection. Our solution works forM-block cylic graphs whenM is a power of 2. Specifically, whenM = 2 L , we propose to use anL-stage hierarchical tree-structured design for the filterbank that naturally extends the two- channel design for bipartite graphs. Our design also suggests a sub-band decomposition of the frequency domain (enclosed in the unit-disc), that in turn provides a frequency interpretation for the M > 2 case. 4.5.2 Tree-structured filterbank design WhenM = 2 L , we propose a hierarchical tree-structured filterbank design consisting of L stages that can be succinctly explained in the following two points: • Ineachstage, wegroupeven-numberedandodd-numberedblockstogetherintotwo partitions and treat the original graph as a directed bipartite graph over which a 2-channel filterbank can be implemented. Existing solutions meant for undirected 70 Figure 4.6: A 4-block cyclic graph visualized as a directed bipartite graph by grouping even and odd blocks. bipartite graphs in [45, 46, 68] are extended to this directed setting via simple modifications. • We create a new subgraph for each channel that is closely related to the original graph in the spectral domain. The 2-channel design from the previous stage is replicated on these subgraphs, leading to the hierarchical design. This approach simplifies the design of the filterbank significantly since the problem is reduced to designing two prototype half-band filters (for the biorthogonal case) or one prototype half-band filter (for the orthogonal case), which can be repeatedly used in every stage. In contrast, for the generalM-channel design, one would have to designM and 2M filters for the orthogonal and biorthogonal cases respectively, with appropriate responses for each sub-band. Filterbank architecture By definition, it is easy to see that block cyclic graphs with M = 2 L blocks are directed bipartite, where the two partitions are given by the set of even and odd blocks respectively. This is easily seen, for example in M = 4, where one can observe the bipartite structure by applying a block permutation on A in order to group even and odd blocks (also illustrated graphically in Figure 4.6): 0 A 1 0 0 0 0 A 2 0 0 0 0 A 3 A 0 0 0 0 −→ 0 0 A 1 0 0 0 0 A 3 0 A 2 0 0 A 0 0 0 0 . To make our notation concise, let us denote the set of all even blocks byS e and the set of all odd blocks byS o Therefore, we can design a two-channel filterbank for these 71 graphs, where the filtered signal in one channel is sampled on the even setS e and on the odd setS o in the other channel. After downsampling, we create new subgraphs for each channel with adjacencies A e and A o based on the idea proposed in [68]: A e = S e A 2 S T e , A o = S o A 2 S T o . (4.49) We now highlight some useful properties of these subgraphs that are crucial to our design. Lemma 4.1. If A is the adjacency matrix of an M-block cylic graph, then A e and A o are adjacency matrices of M/2-block cyclic graphs. Proof. Using the structure of A in (4.40), we have A 2 = 0 0 A 1 A 2 ... 0 . . . . . . . . . . . . . . . 0 0 0 ... A M−2 A M−1 A M−1 A 0 0 0 ... 0 0 A 0 A 1 0 ... 0 . (4.50) Using the defintions of S e and S o that are meant to sample even and odd block indices respectively, we get A e = S e A 2 S T e = 0 A 1 A 2 0 ... 0 0 0 A 3 A 4 ... 0 . . . . . . . . . . . . . . . 0 0 0 ... A M−3 A M−2 A M−1 A 0 0 0 ... 0 , (4.51) A o = S o A 2 S T o = 0 A 2 A 3 0 ... 0 0 0 A 4 A 5 ... 0 . . . . . . . . . . . . . . . 0 0 0 ... A M−2 A M−1 A 0 A 1 0 0 ... 0 . (4.52) 72 Figure 4.7: A tree-structured filterbank for block cyclic graphs consisting of two stages. Clearly, A e and A o are adjacency matrices of M/2-block cyclic graphs. Lemma 4.1 allows us to reuse the same two-channel design on the new subgraphs in the next stage. This reduces the design complexity since we are required to design only one perfect reconstruction two-channel filterbank for a directed bipartite graph. By hierarchically implementing this two-channel design on a 2 L -block cylic graph, we end up with a perfect reconstruction tree-structured filterbank of depth L that effectively has 2 L channels. Note that in every stage, we keep splitting the blocks into even and odd in order to sample the two channels until the last stage where we are left with only one block for each channel. An example of the tree-structured filterbank containing two stages(i.e., uptodepthL = 2)isshowninFigure4.7. Sinceeverystageinvolvessplitting theeven blocksfrom theodd blocks, thelabels ofthe blocksin thelaststage withrespect to the original parent graph can be obtained by the bit-reversal permutation sequence (similar to decimation-in-frequency for the radix-2 FFT algorithm). For example, after two stages of decimation in a 4-block cylic graph, the sampling sets for each channel comprise of one of the four blocks – these areS ee =S 0 ,S eo =S 2 ,S oe =S 1 , andS oo =S 3 . The next Lemma gives a relation between the spectrum of the two new subgraphs and the parent graph: Lemma 4.2. If{λ, v} is an eigen-pair of A, then{λ 2 , S e v} is an eigenpair of A e and {λ 2 , S o v} is an eigenpair of A o . Proof. Since A is bipartite with partitionsS e andS o , the following relations are true by definition: S e AS T e = S o AS T o = 0, S e A 2 S T o = S o A 2 S T e = 0. (4.53) 73 Let{λ, v} be an eigenpair of A, then we have A 2 v = λ 2 v. Further, let us define a permutation matrix P = S e S o . Note that P T P = PP T = I, and therefore, we have PA 2 (P T P)v =λ 2 Pv ⇒ S e S o A 2 h S T e S T o i S e S o v =λ 2 S e S o v ⇒ S e A 2 S T e 0 z }| { S e A 2 S T o S o A 2 S T e | {z } 0 S o A 2 S T o S e v S o v =λ 2 S e v S o v (using (4.53)) ⇒ A e 0 0 A o S e v S o v =λ 2 S e v S o v ⇒ A e S e v A o S o v =λ 2 S e v S o v . Therefore, A e (S e v) = λ 2 (S e v) and A o (S o v) = λ 2 (S o v), and this completes our proof. Lemma 4.2 has a significant implication: the eigenvectors of A e and A o are obtained by downsampling the eigenvectors of A. Intuitively, this relationship is quite important for multi-resolution analysis, since it allows us to “zoom in” on a particular portion of the GFT matrix of the parent graph in order to attain finer resolution. Although the definition of these subgraphs seems arbitrary at first sight, there are many reasons for why we would like to work with this specific choice. Lemmas 4.1 and 4.2 allow us to conclude that these subgraphs not only preserve the block cyclic structure in the vertex domain, but also maintain a close association with the parent graph in the spectral domain. Yet another justification for using these definitions is that they arise naturally while working with polyphase analysis and structures of two- channel filterbanks over bipartite graphs [68]. However, the most important reason in our opinion is that these definitions allow us to apply the noble identities stated in [69] in order to provide a spectral interpretation for the filterbank as explained next. 74 Figure 4.8: Two-stage tree-structured filterbank simplified using the noble identities for bipartite graphs. Filterbank spectral interpretation Wenowanalyzethespectralbehavioroftheproposedtree-structureddesign. Thekey idea here is to use the noble identities [69] for the filters in the second stage. Specifically, for any polynomial filterp(A) on a bipartite graph withS e andS e denoting its odd and even partitions, one has p(A e )S e =p(S e A 2 S T e )S e = S e p(A 2 ), (4.54) p(A o )S o =p(S o A 2 S T o )S o = S o p(A 2 ), (4.55) S T e p(A e ) = S T e p(S e A 2 S T e ) =p(A 2 )S T e , (4.56) S T o p(A o ) = S T o p(S o A 2 S T o ) =p(A 2 )S T o . (4.57) Equations (4.54) and (4.55) are the first noble identities, whereas (4.56) and (4.57) are the second noble identities for bipartite graphs. Note how these identities serve as a justification for the choice of subgraphs in each channel. We reduce the depth of the filterbank using these identities on the filters of the second stage to obtain a single stage filterbank with filters as indicated in Figure 4.8. This simplification approach can easily be extended to a tree-structured design of depth L by applying the noble identities in a cascaded fashion starting from the last stage and moving backwards until only stage 75 remains. The resultant analysis and synthesis filters of any particular channel can then be written as h k (A) =h i 1 (A)h i 2 (A 2 )h i 3 (A 4 )...h i L (A 2 L−1 ) = L Y l=1 h i l (A 2 l−1 ), i l ∈{0, 1}, (4.58) g k (A) =g i 1 (A)g i 2 (A 2 )g i 3 (A 4 )...g i L (A 2 L−1 ) = L Y l=1 g i l (A 2 l−1 ), i l ∈{0, 1}. (4.59) It is interesting to observe the behavior of a tree-structured filterbank constructed with ideal filters. For a two-channel filterbank on a directed bipartite graph, the ideal lowpass filter h 0 (A) has a half-disc response (in the spectral domain of A) with a pass- band on the positive real-axis. Similarly, the ideal highpass filter h 1 (A) has a passband on the negative real-axis. These ideal kernels are plotted in Figure 4.9a. The responses ofh 0 (A 2 ),h 1 (A 2 )aregiveninFigure4.9bandthespectralresponsesofthefourchannels of a two-stage tree-structured filterbank are given in Figure 4.9c. Polynomial filter design Theperfectreconstructionconditionsforatwo-channelfilterbankonbipartitegraphs can be obtained by pluggingM = 2 in (4.47) and (4.48). Specifically, we need to design filters h 0 (λ), h 1 (λ), g 0 (λ) and g 1 (λ) that satisfy the following conditions for|λ|≤ 1 (λ is complex): h 0 (λ)g 0 (λ) +h 1 (λ)g 1 (λ) = 2, (4.60) h 0 (−λ)g 0 (λ)−h 1 (−λ)g 1 (λ) = 0. (4.61) When the graph is undirected, λ is restricted to the real axis. Orthogonal solutions for these equations cannot be obtained using polynomial filters, thus an approximate solution is proposed in [45]. However, perfect reconstruction using polynomial filters is possible for the biorthogonal design, and a solution based on the CDF maximally-flat filter design approach is presented in [46]. SinceM-block cyclic graphs are directed, our problem requires designing polynomial filter responses in the complex unit disc. In the following, we focus on the biorthogonal design since it allows perfect reconstruction. Similar to the approach in [46], we choose 76 (a) Ideal lowpass and highpass kernel responses h 0 (A) and h 1 (A). (b) Second stage filter responses h 0 (A 2 ) and h 1 (A 2 ). (c) Responses of the four channels of a two-stage tree-structured filterbank. The right- most and left-most pie slices correspond to the lowest and highest frequency sub-bands respectively. Figure 4.9: Spectral characterization of a two-stage tree-structured filterbank with ideal filters. Gray-shaded areas indicate passbands. h 1 (λ) =g 0 (−λ) and g 1 (λ) =h 0 (−λ) so that (4.61) is automatically satisfied leaving us with the design criterion: h 0 (λ)g 0 (λ) +h 0 (−λ)g 0 (−λ) = 2, ∀{λ :|λ|≤ 1}. (4.62) We then define p(λ) =h 0 (λ)g 0 (λ), so that (4.62) can be rewritten as p(λ) +p(−λ) = 2, ∀{λ :|λ|≤ 1}. (4.63) 77 Since p(λ) is the product of two lowpass kernels, it is also a lowpass kernel. Therefore, our objective is to design a polynomial half-band kernel p(λ) that satisfies the comple- mentarity condition (4.63), followed by its spectral factorization to obtain h 0 (λ) and g 0 (λ). It is immediately clear from (4.63) that p(λ) must have the following the form: p(λ) = 1 + D X k=0 c k λ 2k+1 . (4.64) We now describe an approach for obtaining p(λ) by generalizing the maximally-flat design presented in [46] for undirected bipartite graphs. The key idea is to force p(λ) to have K 1 roots at λ =−1. However, since we are working in the complex unit disc, this does not guarantee a flat response as we move on the imaginary axis away from the real axis (i.e. the top and bottom of the complex unit disc). In order to have a better transition band, we also placeK 2 roots at−1+i and−1−i (note that they must be equal in number to have a polynomial with real coefficients). Therefore, the design approach involves finding a polynomial r(λ) = P R m=0 r m λ m such that (λ + 1) K 1 (λ + 1 +i) K 2 (λ + 1−i) K 2 r(λ) =p(λ) (4.65) ⇒ (λ + 1) K 1 (λ 2 + 2λ + 2) K 2 R X m=0 r m λ m ! = 1 + D X k=0 c k λ 2k+1 . (4.66) Comparing highest powers on both sides of (4.66), we haveR+K 1 +2K 2 = 2D +1. The left hand side of (4.66) hasR+1 unknowns and the right hand side hasD+1 constraints. Therefore, to have a unique polynomial p(λ) that satisfies (4.63), we should have the same number of constraints as unknowns, which implies R =D = (K 1 + 2K 2 − 1). We can thus rewrite (4.66) as (λ + 1) K 1 (λ 2 + 2λ + 2) K 2 K 1 +2K 2 −1 X m=0 r m λ m = 1 + K 1 +2K 2 −1 X k=0 c k λ 2k+1 . (4.67) TheK 1 + 2K 2 unknowns{r m } can be found by solving a linear system of K 1 + 2K 2 equations. Oncer(λ) is found, we obtainp(λ) using (4.65). Note that we can also place roots at other locations to shape the response accordingly and modify (4.66) to obtain the appropriate r(λ) and the corresponding p(λ). 78 Re(λ) Im(λ) |h 0 (λ)| −1 −0.5 0 0.5 1 −1 −0.5 0 0.5 1 0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 2 Re(λ) Im(λ) ∠(h 0 (λ)) (radians) −1 −0.5 0 0.5 1 −1 −0.5 0 0.5 1 −3 −2 −1 0 1 2 3 (a) Re(λ) Im(λ) |h 1 (λ)| −1 −0.5 0 0.5 1 −1 −0.5 0 0.5 1 0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 2 Re(λ) Im(λ) ∠(h 1 (λ)) (radians) −1 −0.5 0 0.5 1 −1 −0.5 0 0.5 1 −3 −2 −1 0 1 2 3 (b) 0: h 0 (λ)h 0 (λ 2 ) 1: h 0 (λ)h 1 (λ 2 ) 2: h 1 (λ)h 0 (λ 2 ) 3: h 1 (λ)h 1 (λ 2 ) Mag. 0 0.5 1 1.5 2 2.5 3 0 0.5 1 1.5 2 2.5 3 0 0.5 1 1.5 2 2.5 3 0 0.5 1 1.5 2 2.5 3 Phase −3 −2 −1 0 1 2 3 −3 −2 −1 0 1 2 3 −3 −2 −1 0 1 2 3 −3 −2 −1 0 1 2 3 (c) Figure 4.10: Spectral responses (magnitude and phase) of polynomial filters designed using the maximally-flat approach withK1 =K2 = 2: (a) lowpass kernelh 0 (λ) (length = 6), (b) highpass kernel h 1 (λ) (length = 7). (c) Channel responses of a two-stage tree-structured filterbank built using the polynomial kernels h 0 (λ) and h 1 (λ). 79 Once we have p(λ), we factor it to obtain the kernels h 0 (λ) and g 0 (λ) using the approach presented in [46]. This approach produces maximally-balanced kernels by splitting roots ofp(λ) betweenh 0 (λ) andg 0 (λ) as evenly as possible while ensuring that the filterbank is as close to orthogonal as possible. Examples of a lowpass filter kernel h 0 (λ) and h 1 (λ)(= g 0 (−λ)) designed using this approach are plotted in Figures 4.10a and 4.10b. The spectral responses of the four channels of a two-stage tree-structured polynomial filterbank built using these kernels is illustrated in Figure 4.10c. 4.5.3 Preliminary experiment Inthissection, weperformasimplefilteringexperiment 6 onasyntheticallygenerated 4-blockcyclicgraphtoevaluateourtree-structuredfilterbankdesign. Thegraphconsists of 100 nodes and is generated in the following way: 1. We first create 25 disjoint directed cycles, each consisting of 4 nodes and oriented in the same direction with edges of weight 1. Note that these cycles collectively form a disconnected 4-block cylic graph of 100 nodes. 2. Next, we add directed edges of weight 1 randomly with probability p = 0.2 while preserving the block-cyclic property (i.e., by only connecting adjacent blocks with edges of consistent directionality). 3. Finally, we normalize the edge weights to ensure that the rows of the adjacency A sumto1(random-walknormalization). Suchanormalizationmakes Aastochastic matrix with a cyclic structure, that is commonly used to model periodic Finite State Machines (FSMs) and Markov Decision Processes (MDPs) [37, 58]. Before performing the experiment, we ensure that the generated graph instance is connected. In order to evaluate its performance, we consider a two-stage tree-structured filterbank and compute the filtered channel outputs for a given input signal that is zero on all the blocks except one where it has i.i.d. values chosen uniformly from [0,1]. The graph chosen for our experiment is illustrated in Figure 4.11a, along with its eigenvalues in Figure 4.11b, and the input signal in Figure 4.11c. Wefirstperformthefilteringexperimentusingidealfilters(illustratedinFigure4.9c), followed by resorting to the polynomial kernels h 0 (λ) and h 1 (λ) obtained from the maximally-flat design for K1 = K2 = 2 (these filters have length 6 and 7 respec- tively, the spectral response of each channel in this case is illustrated in Figure 4.10c). 6 Code available at https://github.com/aamiranis/tree_structured_fb 80 (a) (b) −1 −0.8 −0.6 −0.4 −0.2 0 0.2 0.4 0.6 0.8 1 (c) Figure 4.11: (a) A 4-block cyclic graph considered in the filtering experiment (the edges are oriented in counter-clockwise fashion). (b) The spectrum of its adjacency in the complex unit disc. (c) An input signal on the graph used for filtering experiments with the two-stage tree-structured filterbank. The output of each channel for both cases is illustrated in Figure 4.12. We observe that the filtered output for the first channel (filtered byh 0 (A)h 0 (A 2 )) in both cases has rela- tively little variation across the blocks, thus confirming our intuition that it corresponds to the lowest frequency sub-band. Similarly, the output of the third channel (filtered by h 0 (A)h 0 (A 2 )) has the highest variation with significant sign changes across blocks, thus making it the highest frequency sub-band. The other two channels have moderate variation and therefore correspond to the intermediate frequency sub-bands. 4.6 Summary and future work In this chapter, we considered the sampling problem in the context of designing graph wavelet filterbanks. This is fundamentally different from the earlier problem of sampling bandlimited graph signals since one is required to choose a sampling scheme over multiple channels of the filterbank while satisfying useful properties such as perfect reconstruction, critical sampling and polynomial filter responses. We began by spelling out design criteria for two-channel polynomial filterbanks that are critically sampled, and satisfy perfect reconstruction. We showed that it is not possible in general to attain perfect reconstruction with low-degree polynomial graph filtersusinganycriticalsamplingschemeunlessthegraphshaveaspecialeigen-structure. Specifically, downsampling-upsampling or modulating signals over such graphs should lead to a spectral folding phenomenon. Bipartite graphs, for example, are particularly suited for designing perfect reconstruction two-channel filterbanks. 81 Channel Ideal kernel Polynomial kernel 0: h 0 (λ)h 0 (λ 2 ) −1 −0.8 −0.6 −0.4 −0.2 0 0.2 0.4 0.6 0.8 1 −1 −0.8 −0.6 −0.4 −0.2 0 0.2 0.4 0.6 0.8 1 1: h 0 (λ)h 1 (λ 2 ) −1 −0.8 −0.6 −0.4 −0.2 0 0.2 0.4 0.6 0.8 1 −1 −0.8 −0.6 −0.4 −0.2 0 0.2 0.4 0.6 0.8 1 2: h 1 (λ)h 0 (λ 2 ) −1 −0.8 −0.6 −0.4 −0.2 0 0.2 0.4 0.6 0.8 1 −1 −0.8 −0.6 −0.4 −0.2 0 0.2 0.4 0.6 0.8 1 3: h 1 (λ)h 1 (λ 2 ) −1 −0.8 −0.6 −0.4 −0.2 0 0.2 0.4 0.6 0.8 1 −1 −0.8 −0.6 −0.4 −0.2 0 0.2 0.4 0.6 0.8 1 Figure 4.12: Output obtained in each channel of a two-stage tree-structured filterbank after filtering the input signal from Figure 4.11c using ideal filters (Figure 4.9c) and polynomial filters (Figure 4.10c). We then shifted our focus to the design of critically-sampled near-perfect reconstruc- tion wavelet filterbanks on arbitrary graphs. This problem has two intertwined aspects – designing polynomial filters and designing a critical sampling scheme. Our formula- tion decouples the two design problems and focuses only on choosing the best possible sampling scheme. Specifically, given a predesigned set of analysis/synthesis filters, our algorithm efficiently chooses the best sampling set for each channel in order to minimize a bound on the overall reconstruction error associated with the filterbank. Experiments 82 show that the sampling scheme produced by our method outperforms existing heuristics, however, the reconstruction error is indeed significantly limited by the graph structure. Searching for graphs and sampling schemes that are amenable to perfect reconstruction filterbanks would be an interesting direction of future research. An immediate exam- ple of such graphs are Kronecker graphs where the Kronecker product structure of the adjacency matrix leads to a regular eigen-structure akin to bipartite graphs. Further, the spectral domain sampling scheme presented in [67] allows for the design of perfect reconstruction filterbanks on any arbitrary graph, at the cost of increased complexity of sampling. Finally, we turned our attention to the design of M-channel perfect reconstruction filterbanks onM-block cylic graphs. In this case, we simplified the problem by assuming that M is a power of 2 and proposed a tree-structured design where the original graph is decomposed hierarchically into smaller graphs over which one can implement a two- channel filterbank. This formulation significantly reduces the design complexity, since the problem boils down to designing and reusing only one two-channel filterbank for a directed bipartite graph. We proposed a maximally-flat polynomial design in this case that extends the approach presented in [46] for undirected bipartite graphs. Our design produces sub-bands that have a simple spectral interpretation, which we validate in simple filtering experiments. For future work, we would like to extend the design to cases when M is not a power of 2. One possibility is to design p-channel perfect reconstruction filterbanks, wherep is a small prime number, and use it to hierarchically construct a larger filterbank whenM can be factored as a product of small primes. This approach is similar to the mixed-radix implementations of FFT in traditional DSP. 83 Chapter 5 Sampling Theory Perspective of Semi-supervised Learning In this chapter, we explore the semi-supervised learning problem from a graph sam- pling theory perspective. Graph-based methods have been shown to be quite effective for this problem because they provide an easy means to exploit the underlying geometry of the dataset. These methods involve the construction of a distance-based similarity graphfromthefeaturevectors, whoseverticesrepresentthedatapointsandedgeweights capture the similarity between them. The key assumption here is that class labels vary smoothly over the graph, i.e., there is little variation between labels corresponding to vertices connected by high-weight edges. There are numerous ways of quantitatively imposing smoothness constraints over label functions defined on vertices of a similarity graph. Most graph-based semi- supervised classification algorithms incorporate one of these criteria as a penalty against the fitting error in a regularization problem, or as a constraint term while minimizing the fitting error in an optimization problem. Examples of commonly used measures of smoothness include the graph Laplacian regularizer f T Lf [83, 78], the iterated graph Laplacian regularizers f T L m f [81], etc., and many algorithms involve minimizing these functions while ensuring that the label function on the vertices f satisfies the known set of labels. On the other hand, a spectral theory based classification algorithm restricts f to be spanned by the first few eigenvectors of the graph Laplacian [10, 11], that are known to form a representation basis for smooth functions on the graph. Similarly, a more recent approach, derived from sampling theory considers class indicator signals as smooth signals over the similarity graph. This assumption is incorporated via band- widthinthegraphFourierdomain. Theclassificationalgorithmtheninvolvesestimating a label function that minimizes prediction error on the known set under a bandwidth constraint and can be carried out without explicit eigendecomposition of the Laplacian, as illustrated in our recent work [50, 29]. In each of the examples, the criterion enforces This chapter is based on our work in [5, 6]. 84 smoothness of the labels over the graph – a lower value of the regularizer f T Lf, a smaller number of leading eigenvectors and a smaller bandwidth of f imply that the labels vary less across similar vertices. The focus of this chapter is to provide a formal justification for using bandwidth as a regularizer, thereby providing a sampling theory perspective of graph-based learn- ing methods. An interpretation of this smoothness measure would help complete our theoretical understanding of graph-based semi-supervised classification approaches and strengthen their link with the semi-supervised smoothness assumption and its variants. Under a generic statistical model of the data and a graph construction scheme, we pro- vide a geometric interpretation of the bandwidth (estimated via spectral proxies) of class indicator signals in the asymptotic limit of infinite data points. We show that this quantity is closely connected to the supremum of the data distribution at the class boundaries. This result helps us justify sampling theory based learning: 1. A lower bandwidth of class indicator signals indicates that the class boundary passes through regions of low data density. 2. Given enough labeled data, bandlimited reconstruction of the class indicator learns a decision boundary that respects the labels and over which the maximum density of the data points is as low as possible, similar to other graph-based methods. 3. And finally, we also show that sampling theory based learning attains the theo- retical label complexity, i.e., the minimum number of labels required for perfectly predicting the unknown labels from the known ones. In summary, our results reinforce the smoothness assumption on class indicator signals in graph-based learning methods. From previous analyses of spectral clustering we observe that asymptotically, there is a strong link between the value of a cut and the bandwidth of its associated indicator signal. Thus, the geometric properties desired of “minimal cuts” in clustering translate to those of “minimal bandwidth” indicator signals for classification in the presence of labels. The rest of this chapter is organized as follows: In Section 5.1, we introduce the statistical models of the data, the graph construction scheme and relevant concepts from sampling theory, particularly bandlimited interpolation. Section 5.2 reviews existing work in the literature and their relation to our work. In Section 5.3 we state and discuss our main results (without proofs), followed by their numerical validation in 5.4. We 85 provide a brief discussion along with directions in Section 5.5. Finally, the proofs of our results are laid out in Section 5.6. 5.1 Preliminaries 5.1.1 Data models The separable model In this model, we assume that the dataset consists of a pool of n random, d- dimensional feature vectors X = {X 1 , X 2 ,..., X n } drawn independently from some probability density function p(x) supported on R d (this is assumed for simplicity, the analysis can be extended to arbitrary manifoldsM⊂R d , but would more technically involved). To simplify our analysis, we also assume that p(x) is bounded from above, Lipschitz continuous and twice differentiable. We assume that a smooth hypersurface ∂S, with radius of curvature lower bounded by a constant τ, splitsR d into two disjoint classesS andS c , with indicator functions 1 S (x) :R d →{0, 1} and 1 S c(x) :R d →{0, 1}. This is illustrated in Figure 5.1a. Thus, then-dimensional class indicator signal for class S is given by 1 S ∈{0, 1} n such that 1 S (i) = 1 S (X i ), i.e., thei th entry of 1 S is 1 if X i ∈S and 0 otherwise. The nonseparable model In this model, we assume that each class has its own conditional distribution sup- ported on R d (that may or may not overlap with other distributions of other classes). The data set consists of a pool of n random and independent d-dimensional feature (a) (b) Figure 5.1: Statistical models of data considered in this work: (a) the separable model, (b) the nonseparable model. 86 vectorsX ={X 1 , X 2 ,..., X n } drawn independently from any of the distributions p i (x) with probabilitiesα i , such that P i α i = 1. For our analysis, we consider a class denoted by an indexA with selection probabilityα A , class conditional distributionp A (x) and an n-dimensional indicator vector 1 A whosei th component takes value 1 if X i is drawn from class A. This model is illustrated in Figure 5.1b. Further, we denote by α A c = 1−α A the probability that a point does not belong to A and by p A c(x) = P i6=A α i p i (x)/α A c the density of all such points. The marginal distribution of data points is then given by the mixture density p(x) =α A p A (x) +α A cp A c(x). (5.1) Once again, to simplify our analysis, we assume that all distributions are Lipschitz continuous, bounded from above and twice differentiable inR d . Next, we introduce the notion of a “boundary” for classes in the nonseparable model as follows: for class A, we define its overlap region ∂A as ∂A ={x∈R d |p A (x)p A c(x)> 0}. (5.2) Intuitively,∂Acanbeconsideredastheregionofambiguity, wherebothpointsbelonging and not belonging toA co-exist. In other words,∂A can be thought of as a “boundary” that separates the region where points can only belong toA from the region where points can never belong to A. Since class indicator signals on graphs will change values only within the overlap region, one would expect that the indicators will be smoother if there are fewer data points within this region. We shall show later that this is indeed the case, both theoretically and experimentally. Note that the definition of the boundary is not very meaningful for class conditional distributions with decaying tails, such as the Gaus- sian, since the boundary encompasses the entire feature space. However, in such cases, one can approximate the boundary with appropriate thresholds in the definition and this approximation can also be formalized for distributions with exponentially decaying tails. 87 5.1.2 Graph model Using the n feature vectors, we construct an undirected distance-based similarity graph where nodes represent the data points and edge weights are proportional to their similarity, given by the Gaussian kernel: w ij =K σ 2(X i , X j ) = 1 (2πσ 2 ) d/2 e −kX i −X j k 2 /2σ 2 , (5.3) whereσ is the variance (bandwidth) of the Gaussian kernel. Further, we assumew ii = 0, i.e., the graph does not have self-loops. The adjacency matrix of the graph W is ann×n symmetric matrix with elements w ij , while the degree matrix is a diagonal matrix with elements D ii = P j w ij . We define the graph Laplacian as L = 1 n (D−W). Normalization by n ensures that the norm of L is stochastically bounded as n grows. Since the graph is undirected, L is a symmetric matrix with non-negative eigenvalues 0≤λ 1 ≤···≤λ n and an orthogonal set of corresponding eigenvectors{u 1 ,..., u n }. It is known that for a larger eigenvalue λ, the corresponding eigenvector u exhibits greater variation when plotted over the nodes of the graph [66]. 5.1.3 Estimating bandwidth Recall that the bandwidthω(f) of any signal f on the graph as the largest eigenvalue for which the projection of the signal on the corresponding eigenvector is non-zero, i.e., ω(f) = max i {λ i | u T i f > 0}. (5.4) Ideally, computing the bandwidthω(f) of a graph signal f requires computing the eigen- vectors{u i } and the corresponding projections ˜ f i = u T i f. However, analyzing the con- vergence of these coefficients is technically challenging. Therefore, we resort to Graph Spectral Proxies introduced in Section 3.3 in order to estimate the bandwidth. Since, in this chapter, we deal with a symmetric L, we slightly modify the definition for simplicity: ω m (f) = f T L m f f T f ! 1/m , (5.5) 88 whereω m (f) is them th -order spectral proxy. Recall that the bandwidth estimates satisfy theproperty: forallm 1 ,m 2 suchthat 0<m 1 <m 2 ,ω m 1 (f)≤ω m 2 (f)≤ω(f). Therefore, we have: ∀f, ω(f) = lim m→∞ ω m (f), (5.6) Analyzing the convergence ofω m (1 S ) andω m (1 A ) asn→∞,σ→ 0 andm→∞ consti- tutes the main subject for the rest of this chapter. Specifically, we relate these quantities to the underlying data distribution p(x) and class boundaries (the hypersurface ∂S in the separable case and the overlap region ∂A in the nonseparable case). Note that the limit in (5.6) holds in a point-wise sense. This means that analyzing the convergence of the bandwidth estimates ω m (1 S ) and ω m (1 A ) as n→∞ and then applyingthelimitm→∞givesonlyanideaabouttheconvergenceofactualbandwidths ω(1 S ) and ω(1 A ) as n→∞. Specifically, it does not imply convergence of ω(1 S ) and ω(1 A ) to the same values asω m (1 S ) andω m (1 A ), since the limits are not interchangeable unless (5.6) holds in a uniform sense. However, based on our experiments and results on label complexity, we believe that our intuition is accurate, i.e., the convergence results hold for the actual bandwidths, not only their estimates. We leave the analysis of this intricacy for future work. 5.1.4 Bandlimited interpolation for classification Bandwidth plays an important role in the spectral approach for semi-supervised learning. In this approach, one finds a label assignment by minimizing the error over the known set, while ensuring that the resulting class indicator vector is bandlimited over the similarity graph, i.e, Minimize kf(L)− y(L)k 2 subject to ω(f)<ω L , (5.7) where L denotes the set of known labels, y denotes the true class labels and f(L) and y(L) denote the values of f and y on the set L respectively. ω L restricts the hypothesis space by constraining it to a set of bandlimited signals which is equivalent to enforcing smoothness of the labels over the graph. Therefore, it is important to understand its connection to the geometry of the dataset. A good choice for ω L is the cutoff frequency associated with the labeled set that can be estimated using results in Section 3.4. Notethatthebandwidth-basedapproachforsemi-supervisedlearningdiffersfromthe Fourier eigenvector approach suggested in [10, 11] since it can be implemented without 89 explicitly computing the eigenvectors of L. The method is based on iteratively and alternately projecting onto convex sets and can be implemented in an efficient manner via graph filtering operations [50]. Note that if the original indicators 1 S (or 1 A ) are bandlimited with respect to the labeledset, i.e.,ω(1 S )<ω L (orω(1 A )<ω L ), thentheestimate f LS in (5.7)isguaranteed to be equal to 1 S (or 1 A ) as a consequence of the sampling theorem. Moreover, in this case, 1 S and 1 A can also be perfectly estimated by the solution of the following “dual" problem: f min = arg min f ω(f) s.t. f(L) = 1 S (L) (or f(L) = 1 A (L)). (5.8) These facts leads to the following insight regarding bandlimited interpolation for classi- fication: Observation 5.1. If ω(1 S )<ω L and ω(1 A )<ω L , then 1. 1 S and 1 A can be perfectly recovered using either (5.7) and (5.8). 2. 1 S and 1 A are guaranteed to have minimum bandwidth among all indicator vectors satisfying the label constraints. The observations above have significant implications: Given enough and appropriately chosen labeled data, bandlimited interpolation effectively recovers an indicator vector with minimum bandwidth, that respects the label constraints. Note that by labeling enough data appropriately, we mean to ensure that the cut-off frequency ω L of the labeled set is greater than the bandwidths ω(1 S ) and ω(1 A ) of the indicator functions of interest. If this condition is not satisfied, both observations break down, i.e., the solutions of (5.7) and (5.8) would be different and serve only as approximations for 1 S and 1 A . Moreover, the minimum bandwidth signal f min satisfying the label constraints, would differ from 1 S and 1 A and may not even be an indicator vector. To help ensure that the condition is satisfied, one can use the efficient algorithm in Section 3.5. We note that in practice, (5.7) can be solved via efficient iterative techniques [50]. 5.2 Related work and connections Existing graph-based semi-supervised learning and spectral clustering methods have been justified by analyzing the convergence of graph-based smoothness measures (such asthegraphcutandtheLaplacianregularizer)forvariousgraphconstructionschemesin 90 Table 5.1: Related convergence results in the literature under different data models and graph construction schemes. All models assume that the distributions are smooth (at least twice-differentiable). Further, the graph Laplacian is defined as L = 1 n (D− W) in all cases. [42] also studies convergence of graph cuts for weighted k-nearest neighbor and r-neighborhood graphs which we do not include for brevity. Work Data model Graph model Quantity Convergence regime Limit (within con- stant scaling fac- tor) Narayanan et al [51] p(x) supported on manifold M ⊂ R d , separated into S and S c by smooth hypersur- face ∂S Normalized Gaussian weights w 0 ij = wij √ didj 1 nσ 1 T S L1 S n→∞, σ→ 0 R ∂S p(s)ds Maieret al [42] p(x) supported onM⊂ R d , separated into S and S c by hyperplane ∂S r-neighborhood, unweighted 1 nr d+1 1 T S L1 S n→∞, r→ 0 R ∂S p 2 (s)ds k-nn, unweighted, t = (k/n) 1/d 1 nt d+1 1 T S L1 S n→∞, t→ 0 R ∂S p 1−1/d (s)ds fully-connected, Gaussian weights 1 nσ 1 T S L1 S n→∞, σ→ 0 R ∂S p 2 (s)ds Bousquet et al [15], Hein [34] p(x) and f(x) supported on R d fully-connected, weights w ij = 1 nσ d K kXi−Xjk 2 σ 2 , where K(.) is a smooth decaying kernel 1 nσ 2 f T Lf n→∞, σ→ 0 R k∇f(x)k 2 p 2 (x)dx Zhou et al [81] Uniformly distributed on d- dim. submanifoldM fully-connected, Gaussian weights 1 nσ m f T L m f n→∞, σ→ 0 R f(x)Δ m f(x)dx El Alaoui et al [2] p(x) supported on [0, 1] d fully-connected, weights w ij = K kXi−Xjk σ , where K(.) is a smooth decaying kernel 1 n 2 σ p+d J p (f) n→∞, σ→ 0 R k∇f(x)k p p 2 (x)dx Current work p(x) supported on R d , sepa- ratedintoS andS c bysmooth hypersurface ∂S fully-connected, Gaussian weights ω m (1 S ) n→∞, σ→ 0, m→∞ sup s∈∂S p(s) Drawnfromp A (x) andp A c(x) supported on R d with proba- bilities α A and α A c fully-connected, Gaussian weights ω m (1 A ) n→∞, σ→ 0, m→∞ sup x∈∂A p(x) 91 two different settings – classification and regression. The classification setting assumes that labels indicate class memberships and are discrete, typically with 1/0 values. Note that both the separable and nonseparable data models considered in our work are in the classification setting. On the other hand, in the regression setting, one allows the class label signal f to be smooth and continuous with soft values, i.e, f∈R n and later applies some thresholding mechanism to infer class memberships. For example, in the two class problem, one can assign +1 and−1 to the two classes and threshold f at 0. Convergence analysis of smoothness measures in this setting requires different scaling conditions than the classification setting, and leads to fundamentally different limiting quantities that require differentiability of the label functions. A summary of convergence results in the literature for both settings is presented in Table 5.1. Although these results do not focus on analyzing the bandwidth of class indicator signals, the proof techniques used in this paper are inspired by some of these works. We review them in this section and discuss their connections to our work. 5.2.1 Classification setting Prior work under this setting assumes the separable data model where the feature space is partitioned by smooth decision boundaries into different classes. When m = 1, the bandwidth estimate ω m (1 S ) for the separable model in our work reduces (within a scaling factor) to the empirical graph cut for the partitions S and S c of the feature space, i.e., Cut(S,S c ) = X X i ∈S,X j ∈S c w ij = 1 T S L1 S . (5.9) Convergenceofthisquantityhasbeenstudiedbeforeinthecontextofspectralclustering, where one tries to minimize it across the two partitions of the nodes. It has been shown in [42] that the cut formed by a hyperplane∂S inR d converges with some scaling under the rate conditions σ→ 0 and nσ d+1 →∞ as 1 nσ 1 T S L1 S p. − → 1 √ 2π Z ∂S p 2 (s)ds, (5.10) whereds ranges over all (d− 1)-dimensional volume elements tangent to the hyperplane ∂S, and p. denotes convergence in probability. The analysis is also extended to other graphconstructionschemessuchasthek-nearestneighborgraphandther-neighborhood graph, both weighted and unweighted. The conditionσ→ 0 in (5.10) is required to have 92 a clear and well-defined limit on the right hand side. We borrow this convergence regime in our work, since it allows a succinct interpretation of the bandwidth of class indicator signals. Intuitively, it enforces sparsity in the similarity matrix W by shrinking the neighborhood volume as the number of data points increases. As a result, one can ensure that the graph remains sparse even as the number of points goes to infinity. The analysis is also extended to other graph construction schemes such as the k-nearest neighbor graph and the r-neighborhood graph, both weighted and unweighted. The condition σ→ 0 in (5.10) is required to have a clear and well-defined limit on the right hand side. We borrow this convergence regime in our work, since it allows a succinct interpretation of the bandwidth of class indicator signals. Intuitively, it enforces sparsity in the similarity matrix W by shrinking the neighborhood volume as the number of data points increases. As a result, one can ensure that the graph remains sparse even as the number of points goes to infinity. A similar result for a similarity graph constructed with normalized weightsw 0 ij =w ij / q d i d j was shown earlier for an arbitrary hypersurface ∂S in [51], whered i denotes the degree of nodei. In this case, under the conditionn→∞, and for a vanishing sequence{σ n } that satisfies σ n > 1/ n 1 d+1 , one has 1 nσ n 1 T S L 0 1 S p. − → 1 √ 2π Z ∂S p(s)ds, (5.11) where L 0 denotes the Laplacian with normalized weights. Normalization leads to a different weighting factor on the right hand side than (5.10). The results in [51, 42] aim to provide an interpretation for spectral clustering – up to some scaling, the empirical cut value converges to a weighted volume of the boundary. Thus, spectral clustering is a means of performing low density separation on a finite sample drawn from a distribution in feature space. Although these works provide inspiration for the proof techniques used for analyzing the separable model in this paper, they cannot be directly used in the convergence analysis of ω m (1 S ) for m > 1, which is the main focus of our paper. Additionally, our work is the first to propose and analyze the nonseparable model in the classification setting, i.e., convergence results for ω m (1 A ). 93 5.2.2 Regression setting To predict the labels ofunknown samples in theregression setting, one generally min- imizes the graph Laplacian regularizer f T Lf subject to the known label constraints [83]: min f f T Lf such that f(L) = y(L). (5.12) One particular convergence result in this setting assumes that n data points are drawn i.i.d. from p(x) and are labeled by sampling a smooth function f(x) on R d . Here, the graph Laplacian regularizer f T Lf can be shown to converge in the asymptotic limit under the conditions σ→ 0 and nσ d →∞ as in [15, 34]: 1 nσ 2 f T Lf p. −−→C Z R d k∇f(x)k 2 p 2 (x)dx, (5.13) where for each n, f is the n-dimensional label vector representing the values of f(x) at then sample points,∇ is the gradient operator andC is a constant factor independent of n andσ. The right hand side of the result above is a weighted Dirichlet energy functional that penalizes variation in the label function weighted by the data distribution. Similar to the justification of spectral clustering, this result justifies the formulation in (5.12) for semi-supervised classification: given label constraints, the predicted label function must vary little in regions of high density. The work of [34, 33] also generalizes the result for arbitrary kernel functions used in defining graph weights, and data distributions defined overarbitrarymanifoldsinR d . Similarconvergenceresultshavealsobeenderivedforthe higher-order Laplacian regularizer f T L m f obtained from uniformly distributed data [81]. In this case, it was shown that for data points obtained from a uniform distribution on a d-dimensional submanifoldM⊂R N such that Vol(M) = 1 and 2m-differentiable functions f(x), one has as n→∞, σ→ 0, 1 nσ m n f T L m f p. −−→C Z M f(x)Δ m f(x)dx, (5.14) where Δ is the Laplace operator and σ n = n −1/(2d+4+α) is a vanishing sequence with α> 0. Extensions for non-uniform probability distributions p(x) over the manifold can be obtained using the weighted Laplace-Beltrami operator [12, 82]. More recently, an` p - based Laplacian regularization has been proposed for imposing smoothness constraints in semi-supervised learning problems [2]. This is similar to higher-order regularization but is defined as J p (f) = P i,j∈E w p ij |f i −f j | p , where w ij = K(kX i − X j k/σ) and K(.) 94 is a smoothly decaying Kernel function. It has been shown for a bounded density p(x) defined on [0, 1] d that for every p≥ 2, as n→∞, σ→ 0, 1 n 2 σ p+d J p (f) p. −−→C Z [0,1] d k∇f(x)k p p 2 (x)dx. (5.15) Although our work also uses higher powers of L in the expressions for ω m (1 S ) and ω m (1 A ), we cannot use the convergence results in (5.14) and (5.15) since they are only applicable for smooth functions (i.e., differentiable upto certain order) on R d . Specifi- cally, these results cannot be applied for the bandwidth of discrete 0/1 class indicator functions. To summarize, the results in the literature mostly pertain to convergence analysis of variants of the graph cut or the graph Laplacian regularizer for different models of data and graph construction schemes, and do not provide insight into the convergence of bandwidths of discrete 0/1 class indicator signals. In contrast, we analyze bandwidth expressions involving these class indicator signals and higher powers of L, and for the first time, extend it to a nonseparable data model. As opposed to other smoothness measures considered earlier, analyzing the bandwidth allows us to interpret graph-based semi-supervised learning using the sampling theorem [4] and provide a quantitative insight into label complexity based on data geometry. 5.3 Main results and discussion 5.3.1 Interpretation of bandwidth and bandlimited reconstruc- tion We first show that under certain conditions, the bandwidth estimates of class indi- cator signals, over the distance-based similarity graph described earlier, converge to quantities that are functions of the underlying distribution and the class boundary for both data models. This convergence is achieved under the following asymptotic regime: 1. Increasing size of dataset: n→∞. 2. Shrinking neighborhood volume: σ→ 0. 3. Improving bandwidth estimates: m→∞. Note that an increasing size of the dataset (Condition 1) is required for the stochastic convergence of the bandwidth estimate. Condition 2 ensures that the limiting values 95 are concise and have a simple interpretation in terms of the data geometry. Intuitively, Condition 2 ensures that as the number of data points increases, one looks at a smaller neighborhood around each data point, as a result, the degree of each node in the graph does not blow up. Finally, Condition 3 leads to improving values of the bandwidth estimate. The convergence results are precisely stated in the theorems below: Theorem 5.1. If n→∞, σ → 0 and m→∞ while satisfying the following rate conditions 1. (nσ md )/(mC m )→∞, where C = 2/(2π) d/2 , 2. m/nσ→ 0, 3. mσ 2 → 0, 4. σ 1/m → 1, then for the separable model, one has ω m (1 S ) p. −−→ sup s∈∂S p(s), (5.16) where “p." denotes convergence in probability. Theorem 5.2. If n→∞, σ → 0 and m→∞ while satisfying the following rate conditions 1. (nσ md )/(mC m )→∞, where C = 2/(2π) d/2 , 2. m/n→ 0, 3. mσ 2 → 0, then for the non-separable model, one has ω m (1 A ) p. −−→ sup x∈∂A p(x). (5.17) The dependence of the results on the rate conditions will be explained later in the proofs section. An example of parameter choices that allow all the scaling laws to hold simultaneously is illustrated in the following corollary: 96 Corollary 5.1. Equations (5.16) and (5.17) hold if for each value of n, we choose m and σ as follows: m = [m 0 (logn) y ], (5.18) σ =σ 0 n −x/md , (5.19) for constants m 0 ,σ 0 > 0, 1/2 < y < 1 and 0 < x< 1. [ . ] indicates taking the nearest integer value. Theorems 5.1 and 5.2 give an explicit connection between bandwidth estimates of class indicator signals and the underlying geometry of the dataset. This interpretation formsthebasisofjustifyingthechoiceofbandwidthasasmoothnessconstraintingraph- based learning algorithms. Theorem 5.1 suggests that for the separable model, if the boundary ∂S passes through regions of low probability density, then the bandwidth of the corresponding class indicator vector ω(1 S ) is low. A similar conclusion is suggested for the nonseparable model from Theorem 5.2, i.e., if the density of data points in the overlap region ∂A is low, then the bandwidth ω(1 A ) is low. In other words, low density of data in the boundary regions leads to smooth indicator functions. From our results, we also get an intuition behind the smoothness constraint imposed in the bandlimited reconstruction approach (5.7) for semi-supervised learning. Basically, enforcing smoothness on classes in terms of indicator bandwidth ensures that the learn- ing algorithm chooses a boundary passing through regions of low data density in the separable case. Similarly, in the nonseparable case, it ensures that variations in labels occur in regions of low density. Further, the bandwidth cutoff ω L effectively imposes a constraint on the complexity of the hypothesis space – a larger value increases the size of the hypothesis space and results in choosing more complex boundaries. As a special case of our analysis, we also get a convergence result for the graph cut in the nonseparable model analogous to the results of [42] for the separable model. Note that the cut in this case equals the sum of weights of edges connecting points that belong to class A to points that do not belong to class A, i.e., Cut(A,A c ) = X X i ∈A,X j ∈A c w ij = 1 t A L1 A . (5.20) With this definition, we have the following result: 97 (a) (b) Figure5.2: 1-Dexampleillustratingthetheoreticallabelcomplexityfor(a)theseparable model, (b) the nonseparable model. Note that labeling all points where density is lower than supremum density over the boundary resolves all ambiguity and results in perfect prediction. Theorem 5.3. If n→∞, σ→ 0 and nσ d+1 →∞, then 1 n Cut(A,A c ) p. −−→ Z α A α A cp A (x)p A c(x)dx (5.21) The result above indicates that if the overlap between the conditional distributions of a particular class and its compliment is low, then the value of the graph cut is lower. This justifies the use of spectral clustering in the context of nonseparable models. 5.3.2 Label complexity of SSL In the context of semi-supervised learning, we define the label complexity as the minimum fraction of labeled examples required for perfectly predicting the labels of the unlabeled data points. This quantity essentially is an indicator of how “good" the semi- supervised problem is, i.e., how much help do we get from geometry while learning. A low label complexity is indicative of a favorable situation, where one is able to learn from only a few known labels by exploiting data geometry. In the following discussion, we first estimate the theoretical label complexities of the data models we consider, and then show that the expected label complexity of the sampling theoretic approach to learning exactly matches these values in the asymptotic limit. 98 Theoretical label complexities A simple way to compute the label complexity, for the data models we consider, is to find the fraction of points belonging to a region that fully encompasses the boundary. To formalize this, let us define the following two regions inR d : X S ={x :p(x)≤ sup s∈∂S p(s)}, (5.22) X A ={x :p(x)≤ sup x∈∂A p(x)}. (5.23) Note that by definition,∂S is fully contained inX S and∂A is fully contained inX A (see Figure 5.2 for an example in R 1 ). To perfectly reconstruct 1 S and 1 A , it is sufficient to know the labels of all points inX S andX A respectively, as this strategy removes all ambiguity in labeling the two classes. Based on this, we arrive at the following conclusions: Observation 5.2. The theoretical label complexity of learning 1 S and 1 A in the asymp- totic limit are P (X S ) and P (X A ) respectively, where P (E) = R E p(x)dx. Label complexity of graph-based learning Using our results, we can show that the same label complexities hold for the graph- based sampling theoretic approach to semi-supervised classification. In this context, label complexity can be seen as the fraction of samples required for perfectly reconstruct- ing a signal on the similarity graph. It is known that the fraction of samples required for perfectly reconstructing a bandlimited signal cannot be more than the fraction of eigenvalues of the Laplacian below the signal’s bandwidth [4]. Since our bandwidth con- vergence results relate the bandwidth of indicators for the two data models with data geoemtry, we only need to asymptotically relate the number of eigenvalues of L below any constant in terms of data geometry. This is achieved through the following result: Theorem 5.4. LetN L (t) be the number of eigenvalues of L below a constant t. Then, as n→∞ and σ→ 0, we have E 1 n N L (t) −→P ({x :p(x)≤t}), (5.24) Proof. See Section 5.6.4. 99 Substituting the bandwidth convergence results from Theorems 5.1 and 5.2 (i.e., t =ω m (1 S ) andt =ω m (1 A )), we immediately get the desired value of the expected label complexity of graph-based semi-supervised learning: Theorem 5.5. If the conditions in Theorems 5.1 and 5.2 hold, then the expected label complexities of bandlimited reconstruction for the separable and nonseparable models are given as lim 1 n E{N L (ω m (1 S )}→P (X S ), and (5.25) lim 1 n E{N L (ω m (1 A )}→P (X A ). (5.26) The following remarks are in order: 1. Note that Theorem 5.4 and Theorem 5.5 can be strengthened by proving conver- gence of 1 n N L (t) rather than its expected value. This requires further analysis, which we leave for future work. The result in Theorem 5.5 also encourages us to conjecture that the convergence results for bandwidth estimates also hold for the convergence of the bandwidth itself. 2. This result further strengthens the connection between graph-based learning meth- ods and the semi-supervised smoothness assumption, since one can conclude that the number of labeled examples required for perfect prediction depends on the geometry of the data around the boundary. A low value of the density at the boundary results in a lower label complexity. 3. One might ask what is the advantage of using graph-based methods for semi- supervisedlearning,ifwecanpredicttheclasslabelsbythesimplelabelingstrategy used to compute label complexities in Observation 5.2. Note that our definition of label complexity is an ideal one which aims for perfect reconstruction. The power of graph-based methods would be more evident for a more practical definition of label complexity, where one tries to find the number of labels required for achieving a certain error tolerance. We leave this issue for future work. In the following two sections, we first numerically validate our results through experi- ments and then provide theoretical proofs. 100 Table 5.2: Illustrative boundaries used in the separable model. Boundary Description sup s∈∂S p(s) ∂S 1 x = 0 0.0607 ∂S 2 x =−1 0.2547 ∂S 3 x =y 2 − 1 0.2547 ∂S 4 y = 0 0.5969 ∂S 5 x 2 +y 2 = 1 0.5969 5.4 Numerical validation We now present simple numerical analyses 1 to validate our results and demonstrate their usefulness in practice. For simulating the separable model, we first consider a data distribution based on a 2D Gaussian Mixture Model (GMM) with two Gaussians: μ 1 = [−1 0], Σ 1 = 0.25I and μ 2 = [1 0], Σ 2 = 0.16I, and mixing proportions α 1 = 0.4 and α 2 = 0.6 respectively. The probability density function is illustrated in Figure 5.3. Next, we evaluate the claim of Theorem 5.1 on five boundaries, described in Table 5.2. These boundaries are depicted in Figure 5.4 and are illustrative of typical separation assumptions such as linear or non-linear and low or high density. For simulating the nonseparable model, we first construct the following smooth (twice-differentiable) 2D probability density function q(x,y) = 3 π [1− (x 2 +y 2 )] 2 , x 2 +y 2 ≤ 1 0, x 2 +y 2 > 1 . (5.27) Note that datapoints (X,Y ) can be sampled from this distribution by setting the coordi- nates X = √ 1−U 1/4 cos(2πV ), Y = √ 1−U 1/4 sin(2πV ), where U,V ∼ Uniform(0, 1). We then use q(x,y) to define a nonseparable 2D model with mixture density p(x,y) = α A p A (x,y) +α A cp A c(x,y), where p A (x,y) = q(x− 0.75,y), p A c(x,y) = q(x + 0.75,y) and α A =α A c = 0.5. The probability density function is illustrated in Figure 5.3. The overlap region or boundary ∂A for this model is given by ∂A = n (x,y) : (x− 0.75) 2 +y 2 < 1 and (x + 0.75) 2 +y 2 < 1 o . (5.28) 1 Code available at https://github.com/aamiranis/asymptotics_graph_ssl 101 −3 −2 −1 0 1 2 3 −2 0 2 0 0.2 0.4 0.6 x y p(x,y) (a) −2 −1 0 1 2 −2 0 2 0 0.2 0.4 0.6 x y p(x,y) (b) Figure 5.3: Probability density functions to generate data for (a) separable model, (b) nonseparable model. Further, for this model, we have sup ∂A p(x) = 0.2517. In our first experiment, we validate the statements of Theorems 5.1 and 5.2 by comparingtheleftandrighthandsidesof (5.16)and(5.17)forcorrespondingboundaries. This is carried out in the following way: we draw n = 2500 points from each model and construct the corresponding similarity graphs using σ = 0.1. Then, for the boundaries ∂S i in the separable model and∂A in the nonseparable model, we carry out the following steps: 1. We first construct the indicator functions 1 S i and 1 A on the corresonding graphs. 2. We then compute the empirical bandwidth ω(1 S i ) and ω(1 A ) in a manner that takes care of numerical error: we first obtain the eigenvectors of the corresponding 102 x y −2 −1 0 1 2 −2 −1 0 1 2 ∂S 1 ∂S 2 ∂S 3 ∂S 4 ∂S 5 Figure 5.4: Boundaries{∂S i } considered in the separable model. 0 0.2 0.4 0.6 0.8 1 Bandwidth Empirical Theoretical ∂S 1 ∂S 2 ∂S 3 ∂S 4 ∂S 5 ∂A Figure 5.5: Convergence of empirical value of bandwidthsω(1 S i ) andω(1 A ) for different boundaries{∂S i }and∂Aoncorrespondinggraphs. Darkshadedregionsdenotestandard deviation over 100 experiments. Red bars indicate theoretical values. L, then set ω(1 S i ) and ω(1 A ) to be ν for which energy contained in the Fourier coefficients corresponding to eigenvalues λ j >ν is at most 0.01%, i.e., ω(1 S i ) = min n ν X j:λ j >ν u T j 1 S i 2 ≤ 10 −4 o (5.29) ω(1 A ) = min n ν X j:λ j >ν u T j 1 A 2 ≤ 10 −4 o . (5.30) 103 0 500 1000 1500 2000 2500 3000 0 0.02 0.04 0.06 0.08 0.1 n std. dev. ∂ S 1 ∂ S 2 ∂ S 3 ∂ S 4 ∂ S 5 ∂ A Figure 5.6: Standard deviation of ω(1 S i ) and ω(1 A ) as a function of n. 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0 0.02 0.04 0.06 0.08 0.1 Mean error Fraction of labeled examples (a) 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0 0.01 0.02 0.03 0.04 0.05 Mean error Fraction of labeled examples (b) Figure 5.7: Mean reconstruction errors averaged over 100 experiments for (a) 1 S 3 , and (b) 1 A . Red-dashed lines indicate the theoretical label complexities of ∂S 3 and ∂A. The procedure above is repeated 100 times and the mean of ω(1 S i ) and ω(1 A ) are compared with sup s∈∂S i p(s) and sup x∈∂A p(x) respectively. The result is plotted in Fig- ure 5.5. We observe that the empirical bandwidth is close to the theoretically predicted value and has a very low standard deviation. This supports our conjecture that stochas- ticconvergenceshouldholdforthebandwidth. Tofurtherjustifythisclaim, westudythe 104 behavior of the standard deviation ofω(1 S i ) andω(1 A ) as a function ofn in Figure 5.6, where we observe a decreasing trend consistent with our result. For our second experiment, we validate the bound on the label complexity of graph- based SSL in Theorem 5.5 by reconstructing the indicator function corresponding to∂S 3 and∂A from a fraction of labeled examples on the corresponding graphs. This is carried out as follows: For a given budget B, we find the set of points to label pivoted column- wise Gaussian elimination on the eigenvector matrix U of L. This method ensures that the obtained labeled set guarantees perfect recovery for signals spanned by the first B eigenvectors of L [4]. We then recover the indicator functions from these labeled sets by solving the least squares problem in (5.7) followed by thresholding. The mean reconstruction error is defined as E mean = No. of mismatches on unlabeled set Size of unlabeled set (5.31) We repeat the experiment 100 times by generating different graphs and plot the averaged E mean against the fraction of labeled examples. The result is illustrated in Figure 5.7. We observe that the error goes to zero as the fraction of labeled points goes beyond the theoretically predicted label complexity as predicted. This reinforces the intuition that the bandwidth of class indicators is closely linked with the inherent geometry of the data. 5.5 Summary In this chapter, we provided an interpretation of the graph sampling theoretic approach to semi-supervised learning. Our work analyzed the bandwidth of class indi- cator signals with respect to the Laplacian eigenvector basis and revealed its connection to the underlying geometry of the dataset. This connection is useful in justifying graph- based approaches for semi-supervised and unsupervised learning problems, and provides a geometrical interpretation of the smoothness assumptions imposed in the bandlimited reconstruction approach. Specifically, our results have shown that an estimate of the bandwidth of class indicators converges to the supremum of the probability density on the class boundaries for the separable model, and on the overlap regions for the nonsepa- rable model. This quantifies the connection between the assumptions of smoothness (in terms of bandlimitedness) and low density separation, since boundaries passing through 105 regions of low data density result in lower bandwidth of the class indicator signals. We numerically validated these results through various experiments. Our analysis also sheds light on the label complexity of graph-based semi-supervised learning problems. We showed that perfect prediction from a few labeled examples using a graph-based bandlimited interpolation approach requires the same amount of labelingasonewouldneedtocompletelyencompasstheboundaryorregionofambiguity. This indicates that graph-based approaches achieve the theoretical label complexity as dictated by the underlying geometry of the problem. We believe that the main potential of graph-based methods will be apparent in situations where one can tolerate a certain amount of prediction error, in which case such approaches shall require fewer labeled examples. We plan to investigate this as part of future work. 5.5.1 Future work There are several directions in which our results can be extended. In this work, we only considered Gaussian-weighted graphs, an immediate extension would be to consider arbitrary kernel functions for computing graph weights, or density dependent edge- connections such as k-nearest neighbors [42]. Another possibility is to consider data defined on a subset of the d-dimensional Euclidean space. It will also be useful to have uniform convergence results over all possible boundaries. Moreover, our definition of label complexity considers perfect label predictions, it would be interesting to study trade-offs between prediction error, the number of labeled examples, and the impact of labeling noise. Finally, note that our results on label complexity are theoretical at this point – we assume complete knowledge of the indicator signal in order to quantify the amount of labeling required. However, this result can be made useful in a practical sense if one were able to apriori estimate the bandwidth of a class indicator as this would give an indication of “when to stop” labeling. One possible approach is to estimate the bandwidth of any signal through randomly (or possibly, cleverly) chosen samples: ω(f)≈ω m (f) = f T L m f f T f ! 1/m ≈ f T S T S L m S T S f f T S T S f ! 1/m , (5.32) where S is the sampling operator. Analyzing the convergence of this bandwidth estimate along with our prediction error trade-offs will give us a handle on how much labeling is required for achieving a particular error tolerance. 106 5.6 Proofs We now present the proofs of Theorems 5.1 and 5.2 through a sequence of lemmas. The main idea is to perform a variance-bias decomposition of the bandwidth estimate andthenprovetheconvergenceofeachtermindependently. Specifically, fortheindicator vector 1 R ∈{0, 1} n of any region R⊂R n , one can consider the random variable: (ω m (1 R )) m = 1 T R L m 1 R 1 T R 1 R = 1 n 1 T R L m 1 R 1 n 1 T R 1 R . (5.33) Westudytheconvergenceofthisquantitybyconsideringthenumeratoranddenominator separately (it is easy to show that the fraction converges if both the numerator and denominator converge). By the strong law of large numbers, we conclude the following for the denominator as n→∞: 1 n 1 T R 1 R a.s. −−→ Z x∈R p(x)dx. (5.34) For the numerator, we decompose it into two parts – a variance term for which we show stochastic convergence using a concentration inequality, and a bias term for which we prove deterministic convergence. 5.6.1 Convergence of variance terms Let V = 1 n 1 T R L m 1 R , then we have the following concentration result: Lemma 5.1 (Concentration). For every > 0, we have: Pr (|V−E{V}|>) ≤ 2 exp −[n/(m + 1)]σ md 2 2C m E{V} + 2 3 |C m −σ md E{V}| ! , (5.35) where C = 2/(2π) d/2 . Proof. Recalling that w i,j = K σ 2(X i , X j ), we begin by explicitly expanding V = 1 n 1 T R (D− W) m 1 R into the following summation V = 1 n m+1 X i 1 ,i 2 ,...,i m+1 g X i 1 , X i 2 ,..., X i m+1 . (5.36) 107 The above expansion has the form of a V-statistic. Details on how to explicitly write the summation are given in Section 5.6.5. Note that g is composed of a sum of 2 m terms, each a product of m kernel functions that are non-negative. Therefore, g≤ 2 m kKk m ∞ = 2 (2πσ 2 ) d/2 ! m = C m σ md . (5.37) In order to apply a concentration inequality for V, we first re-write it in the form of a U-statistic by regrouping terms in the summation in order to remove repeated indices, as given in [35]: V = 1 n (m+1) X (n,m+1) g ∗ X i 1 , X i 2 ,..., X i m+1 , (5.38) where P (n,m+1) denotes summation over all ordered (m+1)-tuples of distinct indices taken from the set{1,...,n}, n (m+1) = n.(n− 1)... (n−m) is the number of (m+1)- permutations ofn andg ∗ is a convex combination of specific instances ofg that absorbs repeating indices (see supplementary material for a complete expansion): g ∗ (x 1 , x 2 ,..., x m+1 ) = n (m+1) n m+1 g (x 1 , x 2 ,..., x m+1 ) +(terms with repeated indices). (5.39) Therefore, g ∗ has the same upper bound as that of g derived in (5.37). Moreover, using the fact thatE{V} =E{g ∗ }, we can bound the variance of g ∗ as Var{g ∗ }≤kg ∗ k ∞ E{g ∗ } = C m σ md E{V}. (5.40) Finally, pluggingintheboundandvarianceofg ∗ inBernstein’sinequalityforU-statistics as stated in [35, 33], we arrive at the desired result of (5.35). Note that as n→ 0 and σ→ 0 with rates satisfying (nσ md )/(mC m )→∞, we have P (|V−E{V}|>)→ 0 for all> 0. The continuous mapping theorem then allows us to conclude that V 1/m p. − → (E{V}) 1/m . 108 5.6.2 Convergence of the bias term for the separable model To evaluate the convergence of bias terms, we shall require the following properties of the d-dimensional Gaussian kernel: Lemma 5.2. If p(x) is twice differentiable, then Z K σ 2(x, y)p(y)dy =p(x) +O σ 2 . (5.41) Proof. Using the substitution y = x + t followed by Taylor expansion about x, we have Z K σ 2(x, y)p(y)dy = Z 1 (2πσ 2 ) d/2 e −ktk 2 /2σ 2 p(x + t)dt = Z 1 (2πσ 2 ) d/2 e −ktk 2 /2σ 2 p(x) + t T ∇p(x) + 1 2 t T ∇ 2 p(x)t +... ! dt =p(x) + 0 + σ 2 2 Tr(∇ 2 p(x)) +... =p(x) +O(σ 2 ). where the third step follows from simple component-wise integration. Lemma 5.3. If p(x) is twice differentiable, then Z K aσ 2(x, z)K bσ 2(z, y)p(z)dz =K (a+b)σ 2(x, y) p bx +ay a +b ! +O σ 2 ! . (5.42) Proof. Note that K aσ 2(x, z)K bσ 2(z, y) = 1 (2πaσ 2 ) d 2 e − kx−zk 2 2aσ 2 1 (2πbσ 2 ) d 2 e − kz−yk 2 2bσ 2 = 1 (2π(a +b)σ 2 ) d 2 e − kx−yk 2 2(a+b)σ 2 1 (2π ab a+b σ 2 ) d 2 e − kz− bx+ay a+b k 2 2( ab a+b )σ 2 =K (a+b)σ 2(x, y) K ab a+b σ 2 bx +ay a +b , z ! . 109 Therefore, we have Z K aσ 2(x, z)K bσ 2(z, y)p(z)dz =K (a+b)σ 2(x, y) Z K ab a+b σ 2 bx +ay a +b , z ! p(z)dz =K (a+b)σ 2(x, y) p bx +ay a +b ! +O σ 2 ! , where the last step follows from Lemma 5.2. In order to prove convergence for the separable model, we need the following results: Lemma 5.4. If p(x) is Lipschitz continuous, then for a smooth hypersurface ∂S that dividesR d into S 1 and S 2 , and whose radius has curvature that is bounded by τ > 0, lim σ→0 1 σ Z S 1 Z S 2 K σ 2(x 1 , x 2 )p α (x 1 )p β (x 2 )dx 1 dx 2 = 1 √ 2π Z ∂S p α+β (s)ds, (5.43) where α and β are positive integers. Moreover, for positive integers a,b, and α,β,α 0 ,β 0 such that α +β =α 0 +β 0 =γ, we have: lim σ→0 1 σ Z S 1 Z S 1 K aσ 2(x 1 , x 2 )p α (x 1 )p β (x 2 ) −K bσ 2(x 1 , x 2 )p α 0 (x 1 )p β 0 (x 2 ) dx 1 dx 2 = √ b− √ a √ 2π Z ∂S p γ (s)ds. (5.44) The proof of this lemma is given in Appendix 5.6.6. We now prove the deterministic convergence ofE n 1 n 1 T S L m 1 S o in the following lemma: Lemma 5.5. As n→∞, σ→ 0 such that m/n→ 0,mσ 2 → 0, we have 1 σ E 1 n 1 T S L m 1 S → t(m) √ 2π Z ∂S p m+1 (s)ds, (5.45) where t(m) = P m−1 r=1 m−1 r (−1) r ( √ r + 1− √ r). 110 Proof. We evaluateE n 1 n 1 T S L m 1 S o term by term by expanding L m as (D− W) m−1 (D− W). Details on the intermediate steps of this expansion are given in Section 5.6.5. Using (5.41) repeatedly, we have for the first two terms of the expansion: 1 σ E 1 n 1 T S D... D(D− W)1 S = 1 σ Z S Z S c K σ 2(x, y)p m (x)p(y)dxdy +O (σ) +O m nσ . (5.46) For the rest of the terms, we also require the use of (5.42). However, in this case, we encounter several terms of the formp(θx + (1−θ)y) for someθ∈ [0, 1]. Sincemσ 2 → 0 andp(x) is assumed to be Lipschitz continuous, we can approximate such terms byp(x) or p(y). Therefore, for all terms in the expansion of (D− W) m−1 containing r > 1 occurrences of W (there are m−1 r such terms), repeated use of (5.41), (5.42) gives: 1 σ E 1 n 1 T S [D m−1−r , W r ](D− W)1 S = 1 σ " Z S Z S K rσ 2(x, y)p α (x)p β (y)dxdy − Z S Z S K (r+1)σ 2(x, y)p α 0 (x)p β 0 (y)dxdy # +O(σ) +O m nσ . (5.47) where α +β = α 0 +β 0 = m + 1 and [D m−1−r , W r ] denotes an expression containing r occurrences of W and m− 1− r occurrences of D. Now, using Lemma 5.4, we conclude that the right hand sides of (5.46) and (5.47) converge to 1 √ 2π R ∂S p m+1 (s)ds and √ r+1− √ r √ 2π R ∂S p m+1 (s)ds, respectively, as σ→ 0 and m/nσ→ 0. Putting everything together in the expansion ofE n 1 n 1 T S L m 1 S o , we get the desired result. Since σ 1/m → 1, we have E 1 n 1 T S L m 1 S 1/m =σ 1/m 1 σ E 1 n 1 T S L m 1 S 1/m → t(m) √ 2π Z ∂S p m+1 (s)ds ! 1/m (5.48) 111 Finally, we note that as m→∞, we have t(m) √ 2π R ∂S p m+1 (s)ds R S p(x)dx 1/m s. −−→ sup s∈∂S p(s) (5.49) Therefore, we conclude for the separable model ω m (1 S )→ sup s∈∂S p(s) (5.50) 5.6.3 Convergence of bias term for the nonseparable model For the nonseparable model, we need to prove convergence of E n 1 n 1 T A L m 1 A o . This is illustrated in the following lemma: Lemma 5.6. As n→∞, σ→ 0 such that m/n→ 0,mσ 2 → 0, we have E 1 n 1 T A L m 1 A → Z α A α A cp A (x)p A c(x)p m−1 (x)dx. (5.51) Proof. Similar to the proof of Lemma 5.5, we evaluate E n 1 n 1 T A L m 1 A o term by term by expanding L m as (D− W) m−1 (D− W). Details on the intermediate steps of this expansion are given in Section 5.6.5. Using (5.41) repeatedly, we have for the first two terms of the expansion: E 1 n 1 T A D... D(D− W)1 A = Z α A α A cp A (x)p A c(x)p m−1 (x)dx +O σ 2 +O m n . (5.52) Further, for all terms in the expansion of (D− W) m−1 containing r> 1 occurrences of W (there are m−1 r such terms), repeated use of (5.41), (5.42) gives: E 1 n 1 T A [D m−1−r , W r ](D− W)1 A =O σ 2 +O m n . (5.53) Therefore, as σ→ 0, m/n→ 0, we get the desired result. We finally note that as m→∞, we have R α A α A cp A (x)p A c(x)p m−1 (x)dx R A p(x)dx ! 1/m s. −−→ sup x∈∂A p(x) (5.54) 112 Therefore, we conclude for the nonseparable model ω m (1 A )→ sup x∈∂A p(x) (5.55) Note that Lemma 5.6 for special case for m = 1 yields 1 n 1 T A L1 A → Z α A α A cp A (x)p A c(x)dx (5.56) which proves Theorem 5.3. 5.6.4 Proof of Theorem 5.4 We begin by recalling the definition of the empirical spectral distribution (ESD) of L: μ n (x) = 1 n P n i=1 δ(x−λ i ), where{λ i } are the eigenvalues of L. For each x, μ n (x) is a function of X 1 ,..., X n , and thus a random variable. Note that the fraction of eigenvalues of L below a constant t, and its expected value can be computed from the ESD as 1 n N L (t) = Z t 0 μ n (x)dx, (5.57) E 1 n N L (t) = Z t 0 E{μ n (x)}dx, (5.58) Therefore, to understand the behavior of the expected fraction of eigenvalues of L below t, we need to analyze the convergence of the expected ESD in the asymptotic limit. The idea is to show the convergence of the moments ofE{μ n (x)} to the moments of a limitingdistributionμ(x). Then, byastandardconvergenceresult,E{μ n (I)}→μ(I)for intervalsI. More precisely, let the⇒ symbol denote weak convergence of measures, then we use the following result that follows from the Weierstrass approximation theorem: Lemma5.7. Letμ n be a sequence of probability measures andμ be a compactly supported probability measure. If R x m μ n (dx)→ R x m μ(dx) for all m≥ 1, then μ n ⇒μ. We then use the following result on equivalence of different notions of weak con- vergence of measures [13, Theorem 25.2] in order to prove our result for cumulative distribution functions. Lemma 5.8. μ n ⇒μ if and only if μ n (A)→μ(A) for every μ-continuity set A. 113 We begin by writing the m th moment ofE{μ n (x)}: Z x m E{μ n (x)}dx = 1 n n X i=1 E{λ m i } =E 1 n Tr (L m ) . (5.59) Now, note that L m = (D− W) m = D m + P m k=1 m k [D m−k , W k ], where [D m−k , W k ] denotes product terms withm−k occurrences of D andk occurrences of W. Therefore, we have for the right hand side of (5.59): E 1 n Tr (L m ) = Z Z K(x i 1 , x i 2 )p(x i 2 )dx i 2 ... (5.60) Z K(x i 1 , x i m+1 )p(x i m+1 )dx i m+1 p(x i 1 )dx i 1 +O m n (expected value of other terms) Using (5.41) repeatedly in the equation above, we get: E 1 n Tr (L m ) = Z p m+1 (x)dx +O m n +O σ 2 (5.61) Therefore, as n→∞ and σ→ 0, we have: Z x m E{μ n (x)}dx = Z p m (x)p(x)dx (5.62) From the right hand side of the equation above, we conclude that them th moment of the expected ESD of L converges to them th moment of the distribution of a random variable Y = p(X), where p(x) is the probabilty density function of X. Moreover, since p Y (y) has compact support,E{μ n (x)} converges weakly to the probability density function of p Y (y). Hence, the following can be said about the expected fraction of eigenvalues of L: E 1 n N L (t) = Z t 0 E{μ n (x)}dx s. −−→ Z t 0 p Y (y)dy = Z p(x)≤t p(x)dx. (5.63) This proves our claim in Theorem 5.4. Note that, to prove the stochastic convergence of the fraction itself rather than its expected value, we would need a condition similar to 114 those in Theorems 5.1 and 5.2 to hold for each moment. In that case, σ will go to 0 in a prohibitively slow fashion. We believe that this is an artifact of the methods we employ for proving the result. Hence, our conjecture is that the convergence result must hold for 1 n N L (t) itself, and we leave the analysis of this statement for future work. 5.6.5 Expansions of 1 T S L m 1 S andE n 1 n 1 T S L m 1 S o To expand 1 T S L m 1 S in terms of the elements w ij of W, we first write the expression for each product term. Since L m = 1 n m (D− W) m , there are 2 m such terms. Let us first introduce the following notation: [D, W] m denotes a product term containing the matrices D and W, such that there are m matrices in the product. Note that L m is essentially a summation of all possible [D, W] m with appropriate signs. Now, the explicit expression for 1 T S [D, W] m 1 S can be obtained using the following procedure: 1. All product terms have a form defined by the following template: 1 T S [D, W] m 1 S = X i 1 ,...,i m+1 (1 S ) i 1 w i 1 i 2 w ∗i 3 ...w ∗im w ∗i m+1 (1 S ) ∗ (5.64) where the locations with∗ need to be filled with appropriate indices depending on the product term. Note that each w ij is contributed by either a D or W depending on its location in the expression. 2. We fill the locations one-by-one from left to right, using the following set of rules. Let w ab be the term preceding w ∗c , then • If w ab is contributed by D, then∗ =a. • If w ab is contributed by W, then∗ =b. 3. Let w ai m+1 denote the term preceding (1 S ) ∗ . Then, we have the following rule: • If w ai m+1 is contributed by D, then∗ =a. • If w ai m+1 is contributed by W, then∗ =i m+1 . 115 The expansion of 1 T S L m 1 S can be found by summing up the expansions of the individual product terms 1 T S [D, W] m 1 S . Recalling that w ij =K(X i , X j ), we conclude 1 n 1 T S L m 1 S = 1 n m+1 X i 1 ,...,i m+1 g(X i 1 , X i 2 ,..., X i m+1 ) (5.65) The expression forE n 1 n 1 T S L m 1 S o can be evaluated in a similar fashion, except that the summations are replaced by integrals. We first evaluate the expected value of individual product termsE n 1 n 1 T S [D, W] m 1 S o by the following rules: 1. The template for the expected value of any product term can be expressed through the following template: E 1 n 1 T S [D, W] m 1 S = Z ... Z 1 S (x 1 )K(x 1 , x 2 )K(x ∗ , x 3 ) ...K(x ∗ , x m+1 )1 S (x ∗ ) ! p(x 1 )dx 1 ...p(x m+1 )dx m+1 (5.66) where the locations with∗ need to be filled with appropriate indices depending on the product term. Once again, each K(x i , x j ) is contributed by either a D or a W. 2. We fill the locations one-by-one from left to right, using the following set of rules. Let K(x a , x b ) be the term preceding K(x ∗ , x c ). Then • If K(x a , x b ) is contributed by D, then∗ =a. • If K(x a , x b ) is contributed by W, then∗ =b. 3. Further, let K(x a , x m+1 ) be the term preceding 1 S (x ∗ ). Then • If K(x a , x m+1 ) is contributed by D, then∗ =a. • If K(x a , x m+1 ) is contributed by W, then∗ =m + 1. 5.6.6 Proof of Lemma 4 The key ingredient required for evaluating the integrals in Lemma 5.4 involves select- ing a radius R (<τ) as a function of σ that satisfies the following properties as σ→ 0: 1. R→ 0, 116 2. R/σ→∞, 3. R 2 /σ→ 0, 4. R /σ→ 0, where R := R kzk>R K σ 2(0, z)dz. A particular choice of R is given by R = q dσ 2 log 1/σ 2 . Note that R→ 0 as σ→ 0. Further, substituting this expression in the tail bound for the norm of a d-dimensional Guassian vector gives us: R σ = 1 σ Z kzk>R K σ 2(0, z)dz ≤ 1 σ σ 2 d R 2 ! −d/2 e − R 2 2σ 2 + d 2 = 1 σ eσ 2 log(1/σ 2 ) d/2 (5.67) Therefore, ifd> 1, then R /σ→ 0 asσ→ 0. Further, it is easy to ensureR<τ for the regime of σ in our proofs. We now consider the proof of equation (5.43), let I := 1 σ Z S 1 Z S 2 K σ 2(x 1 , x 2 )p α (x 1 )p β (x 2 )dx 1 dx 2 (5.68) Further, let [S 1 ] R indicate a tubular region of thickness R adjacent to the boundary ∂S in S 1 , i.e., the set of points in S 1 at a distance≤R from the boundary. Then, we have I = 1 σ Z [S 1 ] R p α (x 1 ) Z S 2 K σ 2(x 1 , x 2 )p β (x 2 )dx 2 dx 1 | {z } I 1 + 1 σ Z [S 1 ] c R p α (x 1 ) Z S 2 K σ 2(x 1 , x 2 )p β (x 2 )dx 2 dx 1 | {z } E 1 (5.69) E 1 is the error associated with approximatingI byI 1 and exhibits the following behavior Lemma 5.9. lim σ→0 E 1 = 0. 117 Proof. Note that E 1 ≤ 1 σ (p max ) β Z [S 1 ] c R p α (x 1 ) Z S 2 K σ 2(x 1 , x 2 )dx 2 dx 1 ≤ 1 σ (p max ) β Z [S 1 ] c R p α (x 1 ) Z kzk>R K σ 2(0, z)dz ! dx 1 = R σ (p max ) β Z [S 1 ] c R p α (x 1 )dx 1 ≤ R σ (p max ) α+β (5.70) Using lim σ→∞ R /σ = 0, we get the desired result. InordertoanalyzeI 1 , weneedtodefinecertaingeometricalconstructions(illustrated in Figure 5.8) as follows: Definition 5.1. Geometrical constructions for analyzing the integrals 1. For each x 1 ∈ [S 1 ] R , we define a transformation of coordinates as: x 1 = s 1 +r 1 n(s 1 ), (5.71) where s 1 is the foot of the perpendicular dropped from x 1 onto∂S,r 1 is the distance between s 1 and x 1 , and n(s 1 ) is the surface normal at s 1 (towards the direction of x 1 ). Since the minimum radius of curvature of ∂S is τ and R <τ, this mapping is injective. 2. For each s 1 ∈∂S, let H + s 1 denote the halfspace created by the plane tangent on s 1 and on the side of S 2 . Similarly, let H − s 1 denote the halfspace on the side of S 1 , that is, H − s 1 =R d \H + s 1 . 3. Let W + s 1 (x) denote an infinite slab of thickness x tangent to ∂S at s 1 and towards the side S 2 . Let W − s 1 (y) denote a similar slab of thickness y on the side of S 1 . 4. Finally, for any x, let B(x,R) denote the Euclidean ball of radius R centered at x. 118 Figure 5.8: Geometrical constructions in Definition 5.1. We now considerI 1 , the main idea here is to approximate the integral over S 2 by an integral over the halfspace H + s 1 . Hence, we have: I 1 = 1 σ Z [S 1 ] R p α (x 1 ) Z H + s 1 K σ 2(x 1 , x 2 )p β (x 2 )dx 2 dx 1 | {z } I 2 + 1 σ Z [S 1 ] R p α (x 1 ) Z S 2 −H + s 1 K σ 2(x 1 , x 2 )p β (x 2 )dx 2 dx 1 | {z } E 2 (5.72) where E 2 is the error associated with the approximation. Therefore, we have I =I 2 +E 2 +E 1 (5.73) We now show that as σ→ 0, I 2 → 1 √ 2π R ∂S p α+β (s)ds, and E 2 → 0. Lemma 5.10. lim σ→0 I 2 = 1 √ 2π R ∂S p α+β (s)ds. Proof. Using the change of coordinates x 1 = s 1 +r 1 n(s 1 ), we have I 2 = 1 σ Z ∂S Z R 0 p α (s 1 +r 1 n(s 1 )) Z H + s 1 K σ 2(s 1 +r 1 n(s 1 ), x 2 )p β (x 2 )dx 2 ! |detJ(s 1 ,r 1 )|ds 1 dr 1 (5.74) 119 where J(s 1 ,r 1 ) denotes the Jacobian of the transformation. Now, an arc d PQ of length ds at a distance r 1 away from ∂S gets mapped to an arc [ P 0 Q 0 on ∂S whose length lies in the interval [ds(1− r 1 τ ),ds(1 + r 1 τ )]. Therefore, for all points within [S 1 ] R , we have 1− R τ d−1 ≤|detJ(s 1 ,r 1 )|≤ 1 + R τ d−1 . (5.75) Further, since p(x) is Lipschitz continuous with constant L p , p α (x) is also Lipschitz continuous with constantL p,α . Therefore, for any x 1 ∈ [S 1 ] R , we havep α (x 1 ) =p α (s 1 )+ L p,α R. This leads to the following simplification for I 2 : I 2 = 1 +O(R d−1 ) Z ∂S p α (s 1 )I 3 (s 1 )ds 1 +O(R d ) Z ∂S I 3 (s 1 )ds 1 , (5.76) where we defined I 3 (s 1 ) := 1 σ Z R 0 Z H + s 1 K σ 2(s 1 +r 1 n(s 1 ), x 2 )p β (x 2 )dx 2 dr 1 . (5.77) Note that every x 2 ∈H + s 1 can be written as s 2 +r 2 n(s 2 ), where n(s 2 ) =−n(s 1 ). Hence, we get I 3 (s 1 ) = Z R d−1 1 (2πσ 2 ) d−1 2 e − ks 1 −s 2 k 2 2σ 2 p β (s 2 −r 2 n(s 1 ))ds 2 × 1 σ Z R 0 Z ∞ 0 1 √ 2πσ 2 e − (r 1 +r 2 ) 2 2σ 2 dr 1 dr 2 = Z R d−1 1 (2πσ 2 ) d−1 2 e − ks 1 −s 2 k 2 2σ 2 p β (s 2 )ds 2 +O(R) ! × 1 σ Z R 0 Z ∞ 0 1 √ 2πσ 2 e − (r 1 +r 2 ) 2 2σ 2 dr 1 dr 2 = p β (s 1 ) +O(σ 2 ) +O(R) × 1 σ Z R 0 Z ∞ 0 1 √ 2πσ 2 e − (r 1 +r 2 ) 2 2σ 2 dr 1 dr 2 , (5.78) 120 where we used Lipschitz continuity of p β (x) in the second equality and applied Lemma 5.2 to arrive at the last step. Further, using the definition of theQ-function and integration by parts, we note that 1 σ Z R 0 Z ∞ 0 1 √ 2πσ 2 e − (r 1 +r 2 ) 2 2σ 2 dr 1 dr 2 = Z R/σ 0 Z ∞ 0 1 √ 2π e − (x+y) 2 2 dxdy = Z R/σ 0 Q(y)dy =yQ(y) R/σ 0 − Z R/σ 0 Q 0 (y)dy = R σ Q R σ + 1 √ 2π 1−e −R 2 /2σ 2 . Therefore, I 3 (s 1 ) = p β (s 1 ) +O(σ 2 ) +O(R) × R σ Q R σ + 1 √ 2π 1−e −R 2 /2σ 2 ! . (5.79) Combining (5.76) and (5.79) and using the fact that R/σ→∞ as σ→ 0 (from the definition of R), we get lim σ→∞ I 2 = 1 √ 2π Z ∂S p α+β (s)ds, (5.80) which concludes the proof. We now consider the error term E 2 and prove the following result: Lemma 5.11. lim σ→0 E 2 = 0. Proof. Let us first rewrite E 2 as follows: E 2 = 1 σ Z [S 1 ] R p α (x 1 )I 4 (x 1 )dx 1 (5.81) where we defined I 4 (x 1 ) := Z S 2 −H + s 1 K σ 2(x 1 , x 2 )p β (x 2 )dx 2 (5.82) 121 (a) (b) Figure 5.9: Worst-case scenarios for the boundary ∂S when (a) S 1 is a ball of radius τ, (b) S 2 is a ball of radius τ. The key idea is to lower and upper bound I 4 (x 1 ) for all x 1 using worst case scenarios and evaluate the limits of the bounds. Note that I 4 (x 1 ) is largest in magnitude when S 1 or S 2 is a sphere of radius τ, as illustrated in Figures 5.9a and 5.9b. We now make certain geometrical observations. For any x 1 = s 1 +r 1 n(s 1 )∈ [S 1 ] R , we observe from Figure 5.9b that I 4 (x 1 )≤ Z W − s 1 R 2 −r 2 1 2(τ−r 1 ) K σ 2(x 1 , x 2 )p β (x 2 )dx 2 + Z B(x 1 ,R) c K σ 2(x 1 , x 2 )p β (x 2 )dx 2 ≤ Z W − s 1 (R 0 ) K σ 2(x 1 , x 2 )p β (x 2 )dx 2 +p β max R . (5.83) where R 0 = R 2 2(τ−R) . Similarly, from Figure 5.9a, we observe that I 4 (x 1 )≥− " Z W + s 1 R 2 −r 2 1 2(τ+r 1 ) K σ 2(x 1 , x 2 )p β (x 2 )dx 2 + Z B(x 1 ,R) c K σ 2(x 1 , x 2 )p β (x 2 )dx 2 # ≥− " Z W + s 1 (R 0 ) K σ 2(x 1 , x 2 )p β (x 2 )dx 2 +p β max R # . (5.84) 122 Substituting these in (5.81) and using a simplification similar to that of I 2 in (5.76), we get E 2 ≤ 1 +O(R d−1 ) Z ∂S p α (s 1 )I − 5 (s 1 )ds 1 +O(R d ) Z ∂S I − 5 (s 1 )ds 1 + R σ p α+β max , (5.85) E 2 ≥− 1 +O(R d−1 ) Z ∂S p α (s 1 )I + 5 (s 1 )ds 1 −O(R d ) Z ∂S I + 5 (s 1 )ds 1 − R σ p α+β max , (5.86) where we defined I − 5 (s 1 ) := 1 σ Z R 0 Z W − s 1 (R 0 ) K σ 2(s 1 +r 1 n(s 1 ), x 2 )p β (x 2 )dx 2 dr 1 , (5.87) I + 5 (s 1 ) := 1 σ Z R 0 Z W + s 1 (R 0 ) K σ 2(s 1 +r 1 n(s 1 ), x 2 )p β (x 2 )dx 2 dr 1 . (5.88) Similar to the evaluation of I 3 (s 1 ) in (5.78), we have I + 5 (s 1 ) = p β (s 1 ) +O(σ 2 ) +O(R) × 1 σ Z R 0 Z R 0 0 1 √ 2πσ 2 e − (r 1 +r 2 ) 2 2σ 2 dr 1 dr 2 , (5.89) I − 5 (s 1 ) = p β (s 1 ) +O(σ 2 ) +O(R) × 1 σ Z R 0 Z R 0 0 1 √ 2πσ 2 e − (r 1 −r 2 ) 2 2σ 2 dr 1 dr 2 . (5.90) We now evaluate the two 1-D integrals as follows: 1 σ Z R 0 Z R 0 0 1 √ 2πσ 2 e − (r 1 +r 2 ) 2 2σ 2 dr 1 dr 2 = Z R/σ 0 Z R 0 /σ 0 1 √ 2π e − (x+y) 2 2 dxdy = Z R/σ 0 Q(y)−Q y + R 0 σ !! dy = Z R/σ 0 Q(y)dy + Z R 0 /σ 0 Q(y)dy− Z R+R 0 σ 0 Q(y)dy = R σ Q R σ + R σ Q R 0 σ ! − R +R 0 σ Q R +R 0 σ ! 1 √ 2π 1−e − R 2 2σ 2 −e − R 02 2σ 2 +e − (R+R 0 ) 2 2σ 2 . 123 Similarly, 1 σ Z R 0 Z R 0 0 1 √ 2πσ 2 e − (r 1 −r 2 ) 2 2σ 2 dr 1 dr 2 = Z R/σ 0 Z R 0 /σ 0 1 √ 2π e − (x−y) 2 2 dxdy = Z R/σ 0 Q y− R 0 σ ! −Q(y) ! dy = Z 0 −R 0 /σ Q(y)dy + Z R−R 0 σ 0 Q(y)dy− Z R/σ 0 Q(y)dy = R 0 σ Q − R 0 σ ! + R−R 0 σ Q R−R 0 σ ! − R σ Q R σ 1 √ 2π e − R 02 2σ 2 − 1 +e − (R+R 0 ) 2 2σ 2 −e − R 2 2σ 2 Noting that as σ→ 0, R/σ→∞ and R 0 /σ→ 0, we conclude that lim σ→0 E 2 = 0. The proof of (5.44) proceeds in a similar fashion by approximating the inner integral using hyperplanes. Specifically, similar to the proof of (5.43), we can show that the integral on the left hand side can be written as I +E, where I := 1 σ Z [S 1 ] R Z H − s 1 K aσ 2(x 1 , x 2 )p α (x 1 )p β (x 2 ) −K bσ 2(x 1 , x 2 )p α 0 (x 1 )p β 0 (x 2 ) dx 1 dx 2 , (5.91) andE is the residual associated with the approximation that can be shown to go to zero asσ→ 0 (we skip this proof since it is quite similar to the analysis for (5.43)). In order to evaluate I, we perform a change of coordinates x 1 = s 1 +r 1 n(s 1 ) as before to obtain I = 1 σ Z ∂S Z R 0 p α (s 1 +r 1 n(s 1 )) Z H − s 1 K aσ 2(s 1 +r 1 n(s 1 ), x 2 )p β (x 2 )dx 2 ! −p α 0 (s 1 +r 1 n(s 1 )) Z H − s 1 K bσ 2(s 1 +r 1 n(s 1 ), x 2 )p β 0 (x 2 )dx 2 ! |detJ(s 1 ,r 1 )|ds 1 dr 1 = Z ∂S p α (s 1 )I β (s 1 )ds 1 − Z ∂S p α 0 (s 1 )I β 0(s 1 )ds 1 +O R d , (5.92) 124 where we defined I β (s 1 ) := 1 σ Z R 0 Z H − s 1 K aσ 2(s 1 +r 1 n(s 1 ), x 2 )p β (x 2 )dx 2 dr 1 , I β 0(s 1 ) := 1 σ Z R 0 Z H − s 1 K bσ 2(s 1 +r 1 n(s 1 ), x 2 )p β 0 (x 2 )dx 2 dr 1 . By using change of coordinates for x 2 similar to the steps in (5.78), we obtain I β (s 1 ) = p β (s 1 ) +O(σ 2 ) +O(R) × 1 σ Z R 0 Z ∞ 0 1 √ 2πaσ 2 e − (r 1 −r 2 ) 2 2aσ 2 dr 1 dr 2 , (5.93) I β 0(s 1 ) = p β 0 (s 1 ) +O(σ 2 ) +O(R) × 1 σ Z R 0 Z ∞ 0 1 √ 2πbσ 2 e − (r 1 −r 2 ) 2 2bσ 2 dr 1 dr 2 , (5.94) The 1-D integrals can be evaluated as follows: 1 σ Z R 0 Z ∞ 0 1 √ 2πaσ 2 e − (r 1 −r 2 ) 2 2aσ 2 dr 1 dr 2 = √ a Z R/ √ aσ 0 Z ∞ 0 1 √ 2π e − (x−y) 2 2 dxdy = √ a Z R/ √ aσ 0 Q(−y)dy = √ a Z R/ √ aσ 0 (1−Q(y))dy = R σ − R σ Q R √ aσ ! − √ a √ 2π 1−e −R 2 /2aσ 2 , 1 σ Z R 0 Z ∞ 0 1 √ 2πbσ 2 e − (r 1 −r 2 ) 2 2bσ 2 dr 1 dr 2 = R σ − R σ Q R √ bσ ! − √ b √ 2π 1−e −R 2 /2bσ 2 . Using the fact that α +β = α 0 +β 0 = γ, and taking the limit σ→ 0 after putting everything together, we conclude lim σ→0 I = √ b− √ a √ 2π Z ∂S p γ (s)ds. (5.95) 125 Reference List [1] USPS handwritten digits data. http://www.cs.nyu.edu/~roweis/data.html. [2] A. E. Alaoui, X. Cheng, A. Ramdas, M. J. Wainwright, and M. I. Jordan. Asymp- totic behavior of ` p -based Laplacian regularization in semi-supervised learning. In V. Feldman, A. Rakhlin, and O. Shamir, editors, 29th Annual Conference on Learn- ing Theory, volume 49 ofProceedings of Machine Learning Research, pages 879–906, Columbia University, New York, New York, USA, 23–26 Jun 2016. PMLR. [3] A. Anis, A. Gadde, and A. Ortega. Towards a sampling theorem for signals on arbitrary graphs. In 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 3864–3868, May 2014. [4] A. Anis, A. Gadde, and A. Ortega. Efficient sampling set selection for bandlimited graphsignalsusinggraphspectralproxies. IEEE Transactions on Signal Processing, 64(14):3775–3789, July 2016. [5] A. Anis, A. E. Gamal, S. Avestimehr, and A. Ortega. Asymptotic justification of bandlimited interpolation of graph signals for semi-supervised learning. In 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 5461–5465, April 2015. [6] A. Anis, A. E. Gamal, S. Avestimehr, and A. Ortega. A sampling theory perspec- tive of graph-based semi-supervised learning. Submitted to IEEE Transactions on Information Theory, April 2017. [7] A.AnisandA.Ortega. Criticalsamplingforwaveletfilterbanksonarbitrarygraphs. In 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 3889–3893, March 2017. [8] A. Anis, D. Tay, and A. Ortega. Tree-structured filterbanks on M-block cylic graphs. In Asilomar Conference on Signals, Systems, and Computers, October 2017. [9] A.-L. Barabási and R. Albert. Emergence of scaling in random networks. Science, 286(5439):509–512, 1999. 126 [10] M. Belkin and P. Niyogi. Using manifold stucture for partially labeled classification. In S. Becker, S. Thrun, and K. Obermayer, editors, Advances in Neural Information Processing Systems 15, pages 953–960. MIT Press, 2003. [11] M. Belkin and P. Niyogi. Semi-supervised learning on Riemannian manifolds. Machine Learning, 56(1):209–239, 2004. [12] M. Belkin and P. Niyogi. Towards a theoretical foundation for Laplacian-based manifold methods. J. Comput. Syst. Sci., 74(8):1289–1308, 2008. [13] P. Billingsley. Probability and Measure. Wiley, New York, NY, 3rd edition, 1995. [14] B. Bourdin, D. Bucur, and E. Oudet. Optimal partitions for eigenvalues. SIAM Journal on Scientific Computing, 31(6):4100–4114, 2010. [15] O. Bousquet, O. Chapelle, and M. Hein. Measure based regularization. In Advances in Neural Information Processing Systems (NIPS) 16. MIT Press, 2004. [16] S. Boyd and L. Vandenberghe. Convex optimization. Cambridge University Press, 2009. [17] N. Buchbinder, M. Feldman, J. S. Naor, and R. Schwartz. Submodular maxi- mization with cardinality constraints. In Proceedings of the Twenty-Fifth Annual ACM-SIAM Symposium on Discrete Algorithms, pages 1433–1452, 2014. [18] E. J. Candes and M. B. Wakin. An introduction to compressive sampling. IEEE Signal Processing Magazine, 25(2):21–30, March 2008. [19] Y.H.Chao,A.Ortega,andS.Yea. Graph-basedliftingtransformforintra-predicted video coding. In 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 1140–1144, March 2016. [20] O. Chapelle, B. Schölkopf, and A. Zien. Semi-Supervised Learning (Adaptive Com- putation and Machine Learning). The MIT Press, 2006. [21] S. Chen, A. Sandryhaila, and J. Kovačević. Sampling theory for graph signals. In Acoustics, Speech and Signal Processing (ICASSP), IEEE International Conference on, April 2015. [22] S. Chen, R. Varma, A. Sandryhaila, and J. Kovačević. Discrete signal processing on graphs: Sampling theory. Signal Processing, IEEE Transactions on, 2015. [23] F. Chung. Laplacians and the Cheeger inequality for directed graphs. Annals of Combinatorics, 9(1):1–19, 2005. [24] F. R. K. Chung. Spectral graph theory, volume 92. CBMS Regional Conference Series in Mathematics, AMS, 1997. 127 [25] S. Deutsch, A. Ortega, and G. Medioni. Manifold denoising based on spectral graph wavelets. In 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 4673–4677, March 2016. [26] V. N. Ekambaram, G. Fanti, B. Ayazifar, and K. Ramchandran. Critically-sampled perfect-reconstructionspline-waveletfilterbanksforgraphsignals. In Global Confer- ence on Signal and Information Processing (GlobalSIP), 2013 IEEE, pages 475–478, Dec 2013. [27] V.N.Ekambaram,G.Fanti,B.Ayazifar,andK.Ramchandran. Wavelet-regularized graph semi-supervised learning. In Global Conference on Signal and Information Processing (GlobalSIP), 2013 IEEE, pages 423–426, Dec 2013. [28] Y. C. Eldar. Sampling with arbitrary sampling and reconstruction spaces and oblique dual frame vectors. Journal of Fourier Analysis and Applications, 9(1):77– 96, 2003. [29] A. Gadde, A. Anis, and A. Ortega. Active semi-supervised learning using sampling theory for graph signals. In Proceedings of the 20th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pages 492–501, 2014. [30] A. Gadde, S. Narang, and A. Ortega. Bilateral filter: Graph spectral interpreta- tion and extensions. In Image Processing (ICIP), 2013 20th IEEE International Conference on, pages 1222–1226, Sept 2013. [31] K. Gröchenig. A discrete theory of irregular sampling. Linear Algebra and its applications, 193:129–150, 1993. [32] W. H. Haemers. Interlacing eigenvalues and graphs. Linear Algebra and its appli- cations, 226:593–616, 1995. [33] M. Hein. Geometrical aspects of statistical learning theory. PhD thesis, TU Darm- stadt, April 2006. [34] M. Hein. Uniform Convergence of Adaptive Graph-Based Regularization, pages 50– 64. Springer Berlin Heidelberg, Berlin, Heidelberg, 2006. [35] W. Hoeffding. Probability inequalities for sums of bounded random variables. Jour- nal of the American Statistical Association, 58(301):13–30, 1963. [36] R. A. Horn and C. R. Johnson. Matrix analysis. Cambridge University Press, 2012. [37] Y. Hu and B. Defourny. Near-optimality bounds for greedy periodic policies with application to grid-level storage. In 2014 IEEE Symposium on Adaptive Dynamic Programming and Reinforcement Learning (ADPRL), pages 1–8, Dec 2014. 128 [38] S. Joshi and S. Boyd. Sensor selection via convex optimization. Signal Processing, IEEE Transactions on, 57(2):451–462, 2009. [39] J. M. Kleinberg. Authoritative sources in a hyperlinked environment. Journal of the ACM (JACM), 46(5):604–632, 1999. [40] A.Knyazev, A.Jujunashvili, andM.Argentati. Anglesbetweeninfinitedimensional subspaceswithapplicationstotherayleigh–ritzandalternatingprojectorsmethods. Journal of Functional Analysis, 259(6):1323–1345, 2010. [41] A. V. Knyazev. Toward the optimal preconditioned eigensolver: Locally optimal block preconditioned conjugate gradient method. SIAM journal on scientific com- puting, 23(2):517–541, 2001. [42] M. Maier, U. von Luxburg, and M. Hein. How the result of graph clustering meth- ods depends on the construction of the graph. ESAIM: Probability and Statistics, 17:370–418, 1 2013. [43] A. G. Marques, S. Segarra, G. Leus, and A. Ribeiro. Sampling of graph signals with successive local aggregations. IEEE Transactions on Signal Processing, 64(7):1832– 1843, April 2016. [44] S. Narang and A. Ortega. Local two-channel critically sampled filter-banks on graphs. In Image Processing (ICIP), 2010 17th IEEE International Conference on, pages 333–336, Sept 2010. [45] S.NarangandA.Ortega. Perfectreconstructiontwo-channelwaveletfilterbanksfor graph structured data. IEEE Transactions on Signal Processing, 60(6):2786–2799, June 2012. [46] S. Narang and A. Ortega. Compact support biorthogonal wavelet filterbanks for arbitraryundirectedgraphs. SignalProcessing, IEEETransactions on,61(19):4673– 4685, Oct 2013. [47] S. K. Narang, Y. H. Chao, and A. Ortega. Graph-wavelet filterbanks for edge- aware image processing. In Statistical Signal Processing Workshop (SSP), IEEE, pages 141–144, 2012. [48] S. K. Narang, Y. H. Chao, and A. Ortega. Critically sampled graph-based wavelet transforms for image coding. In Signal and Information Processing Association Annual Summit and Conference (APSIPA), 2013 Asia-Pacific, pages 1–4, Oct 2013. [49] S. K. Narang, A. Gadde, and A. Ortega. Signal processing techniques for interpola- tioningraphstructureddata. InAcoustics, Speech and SignalProcessing (ICASSP), IEEE International Conference on, pages 5445–5449, 2013. 129 [50] S. K. Narang, A. Gadde, E. Sanou, and A. Ortega. Localized iterative methods for interpolation in graph structured data. In Signal and Information Processing (GlobalSIP), IEEE Global Conference on, 2013. [51] H. Narayanan, M. Belkin, and P. Niyogi. On the relation between low density separation, spectral clustering and graph cuts. In Advances in Neural Information Processing Systems (NIPS) 19, 2006. [52] H. Nguyen and M. Do. Downsampling of signals on graphs via maximum spanning trees. Signal Processing, IEEE Transactions on, 63(1):182–191, Jan 2015. [53] H. Q. Nguyen, P. A. Chou, and Y. Chen. Compression of human body sequences using graph wavelet filter banks. In 2014 IEEE International Conference on Acous- tics, Speech and Signal Processing (ICASSP), pages 6152–6156, May 2014. [54] B. Osting, C. D. White, and ÃĽdouard Oudet. Minimal Dirichlet energy partitions for graphs. SIAM Journal on Scientific Computing, 36(4):A1635–A1651, 2014. [55] I. Pesenson. Sampling in Paley-Wiener spaces on combinatorial graphs. Transac- tions of the American Mathematical Society, 360(10):5603–5627, 2008. [56] I. Pesenson. Variational splines and Paley–Wiener spaces on combinatorial graphs. Constructive Approximation, 2009. [57] G. Puy, N. Tremblay, R. Gribonval, and P. Vandergheynst. Random sampling of bandlimited signals on graphs. Applied and Computational Harmonic Analysis, pages –, 2016. [58] J. O. Riis. Discounted Markov programming in a periodic process. Operations Research, 13(6):920–929, 1965. [59] Y. Saad. Numerical Methods for Large Eigenvalue Problems. SIAM, second edition, 2011. [60] A. Sakiyama and Y. Tanaka. Edge-aware image graph expansion methods for oversampled graph laplacian matrix. In 2014 IEEE International Conference on Image Processing (ICIP), pages 2958–2962, Oct 2014. [61] A. Sakiyama and Y. Tanaka. Oversampled graph laplacian matrix for graph filter banks. IEEE Transactions on Signal Processing, 62(24):6425–6437, Dec 2014. [62] A. Sandryhaila and J. Moura. Big data analysis with signal processing on graphs: Representation and processing of massive data sets with irregular structure. Signal Processing Magazine, IEEE, 31(5):80–90, Sept 2014. 130 [63] A. Sandryhaila and J. Moura. Discrete signal processing on graphs: Frequency analysis. Signal Processing, IEEE Transactions on, 62(12):3042–3054, June 2014. [64] A. Sandryhaila and J. M. F. Moura. Discrete signal processing on graphs. IEEE Transactions on Signal Processing, 61(7):1644–1656, April 2013. [65] I. Shomorony and A. Avestimehr. Sampling large data on graphs. In Signal and Information Processing (GlobalSIP), 2014 IEEE Global Conference on, pages 933– 936, Dec 2014. [66] D.Shuman,S.Narang,P.Frossard, A.Ortega,andP.Vandergheynst. Theemerging field of signal processing on graphs: Extending high-dimensional data analysis to networksandotherirregulardomains. Signal Processing Magazine, IEEE,30(3):83– 98, May 2013. [67] Y.Tanaka. Spectraldomainsamplingofgraphsignals. June2017. arXiv:1706.05147 [cs.IT]. [68] D. B. H. Tay and A. Ortega. Bipartite graph filter banks: Polyphase analysis and generalization. IEEE Transactions on Signal Processing, 65(18):4833–4846, Sept 2017. [69] O. Teke and P. P. Vaidyanathan. Extending classical multirate signal processing theory to graphs – Part i: Fundamentals. IEEE Transactions on Signal Processing, 65(2):409–422, Jan 2017. [70] O. Teke and P. P. Vaidyanathan. Extending classical multirate signal processing theory to graphs – Part ii: M-channel filter banks. IEEE Transactions on Signal Processing, 65(2):423–437, Jan 2017. [71] M. Tsitsvero, S. Barbarossa, and P. D. Lorenzo. Uncertainty principle and sampling of signals defined on graphs. In 2015 49th Asilomar Conference on Signals, Systems and Computers, pages 1813–1818, Nov 2015. [72] M. Tsitsvero, S. Barbarossa, and P. D. Lorenzo. Signals on graphs: Uncertainty principleandsampling. IEEE Transactions on Signal Processing, 64(18):4845–4860, Sept 2016. [73] M. Vetterli, P. Marziliano, and T. Blu. Sampling signals with finite rate of innova- tion. IEEE Transactions on Signal Processing, 50(6):1417–1428, Jun 2002. [74] X. Wang, P. Liu, and Y. Gu. Local-set-based graph signal reconstruction. IEEE Transactions on Signal Processing, 63(9):2432–2444, May 2015. [75] D. J. Watts and S. H. Strogatz. Collective dynamics of ‘small-world’ networks. Nature, 393(6684):440–442, 1998. 131 [76] L. Yang and W. Guo. Greedy local-set based sampling and reconstruction for band-limited graph signals. In 2016 23rd International Conference on Telecommu- nications (ICT), pages 1–5, May 2016. [77] J. Zeng, G. Cheung, and A. Ortega. Bipartite subgraph decomposition for critically sampled wavelet filterbanks on arbitrary graphs. In 2016 IEEE International Con- ference on Acoustics, Speech and Signal Processing (ICASSP), pages 6210–6214, March 2016. [78] D. Zhou, O. Bousquet, T. N. Lal, J. Weston, and B. Schölkopf. Learning with local and global consistency. In S. Thrun, L. K. Saul, and B. Schölkopf, editors, Advances in Neural Information Processing Systems 16, pages 321–328. MIT Press, 2004. [79] D. Zhou, J. Huang, and B. Schölkopf. Learning from labeled and unlabeled data on a directed graph. In Proceedings of the 22nd International conference on Machine learning, pages 1036–1043, 2005. [80] D. Zhou, B. Schölkopf, and T. Hofmann. Semi-supervised learning on directed graphs. In L. K. Saul, Y. Weiss, and L. Bottou, editors, Advances in Neural Infor- mation Processing Systems 17, pages 1633–1640. MIT Press, Cambridge, MA, 2004. [81] X. Zhou and M. Belkin. Semi-supervised learning by higher order regularization. In Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, AISTATS 2011, Fort Lauderdale, USA, April 11-13, 2011, pages 892–900, 2011. [82] X.Zhou,M.Belkin,andN.Srebro. Aniteratedgraphlaplacianapproachforranking on manifolds. In Proceedings of the 17th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, San Diego, CA, USA, August 21-24, 2011, pages 877–885, 2011. [83] X. Zhu, Z. Ghahramani, and J. Lafferty. Semi-supervised learning using gaussian fields and harmonic functions. In Proceedings of the 20th International Conference on Machine Learning, pages 912–919, 2003. 132
Abstract (if available)
Abstract
The representation, processing and analysis of large-scale data as signals defined over graphs has drawn much interest recently. Graphs allow us to embed natural inter-connectivities between data points and exploit them during processing. As a result, graph signal processing has laid a strong foothold in various modern application domains such as machine learning, analysis of social, transportation, web and sensor networks, and even traditional areas such as image processing and video compression. Although powerful, this research area is still in its infancy. Recent efforts have therefore focused on translating well-developed tools of traditional signal processing for handling graph signals. ❧ An important aspect of graph signal processing is defining a notion of frequency for graph signals. A frequency domain representation for graph signals can be defined using the eigenvectors and eigenvalues of variation operators (e.g., graph Laplacian) that take into account the underlying graph connectivity. These operators can also be used to design graph spectral filters. The primary focus of our work is to develop a theory of sampling for graph signals that answers the following questions: 1. When can one recover a graph signal from its samples on a given subset of nodes of the graph? 2. What is the best choice of nodes to sample a given graph signal? Our formulation primarily works under the assumption of bandlimitedness in the graph Fourier domain, which amounts to smoothness of the signal over the graph. The techniques we employ to answer these questions are based on the introduction of special quantities called graph spectral proxies that allow our algorithms to operate in the vertex domain, thereby admitting efficient, localized implementations. ❧ We also explore the sampling problem in the context of designing wavelet filterbanks on graphs. This problem is fundamentally different since one needs to choose a sampling scheme jointly over multiple channels of the filterbank. We explore constraints for designing perfect reconstruction two-channel critically-sampled filterbanks with low-degree polynomial filters, and conclude that such a design is in general not possible for arbitrary graphs. This leads us to propose an efficient technique for designing a critical sampling scheme that, given predesigned filters, aims to minimize the overall reconstruction error of the filterbank. We also explore M-channel filterbanks over M-block cyclic graphs (that are natural extensions of bipartite graphs), and propose a tree-structured design in a simpler setting when M is a power of 2. ❧ As an application, we study the graph-based semi-supervised learning problem from a sampling theory point of view. A crucial assumption here is that class labels form a smooth graph signal over a similarity graph constructed from the feature vectors. Our analysis justifies this premise by showing that in the asymptotic limit, the bandwidth (a measure of smoothness) of any class indicator signal is closely related to the geometry of the dataset. Using the sampling theory perspective, we also quantitatively show that the label complexity (i.e., the amount of labeling required for perfect prediction of unknown labels) matches its theoretical value, thereby adding to the appeal of graph-based techniques for semi-supervised learning.
Linked assets
University of Southern California Dissertations and Theses
Conceptually similar
PDF
Critically sampled wavelet filterbanks on graphs
PDF
Graph-based models and transforms for signal/data processing with applications to video coding
PDF
Efficient transforms for graph signals with applications to video coding
PDF
Estimation of graph Laplacian and covariance matrices
PDF
Compression of signal on graphs with the application to image and video coding
PDF
Efficient graph learning: theory and performance evaluation
PDF
Scalable sampling and reconstruction for graph signals
PDF
Lifting transforms on graphs: theory and applications
PDF
Learning and control for wireless networks via graph signal processing
PDF
Learning the geometric structure of high dimensional data using the Tensor Voting Graph
PDF
Neighborhood and graph constructions using non-negative kernel regression (NNK)
PDF
Human activity analysis with graph signal processing techniques
PDF
Word, sentence and knowledge graph embedding techniques: theory and performance evaluation
PDF
Graph machine learning for hardware security and security of graph machine learning: attacks and defenses
PDF
Applications of graph theory to brain connectivity analysis
PDF
Dynamic graph analytics for cyber systems security applications
PDF
Novel algorithms for large scale supervised and one class learning
PDF
Reinforcement learning with generative model for non-parametric MDPs
PDF
Learning multi-annotator subjective label embeddings
PDF
Coded computing: a transformative framework for resilient, secure, private, and communication efficient large scale distributed computing
Asset Metadata
Creator
Anis, Aamir
(author)
Core Title
Sampling theory for graph signals with applications to semi-supervised learning
School
Viterbi School of Engineering
Degree
Doctor of Philosophy
Degree Program
Electrical Engineering
Publication Date
11/15/2017
Defense Date
08/31/2017
Publisher
University of Southern California
(original),
University of Southern California. Libraries
(digital)
Tag
graph signal processing,machine learning,OAI-PMH Harvest,sampling theory,wavelet filterbanks
Language
English
Contributor
Electronically uploaded by the author
(provenance)
Advisor
Ortega, Antonio (
committee chair
), Avestimehr, Salman (
committee member
), Lv, Jinchi (
committee member
)
Creator Email
aamiranis.iitkgp@gmail.com,aanis@usc.edu
Permanent Link (DOI)
https://doi.org/10.25549/usctheses-c40-456633
Unique identifier
UC11265777
Identifier
etd-AnisAamir-5921.pdf (filename),usctheses-c40-456633 (legacy record id)
Legacy Identifier
etd-AnisAamir-5921.pdf
Dmrecord
456633
Document Type
Dissertation
Rights
Anis, Aamir
Type
texts
Source
University of Southern California
(contributing entity),
University of Southern California Dissertations and Theses
(collection)
Access Conditions
The author retains rights to his/her dissertation, thesis or other graduate work according to U.S. copyright law. Electronic access is being provided by the USC Libraries in agreement with the a...
Repository Name
University of Southern California Digital Library
Repository Location
USC Digital Library, University of Southern California, University Park Campus MC 2810, 3434 South Grand Avenue, 2nd Floor, Los Angeles, California 90089-2810, USA
Tags
graph signal processing
machine learning
sampling theory
wavelet filterbanks