Close
About
FAQ
Home
Collections
Login
USC Login
Register
0
Selected
Invert selection
Deselect all
Deselect all
Click here to refresh results
Click here to refresh results
USC
/
Digital Library
/
University of Southern California Dissertations and Theses
/
Improving machine learning algorithms via efficient data relevance discovery
(USC Thesis Other)
Improving machine learning algorithms via efficient data relevance discovery
PDF
Download
Share
Open document
Flip pages
Contact Us
Contact Us
Copy asset link
Request this asset
Transcript (if available)
Content
IMPROVING MACHINE LEARNING ALGORITHMS VIA EFFICIENT DATA RELEVANCE DISCOVERY by Dehua Cheng A Dissertation Presented to the FACULTY OF THE USC GRADUATE SCHOOL UNIVERSITY OF SOUTHERN CALIFORNIA In Partial Fulfillment of the Requirements for the Degree DOCTOR OF PHILOSOPHY (COMPUTER SCIENCE) May 2018 Copyright 2018 Dehua Cheng Acknowledgments During my over five-year study at the University of Southern California, many people have provided invaluable guidance to me and this work. First, I would like to thank my advisor Prof. Yan Liu for her great guidance and timely support over the past five years. She has helped me to become a capable independent researcher and I am greatly fortunate to have Yan as my advisor. She holds us to a high standard in term of research quality and guides me in a way that is tailored to my own strengths and interests. I would also like to thank my other committee members, Shang-Hua Teng and Jinchi Lv, for their valuable assistance and encouragement at all stages of this thesis. Moreover, I would like to thank Mahdi Soltanolkotabi and Greg Ver Steeg for their priceless suggestion on the thesis proposal. It is a great fortune for me to learn from these great researchers. In addition, special thanks to my collaborators: Richard Peng at Georgia Institute of Technology, his brilliance at examining the subject from the most fun- damental aspect makes the collaboration an inspiring experience; Yu Cheng, he introduces me the grace of the theoretical computer science; Jie Chen, his mentor- ship during my IBM internship was essential; and Xinran He, Natali Ruchansky, Michael Tsang, Mohammad Taha Bahadori, and Qi Yu at the Melady Lab, their i help and support are crucial to this work. I would also want to thank other mem- bers and former members of the Melady Lab, Zhengping Che, Umang Gupta, He Jiang, Bo Jiang, Nitin Kamra, Yaguang Li, Guangyu Li, Hanpeng Liu, Tanachat Nilanon, Sanjay Purushotham, Sungyong Seo, Wilka Carvalho, Marjan Ghazvinine- jad, Michael Hankin, David Kale, Wenzhe Li, and Daniel Moyer. Pursuing the Ph.D. would be an arduous journey if without the company of good friends. I owe many thanks to my goodfriends, Shi Xing, Kuan Liu, Guo Yi, Qiang Ruixin, Dingxiong Deng, Yihuan Shao, Zhiyun Lu, and many others, for their good friendship and support. Last, but definitely not the least, I would also like to thank my parents and my wife for their love and support without which I would not have survived the Ph.D. process. ii Contents Acknowledgments i Contents iii List of Figures vii List of Tables viii Abstract ix 1 Introduction 1 1.1 Summary of thesis work . . . . . . . . . . . . . . . . . . . . . . . . . 6 1.1.1 Fast tensor CP decomposition . . . . . . . . . . . . . . . . . . 6 1.1.2 Random-walk matrix-polynomial sparsification . . . . . . . . . 7 1.1.3 Low-rank model completeability analysis . . . . . . . . . . . . 9 1.2 Thesis statement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10 1.3 Thesis outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11 1.4 Related publications . . . . . . . . . . . . . . . . . . . . . . . . . . . 11 2 Prelinary and Related Work 13 2.1 Notations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 2.2 The Monte Carlo principle . . . . . . . . . . . . . . . . . . . . . . . . 14 2.2.1 Quality of the Monte Carlo simulation . . . . . . . . . . . . . 15 2.2.2 Importance sampling in Monte Carlo simulation . . . . . . . . 16 iii 2.2.3 Example: randomized matrix multiplication . . . . . . . . . . 19 2.2.4 Example: least square regression . . . . . . . . . . . . . . . . 21 2.2.5 Example: spectral approximation of graph Laplacian . . . . . 23 2.3 Low-rank matrix completion and finite completability . . . . . . . . . 24 2.3.1 Completability with deterministic observed entries . . . . . . . 25 2.3.2 Finite completability . . . . . . . . . . . . . . . . . . . . . . . 26 3 Tensor Decomposition 29 3.1 Notation and background . . . . . . . . . . . . . . . . . . . . . . . . 32 3.1.1 Tensor matricization . . . . . . . . . . . . . . . . . . . . . . . 32 3.1.2 Special matrix products . . . . . . . . . . . . . . . . . . . . . 32 3.1.3 Tensor CP decomposition . . . . . . . . . . . . . . . . . . . . 34 3.1.4 Alternating least squares algorithm . . . . . . . . . . . . . . . 35 3.2 Near-optimal leverage score estimation for Khatri-Rao product . . . . 35 3.2.1 Leverage score sampling for ` 2 -regression . . . . . . . . . . . . 36 3.2.2 Near-optimal leverage score estimation . . . . . . . . . . . . . 37 3.3 SPALS: sampling alternating least squares . . . . . . . . . . . . . . . 41 3.3.1 Sampling alternating least squares . . . . . . . . . . . . . . . . 42 3.3.2 Approximation guarantees . . . . . . . . . . . . . . . . . . . . 43 3.3.3 Extensions on other tensor related applications . . . . . . . . 44 3.4 Related works . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46 3.5 Experimental results . . . . . . . . . . . . . . . . . . . . . . . . . . . 48 3.5.1 Dense synthetic tensors . . . . . . . . . . . . . . . . . . . . . . 48 3.5.2 Sparse data tensor . . . . . . . . . . . . . . . . . . . . . . . . 49 3.6 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50 3.7 Proof for theorem 3.4 . . . . . . . . . . . . . . . . . . . . . . . . . . . 51 4 Random-Walk Matrix-Polynomial Sparsification 56 4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56 4.1.1 Some applications . . . . . . . . . . . . . . . . . . . . . . . . . 62 iv 4.2 Background and notation . . . . . . . . . . . . . . . . . . . . . . . . . 63 4.3 Random-walk matrix polynomial sparsification . . . . . . . . . . . . . 66 4.3.1 Sparsification by random walks . . . . . . . . . . . . . . . . . 68 4.3.2 Multi-step sparsification of higher degree random-walks . . . . 74 4.3.3 Extension to SDDM matrices . . . . . . . . . . . . . . . . . . 90 5 Matrix Completion 93 5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93 5.2 Preliminaries and related work . . . . . . . . . . . . . . . . . . . . . . 97 5.3 Identifying completability . . . . . . . . . . . . . . . . . . . . . . . . 99 5.3.1 Connectivity for completability . . . . . . . . . . . . . . . . . 101 5.3.2 Empirical evidence for connectivity as a completability criterion104 5.4 Finding k-connected components . . . . . . . . . . . . . . . . . . . . 106 5.5 Details on the MaxKCD algorithm . . . . . . . . . . . . . . . . . . . . . 108 5.5.1 The kCut algorithm . . . . . . . . . . . . . . . . . . . . . . . 110 5.6 Proof of correctness for Batch-EarlyMerging . . . . . . . . . . . . . . 115 5.6.1 Notes on implementation . . . . . . . . . . . . . . . . . . . . . 120 5.7 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 121 5.7.1 Traditional completion task . . . . . . . . . . . . . . . . . . . 122 5.7.2 Controlled matrix completion . . . . . . . . . . . . . . . . . . 123 5.7.3 Comparison with baselines . . . . . . . . . . . . . . . . . . . . 124 5.7.4 Completability over time . . . . . . . . . . . . . . . . . . . . . 125 5.7.5 Further experiments . . . . . . . . . . . . . . . . . . . . . . . 125 5.8 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 127 6 Conclusion and Future Work 128 6.1 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 129 6.2 Future work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 130 Index 132 v Reference List 133 vi List of Figures 4.1 Pseudocode for Sampling by Effective Resistances . . . . . . . . . . . 65 4.2 Pseudocode for Sparsification of Random Walk Matrices. . . . . . . . 68 5.1 Traditionalworkflow(top)andtwopossibleproposedworkflowswhere completability is applied (1) as a post-facto analysis of which esti- mates are reliable; (2) a priori to guide completion. . . . . . . . . . . 96 5.3 The RMSE of completable and non-completable entries on the test set as a function of k KCC , with varying k MC for Amazon and Netflix, and k MC = 20 for DBLP and LibimSeTi. . . . . . . . . . . . . . . . . 122 5.4 Figures (a) and (b) show a comparison between CompleteID and SmartDensity with 25% and 50% of Netflix. Figures (c) and (d) show the RMSE of completable and non-completable entries as a func- tion of time for Netflix and Foursquare with k = 10. . . . . . . . . . 126 vii List of Tables 3.1 Running times per iterations in seconds and errors of various alter- nating least squares implementations . . . . . . . . . . . . . . . . . . 49 3.2 Relative error and running times per iteration on the Amazon review tensor with dimensions 2.44e6× 6.64e6× 9.26e4 and 2.02 billion non- zeros . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49 5.1 Evaluationofk-connectivitycriterionwithrespecttoExactandbase- lines. Pr as Precision, Rec as Recall, and 0.00 as < 1e− 3. . . . . . . 100 5.2 Comparison of relative frobenius error over completable entries using different matrix completion algorithms. . . . . . . . . . . . . . . . . 120 5.3 Matrix completion dataset statistics. . . . . . . . . . . . . . . . . . . 121 5.4 Completability over small datasets with rank-20 synthetic matrix and k KCC =k MC = 20. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 124 viii Abstract This is the era of big data, where both challenges and opportunities lie ahead for the machine learning research. The data are created nowadays at an unprece- dentedpacewithasignificantcostincollecting, storing, andcomputingwiththecur- rent scale of data. As the computational power that we possess gradually plateaus, it is an ever-increasing challenge to fully utilize the wealth of big data, where better data reduction techniques and scalable algorithms are the keys to a solution. We observe that to answer a certain query, the data are not equally important: it is possible to complete the task at hand with fewer but relevant data without compro- mising accuracy. Based on the models and the query, we provide efficient access to the numerical scores of the data points that represent their relevance to the current task. It enables us to wisely devote the computation resources to the important data, which improves the scalability and the reliability. We present our work with three applications. We developed efficient numerical algorithms for tensor CP decomposition and random-walk matrix-polynomial spar- sification, where we provide an efficient access to the statistical leverage scores for a faster randomized numerical routine. We obverse that the model structures, such as the low-rank models, can be leveraged to provide an efficient approximation to data relevance. We also present our matrix completability analysis framework based on ix both data and the model, which are both structured. And we show that the under- lying completability pattern can be utilized to achieve a more reliable estimation of missing entries. x Chapter 1 Introduction This is the era of big data, where both challenges and opportunities lie ahead for the machine learning research. The data are created nowadays at an unprece- dented pace. For example, popular social networks serve over 1 billion daily active users, wheretheuser-createdcontentsandinteractionshelptopersonalizetheadver- tisement distribution; ridesharing companies collect billions of trip information to better the ride schedule. However, big data come with a substantial cost, where to collect, store, and compute with the current scale of data poses a significant challenge for companies of all scales. Machine learning algorithms benefit directly from more data [VV98]. The per- formanceofmanymachinelearningalgorithms, withoutadditionalmodification, can improve significantly with more data. Meanwhile, we can also employ more sophis- ticated machine learning models without the danger of overfitting, which is crucial for many artificial intelligence (AI) related tasks [BCV13]. However, it also becomes a more computation demanding task as the computational power that we possess gradually plateaus. To fully utilize the wealth of big data is an ever-increasing challenge. Balancing between the computational budgets and the algorithm perfor- mance, smarter ways to handle the increasing amount of data are proposed, e.g., better data reduction techniques and scalable machine learning algorithms. Not all data points are of equal importance. For a given a machine learning task, it is possible that a subset of data plays a disproportionately important role. 1 For example, in the renowned support vector machine [CV95], the support vectors represent a subset of data points that are used in forming a decision boundary, while non-support vectors do not influence the decision boundary directly. Understanding the relevance of individual data points provides valuable information about the task at hand. It can help us interpret the learned model by examining the most relevant data points [AP94, BM06, BT11, KRS14] as well as reduce computational cost if we can exclude the irrelevant data randomly or deterministically during or before model training [NSW14, ZZ15]. One of the key limitations for utilizing the data relevance is that the rel- evance itself is difficult to obtain a priori. For example, the statistical leverage score [Mah11], which is widely used to guide data reduction in least square opti- mization, is expensive to compute. It requires solving a large linear system which is often equivalent to solving the original optimization problem. Approximating data relevance efficiently has been an active research area [Mah11, SS11, LMP13]. Most existing works, while sharing many common intuitions, are developed case by case for different applications, where the numerical routines involved are often sim- ple. Similar techniques have been applied to improving machine learning algorithms with simple models, while for more complicated cases, such application has been challenging. Intuitively, data relevance is challenging to obtain from complicated models, which is oftentimes the case. However, we observe that for some widely used mod- els, the model structure itself can offer ways to efficiently approximating the data relevance where the ambient model complexity acts in our advantage. For exam- ple, in a low-CP-rank tensor [KB09], one can represent the tensor through a special product between factor matrices, which are smaller than the ambient size of the 2 tensor 1 . By analyzing the factor matrices and the relevant matrix product, we can approximate the data relevance of the original tensor efficiently. We identify the followingnumericalmodels: low-rankmatricesandtensorsand random-walk matrix- polynomials, and by examining their model structures, we accomplish the following two objectives: 1) efficient data reduction with provable and near-optimal sample complexity, and 2) query-specific data relevance analysis. Hereafter, we motivate these two objectives and provide further details. Efficient near-optimal Monte Carlo with high-order model structure Designing fast algorithms for large-scale machine learning models has become one of the most important research areas in machine learning. While there is no uni- versal solution, sampling (or Monte Carlo), has succeeded in many machine learning tasks that aggregate a large number of samples to extract information. By randomly selecting a smaller but representative subset of objects, we can achieve the same goal approximately. One common example is the calculation of the objective function L, which in many cases is defined as a summation of the loss function of individual data points, i.e.,L(·) = P i f i (·) . Stochastic gradient descent (SGD) and its vari- ants [B + 15], which have been widely-accepted as the go-to optimization algorithm in large-scale machine learning applications, take advantage of this form to compute gradients more efficiently. The approximation of sampling method is oftentimes guaranteed by the law of large numbers, where the quality of the approximation can be fine-tuned by the size of a sampled subset and the method for selecting the subset. 1 And the difference in size increases exponentially when the order of the tensor increases. 3 Uniform sampling has been the go-to sampling method for machine learning algorithms due to its simplicity and computational efficiency. But uniform sam- pling can be suboptimal in terms of the required number of samples to achieve a target accuracy. The size of the sampled representative subset has a direct influence on computational expenses in subsequent tasks. Non-uniform sam- pling[ADFDJ03,NSW14,ZZ15,VHF16], especiallytheimportancesamplingframe- work, can achieve better sample efficiency. It is especially important when the sub- sequent task scales superlinearly in the number of data points. For many models, the state-of-art algorithms can yield an accurate data importance in time nearly- linear in the number of data points, which enabled several nearly-linear time algo- rithms [ST14, Mah11, NSW14, PS14]. However, the number of “data points” to be sampled can be extremely large where simply storing the data importance will exhaust the computational budget, prohibiting any non-uniform operations. While the budget is commonly at least the size of the input data, such “data points” can often be generated algorithmically as intermediate results during runtime. For instance, consider a random walk with transition matrixT on a graph with size n. Then a single step with the transition matrixT p can be viewed as a superposition of up ton p−1 number ofp-step random walks on the original graph. Another example is the number of linear constraints when performing the alternating least square algorithm to solve low-rank tensor decomposition with a sparse input tensor. In both cases, the objects to be sampled is of higher order, i.e., they are constructed from basic building blocks (single-step random walks or factor matrices in tensor CP decomposition). Most existing non-uniform Monte Carlo algorithms cannot be directly applied in this scenario, as the computational cost is at least linear in the input size. By 4 incorporating model-side information, i.e., the generation process for the “data points”, we develop the Monte Carlo sampling probabilities for the intermediate “data points” with both nearly-optimal sample complexity and efficient sample- drawing procedures. We present two works that exemplify our approach: 1) fast tensor CP decomposition, and 2) random-walk matrix polynomial sparsification. We propose efficient solutions, with computational complexity nearly-linear or sublinear in the input size, to these two useful numerical routines. Efficient low-rank model completeability analysis Efficiently excluding irrelevant data from the subsequent computation can directly reduce overall computational cost. For example, when querying certain statistics about male population, we can directly skip female individuals to save computation. Similaroperationscanbeaccomplishedinlessobvioussettings. When we possess excessive data, it is anticipated that some machine learning tasks can be completed with partial data. However, it is non-trivial to identify the relevant data for a given query, which involves analyzing the specific query and the machine learning model in use. Moreover, there are more profound implications aside from computational speed-ups. A certain amount of data points is often required to complete a machine learning task. However, the ambient size of data might not be directly relevant to thetaskathand. Excludingirrelevantdataiscrucialtocorrectlyevaluatingwhether the task can be accomplished (to the required accuracy). This scenario is perfectly exemplified by the low-rank matrix completion prob- lem, which is a tool that has been widely adopted across a variety of applications. The problem takes as input a partially-observed matrix and asks for a completion of 5 the missing values such that the estimate is low rank. When requesting an estima- tion for a missing entry, the relevance of observed entries is determined jointly by the underlying low-rank model and the other observed entries. We can easily construct worst-case examples such that there exist missing entries that are not completable in a densely observed low-rank matrix. Moreover, when querying different missing entries, the data relevance can also be different. We present our work to address the task of completable submatrix discovery: to identify submatrices that are accurately completable. To solve the problem in a practical and theoretically supported man- ner, we identify a criterion for matrix completability by using edge-connectivity on the associated graph. 1.1 Summary of thesis work 1.1.1 Fast tensor CP decomposition Tensor decomposition can be viewed as the generalization of low rank approxi- mation to higher-order array. They are used in a wide range of applications, includ- ing signal processing [DLDM98, DSL08], computer vision [VT02b, VT02a], data mining [ACKY05] and more. On the other hand, the non-convex nature of most tensor processing tasks have motivated the development of a variety of algorithmic techniques [KB09, GHJY15, BKS15]. The large sizes of tensors in many applications also place emphasis on algo- rithmic efficiency. Compared to matrix computations, the tensor decomposition problem is further accentuated by tensor dimensions. For a modest data set of 10 3 points, a square covariance matrix only has about 10 6 entries, while the third-order 6 moment tensor has 10 9 . The situation is similar for data tensors, for instance, the amazon review data [ML13] yield a 2440972× 6643571× 92626 tensor after pre- processing, with 2025872645 nonzero entries. Therefore, even some of the simplest tensor decomposition tasks face the challenge of scalability. We show ways of sampling intermediate steps of alternating minimization algo- rithmsforcomputinglowranktensorCPdecompositions, leadingtothesparsealter- nating least squares (SPALS) method. Specifically, we sample the the Khatri-Rao product, which arises as an intermediate object during the iterations of alternating leastsquares. Thisproductcapturestheinteractionsbetweendifferenttensormodes, and form the main computational bottleneck for solving many tensor related tasks. By exploiting the spectral structures of the matrix Khatri-Rao product, we provide efficient access to its statistical leverage scores. When applied to the tensor CP decomposition, our method leads to the first algorithm that runs in sublinear time per-iteration and approximates the output of deterministic alternating least squares algorithms. Empirical evaluations of this approach show significantly speed-ups over existing randomized and deterministic routines for performing CP decomposition. 1.1.2 Random-walk matrix-polynomial sparsification We consider a fundamental algorithmic question in spectral graph theory: Compute a spectral sparsifier of a random-walk matrix-polynomial L (G) =D− d X r=1 α r D· D −1 A r whereA is the adjacency matrix of a weighted, undirected graph,D is the diagonal matrix of weighted degrees, and = (α 1 ,...,α d ) are nonnegative coefficients with 7 P d r=1 α r = 1. Recall that D −1 A is the transition matrix of random walks on the graph. In its linear form (when d = 1), the matrix polynomial becomes D−A, which is a Laplacian matrix, and hence this problem becomes the standard spec- tral sparsification problem, which enjoys nearly linear time solutions [ST11, SS11]. However, the sparsification of L (G) appears to be algorithmically challenging as the matrix power (D −1 A) r is defined by all paths of length r, whose precise cal- culation would be prohibitively expensive (due to the cost of matrix multiplication and densification in the matrix powers). In this paper, we develop the first nearly linear time algorithm for this spar- sification problem: For any G with n vertices and m edges, d coefficients , and > 0, our algorithm runs in time O(d 2 ·m· log 2 n/ 2 ) to construct a Laplacian matrix e L =D− f A with O(n logn/ 2 ) non-zeros such that e L≈ L (G) =D− d X r=1 α r D· D −1 A r . Intheequation, e L≈ L (G)denotesthat e LandL (G)arespectrally similarwithin a factor of 1± as defined in [ST11]. Matrix polynomials arise in mathematical analysis of matrix functions as well as numerical solutions (such as Newton’s method) of matrix equations. Our work is particularly motivated by the algorithmic problems for speeding up the classic Newton’s method in applications such as computing the inverse square-root of the precision matrix of a Gaussian random field (in order to obtain i.i.d random samples of the graphic model), as well as computing the q th -root transition (for q≥ 1) in a time-reversible Markov model. The key algorithmic step for both applications is 8 the construction of a spectral sparsifier of a constant degree 2 random-walk matrix- polynomial introduced by Newton’s method. Our sparsification algorithm leads to a simplerandfasteralgorithmfortheseproblemsthanthepreviousone[CCL + 14]that circumvents the challenging problem of sparsifying high-degree random-walk matrix polynomials at the cost of slower convergences and complex approximation. Our algorithm can also be used to build efficient data structures for effective resistances for multi-step time-reversible Markov models. 1.1.3 Low-rank model completeability analysis The problem of low-rank matrix completion is continually attracting atten- tion for its applicability to many real-world problems. Still, the large size, extreme sparsity, and non-uniformity of these matrices pose challenges. In this paper, we make the observation that even when the observed matrix is not suitable for accu- rate completion there may be portions of the data where completion is still possible. We propose the CompleteID algorithm, which exploits the non-uniformity of the observation, to analyze the completability of the input instead of blindly apply- ing completion. Balancing statistical accuracy with computational efficiency, we relate completability to edge-connectivity of the graph associated with the input partially-observed matrix. We develop the MaxKCD algorithm for finding maximally k-edge-connected components efficiently. Experiments across datasets from a vari- ety of applications demonstrate not only the success of CompleteID but also the importance of completability analysis. 2 In numerical algorithms where random-walk matrix-polynomials arise, the degree d of the polynomials is usually either a constant or bounded above by a polylogarithmic function in n. 9 1.2 Thesis statement In the thesis proposal, it is hypothesized that we can efficiently obtain the data relevanceforamachinelearningtaskbyexaminingthemodelstructure. Specifically, for numerical models such as low-rank matrices, low-rank tensors, and random-walk matrix-polynomials, which can be decomposed into simple building blocks, the data relevance information can be efficiently obtained by analyzing their structures. Based on the thesis work, we conclude that the statement holds true in general. More specifically, we conclude that • Khatri-Rao product captures the interactions between different tensor modes, and forms the main computational bottleneck for solving many tensor related tasks. By exploiting the spectral structures of the matrix Khatri-Rao product, we provide efficient access to its statistical leverage scores by computing the statistical leverage score on individual factor matrices. • The effective resistance captures the importance of an edge in perserving the quadratic form of a graph Laplacian. The effective resistance of a high-degree random-walk matrix-polynomial can be efficiently estimated by examining the effective resistance of the degree 1 and 2 random-walk matrix-monomials. • In a low-rank matrix completion problem, the relevance of an observed entry w.r.t a missing entry can be obtained by examining the underlying bipartite graphstructurebuiltfromtheobservedentries. Theidentifiabilityofamissing entry can be reduced to the edge-connectivity on the bipartite graph, where a lack of subgraph with a high edge-connectivity indicates the existence of non-unique solutions. 10 1.3 Thesis outline In Chapter 2, we define the notations used throughout the thesis. We also introduce the Monte Carlo principle and low-rank matrix completion as the back- ground information for later chapters. In Chapter 3, we present our work on fast tensor CP decomposition. In Chapter 4, we present our findings on random-walk matrix-polynomial sparsification. In Chapter 5, we discuss our research on efficient low-rank matrix completability analysis. In Chapter 6, we summarize our work and discuss future research directions. 1.4 Related publications Part of the thesis work have been published in major conferences of machine learning: • Cheng, Dehua, Ruchansky Natali, and Yan Liu. Matrix completability analysis via graph k-connectivity. In Artificial Intelligence and Statistics, 2018. • Cheng, Dehua, Richard Peng, Yan Liu, and Ioakeim Perros. SPALS: Fast alternating least squares via implicit leverage scores sampling. In Advances In Neural Information Processing Systems, pp. 721-729. 2016. • Cheng, Dehua, Yu Cheng, Yan Liu, Richard Peng, and Shang-Hua Teng. Effi- cient sampling for Gaussian graphical models via spectral sparsification. In Conference on Learning Theory, pp. 364-390. 2015. Part of the thesis work have also been posted as technical report. 11 • Cheng, Dehua, Yu Cheng, Yan Liu, Richard Peng, and Shang-Hua Teng. Spectral sparsification of random-walk matrix polynomials. arXiv preprint arXiv:1502.03496 (2015). • Cheng, Dehua, Yu Cheng, Yan Liu, Richard Peng, and Shang-Hua Teng. Spectral sparsification of random-walk matrix polynomials. arXiv preprint arXiv:1502.03496 (2015). 12 Chapter 2 Prelinary and Related Work In this chapter, we first present the notations that will be used across this thesis in Section 2.1. Then we discuss the core technique background, the Monte Carlo principle in Section 2.2. In Section 2.3, we discuss the technical background of matrix completability, especially the concept of finite completability [KTT15]. Additional application-specific background and notations will be presented in the respective chapters. 2.1 Notations We list the notations commonly used throughout this thesis. Scalars are represented by lowercase latters, such as x,a,α; Vectors are rep- resented by boldface lowercase letters, such as,x,b,; Matrices are represented by boldface capital letters, such as, X,A, ; The transpose of a,A are denoted by a > ,A > , respectively. The ith entry of a vector is denoted by a i and element (i,j) of a matrixA is denoted byA i,j . For a vector a, letkak p denote the ` p vector norm. Notably, for p = 2, we have the 2-norm:kak 2 = q P i=1 a 2 i . 13 For a matrix A, the spectral norm of A is denoted bykAk 2 , the Frobenius norm is denoted bykAk F , and the nuclear norm is denoted bykAk ∗ . Let [n] denote the set of indices{1, 2,...,n}. And Cartesian product between two sets is denoted by×. We adopt the standard big-O notation: O(·), Θ (·),o (·). We also adopt the notation e O(·) to suppress logarithmic factors. 2.2 The Monte Carlo principle Given a random variablex∼P x wherex∈X and a functionf :X7→R, the expectationE[f(x)] is defined as: E[f(x)] = Z X f (x)p (x)dx. (2.1) The Monte Carlo principle suggests that we can obtain an unbiased estimation of E[f(x)] by drawing i.i.d. samples from the distributionP x and replacing the integration by a sample average. That is, I N (f) = 1 N N X i=1 f x (i) , wherex (i) s are drawn fromP x independently. Then we have that I N (f) a.s. −−−→ N→∞ E[f(x)]. 14 The Monte Carlo principle has been widely used in solving computational chal- lenging problem. It converts an otherwise computationally intractable or expensive task into a simpler task given that we can draw samples fromP x . The integration can be naturally generalized to summation via the delta-Dirac function δ x 0 (·) at pointx 0 for M data pointsx i ,i = 1, 2,...,M as follows p (x) = 1 M M X i=1 δ x i (x). WhenMN, theMonteCarloprincipleoffersnontrivialspeedupwhencalculating E[f(x)] = 1 M M X i=1 f (x i ). 2.2.1 Quality of the Monte Carlo simulation Given N, the quality of the estimation I N (f) depends on both the function f and the distributionP x . When N→∞ and σ 2 f ,E (f 2 (x))−E 2 (f (x)) <∞, by the central limit theorem, the scaled error √ N (I N (f)−E[f(x)]) converges to N (0,σ 2 f ) in distribution. For finite N, a non-asymptotic analysis on the error I N (f)−E[f(x)] is also available. Different concentration bound can be applied if their condition have been met, from the simple Markov’s inequality to different varieties of the Chernoff bounds. Oftentimes, the samples drawn in Monte Carlo simulation can be used for multiple tasks. For instance, we can use the same set of samples for a set of functions F ={f θ |θ ∈ Θ} where Θ can be finite, countably infinite or even uncountably infinite set. Instead of treating f θ individually, we can achieve a more refined error 15 analysis. In statistical learning theory [VV98], the VC dimension of the family {f θ |θ∈ Θ} can be more informative than the cardinality ofF in determining the error achieved with a given number of samples. In spectral graph theory [ST11], a spectral approximation of the graph Laplacian matrix, which can be obtained by Monte Carlo simulation, guarantees approximation accuracy for all inputs of its quadratic form. 2.2.2 Importance sampling in Monte Carlo simulation To improve the quality of the Monte Carlo simulation, one can draw more samples from the distributionP x . However, it can be highly inefficient since the variance of an estimation I N (f) scales inversely proportional to the square root of sample size, i.e., E h (I N (f)−E[f(x)]) 2 i = 1 √ N σ 2 f . Note that the samples are drawn from the distributionP x without the knowl- edge of the function f (·). And the importance sampling technique explores the function f (·) together with the distributionP x , which yields a smaller equivalent σ 2 f and improves the estimation quality. The importance sampling technique first transforms the original integration in Equation (2.1) into E[f(x)] = Z X f (x)p (x)dx = Z X f (x) p (x) q (x) q (x)dx, (2.2) 16 where q (x) is a probabilistic density function of a distributionQ x onX. Letf 0 (x)≡f (x) p(x) q(x) . Then the Equation (2.2) is equivalent to replacing the pair (f,P x ) with (f 0 ,Q x ) in a Monte Carlo simulation, where our expectation stays the same E P [f (x)] =E Q [f 0 (x)]. This allows us to pick a distributionQ x that minimizes σ 2 f 0 ,E Q f 0 2 (x) −E 2 Q (f 0 (x)) =E Q f 0 2 (x) −E 2 P (f (x)), which is equivalent as minimizing E Q f 0 2 (x) = Z X f (x) p (x) q (x) ! 2 q (x)dx = Z X f 2 (x)p 2 (x) q (x) dx. The optimalQ x that achieves the minimal estimation variance is q ∗ (x) = |f (x)|p (x) R X |f (x)|p (x)dx , (2.3) and the corresponding optimal varince σ ∗2 is σ ∗2 = Z X |f (x)|p (x)dx 2 −E 2 P (f (x)). 17 Note that the optimal importanceq ∗ (x) is ususally computationally expensive to calculate, since the partition function, i.e., the denominator in Equation (2.3), is defined similarily to the originalE [f (x)]. However, suboptimal importance can also provide approximatin guarantees. The intuition behind the optimal importance is that it will provide substantial sup- port for the important region, where f (x)p (x) is large in magnitude. Reducing the support proportionally will lead to a proportionally decreasing approximation guarantee, as discussed in the following proposition. Proposition 2.1 (Partial Support in Importance Sampling). For a sampling distri- butionq 0 (x), ifq 0 (x)≥αq ∗ (x) forα∈ (0, 1],∀x∈X, then the estimation variance σ 0 2 with q 0 (x) as importance satisfies σ 0 2 ≤ 1 α Z X |f (x)|p (x)dx 2 −E 2 P (f (x)). Proof. The statement can be verified by checking that Z X f 2 (x)p 2 (x) q 0 (x) dx≤ Z X f 2 (x)p 2 (x) αq ∗ (x) dx. Objectivesotherthantheestimationvariancecanalsobeimprovedbyapplying importancesamplingwithdifferentsamplingimportance. Inmostcases, theoptimal sampling importance will be unchanged due to the error distribution in the limits. 18 2.2.3 Example: randomized matrix multiplication Matrix multiplication is a standard numerical routine that serves as one of the cornerstone of modern data analytics. For example, in a multilayer preceptron, the pre-activated hidden units are calculated by multiplying the hidden units in the previous layer and the connecting weight matrix. A matrix-matrix multiplication, which represents the composition of linear systems, is computationally expensive for large and dense matrices. Given two matricesA∈R m×n andB∈R n×p , the matrixC =AB∈R m×p is defined as C i,j = n X k=1 A i,k B k,j , ∀i,j. When calculated according to the definition, it takesO(mnp) steps. It can be unnecessarily expensive for some applications, especially when mild error can be tolerated. Randomized algorithms for speeding up the matrix multiplication has been extensively surveyed in [Mah11]. By applying the Monte Carlo principle, one can propose a simple randomized algorithm: • Draw N i.i.d. samples from{1, 2,...,n} uniformly at random: i (t) for t = 1, 2,...,N. • Calculate and return f C = n N N X t=1 A :,i tB i t ,: . 19 One can verify that f C is an unbiased estimation ofC, i.e.,E h f C i =C. Instead of a scalar in previous examples, here we offer an estimation to a matrixC, and the estimation quality will be measured in a slightly different quantity. We choose the Frobenius norm of the error matrix C− f C F to measure the approximation error and we have that [MMD08] E C− f C 2 F = n N n X k=1 kA :,k k 2 2 kB k,: k 2 2 − 1 N kCk 2 F , which decreases with a larger sample size N. We can also apply the importance sampling algorithm by designing a sampling importance p i for each element in{1, 2,...,n}: • Draw N i.i.d. samples from{1, 2,...,n} with probability p i : i (t) for t = 1, 2,...,N. • Calculate and return f C p = N X t=1 1 Np i t A :,i tB i t ,: . And in [MMD08], Petros, Kannan, and Mahoney showed that by choosing p i according to p ∗ i = kA :,i k 2 kB i,: k 2 P n k=1 kA :,k k 2 kB k,: k 2 , one can achieve the optimal approximation guarantee: E C− f C p 2 F = 1 N n X k=1 kA :,k k 2 kB k,: k 2 ! 2 −kCk 2 F . 20 Moreover, the optimalp ∗ i can be computed inO(n (m +p)) time, which is less than the computational cost of the original task. Therefore, it is also practical to sample according to the optimal importance. It is shown that with Θ (log (1/δ)/) samples, we can guarantee that will probability at least 1−δ: C− f C p 2 F ≤kAk 2 F kBk 2 F . 2.2.4 Example: least square regression ConsidertheproblemoffindingavectorxsuchthatAx≈b, whereA∈R n×p , x∈R p , andb∈R n . That is, x opt = argmin x kAx−bk 2 2 = argmin x 1 n n X i=1 (A i,: x−b i ) 2 . When the number of constraints n p, it is preferable to select a subset of constraints to reduce the problem size. It is a non-trivial task. For example, if there isonlyoneconstraintforcertainsubspace(itisorthogonaltotherestofconstraints), then missing this constraint will lose all information on the corresponding subspace. With uniform sampling, it takes Θ (n logn) samples to fully cover the solution space, which defeats our propose. It has been shown that one can sample the constraints according to the sta- tistical leverage score to reduce the required sample size [Mah11]. 21 Definition 2.2 (Statistical Leverage Score). Consider ann×p matrixA, withn> p. The statistical leverage scores τ i of the i-th row ofA can be defined equivalently as: 1. LetU denote the n×p matrix consisting of the top-p left singular vectors of A. Then τ i =kU i,: k 2 2 ; 2. Consider the matrixH =A A > A † A > ∈R n×n . Then τ i =H i,i ; 3. τ i =A i,: A > A † A > i,: . Moreover, we have that P n i=1 τ i = rank (A). The statistical leverage score of certain row captures importance of the row in forming the linear subspace. By rewriting the hat matrix H asUU > , we note that the statistical leverage score is proportional to the optimal importance in matrix multiplication as discussion in Section 2.2.3. Its optimality in solving the least square regression problem can be explained by the subspace projection nature of linear regression. Unlike the optimal importance in matrix multiplication, it is very expensive to calculate the statistical leverage score: it is equivalent to calculating the pseudo- inverse of the design matrixA. In a practical algorithm, we often sample from an approximated version of the actually statistical leverage score [Mah11]. With Θ (p logp/ 2 ) constraints sampled according to the statistical leverage score, we can obtain a 1 + optimal solutionx 0 , i.e., kAx 0 −bk 2 2 ≤ (1 +)kAx opt −bk 2 2 . 22 2.2.5 Example: spectral approximation of graph Laplacian In spectral graph theory [ST04, ST14, SS11], the graph Laplacian matrix L playsimportantrole. Thedistributionofitseigenvaluecontainsvaluableinformation onthegraphconnectivityandgraphcutetc. Theeigenvaluescanbeuniquelydefined via the Rayleigh quotient of the matrixL: R (L,x) = x > Lx x > x . Therefore, we can approximate the quadratic form of the Laplacian matrix x > Lx with a simpler Laplacian matrix e L and reduce the computational cost of the sub- sequent tasks. The graph spectral sparsification[ST14] provides an efficient solution to this task. Given a possibly dense graphG with Laplacian matrixL, it outputs a sparse graph e G such that the new Laplacian matrix e L that satisfies exp (−)L4 e L4 exp ()L. Here4 is the Loewner semidefinite order. The Laplacian matrix can be naturally factorized by the edge-vertex incidence matrices B∈R n×m , where n and m are the number of vertices and edges respec- tively. Thei-th column ofB ise s i −e t i , where (s i ,t i ) is thei-th edge ande j is the j-th standard basis vector. The graph LaplacianL can be factorized asL =BB > . We can employ the randomized matrix multiplication algorithm to obtain a simpler Laplacian e L by sampling the columns ofB, a.k.a., the edges in the original graphG. With Θ (n logn/ 2 ) edges, we can preserve the quadratic of the Laplacian matrices. 23 2.3 Low-rank matrix completion and finite com- pletability The low-rank matrix completion problem takes as input a partially-observed matrix and asks for a completion of the missing values such that the estimate is low rank. ForamatrixM∈R n×m (assumingn≥m)withrank (M) =r andtheindices of the observed entries Ω⊆ [n]⊗ [m], the goal of the low-rank matrix completion is to recover the original matrixM from a partial observation{M i,j | (i,j)∈ Ω}. In the seminal work by Candès and Tao [CR09], the low-rank matrix comple- tion is formulated as an optimization problem: min X∈R n×m kXk ∗ , s.t. X i,j =M i,j ,∀ (i,j)∈ Ω, wherekXk ∗ is the nuclear norm (a.k.a. trace norm) of matrixX, serving as a surro- gate for the rank ofX. In the paper, the authors prove that when the rank-r matrix M is well-spread across all entries, then with e Θ (nr) entries observed uniformly at random, the above optimization recover the true matrixM exactly. The different versions of a quantity coherence were proposed to measure the information spread on a matrix, where a low coherence will prevent the linear space spanned by the row (or column) vectors ofM overly aligned with standard basis. The requirement of uniform random sampling of entries is a direct result of the incoherence assumption, which upper bounded the coherence of the ground 24 truth matrix. By relaxing the incoherence assumption, we can also employ non- uniform sampling [CBSW14], where more samples are required from the high coher- ent regions. In [CBSW14], the authors propose the row and column coherence, which provide a more refined view of the matrix entries: their different importance in the task of matrix completion. 2.3.1 Completability with deterministic observed entries From a practical perspective, the location of the observed entries are given deterministically and the ground truth matrix is hidden. Theoretical work that relies on a random sampling of observed entries can provide valuable guidance in understanding the problem but yields limited practical advices. The deterministic setting has also been studied [BJ14, KT12, KTT15]. Bho- janapalli and Jain [BJ14] proved that the spectral gap on the associated bipartite graphcanbeusedasasufficientconditionforthematrixcompletabilitywithnuclear norm surrogate. Király, et al. [KT12, KTT15] have studied the problem from a more fundamen- tal perspective, where the completability is no longer tied with a specific algorithm. The definition of completability is now the identifiability of the ground truth low- rank matrix, i.e., whether there exists a unique low-rank matrix that matches all observed entries. Moreover, they only consider the location of the observed entries, ignoring the actual values. Instead, they assume that the ground truth low-rank matrix is randomly drawn from a continuous distribution. In doing so, worst case scenarios such as a sparse matrix with r entries are automatically excluded. There- fore, the concept of incoherence is no longer necessary. It creates a cleaner setting which essentially serves as a necessary condition for any more realistic settings. 25 Underthissetting, thecompletabilitywillbedeterminedsolelybythelocations of observed entries and a target rank r. For a random matrixM of rank r and a set of observed entries Ω, letS be the set of rankr matrices that agrees withM on Ω, then we says the matrix is • uniquely completable, if|S| = 1; or • finitely completable, if|S|<∞; or • infinitely completable, if|S| =∞. Itisanopenproblemtonumericallyverifytheuniquecompletability. However, it has been observed that for a finitely completable low-rank matrix completion problem, a few additional observed entries usually guarantee the uniqueness of the solution [PABN16]. 2.3.2 Finite completability The theoretical foundation of the finite completability for a low-rank matrix completion problem has been extensively studied [KTT15]. It is closely related to the inverse linear map from the low-rank matrix manifold to the matrix restricted to the observed positions. LetM be the manifold of matrices with size n×m and rank at most r. The observation procedure essentially maps a matrix fromM to a|Ω| dimensional linear space, where Ω is the observed positions. To ensure that the inverse image of a given partially observed matrix is of finite size, the dimension of the image manifold should match the dimension ofM, which is r (m +n−r). 26 We can embedM into a r (n +m) dimensional space by factoring a matrix M =AB > , whereA andB are matrices with size n×r and m×r, respectively. It allows us to draw a Jacobian matrix J of the mapping, where the rank of the matrixJ reveals the dimension of the image space. To draw a Jacobian matrix, we randomly drawA andB from a continuous distribution, and evaluate the Jacobian matrix at (A,B). For example, we can draw a Jacobian matrix by the following procedure: 1. DrawA andB, where all entries are drawn i.i.d. from the standard Gaussian distribution. LetM =AB > . 2. TheJacobianmatrixJ isofsize|Ω|×(rn +rm). WecancalcuatetheJacobian w.r.t. A andB respectively as follows: for (i,j)∈ Ω: ∂M i,j ∂A k,` = 0, if i6=k, B j,` , otherwise. ∂M i,j ∂B k,` = 0, if j6=k, A i,` , otherwise. The size of the Jacobian matrix is polynomial in the input size. Therefore, there exist straightforward algorithms to determine its rank in polynomial time. If rank (J) = r (n +m−r), then we say the low-rank matrix completion problem is finitely completable; otherwise, it is infinitely completable. Furthermore, it is possible to evaluate the completability on a per-entry basis. For a missing entry (i,j) 6∈ Ω, if the rank of the Jacobian matrix w.r.t. to Ω 27 is the same as that w.r.t. b Ω = Ω∪{(i,j)}, the entry at (i,j) is finitely com- pletable [KTT15]. 28 Chapter 3 Tensor Decomposition Tensors, a.k.a. multidimensional arrays, appear frequently in many applications, including spatial-temporal data modeling [YCL15], signal process- ing [DLDM98, DSL08], deep learning [NPOV15] and more. Low-rank tensor decom- position [KB09] is a fundamental tool for understanding and extracting the informa- tion from the tensor data, which has been actively studied in recent years. Develop- ing scalable and provable algorithms for most tensor processing tasks is challenging due to the non-convexity of the objective [HL13, KB09, GHJY15, BKS15]. Espe- cially in the era of big data, scalable low-rank tensor decomposition algorithm (that runs in nearly linear or even sublinear time in the input data size) has became an absolute must to command the full power of tensor analytic. For instance, the Ama- zon review data [ML13] yield a 2, 440, 972×6, 643, 571×92, 626 tensor with 2 billion nonzero entries after preprocessing. Such data sets pose challenges of scalability to some of the simplest tensor decomposition tasks. It is known that the low-rank tensor is not uniquely defined due to mul- tiple well-defined tensor ranks [KB09]. Here, we focus on the tensor CANDE- COMP/PARAFAC (CP) decomposition [Har70, CC70], where the low-rank tensor is modeled by the summation over many rank-1 tensors. Due to its simplicity and interpretability, tensor CP decomposition, which is to find the best rank-R approx- imation for the input tensor often by minimizing the square loss function, has been widely adopted in many applications [KB09]. 29 Matrix Khatri-Rao (KRP) product captures the interactions between different tensor modes in the CP decomposition, and it is essential for understanding many tensor related tasks. For instance, in the alternating least square (ALS) algorithm, which has been the workhorse for solving the tensor CP decomposition problem, a compact representation of the KRP can reduce the computational cost directly. ALS is a simple and parameter-free algorithm that optimizes the target rank-R tensor by updating its factor matrices in the block coordinate descent fashion. In each iteration, the computational bottleneck is to solve a least square regression problem, where the size of the design matrix, a KRP of factor matrices, isn 2 ×n for an×n×n tensor. While least square regression is one of the most studied problem, solving it exactly requires at least Ω (n 2 ) operations [Mah11], which can be larger than the size of input data for sparse tensors. For instance, the amazon review data with 2× 10 9 nonzeros leads to a computational cost on the order of 10 12 per iteration. Exploiting the structure of the KRP can reduce the exact computation of be linear in the input size, which on large-scale applications is still expensive for an iterative algorithm. An effective way for speeding up such numerical computations is through ran- domization [Mah11, Woo14], where the computational cost can be uncorrelated with the ambient size of the input data in the optimal case. By exploring the connection between the spectral structures of the design matrix as the KRP of the factor matri- ces, we provide efficient access to the statistical leverage score of the design matrix. It allows us to propose the SPASL algorithm that samples rows of the KRP in a nearly-optimal manner. This nearly optimality is twofold: 30 1. the estimates of leverage scores are tight up to a factor linear in the target rank; 2. the operation of sampling a row can be efficiently performed. The second requirement is far from trivial: Note that even when the optimal sam- pling probability is given, drawing a sample may requireO(n 2 ) preprocessing. Our result on the spectral structures of the design matrix allows us to achieve both cri- teria simultaneously, leading to the first sublinear-per-iteration cost ALS algorithm with provable approximation guarantees. Our contributions can be summarized as follows: 1. Weshowacloseconnectionbetweenthestatisticalleveragescoresofthematrix Khatri-Rao product and the scores of the input matrices. This yields efficient and accurate leverage score estimations for importance sampling; 2. Our algorithm achieves the state-of-art computational efficiency, while approx- imates the ALS algorithm provably for computing CP tensor decompositions. The running time of each iteration of our algorithm is e O(nR 3 ), sublinear in the input size for large tensor. 3. Our theoretical results on the spectral structure of KRP can also be applied on other tensor related applications such as stochastic gradient descent [NSW14] and high-order singular value decompositions (HOSVD) [DLDMV00]. We formalize the definitions in Section 3.1 and present our main results on leverage score estimation of the KRP in Section 3.2. The SPALS algorithm and its theoretical analysis are presented in Section 3.3. We discuss connections with previous works in Section 3.4. In Section 3.5, we show the empirical evaluations 31 of this algorithm and its variants on both synthetic and real world data. And we conclude and discuss our work in Section 3.6. 3.1 Notation and background Without loss of generality, in this paper we focus our discussion for the 3- modetensors, butourresultsandalgorithmcanbeeasilygeneralizedtohigher-order tensors. In addition to the pre-defined notations, tensors are represented by boldface calligraphic capital letters, such as,T . The element (i,j,k) of a tensorT ∈R I×J×K is denoted byT ijk . For notation simplicity, we assume that (i,j) also represents the index i +Ij between 1 and IJ, where the value I and J should be clear from the context. For a tensor T ∈R I×J×K , we denote the tensor norm askTk F , i.e.,kTk F = q P I,J,K i,j,k=1 T 2 ijk . 3.1.1 Tensor matricization Here we consider only the case of mode-n matricization. For n = 1, 2, 3, the mode-n matricization of a tensor T ∈ R I×J×K is denoted by T (n) . For instance, T (3) ∈R K×IJ , where the element (k, (i,j)) isT ijk . 3.1.2 Special matrix products Our manipulation of tensors as matrices revolves around several matrix prod- ucts. Our main focus is the matrix Khatri-Rao product (KRP), where for a pair 32 of matricesA∈ R I×R andB∈ R J×R , AB∈ R (IJ)×R has element ((i,j),r) as A ir B jr . Our manipulation of tensors as matrices revolves around several matrix prod- ucts. For a pair of matricesA∈R I×K andB∈R J×L , • the Kronecker product of matrices is denoted byA⊗B∈R (IJ)×(KL) , where the element ((i,j), (k,l)) isA ik B jl ; • the Khatri-Rao product of matrices, AB ∈ R (IJ)×K when K = L, has element ((i,j),k) asA ik B jk ; and • the Hadamard product, when I = J and K = L, is the elementwise matrix product. It is denoted byA∗B∈R I×K , where the element (i,k) isA ik B ik . More details on these products can be found in [KB09]. Our proof also relies on the following identity. Lemma 3.1. (AU) (BV ) = (A⊗B)(UV ). (3.1) Proof. AssumeA∈R I×N ,B∈R J×M ,U∈R N×R , andV ∈R M×R . 33 Then we have (AU) (BV )∈R IJ×R and the ((i,j),r)-th element is [(AU) (BV )] (i,j),r =(AU) ir (BV ) jr (3.2) =( N X p=1 A ip U pr )( M X q=1 B jq V qr ) (3.3) = N X p=1 M X q=1 A ip B jq U pr V qr (3.4) = N X p=1 M X q=1 (A⊗B) (i,j),(p,q) (UV ) (p,q),r (3.5) = [(A⊗B)(UV )] ((i,j),r) (3.6) 3.1.3 Tensor CP decomposition The tensor CP decomposition [Har70, CC70] expresses a tensor as the sum of a number of rank-one tensors, e.g., T = R X r=1 a r ◦b r ◦c r , where◦ denotes the outer product,T ∈R I×J×K anda r ∈R I ,b r ∈R J , andc r ∈R K for r = 1, 2,...,R. Tensor CP decomposition will be compactly represented using JA,B,CK, where the factor matrices A∈ R I×R ,B∈ R J×R and C∈ R K×R and a r ,b r ,c r are their r-th column respectively, i.e., JA,B,CK ijk = P R r=1 A ir B jr C kr . Similar as in the matrix case, each rank-1 components are usually interpreted as a hiddenfactor, whichcapturestheinteractionsbetweenalldimensionsinthesimplest way. 34 Given a tensor T ∈ R I×J×K along with target rank R, the goal is to find a rank-R tensor specified by its factor matrices A∈ R I×R ,B∈ R J×R ,C∈ R K×R , that is as close to T as possible: min A,B,C kT−JA,B,CKk 2 F = X i,j,k T i,j,k − R X r=1 A ir B jr C kr ! 2 . 3.1.4 Alternating least squares algorithm A widely used method for performing CP decomposition is alternating least squares (ALS) algorithm. It iteratively minimizes one of the factor matrices with the others fixed. For instance, when the factors A and B are fixed, algebraic manipulations suggest that the best choice of C can be obtained by solving the least squares regression: min C XC > −T > (3) 2 F , (3.7) where the design matrix X = BA is the KRP of A and B, and T (3) is the matricization of T [KB09]. 3.2 Near-optimal leverage score estimation for Khatri-Rao product As shown in Section 3.1.4, the matrix KRP captures the essential interactions between the factor matrices in the tensor CP decomposition. It is a challenging tasks, where the size of KRP of two matrices is significantly larger than the input 35 matrices. For example, for the amazon review data, the KRP of two factor matrices contains 10 12 rows, which is much larger than the data set itself with 10 9 nonzeros. Importance sampling is one of the most powerful tool for obtaining sample effi- cient randomized data reduction with strong guarantees. However, effective imple- mentation requires comprehensive knowledge on the objects to be sampled: the KRP of factor matrices. In this section, we provide the efficient and effective toolset for estimating the statistical leverage score of the KRP of factor matrices, giving a direct way of applying importance sampling, one of the most important tools in randomized matrix algorithms, for tensor CP decomposition related applications. In the remaining of this section, we first define and discuss the optimal impor- tance: statistical leverage score, in the context of ` 2 -regression. Then we propose and prove our near-optimal leverage score estimation. 3.2.1 Leverage score sampling for ` 2 -regression Recall the definiton of statistical leverage score, Definition 2.2, in Section 2.2.4. When p n, subsampling the rows of design matrix X∈ R n×p by its statistical leverage score provides efficient approximate solution to the least square regression problem: min kX−yk 2 2 , with strong theoretical guarantees [Mah11]. It does not yield an efficient algorithm for solving optimization problem 3.7 due to the difficulties of computing statistical leverage scores. But this reduction to the matrix setting allows for speedups using a variety of tools. In particular, sketching [CW13, MM13, NN13] or iterative sampling [LMP13, CLM + 15] lead to routinesthatrunininputsparsitytime: O(nnz)plusthecostofsolvinganO(r logn) 36 sized least squares problem. However, directly applying these methods still require at least one pass over T at each iteration, which will dominate the overall cost. 3.2.2 Near-optimal leverage score estimation As discussed in the previous section, the KRP of factor matrices captures the interaction between two modes in the tensor CP decomposition, e.g., the design matrixBA in the linear regression problem. To extract a compact representation of the interaction, the statistical leverage score of BA provides an informative distribution over the rows, which can be utilized to select the important subsets of rows randomly. For a matrix with IJ rows in total, e.g., BA, in general, the calculation of statistical leverage score is prohibitively expensive. However, due to the special structure of the KRPBA, the upper bound of statistical leverage score, which is sufficient to obtain the same guarantee by using slightly more samples, can be efficiently estimated, as shown in Theorem 3.2. Theorem 3.2 (Khatri-Rao Bound). For matrixA∈R I×R and matrixB∈R J×R , where I > R and J > R, let τ A i and τ B j be the statistical leverage score of the i-th and j-th row of A and B, respectively. Then, for matrix AB ∈ R IJ×R with statistical leverage score τ AB i,j for the (iJ +j)-th row, we have τ A⊗B i,j ≤τ A i τ B j . Proof. Let the singular value decomposition of A and B be A = U a a V a> and B =U b b V b > , whereU a ∈R I×R ,U b ∈R J×R , and a , b ,V a ,V b ∈R R×R . 37 By the definition of Khatri-Rao product, we have that AB = [A :,1 ⊗B :,1 ,...,A :,R ⊗B :,R ]∈R IJ×R , where⊗ is the Kronecker product. By the form of SVD and Lemma 3.1, we have AB =[U a a (V a 1,: ) > ⊗U b b (V b 1,: ) > ,...,U a a (V a R,: ) > ⊗U b b (V b R,: ) > ] = h (U a a )⊗ (U b b ) i V a> V b > = h U a ⊗U b ih a ⊗ b i V a> V b > = h U a ⊗U b i S, where S = h a ⊗ b i V a> V b > ∈ R R 2 ×R . So the SVD of AB can be constructed using the SVD ofS =U s s V > s . So the leverage score ofAB can be computed from [U a ⊗U b ]U s : H = [U a ⊗U b ]U s U > s [U a ⊗U b ] > , (3.8) 38 and for the index k =iJ +j, we have τ AB i,j =H k,k (3.9) =e > k He k (3.10) ≤ h U a ⊗U b i > e k 2 2 (3.11) = R X p=1 R X q=1 (U a i,p ) 2 (U b j,q ) 2 (3.12) =( R X p=1 (U a i,p ) 2 )( R X q=1 (U b j,q ) 2 ) (3.13) =τ A i τ B j , (3.14) where e i is the i-th natural basis vector. The first inequality is because of H 4 [U a ⊗U b ] [U a ⊗U b ] > . For the rank-R CP decomposition, the sum of the leverage scores for all rows inBA equals to R. The sum of our upper bound relaxes it to R 2 , which means that now we need e O(R 2 ) samples instead of e O(R). This result directly generalizes to the Khatri-Rao product of k-dimensional tensors. Theorem 3.3. For matrix A (k) ∈ R I k ×R where I k > R for k = 1,...,K, let τ (k) i be the statistical leverage score of the i-th ofA (k) . Then, for the Q k I k -by-R matrix A (1) A (2) ···A (K) with statistical leverage scoreτ i 1 ,...,i K for the row corresponding to τ i 1 ,...,i K , we have τ 1:K i 1 ,...,i K ≤ K Y k=1 τ (k) i k , where τ 1:K i 1 ,...,i K denotes the statistical leverage score of the row ofA (1) A (2) ··· A (K) corresponding to the i k -th row ofA (k) for k = 1,...,K. 39 Proof. Assume that the claim holds for K− 1, that is, we have τ 2:K i 2 ,...,i K ≤ K Y k=2 τ (k) i k . LetA =A (1) and letB =A (2) A (3) ···A (K) , then by Theorem 3.2, we have τ 1:K i 1 ,...,i K =τ AB i 1 ,(i 2 ,...,i K ) ≤τ (1) i 1 τ 2:K i 2 ,...,i K ≤ K Y k=1 τ (k) i k . Therefore, it also holds for K. By Theorem 3.2, it holds for K = 2. Then by induction, it holds for any K∈Z + . The near-optimality of our estimation is three-fold, which enables the devel- opment of efficient numerical algorithms: 1. The estimation can be calculated in sublinear time given that max{I,J,K} = o (nnz(T )). Forinstance,fortheamazonreviewdata,wehave max{I,J,K}≈ 10 6 nnz(T )≈ 10 9 ; 2. The form of the estimation allows efficient sample-drawing. In fact, the row index can be drawn efficiently by considering each mode independently; 3. The estimation is tight. More precisely, it gives an upper bound with factor-R relaxation, where R is considered as modest constant for low-rank decompo- sition. Therefore, the estimation allows sample-efficient importance sampling. 40 Algorithm 1: Sample a row fromBA andT (3) . Draw a Bernoulli random variable z∼ Bernoulli(β). if z = 0 then Draw i∼ Multi(τ A 1 /R,...,τ A I /R) and j∼ Multi(τ B 1 /R,...,τ B J /R). else Draw a entry (i,j,k) from the nonzero entries with probability proportional to T 2 i,j,k . end if Return the (jI +i)-th row ofBA andT (3) with weight IJp i,j . 3.3 SPALS: sampling alternating least squares The direct application of our results on KRP leverage score estimation is an efficient version of ALS algorithm for tensor CP decomposition, where the compu- tational bottleneck is to solve the optimization problem 3.7. Our main algorithmic result is a way to obtain a high quality O(r 2 logn) row sample of X without explicitly constructing the matrix X. This is moti- vated by a recent work that implicitly generates sparsifiers for multistep random walk [CCL + 15]. In particular, we sample the rows of X, the KRP of A and B, using products of quantities computed on the corresponding rows in A and B, which provides an rank-1 approximation to the optimal importance: the statistical leverage scores. This leads to a sublinear time sampling routine, and implies that we can approximate the progress of each ALS step linear in the size of the factor being updated, which can be sublinear in the number of non-zeros in T . In the remaining of this section, we present our algorithm SPALS and prove its approximation guarantee. We will also discuss its extension to other tensor related applications. 41 3.3.1 Sampling alternating least squares The optimal solution to optimization problem (3.7) is C =T (3) (BA) h A > A ∗ B > B i −1 . We separate the calculation to two parts: (1) T (3) (BA), and (2) h A > A ∗ B > B i −1 , where∗ denotes the elementwise matrix product. The latter is to invert the gram matrix of the Khatri-Rao product, which can also be effi- ciently computed due to its R×R size. We will mostly focus on evaluating former expression. We perform the matrix multiplication by drawing a few rows from both T > (3) and BA and construct the final solution from the subset of rows. The row of BA can be indexed by (i,j) for i = 1,...,I and j = 1,...,J, which correspond to the i-th and j-th row inA andB, respectively. That is, our sampling problem can be seen as to sample the entries of a I×J matrixP ={p i,j } i,j . We define the sampling probability p i,j as follows, p i,j = (1−β) τ A i τ B j R 2 +β P K k=1 T 2 i,j,k kTk 2 F . (3.15) whereβ∈ (0, 1). The first term is a rank-1 component for matrixP. And when the input tensor is sparse, the second term is spare, thus admitting the the sparse plus low rank structure, which can be easily sampled as mixture of two simple distribu- tions. The sampling algorithm is described in Algorithm 1. Note that sampling by the leverage score of the design matrixBA alone provides a guaranteed but worse approximation for each step [Mah11]. Since that the design matrix itself is formed 42 by two factor matrices, i.e., we are not directly utilize the information in the data, we design the second term for the worst case scenario. When R n and n nnz(T ), where n = max(I,J,K), we can afford to calculate τ A i and τ B j exactly in each iteration. So the distribution corresponding to the first term can be efficiently sampled with preparation cost e O(r 2 n +r 3 ) and per- sample-cost O(logn). Note that the second term requires an one-timeO(nnz(T )) preprocessing before the first iteration. 3.3.2 Approximation guarantees We define the following conditions: C1. The sampling probability p i,j satisfies p i,j ≥β 1 τ AB i,j R for some constant β 1 ; C2. The sampling probabilityp i,j satisfiesp i,j ≥β 2 P K k=1 T 2 i,j,k kTk 2 F for some constantβ 2 ; The proposed probabilitiesp i,j in Equation (3.15) satisfy both conditions with β 1 = (1−β)/R and β 2 =β. We can now prove our main approximation result. Theorem 3.4. For a tensor T ∈ R I×J×K with n = max(I,J,K) and any factor matrices on the first two dimension as A∈ R I×R and B ∈ R J×R . If a step of ALS on the third dimension gives C opt , then a step of SPALS that samples m = Θ(R 2 logn/ 2 ) rows producesC satisfying T− q A,B,C y 2 F <kT−JA,B,C opt Kk 2 F +kTk 2 F . Full proof can be found in Section 3.7. 43 Note that the approximation error of our algorithm does not accumulate over iterations. Similar to the stochastic gradient descent algorithm, the error occurred in the previous iterations can be addressed in the subsequent iterations. 3.3.3 Extensions on other tensor related applications Importance sampling SGD on CP decompostion One application is to incorporate importance sampling in the stochastic gra- dient descent algorithm for CP decomposition. The gradient follows the form ∂ ∂C kT−JA,B,CKk 2 F =T (3) (BA). By sampling the rows according to our proposed distribution, it reduces the per-step variance via importance sampling [NSW14]. Our result addresses the computational difficulty of finding the appropriate importance. Sampling ALS on higher-order singular value decomposition For solving the HOSVD [DLDMV00] on tensor, Kronecker product is involved instead of the Khatri-Rao product. And we prove similar leverage score approxima- tion results for Kronecker product. In fact, for Kronecker product, our “approxima- tion” provides the exact leverage score. Theorem 3.5. For matrix A∈ R I×M and matrix B∈ R J×N , where I > M and J > N, let τ A i and τ B j be the statistical leverage score of the i-th and j-th row of A andB, respectively. Then, for matrixA⊗B∈R IJ×MN with statistical leverage score τ A⊗B i,j for the (iJ +j)-th row, we have τ A⊗B i,j =τ A i τ B j . 44 Proof. Let the singular value decomposition of A and B be A = U a a V a> and B =U b b V b > , whereU a ∈R I×R ,U b ∈R J×R , and a , b ,V a ,V b ∈R R×R . By the definition of Khatri-Rao product from Section 2.6 of [KB09], we have that A⊗B = U a ⊗U b ( a ⊗ a ) U a ⊗U b > (3.16) which is the legit SVD of A⊗B. Therefore the leverage score of A⊗B can be computed fromU a ⊗U b , that H = [U a ⊗U b ] [U a ⊗U b ] > (3.17) and for the index k =iI +j, we have τ AB i,j =H k,k =e > k He k (3.18) = h U a ⊗U b i > e k 2 2 (3.19) = R X p=1 R X q=1 (U a i,p ) 2 (U b j,q ) 2 (3.20) =( R X p=1 (U a i,p ) 2 )( R X q=1 (U b j,q ) 2 ) (3.21) =τ A i τ B j . (3.22) 45 3.4 Related works CP decomposition is one of the simplest, most easily-interpretable tensor decomposition. Fitting it in an ALS fashion is still considered as the state-of-art in the recent tensor analytics literature [WTSA15]. The most widely used imple- mentation of ALS is the MATLAB Tensor Toolbox [KB09]. It directly performs the analytic solution of ALS steps. There is a line of work on speeding up this procedure in distributed/parallel/MapReduce settings [KPHF12, JPKF15, CV14, SRSK15]. Such approaches are compatible with our approach, as we directly reduce the num- ber of steps by sampling. A similar connection holds for works achieving more efficient computation of KRP steps of the ALS algorithm such as in [PTC13]. The applicability of randomized numerical linear algebra tools to tensors was studied during their development [NDT15]. Within the context of sampling-based tensordecomposition, earlyworkhasbeenpublishedin[Tso10,SPL + 09]thatfocuses thoughonTuckerdecomposition. In[PFS12], samplingisusedasameansofextract- ingsmallrepresentativesub-tensorsoutoftheinitialinput, whicharefurtherdecom- posed via the standard ALS and carefully merged to form the output. Another work based on an a-priori sampling of the input tensor can be found in [BS15]. However, recent developments in randomized numerical linear algebra often focused on over- constrained regression problems or low rank matrices. The incorporation of such tools into tensor analytics routines fairly recent [PP13, WTSA15] Most closely related to our algorithm are the routines from [WTSA15], which gave a sketch-based CP decomposition inspired by the earlier work in [PP13]. Both approaches only need to examine the factorization at each iteration, followed by a number of updates that only depends on rank. A main difference is that the 46 sketches in [WTSA15] moves the non-zeros around, while our sampling approach removes many entries instead. Their algorithm also performs a subsequent FFT step, while our routine always works on subsets of the various matricizations. As both of these steps do not decrease the density of the data, our method is much more suitableforsparsetensors. Also,ourroutinecanbeconsideredasthedatadependent randomization, which enjoys better approximation accuracy than [WTSA15] in the worst case. For direct comparison, the method in [WTSA15] and ours both require nnz(T ) preprocessingatthebeginning. Then, foreachiteration, ourmethodrequires e O(nr 3 ) operations comparing with O(r(n +Bb logb) +r 3 ) for [WTSA15]. Here B and b for [WTSA15] are parameters for the sketching and need to be tuned for various applications. Depending on the target accuracy, b can be as large as the input size: on the cube synthetic tensors with n = 10 3 that the experiments in [WTSA15] focused on, b was set to between 2 14 ≈×10 3 and 2 16 ≈ 6× 10 4 in order to converge to good relative errors. From a distance, our method can be viewed as incorporating randomization into the intermediate steps of algorithms, and can be viewed as higher dimensional analogs of weighted SGD algorithms [YCRM16]. Compared to more global uses of randomization [Woo14], these more piecemeal invocation have several advantages. For high dimensional tensors, sketching methods need to preserve all dimensions, while the intermediate problems only involve matrices, and can often be reduced to smaller dimensions. For approximating a rank R tensor in d dimensions to error , this represents the difference between poly(R,) and R d . Furthermore, the lower cost of each step of alternate minimization makes it much easier to increase accuracy at the last few steps, leading to algorithms that behave the same way in the limit. 47 The wealth of works on reducing sizes of matrices while preserving objectives such as` p norms, hinge losses, and M-estimators [DDH + 09, CP15, CW15b, CW15a] also suggest that this approach can be directly adapted to much wider ranges of settings and objectives. 3.5 Experimental results We implemented and evaluated our algorithms in a single machine setting. Experiments are tested on a single machine with two Intel Xeon E5-2630 v3 CPU and 256GB memory. All methods are implemented in C++ with OpenMP paral- lelization. We report averages from 5 trials. 3.5.1 Dense synthetic tensors We start by comparing our method against the sketching based algorithm from [WTSA15] in the single thread setting as in their evaluation. The synthetic data we tested are third-order tensors with dimension n = 1000, as described in [WTSA15]. We first generated a rank-1000 tensor with harmonically decreasing weightsoneachrank-1components. Andthenafternormalization,randomGaussian noise with noise-to-signalnsr = 0.1, 1, 10 was added. As with previous experimental evaluations [WTSA15], we set the target rank to r = 10. The performances of various routines are given in Table 3.1. We vary the sampling rate of our algorithm, i.e., SPALS(α) will sample αr 2 log 2 n rows at each iteration. On these instances, a call to SPALS with rate α samples was about 4.77α× 10 3 rows, and as the tensor is dense, 4.77α× 10 6 entries. The correspondence 48 Table 3.1: Running times per iterations in seconds and errors of various alternating least squares implementations nsr = 0.1 nsr = 1 nsr = 10 error time error time error time ALS-dense 0.27 64.8 1.08 66.2 10.08 67.6 sketch(20, 14) 0.45 6.50 1.37 4.70 11.11 4.90 sketch(40, 16) 0.30 16.0 1.13 12.7 10.27 12.4 ALS-sparse 0.24 501 1.09 512 10.15 498 SPALS(0.3) 0.20 1.76 1.14 1.93 10.40 1.92 SPALS(1) 0.18 5.79 1.10 5.64 10.21 5.94 SPALS(3.0) 0.21 15.9 1.09 16.1 10.15 16.16 Table 3.2: Relative error and running times per iteration on the Amazon review tensor with dimensions 2.44e6× 6.64e6× 9.26e4 and 2.02 billion non-zeros error time ALS-sparse 0.981 142 SPALS(0.3) 0.987 6.97 SPALS(1) 0.983 15.7 SPALS(3.0) 0.982 38.9 between running times and rates demonstrate the sublinear runtimes of SPALS with low sampling rates. Comparing with the [WTSA15], our algorithm employs data dependent random sketch with minimal overhead, which yields significantly better precision with similar amount of computation. 3.5.2 Sparse data tensor Our original motivation for SPALS was to handle large sparse data ten- sors. We ran our algorithm on a large-scale tensor generated from Amazon review data [ML13]. Its sizes and convergences of SPALS with various parameters are in 49 Table 3.2. We conduct the experiments in parallel with 16 threads. The Amazon data tensor has a much higher noise to signal ratio than our other experiments which common for large-scale data tensors: Running deterministic ALS with rank 10 on it leads to a relative error of 98.1%. SPALS converges rapidly towards a good approximationwithonlyasmallfractionoftimecomparingwiththeALSalgorithm. 3.6 Discussion Our experiments show that SPALS provides notable speedup over previous CP decomposition routines on both dense and sparse data. There are two main sources of speedups: (1) the low target rank and moderate individual dimensions enable us to compute leverage scores efficiently; and (2) the simple representations of the sampled form also allows us to use mostly code from existing ALS routines with minimal computational overhead. Its worth noting that in the dense case, the total number of entries accessed during all 20 iterations is far fewer than the size of T . Nonetheless, the adaptive nature of the sampling scheme means all the information fromT are taken into account while generating the first and subsequent iterations. From a randomized algorithms perspective, the sub-linear time sampling steps bear strong resemblances with stochastic optimization routines [SV09]. We believe more systematically investigating such connections can lead to more direct connections between tensors and randomized numerical linear algebra, and in turn further algorithmic improvements. 50 3.7 Proof for theorem 3.4 For the least square regression problem min x kAx−bk 2 2 , where design matrixA∈R n×p withn>p (assuming full column rank), coefficients x∈R p×1 and response vectorb∈R n×1 . The optimal solution is x opt = A > A −1 A > b, The percentage explained byx is ρ (x): ρ (x) = 1− kAx−bk 2 2 kbk 2 2 The optimal result is ρ (x opt ) = b > A A > A −1 A > b b > b Lemma 3.6. GivenS∈R m×n , such that U > b−U > S > Sb 2 ≤ 3 kbk 2 , the solutionx B = A > A −1 A > S > Sb, satisfies that ρ (x B )≥ρ (x opt )−. 51 Proof. Let the SVD of matrixA is of the formA =UV > . We have that x opt =V −1 U > b and x B =V −1 U > S > Sb We have that ρ (x opt ) = 1− I−UU > b 2 2 kbk 2 2 Similarity, we have that ρ (x B ) =1− I−UU > S > S b 2 2 kbk 2 2 =1− I−UU > +UU > −UU > S > S b 2 2 kbk 2 2 ≥ρ (x opt ) − 2 I−UU > b 2 kbk 2 + UU > −UU > S > S b 2 kbk 2 UU > −UU > S > S b 2 kbk 2 Note that U > b−U > S > Sb 2 ≤ 3 kbk 2 , 52 and 2 I−UU > b 2 kbk 2 + UU > −UU > S > S b 2 kbk 2 ≤ 3, which concludes that ρ (x B )≥ρ (x opt )−. Lemma 3.7 (Matrix Multiplication Approximation [DMMS11]). Given matrixA∈ R m×n andB∈R n×p , and sampling probability p i ,i = 1, 2,...,n satisfies that p i ≥β A (i) 2 B (i) 2 kAk F kBk F , whereA (i) andB (i) is thei-th column and row ofA andB, respectively. Then with probability at least 1−δ, the random matrix multiplication with c samples, we have that kAB−CDk F ≤ 1 + q (8/δ) log 1/δ βc kAk F kBk F . Proof. See Appendix A.3 in [DMMS11]. 53 Corollary 3.8. Given matrixA∈R m×n andB∈R n×p , and sampling probability p i ,i = 1, 2,...,n satisfies that p i ≥ β 1 A (i) 2 2 kAk 2 F ,and p i ≥ β 2 B (i) 2 2 kBk 2 F . whereA (i) andB (i) is thei-th column and row ofA andB, respectively. Then with probability at least 1−δ, the random matrix multiplication with c samples, we have that kAB−CDk F ≤ 1 + q (8/δ) log 1/δ √ β 1 β 2 c kAk F kBk F . Proof. Note that p i ≥ 1 2 β 1 A (i) 2 2 kAk 2 F +β 2 B (i) 2 2 kBk 2 F ≥ q β 1 β 2 A (i) 2 B (i) 2 kAk F kBk F . Finally, we can prove our main theorem. Theorem 3.4. For a tensor T ∈ R I×J×K with n = max(I,J,K) and any factor matrices on the first two dimension as A∈ R I×R and B ∈ R J×R . If a step of ALS on the third dimension gives C opt , then a step of SPALS that samples m = Θ(R 2 logn/ 2 ) rows producesC satisfying T− q A,B,C y 2 F <kT−JA,B,C opt Kk 2 F +kTk 2 F . 54 Proof. Denote the sample-and-rescale matrix asS∈R m×IJ . By Corollary 3.8, we have that T (3) (BA)−T (3) S > S (BA) F ≤ kTk F . Together with Lemma 3.6, we can conclude. 55 Chapter 4 Random-Walk Matrix-Polynomial Sparsification 4.1 Introduction Polynomials are used in many fields of mathematics and science for encod- ing equations that model various physical, biological and economical processes. In scientific computing and its underpinning numerical analysis, polynomials appear naturally in (truncated) Taylor series, the fast multipole method [GR87], and vari- ous numerical approximations. These computational methods are responsible for a large part of engineering and scientific simulations ranging from weather forecasting to earthquake modeling [SF73] to particle/galaxy simulation [GR87]. Like its scalar counterpart, matrix polynomials of the form, P d i=0 c i ·M i , arise in mathematical analysis of matrix functions and dynamical systems, as well as numerical solution of matrix equations. One class of matrices of particular impor- tance in network analysis is the adjacency matrix of a weighted, undirected graphG. We will denote these matrices usingA. If we useD to denote the diagonal matrix containing weighted degrees of vertices, D −1 A is the transition matrix of random walks on the graph. Powers of this matrix correspond to multiple steps of random walks on the graph, and are graphs themselves. However, they are usually dense, andarecost-prohibitivebothintimeandmemorytoconstructwhentheinputgraph 56 G is of large-scale. Our objective is to efficiently construct sparse approximations of this natural family of random-walk matrices. A problem that motivates our study of random walk polynomials is the inverse square-root problem: find a linear operatorC such thatC > C is close to the inverse of the Laplacian matrix. Laplacian matrices are a subclass of SDDM matrices 1 and have standard split-form representation ofL =D−A. When we apply an extension of the classical Newton’s method to this form we can reduce the problem to that of factoringD− 3 4 D· (D −1 A) 2 + 1 4 D· (D −1 A) 3 , which has smaller spectral radius, by using the matrix identity (D−A) − 1 2 = I + 1 2 D −1 A D− 3 4 D· (D −1 A) 2 + 1 4 D· (D −1 A) 3 − 1 2 . (4.1) Finding inverse square-root factorizations is a key step in sampling from Gaussian graphical models: Given a graphical models of a Gaussian random field specified by its precision matrix and potential vector h, i.e., Pr(x|,h) ∝ exp(− 1 2 x > x+h > x),efficientlygeneratei.i.drandomsamplesfromthismultivariate Gaussian distributions [LW12]. If one can compute an efficient sparse representation ofC≈ −1/2 , then one can convert i.i.d. standard Gaussian random vectorz using x =Cz + (where = −1 h) to i.i.d random vectors of a Gaussian random field that numerically approximates the one defined by (,h) [CCL + 14]. Furthermore, if the precision matrix = (λ i,j ) is symmetric diagonally dominant (SDD), i.e., for all i, λ i,i > P j6=i |λ i,j |, then one can reduce this factorization problem to the problem formulated by Equation (4.1) involving an SDDM matrix. 1 SDDM matrices are positive definite symmetric diagonal dominate matrices with non-positive off-diagonal elements. 57 Then, inordertoiterativelyapplyEquation(4.1)tobuildanefficientrepresen- tation of the inverse square-root factor ofD−A, one needs to efficiently construct the second term in Equation (4.1), D− 3 4 D· (D −1 A) 2 + 1 4 D· (D −1 A) 3 . (4.2) The quadric and cubic powers in this matrix can be very dense, making exact com- putations involving them expensive. Instead, we will directly compute an approxi- mation of this matrix that still suffices for algorithmic purposes. Finding an inverse square-root of an SDDM matrix is a special case of the following basic algorithmic problem in spectral graph theory and numerical analysis [CCL + 14]: Given an n×n SDDM matrixM, a non-zero integer q, and an approx- imation parameter , compute an efficient sparse representation of an n×n linear operator f C such that M 1/q ≈ f C f C > where≈ is spectral similarity between linear operators which we will define at the start of Section 4.2. The matrix q th -root computation appears in several numerical applications and particularly in the analysis of Markov models [HL11]. In our case, note that when the graph is connected,D −1 A is the transition matrix of a reversible Markov chain [AF], and the first order approximation of the q th -root transition is I− (I−D −1 A) 1/q . Extension of Newton’s method then leads to an iterative formula similar to Equation (4.1) for finding factorization of the q th root. Thus, the key algorithmic task for obtaining a nearly linear time Newton(-like) algorithm for 58 q th -root factorizations is the efficient approximation of matrix polynomials akin to Equation (4.2). We start with a definition that captures the matrix polynomials such like Equation (4.2) that arise in the application of Newton’s or Newton-like methods to graph Laplacians. Definition 4.1 (Random-Walk Matrix-Polynomials). Let A and D be the adja- cency matrix and diagonal weighted degree matrix of a weighted, undirected graph G respectively. For a non-negative vector = (α 1 ,...,α d ) with P d r=1 α r = 1, the matrix L (G) =D− d X r=1 α r D· D −1 A r (4.3) is a d-degree random-walk matrix-polynomial of G. Random-walk matrix-polynomials naturally include the graph Laplacian G as the linear case: whend = 1, the matrix polynomial becomesL(G) =D−A, which is the Laplacian matrix of G. In fact, the following proposition can be established by a simple induction. Proposition 4.2 (Laplacian Preservation). For any weighted, undirected graph G with adjacency matrix A and diagonal matrix D, for every non-negative vector = (α 1 ,...,α d ) such that with P d r=1 α r = 1, the random-walk matrix-polynomial L (G) remains a Laplacian matrix. Consequently, applying spectral sparsification algorithms [ST11, SS11, BSS12] toL (G) gives: 59 Proposition 4.3 (Spectral Sparsifiers of Random-Walk Matrix Polynomials). For all G and as in Proposition 4.8, for any > 0, there exists a Laplacian matrix e L =D− f A with O(n logn/ 2 ) non-zeros such that for allx∈Rn (1−)·x > e Lx≤x > D− d X r=1 α r D· D −1 A r ! x≤ (1 +)·x > e Lx. (4.4) The Laplacian matrix e L satisfying Equation (4.4) is called spectrally similar with approximation parameter toL (G) [ST11]. The computation of a (nearly) linearsizespectral sparsifierofadenseLaplacianmatrixisafundamentalalgorithmic problem in spectral graph theory that has been used in solving linear systems [ST14, KMP10] and combinatorial optimization [CKM + 11]. The work of [SS11, ST11] showed that spectral sparsifier of O(n logn/ 2 ) non-zeros can be constructed in O(m log 2 n/ 2 ) time for any n×n Laplacian matrixL with m non-zeros. A recent techniqueof[PS14]cansparsifyadegree 2random-walkmatrix-polynomialinnearly linear time. We give the first nearly linear time spectral-sparsification algorithm for all random-walk matrix-polynomials. Our sparsification algorithm is built on the fol- lowing key mathematical observation that might be interesting on its own: One can obtain a sharp enough upper bound on the effective resistances of the high-order polynomialD−D(D −1 A) r from the combination of the linearD−A and quadratic D−AD −1 A polynomials. This allows us to design an efficient path sampling algorithm that utilizes this mathematical observation to achieve the critical sparsification. We prove the following result which generalizes the works of [ST11, SS11, PS14]. 60 Theorem 4.4 (Random-Walk Polynomials Sparsification). For any weighted, undi- rected graph G with n vertices and m non-zeros, for every non-negative vector = (α 1 ,...,α d ) with P d r=1 α r = 1, for any > 0, we can construct in time O(d 2 ·m· log 2 n/ 2 ) a spectral sparsifier with O(n logn/ 2 ) non-zeros and approxi- mation parameter for the random-walk matrix-polynomialL (G). This result also generalizes to SDDM matrices. The total work of our spar- sification algorithm depends quadratically in the degree of the polynomial, which could be expensive when d = Θ(n c ) for some constant c> 0. And we build upon it to give an algorithm with polylogarithmic dependence on the degree. Theorem 4.5 (Sparsification of Higher Degree Random-Walk Matrix Polynomials). Let nnz() denote the number of non-zeros in the vector . For any precision parameter > 0, and any weighted undirected graph G with n nodes and m edges, for every non-negative vector = (α 1 ,...,α t ) with P t r=1 α r = 1, we can • first compute another set of coefficients = (β 1 ,...,β t ) in O(nnz()) time 2 such thatL (G)≈ L (G), and nnz() =O(logt/); • then construct a spectral sparsifier for the random-walk matrix polynomial L (G) in time O(nnz()·m· log 3 n· log 5 t/ 4 ) with O(n logn/ 2 ) non-zeros and approximation parameter . While we build our construction on the earlier work [ST11, SS11, PS14] for graph sparsification, we need to overcome some significant difficulties posted by high degree matrix polynomials, which appear be algorithmically challenging: The matrix (D −1 A) r is defined by all paths of lengthr, whose precise calculation would 2 If we are given a sampling oracle for the distribution defined by α, we could compute β in time polylogarithmic in t. 61 be prohibitively expensive due to the cost of matrix multiplication and densification in the matrix powers. Moreover, the algorithm of [PS14] relies an explicit clique-like representation of edges in the quadratic power and expanders, which is much more specialized. 4.1.1 Some applications Matrix polynomials are involved in numerical methods such as Newton’s or Newton-like methods, which have been widely used for finding solutions of matrix equations. Our sparsification algorithm can be immediately applied to speed up these numerical methods that involves SDDM matrices. For example, in the appli- cation of finding the inverse square root of an SDDM matrixD−A, we can directly sparsify the cubic matrix polynomial given in Equation 4.2, and iteratively approx- imate the inverse-square root factor ofD−A using Equation 4.1. This leads to a simpler and faster algorithm than the one presented in [CCL + 14], which circumvents the challenging problem of sparsifying high-degree random-walk matrix polynomials at the cost of slower convergences and complex approximation. The simplicity of the new algorithm comes from the fact that we no longer need the Maclaurin series for conditioning the numerical iterations. The convergence analysis of the new algo- rithm follows the standard analysis of the Newton’s method, with careful adaptation to handle the approximation errors introduced by the spectral sparsification. The elimination of the Maclaurin series speeds up the previous algorithm by a factor of log logκ, where κ is the relative condition number ofD−A. In general, our sparsification algorithm can be used inside the Newton-like methodforapproximatingtheinverseq th -rootofSDDMmatricestoobtainasimpler 62 and faster nearly linear time algorithm than the one presented in [CCL + 14] for q∈Z + , with reduction formula as follows (I−X) −1/q = I + X 2q ! I + X 2q ! 2q (I−X) −1/q I + X 2q ! . (4.5) Our mathematical and algorithmic advances enable the sparsification of the (2q+1)- degreepolynomialsinthemiddle, inturnspeedupthepreviousalgorithmbyafactor of log(log(κ)/). By Proposition 4.8, the random-walk matrix polynomial L (G) = D− P d r=1 α r D· (D −1 A) r defines a weighted graph G whose adjacency matrix is P d r=1 α r D· (D −1 A) r , and overlays d graphs induced by the multi-step random walks. WhileD−A andD−AD −1 A offers a good enough bound on the effective resistances of edges inG for the purpose of path sampling, these estimates are rela- tively loose comparing with the standard approximation condition. Because spectral similarity implies effective-resistance similarity [ST11, SS11], our sparsification algo- rithm together with the construction of Spielman and Srivastava [SS11] provide an efficient data structure for effective resistances in G . After nearly linear prepro- cessing time, we can answer queries regarding approximate effective resistances in G in logarithmic time. 4.2 Background and notation WeassumeG = (V,E,w)isaweightedundirectedgraphwithn =|V|vertices, m =|E| edges and edge weightsw e > 0. LetA = (a i,j ) denote the adjacency matrix ofG, i.e.,a i,j =w(i,j). We letD to denote the diagonal matrix containing weighted 63 degrees of vertices. Note that D −1 A is the transition matrix of random walks on the graph andL G =D−A is the Laplacian matrix of G. It is well known that for any vectorx = (x 1 ,...,x n ), x > L G x = X (u,v)∈E (x u −x v ) 2 w uv (4.6) We use G r to denote the graph introduced by r-step random walks on G. We haveL G 2 =D−AD −1 A, andL Gr =D−D(D −1 A) r in general. In our analysis, we will make extensive use of spectral approximations based on the Loewner partial ordering of positive semidefinite matrices. Given two matrices X and Y , we use Y < X (or equivalently X 4 Y ) to denote that Y −X is positive semi-definite. Approximations using this ordering obey usual intuitions with approximations of positive scalars. We will also use a compressed, symmetric notation in situations where we have mirroring upper and lower bounds. We say X≈ Y when exp ()X<Y < exp (−)X, (4.7) We use the following standard facts about this notion of approximation. Fact 4.6. For positive semi-definite matricesX,Y ,W andZ, 1. ifY ≈ Z, thenX +Y ≈ X +Z; 2. ifX≈ Y andW≈ Z, thenX +W≈ Y +Z; 3. ifX≈ 1 Y andY ≈ 2 Z, thenX≈ 1 + 2 Z; 64 e G =GraphSampling(G ={V,E,w},τ e ,M) 1. Initialize graph e G ={V,∅}. 2. For i from 1 to M: Sample an edge e from E with p e =τ e /( P e∈E τ e ). Add e to e G with weight w(e)/(Mτ e ). 3. Return graph e G. Figure 4.1: Pseudocode for Sampling by Effective Resistances 4. if X and Y are positive definite matrices such that X≈ Y , then X −1 ≈ Y −1 ; 5. ifX≈ Y andV is a matrix, thenV > XV ≈ V > YV. The Laplacian matrix is closely related to electrical flow [SS11, CKM + 11]. For an edge with weight w(e), we view it as a resistor with resistance r(e) = 1/w(e). Recall that the effective resistance between two vertices u and v R(u,v) is defined as the potential difference induced between them when a unit current is injected at one and extracted at the other. Lete i denote the vector with 1 in thei-th entry and 0 everywhere else, the effective resistance R(u,v) equals to (e u −e v ) > L † (e u −e v ), whereL † is the Moore-Penrose Pseudoinverse ofL. From this expression, we can see that effective resistance obeys triangle inequality. Also note that adding edges to a graph does not increase the effective resistance between any pair of nodes. In [SS11], it was shown that oversampling the edges using upper bound on the effective resistance suffices for constructing spectral sparsifiers. The theoretical guarantees for this sampling process were strengthened in [KL13]. A pseudocode of this algorithm is given in Figure 4.1, and its guarantees can be stated as follows. 65 Theorem 4.7. Given a weighted undirected graph G, and upper bound on its effec- tive resistance Z(e)≥ R(e). For any approximation parameter > 0, there exists M =O(logn/ 2 · ( P e∈E τ e )), with τ e =w(e)Z(e), such that with probability at least 1− 1 n , e G =GraphSampling(G,w,τ e ,M) has at most M edges, and satisfies (1−)L G 4L e G 4 (1 +)L G . (4.8) The equivalence between solving linear systems in such matrices and graph Laplacians, weakly-SDD matrices, and M-matrices are well known [ST04, DS08, KOSZ13, PS14]. 4.3 Random-walk matrix polynomial sparsifica- tion Theneedto“sparsify”Newton’smethodmotivatedustosolvethefundamental problem of computing a spectral sparsifier for random-walk matrix polynomial of the form L (G) =D− t X r=1 α r D· D −1 A r , given by the adjacency matrix A and diagonal degree matrix D of a weighted, undirected graphG and a distribution vector = (α 1 ,...,α t ) satisfying P t r=1 α r = 1. Note that D −1 A is the transition matrix of the reversible Markov chain of the random walks on the graph defined byA. A polynomial ofD −1 A corresponds to the superposition of multiple steps of random walks on the graph G, which as the following proposition shows, is a graph itself with LaplacianL (G). 66 Proposition 4.8 (Laplacian Preservation). For any weighted, undirected graph G with adjacency matrix A and diagonal matrix D, for every non-negative vector = (α 1 ,...,α t ) such that with P t r=1 α r = 1, the random-walk matrix polynomial L (G) remains a Laplacian matrix. Proof. First note that L (G) = D− P t r=1 α r D· (D −1 A) r is symmetric and has non-positive off-diagonals, so to prove that L (G) is Laplacian, we only need to show each row ofL (G) sums up to zero. For an integer r and any row indexi, we look at the i-th row sum S i,r ofD(D −1 A) r . Forr = 1, the row sumS i,1 ofi-th row ofA isS i,1 = P j A i,j =D i,i . We prove by induction that S i,t =... =S i,1 =D i,i as follows, S i,r+1 = X k D(D −1 A) r i,k ·D −1 k,k · X j A k,j = X k D(D −1 A) r i,k =S i,r =D i,i Thus the i-th row sum ofL (G) is equal toD i,i − P t r=1 α r S i,r = 0, soL (G) is a Laplacian matrix. Therefore, to sparsifyL (G), one could first compute thet-step random-walk graph and then run standard spectral sparsification algorithms on them. However, these random-walk graphs are usually dense even when G is sparse: It is cost- prohibitive both in time and memory to construct when the input graph G is of large-scale. To enable our sparse Newton’s method, we need to efficiently construct sparse approximations of this natural family of random-walk matrix polynomials. The rest of the section is organized as follows. In Section 4.3.1, we give the first nearly-linear time algorithm for sparsifying random-walk matrix polynomials. In Section 4.3.2, we build upon the first algorithm and construct a more efficient 67 sparsificationalgorithmforsparsifyinghigherdegreerandom-walks. InSection4.3.3, weextendourresultsfromgraphLaplacianstopolynomialsinvolvingX whenI−X is an SDDM matrix. 4.3.1 Sparsification by random walks Adding a pathp fromG to the sparsifier with probability that upper bounds the effective resistance in G r : a. Pick an integer k∈ [1 :r] and an edge e∈G, both uniformly at random. b. Perform a (k− 1)-step random walk from one end of e. c. Perform a (r−k)-step random walk from the other end of e. d. Keep track of its weight during the process, and finally add a fraction of this edge to our sparsifier. Figure 4.2: Pseudocode for Sparsification of Random Walk Matrices. Westartbyexaminingtherandom-walkmatrixmonomials. WesparsifyL Gr = D−D(D −1 A) r in two steps. In the first and critical step, we obtain an initial sparsifierwithO(r·m logn/ 2 )non-zerosforL Gr , wheremisthenumberofnonzeros inA. In the second step, we apply the standard spectral sparsification algorithms to further reduce the number of nonzeros to O(n logn/ 2 ). In the first step sparsification, we bound the effective resistance on G r using Lemma 4.9 below, which allows us to use the resistance of any length-r path on G to upper bound the effective resistance between its two endpoints on G r . Lemma 4.12 shows that we can sample by these estimates efficiently. 68 Lemma 4.9 (Two-Step Supports). For a graph Laplacian matrixL =D−A, with diagonal matrix D and nonnegative off-diagonal A, for all positive odd integer r, we have 1 2 L G L Gr rL G . and for all positive even integers r we have L G 2 L Gr r 2 L G 2 . Proof. LetX =D − 1 2 AD − 1 2 , for any integer r, the statements are equivalent to 1 2 (I−X)I−X 2r+1 (2r + 1) (I−X) I−X 2 I−X 2r r I−X 2 . Because X can be diagonalized by unitary matrix U as = UXU > , where = diag(λ 1 ,λ 2 ,...,λ n ) and λ i ∈ [−1, 1] for all i. Therefore, we can reduce the inequalities to their scalar cases, 1 2 (1−λ)≤ 1−λ 2r+1 ≤ (2r + 1) (1−λ), ∀λ∈ (−1, 1) and odd integer r; (1−λ 2 )≤ 1−λ 2r ≤r(1−λ 2 ), ∀λ∈ (−1, 1) and even integer r; which concludes the proof. For each length-r pathp = (u 0 ...u r ) in G, we have a corresponding edge in G r , with weight proportional to the chance of this particular path showing up in the 69 random walk. We can view G r as the union of these edges. Specifically, the weight of a pathp = (u 0 ...u r ) is w(p) = Q r i=1 A(u i−1 ,u i ) Q r−1 i=1 D(u i ,u i ) . The Laplacian ofG r is given by the sum of the weights of all paths with correspond- ing starting/ending pairs L Gr (u 0 ,u r ) =− X p=(u 0 ...ur ) w(p). WeboundtheeffectiveresistanceonG r intwodifferentways. Ifr isodd,Gisa goodapproximationofG r , sowecanobtainanupperboundusingthe resistance ofa length-r (not necessarily simple) path onG. Ifr is even,G 2 is a good approximation of G r , in this case, we get an upper bound by composing the effective resistances of 2-hop paths in different subgraphs of G 2 . We summarize this in the following Lemma. Lemma 4.10 (Upper Bounds on Effective Resistance). For a graph G with Lapla- cian matrix L = D−A, let L Gr = D−D(D −1 A) r be its r-step random-walk matrix. Then, the effective resistance between two vertices u 0 and u r on L Gr is upper bounded by R Gr (u 0 ,u r )≤ r X i=1 2 A(u i−1 ,u i ) , where (u 0 ...u r ) is a path in G. 70 Proof. When r is a positive odd integer, by Lemma 4.9, we have that 1 2 L G L Gr , which implies that for any edge (u,v) in G, R Gr (u,v)≤ 2R G (u,v) = 2 A(u,v) . Because effective resistance satisfies triangular inequality, this concludes the proof for odd r. When r is even, by Lemma 4.9, we have that L G 2 L Gr . The graph of D−AD −1 A can be viewed as a superposition of subgraphs anchored at different vertices, and the effective resistance of these subgraphs is studied in [PS14]. For the subgraph G 2 (u) anchored at the vertex u (the subgraph formed by length-2 paths where the middle node is u), for any two of its neighbors v 1 and v 2 , we have R Gr (v 1 ,v 2 )≤R G 2 (u) (v 1 ,v 2 ) = 1 A(v 1 ,u) + 1 A(u,v 2 ) . Because we have the above upper bound for any 2-hop path (v 1 ,u,v 2 ), by the tri- angular inequality of effective resistance, the lemma holds for even r as well. This motivates us to define effective resistances upper bounds Z(p) = r X i=1 2 A(u i−1 ,u i ) , andsparsifyG r bysamplingaccordingtoZ(p). Thesimpleformoftheupperbounds yields an analytic form of the partition function though the following mathematical identity, which is crucial in our analysis. 71 Lemma 4.11 (A Random-Walk Identity). Given a graph G with m edges, consider the r-step random-walk graph G r and the corresponding Laplacian matrix L Gr = D−D(D −1 A) r , the summation of w(p)Z(p) over all length-r paths satisfies X p w(p)·Z(p) = 2rm. Proof. We substitute the expression for w(p) and Z(p) into the summation. X p=(u 0 ...ur ) w(p)Z(p) = X p r X i=1 2 A(u i−1 ,u i ) ! Q r j=1 A(u j−1 ,u j ) Q r−1 j=1 D(u j ,u j ) ! = 2 X p r X i=1 Q i−1 j=1 A(u j−1 ,u j ) Q r−1 j=i A(u j ,u j+1 ) Q r−1 j=1 D(u j ,u j ) ! (4.9) = 2 X e∈G r X i=1 X p with (u i−1 ,u i )=e Q i−1 j=1 A(u j−1 ,u j ) Q r−1 j=i A(u j ,u j+1 ) Q r−1 j=1 D(u j ,u j ) (4.10) = 2 X e∈G r X i=1 X p with (u i−1 ,u i )=e Q i−1 j=1 A(u j−1 ,u j ) Q i−1 j=1 D(u j ,u j ) · Q r−1 j=i A(u j ,u j+1 ) Q r−1 j=i D(u j ,u j ) (4.11) = 2 X e∈G r X i=1 1 = 2mr. (4.12) From Equation (4.9) to (4.10), instead of enumerating all paths, we first fix an edge e to be the i-th edge on the path, and then extend from both ends of e. From Equation (4.11) to (4.12), we sum over indices iteratively from u 0 tou i−1 , and from u r to u i . Because D(u,u) = P v A(u,v), this summation over all possible paths anchored at (u i−1 ,u i ) equals to 1. Now we show that we can perform Step (2) in GraphSampling efficiently. Lemma 4.11 and Theorem 4.7 shows that sampling e O(rm/ 2 ) paths suffices. Recall 72 that sampling an edge from G r corresponds to sampling a path of length r in G. We take samples in the same way we cancel the terms in the previous proof, which is described in Figure 4.2. Lemma 4.12. There exists an algorithm for the Step (2) in GraphSampling, such that after preprocessing with workO(n), it can draw an edge e as in Step (2) with workO(r· logn). Proof. Thetaskistodrawasamplep = (u 0 ...u r )fromthemultivariatedistribution D Pr(u 0 ...u r ) = 1 2rm · r X i=1 2 A(u i−1 ,u i ) ! · Q r i=1 A(u i−1 ,u i ) Q r−1 j=1 D(u i ,u i ) ! . For any fixed k∈ [1 :r] and e∈G, we can factorize the distribution as Pr(u 0 ...u r ) = r X k=1 Pr(k)· Pr((u k−1 ,u k ) =e)· i=k−1 Y 1 Pr(u i−1 |u i )· r Y i=k+1 Pr(u i |u i−1 ) (4.13) where Pr(k) = 1 r , Pr((u k−1 ,u k ) =e) = 1 m , and Pr(u i−1 |u i ) =A(u i ,u i−1 )/D(u i ,u i ). The first two terms in Equation (4.13) correspond to Step (a) in the sampling algo- rithm stated above, and latter two terms correspond to Step (b) and (c). With linear preprocessing time, we can draw a uniform random edge in timeO(logn), and we can also simulate two random walks with total length r in timeO(r logn), so Step (2) in GraphSampling can be done withinO(r logn) time. Combining path sampling with standard spectral sparsifiers gives our algo- rithm for efficiently sparsifying low-degree random-walk matrix polynomials. Proof of Theorem 4.4. First we show on how to sparsify a degree-r monomialL G d = D−D·(D −1 A) r . WeusethesamplingalgorithmdescribedinTheorem4.7, together 73 with upper bounds on effective resistance of G r obtained from D−A and D− AD −1 A. The total number of samples requires is O(rm logn/ 2 ). We use Lemma 4.12 to draw a single edge fromG t , where we sampler-step random walks onD−A, so the total running time isO(r 2 m log 2 n/ 2 ). Now that we have a spectral sparsifier with O(rm logn/ 2 ) edges, we can sparsify one more time to reduce the number of edges to O(n logn/ 2 ) by [ST11] in time O(rm log 2 n/ 2 ). To sparsify a random-walk matrix polynomial L (G) of degree t, we spar- sify all the even/odd terms together, so the upper bound of effective resistance in Lemma 4.10 still holds. To sample an edge, we first decide the length r of the path where r ∈ {1, 2,...,t}, according to the probability distribution Pr(length = r|r is odd/even)∝ α r for r = 1, 2,...,t, and then sample from the correspondingL Gr . 4.3.2 Multi-step sparsification of higher degree random- walks The sparsification algorithm presented in Section 4.3.1 runs in time quadrat- ically in the degree of the polynomial. In this section, we build upon it to give an algorithm with polylogarithmic dependence on the degree. Our approach has resemblance to repeated squaring in that it relies on several subroutines for using a sparsifier for a lower degree random walk monomials to approximate a higher degree one. We will show that given a positive even integer 2r and f A such thatD− f A≈ D−D(D −1 A) 2r , we can efficiently compute (with strong error preserving properties, which we omit until the full details): 1. f A × such thatD− f A × ≈D− f AD −1f A≈L G 4r in Section 4.3.2, 74 2. f A + such thatD− f A + ≈D− (AD −1 ) 2f A(D −1 A) 2 ≈L G 2r+4 in Section 4.3.2, and 3. f A ∗ such thatD− f A ∗ ≈D− f AD −1 AD −1f A≈L G 4r+1 in Section 4.3.2. These primitives allow us to sparsify any monomials of the form t = 4y, and t = 4y + 2 will be approximated by the nearby t = 4y. Then the third operation allows us to sparsify t = 4y + 1, andt = 4y + 3 will be approximated by the nearby t = 4y + 1. We will put them together to obtain the full algorithm in Section 4.3.2. Constructing f A × We construct f A × by sparsifying D− f AD −1f A with the algorithm described in Section 4.3.1. The remaining is to show that spectral approximation holds under squaring. Specifically, if f A andA satisfy: D− f A≈ 0D−D D −1 A 2r , then D− f AD −1 f A≈ 0D−D D −1 A 4r . The proof relies on interpreting D−AD −1 A as the Schur complement of a larger matrix, and use the fact that approximations in these larger matrices also imply approximations in the Schur complements. Definition 4.13. For a symmetric matrixM in block form, M = M 11 M 12 M > 12 M 22 , 75 its Schur complement onto the second half of variables is the matrix M 11 −M 12 M −1 22 M > 12 . Specifically, we will utilize the fact thatD−AD −1 A is the Schur complement of D −A −A D . This interpretation allows us to use the following fact about approximations between Schur complements: Fact 4.14 (Lemma B.1. from [MP13]). Suppose M and g M are positive semi- definite matrices satisfying M≈ g M, then their Schur complements on the same set of vertices also satisfyM schur ≈ g M schur . WhenA corresponds to an even-step (2r-step in graphG) random walk gives A 0, itallowsustogetfromclosenessintheLaplacianstoclosenessinthepositive Laplacians. Lemma 4.15. IfA< 0 and (1−) (D−A)4D− f A4 (1 +) (D−A), then (1−) (D +A)4D + f A4 (1 +) (D +A). Proof. Rearranging the condition (1−) (D−A) 4 D− f A gives f A 4 D + (1−)A. AddingD to both sides then gives D + f A4 (1 +)D + (1−)A. 76 Combining with 04A givesD + f A4 (1 +) (D +A). The other side can be proved similarly: (1−) (D +A) (1−)D+(1 +)A =D+ f A+ D− f A −(1 +) (D−A)D+ f A. This additional approximation allows us to decompose quadratic forms involv- ing the Schur complement. Lemma 4.16. IfD−A≈ D− f A andD +A≈ D + f A, then D−AD −1 A≈ D− f AD −1 f A. Proof. Because of Lemma 4.14, it suffices to show that D −A −A D ≈ D − f A − f A D . (4.14) Consider a test vector x y , we have x > y > D −A −A D x y = 1 2 h (x +y) > (D−A) (x +y) > + (x−y) > (D +A) (x−y) > i , which leads to Equation (4.14), since we have D−A≈ D− f A and D +A≈ D + f A. 77 Putting Lemmas 4.15 and 4.16 together gives: Lemma 4.17. Let the graph LaplacianL G =D−A andL e G 2r =D− f A forr∈Z + , such that (1)D− f A≈ 0D−D (D −1 A) 2r , and (2) f A contains m nonzero entries, then for any> 0, we can construct in total workO(m log 3 n/ 4 ), a graph Laplacian L e G × = D− f A × with f A × containing at mostO(n logn/ 2 ) nonzero entries, such that D− f A × ≈ 0 + D−D D −1 A 4r . Constructing f A + We extend this squaring routine by composing its walk with shorter random walks on each side. That is, we sparsify D− AD −1 2 f A D −1 A 2 using effective resistance estimates fromD−AD −1 A andD− f A. We first show that symmetric compositions preserve spectral approximations. Lemma 4.18. IfD−A≈ D− f A, then for any symmetric c A such that 04 c A4D, we have D− c AD −1 AD −1 c A≈ D− c AD −1 f AD −1 c A. Proof. Since c AD −1 = (D −1c A) > , we can composeD−A≈ D− f A withD −1c A using Fact 4.6.e to get: c AD −1 (D−A) D −1 c A ≈ c AD −1 D− f A D −1 c A , (4.15) 78 or if expanded: c AD −1 c A− c AD −1 AD −1 c A≈ c AD −1 c A− c AD −1 f AD −1 c A. SinceD− c A< 0, we haveD− c AD −1c A< 0. Adding it to both sides of (4.15) using Fact 4.6.a. turns the c AD −1c A terms intoD terms, concluding the proof. Lemma 4.19. For graph LaplacianL G =D−A andL e G 2r =D− f A for r∈Z + , such that (1)D− f A≈ 0D−D (D −1 A) 2r , and (2) f A andA each contains at mostm nonzero entries, then for any> 0, we can construct in total workO(m log 3 n/ 4 ), a graph LaplacianL e G + =D− f A + with f A + containing at mostO(n logn/ 2 ) nonzero entries, such that D− f A + ≈ 0 + D−D D −1 A 2r+4 . Proof. By Lemma 4.18, we have D− AD −1 2 f A D −1 A 2 ≈ D− AD −1 2 D D −1 A 2r D −1 A 2 =D−D D −1 A 2r+4 . Next step is to support D− (AD −1 ) 2 f A (D −1 A) 2 with D− f A and D− AD −1 A. First, we have exp (−) D−AD −1 A 4 exp (−) D−D D −1 A 2r+4 4D− AD −1 2 f A D −1 A 2 We can also show that exp (−2) D− f A 4 exp (−) D−D D −1 A 2r 4 exp (−) D−D D −1 A 2r+4 4D− AD −1 2 f A D −1 A 2 . 79 Nowweprovethat,withthesupportofD−AD −1 AandD− f A,thereexistsan efficient algorithm to conduct the edge sampling. To sample (AD −1 ) 2 f A (D −1 A) 2 efficiently, we consider the length-5 pathp = (u 0 ,u 1 ,u 2 ,u 3 ,u 4 ,u 5 ), where (u 0 ,u 1 ), (u 1 ,u 2 ), (u 3 ,u 4 ), (u 4 ,u 5 ) are edges inA and (u 2 ,u 3 ) is an edge in f A. The edge corresponds top has weight w(p) as w(p) = A (u 0 ,u 1 )A (u 1 ,u 2 ) f A (u 2 ,u 3 )A (u 3 ,u 4 )A (u 4 ,u 5 ) D (u 1 ,u 1 )D (u 2 ,u 2 )D (u 3 ,u 3 )D (u 4 ,u 4 ) . And it has upper bound on effective resistance Z(p) as Z(p) = exp() A (u 0 ,u 1 ) + exp() A (u 1 ,u 2 ) + exp(2) f A (u 2 ,u 3 ) + exp() A (u 3 ,u 4 ) + exp() A (u 4 ,u 5 ) (4.16) The sampling algorithm is similar to that described in Lemma 4.12 and Fig- ure 4.2. The modifications are: 1. The middle edge of the random-walk is replaced by a random walk on graph f A. 2. The edge index in the path is no longer uniform among{1, 2,..., 5}. Instead, it is now proportional to number of edges in the corresponding random-walk weighted by terms related to the approximation factor as given in Equa- tion 4.16. Nonetheless, the total work for path sampling remains up to dependencies on . After the initial sparsification by path sampling, we can further sparsify the graph as before with spectral sparsifiers. 80 Constructing f A ∗ Both of the previous constructions lead to approximations of random walks whose step length are divisible by 4. As we will see in Section 4.3.2, this suffices for approximating even length walks of sufficient lengths. However, odd step and even step random walks can behave quite differently: the random walk on a bipartite graph is periodic, and the even step random walks lead to two disconnected compo- nents. Our final construction addresses this issue by going from an approximation of the 2r-step walk to an approximation of the (4r + 1)-step walk. Our key statement is that if D− f A≈ D−D −1 (DA) 2r , then D− f AD −1 AD −1 f A≈ 3 L G 4r+1 . We also show that such an f A ∗ can be efficiently approximated by sampling from resistance upper bounds given by D−A and D− f A. This in turn leads to an efficient sparsification routine that resembles Lemma 4.19 from Section 4.3.2 above. The core of our proof is similar to the ones from Section 4.3.2: it is by con- sidering Schur complements as given in Definition 4.13. As Schur complements are usually studied for full rank matrices, we first consider the case whereA 0. Then we generalize this conclusion to positive-semidefinite A 0 in Lemma 4.21 via infinitesimal perturbations, and we further extend it to the general case of any c A in Lemma 4.23. 81 Lemma 4.20. If D− c A≈ D− f A and 04 c A, then for any symmetric A such that 0≺A4D, we have D− c AD −1 AD −1 c A≈ D− f AD −1 AD −1 f A. Proof. By Lemma 4.15 and Lemma 4.16 we have: D − c A − c A D ≈ D − f A − f A D . The fact thatAD andA having full rank gives 04DA −1 D−D, which when placed in a matrix block gives: 0 0 0 DA −1 D−D < 0. Fact 4.6[b] allows us to add this to both sides, giving D − c A − c A DA −1 D ≈ D − f A − f A DA −1 D Applying this approximation to the Schur complement of the top left block via Lemma 4.14 then gives the result. We extend this to the case where A can have a null space, aka. 0 4A, by adding an infinitely small perturbation. 82 Lemma 4.21. If D− c A≈ D− f A and 04 c A, then for any symmetric A such that 04A4D, we have D− c AD −1 AD −1 c A≈ D− f AD −1 AD −1 f A. Proof. Take the spectral decomposition ofD −1/2 AD −1/2 as D −1/2 AD −1/2 = r X i=1 λ i u i u > i , whereλ i andu i are the eigenvalues and the eigenvectors, respectively. When r<n (thatA∈R n×n ), complete the eigenbasis withu r+1 ,...,u n , and defineA λ as A λ =D 1/2 r X i=1 λ i u i u > i + n X i=r+1 λu i u > i D 1/2 . Then for any 0<λ< 1, we have that 0≺A λ 4D so that D− c AD −1 A λ D −1 c A≈ D− f AD −1 A λ D −1 f A. Since that for any vectory∈R n , we have that lim λ→0 + y > D− c AD −1 A λ D −1 c A y =y > D− c AD −1 AD −1 c A y, the result follows from the definition of spectral similarity. To generalize the result to any A with A 4 D, we first prove the following technical lemma. 83 Lemma 4.22. IfA≈ B andA +C≈ B +D andC,D< 0, we have that A +D≈ 3 B +C. Proof. We have that exp(2)(B +C)< exp()(exp()B +C) < exp()(A +C) <B +D < exp(−)A +D< exp(−)(A +D). Similarly, we have exp(3)(A +D)<B +C, soA +D≈ 3 B +C. Next, we prove the general result by decomposingA into positive and negative parts in its eigenspace. Lemma 4.23. If 04 c A, and we’re given f A such that D− c A≈ D− f A, then for any symmetricA such that 04D−A, we have D− c AD −1 AD −1 c A≈ 3 D− f AD −1 AD −1 f A. Proof. Take the spectral decomposition ofD −1/2 AD −1/2 D −1/2 AD −1/2 = n X i=1 λ i u i u > i , 84 where λ 1 ≥λ 2 ≥...λ p ≥ 0≥λ p+1 ≥...≥λ n . We defineA + andA − as A + =D 1/2 p X i=1 λ i u i u > i ! D 1/2 andA − =D 1/2 n X i=p+1 −λ i u i u > i D 1/2 . Then 04A + +A − 4D. By Lemma 4.21, we have D− c AD −1 (A + +A − )D −1 c A≈ D− f AD −1 (A + +A − )D −1 f A (4.17) Add c AD −1 A − D −1c A + f AD −1 A − D −1f A< 0 to both sides of Equation (4.17), we have that D− c AD −1 A + D −1 c A+ f AD −1 A − D −1 f A≈ D− f AD −1 A + D −1 f A+ c AD −1 A − D −1 c A. By Lemma 4.22, we can swap the terms to match c A and f A. And recall A = A + −A − , we have that D− c AD −1 AD −1 c A≈ 3 D− f AD −1 AD −1 f A. The conditions required by this lemma can be satisfied by setting c A to D (D −1 A) 2r , and setting f A to be the adjacency matrix of a sparsifier of D− c A. It remains to show that an f A ∗ satisfying D− f A ∗ ≈ D− f AD −1 AD −1f A can be computed efficiently. Lemma 4.24. For graph LaplacianL G =D−A andL e G 2r =D− f A for r∈Z + , such that (1)D− f A≈ 0D−D (D −1 A) 2r , and (2) f A andA each contains at mostm nonzero entries, then for any > 0, we can construct in total workO(m log 3 n/ 4 ), 85 a graph LaplacianL e G∗ =D− f A ∗ with f A ∗ containing at mostO(n logn/ 2 ) nonzero entries, such that D− f A + ≈ 0 +3 D−D D −1 A 4r+1 . Proof. The sampling algorithm and analysis is similar to that in the Lemma 4.19, with the difference being that we now sample length three paths with the middle edge inD−A and both ends inD− f A. We only show that the effective resistance of graph D− f AD −1 AD −1f A can be efficiently estimated fromD− f A andD−A. ForD− f A, we have that D− f A4 exp() D−D D −1 A 2r 4 exp() D−D D −1 A 4r+1 4 exp(4) D− f AD −1 AD −1 f A . ForD−A, we have that D−A4 2 D−D D −1 A 4r+1 4 2 exp(3) D− f AD −1 AD −1 f A . CombiningA + ,A × andA ∗ We sparsify high degree monomials by first categorizing them based on their degrees module 4. The following Lemma allows us to replace degree d = 4r + 2 monomials with degree 4r monomials, and replace d = 4r + 3 with 4r + 1, for sufficiently large d. 86 Lemma 4.25. LetL Gr be the Laplacian of ther-step random walk matrix as defined at the start of Section 4.3.1. When i,j are of the same parity and i≈ j, we have L G i ≈ L G j . Proof. Similar as in Lemma 4.9, we reduce the statement to that of the scalar case. We first prove for the case when both i and j are odd integers. Without the loss of generality, we assume that (1 +)j >i>j, so we need to prove that i(1−λ j )−j(1−λ i )≥ 0, and i(1−λ i )−j(1−λ j )≥ 0, when λ∈ [−1, 1]. The left hand side of both inequalities are differentiable, so the extreme values are achieved when their derivatives with respect toλ equal to zero, that is whenλ∈ {−1, 0, 1}. It is easy to verify that both inequalities hold when λ∈{−1, 0, 1}. The proof is similar for the case wheni andj are both integers, except that the derivative for left hand side of the second inequality is zero when λ∈{−j/i, 0,j/i}. Combining these components allows us to sparsify high degree monomials effi- ciently. Lemma 4.26 (Sparsification of Higher Degree Random-Walk Matrix Monomial). For any weighted undirected graph G with n nodes and m edges, for any integer t> 0, and any approximation parameter > 0, letL Gt =D−D(D −1 A) t , we can construct in time O(m· log 3 n· log 5 t/ 4 ) an -spectral sparsifier with O(n logn/ 2 ) non-zeros for the random-walk matrix polynomialL Gt . 87 Proof. We first check if t≤ 4/, in which case we can directly use Theorem 4.4 to sparsifyL Gt . So we assume t≥ 4/ for the rest of the proof. Whent is divisible by 4, we start with f A such thatD− f A≈ 0D−AD −1 A, and invoke Lemma 4.17 and Lemma 4.19 k = O(logt) times in total, to reach an approximation for D−D (D −1 A) t . Moreover, if we invoke Lemma 4.17 and Lemma 4.19 with approximation parameter /(2k), and apply [ST11] with /2 on the final output to obtain e G, the total work is bounded byO(m log 3 n log 5 t/ 4 ), and e G satisfiesL e G ≈ L Gt . When t is equal to 4r + 2 for some integer r, Since t> 4/, we can produce a sparsifier for degree 4r monomial with error /2,L e G ≈ /2 L G 4r and use it directly by Lemma 4.25. So in this situation, we know thatL G 4r ≈ /2 L Gt . Combining the two spectral approximation using Fact 4.6.c givesL e G ≈ L Gt . By now, we have proved Theorem 4.5 for all even t. When t = 4r + 1 or t = 4r + 3, we can produce a sparsifier for degree 2r monomial with error /8, and by applying Lemma 4.24 with 0 = /8, it leads to L e G ≈ /2 L G 4r+1 . Note that by Lemma 4.25, we haveL e G ≈ L G 4r+3 . By Lemma 4.26 and separating the random-walk matrix polynomial as a sum- mation over the monomials, we obtain a faster algorithm for spectral sparsification of the random-walk polynomial, in terms of the dependence on the degree of the polynomial, as stated in the following corollary. Corollary 4.27. For any weighted undirected graph G with n nodes and m edges, for every non-negative vector = (α 1 ,...,α t ) with P t r=1 α r = 1, for any > 0, we can construct in time O(nnz()·m· log 3 n· log 5 t/ 4 ) a spectral sparsifier with 88 O(n logn/ 2 ) non-zeros and approximation parameter for the random-walk matrix polynomialL (G). The dependence onnnz() can be undesirable, so we first invoke the following Lemma to expressL (G) as a summation overO(log (t)/) monomials. That is, we will first sparsify before we sparsify the random-walk matrix polynomialL (G) to bring down the computational cost. Lemma 4.28. For any weighted undirected graph G, for every non-negative vector = (α 1 ,...,α t ) with P t r=1 α r = 1, for any> 0, we can construct in timeO(nnz()) a non-negative vector = (β 1 ,...,β t ) with P t r=1 α r = 1 and nnz() =O(log(t)/) such thatL (G)≈ L (G). Proof. Without loss of generality, we prove this Lemma assuming has only odd terms, because we can first split the odd and even terms in, and then reduce the number of non-zeros in the coefficients separately. We keep the first 1/ terms in , and let k 0 = 1/. Define k i = k i−1 (1 +), note that k d ≥ t for d = log 1+ (t) = O(logt/). We set i = i when i≤ k 0 , set k i = P k i <j≤k i+1 j , and j = 0 everywhere else. It is easy to verify the number of non-zeros in is (1/ +d) = O(logt/). Finally, Lemma 4.25 enables us to merge two random-walk monomials when their exponents have the same parity and are close enough, so we haveL G k i ≈ L G j for all k i <j≤k i+1 , and thus we can concludeL (G)≈ L (G). Theorem 4.5 follows directly by combining Corollary 4.27 and Lemma 4.28. 89 4.3.3 Extension to SDDM matrices Our path sampling algorithm from Section 4.3 can be generalized to an SDDM matrix with splittingD−A. The idea is to split out the extra diagonal entries to reduce it to the Laplacian case. Of course, theD −1 in the middle of the monomial is changed, however, it only decreases the true effective resistance so the upper bound in Lemma 4.10 still holds without change. The main difference is that we need to put back the extra diagonal entries, which is done by multiplying an all 1 vector throughD−D(D −1 A) r . The follow Lemma can be proved similar to Lemma 4.8. Lemma4.29 (SDDMPreservation). IfM =D−A is an SDDM matrix with diago- nal matrixD and nonnegative off-diagonalA, for any nonnegative = (α 1 ,...,α d ) with P d r=1 α r = 1,M =D− P d r=1 α r D(D −1 A) r is also an SDDM matrix. Our algorithm is a direct modification of the algorithm from Section 4.3.2. To analyze it, we need a variant of Lemma 4.10 that bounds errors with respect to the matrix by which we measure effective resistances. We use the following statement from [Pen13]. Lemma 4.30 (Lemma B.0.1. from [Pen13]). Let A = P i y > e y e and B be n×n positive semi-definite matrices such that the image space of A is contained in the image space ofB, and τ be a set of estimates such that τ e ≥y > e B † y e ∀e. 90 Then for any error and any failure probability δ = n −d , there exists a constant c s such that if we construct A using the sampling process from Figure 4.1, with probability at least 1−δ = 1−n −d , f A satisfies: A− (A +B) f AA + (A +B). Theorem 4.31 (SDDM Polynomial Sparsification). LetM =D−A be an SDDM matrix with diagonal D, nonnegative off-diagonal A with m nonzero entries, for any nonnegative = (α 1 ,...,α d ) with P d r=1 α r = 1, we can define M = D− P d r=1 α r D(D −1 A) r . For any approximation parameter > 0, we can construct an SDDM matrix g M withO(n logn/ 2 ) nonzero entries, in timeO(m· log 2 n·d 2 / 2 ), such that g M≈ M . Proof. We look at each monomial separately. First, by Lemma 4.29, M r is an SDDM matrix. It can be decomposed as the sum of two matrices, a Laplacian matrix L r = D r −D(D −1 A) r , and the remaining diagonal D extra . As in the Laplacian case, a length-r paths inD−A corresponds to an edge inL r . We apply Lemma 4.30 toM r andL r = P e∈P y e y > e , where P is the set of all length-r paths inD−A, andy e is the column of the incidence matrix associated with e. When r is an odd integer, we have y > e (M r ) −1 y e ≤ 2y > e (D−A) −1 y e , and when r is an even integer, we have y > e (M r ) −1 y e ≤ 2y > e D−AD −1 A −1 y e . 91 Lete denote the edge corresponds to the length-r path (u 0 ,...,u r ), the weight of e is w(e) =w(u 0 ...u r ) =y > e y e = Q r i=1 A(u i−1 ,u i ) Q r−1 i=1 D(u i ,u i ) ≤ Q r i=1 A(u i−1 ,u i ) Q r−1 i=1 D g (u i ,u i ) , whereD g (u,u) = P v6=u A(u,v), so we have the same upper bound as the Laplacian case, and we can sample random walks in the exact same distribution. It implies that we can sparsifyL r using leverage scores calculated onM r as statedinLemma4.30. ThereexistsM =O(r·m·logn/ 2 )suchthatwithprobability at least 1− 1 n , the sampled graph e G =GraphSampling(G r ,τ e ,M) satisfies L r − 1 2 (L r +M r )4L e G 4L r + 1 2 (L r +M r ). Now if we addD extra and set g M =L e G +D extra , we will have (1−)M r 4 g M 4 (1 +)M r . Note thatD extra can be computed efficiently by computing diag(M r 1) via matrix- vector multiplications. 92 Chapter 5 Matrix Completion The problem of low-rank matrix completion is continually attracting atten- tion for its applicability to many real-world problems. Still, the large size, extreme sparsity, and non-uniformity of these matrices pose a challenge. In this paper, we make the observation that even when the observed matrix is not suitable for accu- rate completion there may be portions of the data where completion is still possible. We propose the CompleteID algorithm, which exploits the non-uniformity of the observation, to analyze the completability of the input instead of blindly apply- ing completion. Balancing statistical accuracy with computational efficiency, we relate completability to edge-connectivity of the graph associated with the input partially-observed matrix. We develop the MaxKCD algorithm for finding maximally k-edge-connected components efficiently. Experiments across datasets from a vari- ety of applications demonstrate not only the success of CompleteID but also the importance of completability analysis. 5.1 Introduction Low-rank matrix completion is a tool that has been widely adopted across a variety of applications. The problem takes as input a partially-observed matrix and asks for a completion of the missing values such that the estimate is low rank. The large adoption of matrix completion in practice is supported by a body 93 of elegant theoretical and practical results surrounding low-rank matrix comple- tion [CCS10, CR09, CBSW14, DR16, GLM16, HMLZ15, KMO10, WYZ12]. To pro- vide guarantees on the quality of completion, the observations are typically assumed to sufficiently cover the matrix, both in quantity and uniformity. For example, Candès and Tao [CT10] showed that when Ω(nrpolylog(n)) entries are observed uniformly at random, an n×n incoherent matrix of rank r can be accurately com- pleted with high probability. Unfortunately, it is often the case that real-world data does not satisfy the assumptions required for accurate completion; not only is the observed data highly sparse, but the entries are rarely observed uniformly at random [MJD09]. In this case, blindly applying a completion algorithm does not guarantee accurate completion, and runs the danger of poor quality estimates. However, even when the matrix as a whole is not well suited for existing com- pletion algorithms, there may very well be submatrices that can still be completed reliably. By examining the sparsity and non-uniformity of the observed data, valu- able information on the location of the completable submatrices van be extracted. Identifying such submatrices isolates parts of the data that can be completed accu- rately and more efficiently due to the reduced problem size. When the matrix estimate is used to guide an action, such recommendation, traffic routing, or exper- imentation, it is undoubtedly important to know which parts of the estimate can be relied upon. Moreover, knowing which parts of the data are completable also provides feedback on how information is distributed throughout the matrix with respect to the completion task. Such information can then be used to study the behavior of consumers, to guide additional data acquisition, and more. 94 In this work, we propose to incorporate an analysis of the completability of the input which runs in parallel to the matrix completion. The proposed frame- work, which we call CompleteID and depict in Figure 5.1, transforms the question of completability, i.e does there exist a unique rank-k matrix that matches the obser- vations?, into a graph mining task on a bipartite graph built from the observation pattern which admits an efficient solution. Building upon previous results on matrix completability [KT13, KTT15, SC10], we relate completability to edge-connectivity of the associated graph: A submatrix is considered as completable when the cor- responding subgraph is k-edge-connected for target rank k. This correspondence captures a slightly relaxed theoretical guarantee in favor of an efficient algorithm that makes application on web-scale data analysis possible. As a key module in CompleteID, we propose an efficient algorithm, called MaxKCD, to enumerate maxi- mal k-edge-connected components; the corresponding completable submatrices that can either be completed later by off-the-shelf algorithms or serve as an indication of confidence on an existing prediction. Further, the algorithm we propose does not require an additional tuning parameter and is largely agnostic to completion algorithm used. To the best of our knowledge, the CompleteID algorithm is the first scalable algorithm that discovers completable submatrices for real-world applications. We advocate for a new workflow that incorporates completability analysis alongside matrix prediction to any real-world application that is based on low-rank matrix completion, such as recommender systems. It has an array of benefits. First, the cold-start problem no longer requires manual handling, as the newly arrived user with insufficient data is identified as non-completable. Second, in the case that the overall completion accuracy is low, the work proposed here is able to identify 95 Applicati on Completability analysis Input Output Completability analysis or Traditional Proposed completeID Matrix Completion Matrix Completion Figure 5.1: Traditional workflow (top) and two possible proposed workflows where completability is applied (1) as a post-facto analysis of which estimates are reliable; (2) a priori to guide completion. entries that were overshadowed by the large error and can be accurately predicted nonetheless. Finally, even if the error is reasonable, there may be a disparity in the accuracy of the entries, giving a false sense that accurate recommendations are beingmade. Thissituationisdangerousasithasthepropensitytoleadtoinaccurate recommendations that deter customers. The contributions of the paper include: • We study the novel problem of completable submatrix discovery: identify sub- matrices that are accurately completable. To solve the problem in a practical and theoretically supported manner, we identify a criterion for matrix com- pletability by using edge-connectivity on the associated graph. • We develop the first scalable solution, CompleteID, for the completable subma- trix discovery. The CompleteID algorithm can be incorporated into real-world applications, without noticeable overhead, and empirical evaluation on both synthetic and real-world datasets demonstrate its success. • We propose the novel graph mining algorithm, MaxKCD, for maximal k-edge- connected component decomposition. The core subroutine we propose, (kCut), 96 finds a graph cut with size less than k and can be used in many tasks such as accelerating the Stoer-Wagner algorithm for global minimum cut. 1 The paper is organized in the following manner: we briefly review the related works and notations in Section 5.2. In Section 5.3, we theoretically and empirically motivate edge-connectivity as the completability criterion. Section 5.4 gives the details of MaxKCD that efficiently findsk-connected components. Finally, we demon- strate the practical utility of CompleteID in Section 5.7. Due to the page limit, we defer the proofs and some additional experiments to the appendix. 5.2 Preliminaries and related work Notation: Throughout the paper we will useM to refer to an n×m fully-known matrix of rank k. When the matrix is only partially known, we useM Ω to refer to the matrix that is equal toM on the entries specified by Ω⊆{(i,j)|1≤i≤n, 1≤ j≤m} and zero elsewhere. When a completion algorithm is applied to a partially observed matrixM Ω with target rank k, the output estimate is denoted as d M. For a partially observed matrixM Ω , the corresponding bipartite graph G Ω = (V 1 ,V 2 ,E) is constructed as follows: create a vertexi∈V 1 for every row ofM Ω and a vertex j∈V 2 for every column. Add an edge (i,j) to G Ω if and only if (i,j)∈ Ω (i.e., the entry is observed). Related work: There are a large body of elegant theoretical and practical results for low-rank matrix completion [CCS10, CR09, CBSW14, DR16, GLM16, HMLZ15, KMO10, WYZ12]. In order to provide guarantees on the quality of the completion, 1 The source code of our C++ implementation will be released. 97 two types of assumption on the input have been established. The first set states that each low-rank component must be spread evenly across the matrix, for example, the notion of incoherence [CR09] is commonly adopted. The second set of assumptions concerns the locations of observed entries, requiring the pattern to sufficiently cover the matrix both in quantity and uniformity. Mostoftheexistingworkonlow-rankmatrixcompletionfocuseson1)reducing the required number of observations under particular assumptions, or 2) improving the computational efficiency of the completion algorithm. There has been little work devoted to the case when the input has an insufficient number of observations, in which the low-rank solution might not be completable even with a known rank k. Existing work has incorporated side information [MCG + 11, PAW10, XJZ13], or taken an active approach by adding new entries [CZB + 13, RCT15]. Unfortunately, active approaches are not feasible in many applications that are restricted in mone- tary and physical capabilities to acquire new information, nor in those that rely on human good will to provide it. Further, while the algorithm in [RCT15] can iden- tify completable entries, the results are dependent on the initialization and active capability. A few works study the matrix completion problem in the similar setting of the pattern of observation as a deterministic input. In [KT13], Király et al. produce accuracy estimates for entries when the matrix has rank-1. The approach sam- ples submatrices that contain a particular missing entry and uses the distribution of completions to produce an estimate of the confidence. Király et al. [KTT15] also provided a necessary and sufficient condition for a partially-observed matrix to have a finite number of completions at rank k. This condition was used as a basis to an algorithm for identifying the missing entries in the partially observed 98 matrix which can take only a finite number of possible values (called finitely com- pletable). In another line of work work, Bhojanapalli and Jain [BJ14] proved that the spectral gap on the associated bipartite graph can be used as a sufficient condi- tion for the matrix completability with nuclear norm surrogate. Unfortunately, all works are not suitable for the purpose of identifying completable submatrices, as they are computationally demanding for large real-world matrices, let alone for our submatrix-search version. Further, many of these works are too coarse, providing only yes/no feedback on the matrix as a whole. Several works that either explicitly or implicitly estimate the per-entry error margin on estimates produced by a particular matrix completion algo- rithm [CZB + 13, FBK + , SPS13]. While differing in the precise details, each method constructs a particular model for the completion process. The variance of the esti- mate of each entry in this model is used to represent the confidence, and entries with very high confidence are likely to be completed accurately. In contrast to these works, the framework we propose incorporates completability in a manner that is independent of the completion algorithm used. In that sense, the algorithm we pro- pose is more powerful in that it provides insights on the matrix completion task as whole, rather than the particular algorithm being applied. 5.3 Identifying completability In this section, we present the proposed algorithm for identifying completable submatrices, which we call CompleteID. The focus of this section is on finding an appropriate criterion for matrix completability. For an algorithm to have both the- oretical and practical importance, there are two factors to consider: (i) the criterion 99 CompleteID Density Quasi Triangle AC Pr Rec Pr Rec Pr Rec Pr Rec Pr Rec CAG 0.93 1.00 0.27 1.00 0.00 1.00 1.00 0.42 0.74 0.52 Celegans 0.91 1.00 0.32 1.00 0.00 1.00 0.99 0.53 1.00 0.47 CAG364 0.98 1.00 0.34 1.00 0.00 1.00 0.99 0.66 1.00 0.73 Table5.1: Evaluationofk-connectivitycriterionwithrespecttoExactandbaselines. Pr as Precision, Rec as Recall, and 0.00 as < 1e− 3. for matrix completability must be closely connected to the theoretical feasibility of recovery (identifiability), and (ii) the criterion must admit an efficient algorithm for finding completable submatrices. Existing results typically focus on developing a criterion that is strong for the first factor, while greatly sacrificing the second. To achieve a good trade-off between theoretical and practical factors, we pro- pose using the concept of edge-connectivity on the bipartite graph G Ω associated with the inputM Ω , as the criterion for matrix completability. Hence, we can trans- form the problem of identifying completable submatrices in M Ω into the problem of identifying k-connected components in G Ω . The proposed approach, called CompleteID, is composed of three steps: (1) transform the input matrixM Ω into a graph G Ω , (2) decompose G Ω into maximal k-connected components using the proposed MaxKCD algorithm, (3) map the ver- tices in each component to rows and columns indexing completable submatrices. In the remainder of this section, we motivate the k-connectivity criterion both theo- retically and empirically. Note that we assume the underlying low-rank matrix is drawn from some continuous distribution, which avoids the extreme coherent matri- cessuchas thepermutationmatrices. Similarassumption isalsoadoptedby existing works [KT13, KTT15]. 100 5.3.1 Connectivity for completability High edge-connectivity in a graph ensures that there are many independent paths between the vertices. For example, if a graph is 3-edge connected, it means that there exist three vertex-independent paths between every pair of vertices. A graph with high edge connectivity can be considered robust in the sense that edges can be removed without disconnecting the graph. In the context of matrix comple- tion, a set of independent paths in the bipartite graphG Ω can be thought of as a set of constraints on the relationship between the rows or columns ofM Ω . Intuitively, more independent paths impose more constraints and render it easier to recover the missing entries. For example, consider the case when the target rank is k = 1, i.e., the matrix M isassumedtoberank-1. Inthiscase,k-edge-connectivityreducessimplytograph connectivity; if G Ω is connected then each entry ofM Ω will be in the same row or column as at least one other entry. This overlap can be viewed as fixing the scaling parameter in the rank-1 factors, making it possible to recover the missing values in M Ω . The intuition is in line with theoretical results by Király et al. [KTT15] showing that if G Ω is connected, then there is a single rank-1 matrix d M that is consistent withM Ω on Ω. For general rankk> 1, the following proposition holds: Proposition 5.1. If the associated bipartite graph of a partially-observed matrix is notk-edge-connected, there are an infinite number of rank-k matrices that consistent with the observation. Proof. The key idea is to consider the number of entries that need to be observed in order to fix ambiguity in the recover. For example, consider an n×m matrix M of rank k which can be written as a product of two factors: A of size n×k 101 … … … … Ω1 Ω2 Ω1 Ω2 A 1 B 1 A 1 B 2 W 1 A 2 B 1 WA 2 B 2 Ω1 Ω2 Ω1 Ω2 (a) Ω 1 , Ω 2 inM Ω and G Ω … … … … Ω1 Ω2 Ω1 Ω2 Ωc (b) with k edges as Ω C and B of size m×k such thatM = AB > . Now consider a partition of M Ω into M Ω 1 of size n 1 ×m 1 andM Ω 2 of size n 2 ×m 2 where n 1 +n 2 = n, m 1 +m 2 = m, and Ω ={Ω 1 , Ω 2 }. Figure 5.2a gives a visual of M Ω and G Ω . Observe that G Ω is composed of two components that are disconnected, since there is no overlap between Ω 1 and Ω 2 . With this Ω, the factor A can be divided into two parts: A 1 ∈ R n 1 ×k and A 2 ∈ R n 2 ×k with n = n 1 +n 2 . Similarly for B with m = m 1 +m 2 . Denote by M(W ) the family of matrices parameterized by a k×k matrixW, laying in the span ofA 2 andB 2 , that is applied toA 2 andB 2 as shown below: M(W ) = A 1 A 2 W B > 1 W † B > 2 = A 1 B > 1 A 1 W † B > 2 A 2 WB > 1 A 2 B > 2 Observe thatM Ω is identical for every matrix in the family, regardless ofW. In other words, given M Ω there are infinitely many completions d M that match on Ω. To have any hope of recovering the true matrix M, there must be some additional known entries outside of Ω 1 and Ω 2 to fix the degrees of freedom inW, call these entries Ω c . It can be shown that as long as|Ω c |<k, which is the minimum degrees of freedom forW, thenM Ω has infinitely many completions. Entries in Ω c 102 correspondsto edgesinthe cutonG Ω asillustrated inFigure5.2b; hence, Ω c induces k-connectivity on G Ω . The intuition is to consider a partition of the graph into two components corresponding to submatrices all of whose degrees of freedom are fixed. It can be shown, that in order to fix all degrees of freedom in the matrix at least k additional entries (edges) are needed, translating to k-edge-connectivity. When entries are observed uniformly at random, the edge-connectivity require- ment also nicely aligns with the known results [CT10], which can be proved by applying Chernoff Bounds on the graph cut over all of the possible graph partitions. Proposition 5.2. For a partially observed n× n matrix, if the entries Ω is observed uniformly at random, then there exists a constant C, such that when |Ω|≥nk +C(nk) 0.5 log(n/δ), the corresponding bipartite graph is k-edge-connected with probability at least 1−δ. Aside from k-connectivity, there are other graph-theoretic concepts that may be considered as a criterion for matrix completability, such as density. Depend- ing on the particular end-goal, various density measures and computational models have been defined, with the state of the art including degree-density [Cha00], edge- surplus [TBG + 13], k-clique [Tso15], and various notions related to k-clique that emphasize cohesiveness, such ask-plex. Not only do these definitions have no known relation to completability, but most of them are also NP-hard to enumerate [AIY13]. In contrast, the k-connected components can be mined in polynomial time even in the worst case [CYQ + 13, SHB + 16]. 103 5.3.2 Empirical evidence for connectivity as a completabil- ity criterion As discussed in Section 5.2, the algorithm proposed in [KTT15] which we call Exact, is computationally expensive; however, its theoretical guarantee provides an opportunity to validate our choice ofk-connectivity as a criterion for completability. To do this, we compare the entries deemed completable by Exact to those that participate ink-connected components. Limited by the poor scalability of Exact, we selectseveralsmallpartially-observedmatricesfromtheUFsparsematrixcollection, and run Exact andCompleteID on the Ω corresponding to each dataset with a range ofvaluesfork. 2 Table5.1showshighprecisionandrecallofCompleteIDwithrespect to Exact, validating that k-connectivity is a good and efficient proxy for Exact. Wealsocomparewithseveralbaselines, includingtheActiveCompletion algo- rithm (AC) proposed in [RCT15] with the query budget set to zero, and four state-of- the-artalgorithmsfordensesubgraphdiscovery: Density[Cha00],Quasi[TBG + 13], and Triangle [Tso15]. 3 We note that while the density-based algorithms have no theoretical ties to completability, in principle, higher density increases the likelihood of completability. We find that ActiveCompletion achieves high precision but low recall; in fact, as k grows, the algorithm does not find any completable entries at all. These results held even when we allowed a small query budget, and are a result of the large reliance on the starting point and an active approach. Similarly, the Triangle algorithm failed to identify a vast majority of the completable entries; this behavior 2 Dataset available at http://www.cise.ufl.edu/research/ sparse/matrices/ 3 Code available at https://github.com/giannisnik/k-clique-graphs-dense-subgraphs 104 is exacerbated as the size and rank of the matrix grow. In contrast both Density andQuasi achieve a high recall but a low precision. In other words, these algorithms falsely label many entries as completable. The Graph-Triangle algorithm is not shown since it failed to identify any components. These results demonstrate that identifying completable submatrices is a non- trivial task, and thatk-CC decomposition offer a promising approach for this prob- lem by achieving both high precision and recall. Not only does the k-CCs capture the majority of completable entries, but it also does not give false confidence with false positives. 105 Algorithm 2: Maximal k-CC Decomposition (MaxKCD) Input : Graph G = (V,E) and the target cut size k Output: The vertex partition Φ 1 Initialize Φ =∅ and Γ 0 to be the whole graph G 2 while Γ i 6=∅ do 3 Initialize Γ i+1 =∅ 4 forall G 0 = (V 0 ,E 0 )∈ Γ i do 5 Find a cut C = (V 0 1 ,V 0 2 ) = kCut(G 0 ,k) 6 if C =∅ or|V 0 | = 1 then 7 Add V 0 to Φ 8 else 9 Add G 0 V 1 and G 0 V 2 to Γ i+1 10 end 11 end 12 end 13 return Φ 5.4 Finding k-connected components Havingestablishedthatedgeconnectivityisagoodcriterionforcompletability, wenowpresentanalgorithmforfindingk-connectedcomponents(k-CCs)inagraph. The problem can be stated as: Problem 5.3. Given a graph G = (V,E) and a positive integer k, find a partition of the vertices V = S m i=1 V i with minimum m such that if|V i |> 1 then the subgraph induced by V i is a maximal k-connected subgraph. The algorithm we propose, called MaxKCD, is an improvement upon recent literature [CYQ + 13, SHB + 16] which modifies the Stoer-Wagner algorithm [SW97] for finding a global min cut. MaxKCD works by iteratively finding a cut of size less than k in G until no such a cut exists and only k-CCs or singleton vertices remain; the pseudocode is shown in Algorithm 2. 106 Our main contribution in MaxKCD is the kCut subroutine which utilizes three core operations: EarlyStop, ForceContraction, and Batch-EarlyMerge. The Batch-EarlyMerge operation is inspired by the observation that the maximal adja- cency search (MAS) subroutine in the original Stoer-Wagner algorithm can be accel- erated by a batch-like approach; the acceleration allows kCut to be more efficient while still retaining accuracy for Problem 5.3. We also incorporate a vertex-merging technique, Force-Contraction, which is a more aggressive version than those used in previous work. Finally, we use the EarlyStop routine proposed in [SHB + 16] to limit the number of recursive iterations. As an interesting side benefit, the proposed kCut algorithm can be used to speed up the original Stoer-Wagner algorithm by actively updating k to be the current smallest cut, starting from infinity. Due to space constraints, we include the implementation details in Appendix 5.5, proofs of correctness in Appendix 5.6, and we will make the code available online. Computational Complexity Since the kCut algorithm can be used with binary searchtosolvetheglobalmin-cutproblem, theworst-casecomputationalcomplexity of kCut is the same as the Stoer-Wagner algorithm: O (|V||E| +|V| 2 log|V|). How- ever, unlike the Stoer-Wagner algorithm, the empirical runtime of the kCut algo- rithm is much less than its worst-case. If the input graph is sparse and loosely con- nected,likethosestudiedinreal-worldapplicationsofmatrixcompletion, EarlyStop is likely to terminate the algorithm in the first few calls. On the other hand, if the inputgraphistightlyconnected,Batch-EarlyMergeensuresthateachrecursiveiter- ation will be conducted efficiently, and together with ForceContraction, the graph size shrinks quickly. In our experiments, less than 10 recursive calls are observed even in graphs with million of vertices and billions of edges, with the empirical runtime scaling with|E| log|V|. Note that the overall running time of the MaxKCD 107 algorithm is a small fraction of the running time of the common matrix completion algorithms. 5.5 Details on the MaxKCD algorithm In this section, we expand on the details of the proposed MaxKCD algorithm for identifying k-edge-connected components, i.e. for Problem 5.3. Aside from connections to completability, k-connected components have been recognized as an important structure in applications ranging from behavior mining, social network analysis, e-commerce, biology, and more [SHB + 16]. For a fixed small k (e.g. k = 1, 2, 3) and a graphG = (V,E), all maximalk-CCs can be found inO (k(|V| +|E|)) time by algorithms based on depth-first-search [NW93, Tsi09]. For a general k, the global min-cut routine can be iteratively applied to remove cuts smaller than k; however, the worst case run-time for such a vanilla algorithm is O(|V|T ), where T is the run-time of the global min cut algorithm, e.g., T =O (|V| (|E| +|V| log|V|)) for Stoer-Wagner algorithm [SW97]. Heuristic improvements have been proposed, for example in [ZLY + 12]. Several groups have proposed modifying the black box min-cut algo- rithms [AIY13, CYQ + 13, SHB + 16]. While the worst-case performance is generally unchanged, the empirical run-time decreases significantly. The improvement stems from the fact that an algorithm for finding k-CCs relies largely on finding a cut of size specifically less thank, and that this subroutine can be implemented more effi- ciently than the black-box min cut algorithms. In [CYQ + 13, SHB + 16], the authors proposed an algorithm based on the Stoer-Wagner global min-cut algorithm incor- porating the knowledge ofk for a gain in efficiency. Instead, Akiba et al. in [AIY13] 108 proposed a randomized approach based on Karger’s algorithm; however, it has been shown to be inefficient compared to Stoer-Wagner-based algorithms. The algorithm we propose, which is called MaxKCD, works by iteratively finding a cut of size less than k in G until such a cut does not exist, and only k-CCs or singleton vertices remain; the pseudocode is shown in Algorithm 3. At step 5, the algorithm applies a subroutine called kCut to find a cut of size less than k. The subroutine is based on the Stoer-Wagner algorithm [SW97] and improves upon the works in [SHB + 16, CYQ + 13]. Using the returned cut, the graph is split at step 9 of MaxKCD and each part is pushed for the next iteration. Algorithm 3: Maximal k-CC Decomposition (MaxKCD) Input : Graph G = (V,E) and the target cut size k Output: The vertex partition Φ 1 Initialize Φ =∅ and Γ 0 to be the whole graph G 2 while Γ i 6=∅ do 3 Initialize Γ i+1 =∅ 4 forall G 0 = (V 0 ,E 0 )∈ Γ i do 5 Find a cut C = (V 0 1 ,V 0 2 ) = kCut(G 0 ,k) 6 if C =∅ or|V 0 | = 1 then 7 Add V 0 to Φ 8 else 9 Add G 0 V 1 and G 0 V 2 to Γ i+1 10 end 11 end 12 end 13 return Φ Our main contribution in MaxKCD is the kCut subroutine, which utilizes three core operations: EarlyStop, ForceContraction, and Batch-EarlyMerge. First, we observe that the core component, maximal adjacency search (MAS), can be accelerated by a batch-like approach, Batch-EarlyMerge, which allows kCut to be more efficient while retaining accuracy for the the K-Connected Component 109 Decompositon Problem. We also incorporate a vertex-merging technique, called Force-Contraction, which results in a more aggressive merging technique than those in [SHB + 16, CYQ + 13]. We also incorporated the EarlyStop routine pro- posed in [CYQ + 13, SHB + 16]. As a side note, the kCut subroutine can be used in turn to speed up the original Stoer-Wagner algorithm by actively updating the k, starting from infinity, as the current smallest cut. In what follows we dive into the details of the kCut algorithm proposed in this work. 5.5.1 The kCut algorithm ThekCutalgorithmisbasedontheStoer-Wagneralgorithmforfindingaglobal min-cut. Shown in Algorithm 4, kCut is composed of two parts: the kMAS algorithm in step 4, and the Merge routine in step 5. The former obtains information on the (s,t)-min-cut via graph traversal, and the later utilizes the target value k to leave a strictly smaller graph for the next step. The kMAS routine is based on a core procedure in Stoer-Wagner: the MAS algorithm for finding an (s,t)-min-cut in G. Unlike other (s,t)-min-cut algorithms, the vertices s and t are not received as input. Instead, MAS visits vertices in a particular order, and outputs the size of the (s,t)-min-cut between the two vertices visited last. The order is dictated by connectivity to the already-visited set, and we refer the reader to [SW97] for more details. The key property of MAS is that at any iteration, MAS records the (s,t)-min-cut between two vertices most recently visited on the subgraph induced by all visited vertices. More formally, let L i = (u 1 ,...,u i ) be the ordered set of already visited vertices at iteration i andu i+1 be the vertex to be visited next. Then we have that w (L i ,u i+1 ) is the (s,t)-min-cut betweenu i and u i+1 on the induced subgraph G L i+1 . Note that w (L i ,u i+1 ) is the sum of the edges 110 Algorithm 4: Finding a cut with size less than k 1 C =kCut(G,k) Input : G = (V,E) and the target cut size k Output: Cut C 2 Initialize G 1 = (V 1 ,E 1 ) to G, and i = 1 3 while|V i |> 1 where G i = (V i ,E i ) do 4 P i =kMAS(G i ,v j ,k), for arbitrary v j ∈V i 5 G i+1 =Merge(G i ,P i ,k) 6 i =i + 1 7 end 8 return C =∅ weights betweenL i andu i+1 andG Ln =G. For the propose of kCut,s andt can be safely merged if the (s,t)-min-cut on G is at least k. This will not destroy any cut of size less than k since such a cut would place these two vertices on the same side. The EarlyMerge operation proposed in [CYQ + 13] monitors w (L i ,u i+1 ), which is a lower bound on the (s,t)-min-cut, and merges u i and u i+1 as soon as it reaches k. The key to modifying MAS for theK-Connected Component Decomposi- ton Problem and later improving on efficiency is based on two main observations. First, the existence of a target rank k implies that we do not care to differentiate between cuts of size larger than k in Algorithm 3. This provide the extra room for us to modify the MAS procedure as we do not require the (s,t)-min-cut lower bound to as accurate. As a result, a batch-like approach Batch-EarlyMerge can be utilized in the kMAS procedure to speed up the process. Second, we observe that a separation between the visitation and the merging allows for incorporation of more aggressive merging techniques, known as ForceContraction, resulting in a more efficient algorithm. These two observations are captured in three core operations, EarlyStop, Batch-EarlyMerge, and ForceContraction, which we describe below. 111 Algorithm 5: Accelerated MAS for k-CCs 1 P =kMAS(G,u 1 ,k) Input : G = (V,E), start vertex u 1 , target cut size k Output: Sets of vertices P for Batch-EarlyMerge 2 Initialize i = 1, P =∅, and L 1 ={u 1 } 3 while i<|V| do 4 if w(L i ,V\L i )<k then 5 Apply EarlyStop 6 else 7 if max v∈V\L i {w(L i ,v)}≥k then 8 Batch-EarlyMerge: 9 U i+1 ={v|w(L i ,v)≥k,v∈{V\L i }} 10 L i+|U i+1 | =L i ∪U i+1 11 P =P∪{U i+1 ∪{u i }} 12 u |L i | = any vertex from U i+1 13 else 14 u i+1 = argmax v∈V\L i {w(L i ,v)} 15 L i+1 =L i ∪{u i+1 } 16 end 17 i =|L i | 18 end 19 end 20 return P Early Stopping: The first operation, EarlyStop, utilizes the information of the target cut size k to terminate the visitation procedure of kMAS and directly return the solution for kCut. If at any point the cut between the visited set of vertices L i and the unvisited set V\L i is smaller than k, we can terminate kMAS and return the cut{L i ,V\L i } directly. Batch Early-Merging: The original MAS routine visits one vertex at a time; how- ever, we observe that this one-by-one visitation is too strict for the K-Connected ComponentDecompositonProblemsincewecareonlywhetherthecutissmaller thank. We introduce the Batch-EarlyMerge operation to work on multiple vertices 112 at a time, and prove that it retains correctness with respect to the K-Connected Component Decompositon Problem. Recall that MAS merges two vertices u i ∈ L i and u i+1 / ∈ L i at step i if w (L i ,u i+1 )≥ k. The Batch-EarlyMerge operation works by merging all v with w (L i ,v)≥k, as opposed to only theu i+1 that is scheduled to be visited next; more precisely the set{v|w (L i ,v)≥ k} is merged together with u i . The main idea of the proof of correctness (available in the extended version) is that although kMAS does not follow the order specified by MAS, no cut of size at least k will be omitted and no cut of size less than k will be affected: The resulting graph after Merge operation will be the same as that with EarlyMerge. The computational benefit of Batch-EarlyMerge is that it reduces the number of IncreaseKey operations for the Max Heap used in MAS, which is the computational bottleneck of the MAS algorithm for dense graph. Force Contraction: The ForceContraction operation was first introduced in [AIY13]. The operation merges any vertices u and v whose edge weight w (u,v) exceedsk. The operation has been omitted from previous exact algorithms because applying it to nodes outside of the visited set L i may lead to a violation of the MAS property. However, by separating the visitation and merging into kMAS and Merge, thekCut algorithm we propose allows ForceContraction to be utilized. The combination of ForceContraction with the other two operations further reduces the size of the graph, leading to a faster convergence of the recursion. Moreover, since the edges within a super-vertex can be ignored during vertex merging and ForceContraction creates larger super-vertex, the cost of the merging operation is also often reduced considerably. For example, the cost was reduced by 30% on the Amazon review data. 113 Algorithm 6: Vertex merging with ForceContraction 1 G 0 =Merge(G,P,k) Input : G = (V,E), set of pairs P, target cut size k Output: Modified graph G 0 2 Initialize G 0 =G 3 forall U∈P do 4 Batch-EarlyMerge: Merge all vertices in U on G 0 5 end 6 while ∃ edge with w(v 1 ,v 2 )≥k in G 0 do 7 ForceContraction: Merge v 1 and v 2 on G 0 8 end 9 return G 0 ThecorrectnessofAlgorithm4largelyfollowsthatofthe MASprocedure. Addi- tional optimization of kCut are discussed in the extended version, e.g., remove ver- tices with degree less than k. Computational Complexity: Since the kCut algorithm can be applied to solve the global min-cut problem with binary search, the worst-case computational com- plexity of kCut algorithm would be similar to that of the global min-cut algo- rithm, differed by a log|V| factor at most. In fact, the worst-case computational complexity of kCut and its analysis is the same as the Stoer-Wagner algorithm: O (|V||E| +|V| 2 log|V|). However, unlike the Stoer-Wagner algorithm, the empirical runtime of the kCut algorithm is much less than its worst-case. If the input graph is sparse and loosely connected, like those in the matrix completion setting, EarlyStop will likely terminate the algorithm in the first few calls. If the input graph is tightly connected, Batch-EarlyMergeensuresthateachkMAScanbeconductedefficiently, andtogether with ForceContraction, the graph size shrinks quickly. In our experiments, less than 10 calls to kMAS are observed even in graph with million of vertices and billions 114 of edges, where the empirical runtime scales with|E| log|V| due to usage of binary heaps. 5.6 Proof of correctness for Batch-EarlyMerging Unlike previous algorithms for Problem 5.3 which required the exact MAS pro- cedure, we propose a brand-new accelerated MAS routine, where multiple vertices can be visited in a single step. This batch approach reduces the time-consuming operation of updating the max-heap for the traditional MAS, leading to potentially significant speed-up. The key observation is that we are not restricted to the order of visitation specified by MAS since we do not need to distinguish cuts of size greater or equal to k. Let u 1 → u 2 → ··· → u n be the order of visitation by MAS as stated in Algorithm 7, andL i ={u 1 ,...,u i }. Then, at thei-th step, we can list the remaining verticesv∈V\L i byw(L i ,v)indecreasingorder: v i+1 →v i+2 →···→v n ; notethat this order is not necessarily the same as u i+1 →···→ u n except that v i+1 = u i+1 . The batch approach can be applied for the maximum possible s≥ i, if it exists, such that w(L i ,v r )≥k, ∀r, i<r≤s. That is, the set U i+1 ={v r |i<r≤s}, can be visited all at once as follows: first merge the vertices inside U i+1 to u i+1 = v i+1 , then add u i+1 to L i by merging u i+1 with u i . Take L s =L i ∪U i+1 and continue the visit. 115 Though the final order of visitation may differ from that of MAS, we will prove that no cut of size less than k will be destroyed, hence end result is the same. The computational benefit over the original MAS procedure comes from the reduced number of IncreaseKey operations in heap maintenance for all the edges inside U i+1 , which is the dominant (first) factor in theO (|E| +V log|V|) cost of MAS. The improvement is especially significant when MAS is implemented with binary heap, where the cost would be O (|E| log|V| +|V| log|V|). Proof of Correctness: To prove the correctness of the batch approach, we prove that P =MAS(G,s,k) and P + =kMAS(G,s,k) lead to the same graph after merging, i.e., Merge(G,P,k) = Merge(G,P + ,k). Here MAS denotes the simple MAS withEarlyMerge with vertices visited one at a time. Since theForceContraction is appliedinthelaststageofMergeanduniquelydependsontheresultsofEarlyMerge, we need only consider the results of EarlyMerge. Algorithm 7: Maximum Adjacency Search for kCut 1 P =MAS (G,u 1 ,k); Input : G = (V,E), start vertex u 1 , target cut size k Output: Sets of vertices P for Batch-EarlyMerge 2 Initialize i = 1, P =∅, and L 1 ={u 1 } 3 while i<|V| do 4 u i+1 ← argmax u∈V\L i {w(L i ,u)}; 5 w i+1 ←w(L,u i+1 ) 6 if w i+1 ≥k then 7 P =P∪{{u i+1 ,u i }} 8 end 9 L i+1 ←L i ∪{u i+1 } 10 i←i + 1 11 end 12 return P 116 Note that the elements in P =MAS(G,s,k) are pairs of vertices that will be merged together. Moreover, elements ofP may overlap if they contain the same ver- tex, which suggests a chain of vertices that will all be merged together. For example, P may contain both{u i ,u i+1 } and{u i+1 ,u i+2 } which means all three vertices will be merged together. Hence,P can be simplified by merging the overlapped elements until all elements in P are non-overlapping; we call these resulting non-overlapping elements early merging blocks. More formally: Definition 5.4 (Early Merging Block). In Algorithm 7, an early merging block is a subset of vertices R k i:j that satisfies R k i:j ={u r |i≤r≤j,w i <k,w j+1 <k, and w s ≥k,∀s,i<s≤j}, where 1≤i<j≤n, and w 1 =w n+1 = 0. Note that|R k i:j |≥ 2 by definition. If two vertices belong to the same early merging block, they will be merged together. Further, without ForceContraction, if two vertices do not belong to the same block, they will remain distinct vertices after Merge. We will prove that P + , obtained by Batch-EarlyMerge, contains the same information as P. Theorem 5.5. Given P =MAS(G,s,k) and P + =kMAS(G,s,k), then Merge(G,P,k) = Merge(G,P + ,k). Proof. As discussed before, the merging result ofP can be described by early merg- ing blocks. We will now prove that P + yields the same merging result as suggested by the early merging blocks. Letu 1 ,u 2 ,...,u n be the visitation order taken by MAS, then at step i we have L MAS i ={u 1 ,...,u i }, and similarly for kMAS we have L kMAS i . 117 Note that if P + is empty, this implies that the condition on step 7 in kMAS is never true. If this is the case then there is no early merging block, implying that P is empty as well. Similarly ifP + is not empty, but all|U i+1 | = 1, then by definition that P =P + . When there exists|U i+1 | > 1, we prove Theorem 5.5 by induction, starting both MAS and kMAS, at the first step, and matching P and P + step by step. Before the first step, P and P + for MAS and kMAS are all empty, and L kMAS 0 = L MAS 0 =∅, so the conclusion holds. In fact, P and P + (also L kMAS i and L MAS i ) are equivalent before encountering the first nontrivial (|U i+1 |> 1) early merging block. Here, we assume that P and P + are equivalent up to the p-th iteration. We break the proof into three parts: 1. LetU i+1 withi≥p be the firstU i+1 with|U i+1 |> 1, then we know thatL kMAS i is the same asL MAS i . Here we also mergeu i intoU i+1 corresponding to the step 11 in kMAS algorithm. Pick any vertex u t in U i+1 , which is the t-th (t > i) visited vertex in MAS. By definition in kMAS, we have that w(L MAS i ,u t )≥ k. Then for any r such that i<r≤t, we have that: w r =w(L MAS r−1 ,u r )≥w(L MAS r−1 ,u t )≥w(L MAS i ,u t )≥k, which implies that there exists R k i:j such that: {u i ,...,u t }⊂R k i:j ⇒u t ∈R k i:j Since u t is picked arbitrarily, same conclusion holds for all elements in U i+1 , and there is a unique early merging block that starts from i. Therefore, we 118 have that U i+1 ⊂R k i:j meaning that the batch found by Batch-EarlyMerge is always a subset of an early merging block. 2. When R k i:j = U i+1 , we know that P and P + are equivalent up to the j-th iteration, and that L kMAS j is the same as L MAS j . Therefore, we can replace the results of kMAS by that of MAS and continue both algorithms by repeating the argument in (1) from the j-th iteration. 3. When R k i:j 6= U i+1 , we will prove that the next element U i 0 +1 in P + satisfies U i 0 +1 ⊂R k i:j and U i 0 +1 ∩U i+1 6=∅. Let s = min{s 0 |u s 0∈R k i:j \U i+1 }, then we have that w(L kMAS i 0 ,u s )≥w(L MAS s−1 ,u s )≥k, because that L MAS s−1 ⊂L kMAS i 0 =L MAS i ∪U i+1 . Therefore, we know that u s ∈U i 0 +1 andU i 0 +1 wascreatedimmediatelyafterU i+1 , whichimpliesthatU i 0 +1 ∩U i+1 6= ∅. Similarly, for any vertex u s where s>j, we have that w(L kMAS i 0 ,u s )≤w(L MAS j ,u s )≤w(L MAS j ,u j+1 )<k. Therefore, we have that U i 0 +1 ⊂R k i:j . IfU i 0 +1 ∪U i+1 =R k i:j , we can repeat the argument in (1) from the j-th iteration as in (2). If U i 0 +1 ∪U i+1 6= R k i:j , we assign U i+1 =U i 0 +1 ∪U i+1 and repeat the argument in (3), and reach (2) in a finite number of steps since|U i 0 +1 ∪U i+1 |>|U i+1 |. 119 lmafit optspace reimann vbmc grouse k MC CPLT overall CPLT overall CPLT overall CPLT overall CPLT overall Bibsonomy 12 0.403 1.257 n/a n/a 0.590 1.377 0.760 1.061 n/a n/a Movielens 40 0.009 0.291 0.074 0.36 0.005 0.287 0.005 0.335 3.45 37.9 Table 5.2: Comparison of relative frobenius error over completable entries using different matrix completion algorithms. 5.6.1 Notes on implementation We have discussed the backbone of the MaxKCD algorithm; however, There are a few optimization tricks that can be exploited for the implementation of the MaxKCD algorithm which we provide in C++. k-Core Optimization: Since ak-CC requires each vertex to have degreek, a scan of vertex degrees can remove unnecessary computation. Before searching for the cut with size less than k, the kCut algorithm can first check for any vertex whose weighted vertex degrees is less than k, remove it, and add it directly to the final partition P. Such a scan is called a k-coreOpt routine, as it decompose the graph into k-Core components. In-place storage of Γ: In Algorithm 2, each Γ i can be stored on the original graph, since it corresponds to a list of connected components of G at iteration i. A key and simple observation is that all of the algorithms presented here rely only on the local information of which vertices have been visited and their respective neighbors. Hence, the only information that needs to be stored about each subgraph is its size and a seed vertex. 120 n m |Ω| Netflix 480K 17K 100M Amazon 6.6M 2.4M 29M Foursquare 45K 17K 1.2M DBLP 122K 36K 1.8M Music 5K 210K 16M LibimSeTi 135K 168K 17M Gisette 6K 5K 3.9M Table 5.3: Matrix completion dataset statistics. 5.7 Experiments In this section, we evaluate the performance of CompleteID on both real and synthetic data, study the interplay of connectivity and completability, and demon- strate the variety of insights it offers. Setup: In all our experiments, we use a pattern of observed entries Ω from sparse real-world data (shown in Table 5.3) to capture sampling patterns that occur in practice as opposed to artificial settings. We experiment with the real matrix values and, for a more controlled setting, with synthetic ones of fixed rank. Synthetic matrices are generated following a standard Gaussian, with additive Gaussian noise having standard deviation 0.1. Given a partially observed matrix M Ω , we estimate the missing entries by fitting a rank k MC matrix to the observed entries with alternating minimization (chosen for it computational efficiency) [KBV09]. For an output estimate d M we compare the RMSE on the held-out test set, separately over the completable and non-completable entries as selected by CompleteID. 121 10 20 30 40 50 60 70 80 90 kKCC for MaxKCD 0.4 0.5 0.6 0.7 0.8 0.9 1.0 1.1 1.2 RMSE kMC =10 kMC =20 kMC =30 kMC =40 completable non-completable (a) Amazon 50 100 150 200 250 300 350 kKCC for MaxKCD 0.80 0.85 0.90 0.95 1.00 1.05 1.10 1.15 RMSE kMC =10 kMC =20 kMC =30 kMC =40 completable non-completable (b) Netflix 0 5 10 15 20 25 30 35 40 kKCC 0 1 2 3 4 5 6 7 RMSE Completable Non-completable kKCC =kMC (c) DBLP k MC = 20 0 5 10 15 20 25 30 35 40 kKCC 0 1 2 3 4 5 6 7 RMSE Completable Non-completable kKCC =kMC (d) LibimSeTi k MC = 20 Figure 5.3: The RMSE of completable and non-completable entries on the test set as a function of k KCC , with varying k MC for Amazon and Netflix, and k MC = 20 for DBLP and LibimSeTi. 5.7.1 Traditional completion task Fortwolarge-scalecollaborativefilteringdatasets,NetflixandAmazon[ML13] , we select 99% of the observed entries of Amazon and Netflix as the training set and evaluate the error over the remaining held-out test set. We explore the behavior of CompleteID under a variety of model settings parameterized by k KCC , the edge- connectivity for MaxKCD ,andk MC the rank for the completion algorithm. The target rank k MC ranges in{10, 20, 30, 40}, and k KCC in [10, 90] for Amazon and [10, 350] for Netflix. Figures 5.3a and 5.3b show the RMSE on the completable and non-completable entries. For any given k KCC and k MC , the RMSE on the completable entries is sig- nificantly lower than on non-completable ones. Moreover, for a given k MC , ask KCC increases, the RMSE on the completable entries continues to drop. This decay indi- cates that the CompleteID algorithm serves as a smooth measure for completability, meaning that we need not be bounded by k MC = k KCC in real world applications. With k KCC = 350, there are still over 55% of test entries that are completable 122 on Netflix, while Amazon is much sparser with around 3% entries that are com- pletable whenk KCC = 90. In Figure 5.3b) we see an interesting flipping behavior on the completable entries with different k MC . The fact that smaller k MC give better accuracy when k KCC is small can be explained by overfitting; however, as k KCC increases, we see a faster decay in RMSE for larger k MC , resulting in a better perfor- mance. This behavior demonstrates that the CompleteID algorithm can be used to deploy more complex models to improve local performance without the consequence of overfitting. 5.7.2 Controlled matrix completion In this section, we mimic the collaborative filtering task on DBLP, Music, LibimSeTi, andGisette, inacontrolledsettingwherewegenerateasyntheticmatrix ofrankk asdescribedinthebeginningofthissection. Thesyntheticmatrixallowsus to use the full set of observed entries for training, and the rest for testing. Table 5.4 shows the RMSE for a synthetic matrix of rank k = 20 and k KCC = k MC = 20. In addition, Figure 5.3c and Figure 5.3d show the RMSE over k KCC ∈ [1, 40] on the DBLP and LibimSeTi dataset; the coordinate at x =k KCC = 20 is the same value that is in Table 5.4. The completability ranges from 96% to 55% for DBLP and 99% to 87% for LibimSeTi. Consistent with the previous experiment, the error over completable entries is lower than over non-completable ones across all datasets. Further, we see that the error decreases ask KCC grows, but hits a meaningful point around k KCC = 20 which is the matrix rank. 123 DBLP Music LibimSeTi Gisette Compl. 0.42 0.29 0.38 0.11 Non-Compl. 4.41 2.96 5.27 5.81 Table 5.4: Completability over small datasets with rank-20 synthetic matrix and k KCC =k MC = 20. 5.7.3 Comparison with baselines As discussed in Section 5.3, most related completability algorithms are limited by their computational inefficiency, hence we include a comparison with the best performing density-based baseline, SmartDensity. For each unknown entry (i,j) in the test set, we calculate the number of known entries in the i-th row and j-th columnasameasureofcompletability; thehigherthebetter. Then, wemarkthetop N entries as completable, where N is the number of completable entries discovered by CompleteID (the knowledge of N merits the name Smart). We tested CompleteID and SmartDensity on theNetflix data. Instead of the randomly splitting the entries into training and testing, we used the timestamps available on matrix entries and select the first 25% (or 50%) of the observed rat- ings as training, and the next 1% of the observed entries as testing. The RMSE on the completable entries for both CompleteID and SmartDensity is shown in Fig- ures 5.4a and 5.4b. We see that the RMSE over the completable entries discovered by CompleteIDissignificantlylowerthanoverthosediscoveredbySmartDensity which is slightly better than the total RMSE. Moreover, the performance of SmartDensity degrades, implying that identifying completable submatrices is a non-trivial prob- lem that cannot be addressed solely with a density-based approach. Finally, we also compared with ActiveCompletion [RCT15] of Section 5.3. We found that 124 ActiveCompletion identifies only a small subset of the completable entries identi- fied by CompleteID. For example, on Netflix the algorithm did not find any com- pletable regions for k > 90. This is due to its dependence on proper initialization and the ability to be active by adding entries. 5.7.4 Completability over time Finally, we study the accuracy over the completable and non-completable entries with time for two datasets, Netflix and Foursquare. For Netflix we use the real data values, and for Foursquare we generate a synthetic rank-10 matrix. Both datasets contain timestamps which we use to divide the data into training and testing. At each timestamp t, we use the entries that appeared before t as training and the entries appears within [t,t+Δt] as testing – mimicking the pattern of observed entries encountered in real-world applications. The time is measured by days, and the length of the test set Δt is 30 days for Netflix and 3 days for Foursquare. For both matrices we show results for fixed k MC =k KCC = 10. Figures5.4dand5.4cshowtheRMSEasafunctionoftimewiththepercentageof completable entries varying from 0% to≈ 80%. We observe that over time, the test entries that fall in submatrices deemed completable by MaxKCD have significantly less error than those outside the submatrices. These results demonstrate not only that CompleteID successfully identifies completable submatrices, but that the knowledge of these submatrices carries importance in practice. 5.7.5 Further experiments In this section, we highlight several additional experiments. 125 50 100 150 200 250 300 350 kKCC for MaxKCD 0.75 0.80 0.85 0.90 0.95 RMSE kMC =10 kMC =20 kMC =30 kMC =40 MaxKCD Deg Overall (a) 25% Netflix 50 100 150 200 250 300 350 kKCC for MaxKCD 0.75 0.80 0.85 0.90 0.95 RMSE kMC =10 kMC =20 kMC =30 kMC =40 MaxKCD Deg Overall (b) 50% Netflix 0 20 40 60 80 100 120 140 160 Time 0.0 0.5 1.0 1.5 2.0 2.5 RMSE Completable Non-completable (c) Foursquare 0 200 400 600 800 1000 1200 1400 1600 Time 0.9 1.0 1.1 1.2 1.3 1.4 1.5 RMSE Completable Non-completable (d) Netflix Figure 5.4: Figures (a) and (b) show a comparison between CompleteID and SmartDensity with 25% and 50% of Netflix. Figures (c) and (d) show the RMSE of completable and non-completable entries as a function of time for Netflix and Foursquare with k = 10. Alternative completion algorithms: We selected five algorithms with the most competitive efficiency and accuracy: LMaFit, OptSpace, LRGeom, Riemann, and VBMC. 4 The interested reader can refer to [ZYZY15] for a comparison. We found that using all algorithms, the error over the entries selected by CompleteID as completable is significantly smaller than over the non-completable ones, which demonstrates the robustness of CompleteID. Efficiency of MaxKCD: The overall running time of the MaxKCD algorithm is a small fraction of the running time of common matrix completion algorithms. Therefore, the utility of our framework is not limited by the computational efficiency of MaxKCD. However, theempiricalruntimeisofinterestsinceMaxKCDcanbeusedindependently for graph maximal k-connected components decomposition and global minimum cut algorithms. With our aggressive graph contraction schemes, including both Batch-EarlyMerge and ForceContraction, the overall running time is all less than 6 minutes on both Amazon and Netflix datasets with k≥ 10. 4 Source code was obtained from https://sites.google.com/site/lowrankmodeling/ 126 5.8 Conclusion In this work, we argue that an analysis of the completability of a partially- observed matrix should be carried out alongside the actual completion. While real- world matrices are not typically completable as a whole, we make the observation that there may still exist portions of the data that can still be completed accurately. Information of such completable regions enables a more principled completion pro- cess, providing feedback on the structure of the observed data, the accuracy of the estimate through the data, and whether issues such as overfitting and cold-start are of concerns. We propose the problem of identifying completable submatrices and the CompleteID framework, which features the first scalable algorithm for the problem. A key component of our work is the exposure of edge-connectivity as a practical and theoretically supported surrogate for completability. While we have shown that incorporating completability analysis into the matrix completion workflow is an impactful and promising direction, there are var- ious avenues that are open for investigation. One major component is the develop- ment of a deeper theoretical understanding of matrix completability. While several works have studied the problem [MJD09, RCT15, KTT15, SC10], many open ques- tions remain. 127 Chapter 6 Conclusion and Future Work 128 6.1 Conclusion In this thesis, we have studied the efficient access to the data relevance for severalwidely-usedmodels, whichcanbeusedtoimprovethecorrespondingmachine learning algorithms from the computation and reliability perspectives. The models include low-rank models and multi-step random-walks, where the statistical leverage scores can be efficiently estimated from their building blocks. For the completion task on low-rank models, we relate the completability to the edge-connectivity on the corresponding graph, which leads to efficient discovery. For numerical computations related to tensors, we show that the statistical leverage scores of the Khatri-Rao product can be efficiently accessed. Khatri-Rao product captures the interactions between different tensor modes, and its statisti- cal leverage scores enable many efficient numerical algorithms. It yields scalable algorithms for many tensor tasks, such as CP decomposition and HOSVD. On graphs, we can compute effective resistance for a Laplacian polynomial efficiently in nearly-linear time, which provides a refined view of the associated random walks. It serves as the core component of our sparse Newton’s method that accelerates numerical tasks with matrices. In a low-rank matrix completion problem, we provide the relevance of an observed entry w.r.t a missing entry by examining the underlying bipartite graph structure built from the observed entries. The edge-connectivity on the bipartite graph serves as a reasonable surrogate for the identifiability of a missing entry. 129 6.2 Future work In this section, we discuss future research topics. Accelerating feedforward neural networks The feedforward neural network has been one of the workhorses in the modern AI-systems [CAS16]. The computational cost for serving those models represents a large portion of business expenses, where reducing it will have a significant impact. Existing researches [CWT + 15] have shown that there is redundancy in the neural network computation, where smart computation could potentially improve the com- putational efficiency. One possible approach is to draw a connection between the layered structure in a feedforward neural network with the multi-step random walks. With less structure in the weight matrices and the additional nonlinearity, mostly the approximation guarantee would be compromised. But by carefully analyzing the hidden units, we might be able to adaptively adjust the approximation rate, ensur- ing higher quality for those data closer to the decision boundary. Such approach might be limited by the flexibility of the existing computational frameworks. Model-agnostic data relevance While our studied models subsume a large set of applications, analyzing the data relevance is still a model-specific task. An important future direction would be a model-agnostic framework for data relevance discovery. Existing works on stochastic gradient descend with importance sampling [NSW14] can be viewed as an example, where the smoothness of the individual objective function serves as 130 the data relevance. The utility of similar method is limited due to the coarseness of usingthesmoothnessasdatarelevance. Itisanopenchallengetoachieveguaranteed approximation and computational efficiency. 131 Index Alternating least squares algorithm, 35 Data relevance, 2 Edge-connectivity, 9 Effective-resistance, 63 Identifiability, 100 Importance sampling, 16 Khatri-Rao product, 7 Kronecker product, 38 Least square regression, 21 Low-rank matrix completion, 5 Matrix completability, 25 Monte Carlo, 4 Newton’s method, 62 Random-walk matrix-polynomial, 7 Recommender systems, 95 Reversible Markov chain, 58 Spectral approximation, 8 Spectral graph theory, 23 Statistical leverage scores, 7 Tensor CP decomposition, 4 Uniform sampling, 4 132 Reference List [ACKY05] Evrim Acar, Seyit A Camtepe, Mukkai S Krishnamoorthy, and Bülent Yener. Modeling and multiway analysis of chatroom tensors. In Intel- ligence and Security Informatics: IEEE International Conference on Intelligence and Security Informatics, ISI 2005, Atlanta, GA, USA, May 19-20, 2005, Proceedings, volume 3495, page 256. Springer, 2005. [ADFDJ03] Christophe Andrieu, Nando De Freitas, Arnaud Doucet, and Michael I Jordan. An introduction to mcmc for machine learning. Machine learning, 50(1-2):5–43, 2003. [AF] David Aldous and Jim Fill. Reversible markov chains and random walks on graphs. [AIY13] Takuya Akiba, Yoichi Iwata, and Yuichi Yoshida. Linear-time enu- meration of maximal k-edge-connected subgraphs in large networks by random contraction. In Proceedings of the 22nd ACM international conference on Information & Knowledge Management, pages 909–918. ACM, 2013. [AP94] Agnar Aamodt and Enric Plaza. Case-based reasoning: Foundational issues, methodological variations, and system approaches. AI commu- nications, 7(1):39–59, 1994. [B + 15] Sébastien Bubeck et al. Convex optimization: Algorithms and com- plexity. Foundations and Trends R in Machine Learning, 8(3-4):231– 357, 2015. [BCV13] Yoshua Bengio, Aaron Courville, and Pascal Vincent. Representation learning: Areviewandnewperspectives. IEEE transactions on pattern analysis and machine intelligence, 35(8):1798–1828, 2013. [BJ14] Srinadh Bhojanapalli and Prateek Jain. Universal matrix completion. In Proceedings of the 31st International Conference on International 133 Conference on Machine Learning-Volume 32, pages II–1881. JMLR. org, 2014. [BKS15] Boaz Barak, Jonathan A Kelner, and David Steurer. Dictionary learn- ing and tensor decomposition via the sum-of-squares method. In Pro- ceedings of the forty-seventh annual ACM symposium on Theory of computing, pages 143–151. ACM, 2015. [BM06] Isabelle Bichindaritz and Cindy Marling. Case-based reasoning in the health sciences: What’s next? Artificial intelligence in medicine, 36(2):127–135, 2006. [BS15] Srinadh Bhojanapalli and Sujay Sanghavi. A new sampling technique for tensors. arXiv preprint arXiv:1502.05023, 2015. [BSS12] Joshua Batson, Daniel A Spielman, and Nikhil Srivastava. Twice- ramanujan sparsifiers. SIAM Journal on Computing, 41(6):1704–1721, 2012. [BT11] Jacob Bien and Robert Tibshirani. Prototype selection for inter- pretable classification. The Annals of Applied Statistics, pages 2403– 2424, 2011. [CAS16] Paul Covington, Jay Adams, and Emre Sargin. Deep neural networks for youtube recommendations. In Proceedings of the 10th ACM Con- ference on Recommender Systems, New York, NY, USA, 2016. [CBSW14] Yudong Chen, Srinadh Bhojanapalli, Sujay Sanghavi, and Rachel Ward. Coherent matrix completion. In Proceedings of the 31st Inter- national Conference on Machine Learning (ICML-14), pages 674–682, 2014. [CC70] J Douglas Carroll and Jih-Jie Chang. Analysis of individual differences in multidimensional scaling via an n-way generalization of âĂIJeckart- youngâĂİ decomposition. Psychometrika, 1970. [CCL + 14] Dehua Cheng, Yu Cheng, Yan Liu, Richard Peng, and Shang-Hua Teng. Scalable Parallel Factorizations of {SDD} Matrices and Efficient SamplingforGaussianGraphicalModels. CoRR,abs/1410.5392, 2014. [CCL + 15] Dehua Cheng, Yu Cheng, Yan Liu, Richard Peng, and Shang-Hua Teng. Spectral sparsification of random-walk matrix polynomials. arXiv preprint arXiv:1502.03496, 2015. 134 [CCS10] Jian-Feng Cai, Emmanuel J Candès, and Zuowei Shen. A singular value thresholding algorithm for matrix completion. SIAM Journal on Optimization, 20(4):1956–1982, 2010. [Cha00] Moses Charikar. Greedy approximation algorithms for finding dense components in a graph. In International Workshop on Approximation Algorithms for Combinatorial Optimization, pages 84–95. Springer, 2000. [CKM + 11] Paul Christiano, Jonathan A Kelner, Aleksander Madry, Daniel A Spielman, and Shang-Hua Teng. Electrical flows, {L}aplacian systems, and faster approximation of maximum flow in undirected graphs. In Proceedings of the forty-third annual ACM symposium on Theory of computing, pages 273–282. ACM, 2011. [CLM + 15] Michael B Cohen, Yin Tat Lee, Cameron Musco, Christopher Musco, Richard Peng, and Aaron Sidford. Uniform sampling for matrix approximation. In Proceedings of the 2015 Conference on Innovations in Theoretical Computer Science, pages 181–190. ACM, 2015. [CP15] Michael B Cohen and Richard Peng. ` p row sampling by lewis weights. In Proceedings of the forty-seventh annual ACM symposium on Theory of computing, pages 183–192. ACM, 2015. [CR09] Emmanuel J Candès and Benjamin Recht. Exact matrix completion via convex optimization. Foundations of Computational mathematics, 2009. [CT10] EmmanuelJCandèsandTerenceTao. Thepowerofconvexrelaxation: Near-optimal matrix completion. IEEE Transactions on Information Theory, 56(5):2053–2080, 2010. [CV95] Corinna Cortes and Vladimir Vapnik. Support vector machine. Machine learning, 20(3):273–297, 1995. [CV14] Joon Hee Choi and SVN Vishwanathan. Dfacto: distributed factor- ization of tensors. In Proceedings of the 27th International Conference on Neural Information Processing Systems, pages 1296–1304, 2014. [CW13] Kenneth L Clarkson and David P Woodruff. Low rank approximation and regression in input sparsity time. In Proceedings of the forty-fifth annual ACM symposium on Theory of computing, pages 81–90. ACM, 2013. 135 [CW15a] Kenneth L Clarkson and David P Woodruff. Input sparsity and hard- ness for robust subspace approximation. In Foundations of Computer Science (FOCS), 2015 IEEE 56th Annual Symposium on, pages 310– 329. IEEE, 2015. [CW15b] Kenneth L Clarkson and David P Woodruff. Sketching for m- estimators: A unified approach to robust regression. In Proceedings of the Twenty-Sixth Annual ACM-SIAM Symposium on Discrete Algo- rithms, pages 921–939. Society for Industrial and Applied Mathemat- ics, 2015. [CWT + 15] Wenlin Chen, James Wilson, Stephen Tyree, Kilian Weinberger, and Yixin Chen. Compressing neural networks with the hashing trick. In International Conference on Machine Learning, pages 2285–2294, 2015. [CYQ + 13] Lijun Chang, Jeffrey Xu Yu, Lu Qin, Xuemin Lin, Chengfei Liu, and Weifa Liang. Efficiently computing k-edge connected components via graphdecomposition. InProceedings of the 2013 ACM SIGMOD Inter- national Conference on Management of Data, pages 205–216. ACM, 2013. [CZB + 13] Shayok Chakraborty, Jiayu Zhou, Vineeth Balasubramanian, Sethu- raman Panchanathan, Ian Davidson, and Jieping Ye. Active matrix completion. In Data Mining (ICDM), 2013 IEEE 13th International Conference on, pages 81–90. IEEE, 2013. [DDH + 09] Anirban Dasgupta, Petros Drineas, Boulos Harb, Ravi Kumar, and Michael W Mahoney. Sampling algorithms and coresets for\ell_p regression. SIAM Journal on Computing, 38(5):2060–2078, 2009. [DLDM98] Lieven De Lathauwer and Bart De Moor. From matrix to tensor: Multilinear algebra and signal processing. In Institute of Mathematics and Its Applications Conference Series, volume 67, pages 1–16. Oxford University Press, 1998. [DLDMV00] Lieven De Lathauwer, Bart De Moor, and Joos Vandewalle. A multi- linear singular value decomposition. SIAM journal on Matrix Analysis and Applications, 2000. [DMMS11] Petros Drineas, Michael W Mahoney, S Muthukrishnan, and Tamás Sarlós. Faster least squares approximation. Numerische Mathematik, 2011. 136 [DR16] Mark A Davenport and Justin Romberg. An overview of low-rank matrix recovery from incomplete observations. IEEE Journal of Selected Topics in Signal Processing, 10(4):608–622, 2016. [DS08] Samuel I. Daitch and Daniel A. Spielman. Faster approximate lossy generalized flow via interior point algorithms. In Proceedings of the 40th annual ACM symposium on Theory of computing, STOC ’08, pages 451–460, New York, NY, USA, 2008. ACM. [DSL08] Vin De Silva and Lek-Heng Lim. Tensor rank and the ill-posedness of the best low-rank approximation problem. SIAM Journal on Matrix Analysis and Applications, 30(3):1084–1127, 2008. [FBK + ] Farideh Fazayeli, Arindam Banerjee, Jens Kattge, Franziska Schrodt, and Peter B Reich. Uncertainty quantified matrix completion using bayesian hierarchical matrix factorization. In 2014 13th International Conference on Machine Learning and Applications. [GHJY15] Rong Ge, Furong Huang, Chi Jin, and Yang Yuan. Escaping from saddle pointsâĂŤonline stochastic gradient for tensor decomposition. In Conference on Learning Theory, pages 797–842, 2015. [GLM16] Rong Ge, Jason D Lee, and Tengyu Ma. Matrix completion has no spuriouslocalminimum. InAdvances in Neural Information Processing Systems, pages 2973–2981, 2016. [GR87] Leslie Greengard and Vladimir Rokhlin. A fast algorithm for particle simulations. Journal of computational physics, 73(2):325–348, 1987. [Har70] Richard A Harshman. Foundations of the parafac procedure: Models and conditions for an" explanatory" multi-modal factor analysis. 1970. [HL11] Nicholas J Higham and Lijing Lin. On $p$th roots of stochastic matri- ces. Linear Algebra and its Applications, 435(3):448–463, 2011. [HL13] Christopher J Hillar and Lek-Heng Lim. Most tensor problems are np-hard. Journal of the ACM (JACM), 2013. [HMLZ15] TrevorHastie,RahulMazumder,JasonDLee,andRezaZadeh. Matrix completion and low-rank svd via fast alternating least squares. Journal of Machine Learning Research, 16:3367–3402, 2015. [JPKF15] Inah Jeon, Evangelos E Papalexakis, U Kang, and Christos Falout- sos. Haten2: Billion-scale tensor decompositions. In Data Engineering (ICDE), 2015 IEEE 31st International Conference on, pages 1047– 1058. IEEE, 2015. 137 [KB09] Tamara G Kolda and Brett W Bader. Tensor decompositions and applications. SIAM review, 2009. [KBV09] Yehuda Koren, Robert Bell, and Chris Volinsky. Matrix factorization techniques for recommender systems. Computer, 42(8), 2009. [KL13] Jonathan A Kelner and Alex Levin. Spectral sparsification in the semi-streaming setting. Theory of Computing Systems, 53(2):243–262, 2013. [KMO10] Raghunandan H Keshavan, Andrea Montanari, and Sewoong Oh. Matrix completion from a few entries. IEEE Transactions on Infor- mation Theory, 56(6):2980–2998, 2010. [KMP10] Ioannis Koutis, Gary L Miller, and Richard Peng. Approaching opti- mality for solving sdd linear systems. In Proceedings of the 2010 IEEE 51st Annual Symposium on Foundations of Computer Science, pages 235–244. IEEE Computer Society, 2010. [KOSZ13] Jonathan A. Kelner, Lorenzo Orecchia, Aaron Sidford, and Zeyuan Allen Zhu. A simple, combinatorial algorithm for solving sdd systems in nearly-linear time. In Proceedings of the Forty-fifth Annual ACM Symposium on Theory of Computing, STOC ’13, 2013. [KPHF12] U Kang, Evangelos Papalexakis, Abhay Harpale, and Christos Falout- sos. Gigatensor: scalingtensoranalysisupby100times-algorithmsand discoveries. In Proceedings of the 18th ACM SIGKDD international conference on Knowledge discovery and data mining, pages 316–324. ACM, 2012. [KRS14] Been Kim, Cynthia Rudin, and Julie Shah. The bayesian case model: a generative approach for case-based reasoning and prototype classifi- cation. In Proceedings of the 27th International Conference on Neural Information Processing Systems, pages 1952–1960. MIT Press, 2014. [KT12] Franz Király and Ryota Tomioka. A combinatorial algebraic approach for the identifiability of low-rank matrix completion. In Proceedings of the 29th International Coference on International Conference on Machine Learning, pages 755–762, 2012. [KT13] Franz Kiraly and Louis Theran. Error-minimizing estimates and uni- versal entry-wise error bounds for low-rank matrix completion. In Advances in Neural Information Processing Systems, pages 2364–2372, 2013. 138 [KTT15] Franz J Király, Louis Theran, and Ryota Tomioka. The algebraic combinatorial approach for low-rank matrix completion. Journal of Machine Learning Research, 16:1391–1436, 2015. [LMP13] Mu Li, Gary L Miller, and Richard Peng. Iterative row sampling. In Foundations of Computer Science (FOCS), 2013 IEEE 54th Annual Symposium on, pages 127–136. IEEE, 2013. [LW12] Po-Ling Loh and Martin J Wainwright. Structure estimation for dis- crete graphical models: Generalized covariance matrices and their inverses. InAdvances in Neural Information Processing Systems, pages 2087–2095, 2012. [Mah11] Michael W Mahoney. Randomized algorithms for matrices and data. Foundations and Trends R in Machine Learning, 2011. [MCG + 11] Aditya Krishna Menon, Krishna-Prasad Chitrapura, Sachin Garg, Deepak Agarwal, and Nagaraj Kota. Response prediction using collab- orative filtering with hierarchies and side-information. In Proceedings of the 17th ACM SIGKDD international conference on Knowledge dis- covery and data mining, pages 141–149. ACM, 2011. [MJD09] Raghu Meka, Prateek Jain, and Inderjit S Dhillon. Matrix completion frompower-lawdistributedsamples. InAdvances in neural information processing systems, pages 1258–1266, 2009. [ML13] Julian McAuley and Jure Leskovec. Hidden factors and hidden topics: understanding rating dimensions with review text. In Proceedings of the 7th ACM conference on Recommender systems, pages 165–172. ACM, 2013. [MM13] Xiangrui Meng and Michael W Mahoney. Low-distortion subspace embeddings in input-sparsity time and applications to robust linear regression. In Proceedings of the forty-fifth annual ACM symposium on Theory of computing, pages 91–100. ACM, 2013. [MMD08] Michael W Mahoney, Mauro Maggioni, and Petros Drineas. Tensor- cur decompositions for tensor-based data. SIAM Journal on Matrix Analysis and Applications, 30(3):957–987, 2008. [MP13] Gary L Miller and Richard Peng. Approximate maximum flow on separable undirected graphs. In Proceedings of the twenty-fourth annual ACM-SIAM symposium on Discrete algorithms, pages 1151– 1170. SIAM, 2013. 139 [NDT15] Nam H Nguyen, Petros Drineas, and Trac D Tran. Tensor sparsifica- tion via a bound on the spectral norm of random tensors. Information and Inference: A Journal of the IMA, 4(3):195–229, 2015. [NN13] Jelani Nelson and Huy L Nguyên. Osnap: Faster numerical linear algebra algorithms via sparser subspace embeddings. In Foundations of Computer Science (FOCS), 2013 IEEE 54th Annual Symposium on, pages 117–126. IEEE, 2013. [NPOV15] Alexander Novikov, Dmitry Podoprikhin, Anton Osokin, and Dmitry Vetrov. Tensorizing neural networks. In Proceedings of the 28th Inter- national Conference on Neural Information Processing Systems, pages 442–450. MIT Press, 2015. [NSW14] DeannaNeedell, NathanSrebro, andRachelWard. Stochasticgradient descent, weighted sampling, and the randomized kaczmarz algorithm. In Proceedings of the 27th International Conference on Neural Infor- mation Processing Systems, pages 1017–1025. MIT Press, 2014. [NW93] Hiroshi Nagamochi and Toshimasa Watanabe. Computing k-edge- connected components of a multigraph. IEICE TRANSACTIONS on Fundamentals of Electronics, Communications and Computer Sci- ences, 1993. [PABN16] Daniel L Pimentel-Alarcón, Nigel Boston, and Robert D Nowak. A characterizationofdeterministicsamplingpatternsforlow-rankmatrix completion. IEEE Journal of Selected Topics in Signal Processing, 10(4):623–636, 2016. [PAW10] Ian Porteous, Arthur Asuncion, and Max Welling. Bayesian matrix factorization with side information and dirichlet process mixtures. In Twenty-Fourth AAAI Conference on Artificial Intelligence, 2010. [Pen13] Richard Peng. Algorithm Design Using Spectral Graph Theory. PhD thesis, Carnegie Mellon University, Pittsburgh, August 2013. CMU CS Tech Report CMU-CS-13-121. [PFS12] Evangelos E Papalexakis, Christos Faloutsos, and Nicholas D Sidiropoulos. Parcube: Sparse parallelizable tensor decompositions. In Machine Learning and Knowledge Discovery in Databases. Springer, 2012. [PP13] Ninh Pham and Rasmus Pagh. Fast and scalable polynomial kernels via explicit feature maps. In Proceedings of the 19th ACM SIGKDD 140 international conference on Knowledge discovery and data mining, pages 239–247. ACM, 2013. [PS14] Richard Peng and Daniel A. Spielman. An efficient parallel solver for sddlinearsystems. InProceedings of the 46th Annual ACM Symposium on Theory of Computing, STOC ’14, pages 333–342, New York, NY, USA, 2014. ACM. [PTC13] Anh-Huy Phan, Petr Tichavsk` y, and Andrzej Cichocki. Fast alter- nating ls algorithms for high order candecomp/parafac tensor factor- izations. IEEE Transactions on Signal Processing, 61(19):4834–4846, 2013. [RCT15] Natali Ruchansky, Mark Crovella, and Evimaria Terzi. Matrix com- pletion with queries. In Proceedings of the 21th ACM SIGKDD Inter- national Conference on Knowledge Discovery and Data Mining, pages 1025–1034. ACM, 2015. [SC10] Amit Singer and Mihai Cucuringu. Uniqueness of low-rank matrix completion by rigidity theory. SIAM Journal on Matrix Analysis and Applications, 31(4):1621–1641, 2010. [SF73] Gilbert Strang and George J Fix. An analysis of the finite element method, volume 212. Prentice-hall Englewood Cliffs, NJ, 1973. [SHB + 16] Heli Sun, Jianbin Huang, Yang Bai, Zhongmeng Zhao, Xiaolin Jia, Fang He, and Yang Li. Efficientk-edge connected component detection through an early merging and splitting strategy. Knowledge-Based Systems, 111:63–72, 2016. [SPL + 09] Jimeng Sun, Spiros Papadimitriou, Ching-Yung Lin, Nan Cao, Shixia Liu, andWeihongQian. Multivis: Content-based social network explo- ration through multi-way visual analysis. In Proceedings of the 2009 SIAM International Conference on Data Mining, pages 1064–1075. SIAM, 2009. [SPS13] Dougal J Sutherland, Barnabás Póczos, and Jeff Schneider. Active learning and search on low-rank matrices. In Proceedings of the 19th ACM SIGKDD international conference on Knowledge discovery and data mining, pages 212–220. ACM, 2013. [SRSK15] Shaden Smith, Niranjay Ravindran, Nicholas D Sidiropoulos, and George Karypis. Splatt: Efficient and parallel sparse tensor-matrix multiplication. In Parallel and Distributed Processing Symposium (IPDPS), 2015 IEEE International, pages 61–70. IEEE, 2015. 141 [SS11] Daniel A. Spielman and Nikil Srivastava. Graph sparsification by effec- tive resistances. SIAM J. Comput., 40(6):1913–1926, 2011. [ST04] Daniel A. Spielman and Shang-Hua Teng. Nearly-linear time algo- rithms for graph partitioning, graph sparsification, and solving linear systems. In Proceedings of the Thirty-sixth Annual ACM Symposium on Theory of Computing, STOC ’04, pages 81–90, 2004. [ST11] Daniel A Spielman and Shang-Hua Teng. Spectral sparsification of graphs. SIAM Journal on Computing, 40(4):981–1025, 2011. [ST14] Daniel A Spielman and Shang-Hua Teng. Nearly linear time algo- rithms for preconditioning and solving symmetric, diagonally domi- nant linear systems. SIAM Journal on Matrix Analysis and Applica- tions, 35(3):835–885, 2014. [SV09] Thomas Strohmer and Roman Vershynin. A randomized kaczmarz algorithm with exponential convergence. Journal of Fourier Analysis and Applications, 15(2):262–278, 2009. [SW97] MechthildStoerandFrankWagner. Asimplemin-cutalgorithm. Jour- nal of the ACM (JACM), 44(4):585–591, 1997. [TBG + 13] Charalampos Tsourakakis, Francesco Bonchi, Aristides Gionis, Francesco Gullo, and Maria Tsiarli. Denser than the densest subgraph: extracting optimal quasi-cliques with quality guarantees. In Proceed- ings of the 19th ACM SIGKDD international conference on Knowledge discovery and data mining, pages 104–112. ACM, 2013. [Tsi09] Yung H Tsin. Yet another optimal algorithm for 3-edge-connectivity. Journal of Discrete Algorithms, 7(1):130–146, 2009. [Tso10] Charalampos E Tsourakakis. Mach: Fast randomized tensor decom- positions. In Proceedings of the 2010 SIAM International Conference on Data Mining, pages 689–700. SIAM, 2010. [Tso15] Charalampos Tsourakakis. The k-clique densest subgraph problem. In Proceedings of the 24th international conference on world wide web, pages 1122–1132. International World Wide Web Conferences Steering Committee, 2015. [VHF16] Matti Vihola, Jouni Helske, and Jordan Franks. Importance sampling typecorrectionofmarkovchainmontecarloandexactapproximations. arXiv preprint arXiv:1609.02541, 2016. 142 [VT02a] M Alex O Vasilescu and Demetri Terzopoulos. Multilinear analysis of image ensembles: Tensorfaces. In European Conference on Computer Vision, pages 447–460. Springer, 2002. [VT02b] M Alex O Vasilescu and Demetri Terzopoulos. Multilinear image anal- ysis for facial recognition. In Pattern Recognition, 2002. Proceedings. 16th International Conference on, volume 2, pages 511–514. IEEE, 2002. [VV98] VladimirNaumovichVapnikandVlamimirVapnik. Statistical learning theory, volume 1. Wiley New York, 1998. [Woo14] David P. Woodruff. Sketching as a tool for numerical linear algebra. Foundations and TrendsÂő in Theoretical Computer Science, 2014. [WTSA15] Yining Wang, Hsiao-Yu Tung, Alex Smola, and Anima Anandkumar. Fast and guaranteed tensor decomposition via sketching. In Proceed- ings of the 28th International Conference on Neural Information Pro- cessing Systems, pages 991–999. MIT Press, 2015. [WYZ12] Zaiwen Wen, Wotao Yin, and Yin Zhang. Solving a low-rank factor- ization model for matrix completion by a nonlinear successive over- relaxation algorithm. Mathematical Programming Computation, pages 1–29, 2012. [XJZ13] Miao Xu, Rong Jin, and Zhi-Hua Zhou. Speedup matrix completion withsideinformation: Applicationtomulti-labellearning. InAdvances in Neural Information Processing Systems, pages 2301–2309, 2013. [YCL15] Rose Yu, Dehua Cheng, and Yan Liu. Accelerated online low rank ten- sor learning for multivariate spatiotemporal streams. In International Conference on Machine Learning, pages 238–247, 2015. [YCRM16] JiyanYang, Yin-LamChow, ChristopherRé, andMichaelWMahoney. WeightedsgdforâĎŞpregressionwithrandomizedpreconditioning. In Proceedings of the Twenty-Seventh Annual ACM-SIAM Symposium on Discrete Algorithms, pages 558–569. Society for Industrial and Applied Mathematics, 2016. [ZLY + 12] Rui Zhou, Chengfei Liu, Jeffrey Xu Yu, Weifa Liang, Baichen Chen, and Jianxin Li. Finding maximal k-edge-connected subgraphs from a large graph. In Proceedings of the 15th International Conference on Extending Database Technology, pages 480–491. ACM, 2012. 143 [ZYZY15] Xiaowei Zhou, Can Yang, Hongyu Zhao, and Weichuan Yu. Low-rank modeling and its applications in image analysis. ACM Computing Surveys (CSUR), 47(2):36, 2015. [ZZ15] Peilin Zhao and Tong Zhang. Stochastic optimization with importance sampling for regularized loss minimization. In Proceedings of the 32nd International Conference on Machine Learning (ICML-15), pages 1–9, 2015. 144
Abstract (if available)
Abstract
This is the era of big data, where both challenges and opportunities lie ahead for the machine learning research. The data are created nowadays at an unprecedented pace with a significant cost in collecting, storing, and computing with the current scale of data. As the computational power that we possess gradually plateaus, it is an ever-increasing challenge to fully utilize the wealth of big data, where better data reduction techniques and scalable algorithms are the keys to a solution. We observe that to answer a certain query, the data are not equally important: it is possible to complete the task at hand with fewer but relevant data without compromising accuracy. Based on the models and the query, we provide efficient access to the numerical scores of the data points that represent their relevance to the current task. It enables us to wisely devote the computation resources to the important data, which improves the scalability and the reliability. ❧ We present our work with three applications. We developed efficient numerical algorithms for tensor CP decomposition and random-walk matrix-polynomial sparsification, where we provide an efficient access to the statistical leverage scores for a faster randomized numerical routine. We obverse that the model structures, such as the low-rank models, can be leveraged to provide an efficient approximation to data relevance. We also present our matrix completability analysis framework based on both data and the model, which are both structured. And we show that the underlying completability pattern can be utilized to achieve a more reliable estimation of missing entries.
Linked assets
University of Southern California Dissertations and Theses
Conceptually similar
PDF
Learning controllable data generation for scalable model training
PDF
Scalable machine learning algorithms for item recommendation
PDF
Scalable multivariate time series analysis
PDF
Tensor learning for large-scale spatiotemporal analysis
PDF
Robust causal inference with machine learning on observational data
PDF
Computational aspects of optimal information revelation
PDF
Theoretical foundations for dealing with data scarcity and distributed computing in modern machine learning
PDF
Optimizing task assignment for collaborative computing over heterogeneous network devices
PDF
Interpretable machine learning models via feature interaction discovery
PDF
QoS-aware algorithm design for distributed systems
PDF
Algorithm and system co-optimization of graph and machine learning systems
PDF
Scalable exact inference in probabilistic graphical models on multi-core platforms
PDF
Disentangling the network: understanding the interplay of topology and dynamics in network analysis
PDF
Scheduling and resource allocation with incomplete information in wireless networks
PDF
Simulation and machine learning at exascale
PDF
Machine learning in interacting multi-agent systems
PDF
Novel algorithms for large scale supervised and one class learning
PDF
Advanced machine learning techniques for video, social and biomedical data analytics
PDF
Learning distributed representations from network data and human navigation
PDF
Generating gestures from speech for virtual humans using machine learning approaches
Asset Metadata
Creator
Cheng, Dehua
(author)
Core Title
Improving machine learning algorithms via efficient data relevance discovery
School
Viterbi School of Engineering
Degree
Doctor of Philosophy
Degree Program
Computer Science
Publication Date
01/18/2018
Defense Date
11/14/2017
Publisher
University of Southern California
(original),
University of Southern California. Libraries
(digital)
Tag
graphical model,importance sampling,low-rank model,machine learning,matrix completion,Monte Carlo,OAI-PMH Harvest,scalable algorithm,tensor analysis
Language
English
Contributor
Electronically uploaded by the author
(provenance)
Advisor
Liu, Yan (
committee chair
), Lv, Jinchi (
committee member
), Teng, Shang-Hua (
committee member
)
Creator Email
chengdehua7197@gmail.com,dehuache@usc.edu
Permanent Link (DOI)
https://doi.org/10.25549/usctheses-c40-463734
Unique identifier
UC11266354
Identifier
etd-ChengDehua-5956.pdf (filename),usctheses-c40-463734 (legacy record id)
Legacy Identifier
etd-ChengDehua-5956.pdf
Dmrecord
463734
Document Type
Dissertation
Rights
Cheng, Dehua
Type
texts
Source
University of Southern California
(contributing entity),
University of Southern California Dissertations and Theses
(collection)
Access Conditions
The author retains rights to his/her dissertation, thesis or other graduate work according to U.S. copyright law. Electronic access is being provided by the USC Libraries in agreement with the a...
Repository Name
University of Southern California Digital Library
Repository Location
USC Digital Library, University of Southern California, University Park Campus MC 2810, 3434 South Grand Avenue, 2nd Floor, Los Angeles, California 90089-2810, USA
Tags
graphical model
importance sampling
low-rank model
machine learning
matrix completion
scalable algorithm
tensor analysis