Close
About
FAQ
Home
Collections
Login
USC Login
Register
0
Selected
Invert selection
Deselect all
Deselect all
Click here to refresh results
Click here to refresh results
USC
/
Digital Library
/
University of Southern California Dissertations and Theses
/
Tensor learning for large-scale spatiotemporal analysis
(USC Thesis Other)
Tensor learning for large-scale spatiotemporal analysis
PDF
Download
Share
Open document
Flip pages
Contact Us
Contact Us
Copy asset link
Request this asset
Transcript (if available)
Content
Tensor Learning for Large-Scale Spatiotemporal Analysis by Qi (Rose) Yu A Dissertation Presented to the FACULTY OF THE GRADUATE SCHOOL UNIVERSITY OF SOUTHERN CALIFORNIA In Partial Fulllment of the Requirements for the Degree DOCTOR OF PHILOSOPHY (Computer Science) August 2017 Copyright 2017 Qi (Rose) Yu Dedication I dedicate my dissertation work to my mother. She is by far the strongest woman I have ever known. Her support, encouragement, and constant love have sustained me throughout my life. To my mother ii Acknowledgements First and foremost, I wish to thank my committee members, Prof. Yan Liu, Prof. Cyrus Shahabi, and Prof. Mahdi Soltanolkotabi who were more than generous with their ex- pertise and precious time. A special thank to Prof. Yan Liu, my committee chair, for her exquisite attention to detail and her demand for excellence. She has been a great mentor, a close friend, and my role model. Thank you to Prof. Cyrus Shahabi, my committee co-chair for advising me since the third year and being an inspiration for me in many ways. I would like to acknowledge many of my fellow lab mates and collaborators: Taha Bahadori, Dehua Cheng, Xinran He, Dingxiong Deng, Yaguang Li and Guangyu Li for helping conduct my research and helpful discussions. I still remember the day when we were brainstorming research ideas and sketching random matrices on the white board, Taha shouted out the word: tensor! And that was the beginning of this thesis work. I am also very grateful to Prof. Christopher R e of Stanford University for hosting me in the summer and having me working with his Ph.D. students Paroma Varma and Chris De Sa. I learned a lot outside my area and have made many friends in Stanford. My sincere thanks also go to Dr. Andrew Gelfand (now at Engineers Gates), Dr. Suju iii Rajan (now at Criteo) in Yahoo! And Dr. Hongfei Li in IBM for mentoring me in their team as an intern, and exposing me to exciting research opportunities in the industry. I thank Prof. Gaurav Sukhatme and Prof. Aram Galstyan in USC ISI for constructive feedback during my thesis proposal. I will forever be thankful to Prof. Gaurav Sukhatme for oering mock interviews to help my academic job search. I want to thank our depart- ment advisor Lizsl De Leon for all the administrative help. I also thank the wonderful sta in the computer science department, including Michael Archuleta and Daisy Tang, for always being so friendly and helpful. I am very grateful for the USC Annenberg graduate fellowship program and IARPA for funding my graduate studies. I thank the USC graduate school, USC WiSE program for funding my travel to conferences and present my research. I was lucky to be a selected participant in Heidelberg Laureate Forum and MIT Rising Stars in EECS. I had unforgettable experiences and enormous personal growth in these programs. Finally, I am especially grateful for all the love from my family and friends. For my family in China, their support has been unconditional all these years; they have given up many things for me to be at USC; they have cherished with me every great moment and supported me whenever I needed it. For my uncle Xiaoping Wu, my two cousins Nelson Wu and Xun Wu, who have never left my side in the United States. For my best friends Ying Yu and Xinyue Li, with their encouragement throughout the process. I give special thanks to Stephan Zheng, for always being there for me and his faithful support during the nal stages of my Ph.D. iv Table of Contents Dedication ii Acknowledgements iii List Of Tables viii List Of Figures ix Abstract xi Chapter 1: Introduction 1 1.1 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1.2 Thesis Statement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 1.3 Contributions of the Research . . . . . . . . . . . . . . . . . . . . . . . . . 5 Chapter 2: Related Work 8 2.1 Tensor Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8 2.2 Spatiotemporal Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10 2.3 Notations and Preliminary . . . . . . . . . . . . . . . . . . . . . . . . . . . 14 Chapter 3: Spatiotemporal Analysis as Tensor Learning 16 3.1 Practical Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16 3.1.1 Cokriging . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16 3.1.2 Forecasting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17 3.1.3 Multimodel Ensemble . . . . . . . . . . . . . . . . . . . . . . . . . 17 3.2 Tensor Formulations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18 3.2.1 Cokriging Formulation . . . . . . . . . . . . . . . . . . . . . . . . . 19 3.2.2 Forecasting Formulation . . . . . . . . . . . . . . . . . . . . . . . . 20 3.2.3 Multi-model Ensemble Formulation . . . . . . . . . . . . . . . . . . 21 3.2.4 Unied Formulation . . . . . . . . . . . . . . . . . . . . . . . . . . 22 3.2.5 Temporal Regularization . . . . . . . . . . . . . . . . . . . . . . . . 25 3.3 Spatiotemporal Dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25 v Chapter 4: Fast Batch Learning 28 4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28 4.2 Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30 4.3 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34 4.3.1 Synthetic . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34 4.3.2 Cokriging . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37 4.3.3 Forecasting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38 4.4 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39 Chapter 5: Accelerated Online Learning 41 5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41 5.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43 5.3 Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44 5.3.1 Tensor Stream in Online Setting . . . . . . . . . . . . . . . . . . . 45 5.3.2 Online Low-Rank Tensor Approximation . . . . . . . . . . . . . . . 46 5.3.3 Theoretical Analysis of ALTO . . . . . . . . . . . . . . . . . . . . 49 5.4 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53 5.4.1 Synthetic . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53 5.4.2 Online Forecasting . . . . . . . . . . . . . . . . . . . . . . . . . . . 54 5.4.3 Online Multi-model Ensemble . . . . . . . . . . . . . . . . . . . . . 55 5.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58 Chapter 6: Memory Ecient Learning 59 6.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59 6.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61 6.3 Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62 6.3.1 Tensor Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . 63 6.3.2 Tensor Projected Gradient . . . . . . . . . . . . . . . . . . . . . . 64 6.4 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74 6.4.1 Synthetic Data set . . . . . . . . . . . . . . . . . . . . . . . . . . . 75 6.4.2 Real Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75 6.4.2.1 Multi-linear Multi-task Learning . . . . . . . . . . . . . . 76 6.4.2.2 Spatiotemporal Forecasting . . . . . . . . . . . . . . . . . 77 6.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80 Chapter 7: Probabilistic Interpretation 81 7.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81 7.2 Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83 7.2.1 Low-Rank Tensor Regression . . . . . . . . . . . . . . . . . . . . . 83 7.2.2 Multi-linear Gaussian Process . . . . . . . . . . . . . . . . . . . . . 86 7.3 Theory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88 7.3.1 Multi-Level Task Relatedness . . . . . . . . . . . . . . . . . . . . . 89 7.3.2 Multi-Linear Low-Rank Constraint . . . . . . . . . . . . . . . . . . 90 7.3.3 Theoretical Properties of Multi-linear GP . . . . . . . . . . . . . . 92 7.3.4 Relation to Other Methods . . . . . . . . . . . . . . . . . . . . . . 94 7.4 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95 vi 7.4.1 Multi-linear Multi-task learning . . . . . . . . . . . . . . . . . . . . 95 7.4.2 Spatiotemporal Forecasting . . . . . . . . . . . . . . . . . . . . . . 97 7.4.3 Multi-output Regression . . . . . . . . . . . . . . . . . . . . . . . . 98 7.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99 Chapter 8: Summary, Discussion and Future Work 101 Appendix A Appendix for Chapter 4 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103 A.0.1 Example of Tensor Unfolding . . . . . . . . . . . . . . . . . . . . . 103 A.0.2 Proof of Lemma 4.2.1 . . . . . . . . . . . . . . . . . . . . . . . . . 104 A.0.3 Proof of Theorem 4.2.2 . . . . . . . . . . . . . . . . . . . . . . . . 105 A.0.4 Convex Relaxation with ADMM . . . . . . . . . . . . . . . . . . . 107 A.0.5 Derivation of the Unied Formulation . . . . . . . . . . . . . . . . 108 Appendix B Appendix for Chapter 5 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109 B.0.1 Matrix Update . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109 B.0.2 Reformulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109 B.0.3 Multi-model Data . . . . . . . . . . . . . . . . . . . . . . . . . . . 110 Appendix C Appendix for Chapter 7 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111 C.0.1 Eigenvalue Problem . . . . . . . . . . . . . . . . . . . . . . . . . . 111 C.0.2 Derivatives for the Optimization . . . . . . . . . . . . . . . . . . . 111 C.0.3 Proof for Proposition 7.3.1 . . . . . . . . . . . . . . . . . . . . . . 112 C.0.4 Proof of Theorem 7.3.2 . . . . . . . . . . . . . . . . . . . . . . . . 113 Reference List 116 vii List Of Tables 4.1 Cokriging RMSE of 6 methods averaged over 10 runs. In each run, 10% of the locations are assumed missing. . . . . . . . . . . . . . . . . . . . . . . 36 4.2 Forecasting RMSE for VAR process with 3 lags, trained with 90% of the time series. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38 4.3 Running time (in seconds) for cokriging and forecasting. . . . . . . . . . . 38 5.1 Forecasting RMSE comparison for ALTO and baselines on Foursquare-S and AWS datasets with 90 % training data with respect to dierent lags of the VAR model. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55 5.2 Forecasting overall run time comparison for ALTO and baselines on Foursquare- S and AWS datasets with 90 % training data with respect to dierent lags of the VAR model. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55 6.1 Summarization of contemporary tensor regression models, algorithms, and applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61 6.2 Forecasting RMSE and run time on Foursquare check-in data and USHCN daily measurement for VAR process with 3 lags, trained with 80% of the time series. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78 7.1 spatiotemporal forecasting mean square error comparison of MLGP and baselines. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98 viii List Of Figures 1.1 An example of a three-mode tensorX2R IJK . . . . . . . . . . . . . . 5 2.1 Timeline of spatial statistics and tensor analysis . . . . . . . . . . . . . . 13 2.2 Tucker decomposition and Tucker rank . . . . . . . . . . . . . . . . . . . . 15 2.3 CP decomposition and CP rank . . . . . . . . . . . . . . . . . . . . . . . . 15 4.1 Tensor estimation performance comparison on the synthetic dataset over 10 random runs. (a) Parameter estimation RMSE with training time series length, (b) Mixture Rank Complexity with training time series length, (c) running time for one single round with respect to the number of variables. 35 4.2 Map of most predictive regions analyzed by the greedy algorithm using 17 variables of the CCDS dataset. Red color means high predictiveness whereas blue denotes low predictiveness. . . . . . . . . . . . . . . . . . . . 39 5.1 (a) Average parameter estimation RMSE (b) Overall run time comparison over 10 random runs for ALTO and baselines on the synthetic dataset. . . 53 5.2 (a) Forecasting RMSE with respect to iteration number (b) Per iteration run time comparison for ALTO and baselines on Foursquare-S dataset. . . 54 5.3 Climate models and their in uential areas. Dierent color denotes dierent models. The in uence is computed by aggregating the model parameters. 56 5.4 Per variable forecasting RMSE for 18 variables (a) and overall run time (b) comparison of multi-model ensemble for ALTO and baselines using 90 % training data, with 7 dierent models over 20 years. . . . . . . . . . . . 56 5.5 Forecasting RMSE using 90 % data with the rank value for (a) Foursquare- S forecasting and (b) multi-model ensemble. . . . . . . . . . . . . . . . . . 57 ix 6.1 Performance comparison on the synthetic data set over 10 random runs. (a) model parameter estimation RMSE for dierent algorithms, (b) running time with respect to sketching size for dierent algorithms, (c) RMSE for dierent sketching method, (d) running time for dierent sketching method. 76 6.2 Multi-linear multi-task learning performance comparison on the synthetic dataset over 10 random runs. (a) average forecasting MSE and (b) running time w.r.t. training size. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77 6.3 Velocity vectors plot of spatial-temporal dependency graph obtained via TPG. Results are averaged across all ve dierent climate variables. . . . 79 7.1 (a) Theoretical and numerically simulated learning curve for task correla- tion = 0:25; 0:25; 0:75. (b) Learning curve for 2-mode MLGP with low- rank approximation r = 9; 4; 1. 7.1(c) Learning curve for 3-mode MLGP with low-rank approximation r = 9; 4; 1. . . . . . . . . . . . . . . . . . . . 88 7.2 Graphical model for (a) tensor regression, (b) Gaussian process and (c) MLGP. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91 7.3 Multi-linear multi-task learning performance with sample size for MLGP and baselines on bench mark datasets with standard metrics. (a) Mean square error on restaurant dataset. (b) Expected variance on school dataset. 96 7.4 Contour plots for predictive variance of MLGP with respect to 5 climate variables in USHCN-CA data. Color-coded with yellow for high value and blue for low variance. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97 7.5 Predictive mean (solid line) and variance (shaded area) from MLGP for the foreign exchange rate of XAU, EUR and JPY. Magenta points are observations. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99 x Abstract Spatiotemporal data is ubiquitous in our daily life ranging from climate science, via transportation, to social media. Today, data is being collected at an unprecedented scale. Yesterdays concepts and tools are insucient to serve tomorrows data-driven decision makers. A key challenge is that spatiotemporal data often demonstrate complex depen- dence structures, come at multiple resolutions, and are of high dimensionality. However, current machine learning methods are not equipped to handle this. Hence, we require new methods that can handle non-i.i.d data, reason at dierent granularity, and generate structured predictions. This thesis proposes a general framework of low-rank tensor learning for large-scale spatiotemporal analysis. It unies a number of spatiotemporal analysis problems as non-convex tensor optimization tasks. It automatically learns the high-order correlation structure, such as spatial clustering, temporal periodicity and shared structure among variables. Within the framework, we develop ecient algorithms for oine, online, and memory-ecient learning with theoretical guarantees. We also discover the connections between our framework and Gaussian processes. Finally, we conduct experiments on climate and social network applications and demonstrate signicant speedup without comparable prediction accuracy. xi Chapter 1 Introduction Time and space are modes by which we think and not conditions in which we live. Albert Einstein 1.1 Background These words, by Albert Einstein, one of the most recognized scientists of the century, reveals the great wonder of our scientic inquiry to understand the nature of space and time. Here, in the physical world, there are intriguing questions to resolve and predictions to make, and understanding the complex dependencies across time and space is a crucial part. Nowadays, with access to huge amount of spatiotemporal data, as well as the rapid development of computational resources, we are entering a new era of data-driven approaches to rethink time and space. The core of this thesis is to guide the way towards a new set of computationally ecient tools for learning from large-scale spatiotemporal data. 1 Large-scale spatiotemporal data are ubiquitous. For example, in climate science, measurements such as temperature and precipitations are collected at dierent stations and time stamps. In social media, users' check-ins or reviews are often associated with spatiotemporal attributes. In transportation, road sensors constantly record the trac information at various locations. Data with rich spatial and temporal information are everywhere, from classical subjects such as weather forecasting, epidemic prevention, and trac prediction, to emerging elds of sustainability, Internet of Things (IoT) and health-care, there is almost no single domain where spatiotemporal data do not exist. With access to such massive data sources, how can we make use of them and turn big data into valuable knowledge? Spatiotemporal analysis is the procedure of design and application of statistical models to capture the spatiotemporal correlations in the data. Ecient and eective spatiotem- poral analysis would enable us to perform more accurate forecasts and interpolations, to discover novel spatiotemporal patterns, and to make better real-time decisions. His- torically, spatiotemporal analysis has been intensively studied in time series analysis and spatial statistics community, as well as many scientic disciplines such as geology, ecology, and climatology. And Gaussian processes are among the most popular set of tools for spatiotemporal analysis [Gelfand and Schliep, 2016]. However, Gaussian processes often suer from computational bottlenecks, which pose signicant challenges to learning from large volumes of spatiotemporal data using statistical models. In machine learning, modeling spatiotemporal data is often regarded as a statistically dicult task due to three major reasons. (1) non-Euclidean space: geographically dis- tributed data, though measured in the Euclidean space, usually demonstrate properties 2 that violate the geometry or topology in the Euclidean space, such as the travel time in urban area disobey the triangular inequality. (2) spatiotemporal dependencies: temporal dependency usually implies ordering in the sequence. At the same time, spatial depen- dency requires identifying the correct geographical structures in the data. (3) spatiotem- poral heterogeneity: heterogeneity refers to the non-stationarity of most spatiotemporal processes. Time series oftentimes have trend, seasonality and local uctuations. These temporal statistics can also change with locations. In this thesis, we aim to address these challenges by rst casting spatiotemporal data as tensors, with dimensions corresponding to space, time and variables. Tensors, as the generalization to vectors and matrices, provide a convenient way to encode multi- directional dependencies. Using the language of tensor algebra, we can then reformulate many classical multivariate spatiotemporal analysis problems as tensor learning tasks. In particular, we propose to use tensor low-rank constraint to address the high-dimensional issue of modeling multivariate spatiotemporal data. The low-rank constraint captures the shared structure in spatiotemporal data such as spatial clustering, temporal periodicity, and correlations among variables. We also incorporate a graph Laplacian regularizer to model the spatial proximity. We develop a series of algorithms to solve the tensor learning optimization problem in a scalable fashion. Furthermore, we showcase the practicability of the methods on real-world climate and social networks applications. Our work built upon modern techniques in high dimensional variable selection (e.g. [Wainwright, 2009]) and low-rank matrix recovery (e.g. [Cand es and Recht, 2009]). It captivates the fact that we can exploit the inherent sparsity structure in the data to 3 achieve computational eciency. Similar observations have also been reported in hyper- spectral imaging recovery [Gandy et al., 2011], neuro-imaging data analysis [Zhou et al., 2013], and multi-linear multi-task learning (e.g. [Romera-Paredes et al., 2013]). Secondly, by studying tensor models in the supervised learning, we demonstrate the possibility to gen- eralize many theoretical and algorithmic results from matrices to tensors. Many tensor problems are computationally hard [Hillar and Lim, 2013], but it does not mean they are statistically intractable. Moreover, for many practical problems, though non-convex by nature, local solutions suce. And nding a good local optimum gives us much com- putational gain as opposed to convexied global optimum. Finally, we develop ecient dimensionality reduction techniques for high-order models. These allow us to perform fast tensor learning in oine, online and memory constrained setting. We want to emphasize the broader implication of this thesis work. Even though we use spatiotemporal analysis as an example, the application domain of tensor learning goes far beyond. Advancing from matrix-based computational thinking in the 60s to the block-matrix based computation in the 80s, we are entering the new Tensor Era 1 . It is a generic computational language. It is a new way of handling complex structures. It promotes unprecedented opportunities in statistics, numerical analysis, signal processing, and machine learning, where our desire to model high-dimensional, non-linear, high- order systems is confronted by the increasing computational capacity. The proposed tensor learning framework showcases how we can express machine learning problems with tensor language. One can think of applying similar strategies to other elds, developing tensor-level numerical analysis or tensor-based signal processing. 1 http://www.cs.cornell.edu/cv/OtherPdf/Mat2Ten.pdf 4 ! = 1,…,& ' =1,…,( ) =ℝ +×-×. Space Time Variables Figure 1.1: An example of a three-mode tensorX2R IJK 1.2 Thesis Statement Tensor methods provide a systematic way of analyzing large-scale multivariate spatiotem- poral data: the low-rank tensor learning framework can solve many spatiotemporal anal- ysis tasks accurately and eciently. 1.3 Contributions of the Research The thesis in this paper aims to bridge the gap between tensor methods and spatiotem- poral analysis. On one hand, it broadens tensor research from unsupervised learning with tensor decomposition to supervised learning with tensor regression. It poses new chal- lenges and shed novel insights in the optimal trade-o between statistical and computa- tional eciency of tensor algorithms. On the other hand, it oers a powerful optimization tool to address large-scale spatiotemporal problems from spatial statistics and time series analysis community. We elaborate the two-folds contributions as follows: 5 Tensor Learning We propose a general low-rank tensor regression framework to unify many classic spatiotemporal analysis tasks. Our framework expands tensor methods from un- supervised learning with tensor decomposition to supervised learning setting. It has many other applications in multi-way data analysis, multi-task learning, and multi-output regression. We design ecient algorithms to solve the low-rank tensor regression problem in three dierent scenarios: batch learning [Yu et al., 2014], online learning [Yu et al., 2015] and memory-ecient learning [Yu and Liu, 2016]. We also establish the connections between this framework and its probabilistic counterpart: Gaussian processes. Our algorithms achieve signicant speed with comparable performance on real-world sustainability and social network applications. Tensor regression is a test bed for non-convex optimization. Our research demon- strates that for many practical problems, local solutions suce, but come with superior computational benets. Instead of solving a convex surrogate, our algo- rithms directly tackle the non-convex objective. We provide theoretical analysis for our algorithms that guarantees the performance under feasible assumptions. Spatiotemporal Analysis We identify several critical tasks in spatiotemporal analysis: forecasting, cokriging, and multi-model ensemble. We introduce a powerful tool from tensor methods which can include all these tasks under the same umbrella. 6 Our method provides an alternative solution to address the computational chal- lenges of large-scale spatiotemporal problems. The proposed framework can achieve signicant speed-up with comparable accuracy on real world climate and social net- work applications. We also connect the dots between the proposed tensor learning framework and the widely adopted Gaussian process models. We establish the connections between the two models and share insights into the dependency of our model's performance on data correlations. This thesis work unies the domain of tensor methods and spatiotemporal analysis, bridges the gap between theory and practice, and serves as an important step towards learning from large-scale spatiotemporal data. Note that although our discussions are focused on spatiotemporal applications, we believe the ideas and the algorithms we de- veloped are valuable to a broader audience and should be generally applicable to other domains as well. 7 Chapter 2 Related Work This thesis touches upon two areas: tensor learning and spatiotemporal analysis. In the following sections, we review the related literature in both domains. 2.1 Tensor Learning A tensor is the multidimensional generalization of a vector and a matrix. Tensor was rst introduced in 1900 by Italian mathematician and physics Tullio Levi-Civita, and Ricci-Curbastro published the theory of tensors in M ethodes de calcul di erentiel absolu et leurs applications [Ricci and Levi-Civita, 1900], which Albert Einstein used as a re- source to master the tensor calculus, a critical tool in the development of the theory of general relativity. Tensors have established a critical position in physics and received wild applications from abstract mathematics including dierential geometry, dierential calculus, topology to applied science such as quantum chemistry, electronic structures, for a detailed review of tensor analysis and its applications, see [Kolda and Bader, 2009] for a detailed review and the references therein. 8 In machine learning, tensor methods have received a lot of attention. We want to emphasize that tensor methods are not just matrix methods with additional subscripts. The merit of bringing tensor analysis tools as opposed to widely adopted matrix ap- proaches is to capture the high-order correlation structures in the data, to model the non-linear mapping from input to output, and to signicantly alleviate the \curse of di- mensionality". This is because tensors themselves are high-order, non-linear (multi-linear is non-linear). In addition, there is a rich class of tensor models that allow us to perform ecient dimensionality reduction. Tensor methods provide a useful tool for exploratory analysis. It generalizes dimen- sionality reduction techniques to high dimensional data, leading to high-order Principal Component Analysis (PCA), Principal Cumulant Component Analysis [Morton and Lim, 2009] and Independent Component analysis (ICA) [Vasilescu and Terzopoulos, 2005]. Most re- cently, tensor methods have achieved success in learning latent variable models with theoretical guarantees [Anandkumar et al., 2015]. Tensor methods have also seen ap- plications in supervised machine learning. For example, tensor methods can be used for learning time-varying networks [Ho et al., 2015], analyzing the computational com- plexity of neural networks [Cohen et al., 2015] and compressing deep neural network [Novikov et al., 2015]. Our work is closely related to multi-linear multi-task learning (MLMTL) problem studied in [Romera-Paredes et al., 2013], which rst formulates the problem of utilizing tensor models to describe the multi-directional relatedness of learning tasks. [Tomioka et al., 2013, Wimalawarne et al., 2014] propose convex regularization methods to solve the MLMTL problem eciently. MLMTL is a special case of the tensor regression problem, which 9 aims to learn a parametric model that satises the tensor structural constraint, and is the focus of this thesis. It is also worthwhile to mention that while we are de- veloping this thesis, several new results were published related to tensor regression, including Bayesian tensor regression models [Guhaniyogi et al., 2015], tensor response regression [Rabusseau and Kadri, 2016] and supervised learning with tensor networks [Stoudenmire and Schwab, 2016]. 2.2 Spatiotemporal Analysis Many elds have contributed to the rise of spatiotemporal analysis in its modern form. Statistics has contributed greatly through work in spatial statistics. Dates back to 1950s, Moran's I statistics [Moran, 1950] and Getis-Ord general G statistics [Getis and Ord, 1992] provides accurate measures of spatial autocorrelation. Geographically weighted regres- sion [Brunsdon et al., 1998] accounts for the spatial heterogeneity with a local version of spatial regression that generates parameters disaggregated by the spatial units of anal- ysis. More recently, rank reduced regression model [Cressie and Johannesson, 2008] and hierarchical Gaussian process model [Datta et al., 2016] are introduced to handle large number spatial locations. However, these models only deal with multivariate spatial processes without considering the temporal aspects. Robotics (e.g.[Dahl et al., 2002]), computer vision (e.g.[Le et al., 2011]) and spatiotem- poral data mining (e.g. [Miller and Han, 2009]) have contributed to learning spatiotem- poral correlations in the data. Data mining itself is an application of machine learning and statistics applied to large-scale data. Researchers in data mining communities have 10 looked into the spatiotemporal data and the knowledge discovery in such data. However, the analysis is largely limited to exploratory analysis and heuristics with the lack of the- oretical analysis. Data mining tools are developed for applications where the standard of knowledge is \what works" rather than \what is authoritative" [Roddick and Lees, ]. The question is how to analyze the spatiotemporal data as part of a defensible and replicable scientic process. In machine learning, spatiotemporal analysis has been a long-lasting question. Almost 20 years ago, Thomas Dietterich et al. organized an ICML workshop on Machine Learning for Spatial and Temporal Data 1 , which points out several fundamental challenges of spa- tiotemporal data such as i.i.d assumptions and window-based mapping. In 1994, Santa Fe Institute held the time series prediction competition and documented a wide range of pre- diction techniques [Weigend, 1994]. Over the last decade, many developed general struc- tured prediction tools, including graphical models [Wainwright et al., 2008], maximum margin methods [Tsochantaridis et al., 2005], and deep neural networks [LeCun et al., 2015]. Most recent success shows the promise of combining classical structured prediction algo- rithms with deep learning, (e.g. [Zheng et al., 2015, Krishnan et al., 2015, Wilson et al., ]). However, deep learning models suer from lack of interpretability and are hard to ana- lyze. Up to date, the problem of learning from high-dimensional, continuous space and time, multi-resolution spatiotemporal data has not been fully addressed yet. Imposing sparsity in the model for high dimensional time series received considerable attention from statistics community [Han and Liu, 2013, Basu et al., 2015, Yu et al., 2016]. 1 http://web.engr.oregonstate.edu/ ~ tgd/ml2001-workshop/ 11 One key dierence of dealing with high dimensional time series from traditional high di- mensional statistics is to guarantee the stationary property of the underlying process. For example, [Han and Liu, 2013] proposes a linear programming based algorithm to estimate the transition matrix under Vector Autoregressive (VAR) model. [Basu et al., 2015] stud- ies the stability of stochastic process with respect to thel 1 regularization. [Yu et al., 2016] proposes to use matrix factorization techniques for high-dimensional time series via a novel autoregressive temporal regularizer. But the above-mentioned research focus merely on temporal aspect, without considering the spatial process. This thesis can be seen as a gen- eralized setting of high-dimensional time series prediction across multiple locations. The model we consider is multiple correlated VAR models representing a high-order Gaussian graphical model. The parameters of the model naturally form a high-order tensor, hence tensor regression. Most recently, [Qiu et al., 2016] also considers this general setting. However, the focus of this research has been on obtaining better statistical guarantees. The scalability issue of time series or spatiotemporal models remains open. 12 Table1 Timelineofspatialstatisticsandtensoranalysis Moran-Moran’sIstatistics-globalmeasure ofspatialautocorrelation. 1950 • 1966 • Levin,Tucker -multilineardecompositions forthree-waydata. 1970 • CarolandChang -CANDECOMP;Harshman -PARAFAC-approximatelinearrankmodel forthree-waydata. 1976 • Kruskal -rankanduniqueness-generalizes thefundamentalconceptofrankformultiway data. Getis&Ord -Getis-OrdgeneralGstatistic- globalmeasureofthedegreeofclusteringfor eitherhighvaluesorlowvalues. 1992 • Brunsdonetal -geographicallyweighted regression-kerneldensityestimationfor location-specificregressionparameters. 1997 • 2000 • DeLathauwer,DeMoor,Vandewalle -N-mode SVDandALSdimensionalityreduction algorithmforN-waydata,Coinstheterms: MultilinearSVD,HOSVD. 2001 • Kolda-Eckard-YoungSVDcannotbe extendedtohigherordertensors.Optimal Rank-Rdecompositionforageneralclassof tensorscannotbecomputedbysequentially extractingtherank-1approximationina greedyway. 2002 • Zhang&Golub-theoptimalrank-Rapprox. forothogonallydecomposabletensorsis equivalenttosequentiallycomputingthebest rank-1approximationinagreedyway. 2005 • Vasilescu&Terzopoulos -multilinearICA Cressie -fixedrankkriging-rankreduced regressionforspatio-temporalanalysis; Banerjeeetal-Gaussianpredictiveprocess model 2008 • 2013 • Gandy&Recht -low-nranktensor completion-generalizecompressivesensing fortensorcompletion Dattaetal-hierarchicalnearestneighbor Gaussianprocessmodel 2014 • Romera-aredesetal.-multilinearmultitask learning-Yuetal.low-ranktensorregression 1 Figure 2.1: Timeline of spatial statistics and tensor analysis 13 2.3 Notations and Preliminary Due to our intensive usage of scalars, vectors, matrices and tensors, we use the following notations (adapted from [Kolda and Bader, 2009]): across the paper, we use calligraphy font for tensors, such asA;B, bold uppercase letters for matrices, such as A; B, bold lowercase letters for vectors, such as a; b, and lower case for scalars, such as a;b;c. Without loss of generality, we assume that all tensors are third-order for the rest of this proposal. Rank-R Projection For any matrix M, let p (M;R) be the projection of M to the top-R spectral spaces. Given the singular value decomposition (SVD) of M = UV > , we can calculate the projection as p (M;R) = U R R V > R . The rank R might be omitted when the context is clear. Tensor Unfolding Each dimension of a tensor is a mode. An n-mode unfolding of a tensorA along mode i transform a tensor into a matrixA (n) by treating n as the rst mode of the matrix and cyclically concatenate other modes. Specically,A ijk = [A (1) ] i;kl . It is also known as tensor matricization. N-Mode Product The n-mode product between tensorA and matrix U on mode i is represented asA i U and dened as (A i U) (i) = UA (i) . Tucker Decomposition Tucker decomposition factorizes a tensorA intoA =S 1 U 1 n U n , where U n are all unitary matrices and core tensorS (i) is row-wise orthogonal for all i = 1; 2;:::;n. Tucker decomposition can be computed by SVD on all possible unfolding of the tensor. It is also known as high order singular value decomposition (HOSVD) [De Lathauwer et al., 2000a]. 14 ! " # AcceleratedOnlineLow-RankTensorLearning cation of our interest, i.e., the multivariate spatio-temporal streamanalysis,bothZ andX growalongthetemporaldi- mensionastimeT increases. WedefineW m =W :,:,m and similarly for others, the unconstrained optimization prob- lemattimeT canbewrittenasmin W kWZ 1:T X 1:T k 2 F , where we omit the indexm for simplicity. Suppose that at time stampT, we receive a new batch of data of sizeb, we can update the parameter tensor in thek-th iterationW (k) with two possible strategies: one is exact update, and the otheris increment update. Exact update Notice that we can obtain a closed-form solutionofW (k) byusingallthedatafromtimestamp1to T +basfollows: W (k) =X 1:T+b Z † 1:T+b . where † denotes matrix pseudo-inverse. Note that the pseudo-inverse can be computed efficiently via the Wood- bury matrix identity (Woodbury, 1950). At each iteration, we can compute the inverse of the complete data covari- ance(Z 1:T+b Z > 1:T+b ) 1 byinvertingasmallermatrixcon- structed from the new dataZ T+1:T+b at a computational cost linear to the batch size b, with a small memory over- head to store the inverse of the previous covariance matrix (Z 1:T Z > 1:T ) 1 . WedeferthedetailstoAppendixB.1. Incrementupdate Wecanalsoincrementallyupdatethe valueofW giventhenewdataasfollows: W (k) =(1 ↵ )W (k 1) +↵ X T+1:T+b Z † T+1:T+b . The difference of the two updating scheme lies in the vari- ables we store in memory. For exact update, we store the data statistics required to reconstruct the model. It gives an exact solution for the linear regression problem given all the historical observations. For incremental update, we store the previous model, compute the solution for current dataonly,andthentakeaconvexcombinationoftwomod- els. Note that different statistical properties of these two updating scheme may require different theoretical analysis tools, but the low-rank projection of the solution is invari- anttotheupdatingstrategy. 2.4.OnlineLow-RankTensorApproximation In Step 2, we need to project the solution from Step 1 to the low-rank tensor space. In ALTO, we measure the rank with respect to the sum-n-rank of the tensor: We restrict the maximumn-rank of tensorW over all modes to be no larger than R. In order to obtain the n-rank projection, we resort to Tucker decomposition (De Lathauwer et al., 2000), which decomposes a tensor into a core tensor and a set of projection matrices. The dimensions of the core tensor are n-ranks of the tensor itself. The projection is generally time consuming, as it usually involves SVD on unfolded matrices at each mode of a full tensor. For the online setting, this operation needs to be repeated for each iteration,whichisinfeasibleforlarge-scaleapplications. In ALTO, we utilize the projection results from the last itera- tiontoapproximatethecurrentprojection. Iteliminatesthe needofSVDonunfoldedmatricesofafulltensor. Instead, itperformsdimensionreductionandcomputestheSVDon unfoldedmatricesofalow-dimensionaltensor. Without the loss of generality, we elaborate ALTO via a thirdordertensor. GiventheTuckerdecompositionofW 2 R N⇥ N⇥ N fromthepreviousiteration: W (k 1) =S (k 1) ⇥ 1 U (k 1) 1 ⇥ 2 U (k 1) 2 ⇥ 3 U (k 1) 3 . we first augment eachU (k 1) i 2 R N⇥ R with K random columnvectorsfori=1,2,3,whicharedrawnfromazero meanGaussiandistribution. Theserandomcolumnvectors areintroducedasnoiseperturbation. ThenweapplyGram- Schmidt process to create orthonormal augmented projec- tion matricesV (k 1) i 2 R N⇥ (R+K) , which has K more columnsthanU (t 1) i ,fori=1,2,3respectively. With augmented projection matricesV (k 1) i , we project the tensor W (k) to an augmented core tensor S 0 (k) with dimension(R+K)⇥ (R+K)⇥ (R+K). S 0 (k) =W (k 1) ⇥ 1 V (k 1)> 1 ⇥ 2 V (k 1)> 2 ⇥ 3 V (k 1)> 3 . Then we compute the rank-R approximation of the aug- mentedcorebydecomposingS 0 (k) : S 0 (k) ⇡ S (k) ⇥ 1 V 0 (k) 1 ⇥ 2 V 0 (k) 2 ⇥ 3 V 0 (k) 3 whereS (k) isthenewcoretensorwithdimensionR⇥ R⇥ R andV 0 (k) i is of size (R + K)⇥ R. We update the new projectionmatricesasU (k) i =V (k 1) i V 0 (k) i fori=1,2,3. And the final low-rank projection of the solution tensor of currentiterationisgivenby W (k) =S (k) ⇥ 1 U (k) 1 ⇥ 2 U (k) 2 ⇥ 3 U (k) 3 . WesummarizetheworkflowofALTOinAlgorithm1. The rank-RapproximationoftheaugmentedcoreS 0 (k) iscom- putedbyiteratingoverallthemodesandsequentiallymap- ping the unfolded tensor into the rank-R subspace. We name this procedure as low-rank Tensor Sequential Map- ping(TSM),whichisdescribedinAlgorithm2. ALTO is computationally efficient since the augmented core tensorS 0 (k) has dimension (R +K)⇥ (R +K)⇥ (R+K), which is much smaller thanW (k) . At each iter- ation, the low-rank mapping procedure TSM only involves top-R SVD on matrices of size (R+K)⇥ (R+K) 2 , in comparisontotheexpensivetop-RSVDonN⇥ N 2 matri- cesinmostexistinglow-ranktensorlearningapproaches. $ % $ & U 1 1 # $ % U 2 1 " $ & U 3 1 ' × % × & × ) ≈ Figure 2.2: Tucker decomposition and Tucker rank ! " # + + + ≈ & ' ( ' ) ' & * ( * ) * & + ( + ) , ⋯ Figure 2.3: CP decomposition and CP rank CP Decomposition The CP decomposition of a tensorX is dened as X = [A; B; C] = R X r=1 r a r b r c r ; (2.1) where R is the rank, a r ; b r ; c r are the columns of A; B; C, the factor matrices ofX . Partially Observed Tensor For a tensorX and the index set of the observed entries, thenX is a tensor of the same size asX , with X ;i = 8 > > > > < > > > > : X i ; i2 0; otherwise: : (2.2) 15 Chapter 3 Spatiotemporal Analysis as Tensor Learning We outline several critical tasks in spatiotemporal analysis and formulate these tasks into a unied tensor learning framework. Tensor representation provides a convenient way to capture inter-dependencies along multiple dimensions. 3.1 Practical Problems 3.1.1 Cokriging In spatial statistics, kriging is the task of interpolating the data of unknown locations using observations from known locations. Cokriging or coregionalization [Zhang, 2007] is a multivariate generalization of kriging. While kriging utilizes the spatial autocorrelations for interpolation, cokriging further utilizes the variable dependencies. For example, by making use of the correlations between precipitation and temperature, we can obtain a more precise estimate of temperature in unknown locations than univariate kriging. Oftentimes, kriging or cokriging use Gaussian random elds and design parametric forms of cross variogram to describe the intrinsic dependency structure. But manually designed 16 variograms impose strong assumptions on the correlation among the data. Moreover, choosing a valid variogram function requires domain knowledge and manual work. Most importantly, parameter learning of those statistical models is computationally expensive, making them infeasible for large-scale spatiotemporal applications [Bonilla et al., 2007]. 3.1.2 Forecasting Forecasting estimates the future value of multivariate time series given historical obser- vations. For forecasting, popular methods in multivariate time series analysis include vector autoregressive (VAR) models, autoregressive integrated moving average (ARIMA) models, and cointegration models. A Bayesian alternative for spatiotemporal analysis is Bayesian hierarchical spatiotemporal models [Cressie and Huang, 1999], which gener- alizes linear state-space model (Kalman lter) to include latent states for both space and time. Depending on the model assumption on the separability of these latent states, hierarchical spatiotemporal models consider either separable and non-separable space-time covariance functions. To address the scalability issue, [Anderson, 1951] pro- poses rank reduced models in order to capture the inter-dependency among variables. [Cressie et al., 2010] also studies similar xed rank model. However, these models only assume low-rankness in the spatial domain and do not model the correlations among variables, space and time simultaneously. 3.1.3 Multimodel Ensemble The multi-model ensemble problem arises in climate modeling. In the past decades, numerous climate models have been developed to generate large simulation data sets 17 of future climate projections [Tebaldi and Knutti, 2007]. Sophisticated physical models share similar representations of the ocean-atmosphere and land-ice processes but have dif- ferent parameter uncertainty levels. Learning the correlation between model outputs and the actual observations can help quantify uncertainty in climate models and prompt the design of more accurate models. The multi-model ensemble task seeks a way to learn such correlation. It aims to combine climate model outputs into a more accurate description of the observations. Classic methods such as model coupling [Van den Berge et al., 2011] has been used. However, the coupling coecients require expert knowledge and the pa- rameter learning of those methods are computationally expensive. Though cokriging, forecasting, and multi-model ensemble appear to be dierent tasks in spatiotemporal analysis, we will show that all problem can be nicely formulated into the framework of low-rank tensor learning. 3.2 Tensor Formulations Motivated by the nature of spatiotemporal data, we incorporate two design principles in the low-rank tensor learning framework: (1) Global consistency: the data on the same structure are likely to be similar. (2) Local consistency: the data in close locations are likely to be similar. The former principle is akin to the cluster assumption in semi- supervised learning [Zhou et al., 2003]. Global Consistency To achieve global consistency, we constrain the tensorW to be low-rank. The low-rank assumption is based on the belief that high correlations ex- ist within variables, locations and time, which leads to natural clustering of the data. 18 Existing literature have explored the low-rank structure among these three dimensions separately, e.g., multi-task learning [Nie et al., 2010] for variable correlation, xed rank kriging [Cressie and Johannesson, 2008] for spatial correlations. low-rankness assumes that the observed data can be described with a few latent factors. It enforces the com- monalities along three dimensions without an explicit form for the shared structures in each dimension. Local Consistency For local consistency, we construct a regularizer via the spatial Laplacian matrix. The Laplacian matrix is dened as L = DA, where A is a kernel matrix constructed by pairwise similarity and diagonal matrix D i;i = P j (A i;j ). Similar ideas have been explored in matrix completion [Li and Yeung, 2009]. In cokriging liter- ature, the local consistency is enforced via the spatial covariance matrix. The Bayesian models often impose the Gaussian process prior on the observations with the covariance matrixK =K v K x whereK v is the covariance between variables andK x is that for loca- tions. The Laplacian regularization term corresponds to the relational Gaussian process [Chu et al., 2006] where the covariance matrix is approximated by the spatial Laplacian. 3.2.1 Cokriging Formulation Formally, denote the complete data for P locations over T time stamps with M vari- ables asX 2 R PTM . We only observe the measurements for a subset of locations f1;:::;Pg and their side information such as longitude and latitude. Given the mea- surementsX and the side information, the goal is to estimate a tensorW2R PTM that satisesW =X . HereX represents the outcome of applying the index operator 19 I toX :;:;m for all variables m = 1;:::;M. The index operator I is a diagonal matrix whose entries are one for the locations included in and zero otherwise. In summary, we can perform cokriging and nd the value of tensorW by solving the following optimization problem: c W = argmin W kW X k 2 F + M P m=1 tr(W > :;:;m LW :;:;m ) s:t: rank(W);(3.1) where the Frobenius norm of a tensorA is dened askAk F = q P i;j;k A 2 i;j;k and;> 0 are the parameters that make trade-o between the local and global consistency, respec- tively. The low-rank constraint nds the principal components of the tensor and reduces the complexity of the model while the Laplacian regularizer clusters the data using the relational information among the locations. By learning the right trade-o between these two techniques, our method is able to benet from both. Due to the various denitions of tensor rank, we use rank as supposition for rank complexity, which will be specied in the later section. 3.2.2 Forecasting Formulation For ease of presentation, we use the classical VAR model with K lags and coecient tensorW2 R PKPM as an example. Using the matrix representation, the VAR(K) process denes the following data generation process: X :;t;m =W :;:;m X t;m +E :;t;m ; for m = 1;:::;M and t =K + 1;:::;T; 20 where X t;m = [X > :;t1;m ;:::;X > :;tK;m ] > denotes the concatenation ofK-lag historical data before time t. The noise tensorE is a multivariate Gaussian random variable with zero mean. Similar global and local consistency principles still hold in forecasting. For global con- sistency, we can use low-rank constraint to capture the commonalities of the variables as well as the spatial correlations on the model parameter tensor, as in [Cressie et al., 2010]. For local consistency, we enforce the predicted value for close locations to be similar via spatial Laplacian regularization. Thus, we can formulate the forecasting task as the following optimization problem over the model coecient tensorW: c W = argmin W ( k b XXk 2 F + M X m=1 tr( b X > :;:;m L b X :;:;m ) ) s.t. b X :;t;m =W :;:;m X t;m rank(W); (3.2) 3.2.3 Multi-model Ensemble Formulation Suppose we have gathered the model simulation outputs from S models of M climate variables in P locations over time period T . At the same time, we are given access to the actual observations of the same variables, locations and time. As in the forecasting problem setting, we can represent the observation measurements using a three-mode tensorX2R PTM . Similarly, we encode the model outputs with a four-mode tensor Y 2 R PTMS . Those model outputs serve as \experts" for the climate prediction. Incorporating those experts' advice can reduce the uncertainty of the forecasts. As opposed to the forecasting task, we only focus on the current time stamp correlation between model outputs and observations. We start with a simple linear model asX = 21 W :;:;m Y t;m , where Y t;m = [Y > :;t;m;1 ;:::;Y > :;t;m;S ] > denotes the concatenation of S model outputs at time t for variable m, andW2R PPSM characterizes the \importance" of various models in climate predictions. We formulate the multi-model ensemble task as follows: c W = argmin W ( k b XXk 2 F + M X m=1 tr( b X > :;:;m L b X :;:;m ) ) s.t. b X :;t;m =W :;:;m Y t;m ; N X n=1 rank(W (n) )R (3.3) Where the Laplacian matrix L serves a similar role as in the forecasting task to account for the spatial proximity of observations. 3.2.4 Unied Formulation We now show that cokriging, forecasting, and multi-model ensemble can be formulated into the same tensor learning framework. Let us rewrite the loss function in Eq. (3.1) and Eq. (3.2) in the form of multi-task regression and complete the quadratic form for the loss function. The cokriging task can be reformulated as follows: c W = argmin W ( M X m=1 kHW :;:;m (H > ) 1 X ;m k 2 F ) s:t: rank(W) (3.4) where we dene HH > =I +L. 1 1 We can use Cholesky decomposition to obtain H. In the rare cases thatI +L is not full rank,IP is added where is a very small positive value. 22 For the forecasting problem, HH > =I P +L and we have: c W = argmin W ( M X m=1 T X t=K+1 kHW :;:;m X t;m (H 1 )X :;t;m k 2 F ) s:t: rank(W); (3.5) By a slight change of notation (cf. Appendix A.0.5), we can easily see that the optimal solution of both problems can be obtained by the following optimization problem with appropriate choice of tensorsY andV: c W = argmin W ( M X m=1 kW :;:;m Y :;:;m V :;:;m k 2 F ) s:t: rank(W): (3.6) Dene H as the Cholesky decomposition of I P +L HH > = I P +L. Since H is full rank and the mapping dened byW7! ~ W : ~ W :;:;m = HW :;:;m form = 1;:::;M preserves the tensor rank, i.e., rank(W) = rank( ~ W). We can rewrite the loss function in Equation 3.2 and 3.3 as M X m=1 T X t=K+1 kHW :;:;m X t;m (H 1 )X :;t;m k 2 F This suggests that we can solve rst solve the quadratic loss P M m=1 kW :;:;m Y :;:;m V :;:;m k 2 F withY :;:;m = X K+1:T;m andV :;:;m =X :;:;m and obtain its solution as ~ W; then compute W :;:;m = H 1 ~ W :;:;m . With the change of variables, it follows the Equation 3.1, 3.2, 3.3 can all be reduced to the following low-rank tensor learning task: Consider the following low-rank tensor learning problem for regression with predictor tensorZ2R QTM , and response tensor 23 X2R PTM . Our goal is to learn a model parameter tensorW2R PQM whose rank is upper bounded by R. c W = argmin W n P t;m kW :;:;m Z :;t;m X :;t;m k 2 F o s.t. rank(W)R; (3.7) Where a single : is used to index entire rows or columns. We note that tensor rank has dierent notions such as CP rank, Tucker rank and mode n-rank [Kolda and Bader, 2009, Gandy et al., 2011]. In this paper, we choose the sum n-rank, which is computationally more tractable [Hillar and Lim, 2013, De Lathauwer et al., 2000a]. The n-rank of a ten- sorW is the rank of its mode-n unfoldingW (n) . 2 In particular, for a tensorW with N mode, we have the following denition: sum n-rank(W) = N X n=1 rank(W (n) ): (3.8) A common practice to solve this formulation with sum n-rank constraint is to relax the rank constraint to a convex nuclear norm constraint [Gandy et al., 2011, Tomioka et al., 2013]. However, those methods are computationally expensive since they need full singular value decomposition of large matrices. In the following chapters, we present a series of fast al- gorithms to tackle the problem. 2 The mode-n unfolding of a tensor is the matrix resulting from treating n as the rst mode of the matrix, and cyclically concatenating other modes. Tensor refolding is the reverse direction operation [Kolda and Bader, 2009]. 24 3.2.5 Temporal Regularization Local consistency enforces the predictions at adjacent locations to be similar using graph Laplacian. The regularizer is applied to capture spatial dependencies. For temporal dependency, we only use a time series model VAR in our formulation. One might ask the question: is there similar regularization for temporal dependencies such as short-term smoothness? We can design the following following regularizer to enforce smoothness in time: X t K X k=1 W t;k kX t X tk k 2 + X t kX t k 2 (3.9) where W is the coecients for a single variableW :;:;m . It turns out the above graph based regularization is equivalent to vector-autoregressive based regularization [Yu et al., 2016]. Our VAR model is similar to the graph regularization to obtain temporal smoothness. 3.3 Spatiotemporal Dataset The dataset we use for evaluation are from two domains: climate and social networks. USHCN The U.S. Historical Climatology Network (USHCN) daily (http://cdiac. ornl.gov/ftp/ushcn\_daily/) contains daily measurements for 5 climate variables for more than 100 years. The ve climate variables correspond to temperature max, temper- ature min, precipitation, snow fall, and snow depth. The records were collected across more than 1; 200 locations and spans over 45; 384 time stamps. 25 USHCN-S The USHCN dataset consists of monthly climatological data of 108 stations spanning from the year 1915 to 2000. It has 3 climate variables: temperature max, temperature min and precipitation. CCDS The Comprehensive Climate Dataset (CCDS) 3 is a collection of climate records of North America from [Lozano et al., 2009]. The dataset was collected and pre-processed by ve federal agencies. It contains monthly observations of 17 variables such as Carbon dioxide and temperature spanning from 1990 to 2001. The observations were interpolated on a 2:5 2:5 degree grid, with 125 observation locations. AWS The AWS dataset is provided by AWS Convergence Technologies, Inc. of Ger- mantown, MD. It consists of 76 daily maximum values of 4 variables: surface wind speed (mph) and gust speed (mph), temperature and precipitation. We choose 153 weather stations located on a grid laying in the 35N-50N and 70W-90W block. Yelp The Yelp dataset 4 contains the user rating records for 22 categories of businesses on Yelp over ten years. The processed dataset includes the rating values (1-5) binned into 500 time intervals and the corresponding social graph for 137 active users. The dataset is used for the spatiotemporal recommendation task to predict the missing user ratings across all business categories. Foursquare The Foursquare check-in data set [Long et al., 2012] contains the users check-in records in Pittsburgh area from Feb 24 to May 23, 2012, categorized by dierent 3 http://www-bcf.usc.edu/ ~ liu32/data/NA-1990-2002-Monthly.csv 4 http://www.yelp.com/dataset_challenge 26 venue types such as Art & Entertainment, College & University, and Food. We extract hourly check-in records of 739 users in 34 dierent venues categories over 3; 474 hours time period as well as users' friendship network. Foursquare-S The Foursquare dataset by 121 users in each of the 15 categories of venues over 1200 time intervals, as well as their friendship network. Other Resources Other similar spatiotemporal data resources which we did not in- clude in our experiments include Global Historical Climatology Network (GHCN): an integrated database of climate summaries from land surface stations across the globe. 5 NASA ASTER: high-resolution images of the planet Earth in 14 dierent bands of the electromagnetic spectrum, ranging from visible to thermal infrared light. 6 California Department of Transportation PERMS: California trac data collected in real-time from over 39,000 individual detectors. 7 5 ://www.ncdc.noaa.gov/data-access/land-based-station-data/land-based-datasets/ global-historical-climatology-network-ghcn 6 https://lpdaac.usgs.gov/dataset_discovery/aster 7 http://pems.dot.ca.gov/ 27 Chapter 4 Fast Batch Learning 4.1 Introduction spatiotemporal data provide unique information regarding \where" and \when", which is essential to answer many important questions in scientic studies from geology, cli- matology to sociology. In the context of big data, we are confronted with a series of new challenges when analyzing spatiotemporal data because of the complex spatial and temporal dependencies involved. A plethora of excellent work has been conducted to address the challenge and achieved successes to a certain extent [Cressie et al., 2010, Isaaks and Srivastava, 2011]. Often times, geostatistical models use cross variogram and cross covariance functions to describe the intrinsic dependency structure. However, the parametric form of cross variogram and cross covariance functions impose strong assumptions on the spatial and temporal correlation, which requires domain knowledge and manual work. Furthermore, parameter learning of those statistical models is computationally expensive, making them infeasible for large-scale applications. 28 Cokriging and forecasting are two central tasks in multivariate spatiotemporal anal- ysis. Cokriging utilizes the spatial correlations to predict the value of the variables for new locations. One widely adopted method is multi-task Gaussian process (MTGP) [Bonilla et al., 2007], which assumes a Gaussian process prior over latent functions to directly induce correlations between tasks. However, for a cokriging task with M vari- ables of P locations for T time stamps, the time complexity of MTGP isO(M 3 P 3 T ) [Bonilla et al., 2007]. For forecasting, popular methods in multivariate time series analy- sis include vector autoregressive (VAR) models, autoregressive integrated moving average (ARIMA) models, and cointegration models. An alternative method for spatiotemporal analysis is Bayesian hierarchical spatiotemporal models with either separable and non- separable space-time covariance functions [Cressie and Huang, 1999]. Rank reduced mod- els have been proposed to capture the inter-dependency among variables [Anderson, 1951]. However, very few models can directly handle the correlations among variables, space and time simultaneously in a scalable way. In this paper, we aim to address this problem by presenting a unied framework for many spatiotemporal analysis tasks that are scalable for large-scale applications. Tensor representation provides a convenient way to capture inter-dependencies along multiple dimensions. Therefore, it is natural to represent the multivariate spatiotempo- ral data in tensor. Recent advances in low-rank learning have led to simple models that can capture the commonalities among each mode of the tensor [Kolda and Bader, 2009, Romera-Paredes et al., 2013]. A similar argument can be found in the literature of spatial data recovery [Gandy et al., 2011], neuro-imaging analysis [Zhou et al., 2013], and multi- task learning [Romera-Paredes et al., 2013]. Our work builds upon recent advances in 29 low-rank tensor learning [Kolda and Bader, 2009, Gandy et al., 2011, Zhou et al., 2013] and further considers the scenario where additional side information of data is available. For example, in geospatial applications, apart from measurements of multiple variables, geographical information is available to infer location adjacency; in social network ap- plications, friendship network structure is collected to obtain preference similarity. To utilize the side information, we can construct a Laplacian regularizer from the similarity matrices, which favors locally smooth solutions. We develop a fast greedy algorithm for learning low-rank tensors based on the greedy structure learning framework [Barron et al., 2008, Zhang, 2011, Shalev-Shwartz et al., 2011]. Greedy low-rank tensor learning is ecient, as it does not require full singular value de- composition of large matrices as opposed to other alternating direction methods [Gandy et al., 2011]. We also provide a bound on the dierence between the loss function at our greedy solution and the one at the globally optimal solution. Finally, we present experiment results on simulation datasets as well as application datasets in climate and social network analysis, which show that our algorithm is faster and achieves higher prediction accuracy than state-of-art approaches in cokriging and forecasting tasks. 4.2 Methodology To solve the non-convex problem in Eq. (3.7) and nd its optimal solution, we propose a greedy learning algorithm by successively adding rank-1 estimation of the mode-n un- folding. The main idea of the algorithm is to unfold the tensor into a matrix, seek for its rank-1 approximation and then fold back into a tensor with the same dimensionality. 30 We describe this algorithm in three steps: (i) First, we show that we can learn rank-1 matrix estimations eciently by solving a generalized eigenvalue problem, (ii) We use the rank-1 matrix estimation to greedily solve the original tensor rank constrained problem, and (iii) We propose an enhancement via orthogonal projections after each greedy step. Optimal rank-1 Matrix Learning The following lemma enables us to nd such op- timal rank-1 estimation of the matrices. Lemma 4.2.1. Consider the following rank constrained problem: b A 1 = argmin A:rank(A)=1 n kYAXk 2 F o ; (4.1) where Y 2R qn , X2R pn , and A2R qp . The optimal solution of b A 1 can be written as b A 1 =b ub v > ,kb vk 2 = 1 whereb v is the dominant eigenvector of the following generalized eigenvalue problem: (XY > YX > )v =(XX > )v (4.2) and b u can be computed as b u = 1 b v > XX > b v YX > b v: (4.3) The lemma can be found in e.g. [Baldi and Hornik, 1989] and we also provide a proof in Appendix. Eq. (4.2) is a generalized eigenvalue problem whose dominant eigenvector can be found eciently [Gu et al., 2000]. If XX > is full rank, as assumed in Theorem 31 4.2.2, the problem is simplied to a regular eigenvalue problem whose dominant eigen- vector can be eciently computed. Greedy Low n-rank Tensor Learning The optimal rank-1 matrix learning serves as a basic element in our greedy algorithm. Using Lemma 4.2.1, we can solve the problem in Eq. (3.7) in the Forward Greedy Selection framework as follows: at each iteration of the greedy algorithm, it searches for the mode that gives the largest decrease in the objective function. It does so by unfolding the tensor in that mode and nding the best rank-1 estimation of the unfolded tensor. After nding the optimal mode, it adds the rank-1 estimate in that mode to the current estimation of the tensor. Algorithm 1 shows the details of this approach, whereL(W;Y;V) = P M m=1 kW :;:;m Y :;:;m V :;:;m k 2 F . Note that we can nd the optimal rank-1 solution in only one of the modes, but it is enough to guarantee the convergence of our greedy algorithm. Algorithm 1 Greedy Low-rank Tensor Learning Input: transformed dataY;V of M variables, stopping criteria Output: N mode tensorW InitializeW 0 repeat for n = 1 to N do B n argmin B: rank(B)=1 L(refold(W (n) +B);Y; ) n L(W;Y;V)L(refold(W (n) +B n );Y;V) end for n argmax n f n g if n > then W W + refold(B n ;n ) end if W argmin row(A (1) )row(W (1) ) col(A (1) )col(W (1) ) L(A;Y;V) until n < 32 Theorem 4.2.2 bounds the dierence between the loss function evaluated at each iteration of the greedy algorithm and the one at the globally optimal solution. Theorem 4.2.2. Suppose in Eq. (3.7) the matricesY > :;:;m Y :;:;m for m = 1;:::;M are positive denite. The solution of Algo. 1 at its kth iteration step satises the following inequality: L(W k ;Y; )L(W ;Y; ) (kYk 2 kW (1) k ) 2 (k + 1) ; (4.4) whereW is the global minimizer of the problem in Eq. (3.7) andkYk 2 is the largest singular value of a block diagonal matrix created by placing the matricesY(:; :;m) on its diagonal blocks. The detailed proof is given in Appendix. The key idea of the proof is that the amount of decrease in the loss function by each step in the selected mode is not smaller than the amount of decrease if we had selected the rst mode. The theorem shows that we can obtain the same rate of convergence for learning low-rank tensors as achieved in [Shalev-Shwartz et al., 2010] for learning low-rank matrices. The greedy algorithm in Algorithm 1 is also connected to mixture regularization in [Tomioka et al., 2013]: the mixture approach decomposes the solution into a set of low-rank structures while the greedy algorithm successively learns a set of rank-one components. Greedy Algorithm with Orthogonal Projections It is well-known that the forward greedy algorithm may make steps in sub-optimal directions because of noise. A common solution to alleviate the eect of noise is to make orthogonal projections after each greedy step [Barron et al., 2008, Shalev-Shwartz et al., 2011]. Thus, we enhance the forward 33 greedy algorithm by projecting the solution into the space spanned by the singular vectors of its mode-1 unfolding. The greedy algorithm with orthogonal projections performs an extra step in line 13 of Algorithm 1: It nds the top k singular vectors of the solution: [U;S;V ] svd(W (1) ;k) where k is the iteration number. Then it nds the best solution in the space spanned byU andV by solving b S min S L(USV > ;Y; ) which has a closed form solution. Finally, it reconstructs the solution: W refold(U b SV > ; 1). Note that the projection only needs to nd topk singular vectors which can be computed eciently for small values of k. 4.3 Experiments We evaluate the ecacy of our algorithms on synthetic datasets and real-world application datasets. 4.3.1 Synthetic For empirical evaluation, we compare our method with multi-task learning (MTL) al- gorithms, which also utilize the commonalities between dierent prediction tasks for better performance. We use the following baselines: (1) Trace norm regularized MTL (Trace), which seeks the low-rank structure only on the task dimension; (2) Multi- linear MTL [Romera-Paredes et al., 2013], which adapts the convex relaxation of low- rank tensor learning solved with Alternating Direction Methods of Multiplier (ADMM ) [Gabay and Mercier, 1976] and Tucker decomposition to describe the low-rankness in mul- tiple dimensions; (3) MTL-L 1 , MTL-L 21 [Nie et al., 2010], and MTL-L Dirty [Jalali et al., 2010], 34 which investigate joint sparsity of the tasks with L p norm regularization. For MTL-L 1 , MTL-L 21 [Nie et al., 2010] and MTL-L Dirty , we use MALSAR Version 1.1 [Zhou et al., 2011]. We construct a model coecient tensorW of size 202010 with CP rank equals to 1. Then, we generate the observationsY andV according to multivariate regression model V :;:;m =W :;:;m Y :;:;m +E :;:;m form = 1;:::;M, whereE is tensor with zero mean Gaussian noise elements. We split the synthesized data into training and testing time series and vary the length of the training time series from 10 to 200. For each training length setting, we repeat the experiments for 10 times and select the model parameters via 5- fold cross validation. We measure the prediction performance via two criteria: parameter estimation accuracy and rank complexity. For accuracy, we calculate the RMSE of the estimation versus the true model coecient tensor. For rank complexity, we calculate the mixture rank complexity [Tomioka et al., 2013] as MRC = 1 n P N n=1 rank(W (n) ). 0 50 100 150 200 250 0.4 0.5 0.6 0.7 0.8 0.9 1 1.1 1.2 # of Samples Parameter Estimation RMSE Forward Orthogonal ADMM Trace MTL−L1 MTL−L21 MTL−Dirty (a) RMSE 0 50 100 150 200 −5 0 5 10 15 20 # of Samples Mixture Rank Complexity Forward Orthogonal ADMM Trace (b) Rank 10 1 10 2 0 200 400 600 800 1000 1200 # of Variables Run Time (Sec) Forward Greedy Orthogonal Greedy ADMM (c) Scalability Figure 4.1: Tensor estimation performance comparison on the synthetic dataset over 10 random runs. (a) Parameter estimation RMSE with training time series length, (b) Mixture Rank Complexity with training time series length, (c) running time for one single round with respect to the number of variables. The results are shown in Figure 4.1(a) and 4.1(b). We omit the Tucker decomposition as the results are not comparable. We can clearly see that the proposed greedy algorithm with orthogonal projections achieves the most accurate tensor estimation. In terms of rank complexity, we make two observations: (i) Given that the tensor CP rank is 1, 35 greedy algorithm with orthogonal projections produces the estimate with the lowest rank complexity. This can be attributed to the fact that the orthogonal projections eliminate the redundant rank-1 components that fall in the same spanned space. (ii) The rank complexity of the forward greedy algorithm increases as we enlarge the sample size. We believe that when there is a limited number of observations, most of the new rank-1 elements added to the estimate are not accurate and the cross-validation steps prevent them from being added to the model. However, as the sample size grows, the rank-1 estimates become more accurate and they are preserved during the cross-validation. To showcase the scalability of our algorithm, we vary the number of variables and generate a series of tensorW2R 2020M for M from 10 to 100 and record the running time (in seconds) for three tensor learning algorithms, i.e, forward greedy, greedy with orthogonal projections and ADMM. We measure the run time on a machine with a 6-core 12-thread Intel Xenon 2.67GHz processor and 12GB memory. The results are shown in Figure 4.1(c). The running time of ADMM increases rapidly with the data size while the greedy algorithm stays steady, which conrms the speedup advantage of the greedy algorithm. Table 4.1: Cokriging RMSE of 6 methods averaged over 10 runs. In each run, 10% of the locations are assumed missing. Dataset ADMM Forward Orthogonal Simple Ordinary MTGP USHCN-S 0.8051 0.7594 0.7210 0.8760 0.7803 1.0007 CCDS 0.8292 0.5555 0.4532 0.7634 0.7312 1.0296 Yelp 0.7730 0.6993 0.6958 NA NA NA Foursquare-S 0.1373 0.1338 0.1334 NA NA NA 36 4.3.2 Cokriging We compare the cokriging performance of our proposed method with the classical cokrig- ing approaches including simple kriging and ordinary cokriging with non-bias condition [Isaaks and Srivastava, 2011] which are applied to each variable separately. We further compare with multi-task Gaussian process (MTGP) [Bonilla et al., 2007] which also con- siders the correlation among variables. We also adapt ADMM for solving the nuclear norm relaxed formulation of the cokriging formulation as a baseline (see Appendix for more details). For USHCN-S and CCDS, we construct a Laplacian matrix by calculating the pairwise Haversine distance of locations. For Foursquare-S and Yelp, we construct the graph Laplacian from the user friendship network. For each dataset, we rst normalize it by removing the trend and diving by the standard deviation. Then we randomly pick 10% of locations (or users for Foursquare-S) and eliminate the measurements of all variables over the whole time span. Then, we produce the estimates for all variables of each time stamp. We repeat the procedure for 10 times and report the average prediction RMSE for all timestamps and 10 random sets of missing locations. We use the MATLAB Kriging Toolbox 1 for the classical cokriging algorithms and the MTGP code provided by [Bonilla et al., 2007]. Table 4.1 shows the results for the cokriging task. The greedy algorithm with orthogo- nal projections is signicantly more accurate in all three datasets. The baseline cokriging methods can only handle the two dimensional longitude and latitude information, thus are not applicable to the Foursquare-S and Yelp dataset with additional friendship infor- mation. The superior performance of the greedy algorithm can be attributed to two of 1 http://globec.whoi.edu/software/kriging/V3/english.html 37 its properties: (1) It can obtain low-rank models and achieve global consistency; (2) It usually has lower estimation bias compared to nuclear norm relaxed methods. 4.3.3 Forecasting We present the empirical evaluation on the forecasting task by comparing with multi-task regression algorithms. We split the data along the temporal dimension into 90% training set and 10% testing set. We choose VAR(3) model and during the training phase, we use 5-fold cross-validation. Table 4.2: Forecasting RMSE for VAR process with 3 lags, trained with 90% of the time series. Dataset Tucker ADMM Forward Ortho OrthoNL Trace MTL l1 MTL l21 MTL dirty USHCN-S 0.8975 0.9227 0.9171 0.9069 0.9175 0.9273 0.9528 0.9543 0.9735 CCDS 0.9438 0.8448 0.8810 0.8325 0.8555 0.8632 0.9105 0.9171 1.0950 FSQ-S 0.1492 0.1407 0.1241 0.1223 0.1234 0.1245 0.1495 0.1495 0.1504 As shown in Table 6.2, the greedy algorithm with orthogonal projections again achieves the best prediction accuracy. Dierent from the cokriging task, forecasting does not nec- essarily need the correlations of locations for prediction. One might raise the question as to whether the Laplacian regularizer helps. Therefore, we report the results for our formulation without Laplacian (ORTHONL) for comparison. For eciency, we report the running time (in seconds) in Table 4.3 for both tasks of cokriging and forecasting. Compared with ADMM, which is a competitive baseline also capturing the commonalities among variables, space, and time, our greedy algorithm is much faster for most datasets. Table 4.3: Running time (in seconds) for cokriging and forecasting. Cokriging Forecasting Dataset USHCN-S CCDS YELP FSQ-S USHCN-S CCDS FSQ-S ORTHO 93.03 16.98 78.47 91.51 75.47 21.38 37.70 ADMM 791.25 320.77 2928.37 720.40 235.73 45.62 33.83 38 Figure 4.2: Map of most predictive regions analyzed by the greedy algorithm using 17 variables of the CCDS dataset. Red color means high predictiveness whereas blue denotes low predictiveness. As a qualitative study, we plot the map of most predictive regions analyzed by the greedy algorithm using CCDS dataset in Fig. 2. Based on the concept of how informative the past values of the climate measurements in a specic location are in predicting future values of other time series, we dene the aggregate strength of predictiveness of each region as w(t) = P P p=1 P M m=1 jW p;t;m j. We can see that two regions are identied as the most predictive regions: (1) The southwest region, which re ects the impact of the Pacic ocean and (2) The southeast region, which frequently experiences relative sea level rise, hurricanes, and storm surge in Gulf of Mexico. Another interesting region lies in the center of Colorado, where the Rocky mountain valleys act as a funnel for the winds from the west, providing locally divergent wind patterns. 4.4 Conclusion In this paper, we study the problem of multivariate spatiotemporal data analysis with an emphasis on two tasks: cokriging and forecasting. We formulate the problem into a general low-rank tensor learning framework which captures both the global consistency and the local consistency principle. We develop a fast and accurate greedy solver with 39 theoretical guarantees for its convergence. We validate the correctness and eciency of our proposed method on both the synthetic dataset and real-application datasets. For future work, we are interested in investigating dierent forms of shared structure and extending the framework to capture non-linear correlations in the data. 40 Chapter 5 Accelerated Online Learning 5.1 Introduction Low-rank tensor learning enjoys a broad range of applications in practical machine learn- ing problems [Kolda and Bader, 2009], ranging from signal processing, computer vision, to neuroscience. One classical example is learning a low-rank tensor for multivariate re- gression, for which a series of eective batch learning algorithms have been developed [De Lathauwer et al., 2000a, Guo et al., 2012, Zhou et al., 2013, Bahadori et al., 2014]. We notice that in many emerging applications, large-scale tensor data come in streams, such as the spatiotemporal climate observations in climate data analysis. Batch learning algorithms would suer from computational bottleneck, especially facing the challenge of short response time. Therefore, eective and fast online learning algorithms are a must for enabling real-time large-scale tensor analysis. Online learning of low-rank tensors aims to dynamically update a tensor while preserv- ing the low-rank structure. While online low-rank matrix learning has been intensively studied, e.g. [Brand, 2002, Meka et al., 2008, Shalit et al., 2010], online tensor learning 41 remains under-explored. The problem is extremely challenging due to the inherent com- plexity of tensor analysis [Hillar and Lim, 2013]. Local solutions [Sun et al., 2008] have achieved wide success in real applications but lack rigorous theoretical understanding. For certain rank structures, we can use the nuclear norm as a convex surrogate for the rank constraint and solve the problem with o-the-shelf online low-rank matrix algorithms [Avron et al., 2012, Ouyang et al., 2013]. Nevertheless, it is known that optimizing over convex surrogate loss may lead to sub-optimal solutions [Zhang et al., 2013]. Moreover, solving an optimization problem with nuclear norm regularization itself is computation- ally expensive. In this paper, we develop a novel framework, namely the Accelerated Low-rank Tensor Online Learning (ALTO) algorithm to address the problem. Our solution follows a two- step procedure: we rst solve an unconstrained tensor learning problem, and then adjust the solution tensor to satisfy the low-rank constraint. ALTO signicantly accelerates the online learning process by keeping track of the low-rank components of the solution ob- tained at each iteration: It performs a dimension reduction of the tensor using previous low-rank components and tensor matrix multiplications, in order to avoid the expensive operations of singular value decomposition (SVD) on unfolded matrices of the full tensor. In addition, it employs randomization techniques to select additional dimensions so as to overcome the issue of local optima in existing incremental tensor learning algorithms [Sun et al., 2008]. Theoretical analysis shows that our randomization technique can sig- nicantly reduce the noise at a cost of very minor biases. As a side outcome, we also observe an interesting property: despite being non-convex, the low-rank space usually behaves like a convex set in its neighborhood. 42 We demonstrate the eectiveness of our algorithm via two machine learning tasks in spatiotemporal stream analysis: one is the classical forecasting problem (i.e., performing n-step ahead prediction from historical observations), and the other is the multi-model ensemble problem, a fundamental task in climatology to make predictions by combin- ing the forecasting results from multiple simulation models. In these applications, the data often exhibit unique properties, such as spatial proximity, temporal periodicity and variable correlations, which can be captured naturally via the low-rank constraint. We conduct experiments on both synthetic and real-world application datasets, including a Foursquare-S check-in dataset, a daily weather dataset and a climate ensemble dataset. Experimental results show that our algorithm can achieve competitive prediction accu- racy with signicant speed-up. In addition, the low-rank tensor parameters learned by our algorithm from the climate dataset provide interesting insights into the underlying relationships between simulations models and physical processes. 5.2 Related Work A plethora of excellent work have been conducted for analysis of multivariate spatiotempo- ral data streams. For online forecasting task, time series models such autoregressive (AR), and autoregressive moving average (ARMA) models fail to capture the complex shared structure of the spatiotemporal data. Classic state-space models [Cressie and Wikle, 2011] often require high-level domain knowledge and manual work to specify the parametric form of the covariance functions. For multi-model ensemble task, [Wiegerinck and Selten, 2011] learns a super model whose dynamics are a convex combination of the individual model 43 components. Unfortunately, learning the parameters of those statistical models is com- putationally expensive, making them infeasible for large-scale applications. Our work has connection to the common practice of imposing low-rank constraint to capture the task relatedness [Ando and Zhang, 2005, Argyriou et al., 2008]. However, the nature of multi-variate spatiotemporal requires us to capture the correlations not only between tasks (or features), but also between space and time. A recent study in multi-linear multi-task learning [Romera-Paredes et al., 2013] describes the multi-linear commonality of the data with low-rank tensor. They consider Tucker and PARAFAC tensor decomposition in the batch setting. They use alternating minimization method for tensor learning, which converges slow in practice and easily yields local optima. Another line of work in online multi-task learning [?, Cavallanti et al., 2010, Saha et al., 2011] considers a dierent setting where data points from dierent tasks arrive one-at-a-time adversarially while in our setting, the data from multiple tasks all arrive at the same time. 5.3 Methodology We propose the Accelerated Low-rank Tensor Online Learning (ALTO) algorithm, which solves the problem via a simple two-step approach: (1) solving the unconstrained opti- mization problem given the new data, (2) updating the solution with the low-rank con- straint, i.e. projecting the solution to the space of low-rank tensors. As we will show later through theoretical analysis that while the rst step yields approximately low-rank ten- sor, this two-step approach is nearly optimal. It is also computationally ecient since the 44 unconstrained optimization problem has a closed form solution, which is preferable since in online settings, where the data stream arrives in mini-batches, we need to dynamically update the model tensor while preserving its low-rank structures. 5.3.1 Tensor Stream in Online Setting In Step 1, the quadratic loss function in Equation 3.7 is equivalent to P M m=1 kW :;:;m Z :;:;m X :;:;m k 2 F for m = 1; 2; ;M, which is an ordinary linear regression. In online setting, the predictor tensorZ and the response tensorX are updated over time. Especially for the application of our interest, i.e., the multivariate spatiotemporal stream analysis, bothZ andX grow along the temporal dimension as time T increases. We dene W m =W :;:;m and similarly for others, the unconstrained optimization problem at time T can be written as min W kWZ 1:T X 1:T k 2 F , where we omit the index m for simplicity. Suppose that at time stamp T , we receive a new batch of data of size b, we can update the parameter tensor in the kth iterationW (k) with two possible strategies: one is the exact update, and the other is the increment update. Exact update Notice that we can obtain a closed-form solution of W (k) by using all the data from time stamp 1 to T +b as follows: W (k) = X 1:T +b Z y 1:T +b : wherey denotes matrix pseudo-inverse. Note that the pseudo-inverse can be computed eciently via the Woodbury matrix identity [Woodbury, 1950]. At each iteration, we can compute the inverse of the complete data covariance (Z 1:T +b Z > 1:T +b ) 1 by inverting 45 a smaller matrix constructed from the new data Z T +1:T +b at a computational cost linear to the batch size b, with a small memory overhead to store the inverse of the previous covariance matrix (Z 1:T Z > 1:T ) 1 . We defer the details to Appendix. Increment update We can also incrementally update the value of W given the new data as follows: W (k) = (1)W (k1) +X T +1:T +b Z y T +1:T +b : The dierence between the two updating scheme lies in the variables we store in memory. For the exact update, we store the data statistics required to reconstruct the model. It gives an exact solution for the linear regression problem given all the historical observations. For the incremental update, we store the previous model, compute the solution for current data only, and then take a convex combination of two models. Note that dierent statistical properties of these two updating scheme may require dierent theoretical analysis tools, but the low-rank projection of the solution is invariant to the updating strategy. 5.3.2 Online Low-Rank Tensor Approximation In Step 2, we need to project the solution from Step 1 to the low-rank tensor space. In ALTO, we measure the rank with respect to the sum-n-rank of the tensor: We restrict the maximum n-rank of tensorW over all modes to be no larger than R. In order to obtain the n-rank projection, we resort to Tucker decomposition [De Lathauwer et al., 2000a], which decomposes a tensor into a core tensor and a set of projection matrices. The dimensions of the core tensor are n-ranks of the tensor itself. The projection is generally 46 time-consuming, as it usually involves SVD on unfolded matrices at each mode of a full tensor. For the online setting, this operation needs to be repeated for each iteration, which is infeasible for large-scale applications. In ALTO, we utilize the projection results from the last iteration to approximate the current projection. It eliminates the need of SVD on unfolded matrices of a full tensor. Instead, it performs dimension reduction and computes the SVD on unfolded matrices of a low-dimensional tensor. Without the loss of generality, we elaborate ALTO via a third order tensor. Given the Tucker decomposition ofW2R NNN from the previous iteration: W (k1) =S (k1) 1 U (k1) 1 2 U (k1) 2 3 U (k1) 3 : we rst augment each U (k1) i 2 R NR with K random column vectors for i = 1; 2; 3, which are drawn from a zero-mean Gaussian distribution. These random column vectors are introduced as noise perturbation. Then we apply Gram-Schmidt process to create orthonormal augmented projection matrices V (k1) i 2 R N(R+K) , which has K more columns than U (t1) i , for i = 1; 2; 3 respectively. With augmented projection matrices V (k1) i , we project the tensorW (k) to an aug- mented core tensorS 0 (k) with dimension (R +K) (R +K) (R +K). S 0 (k) =W (k1) 1 V (k1)> 1 2 V (k1)> 2 3 V (k1)> 3 : Then we compute the rank-R approximation of the augmented core by decomposingS 0 (k) : S 0 (k) S (k) 1 V 0 (k) 1 2 V 0 (k) 2 3 V 0 (k) 3 47 whereS (k) is the new core tensor with dimensionRRR and V 0 (k) i is of size (R+K)R. We update the new projection matrices as u (k) i = V (k1) i V 0 (k) i for i = 1; 2; 3. And the nal low-rank projection of the solution tensor of current iteration is given by W (k) =S (k) 1 u (k) 1 2 u (k) 2 3 u (k) 3 : We summarize the work ow of ALTO in Algorithm 2. The rank-R approximation of the augmented coreS 0 (k) is computed by iterating over all the modes and sequentially mapping the unfolded tensor into the rank-R subspace. We name this procedure as low-rank Tensor Sequential Mapping (TSM), which is described in Algorithm 3. ALTO is computationally ecient since the augmented core tensorS 0 (k) has dimension (R +K) (R +K) (R +K), which is much smaller thanW (k) . At each iteration, the low-rank mapping procedure TSM only involves top-R SVD on matrices of size (R + K) (R +K) 2 , in comparison to the expensive top-R SVD onNN 2 matrices in most existing low-rank tensor learning approaches. The K random column vectors are introduced so that the algorithm can jump out of the same low-rank subspace. A heuristic algorithm called Streaming Tensor Analysis (STA) is explored in [Sun et al., 2006], where the new core tensor is simply computed byS (k) =W (k) 1 (u (k1) 1 ) > 2 (u (k1) 2 ) > 3 (u (k1) 3 ) > : However, since the projection restricts the tensor to a xed subspace, STA suers from local optima because even when the projection matrices are updated after one examines the core tensor, the space is still largely invariant. Our algorithm resolves this issue via the randomization technique. 48 Note that when the augmentation factor K is so large that V i becomes full rank, ALTO turns into an iterative singular value thresholding procedure, where the solution obtained from each iteration is directly projected to the space of low-rank tensor via top- R truncated SVD. Similar idea has been examined in [Jain et al., 2010] for the low-rank matrix learning in batch settings. Algorithm 2 Accelerated Low-rank Tensor Online Learning (ALTO) [W new ; u new ] = ALTO(W; u;R;K): Input: original tensorW and projection matrices u i ;i = 1; 2; 3. rankR, augmentation factor K Output: updated tensorW new and projection matrices u new i ;i = 1; 2; 3. 1 Augment, orthogonalize and normalize u i ;i = 1; 2; 3 to V i ;i = 1; 2; 3 with R +K columns. 2 ProjectW!S 0 =W 1 V > 1 2 V > 2 3 V > 3 . 3 Find the rank-R approximation toS 0 with TSM: TSM(S 0 ;R) =S 1 V 0 1 2 V 0 2 3 V 0 3 : 4 Return u new i = V i V 0 i ;i = 1; 2; 3 and W new =S 1 u new 1 2 u new 2 3 u new 3 . Algorithm 3 low-rank Tensor Sequential Mapping W new = TSM(W;R): Input: tensorW and target rank R. Output: tensorW new . 1 UpdateW (1) p W (1) ;R , where p (M;R) maps M to its top-R singular spaces. 2 UpdateW (2) p W (2) ;R . 3 UpdateW (3) p W (3) ;R . 4 ReturnW new =W. 5.3.3 Theoretical Analysis of ALTO We provide theoretical analysis of ALTO in terms of low-rankness, approximation accu- racy as well as the behavior of the randomized projection technique. We summarize the main results and defer the detailed proofs to Appendix. LetW be the tensor before TSM procedure. The following proposition guarantees that the outputW new has low n-rank. 49 Proposition 5.3.1 (Low-Rank Guarantee). Given a tensorW2R IJK and a target rank R, then forW new = TSM(W;R), its i-rank is no greater than R for any i. The result directly follows the conclusions from HOSVD [De Lathauwer et al., 2000a]. We denote W as the rank-R target matrix and consider a matrix W in its neigh- borhood withkW W k F <. The following proposition guarantees the approximation accuracy of TSM. Proposition 5.3.2. Let W be an NN matrix with (1) rank(W ) =R, (2)kW k F < C w , (3)ma k (W )> ma w , W be an NN matrix such thatkW W k F , E be a random matrix with (3) zero mean, (4) ma N (E) ma e , (5)kEk F e , then we have that kp (W + E) W k 2 F kW + E W k 2 F ; when (N 2R) 8(+e) 4 C 2 w ma 4 w ma 2 e and ma w 4 ( e +): Proposition 5.3.2 shows that when the target matrix is low-rank and it has reasonable condition number, then in its neighborhood, we can conduct low-rank mapping and expect the error to be reduced. Due to the non-convexity of the low-rank space, the low-rank mapping may push the output further away from the target even if the target itself lies within the low-rank matrix space. Luckily, we prove that this is a rare event and that the space of rank-R matrix can be treated as \nearly-convex" in its neighborhood. 50 In order to see this, denote the full SVD of W as W = [u 1 ; u 2 ] diag ( 1 ; 2 ) [V 1 ; V 2 ] > , the blocks with subscript 1 correspond to the top-R space. Then we have kW W k 2 F kp (W) W k 2 F = 2 u > 2 W V 2 2 F u > 2 W V 2 2 F : (5.1) By Wedin sin theorem [Wedin, 1972], we know that u > 2 W V 2 F O( 2 ), whilek 2 k F is very likely to be at the level ofO(). That is, the reduced noise is a rst order quantity, while the newly introduced bias is of the second order. It justies that the approximation error is reduced after the low-rank sequential mapping:kp (W) W k F kW W k F . Note that error reduction in Proposition 5.3.2 holds under certain statistical assump- tions. However, the error is still contained well within a factor of 2 without any statistical assumptions:kp (W) W k F 2kW W k F . For tensors, the error is upper bounded by a factor of 8 in the worst case scenario, as described by the following proposition: Proposition 5.3.3 (Approximation Guarantee). Given a tensorW 2 R IJK where its i-rank is no greater than R for all i. If tensorW2R IJK satiseskWW k F andW new = TSM(W;R), thenkW new W k F 8. The guarantees above rely on the low-rank assumption of the solution before TSM. If the data is generated from a low-rank model, the estimator can be proved to be approxi- mately low-rank via the standard maximum likelihood estimation analysis. For many real applications, as we will show later in experiments, we do observe the low-rank structures from the data. 51 Next, we further discuss the randomized projection technique. We examine the ran- dom projection on the tensorW2R PQM at its mode-n unfolding. For instance, the mode-1 unfolding of the tensorW =S 1 u 1 2 u 2 3 u 3 can be represented as W (1) = u 1 S (1) (u 3 u 2 ) > ; where is the matrix Kronecker product. The matrix u 3 u 2 is also an orthonormal matrix. And when u i is augmented withK additional dimension to V i , the corresponding V 3 V 2 is augmented with 2K degrees of freedom. This connection essentially allows us to study the tensor problem from the matrix perspective. To understand the random projection, we start with cases whenK = maxfP;Q;Mg R and K 0 . We show that setting K in the middle provides a trade-o between the amount of induced bias and the reduced noise. The case with K = maxfP;Q;MgR projects the tensor to the whole space, i.e., all the information are kept, so that the analysis is exactly as it in Lemma 5.3.2. ForK 0, the potential bias introduced by the information loss during the projection, as analyzed in the \nearly-convexity" section, is in the order ofO( 2 ). From rankR +K to rank R, the projection step will introduce additional bias that is proportional to the order ofO( 2 ), but the noise reduction is likely to be in the order ofO( K R+K ), which dominates the extra bias ifK is not too small comparing withR. This also indicates that we should set K to a larger value when R increases. 52 5.4 Experiments 5.4.1 Synthetic We generate the synthetic data stream of 30000 time stamps according to the VAR(2) modelX :;t;m =W :;:;m X t;m +E :;t;m for m = 1;:::;M and t = K + 1;:::;T , where pa- rameter tensorW2R 306020 is randomly drawn from a standard normal distribution. We projectW with tensor sequential mapping of rank 2. The noise at each time is independently standard normal distributed. We set the initial batch size to 200, the mini-batch size to 100, and repeat the experiment for 10 times. Figure 5.1 compares the average parameter estimation RMSE and the run time for ALTO and baselines over 10 random runs. We measure the run time on a machine with a 6-core 12-thread Intel Xenon 2.67GHz processor and 12GB memory. 0 100 200 300 0 0.5 1 1.5 2 2.5 Iteration Number RMSE STA SADMM ISVT INV ALTO (a) Synthetic RMSE 0 200 400 600 800 ALTO ISVT SADMM Run Time (Sec) (b) Synthetic Run Time Figure 5.1: (a) Average parameter estimation RMSE (b) Overall run time comparison over 10 random runs for ALTO and baselines on the synthetic dataset. As the true tensor is low-rank, low-rank tensor learning algorithms ISVT and ALTO outperform INV at each iteration in terms of parameter estimation accuracy. SADMM outperforms INV in rst few iterations, but later converges to a sub-optimal solution, 53 20 40 60 80 100 0.1 0.12 0.14 0.16 0.18 0.2 0.22 Iteration Number Forecasting RMSE SADMM ISVT INV ALTO (a) Forecasting RMSE 5 10 15 ALTO ISVT SADMM Run Time (Sec) (b) Run Time Figure 5.2: (a) Forecasting RMSE with respect to iteration number (b) Per iteration run time comparison for ALTO and baselines on Foursquare-S dataset. since it utilizes a surrogate loss function. We also adapt Streaming Tensor Analysis (STA) [Sun et al., 2008] for our experiment. We observe that STA stays at a local optimal point and the performance barely improves after the initial iteration, which demonstrates the benet of adding random projection in ALTO. We conduct experiments on real-world applications of multivariate spatiotemporal streams, online forecasting and multi-model ensemble respectively. 5.4.2 Online Forecasting We use following ASW and Foursquare-S for online forecasting: Figure 5.2 shows the forecasting RMSE per iteration and the run time on the Foursquare- S dataset. The superior performance of SADMM, ISVT and ALTO in forecasting accu- racy over INV justify the low-rank assumptions. Compared with SADMM or ISVT, ALTO requires less computational time while achieving more accurate solutions. 54 Table 5.1: Forecasting RMSE comparison for ALTO and baselines on Foursquare-S and AWS datasets with 90 % training data with respect to dierent lags of the VAR model. Lag ALTO ISVT SADMM INV Greedy Foursquare-S 1 0.1239 0.1285 0.1240 0.1394 0.1246 2 0.1244 0.1244 0.1234 0.1357 0.1225 3 0.1241 0.1240 0.1242 0.1362 0.1223 AWS 1 0.9318 1.0055 0.9441 1.4707 0.8951 2 0.9285 0.9182 0.9447 1.0853 0.9131 3 0.9303 0.9297 0.9485 0.9840 0.9166 Table 5.2: Forecasting overall run time comparison for ALTO and baselines on Foursquare-S and AWS datasets with 90 % training data with respect to dierent lags of the VAR model. Data set ALTO ISVT SADMM Foursquare-S 16 (s) 65 (s) 119 (s) AWS 20 (s) 64 (s) 126 (s) Table 5.1 shows the forecasting RMSE. Table 5.2 shows the overall run time with 90 % training data on both datasets for VAR model with dierent lags. We present the results from state-of-art batch algorithm GREEDY as a reference. In general, the forecasting performance of online low-rank tensor learning algorithms signicantly outperforms INV, and is comparable to that of batch algorithms. ALTO obtains accurate forecasting results with much faster speed. We also vary the rank and evaluate the performance of the ALTO algorithm. Figure 5.5(a) shows that both ISVT and ALTO achieves a slight increase in accuracy as the rank decreases, but the dierence is marginal. 5.4.3 Online Multi-model Ensemble We evaluate our method on the multimodel ensemble task. For observation series, we collect the monthly measurements from NCEP-DOE Reanalysis 2. For model outputs, 7 55 Figure 5.3: Climate models and their in uential areas. Dierent color denotes dierent models. The in uence is computed by aggregating the model parameters. 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 0 0.5 1 1.5 Forecasting RMSE ALTO ISVT SADMM INV (a) RMSE ALTO ISVTSADMM INV 0 200 400 600 800 1000 1200 Run Time (Sec) (b) Runtime(Sec) Figure 5.4: Per variable forecasting RMSE for 18 variables (a) and overall run time (b) comparison of multi-model ensemble for ALTO and baselines using 90 % training data, with 7 dierent models over 20 years. dierent model simulation data are taken from the World Climate Research Programme's (WCRP's) CMIP3 multi-model dataset and processed with CDO software. 1 We align the variables of observation series with the model output series. 19 variables are selected with 252 time points from 1979 to 1999. (See Appendix for details of dataset processing ). We use model outputs to predict the observation measurements. 90% of the time series are used for online training. Figure 5.5(b) describes the forecasting RMSE for all variables with respect to the rank value. ALTO selects rank 13 as its optimal rank while 1 CDO 2015: Climate Data Operators. Available at: http://www.mpimet.mpg.de/cdo 56 0 5 10 15 0.1247 0.1248 0.1249 0.125 0.1251 0.1252 0.1253 0.1254 0.1255 Rank Forecasting RMSE ISVT ALTO (a) Foursquare-S 0 5 10 15 20 0.5 0.6 0.7 0.8 0.9 1 1.1 1.2 Rank Forecasting RMSE ISVT ALTO (b) Multi-model Ensemble Figure 5.5: Forecasting RMSE using 90 % data with the rank value for (a) Foursquare-S forecasting and (b) multi-model ensemble. ISVT chooses rank 7. We also examine the forecasting error for each variable separately using the learned model. Figure 5.4 shows the forecasting RMSE for 18 of the 19 variables and overall run time in second. ALTO is not only more accurate but also much faster than baselines. Multi-model ensemble accounts for the dierent uncertainties in climate models. This dierence is partially due to the geographical conguration of the research institutes. To see this, we aggregate the parameters of the learned tensors of all variables and color- code the models. Figure 5.3 shows the area where a particular model is most in uential (i.e., corresponding to the largest value of the aggregated parameters). Japan Center for Climate System Research (Red) has a dominating area in Asia. Norway Bjerknes Centre for Climate Research (Yellow) is most in uential in Europe. Other interesting ndings reveal that Japan Meteorological Research Institute (Blue) is more accurate in the south hemisphere. Russia Institute for Numerical Mathematics (Green) shows most expertise in oceans. 57 5.5 Conclusion In this paper, we propose a simple and ecient algorithm, namely ALTO, to accelerate the process of online low-rank tensor learning. We introduce randomized projection tech- nique in ALTO to overcome the local optimal issue and provide theoretical justications. We formulate two classical tasks in multivariate spatiotemporal data streams: online fore- casting, and multi-model ensemble, via the tensor learning framework. We demonstrate that our algorithm can produce accurate predictions and signicantly reduce the compu- tational costs. For future work, we are interested in examining broader applications and relaxing the assumptions of ALTO for better theoretical properties. 58 Chapter 6 Memory Ecient Learning 6.1 Introduction Massive multiway data emerge from many elds: space-time measurements on several variables in climate dynamics [Hasselmann, 1997], multichannel EEG signals in neurol- ogy [Acar et al., 2007] and natural images sequences in computer vision [Vasilescu and Terzopoulos, 2002]. Tensor provides a natural representation for multiway data. In particular, tensor decomposition has been a popular technique for exploratory data analysis [Kolda and Bader, 2009] and has been extensively studied. In contrast, tensor regression, which aims to learn a model with multi-linear parameters, is especially suitable for applications with multi-directional relatedness, but has not been fully exam- ined. For example, in a task that predicts multiple climate variables at dierent locations and time, the data can be indexed by variable location time. Tensor regression provides us with a concise way of modeling complex structures in multiway data. Many tensor regression methods have been proposed [Zhao et al., 2011, Zhou et al., 2013, Romera-Paredes et al., 2013, Wimalawarne et al., 2014, Signoretto et al., 2014], leading 59 to a broad range of successful applications ranging from neural science to climate re- search to social network analysis. These methods share the assumption that the model parameters form a high order tensor and there exists a low-dimensional factorization for the model tensor. They can be summarized into two types of approaches: (1) alternating least square (ALS) sequentially nds the factor that minimizes the loss while keeping others xed; (2) spectral regularization approximates the original non-convex problem with a convex surrogate loss, such as the nuclear norm of the unfolded tensor. A clear drawback of all the algorithms mentioned above is high computational cost. ALS displays unstable convergence properties and outputs sub-optimal solutions [Cichocki et al., 2009]. Trace-norm minimization suers from slow convergence [Gandy et al., 2011]. Moreover, those methods face the memory bottleneck when dealing with large-scale datasets. ALS, for example, requires the matricization of entire data tensor at every mode. In addition, most existing algorithms are largely constrained by the specic tensor regression model. Adapting one algorithm to a new regression model involves derivations for all the updating steps, which can be tedious and sometimes non-trivial. In this paper, we introduce subsampled Tensor Projected Gradient (TPG), a simple and fast recipe to address the challenge. It is an ecient solver for a variety of tensor regression problems. The memory requirement grows linearly with the size of the prob- lem. Our algorithm is based upon projected gradient descent [Calamai and Mor e, 1987] and can also be seen as a tensor generalization of iterative hard thresholding algorithm [Blumensath and Davies, 2009]. At each projection step, our algorithm iteratively returns a set of leading singular vectors of the model, avoiding full singular value decomposition 60 (SVD). To handle large sample size, we employ randomized sketching for subsampling and noise reduction to further accelerate the process. We provide theoretical analysis of our algorithm, which is guaranteed to nd the correct solution under the Restricted Isometry Property (RIP) assumption. In fact, the algorithm only needs a xed number of iterations, depending solely on the logarithm of the signal to noise ratio. It is also robust to noise, with estimation error depending linearly on the size of the observation error. The proposed method is simple and easy to implement. At the same time, it enjoys fast convergence rate and superior robustness. We demonstrate the empirical performance on two example applications: multi-linear multi- task learning and multivariate spatiotemporal forecasting. Experiment results show that the proposed algorithm signicantly outperforms existing approaches in both prediction accuracy and speed. Table 6.1: Summarization of contemporary tensor regression models, algorithms, and applications MODEL ALGO APP Y = covhX;Wi +E 1 [Zhao et al., 2011] High-order Partial Least Square Neural Imaging (EEG) Y = vec(X ) T vec(W) +E [Zhou et al., 2013] Alternating Least Square Neural Imaging (MRI) Y t =X t w t + t [Romera-Paredes et al., 2013] ADMM Multi-task learning Y =X1W1nWn +E [Ho et al., 2015] Bayesian Estimation Social Networks 6.2 Related Work Several algorithms have been proposed for tensor regression. For example, [Zhou et al., 2013] proposes to use Alternating least square (ALS) algorithm. [Romera-Paredes et al., 2013] employs ALS as well as an Alternating Direction Method of Multiplier (ADMM) technique 61 to solve the nuclear norm regularized optimization problem. [Signoretto et al., 2014] pro- poses a more general version of ADMM based on Douglas-Rachford splitting method. Both ADMM-based algorithms try to solve a convex relaxation of the original optimiza- tion problem, using singular value soft-thresholding. To address the scalability issue of these methods, [Yu et al., 2014] proposes a greedy algorithm following the Orthogonal Matching Pursuit (OMP) scheme. Though signicantly faster, it requires the matriciza- tion of the data tensor, and thus would face memory bottleneck when dealing with large sample size. Our work is closely related to iterative hard thresholding in compressive sensing [Blumensath and Davies, 2009], sparsied gradient descent in sparse recovery [Garg and Khandekar, 2009] or singular value projection method in low-rank matrix com- pletion [Jain et al., 2010]. We generalize the idea of iterative hard thresholding to tensors and utilize several tensor specic properties to achieve acceleration. We also leverage ran- domized sampling technique, which concerns how to sample data to accelerate the com- mon learning algorithms. Specically, we employ count sketch [Clarkson and Woodru, 2013] as a pre-processing step to alleviate the memory bottleneck for large dataset. 6.3 Methodology We start by describing the problem of tensor regression and our proposed algorithm in details. We use three-mode tensor for ease of explanation. Our method and analysis directly apply to higher order cases. 62 6.3.1 Tensor Regression Given a predictor tensorX and a response tensorY, tensor regression targets at the following problem: W ? = argmin W L(W;X;Y) s.t. rank(W)R (6.1) The problem aims to estimate a model tensorW2R D 1 D 2 D 3 that minimizes the empir- ical lossL, subject to the constraint that the Tucker rank ofW is at mostR. Equivalent, the model tensorW has a low-dimensional factorizationW =S 1 U 1 2 U 2 3 U 3 with coreS2 R R 1 R 2 R 3 and orthonormal projection matricesfU n 2 R DnRn g. The dimensionality ofS is at most R. The reason we favor Tucker rank over others is due to the fact that it is a high order generalization of matrix SVD, thus is computationally tractable, and carries nice properties that we later would utilize. Many existing tensor regression models are special cases of the problem in (7.1). For example, in multi-linear multi-task learning [Romera-Paredes et al., 2013], given the predictor and response for each task (X t ; Y t ), the empirical loss is dened as the sum- marization of the loss for all the tasks, i.e.,L(W;X;Y) = P T t=1 kY t X t w t k 2 F , with w t as the tth column ofW (1) . For the univariate GLM model in [Zhou et al., 2013], the model is dened asY = vec(X ) T vec(W) +E. Table 6.1 summarizes existing tensor regression models, their algorithms as well as main application domains. In this work, we use the simple linear regression modelY =hX;Wi +E to illustrate our idea, where X2R TD 1 D 3 ,Y2R TD 2 D 3 with sample size T , andE as i.i.d Gaussian noise. The 63 tensor inner producthX;Wi is dened as the matrix multiplication on each slice inde- pendently, i.e.,hX;Wi :;:;m =X :;:;m ;W :;:;m . Our methodology can be easily extended to handle more complex regression models. 6.3.2 Tensor Projected Gradient To solve problem (7.1), we propose a simple and ecient tensor regression algorithm: sub- sampled Tensor Projected Gradient (TPG). TPG is based upon the prototypical method of projected gradient descent and can also be seen as a tensor generalization of iterative hard thresholding algorithm [Blumensath and Davies, 2009]. For the projection step, we resort to tensor power iterations to iteratively search for the leading singular vectors of the model tensor. We further leverage randomized sketching [Clarkson and Woodru, 2013] to address the memory bottleneck and speed up the algorithm. As shown in Algorithm 4, subsampled Tensor Projected Gradient (TPG) combines a gradient step with a proximal point projection step [Rockafellar, 1976]. The gradient step treats (7.1) as an unconstrained optimization ofW. As long as the loss function is dierentiable in a neighborhood of current solution, standard gradient descent methods can be applied. For our case, computing the gradient under linear model is trivial: OL(W;X;Y) =hX T ;YhX;Wii. After the gradient step, the subsequent proximal point step aims to nd a projectionP R (W) :R D 1 D 2 D 3 !R D 1 D 2 D 3 satisfying: P R (W k ) = argmin W (kW k Wk 2 F ) s.t. W2C(R) =fW : rank(W)Rg (6.2) 64 The diculty of solving the above problem mainly comes from the non-convexity of the set of low-rank tensors. A common approach is to use nuclear norm as a convex surrogate to approximate the rank constraint [Gandy et al., 2011]. The resulting problem can either be solved by Semi-Denite Programming (SDP) or soft-thresholding of the singular values. However, both algorithms are computationally expensive. Soft-thresholding, for example, requires a full SVD for each unfolding of the tensor. Algorithm 4 Subsampled Tensor Projected Gradient 1: Input: predictorX , responseY, rank R 2: Output: model tensorW2R D 1 D 2 D 3 3: Compute count sketch S 4: Sketch ~ Y Y 1 S, ~ X X 1 S 5: InitializeW 0 as zero tensor 6: repeat 7: ~ W k+1 =W k OL(W k ; ~ X; ~ Y) 8: W k+1 = ITP( ~ W k+1 ) 9: until Converge Algorithm 5 Iterative Tensor Projection (ITP) 1: Input: model ~ W, predictorX , responseY, rank R 2: Output: projectionW2R D 1 D 2 D 3 3: InitializefU 0 n g with R left singular vectors ofW (n) 4: while iR do 5: repeat 6: u k+1 1 ~ W 2 u k 2 T 3 u k 3 T 7: u k+1 2 ~ W 1 u k 1 T 3 u k 3 T 8: u k+1 3 ~ W 1 u k 1 T 2 u k 2 T 9: until Converge tofu 1 ; u 2 ; u 3 g 10: UpdatefU n g withfu n g 11: W ~ W 1 U 1 U 1 T 2 U 2 U 2 T 3 U 3 U 3 T 12: ifL(W;X;Y) then 13: RETURN 14: end if 15: end while Iterative hard thresholding, on the other hand, avoids full SVD. It takes advantage of the general Eckart-Young-Mirsky theorem [Eckart and Young, 1936] for matrices, which 65 allows the Euclidean projection to be eciently computed with thin SVD. Iterative hard thresholding algorithm has been shown to be memory-ecient and robust to noise. Un- fortunately, it is well-known that Eckart-Young-Mirsky theorem no long applies to higher order tensors [Kolda and Bader, 2009]. Therefore, computing high-order singular value decomposition (HOSVD) [De Lathauwer et al., 2000a] and discarding small singular val- ues do not guarantee optimality of the projection. To address the challenge, we note that for tensor Tucker model, we have : W = S 1 U 1 2 U 2 3 U 3 . And the projection matricesfU n g happen to be the left singular vectors of the unfolded tensor, i.e., U n n V T n =W (n) . This property allows us to compute each projection matrix eciently with thin SVD. By iterating over all factors, we can obtain a local optimal solution that is guaranteed to have rank at most R. We want to emphasize that there is no known algorithm that can guarantee the convergence to the global optimal solution. However, in the Tucker model, dierent local optima are highly concentrated, thus the choice of local optima does not really matter [Ishteva et al., 2011b]. When the model parameter tensorW is very large, performing thin SVD itself can be expensive. In our problem, the dimensionality of the model is usually much larger than its rank. With this observation, we utilize another property of Tucker model U n = W 1 n1 U T n1 n+1 U T n+1 N U N . This property implies that instead of performing thin SVD on the original tensor, we can trade cheap tensor matrix multiplication to avoid expensive large matrix SVD. This leads to the Iterative Tensor Projection (ITP) procedure as described in Algorithm 5. Denotefu n g as row vectors offU n g, ITP uses power iteration to nd one leading singular vector at a time. The algorithm stops either 66 when hitting the rank upper bound R or when the loss function value decreases below a threshold . ITP is signicantly faster than traditional tensor regression algorithms especially when the model is low-rank. It guarantees that the proximal point projection step can be solved eciently. If we initialize our solution with the top R left singular vectors of tensor unfoldings, the projection iteration can start from a close neighborhood of the stationary point, thus leading to faster convergence. In tensor regression, our main focus is to minimize the empirical loss. Sequentially nding the rank-1 subspace allows us to evaluate the performance as the algorithm proceeds. The decrease of empirical loss would call for an early stop of the thin SVD procedure. Another acceleration trick we employ is randomized sketching. This trick is partic- ularly useful when we are encountered with ultra-high sample size or extremely sparse data. Online algorithms, such as stochastic gradient descent or stochastic ADMM are common techniques to deal with large samples and break the memory bottleneck. How- ever, from a subsampling point of view, online algorithms make i.i.d assumptions of the data and uniformly select samples. It usually fails to leverage the data sparsity. In our framework, the convergence of the TPG algorithm, as will be discussed in a later section, depends only on the logarithm of signal to noise ratio. Randomized sketching instantiates the mechanism to reduce noise by duplicating data instances and combining the outputs. This mechanism provides TPG with considerable amount of boost. Its performance hence increases linearly if the noise is decreased. We resort to count sketch [Clarkson and Woodru, 2013] as a subsampling step before feeding data into TPG. A count sketch matrix of size MN, denoted as S, is generated as follows: start with a 67 zero matrixR MN , for each column j, uniformly pick a row i2f1; 2;Mg and assign f1; 1g with equal probability to S i;j . In practice, we nd count sketch works well with TPG, even when the sample size is very small. We now analyze theoretical properties of the proposed algorithm. We prove that TPG guarantees optimality of the estimated solution, under the assumption that the predictor tensor satises Restricted Isometry Property (RIP) [Candes et al., 2006]. With carefully designed step size, the algorithm converges to the correct solution in a constant number of iterations, and the achievable estimation error depends linearly on the size of the observation error. We assume the predictor tensor satises Restricted Isometry Property in the following form: Denition 6.3.1. (Restricted Isometry Property) The isometry constant of X is the smallest number R such as the following holds for allW with Tucker rank at most R. (1 R )kWk 2 F khX;Wik 2 F (1 + R )kWk 2 F Note that even though we make the RIP analogy for tensorX , we actually impose the RIP assumption w.r.t. matrix rank instead of tensor rank. A similar assumption can be found in [Rauhut et al., 2015]. The proposed solution TPG as in Algorithm 4 is built upon projected gradient method. To prove the convergence, we rst guarantee the optimality (local) of the proximal point step, obtained by ITP in Algorithm 5. The following lemma guarantees the correctness of the solution from ITP. 68 Lemma 6.3.2. (Tensor Projection) The projection step in Algorithm 2, dened asP R : R D 1 D 2 D 3 !R D 1 D 2 D 3 computes a proximal pointP R ( ~ W k+1 ) =W k+1 , whose Tucker rank is at most R. Formally, W k+1 = argmin W kW ~ W k+1 k 2 F s:t rank(W)R the projected solutionW k+1 follows a Tucker model, written asW k+1 =S 1 U 1 2 U 2 3 U 3 , where each dimension of the coreS is upper bounded by R. Proof. MinimizingkW ~ W k+1 k 2 F givenS is equivalent to minimizing the following prob- lem kS 1 U 1 2 U 2 3 U 3 ~ W k+1 k 2 F = fkS 1 U 1 2 U 2 3 U 3 k 2 F 2hS 1 U 1 2 U 2 3 U 3 ; ~ W k+1 i +k ~ W k+1 k 2 F g = fkSk 2 F +k ~ W k+1 k 2 F 2hS; ~ W k+1 1 U T 1 2 U T 2 3 U T 3 ig = kSk 2 F +k ~ W k+1 k 2 F Given projection matricesfU n g, we have: S = ~ W k+1 1 U T 1 2 U T 2 3 U T 3 69 Thus, the minimizer ofkW ~ W k+1 k 2 F generates projection matrices that maximize the following objective function: fU n g = argmax fUng k ~ W k+1 1 U T 1 2 U T 2 3 U T 3 k 2 F (6.3) Each row vector u n of U n can be solved independently with power iteration. There- fore, repeating this procedure for dierent modes leads to an optimal (local) minimizer near a neighborhood of ~ W k+1 . In fact, for Tucker tensor, convergence to a saddle point or a local maximum is only observed in articially designed numerical examples [Ishteva et al., 2011a]. Next, we prove our main theorem, which states that TPG converges to the correct solution in a constant number of iterations with isometry constant 2R < 1=3. Theorem 6.3.3. (Main) For tensor regression modelY =hX;Wi +E , suppose the predictor tensorX satises RIP condition with isometry constant 2R < 1=3. LetW ? be the optimal tensor of Tucker rank at most R. Then tensor projected gradient (TPG) algorithm with step-size = 1 1+ R computes a feasible solutionW such that the estimation errorkWW ? k 2 F 1 1 2R kEk 2 F in at mostd 1 log(1=) log( kYk 2 F kEj 2 F )e iterations for a universal constance that is independent of problem parameters. Proof. : The decrease in loss L(W k+1 )L(W k ) (6.4) = 2hOL(W k );W k+1 W k i +khX;W k+1 W k ik 2 F 2hOL(W k );W k+1 W k i + (1 + 2R )kW k+1 W k k 2 F 70 Here the inequality follows from RIP condition. And isometry constant of 2R follows from the subadditivity of rank. Dene upper bound u(W) := 2hOL(W k );WW k i + (1 + 2R )kWW k k 2 F = (1 + 2R )fkW ~ W k+1 k 2 F hOL(W k );WW k ig where the second equality follows from the denition of gradient step ~ W k+1 =W k OL(W k ) From Equation (6.4) and Lemma 6.3.2, L(W k+1 )L(W k )u(W k+1 )u(W ? ) (6.5) = 2hOL(W k );W ? W k i + (1 + 2R )kW ? W k k 2 F = 2hOL(W k );W ? W k i + 2 2R kW ? W k k 2 F + (1 2R )kW ? W k k 2 F 2hOL(W k );W ? W k i + 2 2R kW ? W k k 2 F + khX;W ? W k ik 2 F (6.6) = L(W ? )L(W k ) + 2 2R kW ? W k k 2 F L(W ? )L(W k ) + 2 2R 1 2R khX;W ? W k ik 2 F (6.7) In short, L(W k+1 )L(W ? ) + 2 2R 1 2R khX;W ? W k ik 2 F (6.8) 71 The inequality (6.6) and (6.7) follows from RIP condition. Given the model assumption YhX;W ? i =E, we have khX;W ? W k ik 2 F = kYhX;W k iEk 2 F = L(W k ) 2hE;YhX;W k ii +kEk 2 F L(W k ) + 2 C L(W k ) + 1 C 2 L(W k ) = (1 + 1 C ) 2 L(W k ) (6.9) as long as the noise satises C 2 kEk 2 F L(W k ). Following Equation (6.8), L(W k+1 ) kEk 2 F + 2 2R 1 2R (1 + 1 C ) 2 L(W k ) [ 1 C 2 + 2 2R 1 2R (1 + 1 C ) 2 ]L(W k ) = L(W k ) (6.10) With the assumption that 2R < 1=3, selectC > 1+ 2R 13 2R , we have [ 1 C 2 + 2 2R 1 2R (1+ 1 C ) 2 ]< 1. The above inequality implies that the algorithm enjoys a globally geometric convergence rate and the loss decreases multiplicatively. For the simplest case, with the initial point as zero, we haveL(W 0 ) =kYk 2 F . In order to obtain a loss value that is small enough L(W K ) K L(W 0 )kEk 2 F (6.11) 72 the algorithm requires at least K 1 log(1=) log( kYk 2 F kEj 2 F ) iterations. kW K W ? k 2 F 1 1 2R khX;W K W ? ik 2 F (6.12) 1 1 2R (1 + 1 C ) 2 L(W K ) 1 1 2R kEk 2 F where inequality (6.12) follows from RIP. Theorem 6.3.3 shows that under RIP assumption, TGP converges to an approximate solution in O( 1 log(1=) log( kYk 2 F kEj 2 F )) number of iterations, which depends solely on the log- arithm of the signal to noise ratio. The achievable estimation error depends linearly on the size of the observation error. As a pre-processing step, our proposed algorithm employs l 2 -subspace embedding (count sketch) to subsample the data instances. Denition 6.3.4. (l 2 -subspace embedding) A (1)l 2 -subspace embedding for the column space of a ND matrix A is a matrix S for which for all x2R D kSAxk 2 2 = (1)kAxk 2 2 Subsampling step solves an unconstrained optimization problem which is essentially a least square problem, the approximation error can be upper-bounded as follows: Lemma 6.3.5. (Approximation Guarantee) For any 0<< 1, given a sparsel 2 -subspace embedding matrix S with K =O((D 1 D 2 D 3 ) 2 = 2 ) rows, then with probability (1), we 73 can achieve (1 +)-approximation. The sketchX 1 S can be computed in O(nnz(X )) time, andY 1 S can be computed in O(nnz(Y)) time. The result follows directly from [Clarkson and Woodru, 2013]. The randomized sketching leads to a (1 + ) approximation of the original tensor regression solution. It also serves as a noise reduction step, facilitating fast convergence of subsequent TPG procedure. 6.4 Experiments We conduct a set of experiments on one synthetic dataset and two real world applications. In this section, we present and analyze the results obtained. We compare TGP with following baseline methods: OLS: ordinary least square estimator without low-rank constraint THOSVD [De Lathauwer et al., 2000a]: a two-step heuristic that rst solves the least square and then performs truncated singular value decomposition Greedy [Yu et al., 2014]: a fast tensor regression solution that sequentially estimates rank one subspace based on Orthogonal Matching Pursuit ADMM [Gandy et al., 2011]: alternating direction method of multipliers for nuclear norm regularized optimization 74 6.4.1 Synthetic Data set We construct a model coecient tensorW of size 303020 with Tucker rank equals to 2 for all modes. Then, we generate the observationsY andX according to multivariate regression modelY :;:;m =X :;:;m W :;:;m +E :;:;m form = 1;:::;M, whereE is a noise tensor with zero mean Gaussian elements. We generate the time series with 30; 000 time stamps and repeat the procedure for 10 times. Figure 6.1(a) and 6.1(b) shows the parameter estimation RMSE and the run time error bar with respect to the sketching size. Since the true model is low-rank, simple OLS suers from poor performance. Other methods are able to converge to the correct solution. The main dierence occurs for small sketch size scenario. TPG demonstrates its impressive robustness to noise while others show high variance in estimation RMSE. ADMM obtains reasonable accuracy and is relatively robust, but is very slow. We also investigate the impact of sketching scheme on TPG. We compare count sketch (Count) with the sketch with i.i.d Gaussian entries (Gaussian) and the sparse random projection (Sparse) [Achlioptas, 2003]. Figure 6.1(c) and 6.1(d) shows the parameter estimation RMSE and the run time error bar for TPG combined with dierent random sketching algorithm. TPG with count sketch achieves the best performance, especially for small sketch size. The results justify the merit of leveraging count sketch for noise reduction, in order to accelerate the convergence of TPG algorithm. 6.4.2 Real Data In this section, we test the described approaches with two real world application datasets: multi-linear multi-task learning and spatiotemporal forecasting problem. 75 0 500 1000 1500 2000 Sketch Size 0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 RMSE Parameter Estimation RMSE TPG OLS THOSVD Greedy ADMM (a) RMSE 0 0.5 1 1.5 2 2.5 3 3.5 Sketch Size #10 4 0 50 100 150 200 250 300 350 400 450 500 Run Time (Sec) Run Time TPG OLS THOSVD Greedy ADMM (b) Run Time 0 2000 4000 6000 8000 10000 12000 Sketch Size 0 0.01 0.02 0.03 0.04 0.05 0.06 0.07 0.08 RMSE Parameter Estimation RMSE TPG-Count TPG-Sparse TPG-Gaussian (c) RMSE 0 2000 4000 6000 8000 10000 12000 Sketch Size 0 50 100 150 200 250 300 350 400 Run Time (sec) Run Time TPG-Count TPG-Sparse TPG-Gaussian (d) Run Time Figure 6.1: Performance comparison on the synthetic data set over 10 random runs. (a) model parameter estimation RMSE for dierent algorithms, (b) running time with respect to sketching size for dierent algorithms, (c) RMSE for dierent sketching method, (d) running time for dierent sketching method. 6.4.2.1 Multi-linear Multi-task Learning We compare TPG with state-of-art multi-linear multi-task baseline. Our evaluation fol- lows the same experiment setting in [Romera-Paredes et al., 2013] on the restaurant & consumer dataset, provided by the authors in the paper. The data set contains consumer ratings given to dierent restaurants. The data has 138 consumers gave 3 type of scores for restaurant service. Each restaurant has 45 attributes for rating prediction. The total number of instances of all the tasks were 3483. The problem results in a model tensor of dimension 45 138 3. We split the training and testing set with dierent ratio ranging from 0.1 to 0.9 and randomly select the training data instances. When the training size is small, many tasks contained no training example. We also select 200 instances as the validation set. We compare with MLMTL-C and multi-task feature learning baselines in the original paper. MTL-L21 uses L 21 regularizer and MTL-Trace is the trace-norm regularized multi-task learning algorithm. The model parameters are selected by minimizing the mean squared error on the validation set. 76 0 500 1000 1500 2000 2500 3000 3500 Training Size 0.3 0.35 0.4 0.45 0.5 0.55 0.6 0.65 0.7 MSE TPG MLMTL-C MTL-L21 MTL-Trace (a) MSE 0 500 1000 1500 2000 2500 3000 3500 Training Size 0 5 10 15 20 25 30 Run Time (Sec) TPG MLMTL-C MTL-L21 MTL-Trace (b) Run Time Figure 6.2: Multi-linear multi-task learning performance comparison on the synthetic dataset over 10 random runs. (a) average forecasting MSE and (b) running time w.r.t. training size. Figure 6.2 demonstrates the prediction performance in terms of MSE and runtime with respect to dierent training size. Compared with MLMTL-C, TPG is around 10% 25% more accurate and at least 20 times faster. MTL-L21 or MTL-Trace runs faster than MLMTL-C but also sacrices accuracy. The dierence is more noticeable in the case of small training size. These results are not surprising. Given limited samples and highly correlated tasks in the restaurant score prediction, the model parameters demonstrate low-rank property. In fact, we found that rank 1 is the optimal setting for this data during the experiments. 6.4.2.2 Spatiotemporal Forecasting For the spatiotemporal forecasting task, we experiment with Foursquare and USHCN datasets. We split the data along the temporal dimension into 80% training set and 20% testing set. We choose VAR (3) model and use 5-fold cross-validation to select the rank during the training phase. For both datasets, we normalize each individual time series by removing 77 the mean and dividing by the standard deviation. Due to the memory constraint of the Greedy algorithm, evaluations are conducted on down-sampled datasets. Table 6.2: Forecasting RMSE and run time on Foursquare check-in data and USHCN daily measurement for VAR process with 3 lags, trained with 80% of the time series. TPG OLS THOSVD Greedy ADMM RMSE 0.3580 0.8277 0.4780 0.3639 0.3916 RunTime 37.06 5.85 12.37 290.12 445.41 TPG OLS THOSVD Greedy ADMM RMSE 0.3872 1.4265 0.7224 0.4389 0.5893 RunTime 144.43 23.69 46.26 410.38 6786 Table 6.2 presents the best forecasting performance (w.r.t sketching size) and the corresponding run time for each of the methods. TPG outperforms baseline methods with higher accuracy. Greedy shows similar accuracy, but TPG converges in very few iterations. For USHCN, TPG achieves much higher accuracy with signicantly shorter run time. Those results demonstrate the eciency of our proposed algorithm for spa- tiotemporal forecasting tasks. We further investigate the learned structure of TPG algorithm from USHCN data. Figure 6.3 shows the spatial-temporal dependency graph on the terrain of California. Each velocity vector represents the aggregated weight learned by TPG from one location to the other. The graph provides an interesting illustration of atmospheric circulation. For example, near Shasta-Trinity National forecast in northern California, the air ow into the forecasts. On the east side of the Rocky mountain area, there is a strong atmospheric pressure, leading to wind moving from southeast to northwest passing the bay area. Another notable atmospheric circulation happens near Salton sea at the border of Utah, caused mainly by the evaporation of the sea. 78 Figure 6.3: Velocity vectors plot of spatial-temporal dependency graph obtained via TPG. Results are averaged across all ve dierent climate variables. The implication of our approach has several interesting aspects that might shed light upon the future algorithmic design. (1) The projection step in TPG does not depend on data, thus it connects to tensor de- composition techniques such as high order orthogonal iteration (HOOI) [De Lathauwer et al., 2000b]. However, there is a subtle dierence in that the regression would call for an early stop of iterative projection as it sequentially searches for the orthogonal subspaces. (2) TPG behaves similarly as rst order methods. The convergence rate can be further improved with second order Newton method. This can be done easily by replac- ing the gradient with Hessian. This modication does not aect the theoretical prop- erties of the proposed algorithm, but would lead to signicant empirical improvement [Jain et al., 2010]. 79 6.5 Conclusion In this paper, we study tensor regression as a tool to analyze multiway data. We introduce Tensor Projected Gradient to solve the problem. Our approach is built upon projected gradient method, generalizing iterative hard thresholding technique to high order tensors. The algorithm is very simple and general, which can be easily applied to many tensor regression models. It also shares the eciency of iterative hard thresholding method. We prove that the algorithm converges within a constant number of iterations and the achievable estimation error is linear to the size of the noise. We evaluate our method on multi-linear multi-task learning as well as spatiotemporal forecasting applications. The results show that our method is signicantly faster and is impressively robust to noise. 80 Chapter 7 Probabilistic Interpretation 7.1 Introduction High-order correlation is ubiquitous in data from various domains. For instance, climate science often deals with multivariate spatiotemporal data, where measurements are cor- related across variable (e.g. CO 2 , temperature), location and time. Multi-task learning exploits such complex correlation by jointly learning for multiple variables, stations, and time stamps, in order to improve the overall prediction accuracy. In recently years, a new class of model based on low-rank tensor regression have received considerable attention, leading to successful applications in multi-task learning [Wimalawarne et al., 2014], dy- namic network modeling [Ho et al., 2015] and spatiotemporal analysis [Yu and Liu, 2016]. Tensor regression [Zhou et al., 2013] learns a model whose parameters form a tensor in a supervised fashion, which diers from unsupervised tensor decomposition methods [Kolda and Bader, 2009]. Compared with traditional vector or matrix based multi-task 81 learning, tensor regression has two main advantages: (1) it provides an explicit parame- terization for the multi-level relatedness among tasks; (2) the multi-linear low-rank con- straint regularizes the model to be more generalizable with fewer examples. However, one notable disadvantage of tensor regression is the absence of prediction condence intervals, which calls for a probabilistic counterpart that can eectively represent the high-order correlation in the data. Meanwhile, Gaussian processes (GPs) [Rasmussen, 2006] are well-established tech- niques for modeling correlations and structures. With versatile covariance design, GPs remain popular in spatial statistics and time series analysis. A natural question then arises, \which method is better? And how are these two model classes related? " Known examples of similar connections include the Gaussian process latent variable model [Lawrence, 2004] for PCA, the multi-task Gaussian process model [Bonilla et al., 2007] for multi-task learning and the probabilistic Tucker model for Tucker tensor decomposition [Chu and Ghahramani, 2009]. Probabilistic interpretation deepens the understanding of the regularized optimization approach, suggesting its generalization to non-Gaussian data with kernel methods. In this paper, we make a rst attempt at understanding this connection. We show that in the context of multi-task learning, tensor regression is equivalent to learning a Gaussian process with multi-linear transformation kernel: multi-linear Gaussian process (MLGP). Low-rank assumption on the parameter tensor can be interpreted as a con- strained Bayesian inference problem. We analyze the theoretical properties of MLGP by proving its oracle inequality and deriving the average case learning curve. We validate our theory with numerical simulations and provide comparative analysis between dierent 82 GP models. Finally, we showcase the model on three real-world multi-task applications: multi-linear multi-task learning, spatiotemporal forecasting, and multi-output regression. The model not only achieves comparable (if not better) predictions as tensor regression, but also quanties the uncertainty of the predictions. Note that our work is fundamentally dierent from existing works on Bayesian estima- tor for tensor-variate regression [Guhaniyogi et al., 2015, Xu et al., 2015, Suzuki, 2015]. For example, [Xu et al., 2015] proposes to place Gaussian priors on decomposed compo- nents and an exponentially decaying prior on the rank; [Suzuki, 2015] further analyzes the minimax optimal rate of the estimator. These works emphasize probabilistic mod- eling instead of establishing the connections. And most existing theoretical analyses are asymptotic. In contrast, our work aims to provide deeper insights into the relationship between the optimizers of tensor regression and estimators for Gaussian process models. 7.2 Methodology 7.2.1 Low-Rank Tensor Regression Tensor regression methods were developed to represent the multi-directional structure among outputs. It learns a model whose parameters form a high-order tensor. The tensor is constrained to be low-rank as a way to enforce shared latent spaces among outputs and to avoid overtting. Formally speaking, given a predictor tensorX , a response tensorY 83 and a model parameter tensorW, tensor regression aims to solve the following non-convex optimization problem: W ? = argmin W ^ L(f(X;W);Y) s.t. rank(W)R Here f represents a regression model that maps inputs to outputs via a tensor. The solution to the above problemW ? minimizes the empirical loss ^ L, and satises the low- rank constraint rank(W)R. Multi-linear multi-task learning (MLMTL) is a classic application of tensor regression 1 . It learns in parallel multiple tasks that have multi-level commonalities. Specically, Given T learning tasks, each of which has a parameter vector w t 2R T 1 and n t training data pointsfx t;i ; y t;i g nt i=1 , with feature dimension T 1 . Suppose T tasks can be clustered into T 2 groups, each of which contains T 3 tasks. We can form a parameter tensor by concatenating all parameters as a matrix W = [w 1 ; ; w T ] and folding the matrix into a tensorW = fold (1) (W)2 R T 1 T 2 T 3 . The objective of MLMTL is to learn the parameters of all tasks simultaneously: W ? = argmin W T X t=1 nt X i=1 L(hx t;i ; w t i; y t;i ) s.t. rank(W)R (7.1) In MLMTL setting, we use a tensor to encode the multi-level clustering hierarchy of tasks. For instance, we can use the rst dimension to encode features, and the rest to index the tasks clustering hierarchy. In the above two-level example T = T 2 T 3 , we 1 Other applications can be re-formulated as special cases of multi-linear multi-task learning 84 obtain a third-order tensor. In general, one can use a (m + 1)-order tensor to represent anm-level task clustering hierarchy. With this tensor formulation, each task index t will be mapped to a set of indicesft m g, corresponding to the index for a particular mode. This index mapping is a linear operation, denoted by P t 1 ;;tm . Note that the denition of tensor rank is not unique [Kolda and Bader, 2009]. One popular denition is Tucker rank, which assumes that the tensorW has a Tucker decom- positionW =S 1 U 1 2 U 2 3 U 3 , with a core tensorS2R R 1 R 2 R 3 and orthonormal projection matricesfU m g 3 m=1 . Tucker rank is the size of the core tensorS, constrained to be at most R. The low-rank constraint poses signicantly challenges to the tensor regression problem as the rank constraint is non-convex. Many ecient solvers have been developed for low-rank tensor regressions in Equation 7.1, e.g., [Rabusseau and Kadri, 2016]. Tensor regression stands out as a scalable method for multi-task learning with complex high-order correlations. However, one major draw- back for such formulation is that it trades uncertainty for eciency: there is no condence interval for the prediction. Hence, it is dicult for the low-rank tensor regression model to reason with uncertainty. In seek of its probabilistic counterpart, we resort to another class of structured prediction models: Gaussian processes. 85 7.2.2 Multi-linear Gaussian Process Gaussian process regression infers continuous values with a GP prior. Given input x, output y, and as the Gaussian noise. GP describes a prior distribution over function f(x) with a mean function m and a covariance function k. y =f(x) +; f(x) GP(m;k) (7.2) By denition, we haveE[f(x)] = m(x), cov(x; x 0 ) = k(x; x 0 ). The mean function is usually dened to be zero. The covariance function fully characterizes the behavior of the GP prior. For MLMTL, we can use GP to describe the underlying generative process of the data. Given a total of N = P T t=1 n t training data pointsfx t;i ; y t;i g nt i=1 from T related tasks as well as the task hierarchyT =T 2 T 3 , we assume that each data point (x t;i ; y t;i ) is drawn i.i.d from the following model: y t;i =f(x t;i ) + t ; f(x t;i ) GP(0;k) Where the noise of task t follows i.i.d Gaussian with zero mean and variance t N(0; 2 t ). To model multiple tasks, we rst concatenate the data from all tasks. Denote X as the inputs, K as the input covariance matrix and D as the noise covariance, we have 86 y = 2 6 6 6 6 6 6 6 6 6 6 4 y t;1 y t;2 y T;n T 3 7 7 7 7 7 7 7 7 7 7 5 ; X = 2 6 6 6 6 6 6 6 6 6 6 4 X 1 0 ::: 0 0 X 2 ::: 0 . . . . . . . . . . . . 0 0 ::: X T 3 7 7 7 7 7 7 7 7 7 7 5 ; D = 2 6 6 6 6 6 6 6 6 6 6 4 2 1 I n 1 0 ::: 0 0 2 2 I n 2 ::: 0 . . . . . . . . . . . . 0 0 ::: 2 T I n T 3 7 7 7 7 7 7 7 7 7 7 5 where X t = [x t;1 ; x t;2 ; ; x t;nt ] is the vectorization of the inputs for task t. This gives us a probabilistic model for the MLMTL problem in the similar form as Equation 7.2: y =f(X) + e; f(X) GP(0; K); eN (0; D) MLMTL further assumes the tasks has multi-level clustering hierarchy T =T 2 T 3 . We need to construct the covariance matrix K in a way that re ects this structure as below: K =(X)K 3 K 2 K 1 (X) > Here we use K 1 to model the interaction among features, K 2 to model the correlations among groups, and K 3 to represent the interdependence of tasks within each group. () maps the inputs to a T 1 dimensional feature space. We want to clarify that the use of () limits the model to a nite feature space. And the model itself is parametric, which is the same as the tensor regression formulation. 87 0 200 400 600 800 1000 Number of Sample 0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 2 MSE Theoretic ( ;=0.25) Theoretic ( ;=0.50) Theoretic ( ;=0.75) Simulation ( ;=0.25) Simulation ( ;=0.50) Simulation ( ;=0.75) (a) Full-rank MLGP 0 500 1000 1500 Number of Sample 0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 2 MSE Full Rank (r=16) Low Rank (r=9) Low Rank (r=4) Low Rank (r=1) (b) Low-rank 2-mode MLGP 0 500 1000 1500 Number of Sample 0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 2 MSE Full Rank (r=16) Low Rank (r=3 # 3) Low Rank (r=2 # 2) Low Rank (r=1 # 1) (c) Low-rank 3-mode MLGP Figure 7.1: (a) Theoretical and numerically simulated learning curve for task correlation = 0:25; 0:25; 0:75. (b) Learning curve for 2-mode MLGP with low-rank approximation r = 9; 4; 1. 7.1(c) Learning curve for 3-mode MLGP with low-rank approximation r = 9; 4; 1. When dealing with large number of tasks and high dimensional data, learningfK m g 3 m=1 can be very expensive. To reduce the computational cost, we use the following low-rank approximation for each kernel matrix: fK m = U m U > m 2R TmTm g 3 m=1 where U m 2R TmRm is an orthogonal matrix whose dimension R m is much smaller than T m . Due to the multi-linear structure of the covariance matrix, we name this particular class of GP model as the multi-linear Gaussian process (MLGP). 7.3 Theory We investigate the connections between tensor regression and the Gaussian process by examining the common structures that the two models aim to learn: multi-level task relatedness and multi-linear low-rank constraint. We further study the theoretical prop- erties of MLGP, which also shed light on the properties of existing tensor regression frameworks. 88 7.3.1 Multi-Level Task Relatedness The weights-space view of GP allows us to re-write the latent function: f(X) =hvec(W);(X)i, whereW2R T 1 T 2 T 3 is the regression model parameters with the following prior distri- bution: vec(W) = (U 1 U 2 U 3 ) T vec(S); vec(S)N (0; 2 s I) whereS 2 R R 1 R 2 R 3 is a super-diagonal core tensor with i.i.d entries. fU m 2 R TmRm g is a set of orthogonal projection matrices. Under the GP model, the prior distribution of the latent function follows Gaussian p(fjX) =N (0; K), and the likelihood distribution is p(yjf) =N (f; D). We can obtain the marginal distribution of the outputs y by integrating out the model parameters: p(yjX) = R f p(y;f;sjX) df ds =N (0; K + D) where we omit the core tensor constant s , which acts as a regularization term. The log-likelihood of the marginal distribution for the MLGP model is: L = 1 2 logjK + Dj 1 2 y > (K + D) 1 y +const s.t. K =(X) 3 m=1 K m (X) > (7.3) We can show that the log likelihood is maximized when the solution satises: (X) 3 m=1 U m = U y ( y D) 1 2 V > x (7.4) 89 Given the singular value decomposition of the input (X) 3 m=1 U m = U x x V > x and the eigen-decomposition of the output covariance yy > = U y y U 1 y . Detailed derivation can be found in Appendix C.0.1. This suggests that the maximum likelihood estimator of MLGP performs a multi-linear transformation from input(X) to the principal subspace of the output. And the transformation represents multi-level task relatedness. Recall that for tensor regression in Equation 7.1, the model parameter tensorW also performs a similar mapping from input to output. And the Tucker decomposition ofW performs principal subspace projection through orthogonal matricesfU m g 3 m=1 . 7.3.2 Multi-Linear Low-Rank Constraint Tensor regression learns a model parameter tensor that satises the low-rank constraint: rank(W) R. For MLGP, we impose low-rank structure on the projection matrices fU m g, i.e. R m < T m . As the covariance matrix is low-rank, GP becomes degen- erate. It is known that degenerate GPs are equivalent to nite sparse linear models [Quinonero-Candela and Rasmussen, 2005], to which tensor regression models belong. Alternatively, we can interpret the connection with respect to the low-rankness using a constrained Bayesian inference approach. By minimizing the KullbackLeibler (KL) di- vergence of Bayesian posteriorN (0; K +D) from any constructed GP priorN (0; S), and assuming K is low-rank, we have the following optimization problem: min K:K0;rank(K)<R log det[(K + D)S 1 ] + tr[(K + D) 1 S] 90 ! ", $ % ", $ & " ' ( ) * + " , " - . (a) Tensor Regression ! " # " $ " % & ' ( (b) Gaussian Process ! " # ", % & ", % ' " ( ) * + ) , - " . " / (c) Multi-linear Gaussian Process Figure 7.2: Graphical model for (a) tensor regression, (b) Gaussian process and (c) MLGP. It turns out that the log-det of K+D is a smooth surrogate for the rank of K [Fazel et al., 2003], which simultaneously minimizes the rank ofW. Therefore, the estimator for MLGP with low-rank constraint would provide an approximate solution to the low-rank tensor regres- sion problem. To this end, we have established the connection between tensor regression and Gaus- sian process through MLGP. Figure 7.2 depicts graphical models of tensor regression, GP and MLGP. It is evident that the parameter tensor in tensor regression maps to the covariance of the MLGP model. Latent tensor components become parameters of the covariance function. We employ gradient-based optimization over the objective function in Equation 7.3 to learn the hyper-parameters. Note that this does not guarantee that projection matrices be orthonormal. However, this procedure can eciently compute reasonable approximations under good initializations. Details of the gradient steps are deferred to Appendix C.0.2. 91 7.3.3 Theoretical Properties of Multi-linear GP We rst bound the excess risk of MLGP and derive the oracle inequality. Consider a tensor of functionalsW, we can dene a space of functional tensorC N on the sample size N: C N =fW :W =S 1 U 1 2 U 2 3 U 3 ; kS (1) k ? =O N T 2 T 3 + log(T 1 T 2 T 3 ) 1=4 g The following proposition states the oracle inequality, which shows error tending to zero under a scaling assumption on the sample size N and the dimensionsfT m g of the tensor regression problem. Proposition 7.3.1. Let ^ W be the estimator that minimizes the empirical risk ^ L(f(X;W);Y) over the space of functional tensorsW2C N , then the excess risk, dened asL satises: L( ^ W) inf W2C N (L(W)) P ! 0 Proof. see Appendix C.0.3. The above result generalizes the claim in [Foygel et al., 2012], where we use the tensor mode n-rank as a generalization for matrix rank. Asymptotic results can only capture the largeN regime and will not necessarily apply for sample sizes occurring in practice. Thus we derive the learning curve for MLGP in the average case. The Bayes error of taskt is dened as the generalization error averaged over the sample distribution ^ t =E x [(w t ^ w t ) 2 ]. The learning curve is the Bayes error averaged over the data setD of size N (N) =E D [^ ]. The following theorem states the 92 explicit form of average case learning curve for the MLGP model under full-rank and low-rank scenarios: Theorem 7.3.2. Assume the eigenfunction decomposition for the data-dependent part of covariance (x)K 1 (x 0 ) > = P i i i (x) i (x 0 ) > and i;j = i () > j (), denote as the diagonal matrix with elementsf i;j i g, the average case learning curve for MLGP of single task t satises (N) t = tr P t 1 ;;t M 0 1 + T X s=1 diag( n s 2 s + s )P sm 1 when 0 is full-rank (N) t = tr P t1;;t M 0 T X s=1 diag( 2 s + s n s )P sm + 0 1 0 2 when 0 is rank-decient, where P t 1 ;;t M is the linear operator that maps index t to a set of indicesft m g, and 0 = M m=2 K m . Proof. see Appendix C.0.4. Theorem 7.3.2 indicates the performance dependency of MLGP, hence tensor regression, on the eigenvalues of the kernel as well as the task correlation matrix. When the number of examples for all tasks becomes large, the Bayes errors ^ t will be small and eventually be negligible compared to the noise variances t . This also re ects a commonly accepted claim for the asymptotic useless of multi-task learning: when the number of samples becomes large, the learning curves would come close to single task learning, except for the fully corrected case. 93 We further conduct numerical simulations to better understand this approximation of learning curve. Consider the case with 16 identical tasks, and set the task correlation matrix M m=2 K m to be all except for the principal diagonal elements. Assuming all the tasks are identical, we can solve the self-consistent learning curve equations via the iterative procedure. Figure 7.1(a) compares the theoretic learning curve with the numer- ically simulated learning curve for dierent levels of task relatedness. We can see that the theoretical learning curves generally lay slightly below the actual learning curves, provid- ing a tight lower bound. With a higher value of , tasks share higher interdependence, resulting in faster convergence w.r.t. Bayes error. Figure 7.1(b) shows the learning curve for 2-modes MLGP with dierent low-rank approximations. The low-rankness alleviates the noise variance error, leading to a faster convergence rate but we end up with a larger approximation gap. Figure 7.1(c) displays the learning curves for the 3-modes MLGP model, with a similar low-rank approximation. We observe that under the same rank assumption, the 3-mode MLGP imposes a stronger prior, leading to superior performance with sparse observations. 7.3.4 Relation to Other Methods It turns out that for multi-output regression, where all the tasks share the same inputs X 0 2R n 0 D , we can write X = X 0 I T , and noise becomes D = diag([ 1 ; ; T ]) I n 0 . The covariance K = ( M m=2 K m ) (X 0 )K 1 (X 0 ) > = ( M m=2 K m ) K x , where M m=2 K m encodes task similarity and K x is the kernel matrix over inputs X 0 . When the number of modes M = 2, the model reduces to the multi-task Gaussian process (MTGP) model with free-form parameters [Bonilla et al., 2007]. Here we use factorization over Kronecker 94 product operands as the low-rank approximation for task similarity matrix while MTGP uses Nystr om approximation. The Kronecker structure of covariance matrix (X)( M m=1 K m )(X) > allows us to compute the factorsfK m g M m=1 separately, which avoids inversion of the big covariance ma- trix K. This property of the Kronecker product has also been exploited in [Wilson et al., 2014] for multidimensional pattern extrapolation (GPatt). In there, inputs are assumed to be on a multidimensional grid x2X =X 1 X M , the covariance matrix has decom- position K = M m=1 K m where each factor K m is a kernel matrix over the spaceX m . From a modeling perspective, we use the Kronecker product to learn multi-directional task correlations while GPatt learns the multi-resolution structure of inputs. 7.4 Experiments We conduct experiments for a series of applications of multi-task learning. We demon- strate comparable prediction performances of MLGP , together with condence intervals. 7.4.1 Multi-linear Multi-task learning Multi-linear multi-task learning is a classic application of tensor regression. We evaluate on two benchmark prediction tasks: school exam scores and restaurant ratings. School exam scores contain 15; 362 students exam records with 21 features from 139 schools across 3 years. Restaurant ratings contain 3; 483 rating records with 45 features from 138 consumers for 3 aspects. We compare with the following baselines. (1) MLMTL-C [Romera-Paredes et al., 2013]: latent trace norm optimization with ADMM (2)MLMTL-S [Wimalawarne et al., 2014]: 95 0 500 1000 1500 2000 2500 3000 trainning size 0.35 0.4 0.45 0.5 0.55 0.6 0.65 0.7 MSE MLGP MLMTL-C MLMTL-S MOGP (a) Restaurant MSE 0 2000 4000 6000 8000 10000 12000 14000 trainning size 5 10 15 20 25 30 35 explaned variance MLGP MLMTL-C MLMTL-S MOGP (b) School EV Figure 7.3: Multi-linear multi-task learning performance with sample size for MLGP and baselines on bench mark datasets with standard metrics. (a) Mean square error on restaurant dataset. (b) Expected variance on school dataset. scaled latent trace norm optimization with ADMM, and (3) MOGP [Alvarez and Lawrence, 2011]: multi-output Gaussian process with DTC variational kernel. For MLGP, we use linear feature mapping as a fair comparison. For MOGP, we use 20 inducing points. We randomly selected from a range of 10% to 80% of the entire data set as the training set. We selected 10% instances as the validation set and the rest was used as the test set. The regularization parameter for each norm was selected by minimizing the mean squared error on the validation set. We repeat the experiments for 10 times and average the results. All the baselines are the implementations of the original authors. Figure 7.3(a) shows the restaurant rating prediction mean square error (MSE) for dierent methods over the number of training samples. Figure 7.3(b) demonstrates the expected variance (EV) for the task of school exam score prediction. We observe a superior performance of MLGP on restaurant data and comparable results for school data. When the size of the training data is small, MLGP shows signicant advantages for both tasks. This justies the benet of MLGP for sparse observations. 96 7.4.2 Spatiotemporal Forecasting spatiotemporal forecasting has been shown to be a special case of tensor regression, with an additional spatial Laplacian matrix [Yu and Liu, 2016]. We evaluate the spatiotem- poral forecasting performance for 4 datasets reported in the original paper. For all the datasets, each variable is normalized by removing mean and dividing by variance. A third-order vector auto-regressive (VAR-3) model is employed for multi-variate time se- ries modeling. We perform an 80=20 split along the time direction for training/testing and use validation to select the rank hyper-parameter. Table 7.1 displays the forecasting MSE comparison. We compare with the reported best algorithm Greedy [Yu et al., 2014] for this task. We also include matrix multi- task learning with trace-norm regularization (MTL-Trace) to justify the benet of the tensor-based approach. For all the 4 datasets, MLGP obtains similar prediction accuracy as Greedy. The predictive variance from MTGP directly provides empirical condence intervals, which we append to the MSE. (a) PRCP (b) TMAX (c) TMIN (d) SNOW (e) SNWD Figure 7.4: Contour plots for predictive variance of MLGP with respect to 5 climate variables in USHCN-CA data. Color-coded with yellow for high value and blue for low variance. To better understand the learned predictive distribution, we use a ne-grained USHCN dataset from California 2 and visualize the predictive variance of dierent locations on 2 http://cdiac.ornl.gov/epubs/ndp/ushcn/ushcn.html 97 Table 7.1: spatiotemporal forecasting mean square error comparison of MLGP and base- lines. Dataset MLGP Greedy MLMTL-C MLMTL-S MTL-Trace USHNC-US 0:8973 0:0008 0:9069 0:9528 0:9543 0:9273 CCDS 0:8498 0:0013 0:8325 0:9105 0:8394 0:8632 FSQ 0:1248 0:0006 0:1223 0:1495 0:1243 0:1245 YELP 1:0725 0:0007 NA 1:0857 1:0876 1:0736 the map. We interpolate the variance values across locations and draw the contour plots. Figure 7.4 shows the predictive variance learned by MLGP over 54 locations for 5 climate variables. We observe interesting correlations between the predictive variance and geographical attributes. For example, precipitation (PRCP) and maximum temperate (TMAX) have relatively low-variance due to the subtropical climate in California. Snow depth (SNWD) demonstrates high variance along the mountains in the Sierra Nevada. 7.4.3 Multi-output Regression Multiple output regression concerns with the case when predictor tensor is shared among all of the responses. One such application is the foreign exchange rate prediction task [Alvarez and Lawrence, 2011]. The original dataset contains 3 precious metals and 12 international currencies. To show the benet of exploiting multi-directional task inter- dependence, we select the foreign exchange rate of 6 international currencies ( EUR, GBP, CHF, JPY, HKD, KRW) and 3 precious metals (gold, silver, and platinum), which forms three groups: precious metal, European currency, and Asian currency. The dataset consists of all the data available for the 251 working days in the year of 2007. We use the VAR-3 model for time series modeling for a fair comparison. MLGP achieves 0:0563 MSE while best performance of low-rank tensor regression is 0:0657. 98 0 50 100 150 200 250 600 650 700 750 800 850 900 950 XAU (a) XAU 0 50 100 150 200 250 1.25 1.3 1.35 1.4 1.45 1.5 1.55 EUR (b) EUR 0 50 100 150 200 250 8 8.2 8.4 8.6 8.8 9 9.2 9.4 #10 -3 JPY (c) JPY Figure 7.5: Predictive mean (solid line) and variance (shaded area) from MLGP for the foreign exchange rate of XAU, EUR and JPY. Magenta points are observations. These results are slightly worse than 0:0301 of MOGP with PITC approximation. How- ever, since MLGP does not require all the responses to be of equal size, it runs much faster than MOGP, which involves a missing value imputation step to satisfy the size constraint. To further interpret the learned model, we plot out the predictive mean and variance together with observations in Figure 7.5. We notice high predictive variance whenever the time series encounters sharp changes. 7.5 Conclusion In this paper, we establish an interesting connection between two popular model classes: tensor regression and Gaussian process. We show that tensor regression is equivalent to learning special GP model: MLGP. With the low-rank structure, the constrained Bayesian estimator learns a smooth surrogate for the rank constraint in tensor regression. We prove oracle inequality and average case learning curve for MLGP. The performance of the model depends on the eigenvalues of the covariance matrix and task. correlations. Comparable (if not better) performance are observed in a series of real-world applications. 99 This relationship hints upon our choice of tools for multi-way data analysis. Tensor regression is fast and simple to implement. It is guaranteed to output orthonormal basis of the latent subspaces but does not generate condence intervals. MLGP, on the other hand, can handle sparse observations better, and is more versatile and is easier to interpret. One possible future directions is the kernel tensor regression method, so it can go beyond the current linear model and share the same exibility as Gaussian processes. 100 Chapter 8 Summary, Discussion and Future Work In this thesis, we proposed a scalable low-rank tensor learning framework, with applica- tions in spatiotemporal analysis. We show that using the language of tensors, we can formulate many critical spatiotemporal analysis tasks such as cokriging, forecasting and multi-model ensemble as tensor learning problems. Besides spatiotemporal analysis, ten- sor regression has many other applications in multi-way data analysis, multiple output regression and multi-linear multi-task learning. Tensor regression exploits the inherent sparsity structure in the data to obtain com- putational eciency. Low-rank tensor regression utilizes tensor constraint to represent high-order correlation and alleviates model over-tting. We developed ecient batch learning, online learning and memory ecient learning algorithms with theoretical guar- antees. We also established the connections of the tensor learning framework with classic spatiotemporal Gaussian processes models. We demonstrate that on real-world spatiotemporal applications including climate fore- casting, geographic interpolation, and social network activity prediction, our framework 101 achieved signicant speed-up with comparable prediction accuracy. These observations are consistent in oine, online and memory constrained setting. In terms of future work, there are several extensions to the current framework: Generalize the linear time series model to handle non-linear structures in the data. This can be done either by incorporating techniques from functional regression or kernel methods. Technical challenges exist for dealing with scalability issue in similar batch, online and memory constrained learning scenarios. Consider the trade-o between computational and statistical eciency of spatiotem- poral analysis. The focus of this thesis is scalability and much of the work was dedicated to the computational aspect of the learning algorithms. Considering sta- tistical ecient requires careful analysis of the sample complexity, especially in the setting where samples are not i.i.d. Improve the tensor learning framework to be robust to random noise and even ad- versarial inputs. To enable broad applicability of spatiotemporal analysis methods, the robustness of the algorithms is an important aspect. This could mean new regu- larization techniques that can improvement model generalization, or novel learning protocol that can dierentiate dierent modes of the distribution. In summary, this is the rst tensor learning framework for large-scale multivariate spatiotemporal analysis. This thesis reveals the benets of incorporating high-order cor- relations in the data, and demonstrates the eectiveness of tensor algebra for modeling high-dimensional spatiotemporal process. 102 Appendix A Appendix for Chapter 4 A.0.1 Example of Tensor Unfolding Let us consider the simpler case of multi-task regression in dierent domains Y 2R qr and X 2 R pr represent the response and predictor variables in dierent tasks. The parameter tensorA2 R qpr models the relationship between the predictors and the response variables, as follows: Y (i;j) = p X k=1 A(i;k;j)X(k;j) +"(i;j) (A.1) Unfolding in Mode 1 A (1) = 2 6 6 6 4 A(1; 1; 1) A(1;p; 1) A(1;p;r) A(2; 1; 1) A(2;p; 1) A(2;p;r) . . . . . . . . . A(q; 1; 1) A(q;p; 1) A(q;p;r) 3 7 7 7 5 We can rewrite Eq. (A.1) Y (:;j) =A (1) (:; (p(j 1) + 1) :pj)X(:;j) +"(:;j) (A.2) Thus, we can write the loss function in the compact form min n YA (1) X 2 F o (A.3) whereY = [Y (:; 1);Y (:; 2);:::;Y (:;r)] andX = [X(:; 1);X(:; 2);:::;X(:;r)] Unfolding in Mode 2 A (2) = 2 6 6 6 4 A(1; 1; 1) A(q; 1; 1) A(q; 1;r) A(1; 2; 1) A(q; 2; 1) A(q; 2;r) . . . . . . . . . A(1;p; 1) A(q;p; 1) A(q;p;r) 3 7 7 7 5 103 Y (:;j) = [A (2) (:; (q(j 1) + 1) :qj)] > X(:;j) +"(:;j) (A.4) Unfolding in Mode 3 A (3) = 2 6 6 6 4 A(1; 1; 1) A(q; 1; 1) A(q;p; 1) A(1; 1; 2) A(q; 1; 2) A(q;p; 2) . . . . . . . . . A(1; 1;r) A(q; 1;r) A(q;p;r) 3 7 7 7 5 For simplicity of notation, we can permute the columns ofA (3) to write as: A 0 (3) = 2 6 6 6 4 A(1; 1; 1) A(1;p; 1) A(q;p; 1) A(1; 1; 2) A(1;p; 2) A(q;p; 2) . . . . . . . . . A(1; 1;r) A(1;p;r) A(q;p;r) 3 7 7 7 5 Now reshape each row ofA 0 (3) into qp matrices: denote the matrix resulting from the j th row as A 0 (3);j . We can rewrite Eq. (A.1) as Y (:;j) =A 0 (3);j X(:;j) +"(:;j) (A.5) A.0.2 Proof of Lemma 4.2.1 Proof. The original problem has the following form: b A = argmin A:rank(A)=1 n kYAXk 2 F o (A.6) We can rewrite the optimization problem in Eq. (A.6) as estimation of 2 R, u2R q1 ;kuk 2 = 1, and v2R p1 ;kvk 2 = 1 such that: b ;b u;b v = argmin ;u;v:kuk 2 =1;kvk 2 =1 Yuv > X 2 F (A.7) We will minimize the above objective function in three steps: First, minimization in terms of yields b =hY; uv > Xi=kuv > Xk 2 F , where we have assumed that v > X 6= 0. Hence, we have: b u;b v = argmax u;v:kuk 2 =1;kvk 2 =1 tr((uv > X) > Y ) 2 kuv > Xk 2 F (A.8) The objective function can be rewritten tr (uv > X) > Y = tr X > vu > Y = tr YX > vu > . Some algebra work on the denominator yieldskuv > Xk 2 F = tr (uv > X) > (uv > X) = tr X > vu > uv > X = tr X > vv > X = v > XX > v. This implies that the denominator is independent of u and the optimal value of u in Eq. (A.8) is proportional toYX > v. Hence, 104 we need to rst nd the optimal value of v and then compute u = (YX > v)=kYX > vk 2 . Substitution of the optimal value of u yields: b v = argmax v:kvk 2 =1 v > XY > YX > v v > XX > v (A.9) Note that the objective function is bounded and invariant ofkvk 2 , hence thekvk 2 = 1 constraint can be relaxed. Now, suppose the value of v > XX > v for optimal choice of vectors v is t. We can rewrite the optimization in Eq. (A.9) as b v = argmax v v > XY > YX > v s:t: v > XX > v =t (A.10) Using the Lagrangian multipliers method, we can show that there is a value for such that the solution b v for the dual problem is the optimal solution for Eq. (A.10). Hence, we need to solve the following optimization problem for v: b v = argmax v:kvk 2 =1 n v > XY > YX > vv > XX > v o = argmax v:kvk 2 =1 n v > X(Y > YI)X > v o (A.11) Eq. (A.11) implies that v is the dominant eigenvector ofX(Y > YI)X > . Hence, we are able to nd the optimal value of both u and v for the given value of. For simplicity of notation, let's deneP,XX > andQ,XY > YX > . Consider the equations obtained by solving the Lagrangian dual of Eq. (A.10): Qv =Pv (A.12) kv > Xk 2 2 =t; (A.13) 0: (A.14) Eq. (A.12) describes a generalized positive denite eigenvalue problem. Hence, we can select max = 1 (Q;P ) which maximizes the objective function in Eq. (A.9). The optimal value of u can be found by substitution of optimal v in Eq. (A.8) and simple algebra yields the result in Lemma 4.2.1. A.0.3 Proof of Theorem 4.2.2 Note that intuitively, since our greedy steps are optimal in the rst mode, we can see that our bound should be at least as tight as the bound of [Shalev-Shwartz et al., 2011]. Here is the formal proof of Theorem 4.2.2. Proof. Let's denote the loss function at k th step by L(Y;V;W k ) = r X j=1 kV (:;:;j) W(:; :;j)Y (:;:;j) k 2 F (A.15) 105 Lines 5{8 of Algorithm 1 imply: L(Y;V;W k )L(Y;V;W k+1 ) =L(Y;V;W k ) min m inf rank(B)=1 L(Y;V;W (m);k +B) L(Y;V;W k ) inf rank(B)=1 L(Y;V;W (1);k +B) (A.16) Let's dene B =C where 2R; rank(C) = 1; andkCk 2 = 1. We expand the right hand side of Eq. (A.16) and write: L(Y;V;W k )L(Y;V;W k+1 ) sup ;C:rank(C)=1;kCk 2 =1 2hCY;VW (1);k Yi 2 kCYk 2 F ; whereY andV are used for denoting the matrix created by repeatingY (:;:;j) andV (:;:;j) on the diagonal blocks of a block diagonal matrix, respectively. Since the algorithm nds the optimal B, we can maximize it with respect to which yields: L(Y;V;W k )L(Y;V;W k+1 ) sup C:rank(C)=1;kCk 2 =1 hCY;VW (1);k Yi 2 kCYk 2 F sup C:rank(C)=1;kCk 2 =1 1 max (Y ) 2 hCY;VW (1);k Yi 2 = sup C:rank(C)=1;kCk 2 =1 1 max (Y ) 2 hC; (VW (1);k Y )Y > i 2 = max (VW (1);k Y )Y > 2 max (V ) Dene the residual R k =L(Y;V;W k )L(Y;V;W ). Note that(V W (1);k Y )Y > is the gradient of the residual function with respect toW (1);k . Since the operator norm and the nuclear norms are dual of each other, using the properties of dual norms we can write for any two matrices A and B hA;BikAk 2 kBk (A.17) Thus, using the convexity of the residual function, we can show that R k R k+1 r W (1);k R k 2 kW (1);k W (1) k 2 max (Y ) 2 kW (1);k W (1) k 2 (A.18) R 2 k max (Y ) 2 kW (1);k W (1) k 2 (A.19) R 2 k max (Y ) 2 kW 2 (1) k 2 (A.20) 106 The sequence in Eq. (A.20) converges to zero according to the following rate [Shalev-Shwartz et al., 2010, Lemma B.2] R k ( max (Y )kW (1) k ) 2 (k + 1) The step in Eq. (A.19) is due to the fact that the parameter estimation error decreases as the algorithm progresses. This can be seen by noting that the minimum eigenvalue assumption ensures strong convexity of the loss function. A.0.4 Convex Relaxation with ADMM A convex relaxation approach replaces the constraint rank(W (n) ) with its convex hull kW (n) k . The mixture regularization in [Tomioka et al., 2013] assumes that the N-mode tensorW is a mixture of N auxiliary tensorsfZ n g, i.e.,W = P N n=1 Z n . It regularizes the nuclear norm of only the mode-n unfolding for then th tensorZ n , i.e, P N n=1 kZ n (n) k . The resulting convex relaxed optimization problem is as follows: c W = argmin W ( L(W;Y;V) + N X n kZ n (n) k s:t: N X n Z n =W ) (A.21) We adapt Alternating Direction Methods of Multiplier (ADMM) [Gabay and Mercier, 1976] for solving the above problem. Due to the coupling offZ n g in the summation, eachZ n is not directly separable from otherZ n 0 . Thus, we employ the coordinate descent algorithm to sequentially solvefZ n g. Given the augmented Lagrangian of problem as follows, the ADMM-based algorithm is elaborated in Algo. 6. F (W;fZ n g;C) =L(W;Y;V) + N X n=1 kZ n (n) k + 2 X n kW X n Z n k 2 F hC;W N X n=1 Z n i (A.22) Algorithm 6 ADMM for solving Eq. (3.6) 1: Input: transformed dataY;V of M variables, hyper-parameters ;. 2: Output: N mode tensorW 3: InitializeW;fZ n g;C to zero. 4: repeat 5: W argmin W n L(W;Y;V) + 2 kW P N n=1 Z n Ck 2 F o . 6: repeat 7: for variable n = 1 to N do 8: Z n (n) = shrink W (n) 1 C P n 0 6=n Z n 0 (n 0 ) . 9: end for 10: until solutionfZ n g converge 11: C C(W N P n=1 Z n ). 12: until objective function converges 107 The sub-routine shrink (A) applies a soft-thresholding rule at level to the singular values of the input matrix A. The following lemma shows the convergence of ADMM- based solver for our problem. Lemma A.0.1. [Bertsekas and Tsitsiklis, 1989] For the constrained problem min x;y f(x) + g(y); s:t x2 C x ;y2 C y ;Gx = y, If eitherfC x ;C y g are bounded or G 0 G is invertible, and the optimal solution set is nonempty. A sequence of solutionsfx;yg generated by ADMM is bounded and every limit point is an optimal solution of the original problem. A.0.5 Derivation of the Unied Formulation In this section, we demonstrate how we can use Eq. (3.6) to solve Eqs. (3.4) and (3.5). In the cokriging problem, it is easy to see that withY :;:;m = H andV :;:;m =X ;m for m = 1;:::;M the problems are equivalent. In the forecasting problem, H is full rank and the mapping dened byW 7! ~ W : ~ W :;:;m = HW :;:;m for m = 1;:::;M preserves the tensor rank, i.e., rank(W) = rank( ~ W). This suggests that we can solve Eq. (3.4) as follows: rst solve Eq. (3.6) withY :;:;m = X K+1:T;m andV :;:;m =X :;:;m and obtain its solution as ~ W; then computeW :;:;m =H 1 ~ W :;:;m . 108 Appendix B Appendix for Chapter 5 B.0.1 Matrix Update Problem Given (Y 1 ;X 1 )2 (R qn 1 ;R pn 1 ), we have estimated b A 1 that minimizeskY 1 AX 1 k F . Now we receive some more data: (Y 2 ;X 2 )2 (R qn 2 ;R pn 2 ), where n 2 n 1 . Denote Y = [Y 1 ;Y 2 ] and X = [X 1 ;X 2 ]. The goal is to nd b A such that it minimizes kYAXk F . Solution: We have b A 1 = (Y 1 X > 1 )(X 1 X > 1 ) 1 . The goal is to compute the following quantity eciently: b A = (YX > )(XX > ) 1 = (Y 1 X > 1 +Y 2 X > 2 )(X 1 X > 1 +X 2 X > 2 ) 1 Given the Woodbury identity (D +CC > ) 1 =D 1 D 1 C(I +C > D 1 C) 1 C > D 1 , if we deneB =D 1 C, we have (D +CC > ) 1 =D 1 B(I +C > B) 1 B > . Thus, dening B = (X 1 X > 1 ) 1 X 2 , , E =B(I +X > 2 B) 1 B > , we can write: b A = (Y 1 X > 1 +Y 2 X > 2 ) (X 1 X > 1 ) 1 B(I +X > 2 B) 1 B > ; = b A 1 +Y 2 X > 2 (X 1 X > 1 ) 1 Y 1 X > 1 EY 2 X > 2 E: Thus, we have found the required update term and it only depends on ecient matrix multiplications: A = Y 2 B > Y 1 X > 1 Y 2 X > 2 E Thus, we can design an algorithm that whenever receives a new set of samples, updates its version of (XX > ) 1 and b A and becomes ready for the next update. B.0.2 Reformulation Dene H as the Cholesky decomposition of I P +L HH > = I P +L. Since H is full rank and the mapping dened byW7! ~ W : ~ W :;:;m = HW :;:;m form = 1;:::;M preserves 109 the tensor rank, i.e., rank(W) = rank( ~ W). We can rewrite the loss function in Equation 3.2 and 3.3 as M X m=1 T X t=K+1 kHW :;:;m X t;m (H 1 )X :;t;m k 2 F This suggests that we can solve rst solve the quadratic loss P M m=1 kW :;:;m Y :;:;m V :;:;m k 2 F withY :;:;m = X K+1:T;m andV :;:;m =X :;:;m and obtain its solution as ~ W; then compute W :;:;m = H 1 ~ W :;:;m . B.0.3 Multi-model Data Seven models are selected fromhttp://www-pcmdi.llnl.gov/ipcc/model_documentation/ ipcc_model_documentation.php 1. BCCR-BCM2.0: Norway Bjerknes Centre for Climate Research 2. CGCM3.1(T47): Canadian Centre for Climate Modelling 3. INM-CM3.0: Russia Institute for Numerical Mathematics 4. MRI-CGCM2.3.2: Japan Meteorological Research Institute 5. MIROC3.2(hires): Japan Center for Climate System Research 6. MIROC3.2(medres): Japan Center for Climate System Research 7. FGOALS-g1.0: China ASG / Institute of Atmospheric Physics China The variables are then downsampled into a 5' by 5' latitude-longitude grid. For ux- like variables, rst order conservative mapping is used. For pressures and temperatures, bilinear interpolation is used. The 19 variables we use are 'lht .sfc', 'sht .sfc', 'shum.2m', 'soilw.0-10cm', 'prate.sfc', 'cprat.sfc','pres.sfc', 'mslp', 'dlwrf.sfc', 'ulwrf.sfc','ulwrf.ntat','dswrf.sfc','dswrf.ntat' , 'uswrf.sfc', 'uswrf.ntat','air.2m', 'skt.sfc', 'vwnd.10m', 'runof.sfc'. The meaning of the variables is available at http://esg.llnl.gov:8080/about/ipccTables.do 110 Appendix C Appendix for Chapter 7 C.0.1 Eigenvalue Problem Using the Kronecker product property 3 m=1 U m U > m = ( 3 m=1 U m )( 3 m=1 U m ) > , we can re-write the covariance matrix as: K = ((X) 3 m=1 U m )((X) 3 m=1 U m ) > Denote ~ U = (X) 3 m=1 U m and let the singular value decomposition of ~ U be ~ U = U x x V > x . Taking derivatives over the log-likelihood L with respect to ~ U and set it to zero, we obtain the stationary point condition: yy > (K + D) 1 ~ U = ~ U Further denote the eigen-decomposition of the output covariance yy > = U y y U 1 y , we have U x = U y , x = ( y D) 1 2 . Given the decomposition of ~ U = U x x V > x , similar to [Lawrence, 2004], we have yy > (K + D) 1 ~ U = ~ U yy > (K + D) 1 U x x V > x = U x x V > x yy > U x ( x + D 1 x ) 1 V > x = U x x V > x yy > U x = U x ( 2 x + D) which is a eigenvalue problem in the transformed space. C.0.2 Derivatives for the Optimization Given that yN(0; K + D), where K =(X) M m=1 K m (X) > = ~ U ~ U > , The negative log-likelihood L = 1 2 y > ( ~ U ~ U > + D) 1 y + 1 2 log det( ~ U ~ U > + D) +const Based on Woodbury lemma, ( ~ U ~ U > +D) 1 = D 1 D 1 ~ U(D+ ~ U > ~ U) 1 ~ U > as well as matrix determinant lemma det( ~ U ~ U > +D) = det(I+ ~ U > D 1 ~ U) det(D) = det(D+ ~ U > ~ U) 111 Denote = D + ~ U > ~ U, let w = 1 ~ U > y. The objective function can be rewrite as L = 1 2 D 1 y > y 1 2 D 1 y > ~ U 1 ~ U > y + 1 2 log det() +const Take derivative over U m(i;j) , we have @L @U m(i;j) = tr[( @L @ ~ U ) > ( @ ~ U @U m(i;j) )]; @L @ ~ U = ~ U( 1 + wD 1 w > ) 1 yD 1 w > @ ~ U @U m(i;j) = @(X) @U m(i;j) (U M @U m @U m(i;j) U 1 ) = @(X) @U m(i;j) (U M O m(i;j) U 1 ) Here O m(i;j) = e i e > j is a matrix with all zeros, but the (i;j)th entry as one. The predictive distribution: p(y ? jx ? ; X; y)N( ? ; ? ): ? = k(x ? ; X)(D 1 D 1 ~ U(D + ~ U > ~ U) 1 ~ U > )y ? = k(x ? ; x ? ) k(x ? ; X)(D 1 D 1 ~ U(D + ~ U > ~ U) 1 ~ U > )k(X; x ? ) Where ~ U =(X)( M m=1 U m ). C.0.3 Proof for Proposition 7.3.1 Consider a 3-mode T 1 T 2 T 3 tensorW of functionsW (1) = [w 1 (X); ; w T (X)] W =S 1 U 1 (X ) 2 U 2 3 U 3 where U m is an orthogonal T m R m matrix. Assuming U 1 (X ) satisesE[U > 1 U 1 ] = I (orthogonal design after rotation). With Tucker property W (1) = U 1 (X )S (1) (U 2 U 3 ) > The population risk can be written as L(W) = tr n (YhX;Wi)(YhX;Wi) > o = tr n 2I S (1) (U 2 U 3 ) > > E[cov(Y; U 1 (X )] 0 S (1) (U 2 U 3 ) > +E(YY > ) o DenoteE[cov(Y; U 1 (X )] = (U 1 ), bound the dierence L(W) ^ L(W) = tr n 2I S (1) (U 2 U 3 ) > ((U 1 ) ^ (U 1 )) 0 S (1) (U 2 U 3 ) > o 2I S (1) (U 2 U 3 ) > ((U 1 ) ^ (U 1 )) 2 0 S (1) (U 2 U 3 ) > ? C maxf2;kS (1) k 2 ? gk(U 1 ) ^ (U 1 )k 2 112 With C as a universal constant. The inequality holds with Schatten norm H older's inequality kABk S 1 kAk Sp kBk Sq 1=p + 1=q = 1 Given that sup U 1 k(U 1 ) ^ (U 1 )k 2 =O P q T 2 T 3 +log(T 1 T 2 T 3 ) N Denote empirical risk ^ L = P T t=1 P nt i=1 L(hw t ; x t;i i. LetW ? = inf W2C L(W). The excess risk L( ^ W)L(W ? ) =L( ^ W) ^ L( ^ W) + ( ^ L( ^ W) ^ L(W ? ) + ( ^ L(W ? L(W ? )) [L( ^ W) ^ L( ^ W)] [L(W ? ) ^ L(W ? )] 2sup W2C N fL(W) ^ L(W)g O kS (1) k 2 ? k(U 1 ) ^ (U 1 )k 2 if we assumekS (1) k 2 ? = O( N T 2 T 3 +log(T 1 T 2 T 3 ) 1=4 ), thenL( ^ W)L(W ? ) O(1), thus we obtain the oracle inequality as stated. C.0.4 Proof of Theorem 7.3.2 For Gaussian process regression, the Bayes error has the following form [Sollich and Halees, 2002]: ^ = tr () tr(D + > ) 1 2 > (C.1) for low-rank case and ^ = tr ( 1 + > D 1 ) 1 (C.2) for full-rank case. and are the eigenfunction decomposition components of the covariance. The size of is equal to the number of kernel eigenfunctions. When the GP has non-degenerate kernel, is full-rank. We can apply the Woodbury lemma to Equation C.1, which yields a simplied version as in Equation C.2. We can obtain a corresponding lower bound for the average case learning curve. Here we provide the derivation for the full-rank case, but similar results apply to low-rank case as well. The Bayes error for the full-rank covariance model is: ^ = tr( 0 1 + > D 1 ) 1 To obtain learning curve = E D [^ ], it is useful to see how the matrixG = ( 1 + > D 1 ) 1 changes with sample size. > can be interpreted as the input correlation matrix. To account for the uctuations of the element in > , we introduce auxiliary oset parametersfv t g into the denition ofG. Dene resolvent matrix G 1 = 1 + > D 1 + X t v t P t 113 where P t is short for P t 1 ;;t M , which denes the projection of tth task to its multi- directional indexes. Evaluating the change G(n + 1)G(n) = [G 1 (n) + 2 t t > t ] 1 G(n) = G(n) t > t G(n) 2 t + > t G(n) t where element ( t ) i = n+1 ;t it (x n+1 ) and maps the global sample index to task- specic sample index. Introducing G =E D [G] and take expectation over numerator and denominator separately, we have @G @n t = E D [GP t G] 2 t + trP t G Since generalization error t = trP t G, we have thatE D [GP t G] = @ @vt E D [G] = @G vt . Multiplying P s on both sides yields the approximation for the expected change: @P s G @n t = @ s @n t = 1 2 t + t @ s @v t Solving t (N;v) using the methods of characteristic curves and resetting v to zero, gives the self-consistency equations: t (N) = trP t 0 1 + X s n s 2 s + s 1 For MLGP, due to the task hierarchy, a task index t is projected to a set of in- dexesft m g along dierent modes of a tensor. Dene the projection on mth mode as P tm = e tm e > tm , where e tm is a unit vector with all zero but t m th entry as one. Assume eigenfunction decomposition for the data-dependent part of covariance (x)K 1 (x) > = P i i i (x) i (x) > , we have K jk = M Y m=2 K m;( j ; k ) X i i j ;t i (x j ) k ;t i (x k ) > K = ( M m=2 K m ) > = 0 > where j is the task index for j th example, further projected to the mode-wise indexes. Augmented eigenfunction matrix j;it = j ;t i (x j ) accounts for missing data, where the column index of runs over all eigenfunctions and all tasks. For task t, denote k t (x;) =k(x t ;) E x [k t (x; X)k(X; x t )] = ( M m=2 (K m P tm K m ) 2 ) > where P tm is the m th mode index for task t. The Bayes error can be written as: ^ t = E x [k t (x; x)]E x [k t (x; X)(K + D) 1 k t (X; x)] 114 For the rst term E x [k t (x; x)] = Y m=2 e > tm K m e tm E x [(x)K 1 (x) > ] = tr M m=2 P tm K m For the second term E x [k t (x; X)(K + D) 1 k t (X; x)] = tr(E x [k t (x; X)k t (X; x)](K + D) 1 ) = tr (D + 0 > ) 1 ( M m=2 (K m P tm K m ) 2 ) > With m P tm = P t 1 ;;t M , compare Equation C.3 with Equation C.1, we have ^ t = P t 1 ;;t M tr( 0 ) tr(D + 0 > ) 1 0 2 > 0 = M m=2 K m Therefore, the Bayes error of taskt is that of all tasks projected to each of its mode-wise task indices. Using analogous method of characteristic curves, we can obtain a set of self-consistency equations for the learning curve of MLGP. 115 Reference List [Acar et al., 2007] Acar, E., Aykut-Bingol, C., Bingol, H., Bro, R., and Yener, B. (2007). Multiway analysis of epilepsy tensors. Bioinformatics, 23(13):i10{i18. [Achlioptas, 2003] Achlioptas, D. (2003). Database-friendly random projections: Johnson-lindenstrauss with binary coins. Journal of computer and System Sciences, 66(4):671{687. [Alvarez and Lawrence, 2011] Alvarez, M. A. and Lawrence, N. D. (2011). Computation- ally ecient convolved multiple output gaussian processes. The Journal of Machine Learning Research, 12:1459{1500. [Anandkumar et al., 2015] Anandkumar, A., Ge, R., Hsu, D., Kakade, S., and Telgarsky, M. (2015). Tensor decompositions for learning latent variable models. arXiv preprint arXiv:1210.7559, 64:70{72. [Anderson, 1951] Anderson, T. (1951). Estimating linear restrictions on regression coef- cients for multivariate normal distributions. The Annals of Mathematical Statistics, pages 327{351. [Ando and Zhang, 2005] Ando, R. K. and Zhang, T. (2005). A framework for learning predictive structures from multiple tasks and unlabeled data. JMLR, 6:1817{1853. [Argyriou et al., 2008] Argyriou, A., Evgeniou, T., and Pontil, M. (2008). Convex multi- task feature learning. Machine Learning. [Avron et al., 2012] Avron, H., Kale, S., Sindhwani, V., and Kasiviswanathan, S. P. (2012). Ecient and practical stochastic subgradient descent for nuclear norm reg- ularization. In Proceedings of the 29th International Conference on Machine Learning (ICML-12), pages 1231{1238. [Bahadori et al., 2014] Bahadori, M. T., Yu, Q. R., and Liu, Y. (2014). Fast multivariate spatio-temporal analysis via low rank tensor learning. In NIPS, pages 3491{3499. [Baldi and Hornik, 1989] Baldi, P. and Hornik, K. (1989). Neural networks and principal component analysis: Learning from examples without local minima. Neural networks, 2(1):53{58. [Barron et al., 2008] Barron, A., Cohen, A., Dahmen, W., and DeVore, R. (2008). Ap- proximation and learning by greedy algorithms. The Annals of Statistics. 116 [Basu et al., 2015] Basu, S., Michailidis, G., et al. (2015). Regularized estimation in sparse high-dimensional time series models. The Annals of Statistics, 43(4):1535{1567. [Bertsekas and Tsitsiklis, 1989] Bertsekas, D. and Tsitsiklis, J. (1989). Parallel and Dis- tributed Computation: Numerical Methods. Prentice Hall Inc. [Blumensath and Davies, 2009] Blumensath, T. and Davies, M. E. (2009). Iterative hard thresholding for compressed sensing. Applied and Computational Harmonic Analysis, 27(3):265{274. [Bonilla et al., 2007] Bonilla, E., Chai, K., and Williams, C. (2007). Multi-task Gaussian Process Prediction. In NIPS. [Brand, 2002] Brand, M. (2002). Incremental singular value decomposition of uncertain data with missing values. In Computer VisionECCV 2002, pages 707{720. Springer. [Brunsdon et al., 1998] Brunsdon, C., Fotheringham, S., and Charlton, M. (1998). Ge- ographically weighted regression. Journal of the Royal Statistical Society: Series D (The Statistician), 47(3):431{443. [Calamai and Mor e, 1987] Calamai, P. H. and Mor e, J. J. (1987). Projected gradient methods for linearly constrained problems. Mathematical programming, 39(1):93{116. [Cand es and Recht, 2009] Cand es, E. J. and Recht, B. (2009). Exact matrix completion via convex optimization. Foundations of Computational mathematics, 9(6):717{772. [Candes et al., 2006] Candes, E. J., Romberg, J. K., and Tao, T. (2006). Stable signal recovery from incomplete and inaccurate measurements. Communications on pure and applied mathematics, 59(8):1207{1223. [Cavallanti et al., 2010] Cavallanti, G., Cesa-Bianchi, N., and Gentile, C. (2010). Lin- ear algorithms for online multitask classication. The Journal of Machine Learning Research, 11:2901{2934. [Chu and Ghahramani, 2009] Chu, W. and Ghahramani, Z. (2009). Probabilistic models for incomplete multi-dimensional arrays. In International Conference on Articial Intelligence and Statistics, pages 89{96. [Chu et al., 2006] Chu, W., Sindhwani, V., Ghahramani, Z., and Keerthi, S. (2006). Re- lational learning with Gaussian processes. In NIPS. [Cichocki et al., 2009] Cichocki, A., Zdunek, R., Phan, A. H., and Amari, S.-i. (2009). Nonnegative matrix and tensor factorizations: applications to exploratory multi-way data analysis and blind source separation. John Wiley & Sons. [Clarkson and Woodru, 2013] Clarkson, K. L. and Woodru, D. P. (2013). Low rank approximation and regression in input sparsity time. In Proceedings of the forty-fth annual ACM symposium on Theory of computing, pages 81{90. ACM. [Cohen et al., 2015] Cohen, N., Sharir, O., and Shashua, A. (2015). On the expressive power of deep learning: A tensor analysis. 117 [Cressie and Huang, 1999] Cressie, N. and Huang, H. (1999). Classes of nonseparable, spatio-temporal stationary covariance functions. JASA. [Cressie and Johannesson, 2008] Cressie, N. and Johannesson, G. (2008). Fixed rank kriging for very large spatial data sets. JRSS B (Statistical Methodology), 70(1):209{ 226. [Cressie et al., 2010] Cressie, N., Shi, T., and Kang, E. (2010). Fixed rank ltering for spatio-temporal data. J. Comp. Graph. Stat. [Cressie and Wikle, 2011] Cressie, N. and Wikle, C. K. (2011). Statistics for spatio- temporal data. John Wiley & Sons. [Dahl et al., 2002] Dahl, T. S., Mataric, M. J., and Sukhatme, G. S. (2002). Adaptive spatio-temporal organization in groups of robots. In Intelligent Robots and Systems, 2002. IEEE/RSJ International Conference on, volume 1, pages 1044{1049. IEEE. [Datta et al., 2016] Datta, A., Banerjee, S., Finley, A. O., and Gelfand, A. E. (2016). Hierarchical nearest-neighbor gaussian process models for large geostatistical datasets. Journal of the American Statistical Association, 111(514):800{812. [De Lathauwer et al., 2000a] De Lathauwer, L., De Moor, B., and Vandewalle, J. (2000a). A multilinear singular value decomposition. SIAM journal on Matrix Analysis and Applications, 21(4):1253{1278. [De Lathauwer et al., 2000b] De Lathauwer, L., De Moor, B., and Vandewalle, J. (2000b). On the best rank-1 and rank-(r 1, r 2,..., rn) approximation of higher-order tensors. SIAM Journal on Matrix Analysis and Applications, 21(4):1324{1342. [Eckart and Young, 1936] Eckart, C. and Young, G. (1936). The approximation of one matrix by another of lower rank. Psychometrika, 1(3):211{218. [Fazel et al., 2003] Fazel, M., Hindi, H., and Boyd, S. P. (2003). Log-det heuristic for matrix rank minimization with applications to hankel and euclidean distance matrices. In American Control Conference, 2003. Proceedings of the 2003, volume 3, pages 2156{ 2162. IEEE. [Foygel et al., 2012] Foygel, R., Horrell, M., Drton, M., and Laerty, J. D. (2012). Non- parametric reduced rank regression. In Advances in Neural Information Processing Systems, pages 1628{1636. [Gabay and Mercier, 1976] Gabay, D. and Mercier, B. (1976). A dual algorithm for the solution of nonlinear variational problems via nite element approximation. Computers & Mathematics with Applications, 2(1):17{40. [Gandy et al., 2011] Gandy, S., Recht, B., and Yamada, I. (2011). Tensor completion and low-n-rank tensor recovery via convex optimization. Inverse Problems. 118 [Garg and Khandekar, 2009] Garg, R. and Khandekar, R. (2009). Gradient descent with sparsication: an iterative algorithm for sparse recovery with restricted isometry prop- erty. In Proceedings of the 26th Annual International Conference on Machine Learning, pages 337{344. ACM. [Gelfand and Schliep, 2016] Gelfand, A. E. and Schliep, E. M. (2016). Spatial statistics and gaussian processes: a beautiful marriage. Spatial Statistics, 18:86{104. [Getis and Ord, 1992] Getis, A. and Ord, J. K. (1992). The analysis of spatial association by use of distance statistics. Geographical analysis, 24(3):189{206. [Gu et al., 2000] Gu, M., Ruhe, A., Sleijpen, G., van der Vorst, H., Bai, Z., and Li, R. (2000). 5. Generalized Hermitian Eigenvalue Problems. Society for Industrial and Applied Mathematics. [Guhaniyogi et al., 2015] Guhaniyogi, R., Qamar, S., and Dunson, D. B. (2015). Bayesian tensor regression. arXiv preprint arXiv:1509.06490. [Guo et al., 2012] Guo, W., Kotsia, I., and Patras, I. (2012). Tensor learning for regres- sion. Image Processing, IEEE Transactions on, 21(2):816{827. [Han and Liu, 2013] Han, F. and Liu, H. (2013). Transition matrix estimation in high dimensional time series. In ICML (2), pages 172{180. [Hasselmann, 1997] Hasselmann, K. (1997). Multi-pattern ngerprint method for detec- tion and attribution of climate change. Climate Dynamics, 13(9):601{611. [Hillar and Lim, 2013] Hillar, C. J. and Lim, L.-H. (2013). Most tensor problems are np-hard. Journal of the ACM (JACM), 60(6):45. [Ho et al., 2015] Ho, P. D. et al. (2015). Multilinear tensor regression for longitudinal relational data. The Annals of Applied Statistics, 9(3):1169{1193. [Isaaks and Srivastava, 2011] Isaaks, E. and Srivastava, R. (2011). Applied geostatistics. London: Oxford University. [Ishteva et al., 2011a] Ishteva, M., Absil, P.-A., Van Huel, S., and De Lathauwer, L. (2011a). Best low multilinear rank approximation of higher-order tensors, based on the riemannian trust-region scheme. SIAM Journal on Matrix Analysis and Applications, 32(1):115{135. [Ishteva et al., 2011b] Ishteva, M., Absil, P.-A., Van Huel, S., and De Lathauwer, L. (2011b). Tucker compression and local optima. Chemometrics and Intelligent Labora- tory Systems, 106(1):57{64. [Jain et al., 2010] Jain, P., Meka, R., and Dhillon, I. S. (2010). Guaranteed rank mini- mization via singular value projection. In Advances in Neural Information Processing Systems, pages 937{945. [Jalali et al., 2010] Jalali, A., Sanghavi, S., Ruan, C., and Ravikumar, P. (2010). A dirty model for multi-task learning. In NIPS. 119 [Kolda and Bader, 2009] Kolda, T. G. and Bader, B. W. (2009). Tensor decompositions and applications. SIAM review, 51(3):455{500. [Krishnan et al., 2015] Krishnan, R. G., Shalit, U., and Sontag, D. (2015). Deep kalman lters. arXiv preprint arXiv:1511.05121. [Lawrence, 2004] Lawrence, N. D. (2004). Gaussian process latent variable models for visualisation of high dimensional data. In In NIPS. [Le et al., 2011] Le, Q. V., Zou, W. Y., Yeung, S. Y., and Ng, A. Y. (2011). Learning hierarchical invariant spatio-temporal features for action recognition with independent subspace analysis. In CVPR, pages 3361{3368. IEEE. [LeCun et al., 2015] LeCun, Y., Bengio, Y., and Hinton, G. (2015). Deep learning. Na- ture, 521(7553):436{444. [Li and Yeung, 2009] Li, W.-J. and Yeung, D.-Y. (2009). Relation regularized matrix factorization. In IJCAI. [Long et al., 2012] Long, X., Jin, L., and Joshi, J. (2012). Exploring trajectory-driven local geographic topics in foursquare. In UbiComp. [Lozano et al., 2009] Lozano, A., Li, H., Niculescu-Mizil, A., Liu, Y., Perlich, C., Hosk- ing, J., and Abe, N. (2009). Spatial-temporal causal modeling for climate change attribution. In KDD. [Meka et al., 2008] Meka, R., Jain, P., Caramanis, C., and Dhillon, I. S. (2008). Rank minimization via online learning. In Proceedings of the 25th International Conference on Machine learning, pages 656{663. ACM. [Miller and Han, 2009] Miller, H. J. and Han, J. (2009). Geographic data mining and knowledge discovery. CRC Press. [Moran, 1950] Moran, P. A. (1950). Notes on continuous stochastic phenomena. Biometrika, pages 17{23. [Morton and Lim, 2009] Morton, J. and Lim, L.-H. (2009). Principal cumulant compo- nent analysis. [Nie et al., 2010] Nie, F., Huang, H., Cai, X., and Ding, C. H. (2010). Ecient and robust feature selection via joint ` 2;1 -norms minimization. In NIPS. [Novikov et al., 2015] Novikov, A., Podoprikhin, D., Osokin, A., and Vetrov, D. P. (2015). Tensorizing neural networks. In Advances in Neural Information Processing Systems, pages 442{450. [Ouyang et al., 2013] Ouyang, H., He, N., Tran, L., and Gray, A. (2013). Stochastic alternating direction method of multipliers. In Proceedings of the 30th International Conference on Machine Learning, pages 80{88. 120 [Qiu et al., 2016] Qiu, H., Han, F., Liu, H., and Cao, B. (2016). Joint estimation of multiple graphical models from high dimensional time series. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 78(2):487{504. [Quinonero-Candela and Rasmussen, 2005] Quinonero-Candela, J. and Rasmussen, C. E. (2005). Analysis of some methods for reduced rank gaussian process regression. In Switching and Learning in Feedback Systems, pages 98{127. Springer. [Rabusseau and Kadri, 2016] Rabusseau, G. and Kadri, H. (2016). Low-rank regression with tensor responses. In Advances in Neural Information Processing Systems, pages 1867{1875. [Rasmussen, 2006] Rasmussen, C. E. (2006). Gaussian processes for machine learning. [Rauhut et al., 2015] Rauhut, H., Schneider, R., and Stojanac, Z. (2015). Tensor comple- tion in hierarchical tensor representations. In Compressed Sensing and its Applications, pages 419{450. Springer. [Ricci and Levi-Civita, 1900] Ricci, M. and Levi-Civita, T. (1900). M ethodes de calcul di erentiel absolu et leurs applications. Mathematische Annalen, 54(1-2):125{201. [Rockafellar, 1976] Rockafellar, R. T. (1976). Monotone operators and the proximal point algorithm. SIAM journal on control and optimization, 14(5):877{898. [Roddick and Lees, ] Roddick, J. F. and Lees, B. G. Paradigms for spatial and spatio- temporal data mining. [Romera-Paredes et al., 2013] Romera-Paredes, B., Aung, H., Bianchi-Berthouze, N., and Pontil, M. (2013). Multilinear multitask learning. In ICML. [Saha et al., 2011] Saha, A., Rai, P., Venkatasubramanian, S., and Daume, H. (2011). Online learning of multiple tasks and their relationships. In International Conference on Articial Intelligence and Statistics, pages 643{651. [Shalev-Shwartz et al., 2011] Shalev-Shwartz, S., Gonen, A., and Shamir, O. (2011). Large-scale convex minimization with a low-rank constraint. In ICML. [Shalev-Shwartz et al., 2010] Shalev-Shwartz, S., Srebro, N., and Zhang, T. (2010). Trad- ing Accuracy for Sparsity in Optimization Problems with Sparsity Constraints. SIAM Journal on Optimization. [Shalit et al., 2010] Shalit, U., Weinshall, D., and Chechik, G. (2010). Online learning in the manifold of low-rank matrices. In Advances in neural information processing systems, pages 2128{2136. [Signoretto et al., 2014] Signoretto, M., Dinh, Q. T., De Lathauwer, L., and Suykens, J. A. (2014). Learning with tensors: a framework based on convex optimization and spectral regularization. Machine Learning, 94(3):303{351. [Sollich and Halees, 2002] Sollich, P. and Halees, A. (2002). Learning curves for gaussian process regression: Approximations and bounds. Neural computation, 14(6):1393{1428. 121 [Stoudenmire and Schwab, 2016] Stoudenmire, E. and Schwab, D. J. (2016). Supervised learning with tensor networks. In Advances in Neural Information Processing Systems, pages 4799{4807. [Sun et al., 2006] Sun, J., Tao, D., and Faloutsos, C. (2006). Beyond streams and graphs: dynamic tensor analysis. In Proceedings of the 12th ACM SIGKDD international con- ference on Knowledge discovery and data mining, pages 374{383. ACM. [Sun et al., 2008] Sun, J., Tao, D., Papadimitriou, S., Yu, P. S., and Faloutsos, C. (2008). Incremental tensor analysis: Theory and applications. ACM Transactions on Knowl- edge Discovery from Data (TKDD), 2(3):11. [Suzuki, 2015] Suzuki, T. (2015). Convergence rate of bayesian tensor estimator and its minimax optimality. In Proceedings of the 32nd International Conference on Machine Learning (ICML-15), pages 1273{1282. [Tebaldi and Knutti, 2007] Tebaldi, C. and Knutti, R. (2007). The use of the multi-model ensemble in probabilistic climate projections. Philosophical Transactions of the Royal Society A: Mathematical, Physical and Engineering Sciences, 365(1857):2053{2075. [Tomioka et al., 2013] Tomioka, R., Hayashi, K., and Kashima, H. (2013). Convex Tensor Decomposition via Structured Schatten Norm Regularization. NIPS. [Tsochantaridis et al., 2005] Tsochantaridis, I., Joachims, T., Hofmann, T., and Altun, Y. (2005). Large margin methods for structured and interdependent output variables. Journal of machine learning research, 6(Sep):1453{1484. [Van den Berge et al., 2011] Van den Berge, L., Selten, F., Wiegerinck, W., and Duane, G. (2011). A multi-model ensemble method that combines imperfect models through learning. Earth System Dynamics, 2:161{177. [Vasilescu and Terzopoulos, 2002] Vasilescu, M. A. O. and Terzopoulos, D. (2002). Mul- tilinear analysis of image ensembles: Tensorfaces. In Computer VisionECCV 2002, pages 447{460. Springer. [Vasilescu and Terzopoulos, 2005] Vasilescu, M. A. O. and Terzopoulos, D. (2005). Mul- tilinear independent components analysis. In Computer Vision and Pattern Recog- nition, 2005. CVPR 2005. IEEE Computer Society Conference on, volume 1, pages 547{553. IEEE. [Wainwright, 2009] Wainwright, M. J. (2009). Sharp thresholds for high-dimensional and noisy sparsity recovery usingl 1 -constrained quadratic programming (lasso). IEEE transactions on information theory, 55(5):2183{2202. [Wainwright et al., 2008] Wainwright, M. J., Jordan, M. I., et al. (2008). Graphical models, exponential families, and variational inference. Foundations and Trends R in Machine Learning, 1(1{2):1{305. [Wedin, 1972] Wedin, P.- A. (1972). Perturbation bounds in connection with singular value decomposition. BIT Numerical Mathematics, 12(1):99{111. 122 [Weigend, 1994] Weigend, A. S. (1994). Time series prediction: forecasting the future and understanding the past. Santa Fe Institute Studies in the Sciences of Complexity. [Wiegerinck and Selten, 2011] Wiegerinck, W. and Selten, F. (2011). Supermodeling: Combining imperfect models through learning. [Wilson et al., 2014] Wilson, A., Gilboa, E., Cunningham, J. P., and Nehorai, A. (2014). Fast kernel learning for multidimensional pattern extrapolation. In Advances in Neural Information Processing Systems, pages 3626{3634. [Wilson et al., ] Wilson, A. G., Hu, Z., Salakhutdinov, R., and Xing, E. P. Deep kernel learning. [Wimalawarne et al., 2014] Wimalawarne, K., Sugiyama, M., and Tomioka, R. (2014). Multitask learning meets tensor factorization: task imputation via convex optimization. In Advances in Neural Information Processing Systems, pages 2825{2833. [Woodbury, 1950] Woodbury, M. A. (1950). Inverting modied matrices. Memorandum report, 42:106. [Xu et al., 2015] Xu, Z., Yan, F., and Qi, Y. (2015). Bayesian nonparametric models for multiway data analysis. Pattern Analysis and Machine Intelligence, IEEE Transac- tions on, 37(2):475{487. [Yu et al., 2016] Yu, H.-F., Rao, N., and Dhillon, I. S. (2016). Temporal regularized matrix factorization for high-dimensional time series prediction. In Advances in Neural Information Processing Systems, pages 847{855. [Yu et al., 2014] Yu, R., Bahadori, M. T., and Liu, Y. (2014). Fast multivariate spatio- temporal analysis via low rank tensor learning. NIPS. [Yu et al., 2015] Yu, R., Cheng, D., and Liu, Y. (2015). Accelerated online low rank tensor learning for multivariate spatiotemporal streams. In Proceedings of the 32th International Conference on Machine Learning. [Yu and Liu, 2016] Yu, R. and Liu, Y. (2016). Learning from multiway data: Simple and ecient tensor regression. In Proceedings of the 33nd International Conference on Machine Learning (ICML-16). [Zhang, 2007] Zhang, H. (2007). Maximum-likelihood estimation for multivariate spatial linear coregionalization models. Environmetrics, 18(2):125{139. [Zhang et al., 2013] Zhang, H., Lin, Z., and Zhang, C. (2013). A counterexample for the validity of using nuclear norm as a convex surrogate of rank. In Machine Learning and Knowledge Discovery in Databases, pages 226{241. Springer. [Zhang, 2011] Zhang, T. (2011). Adaptive Forward-Backward Greedy Algorithm for Learning Sparse Representations. IEEE Trans Inf Theory, pages 4689{4708. 123 [Zhao et al., 2011] Zhao, Q., Caiafa, C. F., Mandic, D. P., Zhang, L., Ball, T., Schulze- Bonhage, A., and Cichocki, A. (2011). Multilinear subspace regression: An orthogonal tensor decomposition approach. In NIPS, volume 2011, pages 1269{1277. [Zheng et al., 2015] Zheng, S., Jayasumana, S., Romera-Paredes, B., Vineet, V., Su, Z., Du, D., Huang, C., and Torr, P. H. (2015). Conditional random elds as recurrent neural networks. In Proceedings of the IEEE International Conference on Computer Vision, pages 1529{1537. [Zhou et al., 2003] Zhou, D., Bousquet, O., Lal, T., Weston, J., and Sch olkopf, B. (2003). Learning with local and global consistency. In NIPS. [Zhou et al., 2013] Zhou, H., Li, L., and Zhu, H. (2013). Tensor regression with applica- tions in neuroimaging data analysis. JASA. [Zhou et al., 2011] Zhou, J., Chen, J., and Ye, J. (2011). MALSAR: Multi-tAsk Learn- ing via StructurAl Regularization. http://www.public.asu.edu/ ~ jye02/Software/ MALSAR/. . . . 124
Abstract (if available)
Linked assets
University of Southern California Dissertations and Theses
Conceptually similar
PDF
Improving machine learning algorithms via efficient data relevance discovery
PDF
Spatiotemporal traffic forecasting in road networks
PDF
Spatiotemporal prediction with deep learning on graphs
PDF
Trustworthy spatiotemporal prediction models
PDF
Deriving real-world social strength and spatial influence from spatiotemporal data
PDF
Physics-aware graph networks for spatiotemporal physical systems
PDF
Structure learning for manifolds and multivariate time series
PDF
Learning the geometric structure of high dimensional data using the Tensor Voting Graph
PDF
Advanced machine learning techniques for video, social and biomedical data analytics
PDF
Neighborhood and graph constructions using non-negative kernel regression (NNK)
PDF
Fair Machine Learning for Human Behavior Understanding
PDF
Theoretical foundations for dealing with data scarcity and distributed computing in modern machine learning
PDF
Reinforcement learning with generative model for non-parametric MDPs
PDF
Understanding diffusion process: inference and theory
PDF
Learning shared subspaces across multiple views and modalities
PDF
Deep learning models for temporal data in health care
PDF
Novel algorithms for large scale supervised and one class learning
PDF
Motion pattern learning and applications to tracking and detection
PDF
Spatio-temporal probabilistic inference for persistent object detection and tracking
PDF
Scalable multivariate time series analysis
Asset Metadata
Creator
Yu, Qi (Rose)
(author)
Core Title
Tensor learning for large-scale spatiotemporal analysis
School
Viterbi School of Engineering
Degree
Doctor of Philosophy
Degree Program
Computer Science
Publication Date
07/17/2017
Defense Date
06/07/2017
Publisher
University of Southern California
(original),
University of Southern California. Libraries
(digital)
Tag
large-scale,machine learning,OAI-PMH Harvest,spatiotemporal,tensor methods
Language
English
Contributor
Electronically uploaded by the author
(provenance)
Advisor
Liu, Yan (
committee chair
), Shahabi, Cyrus (
committee member
), Soltanolkotabi, Mahdi (
committee member
)
Creator Email
qiyu@usc.edu,yuqi.rose@gmail.com
Permanent Link (DOI)
https://doi.org/10.25549/usctheses-c40-401574
Unique identifier
UC11265671
Identifier
etd-YuQiRose-5530.pdf (filename),usctheses-c40-401574 (legacy record id)
Legacy Identifier
etd-YuQiRose-5530.pdf
Dmrecord
401574
Document Type
Dissertation
Rights
Yu, Qi (Rose)
Type
texts
Source
University of Southern California
(contributing entity),
University of Southern California Dissertations and Theses
(collection)
Access Conditions
The author retains rights to his/her dissertation, thesis or other graduate work according to U.S. copyright law. Electronic access is being provided by the USC Libraries in agreement with the a...
Repository Name
University of Southern California Digital Library
Repository Location
USC Digital Library, University of Southern California, University Park Campus MC 2810, 3434 South Grand Avenue, 2nd Floor, Los Angeles, California 90089-2810, USA
Tags
large-scale
machine learning
spatiotemporal
tensor methods