Close
About
FAQ
Home
Collections
Login
USC Login
Register
0
Selected
Invert selection
Deselect all
Deselect all
Click here to refresh results
Click here to refresh results
USC
/
Digital Library
/
University of Southern California Dissertations and Theses
/
Scalable multivariate time series analysis
(USC Thesis Other)
Scalable multivariate time series analysis
PDF
Download
Share
Open document
Flip pages
Contact Us
Contact Us
Copy asset link
Request this asset
Transcript (if available)
Content
SCALABLE MULTIVARIATE TIME SERIES ANALYSIS by Mohammad Taha Bahadori A Dissertation Presented to the FACULTY OF THE USC GRADUATE SCHOOL UNIVERSITY OF SOUTHERN CALIFORNIA In Partial Fulfillment of the Requirements for the Degree DOCTOR OF PHILOSOPHY (ELECTRICAL ENGINEERING) August 2015 Copyright 2015 Mohammad Taha Bahadori Acknowledgments First, I would like to thank my advisor Prof. Yan Liu for her great guidance throughout my Ph.D. studies and contributions to this thesis. I am also grateful to Prof. Jinchi Lv for his insightful help. I am also thankful to many colleagues and collaborators: Yi Chang, David Kale, Rose Yu, Dehua Cheng, Xinran He, Whenzhe Li, Zhengping Che, Marjan Ghazvininejad, Michael Hankin, Daniel Moyer, and Huida Qiu with whom I not only shared many thoughts, but also had great moments. Last but not least, I would like to thank my family who constantly encouraged me during my Ph.D. studies. i Contents Acknowledgments i Contents ii List of Figures vi List of Tables xiii Abstract xv 1 Introduction 1 1.1 Summary of Thesis Work . . . . . . . . . . . . . . . . . . . . . . . . . 4 1.2 Thesis Statement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7 1.3 Thesis Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8 1.4 Related Publications . . . . . . . . . . . . . . . . . . . . . . . . . . . 10 2 Review of Multivariate Time Series Analysis 12 2.1 Forecasting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 2.2 Spatio-temporal Analysis . . . . . . . . . . . . . . . . . . . . . . . . . 17 2.3 Classification and Clustering . . . . . . . . . . . . . . . . . . . . . . . 21 2.4 Correlation and Causality Analysis . . . . . . . . . . . . . . . . . . . 25 3 Complex Correlations I: Low-rank Tensor Learning 31 3.1 Tensor Formulation for Multivariate Spatio-temporal Analysis . . . . 34 3.1.1 Cokriging . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34 ii 3.1.2 Forecasting . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36 3.1.3 Unified Framework . . . . . . . . . . . . . . . . . . . . . . . . 37 3.2 Fast Greedy Low-rank Tensor Learning . . . . . . . . . . . . . . . . . 40 3.3 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48 3.3.1 Low-rank Tensor Learning on Synthetic Data . . . . . . . . . 49 3.3.2 Spatio-temporal Analysis on Real World Data . . . . . . . . . 51 3.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55 4 Complex Correlations II: Functional Subspace Clustering 57 4.1 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60 4.2 Functional Subspace Clustering . . . . . . . . . . . . . . . . . . . . . 62 4.2.1 Data Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62 4.2.2 Functional Subspace Clustering . . . . . . . . . . . . . . . . . 65 4.2.3 Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68 4.3 FSC for Time Series with Warping . . . . . . . . . . . . . . . . . . . 71 4.3.1 Fast Warping Selection for Time Series . . . . . . . . . . . . . 72 4.3.2 Identifying the Latent Basis Functions . . . . . . . . . . . . . 74 4.4 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75 4.4.1 Synthetic Data Experiments . . . . . . . . . . . . . . . . . . . 76 4.4.2 Real World Data . . . . . . . . . . . . . . . . . . . . . . . . . 78 4.4.3 Deformed basis function recovery . . . . . . . . . . . . . . . . 81 4.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83 5 Latent Factors 84 5.1 Preliminaries and Related Work . . . . . . . . . . . . . . . . . . . . . 88 5.2 Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91 5.2.1 Stochastic Processes with Latent Factors . . . . . . . . . . . . 91 5.2.2 Examples of GLARP . . . . . . . . . . . . . . . . . . . . . . . 92 5.2.3 Discussion and Generalization . . . . . . . . . . . . . . . . . . 100 5.2.4 Inference . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103 iii 5.3 Path Delays . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107 5.4 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 114 5.4.1 Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 114 5.4.2 Evaluation Measures . . . . . . . . . . . . . . . . . . . . . . . 116 5.4.3 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 118 5.5 Small Sample Behavior . . . . . . . . . . . . . . . . . . . . . . . . . . 122 5.5.1 Finite Sample Analysis of the Structure of Estimators . . . . . 123 5.6 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 129 6 Non-Gaussian Time Series 131 6.1 Copula-Granger . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 134 6.2 Sparse-GEV . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 138 6.2.1 Model Description . . . . . . . . . . . . . . . . . . . . . . . . 139 6.2.2 Inference and Learning . . . . . . . . . . . . . . . . . . . . . . 141 6.2.3 Prediction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 143 6.2.4 Scalability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 143 6.3 Related Work and Discussions . . . . . . . . . . . . . . . . . . . . . . 144 6.3.1 Connections to existing algorithms . . . . . . . . . . . . . . . 144 6.4 Experiment Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . 146 6.4.1 Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 146 6.4.2 Performance Comparison . . . . . . . . . . . . . . . . . . . . . 148 6.4.3 Parameter Sensitivity Assessment . . . . . . . . . . . . . . . . 151 6.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 152 7 Irregular Time Series 153 7.1 Problem Definitions and Notation . . . . . . . . . . . . . . . . . . . . 155 7.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 156 7.3 Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 159 7.3.1 Generalized Lasso Granger (GLG) . . . . . . . . . . . . . . . . 159 7.3.2 Extension of GLG Method . . . . . . . . . . . . . . . . . . . . 161 iv 7.3.3 Asymptotic Consistency of GLG . . . . . . . . . . . . . . . . . 163 7.4 Experiment Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . 168 7.4.1 Synthetic Datasets . . . . . . . . . . . . . . . . . . . . . . . . 168 7.4.2 Results on the Synthetic Datasets . . . . . . . . . . . . . . . . 171 7.4.3 Paleo Dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . 173 7.4.4 Results on the Paleo Dataset . . . . . . . . . . . . . . . . . . 176 7.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 178 8 Conclusion and Future Work 179 8.1 Contributions and Limitations . . . . . . . . . . . . . . . . . . . . . . 180 8.2 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 182 A Notation Guide 186 Index 187 Reference List 189 v List of Figures 2.1 Location of all stations that have ever recorder the amount of rainfall in the island of Hawai‘i [89]. . . . . . . . . . . . . . . . . . . . . . . . 20 2.2 Illustration of the deformation operation for functional data. Two functions are considered similar if a deformation of one of them is similar to the other one. The figure has been regenerated from [160, Fig. 4.1]. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24 2.3 Illustration of the main principle behind Granger causality: in both examples the cause happens prior to its effect and its past values help predicting future values of the effect. (a) Plot of the values of two time series. (b) Plot of two point processes in which each event is shown with a mark at its happening time. . . . . . . . . . . . . . . . 26 2.4 (a) The causal graph corresponding for the set of equations in (2.10) and (b) The corresponding path diagram. . . . . . . . . . . . . . . . . 29 vi 3.1 Tensor estimation performance comparison on the synthetic dataset over 10 random runs. (a) Parameter estimation RMSE with training time series length, (b) Mixture Rank Complexity with training time series length, (c) running time for one single round with respect to number of variables. . . . . . . . . . . . . . . . . . . . . . . . . . . . 50 3.2 Map of most predictive regions analyzed by the greedy algorithm using 17 variables of the CCDS dataset. Red color means high pre- dictiveness whereas blue denotes low predictiveness. . . . . . . . . . . 55 4.1 Illustration of the deformation operation for functional data. Two functions are considered similar if a deformation of one of them is similar to the other one. The figure has been regenerated from [160, Fig. 4.1]. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60 4.2 Synthetic data experiments. (a) The bases used for constructing the synthetic data. (b,c) The clustering error rate for six algorithms as (b) the length of time series grows and (c) the number of time series per cluster grows. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77 4.3 Mean and standard deviation trajectories for twelve variables in Phy- sionet dataset, for patients who survived (blue) and deceased (red). Note the similarity of time series and the fact that they are almost indistinguishable by naked eye. . . . . . . . . . . . . . . . . . . . . . 80 4.4 Synthetic data experiment: PCA-TW is the only algorithm that suc- cessfully recovers the principal component under deformation. . . . . 82 vii 5.1 Decomposition of the evolution matrix in Eq. (5.6) into low-rank and sparse matrices. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90 5.2 InthistoyGrangergraphicalmodel, thevaluesontheedges (τ 1 ,τ 2 ,τ 3 ) show the delay associated with that edge. According to m-separation criteria, whenX 4 isunobserved, aspuriousedgeX 1 ←X 3 isdetected. However, the spurious edge is not detected when τ 3 −τ 2 +τ 1 ≤ 1, where L is the maximum lag in Granger causality test. . . . . . . . . 108 5.3 Three of four possible directed graphs created by three nodes (a) the coparent, (b) the collider and (c) the chain structures. The fourth structure is the chain with reversed edge directions. . . . . . . . . . . 109 5.4 The diagrams for proving (a) Lemma 5.4 and (b)Lemma 5.5. The green circles are observed variables and the red path shows a d- connected path. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110 5.5 An example of canceling spurious causation. Time series X 1 ,X 2 and X 3 are observed while X 4 and X 5 are unobserved. . . . . . . . . . . 113 5.6 Verification of the theoretical results. Sig, Lasso, and Copula repre- sent the significance test, Lasso-Granger, and Copula-Granger algo- rithms. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 114 5.7 Syntheticdatasetresultsonthepointprocessdataset(a)Graphlearn- ing accuracy as the length of the time series increases. (b) Graph learning accuracy as the number of latent factors increases. . . . . . . 118 viii 5.8 The graph learning accuracy when the number of retweets require- ment n for the ground truth influence graph G RT (n) is varied. The performance of (a) Poisson and (b) COM-Poisson auto-regressive pro- cesses confirms that they make better predictions for the stronger influence edges. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 119 5.9 (a) The spatial-temporal dependency graph obtained via the Gumbel auto-regressive process. Note the denseness of the graph. (b) The sparse part of the spatial-temporal dependency graph obtained via GLARP-GumG. Removing the low-rank global effect leaves only two main local terrain impacts: one is the local impact of the Appalachian mountains along the east coast and the other one is the local impact of the Great Lakes on the weather pattern of their surrounding lands. 121 5.10 The images visualize the estimated matrices with grayscale images; the darker the pixel color, the higher the value of the coefficient. (a) The estimation bias b A S −A. Note that there is a significant low-rank pattern in the estimation error. (b) The low-rank matrix L is very similar to the low-rank part in the estimation error. . . . . . . . . . . 123 5.11 The fraction of variance preserved by the rank-1 estimate of L n in the Poisson auto-regressive model. More specifically, we compute the singular values of the L P as σ 1 ≥ σ 2 ≥ ...≥ σ p ≥ 0 and report σ 1 P p i=1 σ i . We have set x∼N (0,I) and y∼ Poiss(exp(Ax)). . . . . . . 128 5.12 Comparison of the gain achieved by sparse plus low-rank decomposi- tion when the dataset has a latent factor and when it does not. Note that the gain is much larger when there exists a latent factor. . . . . 130 ix 6.1 Illustrationof(a)Groundtruthandtheinferredtemporaldependence graphs by (b) Granger causality, (c) Transfer entropy, (d) Copula method, and (e) Sparse-GEV. . . . . . . . . . . . . . . . . . . . . . . 147 6.2 ThetemporaldependencegraphlearnedbySparse-GEVontheextreme valuetimeseriesof(a)WindinNYand(b)GustinNY.Thickeredges imply stronger dependency. . . . . . . . . . . . . . . . . . . . . . . . 148 6.3 The temporal dependency graph learned by Sparse-GEV from the Twitter dataset on (a) Meme phrases in 2009 and (b) “Occupy Wall Street” hastags in 2011. . . . . . . . . . . . . . . . . . . . . . . . . . 150 6.4 (a) Parameter Sensitivity Assessment: Average AUC achieved by Sparse-GEV on the synthetic datasets when the value of λ varies. (b) The value of log-likelihood function at each iteration of the EM algorithm. (c) The effect of τ on the value of hidden variables in the Sparse-GEV algorithm. . . . . . . . . . . . . . . . . . . . . . . . . . . 152 7.1 Time Series #1 is the target time series in this figure. Prediction of x (1) n 1 should receive a higher weight than x (1) n 2 in the depicted scenario because it can be predicted more accurately. . . . . . . . . . . . . . . 162 7.2 Time Series #1 is the target time series in this figure while the other timeseriesarenotshown. Predictionofx (1) n 2 shouldhavehigherweight in the causality inference than x (1) n 2 because x (1) n 2 is in a denser region of the time series. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 162 x 7.3 The sources of repair errors in GLG when x (1) n is being predicted. In order to predict data point x (1) n GLG repairs the time series in L points before the time t 1 n . At each point of time a repair error z (i) t−`Δt is produced. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 163 7.4 The black circles show the time stamps of the given irregular time series. The crosses are the time stamps used for repair of the time series. The red time stamp shows the moment in that the repair methods produce large errors and propagate the error by predicting the erroneous repaired sample. GLG skips these intervals because it predicts only the observed samples. . . . . . . . . . . . . . . . . . . . 167 7.5 Study of convergence behavior of the algorithms in the Mixture of Sinusoids with (left) Missing data points dataset, (middle) Jittery clock dataset and (right) Auto-regressive Time Series with Missing data points dataset. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 169 7.6 (a) Convergence of different kernels in the Mixture of Sinusoids with missing dataset. (b) The effect of missing data in the Mixture of Sinusoids with missing data points. (c) The effect of clock jitter (γ) the Mixture of Sinusoids with Jittery Clock dataset. . . . . . . . . . 172 7.7 PerformancecomparisonofthealgorithmsintheMixtureofSinusoids with Poisson points sampling times. . . . . . . . . . . . . . . . . . . 173 7.8 Comparison of performance of W-GLG vas. GLG in all four datasets. 174 7.9 Map of the locations and the monsoon systems in Asia. . . . . . . . . 175 xi 7.10 Comparison of the results on the Paleo Dataset: (a) GLG in period 850AD-1563AD. (b) GLG in the period 1250AD-1564AD. (c) GLG in the period 850AD-1250AD. (d) Slotting technique in period 850AD- 1563AD. (e) Slotting technique in the period 1250AD-1564AD. (f) Slotting technique in the period 850AD-1250AD. . . . . . . . . . . . . 176 8.1 Time series of aggregate number of tweets about Tiger Woods (left) andPope (right). Thedailytrendhasbeenremoved. Notetheseveral shocks, marked by red boxes in the figure, occur frequently and are natural to the time series rather than singular events. . . . . . . . . . 184 xii List of Tables 3.1 Cokriging RMSE of 6 methods averaged over 10 runs. In each run, 10% of the locations are assumed missing. . . . . . . . . . . . . . . . 52 3.2 Forecasting RMSE for VAR process with 3 lags, trained with 90% of the time series. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54 3.3 Running time (in seconds) for cokriging and forecasting. . . . . . . . 54 4.1 Average AUC obtained by the algorithms on the real world datasets. 79 5.1 The baselines used in evaluations. . . . . . . . . . . . . . . . . . . . . 117 5.2 The RMS prediction error of the algorithms in the Twitter dataset. Results have been normalized by the the mean. . . . . . . . . . . . . 120 5.3 The RMS prediction error of the algorithms in the wind speed dataset.121 5.4 Comparison of pure sparse learning and sparse plus low-rank decom- position solutions on four performance measures. . . . . . . . . . . . . 123 6.1 Comparison of different models on recovering the temporal depen- dence graph on eight synthetic datasets. . . . . . . . . . . . . . . . . 149 xiii 6.2 Comparison of RMSE by different in the prediction tasks. TE: trans- fer entropy; T-: Twitter dataset. . . . . . . . . . . . . . . . . . . . . . 150 7.1 Description of the Paleo Dataset. . . . . . . . . . . . . . . . . . . . . 175 xiv Abstract Time series data have become ubiquitous in many applications such as climate science, social media, and health care. Analysis of large scale time series data collected from diverse applications has created new multi-faceted challenges and opportunities. In this thesis, we have studied the key challenges in large scale multivariate time series analysis and proposed novel and scalable solutions. First, we tackle the challenge of modeling high-dimensional multi-modal cor- relations in the spatio-temporal data, as accurate modeling of correlations is the key to accurate predictive analysis. We cast the problem as a low-rank tensor learning problem with side information incorporated via a graph Laplacian regularization. For scalable estimation, we provide a fast greedy low-rank tensor learning algorithm. To address the problem of modeling complex correlations in classification and clustering of time series, we propose the functional subspace clustering framework, which assumes that the time series lie on several subspaces with possible defor- mations. For estimation of the subspaces, we propose an efficient greedy variable selection algorithm. Second, weobservethattheperformanceoftemporaldependencyalgorithmsis severely degraded in presence of unobserved confounders. To address this challenge, we propose two solutions: (i) alleviating the impact of major latent confounders xv using sparse plus low-rank decomposition and (ii) eliminating the impact of all latent confounders using the prior information about the delays of the confounding paths. Third, in many application domains, multivariate time series do not follow the commonly assumed multivariate Gaussian distribution. We propose two solutions to address this challenge: (i) a state space model based on generalized extreme value distribution to model the important case of extreme value time series and (ii) a semi-parametric approach using copulas for the general setting. Finally, often in practice time series measurements are collected at irregular intervals, which violates the assumptions of many existing algorithms. To address this challenge, we propose a fast non-parametric extension for temporal dependency analysis algorithms that improves accuracy over the state of the art techniques. Alloftheproposedalgorithmsareevaluatedonmultipledatasetsfromdifferent applications including climate science, social networks, and health care. xvi Chapter 1 Introduction 1 Multivariatetimeseriesanalysisreferstothestudyofmultipletimeseriesalong with possibly a set of non-temporal side information such as spatial location of time series. Multivariate time series data are ubiquitous in diverse application domains such as geostatistics, social media, and health care. For example, in geostatistics, in addition to the fact that the deluge of climate data requires design of new scalable algorithms for forecasting, kriging, and model aggregation tasks, new problems such as temporal dependency discovery have emerged as active research problems. In social media, the activity of users can be represented as time series, which can be used to discover the patterns of influence in social networks. In health care, time series in patients’ electronic health records can be used for analyzing patients’ health status and improving the quality of care in hospitals. The rise of new applications have introduced new challenges and opportunities in time series analysis. The first challenge is existence of high-dimensional spatial andtemporaldependenciesinthedata. Weneedtodevelopnewmodelsthatareable to accurately capture the complex correlations as it is the key to accurate predictive analysis. A plethora of excellent work has been conducted to address the challenge and achieved successes to a certain extent [52, 119]. However, many of the existing solutions impose strong assumptions on the spatial and temporal correlation, which requires domain knowledge and manual work. Furthermore, learning parameters of those statistical models is computationally expensive, making them infeasible for large scale applications. Thus, there is an urgent need for new approaches which not only can capture the complex correlations in a variety of application domains but also are scalable to massive-scale data. Similar challenges exist in the problem of classification and clustering of time series. Time series are naturally high-dimensional and lie along complex manifolds. 2 These properties warrant use of the subspace assumption, but most state-of-the-art subspace learning algorithms are limited to linear or other simple settings [224]. For example, in the classification literature, small warping deformations in time can be ignored in calculation of similarity of two time series [178, 189, 226]. Many of the alternative approaches rely on particular sequential generative models that make strong assumptions about the generative process of the data [2, 127, 134]. Thus, there is a need for extension of subspace clustering to capture the complex correlations in presence of non-linear deformations of the time series without strong assumptions about the time series generative process. Another major challenge that we are confronted with in real-world applica- tions is incompleteness of the data, i.e., often times certain influential time series are missing in the real-world datasets. For example, in social media analysis, the external events influence large clusters of users, while the news propagates through the local connections in the network. In order to identify the true influence pat- terns among the users, we need to take into consideration the impact of external unobserved events. In climate data analysis, the local terrain characteristics play an important role in the air mass propagation while large climate systems, which are usually not observed in the dataset collected by local weather stations, influence wide areas on the ground. Thus, we need scalable approaches to take into account the impact of latent confounders in time series analysis. Furthermore, alargenumberofexistingalgorithmsrelyonsimplifyingassump- tions such as Gaussianity of the time series data. In many applications, such as climate science and social media analysis, we are mostly interested in revealing the temporaldependenceandpredictionofextremeevents. Forexample, climatechange is mostly characterized by increasing probabilities of extreme weather patterns such 3 as temperature or precipitation reaching extremely high value [118]. Therefore, quantifying the temporal dependence between the extreme events from different locations and making effective predictions are important for disaster prevention. Uncovering the temporal dependencies among social media bursts could reveal valu- able insights into information propagation and achieve much better accuracy for burst prediction. Most of the existing work model temporal or spatial dependence with predefined covariance structures. This is unrealistic and demands a significant contribution on automatically learning the temporal structures from the data for better analysis and modeling. Finally, in many real-world applications, we are confronted with irregular time series, whose observations are not sampled at equally-spaced time stamps. Irregular sampling is a common challenge in practice due to natural constraints or human factors. For example, in health care, it is very difficult to obtain blood samples of human beings at regular time intervals for a long period of time. Irregularity in sampling intervals violates the basic assumptions behind many temporal depen- dencylearningalgorithms. Mostoftheexistingapproachesfortemporaldependency discovery assume that the time series observations are obtained at equally-spaced time stamps and fail in analysis of irregular time series. Thus, we need to develop novel algorithms that do not rely on the regular sampling assumption. 1.1 Summary of Thesis Work To address the high-dimensional complex correlation modeling challenge, we build upon recent advances in low-rank tensor learning [83, 136, 244] and further consider the scenario where additional side information of data is available. For 4 example, in geospatial applications, apart from measurements of multiple variables, geographical information is available to infer location adjacency. In social network applications, friendshipnetworkstructureiscollectedtoobtainpreferencesimilarity. We propose a unified low-rank tensor learning framework to efficiently capture the complex multi-modal correlations in the data. To utilize the side information, we can construct a graph Laplacian regularizer from the side information, which favors locally smooth solutions. We develop a fast greedy algorithm for learning low-rank tensors based on the greedy structure learning framework [15, 197, 242]. We also provide a bound on the difference between the loss function at our greedy solution and the one at the globally optimal solution. To address the challenges in the classification and clustering problems, we pro- pose a new framework called Functional Subspace Clustering (FSC). FSC assumes that functional data (such as time series) lie in deformed linear subspaces and extends the power and flexibility of subspace clustering to functional data by per- mitting deformations. The result is a framework that differs from most existing approaches to functional data clustering as it does not assume a structured gener- ative model (e.g., a sequential model for time series) or a predefined set of basis functions (e.g., B-splines). FSC formulates the subspace learning problem as a sparse regression over operators. The resulting problem can be efficiently solved via greedy variable selection, given access to a fast deformation oracle. We provide theoretical guarantees for FSC and show how it can be applied to time series with warped alignments. To address the latent factors challenge, we study generalized auto-regressive processes and show that, under certain regularity conditions, the maximum likeli- hood estimate of their evolution matrices can be decomposed into a sparse and a 5 low-rank matrix with the latter capturing the impact of unobserved processes. For counting processes, we analyze the Poisson [220] and Conway-Maxwell Poisson [246] auto-regressive processes. The latter distribution has recently attracted researchers’ attention because of its flexibility in modeling under-dispersion and over-dispersion of discrete data [196, 201]. For extreme value time series, we propose a novel heavy- tailed auto-regressive time series model, by choosing the distribution of the data to be the Gumbel distribution [96]. In addition, we observe an interesting phe- nomenon: often times, separating out a low-rank component from the maximum likelihood estimation of the sparse coefficients improves the accuracy of the estima- tion, even if there are no latent variables. Thus, we examine the possible causes for this phenomenon and demonstrate theoretically that in most generalized linear models, the finite sample bias is additive and approximately low-rank. We also show that coping with the effects of hidden confounders is easier in temporal-causal analy- sis. In particular, many directionally connected confounding paths are disconnected considering the delay associated with the edges. Thus, often we require condition- ing on fewer variables to block the spurious causation paths. Our contribution in this problem includes definition of path delays and showing that the path delays are essentialinstudyoftheunobservedconfounders’impactintemporal-causalanalysis. In response to non-Gaussianity challenges, we propose two solutions: Copula- Granger, which is a general-purpose semi-parametric algorithm for all non-linear time series and Sparse Generalized Extreme Value (Sparse-GEV) model which is a dynamic linear model customized for the extreme value time series; Copula-Granger, is based on the copula technique proposed for dependency analysis of time series with non-Gaussian marginal distributions [73]. In the copula framework, first the 6 marginal distribution of the time series is computed using a non-parametric estima- tor. Next the observations are transformed to the copula domain using the cumula- tive distribution of the unit Gaussian distribution. Then we use Lasso-Granger to uncover the temporal dependency among the transformed time series. On the the- oretical front, we establish the asymptotic convergence rate of the Copula-Granger method. To address the irregular time series challenge, we propose the Generalized Lasso-Granger (GLG) framework for temporal-causal analysis of irregular time series. It defines a generalization of inner product for irregular time series based on non-parametric kernel functions. As a result, the GLG optimization problem takes the form of a Lasso problem and enjoys the computational scalability for large scale problems. We also investigate a weighted version of GLG (W-GLG), which aims to improve the performance of GLG by giving higher weights to important observations. For theoretical contribution, we propose a sufficient condition for the asymptotic consistency of our GLG method. In particular, we argue that com- pared with the popular locally weighted regression, GLG has the same asymptotic consistency behavior, but achieves lower absolute error. 1.2 Thesis Statement In this thesis, we hypothesis that multivariate time series algorithms can be enhanced to accurately analyze large scale data. In particular, we are able to assert this hypothesis based on the fact that we address the four key challenges in front of large scale multivariate time series analysis. Specifically, we make the following claims: 7 1. Low-rank tensor learning provides a unified framework to efficiently perform a wide variety of multivariate time series analyses on large scale datasets. 2. Functional subspace clustering learns succinct representations for multivariate time series with deformations and enables efficient clustering and classification of time series data. 3. Sparse plus low-rank decomposition can be used to efficiently alleviate the impact of major latent confounders in generalized linear auto-regressive pro- cesses. 4. Prior information about delays in the latent confounding paths can be used to resolve the identifability of temporal-causal discovery in time series. 5. Latent factor modeling together with the generalized extreme value distribu- tion can be efficiently used to model multivariate extreme value time series. 6. Copula-Granger can be used to alleviate the impact of non-Gaussianity in multivariate time series analysis. 7. Temporal dependency analysis can be performed on irregularly sampled time series by extending the existing temporal-causal analysis techniques using non- parametric regression. 1.3 Thesis Outline In Chapter 2, we overview the existing techniques for solving the key tasks in multivariate time series analysis that are related to this thesis work. 8 InChapter3,wedescribealow-ranktensorlearningframeworkformultivariate time series analysis. We also provide a fast greedy algorithm for learning low-rank tensors with theoretical guarantees on its performance. In Chapter 4, we describe the functional subspace clustering framework to learn succinct representations for functional data with deformations. We provide theoretical guarantees and show how it can be applied to time series with warped alignments. In Chapter 5, we study generalized linear auto-regressive processes and show that, under certain regularity conditions, the maximum likelihood estimate of their evolution matrices can be decomposed into a sparse and a low-rank matrix with the latter capturing the impact of unobserved processes. We also provide the defi- nition of path delays and show that the path delays are essential in analysis of the unobserved confounders’ impact in time series. In Chapter 6, we propose Copula-Granger, a general-purpose semi-parametric algorithm for all non-linear time series, and Sparse-GEV, a dynamic linear model customized for the extreme value time series. On the theoretical front, we establish the asymptotic convergence of the Copula-Granger method. In Chapter 7, we propose the Generalized Lasso-Granger (GLG) framework for temporal-causal analysis of irregular time series. It is based on generalization of inner product for irregular time series using non-parametric kernel regression. In Chapter 8, we summarize our contributions and discuss their advantages and potential drawbacks. We also discuss potential future work to extend the con- tributions of this thesis. 9 1.4 Related Publications Parts of this thesis have been published in the machine learning and data mining conferences. The list includes: Related to Chapter 3: • Bahadori, Mohammad Taha, Rose Yu, and Yan Liu. “Fast Multivariate Spatio-temporal Analysis via Low Rank Tensor Learning.” Advances in Neural Information Processing Systems. 2014. Related to Chapter 4: • Bahadori, Mohammad Taha, David Kale, Yingying Fan, and Yan Liu, “Func- tional Subspace Clustering with Application to Time Series”, Proceedings of the 29th International Conference on Machine Learning 2015. Related to Chapter 5: • Bahadori, Mohammad Taha, Yan Liu, and Eric P. Xing. “Fast structure learning in generalized stochastic processes with latent factors.” Proceedings of the 19th ACM SIGKDD international conference on Knowledge discovery and data mining. ACM, 2013. • Bahadori, Mohammad Taha, and Yan Liu. “An examination of practical granger causality inference.” SIAM Conference on Data Mining. 2013. Related to Chapter 6: 10 • Liu, Yan, Taha Bahadori, and Hongfei Li. “Sparse-GEV: Sparse Latent Space Model for Multivariate Extreme Value Time Serie Modeling.” Proceedings of the 29th International Conference on Machine Learning. 2012. • Bahadori, Mohammad Taha, and Yan Liu. “An examination of practical granger causality inference.” SIAM Conference on Data Mining. 2013. Related to Chapter 7: • Bahadori, Mohammad Taha, and Yan Liu. “Granger Causality Analysis in Irregular Time Series.” SIAM Conference on Data Mining. 2012. 11 Chapter 2 Review of Multivariate Time Series Analysis 12 Nowadays, large amounts of time series are collected from various application domains such as climate science, social networks, finance, and health care, to name a few. Given the diversity of the applications, a wide variety of tasks have been defined in time series analysis. In this chapter, we review the common tasks in time series analysis that are related to this thesis work and the popular approaches used in each task. 2.1 Forecasting One of the central tasks in time series analysis is to predict their future values of time series using its historical observations. Formally, let x(t)∈ R p denote the value of p time series x i (t) at time t∈ 1,...,T. Forecasting these time series is to use the observationsx(t) fort = 1,...,T 0 to predictx(t) fort =T 0 + 1,...,T. The basic principle in forecasting is to find a function f(·) that estimates the values of the time series at timet usingK past values. Below, we list the common approaches for time series forecasting. Auto-regressive Models Assuming that the relationships between x(t) and its past values is linear leads to the auto-regressive model. Formally, the vector auto- regressive process with K lags defines the value of time series at time t as follows: x(t) = K X `=1 A (`) x(t−`) +"(t), (2.1) where each of the evolution matrices A (`) ∈ R p×p models the linear impact of the time series values at `th lag on the current data point. The noise process "(t) is assumed to be multivariate Gaussian. 13 Given the simplicity and robustness of the vector auto-regressive process, it has become the central time series forecasting model in many applications including forecasting financial time series and numerous extensions have been proposed [151, 162]. These extensions allow to model non-linear dynamic by using techniques such as kernelized linear regression [101, 152, 225] and neural networks [139, 212]. For example, the Neural Auto-regressive (NAR) model defines the following model for predicting future values of time series in terms of its past values x i (t) =f(x i (t− 1),...,x i (t−K)) +b i , for i = 1,...,p, (2.2) where f(·) is a nonlinear function that is estimated by the neural network. Thegeneralizedadditivemodelsfortimeseriesmodeling[143]definethefuture values of time series are described according to the following model: g(E H(t) [x i (t)]) = p X j=1 L X `=1 f ij (x j (t−`)) +b i , for i = 1,...,p, (2.3) wheref ij are nonlinear functions,g(·) is a strictly increasing function called the link function, andE H(t) emphasizes that the expectation is performed given the history of time series before time t. Examples of the extensions via generalized additive models are given in [21, 175]. A greedy algorithm is usually used for selection of the functions f ij from a pool of non-linear bases which produces a locally optimal solu- tion. Generalized additive models are susceptible to overfitting and regularization techniques have been proposed to control the model complexity, see e.g. [235, 236] and the references therein. Another important extension of VAR models include the ARCH and GARCH modelswhichmodelheteroskedasticityinthetimeseries[31,74]andareinparticular 14 popular in modeling financial time series. The ARCH model defines the variance of time series at time t to depend on the past values of the time series as follows Var[x(t)] =σ 2 t =a 0 + K X `=1 a ` x(t−`) 2 , (2.4) which is an ARCH model with K lags. The ARCH model and its extensions are commonly used for modeling volatility in the financial time series. State space Models It is generally agreed that including latent variables sig- nificantly helps explain the complex dependencies among observed variables using simpler relationships [29, 30, 88]. A common class of latent variable models for time series is the state space model [162, Chapter 18]. The state space model assumes thatthereexistsalatentspacedenotedbyastate vector z(t)whichevolvesaccording to the following equation: z(t) =Cz(t− 1) +Dx(t) +F(t), (2.5) where(t) is usually a Gaussian random vector and C characterizes the evolution of the state vector. The observed time series is related to the state vector as follows: x(t) =Ax(t) +Bz(t) +E"(t), (2.6) where"(t) denotes the observation noise. In addition to time series forecasting, state space modeling has found applications in other fields, including in object tracking and robotics [162, Chapter 18]. Numerous extensions have been proposed such Extended Kalman Filter [104] which allows non-linearity in Eq. (2.5) and Eq. (2.6) and state space models with non-Gaussian noise distributions [135]. Given the fact 15 that these models have memory in their state vectors, extensions such as switching Kalman filters can be used to model non-stationary time series [161]. Neural Networks While feedforward neural networks are used for generalizing the auto-regressive models, the recurrent neural networks generalize the state space models by adding non-linear self-loops to the state space models [38, 58, 192]. For example, the simplest single layer recurrent neural network can be defined via the following equations [23] z(t) =f (Cz(t− 1) +Dx(t) +b), x(t) =Ax(t) +Bz(t) +c, where f(·) is a nonlinear function such as tanh(·) function. Deep recurrent neural networks assume a multilayer structure for the state vector to allow higher repre- sentative power [93]. Various extensions of recurrent neural networks have been proposed which have successfully applied to a wide range of time series applications [58, 192]. Gaussian Processes A common category of models that are popular in time series forecasting is Gaussian process modeling [53, 234]. Gaussian processes can be thought as a Bayesian kernelized linear regression used for predicting the future val- ues of time series using past values [28]. Gaussian processes capture the dependence among the predictors and the target variable using a predefined kernel (covariance) function among the predictors and the target variable. Thus, they can be easily generalized to analyze the spatio-temporal data [53]. 16 ThesimplestformofGaussianprocessfortimeseriesforecastingcanbedefined by identifying a Gaussian random process x(t) using a mean m(t) and a covariance function k(t,t 0 ) as follows [234, Chapter 2] m(t) =E[x(t)] k(t,t 0 ) =E[(x(t)−m(t))(x(t 0 )−m(t 0 ))]. A proper covariance function should be positive definite which makes design of the covariance functions challenging. Gaussian processes have been extended to latent variable models and non-Gaussian distribution for observation and found a wide variety of applications including spatio-temporal analysis [53]. 2.2 Spatio-temporal Analysis Spatio-temporal analysis refers to the prediction tasks for time series when the spatial location of the time series are given. Kriging and forecasting are the two central tasks in spatio-temporal analysis. Kriging is the task of interpolating the data of one variable for unknown locations by taking advantage of the observations from known locations. Kriging utilizes the spatial correlations to predict the value of the variables for new locations [53]. In addition to geostatistics, spatio-temporal analysis has also found applications in social sciences, environmental monitoring, climate science, system biology, and archeology [188]. Spatio-temporal Covariance Functions The common approach in spatio- temporal analysis is to propose a spatio-temporal random process model to capture the temporal dynamics of the spatial data. Gaussian processes are commonly used 17 for this task and require a spatio-temporal covariance function to be specified. The spatio-temporal covariance matrix is defined as follows k(t,t 0 ,s,s 0 ) =E[(x(t,s)−m(t,s))(x(t 0 ,s 0 )−m(t 0 ,s 0 ))]. Design of spatio-temporal covariance function has been studied extensively in the literature [53, 200, 203]. Here, we briefly overview the key principles of covariance function design for spatio-temporal analysis. First, similar to other Gaussian pro- cesses,thecovariancefunctionsshouldbepositivedefinite. Next,oftenforsimplicity, the spatio-temporal function is selected to be separable; i.e., k(t,t 0 ,s,s 0 ) =k T (t,t 0 )×k S (s,s 0 ). A separable covariance function defines the spatio-temporal covariance function to be a product of a spatial and a temporal covariance functions. The main implica- tion of separability assumption is that the temporal dynamics at each location is independent of the dynamics at other locations which usually prevents the model from capturing important spatial and temporal interactions [53]. Another com- mon assumption is to set the temporal dynamics to be stationary by defining k T (t,t 0 ) = k T (|t−t 0 |). This assumption implies that the second order temporal dynamics of process do not depend on the absolute time of measurement. Sim- ilarly, we can define the spatial covariance matrix to be stationary, by defining k S (s,s 0 ) =K S (s−s 0 ) and isotropic K S (s−s 0 ) =K S (ks−s 0 k). Hierarchical Bayesian Modeling Another commonly used approach for build- ing spatio-temporal models is the hierarchical Bayesian modeling technique [53, 18 Chapter7]. Thegoalofhierarchicalmodelingistoavoidthecomputationalcomplex- ity of modeling with covariance functions [233]. Furthermore, unlike the Gaussian process approach, these models are able to include the scientific knowledge and con- straints in the spatio-temporal data and enable incorporation of data from diverse sources [233]. The hierarchical models describe the data generation process in three steps by specifying distributions of data given the latent process, distribution of the latent process, and priors on the parameters of these two distributions [233]. Hierar- chical Bayesian modeling has found a wide range of applications in spatio-temporal analysis and numerous techniques have been developed to accelerate them in large scale datasets [13]. In hierarchical Bayesian modeling approach we construct a model as follows. Supposey(s,t) denotes the spatio-temporal observations. The first step is to write it as sum of two processes, a mean processμ(s,t) and a noise processe(s,t). The mean process may include terms that capture the impact of observed covariates associated withy(s,t). Wecanfurthersplitthespatio-temporalnoisetermtoseveralparts. For example, [86] proposes to writee(s,t) =α(t)+w(s)+ε(s,t) whereα(t) is a Gaussian process to model the temporal dynamics, w(s) describes the spatial dynamics, and ε(s,t) is a white Gaussian noise. The modeling proceeds with adding priors to the parameters of the model. Hierarchical dynamic Bayesian modeling extends the state space models to the spatio-temporal data using the hierarchical Bayesian modeling technique. Using state variables, hierarchical dynamic Bayesian models provide a versatile framework for modeling time-varying spatio-temporal data [13, Chapter 8]. A simple hierar- chical dynamic Bayesian model can be built by assuming that the mean process in the previous example can be written asμ(s,t) =x > (t) with(t) =(t− 1) +(t) 19 MARCH 2013 AMERICAN METEOROLOGICAL SOCIETY | 315 Atlas “Methods” tab and in the project report available under the “Downloads” tab of the website. Point rainfall estimates were used to adjust and eval- uate rainfall maps derived from radar rainfall: com- posite of 2005–08 level-3 data; MM5 rainfall: 2004–09 composited from daily ex- perimental model forecasts using the PSU/NCAR MM5 model; and PRISM rainfall: 1971–2000 mean analysis de- veloped by C. Daly and col- leagues’ Parameter-elevation Regressions on Independent Slopes Model (PRISM) project. Bayesian data fu- sion was used to merge this information to generate gridded rainfall maps. In the framework of Bayesian data fusion, each type of data provides evidence for estimating the true rainfall at a given spatial location, with a certain error associated with it. Details of rain- fall estimates derived from radar, MM5, PRISM, and the Bayesian data fusion technique are given in the project report available under the “Downloads” tab. Estimates of the uncertainty in station means and mapped means are also provided. Uncertainty in mapped rainfall resulted from the combined uncertainty of the point rainfall means, the interpo- lation from points to a grid using ordinary kriging, and the uncertainty in each of the predictor datasets (radar, MM5, and PRISM). In general, interpolation uncertainty is low near stations and increases as the distance to the nearest station increases. For complete method details, please refer to the final project report, including its appendix, available under the “Down- loads” tab on the Rainfall Atlas website. RAINFALL PATTERNS. Hawai‘i’s rainfall pat- tern is spectacularly diverse (Fig. 1). Annual means range from 204 mm (8 in) near the summit of Mauna Kea to 10,271 mm (404 in) near Big Bog on the wind- ward slope of Haleakalā, Maui. In general, high mean rainfall is found on the windward mountain slopes, and low rainfall prevails in leeward lowlands and on the upper slopes of the highest mountains. This pat- tern is explained by the main controls on Hawai‘i’s rainfall: orographic lifting of persistent east-northeast winds give rise to distinct windward-leeward rainfall gradients on each island; thermal effects on slopes and along coasts enhance and alter this pattern (es- pecially for the Kona area of Hawai‘i Island); and the stabilizing effect of the trade wind inversion produces extremely dry zones at the summits of Haleakalā (East Maui), and Mauna Kea and Mauna Loa (Hawai‘i Island). For a detailed description of the rainfall pat- terns in Hawai‘i, please see the “Rainfall” tab on the Rainfall Atlas of Hawai‘i website. The maps comprising the new Rainfall Atlas of Hawai‘i give an up-to-date picture of normal rainfall amounts and patterns. However, rainfall in Hawai‘i is known to vary significantly over time. For ex- ample, interannual variability in Hawai‘i rainfall is strongly associated with ENSO. In particular, El Niño is consistently associated with lower-than-normal rainfall during winter months. The Pacific Decadal Oscillation (PDO) also exerts a strong influence on Hawai‘i rainfall. In addition to natural variations in rainfall, we are now aware of long-term trends that might be caused by global warming. Over the past 90–100 years, while the effects of ENSO and PDO caused large ups and downs, the work of P .-S. Chu and H. Chen showed that rainfall in Hawai‘i has slowly declined overall. This decline has been especially apparent during recent decades, in part, because it coincides with the low rainfall phase of the PDO. Rainfall declines at some stations in the Kona coffee- growing region of the Island of Hawai‘i since the early 1980s correspond with the current eruptive phase of Kīlauea Volcano, which began in 1983. A plume of aerosol-rich volcanic smog (“vog”) streams downwind of Kīlauea and makes its way around the southern flanks Fig . 4. (left) Locations of rain gauge stations on the Island of Hawai‘i with data for 1980 and (right) locations of all of the stations that ever recorded data on the Island of Hawai‘i. Figure 2.1: Location of all stations that have ever recorder the amount of rainfall in the island of Hawai‘i [89]. where (t) is a zero-mean multivariate Gaussian random vector and x represents the static observed covariates. Spatio-temporal Point Processes Spatio-temporal data tend to be collected over locations that are irregularly spaced over the region under study. For example, as shown in Fig. 2.1, the weather stations are distributed irregularly in the island of Hawai‘i. These datasets are described by spatio-temporal point processes that are statistical models that capture the dynamics of time series data that are sampled on irregularly space locations [62]. A common approach for modeling the spatio-temporal point processes is to use the spatio-temporal Poisson process. A spatio-temporal Poisson processN(dx,dt) is described by its conditional intensity function as follows λ(x,t|H t ) = lim |dx|,|dt|→0 ( E[N(dx,dt)|H t ] |dx||dt| ) , 20 whereH t denotes the history of the process until timet. The probability of no event happening in the interval [t 1 ,t 2 ] in an area of size A for a spatio-temporal Poisson process with the above density function is given by P A (t 1 ,t 2 ) = exp − Z t 2 t 1 Z A λ(x,t|H t )dxdt . Spatio-temporal Poisson processes have found wide range of applications in epidemi- ology and climate science [62, Chapter 11]. 2.3 Classification and Clustering In the time series classification task, the data are given as pairs (X i ,y i ) for i = 1,...,n whereX i ∈X denotes a time series andy i ∈{Y = 1,...,K} represents its label. The goal is to design a classifier f :X7→Y that estimates the label for each time series. The clustering problem assumes that the labels are not given and the time series are generated from several groups that should be identified. There are three common approaches for solving these problems including fea- ture extraction from time series, distance metric based approaches, and model based techniques [4]. Here, we briefly describe each approach. Feature Extraction The feature extraction approach relies on learning repre- sentation for time series in a Euclidean space. Given the representations in the Euclidean space, regular classification or clustering algorithms can be applied. Fea- tures can be designed manually by experts or in an automated way by learning algo- rithms[65]. Severalapproacheshavebeenproposedforautomatedfeatureextraction from time series which are briefly reviewed in this section. 21 A simple approach for feature extraction is to to generate features for time series based on analyzing its local patterns [87]. Certain properties need to be satisfied for the local patters, for example [239] argue that the patterns should be frequent in the time series in order to be included in the extracted features. Usually a dictionary of patterns is defined and time series are represented in terms of the number of each pattern found in them. The dictionary may have variable length time series which requires approaches such as dynamic time warping for analysis. As an example of these approaches, the authors in [65] use grammar-guided feature extraction to find representations for time series and perform classification after- wards. Another example is [17] where the authors use bag-of-features representation to extract features from time series for classification. Another simple approach to extract features from time series is to project them onto several prefixed basis functions. Then, the coefficients can be used as representationsfortimeseries[120,125,126,159]. Thisapproachalleviatesthehigh- dimensionality challenges in time series classification because if the basis functions are carefully selected, it results in a lower dimensional representation for the time series. The basis functions can include Fourier bases [76], wavelets [229], B-splinces [82], and Bezier curves [190], to name a few. Wavelets are preferred by [142] as they preserve both the time and frequency domain information. As noted by [180], the basis functions can be selected in a data driven way as in functional PCA approach [98, 99] or singular value decomposition. The authors in [180] list the algorithms that are based on either prefixed basis functions or those that are defined in a data driven way. Another approach for feature extraction from time series is to fit a generative model such as those described in Section 2.1, [2]. For example, one can fit a vector 22 auto-regressive model to multivariate time series data points and represent them by their VAR coefficients. Examples of this approach are hidden Markov models (HMM) [127] and state space models [2]. Deep learning provides a variety of structures for extracting features from time series data such as speech and learning representations for them [22]. The structures include deep belief networks [109, 155, 156], autoencoders [57], convolutional neural networks (CNN) [1, 140], and recurrent neural networks (RNN) [92, 114]. Compared toHMMs, RNNscanhavelargermemorycapacitytostorelongertimepatterns[110] and do not have the strict assumptions of HMM about the generative process of data [92]. Convolutional neural networks are also known to be translation invariant [105] which makes it suitable for learning features from time series. Time Series Distance Metrics The second approach for time series clustering is to define a metric that measures similarity or distance between two time series. Having a distance metric, we can apply techniques such as nearest neighbor classifi- cation or construct a kernel and apply kernelized algorithms such as support vector machines [128]. The simplest distance metric that can be used for time series classification or clustering is the Euclidean distance. However, the Euclidean distance cannot be applied when the time series have different length or irregular sampling patterns. Moreover, it is sensitive to distortions in the time series [132] which is common in applications such as speech recognition [189]. To address these types of challenges, we need to compute the warping distance rather than the Euclidean distance. The warped distance is a popular distance metric for measuring time series distance which is based on the principle that states two time series are similar 23 Figure 2.2: Illustration of the deformation operation for functional data. Two func- tions are considered similar if a deformation of one of them is similar to the other one. The figure has been regenerated from [160, Fig. 4.1]. if warped (deformed) version of them are similar. The warping distance is the minimum distance between warped versions of two time series, as illustrated in Fig. 2.2. The time warping deformation is required to preserve the order during the alignment and map the beginnings and ends of two time series to each other. Dynamic Time Warping [189, 226] is an algorithm based on dynamic programming that can efficiently compute the warping distance between two time series. One potential issue with DTW is that it does not compute a proper metric; i.e., DTW does not satisfy the triangular inequality and the kernel function defined via DTW is not positive semi-definite. Given that dynamic time warping does not define a proper distance metric, several algorithms such as global alignment kernel have been proposed to address this drawback [55, 56]. Metric learning for time series is under study by researchers, for example, [84] learns a Mahalanobis distance for time series by casting the problem as a structured prediction task. Model-based Classification The model-based approach attempts to design a generative model for time series data that includes a class or label for each time series. By training this model on the data, it can be applied to other time series to either cluster or classify them [231]. Probabilistic graphical models and neural 24 networks are two of the major tools for defining such models for time series clas- sification and clustering [9]. Usually the graphical models are extensions of the hidden Markov model customized to the specific applications such as classification of biological sequences [27]. 2.4 Correlation and Causality Analysis The correlation between two time series is characterized by the cross- correlation function. The cross-correlation function is the correlation of a time series with lagged version of another time series. In particular, R XY (t,`) =E[x(t)y(t +`)]. The cross-correlation function captures the correlation of one time series past with the other ones future. Cross-correlation and its Fourier transform known as cross- spectral density have applications in analyzing dependence and periodicity of the time series [8, 182, 208]. While computing correlation between two time series can be straightforward, discovering the causal relationships among time series is more challenging. One of the first attempts to solve this problem is the notion of Granger causality which interprets the coefficients of the vector auto-regressive model as the causal influ- ence of one time series on another [90, 91, 151]. Granger causality has gained success across many domains due to its simplicity, robustness, and extendability [7, 35, 108, 152, 170]. The success has also been accompanied by criticism from sev- eral prospectives and now it has been proven that this technique is insufficient for causal analysis in presence of latent confounders [173]. It has been well established 25 0 20 40 60 80 100 Cause 0 20 40 60 80 100 Time Effect (a) 20 40 60 80 100 Cause 0 20 40 60 80 100 Time Effect (b) Figure 2.3: Illustration of the main principle behind Granger causality: in both examples the cause happens prior to its effect and its past values help predicting future values of the effect. (a) Plot of the values of two time series. (b) Plot of two point processes in which each event is shown with a mark at its happening time. that unobserved confounders can lead the Granger causality algorithms to identify spurious causal relationships among unrelated variables [67, 91, 97, 179]. Definition of Granger causality relies on the regression analysis. If in a multi- variate linear Gaussian system all of the variables are observed, the change in the conditional distribution of a variable can be measured via linear regression. The regression formulation of Granger causality states that a variable X is the cause of another variableY if the past values ofX are helpful in predicting the future values of Y. Consider the following two regressions: y(t) = K X `=1 a ` y(t−`) + 1 (2.7) y(t) = K X `=1 a ` y(t−`) + K X `=1 b ` x(t−`) + 2 , (2.8) where K is the maximal time lag. If Eq. (2.8) is a significantly better model than Eq. (2.7), we determine that time seriesX Granger causes time seriesY. To achieve 26 this goal, usually the noise variance in two models are compared with each other using a statistical significance test such as the likelihood ratio test or Wald test (see [151, Chapter 2] for details). A typical likelihood ratio test examines the test statistic in the form of b σ 1 −b σ 2 > α for some α > 0 to accept or reject one of the hypotheses in Eq. (2.7) or Eq. (2.8). The quantities b σ 1 and b σ 2 are the variance of the residuals of fitting Eq. (2.7) and Eq. (2.8), respectively. The value of α is chosen based on the required level of power for the hypothesis test. The linear Granger causality analysis is readily extended to multivariate time series represented by x(t)∈ R P×1 . For this purpose, usually a multivariate auto- regression is performed to predict the time series x(t) as follows: x(t) = K X `=1 A (`) x(t−`) +"(t), (2.9) where A (`) is the matrix of coefficients modeling the effect of the time series with ` lags. The coefficients A (`) in Eq. (2.9) can be obtained by minimizing the squared loss. Performing a statistical significance test on the value of coefficients identifies the Granger causes of the target series. That is, if any of A (`) ji for ` = 1,...,L is non-zero, then we can sayx i Granger causesx j . The statistical properties of multi- variate Granger causality analysis should be similar to the properties of maximum likelihood estimation, see e.g. [37, Chapter 7]. In real-world applications where not all influential confounders are observed in the real-world datasets, the central challenge becomes how we can utilize the prior knowledge about the unmeasured confounders to take into account their impact. For non-linear system of equations, a common approach in analysis of the effects of hidden confounders is to encode the causal prior knowledge as independence 27 relationships among the confounders in a causal graph [173]. For example, consider the following set of equations: z(t) =α 1 z(t− 1) +ε 1 (t), (2.10) x(t) =α 2 x(t− 1) +α 3 z(t− 1) +ε 2 (t), y(t) =α 4 y(t− 1) +α 5 z(t− 1) +ε 3 (t), where ε i (t),i = 1,..., 3 are independent white noises and α i ∈R,i = 1,..., 5 are a set of coefficients that guarantee stationarity of the time series. The corresponding causal graph is shown in Fig. 2.4(a). However, the causal graphs generated for larger VAR models are usually cluttered and hard to use. In [68, 69], in analogy with causal graphs, path diagrams are proposed to rep- resent the causal relationships in time series with simpler graphs. In path diagrams, every time series are represented by a node in the graph and there is an edge from time series X to Y if and only if X Granger causes Y. Fig. 2.4(b) shows the path diagram corresponding to the set of equations in Eq. (2.10). While the path diagrams are similar to the causal graphs, cycles are allowed in the path diagrams which distinguishes them from causal graphs. The criticism of Granger causality has been mostly centered around the philo- sophical debate on the relationship between Granger causality and True Causality. There are two main frameworks for causality analysis, the Structural Causal Model- ing [173, 207] and Potential Outcome framework [111, 117, 186]. These frameworks are not designed specifically for causality analysis in time series and do not require the assumptions in Granger causality about the relationship between time order and causal order and latent confounders. The main distinction between the classical 28 − 1 − 2 − 3 − 4 − 5 (a) CHAPTER 2. LATENT FACTORS 31 (a) X Z Y τ 1 τ 2 (b) X Z Y τ 1 τ 2 (c) X Z Y τ 1 τ 2 Figure 2.3: Three of four possible directed graphs created by three nodes (a) the coparent, (b) the collider and (c) the chain structures. The fourth structure is the chain with reversed edge directions. The Co-parent Structure In the co-parent structure (Fig. 2.4), an unobserved time series (Z) causes two observed time series X and Y . The effect of the cause Z reaches X and Y with possibly different delays τ 1 and τ 2 , respectively. A simple inspection shows that the identified direction of causality betweenX andY depends on the relative value of τ 1 and τ 2 . In particular, Lemma 2.4.1. In the co-parent structure in Fig. 2.4, when Z is unobserved and generated from a white process, the following spurious edges are detected: τ 1 <τ 2 ⇒ X→Y (2.42) τ 1 >τ 2 ⇒ Y →X τ 1 =τ 2 ⇒ No Causality In other words, the path from X to Y , when Z is unobserved, is blocked if τ 1 ≥τ 2 while the path X←Z→Y is connected in m-connectivity criteria. Proof. Without loss of generality let’s assumeτ 1 <τ 2 . We need to show that there is an spurious edge fromX toY and there is no spurious edge fromY toX. Formally, we need to show that (Y (t)áX(t−τ))SY (t−1,...,t−L) forτ = 1,...,L. Using the SEM notation, as shown in Fig. 2.4, we can see that there is always a directional path from Y (t) via Z(t−τ 2 ) to X(t−τ 2 +τ 1 ). However, all paths from Y (t−τ), (b) Figure 2.4: (a) The causal graph corresponding for the set of equations in (2.10) and (b) The corresponding path diagram. Granger causality and these frameworks is the fact that the latter aim to identify causality in couterfactual sense and they have explicit measures for compensation of the effect of unobserved confounders. Many researchers (e.g. [173]) have argued that causal order defines time order, while the reverse is not always true. Thus, it is possible that in some datasets the assumptions of Granger causality might not be satisfied. For example, the forward looking behavior of human beings creates cases in which human action caused by a predictable event to happen before the event. In addition, instantaneous causes cannot be detected with the classical Granger causality test, though the recent extensions can address this issue under certain conditions. Instantaneous causation refers to the circumstance in which the impact of a cause reaches to its effect instantaneously and without any time lag. As Granger has pointed out [91], instantaneous causation creates difficulties for Granger causality because, similartoIndependentComponentAnalysis(ICA)[47,115,129], thevector auto-regressive model with instantaneous impact suffers from identifiability when the noise processes are all Gaussian processes. Several attempts have been made 29 to estimate the model above [60, 158, 210]. For example, [116] provides a solution based on non-Gaussianity assumption about the noise processes. 30 Chapter 3 Complex Correlations I: Low-rank Tensor Learning 31 Spatio-temporal data provide unique information regarding “where” and “when”, which is essential to answer many important questions in scientific studies from geology, climatology to sociology. In the context of big data, we are confronted with a series of new challenges when analyzing spatio-temporal data because of the complex spatial and temporal dependencies involved. A plethora of excellent work has been conducted to address the challenge and achieved successes to a certain extent [52, 119]. Often times, geostatistical models use cross variogram and cross covariance functions to describe the intrinsic depen- dency structure. However, the parametric form of cross variogram and cross covari- ance functions imposes strong assumptions on the spatial and temporal correlation, which requires domain knowledge and manual work. Furthermore, parameter learn- ing of those statistical models is computationally expensive, making them infeasible for large scale applications. Cokriging and forecasting are two central tasks in multivariate spatio-temporal analysis. Cokriging utilizes the spatial correlations to predict the value of the vari- ables for new locations. One widely adopted method is multitask Gaussian pro- cess (MTGP) [32], which assumes a Gaussian process prior over latent functions to directly induce correlations between tasks. However, for a cokriging task with M variables of P locations for T time stamps, the time complexity of inference in MTGP isO(M 3 P 3 T ) [32]. For forecasting, popular methods in multivariate time series analysis include vector auto-regressive (VAR) models, auto-regressive inte- grated moving average (ARIMA) models, and cointegration models. An alternative method for spatio-temporal analysis is Bayesian hierarchical spatio-temporal mod- els with either separable and non-separable space-time covariance functions [50]. Reduced-rank models have been proposed to capture the inter-dependency among 32 variables [5]. However, very few models can directly handle the correlations among variables, space and time simultaneously in a scalable way. In this chapter, we aim to address this problem by presenting a unified framework for many spatio-temporal analysis tasks that are scalable for large scale applications. Tensor representation provides a convenient way to capture inter-dependencies along multiple dimensions. Therefore it is natural to represent the multivariate spatio-temporal data in tensor. Recent advances in low-rank learning have led to simple models that can capture the commonalities among each mode of the tensor [136, 185]. Similar argument can be found in the literature of spatial data recovery [83], neuroimaging analysis [244], and multi-task learning [185]. Our work builds upon recent advances in low-rank tensor learning [83, 136, 244] and further con- siders the scenario where additional side information of data is available. Many forms of side information are available is spatio-temporal analysis; for example, in geospatial applications, apart from measurements of multiple variables, geographical information is available to infer location adjacency. In social network applications, friendship network structure is collected to obtain preference similarity. To utilize the side information, we can construct a Laplacian regularizer from the similarity matrices, which favors locally smooth solutions. We develop a fast greedy algorithm for learning low-rank tensors based on the greedy structure learning framework [15, 197, 242]. Greedy low-rank tensor learning is efficient, as it does not require full singular value decomposition of large matrices unlike other alternating direction methods [83]. We also bound the differ- ence between the loss function at our greedy solution and the one at the globally optimal solution. Finally, we present experiment results on simulation datasets as well as application datasets in climate and social network analysis, which show that 33 our algorithm is faster and achieves higher prediction accuracy than state-of-art approaches in cokriging and forecasting tasks. 3.1 Tensor Formulation for Multivariate Spatio- temporal Analysis The critical challenge in multivariate spatio-temporal analysis is finding an efficient way to incorporate the spatial temporal correlations into modeling and automatically capture the shared structures across variables, locations, and time. In this section, we present a unified low-rank tensor learning framework that can perform various types of spatio-temporal analysis. We will use two important appli- cations, i.e., cokriging and forecasting, to motivate and describe the framework. 3.1.1 Cokriging In geostatistics, cokriging is the task of interpolating the data of one variable for unknown locations by taking advantage of the observations of variables from known locations. For example, by making use of the correlations between precip- itation and temperature, we can obtain more precise estimate of temperature in unknown locations than univariate kriging. Formally, denote the complete data for P locations overT time stamps withM variables asX∈R P×T×M . We only observe the measurements for a subset of locations Ω⊂{1,...,P} and their side infor- mation such as longitude and latitude. Given the measurementsX Ω and the side information, the goal is to estimate a tensorW∈R P×T×M that satisfiesW Ω =X Ω . HereX Ω represents the outcome of applying the index operator I Ω toX :,:,m for all 34 variables m = 1,...,M. The index operator I Ω is a diagonal matrix whose entries are one for the locations included in Ω and zero otherwise. Two key consistency principles have been identified for effective cokriging [53, Chapter6.2]: (1)Globalconsistency: thedataonthesamelatentstructurearelikely to be similar and (2) local consistency: the data in close locations are likely to be similar. The former principle is akin to the cluster assumption in semi-supervised learning [243]. We incorporate these principles in a concise and computationally efficient low-rank tensor learning framework. To achieve global consistency, we constrain the tensorW to be low-rank. The low-rank assumption is based on the belief that high correlations exist within vari- ables, locations and time, which leads to natural clustering of the data. Existing literature have explored the low-rank structure among these three dimensions sep- arately, e.g., multi-task learning [169] for variable correlation, fixed rank kriging [51] for spatial correlations. Low-rankness assumes that the observed data can be described with a few latent factors. It enforces the commonalities along three dimen- sions without an explicit form for the shared structures in each dimension. For local consistency, we construct a regularizer via the spatial Laplacian matrix. The Laplacian matrix is defined as L =D−A, where A is a kernel matrix constructed by pairwise similarity and diagonal matrix D i,i = P j (A i,j ). Similar ideas have been explored in matrix completion [144]. In cokriging literature, the local consistency is enforced via the spatial covariance matrix. The Bayesian mod- els often impose the Gaussian process prior on the observations with the covariance matrixK =K v ⊗K x whereK v is the covariance between variables andK x is that for locations. The Laplacian regularization term corresponds to the relational Gaussian process [43] where the covariance matrix is approximated by the spatial Laplacian. 35 In summary, we can perform cokriging and find the value of tensorW by solving the following optimization problem: c W = argmin W ( kW Ω −X Ω k 2 F +μ M X m=1 tr(W > :,:,m LW :,:,m ) ) s.t. rank(W)≤ρ, (3.1) where the Frobenius norm of a tensorA is defined askAk F = q P i,j,k A 2 i,j,k and μ,ρ > 0 are the parameters that make tradeoff between the local and global con- sistency, respectively. The low-rank constraint finds the principal components of the tensor and reduces the complexity of the model while the Laplacian regularizer clusters the data using the relational information among the locations. By learning the right tradeoff between these two techniques, our method is able to benefit from both. Due to the various definitions of tensor rank, we use rank as supposition for rank complexity, which will be specified in later section. 3.1.2 Forecasting Forecasting estimates the future value of multivariate time series given histor- ical observations. For ease of presentation, we use the classical VAR model with K lags and coefficient tensorW ∈ R P×KP×M as an example. Using the matrix representation, the VAR(K) process defines the following data generation process: X :,t,m =W :,:,m X t,m +E :,t,m , for m = 1,...,M and t =K + 1,...,T, (3.2) where X t,m = [X > :,t−1,m ,...,X > :,t−K,m ] > denotes the concatenation of K-lag historical data before time t. The elements of the noise tensorE constitute a multivariate Gaussian random vector with zero mean. 36 Existing multivariate regression methods designed to capture the complex cor- relations, such as Tucker decomposition [185], are computationally expensive. A scalablesolutionrequiresasimplermodelthatalsoefficientlyaccountsfortheshared structures in variables, space, and time. Similar global and local consistency princi- ples still hold in forecasting. For global consistency, we can use low-rank constraint to capture the commonalities of the variables as well as the spatial correlations on the model parameter tensor, as in [52]. For local consistency, we enforce the pre- dicted value for close locations to be similar via spatial Laplacian regularization. Thus, we can formulate the forecasting task as the following optimization problem over the model coefficient tensorW: c W = argmin W ( k c X−Xk 2 F +μ M X m=1 tr( c X > :,:,m L c X :,:,m ) ) (3.3) s.t. rank(W)≤ρ, c X :,t,m =W :,:,m X t,m Though cokriging and forecasting are two different tasks, we can easily see that both formulations follow the global and local consistency principles and can capture the inter-correlations from spatial-temporal data. 3.1.3 Unified Framework We now show that both cokriging and forecasting can be formulated into the same tensor learning framework. Let us rewrite the loss function in Eq. (3.1) and Eq. (3.3) in the form of multitask regression and complete the quadratic form for the loss function. The cokriging task can be reformulated as follows: c W = argmin W ( M X m=1 kHW :,:,m − (H > ) −1 X Ω,m k 2 F ) s.t. rank(W)≤ρ (3.4) 37 where we define HH > = I Ω +μL. 1 For the forecasting problem, HH > = I P +μL and we have: c W = argmin W M X m=1 T X t=K+1 kHW :,:,m X t,m − (H −1 )X :,t,m k 2 F s.t. rank(W)≤ρ, (3.5) By slight change of notation, we can easily see that the optimal solution of both problems can be obtained by the following optimization problem with appropriate choice of tensorsY andV: c W = argmin W ( M X m=1 kW :,:,m Y :,:,m −V :,:,m k 2 F ) s.t. rank(W)≤ρ. (3.6) In particular, in the cokriging problem, it is easy to see that withY :,:,m = H and V :,:,m =X Ω,m for m = 1,...,M the problems are equivalent. In the forecasting problem,H is full rank and the mapping defined byW7→ ˜ W : ˜ W :,:,m =HW :,:,m for m = 1,...,M preserves the tensor rank, i.e., rank(W) = rank( ˜ W). This suggests that we can solve Eq. (3.4) as follows: first solve Eq. (3.6) withY :,:,m = X K+1:T,m andV :,:,m =X :,:,m and obtain its solution as ˜ W; then computeW :,:,m =H −1 ˜ W :,:,m . After unifying the objective function, we note that tensor rank has different notions such as CP rank and n-rank [83, 136]. In this chapter, we choose the sum n-rank, which is computationally more tractable [83, 218]. The mode-n rank of a 1 We can use Cholesky decomposition to obtain H. In the rare cases that I Ω +μL is not full rank, I P is added where is a very small positive value. 38 tensorW is the rank of its mode-n unfoldingW (n) . 2 In particular, for a tensorW with N mode, we have the following definition: sum n-rank(W) = N X n=1 rank(W (n) ). (3.7) A common practice to solve this formulation with sum n-rank constraint is to relax the rank constraint to a convex nuclear norm constraint [83, 218]. A convex relax- ationapproachreplacestheconstraintrank(W (n) )withitsconvexsurrogatekW (n) k ∗ . The mixture regularization in [218] assumes that theN-mode tensorW is a mixture of N auxiliary tensors{Z n }, i.e.,W = P N n=1 Z n . It regularizes the nuclear norm of only the mode-n unfolding for the n th tensorZ n , i.e, P N n=1 kZ n (n) k ∗ . The resulting convex optimization problem is as follows: c W = argmin W ( L(W;Y,V) +λ N X n kZ n (n) k ∗ s.t. N X n Z n =W ) (3.8) We adapt Alternating Direction Methods of Multiplier (ADMM) [81] for solv- ing the above problem. Due to the coupling of{Z n } in the summation, eachZ n is not directly separable from otherZ n 0 . Thus, we employ the coordinate descent algorithm to sequentially solve{Z n }. Given the augmented Lagrangian of problem as follows, the ADMM-based algorithm is elaborated in Algo. 3.1. F(W,{Z n },C) =L(W;Y,V)+λ N X n=1 kZ n (n) k ∗ + β 2 X n kW− X n Z n k 2 F −hC,W− N X n=1 Z n i (3.9) 2 The mode-n unfolding of a tensor is the matrix resulting from treating n as the first mode of the matrix, and cyclically concatenating other modes. Tensor refolding is the reverse direction operation [136]. 39 Algorithm 3.1: ADMM for cokriging with mixture regularizer 1 Input: dataX with Laplacian matrix L, hyper-parameters λ,β. 2 Output: N mode tensorW. 3 InitializeW,{Z n },C. 4 repeat 5 for variable m = 1 to M do 6 W :,:,m ← (2λL + (β +N)I) −1 ( 1 λ X :,:,m +C + β N P N n=1 Z n ) 7 end 8 repeat 9 Σ =NW (n) − N β C− P n 0 6=n Z n 0 (n 0 ) Z n (n) = shrink Σ, N 2 λ β 10 until solution{Z n } converge 11 C←C−β(W− 1 N N P n=1 Z n ) 12 until objective function converges The sub-routine shrink α (A) applies a soft-thresholding rule at level α to the singular values of the input matrix A. The following lemma shows the convergence of ADMM-based solver for our problem. Lemma 3.1. [25] For the constrained problem min x,y f(x) +g(y), s.t x∈ C x ,y∈ C y ,Gx = y, If either{C x ,C y } are bounded or G 0 G is invertible, and the optimal solution set is nonempty. A sequence of solutions{x,y} generated by ADMM is bounded and every limit point is an optimal solution of the original problem. However, those methods are computationally expensive since they may need near-full singular value decomposition of large matrices. In the next section, we present a fast greedy algorithm to tackle the problem. 3.2 Fast Greedy Low-rank Tensor Learning To solve the non-convex problem in Eq. (3.6) and find its optimal solution, we propose a greedy learning algorithm by successively adding rank-1 estimation 40 of the mode-n unfolding. The main idea of the algorithm is to unfold the tensor into a matrix, seek for its rank-1 approximation and then fold back into a tensor with same dimensionality. We describe this algorithm in three steps: (i) First, we show that we can learn rank-1 matrix estimations efficiently by solving a generalized eigenvalue problem, (ii) we use the rank-1 matrix estimation to greedily solve the original tensor rank constrained problem, and (iii) we propose an enhancement via orthogonal projections after each greedy step. Optimal rank-1 Matrix Learning The following lemma enables us to find such optimal rank-1 estimation of the matrices. Lemma 3.2. Consider the following rank constrained problem: b A 1 = argmin A:rank(A)=1 n kY−AXk 2 F o , (3.10) where Y ∈ R q×n , X ∈ R p×n , and A∈ R q×p . The optimal solution of b A 1 can be written as b A 1 = b u b v > ,k b vk 2 = 1 where b v is the dominant eigenvector of the following generalized eigenvalue problem: (XY > YX > )v =λ(XX > )v (3.11) and b u can be computed as b u = 1 b v > XX > b v YX > b v. (3.12) Eq. (3.11) is a generalized eigenvalue problem whose dominant eigenvector can be found efficiently [95]. If XX > is full rank, as assumed in Theorem 3.3, the 41 problem is simplified to a regular eigenvalue problem whose dominant eigenvector can be efficiently computed. The lemma can be found in e.g. [10] and we also provide a proof. Proof. The original problem has the following form: b A = argmin A:rank(A)=1 n kY−AXk 2 F o (3.13) We can rewrite the optimization problem in Eq. (3.13) as estimation ofα∈R, u∈R q×1 ,kuk 2 = 1, and v∈R p×1 ,kvk 2 = 1 such that: b α, b u, b v = argmin α,u,v:kuk 2 =1,kvk 2 =1 Y−αuv > X 2 F (3.14) We will minimize the above objective function in three steps: First, minimiza- tion in terms of α yields b α =hY,uv > Xi/kuv > Xk 2 F , where we have assumed that v > X6=0. Hence, we have: b u, b v = argmax u,v:kuk 2 =1,kvk 2 =1 tr((uv > X) > Y ) 2 kuv > Xk 2 F (3.15) The objective function can be rewritten tr n (uv > X) > Y o = tr n X > vu > Y o = tr n YX > vu > o . Some algebra work on the denominator yields kuv > Xk 2 F = tr n (uv > X) > (uv > X) o = tr n X > vu > uv > X o = tr n X > vv > X o = v > XX > v. This implies that the denominator is independent of u and the optimal value of u in Eq. (3.15) is proportional to YX > v. Hence, we need to first find the optimal value of v 42 and then compute u = (YX > v)/kYX > vk 2 . Substitution of the optimal value of u yields: b v = argmax v:kvk 2 =1 v > XY > YX > v v > XX > v (3.16) Note that the objective function is bounded and invariant ofkvk 2 , hence the kvk 2 = 1 constraint can be relaxed. Now, suppose the value ofv > XX > v for optimal choice of vectors v is t. We can rewrite the optimization in Eq. (3.16) as b v = argmax v v > XY > YX > v s.t. v > XX > v =t (3.17) Using the Lagrangian multipliers method, we can show that there is a value for λ such that the solution b v for the dual problem is the optimal solution for Eq. (3.17). Hence, we need to solve the following optimization problem for v: b v = argmax v:kvk 2 =1 n v > XY > YX > v−λv > XX > v o = argmax v:kvk 2 =1 n v > X(Y > Y−λI)X > v o (3.18) Eq. (3.18) implies that v is the dominant eigenvector of X(Y > Y−λI)X > . Hence, we are able to find the optimal value of both u and v for the given value of 43 λ. For simplicity of notation, let’s define P,XX > and Q,XY > YX > . Consider the equations obtained by solving the Lagrangian dual of Eq. (3.17): Qv =λPv (3.19) kv > Xk 2 2 =t, (3.20) λ≥ 0. (3.21) Eq. (3.19) describes a generalized positive definite eigenvalue problem. Hence, we can selectλ max =λ 1 (Q,P ) which maximizes the objective function in Eq. (3.16). The optimal value ofu can be found by substitution of optimalv in Eq. (3.15) and simple algebra yields the result in Lemma 3.2. Greedy Low n-rank Tensor Learning The optimal rank-1 matrix learning serves as a basic element in our greedy algorithm. Using Lemma 3.2, we can solve the problem in Eq. (3.6) in the Forward Greedy Selection framework as follows: at each iteration of the greedy algorithm, it searches for the mode that gives the largest decrease in the objective function. It does so by unfolding the tensor in that mode and finding the best rank-1 estimation of the unfolded tensor. After finding the optimal mode, it adds the rank-1 estimate in that mode to the current estimation of the tensor. Algorithm 3.2 shows the details of this approach, where L(W;Y,V) = P M m=1 kW :,:,m Y :,:,m −V :,:,m k 2 F . Note that we can find the optimal rank- 1 solution in only one of the modes, but it is enough to guarantee the convergence of our greedy algorithm. Theorem 3.3 bounds the difference between the loss function evaluated at each iteration of the greedy algorithm and the one at the globally optimal solution. 44 Algorithm 3.2: Greedy Low-rank Tensor Learning 1 Input: transformed dataY,V of M variables, stopping criteria η 2 Output: N mode tensorW 3 InitializeW← 0 4 repeat 5 for n = 1 to N do 6 B n ← argmin B: rank(B)=1 L(refold(W (n) +B);Y,V) 7 Δ n ←L(W;Y,V)−L(refold(W (n) +B n );Y,V) 8 end 9 n ∗ ← argmax n {Δ n } 10 if Δ n ∗ >η then 11 W←W + refold(B n ∗,n ∗ ) 12 end 13 W← argmin row(A (1) )⊆row(W (1) ) col(A (1) )⊆col(W (1) ) L(A;Y,V) // Optional Orthogonal Projection Step. 14 15 until Δ n ∗ <η Theorem 3.3. Suppose in Eq. (3.6) the matricesY > :,:,m Y :,:,m for m = 1,...,M are positive definite. The solution of Algo. 3.2 at its kth iteration step satisfies the following inequality: L(W k ;Y,V)−L(W ∗ ;Y,V)≤ (kYk 2 kW ∗ (1) k ∗ ) 2 (k + 1) , (3.22) whereW ∗ is the global minimizer of the problem in Eq. (3.6) andkYk 2 is the largest singular value of a block diagonal matrix created by placing the matricesY(:, :,m) on its diagonal blocks. The key idea of the proof is that the amount of decrease in the loss function by each step in the selected mode is not smaller than the amount of decrease if we had selected the first mode. The theorem shows that we can obtain the same rate 45 of convergence for learning low-rank tensors as achieved in [198] for learning low- rank matrices. The greedy algorithm in Algorithm 3.2 is also connected to mixture regularization in [218]: the mixture approach decomposes the solution into a set of low-rank structures while the greedy algorithm successively learns a set of rank one components. The detailed proof is as follows. Note that intuitively, since our greedy steps are optimal in the first mode, we can see that our bound should be at least as tight as the bound of [197]. Here is the formal proof of Theorem 3.3. Proof. Let’s denote the loss function at k th step by L(Y,V,W k ) = r X j=1 kV (:,:,j) −W(:, :,j)Y (:,:,j) k 2 F (3.23) Lines 5–8 of Algorithm 3.2 imply: L(Y,V,W k )−L(Y,V,W k+1 ) =L(Y,V,W k )− min m inf rank(B)=1 L(Y,V,W (m),k +B) ≥L(Y,V,W k )− inf rank(B)=1 L(Y,V,W (1),k +B) (3.24) Let’s define B = αC where α∈ R, rank(C) = 1, andkCk 2 = 1. We expand the right hand side of Eq. (3.24) and write: L(Y,V,W k )−L(Y,V,W k+1 )≥ sup α,C:rank(C)=1,kCk 2 =1 2αhCY,V−W (1),k Yi−α 2 kCYk 2 F , 46 whereY andV are used for denoting the matrix created by repeatingY (:,:,j) and V (:,:,j) on the diagonal blocks of a block diagonal matrix, respectively. Since the algorithm finds the optimal B, we can maximize it with respect to α which yields: L(Y,V,W k )−L(Y,V,W k+1 )≥ sup C:rank(C)=1,kCk 2 =1 hCY,V−W (1),k Yi 2 kCYk 2 F ≥ sup C:rank(C)=1,kCk 2 =1 1 σ max (Y ) 2 hCY,V−W (1),k Yi 2 = sup C:rank(C)=1,kCk 2 =1 1 σ max (Y ) 2 hC, (V−W (1),k Y )Y > i 2 = σ max (V−W (1),k Y )Y > 2 σ max (V ) Define the residualR k =L(Y,V,W k )−L(Y,V,W ∗ ). Note that−(V−W (1),k Y )Y > is the gradient of the residual function with respect toW (1),k . Since the operator norm and the nuclear norms are dual of each other, using the properties of dual norms we can write for any two matrices A and B hA,Bi≤kAk 2 kBk ∗ (3.25) Thus, using the convexity of the residual function, we can show that R k −R k+1 ≥ ∇ W (1),k R k 2 kW (1),k −W ∗ (1) k ∗ 2 σ max (Y ) 2 kW (1),k −W ∗ (1) k 2 ∗ (3.26) ≥ R 2 k σ max (Y ) 2 kW (1),k −W ∗ (1) k 2 ∗ (3.27) ≥ R 2 k σ max (Y ) 2 kW ∗2 (1) k 2 ∗ (3.28) 47 The sequence in Eq. (3.28) converges to zero according to the following rate [198, Lemma B.2] R k ≤ (σ max (Y )kW ∗ (1) k ∗ ) 2 (k + 1) The step in Eq. (3.27) is due to the fact that the parameter estimation error decreases as the algorithm progresses. This can be seen by noting that the minimum eigenvalue assumption ensures strong convexity of the loss function. Greedy Algorithm with Orthogonal Projections It is well-known that the forward greedy algorithm may make steps in sub-optimal directions because of noise. A common solution to alleviate the effect of noise is to make orthogonal projections after each greedy step [15, 197]. Thus, we enhance the forward greedy algorithm by projecting the solution into the space spanned by the singular vec- tors of its mode-1 unfolding. The greedy algorithm with orthogonal projections performs an extra step in line 13 of Algorithm 3.2: It finds the top k singular vectors of the solution: [U,S,V ] ← svd(W (1) ,k) where k is the iteration num- ber. Then it finds the best solution in the space spanned by U and V by solving b S← min S L(USV > ,Y,V) which has a closed form solution. Finally, it reconstructs the solution: W← refold(U b SV > , 1). Note that the projection only needs to find top k singular vectors which can be computed efficiently for small values of k. 3.3 Experiments We evaluate the efficacy of our algorithms on synthetic datasets and real-world application datasets. 48 3.3.1 Low-rank Tensor Learning on Synthetic Data For empirical evaluation, we compare our method with multitask learning (MTL) algorithms, which also utilize the commonalities between different prediction tasks for better performance. We use the following baselines: (1) Trace norm regu- larized MTL (Trace), which seeks the low-rank structure only on the task dimension; (2) Multilinear MTL [185], which adapts the convex relaxation of low-rank tensor learning solved with Alternating Direction Methods of Multiplier (ADMM) [81] and Tuckerdecompositiontodescribethelow-ranknessinmultipledimensions; (3)MTL- L 1 , MTL-L 21 [169], and MTL-L Dirty [124], which investigate joint sparsity of the tasks with L p norm regularization. For MTL-L 1 , MTL-L 21 [169] and MTL-L Dirty , we use MALSAR Version 1.1 [245]. We construct a model coefficient tensorW of size 20× 20× 10 with CP rank equals to 1. Then, we generate the observationsY andV according to multivariate regression modelV :,:,m =W :,:,m Y :,:,m +E :,:,m for m = 1,...,M, whereE is tensor with zero mean Gaussian noise elements. We split the synthesized data into training and testing time series and vary the length of the training time series from 10 to 200. For each training length setting, we repeat the experiments for 10 times and select the model parameters via 5-fold cross validation. We measure the prediction performance via two criteria: parameter estimation accuracy and rank complexity. For accuracy, we calculate the RMSE of the estimation versus the true model coef- ficient tensor. For rank complexity, we calculate the mixture rank complexity [218] as MRC = 1 n P N n=1 rank(W (n) ). The results are shown in Figure 3.1(a) and 3.1(b). We omit the Tucker decom- position as the results are not comparable. We can clearly see that the proposed 49 0 50 100 150 200 250 0.4 0.5 0.6 0.7 0.8 0.9 1 1.1 1.2 # of Samples Parameter Estimation RMSE Forward Orthogonal ADMM Trace MTL−L1 MTL−L21 MTL−Dirty (a) RMSE 0 50 100 150 200 −5 0 5 10 15 20 # of Samples Mixture Rank Complexity Forward Orthogonal ADMM Trace (b) Rank 10 1 10 2 0 200 400 600 800 1000 1200 # of Variables Run Time (Sec) Forward Greedy Orthogonal Greedy ADMM (c) Scalability Figure 3.1: Tensor estimation performance comparison on the synthetic dataset over 10 random runs. (a) Parameter estimation RMSE with training time series length, (b) Mixture Rank Complexity with training time series length, (c) running time for one single round with respect to number of variables. greedy algorithm with orthogonal projections achieves the most accurate tensor esti- mation. In terms of rank complexity, we make two observations: (i) Given that the tensor CP rank is 1, greedy algorithm with orthogonal projections produces the estimate with the lowest rank complexity. This can be attributed to the fact that the orthogonal projections eliminate the redundant rank-1 components that fall in the same spanned space. (ii) The rank complexity of the forward greedy algorithm increases as we enlarge the sample size. We believe that when there is a limited number of observations, most of the new rank-1 elements added to the estimate are not accurate and the cross-validation steps prevent them from being added to the model. However, as the sample size grows, the rank-1 estimates become more accurate and they are preserved during the cross-validation. To showcase the scalability of our algorithm, we vary the number of variables and generate a series of tensorW∈R 20×20×M for M from 10 to 100 and record the running time (in seconds) for three tensor learning algorithms, i.e, forward greedy, greedy with orthogonal projections and ADMM. We measure the run time on a machine with a 6-core 12-thread Intel Xenon 2.67GHz processor and 12GB memory. 50 The results are shown in Figure 3.1(c). The running time of ADMM increase rapidly with the data size while the greedy algorithm stays steady, which confirms the speedup advantage of the greedy algorithm. 3.3.2 Spatio-temporal Analysis on Real World Data We conduct cokriging and forecasting experiments on four real-world datasets: USHCN The U.S. Historical Climatology Network Monthly (USHCN) 3 dataset consists of monthly climatological data of 108 stations spanning from year 1915 to 2000. It has three climate variables: (1) daily maximum, (2) minimum temperature averaged over month, and (3) total monthly precipitation. CCDS The Comprehensive Climate Dataset (CCDS) 4 is a collection of climate records of North America from [149]. The dataset was collected and pre-processed by five federal agencies. It contains monthly observations of 17 variables such as Carbon dioxide and temperature spanning from 1990 to 2001. The observations were interpolated on a 2.5× 2.5 degree grid, with 125 observation locations. Yelp The Yelp dataset 5 contains the user rating records for 22 categories of busi- nesses on Yelp over ten years. The processed dataset includes the rating values (1-5) binned into 500 time intervals and the corresponding social graph for 137 3 http://www.ncdc.noaa.gov/oa/climate/research/ushcn 4 http://www-bcf.usc.edu/~liu32/data/NA-1990-2002-Monthly.csv 5 http://www.yelp.com/dataset_challenge 51 Table 3.1: Cokriging RMSE of 6 methods averaged over 10 runs. In each run, 10% of the locations are assumed missing. Dataset ADMM Forward Orthogonal Simple Ordinary MTGP USHCN 0.8051 0.7594 0.7210 0.8760 0.7803 1.0007 CCDS 0.8292 0.5555 0.4532 0.7634 0.7312 1.0296 Yelp 0.7730 0.6993 0.6958 NA NA NA Foursquare 0.1373 0.1338 0.1334 NA NA NA active users. The dataset is used for the spatio-temporal recommendation task to predict the missing user ratings across all business categories. Foursquare The Foursquare dataset [148] contains the users’ check-in records in Pittsburgh area from Feb 24 to May 23, 2012, categorized by different venue types such as Art & Entertainment, College & University, and Food. The dataset records the number of check-ins by 121 users in each of the 15 category of venues over 1200 time intervals, as well as their friendship network. Cokriging Wecomparethecokrigingperformanceofourproposedmethodwiththeclassi- cal cokriging approaches including simple kriging and ordinary cokriging with non- bias condition [119] which are applied to each variables separately. We further compare with multitask Gaussian process (MTGP) [32] which also considers the correlation among variables. We also adapt ADMM for solving the nuclear norm relaxed formulation of the cokriging formulation as a baseline. For USHCN and CCDS, we construct a Laplacian matrix by calculating the pairwise Haversine dis- tance of locations. For Foursquare and Yelp, we construct the graph Laplacian from the user friendship network. 52 For each dataset, we first normalize it by removing the trend and diving by the standard deviation. Then we randomly pick 10% of locations (or users for Foursquare) and eliminate the measurements of all variables over the whole time span. Then, we produce the estimates for all variables of each timestamp. We repeat the procedure for 10 times and report the average prediction RMSE for all timestamps and 10 random sets of missing locations. We use the MATLAB Kriging Toolbox 6 for the classical cokriging algorithms and the MTGP code provided by [32]. Table 3.1 shows the results for the cokriging task. The greedy algorithm with orthogonal projections is significantly more accurate in all three datasets. The base- line cokriging methods can only handle the two dimensional longitude and latitude information, thus are not applicable to the Foursquare and Yelp dataset with addi- tional friendship information. The superior performance of the greedy algorithm can be attributed to two of its properties: (1) It can obtain low-rank models and achieve global consistency and (2) It usually has lower estimation bias compared to nuclear norm relaxed methods. Forecasting We present the empirical evaluation on the forecasting task by comparing with multitask regression algorithms. We split the data along the temporal dimension into 90% training set and 10% testing set. We choose VAR(3) model and during the training phase, we use 5-fold cross-validation. 6 http://globec.whoi.edu/software/kriging/V3/english.html 53 Table 3.2: Forecasting RMSE for VAR process with 3 lags, trained with 90% of the time series. Dataset Tucker ADMM Forward Ortho OrthoNL Trace MTL l1 MTL l21 MTL dirty USHCN 0.8975 0.9227 0.9171 0.9069 0.9175 0.9273 0.9528 0.9543 0.9735 CCDS 0.9438 0.8448 0.8810 0.8325 0.8555 0.8632 0.9105 0.9171 1.0950 FSQ 0.1492 0.1407 0.1241 0.1223 0.1234 0.1245 0.1495 0.1495 0.1504 Table 3.3: Running time (in seconds) for cokriging and forecasting. Cokriging Forecasting Dataset USHCN CCDS YELP FSQ USHCN CCDS FSQ ORTHO 93.03 16.98 78.47 91.51 75.47 21.38 37.70 ADMM 791.25 320.77 2928.37 720.40 235.73 45.62 33.83 As shown in Table 3.2, the greedy algorithm with orthogonal projections again achieves the best prediction accuracy. Different from the cokriging task, forecasting does not necessarily need the correlations of locations for prediction. One might raise the question as to whether the Laplacian regularizer helps. Therefore, we report the results for our formulation without Laplacian (ORTHONL) for compar- ison. For efficiency, we report the running time (in seconds) in Table 3.3 for both tasks of cokriging and forecasting. Compared with ADMM, which is a competitive baseline also capturing the commonalities among variables, space, and time, our greedy algorithm is much faster for most datasets. As a qualitative study, we plot the map of most predictive regions analyzed by the greedy algorithm using CCDS dataset in Fig. 2. Based on the concept of how informative the past values of the climate measurements in a specific location are in predicting future values of other time series, we define the aggregate strength of predictiveness of each region as w(t) = P P p=1 P M m=1 |W p,t,m |. We can see that 54 Figure 3.2: Map of most predictive regions analyzed by the greedy algorithm using 17 variables of the CCDS dataset. Red color means high predictiveness whereas blue denotes low predictiveness. two regions are identified as the most predictive regions: (1) The southwest region, which reflects the impact of the Pacific ocean and (2) The southeast region, which frequently experiences relative sea level rise, hurricanes, and storm surge in Gulf of Mexico. Another interesting region lies in the center of Colorado, where the Rocky mountain valleys act as a funnel for the winds from the west, providing locally divergent wind patterns. 3.4 Summary In this chapter, we studied the problem of multivariate spatio-temporal data analysis with an emphasis on two tasks: cokriging and forecasting. We formulated the problem into a general low-rank tensor learning framework which captures both the global consistency and the local consistency principle. We developed a fast and accurate greedy solver with theoretical guarantees for its convergence. We validated 55 the correctness and efficiency of our proposed method on both the synthetic dataset and real-application datasets. 56 Chapter 4 Complex Correlations II: Functional Subspace Clustering 57 Classical machine learning models assume that each observation is represented as a finite dimensional vector. However, many applications involve functional data, where samples are random functions (instead of standard vectors) representing con- tinuous processes and exhibiting structure [77]. Functional data are increasingly common and important in a variety of scientific and commercial domains, such as healthcare, biology, traffic analysis, climatology, and video. As a result, many sta- tistical methods for analyzing functional data have been proposed [98, 100, 159]. Functional data present challenges and opportunities for machine learning, especially in clustering and representation learning. The underlying process is a continuous function of infinite dimension, usually unknown and difficult to repre- sent directly. Even when only finite samples are available, they can be difficult to work with. Time series, for example, exhibit noise, different lengths, and irregular sampling. Thus, the first step in functional data clustering is often to transform the data into a more regular representation [59, 98, 199] to which standard cluster- ing can be applied (e.g., k-means). Alternative non-parametric approaches define a measure of similarity between samples and cluster in the similarity (or affinity) space [55, 231]. Such approaches often utilize specialized measures of similarity that provide invariance to transformations or deformations. In object recognition, images of the same object should be similar regardless of resolution, lighting, or angle. In time series data mining, dynamic time warping (DTW) is used to compare time series based on shape and permits distortions (e.g., shifting and stretching) along the temporal axis, as shown in Fig. 4.1 [226]. Such similarities can often be used to perform effective clustering [176, 231] but are not immune to the curse of dimension- ality inherent in functional data [78, 85]. What is more, they can produce complex 58 manifolds difficult to model using classic dimensionality reduction techniques (e.g., PCA) and cluster models [224]. Subspace clustering, an increasingly popular technique in machine learning, addresses many of the aforementioned challenges. Subspace clustering can capture more complex manifolds and is robust in higher dimensional settings [138, 224], both desirable properties in practical applications. In health care, for example, hospitalized patients with different underlying diseases (i.e., clusters) often exhibit shared or overlapping sets of symptoms (i.e., subspaces). However, most state- of-the-art subspace clustering algorithms with provable guarantees are limited to linear settings, making them infeasible for functional data and incompatible with deformation-based similarity measures. In this chapter, we propose a new clustering framework for functional data called Functional Subspace Clustering (FSC). FSC extends the power and flexibil- ity of subspace clustering to functional data by permitting the deformations that underlie many popular functional similarity measures. The result is a framework that differs from most existing approaches to functional data clustering. In partic- ular, FSC does not assume a structured generative model (e.g., a sequential model for time series) or a predefined set of basis functions (e.g., B-splines). The FSC framework, described in Section 4.2, works as follows: first, we define a subspace model which allows the functional data to come from multiple deformed linear sub- spaces. Then we formulate the subspace learning problem as a sparse subspace clustering problem, similar to [72] but as an optimization over operators. Finally, we introduce an efficient learning algorithm, based on greedy variable selection and assuming access to a fast oracle that can return the optimal deformation between two functions. 59 Figure 4.1: Illustration of the deformation operation for functional data. Two func- tions are considered similar if a deformation of one of them is similar to the other one. The figure has been regenerated from [160, Fig. 4.1]. We provide theoretical guarantees for FSC with a general class of deformations (Section 4.2.3). In Section 4.3, we apply FSC to a common functional data setting: time series with warped alignments. We provide an efficient implementation of the warping oracle and show how this algorithm can also be used to retrieve the learned basis functions for each deformed subspace. These bases can be used as lower- dimensional features for either classification or clustering. Experimental results on synthetic data and two real hospital data sets, described in Section 4.4, show that FSC significantly outperforms both standard time series clustering and state-of-the- art subspace clustering. In clinical data, our framework learns physiological patterns that can be used to discriminate patients based on risk of mortality. 4.1 Related Work Functional clustering There has been a significant amount of research on func- tional data clustering. This is commonly performed using a two step process, in which functions are first mapped into a fixed size representations and then clus- tered. For example, we can fit the data to predefined base functions, such as splines or wavelets[229]. In timeseries datamining, researchers oftenuse motifs orcommon patterns discovered from the data [145]. There is a growing body of literature on 60 modelsfordirectlyclusteringfunctionaldatawithoutthetwo-stepprocess[120,126]. These approaches sometimes make strong assumptions about the underlying func- tion or ignore important structure, such as time order [98]. Non-parametric clustering methods are popular in the data mining literature, where researchers combine specialized distance metrics with simple clustering meth- ods. Functional distance metrics allowing deformation date back several decades [189, 226]. However, it has been shown only recently in the functional data anal- ysis literature that deformation-based metrics can be more robust to the curse of dimensionality than simple Euclidean distance [78, 85]. Subspace Clustering Unlike much existing work on time series clustering, FSC is based on subspace clustering. Subspace clustering is a generalization of PCA that can discover lower dimensional representations for multiple principal subspaces, enablingittomodelmorecomplexmanifolds[224]. Subspaceclusteringisacommon tool for cluster analysis in high-dimensional settings [138]. Both of these properties make it well-suited for functional data. Sparse subspace clustering (SSC) does not require local smoothness, permitting disparate points to constitute subspaces [71]. It formulates subspace learning and neighbor selection as a regression, and admits a variety of efficient solutions based on Lasso [71, 72, 205], thresholding [106, 107], and greedy orthogonal matching pursuit [64, 171]. SSC has strong theoretical guarantees and is robust to outliers, which are common in functional data. Alternative Approaches FSC does not assume any particular sequential generative process, as in [2, 127, 134], or a predefined set of basis functions, such as B-splines, Bezier curves, or truncated Fourier functions [82, 190, 240]. FSC also admits a theoretical analysis, unlike many of alternative frameworks. [240] propose a subspace clustering framework for images that uses predefined truncated Fourier 61 basis functions to implicitly capture different kinds of image deformations. They enumerate all possible deformed bases and then apply Group Lasso to learn the affinity matrix. However, this strategy does not generalize to many functional data problems where the space of potential deformations can be too large to enumerate explicitly, such as warped alignments between time series. In contrast, FSC does not requireexplicitenumerationorrepresentationofthedeformations. Instead, itmakes use of the fast deformation oracles that have been proposed and studied for many common function data problems (e.g., DTW for warping distance in time series). Combined with simple greedy variable selection, this makes FSC computationally more efficient than the Group Lasso formulation in [240]. 4.2 Functional Subspace Clustering In this section, we present our proposed Functional Subspace Clustering (FSC) algorithm and elucidate the challenges that functional data present to traditional subspace clustering methods. We first discuss our data model and assumptions in Section 4.2.1, and then we describe the FSC framework in Section 4.2.2 and provide a theoretical analysis in Section 4.2.3. We will discuss how FSC can be applied to time series data with warping in Section 4.3. 4.2.1 Data Model Let X 1 ,...,X n denote n functions on a compact interval I, such that R I E[X 2 i ]<∞ for i = 1,...,n. We observe noisy instances as follows Y i =X i +ε i , for i = 1,...,n. (4.1) 62 whereε i fori = 1,...,n are i.i.d. instances from a random function with zero mean and R I E[ε 2 i ] =σ 2 . Subspace Assumption The functions (curves)X i are selected fromL manifolds S ` for ` = 1,...,L. Given a set of basis functions Φ ` , each manifolds S ` is defined as the set of all functions that are deformation (warping) of linear combinations of basis functions in Φ ` : S ` , ( X X =d X φ k ∈Φ ` α k k ! ; α k ∈R, d∈D ) , (4.2) where k are the basis functions and the setD contains all possible deformation operators d. We denote the set of all given functions that belong to a manifold S ` withX ` and the corresponding noisy observation sets byY ` . Our main goal is to group X 1 ,...,X n according to their corresponding subspaces as defined in Eq. (4.2). While the sets defined in Eq. (4.2) are not linear subspaces in general, they show similar properties under appropriate conditions. In particular, suppose the set of deformations are linear maps and form a finite group with group law defined as the composition operation. The group assumption requires that composition of two deformations belong to the set d 1 ◦d 2 ∈D for every d 1 ,d 2 ∈D and for every d∈D there exists an inverse operation d −1 such that d −1 (d(X)) = X for every X∈S ` . The permutation groups are prominent groups satisfying these conditions. We can show that under this assumption, every function in the manifold with s basis functions can be written as linear combination of s or more other functions 63 in the same manifold. Specifically, for every X i ∈ S ` , we can write the following generalization of the self-expressive equation X i = X X j ∈S ` ,j6=i β j d j (X j ), (4.3) with some deformation d j ∈D and scalars β j ∈R. The proof is as follows: Proof. In order to show the the result in Eq. (4.3), we breakdown the process in Eq. (4.2) into two steps: Let us denote f X = P k ∈Φ α k k and X = d( f X) where Φ is a set of s basis functions. Since the set of f X functions create a linear subspace, every member can be written as a linear combination of at least s other functions: f X i = X j6=i β j f X j . (4.4) Given the fact that the set of deformations is a group, the inverse of deformation operators are also in the set and we can rewrite Eq. (4.4) as d −1 i (X i ) = X j6=i β j d −1 j (X j ), (4.5) X i =d i X j6=i β j d −1 j (X j ) ! . (4.6) Since the operators are assumed to be linear maps, we can rewrite Eq. (4.6) as follows X i = X j6=i β j d i (d −1 j (X j )). (4.7) 64 Group’s closure property guarantees that for alli andj, there exists e d j in the group such that e d j =d i ◦d −1 j . Thus we can rewrite Eq. (4.7) as X i = X j6=i β j e d j (X j ). Note that our algorithm will not rely on these assumptions to operate; for example it will not need to compute the inverse of a deformation. FSC can be applied to any data for which the self-expressive property in Eq. (4.3) holds. 4.2.2 Functional Subspace Clustering Given the result in Eq. (4.3), the cluster assignments of the functional data generated according to Eq. (4.2) can be uncovered using a novel variant of sparse subspace clustering. We solve the following sparse regression problem for all func- tions Y 1 ,...,Y n : b B i,: = argmin B i,: ,{d j } Y i − X j6=i B i,j d j (Y j ) 2 2 , (4.8) subject to kB i,: k 0 ≤s. whereB∈R n,n . TheL 0 sparsity pseudo-norm indicates the number of non-zero ele- mentsofavector. Thegoalofthisregressionistofindthebestsparseapproximation forY i by selecting a few functionsY j , deforming them by optimizingd j , and scaling them by multiplying with B i,j . After solving Eq. (4.8) for all functions, similar to subspace clustering we define the symmetric affinity matrix A =|B| +|B| > and apply spectral clustering [168, 227] on A, described in Algorithm 4.2. Note that 65 Algorithm 4.1: Functional subspace clustering. Data: Noisy functional observations{Y i } n i=1 and a termination criteria . Result: Clustering assignments for Y i , i = 1,...,n. 1 for i = 1,...,n do 2 InitializeF←?, J←{i} , R 1 ←Y i , l← 1 3 while max j6∈J,d j |hR l ,d j (Y j )i| kd j (Y j )k 2 kR l k 2 > do 4 b j ← argmax j6∈J,d j |hR k ,d j (Y j )i| kd j (Y j )k 2 5 F l ←F l−1 ∪{ b j } 6 J←J∪{j} 7 b B i,: ← argmin B i,: kY i − P j ∈F l B i,j j k 2 2 8 R l+1 ←Y i − P j ∈F l b B i,: j 9 l←l + 1 10 A←|B| +|B| > 11 Apply spectral clustering toA (e.g., Algorithm 4.2) to obtain cluster assignments. Algorithm 4.2: Spectral clustering for FSC. Data: Affinity matrixA Result: Clustering assignments for Y i , i = 1,...,n. 1 D← diag(A1) 2 L←D − 1 2 AD − 1 2 3 ,V ← eig(L) 4 m ? ← argmax i=1,...,n−1 (λ i −λ i+1 ) 5 Apply k-means to the first m ? column ofV . we use the eigen-gap statistic (Line 4 in Algorithm 4.2) to determine the dimen- sion of the embedding [216, 227]. We can also compute the Laplacian embedding to extract a lower dimensional representation of the functions, useful for other machine learning tasks (e.g., classification) [193, Chapter 14]. Unlike the linear sparse subspace clustering setting, Eq. (4.8) is a large scale sparse regression which requires optimization over both B and d. The optimiza- tion over the deformation operator can be especially difficult, as it is an operator optimization. 66 Fast Sparse Regression with an Oracle Our approach for efficiently solving Eq. (4.8) is summarized in Algorithm 4.1 and is based on three main steps: a relax- ation to a regular sparse linear regression problem, use of a fast oracle to find the best deformation, and then greedy variable selection. In the first step, we consider all possible deformations of each Y j as covariates in the regression. This relaxation makes the problem linear and convex but introduces a new computational chal- lenge: it dramatically increases the dimensionality of the regression. For example, given two time series Y 1 ∈ R T 1 and Y 2 ∈ R T 2 , there areO(exp(T 1 +T 2 )) possible warping-based alignments. Merely enumerating all possible warpings and updating the gradient becomes computationally expensive, and solving Eq. (4.8) becomes practically intractable. We address this computational bottleneck by assuming that we have access to a fast oracle that can identify the best deformation for any pair of functionals Y 1 and Y 2 , defined as d ? = argmax d |hY 1 ,d(Y 2 )i|/kd(Y 2 )k 2 . Now rather than solving a complex nonlinear regression or enumerating all possible deformations, we can simply query the oracle for the best deformation for each Y j . Computationally, availability of this oracle significantly simplifies the sparse regression problem and yields an efficient algorithm for solving Eq. (4.8). DTW is an example of such an oracle for measuring time series similarity, especially when it is combined with constraints and early-stopping heuristics [178]. With such a fast oracle available, we can use greedy variable selection with orthogonal projections (OMP) to solve Eq. (4.8) efficiently, as it only requires a limitednumberofcallstotheoracletosolvethesparseregressionproblem. Wecould also use the thresholding approach for sparse subspace clustering, as in [106, 107], but as the authors note and we also confirm in the experiments, the greedy approach 67 typically has better empirical performance. The full algorithm for solving Eq. (4.8) is shown in Algorithm 4.1. 4.2.3 Analysis One major advantage of sparse subspace clustering is that we can find condi- tions under which the success of the algorithm is guaranteed. We begin by defining two quantities to describe the similarity of subspaces and the density of points in each cluster, respectively. To measure subspace similarity, we define the principal angle between two deformed subspaces S ` and S ` 0 as θ `,` 0 = arccos sup V∈S ` ,U∈S ` 0,d,d 0 |hd(V ),d 0 (U))i| kd(V )k 2 kd 0 (U)k 2 , where the supremum is over functions V and Z from S ` and S ` 0, respectively, and over their respective deformations. We also define the minimum principal angle θ ` = min ` 0 6=` θ `,` 0 over all pairs of subspaces, in order to provide a uniform bound for all `. To measure cluster density, we define the covering radius r ` as r ` = max Y j ∈Y ` max V∈S ` min Y∈{Y ` \Y j } dist(V,Y ), where dist(V,Y ) = sup d,d 0 r 1− |hd(V ),d 0 (Y ))i| 2 kd(V )k 2 kd 0 (Y )k 2 . The following theorem provides a sufficient condition for success of our algorithm in the noiseless setting: Theorem 4.1. Suppose the deformation operators are linear maps and create a group with composition as the group law. For any function Y i ∈ Y ` , Algorithm 4.1 will select only the functions from the same subspace as Y i ’s neighbors if the termination criterion is > cos(θ ` )(1 + √ 2r ` ). 68 Proof. To prove the statement of the theorem, we need to show that by selection of the termination criterion as the theorem suggests, the Algorithm 4.1 will stop before adding any functions from other subspaces. In other words, Let us study the correctness of the theorem for neighbors of an arbitrary functionY i ∈Y ` ; heretoafter we drop the i index for simplicity of notation whenever it is not ambiguous. If R k denotes the residual at kth step, define the normalized residual as ¯ R k =R k /kR k k 2 ; we need to show that the following quality cannot be larger than : max V6∈Y ` ,d D ¯ R k , ¯ d(V ) E <. where ¯ d(Y ) =d(Y )/kd(Y )k 2 for any function Y. Furthermore, define μ ` = max ` 0 6=` sup V∈S ` ,U∈S ` 0,d,d 0 |hd(V ),d 0 (U))i| kd(V )k 2 kd 0 (U)k 2 . We note that we always have μ ` ≤ θ ` , asY ` ⊂ S ` . Also, let us define the span of d(Y ` ) as the span of the set of functions{d(Y ) Y∈Y ` }. To prove the main statement, we proceed with induction, as in [64]. Given the assumptions and the value of in the theorem, the first step holds, becauseR k =Y i . For induction, assume that at kth iteration all of the previous functions have been selected from the correct subspace. Given the result in Eq. (4.3), the residual is still in the span of thed(Y ` ). Thus, we can write ¯ R k = ¯ d 1 (U) +E whereU is the closest function inY ` to ¯ R k and E∈ S ` . The latter is due to the assumptions about the 69 deformation operators that require them to be linear map and form a group with composition operation as the group law. We can write: max Y j 6∈Y ` ,d 1 ,d 2 | ¯ R k , ¯ d 2 (Y j )i| = max Y j 6∈Y ` ,d 1 ,d 2 |h ¯ d 1 (U) +E, ¯ d 2 (Y j )i| ≤ max Y j 6∈Y ` ,d 1 ,d 2 |h ¯ d 1 (U), ¯ d 2 (Y j )i| +|hE, ¯ d 2 (Y j )i| ≤μ ` + max Y j 6∈Y ` ,d 1 ,d 2 |hE, ¯ d 2 (Y j )i| ≤μ ` + cosθ 0 kEk 2 k ¯ d 2 (Y j )k 2 , (4.9) whereθ is the minimum principal angle betweenS i and all other subspaces. We can bound thekEk 2 as follows: kEk 2 =k ¯ R k − ¯ d 1 (U)k 2 = q k ¯ R k k 2 2 +k ¯ d 1 (U)k 2 2 − 2h ¯ R k , ¯ d 1 (U)i ≤ r 2− 2 q 1−r 2 ` . (4.10) Plugging the result in Eq. (4.10) in Eq. (4.9) yields: max Y j 6∈Y 1 ,d 1 ,d 2 | ¯ R k , ¯ d 2 (Y j )i|≤μ ` + cosθ 0 r 2− 2 q 1−r 2 ` ≤μ ` + √ 2 cosθ 0 r ` , where the last step is due to the fact that q 1− √ 1−x 2 ≤ x for x∈ [0, 1]. Given the fact that cosθ is an upper bound forμ ` , we conclude that the algorithm will not make a mistake in selection of function in its k + 1st step and the induction step is correct. Thus, we obtain the statement in the theorem. 70 The theorem’s main implication is that the algorithm will not make any mis- take in identifyingY i ’s cluster neighbors, provided that the subspaces are sufficiently different and each cluster is sufficiently large. Thus, if the algorithm finds enough neighbors for each data point, then the functions in each subspace will create con- nected clusters, ensuring the success of spectral clustering. Discussion One implication of this theorem is that we need to control the flex- ibility of the deformation operator because of its influence on the performance of FSC. Excessive flexibility increases the overlap of (i.e., decreases the principal angle between) the subspaces, which can degrade performance. Furthermore, we need to restrict the number of possible deformations to be polynomial in order to ensure asymptotic consistency of our variable selection algorithm. Thus, we need to care- fully manage the flexibility of the deformation operator. In time series analysis, for example, it is common to use a constrained warping window with DTW [189]. 4.3 FSC for Time Series with Warping In this section, we show how FSC can be applied to time series data with warping-basedalignmentdeformations. Analignmentdeformation(orwarping func- tion) d maps the samples of one time series Y i onto those of a second time series Y j , while preserving the time order. We assume that we are given only a finite set of observations from each Y i , indexed as Y it for t∈T i ⊂I where 0 <|T i | <∞. Thus, an alignment is typically realized as a list of non-decreasing pairs of indices with constraints on neighboring pairs (e.g., each index can change by at most one from one pair to the next). Given a measure of discrepancy (or similarity) between 71 individual time points, the minimum warping distance (or maximum warping sim- ilarity) can be computed in quadratic time using dynamic programming. This is known as dynamic time warping (DTW) [226]. The set of all warping deformations is not a group as it does not satisfy the conditions in Section 4.2.2, nevertheless we can still perform approximate clustering via FSC with time warping deformations. DTW is, in principle, a fast oracle for returning the best warping alignment betweentwotimeseries, butbecauseitcomputesanun-normalizeddistance(making distancesbetweendifferentpairsoftimeseriesincomparable), itcannotbeusedwith FSC in practice. Thus, we develop an alternative algorithm for quickly computing the optimal alignment between two time series, which returns a normalized distance and can be used as a deformation oracle for FSC. We describe this algorithm in Section 4.3.1 and then show in Section 4.3.2 how its formulation can be used to recoverthelatentbasisfunctionslearnedfortimeseriesunderwarpingdeformations. 4.3.1 Fast Warping Selection for Time Series Here we develop an alternative oracle for efficiently selecting the best warping betweentwotimeseries. Webeginbyobservingthatduringgreedyvariableselection in Algorithm 4.1, the best direction is given as ˜ Y j = argmax d(Y j ) (hR i ,d(Y j )i) 2 kd(Y j )k 2 2 where we have used R i to denote the residual of Y i at some iteration of Algorithm 4.2. For simplicity, we assume that the length of the time series R i and Y j are equal to T 1 and T 2 , respectively. In order to efficiently find the optimal warping for a time series R i , we note that every warping is an assignment of each point in Y j to one point in R i . Thus, we can use a list of binary indicator vectors Z = (z 1 ,...,z T 1 ), 72 z k ∈{0, 1} T 2 to represent every deformation asd τ (Y j ) = (z > 1 Y j ,...,z > T 1 Y j ). Now we can reformulate the warping selection process as an integer program {z ? k } = argmin {z k } P T 1 k=1 (z > k Y j ) 2 P T 1 k=1 R ik z > k Y j 2 (4.11) s.t. z k,` ∈{0, 1}, T 2 X ` z k,` = 1. We impose additional linear constraints to guarantee that the warping preserves the time order. In particular, if Y jt is assigned to R it 0, Y js is assigned to R is 0, and t < s, we require t 0 ≤ s 0 . To enforce this constraint, it suffices to consider the integer number ¯ z corresponding to the binary vector to ¯ z and require ¯ z k ≤ ¯ z k+1 for k = 1,...,T 1 − 1. This constraint can be implemented as a set of linear con- straints by considering the positional binary notation. In practice, we also restrict the warping to not map data points that are apart from each other by more than Δ time stamps [189]. Implementing this constraint reduces the number of variables in the optimization from T 1 T 2 toT 1 (2Δ + 1) which can accelerate the algorithm if the time series are long. Notice that the optimization in Eq. (4.11) can be readily used in irregular and multivariate time series, as well. Relaxing the integer constraint in Eq. (4.11), the problem becomes convex and can be efficiently solved. Given the fact that we need to solve Eq. (4.11) for all time series Y j ,j6= i, we use Frank-Wolfe’s algorithm [121] because it provides an inexpensive certificate for the duality gap in each iteration of the optimization problem. We can use this to disqualify suboptimal directions during the greedy searchinAlgorithm4.1bycheckingthecurrentsearchdirection’sdualitygapagainst the previous optimum. Given the simple form of Eq. (4.11) and the fact that it is strongly convex, we further accelerate the optimization by performing a line search 73 on a grid of step size values, followed by a few iterations of Newton’s method. In practice, the optimization converges within very few iterations. 4.3.2 Identifying the Latent Basis Functions The formulation of warping operator in Eq. (4.11) enables us to recover the deformed latent functions by solving Eq. 4.12. Without loss of generality, suppose Y 1 ,...,Y m have been clustered into a single cluster and that the underlying latent space isr dimensional. We need to solve the following optimization problem (PCA- TW) min Z,,W m X i=1 Y i − r X k=1 Z i,k k w i,k 2 2 (4.12) s.t. Z i,k ∈L,k k k 2 = 1. whereL denotes the set of constraints described in Section 4.3.1. We need to solve for the bases k , alignment variables Z i,k , and weights w i,k . To solve this problem, we alternate between optimizing over k and{w i,j ,Z i,k }, while fixing the other one. For learning the k , we initialize them using PCA and optimize the loss function together with the fused lasso regularizer [215] to obtain a smooth function. For optimization overZ i,k andw i,k , analytical solution of optimization overw i,k leads to an optimization problem similar to Eq. (4.11) for Z i,k which can be solved by the same method described in the previous section. 74 Discussion The formulation of deformation in Eq. (4.11) is not limited to time warping but encompasses many existing deformation operations discussed in the lit- erature. For example, two time series may be considered similar if only certain sub- sequences are similar. We can capture this phenomenon by relaxing the constraint P T 2 ` z k,` = 1 to P T 2 ` z k,` ≤ 1. The formulation in Eq. (4.11) is also appropriate for handling the missing data settings. Note that because of the subspace cluster- ing nature of the algorithm, smoothness of the time series will be automatically incorporated in the clustering process. 4.4 Experiments To demonstrate the effectiveness of FSC, we perform experiments using one synthetic dataset and two real world datasets related to health and wellness. All data are time series, so we use the time series with warping variant of our algorithm, FSC-TW. We compare FSC-TW’s performance to the following baselines: ED+SC WeapplyspectralclusteringtoanaffinitymatrixbasedonEuclidean distance, created as follows: first, we construct a distance matrixD and normalize it by its largest element. Then we define A = exp(−D) + exp(−D > ) and apply spectral clustering toA. DTW+SC We apply spectral clustering to a DTW-based affinity matrix, con- structed using the same procedure. GAK+SC We apply spectral clustering to an affinity matrix constructed using the Global Alignment Kernel (GAK) [55, 56], a variant of DTW that yields a valid positive semidefinite kernel. SSC We apply the original Sparse Subspace Clustering algorithm proposed in [71], 75 without deformations. TSC-TW We apply the Thresholded Subspace Clustering (TSC) algorithm from [107], combined with our warping deformation oracle from Eq. (4.11). The ED+SC, DTW+SC, and GAK+SC baselines enable us to evaluate the benefitprovidedbysparsesubspaceclustering, incomparisontosimpledeformation- based clustering. The SSC baseline allows us to determine whether allowing deformed subspaces improves the performance of subspace clustering. The TSC- TW provides a comparison with an alternative sparse subspace clustering with time warping and another variable selection technique. 4.4.1 Synthetic Data Experiments We begin with synthetic data experiments to investigate how FSC-TW per- forms on data generated from the assumed deformed subspace model described in Section 4.2.1 and to explore the impact of data dimensionality, subspace separation, and cluster density. First, we generate two synthetic datasets with different basis time series, shown in Fig. 4.2(a). We create three subspaces, each including two of the basis functions. We use two forms of deformation operators: (i) a random shift in time, selected uniformly from [−10, 10] and (ii) time warping with maxi- mum window of length 10. We then investigate how increasing both the length of the synthetic time series and the number of points per cluster impacts the cluster error rate of the different algorithms. The results in Fig. 4.2 confirm the utility of the deformed subspace assumption and the superior performance of FSC-TW. Comparing the baseline algorithms, we can divide them into two categories: subspace clustering based algorithms (FSC-TW, TSC, SSC) and regular spectral clustering with different distance metrics. Given the true subspace model in the 76 0 0.5 1 0 0.5 1 Gaussian Bases 0 50 100 150 0 0.5 1 −1 0 1 −1 0 1 Sinusoidal Bases 0 50 100 150 −1 0 1 Time (a) Bases 90 150 210 270 0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 Length of Time Series Clustering Error Rate FSC−Tw SSC TSC−TW GAK+SC DTW+SC ED+SC (b) Length of Time Series 20 30 40 0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 Number of Data Points Per Cluster Clustering Error FSC−TW SSC TSC−TW GAK+SC DTW+SC ED+SC (c) Number of Time Series Figure 4.2: Synthetic data experiments. (a) The bases used for constructing the synthetic data. (b,c) The clustering error rate for six algorithms as (b) the length of time series grows and (c) the number of time series per cluster grows. syntheticdata, weexpectthefirstgrouptoperformbetter. Amongthealgorithmsin the first group, SSC does not capture the deformations in the data. While TSC-TW captures deformations, subspace clustering with thresholding is empirically shown to have inferior performance compared to sparse subspace clustering [106]. Thus, we expect FSC-TW to perform superior compared to SSC and TSC-TW. In Fig. 4.2(b) we fix the number of examples in each cluster to 50 and evaluate performance as the length of the time series increases. Increasing the length of the time series increases the dimensionality of the data, which in turn increases the sparsity of each point’s neighborhood and the separation of the subspaces; i.e. the principal anglesθ ` in Section 4.2.3 increases. As expected, the error for all subspace clustering algorithms improve as length increases, while the performance of non- subspace methods gradually degrades. This is consistent with two previous findings: first, that DTW provides minimal advantage over ED for long time series [146]; and second, that subspace clustering is especially beneficial in higher dimensions [138]. FSC-TW outperforms the baselines at all tested lengths. In Fig. 4.2(c) we fix the length of the time series to 150 and increase the number of data points in each cluster, which also increases the density of points 77 within each cluster and potentially increasing overlap between clusters and a more complicated neighborhood structure. The overall trend is similar to that of length: FSC-TW is clearly superior for all sizes, and the subspace cluster methods improve rapidly as the clusters grow in size. Again, this is consistent with what is known about DTW (it provides less benefit in large scale time series datasets [146]) and about subspace clustering. As we increase the number of data points per cluster, the probabilitythatsubspaceclusteringfindsthecorrectclusteringincreasesbecausethe probability that two points from the same cluster are subspace neighbors increases. Once again, FSC-TW outperforms the baselines. It is very interesting that plain SSC becomes increasingly robust to deforma- tions as the time series become longer and as the data set size grows. This suggests that the subspace model assumptions are well-suited to functional data, at least under these conditions. However, it performs quite poorly for small numbers of short time series. FSC-TW is robust to length, yielding the best performance for both short and long time series. Together, these results suggest that our combi- nation of subspaces, deformations, and greedy variable selection yields a powerful clustering framework for functional data. 4.4.2 Real World Data We apply FSC-TW and our baselines to two real world data sets related to health: ICU This is a collection of multivariate clinical time series extracted from a major hospital’s electronic health records (EHRs) system, recorded by clinical staff during care in an intensive care unit (ICU). Each time series includes 24 hours of 78 Table 4.1: Average AUC obtained by the algorithms on the real world datasets. Dataset FSC-TW TSC-TW SSC GAK+SC DTW+SC ED+SC ICU 62.32±8.37 56.54±9.78 61.99±7.89 56.35±8.06 59.55±8.17 58.76±9.83 Physionet 66.27±6.08 52.41±6.61 62.51±8.56 51.30±7.99 50.56±7.99 49.73±7.25 measurements for 13 variables, including vital signs, lab tests, or clinical obser- vations, with one observation per hour. In these data, subspaces correspond to collections of physiologic signs and symptoms, while clusters represent cohorts of similar patients. We treat in-hospital mortality prediction as a binary classification task. Physionet The Physionet dataset 1 is a publicly available collection of multi- variate clinical time series, similar to ICU but with additional variables. The time series are also 48 hours long and include in-hospital mortality as a binary label. Fig. 4.3 shows that the two classes are very difficult to distinguish on the basis of their raw time series data alone. In all of the datasets, we normalize each time series to have zero mean and unit variance. Then, we apply each algorithm to learn the affinity matrix and then extract lower dimensional representations as described in Algorithm 4.2. We then evaluatetheutilityoftheserepresentationsbyusingthemasfeaturesinaRBF-SVM binary classifier. For evaluation, we create 30 randomly divided training and testing partitions. For each partition, we train the RBF-SVM on the training partition set using 5-fold cross validation, then test it on the corresponding test set. Table 4.1 summarizes the AUC for each method and data set, averaged across partitions. Results The results in Table 4.1 reveal several interesting trends. First, FSC- TWonceagainyieldsthebestclassifier, andonlyFSC-TWandSSCyieldaclassifier 1 http://physionet.org/challenge/2012/ 79 10 20 30 40 −2 −1.5 −1 −0.5 0 0.5 1 1.5 2 10 20 30 40 −1.5 −1 −0.5 0 0.5 1 1.5 2 10 20 30 40 −2 −1 0 1 2 3 10 20 30 40 −2 −1.5 −1 −0.5 0 0.5 1 1.5 2 10 20 30 40 −2 −1 0 1 2 3 4 10 20 30 40 −4 −3 −2 −1 0 1 2 3 10 20 30 40 −2 −1.5 −1 −0.5 0 0.5 1 1.5 2 10 20 30 40 −2 −1 0 1 2 3 10 20 30 40 −3 −2 −1 0 1 2 10 20 30 40 −2 −1 0 1 2 3 10 20 30 40 −2 −1.5 −1 −0.5 0 0.5 1 1.5 10 20 30 40 −3 −2 −1 0 1 2 Figure 4.3: Mean and standard deviation trajectories for twelve variables in Phy- sionet dataset, for patients who survived (blue) and deceased (red). Note the sim- ilarity of time series and the fact that they are almost indistinguishable by naked eye. that is statistically different from guessing at random in both datasets, validating the sparse subspace assumption. This confirms our intuition that the complex latent manifold structure of critical illness (in terms of symptoms and signs) is captured better by subspace clustering than by simpler methods. What is more, the subspace assumption alone may provide some robustness to deformations, as observed in the synthetic data results. The more interesting trend is the interaction between subspaces and warping. Usinga warping-based distancebenefitssubspace clustering morethanspectral clus- tering; we can see this when we compare the improvements in FSC-TW (vs. SSC) 80 and DTW+SC (vs. ED+SC). This suggests that different patients may have con- ditions with different time courses in symptoms and treatment responses. However, we observe that the performance gains when adding warping for the clinical data sets are smaller than those observed in the synthetic data. Our hypothesis is that the degree of warping (shifts and stretches) in the real world clinical data may be relatively small (i.e., an hour or two) with respect to the hourly sampling rate. Finally, we note that this is a very challenging classification problem: the patient outcome (i.e., death or discharge) can occur anywhere from hours to weeks afteradmission, butweareconsideringonlythefirst48hoursofdata[202]. Also, the outcome often depends upon a complex set of factors beyond initial presentation, including treatments, which are not available in these data and may occur after the first 48 hours [172]. What is more, the natural cluster structure is likely less correlated with outcome than it is with disease. Mortalities appear as outliers, rather than as a coherent cluster. In future work, we would like to apply FSC-TW to data with diagnostic labels to examine whether subspaces and clusters reflect known disease patterns. 4.4.3 Deformed basis function recovery Next, we demonstrate PCA-TW’s ability to learn and recover deformed basis functions (described in Section 4.3.2). We first demonstrate this using synthetic data, as follows: first, we select a principal vector (the dashed black line in Fig. 4.4). Then we generate 25 time series that are randomly shifted versions of this principal component. We then apply use algorithms to identify the true principal component, including basic PCA (red dashed line), our PCA-TW algorithm (solid blue line, Section 4.3.2), and PCA with a fused Lasso regularizer (PCA-fused, purple 81 5 10 15 20 25 30 35 −0.05 0 0.05 0.1 0.15 0.2 0.25 Time True PC PCA−TW PCA PCA−fused Figure 4.4: Synthetic data experiment: PCA-TW is the only algorithm that suc- cessfully recovers the principal component under deformation. dashed line), which is also used by our algorithm. Figure 4.4 clearly shows that PCA-TW is the only algorithm that recovers the true shape of the basis and that its performance is not solely due to its use of the fused Lasso regularizer. One of the main advantages of the proposed latent function learning algorithm in Section 4.3.2 is that it preserves more variance than the regular principal com- ponent analysis. At the same time, it is able to capture the main trend in the functional data without overfitting to the particular realization of the time series. The deformation allows us to obtain a principal component that preserves a larger amount of variance. The fraction of variance preserved in the first component by our algorithm is 30.52% and 39.36%, compared to 19.44% and 24.30% by PCA, for survivals and mortalities, respectively. 82 4.5 Summary In this chapter, we proposed Functional Subspace Clustering (FSC), a non- parametric functional clustering framework that can be applied to functional data with complex subspace structures and used with general deformation operations, including time series with warped alignments. We showed that this can be formu- lated as a sparse subspace clustering problem and solved using an efficient greedy algorithmwiththeoreticalguarantees. Appliedtotimeseriesdata, FSCoutperforms both standard time series clustering and linear subspace clustering. 83 Chapter 5 Latent Factors 84 A major challenge that we are confronted with in practical applications is the incompleteness of the data, i.e., certain influential time series are missing in the real- world datasets. For example, in social media analysis, the external events influence large clusters of users, while the news propagates through the local connections in the network. In order to identify the true influence patterns among the users, we need to take into consideration the impact of external unobserved events. In climate data analysis, the local terrain characteristics play an important role in the air mass propagation while large climate systems, which are usually not observed in the dataset collected by local weather stations, influence wide areas on the ground. There are two main approaches to address the challenge of latent factors in time series analysis: (i) One can design algorithms that remove the impact of major unobserved time series and approximately recover the true underlying temporal dependency graph. (ii) One may be interested in elimination of the impact of any possible type of hidden variables, not just the ones with global impact. Obviously, comparing the two tasks, the trade-off is between accuracy of causal inference and the scalability of the algorithm. In this chapter, we propose solutions for both problems. The traditional approach to capture the impact of unobserved variables is to include them in the graphical models and infer their impact via the EM algorithm [61]. However, this approach has two main weaknesses: (1) Often times, the EM algorithm only identifies a local optimum. (2) While several techniques have been developed to speed up the EM algorithm, usually the inference cannot scale to large datasets. Recent progress shows that in the Gaussian linear undirected graphical [39] and vector auto-regressive [123] models, the impact of latent variables appears 85 as an additive low-rank matrix in the precision matrix and evolution matrix, respec- tively. Thus, one can use scalable convex optimization algorithms to decompose the parameter matrix into a sparse local dependency matrix and another low-rank global impact matrix which captures the impact of latent variables. While the convex sparse plus low-rank decomposition in the linear vector auto- regressive models is promising, the model applies to a very limited class of time series data. For example, in social media applications, the number of mentions of key words by users is a counting process and the vector auto-regressive model obvi- ously is not appropriate. In many climatology applications the distribution of the data exhibits heavy tails [20, 44]. For example, climate change is mostly character- ized by increasing probabilities of extreme weather patterns, such as temperature or precipitation reaching extremely high values [204]. In search of more general and flexible time series models, we study generalized auto-regressive processes and show that, under certain conditions, the maximum likelihood estimate of their evolution matrices can be decomposed into a sparse and a low-rank matrix with the latter capturing the impact of unobserved processes. For counting processes, we analyze the Poisson [220] and Conway-Maxwell Poisson [246] auto-regressive processes. The latter distribution has recently attracted researchers’ attention because of its flexi- bility in modeling under-dispersion and over-dispersion of discrete data [196, 201]. For extreme value time series, we propose a novel heavy-tailed auto-regressive time series model, by choosing the distribution of the data to be the Gumbel distribution [96]. To solve the sparse plus low-rank decomposition problem, we develop a scal- able greedy algorithm with an upper bound on its convergence rate, based on the Frank-Wolfe algorithm [121, 213]. Extensive experiments on two synthetic datasets, 86 one climatology dataset and one social network dataset are used to demonstrate the superior performance of the proposed algorithms. In addition, we observe an interesting phenomenon: often times, separating out a low-rank component from the maximum likelihood estimation of the sparse coefficients improves the accuracy of the estimation, even if there are no latent variables. Thus, we examine the possible causes for this phenomenon and demonstrate theoretically that in most generalized linear models, the finite sample bias is additive and the bias term is approximately low-rank. Regarding the second challenge, the main attempt to cancel out the effects of all unobserved confounders in time series has been carried out by the authors in [68, 69, 70]. They have extended Pearl’s criteria [173, ch. 3] for determining a set of time series that by conditioning on them the spurious causation paths are blocked and the Granger causality identifies the true temporal dependency graph. However, they do not answer the central question of “when we limit ourselves to the world in which cause is always prior to its effects can we reduce the number of the time series that should be observed?” In response to the challenge, in Section 5.3 we show that coping with hidden confounders’effectiseasierinGrangernetworks. Inparticular, inGrangernetworks, many directionally connected paths are disconnected considering the delay associ- ated with the edges. Thus, often times we require conditioning on fewer variables to block the spurious causation paths. Our contribution in this problem includes definition of path delays and showing that the path delays are essential in analysis of the unobserved confounders’ impact. The rest of the chapter is organized as follows: after preliminary and related works in Section 5.1, we introduce the models and describe the sparse and low-rank 87 decomposition for generalized linear auto-regressive processes in Section 5.2.1, and the inference optimization algorithm in Section 5.2.4. In Section 5.3 we introduce path delays and analyze the impact of unobserved confounders using them. Section 5.4 describes the experiment results and Section 5.5 elaborates on the low-rank structure of small sample bias followed by summary of the chapter. 5.1 Preliminaries and Related Work Generalized Linear Models Generalized Linear Models [166] connect the response variables y with predictor variables x via the following linear dependence model: g(E y|x [y]) =Ax +b, (5.1) where the strictly monotone function g(·) is called the link function and A, b are linear regression coefficients. Based on generalized linear models, we can define the stochastic process model for time series x(t) for t = 1,...,T according to the following Generalized Linear Auto-regressive Processes (GLARP) model: g(E H(t) [x(t)]) = K X `=1 A (`) x(t−`) +b. (5.2) where the matricesA (`) for` = 1,...,K,K denoting the maximum lag in time, are called the Evolution Matrices and E H(t) is the expectation over the history up to time t. The generative process of the model above is as follows: at time t, compute the mean of x(t) conditioned on the observations at time t− K,...,t− 1, i.e., x(t−K),...,x(t− 1), then generate x(t) according to a distribution (usually from exponential family) with the computed mean. Examples of the generalized linear 88 auto-regressive models are vector auto-regressive models and Poisson auto-regressive processes. In this chapter, we focus in an important sub-class of generalized linear models in which the link function is the canonical link functions, i.e., the density function is given as follows: f(x(t)|A (`) ,H(t)) =h(x(t)) exp x(t) > (˜ x(t))−1 > M (˜ x(t)) , (5.3) where ˜ x(t) = P K `=1 A (`) x(t−`)+b, thefunctionM :R7→Rappliedelementwise, and h(x(t)) is the normalization factor of the distribution. The canonical link function is selected as g −1 (y) = d dy M(y). WecanreconstructthetemporaldependencygraphG(V,E)fromtheevolution matrices A (`) for ` = 1,...,K as follows: represent every time series x i by a node and add a directed edge e i→j to the set E if at least one of the entries A (`) j,i for ` = 1,...,K is non-zero. SparseplusLow-rankDecomposition Inordertoachieveaconsistentestimate of a high-dimensional matrix from a limited number of observations, we may impose a low-dimensional structure on the estimated matrix. One of the most popular structures is the sparse plus low-rank structure which assumes that the true value of the matrix is approximately equal to a low-rank part plus a sparse part (Fig. 5.1). Examples of the applications that exhibit this low-dimensional structure are robust PCA [36, 41, 181], robust covariance estimation [3] and multi-task regression [165, 184]. 89 = + MLE Estimate Low-Rank Sparse Figure 5.1: Decomposition of the evolution matrix in Eq. (5.6) into low-rank and sparse matrices. Learning with Latent Factors In many real world applications, obtaining all influential quantities can be computationally expensive or even not possible. The latent time series can be the quantities that are hard to measure or immeasurable events such as disease outbreak news. Thus, taking into consideration the possible existence of such latent time series makes the analysis significantly more accurate and realistic. The most common approach to capture the effect of latent variables is based on the EM algorithm [61]. While the EM framework is quite general, it can be trapped in the local optima. Thus, it is interesting to find a close to convex programming solution which does not depend on the initialization point. The focus of this chapter is on the settings where the temporal dependency graph among observed time series is sparse, while few latent factors influence a large number of observed time series. This setting, known as the local-global structure [123], is believed to exist in many datasets, such as stock price time series [40] and social media activity time series [163]. [39] show that in undirected graphs with latent factors, the precision matrix of the joint distribution of observed variables havethesparsepluslow-rankstructure. [123]showthatinthevectorauto-regressive model with unobserved variables with global impact, the evolution matrix estimated 90 via maximizing the likelihood of the observed data only, will result in the sparse plus low-rank structure as well. 5.2 Methodology In this section, after describing the generalized linear auto-regressive processes withlatentfactors, weintroduceandanalyzetwoGLARPmodelsformodelingcount data and another one for modeling extreme value data. We show that in these mod- els, the maximum likelihood estimate of the evolution matrix can be decomposed into a sparse and a low-rank matrix with the latter capturing the impact of unob- servedprocesses. Generalizingthisresult, wealsoshowthatthesparsepluslow-rank decomposition property holds for a general sub-class of GLARP with exponential family distributions and canonical link function. Then, in Section 5.2.4 we propose an algorithm to uncover the true evolution matrix and provide a guarantee on its convergence to the global optimum. 5.2.1 Stochastic Processes with Latent Factors Suppose we have observed p time series x(t)∈R p×1 and want to analyze the structure of maximum likelihood estimation when we have been informed that r time series z(t)∈R r×1 are missing. To this end, we incorporate the latent factors into the GLARP model in Eq. (5.2) as follows: g E H(t) x(t) z(t) = K X `=1 A (`) B (`) C (`) D (`) x(t−`) z(t−`) +b (5.4) 91 fort =K +1,...,T, whereK denotes the maximum number of lags,T is the length of time series, and the function g is the link function. The probability density function of the observations at timet is denoted byf(x(t),(t)) where(t) denotes the set of parameters of the distribution that are functions of the evolution matrices A (`) ,B (`) ,C (`) and D (`) and the past values of time series x(t) and z(t). Since we do not have access to the latent factors, the maximum likelihood estimation of the model parameters in absence of the time series z(t) is performed as follows: { b A (`) } MLE = argmax { b A (`) } T Y t=K+1 f(x(t),(t)) , (5.5) where{ b A (`) } represents the set of estimated evolution matrices b A (`) for` = 1,...,K. Clearly, when we have only observedx(t), we will be only able to find an estimate for A in Eq. (5.4) and analyze the theoretical impact ofB,C, andD in our estimation of A. 5.2.2 Examples of GLARP In this section, we provide three examples of GLARP model, including two for count data and one for heavy-tailed continuous data. Then, we show that, in all of these models, the latent factors create an additive low-rank bias in the estimation of the evolution matrix. Count Data Recently, point processes have been successfully applied to social networks analysis [133, 163, 228]. A popular approach in analyzing the temporal dependence among multiple point processes is to count the number of events —e.g., social media 92 activity— in regularly spaced intervals and analyze the resulting count time series [133, 220]. The Poisson distribution is one of the most commonly used distributions for modeling count data. In the Poisson auto-regressive point process model [220], the distribution of time series x(t) at time t is a Poisson distribution with a rate (t)∈R p×1 defined as follows: log(t) = log(E H(t) [x(t)]) = K X `=1 A (`) x(t−`) +b, (5.6) The model parameters A (`) ,` = 1,...,K and b can be obtained by minimizing the negative log-likelihood function, which is convex and easy to minimize. Conwey-Maxwell Poisson Distribution A major limitation of Poisson regres- sion is its strong assumption that the variance of a Poisson variable is equal to its mean. In other words, the Poisson model does not allow over-dispersion and under- dispersion. The Conwey-Maxwell Poisson distribution (in short COM-Poisson) is a two-parameter extension of the Poisson distribution with a parameter for modeling the dispersion. Historically, it was introduced in [49] and recently studied compre- hensively in [201]. The COM-Poisson distribution is defined based on the following property: P[X =k− 1] P[X =k] = k μ ! ν , where ν is called the dispersion parameter, ν < 1 corresponds to over-dispersion and ν > 1 indicates under-dispersion. The main advantage of the COM-Poisson distribution over other generalizations of the Poisson distribution, such as Double Poisson [66] and Generalized Poisson [48] distributions, is its flexibility in modeling 93 a greater range of dispersion [246]. In addition, the COM-Poisson distribution is equivalent to the Poisson distribution whenν = 1, the Geometric distribution when ν = 0 and the Bernoulli distribution as ν →∞. The COM-Poisson GLARP is defined as follows: P[x i (t)|μ i (t),ν] = 1 S(μ i (t),ν) μ i (t) x i (t) x i (t)! ! ν log (t) + 1 2ν − 1 2 ≈ log E H(t) [x(t)] = K X `=1 A (`) x(t−`) +b. (5.7) where(t) is the rate parameter and S(μ i (t),ν) is the normalization term. Given a constant (invariant with time) value for the dispersion parameter ν, the negative log-likelihood function is convex and can be minimized efficiently. Extreme value data In many applications, such as climate analysis, time series data usually exhibitaheavy-taileddistributionwhichissignificantlydifferentfromthecommonly assumed Gaussian distribution. The generalized extreme value theorem states that the extremum of a set of independently and identically distributed random variables asymptotically converges to the Generalized Extreme Value Distribution [20, 44]. It has been shown that the distribution of extrema of many common distributions such as Gaussian, log-normal, and Gamma converges to a popular and simpler sub- class of the generalized extreme value distribution that is called Gumbel distribution [45]. The Gumbel distribution is defined according to following probability density function: f(X =x|μ,σ) = 1 σ exp − x−μ σ − exp − x−μ σ , 94 where μ and σ are location and scale parameters, respectively. In order to analyze extreme value time series data, we define a Gumbel GLARP model as follows [219]: f(x i (t)|μ i (t),σ) = 1 σ exp − x i (t)−μ i (t) σ − exp − x i (t)−μ i (t) σ !! (5.8) (t) +σγ E =E H(t) [x(t)] = K X `=1 A (`) x(t−`) +b, where(t) andσ denote the location and scale parameters of the Gumbel distribu- tion and γ E ≈ 0.5771 is the Euler constant. Given a constant scale parameter σ, the negative log-likelihood function is convex and can be minimized efficiently. For all of the GLARP models described above, we have the following theorem: Theorem 5.1. Suppose a generalized linear auto-regressive process (x(t),z(t)) is defined according to Eq. (5.6), Eq. (5.7) or Eq. (5.8). Suppose the number of unobserved processesr and the number of lagsK are much smaller than the number of observed ones, i.e. r,K p. Then, asymptotically as T →∞, the maximum likelihood estimate of{A (`) } is the sum of two matrices: lim T→∞ b A (`) MLE,T =A (`) +L (`) , where L (`) a low-rank matrix with rank(L (`) )≤rK. Proof sketch The solution relies on two main ideas: 95 1) Using the law of large numbers, we can show that, asymptotically, the maximum likelihood estimation procedure is equivalent to minimization of the KL- divergence between the true model and the observed model. Thus, we can write: lim T→∞ b A (`) MLE,T = argmin n E x(t)∼f( ∗ ) [L ∗(x(t))−L (x(t))] o , = argmax n E x(t)∼f( ∗ ) [L (x(t))] o (5.9) where ∗ is a short notation for the true value of the parameters in the model, i.e., ∗ ={A ∗,(`) ,B ∗,(`) ,C ∗,(`) ,D ∗,(`) } `=1,...,K and describes the parameters in the model without latent factors, i.e., ={A (`) } `=1,...,K . 2) In the second step, we need several approximations to compute the solution ofEq. (5.9). Forpointprocesses,supposewedividethetimeintosmallintervalssuch that the probability of observing more than one event in each interval is small. We can approximate the likelihood of the observed time series for any point process in a unified form given its rate function, as shown in [220]. This allows the computation of b A MLE for all point processes in a unified way. The details of the proof are as follows: Without loss of generality, we prove the case whereK = 1 andb =0. The proof for the general case is straightforward, but algebraically more involved, extension of the simpler case. Proof for the Poisson and COM-Poisson GLARPs Consider the following true model for the time series: log E x(t) z(t) = A B C D x(t− 1) z(t− 1) 96 LetE[x(t)] =(t),E[z(t)] = 0 (t) andu(t) = [x(t) > ,z(t) > ] > denote the aggregation of the both observed and unobserved variables. In the maximum likelihood solution with unobserved time series z(t), we fit the data to the following observed model: log (E[x(t)]) = b Ax(t− 1) (5.10) First we show how we can derive the likelihood for any point process given its rate function, [220]. Suppose we have divided the time into small enough so that the probability of x i (t) = 1 for i = 1,...,p becomes small in each interval [94] and we have: P[x i (t) = 0]≈ 1−λ i (t), P[x i (t) = 1]≈λ i (t), P[x i (t)≥ 2]≈ 0. The probability of observing x(t) in the t th interval can be written as following: P[x(t)|x(t− 1)] = p Y i=1 (λ i (t)) x i (t) (1−λ i (t)) 1−x i (t) . Now we can approximate the negative log-likelihood function as follows using the fact that when λ i (t) is small, we can write log(1− λ i (t)) ≈ −λ i (t) and log(λ i (t)[1−λ i (t)] −1 )≈ log(λ i (t)) [220]. L Obs ≈ p X i=1 x i (t) log(λ i (t))−λ i (t). (5.11) 97 Substituting the value ofλ i (t) from the observed model in Eq. (5.10) into Eq. (5.11), we can see that b A MLE is the solution of the following problem: b A MLE = argmax b A E True [L Obs ], (5.12) = argmax b A n E True h x(t) > b Ax(t− 1)−1 > exp( b Ax(t− 1)) io . where we have written the equations in the compact vector format. Differentiation ofL with respect to b A and setting it to zero yields the following results: E True h x(t−1)x(t) > −x(t−1)exp( b Ax(t−1)) > i =0, E u(t−1) h E x(t)|u(t−1) h x(t−1)x(t) > −x(t−1)exp( b Ax(t−1)) > ii =0, E u(t−1) x(t−1) exp([AB]u(t−1))−exp( b Ax(t−1)) > =0. (5.13) where A and B are the true values in Eq. (5.2.2). Since u i ∈{0, 1} with high probability, by taking the expectation with respect to each individual u i we can see that Eq. (5.13) is satisfied if and only if the following equality holds: E u(t−1) h exp([AB]u(t− 1))− exp( b Ax(t− 1)) i =0. (5.14) Suppose A,B, and b A are bounded. Since u i ∈{0, 1}, the values inside the exponential functions are bounded. Given that the exponential function is a one to 98 one function, Eq. (5.14) is equivalent to the following equation: (which can also be obtained by Taylor expansion.) E u(t−1) h [AB]u(t− 1)− b Ax(t− 1) i =0, E u(t−1) h Bz(t− 1)− ( b A−A)x(t− 1) i =0, B 0 (t− 1)− ( b A−A)(t− 1) =0, (5.15) where Eq. (5.15) is the result of linearity of expectation operator. Since Eq. (5.15) holds for all values of and 0 , the column space of b A−A is equal to the column space ofB. Thus, rank ofL = b A−A can be at most the rank of column space ofB; i.e. rank(L)≤r. This concludes the proof. The proof also holds for Bernoulli and COM-Poisson processes, due to the fact that Eq. (5.11) holds for them too [220]. ProoffortheGumbelGLARP Considerthefollowingtrue modelfortheGum- bel time series: E x(t) z(t) = A B C D x(t− 1) z(t− 1) In the maximum likelihood solution with unobserved time seriesz(t), we fit the data to the following observed model: E[x(t)] = b Ax(t− 1) Similar to the previous theorem, our goal is to find the expression for b A MLE as in Eq. (5.12). The key to approximation of b A MLE is to assume thatE[x(t)] =0 andAx(t) 99 is small; both of these assumptions can be satisfied in the data by pre-processing. Proceeding with the proof, we have: b A MLE = argmin b A ( E True " p X i=1 x i (t)−μ i (t) σ + exp ( − x i (t)−μ i (t) σ )!#) , Using the fact thatE[x(t)] =0 and differentiation with respect to A yields: E True x(t− 1) exp ( − x(t)− b Ax(t− 1) σ ) > =0, (5.16) E True x(t− 1) ( 1− x(t)− b Ax(t− 1) σ ) > ≈0, (5.17) E True x(t− 1) n x(t)− b Ax(t− 1) o > ≈0, (5.18) E u(t−1) x(t− 1) n Ax(t− 1) +Bz(t− 1)− b Ax(t− 1) o > ≈0, (5.19) b A≈A +BQ zx Q −1 xx , (5.20) The step from (5.16) to (5.17) is due to the Taylor expansion of the exponential function around zero; the step from (5.17) to (5.18) is done using the fact that E[x(t)] = 0; the step from (5.18) to (5.19) is done by expectation with respect to conditional distribution of x(t) given u(t) under the true model; and the final step is done via the definition of the co-variance matrices. 5.2.3 Discussion and Generalization The key implication of Theorem 5.1 is that, asymptotically, the latent factors create an additive bias term in the maximum likelihood estimation of the evolution matrices b A (`) . Ifwemaketheassumptionthatthetrueevolutionmatricesaresparse, under certain assumptions [3, 40], we can eliminate the impact of latent factors and 100 obtain a more accurate estimation of A (`) . The procedure for sparse plus low-rank decomposition in GLARP models is described in Section 5.2.4. We can extend the result of Theorem 5.1 to a large sub-class of GLARP models as follows: Theorem 5.2. Consider generalized linear auto-regressive process (x(t),z(t)) is defined according to Eq. (5.2) with canonical link function. Suppose the num- ber of unobserved processes r and the number of lags K are much smaller than the number of observed ones, i.e. r,K p. Suppose the link function is suffi- ciently smooth and there exists a preprocessing procedure for x such that we have M 00 ( P K `=1 A (`) x(t−`) +b)≈m1 for some constant m> 0. Then, asymptotically as T→∞, the maximum likelihood estimate of{A (`) } is sum of two matrices: lim T→∞ b A (`) MLE,T =A (`) +L (`) , where L (`) a low-rank matrix with rank(L (`) )≤r.K. The main assumptionM 00 ( P K `=1 A (`) x(t−`)+b)≈m1 states that the function M 00 (·) should be isotropic in the neighborhood of the true model parameters. It can be satisfied when the function is smooth around the true model parameters and thus m≈ 0. The details of the proof are as follows: Following the proof sketch, we write: b A MLE = argmin A {E True [−L Obs (x(t))]} = argmin A n E True h x(t) > (˜ x(t))−1 > M(˜ x(t)) io 101 where in the last step we have used the definition of the distribution for GLARP with canonical link function in Eq. (5.3). For simplicity of notation, we analyze the case with K = 1 and analysis of the case with K > 1 is similar. We have b A MLE = argmin A n E True h x(t) > (Ax(t− 1))−1 > M((Ax(t− 1))) io Taking the derivative of the right hand side and setting it to zero yields: E True h (x(t)−M 0 ((Ax(t− 1)))x(t− 1) > i =0, E x(t−1) h E x(t)|x(t−1) h (x(t)−M 0 ((Ax(t− 1)))x(t− 1) > ii =0 E x(t−1),z(t−1) h E x(t)|x(t−1) h (x(t)−M 0 ((Ax(t− 1)))x(t− 1) > ii =0 E x(t−1),z(t−1) h (M 0 (A ? x(t− 1) +B ? z(t− 1))−M 0 ((Ax(t− 1)))x(t− 1) > i =0 (5.21) LetuswritetheTaylor’sexpansionofbothM 0 (A ? x(t−1)+B ? z(t−1))andM 0 (Ax(t− 1)) around A ? x(t− 1). Denote Δ =Ax(t− 1)−A ? x(t− 1) (all the operations are elementwise): M 0 (A ? x(t− 1) +B ? z(t− 1)) =M 0 (A ? x(t− 1)) +M 00 (A ? x(t− 1)) (B ? z(t− 1)) +O((B ? z(t− 1)) 2 ) (5.22) M 0 (Ax(t− 1)) =M 0 (A ? x(t− 1)) +M 00 (A ? x(t− 1)) Δ +O(Δ 2 ) (5.23) 102 Given the assumption aboutM 00 (A ? )≈m1, we can substitute Eq. (5.23) and (5.22) in Eq. (5.21) and simplify the result by ignoring the higher order terms as follows: E x(t−1),z(t−1) h (Ax(t− 1)−A ? x(t− 1)−B ? z(t− 1))x(t− 1) > i ≈0 Taking the expectation and some algebra yields: A≈A ? +BQ xz Q −1 xx Define L =BQ xz Q −1 xx . Since B is a p×r matrix, rank(L)≤r. This concludes the proof. 5.2.4 Inference We need to solve the following optimization algorithm to capture the effect of latent factors: min A (`) ,L (`) ,b L x(t),A (`) ,L (`) `=1:K t=1:T (5.24) Subject to: K X `=1 A (`) 0 ≤η S , K X `=1 rank L (`) ≤η L , where the L 0 norm of the matrices is equal to the number of non-zeros elements of the matrices andL denotes the likelihood of the stochastic process defined in Eq. (5.4). A common approach to solve Eq. (5.24) is to use a convex relaxation of the L 0 norm with the L 1 norm and the rank constraint with the nuclear norm L ∗ : min A (`) ,L (`) ,b ( L x(t),A (`) ,L (`) `=1:K t=1:T +λ S K X `=1 A (`) 1 +λ L K X `=1 L (`) ∗ ) (5.25) 103 Algorithm 5.1: Greedy Sparse plus Low-Rank Decomposition Input:{x(t)} t=1,...,T , η S , η L 1 Let w denote concatenation of L (`) , A (`) and b. Initialize w 1 ←0. 2 for τ← 1, 2, 3,... do 3 a (L) t ← argmin a∈A (L) D ∇L(w t ),a (L) E . 4 a (S) t ← argmin a∈A (S) D ∇L(w t ),a (S) E . 5 α t ,β t ,b t ← argmin α,β∈[0,1],b L(w t +α(η S a (S) t −w (S) t ) +β(η L a (L) t −w (L) t )). 6 w (S,L) t+1 ←w (S,L) t +α t (η S a (S) t −w (S) t ) +β t (η L a (L) t −w (L) t ). 7 end 8 return L (`) ,A (`) , for ` = 1,...,K. The optimization problem in Eq. (5.25) is convex and can be solved via Singu- larValueThresholding(SVT)ineachiterationoftheAcceleratedProximalGradient algorithm[167]asdescribedin[217]. Anothersolutionisperformingsparsepluslow- rank decomposition in the Frank-Wolfe framework [121, 213]. This algorithm has been shown to be faster and more scalable than the SVT approach [121, 122, 213] for learning low-rank matrices. Algorithm 5.1 describes the optimization algorithm. In Algorithm 5.1, for notation simplicity, we show the parameters in the sparse and low-rank matrices by w∈ R (2K+1)p 2 ×1 where the its first Kp 2 elements w (S) contain the elements of A (`) , the second Kp 2 elements w (L) contain the elements of L (`) for ` = 1,...,K and the last p elements contain b. The algorithm iteratively selects two directions for updating: (1) The direction for updating b A (`) matrices is obtained by solving an L 1 constrained linear programming (Line 3). (2) The direction for updating the low-rank matrices b L (`) is obtained via singular value decomposition of∇ L (`)L, as described in [197, 213] (Line 4). In fact, we only need to find an approximate leading singular vector which can be done inO(N s log(p)) 104 where N s is the number of non-zero elements of the gradient matrix [197]. We update b together with line search for updating other matrices (Line 5). Following the theoretical framework provided in [121], we can derive the fol- lowing convergence guarantee for Algorithm 5.1: Theorem 5.3. The solution of Algorithm 5.1 at then th iteration is bounded towards the optimal solution w ? according to the following equation: L(w n )−L(w ? )≤ C n (5.26) whereC is a constant that depends on the volume of theL 1 andL ∗ norm constraints. In particular,C =B S +B L +B b where, e.g.B S is defined asB S , 8L ||.|| (L)η 2 S ||A S || 2 in which L ||.|| (L) is the smoothness constant of the likelihood function as defined in [213] and||A S || 2 = sup a∈A S kak whereA S denotes the unitL 1 ball. The bound term for the low-rank partsB L andB b are defined similarly. Note that the solution always stays inside the constraints, thus the optimiza- tion algorithm does not have to deal with the non-differentiability of the Lagrangian at the constraint boundaries. Proof We provide our proof as an extension of the proof in [213]. Given a set S and a normkk, we define the Restricted Smoothness Property constant of the likelihood functionL as defined in Eq. (3) in [213]. Following the same steps, we have: 105 L(w t +α(η S a (S) t ) +β(η L a (L) t ))≤L(w t )−α(−h∇L(w t ),η S a (S) t i +h∇ S L(w t ),w t i) −β(−h∇L(w t ),η L a (L) t i +h∇ L L(w t ),w t i) + 2α 2 L S η S R 2 S + 2β 2 L L η L R 2 L (5.27) Similarly, we can define and show that: δ t ,L(w t )−L(w ? ) ≤−h∇L(w t ),w ?,(L) i +h∇ S L(w t ),w t i−h∇L(w t ),w ?,(S) i +h∇ L L(w t ),w t i (5.28) ≤−h∇L(w t ),η S a (S) t i +h∇ S L(w t ),w t i−h∇L(w t ),η L a (L) t i +h∇ L L(w t ),w t i (5.29) Plugging Eq. (5.29) into Eq. (5.27) and following the reasoning in [213], we can show that: δ t+1 ≤δ t + min α,β∈[0,1] (−(α +β)δ t + 2α 2 L S η S R 2 S + 2β 2 L L η L R 2 L ) Fort = 0, chooseα,β = 1 on the right hand side to getδ 1 ≤ 2(L S η S R 2 S +L L η L R 2 L ). Since δ t is decreasing, we can see that δ t ≤ 2(L S η S R 2 S +L L η L R 2 L ) for all t. Thus, choosing α = 4(L S η S R 2 S +L L η L R 2 L ) yields for all t > 1: δ t+1 ≤ δ t − δ 2 t B S +B L where B S , 8L S η S R 2 S andB L , 8L L η L R 2 L . Solving this yields the desired result. The impact of optimization of b can be captured similarly assuming that the searching space is bounded. 106 In the performance analysis of the greedy algorithm that selects only one sparse or low-rank direction per iteration we should observe that in Eq. (5.28) eitherh∇L(w t ),w ?,(L) i orh∇S(w t ),w ?,(S) i remains unbounded. Boundingthis term introduces the Lipschitz constant of the likelihood function. Plugging the additional term into Eq. (5.27) yields αL S η S R 2 S or αL L η L R 2 L instead of α 2 L S η S R 2 S or α 2 L S η S R 2 S . The termL S denotes the maximum Lipschitz constant of the likelihood function inside the convex hull of the sparsity norm. Since α< 1 and alwaysL S <L S andL L <L L , we observe that the bound for the single coordinate selection should be at least greater by the differences of the Lipschitz and the restricted smoothness constant of the likelihood function for one of the directions. 5.3 Path Delays In response to the second challenge of latent factors, in this section, we show that coping with hidden confounders’ effect is easier in Granger causality. In par- ticular, in Granger causality, many directionally connected paths are disconnected considering the delay associated with the edges. Thus, often times we require con- ditioning on fewer variables to block the spurious causation paths. We start with a canonical example to introduce the main concept of path delays. Via demonstration of the effect of path delays in the three basic graphical structures, we extend the “m-separation” criteria [69] to include path delays in identification of connectivity of the paths. We show that the generalized criteria are able to detect more blocked paths which yields to higher possibility of successful causal identification. Note that the results in this section are general and not limited to the linear VAR models. 107 CHAPTER 2. LATENT FACTORS 30 X 1 X 4 X 2 X 3 τ 1 τ 2 τ 3 Figure 2.2: In this toy Granger graphical model, according to m-separation criteria, whenX 4 is unobserved, a spurious edgeX 1 ←X 3 is detected. However, the spurious edge is not detected when τ 3 −τ 2 +τ 1 ≤ 1, where L is the maximum lag in Granger causality test. the effect of path delays in the three basic graphical structures, we extend the “m- separation” criteria to include path delays in identification of connectivity of the paths. We show that the generalized criteria are able to detect more blocked paths which yields to higher possibility of successful causal identification. Note that the results in this section are general and not limited to the linear VAR models. Consider the following set of linear autoregressive equations: X 1 (t)=αX 4 (t− 2)+ε 1 (t), X 3 (t)=ε 3 (t), (2.41) X 2 (t)=βX 4 (t− 1)+γX 3 (t− 1)+ε 2 (t), X 4 (t)=ε 4 (t), whereε i (t),i= 1,..., 4 are independent noise processes. The corresponding Granger graphical model is shown in Fig. 2.2, with τ 1 = 2,τ 2 = 1 and τ 3 = 1. The direction of the edges and the values of delays on them are causal priors, obtained from field knowledge, which are necessary for analysis of effects of hidden variables. For example, consider three events defined as following: X ∶ rain in Los Angeles, CA, Y ∶ rain in Riverside, CA and Z ∶ the approach of the coastal air masses. One can observe that the effect of coastal air masses cannot reach Riverside earlier than Los Angeles; consequently, The edge Z→X must have smaller delay than Z→Y . In analysis of the structure in Fig. 2.2, Eichler [34] showed that in absence of X 4 an spurious edge is detected from X 3 to X 1 . This spurious causation is the result of the spurious path X 1 ← X 4 → X 2 ← X 3 which is connected under the “m-separation” criteria. However, a quick inspection shows that when τ 1 ≤ 1 the spurious edge X 3 → X 1 is never inferred. This implies that in Granger networks, we might inspect not only for graphical connectivity, but also for the delays in the connected paths. This idea is scrutinized via the three basic structures of directed graphs possible with three time series (see Fig. 2.3). Figure 5.2: In this toy Granger graphical model, the values on the edges (τ 1 ,τ 2 ,τ 3 ) show the delay associated with that edge. According to m-separation criteria, when X 4 is unobserved, a spurious edge X 1 ← X 3 is detected. However, the spurious edge is not detected whenτ 3 −τ 2 +τ 1 ≤ 1, whereL is the maximum lag in Granger causality test. Consider the following set of linear auto-regressive equations: X 1 (t) =αX 4 (t− 2) +ε 1 (t), X 3 (t) =ε 3 (t), (5.30) X 2 (t) =βX 4 (t− 1) +γX 3 (t− 1) +ε 2 (t), X 4 (t) =ε 4 (t), whereε i (t),i = 1,..., 4areindependentnoiseprocesses. ThecorrespondingGranger graphical model is shown in Fig. 5.2, withτ 1 = 2,τ 2 = 1, andτ 3 = 1. The direction of the edges and the values of delays on them are causal priors, obtained from field knowledge, which are necessary for analysis of effects of hidden variables. For example, consider three events defined as following: X : rain in Los Angeles, CA, Y : rain in Riverside, CA and Z : the approach of the coastal air masses. One can observe that the effect of coastal air masses cannot reach Riverside earlier than Los Angeles; consequently, The edge Z→X must have smaller delay than Z→Y. In analysis of the structure in Fig. 5.2, Eichler [67] shows that in absence of X 4 an spurious edge is detected from X 3 to X 1 . This spurious causation is the result of the spurious path X 1 ← X 4 → X 2 ← X 3 which is connected under the “m-separation” criteria. However, a quick inspection shows that when τ 1 ≤ 1 the 108 (a) CHAPTER 2. LATENT FACTORS 31 (a) X Z Y τ 1 τ 2 (b) X Z Y τ 1 τ 2 (c) X Z Y τ 1 τ 2 Figure 2.3: Three of four possible directed graphs created by three nodes (a) the coparent, (b) the collider and (c) the chain structures. The fourth structure is the chain with reversed edge directions. The Co-parent Structure In the co-parent structure (Fig. 2.4), an unobserved time series (Z) causes two observed time series X and Y . The effect of the cause Z reaches X and Y with possibly different delays τ 1 and τ 2 , respectively. A simple inspection shows that the identified direction of causality betweenX andY depends on the relative value of τ 1 and τ 2 . In particular, Lemma 2.4.1. In the co-parent structure in Fig. 2.4, when Z is unobserved and generated from a white process, the following spurious edges are detected: τ 1 <τ 2 ⇒ X→Y (2.42) τ 1 >τ 2 ⇒ Y →X τ 1 =τ 2 ⇒ No Causality In other words, the path from X to Y , when Z is unobserved, is blocked if τ 1 ≥τ 2 while the path X←Z→Y is connected in m-connectivity criteria. Proof. Without loss of generality let’s assumeτ 1 <τ 2 . We need to show that there is an spurious edge fromX toY and there is no spurious edge fromY toX. Formally, we need to show that (Y (t)áX(t−τ))SY (t−1,...,t−L) forτ = 1,...,L. Using the SEM notation, as shown in Fig. 2.4, we can see that there is always a directional path from Y (t) via Z(t−τ 2 ) to X(t−τ 2 +τ 1 ). However, all paths from Y (t−τ), (b) CHAPTER 2. LATENT FACTORS 31 (a) X Z Y τ 1 τ 2 (b) X Z Y τ 1 τ 2 (c) X Z Y τ 1 τ 2 Figure 2.3: Three of four possible directed graphs created by three nodes (a) the coparent, (b) the collider and (c) the chain structures. The fourth structure is the chain with reversed edge directions. The Co-parent Structure In the co-parent structure (Fig. 2.4), an unobserved time series (Z) causes two observed time series X and Y . The effect of the cause Z reaches X and Y with possibly different delays τ 1 and τ 2 , respectively. A simple inspection shows that the identified direction of causality betweenX andY depends on the relative value of τ 1 and τ 2 . In particular, Lemma 2.4.1. In the co-parent structure in Fig. 2.4, when Z is unobserved and generated from a white process, the following spurious edges are detected: τ 1 <τ 2 ⇒ X→Y (2.42) τ 1 >τ 2 ⇒ Y →X τ 1 =τ 2 ⇒ No Causality In other words, the path from X to Y , when Z is unobserved, is blocked if τ 1 ≥τ 2 while the path X←Z→Y is connected in m-connectivity criteria. Proof. Without loss of generality let’s assumeτ 1 <τ 2 . We need to show that there is an spurious edge fromX toY and there is no spurious edge fromY toX. Formally, we need to show that (Y (t)áX(t−τ))SY (t−1,...,t−L) forτ = 1,...,L. Using the SEM notation, as shown in Fig. 2.4, we can see that there is always a directional path from Y (t) via Z(t−τ 2 ) to X(t−τ 2 +τ 1 ). However, all paths from Y (t−τ), (c) CHAPTER 2. LATENT FACTORS 31 (a) X Z Y τ 1 τ 2 (b) X Z Y τ 1 τ 2 (c) X Z Y τ 1 τ 2 Figure 2.3: Three of four possible directed graphs created by three nodes (a) the coparent, (b) the collider and (c) the chain structures. The fourth structure is the chain with reversed edge directions. The Co-parent Structure In the co-parent structure (Fig. 2.4), an unobserved time series (Z) causes two observed time series X and Y . The effect of the cause Z reaches X and Y with possibly different delays τ 1 and τ 2 , respectively. A simple inspection shows that the identified direction of causality betweenX andY depends on the relative value of τ 1 and τ 2 . In particular, Lemma 2.4.1. In the co-parent structure in Fig. 2.4, when Z is unobserved and generated from a white process, the following spurious edges are detected: τ 1 <τ 2 ⇒ X→Y (2.42) τ 1 >τ 2 ⇒ Y →X τ 1 =τ 2 ⇒ No Causality In other words, the path from X to Y , when Z is unobserved, is blocked if τ 1 ≥τ 2 while the path X←Z→Y is connected in m-connectivity criteria. Proof. Without loss of generality let’s assumeτ 1 <τ 2 . We need to show that there is an spurious edge fromX toY and there is no spurious edge fromY toX. Formally, we need to show that (Y (t)áX(t−τ))SY (t−1,...,t−L) forτ = 1,...,L. Using the SEM notation, as shown in Fig. 2.4, we can see that there is always a directional path from Y (t) via Z(t−τ 2 ) to X(t−τ 2 +τ 1 ). However, all paths from Y (t−τ), Figure 5.3: Three of four possible directed graphs created by three nodes (a) the coparent, (b) the collider and (c) the chain structures. The fourth structure is the chain with reversed edge directions. spurious edge X 3 → X 1 is never inferred. This implies that in Granger networks, we might inspect not only for graphical connectivity, but also for the delays in the connected paths. This idea is scrutinized via the three basic structures of directed graphs possible with three time series (see Fig. 5.3). The Co-parent Structure In the co-parent structure (Fig. 5.3), an unobserved time series (Z) causes two observed time series X and Y. The effect of the cause Z reaches X and Y with possibly different delays τ 1 and τ 2 , respectively. A simple inspection shows that the identified direction of causality betweenX andY depends on the relative value of τ 1 and τ 2 . In particular, 109 (a) ܼ ݐ ܺ ݐ ܻ ݐ Past ܮ Samples (b) ܺ ݐ ܼ ݐ ܻ ݐ Past ܮ Samples Figure 5.4: The diagrams for proving (a) Lemma 5.4 and (b)Lemma 5.5. The green circles are observed variables and the red path shows a d-connected path. Lemma 5.4. In the co-parent structure in Fig. 5.3, when Z is unobserved and generated from a white process, the following spurious edges are detected: τ 1 <τ 2 ⇒ X→Y (5.31) τ 1 >τ 2 ⇒ Y→X τ 1 =τ 2 ⇒ No Causality In other words, the path from X to Y, when Z is unobserved, is blocked if τ 1 ≥ τ 2 while the path X←Z→Y is connected in m-connectivity criteria. Proof. Without loss of generality let’s assumeτ 1 <τ 2 . We need to show that there is an spurious edge fromX toY and there is no spurious edge fromY toX. Formally, we need to show that (Y (t)⊥ ⊥X(t−τ))|Y (t− 1,...,t−L) forτ = 1,...,L. Using the SEM notation, as shown in Fig. 5.3, we can see that there is always a directional path from Y (t) via Z(t−τ 2 ) to X(t−τ 2 +τ 1 ). However, all paths from Y (t−τ), τ = 1,...,L to X(t) are blocked. Note that if Z(t) is not white, there exists at least a spurious path which goes through history of Z(t); however in practice the 110 attenuation of this path is significant enough that makes the Lemma approximately hold for non-white unobserved variables. The Collider Structure Before delving into theory, we first formally define the collider structure in Granger causality. Suppose the time series Z are generated from two independent time series X and Y as follows: Z t =f(X(t−1),...,X(t−L),Y(t−1),...,Y(t−L))+ε Z , where the noise term ε Z isN (0,σ). The causal relationships between X, Y and Z include X→ Z and Y → Z, where Z is called the Collider Node. Fig. 5.3 shows an example of the collider structure where the effects of X and Y reach to Z with τ 1 and τ 2 delays, respectively. Next, we discuss our results on the collider structure in Lemma 5.5. Lemma 5.5. In the inference of Granger causality, observing the collider node does not create spurious edge between the parents of the collider node. Proof. An argument similar to the previous proof can be used here to show that (Y (t)⊥⊥X(t−τ))|Y (t− 1,...,t−L) for τ = 1,...,L. Fig. 5.3 shows the scenario corresponding toτ 1 = 1 andτ 2 = 2. We can see that the observations att− 1 block all the directed paths from past to X(t) and Y (t) which concludes the proof. The Chain Structure The third structure the chain structure as shown in Fig. 5.3. It is already known that given the variable Z, no edges from X to Y will be detected; while when Z is not given, the path X→Y is connected. To Summarize the results of observations in the fundamental structures, con- sider the following definition of path delay: 111 Definition 5.6. Consider a path P of length p− 1 from X j to X i defined by a set of ordered nodes{X (k) } p k=1 whereX (1) =X j andX (p) =X i is given. Define the path delay as T j,i (P ) = P p−1 k=1 α (k),(k+1) τ (k),(k+1) where α (k),(k+1) = +1 if the edge between X (k) and X (k+1) is oriented as X (k) →X (k+1) and α (k),(k+1) =−1 otherwise. In other words, start from X j and add the delay of edges if they are towards X i and subtract otherwise. For example, in the example given in Fig. 5.30 the path delay fromX 3 toX 1 is computed asτ 3 −τ 2 +τ 1 . Using the definition of path delay, we can state the following general theorem. Theorem 5.7. Consider a Granger network G(V,E) with set of nodes V ={X i } for i = 1,...,n, set of directed edges E and the edge delays τ i,j ∈ Z + for every edge X i → X j ∈ E. Suppose the unobserved time series are generated from white processes. Then, every pathP from an arbitrary nodeX j pointing toX i is connected if it is both m-connected and the path delay T j,i (P )> 0. Proof. The intuition behind the theorem is rather simple if we accept the directional information transfer interpretation of Granger graphical models: Informally, the proof is a sequence of reduction of the fundamental structures to there equivalent spurious edge. The final path will be one of the three fundamental structures for which the equation T j,i > 0 is satisfied. Note that the profound implication of Theorem 5.7 is that the time order information that is usually assumed available in confounder analysis can be used more efficiently in the Granger causality analysis. If the time order between hidden variables are given, we can make stricter rules for the connectivity of paths in the Granger causality framework by ruling out many paths that would be identified as connected by m-separation. This makes the unidentifiability problem less likely 112 CHAPTER 2. LATENT FACTORS 34 X 1 X 5 X 4 X 2 X 3 τ 1 τ 2 τ 3 τ 4 τ 5 Figure 2.5: An example of canceling spurious causation. Time seriesX 1 ,X 2 andX 3 are observed while X 4 and X 5 are unobserved. X U Z V Y 1 1 1 1 2 4 6 8 10 0 0.02 0.04 0.06 Lag (L) Edge Detection Probability Sig: Z → Y Sig: X → Y Lasso: Z → Y Lasso: X → Y Copula: Z → Y Copula: X → Y Figure 2.6: Verification of the theoretical results. Sig, Lasso, and Copula represent the significance test, Lasso-Granger, and Copula-Granger algorithms. that unless X 4 is observed, the causality from X 1 to X 3 will not be identifiable. However, utilizing the delay of the path X 1 ←X 4 →X 2 ←X 3 , unidentifiability only occurs when the path delay T 3,1 > 0 and we have higher possibility of successful causal inference. While whiteness of the unobserved variables is satisfied in many applications, even in the cases that the hidden time series are not white, the analysis in the supple- mentary materials shows that the unidentified spurious causations need to propagate through long paths and undergo significant attenuation which makes Theorem 2.4.4 approximately hold. Empirical Verification We generated multiple synthetic datasets to verify the claim in Theorem 2.4.4. We provide an example of such experiments. Fig. 2.4 shows the graph of a synthetic dataset generated to verify the claim in Lemma 2.4.2. In this dataset,X,Y andZ are observed, butU andV are unobserved. Fig. 2.4 shows the causality relationships identified by three algorithms when we set the length of the time series to 500. As we can see none of the edges Z → Y and X → Y are detected by algorithms. Figure 5.5: An example of canceling spurious causation. Time seriesX 1 ,X 2 andX 3 are observed while X 4 and X 5 are unobserved. in Granger networks with hidden variables. The next example demonstrates the advantages implied by Theorem 5.7. Example 5.8. Consider the Granger graph in Fig. 5.5. Time series X 1 ,X 2 and X 3 are observed while X 4 and X 5 are unobserved. The goal is to find the Granger causal effect of X 1 on X 3 . Solution. The true causal path is X 1 → X 5 → X 3 , however the X 1 ← X 4 → X 2 ←X 3 path is a potential confounding path. The m-connectivity criteria states that unless X 4 is observed, the causality from X 1 to X 3 will not be identifiable. However, utilizing the delay of the path X 1 ← X 4 → X 2 ← X 3 , unidentifiability only occurs when the path delayT 3,1 > 0 and we have higher possibility of successful causal inference. While whiteness of the unobserved variables is satisfied in many applications, even in the cases that the hidden time series are not white, our analysis shows that the unidentified spurious causation need to propagate through long paths and undergo significant attenuation which makes Theorem 5.7 approximately hold. Empirical Verification We generated multiple synthetic datasets to verify the claim in Theorem 5.7. We provide an example of such experiments. Fig. 5.3 shows the graph of a synthetic dataset generated to verify the claim in Lemma 5.5. In this 113 CHAPTER 2. LATENT FACTORS 34 X 1 X 5 X 4 X 2 X 3 τ 1 τ 2 τ 3 τ 4 τ 5 Figure 2.5: An example of canceling spurious causation. Time seriesX 1 ,X 2 andX 3 are observed while X 4 and X 5 are unobserved. X U Z V Y 1 1 1 1 2 4 6 8 10 0 0.02 0.04 0.06 Lag (L) Edge Detection Probability Sig: Z → Y Sig: X → Y Lasso: Z → Y Lasso: X → Y Copula: Z → Y Copula: X → Y Figure 2.6: Verification of the theoretical results. Sig, Lasso, and Copula represent the significance test, Lasso-Granger, and Copula-Granger algorithms. that unless X 4 is observed, the causality from X 1 to X 3 will not be identifiable. However, utilizing the delay of the path X 1 ←X 4 →X 2 ←X 3 , unidentifiability only occurs when the path delay T 3,1 > 0 and we have higher possibility of successful causal inference. While whiteness of the unobserved variables is satisfied in many applications, even in the cases that the hidden time series are not white, the analysis in the supple- mentary materials shows that the unidentified spurious causations need to propagate through long paths and undergo significant attenuation which makes Theorem 2.4.4 approximately hold. Empirical Verification We generated multiple synthetic datasets to verify the claim in Theorem 2.4.4. We provide an example of such experiments. Fig. 2.4 shows the graph of a synthetic dataset generated to verify the claim in Lemma 2.4.2. In this dataset,X,Y andZ are observed, butU andV are unobserved. Fig. 2.4 shows the causality relationships identified by three algorithms when we set the length of the time series to 500. As we can see none of the edges Z → Y and X → Y are detected by algorithms. 2 4 6 8 10 0 0.02 0.04 0.06 Lag (L) Edge Detection Probability Sig: Z → Y Sig: X → Y Lasso: Z → Y Lasso: X → Y Copula: Z → Y Copula: X → Y Figure 5.6: Verification of the theoretical results. Sig, Lasso, and Copula represent the significance test, Lasso-Granger, and Copula-Granger algorithms. dataset,X,Y andZ are observed, butU andV are unobserved. Fig. 5.3 shows the causality relationships identified by three algorithms when we set the length of the time series to 500. As we can see none of the edgesZ→Y andX→Y are detected by algorithms. 5.4 Experiments Todemonstratetheeffectivenessofourproposedalgorithms,weconductexper- iments on both synthetic datsets and application datasets. 5.4.1 Datasets Synthetic Datasets To study the accuracy of the algorithms in recovering the true underlying temporal dependency graph in presence of latent factors, we created a synthetic dataset according to the Poisson auto-regressive point process model in Eq. (5.6). We fix the number of observed variables at 60 and vary the number 114 of latent factors from r = 1 to 5. We also vary the length of observed time series to study the asymptotic behavior of the algorithm. We set the time lag K = 1, randomly set the elements of the A matrix, and choose a sufficiently large negative value for b to stabilize the time series. The global impact of the latent factors is modeled in the datasets by setting an edge from the latent factors to all other observed variables. We generate 10 random datasets of each type and report the average performance on them. Twitter Dataset We used a Twitter dataset to discover the temporal dependence among its users by analyzing the tweets about “Haiti earthquake”. We divided the timeperiodof17daysaftertheHaitiEarthquakeonJan. 12, 2010into1000intervals andgeneratedamultivariatetimeseriesdatasetbycountingthenumberoftweetson thistopicforthetop1000userswhotweetedmostaboutit. Theresultingtimeseries have on average 0.0225 tweets per user per time interval. For accurate modeling, we removed the users that were highly correlated with each other, resulting in 100 users. Wind Speed Datasets of extreme value of wind speed and gust speed is of great interest to the climate scientists and wind power engineers. A collection of wind observations is provided by AWS Convergence Technologies, Inc. of Germantown, MD. It consists of the observations of surface wind speed (mph) and gust speed (mph) every five minutes. We choose 153 weather stations located on a grid laying in the block of 35N− 50N and 70W− 90W. Following the standard practice in this domain, we generate extreme value time series observations, i.e, daily maximum values, at different weather stations. The objective is to examine how the global 115 weather systems impact the local influence patterns at different locations and how well we can make predictions on future precipitation. 5.4.2 Evaluation Measures For the synthetic datasets, we report the graph learning accuracy by Area Under the Curve (AUC) accuracy measure. The value of AUC is the probability that the algorithm assigns a higher value to a randomly chosen positive (existing) edge than a randomly chosen negative (non-existing) edge in the graph. Since we do not have access to the true underlying influence graph in the Twitter application, we use the retweet network as the ground truth. The retweet networkG RT (n) is constructed by adding an edge from useri to userj if userj has retweeted at leastn of the tweets of useri, wheren is varied from 1 to 5. Clearly, the retweet network is not the actual underlying temporal dependency graph, mainly because there are possible implicit influence patterns as well. However, it is the best possible metric that we could obtain for graph learning accuracy evaluation in our dataset. The retweet network for the 100 selected users is sparse. For e.g., G RT (1) has only 279 out of 10,000 possible edges. We do not have the true underlying influ- ence graph in the wind speed dataset, thus we only report the prediction accuracy and the visualization of the inferred graph. In all the experiments, we tune the penalization parameters via 5 fold cross-validation. For predictive analysis, we split all the datasets into the training/testing parts with ratio 9/1 based on time and report the root mean square error (RMSE) and normalized RMSE on the test set. Specifically, we trained the models with the observations between [1,..., 9 10 T ] and predicted the observations att 0 = [ 9 10 T,...,T ] usingK past observations att 0 −K,t 0 −K +1,...,t 0 −1. In other words, we evaluate 116 the1-steppredictionperformanceofthealgorithms. WereportedtheaverageRMSE on the test samples. Table 5.1: The baselines used in evaluations. Twitter Dataset Algorithm Description GLARP-PoG GLARP model with Poisson distribution and Algorithm 5.1. Poisson-EM GLARP model with Poisson distribution and EM algorithm inference. Poisson GLARP model with Poisson distribution without hidden variables. GLARP-COMG GLARP model with COM-Poisson distribution and Algorithm 5.1. COM-P EM GLARP model with COM-Poisson distribution and EM algorithm infer- ence. COM-P GLARPmodelwithCOM-Poissondistributionwithouthiddenvariables. Transfer Entropy Transfer Entropy, a non-parametric dependency analysis algorithm [194] Wind Speed Dataset Algorithm Description GLARP-GumG GLARP model with Gumbel distribution and Algorithm 5.1. Gumbel-EM GLARP model with Gumbel distribution and EM algorithm inference. Gumbel GLARP model with Gumbel distribution without hidden variables. Gaussian VAR Gaussian VAR without hidden variables. Transfer Entropy Continuous Transfer Entropy [130] Baselines To evaluate the performance of sparse plus low-rank decomposition, we use several state-of-art baselines including GLARP without latent factors, the EM algorithm, and Transfer Entropy (see Table 5.1 for details). Specifically, the EM algorithm solutions use the EM algorithm to learn the parameters of the GLARP model with latent factors in Eq. (5.4). The parameters in the EM algorithm are initialized to zero. Transfer Entropy [130, 194] performs pairwise temporal depen- dency analysis among time series by measuring the amount of uncertainty resolved in the future of a time series by knowing the past value of another time series, given its own past values. 117 Length of Time Series 1000 2000 3000 4000 Graph Learning Accuracy 0.5 0.6 0.7 0.8 0.9 GLARP-PoG Poisson EM Poisson Transfer Entropy (a) 1 2 3 4 5 0.5 0.55 0.6 0.65 0.7 0.75 0.8 0.85 Number of Hidden Variables Graph Learning AUC Poisson Hidden Convex Poisson Hidden EM Poisson Transfer Entropy (b) Figure 5.7: Synthetic dataset results on the point process dataset (a) Graph learning accuracy as the length of the time series increases. (b) Graph learning accuracy as the number of latent factors increases. 5.4.3 Results Synthetic Datasets The results on the synthetic datasets are shown in Fig. 5.7. In Fig. 5.7(a), we fix the number of latent factors and vary the length of time series. As we expect, the performance of all algorithms are improved by increasing the length of the time series. The algorithms which can capture the impact of latent factors outperform the rest. We can attribute the superior performance of our algorithm compared with EM to the fact that the latter can be stuck in some local optima. In Fig. 5.7(b), we fix the time series length at 500 and vary the number of latent factors. The performance of our algorithms slightly drop, because as we increase the number of latent factors, the rank of the low-rank matrix L increases, which makes it harder to estimate [197]. The performance of Transfer Entropy and Poisson degrade as well, since the true underlying model deviates far more from their assumption that there are no latent factors. 118 1 2 3 4 5 0.6 0.65 0.7 0.75 0.8 # of Required Retweets AUC Poisson GLARP PoG Poisson EM (a) 1 2 3 4 5 0.65 0.7 0.75 0.8 # of Required Retweets AUC COM−P GLARP COMG COM−P EM (b) Figure 5.8: The graph learning accuracy when the number of retweets requirement n for the ground truth influence graph G RT (n) is varied. The performance of (a) Poisson and (b) COM-Poisson auto-regressive processes confirms that they make better predictions for the stronger influence edges. Twitter Dataset As shown in Fig. 5.8, the performance of all the algorithms improve as we increase the number of retweets for the ground truth influence graph G RT (n) (defined in Section 5.4.2). This means all the algorithms detect the strong influence edges with higher accuracy. In all of the COM-Poisson auto-regressive models, we have set the dispersion parameter ν to a fixed large number for captur- ing the large underdispersion in the Twitter dataset. Therefore the COM-Poisson based models outperform their Poisson counterparts. As we expected, GLARP- COMG outperforms the EM counterpart. The prediction performance in Table 5.2 confirms this trend as well. The inferior performance of the EM algorithm is due to propagation of error; in other words, EM first infers the values of past latent factors (accruing some error) and then uses them to predict observed time series. Note that because Eq. (5.7) is based on an approximation for the mean of COM-Poisson GLARP, it may contribute to the lower prediction performance of COM-Poisson based algorithms. 119 Table 5.2: The RMS prediction error of the algorithms in the Twitter dataset. Results have been normalized by the the mean. Method RMSE Norm-RMSE GLARP-COMG 0.0059 0.3014 COM-P EM 0.0113 0.5739 COM-P 0.0096 0.4876 GLARP-PoG 0.0017 0.0887 Poisson EM 0.0062 0.3148 Poisson 0.0017 0.0847 Transfer Entropy 0.0030 0.1519 The performance of Transfer Entropy is far below the other algorithms, (0.5427, 0.5915, 0.5924, 0.5785, 0.5442) for n = 1,..., 5. Therefore, we omit them in the plots. Its poor performance can be attributed to the extreme sparsity of the observations in the Twitter dataset and the fact that, unlike the rest of the parametric algorithms, it does not have any procedure to benefit from sparsity of the underlying data generation model. To evaluate the prediction performance of Transfer Entropy, we used the graph estimated by Transfer Entropy in the Poisson auto-regressive process and measured its prediction performance. In order to evaluate the computational cost of GLARP with different algo- rithms, we recorded the run time on the Twitter dataset on an i7 2.67 GHz laptop running Windows: 48 seconds for Poisson, 98 seconds for GLARP-PoG, and 928 for the EM algorithm (with 5 iterations). As we can see, sparse plus low-rank decomposition is around 47 times faster than EM. Wind Speed Dataset The prediction performance of the algorithms is listed in Table 5.3. The results show that GLARP-GumG outperforms the rest of the algorithms. Two patterns are different in this dataset: First, the EM algorithm has lower performance than the simple Gumbel VAR algorithm. Second, due to short 120 92.5 ° W 90.0 ° W 87.5 ° W 85.0 ° W 82.5 ° W 80.0 ° W 77.5 ° W 75.0 ° W 72.5 ° W 70.0 ° W 35.0 ° N 37.5 ° N 40.0 ° N 42.5 ° N (a) 92.5 ° W 90.0 ° W 87.5 ° W 85.0 ° W 82.5 ° W 80.0 ° W 77.5 ° W 75.0 ° W 72.5 ° W 70.0 ° W 35.0 ° N 37.5 ° N 40.0 ° N 42.5 ° N (b) Figure 5.9: (a) The spatial-temporal dependency graph obtained via the Gumbel auto-regressive process. Note the denseness of the graph. (b) The sparse part of the spatial-temporal dependency graph obtained via GLARP-GumG. Removing the low-rank global effect leaves only two main local terrain impacts: one is the local impact of the Appalachian mountains along the east coast and the other one is the local impact of the Great Lakes on the weather pattern of their surrounding lands. Table 5.3: The RMS prediction error of the algorithms in the wind speed dataset. Method RMSE Norm-RMSE GLARP-GumG 0.3147 0.0349 Gumbel EM 0.4789 0.0531 Gumbel VAR 0.3233 0.0358 Gaussian VAR 0.8510 0.0943 Transfer Entropy 0.8871 0.0983 length of time series, the Transfer Entropy faces the high dimensionality problem and cannot perform better than the Gaussian model. To evaluate the prediction performance of Transfer Entropy, we used its inferred graph in the Gaussian auto- regressive process and measured its prediction performance. The GLARP-GumG algorithm detects only one latent variable in the wind speed dataset. The inferred temporal dependency graph by Gumbel auto-regressive process and GLARP-GumG can be seen in Fig. 5.9(a) and 5.9(b), respectively. Two main local influence patterns are detected by our algorithm: (i) the impact of the Appalachian mountains in the east coast and (ii) the local impact of Great Lakes on the weather pattern of their surrounding lands. 121 5.5 Small Sample Behavior When conducting experiments on GLARP, we observe an interesting phe- nomenon: often times, separating out a low-rank component from the maximum likelihood estimation of the sparse coefficients improves the accuracy of the estima- tion, even if there are no latent variables. In this section, we examine the possible causes for this phenomenon. In order to illustrate our observation, we use a simple example and generate n = 50 samples of (y i ,x i ) from the following linear Gaussian model y =Ax +", (5.32) where " i is the noise variable with a Gaussian distribution with mean 1 50×1 and covariance Σ = 4× 1 50×50 . The variables x are generated according to " i ∼N (0 50×1 ,I 50,50 ) and the value of the non-zero elements ofA is set to 1. To show that the gain does not depend on a particular evaluation metric, we compare the performance of sparse learning and sparse plus low-rank learning via four different performance metrics, including: (i) the accuracy in estimation of the elements of A measured in root mean square (RMSE), (ii) the support recovery accuracy measured by F 1 score where the 0− 1 labels are computed by thresholding estimated param- eters, (iii) Matthew’s Correlation Coefficient (MCC) [177], and (iv) Area Under the Curve (AUC). Table 5.4 showcases the interesting phenomenon: (1) the sparse plus low-rank decomposition consistently achieves a gain compared with pure sparse learning to recoverA in all four measures. In particular, the gains inF 1 and MCC measures are close to 2%. (2) Figure 5.10(a) confirms the significant low-rank bias term in the 122 Table 5.4: Comparison of pure sparse learning and sparse plus low-rank decompo- sition solutions on four performance measures. Pure Sparse Sparse plus low-rank RMSE 2.9375± 0.0456 2.8585± 0.0462 F 1 0.2448± 0.0325 0.2515± 0.0353 MCC 0.1995± 0.0220 0.2064± 0.0241 AUC 0.8810± 0.0220 0.8844± 0.0243 (a) (b) Figure 5.10: The images visualize the estimated matrices with grayscale images; the darker the pixel color, the higher the value of the coefficient. (a) The estimation bias b A S −A. Note that there is a significant low-rank pattern in the estimation error. (b) The low-rank matrix L is very similar to the low-rank part in the estimation error. estimated matrix b A which is successfully estimated using the sparse plus low-rank decomposition shown in Figure 5.10(b). (3) Figure 5.10(b) also indicate that the low-rank bias term is dense; i.e. the majority of the elements are non-zero. Note that the gain drops as the number of observations n increases. 5.5.1 Finite Sample Analysis of the Structure of Estimators In search for the causes of the gain achieved by sparse plus low-rank decom- position, we notice that the observed gain is more significant in the datasets with finite samples. Hence, we first to identify a set of GLM models in which in the finite 123 sample bias is an additive low-rank term. The next theorem and the subsequent discussion provide insights into fully understanding the interesting phenomenon. Theorem 5.9. Consider the generalized linear model in Eq. (5.1) with canonical link function. Suppose there exists a set of unobserved predictors denoted as z∈ R r×1 where the number of unobserved predictors r is much smaller than the number of observed ones, i.e. r p. We are given n pairs of samples (x i ,y i ) that are generated according to the model with latent factors. Let the link function satisfy M 00 (A ? x i )≈ m1,m > 0 and i = 1,...,n. The maximum likelihood estimate of A can be written as follows: b A n ≈A +L H +L n , (5.33) where rank(L H ) ≤ r and L n represents the additive bias term due to the small sample size. If we define i ,y i −M 0 (A ? x i +B ? z i ), we have: L H =B ? 1 n n X i=1 z i x > i ! 1 n n X i=1 x i x > i ! −1 L n = 1 mn n X i=1 i ! 1 n n X i=1 x i x > i ! −1 . We directly prove the result in Theorem 5.9 which includes the proof for The- orem 5.1 as a special case. Supposen sets of samples (y i ,x i ,z i ) are drawn from the distribution described by Eq. (5.1) but only x i are observed. The fitting process with maximum likelihood can be written as follows: b A = argmax A ( 1 n n X i=1 h y > i (Ax i )−M(Ax i ) i ) , 124 Taking derivative with respect to A yields: 1 n n X i=1 h (y i −M 0 (Ax i ))x > i i =0. (5.34) Define i ,y i −M 0 (A ? x i +B ? z i ). We can rewrite Eq. (5.34) as follows: 1 n n X i=1 h M 0 (A ? x i +B ? z i )−M 0 (Ax i ))x > i i = 1 n n X i=1 i . Via Taylor expansion of the two terms in the left hand side, we have: M 0 (A ? x i +B ? z i ) =M 0 (A ? x i ) +M 00 (A ? x i ) (B ? z i ) +O((B ? z i ) 2 ), (5.35) M 0 (Ax i ) =M 0 (A ? x i ) +M 00 (A ? x i ) ((A ? −A)x i ) +O(((A ? −A)x i ) 2 ). (5.36) Substitution of Eq. (5.36) and (5.35) in Eq. (5.34) and using the assumption in the theorem that states M 00 (A ? y i )≈m1, we obtain: 1 n n X i=1 h (A−A ? )x i −B ? z i )x > i i ≈ 1 mn n X i=1 i . (5.37) Rewriting Eq. (5.37) yields: A≈A ? +B ? 1 n n X i=1 z i x > i ! 1 n x i x > i −1 + 1 mn n X i=1 i ! 1 n x i x > i −1 . (5.38) The proof is completed by defining the following quantities in Eq. (5.38): L H =B ? 1 n n X i=1 z i x > i ! 1 n n X i=1 x i x > i ! −1 L n = 1 mn n X i=1 i ! 1 n n X i=1 x i x > i ! −1 . 125 Example In the Gaussian example, M 0 is the identity function. Hence, the approximation in Eq. (5.33) is accurate and the finite sample result for the Gaussian example can be simplified as: b A n =A− 1 n n X i=1 " i x > i ! 1 n n X i=1 x i x > i ! −1 (5.39) We have the following result on the rank of the bias term in Eq. (5.39): Proposition 5.10. With high probability the finite sample bias term in Eq. (5.39) is concentrated around a rank-1 matrix. More specifically, we have P " 1 n n X i=1 " i x > i −L F > # ≤C exp(−c n), rank(L) = 1 where C =pqC where C and c are constants. Proof sketch The proof follows directly by standard random matrix theory results. It is based on the observation that the expected value of the term 1 n P n i=1 " i x > i is equal to L =E["x > ] =E["]E[x] > ; hence rank(L) = 1. The proof is completed by applying a union bound followed by Bernstein inequality for the sum of sub-Gaussian random variables. The detailed proof is as follows 126 Proof. The proof for Proposition 5.10 is as follows: first we that the expected value of the term 1 n P n i=1 " i x > i is equal to L = E["x > ] = E["]E[x] > ; hence rank(L) = 1. Thus we have, P 1 n n X i=1 " i x > i −L 2 F > 2 ≤ p X k,j=1 P 1 n n X i=1 ε i,j x i,k −L j,k 2 2 > 2 = q,p X k,j=1 P " 1 n n X i=1 ε i,j x i,k −L j,k > # ≤pq max j,k ( P " 1 n n X i=1 ε i,j x i,k −L j,k > #) (5.40) ≤pqC exp(−c n). We were able to use Lemma 5.11 (provided below) in Eq. (5.40), because productoftwoGaussianrandomvariablesisalwaysasub-Gaussianrandomvariable. Lemma 5.11. Bernstein Inequality for Sub-Gaussian Random Variables Let X 1 ,...,X n be iid copies of a sub-Gaussian random variable X; thus X satisfies the following bound P[|X|≥t]≤ exp(−ct 2 ), for all t > 0 and some C,c > 0. Let S n = 1 n P n i=1 X i . Then for any independent of n we have: P[|S n −E[X]|≥]≤C exp(−c n) for some constants C ,c depending on ,C,c. Furthermore, c grows linearly in as →∞. The Lemma is known with other names such as Extension of Chernoff Bound to Sub-Gaussian Random Variables. 127 1 2 3 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 log(n) (# of samples) Fraction of Preserved Variance Figure 5.11: The fraction of variance preserved by the rank-1 estimate of L n in the Poisson auto-regressive model. More specifically, we compute the singular values of the L P asσ 1 ≥σ 2 ≥...≥σ p ≥ 0 and report σ 1 P p i=1 σ i . We have set x∼N (0,I) and y∼ Poiss(exp(Ax)). Discussion The low-rank structure of the solution for the general case of GLMs can be observed by noting that i is a sum of two terms: the first term solely depends on y i . Hence, an analysis similar to Proposition 5.10 can be carried out. The standard random matrix theory [187, 223] can be employed to analyze a fraction of the variance that is preserved in the first singular value of the second term of. Putting the two terms together, while the bias term for the general case will not be concentrated around a rank-1 matrix, it will have a significant rank-1 component. Empirically we observe a similar low-rank properties for the general case in Theorem 5.1: Fig. 5.11 shows an example of such empirical evaluation for Poisson regression. It confirms that when n is small, the L matrix can be efficiently estimated by a low-rank matrix. NotethatthematrixL, 1 n P n i=1 " i x > i ismuchclosertoalow-rankmatrixthan the commonly studied Wishart noise [187, 223]. When the model is misspecified, i.e. the noise term has non-zero mean, we expect a stronger low-rank component 128 in both finite sample bias terms, which agrees with the experimental results in Fig. 5.10. Discussion Inthissectionwearguedthatafractionofthegainbysparsepluslow- rank decomposition can stem from the small sample bias of the maximum likelihood estimation. It is instructive to compare the small sample gain with the gain achieved because of remedizing existence of latent factors. To this end, we generate two datasets: one with a single latent factor and the other without any latent factors and with varied sample size. The gain in the dataset with no latent factors will most likely to be related to the small sample bias. Fig. 5.12 shows the performance of sparse and sparse plus low-rank decomposition in the Poisson regression problem. It is evident that the gain by sparse plus low-rank decomposition is larger in the dataset with a latent factor and the gain in both datasets diminish as the sample size grows. The datasets have been generated according to Poisson GLM with x ∼ N (0 p , 0.1I p ), y = Poisson(exp(Ax +1 p×1 z)), where p = 100. The p×p random coefficient matrixA is generated to be sparse. The latent factorz =1 1×p x is defined in a way that it influences all other predictor variables x. We generate 20 datasets randomly with sparse coefficient matrix A and report the average performance in Fig. 5.12 as the sample size changes. 5.6 Summary In this chapter, we studied three instances of the generalized linear auto- regressive processes (GLARPs) for analysis of time series of count and extreme value 129 100 200 300 400 0 0.2 0.4 0.6 0.8 1 1.2 1.4 Sample Size (n) Normalizer Parameter Estimation Error No Latent, SLR No Latent, Sparse One Latent, SLR One Latent, Sparse Figure 5.12: Comparison of the gain achieved by sparse plus low-rank decomposition when the dataset has a latent factor and when it does not. Note that the gain is much larger when there exists a latent factor. data. We showed that the latent factors appears as an additive low-rank matrix in the maximum likelihood estimation of the evolution matrices. Based on this result, we proposed a convex programming algorithm for sparse plus low-rank decompo- sition in Frank-Wolfe’s framework. We demonstrated that our algorithm achieves better prediction accuracy and graph learning accuracy than the alternative EM- based algorithms and it is fast enough for large scale applications. By analysis of the characteristics of the maximum likelihood estimate in small sample regime, we showed that in many generalized linear models, the maximum likelihood estimation has an additive and close to a low-rank bias term. We concluded that separating out a low-rank component of the estimation improves its accuracy, even if there are no latent variables. 130 Chapter 6 Non-Gaussian Time Series 131 In many applications, such as climate science, social media analysis and smart grid, we are mostly interested in revealing the temporal dependence and make pre- dictions of extreme events. For example, climate change is mostly characterized by increasing probabilities of extreme weather patterns [118], such as temperature or precipitation reaching extremely high value. Therefore quantifying the temporal dependence between the extreme events from different locations and make effective predictions are important for disaster prevention; in social media analysis, burst of topics, i.e., buzz, is reflected by extremely high frequency of related words. Uncov- ering the temporal dependencies between buzzes could reveal valuable insights into information propagation and achieve much better accuracy for buzz prediction. Identifying temporal dependencies between multiple time-series data is a topic of significant interest [6, 149, 150]. As listed in Chapter 2.4, many algorithms have been proposed to automatically recover the temporal structures, but uncovering temporal dependency for extreme values is much more challenging than classical observations since the distributions of extreme values are more complex and signif- icantly different from the commonly used Gaussian distribution. In addition, the lack of sufficient past observations on extreme events poses difficulties in model- ing and attributing such events. The statistical approach we can utilize to solve these important problems is the theory of extreme value modeling [20, 44], which provides a natural family of probability distributions for modeling the magnitude of the largest (or smallest) of a large number of events, and a canonical stochas- tic process model [44] for the occurrence of events above a very high (or below a very low) threshold. In the past decade, extreme value modeling has attracted a lot of research efforts in statistics, finance, and environmental science, particularly on modeling temporal and spatio-temporal extreme value [46, 79, 113]. However, 132 all of the work above model temporal or spatial dependence with predefined covari- ance structures (e.g. without independence considerations). Furthermore, most general discussions of dependencies in multivariate extreme value modeling have been focused on pairwise relationships. This is obviously unrealistic and demands a significant contribution on automatically learning the temporal structures from the data for better analysis and modeling. In this chapter we propose two solutions: Copula-Granger which is a general- purpose semi-parametric algorithm for all non-linear time series and Sparse-GEV which is customized for the extreme value time series. Copula-Granger is based on the Copula technique which has been proposed for dependency analysis of time series with non-Gaussian marginal distributions [73]. It has been used for prediction of time series [141] and learning the dependency graph among time series [147]. We utilize the Copula framework to analyze the temporal dependency among non-linear timeseries. Inthecopulaframework, firstthemarginaldistributionofthetimeseries x i are estimated as ˜ F i using a non-parametric estimator. Next the observations are transformed to the copula domain as u i (t) = Φ −1 ( ˜ F i (x i (t))), where Φ is the cumulative distribution function of the unit Gaussian distribution. Then, we use Lasso-Granger to uncover the temporal dependency among the transformed time series. On the theoretical front, we establish the asymptotic convergence rate of the Copula-Granger method. The basic idea of Sparse-GEV is to model the multivariate extreme value time series as a latent space model. The latent variables, corresponding to the location parameters (which determine the mode) of extreme value distributions for time series at certain time, are modeled by the location parameters of all time series in history, through a dynamic linear model (DLM). By imposing an L 1 -penalty 133 with respect to the regression coefficients in DLM, we could establish meaningful temporal dependencies between a small subset of time series and the concerned time series of extreme values. To estimate parameters of the model, we develop a iterative searching algorithm based on the generalized EM-algorithms and sampling with particle filtering. Our model is significant because it is among the first models to reveal the temporal dependencies between multiple extreme value time series. In addition, our experiment results demonstrate the superior performance of our model to other state-of-art methods on both learning temporal dependence and predicting future value. The rest of the chapter is organized as follows: after description of the copula model in Section 6.1, we describe the details of our proposed Sparse-GEV model in Section 6.2, then we review the existing work and discuss their connections to our model in Section 6.3. We show the experiment results in Section 6.4 and finally we summarize the chapter. 6.1 Copula-Granger In this section, we propose the Granger Non-paranormal (G-NPN) model and design the Copula-Granger inference technique to capture the non-linearity of the data while retaining the high-dimensional consistency of Lasso-Granger. Definition 6.1. Granger Non-paranormal (G-NPN) model We say a set of time series X = (X 1 ,...,X n ) has Granger-Nonparanormal distribution G− NPN(X,B,F ) if there exist functions{F j } n j=1 such thatF j (X j ) forj = 1,...,n are jointly Gaussian and can be factorized according to the VAR model with coefficients 134 B ={ i,j }. More specifically, the joint distribution for the transformed random variables Z j ,F j (X j ) can be factorized as follows p Z (z) =N (z(1,...,K)) n Y j=1 T Y t=K+1 p N (z j (t)| n X i=1 > i,j z t,Lagged i ,σ j ), where p N (z|μ,σ) is the Gaussian density function with mean μ and variance σ 2 . The lagged vector is defined as z t,Lagged i = [z i (t− 1),...,z i (t−K)]. Based on the copula technique [147], The G-NPN model aims to separate the marginal properties of the data from its dependency structure. The marginal distri- bution of the data can be efficiently estimated using the non-parametric techniques with exponential convergence rate [232, ch. 2]. The estimation of the dependency structure requires more effort because there are at leastO(n 2 ) pairwise dependency relationships. Thus, we resort to L 1 regularized techniques for efficient estimation of the dependency structure in high-dimensional settings. Learning Granger Non-paranormal models consists of three steps: (i) Find the empirical marginal distribution for each time series b F i . (ii) Map the observations into the copula space as b f i (X i (t)) = b μ i +b σ i Φ −1 b F i (X i (t)) . (iii) Find the Granger causality among b f i (X i (t)). In practice we have to use the Winsorized estimator of the distribution function to avoid the large numbers Φ −1 (0 + ) and Φ −1 (1 − ): ˜ F j = δ n if b F j (X j )<δ n b F j (X j ) if δ n ≤ b F j (X j )≤ 1−δ n (1−δ n ) if b F j (X j )> 1−δ n First we have the following proposition that connects the Granger causality results identified by the Copula-Granger method to the true Granger causality values: 135 Proposition 6.2. The independence relationships in the copula space are the same as the independence relationships among original time series. Proof. SinceX⊥⊥Y if and only ifg(X)⊥⊥h(Y ) for any arbitrary random variables X and Y and deterministic one-to-one transformation functions g(·) and h(·), the proposition is established. Next, we have the following theorem which establishes the consistency rate of the Copula-Granger method. Theorem 6.3. Consider the time series X i (t) for i = 1,...,n and t = 1,...,T generated according to G− NPN(X,B,F ). Select δ T−L = 4(T−L) 1/4 q π log(T−L) −1 and λ T−L ∝ q (T−L) log(nL). Suppose the incoherent design condition in [154] holds for both covariance matrices C , E[X i (t)X j (t 0 )] and ˜ C , E[ ˜ F i (X i (t)) ˜ F j (X j (t 0 ))] for i,j = 1,...,N and t,s = t−L,...,t− 1. The Copula-Granger estimate of the B is asymptotically consistent as T→∞ b i,j − i,j 2 =O P K T−L s s log(nL) T−L , (6.1) where b i,j are estimates of i,j using Copula-Granger, s is the number of non- zero coefficients among nL coefficients under analysis and K T−L is proportional to φmax φ 2 min (se 2 n ) where φ max and φ min (m) are maximum and m-sparse minimum eigenvalue of the matrix ˜ C and e n is a saparisty multiplier sequence as defined in [154]. The subindex P inO P denotes convergence in probability. Proof. The proof relies on the result of [147] which shows that the covariance matrix of the samples transformed by the non-parametric Winsorized distribution estimator is concentrated around the true covariance matrix. Using this concentration bound, 136 we can bound the maximum eigenvalue of the matrix ˜ C−C. Repeating the steps of [154] gives the rate above. In order to prove Theorem 6.3 we need to show that the corresponding regression problem with Winzorized mapped version of variables is consistent. Here we show the proof for the bias term; the proof for the variance term follows the same lines. Consider the following linear model: y = > x +ε, where x is a p× 1 zero mean Gaussian random vector, is the coefficient vector and ε is a zero-mean Gaussian noise. Suppose in observation of n samples x i for i = 1,...,n, we have access to noisy versions of them ˜ x i and ˜ y i . We know that the estimation of covariance based on ˜ x i is consistent with the following rate [147] max j,k ˜ S n jk − b S n jk =O P s logp log 2 n n 1/2 , where b S n jk = (X > X) jk and ˜ S n jk = ( ˜ X > ˜ X) jk is our estimate of covariance using the tilted samples ˜ x i . We assume that the matrix Δ = ˜ C−C is positive semi-definite. We can relax this assumption. Modify the bound in Eq. (22) of [154] as following: γ > ˜ Cγ≤λ √ skγk 2 +ϕ max (Δ) (6.2) Bounding ϕ max (Δ)≤ K 2 max ˜ S n jk − b S n jk for some constant K 2 and deriving the lower bound in Eq. (26) of [154] using the fact thatϕ min (Δ)≥ 0 yields the following equation: Kφ min kγk 2 2 ≤ λ n √ s Kφ min +ϕ max (6.3) 137 Since ϕ max (Δ) diminishes with respect to φ min ( ˜ C) according to results from [147] and having the incoherent design assumption [154] for lower bound of φ min ( ˜ C) the proof establishes by following the steps in [154]. Theorem6.3statesthattheconvergencerateforCopula-Grangeristhesameas the one for Lasso which suggests efficient Granger graph learning in high dimensions via Copula-Granger. While Copula-Granger provides a semi-parametric approach for discovering the temporal dependency structure, it can be still inefficient when the number of observations is scarce. To address this challenge, in the next sections, we demon- strate a fully parametric yet flexible model for discovering the temporal dependence among extreme value time series. 6.2 Sparse-GEV Preliminaries Before diving into the details of our model, we first briefly review the extreme value theory [44]. LetX 1 ,··· ,X m be a sequence of independent and identically distributed random variables, and let M m = max{X 1 ,··· ,X m }. If there exist sequences of constants a m > 0 and b m such that Pr M m −b m a m ≤z ! →G(z) as m→∞, (6.4) for some non-degenerate distribution function G, then G should belong to the gen- eralized extreme value (GEV) families, namely G(z) = exp ( − 1 +ξ z−μ σ −1/ξ + ) , (6.5) 138 defined on{z : 1 +ξ(z−μ)/σ > 0}, where μ (−∞ < μ <∞) is the location parameter, σ (σ > 0) is the scale, and ξ (−∞<ξ <∞) is the shape parameter ξ, which governs the tail behavior of the distribution. One popular GEV distribution is the Gumbel distribution whenξ→ 0, whose probability density function is defined as p(z|μ,σ) = 1 σ exp − z−μ σ − exp − z−μ σ . (6.6) It has been shown that the maximum value in a sample of a random variable follow- ing an exponential family distribution (such as Gaussian, log-normal and Gamma distributions) converge to the Gumbel distribution. One special property of the Gumbel distribution is that the mode is determined solely by the location parame- ter μ. 6.2.1 Model Description Given multivariate time series data, our goal is to build an effective model that can recover temporal dependence between extreme value time series (block maxima orpeaksoverthreshold)andmakeaccuratepredictionsforfutureextremeevents. To achieve a robust and interpretable model, a natural choice is to capture the temporal dependence via linear models; however, this is not directly achievable on extreme value variables since their temporal dependence is obviously nonlinear. To solve the problem, we propose latent models in which the location parameters of GEV distributions are latent variables and the temporal dependence between extreme value variables is captured via the latent variables through dynamic linear model. 139 We choose the location parameters because they capture the mode of extreme value variables and can be modeled reasonably well by linear dependence. Formally, letx(1),...,x(T ) forx(t)∈R P×1 denote P number of extreme value timeseriesandeachtimeseriesx i haveT observations, i.e.,x i ={x i (1),...,x i (T )} 1 , we define the joint probability of observations x(t) and their associated location parameter(t) as: p(x(t), i (t)|,,c) = P Y i=1 T Y t=L+1 p(x i (t)|μ i (t),σ i ) p(μ i (t)| H (t),,c), (6.7) where p(x i (t)|μ i (t),σ i ) can be modeled by a GEV distribution such as the Gumbel distribution in Eq. (6.6) with σ i as the scale parameter specific to time series x i , H (t) is the history of all time series until time t, and p(μ i (t)| H (t)),,c) can be modeled by a dynamic linear model as follows, μ i (t) =c i + K X `=1 (`) i (t−`) +. (6.8) where c i is the offset specific to time series i, (`) i are the coefficients, and is a Gaussian noise with variance τ 2 . As we can see, the temporal dependence between x i and x j is now captured via the coefficients . By adding a shrinkage Laplace prior over when maximizing the likelihood function, i.e., { ˆ , ˆ , ˆ c} = argmax ( L(x(1),...,x(T );,,c) + K X `=1 λk (`) k 1 ) , (6.9) 1 In extreme value theory, two main sets of methods, the Block Maxima method and the Peaks over Thresholds method have been developed to model extreme values [44]. In the rest of the chapter, we use Block Maxima method as an example to describe our model. Notice that our methodology is applicable to the Peaks over Thresholds approach by defining a point process model and in the experiments we have used both approaches. 140 where λ is the regularization parameter, we can obtain the sparse solution of . Finally, we determine that x i temporally depends on x j if at least one of the corre- sponding value of β (`) i,j for ` = 1,...,K is non-zero. In this way, our model not only can provide better understanding of potential causes of the extreme events, but also helps to achieve more accurate prediction of the extreme events in the future. This model is later referred to as the Sparse-GEV model. 6.2.2 Inference and Learning Given the existence of hidden variables in the sparse-GEV model, directly maximizing the likelihood as in Eq. (6.9) is not feasible. Therefore we applied the generalized EM-algorithm to solve the problem. Next, we use Gumbel distribution as an example to demonstrate how we can make efficient inference and learning in the proposed model. In the EM algorithm, we optimize the following function via two steps: Q(,,c; old , old ,c old ) =− P X i=1 T X t=K+1 ln(σ i )−E {|X, old , old ,c old } x i (t)−μ i (t) σ i + exp − x i (t)−μ i (t) σ i + 1 2 μ i (t)−c i − P K `=1 (`) i (t−`) τ ! 2 . (6.10) E Step Directly calculating the expectations in Eq. (6.10) is infeasible given the form of the posterior probability, therefore we apply sampling algorithms for approximation. In order to generate samples from p n |x, old , old ,c old o , we use the particle filtering algorithm [63]. The major challenge is that in each iteration 141 of particle filtering, we need to draw samples from p(μ i (t)|x i (t),(t−`)), which cannot be calculated analytically. Instead, we use the following proposal function: N ˜ μ i (t) +γ i τ−σ i W 0 γ 2 i exp ˜ μ i (t)−x i (t) σ i +γ 2 i !! , τ 2 γ 2 i + 1 ! , whereW 0 is the Lambert W function, γ i = τ/σ i , and ˜ μ i (t) is calculated using the history, i.e., ˜ μ i (t) = c old i + P K `=1 (`),old i ˜ (t−`). The rationale behind this choice is to approximate the posterior distribution p n |x, old , old ,c old o with a Gaussian distribution with the same mode and similar variance. Notice that particle filtering may encounter the challenge of “miniscule weights” if the sequence length is long. Therefore the resampling step is usually applied at each time stamp to resolve the issue [63]. For very long time series, particle filtering does face some other challenges, but can be fixed using particle smoothing [63]. M Step The optimization problem for updating and c i is as follows: min ,c i E |x T X t=K+1 μ i (t)−c i − K X `=1 (`) i (t−`) ! 2 +λ K X `=1 (`) i 1 , where the expectation is computed from the samples. As we can see, the optimiza- tion function has the Lasso format and can be solved efficiently by algorithms such as coordinate descent [238]. The parameter estimation for the Gumbel distribution itself is not a trivial problem. In general, the MLE is the widely accepted approach to estimate the shape and scale parameters, and Newton-Raphson or quasi-Newton methods can be applied to solve the resulting optimization problem [75]. Therefore we estimate by the Newton-Raphson algorithm. 142 6.2.3 Prediction In order to make predictions on the future value of extreme events, for example x i (t + 1), given the extreme value time series up to time T, we can first estimate the mean ¯ μ i (T + 1) using the samples drawn from the posterior distribution with the learned parameters. Based on the model defined in Eq. (6.7), we can predict x i (T + 1) as ˆ x i (T + 1) = ¯ μ i (T + 1) +γ E σ i , where ¯ μ i (T + 1) = c i + P K `=1 (`) i ¯ (T−` + 1), and γ E (≈ 0.5771) is the Euler con- stant. 6.2.4 Scalability The computational complexity of Sparse-GEV depends on two factors: the number of EM iterations required for convergence and the scalability of E-Step and M-Step. In Section 5, we empirically show that EM usually converges within a small number of iterations. In the M-Step, while there are efficient solvers for both equa- tions, the problems for different time series are independent and can be implemented in parallel. The particle filtering in E-Step is notoriously efficient for sampling from time series mainly due to three reasons: (i) it requires only one iteration to gen- erate the samples, (ii) the generated samples are independent; no burn-in period or decoupling is required and (iii) at each time stamp the sampling procedures in different locations are independent from each other and can be implemented in par- allel. Therefore our algorithm is scalable and could be easily applicable to practical applications. 143 6.3 Related Work and Discussions In this section, after introducing transfer entropy [194], we discuss how Lasso- Granger and Transfer Entropy are connected to Copula-Granger and Sparse-GEV. Transfer Entropy Solution Transfer entropy is usually employed when the data do not follow the auto-regressive model and a nonlinear generalization of the Granger causality framework is desirable. In the Transfer entropy framework [194], time series x i is thought to be a cause of another time series x j if the values of x i in the past significantly decrease the uncertainty in the future values of x j given its past. The amount of decrease in the uncertainty can be quantified as T x i →x j =H(x j (t)|x i (t−K :t− 1))−H(x j (t)|x j (t−K :t− 1),x i (t−K :t− 1)), where H(x) is the Shannon entropy of the random variable x. Since the transfer entropy is a pairwise quantity, we can use its output as input to a graph learning algorithm, for example, IAMB [221], to uncover the temporal dependency among multiple time series. The transfer entropy approach can be used to uncover causality relationship among extreme value time series since it does not rely on any particular assumptions on the distribution of the time series. 6.3.1 Connections to existing algorithms The connections between our algorithm with existing algorithms can be estab- lished by considering Sparse-GEV, transfer entropy and the copula approach as extensions of the Granger causality framework. The copula approach leverages the 144 marginal distribution of the time series to map the observations to another space and assumes linear dependence in the new space. Sparse-GEV discovers the Granger causality relationship among the latent variables from which the observations have been generated. The transfer entropy approach generalizes the Granger causality framework by finding the Granger causality type relationships from the uncertainty of the time series. In fact, when the data are distributed according to Gaussian linear model, transfer entropy is equivalent to Granger causality [14]. For high-dimensional time series, the number of observations is much less than the parameters of the model. The Lasso-Granger algorithm benefits from the vari- able selection properties of Lasso. [153] show that the Lasso variable selection loss, and subsequently the Lasso-Granger’s loss [6], vanishes with an exponential rate. For the copula approach, [147] show that when copula-based model is the true model, the copula-based structure learning algorithm with non-parametric estima- tion of marginals converges to the true graph with a rate ofO q log(n) n 1−ξ for some ξ∈ (0, 1), which is far slower than the exponential convergence of Lasso-Granger. Theperformanceoftransferentropyheavilyreliesontheaccuracyofentropyestima- tions, which require a large number of observations, especially for high-dimensional distributions, to achieve robust estimation [19]. For example, the Nearest Neighbor Estimator converges with root-n rate, which is again far slower than the conver- gence rate of Lasso-Granger. However, Sparse-GEV inherits the variable selection advantages of Lasso-Granger while allows a more flexible marginal distribution for the observations. It is fully parametric, and together with proper L 1 penalization can avoid over-fitting while capturing non-linear dependencies. 145 6.4 Experiment Results In order to evaluate the effectiveness of our algorithm, we conduct experiments on four datasets, including one synthetic dataset, one weather dataset and two Twitter datasets. The experiment results are evaluated on both how well we uncover the temporal dependence graphs and how accurately we can predict the future value of extreme events using the learned temporal dependence. 6.4.1 Datasets Synthetic Dataset We generate eight synthetic datasets, each composed of nine time-series with different types of temporal dependence, one of which is shown in Figure 6.1(a). Time series of length T = 40 are generated in two steps: (i) A set of observations of the location variables ˜ is generated according to eq (6.7), with the offset c i generated from N(0.2, 0.05), the coefficients set to have stationary time series, τ 2 set to 0.1 and the time lag K set to 2; (ii) The observations ˜ x are generated from a Gumbel distribution with the corresponding location parameters ˜ and scale parameter σ i = 0.05 for all time series. Climate Dataset The study of extreme value of wind speed and gust speed is of great interest to the climate scientists and wind power engineers. A collection of wind observations is provided by AWS Convergence Technologies, Inc. of Ger- mantown, MD. It consists of the observations of surface wind speed (mph) and gust speed (mph) every five minutes. We choose 153 weather stations located on a grid laying in the 35N− 50N and 70W− 90W block. Following the traditions in this domain, we generated extreme value time series observations, i.e, daily maximum values, at different weather stations. The objective is examine how the wind speed 146 (or gust speed) at different locations affects each other and how well we can make predictions on future wind speed. Twitter Dataset In social media analysis, “buzz” refers to those topics or memes that many people are talking about at the same time with rapid growth and impact. Buzz modeling and predictions are the fundamental problems in compu- tational social science, but they are extremely challenging since the distributions of these time series observations have heavy tails and most existing models fail miserably. Given the definition of buzz, i.e., extremely high frequency of certain words within a time interval, it is natural to model them via extreme value the- ory. We collected two Twitter datasets to evaluate the effectiveness of our model: one is the most popular 20 meme phrases during a 28-day interval from Nov-Dec 2009, and the other is popular hashtags around “occupy wall street” during a 21-day intervalinOct-Nov2011. Someexamplephrasesinthefirstdatasetare“Haitiearth- quake”, “Grammy Awards”, “iPad release”, and “Scott Brown’s Senate Election”; some example hashtags in the second dataset are #OWS, #OccupyLA, #OccupySF, #OccupyDC and #OccupyBoston. For those phrases and hashtags, we count the number of mentions in tweets within a interval of one hour. We are interested in uncovering how different buzzes affect each other and how well we can make predic- tions on future buzz. Actual 1 4 5 3 2 5 7 6 9 8 (a) Granger (b) Transfer Entropy (c) Copula (d) Sparse-‐GEV (e) Figure 6.1: Illustration of (a) Ground truth and the inferred temporal dependence graphs by (b) Granger causality, (c) Transfer entropy, (d) Copula method, and (e) Sparse-GEV. 147 Pajek (a) (b) Figure 6.2: The temporal dependence graph learned by Sparse-GEV on the extreme value time series of (a) Wind in NY and (b) Gust in NY. Thicker edges imply stronger dependency. 6.4.2 Performance Comparison We compare the performance of our Sparse-GEV model with three baselines, includingGrangercausality, transferentropy, andthecopulamethod(withGaussian copula), ontwotasks: uncoveringtheunderlyingdependencyamongtimeseries, and predicting the future values of time series. The first one requires knowledge about the true dependency structure, which is only available in the synthetic dataset. For evaluation, we use the Area Under the Curve (AUC) score, i.e., the probability that thealgorithmwillassignahighervaluetoarandomlychosenpositive(existing)edge than a randomly chosen negative (non-existing) edge in the graph. In the prediction task, we conduct experiments via the sliding window approach: given time series observations of length T and a window size S, we train a model on observations of x(s),...,x(T−S +s− 1) and test it on the (T−S +s) th sample, fors = 1,...,S. We set S to 10 for all the datasets and use the root mean squared error (RMSE) measure (averaged over S experiments and all nodes) as the evaluation metric. 148 Table 6.1: Comparison of different models on recovering the temporal dependence graph on eight synthetic datasets. Algorithms Avg AUC Score Sparse-GEV 0.9257 Granger 0.9046 Transfer Entropy 0.8701 Copula 0.8836 In the experiment, the regularization parameter λ is set via cross-validation. All the observations are normalized into interval [0, 1] prior the experiments. Temporal Dependence Discovery Table 6.1 lists the average accuracy of uncovering the underlying dependence structures by different algorithms on the syn- theticdata(consistingof8differentdatasets). Aswecansee, ourSparse-GEVmodel significantly outperforms the baseline methods. Figure 6.1 shows an example of the graphs learned by different algorithms: our model can recover the ground-truth graph more accurately than other methods. Fig. 6.2 shows the inferred temporal dependencies from the extreme value time series of wind speed and wind gust speed by Sparse-GEV. Given the limited space, we limit our discussion on the new york region. The main observation is that the weather in the inland regions are heavily influenced by the coastline region. The wind gust graph (Fig. 6.2(b)) indicates two clusters. One is at the top of the graph, starting from Middletown to Danbury across Fishkill. The other one is located at the bottom of the graph, which passes through several cities, such as Stamford, Fairfield and Brookhaven, around Long Island Sound, then goes to inland cities in New Jersey through New York City. The top cluster gives an example of wind gust path in inland while the bottom one shows the coastal impact of Long Island Sound and the impact extends to inland New Jersey. Comparably, in addition 149 Tiger Woods Haiti Scott Brown Christmas Avatar Dubai Burj (a) O-LA O-WS O-SF O-DC (b) Figure 6.3: The temporal dependency graph learned by Sparse-GEV from the Twit- ter dataset on (a) Meme phrases in 2009 and (b) “Occupy Wall Street” hastags in 2011. Table 6.2: Comparison of RMSE by different in the prediction tasks. TE: transfer entropy; T-: Twitter dataset. Synth. Wind Gust T-Meme T-OWS Sparse-GEV 0.2644 0.0660 0.0927 0.0503 0.1190 Granger 0.2923 0.0695 0.0943 0.0619 0.1410 TE 0.3135 0.0692 0.0983 0.0972 0.1302 Copula 0.2987 0.0678 0.0934 0.1009 0.1240 to the Middletown to Danbury inland cluster in wind gust graph, the wind graph (Fig. 6.2(a)) shows another inland cluster centered at Bridgewater, which has strong temporal dependencies with neighboring cities (confirmed by the climatologists). Fig. 6.3 shows the inferred temporal dependencies from the extreme value time series of Twitter data on a subset of buzz. From Fig. 6.3(a), we can see that the temporal dependence between different buzz are sparse (since they are quite different topics); however, the buzz on “Haiti Earthquake” generates a huge impact on the whole Twitter universe and significantly changes the future mentions of other popular meme phrases. An interesting observation in Fig. 6.3(b) is that the hashtag on the general theme #OWS has direct temporal dependence with the city-specific hashtags, such as #O-LA and #O-DC, while city-specific hashtags do not affect each other. 150 Prediction Performance As discussed in Section 2, Sparse-GEV can also be used for predicting future extreme events. For other baseline methods, we use the approaches discussed in Section 3.1 for predictions. Table 6.2 shows the prediction accuracy of different algorithms on all datasets. As we can see, Sparse-GEV out- performs all the other algorithms across all datasets. This can be attributed to two properties of Sparse-GEV: its flexibility in modeling complex distributions and its effectiveness in utilizing the samples. The assumptions of Lasso-Granger and cop- ula methods about the distribution of the data can be responsible for their lower performance. Transfer entropy requires many observations to perform well, which could be a potential issue in the real applications. 6.4.3 Parameter Sensitivity Assessment Likeotherlatentstatemodels,Sparse-GEVmodelhasmanyparameters,which could affect its performance significantly. In our last experiment, we assess the parameter sensitivity. Fig. 6.4(a) shows that in a large range of values for the reg- ularization parameter λ, the graph learning accuracy remains unchanged and little effort in selection of the regularization parameter leads to the optimal performance. Fig. 6.4(b) suggests that in less than 10 EM iterations, our algorithm converges to the optimal point. Fig. 6.4(c) illustrates the effect of τ on the performance of Sparse-GEV. Small values ofτ result in smoother estimation ofE[|x], while higher values lead to sensitive estimation (as a resultE[|x] closely follows the observation time series). This observation suggests that we should monitor the sample mean of the latent variables and choose a value of τ that allows smooth latent variables to capture the trend of observations. 151 10 0 10 1 0.8 0.85 0.9 0.95 λ AUC Score (a) 5 10 15 20 −4 −3 −2 −1 0 x 10 6 EM Iteration Log−likelihood Value (b) (c) Figure6.4: (a)ParameterSensitivityAssessment: AverageAUCachievedbySparse- GEV on the synthetic datasets when the value of λ varies. (b) The value of log- likelihood function at each iteration of the EM algorithm. (c) The effect ofτ on the value of hidden variables in the Sparse-GEV algorithm. 6.5 Summary In this chapter, we proposed Copula-Granger and Sparse-GEV. The former is a semi-parametric extentsion of Lasso-Granger and the latter is a sparse latent space model, to uncover the sparse temporal dependency from multivariate extreme value time series. To estimate the parameters of the Sparse-GEV, we developed an iterative searching algorithm based on the generalized EM-algorithm and sam- pling with particle filtering. Through extensive experiments, we demonstrated that Copula-Granger and Sparse-GEV outperforms the state-of-the-art algorithms such as transfer entropy. 152 Chapter 7 Irregular Time Series 153 In many real-world applications, we are confronted with Irregular Time Series, whose observations are not sampled at equally-spaced time stamps. The irregularity in sampling intervals violates the basic assumptions behind many models for struc- ture learning. It is a common challenge in practice due to natural constraints or human factors. For example, in biology, it is very difficult to obtain blood sam- ples of human beings at regular time intervals for a long period of time; in climate datasets, many climate parameters (for e.g., temperature and CO 2 concentration) are measured by different equipments with varying sampling rate. All the existing approaches for temporal-causal discovery assume that the time series observations are obtained at equally spaced time stamps and fail in analysis of irregular time series. Existing methods for analyzing irregular time series can be categorized into three directions: (i) the repair approach, which recovers the missing observations via smoothing or interpolation [54, 103, 137, 182, 195]; (ii) generalization of spec- tral analysis tools, such as Lomb-Scargle Periodogram (LSP) [191] or wavelets [80, 157, 211]; and (iii) the kernel methods [182]. While the first two approaches provide sufficient flexibility in analyzing irregular time series, they have been shown to magnify the smoothing effects and result in huge errors for time series with large gaps [195]. In this chapter, we propose the Generalized Lasso-Granger (GLG) framework forcausalityanalysisofirregulartimeseries. Itdefinesageneralizationofinnerprod- uct for irregular time series based on non-parametric kernel functions. As a result, the GLG optimization problem takes the form of a Lasso problem and enjoys the computational scalability for large scale problems. We also investigate a weighted version of GLG (W-GLG), which aims to improve the performance of GLG by giving 154 higher weights to important observations. For theoretical contribution, we propose a sufficient condition for the asymptotic consistency of our GLG method. In partic- ular, we argue that compared with the popular locally weighted regression (LWR), GLG has the same asymptotic consistency behavior, but achieves lower absolute error. The rest of the chapter is organized as follows: we first formally define the problem of learning temporal structures in irregular time series in Section 7.1 and review related work in Section 7.2. Then we describe the proposed GLG frame- work and provide theoretical insights. In Section 7.4, we demonstrate the superior performance of GLG through extensive experiments. 7.1 Problem Definitions and Notation Irregular time series are time series whose observations are not sampled at equally-spaced time stamps. They could appear in many applications due to various factors. In summary, there are three types of irregular time series: including Gappy Time Series, Nonuniformly Sampled Time Series and Time Series with Missing Data. GappyTimeSeriesrefertothosewithregularsamplingratebuthavingfinitely many blocks of data missing. For example, in astronomy, a telescope’s sight could be blocked by a cloud or celestial obstacles for certain period of time, which makes the recorded samples unavailable during that time [54]. Nonuniformly Sampled Time Series refer to those with observations at non-uniform time points. For example, in healthcareapplicationspatientsusuallyhavedifficultiesinrigorouslyrecordingdaily (or weekly) health conditions or pill-taking logs for a long period of time [137]. Time series with missing data appear commonly in applications such as sensor networks, 155 where some data points are missing or corruptted due to sensor failure. In this chapter, unless otherwise stated, the term irregular time series refers to all the mentioned three groups or any combinations of them. While the regular time series need only one vector of values on uniformly spaced time stamps, irregular time series need a second vector specifying the time stamps at which the data are collected. Following the notation of [182], we represent an irregular time series with a tuple of (timestamp,value) pairs. Formally, we have the following definition: Definition 7.1. An irregular time series x of length N is denoted by x = {(t n ,x n )} N n=1 where time-stamp sequence{t n } are strictly increasing, i.e., t l <t 2 < ...<t N and x n are the value of the time series at the corresponding time stamps. The central task of this chapter is as follows: given P number of irregular time series x 1 ,...,x P , we are interested in developing efficient learning algorithms to uncover the temporal causal networks that reveals temporal dependence between these time series. 7.2 Related Work Existing methods for analyzing irregular time series can be summarized into three categories: (i) Repairing the irregularities of the time series by either filling the gaps or resampling the series at uniform intervals, (i) Generalization of spectral analysis for irregular time series, and (iii) Applications of kernel techniques to time series data. 156 Repair Methods The basic idea of repair method is to interpolate the given time series in regularly spaced time stamps. The produced time series can be used in the temporal causal analysis for regular time series. For Gappy Time Series or Time Series with Missing Data, we can use a regression algorithm to recover the time series by filling the blank time stamps, which is also known as Surrogate Data Modeling [137]. In the case of non-uniformly sampled time series, the common practice is to find the value of the time series in regularly-spaced time stamps via a regression algorithm [137, 182, 195]. In some applications, a transient time model that describes the behavior of the system (e.g., by differential equations) is available and can be used to recover the missing data [54, 103]. However, this approach strongly depends on the model accuracy, which prevents it being applicable to the datasets with complex natural processes (e.g. climate system) [182]. One major issue with all repair methods is that the interpolation error propagates throughout all steps after data processing. As a result, quantifying the effect of the interpolation on any resulting statistical inference becomes a challenging task [157]. GeneralizationoftheSpectralAnalysisTools Thebasicideaofthisapproach is to find the spectrum of the irregular time series by generalization of the Fourier transform. Lomb-Scargle Periodogram (LSP) [191] is one classical example of this approach. In LSP, we first fit the time series to a sine curve in order to obtain their spectrum, which can be used to calculate the power spectral density (e.g., by Fourier transform). Then the auto-correlation and cross-correlation functions can be found by taking the inverse Fourier transform (using iFFT) of the corresponding power spectral density and cross power spectral density functions [8, 182, 208]. The versatility of Wavelet transforms and their numerous applications in analysis of time series have motivated researchers to extend the transform to irregular time series 157 [80, 157, 211]. LSP-based algorithms are specially designed for spectral analysis of the periodic signals. They do not perform well for non-periodic signals [209], and are not robust to the outliers in the data [182]. Furthermore, computing the entire cross-correlation function is a much more challenging task than our task of learning temporal dependence. Furthermore the produced correlation function can be a complex function, which is difficult to interpret and conduct theoretical analysis on. Kernel-based Method The basic idea of kernel-based methods is to generalize regular correlation operator via kernels without the complex computation of corre- lation function. One classical example of this approach is the Slotting Techniques [182], which compute a weighted (kernelized) correlation of two irregularly sam- pled time series. For two irregular time series{(t x n ,x n )} Nx n=1 and{(t y m ,y m )} Ny m=1 the correlation is computed as follows: ˆ ρ(`, Δt) = P Nx n=1 P Ny m=1 x n y m w `,Δt (t x n ,t y m ) P Nx n=1 P Ny m=1 x n y m (t y m−t x n ) , (7.1) wherew is the kernel function, ` is the lag number, and Δt is the average sampling interval length. For example, w can be Gaussian kernel as follows: w `,Δt (t 1 ,t 2 ) = exp − (t 2 −t 1 −`Δt) 2 σ 2 ! , (7.2) where σ is the kernel bandwith. The main advantage of kernel-based methods is that they are non-parametric and can be easily extended to any regression-based methods. Furthermore, they do not have specific assumptions on the data, e.g. the model assumption in the repair methods and the periodicity assumption in LSP- based methods. 158 7.3 Methodology Inthissection, wedescribeindetailourproposedmodel, analyzeitstheoretical properties, and discuss one extension that takes into account the irregular patterns of the input time series for accurate prediction. Our goal is to generalize the popular Lasso-Granger approach [6,206,222]technique, whichapplieslasso-typeVARmodel toobtainasparseandrobustestimateofthecoefficientvectorsforGrangercausality. For notation convenience in this chapter, let us rewrite the sparse regression task for learning the coefficients in Eq. (2.9) as follows min T X t=L+1 x i (t)− P X j=1 > j,i x t,Lagged j 2 2 +λkk 1 , (7.3) where λ is the penalty parameter, which determines the sparsity of the coefficients . In this reformulation, the lagged time series are defined as x t,Lagged i = [x i (t− 1),...,x i (t−K)] and j,i = [β (1) j,i ,...,β (K) j,i ] are the set of coefficients modeling the impact of jth time series on the ith one. 7.3.1 Generalized Lasso Granger (GLG) The key idea of our model is as follows: if we treat i,j in Eq. (7.3) as a time series, > i,j x t,Lagged j can be considered as its inner product with another time series x t,Lagged j . If we are able to generalize the inner product operator to irregular time series, the temporal causal models for regular time series can be easily extended to handle irregular ones. Let us denote the generalization of dot product between two irregular time seriesx andy byxy, which can be interpreted as a (possibly non-linear) function 159 that measures the unnormalized similarity between them. Depending on the target application, one can define different similarity measures, and thus inner product definitions. For example, we can define the inner product as a function linear with respect to the first time series components as follows: xy = Nx X n=1 P Ny m=1 x n y m w (t x n ,t y m ) P Ny m=1 w (t x n ,t x m ) , (7.4) where w is the kernel function. For example w can be the Gaussian kernel defined as in Eq. (7.2). Many other similarity measures for time series have been developed for the classification and clustering tasks [237, 247], and can be used for our dot product definition. Given the generalization of the inner product operator, we can now extend the regression in Eq. (7.3) to obtain the desired optimization problem for irregular time series. Formally, suppose P number of irregular time series x 1 ,...,x P , are given. Let Δt denote the average length of the sampling intervals for the target time series (e.g. x i ) and 0 i,j (t) be a pseudo time series, i.e.: 0 i,j (t) ={(t ` ,β (`) i,j )|` = 1,...,K,t ` =t−`Δt}, which means that for different value of t, 0 i,j (t) share the same observation vectors (i.e,{ i,j }), but the time stamp vectors vary according to the value of t. We can perform the causality analysis by generalized Lasso Granger (GLG) method that solves the following optimization problem: min 0 i N i X n=` 0 x (i) n − P X j=1 0 i,j (t (i) n )x t (i) n ,Lagged j 2 +λk 0 i k 1 , (7.5) 160 where ` 0 is the smallest value of n that satisfies t (i) n ≥KΔt. The above optimization problem is not convex in general and the convex opti- mization algorithms can only find a local minimum. However if the generalized inner product is defined to make Problem (7.5) convex, there are efficient algorithms such as FISTA [18] to solve optimization problems of the formf() +kk 1 wheref() is convex. In this chapter, we use the linear generalization of the inner product given by Eq. (7.4) with which Problem (7.5) can be reformulated as linear prediction of x (i) n using parameters 0 i,j (t (i) n ) subject to L 1 -norm constraint on the value of the parameters. Thus, the problem is a Lasso problem and can be solved more efficiently by optimized Lasso solvers such as [214]. 7.3.2 Extension of GLG Method Notice that every data point x (i) n in the GLG method has equal impact in the regression. We make the following two observations, which could help improve the performance. The first observation, depicted in Figure 7.1, states that the samples that have more data points for prediction will be predicted more accurately. Thus, they should have higher contribution in learning the causality graph. Thus we define the following weight for the subproblem of prediction of x (i) n , v 1 x (i) n = P X j=1 L X `=1 N j X m=1 w t (i) n −`Δt,t (j) m . The second observation in Figure 7.2 states that the learning should be uni- formly distributed in time. In other words, the samples that are in a denser region 161 Time Time Series #1 Time Series #2 Time Series #3 1 1 2 1 + 1 Δt + 1 Δt Figure 7.1: Time Series #1 is the target time series in this figure. Prediction of x (1) n 1 should receive a higher weight than x (1) n 2 in the depicted scenario because it can be predicted more accurately. Time Time Series #1 1 1 2 1 Figure7.2: TimeSeries#1isthetargettimeseriesinthisfigurewhiletheothertime series are not shown. Prediction of x (1) n 2 should have higher weight in the causality inference than x (1) n 2 because x (1) n 2 is in a denser region of the time series. should contribute less than those in sparse regions. Thus we define the following weights, v 2 x (i) n = w x (i) n ,x (i) n P N i m=1 w x (i) n ,x (i) m Using the introduced weights, we define the Weighted Generalized Lasso Granger (W-GLG) as follows: min i N i X n=` 0 v 1 t (i) n v 2 t (i) n x (i) n − P X j=1 0 i,j (t (i) n )x t (i) n ,Lagged j 2 +λk i k 1 . (7.6) 162 Time TS #1 TS #2 TS #3 (1) + 1 Δ Time 1 + 1 + 1 Δ z 1 −Δ 1 −2 z 1 −3Δ z 1 −4Δ z 2 −Δ 2 −2 z 2 −3Δ z 2 −4Δ z 3 −Δ 3 −2 z 3 −3Δ z 3 −4Δ TS #1 TS #2 TS #3 GLG Repair Methods −3Δ (1) −2Δ (1) −Δ (1) −4Δ (1) −3Δ (2) −2Δ (2) −Δ (2) −4Δ (2) −3Δ (3) −2Δ (3) −Δ (3) −4Δ (3) Figure 7.3: The sources of repair errors in GLG when x (1) n is being predicted. In order to predict data point x (1) n GLG repairs the time series in L points before the time t 1 n . At each point of time a repair error z (i) t−`Δt is produced. 7.3.3 Asymptotic Consistency of GLG We follow the procedure in [153] to study the consistency of our generalized Lasso Granger Method. Suppose there are two instantiations for the set of random processes x i , one with regular sampling frequency and the other one with irregular sampling frequency. Figure 7.3 shows the source of the errors when x (1) n is being predicted using GLG. It can be seen in the figure that GLG interpolates the value of the time series in regular steps [t n −KΔt,...,t n −Δt] and uses them for prediction ofx (1) n . We can model the errors induced at timet n −`Δt in thej th time series byz (j) tn−`Δt and define the regular time series ˜ x (i) t =x (i) t +z (i) t fori = 1,...,P andt =t n −KΔt,...,t n −Δt. Now, it suffices to show that the graph inferred using the regular time series ˜ x (i) t will produce the same graph compared to the case that we had the original regular time series x (i) t . 163 In order to proceed with the proof we need to assume that ˜ x (i) t are Gaussian random variables with zero mean. Similar to the regular case, let ˜ X (i) t be the vector of length n of samples of ˜ x t i . We can write GLG as, ˜ λ i,j = argmin i,j n −1 ˜ X (i) t − X j ˜ X (j) i,j 2 2 +λk i,j k 1 . (7.7) Let Γ denote the set of all the time series. Suppose we have the regular time series. The neighborhood ne (i) of a time series x (i) is the smallest subset of Γ\{(i)} so that x(i) is conditionally independent of all the variables in the remaining time series. Now, define the estimated neighborhood ofx (i) using irregular time series by ˜ ne λ (i) ={x (j) | ˜ λ i,j 6=0} in the solution of the above problem. Theorem 7.2. Let assumptions 1-6 in [153] hold for the solution of Problem (7.7). (a) SupposeE[z (i) p ˜ x (i) q ] = 0 for all i = 1,...,N and p,q =t n ,...,t n −KΔt,p6=q. Let the penalty parameter satisfy λ∼ dn −(1−)/2 with κ < < ξ and d > 0. There exists some c> 0 so that for all x t i ∈ Γ, P ˜ ne λ (i) ⊆ ne (i) = 1−O(exp(−cn )), for n→∞. P ne (i) ⊆ ˜ ne λ (i) = 1−O(exp(−cn )), for n→∞. (b) The above results can be violated if E[z (i) p x (i) q ]6= 0 for all i = 1,...,N and p,q =t n ,...,t n −KΔt,p6=q. Proof. Let The error variables z (i) ` are zero mean Gaussian variables due to zero mean Gaussian assumption about the distribution of X (i) and ˜ X (i) . Following the proof in [153], construct the following correlation functions: 164 G k,` () =−2n −1 * X (i) t − X j X (j) i,j ,X (k) ` + (7.8) and ˜ G k,` ( ˜ ) =−2n −1 * ˜ X (i) t − X j ˜ X (j) ˜ i,j , ˜ X (k) ` + (7.9) (a) If E[z (i) p ˜ x (i) q ] = 0 for all i = 1,...,N and p,q = t n ,...,t n −KΔt,p6= q, the inner products in Equations (7.8) and (7.9) are the same and sufficiency part of Lemma A.1 in [153] guarantees the same ˆ k,` = ˜ k,` . This proves that the neighborhoods inferred by analysis of the repaired time series is equal to the ones obtained by analysis of the regular time series. Now using the assumptions 1-6 and an application of Theorems 1 and 2 in [153] concludes the proof. (b) Let, w.l.o.g, k,` > 0 for some (k,`). The necessity part of the Lemma A.1 in [153] guarantees thatG k,` () =−λ. It is clear that by taking into account the Z in Eq. (7.9), we can have G k,` ( ˜ )>−λ. (Note that we cannot show G k,` ( ˜ )>−λ happens only with diminishing probability.) Now the sufficiency condition is not satisfied and we are no longer guaranteed to have ˜ k,` = 0. Moreover, if the solution for Problem (7.9) is not unique and−λ<G k,` (˜ a)<λ then ˜ k,` = 0 and the solution of Problems (7.8) and (7.9) are indeed different. The theorem specifies a sufficient condition with which the temporal causal graph inferred by the irregular time series is asymptotically equal to the graph that could be inferred if the time series were available. The condition states that if the repairing error during GLG’s operation is orthogonal to the repaired data points, the inferred temporal causal graph will be the same as the actual one with probability approaching one as the length of the time series grows. 165 Choice of Kernel in GLG Method Consider the non-uniformly sampled time series. Suppose the non-uniformity is due to clock jitter which is modeled by iid zero-mean Gaussian variables at dif- ferent sampling times. If the kernel function w satisfies the following equation, the neighborhood learned by our algorithm is guaranteed to be the same as the regular time series case: N X `=1 [k(` +a ` ,q)−k(p,q)]w(` +a ` ,p) = 0, (7.10) for alli,` = 1,...,N. In this equationk(t,t 0 ) is the covariance function of the time series and a j is the random variable modeling the clock jitter at j th time stamp. Proof. By Theorem 7.2, it is sufficient to show that the following equation holds for all p,q = 1,...,N which means error should be orthogonal to all the observations used in Lasso-Granger. E x P N `=1 (x (i) (`+a ` ) −x (i) p )w(` +a ` ,p) P N `=1 w(` +a ` ,p) x (i) q = 0. (7.11) The expectation is with respect to x, thus: E x " N X `=1 (x (i) (`+a ` ) −x (i) p )w(` +a ` ,p) ! x (i) q # = 0. (7.12) Taking the expectation and using the definition of the covariance function k(t,t 0 ) = E x h x (i) t x (i) t 0 i yields the result. A quick inspection of Eq. (7.10) shows that the Gaussian kernel does not satisfy it. It is clear that by setting q = p and noting that the covariance function always satisfies k(` +a ` ,p)≤k(p,p), any non-negative kernel such as the Gaussian kernel cannot satisfy Eq. (7.10). However if the time series is smooth enough so 166 Time Figure 7.4: The black circles show the time stamps of the given irregular time series. The crosses are the time stamps used for repair of the time series. The red time stamp shows the moment in that the repair methods produce large errors and propagate the error by predicting the erroneous repaired sample. GLG skips these intervals because it predicts only the observed samples. thatk(t,p)≈k(p,p) fort in a neighborhood ofp, the Gaussian kernel can attenuate the effect of k(t,p)−k(p,p) for values of t outside the neighborhood and the left side of Equation (7.10) can become close to zero. In conclusion, while the Gaussian kernel is a suboptimal kernel to use for causality analysis purposes, for smooth time series it approximately satisfies Eq. (7.10) and is expected to perform well. Comparison of GLG Method with Time Decay Kernels and Locally Weighted Regression Analysis of the consistency of the repair methods such as Kernel Weighted Regression algorithm can be done similar to the analysis of GLG method by introducing the repair error variables. However GLG is expected to have lower absolute error because it tries to predict the actual observations x (i) n without additional repair error. In contrast, repair methods such as LWR first interpolate the time series at regular time stamps; then try to predict the repaired samples which carry repair errors with themselves. As alluded in Section 7.2, the repair methods introduce huge amount of error during reconstruction of the irregular time series in the large gaps, see Figure 7.4. In contrast, since there is no sample in the gaps, GLG does not attempt to predict the value of the time series in the gaps and avoids these types of errors. 167 7.4 Experiment Results InordertoexamineperformanceofGLG,weperformtwoseriesofexperiments, one on the synthetic datasets, and the other one on the Paleo dataset for identifying the monsoon climate patterns in Asia. 7.4.1 Synthetic Datasets Design of synthetic datasets with irregular sampling time have been discussed extensively in the literature. We follow the methodology in [182] to create four different synthetic datasets. These datasets are constructed to emulate different sampling patterns and functional behavior. Mixture of Sinusoids with Missing Data (MS-M): In order to create the MS-M dataset, we generate P− 1 time series according to a mixture of several sinusoidal functions: x i (t) = 3 X j=1 A j,i cos(2πf j,i (t) +ϕ j,i ) + i , for i = 2,...,P, (7.13) where f j,i ∼Unif(f min ,f max ), ϕ j,i ∼Unif(0, 2π) and the vector of amplitudes A j,i is distributed as Dirichlet(1 3×1 ) so that all the time series have a least amount of power. The range of the frequencies is selected in a way that only few periods of sines will be repeated in the dataset. The noise term (i) is selected to be zero-mean Gaussian noise. We create the target time series according to a linear model, x 1 (t) = P X i=2 K X `=1 α (`) i x i (t−`) +. 168 0 50 100 150 200 250 300 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 Length of Time Series AUC Score GLG LWR GP Slotting LSP 0 50 100 150 200 250 300 0 0.2 0.4 0.6 0.8 1 Length of Time Series AUC Score GLG LWR GP Slotting LSP 0 50 100 150 200 250 300 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 Length of Time Series AUC Score GLG LWR GP Slotting LSP Figure 7.5: Study of convergence behavior of the algorithms in the Mixture of Sinusoids with (left) Missing data points dataset, (middle) Jittery clock dataset and (right) Auto-regressive Time Series with Missing data points dataset. We set α (`) i to be sparse to model sparse graphs. Finally we drop samples from all the time series independently with probabilityp m . We create 20 instances of MS-M random datasets and report the average performance on them. Mixture of Sinusoids with Jittery Clock (MS-J): For creating this dataset, first we createthesamplingtimestampsforthetargettimeseries: t 1 = [1,...,N]+e 1 where the random vector e 1 = [e 11 ,...,e 1N ] is a zero mean Gaussian random vector with covariance matrix γI and γ is called the jitter variance. We select the parameters of the other time series and use them to calculate the value of the target time series at the given time stamps: x (1) n = P X i=2 L X `=1 α (`) i 3 X j=1 A j,i cos(2πf j,i (t n −`) +ϕ j,i ) +, Then we produce the sampling times for the other time series similarly as a Gaussian random vector with mean [1,...,N] and covariance matrix γI and use Eq. (7.13) to produce the time series. We create 50 instances of MS-J random datasets and report the average performance on them. 169 Auto-Regressive Time Series with Missing Data (AR-M): The procedure of creating the AR-M dataset is similar to the one described for MS-M with a single difference in producing time series 2 to P according to AR processes: x (i) n = 3 X `=1 β i,` x (i) tn−` +, for i = 2,...,P where β i,` are chosen randomly while keeping the AR time series stable. We create 50 instances of AR-M random datasets and report the average performance on them. Mixture of Sinusoids with Poisson Process Sampling times (MS-P): The procedure of creating the MS-P dataset is similar to the one described for MS-J with the difference of producing the sampling times according to a Poisson Point process; i.e. inter-sampling times are distributed according toExp(1). We create 50 instances of MS-P random datasets and report the average performance on them. Baselines We compare the performance of our algorithm with four state-of-the- art algorithms. We use Locally Weighted Regression (LWR) and Gaussian Process (GP)regressiontorepairthetimeseriesandperformtheregularLasso-Granger. The SlottingTechnique[182]andtheLSPmethodarethetwoalgorithmsusedforfinding mutual correlation. Since performance of W-GLG is close to the performance of GLG, performance of W-GLG will be compared in only one figure to avoid cluttered plots. Performance Measures In order to report the accuracy of the inferred graph, we use the Area Under the Curve (AUC) score. The value of AUC is the probability that the algorithm will assign a higher value to a randomly chosen positive (existing) edge than a randomly chosen negative (non-existing) edge in the graph. 170 Parameter Tuning Unless otherwise stated, in all of the kernel-based methods (GLG, LWR, and Slotting technique) we use Gaussian kernel with bandwidth σ = Δt 2 . We select the value of λ in GLG by 5-fold cross-validation. 7.4.2 Results on the Synthetic Datasets Experiment #1: Convergence Behavior of The Algorithms In Figures 7.4.1, we increase the length of the time series in datasets MS-M and MS-J to study the convergence behavior of GLG. Both figures demonstrate excellent convergence behavior of GLG and the fact that GLG consistently outperforms other algorithms by a large margin. Note that the convergence behavior of LWR is very similar to GLG; but as alluded in the theoretical analysis, GLG has lower absolute error. The Slotting technique and the LSP method are designed for analysis of cross- correlation of time series and as expected do not perform well in our settings. The poor performance of LSP can be linked to the periodicity of signals assumption in LSP analysis method, which is violated in our dataset. Experiment #2: Comparison of Different Kernels In order to study the effect of different kernels on performance of GLG we test the convergence behavior of GLG with three different kernels. The kernels are Gaussian: w(t 1 ,t 2 ) = exp(−(t 1 − t 2 ) 2 /σ), Sinc: w(t 1 ,t 2 ) =σ sin((t 1 −t 2 )/σ)/(t 1 −t 2 ), and the Inverse distance kernel w(t 1 ,t 2 ) = (kt 1 −t 2 k 2 2 ) −1 . As shown in Figure 7.6(a), as pointed out in the analysis section the Gaussian kernel shows acceptable convergence behavior. Performance of the Inverse Distance kernel suggests an asymptotic convergence behavior but with much higher absolute error. The Sinc kernel does not properly converge. Experiment #3: The Impact of Missing Rate The effect of missing data is examined in Figure 7.6(b). It is clear that as the probability of missing a data 171 0 50 100 150 200 250 300 0.5 0.6 0.7 0.8 0.9 1 Length of Time Series AUC Score Gaussian Sinc Inverse Distance (a) 0 0.1 0.2 0.3 0.4 0.5 0.6 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 Missing Probability AUC Score GLG LWR GP Slotting LSP (b) 0 0.2 0.4 0.6 0.8 1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 Jitter AUC Score GLG LWR GP Slotting LSP (c) Figure 7.6: (a) Convergence of different kernels in the Mixture of Sinusoids with missing dataset. (b) The effect of missing data in the Mixture of Sinusoids with missing data points. (c) The effect of clock jitter (γ) the Mixture of Sinusoids with Jittery Clock dataset. decreases GLG becomes more accurate. Note the superior performance of GLG compared to others and the fact that when only 10% of the data points are missing GLG perfectly uncovers the correct causality relationship between the variables. Experiment #4: Clock Jitter Impact The effect of clock jitter (γ) is examined in Figure 7.6(c). GLG is robust with respect to the amount of clock jitter. This is because the time series are smooth and we are testing in a reasonable range of jitter. Experiment #5: Performance of the Algorithms in Extremely Irregular Datasets Due to the limited space we only report the result of one experiment on the Mixture of Sinusoids with Poisson points sampling times, see Figure 7.7. In this extremely irregularly sampled dataset, our algorithm outperforms other algorithms by a large margin. This is due to high possibility of large gaps in this dataset and the fact that GLG avoids huge errors caused by interpolation in the gaps. Experiment #6: Comparison of W-GLG and GLG Figure7.8comparesthe performance of weighted version of our algorithm (WGLG) with the non-weighted version (GLG) in all the four synthetic datasets. WGLG performs only marginally better than GLG in all the datasets except the dataset with Poisson sampling 172 GLG LWR GP Slotting LSP 0 0.2 0.4 0.6 0.8 1 AUC Score Figure 7.7: Performance comparison of the algorithms in the Mixture of Sinusoids with Poisson points sampling times. points. This is because only this extremely irregular dataset provides the situa- tion for weights to show their advantages. 7.4.3 Paleo Dataset Now we apply our method to a Climate dataset to discover the weather move- ment patterns. Climate scientist usually rely on models with enormous number of parameters that are needed to be measured. The alternative approach is the data-centric approach which attempts to find the patterns in the observations. The Paleo dataset which is studied in this chapter is the collection of density of δ 18 O, a radio-active isotope of Oxygen, in four caves across China and India. The geologists were able to find the estimate of δ 18 O in ancient ages by drilling deeper in the wall of the caves. The data are collected from Dandak [24], Dongge [230], Heshang [112], Wanxiang [241] caves, see Figure 7.9, with irregular sampling 173 MS−J MS−M AR−M MS−P 0 0.2 0.4 0.6 0.8 1 AUC Score GLG W−GLG Figure 7.8: Comparison of performance of W-GLG vas. GLG in all four datasets. pattern described in Table 7.1. The inter-sampling time varies from high resolution 0.5± 0.35 to low resolution 7.79± 9.79; however there is no large gap between the measurement times. The density ofδ 18 O in all the datasets is linked to the amount of precipitation which is affected by the Asian monsoon system during the measurement period. Asian monsoon system, depicted in Figure 7.9, affects a large share of world’s popu- lation by transporting moisture across the continent. The movement of monsoonal air masses can be discovered by analysis of their δ 18 O trace. Since the datasets are collected from locations in Asia, we are able to analyze the spatial variability of the Asian monsoon system. In order to analyze the spatial transportation of the moisture we normalize all the datasets by subtracting the mean and divide them by their standard devia- tion. We use GLG with the Gaussian kernel with bandwidth equal to 0.5(yr) and maximum lag of 25(yr); i.e. K = 50. In order to compare our results with the 174 Wanxiang Heshang Dandak Dongge Figure 7.9: Map of the locations and the monsoon systems in Asia. Table 7.1: Description of the Paleo Dataset. Location Measur. Period Δt Std(t n −t n−1 ) Dandak 624AD-1562AD 0.50 (yr) 0.35 (yr) Dongge 6930BC-2000AD 4.21 (yr) 1.63 (yr) Heshang 7520BC-2002AD 7.79 (yr) 9.78 (yr) Wanxiang 192AD-2003AD 2.52 (yr) 1.19 (yr) results produced by the slotting method in [182] we analyze the spatial relationship among the locations in three age intervals: (i) The entire overlapping age interval 850AD-1563AD, (ii) The Cold phase 850AD-1250AD, and (ii) The medieval warm period 1250AD-1563AD. Figure 7.10 compares the graphs produced by GLG with the ones reported by [182]. 175 Wanxiang Heshang Dongge Dandak Wanxiang Heshang Dongge Dandak Wanxiang Heshang Dongge Dandak Wanxiang Heshang Dongge Dandak Wanxiang Heshang Dongge Dandak Wanxiang Heshang Dongge Dandak (a) (b) (c) (d) (e) (f) Figure 7.10: Comparison of the results on the Paleo Dataset: (a) GLG in period 850AD-1563AD. (b) GLG in the period 1250AD-1564AD. (c) GLG in the period 850AD-1250AD. (d) Slotting technique in period 850AD-1563AD. (e) Slotting tech- nique in the period 1250AD-1564AD. (f) Slotting technique in the period 850AD- 1250AD. 7.4.4 Results on the Paleo Dataset Figure 7.10 parts (a) and (d) show the results of causality analysis with GLG and slotting technique, respectively. Our results identify two main transportation patterns. First, the edges from Dongge to other locations which can be interpreted as the effect of movement of air masses from southern China to other regions via the East Asian Monsoon System (EAMS). Second, an edge from Dandak to Dongge whichshowstheIndianMonsoonSystem(IMS)significantlyaffectsDonggeinsouth- ern China. The graph in the period 1250AD-1563AD is sparser than the graph in 176 850AD-1250AD which can be due to the fact that the former age period is a cold period, in which air masses do not have enough energy to move from India to China, while in contrast the latter age period is a warm phase and the air masses initiated in India impact southern China regions. During the warm period we can see that other branches of EAMS are also more active which result in denser graph in the warm period. The differences between our results and the results from Slotting technique can be because of the fact that in the Slotting technique an edge is positively iden- tified even if two time series have significant correlation at zero lag. However, by the definition of Granger causality, only past values of one time series should help prediction of the other one in order to be considered as a cause of it. The signifi- cant correlation at zero lag can be due to either fast movement of the air masses or production of theδ 18 O by an external source, such as changes in the sun’s radiation strength, that impacts all the places with the same amount. Thus, the correlation value at zero lag cannot be a reliable sign for inference about the movement of the air masses. The sparsity of the identified causality graphs can be due to several reasons. While we cannot rule out the possibility of non-linear causality relationship between the locations, as [182] have noticed, the sparsity can happen because the relation- ships are either in large millennial scales or short annual scales. In the former case the relationship cannot be captured through analysis of periods with length of sev- eral centuries. In the latter case, the resolution of the dataset is in the order of 3-4 years which does not capture annual or biennial links. 177 7.5 Summary In this chapter, we proposed a non-parametric generalization of the Granger graphical models (GLG) to uncover the temporal causal structures from irregular time series. We provided theoretical proof on the consistency of the proposed model and demonstrated its effectiveness on four simulation data and one real-application data. 178 Chapter 8 Conclusion and Future Work 179 In this thesis, we attempt to provide scalable solutions for the key challenges that multivariate time series analysis faces in the large scale datasets. For the main part of this work, we focus on analysis of simple time series models in different situations and proposing efficient improvements to overcome the challenges. In particular, we focus on four key problems: modeling high-dimensional and complex correlations, eliminating the impact of unobserved confounders, modeling non-linear dependencies, and addressing practical challenges such as irregular sampling. We demonstrate efficiency of the proposed solutions in terms of improved prediction and graph learning accuracy. 8.1 Contributions and Limitations The contributions of this thesis work are five-fold: 1. First, we tackle the challenge of modeling high-dimensional multi-modal cor- relations in the spatio-temporal data as accurate modeling of correlations is the key for accurate predictive analysis. We formulate the problem as a low- rank tensor learning problem with side information incorporated via a graph Laplacian regularization. For scalable estimation, we provide a fast greedy low-rank tensor learning algorithm. 2. Second, in order to learn representations for time series data with complex latent structures, we develop functional subspace clustering algorithm. FSC assumes that functional samples lie in deformed linear subspaces and formu- lates the subspace learning problem as a sparse regression over operators. The resulting problem can be efficiently solved via greedy variable selection, given 180 access to a fast deformation oracle. We provide theoretical guarantees for FSC and show how it can be applied to time series with warped alignments. 3. Third, we observe that the performance of temporal dependency analysis algo- rithms is severely degraded in presence of unobserved confounders. To address this challenge, we propose two solutions: (i) a solution for eliminating the impact of major latent confounders using sparse plus low-rank decomposition and (ii) the second solution for eliminating the impact of all latent confounders using the prior information about the delays on the confounding paths. 4. Fourth, in many applications, multivariate time series do not follow the com- monly assumed multivariate Gaussian distribution. We propose two solutions to address this challenge: a semi-parametric approach using copulas and a state space model based on generalized extreme value distribution to handle the important case of extreme value time series. 5. Finally, in practice, collection of time series is not perfect and the observations are often times sampled on irregular time stamps with large gaps in between sampling times which violates the assumptions of many existing temporal- causal analysis algorithms. To address these challenges, we propose a fast non-parametric extension for temporal dependency analysis algorithms that improves accuracy over the commonly-used time series repair methods. These contributions confirm our thesis hypothesis that multivariate time series algo- rithms can be enhanced to accurately analyze large scale data. However, there are few limitations to the proposed solutions: 181 1. In many datasets such as Twitter dataset, the true influence graph is not available. While we used the best available approximation, i.e., the retweet network, the true influence graph is required for more rigorous evaluation. 2. In Chapter 5, we describe the sparse plus low-rank decomposition technique for removing the impact of latent factors. However, finding the correspon- dence between the uncovered latent structures and the real-world phenomena is challenging in the real-world datasets. Because in many cases we do not know which real-world latent factors are indeed responsible for the patterns discovered in the time series. 3. While in this work we have striven to make our algorithms scalable, given the growing size of the time series datasets, we still need to find new ways to make our solutions further scalable. One of the key bottlenecks in this process is the size of the parameter space for multivariate time series models which grows quadratically even for simplest models. 4. In most cases, the time series datasets have been manually preprocessed by removing seasonality and general trend. It is desirable to propose solutions that perform these steps automatically. 8.2 Future Work It is exciting to explore several directions to extend the contributions of this thesis: Learning the latent structure in time series The sparse plus low-rank is suc- cessful in what it is supposed to do; it learns the globally optimal solution for the 182 sparse local dependency structure. However, learning the latent temporal-causal structure is a harder problem and in general requires more assumptions. For simple graphs such as undirected trees there are solution for finding the globally optimal latent structure [42]. Several other authors have proposed solutions based on heuris- tics in which they find the latent structure by maximizing a set of optimality criteria [102]. Continuing this line of research, we can impose structural constraints on the Laplacian matrix of the latent graph, such as limited bandwidth [174]. Such con- straints have been successfully applied in filling in the missing data [164] and can be applied to learn the latent structure of time series. Approximate temporal-causal inference Our work in Chapter 5.3 focused on guaranteeing exact removal of the impact of latent confounders and achieving unbi- ased causal inference. Continuing this line of work, we can derive causality bounds for the temporal-causal analysis where a set of time series necessary for unbiased causal inference are unobserved, similar to the work for non-temporal causality anal- ysisin[11]. Anotherinterestinglineofresearchistoanalyzetheprocedureofsequen- tial causal experiment design in which the past observation are used to design an experimentthatresultsinthemostaccuratecausalinference[16,26,183]. Insequen- tial causal experiment design, the experimenter has few chances of experimentation and the goal is to efficiently utilize the observations in the previous experiments to design an experiment that improves the accuracy of final causal inference results. Modeling time series with shocks In many applications, large jumps in the value of time series over time (also called shocks) are the intrinsic properties of the time series. For example, in the shock-driven credit derivatives area, modeling the complex multivariate collateralized debt obligations requires simple and robust 183 0 100 200 300 400 500 0 1 2 3 4 Time Interval Normalized Aggregate # of Tweets 0 200 400 600 800 0 2 4 6 8 10 12 14 Time Interval Normalized Aggregate # of Tweets Figure 8.1: Time series of aggregate number of tweets about Tiger Woods (left) and Pope (right). The daily trend has been removed. Note the several shocks, marked by red boxes in the figure, occur frequently and are natural to the time series rather than singular events. models that are able to model sharp jumps in the time series. Lévy processes are currently the leading stochastic process model in modeling the complex multivariate shock-driven financial time series [12, 34, 131]. Other properties of Lévy processes that make them suitable for such applications are (1) being closed under linear transformation which permits construction of complex dependency structures and (2) ability to model heavy-tailed and asymmetric behavior of the data. Analyzing the time series of popularity of topics in social networks requires models that capture shocksaswell, seeFig. 8.1fortwoexamples. Similartotheargumentinthefinancial time series modeling, Lévy processes can be used to efficiently model the time series of popularity in the social networks; hence design of an algorithm for analysis of the temporal causal dependency among multivariate Lévy processes is a promising direction to extend the contributions of this thesis work. Irregular sampling for time series In many applications, the underlying tem- poral dependency graph changes over time and after every change the graph should be updated. However, it is unknown whether in the new settings we can learn the 184 graph with a given accuracy or not. One possible approach is to define the learning taskasaclassificationtask: GiventhetemporaldependencygraphG(V,E), wewant the accuracy of learning the existence of the edges E to be more than a threshold θ. The learning theory states that we can bound the error rate in terms of number of already observed samples and the properties of the underlying distribution [33]. Thus, given the already observed samples, if the generalization bound is larger than a threshold, we can either increase the sampling rate or totally give up the learning task for a period of time. Besides several challenging design trade-offs, the problem is interesting because of the challenges in continuous time stochastic process model- ing, derivation of the generalization bounds for temporal dependency learning task, and finally analysis of the impact of irregular sampling in non-stationary random processes. 185 Appendix A Notation Guide This appendix lists the notations commonly used throughout this thesis. The notations specific to each chapter are defined in the corresponding chapter. Symbol Definition x Small letters denote scalar values. x Bold small letters denote a vertical vector of values. x i The ith element of vector x x(t) Bold small letters followed by (t) denote values of a multivari- ate time series at time t. X Capital letters denote matrices. X Calligraphic capital letters denote tensors. T Length of time series. P Number of time series K Number of lags in VAR model. 186 Index A ADMM, 38 Affinity matrix, 65 C Cokriging, see Kriging Confounding Path, 107 Conwey-Maxwell Poisson, 92 Copula, 134 D DTW, see Dynamic Time Warping Dynamic Time Warping, 23 E Euclidean distance, 22 Extreme value data, 93, 138 F Forecasting, 35 Functional data, 61 G G-NPN, see Granger Non-paranormal Generalized Lasso Granger, 158 Generalized Extreme Value, 137 Generalized Linear Auto-regressive Pro- cesses, 87 GEV, see Generalized Extreme Value GLARP, see Generalized Linear Auto- regressive Processes GLG, see Generalized Lasso Granger Granger graphical model, 107 Granger Non-paranormal, 133 Graph Laplacian, 34 Greedy Algorithms, 43, 67, 103 Gumbel, 93, 138 H Hidden Variables, see Latent Factors I Irregular Time Series, 155 Repair Methods, 155 187 K Kriging, 33 L Latent Factors, 88 Low-rank Tensor Learning, 35 P Path Delays, 106 Principal Angle, 67 R Random Matrix, 127 S Self-expressive property, 63 Small Sample, 121 Sparse plus Low-rank Decomposition, 88 Spectral clustering, 65 Spurious Causation, 107 Subspace, 62 T Time Warping, 70 Transfer Entropy, 143 W Warping Distance, 22, 70 188 Reference List [1] Abdel-Hamid, O., Deng, L., and Yu, D. (2013). Exploring convolutional neu- ral network structures and optimization techniques for speech recognition. In INTERSPEECH, pages 3366–3370. [2] Afsari, B. and Vidal, R. (2014). Distances on spaces of high-dimensional lin- ear stochastic processes: A survey. In Nielsen, F., editor, Geometric Theory of Information. Springer International Publishing. [3] Agarwal, A., Negahban, S., and Wainwright, M. J. (2012). Noisy matrix decom- position via convex relaxation: Optimal rates in high dimensions. Ann. Stat. [4] Amr, T. (2013). Survey on Time-Series Data Classifcation. Accessed: 2015-06- 13. [5] Anderson, T. (1951). Estimating linear restrictions on regression coefficients for multivariate normal distributions. The Annals of Mathematical Statistics, pages 327–351. [6] Arnold, A., Liu, Y., and Abe, N. (2007). Temporal causal modeling with graph- ical granger methods. In KDD. [7] Asimakopoulos, I., Ayling, D., and Mansor Mahmood, W. (2000). Non-linear Granger causality in the currency futures returns. Economics Letters. [8] Babu, P. and Stoica, P. (2010). Spectral analysis of nonuniformly sampled data - a review. Digital Signal Processing. [9] Bach, F. R., Jordan, M., et al. (2004). Learning graphical models for stationary time series. Signal Processing, IEEE Transactions on, 52(8):2189–2199. [10] Baldi, P.andHornik, K.(1989). Neuralnetworksandprincipalcomponentanal- ysis: Learning from examples without local minima. Neural networks, 2(1):53–58. [11] Balke, A. and Pearl, J. (1997). Bounds on Treatment Effects From Studies With Imperfect Compliance. Journal of the American Statistical Association. 189 [12] Ballotta, L. and Bonfiglioli, E. (2010). Multivariate Asset Models Using Levy Processes and Applications. Finance Meeting EUROFIDAI - AFFI. [13] Banerjee, S., Carlin, B. P., and Gelfand, A. E. (2014). Hierarchical modeling and analysis for spatial data. Crc Press. [14] Barnett, L. (2009). Granger Causality and Transfer Entropy Are Equivalent for Gaussian Variables. Physical Review Letters. [15] Barron, A., Cohen, A., Dahmen, W., and DeVore, R. (2008). Approximation and learning by greedy algorithms. The Annals of Statistics. [16] Bartroff, J., Lai, T. L., and Shih, M.-C. (2013). Sequential experimentation in clinical trials. Design and analysis. Springer Series in Statistics . [17] Baydogan, M. G., Runger, G., and Tuv, E. (2013). A bag-of-features frame- work to classify time series. Pattern Analysis and Machine Intelligence, IEEE Transactions on. [18] Beck, A. and Teboulle, M. (2009). A Fast Iterative Shrinkage-Thresholding Algorithm for Linear Inverse Problems. SIAM Journal on Imaging Sciences. [19] Beirlant, J., Dudewicz, E. J., Györfi, L., and Meulen, E. C. (1997). Nonpara- metric Entropy Estimation: An Overview. International Journal of the Mathe- matical Statistics Sciences. [20] Beirlant, J., Goegebeur, Y., Segers, J., and Teugels, J. (2004). Statistics of Extremes: Theory and Applications. Wiley. [21] Bell, D., Kay, J., and Malley, J. (1996). A non-parametric approach to non- linear causality testing. Economics Letters, 51(1):7–18. [22] Bengio, Y., Courville, A., and Vincent, P. (2013). Representation learning: A review and new perspectives. Pattern Analysis and Machine Intelligence, IEEE Transactions on, 35(8):1798–1828. [23] Bengio, Y., Goodfellow, I. J., and Courville, A. (2015). Deep learning. Book in preparation for MIT Press. [24] Berkelhammer, M., Sinha, A., Mudelsee, M., Cheng, H., Edwards, R. L., and Cannariato, K. (2010). Persistent multidecadal power of the Indian Summer Monsoon. Earth and Planetary Science Letters. [25] Bertsekas, D. and Tsitsiklis, J. (1989). Parallel and Distributed Computation: Numerical Methods. Prentice Hall Inc. 190 [26] Berzuini, C., Dawid, P., and Didelez, V. (2012). Assessing Dynamic Treatment Strategies, pages 85–100. John Wiley & Sons, Ltd. [27] Birney, E. (2001). Hidden markov models in biological sequence analysis. IBM Journal of Research and Development, 45(3.4):449–454. [28] Bishop, C. M. (2006). Pattern recognition and machine learning. springer. [29] Bishop, C. M. (2013). Model-based machine learning. Philosophical Transac- tions of the Royal Society A, 371(1984):20120222. [30] Blei, D. M. (2014). Build, compute, critique, repeat: Data analysis with latent variable models. Annual Review of Statistics and Its Application, 1:203–232. [31] Bollerslev, T. (1986). Generalized autoregressive conditional heteroskedasticity. Journal of econometrics, 31(3):307–327. [32] Bonilla, E., Chai, K., and Williams, C. (2007). Multi-task Gaussian Process Prediction. In NIPS. [33] Bousquet, O., Boucheron, S., and Lugosi, G. (2004). Introduction to Statistical Learning Theory. In In , O. Bousquet, U.v. Luxburg, and G. Rsch (Editors). Springer. [34] Brockwell, P. J. and Schlemm, E. (2013). Parametric estimation of the driv- ing Lévy process of multivariate CARMA processes from discrete observations. Journal of Multivariate Analysis. [35] Brovelli, A., Ding, M., Ledberg, A., Chen, Y., Nakamura, R., andBressler, S.L. (2004). Betaoscillationsinalarge-scalesensorimotorcorticalnetwork: directional influences revealed by Granger causality. PNAS. [36] Candès, E. J., Li, X., Ma, Y., and Wright, J. (2011). Robust principal compo- nent analysis? Journal of the ACM. [37] Casella, G. and Berger, R. L. (1990). Statistical inference, volume 70. Duxbury Press Belmont, CA. [38] Chambers, D. and Mandic, J. (2001). Recurrent neural networks for predic- tion: learning algorithms architecture and stability. John Wiley & Sons, Ltd., Chichester, 18:32. [39] Chandrasekaran, V., Parrilo, P. A., and Willsky, A. S. (2012). Latent Variable Graphical Model Selection via Convex Optimization. Ann. Statist. 191 [40] Chandrasekaran, V., Sanghavi, S., Parrilo, P. A., and Willsky, A. S. (2009). Sparse and low-rank matrix decompositions. In Allerton. [41] Chandrasekaran, V., Sanghavi, S., Parrilo, P. A., and Willsky, A. S. (2011). Rank-Sparsity Incoherence for Matrix Decomposition. SIAM Journal on Opti- mization. [42] Choi, M. J., Tan, V. Y. F., Anandkumar, A., and Willsky, A. S. (2011). Learn- ing Latent Tree Graphical Models. Journal of Machine Learning Research. [43] Chu, W., Sindhwani, V., Ghahramani, Z., and Keerthi, S. (2006). Relational learning with Gaussian processes. In NIPS. [44] Coles, S. (2001). An introduction to statistical modeling of extreme values. Springer-Verlag London Ltd. [45] Coles, S., Pericchi, L., and Sisson, S. (2003). A fully probabilistic approach to extreme rainfall modeling. Journal of Hydrology, 273(1-4):35–50. [46] Coles, S. G. and Tawn, J. A. (1996). Modelling extremes of the areal rainfall process. Journal of the Royal Statistical Society. Series B (Methodological). [47] Comon, P. (1994). Independent component analysis, a new concept? Signal Processing. [48] Consul, P. C. and Jain, G. C. (1973). A generalization of the poisson distribu- tion. Technometrics. [49] Conway, R. and Maxwell, W. (1962). A queuing model with state dependent service rates. Journal of Industrial Engineering. [50] Cressie, N. and Huang, H. (1999). Classes of nonseparable, spatio-temporal stationary covariance functions. JASA. [51] Cressie, N.andJohannesson, G.(2008). Fixedrankkrigingforverylargespatial data sets. JRSS B (Statistical Methodology), 70(1):209–226. [52] Cressie, N., Shi, T., and Kang, E. (2010). Fixed rank filtering for spatio- temporal data. J. Comp. Graph. Stat. [53] Cressie, N. and Wikle, C. (2011). Statistics for spatio-temporal data. John Wiley & Sons. [54] Cuevas-Tello, J. C., Tino, P., Raychaudhury, S., Yao, X., and Harva, M. (2009). Uncovering delayed patterns in noisy and irregularly sampled time series: an astronomy application. Pattern Recognition. 192 [55] Cuturi, M. (2011). Fast global alignment kernels. In ICML, pages 929–936. [56] Cuturi, M., Vert, J.-P., Birkenes, Ø., and Matsui, T. (2007). A kernel for time series based on global alignments. In ICASSP, volume 2, pages II–413. IEEE. [57] Dahl, G. E., Yu, D., Deng, L., and Acero, A. (2012). Context-dependent pre- trained deep neural networks for large-vocabulary speech recognition. Audio, Speech, and Language Processing, IEEE Transactions on, 20(1):30–42. [58] De Mulder, W., Bethard, S., and Moens, M.-F. (2015). A survey on the appli- cation of recurrent neural networks to statistical language modeling. Computer Speech & Language, 30(1):61–98. [59] Delaigle, A., Hall, P., and Bathia, N. (2012). Componentwise classification and clustering of functional data. Biometrika, 99(2):299–313. [60] Demiralp, S. and Hoover, K. D. (2003). Searching for the causal structure of a vector autoregression. Oxford Bulletin of Economics and Statistics. [61] Dempster, A. P., Laird, N. M., and Rubin, D. B. (1977). Maximum likelihood from incomplete data via the em algorithm. J. R. Stat. Soc. Series B. [62] Diggle, P. J. (2013). Statistical analysis of spatial and spatio-temporal point patterns. CRC Press. [63] Doucet, A. and Johansen, A. M. (2009). A tutorial on particle filtering and smoothing: Fifteen years later. The Oxford Handbook of Nonlinear Filtering. [64] Dyer, E. L., Sankaranarayanan, A. C., and Baraniuk, R. G. (2013). Greedy feature selection for subspace clustering. JMLR, 14:2487–2517. [65] Eads, D., Glocer, K., Perkins, S., and Theiler, J. (2005). Grammar-guided feature extraction for time series classification. In NIPS. [66] Efron, B. (1986). Double exponential families and their use in generalized linear regression. JASA. [67] Eichler, M. (2005). A graphical approach for evaluating effective connectivity in neural systems. Philosophical Transactions of the Royal Society B: Biological Sciences, 360(1457):953–967. [68] Eichler, M. (2007). Granger causality and path diagrams for multivariate time series. Journal of Econometrics. [69] Eichler, M. (2012). Graphical modelling of multivariate time series. Probability Theory and Related Fields. 193 [70] Eichler, M. and Didelez, V. (2010). On granger causality and the effect of interventions in time series. Lifetime Data Analysis. [71] Elhamifar, E. and Vidal, R. (2009). Sparse subspace clustering. In CVPR. [72] Elhamifar, E. and Vidal, R. (2013). Sparse subspace clustering: algorithm, theory, and applications. PAMI, 35(11):2765–81. [73] Embrechts, P., Mcneil, A., and Straumann, D. (2002). Correlation and depen- dence in risk management: properties and pitfalls. In Risk Management: Value at Risk and Beyond. Cambridge University Press. [74] Engle, R. F. and Bollerslev, T. (1986). Modelling the persistence of conditional variances. Econometric reviews, 5(1):1–50. [75] Evans, M., Hastings, N., and Peacock, B. (2000). Statistical Distributions. Wiley-Interscience. [76] Faloutsos, C., Ranganathan, M., and Manolopoulos, Y. (1994). Fast subse- quence matching in time-series databases, volume 23. ACM. [77] Ferraty, F. and Romain, Y. (2011). The Oxford Handbook of Functional Data Analysis. Oxford University Press. [78] Ferraty, F. and Vieu, P. (2006). Nonparametric functional data analysis: theory and practice. Springer. [79] Ferro, C. A. T. and Segers, J. (2003). Inference for clusters of extreme values. Journal of the Royal Statistical Society: Series B (Statistical Methodology). [80] Foster, G. (1996). Wavelets for period analysis of unevenly sampled time series. The Astronomical Journal. [81] Gabay, D.andMercier, B.(1976). Adualalgorithmforthesolutionofnonlinear variational problems via finite element approximation. Computers & Mathematics with Applications, 2(1):17–40. [82] Gaffney, S. J. and Smyth, P. (2004). Joint probabilistic curve clustering and alignment. In NIPS. [83] Gandy, S., Recht, B., andYamada, I.(2011). Tensorcompletionandlow-n-rank tensor recovery via convex optimization. Inverse Problems. [84] Garreau, D., Lajugie, R., Arlot, S., and Bach, F. (2014). Metric learning for temporal sequence alignment. In Advances in Neural Information Processing Systems, pages 1817–1825. 194 [85] Geenens,G.(2011). Curseofdimensionalityandrelatedissuesinnonparametric functional regression. Statistics Surveys, 5. [86] Gelfand, A. E., Ecker, M. D., Knight, J. R., and Sirmans, C. (2004). The dynamics of location in home price. The journal of real estate finance and eco- nomics, 29(2):149–166. [87] Geurts, P. (2001). Pattern extraction for time series classification. In Principles of Data Mining and Knowledge Discovery, pages 115–127. Springer. [88] Ghahramani, Z. (2015). Probabilistic machine learning and artificial intelli- gence. Nature, 521(7553):452–459. [89] Giambelluca, T. W., Chen, Q., Frazier, A. G., Price, J. P., Chen, Y.-L., Chu, P.-S., Eischeid, J. K., and Delparte, D. M. (2013). Online rainfall atlas of hawai’i. Bulletin of the American Meteorological Society, 94(3):313–316. [90] Granger, C. W. J. (1969). Investigating Causal Relations by Econometric Mod- els and Cross-spectral Methods. Econometrica. [91] Granger, C. W. J. (1980). Testing for causality: A personal viewpoint. Journal of Economic Dynamics and Control. [92] Graves, A., Fernández, S., Gomez, F., andSchmidhuber, J.(2006). Connection- ist temporal classification: labelling unsegmented sequence data with recurrent neural networks. In Proceedings of the 23rd international conference on Machine learning, pages 369–376. ACM. [93] Graves, A., Mohamed, A.-r., and Hinton, G. (2013). Speech recognition with deep recurrent neural networks. In Acoustics, Speech and Signal Processing (ICASSP), 2013 IEEE International Conference on, pages 6645–6649. IEEE. [94] Grimmett, G. R. and Stirzaker, D. R. (2001). Probability and Random Pro- cesses. Oxford University Press. [95] Gu, M., Ruhe, A., Sleijpen, G., van der Vorst, H., Bai, Z., and Li, R. (2000). 5. Generalized Hermitian Eigenvalue Problems. Society for Industrial and Applied Mathematics. [96] Gumbel, E. J. (1941). The return period of flood flows. The Annals of Mathe- matical Statistics, 12(2):163–190. [97] Guo, S., Seth, A. K., Kendrick, K. M., Zhou, C., and Feng, J. (2008). Partial granger causalityâĂŤeliminating exogenous inputs and latent variables. Journal of neuroscience methods, 172(1):79–93. 195 [98] Hall, P. (2011). Principal component analysis for functional data: methodol- ogy, theory and discussion. In The Oxford handbook of functional data analysis, chapter 8. [99] Hall, P. and Hosseini-Nasab, M. (2006). On properties of functional principal components analysis. JRSS-B. [100] Hall, P. and Hosseini-Nasab, M. (2009). Theory for high-order bounds in functional principal components analysis. Math Proc Cambridge. [101] Härdle, W. and Vieu, P. (1992). Kernel regression smoothing of time series. Journal of Time Series Analysis, 13(3):209–232. [102] Harmeling, S. and Williams, C. K. I. (2011). Greedy learning of binary latent trees. IEEE transactions on pattern analysis and machine intelligence. [103] Harteveld, W.K., Mudde, R.F., andVanDenAkker, H.E.A.(2005). Estima- tionofturbulencepowerspectraforbubblyflowsfromLaserDopplerAnemometry signals. Chemical Engineering Science. [104] Haykin, S. (2004). Kalman filtering and neural networks, volume 47. John Wiley & Sons. [105] He, K., Zhang, X., Ren, S., and Sun, J. (2014). Spatial pyramid pooling in deep convolutional networks for visual recognition. In Computer Vision–ECCV 2014, pages 346–361. Springer. [106] Heckel, R. and Bölcskei, H. (2013). Robust subspace clustering via threshold- ing. arXiv preprint arXiv:1307.4891. [107] Heckel, R., Tschannen, M., and Bölcskei, H. (2014). Subspace clustering of dimensionality-reduced data. In ISIT. [108] Hiemstra, C. and Jones, J. D. (1994). Testing for Linear and Nonlinear Granger Causality in the Stock Price- Volume Relation. The Journal of Finance. [109] Hinton, G., Deng, L., Yu, D., Dahl, G. E., Mohamed, A.-r., Jaitly, N., Senior, A., Vanhoucke, V., Nguyen, P., Sainath, T.N., etal.(2012). Deepneuralnetworks for acoustic modeling in speech recognition: The shared views of four research groups. Signal Processing Magazine, IEEE, 29(6):82–97. [110] Hochreiter, S. and Schmidhuber, J. (1997). Long short-term memory. Neural computation, 9(8):1735–1780. [111] Holland, P.W.(1986). Statisticsandcausalinference. JASA,81(396):945–960. 196 [112] Hu, C., Henderson, G. M., Huang, J., Xie, S., Sun, Y., and Johnson, K. R. (2008). Quantification of Holocene Asian monsoon rainfall from spatially sepa- rated cave records. Earth and Planetary Science Letters. [113] Huerta, G. and SansÃş, B. (2007). Time-varying models for extreme values. Environmental and Ecological Statistics. [114] Hüsken, M. and Stagge, P. (2003). Recurrent neural networks for time series classification. Neurocomputing, 50:223–235. [115] Hyvarinen, A. and Oja, E. (2000). Independent component analysis: algo- rithms and applications. Neural Networks. [116] Hyvärinen, A., Zhang, K., Shimizu, S., Hoyer, P. O., and Dayan, P. (2010). Estimation of a Structural Vector Autoregression Model Using Non-Gaussianity. Journal of Machine Learning Research. [117] Imbens, G. W. and Rubin, D. B. (2012). Causal Inference in Statistics and Social Sciences. Cambridge University Press. [118] IPCC(2007) (2007). Climate change 2007 - The physical science basis working group I contribution to the fourth assessment report of the ipcc intergovernmental panel on climate change. Technical report. [119] Isaaks, E. and Srivastava, R. (2011). Applied geostatistics. London: Oxford University. [120] Jacques, J. and Preda, C. (2014). Functional data clustering: a survey. Advances in Data Analysis and Classification. [121] Jaggi, M. (2013). Revisiting Frank-Wolfe: Projection-free sparse convex opti- mization. In ICML. [122] Jaggi, M. and Sulovsky, M. (2010). A simple algorithm for nuclear norm regularized problems. In ICML. [123] Jalali, A. and Sanghavi, S. (2011). Learning the Dependence Graph of Time Series with Latent Factors. In ICML. [124] Jalali, A., Sanghavi, S., Ruan, C., and Ravikumar, P. (2010). A dirty model for multi-task learning. In NIPS. [125] James, G. M. and Hastie, T. J. (2001). Functional linear discriminant analysis for irregularly sampled curves. JRSS-B, 63(3):533–550. 197 [126] James, G. M. and Sugar, C. A. (2003). Clustering for sparsely sampled func- tional data. JASA. [127] Jebara, T., Song, Y., and Thadani, K. (2007). Spectral clustering and embed- ding with hidden markov models. In ECML, pages 164–175. [128] Joachims, T. (1998). Text categorization with suport vector machines: Learn- ing with many relevant features. In Proceedings of the 10th European Conference on Machine Learning, pages 137–142. Springer-Verlag. [129] Jutten, C. and Herault, J. (1991). Blind separation of sources, part i: An adaptive algorithm based on neuromimetic architecture. Signal Processing. [130] Kaiser, A. (2002). Information transfer in continuous processes. Physica D. [131] Kawai, R. (2009). A multivariate Lévy process model with linear correlation. Quantitative Finance. [132] Keogh, E. and Kasetty, S. (2003). On the need for time series data mining benchmarks: a survey and empirical demonstration. Data Mining and knowledge discovery, 7(4):349–371. [133] Kim, G., F-F, L., and Xing, E. P. (2012). Web Image Prediction Using Mul- tivariate Point Processes. In KDD. [134] Kim, S. and Smyth, P. (2006). Segmental hidden markov models with random effects for waveform modeling. JMLR, 7:945–969. [135] Kitagawa, G. (1987). Non-gaussian stateâĂŤspace modeling of nonstationary time series. Journal of the American statistical association, 82(400):1032–1041. [136] Kolda, T. and Bader, B. (2009). Tensor decompositions and applications. SIAM review. [137] Kreindler, D. M. and Lumsden, C. J. (2006). The effects of the irregular sample and missing data in time series analysis. Nonlinear dynamics, psychology, and life sciences. [138] Kriegel, H.-P., Kröger, P., and Zimek, A. (2012). Subspace clustering. Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery. [139] Lapedes, A. and Farber, R. (1987). Nonlinear signal processing using neural networks: Prediction and system modelling. Technical report. [140] LeCun, Y. and Bengio, Y. (1995). Convolutional networks for images, speech, and time series. The handbook of brain theory and neural networks, 3361(10). 198 [141] Leong, Y. K. and Valdez, E. A. (2005). Using, Claims Prediction with Depen- dence Models, Copula. Insurance: Mathematics and Economics. [142] Li, T., Ma, S., and Ogihara, M. (2010). Wavelet methods in data mining. In Data Mining and Knowledge Discovery Handbook, pages 553–571. Springer. [143] Li, W. (1994). Time series models based on generalized linear models: some further results. Biometrics, pages 506–511. [144] Li, W.-J. and Yeung, D.-Y. (2009). Relation regularized matrix factorization. In IJCAI. [145] Lin, J., Keogh, E., Wei, L., and Lonardi, S. (2007). Experiencing sax: a novel symbolic representation of time series. Data Mining and Knowledge Discovery, 15(2):107–144. [146] Lin, J., Williamson, S., Borne, K., and DeBarr, D. (2012). Pattern recognition in time series. Advances in Machine Learning and Data Mining for Astronomy, 1:617–645. [147] Liu, H., Lafferty, J. D., and Wasserman, L. A. (2009). The nonparanormal: Semiparametric estimation of high dimensional undirected graphs. Journal of Machine Learning Research. [148] Long, X., Jin, L., and Joshi, J. (2012). Exploring trajectory-driven local geographic topics in foursquare. In UbiComp. [149] Lozano, A., Li, H., Niculescu-Mizil, A., Liu, Y., Perlich, C., Hosking, J., and Abe, N.(2009a). Spatial-temporalcausalmodelingforclimatechangeattribution. In KDD. [150] Lozano, A. C., Abe, N., Liu, Y., and Rosset, S. (2009b). Grouped graphical Granger modeling for gene expression regulatory networks discovery. Bioinfor- matics (Oxford, England). [151] Lütkepohl, H. (2007). New introduction to multiple time series analysis. Springer. [152] Marinazzo, D., Pellicoro, M., and Stramaglia, S. (2008). Kernel Granger causality and the analysis of dynamical networks. Physical review. E, Statisti- cal, nonlinear, and soft matter physics. [153] Meinshausen, N. and Bühlmann, P. (2006). High-Dimensional Graphs and Variable Selection with the Lasso. The Annals of Statistics. 199 [154] Meinshausen, N. and Yu, B. (2009). Lasso-type recovery of sparse representa- tions for high-dimensional data. The Annals of Statistics. [155] Mohamed, A.-r., Dahl, G. E., and Hinton, G. (2012). Acoustic modeling using deepbeliefnetworks. Audio, Speech, and Language Processing, IEEE Transactions on, 20(1):14–22. [156] Mohamed, A.-r., Sainath, T. N., Dahl, G., Ramabhadran, B., Hinton, G. E., Picheny, M., et al. (2011). Deep belief networks using discriminative features for phone recognition. In Acoustics, Speech and Signal Processing (ICASSP), 2011 IEEE International Conference on, pages 5060–5063. IEEE. [157] Mondal, D. and Percival, D. B. (2008). Wavelet variance analysis for gappy time series. Annals of the Institute of Statistical Mathematics. [158] Moneta, A. and Spirtes, P. (2006). Graphical models for the identification of causal structures in multivariate time series models. In JCIS. [159] Müller, H.-G. (2011). Functional data analysis. In International Encyclopedia of Statistical Science, pages 554–555. Springer. [160] Müller, M. (2007). Information retrieval for music and motion, volume 2. Springer. [161] Murphy, K. P. (1998). Switching kalman filters. [162] Murphy, K. P. (2012). Machine learning: a probabilistic perspective. MIT Press. [163] Myers, S. A., Zhu, C., and Leskovec, J. (2012). Information diffusion and external influence in networks. In KDD. [164] Narang, S. K., Gadde, A., and Ortega, A. (2013). Introduction to Statistical Learning Theory. In International Conference on Acoustics, Speech, and Signal Processing (ICASSP). [165] Negahban, S. and Wainwright, M. J. (2010). Estimation of (near) low-rank matrices with noise and high-dimensional scaling. In ICML. [166] Nelder, J. A. and Wedderburn, R. W. M. (1972). Generalized linear models. J. R. Statist. Soc. A. [167] Nesterov, Y. (2007). Gradient methods for minimizing composite objective function. Core discussion papers, Universite catholique de Louvain, Center for Operations Research and Econometrics (CORE). 200 [168] Ng, A. Y., Jordan, M. I., Weiss, Y., et al. (2002). On spectral clustering: Analysis and an algorithm. NIPS. [169] Nie, F., Huang, H., Cai, X., and Ding, C. H. (2010). Efficient and robust feature selection via joint ` 2,1 -norms minimization. In NIPS. [170] Panchenko, C. D. and Valentyn (2004). Modified hiemstra-jones test for granger non-causality. Technical report, Society for Computational Economics. [171] Park, D., Caramanis, C., and Sanghavi, S. (2014). Greedy subspace clustering. In NIPS. [172] Paxton, C., Niculescu-Mizil, A., and Saria, S. (2013). Developing predictive models using electronic medical records: challenges and pitfalls. In AMIA. [173] Pearl, J. (2009). Causality: Models, Reasning and Inference. Cambridge University Press. [174] Pesenson, I.(2008). SamplinginPaley-Wienerspacesoncombinatorialgraphs. Transactions of the American Mathematical Society. [175] Peters, J., Janzing, D., and Schölkopf, B. (2013). Causal inference on time series using restricted structural equation models. In NIPS, pages 154–162. [176] Petitjean, F., Forestier, G., Webb, G. I., Nicholson, A. E., Chen, Y., and Keogh, E. (2014). Dynamic time warping averaging of time series allows faster and more accurate classification. In ICDM. [177] Powers, D. (2011). Evaluation: From Precision, Recall and F-Measure to ROC, Informedness, Markedness & Correlation. Journal of Machine Learning Technologies. [178] Rakthanmanon, T., Campana, B., Mueen, A., Batista, G., Westover, B., Zhu, Q., Zakaria, J., and Keogh, E. (2012). Searching and mining trillions of time series subsequences under dynamic time warping. In KDD. [179] Ramsey, J. D., Hanson, S. J., Hanson, C., Halchenko, Y. O., Poldrack, R. A., andGlymour, C.(2010). Sixproblemsforcausalinferencefromfmri. Neuroimage, 49(2):1545–1558. [180] Ratanamahatana, C. A., Lin, J., Gunopulos, D., Keogh, E., Vlachos, M., and Das, G.(2010). Miningtimeseriesdata. InData Mining and Knowledge Discovery Handbook, pages 1049–1077. Springer. 201 [181] Recht, B., Fazel, M., and Parrilo, P. A. (2010). Guaranteed Minimum-Rank Solutions of Linear Matrix Equations via Nuclear Norm Minimization. SIAM Review. [182] Rehfeld, K., Marwan, N., Heitzig, J., and Kurths, J. (2011). Comparison of correlation analysis techniques for irregularly sampled time series. Nonlinear Processes in Geophysics. [183] Robins, J. (2004). Optimal structural nested models for optimal sequential decisions. In Lin, D. and Heagerty, P., editors, Proceedings of the Second Seattle Symposium in Biostatistics, Lecture Notes in Statistics. Springer New York. [184] Rohde, A. and Tsybakov, A. B. (2011). Estimation of high-dimensional low- rank matrices. Ann. Stat. [185] Romera-Paredes, B., Aung, H., Bianchi-Berthouze, N., and Pontil, M. (2013). Multilinear multitask learning. In ICML. [186] Rubin, D. B. (2005). Causal inference using potential outcomes. JASA, 100(469). [187] Rudelson, M. and Vershynin, R. (2010). Non-asymptotic theory of random matrices: extreme singular values. [188] Sahu, S. K. and Mardia, K. V. (2005). Recent trends in modeling spatio- temporal data. In Proceedings of the special meeting on Statistics and Environ- ment, pages 69–83. [189] Sakoe, H. and Chiba, S. (1978). Dynamic programming algorithm optimiza- tion for spoken word recognition. Acoustics, Speech and Signal Processing, IEEE Transactions on, 26(1):43–49. [190] Saria, S., Duchi, A., and Koller, D. (2011). Discovering deformable motifs in continuous time series data. In IJCAI. [191] Scargle, J. D. (1981). Studies in astronomical time series analysis. I - Modeling random processes in the time domain. The Astrophysical Journal Supplement Series. [192] Schmidhuber, J. (2015). Deep learning in neural networks: An overview. Neural Networks, 61:85–117. [193] Schölkopf, B. and Smola, A. J. (2002). Learning with kernels: Support vector machines, regularization, optimization, and beyond. MIT press. 202 [194] Schreiber, T. (2000). Measuring Information Transfer. Physical Review Let- ters. [195] Schulz, M. and Stattegger, K. (1997). Spectrum: spectral analysis of unevenly spaced paleoclimatic time series. Computers & Geosciences. [196] Sellers, K. F. and Shmueli, G. (2010). A flexible regression model for count data. Ann. Appl. Stat. [197] Shalev-Shwartz, S., Gonen, A., and Shamir, O. (2011). Large-scale convex minimization with a low-rank constraint. In ICML. [198] Shalev-Shwartz, S., Srebro, N., and Zhang, T. (2010). Trading Accuracy for Sparsity in Optimization Problems with Sparsity Constraints. SIAM Journal on Optimization. [199] Shang, H. L. (2013). A survey of functional principal component analysis. AStA Advances in Statistical Analysis. [200] Sherman, M. (2011). Spatial statistics and spatio-temporal data: covariance functions and directional properties. John Wiley & Sons. [201] Shmueli, G., Minka, T. P., Kadane, J. B., Borle, S., andBoatwright, P. (2005). A useful distribution for fitting discrete data: Revival of the conway-maxwell- poisson distribution. J. R. Stat. Soc. S. C. [202] Silva, I., Moody, G., Scott, D. J., Celi, L. A., and Mark, R. G. (2012). Predict- ing in-hospital mortality of icu patients: The physionet/computing in cardiology challenge 2012. Computing in cardiology. [203] Smith, R. L. (2003). Spatial-Temporal Models. Accessed: 2015-06-13. [204] Solomon, S., D. Q. M. M. Z. C. M. M. K. A. M. T. and Miller, H., editors (2007). Contribution of Working Group I to the Fourth Assessment Report of the Intergovernmental Panel on Climate Change. Cambridge University Press. [205] Soltanolkotabi, M., Elhamifar, E., and Candes, E. J. (2014). Robust subspace clustering. The Annals of Statistics, 42(2):669–699. [206] Song, S. and Bickel, P. J. (2011). Large vector auto regressions. arXiv preprint arXiv:1106.3915. [207] Spirtes, P., Glymour, C., and Scheines, R. (2001). Causation, Prediction, and Search, Second Edition. The MIT Press. 203 [208] Stoica, P., Babu, P., and Li, J. (2011). New Method of Sparse Parameter Estimation in Separable Models and Its Use for Spectral Analysis of Irregularly Sampled Data. IEEE Transactions on Signal Processing. [209] Stoica, P., Li, J., and He, H. (2009). Spectral Analysis of Nonuniformly Sam- pled Data: A New Approach Versus the Periodogram. IEEE Transactions on Signal Processing. [210] Swanson, N. R. and Granger, C. W. J. (1997). Impulse response functions basedonacausalapproachtoresidualorthogonalizationinvectorautoregressions. JASA. [211] Sweldens, W. (1998). The Lifting Scheme: A Construction of Second Gener- ation Wavelets. SIAM Journal on Mathematical Analysis. [212] Taskaya-Temizel, T. and Casey, M. C. (2005). A comparative study of autore- gressive neural network hybrids. Neural Networks, 18(5):781–789. [213] Tewari, A., Ravikumar, P. D., and Dhillon, I. S. (2011). Greedy algorithms for structurally constrained high dimensional problems. In NIPS. [214] Tibshirani, R., Johnstone, I., Hastie, T., and Efron, B. (2004). Least angle regression. The Annals of Statistics. [215] Tibshirani, R., Saunders, M., Rosset, S., Zhu, J., and Knight, K. (2005). Sparsity and smoothness via the fused lasso. JRSS-B. [216] Tibshirani, R., Walther, G., and Hastie, T. (2001). Estimating the number of clusters in a data set via the gap statistic. JRSS-B, 63(2):411–423. [217] Toh, K. and Yun, S. (2010). An accelerated proximal gradient algorithm for nuclear norm regularized least squares problems. Pacific J. Optim. [218] Tomioka, R., Hayashi, K., and Kashima, H. (2013). Convex Tensor Decom- position via Structured Schatten Norm Regularization. NIPS. [219] Toulemonde, G., Guillou, A., Naveau, P., Vrac, M., and Chevallier, F. (2010). Autoregressive models for maxima and their applications to CH4 and N2O. Envi- ronmetrics. [220] Truccolo, W., Eden, U. T., Fellows, M. R., Donoghue, J. P., and Brown, E. N. (2005). A point process framework for relating neural spiking activity to spiking history, neural ensemble, and extrinsic covariate effects. Journal of neurophysiology. 204 [221] Tsamardinos, I., Aliferis, C.F., andStatnikov, E.(2003). AlgorithmsforLarge Scale Markov Blanket Discovery. In FLAIRS. [222] Valdés-Sosa, P. A., Sánchez-Bornot, J. M., Lage-Castellanos, A., Vega- Hernández, M., Bosch-Bayard, J., Melie-García, L., and Canales-Rodríguez, E. (2005). Estimating brain functional connectivity with sparse multivariate autore- gression. Philosophical transactions of the Royal Society of London. Series B, Biological sciences. [223] Vershynin, R. (2011). Introduction to the non-asymptotic analysis of random matrices. [224] Vidal, R. (2011). Subspace Clustering. IEEE Signal Processing Magazine. [225] Vieu, P. (1995). Order choice in nonlinear autoregressive models. Statistics: A Journal of Theoretical and Applied Statistics, 26(4):307–328. [226] Vintsyuk, T. (1968). Speech discrimination by dynamic programming. Cyber- netics, 4(1):52–57. [227] Von Luxburg, U. (2007). A tutorial on spectral clustering. Statistics and computing, 17(4):395–416. [228] Vu, D. Q., Asuncion, A. U., Hunter, D. R., and Smyth, P. (2011). Continuous- Time Regression Models for Longitudinal Networks. In NIPS. [229] Wang, X., Ray, S., and Mallick, B. (2007). Bayesian curve classification using wavelets. JASA. [230] Wang, Y., Cheng, H., Edwards, R. L., He, Y., Kong, X., An, Z., Wu, J., Kelly, M. J., Dykoski, C. A., and Li, X. (2005). The Holocene Asian monsoon: links to solar changes and North Atlantic climate. Science (New York, N.Y.). [231] Warren Liao, T. (2005). Clustering of time series data—a survey. Pattern recognition, 38(11):1857–1874. [232] Wasserman, L. (2005). All of Nonparametric Statistics (Springer Texts in Statistics). Springer. [233] Wikle, C. K., Milliff, R. F., Nychka, D., and Berliner, L. M. (2001). Spa- tiotemporal hierarchical bayesian modeling tropical ocean surface winds. Journal of the American Statistical Association, 96(454):382–397. [234] Williams, C. K. and Rasmussen, C. E. (2006). Gaussian processes for machine learning. the MIT Press, 2(3):4. 205 [235] Wood, S. N. (2004). Stable and efficient multiple smoothing parameter esti- mation for generalized additive models. JASA, 99(467). [236] Wood, S. N. (2008). Fast stable direct fitting and smoothness selection for generalized additive models. JRSSB, 70(3):495–518. [237] Wu, L., Faloutsos, C., Sycara, K. P., and Payne, T. R. (2000). Falcon: Feed- back adaptive loop for content-based retrieval. VLDB. [238] Wu, T. T. and Lange, K. (2008). Coordinate descent algorithms for lasso penalized regression. The Annals of Applied Statistics. [239] Xing, Z., Pei, J., and Keogh, E. (2010). A brief survey on sequence classifica- tion. ACM SIGKDD Explorations Newsletter, 12(1):40–48. [240] Yuan, X.-T. and Li, P. (2014). Sparse additive subspace clustering. In ECCV. [241] Zhang, P., Cheng, H., Edwards, R. L., Chen, F., Wang, Y., Yang, X., Liu, J., Tan, M., Wang, X., Liu, J., An, C., Dai, Z., Zhou, J., Zhang, D., Jia, J., Jin, L., and Johnson, K. R. (2008). A test of climate, sun, and culture relationships from an 1810-year Chinese cave record. Science (New York, N.Y.). [242] Zhang,T.(2011). AdaptiveForward-BackwardGreedyAlgorithmforLearning Sparse Representations. IEEE Trans Inf Theory, pages 4689–4708. [243] Zhou, D., Bousquet, O., Lal, T., Weston, J., and Schölkopf, B. (2003). Learn- ing with local and global consistency. In NIPS. [244] Zhou, H., Li, L., and Zhu, H. (2013). Tensor regression with applications in neuroimaging data analysis. JASA. [245] Zhou, J., Chen, J., and Ye, J. (2011). MALSAR: Multi-tAsk Learning via StructurAl Regularization. http://www.public.asu.edu/~jye02/Software/ MALSAR/. [246] Zhu, F. (2012). Modeling time series of counts with com-poisson ingarch models. Mathematical and Computer Modelling. [247] Zhu, Y. and Shasha, D. (2003). Warping indexes with envelope transforms for query by humming. SIGMOD. 206
Abstract (if available)
Abstract
Time series data have become ubiquitous in many applications such as climate science, social media, and health care. Analysis of large scale time series data collected from diverse applications has created new multi‐faceted challenges and opportunities. In this thesis, we have studied the key challenges in large scale multivariate time series analysis and proposed novel and scalable solutions. ❧ First, we tackle the challenge of modeling high‐dimensional multi‐modal correlations in the spatio‐temporal data, as accurate modeling of correlations is the key to accurate predictive analysis. We cast the problem as a low‐rank tensor learning problem with side information incorporated via a graph Laplacian regularization. For scalable estimation, we provide a fast greedy low‐rank tensor learning algorithm. ❧ To address the problem of modeling complex correlations in classification and clustering of time series, we propose the functional subspace clustering framework, which assumes that the time series lie on several subspaces with possible deformations. For estimation of the subspaces, we propose an efficient greedy variable selection algorithm. ❧ Second, we observe that the performance of temporal dependency algorithms is severely degraded in presence of unobserved confounders. To address this challenge, we propose two solutions: (i) alleviating the impact of major latent confounders using sparse plus low‐rank decomposition and (ii) eliminating the impact of all latent confounders using the prior information about the delays of the confounding paths. ❧ Third, in many application domains, multivariate time series do not follow the commonly assumed multivariate Gaussian distribution. We propose two solutions to address this challenge: (i) a state space model based on generalized extreme value distribution to model the important case of extreme value time series and (ii) a semi‐parametric approach using copulas for the general setting. ❧ Finally, often in practice time series measurements are collected at irregular intervals, which violates the assumptions of many existing algorithms. To address this challenge, we propose a fast non‐parametric extension for temporal dependency analysis algorithms that improves accuracy over the state of the art techniques. ❧ All of the proposed algorithms are evaluated on multiple datasets from different applications including climate science, social networks, and health care.
Linked assets
University of Southern California Dissertations and Theses
Conceptually similar
PDF
Structure learning for manifolds and multivariate time series
PDF
Improving machine learning algorithms via efficient data relevance discovery
PDF
Deep learning models for temporal data in health care
PDF
Deep generative models for time series counterfactual inference
PDF
Tensor learning for large-scale spatiotemporal analysis
PDF
Improving mobility in urban environments using intelligent transportation technologies
PDF
Modeling dynamic behaviors in the wild
PDF
Learning and control for wireless networks via graph signal processing
PDF
Large-scale inference in multiple Gaussian graphical models
PDF
Novel algorithms for large scale supervised and one class learning
PDF
Functional connectivity analysis and network identification in the human brain
PDF
Data-driven image analysis, modeling, synthesis and anomaly localization techniques
PDF
Intelligent knowledge acquisition systems: from descriptive to predictive models
PDF
Prediction models for dynamic decision making in smart grid
PDF
Measuring functional connectivity of the brain
PDF
Human activity analysis with graph signal processing techniques
PDF
3D inference and registration with application to retinal and facial image analysis
PDF
Application of data-driven modeling in basin-wide analysis of unconventional resources, including domain expertise
PDF
Feature and model based biomedical system characterization of cancer
Asset Metadata
Creator
Bahadori, Mohammad Taha
(author)
Core Title
Scalable multivariate time series analysis
School
Viterbi School of Engineering
Degree
Doctor of Philosophy
Degree Program
Electrical Engineering
Publication Date
07/15/2015
Defense Date
05/14/2015
Publisher
University of Southern California
(original),
University of Southern California. Libraries
(digital)
Tag
classification,clustering,copula,extreme values,Forecasting,functional data analysis,irregular time series,latent variables,OAI-PMH Harvest,scalability,spatio-temporal analysis,subspace clustering,tensor,time series
Format
application/pdf
(imt)
Language
English
Contributor
Electronically uploaded by the author
(provenance)
Advisor
Liu, Yan (
committee chair
), Jenkins, Brian Keith (
committee member
), Lv, Jinchi (
committee member
)
Creator Email
mohammab@usc.edu,taha.bahadori@gmail.com
Permanent Link (DOI)
https://doi.org/10.25549/usctheses-c3-593708
Unique identifier
UC11300288
Identifier
etd-BahadoriMo-3602.pdf (filename),usctheses-c3-593708 (legacy record id)
Legacy Identifier
etd-BahadoriMo-3602.pdf
Dmrecord
593708
Document Type
Dissertation
Format
application/pdf (imt)
Rights
Bahadori, Mohammad Taha
Type
texts
Source
University of Southern California
(contributing entity),
University of Southern California Dissertations and Theses
(collection)
Access Conditions
The author retains rights to his/her dissertation, thesis or other graduate work according to U.S. copyright law. Electronic access is being provided by the USC Libraries in agreement with the a...
Repository Name
University of Southern California Digital Library
Repository Location
USC Digital Library, University of Southern California, University Park Campus MC 2810, 3434 South Grand Avenue, 2nd Floor, Los Angeles, California 90089-2810, USA
Tags
clustering
copula
extreme values
functional data analysis
irregular time series
latent variables
scalability
spatio-temporal analysis
subspace clustering
tensor
time series