Close
About
FAQ
Home
Collections
Login
USC Login
Register
0
Selected
Invert selection
Deselect all
Deselect all
Click here to refresh results
Click here to refresh results
USC
/
Digital Library
/
University of Southern California Dissertations and Theses
/
Spatiotemporal prediction with deep learning on graphs
(USC Thesis Other)
Spatiotemporal prediction with deep learning on graphs
PDF
Download
Share
Open document
Flip pages
Contact Us
Contact Us
Copy asset link
Request this asset
Transcript (if available)
Content
SPATIOTEMPORAL PREDICTION WITH DEEP LEARNING ON GRAPHS by Yaguang Li A Dissertation Presented to the FACULTY OF THE USC GRADUATE SCHOOL UNIVERSITY OF SOUTHERN CALIFORNIA In Partial Fulfillment of the Requirements for the Degree DOCTOR OF PHILOSOPHY (DEPARTMENT OF COMPUTER SCIENCE) August 2019 Copyright 2019 Yaguang Li Acknowledegments The last five years has been an unforgettable and invaluable experience to me. I would not be able to make this journey without the help and support from dozens of remarkable individuals whom I feel deeply indebted to. First and foremost, I would like to convey my deep gratitude and respect for my advisors: Prof. Cyrus Shahabi and Prof. Yan Liu, for their great guidance, encouragement and patience. Being co-supervised is a quite unique experience; Cyrus and Yan have made this a really harmonious and rewarding journey for me. I am grateful for being exposed to different research fields and experiencing different supervision styles. They offer me the freedom to explore various research directions, while always being available to help whenever I need. Over the years, they have been a constant source of inspiration and support, and their advice on both research and career have been essential. I would like to extend my sincere thanks to my dissertation committee: Prof. Antonio Ortega, Prof. Xiang Ren and Prof. Joseph Lim for providing constructive input and suggestions to my research. I also would like to thank Prof. Shanghua Teng for his insightful discussions and valuable suggestions. Being a member of both the InfoLab and the Melady Lab, I am fortunate enough to share the Ph.D. experience with a large group of talented and gracious labmates. I would like to thank each and every one in the labs: ii • Dr. Dingxiong Deng, Dr. Ugur Demiryurek, Dr. Han Su, Mingxuan Yue, Dr. Hien To, Dr. Ying Lu, Dr. Mohammad Asghari, Ritesh Ahuja, Jingyun Yang, Kien Nguyen, Luan Tran, Minh Nguyen, Tian Xie, Chaoyang He, Dimitris Stripelis, Abdullah Alfarrarjeh, Giorgos Constantinou and Chryso- valantis Anastasiou from the InfoLab; • Dr. Qi (Rose) Yu, Dr. Dehua Cheng, Chuizheng Meng, Hanpeng Liu, Michael Tsang, Dr. Zhengping Che, Dr. Xinran He, Tanachat Nilanon, Dr. Sanjay Purushotham, Sunyong Seo, Nitin Kamra, Karishma Sharma, Umang Gupta, Aastha Dua, He Jiang and Donsuk Lee from the Melady Lab. Especially, I would like to express my sincere gratitude to my collaborators: Qi (Rose) Yu, Dingxiong Deng, Mingxuan Yue, Hanpeng Liu, Chuizheng Meng, Michael Tsang, Sungyong Seo for their great support in this work. I am grateful to the USC Annenberg graduate fellowship program for support- ing my research and the wonderful university staff, especially: Daisy Tang, Lizsl De Leon, Jack Li and Tracy Charles, for always being so friendly and helpful. During my Ph.D, I have had wonderful internships in Google, Facebook and DiDi AI Labs. I would like to express my appreciation to my mentors and collabo- rators there, with a special mention to Dr. Pierre-Antoine Manzagol, Dr. Philippe Beaudoin and Dr. Jing (David) Dai at Google; Dr. Lenny Grokop and Dr. Blake Shaw at Facebook; Dr. Jieping Ye, Dr. Zheng Wang, Dr. Kun Fu, Dr. and Xu Geng at DiDi AI Labs. I thank with love to my fiancée, Dr. Siyang Li, whom I am extremely lucky to meet. As a Ph.D. herself, she understands me and has been my best friend and a great companion, loved, supported, encouraged, helped and stood by me through iii the whole journey. Research can sometimes be dull and frustrating, but she makes every minute of my life joyful. Finally, I would like to extend my deepest gratitude to my family: my father who encourages me to keep challenging myself, try new things, my mother who teaches me integrity and perseverance, my brother and sister who provide their unconditional trust, timely encouragement, and endless patience. I consider myself the luckiest in the world to have such a lovely and caring family. iv Contents Acknowledegments ii Contents v List of Tables viii List of Figures x Abstract xvi 1 Introduction 1 1.1 Motivation and Challenges . . . . . . . . . . . . . . . . . . . . . . . 2 1.2 Summary of Thesis Work . . . . . . . . . . . . . . . . . . . . . . . . 6 1.3 Thesis Statement . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8 1.4 Thesis Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9 1.5 Related Publications . . . . . . . . . . . . . . . . . . . . . . . . . . 10 2 Related Work 12 2.1 Traffic Forecasting . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 2.1.1 Traffic Forecasting without Spatial Dependency . . . . . . . 13 2.1.2 Traffic Forecasting with Spatial Dependency . . . . . . . . . 14 2.2 Spatiotemporal Prediction in Urban Computing . . . . . . . . . . . 16 2.3 Representation Learning on Graphs . . . . . . . . . . . . . . . . . . 17 2.3.1 Neural Networks on Graphs . . . . . . . . . . . . . . . . . . 18 3 Modeling Non-Euclidean Spatial Correlation: Diffusion Convolu- tional Recurrent Neural Network for Traffic Forecasting 21 3.1 Problem Statement . . . . . . . . . . . . . . . . . . . . . . . . . . . 24 3.2 Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25 3.2.1 Spatial Dependency Modeling . . . . . . . . . . . . . . . . . 25 3.2.2 Temporal Dynamics Modeling . . . . . . . . . . . . . . . . . 29 3.3 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31 3.3.1 Dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31 v 3.3.2 Experimental Settings . . . . . . . . . . . . . . . . . . . . . 32 3.3.3 Traffic Forecasting Performance Comparison . . . . . . . . . 35 3.3.4 Effect of Spatial Dependency Modeling . . . . . . . . . . . . 37 3.3.5 Effect of Temporal Dependency Modeling . . . . . . . . . . . 39 3.3.6 Model Interpretation . . . . . . . . . . . . . . . . . . . . . . 41 4 ModelingMulti-modalSpatialCorrelation: SpatiotemporalMulti- Graph Convolution Network for Ride-hailing Demand Forecast- ing 43 4.1 Preliminary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47 4.2 Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48 4.2.1 Region-level Ride-hailing Demand Forecasting . . . . . . . . 48 4.2.2 Spatial Dependency Modeling . . . . . . . . . . . . . . . . . 50 4.2.3 Temporal Correlation Modeling . . . . . . . . . . . . . . . . 55 4.3 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56 4.3.1 Experimental Settings . . . . . . . . . . . . . . . . . . . . . 57 4.3.2 Performance Comparison . . . . . . . . . . . . . . . . . . . . 60 4.3.3 Effect of Spatial Dependency Modeling . . . . . . . . . . . . 62 4.3.4 Effect of Temporal Dependency Modeling . . . . . . . . . . . 63 4.3.5 Effect of Model Parameters . . . . . . . . . . . . . . . . . . 63 5 Inferring Spatial Correlation: Structure-informed Graph Auto- encoder for Relational Inference and Simulation 65 5.1 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68 5.2 Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70 5.2.1 Encoder . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72 5.2.2 Decoder . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74 5.2.3 Incorporating Structural Prior Knowledge . . . . . . . . . . 75 5.3 Experiment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78 5.3.1 Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78 5.3.2 Experimental Settings . . . . . . . . . . . . . . . . . . . . . 79 5.3.3 Simulation Performance . . . . . . . . . . . . . . . . . . . . 81 5.3.4 Interaction Recovery . . . . . . . . . . . . . . . . . . . . . . 83 5.3.5 Model Interpretation . . . . . . . . . . . . . . . . . . . . . . 85 5.3.6 Ablation Study . . . . . . . . . . . . . . . . . . . . . . . . . 87 6 Modeling Spatial Correlation via Representation Learning: an Application for Travel Time Estimation 89 6.1 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92 6.1.1 Travel Time Estimation . . . . . . . . . . . . . . . . . . . . 92 6.1.2 Representation Learning . . . . . . . . . . . . . . . . . . . . 94 6.2 Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95 vi 6.2.1 Problem Statement . . . . . . . . . . . . . . . . . . . . . . . 95 6.2.2 Representation Learning for Road Network . . . . . . . . . . 96 6.2.3 Spatiotemporal Representation Learning . . . . . . . . . . . 100 6.2.4 Multi-task Representation Learning . . . . . . . . . . . . . . 101 6.3 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103 6.3.1 Dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103 6.3.2 Experimental Settings . . . . . . . . . . . . . . . . . . . . . 104 6.3.3 Performance Comparison . . . . . . . . . . . . . . . . . . . . 108 6.3.4 Effect of Link Embedding . . . . . . . . . . . . . . . . . . . 109 6.3.5 Effect of Spatiotemporal Embedding . . . . . . . . . . . . . 111 6.3.6 Effect of Multi-task Learning . . . . . . . . . . . . . . . . . 112 6.3.7 Effect of Model architecture . . . . . . . . . . . . . . . . . . 113 6.3.8 Model Interpretation . . . . . . . . . . . . . . . . . . . . . . 114 7 Conclusion and Future Work 118 7.1 Summary of the Research . . . . . . . . . . . . . . . . . . . . . . . 119 7.2 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 120 Reference List 123 vii List of Tables 3.1 Notation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24 3.2 Performance comparison of different approaches for traffic speed forecasting. DCRNN achieves the best performance with all three metricsforallforecastinghorizons, andtheadvantagebecomesmore evident with the increase of the forecasting horizon. . . . . . . . . . 35 3.3 Performance comparison for DCRNN and GCRNN on the METR- LA dataset. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36 4.1 Dataset split. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57 4.2 Performancecomparisonofdifferentapproachesforride-hailingdemand forecasting. ST-MGCN achieves the best performance with all met- rics on both datasets. . . . . . . . . . . . . . . . . . . . . . . . . . 59 4.3 EffectofspatialcorrelationmodelingontheBeijingdataset. Remov- ing any component will result in a statistically significant error increase. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61 4.4 Effect of adding multi-graph design to existing methodologies on the Beijing dataset. Adding extra graph to original model will result in a statistically significant error decrease. . . . . . . . . . . . . . . . 62 4.5 Effect of temporal correlation modeling on the Beijing dataset . . . 62 viii 5.1 Notation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70 5.2 Simulation performance w.r.t. MSE . . . . . . . . . . . . . . . . . . 81 5.3 Simulation performance w.r.t. MAE . . . . . . . . . . . . . . . . . . 81 5.4 Simulation performance w.r.t. MAPE . . . . . . . . . . . . . . . . . 82 5.5 Simulation performance w.r.t. SMAPE . . . . . . . . . . . . . . . . 82 5.6 Interaction recovery performance . . . . . . . . . . . . . . . . . . . 84 5.7 Effect of the node degree constraint on the Mass dataset. . . . . . 88 6.1 Datasets used in the experiments. . . . . . . . . . . . . . . . . . . . 104 6.2 Performance comparison of evaluated approaches. . . . . . . . . . . 108 6.3 Performance comparison of approaches with different link represen- tations. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111 6.4 Performance comparison of approaches with different spatiotempo- ral representations. . . . . . . . . . . . . . . . . . . . . . . . . . . . 112 6.5 Effect of multi-task learning. . . . . . . . . . . . . . . . . . . . . . . 112 ix List of Figures 1.1 Example applications of spatiotemporal prediction. . . . . . . . . . 2 1.2 Spatialcorrelation is non-Euclidean and dominatedby roadnetwork structure. (1) Traffic speed in sensor 1 are similarto sensor 2 as they locate in the same highway. (2) Sensor 1 and Sensor 3 locate in the opposite directions of the highway. Though close to each other in the Euclidean space, their road network distance is large, and their traffic speeds differ significantly. . . . . . . . . . . . . . . . . . . . . 3 1.3 Anexampleofmultimodalcorrelationsamongregionsforspatiotem- poral demand forecasting. To predict the demand in region 1, spa- tially adjacent region 2, functionality similar region 3 and trans- portation connected region 4 are considered more important, while distant and irrelevant regions 5 are less relevant. . . . . . . . . . . 4 3.1 An example diffusion filter consists of K diffusion steps. . . . . . . . 26 3.2 System architecture for the Diffusion Convolutional Recurrent Neu- ral Network designed for spatiotemporal traffic forecasting. The historical time series are fed into an encoder whose final states are used to initialize the decoder. The decoder makes predictions based on either previous ground truth or the model output. . . . . . . . . 29 x 3.3 Sensor distribution of the METR-LA and PEMS-BAY dataset. . . . 32 3.4 Learning curve for DCRNN and DCRNN without diffusion convo- lution. Removing diffusion convolution results in much higher val- idation error. Moreover, DCRNN with bi-directional random walk achieves the lowest validation error. . . . . . . . . . . . . . . . . . . 37 3.5 Effects of K and the number of units in each layer of DCRNN. K correspondstothereceptionfieldwidthofthefilter, andthenumber of units corresponds to the number of filters. . . . . . . . . . . . . . 38 3.6 Performance comparison for different DCRNN variants. DCRNN, with the sequence-to-sequence framework and scheduled sampling, achieves the lowest MAE on the validation dataset. The advantage becomes more clear with the increase of the forecasting horizon. . . 39 3.7 Traffictimeseriesforecastingvisualization. DCRNNgeneratessmooth prediction and is usually better at predict the start and end of peak hours. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40 3.8 Visualization of learned localized filters centered at different nodes withK = 3 on the METR-LA dataset. The star denotes the center, and the colors represent the weights. We observe that weights are localized around the center, and diffuse alongside the road network. 41 xi 4.1 System architecture of the proposed spatiotemporal multi-graph con- volution network (ST-MGCN). We encode different aspects of rela- tionships among regions, including neighborhood, functional simi- larity and transportation connectivity, using multiple graphs. First, the proposed contextual gated recurrent neural network (CGRNN) is used to aggregate observations in different times considering the global contextual information. After that, multi-graph convolution is used to model the non-Euclidean correlations among regions. . . 49 4.2 An example of the ChebNet graph convolution centralized at the black vertex. Left: The centralized region is marked black. The one-hop neighbors are marked yellow, while the two-hop neighbors are marked red. Middle: with the increase of degree of the graph Laplacian, the reception field grows (marks green). Right: The output of this layer is a sum among graph transformations with degree value from 1 to K. . . . . . . . . . . . . . . . . . . . . . . . . 52 4.3 Temporal correlation modeling with contextual gated recurrent neu- ral network (CGRNN). It first produces region descriptions using the global average pooling over the input and its graph convolution output for each observation. Then it transfer the summarized vec- torz into weights which are used to scale each observation. Finally, a shared RNN layer across all regions is applied to aggregate the gated input sequence of each region into a single vector. . . . . . . 54 4.4 Demand heatmap of the tested two cities. . . . . . . . . . . . . . . 57 4.5 Data distribution. These two cities have similar demand patterns, i.e., highly skewed with values concentrating on small values. . . . . 58 xii 4.6 Effect of number of layers and the polynomial orderK of the graph convolution on the Beijing dataset. . . . . . . . . . . . . . . . . . . 64 5.1 (a) Movement of a chain of connected objects under the gravity field. (b) (c) Incorporating structural prior knowledge helps find the ground truth interactions, and (d) improves simulation performance. 67 5.2 ModelarchitectureofSUGAR.Theencodertakesasinputasequence ofobservation, x, andestimatestheinteractionsz, whilethedecoder takes as input the estimated interaction graph and learns the sys- tem dynamics to predict the futurestate. The interaction constraint component calculates the loss function based on specified structural prior knowledge. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71 5.3 Examples of observations in the experimental datasets . . . . . . . 79 5.4 Simulation performance vs. prediction steps . . . . . . . . . . . . . 84 5.5 Interactions learned on the Mass dataset. NRI usually infers redun- dant interactions, while SUGAR recovers the ground truth interac- tions. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85 5.6 Observation (first row), simulation (black) and ground truth (red) on the Mass dataset . . . . . . . . . . . . . . . . . . . . . . . . . . 86 5.7 Effect of the sparsity constraint . . . . . . . . . . . . . . . . . . . . 87 6.1 OD-ETA: Given an origin, destination and departure time, estimate the duration of this trip using historical trip data as well as the underlying road network. . . . . . . . . . . . . . . . . . . . . . . . . 96 xiii 6.2 The system architecture of the proposed multi-task representation learning model (MURAT) for travel time estimation. The model first embeds the raw link information and spatiotemporal infor- mation into the learned spaces. Then the learned representations together with other numerical features are fed into a deep resid- ual network which is jointly trained using supervised signals from multiple related tasks. . . . . . . . . . . . . . . . . . . . . . . . . . 97 6.3 DistributionofdistancesbetweenlinkscalculatedbasedontheDeep- walk embedding. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99 6.4 Graph-based regularization for spatiotemporal dependencies. . . . . 100 6.5 Data Statistics. BJS-Pickup contains trips that have smaller dura- tion but broader spatial distribution than NYC-Trip. . . . . . . . . 105 6.6 Performance w.r.t. different trip features on the BJS-Pickup dataset.110 6.7 Effect of the ratio of the main task. Incorporating auxiliary tasks help reduce overfitting. . . . . . . . . . . . . . . . . . . . . . . . . . 112 6.8 Effects of model parameters. . . . . . . . . . . . . . . . . . . . . . . 115 6.9 Visualization of learned temporal representations. (a) The learned representation for time in day has a circular shape, from 00:00 to 24:00 with smooth transitions between adjacent time intervals. (b) weekends are clearly separated from weekdays, where Tuesday, Wednesday, Thursday are quite close to each other, while Monday and Friday with different traffic patterns are relatively far away. . . 116 6.10 Learned travel time and distance patterns. In peak-hours, both the predicted travel time and travel distance increases as during peak-hours, drivers are more likely to take detours to avoid traffic congestions. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 117 xiv 7.1 Interpretable prediction. . . . . . . . . . . . . . . . . . . . . . . . . 121 7.2 Inductive spatiotemporal prediction, where the model is trained on Los Angeles while being testing on another city, e.g., Long Beach. . 121 xv Abstract Spatiotemporal data is ubiquitous in our daily life, ranging from climate sci- ence, transportation, social media, to various dynamical systems. The data is usually collected from a set of correlated objects over time, where objects can be sensors, locations, regions, particles, users, etc. For instance, in the transportation network, road sensors constantly record the traffic data at various correlated loca- tions; in social networks, we observe activity data of correlated users, as indicated byfriendships, evolvingovertime, andindynamical systems, e.g., physics, climate, we observe the movement of particles interacting with each other. Spatiotempo- ral prediction aims to model the evolution of a set of correlated objects. It has various applications, ranging from classical subjects such as intelligent transporta- tion system, climate science, social media, physics simulation to emerging fields of sustainability, Internet of Things (IoT) and health-care. Spatiotemporal prediction is challenging mainly due to the complicated spa- tial dependencies and temporal dynamics. In this thesis, we study the following important questions in spatiotemporal prediction with a special focus on its appli- cationinthetransportationdomain: (1)howtomodelcomplexspatialdependency among objects that are usually non-Euclidean and multimodal, (2) how to model xvi the non-linear and non-stationary temporal dynamics for accurate long-term pre- diction, (3) how to infer the correlations or interactions among objects when they are not provided nor can be constructed a priori. Tomodelthecomplexspatialdependency, werepresentthenon-Euclideanpair- wise correlations among objects using directed graphs and then propose the novel diffusion graph convolution which captures the spatial dependency with bidirec- tional random walks on the graph. To model the multimodal correlations among objects, we further propose the multi-graph convolution network. To model the non-linear and non-stationary temporal dynamics, we integrate the novel diffusion graph convolution into the recurrent neural network to jointly model the spatial and temporal dependencies. To capture the long-term temporal dependency, we propose to use the sequence-to-sequence architecture with scheduled sampling. To utilize the global contextual information in the temporal correlation modeling, we further propose the contextual gated recurrent neural network which augments the recurrent neural network with a contextual-aware gating mechanism to re-weights different historical observations. To infer correlation among objects, we propose a structure-informed variational graph autoencoder based model, which infers the explicit interactions considering both observed movements and structural prior knowledge, e.g., node degree distribution, edge type distribution, and sparsity. We represent the structural prior knowledge as differentiable constraints on the interaction graph and optimize it using gradient-based methods. We conduct extensive experiments on multiple real-world large-scale datasets forvariousspatiotemporalpredictiontasks,includingtrafficforecasting,spatiotem- poral demand forecasting, travel time estimation, relational inference and simula- tion. The results show the proposed models consistently achieve clear improve- ments over state-of-the-art methods. The proposed models and their variants have xvii been/are being deployed in real-world large-scale systems for applications includ- ing road traffic speed prediction, Internet traffic forecasting, air quality forecasting, travel time estimation, and spatiotemporal demand forecasting. xviii Chapter 1 Introduction 1 1.1 Motivation and Challenges The spatiotemporal prediction is a crucial task for a learning system that oper- ates in a dynamic environment. It has a wide range of applications, ranging from classical subjects such as intelligent transportation system, climate science, social media, physics simulation to emerging fields of sustainability, Internet of Things (IoT) and health-care. The goal of spatiotemporal prediction is modeling the evolution of a set of correlated objects over time, where we aim to predict the future observations of these correlated objects given their historical observations. These correlated objects can be sensors, locations, regions, particles, users, etc. Yaguang Li (USC) Spatiotemporal Prediction with Deep Learning on Graphs Spatiotemporal Prediction Page 6 (a) Traffic forecasting (c) Climate analysis (d) Simulation in dynamical systems (b) Skeleton-based motion analysis Sitting down Walking Running … Figure 1.1: Example applications of spatiotemporal prediction. For instance, in the transportation network, road sensors constantly record the traffic data at various correlated locations over time; in spatiotemporal demand forecasting, we observe the demand statistic in different regions over time; in social 2 Figure 1.2: Spatial correlation is non-Euclidean and dominated by road network struc- ture. (1) Traffic speed in sensor 1 are similar to sensor 2 as they locate in the same highway. (2) Sensor 1 and Sensor 3 locate in the opposite directions of the highway. Though close to each other in the Euclidean space, their road network distance is large, and their traffic speeds differ significantly. networks, we observe activity data of correlated users, as indicated by friendships, evolving over time, and in dynamical systems, e.g., physics, climate, we observe the movement of particles interacting with each other. Figure 1.1 shows several real-world examples of the spatiotemporal prediction, including traffic forecast- ing, skeleton-based motion analysis, climate analysis and simulation in dynamical systems. These tasks are challenging mainly due to the complex spatiotemporal dependencies. Non-Eucldiean spatial dependency First, the correlations among spatial objects are usually non-Euclidean. For example, in traffic forecasting scenario, the traffic sensors on the road network shows complex yet unique spatial corre- lations. Figure 1.2 illustrates an example. Sensor 1 and sensor 2 are correlated, while sensor 1 and sensor 3 are not. Though sensor 1 and sensor 3 are close in the Euclidean space, they locate in the opposite directions of the highway, and thus 3 Highway 2 School Hospital Factory Amusement Park School Hospital Amusement Park 1 2 4 3 6 R14 Highway 1 R46 Park Park 5 Lake Figure 1.3: An example of multimodal correlations among regions for spatiotemporal demand forecasting. To predict the demand in region 1, spatially adjacent region 2, functionality similar region 3 and transportation connected region 4 are considered more important, while distant and irrelevant regions 5 are less relevant. demonstrate quite different behaviors. Moreover, the future traffic speed is influ- enced more by the downstream traffic than the upstream one. This means that the spatial correlations among traffic sensors are non-Euclidean and directional. Similar phenomenon is observed in the case of skeleton-based motion analysis. For example, though the joints representing “the left hand” and “the right foot” are far away from each other, they are usually more correlated than closer ones. Multimodal spatial dependency In addition, there usually exist different types of correlations among spatial objects. Figure 1.3 shows an example of region- level demand forecasting. For example, the demand of a region is usually affected by its spatially adjacent neighbors and at the same time correlated with distant regionswiththesimilarcontextualenvironment. Forregion1, inadditiontoneigh- borhood region 2, it may also correlate to a distant region 3 that shares similar 4 functionality, i.e., they are both near schools and hospitals. Besides, region 1 may also be affected by region 4, which is directly connected to region 1 via a high- way. Similar phenomenon also exists in the scenario of climate analysis where the climate of a region is not only correlated with nearby ones but also regions that share similar environmental context, e.g., both near lake, mountain. Inferring spatial dependency In the cases mentioned above, we assume the spatial correlations are either provided, e.g., correlation imposed by the road net- work, or can be constructed a priori, e.g., calculated based on environmental simi- larity. However, in many cases, we only have access to the movements of individual object, rather than the actual underlying correlations/interactions. It is desirable to be able to infer the correlations based the observed movements and possibly together with some prior knowledge about the underlying interaction structures. Non-linear non-stationary temporal dynamics Besides the spatial corre- lation, spatiotemporal prediction tasks usually show non-linear non-stationarity temporal dynamics. For example, in the scenario of traffic forecasting, recurring incidents such as rush hours or accidents can cause non-stationarity, making it dif- ficult to conduct multiple step ahead forecasting. Traditional time series prediction methods may perform well for normal conditions, but usually behave poorly for extreme conditions. For example, historical average depends solely on the periodic patterns of the traffic flow, thus can hardly be responsive to dynamic changes. Auto-regressive integrated moving average (ARIMA) time series models rely on stationary assumption of the time series, which would be violated in the face of abrupt changes in traffic flow. Moreover, many spatiotemporal prediction tasks show long-term dependency and periodicity, and the prediction of a certain time 5 is usually correlated with various historical observations, e.g., an hour ago, a day ago or even a week ago. 1.2 Summary of Thesis Work The goal in this thesis work is to provide an solution to model the non- Euclidean, multimodal spatial correlations as well as the non-linear non-stationary temporal dynamics, and thus enables more accurate spatiotemporal predic- tion. Specifically, we address the above mentioned challenges with the following approaches. To capture the non-Euclidean spatial dependency among objects, we repre- sent the pair-wise spatial correlations between spatial objects using a directed graph whose nodes are objects and edge weights denote the strength of correla- tions between node pairs. We propose the diffusion convolution operation (Chap- ter 3) which uses the bidirectional graph random walk to capture the spatial dependency. To capture the multimodal correlations, we encode different types of correlations as multiple graphs, and introduce a novel multi-graph convolution operation (Chapter 4), a convolution operating on multiple graphs with learnable aggregation function. In Chapter 6, we further propose a multi-task representation learning framework to implicit model the spatial correlations among objects. To infer correlations among objects, we propose the Structure-informed Graph- Autoencoder for Relational inference and simulation (SUGAR, Chapter 5). SUGAR takes the form of variational auto-encoder, where the discrete latent vari- ables represent the underlying interactions among objects. Both the encoder and the decoder employ a graph network-based architecture, with node, edge, and global features. The model can incorporate various structural priors, e.g., the node 6 degree distribution, the sparsity, and the edge type distribution, in a differentiable manner. To model the complicated temporal dynamics, we further integrates diffu- sion convolution, the sequence-to-sequence architecture and the scheduled sampling technique(Chapter3)Inparticular, wereplacethematrixmultiplicationsinGated Recurrent Units (GRU), a simple yet powerful variant of RNNs, with the diffusion convolution, leadingtoourproposedDiffusion Convolutional Gated Recurrent Unit (DCGRU). In multiple step ahead forecasting, we employ the sequence-to-sequence architecture. Both the encoder and the decoder are recurrent neural networks with DCGRU. Furthermore, to incorporate global contextual information when model- ingthetemporalcorrelation, weproposecontextualgatedrecurrentneuralnetwork (CGRNN, Chapter 4). It augments RNN by learning a gating mechanism, which is calculated based on the summarized global information, to re-weight observations in different timestamps. We conduct extensive experiments on multiple real-world large-scale datasets forvariousspatiotemporalpredictiontasks,includingtrafficforecasting,spatiotem- poral demand forecasting, travel time estimation, relational inference and simula- tion. Experimental results show the proposed models consistently achieve clear improvements over state-of-the-art methods. Besides, the proposed models and their variants have been/are being deployed in real-world large-scale systems for applications. Specifically, DCRNN and its variants have been used for traffic speed forecasting (with LA-Metro), air quality forecasting (Lin et al., 2018) and Inter- net traffic forecasting (Andreoletti et al., 2019). A variant of MURAT is used for origin-destination travel time estimation in the route planning phase (with Didi Chuxing), while ST-MGCN is being deployed for spatiotemporal demand forecast- ing (with Didi Chuxing). 7 1.3 Thesis Statement By address all the key challenges mentioned above, we demonstrate that “Jointly modeling both 1) the non-Euclidean, multimodal correlations among spatial objects withnovelgraphnetworks, and 2) thenon-linear, non- stationary temporal dynamics with augmented recurrent neural network enable more accurate spatiotemporal prediction.” Here are the meanings of highlighted terms: • Non-Euclidean: the correlations between objects are irregular and pair-wise. • Multimodal: the correlations have multiple types. • Novel graph networks: Diffusion graph convolution network (Chapter 3) to capture pair-wise correlations; Multi-graph convolution network (Chap- ter4)tocapturemulti-modalpair-wisecorrelations; VariationalGraphAuto- encoder network (Chapter 5) to infer/learn the correlations. • Non-linear: the future observations cannot be represented as the linear com- binations of historical ones. • Non-stationary: the mean of observations keep changing over time. • Augmented recurrent neural network: Diffusion convolutional recurrent net- work(DCRNN,Chapter3)tocapturebothtemporalandspatialcorrelations. Contextual gated recurrent neural network (CGRNN, Chapter 4) to capture global contextual information when modeling temporal dynamics. • Accurate: Better regression and classification performance measured by RMSE/MAPE and by precision/recall respectively. 8 1.4 Thesis Outline In Chapter 2, we review the major related works in spatiotemporal prediction, traffic prediction, graph convolution and representation learning. In Chapter 3, we study the traffic prediction problem, and formulate it as a spatiotemporal forecasting problem, and propose the diffusion convolutional recur- rent neural network that captures the spatiotemporal dependencies. Specifically, we use bidirectional graph random walk to model spatial dependency and recur- rent neural network to capture the temporal dynamics. We further integrate the encoder-decoder architecture and the scheduled sampling technique to improve the performance for long-term forecasting. In Chapter 4, we investigate the region-level ride-hailing demand forecasting problem and identified its unique spatiotemporal correlations. We propose a novel deep learning based model, i.e., ST-MGCN which encodes the non-Euclidean cor- relations among regions using multiple graphs and explicitly captured them using multi-graph convolution. We further augment the recurrent neural network with contextual gating mechanism to incorporate global contextual information in the temporal modeling procedure. In Chapter 5, we introduce SUGAR, a variational graph auto-encoder based model to infer interactions and learn the system dynamics, considering both observed movements and structural prior knowledge, e.g., node degree distribu- tion, edge type distribution, and sparsity. The model represents the structural prior knowledge as differentiable constraints on the interaction graph and opti- mizes it using gradient-based methods. In Chapter 6, we introduce a novel representation learning solution to model the spatiotemporal dependency and show its application for the origin-destination travel time estimation. Specifically, the model MURAT learns a representation 9 that effectively captures underlying road network structures as well as spatiotem- poral prior knowledge. In Chapter 7, we summarize our contributions and limitations. We also discuss potential future work to extend the contributions of this thesis. 1.5 Related Publications Parts of this thesis have been published in machine learning and data mining conferences. The list includes: • Related to Chapter 2: Yaguang Li, Cyrus Shahabi. A Brief Overview of Machine Learning Methods for Short-term Traffic Forecasting and Future Directions. ACM SIGSPATIAL Special, 2018 • Related to Chapter 3: – Yaguang Li, Rose Yu, Cyrus Shahabi, Yan Liu. Diffusion Convolutional Recurrent Neural Network: Data-Driven Traffic Forecasting, Interna- tional Conference on Learning Representations (ICLR), 2018 – Yaguang Li*, Rose Yu, Ugur Demiryurek, Cyrus Shahabi, Yan Liu (*Equal Contribution). Deep Learning: A Generic Approach for Extreme Condition Traffic Forecasting. Proceedings of the Seventeenth SIAM International Conference on Data Mining (SDM), 2017 • Related to Chapter 4: Yaguang Li*, Xu Geng*, Leye Wang, Lingyu Zhang, Qiang Yang, Jieping Ye, Yan Liu (*Equal Contribution). Spatiotemporal Multi-graphConvolutionNetworkforRide-hailingDemandForecasting. The Thirty-Third AAAI Conference on Artificial Intelligence (AAAI), 2019 10 • Related to Chapter 5: Yaguang Li, Chuizheng Meng, Cyrus Shahabi, Yan Liu. Structure-informed Graph Auto-encoder for Relational Inference and Simulation, ICML Workshops on Learning and Reasoning with Graph- Structured Representations, 2019 • Related to Chapter 6: Yaguang Li, Kun Fu, Zheng Wang, Cyrus Shahabi, Jieping Ye, Yan Liu. Multi-task Representation Learning for Travel Time Estimation, ACM SIGKDD Conference on Knowledge Discovery and Data Mining (KDD), 2018 11 Chapter 2 Related Work 12 2.1 Traffic Forecasting Traffic forecasting is a classic problem in transportation and operational research which are primarily based on queuing theory and simulations (Drew, 1968). Data-driven approaches for traffic forecasting have received considerable attention. These approaches are categorized into two types based on whether they model the spatial correlation among different traffic time series. 2.1.1 Traffic Forecasting without Spatial Dependency Traffic forecasting can be modeled as a time series regression problem and thus various time series analysis approaches have been applied to this problem. Histor- ical Average models the traffic flow as a seasonal process, and uses the weighted average of previous seasons as the prediction. For example, suppose the season is 1 week, then the prediction for this Wednesday is the averaged traffic speeds from last four Wednesdays. As the historical average method does not depend on short- term data, its performance is invariant to the small increases in the forecasting horizon. Auto-regressive integrated moving average (ARIMA) is a popular model for time series analysis and has been successfully applied to traffic forecasting (Lee & Fambro, 1999). ARIMA consists of three parts: 1) the Auto-regressive (AR) part indicates that the evolving variable of interest can be approximated using a linear combination of its own historical values, 2) the Moving average (MA) part is used to model the residual from the AR part using a weighted combination of random noises at various previous time steps, and 3) the Integrate (I) part models the difference between adjacent values rather than raw values. Williams & Hoel (2003) use Seasonal ARIMA to capture the periodicity in the traffic flow, while in Pan et al. (2012) ARIMA is augmented with historical average to better model 13 the rush hour traffic behavior. Other popular time series methods for traffic fore- casting include K-nearest Neighbor (KNN) (Zhang et al., 2013; Cai et al., 2016), Support Vector Regression (SVR) (Su et al., 2007), particle filter, Hidden Markov Model (Qi & Ishak, 2014), Gaussian Process (Xie et al., 2010), etc. However, these time series models usually rely on the stationary assumption, which is often violated by the real-world traffic data. To model the non-linear temporal dependency, neural network based approaches have also been applied to traffic forecasting. Lv et al. (2015); Huang et al. (2014) propose to use stacked denoising encoder and deep belief networks to model the temporal behavior. Ma et al. (2015); Li et al. (2017); Laptev et al. (2017) model the temporal dependency using Recurrent Neural Networks (RNN), which is a type of neural network with self-connection, and is able to perform non- linear auto-regression. However, the majority of the above-mentioned approaches model each traffic time series separately, failing to capture the spatial dependency among them. 2.1.2 Traffic Forecasting with Spatial Dependency To capture the spatial dependency among traffic time series, researchers have extended existing approaches to process multivariate time series. The resulted models include Vector Auto-regressive (Hamilton, 1994), Vector ARIMA (Kamar- ianakis & Prastacos, 2003), Spatiotemporal ARIMA (Min & Wynter, 2011), Spa- tiotemporal HMM (Kwon & Murphy, 2000; Xu et al., 2015a). Deng et al. (2017) further propose to first group similar sensors and then perform multi-task learn- ing on each group. An alternative way to model the relationship among different time series is the latent space model which first transforms the raw traffic time series into the latent space and then learns the spatiotemporal dependency. Yu 14 et al. (2016) propose a temporal regularized matrix factorization based approach which performs vector auto-regression in the latent space. While Deng et al. (2016) model the road network as a graph, and propose to learn the attributes of vertices in latent spaces which captures both topological and temporal properties. However, existing machine learning models either impose strong stationary assumptions on the data (e.g., auto-regressive model) or fail to account for highly non-linear temporal dependency (e.g., latent space model (Yu et al., 2016; Deng et al., 2016)). Deep learning models deliver new promise for time series forecast- ing problem. To capture the spatial dependency of the traffic, recent studies (Wu & Tan, 2016; Ma et al., 2017; Zhang et al., 2017) propose to model the trans- portation network as an image and use Convolutional Neural Networks (CNN) to extract spatial features. One drawback of these methods is that they ignore the topology of the underlying transportation network, e.g., in Figure 3.4, two roads in different directions of a highway, though close in Euclidean distance, can have significantly different traffic pattern because of the network topology. Cheng et al. (2018) propose DeepTransport which models the spatial dependency by explicitly collecting certain number of upstream and downstream roads for each individual road and then conduct convolution on these roads respectively. Our approach is different from all mentioned methods due to both the problem settings and the formulation of the convolution on the graph. We model the sensor network as a weighted directed graph which is more realistic than grid or undirected graph. Besides, theproposedconvolutionisdefinedusingbidirectionalgraphrandomwalk and is further integrated with the sequence-to-sequence learning framework as well as the scheduled sampling to model the long-term temporal dependency. 15 2.2 Spatiotemporal Prediction in Urban Com- puting Spatiotemporal prediction is a fundamental problem for data-driven urban management. There are rich amount of works on this topic, including predict- ing bike flows (Zhang et al., 2017), the taxi demand (Ke et al., 2017b; Yao et al., 2018b), the arrival time (Li et al., 2018c), and the precipitation (Xingjian et al., 2015; Shi et al., 2017), where the prediction is aggregated in rectangular regions, andregion-wiserelationshipismodeledbygeographicaldistance. Morespecifically, thespatialstructureofurbandataisformulatedasamatrixwhoseentriesrepresent rectangular regions. In previous works, regions and their pair-wise relationships naturally formulate an Euclidean structure, and consequently convolution neural networks are leveraged for effective prediction. Non-Euclidean structured data also existsin urban computing. Usually, station or point based prediction tasks, like traffic prediction (Li et al., 2018d; Yu et al., 2018;Yaoetal.,2018a), point-basedtaxidemandprediction(Tongetal.,2017)and station-basedbikeflowprediction(Chaietal.,2018)arenaturallynon-Euclideanas thedataformatisnolongeramatrixandconvolutionneuralnetworksbecomesless helpful. Manualfeatureengineeringorgraphconvolutionnetworksarestate-of-the- art techniques for handling non-Euclidean structure data. Different from previous works, ST-MGCN encodes pair-wise relationships among regions into semantic graphs. ThoughST-MGCNisdesignedforregionbasedprediction, theirregularity of region-wise relationship makes it a prediction problem for non-Euclidean data. Yao et al. (2018b) propose DMVST-Net which encodes the region-wise relation- ship as graph for taxi demand prediction. DMVST-Net mainly uses graph embed- dingasanexternalfeaturesforspatiotemporalprediction, andconsequentlyfailsto 16 use the demand values from related regions. Yao et al. (2018a) further improve Yao et al. (2018b) by modeling the periodically shift problem with the attention mech- anism. However, none of these approaches explicitly models the non-Euclidean pair-wise relationships among regions. In this work, ST-MGCN uses the proposed multi-graph convolution to incorporate features from related regions, which is able to make predictions from demand values of regions that are related in different perspective. Recent research in neuroimage analysis for Parkinson’s disease (Zhang et al., 2018c) shows the effectiveness of graph convolution network in spatial feature extraction. It uses GCN to learn features from most similar regions and pro- posedamulti-viewstructuretofusedifferentMRIacquisitions. However, temporal dependency is not considered in above work. ST-GCN is used in spatiotemporal prediction for skeleton based action recognition (Li et al., 2018a; Yan et al., 2018). The transformation of ST-GCN is a combination of spatial dependency and local temporal recurrence. However, we argue that in these models the contextual infor- mation or the global information is largely overlooked in the temporal dependency modeling. 2.3 Representation Learning on Graphs Representation learning on graphs aims to find a way to represent, or encode, graph structure so that it can be easily exploited by machine learning algo- rithms (Hamilton et al., 2017b). Recently, a group of graph representation learn- ing methods have been proposed, which try to learn low-dimensional embed- dings of nodes in the graph. For example, Deepwalk (Perozzi et al., 2014) maps 17 each node to a low-dimensional feature vector preserving higher-order proxim- ity between nodes by maximizing the probability of observing the neighborhoods in the random walk process. While node2vec (Grover & Leskovec, 2016) employs biased-randomwalksthatbalancebreadth-first(BFS)anddepth-first(DFS)graph searches to produce more informative and customized embeddings than DeepWalk. LINE (Tang et al., 2015) learns embeddings that explicitly preserves both the first and second-order proximities. Other well-known work include DNGR (Cao et al., 2016), SDNE (Wang et al., 2016a), Laplacian Eigenmaps (Belkin & Niyogi, 2002), GraRep (Cao et al., 2015), etc. These methods can be formulated as a encoder- decoder framework (Hamilton et al., 2017b), where the encoder maps each node to a low-dimensional vector or embedding and the decoder decodes structural infor- mation about the graph from the learned embedding. Then the whole framework is trained to minimize some user-defined graph proximity measures between nodes. One types partially interesting settings are when the encoder is based on neural networks on graph, which enable both inductive learning and supervised repre- sentation learning on graphs. More related representation learning methods on graph-structured data can be found in Hamilton et al. (2017b); Hoff et al. (2002); Bronstein et al. (2017); Nickel et al. (2015). 2.3.1 Neural Networks on Graphs Recently, Convolutional Neural Network (CNN) has been generalized to arbi- trary graphs based on the spectral graph theory (Ortega et al., 2018). Graph convolutional neural networks (GCN) are first introduced in Bruna et al. (2014), which bridges the spectral graph theory and deep neural networks. Following the work, many models have been proposed to enable localized filters (Henaff et al., 2015) and handle the computational limitations of the convolution operation on 18 graphs (Defferrard et al., 2016; Kipf & Welling, 2017). Defferrard et al. (2016) propose ChebNet which improves GCN with fast localized convolutions filters. The ChebNet is defined over a graphG = (V,A), where V is the set of all vertices and A ∈ R |V|×|V| is the adjacency matrix whose entries represent the connections between vertices. ChebNet is able to extract local features with differ- ent reception fields from translation variant non-Euclidean structures (Hammond et al., 2011). Let L = I−D −1/2 AD −1/2 denotes the graph Laplacian matrix, where D is the degree matrix, a graph convolution operation (Defferrard et al., 2016) is defined as : X l+1 =σ( K−1 X k=0 α k L k X l ), 1 whereX l denotes the features in the l-th layer, α k is the trainable coefficient,L k is the k-th power of the graph Laplacian matrix, σ is the activation function. Kipf & Welling (2017) simplify ChebNet and achieve state-of-the-art perfor- mance in semi-supervised classification tasks. Seo et al. (2018) combine ChebNet with Recurrent Neural Networks (RNN) for structured sequence modeling. Yu et al. (2018) model the sensor network as a undirected graph and applied Cheb- Net and convolutional sequence model (Gehring et al., 2017) to do forecasting. One limitation of the mentioned spectral based convolutions is that they generally require the graph to be undirected to calculate meaningful spectral decomposition. Going from spectral domain to vertex domain, Atwood & Towsley (2016) propose diffusion-convolutional neural network (DCNN) which defines convolution as a dif- fusion process across each node in a graph-structured input. Hechtlinger et al. (2017) propose GraphCNN to generalize convolution to graph by convolving every node with its p nearest neighbors. Hamilton et al. (2017a) propose GraphSAGE 1 In a graph convolution layer with P inputs and Q outputs, there will be PQ convolution operations. Here, we only show one operation for simplicity. 19 which extends GCN (Kipf & Welling, 2017) to large-scale and inductive setting. Velickovic et al. (2018) propose an attention-based framework which learns to spec- ify different weights to different nodes in a neighborhood. Sanchez-Gonzalez et al. (2018) propose GraphNet, which provide a unify framework for various neural net- work on graphs, including ChebNet (Defferrard et al., 2016), DCNN (Atwood & Towsley, 2016), GCN (Kipf & Welling, 2017), GAT Velickovic et al. (2018), Graph- SAGE (Hamilton et al., 2017a) etc. More related work can be found in a recent survey paper (Battaglia, 2018). 20 Chapter 3 Modeling Non-Euclidean Spatial Correlation: Diffusion Convolutional Recurrent Neural Network for Traffic Forecasting 21 Spatiotemporal forecasting is a crucial task for a learning system that operates in a dynamic environment. It has a wide range of applications from autonomous vehicles operations, to energy and smart grid optimization, to logistics and supply chain management. In this chapter, we study one important task: traffic fore- casting on road networks, the core component of the intelligent transportation systems. The goal of traffic forecasting is to predict the future traffic speeds of a sensor network given historic traffic speeds and the underlying road networks. Thistaskischallengingmainlyduetothecomplexspatiotemporaldependencies and inherent difficulty in the long term forecasting. On the one hand, traffic time series demonstrate strong temporal dynamics. Recurring incidents such as rush hours or accidents can cause non-stationarity, making it difficult to forecast long- term. On the other hand, sensors on the road network contain complex yet unique spatial correlations. Figure 1.2 illustrates an example. Road 1 and road 2 are correlated, while road 1 and road 3 are not. Although road 1 and road 3 are close in the Euclidean space, they demonstrate very different behaviors. Moreover, the future traffic speed is influenced more by the downstream traffic than the upstream one. This means that the spatial structure in traffic is non-Euclidean and directional. Traffic forecasting has been studied for decades, falling into two main cate- gories: knowledge-driven approach and data-driven approach. In transportation and operational research, knowledge-driven methods usually apply queuing the- ory and simulate user behaviors in traffic (Cascetta, 2013). In time series com- munity, data-driven methods such as Auto-Regressive Integrated Moving Average (ARIMA) model and Kalman filtering remain popular (Liu et al., 2011; Lippi et al., 2013). However, simple time series models usually rely on the stationar- ity assumption, which is often violated by the traffic data. Most recently, deep 22 learning models for traffic forecasting have been developed in Lv et al. (2015); Li et al. (2017), but without considering the spatial structure. Wu & Tan (2016) and Ma et al. (2017) model the spatial correlation with Convolutional Neural Net- works (CNN), but the spatial structure is in the Euclidean space (e.g., 2D images). Bruna et al. (2014); Defferrard et al. (2016) investigate graph convolution, but only for undirected graphs. In this work, we represent the pair-wise spatial correlations between traffic sensors using a directed graph whose nodes are sensors and edge weights denote proximity between the sensor pairs measured by the road network distance. We model the dynamics of the traffic flow as a diffusion process and propose the diffu- sion convolution operation to capture the spatial dependency. We further propose Diffusion Convolutional Recurrent Neural Network (DCRNN) that integrates dif- fusion convolution, the sequence-to-sequence architecture and the scheduled sam- pling technique. When evaluated on real-world traffic datasets, DCRNN consis- tently outperforms state-of-the-art traffic forecasting baselines by a large margin. In summary: • We study the traffic forecasting problem and model the spatial dependency of traffic as a diffusion process on a directed graph. We propose diffusion convolution, which has an intuitive interpretation and can be computed effi- ciently. • We propose Diffusion Convolutional Recurrent Neural Network (DCRNN), a holistic approach that captures both spatial and temporal dependencies among time series using diffusion convolution and the sequence-to-sequence learningframeworktogetherwithscheduledsampling. DCRNNisnotlimited to transportation and is readily applicable to other spatiotemporal forecast- ing tasks. 23 Table 3.1: Notation Name Description G a graph V,v i nodes of a graph,|V| =N and the i-th node. E edges of a graph W,W ij , weight matrix of a graph and its entries D,D I ,D O undirected degree matrix, In-degree/out-degree matrix L normalized graph Laplacian , eigen-vector matrix and eigen-value matrix ofL X, ˆ X∈R N×P a graph signal, and the predicted graph signal. X (t) ∈R N×P a graph signal at time t. H∈R N×Q output of the diffusion convolutional layer. f , convolutional filter and its parameters. f , convolutional layer and its parameters. • We conducted extensive experiments on two large-scale real-world datasets, and the proposed approach obtains significant improvement over state-of- the-art baseline methods. 3.1 Problem Statement We formalize the learning problem of spatiotemporal traffic forecasting and describe how to model the dependency structures using diffusion convolutional recurrent neural network. Table 3.1 summarizes the main notations used in this chapter. The goal of traffic forecasting is to predict the future traffic speed given pre- viously observed traffic flow from N correlated sensors on the road network. We can represent the sensor network as a weighted directed graphG = (V,E,W ), whereV is a set of nodes|V| = N,E is a set of edges and W ∈ R N×N is a weighted adjacency matrix representing the nodes proximity (e.g., a function of their road network distance). Denote the traffic flow observed onG as a graph 24 signalX∈R N×P , where P is the number of features of each node (e.g., velocity, volume). LetX (t) represent the graph signal observed at time t, the traffic fore- casting problem aims to learn a functionh(·) that mapsT 0 historical graph signals to future T graph signals, given a graphG: [X (t−T 0 +1) ,··· ,X (t) ;G] h(·) −−→ [X (t+1) ,··· ,X (t+T) ] 3.2 Methodology 3.2.1 Spatial Dependency Modeling We model the spatial dependency by relating traffic flow to a diffusion process, which explicitly captures the stochastic nature of traffic dynamics. This diffusion process is characterized by a random walk onG with restart probabilityα∈ [0, 1], and a state transition matrix D −1 O W. Here D O = diag(W 1) is the out-degree diagonal matrix, and 1∈R N denotes the all one vector. After many time steps, such Markov process converges to a stationary distributionP∈R N×N whose ith rowP i,: ∈R N represents the likelihood of diffusion from node v i ∈V, hence the proximity w.r.t. the nodev i . The following Lemma provides a closed form solution for the stationary distribution. Lemma 3.1. (Teng et al., 2016) The stationary distribution of the diffusion process can be represented as a weighted combination of infinite random walks on the graph, and be calculated in closed form: P = ∞ X k=0 α(1−α) k D −1 O W k (3.1) 25 Author Author Yaguang Li (USC) Predictive Modeling on Spatiotemporal Graphs Generalize Convolution to Graph Page 18 4/14/2019 Diffusion convolution filter: combination of diffusion processes with different steps on the graph. Max Min = + + + … + 0 Step Diffusion 1 Step Diffusion 2 Step Diffusion K Step Diffusion Example diffusion filter Centered at :, ⋆ = , + , ⊺ :, Dual directional diffusion to model upstream and downstream separately Figure 3.1: An example diffusion filter consists of K diffusion steps. where k is the diffusion step. In practice, we use a finite K-step truncation of the diffusion process and assign a trainable weight to each step. We also include the reversed direction diffusion process, such that the bidirectional diffusion offers the model more flexibility to capture the influence from both the upstream and the downstream traffic. Diffusion Convolution The resulted diffusion convolution operation over a graph signalX∈R N×P and a filter f is defined as: X :,p ? G f = K−1 X k=0 θ k,1 D −1 O W k +θ k,2 D −1 I W | k X :,p for p∈{1,··· ,P} (3.2) where∈R K×2 are the parameters for the filter andD −1 O W,D −1 I W | represent the transition matrices of the diffusion process and the reverse one, respectively. Figure 3.1 shows and example diffusion filter centered at the highlighted node. The filter is represented as a weighted combination of K diffusion steps. Lemma 3.2. Equation 3.2 can be calculated efficiently using O(K) recursive sparse-dense matrix multiplication with total time complexity O(K|E|)O(N 2 ). Proof. Equation 3.2 can be decomposed into two parts with the same time com- plexity, i.e., one part withD −1 O W and the other part withD −1 I W | . Thus we will only show the time complexity of the first part. 26 Let T k (x) = D −1 O W k x, The first part of Equation 3.2 can be rewritten as K−1 X k=0 θ k T k (X :,p ) (3.3) As T k+1 (x) = D −1 O W T k (x) and D −1 O W is sparse, it is easy to see that Equa- tion 3.3 can be calculated using O(K) recursive sparse-dense matrix multiplica- tion each with time complexity O(|E|). Consequently, the time complexities of both Equation 3.2 and Equation 3.3 are O(K|E|). For dense graph, we may use spectral sparsification (Cheng et al., 2015) to make it sparse. Diffusion Convolutional Layer With the convolution operation defined in Equation 3.2, we can build a diffusion convolutional layer that mapsP-dimensional features to Q-dimensional outputs. Denote the parameter tensor as Θ ∈ R Q×P×K×2 = [] q,p , where Θ q,p,:,: ∈ R K×2 parameterizes the convolutional filter for the pth input and the qth output. The diffusion convolutional layer is thus: H :,q =a P X p=1 X :,p ? G f Θq,p,:,: forq∈{1,··· ,Q} (3.4) whereX∈R N×P is the input,H∈R N×Q is the output,{f Θq,p,,: } are the filters and a is the activation function (e.g., ReLU, Sigmoid). Diffusion convolutional layer learns the representations for graph structured data and we can train it using stochastic gradient based method. Relation with Spectral Graph Convolution Diffusion convolution is defined on both directed and undirected graphs. When applied to undirected graphs, we show that many existing graph structured convolutional operations including the popular spectral graph convolution, i.e., ChebNet (Defferrard et al., 2016), 27 can be considered as a special case of diffusion convolution (up to a similarity transformation). Let D denote the degree matrix, and L =D − 1 2 (D−W )D − 1 2 be the normalized graph Laplacian, the following Proposition demonstrates the connection. Proposition 3.3. The spectral graph convolution defined as X :,p ? G f = F () | X :,p with eigenvalue decomposition L = ΦΛΦ | and F () = P K−1 0 θ k k , is equivalent to graph diffusion convolution up to a similarity transformation, when the graphG is undirected. Proof. The spectral graph convolution utilizes the concept of normalized graph LaplacianL =D − 1 2 (D−W )D − 1 2 = ΦΛΦ | . ChebNet parameterizesf θ to be aK order polynomial of , and calculates it using stable Chebyshev polynomial basis. X :,p ? G f = K−1 X k=0 θ k k ! Φ | X :,p = K−1 X k=0 θ k L k X :,p = K−1 X k=0 ˜ θ k T k ( ˜ L)X :,p (3.5) where T 0 (x) = 1,T 1 (x) = x,T k (x) = xT k−1 (x)− T k−2 (x) are the basis of the Cheyshev polynomial. Let λ max denote the largest eigenvalue of L, and ˜ L = 2 λmax L−I represents a rescaling of the graph Laplacian that maps the eigenvalues from [0,λ max ] to [−1, 1] since Chebyshev polynomial forms an orthogonal basis in [−1, 1]. Equation 3.5 can be considered as a polynomial of ˜ L and we will show that theoutputofChebNetConvolutionissimilar totheoutputofdiffusionconvolution up to constant scaling factor. Assumeλ max = 2 andD I =D O =D for undirected graph. ˜ L =D − 1 2 (D−W )D − 1 2 −I =−D − 1 2 WD − 1 2 ∼−D −1 W (3.6) 28 · ... ... Diffusion Convolutional Recurrent Layer Input Graph Signals ... ... ... ... Encoder ... ... ... ... ... ... Decoder Predictions Copy States <GO> Time Delay =1 Diffusion Convolutional Recurrent Layer Diffusion Convolutional Recurrent Layer Diffusion Convolutional Recurrent Layer Figure 3.2: System architecture for the Diffusion Convolutional Recurrent Neural Net- work designed for spatiotemporal traffic forecasting. The historical time series are fed into an encoder whose final states are used to initialize the decoder. The decoder makes predictions based on either previous ground truth or the model output. ˜ L is similar to the negative random walk transition matrix, thus the output of Equation 3.5 is also similar to the output of Equation 3.2 up to constant scaling factor. 3.2.2 Temporal Dynamics Modeling We leverage the recurrent neural networks (RNNs) to model the temporal dependency. In particular, we use Gated Recurrent Units (GRU) (Chung et al., 29 2014), which is a simple yet powerful variant of RNNs. We replace the matrix mul- tiplications in GRU with the diffusion convolution, which leads to our proposed Diffusion Convolutional Gated Recurrent Unit (DCGRU). r (t) = σ( r ? G [X (t) , H (t−1) ] +b r ) u (t) = σ( u ? G [X (t) , H (t−1) ] +b u ) C (t) = tanh( C ? G h X (t) , (r (t) H (t−1) ) i +b c ) H (t) = u (t) H (t−1) + (1−u (t) )C (t) whereX (t) ,H (t) denote the input and output of at time t,r (t) ,u (t) are reset gate and update gate at time t, respectively. ? G denotes the diffusion convolution defined in Equation 3.2 and r , u , C are parameters for the corresponding filters. Similar to GRU, DCGRU can be used to build recurrent neural network layers and be trained using backpropagation through time. In multiple step ahead forecasting, we employ the sequence-to-sequence archi- tecture (Sutskever et al., 2014). Both the encoder and the decoder are recurrent neural networks with DCGRU. During training, we feed the historical time series into the encoder and use its final states to initialize the decoder. The decoder generates predictions given previous ground truth observations. At testing time, ground truth observations are replaced by predictions generated by the model itself. The discrepancy between the input distributions of training and testing can cause degraded performance. To mitigate this issue, we integrate scheduled sam- pling (Bengio et al., 2015) into the model, where we feed the model with either the ground truth observation with probability i or the prediction by the model with probability 1− i at the ith iteration. During the training process, i gradually decreases to 0 to allow the model to learn the testing distribution. 30 With both spatial and temporal modeling, we build a Diffusion Convolutional Recurrent Neural Network (DCRNN). The model architecture of DCRNN is shown in Figure 3.2. The entire network is trained by maximizing the likelihood of gener- ating the target future time series using backpropagation through time. DCRNN is able to capture spatiotemporal dependencies among time series and can be applied to various spatiotemporal forecasting problems. 3.3 Experiments 3.3.1 Dataset We conduct experiments on two real-world large-scale datasets: • METR-LA This traffic dataset contains traffic information collected from loop detectors in the highway of Los Angeles County (Jagadish et al., 2014). We select 207 sensors and collect 4 months of data ranging from Mar 1st 2012 to Jun 30th 2012 for the experiment. The total number of observed traffic data points is 6,519,002. • PEMS-BAY This traffic dataset is collected by California Transportation Agencies (CalTrans) Performance Measurement System (PeMS). We select 325 sensors in the Bay Area and collect 6 months of data ranging from Jan 1st 2017 to May 31th 2017 for the experiment. The total number of observed traffic data points is 16,937,179. The sensor distributions of both datasets are visualized in Figure 3.3. In both of those datasets, we aggregate traffic speed readings into 5 minutes windows, and apply Z-Score normalization. 70% of data is used for training, 20% are used for testing while the remaining 10% for validation. To construct the sensor graph, 31 we compute the pairwise road network distances between sensors and build the adjacency matrix using thresholded Gaussian kernel (Shuman et al., 2013). W ij = exp − dist(v i ,v j ) 2 σ 2 ! if dist(v i ,v j )≤κ, otherwise 0 where W ij represents the edge weight between sensor v i and sensor v j , dist(v i ,v j ) denotes the road network distance from sensor v i to sensor v j . σ is the standard deviation of distances and κ is the threshold. Figure 3.3: Sensor distribution of the METR-LA and PEMS-BAY dataset. 3.3.2 Experimental Settings We compare DCRNN with widely used time series regression models, including • HA: Historical Average, which models the traffic flow as a seasonal process, and uses weighted average of previous seasons as the prediction. The period used is 1 week, and the prediction is based on aggregated data from previous weeks. For example, the prediction for this Wednesday is the averaged traffic speeds from last four Wednesdays. As the historical average method does 32 not depend on short-term data, its performance is invariant to the small increases in the forecasting horizon. • ARIMA kal : Auto-Regressive Integrated Moving Average model with Kalman filter. The orders are (3, 0, 1), and the model is implemented using the statsmodel python package. • VAR: Vector Auto-regressive model (Hamilton, 1994). The number of lags is set to 3, and the model is implemented using the statsmodel python package. • SVR: Linear Support Vector Regression, the penalty term C = 0.1, the number of historical observation is 5. The following deep neural network based approaches are also included: • FNN: Feed forward neural network with two hidden layers, each layer con- tains 256 units. The initial learning rate is 1e −3 , and reduces to 1 10 every 20 epochs starting at the 50th epochs. In addition, for all hidden layers, dropout with ratio 0.5 and L2 weight decay 1e −2 is used. The model is trained with batch size 64 and MAE as the loss function. Early stop is performed by monitoring the validation error. • FC-LSTM: The Encoder-decoder framework using LSTM with peep- hole (Sutskever et al., 2014). Both the encoder and the decoder contain two recurrent layers. In each recurrent layer, there are 256 LSTM units, L1 weight decay is 2e −5 , L2 weight decay 5e −4 . The model is trained with batch size 64 and loss function MAE. The initial learning rate is 1e-4 and reduces to 1 10 every 10 epochs starting from the 20th epochs. Early stop is performed by monitoring the validation error. 33 DCRNN : Diffusion Convolutional Recurrent Neural Network. Both encoder and decoder contain two recurrent layers. In each recurrent layer, there are 64 units, the initial learning rate is 1e −2 , and reduces to 1 10 every 10 epochs starting at the 20th epoch and early stopping on the validation dataset is used. Besides, the maximum steps of random walks, i.e., K, is set to 3. For scheduled sampling, the thresholded inverse sigmoid function is used as the probability decay: i = τ τ + exp (i/τ) where i is the number of iterations while τ are parameters to control the speed of convergence. τ is set to 3,000 in the experiments. The implementation is available in https://github.com/liyaguang/DCRNN. All neural network based approaches are implemented using Tensorflow (Abadi et al., 2016), and trained using the Adam optimizer with learning rate annealing. The best hyperparameters are chosen using the Tree-structured Parzen Estimator (TPE) (Bergstra et al., 2011) on the validation dataset. Metrics We evaluate approaches based on the three popular metrics, i.e., Root Mean Square Error (RMSE), Mean Absolute Percentage Error (MAPE) and Mean Absolute Error (MAE) defined as follows: Supposex =x 1 ,··· ,x n represents the ground truth, ˆ x = ˆ x 1 ,··· , ˆ x n represents the predicted values, and Ω denotes the indices of observed samples, the metrics are defined as follows. Root Mean Square Error (RMSE) RMSE(x, ˆ x) = v u u t 1 | | X i∈ (x i − ˆ x i ) 2 34 Table 3.2: Performance comparison of different approaches for traffic speed forecasting. DCRNN achieves the best performance with all three metrics for all forecasting horizons, and the advantage becomes more evident with the increase of the forecasting horizon. T Metric HA ARIMA Kal VAR SVR FNN FC-LSTM DCRNN METR-LA 15 min MAE 4.16 3.99 4.42 3.99 3.99 3.44 2.77 RMSE 7.80 8.21 7.89 8.45 7.94 6.30 5.38 MAPE 13.0% 9.6% 10.2% 9.3% 9.9% 9.6% 7.3% 30 min MAE 4.16 5.15 5.41 5.05 4.23 3.77 3.15 RMSE 7.80 10.45 9.13 10.87 8.17 7.23 6.45 MAPE 13.0% 12.7% 12.7% 12.1% 12.9% 10.9% 8.8% 1 hour MAE 4.16 6.90 6.52 6.72 4.49 4.37 3.60 RMSE 7.80 13.23 10.11 13.76 8.69 8.69 7.59 MAPE 13.0% 17.4% 15.8% 16.7% 14.0% 13.2% 10.5% PEMS-BAY 15 min MAE 2.88 1.62 1.74 1.85 2.20 2.05 1.38 RMSE 5.59 3.30 3.16 3.59 4.42 4.19 2.95 MAPE 6.8% 3.5% 3.6% 3.8% 5.19% 4.8% 2.9% 30 min MAE 2.88 2.33 2.32 2.48 2.30 2.20 1.74 RMSE 5.59 4.76 4.25 5.18 4.63 4.55 3.97 MAPE 6.8% 5.4% 5.0% 5.5% 5.43% 5.2% 3.9% 1 hour MAE 2.88 3.38 2.93 3.28 2.46 2.37 2.07 RMSE 5.59 6.50 5.44 7.08 4.98 4.96 4.74 MAPE 6.8% 8.3% 6.5% 8.0% 5.89% 5.7% 4.9% Mean Absolute Percentage Error (MAPE) MAPE(x, ˆ x) = 1 | | X i∈ x i − ˆ x i x i Mean Absolute Error (MAE) MAE(x, ˆ x) = 1 | | X i∈ |x i − ˆ x i | 3.3.3 Traffic Forecasting Performance Comparison Table 3.2 shows the comparison of different approaches for 15 minutes, 30 minutes and 1 hour ahead forecasting on both datasets. These methods are 35 Table 3.3: Performance comparison for DCRNN and GCRNN on the METR-LA dataset. 15 min 30 min 1 hour MAE RMSE MAPE MAE RMSE MAPE MAE RMSE MAPE DCRNN 2.77 5.38 7.3% 3.15 6.45 8.8% 3.60 7.60 10.5% GCRNN 2.80 5.51 7.5% 3.24 6.74 9.0% 3.81 8.16 10.9% evaluated based on three commonly used metrics in traffic forecasting, including (1) Mean Absolute Error (MAE), (2) Mean Absolute Percentage Error (MAPE), and (3) Root Mean Squared Error (RMSE). Missing values are excluded in cal- culating these metrics. Detailed formulations of these metrics are provided in Appendix 3.3.2. We observe the following phenomenon in both of these datasets. (1) RNN-based methods, including FC-LSTM and DCRNN, generally outperform other baselines which emphasizes the importance of modeling the temporal depen- dency. (2) DCRNN achieves the best performance regarding all the metrics for all forecastinghorizons,whichsuggeststheeffectivenessofspatiotemporaldependency modeling. (3) Deep neural network based methods including FNN, FC-LSTM and DCRNN, tend to have better performance than linear baselines for long-term fore- casting, e.g., 1 hour ahead. This is because the temporal dependency becomes increasingly non-linear with the growth of the horizon. Besides, as the historical average method does not depend on short-term data, its performance is invariant to the small increases in the forecasting horizon. Note that, traffic forecasting on the METR-LA (Los Angeles, which is known for its complicated traffic conditions) dataset is more challenging than that in the PEMS-BAY (Bay Area) dataset. Thus we use METR-LA as the default dataset for following experiments. 36 0 10000 20000 30000 40000 50000 # Iteration 2.8 3.0 3.2 3.4 3.6 3.8 4.0 4.2 Validation MAE DCRNN-NoConv DCRNN-UniConv DCRNN Figure 3.4: Learning curve for DCRNN and DCRNN without diffusion convolution. Removing diffusion convolution results in much higher validation error. Moreover, DCRNN with bi-directional random walk achieves the lowest validation error. 3.3.4 Effect of Spatial Dependency Modeling To further investigate the effect of spatial dependency modeling, we compare DCRNN with the following variants: (1) DCRNN-NoConv, which ignores spatial dependencybyreplacingthetransitionmatricesinthediffusionconvolution(Equa- tion 3.2) with identity matrices. This essentially means the forecasting of a sensor can be only be inferred from its own historical readings; (2) DCRNN-UniConv, which only uses the forward random walk transition matrix for diffusion convo- lution; Figure 3.4 shows the learning curves of these three models with roughly the same number of parameters. Without diffusion convolution, DCRNN-NoConv has much higher validation error. Moreover, DCRNN achieves the lowest valida- tion error which shows the effectiveness of using bidirectional random walk. The intuition is that the bidirectional random walk gives the model the ability and flexibility to capture the influence from both the upstream and the downstream traffic. 37 1 2 3 4 5 K 2.6 2.8 3.0 3.2 3.4 Validation MAE 16 32 64 128 # Units 2.6 2.8 3.0 3.2 3.4 Validation MAE Figure 3.5: Effects of K and the number of units in each layer of DCRNN. K corresponds to the reception field width of the filter, and the number of units corresponds to the number of filters. To investigate the effect of graph construction, we construct a undirected graph by setting c W ij = c W ji = max(W ij ,W ji ), where d W is the new symmetric weight matrix. Then we develop a variant of DCRNN denotes GCRNN, which uses the sequence-to-sequencelearningwithChebNet graph convolution (Equation3.5)with roughly the same amount of parameters. Table 3.3 shows the comparison between DCRNN and GCRNN in the METR-LA dataset. DCRNN consistently outper- forms GCRNN. The intuition is that directed graph better captures the asymmet- ric correlation between traffic sensors. Figure 3.5 shows the effects of different parameters. K roughly corresponds to the size of filters’ reception fields while the number of units corresponds to the number of filters. Larger K enables the model to capture broader spatial dependency at the cost of increasing learning complexity. We observe that with the increase of K, the error on the validation dataset first quickly decrease, and thenslightlyincrease. Similarbehaviorisobservedforvaryingthenumberofunits. 38 15 Min 30 Min 1 Hour Horizon 2.0 2.5 3.0 3.5 4.0 4.5 MAE DCNN DCRNN-SEQ DCRNN Figure 3.6: Performance comparison for different DCRNN variants. DCRNN, with the sequence-to-sequence framework and scheduled sampling, achieves the lowest MAE on the validation dataset. The advantage becomes more clear with the increase of the forecasting horizon. 3.3.5 Effect of Temporal Dependency Modeling To evaluate the effect of temporal modeling including the sequence-to-sequence framework as well as the scheduled sampling mechanism, we further design three variants of DCRNN: (1) DCNN: in which we concatenate the historical observa- tions as a fixed length vector and feed it into stacked diffusion convolutional layers to predict the future time series. We train a single model for one step ahead prediction, and feed the previous prediction into the model as input to perform multiplestepsaheadprediction. (2)DCRNN-SEQ:whichusestheencoder-decoder sequence-to-sequence learning framework to perform multiple steps ahead forecast- ing. (3) DCRNN: similar to DCRNN-SEQ except for adding scheduled sampling. Figure 3.6 shows the comparison of those four methods with regards to MAE for different forecasting horizons. We observe that: (1) DCRNN-SEQ outperforms DCNN by a large margin which conforms the importance of modeling temporal dependency. (2) DCRNN achieves the best result, and its superiority becomes 39 Figure 3.7: Traffic time series forecasting visualization. DCRNN generates smooth pre- diction and is usually better at predict the start and end of peak hours. more evident with the increase of the forecasting horizon. This is mainly because the model is trained to deal with its mistakes during multiple steps ahead predic- tion and thus suffers less from the problem of error propagation. We also train a model that always been fed its output as input for multiple steps ahead predic- tion. However, its performance is much worse than all the three variants which emphasizes the importance of scheduled sampling. 40 center Max Min 0 Figure 3.8: Visualization of learned localized filters centered at different nodes with K = 3 on the METR-LA dataset. The star denotes the center, and the colors represent the weights. We observe that weights are localized around the center, and diffuse alongside the road network. 3.3.6 Model Interpretation To better understand the model, we visualize forecasting results as well as learned filters. Figure 3.7 shows the visualization of 1 hour ahead forecasting. We have the following observations: • DCRNN generates smooth prediction of the mean when small oscillation exists in the traffic speeds (Figure 3.7(a)). This reflects the robustness of the model. • DCRNN is more likely to accurately predict abrupt changes in the traffic speed than baseline methods (e.g., FC-LSTM). As shown in Figure 3.7(b), DCRNN predicts the start and the end of the peak hours. This is because 41 DCRNN captures the spatial dependency, and is able to utilize the speed changes in neighborhood sensors for more accurate forecasting. Figure 3.8 visualizes examples of learned filters centered at different nodes. The star denotes the center, and colors denote the weights. We can observe that (1) weights are well localized around the center, and (2) the weights diffuse based on road network distance. 42 Chapter 4 Modeling Multi-modal Spatial Correlation: Spatiotemporal Multi-Graph Convolution Network for Ride-hailing Demand Forecasting 43 Region-level demand forecasting is an essential task in ride-hailing services. Accurate ride-hailing demand forecasting can guide vehicle dispatching, improve vehicle utilization, reduce the wait-time, and mitigate traffic congestion. This task is challenging due to the complicated spatiotemporal dependencies among regions. Existing approaches mainly focus on modeling the Euclidean correlations among spatially adjacent regions while we observe that non-Euclidean pair-wise correla- tions among possibly distant regions are also critical for accurate forecasting. In this chapter, we propose the spatiotemporal multi-graph convolution network (ST- MGCN), a novel deep learning model for ride-hailing demand forecasting. We first encode the non-Euclidean pair-wise correlations among regions into multi- ple graphs and then explicitly model these correlations using multi-graph con- volution. To utilize the global contextual information in modeling the temporal correlation, we further propose contextual gated recurrent neural network which augments recurrent neural network with a contextual-aware gating mechanism to re-weights different historical observations. We evaluate the proposed model on two real-world large scale ride-hailing demand datasets and observe consistent improvement of more than 10% over state-of-the-art baselines. Spatiotemporal forecasting is a crucial task in urban computing. It has a wide range of applications from autonomous vehicles operations, to energy and smart grid optimization, to logistics and supply chain management. In this chapter, we study one important task: region-level ride-hailing demand forecasting, which is one of the essential components of the intelligent transportation systems. The goal of region-level ride-hailing demand forecasting is to predict the future demand of regions in a city given historical observations. Accurate ride-hailing demand fore- casting can help organize vehicle fleet, improve vehicle utilization, reduce the wait- time, and mitigate traffic congestion (Yao et al., 2018b). This task is challenging 44 mainly due to the complex spatial and temporal correlations. On the one hand, complicated dependencies are observed among different regions. For example, the demand of a region is usually affected by its spatially adjacent neighbors and at the same time correlated with distant regions with the similar contextual environment. On the other hand, non-linear dependencies also exist among different temporal observations. The prediction of a certain time is usually correlated with various historical observations, e.g., an hour ago, a day ago or even a week ago. Recent advances in deep learning enable promising results in modeling the com- plex spatiotemporal relationship in region-based spatiotemporal forecasting. With convolutional neural network and recurrent neural network, state-of-the-art results are achieved in Xingjian et al. (2015); Li et al. (2017); Shi et al. (2017); Zhang et al. (2017, 2018b); Ma et al. (2017); Yao et al. (2018b,a). Despite promising results, we argue that two important aspects are largely overlooked in modeling the spatiotemporal correlations. First, these methods mainly focus on model- ing the Euclidean correlations among different regions, however, we observe that non-Euclidean pair-wise correlations are also critical for accurate forecasting. Fig- ure 1.3 shows an example. For region 1, in addition to neighborhood region 2, it may also correlate to a distant region 3 that shares similar functionality, i.e., they are both near schools and hospitals. Besides, region 1 may also be affected by region 4, which is directly connected to region 1 via a highway. Second, in these methods, when modeling temporal correlation with RNN, each region is processed independently or only based on local information. However, we argue that global and contextual information are also important. For example, a global increase/decrease in ride-hailing demand usually indicates the occurrence of some events that will affect future demand. 45 To address these challenges, we propose a novel deep learning model called spa- tiotemporal multi-graph convolution network (ST-MGCN). In ST-MGCN, we pro- pose to encode the non-Euclidean correlations among regions into multiple graphs. Different from Yao et al. (2018b), which uses the graph embedding as extra con- stantfeaturesforeachregion, weleveragethegraphconvolutiontoexplicitlymodel the pair-wise relationship among regions. Graph convolution is able to aggre- gate neighborhood information when performing the prediction which is hard to achieve through traditional graph embedding. Furthermore, to incorporate global contextual information when modeling the temporal correlation, we propose con- textual gated recurrent neural network (CGRNN). It augments RNN by learning a gating mechanism, which is calculated based on the summarized global informa- tion, to re-weight observations in different timestamps. When evaluated on two real-world large scale ride-hailing demand datasets, ST-MGCN consistently out- performs state-of-the-art baselines by a large margin. In summary, we make the following contributions: • We identify non-Euclidean correlations among regions in ride-hailing demand forecasting and propose to encode them using multiple graphs. Then we further leverage the proposed multi-graph convolution to explicitly model these correlations. • We propose the Contextual Gated RNN (CGRNN) to incorporate the global contextual information when modeling the temporal dependencies. • We conduct extensive experiments on two large-scale real-world datasets, and the proposed approach achieves more than 10% relative error reduction over state-of-the-art baseline methods for ride-hailing demand forecasting. 46 4.1 Preliminary AttentionMechanism Attentionhasbeenapopularconceptinthedeepneural networks. As motivated by how human pay visually attention to an image or text, the attention mechanism is first introduced in (Bahdanau et al., 2015) to improve machine translation tasks, especially for long sentences. The similar idea is applied on the visual domain (Xu et al., 2015b) and more improved mechanisms for translation have been proposed (Luong et al., 2015; Vaswani et al., 2017) and used for video question answering and activity recognition (Zhao et al., 2017; Pei et al., 2017; Du et al., 2018). Recently, the attention idea has been adopted to graph domain. Veličković et al. (2018) (GAT) use attention mechanism to model the spatial/topological dependencies. Zhang et al. (2018a) (GaAN) combine graph attention with recurrent neural network for spatial temporal forecasting. Channel-wise attention (Hu et al., 2018; Chen et al., 2017) is proposed in the computer vision literature. The intuition behind channel-wise attention is to learn aweightforeachchannel,inordertofindthemostimportantframesandemphasize them by giving higher weights. LetX∈R W×H×C denotes the input, whereW and H are the dimensions of the input image, and C denotes the number of channels, then the pipeline of channel-wise attention is defined as follows: z c =F pool (X :,:,c ) = 1 WH W X i=0 H X j=0 X i,j,c forc = 1, 2,···C s =σ(W 2 δ(W 1 z)) (4.1) ˜ X :,:,c =X :,:,c ◦s c forc = 1, 2,···C F pool is a global average pooling operation, which summarizes each channel into a scalarz c where c is the channel index. Then an attention operation is applied 47 to generate adaptive channel weightss by applying non-linear transformations on the summarized vectorz, whereW 1 andW 2 is the corresponding weights, δ and σ is the ReLU and sigmoid function respectively. After that, s is applied to the input via channel-wise dot product. Finally, the input channels are scaled based learned weights. In this work, we adopt the idea of channel-wise attention, and generalize it for temporal dependency modeling among a sequence of graphs. 4.2 Methodology We formalize the learning problem of spatiotemporal ride-hailing demand fore- casting and describe how to model the spatial and temporal dependencies using the proposed spatiotemporal multi-graph convolution network (ST-MGCN). 4.2.1 Region-level Ride-hailing Demand Forecasting We divide a city into equal-size grids, and each grid is defined as a regionv∈V, where V denotes the set of all disjoint regions in the city. LetX (t) represent the number of orders in all regions at the t-th interval. Then the region-level ride- hailing demand forecasting problem is formulated as a single step spatiotemporal prediction given input with a fixed temporal length, i.e., learning a function f : R |V|×T →R |V| that maps historical demands of all regions to the demand in the next timestep. [X (t−T+1) ,··· ,X (t) ] f(·) −−→X (t+1) Framework Overview The system architecture of the proposed model ST- MGCN is shown in Figure 4.1. We represent different aspects of correlations 48 GCN GCN GCN … … … Contextual Gating GCN GCN GCN Encode pair-wise correlations between regions using multiple graphs Aggregate different observations with Contextual Gated RNN Capture spatial dependency with graph convolution on multiple graphs Generate prediction Neighborhood Func. similarity Connectivity RNN RNN RNN Contextual Gated RNN Contextual Gating Contextual Gating Contextual Gated RNN Contextual Gated RNN Figure 4.1: System architecture of the proposed spatiotemporal multi-graph convolution network (ST-MGCN).Weencodedifferentaspectsofrelationshipsamongregions,includ- ing neighborhood, functional similarity and transportation connectivity, using multiple graphs. First, the proposed contextual gated recurrent neural network (CGRNN) is used to aggregate observations in different times considering the global contextual informa- tion. Afterthat, multi-graphconvolutionisusedtomodelthenon-Euclideancorrelations among regions. between regions as multiple graphs, whose vertices represent regions and edges encode the pair-wise relationship among regions. First, we use the proposed Con- textual Gated Recurrent Neural Network (CGRNN) to aggregate observations in different times considering the global contextual information. After that, multi- graph convolution is applied to capture different types of correlations between regions. Finally, a fully connected neural network is used to transform features into the prediction. 49 4.2.2 Spatial Dependency Modeling In this section, we show how to encode different types of correlations among regions using multiple graphs and how to model these relationships using the pro- posed multi-graph convolution. We model three types of correlations among regions with graphs, including (1) the neighborhood graphG N = (V,A N ), which encode the spatial proximity, (2) functionalsimilaritygraphG F = (V,A F ),whichencodesthesimilarityofsurround- ing Point of Interests (POIs) of regions, and (3) the transportation connectivity graphG T = (V,A T ), which encodes the connectivity between distant regions. Note that, our approach can be easily extended to model new types of correlations by constructing related graphs. Neighborhood Neighborhood of a region is defined based on the spatial prox- imity. We construct the graph by connecting a region to its 8 adjacent regions in a 3× 3 grid. A N,ij = 1, v i and v j are adjacent 0, otherwise (4.2) Functional similarity When making prediction for a region, it is intuitive to refer to other regions that are similar to this one in terms of functionality. Region functionality could be characterized using its surrounding POIs for each category, and the edge between two vertices (regions) is defined as the POI similarity: A S,i,j = sim(P v i ,P v j )∈ [0, 1] (4.3) 50 whereP v i ,P v j arethePOIvectorsofregionsv i andv j respectively,whosedimension equals to the number of POI categories and each entry represents the number of a specific POI category in the region. Transportation connectivity The transportation system is also an important factor when performing spatiotemporal predictions. Intuitively, those geographi- cally distant but conveniently reachable regions can be correlated. These kinds of connectivity are induced by roads like motorway, highway or public transportation like subway. Here, we define regions that are directly connected by these roads as “connected” and the corresponding edge is defined as: A C,i,j =max(0,conn(v i ,v j )−A N,i,j )∈{0, 1} (4.4) where conn(u,v) is the indicator function of the connectivity between v i and v j . Note that, the neighborhood edges are removed from connectivity graph to avoid redundant correlations and also results in a sparser graph. Multi-graph convolution for spatial dependency modeling With these graphs constructed, we propose the multi-graph convolution to model the spatial dependency as defined in Equation 4.5. X l+1 =σ Agg A i ∈A f(A i ;θ i )X l W l ! (4.5) whereX l ∈R |V|×P l ,X l+1 ∈R |V|×P l+1 are the feature vectors of|V| regions in layer l and l + 1 respectively. σ denotes the activation function, and Agg denotes the aggregation function, e.g., max, average, attention-based method etc. A denotes 51 " " % % & & % ' ' % ( %*" Figure4.2: AnexampleoftheChebNetgraphconvolutioncentralizedattheblackvertex. Left: The centralized region is marked black. The one-hop neighbors are marked yellow, while the two-hop neighbors are marked red. Middle: with the increase of degree of the graph Laplacian, the reception field grows (marks green). Right: The output of this layer is a sum among graph transformations with degree value from 1 to K. the set of graphs, andf(A i ;θ i )∈R |V|×|V| represents the aggregation matrix of dif- ferent samples based on graph A i ∈A parameterized by θ i , while W l ∈R P l ×P l+1 denotes the feature transformation matrix, For example, iff(A i ;θ i ) is the polyno- mialfunctionoftheLaplacianmatrix L, thenthiswillbecomeChebNet(Defferrard et al., 2016) on multiple graphs. If f(A i ;θ i ) = I, i.e., the identity matrix, then this will fall back to the fully connected network. In the implementation, f(A i ;θ i ) is chosen to be the K order polynomial func- tion of the graph Laplacian L, and Figure 4.2 shows an example of the value transformation for a centralized region through the graph convolution layer. Sup- pose all the entries in the adjacency matrix are 0 or 1, entry L k ij 6= 0 means v i is able to reach v j in k-hop. In terms of convolution operation, k defines the size 52 of reception field during spatial feature extraction. Using road connectivity graph G C = (V,A C ) in Figure 1.3 to illustrate. In the adjacency matrixA C , we have: A C,1,4 = 1;A C,1,6 = 0;A C,4,6 = 1, and the corresponding entries of the 1-degree graph Laplacian are: L 1 C,1,4 6= 0;L 1 C,1,6 = 0;L 1 C,4,6 6= 0 If the maximum degree of graph Laplacian K is set to 1, the transformed feature vector of region 1, i.e.,X l+1,1,: will not contain the feature vector of region 6: X l,6,: , since L 1 C,1,6 = 0. When increasing K to 2, the corresponding entry L 2 C,1,6 becomes non-zero, and consequentlyX l+1,1,: can utilize information fromX l,6,: . The multi-graph convolution based spatial dependency modeling is not restricted to these three types of region-wise relationships mentioned above, and it can be easily extended to model other region-wise relationships as well as other spatiotemporal forecasting problems. It models spatial dependencies by feature extraction through region-wise relationship. With small reception field, the fea- ture extraction will focus on close regions, i.e., neighbors that can be reaches with small number of hops. Increasing the max degree of graph Laplacian or stack- ing multiple convolution layers will increase the reception field and consequently encourage the model to capture more global dependencies. Graph embedding is an alternative technique for modeling the region-wise cor- relation. In DMVST-Net (Yao et al., 2018b), the authors use graph embedding 1 to representregion-wiserelationship, andthenaddtheseembeddingsasextrafeatures 1 The graph embedding is pre-computed. It produces a temporal-invariant feature vector for each region. 53 [ , ( + ) , … ] ∈ ℝ × × Contextual Gating T RNN RNN RNN . . . T ∈ ℝ ෩ , ෩ + , … ∈ ℝ × × ∈ ℝ × 1 × Share-weight RNN ( ′ ) reweighted observations are aggregated with RNN observations reweighted observations Figure 4.3: Temporal correlation modeling with contextual gated recurrent neural net- work (CGRNN). It first produces region descriptions using the global average pooling over the input and its graph convolution output for each observation. Then it transfer the summarized vectorz into weights which are used to scale each observation. Finally, a shared RNN layer across all regions is applied to aggregate the gated input sequence of each region into a single vector. toeachregion. WearguethatspatialdependencymodelingapproachinST-MGCN is preferred for the following reasons: ST-MGCN encodes region-wise relationships into graphs and aggregate demand values from related regions by graph convolu- tion. While in DMVST-Net, the region-wise relationship was embedded to a tem- poral invariant region-based feature as input to the model. Though DMVST-Net also captures the topological information, it is hard to aggregate demand values from related regions through the region-wise relationship. Also, invariant features have limited contribution to the model training. 54 4.2.3 Temporal Correlation Modeling We propose the Contextual Gated Recurrent Neural Network (CGRNN) to model the correlations between observations in different timestamps. CGRNN incorporates contextual information into the temporal modeling by augmenting RNN with a context aware gating mechanism whose architecture is shown in Fig- ure 4.3. Suppose, we haveT temporal observations andX (t) ∈R |V|×P denotes the t-th observation, whereP is the feature dimensions,P will be 1 if the feature only contains the number of orders. Then the workflow of contextual gating mechanism is as follows. ˆ X (t) = [X (t) ,F K 0 G (X (t) )] fort = 1, 2,···T (4.6) First, the contextual gating mechanism produces region descriptions by concate- nating the historical data of a certain region with information from related regions. The information from related regions is regarded as contextual information, and is extracted by a graph convolution operation F K 0 G with max degree K 0 (Equa- tion 4.6) using the corresponding graph Laplacian matrix. The contextual gating mechanism is designed to involve information from related regions by performing graph convolution operation before the pooling step. z (t) =F pool ( ˆ X (t) ) = 1 |V| |V| X i=1 ˆ X (t) i,: fort = 1, 2,···T (4.7) Secondly, we use the global average pooling F pool over all regions to produce the summary of each temporal observation (Equation 4.7). s =σ(W 2 δ(W 1 z)) (4.8) 55 Then an attention operation (Equation 4.8) is applied to the summarized vectorz, whereW 1 andW 2 is the corresponding weights,δ andσ is the ReLU and sigmoid function respectively. ˜ X (t) =X (t) ◦s (t) fort = 1, 2,···T (4.9) Finally,s is applied to the scale each temporal observation (Equation 4.9). H i,: = RNN( ˜ X (1) i,: ,··· , ˜ X (T) i,: ;W 3 ) fori = 1,··· ,|V| (4.10) After the contextual gating, a shared RNN layer with weightW 3 across all regions is applied to aggregate the gated input sequence of a region into a single vector H i,: (Equation 4.10). The intuition of sharing RNN among regions is to find a universal aggregation rule for all regions, which encourages model generalization and reduces model complexity. 4.3 Experiments In this section, we compare the proposed model ST-MGCN with other state- of-the-art baselines for region-level ride-hailing demand forecasting. Dataset We conduct experiments on two real-world large scale ride-hailing datasets collected in cities: Beijing and Shanghai. Both of these datasets are collected in the main city zone of ride-hailing orders within the time period from Mar 1st, 2017 to Dec 31st, 2017. For data split, we use the data from Mar 1st 2017 to Jul 31st 2017 for training, data from Aug 1st 2017 to Sep 30th 2017 as validation, and the data from Oct 1st 2017 to Dec 31st 2017 is used for testing. 56 Table 4.1: Dataset split. # Intervals Start End Training 7344 2017/3/1 2017/7/31 Validation 2928 2017/8/1 2017/9/30 Test 4416 2017/10/1 2017/12/31 (a) Beijing (b) Shanghai Figure 4.4: Demand heatmap of the tested two cities. Figure 4.4 and Figure 4.5 shows the demand heatmaps and demand histograms of the two tested cities respectively. The POI data is collected in 2017, and contains 13 primary POI categories. Each region is associated with a POI vector, whose entry is the number of instances of a certain POI category. The road network data used for transportation connec- tivity evaluation is provided by OpenStreetMap (Haklay & Weber, 2008). 4.3.1 Experimental Settings Recall that the learning task is formulated as learning a function f :R |V|×T → R |V| . In the experiment, we generate the region setV by partitioning city map into gridswithsizeequalsto 1km×1km 2 . Therearetotally1296regionsinBeijing, and 2 Referred to industrial practice. 57 0 100 200 300 400 500 # Demand 10 6 10 5 10 4 10 3 10 2 10 1 PDF 0.0 0.2 0.4 0.6 0.8 1.0 CDF (a) Beijing 0 100 200 300 400 500 # Demand 10 6 10 5 10 4 10 3 10 2 10 1 PDF 0.0 0.2 0.4 0.6 0.8 1.0 CDF (b) Shanghai Figure 4.5: Data distribution. These two cities have similar demand patterns, i.e., highly skewed with values concentrating on small values. 896 regions in Shanghai. Following the practice in Zhang et al. (2017), the input of the network consists of 5 historical observations, including 3 latest closeness components, 1 period component and 1 latest trend component. In building the transportation connectivity graph, we consider the following high-speed roads, includingmotorway,highwayandsubway. Tworegionsareregardedas“connected” as long as there is a high-speed road directly connecting them. In the experiment, f(A;θ i ) in Equation 4.5 is chosen to be the Chebyshev polynomialfunction(Defferrardetal.,2016)ofthegraphLaplacianwiththedegree K equals to 2, and F is chosen to be the sum aggregation function. The number of hidden layers is 3, with 64 hidden units each and an L2 regularization with a weightdecayequalto1e-4isappliedtoeachlayer. Specially, thegraphconvolution degree K 0 in CGRNN equals to 1. We use ReLU as the activation in the graph convolution network. The learning rate of ST-MGCN is set to 2e-3, and early stopping on the validation dataset is used. All neural network based approaches are implemented using Tensor- flow (Abadi et al., 2016), and trained using the Adam optimizer (Kingma & Ba, 58 Table 4.2: Performance comparison of different approaches for ride-hailing demand fore- casting. ST-MGCN achieves the best performance with all metrics on both datasets. Method Beijing Shanghai RMSE MAPE(%) RMSE MAPE(%) HA 16.14 23.9 17.15 34.8 LASSO 14.24±0.14 23.8±0.8 10.62±0.06 22.9±0.8 Ridge 14.24±0.11 23.8±0.9 10.61±0.04 23.1±0.8 VAR 13.32±0.17 22.4±1.6 10.54±0.18 23.7±1.4 STAR 13.16±0.22 22.2±1.9 10.52±0.21 23.2±1.4 GBM 13.66±0.16 23.1±1.5 10.25±0.11 23.4±1.2 STResNet 11.77±0.95 14.8±6.0 9.87±0.94 14.9±6.0 DMVST-Net 11.62±0.48 12.3±5.5 9.61±0.44 13.8±1.2 ST-GCN 11.62±0.36 10.1±5.1 9.29±0.31 11.2±1.3 ST-MGCN 10.78±0.25 8.8±3.5 8.30±0.16 9.3±0.9 2015) for minimizing RMSE. The training of ST-MGCN takes 10GB RAM and 9GB GPU memory. The training process takes about 1.5 hour on a single Tesla P40. Methods for evaluation We compare the proposed model (ST-MGCN) with the following methods for ride-hailing demand forecasting: • Historical Average (HA): which models the ride-hailing demand as a sea- sonal process, and uses the average of previous seasons as the prediction. The period used is 1 week, and the prediction is based on aggregated data from the same time in previous weeks. • LASSO, Ridge: which takes historical data from different timestamps as input for linear regression with L1 and L2 regularization respectively. • Auto-regressivemodel(VAR,STAR):VARisthemultivariateextensionof auto-regressive model which is able to model the correlation between regions. 59 STAR (Pace et al., 1998) is a an AR extension specifically for spatiotemporal modeling problems. In the experiment, the number of lags used is 5. • Gradient boosted machine (GBM): gradient boosting decision tree based regression implemented using LightGBM (Ke et al., 2017a). The following setting is used in the experiment: the number of trees is 50, the maximum depth is 4 and the learning rate is 2e-3. • ST-ResNet (Zhang et al., 2017): ST-ResNet is a CNN-based framework for traffic flow prediction. The model uses CNN with residual connections to capture the trend, the periodicity, and the closeness information. • DMVST-Net (Yao et al., 2018b): DMVST-Net is a multi-view based deep learning approach for taxi demand prediction. It consists of three different views: the temporal view, the spatial view, and the semantic view modeled with LSTM, CNN and graph embedding respectively. • DCRNN, ST-GCN: Both DCRNN (Li et al., 2018d) and ST-GCN (Yu et al., 2018) are graph convolution based models for traffic forecasting. Both models use road network for building non-euclidean region-wise relationship. DCRNN models the spatiotemporal dependency by integrating graph con- volution into the gated recurrent unit, while ST-GCN models the both the spatial and temporal dependencies with convolution structures and achieves better efficiency. 4.3.2 Performance Comparison For all approaches, we tune the model parameters using grid search based on the performance on the validation dataset, and report the performance on the 60 Table 4.3: Effect of spatial correlation modeling on the Beijing dataset. Removing any component will result in a statistically significant error increase. Removed component RMSE Neighborhood 11.47 Functional 11.42 Transportation 11.69 ST-MGCN 10.78 testing dataset over multiple runs. We evaluate the performance of based on two popular metrics, i.e., Root Mean Square Error (RMSE) and Mean Absolute Percentage Error (MAPE) 3 . Table 4.2 shows the test error comparison of different approaches for ride-hailing demand forecasting over of ten runs. We observe the following phenomena in both datasets: • deep learning based methods, including ST-ResNet, DMVST-Net, ST-GCN and the proposed ST-MGCN, which are able to model the non-linear spa- tiotemporal dependencies, generally outperform other baselines; • ST-MGCN achieves the best performance regarding all the metrics on both datasets, outperforming the second best baseline by at least 10% in terms of relative error reduction, which suggests the effectiveness of proposed approaches for spatiotemporal correlations modeling; • Compared with other deep learning models, ST-MGCN also shows lower variance. 3 Following the practice in Yao et al. (2018b), we filter the samples with demand values less than 10 when computing MAPE. 61 Table 4.4: Effect of adding multi-graph design to existing methodologies on the Beijing dataset. Adding extra graph to original model will result in a statistically significant error decrease. Model RMSE ST-GCN 11.62 ST-GCN+ 11.20 DCRNN 12.02 DCRNN+ 11.55 ST-MGCN 10.78 Table 4.5: Effect of temporal correlation modeling on the Beijing dataset Temporal modeling approach RMSE Average pooling 12.74 RNN 11.05 CG 11.82 GRNN 10.91 CGRNN 10.78 4.3.3 Effect of Spatial Dependency Modeling To investigate the effect of spatial and temporal dependency modeling, we eval- uate the following variants of ST-MGCN by removing different components from the model, including: (1) the neighborhood graph, (2) the functional similarity graph, (3) the transportation connectivity graph. The result is shown in Table 4.3. Removing any graph component causes a significant error increase which justifies the importance of each type of relationship. These graphs encode the important prior knowledge, i.e., region-wise correlation, which is leveraged for more accurate forecasting. To evaluate the effect of incorporating multiple region-wise relationships, we extend existing single graph-based models, including DCRNN (Li et al., 2018d) and ST-GCN (Yu et al., 2018) with the multi-graph convolution framework and 62 the resulted models are DCRNN+ and ST-GCN+. As shown in table 4.4, both DCRNN+ and ST-GCN+ achieve improved performance which shows the effec- tiveness of incorporating multiple region-wise relationships. 4.3.4 Effect of Temporal Dependency Modeling To further investigate the effect of temporal dependency modeling, we evaluate thefollowingvariantsofST-MGCNusingdifferentmethodsfortemporalmodeling, including (1) Average pooling: which aggregates different temporal observations usingtheaveragepooling, (2)RNN:whichaggregatestemporalobservationsusing the recurrent neural network (RNN) (3) CG: which uses contextual gating to re- weight different temporal observations but without RNN (4) GRNN: CGRNN without the graph convolution (Equation 4.6). The results are shown in Table 4.5. We observe the following phenomena: • Average pooling which blindly averages different observations has the worst performance, while RNN which is able to do content dependent non-linear temporal aggregation achieves clearly improved results. • CGRNN which augments RNN with contextual gating mechanism achieves further improved result than RNN. Besides, removing either the RNN (CG) or the graph convolution operation (GRNN) results in clear worse perfor- mance which justify the effectiveness of each component. 4.3.5 Effect of Model Parameters To study the effects of different hyperparameters of the proposed model, we evaluate models on the Beijing by varying two of the most important hyperparam- eters, i.e., the degreeK and number of layers in the graph convolution. Figure 4.6 63 1 2 3 4 5 Layer 10.5 11.0 11.5 12.0 12.5 RMSE K K=1 K=2 K=3 K=4 K=5 Figure 4.6: Effect of number of layers and the polynomial order K of the graph convo- lution on the Beijing dataset. shows the performance on test set. We observe that with the increase of number of layers, the error first decreases and then increases. While the error first decreases and then plateaus with the increase of K. Larger K or the number of layers will enable the model capture more global correlation at the cost of increased model complexity and more prune to overfitting. 64 Chapter 5 Inferring Spatial Correlation: Structure-informed Graph Auto-encoder for Relational Inference and Simulation 65 A variety of real-world applications require the modeling and the simulation of dynamical systems, e.g., physics, transportation, climate, and social network. This task is challenging as we can only observe the movement of objects without knowing the explicit underlying interactions. To address this challenge, we propose theStructure-informedGraph-AutoencoderforRelationalinferenceandsimulation (SUGAR). The model takes the form of a variational auto-encoder whose latent variables represent the underlying interactions among objects. The encoder infers the interactions among objects based on both observed movements and structural prior knowledge, e.g., node degree distribution, edge type distribution, and spar- sity, and then the decoder predicts the future states of objects according to the inferred interactions. Specifically, SUGAR represents structural prior knowledge as differentiable constraints on the interaction graph, and these constraints can be optimized using gradient-based methods. Experimental results on both synthetic and real-world datasets show our approach clearly outperforms other state-of-the- art methods in terms of both interaction recovery and simulation. Modeling and simulation of dynamical systems have various applications in domains including physics, transportation, climate, and social networks. These dynamical systems can be represented as groups of interacting objects. It is chal- lenging to model dynamics in these systems, as usually we only have access to the movements of individual object, rather than the underlying interactions. Recently, many work have been done on learning the dynamic model of interacting systems using implicit interaction model (Sukhbaatar et al., 2016; Guttenberg et al., 2016; Scarselli et al., 2009; van Steenkiste et al., 2018), where interactions are modeled implicitly by message passing or through the attention mechanism. Kipf et al. (2018) propose the Neural Relational Inference model (NRI), an approach that infers explicit interactions while simultaneously learns the dynamics purely from 66 (a) Observations Forecasting Structural Priors L 0 Sparsity Node degree distribution … Edge distribution (d) Simulation (c) Inferred interactions II Gravity (b) Inferred interactions I Figure 5.1: (a) Movement of a chain of connected objects under the gravity field. (b) (c) Incorporating structural prior knowledge helps find the ground truth interactions, and (d) improves simulation performance. observational data. However, with the increase of complexity, it becomes chal- lenging to recover the true interactions solely based on observed data, and it is desirable to incorporate the prior knowledge about the structure of the interac- tions. Figure 5.1 shows a motivating example, where we observe the movements of a set of objects that are connected with springs in a chain structure under the gravity field. Due to the global gravity and the deeply entangled movements, NRI tends to infer redundant interactions and consequently results in degenerated sim- ulation. In this work, we address the problem of incorporating structural prior knowledge to better infer the explicit interactions and learn the system dynamics. We propose the Structure-informed Graph-Autoencoder for Relational inference and simulation (SUGAR). SUGAR takes the form of variational auto-encoder, 67 where the discrete latent variables represent the underlying interactions among objects. Both the encoder and the decoder employ a graph network-based archi- tecture, with node, edge, and global features. The model can incorporate various structural priors, e.g., the node degree distribution, the sparsity, and the edge type distribution, in a differentiable manner. For the example in Figure 5.1, suppose we know some properties about the underlying interactions structure, e.g., objects are connected by a chain and each object usually directly interacts with a small number of objects, then we can recover the true interactions (Figure 5.1(c)) and improve simulation (Figure 5.1(d)). In particular, we provide the following key contributions: • We propose novel approaches to integrate various structural priors for better interaction recovery and system dynamics modeling. • We propose a new encoder and decoder design that explicitly models the global variables, e.g., the gravity. This also facilitates communications between nodes that are not directly connected. • We conducted a wide range of experiments on both synthetic and real-world datasets. The results show that by exploiting structural prior knowledge, SUGAR clearly outperforms state-of-the-art methods and manages to dis- cover ground-truth interactions and simulate future movements with high fidelity. 5.1 Related Work Our work draws on several lines of previous research. Battaglia et al. (2016); Guttenberg et al. (2016); Chang et al. (2017); Sanchez-Gonzalez et al. (2018); 68 Scarselli et al. (2009) investigate the problem of learning the dynamics of a physical system from simulated trajectories and from generated video data (Watters et al., 2017; van Steenkiste et al., 2018) with a graph neural network. Li et al. (2018b) infer a residual graph based on the given structure. A number of recent works based on graph network (Monti et al., 2017; Velickovic et al., 2018; van Steenkiste et al., 2018) have the ability to focus on a specific neighbor when aggregating information with the attention mechanism. These works either assume a known graph structure or infer interactions implicitly. We aim to infer interactions while simultaneously learns the dynamics purely from observational data. The most related work is Kipf et al. (2018), where the authors propose to learn the explicit interactions among objects using variational graph auto-encoder. The main difference from ours is that (1) we introduce differ- ent ways to incorporate structural prior knowledge into the model, e.g., the node degree distribution, the edge type distribution, and (2) we design an improved encoder and decoder architecture that is able to model the global information. Our work also relates to literature in graph generation (You et al., 2018; Bojchevski et al., 2018; Simonovsky & Komodakis, 2018; Liu et al., 2018; Kipf & Welling, 2016; Wang et al., 2017). However, instead of generating a graph from scratch, this work focuses on inferring the interactions/edges among a set of given nodes. Some concurrent work also investigate the problem of inferring the graph struc- ture. Grover et al. (2019) learn the graph structure using an iterative graph refine- ment strategy with the low-rank approximations. Franceschi et al. (2019) propose to learn the graph structure by refining the initial KNN graph. However, none of these approaches provide a general way to incorporate prior knowledge. 69 Table 5.1: Notation Name Description G A graph (u,V,E) u Global variable of the graph V Nodes of the graph E Edges of the graph v i The i-th node e k ,e ij The k-th edge, the edge from v i to v j z k ,z ij The latent random variable representing the distribution of edge e k , e ij φ l v ,φ l e ,φ l u The update functions of node, edge and global variable of encoder in layer l ˜ φ l v , ˜ φ l e , ˜ φ l u The update functions of node, edge and global variable of decoder in layer l x (t) Observation in time t x (t) i Observation of node v i in time t L · Various loss functions 5.2 Methodology Problem Definition Given a sequence of observations x = (x (1) ,··· , x (T) )∈R T×|V|×P , which consists of the observations from|V| objects over T time steps, we want to simultaneously learn the interactions among objects and predict the future states of these objects. We use customized graph network (Battaglia, 2018) to model the movement of these objects. The graph consists of three components, G = (u,V,E), where u is the global variable, V = {v i } i=1:|V| is the set of nodes, and E = {(e k ,r k ,s k )} k=1:|E| is the set of edges, where e k is the attribute of the kth edge, 70 x Encoder Decoder Prediction Concrete distribution Adjacency Matrix u 1 V 1 E 1 E 2 Observation with inferred interaction Graph Network Block u 0 E 0 Graph Network Block V 0 Embedding u 4 V 4 E 4 Graph Network Block u 3 E 3 Graph Network Block V 3 Embedding x (t) Observations Δx (t) Node degree distribution Edge distribution L 0 Sparsity Differentiable constraints Graph Alignment Figure 5.2: Model architecture of SUGAR. The encoder takes as input a sequence of observation, x, and estimates the interactions z, while the decoder takes as input the estimated interaction graph and learns the system dynamics to predict the future state. The interaction constraint component calculates the loss function based on specified structural prior knowledge. s k ,r k are the indices of the sender and receiver nodes respectively. We use latent variable z to represent the relations among objects, where z k represent the distri- bution of the interaction type of e k . A summary of main notations used in the chapter is provided in Table 5.1. We formalize SUGAR based on the variational autoencoder (Kingma & Welling, 2014; Kipf & Welling, 2016; Kipf et al., 2018). The model consists of three components, the encoder, the decoder, and the component to incorporate struc- tural prior knowledge. Both the encoder and the decoder are based on customized graph networks. Figure 5.2 shows the architecture of SUGAR. The encoder takes as input a sequence of observations, x, and estimates the interactions z, while the decoder takes as input the estimated interaction graph and learns the sys- tem dynamics to predict the future state. The interaction constraint component calculates the regularizations based on various structural prior knowledge. 71 TherearesomemaindifferencesbetweenourmodelandNRI(Kipfetal.,2018). First, we provide effective and concrete ways to encode different prior knowledge into the model in a differentiable way. Besides, we design an improved encoder and decoder architecture that explicitly models the global features. This helps capture global interactions and to facilitate communications between not directly connected nodes. 5.2.1 Encoder In SUGAR, the encoder is used to infer pairwise interactions among objects based on observations x. It employs a graph network with a fully-connected graph structure, with two round updates as follows: Initialization of node and edge features: v i =φ emb (x (1) i , x (2) i ,··· , x (T) i ) (5.1) e k = 0 (5.2) Then each round consists of the following three steps: (1) edge update, which updates the edge based its two connected nodes and the global variable; e l+1 k =φ l e e l k ,v l r k ,v l s k ,u l (5.3) 72 (2) node update, which aggregates all the information from incoming edges; v l+1 i =φ l v v l i , X r k =i e l k ,u l (5.4) (3) global update, which updates the global features with aggregated node and edge features. u l+1 =φ l u X k e l+1 k , X i v l+1 i ,u l ! (5.5) where φ l · () denote the updating functions of the encoder in the layer l, which is usually a multi-layer perceptron. Interaction generation Based on the updated edge attribute, we infer the cor- responding interaction. We use the continuous approximations of the discrete dis- tribution and the reparametrization trick to get the gradient from it. The concrete distribution (Maddison et al., 2017), is used. z k = softmax e k +g τ (5.6) wheree k istheedgefeatureinthepreviousgraphnetworklayer,g isavectorofi.i.d samples from the Gumbel(0, 1) distribution, while τ controls the approximation. The distribution converges to the argmax function when τ→ 0. 73 5.2.2 Decoder The decoder takes as input the observation x (t) and the inferred interactions, and outputs Δx (t) with two rounds of updates, with following processing steps. e l+1 k = X m z k,m ˜ φ l e e l k ,v l r k ,v l s k ,u l (5.7) wherez k,m denotes the probability ofe k being them-th type. Note that each type of interaction has its dedicated update function to enforce the effect of edge type. Then the decoder updates the node and the global information v l+1 i = ˜ φ l v v l i , X r k =i e l k ,u l (5.8) u l+1 = ˜ φ l u X k e l+1 k , X i v l+1 i ,u l ! . (5.9) Finally, the decoder predicts the observation in the next time stamp. Here ˜ φ l · (·) denote the updating functions for the decoder in layer l. Δx (t) i = ˜ φ x (v l+1 i ) (5.10) q(x (t+1) i |x (t) ,z) =N x (t) i + Δx (t) i ,σ 2 I (5.11) Recurrent Decoder In many real-world cases, the Markov assumption may not hold. Thus the recurrent version of the decoder is used, where each node will has additional hidden states, and ˜ φ l v in Equation 5.8 will be implemented using a Gated Recurrent Unit (GRU). 74 5.2.3 Incorporating Structural Prior Knowledge For a dynamical system, we usually have prior knowledge about properties of its interactions, which can help recover the true interactions. In this work, we are particularly interested in edge/interaction-level and node/object-level struc- tural knowledge. Examples of interaction-level structural knowledge can be the distribution of interactions types, the sparsity of interactions, while examples of object-level structural knowledge are the distribution of node degrees, the maxi- mize/minimum interactions of a node. We incorporate prior knowledge by extending the regularization term in ELBO, i.e., D[q(z|x)||p(z)], specifically: • Customize the target probabilistic distribution p(z), and minimize the KL divergence following the framework of VAE, e.g, node degree distribution and edge type distribution. • For priors that can’t easily be represented as probability distribution, e.g., graph alignment,L 0 sparsity, symmetric, we define customized distance met- rics D. In this section, we show several examples of encoding the structural knowledge into differentiable graph constraints. For the simplicity of illustration, we assume there are only two types of interactions, and the first type means no-edge. Thus, z k denotes the probability of there exists an interaction between v s k and v r k , and ˆ z k means an instance sampled from that distribution. Interaction-level structural knowledge With the probabilistic distribution of interactions, we can incorporate various interaction-level structural knowledge. 75 Interaction Sparsity One important example is the sparsity prior, which aims to minimize the number of interactions measured using the L 0 distance. L 0 dis- tance is not differentiable in general, however, with the probabilistic distribution of interactions, we can minimize the expected number of interactions by penalizing the probability of has interactions between nodes. L 0 (z) = X k E e 0 ∼z [I(e 0 k,0 6= 1] = X k z k,0 This idea can be further generalized to “prior graph alignment”, which aims to minimize the number of interactions that are different from a specified graph. Prior graph alignment Given a prior graph, where e ∗ k represents the one-hot edge type of k-th edge of the prior graph, we want to optimize the prediction performance using a graph that has the minimum expected number of different edges from the prior graph. L G (z) = X k X m E e 0 ∼z [I(e 0 k,m 6=e ∗ k,m )] = X k X m z k,m e ∗ k,m Interaction type distribution Similar to the node degree distribution, we can enforce the interaction type distribution by first calculating the interaction type distribution from z, and then minimize the K-L divergence (or another differen- tiable distance metric of distributions). 76 Object-level structural knowledge We can encode object-level structural knowledge by first summarizing object- level distributions of interactions from z, and then minimizing the differentiable distance metric, e.g., K-L divergence, between it and the prior distribution. Node degree distribution One important example of object-level structural knowledge is the distribution of the node degrees. Larger node degree means more densely related objects. Suppose, the node out-degree d O (v i )∼p d (·) d O (v i ) =E ˆ z∼z ( X s k 0=i ˆ z k 0) = X s k 0=i z k 0, then we want the node degree distribution of the generated graph, i.e., q d (·), to be close to p d (·). L d (z) =D KL [q d (·)||p d (·)] =E d(v)∼q d (logq d − logp d ) =− 1 |V| X i logp d (d O (v i )) +const =− 1 |V| X i logp d ( X s k 0=i z k 0) +const In the case that p d (·) is discrete, we can use continuous approximation, e.g., Gumbel softmax (Maddison et al., 2017) can be used for the multinomial distri- bution. 77 Many priors can be specified to a particular object, e.g., L0 sparsity, as they can be written as the sum of constraints of individual object/interaction. Besides, we can also specify the node/edge dependent prior distributions to accomplish this. 5.3 Experiment 5.3.1 Datasets We conduct experiments on both the physical simulation dataset, Mass (Sanchez-Gonzalez et al., 2018), and the real-world dataset, Skeleton (Kipf et al., 2018) 1 . Examples from these datasets are shown in Figure 5.3. Mass : which contains the observations of a chain of objects connected by strings moving in the gravity field. This is generated by a physical simulation system in Sanchez-Gonzalez et al. (2018). The number of objects ranges from 5 to 8. The length of the input sequence is 10, while the length of the output sequence is 20. There are 50K samples for training, 10K samples for validation and 10K samples for testing. Skeleton : The CMU Motion Capture Database 2 has a large number of trajec- tories of different human activities, including walking, jogging, and dancing. Each sample in the dataset is the 3D trajectories of 31 nodes, each of which tracks a joint. Here we follow the data selection process in Kipf et al. (2018): we choose 23 non-overlapping walking trials from the database and split them into training (11 1 WealsoconductexperimentsontheSpring dataset(Kipfetal.,2018). BothNRIandSUGAR achieve near perfect results, i.e., visually no difference from the ground truth and interaction recovery accuracy greater than 99%. 2 http://mocap.cs.cmu.edu/ 78 (a) Mass (b) Skeleton Figure 5.3: Examples of observations in the experimental datasets trials), validation (5 trials) and test (7 trials). We use the original form of motion data (which only contains positions of each joint) in all experiments. 5.3.2 Experimental Settings Method for evaluation We compare the proposed method with the following approaches, including: • Static: which assumes a constant state, x (t+1) = x (t) ; • VAR: Vector Auto-Regression model (Hamilton, 1994) which takes as input the trajectories of the concatenation of all the objects in the feature dimen- sion; • LSTM (single): A long-short term memory (LSTM) based sequence-to- sequence recurrent neural network. The weights are shared across different objects; • LSTM (joint): A long-short term memory (LSTM) based sequence-to- sequence recurrent neural network which takes as input the trajectories of 79 the concatenation of all the objects in the feature dimension, and make the prediction as a whole; • Graph Network (GN) (Sanchez-Gonzalez et al., 2018): a learnable forward andinferencemodelwithrelationalinductivebias. Thefullyconnectedgraph is used as the input; • Neural relational inference model (NRI) (Kipf et al., 2018): a variational auto-encoder based inference model with graph network, the hidden dimen- sion is 256. For the predictions on the Mass and Skeleton datasets, the KL divergence based sparse prior is used; • SUGAR-NP: the variant of SUGAR without using the graph constraints derived from the structural prior knowledge. All neural network based approaches are implemented using PyTorch (Paszke et al., 2017), and are trained using the Adam optimizer (Kingma & Ba, 2015) with learning rate annealing. The best hyperparameters are chosen based on the performance on the validation dataset. Both encoder and decoder contain two graph network blocks, with hidden dimension 64, such that it has a similar number ofparameterswithNRI.Theinitiallearningrateis 5e−4andexponentiallyreduces with a ratio of 0.2 every 50 epochs. Early stopping on the validation dataset is used. We use the multi-step prediction trick (Kipf et al., 2018), i.e., feeding the ground truth every 10 timesteps to avoid the degenerated decoder. The sparsity constraint and the node degree distribution constraint are used in the Mass and Skeleton dataset. Besides, the recurrent decoder is used in the Skeleton dataset. Note that, SUGAR and NRI share the same inputs on both Mass and Skeleton, i.e., SUGAR does not has additional input, e.g., the gravity. Instead, the global variable is a zero vector in the input layer, and the global variables are designed to 80 Table 5.2: Simulation performance w.r.t. MSE Dataset Mass (×10 −2 ) Skeleton (×10 −4 ) Prediction steps 1 10 20 1 10 20 Static 155 653 770 1.80 62.2 215 VAR 24.1 77.3 140 173 211 240 LSTM (Single) 85.1 162 198 3.64 38.7 109 LSTM (Joint) 9.04 25.5 74.1 2.82 28.2 75.0 GN (full graph) 68.7 186 238 4.35 47.4 135 NRI 11.1 40.2 104 3.81 39.3 109 SUGAR 2.01 7.09 31.6 1.72 15.0 40.3 Table 5.3: Simulation performance w.r.t. MAE Dataset Mass (×10 −2 ) Skeleton (×10 −2 ) Prediction steps 1 10 20 1 10 20 Static 159 332 375 1.75 8.98 16.5 VAR 57.6 111 147 16.0 17.4 18.5 LSTM (Single) 115 160 176 2.32 6.94 11.5 LSTM (Joint) 35.2 57.2 96.0 2.06 6.05 9.80 GN (full graph) 94.5 169 195 2.54 7.60 12.7 NRI 33.4 71.6 117 2.42 7.15 11.8 SUGAR 6.86 16.8 43.3 1.56 4.22 6.72 facilitate communications between not directly connected nodes (in the decoder), and to additional capture global interactions. 5.3.3 Simulation Performance Evaluation metrics Suppose x = x (1) ,··· , x (T) represents the ground truth, ˆ x = ˆ x (1) ,··· , ˆ x (T) represents the predicted values, and T denotes the number of predicted steps, the metrics are defined as follows. 81 Table 5.4: Simulation performance w.r.t. MAPE Dataset Mass (%) Skeleton (%) Prediction steps 1 10 20 1 10 20 Static 47.37 96.23 119.64 0.82 4.18 7.77 VAR 15.74 32.28 46.61 7.14 7.71 8.24 LSTM (Single) 32.87 45.76 54.11 1.05 3.12 5.26 LSTM (Joint) 10.63 17.46 32.35 0.92 2.68 4.37 GN (full graph) 28.24 51.14 64.12 1.16 3.42 5.79 NRI 10.28 21.25 38.38 1.11 3.26 5.39 SUGAR 2.00 4.80 14.41 0.74 2.01 3.24 Table 5.5: Simulation performance w.r.t. SMAPE Dataset Mass (%) Skeleton (%) Prediction steps 1 10 20 1 10 20 Static 46.23 90.81 105.01 0.82 4.18 7.74 VAR 16.02 34.40 53.84 7.15 7.74 8.26 LSTM (Single) 34.16 50.69 64.17 1.05 3.12 5.24 LSTM (Joint) 10.62 17.49 33.98 0.92 2.68 4.37 GN (full graph) 27.99 50.57 66.48 1.16 3.42 5.77 NRI 10.19 21.42 40.47 1.11 3.25 5.38 SUGAR 1.96 4.61 13.66 0.74 2.01 3.22 Mean Square Error (MSE) MSE(x, ˆ x) = 1 T T X t=1 (x (t) − ˆ x (t) ) 2 Mean Absolute Error (MAE) MAE(x, ˆ x) = 1 T T X t=1 x (t) − ˆ x (t) 82 Mean Absolute Percentage Error (MAPE) MAPE(x, ˆ x) = 1 T T X t=1 x (t) − ˆ x (t) x (t) Symmetric Mean Absolute Percentage Error (MAPE) SMAPE(x, ˆ x) = 1 T T X t=1 x (t) − ˆ x (t) (|x (t) | +|ˆ x (t) |)/2 Table 5.2, 5.3 and 5.4 show the performance comparison of different approaches on the two datasets based on different metrics, including Mean Squared Error (MSE), MAE and MAPE respectively and the best values are highlighted. We observe that: • SUGAR consistently achieves the best performance on both datasets for all prediction steps, which suggests the effectiveness of the proposed algorithm. The superiority of SUGAR becomes more significant with the increase of the number of prediction steps; • The performance of GN with the full graph is significantly worse than both NRI and SUGAR, which suggests the importance of inferring the interac- tions. • Besides, in Figure 5.4, SUGAR outperforms SUGAR-NP which justifies the importance of incorporating prior knowledge. 5.3.4 Interaction Recovery Note that, a range of methods can be used for relation recovery or link pre- diction (Lü & Zhou, 2011; Martínez et al., 2017). However, most of them are 83 1 10 20 Steps 0.0 0.5 1.0 1.5 2.0 MSE Transductive VAR LSTM(single) LSTM(joint) GN(full graph) NRI SUGAR-NP SUGAR Figure 5.4: Simulation performance vs. prediction steps Table 5.6: Interaction recovery performance Metric Accuracy Precision Recall F1-Score Corr 63.2% 30.2% 67.9% 41.8% Corr (LSTM) 57.7% 28.2% 76.2% 41.2% NRI 92.7% 72.9% 99.3% 84.1% SUGAR-NP 97.2% 88.0% 99.4% 93.4% SUGAR 99.2% 96.6% 99.4% 98.0% either supervised or semi-supervised. While the task requires full unsupervised interaction recovery. In addition to NRI, we further implement the following two baselines described in Kipf et al. (2018). • Corr: We calculate a correlation matrix R of all nodes, where R i,j is the Pearson correlation coefficient between flattened trajectories of thei th node and the j th node. With a threshold θ, (v i , v j ) has interaction if and only if |R i,j |>θ. 84 (a) True interactions (b) SUGAR (c) NRI Figure 5.5: Interactions learned on the Mass dataset. NRI usually infers redundant interactions, while SUGAR recovers the ground truth interactions. • Corr(LSTM): Similar with Corr, except that we use the output of the last LSTM layer at the last time step of each node to calculate the correlation matrix. For each of the above methods, we choose the threshold that can produce the highest F-1 score and report the corresponding precision score and recall score as the final output. Table 5.6 shows the performance of interaction recovery of different methods, we observe that • SUGAR and SUGAR-NP perform clearly better than NRI and other base- lines. Besides, SUGAR achieves even better performance than SUGAR-NP which justifies the importance of incorporating prior knowledge; • NRI usually has a high recall but relatively low precision even with the sparsity prior. There is because NRI tends to have redundant connections which will be further discussed in Section 5.3.5. 5.3.5 Model Interpretation Example inferred interactions To better understand the model, we visualize the interactions learned by NRI and SUGAR in Figure 5.5, We observe SUGAR 85 1.5 1.0 0.5 0.0 0.5 1.0 6 4 2 0 VAR 1.5 1.0 0.5 0.0 0.5 1.0 6 4 2 0 1.5 1.0 0.5 0.0 0.5 1.0 6 4 2 0 1.5 1.0 0.5 0.0 0.5 1.0 6 4 2 0 1.5 1.0 0.5 0.0 0.5 1.0 6 4 2 0 1.5 1.0 0.5 0.0 0.5 1.0 6 4 2 0 LSTM(Single) 1.5 1.0 0.5 0.0 0.5 1.0 6 4 2 0 1.5 1.0 0.5 0.0 0.5 1.0 6 4 2 0 1.5 1.0 0.5 0.0 0.5 1.0 6 4 2 0 1.5 1.0 0.5 0.0 0.5 1.0 6 4 2 0 1.5 1.0 0.5 0.0 0.5 1.0 6 4 2 0 LSTM(Joint) 1.5 1.0 0.5 0.0 0.5 1.0 6 4 2 0 1.5 1.0 0.5 0.0 0.5 1.0 6 4 2 0 1.5 1.0 0.5 0.0 0.5 1.0 6 4 2 0 1.5 1.0 0.5 0.0 0.5 1.0 6 4 2 0 1.5 1.0 0.5 0.0 0.5 1.0 6 4 2 0 GN(full graph) 1.5 1.0 0.5 0.0 0.5 1.0 6 4 2 0 1.5 1.0 0.5 0.0 0.5 1.0 6 4 2 0 1.5 1.0 0.5 0.0 0.5 1.0 6 4 2 0 1.5 1.0 0.5 0.0 0.5 1.0 6 4 2 0 1.5 1.0 0.5 0.0 0.5 1.0 6 4 2 0 NRI 1.5 1.0 0.5 0.0 0.5 1.0 6 4 2 0 1.5 1.0 0.5 0.0 0.5 1.0 6 4 2 0 1.5 1.0 0.5 0.0 0.5 1.0 6 4 2 0 1.5 1.0 0.5 0.0 0.5 1.0 6 4 2 0 1.5 1.0 0.5 0.0 0.5 1.0 6 4 2 0 SUGAR 1.5 1.0 0.5 0.0 0.5 1.0 6 4 2 0 1.5 1.0 0.5 0.0 0.5 1.0 6 4 2 0 1.5 1.0 0.5 0.0 0.5 1.0 6 4 2 0 1.5 1.0 0.5 0.0 0.5 1.0 6 4 2 0 1.5 1.0 0.5 0.0 0.5 1.0 5 4 3 2 1 0 1 1.5 1.0 0.5 0.0 0.5 1.0 5 4 3 2 1 0 1 1.5 1.0 0.5 0.0 0.5 1.0 5 4 3 2 1 0 1 1.5 1.0 0.5 0.0 0.5 1.0 5 4 3 2 1 0 1 1.5 1.0 0.5 0.0 0.5 1.0 5 4 3 2 1 0 1 Observation Prediction Figure 5.6: Observation (first row), simulation (black) and ground truth (red) on the Mass dataset 86 5 6 7 8 Number of Nodes 0.5 0.6 0.7 0.8 0.9 1.0 Precision SUGAR-NP SUGAR-SP50 SUGAR-SP500 SUGAR-SP5000 5 6 7 8 Number of Nodes 0.90 0.95 1.00 Recall SUGAR-NP SUGAR-SP50 SUGAR-SP500 SUGAR-SP5000 Figure 5.7: Effect of the sparsity constraint manages to identify the true interactions, while NRI has a couple of redundant interactions even with its sparsity prior. This is potentially due to the lack of ability to model the global interaction in NRI, and thus requires more connections for accurate simulation and we also observe decreased precision when removing the global component from SUGAR. Example prediction Besides, Figure 5.6 shows the example observation, the ground truth together with the predictions generated by different methods on the Mass datasets. We observe the predictions generated by SUGAR well match the ground truth while the predictions of baselines tend to deviate from the ground truth. 5.3.6 Ablation Study To investigate the effect of incorporating prior knowledge about the structure, we conduct experiments on the Mass dataset. 87 Table 5.7: Effect of the node degree constraint on the Mass dataset. Metric Accuracy Precision Recall F1-Score NRI 92.7% 72.9% 99.3% 84.1% SUGAR-NP 97.2% 88.0% 99.4% 93.4% SUGAR-NDC 99.2% 97.2% 98.8% 98.0% Effect of sparsity constraint Figure 5.7 shows the effect of applying L 0 spar- sity prior, where SUGAR-SP50 means the regularization coefficient is 50. We observe that with the increase of the regularization coefficient, the precision gener- ally increases while the recall first stays stable and then decreases. This is because large regularization helps reduce the number of redundant interactions. However, when the regularization become overlarge, the model tends to miss groundtruth interactions. Effect of node degree constraint Table 5.7 shows the effect of the node degree constraint. We denote as SUGAR-NDC the SUGAR model incorporating the node degree constraint. The constraint is applied when the performance becomes stable on the validation dataset. We find that applying the node degree constraint greatly increase the precision, while the recall become slightly worse, resulting a significantly improved F1 score. 88 Chapter 6 Modeling Spatial Correlation via Representation Learning: an Application for Travel Time Estimation 89 Real-time estimation of vehicle travel time is essential for route planning (Li et al., 2015b), ride-sharing (Asghari et al., 2016) and vehicle dispatching (Yuan et al., 2013). Most existing approaches (Wang et al., 2014; Rahmani et al., 2013; Hunter et al., 2009; Tang et al., 2016) for travel time estimation assume the avail- ability of the actual route, while in online services only the origin and the destina- tion are given before a trip. One solution is to first infer the most likely path, such as the time dependent fastest path (Demiryurek et al., 2011), and then to esti- mate its travel time. However, this approach introduces expensive computations in the route planning step, making path-based approaches less practical for fast online travel time estimation services. Besides, this path-based approach heavily depends on the route planning, and its performance suffers when the actual path deviates from the predicted one due to changing traffic conditions and different user preferences. The recent advance in addressing this problem is to perform origin-destination (OD) travel time estimation (Wang et al., 2016b; Jindal et al., 2017), which aims to estimate the travel time without the actual route. Several major challenges exist in the OD travel time estimation problem. First, because of the absence of path information, only limited raw input features, i.e., the origin, the destination and the departure time, remain available for online prediction. Besides, these raw features are usually hard to utilize for a model, e.g., it is non-trivial to measure the similarity or road network distance between two roads based on their lati- tudes/longitudes. Thus, to deal with the insufficiency of the information provided byrawfeatures, itisdesirabletolearnafeaturerepresentationthatutilizestherich information from the underlying road network structure and the spatiotemporal properties of the problem. To the best of our knowledge, all existing works make a straightforward use of raw features to build models, and very limited attention 90 has been paid to finding a proper representation for transportation problems, espe- cially for travel time estimation. Wang et al. (2016b) propose a nearest neighbor based method, which estimates the trip duration by averaging the scaled travel times of all historical trips with similar origins and destinations. However, this non-parametric approach is hard to generalize to cases in which no or very limited number of neighbors are available. Jindal et al. (2017) propose ST-NN, a multi- layer feed-forward neural network for travel time estimation. ST-NN first takes as input the discretized latitudes and longitudes of the origin and destination to predict the travel distance and then combines this prediction with the time infor- mation to estimate the trip duration. One common limitation of these approaches is that the underlying road network structure and the spatiotemporal properties are largely overlooked. In this work, we propose a novel solution for the OD travel time estimation problem. It learns better representations from limited raw features, leveraging the roadnetworkstructureandspatiotemporalpriorknowledge. Specifically, wemodel the underlying road network as a graph of links/roads and learn representation for each link considering the road network topology. Besides, to enforce the spatiotem- poral smoothness prior on the learned representations, i.e., timestamps/locations that are close to each other should have similar representations, we construct the spatial and temporal graphs and utilize graph Laplacian regularization on those constructed graphs. Moreover, to learn more meaningful representations, we enforce the learned representations not only being optimized for estimating the travel time, but also capturing various path information. Specifically, we propose a multi-task learning framework which models various trip properties, e.g., the distance, the number of traveled road segments, as additional tasks during the 91 training process. This framework not only produces more meaningful representa- tions, but also boosts the learning performance. In summary, the main contributions of this work are as follow: • We propose a novel representation learning framework for the origin- destination travel time estimation problem. • We propose approaches to leverage the underlying road network structure as well as the spatiotemporal smoothness prior to deal with the insufficiency of the input information. • We propose a multi-task learning approach to utilize the path information in the training phase to learn more meaningful representations, which boosts the learning performance. • Weconductextensiveexperimentsontworeal-worldlarge-scaletripdatasets. The proposed approach clearly outperforms state-of-the-art methods for OD travel time estimation. Therestofthischapterisorganizedasfollows. InSection6.1, wediscussrelated work. We define the problem and present the proposed multi-task representation model in Section 6.2. In Section 6.3, we discuss experiment results. 6.1 Related Work 6.1.1 Travel Time Estimation There are mainly two types of approaches for travel time estimation: path- based method and origin-destination based method. Path-based methods require 92 the route information to generate a prediction while origin-destination based meth- ods are able to predict the travel time without route information. Path-based travel time estimation A straightforward approach is to first estimate the travel time on individual links and then sum up all the link travel times in the given path. The travel time of each link can be estimated using loop detector data (Petty et al., 1998; Jia et al., 2001; Tang et al., 2016; Asghari et al., 2015) or floating car data (Hunter et al., 2009; Wang et al., 2014; Ding et al., 2015). However, this method fails to consider the transition times between links, e.g., traffic lights, left/right turns. Herring et al. (2010); Li et al. (2015a) propose a method that considers the time spent on intersections, and Rahmani et al. (2013) propose to concatenate sub-paths to achieve more accurate travel time estimation. Wang et al. (2014) further improve this sub-path based method by first mining frequent patterns (Song et al., 2014) and then finding the optimal way to concatenate frequent paths that balance the length and the support. The main drawback of the path-based methods is that they require estimating the path which is time consuming and error-prone. OD travel time estimation To mitigate this issue, recently a few works start investigating the Origin-Destination (OD) travel time estimation. Wang et al. (2016b) design a nearest neighbor based method for OD travel time estimation. This approach estimates the trip duration by averaging the scaled travel times of all historical trips with a similar origin, destination and time of the day. Jindal et al. (2017) propose the ST-NN, a multi-layer feed-forward neural network for travel time estimation. ST-NN first predicts the travel distance given an origin and a destination and then combines this prediction with the time information 93 to estimate the travel time. One drawback of these methods is that the under- lying road network structure, as well as the spatiotemporal property are largely overlooked. In this chapter, we propose an approach to leverage the topologi- cal information as well as the spatiotemporal prior knowledge for OD travel time estimation. In Section 6.3, we will compare the proposed approach with these approaches as baselines. 6.1.2 Representation Learning The performance of machine learning models heavily depends on the choice of data representation (Bengio et al., 2013). Representation Learning aims to learn representations from data that make it easier to extract useful information for various tasks. An overview of representation learning is available in Bengio et al. (2013). In the case of OD travel time estimation, we want to learn representations from the raw trip data, the underlying road network structure as well as the spa- tiotemporal property. One benefit of explicitly dealing with representations is that we can conveniently express many general priors about the task. Specifically, in thischapterweexploitthefollowingpriorsinlearninginterpretablerepresentations for travel time estimation. Spatial and temporal smoothness Consecutive or spatially nearby observa- tions tend to be associated with the similar values. This prior was first introduced by Becker & Hinton (1992), and it can be enforced by penalizing changes in values over time or space. Mobahi et al. (2009) apply the temporal smoothness prior to video modeling, and in Bahadori et al. (2014), graph Laplacian is used to enforce spatial smoothness. In this work, we enforce this prior to the representation of 94 links and spatiotemporal factors using unsupervised graph embedding and graph Laplacian regularization. Shared factors across tasks Many tasks of interest are explained by factors that are shared with other tasks (Bengio et al., 2013). By jointly optimizing several related tasks, we can learn representations that capture underlying factors, and achieve better empirical results. In this work, we exploit this prior by jointly learning travel time estimation as well as many other related real-world tasks that potentially share common factors, e.g., predicting the travel distance and the number of road segments in the path. 6.2 Methodology 6.2.1 Problem Statement Definition 1 (Trip). A trip x (i) = (o (i) ,d (i) ,s (i) ,τ (i) ,I (i) ) is defined as a tuple with five components where o i denotes the origin location, d (i) denotes the destination, s (i) denotes the departure time,τ (i) denotes the duration andI represents additional trip summary information available for historical trips, e.g., trip distance and trip fare. Definition 2 (OD Travel Time Estimation). Given an origin, destination and departure time, our goal is to estimate the duration of this trip using historical trip data as well as the underlying road network. Figure 6.1 shows an example. This problem is challenging because of (1) the complicated spatiotemporal dependency in the underlying road network and (2) the limited amount of available information when performing online prediction. In 95 Author Author Yaguang Li (USC) Multi-task Representation Learning for Travel Time Estimation KDD 2018 Input: Origin , Destination , and departure time Output: Estimated Time of Arrival (ETA) Figure 6.1: OD-ETA: Given an origin, destination and departure time, estimate the duration of this trip using historical trip data as well as the underlying road network. the following section, we describe the proposed multi-task representation learning model to deal with these challenges. In this section, we introduce the proposed model, i.e., MURAT, to learn the representation for the origin-destination travel time estimation problem. First, we describe the approach to incorporate the underlying road network structure into the model. Then we show how to incorporate spatiotemporal prior knowledge in the representation learning process. Finally, we describe the proposed multi-task representation learning framework. 6.2.2 Representation Learning for Road Network Travel time is strongly affected by the underlying network structure. For exam- ple, two nearby locations in the opposite directions of the highway might have significantly different travel times with regards to the same destination. Thus, it is desirable to incorporate the network structure into the learning process. 96 Main Task Auxiliary Task 1 Link Embedding Network Spatial Embedding Network Temporal Embedding Network Auxiliary Task N ... Link Spatial Feature Temporal Feature ResNet Block Other Features Unsupervised Graph Embedding initialize Figure 6.2: The system architecture of the proposed multi-task representation learn- ing model (MURAT) for travel time estimation. The model first embeds the raw link information and spatiotemporal information into the learned spaces. Then the learned representations together with other numerical features are fed into a deep residual net- work which is jointly trained using supervised signals from multiple related tasks. The underlying road network can be represented as a undirected graph of links/roadsG = (V,A) whereV is the set of vertices representing links in the road network, whileA denotes the connectivity among links. Given a location, we can easily match it with a link in the underlying road network (Liu et al., 2012). One potential way of representing the link information is to feed the link identifier either as a numerical feature or a one-hot encoded categorical feature. However, these methods fail to capture the network struc- ture as each link is defined and learned independently. One alternative is to use unsupervised graph embedding approaches, e.g., Laplacian Eigenmap (Belkin & Niyogi, 2002), DeepWalk (Perozzi et al., 2014), to learn a representation for each 97 link. While these representations preserve similarities defined on the graph, it is hard for them to utilize task specific supervised signals. Graph Laplacian Regularization Several approaches can be used to tackle this problem. One way is to use the combination of the supervised loss and the unsupervised one as the objective. Following the practice in Belkin & Niyogi (2002), weaddthegraphLaplacianasanadditionalunsupervisedloss. Specifically, this loss serves as a regularizer that encourages adjacent links to have similar representations. LetG = (V,A)denotestheundirectedlinkgraph, andL =D−A representsthe graphLaplacian whereD = diag(A1) is thediagonal degreematrix, and 1∈R N denotes the all one vector. LetE∈R |V|×d l denote the link embedding matrix, withE i,: ∈R d l corresponding to the representation of thev i . LetL denote the supervised loss function, e.g., mean absolute error for travel time. Then we have the following objective function: ˆ L =L +αTr(E | LE) =L +α X i,j A ij kE i,: −E j,: k 2 (6.1) where α is used to control the strength of the regularization. One drawback of this approach is that for large graphs with millions of edges, e.g, the trip dataset in Beijing used in this work, this objective can become be too costly to optimize as in each training step we have to calculate the large matrix multiplication and possibly the corresponding gradients. Unsupervised Pre-training Inspired by the well-known practice of unsuper- vised pre-training (Erhan et al., 2010) in which researchers used the pre-trained stacked restricted Boltzman machine to initialize a supervised classification deep 98 0 5 10 15 L2 distance 0.0 0.2 0.4 0.6 PDF neighbor random Figure 6.3: Distribution of distances between links calculated based on the Deepwalk embedding. neural network, we find an more efficient approach that achieves similar perfor- mance. Specifically, we first learn a representation of each node based on unsu- pervised graph embedding techniques, e.g., Belkin & Niyogi (2002); Perozzi et al. (2014), and then we use the learned representation as the initialization of a super- visedembeddingnetworkwhichwillbefine-tunedlaterbasedonsupervisedsignals. The question of why unsupervised pre-training could be helpful was extensively studied in Erhan et al. (2010) which tries to explain this phenomenon from the perspective of regularization effect and the optimization effect. Figure 6.3 shows the distribution of distances calculated based on the Deepwalk embedding. We observe that theL2 distance between embeddings of neighborhood links are much smaller than that of two random links. We hypothesize that this initialization has a similar effect to the graph Laplacian regularizer. 99 Longitude Latitude (a) Spatial graph. 8:55 AM 9:00 AM 9:05 AM Jan 23 rd Tue 8:55 AM 9:00 AM 9:05 AM Jan 30 th Tue (b) Temporal graph. Figure 6.4: Graph-based regularization for spatiotemporal dependencies. 6.2.3 Spatiotemporal Representation Learning Besides the road network structure, another important aspect of learning the representation is our prior knowledge of the spatial and temporal domain, e.g., the spatiotemporal smoothness, the periodicity and recurring nature of traffic. We represent this prior knowledge by constructing the spatial and the temporal graphs in the embedding space. Figure 6.4a shows an example of the spatial graph. Each vertex represents an equal-sized grid/region, and a vertex is connected to adjacent ones, i.e., nearby regions, by edges whose weights correspond to the pair-wise similarity. The temporal dependency is also represented as a graph. An example of a tem- poral graph is shown in Figure 6.4b, which aims to model the temporal smoothness as well as the weekly periodicity. Each vertex, representing a 5 minutes time inter- val, is connected to its two adjacent neighbors and its counterparts at the same time of the week. 100 With the spatial and temporal graphs, we can enforce the spatiotemporal smoothness prior and periodicity using graph Laplacian regularization similar to Equation 6.1 in Section 6.2.2. Besides, we use distributed representations for the spatial and the temporal embeddings, i.e., the embedding of a grid is represented as a function of the embed- dings of its latitude and longitude, while the embedding of a temporal interval is represented as a function of the embedding of “time in day” and “day in week”. Compared to encoding each location/time interval separately, this distributed rep- resentation has the following advantages: (1) fewer number of parameters, e.g., from O(#grids) to O( √ #grids) for spatial representations, and (2) by sharing embedding, it implicitly imposes the priors that locations in the same latitude /longitude, and intervals at the same time of the day or day of the week should have similar representations. In addition to leveraging the graph structure and the spatiotemporal smooth- ness, weenforcethemodeltolearnmoreinterpretablerepresentationsbycapturing various other related trip properties in the world. This leads to the proposal of the multi-task representation learning framework which utilize the path information that is available during the training period. 6.2.4 Multi-task Representation Learning Instead of using path information as extra input signals which are not available during testing, we use path summary information as extra supervised signals that the model needs to predict. Specifically, the path summary information serves as multiple related tasks that the model learns simultaneously. We use a hard parameter sharing framework where different tasks share most part of the model except for having a dedicated output layer for each task. 101 Suppose, b y (i) denotes the prediction, andL k ( b y (i) ,y (i) ) denotes the loss func- tion for the kth task. The final learning objective is defined as a function of the individual loss functions. L( b y (i) ,y (i) ) =f L ( h L 1 ( b y (i) ,y (i) ),··· ,L k ( b y (i) ,y (i) ) i ) (6.2) For example, the function f L can be the weighted sum function, the maximum function or other suitable aggregation functions. In the experiment, we use the linear combination of the losses of all tasks, with the corresponding weights as hyperparameters. To map the learned representations to target tasks, we use deep residual net- works (He et al., 2016) with pre-activation which empirically offers better results than deep feed forward neural networks. Figure 6.2 shows the architecture of the model and Algorithm 1 gives an overview of the learning process. We first initialize the link embedding net- work with unsupervised graph embedding, e.g., Deepwalk, and initialize the spa- tial/temporal embedding networks with Gaussian random noise. In the training process, for each sample x (i) , we embed its link information and spatiotemporal information into the learned spaces, i.e., E L,x (i),E S,x (i), and E T,x (i). Then the embedded representation together with other numerical features are fed as input into a deep residual network which generates the prediction for multiple related task. The objective is defined as the aggregation of losses from multiple tasks as well as the graph Laplacian regularizers for the spatial and the temporal graphs. Finally, the embeddings, as well as the weights are jointly optimized using the Adam Optimizer (Kingma & Ba, 2015). 102 Algorithm 1: Multi-task Representation Learning (MURAT) Data: Link graphG L , Spatial graphG S , temporal graphG T Result: Learned representations: E L ,E S ,E T begin E L ← GraphEmbed(G L ) // Unsupervised pre-training E S ,E T ←N (0, 1) for i← 1···N do L (i) ← [] x (i) ,y (i) ← sample(data,i) E L,x (i)← Embed(E L ,x (i) ) // Link embedding E S,x (i)← Embed(E S ,x (i) ) // Spatial embedding E T,x (i)← Embed(E T ,x (i) ) // Temporal embedding b y (i) ← ResNet(W, h E L,x (i),E S,x (i),E T,x (i) i ) // Aggregate losses from multiple tasks for k← 1···K do L (i) ← h L (i) ,L k ( b y (i) ,y (i) ) i end L (i) ←f L (L (i) ) +λ S L G S (E S ) +λ T L G T (E T ) E L ,E S ,E T ,W← AdamOpt( h E L ,E S ,E T ,W i ,L (i) ) end returnE L ,E S ,E T end 6.3 Experiments 6.3.1 Dataset BJS-Pickup This dataset contains 61.4 millions of pickup trips in Beijing from May 1st 2017 to Oct 31st 2017 collected by Didi Chuxing. Each pickup trip con- tains source, destination, departure time, travel time and additional trip summary information including travel distance, number of roads, number of lights, turns etc. For data split, we use the data from May 1st 2017 to Oct 16th 2017 for training, data from Oct 17 2017 to Oct 23th 2017 as validation, and the data from Oct 24th to Oct 31st is used for testing. For the road network, we use a commercial map 103 Table 6.1: Datasets used in the experiments. Name BJS-Pickup BJS-Small NYC-Trip # Samples 61.4M 4.8M 21.9M Avg. trip time 191s 335s 660s # Links 1.1M 30K 73K of Beijing provided by Didi Chuxing. We also build a smaller dataset, BJS-Small, for ablation study. BJS-Small contains all the 4.8 millions of trips in the right central part of the downtown area in Beijing which is more congested and more challenging for travel time estimation. NYC-Trip To compare with existing approaches, we also conduct experiments on a publicly available dataset processed by Wang et al. (2016b), which contains 21.9 millions of taxi trips collected from Nov 2013 to Dec 2013 in New York City. Each trip records the locations of origin and destination, departure time, along with trip summary information including trip duration and distance. We follow the settings in Wang et al. (2016b) and use data in November for training, and data in December for testing. For the road network, we use the map of New York City provided by OpenStreetMap (Haklay & Weber, 2008). Table 6.1 shows the statistics of these datasets and the corresponding underly- ing road networks. Figure 6.5 plots the heatmap of pickup/drop-off locations and the travel time distributions of these two datasets. 6.3.2 Experimental Settings Method for evaluation We compare the proposed model (MURAT) with the following methods for OD travel time estimation. 104 (a) Spatial distribution of NYC taxi trips. 0 1000 2000 Travel time (s) 0.00000 0.00025 0.00050 0.00075 0.00100 0.00125 PDF 0.0 0.2 0.4 0.6 0.8 1.0 CDF (b)TraveltimedistributionofNYCtaxi trips. (c) Spatial distribution of Beijing taxi pickup trips. 0 500 1000 Travel time (s) 0.000 0.001 0.002 0.003 PDF 0.0 0.2 0.4 0.6 0.8 1.0 CDF (d) Travel time distribution of Beijing taxi pickup trips. Figure 6.5: Data Statistics. BJS-Pickup contains trips that have smaller duration but broader spatial distribution than NYC-Trip. • Linear regression (LR):, which models the travel time as a linear function of the Euclidean distance and L1 distance. This simple method serves as a baseline for comparison. • Gradient boosted machine (GBM) (Ke et al., 2017a): gradient boosting deci- sion tree based regression implemented using LightGBM (Ke et al., 2017a), The input features include time in day, day in the week, Euclidean distance, start location, end location, etc. We use maximum 500 trees with a learning rate equals to 0.1. Early stopping based on the validation dataset is used. 105 • Spatial temporal deep neural network (ST-NN) (Jindal et al., 2017): a deep neural network based approach which first predicts the travel distance given an origin and a destination and then combines this prediction with the time information to estimate the travel time. We implement this algorithm follow- ing the parameter settings suggested in Jindal et al. (2017). For the experi- ment on the BJS-Pickup dataset, we further tune the hyperparameters and add additional features to achieve the best performance. • TEMP+R (Wang et al., 2016b), a nearest neighbor based approach which estimates the trip duration by averaging the scaled travel times of all histor- ical trips with similar origin and destination. The travel times of neighbor- hoods are scaled based on the region-based temporal speed reference. • MURAT-NR,thevariantoftheproposedapproachwhichhasthesameinput, output and similar amount of learnable parameters to MURAT, but without explicitrepresentationlearning, i.e., wedirectlyfeedtherawfeaturesasinput to the model. All the deep neural network based approaches, including ST-NN, MURAT-NR and MURAT, are implemented using PyTorch (Paszke et al., 2017). The default experimental settings for MURAT are as follow. For the link embedding, the dimension is 40, and DeepWalk (Perozzi et al., 2014) is used for unsupervised pre- training. Forthespatialgraph,eachgridisconnectedto4adjacentgridswithequal weights, andthedimensionsforboththe“latitude”andthe“longitude”are20. For the temporal graph, each vertex corresponds to a 5-minute interval. We connect each vertex to their two neighbors with equal weights. The temporal information includes “time in day” and “day in week” both of which have 20 dimensions. The deep residual network contains 5 residual network blocks (He et al., 2016), with 11 106 layers in total and each hidden layer contains 1024 units. The objective is MAE for NYC-Trip and MAPE for BJS-Pickup, optimized using Adam (Kingma & Ba, 2015) with mini-batch size equals to 1024. The initial learning rate is 10 −2 , and reduces to 1 5 every 2 epochs. Early stopping on the validation dataset is used. In the multi-task learning, weighted linear function is used to aggregated loss function from different tasks and the best hyperparameters are chosen using the Tree-structured Parzen Estimator (TPE) (Bergstra et al., 2011) on the validation dataset. Evaluation metrics We evaluate the performance of proposed method based on three popular metrics. Supposex =x (1) ,··· ,x (N) represents the ground truth, ˆ x = ˆ x (1) ,··· , ˆ x (N) represents the predicted values, and N denotes the number of samples, these metrics are defined as follow: Mean Absolute Percentage Error (MAE), MAPE(x, ˆ x) = 1 N N X i x (i) − ˆ x (i) x (i) Mean Absolute Error (MAE), MAE(x, ˆ x) = 1 N N X i x (i) − ˆ x (i) and Mean Absolute Relative Error (MARE). MARE(x, ˆ x) = P N i x (i) − ˆ x (i) P N i |x (i) | 107 Table 6.2: Performance comparison of evaluated approaches. Metric LR GBM ST-NN TEMP+R 1 MURAT-NR MURAT NYC MAPE 42.43% 24.80% 24.25% - 23.29% 22.32% MAE 213.88 178.30 149.40 145.15 145.96 139.44 MARE 32.40% 27.02% 22.63% 22.10% 22.11% 20.96% BJS MAPE 46.53% 35.82% 33.52% - 30.37% 26.81% MAE 96.97 86.20% 83.57 - 85.83 75.60 MARE 37.72% 33.53% 32.51% - 33.38% 29.39% 6.3.3 Performance Comparison Table 6.2 shows the comparison of evaluated approaches for travel time esti- mation on both datasets. We have the following observations: • MURAT achieves the best performance regarding all the metrics in both datasets outperforming the state-of-the-art approaches by 4%−10% in terms ofMAE.Thissuggeststheeffectivenessofproposedmulti-taskrepresentation learning model. • The performance of MURAT-NR is much worse than MURAT even with the same input/output and roughly the same amount of parameters. This demonstrates the importance of the explicit representation learning. • The benefit of representation learning is more significant on the BJS-Pickup dataset than on the NYC-Trip dataset. This may due to the relatively better map quality used for BJS-Trip (which results in better coverage and better link matching accuracy). 1 The numbers are copied from Wang et al. (2016b). If near real-time trip data, i.e., trips’ data from one hour ago, are available, its MAE and MAPE will decrease to 142.73 and 21.73% respectively. 108 Performance w.r.t. Travel Time Figures 6.6a and 6.6b show the relationship between travel time and different metrics for evaluated approaches. As expected, the trips with large duration have higher MAPE and MAE. An interesting obser- vation is that with the increase of travel time, MAPE first decreases dramatically and then gradually increases. This is because there is certain amount of random- ness in a trip, and MAPE will be dominated by this randomness when the travel time, i.e., the denominator, is small. Performancew.r.t. Traveldistance Therelationshipbetweentraveldistance and different metrics are shown in Figures 6.6c and 6.6d. It is interesting to see that MAPE first decreases and then increases as the travel distance grows. This reflects the joint effects of the increasing denominator and the numerator, i.e., a larger travel distance usually means more uncertainties as well as a longer travel time. 6.3.4 Effect of Link Embedding To investigate the effect of link embedding, we conduct experiments with the following variants of the proposed model. • Raw: model without link embedding; • RandEmb: which embeds each link as a vector of random Gaussian noise. • UnsupEmb: which uses unsupervised graph embedding, i.e., Deepwalk (Per- ozzi et al., 2014), to generate the link representation. • SupEmb: supervised embedding where the embedding of each link is first initialized as a vector of random Gaussian noise; 109 0 500 1000 1500 2000 Travel Time (s) 0.2 0.4 0.6 0.8 1.0 MAPE LR GBM MURAT (a) Travel time vs. MAPE 0 500 1000 1500 2000 Travel Time (s) 0 500 1000 1500 2000 MAE LR GBM MURAT (b) Travel time vs. MAE 0 1 2 3 4 5 Travel Distance (km) 0.2 0.4 0.6 0.8 1.0 MAPE LR GBM MURAT (c) Travel distance vs. MAPE 0 1 2 3 4 5 Travel Distance (km) 0 50 100 150 200 MAE LR GBM MURAT (d) Travel distance vs. MAE. Figure 6.6: Performance w.r.t. different trip features on the BJS-Pickup dataset. • SupEmb+Pre: Supervised embedding with unsupervised pre-training using Deepwalk. Allthesemodelshaveidenticalinputs/outputsandareadjustedtohaveroughlythe same number of parameters by increasing/decreasing the number of hidden units. Thus, the main difference lies in the way of representing the link information. Table 6.3 shows the effect of link embedding. We observe that (1) Raw, which using raw link id instead of embedding, has significantly worse performance, (2) 110 Table 6.3: Performance comparison of approaches with different link representations. Metric Raw RandEmb UnsupEmb SupEmb SupEmb+Pre MAPE 29.05% 28.76% 28.31% 28.12% 27.85% MAE 95.47 94.22 92.95 91.18 90.80 MARE 31.04% 30.63% 30.22% 29.64% 29.53% unsupervised embedding generated by Deepwalk, which captures the graph struc- ture, results in significantly better results than RandEmb. (3) task specific super- vised embedding outperforms unsupervised embedding. (4) by utilizing both the graph structure information and the supervised signal, SupEmb+Pre achieves the best result. 6.3.5 Effect of Spatiotemporal Embedding To investigate the effect of spatiotemporal embedding, we conduct experiments with the following variants of the proposed model: • RawST: neither spatial nor temporal embedding is used. • SEmb: only spatial embedding is used. • TEmb: only temporal embedding is used. • STEmb: both spatial and temporal embeddings are used. All these models have identical inputs/outputs, and are adjusted to have roughly the same number of parameters by increasing/decreasing the number of hidden units. Table 6.4 shows the effect of spatial and temporal embedding. We observe that: (1) using raw spatiotemporal information results in clearly worse perfor- mance, and (2) temporal embedding has a higher impact on the performance than spatial embedding. The latter observation maybe due to the fact that part of the spatial information has already been captured by link embedding. 111 Table 6.4: Performance comparison of approaches with different spatiotemporal repre- sentations. Metric RawST SEmb TEmb STEmb MAPE 28.15% 28.05% 27.99% 27.84% MAE 92.76 92.44 91.79 90.80 MARE 30.16% 30.05% 29.84% 29.53% Table 6.5: Effect of multi-task learning. Metric ETA ETA+Distance All MAPE 28.43% 28.09% 27.84% MAE 92.71 92.18 90.80 MARE 30.14% 29.97% 29.53% 6.3.6 Effect of Multi-task Learning To investigate the effect of multi-task learning, we conduct experiments with the following variants of the proposed model. Tasks includes predicting the travel time, the travel distance, the number of links in the trip, the number of traffic lights and the number of turns. 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 Ratio of the main task 0.280 0.282 0.284 Validation MAPE (a) Ratio vs. MAPE 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 Ratio of the main task 0.0100 0.0125 0.0150 0.0175 0.0200 0.0225 Overfitting MAPE (b) Ratio vs. Overfitting MAPE Figure 6.7: Effect of the ratio of the main task. Incorporating auxiliary tasks help reduce overfitting. 112 • ETA: only travel time is used as the supervised signal. • ETA+Distance: the weighted combination of travel time and travel distance is used as the objective. • All: all the supervised signals are used. In addition, we trained 200 models with different weights of tasks to study its effects in terms of the error and overfitting of the main task. Table 6.5 shows the effect of multi-task learning. We observe that: (1) the multi-task learning framework offers clearly improved performance and (2) adding more relevant auxiliary tasks results in even better performance. Figure 6.7 shows the effect of task weights. The x-axis denotes the ratio of the main task, where 1 means only using the main task and 0 means not using the main task at all. With the increase of the ratio, the error shows a U-style curve. This phenomenon can be explained from the perspective of “shared factors across tasks” as discussed in Section 6.1.2. We hypothesize that these related tasks can be explained using a shared set of factors, and the multi-task learning framework exploits this common- alities among different learning tasks. Besides, as shown in Figure 6.7b, multi-task learning also has the effect of avoiding overfitting the main task. 6.3.7 Effect of Model architecture To study the effects of different parameters of the proposed model, we trained 200 models on the BJS-Small dataset with different combinations of hyperparam- eters including the number of residual blocks, the number of units in each hidden layer, the dimension of the location embedding, the dimension of the temporal embedding, the dimension of the link embedding and the weight of the graph Laplacian regularizer for link embedding. 113 Figure 6.8a shows the effect of the number of layers of the residual network. Generally, as the number of layers grows, the error first decreases, and then slightly increases. Note that, except for when the number of residual blocks is 2, there is no significant difference among the performances of different variants. This probably is due to the effect of identity mapping and batch normalization which gives the model’s the ability to “skip” layers. Figure 6.8b shows the effect of the number of units in the residual network. We observe a clear trend of decreasing error as the number of units grows. Figure 6.8c shows the effect of the dimension of spatial embedding. We observe a sharp error decrease when increasing the number of dimension from 1 to 5. After that, the performance become less sensitive to the dimension changes. This is because of small spatial cardinality, i.e., the number of grids in each column/row. Figure6.8d shows the impact of the dimensions of temporal embedding. No clear trend is observed and the performance tends to be good as long as the dimension is not too small, i.e., greater than 1. This is because the cardinality of the temporal input, e.g., time in the day, day in the week, is relatively small. Figure 6.8e shows the effects of link dimension. Similar to the spatial embed- ding, with the increase in dimensions, the error first decreases and then becomes stable. Graph Laplacian regularization enforces the prior of local smoothness in the link embedding. As shown in Figure 6.8f, a proper weight decay, e.g., 10 −7 , results in clearly improved performance. 6.3.8 Model Interpretation To further understand the learned representation of the proposed model, we conduct a series of visualizations. Figure 6.9a shows the projection of the learned temporal embedding on the two largest principle components. Remarkably, the 114 2 3 4 5 6 Number of residual blocks 0.28 0.29 0.30 0.31 Validation MAPE (a) Number of residual blocks. 16 64 256 1024 2048 Number of units in each hidden layer 0.28 0.29 0.30 0.31 0.32 Validation MAPE (b) Number of hidden units. 1 5 10 20 40 60 Dimension of Spatial Embedding 0.28 0.29 0.30 0.31 Validation MAPE (c) Spatial embedding. 1 5 10 20 40 60 Dimension of Temporal Embedding 0.28 0.29 0.30 0.31 Validation MAPE (d) Temporal embedding. 1 5 10 20 40 60 Dimension of Link Embedding 0.28 0.29 0.30 0.31 Validation MAPE (e) Link embedding. 0.0 1e-07 1e-06 1e-05 0.0001 Laplacian weight decay 0.28 0.29 0.30 0.31 Validation MAPE (f) Laplacian weight decay. Figure 6.8: Effects of model parameters. 115 0 5 2 0 2 4 00:00 08:00 16:00 24:00 (a) Time in the day. 0 1 0.5 0.0 0.5 Sun Mon Wed Thu Tue Fri Sat (b) Day in the week. Figure 6.9: Visualization of learned temporal representations. (a) The learned represen- tation for time in day has a circular shape, from 00:00 to 24:00 with smooth transitions between adjacent time intervals. (b) weekends are clearly separated from weekdays, where Tuesday, Wednesday, Thursday are quite close to each other, while Monday and Friday with different traffic patterns are relatively far away. model automatically learns a feature representation that has a circular shape, from 00:00 to 24:00 with smooth transitions between adjacent time intervals. Besides, points tend to distribute more densely during the peak-hours, e.g., 7am-9am, 6pm- 9pm, than non-peak hours. Figure 6.9b shows the projection of the learned representation of day in the week. We observe that: (1) the representations of day in the week also forms a circle, from Monday to Sunday, (2) weekends are clearly separated from weekdays, and(3)Tuesday, Wednesday, Thursdayarequiteclosetoeachother, whileMonday and Friday with different traffic patterns are relatively far away. To study the spatiotemporal knowledge learned by the model, we randomly pickup 10K origin-destination pairs from the validation dataset, and vary the departure time from 0:00 to 23:59. Then we calculate the averaged travel time and travel distances of these trips. Figure 6.10 shows the learned patterns about travel time and travel distance. The model generates time-varying travel times, 116 0 2 4 6 8 10 12 14 16 18 20 22 24 Time in day 210 215 220 225 Trave Time (s) 0 2 4 6 8 10 12 14 16 18 20 22 24 Time in day 1088 1090 1092 Travel Distance (m) Figure 6.10: Learned travel time and distance patterns. In peak-hours, both the pre- dicted travel time and travel distance increases as during peak-hours, drivers are more likely to take detours to avoid traffic congestions. where the travel time is larger in peak-hours and smaller in non-peak hours. One interesting observation is that the travel distances also change in different time, e.g., the two clear increases happen around 8am and 5pm. This reason is that during peak-hours, drivers are more likely to take detours to avoid traffic conges- tions. The results suggest that the proposed model learns a shared representation for different tasks. 117 Chapter 7 Conclusion and Future Work 118 7.1 Summary of the Research In this thesis, we investigated the following important questions in spatiotem- poral prediction: (1) How to model complex spatial dependency among objects that are usually non-Euclidean and multimodal? (2) How to model the non-linear and non-stationary temporal dynamics for accurate long-term prediction? (3) How to infer the correlations or interactions among objects when they are not provided nor can be constructed a priori? The main contributions of this thesis are as follows: • First, we proposed the diffusion convolutional recurrent neural network that captures the spatiotemporal dependencies. Specifically, we use bidirectional graph random walk to model spatial dependency and recurrent neural net- work to capture the temporal dynamics. We further integrated the encoder- decoder architecture and the scheduled sampling technique to improve the performance for long-term forecasting. When evaluated on two large-scale real-world traffic datasets, our approach obtained significantly better predic- tion than baselines. • Second, we proposed a novel graph network which encodes the non-Euclidean multimodal correlations among regions using multiple graphs and explicitly captures them using multi-graph convolution. We further augmented the recurrent neural network with contextual gating mechanism to incorporate global contextual information in the temporal modeling procedure. When evaluated on two large scale real-world ride-hailing demand datasets, the proposed approach achieved significantly better results than state-of-the-art baselines. 119 • Third, we proposed SUGAR, a variational graph auto-encoder based model to incorporate structural prior knowledge to better infer interactions and learn the system dynamics. In a range of experiments on both synthetic and real-world datasets, we found that with structural priors, SUGAR achieved clearly improved performance on both interaction recovery and simulation. Besides, it also achieved better generalization performance to unseen graphs. • Fourth, we proposed a novel representation learning based approach to cap- ture the spatiotemporal dependency for the origin-destination travel time estimation problem. Specifically, the model learned a representation that effectively captures underlying road network structures as well as spatiotem- poral prior knowledge. We further introduced a multi-task learning frame- work to utilize path information in the origin-destination travel time estima- tion problem setting. These contributions confirm the claims in the thesis statement. 7.2 Future Work Interpretable spatiotemporal prediction It is desirable to make the predic- tions more interpretable. As shown in Figure 7.1, in the traffic prediction, we want to know “which neighbor road and/or historical observation has the largest effects for the prediction of a specific location?”. Inductive spatiotemporal prediction Existing spatiotemporal prediction approaches are mainly designed to perform transductive learning, i.e., the training and testing share the same graph structure. However, inductive learning is desir- able in many cases. For example, it is desirable to be able to apply a trained traffic 120 (a) Which neighborhood road has the largest effects? (b) Which historical observation is the most important one? Figure 7.1: Interpretable prediction. Figure 7.2: Inductive spatiotemporal prediction, where the model is trained on Los Angeles while being testing on another city, e.g., Long Beach. prediction model to new locations which have limited training data. As shown in Figure 7.2 traffic prediction model that is capable of inductive learning, can be potentially used in “Long Beach” even if it is only trained on “Los Angeles”. This is also related to transfer learning. Spatiotemporal prediction on evolving graphs Existing spatiotemporal prediction approaches are mainly designed to operate on static graph, i.e., the 121 graph structure stays the same during a training instance. However, many real- world application has evolving graph structures, e.g., the K nearest neighbor graph for moving objects, the evolving knowledge graph. Incorporate general prior knowledge in graph construction Currently, the model SUGAR mainly incorporates prior knowledge in the form of graph con- straints or regularizers. It will be interesting to investigate (1) how to integrate the prior knowledge in constructing the graph? and (2) how to integrate more general prior knowledge about the graph, e.g., symmetric, acyclic, into the model? 122 Reference List Martín Abadi, Paul Barham, Jianmin Chen, Zhifeng Chen, Andy Davis, Jeffrey Dean, Matthieu Devin, Sanjay Ghemawat, Geoffrey Irving, Michael Isard, Man- junath Kudlur, Josh Levenberg, Rajat Monga, Sherry Moore, Derek G. Murray, Benoit Steiner, Paul Tucker, Vijay Vasudevan, Pete Warden, Martin Wicke, Yuan Yu, and Xiaoqiang Zheng. Tensorflow: Large-scale machine learning on heterogeneous distributed systems. In 12th USENIX Symposium on Operating Systems Design and Implementation (OSDI ’16), 2016. Davide Andreoletti, Sebastian Troia, Francesco Musumeci, Giordano Silvia, GUIDO ALBERTO Maier, and Massimo Tornatore. Network traffic predic- tion based on diffusion convolutional recurrent neural networks. In INFOCOM Workshops, pp. 1–6, 2019. Mohammad Asghari, Tobias Emrich, Ugur Demiryurek, and Cyrus Shahabi. Prob- abilistic estimation of link travel times in dynamic road networks. In Proceedings of the 23rd SIGSPATIAL International Conference on Advances in Geographic Information Systems (SIGSPATIAL), pp. 47. ACM, 2015. Mohammad Asghari, Dingxiong Deng, Cyrus Shahabi, Ugur Demiryurek, and Yaguang Li. Price-aware real-time ride-sharing at scale: an auction-based approach. In Proceedings of the 24th ACM SIGSPATIAL International Con- ference on Advances in Geographic Information Systems (SIGSPATIAL), pp. 3. ACM, 2016. James Atwood and Don Towsley. Diffusion-convolutional neural networks. In Advances in Neural Information Processing Systems (NIPS), pp. 1993–2001, 2016. Mohammad Taha Bahadori, Rose Yu, and Yan Liu. Fast multivariate spatio- temporal analysis via low-rank tensor learning. In Advances in Neural Informa- tion Processing Systems (NIPS), pp. 3491–3499, 2014. Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. Neural machine trans- lation by jointly learning to align and translate. In International Conference on Learning Representations (ICLR), 2015. 123 Peter Battaglia, Razvan Pascanu, Matthew Lai, Danilo Jimenez Rezende, et al. Interaction networks for learning about objects, relations and physics. In Advances in neural information processing systems (NIPS),pp.4502–4510, 2016. Peter W et al. Battaglia. Relational inductive biases, deep learning, and graph networks. arXiv preprint arXiv:1806.012, 2018. Suzanna Becker and Geoffrey E Hinton. Self-organizing neural network that dis- covers surfaces in random-dot stereograms. Nature, 355(6356):161, 1992. Mikhail Belkin and Partha Niyogi. Laplacian eigenmaps and spectral techniques for embedding and clustering. In Advances in neural information processing systems (NIPS), pp. 585–591, 2002. Samy Bengio, Oriol Vinyals, Navdeep Jaitly, and Noam Shazeer. Scheduled sam- pling for sequence prediction with recurrent neural networks. In Advances in Neural Information Processing Systems (NIPS), pp. 1171–1179, 2015. Yoshua Bengio, Aaron Courville, and Pascal Vincent. Representation learning: A review and new perspectives. IEEE transactions on pattern analysis and machine intelligence, 35(8):1798–1828, 2013. JamesSBergstra, RémiBardenet, YoshuaBengio, andBalázsKégl. Algorithmsfor hyper-parameter optimization. In Advances in Neural Information Processing Systems (NIPS), pp. 2546–2554, 2011. Aleksandar Bojchevski, Oleksandr Shchur, Daniel Zügner, and Stephan Günne- mann. Netgan: Generating graphs via random walks. 2018. Michael M Bronstein, Joan Bruna, Yann LeCun, Arthur Szlam, and Pierre Van- dergheynst. Geometricdeeplearning: goingbeyondeuclideandata. IEEE Signal Processing Magazine, 34(4):18–42, 2017. Joan Bruna, Wojciech Zaremba, Arthur Szlam, and Yann LeCun. Spectral net- works and locally connected networks on graphs. In International Conference on Learning Representations (ICLR), 2014. Pinlong Cai, Yunpeng Wang, Guangquan Lu, Peng Chen, Chuan Ding, and Jian- ping Sun. A spatiotemporal correlative k-nearest neighbor model for short-term traffic multistep forecasting. Transportation Research Part C: Emerging Tech- nologies, 62:21–34, 2016. ShaoshengCao, WeiLu, andQiongkaiXu. Grarep: Learninggraphrepresentations withglobalstructuralinformation. In Proceedings of the 24th ACM international on conference on information and knowledge management, pp. 891–900. ACM, 2015. 124 Shaosheng Cao, Wei Lu, and Qiongkai Xu. Deep neural networks for learning graph representations. In Thirtieth AAAI Conference on Artificial Intelligence, 2016. Ennio Cascetta. Transportation systems engineering: theory and methods, vol- ume 49. Springer Science & Business Media, 2013. Di Chai, Leye Wang, and Qiang Yang. Bike flow prediction with multi-graph con- volutional networks. ACM SIGSPATIAL International Conference on Advances in Geographic Information Systems (ACM SIGSPATIAL), 2018. Michael B Chang, Tomer Ullman, Antonio Torralba, and Joshua B Tenenbaum. A compositional object-based approach to learning physical dynamics. In Inter- national Conference on Learning Representations (ICLR), 2017. Long Chen, Hanwang Zhang, Jun Xiao, Liqiang Nie, Jian Shao, Wei Liu, and Tat-Seng Chua. SCA-CNN: Spatial and channel-wise attention in convolutional networks for image captioning. In Computer Vision and Pattern Recognition (CVPR), 2017 IEEE Conference on, pp. 6298–6306. IEEE, 2017. Dehua Cheng, Yu Cheng, Yan Liu, Richard Peng, and Shang-Hua Teng. Efficient sampling for gaussian graphical models via spectral sparsification. In Conference on Learning Theory, pp. 364–390, 2015. Xingyi Cheng, Ruiqing Zhang, Jie Zhou, and Wei Xu. Deeptransport: Learning spatial-temporal dependency for traffic condition forecasting. In International Joint Conference on Neural Networks (IJCNN), 2018. Junyoung Chung, Caglar Gulcehre, KyungHyun Cho, and Yoshua Bengio. Empir- ical evaluation of gated recurrent neural networks on sequence modeling. In NIPS Deep Learning and Representation Learning Workshop, 2014. Michaël Defferrard, Xavier Bresson, and Pierre Vandergheynst. Convolutional neural networks on graphs with fast localized spectral filtering. In Advances in Neural Information Processing Systems (NIPS), pp. 3837–3845, 2016. Ugur Demiryurek, Farnoush Banaei-Kashani, Cyrus Shahabi, and Anand Ran- ganathan. Online computation of fastest path in time-dependent spatial net- works. In International Symposium on Spatial and Temporal Databases (SSTD), pp. 92–111. Springer, 2011. Dingxiong Deng, Cyrus Shahabi, Ugur Demiryurek, Linhong Zhu, Rose Yu, and Yan Liu. Latent space model for road networks to predict time-varying traffic. In International Conference on Knowledge Discovery and Data Mining, (KDD), pp. 1525–1534, 2016. 125 Dingxiong Deng, Cyrus Shahabi, Ugur Demiryurek, and Linhong Zhu. Situation aware multi-task learning for traffic prediction. In Data Mining (ICDM), 2017 IEEE International Conference on, pp. 81–90. IEEE, 2017. ZhimingDing, BinYang, RalfHartmutGüting, andYaguangLi. Network-matched trajectory-basedmoving-objectdatabase: Modelsandapplications. IEEE Trans- actions on Intelligent Transportation Systems, 16(4):1918–1928, 2015. Donald R Drew. Traffic flow theory and control. Technical report, 1968. Wenbin Du, Yali Wang, and Yu Qiao. Recurrent spatial-temporal attention net- work for action recognition in videos. IEEE Transactions on Image Processing, 27(3):1347–1360, 2018. Dumitru Erhan, Yoshua Bengio, Aaron Courville, Pierre-Antoine Manzagol, Pas- cal Vincent, and Samy Bengio. Why does unsupervised pre-training help deep learning? Journal of Machine Learning Research, 11(Feb):625–660, 2010. Luca Franceschi, Mathias Niepert, Massimiliano Pontil, and Xiao He. Learning discrete structures for graph neural networks. In International Conference on Machine Learning (ICML), 2019. Jonas Gehring, Michael Auli, David Grangier, Denis Yarats, and Yann N Dauphin. Convolutional sequence to sequence learning. In International Conference on Machine Learning (ICML), 2017. Aditya Grover and Jure Leskovec. node2vec: Scalable feature learning for net- works. In Proceedings of the 22nd ACM SIGKDD international conference on Knowledge discovery and data mining (KDD), pp. 855–864. ACM, 2016. Aditya Grover, Aaron Zweig, and Stefano Ermon. Graphite: Iterative generative modeling of graphs. In International Conference on Machine Learning (ICML), 2019. Nicholas Guttenberg, Nathaniel Virgo, Olaf Witkowski, Hidetoshi Aoki, and Ryota Kanai. Permutation-equivariantneuralnetworksappliedtodynamicsprediction. arXiv preprint arXiv:1612.04530, 2016. Mordechai Haklay and Patrick Weber. Openstreetmap: User-generated street maps. IEEE Pervasive Computing, 7(4):12–18, 2008. James Douglas Hamilton. Time series analysis, volume 2. Princeton university press Princeton, 1994. 126 Will Hamilton, Zhitao Ying, and Jure Leskovec. Inductive representation learning on large graphs. In Advances in Neural Information Processing Systems (NIPS), pp. 1024–1034, 2017a. William L Hamilton, Rex Ying, and Jure Leskovec. Representation learning on graphs: Methods and applications. IEEE Data Engineering Bulletin, 2017b. David K Hammond, Pierre Vandergheynst, Rémi Gribonval, David K Hammond, Pierre Vandergheynst, and Rémi Gribonval. Wavelets on graphs via spectral graph theory. 30(2):129–150, 2011. doi: 10.1016/j.acha.2010.04.005. Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Identity mappings in deep residual networks. In European Conference on Computer Vision (ECCV), pp. 630–645. Springer, 2016. Yotam Hechtlinger, Purvasha Chakravarti, and Jining Qin. A generalization of convolutional neural networks to graph-structured data. arXiv preprint arXiv:1704.08165, 2017. Mikael Henaff, Joan Bruna, and Yann LeCun. Deep convolutional networks on graph-structured data. arXiv preprint arXiv:1506.05163, 2015. Ryan Herring, Aude Hofleitner, Saurabh Amin, T Nasr, A Khalek, Pieter Abbeel, and Alexandre Bayen. Using mobile phones to forecast arterial traffic through statistical learning. In 89th Transportation Research Board Annual Meeting, pp. 10–14, 2010. Peter D Hoff, Adrian E Raftery, and Mark S Handcock. Latent space approaches to social network analysis. Journal of the american Statistical association, 97 (460):1090–1098, 2002. Jie Hu, Li Shen, and Gang Sun. Squeeze-and-excitation networks. In Computer Vision and Pattern Recognition (CVPR), 2018 IEEE Conference on. IEEE, 2018. Wenhao Huang, Guojie Song, Haikun Hong, and Kunqing Xie. Deep architecture for traffic flow prediction: deep belief networks with multitask learning. ITS, IEEE Transactions on, 15(5):2191–2201, 2014. Timothy Hunter, Ryan Herring, Pieter Abbeel, and Alexandre Bayen. Path and travel time inference from gps probe vehicle data. NIPS Analyzing Networks and Learning with Graphs, 12(1), 2009. H. V. Jagadish, Johannes Gehrke, Alexandros Labrinidis, Yannis Papakonstanti- nou, Jignesh M. Patel, Raghu Ramakrishnan, and Cyrus Shahabi. Big data and its technical challenges. Commun. ACM, 57(7):86–94, July 2014. 127 Zhanfeng Jia, Chao Chen, Ben Coifman, and Pravin Varaiya. The pems algo- rithms for accurate, real-time estimates of g-factors and speeds from single-loop detectors. In Intelligent Transportation Systems, 2001. Proceedings. 2001 IEEE, pp. 536–541. IEEE, 2001. Ishan Jindal, Xuewen Chen, Matthew Nokleby, Jieping Ye, et al. A unified neural network approach for estimating travel time and distance for a taxi trip. arXiv preprint arXiv:1710.04350, 2017. Yiannis Kamarianakis and Poulicos Prastacos. Forecasting traffic flow conditions in an urban network: Comparison of multivariate and univariate approaches. Transportation Research Record: Journal of the Transportation Research Board, (1857):74–84, 2003. Guolin Ke, Qi Meng, Thomas Finley, Taifeng Wang, Wei Chen, Weidong Ma, Qiwei Ye, and Tie-Yan Liu. Lightgbm: A highly efficient gradient boosting decision tree. In Advances in Neural Information Processing Systems (NIPS), pp. 3149–3157, 2017a. Jintao Ke, Hongyu Zheng, Hai Yang, and Xiqun Michael Chen. Short-term fore- casting of passenger demand under on-demand ride services: A spatio-temporal deep learning approach. Transportation Research Part C: Emerging Technolo- gies, 85:591–608, 2017b. Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. In International Conference on Learning Representations (ICLR), 2015. Diederik P Kingma and Max Welling. Auto-encoding variational bayes. In Inter- national Conference on Learning Representations (ICLR), 2014. Thomas Kipf, Ethan Fetaya, Kuan-Chieh Wang, Max Welling, and Richard Zemel. Neural relational inference for interacting systems. In International Conference on Machine Learning (ICML), 2018. Thomas N Kipf and Max Welling. Variational graph auto-encoders. NIPS Work- shop on Bayesian Deep Learning, 2016. Thomas N Kipf and Max Welling. Semi-supervised classification with graph con- volutional networks. In International Conference on Learning Representations (ICLR), 2017. JaimyoungKwonandKevinMurphy. Modelingfreewaytrafficwithcoupledhmms. Technical report, Technical report, Univ. California, Berkeley, 2000. 128 NikolayLaptev, JasonYosinski, LiErranLi, andSlawekSmyl. Time-seriesextreme event forecasting with neural networks at Uber. In Int. Conf. on Machine Learn- ing Time Series Workshop, 2017. Sangsoo Lee and Daniel Fambro. Application of subset autoregressive integrated moving average model for short-term freeway traffic volume forecasting. Trans- portation Research Record: Journal of the Transportation Research Board, (1678):179–188, 1999. Chaolong Li, Zhen Cui, Wenming Zheng, Chunyan Xu, and Jian Yang. Spatio- temporal graph convolution for skeleton based action recognition. In 2018 AAAI Conference on Artificial Intelligence (AAAI’18), 2018a. Mu Li, Amr Ahmed, and Alexander J Smola. Inferring movement trajectories from gps snippets. In Proceedings of the Eighth ACM International Conference on Web Search and Data Mining (WSDM), pp. 325–334. ACM, 2015a. Ruoyu Li, Sheng Wang, Feiyun Zhu, and Junzhou Huang. Adaptive graph convo- lutional neural networks. In AAAI Conference on Artificial Intelligence (AAAI), 2018b. Yaguang Li, Dingxiong Deng, Ugur Demiryurek, Cyrus Shahabi, and Siva Ravada. Towards fast and accurate solutions to vehicle routing in a large-scale and dynamic environment. In International Symposium on Spatial and Tempol Databases (SSTD), pp. 119–136. Springer, 2015b. Yaguang* Li, Rose* Yu, Cyrus Shahabi, Ugur Demiryurek, and Yan Liu. Deep learning: A generic approach for extreme condition traffic forecasting. In SIAM International Conference on Data Mining (SDM), 2017. (* Equal contribution). Yaguang Li, Kun Fu, Zheng Wang, Cyrus Shahabi, Jieping Ye, and Yan Liu. Multi-task representation learning for travel time estimation. In International Conference on Knowledge Discovery and Data Mining (KDD ’18), 2018c. Yaguang Li, Rose Yu, Cyrus Shahabi, and Yan Liu. Diffusion convolutional recur- rent neural network: Data-driven traffic forecasting. In International Conference on Learning Representations (ICLR ’18), 2018d. Yijun Lin, Nikhit Mago, Yu Gao, Yaguang Li, Yao-Yi Chiang, Cyrus Shahabi, and José Luis Ambite. Exploiting spatiotemporal patterns for accurate air quality forecasting using deep learning. In Proceedings of the 26th ACM SIGSPATIAL International Conference on Advances in Geographic Information Systems, pp. 359–368. ACM, 2018. 129 Marco Lippi, Marco Bertini, and Paolo Frasconi. Short-term traffic flow forecast- ing: Anexperimentalcomparisonoftime-seriesanalysisandsupervisedlearning. ITS, IEEE Transactions on, 14(2):871–882, 2013. Kuien Liu, Yaguang Li, Fengcheng He, Jiajie Xu, and Zhiming Ding. Effective map-matching on the most simplified road network. In Proceedings of the 20th International Conference on Advances in Geographic Information Systems, pp. 609–612. ACM, 2012. Qi Liu, Miltiadis Allamanis, Marc Brockschmidt, and Alexander Gaunt. Con- strained graph variational autoencoders for molecule design. In Conference on Neural Information Processing Systems (NeurIPS), pp. 7806–7815, 2018. Wei Liu, Yu Zheng, Sanjay Chawla, Jing Yuan, and Xie Xing. Discovering spatio- temporal causal interactions in traffic data streams. In International Conference on Knowledge Discovery and Data Mining, (KDD), pp. 1010–1018. ACM, 2011. LinyuanLüandTaoZhou. Linkpredictionincomplexnetworks: Asurvey. Physica A: statistical mechanics and its applications, 390(6):1150–1170, 2011. Thang Luong, Hieu Pham, and Christopher D. Manning. Effective approaches to attention-based neural machine translation. In Proceedings of the 2015 Con- ference on Empirical Methods in Natural Language Processing, pp. 1412–1421. Association for Computational Linguistics, 2015. doi: 10.18653/v1/D15-1166. URL http://www.aclweb.org/anthology/D15-1166. Yisheng Lv, Yanjie Duan, Wenwen Kang, Zhengxi Li, and Fei-Yue Wang. Traffic flow prediction with big data: A deep learning approach. ITS, IEEE Transac- tions on, 16(2):865–873, 2015. Xiaolei Ma, Zhimin Tao, Yinhai Wang, Haiyang Yu, and Yunpeng Wang. Long short-term memory neural network for traffic speed prediction using remote microwave sensor data. Transportation Research Part C: Emerging Technologies, 54:187–197, 2015. Xiaolei Ma, Zhuang Dai, Zhengbing He, Jihui Ma, Yong Wang, and Yunpeng Wang. Learning traffic as images: a deep convolutional neural network for large-scale transportation network speed prediction. Sensors, 17(4):818, 2017. Chris J Maddison, Andriy Mnih, and Yee Whye Teh. The concrete distribution: A continuous relaxation of discrete random variables. In International Conference on Learning Representations (ICLR), 2017. Víctor Martínez, Fernando Berzal, and Juan-Carlos Cubero. A survey of link prediction in complex networks. ACM Computing Surveys (CSUR), 49(4):69, 2017. 130 Wanli Min and Laura Wynter. Real-time road traffic prediction with spatio- temporal correlations. Transportation Research Part C: Emerging Technologies, 19(4):606–616, 2011. HosseinMobahi,RonanCollobert,andJasonWeston. Deeplearningfromtemporal coherence in video. In Proceedings of the 26th Annual International Conference on Machine Learning (ICML), pp. 737–744. ACM, 2009. FedericoMonti,DavideBoscaini,JonathanMasci,EmanueleRodola,JanSvoboda, and Michael M Bronstein. Geometric deep learning on graphs and manifolds using mixture model cnns. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 5115–5124, 2017. Maximilian Nickel, Kevin Murphy, Volker Tresp, and Evgeniy Gabrilovich. A review of relational machine learning for knowledge graphs. Proceedings of the IEEE, 104(1):11–33, 2015. Antonio Ortega, Pascal Frossard, Jelena Kovačević, José MF Moura, and Pierre Vandergheynst. Graph signal processing: Overview, challenges, and applica- tions. Proceedings of the IEEE, 106(5):808–828, 2018. R Kelley Pace, Ronald Barry, John M Clapp, and Mauricio Rodriquez. Spatiotem- poral autoregressive models of neighborhood effects. The Journal of Real Estate Finance and Economics, 17(1):15–33, 1998. BeiPan, UgurDemiryurek, andCyrusShahabi. Utilizingreal-worldtransportation data for accurate traffic prediction. In Data Mining (ICDM), 2012 IEEE 12th International Conference on, pp. 595–604. IEEE, 2012. Adam Paszke, Sam Gross, Soumith Chintala, Gregory Chanan, Edward Yang, Zachary DeVito, Zeming Lin, Alban Desmaison, Luca Antiga, and Adam Lerer. Automatic differentiation in pytorch. In NIPS Workshop, 2017. Wenjie Pei, Tadas Baltrušaitis, David MJ Tax, and Louis-Philippe Morency. Tem- poral attention-gated model for robust sequence classification. In Computer Vision and Pattern Recognition (CVPR), 2017 IEEE Conference on, pp. 820– 829. IEEE, 2017. Bryan Perozzi, Rami Al-Rfou, and Steven Skiena. Deepwalk: Online learning of social representations. In Proceedings of the 20th ACM SIGKDD international conference on Knowledge discovery and data mining (KDD), pp. 701–710. ACM, 2014. Karl F Petty, Peter Bickel, Michael Ostland, John Rice, Frederic Schoenberg, Jiming Jiang, and Ya’acov Ritov. Accurate estimation of travel times from 131 single-loop detectors1. Transportation Research Part A: Policy and Practice, 32 (1):1–17, 1998. YanQiandSherifIshak. Ahiddenmarkovmodelforshorttermpredictionoftraffic conditionsonfreeways. Transportation Research Part C: Emerging Technologies, 43:95–111, 2014. Mahmood Rahmani, Erik Jenelius, and Haris N Koutsopoulos. Route travel time estimation using low-frequency floating car data. In Intelligent Transportation Systems-(ITSC), 2013 16th International IEEE Conference on, pp. 2292–2297. IEEE, 2013. Alvaro Sanchez-Gonzalez, Nicolas Heess, Jost Tobias Springenberg, Josh Merel, Martin Riedmiller, Raia Hadsell, and Peter Battaglia. Graph networks as learn- able physics engines for inference and control. In International Conference on Machine Learning (ICML), 2018. Franco Scarselli, Marco Gori, Ah Chung Tsoi, Markus Hagenbuchner, and Gabriele Monfardini. The graph neural network model. IEEE Transactions on Neural Networks, 20(1):61–80, 2009. Youngjoo Seo, Michaël Defferrard, Pierre Vandergheynst, and Xavier Bresson. Structured sequence modeling with graph convolutional recurrent networks. In International Conference on Neural Information Processing (ICONIP), 2018. Xingjian Shi, Zhihan Gao, Leonard Lausen, Hao Wang, Dit-Yan Yeung, Wai-kin Wong, and Wang-chun Woo. Deep learning for precipitation nowcasting: A benchmark and a new model. In Advances in Neural Information Processing Systems (NIPS), pp. 5617–5627, 2017. David I Shuman, Sunil K Narang, Pascal Frossard, Antonio Ortega, and Pierre Vandergheynst. The emerging field of signal processing on graphs: Extending high-dimensional data analysis to networks and other irregular domains. IEEE Signal Processing Magazine, 30(3):83–98, 2013. Martin Simonovsky and Nikos Komodakis. Graphvae: Towards generation of small graphs using variational autoencoders. In International Joint Conference on Artificial Intelligence (IJCAI), 2018. Renchu Song, Weiwei Sun, Baihua Zheng, and Yu Zheng. Press: A novel frame- work of trajectory compression in road networks. Proceedings of the VLDB Endowment (VLDB), 7(9):661–672, 2014. Haowei Su, Ling Zhang, and Shu Yu. Short-term traffic flow prediction based on incremental support vector regression. In Natural Computation, 2007. ICNC 2007. Third International Conference on, volume 1, pp. 640–645. IEEE, 2007. 132 Sainbayar Sukhbaatar, Rob Fergus, et al. Learning multiagent communication with backpropagation. In Advances in Neural Information Processing Systems (NIPS), pp. 2244–2252, 2016. Ilya Sutskever, Oriol Vinyals, and Quoc V Le. Sequence to sequence learning with neuralnetworks. InAdvances in Neural Information Processing Systems (NIPS), pp. 3104–3112, 2014. Jian Tang, Meng Qu, Mingzhe Wang, Ming Zhang, Jun Yan, and Qiaozhu Mei. Line: Large-scale information network embedding. In Proceedings of the 24th international conference on world wide web, pp. 1067–1077. International World Wide Web Conferences Steering Committee, 2015. JinjunTang, YajieZou, JohnAsh, ShenZhang, FangLiu, andYinhaiWang. Travel timeestimationusingfreewaypointdetectordatabasedonevolvingfuzzyneural inference system. PloS one, 11(2):e0147263, 2016. Shang-Hua Teng et al. Scalable algorithms for data and network analysis. Foun- dations and Trends R in Theoretical Computer Science, 12(1–2):1–274, 2016. Yongxin Tong, Yuqiang Chen, Zimu Zhou, Lei Chen, Jie Wang, Qiang Yang, Jieping Ye, and Weifeng Lv. The simpler the better: a unified approach to predicting original taxi demands based on large-scale online platforms. In Pro- ceedings of the 23rd ACM SIGKDD International Conference on Knowledge Dis- covery and Data Mining (KDD), pp. 1653–1662. ACM, 2017. Sjoerd van Steenkiste, Michael Chang, Klaus Greff, and Jürgen Schmidhuber. Relational neural expectation maximization: Unsupervised discovery of objects and their interactions. In International Conference on Learning Representations (ICLR), 2018. Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need. In Advances in Neural Information Processing Systems (NIPS), pp. 6000–6010, 2017. Petar Velickovic, Guillem Cucurull, Arantxa Casanova, Adriana Romero, Pietro Lio, and Yoshua Bengio. Graph attention networks. In International Conference on Learning Representations (ICLR), 2018. Petar Veličković, Guillem Cucurull, Arantxa Casanova, Adriana Romero, Pietro Liò, and Yoshua Bengio. Graph attention networks. In International Conference on Learning Representations (ICLR), 2018. 133 Chun Wang, Shirui Pan, Guodong Long, Xingquan Zhu, and Jing Jiang. Mgae: Marginalized graph autoencoder for graph clustering. In Proceedings of the 2017 ACM on Conference on Information and Knowledge Management (CIKM), pp. 889–898. ACM, 2017. Daixin Wang, Peng Cui, and Wenwu Zhu. Structural deep network embedding. In Proceedings of the 22nd ACM SIGKDD international conference on Knowledge discovery and data mining, pp. 1225–1234. ACM, 2016a. Hongjian Wang, Yu-Hsuan Kuo, Daniel Kifer, and Zhenhui Li. A simple baseline for travel time estimation using large-scale trip data. In Proceedings of the 24th ACM SIGSPATIAL International Conference on Advances in Geographic Information Systems (SIGSPATIAL), pp. 61. ACM, 2016b. Yilun Wang, Yu Zheng, and Yexiang Xue. Travel time estimation of a path using sparse trajectories. In Proceedings of the 20th ACM SIGKDD international con- ference on Knowledge discovery and data mining (KDD), pp. 25–34. ACM, 2014. Nicholas Watters, Daniel Zoran, Theophane Weber, Peter Battaglia, Razvan Pas- canu, and Andrea Tacchetti. Visual interaction networks: Learning a physics simulator from video. In Advances in Neural Information Processing Systems (NIPS), pp. 4539–4547, 2017. Billy M Williams and Lester A Hoel. Modeling and forecasting vehicular traffic flowasaseasonalarimaprocess: Theoreticalbasisandempiricalresults. Journal of transportation engineering, 129(6):664–672, 2003. Yuankai Wu and Huachun Tan. Short-term traffic flow forecasting with spatial- temporal correlation in a hybrid deep learning framework. arXiv preprint arXiv:1612.01022, 2016. Yuanchang Xie, Kaiguang Zhao, Ying Sun, and Dawei Chen. Gaussian processes for short-term traffic volume forecasting. Transportation Research Record: Jour- nal of the Transportation Research Board, (2165):69–78, 2010. SHI Xingjian, Zhourong Chen, Hao Wang, Dit-Yan Yeung, Wai-Kin Wong, and Wang-chun Woo. Convolutional lstm network: A machine learning approach for precipitationnowcasting. InAdvances in Neural Information Processing Systems (NIPS), pp. 802–810, 2015. Jie Xu, Dingxiong Deng, Ugur Demiryurek, Cyrus Shahabi, and Mihaela Van Der Schaar. Mining the situation: Spatiotemporal traffic prediction with big data. IEEE Journal of Selected Topics in Signal Processing, 9(4):702–715, 2015a. 134 Kelvin Xu, Jimmy Ba, Ryan Kiros, Kyunghyun Cho, Aaron Courville, Ruslan Salakhudinov, Rich Zemel, and Yoshua Bengio. Show, attend and tell: Neural image caption generation with visual attention. In International Conference on Machine Learning (ICML), pp. 2048–2057, 2015b. Sijie Yan, Yuanjun Xiong, and Dahua Lin. Spatial temporal graph convolutional networks for skeleton-based action recognition. In 2018 AAAI Conference on Artificial Intelligence (AAAI’18), 2018. Huaxiu Yao, Xianfeng Tang, Hua Wei, Guanjie Zheng, Yanwei Yu, and Zhenhui Li. Modeling spatial-temporal dynamics for traffic prediction. arXiv preprint arXiv:1803.01254, 2018a. Huaxiu Yao, Fei Wu, Jintao Ke, Xianfeng Tang, Yitian Jia, Siyu Lu, Pinghua Gong, Jieping Ye, and Zhenhui Li. Deep multi-view spatial-temporal network for taxi demand prediction. In 2018 AAAI Conference on Artificial Intelligence (AAAI’18), 2018b. Jiaxuan You, Rex Ying, Xiang Ren, William Hamilton, and Jure Leskovec. Graphrnn: Generating realistic graphs with deep auto-regressive models. In International Conference on Machine Learning (ICML), pp. 5694–5703, 2018. Bing Yu, Haoteng Yin, and Zhanxing Zhu. Spatio-temporal graph convolutional networks: A deep learning framework for traffic forecasting. In International Joint Conference on Artificial Intelligence (IJCAI), 2018. Hsiang-Fu Yu, Nikhil Rao, and Inderjit S Dhillon. Temporal regularized matrix factorization for high-dimensional time series prediction. In Advances in Neural Information Processing Systems (NIPS), pp. 847–855, 2016. Nicholas Jing Yuan, Yu Zheng, Liuhang Zhang, and Xing Xie. T-finder: A recom- mender system for finding passengers and vacant taxis. IEEE Transactions on Knowledge and Data Engineering, 25(10):2390–2403, 2013. Jiani Zhang, Xingjian Shi, Junyuan Xie, Hao Ma, Irwin King, and Dit-Yan Yeung. Gaan: Gatedattentionnetworksforlearningonlargeandspatiotemporal graphs. In UAI, 2018a. Junbo Zhang, Yu Zheng, and Dekang Qi. Deep spatio-temporal residual networks for citywide crowd flows prediction. In AAAI Conference on Artificial Intelli- gence (AAAI), pp. 1655–1661, 2017. Junbo Zhang, Yu Zheng, Dekang Qi, Ruiyuan Li, Xiuwen Yi, and Tianrui Li. Predicting citywide crowd flows using deep spatio-temporal residual networks. Artificial Intelligence, 259:147–166, 2018b. 135 Lun Zhang, Qiuchen Liu, Wenchen Yang, Nai Wei, and Decun Dong. An improved k-nearest neighbor model for short-term traffic flow prediction. Procedia-Social and Behavioral Sciences, 96:653–662, 2013. Xi Zhang, Lifang He, Kun Chen, Yuan Luo, Jiayu Zhou, and Fei Wang. Multi- view graph convolutional network and its applications on neuroimage analysis for parkinson’s disease. In AMIA Annual Symposium Proceedings, volume 2018, pp. 1147. American Medical Informatics Association, 2018c. ZhouZhao,QifanYang,DengCai,XiaofeiHe,andYuetingZhuang. Videoquestion answering via hierarchical spatio-temporal attention networks. In International Joint Conference on Artificial Intelligence (IJCAI), volume 2, 2017. 136
Abstract (if available)
Linked assets
University of Southern California Dissertations and Theses
Conceptually similar
PDF
Physics-aware graph networks for spatiotemporal physical systems
PDF
Tensor learning for large-scale spatiotemporal analysis
PDF
Deriving real-world social strength and spatial influence from spatiotemporal data
PDF
Trustworthy spatiotemporal prediction models
PDF
Neighborhood and graph constructions using non-negative kernel regression (NNK)
PDF
Graph embedding algorithms for attributed and temporal graphs
PDF
Forecasting traffic volume using machine learning and kriging methods
PDF
Deep learning models for temporal data in health care
PDF
Invariant representation learning for robust and fair predictions
PDF
Realistic and controllable trajectory generation
PDF
Spatiotemporal traffic forecasting in road networks
PDF
Transforming unstructured historical and geographic data into spatio-temporal knowledge graphs
PDF
Dynamic topology reconfiguration of Boltzmann machines on quantum annealers
PDF
Green knowledge graph completion and scalable generative content delivery
PDF
Learning to adapt to sensor changes and failures
PDF
Visual knowledge transfer with deep learning techniques
PDF
Estimation of graph Laplacian and covariance matrices
PDF
Interpretable machine learning models via feature interaction discovery
PDF
Efficient graph learning: theory and performance evaluation
PDF
Effective graph representation and vertex classification with machine learning techniques
Asset Metadata
Creator
Li, Yaguang
(author)
Core Title
Spatiotemporal prediction with deep learning on graphs
School
Viterbi School of Engineering
Degree
Doctor of Philosophy
Degree Program
Computer Science
Publication Date
07/25/2019
Defense Date
04/23/2019
Publisher
University of Southern California
(original),
University of Southern California. Libraries
(digital)
Tag
AI for transportation,deep learning on graphs,demand prediction,graph representation learning,OAI-PMH Harvest,spatiotemporal prediction,time series,traffic prediction,travel time estimation
Format
application/pdf
(imt)
Language
English
Contributor
Electronically uploaded by the author
(provenance)
Advisor
Liu, Yan (
committee chair
), Shahabi, Cyrus (
committee chair
), Ortega, Antonio (
committee member
)
Creator Email
liyaguang0123@gmail.com,yaguang@usc.edu
Permanent Link (DOI)
https://doi.org/10.25549/usctheses-c89-193035
Unique identifier
UC11662793
Identifier
etd-LiYaguang-7616.pdf (filename),usctheses-c89-193035 (legacy record id)
Legacy Identifier
etd-LiYaguang-7616.pdf
Dmrecord
193035
Document Type
Dissertation
Format
application/pdf (imt)
Rights
Li, Yaguang
Type
texts
Source
University of Southern California
(contributing entity),
University of Southern California Dissertations and Theses
(collection)
Access Conditions
The author retains rights to his/her dissertation, thesis or other graduate work according to U.S. copyright law. Electronic access is being provided by the USC Libraries in agreement with the a...
Repository Name
University of Southern California Digital Library
Repository Location
USC Digital Library, University of Southern California, University Park Campus MC 2810, 3434 South Grand Avenue, 2nd Floor, Los Angeles, California 90089-2810, USA
Tags
AI for transportation
deep learning on graphs
demand prediction
graph representation learning
spatiotemporal prediction
time series
traffic prediction
travel time estimation