Close
About
FAQ
Home
Collections
Login
USC Login
Register
0
Selected
Invert selection
Deselect all
Deselect all
Click here to refresh results
Click here to refresh results
USC
/
Digital Library
/
University of Southern California Dissertations and Theses
/
Spatiotemporal traffic forecasting in road networks
(USC Thesis Other)
Spatiotemporal traffic forecasting in road networks
PDF
Download
Share
Open document
Flip pages
Contact Us
Contact Us
Copy asset link
Request this asset
Transcript (if available)
Content
SPATIOTEMPORAL TRAFFIC FORECASTING IN ROAD NETWORKS by Dingxiong Deng A Dissertation Presented to the FACULTY OF THE USC GRADUATE SCHOOL UNIVERSITY OF SOUTHERN CALIFORNIA In Partial Fulfillment of the Requirements for the Degree DOCTOR OF PHILOSOPHY (COMPUTER SCIENCE) December 2017 Copyright 2017 Dingxiong Deng Acknowledgments I would like to thank my advisor, Professor Cyrus Shahabi for his great guidance and endless support throughout my PhD studies. I am greatly thankful that he guided me through everything of doing research, such as choosing the topic, approaching a problem, analyzing results and making presentations. I am always inspired by his passion and devotion to make a real-world impact. I am truly honored to have the opportunity to directly learn from him, and I feel very fortunate to be his student during my life. I would like to thank Dr. Ugur Demiryurek for his continuous guidance, help and support. Thanks for helping me survive through my first Ph.D years and giving me valuable advice. I am very grateful to have the opportunity to explore many research projects with him. I would like to thank my proposal and dissertation committee members Professor Shahram Ghandeharizadeh, Professor Craig Knoblock, Professor Yan Liu and Professor Ketan Salva for their valuable advice and guidance. I would like to thank Professor Mihaela van der Schaar and Professor Jie Xu for their help on our first traffic prediction paper. IthankProfessorShanghuaTeng,Ming-DehHuangfortheirinsightfuldiscussions and conversations. Being at infolab, I was fortunate to share the Ph.D experiences with a group of bright and talented students: Leyla Kazemi, Ali Khodaei, Houtan Shirani-Mehr, Bei Pan, Huy Pham, Afsin Akdogan, Hien To, Mohammad Asghari, Abdullah Alfarrarjeh, Ying Lu, Giorgos Constantinou, Rose Yu, Luan Tran, Yaguang Li, Minh Nguyen, Kien Nguyen, DimitriosStripelis,MingxuanYue,RiteshAhuja,andChrysovalantisAnastasiou. Thank ii you for making my Ph.D journey full of excitement and happiness. Special thanks to Bei Pan for your help on my research and life during my first year, Hien To for all the advice, rehearsals and collaborations, Ying Lu and Yaguang Li for all your kind help. I wish them all personal and professional successes in life. My love and gratitude to my sweetest wife Linhong. I thank her for sharing the entire journey with love and patience, for always trusting, supporting, and standing by my side. I also thank my wonderful daughter Helen, who brought me so much happiness. Lastly, I would like to thank my parents for their love and support. I would not have had these days without their endless encouragement. iii Contents Acknowledgments ii List of Tables vii List of Figures viii Abstract xi 1 Introduction 1 1.1 Motivation and Challenges . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1.2 Thesis Statement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 1.3 Context-Aware Spatiotemporal Traffic Prediction . . . . . . . . . . . . . . 5 1.4 Real-Time Traffic Prediction with Latent Space Model for Road Network 7 1.5 Situation-Aware Multi-Task Learning for Traffic Forecasting . . . . . . . . 9 1.6 Thesis Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11 2 Related Work 12 2.1 Traffic Prediction Techniques . . . . . . . . . . . . . . . . . . . . . . . . . 12 2.2 Traffic Prediction with Spatiotemporal Correlations . . . . . . . . . . . . 12 2.3 Online Traffic Prediction . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 2.4 Latent Space Model for Traffic Prediction . . . . . . . . . . . . . . . . . . 14 2.5 Multi-Task Learning and Mixture of Experts for Traffic Prediction . . . . 15 3 Context-Aware Spatiotemporal Traffic Prediction 16 3.1 Problem Formulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16 3.1.1 Problem Setting . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16 3.1.2 Spatiotemporal Prediction and Multi-predictor Diversity Gain . . 19 3.1.3 Performance Metric for Our Algorithm . . . . . . . . . . . . . . . . 20 3.2 Context-Aware Adaptive Traffic Prediction . . . . . . . . . . . . . . . . . 21 3.2.1 Algorithm Description . . . . . . . . . . . . . . . . . . . . . . . . . 22 3.2.2 Learning Regret Analysis . . . . . . . . . . . . . . . . . . . . . . . 25 3.3 Extensions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29 3.3.1 Dimension Reduction . . . . . . . . . . . . . . . . . . . . . . . . . 29 3.3.2 Relevant Context Dimension . . . . . . . . . . . . . . . . . . . . . 31 3.3.3 Missing and Delayed Feedback . . . . . . . . . . . . . . . . . . . . 32 iv 3.4 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34 3.4.1 Experimental Setup . . . . . . . . . . . . . . . . . . . . . . . . . . 34 3.4.2 Prediction Accuracy . . . . . . . . . . . . . . . . . . . . . . . . . . 36 3.4.3 Convergence of Learning . . . . . . . . . . . . . . . . . . . . . . . . 37 3.4.4 Missing Context Information . . . . . . . . . . . . . . . . . . . . . 38 3.4.5 Relevant Context . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39 3.4.6 Missing and Delayed Labels . . . . . . . . . . . . . . . . . . . . . . 40 3.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40 4 LatentSpaceModelforRoadNetworktoPredictTime-VaryingTraffic 42 4.1 Problem Definition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44 4.2 Latent Space Model for Road Networks (LSM-RN) . . . . . . . . . . . . . 46 4.2.1 Topology in LSM-RN . . . . . . . . . . . . . . . . . . . . . . . . . 46 4.2.2 Time in LSM-RN . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48 4.2.3 LSM-RN Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49 4.3 Learning & Prediction by LSM-RN . . . . . . . . . . . . . . . . . . . . . . 50 4.3.1 Global Learning algorithm . . . . . . . . . . . . . . . . . . . . . . . 50 4.3.2 Incremental Learning Algorithm . . . . . . . . . . . . . . . . . . . 53 4.3.3 Real-time Forecasting . . . . . . . . . . . . . . . . . . . . . . . . . 58 4.4 Experiment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59 4.4.1 Dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59 4.4.2 Experimental Setting . . . . . . . . . . . . . . . . . . . . . . . . . . 60 4.4.3 Comparison with Edge Traffic Prediction . . . . . . . . . . . . . . 63 4.4.4 Effect of Missing Data . . . . . . . . . . . . . . . . . . . . . . . . . 64 4.4.5 Comparison for Missing Value Completion . . . . . . . . . . . . . . 66 4.4.6 Scalability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67 4.4.7 Comparison for Real-time Forecasting . . . . . . . . . . . . . . . . 68 4.4.8 Varying Parameters of Our Methods . . . . . . . . . . . . . . . . . 70 4.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72 5 Situation Aware Multi-Task Learning for Traffic Prediction 73 5.1 Problem Definition And Naive MTL Formulation . . . . . . . . . . . . . . 75 5.1.1 Problem Definition . . . . . . . . . . . . . . . . . . . . . . . . . . . 75 5.1.2 Naive Multi-Task Learning Formulation . . . . . . . . . . . . . . . 76 5.2 SA-MTL Framework . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78 5.2.1 Overall Framework . . . . . . . . . . . . . . . . . . . . . . . . . . . 78 5.2.2 Combining and Augmenting Training Data . . . . . . . . . . . . . 80 5.2.3 Traffic Situation Clustering . . . . . . . . . . . . . . . . . . . . . . 81 5.2.4 Multi-Task Learning Per Traffic Situation . . . . . . . . . . . . . . 83 5.3 Experiment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86 5.3.1 Experiment Design . . . . . . . . . . . . . . . . . . . . . . . . . . . 87 5.3.2 Performance Comparisons . . . . . . . . . . . . . . . . . . . . . . . 90 5.3.3 Varying Parameters . . . . . . . . . . . . . . . . . . . . . . . . . . 92 5.3.4 Case Studies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94 5.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95 v 6 Conclusions and Future Work 96 6.1 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96 6.2 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97 Reference List 99 A Appendix 106 A.1 Proof of Lemma 2 in Section 3.2.2 . . . . . . . . . . . . . . . . . . . . . . 106 B Appendix 111 B.1 Derivatives of L with Respect to U t in Equation 4.7. . . . . . . . . . . . . 111 B.2 Update Rule of A and B for Eq. 4.11 and Eq. 4.12 . . . . . . . . . . . . . 114 vi List of Tables 3.1 Overall prediction accuracy. . . . . . . . . . . . . . . . . . . . . . . . . . . 37 3.2 Traffic prediction accuracy at 0.8 miles. . . . . . . . . . . . . . . . . . . . 37 3.3 Traffic prediction accuracy at 10am . . . . . . . . . . . . . . . . . . . . . . 38 3.4 Mean square error of traffic speed prediction (mph 2 ) . . . . . . . . . . . . 38 3.5 Traffic prediction accuracy with incomplete context information . . . . . 39 4.1 Notations and explanations . . . . . . . . . . . . . . . . . . . . . . . . . . 44 4.2 Experiment parameters . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62 4.3 Running time comparisons. For ARIMA and SVR, the training time cost is the total training time for all the edges for one-step ahead prediction, and the prediction time is the average prediction time per edge per query. 67 5.1 Notations and explanations . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75 5.2 Short term prediction performance (RMSE) . . . . . . . . . . . . . . . . . 89 5.3 Long term prediction performance (RMSE) . . . . . . . . . . . . . . . . . 89 5.4 Running time comparison of different methods . . . . . . . . . . . . . . . 90 vii List of Figures 1.1 Sensor distribution and Los Angeles road network. . . . . . . . . . . . . . 2 1.2 Example of base predictors. . . . . . . . . . . . . . . . . . . . . . . . . . . 5 1.3 Example of snapshots. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7 1.4 Overall framework of LSM-RN . . . . . . . . . . . . . . . . . . . . . . . . 8 3.1 System diagram . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17 3.2 Spatiotemporal prediction and multi-predictor diversity gain . . . . . . . . 19 3.3 An illustration of the context space partitioning in a 2-dimensional space: the lower left subspace is further partitioned into 4 smaller subspaces because the partition condition is satisfied. . . . . . . . . . . . . . . . . . 24 3.4 An illustrative example for predictor selection with separately maintained context partition: a request with context (10:05 am and 3.7 miles away from reference location) arrives; Predictor 1 is the best for the time of day context and Predictor 2 is the best for the location context; Predictor 2 is the finally selected predictor. . . . . . . . . . . . . . . . . . . . . . . . . 30 3.5 An illustrative example for context space partition with relevant context: partitioning only occurs on the location context since the partitioning condition is satisfied. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31 3.6 Freeway segment used in the experiment . . . . . . . . . . . . . . . . . . . 34 3.7 Accuracy over time (λ = 50mph) . . . . . . . . . . . . . . . . . . . . . . . 38 3.8 Relative importance of contexts . . . . . . . . . . . . . . . . . . . . . . . . 39 viii 3.9 Prediction accuracy with missing and delayed labels . . . . . . . . . . . . 40 4.1 An example of road network . . . . . . . . . . . . . . . . . . . . . . . . . . 45 4.2 An example of our traffic model, where G represents a road network, U denotes the attributes of vertices in the road network, n is number of nodes, and k is number of attributes, and B denotes how one type of attributes interacts with others. . . . . . . . . . . . . . . . . . . . . . . . 47 4.3 Challenges of adjusting the latent attribute with feedbacks. . . . . . . . . 56 4.4 A batch window framework for real-time forecasting. . . . . . . . . . . . . 59 4.5 One-step ahead prediction MAPE . . . . . . . . . . . . . . . . . . . . . . 63 4.6 Six-steps ahead prediction MAPE . . . . . . . . . . . . . . . . . . . . . . . 65 4.7 Missing rate during training stages for SVR and ARIMA . . . . . . . . . . . . 65 4.8 Missing value completion MAPE . . . . . . . . . . . . . . . . . . . . . . . 66 4.9 Converge rate . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68 4.10 Online prediction MAPE . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69 4.11 Online Prediction time . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69 4.12 Effect of varying T . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70 4.13 Effect of varying span . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71 4.14 Effect of varying k and λ, where k is number of latent attributes, and λ is the graph regularization parameter. . . . . . . . . . . . . . . . . . . . . 71 5.1 Illustration of Naive Multiple-Task Learning . . . . . . . . . . . . . . . . . 77 5.2 Traffic readings of three different highway sensors at the same day. . . . . 77 5.3 Overview of SA-MTL Framework. SA-MTL first combines the training samples of all sensors together, and clusters them into several partitions, where each partition represents one typical traffic situation and consists of different number of training samples from all sensors. Consequently, for each partition (i.e., traffic situation), we utilize multi-task feature learning that simultaneously learns the prediction model of all sensors. . . . . . . . 79 ix 5.4 Traffic situations of one sensor. . . . . . . . . . . . . . . . . . . . . . . . . 83 5.5 Effect of using historical feature on Rush hour . . . . . . . . . . . . . . . . 92 5.6 Effect of using historical feature on Non-Rush hour . . . . . . . . . . . . . 92 5.7 Effect of varying k on Rush hour and Non-Rush hour . . . . . . . . . . . . 93 5.8 Effect of varying ρ 1 on Rush hour . . . . . . . . . . . . . . . . . . . . . . . 93 5.9 Case study for sensor 168 with h = 6. . . . . . . . . . . . . . . . . . . . . 94 5.10 Case study for sensor 661 with h = 6. . . . . . . . . . . . . . . . . . . . . 95 x Abstract Real-time traffic prediction from high-fidelity spatiotemporal traffic sensor datasets is an important problem for intelligent transportation systems and sustainability. This is a challenging problem due to the following reasons. First, the data is incomplete as some network edges have no sensors and those with sensors may occasionally fail reporting data due to device and network failures. Second, the spatial relationship among sensors’ readings is dictated by the network topology and not a simpler space such as Euclidean. Third, the temporal relationship among sensors’ readings is highly time-dependent, e.g., two sensors may be correlated during morning rush hours but not the afternoon rush hours. Fourth, the sensor data is streaming and hence any prediction model should be able to change/adjust as new data becomes available. Finally, the network is large and hence any prediction model must be able to get trained and updated at scale. Previous work tackle different subsets of these challenges at the cost of failing in other aspects, rendering them impractical for real-time traffic predictions in large road networks. We propose three different approaches to address the above challenges by considering the spatiotemporal correlations between sensors. In our first two studies, we present an “online" learning framework and a Latent Space Model for Road Networks (LSM-RN), respectively. Our online learning framework encodes the spatiotemporal relationship into traffic situations, and creates a hybrid predictor from several weak predictors. Although this framework can predict the future traffic by matching the current situation to the most effective prediction model trained using historical data, it does not utilize the newly arrived data to dynamically update these weak predictors. On the contrary, LSM-RN xi learns the attributes of all sensors in latent space to estimate how traffic pattern are formed and evolved. LSM-RN supports an incremental online algorithm that sequen- tially and adaptively learns the latent attributes from the temporal graph changes, thus updating the prediction model and making predictions on-the-fly. Because both online learning framework and LSM-RN cannot build models for all sensors under all traffic sit- uations, we further explore the commonalities across multiple traffic sensors who behave thesameinaspecifictrafficsituation. Weshowthatbuildingmodelsbasedontheshared traffic situations across sensors can help improve the prediction accuracy. Towards this end, we propose a Multi-Task Learning (MTL) framework that aims to first automati- cally identify the traffic situations and then simultaneously build one forecasting model for similar behaving sensors per traffic situation. We demonstrated that our proposed framework outperforms all the best traffic prediction approaches in both short and long term predictions, respectively. xii Chapter 1 Introduction 1.1 Motivation and Challenges Traffic congestion causes tremendous loss in terms of both time and energy wasted. According to a recent report from the Texas Transportation Institute [2], in 2007, 439 metropolitan areas experienced 4.2 billion vehicle-hours of delay, which is equivalent to 2.8 billion gallons in wasted fuel and $87.2 billion in lost productivity, or about 0.7% of the nation’s GDP. Traffic congestion is caused when the traffic demand approaches or exceeds the available capacity of the traffic system. In the United States, the Federal Highway Administration [1, 3] has observed that the number of miles of vehicle travel increased by 76 percent from 1980 to 1999, while the total miles of highway increased merely by 1.5 percent, which hardly accommodates growth in travel. It is now gener- ally conceded that it is impossible to build our way out of congestion, mainly because increased capacity results in increased demand. These factors motivate an information- based approach to address these problems. Fortunately, recentadvancesintrafficsensingtechnologyhaveenabledtheacquisition of high-fidelity spatiotemporal traffic datasets. The roads of all major cities (e.g., Los Angeles, San Francisco, Beijing) have been equipped with sensors. For example, at our research center Integrated Media System Center (IMSC) at USC, for the past five years, we have been collecting data from 15000 loop detectors installed on the highways and arterial streets of Los Angeles County, covering 3420 miles cumulatively (see the case study in [35]). The collected data include several main traffic parameters such as occupancy, volume, and speed at the rate of 1 reading/sensor/min. Figure 1.1 shows 1 sensors locations and road network segments, where the green lines depict the sensors, and blue lines represent the road network segments. Figure 1.1: Sensor distribution and Los Angeles road network. Many efforts have been devoted to acquire, store, analyze and visualize these traffic datasets. For example, at IMSC we have developed an end-to-end system, TransDec [21] (fortransportationdecisionmaking), tocollectdatafromvarioussourcesincludingtraffic loop-detectors, bus and rail, ramp meters and events. We developed a streaming pro- cessing platform [41] to clean, aggregate and index these data into relational databases (i.e., Oracle 11g) as well as cloud platform (i.e., Microsoft Azure), so that a set of spa- tial and temporal queries (e.g., average hourly speed of a segment of highway I-10) are supported. These data arrive at 46 megabytes per minute and over 15 Terabytes have been collected so far. Now we are exploiting other big data platforms such as Spark to further expedite this process. With our data processing platform, the next step is to utilize the large amount of data for traffic prediction, which can enable drivers to avoid congested areas (e.g., through intelligent navigation systems), policy makers to decide about changes to traffic regulations (e.g., replace a carpool lane with a toll lane), urban planners to design better pathways (e.g., add an extra lane), and civil engineers to plan better for construction 2 zones(e.g., howashort-timeconstructionwouldimpacttraffic). Toachievethisgoal, one fundamental problem in traffic prediction we need to solve is to predict the near-future travel speed for each and every edge on the road network, which can be used to provide better route navigation and traffic regulation. For instance, commercial applications (e.g., Google Map, Waze, Uber) can provide routes to drivers that correspond to the traffic situation when the driver will be on the roads, instead of using the obsolete prior traffic information. However, even with such rich high-fidelity spatiotemporal sensor data, many challenges still exist for real-time traffic prediction on road network: • Missing data: While many edges of the road network are equipped with sensors, there are still many edges with no traffic sensors, termed the missing-sensor prob- lem. On the other hand, even for the edges with sensors, every once in a while, no value is reported for some time span due to various device and network failures, termed the missing-value problem. • Complex spatial dependence: The spatial relationship among sensors’ readings isdictatedbythenetworktopologyandnotalesscomplexspacesuchasEuclidean. • Time-varying relations: The temporal relationship among sensors’ readings is highly time-dependent, e.g., two sensors may be correlated during morning rush hours but not the afternoon rush hours. • Streaming sensor data: The sensor data is streaming and hence any prediction model should be able to change/adjust as new data becomes available. • Large-scale road network: The network is large and hence any prediction model must be able to get trained and updated at scale. Most existing studies [55, 87]modeleachedge independentlyandutilizethehistorical information of one edge to predict its future-speed via time series approaches, which can potentially suffer from missing data. Few studies [42, 79, 39, 51] utilize spatiotemporal model with correlated time series, but they are not always using network space as the 3 spatial dimension. In addition, they are computationally expensive, and thus can only be applied on small network. 1.2 Thesis Statement In this thesis proposal, we study the spatiotemporal traffic prediction problem on road network. Specifically, the objective is to predict the future travel speed of each and every edge of a road network, given the historical speed readings from the sensors on these edges. To address the aforementioned challenges, we present three different methods by taking into account the spatiotemporal correlations of traffic sensors. In our first study [77], the spatial and temporal information of sensors are encoded into traffic situation (i.e., context), and we propose an online framework that can learn from the current context in real-time and predict the future traffic by matching the current context to the most effective prediction model trained using historical data. In our second study [22], we present a Latent Space Model for Road Networks (LSM-RN) to holistically combine the spatial and temporal correlations. In particular, given a series of recent road network snapshots, we learn the attributes of sensors in latent spaces which capture both topological and temporal properties. As these latent attributes are time-dependent, they can estimate how traffic patterns form and evolve. However, both proposed two methods cannot build a model for all sensors under all traffic situations. Towards this end, we explore the idea of building models based on the shared traffic situations across all sensors. We propose a Multi-Task Learning (MTL) framework that aims to first automatically identify the traffic situations and then simultaneously build oneforecastingmodelforsimilarbehavingsensorspertrafficsituation. Wedemonstrated that our proposed MTL framework outperforms all the best traffic prediction approaches in both short and long term predictions, respectively. In summary, this thesis aims to demonstrate that Incorporating spatiotemporal correlations and contextual 4 behaviors of sensors in a road network improves the accuracy of real-time traffic forecasting. 1.3 Context-Aware Spatiotemporal Traffic Prediction In our first study, we explore a simple yet effective method to incorporate the corre- lations between different sensors: We represent traffic situation as contexts, and sensor’s spatiotemporal information (i.e., location, time of day) can be encoded into contexts. The main objective is to learn from the current traffic situation in real-time and predict the future traffic by matching the current situation to the most effective prediction model we have constructed using historical data. For example, suppose that we trained four different base predictors at locations 1-4 using historical data as shown in Figure 1.2, we will learn online which predictor to use for prediction at location 5, where we do not have a predictor trained for it. Figure 1.2: Example of base predictors. Theoverallprocessisasfollows: First,inanoff-linephase,wecategorizethehistorical data into classes of similar "situations", for each of which we train a finite number of traffic predictors (e.g., Auto Regressive and Integrated Moving Average (ARIMA) and 5 Support Vector Regression (SVR)). Next, in an on-line phase, suppose we would like to predict speed at location X, we find the subspace where the location X belongs to and choose the most effective predictor. Thus, we need a mechanism to select the most effective predictor. The basic ideas is based on estimating the reward of using a predictor in different situations. The reward estimate is calculated based on how accurate each predictor has been in predicting, giving the actual speed values we have observed in the recent past via real-time data. Many other features can be used to identify a traffic context, besides the spatiotem- poral feature. Example features include: weather condition, number of lanes, area type (e.g., business district, residential) etc. Therefore, the context space is a multidimen- sional space with D dimensions, where D is the number of features. Since the context space can be very large, learning the most effective predictor in each individual context (i.e. aD-dimensional point in the context space) using reward estimates for this individ- ual context can be extremely slow. Therefore, we adaptively partition the context space online based on the dimensions and the domain of each feature in order to efficiently estimate the reward of each predictor in different contexts. Our approach has three important byproducts. First, since location and time are two features of our context space, our approach is inherently spatiotemporal and takes into consideration the sensor readings that are spatially and temporally closest to the target location. Second, our approach is agnostic to the congestion cause. For example, our reward mechanism may guide us to select a predictor that is trained for a rush-hour subspace, even though the current time is not a rush-hour time, but perhaps because of an incident (e.g., an unknown object in the middle of the freeway) that resulted in a similar traffic condition as that of a rush hour at that location. Therefore, we can still predict the traffic condition successfully in the presence of that event. Finally, since the reward is continuously being updated/aggregated, we are utilizing what we learn in real-time to adapt to both short-term and long-term changes. 6 1.4 Real-Time Traffic Prediction with Latent Space Model for Road Network In our second study, we propose Latent Space Modeling for Road Networks (LSM- RN), which enables more accurate and scalable traffic prediction by utilizing both topol- ogy similarity and temporal correlations in a holistic way. Specifically, given series of road network, we embed vertices of road network into a latent space, where two vertices that are similar in terms of both time-series traffic behavior and the road network topol- ogy are close to each other in the latent space. With these latent attributes, we can accurately estimate the traffic patterns and their evolution over time. Figure 1.3 shows an example with 10 graph snapshots at the morning rush hour time between 9 AM to 9:55 AM, the objective is to predict the near-future travel time (i.e., 10 AM) of the road network. Figure 1.3: Example of snapshots. Figure 5.3 shows an overview of the LSM-RN framework. Given a series of road net- worksnapshots, LSM-RNprocessestheminthreesteps: (1)discoverthelatentattributes of vertices at each timestamp, which capture both the spatial and temporal properties; (2) understand the traffic patterns and build a predictive model of how these latent attributes change over time; and (3) exploit real time traffic information to adjust the existing latent space models. 7 Graph snapshots G , G , , G Latent space learning of road network Edge traffic prediction with missing data Real-time feedback 1 2 3 Figure 1.4: Overall framework of LSM-RN Toinferthetime-dependentlatentattributesofourLSM-RNmodel, atypicalmethod is to utilize multiplicative algorithms [44] based on Non-negative Matrix Factorization, where we jointly infer the whole latent attributes via iterative updates until they become stable, termed as global learning. However, global learning is not only slow but also not practical for real-time traffic prediction. This is because, traffic data are of high- fidelity (i.e., updates are frequent in every one minute) and the actual ground-truth of traffic speed becomes available shortly afterwards (e.g., after making a prediction for the next five minutes, the ground truth data will be available instantly after five minutes). Wethusproposeanincrementalonlinelearningwithwhichwesequentiallyand adaptively learn the latent attributes from the temporal traffic changes. In particular, each time when our algorithm makes a prediction with the latent attributes learned from the previous snapshot, it receives feedback from the next snapshot (i.e., the ground truth speed reading we already obtained) and subsequently modifies the latent attributes for more accurate predictions. Unlike traditional online learning which only performs one single update (e.g., update one vertex per prediction) per round, our goal is to make predictions for the entire road network, and thus the proposed online algorithm allows updating latent attributes of many correlated vertices simultaneously. Leveraging global and incremental learning algorithms our LSM-RN model can strike a balance between accuracy and efficiency for real-time forecasting. Specifically, we consider a setting with a predefined time window where at each time window (e.g., 5 minutes), we learn our traffic model with the proposed incremental inference approach on-the-fly, and make predictions for the next time span. Meanwhile, we batch the re- computation of our traffic model at the end of one large time window (e.g., one hour). 8 Under this setting, our LSM-RN model enables the following two properties: (1) real- time feedback information can be seamlessly incorporated into our framework to adjust for existing latent spaces, thus allowing for more accurate predictions, and (2) our algo- rithms perform training and predictions on-the-fly with small amount of data rather than requiring large training datasets. We conducted extensive experiments on a large scale of real-world traffic sensor dataset. We demonstrated that the LSM-RN framework achieves better accuracy than that of both existing time series methods (e.g. ARIMA and SVR) and approaches based on latent space model for social network. Moreover, we show that our algorithm scales to large road networks. For example, it only takes 4 seconds to make a prediction for a network with 19,986 edges. Finally, we show that our batch window setting works perfectly for streaming data, alternating the executions of our global and incremental algorithms, which strikes a compromise between prediction accuracy and efficiency. For instance, incremental learning is one order of magnitude faster than global learning, and it requires less than 1 seconds to incorporate real-time feedback information. 1.5 Situation-Aware Multi-Task Learning for Traffic Fore- casting Although both our previous two studies make real-time traffic prediction by utilizing the spatiotemporal correlations between sensors, none of them can build a model for all sensor under all traffic situation. The focus of our first online learning framework is to dynamically update and identify the traffic situation, whereas it does not update these weak predictors from the newly arrived data. On the other hand, LSM-RN performs training and prediction on-the-fly by only utilizing the most recent traffic data, but does not consider the underlying traffic situations. In addition, since LSM-RN ignores the large amount of historical traffic data, it does not perform well for long-term traffic prediction. 9 Therefore, in our third study, we further explore the idea of building models based on the shared traffic situations across sensors. We observe the traffic readings within one sensor can be an amalgam of multiple traffic situations, and thus difficult to build one single model to capture all these traffic situations. On the other hand, there exists so much commonalities across sensors, especially since they exhibit similar patterns under the same traffic situation, for example during the rush hour or at a rainy day. Moreover, the number of traffic situations is limited, which lends itself to building one model per traffic situation rather than per sensor. To summarize, our hypothesis is that building models based on the shared traffic situations across sensors can help improve the prediction accuracy. We propose a Situation-Aware Multi-Task Learning (termed as SA-MTL) frame- work: where we first identify the traffic situations across all sensors, then apply the MTL framework for each identified traffic situation, i.e., each “task" corresponds to a traffic situation (see Figure 5.3). Since the number of distinct traffic situations is small, we can apply MTL for each individual traffic situation across different sensors to examine whether the traffic situations are shared among all the sensors. Specifically, to identify the traffic situation, we can augment each training sample of one sensor with additional contextual features including road type (e.g., highway or arterial), location, weather condition, area classification (e.g., business district, residential), accident infor- mation, etc. Subsequently, we combine the training samples across all sensors and cluster them into several partitions where each partition represents one typical traffic situation and consists of different number of training samples from all sensors. Consequently, for each specific traffic situation, we use MTL that simultaneously learns the prediction model of all sensors. In particular, we employ the group Lasso regularization based on the l 2,1 -norm, which ensures that for one traffic situation a small set of features are shared among all sensor prediction tasks. We utilize the FISTA [11] method to solve the proposed optimization problem with guaranteed convergence rate. 10 We evaluated our proposed model with extensive experimental study on the large scale Los Angeles traffic sensor data. We show that by taking into consideration all of the traffic situations, our proposed SA-MTL framework performs consistently better than not only Naive-MTL but also other state-of-the-art approaches, with up to 18% and 30% improvement for short and long term predictions, respectively, over the best approach under each traffic situation and prediction horizon. 1.6 Thesis Overview The structure of the thesis is organized as follows: Chapter 2 discusses the related work of traffic prediction techniques. Chapter 3 introduces our online learning frame- work that creates a strong predictor from several weak predictors. Chapter 4 discusses the Latent Space Model for Road Networks that supports incremental online learning. Chapter 5 further explores the commonalities across sensors under different traffic situ- ations, and introduces our Situation-Aware Multi-Task Learning framework. Chapter 6 summarizes the thesis and presents possible directions for future work. 11 Chapter 2 Related Work In this section, we discuss the related work in the domain of traffic prediction. 2.1 Traffic Prediction Techniques With the increasing demand for road network traffic prediction, much research effort has been devoted to developing traffic prediction models. Various machine learning techniques [36, 40, 45, 73, 55, 62, 54, 28, 71, 84, 5, 65, 67, 59, 52, 59, 56, 76, 87, 61] have been proposed for the problem of traffic prediction. Some representative techniques include Historical Average [36, 40], ARIMA [45, 73, 55, 62], Kalman filter [54, 28, 71], Support Vector Regression [59, 52, 74, 56], Gaussian Process [76, 87], k-Nearest Neighbor (kNN)[61], NeutralNetwork[5,65,67,84]anddeeplearning[70,31,81]. Arecentsurvey of traffic prediction techniques can be found in [66, 68]. Among those studies, the major focus is to make prediction for single road segment of one specific road category (e.g., highway, arterial way). In this thesis, we study the problem of traffic prediction for the whole network by considering their spatiotemporal correlations. 2.2 Traffic Prediction with Spatiotemporal Correlations The spatiotemporal relationship plays an important role in traffic prediction. The basic idea is that the traffic condition of one road segment can be related with some of its nearby road segments. The majority of existing studies utilizes temporal data which models each sensor (or edge) independently and makes predictions using approaches such as ARIMA [55], SVR [74] and GP [87]. For instance, Pan et. al. [55] learns an enhanced ARIMA model for each edge in advance, and then performs traffic prediction on top of 12 these models. In addition, they aim at predicting in specific traffic situation, e.g., either typical condition or when accident occurs. There are many studies that leverage spatial/topological similarities to predict the readings of an edge based on its neighbors in either the Euclidean space or the network space. For example, Kamarianakis and Prastacos [39] extend the vector ARIMA model to incorporate the spatial correlations, where the neighboring relationship of sensors depends purely on Euclidean distance. Min and Wynter [51] improve the model of [39] by decomposing the traffic condition into different time intervals, and exhibiting a distinct neighboring relationship for each time interval. Few studies [42, 79] utilize spatiotempo- ral model with correlated time series based on Hidden Markov Model, these approaches simply combine the local information of neighbors with temporal information. In addi- tion, existing approaches are computationally expensive and require repeated offline trainings. Therefore, it is difficult to adapt the models to real-time traffic forecasting for the entire road network. Note that many existing studies [32, 72, 18] on traffic predic- tion are based on GPS dataset, which is different with the sensor dataset, where we have fine-grained and steady readings from road-equipped sensors. Comparing with those studies, our latent space modeling approach considers both time and network topology for real-time traffic prediction from incomplete (i.e., missing sensors and values) sensor datasets. We further propose a multi-task learning framework that identifies the traffic situations via unsupervised learning, then apply supervised learning to simultaneously train the prediction models for all sensors per traffic situation. Therefore, we can find the best prediction model for each traffic situation. 2.3 Online Traffic Prediction In traffic prediction, the ground truth data can be observed shortly after making the prediction, which provides a great opportunity to improve/adjust the model incremen- tally (i.e., online learning). For example, Jeong et. al. [37] proposed an online weighted 13 support vector regression for short-term traffic flow prediction, and Liu et. al. [48] pro- posed an online learning algorithm for ARIMA model. In this thesis, we propose two different ideas for online traffic prediction. In our first study [77], we consider using the newly arrived data as feedback to reward one classifier vs. the other but not for dynamically updating the model. In LSM-RN [22], we design a topology-aware online learning algorithm, which adaptively updates our model with topology constraints. The proposed algorithms differ from traditional online learning algorithms such as [13], which independently perform the online update. 2.4 Latent Space Model for Traffic Prediction Recently, many real data analytic problems such as community detection [82, 69], recommendation system [19], topic modeling [58], image clustering [16], and sentiment analysis [88], have been formulated as the problem of latent space learning. These stud- ies assume that, given a graph, each vertex resides in a latent space with attributes, and vertices which are close to each other are more likely to be in the same cluster (e.g., com- munity or topic) and form a link. In particular, the objective is to infer the latent matrix by minimizing the difference (e.g., squared loss [82, 88] or KL-divergence [16]) between observed and estimated links. However, existing methods are not designed for the highly correlated (topologically and temporally) and dynamic road networks. Few studies [57] have considered the temporal relationships in social network with the assumption that networks evolve over time. The temporal graph snapshots in [57] are treated separately and thus newly observed data are not incorporated to improve the model. Compared with existing works, we explore the feasibility of modeling road networks with time- varying latent space. The traffic speed of a road segment is determined by their latent attributes and the interaction between corresponding attributes. To tackle the sparsity of road network, we utilize the graph topology by adding a graph Laplacian constraint to impute the missing values. In addition, the latent position of each vertex, varies over 14 time and allows for sudden movement from one timestamp to the next timestamp via a transition matrix. 2.5 Multi-Task Learning and Mixture of Experts for Traf- fic Prediction Multi-task learning (MTL) [17, 25] learns multiple related tasks simultaneously by extracting and utilizing shared information, thus resulting in better generalization per- formance than independently learning each task. A key aspect of MTL is about how to extract and exploit the commonality among different tasks, i.e., the task relatedness. In [25], Evgeniou et. al. proposed the regularized MTL which constrained the models of all tasks to be close to each other. Another formulations of task relatedness include constraining multiple tasks to share a common set of features [8], or a common sub- space [6, 7]. MTL has been applied in various domains such as computer vision [64], natural language processing [7], social eventprediction [83], etc. Recently, MTL hasbeen incorporated into the deep learning architecture [31] to provide better traffic prediction. Different with [31], our proposed SA-MTL framework applies MTL for each identified traffic situation. Mixture of experts (ME), proposed by Jacobs et. al. [34, 38], is a learning paradigm that divides one task into a subset of distinct tasks (i.e., expert), and then utilizes a gating network to select the best expert. In our traffic prediction scenario, for each sensor, we learn an individual model for every traffic situation, which can be regarded as a special case of ME. Different with traditional ME, we explore the commonalities across all sensors, i.e., we assume that the traffic situations are shared among sensors, and MTL can thus be applied for each identified traffic situation to improve the prediction accuracy. This also addresses the issue that one sensor does not contain enough training samples for a specific traffic situation. 15 Chapter 3 Context-Aware Spatiotemporal Traffic Prediction In this chapter, we propose an online framework that could learn from the current traffic situation (or context) in real-time and predict the future traffic condition by matching the current situation to the most effective prediction model trained using his- torical model. The location and time are two features of our context space, and hence our approach is inherently spatiotemporal and considering the sensor readings that are spatially and temporally closet to the target location. 3.1 Problem Formulation 3.1.1 Problem Setting Figure 3.1 illustrates the system model under consideration. We consider a set of locationsL where traffic sensors are deployed. These locations can be either on the highways or arterial streets. We consider an infinite horizon discrete time system t = 1, 2,... where in each slot t a traffic prediction request from one of the locations l o ∈L arrives to the system in sequence. Given the current traffic speed x t at this location, the goal is to predict the traffic speed ˆ y t in some predetermined future time, e.g. in the next 15 minutes or in the next 2 hours. Note that the notation t is only used to order the requests according to their relative arrival time. Each request can come from any location inL at any time in a day, thereby posing a spatiotemporal prediction problem. Each request is associated with a set of traffic context information which is provided by the road sensors. The context information can include but is not limited to: 16 Figure 3.1: System diagram • The location context, e.g. the longitude and latitude of the requested location l o , the location type (highway, arterial way), the area type (business district, residen- tial). • The time context, e.g. whether on weekday or weekend, at daytime or night, in the rush hour or not, etc. • The incident context, e.g. whether there is a traffic incident occurred nearby and how far away from l o , the type of the incident, the number of affected lanes etc. • Other contexts such as weather (temperature, humidity, wind speed etc.), tempo- rary events etc. We use the notation θ t ∈ Θ to denote the context information associated with the t-th request where Θ is aD-dimensional space andD is the number of types of contexts used. Without loss of generality, we normalize the context space Θ to be [0, 1] D . For example, time of day can be normalized with respect to 24 hours. The system maintains a set of K base predictors f∈F that can take input of the current speed x t , sent by the road sensors, and output the predicted speed f(x t ) in the 17 predetermined future at location l o . These base predictors are trained and constructed using historical data for K representative traffic situations before the system operates. However, theirperformanceisunknownfortheothertrafficsituationswhicharechanging over time. We aim to build a hybrid predictor that selects the most effective predictor for the real-time traffic situation by exploiting the traffic context information. Thus, for each request, the system selects the prediction result of one of the base predictors as the final traffic prediction result, denoted by y t . The prediction result can be consumed by third-party applications such as navigation. Eventually, the real traffic at the predetermined future for the t-th request, denoted by ˆ y t , is revealed. We also call ˆ y t the ground-truth label for the t-th request. For now we assume that the label is revealed for each request at the end of each prediction. In reality, the label can arrive with delay or even be missing. We will consider these scenarios in Section 3.3.3. By comparing the system predicted traffic y t and the true traffic ˆ y t , a reward r t is obtained according to a general reward function r t =R(y t , ˆ y t ). For example, a simple reward function indicates the accuracy of the prediction, i.e. R(y t , ˆ y t ) =I(y t = ˆ y t ) whereI(·) is the indicator function. The system obtains a reward 1 only if the prediction is correct and 0 otherwise. Other reward functions that depend on how close the prediction is to the true label can also be adopted. As mentioned, each base predictorisafunctionofthecurrenttrafficx t whichoutputs the future traffic predictiony t . Since for a givenx t the true future traffic ˆ y t is a random variable, the reward by selecting a predictorf, i.e. R(f(x t ), ˆ y t ), is also a random variable at eacht. The effectiveness of a base predictor is measured by its expected reward, which depends on the underlying unknown joint distribution of x t and ˆ y t . The effectiveness of a base predictor in a traffic context θ is thus its expected reward conditional on θ and is determined by the underlying unknown joint distribution of x t and ˆ y t conditional on the situation θ. Let μ f (θ) = E{R(f(x), ˆ y)|θ} be the expected reward of a predictor f in context θ. However, since the base predictors are constructed using historical data, their expected rewards are unknown a priori for real-time situations which may vary over 18 time. Therefore, the system will continuously revise its selection of base predictors as it learns better and better the base predictors’ expected rewards in the current context. Figure 3.2: Spatiotemporal prediction and multi-predictor diversity gain 3.1.2 Spatiotemporal Prediction and Multi-predictor Diversity Gain By taking into consideration the traffic context information when making traffic prediction, we are exploiting the multi-predictor diversity to improve the prediction performance. To get a sense of where the multi-predictor diversity gain comes from, consider the simple example in Figure 3.2, which shows the expected rewards of various base predictors. Since the traffic prediction is a spatiotemporal problem, we use both timeofdayandlocationofthetrafficasthecontextinformation. Givenalocation5miles from the reference location, we have three predictors constructed for three representative traffic situations - morning around 6am, afternoon around 2pm and evening around 7pm. These predictors work effectively in their corresponding situations but may not work well in other time of day contexts due to the different traffic conditions in different times of the day. If we use the same predictor for the entire day, then the average prediction performance can be bad. Instead, if we use the predictor for traffic situations that are similar to its representative situation, then much better prediction performance can be 19 obtained. However, the challenge is when to use which predictor for prediction since the effectiveness of the base predictors is unknown for every traffic context. For example, the three base predictors (constructed for locations 0 mile, 5 miles and 10 miles from the reference location, respectively, around time 12pm) have complex expected reward curves which need to be learned over time to determine which predictor is the best at different locations. 3.1.3 Performance Metric for Our Algorithm The goal of our system is to learn the optimal hybrid predictor which selects the most effective base predictor for each traffic situation. Since we do not have the com- plete knowledge of the performance of all base predictors for all contexts in the online environment, we will develop online learning algorithms that learn to select the best predictors for different traffic contexts over time. The benchmark when evaluating the performance of our learning algorithm is the optimal hybrid predictor that is constructed by an oracle that has the complete information of the expected rewards of all base pre- dictors in all situations. For a traffic context θ, the optimal base predictor selected in the oracle benchmark is f ∗ (θ) := arg max f∈F μ f (θ),∀θ∈ Θ (3.1) Letσ be a learning algorithm andf σ(t) be the predictor selected byσ at timet, then the regret of learning by time T is defined as the aggregate reward difference between our learning algorithm and the oracle solution up to T, i.e. Reg(T ) :=E " T X t=1 μ f ∗ (θ t ) (θ t )− T X t=1 R(f σ(t) (x t ), ˆ y t ) # (3.2) where the expectation is taken with respect to the randomness of the prediction, true traffic realization and predictors selected. The regret characterizes the loss incurred due to the unknown transportation system dynamics and gives the convergence rate of 20 the total expected reward of the learning algorithm to the value of the optimal hybrid predictor in (3.1). The regret is non-decreasing in the total number of requests T but we want it to increase as slow as possible. Any algorithm whose regret is sublinear in T, i.e. Reg(T ) =O(T q ) such that q < 1, will converge to the optimal solution in terms of the average reward, i.e. lim T→∞ Reg(T ) T = 0. The regret of learning also gives a measure for the rate of learning. A smaller q will result in a faster convergence to the optimal average reward and thus, learning the optimal hybrid predictor is faster if q is smaller. 3.2 Context-Aware Adaptive Traffic Prediction A natural way to learn a base predictor’s performance in a non-representative traffic context is to record and update its sample mean reward as additional data (i.e. traffic requests and the realized traffic) in the same context arrives. Using such a sample mean- basedapproachtoconstructahybridpredictoristhebasicideaofourlearningalgorithm; however, significant challenges still remain. On the one hand, exploiting the context information can potentially boost the predic- tion performance as it provides ways to construct a strong hybrid predictor as suggested in Section 3.1.2. Without the context information, we would only learn the average performance of each predictor over all contexts and thus, a single base predictor would always be selected even though on average it does not perform well. On the other hand, building the optimal hybrid predictor can be very difficult since the context space Θ can be very large and the value space can be continuous. Thus, the sample mean reward approach would fail to work efficiently due to the small number of samples for each individual context θ. Our method to overcome this difficulty is to dynamically partition the entire con- text space into multiple smaller context subspaces and maintain and update the sample mean reward estimates for each subspace. This is due to the fact that the expected rewards of a predictor are likely to be similar for similar contexts. For instance, similar 21 weather conditions would have similar impacts on the traffic on close locations. Next, we will propose an online prediction algorithm that adaptively partitions the context space according to the traffic prediction request arrivals on the fly and guarantees sublinear learning regret. 3.2.1 Algorithm Description In this subsection, we describe the proposed online context-aware traffic prediction algorithm (CA-Traffic). First we introduce several useful concepts for describing the proposed algorithm. • Context subspace. A context subspace C is a subspace of the entire context space Θ, i.e. C⊆ Θ. In this thesis, we will consider only context subspaces that are created by uniformly partitioning the context space on each dimension, which are enough to guarantee sublinear learning regrets. Thus, each context subspace is a D-dimensional hypercube with side length being 2 −l for some l. We call such a hypercube a level-l subspace. For example, when the entire context space is [0, 1], namely the context dimension is D = 1, the entire context space [0, 1] is a level-0 subspace, [0, 1/2) and [1/2, 1] are two level-1 subspaces, [0, 1/4), [1/4, 1/2), [1/2, 3/4), [3/4, 1] are four level-2 subspaces etc. • Context space partition. Acontext spacepartitionP isa setof non-overlapping context subspaces that cover the entire context space. For example, when D = 1, {[0, 1]}, {[0, 1/2), [1/2, 3/4), [3/4, 1]} are two context space partitions. Since our algorithm will adaptively partition the context space by adaptively removing subspaces from the partition and adding new subspaces into the partition, the context space partition is time-varying depending on the context arrival process of the traffic requests. Initially, the context space partition includes only the entire context space, i.e.P 0 ={Θ}. 22 • Active context subspace. A context subspace C is active if it is in the current context space partitionP t , at timet. For each active context subspaceC∈P t , the algorithm maintains the sample mean reward estimates ¯ r t f (C) for each predictor for the context arrivals to this subspace from time 1 to time t. For each active subspace C ∈P t , the algorithm also maintains a counter M t C that records the number of context arrivals to C from time 1 to time t 1 . The algorithm works as follows (see also a formal description in Algorithm 1). We will describe the algorithm in two parts. The first part (lines 3-9) is the predictor selection and reward estimates update. When a traffic prediction request comes, the traffic speed vector x t along with the traffic context information θ t are sent to the system. The algorithm first checks which active subspaceC∈P t in the current partition P t the context θ t belongs to (line 3) and the level l of this subspace (line 4). Next, the algorithm activates all predictors and obtains their predictions f(x t ),∀f∈F given the input x t (line 5). However, it selects only one of the predictions as the final prediction y t , according to the selection as follows (line 6) y t = ˜ f(x t ) where ˜ f = arg max f ¯ r t f (C) (3.3) In words, the selected base predictor has the highest reward estimate for the context subspace C among all predictors. This is an intuitive selection based on the sample mean rewards. When the true traffic label ˆ y t is revealed (line 7), the sample mean reward estimates for all predictors are then updated (line 8) and the counter steps by 1 (line 9). The second part of the algorithm, namely the adaptive context space partitioning, is the key to our algorithm (lines 10-12). At the end of each slot t, the algorithm decides whether to further partition the current subspace C, depending on whether we have 1 This method requires keeping all request history in memory which can be a concern for some systems. Alternatively, the counters can be restarted after every partition. Our bounds will still hold but the performance of the algorithm will be worse in practice since it does not capitalize on the prior knowledge. 23 Algorithm 1 Context-aware Traffic Prediction (CA-Traffic) 1: InitializeP 0 ={Θ}, ¯ r f (Θ) = 0,∀f∈F, M 0 Θ = 0. 2: for each traffic prediction request (time slot t) do 3: Determine C∈P t such that θ t ∈C. 4: Determine the level l of C. 5: Generate the predictions results for all predictors f(x t ),∀f. 6: Select the final prediction y t = ˜ f(x t ) according to (3.3). 7: The true traffic pattern ˆ y t is revealed. 8: Update the sample mean reward ¯ r f (C),∀f. 9: M t C =M t C + 1. 10: if M t C ≥A2 pl then 11: C is further partitioned. 12: end if 13: end for Figure 3.3: An illustration of the context space partitioning in a 2-dimensional space: the lower left subspace is further partitioned into 4 smaller subspaces because the partition condition is satisfied. seen sufficiently many request arrivals in C. More specifically, if M t C ≥ A2 pl , then C will be further partitioned (line 10), where l is the subspace level of C,A> 0 andp> 0 are two design parameters. When partitioning is needed, C is uniformly partitioned into 2 D smaller hypercubes (each hypercube is a level-(l + 1) subspace with side-length half of that of C). Then C is removed from the active context subspace setP t and the new subspaces are added intoP t (line 11). In this way,P t is still a partition whose 24 subspaces are non-overlapping and cover the entire context space. Figure 3.3 provides an illustrative example of the context space partitioning for a 2-dimensional context space. The current context space partitionP t is shown in the left plot and the current subspace C is the shaded bottom-left square. When the partitioning condition is satisfied, C is further split into four smaller squares. The context space partitioning process helps refine the learning in smaller subspaces. In the next subsection, we will show that by carefully choosing the design parameters A andp, we can achieve a regret upper bound that is sublinear in time, which implies that the optimal time-average prediction performance can be achieved. 3.2.2 Learning Regret Analysis In this subsection, we analyze the regret of the proposed traffic prediction algorithm. Toenablethisanalysis, wemakeatechnicalassumptionthateachbasepredictorachieves similar expected rewards (accuracy) for similar contexts; this is formalized in terms of a Hölder condition. Assumption* 1. For each f∈F, there exists L> 0, α> 0 such that for all θ,θ 0 ∈ Θ, we have |μ f (θ)−μ f (θ 0 )|≤Lkθ−θ 0 k α (3.4) This is a natural and reasonable assumption in traffic prediction problems since similar contexts would lead to similar impact on the prediction outcomes. Note that L is not required to be known and that an unknown α can be estimated online using the sample mean estimates of rewards for similar contexts, and our proposed algorithm can be modified to include the estimation of α. To obtain sharp bounds on the prediction regret, we divide the time slots into two different types depending on a deterministic control function ζ(t) = 2 2αl ln(t) where l is the level of the context subspace C that the time-t context belongs to: if M t C ≤ ζ(t), 25 then slot t is a type-1 slot; if M t C > ζ(t), then slot t is a type-2 slot. The important difference between these two types of slots is that for the type-2 slot, we can have a stronger confidence bound on the estimated rewards of the various predictors for the current context subspace C because we have sufficiently many samples according to the deterministic function. This will help us to derive the regret bound. However, this differentiation of slots is used only in our regret analysis; all slots are equal in terms of the implementation and operation of our algorithm. Because any time slot is either a type-1 slot or a type-2 slot, the prediction regret therefore can be divided into two parts: Reg(T ) = Reg 1 (T ) +Reg 2 (T ) (3.5) where Reg 1 (T ), Reg 2 (T ) are the regret due to choosing non-optimal predictors in type-1 slots and type-2 slots, respectively. We will bound these two parts separately to get the total regret bound. To do this, we will first investigate the regret incurred for a level-l context subspace and then sum up the regret incurred in context subspaces of all levels. Without loss of generality, we assume that, for any context, the reward difference between the optimal predictor and any non-optimal predictor is bounded by 1. In Lemma 1, we bound, for any level-l subspace, the regret due to choosing non- optimal predictors in type-1 slots. Lemma 1. For every level-l context subspace C, the regret due to choosing non-optimal predictors in type-1 slots is bounded by 2 2αl ln(t). Proof. Due to the definition of type-1 slot, at any time t, there are no more than ζ(t) = 2 2αl ln(t) type-1 slots for level-l context subspace. Hence, the regret is bounded above by 2 2αl ln(t). Next, we bound, for any level-l subspace, the regret due to choosing non-optimal predictors in type-2 slots. 26 Lemma 2. Assume 2L( √ D) α + (2−B)≤ 0. Then for every level-l context subspace C, the regret due to choosing non-optimal predictors in type-2 slots is bounded by K π 2 3 + AB2 (p−α)l . Proof. See Appendix A.1. Now, we combine the results in Lemma 1 and Lemma 2 to obtain the complete regret bound. Theregretdependsonthecontextarrivalprocessandhence, weletW l (T )denote the number of level-l subspaces that have been activated by time T. Before we derive Theorem 1, we provide a bound on the highest level of active subspace by time T. Lemma 3. Given a time T, the highest level of active subspace is at most dlog 2 (T/A)/pe + 1. Proof. It is easy to see that the highest possible level of active subspace is achieved when all requests by time T have the same context. This requires A2 plmax < T. Therefore, l max =dlog 2 (T/A)/pe + 1. Theorem 1 establishes the regret bound. Theorem 1. Assume p = 3α and B = 2L( √ D) α + 2. The regret is upper bounded by Reg(T )≤ d log 2 (T/A) 3α e+1 X l=1 W l (T )(2 2αl (ln(T ) +AB) +K π 2 3 ) (3.6) Proof. Combining the result of Lemma 1 and Lemma 2, it is easy to see that the regret is upper bounded by Reg(T ) ≤ d log 2 (T/A) p e+1 P l=1 W l (T )(2 2αl ln(T ) +K π 2 3 +AB2 (p−α)l ) (3.7) Inordertobalancethetimeorderofdifferenttermsontheright-handside, weletp = 3α. Although choosingp smaller than 3α will not make the regret of a subspace larger, it will 27 increase the number of subspaces activated by time T, causing an increase in the regret. Since we sum over all activated subspaces, it is best to choose p as large as possible. The following corollary establishes the regret bound when the context arrivals are uniformly distributed over the entire context space. For example, if the context is the location, then the requests come uniformly from the areaL. This is the worst-case scenario because the algorithm has to learn over the entire context space. Corollary 1. If the context arrival by time T is uniformly distributed over the context space, we have Reg(T ) ≤ (T/A) D+2α D+3α 2 D+2α (ln(T ) +AB) + (T/A) D D+3α 2 D K π 2 3 (3.8) Proof. First we calculate the highest level of subspace when context arrivals are uniform. In the worst case, all level l subspaces will stay active and then, they are deactivated until all level l + 1 subspaces become active and so on. Let l max be the maximum level subspace under this scenario. Because there must be some time T 0 < T when all subspaces are level l subspaces, we have 2 Dl A2 3αl <T (3.9) where 2 Dl is the maximum number of levell subspaces and 2 3αl is the maximum number of time slots that belong to a levell subspace. Thus, we havel max < log 2 (T/A) D+3α + 1. After substituting it into the regret bound in Theorem 1, we get Reg(T ) ≤ log 2 (T/A) D+3α +1 P l=1 2 Dl (2 2αl ln(T ) +K π 2 3 +AB2 l(p−α) ) ≤ (T/A) D+2α D+3α 2 D+2α (ln(T ) +AB) + (T/A) D D+3α 2 D K π 2 3 (3.10) 28 We have shown that the regret upper bound is sublinear in time, implying that the average traffic prediction reward (e.g. accuracy) achieves the optimal reward as time goes to infinity. Moreover, it also provides performance bounds for any finite time T rather than the asymptotic result. Ensuring a fast convergence rate is important for the algorithm to quickly adapt to the dynamically changing environment. 3.3 Extensions 3.3.1 Dimension Reduction In the previous section, the context space partitioning is performed on all context dimensions simultaneously. In particular, each context subspace C has a dimension D andeachtimeitisfurtherpartitioned, 2 D newsubspacesareaddedintothecontextspace partitionP. Thus, learning can be very slow whenD is large since many traffic requests are required to learn the best predictors for all these subspaces. One way to reduce the number of new subspaces created during the partitioning process is to maintain the context partition and subspaces and perform the partitioning for each dimension separately. In this way, each time a partitioning is needed for one dimension, only two new subspaces will be created for this dimension. Therefore, at most 2D more subspaces will be created for each request arrival. The modified algorithm works as follows. For each context dimension (e.g. time of day, type and distance), we maintain a similar context space and partition structure as in Section 3.1 (in other words the context space dimension is 1 but we have D such spaces). DenoteP t d as the context space partition for dimensiond andC t d as the current context subspace for dimension d, at time t. Note now that since we consider only one dimension, C t d is a one-dimensional subspace for each d. Each time a traffic prediction requestx t with contextθ t arrives, we obtain the prediction results of all base predictors 29 given x t . The final prediction y t is selected according to a different rule than (3.3) as follows y t = ˜ f(x t ) where ˜ f = arg max f {max d ¯ r t f (C t d )} (3.11) In words, the algorithm selects the predictor that has the highest reward estimate for all current subspaces among all context dimensions. Figure 3.4 shows an illustrative example for the predictor selection when we only use the time of day and location as the contexts. In this example, the time of day context (10:05am) falls into the subspace at the most left quarter (7am - 11pm) and the location context (3.7 miles away from a reference location) falls into the right half subspace (2.5 - 5 miles). According to the time of day context dimension, the predictor with the highest reward estimate is Predictor 1 while according to the location context dimension, the predictor with the highest reward estimate is Predictor 2. Overall, the best estimated predictor is Predictor 2, which is selected by the algorithm. Figure 3.4: An illustrative example for predictor selection with separately maintained context partition: a request with context (10:05 am and 3.7 miles away from reference location) arrives; Predictor 1 is the best for the time of day context and Predictor 2 is the best for the location context; Predictor 2 is the finally selected predictor. After the true traffic ˆ y t is observed, the reward estimates for all predictors in all D one-dimensional context subspacesC t d ,∀d are updated. TheD partitionsP t d ,∀d are also updated in a similar way as before depending on whether there have been sufficiently 30 many traffic requests with contexts in the current subspaces. Figure 3.5 illustrates the context space partition for each individual dimension. In this example, only the location context satisfies the partitioning condition and hence its right half subspace is further partitioned. Figure 3.5: An illustrative example for context space partition with relevant context: partitioning only occurs on the location context since the partitioning condition is satisfied. 3.3.2 Relevant Context Dimension While using all context dimensions will provide the most refined information and thus lead to the best performance, it is equally important to investigate which dimen- sion or set of dimensions is the most informative for a specific traffic situation. The benefits of revealing the most relevant context dimension (or set of dimensions) are manifold, including reduced cost due to context information retrieval and transmission, reduced algorithmic and computation complexity and targeted active traffic control. In the extreme case, a context dimension (e.g. time of day) is not informative at all if for all values of the context along this dimension, the best traffic predictor is the same. Hence, having this context dimension does not add benefits for the traffic prediction but only incurs additional cost. For clarity, in the following we will focus only on the most relevant context. The extension to the k most relevant context dimensions (∀k < D) is straightforward. Let μ f (θ d ) be the expected prediction reward of predictor f ∈F when the context along 31 thed-th dimension isθ d andf ∗ (θ d ) = arg max f μ f (θ d ) be the predictor with the highest expected reward given θ d . Then the expected reward if we only use the d-th dimension context information is R d = E θ d {μ f ∗ (θ d ) (θ d )} where the expectation is taken over the distribution of the d-th dimension context. The most relevant context dimension is defined to be d ∗ = arg max d R d . Our framework can be easily extended to determine the most relevant context dimen- sion. For each dimension, we maintain the similar partition and subspace structure as in Section 3.1 (with D = 1). In addition, we maintain the time-average prediction reward ¯ R t d for each dimension d. The estimated most relevant dimension at time t is thus (d ∗ ) t = arg max d ¯ R t d . Theorem 2. The estimated most relevant dimension converges to the true most relevant dimension, i.e. lim t→∞ (d ∗ ) t =d ∗ . Proof. Since for each dimension d, the time-average regret tends to 0 as t→∞, the time-average reward also ¯ R t d → R d as t→∞. Therefore, the most relevant dimension can also be revealed when t→∞. 3.3.3 Missing and Delayed Feedback The proposed algorithm requires the knowledge of the true label ˆ y t on the predicted traffic to update reward estimates of different predictors so that their true performance can be learned. In practice, the feedback about true traffic label ˆ y t can be missing or delayed due to, for example, delayed traffic reports and sensors being down temporarily. In this subsection, we can make small modifications to the proposed algorithm to deal with such scenarios. Consider the case when the feedback is missing with probability p m . The algorithm can be modified so that it updates the sample mean reward and performs context space partitioning only for requests in which the true label is revealed. Let Reg m (T ) denote the regret of the modified algorithm with missing feedback, we have the following result. 32 Proposition 1. Suppose the feedback about the true label is missing with probability p m , we have Reg m (T ) ≤ d log 2 (T/A) 3α e+1 P l=1 W l (T )(2 2αl ( 1 1−pm ln(T ) +AB) +K π 2 3 ) (3.12) Proof. Missing labels cause more type-1 slots to learn the performance of base predictors accurately enough. In expectation, 1 1−pm − 1 more type-1 slots are required in ratio. Hence, theregretincurredintype-1slotsincreasesto 1 1−pm ofbefore. Theregretincurred in type-2 slots is not affected since the control function ζ(t) ensures that the reward estimates are accurate enough. Using the original regret bound and taking into account the increased regret incurred in type-1 slots, we obtain the new regret bound. Consider the case when the feedback is delayed. We assume that the true label of the request at t is observed at most L max slots later. The algorithm is modified so that it keeps in its memory the lastL max labels and the reward estimates are updated whenever the corresponding true label is revealed. Let Reg d (T ) denote the regret of the modified algorithm with delayed feedback. We then have the following result Proposition 2. Suppose the feedback about the true label is delayed by at most L max slots, then we have Reg d (T )≤L max +Reg(T ) (3.13) Proof. A new sample is added to sample mean accuracy whenever the true label of a precious prediction arrives. The worst case is when all labels are delayed by L max time slots. This is equivalent to starting the algorithm with an L max delay. The above two propositions show that the missing and delayed labels reduce the learning speed. However, since the regret bounds are still sublinear in time T, the time average reward converges to the optimal reward as T→∞. This shows that our algorithm is robust to errors caused by uncertain traffic conditions. 33 Figure 3.6: Freeway segment used in the experiment 3.4 Experiments 3.4.1 Experimental Setup Dataset Our experiment utilizes a large real-world traffic dataset, which includes both real- time and historically archived data since 2010. The dataset consists of two parts: (i) Traffic sensor data from 9300 traffic loop-detectors located on the highways and arterial streets of Los Angeles County (covering 5400 miles cumulatively). Several main traffic parameters such as occupancy, volume and speed are collected in this dataset at the rate of 1 reading per sensor per minute; (ii) Traffic incidents data. This dataset contains the traffic incident information in the same area as in the traffic sensor dataset. On average, 400 incidents occur per day and the dataset includes detailed information of each incident, including the severity and location information of the incident as well as the incident type etc. 34 Evaluation Method The proposed method is suitable for many spatiotemporal traffic prediction prob- lem. In this experiment, to verify the performance bound, we consider a simple setting. Specifically, as shown in Figure 3.6, the prediction requests come from a 3.4 miles single freeway segment on interstate freeway 405 (I-405) during daytime 8am to 5pm. The objective is to predict the traffic conditions for the three sensors located at 0.8 mile, 2.1 mile and 3.1 mile respectively (locations are referred using the distance from the reference location A). The context information that we use in the experiments include the time of day when the prediction request is made, the location where the request comes from. These contexts capture the spatiotemporal feature of the considered prob- lem. The base predictors are trained using historical data at locations A and B: 6 base predictors (we use naive bayes for classification task, SVR for speed prediction task) for 6 representative situations are constructed with spatiotemporal features from the set [8am, 12pm, 4pm]× [0 mile, 3.4 miles]. These are representative traffic situations since 8am represents the morning rush hour, 12pm represents non-rush hour, 4pm represents the afternoon rush hour. The features used to train our base predictors include whether on weekday v.s. weekend, whether incident occurs nearby, the past 15, 10, 5 minutes readings of this sensor. For each request from locationl o , the system aims to predict whether the traffic will be congested at l o in the next 15 minutes using the current traffic speed data. If the traffic speed drops below a thresholdλ, then the location is labeled as congested, denoted by ˆ y = 1; otherwise, the location is labeled as not congested, denoted by ˆ y =−1. We will show the results for different values of λ. We use the simple binary reward function for evaluation. That is, the system obtains a reward of 1 if the prediction is correct and 0 otherwise. Therefore, the reward represents the prediction accuracy. 35 Baseline Approaches Since our scheme is related with the class of online ensemble learning techniques, we will compare our scheme against several such approaches. These baseline solutions assignweightstobasepredictorsbutusedifferentrulestoupdatetheweights. Denotethe weight for base predictor f by w f . The final traffic prediction depends on the weighted combination of the predictions of the base predictors: y = +1, if P f∈F w f y f ≥ 0 −1, otherwise (3.14) Three approaches are used to update the weights: • Multiplicative Update (MU) [47][14]: If the prediction is correct for predictor f, i.e. y f = ˆ y, then w f ←αw f where α> 1 is a constant; otherwise, w f ←w f /α. In our experiments,α = 1.05 sinceα is usually chosen to be close to 1 for convergence purpose. • Additive Update (AU) [26]: If the prediction is correct for predictor f, i.e. y f = ˆ y, then w f ←w f + 1; otherwise, w f ←w f . • Gradient Descent Update (GDU) [60]: The weight of predictor f is update as w f ← (1−β)w f −2β(w f y f − ˆ y)ˆ y whereβ∈ (0, 1) is a constant. In our experiments, we use a small β, namely β = 0.2 for convergence purpose. 3.4.2 Prediction Accuracy In Table 3.1, we report the prediction accuracy of our proposed algorithm (CA- Traffic) and the baseline solutions for λ = 50 mph and λ = 30 mph. Our algorithm outperforms the baseline solutions by more than 10% in terms of prediction accuracy. Table 3.2 and 3.3 further report the prediction accuracy in different traffic situations. In Table 3.2, the location context is fixed at 0.8 miles from the reference location and 36 the accuracy for various time of day contexts (i.e. 10am, 2pm and 5pm) are presented for our proposed algorithm and the benchmarks. In Table 3.3, the time of day context is fixed at 10am and the accuracy for various location contexts (i.e. 0.8 miles, 2.1 miles, 3.1 miles) are reported. In all traffic situations, the proposed algorithm significantly outperforms the baseline solutions since it is able to match specific traffic situations to the best predictors. Our proposed algorithm not only can predict traffic congestion but also can be used to predict the actual traffic speed. In Table 3.4, we report the mean square errors of the traffic speed prediction by using different algorithms. As we can see, our algorithm achieves much smaller mean square errors than the baseline approaches. CA-Traffic MU AU GDU λ = 50 mph 0.94 0.83 0.83 0.82 λ = 30 mph 0.91 0.82 0.80 0.78 Table 3.1: Overall prediction accuracy. (λ = 50mph) CA-Traffic MU AU GDU 10am 0.93 0.81 0.85 0.83 2pm 0.93 0.86 0.81 0.80 5pm 0.95 0.86 0.87 0.88 (λ = 30mph) CA-Traffic MU AU GDU 10am 0.93 0.83 0.83 0.81 2pm 0.91 0.80 0.83 0.82 5pm 0.95 0.81 0.85 0.83 Table 3.2: Traffic prediction accuracy at 0.8 miles. 3.4.3 Convergence of Learning Since our algorithm is an online algorithm, it is also important to investigate its convergence rate. Figure 3.7(a) and 3.7(b) illustrate the prediction accuracy of our proposed algorithm over time, where the horizontal axis is the number of requests. As 37 (λ = 50 mph) CA-Traffic MU AU GDU 0.8 mile 0.92 0.84 0.83 0.82 2.1 mile 0.96 0.81 0.85 0.85 3.1 mile 0.93 0.85 0.83 0.81 (λ = 30 mph) CA-Traffic MU AU GDU 0.8 mile 0.92 0.81 0.83 0.82 2.1 mile 0.93 0.83 0.82 0.81 3.1 mile 0.94 0.81 0.82 0.82 Table 3.3: Traffic prediction accuracy at 10am CA-Traffic MU AU GDU 10 am, 0.8 miles 54.3 68.6 64.3 66.0 2 pm, 0.8 miles 59.8 70.2 74.3 76.9 5 pm, 0.8 miles 8.9 12.7 12.1 13.4 CA-Traffic MU AU GDU 0.8 miles, 10 am 47.8 55.9 54.9 53.0 2.1 miles, 10 am 33.8 43.8 47.9 46.7 3.1 miles, 10 am 77.1 93.6 93.5 97.9 Table 3.4: Mean square error of traffic speed prediction (mph 2 ) we can see, the proposed algorithm converges fast, requiring only a couple of hundreds of traffic prediction requests. 0 50 100 150 200 0.5 0.6 0.8 1 No. of requests Accuracy 10am 2pm 5pm 0 50 100 150 200 0.5 0.6 0.8 1 No. of requests Accuracy 0.8 miles 2.1 miles 3.1 miles (a) Different time of day contexts at 0.8 miles (b) Different location contexts at 10 am Figure 3.7: Accuracy over time (λ = 50mph) 3.4.4 Missing Context Information Thecontextinformationassociatedwiththerequestsmaybemissingoccasionallydue to, for example, missing reports and record mistakes. However, our modified algorithm 38 (λ = 50 mph) CA-Traffic CA-Traffic(R) MU AU GDU time of day&distance 0.94 0.89 0.83 0.83 0.82 time of day 0.78 0.92 0.76 0.76 0.78 distance 0.72 0.88 0.79 0.77 0.78 (λ = 30 mph) CA-Traffic CA-Traffic(R) MU AU GDU time of day&distance 0.91 0.80 0.81 0.83 0.78 time of day 0.76 0.86 0.70 0.72 0.75 distance 0.75 0.89 0.72 0.71 0.74 Table 3.5: Traffic prediction accuracy with incomplete context information 20 30 50 0 0.1 0.2 0.3 0.4 0.5 0.6 λ (mph) Percentage time of day distance Figure 3.8: Relative importance of contexts (describedinSectionV(A)),denotedbyCA-Traffic(R),caneasilyhandlethesescenarios. In this set of experiments, we show the performance of the modified algorithm for the extreme cases in which one type of context information is always missing. Table 3.5 reports the accuracy of our algorithms (CA-Traffic and CA-Traffic(R)) as well as the baseline approaches. Although CA-Traffic(R) performs slightly worse than CA-Traffic when there is no missing context, it performs much better than CA-Traffic and the benchmark solutions when context can be missing because it maintains the context partition separately for each context type and hence, it is robust to missing context information. 3.4.5 Relevant Context In this set of experiments, we unravel the most relevant context that leads to the best prediction performance. To do so, we run the algorithm using only a single context 39 (i.e. either time of day or location) and record the average reward. The most relevant context is the one leading to the highest average reward. Figure 3.8 shows the relative importance (e.g. Reward(time of day)/(Reward(time of day) +Reward(location))) of each context for different congestion thresholds λ = 20mph, 30mph, 50mph. The figure showsthatthetimeofthedayrepresentsamorerelevantcontextforthetrafficprediction problem in our experiment. 0 50 100 150 200 0.5 0.6 0.8 1 No. of requests Accuracy without m&d labels with missing labels with delayed labels 0 50 100 150 200 0.5 0.6 0.8 1 No. of requests Accuracy without m&d labels with missing labels with delayed labels (a) λ = 50mph (b) λ = 30mph Figure 3.9: Prediction accuracy with missing and delayed labels 3.4.6 Missing and Delayed Labels Finally, we investigate the impact of missing and delayed labels on the prediction accuracy, as shown in Figure 3.9(a) and 3.9(b). In the missing label case, the system observes the true traffic label with probability 0.8. In the delayed label case, the true label of the traffic comes at most five prediction requests later. In both cases, the prediction accuracy is lower than that without missing or delayed labels. However, the proposed algorithm is still able to achieve very high accuracy which exceeds 90%. 3.5 Summary In this chapter, we proposed a framework for online traffic prediction, which discovers online the contextual specialization of predictors to create a strong hybrid predictor from several weak predictors. The proposed framework matches the real-time traffic situation to the most effective predictor constructed using historical data, thereby self-adapting to 40 the dynamically changing traffic situations. We systematically proved both short-term and long-term performance guarantees for our algorithm, which provide not only the assurance that our algorithm will converge over time to the optimal hybrid predictor for each possible traffic situation but also provide a bound for the speed of convergence to the optimal predictor. Our experiments on real-world dataset verified the efficacy of the proposed scheme and showed that it significantly outperforms existing online learning approaches for traffic prediction. 41 Chapter 4 Latent Space Model for Road Network to Predict Time-Varying Traffic In this chapter, we propose Latent Space Modeling for Road Networks (LSM-RN), which enables more accurate and scalable traffic prediction by utilizing both topology similarity and temporal correlations. Specifically, with LSM-RN, vertices of dynamic road network are embedded into a latent space, where two vertices that are similar in terms of both time-series traffic behavior and the road network topology are close to each other in the latent space. Recently, Latent Space Modeling has been success- fully applied to several real-world problems such as community detection [82, 69], link prediction [50, 89] and sentiment analysis [88]. Among them, the work on social net- works [69, 50, 89] (hereafter called LSM-SN) is most related to ours because in both scenarios data are represented as graphs and each vertex of these graphs has different attributes. However, none of the approaches to LSM-SN are suitable for both identify- ing the edge and/or sensor latent attributes in road networks and exploiting them for real-time traffic prediction due to the following reasons. First, road networks show significant topological (e.g., travel-speeds between two sen- sors on the same road segment are similar), and temporal (e.g., travel-speeds measured every 1 minute on a particular sensor are similar) correlations. These correlations can be exploited to alleviate the missing data problem, which is unique to road networks, due to 42 the fact that some road segments may contain no sensors and any sensor may occasion- ally fail to report data. Second, unlike social networks, LSM-RN is fast evolving due to the time-varying traffic conditions. Onthe contrary, social networks evolve smoothly and frequent changes are very unlikely (e.g., one user changes its political preferences twice a day). Instead, in road networks, traffic conditions on a particular road segment can change rapidly in a short time (i.e., time-dependent) because of rush/non-rush hours and trafficincidents. Third, LSM-RNishighlydynamicwherefreshdatacomeinastreaming fashion, whereas the connections (weights) between nodes in social networks are mostly static. The dynamic nature requires frequent model updates (e.g., per minute), which necessitates partial updates of the model as opposed to the time-consuming full updates in LSM-SN. Finally, with LSM-RN, the ground truth can be observed shortly after mak- ing the prediction (by measuring the actual speed later in future), which also provides an opportunity to improve/adjust the model incrementally (i.e., online learning). With our proposed LSM-RN, each dimension of the embedded latent space represents a latent attribute. Thus the attribute distribution of vertices and how the attributes interact with each other jointly determine the underlying traffic pattern. To enforce the topology of road network, LSM-RN adds a graph Laplacian constraint which not only enables global graph similarity, but also completes the missing data by a set of similar edges with non-zero readings. Subsequently, we incorporate the temporal properties into our LSM-RN model by considering time-dependent latent attributes and a global tran- sition process. With these time-dependent latent attributes and the transition matrix, we are able to better model how traffic patterns form and evolve. The remainder of this chapter is organized as follows: We define our problem in Section 5.1 and explain LSM-RN in Section 4.2. We present the global learning and increment learning algorithms, and discuss how to adapt our algorithms for real-time traffic forecasting in Section 4.3. In Section 5.3, we report the experiment results and conclude this section afterwards. 43 Notations Explanations N,n road network, number of vertices of the road network G the adjacency matrix of a graph U latent space matrix B attribute interaction matrix A the transition matrix k the number of dimensions of latent attributes T the number of snapshots span the gap between two continuous graph snapshots h the prediction horizon λ,γ regularization parameters for graph Laplacian and transition process Table 4.1: Notations and explanations 4.1 Problem Definition We denote a road network as a directed graphN = (V,E), where V is the set of vertices and E∈ V ×V is the set of edges, respectively. A vertex v i ∈ V models a road intersection or an end of road. An edge e(v i ,v j ), which connects two vertices, represents a directed network segment. Each edge e(v i ,v j ) is associated with a travel speedc(v i ,v j )(e.g.,40miles/hour). Inaddition,N hasacorrespondingadjacencymatrix representation, denoted as G, whose (i,j) th entry represents the edge weight between the i th and j th vertices. The road network snapshots are constructed from a large-scale, high resolution traffic sensor dataset (see detailed description of sensor data in Section 5.3). Specifically, a sensor s (i.e., a loop detector) is located at one segment of road networkN, which provides a reading (e.g., 40 miles/hour) per sampling rate (e.g., 1 min). We divide one day into different intervals, where span is the length of each time interval. For example, when span = 5 minutes, we have 288 time intervals per day. For each time interval t, we aggregate (i.e., average) the readings of one sensor. Subsequently, for each edge segment of networkN, we average all sensor readings located at that edge as its weight. Therefore, at each timestampt, we have a road network snapshotG t from traffic sensors. Example. Figure 4.1 (a) shows a simple road network with 7 vertices and 10 edges at one timestamp. Three sensors (i.e., s 1 ,s 2 ,s 3 ) are located in edges (v 1 ,v 2 ), (v 3 ,v 4 ) and (v 7 ,v 6 ) respectively, and each sensor provides an aggregated reading during the time 44 interval. Figure 4.1(b) shows the corresponding adjacent matrix after mapping the sensor readings to the road segments. Note that the sensor dataset is incomplete with both missing values (i.e., sensor fails to report data) and missing sensors (i.e., edges without any sensors). Here sensor s 3 fails to provide reading, thus the edge weight of c(v 3 ,v 4 ) is ? due to missing value. In addition, the edge weight of c(v 3 ,v 2 ) is marked as× because of missing sensors. v v v v v v v s :40 s :? s :28.6 v v v v v v v v v v v v v v 0 28.6 0 0 0 0 0 0 0 0 0 0 0 0 0 x 0 ? 0 0 0 0 x 0 0 x 0 0 0 x 0 0 0 0 0 0 x 0 0 0 0 x 0 0 0 0 0 40 0 (a) An abstract road networkN (b) Adjacency matrix representation G Figure 4.1: An example of road network Given a small number of road network snapshots, or a dynamic road network, our objective is to predict the future traffic conditions. Specifically, a dynamic road network, is a sequence of snapshots (G 1 ,G 2 ,··· ,G T ) with edge weights denoting time-dependent travel speed. With a dynamic road network, we formally define the problem of edge traffic predic- tion with missing data as follows: Problem 1. Given a dynamic road network (G 1 ,G 2 ,··· ,G T ) with missing data at each timestamp, we aim to achieve the following two goals: • complete the missing data (i.e., both missing value and sensor) of G i , where 1≤ i≤T; • predict the future readings ofG T +h , whereh is the prediction horizon. For example, when h = 1, we predict the traffic condition of G T +1 at the next timestamp. 45 For ease of presentation, Table 5.1 lists the notations we use throughout this chapter. Note that since each dimension of a latent space represents a latent attribute, we thus use latent attributes and latent positions interchangeably. 4.2 Latent Space Model for Road Networks (LSM-RN) In this section, we describe our LSM-RN model in the context of traffic prediction. We first introduce the basic latent space model (Section 4.2.1) by considering the graph topology, and then incorporate both temporal and transition patterns (Section 4.2.2). Finally, we describe the complete LSM-RN model to solve the traffic prediction problem with missing data (Section 4.2.3). 4.2.1 Topology in LSM-RN Our traffic model is built upon the latent space model of the observed road net- work. Basically, each vertex of road network have different attributes and each vertex has an overlapping representation of attributes. The attributes of vertices and how each attribute interacts with others jointly determine the underlying traffic patterns. Intu- itively, if two highway vertices are connected, their corresponding interaction generates a higher travel speed than that of two vertices located at arterial streets. In particular, given a snapshot of road network G, we aim to learn two matrices U and B, where matrixU∈R n×k + denotes the latent attributes of vertices, and matrixB∈R k×k + denotes the attribute interaction patterns. The product of UBU T represents the traffic speed between any two vertices, where we use to approximateG. Note thatB is an asymmetric matrix since the road network G is directed. Therefore, the basic traffic model which considers the graph topology can be determined by solving the following optimization problem: arg min U≥0,B≥0 J =||G−UBU T || 2 F (4.1) 46 U B U G ≈ × n × n × n × k k × k k × n 28.6 c(v , v ) U(v ) U (v ) B = × × 0.6 0.1 0.4 50 20 15 30 0.5 highway business (a) Basic model (b) Travel time of c(v 1 ,v 2 ) Figure 4.2: An example of our traffic model, where G represents a road network, U denotes the attributes of vertices in the road network, n is number of nodes, and k is number of attributes, and B denotes how one type of attributes interacts with others. Similar Non-negative Tri-factorization frameworks have been utilized in cluster- ing[24], communitydetection[82]andsentimentalanalysis[88]. Figure4.2(a)illustrates the intuition of our static traffic model. As shown in Figure 4.2 (b), suppose we know that each vertex is associated with two attributes (e.g., highway and business area), and the interaction pattern between two attributes is encoded in matrixB, we can accurately estimate the travel speed between vertexv 1 andv 2 , using their latent attributes and the matrix B. Overcome the sparsity of Road Network. In our road network, G is very sparse (i.e., zero entries dominate the items in G) for the following reasons: (1) the average degree of a road network is small [75], and thus the edges of road network is far from fully connected, (2) the distribution of sensors is non-uniform, and only a small number of edges are equipped with sensors; and (3) there exists missing values (for those edges equipped with sensors) due to the failure and/or maintenance of sensors. Therefore, we define our loss function only on edges with observed readings, that is, the set of edges with travel cost c(v i ,v j ) > 0. In addition, we also propose an in- filling method to reduce the gap between the input road network and the estimated road network. We consider graph Laplacian dynamics, which is an effective smoothing approach for finding global structure similarity [43]. Specifically, we construct a graph Laplacian matrix L, defined as L =D−W, where W is a graph proximity matrix that is constructed from the network topology, and D is a diagonal matrix D ii = P j (W ij ). 47 With these new constraints, our traffic model for one snapshot of road network G is expressed as follows: arg min U,B J =||Y (G−UBU T )|| 2 F +λTr(U T LU), (4.2) whereY is an indication matrix for all the non-zero entries in G, i.e,Y ij = 1 if and only if G(i,j)> 0; is the Hadamard product operator, i.e., (XZ) ij =X ij ×Z ij ; and λ is the Laplacian regularization parameter. 4.2.2 Time in LSM-RN Next, we will incorporate the temporal information, including time-dependent mod- eling of latent attributes and the temporal transition. With this model, each vertex is represented in a unified latent space, where each dimension either represents a spatial or temporal attribute. Temporal effect of latent attributes The behavior of the vertices of road networks may evolve quickly. For instance, the behavior of a vertex that is similar to that of a highway vertex during normal traffic condition, may become similar to that of an arterial street node during congestion hours. Because the behavior of each vertex can change over time, we must employ a time- dependent modeling for attributes of vertices for real-time traffic prediction. Therefore, weaddthetime-dependenteffectofattributesintoourtrafficmodel. Specifically,foreach t≤ T, we aim to learn a corresponding time-dependent latent attribute representation U t . Although the latent attribute matrix U t is time-dependent, we assume that the attribute interaction matrix B is an inherent property, and thus we opt to fix B for all timestamps. By incorporating this temporal effect, we obtain our model based on the following optimization problem: 48 arg min Ut,B J = T X t=1 ||Y t (G t −U t BU T t )|| 2 F + T X t=1 λTr(U T t LU t ) (4.3) Transition matrix Due to the dynamics of traffic condition, we aim to learn not only the time-dependent latent attributes, but also a transition model to capture the evolving behavior from one snapshottothenext. Thetransitionshouldcapturebothperiodicevolvingpatterns(e.g., morning/afternoon rush hours) and non-recurring patterns caused by traffic incidents (e.g., accidents, road construction, or work zone closures). For example, during the interval of an accident, a vertex transition from the normal state to the congested at the beginning, then become normal again after the accident is cleared. We thus assume a global process to capture the state transitions. Specifically, we use a matrix A that approximates the changes of U between time t− 1 to time t, i.e., U t =U t−1 A, whereU∈R n×k + ,A∈R k×k + . The transition matrix A represents how likely a vertex is to transit from attribute i to attribute j from timestamp 1 to timestamp T. 4.2.3 LSM-RN Model Considering all the above discussions, the final objective function for our LSM-RN model is defined as follows: arg min Ut,B,A J = T X t=1 ||Y t (G t −U t BU T t )|| 2 F + T X t=1 λTr(U T t LU t )+ T X t=2 γ||U t −U t−1 A|| 2 F (4.4) where λ and γ are the regularization parameters. By solving Eq. 4.4, we obtain the learned matrices of U t ,B and A from our LSM- RN model. Consequently, the task of both missing value and sensor completion can be accomplished by the following: 49 G t =U t BU T t , when 1≤t≤T. (4.5) Subsequently, the edge traffic for snapshot G T +h (where h is the number of future time spans) can be predicted as follows: G T +h =(U T A h )B(U T A h ) T (4.6) 4.3 Learning & Prediction by LSM-RN In this section, we first present a typical global multiplicative algorithm to infer the LSM-RN model, and then discuss a fast incremental algorithm that scales to large road networks. 4.3.1 Global Learning algorithm We develop an iterative update algorithm to solve Equation 4.4, which belongs to the category of traditional multiplicative update algorithm [44]. By adopting the methods from [44], we can derive the update rule of U t , B and A. Update rule of U t We first consider updating variable U t while fixing all the other variables. The part of objective function in Equation 4.4 that is related to U t can be rewritten as follows: J = T X t=1 Tr Y t (G t −U t BU T t ) Y t (G t −U t BU T t ) T + T X t=1 λTr(U T t (D−W )U t ) + T X t=2 Tr γ(U t −U t−1 A)(U t −U t−1 A) T Becausewehavethenon-negativeconstraintofU t , followingthestandardconstrained optimization theory, we introduce the Lagrangian multiplier (ψ t )∈R n×k and minimize the Lagrangian function L: 50 L =J + T X t=1 Tr(ψ t U T t ) (4.7) Take the derivation of L with respect to U t , we have the following expression. (The detail is described in Appendix B.1) ∂L ∂U t =−2(Y t G t )U t B T − 2(Y T t G T t )U t B + 2(Y t U t BU T t )(U t B T +U t B) + 2λDU t −λ(W +W T )U t + 2γ(U t −U t−1 A) + 2γ(U t AA T −U t+1 A T ) +ψ t (4.8) By setting ∂L ∂Ut = 0, and using the KKT conditions (ψ t ) ij (U t ) ij = 0, we obtain the following equations for (U t ) ij : − (Y t G t )U t B T − (Y T t G T t )U t B + (Y t U t BU T t )(U t B T +U t B) +λDU t − 1 2 λ(W +W T )U t +γ(U t −U t−1 A) +γ(U t AA T −U t+1 A T ) ij (U t ) ij = 0 (4.9) Following the updating rules proposed and proved in [44], we derive the following update rule of U t : (U t )← (U t ) (Y t G t )U t B T + (Y T t G T t )U t B + 1 2 λ(W +W T )U t +γ(U t−1 A +U t+1 A T ) (Y t U t BU T t )(U t B T +U t B) +λDU t +γ(U t +U t AA T ) 1 4 (4.10) Note that for the update parts regardingγ parameter in both numerator and denom- inator: if t = 1, the update term becomes U t+1 A T UtAA T because no transition occurs in the previous step, similarly when t =T, the update term becomes U t−1 A Ut . Update rule of B and A The updating rules for A and B could be derived as follows in a similar way (see Appendix B.2 for detailed calculation): 51 Algorithm 2 Global-learning(G 1 ,G 2 ,··· ,G T ) Input: graph matrix G 1 ,G 2 ,··· ,G T . Output: U t (1≤t≤T), A and B. 1: Initialize U t , B and A 2: while Not Convergent do 3: for t = 1 to T do 4: update U t according to Eq. 4.10 5: end for 6: update B according to Eq. 4.11 7: update A according to Eq. 4.12 8: end while B←B P T t=1 U T t (Y t G t )U t P T t=1 U T t (Y t (U t BU T t ))U t (4.11) A←A P T t=1 U T t−1 U t P T t=1 U T t−1 U t−1 A (4.12) Algorithm 2 outlines the process of updating each matrix using aforementioned mul- tiplicative rules to optimize Eq. 4.4. The general idea is to jointly infer and cyclically update all the latent attribute matricesU t ,B andA. In particular, we first jointly learn the latent attributes for each time t from all the graph snapshots (Lines 2–4). Based on the sequence of time-dependent latent attributes <U 1 ,U 2 ,··· ,U T >, we then learn the global attribute interaction pattern B and the transition matrix A (Lines 6–7). From Algorithm 2, we now explain how our LSM-RN model jointly learns the spatial and temporal properties. Specifically, when we update the latent attribute of one vertex U t (i), the spatial property is preserved by (1) considering the latent positions of its adjacent vertices (Y t G t ), and (2) incorporating the local graph Laplacian constraint (i.e., matrixW andD). Moreover, the temporal property of one vertex is then captured by leveraging its latent attribute in the previous and next timestamps (i.e., U t−1 (i) and U t+1 (i)), as well as the transition matrix. In the following, we briefly discuss the time complexity and convergence for our global multiplicative algorithm. In each iteration, the computation is dominated by 52 matrix multiplication operations: the matrix multiplication between a n×n matrix and an×k matrix, and another matrix multiplicative operator between a n×k matrix and ak×k matrix. Therefore, the worst case time complexity per iteration is dominated by O(T (nk 2 +n 2 k)). However, since all the matrices are sparse, the complexity of matrix multiplication with twon×k sparse matrix, is much smaller thanO(n 2 k). Followed the proof shown in previous works [16] [44] [88], we could prove that Algorithm 2 converges into a local minimal and the objective value is non-increasing in each iteration. 4.3.2 Incremental Learning Algorithm The intuition behind our incremental algorithm is based on the observation that each time when we make a prediction for the next five minutes, the ground truth reading will be available immediately after five minutes. This motivates us to adjust the latent position of each vertex so that the prediction is closer to the ground truth. On the other hand, it is not necessary to perform the latent position adjustment for each vertex. This is because during a short time interval, the overall traffic condition of the whole network tends to stay steady, and the travel cost of most edges changes at a slow pace, although certain vertices can go through obvious variations. Therefore, instead of recomputing the latent positions of all the vertices from scratch at every time stamp, we perform a “lazy" update. In particular, to learn the latent space U t , the incremental algorithm utilizes the latent space we have already learned in the previous snapshot (i.e., U t−1 ), makes predictions for the next snapshot (i.e., G t ), and then conditionally adjusts latent attributes of a subset of vertices based on the changes of traffic condition. Framework of Incremental Algorithm Algorithm 3 presents the pseudo-code of incremental learning algorithm. Initially, we learn the latent space of U 1 from our global multiplicative algorithm (Line 1). With the learned latent matrix U t−1 , at each time stamp t between 2 and T, our incremental update consists of the following two components: 1) identify candidate vertices based on 53 feedbacks (Lines 3-8); 2) update their latent attributes and propagate the adjustment from one vertex to its neighbors (Line 9). As outlined in Algorithm 3, givenU t−1 andG t , we first make an estimation of c G t based onU t−1 (Line 3). Subsequently, we useG t as the feedback information, select the set of vertices where we make inaccurate predictions, and insert them into a candidate setcand (Lines 4-8). Consequently, we updateU t based on the learned latent matrix U t−1 , the ground truth observation G t and candidate set cand (Line 9). After that, we learn the global transition matrix A (Line 10). Algorithm 3 Incremental-Learning(G 1 ,G 2 ,··· ,G T ) Input: graph matrix G 1 ,G 2 ,··· ,G T . Output: U t (1≤t≤T), A and B. 1: (U 1 ,B)←Global-learning(G 1 ) 2: for t = 2 to T do 3: c G t ←U t−1 BU T t−1 (prediction) 4: cand←∅ (a subset of vertices to be updated) 5: for each i∈G do 6: for each j∈out(i) do 7: if|G t (i,j)− \ G t (i,j)|≥δ then 8: cand←cand∪{i,j} 9: end if 10: end for 11: end for 12: U t ← Incremental-Update(U t−1 ,G t ,cand) (See Section 4.3.2) 13: end for 14: Iteratively learn transition matrix A using Eq. 4.12 until A converges Topology-Aware Incremental Update Given U t−1 and G t , we now explain how to calculate U t incrementally from U t−1 with the candidate set cand, with which we can accurately approximate G t . The main idea is similar to an online learning process. At each round, the algorithm predicts an outcome for the required task (i.e., predict the speed of edges). Once the algorithm makes a prediction, it receives feedback indicating the correct outcome. Then, the online algorithm can modify its prediction mechanism for better predictions on subsequent timestamps. In our scenario, we first use the latent attribute matrix U t−1 to predict G t 54 Algorithm 4 Incremental-Update(U t−1 ,G t ,cand) Input: the latent matrixU t−1 , observed graph readingG t , candidate setcand, hyper-parameters δ and τ Output: Updated latent space U t . 1: U t ←U t−1 2: while Not Convergent AND cand / ∈∅ do 3: order cand from the reverse topological order 4: for i∈cand do 5: oldu←U t (i) 6: for each j∈out(i) do 7: adjust U t (i) with Eq. 4.14 8: end for 9: if||U t (i)−oldu|| 2 F ≤τ then 10: cand←cand\{i} 11: end if 12: for each j∈out(i) do 13: p←U t (i)BU t (j) 14: if|p−G t (i,j)|≥δ then 15: cand←cand∪{j} 16: end if 17: end for 18: end for 19: end while as if we do not know the observation, subsequently we adjust the model of U t according to the true observation of G t we already have in hand. However, in our problem, we are making predictions for the entire road network, not for a single edge. When we predict for one edge, we only need to adjust the latent attributes of two vertices, whereas in our scenario we need to update the latent attributes for many correlated vertices. Therefore, the effect of adjusting the latent attribute of one vertex can potentially affect its neighboring vertices, and influence the convergence speed of incremental learning. Hence, the adjustment order of vertices is very important. Algorithm 4 presents the details of updating U t incrementally from U t−1 . For each vertex i of cand, we adjust its latent position so that we could make more accurate predictions(Line7)andthenexaminehowthisadjustmentwouldinfluencethecandidate task set from the following two aspects: (1) if the latent attribute of i does not change much, we remove it from the set ofcand (Lines 8-9); (2) if the adjustment ofi also affects its neighbor j, we add vertex j to cand (Lines 10-13). 55 U t-1 (v 1 ) U t (v 1 ) highway business 1 2 v 1 28.6 35 v 2 v 1 v 2 v 3 v 4 v 5 v 6 v 7 v 3 v 4 v 6 v 7 v 1 v 2 v 5 (a) Adjustment method (b) Adjustment order Figure 4.3: Challenges of adjusting the latent attribute with feedbacks. The remaining questions in our Incremental-Update algorithm are how to adjust the latent position of one vertex according to feedbacks, and how to decide the order of update. In the following, we address each of them. Adjusting latent attribute of one vertex. To achieve high efficiency of adjusting the latent attribute, we propose to make the smallest changes of the latent space (as fast as possible) to predict the correct value. For example, as shown in Figure 4.3 (a), suppose we already know the new latent position of v 1 , then fewer step movement (Option 1) is preferable than gradual adjustment (Option 2). Note that in our problem, when we move the latent position of a vertex to a new position, the objective of this movement is to produce a correct prediction for each of its outgoing edges. Specifically, givenU t−1 (i), we want to find U t (i) which could accurately predict the weight of each edge e(v i ,v j ) that is adjacent to vertex v i . We thus formulate our problem as follows: U t (i),ξ ∗ = arg min U(i)∈R k + 1 2 ||U(i)−U t−1 (i)|| 2 F +Cξ s.t. |U(i)BU T (j)−G t (i,j)|≤δ +ξ, (4.13) whereξ isanon-negativeslackvariable,C > 0isaparameterwhichcontrolsthetrade-off between being conservative (do not change the model too much) and corrective (satisfy the constraint), and δ is a precision parameter. Note that we have non-negativity constraint over the latent space of U t (i). We thus adopt the approaches from [13]: When the predicted value b y t (i.e., U t (i)BU T t (j)) is less 56 than the correct value y t (i.e., G t (i,j)), we use the traditional online passive-aggressive algorithm [20] because it guarantees the non-negativity of U(i); Otherwise, we update U(i) by solving a quadratic optimization problem. The detailed solution is as follows: U t (i) = max(U t−1 (i) + (k ∗ −θ ∗ )·BU t−1 (j) T , 0) (4.14) k ∗ and θ ∗ are computed as follows: k ∗ =α t ,θ ∗ = 0 if b y t <y t k ∗ = 0,θ ∗ =C if b y t >y t and f(C)≥ 0 k ∗ = 0,θ ∗ =f −1 (0) if b y t >y t and f(C)< 0 (4.15) where α t = min C, max(|b y t −y t |−δ, 0) ||BU t−1 (j) T || 2 f t (θ) = max U t (i)−θBU t (j) T , 0 ·BU t (j) T −G t (i,j)−δ Updating order of cand. As we already discussed, the update order is important because it influences the convergence speed of our incremental algorithm. Take the example of the road network shown in Figure 4.1, suppose our initial cand contains three vertices v 7 ,v 6 and v 2 , where we have two edges e(v 7 ,v 6 ) and e(v 6 ,v 2 ). If we randomly choose the update sequence as<v 7 ,v 6 ,v 2 >, that is, we first adjust the latent attribute of v 7 so that c(v 7 ,v 6 ) has a correct reading; subsequently we adjust the latent attribute ofv 6 to correct our estimation ofc(v 6 ,v 2 ). Unfortunately,the adjustment ofv 6 could influence the correction we have already made tov 7 , thus leading to an inaccurate estimation of c(v 7 ,v 6 ) again. A desirable order is to update vertex v 6 before updating v 7 . Therefore, we propose to consider the reverse topology of road network when we update the latent position of each candidate vertex v∈ cand. The general principle is that: given edge e(v i ,v j ), the update of vertex v i should be proceeded after the update ofv j , because the position ofv i is dependent onv j . This motivates us to derive a reverse 57 topological order in the graph ofG. Unfortunately, the road networkG is not a Directed Acyclic Graph (DAG), and contains cycles. To address this issue, we first generate a condensed super graph where we contract each Strongly Connected Component (SCC) of the graph G as a super node. We then derive a reverse topological order based on this condensed graph. For the vertex order in each SCC, we generate an ordering of vertices inside each SCC by random algorithms or some heuristics. Figure 4.3(b) shows an example of ordering for the road network of Figure 4.1, where each rectangle represents a SCC. After generating a reverse topological order based on the contracted graph and randomly ordering the vertices within each SCC, we obtain one final ordering <v 2 ,v 6 ,v 7 ,v 1 ,v 5 ,v 4 ,v 3 >. Each time when we update the latent attributes of cand, we follow this ordering of vertices. Time complexity. For each vertex i, the computational complexity of adjusting its latent attributes using Eq. 4.14 is O(k), where k is number of attributes. Therefore, to computelatentattributesu, thetimecomplexityperiterationisO(kT (Δn+Δm)), where Δn is number of candidate vertex in cand, and Δm is total number of edges incident to vertices in cand. In practice, Δn n and Δm m n 2 . In addition, the SCC can be generated in linear time O(m+n) via Tarjan’s algorithm [63]. Therefore, we conclude that the computational cost per iteration is significantly reduced using Algorithm 3 as compared to using the global learning approach. 4.3.3 Real-time Forecasting In this section, we discuss how to apply our learning algorithms to real-time traffic prediction, where the sensor reading is received in a streaming fashion. In practice, if we want to make a prediction for the current traffic, we cannot afford to apply our global learning algorithm to all the previous snapshots because it is computationally expensive. Moreover, it is not always true that more snapshots would yield a better prediction performance. The alternative method is to treat each snapshot independently: i.e., each time we only apply our incremental learning algorithm for the most recent snapshot, 58 Global learning ... ... G T ... ... 2T G G G U U U U U U U U G G G G Incremental learning time ... Global learning Incremental learning Figure 4.4: A batch window framework for real-time forecasting. and then use the learned latent attribute to predict the traffic condition. Obviously, this might yield poor prediction quality as it totally ignores the temporal transitions. To achieve a good trade-off between the above two methods, we propose to adapt a sliding window setting for the learning of our LSM-RN model, where we apply incre- mental algorithm at each timestamp during one time window, and only run our global learning algorithm at the end of one time window. As shown in Figure 4.4, we apply our global learning at timestamps T (i.e., the end of one time window), which learns the time-dependent latent attributes for the previous T timestamps. Subsequently, for each timestamp T +i between [T, 2T], we apply our incremental algorithm to adjust the latent attribute and make further predictions: i.e., we use U T +i to predict the traf- fic of G T +(i+1) . Each time we receive the true observation of G T +(i+1) , we calculate U T +(i+1) via the incremental update from Algorithm 4. The latent attributes U 2T will be re-computed at timestamp 2T (the end of one time window), and the U 2T would be used for the next time window [2T, 3T ]. 4.4 Experiment 4.4.1 Dataset We used a large-scale high resolution (both spatial and temporal) traffic sensor (loop detector) dataset collected from Los Angeles county highways and arterial streets. This dataset includes both inventory and real-time data for 15000 traffic sensors covering approximately 3420 miles. The sampling rate of the data, which provides speed, volume 59 (number of cars passing from sensor locations) and occupancy, is 1 reading/sensor/min. We have been collecting and archiving this sensor dataset continuously since 2010. We chose sensor data between March and April in 2014 for our experiments, which include more than 60 million records of readings. As for the road network, we used Los Angeles road network which was obtained from HERE Map dataset [30]. We con- structed two subgraphs of Los Angeles road network, termed as SMALL and LARGE. The SMALL (resp. LARGE) network contains 5984 (resp. 8242) vertices and 12538 (resp. 19986) edges. As described in Section 5.1, the sensor data are mapped to the road network, where 1642 (resp. 4048) sensors are mapped to SMALL (resp. LARGE). After mapping the sensor data, we have two months of network snapshots for both SMALL and LARGE. 4.4.2 Experimental Setting Algorithms Our methods are termed as LSM-RN-All (i.e., global learning algorithm) and LSM-RN-Inc (i.e., incremental learning algorithm). For edge traffic prediction, we compare with LSM-RN-Naive, where we adapted the formulations from LSM-SN ([82] and [57]) by simply combining the topology and tempo- ral correlations. In addition, LSM-RN-Naive uses a Naive incremental learning strategy in [57], which independently learns the latent attributes of each timestamp first, then the transition matrix. We also compare our algorithms with two representative time series prediction methods: a linear model (i.e., ARIMA [55]) and a non-linear model (i.e., SVR [56]). We train each model independently for each time series with historical data. In addition, because these methods will be affected negatively due to the missing values during the prediction stages (i.e, some of the input readings for ARIMA and SVR could be zero), for fair comparison we consider ARIMA-Sp and SVR-Sp, which use the completed readings from our global learning algorithm. We also implemented the Tensor 60 method [10, 4], however, it cannot address the sparsity problem of our dataset and thus produce meaningless results (most of the prediction values are close to 0). For missing-value completion, we compare our algorithms with two methods: (1) KNN [29], which uses the average values of the nearby edges in Euclidean distance as the imputed value, (2) LSM-RN-Naive, which independently learns the latent attributes of each snapshot, then uses them to approximate the edge readings. To evaluate the performance of online prediction, we consider the scenario of a batch- window setting described in Section 4.3.3. Considering a time window [0, 2T ], we first batch learn the latent attributes of U T and transition matrix A from [0,T ], we then sequentially predict the traffic condition for the timestamps during [T + 1, 2T ]. Each time when we make a prediction, we receive the true observations as the feedback. We compare our Incremental algorithm (Inc), with three baseline algorithms: Old, LSM-RN- Naive and LSM-RN-All. Specifically, to predictG T +i , LSM-RN-Inc utilizes the feedback of G T +(i−1) to adjust the time-dependent latent attributes of U T +(i−1) , whereas Old does not consider the feedback, and always uses latent attributes U T and transition matrix A from the previous time window. On the other hand, LSM-RN-Naive ignores the previous snapshots, and only applies the inference algorithm to the most recent snapshot G T +(i−1) (aka Mini-batch). Finally, LSM-RN-All applies the global learning algorithm consistently to all historical snapshots (i.e., G 1 to G T +(i−1) ) and then makes a prediction (aka Full-batch). Configurations and Measures. We selected two different time ranges that represent rush hour (i.e., 7am-8am) and non-rush hour (i.e., 2pm-3pm), respectively. For the task of missing value completion, during each timestamps of one time range (e.g., rush hour), we randomly selected 20% of valuesasunobservedandmanipulatedthemasmissing 1 ,withtheobjectiveofcompleting 1 Note that missing values are plenty in our dataset, especially for arterials. However, we needed ground-truth for evaluation purposes and that is why we generated missing values artificially. 61 Table 4.2: Experiment parameters Parameters Value range T 2, 4, 6, 8,10, 12 span 5, 10, 15, 20, 25, 30 k 5, 10, 15,20, 25, 30 λ 2 −7 , 2 −5 , 2 −3 , 2 −1 , 2 1 ,2 3 , 2 5 γ 2 −7 ,2 −5 , 2 −3 , 2 −1 , 2 1 , 2 3 , 2 5 those missing values. For each traffic prediction task at one particular timestamp (e.g., 7:30 am), we randomly selected 20% of the values as unknown and use them as ground- truth values. We varied the parametersT andspan: whereT is the number of snapshots, andspan is time gap between two continuous snapshots. We also varied k, λ, and γ, which are parameters of our model. The default settings (shown with bold font) of the experiment parameterarelistedinTable4.2. Becauseofspacelimitations,theresultsofvaryingγ are not reported, which are similar to result of varyingλ. We use Mean Absolute Percentage Error (MAPE) and Root Mean Square Error (RMSE) to measure the accuracy. In the following we only report the experiment results based on MAPE, the experiment results based on RMSE are reported in the technical report [23]. Specifically, MAPE is defined as follows: MAPE = ( 1 N N X i=1 |y i − ˆ y i | y i ) With ARIMA and SVR, we use the dataset of March to train a model for each edge, and use 5-fold cross-validation to choose the best parameters. All the tasks of missing value completion and edge traffic prediction tasks are conducted on April data. We conducted our experiments with C++ on a Linux PC with i5-2400 CPU @ 3.10G HZ and 24GB memory. 62 LSM-RN-Inc SVR ARIMA ARIMA-SP LSM-RN-Naive LSM-RN-All SVR-SP 5 10 15 20 25 30 35 7:00 7:15 7:30 7:45 8:00 8:15 8:30 8:45 9:00 9:15 MAPE (%) Time (am) 5 10 15 20 25 30 35 2:00 2:15 2:30 2:45 3:00 3:15 3:30 3:45 4:00 4:15 MAPE (%) Time (pm) (a) Rush hour on SMALL (b) Non-Rush hour on SMALL 5 10 15 20 25 30 35 7:00 7:15 7:30 7:45 8:00 8:15 8:30 8:45 9:00 9:15 MAPE (%) Time (am) 5 10 15 20 25 30 35 2:00 2:15 2:30 2:45 3:00 3:15 3:30 3:45 4:00 4:15 MAPE (%) Time (pm) (c) Rush hour on LARGE (d) Non-Rush hour on LARGE Figure 4.5: One-step ahead prediction MAPE 4.4.3 Comparison with Edge Traffic Prediction One-step Ahead Prediction The experimental results of SMALL are shown in Figures 4.5 (a) and (b). Among all the methods, LSM-RN-All and LSM-RN-Inc achieve the best results, and LSM-RN- All performs slightly better than LSM-RN-Inc. This demonstrates the effectiveness of time-dependent latent attributes and the transition matrix. We observe that without imputing of missing values, time series prediction techniques (i.e., ARIMA and SVR) perform much worse than LSM-RN-All and LSM-RN-Inc. Meanwhile, LSM-RN-Naive, which separately learns the latent attributes of each snapshot, cannot achieve good prediction results as compared to LSM-RN-All and LSM-RN-Inc. This indicates that 63 simplycombiningtopologyandtimeisnotenoughforaccuratepredictions. Wenotethat even with completed readings, the accuracy of SVR-Sp and ARIMA-Sp is worse than that of LSM-RN-All and LSM-RN-Inc. One reason is that simply combining the spatial and temporal properties does not necessarily yield a better performance. Another reason is that both SVR-Sp and ARIMA-Sp also suffer from missing data during the training stage, which results in less accurate predictions. In the technical report [23], we show how the ratio of missing data would influence the prediction performance. Finally, we observe that SVR is more robust than ARIMA when encountering missing values: i.e., ARIMA-Sp performs significantly better than ARIMA, while the improvement of SVR- Sp over SVR is marginal. This is because ARIMA is a linear model which mainly uses the weighted average of the previous readings for prediction, while SVR is a non-linear model that utilizes a kernel function. Figures 4.5 (c) and (d) show the experiment results on LARGE, the trend is similar to SMALL. Multi-steps Ahead Prediction We now present the experiment results on long-term predictions, with which we predict the traffic conditions for the next 30 minutes (i.e., h = 6). The prediction accuracy of different methods on SMALL are shown in Figures 4.6 (a) and (b). Although LSM-RN-All and LSM-RN-Inc still outperform other methods, the margin between our methods and the baselines is narrower. The reason is that: when we make long-term predictions, we use the predicted values from the past for future prediction. This leads to the problem of error accumulation, i.e., errors incurred in the past are propagated into future predictions. We observe the similar trends on LARGE, the results are reported in Figures 4.6 (c) and (d). 4.4.4 Effect of Missing Data In this set of experiment, we analyze the effect of missing data on the training dataset for the time series prediction techniques (i.e., ARIMA and SVR). As shown in Figure 4.7 64 LSM-RN-Inc SVR ARIMA ARIMA-SP LSM-RN-Naive LSM-RN-All SVR-SP 10 15 20 25 30 35 7:00 7:15 7:30 7:45 8:00 8:15 8:30 8:45 9:00 9:15 MAPE (%) Time (am) 10 15 20 25 30 35 2:00 2:15 2:30 2:45 3:00 3:15 3:30 3:45 4:00 4:15 MAPE (%) Time (pm) (a) Rush hour on SMALL (b) Non-Rush hour on SMALL 10 15 20 25 30 35 7:00 7:15 7:30 7:45 8:00 8:15 8:30 8:45 9:00 9:15 MAPE (%) Time (am) 10 15 20 25 30 35 2:00 2:15 2:30 2:45 3:00 3:15 3:30 MAPE (%) Time (pm) (c) Rush hour on LARGE (d) Non-Rush hour on LARGE Figure 4.6: Six-steps ahead prediction MAPE 0.1 0.2 0.3 0.4 0.5 0.6 0 0.1 0.2 0.3 0.5 MAPE (%) Missing rate ARIMA SVR 0.1 0.2 0.3 0.4 0.5 0.6 0 0.1 0.2 0.3 0.5 MAPE (%) Missing rate ARIMA SVR (a) SMALL (b) LARGE Figure 4.7: Missing rate during training stages for SVR and ARIMA 65 (a) and (b), the prediction error for both approaches increases with more number of noise. Similar to the effect of missing value on the prediction stages shown in Figure 4.5, ARIMA is less robust than SVR because of its linear model. One interesting observation is that ARIMA performs better than SVR if the missing ratio is less than 10%, this indicates ARIMA is a good candidate for accurate traffic condition under the presence of complete data, this also conforms with the experiment results on [55]. However, ARIMA is sensitive to the missing values during both training and prediction stages, which renders poor performance with incomplete dataset. LSM-RN-Inc LSM-RN-Naive LSM-RN-All KNN 5 10 15 20 25 30 7:00 7:05 7:10 7:15 7:20 7:25 7:30 7:45 8:00 8:05 MAPE (%) Time (am) 5 10 15 20 25 30 2:00 2:05 2:10 2:15 2:20 2:25 2:30 2:45 3:00 3:05 MAPE (%) Time (pm) (a) Rush hour on LARGE (b) Non-Rush hour on LARGE Figure 4.8: Missing value completion MAPE 4.4.5 Comparison for Missing Value Completion In this set of experiments, we evaluate the completion accuracy of different methods. Due to space limitation, we only report the experiment results on LARGE in Figures 4.8 (a) and (b), and the effects on SMALL are similar. We observe that both LSM-RN- All and LSM-RN-Inc achieve much lower errors than that of other methods. This is because LSM-RN-All and LSM-RN-Inc capture both spatial and temporal relationships, while LSM-RN-Naive and KNN only use spatial property. LSM-RN-All performs better than LSM-RN-Inc by jointly inferring all the latent attributes. On the other hand, we note that LSM-RN-Naive and KNN have similar performances, which is inferior to our methods. This also indicates that utilizing both spatial and temporal properties yields 66 a larger gain than only utilizing the spatial property. As shown in Figure 4.8(b), the completion performance during the non-rush hour is better as compared to the rush hour time. This is because during rush hour range, the traffic condition is more dynamic, and the underlying pattern and transition changes frequently. 4.4.6 Scalability data SMALL LARGE train (s) pred.(ms) train (s) pred. (ms) LSM-RN-Naive - 1353 - 29439 LSM-RN-All - 869 - 14247 LSM-RN-Inc - 407 - 4145 ARIMA 484 0.00015 987 0.00024 SVR 47420 0.00042 86093.99 0.00051 Table 4.3: Running time comparisons. For ARIMA and SVR, the training time cost is the total training time for all the edges for one-step ahead prediction, and the prediction time is the average prediction time per edge per query. Table 4.3 shows the running time of different methods. Although ARIMA and SVR are fast in each prediction, they require large volume of training data and have much higher training time, which can be a problem for real systems. On the contrary, our methods do not require extra training data, i.e., our methods efficiently train and predict at the same time. Among them, LSM-RN-Inc is the most efficient approach: it only takes less than 500 milliseconds to learn the time-dependent latent attributes and make predictionsforalltheedgesoftheroadnetwork. Thisisbecauseourincrementallearning algorithm conditionally adjusts the latent attributes of certain vertices, and utilizes the topological order that enables fast convergence. Even for the LARGE dataset, LSM-RN- Inc takes less than five seconds, which is acceptable considering that the span between two snapshots is at least five minutes in practice. This demonstrates that LSM-RN-Inc scales well to large road networks. Regarding LSM-RN-All and LSM-RN-Naive, they both require much longer running time than that of LSM-RN-Inc. In addition, LSM- RN-All is faster than LSM-RN-Naive. This is because LSM-RN-Naive independently 67 runs the global learning algorithm for each snapshot T times, while LSM-RN-All only applies global learning for all the snapshots once. 20 40 60 80 100 120 0 10 20 30 40 50 Objective values Number of iterations LSM-RN-All 20 40 60 80 100 120 0 10 20 30 40 50 Objective values Number of iterations LSM-RN-All (a) SMALL (b) LARGE Figure 4.9: Converge rate Convergence analysis. Figures 4.9 (a) and (b) report the convergence rate of iterative algorithm LSM-RN-All on both SMALL and LARGE. As shown in Figure 4.9, LSM- RN-All converges very fast: when the number of iterations is around 20, our algorithm tends to converge in terms of our objective value in Eq. 4.4. 4.4.7 Comparison for Real-time Forecasting In this set of experiments, we evaluate our online setting algorithms. Due to space limitation, we only report the experiment results on LARGE. As shown in Figures 4.10 (a) and (b), LSM-RN-Inc achieves comparable accuracy with LSM-RN-All (Full-batch). This is because LSM-RN-Inc effectively leverages the real-time feedback to adjust the latent attributes. We observe that LSM-RN-Inc performs much better than Old and LSM-RN-Naive (Mini-batch), which ignore either the feedback information (i.e., Old) or the previous snapshots (i.e., LSM-RN-Naive). One observation is that Old performs better than LSM-RN-Naive for the initial timestamps, whereas Old surpasses Mini-batch at the later timestamps. This indicates that the latent attributes learned in the previous time-window are more reliable for predicting the near-future traffic conditions, but may not be good for long-term predictions because of the error accumulation problem. 68 LSM-RN-Inc LSM-RN-Naive LSM-RN-All Old 10 15 20 25 30 7:00 7:05 7:10 7:15 7:20 7:25 MAPE (%) Time (am) 10 15 20 25 30 2:00 2:05 2:10 2:15 2:20 2:25 MAPE (%) Time (pm) (a) Rush hour on LARGE (b) Non-Rush hour on LARGE Figure 4.10: Online prediction MAPE LSM-RN-Inc LSM-RN-Naive LSM-RN-All 10 2 10 3 10 4 7:00 7:05 7:10 7:15 7:20 7:25 Running time (milliseconds) Time (am) 10 2 10 3 10 4 2:00 2:05 2:10 2:15 2:20 2:25 Runnint time (milliseconds) Time (pm) (a) Rush hour on LARGE (b) Non-Rush hour on LARGE Figure 4.11: Online Prediction time Figures 4.11 (a) and (b) show the running time comparisons of different methods. OneimportantobservationfromthisexperimentisthatLSM-RN-Incisthemostefficient approach, which is on average two times faster than LSM-RN-Naive and one order of magnitudefasterthanLSM-RN-All. ThisisbecauseLSM-RN-Incperformsaconditional latentattributeupdateforverticeswithinasmallportionofroadnetwork, whereasLSM- RN-Naive and LSM-RN-All both recompute the latent attributes from at least one entire road network snapshot. Since in the real-time setting, LSM-RN-All utilizes all the up- to-date snapshots and LSM-RN-Naive only considers the most recent single snapshot, LSM-RN-Naive is faster than LSM-RN-All. We observe that LSM-RN-Inc only takes less 69 than 1 second to incorporate the real-time feedback information, while LSM-RN-Naive and LSM-RN-All take much longer. Therefore, we conclude that LSM-RN-Inc achieves a good trade-off between predic- tion accuracy and efficiency, which is applicable for real-time traffic prediction applica- tions. 4.4.8 Varying Parameters of Our Methods Inthissection,weevaluatetheperformanceofourmethodsbyvaryingtheparameters of our model. Due to space limitation, we only show the experimental results on SMALL. 10 15 20 25 2 4 6 8 10 12 MAPE (%) T LSM-RN-All LSM-RN-Inc 100 400 700 1000 2 4 6 8 10 12 Running time (ms) T LSM-RN-All LSM-RN-Inc (a) Prediction error (b) Running time Figure 4.12: Effect of varying T Effect of Varying T Figure 4.12 (a) and Figure 4.12 (b) show the prediction performance and the running time of varying T, respectively. We observe that with more snapshots, the prediction error decreases. In particular, when we increase T from 2 to 6, the results improve significantly. However, the performance tends to stay stable at T ≥ 6. This indicates that fewer snapshots (i.e., two or less) are not enough to capture the traffic patterns and the evolving changes. On the other hand, more snapshots (i.e., more historical data) do not necessarily yield better gain, considering the running time increases when we have more snapshots. Therefore, to achieve a good trade-off between running time 70 and prediction accuracy, we suggest to use at least 6 snapshots, but no more than 12 snapshots. 10 15 20 25 5 10 15 20 25 30 MAPE (%) Span (minutes) LSM-RN-All LSM-RN-Inc 100 300 500 5 10 15 20 25 30 Running time (ms) Span (minutes) LSM-RN-All LSM-RN-Inc (a) Prediction error (b) Running time Figure 4.13: Effect of varying span Effect of Varying Span TheresultsofvaryingspanareshowninFigure4.13. Clearly,asthetimegapbetween two snapshots increases, the performance declines. This is because when span increases, the evolving process of underlying traffic may not evolve smoothly, the transition process learned in the previous snapshot is not applicable for the future. Fortunately our sensor dataset usually have high-resolution, so it is better to use smaller span to learn the latent attributes. In addition, span does not affect the running time of either algorithms. 10 15 20 25 30 35 40 5 10 15 20 25 30 MAPE (%) k LSM-RN-All LSM-RN-Inc 10 15 20 25 30 35 40 2 -5 2 -3 2 -1 2 2 3 2 5 2 7 MAPE (%) λ LSM-RN-All LSM-RN-Inc (a) Prediction MAPE with k (b) Prediction MAPE with λ Figure 4.14: Effect of varyingk andλ, wherek is number of latent attributes, andλ is the graph regularization parameter. 71 Effect of Varying k and λ Figure 4.14 (a) shows the effect of varyingk. We observe that: (1) we achieve better results with increasing number of latent attributes; (2) the performance is stable when k≥ 20. This indicates that a low-rank latent space representation can already capture the attributes of the traffic data. In addition, our results show that when the number of latent attributes is small (i.e., k≤ 30), the running time increases with k but does not change much when we vary k from 5 to 30. Therefore, setting k to 20 achieves a good balance between computational cost and accuracy. Figure 4.14 (b) depicts the effect of varying λ, which is the regularization parameter for our graph Laplacian dynamics. We observe that the graph Laplacian has a larger impact on LSM-RN-All algorithm than on LSM-RN-Inc. This is becauseλ controls how the global structure similarity contributes to latent attributes and LSM-RN-All jointly learns those time-dependent latent attribute, thus λ has larger effect on LSM-RN-All. In contrast, LSM-RN-Inc adaptively updates the latent positions of a small number of changedverticesinlimitedlocalizedview, andthusislesssensitivetotheglobalstructure similarity than LSM-RN-All. In terms of parameters choices, λ = 2 and λ = 8 yields best results for LSM-RN-All and LSM-RN-Inc, respectively. 4.5 Summary In this chapter, we proposed LSM-RN, where each vertex is associated with a set of latent attributes that captures both topological and temporal properties of road net- works. We showed that the latent space modeling of road networks with time-dependent weights accurately estimates the traffic patterns and their evolution over time. To effi- ciently infer these time-dependent latent attributes, we developed an incremental online learning algorithm which enables real-time traffic prediction for large road networks. With extensive experiments we verified the effectiveness, flexibility and scalability of our model in identifying traffic patterns and predicting future traffic conditions. 72 Chapter 5 Situation Aware Multi-Task Learning for Traffic Prediction In this chapter, we further explore the commonalities across sensors by considering the traffic situations. This is because the traffic readings within one sensor can be an amalgam of multiple traffic situations, and thus difficult to build one single model to capture all these traffic situations. On the other hand, our key observation from working with large-scale traffic sensor data is that there exists so much commonalities across sensors, especially since they exhibit similar patterns under the same traffic situation, for example during the rush hour or at a rainy day. Moreover, the number of traffic situations is limited, which lends itself to building one model per traffic situation rather than per sensor. To summarize, our hypothesis is that building models based on the shared traffic situations across sensors can help improve the prediction accuracy. To verify our hypothesis, we utilize the framework of Multi-Task Learning (MTL) because the objective of MTL is to explore the commonalities across different tasks and learn multiple related tasks by extracting and utilizing shared information. To examine whether traffic situations are necessary to build the prediction models, we first ignore the underlying traffic situation, and we naively apply the traditional MTL formulation (termed as Naive-MTL) to jointly learn the prediction model of all sensors, i.e., each “task" corresponds to a sensor (see Figure 5.1). Thus, the objective of Naive-MTL is then to build prediction models for all sensors via exploring their commonalities. Unfor- tunately, it is impossible to find the commonalities across sensors without considering 73 their traffic situations, and the performance of Naive-MTL becomes similar to that of training models per sensor independently. This is because even two nearby sensors under different traffic situations can behave very differently. For example, consider two nearby sensors on northbound I-110, where one sensor is more prone to traffic incidents than the other, although they behave almost the same during normal traffic conditions. On the other hand, the accident-prone sensor at northbound I-110 may behave similar to another sensor located at northbound I-710 (a freeway parallel to I-110) when incident happens, although they may not be correlated in other traffic situations. Therefore, weproposeaSituation-AwareMulti-TaskLearning(termedasSA-MTL) framework: where we first identify the traffic situations across all sensors, then apply the MTLframeworkforeachidentifiedtrafficsituation, i.e., each“task"correspondstoa trafficsituation (seeFigure5.3). Sincethenumberofdistincttrafficsituationsissmall, we can apply MTL for each individual traffic situation across different sensors to examine whether the traffic situations are shared among all the sensors. Specifically, to identify the traffic situation, we can augment each training sample of one sensor with additional contextual features including road type (e.g., highway or arterial), location, weather condition, area classification (e.g., business district, residential), accident information, etc. Subsequently, we combine the training samples across all sensors and cluster them into several partitions where each partition represents one typical traffic situation and consists of different number of training samples from all sensors. Consequently, for each specific traffic situation, we use MTL that simultaneously learns the prediction model of all sensors. In particular, we employ the group Lasso regularization based on the l 2,1 -norm, which ensures that for one traffic situation a small set of features are shared among all sensor prediction tasks. We utilize the FISTA [11] method to solve the proposed optimization problem with guaranteed convergence rate. We evaluated our proposed model with extensive experimental study on the large scale Los Angeles traffic sensor data. We show that by taking into consideration all of the traffic situations, our proposed SA-MTL framework performs consistently better 74 than not only Naive-MTL but also other state-of-the-art approaches, with up to 18% and 30% improvement for short and long term predictions, respectively, over the best approach under each traffic situation and prediction horizon. 5.1 Problem Definition And Naive MTL Formulation In this section, we first provide our problem definition, we then describe a typical method for applying Multi-Task Learning framework to our traffic prediction problem. Finally, we show that the existing MTL formulation does not perform well for our prob- lem. For ease of presentation, the notations used throughout this chapter are presented in Table 5.1. Table 5.1: Notations and explanations Notations Explanations t,T one sensor t, total number of sensors T v t (γ) travel speed of sensor t at time γ h prediction horizon x t ,y t the training input and output of sensor t D the feature dimension w t ,W the model parameter of one sensor t, model matrix of all sensors l, Ω the empirical loss and regularization terms X combined training samples of all sensors P c one traffic situation partitioning 5.1.1 Problem Definition Given a set of road segments containing T traffic sensors, assume that at a given time intervalγ (e.g., 5 minute), each sensort provides a traffic speed readingv t (γ) (e.g., 40 miles/hour). Given a historical sensor dataset, the objective is to predict the future traffic speed of any given sensor. We formulate the speed prediction problem as follows: Definition 1. Given a set of observed historical reading of every sensor t, suppose the current time is γ, the traffic prediction problem is to estimate the future travel speed v t (γ +h) for each sensor, where h is the prediction horizon. For example, when h = 1, we predict the traffic speed at the next timestamp. 75 Definition2. Short-term prediction and long-term prediction refer to the scenarios when h = 1 and h> 1, respectively. In this chapter, we regard the traffic prediction problem as a regression problem. In particular, for each sensort, suppose the current time isγ, in order to predictv t (γ +h), we mainly extract the previouslag readings (i.e.,v t (γ− 1),v t (γ− 2),··· ,v t (γ−lag)) as the training features (The details of other features can be found in Section 5.3). For each sensor t, we construct the training input x t and output y t , where x t ∈R nt×D , y t ∈R nt and D is the feature number, and we want to learn a function such that y t =f t (x t ). Many machine learning approaches have been utilized to solve traffic prediction prob- lem, the majority of them treat each sensor independently and train a model per sensor. For simplicity, we consider linear prediction models 1 : i.e., for each sensor t, our goal is to infer a linear functionf wheref t (x t ) =x t w t andw t ∈R D×1 . To learn the parameter vector w t , we solve the following optimization problem: arg min wt l(f t (x t ),y t ) + Ω(w t ), (5.1) where l(f t (x t ),y t ) defines the loss function and Ω(w t ) is the regularization term of each sensor. For example, we can employ squared loss and l 2 regularization, respectively. Therefore, our objective is to learn T linear models of each sensor. We denote W = [w 1 ,··· ,w T ]∈R D×T as the parameter matrix to be estimated. 5.1.2 Naive Multi-Task Learning Formulation In this chapter, we are interested in exploring the commonalities across different sensors and applying them to reduce the error of prediction. Instead of solving each task independently, we first employ the typical multi-task learning formulation, where we can learn all prediction tasks simultaneously by extracting and utilizing the common 1 Our method can be easily generalized to non-linear models. 76 Figure 5.1: Illustration of Naive Multiple-Task Learning information across these sensors. As shown in Figure 5.1, one typical formulation of MTL that estimates W is as follows: arg min W L(W ) + Ω(W ) (5.2) where L(W ) = P n t=1 1 nt l(f t (x t ),y t ) and Ω(W ) encodes our prior knowledge to constrain the sharing of information between sensors. 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 Time of the Day 0 10 20 30 40 50 60 70 Speed (miles/hour) sensor 25 sensor 168 sensor 661 Figure 5.2: Traffic readings of three different highway sensors at the same day. 77 Many different MTL approaches have been proposed to enforce the commonalities across tasks by defining the appropriate penalty functions Ω. For example in [25], each w t is constrained to be close to each other; and in [8], a common set of features are selected across the tasks by employing the l 2,1 norm of W matrix. However, simply applying the existing MTL framework does not perform well for our traffic prediction problem. The reason is that the evolving patterns of sensors can be a combination of different traffic situations, and the prior knowledge we have enforced in MTL no longer hold any more. We use the following example to illustrate the intuition: Example 1: Figure 5.2 shows one day traffic readings of three highway sensors. Obvi- ously the three sensors exhibit different morning/afternoon rush hour patterns. For instance, the speed of sensor 25 does not fluctuate too much throughout the whole day, while sensor 168 and 661 clearly have one afternoon and morning rush hour patterns, respectively. Because the prediction model for rush hour is typically different with the model for normal traffic condition and each sensor has its unique mixture of normal and rush hour traffic conditions, simply enforcing the model similarities (e.g., constraining w t close to each other) across all sensors do not perform well. 5.2 SA-MTL Framework In Section 5.2.1, we first describe the general process of our proposed SA-MTL frame- work. Subsequently we explain each component of SA-MTL in Sections 5.2.2, 5.2.3 and 5.2.4. 5.2.1 Overall Framework Although each sensor has its unique traffic patterns, they exhibit similar performance under the same traffic situation. For example, from Figure 5.2, our hypothesis is that sensor behave similarly under the common traffic situation, e.g., normal or rush hour condition. This motivates us to explore the commonalities among sensors under each traffic situation. 78 Figure 5.3: Overview of SA-MTL Framework. SA-MTL first combines the training samples of all sensors together, and clusters them into several partitions, where each partition represents one typical traffic situation and consists of different number of training samples from all sensors. Consequently, for each partition (i.e., traffic situation), we utilize multi-task feature learning that simultaneously learns the prediction model of all sensors. Towards this end, we propose Situation-Aware Multi-Task-Learning framework (SA- MTL),whichisshowninFigure5.3. WithSA-MTL,wefirstidentifythetrafficsituations 79 across all sensors via clustering algorithms, then apply tradition multi-task learning framework for each traffic situation. The three steps are also illustrated in Algorithm 5: (1) Combine the training samples from all sensors together (Lines 1-3); (2) Cluster the combined training data into k partitions, where each partition constitutes training samples from different sensors (Lines 4); (3) Apply multi-task learning for each partition P c , where c∈ [1,k], and learn a parameter matrix W c for each traffic situation (Lines 5-7). With the trained models W c of each partition P c , we are able to make the traffic prediction for any sensor under any situations. Given a prediction request from one sensor with input features, we first determine the underlying traffic situation (i.e., cluster labelc)ofthisrequest. Subsequently, wecanutilizethetrainedmodelfromW c topredict its future travel speed. 5.2.2 Combining and Augmenting Training Data Ourmainhypothesisisthattrafficsituationsaresharedamongallsensors. Therefore, to identify them, we combine the training data x t from each sensor t into X∈ R N×D , where N = P T t=1 n t . This also addresses the issue that one sensor may have insufficient training samples under one traffic situation. Besidesutilizingthesensordataset, wecanintegrateothersourcesofinformationthat relates to traffic situation, such as sensor attributes (i.e., sensor direction, category, etc), road and weather condition, and events information to generate better traffic situations. Currently, road construction and weather information are readily available for most areas of United States; and other factors such as incident are becoming available at certain areas. For example, three years of Los Angeles accident data from different agencies including California Highway Patrol (CHP), LA Department of Transportation (LADOT),andCaliforniaTransportationAgencies(CalTrans)arepublicavailable. With all these extra information, we can augment the training data (i.e., adding more features) before feeding into the clustering algorithm to better identify the traffic situation. 80 5.2.3 Traffic Situation Clustering With the combined matrix X, the next step is to identify the traffic situations. That is, we want to partition the training data into different groups, where each group corresponds to one traffic situation and contains the training data from different sensors. Hence we consider unsupervised learning (i.e., clustering) methods. To identify the traffic situation, a straightforward idea is to use traditional clustering methods such asK-means clustering [9] and specify the number of cluster. In our traffic prediction problem, our objective is to discover the underlying similarities across sensors, i.e., the latent traffic patterns. In addition, since the density of different traffic situations can be skewed (e.g., normal traffic condition dominates the training sample), K-means clustering may not perform well. Therefore, we propose to utilize Non-Negative Matrix Factorization (NMF) [44] to discover the traffic situations. NMF has been shown to have theabilitytoextracttheunderlyinghiddenfeatures, whichhasbeenappliedtodocument clustering [78] and traffic predictions [22]. In [78], given a document corpus, NMF can discover the basic topics of these documents, and each document will be represented as an addictive combination of these basic topics. Subsequently, the cluster label of each document can then be determined by finding the base topic with which the document has the largest projection value. We here use the similar idea to discover the hidden situations and assign each training sample into the corresponding traffic situation. Algorithm 5 SA-MLT Framework Input:∀t∈ [1,T ],x t ∈R nt×D ,y t ∈R nt , number of cluster k Output:∀c∈ [1,k],W c 1: for t = 1 to T do 2: X← S x t (See Section 5.2.2) 3: end for 4: Partition X into clusters P c (c≤k) (See Section 5.2.3) 5: for c = 1 to k do 6: Multi-Task Learning W c for each traffic situation P c (See Section 5.2.4) 7: end for 81 Formally, given the combined input matrix X∈ R N×D and the cluster number k, wherek<D, the objective of NMF is to find two non-negative matrices V ∈R N×k + and H∈R k×d + to approximate the original matrix X, the formulation is given as follows: arg min V,H 1 2 ||X−VH|| 2 F (5.3) By solving the above equation, we have two non-negative matrices V ∈ R N×k + and H ∈ R k×D + , where each row of H represents a basic traffic situation, and matrix V contains the weights of linear combination of each basic traffic situation. We can derive the cluster label from matrix V. Specifically, for one training row X i of X, similar as [78], we assign the cluster label by examining V i . More precisely, X i is assigned to traffic situation c if c = arg max j V ij . Because NMF is NP-hard problem, we use projected gradient descent [46] to solve Equation 5.3. Through these steps, we can generate k partitions of X, and each partition P c (c∈ [1,k]) contains the training data across T different sensors, i.e., P c = T [ t=1 {x c t ,y c t }, (5.4) wherex c t andy c t is the corresponding training input and output of sensor t that belongs toclusterc. Notethatitispossiblethatonesituationgroupmaynotcontainthetraining samples from each and every sensor, because the training samples of one sensor may only belong to a subset of all traffic situations. Example 2: Figure 5.4 shows two identified traffic situations of one sensor and their properties after the clustering steps, where Situations I and II can be treated as congested scenario and normal condition, respectively. We observe clear differences between the two traffic situations, where we can build a model for each situation instead of training one model for the mixture of the two situations. Specifically, Figure 5.4 (a) and (b) show the variance of one sensor’s six continuous readings (i.e., training features) and the speed distributions (i.e., prediction targets) of this sensor under each traffic situation for a two month’s period, respectively. Comparing with the scenario that traffic situations have 82 v γ−1 v γ−2 v γ−3 v γ−4 v γ−5 v γ−6 Time 50 100 150 200 250 300 350 400 Variance all situation I situation II (a) Variance of training features (b) Distribution of travel speed Figure 5.4: Traffic situations of one sensor. not been considered (i.e., situation all), we observe that the variance of training features tends to be larger in situation I, while smaller in situation II; on the other hand, the sensor contains smaller speed values (i.e., less than 50) in situation I, while larger values (i.e., larger than 40) in situation II. Hence, the traffic conditions is naturally divided into two subgroups. Intuitively speed reading varies significantly (i.e., larger variance) under the congested scenario (i.e., smaller speed values) and vice versa, we thus regard situation I and II as congested scenario and normal condition, respectively. 5.2.4 Multi-Task Learning Per Traffic Situation Within each partition c, our objective is to learn the parameter matrix of W c of the corresponding sensors, where each column W c t is the model parameter for sensor t. To explore the commonalities under one traffic situation, we propose to simultaneously build forecasting models for all sensors with Multi-Task Learning. This is because, under each traffic situation, sensors exhibit similar behaviors and transition patterns. For example, during the rush hour, the process of transitioning from normal condition to congested state, and then back to normal condition should be similar across different sensors. Hence the prediction models of sensors within each situation can share some commonalities, and exploring these commonalities should help improve the prediction performance. In addition, learning different sensors together can increase the sample size 83 of each sensor, which addresses the issue that some sensors only have limited number of training samples for one specific traffic situation. We therefore utilize multi-task feature learning [8, 49] that selects a common set of features across multiple related models. That is, within each traffic situation we restrict that the regression model of each sensor shares the same subset of features. For example, the prediction models under rush hour situation can mainly rely on features such as time of the day, day of the week, whether or not in the boundary of rush hour and the previous reading. This can be achieved by employing the group lasso regularization, i.e., penalizing the sum of the l 2 norms of each row in W c . In addition, we also include the l 2 norm regularization for W c to increase the robustness of the model. For each traffic situation partition P c , we formulate our MTL problem as follows: arg min W c T X t=1 ||y c t −x c i W c t || 2 F +ρ 1 ||W c || 2,1 +ρ 2 ||W c || 2 F (5.5) where the first term is the data fitting term for all sensors (i.e., Frobenius norm), ρ 1 and ρ 2 are the regularization parameters for group lasso and l 2 norm penalty, respectively. In Equation 5.5, the l 2,1 norm is given by||W c || 2,1 = P D d=1 q P t W c dt 2 , where it computes the 2-norm of parameter values in each feature dimension across all sensors. Consequently, for any featured, this regularization achieves the minimal value when the corresponding parameters of one row are all zeros, i.e., W c dt for all t. This guarantees thatl 21 norm would choose theW c with the smallest number of non-zero rows, which is the same as finding a subset of sharing features that are used for all sensors. Algorithm of Solving Equation 5.5 We adapt FISTA [11] algorithm to solve the convex optimization problem. FISTA was a Proximal Gradient method that can be used to solve the composite optimization problem. FISTA combines the idea of ISTA (Iterative Shrinkage-Thresholding Algo- rithm) [15] with Nestorov’s Accelerated Gradient Descent [53], and converges faster than 84 ISTA.In particular, ourobjectiveistominimizeF (W ) =f(W )+g(W ), wheref(W ) and g(W ) are defined as follows (we ignore the cluster label c for x c t ,y c t and W c hereafter): f(W ) = T X t=1 ||y t −x t W|| 2 F +ρ 2 ||W|| 2 F g(W ) =ρ 1 ||W|| 2,1 (5.6) As an iterative algorithm, given the current search pointW i , similar as ISTA, FISTA generates the next point W i+1 by first finding F (W )’s quadratic approximation Q at point W i . The quadratic approximation function Q(W,W i ) at W i is defined as follows, Q(W,W i ) =f(W i )+<W−W i ,∇f(W i )> + η 2 ||W−W i || 2 +g(W ) (5.7) The next point W i+1 = arg min W Q(W,W i ), is simplified as follows after ignoring constant terms of Q: W i+1 = arg min W g(W ) + η 2 ||W− (W i − 1 η ∇f(W i ))|| 2 (5.8) Besides using one sequence{W i }, similar with Nestorov’s algorithm, FISTA utilizes another sequence{Z i }, where{W i } is the sequence for approximation solutions, and {Z i } is the sequence of search points. The approximate solution of W i+1 is generated at the search point Z i , and Z i is the affine combination of W i−1 and W i . The details of FISTA algorithm is presented in Algorithm 6. At each iteration, we first generate the search point ofZ i fromW i andW i−1 (Line 4), and calculateW i+1 from the search point ofZ i (Line 7). Finally, the step sizeη at each step is calculated via line search according to Armijo rule [12]. 85 Algorithm 6 FISTA Algorithm Input:∀t,x t ,y t and W 0 Output: W 1: Initialize W 1 =W 0 ,λ −1 = 0,λ 0 = 1,η = 1 2: for i = 1, 2,··· , do 3: α i ← λ i−2 −1 λ i−1 4: Z i ← (1−α i )W i +α i W i−1 5: for j← 0, 1,··· do 6: η← 2 j η i−1 7: W i ← arg min Z g(Z) + η 2 ||Z− (Z i − 1 η ∇f(Z i ))|| 2 F 8: if F (W i )≤Q(W i ,Z i ) then 9: η i ←η 10: break 11: end if 12: λ i ← 1+ p 1+4λ 2 i−1 2 13: end for 14: if Convergence then 15: W←W i , break 16: end if 17: end for Time Complexity and Convergence We first analyze the time complexity of Algorithm 6. For each iteration, it costs O(mD) to evaluate F (w) and O(DT ) to find the approximation point, where m is the total number of training samples, D is the number of feature dimension, and T is the number of sensors (i.e., tasks). As shown in [11], FISTA has a convergence rate ofO( 1 2 ), while gradient and subgradient descent have convergence rates ofO( 1 ) andO( 1 √ ), where denotes the number of iteration. 5.3 Experiment We performed extensive empirical evaluation of our approach to evaluate its effec- tiveness and efficiency. We first compare our methods with various categories of baseline methods. We then show the effect of varying different parameters. Finally, we provide several case studies to demonstrate the advantage of our methods. 86 5.3.1 Experiment Design Dataset We used a large-scale high resolution (both spatial and temporal) traffic sensor (loop detector) dataset collected from Los Angeles county highways and arterial streets. This dataset include both inventory and real-time data for 15000 traffic sensors covering approximately 3420 miles. The sampling rate of the data, which provides speed, volume (number of cars passing from sensor locations) and occupancy, is 1 reading/sensor/min. The sensor data are aggregated into 5 minutes interval. We chose four months data with 880 sensors in total from January/2014 to April/2014 for our experiment. Algorithms We compared our proposed algorithms SA-MTL with the following baseline approaches: • Straightforward methods require little training effort: Historical Average Model (HAM),RandomWalk(RW)(i.e., usingthemostrecentreadingsasthepredicted speed),. • Single sensor models: we independently train a model for each sensor with the following different models: Ridge Regression Model (Ridge)[27], Support Vector Regression (SVR)[74], Multi-level Neural Network (Neural)[80], Random Forest (Forest)[27] and time series approach Autoregressive Integrated Moving Average (ARIMA)[55]. Note that RW is a special case of ARIMA, i.e., ARIMA(0, 1, 0). 2 2 We did not compare our methods with latent space modeling (LSM) approach [22] because we have cleaned the sensors with missing values, and in [22] LSM achieves similar performance with SVR when the missing value issue has been addressed. We did not compare our methods with the latest deep learning approach [81] either, because the post-accident scenario has not been explicitly introduced in our experiments. 87 • Multi sensor models: Multi-Task Feature Learning without considering traffic situ- ations (i.e., Naive-MTL[8]), Clustered Multi-Task Learning (i.e., CMTL[33, 85]) that simply groups certain sensors together. We use the implementation from [86]. Configurations and Measures For each sensor t, we generated the training and testing samples from our historical sensor dataset as follows: suppose the current time is γ, in order to predictv(γ +h), we mainly utilized the knowledge of previouslag readings, i.e.,v(γ− 1),v(γ− 2),··· ,v(γ− lag) as our features. We tested different values of lag, and find that fixing lag = 6 achieves the good trade-off between accuracy and training complexity. In addition, we generatedhistoricalfeatures, i.e, foreachweekdaytime(e.g., Monday4pm)ofonesensor, we aggregated the previous readings of this sensor at that specific time from the past three months. Besides features from sensor readings, basic contextual information (e.g., time of day, day of week) and sensor attributes (e.g., sensor category, direction) were also included as our training features. We utilized two months data, where March data are used for generating training samples and April for testing purpose, respectively. The historical features were generated from three months sensor data (i.e., from January to March). In our experiments, we did not incorporate weather and incident data as the training features and we applied the same features across all our models except for ARIMA because ARIMA is a single variate time series approach. For ARIMA, we only used March data for training and April data for testing. WevarieddifferentparametersofourSA-MTLmodel, namely, thenumberoffeatures (i.e., whether or not to include historical feature), the clustering algorithm and the num- ber of clusters for our traffic situation identification, and the regularization parameter ρ 1 for multi-task learning. We used Mean Absolute Percentage Error (MAPE) and Root Mean Square Error (RMSE) to measure the accuracy. In the following we only report the experiment results based on RMSE, because the results based on MAPE are similar to those of RMSE. We manually distinguished rush hour (i.e., 7am-9am, 4pm-7pm) and 88 Table 5.2: Short term prediction performance (RMSE) Rush Hour Non Rush-Hour HAM 10.65 7.59 RW 5.32 4.50 Ridge 5.54 4.38 Neural 5.80 4.63 Forest 6.13 4.90 SVR 5.64 4.49 ARIMA 5.71 4.75 Naive-MTL 5.49 4.37 C-MTL 8.24 7.27 SA-MTL 4.54 3.72 Table 5.3: Long term prediction performance (RMSE) Rush Hour Non-Rush Hour HAM 11.34 8.46 RW 10.22 8.06 Ridge 8.91 6.86 Neural 9.11 7.2 Forest 9.58 7.38 SVR 8.86 7.03 ARIMA 8.89 7.23 Naive-MTL 8.91 6.85 C-MTL 15.82 13.28 SA-MTL 6.4 4.83 non-rush hour time ranges, and we reported RMSE for both of them. We conducted both short term and long term traffic prediction, where h = 1 (i.e., 5 minutes) for short term and h = 6 (i.e., 30 minutes) for long term, respectively. The definition of RMSE is as follows: RMSE = v u u t 1 N N X i=1 (y i − ˆ y i ) 2 For each tested method, we use 5-fold cross-validation to choose the best parameters, andreportthecorrespondingresults. WeconductedourexperimentsonaLinuxPCwith i5-2400 CPU @ 3.10G HZ and 24GB memory. 89 Table 5.4: Running time comparison of different methods HAM RW Ridge Neural Forest SVR ARIMA MTFL CMTL SA-MTL Training Time (sec) 1256.37 0.008 15.9 413.11 7719.15 4122.86 161.24 19.03 207.83 68.25 Testing Time (sec) 1.21 0.86 0.91 12.92 3.42 168.46 4.946 0.206 0.17 1.34 5.3.2 Performance Comparisons Short-term Traffic Prediction Table 5.2 shows the performance comparisons whenh = 1, i.e., we predict the traffic for the next five minutes. Among all methods, SA-MTL achieves the best performance, with over 18% and 13% improvement compared with the baseline methods for rush and non-rush hour, respectively. The prediction error in rush hour is higher than that of non- rush hour, because the traffic condition in rush hour is more complex for prediction. We clearly observe that by incorporating traffic situations, SA-MTL performs significantly better than Naive-MTL. Even for non-rush hour, 13% decrease of error is significant given the already accurate prediction of short term traffic in non rush-hour due to its repetitive pattern. On the other hand, Naive-MTL performs similarly to other single sensor models (e.g., Ridge, SVR) that train a model for each sensor independently, and C-MTL that simply groups sensors without considering traffic situations even yields worse performance than Naive-MTL. These observations support our hypothesis that grouping sensors together without considering their commonalties under different traffic situations, will have no effect or even negative effect in traffic prediction. For other baselines, RW achieves good performance (even better than other single sensor models) and HAM performs worse, which shows that the most recent reading is a strong indicator for short term prediction, while historical feature dose not help much. Among single sensor models, Ridge, Neural and SVR achieve similar results, while Forest performs the worst among them. 90 Long-term Traffic Prediction We now present the experiment results for long-term prediction in Table 5.3, with which we predict the traffic conditions for the next 30 minutes (i.e., h = 6). Basically long term prediction is harder than short term prediction considering many factors can affect the traffic in the next 30 minutes. In practice, typically there exists two categories of methods for long term prediction, i.e., 1) training a model for each horizon or 2) iteratively applying the previous predicted values as the input for the next prediction task. In our experiments, we chose the first method where we train a model for each prediction horizon, because the iterative methods can suffer from the error accumulation problem. As shown in Table 5.3, SA-MTL achieves the best accuracy with more than 30% improvement over the best baseline for each situation, which is more significant than that of short term prediction. Meanwhile, SA-MTL performs much better than Naive-MTL and C-MTL, which shows the effectiveness of exploring commonalities after identifying the traffic situations. Interestingly, simple methods such as HAM and RW no longer work well for long term predictions, with much worse performance than single sensor models. For single sensor models, SVR performs better than Ridge, Neural and Forest. Running Time comparisons Table 5.4 shows the running time of different methods at both training and testing phases. Among all single-sensor models, Ridge is the most efficient because it is a linear model, whereas other non-linear models such as Neural, Forest, and SVR require significantly more training time. On the other hand, as multi-sensor models, Naive- MTL and C-MTL require similar training time as Ridge. Compared with Ridge and Naive-MTL, our method SA-MTL takes a little longer time because we require extra steps to identify the traffic situations, and then apply MTL for each identified situation. However, even including extra steps, the training time of SA-MTL is still efficient and takes less than 70 seconds, thus it can easily scale to large number of sensors. The testing 91 time includes the prediction time for all testing samples of all sensors. We note that the testing time are mostly negligible, except for Neural and SVR. 5.3.3 Varying Parameters 4 6 8 10 Ridge Neural SVR MTFL SA-MLT RMSE Rush Hour W-HIST W/O-HIST 4 6 8 10 Ridge Neural SVR MTFL SA-MLT RMSE Rush Hour W-HIST W/O-HIST (a) Short term prediction (b) Long term prediction Figure 5.5: Effect of using historical feature on Rush hour 4 6 8 10 Ridge Neural SVR MTFL SA-MLT RMSE Non-rush Hour W-HIST W/O-HIST 4 6 8 10 Ridge Neural SVR MTFL SA-MLT RMSE Non-rush Hour W-HIST W/O-HIST (a) Short term prediction (b) Long term prediction Figure 5.6: Effect of using historical feature on Non-Rush hour Effect of Using Historical Features We evaluated the effect of using different set of training features, here we presented the results of traffic prediction with/without historical average speed readings in Fig- ure 5.5. As shown in Figure 5.5(a), the historical aggregated averaged readings does not help for short term prediction, this conforms with the results in Table 5.2: HAM achieves the worst accuracy, whereas using the previous reading (RW) has relatively good perfor- mance. However, as shown in Figure 5.5(b), the historical feature does not help for long 92 term prediction, either. This is a bit contradictory with our common knowledge that historical average can be a good predictor for long term prediction. The reason is that from our observations, the historical values reflect the trend of ground truth value mostly at the transitioning time (i.e., the boundary of rush hour), but would not closely cor- relate with the readings under other traffic conditions. Moreover, the traffic prediction tasks are dominated by those non-transitioning time instances, hence historical feature does not help much in long term prediction. Figures 5.6 (a) and (b) show similar effect during the non-rush hour time. 3 4 5 6 7 8 2 4 6 8 10 12 RMSE Number of partitions k Rush Hour K-means NMF 3 4 5 6 7 8 2 4 6 8 10 12 RMSE Number of partitions k Rush Hour K-means NMF (a) Rush hour (b) Non-Rush hour Figure 5.7: Effect of varying k on Rush hour and Non-Rush hour 4 6 8 10 10 -3 10 -2 10 -1 10 0 10 1 10 2 10 3 RMSE ρ 1 Rush Hour Long-term Short-term 4 6 8 10 10 -3 10 -2 10 -1 10 0 10 1 10 2 10 3 RMSE ρ 1 Non-rush Hour Long-term Short-term (a) Short term prediction (b) Long term prediction Figure 5.8: Effect of varying ρ 1 on Rush hour Effect of Varying k The results of varying different clustering algorithms (i.e., NMF v.s. K-means) and the number of cluster k (i.e, traffic situations) are shown in Figures 5.7 (a) and (b). 93 We observe that NMF and K-means achieve similar results, and NMF performs slightly better than K-means. This shows the superiority of NMF of discovering the latent traffic situations. For the number cluster k, we note that k = 4 achieves the best performance for both short and long term traffic predictions. This indicates that even though traffic conditions are chaotic and complex, we can distinguish its underlying latent grouping with limited number of traffic situations. This is one of the main reasons that our SA-MTL framework is effective and efficient in traffic prediction tasks. Effect of Varying Hyper-parameters ρ 1 in Equation 5.5 Figures 5.8 (a) and (b) show the effect of varyingρ 1 from 0.0001 to 1000. We observe that ρ 1 achieves the best RMSE value when ρ 1 = 1. When ρ 1 becomes larger, the performances become worse. Therefore, setting k to smaller values (e.g., 1) can yield better accuracy. 12 13 14 15 16 17 18 19 20 Time of the Day 0 10 20 30 40 50 60 70 Speed (miles/hour) Our Prediction Best Baseline Ground Truth Figure 5.9: Case study for sensor 168 with h = 6. 5.3.4 Case Studies Here we present two case studies of predicted values versus observed ground truth to demonstrates the superiority of SA-MTL under various traffic situations such as rush hour transition which mimics the patterns of accidents and other abrupt situations and 94 6 7 8 9 10 11 12 13 14 Time of the Day 0 10 20 30 40 50 60 70 Speed (miles/hour) Our Prediction Best Baseline Ground Truth Figure 5.10: Case study for sensor 661 with h = 6. is the most difficult to predict. Figure 5.9 plots the prediction values for sensor 168. Compared with the best baseline method, we notice that our prediction has less error at the transition time, especially at the range of [12pm, 13pm] and [18pm, 19pm], whereas the best baseline cannot capture these abrupt changes. Similar results occur for sensor 661 in the transitioning periods of morning rush hour. 5.4 Summary In this chapter, we presented a novel Multi-Task Learning framework for the problem of traffic prediction. Different with existing MTL formulation that trains a model per sensor, our framework automatically identifies the basic traffic con- ditions, and simul- taneously train models per traffic situation. By extensive experimental evaluation, we showed that our proposed model can accurately capture traffic situations and achieve significant improvement for both short and long term predictions, especially for long term prediction. 95 Chapter 6 Conclusions and Future Work 6.1 Conclusions In this thesis, we studied the problem of traffic prediction from large volume of sen- sor data by taking into account their spatiotemporal correlations. Towards this end, we present three different approaches of incorporating spatiotemporal relationships between sensors. We first present an online learning framework that encodes spatiotemporal relationships into specific traffic situations, and creates a strong hybrid predictor from several weak predictors. Although this framework can match the real-time traffic situ- ation to the most effective predictor, it independently trains these weak predictors and does not utilize the newly arrived data to dynamically update these predictors. We thus propose a Latent Space Model for Road Networks (LSM-RN) that captures both the topological and temporal properties of all sensors. In particular, given a series of road network snapshots, we learn the attributes of sensors in latent spaces to estimate how traffic patterns are formed and evolved. We present an incremental online algorithm which sequentially and adaptively learns the latent attributes from the temporal graph changes. LSM-RN enables real-time traffic prediction by 1) exploiting real-time sensor readings to adjust/update the existing latent spaces, and 2) training as data arrives and making predictions on-the-fly. Because LSM-RN only utilizes the most recent graph snapshots and has not distinguish the underlying traffic situations, it does not perform well in long term traffic prediction. Finally, we explore the commonalities across mul- tiple traffic sensors who behave the same in a specific traffic situation. We show that building models based on the shared traffic situations across sensors can help improve the prediction accuracy. We propose a Multi-Task Learning (MTL) framework that aims 96 to first automatically identify the traffic situations and then simultaneously build one forecasting model for similar behaving sensors per traffic situation. We demonstrated that our proposed MTL framework outperforms all the best traffic prediction approaches in both short and long term predictions, respectively. In summary, we have studied the traffic prediction problems with the focus of incor- porating spatiotemporal correlations between traffic sensors. We have explored different ideas of utilizing spatiotemporal relationships into the prediction techniques, and we verified their effectiveness, flexibility and scalability in identifying traffic patterns and predicting future traffic conditions with our large-scale sensor dataset. 6.2 Future Work We briefly discuss several future directions to extend our thesis: First, besides the techniques of ARIMA, graphic models, latent space modeling and multi-task learning, it is possible to incorporate spatiotemporal correlations into deep learning approaches. Although recently approaches such as Deep Belief Network [31], Deep LSTM recurrent neural network [81] have been applied in traffic forecasting, the spatiotemporal correlations have not been utilized to improve the deep learning model for traffic prediction. It would be interesting to explore the spatiotemporal correlations into the current deep learning models. Second, we plan to integrate other data sources (e.g., GPS devices, traffic cameras) and other information (e.g., road and weather condition, event information) into our prediction techniques for more accurate traffic prediction. To achieve this goal, we need to first resolve the inconsistency between different data sources, we can then build a unified representations of all information. In addition, we need to extend the current frameworktodistributedscenarioswheretrafficdataisgatheredandfusedbydistributed entities, and thus coordinating among different entities are required to achieve a global traffic prediction goal. 97 Finally, we plan to embed the current prediction techniques into real-world applica- tions such as ride-sharing or vehicle routing system, to enable better navigation using accurate traffic patterns. Real applications such as estimated travel time (ETA), route planning usually contain the tasks of both short and long term traffic prediction, while the current techniques mainly rely on the current traffic condition while ignore the future conditions. It would be interesting to blend real-time traffic with the predicted traffic conditionsforbetterETAestimationandrouteplanning. Inaddition, wecanalsoextend our framework for other real-world problems such as social event prediction. 98 Reference List [1] The highway system. FHWA, Office of Highway Policy Information. [2] What congestion means to your town, 2007 urban area totals. Texas Transportation Institute. [3] Annual vehicle miles of travel. Federal Highway Administration, 2003-02-14. [4] E. Acar, D. M. Dunlavy, T. G. Kolda, and M. Morup. Scalable tensor factorizations for incomplete data. Chemometrics and Intelligent Laboratory Systems, 106(1):41– 56, March 2011. [5] R. Adhikari and R. Agrawal. A novel weighted ensemble technique for time series forecasting. In Advances in Knowledge Discovery and Data Mining, pages 38–49. Springer, 2012. [6] A. Agarwal, S. Gerber, and H. Daume. Learning multiple tasks using manifold regularization. In Advances in neural information processing systems, pages 46–54, 2010. [7] R. K. Ando and T. Zhang. A framework for learning predictive structures from mul- tipletasksandunlabeleddata. Journal of Machine Learning Research, 6(Nov):1817– 1853, 2005. [8] A.Argyriou, T.Evgeniou, andM.Pontil. Convexmulti-taskfeaturelearning. Mach. Learn., 73(3):243–272, Dec. 2008. [9] D. Arthur and S. Vassilvitskii. k-means++: The advantages of careful seeding. In Proceedings of the eighteenth annual ACM-SIAM symposium on Discrete algorithms, pages 1027–1035, 2007. [10] B. W. Bader, T. G. Kolda, et al. Matlab tensor toolbox version 2.6. Available online, February 2015. [11] A. Beck and M. Teboulle. A fast iterative shrinkage-thresholding algorithm for linear inverse problems. SIAM journal on imaging sciences, 2(1):183–202, 2009. [12] D. Bertsekas. On the goldstein-levitin-polyak gradient projection method. IEEE Transactions on automatic control, 21(2):174–184, 1976. 99 [13] M. Blondel, Y. Kubo, and U. Naonori. Online passive-aggressive algorithms for non-negative matrix factorization and completion. In AISTATS, 2014. [14] A. Blum. Empirical support for winnow and weighted-majority algorithms: Results on a calendar scheduling domain. Machine Learning, 26(1):5–23, 1997. [15] K. Bredies and D. A. Lorenz. Linear convergence of iterative soft-thresholding. Journal of Fourier Analysis and Applications, 14(5):813–837, 2008. [16] D. Cai, X. He, J. Han, and T. S. Huang. Graph regularized nonnegative matrix factorization for data representation. TPAMI, 33(8):1548–1560, 2011. [17] R. Caruana. Multitask learning. In Learning to learn, pages 95–133. Springer, 1998. [18] H. Cheng and P.-N. Tan. Semi-supervised learning with data calibration for long- term time series forecasting. In KDD, pages 133–141. ACM, 2008. [19] F. C. T. Chua, R. J. Oentaryo, and E.-P. Lim. Modeling temporal adoptions using dynamic matrix factorization. In ICDM, pages 91–100. IEEE, 2013. [20] K. Crammer, O. Dekel, J. Keshet, S. Shalev-Shwartz, and Y. Singer. Online passive- aggressive algorithms. JMLR, 7:551–585, 2006. [21] U. Demiryurek, F. Banaei-Kashani, and C. Shahabi. Transdec: a spatiotemporal query processing framework for transportation systems. In 2010 IEEE 26th Inter- national Conference on Data Engineering (ICDE 2010), pages 1197–1200. IEEE, 2010. [22] D. Deng, C. Shahabi, U. Demiryurek, L. Zhu, R. Yu, and Y. Liu. Latent space model for road networks to predict time-varying traffic. In Proceedings of the 22Nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD ’16, pages 1525–1534. ACM, 2016. [23] D. Deng, C. Shahabi, U. Demiryurek, L. Zhu, R. Yu, and Y. Liu. Latent space model for road networks to predict time-varying traffic. CoRR, abs/1602.04301, 2016. [24] C. Ding, T. Li, W. Peng, and H. Park. Orthogonal nonnegative matrix t- factorizations for clustering. In KDD, pages 126–135. ACM, 2006. [25] T. Evgeniou and M. Pontil. Regularized multi–task learning. In Proceedings of the tenth ACM SIGKDD international conference on Knowledge discovery and data mining, pages 109–117. ACM, 2004. [26] A. Fern and R. Givan. Online ensemble learning: An empirical study. Machine Learning, 53(1-2):71–109, 2003. [27] F. Guo. Short-term traffic prediction under normal and abnormal conditions. PhD thesis, Imperial College London, United Kingdom, 2013. 100 [28] J. Guo and B. Williams. Real-time short-term traffic speed level forecasting and uncertainty quantification using layered kalman filters. Transportation Research Record: Journal of the Transportation Research Board, (2175):28–37, 2010. [29] J. Haworth and T. Cheng. Non-parametric regression for space–time forecasting under missing data. Computers, Environment and Urban Systems, 36(6):538–550, 2012. [30] HERE. https://company.here.com/here/. [31] W. Huang, G. Song, H. Hong, and K. Xie. Deep architecture for traffic flow predic- tion: Deep belief networks with multitask learning. IEEE Transactions on Intelli- gent Transportation Systems, 15(5):2191–2201, 2014. [32] T. Idé and M. Sugiyama. Trajectory regression on road networks. In AAAI, 2011. [33] L. Jacob, J.-p. Vert, and F. R. Bach. Clustered multi-task learning: A convex formulation. In Advances in neural information processing systems, pages 745–752, 2009. [34] R. A. Jacobs, M. I. Jordan, S. J. Nowlan, and G. E. Hinton. Adaptive mixtures of local experts. Neural computation, 3(1):79–87, 1991. [35] H. Jagadish, J. Gehrke, A. Labrinidis, Y. Papakonstantinou, J. M. Patel, R. Ramakrishnan, and C. Shahabi. Big data and its technical challenges. Com- munications of the ACM, 57(7):86–94, 2014. [36] D. Jeffery, K. Russam, and D. Robertson. Electronic route guidance by autoguide: the research background. Traffic engineering & control, 28(10):525–529, 1987. [37] Y.-S. Jeong, Y.-J. Byon, M. M. Castro-Neto, and S. M. Easa. Supervised weighting- online learning algorithm for short-term traffic flow prediction. IEEE Transactions on Intelligent Transportation Systems, 14(4):1700–1707, 2013. [38] M. I. Jordan and R. A. Jacobs. Hierarchical mixtures of experts and the em algo- rithm. Neural computation, 6(2):181–214, 1994. [39] Y. Kamarianakis and P. Prastacos. Forecasting traffic flow conditions in an urban network: comparison of multivariate and univariate approaches. Transportation Research Record: Journal of the Transportation Research Board, (1857):74–84, 2003. [40] I. Kaysi, M. Ben-Akiva, and H. Koutsopoulos. Integrated approach to vehicle rout- ingandcongestionpredictionforreal-timedriverguidance. Transportation Research Record, (1408), 1993. [41] S. J. Kazemitabar, U. Demiryurek, M. Ali, A. Akdogan, and C. Shahabi. Geospatial stream query processing using microsoft sql server streaminsight. Proceedings of the VLDB Endowment, 3(1-2):1537–1540, 2010. 101 [42] J. Kwon and K. Murphy. Modeling freeway traffic with coupled hmms. Technical report. [43] R.Lambiotte, J.-C.Delvenne, andM.Barahona. Laplaciandynamicsandmultiscale modular structure in networks. arXiv:0812.1770, 2008. [44] D. D. Lee and H. S. Seung. Algorithms for non-negative matrix factorization. In NIPS, pages 556–562, 2001. [45] M. Levin and Y.-D. Tsao. On forecasting freeway occupancies and volumes (abridg- ment). Transportation Research Record, (773), 1980. [46] C.-J. Lin. Projected gradient methods for nonnegative matrix factorization. Neural computation, 19(10):2756–2779, 2007. [47] N. Littlestone and M. K. Warmuth. The weighted majority algorithm. Information and computation, 108(2):212–261, 1994. [48] C. Liu, S. C. Hoi, P. Zhao, and J. Sun. Online arima algorithms for time series prediction. 2016. [49] J. Liu, S. Ji, and J. Ye. Multi-task feature learning via efficient l 2, 1-norm min- imization. In Proceedings of the twenty-fifth conference on uncertainty in artificial intelligence, pages 339–348. AUAI Press, 2009. [50] A. K. Menon and C. Elkan. Link prediction via matrix factorization. In Machine Learning and Knowledge Discovery in Databases, pages 437–452. Springer, 2011. [51] W. Min and L. Wynter. Real-time road traffic prediction with spatio-temporal correlations. Transportation Research Part C: Emerging Technologies, 19(4):606– 616, 2011. [52] K.-R. Müller, A. J. Smola, G. Rätsch, B. Schölkopf, J. Kohlmorgen, and V. Vapnik. Predicting time series with support vector machines. In Artificial Neural Networks- ICANN, pages 999–1004. Springer, 1997. [53] Y. Nesterov. Introductory lectures on convex programming volume i: Basic course. 1998. [54] I. Okutani and Y. J. Stephanedes. Dynamic prediction of traffic volume through kalman filtering theory. Transportation Research Part B: Methodological, 18(1):1– 11, 1984. [55] B. Pan, U. Demiryurek, and C. Shahabi. Utilizing real-world transportation data for accurate traffic prediction. ICDM, pages 595–604, 2012. [56] G. Ristanoski, W. Liu, and J. Bailey. Time series forecasting using distribution enhanced linear regression. In PAKDD, pages 484–495. 2013. 102 [57] R.A.Rossi, B.Gallagher, J.Neville, andK.Henderson. Modelingdynamicbehavior in large evolving graphs. In WSDM, pages 667–676, 2013. [58] A. Saha and V. Sindhwani. Learning evolving and emerging topics in social media: a dynamic nmf approach with temporal regularization. In WSDM, pages 693–702. ACM, 2012. [59] N. Sapankevych, R. Sankar, et al. Time series prediction using support vector machines: a survey. Computational Intelligence Magazine, IEEE, 4(2):24–38, 2009. [60] S. Shalev-Shwartz, Y. Singer, N. Srebro, and A. Cotter. Pegasos: Primal estimated sub-gradient solver for svm. Mathematical programming, 127(1):3–30, 2011. [61] B. Smith and M. Demetsky. Multiple-interval freeway traffic flow forecasting. Transportation Research Record: Journal of the Transportation Research Board, (1554):136–141, 1996. [62] B. L. Smith, B. M. Williams, and R. K. Oswald. Comparison of parametric and nonparametric models for traffic flow forecasting. Transportation Research Part C: Emerging Technologies, 10(4):303–321, 2002. [63] R. Tarjan. Depth-first search and linear graph algorithms. SIAM Journal on Com- puting, 1(2):146–160, 1972. [64] A. Torralba, K. P. Murphy, and W. T. Freeman. Sharing visual features for multi- class and multiview object detection. IEEE Transactions on Pattern Analysis and Machine Intelligence, 29(5), 2007. [65] J. Van Lint, S. Hoogendoorn, and H. J. van Zuylen. Accurate freeway travel time prediction with state-space neural networks under missing data. Transportation Research Part C: Emerging Technologies, 13(5):347–369, 2005. [66] E. I. Vlahogianni, J. C. Golias, and M. G. Karlaftis. Short-term traffic forecasting: Overview of objectives and methods. Transport reviews, 24(5):533–557, 2004. [67] E. I. Vlahogianni, M. G. Karlaftis, and J. C. Golias. Optimized and meta-optimized neural networks for short-term traffic flow prediction: a genetic approach. Trans- portation Research Part C: Emerging Technologies, 13(3):211–234, 2005. [68] E. I. Vlahogianni, M. G. Karlaftis, and J. C. Golias. Short-term traffic forecasting: Where we are and whereweâĂŹre going. Transportation Research Part C: Emerging Technologies, 43:3–19, 2014. [69] F. Wang, T. Li, X. Wang, S. Zhu, and C. Ding. Community discovery using nonneg- ative matrix factorization. Data Mining and Knowledge Discovery, 22(3):493–521, 2011. [70] J. Wang, Q. Gu, J. Wu, G. Liu, and Z. Xiong. Traffic speed prediction and con- gestion source exploration: A deep learning method. In ICDM, pages 499–508, Dec 2016. 103 [71] Y. Wang and M. Papageorgiou. Real-time freeway traffic state estimation based on extended kalman filter: a general approach. Transportation Research Part B: Methodological, 39(2):141–167, 2005. [72] Y. Wang, Y. Zheng, and Y. Xue. Travel time estimation of a path using sparse trajectories. In KDD, pages 25–34. ACM, 2014. [73] B. Williams, P. Durvasula, and D. Brown. Urban freeway traffic flow prediction: application of seasonal autoregressive integrated moving average and exponential smoothing models. Transportation Research Record: Journal of the Transportation Research Board, (1644):132–141, 1998. [74] C.-H. Wu, J.-M. Ho, and D.-T. Lee. Travel-time prediction with support vector regression. IEEE transactions on intelligent transportation systems, 5(4):276–281, 2004. [75] L. Wu, X. Xiao, D. Deng, G. Cong, A. D. Zhu, and S. Zhou. Shortest path and distance queries on road networks: An experimental egenaluation. VLDB, 5(5):406– 417, 2012. [76] Y. Xie, K. Zhao, Y. Sun, and D. Chen. Gaussian processes for short-term traffic volume forecasting. Transportation Research Record: Journal of the Transportation Research Board, (2165):69–78, 2010. [77] J. Xu, D. Deng, U. Demiryurek, C. Shahabi, and M. van der Schaar. Mining the situation: Spatiotemporal traffic prediction with big data. IEEE Journal of Selected Topics in Signal Processing, 9(4):702–715, 2015. [78] W. Xu, X. Liu, and Y. Gong. Document clustering based on non-negative matrix factorization. In Proceedings of the 26th annual international ACM SIGIR confer- ence on Research and development in informaion retrieval, pages 267–273. ACM, 2003. [79] B. Yang, C. Guo, and C. S. Jensen. Travel cost inference from sparse, spatio temporally correlated time series using markov models. Proceedings of the VLDB Endowment, 6(9):769–780, 2013. [80] R. Yasdi. Prediction of road traffic using a neural network approach. Neural com- puting & applications, 8(2):135–142, 1999. [81] R. Yu, Y. Li, C. Shahabi, U. Demiryurek, and Y. Liu. Deep learning: A generic approach for extreme condition traffic forecasting. In Proceedings of the 2017 SIAM International Conference on Data Mining. SIAM, 2017. [82] Y. Zhang and D.-Y. Yeung. Overlapping community detection via bounded non- negative matrix tri-factorization. In KDD, pages 606–614. ACM, 2012. 104 [83] L. Zhao, Q. Sun, J. Ye, F. Chen, C.-T. Lu, and N. Ramakrishnan. Multi-task learning for spatio-temporal event forecasting. In Proceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining,pages 1503–1512. ACM, 2015. [84] W. Zheng, D.-H. Lee, and Q. Shi. Short-term freeway traffic flow prediction: Bayesian combined neural network approach. Journal of transportation engineering, 132(2):114–121, 2006. [85] J. Zhou, J. Chen, and J. Ye. Clustered multi-task learning via alternating structure optimization. In Advances in neural information processing systems, pages 702–710, 2011. [86] J. Zhou, J. Chen, and J. Ye. Malsar: Multi-task learning via structural regulariza- tion. Arizona State University, 2011. [87] J. Zhou and A. K. Tung. Smiler: A semi-lazy time series prediction system for sensors. In Proceedings of the 2015 ACM SIGMOD International Conference on Management of Data, pages 1871–1886. ACM, 2015. [88] L. Zhu, A. Galstyan, J. Cheng, and K. Lerman. Tripartite graph clustering for dynamic sentiment analysis on social media. In SIGMOD’14, pages 1531–1542. [89] L. Zhu, D. Guo, J. Yin, G. V. Steeg, and A. Galstyan. Scalable temporal latent space inference for link prediction in dynamic social networks. IEEE Transactions on Knowledge and Data Engineering, 28(10):2765–2777, Oct 2016. 105 Appendix A Appendix A.1 Proof of Lemma 2 in Section 3.2.2 Proof. We first introduce some additional notations. For each subspace C, let f ∗ C be the predictor which is optimal for the center context in that subspace. Let ¯ μ f,C := sup θ∈C μ f (θ) and μ f,C := inf θ∈C μ f (θ) for any predictor f. For a level-l subspace C, we define the set of sub-optimal predictors as S C,l,B ={f :μ f ∗ C ,C − ¯ μ f,C >B2 −αl } (A.1) whereB > 0 is a constant (which will be determined later) andα is the Hölder condition parameter. For those non-optimal predictors that do not belong to the sub-optimal set, we call them as the near-optimal predictors. To bound the regret in type-2 slots, we consider the regret due to choosing sub- optimal and near-optimal predictors separately. (1) We first bound the regret due to choosing sub-optimal predictors for subspace C, denoted by Reg s,C (T ). Because the maximum loss due to choosing a non-optimal predictor is at most 1 due to normalization, we can bound the probability of choosing sub-optimal predictors instead. Let λ t f,C be the event that a sub-optimal predictor f ∈S C,l,B is selected at time t. Let γ t C be the event that the context arrival belongs to C at time t and time slot t is a type-2 slot. The regret due to choosing sub-optimal 106 predictors in type-2 slots when the context belongs to C up to T is bounded above as follows: Reg s,C (T ) ≤ T P t=1 P f∈S C,l,B Pr(λ t f,C ∩γ t C ) ≤ T P t=1 P f∈S C,l,B Pr({¯ r t f (C)≥ ¯ r t f ∗ C (C)}∩γ t C ) (A.2) The second inequality is because, for λ t f,C to occur, it is necessary that ¯ r t f (C)≥ ¯ r t f ∗ C (C) occurs. Furthermore, ¯ r t f (C)≥ ¯ r t f ∗ C (C) holds only if for any given positive value (denoted by H t > 0), the following joint event occurs, {¯ r t f (C)≥ ¯ μ t f (C) +H t }∪{¯ r t f ∗ C (C)≤μ t f ∗ C −H t } ∪ {{¯ r t f (C)≥ ¯ r t f ∗(C)}∩{¯ r t f (C)< ¯ μ t f (C) +H t } ∩{¯ r t f ∗ C (C)>μ t f ∗ C (C)−H t }} (A.3) Therefore, we have Pr({¯ r t f (C)≥ ¯ r t f ∗ C (C)}∩γ t C ) ≤ Pr({¯ r t f (C)≥ ¯ μ t f (C) +H t }∩γ t C ) +Pr({¯ r t f ∗ C (C)≤μ t f ∗ C −H t }∩γ t C ) +Pr({¯ r t f (C)≥ ¯ r t f ∗ C (C)} ∩{¯ r t f (C)< ¯ μ t f (C) +H t } ∩{¯ r t f ∗ C (C)>μ t f ∗ C (C)−H t }∩γ t C ) (A.4) There are three terms on the right-hand side of the above equation. We want to find conditions on H t such that the third term equals zero. In this way, the regret can be bounded using only the first two terms. Let ¯ r t,best f (C) denote the reward estimate for a sub-optimal predictorf in the best case (over the subspaceC) and the ¯ r t,worst f ∗ C (C) denote 107 the reward estimate forf ∗ C in the worst case (over the subspaceC). We define ¯ r t,best f (C) (and similarly ¯ r t,worst f ∗ C (C)) as follows. The reward estimate ¯ r t f (C) can be expressed as ¯ r t f (C) = X τ∈E t f (C) (μ τ f (θ τ ) + τ ) (A.5) whereE t f (C) is the set of slots when the predictor f is selected by time t for contexts in subspace C, μ τ f (θ τ ) is the expected reward by selecting f for the actual context arrived in slot τ and τ is the noise in the observed reward in slot τ. Then ¯ r t,best f (C) is defined as ¯ r t,best f (C) = X τ∈E t f (C) (μ τ f (θ ∗ f (C)) + τ ) (A.6) where θ ∗ f (C) = arg max θ∈C μ f (θ). ¯ r t,worst f ∗ C (C) is defined in a similar way. It is then easy to see ¯ r t,best f (C)− ¯ r t f (C)≤ L( √ D/2 l ) α and ¯ r t f ∗ C (C)− ¯ r t,worst f ∗ C (C)≤ L( √ D/2 l ) α by applying the Hölder condition. The third term can be bounded above as follows: Pr({¯ r t f (C)≥ ¯ r t f ∗ C (C)} ∩{¯ r t f (C)< ¯ μ t f (C) +H t } ∩{¯ r t f ∗ C (C)>μ t f ∗ C (C)−H t }∩γ t C ) ≤ Pr({¯ r t,best f (C)≥ ¯ r t,worst f ∗ C (C)} ∩{¯ r t,best f (C)< ¯ μ t f (C) +L( √ D/2 l ) α +H t } ∩{¯ r t,worst f ∗ C (C)>μ t f ∗ C (C)−L( √ D/2 l ) α −H t }∩γ t C ) (A.7) whereL( √ D/2 l ) α is the possible maximum reward difference between the center context and a border context for a given predictor according to the Hölder condition. The reason is because 1) for ¯ r t f (C)≥ ¯ r t f ∗ C (C) to occur, it is necessary for ¯ r t,best f (C)≥ ¯ r t,worst f ∗ C (C) to be true; 2) for ¯ r t f (C) < ¯ μ t f (C) +H t to occur, it is necessary for ¯ r t,best f (C) < ¯ μ t f (C) + L( √ D/2 l ) α +H t to be true due to ¯ r t,best f (C)− ¯ r t f (C)≤ L( √ D/2 l ) α ; 3) for ¯ r t f ∗ C (C) > 108 μ t f ∗ C (C)−H t to occur, it is necessary for ¯ r t,worst f ∗ C (C)>μ t f ∗ C (C)−L( √ D/2 l ) α −H t to be true due to ¯ r t f ∗ C (C)− ¯ r t,worst f ∗ C (C)≤L( √ D/2 l ) α . Our objective is to show that right-hand side of (A.7) is zero, thereby implying that the left-hand side is also zero. To show that the right-hand side is zero, we will find a condition under which the following three events, ¯ r t,best f (C)≥ ¯ r t,worst f ∗ C (C) (A.8) ¯ r t,best f (C)< ¯ μ t f (C) +L( √ D/2 l ) α +H t (A.9) ¯ r t,worst f ∗ C (C)>μ t f ∗ C (C)−L( √ D/2 l ) α −H t (A.10) cannot occur at the same time. Observe the second and third events- if μ t f ∗ C (C)−L( √ D/2 l ) α −H t ≥ ¯ μ t f (C) +L( √ D/2 l ) α +H t , (A.11) then we must have ¯ r t,best f (C) < ¯ r t,worst f ∗ C (C). This contradicts ¯ r t,best f (C)≥ ¯ r t,worst f ∗ C (C). Thus, we next findH t such that (A.1) holds. Sincef is a sub-optimal predictor, we have r t f ∗ C (C)− ¯ μ t f (C)>B2 −αl . Therefore, (A.1) holds if 2L( √ D/2 l ) α + 2H t −B2 −αl ≤ 0 (A.12) Let H t = 2 −αl and B = 2L( √ D) α + 2, the above inequality holds. Therefore, we have found a condition the left-hand side of (A.7) is zero. Next, we bound the first two terms on the right-hand side of (A.4) by using the Chernoff-Hoeffding bound. Since on the eventγ t C , the number of samples is greater than 2 2αl ln(t), the first term can be bounded as Pr({¯ r t f (C)≥ ¯ μ t f (C) +H t }∩γ t C ) ≤ e −2(Ht) 2 2 2αl ln(t) =t −2 (A.13) 109 The last equality is by substitutingH t = 2 −αl into the equation. Similarly, for the second term on the right-hand side of (A.4), we can also have Pr({¯ r t f ∗ C (C)≤μ t f ∗ C −H t }∩γ t C ) ≤ e −2(Ht) 2 2 2αl ln(t) =t −2 (A.14) Summing over time and all sub-optimal predictors, we have the Reg s,C (T )≤ T X t=1 X f∈S C,l,B 2t −2 ≤K π 2 3 (A.15) (2) Next we bound the regret due to choosing near-optimal predictors in type-2 slots. Due to the definition of near-optimal predictors, regret due to selecting a near- optimal predictor is at most B2 −αl . Because there could be at most A2 pl slots for a level-l subspace C according to the partitioning rule, the regret of this part is at most AB2 (p−α)l . Combining (1) and (2), the regret due to choosing non-optimal predictors in type-2 slots is bounded by K π 2 3 +AB2 l(p−α) . 110 Appendix B Appendix B.1 Derivatives of L with Respect to U t in Equation 4.7. The objective of L could be rewritten as follows: L =J 1 +J 2 +J 3 + T X t=1 Tr(ψ t U t ) where: J 1 = T X t=1 Tr Y t (G t −U t BU T t ) Y t (G t −U t BU T t ) T J 2 = T X t=1 λTr(U T t LU t ) J 3 = T X t=2 γTr (U t −U t−1 A)(U t −U t−1 A) T (B.1) J 1 could also be rewritten as follows: J 1 = T X t=1 Tr Y t (G t −U t BU T t ) Y t (G t −U t BU T t ) T = T X t=1 Tr (Y t G t ) T (Y t G t )− 2(Y T t G T t )(Y t U t BU T t ) + (Y T t U t B T U T t )(Y t U t BU T t ) =const− 2 T X t=1 Tr (Y T t G T t )(Y t U t BU T t ) + T X t=1 Tr (Y T t U t B T U T t )(YU t BU T t ) (B.2) 111 The second item of equation B.2 could be transformed by: O 1 = T X t=1 Tr (Y T t G T t )(Y t U t BU T t ) = T X t=1 n X k=1 (Y T t G T t )(Y t U t BU T t ) kk = T X t=1 n X k=1 m X i=1 (Y T t G T t ) ki (Y t U t BU T t ) ik = T X t=1 m X i=1 n X k=1 (Y t G t Y t U t BU T t ) ik = T X t=1 Tr (Y T t G T t Y T t )U t BU T t (B.3) Now J 1 could be written as follows: J 1 =const− 2 T X t=1 Tr (Y T t G T t Y T t )U t BU T t + T X t=1 Tr (Y T t U t B T U T t )(YU t BU T t ) (B.4) We now take the derivative of L in respect of U t : ∂L ∂U t = ∂J 1 ∂U t + ∂J 2 ∂U t + ∂J 3 ∂U t + ∂ P T t=1 Tr(ψ t U t ) ∂U t (B.5) The derivative of ∂J 1 ∂Ut now could be calculated as follows: ∂J 1 ∂U t =− 2(Y t G t Y t )U t B T − 2(Y T t G T t Y T t )U t B + ∂ P T t=1 Tr (Y T t U t B T U T t )(YU t BU T t ) ∂U t (B.6) 112 Suppose O 2 = P T t=1 Tr (Y T t U t B T U T t )(YU t BU T t ) , the derivative of O 2 could be written as: ∂O 2 ∂U t (pq) = ∂ P n k=1 P m i=1 (Y T t U t B T U T t ) ki (YU t BU T t ) ik ∂U t (pq) = ∂ P m i=1 P n k=1 (Y t Y t U t BU T t U t BU T t ) ik ∂U t (pq) = ∂ P m i=1 P n k=1 Y 2 t (ik)(U t BU T t ) 2 (ik) ∂U t (pq) (B.7) Because only the p th row of U t BU T t is related with ∂O 2 ∂Ut(pq) , we have the following: ∂O 2 ∂U t (pq) = ∂ P n k=1 Y 2 t (pk)(U t BU T t ) 2 (pk) ∂U t (pq) = 2 n X k=1 (Y 2 t ) pk (U t BU T t ) pk ∂(U t BU T t ) pk ∂(U t ) pq = 2 n X k=1 (Y 2 t ) pk (U t BU T t ) pk (U t B T +U t B) kq (B.8) The matrices derivation is then expressed as: ∂O 2 ∂U t = 2(Y t Y t U t BU T t )(U t B T +U t B) = 2(Y t U t BU T t )(U t B T +U t B) (B.9) Now the derivative of ∂J 1 ∂Ut is as follows: ∂J 1 ∂U t =− 2(Y t G t Y t )U t B T − 2(Y T t G T t Y T t )U t B + 2(Y t U t BU T t Y t )(U t B T +U t B) (B.10) Similarly, we could calculate the derivatives of ∂J 2 ∂Ut = 2λLU t , ∂J 3 ∂Ut = 2γ(U t −U t−1 A) + 2γ(U t AA T −U t+1 A T ), and ∂ P T t=1 Tr(ψtU T t ) ∂Ut =ψ t , we have the following: ∂L ∂U t =−2(Y t G t )U t B T − 2(Y T t G T t )U t B + 2(Y t U t BU T t )(U t B T +U t B) + 2λDU t −λ(W +W T )U t + 2γ(U t −U t−1 A) + 2γ(U t AA T −U t+1 A T ) +ψ t (B.11) 113 B.2 Update Rule of A and B for Eq. 4.11 and Eq. 4.12 Similar with the derivation of U t , we add Lagrangian multiplier with φ∈R k×k and ω∈R k×k , and calculate the derivatives of L in respect of A and B : ∂L ∂B =−2 T X t=1 U T t (Y t G)U t + 2 T X t=1 U T t (Y t U t BU T t )U t +φ ∂L ∂A =−2 T X t=2 U T t−1 U t + 2 T X t=2 U T t−1 U t−1 A +ω (B.12) Using the KKT conditionsφ ij B ij = 0 andω ij A ij = 0, we get the following equations for B kk , and A kk : − T X t=1 U T t (Y t G)U t ij B ij + T X t=1 U T t (Y t U t BU T t )U t ij B ij = 0 (B.13) − T X t=2 U T t−1 U t ij A ij + T X t=2 U T t−1 U t−1 A ij A ij = 0 (B.14) These lead us to the following update rules: B ij ←B ij [ P T t=1 U T t (Y t G t )U t ] ij [ P T t=1 U T t (Y t (U t BU T t ))U t ] ij (B.15) A ij ←A ij [ P T t=1 U T t−1 U t ] ij [ P T t=1 U T t−1 U t−1 A] ij (B.16) 114
Abstract (if available)
Linked assets
University of Southern California Dissertations and Theses
Conceptually similar
PDF
Utilizing real-world traffic data to forecast the impact of traffic incidents
PDF
Tensor learning for large-scale spatiotemporal analysis
PDF
Learning the semantics of structured data sources
PDF
Physics-aware graph networks for spatiotemporal physical systems
PDF
Novel techniques for analysis and control of traffic flow in urban traffic networks
PDF
Query processing in time-dependent spatial networks
PDF
Prediction models for dynamic decision making in smart grid
PDF
Inferring mobility behaviors from trajectory datasets
PDF
Machine learning for efficient network management
PDF
Learning distributed representations from network data and human navigation
PDF
Differentially private learned models for location services
PDF
A protocol framework for attacker traceback in wireless multi-hop networks
PDF
Understanding the characteristics of Internet traffic dynamics in wired and wireless networks
PDF
Improving mobility in urban environments using intelligent transportation technologies
PDF
Efficient reachability query evaluation in large spatiotemporal contact networks
PDF
Mechanisms for co-location privacy
PDF
Congestion control in multi-hop wireless networks
PDF
Disentangling the network: understanding the interplay of topology and dynamics in network analysis
PDF
Iteratively learning data transformation programs from examples
PDF
Deriving real-world social strength and spatial influence from spatiotemporal data
Asset Metadata
Creator
Deng, Dingxiong
(author)
Core Title
Spatiotemporal traffic forecasting in road networks
School
Viterbi School of Engineering
Degree
Doctor of Philosophy
Degree Program
Computer Science
Publication Date
09/07/2017
Defense Date
08/16/2017
Publisher
University of Southern California
(original),
University of Southern California. Libraries
(digital)
Tag
latent space model,loop detectors,OAI-PMH Harvest,road network,situation-aware multi-task learning,traffic forecasting
Language
English
Contributor
Electronically uploaded by the author
(provenance)
Advisor
Shahabi, Cyrus (
committee chair
), Knoblock, Craig (
committee member
), Salva, Ketan (
committee member
)
Creator Email
dengdx@gmail.com,dingxiod@usc.edu
Permanent Link (DOI)
https://doi.org/10.25549/usctheses-c40-426273
Unique identifier
UC11263994
Identifier
etd-DengDingxi-5709.pdf (filename),usctheses-c40-426273 (legacy record id)
Legacy Identifier
etd-DengDingxi-5709.pdf
Dmrecord
426273
Document Type
Dissertation
Rights
Deng, Dingxiong
Type
texts
Source
University of Southern California
(contributing entity),
University of Southern California Dissertations and Theses
(collection)
Access Conditions
The author retains rights to his/her dissertation, thesis or other graduate work according to U.S. copyright law. Electronic access is being provided by the USC Libraries in agreement with the a...
Repository Name
University of Southern California Digital Library
Repository Location
USC Digital Library, University of Southern California, University Park Campus MC 2810, 3434 South Grand Avenue, 2nd Floor, Los Angeles, California 90089-2810, USA
Tags
latent space model
loop detectors
road network
situation-aware multi-task learning
traffic forecasting