Close
About
FAQ
Home
Collections
Login
USC Login
Register
0
Selected
Invert selection
Deselect all
Deselect all
Click here to refresh results
Click here to refresh results
USC
/
Digital Library
/
University of Southern California Dissertations and Theses
/
From raw sensor data to moving object trajectories at right resolution, quality, and abstraction
(USC Thesis Other)
From raw sensor data to moving object trajectories at right resolution, quality, and abstraction
PDF
Download
Share
Open document
Flip pages
Contact Us
Contact Us
Copy asset link
Request this asset
Transcript (if available)
Content
FROM RAW SENSOR DATA TO MOVING OBJECT TRAJECTORIES AT RIGHT RESOLUTION, QUALITY , AND ABSTRACTION by Hyunjin Yoon A Dissertation Presented to the FACULTY OF THE USC GRADUATE SCHOOL UNIVERSITY OF SOUTHERN CALIFORNIA In Partial Fulfillment of the Requirements for the Degree DOCTOR OF PHILOSOPHY (COMPUTER SCIENCE) December 2009 Copyright 2009 Hyunjin Yoon Acknowledgements First and foremost, I thank my advisor Cyrus Shahabi for his continuous support throughout the long years of Ph.D. program. Cyrus was always there to listen and to give excellent advice. He taught me how to express my ideas, approach research problems, and write academic papers. More importantly, he always showed his confidence in me even when I doubted myself, brought out the best efforts and good ideas in me, and encouraged me to stay persistent to any goal. Without his intellectual discipline and mentoring, I could not have finished this dissertation. Besides my advisor, I would like to thank the rest of my dissertation committee, Aiichiro Nakano for asking me good questions during the defense of my dissertation and Carolee J. Winstein who showed interests in my research and suggested potential opportunities of applying it to the field of neuro-rehabilitation. I was always inspired by Carolee’s great leadership and her constant urge for innovation during the course of collaboration. Also thanks to my Ph.D. guidance committee Guarav Sukhatme who gave insightful comments and asked hard questions at the qualifying exam and Yolanda Gil for her sincere advice on the balance between the work and family. I am also grateful to the colleagues at InfoLab, USC: Kiyoung Yang for his technical support in many projects in which we participated together and co-authoring a few papers at the early stage of my degree. I could learn a lot from Kiyoung particularly of good implementation and thorough experimental testing to demonstrate the feasibility of new ideas. A special thank is given to Mehrdad Jahangiri who offered me never-ending encouragement, moral support, and friendship. I always adored his efforts of caring and comforting others as well as making the workplace more friendly. Also thanks to Mehdi Sharifzadeh and Farnoush Banaei-Kashani for their help and advice throughout the entire years of my degree. Most of all, I will never forget that we had lunch together all the time and how much I enjoyed it. Also thanks to ii Leyla Kazemi, Xiaoyan Zhang, Songhua Xing, Jeff Khoshgozaran-Haghighi, Ugur Demiryurek, Houtan Shirani-Mehr, Ali Khodaei, Bei Pan, and Ling Hu, who shared my last few years at InfoLab. I also thank to the dear friends I met at USC, Namhee Kwon, Hyokyeong Lee, Hyunju Lee, Minyoung Mun, Soojung Hyun, Bomi Song, and Yunji Lee, for their emotional support and sisterhood. I would like to thank the people whom I worked with in many interdisciplinary projects such as BCAR, ISNSR, P20, and EXCITE: Jarugool Tretiluxana, Shu-ya Chen, Jill Stewart, and Maureen Whitford at the Motor Behavior and Neurorehabilitation Lab, who helped me with learning the domain knowledge of physical therapy and motor rehabilitation as well as offered valuable feedback and interesting discussions. Also thanks to Shih-Ching Yeh, Younbo Jung, and Lei Li at Integrated Media Systems Center for their technical support during the collaboration. I am also greatly indebted to some teachers at Ewha Womans University: my former advisor Hwan- Seung Yong for teaching me the theoretical fundamentals of databases and supervising me in the master program. The education and research experience with him was greatly helpful in continuing my study at USC. Thanks also go to Keeho Lee and Myung Kim for getting me realize the issue of women’s represen- tation, leadership, and responsibility in science and engineering where women are underrepresented. Last but not least, I thank my family: my parents, Youngil Yoon and Heesook Kim, for their un- conditional support and constant encouragement to pursue my interests, listening to my complaints and frustrations, and believing in me all the time. Also thanks to my parents-in-law, Changwoo Nam and Dongsook Kim, for their moral and material support during the long years of graduate study abroad. I am glad that my daughter Christine has been always with me along this long journey since the very beginning. She is truly the source of happiness and motivation to me. The preparation of this dissertation coincided with the pregnancy of my son Anthony, born three weeks after my dissertation defense. His presence helped me get focused to the end and feel double accomplishment. Finally, I want to thank my husband Sungjip Nam for all the support and patience that he gave me throughout the completion of this degree. Without his support and understanding, I could not have done what I always wanted to do. iii Table of Contents Acknowledgements ii List Of Tables vi List Of Figures vii Abstract ix Chapter 1: Introduction 1 1.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1.1.1 Trajectory Data at Right Resolution and Quality . . . . . . . . . . . . . . . . . . . 2 1.1.2 Trajectory Data at Right Level of Abstraction . . . . . . . . . . . . . . . . . . . . 3 1.2 Thesis Statement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 1.3 Previous Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 1.3.1 Previous Work on Trajectory Segmentation . . . . . . . . . . . . . . . . . . . . . 5 1.3.2 Previous Work on Movement Pattern Discovery . . . . . . . . . . . . . . . . . . . 8 1.4 Proposed Approaches . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10 1.4.1 Robust Time-Referenced Trajectory Segmentation . . . . . . . . . . . . . . . . . 10 1.4.2 Discovery of Valid Convoy Patterns from Trajectories . . . . . . . . . . . . . . . 11 1.5 Contribution Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12 1.6 Dissertation Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14 Chapter 2: Robust Time-Referenced Trajectory Segmentation 15 2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15 2.2 Preliminaries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16 2.2.1 Trajectory Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16 2.2.2 Outlier Detection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18 2.3 Robust Time-Referenced Trajectory Segmentation . . . . . . . . . . . . . . . . . . . . . . 21 2.3.1 Problem Definition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21 2.3.2 Top-down Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22 2.3.3 Bottom-up Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24 2.3.4 Sliding Window Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26 2.4 Experimental Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27 2.4.1 Dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28 2.4.2 Selection of Maximum Speed Value . . . . . . . . . . . . . . . . . . . . . . . . . 28 2.4.3 Spatio-Temporal Homogeneity . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29 2.4.4 Spatial Deviation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33 2.4.5 Processing Time . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35 2.5 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36 2.6 Chapter Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38 iv Chapter 3: Discovery of Valid Convoy Patterns from Trajectories 40 3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40 3.2 Preliminaries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41 3.3 Discovery of Valid Convoys . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44 3.3.1 VCoDA: Straightforward Solution . . . . . . . . . . . . . . . . . . . . . . . . . . 45 3.3.1.1 Discovery of Partially Connected Convoys . . . . . . . . . . . . . . . . 45 3.3.1.2 Density-Connectivity Validation . . . . . . . . . . . . . . . . . . . . . 49 3.3.2 EVCoDA: Efficient Solution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52 3.3.2.1 Efficient Discovery of Partially Connected Convoys . . . . . . . . . . . 52 3.3.2.2 Efficient Density-Connectivity Validation . . . . . . . . . . . . . . . . . 54 3.4 Experimental Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61 3.4.1 Dataset and Parameter Setting . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62 3.4.2 Effectiveness . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63 3.4.3 Efficiency . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64 3.5 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67 3.6 Chapter Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69 Chapter 4: Conclusions 70 4.1 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70 4.2 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71 References 74 v List Of Tables 2.1 Summary of three datasets used in the experiments . . . . . . . . . . . . . . . . . . . . . 28 3.1 Four operations to update the setV of current partially connected convoy candidates . . . . 45 3.2 Validation process of DCVal . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51 3.3 Summary of datasets and parameter settings for experiments . . . . . . . . . . . . . . . . 62 3.4 Comparison of Accuracy on three datasets . . . . . . . . . . . . . . . . . . . . . . . . . . 64 3.5 Comparison of maximumjVj . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65 vi List Of Figures 1.1 Example of route segmentation over the sequence of nine sampled locations . . . . . . . . 6 1.2 Examples of trajectory segmentation in the presence of outliers. (a) Over-segmented tra- jectory (b) Segmentation robust to outliers . . . . . . . . . . . . . . . . . . . . . . . . . . 8 1.3 Accuracy problems of current convoy discovery algorithms . . . . . . . . . . . . . . . . . 9 1.4 An overview illustration of dissertation focus and the proposed approaches . . . . . . . . . 11 2.1 Examples of a trajectory and its route . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17 2.2 An example of spatially and temporally homogeneous segment s i;i+2 (a) in the spatio- temporal space and (b) the projected view of the segment on the spatial plane . . . . . . . 19 2.3 Examples of an outlier in the spatio-temporal space and peaks in the spatial plane . . . . . 20 2.4 Detecting outliers based on two upper boundsr 1 andr 2 . . . . . . . . . . . . . . . . . . . 21 2.5 Top-down trajectory segmentation algorithm RTR-TopDown (S startidx;endidx ,",À m ) . . . 23 2.6 Robust time-referenced distance function RobustTDist(S startidx;endidx ,i th ,À m ) . . . . . 24 2.7 Bottom-up trajectory segmentation algorithm RTR-BottomUp(S,",À m ) . . . . . . . . . . 25 2.8 Sliding-window trajectory segmentation algorithm RTR-SlidingWindow (S,",À m ) . . . . 26 2.9 Comparison of spatio-temporal homogeneity among the three proposed trajectory segmen- tation algorithms on three datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30 2.10 Comparison to conventional trajectory segmentation approaches in terms of spatio-temporal homogeneity on three datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32 2.11 Comparison of perpendicular distance error on three datasets . . . . . . . . . . . . . . . . 34 2.12 Time-referenced distance error on bus dataset . . . . . . . . . . . . . . . . . . . . . . . . 35 2.13 Comparison of processing time in seconds on bus dataset . . . . . . . . . . . . . . . . . . 36 vii 3.1 Examples of moving object trajectories and their density-based snapshot clusters at each timestamp . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42 3.2 Discovery process of partially connected convoys . . . . . . . . . . . . . . . . . . . . . . 46 3.3 Partially connected convoy discovery algorithm PCCD(moving objectsO,",m,k) . . . . 47 3.4 Update function updateVnext(V next ,v new ) for the set of next partially connected convoy candidates . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48 3.5 Density-connectivity validation algorighm DCVal(v pcc ,",m,k) . . . . . . . . . . . . . . 50 3.6 Efficient algorithm for partially connected convoy discovery EPCCD(moving objectsO, ",m,k) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53 3.7 The minimum distanceD min and the maximum distanceD max between two MBBs . . . 56 3.8 Efficient density-connectivity validation algorithm EDCVal(v pcc =hO,[t a ;t b ]i,V val ,",m,k) 59 3.9 Validation process of EDCVal . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61 3.10 Comparison of discovered convoys . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65 3.11 Comparison of processing time in seconds . . . . . . . . . . . . . . . . . . . . . . . . . . 66 viii Abstract A moving object trajectory is a series of locations of a moving object sampled at discrete instances of time. Real-world moving object trajectories acquired by location-aware sensors typically involve a large number of moving objects, massive observations, and noisy measurements. In addition, the trajectory data is available only at the form of bulky point clouds of sampled locations and timestamps. Such raw sensor data is therefore neither at right resolution and quality nor at right level of abstraction for the efficient data exploration, advanced data analysis, and high-level decision making. In this dissertation, we address the problem of large gap between the raw sensor data that is readily available and the desired trajectory data that is needed. Specifically, we focus on bridging the gap in two perspectives: 1) transforming the raw sensor data into the trajectory data approximated at right resolution and quality by trajectory segmentation and 2) summarizing a large set of trajectory data at right level of abstraction by discovering a specific type of movement patterns called a valid convoy. We first propose a preprocessing technique, called robust time-referenced trajectory segmentation, to transform the raw sensor data into the desired trajectory data approximated at right resolution and quality. Unlike conventional trajectory segmentation techniques focusing only on the spatial features of the movement and vulnerable to outliers, our proposed trajectory segmentation methods take into account both geo-spatial and temporal structures of movement to obtain spatially and temporally homogeneous segments and are also robust against to time-referenced spatial outliers. In addition, we discover a specific type of movement patterns called a valid convoy to summarize the interesting mobility of moving objects both in space and time at the desired level of abstraction. Existing convoy discovery algorithms have a critical problem of accuracy; they tend to both miss larger convoy ix patterns and retrieve invalid ones. We therefore propose two new valid convoy discovery algorithms, a straightforward VCoDA and an efficient alternative EVCoDA, which accurately discover all valid con- voys from moving object trajectories. Our extensive experiments on real-world datasets demonstrate the effectiveness and efficiency of our trajectory segmentation and convoy mining techniques. x Chapter 1 Introduction 1.1 Motivation A moving object trajectory 1 is a series of locations of a moving object sampled at discrete instances of time and defined as a sequence of pairs, h(p 1 ;t 1 ), (p 2 ;t 2 ), :::, (p n ;t n )i, where p i is a two- or three- dimensional vector representing the geo-spatial position sampled at a timestampt i (i=1;:::;n). Various types of trajectory data tracking animals, hurricanes, vehicles, or human subjects have been acquired by using location-aware sensors or positioning devices in a wide range of applications, including animal study, climate monitoring, transportation, neuro-rehabilitation, and more. For example, the Starkey project 2 , a 10-year study of wildlife animals, has collected the trajectories of elk, deer, and cattle by using an automated radio telemetry system in order to examine key questions about these animals, recreation uses and nutrient flows on National Forests. Inrix Inc. 3 has archived real-time GPS probe data from over one million commercial fleet, delivery, and taxi vehicles to provide dynamic traffic information. The Hurricane Best Track data 4 is a comprehensive set of tracking information of tropical storms and hurricanes accumulated since 1851, which was acquired by various positioning systems including satellites. Moving object trajectory data has been thus increasingly archived and the amount is exponentially growing with 1 The terms, trajectory, spatio-temporal sequence, and sequence of time-referenced locations, are interchangeably used to refer to the same type of data, depending on the specific aspect to be emphasized. 2 http://www.fs.fed.us/pnw/starkey/index.shtml 3 http://www.inrix.com/techdustnetwork.asp 4 http://weather.unisys.com/hurricane/atlantic/index.html 1 the rapid advances in the telecommunication, telemetry, and positioning technologies (e.g., GPS, mobile phone networks, RFID, and satellite) and the continuous drop in the cost of sensors and IT technologies. A collection of moving object trajectories has been the output of various applications but more in- creasingly demonstrated as the input of great potential to diagnose complex systems, discover intrinsic movements, detect deviations from desired behaviors, and obtain new insights in the original context of applications. For example, the collected animal trajectories have been exploited to find interesting migra- tion patterns (e.g., flock, leadership, concurrent, converge) [7, 36, 4], learn a predictive model (e.g., animal classifier) [38], and cluster animals with similar movement routes [39]. The vehicle trajectories have been analyzed to provide real-time and predictive traffic flow, detect hot routes in a road network [40], and find frequent routes with similar travel time [20]. The archives of hurricane trajectories have been analyzed to forecast weather and detect unusual hurricane movements [37]. Hence, there is an emerging opportunity to provide novel applications and services that enable users to efficiently model, query, analyze, and visualize the massive archives of moving object trajectories for high-level decision making. Developing such trajectory-based applications and services from the collected real-world trajectory data is becoming increasingly crucial yet challenging because of the large gap between the archived raw sensor data that is readily available and the desired trajectory data that is needed; the raw sensor data typ- ically involving a large number of moving objects, massive observations, noisy/erroneous measurements, and irregular samples is neither at right resolution and quality nor at right level of abstraction for the ef- ficient exploration and advanced analysis of data. In this dissertation, we focus on bridging the gap by addressing the following two specific problems in its own context: 1) trajectory data at right resolution and quality and 2) trajectory data at right level of abstraction. In the following sections, we motivate each of the problems in more detail and disclose our key ideas about how it can be addressed. 1.1.1 Trajectory Data at Right Resolution and Quality Unlike simulated data, real-world trajectory data acquired by sensors suffers from gaps (missing values), noise, and irregular samples due to the inherent imprecisions of sensor devices, network failure or delay, 2 and disturbance signals. Furthermore, the size of a moving object trajectory, i.e., the number of observa- tions n, is often large. For example, the elk trajectories used in [39] contain about 1430 observations on average, and the size of bus trajectories used in [12] varies in the range from 1000 to 7000. The noisy and erroneous measurements as well as the massive size in observations inherent in raw sensor data streams will increase the computational complexity of subsequent analytical tasks and also affect the quality of the results. Hence, preprocessing is in practice essential and often performed prior to any of the major analyt- ical tasks such as close pair join queries [6, 5, 61, 29], search for similar trajectories [59, 55, 15], pattern mining [12, 30, 31], and classification/clustering [39, 38, 19, 9], in order to transform the raw moving object trajectories into more cleaned and effective input to the subsequent analytical algorithms. Among various preprocessing techniques from simple interpolation for missing data to more sophisti- cated transformations, trajectory segmentation is our choice to transform the large and noisy sensor data streams into the desired trajectory data approximated at right resolution and quality. We chose trajectory segmentation since it specifically aims at reducing the dimensionality of data, providing more concise rep- resentations, filtering out noise, and eventually improves the performance and the quality of subsequent data analysis and exploration. 1.1.2 Trajectory Data at Right Level of Abstraction The size of real-world trajectory data is often large and the amount quickly grows as more moving objects are involved and continuously traced by location-aware sensors over time. For example, the aforemen- tioned Hurricane Best Track data includes the trajectories of about 1500 Atlantic hurricanes and tropical storms accumulated since 1851. The Inrix vehicle trajectory data was collected from over 650,000 cars in 2008 and the number of vehicles was increased to over one million in a year. For the applications and services that aim at providing high-level information and actionable knowledge for decision making and diagnosis, one could be swamped by the large set of archived trajectory data that is currently available only at the form of bulky point/polyline clouds of the sampled locations and the associated timestamps. 3 Therefore, it is essential to abstract the large set of raw or even preprocessed trajectory data at the right form and level. A movement pattern is a local abstraction or structure extracted from the data [25], conveying useful knowledge about the mobility of objects both in space and in time, showing how just a few moving objects behaved and traversed over time, or characterizing some persistent yet non-trivial deviation from desired behaviors. Various movement patterns have been defined and discovered from moving object trajectories in order to explore some specific aspects of the large data set and get insights on the intrinsic movement and behaviors presented in the data [16]. Examples of movement patterns include relative motion patterns [3, 36], sequential patterns [52, 56, 20, 53], periodic patterns [11, 44], group patterns [31, 57], complex spatio- temporal patterns [23, 54, 13], trajectory clusters [32, 49, 41, 28, 14, 35], and more. Among those movement patterns, we focus on finding a group of objects that moved together for a certain duration of time in order to abstract the interesting mobility common to some object groups. Such groups of objects close both in space and time are useful in deriving high-level information in many real- life applications. For example, retailers find a group of customers with a similar shopping track (e.g., shopping pals) useful to develop their marketing strategies [57]. An animal group that has moved close together for a long time can be used to study the animals’ migration and movement behaviors such as flock, leadership, convergence, encounter, etc [22, 36]. A group of rehabilitation practice trials with a similar movement path can be exploited to diagnose the impairment level of motor skills of post-stroke patients. A group of commuters or cars that have traversed together along a path can be used to schedule carpools or to derive popular or hot routes [40]. 1.2 Thesis Statement Real-world moving object trajectory data acquired by sensors typically involves a large number of moving objects, massive observations, noisy/erroneous measurements, and irregular samples. In addition, they are currently available only at the form of bulky point/polyline clouds of sampled locations and timestamps. 4 Such raw sensor data is therefore neither at right resolution and quality nor at right level of abstraction for the efficient data exploration, advanced data analysis, and high-level decision making. In this dissertation, the large gap between the raw sensor data that is readily available and the desired trajectory data that is needed is bridged in two perspectives; first the long, noisy, or irregularly sampled raw sensor data is transformed into the trajectory data approximated at the right resolution and quality by a preprocessing technique called trajectory segmentation. Secondly, the large archive of raw or preprocessed trajectory data is summarized at the right level of abstraction by mining a specific type of movement patterns called a valid convoy representing grouping mobility of moving objects. 1.3 Previous Work We present the previous work on the trajectory segmentation and the discovery of specific type of move- ment pattern called a convoy from moving object trajectories in the following sections, in order to show the limitations and drawbacks of the existing approaches. 1.3.1 Previous Work on Trajectory Segmentation Trajectory segmentation is a process of partitioning a given trajectory into a smaller number of homogeneous segments by identifying a minimum subset of observations, such that the data within each segment are similar with respect to some criteria and thus can be effectively approximated by a simple model [8]. A typical approach previously adopted for the trajectory segmentation [12, 39] takes a simple sequence of sampled locations (by dropping the timestamp component) of a trajectory as an input, which we call a route of a moving object to be explicitly distinguished from a trajectory. The approach first selects a subset of the sampled locations, identified as characteristic points, where the geometric structure (e.g., spatial closeness, co-linearity, or movement direction) of the given route changes rapidly. Subsequently, the in- put route is partitioned at every characteristic point, and each segment is modeled by a line connecting each pair of two consecutive characteristic points to represent a linear movement. Figure 1.1 illustrates 5 p 1 p 2 p 4 p 9 p 5 p 6 p 7 p 8 S 1 S 2 S 3 S: : sampled locations : characteristic points p 3 Figure 1.1: Example of route segmentation over the sequence of nine sampled locations an example of a route S with 9 sampled positions and its desirable segmentation into three continuous and non-overlapping segmentshS 1 ,S 2 ,S 3 i, each of which is modeled by a directed line connecting two consecutive characteristic pointshp 1 ,p 4 ,p 6 ,p 9 i, respectively. That is, both the dimension reduction and the compact representation are achieved by retaining only the characteristic points and thus discarding the non-characteristic points. Our key observation is that the previous trajectory segmentation algorithms have completely ignored the time dimension by dropping the timestamps, so that only the geometric structures presented in the spatial dimensions could be exploited for the segmentation. Such segmentations could lead to spatially homogeneous segments but with presumably dissimilar temporal or spatio-temporal structures such as substantially varied movement speed. Example 1. Suppose the sequence of sampled locations shown in Figure 1.1 is acquired at irregular sam- pling rates. For example, let the timestamps of the first four observations be 1, 2, 3, and 13, respectively. From this timestamps together with the moving distances, it can be derived that the speed development of the moving object varies in the obtained segment S 1 ; it is fast at first from p 1 to p 2 , similarly fast from p 2 top 3 , and then moves slowly fromp 3 top 4 . Since the movement speed significantly changes atp 3 , the segment S 1 should have been partitioned at p 3 as well to result in genuinely homogeneous segments in terms of both spatial and temporal semantics. 6 Unfortunately, irregular time intervals between two consecutive observations are usually encountered in the real-world sensor data due to the inherent imprecisions of sensor devices, missing data, network failure or delay, disturbance signals, etc. For example, in the bus trajectories used in [12], the bus locations were sampled every 30 seconds. However, GPS was switched off while the bus was stopping. Therefore, the sampling rate varies even within a single bus trajectory. One might argue that each trajectory could be re-sampled first so as to have a fixed time interval by interpolating the un-observed locations with observed ones. Then, the movement could be represented by a simple sequence of locations since the timestamps can be implied. However, this alternative would lead to over-sampling when aligned to the minimum time interval, which will affect the time complexity of the segmentation algorithm. In addition, the interpolation introduces an additional source of noise into the time-referenced location values without the knowledge on the true movement. Another limitation of previous approaches is their vulnerability to outliers. A single extreme location value, typically described as a peak, is most likely selected as a characteristic point due to the substantial change of geometric structure such as movement direction and spatial distance. In real world, there al- ways exist inherent imprecision and errors involved with sensor measurements, arising from many sources such as simple measurement errors, inherent imprecision of devices, the noise of sensor devices, limited capability of equipment, and many other disturbing factors. Such incorrect and noisy sensor values may result in wrong characteristic points, deteriorating the quality of the segmentation result. By erroneously considering these outliers as characteristic points, trajectories will be over-segmented at every outlier. To illustrate this problem, let us consider the following example. Example 2. Consider a route with 8 sampled locations in Figure 1.2. Suppose the forth observationp 4 is an inaccurate value sampled by the sensor and hence an outlier (its location is significantly distant from the preceding and succeeding observations). Figure 1.2(a) illustrates the segmentation into four subsequences obtained by a conventional segmentation method. As expected, p 4 is identified as a characteristic point and the sequence is cut at this point. However, a desired segmentation robust to noise should take time into consideration and lead to a single partition as shown in Figure 1.2(b). 7 S 1 S 2 p 1 p 2 p 8 p 5 p 6 p 7 p 3 X Y p 4 p 1 p 2 p 8 p 5 p 6 p 7 p 3 X Y p 4 S 3 S 4 S 1 (a) (b) Figure 1.2: Examples of trajectory segmentation in the presence of outliers. (a) Over-segmented trajectory (b) Segmentation robust to outliers The sequence of numerical position values provides no information other than the geometric struc- tures of the movement such as movement shape and direction unless a constant sampling rate is assumed. Unfortunately, detecting outliers simply based on these static information can be misleading; suppose the moving object traced in Figure 1.2 moved very slowly from p 3 to p 5 . Then, the substantial positional change at p 4 depicted as a peak should be retained as a characteristic point to accurately represent the moving objects’ real movement. With the temporal information available in the trajectory, outliers can be better identified as an extreme change of location value in a unit time interval, i.e., a fast peak. Therefore, a trajectory segmentation should take into account both the geo-spatial and the temporal semantics, which is never exploited in the existing segmentation algorithms to the best of our knowledge. 1.3.2 Previous Work on Movement Pattern Discovery Various movement patterns, more specifically group patterns, have been defined over moving object tra- jectories. A (long duration) flock [22, 21] is defined by at leastm moving objects staying together within a circular region of radius" during at leastk consecutive timestamps. Although the flock pattern has been most popularly exploited in the past, the shape and the size of flock snapshot is limited to a disk of a fixed size bound", hence it cannot cover larger flocks where objects are distributed over a wide area larger 8 than the given disk size. To avoid this rigid restriction, a variant of flock called a convoy [30, 31] has been recently proposed based on the notion of density-connectivity. A convoy is defined as a group of at leastm moving objects that are density-connected with respect to the density constraints during at leastk consecutive timestamps. Unlike flock, the shape and the size of convoy snapshot can be arbitrary. Jeung et al. [30, 31] have first introduced the convoy pattern as a group of at leastm moving objects that are density-connected with respect to the density constraints during at least k consecutive timestamps and proposed several algorithms to discover all convoys from a given moving object trajectory dataset. Their solutions have adopted the well-known moving cluster algorithm [32] in that a density-based clustering (e.g., DBSCAN [18]) is first performed on the moving objects at each timestamp to find snapshot density- connected clusters of arbitrary shapes and then the intersection of a sequence of at leastk such snapshot clusters appearing during consecutive timestamps is detected as a convoy if they share at leastm objects in common. c 3 c 4 c 1 a c 2 a t 1 t 2 t 3 t 4 Time c 1 b c 2 b o 1 o 2 o 4 o 5 o 6 o 3 (a) Missed convoy t 1 t 2 t 3 t 4 Time c 1 c 2 c 3 o 1 o 2 o 3 o 6 c 4 o 4 o 5 (b) Invalid convoy Figure 1.3: Accuracy problems of current convoy discovery algorithms Our key observation is that current convoy discovery algorithms have a critical problem of accuracy in terms of both precision and recall; they tend to miss larger convoys and retrieve invalid ones where the density-connectivity among the objects is not completely satisfied. First consider the six moving objects o 1 ,o 2 ,:::,o 6 in Figure 1.3(a). Suppose the minimum number of objects and the lifetime of convoys to be mined is set tom = 3 andk = 2. We can clearly see that the six objects start moving in two groups from t 1 and then travel all together fromt 3 , which results in three natural group patterns: P 1 =hfo 1 ,o 2 ,o 3 g,[t 1 , 9 t 4 ]i,P 2 =hfo 4 ,o 5 ,o 6 g,[t 1 ,t 4 ]i, andP 3 =hfo 1 ,o 2 ,:::,o 6 g,[t 3 ;t 4 ]i. However, current convoy discovery algorithms are unable to detect the larger convoyP 3 of all six objects. Similarly, the current approach will return only one convoy of three objectsP 4 =hfo 1 ,o 2 ,o 3 g,[t 1 ;t 4 ]i, not the larger group of fiveP 5 =hfo 1 , o 2 , :::, o 5 g;[t 1 ;t 2 ]i, in Figure 1.3(b). Another problem is that the returned convoy P 4 =hfo 1 , o 2 , o 3 g, [t 1 ;t 4 ]i is just a partially connected group since the three (black) objects are not density-connected at time t 1 , i.e., they are unable to form a cluster by themselves without the other two (white) objects. Hence,P 4 should have had a shorter lifetime[t 2 ;t 4 ] to be indeed a correct valid convoy. 1.4 Proposed Approaches An overview of the dissertation focus and our approaches that address the aforementioned limitations of conventional solutions is illustrated in Figure 1.4. We propose robust time-reference trajectory segmenta- tion to transform the raw sensor data into the trajectory data approximated at right resolution and quality. The reduced and filtered trajectory data is still not at the desired level of abstraction needed for obtaining insights from the intrinsic movements and behaviors presented in the large trajectory data and for high- level decision making. Therefore, we also propose both straightforward and efficient solutions to discover specific movement patterns, called valid convoys, from either raw or preprocessed moving object trajecto- ries. In the following sections, we briefly introduce our approaches, which will be explained in more detail throughout the dissertation. 1.4.1 Robust Time-Referenced Trajectory Segmentation As aforementioned in Section 1.3.1, existing trajectory segmentation techniques only focus on the spatial features of the movement and could lead to spatially homogeneous segments but with presumably dissim- ilar temporal structures. Furthermore, trajectories could be over-segmented in the presence of outliers. In order to obtain homogeneous segments both in space and in time as well as make the segmentation process robust to noise, we propose a family of three robust time-referenced trajectory segmentation algorithms 10 Straightforward Solution VCoDA Efficient Solution EVCoDA Raw Sensor Data Trajectory Data at Right Resolution and Quality Trajectory Data at Right Abstraction Level Robust Time-Reference Trajectory Segmentation Accurate Discovery of Valid Convoy Patterns Available Data Needed Data Figure 1.4: An overview illustration of dissertation focus and the proposed approaches that take into account both spatial and temporal structures presented in the trajectory data. Intuitively, we attempt to partition a given trajectory into a small number of spatially and temporally homogeneous seg- ments by selecting a minimum subset of observations, such that each segment accurately approximates a linear movement at a constant speed. In addition, we utilize the spatio-temporal properties of movement to guide the outlier detection so as to make the segmentation robust to outliers. The three trajectory segmen- tation algorithms adopt a greedy approach based on the split or merge heuristics and employ the generic forms of three representative heuristic algorithms, top-down, bottom-up, and sliding window algorithms, respectively. Our empirical evaluation on three real-world trajectory datasets verifies that our techniques outperform the conventional techniques as well as their simple temporal extensions, in terms of spatio- temporal homogeneity while maintaining comparable spatial homogeneity. In addition, the comparative results ascertains that, semantically, the temporal dimension is different from the spatial dimension, and hence should be dealt with in its own context. We will take care of this subject in more detail in Chapter 2. 1.4.2 Discovery of Valid Convoy Patterns from Trajectories Given a set of trajectories of moving objectsO, a distance threshold", an integerm, and an integerk, the convoy discovery problem is to mine all valid convoys fromO, each consisting of at leastm moving objects 11 that are density-connected with respect to the density constraints m and " during at least k consecutive timestamps. Current convoy discovery algorithms [30, 31] have a critical problem of accuracy in terms of both precision and recall as illustrated in Section 1.3.2; they tend to miss larger convoys and retrieve invalid ones where the density-connectivity among the objects is not completely satisfied. In order to address the accuracy problem of exiting convoy discovery algorithms, we propose two algorithms for the accurate mining of valid convoys from a set of moving object trajectories, a straightfor- ward solution VCoDA (ValidConvoyDiscoveryAlgorithm) and an efficient alternative solution EVCoDA (EfficientValidConvoyDiscoveryAlgorithm). Both solutions consist of two phases; first a set of all par- tially connected convoys is discovered from a given set of moving objects while no false dismissal of any valid convoys is guaranteed. Then the density-connectivity of each obtained partially connected convoy is validated to finally obtain a complete set of valid convoys. The straightforward approaches of VCoDA extend the current convoy algorithm CMC [30] in both phases. While the first phase of EVCoDA adopts the moving cluster algorithm [32], the efficient density-connectivity validation of EVCoDA employs the branch-and-bound framework, where minimum bounding boxes are employed to efficiently approximate the validation during a given time interval so as to avoid unnecessary exact validations at each timestamp of the time interval. Our experiments on three real-world datasets demonstrate the effectiveness and effi- ciency of our techniques; both VCoDA and EVCoDA improve the precision by a factor of 3 on average and the recall by up to 2 orders of magnitude as compared to the existing convoy algorithm. Also, the efficient algorithm EVCoDA takes up to a factor of 3 less time than VCoDA. We will present this work in more detail in Chapter 3. 1.5 Contribution Summary This dissertation, as it has been mentioned thus far, deals with the problem of large gap between the raw sensor data that is readily available and the desired trajectory data that is needed for efficient data exploration, advanced data analysis, and high-level decision making. Specifically, trajectory segmentation 12 and convoy pattern mining are employed and exploited in order to transform the available raw sensor data into the desired trajectory data approximated at right resolution and quality and summarized at right level of abstraction, respectively. The contributions of the dissertation are summarized as follows: ² Trajectory data at right resolution and quality – Trajectory segmentation is chosen as a preprocessing technique to transform the raw sensor data into the desired trajectory data at right resolution and quality. – A new distance function, called time-reference distance function, is proposed to measure the spatial distance between two sampled locations and to be used in our trajectory segmentation. – Spatio-temporal information presented in trajectory data is exploited to guide the outlier detec- tion so as to filter out the erroneous measurements from the segmentation process and eventu- ally avoid the over-segmentation problem in the presence of outliers. – Novel trajectory segmentation algorithms are proposed to partition each trajectory into con- secutive and non-overlapping segments such that each segment is homogeneous both in spatial and temporal semantics, with the purpose of approximating the original input trajectory with less number of sampled locations and the associated timestamps. – The effectiveness of our segmentation are empirically demonstrated over real moving object trajectory datasets. ² Trajectory data at right level of abstraction – Specific movement patterns are discovered from raw or preprocessed moving object trajecto- ries to summarize the large trajectory data at a right level of abstraction. – New group patterns, partially density-connected convoy and valid convoy, are introduced to find groups of moving objects that have traveled close together for some duration of time. – Two algorithms, VCoDA and EVCoDA, are developed for the accurate discovery of valid convoy patterns from a set of moving object trajectories. 13 – The computational bottlenecks of straightforward solution VCoDA are identified and addressed by the efficient alternative EVCoDA. – An extensive set of experiments are conducted to demonstrate the effectiveness and efficiency of our convoy discovery algorithms. 1.6 Dissertation Outline This dissertation is organized as follows. In Chapter 2, we propose a family of three trajectory segmenta- tion methods to transform raw sensor data into the trajectory data approximated at desired resolution and quality. The proposed trajectory segmentation methods take into account both geo-spatial and temporal structures of movement for the segmentation and is also robust with respect to time-referenced spatial outliers. The effectiveness of our methods is empirically demonstrated over three real-world trajectory datasets in terms of spatial and temporal homogeneity. In Chapter 3, we propose two new valid convoy discovery algorithms that discover all valid convoys from a set of moving object trajectories, which ab- stract the grouping mobility common to some moving objects. Our solution consists of two phases; they first retrieve all partially connected convoys while guaranteeing no false dismissal of any valid convoys and then validate their density-connectivity to eventually obtain a complete set of valid convoys. In or- der to demonstrate the effectiveness and efficiency of our techniques, an extensive set of experiments is performed on three real-world trajectory datasets. This dissertation is concluded in Chapter 4 with final remarks on our contribution, followed by future directions of our research. 14 Chapter 2 Robust Time-Referenced Trajectory Segmentation 2.1 Introduction Trajectory segmentation is a process of partitioning a given moving object trajectory into a small number of homogeneous segments by identifying a minimum subset of observations, such that the data within each segment are similar with respect to some criteria and thus can be effectively approximated by a simple model [8]. Real-world moving object trajectory data acquired by location-aware sensors or positioning devices typically involves a large number of moving objects, massive observations, noisy/erroneous mea- surements, and irregular samples. Such raw sensor data thus needs to be transformed at right resolution and quality for the efficient data exploration and advanced data analysis. To this end, trajectory segmenta- tion has been often performed as a preprocessing prior to any of the major analytical tasks such as pattern mining [12, 30, 31], classification/clustering [39, 38], or outlier detection [37], in order to reduce the di- mensionality, provide more concise and effective representations, filter out erroneous measurements, and eventually improve the performance of subsequent analytical tasks. In this chapter, we propose a family of three robust time-referenced trajectory segmentation algorithms that partition a given trajectory into a small number of spatially and temporally homogeneous segments, such that each segment accurately ap- proximates a linear movement at a constant speed. In addition, we utilize the spatio-temporal properties of 15 movement to guide the outlier detection so as to make the segmentation robust against incorrect and noisy measurements. This chapter is organized as follows. Section 2.2 introduces our trajectory model and some basic definitions to be used in the rest of this chapter. In Section 2.3, we formalize the trajectory segmentation problem and propose three robust time-referenced trajectory segmentation algorithms. Section 2.4 presents the results of experimental evaluation and comparison to conventional trajectory segmentation approaches. Section 2.5 discusses related work. Finally, Section 2.6 concludes this chapter. 2.2 Preliminaries In this section, we present the trajectory model that captures both the spatial and the temporal nature of movement and some basic definitions. 2.2.1 Trajectory Model The movement of a moving object is typically traced by sampling its geographic locations at discrete instances of time by using location-aware sensors. This finite sequence of time-referenced locations cap- turing both the spatial and the temporal nature of the movement is called a moving object trajectory (or simply trajectory). The simple sequence of geo-spatial location values without the referenced time infor- mation is termed as a route to explicitly distinguish it from the trajectory. Definition 1. A trajectoryS is a finite sequence of sampled locations during a closed time interval[t 1 ;t n ] and defined as a sequence of pairs S = h(p 1 ;t 1 ), (p 2 ;t 2 ),:::,(p n ;t n )i, where p i 2< d (d 2 f2;3g) is a two- or three-dimensional location vector representing the geo-spatial position sampled at a timestampt i 2< + . Definition 2. A route of the moving object is the projection of the trajectory S on the spatial space by dropping the temporal component, thus defined as the simple sequence of sampled position valuesR s = hp 1 ;p 2 ;:::;p n i. 16 (p i , t i ) (p i+1 , t i+1 ) Time X t i t i+1 t i+2 Trajectory Route Y Δt i Δt i+1 p i p i+1 t k (p k , t k ) ^ p k ^ Figure 2.1: Examples of a trajectory and its route Definition 3. The size of a trajectory or a route is defined as the number of samples and denoted as jSj=jR s j=n. Figure 2.1 illustrates a trajectory as a solid directed polyline in the spatio-temporal space formed by combining the spatial plane spanned by X and Y axes (shaded in the figure) and the time dimension of Z axis. The dashed line on the (shaded) spatial plane shows the route. We assume an irregular sampling rate, that is, the time intervals between two consecutive samples,¢t i = t i+1 ¡t i (i = 1;:::;n¡1), can vary even within a single trajectory. This is a reasonable assumption due to the inherent imprecision involved in the sensor data acquisition. As shown in Figure 2.1, during a time interval[t i ;t i+1 ], the unmeasured movement of a moving object is estimated by a linear interpolation using the observed positions p i and p i+1 , assuming that the object moves straight from the locationp i top i+1 at a constant speed. Therefore, at any instance of timet k within this time interval (i · k · i+1), the estimated positionb p k placed along the line can be calculated as follows: b p k = t k ¡t i t i+1 ¡t i p i + t i+1 ¡t k t i+1 ¡t i p i+1 : (2.1) 17 Definition 4. A segment s ij of a trajectory S of size n (1 · i < j · n) is a continuous sub-sequence ofS, starting from(p i ;t i ) and ending at(p j ;t j ). A segments ij is spatially and temporally homogeneous during the time interval[t i ;t j ] with respect to a pre-specified distance threshold" if all sampled locations p k observed at time t k (i · k · j) within this segment deviate from their estimated time-referenced positionsb p k by no more than". Note thatb p k is obtained as in Equation 2.1 by a linear interpolation using the starting and the ending positionsp i andp j , respectively, while assuming a linear movement with a constant speed during this time interval [t i ;t j ]. The distance between p k andb p k within the segment s ij , denoted as T-dist(p k ;b p k js ij ), is computed using the Euclidean distance, and called time-referenced distance in order to emphasize that the temporal structure is incorporated to measure the spatial dissimilarity. This time-reference distance can be formulated as follows; T-dist(p k ;b p k js ij )=dist(p k ; t k ¡t i t j ¡t i p i + t j ¡t k t j ¡t i p j ); (2.2) wherei·k·j anddist() is the Euclidean distance. Figure 2.2(a) illustrates a segments i;i+2 , starting from (p i ,t i ) and ending at (p i+2 ,t i+2 ). This segment is spatially and temporally homogeneous since the Euclidean distance between the actually measured location p i+1 and the estimated locationb p i+1 at the time instance t i+1 within the segment s i;i+2 is less than a given distance threshold ". Figure 2.2(b) illustrates the projected view of the segment onto the spatial plane. In addition, Figure 2.2(b) shows how the time referenced distance T-dist(p i+1 ,b p i+1 js i;i+2 ) is different from the perpendicular distance d ? , which is the distance from the position p i+1 to the line ¡ ¡¡¡ ! p i p i+2 that has been widely employed in the previous segmentation algorithms [12, 39, 38, 37]. 2.2.2 Outlier Detection An outlier is a single extreme value that is substantially different from the rest of data at some measures [1]. In moving object trajectories, an outlier can be identified as an extreme rate of change of location value to a 18 (p i , t i ) (p i+2 , t i+2 ) Time X (p i+1 , t i+1 ) (p i+1 , t i+1 ) ^ Y p i p i+1 p i+2 p i+1 ^ X Y (a) (b) T-dist(p i+1 , p i+1 | s i,i+2 ) ^ ^ d (p i+1 , p i p i+2 ) T-dist(p i+1 , p i+1 | s i,i+2 ) Figure 2.2: An example of spatially and temporally homogeneous segments i;i+2 (a) in the spatio-temporal space and (b) the projected view of the segment on the spatial plane unit time interval and thus visually depicted as a peak in the spatio-temporal space as shown in Figure 2.3. Note that the peak in the spatio-temporal space is different from the one in the spatial subspace. For example, Figure 2.3 shows a significant change of location both atp 4 andp 7 , both of which are illustrated as peaks, thus outliers, in the spatial plane. However, considering that the time intervals ¢t 3 = t 4 ¡t 3 taken to sample the locationp 4 is about three times longer than the rest, p 4 is not really a surprising rate of location change per unit time, while p 7 could be indeed a substantial change in a short time interval. Unfortunately, theslow peak or thefast peak is not distinguishable in the geo-spatial representation unless a constant sampling rate is assumed. Detecting outliers simply based on the static geometric structure presented in route can be thus misleading due to the missing temporal information. In order to identify the time-referenced spatial outliers, spatio-temporal features of movement must be incorporated. This suggests us to adopt the maximum movement speed as an indicator whether a peak is an error or just a surprising but a correct value representing the real movement. The key idea of our outlier detection method is that the maximum speed of a moving object can limit the upper bound of where the 19 (p 4 , t 4 ) (p 7 , t 7 ) Time X Y (p 5 , t 5 ) p 4 (p 6 , t 6 ) (p 8 , t 8 ) Outlier p 7 Figure 2.3: Examples of an outlier in the spatio-temporal space and peaks in the spatial plane moving object could traverse in a given time interval. If the sampled location is beyond this range, it is considered as an error. Consider the segment s i¡1;i+1 of a trajectory in Figure 2.4. Suppose the moving object’s maximum speed À m is known in advance. If the object linearly moves at maximum speed from p i¡1 , its position at time t i will be on the surface of a half (smaller) sphere of a radius r 1 = À m £(t i ¡ t i¡1 ) as shown in Figure 2.4. The points on the surface of the half sphere are the furthest positions taken at the time instancet i . If the movement speed is less thanÀ m , or it does not move, then the time-referenced location at time t i is somewhere within the area bounded by the smaller half sphere. Similarly, in order to move to the locationp i+1 by the timet i+1 during the time interval[t i ;t i+1 ], the location at the preceding time instance t i should be at most on the surface of the (larger) half sphere of radius r 2 = À m £(t i+1 ¡t i ) or somewhere inside it. Therefore, the time-referenced location (p i ,t i ) beyond the range represents the unrealistic movement, thus indeed an inaccurate and noisy sensor measurement. Definition 5. A location p i sampled at time t i is a time-referenced spatial outlier with respect to a pre- specified maximum speedÀ m ifdist(p i ¡p i¡1 )>À m £(t i ¡t i¡1 )^dist(p i+1 ¡p i )>À m £(t i ¡t i¡1 ), wheredist(¢) is the Euclidean distance. 20 (p i-1 , t i-1 ) (p i+1 , t i+1 ) r1 r2 (p i , t i ) Time X Y S i-1,i+1 Figure 2.4: Detecting outliers based on two upper boundsr 1 andr 2 2.3 Robust Time-Referenced Trajectory Segmentation In this section, we formalize the problem considered in this chapter and present three robust time-referenced trajectory segmentation algorithms as our solution. 2.3.1 Problem Definition Moving object trajectories is large in size and noisy, which is typical when acquired by sensors. Therefore, it is inevitable to preprocess the trajectories to reduce the dimensionality and compress them in a compact representation robust against outliers in order to process them efficiently in the subsequent analysis tasks. To this end, we aim to partition such a large and noisy trajectory into a small number of spatially and temporally homogeneous segments by selecting minimum number of observations such that each segment can accurately approximate a linear movement with a constant movement speed. The trajectory segmentation problem considered in this chapter can be formalized as follows; given a trajectoryS of sizen,S =h(p 1 ,t 1 ), (p 2 ,t 2 ),:::, (p n ,t n )i, determine a minimum subset of the discrete in- stances of time,CT =ft c 1 ;t c 2 ;:::;t c k g (1·c 1 <c 2 <:::<c k ·n), termed characteristic timestamps (CTs), such that each segments cici+1 (1·i·k¡1) that starts from (p ci ,t ci ) and ends at (p ci+1 ,t ci+1 ) is spatially and temporally homogeneous during the corresponding time interval[t c i ;t c i+1 ] with respect to a pre-specified distance threshold" (by Definition 4) and none of the locations sampled at the characteristic 21 timestamps is a time-referenced spatial outlier with respect to a pre-specified maximum speedÀ m (by Def- inition 5). Such characteristic timestamps specify where the input trajectory should be partitioned to result in the smallest number of homogeneous segments within a given bound, each approximated with a direct line representing a linear movement at a constant speed. Therefore, both the dimension reduction and the compact representation are achieved by retaining only the observations at the characteristic timestamps and discarding the unselected observations. The optimal solution that finds the minimum subset of the timestamps satisfying the constraints on the maximum deviation and the outliers exclusion can be obtained by using dynamic programming as in [45] or methods developed for the shortest path problem in digraph [27]. However, this is an expensive solu- tion due to its time complexity usually ranging between O(n 2 ) and O(n 3 ). Instead, we adopt a greedy approach based on the split or merge heuristics, which greedily selects the local optimum at each itera- tion as hoping to lead to a global optimum. Although it may fail to find a globally optimal partitioning, it has been demonstrated in practice that the accuracy is pretty high and the quality of segmentation is reasonable [33]. Our trajectory segmentation algorithms employ the generic forms of three representative heuristic algorithms, top-down, bottom-up, and sliding window algorithms, respectively, each of which is described in the following sections. 2.3.2 Top-down Algorithm As in other top-down algorithms [17, 12], our top-down trajectory segmentation algorithm takes an unseg- mented trajectory as an input and selects one characteristic timestampt i with the largest time-referenced distance between the actually observed locationp i sampled at this timet i and its estimated time-referenced locationb p i . If the observed locationp i is an outliers with respect to its preceding and succeeding locations and a given maximum speed, it should not be considered as the characteristic timestamp. Subsequently, it splits the sequence att i into two subsequences. The same split process is recursively repeated until no more observation deviates from its estimated time-referenced position by more than the given distance threshold. 22 Require: Trajectory segment S startidx;endidx = h(p startidx , t startidx ), (p startidx+1 , t startidx+1 ), :::, (p endidx ,t endidx )i, distance threshold", and maximum speedÀ m 1: nÃjS startidx;endidx j; 2: CT Ã;; 3: fori=1 ton do 4: splitMeasure(i)à 0; 5: end for 6: fori=2 ton¡2 do 7: splitMeasure(i)à RobustTDist(S startidx;endidx ;i;À m ); 8: end for 9: if max(splitMeasure)·" then // No more split 10: CT Ãft startidx ,t endidx g; 11: return CT ; 12: end if 13: splitidxà get the index of max(splitMeasure); 14: == Recursively split the subsequences 15: CT1à RTR-TopDown(S startidx;splitidx ,",À m ); 16: CT2à RTR-TopDown(S splitidx;endidx ,",À m ); 17: == Combine the selected timestamps 18: CT à concat(CT1,CT2); 19: return CT ; Figure 2.5: Top-down trajectory segmentation algorithm RTR-TopDown (S startidx;endidx ,",À m ) Figure 2.5 shows the pseudocodes of our top-down trajectory segmentation algorithm RTR-TopDown. We first compute the split measure of each observation in a given (sub-)trajectory using the procedure RobustTDist() in lines 6¡8 based on which we choose one splitting characteristic timestamp. The robust time-referenced distance function RobustTDist(S startidx;endidx ,i,À m ) in line 7 returns the spatial distance of thei th observation in the segmentS startidx;endidx to its estimated position if this observation is not an outlier with respect to the given maximum speedÀ m and a zero value otherwise, as shown in Figure 2.6. If the maximum value of the split measures is below the given distance threshold", there will be no more splits and the segmentation is stopped by returning the CTs found so far (lines 9¡12). Otherwise, the input sequence is partitioned at the observation with the largest split measure value into two subsequences, each of which is recursively partitioned by the same RTR-TopDown procedure in lines 15 and 16. The RobustTDist(S startidx;endidx ,i th ,À m ) function shown in Figure 2.6 computes the time-referenced distance between the actually measured location observed at the i th timestamp in the input segment and its corresponding estimated location, i.e., T-dist(p i th,b p i thjS startidx;endidx ) as in Equation 2.2, assuming a 23 Require: Trajectory segment S startidx;endidx = h(p startidx , t startidx ), (p startidx+1 , t startidx+1 ), :::, (p endidx ,t endidx )i, indexi th , and maximum speedÀ m 1: cidxÃstartidx+i th ¡1; 2: if dist(p cidx ,p cidx¡1 )>À m £(t cidx ¡t cidx¡1 )^ dist(p cidx+1 ,p cidx )>À m £(t cidx+1 ¡t cidx ) then 3: tdistà 0; //p cidx is an outlier 4: else 5: b p cidx à (t cidx ¡t startidx )/(t endidx ¡t startidx )£p startidx + (t endidx ¡t cidx )/(t endidx ¡t startidx )£p endidx ; 6: tdistà dist(p cidx ,b p cidx ); 7: end if 8: return tdist; Figure 2.6: Robust time-referenced distance function RobustTDist(S startidx;endidx ,i th ,À m ) linear movement at a constant speed within the input segment if the observationp i th at timet i th is not an outlier (lines 5¡7). If the observation is an outlier with respect to its preceding and succeeding locations and a given maximum speed À m , a zero value is returned so that its timestamp should not be selected as the characteristic timestamp in the top-down algorithm. This distance function RobustTDist() will be also repeatedly used in the bottom-up and the sliding window algorithms. The time complexity of the top-down algorithm isO(kn 2 ) afterk number of splits, wheren is the size of trajectory, and can be reduced to O(knlogn) if an efficient data structure such as a priority queue is used to maintain the spit measure values in order. 2.3.3 Bottom-up Algorithm A typical bottom-up algorithm begins with the finest possible segments of a given trajectory and greedily merge two adjacent segments with the minimum value at some measure into one. The same merging procedure is repeated until some stopping condition is satisfied. Our bottom-up trajectory segmentation algorithm adopts this generic form of the bottom-up approach. Figure 2.7 shows the pseudocodes of our bottom-up trajectory segmentation algorithm RTR-BottomUp. We first initialize the set of characteristic timestamps (CTs) with the finest possible timestamps of a given trajectory of sizen in line 3. Therefore, each observation in the trajectory becomes an initial segment by 24 Require: TrajectoryS =h(p 1 ,t 1 ), (p 2 ,t 2 ),:::, (p n ,t n )i, distance threshold", and maximum speedÀ m 1: nÃjSj; 2: fori=1 ton do 3: CT(i)Ãt i ; 4: mgMeasure(i)Ã1 5: end for 6: fori=2 ton¡1 do 7: mgMeasure(i)à RobustTDist(S i¡1;i+1 , i,À m ); 8: end for 9: while min(mgMeasure)·" do 10: mgidxà get the index of min(mgMeasure); 11: == Remove the selectedCT to merge 12: CT(mgidx)ÃNULL; 13: mgMeasure(mgidx)Ã1; 14: == Update the preceding merge measure 15: mgS1à concat(S mgidx¡2;mgidx¡1 ,S mgidx+1;mgidx+1 ); 16: mgMeasure(mgidx¡1)à RobustTDist(mgS1,2,À m ); 17: == Update the succeeding merge measure 18: mgS2à concat(S mgidx¡1;mgidx¡1 ,S mgidx+1;mgidx+2 ); 19: mgMeasure(mgidx+1)à RobustTDist(mgS2,2,À m ); 20: end while 21: return all non-nullCT ; Figure 2.7: Bottom-up trajectory segmentation algorithm RTR-BottomUp(S,",À m ) its own. In line 7, we compute the merge measure of the observation at each CT by using the same distance function RobustTDist() shown in Figure 2.6. Based on the merge measures, we remove the timestamp of the observation with the minimum value from the CT list (lines 12¡13) and update the merge measures of its preceding and succeeding observations in lines 15¡16 and 18¡19, respectively. This merging pro- cess is repeated until no more observation is an outlier with respect to the given maximum speed À m or deviates from its estimated location by more than the given distance threshold " (lines 9¡20). Note that the robust time-referenced distance function used to compute the merge measure returns a zero value if the considering observation is an outlier so that outliers would be first removed from the CT list in the merging procedure. The time complexity of the bottom-up algorithm isO(kn) afterk number of merges and can be reduced toO(klogn) if a priority queue is used to maintain the merge measure values in order. 25 Require: TrajectoryS =h(p 1 ,t 1 ), (p 2 ,t 2 ),:::, (p n ,t n )i, distance threshold", and maximum speedÀ m 1: nÃjSj; 2: CT Ãft 1 g; 3: startidxÃ1,lengthÃ1; 4: whilestartidx+length·n do 5: curridxÃstartidx+length; 6: fori=2 tolength¡1 do 7: swMeasure(i)à RobustTDist(S startidx;curridx ,i,À m ); 8: end for 9: if max(swMeasure)>" then 10: CT à concat(CT ,t curridx¡1 ); 11: startidxÃcurridx¡1; 12: lengthÃ1; 13: else 14: lengthÃlength+1; 15: end if 16: end while 17: return CT ; Figure 2.8: Sliding-window trajectory segmentation algorithm RTR-SlidingWindow (S,",À m ) 2.3.4 Sliding Window Algorithm As in other sliding window algorithms [39], our sliding window algorithm fixes the first timestamp of a given trajectory as the first characteristic timestamp and attempts to place the next CT as far as possi- ble. Each of the subsequent spatio-temporal observation is added to the currently considering segment as long as this segment remains spatially and temporally homogeneous. When the spatial and temporal homogeneity of the current segment becomes unsatisfied by adding another observation, we separate the segment (without the current observation breaking the homogeneity) from the trajectory. Subsequently, we precede with the current observation and repeat the process until the end of the trajectory. The sliding window algorithm is preferred for those applications where data arrives in continuous stream due to its online nature. Figure 2.8 shows the pseudocodes of our sliding window segmentation algorithm RTR-SlidingWindow. We first anchor the first timestamp as the first CT in line 2, which is the starting observation of the segment under consideration. Each of the subsequent timestamps is considered in order as the potential candidate for the next CT. To this end, we compute the slidingwindow measure of each observation within the 26 current segmentS startidx;curridx using our distance function RobustTDist() in lines 6¡8, while assuming a linear movement at a constant speed within this segment. If any of the measures is larger than the given distance threshold", i.e., the segment under consideration is not spatially and temporally homogeneous, we add the immediately preceding timestamp into the CT list to separate the segment from the rest and reset the variables as necessary (lines 9¡13). Otherwise, we increase the length of the segment under consideration (line 14) and proceed with the next observation. The same procedure (lines 4¡16) is repeated until the sequence ends. The time complexity of the sliding window algorithm is O(kn), where k is the number of inner- iterations required to update the slidingwindow measure of each observation in the increased segment (i.e., for-loop in lines 6¡8 of Algorithm 2.8) andn is the trajectory size. 2.4 Experimental Evaluation In this section, we evaluate the effectiveness of our robust time-referenced trajectory segmentation al- gorithms, RTR-TD, RTR-BU, and RTR-SW, on three real-world moving object trajectory datasets. The performance of our segmentation algorithms are compared to those of conventional segmentation tech- niques, previously applied to the route of a moving object without the time dimension; the well-known Douglas-Peucker (DP) algorithm [17, 26, 12] and the one-pass approximation algorithm based on the Minimum Description Length (MDL) principal [39]. While DP uses the perpendicular distance in the segmentation criterion, MDL employs the angular distance in addition to the perpendicular distance in its MDL cost functions. One might argue that these algorithms can be applied simply to the trajectories to obtain spatially and temporally similar segments. Therefore, the Douglas-Peucker segmentation on route (DP-R), DP on trajectory (DP-T), the MDL-based segmentation on route (MDL-R), and MDL on trajectory (MDL-T) are all employed in our experiments. All the algorithms for the experiments were implemented in MatlabTM and all the experiments were run on a machine with Pentium IV 3.2GHz CPU and 2 GB of RAM. 27 Table 2.1: Summary of three datasets used in the experiments Bus Truck BI-Rehab. Dimensions of location 2 2 3 # Trajectories Dataset size 108 273 660 Range of trajectory size 79»1095 29»992 65»693 # Total observations 66,096 112,203 118,664 2.4.1 Dataset The bus dataset [12] tracks the fleet of school buses moving in an urban area during a single day in Patras, Greece. The bus locations were sampled approximately every 30 seconds. However, the GPS was switched off while the bus was stopping [12]. Therefore, the sampling rate varies even within a single bus trajectory. The truck dataset [20] traces the fleet of trucks moving in a regional area of Athens, Greece. Similar to the bus trajectory, a irregular sampling rate is observed. The ball-intercepting (BI) dataset [51] has been acquired from the virtual reality enhanced motor rehabilitation system, developed at the University of Southern California Integrated Media Systems Center (USC-IMSC), to provide post- stroke individuals with a practice environment for the restoration of impaired arm and shoulder motor function. It traces the hand reaching movement while a post-stroke subject with mild motor impairment attempts to reach and intercept a virtual moving ball. The hand position in the three-dimensional virtual space was sampled at approximately 85Hz with little variation within a single trajectory. Table 2.1 shows the summary information of the datasets used in the experiments. 2.4.2 Selection of Maximum Speed Value Our segmentation algorithm requires the maximum speedÀ m as an input parameter in order to determine whether a spatio-temporal observation is indeed an outlier. In a majority of scenarios, the maximum speed value or its range can be provided by the domain experts or derived in the context of the target application. For example, the upper speed limit of school bus can be derived from the driving areas (e.g., urban), local speed limits, or road conditions [48]. The hand movement speed of post-stroke individual with hemiparesis would be very much dependent to the level of impairment and must not be faster than the control movement 28 speed [51]. However, the domain knowledge is not always available and also the experts are not always in agreement. Instead, we employ a heuristic based on the sorted movement speed values to determine a sensible value of À m . We compute the movement speed between each pair of two consecutive samples from the entire trajectories in the dataset and then sort the speed values. By a visual inspection, we choose an extreme and substantially larger value than its immediately preceding one. In the bus and the truck datasets, we could observe such value around80km=h, corresponding to the99 th percentile of the entire speed values. This implies that there exist few outliers in the vehicle datasets. Similarly, we obtained 150cm=s for the BI dataset, approximately corresponding to the 92 th percentile, which indicates that about 8% of the observations could be outliers. The same speed values are used forÀ m throughout the experiments. 2.4.3 Spatio-Temporal Homogeneity In this section, we evaluate the spatio-temporal homogeneity presented in the obtained segments. In order to measure the spatio-temporal homogeneity, we employ the distribution of the normalized movement speed per segment, based on which we define a new measure. Given a trajectory S of size n, S =h(p 1 ,t 1 ), (p 2 ,t 2 ),:::,(p n ,t n )i and a set of characteristic indices, C=fc 1 ;c 2 ;:::;c k g (1·c 1 <c 2 <:::<c k ·n), indexing the obtained characteristic timestamps or the characteristic points, our spatio-temporal homogeneity measure (STHM) is defined as follows; STHM(S;C)= 1 jCj jCj X i=1 1 c i+1 ¡c i c i+1 ¡1 X j=c i e v j ; (2.3) wheree v j = v j maxfvjjci·j<ci+1g andv j = dist(p j ;p j+1 ) tj+1¡tj . As seen in the formula, the movement speed between two consecutive observations (v j ) is normalized with the maximum speed of the segment so as to lie in the range from 0 to 1, which ise v j . The mean of the normalized speeds in each segments is then averaged over all segments of the trajectory. This measure 29 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.65 0.7 0.75 0.8 0.85 0.9 0.95 1 Reduction Ratio STHM RTR−TopDown ( Eps: 0 10 20 40 60 80 100 150 200 ) RTR−BottomUp RTR−SlidingWindow (a) Bus dataset 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.65 0.7 0.75 0.8 0.85 0.9 0.95 1 Reduction Ratio STHM RTR−TopDown ( Eps: 0 10 20 40 60 80 100 150 200 ) RTR−BottomUp RTR−SlidingWindow (b) Truck dataset 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.75 0.8 0.85 0.9 0.95 1 Reduction Ratio STHM RTR−TopDown ( Eps: 0 .01 .05 .1 .2 .3 .4 .5 .6 .7 .8 .9 1 ) RTR−BottomUp RTR−SlidingWindow (c) BI-Rehab. dataset Figure 2.9: Comparison of spatio-temporal homogeneity among the three proposed trajectory segmenta- tion algorithms on three datasets 30 attains a value closer to 1 when the movement speed varies less per segment, which can be interpreted that the trajectory is partitioned into more spatio-temporally homogeneous segments. Figure 2.9 shows the comparative results of our three segmentation methods on three datasets in terms of the spatio-temporal homogeneity. The X axis represent the reduction ratio, i.e., the ratio of the total number of non-characteristic observations discarded upon the segmentation to the total number of obser- vations in the dataset, and the Y axis is the spatio-temporal homogeneity measure. Throughout all plots, we specify the parameter values actually used for the experiment in the legend box. For example, in Fig- ure 2.9(a), the same set of valuesf0;10;20;40;60;80;100;150;200g are used for" in all three algorithms. Each marker, denoted as a dot, a asterisk, or a cross, along the performance curve represents the perfor- mance obtained by using the corresponding parameter value in the same order from the left to the right. Recall that the maximum speed parameterÀ m is already determined and specified in Section 2.4.2. In all three datasets, the spatio-temporal homogeneity retained upon the RTR-BU and RTR-TD seg- mentations is more or less the same, which is slightly yet consistently better than RTR-SW. In practice, the sliding window approach have been demonstrated to have poor performance as compared to the top-down and bottom-up methods [33]. The same observation is made in our segmentation algorithms. However, the relative difference appear to be marginal when the reduction ratio is less than about 0.4. Figure 2.10 presents the comparison of our RTR-BU to the conventional approaches in terms of the spatio-temporal homogeneity. RTR-BU is selected because it consistently shows the best results among our approaches. As illustrated, the four conventional methods never beat RTR-BU at all reduction ratios and in all three datasets. This observation shows that RTR-BU identifies the characteristic observations where the spatio-temporal structure significantly changes and the partitioned segments are more spatio- temporally homogeneous. Another interesting observation is that the n¨ aive extensions of two conventional techniques simply to the trajectories, DPT and MDLT, only attain as similar spatio-temporal homogeneity as those of DPR and MDLR (on the routes), respectively. This observation ascertains that semantically, the temporal dimension is indeed different from the spatial dimension, and hence should be incorporated in 31 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.65 0.7 0.75 0.8 0.85 0.9 0.95 1 Reduction Ratio STHM RTR−BU [ Eps: 0 10 20 40 60 80 100 150 200 ] DP−R [ Eps: 0 5 10 20 40 60 80 100 150 200 ] DP−T [ Eps: 0 5 10 20 40 60 80 100 150 200 ] MDL−R [ MDLc: 0 0.5 1.0 3 5 6.5 8 10 ]; MDL−T [ MDLc: 0 0.5 1.0 3 5 6.5 8 10 ] (a) Bus dataset 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.65 0.7 0.75 0.8 0.85 0.9 0.95 1 Reduction Ratio STHM RTR−BU [ Eps: 0 10 20 40 60 80 100 150 200 ] DP−R [ Eps: 0 5 10 20 40 60 80 100 150 200 ] DP−T [ Eps: 0 5 10 20 40 60 80 100 150 200 ] MDL−R [MDLc: 0 0.5 1.0 3 5 6.5 8 10 ]; MDL−T [MDLc: 0 0.5 1.0 3 5 6.5 8 10 ] (b) Truck dataset 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.75 0.8 0.85 0.9 0.95 1 Reduction Ratio STHM RTR−BU [ Eps: 0 .01 .05 .1 .2 .3 .4 .5 .6 .7 .8 .9 1 ] DP−R [ Eps: 0 .01 .05 .1 .2 .3 .4 .5 .6 .7 .8 .9 1 ] DP−T [ Eps: 0 .01 .05 .1 .2 .3 .4 .5 .6 .7 .8 .9 1 ] MDL−R [ MDLc: 0 .01 .05 .1 .5 1 2 3 4 5 ] MDL−T [ MDLc: 0 .01 .05 .1 .5 1 2 3 4 5 ] (c) BI-Rehab. dataset Figure 2.10: Comparison to conventional trajectory segmentation approaches in terms of spatio-temporal homogeneity on three datasets 32 its own context like we do in our time-referenced distance function. Recall that about 8% of the total ob- servations in BI dataset are considered as outliers by using150cm=s, approximately corresponding to the 92 th percentile, for the maximum speed parameterÀ m , whereas the vehicle datasets have few outliers. The experiment results in Figure 2.9(c) and Figure 2.10(c) show that our RTR algorithms partition trajectories into more temporally homogeneous segments in the presence of outliers. 2.4.4 Spatial Deviation As an attempt to approximate the true spatial discrepancy between a trajectory and a set of its segments, we aggregate the perpendicular distances as well as the time-referenced distances on every observation as follows; PDError(S;C)= jCj X i=1 c i+1 X j=c i pdist(p j ; ¡ ¡¡¡¡ ! p ci p ci+1 ); (2.4) TDError(S;C)= jCj X i=1 ci+1 X j=ci T-dist(p j ;b p j js c i c i+1 ): (2.5) A larger value of these spatial error measures indicates larger spatial deviation between a trajectory and its sequence of segments. Figure 2.11 shows the spatial deviation between trajectories and their segments measured by the per- pendicular distance as in Equation 2.4. Note that both PDErr and TDErr are the approximation of the true spatial discrepancy between a trajectory and its segments, and the purpose of these measures is to infer the spatial homogeneity or deviation retained upon the segmentation. As expected, the DP family that explicitly aim to segment trajectories while minimizing the perpendicular distances show the lower PDErr than other methods. The MDL algorithms that also employ the perpendicular distances lie in between the PD and RTR groups. Our RTR algorithms yield similar or slightly larger PDErr when the reduction ratio is no larger than 0.4 in the vehicle datasets. Interestingly, RTR-BU shows comparable results to the DP algorithms at all reduction ratios in the BI-Rehab dataset. 33 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0 0.5 1 1.5 2 2.5 3 3.5 4 x 10 6 Reduction Ratio PDErr (a) Bus dataset 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0 1 2 3 4 5 6 x 10 6 Reduction Ratio PDErr (b) Truck dataset 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0 0.5 1 1.5 2 2.5 3 3.5 x 10 4 Reduction Ratio PDErr RTR−TopDown RTR−BottomUp RTR−SlidingWindow DP−R DP−T MDL−R MDL−T (c) BI-Rehab. dataset Figure 2.11: Comparison of perpendicular distance error on three datasets 34 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0 2 4 6 8 10 12 x 10 6 Reduction Ratio TDErr RTR−TopDown RTR−BottomUp RTR−SlidingWindow DP−R DP−T MDL−R MDL−T Figure 2.12: Time-referenced distance error on bus dataset Figure 2.12 presents the result on the spatial difference measured by the time-referenced distance as in Equation 2.5 on the bus dataset. As expected, the TDerr produced by our RTR family is smaller than those of other approaches. The same results are consistently observed in the other two datasets. In summary, the results of two spatial discrepancy measures indicate that our RTR segmentation algorithms maintain the spatial homogeneity as comparable as to the conventional techniques focusing only on the spatial component presented in the route of moving object. 2.4.5 Processing Time In order to demonstrate the efficiency of our trajectory segmentation algorithms, we compare the process- ing time of our trajectory segmentation method to those of conventional approaches. Figure 2.13 shows the processing time of the seven trajectory segmentation methods on the bus dataset, where the X axis represent the reduction ratio and the Y axis is the running time in seconds. The processing time includes the total CPU time taken to segment all trajectories in the set. As expected, the sliding window approach (RTR-SlidingWindow) that linearly scans the sequence takes the least time (less than 10 seconds in all settings). In addition, the processing time is almost consistent as the reduction ratio increases, which im- plies that the number of inner-iterations required to update the maximum deviation in the increased sliding 35 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0 10 20 30 40 50 60 70 80 90 100 Reduction Ratio Processing Time (s) RTR−TopDown RTR−BottomUp RTR−SlidingWindow DP−R DP−T MDL−R MDL−T Figure 2.13: Comparison of processing time in seconds on bus dataset window remains more or less the same. The next best processing time is achieved by our bottom-up algo- rithm (RTR-BottomUp), which linearly increases due to the increased merge operations as the reduction ratio increases. All three algorithms (RTR-TopDown, DP-R, and DP-T) based on the general top-town approach take similar processing time. Their processing time decreases as less observations are retained resulting higher reduction ratio, which is expected because less split operations are required. The mini- mum description length (MDL) based segmentation methods (MDL-R and MDL-T) take at least a factor of 5 more time than our RTR-SlidingWindow. It is observed that the relative processing time among the seven trajectory segmentation techniques are more or less the same on the other two datasets. 2.5 Related Work We consider the problem of segmenting a sequence of time-referenced locations into contiguous homo- geneous segments by identifying a minimum subset of observations such that the segmentation error does not exceed a given maximum tolerance. The general problem of approximating a sequence data by another coarse one been extensively exploited in various applications such as pattern recognition [58], cartogra- phy [17, 42], gene/DNA analysis [8], time series analysis [33], and more, with different names such as 36 piecewise linear (or polygonal) approximation of digitized curves, vectorization, line/contour simplifica- tion or generalization, dimension reduction, and more. The main purpose of segmentation is to provide compact representation of data and noise filtering for further analysis [46]. Many segmentation algorithms have bee proposed, ranging from online to offline, from optimal to faster heuristic, and from combinatorial to probabilistic. Our algorithms falls in the category of heuristic approach. Segmentations previously applied to moving object trajectories are mainly motivated by line simpli- fication/generalization in the domain of geo-spatial data processing. Cao et al. [12, 10] have employed the Douglas-Peucker (DP) algorithm [17] for the trajectory segmentation, which was originally proposed to simplify 2D polylines such as coastlines in maps. It starts with the unsegmented sequence of sampled locations (i.e., route), introduces one characteristic point at a time with the largest perpendicular distance to the straight line connecting the first and the last samples of the current sequence, splits the sequence into two subsequences, and repeats splitting until no point deviates from the straight line by more than a pre-specified tolerance. Lee et al. [39] have adopted the Minimum Description Length (MDL) principal to guide the segmentation and proposed a one-pass approximate algorithm. It also takes the route of a moving object as an input, fixes the first sampled location as the first characteristic point and places the next characteristic point as far as possible. The next characteristic point is determined as the point where the MDL cost to approximate the current sequence with a straight line becomes larger than the MDL cost to keep the original sequence as it is. Unlike our trajectory segmentation, these previous approaches have taken into account only the geo-spatial structures and relationships presented in the sequence of spatial locations to simplify a moving object trajectory. Many incorrect and noisy measurements are observed in moving object trajectories, which is typical when the data is acquired by sensors. Lin et al. [43] have proposed a symbolic representation of time series to be robust against outliers. Specifically, they have partitioned a sequence of observation into a fixed number of equal sized segments, approximated each segment with the mean value of the segment, and then represented each approximated segment with a symbol. Although the final symbolic representation of time series is robust to outliers in the distance computation of time series, the intermediate representation 37 of time series in the form of segments approximated by a constant model is still vulnerable to outliers since the outliers are included in the mean value computation of each segment. On the other hand, we deal with the outliers in the segmentation process so that the detected outliers are excluded from the segment approximation. Therefore, the segmented trajectories modeled by the selected characteristic points are robust to outliers. Pfoser et al. [48] adopted the maximum movement speed to represent the uncertainty about the position of the object in-between the measurements, which is exploited for the outlier detection in our trajectory segmentation algorithms. More recently, outlier analysis has been conducted over moving object trajectory data. Knorr et al. [34] have represented each moving object trajectory with a set of key features (e.g., starting/ending locations, average/minimum/ maximum velocities, etc) and detected outlying moving objects based on the weighted distance on these key features. Lee et al. [37] have proposed a partition-and-detect framework for trajec- tory outlier detection in order to detect outlying sub-trajectories from a trajectory database. All of these work have focused on detecting outlying (sub-)trajectories, while we are interested in finding outlying observations deviated from their neighborhood values in order to filter out noisy measurements for the subsequent dimension reduction process. 2.6 Chapter Summary In this chapter, we proposed a family of three robust time-referenced trajectory segmentation algorithms that take into account both spatial and temporal structures presented in moving object trajectories so as to transform raw sensor data into the trajectory data approximated at desired resolution and quality. The pro- posed segmentation methods employed our time-referenced distance function to partition a given moving object trajectory into a small number of spatially and temporally homogeneous segments. In addition, we utilized the maximum movement speed to detect time-referenced spatial outliers in order to filter out erro- neous measurements from the segmentation process in the presence of outliers. Our experiments on three real-world datasets demonstrated that our techniques outperform the conventional techniques, such the 38 Douglas-Peucker (DP) algorithm and the one-pass approximation algorithm based on the Minimum De- scription Length (MDL) principal as well as their simple temporal extensions, in terms of spatio-temporal homogeneity, while maintaining comparable spatial homogeneity. We intend to employ our segmentation algorithms as a preprocessing of various real-world data mining and knowledge discovery applications, such as discovering non-trivial movement patterns from a set of moving object trajectories. 39 Chapter 3 Discovery of Valid Convoy Patterns from Trajectories 3.1 Introduction The size of real-world trajectory data is often large and the amount quickly grows as more moving objects are involved and continuously traced by location-aware sensors over time. A large set of archived trajec- tory data that is currently available only at the form of bulky point clouds of sampled locations and the associated timestamps thus needs to be summarized at the right level of abstraction to obtain insights on the mobilities and behaviors of moving objects and support high-level decision making and diagnosis. A movement pattern is a local abstraction or structure, conveying useful knowledge about the mobilities of objects both in space and in time and showing how just a few moving objects behaved and traversed over time. Examples include a sequence of spatial locations frequently visited by moving objects [12], a group of moving objects that converge and simultaneously pass through a certain spatial region [36], and more. Among various movement patterns defined and discovered over moving object trajectories, we focus on finding a group of objects that moved together for a certain duration of time, in order to abstract the group mobilities common to some moving objects at a desired form and level. In this chapter, we define a specific type of movement pattern called a valid convoy and propose two algorithms, named VCoDA (Valid Convoy Discovery Algorithm) and EVCoDA (Efficient Valid Convoy Discovery Algorithm), for the mining of valid convoys from a set of raw or preprocessed moving object trajectories. 40 This chapter is organized as follows. In Section 3.2, we introduce some basic definitions and define the problem considered in this chapter. Section 3.3 proposes our solutions for the valid convoy discovery. Section 3.4 presents the results of experimental evaluation. The related work is discussed in Section 3.5, followed by conclusion in Section 3.6. 3.2 Preliminaries The moving object trajectory is a finite sequence of sampled locations during a closed time interval[t 1 ;t n ] and defined as a sequence of pairs,h(p 1 ;t 1 ), (p 2 ;t 2 ), :::, (p n ;t n )i, wherep i 2< d (d2f2,3g) is a two- or three-dimensional vector 1 representing the geo-spatial position sampled at a timestamp t i 2= + . Fig- ure 3.1 shows the trajectories of five moving objects as solid directed polylines in the spatio-temporal space formed by the spatial plane of X and Y axes and the time dimension of Z axis. For simplicity, we use the notationo(t i ) to denote the snapshot positionp i of a moving objecto sampled at a timestampt i . We now adopt the notions of density-based clustering originally proposed for the algorithm DB- SCAN [18] specifically for the snapshot locations of moving objects. Given a setP =fo 1 (t),:::,o N (t)g of snapshot locations ofN moving objects at a timestampt and a distance threshold", the"-neighborhood N " (o p (t)) of a locationo p (t)2P is defined asN " (o p (t)) =fo q (t)2PjD(o p (t);o q (t))·"g, whereD(¢) is the Euclidean distance. A snapshot location o p (t) is directly density-reachable from a location o q (t) with respect to a given distance threshold" and an integerm ifo p (t)2N " (o q (t)) andjN " (o q (t))j¸m. A locationo p (t) is density-reachable from a locationo q (t) with respect to a given distance threshold" and an integerm if there is a chain of locationso 1 (t), :::, o n (t)2P such thano 1 (t) =o q (t), o n (t) =o p (t), ando i+1 (t) is directly density-reachable fromo i (t) with respect to the distance threshold" and the integer m for1·i<n. A locationo p (t)2P is density-connected to a locationo q (t)2P with respect to a given distance threshold " and an integer m if there is a location o r (t)2 P such that both o p (t) and o q (t) are density-reachable from o r (t) with respect to " and m. For example, in Figure 3.1, the snapshot location 1 For the purpose of proper visualization, we assume two-dimensional location vectors throughout this chapter. 41 t 1 t 2 t 3 t 4 Time c t2 c t3 o 1 o 2 o 3 c t4 o 4 o 5 snapshot cluster c t1 X Y -neighborhood of o 2 (t 1 ) Figure 3.1: Examples of moving object trajectories and their density-based snapshot clusters at each times- tamp o 1 (t 1 ) of an objecto 1 at a timestampt 1 is not directly density-reachable from the snapshot locationo 2 (t 1 ) of an objecto 2 , but density-reachable due to a chain of locationso 2 (t 1 );o 5 (t 1 );o 4 (t 1 );o 1 (t 1 ). Definition 6. Given a set of moving objectsO, a distance threshold", and an integerm, a snapshot cluster c t at a timestampt is a non-empty subset of objectsO 0 µO satisfying the following conditions: 1) Connectivity: 8o p ;o q 2O 0 , a locationo p (t) is density-connected to a locationo q (t) with respect to" andm. 2) (Spatial) maximality:8o p ;o q 2O 0 , ifo q 2O 0 and a locationo p (t) is density-reachable from a location o q (t) with respect to" andm, then alsoo p 2O 0 . 3) Sufficient objects:jO 0 j¸m. A snapshot cluster is a group of density-connected objects with arbitrary shape and size yet constrained to a single timestamp. Figure 3.1 shows four snapshot clusters discovered at each timestamp with the parameter m=3. Such snapshot clusters are spatially maximal such that no two snapshot clusters at a timestamp can overlap in their objects. We extend this notion of density-based spatial clusters to that of density-based spatio-temporal clusters for moving objects. Definition 7. Given a set of moving objectsO, a distance threshold", an integerm, and a lifetime integer k, a valid convoy is a non-empty subset of objectsO 0 µO during a time interval [t a , t b ], satisfying the following conditions: 42 1) Connectivity:8o p ;o q 2O 0 , a snapshot locationo p (t) is density-connected to a snapshot locationo q (t) with respect to" andm, which holds for every timestampst,t a ·t·t b . 2) Spatial maximality:8o p ;o q 2O 0 , ifo q 2O 0 ando p (t) is density-reachable fromo q (t) for allt a ·t· t b , theno p 2O 0 3) Temporal maximality::8o p ;o q 2O 0 ,o p (t a ¡1) is density connected too q (t a ¡1) att a ¡1 (t a > 1) and :8o r ;o s 2O 0 ,o r (t b +1) is density-connected too s (t b +1) att b +1 (t b <n). 4) Sufficient objects:jO 0 j¸m. 5) Sufficient lifetime: (t b ¡t a +1)¸k. A valid convoy is therefore a group of density-connected objects with arbitrary shape and extent during a sufficient consecutive time interval. Such valid convoys are spatially and temporally maximal such that no two valid convoys with the same time interval can overlap in their objects and no two valid convoys with the same set of moving objects can overlap in time. We use the notation v=hfo 1 , :::, o n g, [t a , t b ]i to denote a valid convoyv consisting of moving objectso 1 ,:::,o n that are thoroughly density-connected during a consecutive time interval [t a ;t b ]. A valid convoyhfo 1 , o 2 , o 3 g, [t 2 ;t 4 ]i can be found over the five trajectories in Figure 3.1, with the parametersm=3 andk=3. A snapshot clusterc t at a timestampt, consisting of moving objects o 1 , :::, o n , can be represented by the same notation as c t =hfo 1 , :::, o n g, [t;t]i since it can be considered as a valid convoy of lifetime 1. Definition 8. Given a set of moving objectsO, a distance threshold", an integerm, a lifetime integerk, and a sequence of snapshot clustersc t a ,c t a +1 ,:::,c t b during a consecutive time interval[t a ;t b ], a partially density-connected convoy (or simply partially connected convoy) is defined as a non-empty subset of ob- jectsO 0 µO during a time interval[t a ;t b ], satisfying following conditions: 1) Spatial maximality:O 0 =c t a \c t a +1 \:::\c t b . 2) Temporal maximality:@c ta¡1 ,c t b +1 ,O 0 µc ta¡1 andO 0 µc t b +1 3) Sufficient objects:jO 0 j¸m. 4) Sufficient lifetime: (t b ¡t a +1)¸k. 43 In other words, a partially connected convoy is a group of objects that traverse in a sequence of dense regions during a consecutive time interval of sufficient lifetime. Although the objects are within a dense area at each timestamp, they are not necessarily density-connected by themselves unlike valid convoys. For example, a partially connected convoyhfo 1 , o 2 , o 3 g, [t 1 ;t 4 ]i is found over the five moving object trajectories in Figure 3.1, with the parametersm=3 andk=3. As can be seen, the three objectso 1 ,o 2 , and o 3 are not density-connected by themselves at the timestampt 1 without the objectso 4 ando 5 but are lying in a dense region corresponding to the snapshot cluster c t 1 , which is therefore not a valid convoy due to the unsatisfied connectivity constraint. Definition 9. Given two valid (or partially connected) convoys v=hO;[t a ;t b ]i and v 0 =hO 0 ;[t a 0;t b 0]i, v is a sub-convoy of v 0 , denoted as v=sub-convoy(v 0 ), if either one of the following conditions is satisfied exclusively: 1)O =O 0 andt a 0 ·t a ·t b ·t b 0, or 2)OµO 0 andt a 0 =t a ·t b =t b 0. Given a set of moving objectsO, a distance threshold ", an integer m, and an integer k, the convoy discovery problem is to mine all valid convoys from O, each consisting of at least m moving objects that are density-connected with respect to the density constraints m and " during at least k consecutive timestamps. 3.3 Discovery of Valid Convoys In this section, we introduce our solutions for discovering valid convoys. Both VCoDA and EVCoDA consist of two phases; first a set of all partially connected convoys is discovered from a given set of moving objects. Then, the density-connectivity of these partially connected convoys is validated to finally obtain all valid convoys. The straightforward solution VCoDA and the efficient alternative EVCoDA are presented in Sections 3.3.1 and 3.3.2, respectively. 44 Table 3.1: Four operations to update the setV of current partially connected convoy candidates Operation Conditions Insert(c) (1)c is not matched:@v2V,jv\cj¸m, or (2)c is matched but not absorbed: 9v2V,jv\cj¸m^cn(v\c)6=;^@v 0 2Vnfvg,v 0 =c Extend(v,c) v andc share enough objects:jv\cj¸m Delete(v) v is a sub-convoy:9v 0 2V,v = sub-convoy(v 0 ) (by Definition 9) Return(v) (1)v is not extended:@c2C,jv\cj¸m^v.lifetime¸k, or (2)v is extended but not absorbed: 9c2C,jv\cj¸m^v.lifetime¸k^vn(v\c)6=;^@c 0 2Cnfcg,v =c 0 3.3.1 VCoDA: Straightforward Solution 3.3.1.1 Discovery of Partially Connected Convoys Our approach of finding partially connected convoys extends the well-known moving cluster algorithm [32] as in other convoy discovery algorithms [30, 31]. First, we perform a density-based clustering DB- SCAN [18] on the snapshot locations of moving objects at each timestamp to identify the snapshot clusters defined in Definition 6. Starting with the setC of such snapshot clusters at the first timestamp, we incre- mentally maintain a setV of current partially connected convoy candidates scanning through the times- tamps. The setV is updated at each timestamp under four operations defined in Table 3.1: 1) insertion of a snapshot clusterc2C as a new partially connected convoy candidate, 2) extension of an existing partially connected convoy candidatev2V by a matching snapshot clusterc2C, 3) deletion of a sub-convoyv2 V, and 4) return of a maximally extended partially connected convoy candidatev2V to an actual partially connected convoy. Figure 3.2(b) shows all possible updates ofV over the six objects in Figure 3.2(a) during a time interval [t 1 ;t 4 ], with the parameters m=3 and k=2. At time t 1 , the cluster c 1 is inserted toV as a new partially connected convoy candidate v 1 by the first condition of Insert() in Table 3.1 sinceV is initialized as an empty set. At timet 2 , the current partially connected convoy candidatev 1 is compared with the snapshot cluster c 2 , resulting in an extended convoy candidate consisting of their intersection o 1 ;o 2 , and o 3 . Al- though matched to the convoy candidatev 1 , the snapshot clusterc 2 should be also inserted toV as a new 45 c 3 c 4 a c 1 c 2 t 1 t 2 t 3 t 4 Time o 1 o 2 o 4 o 5 o 3 o 6 c 4 b (a) Snapshot clusters t 1 t 2 t 3 t 4 v 1 c 1 {o 1 ,o 2 ,o 3 }, [t 1 ,t 1 ] v 1 v 1 c 2 {o 1 ,o 2 ,o 3 }, [t 1 ,t 2 ] v 2 c 2 {o 1 ,...,o 6 }, [t 2 ,t 2 ] v 1 v 1 c 3 {o 1 ,o 2 ,o 3 }, [t 1 ,t 3 ] v 1 v 1 c 4 a {o 1 ,o 2 ,o 3 }, [t 1 ,t 4 ] v 2 v 2 c 4 a {o 1 ,o 2 ,o 3 }, [t 2 ,t 4 ] v 3 c 4 b {o 4 ,o 5 ,o 6 }, [t 4 ,t 4 ] : Insert(c 1 ) : Extend(v 1 , c 2 ) : Extend(v 1 , c 3 ) : Extend(v 1 , c 4 a ) : Insert(c 2 ) : Extend(v 2 , c 3 ) : Extend(v 2 , c 4 a ), Delete(v 2 ) : Insert(c 4 b ) v 2 v 2 c 3 {o 1 ,...,o 6 }, [t 2 ,t 3 ] Vpcc v 2 {o 1 ,...,o 6 }, [t 2 ,t 3 ] : Return(v 2 ) (b) Updates of the current partially connected convoy candidatesV Figure 3.2: Discovery process of partially connected convoys partially connected convoy candidate since it is not fully absorbed by any of the candidates inV (see the second condition of Insert() in Table 3.1). At time t 3 , both v 1 and v 2 are extended by the cluster c 3 . At timet 4 , the convoy candidatev 1 is extended by the matching clusterc a 4 . The convoy candidatev 2 is also extended byc a 4 , which however leads to a sub-convoy of the extended convoy candidatev 1 . Therefore, this extended partially connected convoy candidate is deleted from the setV. In addition,v 2 must be returned as a partially connected convoy satisfying the constrainsm=3 andk=2 at the current timestampt 4 since is not absorbed by any of the snapshot clusters inC. At last, the snapshot clusterc b 4 is inserted toV as a new partially connected convoy candidate since it is not matched to any of the convoy candidates. Figure 3.3 shows the pseudocode of our straightforward partially connected convoy discovery algo- rithm (PCCD). As an input, it takes trajectories of moving objectsO, a distance threshold", the minimum 46 Require: Moving objectsO, distance threshold", integerm, and integerk 1: VÃ;; // set of current partially connected convoys 2: for each timestampt do 3: V next Ã;; // next set of partially connected convoys 4: Cà DBSCAN(P t (O),m,"); 5: for each clusterc2C do // initialize snapshot clusters 6: c.matchedÃfalse;c.absorbedÃfalse;c.lifetimeà [t,t]; 7: for each current convoy candidatev2V do 8: v.extendedÃfalse; v.absorbedÃfalse; 9: for each clusterc2C do 10: ifjv\cj¸m then // Extend(v;c) 11: v.extendedÃtrue; c.matchedÃtrue; 12: v ext Ãhv\c, [v.lifetime start ,t]i; 13: V next à updateVnext(V next ,v ext ); 14: ifv½c thenv.absorbedÃtrue; 15: ifc½v thenc.absorbedÃtrue; 16: if notv.extended or notv.absorbed then 17: ifv.lifetime¸k then V pcc ÃV pcc [fvg // Return(v) ; 18: for each clusterc2C do 19: if notc.matched or notc.absorbed then 20: V next à updateVnext(V next ,c); // Insert(c) 21: VÃV next ; 22: return V pcc ; Figure 3.3: Partially connected convoy discovery algorithm PCCD(moving objectsO,",m,k) number of objectsm, and the minimum lifetimek. At each timestamp, the snapshot clustersC satisfying the density constraints (" andm) are first obtained by DBSCAN in line 4 and compared with the current partially connected convoy candidates inV (lines 7 and 9). If a current convoy candidatev2V shares at least m objects with a cluster c2C (line 10), then v is extended with c into a v ext as in line 12. Now, we call the function updateVnext() to update the set of next partially connected convoy candidatesV next with this v ext so as to be used in the next iteration (line 13). Then, v and c should be compared again if one is absorbed by the other in lines 14-15. The current convoy candidates inV that are neither extended nor absorbed by any of clusters in C and have the lifetime of at least k are returned to the setV pcc of actual partially connected convoys in lines 16-17, which is the output of our PCCD algorithm. Similarly, the current snapshot clusters inC that are neither matched nor absorbed by any of the current partially connected convoy candidates inV are inserted to the next partially connected convoy candidatesV next in lines 18-20, to be used in the next iteration. 47 Require: Set of next partially connected convoy candidatesV next , new extended convoy candidatev new 1: if9v2V next s.t. v =v new then 2: ifv.lifetime start >v new .lifetime start then //v is a sub-convoy 3: V next ÃV next nfvg; // Delete(v) 4: V next ÃV next [fv new g; 5: else 6: V next ÃV next [fv new g; 7: return V next ; Figure 3.4: Update function updateVnext(V next , v new ) for the set of next partially connected convoy candidates Figure 3.4 shows the pseudocode of the updateVnext function, where the setV next of next partially connected convoy candidates is updated with a new extended convoy candidate v new . If there already exists a convoy candidate v inV next s.t. v is a sub-convoy of v new by Definition 9 (lines 1-2), then v should be deleted and replaced byv new in order to maintain a set ofmaximal convoy candidates inV next as in lines 3-4. Otherwise, the new convoy candidatev new can be safely added in the setV next to be used in the next iteration. Complexity Analysis. The cost of updateVnext function isO(jV next j) due to the if-statement in line 1, which can be a potentially expensive operation because it is calledjVj¢jCj times in the worst case in the main PCCD algorithm with the two loops in lines 8 and 10 of Figure 3.3. To implement the updateVnext procedure efficiently, we employ a hash table for V next with the objects as the key so as to find any sub-convoy inV next in a constant time on average and to maintain a set of partially connected convoy candidates that are both spatially and temporally maximal. The time complexity of the PCCD algorithm isO(T¢jOj 2 +T¢jVj¢jCj), whereT is the whole time span ofO, andjVj andjCj are the maximum size of the setsV andC, respectively, during T . The first term results from the DBSCAN algorithm quadratic to the total number of moving objects, which is called at each timestamp. The second term of the complexity is due to the two loops in lines 7 and 9 of Figure 3.3 to compare each pair of current partially connected convoy candidates inV and the snapshot clusters inC, which is also performed at each timestamp. 48 We assume all snapshot positions P t of moving objects at each timestamp fit in memory. Therefore, the memory required by PCCD is O(jP t j+jVj+jCj+jV next j) for two consecutive timestamps. Therefore, hash-based main-memory version of DBSCAN [32, 11] can be employed for an efficient clustering in line 4 of Figure 3.3. 3.3.1.2 Density-Connectivity Validation In the previous section, we find all possible partially connected convoys from a given set of moving object trajectories. Now, we present the second phase of VCoDA, which takes as input each partially connected convoy and obtain a set of truly valid convoys that satisfy the density connectivity among the group ob- jects. We extend our PCCD algorithm to this end; scanning through each timestamp of a given partially connected convoy, if a current valid convoy candidatev is extended by either a smaller or a lager snapshot clusterc, the density-connectivity of the resulting extended valid convoy candidatev ext =v\c is validated by re-clustering the objects inv ext at every timestamp ofv ext . Figure 3.5 shows the pseudocode of our straightforward density-connectivity validation of partially connected convoy (DCVal). As an input, it takes a partially connected convoy v pcc 2 V pcc , a distance threshold ", the minimum number of objects m, and a minimum lifetime k. The algorithm is similar to PCCD, except that the density-connectivity is verified for each extended valid convoy candidatev ext =v \ c in lines 8-26. For each matching pair (v, c) with sufficient common objects (line 6), if their moving objects are the same (i.e., v = c), we know that all the moving objects in v also form a dense group c by themselves at the consecutive timestamp. Hence, the density connectivity of v ext is immediately validated without further re-clustering (lines 8-11). If a current valid convoy candidate v is extended by a larger snapshot cluster c (line 12), we perform a re-clustering of the moving objects in v ext on their snapshot positions at the current timestamp (line 13), in order to confirm that all moving objects of v ext are density-connected by themselves at the current timestamp. The extended valid convoy candidatev ext is indeed valid only if the re-clustering results in a single cluster equivalent to v ext (line 15), which is then added to the next valid convoy candidate setV next to be used in the next iteration (line 16). If only 49 Require: Partially connected convoyv pcc , distance threshold", integerm, and integerk 1: VÃ;; // set of current valid convoys 2: for each timestampt ofv pcc do 3: V next Ã;; // next set of valid convoys 4: Cà DBSCAN(P t (v pcc ),m,"); 5: initialize each clusterc2C, convoyv2V as in lines 6, 8 of Algo.3.3; 6: for each pair of(v;c)2V£C s.t.jv\cj¸m do 7: v ext .validatedÃfalse; v ext Ãhv\c, [v.lifetime start ,t]i; 8: ifv =c then // immediate validation 9: v ext .validatedÃtrue;V next ÃupdateVnext(V next ,v ext ); 10: v.extendedÃtrue; v.absorbedÃtrue; 11: c.matchedÃtrue; c.absorbedÃtrue; 12: else ifv½c then // validation by re-clustering 13: C t à DBSCAN(P t (v ext ),m,"); 14: ifC t =; then v ext .validatedÃtrue;v ext .lifetime end Ãt-1; 15: else ifjC t j=1^v ext =C t then // val. by single re-clustering 16: v ext .validatedÃtrue;V next ÃupdateVnext(V next ,v ext ); 17: v.extendedÃtrue;v.absorbedÃtrue;c.matchedÃtrue; 18: elsev ext Ãv ext \C t ; 19: if notv ext .validated then // recursive validation 20: VCà DCVal(v ext ,",m,k); 21: for each validated convoyvc2VC do 22: v.extendedÃtrue;c.matchedÃtrue; 23: ifvc.lifetime end =t then 24: V next à updateVnext(V next ,vc); 25: else 26: ifvc.lifetime¸k thenV val ÃV val [fvcg; 27: if notv.extended or notv.absorbed then 28: ifv.lifetime¸k thenV val ÃV val [fvg; 29: for each clusterc2C s.t. notc.matched or notc.absorbed do 30: V next à updateVnext(V next ,c); 31: VÃV next ; 32: return V val ÃV val [V; Figure 3.5: Density-connectivity validation algorighm DCVal(v pcc ,",m,k) a subset of moving objects in v ext from a cluster, v ext is replaced with this subset in line 19 so as to be subsequently validated. If the re-clustering does not result in a single matching cluster or the current valid convoy candidatev is extended by a smaller snapshot cluster, the density-connectivity of the extended valid convoy candidatev ext is now validated by a recursive call to DCVal in line 20, which obtains a setVC of valid convoy candidates from v ext . Each valid convoy vc inVC is inserted to the setV next to be further extended if its lifetime ends at the current timestamp (lines 23-24). Otherwise, it is returned to the setV val of valid convoys as an output of validation as long as its lifetime is sufficient (line 26). As in PCCD, the 50 current valid convoy candidates inV that are neither extended nor absorbed by any of clusters in C and have the lifetime of at leastk are returned to the setV val (lines 27-28). Also, the current snapshot clusters inC that are neither matched nor absorbed by any of the current valid convoy candidates inV are inserted to the next valid convoy candidatesV next to be further extended (lines 29-30). Note that DCVal returns as an output a setV val consisting of truly valid convoys satisfying all five constraints in Definition 7 and valid convoy candidates that may not satisfy the sufficient lifetime constraint due to the recursive usage. Hence, it requires a post-processing to remove such short valid convoy candidates from the final setV val . Table 3.2 shows the validation process of DCVal, assuming that all five moving objects in Figure 1.3(b) form a partially connected convoyhfo 1 ,o 2 ,:::,o 5 g,[t 1 ,t 4 ]i and using the parametersm=3 andk=2. At time t 1 , the cluster c 1 is set as a valid convoy candidate v 1 . At time t 2 , since the current valid convoy candidate v 1 is extended by the cluster c 2 with the exactly same moving objects, the extended convoy is immediately validated without re-clustering. At timet 3 , since the current convoy candidatev 1 is extended by a smaller clusterc 3 , the extended valid convoy candidatev 1 \c 3 =hfo 1 ,o 2 ,o 3 g;[t 1 ,t 3 ]i is validated by a recursive call to DCVal. As shown in Figure 1.3(b), the objectsfo 1 ,o 2 , ando 3 g do not form a cluster by themselves at times t 1 . As a result, a valid convoy candidatehfo 1 , o 2 , o 3 g;[t 2 , t 3 ]i with a reduced lifetime is returned and added to the next valid convoy candidate set to be used in the next iteration. Att 4 , the current valid convoy candidatev 1 is extended by a larger clusterc 4 . Hence, the density-connectivity ofhfo 1 , o 2 , o 3 g;[t 3 , t 4 ]i is validated by re-clustering the snapshot positions ofo 1 , o 2 , ando 3 only at the current timestampt 4 . Table 3.2: Validation process of DCVal Current/next valid convoys inV Validation t 1 v 1 Ãc 1 Ãhfo 1 ;o 2 ;o 3 ;o 4 ;o 5 g;[t 1 ;t 1 ]i t 2 v 1 Ãv 1 \c 2 Ãhfo 1 ;o 2 ;o 3 ;o 4 ;o 5 g;[t 1 ,t 2 ]i immediate t 3 v 1 ÃDCVal(v 1 \c 3 )Ãhfo 1 ,o 2 ,o 3 g,[t 2 ;t 3 ]i recursive t 4 v 1 Ãv 1 \c 4 Ãhfo 1 ;o 2 ;o 3 g;[t 2 ;t 4 ]i re-clustering v 2 Ãc 4 Ãhfo 1 ;o 2 ;o 3 ;o 4 g;[t 4 ;t 4 ]i 51 Complexity Analysis. The density-connectivity validation for each extended valid convoyv ext (lines 10-33) costs a constant time in the best case of immediate validation andO(j¿j¢jv ext j 2 ) time in the worst case, where¿ is the maximum time span ofv ext , due to the recursive call to DCVal that requiresj¿j times re-clustering at every timestamp ofv ext . This validation is performedjVj¢jCj times at each timestamp in DCVal-PCC. Therefore, the worst time complexity of DCVal-PCC isO(T¢jv pcc j 2 +T¢jVj¢jCj¢j¿j¢jv ext j 2 ), whereT is the whole time span of a given partially connected convoyv pcc . 3.3.2 EVCoDA: Efficient Solution In this section, we present the computational bottlenecks of the VCoDA algorithm. Then, we introduce the improved algorithm EVCoDA that reduces the overall computational cost of VCoDA. 3.3.2.1 Efficient Discovery of Partially Connected Convoys As compared to the moving cluster [32] and previous convoy discovery algorithms [30, 31], the intersection operation for each pair of v2V and c2C with two loops (lines 7-10 of Figure 3.3) is more significant bottleneck in our PCCD algorithm due to the larger set V; previously the set V of current convoy (or moving cluster) candidates to be compared with the current snapshot clusters consists of extended convoy (or moving cluster) candidates and unmatched snapshot clusters at the current timestamp. However, the set V in PCCD additionally contains the matched yet unabsorbed snapshot clusters (see Insert() and Extend() operations in Table 3.1). Kalnis et. al [32] have proposed an improved version of the moving cluster algorithm that minimizes the comparisons based on the observation that each moving cluster candidate shares objects only with a few snapshot clusters. We adopt this approach for our partially connected convoy discovery problem so as to upper bound the comparisons required for each current partially connected convoy candidatev2V to b jvj m c, as presented in the following lemma. Lemma 1. For each current partially connected convoy candidatev2V, the number of snapshot clusters c2C that share at leastm objects withv is at mostb jvj m c. 52 Require: Moving objectsO, distance threshold", integerm, and integerk 1: VÃ;; // set of current partially connected convoys 2: for each timestampt do 3: V next Ã;; // next set of partially connected convoys 4: Cà DBSCAN(P t (O),m,"); 5: for each clusterc2C do 6: c.matchedÃfalse; c.absorbedÃfalse; 7: c.lifetimeà [t,t]; 8: for each current convoy candidatev2V do 9: v.extendedÃfalse, v.absorbedÃfalse; 10: C 0 à clustersc2C s.t.jc\vj¸m; 11: for each clusterc2C 0 do // Extend(v,c) 12: v.extendedÃtrue, c.matchedÃtrue; 13: v ext Ãhv\c, [v.lifetime start ,t]i; 14: V next à updateVnext(V next ,v ext ); 15: ifv½c thenv.absorbedÃtrue; 16: ifc½v thenc.absorbedÃtrue; 17: if notv.extended or notv.absorbed then 18: ifv.lifetime¸k then // Return(v) 19: V pcc ÃV pcc [fvg; 20: for each clusterc2C do 21: if notc.matched or notc.absorbed then 22: V next à updateVnext(V next ,c); // Insert(c) 23: VÃV next ; 24: return V pcc ; Figure 3.6: Efficient algorithm for partially connected convoy discovery EPCCD(moving objectsO,",m, k) Proof. To prove the lemma by contradiction, assume that a partially connected convoy candidatev2V is intersected with snapshot clustersC 0 , whereC 0 µC andjC 0 j =b jvj m c+1. For each pair (v, c)2V £C 0 , it holds thatjv\cj¸m. Since the snapshot clusters are a disjoint partitioning of the moving objects (i.e.,c 1 \c 2 \:::\c b jvj m c+1 =;), the following equation holds:jvj¸jv\c 1 j+jv\c 2 j+:::+jv\c b jvj m c+1 j¸ m¢(b jvj m c+1)¸jvj+m. Hence, the resulting contradiction proves Lemma 1. Figure 3.6 shows the pseudocode of efficient algorithm for partially connected convoy discovery (EPCCD). As input, it takes trajectories of moving objectsO, a distance threshold", the minimum number of objectsm, and the minimum lifetimek. At each timestamp, the snapshot clustersC satisfying the den- sity constraints (" andm) are first obtained in line 4 and initialized in lines 5-7. For each current partially connected convoy candidatev2V, it obtains a subsetC 0 of snapshot clusters that share at leastm objects 53 withv in line 10. Since the size of the setC 0 is bounded tob jvj m c, the inner-loop in lines 11-16 is executed at mostb jvj m c times for eachv2V. For each matching pairv andc with sufficient objects in common,v is extended withc into av ext as in line 13. Now, we call the function updateVnext() to update the set of next partially connected convoy candidatesV next with thisv ext so as to be used in the next iteration (line 14). Then, v andc should be compared again if one is absorbed by the other in lines 15-16. The current convoy candidates inV that are neither extended nor absorbed by any of clusters inC and have the lifetime of at leastk are returned to the setV pcc of actual partially connected convoys in lines 17-19, which is the output of our EPCCD algorithm. Similarly, the current snapshot clusters inC that are neither matched nor absorbed by any of the current partially connected convoy candidates inV are inserted to the next partially connected convoy candidatesV next in lines 20-22, to be used in the next iteration. Note that in order to efficiently search the matching snapshot cluster subset C 0 of any convoy can- didate v in line 10, we maintain an array of cluster labels, each element of which provides the mapping between the object and its cluster label so that the corresponding cluster labels of all objects in any partially connected convoy candidatev can be found in constant time. Complexity Analysis. As compared to PCCD, EPCCD needs additional memoryO(jOj) to store the resulting cluster labels of all moving objectsO in a hash table so that the snapshot clusters that share at least m objects with a given current partially connected convoy candidate are retrieved in constant time (line 10). The time complexity of EPCCD algorithm is hence reduced toO(T¢jOj 2 +T¢jVj¢b jvj m c), where T is the whole time span of datasetO andjvj is the maximum size of partially connected convoy candidate. 3.3.2.2 Efficient Density-Connectivity Validation The DCVal algorithm in Section 3.3.1.2 scans the whole lifetime of a partially connected convoy at least once to verify the density-connectivity of moving objects by clustering their snapshot locations at each timestamp. This approach incurs many unnecessary clusterings particularly when the density-connectivity is unsatisfied only at the beginning or towards the end of the lifetime. Hence, we propose an efficient density-connectivity validation algorithm EDCVal for better performance. The idea is that a minimum 54 bounding box (MBB) is employed to approximate the (sub-)trajectory each moving object during a given time interval and the density-based clustering is performed on the MBBs so that the snapshot clusters at each time stamp is efficiently estimated by the clusters of MBBs during the time interval. Given a moving object trajectory o = h(p 1 ;t 1 ), (p 2 ;t 2 ), :::, (p n ;t n )i, its minimum bounding box during a time interval[t a ,t b ] (t 1 ·t a ·t b ·t n ) is defined asB o;[t a ;t b ] =h(t a ;p min ),(t b ;p max )i, where B o :p min is a two-dimensional vector(p min :x,p min :y) representing the minimum location values sampled during the time interval. That is, p min :x and p min :y are defined as minfp a :x, p a+1 :x, :::, p b :xg and minfp a :y, p a+1 :y, :::, p b :yg, respectively. The vector B o :p max is similarly defined as the maximum location values during the time interval. Figure 3.7 shows examples of such MBBs of two moving objects o 1 ando 2 during a time interval[t 1 ;t 4 ]. We now define the minimum distanceD min (¢) between two boxesB op;[ta;t b ] andB oq;[tc;t d ] as follows: D min (B o p ;[t a ;t b ] ;B o q ;[t c ;t d ] )= 8 > > > < > > > : p (d min :x) 2 +(d min :y) 2 if maxft a ;t c g· minft b ;t d g; 1 otherwise (3.1) , where d min :x= 8 > > > < > > > : 0 if maxfB o p :p min :x;B o q :p min :xg< minfB o p :p max :x;B o q :p max :xg; maxfB o p :p min :x;B o q :p min :xg¡ minfB o p :p max :x;B o q :p max :xg otherwise and d min :y = 8 > > > < > > > : 0 if maxfB op :p min :y;B oq :p min :yg< minfB op :p max :y;B oq :p max :yg; maxfB o p :p min :y;B o q :p min :yg¡ minfB o p :p max :y;B o q :p max :yg otherwise The maximum distanceD max (¢) between two boxesB op;[ta;t b ] andB oq;[tc;t d ] is defined as follows: 55 D max (B op;[ta;t b ] ;B oq;[tc;t d ] )= 8 > > > < > > > : p (d max :x) 2 +(d max :y) 2 if maxft a ;t c g· minft b ;t d g; 1 otherwise (3.2) , where d max :x= 8 > > > > > > > > > > > > > > > > > > < > > > > > > > > > > > > > > > > > > : max 8 > > > < > > > : minfB o p :p max :x;B o q :p max :xg¡ minfB o p :p min :x;B o q :p min :xg maxfB o p :p max :x;B o q :p max :xg¡ maxfB o p :p min :x;B o q :p min :xg if(B o p :p min :x<B o q :p min :x andB o p :p max :x>B o q :p max :x) or (B oq :p min :x<B op :p min :x andB oq :p max :x>B op :p max :x); maxfB o p :p max :x;B o q :p max :xg¡ minfB o p :p min :x;B o q :p min :xg otherwise Thed max :y value is defined similarly. As shown in Figure 3.7,D min andD max capture the minimum and the maximum distances, respectively, between any pair of points belonging to two boxes if they are intersect in time. o 1 t 1 t 2 t 3 t 4 Time B o2,[t1,t4] B o1,[t1,t4] o 2 d min .x d max .x d max .y D min (B o1,[t1,t4], B o2,[t1,t4] ) = (d min .x) 2 + 0 D max (B o1,[t1,t4], B o2,[t1,t4] ) = (d max .x) 2 + (d max .y) 2 Figure 3.7: The minimum distanceD min and the maximum distanceD max between two MBBs In order to perform density-based clustering on MBBs, we adopt the notions of density-based clus- tering defined for the snapshot locations in Section 3.2 to those for MBBs. Given a setM =fB o1;[ta;t b ] , 56 :::, B o N ;[ta;t b ] g of N MBBs during a time interval [t a ;t b ] and a distance threshold ", the minimum "- neighborhood N min " (B o p ;[t a ;t b ] ) of a MBBB o p ;[t a ;t b ] 2M is defined as N min " (B o p ;[t a ;t b ] ) =fB o q ;[t a ;t b ] 2 MjD min (B o p ;[t a ;t b ] ,B o q ;[t a ;t b ] )·"g. The maximum"-neighb- orhoodN max " (B o p ;[t a ;t b ] ) of the MBB is similarly defined using the maximum distance defined in Equation 3.2. With these two"-neighborhood op- erations, the notions of directly density-reachable, density-reachable and density-connected defined over the snapshot locations in Section 3.2 can be similarly defined over the MBBs. At last, two spatio-temporal clusters are additionally defined over the setM of MBBs as follows. Definition 10. Given a set of moving objectsO, a distance threshold", and an integerm, a lower-bound cluster c L;[t a ;t b ] during a time interval [t a ;t b ] is a non-empty subset of objects O 0 µ O satisfying the following conditions: 1) Connectivity:8o p ;o q 2O 0 , a MBBB op;[ta;t b ] is density-connected to a MBBB oq;[ta;t b ] w.r.t. ",m, and the maximum"-neighborhood operation. 2) Maximality:8o p ;o q 2O 0 , ifo q 2O 0 and a MBBB op;[ta;t b ] is density-reachable from a MBBB oq;[ta;t b ] w.r.t",m, and the maximum"-neighborhood operation, then alsoo p 2O 0 . 3) Sufficient objects:jO 0 j¸m. Definition 11. Given a set of moving objectsO, a distance threshold", and an integerm, a upper-bound cluster c U;[t a ;t b ] during a time interval [t a ;t b ] is a non-empty subset of objects O 0 µ O satisfying the following conditions: 1) Connectivity:8o p ;o q 2O 0 , a MBBB o p ;[t a ;t b ] is density-connected to a MBBB o q ;[t a ;t b ] w.r.t. ",m, and the minimum"-neighborhood operation. 2) Maximality:8o p ;o q 2O 0 , ifo q 2O 0 and a MBBB o p ;[t a ;t b ] is density-reachable from a MBBB o q ;[t a ;t b ] w.r.t",m, and the minimum"-neighborhood operation, then alsoo p 2O 0 . 3) Sufficient objects:jO 0 j¸m. 57 Intuitively, a lower-bound clusterc L;[ta;t b ] is the smallest possible maximal subset of moving objects that must form a dense group by themselves during the entire span of the time interval [t a ;t b ]. A upper- bound cluster c U;[t a ;t b ] is however the largest possible maximal subset of moving objects that could ever form a dense group by themselves at some timestampt (t a ·t·t b ). Lemma 2. LetO be a set of moving objects in a partially connected convoyv pcc =hO,[t a ;t b ]i andc t be a snapshot cluster ofO at a timestampt (t a ·t·t b ). Then, the snapshot clusterc t is bounded by lower- and upper-bound clusters during the time interval[t a ;t b ] such thatc L;[t a ;t b ] µc t µc U;[t a ;t b ] , which holds for all timestampst during the lifetime[t a ;t b ]. Proof. To first provec L;[t a ;t b ] µc t by contradiction, assume two moving objectso p ,o q 2O such thato p 2 c L;[ta;t b ] ^o p 2c t ando q 2c L;[ta;t b ] ^o q 62c t . Since botho p ando q are in the lower-bound clusterc L;[ta;t b ] andD max () is the largest possible distance between two objects during the time interval [t a ;t b ], it holds thatD(o p (t);o q (t))·D max (B o p ;[t a ;t b ] ,B o q ;[t a ;t b ] ))·", which contradicts the assumption thato p 2c t but o q 62c t . To provec t µc U;[ta;t b ] , assume two moving objectso p ,o q 2O such thato p 2c t ^o p 2c U;[ta;t b ] ando q 2c t ^o q 62c U;[t a ;t b ] . Since botho p ando q are in the snapshot clusterc t andD min () is the smallest possible distance between two objects during the time interval [t a ;t b ], D min (B op;[ta;t b ] ,B oq;[ta;t b ] )) · D(o p (t);o q (t))·" holds, which contradicts the assumption thato p 2c U;[t a ;t b ] buto q 62c U;[t a ;t b ] . Hence, Lemma 2 is proved. By Lemma 2, if the lower-bound cluster matches the upper-bound cluster during a certain time interval, this group of moving objects are guaranteed to form a snapshot cluster by themselves at every timestamp of the considering time interval, i.e., c L;[t a ;t b ] =c U;[t a ;t b ] =c t for all timestamps t in [t a ;t b ]. This enables us to adopt a branch-and-bound framework to efficiently validate the density-connectivity of a partially connected convoy, where the clustering on snapshot locations at every timestamp can be avoided during the time interval when the lower-bound and upper-bound clusters are completely matched. Figure 3.8 shows the pseudocode of efficient density-connectivity validation algorithm EDCVal. It takes as an input a partially connected convoyv pcc =hO,[t a ,t b ]i), the parameters",m,k, and a setV val 58 Require: Partially connected convoy v pcc =hO,[t a ;t b ]i, set of valid convoy candidates V val , distance threshold", integerm, and integerk 1: MÃMBBs ofO;(C L;[t a ;t b ] ;C U;[t a ;t b ] )ÃDBSCAN-MBB(M,m,"); 2: for each current low-bound clusterc2C L;[t a ;t b ] do 3: if@v2V val : c=sub-convoy(v) thenV val ÃV val [fhc;[t a ;t b ]ig; // check stop conditions to end the validation 4: if9v2V val : c=sub-convoy(v) for8c2C U;[ta;t b ] then returnV val ; 5: ifC L;[t a ;t b ] =C U;[t a ;t b ] then returnV val ; // partition the time interval[t a ;t b ] for sub-convoys 6: v 1 Ãh S c2C U;[ta;t b ] c, [t a ,b t b ¡t a 2 c]i; 7: v 2 Ãh S c2C U;[ta;t b ] c, [b t b ¡t a 2 c+1,t b ]i; // recursive validation of two sub-convoys. 8: V val1 ÃEDCval(v 1 ,V val ,",m,k);V val2 ÃEDCval(v 2 ,V val ,",m,k); // merge the results of sub-validations. 9: V mrg Ã;; // set of merged valid convoy candidates 10: for each pair of convoys(v 1 ;v 2 )2V val1 £V val2 do 11: ifjv 1 \v2j¸m^ (v 1 ,v 2 ) are consecutive in time then 12: v ext Ãhv 1 \v2, [v 1 .lifetime start v 2 .lifetime end ]i; 13: ifv 1 =v 2 ^@v2V val : v ext = sub-convoy(v) then 14: V mrg ÃV mrg [fv ext g; 15: deletev2V val : v=sub-convoy(v ext ); 16: else if@v2V val : v ext =sub-convoy(v) then 17: VCà EDCVal(v ext ,V val ,",m,k); 18: V mrg ÃV mrg [fvc2VCj@v2V val :vc=sub-convoy(v)g; 19: deletev2V val :9vc2VC,v=sub-convoy(vc); 20: V val ÃV val [fv 1 2V val1 jv 1 is not mergedg; 21: V val ÃV val [fv 2 2V val2 jv 2 is not mergedg; 22: return V val ÃV val [V mrg ; Figure 3.8: Efficient density-connectivity validation algorithm EDCVal(v pcc =hO,[t a ;t b ]i,V val ,",m,k) of valid convoy candidates. Note thatV val is initially an empty set at the first call to EDCVal. First, it performs the DBSCAN-like density-based clustering on the MBB setM of all moving objects inv pcc and obtains the lower- and upper-bound clusters in line 1. Each lower-bound cluster c inC L;[t a ;t b ] , which is not a sub-convoy of any valid convoy candidates, is added in the setV val to be used as a tighter lower- bound cluster in the subsequent validation (lines 2-3). Next, it checks stop conditions to end the process of density-connectivity validation; we can safely stop the validation if every upper-bound cluster (i.e., largest possible cluster)c2C U;[t a ;t b ] is a sub-convoy of a valid convoy candidate inV val (line 4) and the lower- bound completely matches the upper-bound by Lemma 2 (line 5). Otherwise, it partitions the current time interval into two sub-intervals and the moving objects to be subsequently validated are set as the union 59 of all moving objects in the upper-bound clusters in C U;[ta;t b ] in order to guarantee no false dismissal of any objects (lines 6-7). The density-connectivity validation is performed on the two sub-convoys by the recursive calls to EDCVal in line 8. As a result, the setsV val1 andV val2 contain all valid convoy candidates during the first and the second half of the current time interval, respectively. Next, the results of two sub-validations are merged to obtain a setV mgr of merged valid convoy can- didates spanning through two partitioned time intervals in lines 10-19. For each pair (v 1 , v 2 )2V val1 £ V val2 , if they share sufficient common objects and are consecutive in time (lines 10-11), the extended con- voy candidatev ext (line 12) can be added toV mgr under some conditions (lines 13 and 18). Note that the merged valid convoy candidates added to the setV mgr could make some current valid convoy candidates inV val their sub-convoys. If this is the case, then the sub-convoys must be deleted from the setV val as in lines 15 and 19. At last, the valid convoy candidates either inV val1 or inV val2 that are not merged so far are added to the setV val in lines 20-21, to be used in the higher-level merging process. As in DCVal, the setV val contains valid convoy candidates that may not satisfy the sufficient lifetime constraint due to the recursive usage. Such short valid convoy candidates must be removed from the final set ofV val via a post-processing. Figure 3.9 shows the validation process of EDCVal, which starts from the root node in Figure 3.9(a) with a partially connected convoyhfo 1 ,o 2 ,o 3 g,[t 1 ,t 4 ]i obtained from the five objects in Figure 1.3(b) with the parametersm=3 andk=2. At each node, the lower- and upper-bound clusters are obtained and the set V val is updated with the lower-bound cluster so as to maintain maximal groups of objects that are verified so far to form clusters during the associated time intervals. If the lower- and upper-bounds are matched (see the shaded nodes in Figure 3.9(a)), the subtree is pruned from the further validation. Otherwise, the lifetime of current convoy is partitioned in half to validate the density-connectivity of sub-convoysv 1 and v 2 . Figure 3.9(b) shows the merging process occurring at each non-leaf node, where the two sets of valid convoy candidates are recursively combined from the bottom until the root node has the final set of valid convoy candidates. 60 v pcc = {o 1 ,o 2 ,o 3 }, [t 1 ,t 4 ] V val ={} C L = C U ={o 1 ,o 2 ,o 3 } v 1 = {o 1 ,o 2 ,o 3 }, [t 1 ,t 2 ] V val ={} C L = C U ={o 1 ,o 2 ,o 3 } V val ={} v 2 = {o 1 ,o 2 ,o 3 }, [t 3 ,t 4 ] V val ={} C L = o 1 ,o 2 ,o 3 C U ={o 1 ,o 2 ,o 3 } v 1 = {o 1 ,o 2 ,o 3 }, [t 1 ,t 1 ] V val ={} C L = C U ={} v 2 = {o 1 ,o 2 ,o 3 }, [t 2 ,t 2 ] V val ={} C L = o 1 ,o 2 ,o 3 C U ={o 1 ,o 2 ,o 3 } V val ={ {o 1 ,o 2 ,o 3 }, [t 3 ,t 4 ] } V val ={} V val ={ {o 1 ,o 2 ,o 3 }, [t 2 ,t 2 ] } V val ={} Level 1 Level 2 Level 3 (a) Partitioning process V val ={ {o 1 ,o 2 ,o 3 }, [t 2 ,t 4 ] } V val1 ={ {o 1 ,o 2 ,o 3 }, [t 2 ,t 2 ] } V val2 ={ {o 1 ,o 2 ,o 3 }, [t 3 ,t 4 ] } V val1 ={} V val2 ={ {o 1 ,o 2 ,o 3 }, [t 2 ,t 2 ] } V mgr ={} V mgr ={ {o 1 ,o 2 ,o 3 }, [t 2 ,t 4 ] } (b) Merging process Figure 3.9: Validation process of EDCVal 3.4 Experimental Evaluation We evaluate the performance of our valid convoy discovery algorithms on three real-world moving object trajectory datasets. The performance of VCoDA and EVCoDA is compared to that of an existing convoy discovery algorithm CMC (CoherentMovingCluster) [30, 31] 2 . All the algorithms were implemented in Matlab TM and the experiments were run on a Pentium IV machine with 3.2GHz CPU and 2 GB of RAM, running on Windows Server 2003. 2 The algorithm CuTS (Convoy DiscoveryusingTrajectorySimplifi-cation) proposed by the same authors of CMC improves the efficiency of CMC with the same accuracy. Hence, it is not included in the comparison. 61 3.4.1 Dataset and Parameter Setting Table 3.3 summarizes the information of the three datasets used in the experiments. The hurricane dataset 3 contains the Atlantic hurricanes’ position in latitude and longitude sampled at 6-hourly intervals. We used 608 hurricane trajectories from the years 1950 through 2006. Each trajectory was preprocessed to have timestamps relative to its starting time to find more groups of hurricanes with similar development in space and time. The truck dataset 4 consists of 276 trajectories of 50 trucks delivering concrete to several construction places around Athens metropolitan area in Greece. The locations in latitude and longitude were sampled approximately every 30 seconds for 33 days [20]. In our experiment, the fleet of a truck object traced during a single day was considered as a distinct trajectory to find more valid convoys. The commute dataset 5 consists of 210 trajectories of two people tracing their daily commute in the Cook and Dupage counties of Illinois, USA. The locations in latitude and longitude were sampled every second by GPS systems for 6 month. Each commute trajectory during a single day was assumed to be obtained from a distinct moving object in our experiment. In all three datasets, the un-sampled positions at missing timestamps were linearly interpolated using the nearby measured location values to be aligned at a pre- defined set of timestamps. Table 3.3: Summary of datasets and parameter settings for experiments Hurricane Truck Commute Number of objects 608 276 210 Average trajectory length 31 407 1679 Total observations 18,951 112,203 354,913 Timestamp interval 6 hrs. 30 sec. 1 sec. Number of timestamps 124 2875 48099 Number of objects (m) 3 3 3 Lifetime (k) 8 180 600 Distance threshold (") 1.0»1.5 0.0001» 0.0006 1» 5 Our valid convoy discovery algorithms require three input parameters m, k, and " to discover valid convoys. We assume that the parametersm andk determining the smallest and shortest convoys of interest 3 http://weather.unisys.com/hurricane/atlantic/ 4 http://www.rtreeportal.org/ 5 http://cs.uic.edu/»boxu/mp2p/gps data.html 62 in the dataset are given by the domain experts. In our experiments, the minimum number of objectsm is commonly set to 3 for all three datasets and the minimum number of timestamps k (i.e., lifetime) is set to around 20-th quantile of the trajectory length. Given the number of minimum moving objects m=3, the reasonable range of distance threshold " is estimated based on the density distribution of snapshot positions at each timestamp using the sorted 3-dist graph proposed in [18] for the hurricane dataset. For the truck and commute datasets where the fleet of vehicles are physically constrained on the road network, it is determined by the width of roads inspected from the visualization of trajectories. The actual parameter values used in our experiments are summarized in Table 3.3. 3.4.2 Effectiveness In order to compare our valid convoy discovery algorithms with CMC in terms of effectiveness, we measure the precision and recall; letV val be a result set of all valid convoys discovered by (E)VCoDA andV cmc be another set obtained by CMC. Considering V val as a baseline, precision is defined as jVcmc\V val j jV cmc j , indicating the probability that a convoy discovered by CMC is a valid convoy retrieved by (E)VCoDA. Precision attains a value 1 when the result set of CMC is entirely valid. Similarly, recall is defined as jV val \Vcmcj jV val j , implying the probability that a valid convoy discovered by (E)VCoDA is also retrieved by CMC. Recall scores a value 1 when all valid convoys discovered by (E) VCoDA are completely retrieved by CMC. Table 3.4 shows the results of accuracy on all three datasets. Note thatV pcc denotes the result set of partially connected convoys obtained from the first phase of our valid convoy discovery algorithms. Over all three datasets, CMC found less number of convoys than (E)VCoDA. This is expected because (E)VCoDA additionally maintains unabsorbed snapshot clusters in the intermediate set of convoy candi- dates to be subsequently extended and output any unabsorbed convoy candidate on the fly while scanning the whole time span. Due to the same reason, CMC never discovered the complete set of valid convoys found by (E)VCoDA, which is demonstrated in the low recall scores less than 1 and even closer to 0 in the most of parameter settings. Also, only one third of the convoys discovered by CMC are valid on average as indicated by the low precision around 0.30, which is because of the missing density-validation process 63 Table 3.4: Comparison of Accuracy on three datasets Hurricane: m=3,k=8 (48 hrs.) " 1.0 1.1 1.2 1.3 1.4 1.5 jV pcc j 4 11 28 74 128 217 jV val j 2 3 5 8 20 36 jV cmc j 3 5 7 17 18 15 precision 0.33 0.40 0.29 0.12 0.11 0.20 recall 0.50 0.67 0.40 0.25 0.10 0.08 Truck: m=3,k=180 (1hr. 30 min.) " 0.0001 0.0002 0.0003 0.0004 0.0005 0.0006 jV pcc j 5 90 180 209 222 248 jV val j 2 25 76 145 179 220 jV cmc j 1 12 14 16 18 18 precision 1.00 0.33 0.29 0.19 0.33 0.28 recall 0.50 0.16 0.05 0.02 0.03 0.02 Commute: m=3,k=600 (1min. 30 sec.) " 1 2 3 4 5 6 jV pcc j 23 139 248 427 551 606 jV val j 16 96 205 341 477 543 jV cmc j 7 16 20 23 25 25 precision 0.43 0.25 0.15 0.26 0.28 0.32 recall 0.19 0.04 0.01 0.02 0.01 0.01 in CMC. In summary, (E)VCoDA improves the precision by a factor of 3 on average and the recall by up to 2 orders of magnitude as compared to CMC. Figure 3.10 compares the size and the lifetime of 145 (E)VCoDA and 16 CMC convoys found over the truck dataset with the distance threshold"=0.0004 (see the bold-faced numbers in Table 3.4). In this figure, the X-axis represents the number of moving objects in a convoy and the Y-axis is the convoy lifetime. As expected, CMC discovered only small convoys consisting of three moving objects and missed all larger convoys, which was commonly observed in other parameter settings over all three datasets. 3.4.3 Efficiency Figure 3.11 shows the efficiency of (E)VCoDA and CMC on three datasets, where the elapsed time of (E)VCoDA is divided to denote the running time of (E)PCCD and (E)DCVal in the lower and upper bars, respectively. Over all three datasets, (E)VCoDA took more time than CMC, which is expected due to the newly added density-connectivity validation. The time complexity of both (E)VCoDA (particularly first 64 3 5 7 9 11 200 300 400 500 600 700 Number of moving objects in a convoy Convoy lifetime VCoDA/EVCoDA CMC Figure 3.10: Comparison of discovered convoys phase) and CMC is affected by the total number of timestampsjTj, number of moving objectsjOj, number of snapshot clustersjCj, and number of intermediate convoy candidatesjVj. While the first three factors are the same in both algorithms, (E)VCoDA typically maintains much larger setV of convoy candidates to discover all partially connected convoys. Table 3.5 summarizes the maximum size ofV encountered during the process of (E)PCCD and CMC. The larger setV of (E)PCCD well explains the increased elapsed time of the first phase of (E)VCoDA as compared to CMC. Table 3.5: Comparison of maximumjVj Hurricane: m=3,k=8 (48 hrs.) " 1.0 1.1 1.2 1.3 1.4 1.5 (E)PCCD 125 129 149 168 180 1835 CMC 68 71 77 74 71 62 Truck: m=3,k=180 (1hr. 30 min.) " 0.0001 0.0002 0.0003 0.0004 0.0005 0.0006 (E)PCCD 47 64 70 72 80 87 CMC 12 14 14 16 17 18 Commute: m=3,k=600 (1min. 30 sec.) " 1 2 3 4 5 6 (E)PCCD 14 21 26 27 28 28 CMC 7 8 6 4 3 3 As compared to VCoDA, EVCoDA takes up to a factor of 3 (mostly a factor of 2) less time over all three datasets. In addition, while VCoDA takes much more time than CMC in the dataset with larger time 65 1 1.1 1.2 1.3 1.4 1.5 0 10 20 30 40 50 60 70 ε Elapsed Time (sec.) VCoDA (PCCD+DCVal) EVCoDA (EPCCD+EDCVal) CMC (a) Hurricane 0.0001 0.0002 0.0003 0.0004 0.0005 0.0006 0 200 400 600 800 1000 1200 ε Elapsed Time (s) (b) Truck 1 2 3 4 5 6 0 1000 2000 3000 4000 5000 6000 7000 ε Elapsed Time (sec.) (c) Commute Figure 3.11: Comparison of processing time in seconds 66 domain, i.e., larger number of timestamps, (see the increase rate of elapsed time across all three datasets in Figure 3.11), the performance of EVCoDA relative to that of CMC remains the same by taking up to a factor of 2 more time than CMC in all three datasets regardless of the size of time domain. In the hurricane dataset, the efficiency gain was achieved mainly in the first phase of EVCoDA; EPCCD took a factor of 2 less time than PCCD of VCoDA in all parameter settings. In general, the performance improvement of EPCCD could be maximized when the size of partially connected convoy candidate v 2V is relatively smaller than the size of snapshot clustersC (i.e.,jvj¿jCj) because the size of snapshot cluster subsetC 0 that is actually compared with each v is bounded byjvj in EPCCD while it is bounded byjCj in PCCD of VCoDA. We could observe that the size of snapshot cluster subsetC 0 µC that were actually compared with each partially connected convoy candidate v 2V was significantly reduced s.t. jC 0 j·b jvj m c¿jCj in the hurricane dataset while this reduction was marginal in the other datasets (see the marginal improvement of EPCCD over PCCD in the truck and commute datasets shown in Figure 3.11). However, the running time of the second phase EDCVal of EVCoDA was significantly decreased by a factor of 5 or more for both truck and commute datasets, which indicates that the density-connectivity of partially connected convoys was unsatisfied mostly at the beginning and/or towards the end of their lifetime. This result confirms that the fleet of vehicles is indeed constrained along the road network s.t. the density-connectivity stayed satisfied along the roads for a while except at intersections while the traverse of hurricanes is rather arbitrary. 3.5 Related Work Various group patterns, e.g., a group of moving objects that travel close to each other for some duration time, have been defined and mined over moving object trajectories. Such group patterns, satisfying both spatial and temporal proximity constraints, are potentially useful in understanding the accumulated mobil- ities common in moving objects and obtaining insights from the behaviors and interactions among moving objects. 67 Laube et al. [36] first introduced the notions of several relative motion patterns (e.g., constance, con- currence or group turn) within groups of moving objects and proposed simple algorithms to discover the motion patterns in data. Here, the term, flock, was first used in [36] to denote a cluster of snapshot locations of moving objects showing concurrence at a single instance of time. Gudmundsson et al. [22] developed efficient approximation algorithms to compute four spatio-temporal motion patterns, namely flock, lead- ership, convergence, and encounter. Later, Gudmundsson et al. extended the notion of snapshot flock to a long duration flock over a certain time interval, which was defined as a group of moving objects that move close to each other for a long consecutive duration of time. Several exact and approximation algo- rithms were proposed in [21] to discover longest duration flocks and the computation was further improved by mapping spatio-temporal data into a high dimensional space and reducing the search of longest dura- tion flocks into a sequence of standard range searches using a spatial indexing structure [7]. Al-Naymat et al. [3] used a random projection to reduce the dimensionality of the transformed space and obtained better performance. Although flock is the most extensively exploited group pattern, the shape of flocks (including the variants) at each timestamp is limited to a disk of a fixed size bound because the group of objects should be in a disk of a fixed radius. Kalnis et al. [32] introduced another related group pattern, called moving cluster, which is defined as a sequence of (snapshot) clusters such that two snapshot clusters at consecutive timestamps share a enough number of common objects. They also proposed two exact and one approximation algorithms to find such moving clusters, where a density-based clustering (e.g., DBSCAN [18]) is first performed at each timestamp to find object groups of arbitrary shape and extent, and the snapshot clusters are incrementally extended with consecutive ones scanning through the whole time span. However, unlike flock and valid convoy, the objects in a moving cluster do not necessarily appear at every timestamp during the moving cluster lifetime. More recently, Jeung et. al [30, 31] introduced the notion of convoy, which is a group of at least m moving objects that are density-connected w.r.t. some density constraints during at least k consecutive timestamps. They proposed a straightforward (CMC) and three efficient algorithms (CuTS family) using 68 three different trajectory simplification techniques, all of which commonly extend the moving cluster algo- rithm to find groups arbitrary in shape and size. However, in comparison with (E)VCoDA, their algorithms tend to miss larger convoys and retrieve invalid ones where the density-connectivity among the objects is not completely satisfied. Another related group pattern was proposed by Wang et al. [57], where a user group pattern is defined as a group of users that are within a distance threshold from one another for at least a minimum duration. This group pattern generalizes the long duration flock in that the whole duration is not necessarily consec- utive in time but composed with multiple time intervals, each exceeding a minimum lifetime. Although this group pattern relaxes the consecutive time constraint, it is spatially limited to a group of fixed size and shape like flock. In addition, their algorithms extending the Apriori [2] and FT-growth [24] algorithms are not directly applicable to our convoy discovery problem. 3.6 Chapter Summary Discovering valid convoys from a set of moving object trajectories is a challenging problem. In this chap- ter, we showed that existing convoy discovery algorithms are however inaccurate at finding valid convoys. Hence, we proposed two new algorithms VCoDA and EVCoDA to mine a complete set of correct valid convoys from a set of moving object trajectories. (E)VCoDA first finds all partially connected convoys while guaranteeing no false dismissal of any valid convoys and then validates their density-connectivity to obtain a final set of valid convoys. The straightforward algorithms and efficient alternatives are de- veloped for each phase of both solutions. In experiments on real datasets, (E)VCoDA outperformed the current convoy algorithms in terms of precision and recall by a factor of 3 on average and up to 2 orders of magnitude, respectively. Also, our efficient algorithm EVCoDA took up to a factor of 3 less time than VCoDA. 69 Chapter 4 Conclusions 4.1 Contributions In this dissertation, we dealt with the problem of large gap between the raw sensor data that is readily available and the desired trajectory data that is needed for efficient data exploration, advanced data analysis, and high-level decision making. For this, we have focused on bridging the gap in two perspectives: 1) transforming the raw sensor data into the trajectory data approximated at right resolution and quality by trajectory segmentation and 2) summarizing a set of raw or preprocessed trajectory data at right level of abstraction by discovering a specific type of movement patterns called a valid convoy. In Chapter 2, we addressed the problem of conventional trajectory segmentation methods that focus only on the spatial features of the movement and could over-partition trajectories in the presence of outliers, so as to approximate raw sensor data at right resolution and quality. We first introduced a time-reference distance function to measure the spatial dissimilarity between two geo-spatial locations and to be used in our segmentation process to obtain spatially and temporally homogeneous segments. In addition, we utilized the maximum movement speed to detect time-referenced spatial outliers so as to filter out the erroneous measurements from the segmentation process. With the new distance function and the outlier removal, we proposed a family of three robust time-referenced trajectory segmentation algorithms that take into account both spatial and temporal structures presented in moving object trajectories. The three 70 trajectory segmentation algorithms adopted a greedy approach based on the split or merge heuristics and employed the generic forms of three representative heuristic algorithms, top-down, bottom-up, and sliding window algorithms, respectively. Our experiments on three real-world datasets demonstrated that our techniques outperformed the conventional techniques as well as their simple temporal extensions, in terms of spatio-temporal homogeneity, while maintaining comparable spatial homogeneity. In Chapter 3, we tackled the accuracy problem of existing convoy discovery algorithms that find con- voys from moving object trajectories with the purpose of summarizing a large trajectory data at right level of abstraction. We first introduced two group patterns, valid convoy and its variant, partially density- connected convoy with relaxed density-connectivity constraints. Then we proposed two new algorithms VCoDA and EVCoDA that mine a complete set of correct valid convoys from given moving object trajec- tories. Both solutions consist of two phases; first all partially density-connected convoys are discovered from a given set of trajectories while no false dismissal of any valid convoys is guaranteed. Then the density-connectivity of each partially connected convoy is validated to finally obtain a complete set of valid convoys. The straightforward VCoDA extended the existing convoy algorithm CMC in both phases. While the first phase of EVCoDA adopted the well-known moving cluster algorithm, the second phase of EVCoDA employed the branch-and-bound framework for the efficient density-connectivity validation, where minimum bounding boxes were used to approximate the validation during a given time interval so as to avoid unnecessary exact validation at each timestamp. The empirical results showed that (E)VCoDA outperformed the current convoy algorithms in terms of precision and recall by a factor of 3 on average and up to 2 orders of magnitude, respectively. Also, our efficient algorithm EVCoDA took up to a factor of 3 less time than VCoDA. 4.2 Future Work We see several avenues for extensions and future work. The performance of proposed valid convoy discov- ery algorithms (VCoDA and EVCoDA) can be further improved by employing our robust time-referenced 71 trajectory segmentation techniques. Both VCoDA and EVCoDA are sensitive to the size of time domain of moving object trajectories since they scan through the entire time domain to perform clustering at each timestamp and compare two sets of snapshot clusters at every two consecutive timestamps for the search of convoy candidates. With our trajectory segmentation techniques, each moving object trajectory can be approximated with much less number of observations (i.e., smaller number of sampled locations together with the corresponding timestamps) and erroneous observations are filtered out from this dimension re- duction process. These reduced and filtered trajectories are then used as the input to (E)VCoDA in order to make the valid convoy mining process robust to outliers and more computationally efficient by performing the density-based clustering and the snapshot cluster comparison only at a subset of the entire timestamps in the time domain. However in this integrated framework, we need to determine an optimal subset of timestamps where the clustering and the comparison are conducted in order to maximize the performance improvement; a small subset of timestamps will involve complex distance computation between two long sub-sequences of simplified trajectories to find clusters, whereas a large timestamp subset will require as many clusterings and comparisons as (E)VCoDA performed on the original un-preprocessed trajectories. In addition, we could relax the consecutive time constraint of valid convoy pattern to find longer movement patterns. Mining such groups of moving objects that are spatially density-connected during some time intervals that are not necessarily successive is closely related to the frequent sequential pattern mining. A typical approach of mining frequent sequences first finds a set of frequent sequences of length 1 and expands the length-1 frequent sequences for the search of frequent sequences of larger length. By employing the generic form of the frequent sequential pattern mining algorithm, longer convoy patterns can be discovered either by expanding snapshot clusters of time span 1 into the moving object clusters of larger time span in the time domain or by first finding groups of close moving objects of size 2 and then expanding the size-2 object groups into the close object groups of larger size. To this end, existing frequent sequential pattern mining algorithms such as an Apriori-like GSP (Generalized Sequential Pattern) [50], PrefixSpan [47], or SPADE [60] can be adopted. In general, Apriori-like algorithms need to generate and examine a huge number of intermediate candidate subsequences, which is potentially an expensive 72 operation. To break this computational bottleneck, the PrefixSpan algorithm greatly reduces the efforts of candidate subsequence generation by transforming the sequence database into a set of smaller projected databases and performing the subsequence mining on each projected database of smaller size. The SPADE algorithm employs another special data structure, called vertical format database, for efficient sequential pattern mining. At last, the number of valid convoy patterns becomes large depending on the spatial and temporal constraints to be satisfied as more observations are archived, which could make it harder to locate distinct or unusual convoys of certain interest, summarize or visualize the overwhelming set of convoys, or interpret the obtained convoys. Fortunately, we could observe that the obtained convoys are related in certain types of transitions; for example, a group of moving objects formed a valid convoy during some duration of time and some of its members migrated to form another valid convoy at later time, or a convoy was later absorbed by a larger convoy. Therefore, it is necessary to provide insights about the transition relationship among related convoys, which is particularly useful to summarize the large set of valid convoy patterns and make them more meaningful and understandable in the context of applications. 73 References [1] Charu C. Aggarwal and Philip S. Yu. Outlier detection for high dimensional data. In SIGMOD ’01: Proceedings of the 2001 ACM SIGMOD international conference on Management of data, pages 37–46, New York, NY , USA, 2001. ACM. [2] Rakesh Agrawal and Ramakrishnan Srikant. Fast algorithms for mining association rules in large databases. In VLDB’94, Proceedings of 20th International Conference on Very Large Data Bases, pages 487–499, 1994. [3] Ghazi Al-Naymat, Sanjay Chawla, and Joachim Gudmundsson. Dimensionality reduction for long duration and complex spatio-temporal queries. In SAC ’07: Proceedings of the 2007 ACM symposium on Applied computing, pages 393–397, New York, NY , USA, 2007. ACM. [4] Mattias Andersson, Joachim Gudmundsson, Patrick Laube, and Thomas Wolle. Reporting leaders and followers among trajectories of moving point objects. Geoinformatica, 12(4):497–528, 2008. [5] Subramanian Arumugam and Christopher Jermaine. Closest-point-of-approach join for moving ob- ject histories. In ICDE ’06: Proceedings of the 22nd International Conference on Data Engineering, page 86, Washington, DC, USA, 2006. IEEE Computer Society. [6] Petko Bakalov, Marios Hadjieleftheriou, and Vassilis J. Tsotras. Time relaxed spatiotemporal trajec- tory joins. In GIS ’05: Proceedings of the 13th annual ACM international workshop on Geographic information systems, pages 182–191, New York, NY , USA, 2005. ACM. 74 [7] Marc Benkert, Joachim Gudmundsson, , Florian Hubner, and Thomas Wolle. Reporting flock pat- terns. In 14th European Symposium on Algorithms, pages 660–671. Springer Berlin / Heidelberg, 2006. [8] Ella Bingham, Aristides Gionis, Niina Haiminen, Heli Hiisil¨ a, Heikki Mannila, and Evimaria Terzi. Segmentation and dimensionality reduction. In SIAM Data Mining Conf., 2006. [9] Igor V . Cadez, Scott Gaffney, and Padhraic Smyth. A general probabilistic framework for clustering individuals and objects. In In Proceedings of the sixth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pages 140–149, 2000. [10] Hu Cao, Ouri Wolfson, and Goce Trajcevski. Spatio-temporal data reduction with deterministic error bounds. The VLDB Journal, 15(3):211–228, 2006. [11] Huiping Cao, N. Mamoulis, and D.W. Cheung. Discovery of periodic patterns in spatiotemporal sequences. Transactions on Knowledge and Data Engineering, 19(4):453–467, April 2007. [12] Huiping Cao, Nikos Mamoulis, and David W. Cheung. Mining frequent spatio-temporal sequential patterns. In ICDM ’05, pages 82–89, Washington, DC, USA, 2005. [13] Mete Celik, Shashi Shekhar, James P. Rogers, James A. Shine, and Jin Soung Yoo. Mixed-drove spatio-temporal co-occurence pattern mining: A summary of results. In ICDM, pages 119–128, 2006. [14] Jidong Chen, Caifeng Lai, Xiaofeng Meng, Jianliang Xu, and Haibo Hu. Clustering moving objects in spatial networks. In DASFAA, pages 611–623, 2007. [15] Lei Chen, M. Tamer ¨ Ozsu, and Vincent Oria. Robust and fast similarity search for moving object trajectories. In SIGMOD ’05, pages 491–502, New York, NY , USA, 2005. ACM Press. [16] Somayeh Dodge, Robert Weibel, and Anna-Katharina Lautensch¨ utz. Towards a taxonomy of move- ment patterns. Information Visualization, 7(3):240–252, 2008. 75 [17] David H. Douglas and Thomas K. Peucker. Algorithm for the reduction of the number of points required to represent a line or its caricature. Cartographica: The Int. Journal for Geographic Infor- mation and Geovisualization, 10(2):112–122, December 1973. [18] Martin Ester, Hans peter Kriegel, Jorg S, and Xiaowei Xu. A density-based algorithm for discovering clusters in large spatial databases with noise. pages 226–231. AAAI Press, 1996. [19] Scott Gaffney and Padhraic Smyth. Trajectory clustering with mixtures of regression models. In Proceedings of the fifth ACM SIGKDD international conference on Knowledge discovery and data mining, pages 63–72. ACM Press, 1999. [20] Fosca Giannotti, Mirco Nanni, Fabio Pinelli, and Dino Pedreschi. Trajectory pattern mining. In KDD ’07, pages 330–339, New York, NY , USA, 2007. ACM. [21] Joachim Gudmundsson and Marc van Kreveld. Computing longest duration flocks in trajectory data. In GIS ’06: Proceedings of the 14th annual ACM international symposium on Advances in geo- graphic information systems, pages 35–42, New York, NY , USA, 2006. ACM. [22] Joachim Gudmundsson, Marc van Kreveld, and Bettina Speckmann. Efficient detection of motion patterns in spatio-temporal data sets. In GIS ’04: Proceedings of the 12th annual ACM international workshop on Geographic information systems, pages 250–257, New York, NY , USA, 2004. ACM. [23] Marios Hadjieleftheriou, George Kollios, Petko Bakalov, and Vassilis J. Tsotras. Complex spatio- temporal pattern queries. In VLDB ’05: Proceedings of the 31st international conference on Very large data bases, pages 877–888. VLDB Endowment, 2005. [24] Jiawei Han, Jian Pei, and Yiwen Yin. Mining frequent patterns without candidate generation. In SIGMOD ’00: Proceedings of the 2000 ACM SIGMOD international conference on Management of data, pages 1–12, 2000. [25] David J. Hand, Padhraic Smyth, and Heikki Mannila. Principles of data mining. MIT Press, Cam- bridge, MA, USA, 2001. 76 [26] John Hershberger and Jack Snoeyink. Speeding up the douglas-peucker line-simplification algo- rithm. In Proc. 5th Intl. Symp. on Spatial Data Handling, pages 134–143, 1992. [27] Hiroshi Imai and Masao Iri. Computational-geometric methods for polygonal approximations of a curve. Comput. Vision Graph. Image Process., 36(1):31–41, 1986. [28] Christian S. Jensen, Dan Lin, and Beng Chin Ooi. Continuous clustering of moving objects. IEEE Transactions on Knowledge and Data Engineering, 19(9):1161–1174, 2007. [29] Seung-Hyun Jeong, Norman W. Paton, Alvaro A. A. Fernandes, and Tony Griffiths. An experimental performance evaluation of spatio-temporal join strategies. Transactions in GIS, 9(2):129–156, 2005. [30] Hoyoung Jeung, Heng Tao Shen, and Xiaofang Zhou. Convoy queries in spatio-temporal databases. Data Engineering, 2008. ICDE 2008. IEEE 24th International Conference on, pages 1457–1459, April 2008. [31] Hoyoung Jeung, Man Lung Yiu, Xiaofang Zhou, Christian S. Jensen, and Heng Tao Shen. Discovery of convoys in trajectory databases. Proc. VLDB Endow., 1(1):1068–1080, 2008. [32] Panos Kalnis, Nikos Mamoulis, and Spiridon Bakiras. On discovering moving clusters in spatio- temporal data. In Advances in Spatial and Temporal Databases, pages 364–381. Springer Berlin / Heidelberg, 2005. [33] Eamonn J. Keogh, Selina Chu, David Hart, and Michael J. Pazzani. An online algorithm for seg- menting time series. In ICDM ’01, pages 289–296, 2001. [34] Edwin M. Knorr, Raymond T. Ng, and Vladimir Tucakov. Distance-based outliers: Algorithms and applications. VLDB Journal: Very Large Data Bases, 8(3–4):237–253, 2000. [35] Hans-Peter Kriegel and Martin Pfeifle. Clustering moving objects via medoid clusterings. In SS- DBM’2005: Proceedings of the 17th international conference on Scientific and statistical database management, pages 153–162, Berkeley, CA, US, 2005. Lawrence Berkeley Laboratory. 77 [36] Patrick Laube and Stephan Imfeld. Analyzing relative motion within groups of trackable moving point objects. In GIScience ’02: Proceedings of the Second International Conference on Geographic Information Science, pages 132–144, London, UK, 2002. Springer-Verlag. [37] Jae-Gil Lee, Jiawei Han, and Xiaolei Li. Trajectory outlier detection: A partition-and-detect frame- work. In ICDE, pages 140–149, 2008. [38] Jae-Gil Lee, Jiawei Han, Xiaolei Li, and Hector Gonzalez. raClass: trajectory classification using hierarchical region-based and trajectory-based clustering. PVLDB, 1(1):1081–1094, 2008. [39] Jae-Gil Lee, Jiawei Han, and Kyu-Young Whang. Trajectory clustering: a partition-and-group frame- work. In SIGMOD ’08, pages 593–604, 2007. [40] Xiaolei Li, Jiawei Han, Jae-Gil Lee, and Hector Gonzalez. Traffic density-based discovery of hot routes in road networks. In SSTD, pages 441–459, 2007. [41] Yifan Li, Jiawei Han, and Jiong Yang. Clustering moving objects. In KDD ’04: Proceedings of the tenth ACM SIGKDD international conference on Knowledge discovery and data mining, pages 617–622, New York, NY , USA, 2004. ACM. [42] Zhilin Li. An algorithm for compressing digital contour data. The Cartographic Journal, 25(2):143– 146, 1988. [43] Jessica Lin, Eamonn J. Keogh, Li Wei, and Stefano Lonardi. Experiencing sax: a novel symbolic representation of time series. Data Min. Knowl. Discov., 15(2):107–144, 2007. [44] Nikos Mamoulis, Huiping Cao, George Kollios, Marios Hadjieleftheriou, Yufei Tao, and David W. Cheung. Mining, indexing, and querying historical spatiotemporal data. In KDD ’04: Proceedings of the tenth ACM SIGKDD international conference on Knowledge discovery and data mining, pages 236–245, New York, NY , USA, 2004. ACM. 78 [45] G. Papakonstantinou, P. Tsanakas, and G. Manis. Parallel approaches to piecewise linear approxima- tion. Signal Process., 37(3):415–423, 1994. [46] Theodosios Pavlidis and Steven L. Horowitz. Segmentation of plane curves. IEEE Trans. on Comp., 23(8):860–870, 1974. [47] Jian Pei, Jiawei Han, Behzad Mortazavi-Asl, Helen Pinto, Qiming Chen, Umeshwar Dayal, and Mei-Chun Hsu. Prefixspan: Mining sequential patterns by prefix-projected growth. In International Conference on Data Engineering, pages 215–224, 2001. [48] Dieter Pfoser and Christian S. Jensen. Capturing the uncertainty of moving-object representations. Lecture Notes in Computer Science, 1651:111–131, 1999. [49] Myra Spiliopoulou, Irene Ntoutsi, Yannis Theodoridis, and Rene Schult. Monic: modeling and monitoring cluster transitions. In KDD ’06: Proceedings of the 12th ACM SIGKDD international conference on Knowledge discovery and data mining, pages 706–711, New York, NY , USA, 2006. ACM. [50] Ramakrishnan Srikant and Rakesh Agrawal. Mining sequential patterns: Generalizations and perfor- mance improvements. In EDBT, pages 3–17, 1996. [51] Jill Campbell Stewart, Shih-Ching Yeh, Younbo Jung, Hyunjin Yoon, Maureen Whitford, Shu-Ya Chen, Lei Li, Margaret McLaughlin, Albert Rizzo, and Carolee J. Winstein. Pilot trial results from a virtual reality system designed to enhance recovery of skilled arm and hand movements after stroke. In IWVR ’06, New York City, USA, August 2006. [52] Ilias Tsoukatos and Dimitrios Gunopulos. Efficient mining of spatiotemporal patterns. In SSTD ’01: Proceedings of the 7th International Symposium on Advances in Spatial and Temporal Databases, pages 425–442, London, UK, 2001. Springer-Verlag. [53] Florian Verhein. Mining complex spatio-temporal sequence patterns. In SDM, pages 605–616, 2009. 79 [54] Florian Verhein and Sanjay Chawla. Mining spatio-temporal patterns in object mobility databases. Data Min. Knowl. Discov., 16(1):5–38, 2008. [55] M. Vlachos, G. Kollios, and D. Gunopulos. Discovering similar multidimensional trajectories. In ICDE, 2002. [56] Junmei Wang, Wynne Hsu, and Mong Li Lee. Mining generalized spatio-temporal patterns. In DAS- FAA ’05: database systems for advanced applications, pages 649–661, London, UK, 2005. Springer- Verlag. [57] Yida Wang, Ee-Peng Lim, and San-Yih Hwang. Efficient mining of group patterns from user move- ment data. Data Knowl. Eng., 57(3):240–282, 2006. [58] Yutaka Yanagisawa, Jun ichi Akahani, and Tetsuji Satoh. Shape-based similarity query for trajectory of mobile objects. In MDM ’03, pages 63–77, London, UK, 2003. Springer-Verlag. [59] Byoung-Kee Yi, H. V . Jagadish, and Christos Faloutsos. Efficient retrieval of similar time sequences under time warping. In ICDE ’98: Proceedings of the Fourteenth International Conference on Data Engineering, pages 201–208, Washington, DC, USA, 1998. IEEE Computer Society. [60] Mohammed Javeed Zaki. Spade: An efficient algorithm for mining frequent sequences. Machine Learning, 42(1/2):31–60, 2001. [61] Panfeng Zhou, Donghui Zhang, Betty Salzberg, Gene Cooperman, and George Kollios. Close pair queries in moving object databases. In GIS ’05: Proceedings of the 13th annual ACM international workshop on Geographic information systems, pages 2–11, New York, NY , USA, 2005. ACM. 80
Abstract (if available)
Abstract
A moving object trajectory is a series of locations of a moving object sampled at discrete instances of time. Real-world moving object trajectories acquired by location-aware sensors typically involve a large number of moving objects, massive observations, and noisy measurements. In addition, the trajectory data is available only at the form of bulky point clouds of sampled locations and timestamps. Such raw sensor data is therefore neither at right resolution and quality nor at right level of abstraction for the efficient data exploration, advanced data analysis, and high-level decision making.
Linked assets
University of Southern California Dissertations and Theses
Conceptually similar
PDF
Efficient and accurate in-network processing for monitoring applications in wireless sensor networks
PDF
Efficient updates for continuous queries over moving objects
PDF
MOVNet: a framework to process location-based queries on moving objects in road networks
Asset Metadata
Creator
Yoon, Hyunjin (author)
Core Title
From raw sensor data to moving object trajectories at right resolution, quality, and abstraction
School
Viterbi School of Engineering
Degree
Doctor of Philosophy
Degree Program
Computer Science
Publication Date
11/18/2009
Defense Date
09/24/2009
Publisher
University of Southern California
(original),
University of Southern California. Libraries
(digital)
Tag
convoy patten mining,moving object trajectories,OAI-PMH Harvest,trajectory segmentation
Language
English
Contributor
Electronically uploaded by the author
(provenance)
Advisor
Shahabi, Cyrus (
committee chair
), Nakano, Aiichiro (
committee member
), Winstein, Carolee J. (
committee member
)
Creator Email
hjy@usc.edu,hyunjin.yoon@gmail.com
Permanent Link (DOI)
https://doi.org/10.25549/usctheses-m2744
Unique identifier
UC1498174
Identifier
etd-Yoon-3332 (filename),usctheses-m40 (legacy collection record id),usctheses-c127-276770 (legacy record id),usctheses-m2744 (legacy record id)
Legacy Identifier
etd-Yoon-3332.pdf
Dmrecord
276770
Document Type
Dissertation
Rights
Yoon, Hyunjin
Type
texts
Source
University of Southern California
(contributing entity),
University of Southern California Dissertations and Theses
(collection)
Repository Name
Libraries, University of Southern California
Repository Location
Los Angeles, California
Repository Email
cisadmin@lib.usc.edu
Tags
convoy patten mining
moving object trajectories
trajectory segmentation