Close
Home
Collections
Login
USC Login
Register
0
Selected
Invert selection
Deselect all
Deselect all
Click here to refresh results
Click here to refresh results
USC
/
Digital Library
/
University of Southern California Dissertations and Theses
/
Inferring mobility behaviors from trajectory datasets
(USC Thesis Other)
Inferring mobility behaviors from trajectory datasets
PDF
Download
Share
Open document
Flip pages
Contact Us
Contact Us
Copy asset link
Request this asset
Transcript (if available)
Content
INFERRING MOBILITY BEHA VIORS FROM TRAJECTORY DATASETS
by
Mingxuan Yue
A Dissertation Presented to the
FACULTY OF THE GRADUATE SCHOOL
UNIVERSITY OF SOUTHERN CALIFORNIA
In Partial Fulfillment of the
Requirements for the Degree
DOCTOR OF PHILOSOPHY
(Computer Science)
December 2021
Copyright 2021 Mingxuan Yue
Acknowledgements
Foremost, my sincere thanks go to my advisor, Prof. Cyrus Shahabi, for his invaluable advice,
continuous support, and patience during my Ph.D. study. He taught me how to exploit the resources
around me to support my research. He also guided me through most of the important steps of doing
research, such as defining a problem, validating the idea with experiments, and using the right
narrative and precise language in paper writing. I am always surprised by his talented presentation
skills which I believe will affect me in my whole career life. And I feel fortunate to learn from
him about his humorous communication, unique vision of new technologies, and critical thinking.
Being a student of Cyrus is a precious treasure in my life.
Additionally, I would like to thank Prof. Liyue Fan and Prof. Yao-Yi Chiang for mentoring me
during my first two years and the last two years in my Ph.D. program. It was a great experience
to work with them on various research topics. I also enjoyed the short collaboration with Dr.
Ugur Demiryurek on the interesting projects and interviews with the Annenberg School. Besides,
I gratefully recognize the unforgettable experience with Prof. Gabriel Kahn and the Xtown team.
It has been a great time that we incubated Xtown together out of a course. I also would like to
thank Prof. Tianshu Sun for connecting me with Alibaba and working together on the exciting
delivery problem. I would like to extend my sincere thanks to my dissertation committee: Haipeng
Luo, Tianshu Sun, Mahdi Soltanolkotabi, and Craig Knoblock, for giving valuable feedback to my
thesis.
I feel proud and grateful as a member of Infolab and IMSC at USC. I enjoyed a fantastic time
and received tremendous support from my labmates: Yaguang Li, Ritesh Ahuja, Kien Nguyen,
ii
Luan Tran, Giorgos Constantinou, Abdullah Alfarrarjeh, Dimitrios Stripelis, Chrysovalantis Anas-
tasiou, Haowen Lin, Jiao Sun, Sepanta Zeighami, Arvin Hekmati, Hien To, Dingxiong Deng, Ying
Lu, Minh Nguyen, Tian Xie, Chaoyang He, and Sina Shaham. I want to especially thank my coau-
thors Muhao Chen, Yaguang Li, Ritesh Ahuja, Jiao Sun, Haoze Yang, for their great help in my
research. And I also appreciate Kien Nguyen, Luan Tran, and Dimitrios Stripelis’s help during the
preparation of my thesis. In addition, I would like to show my gratitude to Daisy Tang, Lizsl De
Leon, and other university staff for their patient help with my academic life.
Last but not least, my warm and heartfelt thanks go to my family for their tremendous support
and hope they had given to me. I would thank my wife and my best friend Zhuozhen Zhao. Her
bright smile and healing hug have been my greatest source of strength during the past five years. I
would like to extend my deepest gratitude to my parents. Their unconditional support gave me the
courage to face any challenge.
iii
Table of Contents
Acknowledgements ii
List of Tables vii
List of Figures viii
Abstract x
Chapter 1: Introduction 1
1.1 Motivation and Challenges . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 Thesis Statement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.3 Thesis Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
Chapter 2: Related Work 7
2.1 Trajectory clustering for mobility behavior inference . . . . . . . . . . . . . . . . 7
2.1.1 Trajectory clustering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2.1.2 Deep-learning-based clustering . . . . . . . . . . . . . . . . . . . . . . . 8
2.2 Region representation for mobility behavior inference . . . . . . . . . . . . . . . . 9
2.2.1 Region representation learning . . . . . . . . . . . . . . . . . . . . . . . . 9
2.2.2 Graph representation learning . . . . . . . . . . . . . . . . . . . . . . . . 10
Chapter 3: Deep Trajectory Clustering for Mobility-Behavior Analysis 11
3.1 Preliminaries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
3.2 DETECT . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
3.2.1 Stay point detection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
3.2.2 Augmenting Geographical Context . . . . . . . . . . . . . . . . . . . . . 15
3.2.3 Phase I - Clustering with neural network . . . . . . . . . . . . . . . . . . . 16
3.2.4 Phase II - Joint optimization for a good clustering . . . . . . . . . . . . . . 18
3.3 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
3.3.1 Experimental Settings . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
3.3.2 Performance Comparison with Baselines . . . . . . . . . . . . . . . . . . 23
3.3.3 Ablation Study . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
3.3.4 Parameter study . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
3.3.4.1 Qualitative Evaluation . . . . . . . . . . . . . . . . . . . . . . . 28
3.3.4.2 Scalability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
iv
Chapter 4: A Variational Approach for Mobility Behavior Clustering 34
4.1 Problem Definition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
4.2 The V AMBC Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
4.2.1 Decomposing Hidden Variables . . . . . . . . . . . . . . . . . . . . . . . 36
4.2.2 Training Objectives and Neural Layers . . . . . . . . . . . . . . . . . . . 38
4.2.3 Network Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
4.2.4 Relationship to V AE and Gaussian-Mixture V AE . . . . . . . . . . . . . . 41
4.3 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
4.3.1 Environment and Experiment Settings . . . . . . . . . . . . . . . . . . . . 43
4.3.2 Quantitative Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
4.3.3 Ablation Study . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
4.3.4 The Training Progress of V AMBC . . . . . . . . . . . . . . . . . . . . . . 49
4.3.5 Qualitative Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
Chapter 5: Region Representation Learning for Mobility Behavior Analysis 53
5.1 Learning a Contextual and Topological Representation of Areas-of-Interest for On-
Demand Delivery Application . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
5.1.1 Preliminaries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
5.1.2 Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
5.1.2.1 Learning contextual representation from trajectories . . . . . . . 58
5.1.2.2 Learning topological representation from graphs . . . . . . . . . 61
5.1.2.3 Jointly learning one representation by a multi-view ranking au-
toencoder . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64
5.1.3 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66
5.1.3.1 Dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66
5.1.3.2 Experimental Settings . . . . . . . . . . . . . . . . . . . . . . . 66
5.1.3.3 Evaluation with ETA Prediction . . . . . . . . . . . . . . . . . . 68
5.1.3.4 Model Interpretation . . . . . . . . . . . . . . . . . . . . . . . . 68
5.2 Learning Region Embeddings from Trajectories by Capturing Their Mobility Con-
texts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72
5.2.1 Preliminaries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75
5.2.1.1 Notations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75
5.2.1.2 Problem Definition . . . . . . . . . . . . . . . . . . . . . . . . 76
5.2.2 Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77
5.2.2.1 Heterogeneous Graph Construction . . . . . . . . . . . . . . . . 77
5.2.2.2 Heterogeneous Graph Encoding . . . . . . . . . . . . . . . . . . 81
5.2.2.3 Optimization Objective . . . . . . . . . . . . . . . . . . . . . . 82
5.2.2.4 Dynamic Grid Partitioning . . . . . . . . . . . . . . . . . . . . 85
5.2.2.5 Inference . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85
5.2.3 Experiment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87
5.2.3.1 Experimental Setup . . . . . . . . . . . . . . . . . . . . . . . . 87
5.2.3.2 Baseline approaches . . . . . . . . . . . . . . . . . . . . . . . . 88
5.2.3.3 Mobility Behavior Clustering . . . . . . . . . . . . . . . . . . . 89
5.2.3.4 Next Location Prediction . . . . . . . . . . . . . . . . . . . . . 90
5.2.3.5 POI correlation . . . . . . . . . . . . . . . . . . . . . . . . . . 91
v
5.2.3.6 Ablation Study . . . . . . . . . . . . . . . . . . . . . . . . . . . 92
5.2.3.7 Hyperparameter Study . . . . . . . . . . . . . . . . . . . . . . . 93
5.2.3.8 Qualitative Analysis . . . . . . . . . . . . . . . . . . . . . . . . 94
Chapter 6: Conclusions and Future Work 95
References 97
vi
List of Tables
3.1 Stats of datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
3.2 Clustering performance of all approaches. . . . . . . . . . . . . . . . . . . . . . . 24
3.3 Performance with varying level of augmentation. . . . . . . . . . . . . . . . . . . 25
3.4 Mean and Standard Deviation of MAE for DETECT Phase I with varying latent
embedding dimension d . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
4.1 Clustering performance comparison . . . . . . . . . . . . . . . . . . . . . . . . . 47
5.1 ETA prediction performance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69
5.2 Notations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77
5.3 Performance on Mobility Behavior Clustering . . . . . . . . . . . . . . . . . . . . 90
5.4 Performance on Next Location Prediction . . . . . . . . . . . . . . . . . . . . . . 91
5.5 Performance on Correlation with POI data . . . . . . . . . . . . . . . . . . . . . . 92
vii
List of Figures
1.1 The mobility behavior of a trajectory has little relevance to the shape or spatio-
temporal scales of the trajectory. It is more relevant to the activities (can be inferred
from geographical context) of a few most crucial points. For example, the two
trajectories in the figure have different shapes and locations, but both following a
“rest! work! fun” pattern. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
3.1 Accelerated Stay Point Detection: Fast-SPD . . . . . . . . . . . . . . . . . . . . . 15
3.2 Augment stay points with geographical features . . . . . . . . . . . . . . . . . . . 16
3.3 DETECT Phase I and Phase II: The augmented trajectories are fed to the recurrent
autoencoder in Phase I. Phase I learns a hidden embedding z encoding the context
dynamics of the trajectories, upon which an initial cluster assignments is generated
via k-means. Phase II jointly refines the embedding (via the encoder) and the
cluster assignments. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
3.4 Ablation study on compared approaches. DETECT outperforms the adapted base-
lines with feature augmentation (marked with *) on all metrics and for both datasets
(relatively little improvements are seen in DMCL, since it spans only two users
with limited mobility variation). . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
3.5 Effect of DETECT parameters. (a) Effect of the number of clusters k. (b) Effect of
the learning rate lr . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
3.6 Clustering results visualization of “DETECT” and “DETECT Phase I” using t-
SNE with perplexity 40 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
3.7 Projecting raw trajectories clusters onto the map. (a) Colored based on predicted
cluster labels (b) Colored based on ground-truth classes . . . . . . . . . . . . . . . 30
3.8 Internal validation of clustering quality. . . . . . . . . . . . . . . . . . . . . . . . 32
3.9 Clusters of all trajectories in GeoLife. . . . . . . . . . . . . . . . . . . . . . . . . 32
3.10 Visualization of a detected GeoLife trajectory cluster with locations of recreational
parks in red circles. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
4.1 V AMBC network structure. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
viii
4.2 Compare the graphic notations of a GMV AE and V AMBC . . . . . . . . . . . . . 43
4.3 Changes of metrics by variants over their training epochs . . . . . . . . . . . . . . 48
4.4 Visualization of the training progress . . . . . . . . . . . . . . . . . . . . . . . . . 49
4.5 Latent embedding by different approaches . . . . . . . . . . . . . . . . . . . . . . 50
4.6 Reconstruction of a random sample of inputs by different approaches . . . . . . . . 51
4.7 Decoded context sequences from cluster embeddings . . . . . . . . . . . . . . . . 52
4.8 Interpolate between two context sequences from the same cluster . . . . . . . . . . 52
5.1 Different ways of counting co-occurring AOIs . . . . . . . . . . . . . . . . . . . . 61
5.2 Learn contextual representation from trajectories. . . . . . . . . . . . . . . . . . . 62
5.3 Calculate proximity to other nodes by different-step random walks . . . . . . . . . 63
5.4 DeepMARK Network Structure . . . . . . . . . . . . . . . . . . . . . . . . . . . 66
5.5 Learning curves of ETA prediction by different representations . . . . . . . . . . . 69
5.6 Embedding visualization by t-SNE . . . . . . . . . . . . . . . . . . . . . . . . . . 70
5.7 Visualize the change of ranking loss between the topological view and the contex-
tual view . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71
5.8 Framework . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78
5.9 Mobility Heterogeneous Graph . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78
5.10 Performance of model variants on clustering mobility behaviors . . . . . . . . . . 92
5.11 Effect of parameters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93
5.12 Regions merged by dynamic partitioning. . . . . . . . . . . . . . . . . . . . . . . 94
ix
Abstract
Identifying people’s mobility behaviors (e.g., work commute, shopping) in rich trajectory data is
of great economic and social interest to various applications, including location/trip recommen-
dations, geo-targeting/advertisements, urban planning, anomaly detection, epidemiology.Inferring
the mobility behaviors is challenging as it requires a robust unsupervised clustering technique and
effective mobility-related features to cluster trajectories of various spatial and temporal scales into
groups, each of which follows the same mobility behavior. Specifically, my thesis tackles the
following three challenges.
First, it is difficult to infer the mobility behavior directly from a trajectory since the raw coor-
dinates do not provide useful information about the surrounding environment of the visited loca-
tions. Existing trajectory clustering approaches usually rely on pre-defined distance measurement
and usually group trajectories with similar shapes and spatial (and temporal) scales together rather
than group the trajectories with the same mobility behavior. In this case, trajectories in different
groups may still belong to the same mobility behavior, e.g., the school commutes may occur at
different locations by different transportation modes and are assigned to different groups by these
approaches. To overcome this challenge, we propose DETECT, which extracts salient points in
the trajectories and augments them with auxiliary geographical features retrieved from the Point
of Interest data. In this way, each trajectory is transformed into a context sequence, i.e., an ordered
list of real-value feature vectors, each describing the “context” of a visited location (e.g., sports,
shopping, or dining venues) in the trajectory. Rather than using pre-defined distance measure-
ments, DETECT is data-driven by employing a two-phase deep learning procedure that first learns
x
fixed-size embeddings of variant-length trajectories and then optimizes a clustering objective for a
better separation of clusters.
Second, the robustness of the clustering approaches on the context sequences could be further
improved to have a more accurate and stable inference of mobility behaviors. Existing deep-
learning-based clustering approaches (including DETECT) usually employ a two-phase procedure
and are sensitive to a lossy initialization. Therefore, we propose a variational clustering method
called V AMBC which can simultaneously learn the fixed-size embeddings and the cluster assign-
ments in a single phase and produce robust clustering results. In addition, unlike other variation
approaches that could collapse to trivial solutions, V AMBC separates the information of individ-
ual trajectories and common patterns of clusters in the embedding space and encourages sufficient
involvement of the cluster membership in creating the embeddings to avoid producing poor clus-
tering results.
Third, effective mobility-related features are of great importance in this problem, but one can
not assume the auxiliary geographical data is always available for generating these features. Hence,
we also investigate approaches to learn region representations from trajectories without using any
auxiliary data like the Point of Interests. We first propose DeepMARK for learning a representation
of regions to support delivery time estimation. Then, we study to learn a representation for broader
applications, including the mobility behavior inference and next location prediction. In this study,
we propose a framework based on a heterogeneous graph neural network, in which we exploit rich
mobility-related attributes and relationships of regions, users, and temporal periods involved in
trajectories. Within this framework, we design a mobility-related objective via customized random
walks and learn effective region embeddings by encoding information from neighboring nodes in
the graph. We demonstrate the advance of the learned region representation over various baseline
approaches in three downstream tasks using a real-world dataset.
xi
Chapter 1
Introduction
1.1 Motivation and Challenges
The rapid proliferation of mobile devices has led to the collection of vast amounts of GPS trajec-
tories by location-based services and ride-sharing apps. In these trajectories, sophisticated human
mobility behaviors are encoded. We define mobility behavior (of a trajectory) as the travel activity
that describes a user’s movements regardless of the spatial and temporal distances that he covers.
For example, work-to-home commute is one such mobility behavior that varies widely in terms of
the area and the time to complete the activity.
Understanding mobility behaviors has enormous values ranging from location/trip recommen-
dations, geo-targeting/advertisements, to urban planning, anomaly detection, epidemiology, and
other social sciences[1–3]. For example, in target advertising, a gas station owner can infer that
the major mobility behavior of trajectories passing through their gas station is work commute, thus
deciding to display financial ads on weekdays. For public health during a pandemic, policymakers
could make a better decision on what types of businesses should be closed to reduce mobility more
effectively in different neighborhoods. For instance, they may decide to close restaurants/pubs in
a neighborhood where the majority of mobility behavior is for entertainment but not in a neigh-
borhood where the main mobility behavior is for work commute. The former closure impacts
entertainment, but the latter will impact work. For economic studies, by labeling work commutes,
we can estimate the amount of jobs, or the number of working people in the neighborhoods of a
1
1 2 Resident Business Park ≠ T raditional: ≈ Proposed: Figure 1.1: The mobility behavior of a trajectory has little relevance to the shape or spatio-temporal
scales of the trajectory. It is more relevant to the activities (can be inferred from geographical
context) of a few most crucial points. For example, the two trajectories in the figure have different
shapes and locations, but both following a “rest! work! fun” pattern.
city. For transportation studies, when repairing (or constructing a new) bridge or a freeway seg-
ment, policymakers could better determine the dates and times for construction by considering the
major mobility behavior passing through the area (e.g., weekend closure vs. weekdays).
Manually labeling each trajectory with its mobility behavior is prohibitively expensive and re-
quires expert skills, making unsupervised learning methods highly desirable. Hence we formulate
the problem of mobility-behavior analysis as a clustering task on trajectories. However, clustering
the trajectories into different mobility behaviors has many challenges.
First, trajectories with similar mobility behavior exhibit various spatial and/or temporal ranges
of movement. For example, a “commute to work” behavior could take from 10 minutes to up to
1 hour via different transportation modes. Even with the same transportation modality, the travel
time fluctuates at different times of the day (e.g., rush hours), needing us to account for varying
temporal scales within the same mobility behavior. Likewise, the spatial range of movement also
varies, e.g., from a mile to more than forty, for trajectories of users living near or far away from
their workplaces. Traditional trajectory clustering techniques (e.g., [4–7]) group trajectories based
on raw spatial and temporal distances that are sensitive to variation in the spatio-temporal scale.
These methods fail to cluster mobility behaviors, and instead produce simple clusters, each with a
similar spatio-temporal range of movement.
2
Second, the mobility behavior of a trajectory is highly dependent on characteristics of places
and environments in which it occurs, henceforth called the geographical context. For example,
the “grocery shopping” behavior encompasses trajectories with very different travel times and
shapes depending on the user’s choice of a supermarket. However, in this instance, knowing the
geographical context, i.e., the supermarket, is more informative than the exact shape and duration
of the trajectory. It is therefore necessary to integrate geographical information at each point of the
trajectory to achieve good clustering quality. Figure 1.1 illustrates an example of the importance
of the geographical context and the effect of raw spatial and temporal features.
In addition, even if the trajectories are condensed to salient points and augmented with the
geographical context, we still need an accurate and robust clustering framework that can discover
similar transition patterns of the geographical context along the trajectory. Such transition pat-
terns are usually driven by the trajectory data and vary from one population to another. However,
traditional time series clustering approaches are usually based on pre-defined similarity and align-
ments between sub-sequences and shapes [8–10]. Recent advanced autoencoder-based clustering
approaches using Recurrent Neural Networks (RNN) can convert sequences of real-value feature
vectors into a fixed-length vector for clustering the dynamics in the sequences using a two-phase
training process [11]). But such two-phase approaches [11–13] could learn a poor initial feature
representation (i.e., irrelevant to the clustering objective) in the first phase, which cannot be refined
in the second phase to improve the clustering performance. Consequently, clustering accuracy can-
not be guaranteed across training variations [14].
The last important concern is the utilization of auxiliary geographical data used for augment-
ing the trajectories. Such auxiliary information plays an essential role in the learning of mobility
behaviors but is not always available or has limited quality and access in many cities or coun-
tries [15–17]. Therefore, the remaining challenge is how to learn representations of regions that
could capture the mobility-related contexts when the auxiliary information is not available. Many
existing region representation studies rely on information from some auxiliary data such as Point of
Interests (POIs) or social media data [18, 19]. These works apparently are not applicable to solve
3
our problem since they still require auxiliary data sources. Other studies proposed approaches that
merely require mobility (trajectory) data to learn region embeddings. For example, Yao et al. pro-
posed to conduct a decomposition of a co-occurrence matrix generated from trips with a gravity
model to optimize the region embedding [18]. However, the trajectory data carry much richer in-
formation than simply a sequence of visits. Such information includes the preference of users who
generate the trajectories, the order and duration of regions being visited, the temporal semantics
(such as day or night) of each visit, and the commonality of the users, times, and regions across
trajectories. Yet, these mobility contexts of trajectories were not fully exploited by the existing
research on region representation.
1.2 Thesis Statement
Based on the above motivations, we state that: Inferring mobility behaviors from all-scale tra-
jectories requires accurate and robust clustering techniques by exploiting explicit or implicit
geographical contexts of the trajectories.
In this thesis, we propose four deep-learning-based approaches to tackle the challenges of this
problem described in Section1.1. Specifically, the contributions of my thesis are listed below:
• To mitigate the effect of variant scales and augment the trajectories with explicit geograph-
ical contexts, we propose an end-to-end clustering framework called DETECT. Particularly,
DETECT first transforms the trajectories by summarizing their critical parts and augmenting
them with context derived from their geographical locality (e.g., using POIs from gazetteers).
In the second part, it learns a powerful representation of trajectories in the latent space via
an autoencoder structure, thus enabling a clustering function (such as k-means) to be ap-
plied. Finally, a clustering-oriented loss is directly built on the embedded features to jointly
perform feature refinement and cluster assignment, thus improving separability between mo-
bility behaviors.
4
• To improve the training accuracy and robustness over DETECT and other two-phase clus-
tering approaches for mobility behavior analysis, we propose a variational model called
V AMBC. It simultaneously learns the self-supervision and cluster assignments in a sin-
gle phase to infer moving behaviors from context transitions in trajectories. Specifically,
V AMBC explicitly decomposes the cluster latent representation and individualized latent
representation via two reparameterization layers. Such decomposition and the finely de-
signed network enable the model to learn the self-supervision and cluster structure jointly
without collapsing to trivial solutions.
• Considering learning region representations from trajectories, we first raise a solution called
DeepMARK for the delivery time estimation problem. As the estimation of delivery time
requires the region representations to incorporate both the topological relationship and the
human’s preferences, DeepMARK first encodes trajectories conducted by humans and ge-
ometrical graphs of regions into homogeneous views, and then trains a multi-view autoen-
coder to learn the representation of areas using a ranking-based loss.
• To develop a comprehensive region representation that captures the mobility contexts and
can be used in more general cases like mobility behavior clustering and next location pre-
diction, we propose a region embedding framework based on a heterogeneous graph neural
network. Specifically, we model the associations and attributes of regions, users, and times
concealed in the trajectories with a graph with multiple types of nodes and edges (so-called
heterogeneous graph). We use a relational graph attention network to encode regions on the
graph and define a metapath constrained random-walk objective that models the proximity
of mobility contexts.
1.3 Thesis Outline
The structure of the thesis is organized as follows: Chapter 2 introduces the related studies of
trajectory clustering and region representation learning. In Chapter 3, we study to infer mobility
5
behaviors from various scales of trajectories with the utilization of explicit geographical features
from auxiliary data. And then in Chapter 4, we further improve the accuracy and robustness of the
clustering performance with a variational model. In Chapter 5, we investigate approaches to learn
region representations as implicit geographical features for supporting mobility behavior analy-
sis without auxiliary information. Finally, we summarize our contributions and discuss potential
future work to extend the contributions of this study in Chapter 6.
6
Chapter 2
Related Work
2.1 Trajectory clustering for mobility behavior inference
2.1.1 Trajectory clustering
Trajectory clustering is an important problem and has received significant attention in the past
decade. Most techniques utilize raw trajectories (e.g., [5]) with a variety of distance/similarity
measurements such as the classic Euclidean, Hausdorff, LCSS, DTW, Frechet, SSPD [20] or ad-
hoc measurements suited to specific applications (e.g., [7, 21]). To deal with trajectories of large
cardinality, some techniques resort to finding local patterns in sub-trajectories. Lee et al. [4]
partition trajectories into minimum description length (MDL) and then cluster the partitioned tra-
jectories. However, these methods operate over raw data points and remain sensitive to the wide
range of spatio-temporal scales in human mobility patterns [6]. Since raw GPS coordinates and
timestamps, in and of themselves do not provide semantic information regarding a trajectory, sev-
eral techniques use movement characteristics derived from the raw data for clustering tasks. In
recent work, Yao et. al. [22] extract the speed, acceleration, and change of “rate of turn” (ROT) of
each point in a trajectory as the input sequence for clustering. Wang et al. [23] annotate trajec-
tories with movement labels such as “Enter”, “Leave”, “Stop”, and “Move” and use these labels
to detect events from trajectories. However, just like the approaches based on raw trajectory data,
7
these methods completely fail when dealing with scale-variant trajectories and require carefully ex-
tracting movement characteristics, which often do not represent mobility behaviors. Recent work
in the privacy literature [24, 25] make efforts to capture mobility semantics so as to anonymize
real trajectories or synthesize fake trajectories that are indiscernible from real data. Most related
to our work is [26], which presents methods to mine patterns in trajectories (eg. work-to-pub).
These patterns are limited to places visited one-after-the-other by a user. Hence, they rely on using
the check-in point of interest as their data input along with a description of the POI (as opposed
to GPS data inputs in our method). In addition, they utilize distance-based clustering to discover
regions(as opposed to mobility behaviors) where these patterns occur frequently.
2.1.2 Deep-learning-based clustering
Recently deep-learning-based clustering approaches are widely investigated. One major group
of these studies are based on various autoencoders and clustering layers [11–13, 27]. A few ap-
proaches in this group are designed for sequential data, such as [11, 27], by using RNN layers for
sequence modeling. However, these autoencoder-based approaches usually employ a two-phase
training. And the two-phase training performance highly relies on the result of its first phase (self-
supervision), which does not always align with the clustering objective thus is not robust to training
variations. On the other hand, recent variational approaches can jointly learn the self-supervision
and clustering structure from the beginning using various variational assumptions about the hidden
space. For example, Dupont [28] proposed to use both a continuous latent space and a discrete la-
tent space and concatenate the layers for reconstruction. Dilokthanakul et al. [29], Jiang et al. [30]
and Rui et al. [31] proposed variational autoencoder models with a Gaussian Mixture assumption
in the latent space for capturing the manifolds.
8
2.2 Region representation for mobility behavior inference
2.2.1 Region representation learning
In the past decade, much effort has been made to mining useful knowledge from the rapidly grow-
ing mobility data produced by mobile devices. Earlier work first studied to infer land usage of
regions using the trajectory and POI data and annotate each region with a tag representing its func-
tion [18, 19, 32–35]. For example, Yuan et al. adopted topic models to annotate the regions of a
city (Beijing, in their case) into nine functional categories using POI and taxi data [19]. Pan et al.
defined several empirical features to classify the regions into eight social functions [32]. Though
the topical annotation of regions is useful for coarse-grained predictive tasks for urban planning, it
is not enough for the fine-grained or high-frequency decision making tasks such as location fore-
casting, recommendation, and trajectory understanding. Therefore, recent researchers focused on
learning a high-quality vector representation for regions to support various downstream tasks. For
instance, Zhang et al. defined types of proximity between regions based on mobility similarity,
POI attributes and social check-in attributes [33] and learned region embeddings that integrate all
types of proximity. Zhang et al. built a cross-modal representation learning framework that de-
tects hotspots from massive geo-tagged social media data, and mapped the regions and texts into
low-dimensional space [34]. However, these approaches require rich auxiliary information from
different data sources whose quality and availability vary in different places. Other studies pro-
posed approaches that merely require mobility (trajectory) data to learn region embeddings. Yao
et al. proposed the decomposition of a co-occurrence matrix with a gravity model to optimize the
region embedding [18]. Yue et al. jointly incorporated the co-occurrence information and physical
proximity to the region embeddings and leveraged the embeddings for package delivery predic-
tion [35]. However, these approaches have not fully exploited the heterogeneous associations of
regions, users, and temporal periods involved in the trajectory data.
9
2.2.2 Graph representation learning
Graph embeddings seek to learn a vector representation for each node in a graph. Literature pro-
posed various approaches for different types of graph structures. The classical graph embedding
approaches like Deepwalk [36], LINE [37] and Node2vec [38] applied word2vec models to random
walks on homogeneous graphs and learned embeddings of nodes based on their co-occurrences in
random-walk paths. In addition, some other works utilize deep neural networks to learn more
neighboring information through an autoencoder framework [39, 40]. Furthermore, recent efforts
seek to incorporate node features to the embedding in addition to the graph structure such as in [41,
42]. As the development of Graph Neural Networks(GNN), researchers propose to incorporate
neighbor attributes or messages to learn node representations. Methods in this line of research typ-
ically leverage different types of neighborhood aggregation operations, such as GCN [43], Graph
Attention [44] and random projections [45], to capture the node proximity based on neighborhood
structures. Another neighborhood aggregation approach was GraphSAGE [46] which realized gen-
eralized neighborhood message passing operations, and can learn the structure and the features of
local neighbors in an inductive way. A recent survey [47] has systematically summarized more
works in this line.
As more applications usually involve different types of nodes and edges, recent studies have
focused on learning node embeddings on heterogeneous graphs. Among those, meta-path based
methods[48–51] are most related to this work. Specifically, Metapath2vec [48] was proposed to
learn embeddings of authors/venues by restricting walks to specific metapaths on the heteroge-
neous graph. HIN2Vec [49] learns a single-layer neural regressor on metapaths and produces
embeddings using the hidden weights of the regressor. Ying et al. extended GraphSAGE to het-
erogeneous graph of pins and boards in a media sharing network to learn embeddings for media
recommendation.
10
Chapter 3
Deep Trajectory Clustering for Mobility-Behavior Analysis
In this chapter, we propose the Deep Embedded TrajEctory ClusTering network (DETECT), as a
unified framework to cluster trajectories according to their mobility behaviors. DETECT operates
in three parts: in the first, it summarizes the critical parts of the trajectory and augments them with
context derived from their geographical locality (e.g., using POIs from gazetteers). The augmented
trajectories now incorporate semantics essential for identifying mobility behaviors, but are still of
variable lengths. In the second part, DETECT handles variable-length input by adapting an au-
toencoder [52], trained over a large volume of unlabeled trajectories. The autoencoder instantiates
an architecture for learning the distributions of latent random variables to model the variability ob-
served in trajectories. On the learned embedding in the low-dimensional feature space, a clustering
function (e.g., k-means) is applied. In the last part—as the most computationally intensive proce-
dure of DETECT—a clustering oriented loss is directly built on embedded features to jointly per-
form embedding refinement and cluster assignment. The joint optimization iteratively and finely
updates the embedding towards a high confidence clustering distribution by improving separability
between mobility behaviors. In summary, we make the following major contributions:
• We propose a powerful unsupervised neural framework to cluster GPS trajectories for the
problem of mobility behavior detection, while addressing scale-variance and context-absence
of raw GPS points.
11
• We propose a novel feature augmentation process for mobility behavior analysis that aug-
ments the GPS trajectories with effective input features; that, by providing mobility context,
can even improve the performance of baseline approaches.
• We propose a neural network architecture that takes as input the augmented trajectory fea-
tures to embed in a fixed-length latent space of behaviors, and gradually improves the em-
bedding for a better clustering.
• We conduct extensive quantitative and qualitative evaluation on real-world trajectory datasets
to show that the proposed approach outperforms the state-of-the-art approaches significantly;
ranging from at least 41%, and up to 252% across well-established clustering metrics.
3.1 Preliminaries
Problem. We define mobility behavior clustering as the problem of grouping trajectories in such
a way that trajectories in the same group have similar transitions of travel activity context.
Data. The input is a set of raw trajectories, where each trajectory s=fs
(1)
;s
(2)
;:::;s
(T)
g is a time-
ordered sequence of spatio-temporal points. Each point s
(t)
consists of a pair of spatial coordinates
(i.e. latitude, longitude) and its timestamp. A spatio-temporal point of a trajectory often does not
coincide with a Point-Of-Interest (POI), which represents a location someone may find useful or
interesting, such as a high school, a business office or a wholesale store. We represent a POI with a
spatial coordinate and a major category of service m2 M (e.g, education, commerce or shopping).
3.2 DETECT
DETECT is comprised of three parts. In the first part (Section 3.2), DETECT operates over the
input raw trajectories, and summarizes the critical parts of the trajectory as stay points, incurring
little loss of information. Stay points form ideal candidates to discover the context of the travel
12
activity through a representation of their geographical locality as feature vectors. In the next part,
labeled DETECT Phase I (Section 3.2.3), the context augmented trajectory are embedded into a
latent space of mobility behaviors. The low-dimensional latent space enables a simple clustering
function to be then applied. Finally in the third part (Section 3.2.4), labeled DETECT Phase II, the
clustering assignment is refined by updating the latent embedding via a clustering oriented loss.
subsectionFeature Augmentation
3.2.1 Stay point detection
Trajectories widely vary in their spatial and temporal range of movements. For each trajectory, the
numerous GPS points recorded generally do not contribute significant information to support the
detection of mobility behavior; which tends to be correlated with the context of the stops where the
actual activity occurs. We argue the same empirically in Section 3.3.3. In this work, we explore
stay point detection [53] as a type of trajectory summarization customized to reducing the spatio-
temporal scale in trajectories. First introduced in [53] for discovering location-embedded social
structure, stay points also offer several benefits to neural learning on trajectories. Due to their
condensed format, they improve learning efficiency and enable the neural layer to learn a better
representation.
Definition 1 (Stay Points [53]). A stay point ˙ s
(t)
of trajectory s is a spatiotemporal point, which is
the geometric center of a longest sub-trajectory s
(i! j)
s;s
(i! j)
=fs
(i)
;s
(i+1)
;:::;s
( j)
g, such that
s
(i! j)
is a staying subtrajectory, and neither s
(i1! j)
nor s
(i! j+1)
is a staying subtrajectory.
In the original work, the algorithm to extract stay points (denoted SPD) has a computational
complexity of O(T
2
) (T is the length of the trajectory), which does not scale to real-world datasets
with lengthy trajectories. We propose Fast-SPD, an output-sensitive algorithm for stay point de-
tection that exploits spatial indexes (e.g., R-Tree) to reduce the search space and quickly identify
trajectory points close in space and time. Fast-SPD relies on efficient parsing of Staying Subtra-
jectories defined as
13
Definition 2 (Staying Subtrajectory). A staying subtrajectory s
(i! j)
s;s
(i! j)
=fs
(i)
;s
(i+1)
;:::;s
( j)
g
of trajectory s is a contiguous sub-sequence of s, such that within s
(i! j)
, the trajectory is limited
to a ranger
s
in space, and its duration s
( j)
:time s
(i)
:time is longer than a specific thresholdr
t
.
Algorithm 1: Fast-SPD(s;r
s
;r
t
)
1 i 0 , ˙ s fg
2 while i< len(s) 1 do
3 if dist(s
(i)
;s
(i+1)
)>r
s
then
4 i i+ 1 ; continue
5 end
6 cands s
(i+1!T)
\ b(s
(i)
;r
s
)
7 neighbors CompareIdx(cands;s
(i+1!T)
)
8 if last(neighbors):time s
(i)
:time>r
t
then
9 neighbors neighbors[fs
(i)
g
10 ˙ s s[faverage(neighbors)g
11 i i+ len(neighbors)
12 end
13 i i+ 1
14 end
15 ˙ s fs
(0)
g[ ˙ s[fs
(T)
g
16 return ˙ s
The pseudocode of Fast-SPD is presented in Algorithm 1. In brief the algorithm, iterates
through a trajectory point-by-point, finds all staying subtrajectories, and then generates stay points
from each. We elaborate the procedure for stay point detection with an example in Figure 3.1a.
Beginning with a pivot point s
(i)
(depicted as a red point) to search a staying subtrajectory. It
first filters the candidate points (yellow points) by joining a spatial buffer (dotted circle) with the
remaining points of the trajectory. Among the identified candidates, the consecutive points are
refined to be termed as neighbors (yellow points in the red-border box). If these neighbors cover a
long enough period in time, i.e., last(neighbors):times
(i)
:time>r
t
, then these points along with
the pivot point s
(i)
together constitute a staying subtrajectory (the pink box). Lastly, the algorithm
extracts the geometric center of the staying subtrajectory as a stay point. Fast-SPD then skips the
visited points and continues scanning the rest of the trajectory.
14
Candidates Neighbors i Staying Subtraj. Pivot Buf fer (a) SPD illustration
<1k 1k 2k 3k 4k 5k
TrajectoryCardinality
0
10
20
30
40
50
Runtime(s)
SPD
fast-SPD
(b) Running time
Figure 3.1: Accelerated Stay Point Detection: Fast-SPD
Compared to the SPD, Fast-SPD reduces the time complexity by O(c) O(log(T)) for each
search of a stay point, where c (usually c> log(T)) is the length of a staying subtrajectory, and
T is the length of the trajectory. Figure 3.1b depicts the running time of stay point detection with
respect to the cardinality of the trajectory. Fast-SPD does not suffer on short trajectories (1k
points) but is up to three times more efficient on trajectories with many GPS points.
3.2.2 Augmenting Geographical Context
Given the extracted stay points, the next step is to incorporate the context of the surrounding en-
vironment (called geographical context) at each stay point. The Tobler’s First Law of Geography
reported that ”Everything is related to everything else, but near things are more related than dis-
tant things” [54]. This phenomenon forms the basis for the study of Geographical Influences,
which has seen wide adoption in the literature: from POI recommendation [55–57] to air quality
prediction [58]. In our work, the geographical context is constituted of categories (e.g., restau-
rants, apartments, hospitals) of the points of interest in the local vicinity of a stay point. DETECT
transforms each point into a geographical context feature vector defined as
Definition 3 (Geographical Context Features). The geographical context features of a stay point
˙ s
(t)
i
are represented as a vector x
(t)
i
=fx
(t)
i;1
;x
(t)
i;2
;:::;x
(t)
i;M
g, where a feature x
(t)
i;m
denotes the contri-
bution of the m
th
POI major category.
15
We briefly illustrate (in Figure 3.2) the procedure to develop the feature vector from information
existent in the locality of a stay point. For each point ˙ s
(t)
(large solid points) in a stay point
trajectory ˙ s =f ˙ s
(t)
g
T
t=1
, we execute a range search (circles) of radius r
poi
centered at the stay
point. Next, within each circle, we summarize the counts of POIs in every category (small points
in different colors) and normalize them by the total count within the circle. More precisely, the
POIs annotated with M categories within the circle are counted and normalized to a vector x
(t)
i
=
fx
(t)
i;1
;x
(t)
i;2
;:::;x
(t)
i;m
g. Thus, each stay point is transformed to store its local geographic context.
Using spatial range search has the benefit of including a broader geographical context for the
augmented trajectory than using a few nearest POIs, which is sensitive to outliers.
In summary, Feature Augmentation is a crucial part of DETECT. It transforms raw GPS tra-
jectories into scale-free, context-augmented trajectoriesfx
i
g
n
i=1
, with little loss in information but
ample gain in geographical context imperative to clustering mobility behaviors.
Different Categories of POIs Figure 3.2: Augment stay points with geographical features
3.2.3 Phase I - Clustering with neural network
DETECT Phase I operates in two steps: In the first, it constructs a continuous latent space of be-
haviors by learning to embed and generate context-augmented trajectories via a fully unsupervised
objective. While the input to is the set of augmented trajectories with arbitrary lengths, the out-
put is fixed-length latent embedding that encodes sufficient context to facilitate accurate clustering
16
LSTM Encoder Augmented Trajectory LSTM Decoder Latent Embedding Clustering Function Reconstruction Cluster Distribution P &
Auxiliary Target Q Figure 3.3: DETECT Phase I and Phase II: The augmented trajectories are fed to the recurrent
autoencoder in Phase I. Phase I learns a hidden embedding z encoding the context dynamics of the
trajectories, upon which an initial cluster assignments is generated via k-means. Phase II jointly
refines the embedding (via the encoder) and the cluster assignments.
of mobility behaviors. In the final step of this Phase, DETECT applies a clustering function on
the dimension-reduced latent space. Formally, for each context augmented stay point trajectories
x
i
2fx
i
g
n
i=1
, in the first step we adopt an RNN autoencoder [59, 60] with parametersQ to learn an
embedding z
i
= f
Q
(x
i
) in the latent space of behaviors. Finally, we apply a clustering function to
output k clusters, each represented by its centroidm
j
2fm
j
g
k
j=1
.
The recurrent autoencoder model consists of a recurrent encoder and a recurrent decoder (see
Figure 3.3 for an illustration). Similar to the mechanics of a general autoencoder[59], the task of
the LSTM encoder is to encode the input context augmented trajectories as a latent embedding, and
then using a recurrent decoder, reconstruct the trajectories solely from the embedding. The model
is trained to minimize the error of reconstruction, thus learning a representative embedding that
fully captures the movement transitions and the context within a trajectory. Consider x
i
=fx
(t)
i
g
T
i
t=1
as the augmented trajectory with length T
i
. x
i
is fed sequentially to the recurrent encoder comprised
of several LSTM units [61]. The encoder updates the hidden state h
(t)
enc
and other parameters with
each passing unit t, h
(t)
enc
=s(h
(t1)
enc
;x
(t)
), where s is the activation function for the neural layer.
The last hidden state h
(T)
enc
is called the latent embedding z
i
, and is assumed to summarize the infor-
mation necessary to represent the entire trajectory sequence. Next, the decoder tries to reconstruct
17
the trajectory with h
(T
i
)
enc
as its initial state, h
(1)
dec
= h
(T
i
)
enc
. With h
(1)
dec
, the decoder generates ˆ x
(1)
. The
hidden states that follow are generated recursively as: h
(t)
dec
=s(h
(t1)
dec
; ˆ x
(t1)
); and ˆ x
(t)
=s(h
(t)
dec
).
For the trajectory data, the encoder and decoder are trained together to minimize the reconstruction
error:
`
r
=
1
n
n
å
i=1
1
T
i
T
i
å
t=1
(x
(t)
i
ˆ x
(t)
i
)
2
(3.1)
The decoder can reconstruct the entire augmented trajectory from the latent embedding z
i
,
which implicitly encodes transition patterns of the geographical context. To conclude, using the
neural function z
i
= f
Q
(x
i
), Phase I maps the augmented trajectories x
1
;x
2
;:::;x
n
to their corre-
sponding fix-length embeddings z
1
;z
2
;:::;z
n
, before applying a clustering function (such as k-
means on the latent space of behaviors to produce, what we call, a soft cluster assignments.
3.2.4 Phase II - Joint optimization for a good clustering
A straightforward clustering over the embedded trajectories does not produce clusters tailored for
mobility behavior analyses (as we show empirically in Section 3.3). Therefore, in Phase II, DE-
TECT jointly refines the embedding and the cluster assignment to improve the separability of
clusters (i.e., improving both the latent embedding and the clusters of behaviors). We achieve this
using an unsupervised clustering objective, following the recent advancements in deep neural clus-
tering [13, 62, 63]. The design of the clustering loss is based on the assumption that in the initial
clusters, points that are very close to the centroid are likely to be correctly predicted/clustered, i.e.,
the high confidence predictions. Learning from these, the model improves the overall clustering
iteratively, by aligning the low confidence counterparts.
We develop an objective function customized to amplify the “clustering cleanness”, i.e., mini-
mize the similarity between clusters and maximize the similarity between points in the same clus-
ter. In particular, the clustering refinement process iteratively minimizes the distance between the
current cluster distribution Q and an auxiliary target cluster distribution P, which is a distribu-
tion derived from high confidence predictions of Q. Intuitively, Q describes the probability of an
18
augmented trajectory belonging to the k tentative mobility behaviors. Whereas, P is generated by
re-enforcing the probability of high-confidence trajectories.
Following the work in [64], we represent the distribution of the current embedding Q as the
Student t-distribution on the current cluster centers. Given the centroid of the j
th
cluster m
j
, and
the latent embedding of the i
th
trajectory z
i
, we calculate the current distribution as
q
i j
=
(1+jjz
i
m
j
jj
2
)
1
å
j
0(1+jjz
i
m
j
0jj
2
)
1
(3.2)
where q
i j
is the probability of assigning z
i
tom
j
, also read as the probability of assigning trajectory
x
i
to the j
th
tentative mobility behavior. In the current distribution, the low-confidence points are
assumed to be assigned to poor clusters, and hence in the need of refinement to better clustering
cleanness.
We derive an auxiliary distribution made up of the high confidence assignments of the current
distribution. The goal of this self-training target distribution is to 1) emphasize data points assigned
with high confidence, and (2) normalize loss contribution to prevent the sizes of clusters from
negatively impacting the latent embeddings. Equation (3.3) defines our target distribution P, where
p
i
is computed by first raising q
i
to the second power and then normalizing by the frequency per
cluster. The second power of probabilities places more weight on the instances near the centroids.
The division of å
i
0 q
i
0
j
normalizes the different cluster sizes, making the model robust to biased
classes in the data.
p
i j
=
q
2
i j
=å
i
0 q
i
0
j
å
j
0(q
2
i j
0
=å
i
0 q
i
0
j
0)
(3.3)
To measure the distance between the distributions P and Q (as defined above), we use the
K-L divergence, a widely known distribution-wise asymmetric distance measure. The clustering
oriented loss is defined as,
`
c
= KL(PjjQ)=
å
i
å
j
p
i j
log
p
i j
q
i j
(3.4)
Lastly, as the most computationally intensive step, we iteratively minimize the distance between
the soft clustering assignment Q and the auxiliary distribution P by jointly training the recurrent
19
encoder (in turn, the latent embedding), and the cluster assignment. We train our model using the
Stochastic Gradient Descent optimization minimizing the clustering loss defined as a K-L diver-
gence. Updates are made in an iterative manner to the latent trajectory embedding—through train-
ing on its own high confidence clustering assignments and refining cluster centroids—resulting in
distinct mobility behavior boundaries. For completeness, we present the computation of the gradi-
ents
¶`
¶z
i
and
¶`
¶m
j
in equation 3.5, and leave the derivation of the gradients propagated backward to
the recurrent encoder (i.e.
¶`
¶Q
) as an exercise.
¶`
¶z
i
= 2
k
å
j=1
(z
i
m
j
)(p
i j
q
i j
)(1+
z
i
m
j
2
)
1
¶`
¶m
j
= 2
n
å
i=1
(z
i
m
j
)(q
i j
p
i j
)(1+
z
i
m
j
2
)
1
(3.5)
The final output of DETECT is an encoder that is finely tuned to learn a data representation
specialized for clustering without groundtruth cluster membership labels.
3.3 Experiments
In this section, we quantitatively and qualitatively evaluate DETECT for trajectory clustering with
extensive experiments on two real-world trajectory datasets.
3.3.1 Experimental Settings
Datasets. We utilize two datasets for the evaluation of our proposed approach: the GeoLife
dataset [65] and the DMCL dataset [66]. The GeoLife dataset consists of 17,621 trajectories
generated by 182 users from April 2007 to August 2012. We validate our approach against the
ground-truth. we extract a subset of the data comprised of 601 trajectories from 11 users. This is
a common methodology for mobility analysis and its applications [1, 67, 68]. Furthermore, we re-
trieve POI information from the PKU Open Research Data [69]. This dataset contains over 14,000
20
Table 3.1: Stats of datasets
Dataset stats min max mean std.
GeoLife Duration (min) 0.91 1177.96 192.55 257.62
Length (km) 0.01 11.68 2.36 2.15
DMCL Duration (min) 15.45 651.2 314.21 234.42
Length (km) 0.004 38.9 11.84 17.65
POIs in Beijing, falling into 22 major categories (i.e.jMj= 22) including “education”, “transporta-
tion”, “company” and “shopping”. The DMCL dataset consists of trajectories in Illinois, United
States. It contains 90 complete trajectories generated by two users over six months, and is used
as whole. The POI data is scraped from OpenStreetMap (OSM) [70]. It contains 30,401 POIs
in Illinois, subject to 9 major categories such as “public” and “accommodation”. Table 3.1 gives
some statistics on the duration and lengths of the trajectories in both datasets. In the last
Data Preparation and Ground-truth. The ground-truth is prepared by manually labelling the
datasets. It is performed by an expert, through a meticulous process, bereft of any knowledge
on the clustering membership labels. More precisely, labels are generated in an iterative manner
by first visualizing the trajectory by georeferencing all its GPS points on the map using Mapbox
Gl
1
. Subsequently, the duration of time between any two consecutive check-ins and the type of
surrounding buildings are studied via a Google Maps plugin, before making the judgement on
its mobility behavior. In GeoLife dataset, six mobility behaviors were identified as the ground-
truth classes: “campus activities”, “hangouts”, “dining activities”, “healthcare activities”, “work-
ing commutes”, “studying commutes”. While in the DMCL dataset, four mobility behaviors were
identified: “studying commutes”, “residential activities”, “campus activities”, “hangouts”. The
labelling process consumed 60 hours of labor and the generated dataset is publicly released here
2
.
Compared Approaches. We evaluate the following approaches:
1
https://github.com/mapbox/mapboxgl-jupyter
2
https://tinyurl.com/y5a3r3oy
21
• KM-DBA [8, 71]: DBA stands for Dynamic Time Warping Barycenter Averaging. It utilizes
DTW as its distance measure for k-means clustering on sequential data, before performing a
sophisticated averaging step. We set parameter k= 6(4) for the GeoLife (DCML) dataset,
respectively.
• DB-LCSS [72]: DB-LCSS uses DBSCAN, a density-based clustering approach, in con-
junction with LCSS as its distance measure between raw trajectories. For GeoLife (DCML,
respectively) we set the common sequence threshold as 0:15 (1:5e-6) for LCSS, and e =
0:03(1e-6) and min samples= 18(2) as the neighborhood thresholds in DBSCAN.
• SSPD-HCA [20]: Symmetrized Segment-Path Distance is a shape-based distance metric
particularly suited to measuring similarity between location trajectories. It utilizes an ag-
glomerative hierarchical clustering procedure with the Ward’s criterion for choosing the pair
of clusters to merge at each step.
• KM-DBA*, DB-LCSS*: For a more interesting comparison, we adapt KM-DBA and DB-
LCSS methods to work on context augmented trajectories.
• RNN-AE [22, 73]: We label as RNN-AE the simple process of training a recurrent autoen-
coder on the segmented raw trajectories and then clustering the learned embedding using
k-means.
• DETECT Phase I, DETECT: We evaluate two variations of DETECT, the first one, termed
”DETECT Phase I” only includes the first phase, while the second variant, termed ”DE-
TECT” includes both phases I and II. RNN-AE is identical to DETECT Phase I in all aspects
except that it is trained on raw trajectory data.
Finally, we note that in the interest of space we omit the comparison of clustering with 1) HU
distance [74] and 2) PCA decomposition. HU distance is computed as the average Euclidean dis-
tance between points on two trajectories. The PCA distance is similar to HU but works in a lower
dimensional space via PCA shape decomposition. Both these methods are inferior to the above
22
baselines for trajectory clustering [72].
Evaluation metrics. We evaluate extent to which cluster labels match externally supplied class
labels according to four well-established external metrics: 1) Rand Index (RI) measures the simple
accuracy, i.e., the percentage of correct prediction of clusters. 2) Mutual Information (MI) mea-
sures the mutual dependency between the clustering result and the ground-truth, i.e., how much
information can one infer from the other. Zero mutual information indicates the clustering labels
that are independent from the ground-truth classes. 3) Purity measures how pure are the clustering
results, i.e., whether the trajectories in the same cluster belong to the same ground-truth class. 4)
Fowlkes-Mallows Index (FMI) measures the geometric mean of the pairwise precision and recall,
which is robust to noises.
Training. We implement our approaches on a computer with an Intel Core i7-8850H CPU, a 16
GB RAM and an NVIDIA GeForce GTX 1080 GPU. We implement the kMeans+DTW using
tslearn [75], and DBSCAN clustering using Sklearn with LCSS distance
3
. The proposed deep
embedded neural network is built using Keras [76] with Tensorflow [77].
3.3.2 Performance Comparison with Baselines
We quantitatively evaluate the clustering quality of DETECT against all baselines on the GeoLife
Dataset. The improvements of DETECT over the compared approaches all passed the paired t-tests
with significance value p< 0:03. The results are depicted in Table 3.2. DETECT clearly outper-
forms all compared approaches. The relative performance against the baseline approaches varies
across metrics; ranging from at least a 41% improvement (in FMI against KM-DBA) and up to
252% improvement (in RI against DB-LCSS). Even the primitive neural approach RNN-AE com-
petes in performance with the alignment based methods, demonstrating the advantage of a neural
approach in modeling the transitions within raw GPS trajectories. Moreover, DETECT Phase I
3
https://github.com/maikol-solis/trajectory_distance
23
Table 3.2: Clustering performance of all approaches.
Method RI MI Purity FMI
KM-DBA 0.33 0.64 0.58 0.58
DB-LCSS 0.22 0.55 0.51 0.56
RNN-AE 0.39 0.46 0.56 0.53
SSPD-HCA 0.52 0.93 0.66 0.67
KM-DBA* 0.51 0.91 0.74 0.63
DB-LCSS* 0.50 0.95 0.64 0.66
DETECT Phase I 0.65 1.06 0.84 0.73
DETECT 0.76 1.26 0.89 0.81
Impr. over KM-DBA 132% 98% 54% 41%
Impr. over DB-LCSS 252% 131% 74% 46%
produces significant quality improvements over highly customized distance-based metrics such as
SSPD, confirming that the autoencoder trained on context augmented trajectories can accurately
model context transitions in the latent space of behaviors.
3.3.3 Ablation Study
In order to understand the influence of feature augmentation, neural embedding, and cluster refine-
ment individually, we conduct an ablation study by isolating the effects of each procedure.
Benefit of stay point detection and geographical augmentation On the GeoLife dataset, Ta-
ble 3.3 presents the clustering quality of KM-DTW on trajectories with progressive levels of feature
augmentation: 1) raw trajectory is the simple sequence of GPS points; 2) stay points only trajec-
tory is the sequence of spatiotemporal points extracted from the raw trajectory by Fast-SPD; 3)
geographical only trajectory is the sequence of geographical vectors generated at each point in
the raw trajectory, rather than at stay points; and 4) fully augmented trajectory is the sequence of
geographical vectors generated at each stay point extracted from the raw trajectory.
It is clear that the methods using fully augmented trajectories outperforms methods with a
weaker degree of augmentation. The stay points only trajectory data are comparable in perfor-
mance to raw trajectory input because both do not incorporate geographical context that is essential
24
Table 3.3: Performance with varying level of augmentation.
Data Type RI MI Purity FMI
Raw trajectory 0.33 0.64 0.58 0.58
Stay point only 0.30 0.68 0.60 0.57
Geographical only 0.44 0.85 0.69 0.59
Augmented trajectory 0.52 0.93 0.75 0.63
to learning a rich embedding in the latent space of mobility behaviors. As a result, the geograph-
ical only trajectory data has better accuracy than both. However, when context augmentation is
combined with stay point extraction (i.e. fully augmented trajectories), the two procedures com-
plement each other. Fast-SPD discards the irrelevant geographical context features of GPS points
that lie in-between the critical stay points of a trajectory. Thus, achieving a results that is better
than the sum of its parts.
Benefit of neural network and cluster refinement There is no previous study on clustering
mobility behaviors that uses a neural network to model context transitions in trajectories or one
that iteratively refines behavior cluster. Therefore, we are interested in evaluating how the deep
neural architecture can benefit mobility behavior analyses. To isolate the contribution of the neural
network, we set the inputs of all approaches in this experiment as context augmented trajectories.
Accordingly, we compare adapted baselines KM-DBA* and DB-LCSS*, against neural approaches
DETECT Phase I and DETECT. Figure 3.4 presents the results on both datasets. The improvements
of DETECT are more significant in the GeoLife dataset than in the DMCL dataset. Training on a
larger dataset (GeoLife) produces a latent embedding that better captures the transitions in a tra-
jectory. But more importantly, since DMCL is comprised of only two users, it contains trajectories
with limited variation in mobility behaviors (e.g., a user generally visits the same supermarket),
accordingly the benefit of our approach is relative small. Based on the results, it is evident that
DETECT Phase I learns an expressive latent embedding in the trajectories , while other baselines
rely on alignments between augmented trajectories and fail to capture the dynamics of the data.
25
The improvement offered by clustering refinement is significant as illustrated in the difference
in clustering quality between DETECT Phase I and DETECT. Thus proving that the customized
clustering oriented loss offers large benefits in clustering accuracy by helping separability between
mobility behaviors.
3.3.4 Parameter study
Below we discuss the effects of different parameter settings of DETECT on the reconstruction and
clustering performance.
Effect of the number of clusters We illustrate the effect of varying k on the clusters in Fig-
ure 3.5a. The experiment is conducted on the GeoLife dataset with six ground-truth classes.
Both DETECT Phase I and DETECT reach their best FMI at around six. This aligns with the
intuition that the model performs best as the number of clusters approach the number of ground-
truth classes. In practice however, supervision is not available for hyperparameter cross-validation,
hence alternate methods are employed to determine the clustering cardinality. Most common meth-
ods include manually determining the number of clusters via a low dimensional visualizations or
tuning k through Elbow method [78] using internal metrics [79].
Effect of varying thresholds in Fast-SPD and geographical augmentation Parametersr
s
and
r
t
in Fast-SPD control the magnitude of stay points extracted. Usually the algorithm is robust to
reasonable practical values, e.g. 1km forr
s
and 20 min forr
t
. But too small or too large thresh-
olds tend or cause over-extraction or nothing extracted from the trajectories. Similar robustness
if observed with the the buffer size r
poi
of geographical context augmentation. However, a small
value generates a context vector very sensitive to outliers, while a large value includes POIs that
are less likely to be important , reducing the variation between stay points, and obscuring the tran-
sitions of contexts. We set r
poi
= 1km but values between 0.5km to 1.5km yeild almost as good a
result. Overall, robustness is an important property of the feature augmentation part of DETECT.
26
DMCL GeoLife
0.0
0.2
0.4
0.6
Rand Index
KM-DBA*
DB-LCSS*
DETECT Phase I
DETECT
(a) Rand Index
DMCL GeoLife
0.0
0.2
0.4
0.6
0.8
1.0
1.2
Mutual Information
KM-DBA*
DB-LCSS*
DETECT Phase I
DETECT
(b) Mutual Information
DMCL GeoLife
0.0
0.2
0.4
0.6
0.8
Purity
KM-DBA*
DB-LCSS*
DETECT Phase I
DETECT
(c) Purity
DMCL GeoLife
0.0
0.2
0.4
0.6
0.8
Fowlkes Mallows Index
KM-DBA*
DB-LCSS*
DETECT Phase I
DETECT
(d) Fowlkes Mallows Index
Figure 3.4: Ablation study on compared approaches.
DETECT outperforms the adapted baselines with feature augmentation (marked with *) on all
metrics and for both datasets (relatively little improvements are seen in DMCL, since it spans only
two users with limited mobility variation).
27
Effect of the latent embedding dimension Table 3.4 demonstrates the effect of latent embed-
ding dimension d on the reconstruction error in Phase I. Given a small number of hidden dimen-
sions, the model is incapable of learning an expressive-enough latent embedding. Whereas, if the
number of dimensions grows too large, the model easily overfits the training data. Overall, for a
value of d between 50 and 100, the clustering performance remains good.
Table 3.4: Mean and Standard Deviation of MAE for DETECT Phase I with varying latent embed-
ding dimension d
d 16 32 64 128
mean (10
3
) 5.6 4.9 4.3 4.5
std (10
3
) 0.61 0.2 0.06 0.23
Effect of the learning rate and training epochs In Figure 3.5b, we compare the training curves
with different learning rates. Given a large learning rate, i.e., lr= 0:1 the model incurs a propor-
tionally large reconstruction error. In contrast, when learning rate is too small, e.g., lr= 10
5
, the
model takes long to converge. However, for a reasonable value of the parameter, the unsupervised
training loss converges fast after the first few epochs. We also remark that the model does not
overfit the dataset too readily. We notice (not shown here) an increase in standard deviation of
MAE only when the model is trained for more than 1500 epochs.
3.3.4.1 Qualitative Evaluation
Visualization study. To further understand the learned latent embedding of DETECT, we generate
a series of visualizations. Figure 3.6 shows the two-dimensional t-SNE [64] plot over DETECT
embeddings on the GeoLife dataset. Each point is colored based on the corresponding ground-
truth class. DETECT generate well-formed clusters (Figure 3.6a) with most points in the same
class grouped into the same cluster while being well separated from others. A close inspection
also reveals that the clusters in Figure 3.6a are “cleaner” than the ones in Figure 3.6b, demonstrat-
ing the effectiveness of cluster refinement in Phase II, wherein the embedding and the clustering
assignment are jointly optimized to gain better differentiation in mobility behaviors.
28
3 4 5 6 7
k
0.0
0.2
0.4
0.6
0.8
1.0
FMI
DETECT Phase I
DETECT
(a)
0 100 200 300 400 500
epoch
0.4
0.6
0.8
1.0
1.2
1.4
mae
1e 2
lr=0.1
lr=0.01
lr=0.001
lr=0.0001
lr=0.00001
(b)
Figure 3.5: Effect of DETECT parameters. (a) Effect of the number of clusters k. (b) Effect of the
learning rate lr
(a) DETECT (b) DETECT Phase I Only
Figure 3.6: Clustering results visualization of “DETECT” and “DETECT Phase I” using t-SNE
with perplexity 40
29
(a) Trajectories with predicted class
(b) Trajectories with ground-truth class
Figure 3.7: Projecting raw trajectories clusters onto the map. (a) Colored based on predicted cluster
labels (b) Colored based on ground-truth classes
Figure 3.7a visualizes the trajectories with the predicted classes and figure 3.7b depicts the
ground-truth classes in the GeoLife dataset. The individual predicted classes are labeled using the
same color as their corresponding ground-truth classes. It is evident that the predicted classes gen-
erally match the ground-truth, even though each class contains trajectories with various shapes and
lengths. This is a testament to DETECT’s capabilities in clustering trajectories of widely different
spatial and temporal range of movement.
30
3.3.4.2 Scalability
We evaluate DETECT on the entire GeoLife dataset comprised of the 17,621 trajectories. Since the
full dataset is unlabelled, we utilize four internal validation measures to understand the compact-
ness, the connectedness and the separation of the cluster partitions. Namely, the Silhouette score,
which ranges from -1 to +1, where a high value indicates that the object is well matched to its
own cluster and poorly matched to neighboring clusters. Dunn Index tries to maximise intercluster
distances whilst minimising intracluster distances. Thus, large values of Dunn Index correspond
to good clusters. The Within-Like criterion, captures the intracluster variance, accordingly a small
value indicates compact clusters. Likewise, the Between-Like criterion captures intercuster vari-
ance, hence a large value indicates good separation between different clusters. These metrics are
customized to support context-augmented trajectories and are formally defined in [20]. Note that
internal metrics are not suitable for comparison between clustering approaches that utilize different
distance measures (e.g. DETECT vs DB-LCSS).
Figure 3.8 presents the results. All four internal measures indicate that DETECT Phase II in-
creases compactness within (smaller Within-Like) and separation between (larger Between-Like)
the clusters generated via Phase I. Moreover, the evolution of Silhouette score, Within-Like crite-
rion and Between-Like criteria suggest that just a few clusters (i.e. 10) can best capture common
mobility behaviors. A visual inspection of the 10 clusters (in Figure 3.9) generated using DETECT
establishes its effectiveness in generating well-behaved clusters. Moreover, a closer look at the tra-
jectories in the purple cluster (Figure 3.10) illustrates that the discovered mobility behavior can
be easily understood. All comprising trajectories—albeit originate from various locations (such as
schools and residences)—always involve activities in parks (red dashed circles).
Lastly, we remark upon the running time of the compared approaches over the full GeoLife
dataset. Within our experimental testbed Phase I trained for 0.8 hour, which was further optimized
in Phase II for an additional 1.3 hours. The total computational time of 2.1 hours is still substan-
tially better than the 4.3 hours for KB-DBA*, and 5.2 hours for DB-LCSS*, both of which require
O(n
2
) pairwise distance computations.
31
10 20 35 50
Number of Clusters
0.2
0.0
0.2
0.4
0.6
Silhouette Score
DETECT
DETECT Phase I
DB-LCSS*
KM-DBA*
(a) Silhouette Score
10 20 35 50
Number of Clusters
0.00
0.02
0.04
0.06
Dunn index
DETECT
DETECT Phase I
DB-LCSS*
KM-DBA*
(b) Dunn Index
10 20 35 50
Number of Clusters
0.1
0.2
0.3
0.4
Within-Like criterion
DETECT
DETECT Phase I
DB-LCSS*
KM-DBA*
(c) Within-like criterion
10 20 35 50
Number of Clusters
0.6
0.7
0.8
0.9
Between-Like criterion
DETECT
DETECT Phase I
DB-LCSS*
KM-DBA*
(d) Between-like criterion
Figure 3.8: Internal validation of clustering quality.
Figure 3.9: Clusters of all trajectories in GeoLife.
32
Figure 3.10: Visualization of a detected GeoLife trajectory cluster with locations of recreational
parks in red circles.
33
Chapter 4
A Variational Approach for Mobility Behavior Clustering
In DETECT, we first generate a “context sequence” for each trajectory from nearby geographic
entities, e.g., Points-Of-Interest (POIs), and then infer mobility behaviors by clustering context
sequences from large numbers of trajectories based on the context transitions. Here, a context
sequence is an ordered list of real-value feature vectors, each describing the “context” of a visited
location (e.g., sports, shopping, or dining venues) in a trajectory. Clustering context sequences
based on their transition patterns (similar dependencies across different dimensions and positions
in the sequences) is challenging. For example, a transition in the context sequence can be: rest(a
place surrounded by many residential POIs)! shopping(a place surrounded by many restaurants,
theaters, and malls)! dating(a place surrounded by many scenic POIs). Such transitions are usu-
ally driven by the trajectory data and vary from one population to another. However, traditional
time series clustering approaches are usually based on pre-defined similarity and alignments be-
tween sub-sequences and shapes [8–10]. Other typical sequence clustering approaches only handle
discrete variables (e.g., [80–82]). Autoencoder-based clustering approaches using Recurrent Neu-
ral Networks (RNN) can convert sequences of real-value feature vectors into a fixed-length vector
for clustering the dynamics in the sequences using a two-phase training process [11, 27]). The
first phase learns an initial representation by self-supervision (i.e., reconstruction), and the second
phase improves the representation and clustering performance by optimizing a customized cluster-
ing objective. However, the first phase’s self-supervision objective highly depends on the initial
parameters and could lead to a poor initial feature representation (i.e., irrelevant to the clustering
34
objective), which cannot be refined in the second phase to improve the clustering performance.
Consequently, clustering accuracy cannot be guaranteed across training variations [14].
In this chapter, we present a novel Variational Approach for Mobility Behavior Clustering
(V AMBC) that can robustly handle sequences of context vectors in a single training phase. V AMBC as-
sumes the pre-existence of clusters in the latent space and jointly learns the hidden representation
and cluster formation in an end-to-end training process. Though variational clustering approaches
are recently well developed for image data [28–31], directly applying them to variable-length se-
quential data requires an RNN decoder, which is sensitive to small changes in the latent space [83],
resulting in poor clustering accuracy and robustness. The main problem originates from having
minimum involvement of cluster assignments when constructing the latent space from the input
sequences. Hence, the model would generate poor clustering results like one or a few large clus-
ters leaving many clusters empty (called the “empty cluster” problem in the rest of the chapter) or
several similar clusters (called “trivial solution”). To address these challenges, V AMBC explicitly
constructs two representations: one captures the unique information of a context sequence, and the
other one captures the shared information within a cluster. We call the former the individualized
latent embedding and the latter the cluster latent embedding. V AMBC makes the cluster latent em-
bedding available to the cluster members during reconstruction and explicitly uses the embedding
in constructing the latent space so that the final latent embedding is aware of the cluster assign-
ments. V AMBC also encourages the cluster assignment to be flexible at early stages and get well
separated as the model is trained adequately. Therefore, the model has sufficient involvement of
the cluster membership in creating the embeddings and can avoid producing poor clustering re-
sults. We compare our approach with many baseline approaches, and the experimental results on
real-world data show that V AMBC achieves a better clustering accuracy and robustness than all
the baselines.
35
4.1 Problem Definition
Formally, in the remainder of this chapter, we study the following problem: Given a set of con-
text sequences X =fx
i
g where x
i
=f~ x
i;1
;~ x
i;2
;:::;~ x
i;L
i
g is a variant-length sequence (i.e., L
i
is the
length of sequence x
i
and is not fixed) of POI context vectors represented by~ x
i;l
2R
D
(where D
is the number of POI types). Each element of the vector represents the likelihood of visiting the
corresponding POI type (e.g., residence). The goal is to cluster the set of sequences X into K (a
predefined hyper-parameter) groups, s.t., within each group the sequences are of similar context
transitions.
4.2 The V AMBC Model
This section introduces our model V AMBC. Specifically, we first introduce our novel idea of de-
composing the hidden variables to improve the involvement of cluster hidden variable y. Then we
explain the derived training objectives with well-designed layers of the model and discuss their
roles. Finally, we describe the network structure of V AMBC followed by a discussion comparing
the mechanisms of V AMBC and other variational models for clustering.
4.2.1 Decomposing Hidden Variables
The goal of the variational clustering model is maximizing the likelihood of the training data
fx
i
g
N
i=1
while learning a parametric latent embedding z
i
representing the hidden information of
each input sequence x
i
and a latent variable y
i
describing the cluster membership. In the following
paragraphs, we hide the subscript i for simplicity.
To increase the involvement of cluster assignments (i.e., y) in constructing the latent embed-
ding z, we decompose the hidden variable z into two independent parts: cluster latent information
(z
c
) and individualized latent information (z
b
). Intuitively, each input can be represented using its
cluster information (the cluster center) plus its individual (bias) information (the relative position
36
to the cluster center). The cluster latent information is modeled as variable z
c
that fully depends on
the discrete variable y and, s.t., z
c
is a learnable deterministic mapping from y, z
c
= f
c
(y). Since
the cluster latent information summarizes the latent characteristics of a cluster, we explicitly let
z
c
be the center of the cluster in the latent space that could be shared across x within individual
clusters. In other words, for a given cluster k,E(Z
k
) is the expectation of Z
k
=fz
i
jx
i
2 cluster kg,
s.t.,E(Z
k
)= z
c
(y=k)
= f
c
(y
k
). We model the individualized latent representation as a continuous
variable z
b
that describes the bias to the cluster center, s.t., the distribution of z
b
centers at 0
(E(z
b
)= 0). The overall latent representation z is modeled as the summation of the cluster latent
representation and the individualized latent representation, i.e., z= z
c
+ z
b
. In this way, the hidden
space z still preserves the Gaussian-Mixture structure but can also be decomposed into two embed-
dings which can be supervised separately. Specifically, the generative process can be described in
Equation (4.1).
y Cat(1=K);z
c
= f(y;W) (4.1)
z
b
N (0;I)
z= z
c
+ z
b
xN (m
x
(g(z;q));s
2
x
(g(z;q))I)
In Equation (4.1), y is a discrete variable that follows a categorical prior (denoted by Cat()) and
K is the predefined number of clusters. f is a deterministic function (implemented by a neural
layer parameterized by W) of y that maps each cluster to a vector z
c
. z
b
is a continuous variable
following a Gaussian priorN (0;I) and represents the individualized embedding. g(z;q) denotes
a neural network that decodes z to the input space parameterized by q. m
x
(g(z;q)) denotes the
mean of the Gaussian likelihood distribution of x condition on z. We set s
x
(g(z;q))= 1 which
reduces the log likelihood to mean squared error following the common practice of V AE. We use
q(y;zjx) to approximate the posterior of p(x;y;z). The problem can be reduced to maximizing the
log-Evidence Lower Bound (ELBO). We refer the reader to [52, 84] for a detailed explanation
37
of the derivation of ELBO. Thus the objective is minimizing the negative ELBO written as in
Equation (4.2).
L
ELBO
=E
y;zq(y;zjx)
log
p(x;y;z)
q(y;zjx)
(4.2)
According to the proposed generative process, we can substitute p(z)= p(z
b
)p(z
c
), p(x;y;z)=
p(y)p(z
b
)p(z
c
jy)p(xjy;z), q(y;zjx)= q(yjx)q(z
b
jx)q(z
c
jy) into the ELBO. By ignoring q(z
c
jy) be-
cause it is deterministic, we can break down Equation (4.2) and rewrite the (negative) ELBO as in
Equation (4.3).
L
ELBO
(4.3)
=E
y;zq(y;zjx)
(log
p(y)
q(yjx)
+ log
p(z
b
)
q(z
b
jx)
+ log p(xjy;z))
= D
KL
(q(yjx)jjp(y))+ D
KL
(q(z
b
jx)jjp(z
b
))
E
yq(yjx);z
b
q(z
b
jx);z
c
= f(y;W)
log p(xjy;z
b
;z
c
)
4.2.2 Training Objectives and Neural Layers
Now we expand all the Right-Hand-Side (RHS) terms in Equation (4.3) and formulate the objec-
tives.
The first RHS term D
KL
(q(yjx)jjp(y)) describes the KL distance between the posterior estimate
q(yjx) and its prior p(y). Since the prior p(y) Cat(1=K) is a categorical distribution, we can
expand this term as below.
D
KL
(q(yjx)jjp(y))=
å
y
q(yjx)log(q(yjx) K)
= Entropy(y)+ log(K);y q(yjx)
The first RHS term turns out to be the negative entropy of y, where y q(yjx) (the constant log(K)
could be omitted). Intuitively, a small negative entropy indicates high randomness in y q(yjx),
and a large negative entropy value indicates less randomness in y q(yjx). Minimizing the negative
38
entropy could prevent the prediction of cluster probability q(yjx) from being “overly-confident”,
i.e., always assigning the input to one cluster aggressively (output 0.99 as the probability of assign-
ment). Therefore, the negative entropy term can prevent the model from having a result of empty
clusters.
The posterior probability q(yjx) is estimated using a Softmax activation after the encoder net-
work. To enable sharing of cluster information, we need to sample the discrete variable y from
q(yjx), for which we employ the Gumbel-Softmax layer. Gumbel-Softmax can sample a pseudo
one-hot vector (a real-value vector that is very similar to a one-hot vector) and propagates the pa-
rameter gradients backward to the previous Softmax layer [85]. Thus the output will be almost
discrete (e.g., (1,0,0,0)) and multiple input sequences could share the same choice of y and hence
the same cluster embedding z
c
(e.g., z
c
i
= z
c
j
= f
c
(y
k
) if x
i
and x
j
are both assigned to cluster k). In
addition, Gumbel-Softmax is different from an Argmax operation on q(yjx), which always chooses
the cluster with the largest probability. Specifically, Gumbel-Softmax introduces randomness when
sampling y according to the probability q(yjx), so an input x assigned to cluster k
1
could “jump” to
a different cluster k
2
in the next round if the likelihood of k
2
is similar to k
1
. This avoids x being
always assigned to the same cluster that is initially predicted.
x LSTM
cell LSTM
cell LSTM
cell Encoder h q(y|x) GumbelS
oftmax y FC z
b
z
c
FC z LSTM
cell LSTM
cell LSTM
cell Decoder LSTM
cell LSTM
cell LSTM
cell Decoder Share x
c x’ FC Gaussian
Reparameterization Figure 4.1: V AMBC network structure.
The second RHS term D
KL
(q(z
b
jx)jjp(z
b
)) measures the Kullback–Leibler (KL) distance be-
tween the posterior q(z
b
jx) of the individualized latent embedding z
b
and its prior p(z
b
)= N(0;I).
Here we assume q(z
b
jx) is a Gaussian distribution with a learnable mean ¯ z
b
and a constant variance
39
following the constant-variance V AE (CV-V AE) which sacrifices a little capacity for robustness and
mitigate the sensitivity in the decoder following [86, 87]. Therefore, the KL term can be rewritten
into the following form:
D
KL
(q(z
b
jx)jjp(z
b
))=jj¯ z
b
jj
2
2
+ constant (4.4)
The last RHS termE
yq(yjx);z
b
q(z
b
jx);z
c
= f(y;W)
log p(xjy;z
b
;z
c
), is the negative log likelihood
of the observation x. Following the common practice of V AE, we can rewrite it to the mean square
error (MSE) between the input data x and the reconstruction x
0
.
E
yq(yjx);z
b
q(z
b
jx);z
c
= f(y;W)
log p(xjy;z
b
;z
c
)
= MSE(x;x
0
)
In addition to the terms inL
ELBO
, we introduce a center loss regularizer L
center
=jjx x
c
jj
2
2
.
Here x
c
is a sequence of the same dimension with input x (after padding), and is decoded from
the center embedding z
c
. The center loss could prevent the model from overly relying on the indi-
vidualized embedding. Specifically, it generates a center sequence x
c
from the cluster embedding
z
c
and minimizes the distance between x
c
with all x assigned to this cluster. Therefore, the clus-
ter embedding z
c
is learned to be expressive to generate a sequence x
c
similar to the sequences
in the cluster. AndL
center
also improves the compactness and discrimination of clusters in the
embedding space.
Finally, we can write the loss function of V AMBC in Equation (4.5).
L =MSE(x;x
0
)+jjz
b
jj
2
2
entropy(y)+jjx x
c
jj
2
2
=L
recon
+L
KL
+L
NE
+L
center
(4.5)
40
4.2.3 Network Design
Based on the loss function, we design our network structure by modeling the probabilities q(yjx);q(z
b
jx); p(xjy;z
b
;z
c
)
with neural layers. Figure 4.1 shows our network structure. On the left side, we use (stacked)
LSTM (Long Short-Term Memory, an advanced RNN) layers and Fully Connected (FC) layers
with/without softmax activation to model the posterior q(yjx);q(z
b
jx). Then in the middle of the
network, the discrete variable y is sampled through a Gumbel-Softmax layer and mapped to the
cluster embedding z
c
. The continuous individualized embedding z
b
is sampled via the Guassian
reparameterization [52]. After obtaining z
c
and z
b
, the embedding z of input x is computed by
adding z
c
and z
b
in the addition layer denoted by
L
. Finally, on the right side of the network,
LSTM layers (decoder) are used to decode the hidden embedding z into a reconstructed sequence,
x
0
. The shared decoder also generates the cluster sequence x
c
from z
c
for computing the center
loss.
As explained in Section 4.2.2, the network is supervised by the RHS terms to balance the self-
supervision and clustering structure. The network also employs the Gumbel-Softmax and center
loss to prevent early/local convergence and preserve the involvement of cluster memberships.
4.2.4 Relationship to V AE and Gaussian-Mixture V AE
Here we briefly describe the basic concepts of the Variational AutoEncoder (V AE), its extension to
the clustering scenario and the differences between V AMBC and the existing MoG-based V AEs.
The goal of V AE is to learn a parametric latent embedding z and a generative model to maximize
the marginal likelihood of the training datafx
i
g
N
i=1
. Its objective in Equation (4.6) is derived by
approximating the intractable posterior p
q
(zjx) with q
f
(zjx) and maximizing the ELBO. In general,
the ELBO includes two terms, one for the reconstruction and the other for regularizing the hidden
space to a Gaussian prior.
L
V AE
=E
ˆ p(x)
[E
q
f
(zjx)
[log(p
q
(xjz)]+ D
KL
(q
f
(zjx)jjp(z))] (4.6)
41
For clustering purposes, the V AE can be extended to learn the clustering priors [88] driven
by the data. The Gaussian prior in the hidden layer of V AE, p(z)=N (0;1), in this case, is re-
placed with the Mixtures of Gaussians (MoG) such as in [29–31]. The generative process could
be described in Equation (4.7). Here y is a discrete hidden variable representing the cluster assign-
ments, and z is a continuous hidden variable that x is mapped to/conditional on. K is the predefined
number of clusters and Cat(:) is the categorical distribution.
y Cat(1=K);zN (m
y
;s
2
y
I) (4.7)
xN (m
x
(z);s
2
x
(z)I)
The objective in Equation (4.6) is then rewritten as :
L
V AE
GM
=E
ˆ p(x)
[E
q
f
(z;yjx)
[log(p
q
(xjz)] (4.8)
+ D
KL
(q
f
(z;yjx)jjp(z;y))]
However when we use RNN to encode and decode the variant-length context sequences (e.g, for
learning the transition patterns of context vectors), the model above would fall into local optimum
and produce undesired clustering results. Specifically, when applying the variational models to
sensitive decoders, such as RNNs, for sequence modeling, the model might initially learn to ignore
the hidden variable z or y and go after the low hanging fruit, producing a decoder that is easy
to optimize [83]. Ignoring y could collapse the joint probability q
f
(z;yjx) to q
f
(z;y= 1jx) by
assigning all data to one cluster y= 1 in the extreme case. This would cause the problem of empty
clusters. It is also possible to learn a trivial parameterization for p(z;c) to collapse to p(z)= N(0;1)
by generating the same Gaussian components in the MoG, i.e.,m
y=1
=m
y=2
=:::m
y=K
. Therefore
the model will be reduced to a general V AE without the clustering ability. The reason causing
the ignorance or the little supervision over y is that the model requires z to include the embedding
that can reconstruct x and to decide the cluster assignment y from their conditional relationship.
Especially when the decoder is a sensitive RNN structure, there will not be enough capacity for z
42
to provide enough supervision on y. Instead, z will focus more on the reconstruction thus makes
the model inaccurate in clustering the context sequences.
On the contrary, V AMBC can avoid these problems by separating z into two embeddings z
c
and
z
b
, emphasizing on clustering and reconstruction, respectively. As shown in Figure 4.2, we replace
the conditional dependency between z and y with a joint relationship and add self-supervision
on y with a center loss with x
c
. For the new modeling of the latent variables z
c
;z
b
and y, we
carefully derived the objectives and delicately designed networks to fulfill the assumptions and
prevent practical problems.
(a) General Gaussian Mixture V AE (b) V AMBC
Figure 4.2: Compare the graphic notations of a GMV AE and V AMBC
4.3 Experiments
In this section, we quantitatively evaluate the clustering performance of V AMBC by comparing
it with the state-of-the-art approaches in Section 4.3.2. We also analyze variants of V AMBC to
understand the role of each component in Section 4.3.3. Additional experiments on the quality of
the representation are discussed in Appendix 4.3.5.
4.3.1 Environment and Experiment Settings
Datasets. Following [27], we utilize the GeoLife dataset [65] and DMCL dataset [66] produced
by real human trajectories for the evaluation of our proposed approach. The POI information of
the two datasets are from OpenStreetMap (OSM) [70] and the PKU Open Research Data [69],
43
respectively. The experiments are evaluated based on the labeled moving behavior samples of
the GeoLife and DMCL datasets reported in [27]. For GeoLife, six labels were provided as the
ground-truth classes: “campus activities”, “hangouts”, “dining activities”, “healthcare activities”,
“working commutes”, “studying commutes”. Four clusters were labeled in DMCL dataset: “study-
ing commutes”, “residential activities”, “campus activities”, “hangouts”.
Baseline approaches. We compare our approaches with the state-of-the-art clustering approaches
from four categories:
• For classical time series clustering approaches, we include KM-DTW (KMeans with Dy-
namic Time Warping distance) [8], KM-GAK (KMeans with Global Assignment Kernel) [9],
k-Shape [10] and DB-LCSS (DBSCAN with Longest Common Sub-Sequence distance) [72].
• For discrete sequence clustering approaches, we include SGT [89] and MHMM (Mixed
Hidden Marcov Model) [90]. Since they work for discrete sequences, we transform the
context sequences into discrete sequences by mapping the real-value vectors to discrete cat-
egories via pre-clustering all context vectors;
• For the AutoEncoder based deep clustering approaches, we include DTC [11], DETECT [27],
and adapted IDEC* [13] and DCN* [12] by replacing the encoder and decoder with RNN(LSTM)
layers to work with context sequences.
• For variational deep clustering approaches, we adapted GMV AE* [29, 31], VaDE* [30] and
JointV AE* [28] in the same way mentioned above. Here we also add KL Annealing [83]
to GMV AE* to solve its KL vanishing problem. The approaches adapted from image re-
searches are marked with “*” after their names.
Environment and parameters. We implemented our approaches on a computing node with
a 36 core Intel i9 Extreme processor, 128 GB RAM and 2 RTX2080Ti / Titan RTX GPUs. We
44
implemented the KM-DBA, KM-GAK and kShape using tslearn
1
. DBSCAN clustering uses Scik-
itLearn
2
with LCSS distance
3
. We set the common sequence threshold as 0.15 for LCSS, and
e = 0:03 and minPts = 18 as the neighborhood thresholds in DBSCAN. The proposed model
V AMBC was built using Keras [76] with Tensorflow [77]. The discrete sequence clustering ap-
proach Mixed-HMM was implemented using the R package seqHMM
4
. The adapted baselines
were revised for context sequences based on their public code on Github
5 6 7
. Both V AMBC and
adapted baselines are using LSTM layers with 128 units in the encoder and decoder. The dimen-
sion of hidden variable z was set to 64.
4.3.2 Quantitative Analysis
Evaluation metrics. We use three clustering metrics that are widely used in the clustering commu-
nity [11–13, 30]: Normalized Mutual Information (NMI) [91], Adjusted Rand Index (ARI) [92],
and Clustering Accuracy (ACC) [91]. These metrics have different emphasis on evaluating the
clustering quality; therefore, we believe a side-by-side comparison could indicate the overall clus-
tering performances of the models. All of the three metrics reach 1 if the clustering result is fully
consistent with ground truth. NMI and ACC have minimums of 0, and ARI has -1 as the minimum
for the worst clustering result.
Evaluation results. We conducted the experiments ten times following the practice in [13, 30] and
reported the clustering performance of the best/worst run and the average metrics with standard
deviations (the numbers after) (Table 4.1). DB-LCSS always produces the same results, so we
did not include its standard deviation in the table. We believe the average performance is important
because in real-world use cases one would not be able to tell which run is better without knowing
the ground truth. We compare the performance of V AMBC with the baselines in Table 4.1. We
1
https://github.com/rtavenar/tslearn
2
https://scikit-learn.org/stable
3
https://github.com/maikol-solis/trajectory_distance
4
https://cran.r-project.org/web/packages/seqHMM/index.html
5
https://github.com/XifengGuo/IDEC
6
https://github.com/sarsbug/DCN_keras
7
https://github.com/slim1017/VaDE
45
observe that V AMBC outperforms all the baselines on both the worst run and the average metrics in
both datasets. In addition, the standard deviation of the metrics produced by V AMBC is extremely
low, i.e., 1/3-1/8, as compared with the baselines for GeoLife. This result indicates V AMBC is
robust and can produce accurate results regardless of repeating training. It’s interesting to see the
adapted DCN got the highest best NMI in GeoLife. But its variance across different experiments
is high, and the average NMI is not as good which means it does not guarantee it produces a good
result every time. This is because its first-phase training varies from one time to another and does
not always produce a good initial representation for clustering.
4.3.3 Ablation Study
In this section, we demonstrate how different components of V AMBC work under the hood
through ablation experiments. Specifically, we create three ablated variants of V AMBC by re-
moving one component. We compare these variants by looking at different measurements during
their training processes. In Figure 4.3, we plot the curves of these measurements versus the train-
ing epoch. Figure 4.3a shows the curves of accuracy. Figure 4.3b shows the curves of negative
entropy and Figure 4.3c shows the curves of reconstruction errors. We plot the first 300 epochs
and discard the rest of the curves, which already converge.
Removal of negative entropy. The negative entropy term in the loss penalizes the model if the
clustering prediction is overly confident. In Figure 4.3a, we can observe that the model without
using negative entropy (green line) stays at a low accuracy after some fluctuations and climbs up
but fails to converge to high accuracy. The low accuracy period is because this variant model ag-
gressively assigns all data to two or three clusters and does not split more clusters. This can also be
observed in Figure 4.3b that during the same period (from epoch 100 to epoch 150), the negative
entropy increases sharply. Although the accuracy increases again after this period (possibly start
to split some clusters due to the randomness in the Gumbel-softmax component), it cannot fully
escape from the over-confidence problem. In contrast, in Figure 4.3b, the curve of V AMBC also
increases but in a relatively restrained pace. This means the V AMBC becomes confident gradually
46
Table 4.1: Clustering performance comparison
Dataset Method NMI (aver) NMI
(best)
NMI
(worst)
ARI (aver) ARI
(best)
ARI
(worst)
ACC (aver) ACC
(best)
ACC
(worst)
GeoLife
KM-DTW 0.6100.021 0.645 0.579 0.6350.019 0.656 0.617 0.7420.031 0.763 0.655
KM-GAK 0.5910.057 0.657 0.507 0.5050.076 0.573 0.392 0.7370.033 0.770 0.688
K-Shape 0.2290.033 0.272 0.174 0.2200.046 0.271 0.102 0.5220.015 0.551 0.495
DB-LCSS 0.547 0.547 0.547 0.412 0.412 0.412 0.697 0.697 0.697
SGT 0.4190.024 0.454 0.371 0.2160.036 0.277 0.149 0.6280.029 0.694 0.579
MHMM 0.5300.047 0.611 0.486 0.4030.057 0.495 0.344 0.6270.017 0.649 0.607
IDEC* 0.6050.035 0.673 0.572 0.4650.097 0.664 0.404 0.670.08 0.819 0.596
DCN* 0.6460.051 0.725 0.594 0.6350.065 0.693 0.503 0.7820.061 0.840 0.624
DTC 0.5000.027 0.550 0.474 0.4830.028 0.512 0.451 0.6820.032 0.737 0.655
DETECT 0.6440.037 0.691 0.589 0.6460.044 0.688 0.582 0.80.013 0.822 0.780
GMV AE* 0.4470.083 0.598 0.364 0.3530.074 0.480 0.274 0.5300.052 0.617 0.479
VaDE* 0.6310.053 0.669 0.502 0.6030.078 0.658 0.440 0.7830.037 0.822 0.720
JointV AE* 0.4590.056 0.556 0.408 0.2270.123 0.442 0.161 0.5190.062 0.597 0.473
V AMBC 0.6970.015 0.699 0.692 0.70.019 0.719 0.682 0.8250.01 0.842 0.810
DMCL
KM-DTW 0.366±0.023 0.415 0.355 0.211±0.008 0.229 0.208 0.582±0.009 0.600 0.578
KM-GAK 0.323±0.019 0.345 0.277 0.161±0.04 0.270 0.120 0.579±0.056 0.733 0.556
K-Shape 0.409±0.055 0.531 0.344 0.241±0.06 0.396 0.183 0.616±0.08 0.811 0.522
DB-LCSS 0.365 0.365 0.365 0.158 0.158 0.158 0.511 0.511 0.511
SGT 0.458±0.012 0.466 0.440 0.256±0.009 0.262 0.242 0.763±0.005 0.766 0.755
HMM 0.326±0.055 0.392 0.208 0.126±0.096 0.339 0.011 0.648±0.064 0.756 0.567
IDEC* 0.442±0.012 0.448 0.409 0.333±0.006 0.338 0.318 0.776±0.005 0.778 0.767
DCN* 0.447±0.02 0.479 0.413 0.343±0.014 0.375 0.328 0.781±0.011 0.800 0.767
DTC 0.427±0.081 0.487 0.222 0.304±0.101 0.368 0.083 0.733±0.089 0.800 0.522
DETECT 0.486±0.022 0.527 0.448 0.378±0.047 0.398 0.247 0.779±0.063 0.800 0.600
GMV AE* 0.319±0.063 0.476 0.251 0.127±0.056 0.256 0.082 0.566±0.026 0.622 0.544
VaDE* 0.456±0.02 0.493 0.446 0.341±0.007 0.355 0.338 0.778±0 0.778 0.778
JointV AE* 0.12±0.123 0.263 0.000 0.044±0.048 0.104 0.000 0.524±0.016 0.544 0.511
V AMBC 0.512±0.02 0.527 0.475 0.384±0.013 0.398 0.351 0.799±0.004 0.800 0.789 47
(a) Accuracy v.s. training epochs (b) Negative Entropy v.s. training
epochs
(c) Recon. Loss v.s. training
epochs
Figure 4.3: Changes of metrics by variants over their training epochs
about the cluster assignment because of the restrain from negative entropy. This way, using nega-
tive entropy in the model prevents the model from being dominated by one or a few clusters. We
can also observe this phenomenon in Figure 4.3c. The reconstruction loss of V AMBC drops rel-
atively slowly during the first 100 epochs to let the model working on the cluster embedding and
assignments. Therefore, we observe the V AMBC would eventually converge to a much higher
accuracy than the ablated variants because V AMBC can escape from the sub-optimums.
Removal of Gumbel-softmax. The Gumbel-softmax layer enables some randomness in the dis-
crete variable. Such randomness enables an input to “jump” to a similar cluster if the model is not
confident enough about their assignments. Without the Gumbel-softmax layer, the model would
stay in a sub-optimal assignment and prevent other losses from directing the model to learn a better
representation. As we can see in Figure 4.3a, the curve (red line) without Gumbel-Softmax quickly
goes up and stay at a certain accuracy until convergence.
Removal of center loss. The center loss regularizer is very important in preventing the model from
ignoring the discrete variable and cluster latent embedding. As we can see in Figure 4.3a, the curve
(orange line) without the center loss quickly drops to a low accuracy after a spike. This indicates
that the model will soon rely less on the cluster embedding to minimize the reconstruction, which
is not ideal. In Figure 4.3b, its negative entropy stays low, which also indicates that the model is
reluctant to differentiate clusters and chooses to rely on the individualized embedding only.
48
4.3.4 The Training Progress of V AMBC
To understand the change of the latent embedding and the cluster embedding in V AMBC, we vi-
sualize the learned embeddings at different epochs in Figure 4.4 using t-SNE [64]. The red points
denote the latent embeddings of the context sequences, and the black points represent the cluster
embeddings z
c
for each cluster. At the initial stages (epoch =1 and 60), the model learns giant clus-
ters (where many nearby black points locate) that can roughly reconstruct the data. Subsequently,
the negative entropy and the reparameterization by Gumbel-Softmax encourage the model to split
more clusters and finally (epoch = 400) the clusters are well separated and the cluster embeddings
are well-distributed at the centers of each cluster.
(a) epoch = 1 (b) epoch = 60 (c) epoch = 200 (d) epoch = 400
Figure 4.4: Visualization of the training progress
4.3.5 Qualitative Analysis
One advantage of V AMBC is that it could preserve a high-quality latent representation in addi-
tion to its accurate clustering performance. In this section, we present the quality of the latent
representation produced by different approaches. The visualization of the embedding could reveal
how discriminative the hidden space is and the interpretability of the clustering structure. The
reconstruction performance could expose the information preserved by the latent representation.
Visualization of embedding. Figure 4.5 shows the visualization of the latent embeddings by dif-
ferent approaches on a two-dimensional space by t-SNE [64]. For VaDE, GMV AE, and V AMBC,
we also plot the embeddings of the cluster centers in black dots. From the figure, it is clear that
49
(a) DCN (b) IDEC (c) DETECT
(d) VaDE (e) GMV AE (f) V AMBC
Figure 4.5: Latent embedding by different approaches
V AMBC has a well-discriminated hidden embedding. Such discrimination ensures a good cluster-
ing with respect to both cleanness and compactness. DETECT, though also produces good discrim-
ination of four out of the six groups, it fails to separate the two clusters in the left bottom corner.
Other autoencoder based approaches like DCN and IDEC also have the same problem because
the centers are largely dependent on the representation merely learned from the reconstruction
in their pre-training phase. Meanwhile, variational approaches like GMV AE only produce three
groups in the embedding, which demonstrates the empty cluster problem in these approaches. On
the other hand, V AMBC produces all six well-positioned center embedding than other baselines.
The reason is that V AMBC explicitly models the embedding center using an embedding layer
directly produced by the discrete hidden variable (cluster assignment). The center loss also forces
the center embedding to produce meaningful decoded sequences.
Reconstruction of the context sequences. Figure 4.6 depicts the reconstruction of a random
subset of the data by each of the deep learning based approaches. Figure 4.6a shows the original
context sequences, and Figures 4.6b to 4.6g show the reconstructed ones by the models. In each
figure, the horizontal axis denotes different POI categories and the vertical axis denotes the order
in the sequences. The color indicates the normalized values of the POI feature. As depicted in
the figures, V AMBC always produces good-quality reconstructions compared with other baseline
50
(a) Original Inputs
(b) DETECT
(c) DCN
(d) VaDE
(e) GMV AE
(f) JointV AE
(g) V AMBC
Figure 4.6: Reconstruction of a random sample of inputs by different approaches
approaches. VaDE also produces a plausible reconstruction, but it does not have a good clustering
performance. V AMBC could preserve both the quality of embedding and the performance of
clustering.
Center construction. Figure 4.7 shows the decoded sequences of the cluster latent embeddings z
c
of each cluster by V AMBC. We can observe significant differences between these clusters, which
indicates the model is not trapped in the trivial solution that produces the same clusters and ignores
y.
51
Figure 4.7: Decoded context sequences from cluster embeddings
Figure 4.8: Interpolate between two context sequences from the same cluster
Sample interpolation. We evaluate the interpolation task, which is a common task in generative
models to show the smoothness of the latent space by V AMBC. In Figure 4.8, we first produce
four interpolated vectors between the embeddings of the two context sequences (the left-most one
and the right-most one). Subsequently, these interpolated vectors are decoded into sequences (the
four images in the middle). The visualization shows a smooth transition from the left to the right,
which indicates a good smoothness in the latent space.
52
Chapter 5
Region Representation Learning for Mobility Behavior Analysis
53
5.1 Learning a Contextual and Topological Representation of
Areas-of-Interest for On-Demand Delivery Application
A good representation of spatial units is of vital importance to all delivery-related services [93].
Various companies like Uber and DiDi utilize different spatial extents such as grids, hexagons, or
polygons to partition the space into spatial units [94]. These spatial units, represented by their co-
ordinates and other geometric features, are then used as sources and targets for delivery services.
Such spatial units fully cover an entire space (e.g., a city) and have nice topological properties.
Thus, the algorithms based on these units can accommodate any possible delivery request. How-
ever, such topological representation can only capture spatial relationships between these units and
ignore human’s intuition and tacit knowledge on how to navigate between these regions. For exam-
ple, when couriers deliver packages on foot and/or by bike, they mainly choose paths according to
their knowledge and experience on real-world road conditions and connections such as shortcuts,
bridges, crowded streets, and crossings with long traffic lights. In such cases, mere topological
representation often fails to capture key information thus may not be sufficient for the real delivery
tasks. Fortunately, human trajectories capture such tacit knowledge and experiences.
Towards this end, recent work [1, 55, 95] strives to add such contextual data to Point Of In-
terest(POI) representation from check-in histories by adopting NLP models like Word2vec [96].
However, these studies mainly focus on recommending POI to users. Hence, the representation of
POIs usually does not cover the entire space, thus they cannot be directly applied to delivery sys-
tems that requires every points in space to be reachable. Besides, the learned representation may
also lose the topological property and conflict with the Tobler’s First Law of Geography [54] which
says ”Everything is related to everything else, but near things are more related than distant things”,
due to the discrete locations of POIs and sampling bias in the collection of check-in histories.
Therefore, the best representation should learn from both topological and contextual data to
take advantage of the best of the two worlds. To achieve this, we propose a novel Deep Multi-view
informAtion-encoding RanKing-based network (DeepMARK) to learn a representation of spatial
54
regions. Rather than regular-shaped regions, we consider the spatial regions to be geographically
partitioned by map segmentation, i.e., the Areas of Interests (AOI) used in this work. AOIs are
non-overlapping irregular polygons that fully partition (and hence cover) the space and each AOI
captures its individual context. For example, while hexagons or grids may split a school into two
units or may have a unit containing multiple land uses, each AOI represents a single context.
To learn both the topological and contextual features of these AOIs, our proposed framework
DeepMARK consists of three components: one to learn the topological representation, the second
one to learn the contextual representation and finally the third to unify the first two components.
Contextual representation component: In the field of NLP, contextual representations are
usually learned based on the distributional hypothesis [97] from real-world language sequences,
i.e., human utterance. Analogous to NLP, for ”spatial” context, we consider location sequences,
i.e., human trajectories, as the data source from which we learn the contextual representation of
AOIs. The trajectory data is selected for its relevance and scalability in learning contextual rep-
resentations for delivery problems: 1) trajectories preserve human’s knowledge and preferences
in traveling between AOIs. 2) with the ubiquity of mobile devices and the prevalence of spatial
crowdsourcing apps, trajectories can be easily collected at scale. To learn contextual representa-
tion from trajectories, we model the spatial distributional hypothesis using Pointwise Mutual In-
formation (PMI) between AOIs calculated from trajectories. Subsequently, we learn a distributed
representation based on the PMI using an autoencoder framework.
Topological representation component: To model topological properties of irregular-shaped
AOIs, we define Euclidean graph and Adjacency graph to capture the spatial relationships of the
AOIs. Lately, to learn representations from graphs, researchers proposed various graph embedding
approaches [36, 38–40, 46]. Popular methods like Deepwalk [36] and Node2vec [38] are based on
random walks and train the network on randomly generated samples. However, in our problem,
such a process cannot be easily trained with trajectories jointly. Therefore, we propose to estimate
the node-wise mutual information in graphs and use the same autoencoder framework as used for
trajectories to align the learning of the two heterogeneous views.
55
Unified representation component: Finally, to combine the two heterogeneous views, pre-
vious studies employ different strategies to model the correlation between views and control the
learning across views [98–100]. However, none of these approaches could be directly applied to
our problem because most of them are designed for text and image data. To the best of our knowl-
edge, we are the first to study the joint learning of AOI representation using both trajectory and
graph data. To join the learning of trajectories and graphs, we propose a novel multi-view autoen-
coder neural network that takes the PMI matrices generated by the previous two components and
utilizes an innovative ranking-weighted loss to dynamically balance the learning between views.
We evaluated our representation with a large real-world package delivery data acquired from
Cainiao Network. Our representation approach is shown to have up to 20% reduction of errors
as compared to the adapted baseline approaches in predicting Estimated Time of Arrival (ETA) of
real-world deliveries.
5.1.1 Preliminaries
In this section, we introduce some important concepts followed by the formal problem definition.
Definition 4 (Area Of Interest (AOI)). An AOI is a minimum geographical unit in the form of a
polygon. The raw AOIs are generated by partitioning a space with fine-grain road networks and
geometric boundaries (e.g., roads, rivers, railways) using map segmentation techniques.
By definition, the boundaries of AOIs, i.e., the irregular polygons can have different sizes
and numbers of edges, which differentiate them from those of the conventional space partitioning
techniques using regular shapes (e.g., hexagons). Moreover, our AOIs still do cover the entire
space and each AOI captures a single context (e.g., a school). Later in Section 5.1.2.2 we show
how we add latent features (learned from topological representations) to each AOI to enforce the
Tobler’s First Law of Geography.
56
Definition 5 (Trajectory). A trajectory s is a sequence of spatio-temporal tuples s=[s(1);s(2);:::;s(k);:::],
where s(k) is represented by a tuple consisting of the AOI v that contains the GPS point and a
timestamp t, i.e., s(k)=(v;t).
We derive the modeling of contextual representation from the analogy in language models.
Most word representation models explicitly or implicitly follow the distributional hypothesis in-
troduced by linguists [97]. The hypothesis is often stated as: words which are similar in meaning
occur in similar contexts. In our problem, as sequences of AOIs (trajectories) are analogous to
sequences of words (sentences), we make the following assumption:
Assumption 1 (Contextual representation of AOIs). A contextual representation of AOIs follows
the spatial distributional hypothesis, that AOIs have similar contextual representations are usually
visited closely and in a trip.
Given the above definitions, we define our problem of learning a contextual and topological
representation of AOIs as below.
Definition 6 (Learning a Contextual and Topological Representation of AOIs (CTRA) Problem).
Given a set of raw AOIs (i.e., without latent features) V , and a set of trajectories S, s.t. 8s2
S;8(v;t)2 s;v2 V , the objective is to learn a mapping V! Z, s.t., it generates a latent representa-
tion z2 Z for each AOI v2 V , that follows the spatial distributional hypothesis, and Tobler’s First
Law of Geography.
5.1.2 Methodology
We propose a Deep Multi-view informAtion-based RanKing network (DeepMARK) to solve the
CTRA problem. DeepMARK consists of three parts: learning contextual representation, learning
topological representation and jointly learning of both representations, which are elaborated in the
following sections.
57
5.1.2.1 Learning contextual representation from trajectories
Modeling spatial distributional hypothesis The learning of contextual representation of AOIs
in trajectories is analogous to the learning of word embeddings from sentences. To model the
distributional hypothesis, word embedding techniques usually describe similarities between words
using their contexts and then map words to hidden embeddings according to such similarities. For
example, the word2vec model [96] maximizes the log probability as in Equation 5.1. The modeling
of the context similarity is implicitly computed by predicting the context words (w
t+ j
) of a target
word (w
t
), which usually requires a sampling-based training process, e.g., negative sampling.
1
T
T
å
t=1
å
c jc; j6=0
log p(w
t+ j
jw
t
) (5.1)
In this work, rather than use the sampling-based training and objective, we propose to use Point-
wise Mutual Information (PMI) to describe the contextual similarity between AOIs and learn the
representation by decomposing the similarities using neural networks. We believe such approach
has better compliance with CTRA problem because of the following reasons.
1. The similarity is symmetric. In word2vec models, people choose a center word and its con-
text word to describe the similarity. In this case, the similarity of ”A to B” might be different
that of ”B to A”, when choosing A or B as the center word. However, in delivery scenarios,
we concern more about whether the 2 places are likely to be visited from each other. So
we expect the similarity to be symmetric, i.e., similarity(A;B)= similarity(B;A), which is
guaranteed in PMI.
2. The decomposition of PMI has comparable performance and is implicitly equivalent to
SGNS. As shown in recent studies, the SGNS model is implicitly factorizing the shifted PMI
matrix [101] and a good decomposition of PMI(PPMI) matrix is comparable with word2vec
models in various tasks. [102]
58
3. The training process is easy for alignment in a multi-view learning framework. Sampling-
based training is hard to be extended to multi-view problems like CTRA. Even applying
iterative training one cannot align different views well to the same training target (a single
AOI) and train them jointly. However, the decomposition of PMI is easy for aligning the
same AOI from different views which allows joint training described in Section 5.1.2.3
Formally, given AOI v
i
and v
j
, we define the contextual similarity from the trajectory data as
follows:
PMI
tra j
(v
i
;v
j
)= log(
p(v
i
;v
j
)
p(v
i
)p(v
j
)
) (5.2)
Here, p(v
i
) and p(v
j
) denote the probability of randomly visiting v
i
and v
j
, and p(v
i
;v
j
) denotes
the probability of visiting v
i
and v
j
together. we can interpret
p(v
i
;v
j
)
p(v
i
)p(v
j
)
as: the ratio of how likely
people visit v
i
and v
j
together in the real world to how likely v
i
and v
j
are visited together at
random. Therefore, a large ratio means the two AOIs v
i
and v
j
are, rather than randomly visited
together, co-visited for some real reason, e.g., they are easily accessible in human knowledge.
Computation of PMI in trajectories Now, to compute the PMI between AOIs, the remaining
task is to define the computation of p(v
i
;v
j
); p(v
i
) and p(v
j
) for AOIs in trajectories. For calcu-
lating p(v
i
;v
j
), it’s important to properly define the co-occurrence of AOI v
i
and AOI v
j
in the
trajectories. Different from the skip-gram model, we define that two AOIs co-occur in a close con-
text if they fall in a fixed-length temporal window in a trajectory. To count such co-occurrences, we
apply a time sliding window to each trajectory: [t;t+D], whereD is the window size. As shown in
Fig 5.1b, each sliding window may contain various numbers of AOIs (window T
1
has 2 AOIs while
T
3
includes 3 AOIs) but the temporal length of each window is the same. In addition, we slide the
windows with an offset ofD=2 to make the best use of the trajectories while avoid generating too
many samples, similar to [22].
We use such temporal windows for defining co-occurrence because of the nature of trajectories.
In detail, as depicted in Figure 5.1a, if we adopt the way skip-gram model building context win-
dows (C
1
to C
4
), for each AOI in the trajectory, we have to extract a fixed number of preceding and
59
succeeding AOIs as its co-occurring neighbors. However, in trajectories, consecutively collected
spatio-temporal points usually have variant time differences, e.g., from 2 min to 20 min, because
of the unstable signals and different mobile application settings. Consequently, if we adopt skip-
gram and consider two consecutive but distant AOIs as a co-occurrence, it will mislead the model
to produce similar embeddings between the two distant AOIs (e.g., in C
3
, two distant nodes are
counted in the same context window), which is not expected.
After the sliding windows are generated, we count any two AOIs in the same window as a
co-occurring pair. The probabilities p(v
i
;v
j
); p(v
i
), p(v
j
), and the PMI matrix between AOIs can
be estimated by counting the co-occurring pairs as below.
PMI
tra j
(v
i
;v
j
)= log(
p(v
i
;v
j
)
p(v
i
)p(v
j
)
)
= log(
#(v
i
;v
j
)=jCj
(#(v
i
)=jCj)(#(v
j
)=jCj)
)
= log(
#(v
i
;v
j
)jCj
#(v
i
) #(v
j
)
)
wherejCj=
å
i
0
å
j
0
#(v
i
0;v
j
0)
In the equation above, #(v
i
;v
j
) denotes the count of co-occurring pairs(v
i
;v
j
) from all windows,
#(v
i
) and #(v
j
) denotes the count of pairs containing v
i
and v
j
respectively. C denotes the set of all
co-occurring pairs andjCj is the number of all pairs.
Learning a distributed representation using autoencoder After the computation of PMI from
trajectories, we propose to use autoencoder to decompose the PMI for a dense and distributed repre-
sentation. Although for each AOI v
i
, we can use its PMI similarity to all AOIs as its representation,
i.e.,[PMI
tra j
(v
i
;v
0
);PMI
tra j
(v
i
;v
1
);:::PMI
tra j
(v
i
;v
n
)], we propose to apply low-rank decomposi-
tion by autoencoder on the sparse PMI matrix. Because a distributed representation [103](i.e., each
element encodes multiple things) is always expressive and allows efficient activation in down-
stream training [88]. In addition, an autoencoder allows non-linear encoding, and thus could
have more accurate reconstruction of the similarities. Specifically, the autoencoder consists of
60
C
1 C
2 time C
3 C
4 (a) Count co-occurrences by skip-gram.
T
1 T
2 time T
3 T
4 T
5 T
6 (b) Count co-occurrences by sliding window.
Figure 5.1: Different ways of counting co-occurring AOIs
an encoder f and a decoder g. The encoder f takes the PMI vector of each AOI and learns a
low-dimensional embedding. Then the decoder g takes the low-dimensional embedding and re-
constructs the PMI vector with minimum error. The objective of the network is minimizing the
reconstruction errorL in Equation 5.3.
L
tra j
=
n
å
i
jjPMI
tra j
(i);g( f(PMI
tra j
(i)))jj
2
(5.3)
In summary, as depicted in Figure 5.2, DeepMARK first slides windows in trajectories, and then
counts the co-occurring pairs in these windows. After that, the PMI matrix is computed based
out of the counts, and is fed to an autoencoder for learning representations. Notice that here we
actually employ the common practice of positive PMI [101, 102, 104] rather than PMI but we use
PMI for simplicity.
5.1.2.2 Learning topological representation from graphs
Given that our initial AOIs already cover the entire space (see Def. 5.1.1 in Section 5.1.1), here
we would like to learn features (latent representation) per AOI to capture the spatial relationships
61
... ... 0.6 1 ... 0.3 1 ... 0.3 ... Trajectories Sliding windows PMI Autoencoder Figure 5.2: Learn contextual representation from trajectories.
among these AOIs that follow Tobler’s first Law of Geography. Therefore, we use two graphs
to capture the spatial relations between AOIs: Euclidean Graph G
euc
and Adjacency Graph G
ad j
.
Intuitively, the former graph captures Euclidean proximity to enforce Tobler’s First Law between
nearby AOIs and the latter uses adjacency relationships to enforce the law for adjacent AOIs.
Euclidean Graph G
euc
We define G
euc
=fV;E
euc
;W
euc
g. V is the set of nodes i.e., AOIs. We
define the weights W =[w
i j
]2R
nn
representing the proximity between v
i
and v
j
as a function of
their Euclidean distance dist(v
i
;v
j
). In particular, we define the proximity function as a thresholded
Gaussian kernel function [105] as in Equation 5.4. Intuitively, the closer nodes, the larger weight
is assigned to the edge between the nodes.
W
i j
=
8
>
>
<
>
>
:
exp(
dist(v
i
;v
j
)
2
s
2
) if dist(v
i
;v
j
)K
0 otherwise
(5.4)
Adjacency Graph G
ad j
We model the adjacency between AOIs as a graph G
ad j
=fV;E
ad j
;W
ad j
g.
V is the set of nodes, i.e., AOIs. The weights W =[w
i j
]2f0;1g represent the adjacency between
AOIs, where w
i j
is defined as below.
W
i j
=
8
>
>
<
>
>
:
1 if v
i
is adjacent to v
j
0 otherwise
(5.5)
62
k = 1 k = 2 Figure 5.3: Calculate proximity to other nodes by different-step random walks
We expect the learning from the two graphs and from trajectories could have homogeneous
processes for a flexible and alignable joint learning of topological and contextual representations.
Therefore, we design PMI matrices for the graphs to be homogeneous with the trajectory view. For
any two nodes v
i
;v
j
in a graph G given its weights W, to prepare the probabilities p(v
i
;v
j
); p(v
i
)
and p(v
j
) in the graph, we define p(v
i
;v
j
) as the proximity from v
i
to v
j
within K-step random
walks. Specifically, we first define a transition matrix M
k
, in which M
k
i; j
presents the probability of
visiting v
j
in a k step random walk from v
i
with restart ratioh according to [106].
M
k
=hI+(1h)M
(k1)
(D
1
W);
where M
0
=I
HereI is the identity matrix. D is a diagonal matrix, s.t., each element in the diagonal is the
summation of the corresponding row in W, i.e., D
ii
=å
j
W
i; j
. And respectively, as depicted in
Figure 5.3, we can compute the proximity matrix P
K
as the sum of random walks within K steps
starting from any node: P
K
=å
K
k=1
M
k
. Then p(v
i
;v
j
) is defined as P
K
i; j
, and accordingly, p(v
i
) is
defined aså
l
P
K
l;i
.
63
Therefore we compute the PMI matrix for a graph G=fV;E;Wg with the maximum walking
step k as below.
PMI
K
graph
(v
i
;v
j
)= log(
P
K
i; j
å
l
P
K
l;i
å
l
P
K
l; j
)
where P
K
=
K
å
k=1
M
k
After computing the PMI matrices for G
euc
and G
ad j
, we can use the same autoencoder frame-
work as for PMI
tra j
to learn the topological representations from the graphs. And the remaining
task is to jointly learn a representation from all autoencoders.
5.1.2.3 Jointly learning one representation by a multi-view ranking autoencoder
After the heterogeneous data are transformed into homogeneous views through different PMI com-
putations, we propose to use a multi-view autoencoder for jointly learning the PMI in both trajec-
tory and graphs as described in previous sections. In detail, all views (the PMI matrices) are fed
into separate encoders and share the same middle layers, which generate embedding for the AOIs.
Then separate decoders take the outputs of the shared layers and reconstruct the views and mini-
mizing all errors. The network structure is depicted in Figure 5.4. For a straightforward multi-view
autoencoder [107], the loss of our network could be written as a summation of all reconstruction
losses:
`=(1ab)L
tra j
+aL
euc
+bL
ad j
(5.6)
Dynamic ranking-weighted loss In the equation 5.6,a;b are the weights of the two topological
views (graphs). Rather than use static weights which require much effort in finding the optimal
values and do not change during the training, we propose to dynamically change the weights ac-
cording to the alignment between different views. Specifically, we expect the weight on the con-
textual view be correlated with the order-sensitive discordance between contextual and topological
views. Below we provide the definition of such discordance:
64
Definition 7 (Order-sensitive Discordance). For a given AOI v
i
, an order-sensitive discordance
between view A and view B happens if, the sorting of other AOIs by their similarities to v
i
in view
A is largely different from that in view B. In other words, AOI v
j
ranks high in v
i
’s similarity sorted
by view A, but ranks low in the sorting by view B.
In this strategy, we introduce the inductive bias from real-world observations and domain
knowledge. In detail, we observe that trajectories have different sampling density at different
AOIs. Some AOIs and their neighbors are frequently visited in the trajectories. These AOIs have
sufficient contextual semantics and are also consistent with the geography law. Then we want to
learn more from (put more weight on) the contextual view. In contrast, some AOIs are rarely or
never visited, and the learning from trajectories cannot learn a meaningful embedding from these
AOIs. From the trajectory view, these AOIs cannot correctly order their relationships to other AOIs
and would conflict the geographical law. In the latter case, we require more effort from the learning
of graphs (higher weights on topological views) to ensure the law of geography.
Therefore, rather than use static values fora andb, we propose to use dynamic weights based
on ListMLE [108], a list-wise ranking loss, computed between the graph PMI vectors and the
trajectory PMI vector. If we denote x
tra j
as the reconstructed vector from trajectory view, y
euc
;y
ad j
as the reconstructed vectors from G
euc
and G
ad j
, h
i
(y
euc
) as the AOI index at the i
th
largest value
of y
euc
, the ListMLE-based weights can be written in Equation 5.7. Intuitively,a andb are large
if y
euc
;y
ad j
have a different ranking of elements from x
tra j
. In other words, the largest element in
y
euc
might be the smallest in x
tra j
. A possible example could be when both v
1
and v
2
are not visited
in trajectories, the ranking of their similarity is based on a default value which could conflict with
the ranking of the topological similarity learned from y
euc
and y
ad j
. In this case, DeepMARK puts
more weights on G
euc
and G
ad j
.
a =
1
2
n
å
j=1
log
e
x
tra j
(h
j
(y
euc
))
å
n
k= j
e
x
tra j
(h
k
(y
euc
))
(5.7)
b =
1
2
n
å
i=1
log
e
x
tra j
(h
j
(y
ad j
))
å
n
k= j
e
x
tra j
(h
k
(y
ad j
))
(5.8)
65
... Stacked Shared
Layers Reconstruction PMI matrices G
euc G
adj Traj
z Contextual &
Topological Data ListMLE Figure 5.4: DeepMARK Network Structure
5.1.3 Experiments
In this section, we evaluate the proposed model with real-world delivery datasets and task. We
also include visualizations of interpretive results to help understand the model and the effect of
different modules.
5.1.3.1 Dataset
We conduct the experiments on the package delivery data collected by Cainiao Network, handling
more than a hundred million packages per day. In the experiment, the trajectory data is from
a dispatching region from July 1, 2019 to Aug 31, 2019. The trajectories are pre-processed by
removal of outliers, proper aggregation and mapped to AOIs. The original form of trajectories are
GPS coordinates and timestamps, and after pre-processing, the input of this work is sequences of
AOIs and timestamps.
5.1.3.2 Experimental Settings
Adapted Baseline Algorithms Since there is no existing work on learning a contextual and
topological AOI representation from trajectories and graphs, we adapted various approaches to our
problem and compare them with DeepMARK. Here we list these adapted baseline approaches:
66
• Topological-only baselines:
– GeoHash [109] is a general encoding of spatial objects. It maps the coordinates to
fixed-length vectors in which common prefix usually infers close locations.
– Deepwalk [36] and Node2vec [38] are state-of-the-art graph representation models
which learn node embedding by skip-gram model from generated random walks.
• Contextual-only baseline:
– Word2vec [96] is a word embedding approach that learns word representation from
sentences. We adapt the model to our problem by treating trajectories as sentences and
AOIs as words.
• Homogeneously integrated baselines:
– We create two straightforward baselines word2vec + deepwalk and word2vec + node2vec,
which are concatenations of word2vec embedding and graph embeddings. Thus these
approaches also have the same input information as PTE (described below) and Deep-
MARK for a fair comparison.
• Heterogeneously integrated baseline:
– PTE [99] is a heterogeneous embedding model that learns word embedding from
both sentences and graphs based on a Heterogeneous Information Network Embed-
ding (HINE) approach. We adapt this model to our problem by treating the trajectories
of AOIs as the sentences of words and replacing their graphs with our graphs.
Parameter Settings For all random walk generations from graphs in Deepwalk, Node2vec and
PTE, the walking length is set to 30, and walks per node is set to 30. Specifically, for Node2vec, p
and q are set to 4 and 1. In DeepMARK, the revisiting ratioh for G
euc
and G
ad j
is set to 0.1. The
sliding window size is set to 20 min and the sliding offset is 10 min.
67
Training, Validation and Testing Following the principle of time-related prediction, we use the
latter data for testing and the earlier for training and validation using 80-20 splits.
5.1.3.3 Evaluation with ETA Prediction
We evaluate our embedding framework in the prediction of the Estimated Time of Arrival(ETA)
in the last-mile package delivery task. Predicting ETA in the last-mile deliveries is challenging
because the couriers usually travel by non-motor vehicles and the environments are very complex.
In this task, we use deepETA [110] as the prediction model and replace the spatial representation
of AOIs (by default Geohash in deepETA) with embeddings from the listed approaches to evaluate
their performances.
Evaluation metrics We utilize Rooted Mean Squared Error (RMSE) and Mean Absolute Er-
ror (MAE) to evaluate the prediction performance on different embeddings. The smaller value
indicates better performance in the prediction of ETA.
Comparison Results In Table 5.1 we compare the errors of ETA prediction using deepETA
with different representations. We can observe that DeepMARK has a significant advance over
all other baselines, inducing up to 20% reduction of errors. Its variant DeepMARK
static
which
uses fixed weights on multiple views is worse than DeepMARK but slightly better than others.
We also draw the curves of MAE of validation set versus the training epochs in deepETA using
different representations. We can observe that the embedding by DeepMARK enables the model
converge to the lowest validation error. And we can observe that PTE has a similar performance
with word2vec+node2vec. Both have little control of coordinating different views, thus induce
larger errors than DeepMARK.
5.1.3.4 Model Interpretation
Visualization of the effect of joint learning We utilize t-SNE [64] to visualize the embeddings
of AOIs by Word2vec on trajectories, Node2vec on G
euc
and DeepMARK on both views in Fig-
ure 5.6. The colors of the points are based on Geohash values. That means points in similar colors
68
Figure 5.5: Learning curves of ETA prediction by different representations
Table 5.1: ETA prediction performance
Model RMSE (min) MAE (min)
Geohash 55.22 40.98
word2vec 53.43 36.91
Deepwalk
euc
56.52 37.35
Deepwalk
ad j
55.36 37.44
Node2vec
euc
54.76 37.12
Node2vec
ad j
55.23 38.56
word2vec + deepwalk 54.15 37.41
word2vec + node2vec 53.86 36.53
PTE 53.17 36.52
DeepMARK
static
51.78 34.87
DeepMARK 48.68 32.61
69
(a) Word2vec (b) Node2vec (c) DeepMARK
Figure 5.6: Embedding visualization by t-SNE
are close in the real world. We can observe that in the Word2vec result, many distant AOIs are em-
bedded closely (light yellow points and dark blue points). The Node2vec result has a smooth color
transition from yellow to blue which indicates a nicely consistency with the law of geography, but
the colors are almost evenly distributed which means it does not reveal any human knowledge on
the AOIs. On the contrary, in DeepMARK result, the points have some variances in colors and
shapes (holes and clusters in the figure) while overall the color transition is also smooth. This re-
flects that DeepMARK can learn human knowledge and meanwhile maintain the law of geography.
Visualization of the changes of ranking-based weights To understand how the listMLE losses
direct the training process of DeepMARK, we visualize the change ofb (the ranking loss between
the trajectory view and G
ad j
view) in Figure 5.7. In each plot, the x-axis and y-axis are the ge-
ometric coordinates, i.e., longitude and latitude. Each dot in the figures representing an AOI and
its color denotes the value of b calculated for this AOI. dark blue indicates large b and shallow
green indicates small b. In Figure 5.7 we show the calculated b of all AOIs at different training
stages, i.e., epoch = 1, 30, 100. We can observe that: (a) Theb for all AOIs have little difference
at epoch 1 because of the randomness caused by initial parameters of the neural network; (b) The
b of some AOIs get larger and others get smaller as the training proceeds to epoch 30; (c) A few
AOIs have relatively large enough beta while the majority of AOIs gain low beta when the net-
work is well-trained at epoch 100. Such change from epoch 0 to epoch 100 indicates the intuition
behind our ranking-based weight strategy. Specifically, the ranking losses don’t affect much in the
reconstruction of all views at the early stages. Therefore it allows the model to have a warm start
70
(a) Epoch 1 (b) Epoch 30 (c) Epoch 100
Figure 5.7: Visualize the change of ranking loss between the topological view and the contextual
view
on roughly learning the representation of all views. However, when each view gets well trained
and the topological views and contextual view become inconsistent in the ordering perspective, the
ranking losses (a andb) start to regularize these disordering according to our inductive bias until
the reconstructions and the ranking losses across different views are balanced.
71
5.2 Learning Region Embeddings from Trajectories by Capturing
Their Mobility Contexts
The mobility behavior analysis and other mobility related tasks like predicting next location usually
rely an effective representation of regions that incorporate information of mobility characteristics
instead of merely the location measurements. To effectively represent a region, besides extract
handcrafted features from raw trajectories, recent studies typically rely on information from some
auxiliary data such as Point of Interests (POIs) or social media data [18, 19]. For example, Zhang
et al. defined proximity between regions using taxi trips, POI data, and social check-in data. [33].
Zhang et al. utilized geo-tags of social media tweets to discover geographical topics and hotspots
of regions [34].
However, the aforementioned auxiliary information is not always available in many cities or
countries. The distribution and completeness of these data also vary largely between places [15].
Besides, the quality of the auxiliary data varies from one source to another and cannot always be
preserved over time [16], and may be limited by commercial or privacy concerns [17]. Thus, it
is not always applicable to employ the prior approaches without the presence of well-populated
and high-quality auxiliary information. On the other hand, the trajectory data actually carry much
richer information than simply a sequence of visits. Such information includes preference of users
who generate the trajectories, the order and duration of regions being visited, the temporal seman-
tics (such as day or night) of each visit, and the commonality of the users, times and regions across
trajectories. We term such information as the “mobility context”, which captures the commonality
and associations among regions, users, time and their attributes in trajectories. Yet, these mobility
contexts of trajectories were not fully exploited by the existing research on region representation.
For example, questions like “which regions are usually visited by the same group of people?” and
“which regions are typically visited for longer duration than others?” were rarely asked when
modeling the trajectories. Our hypothesis is that if we can more effectively capture these mobility
contextual features, we can use them to learn more about the geographical regions these trajectories
72
span over. In cases where we have no auxiliary information about the regions (e.g., no POI, no so-
cial media) just by analyzing the trajectories, we can characterize the commonality or proximity of
mobility behaviors occurring in the underlying regions, which in turn can facilitate the downstream
utilization of trajectories (e.g., clustering the mobility behavior they represent, the next locations
they visit, or how the underlying regions are similar to each other).
To this end, we propose to learn region embeddings (low-dimensional vectors mapped from
regions via a function) that capture mobility contexts only using trajectory data, which aim at prop-
erly modeling associations among users, regions and times and their attributes. Note that existing
approaches that only use the trajectory data fall short of incorporating such composite mobility
contexts. For instance, Zhu et al. modeled trajectories like sentences in NLP, therefore learning lo-
cation embeddings in the same way as a skip-gram word embedding and limit the characterization
based on the co-occurrence of regions [111]. Yue et al. consider both the co-occurrence of regions
and the geometrical proximity but have overlooked temporal relations and user connections [35].
To collectively capture the diverse mobility contexts within the trajectories, we are inspired
by the recent advances in heterogeneous graph embedding approaches [47] given their successes
in other domains like Web mining [45], social media analysis [51]. In this line of research, Dong
et al. proposed to learn embeddings of nodes based on metapaths [112] in heterogeneous networks,
and the obtained embeddings are applied to classify authors and venues on academic references.
Ying et al. designed a graph neural network to learn embeddings on the bipartite graphs of pins
and boards in a media sharing network to facilitate media recommendation. Inspired by these
discoveries, we seek to investigate an approach by heterogeneous graph embedding for capturing
the diverse mobility context in trajectory data.
Nevertheless, it is a nontrivial task to adapt the principle of heterogeneous graph embeddings
to model regions, e.g., due to the ordinal properties of trajectories and times, as well as extracting
intra-trajectory and inter-trajectory properties for characterizing users and regions. In order to
properly model the mobility context with the graph and learn effective representations, in this work,
we define a customized heterogeneous graph. On this graph, we characterize each type of node
73
with appropriate attributes and define multiplex edges between the nodes. We use a heterogeneous
graph neural network to encode neighbor attributes to node embeddings, and design a meta-path
constrained random-walk (or metawalk) objective to measure the proximity of mobility context. In
addition, we develop an adaptive learning process that changes the partitioning of regions to cope
with the sparsity of co-occurrences.
We evaluated our model on three tasks with real-world trajectory data and showed significant
advantages over other state-of-the-art approaches without using any auxiliary data.
Our contributions are summarized as follows.
1. We propose a framework that learns region embeddings only from trajectory data (without
other auxiliary information like POI) using a heterogeneous graph neural network.
2. We model various types of contextual associations and attributes from the trajectories and
build a heterogeneous graph to exhaustively incorporate the mobility context of regions. This
is coupled with a metawalk-based objective to model the proximity of mobility context.
3. We design an adaptive region partitioning method that changes the partition of regions along
with the training of region embedding.
4. Experiments on real-world data show that our method consistently outperforms various
strong baseline approaches on clustering mobility behaviors, next location prediction, and
POI correlation analysis.
The remainder of this section is organized as follows. After introducing preliminaries in Sec-
tion 5.2.1, we propose our method in Section 5.2.2. Section 5.2.3 describes our experiments on
real-world data and discussion of the ablation analysis.
74
5.2.1 Preliminaries
5.2.1.1 Notations
Definition 8 (Trajectory). A trajectory s =fs
(1)
;s
(2)
; :::;s
(T)
g is a time-ordered sequence of
spatio-temporal points. Each point s
(t)
consists of a pair of spatial coordinates (i.e. latitude,
longitude) and its timestamp. Each trajectory s is generated by an anonymous user u in a specific
time granule (e.g., a day).
Since most spatio-temporal points in the trajectory are collected during travelling between two
locations, these points have much less relevance to users’ activities than the points where the users
stay. Therefore, we need to preprocess the trajectories and extract the points where the users stay,
i.e., stay points.
Definition 9 (Stay Point). A stay point ˙ s
(t)
extracted from a trajectory s is a spatio-temporal point,
which is the geometric center of a longest sub-trajectory s
(i! j)
s;s
(i! j)
=fs
(i)
;s
(i+1)
;:::;s
( j)
g,
such that s
(i! j)
is a staying subtrajectory, and neither s
(i1! j)
nor s
(i! j+1)
is a staying subtrajec-
tory.
Definition 10 (Staying Subtrajectory). A staying subtrajectory s
(i! j)
s;s
(i! j)
=fs
(i)
;s
(i+1)
;:::;s
( j)
g
of trajectory s is a contiguous sub-sequence of s, such that within s
(i! j)
, the trajectory is limited
to a ranger
s
in space, and its duration s
( j)
:time s
(i)
:time is longer than a specific thresholdr
t
.
Intuitively, a stay point is the spatio-temporal point where the user stay long enough within a
small region. We utilize the algorithm proposed in [27] to extract the stay points from trajectories.
The trajectories are then transformed into sequences of stay points (named stay point trajectories).
The locations of the stay points are continuous coordinates which should be discretized into small
regions for learning distributional representation. Following the common practice, we employ
grid-based partitioning to split the area into fine-grained regions.
Definition 11 (Grid-based Partitioning). We partition an area (like a city) with a fixed-size grid
into small regions. Each region is a rectangular cell. All the stay points trajectories are then
transformed to sequences of regions where the stay points locate in.
75
We consider the temporal and periodical properties of the timestamps of trajectories as im-
portant characteristics of mobility context. Hence, we discretize the cycle of time in a week into
periods to represent the temporal information.
Definition 12 (Period). A period p denotes a granular period of time in a week, e.g., Monday
8am-10am if we select 2 hours as the granularity [113].
To incorporate different associations between regions, users and periods, we define a heteroge-
neous graph for mobility data as below, which will be elaborated in Section 5.2.2.1.
Definition 13 (Heterogeneous Mobility Graph). A heterogeneous mobility graph constructed from
a set of trajectories S is denoted asG =fV;Eg in whichV =
S
f2fR;U;Pg
V
f
;E =
S
y2fv;o; f;bg
E
y
.
HereV
R
,V
U
,V
P
are typed node sets for regions, users, and periods. E
y
;y2fv;o; f;bg are
heterogeneous edges denoted as in Tab. 5.2.
We also define Mobility Metapath and Mobility Metawalk in the Heterogeneous Mobility
Graph to model the proximity of mobility context which are elaborated in Section 5.2.2.3.
Definition 14 (Mobility Metapath). A mobility metapath is a minimum sequence of types of nodes
that implies the high proximity of mobility between two region nodes. For instance, Region
1
v
!
User
v
! Region
2
is a metapath.
Definition 15 (Mobility Metawalk). A mobility metawalk is a sequence that concatenates a group
of mobility metapaths via a sampling process. For instance, Region
1
v
! User
v
! Region
2
f
!
Region
3
b
! Region
4
is a metawalk.
5.2.1.2 Problem Definition
Formally, we define the problem we study as below.
Definition 16 (Region Embedding from Trajectory Data). Given a set of stay point trajectories
˙
S
in a specific area, the goal is to partition the area into a set of regions R and learn embeddings of
regions that capture the mobility contextual proximity of them in the continuous low-dimensional
space.
76
Table 5.2: Notations
s a raw trajectory
˙ s a stay point trajectory
v
R
i
region node i
u a user
p a temporal period
G heterogeneous mobility graph
E
v
edges between user nodes and region nodes
E
o
edges between period nodes and region nodes
E
f
edges from region nodes preceding to
other region nodes in trajectories
E
b
edges from region nodes succeeding to
other region nodes in trajectories
N
y
v
set of neighboring nodes of v via edges of typey
m a mobility metapath
5.2.2 Method
To capture the heterogeneous associations between regions, users, periods and their specific at-
tributes in the representation, we propose to build a heterogeneous graph and learn the region
embeddings via a graph neural network. Our framework can be summarized in Fig. 5.8. In this
section, we first introduce the construction of the heterogeneous graph (Section 5.2.2.1), followed
by the encoder techniques to encode the graph elements (Section 5.2.2.2) and the learning objec-
tive that characterizes the mobility contextual proximity (Section 5.2.2.3). Then we introduce a
dynamic grid partitioning mechanism (Section 5.2.2.4) that dynamically merges physically adja-
cent regions for better embedding learning. Finally we describe the inference process using the
learned embeddings (Section 5.2.2.5).
5.2.2.1 Heterogeneous Graph Construction
We construct an attributed heterogeneous graph from the stay point trajectories which is visualized
in Fig. 5.9. Both the edges and nodes are of varied types, and especially the edges between region
nodes are multiplex. We describe in the following the technical details of how the proposed graph
captures heterogeneous associations and attributes of different types of mobility contexts.
77
V
U V
R V
P Attributes (1, 0, 1, …... ) (0, 1, 0, …... ) (0, 0, 0, …... ) metapaths A metawalk Figure 5.8: Framework
Figure 5.9: Mobility Heterogeneous Graph
Edges
We define several types of edges for different associations of the entities accommodated in
trajectories.
Region-Region Associations We define two sets of directional edges between region nodes, i.e.
the forward edgesE
f
and the backward edgesE
b
. A forward edge thereof denotes a co-occurrence
of regions where the target node (region) succeeds the source node (region) in the trajectory,
78
whereas this direction is reverted for a backward edge. Specifically, given a skip-gram size k, a for-
ward edge e
f
(r
i
;r
j
)2E
f
is created if r
i
appeared at least once preceding r
j
within k grams in a stay
point trajectory, i.e.,9 ˙ s2
˙
S;such that ˙ s
t
1
:pos= r
i
; ˙ s
t
2
:pos= r
j
;t
1
< t
2
, and t
2
t
1
k. The weight
w
f
(r
i
;r
j
) of edge e
f
(r
i
;r
j
) corresponds to the number of such forward co-occurrences in
˙
S. Simi-
larly, a backward edge e
b
(r
i
;r
j
)2E
b
is created if r
i
appeared at least once succeeding r
j
within k
grams in a stay point trajectory, i.e.,9 ˙ s2
˙
S;such that ˙ s
t
1
:pos= r
j
; ˙ s
t
2
:pos= r
i
;t
1
< t
2
;t
2
t
1
k.
The weight w
b
(r
i
;r
j
) also corresponds to the number of such backward co-occurrences. To be
noticed that, the existence of e
f
(r
i
;r
j
) always implies the existence of e
b
(r
j
;r
i
), but not neces-
sarily the existence of e
f
(r
j
;r
i
). Thus, it is always true that w
f
(r
i
;r
j
)= w
b
(r
j
;r
i
), but in most
cases w
f
(r
i
;r
j
)6= w
f
(r
j
;r
i
) in a trajectory dataset. For illustration, a relevant example is that a
breakfast restaurant is commonly visited before working but rarely visited after working. In this
case, differentiating between directions and define two types of directional edges for region-region
associations are necessary.
Region-User AssociationsE
v
is the set of edges between region nodes and user nodes. An edge
e
v
(r;u)2E
v
exists if the user u has visited region r at least once. Its weight w
v
0
(r;u) denotes
the frequency for user u to visit the region r. Note that region-user associations are marked as
undirected edges since the association does not mean to be ordered.
Region-Period AssociationsE
o
is the set of edges between region nodes and time period nodes.
An edge e
o
(r; p)2E
o
is created if a visit of region r occurs within period p. The edge weight
w
v
0
(r; p) represents the occurrences of r being visited on period p. Like the edges for region-user
associations, those between regions and periods are also undirected.
Nodes
There are other important properties of all types of nodes that can be extracted from stay point
trajectories and can not be represented as edges. We aim to encode these attributes to the embed-
ding so that the attribute proximity could also be captured in addition to the structural proximity.
The commonality can also be learned through the nodes that have same attributes.
Region Attributes We extract the following attributes for each region node.
79
• Geohash We employ the binary vector of the geohash of the regions. Each vector consist
of six 5-digit binary number and each binary number indicates the relative location in a
geometrical hierarchy, e.g., the first binary indicates a rough area of the city, and the second
binary indicate a smaller area inside the large area.
• Stay time We calculated the histogram of the stay time of visits to the region. Specifically,
for each region r, we group the duration of all stay points locating in r into bins and use the
counts of stay points in all bins as a vector.
User Attributes We employ the following attributes of each user.
• Onehot Identification The user identification can be represented in a onehot vector. The i
th
digit represents the i
th
user.
• Travelling Diameters For each user u, We include the statistics of the magnitude of the
trajectories by user u. Specifically, this feature is a vector consisting of six real values. The
first three values represent the minimum, maximum and average value of the number of stay
points in all trajectories visited by the user. The latter three values represent the minimum,
maximum and average value of the euclidean diameter of all trajectories visited by user u.
Period Attributes We employ the following attributes of each period to describe its temporal
semantics.
• Day of week The day of week of the period. The ordinal value is transformed into one-hot
vector in which each digit represents one day of week.
• Time of day The index of the time bin of the day. The bin size is set to 2 hours, e.g, 8a.m-
10a.m. is a period. The ordinal value is transformed into a one-hot vector as well.
• Weekday/weekend 1 if the period is a weekday and 0 for weekend.
• Day/night 1 if the period is in daytime and 0 for night.
80
5.2.2.2 Heterogeneous Graph Encoding
Since the region nodes, user nodes and period nodes have attributes that are of different dimensions
and meanings, we need to project these features to equal-size latent vectors for further operations
in the neural networks. Therefore, a linear neural layer is added before feeding the attributes to the
graph layers. Specifically, we transform the node attributes via
h
R
= W
R
x
R
h
U
= W
U
x
U
h
P
= W
P
x
P
where h
denotes the latent vector of the corresponding type of node, x
refers to the attributes and
W
is the weight of the layer.
We believe that the embedding of a region should incorporate the attributes of itself and the
neighboring entities. For example, a region mobility characteristics are determined by the specific
groups of users visiting this region at specific times. We will describe how we encode such infor-
mation via a heterogeneous graph encoder. Given the transformed latent vectors of all node types,
we compute the embedding of each region node by aggregating all the latent vectors from all types
of neighbor nodes. Specifically, for each region node v
R
i
, the embedding z
R
i
is defined as
z
R
i
= WCONCAT(h
R
i
;h
0
N(v
R
i
)
))+ b (5.9)
where h
0
N(v
R
i
)
represents the aggregated information from all four types of neighbors defined as
below.
h
0
N(v
R
i
)
=a
1
h
R
N
f
(v
R
i
)
+a
2
h
R
N
b
(v
R
i
)
+a
3
h
U
N
v
(v
U
i
)
+a
4
h
P
N
o
(v
P
i
)
(5.10)
Here h
R
N
y
(v
R
i
)
is the output of a graph aggregation of connected neighboring nodes via edges of
typeE
y
.
81
h
R
N
f
(v
R
i
)
=Agg(f f
f
(h
R
j
);8v
R
j
2N
f
v
R
i
g)
h
R
N
b
(v
R
i
)
=Agg(f f
b
(h
R
j
);8v
R
j
2N
b
v
R
i
g)
h
R
N
v
(v
R
i
)
=Agg(f f
v
(h
U
j
);8v
U
j
2N
v
v
R
i
g)
h
R
N
o
(v
R
i
)
=Agg(f f
o
(h
P
j
);8v
P
j
2N
o
v
R
i
g)
~ a is the self-attention weights learned by an attention mechanism [114]. Agg is an aggregation
or pooling layer such as “mean”, “max”, “lstm”. f
f
; f
b
; f
v
; f
o
are non-linear neural layers.
5.2.2.3 Optimization Objective
The learning objective is to maximize the embedding similarity of regions that have high mobility
proximity. For such a purpose, we design the objective function based on the following practical
heuristics:
1. Two regions that are visited more frequently by users in the same period of time should have
higher mobility proximity.
2. Two regions that are more often visited by the same user should have higher mobility prox-
imity.
3. Two regions co-occur before or after the same region in more trajectories should also have
higher mobility proximity.
The aforementioned heuristics reflect that the proximity of regions can be reflected from hetero-
geneous mobility contexts. Accordingly, we define four metapaths based on the above heuristics.
1. Region
1
f
! Region
2
b
! Region
3
2. Region
1
b
! Region
2
f
! Region
3
82
Algorithm 2: Metawalk
Input: mobility heterogeneous graph G, target region node v, a set of metapaths M,
number of hops K, restart probability q
1
Output: a reachable node v
0
via a metawalk
1 v
0
v
2 for k 1 to K do
3 if(random()< q)&(k> 1) then
4 continue
5 end
6 else
7 randomly draw m from M
/* iterate over each edge type in the metapath m */
8 for j2 m do
/* sample the next node based on the weights on type j edges
*/
9 v
0
sample(N
j
v
0
)
10 end
11 end
12 end
13 return v
0
3. Region
1
v
! User
v
! Region
2
4. Region
1
o
! Period
o
! Region
2
It is noteworthy that Region
1
f
! Region
2
b
! Region
3
is meaningfully different from Region
1
b
!
Region
2
f
! Region
3
. The former indicates that both regions Region
1
and Region
3
are preceding the
same Region
2
, while the latter indicates that both ending regions are succeeding the same Region
2
.
To characterize the region proximity, we generate multiple hops of random walks that are con-
strained by the above metapaths (i.e. metawalks) from the heterogeneous graphs (Section 5.2.2.1).
This would particularly require each metapath to both start and end with region nodes. Specifi-
cally, for each node v, given a set of metapaths M=fm
1
;:::;m
4
g, the number of hops K, and the
restarting probability q (similar to node2vec [38]), we generate a metawalk via the procedure in
Alg. 2.
The learning objective is then defined as the negative log-likelihood loss for pairwise co-
occurrences of region nodes in the metawalk contexts. For each target region node v
R
t
in training
83
Algorithm 3: Batched Training Process
Input: mobility heterogeneous graph G, target region node v, a set of metapaths M,
number of hops K, restart probability q
Output: embedding z
v
;8v2V
R
1 for i 1 to batch size do
2 sample v
t
2V
R
3 v
c
=MetaWalk(v
t
;M;k;q)
4 v
n
= random sample(V
R
)
5 forx2ft;c;ng do
6 fory2f f;b;v;og do
7 h
R
N
y
(v
x
)
=Agg(f f
y
(h
j
);8v
j
2N
y
v
R
i
g)
8 end
9 z
x
= Wfafh
R
N(v
x
)
g;h
R
x
g+ b
10 end
11 L =logs(z
R
t
z
R
c
) logs(z
R
t
z
R
n
)
12 update parametersq based on
¶L
¶q
13 end
data, we sample in its context another region node v
R
c
through the metawalks as to form a positive
co-occurring region pair(v
R
t
;v
R
c
), and compare it to a randomly paired negative example region v
R
n
.
The loss is then term as follows:
L =
å
8(v
R
t
;v
R
c
;v
R
n
)
logs(z
R
t
z
R
c
) logs(z
R
t
z
R
n
): (5.11)
Based on the graph encoding and metawalks, the training process of our model is described in
Alg. 3. Through this process, the model is trained to learn a mapping from attributes of heteroge-
neous neighbors to node embeddings on a static heterogeneous graph. However, as the regions are
partitioned arbitrarily by the grid to record trajectories, which may easily lead to training on sparse
occurrences of regions in mobility contexts. To address this issue, we aim to improve the training
process dynamically adjust the nodes in the graph so as to build more connections between regions
in the next subsection.
84
5.2.2.4 Dynamic Grid Partitioning
The arbitrary partitioning of regions may split some large functional regions into separate ones
since the grid partitioning generates equal-size regions. Such splitting of the same functional
region would disconnect the regions in metawalks. For example, two other regions connected to
the split regions individually would build new a connection if the split regions are merged. Merging
such split regions would cause more co-occurrences of related regions to be captured in metawalk
contexts, which could effectively cope with sparsity in training data.
Therefore, we propose to dynamically change the region partitioning in the model’s training
process. Specifically, we start with fine-grained grid partitioned regions. Then after sufficient train-
ing of the embedding model, we gradually merge regions if two regions meet both the following
two conditions:
1. They appear with a high proximity in the embedding space.
2. They are adjacent in the physical space.
The training process will have two termination conditions, that either it has reached the limit of
training epochs or no more regions can be merged according to the above conditions.
Specifically, each time we merge a set of adjacent nodes (a connected component in the phys-
ical graph) and create a new node to replace them, the new node inherits the averaged attributes
of the replaced nodes. The new node also inherits all connections to other nodes sourcing and
targeting on the replacing nodes. The procedure of the merging operation is presented in Alg. 4.
5.2.2.5 Inference
After the model is trained, we can infer the embedding of each v2V
R
by the computation in Alg. 3
Line 5-8. Then the downstream tasks would include the learned embeddings in their feature space
to incorporate contextual information of regions.
Specifically, the learned embeddings could be integrated via different ways depending on the
tasks. For example, the mobility behavior clustering task fully relies on the contextual similarity
85
Algorithm 4: MergeNodes
Input: mobility heterogeneous graph G, region node embeddings Z, physical adjacency
matrix A, similarity comparative thresholdd
Output: new mobility heterogeneous graph G
/* filter nodes to merge */
1 D=PairWiseDis(Z)
2 d=percentile(D;d)
3 pairs=[]
4 for node v
i
;v
j
2V
R
do
5 if Dis(z
i
;z
j
) ¡ d and A[i; j]== 1 then
6 add(v
i
;v
j
) to pairs
7 end
8 end
/* Convert pairs of nodes to a list of connected components */
9 S=Pairs2Sets(pairs)
10 for s2 S do
/* Create a new node */
11 create node ¯ v and add it to G
12 ¯ v:attribute= Average(fv:attribute8v2 sg)
/* Update edges for the new node */
13 fory2f f;b;v;og do
14 for v2 s do
15 Add edges between ¯ v andN
y
v
16 end
17 end
18 Update the weights on the edges
19 Remove s from S
20 end
of underlying regions to identify behaviors. So the trajectories will be transformed into sequences
of embeddings to be aggregated (e.g., via an RNN autoencoder) into fixed-size vectors for clus-
tering. The next location prediction task requires to infer the continuation of the trajectory based
on regions in the past context. In this case, the embeddings of regions in the seen part of the tra-
jectory serve as semantic features which can be concatenated with the locations of each region as
inputs to the downstream predictive model. For other region-level inference or analysis, such as
POI correlation analysis, the embedding will just serve as the feature vector for each region.
86
5.2.3 Experiment
In this section, we compare the proposed region embeddings against other approaches on three
tasks: mobility behavior clustering (Section 5.2.3.3), next location prediction (Section 5.2.3.4) and
POI correlation (Section 5.2.3.5). Ablation and parameter studies are presented as well for a better
understanding of our model (sections 5.2.3.6 and 5.2.3.7). We also qualitatively visualize the result
of the dynamic partitioning component to understand its behavior (Section 5.2.3.8).
5.2.3.1 Experimental Setup
Dataset We utilized a real-world trajectory dataset GeoLife [65] for the evaluation of our pro-
posed approach. The GeoLife dataset consists of 17,621 trajectories generated by 182 users dated
between April 2007 and August 2012. The trajectories cover a total distance of 1,292,951 km and
a total duration of 50,176 hours.
Configuration We implemented and trained all models on a commodity server with a 36 core In-
tel i9 Extreme processor, 128 GB RAM and an RTX2080Ti GPU. We set the embedding dimension
to 32 and grid resolution as 100 100 for all methods. The proposed approach was implemented
using PyTorch [115] and DGL [116].
Metrics
Clustering We use the following three external clustering metrics for the mobility behavior
clustering task.
• Normalized Mutual Information (NMI) measures the normalization of the Mutual Informa-
tion between the predicted classes and ground-truth classes. The score ranges from 0 for no
matching to 1 for perfect matching.
• Adjusted Rand Index (ARI) calculates the clustering accuracy by comparing inner-group
and inter-group pairs within the predicted classes and ground-truth classes. The score ranges
from -1 for no matching to 1 for perfect matching.
87
• Accuracy is used in [91] to directly measure the overlap ratio between clusters of the pre-
dicted classes and those of ground-truth classes.
Prediction We evaluate the next location prediction accuracy by computing the distance (error)
between the predicted location coordinates and the ground-truth coordinates.
• Rooted Mean Square Error (RMSE) measures the quadratic errors of all predictions.
• Mean Absolute Error (MAE) measures the average absolute distance between predictions
and ground-truths.
Correlation
• Pearson Correlation (Corr.) measures the linear correlation between two variables with 0
implies no correlation and 1= 1 implies positive/negative correlation.
5.2.3.2 Baseline approaches
We compare our proposed model with eight strong baseline methods.
• Node2vec [38] is a classical homogeneous graph embedding approach. We adapt it on the
heterogeneous graph by ignoring the node type, attributes and edge types. We used the code
released by the authors
2
.
• Skip-gram (trajectory) [96] is a neural language model that learns static embeddings of
words from sentences. We employ the model by treating trajectories as sentences and re-
gions as word to generate region embeddings. We used the gensim library to implement this
baseline
3
.
• GraphSAGE [46] is a graph convolutional model that can be used for homogeneous graph
embedding. We adapt on the heterogeneous graph by ignoring the node type, attributes and
edge types. We implemented this method using StellaGraph [117].
2
https://github.com/aditya-grover/node2vec
3
https://radimrehurek.com/gensim/
88
• PTE [99] is a heterogeneous embedding model that learns word embedding from both sen-
tences and graphs based on a Heterogeneous Information Network Embedding (HINE) ap-
proach. We adapt this model to our problem by treating the trajectories as the sentences of
words and simultaneously conduct joint learning on the heterogeneous graph in the same
way as the original setup. We used the official release by the authors
4
.
• Metapath2vec [48] is a heterogeneous model that generates node embeddings by feeding
metapath-constrained node sequences to a skip-gram model. We apply it to our hetero-
geneous graph by ignoring the node attributes that are not handled by this method. We
implemented this method using StellaGraph [117].
• DeepMARK [35] is our previous region embedding method that learns region embeddings
from package delivery trajectories for estimating the arrival time of packages.
• ZE-Mob [18] is the state-of-the-art region embedding method that employs pointwise mu-
tual information matrix decomposition to learn embeddings. We implemented this method
based on the algorithms in the paper.
• MVURE [33] is another state-of-the-art region embedding method that uses multiple homo-
geneous graphs to learn a joint embedding. We adapt the method while excluding the POI
and social media data that are not available in our problem setting. We use the code shared
by the authors
5
.
5.2.3.3 Mobility Behavior Clustering
We first evaluate the embedding in the mobility behavior clustering task proposed in [27]. The goal
of this task is to cluster trajectories into different groups of mobility behaviors such as working
commute and shopping. It usually requires transforming the raw trajectories into sequences of
4
https://github.com/mnqu/PTE
5
https://github.com/mingyangzhang/mv-region-embedding
89
Table 5.3: Performance on Mobility Behavior Clustering
NMI ARI Accuracy
node2vec 0.5818 0.4354 0.6739
Skip-gram 0.6247 0.4326 0.6855
GraphSAGE 0.6450 0.5065 0.7837
PTE 0.6646 0.4837 0.7221
Metapath2vec 0.6347 0.4510 0.7022
DeepMARK 0.6267 0.5254 0.7038
ZeMob 0.6771 0.5023 0.6722
MVURE 0.6086 0.5107 0.6955
Our approach 0.6877 0.6474 0.8303
feature vectors. In DETECT, the feature vectors are extracted from POI data. In our experiment,
we replace the feature vectors with the region embeddings learned by different approaches.
The clustering results are compared with ground-truth classes and we report NMI, ARI, and
Accuracy of the clustering performance in Tab. 5.3. We can observe that our proposed approach
outperforms all the other baselines in all three clustering metrics. The GraphSAGE approach that
treats the heterogeneous graph we built as a no-attribute homogeneous graph is also competent in
terms of accuracy, as it can still leverage the graph structure that we built between regions, users
and periods. In contrast, without leveraging our proposed graph representation and any auxiliary
data, the MVURE approach falls noticeably behind several other baselines.
5.2.3.4 Next Location Prediction
Region embeddings are useful in predicting next location coordinate(s) given the current and past
locations along the trajectory. For this task, we employ a state-of-the-art model based on Multi-
Layer Perceptron (MLP) and Long-Short Term Memory (LSTM) [118]. Specifically, the input is
a sub-trajectory and their corresponding embeddings are concatenated into a sequence of vectors.
The goal is to predict the coordinates of the next visit. We used 50% data for pre-training the
embeddings and the rest for training and testing the prediction model by a 70-30 split.
We show the prediction errors of next locations in Tab. 5.4. To better understand the errors, we
include another baseline (“W/o embedding”) that only uses previous location coordinates without
90
Table 5.4: Performance on Next Location Prediction
RMSE (1e3) MAE (1e2)
W/o embedding 228.90 20.62
Node2vec 5.77 4.06
Skip-gram 9.46 5.94
GraphSAGE 6.90 4.97
PTE 6.86 4.26
Metapath2vec 6.06 3.88
DeepMARK 6.74 4.14
ZeMob 5.57 3.87
MVURE 5.54 4.44
Our approach 5.38 3.85
any embedding information to predict the next location. We can observe that all embedding-based
approaches are significantly better than the “W/o embedding” approach. Hence, the region em-
beddings are very effective for this task. In addition, we also show that our approach achieves the
lowest errors among all the methods. Another region embedding baseline ZeMob is also compe-
tent in this task because its main component (PPMI matrix) is built on co-occurrences of regions
and time that are helpful to this task. But it fails to extract the user and time associations as our
method does to further enrich the context for location inference.
5.2.3.5 POI correlation
We also measure the correlation between the embeddings and the distribution of POIs. Specif-
ically, we count the number of venues within each POI category for each region. And then for
each embedding dimension, we calculate the maximum absolute Pearson Correlation between the
embedding dimension and the counts of every POI category to measure if there exist meaningful
pattern in the embedding. In Tab. 5.5, we report the maximum and average of all the computed
correlation over all embedding dimensions.
The results indicate that our approach has a significant advantage over other approaches in
terms of mean correlation. This confirms the performances in the other two tasks. We attributed
to this observation to that, among all approaches, our approach exploits the most information from
91
Table 5.5: Performance on Correlation with POI data
Mean Corr. Max Corr.
Node2vec 0.1630 0.4007
Skip-gram 0.2728 0.4386
GraphSAGE 0.1272 0.2633
PTE 0.1584 0.4114
Metapath2vec 0.1063 0.1692
DeepMARK 0.1696 0.3221
ZeMob 0.1761 0.4692
MVURE 0.2084 0.5010
Our approach 0.3382 0.5380
the trajectories that relate to human mobilities via the attributes, edges, and the metawalks in our
heterogeneous graph.
5.2.3.6 Ablation Study
To help understanding the contribution of each incorporated model component, we create three
variants of our models by removing a single component in turn, leading to variants marked with
“W/o metawalk”(using metapaths as objective), “W/o attributes”(using random features as at-
tributes), “W/o mergeing”(without the dynamic partitioning component). We report their per-
Figure 5.10: Performance of model variants on clustering mobility behaviors
92
formance on the mobility behavior clustering task in Fig. 5.10. Among these variants, “W/o
attributes” produces the worst result which indicates the importance of incorporating the node
attributes into the region embeddings. With all components incorporated, the complete version of
the model unsurprisingly offers the best performance.
5.2.3.7 Hyperparameter Study
In this section, we discuss the effects of some important hyperparameters. Fig. 5.11 depicts the
effects of grid resolution and the choice of aggregation function on the clustering performance.
We can observe that the performance first improves along with the increment of the grid resolution
(finer-grain) and drops after reaching a specific resolution (100 x 100). While a too coarse reso-
lution generates oversized regions which cannot differentiate among locations inside each region,
a too large resolution, however, leads to too sparse connections between regions effectively ex-
press the co-occurrence information. It is also noteworthy in the figure that the “lstm” aggregation
leads to slightly better performance than other aggregation functions due to its larger expressive
capability.
(a) Effect of grid resolution (b) Effect of aggregation
Figure 5.11: Effect of parameters
93
5.2.3.8 Qualitative Analysis
We perform a visualization to illustrate how regions are merged by the dynamic partitioning com-
ponent in Fig. 5.12. The black squares are the regions partitioned by the original grid and are later
merged together into an irregularly shaped region after training. We observe that there are many
universities in these merged regions, indicating similar mobility characteristics between these re-
gions. By merging those highly relevant regions, our approach helps build more connections to
other regions build through the merged regions.
Figure 5.12: Regions merged by dynamic partitioning.
94
Chapter 6
Conclusions and Future Work
In this thesis, we studied the problem of inferring mobility behaviors from trajectories by tackling
various challenges via multiple deep-learning-based frameworks. First, we proposed DETECT, a
powerful neural framework that mitigates the scale variance of trajectories and leverages the aux-
iliary POI data to properly learn the mobility behavior clusters. DETECT used an efficient stay
point extracting procedure to extract critical parts of the trajectory and augmented them with con-
text to capture their geographical influence. Furthermore, with the input of context sequences, a
recurrent auto-encoder can embed trajectories in the latent space of behaviors. The optimization
of the latent embedding on a clustering-oriented loss improves clustering compactness even fur-
ther. An exhaustive experimental evaluation confirms the effectiveness of DETECT, consistently
achieving at least 40% improvement over the state-of-the-art baselines in all evaluated external
metrics. At the same time, internal validation measures and running time efficiency results mean
that DETECT is also scalable to large datasets. Second, we proposed a novel deep learning frame-
work V AMBC that can accurately and robustly cluster the context sequences of ordered real-value
feature vectors based on their transition patterns to infer mobility behaviors. The framework ex-
plicitly decomposes the cluster latent representation and individualized latent representation via
two reparameterization layers. Such decomposition and the finely designed network enable the
model to learn the self-supervision and cluster structure jointly without collapsing to trivial solu-
tions. We showed that V AMBC significantly improves the robustness and accuracy of the mobility
behavior clustering. It would be interesting to investigate in future works how V AMBC could be
95
extended to other sequence data. Third, we proposed DeepMARK, an innovative deep multi-view
autoencoder framework that learns a representation of AOI from trajectories and graphs data. The
framework learns embedding of AOI that takes the best of both contextual and topological repre-
sentations, i.e., incorporates data-driven contextual information and follows Tobler’s First Law of
Geography. DeepMARK is evaluated in real-world package delivery ETA prediction and achieved
a better performance than various adapted baselines. Last, we proposed a framework that learns ef-
fective region representations from only trajectories without using any auxiliary data. Specifically,
we construct an attributed heterogeneous graph that incorporates various associations between re-
gions, users, and periods, as well as descriptive features of these elements. We use a heterogeneous
graph neural network with a metawalk-based objective to learn both the relational proximity within
metawalks and neighborhood information in the embeddings. We compare our approach on three
tasks against solid baselines and show its superiority in all tasks which indicates the effectiveness
of the region embeddings learned by our method. For future work, we could investigate more ap-
plications that could leverage the region embeddings and how our framework can be further tuned
for various supervised tasks.
96
References
1. Chang, B. et al. Content-Aware Hierarchical Point-of-Interest Embedding Model for Suc-
cessive POI Recommendation. in IJCAI (2018), 3301–3307.
2. He, J. et al. Inferring a Personalized Next Point-of-Interest Recommendation Model with
Latent Behavior Patterns. in AAAI (2016).
3. Zheng, Y . Trajectory data mining: an overview. ACM TIST (2015).
4. Lee, J.-G. et al. Trajectory clustering: a partition-and-group framework in SIGMOD (2007).
5. Yoon, H. & Shahabi, C. Robust time-referenced segmentation of moving object trajectories
in ICDM (2008).
6. Yuan, G. et al. A review of moving object trajectory clustering algorithms. Artificial Intelli-
gence Review (2017).
7. Lin, B. & Su, J. One way distance: For shape based similarity search of moving object
trajectories. GeoInformatica (2008).
8. Petitjean, F. et al. A global averaging method for dynamic time warping, with applications
to clustering. Pattern Recognition (2011).
9. Cuturi, M. Fast global alignment kernels in ICML (2011), 929–936.
10. Paparrizos, J. & Gravano, L. k-shape: Efficient and accurate clustering of time series in
SIGMOD (2015).
11. Madiraju, N. S. et al. Deep temporal clustering: Fully unsupervised learning of time-domain
features. arXiv preprint arXiv:1802.01059 (2018).
12. Yang, B. et al. Towards k-means-friendly spaces: Simultaneous deep learning and clustering
in ICML (2017), 3861–3870.
13. Guo, X. et al. Improved deep embedded clustering with local structure preservation in IJCAI
(2017).
14. Mrabah, N. et al. Adversarial Deep Embedded Clustering: on a better trade-off between
Feature Randomness and Feature Drift. arXiv preprint arXiv:1909.11832 (2019).
15. Steiniger, S. et al. Can we use OpenStreetMap POIs for the Evaluation of Urban Accessi-
bility? in International Conference on GIScience Short Paper Proceedings 1 (2016).
16. Touya, G. et al. Assessing crowdsourced POI quality: Combining methods based on refer-
ence data, history, and spatial relations. ISPRS International Journal of Geo-Information 6,
80 (2017).
17. Haklay, M. How good is volunteered geographical information? A comparative study of
OpenStreetMap and Ordnance Survey datasets. Environment and planning B: Planning and
design 37, 682–703 (2010).
18. Yao, Z. et al. Representing urban functions through zone embedding with human mobility
patterns in Proceedings of the Twenty-Seventh International Joint Conference on Artificial
Intelligence (IJCAI-18) (2018).
97
19. Yuan, J. et al. Discovering regions of different functions in a city using human mobility and
POIs in Proceedings of the 18th ACM SIGKDD international conference on Knowledge
discovery and data mining (2012), 186–194.
20. Besse, P. C. et al. Review and perspective for distance-based clustering of vehicle trajecto-
ries. IEEE TITS (2016).
21. Ferreira, N. et al. Vector Field k-Means: Clustering Trajectories by Fitting Multiple Vector
Fields in Computer Graphics Forum (2013).
22. Yao, D. et al. Trajectory clustering via deep representation learning in IJCNN (2017).
23. Wang, X. et al. Semantic trajectory-based event detection and event pattern mining. KAIS
(2013).
24. Bindschaedler, V . & Shokri, R. Synthesizing plausible privacy-preserving location traces in
IEEE Security and Privacy (2016).
25. Gramaglia, M. et al. Preserving mobile subscriber privacy in open datasets of spatiotempo-
ral trajectories in INFOCOM (2017).
26. Choi, D.-W. et al. Efficient mining of regional movement patterns in semantic trajectories.
VLDB (2017).
27. Yue, M. et al. DETECT: Deep Trajectory Clustering for Mobility-Behavior Analysis in IEEE
Big Data (2019), 988–997.
28. Dupont, E. Learning disentangled joint continuous and discrete representations in NIPS
(2018), 710–720.
29. Dilokthanakul, N. et al. Deep unsupervised clustering with gaussian mixture variational
autoencoders. arXiv preprint arXiv:1611.02648 (2016).
30. Jiang, Z. et al. Variational deep embedding: an unsupervised and generative approach to
clustering in IJCAI (2017).
31. Shu, R. et al. A Note on Deep Variational Models for Unsupervised Clustering (2017).
32. Pan, G. et al. Land-use classification using taxi GPS traces. IEEE Transactions on Intelligent
Transportation Systems 14, 113–123 (2012).
33. Zhang, M. et al. Multi-View Joint Graph Representation Learning for Urban Region Em-
bedding in Proceedings of the 29th International Joint Conference on Artificial Intelligence.
AAAI Press (2020).
34. Zhang, C. et al. Regions, periods, activities: Uncovering urban dynamics via cross-modal
representation learning in Proceedings of the 26th International Conference on World Wide
Web (2017), 361–370.
35. Yue, M. et al. Learning a Contextual and Topological Representation of Areas-of-Interest
for On-Demand Delivery Application in Machine Learning and Knowledge Discovery in
Databases: Applied Data Science Track: European Conference, ECML PKDD 2020, Ghent,
Belgium, September 14–18, 2020, Proceedings, Part IV (2021), 52–68.
36. Perozzi, B. et al. Deepwalk: Online learning of social representations in Proceedings of
the 20th ACM SIGKDD international conference on Knowledge discovery and data mining
(2014), 701–710.
37. Tang, J. et al. Line: Large-scale information network embedding in Proceedings of the 24th
international conference on world wide web (2015), 1067–1077.
38. Grover, A. & Leskovec, J. node2vec: Scalable feature learning for networks in Proceed-
ings of the 22nd ACM SIGKDD international conference on Knowledge discovery and data
mining (2016), 855–864.
98
39. Cao, S. et al. Deep neural networks for learning graph representations in Thirtieth AAAI
conference on artificial intelligence (2016).
40. Wang, D. et al. Structural deep network embedding in Proceedings of the 22nd ACM SIGKDD
international conference on Knowledge discovery and data mining (2016), 1225–1234.
41. Gao, H. & Huang, H. Deep Attributed Network Embedding. in IJCAI 18 (2018), 3364–3370.
42. Zhang, Z. et al. ANRL: Attributed Network Representation Learning via Deep Neural Net-
works. in IJCAI 18 (2018), 3155–3161.
43. Kipf, T. N. & Welling, M. Semi-supervised classification with graph convolutional net-
works. arXiv preprint arXiv:1609.02907 (2016).
44. Velickovic, P. et al. Graph Attention Networks in Internation Conference on Learning Rep-
resentations (ICLR) (2018).
45. Chen, H. et al. Fast and accurate network embeddings via very sparse random projection
in Proceedings of the 28th ACM International Conference on Information and Knowledge
Management (2019), 399–408.
46. Hamilton, W. et al. Inductive representation learning on large graphs in Advances in neural
information processing systems (2017), 1024–1034.
47. Cui, P. et al. A survey on network embedding. IEEE Transactions on Knowledge and Data
Engineering 31, 833–852 (2018).
48. Dong, Y . et al. metapath2vec: Scalable representation learning for heterogeneous networks
in Proceedings of the 23rd ACM SIGKDD international conference on knowledge discovery
and data mining (2017), 135–144.
49. Fu, T.-y. et al. Hin2vec: Explore meta-paths in heterogeneous information networks for
representation learning in Proceedings of the 2017 ACM on Conference on Information
and Knowledge Management (2017), 1797–1806.
50. Ying, R. et al. Graph convolutional neural networks for web-scale recommender systems in
Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery
& Data Mining (2018), 974–983.
51. Chen, T. & Sun, Y . Task-guided and path-augmented heterogeneous network embedding for
author identification in Proceedings of the Tenth ACM International Conference on Web
Search and Data Mining (2017), 295–304.
52. Kingma, D. P. & Welling, M. Auto-encoding variational bayes. arXiv (2013).
53. Li, Q. et al. Mining user similarity based on location history in SIGSPATIAL (2008).
54. Tobler, W. R. A computer movie simulating urban growth in the Detroit region. Economic
geography (1970).
55. Zhang, J.-D. et al. iGeoRec: A personalized and efficient geographical location recommen-
dation framework. IEEE TSC (2014).
56. Yuan, Q. et al. Time-aware point-of-interest recommendation in SIGIR (2013).
57. Yu, Y . & Chen, X. A survey of point-of-interest recommendation in location-based social
networks in AAAI (2015).
58. Lin, Y . et al. Mining public datasets for modeling intra-city PM2. 5 concentrations at a fine
spatial resolution in SIGSPATIAL (2017).
59. Baldi, P. Autoencoders, unsupervised learning, and deep architectures in ICML workshop
on unsupervised and transfer learning (2012).
60. Lipton, Z. C. et al. A critical review of recurrent neural networks for sequence learning.
arXiv (2015).
99
61. Hochreiter, S. & Schmidhuber, J. Long short-term memory. Neural computation (1997).
62. Xie, J. et al. Unsupervised deep embedding for clustering analysis in ICML (2016).
63. Guo, X. et al. Deep clustering with convolutional autoencoders in NIPS (2017).
64. Maaten, L. v. d. & Hinton, G. Visualizing data using t-SNE. JMLR (2008).
65. Zheng, Y . et al. Mining interesting locations and travel sequences from GPS trajectories in
WWW (2009).
66. Of Illinois at Chicago, D. U. Real Trajectory Data 2006.
67. Zhou, F. et al. Trajectory-User Linking via Variational AutoEncoder. in IJCAI (2018).
68. Cui, G. et al. Personalized travel route recommendation using collaborative filtering based
on GPS trajectories. IJDE (2018).
69. Center, S. I. Map POI (Point of Interest) data Peking University Open Research Data Plat-
form. 2017.
70. Foundation, O. OpenStreetMap data 2018.
71. Petitjean, F. et al. Dynamic time warping averaging of time series allows faster and more
accurate classification in ICDM (2014).
72. Morris, B. & Trivedi, M. Learning trajectory patterns by clustering: Experimental studies
and comparative evaluation in CVPR (2009).
73. Yao, D. et al. Learning deep representation for trajectory clustering. Expert Systems (2018).
74. Hu, W. et al. Semantic-based surveillance video retrieval. IEEE TIP (2007).
75. Tavenard, R. tslearn: A machine learning toolkit dedicated to time-series data https://github.com/rtavenar/tslearn.
2017.
76. Chollet, F. et al. Keras https://github.com/fchollet/keras. 2015.
77. Abadi, M. et al. Tensorflow: Large-scale machine learning on heterogeneous distributed
systems in OSDI (2016).
78. Bholowalia, P. & Kumar, A. EBK-means: A clustering technique based on elbow method
and k-means in WSN. IJCA (2014).
79. Liu, Y . et al. Understanding of internal clustering validation measures in ICDM (2010).
80. Zou, Q. et al. Sequence clustering in bioinformatics: an empirical study. Briefings in bioin-
formatics 21, 1–10 (2020).
81. Xu, J. et al. Self-taught convolutional neural networks for short text clustering. Neural Net-
works 88, 22–31 (2017).
82. Helske, S. & Helske, J. Mixture hidden Markov models for sequence data: The seqHMM
package in R. arXiv preprint: 1704.00543 (2017).
83. Bowman, S. et al. Generating Sentences from a Continuous Space in SIGNLL (2016), 10–
21.
84. Doersch, C. Tutorial on variational autoencoders. arXiv preprint arXiv:1606.05908 (2016).
85. Jang, E. et al. Categorical reparameterization with gumbel-softmax. arXiv preprint arXiv:1611.01144
(2016).
86. Ghosh, P. et al. From variational to deterministic autoencoders. arXiv preprint arXiv:1903.12436
(2019).
87. Ball´ e, J. et al. End-to-end Optimized Image Compression in ICLR 2017 ().
88. Bengio, Y . et al. Representation learning: A review and new perspectives. TPAMI 35, 1798–
1828 (2013).
89. Ranjan, C. et al. Sequence Graph Transform (SGT): A Feature Extraction Function for
Sequence Data Mining (Extended Version). arXiv preprint arXiv:1608.03533 (2016).
100
90. Smyth, P. Clustering sequences with hidden Markov models in NIPS (1997), 648–654.
91. Cai, D. et al. Locally consistent concept factorization for document clustering. IEEE TKDE
23, 902–913 (2010).
92. Yeung, K. Y . & Ruzzo, W. L. Details of the adjusted rand index and clustering algorithms,
supplement to the paper an empirical study on principal component analysis for clustering
gene expression data. Bioinformatics 17, 763–774 (2001).
93. Li, Y . et al. Multi-task representation learning for travel time estimation in Proceedings of
the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining
(2018), 1695–1704.
94. Ke, J. et al. Hexagon-Based Convolutional Neural Network for Supply-Demand Forecast-
ing of Ride-Sourcing Services. IEEE Transactions on Intelligent Transportation Systems
(2018).
95. Liu, X. et al. Exploring the Context of Locations for Personalized Location Recommenda-
tions. in IJCAI (2016), 1188–1194.
96. Mikolov, T. et al. Distributed representations of words and phrases and their composition-
ality in Advances in neural information processing systems (2013), 3111–3119.
97. Sahlgren, M. The distributional hypothesis. Italian Journal of Disability Studies 20, 33–53
(2008).
98. Li, Y . et al. A survey of multi-view representation learning. IEEE transactions on knowledge
and data engineering 31, 1863–1883 (2018).
99. Tang, J. et al. Pte: Predictive text embedding through large-scale heterogeneous text net-
works in Proceedings of the 21th ACM SIGKDD International Conference on Knowledge
Discovery and Data Mining (2015), 1165–1174.
100. Chang, S. et al. Heterogeneous network embedding via deep architectures in Proceedings
of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data
Mining (2015), 119–128.
101. Levy, O. & Goldberg, Y . Neural word embedding as implicit matrix factorization in Ad-
vances in neural information processing systems (2014), 2177–2185.
102. Levy, O. et al. Improving distributional similarity with lessons learned from word embed-
dings. Transactions of the Association for Computational Linguistics 3, 211–225 (2015).
103. Hinton, G. E. et al. Distributed representations (Carnegie-Mellon University Pittsburgh, PA,
1984).
104. Hamilton, W. L. et al. Diachronic word embeddings reveal statistical laws of semantic
change. arXiv preprint arXiv:1605.09096 (2016).
105. Shuman, D. I. et al. The emerging field of signal processing on graphs: Extending high-
dimensional data analysis to networks and other irregular domains. IEEE signal processing
magazine 30, 83–98 (2013).
106. Teng, S.-H. Scalable algorithms for data and network analysis. Foundations and Trends in
Theoretical Computer Science 12, 1–274 (2016).
107. Ngiam, J. et al. Multimodal deep learning in ICML (2011), 689–696.
108. Xia, F. et al. Listwise approach to learning to rank: theory and algorithm in Proceedings of
the 25th international conference on Machine learning (2008), 1192–1199.
109. Niemeyer, G. Geohash 2008.
101
110. Wu, F. & Wu, L. DeepETA: A Spatial-Temporal Sequential Neural Network Model for Esti-
mating Time of Arrival in Package Delivery System in Proceedings of the AAAI Conference
on Artificial Intelligence 33 (2019), 774–781.
111. Zhu, M. et al. Location2vec: a situation-aware representation for visual exploration of urban
locations. IEEE Transactions on Intelligent Transportation Systems 20, 3981–3990 (2019).
112. Sun, Y . et al. Pathsim: Meta path-based top-k similarity search in heterogeneous information
networks. Proceedings of the VLDB Endowment 4, 992–1003 (2011).
113. Chen, M. et al. Converting spatiotemporal data among heterogeneous granularity systems
in The 25th IEEE International Conference on Fuzzy Systems (FUZZ-IEEE) (2016), 984–
992.
114. Vaswani, A. et al. Attention is all you need. arXiv preprint arXiv:1706.03762 (2017).
115. Paszke, A. et al. in Advances in Neural Information Processing Systems 32 (eds Wallach, H.
et al.) 8024–8035 (Curran Associates, Inc., 2019).
116. Wang, M. et al. Deep Graph Library: A Graph-Centric, Highly-Performant Package for
Graph Neural Networks. arXiv preprint arXiv:1909.01315 (2019).
117. Data61, C. StellarGraph Machine Learning Libraryhttps://github.com/stellargraph/
stellargraph. 2018.
118. Wang, Y . et al. Trajectory forecasting with neural networks: An empirical evaluation and a
new hybrid model. IEEE Transactions on Intelligent Transportation Systems 21, 4400–4409
(2019).
102
Abstract (if available)
Linked assets
University of Southern California Dissertations and Theses
Conceptually similar
PDF
Neighborhood and graph constructions using non-negative kernel regression (NNK)
PDF
Transforming unstructured historical and geographic data into spatio-temporal knowledge graphs
PDF
Spatiotemporal traffic forecasting in road networks
PDF
Efficient crowd-based visual learning for edge devices
PDF
Mechanisms for co-location privacy
PDF
Realistic and controllable trajectory generation
PDF
Modeling dynamic behaviors in the wild
PDF
Privacy-aware geo-marketplaces
PDF
Tensor learning for large-scale spatiotemporal analysis
PDF
Theoretical foundations for dealing with data scarcity and distributed computing in modern machine learning
PDF
Modeling and predicting with spatial‐temporal social networks
PDF
A function approximation view of database operations for efficient, accurate, privacy-preserving & robust query answering with theoretical guarantees
PDF
Differentially private learned models for location services
PDF
Mining and modeling temporal structures of human behavior in digital platforms
PDF
Utilizing real-world traffic data to forecast the impact of traffic incidents
PDF
Scaling up temporal graph learning: powerful models, efficient algorithms, and optimized systems
PDF
DBSSC: density-based searchspace-limited subspace clustering
PDF
Human behavior understanding from language through unsupervised modeling
PDF
Robust and proactive error detection and correction in tables
PDF
Learning distributed representations from network data and human navigation
Asset Metadata
Creator
Yue, Mingxuan
(author)
Core Title
Inferring mobility behaviors from trajectory datasets
School
Viterbi School of Engineering
Degree
Doctor of Philosophy
Degree Program
Computer Science
Degree Conferral Date
2021-12
Publication Date
09/14/2021
Defense Date
07/12/2021
Publisher
University of Southern California
(original),
University of Southern California. Libraries
(digital)
Tag
clustering,graph neural network,mobility behavior,OAI-PMH Harvest,representation learning,spatial-temporal data mining,trajectory,variational model
Format
application/pdf
(imt)
Language
English
Contributor
Electronically uploaded by the author
(provenance)
Advisor
Shahabi, Cyrus (
committee chair
), Knoblock, Craig (
committee member
), Luo, Haipeng (
committee member
), Soltanolkotabi, Mahdi (
committee member
), Sun, Tianshu (
committee member
)
Creator Email
mingxuay@usc.edu,yuemingxuan88@gmail.com
Permanent Link (DOI)
https://doi.org/10.25549/usctheses-oUC15918129
Unique identifier
UC15918129
Legacy Identifier
etd-YueMingxua-10062
Document Type
Dissertation
Format
application/pdf (imt)
Rights
Yue, Mingxuan
Type
texts
Source
University of Southern California
(contributing entity),
University of Southern California Dissertations and Theses
(collection)
Access Conditions
The author retains rights to his/her dissertation, thesis or other graduate work according to U.S. copyright law. Electronic access is being provided by the USC Libraries in agreement with the author, as the original true and official version of the work, but does not grant the reader permission to use the work if the desired use is covered by copyright. It is the author, as rights holder, who must provide use permission if such use is covered by copyright. The original signature page accompanying the original submission of the work to the USC Libraries is retained by the USC Libraries and a copy of it may be obtained by authorized requesters contacting the repository e-mail address given.
Repository Name
University of Southern California Digital Library
Repository Location
USC Digital Library, University of Southern California, University Park Campus MC 2810, 3434 South Grand Avenue, 2nd Floor, Los Angeles, California 90089-2810, USA
Repository Email
cisadmin@lib.usc.edu
Tags
clustering
graph neural network
mobility behavior
representation learning
spatial-temporal data mining
trajectory
variational model