Close
About
FAQ
Home
Collections
Login
USC Login
Register
0
Selected
Invert selection
Deselect all
Deselect all
Click here to refresh results
Click here to refresh results
USC
/
Digital Library
/
University of Southern California Dissertations and Theses
/
Modeling dynamic behaviors in the wild
(USC Thesis Other)
Modeling dynamic behaviors in the wild
PDF
Download
Share
Open document
Flip pages
Contact Us
Contact Us
Copy asset link
Request this asset
Transcript (if available)
Content
Modeling Dynamic Behaviors in the Wild
by
Nazgol Tavabi
A Dissertation Presented to the
FACULTY OF THE USC GRADUATE SCHOOL
UNIVERSITY OF SOUTHERN CALIFORNIA
In Partial Fulllment of the
Requirements for the Degree
DOCTOR OF PHILOSOPHY
(Computer Science)
December 2021
Copyright 2021 Nazgol Tavabi
Acknowledgements
I would like to thank my advisor professor Kristina Lerman. She has kindly guided me in my research all
through my PhD, and this thesis would not have been possible without her constant help and support. I
also like to thank my PhD committee : professors Shrikanth Narayanan, Bistra Dilkina, Xiang Ren, Cyrus
Shahabi, and Emilio Ferrara for their valuable time and insightful comments.
I would like to thank my parents Laila Shahcheraghi and Kazem Tavabi, who are the constant source
of my inspiration and motivation. I would also like to thank my aunt, Zohreh Shahcheraghi, and uncle,
Ali Saberi, for their emotional support and for giving me the strength to move forward and keep going.
I would like to thank my sister, Leili Tavabi, and brothers, Iman Saberi and Behnam Shahbazi, for
always being there for me and helping me with any type of problem I might have. I know that I can face
anything with their support.
And nally, I would like to thank Andres Abeliuk, Homa Hosseinmardi, and Keith Burghardt for
their mentorship and my lab-mates and friends who helped me and collaborated with me in my research:
Nazanin Alipourfard, Negar Mokhberian, Palash Goyal, Tozammel Hossain ,Nathan Bartley, Mehrnoosh
Mirtaheri, Mozhdeh Gheini, and Pegah Jandaghi.
ii
TableofContents
Acknowledgements ii
ListofTables vi
ListofFigures viii
Abstract xii
Chapter1: Introduction 1
Chapter2: RelatedWork 5
Chapter3: LearningBehavioralStateswithNon-ParametricHMMs 11
3.1 Non-Parametric HMM Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
3.2 Dynamics of Discussions in Social Media . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
3.2.1 D2Web Data Collection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
3.2.2 Modeling Topics of Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
3.2.3 Modeling Dynamics of Activity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
3.2.4 Clustering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
3.2.5 Case Studies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
3.2.5.1 Prescription Drugs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
3.2.5.2 Proxy Servers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
3.2.5.3 Marketplace Shutdowns . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
3.3 Behavioral Representations from Wearable Sensors . . . . . . . . . . . . . . . . . . . . . . 26
3.3.1 Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
3.3.2 Measuring Distance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
3.3.2.1 Viterbi Distance: . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
3.3.3 Learning Representations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
3.3.3.1 Stationary Representation: . . . . . . . . . . . . . . . . . . . . . . . . . . 30
3.3.3.2 Spectral Representation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
3.3.4 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
3.4 Identify Atypical Life Events . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
3.4.1 Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
3.4.2 Causal Inference Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
3.4.3 Causal Eect of Atypical Events . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
3.4.4 Detecting Atypical Events . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
iii
Chapter4: LearningBehavioralPatternswithBytePairEncoding 47
4.1 Pattern Discovery in Time Series with Byte Pair Encoding . . . . . . . . . . . . . . . . . . 47
4.1.1 Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
4.1.1.1 PAA transformation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
4.1.1.2 Discretization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
4.1.1.3 Identifying Patterns . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
4.1.1.4 Handling Consecutive Identical Symbols . . . . . . . . . . . . . . . . . . 52
4.1.1.5 Post Processing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
4.1.1.6 Extension to Multivariate Series . . . . . . . . . . . . . . . . . . . . . . . 54
4.1.1.7 Classication/Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
4.1.1.8 Hyper-parameter Tuning . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
4.1.2 Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
4.1.3 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
4.1.3.1 Computation Speed . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60
4.1.3.2 interpretability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61
4.1.4 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61
Chapter5: EectofMissingDatainTimeSeriesAnalysis 64
5.1 Challenges in Forecasting Malicious Events from Missing Data . . . . . . . . . . . . . . . . 65
5.1.1 Previous Works in Forecasting Cyber Attacks . . . . . . . . . . . . . . . . . . . . . 66
5.1.2 Limits of Predictability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68
5.1.3 Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70
5.1.3.1 ISI . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71
5.1.3.2 Armstrong . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71
5.1.4 Sampling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72
5.1.4.1 Random Sampling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73
5.1.4.2 Real-world sampling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73
5.1.5 Measuring Predictability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74
5.1.6 Forecasting Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76
5.1.6.1 Auto-regressive Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76
5.1.6.2 Neural Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77
5.1.7 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78
5.1.7.1 Random Sampling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78
5.1.7.2 Real-World Sampling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79
5.1.7.3 Eects of External Signals . . . . . . . . . . . . . . . . . . . . . . . . . . 80
5.1.8 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81
5.2 Eect of Missing Data on Time Series Classication . . . . . . . . . . . . . . . . . . . . . . 83
5.2.1 Types of Missing Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83
5.2.2 Research Questions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84
5.2.3 Modeling Framework . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86
5.2.3.1 Random Masks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86
5.2.3.2 Realistic Masks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87
5.2.3.3 Classication Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90
5.2.3.4 Imputation Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91
5.2.4 Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92
5.2.5 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93
5.2.5.1 Eect of Missing Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94
5.2.5.2 Comparing Realistic and Random Masks . . . . . . . . . . . . . . . . . . 96
iv
5.2.5.3 Eect of Imputation Methods . . . . . . . . . . . . . . . . . . . . . . . . 98
5.2.6 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104
5.2.7 Appendix . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104
Chapter6: Conclusions 108
Bibliography 110
v
ListofTables
3.1 Example topics in the 100-topic LDA model.Topics are labeled (in the rst column)
manually for convenience. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
3.2 Volatility computed via HMM and via Cross-entropy. Ranked highest volatility to lowest. . 21
3.3 Table of ground truth constructs collected during pre-study surveys. . . . . . . . . . . . . 28
3.4 Extracted Features from OMsignal. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
3.5 Evaluation of the model on the construct prediction task. The best performing model’s
results are highlighted in bold. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
3.6 Performance of atypical event detection from sensors in both datasets with randomly
sampled cross-validation. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
3.7 Performance of atypical event detection from sensors in both datasets with subject
held-out detection. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
4.1 Remaining IGTB & MGT constructs not discussed in text. . . . . . . . . . . . . . . . . . . . 57
4.2 Results for numerical IGTB constructs, reported in RMSE . . . . . . . . . . . . . . . . . . . 58
4.3 Results for numerical MGT constructs, reported in RMSE . . . . . . . . . . . . . . . . . . . 59
4.4 Results for categorical MGT constructs, reported in accuracy except for Atypical construct,
which is reported in AUC ROC . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
4.5 Results for a subset of numerical MGT constructs with multivariate data, reported in RMSE. 60
5.1 Average of Mean Absolute Error (MAE) for imputing the masked values in the dataset for
10 dierent masks. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99
vi
5.2 Overview of eect of imputation methods on classication results, aggregated based
on imputation methods. The positive column, shows how many times the imputation
method helped classication regardless of the classication method, type of the mask
and classication task. The negative column, shows the number of times the imputation
method signicantly hurt the results. For each row, the total number of experiments
equals to 30. In the rest of the experiments with the given imputation method, the results
were not statistically signicant from using no-imputation . . . . . . . . . . . . . . . . . . 102
5.3 Overview of eect of imputation methods on classication results, aggregated based on
classication methods. The positive column, shows how many times using imputation
helped the results of the classication method regardless of the type of imputation method,
the mask, and the classication task. The negative column, shows the number of times
using imputation methods signicantly hurt the results. For each row, the total number
of experiments equals to 50. In the rest of the experiments with the given classication
method, the results were not statistically signicant from using no-imputation . . . . . . . 102
vii
ListofFigures
3.1 Illustration of the BP-AR-HMM model. Left image is matrixF . Based on this matrix time
series 1 exhibits states A, B and C but there is a zero probability of time series 1 going
into state D. The right image shows state sequences of time series. 1, 2 5 represent 5
dierent participants and A D activities identied by the model. . . . . . . . . . . . . . 12
3.2 State sequences of forums (each line represents a forum and each color represents a state)
and Dendrogram showing the similarity of forums based on their learned states. . . . . . 19
3.3 (a) Activity of forums relevant to AlphaBay and Hansa closure. (b) Smoothed weekly topic
cross-entropy of forums. Cross-entropy is smoothed using a rolling average over 4 weeks.
For both gures the black line indicates July 4th, 2017 when AlphaBay was seized. The
red line indicates July 20th, 2017 when Hansa was seized. . . . . . . . . . . . . . . . . . . . 25
3.4 Overview of the modeling framework. Sensor data collected from participants A and B
is fed into BP-AR-HMM model which outputs an HMM per participant, where states are
shared among participants. Output from BP-AR-HMM model is used to learn embeddings,
which are later used to predict personal attributes. . . . . . . . . . . . . . . . . . . . . . . . 26
3.5 Dendrogram showing the similarity of participants based on their learned states. . . . . . 32
3.6 Bipartite graphs with subset of constructs from Table 3.3 and subset of states. In (a) each
construct is connected to two states whose regression coecients are the highest (i.e.,
strongest positive relationship); and in (b) each construct is connected to two states with
the lowest negative coecients (i.e., strongest negative relationship). . . . . . . . . . . . . 33
3.7 Eect of atypical events among the datasets studied. (a) Positive aect, (b) negative aect,
(c) stress, and (d) anxiety. Green squares show the aerospace dataset, red diamonds show
the hospital dataset, and gray circles are the null models, in which we collect sequential
data from subjects who do not experience an atypical event at day zero. . . . . . . . . . . . 41
3.8 Eect of atypical events versus severity of event. (a) Positive aect, (b) negative aect,
(c) stress, and (d) anxiety. Green squares are positive events, white triangles are minor
negative events, red diamonds are major negative events, and gray circles are the null
models. In the null models we collect sequential data from subjects who do not experience
an atypical event at day zero. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
viii
3.9 Overview of the modeling framework. Sensor data collected from participants A and B
(left two panels) is fed into a non-parametric HMM model that outputs state sequences
(middle panel). Output from the HMM model is used to learn embeddings for each day of
each participants (right panel). The daily embedding (lighter and darker-colored circles)
and the average embedding for each participant (hashed circles) are used as features.
These features, and daily atypical event labels, are then fed into an SVM classier to
predict whether any given day is atypical. . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
4.1 Overview of the method. Data is 1.transformed with PAA, 2. discretized, 3. transformed
to multiple variations 4. Patterns are extracted from each variation. 5. Features generated
from dierent variations are combined and post-processed to output the nal representation. 49
4.2 Example of time series discretization: The series are rst transformed with PAA (on the
left), then they are discretized based on equal-width bins (on the right). The discretized
version of the rst series in this plot is: ’A B C D D C B B B B’. It should be mentioned that
PAA actually reduces the length of the series, but here they are plotted the same length
with the original signal to show the eect of the transformation. . . . . . . . . . . . . . . . 50
4.3 Example of BPE on discretized time series: The rst line shows the original discretized
series. In each iteration the most common pair, is identied and replaced by a new symbol. 51
4.4 A sample datapoint for Atypical event construct. Top left plots show the original heart
rate and step count. Top right plots show the PAA transformed data with window size
W = 8. In middle left, the data is discretized intoK = 10 bins. Middle right shows the
RCS variation. Bottom left is the RCS Median variation. And nally bottom right is the
autoregressive variation. 10 most important patterns for classifying Atypical events, based
on F-values from ANOVA test, are highlighted in red. . . . . . . . . . . . . . . . . . . . . . 62
5.1 Sketch of the ltering framework. In the context of cyberattacks,X is the number of
attempted attacks per day;Y is the number of successful attacks per day observed by
the target. The probability of an attack being successful isp. The Binomial distribution
B(n;p) is used to model the time seriesY . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68
5.2 Cumulative number of messages in each category within ISI data. . . . . . . . . . . . . . . 70
5.3 Cumulative time series of categories in ISI data used for real-world sampling. The top
leftmost plot is the time series of messages from Virus Detected category only, we add
other categories to sampled data one-by-one. The last plot (titled as Marketing) contains
messages from all malicious categories. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72
5.4 Number of threat messages in Armstrong data based on their risk score threshold . . . . . 73
5.5 Decay of predictability of (randomly) sampled ISI data. The (a) auto-correlation decreases
at low sampling rates and (b) permutation entropy increases; and (c) and error of
model-bases techniques increases . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74
ix
5.6 Decay of predictability of (randomly) sampled Armstrong data. The (a) auto-correlation
decreases at low sampling rates and (b) permutation entropy increases; and (c) and error
of model-bases techniques increases . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75
5.7 Decay of predictability of ISI data under real-world sampling. The (a) auto-correlation
decreases at low sampling rates; (b) permutation entropy increases; and (c) and error of
model-bases techniques increases. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77
5.8 Decay of predictability of Armstrong data under real-world sampling for three out of the
four measures. The (a) auto-correlation decreases at low sampling rates; (b) permutation
entropy decreases; and (c) and error of model-bases techniques increases. . . . . . . . . . . 77
5.9 Using raw (unltered) data as external signal to predict ISI data (on the left) and Armstrong
data (on the right) with real-world sampling. Full: raw data as external signal, Prediction:
predicted raw data as external signal, without: without external signal . . . . . . . . . . . 80
5.10 Example of a time series and its mask where the missing values are represented by 1 and
observed values by 0. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87
5.11 Structure of the CGAN model used to generate realistic masks. . . . . . . . . . . . . . . . . 88
5.12 Example of a time series’ mask (st row), where the goal is to increase its missing ratio
from the original value 0.3 to 0.5. The second row shows a mask that is incompatible with
the rst mask and the third row shows an acceptable mask that is compatible with the
original mask of the data. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89
5.13 Histogram of the missing ratio in dierent datapoints in TILES dataset. Each datapoint is
a time series of one day of a participant. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93
5.14 The discriminator loss over hand generated masks, as the model is being trained. . . . . . 94
5.15 Eect of missing data on dierent classication tasks. . . . . . . . . . . . . . . . . . . . . . 95
5.16 Eect of missing data on dierent classication tasks with random masks. . . . . . . . . . 97
5.17 Comparing eect of missing data with realistic and random masks on classication (work
construct). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98
5.18 Eect of dierent imputation methods on classication of work construct. The plots on
the left show the eect of imputation on masks generated with CGAN (realistic masks).
The plots on the right show the eect of imputation on random masks. . . . . . . . . . . . 100
5.19 Eect of dierent imputation methods on classication of atypical event construct. This
examples shows how using imputation methods can make the classication task more
challenging. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103
5.20 Comparing eect of missing data with realistic and random masks on classication
(location construct). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105
x
5.21 Comparing eect of missing data with realistic and random masks on classication
(activity construct). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105
5.22 Comparing eect of missing data with realistic and random masks on classication
(atypical event construct). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105
5.23 Comparing eect of missing data with realistic and random masks on classication
(interaction construct). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106
5.24 Eect of dierent imputation methods on classication of location construct . . . . . . . . 106
5.25 Eect of dierent imputation methods on classication of activity construct . . . . . . . . 107
5.26 Eect of dierent imputation methods on classication of interaction construct . . . . . . 107
xi
Abstract
The abundant real-time data collected from people in the wild creates new opportunities to better under-
stand human behaviors. One example, is temporal data collected from wearable sensors. The ability to
analyze this data, oers new opportunities for real-time monitoring of physical and psychological health.
Physiological data collected from wearable sensors has been used to detect activities, diagnose illnesses,
and analyze habits and personality traits. However, temporal physiological data presents many analytic
challenges: the data is multimodal, heterogeneous, noisy; may contain missing values, and long sequences
with dierent lengths. Existing methods for time series analysis and classication are often not suitable for
data with these characteristics, nor do they oer interpretability and explainability, a critical requirement
in the health domain.
In this thesis, I address some of the challenges in learning representations from these complex tempo-
ral data. First, I propose a method based on non-parametric Hidden Markov Models to learn interpretable
representations from time series. This method is applied to analyze, cluster, regress and classify multiple
datasets. Second, I propose Pattern Discovery with Byte Pair Encoding method to better capture long-term
dependencies in lengthy time series, which learns representations by extracting variable length patterns
using Byte Pair Encoding compression technique. The proposed model is interpretable, explainable and
computationally ecient, and beats state-of-the-art approaches on a real world dataset collected from
wearable sensors. Finally, I systematically evaluate how the presence of missing data aects the perfor-
mance of dierent state-of-the-art time series classication methods. My work shows how performance
xii
of dierent methods degrades as a function of missing data and, using imputation methods generally does
not make a signicant dierence in the results.
The proposed models and ndings, could help better understand and analyze dynamic behaviors within
a population and oer new perspectives on monitoring and predicting human behaviors from data col-
lected in the wild.
xiii
Chapter1
Introduction
Time series data appears in many dierent applications from health records to industrial processes and
online social networks. The task of time series classication/regression is to model/analyze multiple time
series together. The overall goal is to identify a time series as coming from one of possibly many sources
or predened groups.
One of the use-cases of jointly modeling multiple time series is to learn attributes from a population of
individuals where each person is represented by a univariate or multivariate time series. This data could
include anything from participant’s biometric signals to their activity on social media, tone of their voice,
their GPS data, etc.
Modeling multiple time series introduces many challenges. For example, most models for time series
analysis accept synchronous time series, collected in the same time window, with xed lengths as input.
However in modeling behaviors this is rarely the case and often data consists of asynchronous series with
dierent lengths. The data collected in the wild, in general is very noisy. In order to model this type of data,
methods should be exible to dierent variations in the data and should not require well-formed datasets.
These methods should also be resistant to noise and missing values, which are both common to these types
of datasets. It is widely understood that noise and missing values in data can harm results; however, their
eect on time series classication have not been systematically evaluated prior to this work.
1
Another challenge in these problems is having an interpretable representation that can be meaningfully
explored to understand results. More often than not interpretability loses in the trade-o with prediction
power, especially with the growth of deep learning methods. However with the goal of understanding
patterns and behaviors in the data, interpretable representations become much more important. In appli-
cations like healthcare, in addition to interpretability, explainability and transparency of the models gain
importance. For instance, if the goal of the method is to be used in decision making, doctors and physicians
should be able to understand the process and reasoning behind the model’s predictions.
In time series classication and regression tasks, another challenge that arises is that in most problems
the dataset itself is very rich, with multiple modalities and long sequences, however there are very few
labels available. For example in health-related problems collecting data is very expensive and cumbersome,
hence the data collected is usually from limited number of participants. These causes many state-of-the-art
methods to fail due to overtting with small number of datapoints and a large and complex feature space.
In these type of tasks where there is not enough data or the data is very heterogeneous, unsupervised
methods tend to be more robust to overtting. This statement is also supported by our results. Another
challenge in time series classication is the model’s eciency. HIVE-COTE [85], the current state-of-the-
art method in terms of classication accuracy, is often infeasible to run on even modest amounts of data.
For instance, training HIVE-COTE on a dataset with only 1,500 time series can require 8 days of CPU time
[125].
Thegoalofthisthesisistoaddressthechallengesmentionedabovebyproposingunsuper-
vised, ecient/scalable models which generate accurate and interpretable representations of
noisy,heterogeneous,multimodaltemporaldata.
Based on this goal, the research is broken down to the following three research questions:
1. Howcanwelearnaccurateinterpretablerepresentationsfromcomplextemporaldata?
2. Howcanweidentifydependenciesintemporaldataatmultipletemporalscales?
2
3. Whateectdoesmissingdatahaveonthequalityofthelearnedmodelsandhowcanwe
mitigateit?
To answer the rst question, I propose a model based on non-parametric Hidden Markov Models
(HMMs). This model learns shared latent states from dierent multivariate time series. The number of
states in this model is learned from the data: if an unusual new pattern appears, another state is added to
model that segment. Additionally, since the learned states are shared between dierent time series, they
can be used as basis to learn representations. The proposed approach is applied to three dierent problems:
• Analyze dynamics of discussions in deep and dark web forums and identify patterns of activity.[132]
• Learn embeddings for physiological signals to cluster participants and detect personal demographics
such as job type, age, gender; and personality traits such as wellbeing and positive aect.[134]
• Identify atypical life events using wearable sensors and looking at their eect on one’s wellbeing.
[26].
To answer the second question, I propose the method Pattern Discovery with Byte Pair Encoding (PD-
BPE)[135]. This method learns representations of time series based on common patterns observed at dif-
ferent scales in the data. The patterns are interpretable, variable in length, and extracted using Byte Pair
Encoding compression technique. This method is able to capture long-term and short-term dependencies
present in the data in a scalable and computationally ecient manner.
For the third research question, I systematically investigate the eects of missing data on time series
classication, specically focusing on data collected from wearable sensors. In this work, I use state-
of-the-art methods for time series classication to classify datasets with dierent levels of missing data.
Afterwards, I evaluate dierent imputation methods for their ability to compensate the loss of information
due to missing data and improve classication performance.
3
It should be mentioned that each dataset has its own patterns of missingness. For example, if the
missing data is mostly because of sensor noise, the pattern would be dierent from when it’s caused by a
person not wearing or turning o the device. Hence, in order to accurately estimate the eect of missing
data, the process of increasing/decreasing the missingness should imitate the patterns of missing data in
the original dataset. In this work, I use Conditional Generative Adversarial Nets [96] to generate such
realistic masks and later compare them to random masks.
The proposed models and ndings, could help better understand and analyze dynamic behaviors within
a population and oer new perspectives on monitoring and predicting human behaviors from data col-
lected in the wild.
4
Chapter2
RelatedWork
The previous works are categorized into two groups, rst group are dierent time series classication
methods and the second group consist of previous works on dealing with missing data in time series:
Many methods have been proposed for time series classication. [8] covers a wide variety of them and
categorizes time series classication methods to six groups. Based on the recent advancement of neural
network and deep learning methods, we see t to add a seventh category for methods using deep learning
approaches for time series classication. These seven categories are listed below.
• Wholeseries In these algorithms a similarity measure is dened between time series. The most no-
table method in this group is Dynamic Time Warping (DTW) [100] with Nearest Neighbour classier,
which is still used as a baseline for many methods. Also Random Warping Series (RWS) [149] is a
recent embedding method for time series, based on their DTW distance with thousands of randomly
generated time series.
• Intervals These types of algorithms extract features from intervals, instead of the whole time series.
For example in a long time series, an interval can be considered a minute/hour, and simple features
such as mean and standard deviation can be calculated from each interval. [26] is a recent example
and [112] is one of the earliest works with this approach.
5
• Shapelets In these methods the focus is to nd short patterns, commonly called shapelets, in time
series which can distinguish between dierent classes. In these algorithms the presence, or lack
thereof the shapelet in the series is important and its location is irrelevant [63, 25]. Most shapelet al-
gorithms enumerate identied shapelets and choose the ones that help with the classication [154,
108]. The alternative approach is to capture dierent shapelets in an unsupervised manner, then
perform classication on the generated features [63, 25]. The latter approach can help reduce over-
tting compared to the former. Also unsupervised approaches can be easily extended to regression
tasks while it may not be feasible to extend a supervised classication task like [154] to a regression
problem.
• Dictionarybased Our proposed approach for the second research question ts in this group. This
type can be seen as an extension of Shapelet methods. Instead of forming the decision boundaries
based on presence of patterns, features generated by dictionary based methods, capture the relative
frequency of observed shapelets in each series. Most of the methods in this group rst discretize the
series and convert them to strings of symbols, then identify the patterns. One of the most popular
methods for discretizing time series is Symbolic Aggregate Approximation (SAX) [82]. In SAX, the
normalized series are rst transformed by Piecewise Aggregate Approximation (PAA) [75], then
discretized into bins formed from equal probability areas of the normal distribution. PAA reduces
the dimension of the input time series by splitting them into segments and averaging the values
in these segments. In [83] after SAX transformation, signals are broken down to windows with a
xed size L which are referred to as words, and a histogram of word counts is used as features for
classication similar to bag-of-words approach. [122] has a similar approach but instead of bag-
of-words it uses TF-IDF (Term Frequency-Inverse Document Frequency), and instead of calculating
TF-IDF for each time series, it calculates them for each class. Probably the most popular methods in
this group is Bag-of-SFA-Symbols (BOSS) model [121], which similar to DTW, is used as a baseline
6
in many recent methods. Instead of using PAA, BOSS applies Discrete Fourier Transform (DFT)
on each window; then discretizes the series. Most of the methods in this group, when instances of
consecutive identical words such as “aba aba“ in “aba aba acc“ are observed, only count the rst
word and ignore the remaining repeated words to avoid over counting trivial matches.
• Combinations This group contains methods that combine elements or features from dierent ap-
proaches. Collection Of Transformation Ensembles (COTE) [9] is known as one of the most accurate
methods for time series classication. COTE combines dierent classiers over representations of
data in dierent domains including time and shapelet. HIVE-COTE [85], an extension to COTE, pro-
poses a hierarchical structure with probabilistic voting over multiple classiers and has also proven
to be very successful.
• Model based In these methods, generative models are tted to time series and the similarity be-
tween the series is measured by comparing the similarity between the models tted to them. For
example [74] measures the distance between series by comparing their ARIMA models. Our pro-
posed approach for the rst research question ts in this group [132, 134, 26]. The time series are
compared with each other based on the non-parametric Hidden Markov Models tted to them.
Hidden Markov Models (HMMs) are very well known in time series modeling, however they have
been mostly used for time series forecasting and other similar tasks, as apposed to time series clas-
sication. HMMs represent temporal trends and dynamics of time series using states and transition
probabilities. Since transition probabilities are not time dependant, HMMs’ parameters of time se-
ries with dierent lengths could easily be compared. In prior research, HMM models are learned
on each time series independently [15]. As a consequence, the learned states cannot be compared
across dierent representations. Another shortcoming of standard HMM models is that the number
of states must be xed a priori. Recent Bayesian approaches overcome these constraints by allowing
7
innitely many potential states using Beta process, which are shared among all time series [49, 48,
50]. This allows each time series to be represented in the space of shared latent states. There is
another line of work which allows for innite potential states in a Hidden Markov Model. Beal et
al. [17] use the Dirichlet process [6] as prior over the hidden states, but the model is again designed
to capture each time series independently.
• Deeplearning [45] reviews successful deep learning approaches for time series data. Although in-
tuitively, it seems that recurrent neural networks might be better suited for time series classication
[93, 69], convolutional networks are proven to be more successful in this task [156]. An important
work in this group is ROCKET (RandOm Convolutional KErnel Transform) [39]. ROCKET trans-
forms time series with a large number of random convolutional kernels, i.e., kernels with random
lengths, weights, biases, etc. It has become popular for its exceptional computation speed and accu-
rate results.
Another notable work is an unsupervised embedding approach proposed in [51]. These embeddings
are generated by an encoder with causal dilated convolutional layers. If we annotate the time series
in the dataset asx
1
x
N
; the model is trained such that the embeddings of sub-series inx
i
are
closer to each other than to a sub-series in8
j6=i
x
j
. Both of these methods generate unsupervised
representations, and are able to handle dierent variants of time series (variable length, multivariate,
etc). They are also among the best and most recent methods proposed for time series classication.
Another approach in nding shared latent features among multiple time series is tensor decomposition
which doesn’t completely t in any of the categories mentioned above. Tensor-based methods have been
extensively used in dierent elds [79] including behavioral modeling [68, 117, 67]. A popular tensor de-
composition method is Parafac2 [61], which oers multilinear higher-order decomposition that can handle
missing values and dierent length time series. Parafac2 itself is considered a traditional method, however
8
there are multiple works published in recent years on improving it’s inference, imposing constraints, etc
[73, 34].
It should be mentioned that although time series classication has gained more attention, there are
also many time series regression problems such as aect estimation from wearable sensors [153], analysis
of fMRI signals [47], and many more examples. Many of the methods described above, such as meth-
ods in the Combinations category, can’t be extended for a regression problem. Also, most methods in
Deep Learning category are not interpretable nor explainable. Dictionary Based methods are known for
being interpretable, but in most cases the identied patterns have the same length. There are a few meth-
ods in motif discovery for identifying patterns with variable length [137]. However, since these methods
were proposed for identifying the best set of motifs, they can’t be easily compared with dictionary based
methods. Motif discovery methods are mostly evaluated based on their scalability, robustness to noise,
exactness, and other related characteristics and are not necessarily targeted for classication/regression
tasks.
It is a general understanding that loss of information can cause loss of prediction power. To deal with
this phenomenon, in each data science task, there have been many dierent solutions proposed. For the
task of time series classication there have been mainly three solutions to this problem. Some researchers
have tried to deal with this issue by proposing classication models that disregard the missing values
(approach A). Another approach is based on the assumption that the missing patterns are informative,
therefore, they can be used as extra features (Approach B). The last approach is to estimate the missing
data based on the observed values, (i.e. impute the missing values) before the classication task (approach
C). There are many methods proposed for each of these solutions.
Approach A: Many methods have been proposed for time series classication in the presence of missing
data such as [94, 21]. In [94, 21], the assumption is that missing data is of type MAR (described in 5.2.1), and
are dependant to the observed values, hence, they can be analytically integrated away and not considered
9
in the analysis. There have also been methods proposed specically for time series data collected from
wearable sensors [150], since missing patterns in these datasets are dierent from most other time series
(more discussion on this in section 5.2.2)
Approach B: A dierent approach is to take advantage of the missing patterns and treat them as extra
features, since these patterns themselves can provide information about the target labels, i.e. informative
missingness [114]. Methods such as [31, 80, 86] use this approach. Although the methods have been
especially designed to capture the patterns of missing data, many other classication methods can also
capture these patterns without necessarily being designed to do so. For instance, if the missing data is
lled by a constant dierent from other values observed in the data, methods such as [135] can very well
capture them as patterns present in the data, and use them in classication.
Approach C: There are many methods proposed for imputing missing data. Some methods are more
general and can be applied to dierent data types as well as time series [27, 32] Some methods are more
specically proposed for time series imputations [33, 30, 87] ([106] reviews more traditional imputation
methods for time series while [44] reviews recently proposed deep learning approaches). Some Methods
are even proposed especially for imputing time series data collected from wearable sensors [151, 84, 46].
There are also classication methods which impute the data rst before classifying them [102]. Other
methods use classication as part of their evaluation, to show that the results improve by using the pro-
posed imputation method [30, 87]. However, this conclusion can not be generalized, since it shows that
the proposed imputation method helps the specic classication method that they used, and another clas-
sication method might not benet from that.
10
Chapter3
LearningBehavioralStateswithNon-ParametricHMMs
As mentioned in the Introduction (Section 1), research in this thesis is broken down to three research
questions. The rst research question is:
Howcanwelearnaccurateinterpretablerepresentationsfromcomplextemporaldata?
To answer this question, I propose a model based on non-parametric Hidden Markov Models (HMMs).
In this chapter a detail description of the model and how it can be used in dierent applications is given.
We originally introduced this method in [132] for clustering deep and dark web forums and analyzing
their activity. Later, we completed the model and used it to learn representations for data collected from
wearable sensors [134]. And nally, the same approach was used for detecting atypical events in [26].
3.1 Non-ParametricHMMModel
One of the most popular tools for studying multivariate time series is vector autoregressive (VAR) model [98].
In a VAR model of lagr, each variable is a linear function of itself and the other variable’sr previous values.
However, such models cannot describe time series with changing behaviors. In order to model such cases,
Markov switching autoregressive models, which are generalization of autoregressive and Hidden Markov
Models [98] are used.
11
Figure 3.1: Illustration of the BP-AR-HMM model. Left image is matrix F . Based on this matrix time
series 1 exhibits states A, B and C but there is a zero probability of time series 1 going into state D. The
right image shows state sequences of time series. 1, 2 5 represent 5 dierent participants and A D
activities identied by the model.
In this paper, we use a generative model proposed by [50, 48], called Beta Process Autoregressive HMM
(BP-AR-HMM), to discover states, or regimes of Markov switching autoregressive models that are shared
by dierent time series. Based on the proposed model, the entire set of time series can be described by the
globally-shared states, or behaviors, where each time series is associated with a subset of them. Behaviors
associated with dierent time series can be represented by a binary matrixF , whereF
ij
= 1 means time
seriesi is associated with behaviorj. Given matrixF , each time series is modeled as a separate hidden
Markov model with states it exhibits. An example ofF matrix and the corresponding state sequences are
shown in Figure 3.1
HMM is represented by a transition matrixT
i
, which is a square matrix with dimensions equal to the
number of states time seriesi exhibits.EntryT
i
(m;n) is the probability of transitioning from statem to
12
state n for time series i, hence, the sum of each row equals to 1. Each state is modeled using a vector
autoregressive process with lagr.
y
t
=
r
X
l=1
A
l;zt
y
tl
+e
zt
e
zt
N (0;
2
zt
)
(3.1)
When a time series is in statez
t
, its future values evolve according to the autoregressive weightsA
1;zt
A
r;zt
and noisee(z
t
). Since the number of such states in the data is not known in advance, the Beta process
is used [64, 136]. A Beta process allows for innite number of behaviors but encourages sparse represen-
tations. Consider, as an example, a model withK behaviors. Each behavior (each column of matrixF )
is modeled by a Bernoulli random variable whose parameter is obtained from a Beta distribution (Beta
Bernoulli process), i.e.
k
Beta(=k; 1);k = 1; ;K
F
nk
Bernoulli(
k
);n = 1; ;N
(3.2)
The underlying distribution when this process is extended to innite number of behaviors—i.e., as K
tends to innity—is the Beta process. This process is also known as the Indian Buet Process [78, 60]
which can be best understood with the following “culinary metaphor” involving a sequence of customers
(time series) selecting dishes (features) from an innitely large buet. Then-th customer selects dishk
with probabilitym
k
=n, wherem
k
is the popularity of the dish, i.e., some features are going to be more
prevalent than others. S/He then selectsPoisson(=n) new dishes. With this approach, the number of
features can grow arbitrarily with the sizen of the dataset: in other words, the feature space increases
if the data cannot be faithfully represented with the already dened states. However, the probability of
adding new states decreases according toPoisson(=n). Finally, the distribution generated by the Indian
13
Buet Process is independent of the order of the customers. For posterior computations the original work
is referenced [50, 48].
We used this model to better understand the 1)dynamics of discussions in social media, 2)behaviors
individual share and are predictive of personality traits and 3)physiological states that capture the dier-
ence between a typical and an atypical day. In the following sections these problems will be discussed in
more detail.
3.2 DynamicsofDiscussionsinSocialMedia
D2web refers to limited access web sites that require registration, authentication, or more complex encryp-
tion protocols to access them. These web sites serve as hubs for a variety of illicit activities: to trade drugs,
stolen user credentials, hacking tools, and to coordinate attacks and manipulation campaigns. The growing
popularity of the marketplaces within the deepweb can be attributed to the elimination of the risk of vio-
lence since there is limited, if any, physical interaction between the buyers and the sellers. Another reason
is the use of encrypted protocols to preserve anonymity, encouraging people to express themselves without
the risk of getting caught by law enforcement nor being censored by the moderators of a web site [129].
Despite its importance to cyber crime, the d2web has not been systematically investigated. Given the
threat posed by these malicious actors, observing their activities on the deep and dark web (d2web) may
provide valuable clues both for anticipating and preventing cyber attacks as well as mitigating the fallout
from data breaches. However, picking out useful signals in the vast, dynamic and heterogeneous environ-
ment of the d2web can be challenging. In this paper, we study a large corpus of messages posted to 80
d2web forums over a period of more than a year. We identify topics of discussion using Latent Dirichlet
Allocation (LDA) [23] and use non-parametric HMM to model the evolution of topics across forums. Then,
we examine the dynamic patterns of discussion and identify forums with similar patterns. We show that
14
our approach surfaces hidden similarities across dierent forums and can help identify anomalous events
in this rich, heterogeneous data.
3.2.1 D2WebDataCollection
The d2web data we use in the study was collected using the crawling infrastructure described in [110, 103].
This infrastructure uses anonymization protocols, such as Tor and I2P, to access darkweb sites, and handles
authentication to access non-indexed deepweb sites on the Internet. It includes lightweight crawlers and
parsers that are focused on specic sites related to malicious hacking and/or online nancial fraud. These
sites represent forums and discussion boards where people mostly discuss cyber crime and fraud, although
other illicit activities are discussed as well, such as the sale of drugs and other stolen goods. There are also
a handful of forums crawled that are on the clearnet but which are mostly white hat (i.e., involved with
ethical hacking and/or professional cybersecurity). We include these forums in our analysis primarily to
help identify other forums that might discuss similar topics. In all, crawlers scraped data from over 250
d2web forums. The most common languages in which the posts were written were English (accounting
for 37.8% of all posts), Russian (22.4% of all posts), and Chinese (15.4% of all posts). Other languages, such
as Spanish, Arabic, and Turkish were less frequent, each accounting for less than 7% of the posts. For this
analysis we only focus on English posts, though the same structure could be used for multiple languages.
Filtering out the non-English posts brings the number of forums down to 155. We pre-processed the posts
using NLTK [22], SpaCy [66], and scikit-learn to remove stopwords, tokenize each post, and lter tokens
by post frequency. This gives us a corpus of 1.33 million posts.
For modeling the activity in forums,we focused on time period of 2016 until mid September 2017 which
has the best coverage in our data. We only looked at forums with at least one month of activity and more
than 100 posts overall, which reduced our data set to 80 forums (and approximately 482 thousand posts).
15
The level of activity in these 80 forums is highly heterogeneous, with some forums seeing hundreds of
posts per week, and other forums showing little activity.
3.2.2 ModelingTopicsofDiscussion
We applied a popular statistical technique known as Latent Dirichlet Allocation (LDA) [23] to learn the
topics of the English-language posts. LDA is used to decompose documents into latent topics, where each
topic is a distribution over words, intended to capture the semantic content of documents. In this model
each document is treated as a bag-of-words. Once we learn the model we can represent documents as dis-
tributions over a xed number of topics. Doing so gives us low-dimensional representations of documents.
We train the model on all 1.33 million documents to learn the most informative topics. We tested with
50, 100, and 200 topics respectively, and found that 100 topics results in the most coherent and relevant
topics. Table 3.1 highlights some of the topics learned by the 100-topic LDA model by showing the most
signicant words associated with each topic.
3.2.3 ModelingDynamicsofActivity
To examine the dynamics of topics on d2web forums, we represent each forum as a time series of topic
vectors learned by the 100-topic LDA model. Our unit of time in this analysis is a week; to generate a
forum’s vector we average the topic vectors of all posts submitted to the forum over the course of a week.
The time series of weekly topic vectors were used to learn HMM states. We use generative model Beta
ProcessHMM described in 3.1, to identify latent states shared by dierent time series. In this work instead
of modeling states as auto-regressive processes each global state is modeled using a multivariate Gaussian
distribution for higher interpretability, when a time series is in statex, its data is sampled from a Gaussian
distribution with mean vector
x
and covariance matrix
x
. After training the BP-HMM model on weekly
topic distributions of forums the model learned 28 states, i.e., 28 dierent topic distributions.
16
Topic (Manually Labeled) Top 10 Keywords
Vending
1. Locations checked, live, united states, unknown, california, carolina, south, ca, nj, new
2. Money money, people, make, pay, want, free, buy, just, like, sell
3. Pharmaceuticals buy, online, prescription, cheap, cod, xanax, delivery, overnight, order, day
4. Banking card, bank, credit, cards, paypal, account, business, debit, accounts, gift
5. Fake IDs fake, real, id, original, high, english, license, quality, registered, passports
6. Purchase details order, vendor, days, sent, orders, ordered, package, received, shipped, just
7. LSD like, just, quote, lsd, really, good, feel, know, tabs, experience
8. Cryptocurrency bitcoin, btc, wallet, address, send, coins, bitcoins, transaction, sent, account
9. Marijuana like, good, got, weed, time, bit, high, great, nice, low
10. Markets market, vendor, vendors, dream, alphabay, markets, scam, ab, hansa, escrow
11. Narcotics good, cocaine, quality, best, vendor, product, mdma, coke, free, order
Security
12. Malware virus, scan, antivirus, le, malware, clean, av, security, download, detected
13. Botnets bot, attack, malware, used, domain, ddos, botnet, irc, hosting, attacks
14. Windows windows, build, microsoft, xp, vista, beta, server, ms, longhorn, version
15. Social hacking email, send, rat, stealer, message, keylogger, mail, facebook, crypter, download
16. Law enforcement police, law, drug, drugs, enforcement, according, dark, illegal, said, darknet
17. Hacking tutorial learn, know, want, good, learning, start, like, knowledge, programming, hacking
18. Carding transfer, dumps, info, sell, cvv, good, track, balance, bank, uk
19. Web Vulnerabilities web, sql, php, injection, exploit, code, server, site, script, page
20. OS Code process, code, dll, memory, address, api, function, module, use, hook
21. Network Hacking network, connect, wi, wireless, ip, internet, router, connected, pineapple, fon
22. Security information, data, security, software, used, access, user, users, network, application
23. Proxy use, tor, using, vpn, internet, proxy, browser, ip, web, access
24. Mobile phones phone, android, phones, samsung, pixel, battery, note, camera, better, google
25. Update install, installed, download, just, update, need, use, installing, using, try
Gaming
26. Gaming Source Code end, local, return, function, false, mod, script, item, nil, damage
27. Torrents torrent, quote, download, upload, forget, left, feedback, like, plz, 720p
28. Gameplay game, complete, level, win, play, kill, mode, team, player, single
29. Games game, games, new, play, like, xbox, ps4, sony, playstation, console
30. PlayStation Vita vita, ps, psp, game, rmware, exploit, games, sony, custom, psn
31. Emulators game, games, vita, version, plugin, homebrew, psvita, emulator, use, play
32. Hacking Consoles ps3, games, play, tutorial, game, cfw, console, psn, use, need
Other
33. Contact contact, pm, need, icq, want, send, add, rue, interested, na
34. Thanks thanks, thank, man, lot, sharing, bro, thx, share, mate, nice
Table 3.1: Example topics in the 100-topic LDA model.Topics are labeled (in the rst column) manually for
convenience.
17
3.2.4 Clustering
To dene a similarity measure between two HMMs, one could measure the probability of their state se-
quences having been generated by the same process. Since each time series is associated with a distinct
generative process, we measure two state sequences’ similarity as the likelihood that seq
i
was generated
by the process that gave rise to seq
j
, and the likelihood that seq
j
was generated by the process giving rise
to seq
i
. We average the two likelihoods to symmetrize the similarity measure.
8
i;j
Sim(i;j) =
p(seq
i
jT
j
) +p(seq
j
jT
i
)
2
(3.3)
The likelihoodp(seq
i
jT
j
) is computed using the learned transition matrixT
i
and Markov process as-
sumption. In transition matrixT
i
, which is a square matrix with dimensions equal to the number of states,
entry T
i
(m;n) gives the probability that time series i transitions from state m to state n. Matrix T
i
is
stochastic, with the sum of entries in each row equal to 1. We call this distance measure theLikelihood
Distance.
Once the similarity between two HMMs is dened, we can perform a number of operations, including
clustering similar time series together. For example, we use hierarchical agglomerative clustering method
to automatically group forums (represented by their time series) with similar discussions. Figure 3.2 shows
the resulting dendrogram and also the sequences of learned states for each forum. Each line in the gure
represents a forum, and dierent states are represented by dierent colors. Transitions between states are
visible in places the colors alternate.
The clustering results show that the method is able to cluster forums into meaningful groups. Next we
examine a few of the main clusters:
Cluster1 mostly contains forums discussing cyber hacking, including HackForum, GroundZero, Zero-
Day, DeepDotWeb, SafeSkyHacks. The two subgroups in this cluster dier mainly in their levels of activity.
18
Figure 3.2: State sequences of forums (each line represents a forum and each color represents a state) and Dendrogram showing the similarity of
forums based on their learned states.
19
Forums in the rst subgroup are less active (5 posts in a week on average), which is why their average
weekly topic vector is more sensitive and their corresponding HMM changes state more frequently. In the
active group (more than 50 posts in a week on average) the most common state is the yellow state, which
corresponds to activation of the following topics: 16.Law enforcement, 23.Proxy and 17.Hacking tutorial.
(described in Table 3.1)
Cluster2 mostly contains dark web marketplaces such as Abraxas Market, AlphaBay, Dream Market,
Hansa Market and BlackWorld, as well as forums dedicated to their reviews. This cluster is also divided
into two main subgroups: in the rst subgroup the dark blue state dominates, representing high activity
of discussions regarding topics 1.Locations (which is mostly concerned with the sale of proxy servers),
33.Contact, 34.Thanks and 4.Banking. The second subgroup, where light blue state is more prevalent,
corresponds to the topics 6.Purchase details, 10.Markets, 8.Cryptocurrency and 11.Narcotics. Based on the
clustering, one can characterize forums in the rst subgroup as mostly selling proxy servers and sharing
information about other marketplaces, while forums in the second subgroup are more involved in selling
drugs.
Cluster 3 is made up of forums related to hacking Playstation video game consoles where the most
prominent state in this cluster is the Cyan state. The most active topics in this state are 32.Hacking Consoles
and 25.Update.
Cluster4 (as well as the two forums adjacent) contains forums which are predominantly focused on
white hat hacking. Notable forums include Metasploit and Hak5. While 0daybank and FreeBuf are related,
they are mostly in Chinese, hence their more active topics have many non-English tokens and are hard to
interpret.
With the state sequences obtained from our BP-HMM model, we can track forums’ discussions. State
transitions indicate a signicant change in discussions and could represent an event. However, as shown in
Figure 3.2, some forums change states more frequently, thus their transitions might have less signicance.
20
HMM-ranked volatility Cross-entropy-ranked volatility
1. GroundZero CodePaste
2. OpenSCMarketPlace DroidJack
3. StrongholdPaste EectHacking
4. DemonForum DemonForum
5. EectHacking Overchan
6. CrackingFire HellboundHackers
7. Overchan CardingF
8. NullByte HackForum
9. Siphon Redditsn Hacking
... ...
71. UnknownCheats TheMajesticGarden
72. DevilGroup Dumpz
73. KernelMode KernelMode
74. VirusRadar BlackWorld
75. RedditsnVitahacks MetaSploit
76. BetaArchive 0DayBank
77. TheMajesticGarden RedditsnDarkNetReviews
78. RedditsnPS3Homebrew Wololo
79. CSU VirusRadar
80. BugsChromium CSU
Table 3.2: Volatility computed via HMM and via Cross-entropy. Ranked highest volatility to lowest.
In order to be able to recognize signicant transitions, we describe the volatility measure, the likelihood
of drastic changes in a forum’s topics distribution over time. As each forum is characterized by its learned
transition matrix over the global states, we compute a forum’s volatility by adding the o diagonal elements
of its transition matrix, the probability of changing states. Since we are interested in nding variations in
topics discussed rather than the activity of forums, the probabilities of the state corresponding to 0 posts
(i.e., no data) were not taken into account. Also to validate the results obtained, a similar volatility measure
was computed with cross entropy. To compute the cross-entropy we use the following formula where Q
is the topic distribution for a forum averaged over the entire timespan and P is the topic distribution in a
unit of time.
H(p;q) =E
P
[ log
2
Q(x)] (3.4)
Table 3.2 shows a list of forums with high and low volatility computed via both methods. Using the de-
21
scribed volatility measure with the HMM, we nd that the most volatile forums with at least 10 posts per
week (on average) are OpenSC Marketplace, Stronghold Paste and Demon Forum and that the least volatile
forums are BugsChromium, CSU, and the subreddit PS3Homebrew. An estimate of forum’s volatility or
lack thereof is also apparent from its state sequence in Figure 3.2. Stronghold Paste is an onion website sim-
ilar to Pastebin and covers dierent topics and hence dierent states. However, there are two main states
it oscillates between: in one of them hacking and cyber security topics (described by topic 19.Web Vulner-
abilities) are more prominent, and in the other one, topic 8.Cryptocurrencies has high activation. These
results show state transitions in forums like Stronghold Paste have a low probability of being indicative of
an event. However transitions in forums like BugsChromium or CSU might be of more interest.
Results of measuring volatility with cross-entropy are consistent with the result we got from the HMM
model, in the sense that forums with low or high volatility based on the HMM measure also appear in
the bottom or top of the ranking based on the cross entropy measure. Results show that large and active
forums have wide-ranging discussions on diverse topics however their average topic distribution is usually
consistent over time. Forums focused on specic topics like CSU also tend to have low volatility. On the
other end of the spectrum there are forums with medium or low activity with discussions spanning a wide
range of topics like Stronghold Paste and CodePaste.
3.2.5 CaseStudies
In this section we give examples of how this framework could be used to study d2web discussions.
3.2.5.1 PrescriptionDrugs
In the search for anomalies using the results obtained by our HMM we observed a rare state, exhibited
only on 3 rst weeks of June 2017 and rst week of August 2017 by forum OensiveSecurity where topic
22
3.Pharmaceuticals has its highest probability. OensiveSecurity is considered a low volatility forum based
on both measures computed in Table 3.2 which makes this transition more of interest.
By looking at the posts published on this forum and on the aforementioned dates we retrieved similar
posts with variations in the names of the drugs being advertised. An excerpt from one of the posts is
as follows: “...buy lynoral cheap buy generic femara buy modanil online uk can i buy qsymia online buy
cytotec online us buy lumigan online canada buy 25 mg lyrica buy vibramycin orida buy diucan without
buy synthroid online next day delivery buy xanax online us where to buy generic qsymia buy adderall no
prescription buy cytotec in europe...”
This analysis shows a high and anomalous volume of advertisements for prescription drugs in the
specied dates which suggests some precipitating event that merits further investigation.
We also looked at one of the most illicit drug topics, 11.Narcotics. We found that besides 11.Narcotics
other prominent topics in the state where this topic is at its highest value are 9. Marijuana and 7.LSD.
These are a few forums which exhibit this state: The Majestic Garden, the DarkNetReviews subreddit, and
the HANSAdnmDarkNetMarket subreddit.
3.2.5.2 ProxyServers
The second case study is around the discussion of proxy servers and the sale of services that can be used
to game social media platforms, defraud ad networks, build botnets, and eectively launder illegal activity.
We use the clustering from HMM to help identify which kinds of forums have high activity related to proxy
servers. Compared to many of the other topics, topic 1.Locations is concerned primarily with the sale of
proxy servers. When we examined cluster 2 from the HMM clustering, we noticed a signicant activation
of this topic. Interestingly, the ten most active forums in cluster 2 seem to capture most of the activation
of the topic over time. We examine the forum-weeks (points) with signicant probability (above 0.50) for
topic 1.Locations from forums in cluster 2 and recover approximately 11,000 documents which seem to be
23
automated posts advertising the sale of access to proxies all over the world. These 11,000 documents come
mostly from CSU and BlackWorld: both forums have subforums dedicated to the advertising of proxies and
are in the top 10 active forums in the cluster. An excerpt from one of these posts is as follows: “...camarillo
| ca | unknown | united states | checked at vn5socks.netlive...”. When we look for documents pertaining to
more specic uses of proxies, we nd approximately 20 posts that directly mention “viewbot”. Viewbotting
is the act of using bots to articially inate the number of views on a social media prole (e.g. YouTube
and Twitch). As it can be dicult to determine if a viewer is a human or a bot, this can potentially trick
the social media platform into thinking a prole is more popular than it actually is and result in more
attention than it would obtain organically. Alternate uses of viewbots are to watch video ads on a channel,
articially increasing ad revenue. An excerpt from one of these documents is as follows: “...i viewbotted
my vid to 1k views and got 10 slaves...”
Another potential malicious use of proxy servers is for carding. Carding in this case refers to the
fradulent use of other people’s credit cards, personal and/or nancial data to purchase goods, launder
money, or generally steal an individual’s money. Proxy servers are commonly used to “cash out” stolen
credit card information by buying items like pre-paid gift cards through payment processors. In our corpus,
there are a number of documents that mention carding. An example that suggests intent to use proxies for
carding is as follows: “...proxies are often blacklisted when used for fraud so im looking for a source for
fresh proxies for use for carding...”
3.2.5.3 MarketplaceShutdowns
Our last test case study regards the seizure of the AlphaBay marketplace by the FBI on July 4th 2017 and
the seizure of the Hansa marketplace on July 20th 2017 by the Dutch NHTCU. We look at cross-entropy,
forum activity and transitions between states to analyze this case study.
24
(a) (b)
Figure 3.3: (a) Activity of forums relevant to AlphaBay and Hansa closure. (b) Smoothed weekly topic
cross-entropy of forums. Cross-entropy is smoothed using a rolling average over 4 weeks. For both gures
the black line indicates July 4th, 2017 when AlphaBay was seized. The red line indicates July 20th, 2017
when Hansa was seized.
Included in our d2web data, are forums related to transactions and reviews of these marketplaces,
including several private subreddits. We observed that a few of the forums in our dataset have peaks in ac-
tivity around the same time. Figure 3.3(a) shows number of posts published in these forums. Forums related
to AlphaBay and Hansa have peaks on the date of AlphaBay and Hansa closure respectively. Interestingly,
a week after the Hansa closure Dream Market and the subreddits DreamMarket and DreamMarketDarknet
had their highest value which suggests that users of these two big market places, AlphaBay and Hansa,
have migrated to Dream Market.
To check whether the forums respond to these events by changing the topics of discussion, we compute
cross-entropy. As seen in Figure 3.3(b) we see that two relevant forums, one about AlphaBay and one about
DreamMarket, experience a change in their topic vectors after the shutdown (2017 week 27, or 07-04-2017).
The cross-entropy increases around that time, suggesting growing dierence from the aggregate topic
distribution for their respective forums however changes in the volume of posts were more signicant.
We also observe state transitions in Dream Market forum in all three dates (AlphaBay and Hansa closure
and the week after when there is a peak in activity for Dream Market related forums).
25
Figure 3.4: Overview of the modeling framework. Sensor data collected from participants A and B is fed
into BP-AR-HMM model which outputs an HMM per participant, where states are shared among par-
ticipants. Output from BP-AR-HMM model is used to learn embeddings, which are later used to predict
personal attributes.
3.3 BehavioralRepresentationsfromWearableSensors
Continuous collection of multimodal physiological data from wearable sensors enables temporal charac-
terization of individual behaviors. Understanding the relation between an individual’s behavioral patterns
and psychological states can help identify strategies to improve work environment and reduce stress. This
can be especially benecial in hospital workplaces, where job-related stress often leads to burnout; cor-
rect policies could improve the quality of patient care. One challenge in analyzing physiological data is
extracting the underlying behavioral states from the temporal sensor signals and interpreting them. Here,
we use non-parametric HMM, described in the 3.1, to model multivariate sensor data from multiple people
and discover the dynamic behaviors that they share. We apply this method to data collected from sensors
worn by a population of hospital workers and show that the learned states can cluster participants into
meaningful groups and better predict their cognitive and psychological states. This method oers a way to
learn interpretable compact behavioral representations from multivariate sensor signals. Figure 3.4 shows
an overview of this framework.
3.3.1 Data
The data used in this work comes from the TILES (Tracking Individual Performance with Sensors) study
[101]. TILES is a study of workplace well-being that measures physical activity and physiological states
of hospital workers. The study recruited over 200 volunteers for ten weeks, from among the employees
26
of a large urban hospital. Participants were 31.1% male and 68.9% female and ranged in age from 21 to 65
years. Participants held a variety of job titles: 54.3% were registered nurses, 12% were certied nursing
assistants, and the rest with some other job title, such as respiratory therapist, technician, etc. Participants
wore the sensors for dierent number of days, depending on the number of workdays during the study.
Furthermore, participants exhibited varying compliance rates, with a few participants forgetting to wear
their sensors on some days. Hence, the length of the collected data varies across participants. For this
paper, we focused on 180 participants from whom at least 6 days of data was collected. In addition to
wearing sensors, participants were also asked to complete surveys prior to the study. These pre-study
surveys measured cognitive ability, personality, aect, and health states, which serve as ground truth
constructs for our study. Constructs are shown in Table 3.3. Data used in this paper was collected from a
suite of wearable sensors produced by OMSignal Biometric Smartwear. These OMSignal garments include
sensors embedded in the fabric that measure physiological data in real-time and can relay this information
to participant’s smartphone. Table 3.4 shows the sensor signals that we use in this work.
3.3.2 MeasuringDistance
When applied to multivariate physiological signals, the generative model described in 3.1 learns a hid-
den Markov model for each participant. We use the learned HMMs to identify individuals with similar
behaviors. We propose two dierent methods for measuring the distance between HMMs. One distance
measure is the Likelihood Distance described in 3.2.4. As mentioned in the data section, length of the
collected data varies across participants and based on the Likelihood Distance, longer time series would
automatically have a smaller likelihood. Hence, we normalize them by dividing the likelihoods to
1
K
L
,K
being the number of states andL being the length of the time series.
27
Table 3.3: Table of ground truth constructs collected during pre-study surveys.
Name Description Instrument
ITP Job performance [59]
IRB In Role Behavior [147]
IOD-ID Counter-productive Work behavior [20]
IOD-OD Counter-productive Work behavior [20]
OCB Organizational Citizenship Behavior [139]
Shipley Abstraction Cognitive ability [126]
Shipley Vocabulary Cognitive ability [126]
NEU Personality: Neuroticism [56]
CON Personality: Conscientiousness [56]
EXT Personality: Extraversion [56]
AGR Personality: Agreeableness [56]
OPE Personality: Openness [56]
POS-AF Positive aect [145]
NEG-AF Negative aect [145]
STAI Anxiety [130]
AUDIT Alcohol Use Disorders Identication Test [119]
IPAQ Physical activity [92]
PSQI Sleep quality [28]
Health limit Role limitations due to physical health problems [144]
Emotional limit Role limitations due to emotional problems [144]
Well-being Index of psychological well-being [144]
Social Functioning Index of social interaction ability [144]
Pain Index of physical pain [144]
General Health Index of general health [144]
Life Satisfaction Global life satisfaction [40]
Perceived stress Perceived stress indicator [35]
PSY exibility Ability to adapt to situational demands [113]
PSY inexibility Inability to adapt to situational demands [113]
WAAQ Work-related Acceptance and Action Questionnaire [24]
Psychological Capital PCQ a measure of psychological capital [89]
Challenge Stress Challenge stress indicator (positive stress) [111]
Hindrance Stress Hindrance stress indicator (negative stress) [111]
28
Signal Feature
Biometrics:
6 features
Heart Rate, R-R Peak Coverage
Avg. Breathing Depth, Avg. Breathing Rate
Std. Breathing Depth, Std. Breathing Rate
Movement:
15 features
Intensity, Cadence
Steps, Sitting, Supine
Avg. G Force, Std. G Force
Angle From Vertical, Low G Coverage
Avg. X-Acceleration, Std. X-Acceleration
Avg. Y-Acceleration, Std. Y-Acceleration
Avg. Z-Acceleration, Std. Z-Acceleration
Table 3.4: Extracted Features from OMsignal.
3.3.2.1 ViterbiDistance:
Distance between dierent HMMs could be also computed with the Viterbi distance proposed in [43].
Viterbi distance is dened as follows:
d
Vit
(;
0
) =
Z
Y
1
L
log
p
0(Y;S
0)
p
(Y;S
)
P
(Y )dY (3.5)
Y is any possible time series,L is length ofY , andp
(Y;S
) is joint probability ofY andS
(state sequence
of Y given HMM) dened as
P
(Y;S
) = max
S
P
(Y;S) (3.6)
Where S in any possible state sequence.
We use the same approximation used in [43] for equation 3.5 which gives us:
d
Vit
(;
0
) =
X
i;j
a
ij
(i)(loga
0
ij
loga
ij
) (3.7)
a
ij
is the probability in transition matrix and
(i) probability of statei in the stationary distribution of
. (The stationary distribution will be further explained in section 3.3.3) TheLikelihoodDistance computes
the distance based on both the state sequences and HMMs, where state sequence is a sample of the HMM
(generative) model. The other method,ViterbiDistance, computes the values by only comparing the HMMs.
29
This makes Viterbi distance less susceptible to noises observed in state sequences or in other words less
sensitive to small changes. This trade-o causes one method to perform better than the other depending
on the targeted ground truth construct and it’s corresponding sensitivity to small variations in the data.
3.3.3 LearningRepresentations
In this section we describe two methods for learning representations from the HMMs. The rst method is
interpretable and could be used for analyzing the data. The second method however gives better perfor-
mance in predicting most of the constructs.
3.3.3.1 StationaryRepresentation:
Each HMM is dened by a transition matrix. Transition matrix gives the probability of transitioning from
one state to another, soz
t
T
i
is the probability distribution forz
t+1
, and lim
x!1
z
t
T
i
x
is the probability
distribution for lim
t!1
z
t
, which is the stationary distribution. No matter the starting state, the relative
amount of time spent in each state is given by the stationary distribution, which is unique for each transi-
tion matrix and given by the eigenvector corresponding to the largest eigenvalue of the transition matrix.
We use these stationary distributions as features for classication and regression tasks with dierent mod-
els and is shown by HMM-S.
3.3.3.2 SpectralRepresentation
A drawback of using stationary distribution of the transition matrix to represent participants is that it
does not capture the relation between behavioral states, hence it might not be able to distinguish between
participants with similar behaviors but which are ordered dierently. In order to capture these dierences
we use distance between participants. Specically, we perform these steps:
30
1. Calculate the distance matrix between participants using either the likelihood distance or the Viterbi
distance described in section 3.3.2
2. Compute the normalized Laplacian of the distance matrix
3. UseK largest eigen-vectors (i.e., eigenvectors corresponding to largest eigenvalues) as representa-
tions of participants (K is a hyperparameter)
This approach is similar to spectral clustering [141] methods. The intuition behind this approach is that the
distance matrix could be interpreted as a weighted adjacency matrix for the network between individuals
and the eigen vectors of graph Laplacian provide information about the components and possible cuts in
the graph.
3.3.4 Results
BP-AR-HMM with auto regressive lag 1 is used to model the temporal data collected from sensors worn
by the 180 high-compliance participants in the study. For each participant, we constructed vectors repre-
senting the 21 features from physiological and movement signals listed in Table 3.4. We used Z-score to
normalize features from sensors. However, since some statistical features like mean and variance are useful
for predicting constructs like age, both normalized and unnormalized signals were used in the prediction
tasks in Tables 3.5.
The model identied 23 shared latent states describing participants’ behavior. Some of the states were
only exhibited by a few participants. These rare states could convey some useful information that helps
identify noise or anomalies in the collected data; however, their sparseness is not benecial to the predic-
tion and clustering tasks. Therefore, we ignore states observed in fewer than 5% of the time series. For
example, one of these states observes a constant heart rate which shows a malfunction in the sensor.
Clustering: For validating the states learned by the model, we apply hierarchical agglomerative
clustering on the distance matrix generated with likelihood distance. The resulting dendrogram is shown
31
Figure 3.5: Dendrogram showing the similarity of participants based on their learned states.
in Figure 3.5. We used responses gathered from participants during the pre-study surveys and performed
statistical tests to evaluate dierences between the clusters. We partitioned the dendrogram into clusters
with more than ve members by cutting the dendrogram horizontally on dierent depths. Based on the P-
values obtained the most important features dierentiating the branches–clusters–were job type, age and
gender in that order. This was aligned with our expectations, since dierent job types require dierent
activities, also age and gender aect physiological signals.
The rst cut point (marked 1 in gure 3.5) separates registered nurses from other jobs types. The
main dierence between the two clusters (red and blue) is the frequency of three latent states, which we
call A, B, and C. Variables related to acceleration and movement are almost zero for state A, which has
a higher frequency for participants in cluster 3–the non-nurse cluster. State B is more representative of
higher activity levels and is more frequent for participants in cluster 2–nurses. This also is aligned with our
expectations, since the nurse occupation requires more activity compared to other job types in the study.
State C mostly captures exibility of work hours for non-nurses (non-nurse participants are more likely
to nish their shifts earlier and have less than 12 hours worth of data in one shift). This clustering also
separates participants based on their work shifts, whether they work in day or night shifts. Work shifts
are distinguished by stateD. In this state, binary supine signal, which is activated when the participants is
lying down, is on. It appears that stateD captures quick naps in the workplace and has a higher frequency
for night shift participants. Participants who exhibit stateD are shown by color yellow in Figure 3.5.
32
(a) (b)
Figure 3.6: Bipartite graphs with subset of constructs from Table 3.3 and subset of states. In (a) each
construct is connected to two states whose regression coecients are the highest (i.e., strongest positive
relationship); and in (b) each construct is connected to two states with the lowest negative coecients (i.e.,
strongest negative relationship).
We use the learned representations for each participant as features to predict the ground truth con-
structs. The objective is two-fold: not only do we want to predict, but also to gain understanding about
what the latent states represent.
QualitativeResults A possible way to understand latent behaviors is to quantify their importance in
explaining constructs. The stationary representation described in section 3.3.3 has a clear interpretation,
with each dimension representing the percentage of time spent in the corresponding state. We explain the
behavioral states using the stationary representation with the following process:
1. Get the stationary representations of participants
2. Run classication/regression on the representations to predict each construct.
3. Retrieve the learned coecients. Each coecient corresponds to one dimension of the embedding
which represents behavioral states.
33
4. Select the states with highest positive and lowest negative coecients and interpret these states
based on their relation with the targeted construct.
Figure 3.6 shows a subset of constructs and the states that best predict them. Based on this approach,
we recognized state D, described earlier, as the most relevant state for dierentiating between day and
night shift employees. State D also has a high positive coecient in predicting POS-AF and Well-being,
whereas for hindrance stress it has a high negative coecient. Hindrance stress is generally perceived
as a type of stress that prevents progress toward personal accomplishments. Thus a plausible interpreta-
tion of this result is: Quick breaks during work hours could increase positive aect and well-being and
decrease hindrance stress. Similarly, as shown in Figure 3.6, stateN which has a large positive coecient
in predicting hindrance stress, has a large negative coecient in predicting positive aect, well-being, life
satisfaction, and extraversion.
Quantitative Results: For predicting constructs, we obtained stationary representations(HMM-S)
and spectral representations using both distance measures (likelihood (HMM-SL) and Viterbi (HMM-SV)
distance). Spectral representation requires a hyperparameter K, the number of eigenvectors to include in
the representation. We set the K to 10, 20,:::, 100 and use it to train the regression model. Since there
are many ground truth constructs, one model can not be chosen for all the regression tasks. We use the
ridge, kernel ridge and random forest regression on proposed representations and baselines and report
the best model. The results are reported in correlation to the target construct () and Root Mean Squared
Error (RMSE) using Leave-one-out cross validation. We compare our results against Random Warping
Series (RWS)[149] and Parafac2 [61]. RWS generates time series embeddings by measuring the similarity
between a number of randomly generated sequences and the original sequence. This method has three
hyper-parameters. Based on authors suggestion, we xed the dimension of the embedding space to 512,
and experimented with few dierent values for the other two parameters. The second baseline we use is
Parafac2 [61]. This approach views the data as a tensor (3 dimensional array) of participants-sensors-time
34
and decomposes it into hidden components. For Parafac2 the number of hidden components is a hyper-
parameter. We varied the number of hidden components from one to ten and report the best results.
Overall, the results of the predictions based on the HMM’s latent states were systematically better,
outperforming the baseline method in 25 of the 27 constructs predicted. It’s worth mentioning that, except
for HMM-S, which is non-parametric, all other four models in Table 3.5 have hyperparameters that need
to be set. We tune the hyperparameters by running 10 dierent settings and selecting the setting with
best results . Between our own representations (HMM-S, HMM-SL, HMM-SV), HMM-SV performs better
for some construct while HMM-SL gives better results for others. This could be because of dierences
between Viterbi and likelihood distance’s sensitivity to small variations in the data (discussed in section
3.3.2). Also, HMM-S is not a good representation for prediction and is better suited for analysis of the data.
35
Table 3.5: Evaluation of the model on the construct prediction task. The best performing model’s results are highlighted
in bold.
RMSE RMSE RMSE RMSE RMSE
Construct HMM-S HMM-SL HMM-SV RWS Parafac2
ITP -0.729 0.493 0.073 0.488 0.288 0.469 -0.143 0.494 0.105 0.487
IRB -0.481 4.166 0.193 4.066 0.203 4.055 -0.146 4.18 0.265 4.014
IOD-ID -0.728 5.185 0.11 5.124 0.136 5.108 -0.534 5.22 -0.709 5.182
IOD-OD -0.326 6.917 0.202 6.715 0.244 6.634 -0.168 6.891 -0.01 6.864
OCB 0.168 12.035 0.167 12.035 0.245 11.884 0.112 12.159 0.215 11.986
Shipley abstract 0.178 3.732 0.085 3.777 0.179 3.756 0.148 3.764 0.314 3.603
Shipley vocabulary 0.26 4.713 0.085 4.841 0.213 4.748 0.297 4.643 0.399 4.486
NEU 0.066 0.726 0.159 0.718 0.174 0.722 0.048 0.728 0.116 0.724
CON -0.165 0.62 0.245 0.591 0.181 0.6 -0.033 0.613 0.093 0.612
EXT 0.154 0.655 0.152 0.659 0.264 0.642 0.178 0.65 0.038 0.66
AGR -0.428 0.491 0.122 0.485 0.191 0.479 0.079 0.488 0.099 0.488
OPE 0.224 0.586 0.217 0.581 0.28 0.571 0.216 0.585 -0.386 0.598
POS-AF 0.37 6.547 0.254 6.614 0.231 6.686 0.139 6.821 0.112 6.822
NEG-AF -0.278 5.293 0.235 5.139 0.206 5.195 0.045 5.286 0.139 5.238
STAI 0.016 8.975 0.196 8.817 0.112 8.919 0.128 8.912 0.095 8.966
AUDIT 0.1 2.159 0.362 2.017 0.153 2.142 0.053 2.169 0.244 2.113
IPAQ -0.57 15352 0.094 15191 0.115 15316 0.033 15311 0.097 15246
PSQI -0.682 2.366 0.178 2.318 0.142 2.33 0.193 2.322 0.194 2.311
Age 0.461 8.613 0.091 9.662 0.084 9.667 0.243 9.406 0.363 9.035
Health Limit -0.75 23.284 0.196 22.704 0.333 21.986 0.222 23.325 0.118 23.264
Emotional Limit -0.704 22.71 0.211 22.102 0.164 22.504 0.042 22.652 0.091 22.553
Well being 0.077 18.458 0.152 18.302 0.276 17.904 0.011 18.682 0.167 18.277
Social Functioning 0.057 21.94 0.109 21.684 0.191 21.547 0.085 21.857 0.218 21.541
Pain 0.167 18.613 0.134 18.448 0.239 18.164 0.023 18.658 0.102 18.571
General Health 0.211 17.062 0.27 16.792 0.171 17.28 0.151 17.311 0.2 17.105
Life Satisfaction -0.655 1.354 0.106 1.338 0.22 1.317 -0.125 1.362 0.207 1.317
Perceived Stress 0.196 0.511 0.201 0.51 0.209 0.511 0.195 0.513 -0.728 0.524
PSY exibility -0.793 0.821 0.187 0.806 0.233 0.795 -0.077 0.823 0.103 0.813
PSY inexibility -0.66 0.803 0.182 0.785 0.152 0.79 -0.013 0.803 0.006 0.8
WAAQ 0.31 5.65 0.284 5.705 0.205 5.833 0.153 5.878 0.163 5.866
Psychological Capital 0.188 0.656 0.129 0.661 0.17 0.662 0.12 0.662 0.08 0.664
Challenge Stress -0.639 0.622 0.171 0.615 0.078 0.62 -0.097 0.623 -0.789 0.621
Hindrance Stress 0.132 0.644 0.005 0.646 0.206 0.633 0.035 0.647 0.143 0.637
36
3.4 IdentifyAtypicalLifeEvents
Everyday life events can dramatically aect our well-being, both positively and negatively. Stressful job-
related events, for example, have been linked to professional dissatisfaction, increased anxiety, and work-
place burnout. Traditional health assessment tools rely on time-consuming and often expensive surveys.
Wearable sensors, on the other hand, have proven eective at unobtrusively monitoring health-related fac-
tors. In this work, we demonstrate that wearable sensors, paired up with the proposed embedding-based
learning models, can be used “in the wild” to capture atypical life events in two distinct populations, and
that such events can aect, over both short- and long-term horizons, one or more constructs related to in-
dividual well-being. As organizations prepare their workforce for changing job demands, worker wellness
has emerged as an important focus. Organizations see worker wellness as being central to their mission to
develop a healthy and productive workforce while also maintaining optimal job performance. These goals
are especially important in high-stakes jobs, such as healthcare providers working at hospitals, where
job-related stress often leads to burnout and poor performance [58, 14, 72], and is one of the most costly
modiable health issues at the workplace [53]. An additional challenge faced by workers is balancing
demanding jobs with equally stressful events in their personal life. Adverse events—such as attending a
funeral, the death of a pet, or illness of a family member—may amplify worker stress, potentially nega-
tively impacting job performance. On the other hand, positive life events—such as getting a raise, getting
engaged, spending a day o work with a loved one—may decrease stress and improve well-being. The
ability to detect such atypical life events in their workforce will help organizations better balance worker
tasks to reduce stress, burnout, and absenteeism and improve job performance.
The data used in this work comes from the study described in section 3.3.1, and a similar study on
aerospace industry workers who wore sensors and reported Ecological Momentary Assessments (EMAs)
over the course of a 10-week period. Workers also reported whether they had experienced an atypical
event. The data allows us to use dierence-in-dierence analysis, a type of causal inference method [140],
37
to measure the eect of atypical events—either positive or negative life events—on individual psychological
states and well-being. We nd that negative life events increase self-reported stress, anxiety, and negative
aect, while decreasing positive aect over multiple days. Positive life events, meanwhile, have little eect
on stress, anxiety, and negative aect, but boost positive aect on the day of the event. Negative atypical
events have a greater impact on worker’s psychological states than positive events, in line with previous
ndings [16].
In addition to measuring the eects of atypical events, we show that it is possible to detect these events
from a non-invasive wristband sensor. We discover that, although changes in individual psychological
constructs are dicult to detect, atypical events are amenable to detection because they jointly aect
several constructs. We propose a method that learns a representation of multi-modal physiological signals
from sensors by embedding them in a lower-dimensional space. The embedding provides features for
classifying when atypical events occur.
There exists extensive research on how sensors can be used to detect patterns and changes in human
behavior [42, 143], including psychological constructs such as stress, anxiety, and aect (c.f., literature
review of wearable sensors [11]). For example they can detect if workers [131] or students [116] are
stressed, at minute level (c.f., cited literature in [29]). Previous research has focused on short time intervals
(up to two weeks) and very small sample sizes (on the order of tens of subjects). additionally, previous
literature has typically detected very short-term stresses (e.g., stresses that aect people that minute) rather
than longer-term stresses, which aect people over multiple days. Our work diers from these previous
studies through evaluation over several weeks of hundreds of subjects, allowing us to robustly uncover
eects in diverse populations. Moreover, we uncover patterns associated with unusually good or bad
events that can aect multiple psychological constructs over multiple days.
38
3.4.1 Data
In this work, data collected from Fitbit wristbands was used. Although other sensor data was collected
during each study, including location data and audio or environmental features, we focus on this modality
since it was common to both studies. The Fitbit wristband captures dynamic heart rate and step count. It
also oers a summary report of duration and quality of sleep for each day. For the embedding approach,
we only used the signals extracted from Fitbit (heart rate and steps) but for the aggregated method we
also included the static summary features. We used data only from participants who had at least two days
worth of data and one day marked as an atypical day. This brings the hospital data to 8,155 days for 150
participants and the aerospace data to 10,057 days for 207 participants. The amount of data available from
each day also varies and depends on the amount of time the participant wore the wristband. Although
most participants (%90 in hospital dataset and %89 in aerospace dataset) had the wristband on for the full
day (24 hours), there are instances with only ve hours in a day worth of data.
The data used for this study includes daily self-assessments of psychological states provided by sub-
jects over the course of the study. Stress and anxiety were measured by responses to questions that read,
“Overall, how would you rate your current level of stress?” and “Please select the response that shows
how anxious you feel at the moment” respectively and have a range of 1–5. Positive and negative aect
were measured based on 10 questions from [91] (ve questions for measuring positive aect and ve for
measuring negative aect). In addition to these constructs, subjects were also asked if they had experi-
enced, or anticipated experiencing, an atypical event: “Have any atypical events happened today or are
expected to happen?”. If subjects replied yes, they had the option to add free-form text describing the
atypical event. In the hospital data, there are 8,155 days of data, of which 958 days had atypical events
(11.7%). In comparison, the aerospace data has 10,057 days of data, of which 1,503 were considered atypical
(14.9%).
39
We have access to the free-form text in the hospital data, which constitutes the vast majority of atypical
events (87%). Surprisingly, the severity of the event could not be easily gleaned from sentiment analysis,
such as VADER [70], as these tools gave neutral sentiment to text samples that were clearly negative. For
example, text alike to “at a funeral” is given zero sentiment in VADER. We therefore applied a protocol,
using human annotators, to categorize text as signicant negative events (such as deaths or injuries of
loved ones), minor negative events (such as being stuck in trac), or positive events (such as promotions).
Of all categorized atypical events, 210 (24%) were positive, 626 (71%) were minor negative events, and 39
(4.5%) were major negative events.
3.4.2 CausalInferenceMethod
The text descriptions of many atypical events in the hospital data mention sudden and unexpected events,
such as an injured family member or unusually heavy trac. We can therefore conjecture that atypical
events create an as-if random assignment of any given subject over time. This is not always true, as in the
case of subjects who report being on vacation multiple days, or are at dierent stages of burying a loved
one. These are, however, rare instances. To determine the eect of atypical events on subjects, we use a
dierence-in-dierence approach to causal inference. Specically, we look at all subjects who report an
atypical event and then look at a subset who have lled out their daily survey the prior day and reported
stress, anxiety, negative aect, or positive aect. This is usually the majority of all events (>83%). We
subtract their self-reported constructs from the day before the event. If subjects report construct values
after the event (which is usually the case) we report the dierence between these values and the day prior
to an atypical event. We contrast these measurements with a null model, in which subjects did not report
an atypical event on the same days that other subjects reported an atypical event, and nd the change
in their construct values from the prior day. This null model shows very little change in constructs over
40
-1 0 1
-1.0
-0.5
0.0
0.5
1.0
DaysBetweenEvent
Δ Pos Affect
-1 0 1
-0.5
0.0
0.5
1.0
1.5
DaysBetweenEvent
Δ Neg Affect
-1 0 1
-0.2
0.0
0.2
0.4
0.6
DaysBetweenEvent
Δ Stress
-1 0 1
-0.2
0.0
0.2
0.4
DaysBetweenEvent
Δ Anxiety
(a) (b) (c) (d) Aerospace
Hospital
Null model
Aerospace
Hospital
Null model
Aerospace
Hospital
Null model
Aerospace
Hospital
Null model
Figure 3.7: Eect of atypical events among the datasets studied. (a) Positive aect, (b) negative aect,
(c) stress, and (d) anxiety. Green squares show the aerospace dataset, red diamonds show the hospital
dataset, and gray circles are the null models, in which we collect sequential data from subjects who do not
experience an atypical event at day zero.
consecutive days, in agreement with expectation. The dierence between construct values associated with
the event and the null model is the average treatment eect (ATE).
3.4.3 CausalEectofAtypicalEvents
We apply the dierence-in-dierence approach to measure the impact of atypical events on self-reported
psychological constructs. We rst look at the eect of atypical events across all our datasets, as shown in
Fig. 3.7. Atypical events, on average, have a relatively small eect on positive aect the day of the event
(dierence from null = 0:09; 0:33; p-value= 0:6; 0:009, for hospital and aerospace data, respectively).
We notice a decrease in positive aect from the day of the event to the day after the event (dierence=
0:54; 0:55; p-values = 0:0015; 0:017 for aerospace and hospital data, respectively). On the other hand,
there is a substantial increase in negative aect, stress, and anxiety (p-values< 0:001), although changes
are smaller in the aerospace dataset. The free-text descriptions that subjects provided about atypical events
they experienced (only available in the hospital data), conrms these results. Most atypical events are
negative, such as a ght with the spouse, trac, or deaths. In a minority of cases, however, subjects
report positive events, such as passing a test or a promotion. For the hospital data, we categorized atypical
events as positive, minor negative, or major negative events, and determined the relative eect each has
41
-1 0 1
-0.5
0.0
0.5
1.0
1.5
2.0
DaysBetweenEvent
Δ Stress
-1 0 1
-0.5
0.0
0.5
1.0
1.5
DaysBetweenEvent
Δ Anxiety
-1 0 1
0
1
2
3
4
5
6
DaysBetweenEvent
Δ Neg Affect
-1 0 1
-4
-3
-2
-1
0
1
2
DaysBetweenEvent
Δ Pos Affect
Positive events
Minor negative events
Major negative events
Null model
(a) (b) (c) (d)
Figure 3.8: Eect of atypical events versus severity of event. (a) Positive aect, (b) negative aect, (c) stress,
and (d) anxiety. Green squares are positive events, white triangles are minor negative events, red diamonds
are major negative events, and gray circles are the null models. In the null models we collect sequential
data from subjects who do not experience an atypical event at day zero.
on subjects, as shown in Fig. 3.8. We nd that, as expected, positive events increase positive aect (p-
value= 0:009), but have no statistically signicant eect on negative aect, stress, or anxiety (p-value
0:3). Minor negative events do not substantially change positive aect on the day of the event (dierence
from null = 0:15, p-value= 0:57), and have a small eect on positive aect the day after the event
(dierence from null =0:42, p-value= 0:04). On the other hand, they signicantly increase negative
aect, anxiety, and stress (p-value< 0:001). Finally, major negative events both decrease positive aect the
day of the event and the day after the event (p-value= 0:005; 0:03 respectively). These results point to the
strong diversity in atypical events, and support the idea that “bad is stronger than good” [16]: adverse, or
negative, events have a stronger eect on people than positive events, and are reported as atypical events
more often.
3.4.4 DetectingAtypicalEvents
We detect atypical events by embedding individuals’ physiological data into a multi-dimensional space. We
then, train models to identify where in this space atypical events are more likely to happen. We use the
framework proposed in Section 3.3 (also shown in Figure 3.9) to embed physiological signals. For this work
42
Figure 3.9: Overview of the modeling framework. Sensor data collected from participants A and B (left
two panels) is fed into a non-parametric HMM model that outputs state sequences (middle panel). Output
from the HMM model is used to learn embeddings for each day of each participants (right panel). The daily
embedding (lighter and darker-colored circles) and the average embedding for each participant (hashed
circles) are used as features. These features, and daily atypical event labels, are then fed into an SVM
classier to predict whether any given day is atypical.
we use thestationaryrepresentation described in section 3.3.3.1. Each dimension of the embedding can
be described as an activity performed by some subjects, such as exercising, working, resting, etc. Each
person can therefore be represented by the percentage of time they spend on those activities (dimensions),
which gives us a sense of each subject’s routines and characteristics. We compare detection results of
features based on the learned embeddings with models based on features from statistics of aggregated
data. We create several aggregate statistics for the aggregated data model based on both signals and static
modalities collected from tbit. These statistics included the sum, mean, median, variance, kurtosis, and
skewness of sensor features from the day before, the day of, and the day after each day to detect atypical
events. We include features before and after the day of the atypical event because dierent physiological
features may precede an atypical event compared to normal events, and features may be dierent the
day after an atypical event compared to a normal day. We then use Minimum Redundancy Maximum
Relevance feature selection(MRMR)[138] on the entire dataset to select features (23 and 26 for the aerospace
and hospital data respectively) . Typical features in the hospital data relate to sleep (for example, the top
feature was tomorrow’s minutes in bed). Typical features in the aerospace dataset tend to relate to heart
rate (the top feature was the number of minutes in the “fat burn” heart rate zone in the past day).
When we model data using HMM embedding, we only used the signal modalities (heart rate and step
counts). Representations from HMMs were learned for the day of, and the day after any datapoint. We also
43
include the centroid of embeddings for each person as features, to control for subject-specic dierences
in behavior. We did not use any additional feature selection, because embedding naturally restricts the
dimensionality of the features.
We evaluate performance of three classication tasks using sensor data: (1) detecting whether an atyp-
ical event occurred on that day; (2) detecting whether subjects experienced a good day; or (3) detecting
whether subjects experienced a bad day. For (2) and (3) the classication task was “1” if subjects experi-
enced a good or bad day, respectively, and “0” otherwise. Hence we simplify all tasks into a binary detection
task. We emphasize that these last two tasks are only available for the hospital data.
We use ten-fold cross-validation. Performance metrics are averaged across all held-out folds. Two type
of experiments are presented in the paper. In the rst set (Table 3.6) datapoints are split at random, In the
second set (Table 3.7) subjects are split into training and testing sets to approximate a cold-start scenario,
where model is trained on one cohort of subjects and tested on another cohort. The challenge of the latter
detection task is that we need to classify if a subject has a good or bad day despite not being trained on
any previous data from that subject. We use three performance metrics for evaluation. Area under the
receiver operating characteristic (ROC-AUC), F1 score, and precision. These metrics are chosen because
the data is highly imbalanced.
The majority class (no atypical event) is downsampled such that the number of datapoints in each class
are equal, which improves the models over using the raw data or upsampling the minority classBased on
ten-fold cross-validated F1, ROC-AUC, and precision, we nd atypical events in the hospital dataset are
best modeled with random forests, while the aerospace workforce dataset is best modeled with logistic
regression. In comparison, positive events are best modeled with random forests but negative events are
best modeled with extra trees. Finally for embedding features, we used the SVM classier with no down-
sampling, which follows the original work [134].
44
We demonstrate the rst set of results in Table 3.6. We nd that the HMM embedding-based model
outperforms other models. The ROC-AUC for the HMM-based model is 0.60 for the aerospace workforce
and 0.66 for the hospital workforce. Positive and negative events similarly have an ROC-AUC of 0.61-0.63.
F1 and precision exceed random baselines by factors of two to nine. The seemingly low F1 and precision
are due to the rarity of atypical events, especially for positive events, which only happen on 3% of days,
and negative events which only happen in 8% of all days. A detection therefore represents a “warning
sign” that a worker may have had an negative event that day. Overall, detecting atypical events shows
promise.
Alternatively, a model may be trained on one dataset and tested on another (cold-start scenario). These
results are presented in Table 3.7. Atypical events can be detected 91–220% above baselines based on F1
score, but results are more modest than in the Table 3.6, with a reduction in ROC-AUC from 0.66 to 0.58 for
hospital atypical events. These results are alike to other recent papers, which split subjects into training
and testing and found relatively poor model performance (c.f., [127]). These results suggest that models
will perform best when personalized to subjects or transfer learning methods are developed for these data.
In conclusion, we discover that atypical events and negative events substantially increase stress, anxi-
ety, and negative aect. Major negative events are found to reduce positive aect over multiple days, while
positive events improve positive aect that day. We also demonstrate that wearable sensors can provide
important clues about whether someone is experiencing a positive or negative event. We nd atypical
events can be predicted with ROC-AUC of up to 0.66 with relatively little hyperparameter tuning. More
improvements are therefore possible to predict atypical events. These results point to the importance and
detectability of atypical events, which oer hope for remote sensing and automated interventions in the
future.
45
Dataset Construct Model ROC-AUC F1 Precision
Hospital
workforce
Atypical
Event
Random 0.50 0.12 0.12
Aggregated 0:57 0:24 0:15
Embedding 0:66 0:37 0:32
Positive Event
Random 0.50 0.03 0.03
Aggregated 0:63 0:08 0:04
Embedding 0:62 0:27 0:30
Negative
Event
Random 0.50 0.08 0.08
Aggregated 0:57 0:17 0:10
Embedding 0:61 0:27 0:24
Aerospace
workforce
Atypical
Event
Random 0.50 0.15 0.15
Aggregated 0:59 0:31 0:21
Embedding 0:60 0:32 0:36
Table 3.6: Performance of atypical event detection from sensors in both datasets with randomly sampled
cross-validation.
Dataset Construct Model ROC-AUC F1 Precision
Hospital
workforce
Atypical
Event
Random 0.50 0.12 0.12
Aggregated 0:55 0:22 0:14
Embedding 0:56 0:23 0:16
Positive Event
Random 0.50 0.03 0.03
Aggregated 0:57 0:065 0:035
Embedding 0:58 0:08 0:05
Negative
Event
Random 0.50 0.08 0.08
Aggregated 0:57 0:15 0:09
Embedding 0:56 0:16 0:10
Aerospace
workforce
Atypical
Event
Random 0.50 0.15 0.15
Aggregated 0:58 0:30 0:20
Embedding 0.54 0:25 0:17
Table 3.7: Performance of atypical event detection from sensors in both datasets with subject held-out
detection.
46
Chapter4
LearningBehavioralPatternswithBytePairEncoding
As mentioned in the Introduction (Section 1), research in this thesis is broken down to three research
questions. The second research question is:
Howcanweidentifydependenciesintemporaldataatmultipletemporalscales?
To answer this question, I propose the method Pattern Discovery with Byte Pair Encoding (PD-BPE)[135].
This method is able to generate more accurate representations compared to the non-parametric HMM ap-
proach discussed in previous chapter. This method is also much more scalable and ecient. And although
it’s not non-parametric, it only has a few parameters that need to be tuned.
In this chapter a detail description of PD-BPE is given, accompanied by a set of experiments that shows
its advantages over previous works.
4.1 PatternDiscoveryinTimeSerieswithBytePairEncoding
Rapid technological advances have enabled continuous collection of physiological and activity data from
wearable sensors in the form of time series. Such data oers new opportunities to quantify and characterize
human behavior, monitor health, and assess psychological well-being in real-time [12].
However when it comes to modeling multiple time series there are many dierent variants of this
problem, each with its own set of challenges. These variants include time series with, dierent lengths,
47
multiple modalities–multivariate time series–, missing data, noise, etc. Because of these challenges, many
methods put constraints on the input data and can be used only on a subset of problems. Data collected
from wearable sensors, usually has all of the characteristics mentioned above, hence few methods can be
applied to it.
Explainability and interpretability are very helpful and important properties of machine learning ap-
proaches. Having interpretable features is necessary in applications like healthcare, where doctors and
physicians need to understand the reasoning behind a decision made by the model. However, most of the
state-of-the-art methods lack these properties, especially methods that rely on neural networks and are
mostly black box models.
In this work we propose an unsupervised, explainable and interpretable method for learning repre-
sentations from time series. This method, is scalable, can handle dierent variants of time series datasets,
has low computation complexity, and beats state-of-the-art methods on a real world dataset. Our pro-
posed approach extracts common patterns from the data by discretizing the time series and using Byte
Pair Encoding compression technique (BPE). Afterwards, series in the dataset are represented by observed
frequencies of identied patterns. The identied patterns are interpretable and can be of dierent lengths.
Thus, the method can capture both long term and short term dependencies present in the data.
4.1.1 Method
Figure 4.1 shows an overview of our proposed method. Data is 1. transformed with PAA, 2. discretized,
3. transformed to multiple variations 4. Patterns are extracted from each variation. 5. Features generated
from dierent variations are combined and post-processed to output the nal representation. In the fol-
lowing sections each one of those steps are described in detail. We rst explain the method for univariate
time series, then describe how it can be extended for multivariate series.
48
Figure 4.1: Overview of the method. Data is 1.transformed with PAA, 2. discretized, 3. transformed to
multiple variations 4. Patterns are extracted from each variation. 5. Features generated from dierent
variations are combined and post-processed to output the nal representation.
4.1.1.1 PAAtransformation
Each time series is rst normalized, i.e. zero mean and standard deviation of one. The same normalization
step is also applied in the original SAX paper [82], as it it shown in [76] that time series with dierent
osets and amplitudes can’t be correctly compared with each other.
The normalized time series are then transformed using PAA. PAA reduces the dimension of the time
series by splitting them into segments of lengthW -a hyperparameter of the method- and averaging the
values in these segments. Since PAA smooths the series, it also helps mitigate the eects of noise and
outliers in the data. In the left plots of Figure 4.2, the blue lines (also shown on the right plots) are the
PAA-transformed versions of the orange colored time series.
4.1.1.2 Discretization
For discretization, rst outliers are detected using Inter Quartile Range (IQR) [142]. After setting the
outliers aside, values are binned (discretized) by equal-width bins. Both outlier detection and the binning
are based on the entire dataset, as opposed to each time series independently. Binning the values based on
49
Figure 4.2: Example of time series discretization: The series are rst transformed with PAA (on the left),
then they are discretized based on equal-width bins (on the right). The discretized version of the rst
series in this plot is: ’A B C D D C B B B B’. It should be mentioned that PAA actually reduces the length
of the series, but here they are plotted the same length with the original signal to show the eect of the
transformation.
the entire data helps better capture the dierences between series. For example if the data is coming from
two participants with dierent levels of physical activity, the discretized data should reect that (In this
case the less active participant will not observe a subset of the symbols that correspond to high levels of
activity).
Examples of this discretization in shown in the right plots of Figure 4.2. The PAA-transformed blue
series are discretized to 4 bins: A, B, C, and D; each bin/symbol is shown by a dierent color. The number
of bins used for discretization,K is another hyper-parameter of the method. Symbol A is only observed
in the rst time series.
4.1.1.3 IdentifyingPatterns
We rst describe how patterns are extracted from the series, “Find Pattern“ block in Figure 4.1, then discuss
its previous step “Handling Consecutive Identical Symbols“ which addresses how the discretized series are
rst transformed to dierent variations before pattern extraction.
50
Figure 4.3: Example of BPE on discretized time series: The rst line shows the original discretized series.
In each iteration the most common pair, is identied and replaced by a new symbol.
To identify variable length patterns in time series, we use the Byte Pair Encoding (BPE) compression
technique [52]. BPE has been around for a long time, however since it was used in [123] it received
much more recognition. BPE is a compression technique in which the most common pair of consecutive
symbols is replaced by a new symbol. In [123] it was used to address the rare words problem in neural
machine translation by subword tokenization. Traditionally, words were considered as tokens in text.
With subword tokenization characters are initially considered as tokens and with each iteration two most
common tokens are merged together to build a new token. In this way compound words like “authorship“
could be understood by the model without having observed it beforehand and by breaking it into subwords
“author-“ and “-ship“. We use the same approach in identifying patterns in time series. To the best of our
knowledge this technique has not yet been used in context of time series. An example of BPE on discretized
series is shown in Figure 4.3. In each iteration, the most common pair over all time series in the train
set is identied as a new pattern and replaced by a new symbol. For example in Figure 4.3, in the rst
iteration, pair B A is identied as a new pattern and is replaced by a new introduced symbol E. The time
series shown in the example observes two instances of this pattern, hence, in its representation, the feature
corresponding to pattern E has the value 2. This value is later normalized by the length of the series, so that
representations of series with dierent lengths can be compared together. With this approach, dimensions
of the representation would be number of discretization bins (K) + number of identied patterns. For
51
example in Figure 4.3, the series are discretized to 4 symbols (A, B, C, and D), and 6 new patterns are
identied (E,F,G,H,I andJ), hence the number of features in this representation is 10. However, this is
not always the case. Not all identied patterns are added as new features. They should be present in at
leastNP; 0 < P < 1, whereN is the number of series in the dataset, since having sparse features is
not benecial and can lead to overtting. This constraint also serves as a stopping criteria for nding new
patterns. If the overall frequency of the most common pair,F , is less thanMAX(NP;TU), the loop
for nding new patterns is stopped.T is the number of all pairs in the data in the very rst iteration and
0
Abstract (if available)
Abstract
The abundant real-time data collected from people in the wild creates new opportunities to better understand human behaviors. One example, is temporal data collected from wearable sensors. The ability to analyze this data, offers new opportunities for real-time monitoring of physical and psychological health. Physiological data collected from wearable sensors has been used to detect activities, diagnose illnesses, and analyze habits and personality traits. However, temporal physiological data presents many analytic challenges: the data is multimodal, heterogeneous, noisy; may contain missing values, and long sequences with different lengths. Existing methods for time series analysis and classification are often not suitable for data with these characteristics, nor do they offer interpretability and explainability, a critical requirement in the health domain. ❧ In this thesis, I address some of the challenges in learning representations from these complex temporal data. First, I propose a method based on non-parametric Hidden Markov Models to learn interpretable representations from time series. This method is applied to analyze, cluster, regress and classify multiple datasets. Second, I propose Pattern Discovery with Byte Pair Encoding method to better capture long-term dependencies in lengthy time series, which learns representations by extracting variable length patterns using Byte Pair Encoding compression technique. The proposed model is interpretable, explainable and computationally efficient, and beats state-of-the-art approaches on a real world dataset collected from wearable sensors. Finally, I systematically evaluate how the presence of missing data affects the performance of different state-of-the-art time series classification methods. My work shows how performance of different methods degrades as a function of missing data and, using imputation methods generally does not make a significant difference in the results. ❧ The proposed models and findings, could help better understand and analyze dynamic behaviors within a population and offer new perspectives on monitoring and predicting human behaviors from data collected in the wild.
Linked assets
University of Southern California Dissertations and Theses
Conceptually similar
PDF
Modeling and recognition of events from temporal sensor data for energy applications
PDF
Fair Machine Learning for Human Behavior Understanding
PDF
Computational narrative models of character representations to estimate audience perception
PDF
Human behavior understanding from language through unsupervised modeling
PDF
Computational modeling of mental health therapy sessions
PDF
Towards social virtual listeners: computational models of human nonverbal behaviors
PDF
Modeling dyadic synchrony with heterogeneous data: validation in infant-mother and infant-robot interactions
PDF
Mining and modeling temporal structures of human behavior in digital platforms
PDF
Predicting and modeling human behavioral changes using digital traces
PDF
Parasocial consensus sampling: modeling human nonverbal behaviors from multiple perspectives
PDF
Deep learning models for temporal data in health care
PDF
Learning shared subspaces across multiple views and modalities
PDF
Scalable multivariate time series analysis
PDF
Multimodal and self-guided clustering approaches toward context aware speaker diarization
PDF
Learning logical abstractions from sequential data
PDF
Socially-informed content analysis of online human behavior
PDF
Integrating annotator biases into modeling subjective language classification tasks
PDF
Latent space dynamics for interpretation, monitoring, and prediction in industrial systems
PDF
Knowledge-driven representations of physiological signals: developing measurable indices of non-observable behavior
PDF
Behavior understanding from speech under constrained conditions: exploring sparse networks, transfer and unsupervised learning
Asset Metadata
Creator
Tavabi, Nazgol
(author)
Core Title
Modeling dynamic behaviors in the wild
School
Viterbi School of Engineering
Degree
Doctor of Philosophy
Degree Program
Computer Science
Degree Conferral Date
2021-12
Publication Date
09/15/2021
Defense Date
06/21/2021
Publisher
University of Southern California
(original),
University of Southern California. Libraries
(digital)
Tag
machine learning on healthcare,missing data,OAI-PMH Harvest,physiological signals,time series,time series analysis,time series classification,time series clustering,wearable sensors
Format
application/pdf
(imt)
Language
English
Contributor
Electronically uploaded by the author
(provenance)
Advisor
Lerman, Kristina (
committee chair
), Dilkina, Bistra (
committee member
), Narayanan, Shrikanth (
committee member
), Ren, Xiang (
committee member
)
Creator Email
nazgoltavabi@gmail.com,tavabi@usc.edu
Permanent Link (DOI)
https://doi.org/10.25549/usctheses-oUC15918130
Unique identifier
UC15918130
Legacy Identifier
etd-TavabiNazg-10068
Document Type
Dissertation
Format
application/pdf (imt)
Rights
Tavabi, Nazgol
Type
texts
Source
University of Southern California
(contributing entity),
University of Southern California Dissertations and Theses
(collection)
Access Conditions
The author retains rights to his/her dissertation, thesis or other graduate work according to U.S. copyright law. Electronic access is being provided by the USC Libraries in agreement with the author, as the original true and official version of the work, but does not grant the reader permission to use the work if the desired use is covered by copyright. It is the author, as rights holder, who must provide use permission if such use is covered by copyright. The original signature page accompanying the original submission of the work to the USC Libraries is retained by the USC Libraries and a copy of it may be obtained by authorized requesters contacting the repository e-mail address given.
Repository Name
University of Southern California Digital Library
Repository Location
USC Digital Library, University of Southern California, University Park Campus MC 2810, 3434 South Grand Avenue, 2nd Floor, Los Angeles, California 90089-2810, USA
Repository Email
cisadmin@lib.usc.edu
Tags
machine learning on healthcare
missing data
physiological signals
time series
time series analysis
time series classification
time series clustering
wearable sensors