Close
About
FAQ
Home
Collections
Login
USC Login
Register
0
Selected
Invert selection
Deselect all
Deselect all
Click here to refresh results
Click here to refresh results
USC
/
Digital Library
/
University of Southern California Dissertations and Theses
/
Mining and modeling temporal structures of human behavior in digital platforms
(USC Thesis Other)
Mining and modeling temporal structures of human behavior in digital platforms
PDF
Download
Share
Open document
Flip pages
Contact Us
Contact Us
Copy asset link
Request this asset
Transcript (if available)
Content
Mining and Modeling Temporal Structures of Human Behavior in Digital
Platforms
by
Akira Matsui
A Dissertation Presented to the
FACULTY OF THE USC GRADUATE SCHOOL
UNIVERSITY OF SOUTHERN CALIFORNIA
In Partial Fulfillment of the
Requirements for the Degree
DOCTOR OF PHILOSOPHY
(COMPUTER SCIENCE)
August 2022
Copyright 2022 Akira Matsui
I dedicate this dissertation to my late grandfather.
ii
Acknowledgements
My PhD research would have been impossible without his constant and warm advisement of my
advisor Professor Emilio Ferrara. I am also grateful for my Ph.D. dissertation committee, who have
taught me and dedicated their time for my graduation: Professor Aiichiro Nakano and Professor
Marlon Twyman. I would also like to thank Professor Aram Galstyan and Professor Xiang Ren,
who are my Ph.D. dissertation proposal committee.
I would also like to thank Professor Teruyoshi Kobayashi for his support. My sincere thanks
go to Dr. Daisuke Moriwaki for providing me an opportunity to join his intern team and widening
the scope of my research.
I also thank my friends at USC, Yilei Zeng, Shen Yan, and Hikaru Ibayashi. I thank my fellow
lab-mates for engaging me in stimulating discussions.
Last but not the least, I would like to thank my family and friends, especially my late grandfa-
ther who not just encouraged but pushed me to embark my Ph.D. in the U.S. right before he passed
away.
iii
Table of Contents
Dedication ii
Acknowledgements iii
List of Tables viii
List of Figures ix
Abstract xi
Chapter 1: Introduction 1
1.1 Human Behavior in Digital Platforms . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 Motivations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.2.1 Incorporating Inter-Temporal Information for Mid-Term Human Behavior
Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.3 Overview, Research Questions, and Contributions . . . . . . . . . . . . . . . . . . 5
1.3.1 Overview of the Contents of This Dissertation . . . . . . . . . . . . . . . . 6
1.3.2 Research Questions of This Dissertation . . . . . . . . . . . . . . . . . . . 8
1.3.3 Contributions of the Contents of This Dissertation . . . . . . . . . . . . . 10
Chapter 2: Survey of Human Behavior Models that Leverage Embedding 13
2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
2.2 Non-negative Tensor Factorization for Human Behavior Modeling . . . . . . . . . 14
2.2.1 Non-negative Tensor Factorization by PARAFAC . . . . . . . . . . . . . . 15
2.2.1.1 Number of Components . . . . . . . . . . . . . . . . . . . . . . 16
2.2.2 Application of Non-negative Tensor Factorization . . . . . . . . . . . . . . 17
2.2.3 The NTF Model of This Dissertation . . . . . . . . . . . . . . . . . . . . . 18
2.3 Word Embedding for Human Behavior Modeling . . . . . . . . . . . . . . . . . . 19
2.3.1 word2vec Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
2.3.1.1 CBOW, Skip-gram Model and SGNS Model . . . . . . . . . . . 21
2.3.2 Taxonomy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
2.3.3 Labeling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
2.3.4 Pre-trained Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
2.3.4.1 Overfitting Models . . . . . . . . . . . . . . . . . . . . . . . . . 25
2.3.5 Working Variable . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
2.3.6 Reference Words . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
iv
2.3.6.1 Cosine Similarity or Euclidean Distance? . . . . . . . . . . . . . 27
2.3.7 Comparing the Same Words . . . . . . . . . . . . . . . . . . . . . . . . . 30
2.3.8 Non-text . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
2.3.9 Classifying the Methods of This Doctoral Dissertation . . . . . . . . . . . 35
Chapter 3: Extracting Temporal Structures from User Behavior: The Case of Consump-
tion Behavior 39
3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
3.2 Consumption Behavior . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
3.3 Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
3.4 Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
3.4.1 Tensor Representation of Consumption Expenditure . . . . . . . . . . . . 43
3.5 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
3.5.1 Core-Consistency . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
3.5.2 Multi-Timescale Expenditure Patterns . . . . . . . . . . . . . . . . . . . . 45
3.5.3 Expenditure Patterns and Demographic Differences . . . . . . . . . . . . . 46
3.5.4 Characterizing Clusters Based on the Demographic Properties . . . . . . . 47
3.6 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
Chapter 4: Characterizing User Behavior on Digital Platforms and Its Action Timing
Intervals: The Case of Wikipedia Editing 55
4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
4.2 Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
4.2.1 Embedding Vectors of Editing Behavior . . . . . . . . . . . . . . . . . . . 58
4.2.1.1 Correspondence between Users and Articles . . . . . . . . . . . 58
4.2.1.2 Embeeding Vectors . . . . . . . . . . . . . . . . . . . . . . . . 58
4.2.1.3 What Does the Embedding Vectors Mean? . . . . . . . . . . . . 58
4.2.2 Similarity Graph Among Articles . . . . . . . . . . . . . . . . . . . . . . 59
4.2.3 Entropy of Editing Behavior . . . . . . . . . . . . . . . . . . . . . . . . . 59
4.2.4 Defining Dimensions of Editing Behavior . . . . . . . . . . . . . . . . . . 60
4.3 Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61
4.3.1 Wikipedia Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61
4.3.2 Preprocessing Wikipedia Data . . . . . . . . . . . . . . . . . . . . . . . . 61
4.4 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62
4.4.1 Within-user Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62
4.4.2 Between-Articles Analysis: Anatomy of Similarity Graph . . . . . . . . . 62
4.4.3 Within-Article Analysis: Distribution of Division of Labor Index . . . . . . 64
4.4.4 Within-Article Analysis: Division of Labor Index and Quality of Articles . 65
4.4.5 Characterizing the Differences Between the Two-Sides . . . . . . . . . . . 67
4.5 Wikipedia Editing Behavior . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68
4.6 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70
Chapter 5: Unbiasing Session Analysis Using the Distributions of Individual Time Inter-
vals 75
5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76
v
5.2 Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78
5.2.1 Exponential Mixture Distribution of User Time Interval . . . . . . . . . . 79
5.2.1.1 Modeling user time interval . . . . . . . . . . . . . . . . . . . . 79
5.2.1.2 Estimation by EM Algorithm . . . . . . . . . . . . . . . . . . . 80
5.2.2 Determining Session Thresholds . . . . . . . . . . . . . . . . . . . . . . . 81
5.2.3 Biased Engagement Analysis . . . . . . . . . . . . . . . . . . . . . . . . . 81
5.2.4 Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81
5.3 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82
5.3.1 Analysis With Synthetic Data . . . . . . . . . . . . . . . . . . . . . . . . 83
5.3.2 Analysis With Real-World Datasets . . . . . . . . . . . . . . . . . . . . . 85
5.4 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86
Chapter 6: User Action Embedding With Inter-Time Information 87
6.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88
6.2 Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90
6.2.1 Capturing Inter-Temporal Information With Time Bins . . . . . . . . . . . 91
6.2.2 Time Interval Bins Estimation by a Mixture of Exponential Distributions . 91
6.2.2.1 EM Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . 92
6.2.2.2 Attributing Time Intervals to Time Bins . . . . . . . . . . . . . 93
6.2.3 Action Sequence and n-Gram . . . . . . . . . . . . . . . . . . . . . . . . 94
6.2.3.1 Action Sequence . . . . . . . . . . . . . . . . . . . . . . . . . . 94
6.2.3.2 Action n-gram . . . . . . . . . . . . . . . . . . . . . . . . . . . 94
6.2.4 Word Embedding Model . . . . . . . . . . . . . . . . . . . . . . . . . . . 95
6.2.5 Extracting action timing context (ATC) using n-gram actions . . . . . . . . 95
6.2.5.1 Constructing the Reference Vectors . . . . . . . . . . . . . . . . 95
6.2.5.2 Definition of the “Long Term Context” and “Short Term Context” 96
6.2.5.3 Aligning Actions Into Long vs Short Term Context . . . . . . . 96
6.3 Data and Experiment Settings . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97
6.3.1 Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98
6.3.2 Experiment Settings . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99
6.4 Empirical Analysis with Action Timing Context (ATC) . . . . . . . . . . . . . . . 99
6.4.1 ATC Differences Among Smartphone Apps Usage . . . . . . . . . . . . . 100
6.4.2 ATC Differences Between Dropout and Non-dropout Students . . . . . . . 100
6.4.3 Capturing Behavior Dynamics by ATC . . . . . . . . . . . . . . . . . . . 101
6.5 User Modeling with Embedding and Intertemporal Information between Users’
Actions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102
6.5.1 User Modeling With Embedding . . . . . . . . . . . . . . . . . . . . . . . 103
6.5.2 Inter-Action Times of Human Behavior . . . . . . . . . . . . . . . . . . . 104
6.6 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104
Chapter 7: User Action Clustering While Preserving the Order of Actions 110
7.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111
7.2 Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112
7.2.1 Optimal Transportation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113
7.2.1.1 Optimal Transportation Problem . . . . . . . . . . . . . . . . . 113
vi
7.2.1.2 Optimization Problem . . . . . . . . . . . . . . . . . . . . . . . 114
7.2.2 Order-Preserving Wasserstein Distance . . . . . . . . . . . . . . . . . . . 114
7.2.2.1 Local penalty . . . . . . . . . . . . . . . . . . . . . . . . . . . 115
7.2.2.2 Global penalty . . . . . . . . . . . . . . . . . . . . . . . . . . . 115
7.2.2.3 Order-Preserving constraint . . . . . . . . . . . . . . . . . . . . 116
7.2.3 K-means clustering with OPW . . . . . . . . . . . . . . . . . . . . . . . . 116
7.2.3.1 Ordinal k-means algorithm . . . . . . . . . . . . . . . . . . . . 116
7.2.3.2 Modification of k-means algorithm with the OPW distance . . . 117
7.3 Data and Experiment Settings . . . . . . . . . . . . . . . . . . . . . . . . . . . . 118
7.3.1 Experiment With a Synthetic Data . . . . . . . . . . . . . . . . . . . . . . 119
7.3.1.1 Synthetic User Action Vectors . . . . . . . . . . . . . . . . . . . 119
7.3.1.2 k-mean and OPW k-means . . . . . . . . . . . . . . . . . . . . 120
7.4 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 120
7.4.1 Analysis with Synthetic Data . . . . . . . . . . . . . . . . . . . . . . . . . 121
7.4.1.1 Clustering Results . . . . . . . . . . . . . . . . . . . . . . . . . 121
7.4.1.2 Within Clustering Errors . . . . . . . . . . . . . . . . . . . . . 121
7.4.2 Analysis with Empirical Data . . . . . . . . . . . . . . . . . . . . . . . . 122
7.4.2.1 A Use Case: Morning Pattern Clustering . . . . . . . . . . . . . 122
7.4.2.2 Results of Morning Pattern Clustering . . . . . . . . . . . . . . 124
7.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 126
Chapter 8: Conclusion and Discussion 127
References 131
vii
List of Tables
2.1 Labels for Analysis Methods With Word Embedding . . . . . . . . . . . . . . . . 23
2.2 Summary of the Analytical Methods With Word Embedding for the Social Science
Research I . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
2.3 Summary of the Analytical Methods With Word Embedding for the Social Science
Research II . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
3.1 Basic statistics of the receipt data . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
3.2 Chi-squared test for demographic difference between clusters. . . . . . . . . . . . 53
4.1 Reference Article to Define Dimensions . . . . . . . . . . . . . . . . . . . . . . . 60
4.2 Entropy Difference I . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63
4.3 Dimension Differences between addtion edit and deletion edit . . . . . . . . . . . 69
4.4 Entropy Difference II . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69
5.1 Glossary of Terms: Chapter 5 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77
5.2 Datasets for empirical analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82
5.3 The comparison of engagement distributions . . . . . . . . . . . . . . . . . . . . . 83
6.1 Glossary of Terms: Chapter 6 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90
6.2 Basic statistics of datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97
6.3 EMA Questions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99
viii
List of Figures
2.1 Schematic comparison: Cosine-similarity vs Euclidean distance . . . . . . . . . . 30
2.2 Empirical relationship: Cosine-similarity vs Euclidean distance . . . . . . . . . . . 31
3.1 Schematic of NTF for extracting intra- and inter-week expenditure patterns. . . . . 43
3.2 Core-consistency calculation to determine the number of components . . . . . . . 44
3.3 Multi-timescale consumption activity . . . . . . . . . . . . . . . . . . . . . . . . 45
3.4 Silhouette analysis for the k-means clustering . . . . . . . . . . . . . . . . . . . . 47
3.5 Silhouette analysis for the k-medoids clustering . . . . . . . . . . . . . . . . . . . 48
3.6 The sum of distances between points . . . . . . . . . . . . . . . . . . . . . . . . . 49
3.7 Visualization of the low-dimensional representation of consumers by the NTF model 50
3.8 Demographic distribution of each cluster. . . . . . . . . . . . . . . . . . . . . . . 51
3.9 Demographic distribution of representative users . . . . . . . . . . . . . . . . . . 52
3.10 The Jaccard index for the overlap of users . . . . . . . . . . . . . . . . . . . . . . 54
4.1 Schematic illustration of modeling “two-sided” nature in editing Wikipedia articles 57
4.2 Anatomy of the similarity graph of deletion and addition edit nodes in Wikipedia . 64
4.3 Distribution of the Division of Labor Index (DLI) of Wikipedia editing: I . . . . . 66
4.4 Information Criteria . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67
4.5 Imbalance of editing and its quality: I . . . . . . . . . . . . . . . . . . . . . . . . 68
4.6 Colleration between dimension values . . . . . . . . . . . . . . . . . . . . . . . . 72
4.7 Anatomy of the similarity graph of deletion and addition edit nodes in Wikipedia II 73
4.8 Distribution of the Division of Labor Index (DLI) of Wikipedia editing II . . . . . 73
4.9 Imbalance of editing and its quality II . . . . . . . . . . . . . . . . . . . . . . . . 74
5.1 Schematic: Sessions with 1 min threshold . . . . . . . . . . . . . . . . . . . . . . 78
ix
5.2 Schematic: Biased threshold on aggregated data . . . . . . . . . . . . . . . . . . . 79
5.3 Time intervals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82
5.4 Engagement analysis: synthetic data . . . . . . . . . . . . . . . . . . . . . . . . . 83
5.5 The comparisons of the engagement distribution . . . . . . . . . . . . . . . . . . . 84
6.1 Schematic of studying the inter-temporal context of user actions . . . . . . . . . . 90
6.2 Extracting action timing contexts with n-gram action . . . . . . . . . . . . . . . . 97
6.3 Smartphone application usage context . . . . . . . . . . . . . . . . . . . . . . . . 106
6.4 Action timing context differences: Drop-out students vs Non-dropout students . . . 107
6.5 Transition of the action timing context of the students over the academic term . . . 108
6.6 ATCs of the eating actions and the students’ physiological state . . . . . . . . . . . 108
6.7 ATCs of the physical actions and the students’ physiological state . . . . . . . . . 109
7.1 Schematic of synthetic data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 118
7.2 Clustering synthetic data: Jaccard index . . . . . . . . . . . . . . . . . . . . . . . 122
7.3 Clustering synthetic data: Within cluster error . . . . . . . . . . . . . . . . . . . . 123
7.4 Morning Pattern Detection by OPW . . . . . . . . . . . . . . . . . . . . . . . . . 125
7.5 The number of the shared classes by the students in the same cluster . . . . . . . . 125
7.6 The clusters in which the students particularly share the same class . . . . . . . . . 126
x
Abstract
Since the advent of the Internet, large-scale data from digital platforms have gained attention be-
cause of their excellent capability for tracking various aspects of human behavior, such as economic
activity, human mobility, and student learning style. Analyzing these substantial data requires
methods to extract the essence of human behavior and expand our understanding of how we live in
this age of digital platforms. This dissertation proposes a comprehensive set of methods and appli-
cations to uncover the underlying mechanisms of the dynamic human behavior observed in digital
platforms. First, I propose a non-negative tensor factorization model for detecting multi-timescale
consumption patterns in high-dimensional data. We demonstrate that the multi-timescale temporal
structure extracted by the proposed methods reflects the demographic information of individu-
als. Second, I discuss recent trends in applying word embedding techniques to human behavior
modeling. Along with these recent trends, I propose a word embedding model for studying hu-
man behavior in knowledge production. By mining the high-dimensional human behavior data of
knowledge production, I demonstrate that the two different types of labor supply compose pro-
duced knowledge. Third, I highlight the importance of understanding time intervals for human
behavior analysis and propose a unified framework for understanding temporal human behavior
with inter-time information and cluster users based on the dynamics of human behavior. The com-
prehensive set of methods and application proposed in this dissertation allows us to capture the
temporal context of human behavior from the dual-process theory and detect individual differences
in high-dimensional human behavior data.
xi
Chapter 1
Introduction
1.1 Human Behavior in Digital Platforms
Many scientists have studied human behavior to extract insights and advance our understanding of
society. The advent of digital platforms has fueled research on human behavior by gathering and
recording detailed human behavior data. The data obtained from digital platforms are called big
data, named after their large size and complexity.
1
This technical term is widely and frequently
used not only in research papers but also in media outlets, such as newspapers and magazines.
To analyze big data, researchers have developed a wide range of machine learning techniques and
statistical models.
Studying data from digital platforms enables us to observe human behavior with a level of gran-
ularity that has never been previously achieved. Digital platforms, one of the best sources of big
data, are powered by the internet, IoT (Internet of Things) devices, and smartphone applications.
Digital devices such as smartphones and their applications are now an integral part of our lives;
many of us live with this kind of high-tech that track our activities throughout the day, constantly
observing what we do and how we communicate. These “social sensors” have been generating a
1
Big data is a concept proposed by several scientists from the 1980s through the 1990s. Discussions of big data
often refer to “Three Vs” to characterize its nature, as proposed by Doug Laney; they stand for “Variety, Velocity, and
V olume.” “Variety” refers to the variety of data types, “velocity” to speed of data generation, and “volume” to the
size of the data. It should be noted that this proposal is still under debate . For example, Salganik (2019) suggests ten
properties of Big Data instead of three . While the “Three Vs” may not fully capture the nature of Big Data, it would
be reasonable for us to assume that this initial concept captures the minimum properties of its nature.
1
lot of high granular data about human behavior; for example, how we consume, how we use smart-
phone applications, and how students study and behave on campus, among other social activities.
Research on big data from digital platforms has not only revolutionized the methods of tra-
ditional disciplines, such as computer science, economics, management science, and sociology,
but also created new academic fields, such as computational social science [129, 128]. The emer-
gence of high-granularity data from digital platforms has prompted researchers to conduct in-
terdisciplinary research because the data and problems on digital platforms are so complex that
conventional disciplines and methods do not seem to be fully applicable. Therefore, this trend has
generated new research topics that require interdisciplinary approaches to tackle problems, such
as social bots, in which a combination of computer science, network science, and social science is
effective [60].
1.2 Motivations
In order to analyze complex human behavior in digital platforms, it is necessary for us to have
methods that can extract essential information of human behavior from large amounts of data in
an interpretable manner. To this end, this dissertation proposes a set of methods that address the
above two challenging points and analyzes human behavior using data from digital platforms. In
the first part of this dissertation, I discuss the methods to capture human behavior and its tempo-
ral structure from large amounts of data. As my methods model human behavior from a certain
abstract perspective, they have a wide range of applications and facilitate our understanding of hu-
man behavior. In this dissertation, I leverage low-dimensional representation techniques to obtain
an interpretable representation of human behavior from a large amount of data. With the proposed
methods, I conduct empirical studies that extract insights into human behavior using real-world
data.
Extracting insights on human behavior from digital platform data delivers value to our society,
such as a better understanding of economic policies or public health [129]. However, analyzing
2
human behavior in digital platforms requires us to extract the essence of human behavior from
substantial and complex data; data too large to read. For this purpose, computer scientists propose
machine learning models that learn low-dimensional representations, such as embeddings and di-
mensional reduction methods. Recently, users and developers of machine learning models are both
computer and social scientists.
While machine learning models provide handy features for prediction and recommendation,
interpreting results has become a topic when we analyze social data, especially human behavior
data. This is because methods for human behavior analysis must not only handle a large amount
of data but also provide interpretable results such that findings with those methods advance our
understanding of human behavior. Therefore, the methods for human behavior analysis with a
large social data need to return interpretable results. In addition to interpretability, those methods
need to capture important and widely observed aspects of human behavior such that we can utilize
them to understand not only a particular domain or task but also human behavior in general.
In this dissertation, I first propose a tensor factorization model that extracts human behavior in
a multi-time temporal structure. I apply the proposed method to data from a financial management
smartphone application, which records the consumption behavior of each individual over time.
I demonstrate that the obtained low-dimensional representation using the method captures both
the consumption behavior and the temporal structure, and the temporal structure of consumption
behavior reflects the demographic of consumers. This study suggests that the temporal structure
plays a vital role in understanding human behavior.
I also propose an embedding model for characterizing human behavior in knowledge produc-
tion. I utilize a standard word-embedding model to model the essential nature of knowledge pro-
duction on digital platforms. I apply the proposed method to the editing history of Wikipedia
articles and unveil the critical mechanism behind knowledge production. I uncovered that different
groups join the different production tasks of the same article and this division of labor is linked
to the quality of articles. This dissertation also shows the importance of understanding the hetero-
geneity of inter-temporal structures. I demonstrate that considering users’ time interval differences
3
can provide a clearer view of human behavior on the digital platform. I conduct a session analysis
by considering the time interval differences among users to investigate their engagement.
The above three findings suggest that understanding temporal information is key to understand-
ing human behavior, and utilizing low-dimensional representation aids in such analysis. Based on
this notion, I propose a unified method that analyzes user actions through inter-temporal informa-
tion (time intervals). I simultaneously embed users’ action sequences and their time intervals to
obtain a low-dimensional representation of the action along with inter-temporal information. With
the proposed method, an explicit model of action sequences and inter-temporal user behavior in-
formation enables successful interpretable analysis. I further develop a method that clusters users
based on the sequences of their actions by considering the orders of actions in those sequences. An
empirical analysis using the proposed methods shows that the order of actions indicates individual
differences. This proposed method investigates human behavior from the mid-term perspective.
1.2.1 Incorporating Inter-Temporal Information for Mid-Term Human Be-
havior Analysis
Most existing models focus on the relationship between the points of observation and the prediction
of the next action. These methods attempt to study relationships between variables over time, even
when it comes to dynamic analysis or time series. However, to understand human decision-making
or the cognitive state, it is important to investigate mid-term human behavior, especially the time
intervals between actions [78, 206, 228, 111].
Time intervals have various essential properties in terms of human behavior. A famous example
is the exchange of letters [169]. To explain the mechanism behind the generation of the time inter-
val distributions, priority queue models and modulated Markov processes have been proposed [95,
145, 166]. In particular, network scientists have analyzed and discussed the importance of the
time interval distribution to understand human behavior [95, 145, 215, 12, 169]. Furthermore, it
is known that these differences in human cognitive states appear as time intervals of behavior [78,
206, 228]. A wide range of literature in cognitive science, psychology, and behavioral economics
4
has shown the existence of multiple modes of human behavior [78, 206, 228], and the time inter-
vals between actions are linked to such cognitive modes [206, 111]. Therefore, modeling the time
interval differences associated with actions should facilitate our understanding of human behavior.
Considering the importance of time intervals to human decision-making and the fact that human
behavior changes over time, it is of paramount interest to investigate the link between time intervals
and the temporal behavior of human actions. To understand this point, we need to conduct a mid-
term behavior analysis that focuses on a few sequences of actions. Focusing on mid-term human
behavior allows us to model a sequence of human actions combined with time intervals. However,
little attention has been devoted to this area. There are many well-exploited time-series models for
long-term analysis, but mid-term behavior does not have a sufficiently long sequence of actions to
apply to the time-series model. There is ample literature about short-term analysis that predicts
the next single action of a given user, such as recommendations. Therefore, there is a gap between
these two lines of literature.
In addition, both cases use the duration between observations as a feature or indicator to de-
termine the order of observation points, but they are not fully utilized as an objective variable of
interest. To understand inter-temporal information and actions, it is not sufficient to analyze only
the time interval or sequence of actions. Analyzing these two different factors holistically would
expand our understanding of the structural link between human actions and their inter-temporal
information. To tackle this challenge, this dissertation proposes a method for embedding the in-
formation of temporal intervals into observed actions simultaneously and obtaining embedding
vectors that mix the two factors. Using this embedding model, I extract the information for human
behavior analysis from a large amount of data.
1.3 Overview, Research Questions, and Contributions
This doctoral dissertation proposes a comprehensive set of methods and applications for analyzing
human behavior from massive digital platform data. I explicitly model the temporal aspect of
5
human behavior and utilize the embedding methods to extract crucial information from large data.
Furthermore, I study a method that focuses on mid-term human behavior and conduct several
experiments using real datasets.
1.3.1 Overview of the Contents of This Dissertation
We first discuss the contents of this dissertation to provide a clear overview of the proposed meth-
ods for mining and modeling the temporal structures of human behavior in digital platforms.
This dissertation also utilizes non-negative tensor factorization (NTF) techniques and word-
embedding techniques to obtain a low-dimensional representation of human behavior from sub-
stantial data in an interpretable manner. To provide the context of this dissertation in the literature,
Chapter 2 discusses the recent applications of the methods used in this dissertation. I first overview
the techniques and application of the NTF to human behavior analysis. Then, I turn our attention
to a survey of word-embedding techniques for human behavior modeling. To review the recent
literature with word embedding methods across multiple disciplines, I build the taxonomy of the
analysis with word embedding models, particularly, with word2vec. This taxonomy detects the
recent interdisciplinary trend that applies word2vec to human behavior analysis.
Chapter 3 proposes a model that extracts the temporal structure of human behavior. I extract
the low-dimensional representation of consumption behavior using the NTF method by explicitly
modeling a multi-timescale behavior. I apply the proposed method to the data from a financial
management smartphone application, which records the consumption behavior of each individual
over time. This method successfully captures consumption behavior as high-dimensional data. The
proposed model extracts the temporal structure of consumption behavior that reflects consumption
behavior. This study suggests that the temporal structure plays a vital role in understanding human
behavior.
6
Following the trends discussed above, Chapter 4 proposes a word embedding-based method for
characterizing human behavior in knowledge production. I utilize a word2vec to model the two-
sided nature of knowledge production on digital platforms, which views editing behavior compris-
ing addition and deletion edits. I apply the proposed method to a Wikipedia article’s editing history
and unveil that the two-sided nature plays a vital role in knowledge production, and the division of
labor between the two sides is associated with the quality of knowledge production.
This dissertation also highlights the importance of understanding the heterogeneity of inter-
temporal structures. Chapter 5 notes the importance of understanding the time interval for human
behavior analysis. I demonstrate that considering users’ time interval differences can provide a
clearer view of human behavior on the digital platform. I conduct a session analysis by considering
the time interval differences among users to investigate their engagement.
The above three findings suggest that understanding temporal information is key to modeling
human behavior. Also, utilizing low-dimension representations help us mine substantial human
behavior data from digital platforms. Based on this notion, I propose a unified set of methods that
analyze user actions with inter-temporal information (time intervals) in Chapter 6 and Chapter 7.
Chapter 6 proposes a framework for understanding the temporal human behavior and the inter-time
information. Chapter 7 proposes clustering methods account for the order of actions to conduct a
user-level analysis. In Chapter 6, I embed users’ action sequence and their time intervals simulta-
neously to obtain a low-dimensional representation of the action with inter-temporal information.
Using the proposed embedding model, I discuss the method for generating interpretable results
that facilitate our understanding of human behavior instead of simply performing dimensional re-
duction. I propose an action timing context (ATC) that describes the inter-temporal context of
actions based on the obtained embedding vectors and estimates the human cognitive state. ATC
is inspired by the dual-process theory [78, 111] that argues humans have two modes of cogni-
tion: Type 1, which makes decisions automatically (short-term); and Type 2, which requires time
to make decisions (long-term). Based on this theory, ATC investigates the actions and cognitive
types to which they belong. The proposed method explicitly models the interdependence between
7
action sequences and their inter-temporal information, enabling us to obtain interpretable results
from massive amounts of human behavior data. In Chapter 7, I further develop a method that
clusters users based on the sequences of their actions by considering the orders of actions in those
sequences. An empirical analysis using the proposed methods shows that the proposed clustering
algorithm can facilitate the exploration of individual dynamic differences.
1.3.2 Research Questions of This Dissertation
This doctoral dissertation discusses how to extract information from a large amount of data ob-
tained from digital platforms to facilitate our understanding of human behavior. It also aims to
model the temporal aspects of human behavior. To address these issues, I specifically examine the
following research questions:
RQ1. Does the temporal structure of consumption behavior capture consumers’ attributes?
First, I model human behavior in terms of different time scales. I analyze consumption behavior,
which is one of the most significant human behaviors. Economic activities such as consumption
are one of the major drivers of our society, and therefore, studying consumption behavior would
reveal how our society functions. Detection of consumption patterns often has a problem with its
high dimensionality, which prevents us from employing traditional statistical methods. In addi-
tion, various factors, such as regularity and seasonality, can influence consumer behavior within
the data period. Moreover, previous studies suggest an association between consumption patterns
within a week or month and consumer attributes. This research question requires an investigation
on whether the temporal structure of consumption behavior reflects consumer attributes. To ad-
dress this question, I first propose a multi-timescale method that extracts the temporal structure
of consumption behavior. Then, I study the relationship between consumption behavior projected
onto the multi-timescale and consumers’ attributes.
RQ2. Can the embedding method extract characteristics of human behavior from a massive record?
8
I study a method for extracting the characteristics of human behavior based on an embedding
model. This research question investigates whether we can use an embedding model invented in
another domain (natural language processing) to study human behavior. I propose a word2vec-base
model to understand human behavior in knowledge production and apply it to the one of the most
popular knowledge production platforms, Wikipedia. With this research question, I discuss how
low-dimensional representations obtained by the word embedding model can help human behavior
analysis.
RQ3. Does considering different time intervals matter for understanding human behavior in mid-
term?
RQ3 inquires about the inter-temporal differences between users. The concept of a session is used
in many mid-term behavioral analyses and is common in web science. Session analysis allows
us to analyze the temporal behavior of users by combining traces of their consecutive discrete
behaviors into a single group. When a session is created, its end is defined as the point at which
the users’ behavior is no longer observed for a “certain period of time,” and many previous studies
have defined a threshold for defining such a period based on the aggregated distribution. However,
in reality, users’ behavioral time intervals vary widely, and the determination of the threshold
from the aggregated distribution leads to biased results. To answer RQ3, I study the time interval
differences between users and propose a threshold-setting method based on mixture distributions.
RQ4. Does considering user actions and inter-time information foster our understanding of human
behavior in the mid-term?
This research question studies a method that considers user behavior and its inter-temporal struc-
ture in a holistic manner. Many methods have been proposed for mining human dynamic behav-
ioral data, and they have provided valuable insights for research and business. However, most
methods analyze only the sequence of actions and do not comprehensively consider inter-temporal
information, such as the time interval between actions. To answer this research question, I pro-
pose a unified method for analyzing user behavior with the time intervals between actions. By
9
simultaneously embedding the users’ action sequences and their time intervals, I obtain a low-
dimensional representation of the actions with inter-temporal information. I then use the proposed
model to study human actions in terms of their temporal context.
RQ5. Does a single observed behavior sequence reflect the fluctuation of human behavior over a
period?
Finally, I consider how to analyze user behavior using the low-dimensional representation obtained
in the solution of the previous research question. This last research question explores whether
individual dynamic differences can be studied using the action sequence and low-dimensional
representation of actions. To answer this research question, I propose a method that clusters users
based on their behavior while preserving the order of each behavior. The proposed method uses an
optimal transport framework to formalize a loss function that considers the order of each element
and incorporates it into k-means clustering.
1.3.3 Contributions of the Contents of This Dissertation
In the following section, I describe the contributions and findings of this doctoral dissertation in
detail.
In Chapter 2, along with the survey of the NTF techniques and application, I survey recent
studies that use word-embedding techniques for human behavior mining and modeling. This sur-
vey provides an overview of the analysis and findings of the papers that apply word embedding to
human behavior data. The key contributions of this survey are as follows:
• I survey the literature that uses word-embedding models for human behavior mining, bridg-
ing social science and computer science studies.
• I build the taxonomy to illustrate the methods adopted in the surveyed research for extracting
essential information of human behavior using word embedding models and discuss the
guidelines to use word embedding for human behavior analysis.
10
• I highlight recent emerging literature trends that apply word-embedding techniques to non-
text data such as human behavior data and demonstrate that this trend is present in multiple
fields.
In Chapter 3, I present an NTF-based method for extracting the dynamic consumption patterns
of consumer modeling and analyze the relationship between the multi-timescale of consumption
and the demographic of the consumers. The key contributions of the study are as follows:
• I propose a model that allows us to study consumer behavior from a multi-timescale per-
spective (i.e., intra- and inter-week consumption patterns).
• I prove that the multi-timescale patterns capture daily and weekly consumption behaviors.
• I demonstrate that these multi-time consumption patterns reflect the demographics of indi-
viduals, such as their gender and age.
Chapter 4 proposes a word embedding-based method for extracting human behavior in knowl-
edge production. I model the two-sided nature of editing, in which participants contribute to both
addition and deletion edits in knowledge production tasks. The main contributions of studying
Wikipedia data are as follows:
• I formalize the concept of the two-sided nature of knowledge production and propose a
model to analyze this vital nature.
• I demonstrate that the labor supply of the two sides (addition and deletion edits) is not uni-
form across different facets, such as within-user, within-article, and inter-article facets.
• I connect these findings with the division of labor, a well-known important concept in eco-
nomics.
• I introduce the Division of Labor Index (DLI) to the degree of the division of labor and
discuss the association between the DLI and the quality of articles.
11
In Chapter 5, I study the effects of heterogeneity among mid-term temporal human behaviors
for session analysis. I propose a method for defining the threshold for session analysis based on
estimated time intervals, assuming that several components are mixed in the observed time interval
distribution. The key contributions of this chapter are as follows:
• I demonstrate that a well-used engagement analysis in web science (session analysis) can be
biased to synthetic and empirical data.
• I propose a threshold-setting method for mitigating the bias based on a mixture distribution.
• I also conduct experiments with several datasets to demonstrate the application of the pro-
posed method can debiase the session analysis.
In Chapters 6 and 7, I propose a unified framework for studying the human behavior in the
mid-term. The proposed framework employs a word embedding model to alloy the two essential
components of human behavior: actions and time intervals. With the proposed model, I investigate
the actions observed in digital platforms in terms of temporal context to understand the cognitive
state of humans. I also propose a clustering method for conducting user-level analysis that uses the
obtained embedding vectors of actions. The key contributions of this chapter are as follows:
• I use the proposed embedding model to obtain the low representations of user actions with
the inter-temporal information, and they capture the context of the temporal structure in
which those actions are executed.
• I propose an indicator that tracks the context of actions in terms of temporal structure and
ATC. Using several real-world datasets, I demonstrate that the ATC can capture the differ-
ence between actions in terms of different contexts.
• I also propose a clustering method that groups users based on their actions while preserving
the order of their actions.
12
Chapter 2
Survey of Human Behavior Models that Leverage
Embedding
2.1 Introduction
To extract essential information of human behavior from complex data, computer scientists have
been developing machine learning models that learn low-dimensional representations from the
data. From such advancements in machine learning research, not only computer scientists but also
social scientists have benefited and advanced their research.
This dissertation employs two machine learning models to mine and model temporal human be-
havior in digital platforms: a non-negative tensor factorization (NTF) and word embedding model.
These are machine learning methods that learn low-dimensional representations from data. The
reason I chose these two methods instead of many existing methods is that they allow us to carry
out a unified analysis of data with different structures. This doctoral dissertation uses data on
various human behaviors in various formats. It is possible to propose a human behavior model ac-
cording to the data format to be analyzed, but this would mean proposing as many computational
models as the number of data formats to be explored. Therefore, it is difficult to study human
behavior with a unified model. In addition, when we propose computational models for human be-
havior for each data format, we need to discuss the validity of the proposed model for as many data
formats as there are data formats, which requires a great deal of effort. However, using machine
learning to obtain a low-dimensional representation from the data as a vector can encapsulate such
13
discussions on modeling and its validity. Since the low-dimensional representation obtained from
any form of data is a vector, we can focus our discussion on how to model human behavior from
the vectors obtained.
In this dissertation, Chapter 3 uses the NTF to model the multi-timescale consumption behav-
ior. In addition to the NTF, we utilize the word embedding technique to model human temporal
behavior. Chapter 4 proposes a word embedding-based model for understanding human behavior
in knowledge production and analyzing the composition of produced knowledge. In Chapter 6, we
introduce a unified framework that uses a word embedding model to understand temporal human
behavior and inter-time information.
To describe our methods and the context of those methods in the literature of human behavior,
this chapter discusses the method and literature of the NTF and word embedding models. The first
part of this chapter overviews the NTF techniques that model and extract the temporal structures
of human behavior. Then, we discuss the applications of the NTF for human behavior mining.
The second part of this chapter surveys recent studies that apply word embedding techniques to
human behavior mining. We build a taxonomy to illustrate the methods and procedures used in
the surveyed papers. Our taxonomy also highlights the recent emerging trends that apply word
embedding models to non-textual data of human behavior. Lastly, we contextualize the word
embedding-based methods that we proposed in this dissertation on the detected current trends in
the literature.
2.2 Non-negative Tensor Factorization for Human Behavior Mod-
eling
NTF is a machine learning model that decomposes the high-dimensional data, represented as a
tensor, into several factors [28, 121, 136]. Tensor refers to an N-dimensional array. Specifically,
the NTF decomposes the tensor data that consists of non-negative components (i.e., values that
are greater than or equal to zero). The NTF was used to model and analyze temporal structures
14
of human behavior in a wide range of research areas, such as temporal human activities in digital
platforms [194, 107, 170] and economic activities [172, 71, 120, 146]. This section first discusses
the NTF techniques, including a method for selecting the hyper-parameter of the NTF. Then, we
overview the applications of NTFs on behavior data.
2.2.1 Non-negative Tensor Factorization by PARAFAC
The NTF method decomposes tensorX ∈R
I× J× K
+
into latent factors that characterize the activity
patterns of the corresponding mode. Each element of the tensor is denoted by x
i jk
∈X . In
our model, x
i jk
denotes the number of items purchased by user i on j-th day of week k. We
employ the PARAFAC decomposition as an NTF algorithm throughout the analysis. he PARAFAC
decomposition is an approximation method that expressesX as a sum of rank-one non-negative
tensors{
ˆ
X
r
}
R
r=1
:
X ≈ R
∑
r=1
ˆ
X
r
=
R
∑
r=1
a
r
◦ b
r
◦ c
r
, (2.1)
where R denotes the number of components, and a
r
∈R
I× 1
+
, b
r
∈R
J× 1
+
and c
r
∈R
K× 1
+
represent
the r-th component factors that respectively encode the membership of a user to a component, intra-
and inter-week activity levels. The operator◦ represents outer product.
Let A∈R
I× R
+
, B∈R
J× R
+
and C∈R
K× R
+
be the factor matrices, whose r-th columns are vectors
a
r
, b
r
and c
r
, respectively. The factor matrices A, B and C are obtained by solving the following
minimization problem with non-negativity constraints:
min
A≥ 0,B≥ 0,C≥ 0
∥X − JA,B,CK∥
2
F
, (2.2)
where∥·∥
F
denotes the Frobenius norm, and JA,B,CK represents the Kruscal form of the ten-
sor decomposition (i.e., the right-hand side of Equation 2.1). To solve this problem, we use the
alternating non-negative least squares (ANLS) with the block principal pivoting (BPP) [117].
15
2.2.1.1 Number of Components
We utilize the Core-Consistency Diagnostic to determine an appropriate number of components,
R [28]. The basic idea of the Core-Consistency measure is to quantify the difference between
PARAFAC decomposition and a more general decomposition, namely the Tucker3 decomposi-
tion [28]. The Tucker3 decomposition is more flexible than PARAFAC because it allows for cor-
relations between different components. If PARAFAC and Tucker3 return similar decomposition,
then the PARAFAC model is considered to be a good approximation of the original tensor (i.e.,
ignoring correlations among components would be justified).
For the PARAFAC decomposition, the(i, j,k) element of the tensor can be written as
x
i jk
=
R
∑
n=1
R
∑
m=1
R
∑
p=1
λ
nmp
a
in
b
jm
c
kp
, (2.3)
where λ
nmp
denotes a product of Kronecker delta, i.e., λ
nmp
=δ
nm
δ
mp
δ
np
, where δ
nm
is the Kro-
necker delta that takes one if n= m, and 0 otherwise. Note that λ
nmp
takes 1 if n= m= p and 0
otherwise, soλ
nmp
is the(n,m, p) element of the superdiagonal binary tensorL .
For the Tucker3 model, the(i, j,k) element of the tensor is generally written as
x
i jk
=
R
n
∑
n=1
R
m
∑
m=1
R
p
∑
p=1
g
nmp
a
in
b
jm
c
kp
, (2.4)
where g
nmp
may not be expressed by a product of the Kronecker delta. g
nmp
is an element of the
core tensorG obtained by the Tucker3 algorithm [121].
The Core-Consistency (CC) quantifies the difference between PARAFAC and Tucker3 decom-
position by computing the distance betweenL andG as
CC= 100×
1− ∑
R
n=1
∑
R
m=1
∑
R
p=1
(g
nmp
− λ
nmp
)
2
R
!
. (2.5)
Note that the number of components R is common for all modes in both the PARAFAC and the
Tucker3 decomposition, i.e., R
n
= R
m
= R
p
= R. If the PARAFAC and the Tucker3 methods yield
16
exactly the same decomposition, then CC= 100 [28]. In general, the CC value decreases with R
because interactions between components tend to be more evident as the number of components in-
creases. For implementation, we can utilize the computationally efficient method proposed by Pa-
palexakis and Faloutsos (2015) [171], which is based on the alternating nonnegativity-constrained
least squares with block principal pivoting.
2.2.2 Application of Non-negative Tensor Factorization
The NTF method has been widely used to mine high-dimensional human behavior data in different
social contexts. The NTF decomposes the tensor data of human behavior into several factors,
allowing us to understand the essential information in the data. An early application of the NTF on
behavior data is Panisson et al. (2014), which investigate the temporal structure of activities in a
social network service [170]. They constructed a tensor that contains the transition of contents of
the posts on Twitter during the London 2012 Olympics and demonstrated that the factors obtained
by the NTF can reflect the timeline of the Olympics schedule.
The NTF can mine data from not only social network services but also other digital platforms,
for example, online game platforms. Sapienza et al. (2018) decomposed the online game players
and group players based on the decomposition results [194]. They demonstrated that the players
in the same group have common temporal characteristics such as playing strategies, and their
strategies are stable over time. Following these findings, Jiang (2021) grouped the game players
in an online game based on the NTF’s decomposition result of the behavior data in an online
game [107]. They showed that each type of player demonstrates stable behavior, even when they
become more experienced. While these studies focus on a specific aspect of human behavior, such
as Twitter and online games, Hosseinmardi et al. (2019) studied human behavior in general by
applying the NTF to the behavior history of students in a university [96]. They demonstrated that
the decomposed factors can reflect student attributes, such as academic performance.
A temporal network are a major research area that leverages NTFs. Temporal networks can be
represented as a sequence of adjacent matrices using tensor representations. Because the elements
17
of each matrix represent the edges and are non-negative values, such as binary values or weights,
the NTF has high compatibility with temporal network models. Sapienza et al. (2018) applied
the NTF to a temporal network built from incomplete information and obtained a low-dimensional
representation of it [193]. They showed that the obtained low-dimensional representation can
approximate the outcome of complete networks. They analyzed the spreading processes on the
network by simulating a susceptible-infected-recovered (SIR) model, and the epidemiological size
calculated using a surrogate network is consistent with incomplete networks. Because of the wide
applicability of temporal networks, the NTF can analyze a wide range of human behaviors [223],
or anomalies in mobility patterns [137, 195].
In this line of this literature, the temporal networks of economic activities were analyzed using
the NTF. Pecora and Spelta (2017) modelled the financial activities of countries using the data of
the consolidated banking statistics from the bank of international settlements [172]. They detected
important communities and demonstrated that these activities decreased during the 2008 finan-
cial crisis and European sovereign debt crisis. Guidict et al.(2018) constructed the trading network
based on the monthly import and export data of over 30 countries and obtained the low-dimensional
representations of the constructed temporal network [71]. Their analysis of the network properties
based on the obtained low-dimensional representations detected crucial nodes in the network, and
their method could improve the prediction accuracy of economic activities. Kobayashi et al.(2018)
modelled the bank lending patterns using a temporal network from the bilateral interbank trans-
actions in the Italian online inter-bank market and applied the NTF on that temporal network to
reveal the intra- and inter-day transaction patterns [120].
2.2.3 The NTF Model of This Dissertation
Following this line of literature, Chapter 3 discusses the application of the NTF to the economic ac-
tivities of individuals and the studies on the multi-timescale of consumption behavior. We construct
a temporal bipartite graph model using the daily consumption behavior of consumers observed in a
18
financial management application. The constructed temporal bipartite graph represents the multi-
timescale consumption patterns. Then, we extract the essential factors of the consumption behavior
hidden in the data, preserving the multi-timescale temporal structure. Our analysis determines that
the extracted temporal structure reflects the demographics of the consumers.
2.3 Word Embedding for Human Behavior Modeling
Social scientific knowledge lies in unconventional and unstructured data. For example, researchers
in social science, particularly in computational social science, have recently investigated text data,
being enamored with the notion of “Text as Data” [69, 16, 81]. Not only disciplines that tradition-
ally employ quantitative analysis, such as [45] or finance [182], but also other qualitative fields,
such as History [227], are interested in these new unsupervised machine learning models.
The embedding technique is one of the most common machine learning methods used in so-
cial science research with text data, which obtains a low-dimensional representation from high-
dimensional or unstructured data. Embedding models learn a low-dimensional representation of
each entity in the data from the relationship between entities, such as the relationships between
words in a sentence, nodes in a graph [77, 13, 32], or pixels in an image [140, 11] or nodes in
a network [48, 239]. While many embedding models and algorithms have been proposed, this
survey focuses on the word embedding model , one of the most used embedding models [39, 149,
151, 73, 152, 150]. Word embedding models obtain the embedding vectors from the relationships
between words in a sentence.
The obtained embedding vector is considered to represent the semantic of the word, i.e., the
meaning expressed in the sentence. To justify this interpretation, most literature cites the so-called
distributional hypothesis [92], which is a theory that assumes the meaning of a word is determined
by the words surrounding that word in the text. Many experiments verify that word embedding
models can capture the semantic as humans do.
1
1
Computer science research evaluates the word embedding models from two perspectives: intrinsic evaluation and
extrinsic evaluation. Typical tasks for intrinsic evaluations are analogy and similarity. Many datasets for intrinsic eval-
uation are publicly available and can be easily used (Analogy [153, 38, 2, 192, 31, 181, 100, 141, 94], Similarity [149,
19
The word embedding model is now accessible to many social scientists, thanks to word2vec.
Word2vec is a package that combines a neural model and a learning algorithm. Both computer
scientists as well as researchers across many disciplines, such as [131], management science [213,
37, 135, 20, 41, 238, 115, 214], and poetory [161], now use word word2vec in their research.
Therefore, describing the current status of social science research with word embedding models
would contribute to the future development of this interdisciplinary field. In this survey, we build
the taxonomy of how the studies published in social science utilize word2vec to help computer
scientists develop new models. In addition, our taxonomy contributes to the development of com-
putational social science, a research area that links social and computer sciences.
In this section, we survey the usage of word embedding models for social science
2
not as a
technical report of word embedding algorithms but as a report of how word embedding models
can contribute to social science.
3
This survey primarily focuses on the social science papers that
use word2vec. We select the journal publications and working papers likely to be published in, or
submitted to, journals in the social sciences.
2.3.1 word2vec Model
We first clarify the terminology of “word2vec.” The term word2vec does not refer to a specific
neural network model or algorithm but a software program and tool. The pedagogical definition of
the word “word2vec” is software that combines a learning algorithm and a training method [39].
A typical model implemented in word2vec is the continuous bag of words model (CBOW) and
the skip-gram model, and the negative sampling algorithm is a popular algorithm [149, 151, 73,
152, 150]. While computer science papers usually describe the details of their models, social
science papers often use word2vec as if it refers to a specific neural network model. In addition,
they sometimes do not discuss learning algorithms and training methods in-depth, as studying
152]). External evaluation are, for instance, translation and sentiment analysis. See Wang et al. (2019) for recent
evaluation methods [222].
2
I include a paper in chemistry [214] to discuss the method for non-textual data.
3
Kozlowski et al. (2019) provide a summary of the history of word embedding models for social scientists [124].
The lecture note by Chaubard et al.(2019) [39] provides a great description of word2vec.
20
algorithms is not the direct subject of research. Therefore, when a paper mentions only “word2vec”
for their methods, it would be beneficial to know whether it uses the CBOW or the skip-gram
model. In this survey, we will not provide detailed explanations of CBOW or skip-gram models
but concentrate on how social science research uses word2vec.
4
2.3.1.1 CBOW, Skip-gram Model and SGNS Model
The CBOW and skip-gram solve a “fake” problem that predicts what words appear in a given
text. Word-embedding models’ main goal is not to make predictions but to acquire the parameters
obtained by solving the prediction problem as a low-dimensional representation of words. In other
words, the prediction problem solved by word embedding is only to learn the parameters, and the
prediction problem itself is a problem that is not of central interest. Given this background, we refer
to prediction problems that word embedding solves to learn the low-dimensional representation
as“fake” problems. To explain this fake problem in the word embedding model, we often use the
terms “target”
5
and “context.” Target is the element in the text and is often a word. Context is the
set of elements around the target and is often the set of words.
Consider the sentences, “The stick to keep the bad away. The rope used to bring the good
toward us.” When the target is “rope,” its context could be [bad, away, the, used, to, bring]. The
window size here is three because the window refers to the three words on the left of the target and
the three words on the right (linear window). Therefore, the window size determines the context
size. In other words, it determines the number of elements to capture from the target. The CBOW
and skip-gram models try to solve the prediction model using the context and target. The difference
between the CBOW and skip-gram models is which part predicts the other. The CBOW model
predicts the target w using the context c, P(w
c
| c). Opposingly, the skip-gram model predicts the
context using the target, P(w
c
| c).
The most used word2vec is the skip-gram negative sampling (SGNS) model that combines the
skip-gram model and negative sampling algorithm. This software is provided in a ready-to-use
4
In this survey, several papers ([8, 131, 33]) use GloV model [174] and one paper ([113]) uses fastText model [23].
5
“word” is also popularly used instead of “target,” but to avoid potential confusion, we use “target” in this survey.
21
library such as Gensim [185, 184]. The SGNS models the distribution, p(d| w,c), where d takes
1 when a pair of target w and context c is observed in the data, 0 otherwise. The SGNS maximizes
the following conditional log-likelihood [56],
∑
(w,c)∈D
log p(d= 1| c,w)+ kE
¯ w∼ q
log p(d= 0| c, ¯ w)
, (2.6)
where q is the noise distribution in negative sampling, and k is the sample size from the noise
6
;D is the set of paris of target w and context c. The SGNS especially calculates the conditional
probability p(d= 1| c,w) byσ(v
c
· v
w
), whereσ(x) is sigmoid function and v
w
,v
c
∈ R
d
. In other
words, the SGNS seeks the parameter v
w
,v
c
that maximizes the above conditional probability, and
v
w
,v
c
is the embedding vectors of interests.
As mentioned above, the typical context is the set of words around the target word. However,
the manner of constructing the context must not be the surroundings of the target word, but it
could be arbitrary. Because context determines the relationship between targets and words in
the data, we can utilize domain knowledge such as dependency parsing. For example, Levy and
Goldberg (2014) proposed the word2vecf model, which is the modified version of word2vec model,
that obtains word embedding vectors constructing the context based on the dependency between
words [134]. The word2vecf model can construct an arbitrary context for the word2vec model and
allows us to apply the relationships in non-text data, not only the dependency between words in
sentences. We discuss this trend in Section 2.3.8.
2.3.2 Taxonomy
We first build the taxonomy of the analytical methods used in the surveyed social science papers
to provide a clear overview of this emerging research area. Across disciplines, many papers with
word2vec conduct a reference-based analysis to ensure the interpretability of results and to achieve
6
More precisely, this log-likelihood maximization with the noise distribution is considered as Noise Contrastive
Estimation (NCE) [85], and the SGNS is one of the variations of NCE.
22
Table 2.1: Labels for Analysis Methods With Word Embedding
Labels Definition
Pre-trained Pre-trained model
Working variable Variables constructed from a word embedding model for analysis
W/theory Validating a theory proposed in a research field for analysis, or quoting it to
justify an analysis.
W/reference Using references such as the words with specific semantic to define
Non-text Applying word embedding model to non-text data (e.g., human behavior
data)
Same words Comparing the same words from different models
Human subj Employing human subjects for analysis
Clustering Clustering analysis using word embedding vectors
Prediction Prediction using word embedding vectors
their analytical objectives. Some studies have also used word2vec to analyze not only text but also
relational data such as users’ relationships.
2.3.3 Labeling
We first group the papers to be surveyed by their research field based on the journals in which the
papers were published. For working papers, we infer the topic based on the author’s faculty affil-
iation and past publication history to determine the research area of that paper. Our classification
in this survey results in 14 research areas. Note that we understand that defining or classifying
research fields may be trivial and controversial. We use these research field labels not for studying
the primary interest of this survey but for clarifying the overall perspective of the survey.
Because of the high diversity of the surveyed papers’ research fields, it is difficult for us to dis-
cuss each paper according to its research field. Instead of following the literature of each research
field, we study the way that the surveyed papers use word embedding models for their analysis,
such as what kind of tasks they use word embedding models for and how they construct variables
for their research. To overview how the surveyed papers use word embedding models, we build the
taxonomy of their analysis. For this survey, we define eight labels and summarize the descriptive
definitions of the analytical method labels in Table 2.1. We summarizes the papers we surveyed
23
in Tables 2.2 and 2.3. In the following sections, we discuss each label by reviewing several
representative papers of each.
2.3.4 Pre-trained Models
One of the central motivations of social science papers is to extract the information from data to
answer their research questions. Most studies use word embedding models trained on their own
datasets. Obtaining low dimensional representations from data allows us to extract the semantics
and characteristics of entities of interest, such as important words in the document. Social science
is often interested in a specific subject, such as central bank [147, 14], judge [8], organization [41],
job opening [115] or smartphone application [113]. To study these, researchers build their own
dataset about their subject and train their word embedding models on the data.
While most studies construct their own word embedding models, pretrained word embedding
models can still be beneficial. There are papers with pretrained word embedding models published
in prestigious journals [19, 124, 65]. Pretrained word embedding models are ready-to-use models
that are already trained on a large corpus. Most pretrained models are trained on a general corpus.
There are many pretrained word embedding models that are open to the public, such as the Google
News pretrained model [75] and pretrained models on historical corpora in several languages [90].
Models trained on general corpora represent the general semantic, and hence, such models can-
not answer research questions for a specific subject. Conversely, the studies with the pretrained
models can have general implications. Therefore, such studies help study general trends, such as
the evaluation of biases[65, 33, 33] and culture [124]. In this line of literature, it should be noted
that some studies demonstrate that employing human subjects to validate the findings of pretrained
models can aid our understanding of crucial human perceptions [33, 20, 238]. For example, man-
agement scientists combine the survey from human subjects and pretrained models to investigate
essential concepts in their discipline, such as the perceptions of risk [19] and leadership [20]. In
addition, because of their generality, pretrained models can utilize researchers to validate theories
proposed in their disciplines. For example, the researchers in neuroscience propose the features
24
for interpreting word embeddings based on the method in their field called Neurosemantic decod-
ing [43].
2.3.4.1 Overfitting Models
Social science is often interested in analyzing a specific subject rather than general phenomena as
natural science is. Computer scientists avoid overfitting and ensure the generality of their trained
model to obtain a better performance on computational tasks, such as analogical reasoning tasks, or
on generating language models. However, it is the overfitting result that social science researchers
are interested in when they study their data because an overfitted word embedding model can rep-
resent the specificity of that data. For example, many social scientists are interested in what kind
of biases, such as stereotypes, are “embedded” in the text of interest [18, 8, 127, 65]. The word
embedding learned from the corpora can find the existence of biases that some specific words are
overfitted with specific words associated with human characteristics such as gender, and such com-
puted biases are similar to the biases that humans hold [33]. In the context of fairness in machine
learning, there is important computer science literature that proposes the methods of debiasing
such biases [24, 236, 30, 63] and their validity [74].
2.3.5 Working Variable
While there are papers that use word embedding vectors for clustering [99, 160, 238, 115, 116] or
prediction [20, 188, 131, 226] tasks, we find that the most papers we surveyed use the obtained
vectors to define variables that embody the concept or research questions to be examined in their
study. To clarify this trend, we call this procedure the “working variable” using an analogy from
“working hypothesis,” which is a well-used term in social science research [210] . To answer their
research questions, social scientists transform a theoretical hypothesis into a working hypothesis
that can be tested by experimental or observational research. A working hypothesis is testable
using empirical analysis and is often transformed from a conceptual hypothesis or theoretical hy-
pothesis. While a theoretical hypothesis is a conceptual description of the hypothesis to be proved
25
in the research, a working hypothesis describes an executable procedure or test that can prove the
theoretical hypothesis.
This survey defines “working variable” as a variable that is a proxy variable of theoretical
interest. By defining working variables, the researchers investigate theoretical hypotheses in vari-
ous fields and external objects such as human perception. For example, to analyze stereotypes in
the documents of interest, researchers can use word embedding to calculate some working vari-
ables that proxy stereotypes [65]. Not only human perception, but also qualitative concepts can
be defined as working variables such as the speed of semantics in documents, such as research
papers [204, 213], law documents [204], and the plots of movies and TV [204, 213].
Because social science studies exclusively aim at analyzing a specific phenomenon or concept,
many papers define a working variable to analyze the phenomenon or concept under analysis.
Working variables are often defined as a scalar value obtained by computing the embedding vectors
obtained in the study. For example, working variables can be proxies of important concepts, such
as opacity [26] or grammar [36]. Calculating these working variables often employs the reference
words to be described in Section 2.3.6. Defining not only a single working variable but also
multiple working variables clarifies the concept to be studied. Toubia et al. (2021) quantified the
shape of stories by defining multiple variables such as speed, volume, and circuitousness in the
text [213].
We mention here that, as a matter of course, there are studies that directly calculate vectors
obtained from word embedding models. Some researchers use word embedding models to estimate
their own statistical model for causal inference [226] and equilibrium estimation [113], rather than
analyzing the subject represented using the embedding vectors.
26
2.3.6 Reference Words
There are two primary ways to compare the obtained word vectors with reference words: direct
and indirect comparisons. Analysis using reference words defines working variables as the dis-
tance between the selected reference words and the words of interest. This method determines
appropriate “reference” words representing a concept to be analyzed.
For example, Garg et al. (2018) analyzed the gender stereotypes in text over one century by
calculating the relative similarity between the words related to gender (men or women) and the
words related to specific occupations [65]. To calculate the relative similarity, they first calculated
the distance between the word embedding vectors of words related to men and the words describing
specific occupations. They also calculated the same distance for women-related words. Then,
they computed the relative differences between these two distances. When the relative distance
with an occupation is large, it means that the stereotype in the text is considerable. For example,
they determined that “engineer” is close to the men-related words in the historical corpus and
argued that this suggests the existence of stereotypes in the corpus. The paper showed that the
measured gender stereotypes are correlated with occupational employment rates by gender. They
also conducted the same analysis for ethnic stereotypes.
There are two advantages to using reference words for analysis. The first is that introduc-
ing references improves the interpretability of the results; for example, if we use the Subjectivity
Lexicon as the reference word, we can measure the degree to which the word of interest is subjec-
tive [147]. The second advantage is that it avoids the problems associated with coordinate systems
when comparing different models, and we detail this point in Section 2.3.7.
2.3.6.1 Cosine Similarity or Euclidean Distance?
As surveyed so far, many papers examine the similarity between embedding vectors, whether rel-
ative or absolute. Most of them use a cosine similarity to measure the similarity between vectors,
but Euclidean distance (L
2
norm) is also popular. Because the similarity between vectors can be
interpreted as the distance between vectors, we have various options to define the distance between
27
vectors.
7
A natural question we have to ask is whether the results can change depending on the
measure we use for analysis. Some papers analyze the robustness of the results using multiple
options, such as Euclidean distance and cosine similarity [65].
8
Toubia et al. (2021), in their supplementary information, argued that Euclidean distance is
richer than cosine similarity, which only measures the angle [213]. Indeed, cosine similarity is not
a perfect alternative to Euclidean distance. While their argument is true, using cosine similarity
illuminates the relationship between two given vectors in some cases. Figure 2.1 depicts five
different vectors. Vectors 1 and 2 have the same angle θ
1
, and therefore, the cosine similarity
of these is one. In contrast, the Euclidean distance between the two is d, which means they are
different in terms of Euclidean distance. Note that the same Euclidean distance does not always
mean the same using cosine similarity. The Euclidean distance between Vectors 2 and 3 is the
same as the Euclidean distance between Vectors 1 and 2, but these two pairs have different angles
(θ
1
and θ
2
). Figure 2.1 also depicts that using the normalized Euclidean distance eliminates the
distance between Vectors 1 and 2. The normalized Vector 1 will be on the same point as Vector
2, which is on the unit circle. In this case, these two vectors are the same from the perspective
of both cosine similarity and Euclidean distance. In addition, we note that small distances do not
always mean small or large cosine distances. The distance between Vectors 4 and 5 is small, but
their angle is approximatelyπ/4; therefore, the cosine similarity between them is not small.
Specifically, the relationship two measurement become dismissed when θ is small or around
π. Let us consider the distance between two point in a unit circle: A(0, 1) and B(cosθ, sinθ).
9
The distance between two is (2(1− cosθ))
1/2
, and the cosine similarity is cosθ. Therefore, the
7
We also note that cosine similarity is not distance metric.
8
Regarding the robustness check, Ash et al. (2020) also studied the correlations between their stereotype measure-
ments of 100 and 300 dimension embedding vectors [8]. In addition, they tested three sets of window sizes in their
robustness check.
9
We can consider two point in a unit circle by (x, y) and (xcosθ− ysinθ, xsinθ + ycosθ), where the angle
between the two is θ. Since we only study the distance between the two, we set x= 1 and y= 0 without loss of
generality.
28
error between the Euclidean distance of normalized vectors and cosine similarity is a function of
θ, and that is
e(θ)= cosθ− (2(1− cosθ))
1/2
.
Since the first derivative of e(θ) with respect toθ is
e
′
(θ)=− sin(θ)
1+(2(1− cos(θ))
1/2
)
,
the relationship between the two metrics become zero around (θ =π or 0). This fact implies that
the two metrics are not compatible when their angle is small or large. We also note that the fact
that the angle is not always relevant to the distance.
To further discuss the comparison between the cosine similarity and Euclidean distance, we
study the empirical relationship between them using the Google News pretrained word2vec model
that contains the embedding vectors of 3,000,000 words [75]. We construct 1,500,000 random
word pairs and plot the relationships in Figures 2.2a and 2.2b. While Figure 2.2a demonstrates that
the two metrics are correlated, their correlation is not strong (the Pearson correlation coefficient
ρ =− 0.357), which implies that using cosine similarity would lose some information and vice
versa. Figure 2.2a also demonstrates that the cosine similarity distributes uniformly where the
Euclidean distance is small, meaning that the two different pairs of vectors with the same Euclidean
distance can assume different values in their cosine similarities. This deviation becomes large
where the Euclidean distance is small. This finding is consistent even when we normalize the
metrics. Figure 2.2b demonstrates the empirical relationship between the Euclidean distance of
normalized vectors and cosine similarity. While they capture the difference of two word embedding
vectors in a similar way (the Pearson correlation coefficient ρ =− 0.991), they are not the same.
The Euclidean distance of normalized vectors is more sensitive than cosine similarity in which the
distance between two points are close (i.e.,θ is small) or large(i.e.,θ is aroundπ).
The above discussion suggests that the cosine similarity and Euclidean distance can qualita-
tively return in a similar manner, but they are not always compatible. Notably, the cosine similarity
29
Figure 2.1: Schematic comparison: Cosine-similarity vs Euclidean distance
Note: Schematic illustration of three different vectors. Vector 1 and Vector 2 have the same angle θ
1
, but
Vector 1 is longer than Vector 2. The angle between Vector 2 and Vector 3 is θ
2
but the distance between
Vector 2 and Vector 3 is the same as the distance between Vector 1 and Vector 2. The distance between
Vector 4 and Vector 5 is close, but it has a certain angle.
can capture the difference between two vectors when their Euclidean distance is small. However,
when we study the normalized vectors, the Euclidean distance captures the differences, but the
cosine similarity does not when the Euclidean distance is small. Given this, we should be aware of
what kind of characteristics we intended to capture when selecting a metric.
These findings also imply that the Euclidean distance and cosine similarity (angle) capture
different characteristics. Recently, some studies in computer science literature propose word em-
bedding models that explicitly model this relation, such as hierarchical relationships [211, 163,
216, 218]. For example, Iwamoto et al. (2021) proposed the embedding models that use the po-
lar coordinate system and illustrate that their model captures the hierarchy of words in which the
radius (distance) represents generality, and the angles represent similarity [103].
2.3.7 Comparing the Same Words
Some papers use a simple method of analyzing the similarity between words. For example, this
category contains the analysis of enumerating words that are similar to a certain word, which is
often done in word embedding tutorials because the similarity of each word is calculated to list
30
(a) Euclidean distance (b) Euclidean distance of normalized vector
Figure 2.2: Empirical relationship: Cosine-similarity vs Euclidean distance
Note: We built 1,500,000 pairs of two words by randomly picking words from Google News pretrained
word2vec model [75]. Then, we plot the cosine similarity and Euclidean distance between the pairs and
relationship between the cosine similarity and normalized Euclidean distance. We normalize word embed-
ding vectors such that the norm of each embedding vector is 1.
31
similar words. Some papers have used this simple analysis as seminal work and have yielded
useful findings [37].
Calculating the similarity between the same words from different models is also attractive for
social scientists. The same word often has different meanings depending on the speaker and time
period in which the word was used. Examining such diachronic changes in semantics and changes
in meaning by the speaker can answer research questions that arise from natural social science
ideas. For example, Nyarko and Sanga (2020) examined the difference in perception between
experts and laypeople by examining whether these two groups use the same word to describe
different semantics according to their profession [164].
Comparing the same words from different data requires different learning models from differ-
ent corpora. For example, when we want to compare the semantic differences of the same word
from two different documents, we need to train two separate word embedding models on each
document. It is important to note that we are not able to directly compare the two models, even if
they were trained with the same algorithm. This is because the learning algorithm takes an arbi-
trary coordinate system with each training. To overcome this issue, Hamilton et al. (2016) used
orthogonal Procrustes to align different models to reveal historical changes in the semantics of
words [89], and some other studies follow this method [164].
It has also been noted that alignment using orthogonal Procrustes may not be stable [70]. There-
fore, an improved method using the Dynamic Word Embedding model, which explicitly incorpo-
rates semantic changes, has been proposed [70, 231]. There are papers that use the Dynamic Word
Embedding model to analyze semantic changes [147].
In addition, thanks to recent studies on the statistical properties of word embedding models [6],
researchers have proposed a way to compare different models [240] without aligning models,
which will advance the use of word embedding models from a more social science perspective.
In addition, this problem can be avoided if the comparison is done within the same models and
is projected onto scalar value. Garg et al. (2018) calculated the stereotype as a scalar value and
studid the trajectory of the values over time [65]. We can consider this as another advantage of
32
introducing reference words that allow us to avoid the problems stemming from the issue of word
embedding models using an arbitrary coordinate system in their learning.
2.3.8 Non-text
In addition, it should be noted that there are studies that apply word2vec to non-textual data,
such as metadata [187] and symbols [214], in addition to text data. We also survey research that
applies word2vec algorithms (skip-gram or CBOW) [149, 151, 73, 152, 150] to digital platform
and geographic data.
In recent years, researchers have applied the word2vec to non-text data. Because the word em-
bedding model learns relationships between words in sentences and two entities in general. Several
studies apply the word embedding algorithm to learn between text and non-text relationships or be-
tween non-text relationships, such as users. As discussed in the introduction, this survey primarily
focuses on the applications of word embedding models and algorithms. Therefore, the Non-text
label does not cover the embedding models for non-text data if they are not word embedding mod-
els, such as graph embedding [77, 13], image embedding [140, 11], or network embedding [48,
239]
Tshitoyan et al. (2019) applied the word2vec to an academic paper in chemistry to learn the
relationship between text and symbols(chemical formulae) [214]. They demonstrated that poten-
tial knowledge about future discoveries can be obtained from past publications by showing that
the obtained embedding model can recommend discoveries several years in advance of the actual
discovery. Not only symbols but also other information can be embedded. Rheault and Cochrane
(2020) analyzed a politician’s speech using the word2vec algorithm such that it embeds informa-
tion about the politicians’ ideologies with the text data [187].
Relationships that word embedding models learn can be a relationship between a chemical
equation and the text around it in a research paper. Tshitoyan et al. (2019) apply the word2vec to
an academic paper in chemistry to learn the relationship between text and symbols (chemical for-
mulae) [214]. They demonstrate that potential knowledge about future discoveries can be obtained
33
from past publications by showing that the obtained embedding model can recommend discover-
ies several years in advance of the actual discovery. Not only symbols but also other information
can be embedded. Rheault and Cochrane (2020) analyze the politician’s speech by the word2vec
algorithm so that it embeds information about politicians’ ideologies along with the text data of
politicians’ speech [187].
In addition, it should be noted that there is a recent trend that applies word2vec to non-textual
data. Not only text data but also non-text data such as behavior data and geographic data can be
a good application of word embedding models. As discussed in Section 2.3.1.1, word embedding
models learn relationships between words in sentences. Therefore, the relationships that word
embedding models learn are not limited between words in a text.
Word embedding models can obtain low-dimensional representation from the relationships
composed of non-text. For example, Hu et al. (2020) obtained the embedding vectors of Point
of Interests (POI) [99] from geological data that describes the relationships between POI. With
their clustering algorithm, they suggested it can discover the usage pattern of urban space with
reasonable accuracy. Murray et al. (2020) noted an important theoretical insight for analyzing
geographic data with word2vec. They revealed that word2vec is mathematically equivalent to the
gravity model [241], which is often used to analyze human mobility and to conduct experiments
with real datasets to validate their finings[158].
Adapting word2vec to non-text data may yield useful insights even though text data is the
available data. Waller and Anderson (2021) adopted the modified version word2vec algorithm
10
to analyze the posting behavior of Reddit users [220]. They obtained the embedding vectors by
learning the relationships between Reddit users and a subreddit.
11
It is worth noting that they do not analyze the textual data of the Reddit post to investigate
users’ behavior. Studying the Reddit text data might unveil the attributes of Reddit users, such as
demographic information, preferences, and age groups. However, user comments do not always
10
Waller and Anderson (2021) used an algorithm called word2vecf, which is a modified version of word2vec [134,
72].
11
In Waller and Anderson (2021), a subreddit is called a community.
34
reflect such meta-information. Some supervised machine learning models may predict such meta-
information; there are successful unsupervised methods that predict the characteristics of text, such
as sentiment analysis [173, 21, 102] or moral foundations dictionary [79, 86]. Moreover, the valid-
ity of training data is difficult to prove owing to the issues regarding the annotator’s subjectivity.
To overcome this problem, the authors used a data-driven and linguistically independent method
that characterizes subreddits only from the perspective of ”what kind of users are posting in the
community.” By setting reference “words” (subreddit), they created indicators such as age, gender,
and partisan. For example, to calculate the working variable of “partisan,” they chose the oppos-
ing communities in Reddit: “democrats” and “conservatives.” Then, they calculated the relative
distance for each community.
If we apply our taxonomy to Waller and Anderson (2021), it reveals that, while they do not
use text, they share characteristics with other studies in their methodologies. They define a work-
ing variable (partisanship) using reference words (subreddit). In addition, they are similar to the
analysis by Toubia et al. (2021) in terms of setting multiple working variables [213].
Several studies apply the word embedding algorithm to learn between text and non-text rela-
tionships or between non-text and non-text relationships. As discussed in the introduction, this
survey primarily focuses on the applications of word embedding models and algorithms. There-
fore, the Non-text label does not cover the embedding models for non-text data if they are not
word embedding models, such as graph embedding [77, 13], image embedding [140, 11], or net-
work embedding [48, 239].
2.3.9 Classifying the Methods of This Doctoral Dissertation
Finally, we contextualize the word embedding-based methods proposed in this dissertation on the
detected current trends in the literature. In this doctoral dissertation, we utilize the word2vec
algorithm to model the human behavior in knowledge production (Chapter 4) and the relation-
ship between actions and their inter-time information (Chapter 6). These two methods learn low-
dimensional representations from non-textual data, and therefore, they are “non-text.” Chapter 4
35
discusses the knowledge production in Wikipedia. We model editing behavior in Wikipedia in two
parts: addition and deletion editing. To model and mine these, we introduce a word2vec-based
model that learns the low-dimensional representation of the relationships between editors and per-
taining articles, in which they add or delete information. Our analysis with the proposed model
reveals that the users divide their tasks, and this division of labor is linked to the quality of articles.
In addition, we analyze the characteristics of each article by selecting the reference articles.
In Chapter 6, we propose a method based on the word2vec algorithm to study user actions and
their temporal context. We construct the data that represents the user’s behavior as a sequence
of actions. We also insert the information of time intervals into the action sequence. Then, we
embed the low-dimensional representations of actions with inter-time information. To conduct an
interpretable analysis of human temporal behavior with this model, we propose the Action Timing
Context (ATC), which captures the temporal context of each action. The ATC is calculated using
the references that represent long-time and short-time intervals. Therefore, the word embedding
model in Chapter 6 learns the “non-text” data, and we use the “reference” to interpret the obtained
results.
36
Table 2.2: Summary of the Analytical Methods With Word Embedding for the Social Science Research I
Research
Topic
Pre-trained
Define
variable
W/theory W/reference
Non
text
Same
words
Human subj Clustering Prediction
[43] Neuro. Sci ✓ ✓
[114] Cogni. Sci ✓
[33] Cogni. Sci ✓ ✓ ✓
[127] Commnication ✓ ✓
[126] Commnication ✓ ✓
[18] Commnication ✓
[226] Finance ✓
[106] Finance
[131] Humanity ✓
[36] Linguistic ✓ ✓
[213] Manag. Sci ✓ ✓
[37] Manag. Sci ✓
[135] Manag. Sci ✓
[20] Manag. Sci ✓ ✓ ✓
[41] Manag. Sci ✓
[238] Manag. Sci ✓ ✓
[115] Manag. Sci ✓
[214] Mat. Sci ✓ ✓ ✓
[161] Cultural Studies ✓ ✓
[160] Cultural Studies ✓
[187] Poli. Sci ✓ ✓
[67] Poli. Sci ✓ ✓
[189] Poli. Sci ✓
[191] Poli. Sci ✓ ✓
[130] Poli. Sci
[8] Poli. Sci ✓
[164] Poli. Sci ✓
[118] Psychology ✓ ✓
[175] Psychology ✓
[26] Psychology ✓
[188] Psychology ✓ ✓ ✓
37
Table 2.3: Summary of the Analytical Methods With Word Embedding for the Social Science Research II
Research
Topic
Pre-trained
Define
variable
W/theory W/reference
Non
text
Same
words
Human subj Clustering Prediction
[204] Sci. metrics ✓
[98] Sci. metrics ✓
[65] General ✓ ✓
[109] Sociology ✓ ✓
[124] Sociology ✓ ✓ ✓
[99] Urban. Eng ✓ ✓
[158] Urban. Eng ✓ ✓
[116] Urban. Eng ✓
Chapter 4 Comp. Sci ✓ ✓ ✓ ✓
Chapter 6 and 7 Comp. Sci ✓ ✓ ✓ ✓
38
Chapter 3
Extracting Temporal Structures from User Behavior: The
Case of Consumption Behavior
This chapter discusses the method for extracting the temporal structure of human behavior on an
online platform. We propose a tensor-factorization-based method and study the aspect of homo
economicus in human behavior, consumption.
1
Understanding consumer behavior is an important task, not only for developing marketing
strategies, but also for the management of economic policies. Detecting consumption patterns,
however, is a high-dimensional problem in which various factors that would affect consumers’
behavior need to be considered, such as consumers’ demographics, circadian rhythm, seasonal
cycles.
We develop a method for extracting multi-timescale expenditure patterns of consumers from
a large dataset of scanned receipts. We use an NTF to detect intra- and inter-week consump-
tion patterns at one time. The proposed method allows us to characterize consumers based on
their consumption patterns that are correlated over different timescales. Our results show that our
method successfully embedded and captured the consumption behavior, and the temporal structure
extracted using our method reflects their consumption behavior.
1
This chapter is a modified version of my article “Detecting multi-timescale consumption patterns from receipt
data: a non-negative tensor factorization approach” [146] published in the Journal of Computational Social Science;
reproduced here under a Creative Commons Attribution 4.0 International License.
39
In the rest of this chapter, we first discuss a machine learning model that decomposes tensor
data. Then, we construct a three-way tensor that represents the consumption data observed in a
smartphone bookkeeping app.
3.1 Introduction
Consumption has been extensively studied in multiple research disciplines, and their viewpoints
differ from each another. Macroeconomists, for example, consider that individual consumer deci-
sions determine the economic condition at the macroscopic level [144]. In marketing studies, how-
ever, analyzing the shopping behavior of individual consumers is essential to acquire insight into
business strategy [15]. Researchers also study consumption at different time scales; economists of-
ten assume that representative individuals live infinitely long to investigate life-long consumption
paths, while business researchers are interested in shorter, practical time scales.
Many studies note that consumption patterns change in accordance with the consumer’s stage
of life [9, 101, 3]. Arguably, young people having a child visit supermarkets more frequently than
elderly people. The income level of an individual would also affect how often and how much they
spend and for what. Different demographic characteristics may, therefore, exhibit different dynam-
ical patterns of expenditure, and this leads us to believe we could infer consumers’ demographic
properties from their dynamical expenditure patterns.
To understand the consumption behavior of individuals with different demographic properties,
we propose an NTF model to detect multi-timescale patterns of consumers’ expenditure at intra-
and inter-week scales. We employ the PARAFAC decomposition to factorize a three-way tensor.
In our model, the(i, j,k)-th element of a tensor corresponds to the number of items purchased by
consumer i on jth day of week k. The NTF allows us to know how the intra-week expenditure
behavior is associated with the inter-week patterns and how many such multi-timescale patterns
exist. We argue that different multi-timescale patterns may come from different demographic
40
characteristics of consumers, such as gender, marital status, and age. This suggests that people in
different stages of life do indeed spend differently both at intra- and inter-week scales.
3.2 Consumption Behavior
Maximizing aggregate consumption is a primary goal for policymakers and is considered to con-
tribute to social welfare [230, 221]. Economists often model consumer behavior as a solution to a
utility maximization problem with infinite horizon [34, 108, 97, 221]. Using a formal framework
based on a utility maximization problem, economists have been discussing how consumers form
and follow consumption habits [4, 93], including whether or not such an explicit dynamical pat-
tern exists[57, 83, 35, 29, 47, 93]. Various studies also point out that consumption patterns tend to
change according to the consumer’s stage of life [9, 101, 3].
Marketing scientists study consumer behavior from a more business-oriented viewpoint. For
instance, they model the expenditure pattern of targeted consumers to predict the effect of a busi-
ness strategy, such as a recommendation system, on actual consumption [62]. Models of consumer
behavior in marketing studies incorporate various factors, including the structure of consumers’
network [190, 27], self-revealed information in social media [50, 202], and spatial information
regarding the consumer’s geographical location [219]. Among many factors that could explain the
observed consumption patterns, the sequence of temporal actions has been particularly studied to
understand consumers’ dynamic behavior [154, 155, 167, 17, 197]. A dynamical model has also
been used to predict consumers’ future activity [179]. Notably, some studies point out that there
are temporal patterns of shopping activity at the intra-week scale, i.e., day-of-week effects [110,
159, 22].
In this study, we employ a NTF method [121, 136] to uncover hidden patterns in our receipt
data. We represent consumers’ expenditure data as a 3-way tensor, which will be detailed in the
following section. NTF is widely used to mine temporal patterns in face-to-face contacts [66,
193], financial transactions [120], online communications [170] and online games [194]. Based
41
Table 3.1: Basic statistics of the receipt data
Category #users Cohort #users
Gender Female 1,887 Age 1 69
Male 737 2 690
Marital status Married 1,628 3 824
Unmarried 996 4 673
Child No children 1,345 5 331
With children 1,279 6 137
Total #users 2,624
Note: Basic statistics of the receipt data collected from Dr. Wallet between April 1, 2017 and January 21,
2018. Total number of purchased items is 2,796,008. Age range is in ascending order, i.e., 1 and 6 denote
the youngest and the oldest cohorts, respectively.
on the decomposed patterns from our consumption data, we show that consumers with different
demographics have different consumption patterns.
3.3 Data
Our dataset is constructed from the receipt data scanned through a bookkeeping smartphone ap-
plication Dr.Wallet [54]. This application allows users to digitize the record of their purchases
by scanning receipts using smartphones or tablet PCs. Item names listed in receipts are annotated
and documented by human workers. The dataset contains the prices, the name of each item and
the date when the receipt has been scanned. There are in total 2,796,008 purchased items recorded
by 2,624 users from April 1, 2017 to January 21, 2018. The data also contains the demographic
attributes of the users such as gender, marital status and age range. Table 3.1 shows the basic
statistics and the demography of users.
42
3.4 Methods
3.4.1 Tensor Representation of Consumption Expenditure
Our study aims to detect dynamical patterns from our shopping record dataset. To pursue this goal,
we use a non-negative tensor factorization (NTF) to obtain the latent factors that would reflect the
characteristic expenditure patterns across different attributes of consumers, which is the method
we discussed in Chapter 2.
Here, we extract multi-timescale patterns that would exist at intra- and inter-week scales [120].
We represent the users’ shopping records by a 3-way tensor, whose size is given by I× J× K, where
I=#consumers (= 2,624), J=#days in a week (= 7) and K=#weeks (= 42). The constructed 3-
way tensor is interpreted as representing a sequence of weekly bipartite networks in each of which
the nodes denoting the days of the week are connected to users with edge weights being the number
of purchased items (Figure 3.1).
Figure 3.1: Schematic of NTF for extracting intra- and inter-week expenditure patterns.
43
3.5 Results
3.5.1 Core-Consistency
Figure 3.2: Core-consistency calculation to determine the number of components
Note We calculate the average core-consistency value over 20 runs of PARAFAC decomposition. Error bar
denotes 95% confidence interval. Horizontal dashed line denotes CC = 85.
We utilize the Core-consistency to select the rank size R, which is the method we discussed in
Chapter 2. The CC values for our NTF results with different rank size R are shown in Figure 3.2.
Since the solution for the PARAFAC decomposition is not unique due to randomly selected seeds,
we run the decomposition algorithm 20 times for each R and calculate the mean of the CC value
with the 95% confidence interval. The result indicates that R= 3 would be the best choice because
the CC value is larger than a rule-of-thumb threshold (= 85) [193] up to R= 3 and turns negative
for R= 4. Therefore, we set R= 3 in the following analysis. We have repeated this procedure
multiple times and confirmed that the results presented in the rest of the chapter is qualitatively
unaffected by the randomness of seeds.
44
3.5.2 Multi-Timescale Expenditure Patterns
Figure 3.3: Multi-timescale consumption activity
Note Activity at different timescales. (a) Day-of-week (i.e., intra-week) activity of each component. Activity
of Component r of day j is given by b
jr
. (b) Weekly (i.e., inter-week) activity of Component r in week k is
given by c
kr
.
We firstly examine if the shopping activities have different dynamical patterns by looking at
the components of day-of-week and weekly activities. The r-th column of factor matrices B and
C contain day-of-week and weekly activity patterns of Component r, respectively. For R= 3, we
find three distinctive day-of-week expenditure patterns from matrix B (Figure 3.3a). Each pattern
is characterized by the days of week on which activity is concentrated, namely Weekdays, Saturday,
or Sunday. This suggests that the users’ expenditure behavior during a week is characterized by
one of these three patterns or a combination of them.
45
Similarly, weekly patterns can be extracted from C (Figure 3.3b). Activity level of Component
2 (i.e., weekday-shopping pattern) is the highest among the three and relatively stable except for the
last 5 weeks which correspond to the year end. The activity of Component 1 (i.e., Sunday-shopping
pattern) and 3 (i.e., Saturday-shopping pattern) are lower than that of Component 2 throughout the
data period, while activity of Component 1 is a bit more volatile than that of Component 3.
3.5.3 Expenditure Patterns and Demographic Differences
Next, we examine if the temporal structure of consumption reflects demographic differences. To
this end, we group the users based on their activities and see if each group has a characteristic
demographic property. We use the factor matrix A obtained by the PARAFAC decomposition, on
which we implement the k-medoids and the k-means methods to quantify the belongingness of
user i to each component. We compare the two clustering methods with silhouette analysis [112]
in Figure 3.4 and Figure 3.5.
We find that the k-medoids method gives us more evenly sized clusters compared to the k-
means method (Figure 3.4 and Figure 3.5). The mean silhouette coefficients for the k-medoids
clustering are roughly the same across different numbers of clusters, which does not convey enough
information to determine the number of clusters. We select the number of clusters k= 5, judging
from the fact that the rate at which the sum of distances between points in a cluster and the medoid
decreases slows down around k= 5 (Figure 3.6). In section 3.5.4, we will also show the results for
which the consumers are grouped based on a threshold value.
Note that each consumer is classified by the k-medoids into one of the five non-overlapping
groups based on their belongingness to each component quantified by matrix A. To visualize the
clustering result based on the k-medoids at the user level, we project the factor matrix A onto
two-dimensional space by exploiting the t-SNE embedding [142] (Figure 3.7). The t-SNE is a
visualization technique that allows us to convert high-dimensional data into low dimensional vec-
tors [142].
46
Figure 3.4: Silhouette analysis for the k-means clustering
Note We conduct a silhouette analysis for the k-means clustering to determine the number of clusters. Num-
ber of clusters is annotated at the top of each panel. Red dotted denotes the mean silhouette coefficient.
3.5.4 Characterizing Clusters Based on the Demographic Properties
Different multi-timescale expenditure patterns would reflect the users’ demographic characteris-
tics because the status of a consumer (i.e., age, gender, marital status, etc) might determine, at
least partially, the timing of shopping and the variety of items purchased. Here, we compare the
demographic characteristics among the five clusters identified by the k-medoids method.
Figure 3.8 indicates that each user cluster is characterized by some demographic properties.
Typical examples can be found from Cluster 1 and Cluster 4. Cluster 1 consists of relatively young
consumers having no children, while Cluster 4 appears to be formed mainly by married elderly
women who have children. We use the chi-squared test to see if the demographic distribution in
47
Figure 3.5: Silhouette analysis for the k-medoids clustering
Note We conduct a silhouette analysis for the k-medoids clustering. Number of clusters is annotated at the
top of each panel. Red dotted denotes the mean silhouette coefficient.
each cluster is significantly different from the null distribution obtained from the original demo-
graphic structure. The chi-squared statistic is given by the sum of squared differences between
the number of users identified by the k-medoids method and the expected number under the null
hypothesis: χ
2
=∑
m
∑
ℓ
(D
ℓm
− E
ℓm
)
2
/E
ℓm
where D
ℓm
denotes the observed number of consumers
in categoryℓ (i.e., Male, Female, etc) for Cluster m, and E
ℓm
is the expected number of consumers
in categoryℓ for Cluster m under the null [51].
The results from the chi-squared tests suggest that for each demographic attribute (i.e., gender,
age, marital status and child), the distribution of users identified by the clustering method is signif-
icantly different from the null distribution (p< 0.001). We also test whether there is a statistical
48
Figure 3.6: The sum of distances between points
Note We calculate the sum of distances between points in a cluster and the medoid. We select k= 5 for the
analysis.
difference in the distribution of users between two particular clusters. We conduct the statisti-
cal tests for all the pairwise combinations between different clusters. For all the demographic
attributes, the null hypothesis is rejected for most of the pairs of clusters (Table 3.2).
Lastly, we investigate what demographic factors characterize the expenditure. By focusing
on representative users in each component, who are selected based on their belongingness to a
component [120]. Since the representative users in a given component would share similar demo-
graphic characteristics, we could identify which component is associated with which demographic
properties.
We detect R(= 3) groups of representative users according to the following threshold rule: User
i is considered to belong to group r if a
ir
/∑
r
a
ir
≥ h
r
, where threshold h
r
is chosen such that only
the upper 10 percent of users belong to group r. Figure 3.9 shows the demographic distributions
of the representative users belonging to each component. We note that each user may belong to
multiple components, but such overlap is quite small (Figure 3.10).
49
Figure 3.7: Visualization of the low-dimensional representation of consumers by the NTF model
Note We plot the consumers based on the low-dimensional representation from the NTF model with the
cluster affiliations. We obtain the consumer representations (feature vectors) from factor matrix A and
visualize them through the t-Distributed Stochastic Neighbor Embedding (t-SNE).
We find that “Marital status” and “Child” are two demographic properties that distinguish
Component 2 (Weekday-shopping pattern) from the other components (Figure 3.9c and d). For
these two family-related attributes, the demographic distribution of the representative consumers
in Component 2 is clearly different from the null distribution. This finding suggests that “Mar-
ital status” and “Child” would be the two driving factors that yield the five clusters detected by
the k-medoids. On the other hand, the difference in user age between clusters seem to be more
reflected in the activity of Component 1 (Sunday-shopping pattern) and 3 (Saturday-shopping pat-
tern) rather than Component 2 (Figure 3.9b), while it is not clear for gender (Figure 3.9a). This
means that gender and user age may be less important in extracting the multi-timescale patterns
and the emergence of clusters classified by them.
50
Figure 3.8: Demographic distribution of each cluster.
Note We compare the difference of demographic attributes among clusters. (a) Gender, (b) Age range, from
1 (youngest) to 6 (oldest), (c) marital status, and (d) share of users who have or do not have children.
3.6 Conclusion
We presented an NTF-based method for extracting dynamical shopping patterns of consumers from
scanned receipt data collected through a bookkeeping application. The proposed method allows
us to find intra- and inter-week expenditure patterns simultaneously, which would be impossible
without such a large, high-resolution, yet long time-series, dataset. We found three multi-timescale
patterns, each of which captures a characteristic expenditure behavior observed at daily and weekly
scales.
51
Figure 3.9: Demographic distribution of representative users
Note We compare the demographic distribution of representative users in each component. User i belongs
to group r if a
ir
/∑
r
a
ir
≥ h
r
. (a) Gender, (b) Age range, from 1 (youngest) to 6 (oldest), (c) marital status,
and (d) share of users who have or do not have a child.
52
Table 3.2: Chi-squared test for demographic difference between clusters.
Attribute Cluster X Cluster Y χ
2
Significance level
Age range 1 2 46.693 ****
Age range 1 3 105.398 ****
Age range 1 4 108.203 ****
Age range 1 5 48.561 ****
Age range 2 3 26.759 ****
Age range 2 4 47.835 ****
Age range 2 5 18.585 **
Age range 3 4 16.049 **
Age range 3 5 5.456
Age range 4 5 3.213
Child 1 2 31.310 ****
Child 1 3 121.161 ****
Child 1 4 179.896 ****
Child 1 5 70.022 ****
Child 2 3 37.733 ****
Child 2 4 92.999 ****
Child 2 5 29.783 ****
Child 3 4 24.574 ****
Child 3 5 3.788
Child 4 5 3.524
Gender 1 2 7.862 **
Gender 1 3 22.765 ****
Gender 1 4 24.737 ****
Gender 1 5 0.642
Gender 2 3 5.939 *
Gender 2 4 10.983 ***
Gender 2 5 0.274
Gender 3 4 2.116
Gender 3 5 3.673
Gender 4 5 7.818 **
Martial status 1 2 27.642 ****
Martial status 1 3 95.259 ****
Martial status 1 4 130.030 ****
Martial status 1 5 66.880 ****
Martial status 2 3 28.989 ****
Martial status 2 4 69.308 ****
Martial status 2 5 34.271 ****
Martial status 3 4 20.620 ****
Martial status 3 5 9.555 ***
Martial status 4 5 0.130
∗ p<0.1;
∗∗ p<0.05;
∗∗∗ p<0.01;
∗∗∗∗ p<0.001 (Bonferroni corrected).
53
Figure 3.10: The Jaccard index for the overlap of users
Note We calculate the Jaccard index for the overlap of users belonging to multiple components.
54
Chapter 4
Characterizing User Behavior on Digital Platforms and Its
Action Timing Intervals: The Case of Wikipedia Editing
This chapter discusses the embedding method for characterizing user behavior from high-dimensional
records. This chapter aims to provide a case study that uses a standard word embedding model
(Skip-Gram model). We use the skip-gram model to obtain the embedding representation of user
editing behavior in Wikipedia and discuss the characteristics of this behavior that we can extract
from such high-dimensional representations.
In this case study, we study the division of labor on an online platform for knowledge accumu-
lation. Division of labor has been an essential part of the production of not only physical but also
virtual products. Thanks to the division of labor, workers divide their tasks to yield products that
cannot be done alone. While such division of labor plays a pivotal role in production, it remains
unclear how workers contribute to which part of production in the division of labor, particularly
in virtual products such as knowledge. Here, we analyze the knowledge accumulated using the
division of labor, developing an embedding method to build high-dimensional representations of
Wikipedia articles based on their editing history. We determine that, across multiple facets, the
produced articles are composed of inputs from different groups in terms of their editing behaviors
and interests even within the same article.
For the remainder of this chapter, we briefly discuss the method for constructing the low-
dimensional representation of user editing behavior from a massive record of Wikipedia editing
history. Then, we discuss the characteristics obtained by the embedding method and its importance
55
to understanding knowledge accumulation through collaboration which is a human behavior on
online platforms.
4.1 Introduction
Knowledge has been a vital driver of modern society and has been accumulated through the divi-
sion of labor (i.e., collaboration). The strength of the division of labor in knowledge production
is its mutual complementarity; when the information transmitted by an individual is incomplete,
others can correct it. This division of labor has driven technological development in recent cen-
turies. Therefore, understanding what kind of knowledge is accumulated by this is essential for
understanding our accumulation of knowledge, a crucial public good.
Several studies have investigated the mechanisms behind the division of labor in knowledge
production, including how people collaborate and what factors foster collaboration [157, 178, 162,
180, 133]. Analyzing the dynamics of collaboration, for instance, can reveal how they accumulate
knowledge [209, 64, 233, 232]. It is also of paramount interest that the conflicts in collaboration
and biases in the knowledge accumulated through collaboration [165, 132, 1, 201].
While many studies investigate collaboration by the division of labor in knowledge production,
not many analyze what part of the division of labor accounts for what parts of the accumulated
knowledge. This fundamental gap means that we do not know much about what the division of
labor in production constitutes as part of the knowledge accumulated in the end. Therefore, in
this study, we reverse engineer the knowledge accumulated by the division of labor into factors of
production.
In this chapter, we focus on the division of labor in Wikipedia editing, a free encyclopedia
where anyone can contribute by freely editing its articles. The Wikipedia editing system allows
users to add and delete information from articles, such as updating or correcting information. To
capture this “two-sided’‘ nature of knowledge production in Wikipedia editing, we use an embed-
ding model to characterize these types of editing behavior using the revision history of articles. We
56
Figure 4.1: Schematic illustration of modeling “two-sided” nature in editing Wikipedia articles
Note: This figure illustrates the schematic of our model to obtain embedding vectors of Wikipedia articles.
(a). We firstly construct two nodes for each article that represent addition edit and deletion edit. Then, we
record the correspondences between nodes (article and edit type) and users. (b). We collect those corre-
spondences from the user histories and obtain the embedding model from them with the SGNS algorithm.
As a result of this data preprocessing, we generate two embedding vectors for each article that represents
addition edit and deletion edits.
illustrate a schematic of our analysis in Figure 4.1. With an embedding model, we generate two
different embedding vectors for each article: addition and deleting edit vectors. Our analysis with
the embedding vectors determines that i) different user groups contribute addition and deletion
edits, and ii) the editing of high-quality articles is achieved through a high degree of division of
labor compared to normal quality articles. Our investigation also identifies iii) users who divide
their editing labor have different interests.
4.2 Method
This section describes the methods used to obtain and analyze the embedding vectors that represent
the two types of Wikipedia editing . We first review an embedding model for constructing the
embedding vectors that represent the two-sides of editing Wikipedia articles. Then, we explain
how we study the obtained embedding vectors of editing behavior to reveal the system behind the
knowledge production.
57
4.2.1 Embedding Vectors of Editing Behavior
As discussed in the introduction, this chapter highlights the two-sided nature of knowledge pro-
duction. To quantify this nature, we first construct the correspondence between users and articles
based on their editing behavior and build high-dimensional representations of the editing behavior
using a word embedding method (Figure 4.1).
4.2.1.1 Correspondence between Users and Articles
To depict the two-sided nature of Wikipedia editing, we construct a bipartite graph in which article
nodes and user nodes are connected by edges that represent editing behaviors. For user nodes, we
built two nodes for each article; one represents the addition edit while the other is the deletion edit.
This bipartite graph describes the correspondence between users and the articles in whicg they did
addition or deletion edits.
4.2.1.2 Embeeding Vectors
We use a word embedding approach to construct an embedding for editing behavior using a Skip
Gram Negative Sampling (SGNS) model, which is discussed in Chapter 2. Particularly, we use the
word2vecf model [134] that allows us to use arbitrary correspondences between two entities. The
word2vecf in the ordinal setting obtains word embedding vectors by learning the correspondences
among words in the text based on the dependency relations. Inspired by [220], we, instead of
modelling relationships between words, learn the relationship between users and articles from the
correspondence constructed in the previous subsection.
4.2.1.3 What Does the Embedding Vectors Mean?
We discuss the interpretation of the obtained embedding vectors from the method above. The
embedding vector of a given article represents an article’s editing history based on a users’ edit
of that article. Therefore, when two given vectors are similar, the users who edit those two are
similar. Because we split the editing history into deletion and addition edits, we can compare the
58
similarity between addition and deletion edits of arbitrary pairs of articles. We can even compare
between the addition edits of the same article.
4.2.2 Similarity Graph Among Articles
To examine the connectivity among Wikipedia articles, we first construct a similarity graph using
the obtained embedding vectors, in which nodes represent the article edit types and edges represent
the similarity between nodes. We calculate the cosine similarities between all pairs of embedding
vectors of articles and add an edge between a pair if the cosine similarity of that pair is greater
than the threshold to be determined. In the similarity graph, the nodes are connected when they
are edited (deletion/addition) by a similar group of users.
4.2.3 Entropy of Editing Behavior
To quantify the differences in the consumption in terms of editing between addition and deletion
edits, we measure the entropy ofthe edits one makes defined as
e
H
i
=− m
∑
j=1
p
i j
log p
i j
(4.1)
where p
i j
is the frequency in which user i edits article j, and m is the total number of unique articles.
This entropy represents the degree of disparity in preference of editing. When one prefers to edits
diverse contents, H
i
takes a large value. The extreme case is user j edits all articles (H
i
= logn
i
)
with uniform effort, where n
i
is the number of edits she/he did. To consider the heterogeneity
among the users, we rescale
e
H
i
by H
i
=
e
H
i
/logn
i
so that we can ensure that the entropy falls
between 0 and 1. We calculate the H
i
for deletion and addition edits for each user to grasp the
within-user difference in their preference of editing between addition and deletion editing.
59
Table 4.1: Reference Article to Define Dimensions
Dimension Ref Article Pos Ref Article Neg Interpretation
Politics Democrat: Barack Obama Republic: George W. Bush Democrat
Perspective Science: Albert Einstein Religion: Catholic Church Science
Worldview New Tech: Bitcoin History: American Revolutionary War Net Tech
Culture Sport: ATP Tour records Rap: Eminem Sport
Note: The seeds article defines the dimension. Interpretation column describes the interpretation of
dimension value, describing that when a dimension value of a given article is large. For example, when we
find a large value in the Politics dimension of Article X, the article is relatively similar to the users who
join the edition of “Barack Obama” article. In other words, this large value suggests that the characteristic
of Article X is more Democrat than Republic, and vice versa.
4.2.4 Defining Dimensions of Editing Behavior
To reveal the interconnection between users and their preference of article editing, we define the
dimension on Wikipedia’s articles. Motivated by the literature on online community platform
analysis [220], we define four dimensions to study the users’ preferred type of article editing:
Politics, Perspective, Worldview, and Culture. For each dimension, we select two articles as the
reference points that span the dimension conflicting with each other.
We, for instance, select the Wikipedia article of Barack Obama and George W. Bush for the
Politics dimension. After selecting the two reference articles, we calculate, for each article, the
average embedding vector of addition and deletion edits. Let the average vector of addtion and
deletion edit for Barack Obama be v
obama
, and for George W. Bush be v
bush
. Following [220],
we also automatically pick up the 10 most similar vectors for v
obama
and v
bush
respectively. Then,
we calculate the average vector of these 11 vectors. By averaging v
obama
and the ten most simi-
lar vectors, we generate the vectors that represent democratic partisanship, v
democ
. By the same
token, we generate v
repub
that represents republican partisanship. Then, we position each edit-
ing embedding vector u on the political dimension, calculating the similarity difference between
d
poli
(u)= cos(v
democ
,u)− cos(v
repub
,u). For example, when an additional vector of a given article
u
add− X
has a high d
poli
(u
add− X
), it means that the users who did addition edits on Article X have
a preference toward Democrat over Republican in their editing. We apply the same procedure for
the other three dimensions and summarize the selected reference articles and their interpretations
in Table 4.1.
60
4.3 Data
This section discusses the data we collect from Wikipedia and reports the preprocessing procedure.
4.3.1 Wikipedia Data
We collect the editing history of articles from Wikipedia using the MediaWiki API
1
through the
Python client.
2
To construct the dataset that records the Wikipedia article editing history, we
first collect the revision history of each article on Wikipedia. Each revision history contains the
timestamp of that revision, the username of the user who made that revision, and the total size
of the article after that revision. The size of the article is the file size of the text of the article in
bytes. We collect all available article revision histories from the English version of Wikipedia. For
embedding vectors, we focus on the articles with at least 100 edits (5,128,971 articles in total) to
obtain robust embeddings.
4.3.2 Preprocessing Wikipedia Data
Now, we process the data collected from English Wikipedia. To label each revision with Addition
Edit or Deletion Edit, we calculate the difference in size between that revision and the previous re-
vision. Then, if the difference is positive, we label that revision as Addition Edit. Similarly, we use
a Deletion Edit label for revisions when its difference is negative. You can find this manner of cal-
culating size differences from revisions in the revision history pages of articles in the Wikipedia.
3
By this procedure, we obtain the addition and deletion editing history as illustrated in Figure 4.1.
Our processed data records whether each edit added or deleted information. It might be over-
simplified to interpret that addition editing always enriches content and deletion editing always
removes content. Only formatting an article can decrease or increase the size of the article even if
1
https://www.mediawiki.org/wiki/API
2
mwclient:https://github.com/mwclient/mwclient
3
https://en.wikipedia.org/wiki/Help:Page_history
61
its editing does not add or delete any particular content. However, we can interpret from our la-
beling that an Addition Edit adds information and a Deletion Edit deletes information because we
calculate the size in bytes. We also mention that we only use the size for labeling purposes. Thanks
to this simplification, we can provide a unified analysis of the two-sided nature under knowledge
production.
4.4 Results
This section reports the results of our analysis that demonstrates how the two-sided nature plays a
pivotal role in knowledge production. We first study the within-user differences by investigating
users’ deletion addition edits on Wikipedia. We also explore the differences between the two-sides
from an article-level perspective and study the connection between these differences and the article
quality. Then, we examine the characteristic differences between the two-sides.
4.4.1 Within-user Analysis
First, we study the differences between two-sides within a user. To do this, we compare the entropy
of editing behavior for additions with deletions using the equation 4.1. We present the result in
Table 4.2, which demonstrates that the deletion edit has a larger entropy than the addition edit
(6.5% more, p< 0.0001). The users on Wikipedia tend to commit a wider range of deletion edits
than addition edits, suggesting that, in knowledge production, the participants for the two-sides
(add/del) are not the same.
4.4.2 Between-Articles Analysis: Anatomy of Similarity Graph
Our user-level analysis determined that users do not uniformly commit addition and deletion edits.
To clarify this observation, we compare the similarity between articles using the similarity graph
in which similar nodes are connected. To study the similarity between articles, we calculate the
cosine similarity between the embedding vectors obtained using the method in Section 4.2.1. For
62
Table 4.2: Entropy Difference I
Entropy
Addition 0.739 (0.238)
Deletion 0.786 (0.236)
Difference [%] 0.047 [6.5%]****
Note: We calculate the entropy of the addition edit and the deletion edit for each users, and calculate
the mean values for each group and present (standard deviation in parentheses). We also calculate
the difference between the two group [percent difference in square bracket]. We select the users who
conduct addition and deletion edit more than once (n=4,377,872).; ****: p-val< 0.0001 (Welch’s t-
test)
this analysis, we select the top 1000 popular articles in terms of the number of edits. Because we
construct the embedding vectors of both sides (addition and deletion edits) for each article, our
similarity graph contains 2000 nodes. The similarity between two vectors represents the similarity
of the users who commit those edits. When the addition edit of Article X and the deletion edit of
Article Y have common users, the cosine similarity of the embedding vector of the addition edit
and the embedding vector of the deletion edit would be high.
The constructed similarity graph probes the interconnection between the two sides of Wikipedia
editing (Figure 4.2a-c). We determine that the deletion edit node has a longer tail in its degree
distribution compared to that of the addition edit node (Figure 4.2a). Figure 4.2 also employs the
dismantling procedure that removes nodes from the graph by highest-to-lowest degrees [12], and
we determined that removing deletion nodes decreases the connectivity of the network at a faster
rate than removing addition nodes, indicating that the deletion edit node connects more nodes than
addition nodes. This implies that deletion edit vectors are similar to the other vectors, and the
group of users who commit deletion edits is denser than those committing addition edits. Taken
together, these results demonstrate that the editing behavior on the deletion side is more collective.
The users on Wikipedia commit deletion edits on many more articles than addition edits, which is
consistent with the findings of the user-level analysis.
63
Figure 4.2: Anatomy of the similarity graph of deletion and addition edit nodes in Wikipedia
Note: Analyzing the similarity graph by the Wikipedia article editing vectors to reveal the characteristic
differences of the users engaged in different roles (a-c). (b). We calculate the similarity between nodes
from the top 1000 articles in Wikipedia and add edges when they are similar. We generate two nodes from
one article representing addition (blue nodes), and deletion (red nodes) edits as described in Figre 4.1.
The connection between two nodes means that similar users did those two editions. (a). To understand
the properties of the similarity graph, we first plot the degree distribution of the graph. The distribution
indicates that deletion nodes (orange) were characterized by a much longer tail than the addition nodes
(blue), indicating that a more similar group did the deletion edits compared to the addition edit. (c). We
also use the dismantling procedure to understand the connectivity among nodes. The dismantling procedure
removes the nodes with a high degree from the graph and calculates the diameter of that graph. The result
demonstrates that removing deletion nodes results in longer diameters, helping us to reveal that deletion
nodes are well connected.
4.4.3 Within-Article Analysis: Distribution of Division of Labor Index
The findings thus far suggest that addition and deletion edits are different in their participants.
To further understand this difference, we study if such differences are observed within the same
article.
To quantify the differences between the two sides in the same article, we calculate the cosine
similarity between the addition and deletion edits embedding vectors for each article. Because we
characterize our embedding models based on who edits which article of which side (addition or
deletion), a small cosine similarity, for instance, implies that the participants of the deletion and
addition edit are different despite those vectors being from the editing history of the same article.
We can also interpret this cosine similarity as an indicator representing the degree of division of
labor in Wikipedia article editing. A small cosine similarity means the editors divide their labor
into addition and deletion editing. For clarity, we hereby use the term “Division of Labor Index
64
(DLI)” to refer to the inverse cosine similarity between the deletion and addition edit embedding
vectors, in which a large DLI of a given article means users who committed addition and deletion
edits on that article are different.
To grasp the degree of division of labor across the “production” of Wikipedia articles, we plot
the distribution of the DLI in Figure 4.3. The distribution demonstrates that most articles show
negative DLIs, meaning that the participants of deletions and additions are similar. However,
the histogram is not symmetrical, and this hints that the plotted distribution consists of several
components.
To answer this hypothesis, we estimate the mixture Gaussian distribution from the data that
assumes multiple Gaussian distributions are mixed. The Figure 4.3 plots the estimated mixture
Gaussian distribution (dotted lines) and the five different estimated distributions (solid lines); in
which we select the number of distributions (i.e., number of components) based on the Bayesian
information criterion (Figure 4.4). One striking result observed from this mixture distribution is
that there is a distribution that covers the positive area (the solid orange line). Because a high DLI
implies that the participants of addition and deletion edits are different, the distribution depicted
with the orange line suggests the existence of articles produced using a high division of labor.
We also study the time interval distributions between edits in the top and bottom 1000 articles in
the DLI using the method proposed by [166], which fits the time interval distribution by the mixture
of exponential distributions. We determine that the time interval distribution of the bottom articles
consists of 3 distributions, whereas that of the top articles consists of 4 distributions, suggesting
that the editing behavior of high DLI articles consists of more a diverse behavior compared to low
DLI articles.
4.4.4 Within-Article Analysis: Division of Labor Index and Quality of Arti-
cles
The distribution of DLI discussed in Section 4.4.3 suggests that the degree of the division of labor
in Wikipedia article editing is not uniform across the data. This result raises an intriguing question:
65
Figure 4.3: Distribution of the Division of Labor Index (DLI) of Wikipedia editing: I
Note: The histogram of the Division of Labor Index (DLI). DLI is an inverse cosine similarity between the
addition and deletion edits of the same article. A positive value of a given article represents the users who
participated in the deletion edit are different from the users in the addition edit and vice versa. The solid
lines are the Gaussian distribution obtained from the estimated mixture distribution, and the dotted line is
the mixture distribution.
is the division of labor related to the quality of article? To test this, we identify the relationship
between DLIs and article quality by inspecting high-quality articles on Wikipedia.
For selecting high-quality articles, we use the annotation Wikipedia provides: “Featured ar-
ticles” and “Good article”
4
annotated by editors on Wikipedias abiding by objective and taught
criteria.
5
Using these annotations, we compare the differences in DLI: Featured articles, Good
articles, and others.
Figure 4.5 presents that the Featured articles and Good articles have higher DLIs than Normal
articles, suggesting that the high-quality articles tend to have different users in their addition and
deletion edits. This result implies that the division of labor plays a pivotal role in generating
high-quality articles. However, we do not see a meaningful difference in DLI between Featured
and Good articles. Given that the articles annotated as Featured articles are top-notch articles on
Wikipedia,
6
our finding suggests division of labor is vital for generating high-quality articles, but
other factors make them the best of the best articles.
4
https://en.wikipedia.org/wiki/Wikipedia:Quality_articles
5
https://en.wikipedia.org/wiki/Wikipedia:Good_article_criteria; https://en.wikipedia.
org/wiki/Wikipedia:Featured_article_criteria
6
Featured article is about 0.09% to the total population and Good article is about 0.55%.
66
Figure 4.4: Information Criteria
Note: The Bayesian information criterion (BIC) for each #of components. We calculate BICs of the Gaus-
sian mixture estimation for different components (from 1 to 10), and then we select the best results where
#of components is 5.
4.4.5 Characterizing the Differences Between the Two-Sides
We discovered that users divide their labor when editing Wikipedia articles, demonstrating that
different users commit addition and deletion edits. This divided editor pool alludes to users who
participate in deletion and addition edits have different interests. To test this, we identify the
position of interest of deletion and addition edits using the dimension defined in Section 4.2.4.
Comparing the differences in the dimensions defined in Section 4.2.4, we systematically com-
pare the characteristic differences and yield the four characteristics of each editing behavior on
a given article. First, we determine that the dimensions are correlated (Figure 4.6). For exam-
ple, the Culture dimension is well correlated with the other dimensions, and we see a positive
correlation with the Politics and Worldview dimensions and a negative correlation with the World-
view dimension (all p< 0.0001). We also note that these correlations between dimensions are not
homogeneous; some dimensions have relatively weak correlations with each other. The Politics
dimension, for example, is weakly correlated with the Worldview and Perspective dimensions.
We then calculate the mean differences between the two-sides for each dimension and obtained
that, as in the correlation analysis, they are not homogeneous (Table 4.3). The differences in the
67
Figure 4.5: Imbalance of editing and its quality: I
Note: We calculate the DLI for each article. A large DLI of an article means a high degree of division
of labor in editing that the different group of users join deletion and addition edit of that article. Then we
compare the mean differences in DLI among the three groups based on the article quality (Featured, Quality,
and Normal). Compared with the normal articles, we find that the high-quality articles are produced by a
high degree of division of labor.; ****: p-val< 0.0001, ns: p-value> 0.05 (Welch’s t-test)
Politics and Culture dimensions are more evident than those of Perspective or Worldview. The
addition edit prefers articles with the Republican characteristics to those with Democratic char-
acteristics, compared to the deletion edit (68.96% more republican, p< 0.0001). Similarly, the
addition edit prefers articles with sports characteristics to those with rap characteristics, compared
to the deletion edit (226.68% more republican, p< 0.0001). In addition, the differences in Per-
spective and Worldview dimensions are small (30% and 20% differences, respectively).
4.5 Wikipedia Editing Behavior
This section reviews the related research. We briefly discuss the analysis of the Wikipedia platform
in the context of this chapter. Then, we review the studies on the user behaviors on Wikipedia.
Lastly, we review the recent application of embedding methods to understand user behavior on
digital platforms.
68
Table 4.3: Dimension Differences between addtion edit and deletion edit
Axis Name Addition Edit Deletion Edit Mean Difference (% diff)
Politics -0.00756 -0.00447 -0.00308 (68.96%)****
Worldview 0.01857 0.02685 -0.00828 (30.84%)****
Perspective -0.03079 -0.03895 0.00816 (20.94%)****
Culture 0.02643 0.00809 0.01834 (226.68% )****
Note: The mean difference of the attribution values between addition and deletion vectors. We calculate
the four attribution values for each edit node as we describe in Section 4.2.4. Then, we calculate the mean
differences in each attribution value between addition and deletion edits along with the absolute percent
differences in the parentheses. We also conduct statistical tests for mean difference for each attribution
(two-sided t-test). We also present the percent difference calculating the absolute value of the mean differ-
ence divided by the mean value of the deletion edit.; ****: p-val< 0.0001 (Welch’s t-test)
Table 4.4: Entropy Difference II
Entropy
Addition 0.757 (0.230)
Deletion 0.822 (0.225)
Difference [%] 0.066 [8.68%]****
Note: We calculate the entropy of the addition edit and the deletion edit for each users, and calculate
the mean values for each group and present (standard deviation in parentheses). We also calculate
the difference between the two group [percent difference in square bracket]. We select the users who
conduct addition and deletion edit more than once, size of edit is greater than 10 byte and non-bot user
(n=2,350,550 ).; ****: p-val< 0.0001 (Welch’s t-test)
Wikipedia has gathered attention from researchers as a place for participants to corroborate
and accumulate knowledge. This function of Wikipedia is often referred to as “wisdom of the
crowds” [5, 119]. Different approaches have been taken to understand the performance of the
wisdom of the crowds and the quality of its output.
For example, several studies demonstrate that the editing history of a Wikipedia article, such
as revision history, is associated with the quality of that article [183, 200], and discussions among
editors also play a pivotal role [217]. In this line of research, it is shown that network models,
such as signed networks, can capture the interaction between Wikipedia editors [1, 61, 105] well
and can predict the quality of articles [143]. The discussion on Wikipedia editing is often modeled
using opinion dynamics [44].
69
Understanding the formation of teams working in Wikipedia article editing and the “crowds”
part has also been an essential topic in the literature toward improving the quality of team orga-
nization and Wikipedia articles [178, 162, 180, 133]. Significantly, the characteristic of teams
in Wikipedia mitigates unforeseen results. For example, several studies show that the diversity
of teams can alleviate polarization [132, 1, 201], but working in teams does not mean being free
from biases such as in-group biases [165]. In addition, opinion dynamic models can reveal the
decision-making process in groups [209].
The user-level analysis has received some attention in the literature about participants disman-
tling knowledge production, for example, conducting the taxonomy of user roles [177, 156, 49,
119]. The first seminal work on this topic is Kittur et al.(2007), which demonstrates the user-role
transition in Wikipedia. Studying user behavior on Wikipedia shows the factors that decrease the
users’ motivation for editing [88, 87].
The literature is also interested in a phenomenon specific to the Wikipedia platform. “Edit
wars” is, for instance, a topic that researchers pay attention to. This refers to a phenomenon in
which users repeatedly edit the content of a particular article, often controversial topics [25, 125,
104].
4.6 Conclusion
This chapter unveiled the division of labor in knowledge production. We modeled the two-sided na-
ture of editing on Wikipedia, which showed that the labor supply of the two-sides is not uniform in
different facets. The within-user study informed us that Wikipedia users are more likely to commit
a broader range of deletion edits than addition edits. The similarity graph analysis demonstrated
that similar user groups conduct more deletion edits than addition edits on popular articles. We
also identified the characteristic differences between the two-sides. Overall, our results highlight
the existence of the division of labor between the two. To understand the importance of division
of labor, we studied the relationship between the DLI and the quality of articles. The editors who
70
participated in high-quality articles divide their labor between addition and deletion edits. The
results reported the importance of considering the division of labor to understand the quality of
knowledge production.
71
Figure 4.6: Colleration between dimension values
Note: R represents Pearson’s correlation coefficient between two attirbution values; All p-values for each
pair of dimensions are< 0.0001; We calculate the attribution values from the deletion and addition edit
vectors, and gerner the correlation among them. Among four attributions, the cultural attribution (Sport vs
Rap) are most correlated with the others.
72
Figure 4.7: Anatomy of the similarity graph of deletion and addition edit nodes in Wikipedia II
Note: We conduct the same analysis as Figure 4.2 for the editing removing the bots and the small edit (edit
size is less than 10 bytes). This analysis returns qualitatively similar results as Figure 4.2.(a).The distribu-
tion indicates that deletion nodes (orange) were characterized by a much longer tail than the addition nodes
(blue), indicating that a more similar group did the deletion edits compared to the addition edit. (c).The
dismantling procedure demonstrates that removing deletion nodes results in longer diameters, helping us
to reveal that deletion nodes are well connected.
Figure 4.8: Distribution of the Division of Labor Index (DLI) of Wikipedia editing II
Note: The histogram of the Division of Labor Index (DLI) constructed from the edit history data that
removes the bots and the minor edits (edit size is less than 10 bytes) as Figure 4.3. Compared to Figure 4.3,
we find that there is not a lot of articles that have a positive DLI. While the additional analysis of the
within-user analysis (Table 4.4) and the between-article analysis (Figure 4.7) show similar results as the
main analysis, Figure 4.2 suggests that a very high DLI can be produced as the results of bots or small size
edits.
73
Figure 4.9: Imbalance of editing and its quality II
Note: We conduct the same analysis as Table 4.2 removing the bots and the small edit (edit size is less than
10 bytes). Compared to Table 4.2, we find that the gap between the high quality articles and the normal
article become narrow because of the absence of high DLI articles as we discussed in Figure 4.8. However,
the differences are still statistical significant and not ignitable. The average DLI of Featured, Good, and
Normal articles are -0.564, -0.574, and -0.591 respectively. The percentage differences of Featured vs
Normal is 4.63% and the one of Quality vs Normal is 2.83%.; ****: p-val< 0.0001 (Welch’s t-test)
74
Chapter 5
Unbiasing Session Analysis Using the Distributions of
Individual Time Intervals
This chapter discusses a method for grouping the user behavior using sessions. In this chapter, we
propose a method for setting the threshold to make sessions that consider the action time interval
differences among users.
A session is one of the most commonly used concepts in web science. It enables us to analyze
the temporal behavior of users by grouping their consecutive and discrete behavioral traces into a
single group. When creating a session, the end of a session is defined as the point at which a user’s
behavior is no longer observed for “a certain period of time,” and many previous studies define the
threshold for defining such a period of time from the aggregated distribution. However, in reality,
users have different behavioral time intervals, and determining the threshold from the aggregate
distribution introduces bias into the results. This study notes that session analysis can be biased
when users’ behavioral intervals follow different distributions and proposes a threshold-setting
method based on a mixture distribution to mitigate the bias.
For the remainder of this chapter, we discuss the background of the study and the method that
estimates the time interval differences between actions. After discussing our method with synthetic
data, we conduct an empirical analysis with four real-world datasets.
75
5.1 Introduction
Mining user behavior from their digital traces has been one of the most critical tasks in web science,
delivering invaluable insights on human behavior both to sciences and businesses. Because data
logging systems record user behaviors as sequences of actions, such as click streaming, many
researchers have been devoting their effort to model user behaviors from discrete data. While
time-series analysis studies users’ long-term behavior, one-shot analyses, such as recommendation
systems, predict users’ next action in the short term. In addition to these two large research areas,
it is of paramount interest to conduct a mid-term analysis such as how users move when they visit
a website.
In such middle-term analyses, the concept of a session plays a pivotal role. A session is a group
of consecutive actions without a long-term break [196, 82, 205, 76]. A session is introduced for
the log analysis of web servers [42, 76, 205], and is well used for data mining in various topics.
Dismantling the contents of sessions uncovers the driving source of user engagements and a deeper
understanding of their preferences, for example, in online games [148, 196] and social networking
sites [122, 123, 203]. To generate sessions from sequence-logged actions, most studies use a
predefined threshold to find a “long-term break” (Fig 5.1). Such sessions allow us to investigate
users’ middle-term dynamic behavior. For example, the session length of smartphone apps and
web services can reflect user engagement.
Preprocessing sessions from the data is a crucial part of session analysis. The threshold defines
a “long-term break” that governs a sessions’ length and what actions belong to which session.
However, despite this importance, there is no stylized method to determine this threshold. Most
studies use a rule of thumb that a threshold predetermined on the aggregated data applies to all
users, such as the 50th percentile of aggregated users’ time interval distribution.
1
Following such
convention implicitly imposes several limitations on the analysis.
Using a single threshold predetermined on the aggregated time interval distribution ignores
the heterogeneity among users (Fig 5.2). Each user can have different allocations for their time
1
We name this rule the “the global threshold” for simplicity.
76
Table 5.1: Glossary of Terms: Chapter 5
Terminology Description
Session A group of consecutive actions
Time interval A time gap between two consecutive actions
Session threshold The threshold to make session as an example at Figure5.1
Global threshold The session threshold predetermined on the aggregated interval
distribution
Individual threshold The session thresholds determined based on the estimated mixture
distribution by the proposed method
interval and a different number of observation points. This could easily happen for a session
analysis on web applications in which users can have different devices (PC VS Smartphone) or
contexts (frequent VS occasional users). In addition, because of a lack of statistical procedure, the
results of session analysis with the global threshold may not be robust. The shape of aggregated
time interval distribution depends on its sample size and the number of observations for each user.
For these reasons, a statistical method of threshold determination is necessary.
To resolve the aforementioned issues, this chapter proposes a statistical model of individual
time intervals for session analysis. Our model groups users based on the exponential mixture
distributions, assuming that different users can have different time interval distributions. With
the estimated distribution, we modify the global threshold rule in which we use different thresh-
olds based on users’ time interval distributions. To evaluate our method, this chapter focuses on
studying user engagement using session analysis. We first conduct an engagement analysis with a
session using synthetic data that suggests the session analysis with which the global threshold can
be biased under the situation that different users’ time intervals follow different distribution, and
the proposed method can mitigate these biases. Next, we apply our model with four real datasets
from both digital platforms and real places. Our empirical analysis yields consistent results with
our analysis of synthetic data. The session analysis with the global threshold rule may not evaluate
the engagement owing to the ignorance of user heterogeneity. Our method can assess the engage-
ment of users fairly considering the individual differences of action intervals. To keep our chapter
concise and easy to follow, Table 5.1 summarizes the terms used in this study.
77
Figure 5.1: Schematic: Sessions with 1 min threshold
The blue dots represent actions (actions A and B), and the black arrows represent time intervals. User A
continued to act without an interval of more than the session time threshold (1 minute), and therefore, all
six actions are in the same session. User B has two sessions because User B had a two-minute time interval
(more than the session time threshold) between the second and third action.
5.2 Method
To overcome the issues discussed in the introduction, we propose a statistical method for modelling
user action time intervals, considering the heterogeneity among users. In our model, we allow in-
dividuals to follow different exponential distributions in their action time intervals. To model these
individual differences, our model employs a mixture of exponential distributions and estimates the
distribution that each individual follows. Then, based on the estimated distributions, we determine
the thresholds for session analysis such that the users affiliated to the same distribution have the
same threshold.
Mixture distributions are popular for modelling traffic waiting times, such as in mobility traf-
fic [46] and computer network traffic [58], but widely applicable to other human behavior data [229].
In terms of web science applications, Ali and Scarr (2007) try mixture distributions for robust click
distribution modeling.
78
Figure 5.2: Schematic: Biased threshold on aggregated data
A Threshold determined on the aggregated distribution can be biased when different users prefer different
time interval length (fast and slow). The figure assumes that 2 types of users exist: Fast VS Slow users. The
dot lines on the left side graph represent 50% point for group distribution. The left figure shows that the
global threshold (determined on the aggregated distribution) is larger than the threshold for Fast group ( 1 ⃝ ),
while it is smaller than those of Slow group ( 2 ⃝ ).
5.2.1 Exponential Mixture Distribution of User Time Interval
To estimate the time interval distribution of users, we use the mixture of exponential distributions
in which multiple exponential distributions are mixed in observed data. This mixture distribu-
tion allows us to assume that different users can have different distributions for their session time
interval.
5.2.1.1 Modeling user time interval
We assume that the probability of observing time intervals of user i, y
i
=(y
i0
, ..., y
ir
) as
f(y
i
;λ)=
r
∏
l=1
P(y
il
;λ), (5.1)
where P is the probability density function of the exponential distribution, and λ is a parameter.
Here we assume that each component of y
i
is independent. Then our assumption is that each user
has different parametersλ rather than sharing the same parameter.
In particular, we assume that user i’s time intervals follow one of the K distinct subpopulations.
That is user i’s exponential is one of λ
1
,··· ,λ
K
. This formulation allows us to know which user
follows which distribution.
79
As mentioned, we assume that observed user time intervals are an aggregation of different
exponential distributions. We describe this aggregated distribution as the mixture exponential dis-
tribution:
f(y;ψ
K
)=
N
∏
i=1
K
∑
k=1
π
k
f
k
(y
i
;λ
k
), (5.2)
whereψ
K
=(λ
1
,··· ,λ
K
,π
1
,··· ,π
K
). The mixture parameters,π
j
, represents the ratio of the users
that follows each distribution. Therefore, pi
j
satisfies π
j
≥ 0 and∑
K
j=0
π
j
= 1.
5.2.1.2 Estimation by EM Algorithm
To estimate the parameters of the mixture distributionψ
K
, we use expectation-maximization (EM)
algorithm [198]. At the M step during the tth iteration, our EM algorithm calculate
h
(t)
ki
=
π
k
f
k
y
i
;λ
(t)
k
∑
K
k=1
π
k
f
k
y
i
;λ
(t)
k
, (5.3)
where h
(t)
ki
represent the conditional probability that the observation y
i
arises from the component
k. As the M step, we update the parameters of the mixture distribution,λ
(t+1)
k
and π
(t+1)
k
as
π
(t+1)
k
=
1
N
N
∑
i=1
h
(t)
ki
(5.4)
and
λ
(t+1)
k
=
r∑
N
i=1
h
(t)
ki
∑
N
i=1
h
(t)
ki
y
i· . (5.5)
After estimating the parameters ψ
K
, we calculate the Bayesian Information Criteria (BIC) for
model selection,
BIC(K)=− log f(y;ψ
K
)+ K log(N). (5.6)
80
5.2.2 Determining Session Thresholds
After estimating the mixture distribution of users’ time intervals, we determine thresholds us-
ing the cumulative distribution. For example, if user belongs to the distribution of k, because
their parameter is λ
k
, the cumulative distribution function (CDF) of the exponential distribution
is F
k
(y)= 1− e
− λ
k
y
. With this CDF, we determine the threshold that covers a certain aggregate
percentile , such as the kth percentile. That is, this method can be interpreted as a way to make
multiple thresholds, applying the global threshold rule based on the user’s time distribution. In this
chapter, we set the 50th percentile of the distribution as the threshold.
5.2.3 Biased Engagement Analysis
This chapter focuses on evaluating user engagement by session length. The session length of a
given session is the number of actions that session contains. Many applications of session analysis
use the session length to represent the engagement of the users; the longer the session length, the
higher the engagement. However, as discussed in the introduction, the length of the session highly
depends on the threshold used to make sessions.
Assume a User A prefers a slower pace of action, but in some sequence of their actions, User A
shows high engagement. However, when a session threshold is determined based on the aggregated
data in which the users who prefer fast-paced actions also exist, that predetermined threshold
would be too large for User A. Therefore, User A may be considered to have low engagement in
all sessions in the analysis with the threshold determined on the aggregated data. We provide a
schematic as Figure 5.2.
5.2.4 Data
In the rest of the chapter, we will conduct the same analysis in Section 5.3.1 with the real datasets.
Table 5.2 summarizes the data set to be used in the rest of our chapter. We will use the four different
real-world datasets: two smartphone app usage platform datasets two physical movement datasets.
81
Figure 5.3: Time intervals
For each dataset, we focus on one hundred consecutive data points, and we exclude users who are
not observed in the dataset less than one hundred times. Eventually, the number of users we study
are following; Carat:889; OneWeekApp: 845; WS-16: 124; IC2S2-17:215. Also, we min-max
normalize the time interval after excluding the time interval over a half-day (12 hours) break.
Table 5.2: Datasets for empirical analysis
Name of dataset Description
Carat Long term app user tracking detests[168]
OneWeekApp One week app usage datasets users [234]
WS-16 Human contact tracing dataset at WS-16 conference [68]
IC2S2-17 Human contact tracing dataset at IC2S2-17 conference [68]
Note: The description of the real world datasets for empirical analysis. The first two datasets are about
application usage data, whereas the last two are human contact dataset.
5.3 Results
This section presents the session analysis using our proposed method. As mentioned in the pre-
vious sections, we focus on engagement analysis with sessions. This section first conducts an
analysis with synthetic data to demonstrate that the ignorance of action timing preferences may
82
Figure 5.4: Engagement analysis: synthetic data
Note: The engagement analysis using session with the synthetic data. We firstly generate 100 users whose
time interval follows one of the four exponential distribution (25 users each, generate 100 data points each
user, λ = 1,3,5,or 10). Then we employ our method to cluster the users and calculate the engagement
making sessions for each threshold (5.4)
Table 5.3: The comparison of engagement distributions
Dataset #Groups (#comparisons) Global threshold Individual threshold
Carat 6 groups (15 comparisons) 10 out of 15(average: 8.96, std: 15.21) 3 out of 15 (average: 3.38, std: 1.06)
OneWeekApp 4 groups (6 comparisons) 5 out of 6 (average: 2.52, std: 0.69) 3 out of 6 (average: 6.54, std: 1.03)
WS-16 4 groups (6 comparisons) 4 out of 6 (average: 12.37, std: 12.40) 3 out of 6 (average: 5.58, std: 2.00)
IC2S2-17 7 groups (21 comparisons) 9 out of 21 (average: 9.98, std: 13.64) 6 out of 21 (average: 5.90, std: 1.95)
Note: We compare the mean values of the engagements (session length) of the groups. The groups are
based on the estimated time interval distribution. Then we compare the mean values of each groups for
each method for session analysis (global threshold and the proposed method). For each method, we count
the statistical significant comparison (Welch’s t test p.value< 0.01, Bonferroni corrected). We also report
the mean value and standard deviation of the groups.
cause a biased engagement analysis, and our proposed method can debias such results. We then
apply our method to the real-world dataset to show the practicability of our results.
5.3.1 Analysis With Synthetic Data
As in the previous sections, we argue for the importance of considering action timing preferences
for engagement analysis by session analysis. To demonstrate this, we generate synthetic data for
time interval distributions and conduct engagement analysis with two different thresholds. We
generate 100 time intervals for 100 users (10,000 data-points) in which a user’s time intervals
83
(a) Carat (b) OneWeekApp
(c) WS16 (d) IC2S2-17
Figure 5.5: The comparisons of the engagement distribution
Note: The distributions of the engagement for each method. We plot the engagement distribution of the
sessions with the global threshold (Global) and with the proposed individual time interval estimation method
(Individual). The number of groups are determined base on BIC (Eq. 5.6). The descriptions of the dataset
are in Table 5.2. We study the digital platform data (a and b) and the physical contact data (c and d).
84
follow one of four different exponential distributions (λ = 1,3,5,or 10). We then, for each user,
make two different sets of sessions using two different thresholds. That is, for each user, we make
the sessions with the global threshold and the session with the individual threshold estimated using
our proposed method. Lastly, we calculate the engagement of users, measuring the session length.
Figure 5.4 demonstrates how the threshold affects the results of the engagement analysis,
comparing the engagement distribution of the two different thresholds; the Global and Individ-
ual thresholds. The figures show that the engagement analysis with the global threshold depends
on their time intervals. Users with longer time intervals are classified as low engagement regardless
of their time intervals drawn from the different distribution. All groups have different mean values
for their engagement (Welch’s t-test for mean values Bonferroni corrected p-value p.value< 0.01).
For example, Figure 5.3 shows that Group4 has the longest time interval preferences, but Figure 5.4
(Global) shows that their engagement is the lowest in the four groups when we use the Global
thresholds. However, as stated in Section 5.2.3, this result ignores the time interval distribution
differences. We then apply our method to make the threshold that considers individual time inter-
val differences. Figure 5.4 (Individual) shows that the engagement of the distributions becomes
similar (Welch’s t-test for mean values p-value p.value> 0.15).
5.3.2 Analysis With Real-World Datasets
As discussed in the analysis with the synthetic data, the ignorance of action timing preference
can result in a biased engagement analysis. To study this, we apply the global threshold rule to
calculate the engagement, as in Section 5.3.1.
Figure 5.5 compares the engagement distribution of the two different thresholds, as in Fig-
ure 5.4. We see similar findings in Figure 5.5, in which the engagement distribution of the global
threshold is different, where those using the individual threshold are similar. We also compare the
mean differences of the engagement to see how many pairs have different mean values of engage-
ment, counting the number of pairs that have different mean values at p.value< 0.01. We present
the results in Table 5.3.
85
For all data sets, both in digital behavior data and physical behavior data, Table 5.3 reports
that the global threshold generates more groups that have different engagement means, and the
individual threshold mitigates these differences, which is consistent with our analysis with the
synthetic data in Section 5.3.1 We calculate the average engagement for each group and calculate
the standard deviation among the group. Table 5.3 demonstrates that the groups with the global
threshold tend to have high variances, which suggests that the engagement analysis with the global
threshold is vulnerable to the action timing differences.
The takeaways of our findings are twofold. First, the engagement analysis without considering
action interval differences may distort the results. Such analysis classifies some users who prefer
longer time action intervals as low-engagement users because the global threshold is determined on
the aggregated data, in which there are also users who prefer short time action intervals. Second,
our proposed method that considers the individual differences in time interval can debias the biased
engagement results.
5.4 Conclusion
In this chapter, we first demonstrate that session analysis without considering the action time pref-
erence can distort the engagement analysis by session. To settle this issue, we proposed the expo-
nential mixture distribution to group the users based on their action timing. Our experiments, both
with the synthetic and real dataset, show that our method can rebalance the distorted results. While
our method solves the problem detected in this chapter, there are remaining issues that suggest
further studies. For example, our simple solution does not consider the other elements that cause
heterogeneity (i.e., different users can have different action timing preferences.). Among the many
potential directions for future work, we would like to develop our model such that it considers the
performance differences or estimates the distributions online.
86
Chapter 6
User Action Embedding With Inter-Time Information
This chapter proposes a method for analyzing user action behavior with inter-time information
from high dimensional data. We utilize the embedding method and estimate the time-interval
between actions to obtain the low-dimensional representation of actions.
With the recent development of relevant technology, data on detailed human temporal behav-
iors has become available. Many methods have been proposed to mine human dynamic behavior
data and have revealed valuable insights for research and business. However, most methods only
focus on sequences of actions and do not study the inter-temporal information, such as the time
intervals between actions, in a holistic manner. While actions and action time intervals are inter-
dependent, it is challenging to integrate them because they have different natures: time and action.
To overcome this challenge, we propose a unified method that analyzes user actions with inter-
temporal information (time interval). We embed the user’s action sequence and its time intervals
simultaneously to obtain a low-dimensional representation of the action along with inter-temporal
information. The proposed method enables us to characterize user actions in terms of temporal
context, using three real-world datasets. This study demonstrates that the explicit modeling of
action sequences and inter-temporal user behavior information enables a successful, interpretable
analysis.
For the remainder of this chapter, we propose the method that embeds user actions and inter-
temporal behavior and analyzes the dynamic user actions considering the context. Then, we apply
87
our proposed method to several real-world datasets and reveal the temporal context of user behav-
iors hidden in high-dimensional data.
6.1 Introduction
Human temporal behaviors contain a wide range of valuable information. Mining human physical,
social, and economic dynamic behaviors has contributed greatly to our understanding of human
mobility, social phenomena, and consumption behavior. However, most human dynamic behavior
models focus only on the sequence of user actions. For example, statistical time-series models
study the variation of values in the data over the course of time. While this analysis is natural in
many contexts, this analysis does not explicitly model or integrate inter-time information, such as
the time intervals between actions. Despite the importance of the information on time intervals
for human behavior analysis, the existing user models in computer science primarily focus on
sequences of actions and do not integrate time-interval information into action.
Time intervals between actions provide crucial information about the context of those actions.
Much of the literature shows that time intervals between actions can determine human cognitive
states [206, 78, 228, 111]. The most famous theory in the literature and popular science is “Dual-
process theory,” introduced by [78] and developed by [111]. In addition to the cognitive process
that generates action time intervals, the statistical properties of time interval distributions have
attracted attention from researchers, especially in network science [95, 145, 215, 12, 169].
To consider the context when mining and interpreting actions, a model that incorporates the
information of time intervals into action is needed. An obstruction that prevents us from modeling
user actions and their time intervals in a holistic manner is that actions and time intervals are
incompatible. Action is categorical data, and a time interval is numerical data. One natural solution
is to discretize time intervals using several time bins [224]. However, making time bins requires
hyperparameters, and validating these predetermined parameters is a cumbersome process. In
88
addition, even if we successfully discretize time intervals, it is not trivial how to incorporate time
interval information into actions.
To overcome the aforementioned challenges, this chapter proposes a model that embeds inter-
temporal information to obtain the low-dimensional representations of user actions. First, to dis-
cretize time intervals, we leverage the statistical properties of time intervals that have been well
exploited. Our model estimates as a mixture exponential distribution and uses the estimated param-
eters to group the time intervals. Leveraging that statistical procedure can ensure the objectivity
of the analysis, avoiding the hyper-parameters selection for time bins. In addition, this procedure
ensures that discretized time intervals capture the nature of observed time intervals. We then con-
struct the action sequences in which action and its time intervals appear one after the other and
make n-grams from them. Then, our framework calculates a low-dimensional representation of
user actions with the time interval information using a word embedding technique. Lastly, we
study the inter-temporal context for each action using the obtained action embedding vectors. To
do this, we propose an interpretable measurement of action timing context (ATC), which represents
whether a given action belongs to a long-term (fast pace) context or short-term (slow pace) context.
We illustrate our study in Figure 6.1 and summarize the terminologies in Table 6.1 to make our
chapter easy to follow.
With the proposed model, we conduct three empirical studies to reveal the inter-temporal con-
text of user actions using ATC. We first show the inter-temporal context of actions corresponding
to apps’ categories. Second, we demonstrate that ATC can compare the behavior differences of
different types of students in terms of their temporal behavior. The second analysis investigates the
behavior differences between the drop-out and non drop-out students in a Massive Open Online
Course (MOOC) platform. In the third study, we study the dynamics of the context of actions.
With a student behavior dataset over an academic term, we demonstrate that the inter-temporal
context of actions changes according to the event on campus, such as mid-term weeks. These
three empirical studies show that our framework and proposed measurement (ATC) captures the
inter-temporal context of actions in an interpretable manner.
89
Figure 6.1: Schematic of studying the inter-temporal context of user actions
Note: (a) We firstly construct the action sequence of each user and calculate time intervals between two
consecutive actions.(b) Then, we estimate the mixture distribution of time intervals to discretize the time
intervals. (c) With the discretized time-intervals and action, we construct the sequences of action + time
interval. Also, we construct trigrams from those sequences. (d) Using the data constructed at (c), We learn
the embedding vector of actions and trigrams. (e) We study the inter-temporal context of action extracting
Action Timing Context (ATC) from obtained embedding vectors.
Table 6.1: Glossary of Terms: Chapter 6
Terminology Description
Action Single unit of the observed behavior (e.g., user smartphone application, phys-
ical contact, ...)
Time interval A time gap between two consecutive actions
ATC Action Timing Context defined at Equation 6.8 in Section 6.2.5.
6.2 Methods
This chapter studies human behavior considering inter-temporal information. For this, our method
attempts to obtain a low-dimensional representation of the action sequence. We construct action
sequences of each user with the actions and their time interval and then utilize a word embedding
method on the constructed action sequences.
90
6.2.1 Capturing Inter-Temporal Information With Time Bins
This subsection discusses how we estimate the time intervals and convert them into the action se-
quence using time bins. While calculating time intervals between consecutive actions is straightfor-
ward, it is not trivial to discretize this continuous-time value to place them in the action sequence.
A natural solution for this discretization problem is to let some time bins be hyperparameters. For
example, Wang et al.(2016) [224] use predefined bins such as ( [0− 1min],[1min− 10min],[10min− 60min],[60min<]) and then classify calculated time intervals into those time bins. However, be-
cause there are numerous patterns of time bins, it is not easy to make an appropriate pattern of time
bins without any prior information. Such a combination of time bins might miss the inter-temporal
information between actions, which is the primary interest of this study.
Therefore, this study will estimate the statistical property of the time intervals of data to con-
struct the time bins based on the estimated parameters instead of using hyperparameter time bins.
The goal of time-bin construction is to capture the behavior states of the users from their time
intervals and represent the behavior states. For this, we study the behavior states estimating the
mixture distributions.
6.2.2 Time Interval Bins Estimation by a Mixture of Exponential Distribu-
tions
This chapter estimates the time interval distribution between actions and constructs the time bins
based on the estimated distribution. We assume that the observed time intervals follow the mixture
of exponential distributions. To estimate the mixture of exponential distributions, we utilize the
EM algorithm and the model selection criterion.
91
Recent development in the literature reveals that the mixture of the exponential distribution
with a few components can fit the time intervals of various empirical data well [166].
1
Follow-
ing [166], we estimate the mixture of the exponential distribution using the EM algorithm and
selecting the number of components by the model selection criteria.
6.2.2.1 EM Algorithm
To study the mixture of exponential distribution from the observed time interval between actions
x={x
1
,...,x
n
} , we estimate the parameters in the following equation,
f(x)=
K
∑
k=1
π
k
f
k
(x;λ
k
) (6.1)
, where of the time π
k
is the mixture parameters (mixing weights), π
k
≥ 0 and∑
K
j=0
π
j
= 1; f is
the probability density function (PDF) of an exponential distribution andλ
k
is the parameter of the
exponential distribution of k component (rate parameter).
To estimate the parameters π
k
and λ
k
(k= 1,...,K), the EM algorithm takes E (Expectation)
step and M (Maximization) step. At E step during the iteration t, the algorithm calculates so called
membership probability (weight) r
t
ik
by,
γ
t
ik
=
π
t
k
f(x
i
;λ
t
k
)
∑
K
l=1
π
t
l
f(x
i
;λ
t
l
)
. (6.2)
Intuitively, the membership weight r
t
ik
describes the probability of a given data (time interval,
x
i
) is generated from the component k (i.e., the exponential distribution with the parameterλ
k
).
With the membership weight calculated at the E step at the iteration t, the algorithm calculates
the parameters as the M step as
1
[166] uses slightly different wordings than this chapter. They describe actions in this chapter as “events,” and time
intervals as “inter-event times.” To keep our wordings consistent, we use our terminology in Table 6.1.
92
π
(t+1)
k
=
1
N
N
∑
i=1
γ
(t)
ik
, (6.3)
µ
(t+1)
k
=
∑
N
i=1
γ
(t)
ik
x
i
∑
N
i=1
γ
(t)
ik
. (6.4)
Then the algorithm iterates those two steps and obtains the final parameters after they converge.
As the equations 6.2, 6.3 and 6.4 show, the number of components K is given in the algo-
rithm. For the criteria to select the number of components (model selection), we use decomposed
normalized maximum-likelihood (DNML) codelength following [166].
6.2.2.2 Attributing Time Intervals to Time Bins
We attribute the time intervals between actions to the time bins based on the estimated mixture
of exponential distribution. We construct as many time bins as the components of the mixture
distribution and the bin for zero time interval T
0
. In other words, we construct K+ 1 time bins
{T
0
,...T
K
}, when we have K components. For a given time interval x
i
> 0, we find the time bin
T
∗ fir x
i
using the membership probability (Equation 6.2) as
T
∗ = argmax
l
π
l
f(x
i
;λ
l
)
∑
K
k=1
π
k
f(x
i
;λ
k
)
. (6.5)
Since f is the PDF of exponential distribution, the time bin attribution cannot be calculated for
x
i
= 0, and this is the reason that we make the special time bin T
0
for x
i
= 0.
93
6.2.3 Action Sequence and n-Gram
To represent the context of the actions by users, we first construct the sequence of actions and
their time intervals where they are in chronological order. Then we construct N-grams from those
actions sequences.
6.2.3.1 Action Sequence
With the actions and time intervals between them, we construct action sequences for each user.
Let User i have the actions{A
i, j
}
J
j=0
. His/her j th action A
i, j
∈{A
1
...A
M
} where M is the number
of unique actions, and its time interval is x
i j
that is the time interval between A
i, j
and A
i, j+1
. By
using the time bin constructed in with the equation 6.5, we attribute time interval x
i j
to a time bin
T
i j
, which is one of{T
0
,...T
K
}. With User i’s actions{A
i, j
}
J
j=0
and their time intervals{T
i, j
}
J
j=0
,
we construct User i’s action sequence as
A
i,1
T
i,1
A
i,2
T
i,2
A
i,3
...T
i,J− 2
A
i,J− 1
T
i,J− 1
A
i,J
We construct this action sequence for each user and we will make N-grams from those actions
sequences.
6.2.3.2 Action n-gram
Our action n-gram captures the context of users’ actions, representing the order of actions and their
time interval. For an action sequence
2
A
1
T
1
...T
J− 1
A
J
, we make n-gram separating action A
j
and
T
j
. In this study, we use trigram (n= 3) of the action sequence, and it will be
{A
1
T
1
A
2
, T
1
A
2
T
2
, ...T
J− 2
A
J− 1
T
J− 1
, A
J− 1
T
J− 1
A
J
}
2
We use A
j
and T
j
for A
i, j
and T
i, j
respectively, abbreviating the notation i for user.
94
While the others may prefer to use the set of action and its time interval as one unit to make n-gram
such as A
J
T
J
, making n-gram with separating those two will benefit from studying the time interval
context of the action (to be discussed in Section 6.2.5).
6.2.4 Word Embedding Model
We use the Skip-gram with negative sampling (SGNS) algorithm to obtain low-dimensional rep-
resentations of actions and N-grams [149, 151], and we discuss this method in Chapter 2. This
section will discuss the prediction problem that SGNS solves in this chapter to understand the
embedding vectors to be obtained from the action sequence. The SGNS model the distribution,
p(d| w,c), where d takes 1 when a pair of word c and context c is observed in the data otherwise
0.
This chapter treats actions and n-grams of action sequences as either contexts or words. Note
that we treat actions as both words and contexts, and we treat n-grams of action sequences as
well. We calculate the conditional probability of actions/n-gram given actions/n-gram to obtain
the low-dimensional representation. For implementation, we use the ngram2vec [237], which is
the modified version of word2vecf [134].
6.2.5 Extracting action timing context (ATC) using n-gram actions
To contextualize the actions with the inter-temporal information, we use the embedding vectors
of the n-grams. As discussed in Section 6.2.3.2, n-grams contain elements that have an action
between time intervals, such as TAT . We leverage the embedding vector of this type of n-gram as
the references that represents the inter-temporal context.
6.2.5.1 Constructing the Reference Vectors
To do so, we firstly make the two types of reference n-grams: the long and short time interval
context. Let T
long
and T
short
be the longest and the shortest time interval bin, respectively. Note
95
that those bins are defined in Section 6.2.2. Then, we calculate the reference vector for the long
and the short interval context as
v
long
=
1
|V
long
|
∑
v∈V
long
v (6.6)
, where V
long
is the set of the embedding of the n-gram where actions are between the longest time
intervals. That is,
V
long
={T
long
AT
long
}
A∈A
(6.7)
, andA is the set of all actions. Similarly, we makeV
short
using T
short
.
6.2.5.2 Definition of the “Long Term Context” and “Short Term Context”
The reference vectors,V
long
andV
short
represent an action taken between the long term intervals
or the short term intervals. For example, when an action A is similar to V
long
, it means that the
user takes a long time break (interval) before/after that action. We can interpret such action as the
action that the users spend a long time execution time. On the other hand, when an action A is
similar to V
short
, it means the users tend to execute with a short-term execution time.
6.2.5.3 Aligning Actions Into Long vs Short Term Context
To study the time interval context of a given action A, we calculate the relative distance r(A)
between the two reference vectors for each action of interest,
r(A)= cos(v
long
,a)− cos(v
short
,a) (6.8)
96
Figure 6.2: Extracting action timing contexts with n-gram action
Note: We calculate the relative distance from long-term and short-term context n-gram. We firstly construct
the vector of long-term v
long
and short-term v
short
context n-gram (as Equation 6.6 and 6.7). Then, for each
action vector a of Action, we calculate defences between the cosine distance: cos(v
long
,a) and cos(v
short
,a)
This relative distance r(A) defined at 6.8 represents Action belongs to which context in their action timing
(A large r(A) suggests that action is used in the long term context).
where a is the embedding vector of action A, and cos is the cosine similarity. When r(A) is large,
action A is in the long term context; the other means A is in the short term context. We provide a
schematic illustration of this relative distance in Figure 6.2.
6.3 Data and Experiment Settings
In this section, we discuss the dataset and experiment setting for the embedding model. We em-
ploy the three datasets to investigate the wide range of human behavior in terms of inter-temporal
context using the ATC discussed in Section 6.2.5.
Table 6.2: Basic statistics of datasets
Dataset Statistics App Usage MoocData StudentLife
Observation period 1 week 2 years 11 days 11 weeks
Observation date Jun’16 Jun’15-Jun’17 Mar’13 - May’13
Observation field Smartphone MOOC platform University campus
Environment Digital device Digital platform Digital device and
Real place
# unique actions 1,696 22 800
# total users 871 225,642 49
# total actions 4,171,950 42,110,402 219,360
Avg. # actions per user 4789.83 186.62 4476.73
Avg. # unique actions per user 1.94 9.74 16.326
# components of the mixture of exp distributions 3 3 3
Note: The basic statistics of the three datasets used for our analysis. This table reports the basic
statistics of the data sets after preprocessed described in Section 6.3. # components is determined by
DNML as discussed in Section 6.2.2.
97
6.3.1 Data
We apply our method to the three data sets. We first use the app usage history dataset [59] to
study the correspondences between the inter-temporal context differences (ATC) and the category
of the apps. Then, we use the ATC to study the behavior difference between different individuals
to demonstrate that our method captures human behavior in an inter-temporal context. For this
analysis, we use the clickstream data from the MOOC platform [234]. Lastly, we turn to the
dynamics of ATC, studying the student behavior sensing data [225].
App Usage Dataset [59] We use the smartphone app usage history data published by [59]. The
dataset provides the user history of smartphone app usage, including the timestamp and the cate-
gory of the app. We consider the usage of the app as actions and labeled its categories.
MoocData [234] MoocData is the clickstream data on the users in the XuetangX platform (the
MOOC platform in China). The dataset contains the users’ learning activity logs, including what
functions of the platform they use during their learning. While the dataset provides a massive
amount of logs, it also serves as the dataset for dropout prediction. We utilize this subset of
the dataset to compare the behavior differences between two distinct types of users in terms of
academic performance: Dropout VS Non-Dropout.
StudentLife [225] The StudentLife dataset provides a wide range of behavior data from automatic
sensing using participants’ smartphone [225]. Wang et al.(2014) recruit the students for the study
and tracks their behavior on campus during one academic term. The dataset contains the smart-
phone usage history and eating behavior (e.g., lunch), physical activities (e.g., walking), etc. In
addition to the action data, this dataset provides the survey data based on an Ecological Momentary
Assessment (EMA). The EMA data contains the self-report of the physiological state of the stu-
dents throughout the data period. We use the three questionnaires to study the correlation between
the ATC of students’ activity and the students’ psychological state.
We utilize the above three datasets to demonstrate that ATC can capture human behavior from
the inter-temporal point of view.
98
Table 6.3: EMA Questions
Question Option (scale)
how happy do you feel? 1.little bit, 2.somewhat, 3.very much, 4.extremely
how sad do you feel? 1.little bit, 2.somewhat, 3.very much, 4.extremely
How are you right now? 1.happy, 2.stressed, 3.tired
How many hours did you sleep last night? 18 scales (0.5-hour grid from less than 3 hours to more
than 12 hours)
Note: The EMA questions used in this chapter. The participants who only answer “yes” to “Do you feel
AT ALL happy (sad) right now?” answer the first two questions (“how happy(sad) do you feel? ”).
6.3.2 Experiment Settings
As discussed in Section 6.2.2, we estimate the mixture of time intervals between actions to con-
struct the time bins. We use the implementation by [166]. For each dataset, we randomly sample
10k time intervals and estimate the distributions.
For embedding, we use the ngram2vec implemented by [237] based on word2vecf [134]. We
use SGNS to learn embedding vectors (300 dim). We use a flexible window size: two windows for
bi-gram and one window for actions as illustrated in Figure 6.1 (d) to ensure that our embedding
model captures the dependency among actions and time intervals. We remove the users who have
less than ten action
6.4 Empirical Analysis with Action Timing Context (ATC)
This section reports the empirical extraction analysis of action timing contexts with real-world
datasets. Our empirical analysis to demonstrate the action timing context extracts the informative
insights of human behavior from the action sequence. To this aim, we utilize the three different
datasets: the student behavior data observed in a field study, the smartphone app usage data, and
the students’ behavior in a Massive Open Online Courses (MOOC).
Our empirical analysis compares the differences in action timing contexts between different
types of entities. First, we show that the ATC of app usage depends on its category. Because
each application has a different purpose, it is evident that these differences can reflect on its ATC.
Next, we utilize our method to find behavior differences between successful students and those
99
not in the academic environment. We compare the differences in the ATC of the dropout and
non-dropout students in the MOOC platform. Lastly, we study the transition of the ATC of the
student behavior in a real academic environment. We use the datasets covering physical and digital
behavior (smartphone usage) over an academic term.
6.4.1 ATC Differences Among Smartphone Apps Usage
Smartphone apps are an essential tool for modern life, providing a wide range of functions such
as games and health tracking. These apps are intended to have different ATCs depending on their
purpose. To study this, we calculate the mean ATC for each app category. Figure 6.3 demonstrates
that different category apps have different ATC means.
In the figure, the Reference category app has the smallest ATC (-0.27), meaning that the apps
in the reference, such as the dictionary, are used in a short span. Opposingly, Infant & Mom has the
largest ATC (1.1). The users of apps for infant care and monitoring use these apps in a long-term
context. Our analysis also reveals that the apps that do not belong to a definitive context. Finance
apps, for example, are around the median ATC (0.27). Therefore, the finance app usage spans from
the long term to the short term context.
6.4.2 ATC Differences Between Dropout and Non-dropout Students
Our method can reveal the behavior differences in an interpretable manner. To demonstrate this,
we use the student clickstream data from the MOOC platform. Along with the clickstream, the
datasets provide labels of dropout or not for each student. We split the dataset into the data of
dropout non-dropout students and then learn embedding from each to calculate ATC. In doing
this, we can compare ATCs of the same action but with different types of students (dropout/non-
dropout).
Figure 6.4 plots the differences in the ATC of the same action between the dropout and non-
dropout students. The figure calculates the differences by subtracting the ATC of dropout students
from that of the non-dropout student. Therefore, a positive difference in ATC of action implies that
100
the dropout students use that action in the long term, but the non-dropout students do such in the
short term.
The most evident distinction in the figure is in Pause Video (0.45). This positive difference
suggests that dropout students do not often pause the course video, but non-dropout students pause
a video in a short span in their learning. Contrastingly, we find the negative differences in com-
menting (ATC of Delete Comment:− 0.61; Create Comment:− 0.51), revealing that non-drop out
students take their time for commenting compared to dropout students. Although we observed
differences in the ATCs between the two distinct types of students, they are similar in their ATC of
Close Courseware (-0.01). Combining these three differences highlights the behavior differences
between dropout and non-drop students. Compared to non-dropout students, dropout students are
less likely to “pause” the video when it plays and less likely to spend a lot of time commenting on
it.
6.4.3 Capturing Behavior Dynamics by ATC
Lastly, we use our method to study a dynamic behavior from the action sequence. We utilize the
StudentLife dataset to analyze how the ATC transitions over an academic term (11 weeks). The
StudentLife dataset provides a wide range of students’ behavior trace from their smartphones, such
as physical activity, eating behaviors, and apps usage. We first construct action sequences of the
students and split them into weeks (i.e., 11 weeks). Then, we obtain the embedding vectors from
each week and calculate the ATC of each action.
Figure 6.5a plots the transition of ATCs over the course of the academic term. The figure shows
that the ATCs of physical behavior settle over the term, but the ATCs of eating behavior swings
near Week 45 when the students had their midterm. This suggests that students compress the time
interval between their eating actions during the mid-term to save their time for studying. After
the midterm, their eating behavior stabilizes in terms of ATC. The ATC of the eating behavior
remains a plateau after Week 6. For reference, we picked the popular smartphone apps among the
participants (Gmail and Youtube), which are between physical and eating behaviors.
101
The result above shows that the ATC can reveal the action that can be affected by external
circumstances. Figure 6.5a shows that eating behaviors can change according to an educational
event (the midterm), but the physical behaviors do not. The box plot in Figure 6.5b demonstrates
this difference in the vulnerability of ATC, where the ATCs of the physical behaviors have shorter
bars and those of the eating behavior have longer bars.
Finally, we study the correlation between the ATCs and students’ physiological states. Fig-
ure 6.5a plots the correlation between the answer of EMA by students and the ATCs of eating
actions (Breakfast, Lunch, Supper, Snack). Figure 6.5b shows that the ATCs of eating actions are
correlated with the mental state (happy or sad). The ATC of eating action is positively correlated
with the happy mood and is negatively correlated with the sad mood. They are not correlated with
the question about sleep or the question about moods in general (“How are you right now?”). Those
correlations suggest that eating in the short term (low ATC) is associated with a sad mood and vice
versa. We also study the correlation between the EMA and ATCs of physical actions (Walking,
Running, Other Activity), and we did not determine a clear correlation. We acknowledge that they
are just correlations found by a simple linear regression model. However, this result suggests a
promising result that our model with the inter-temporal context between actions may capture the
user’s psychological state.
6.5 User Modeling with Embedding and Intertemporal Infor-
mation between Users’ Actions
This section discusses the work related to our study. We first discuss the research on user modelings
with embedding methods and human temporal behavior using embedding techniques. Then, we
study the research on the inter-action times of human behavior.
102
6.5.1 User Modeling With Embedding
Word embedding models have been primarily used to obtain the low dimensional representation of
word semantics [149, 151]. It is also known that word embedding can capture biases hidden in the
text data [74, 24, 65, 33] or semantic dynamics [89, 52].
The applications of such embedding techniques are not only in natural language processing but
also in user modeling. The most popular application is in models that take embedding vectors of
user consumption as feature vectors for recommendations [80, 40] or online advertisement [186,
53, 10, 235]. In addition, there is a line of literature that employs the temporal points process to
model the dynamics of user behaviors [55, 199].
For modeling user dynamic behavior, Han et al.(2020) [91] proposed an embedding model that
obtains the vector-reorientation of user dynamic from the user actions sequence. In addition, Liu et
al. (2020) predicts the students’ academic performances by embedding the students’ daily behavior
sequences [139].
Recently, it has been found that the embedding model can construct the axis to study the hid-
den structure of the platform. Waller and Anderson (2021) proposed the method and application
for studying the social phenomena in an online community (Redit) in a data-driven manner [220].
They characterize an online community (subredit) by the users’ posts and study the political polar-
ization dynamics.
The major difference between our study and the studies discussed above is that our study ex-
plicitly embeds temporal structure into actions. Existing studies use action sequences to represent
the dynamics of user behavior and abstract the time intervals between actions, even though, as
mentioned in the introduction, temporal information is necessary to understand human behavior.
Certainly, using point processes and other methods can improve prediction accuracy, but such
models complicate the understanding of human behavior. In this study, we explicitly model the
temporal structure in the embedding of behavior by estimating the time interval. This makes the
model structure simple and enables a unified analysis of user behavior using indicators, such as the
proposed ATC.
103
6.5.2 Inter-Action Times of Human Behavior
Studying interactions between human actions and their time intervals reveals a deep mechanism of
human behaviors such as the human cognitive states [206, 78, 228, 111]. Many researchers have
devoted themselves to modeling human cognition based on this temporal information. The most
popular theory in this literature is “Dual-process theory” introduced by [78] and [111] developed.
We propose our framework, ATC, to study the temporal structure of users’ actions based on this
simple yet solid foundation.
In addition to the cognitive process that generates action time intervals, the statistical properties
of the time interval distributions have attracted the attention of researchers, especially in network
science [95, 145, 215, 12, 169]. Especially, [166] summarize the literature that the two classes of
models generate the distribution of the time intervals as priority queue models [12] and modulated
Markov processe [95, 145]
6.6 Conclusion
This chapter showed that incorporating the vital information, “time-interval,” into human behavior
analysis can lead to a unified understanding of human dynamic behavior. Our framework is based
on the idea that the time interval between actions corresponds to users’ cognitive state. To examine
the cognitive state from the distribution of time intervals, we discretized the time intervals based
on mixture distribution estimation. Then, we learned embedding vectors of actions based on the
discretized time intervals and action sequences and proposed the ATC, an inter-temporal context
index for each action, using the obtained low-dimensional representation of actions. Our proposed
framework conducted empirical studies on user behavior with the three datasets. As a result of the
analysis, we determined that the ATC captures actions and differences in behavior among users
and how the inter-temporal context changes depending on the situation. We also showed that an
ATC is a unified and interpretable measure of inter-temporal context.
104
There is a need to analyze the user’s behavior with the embedding vectors of actions. While this
chapter focused on the inter-temporal context of actions, it is natural to calculate the user-behavior
vector that represents the user behavior from the obtained embedding vector of actions. As existing
research constructs a representation of text such as sentences or documents by averaging the vectors
of each word in them, we can calculate the feature vector of the user by averaging the vectors of
the actions in the user’s action sequence. However, the dependency between the time points of
the user’s actions is lost in this case. Therefore, we need a method that preserves the order of the
vectors to calculate the user’s feature vector, and we discuss this in Chapter 7.
105
Figure 6.3: Smartphone application usage context
Note: Comparisons of smartphone application usage by the action timing context (standardized on all sam-
ples). The figure plots the mean value of each category’s action timing context (with 95%CIs by bootstrap).
While the apps for “Infant & Mom” are used in the long action timing context, “Reference” category apps
are used in the short-term context. Also, the “Finance” category apps are neutral in terms of the action
timing, suggesting that the users use those apps in both contexts.
106
Figure 6.4: Action timing context differences: Drop-out students vs Non-dropout students
Note: Action timing context difference between the different types of users (standardized on each student
type). We calculate the action timing context of the students in the MOOC platform. We calculate the action
timing context for the students who dropped out of their course and the student. A positive difference of
action represents the dropout users use that action in the long term context, but the non-dropout students use
that action in the short term context. For example, the dropout students use “Pause Video” in the long-term
context, but the non-dropout students use it in the short-term (difference 0.45). This difference implies that
the dropout students do not often pause the course video, but the non-dropout students do so.
107
(a) (b)
Figure 6.5: Transition of the action timing context of the students over the academic term
Note: Calculating the action timing context of the student behavior over the course of an academic term (11
weeks, standardized on each week). We calculate the action timing context of the actions in each week: the
physical behavior (e.g., walking), the smartphone app usage (e.g., YouTube), and the eating activity (e.g.,
lunch). Figure 6.5a shows the ATCs changes over the academic terms. The area between the dot lines
represents the mid-term weeks (Week4-6). While the context of physical behavior is stable, eating activity
transitions from the long-term context to the short-term context. This transit implies that the important
academic event off the students stride in their eating habit. Figure 6.5b plots the box plot of the action
timing context of the actions, and shows that while the physical activities are stable, the eating activities are
volatile (having longer bars).
Figure 6.6: ATCs of the eating actions and the students’ physiological state
Note The figure report the correlation between ATCs of the eating actions(Breakfast, Lunch, Supper, Snack)
and the EMA (weekly average) described in Table 6.3. We use all available EMA answers during the period
(11 weeks) for each question. On each subfigure, we report the Pearson correlation coefficient ( r) and p-
value for testing non-correlation, and the blue line represents the fitted simple linear regression.
108
Figure 6.7: ATCs of the physical actions and the students’ physiological state
Note: The figure reports the correlation between ATCs of the physical actions (Walking, Running, Other
Activity) and the EMA (weekly average) described in Table 6.3. We use all available EMA answers during
the period (11 weeks). On each subfigure, we report the Pearson correlation coefficient ( r) and p-value for
testing non-correlation, and the blue line represents the fitted simple linear regression.
109
Chapter 7
User Action Clustering While Preserving the Order of
Actions
This chapter discusses the method for clustering users based on action sequences. We use the
low-dimensional representations of users’ actions obtained in Chapter 6 to construct user action
sequences.
Understanding user-level behavior in massive records from digital platforms has been an im-
portant research topic in computer science, and many studies propose dimensional reduction meth-
ods, such as embedding vectors, to obtain handful representations of user behavior from high-
dimensional data. While those low-dimensional representations from the data foster our under-
standing of user behavior, limited attention has been paid to a manner of reconstructing the dy-
namics of such user behaviors from static representations. There are many methods that obtain
low-dimensional representations of each action in behavior data. However, few methods consider
a manner for analyzing the dynamic behavior of users using such low-dimensional representations
of actions. We develop a method that clusters users based on the sequences of their actions while
considering the orders of actions in those sequences. Our analysis with empirical data determines
that the order of actions captures the individual differences.
For the remainder of this chapter, we propose the clustering method based on the optimal
transportation framework and conduct experiments with the synthetic data. Then, we also conduct
an empirical analysis that studies the transition of human behavior affected by external events.
110
7.1 Introduction
With the recent development in computer science, many methods have been developed to study
high-dimensional data. One of the most popular methods in this research line is called embedding,
which is a method for obtaining low-dimensional representations from high-dimensional data. Em-
bedding techniques have attracted attention in natural language processing (NLP), and they have
succeeded in obtaining good low-dimensional representations of words from high-dimensional
representations, such as sentences and documents. Embedding vectors are used not only in NLP
but also in many other fields, such as graph and image embeddings. Beyond NLP research, em-
bedding models have a wide range of applications, such as recommendation systems and causal
inferences.
In general, there are two ways to conduct a user-level analysis with embedding models. The
first is to directly obtain the low-dimensional representation of the user by learning the represen-
tations of the users’ action sequences that the users execute. The second is to construct feature
vectors that represent users’ action sequences by embedding vectors of actions. One of the most
common ways is to gather the vector of actions taken by each user and average them into a sin-
gle vector. In NLP, we use both procedures to analyze documents. To obtain representations of
documents, we can use an embedding model, such as doc2vec [7], that obtains low-dimensional
representations of documents. In addition, we can use word embedding models such as word2vec
to construct representations of documents by averaging the embedding vectors of the words in
documents [84]. We also have the same two options for user behavior analysis: obtaining the
vector of the entire user action sequences or reconstructing the actions sequence vectors using the
embedding vectors of user actions.
These two methods have pros and cons. The first method can obtain a vector of the entire
action sequence of sentences as a whole but does not provide representations of elements in actions
such as words. Therefore, the first method costs the microscopic information, such as the vector
representations of each word. However, when we average the embedding vectors of the words
in a document, we lose some critical information for that document, such as the order of the
111
words. While the second method reconstructs the macroscopic information (documents) from the
microscopic information (words), we are not sure if such reconstructed representation ultimately
captures the features of the documents or entire actions.
This chapter proposes a method along with the second procedure described above. We con-
sider a clustering algorithm that uses the actions sequences of users to analyze the user’s dynamic
behavior. To achieve this, we incorporate a distance metric based on the optimal transportation
framework into a clustering algorithm. Specifically, we use a order-preserving Wasserstein (OPW)
distance [208] into k-means clustering such that it recognizes the order of actions in actions se-
quences. To construct actions sequences, we use the model proposed in Chapter 6 that embeds
actions and their inter-time information into low-dimensional representations. We only use the
action vectors and not the vectors representing inter-temporal information, such as the trigrams for
clustering and time interval vectors. However, because the inter-temporal information of actions
is embedded in action vectors by the model proposed in Chapter 6, our method does not lose the
macroscopic information. Using the clustering methods that preserve macroscopic information
(order of actions) and the embedding model that embeds microscopic information (inter-temporal
information) into action, we intend to analyze the dynamics of user behavior in a holistic manner.
7.2 Method
This section discusses the method that clusters the users based on the embedding vectors of each
user’s actions. The proposed method uses the optimal transport framework and k-means clustering
with a loss function that considers each element’s order. First, we construct action sequences using
the embedding model that incorporates the inter-temporal (time interval) information between the
actions proposed in Chapter 6. Second, when we cluster the users based on the embedding vec-
tors obtained from the model, we account for the order of the users’ actions. To achieve this, we
112
propose a k-means algorithm that uses the OPW distance and preserves the order using the opti-
mal transportation framework. We discuss the OT framework briefly and then explain the OPW
proposed by [208]. We also describe our k-means algorithm with the OPW distance.
7.2.1 Optimal Transportation
The optimal transportation problem has been studied in the machine learning literature, such as in
Computer Vision and Natural Language Processing [176, 212]. Recently, the OT framework was
used to define the distance between elements in the data, for example, the words in sentences [138].
In the remainder of this section, we first briefly discuss the OT framework and then explain the
OPW proposed by [208, 207], and we also describe our k-means algorithm with the OPW distance.
7.2.1.1 Optimal Transportation Problem
The optimal transportation problem considers a problem of “transporting” elements from one place
to another. In our context, we use the OT to calculate the transportation cost when we transport the
sequence of embedding vectors of a user to that of another user.
The optimal transportation problem considers transporting elements in one set x=
x x x
1
,··· ,x x x
N
to the other y=
y y y
1
,··· ,y y y
M
. In our case, they are the set of embedding vectors, x x x
i
,y y y
i
∈R
dim
where dim is the dimension of the embedding vectors. To represent this transportation, we use
transporting matrix T∈R
N× M
+
whose elements are non-negative. Then, the optimal transportation
problem describes the cost of transporting using the cost matrix D∈R
N× M
+
. The cost matrix D
stores the cost of transporting i to j that is Euclidean distance
1
. Therefore, the definition of the
cost matrix D D D is
D D D :=
dim
∑
i=1
(x
i
− y
i
)
2
!
1/2
i j
(7.1)
1
Most literature uses more general notation to assume an arbitrary distance metric such as pth-Wasserstein distance
d
x x x
i
,y y y
j
p
.
113
7.2.1.2 Optimization Problem
The goal of optimal transportation problem is to find T that attains
min
T∈U
N,M
⟨T,D⟩
F
(7.2)
where⟨⟩
F
represents Frobenius inner product and therefore⟨T,D⟩ :=∑
i, j
T
i, j
D
i, j
. Also, U
N,M
is
the transport polytope that is defined as
U
N,M
:=
T T T∈R
N× M
+
| T T T 1
M
=α,T T T
T
1
N
=β
(7.3)
where α and β are uniform weight vectors
2
such that α ∈R
N
, α
i
=
1
N
for all i, and β ∈R
M
,
β
j
=
1
M
for all j.
We can interpret the formulation of the OT problem in Equation 7.2 as the optimal problem
to find optimal T
∗ under the constraint T∈ U
N,M
. Obviously, the polytope U
N,M
do not take into
account the order of the elements, which means that any element i is transportable to any place
j with the cost D D D
i j
. In other words, the cost D D D
i j
is irrelevant of the differences in the order of
elements that is|i− j|.
7.2.2 Order-Preserving Wasserstein Distance
To account for the order of the elements for the OT problem, we introduce penalties to the con-
straint that brings costs according to|i− j|. Following Su and Hua (2017) [208], we use two kinds
of penalties: local and global penalties.
2
In the literature, a typical transport prototype considers an arbitrary weight using a notation such as U(α,β).
However, since the most cases including [208] use a uniform weight, we use the formulation in Equation 7.3.
114
7.2.2.1 Local penalty
The local penalty can encourage the transport matrix T so that disputations between two near
places in the order are cheap, and vice versa. Specifically, we use the local penalty function:
I(T T T) :=− N
∑
i=1
M
∑
j=1
t
i j
i
N
− j
M
2
+ 1
(7.4)
This local penalty function returns a large penalty when we try to transport an element from i to j,
and those places are apart in terms of the order (i.e.,|i− j| is large).
7.2.2.2 Global penalty
While the local penalty encourages the transport matrix T to transport an element within a closer
location, it only considers the local relations, focusing on specific i and j. Therefore, it is necessary
to take into account the global relationship. To this aim, we firstly assume an ideal transport matrix
with an ideal distribution and then impose the penalty when calculated T is far from that assumed
desired transport matrix. The transport matrix with a desired distribution P used in this study is
P :=
1
σ
√
2π
exp
− ℓ(i, j)
2
2σ
2
i j
(7.5)
ℓ(i, j)=
|i/N− j/M|
p
1/N
2
+ 1/M
2
(7.6)
The transport matrix P follows the bivariate Gaussian distribution. It takes high value when
|i− j| is small, which means that, as the local penalty requires, transporting an element a long
distance in terms of the order costs a lot. Then, we use the Kullback Leibler (KL) divergence to
calculate how T is close to an desired matrix P, notating KL(T T T∥P P P).
115
7.2.2.3 Order-Preserving constraint
Incorporating the local and global penalty into the transport polytope in Equatoin 7.3, we will
introduce the transport polytope that considers the order of the elements:
U
N,M,ξ
l
,ξ
g
(α α α,β β β) :=
T T T∈ R
N× M
+
| T T T 1
M
=α α α,T T T
T
1
N
=β β β, I(T T T)≤ ξ
l
, KL(T T T∥P P P)≤ ξ
g
(7.7)
Here, we do not impose the strict constraint on I(T T T) and KL(T T T∥P P P), which means that this
polytope encourages us to select T T T that both local and global penalty are smaller than ξ
l
and ξ
g
respectively. With this modified polytope, we can define Order-Preserving Wasserstein distance as
opw(x,y)= min
T∈U
N,M,ξ
l
,ξ
g
⟨T,D⟩
F
(7.8)
7.2.3 K-means clustering with OPW
We use the OPW distance for the objective function to be minimized in the k-means clustering, but
we use an ordinary k-means clustering algorithm. This subsection first reviews the ordinal k-means
algorithm and then introduces the OPW distance into the algorithm.
7.2.3.1 Ordinal k-means algorithm
k-means algorithm is one of the most popular clustering algorithm proposed by several researchers.
In this study, we use the Lloyd’s k-means algorithm. K-means algorithm calculates
argmin
C
K
∑
j=1
∑
x
i
∈X
j
d(x
i
,c
j
) (7.9)
where C is the set of the centers of clusters, and X
j
is the data points assigned to the cluster j
where its center is c
j
and the total number of clusters is K We usually use L
2
norm for the distance
metrics d(). The total sum of calculated distances is called within-cluster errors (WCE).
116
The aim of the k-means algorithm is to find the set of centers C that minimizes the above
objective function. Lloyd’s algorithm, for example, tries to find the assignments of data point x
i
to
cluster c
j
that minimizes d(x
i
− c
j
) by iterating assigning steps.
For example, in each iteration of Lloyd’s algorithm, it calculates centroids as the centers of the
clusters. Centroid of cluster j is the mean of the data points assigned to that cluster as
1
|X
t
j
|
∑
x
i
∈X
t
j
x
i
(7.10)
where X
t
j
is the set of the data points assigned to cluster j at iteration t and|X
t
j
| is the size of the
data point (i.e., the number of data points). The Lloyd’s algorithm continues their iterations until
the assignment of data points does not change between two consecutive iterations.
7.2.3.2 Modification of k-means algorithm with the OPW distance
We introduce the OPW distance into the k-means algorithm, and it requires two modifications;
distance metric and calculating centroids.
The first modification is trivial that we use the OPW distance opw() for distance metrics d()
in Equation 7.9. By introducing the OPW distance, we will be able to cluster users based on their
sequence of embedding vectors considering its order. That is x
i
=
x x x
i
1
,··· ,x x x
i
N
, where x x x
i
n
∈R
dim
represents the embedding vector of nth action of user i. We will cluster the users based on the
sequence x
i
, and to represent cluster k at iteration t we use the set of sequence X
k
=
x
1
,··· ,x
s
.
The second modification is on finding the centroids in each iteration. Since we use the OPW
distance, it would be not trivial to calculate the mean points of clusters compared to Equation 7.10.
Instead of calculating the mean points, we select a data point for each cluster that represents the
center of each clusters. We define the center of cluster k from the data point assigned to cluster k
as c
k
which is obtained by
c
k
= argmin
c
j
∈X
t
k
∑
x
i
∈X
t
k
opw(x
i
, c
j
) (7.11)
117
Here, we replace d() with opw() to clarify the distance used in our method. We present a pseudo
code of our k-means algorithm in Algorithm 1
Algorithm 1: OPW k-means
Data: set of sequences X
k
=
x
1
,··· ,x
s
, number of cluster K, initial centroids
C=(c
1
,··· ,c
K
).
Result: Write here the expected result
initialization;
while CentroidsC change during iteration do
Find centroids for each cluster k by ;
c
k
= argmin
c
j
∈X
t
k
∑
x
i
∈X
t
k
opw(x
i
, c
j
);
Update C and Check if C changed;
end
Figure 7.1: Schematic of synthetic data
Note: Schematic of synthetic data of sequences of user action vectors. Each circle represent user action
vectors and the color represent the type of action vectors. The circle with the same color are the same action
vectors. We construct the four different sequences (a)-(d). (a) (a) and (b) has the same components: 45
walking vectors (gray) and 45 running vectors (white). However, their order are different that (a) has 45
walking vectors (gray) first and 45 running vectors (white) in its last half, and ( b) has has 45 running vectors
(white) first, and 45 walking vectors (gray) last. ( c) and (d) consist of the three user action embedding
vectors about eating (supper (yellow), lunch (blue), and snack(green) ) and 30 vectors each. Similarly, (c)
and (d) are different in its order of their sequences. (c) has each type of vectors in a chunk (30 vectors 3),
but (d) has a nested order that are 30 sets of (supper(yellow)-lunch(blue)-snack(green).
7.3 Data and Experiment Settings
This section reports the data and the experiment setting with the proposed method. We first use the
synthetic data to understand the performance of the OPW k-means algorithm, comparing it to the
ordinal k-means algorithm. We then turn to experiments with empirical data.
118
7.3.1 Experiment With a Synthetic Data
Our experiment with synthetic data studies whether our OPW k-means captures the dynamic struc-
ture of the data. For this, we construct the sequence of vectors that represents user behaviors with
the same vectors but in different orders. We also use the ordinal k-means algorithm for compar-
isons.
7.3.1.1 Synthetic User Action Vectors
We construct our synthetic data that contains two different groups in which each group has two
subgroups. The first group is the physical action group, and the second group is the eating action.
For each group, we construct two different sequences of vectors but with the same number of
actions. We show the schematic of this data construction in Figrue 7.1.
The first group contains running action vectors and walking action embedding vectors. For the
first group, we construct two different subgroups. The first subgroup contains 45 running vectors
first and then 45 walking vectors. The second is in the reversed order (i.e., 45 walking vectors and
then 45 running vectors.) Formally, the sequence of vectors in the first subgroup x
sub1
and second
subgroup x
sub2
are
x
sub1
=
x x x
run
1
,··· ,x x x
run
45
,x x x
walk
46
,x x x
walk
90
x
sub2
=
x x x
walk
1
,··· ,x x x
walk
45
,x x x
run
46
,x x x
run
90
The second group contains eating actions. As for the first group, we construct two different sub-
groups. The first subgroup has 30 supper vectors, 30 lunch vectors, and 30 snack vectors. On the
other hand, the second subgroup has 30 set of supper, lunch, and snack vectors. The subgroups in
119
both groups have different structure in their order, but the components and their numbers are the
same. Formally, the sequence of vectors in the first subgroup y
sub1
and second subgroup y
sub2
are
y
sub1
=(y y y
supper
1
,··· ,y y y
supper
30
,y y y
snack
31
,··· ,
y y y
snack
60
,y y y
lunch
61
,··· ,y y y
lunch
90
)
y
sub2
=(y y y
supper
1
,y y y
snack
2
,y y y
lunch
3
,··· ,
y y y
supper
31
,y y y
snack
32
,y y y
lunch
33
,··· ,y y y
snack
88
,y y y
lunch
89
,y y y
lunch
90
)
Because the two vectors in each group have the same component, the mean values are the same.
That is,∑ x
sub1
=∑ x
sub2
and∑ y
sub1
=∑ y
sub2
. Therefore, as discussed in the introduction with
the example of doc2vec vs. word2vec, a typical manner of representing the user sequence erases
the order of the actions.
7.3.1.2 k-mean and OPW k-means
With the synthetic data constructed above, we demonstrate that our OPW k-means algorithm cap-
tures the order in the sequence. For comparison, we show that a typical representation ignores such
order as expected. To demonstrate this, we construct 40 users and each user has one of the four
action sequences (x
sub1
,x
sub2
or y
sub1
,y
sub2
). The ground truth of the number of clusters is four
when we consider the order in the sequence, but the ground truth is two without accounting for the
order in the sequence. For this toy experiment, we run our algorithm on the synthetic user data to
illustrate that it clusters the users according to their groups (one of the four groups above). Then,
we show that the k-means algorithm does not distinguish the order of the sequence.
7.4 Results
This section reports the results of the experiments with the synthetic and empirical data. We
conduct two experiments to demonstrate our OPW k-means can classify the data according to the
120
order in the sequences. Then, we also run an empirical analysis that detects the routine behavior
of users.
7.4.1 Analysis with Synthetic Data
This subsection illustrates the results of the experiment with the synthetic data. We conduct two
experiments with synthetic data to see if our proposed method captures user dynamics as expected.
The first runs our OPW algorithm on the synthetic data in which four different groups of users
exist (the two groups with two subgroups each). We illustrated the description and schematic
of the synthetic data we construct for the experiments in Section 7.3.1. The second experiment
uses the ordinal k-means algorithm to show that the averaged action sequence vector does not
differentiate the order in the sequence, resulting in the two subgroups becoming identical.
7.4.1.1 Clustering Results
We cluster the users based on their action sequences and calculate the Jaccard similarity between
the clusters and the groups. Figures 7.2 shows that the ordinal k-means do not differentiate the
subgroups, but the OPW k-means does. The left figure in Figures 7.2 shows that the k-means clus-
tered users in Group a and Group b, where the components of the actions sequences are the same
but their order is different, as described in Figure 7.1. In contrast, the OPW k-means algorithm
does distinguish Group a and Group b. These results indicate that the OPW k-means recognize the
order of actions in the sequences that ordinary k-means does not.
7.4.1.2 Within Clustering Errors
We also study within clustering errors (WCE) by running the two clustering algorithms with dif-
ferent initial centers 50 times for each number of clusters. Figure 7.3 plots the mean, min, and max
values of the calculated WCEs for different numbers of clusters. The left subfigure in Figure 7.3
shows that the WCE of ordinal k-means is mostly smooth. Contrarily, the transition of WCEs of
the OPW k-means in Figure 7.3 has two clear elbows in which k= 2 and k= 4. These elbows
121
Figure 7.2: Clustering synthetic data: Jaccard index
Note: Jaccard similarity between the clusters and the user groups (ground truth). We cluster the users based on their
action sequences where four users groups exist (the two subgroups for the two groups). The ordinal k-means algorithm
that ignores the orders in the action sequence clusters the two subgroups into the same cluster, which results in the
users in Group a and b in the same cluster; Group c and d in the same cluster. On the other hand, the OPW k-means
captures the differences in the order in the sequence and clusters all four groups into different groups.
correspond to the clusters with the two groups and four subgroups, respectively. When k= 2, the
OPW k-means returns the cluster of Groups a and b and of Groups c and d. For k= 4, the OPW
k-means clusters Groups a-d.
7.4.2 Analysis with Empirical Data
Next, we conduct an empirical analysis with our proposed method and demonstrate that our method
can extract an insight behind the dynamic behavior in the data. To this end, we perform a morning
pattern clustering in which existing methods are not applicable owing to several difficulties. We
start this subsection by describing these challenges, and we then discuss the result of our analysis.
7.4.2.1 A Use Case: Morning Pattern Clustering
We formalize an empirical problem to which we apply our proposed clustering method. We try to
consider a problem for clustering users based on their behavior during a morning on a given day.
122
Figure 7.3: Clustering synthetic data: Within cluster error
Note: The within cluster error (WCE) for each number of clusters(#Clusters). The shadows represent 95% CI from
boot strapping. For the two clustering methods, we calculate the (WCE) 50 times, and present Min, Mean, and Max
values for each #Clusters.
On a given day, people exhibit a certain pattern of behavior, but they may deviate from the pat-
tern depending on their schedule for that day. For example, when we find common behavior among
users in a day while they do not on other days, we can expect that they have some common event
on that day. To understand these fluctuations from day-to-day, we need to study the similarities in
behavior between users on each day.
There are two difficulties in analyzing the similarity of behavior between users on a given day.
First, we can observe each user’s morning behavior only once each day, meaning that we can
observe a single behavior pattern per user. This prevents us from comparing users’ behaviors from
a statistical perspective, such as the frequency of observed patterns. For example, when finding
users who demonstrate two similar behaviors during a specific duration such as a month, we can
calculate the frequency of behavior patterns to obtain some feature vectors. In this case, we can
utilize some popular methods in the literature, such as the Bag-of-words model or tf-idf. In the
setting of this chapter, we are unable to study the frequencies of behavior patterns and only able to
compare users using the single observation from each user.
Indeed, we can study the similarity of behaviors among users based on a single observation.
If two people have the same stream of behaviors, such as action sequence or n-grams, we can
consider they behave the same. However, this case only considers perfect matches. The behaviors
123
of running and walking, for instance, are similar in terms of movement, but only comparing those
labels omit this kind of similarity.
Even under the above two difficulties, our proposed method can cluster the users. The OPW
clustering overcomes the first difficulty by calculating the distance between the action sequences
of the users in a given day and by considering the order of elements of those action sequences. To
alleviate the second difficulty, the cost matrix used in the OPW calculates the similarity between
actions in which the similarity between actions is defined as L
2
the distance between the embedding
representation of the actions (Equation 7.1).
7.4.2.2 Results of Morning Pattern Clustering
To conduct the morning pattern clustering, we investigate how students start their week over an
academic semester. We study the morning behavior on the first day of the week over ten weeks
during the semester. We focus on the students whose behavior was observed in all ten weeks,
resulting in 28 students. We run our OPW clustering algorithm each week and present the result in
Figure 7.4. We observed that the size of each cluster becomes smaller as the semester continues,
meaning that the behaviors among students diversify as the semester approaches its end. We also
observed that, during Weeks 4-6, most students are in the same cluster (red). Given that they have
the midterm during those weeks, we reckon that the clustering algorithm captures the behavior
changes associated with the critical event common among the students.
To further investigate the clustering patterns, we study the class that the students in the same
cluster share. We study how many classes each pair of students share in each cluster over the
weeks. Figure 7.5 compares the average number of shared classes among the clusters in each week
and notes that the average of the clusters in the last five weeks tend to be large. To clarify this
tendency, we construct the null average in which we randomly shuffle students in the cluster and
average the 1000 realizations. Figure 7.6 indicates the clusters whose average is greater than the
null average, and more clusters have a large average compared to the null average in the last part
124
Figure 7.4: Morning Pattern Detection by OPW
Note: The results of the OPW clustering on students’ morning pattern detection. We run the clustering
algorithm on each first day of the week (#clustering = 5). The color represents cluster ID, the row represents
users, and the column represents weeks.
Figure 7.5: The number of the shared classes by the students in the same cluster
Note: The number of classes that the students in the same cluster share. We first calculate the number of
classes that each pair of students share for each cluster. Then we calculate the average number of shared
classes for each cluster, and we do not calculate the average of a size one cluster.
of the term. This finding suggests that the students who share the same class behave similarly as
the academic term continues.
Thanks to the advantage of our proposed method, we capture the dynamic behavior from the
action sequence on each first day of the week and reveal the transition of students’ behavior ac-
cording to the external events that may affect their behavior.
125
Figure 7.6: The clusters in which the students particularly share the same class
Note: The clusters in which the students particularly share the same class over the academic term. We first
construct the null clusters by randomly shuffling the students and calculating the number of shared classes.
We iterate this procedure 1000 times and computer the null average. For each cluster, we plot a line when
that cluster has a larger average than the null average. We do not calculate the average of a size one cluster.
7.5 Conclusion
This chapter proposed the method for clustering users based on their dynamic behavior, preserv-
ing the order of their actions using embedding representation of actions. We first discussed that
existing methods use the statistic representation of dynamic behavior constructed from embedding
vectors of behavior, which prevents us from capturing the dynamics of actions in clustering. To
overcome this, we proposed an OPW based clustering method powered by an optimal transporta-
tion technique. To study the method’s performance, we analyze synthetic data in which the order
of actions plays a pivotal role. The analysis with synthetic data demonstrates that our method can
capture the dynamic behavior, which a typical method in the literature ignores. Then, we con-
ducted the case study with the student life dataset that detects morning patterns over the course of
the academic term. We also demonstrated that our method can reveal users’ dynamics in a higher
granularity, helping us understand dynamic human behavior.
126
Chapter 8
Conclusion and Discussion
Temporal structure is critical to understanding human behavior. Studying the temporal structure of
human behavior can help us understand how people live in this age of pervasive digital platforms,
but the amount of data from digital platforms is enormous. Therefore, extracting the necessary
information can be helpful for modeling and analyzing human behavior from such a considerable
amount of data. To this end, this paper proposes a method and its application for mining and
modeling the temporal structure of human behavior in digital platforms.
In this dissertation, I contributed to human behavioral research, bridging computer science
and social science in three folds. First, I modeled and analyzed the temporal structure of human
behavior in consumption using the NTF method and the multi-time scale representation. Our
model uncovered that the temporal structure captured the multi-time scale representation reflects
the consumers’ characteristics, such as demographic information.
Second, I discussed the literature of the recent trend of human behavior studies with machine
learning methods focusing on the applications of NTF models and word embedding models. This
literature review built the taxonomy of the analytical approach of recent studies that apply word
embedding models to social science, and I overviewed the manner in which the literature leverages
word embedding models in their research. In this holistic survey with the taxonomy of analytical
methods, I detected the recent trends of word2vec applications to human behavior data and argued
that this dissertation follows this line of literature. Then, I contextualized this dissertation in the
recent related research. In addition, I proposed a word2vec-base model for modelling human
127
behavior in knowledge production and revealed the mechanism behind the users’ coordination in
knowledge production. The proposed wor2vec-base model demonstrated that there are two sources
of labor to produce the knowledge (addition and deletion edits), and the balance between these is
linked to the quality of the produced knowledge.
Third, after noting the importance of inter-time information for human behavior analysis, I
proposed a model that examines human actions and their time intervals simultaneously utilizing
the word embedding methods. Inspired by dual-process theory, I proposed an action timing context
(ATC) that examines whether a given action belongs to the long-term or short-term context. I
demonstrated that the ATC allows us to explore the smallest units of human behavior (action) in
terms of temporal context. Using the proposed model, I also proposed a clustering method to
conduct a user-level analysis with the obtained action embedding vectors. The proposed clustering
considers the order of actions and enables us to find the cluster of users who show similar dynamic
behavior.
While this dissertation provided a comprehensive set of methodologies and applications for
a wide rage of human behavior, there remain some issues that need to be addressed in future
research. Regarding Chapter 3, it is important to note that the proposed NTF model with multi-
timescale successfully revealed explicit patterns, but there may be other multi-timescale activity
patterns that exist as shorter or longer time scales, rather than daily or weekly. For instance, the
timing of shopping may be affected by the time of a day, and the consumption of expensive goods
(e.g., cars) may be scheduled once every ten years. In addition, consumption patterns could also
be encoded in what they purchased. While our analysis is based on the number of items purchased
by a user, its composition would also be useful for revealing the demographic characteristics of
users. In addition, more multi-timescale patterns may exist in other economic and social contexts,
such as financial markets, online communication networks, and face-to-face networks. Chapter 3
demonstrated that the NTF is a useful and user-friendly tool for detecting multi-timescale prop-
erties, and I hope our work stimulates further research on many economic and social activities to
better understand human behavior.
128
While Chapter 4 demonstrated that the users divide their labor between two sides and its con-
sequence to the quality of produced knowledge, many aspects remain unexplored. For further
directions, we need to conduct the same analysis on other platforms for knowledge production to
study if the findings in the presented chapter are universal regularities. For example, it should be
noted that the current analysis only focused on the data from the English version of Wikipedia.
Thanks to the high level of abstraction of the model, the analysis conducted in the chapter are free
from any language, and we can apply it to a wide range of knowledge production in digital plat-
forms. Therefore, it should be easy to conduct the same analysis with other types of knowledge
production, such as writing academic papers with co-authors and developing software in GitHub.
Moreover, the presented study is not immune to some limitations because of its high level of ab-
straction. The proposed model dramatically abstracts the editing behavior in Wikipedia, and does
not account for what kind of content the users added or deleted in their editing. Although our dras-
tic abstraction provides consistent yet interpretable results, this abstraction limits our study. For
example, we are unaware of what kind of deletion or addition is associated with high-quality edit-
ing, which is essential to improve the quality of knowledge production. For further direction, I will
expand the model to characterize addition and deletion edits based on what contents or information
are added or deleted during each edit.
In Chapter 6, I proposed a framework that can lead to a unified understanding of human dy-
namic behavior by embedding the inter-temporal information into low-dimensional representations
of actions. Indeed, the proposed framework allows us to study human actions from the perspective
of temporal context. Yet, our analysis reveals several exciting directions for future work. First, we
need to analyze the sensitivities of our framework to capture human cognitive changes. While a
simple analysis with EMA data showed a promising result, we need to conduct a more rigorous
analysis on this point. To do this, it is necessary to construct a dataset that combines behavioral
data and psychological changes in subjects, and this interdisciplinary challenge would reveal a
new academic research theme. Second, I discussed that dual-process theory assumes that a human
being has the two cognitive types. However, using large-scale and high-dimensional data and the
129
proposed framework would make it possible to analyze this important theory in detail, such as
studying how people switch cognitive states. This data-driven analysis would allow us to capture
the transition between the two cognitive states in a high resolution, which may find that the change
of cognitive state is not binary, and continuous instead.
There are several further directions to develop the model proposed in Chapter 7. First, while
the case study demonstrated that our model could capture the behavior transition, we also need
to conduct several case studies that apply our model for human behavior analysis. To illustrate
the performance of the proposed method in real-world situations, we need to have some empirical
data of natural experiments which record dynamic human behavior and the evens that are indeed
supposed to change the user behavior. Therefore, in addition to the existing dataset used in this
study, it is vital to construct a dataset that records human dynamic behavior with the events that
might affect the users’ behavior.
Lastly, I state that machine learning models can aid us with bridging one discipline to the other
disciplines, as discussed in Chapter 2. It is crucial that while the methods used in this dissertation
utilize the machine learning models invented in computer science, this dissertation also employed
the knowledge and theories proposed in other research disciplines, such as economics, physiology,
and behavior science. This dissertation, therefore, indicates the importance and power of inter-
disciplinary research to understand human behavior. To rephrase, this dissertation demonstrates
that combining the knowledge from different fields expands our understanding of the world, and I
argue that this is the beauty of art and science.
130
References
[1] B. T. Adler and L. De Alfaro, “A content-driven reputation system for the wikipedia,” in
Proceedings of the World Wide Web Conference, 2007.
[2] E. Agirre, E. Alfonseca, K. Hall, J. Kravalova, M. Pas ¸ca, and A. Soroa, “A study on simi-
larity and relatedness using distributional and WordNet-based approaches,” in Proceedings
of Human Language Technologies: Annual Conference of the North American Chapter of
the Association for Computational Linguistics, 2009, pp. 19–27.
[3] E. Aguila, O. Attanasio, and C. Meghir, “Changes in consumption at retirement: Evidence
from panel data,” Review of Economics and Statistics, vol. 93, pp. 1094–1099, 2011.
[4] F. Alvarez-Cuadrado, G. Monteiro, and S. J. Turnovsky, “Habit formation, catching up
with the joneses, and economic growth,” Journal of Economic Growth, vol. 9, pp. 47–80,
2004.
[5] O. Arazy, W. Morgan, and R. Patterson, “Wisdom of the crowds: Decentralized knowledge
construction in wikipedia,” in Annual Workshop on Information Technologies & Systems
Paper, 2006.
[6] S. Arora, Y . Li, Y . Liang, T. Ma, and A. Risteski, “A latent variable model approach to PMI-
based word embeddings,” Transactions of the Association for Computational Linguistics,
vol. 4, pp. 385–399, 2016.
[7] S. Arora, Y . Liang, and T. Ma, “A simple but tough-to-beat baseline for sentence embed-
dings,” in International Conference on Learning Representations, 2017.
[8] E. Ash, D. L. Chen, A. Ornaghi, et al., Stereotypes in high-stakes decisions: Evidence from
us circuit courts. University of Warwick, Department of Economics, 2020.
[9] O. P. Attanasio and G. Weber, “Consumption and saving: Models of intertemporal allo-
cation and their implications for public policy,” Journal of Economic Literature, vol. 48,
pp. 693–751, 2010.
[10] J. Attenberg, S. Pandey, and T. Suel, “Modeling and predicting user behavior in sponsored
search,” in Proceedings of the ACM SIGKDD International Conference on Knowledge
Discovery and Data Mining, 2009.
[11] T. Baltruˇ saitis, C. Ahuja, and L.-P. Morency, “Multimodal machine learning: A survey and
taxonomy,” IEEE transactions on pattern analysis and machine intelligence, vol. 41, no. 2,
pp. 423–443, 2018.
[12] A.-L. Barabasi, “The origin of bursts and heavy tails in human dynamics,” Nature, vol. 435,
no. 7039, pp. 207–211, 2005.
[13] C. D. Barros, M. R. Mendonc ¸a, A. B. Vieira, and A. Ziviani, “A survey on embedding
dynamic graphs,” ACM Computing Surveys, vol. 55, no. 1, pp. 1–37, 2021.
131
[14] M. Baumg¨ artner and J. Zahner, “Whatever it takes to understand a central banker: Em-
bedding their words using neural networks,” MAGKS Joint Discussion Paper Series in
Economics, Tech. Rep., 2021.
[15] D. R. Bell and J. M. Lattin, “Shopping behavior and consumer preference for store price
format: Why “large basket” shoppers prefer edlp,” Marketing Science, vol. 17, pp. 66–88,
1998.
[16] K. Benoit, “Text as data: An overview,” The SAGE Handbook of Research Methods in
Political Science and International Relations, SAGE Publishing, London, 2020.
[17] A. R. Benson, R. Kumar, and A. Tomkins, “Modeling user consumption sequences,” in
Proceedings of International Conference on World Wide Web, 2016, pp. 519–529.
[18] N. Bhatia and S. Bhatia, “Changes in Gender Stereotypes Over Time: A Computational
Analysis,” Psychology of Women Quarterly, vol. 45, no. 1, pp. 106–125, 2021.
[19] S. Bhatia, “Predicting Risk Perception: New Insights from Data Science,” Management
Science, vol. 65, no. 8, pp. 3800–3823, 2019.
[20] S. Bhatia, C. Y . Olivola, N. Bhatia, and A. Ameen, “Predicting leadership perception with
large-scale natural language data,” The Leadership Quarterly, p. 101 535, 2021.
[21] S. Bird, E. Klein, and E. Loper, Natural language processing with Python: analyzing text
with the natural language toolkit. ” O’Reilly Media, Inc.”, 2009.
[22] S. Bogomolova, K. V orobyev, B. Page, and T. Bogomolov, “Socio-demographic differ-
ences in supermarket shopper efficiency,” Australasian Marketing Journal (AMJ), vol. 24,
pp. 108–115, 2016.
[23] P. Bojanowski, E. Grave, A. Joulin, and T. Mikolov, “Enriching word vectors with sub-
word information,” Transactions of the Association for Computational Linguistics, vol. 5,
pp. 135–146, 2017.
[24] T. Bolukbasi, K.-W. Chang, J. Y . Zou, V . Saligrama, and A. T. Kalai, “Man is to com-
puter programmer as woman is to homemaker? debiasing word embeddings,” Advances in
Neural Information Processing Systems, vol. 29, pp. 4349–4357, 2016.
[25] E. Borra, E. Weltevrede, P. Ciuccarelli, A. Kaltenbrunner, D. Laniado, G. Magni, M. Mauri,
R. Rogers, and T. Venturini, “Societal controversies in wikipedia articles,” in Proceedings
of the Conference on Human Factors in Computing Systems, 2015, pp. 193–196.
[26] C. Boyce-Jacino and S. DeDeo, “Opacity, obscurity, and the geometry of question-asking,”
Cognition, vol. 196, p. 104 071, 2020.
[27] M. Bressan, S. Leucci, A. Panconesi, P. Raghavan, and E. Terolli, “The limits of popularity-
based recommendations, and the role of social ties,” in Proceedings of the ACM SIGKDD
International Conference on Knowledge Discovery and Data Mining, 2016, pp. 745–754.
[28] R. Bro and H. A. Kiers, “A new efficient method for determining the number of compo-
nents in parafac models,” Journal of Chemometrics, vol. 17, pp. 274–286, 2003.
[29] M. Browning and M. D. Collado, “Habits and heterogeneity in demands: A panel data
analysis,” Journal of Applied Econometrics, vol. 22, pp. 625–640, 2007.
[30] M.-E. Brunet, C. Alkalay-Houlihan, A. Anderson, and R. Zemel, “Understanding the ori-
gins of bias in word embeddings,” in International Conference on Machine Learning,
PMLR, 2019, pp. 803–811.
[31] E. Bruni, N.-K. Tran, and M. Baroni, “Multimodal distributional semantics,” Journal of
artificial intelligence research , vol. 49, pp. 1–47, 2014.
132
[32] H. Cai, V . W. Zheng, and K. C.-C. Chang, “A comprehensive survey of graph embedding:
Problems, techniques, and applications,” IEEE Transactions on Knowledge and Data En-
gineering, vol. 30, no. 9, pp. 1616–1637, 2018.
[33] A. Caliskan, J. J. Bryson, and A. Narayanan, “Semantics derived automatically from lan-
guage corpora contain human-like biases,” Science, vol. 356, no. 6334, pp. 183–186, 2017.
[34] J. Y . Campbell and N. G. Mankiw, “Consumption, income, and interest rates: Reinterpret-
ing the time series evidence,” NBER macroeconomics annual, vol. 4, pp. 185–216, 1989.
[35] R. Carrasco, J. M. Labeaga, and J. David L´ opez-Salido, “Consumption and habits: Evi-
dence from panel data,” The Economic Journal, vol. 115, pp. 144–165, 2005.
[36] R.
ˇ
Cech, J. H˚ ula, M. Kub´ at, X. Chen, and J. Miliˇ cka, “The Development of Context Speci-
ficity of Lemma. A Word Embeddings Approach,” Journal of Quantitative Linguistics,
vol. 26, no. 3, pp. 187–204, 2019.
[37] A. Chanen, “Deep learning for extracting word-level meaning from safety report nar-
ratives,” in 2016 Integrated Communications Navigation and Surveillance, IEEE, 2016,
pp. 5D2–1.
[38] W. G. Charles, “Contextual correlates of meaning,” Applied Psycholinguistics, vol. 21,
no. 4, pp. 505–524, 2000.
[39] F. Chaubard, M. Fang, G. Genthial, R. Mundra, and R. Socher, CS224n: Natural language
processing with deep learning, lecture notes: Part I, https://web.stanford.edu/
class/cs224n/readings/cs224n-2019-notes01-wordvecs1.pdf, last accessed
March 1, 2022, 2019.
[40] C. Chen, S. Kim, H. Bui, R. Rossi, E. Koh, B. Kveton, and R. Bunescu, “Predictive analysis
by leveraging temporal user behavior and user embeddings,” in Proceedings of the ACM
International Conference on Information and Knowledge Management, 2018.
[41] H. Chen and R. Zhang, “Identifying Nonprofits by Scaling Mission and Activity with Word
Embedding,” VOLUNTAS: International Journal of Voluntary and Nonprofit Organiza-
tions, 2021.
[42] L. Cherkasova and P. Phaal, “Session-based admission control: A mechanism for improv-
ing performance of commercial web sites,” in Seventh International Workshop on Quality
of Service, IEEE, 1999, pp. 226–235.
[43] E. Chersoni, E. Santus, C.-R. Huang, and A. Lenci, “Decoding Word Embeddings with
Brain-Based Semantic Features,” Computational Linguistics, vol. 47, no. 3, pp. 663–698,
2021.
[44] G. L. Ciampaglia, A. Flammini, and F. Menczer, “The production of information in the
attention economy,” Scientific Reports , vol. 5, no. 1, p. 9452, 2015.
[45] L. Cong, T. Liang, and X. Zhang, “Textual Factors: A Scalable, Interpretable, and Data-
driven Approach to Analyzing Unstructured Information,” SSRN Electronic Journal, 2018.
[46] K. Cooray and M. M. Ananda, “Modeling actuarial data with a composite lognormal-pareto
model,” Scandinavian Actuarial Journal, vol. 2005, no. 5, pp. 321–334, 2005.
[47] I. Crawford, “Habits revealed,” The Review of Economic Studies, vol. 77, pp. 1382–1402,
2010.
[48] P. Cui, X. Wang, J. Pei, and W. Zhu, “A survey on network embedding,” IEEE transactions
on knowledge and data engineering, vol. 31, no. 5, pp. 833–852, 2018.
[49] T. T. Cuong and C. M¨ uller-Birn, “Applicability of sequence analysis methods in analyzing
peer-production systems: A case study in wikidata,” in Social Informatics, 2016.
133
[50] M. De Choudhury, S. Sharma, and E. Kiciman, “Characterizing dietary choices, nutrition,
and language in food deserts via social media,” in Proceedings of the ACM Conference on
Computer Supported Cooperative Work and Social Computing, 2016, pp. 1157–1170.
[51] M. H. DeGroot and M. J. Schervish, Probability and statistics. Pearson Education, 2012.
[52] V . Di Carlo, F. Bianchi, and M. Palmonari, “Training temporal word embeddings with a
compass,” in Proceedings of the AAAI Conference on Artificial Intelligence , 2019.
[53] N. Djuric, V . Radosavljevic, M. Grbovic, and N. Bhamidipati, “Hidden conditional random
fields with deep user embeddings for ad targeting,” in International Conference on Data
Mining, 2014, pp. 779–784.
[54] Dr.wallet,https://www.drwallet.jp, last accessed March 22, 2020.
[55] N. Du, H. Dai, R. Trivedi, U. Upadhyay, M. Gomez-Rodriguez, and L. Song, “Recurrent
marked temporal point processes: Embedding event history to vector,” in Proceedings of
the ACM SIGKDD International Conference on Knowledge Discovery and Data Mining,
2016.
[56] C. Dyer, “Notes on noise contrastive estimation and negative sampling,” arXiv preprint
arXiv:1410.8251, 2014.
[57] K. E. Dynan, “Habit formation in consumer preferences: Evidence from panel data,” Amer-
ican Economic Review, vol. 90, pp. 391–406, 2000.
[58] A. Feldmann and W. Whitt, “Fitting mixtures of exponentials to long-tail distributions to
analyze network performance models,” Performance Evaluation, vol. 31, no. 3, pp. 245–
279, 1998.
[59] W. Feng, J. Tang, T. X. Liu, S. Zhang, and J. Guan, “Understanding dropouts in moocs,”
in Proceedings of the AAAI Conference on Artificial Intelligence , 2019.
[60] E. Ferrara, O. Varol, C. Davis, F. Menczer, and A. Flammini, “The rise of social bots,”
Commun. ACM, vol. 59, no. 7, pp. 96–104, 2016.
[61] F. Fl¨ ock and M. Acosta, “Wikiwho: Precise and efficient attribution of authorship of revi-
sioned content,” in Proceedings of the World Wide Web Conference, 2014, pp. 843–854.
[62] A. C. M. Fong, B. Zhou, S. C. Hui, G. Y . Hong, and T. A. Do, “Web content recom-
mender system based on consumer behavior modeling,” IEEE Transactions on Consumer
Electronics, vol. 57, pp. 962–969, 2011.
[63] J. E. Font and M. R. Costa-Jussa, “Equalizing gender biases in neural machine translation
with word embeddings techniques,” arXiv preprint arXiv:1901.03116, 2019.
[64] Y . Gandica, J. Carvalho, and F. S. Dos Aidos, “Wikipedia editing dynamics,” Physical
Review E, vol. 91, no. 1, p. 012 824, 2015.
[65] N. Garg, L. Schiebinger, D. Jurafsky, and J. Zou, “Word embeddings quantify 100 years of
gender and ethnic stereotypes,” Proceedings of the National Academy of Sciences, vol. 115,
no. 16, E3635–E3644, 2018.
[66] L. Gauvin, A. Panisson, and C. Cattuto, “Detecting the community structure and activity
patterns of temporal networks: A non-negative tensor factorization approach,” PLOS ONE,
vol. 9, e13636, 2014.
[67] G. Gennaro and E. Ash, “Emotion and reason in political language,” The Economic Jour-
nal, 2022.
[68] M. G´ enois, M. Zens, C. Lechner, B. Rammstedt, and M. Strohmaier, “Building connec-
tions: How scientists meet each other during a conference,” arXiv preprint arXiv:1901.01182,
2019.
134
[69] M. Gentzkow, B. Kelly, and M. Taddy, “Text as data,” Journal of Economic Literature,
vol. 57, no. 3, pp. 535–74, 2019.
[70] N. Gillani and R. Levy, “Simple dynamic word embeddings for mapping perceptions in
the public sphere,” in Proceedings of the Third Workshop on Natural Language Processing
and Computational Social Science, 2019, pp. 94–99.
[71] P. Giudici, B. Huang, and A. Spelta, “Trade networks and economic fluctuations in asian
countries,” Economic Systems, vol. 43, no. 2, p. 100 695, 2019.
[72] Y . Goldberg, Word2vecf,https://github.com/BIU-NLP/word2vecf, 2017.
[73] Y . Goldberg and O. Levy, “Word2vec explained: Deriving mikolov et al.’s negative-sampling
word-embedding method,” arXiv preprint arXiv:1402.3722, 2014.
[74] H. Gonen and Y . Goldberg, “Lipstick on a pig: Debiasing methods cover up systematic gen-
der biases in word embeddings but do not remove them,” arXiv preprint arXiv:1903.03862,
2019.
[75] Google, Google news word embedding model, 2013.
[76] K. Goˇ seva-Popstojanova, A. D. Singh, S. Mazimdar, and F. Li, “Empirical characterization
of session–based workload and reliability for web servers,” Empirical Software Engineer-
ing, vol. 11, no. 1, pp. 71–117, 2006.
[77] P. Goyal and E. Ferrara, “Graph embedding techniques, applications, and performance: A
survey,” Knowledge-Based Systems, vol. 151, pp. 78–94, 2018.
[78] A. C. Graesser, K. K. Millis, and R. A. Zwaan, “Discourse comprehension,” Annual review
of psychology, vol. 48, no. 1, pp. 163–189, 1997.
[79] J. Graham, B. A. Nosek, J. Haidt, R. Iyer, S. Koleva, and P. H. Ditto, “Mapping the moral
domain.,” Journal of Personality and Social Psychology, vol. 101, no. 2, p. 366, 2011.
[80] M. Grbovic, V . Radosavljevic, N. Djuric, N. Bhamidipati, J. Savla, V . Bhagwan, and D.
Sharp, “E-commerce in your inbox: Product recommendations at scale,” in Proceedings of
the ACM SIGKDD International Conference on Knowledge Discovery and Data Mining,
2015.
[81] J. Grimmer and B. M. Stewart, “Text as data: The promise and pitfalls of automatic content
analysis methods for political texts,” Political Analysis, vol. 21, no. 3, pp. 267–297, 2013.
[82] N. Grinberg, P. A. Dow, L. A. Adamic, and M. Naaman, “Changes in engagement before
and after posting to facebook,” in Proceedings of the Conference on Human Factors in
Computing Systems, 2016.
[83] A. Guariglia and M. Rossi, “Consumption, habit formation, and precautionary saving: Ev-
idence from the british household panel survey,” Oxford Economic Papers, vol. 54, pp. 1–
19, 2002.
[84] V . Gupta, A. Saw, P. Nokhiz, P. Netrapalli, P. Rai, and P. Talukdar, “P-sif: Document em-
beddings using partition averaging,” in Proceedings of the AAAI Conference on Artificial
Intelligence, vol. 34, 2020, pp. 7863–7870.
[85] M. Gutmann and A. Hyv¨ arinen, “Noise-contrastive estimation: A new estimation princi-
ple for unnormalized statistical models,” in Proceedings of the Conference on Artificial
Intelligence and Statistics, 2010.
[86] J. Haidt and J. Graham, “When morality opposes justice: Conservatives have moral intu-
itions that liberals may not recognize,” Social Justice Research, vol. 20, no. 1, pp. 98–116,
2007.
135
[87] A. Halfaker, R. S. Geiger, J. T. Morgan, and J. Riedl, “The rise and decline of an open col-
laboration system: How wikipedia’s reaction to popularity is causing its decline,” American
Behavioral Scientist, vol. 57, no. 5, pp. 664–688, 2013.
[88] A. Halfaker, A. Kittur, and J. Riedl, “Don’t bite the newbies: How reverts affect the quan-
tity and quality of wikipedia work,” in Proceedings of the International Symposium on
Wikis and Open Collaboration, 2011, pp. 163–172.
[89] W. Hamilton, J. Leskovec, and D. Jurafsky, “Diachronic word embeddings reveal statistical
laws of semantic change,” in Proceedings of the Annual Meeting of the Association for
Computational Linguistics, 2016.
[90] ——, Histwords:word embeddings for historical text, 2016.
[91] L. Han, A. Checco, D. Difallah, G. Demartini, and S. Sadiq, “Modelling user behavior dy-
namics with embeddings,” in Proceedings of the ACM SIGKDD International Conference
on Knowledge Discovery and Data Mining, 2020.
[92] Z. S. Harris, “Distributional structure,” WORD, vol. 10, no. 2-3, pp. 146–162, 1954.
[93] T. Havranek, M. Rusnak, and A. Sokolova, “Habit formation in consumption: A meta-
analysis,” European Economic Review, vol. 95, pp. 142–167, 2017.
[94] F. Hill, R. Reichart, and A. Korhonen, “Simlex-999: Evaluating semantic models with
(genuine) similarity estimation,” Computational Linguistics, vol. 41, no. 4, pp. 665–695,
2015.
[95] P. Holme and J. Saram¨ aki, “Temporal networks,” Physics Reports, vol. 519, no. 3, pp. 97–
125, 2012, Temporal Networks.
[96] H. Hosseinmardi, H.-T. Kao, K. Lerman, and E. Ferrara, “Discovering hidden structure
in high dimensional human behavioral data via tensor factorization,” arXiv:1905.08846,
2019.
[97] C.-T. Hsieh, “Do consumers react to anticipated income changes? evidence from the alaska
permanent fund,” American Economic Review, vol. 93, pp. 397–405, 2003.
[98] K. Hu, K. Qi, S. Yang, S. Shen, X. Cheng, H. Wu, J. Zheng, S. McClure, and T. Yu,
“Identifying the “Ghost City” of domain topics in a keyword semantic space combining
citations,” Scientometrics, vol. 114, no. 3, pp. 1141–1157, 2018.
[99] S. Hu, Z. He, L. Wu, L. Yin, Y . Xu, and H. Cui, “A framework for extracting urban func-
tional regions based on multiprototype word embeddings using points-of-interest data,”
Computers, Environment and Urban Systems, vol. 80, p. 101 442, 2020.
[100] E. Huang, R. Socher, C. Manning, and A. Ng, “Improving word representations via global
context and multiple word prototypes,” in Proceedings of Annual Meeting of the Associa-
tion for Computational Linguistics, 2012, pp. 873–882.
[101] M. D. Hurd and S. Rohwedder, “Heterogeneity in spending change at retirement,” The
Journal of the Economics of Ageing, vol. 1, pp. 60–71, 2013.
[102] C. Hutto and E. Gilbert, “Vader: A parsimonious rule-based model for sentiment analysis
of social media text,” in Proceedings of the International AAAI Conference on Web and
Social Media, vol. 8, 2014.
[103] R. Iwamoto, R. Kohita, and A. Wachi, “Polar embedding,” in Proceedings of the 25th
Conference on Computational Natural Language Learning, Online: Association for Com-
putational Linguistics, 2021, pp. 470–480.
[104] K. Jaidka, A. Ceolin, I. Singh, N. Chhaya, and L. Ungar, “WikiTalkEdit: A dataset for
modeling editors’ behaviors on wikipedia,” in Proceedings of the Conference of the North
136
American Chapter of the Association for Computational Linguistics: Human Language
Technologies, 2021, pp. 2191–2200.
[105] S. Javanmardi, C. Lopes, and P. Baldi, “Modeling user reputation in wikis,” Statistical
Analysis and Data Mining, vol. 3, no. 2, pp. 126–139, 2010.
[106] M. Jha, H. Liu, and A. Manela, “Natural disaster effects on popular sentiment toward
finance,” Journal of Financial and Quantitative Analysis, pp. 1–35, 2021.
[107] J. Jiang, D. Maldeniya, K. Lerman, and E. Ferrara, “The wide, the deep, and the maver-
ick: Types of players in team-based online games,” Proceedings of the ACM on Human-
Computer Interaction, vol. 5, no. CSCW1, pp. 1–26, 2021.
[108] D. S. Johnson, J. A. Parker, and N. S. Souleles, “Household expenditure and the income
tax rebates of 2001,” American Economic Review, vol. 96, pp. 1589–1610, 2006.
[109] J. Jones, M. Amin, J. Kim, and S. Skiena, “Stereotypical Gender Associations in Language
Have Decreased Over Time,” Sociological Science, vol. 7, pp. 1–35, 2020.
[110] B. E. Kahn and D. C. Schmittlein, “Shopping trip behavior: An empirical investigation,”
Marketing Letters, vol. 1, pp. 55–69, 1989.
[111] D. Kahneman, Thinking, fast and slow. Macmillan, 2011.
[112] L. Kaufman and P. J. Rousseeuw, Finding groups in data: an introduction to cluster anal-
ysis. John Wiley & Sons, 2009, vol. 344.
[113] K. Kawaguchi, T. Kuroda, and S. Sato, “Merger Analysis in the App Economy: An Empir-
ical Model of Ad-Sponsored Media,” SSRN Electronic Journal, 2020.
[114] M. Kelly, M. Ghafurian, R. L. West, and D. Reitter, “Indirect associations in learning
semantic and syntactic lexical relationships,” Journal of Memory and Language, vol. 115,
p. 104 153, 2020.
[115] I. Khaouja, G. Mezzour, K. M. Carley, and I. Kassou, “Building a soft skill taxonomy from
job openings,” Social Network Analysis and Mining, vol. 9, no. 1, p. 43, 2019.
[116] B. Kim, M. Yoo, K. C. Park, K. R. Lee, and J. H. Kim, “A value of civic voices for
smart city: A big data analysis of civic queries posed by Seoul citizens,” Cities, vol. 108,
p. 102 941, 2021.
[117] J. Kim and H. Park, “Fast nonnegative tensor factorization with an active-set-like method,”
in High-Performance Scientific Computing , Springer, 2012, pp. 311–326.
[118] Y . Kim and H. Shin, “Finding Sentiment Dimension in Vector Space of Movie Reviews:
An Unsupervised Approach,” Journal of Cognitive Science, vol. 18, no. 1, pp. 85–101,
2017.
[119] A. Kittur, E. Chi, B. A. Pendleton, B. Suh, and T. Mytkowicz, “Power of the few vs.
wisdom of the crowd: Wikipedia and the rise of the bourgeoisie,” in Proceedings of the
Conference on Human Factors in Computing Systems, 2007.
[120] T. Kobayashi, A. Sapienza, and E. Ferrara, “Extracting the multi-timescale activity patterns
of online financial markets,” Scientific Reports , vol. 8, p. 11 184, 2018.
[121] T. G. Kolda and B. W. Bader, “Tensor decompositions and applications,” SIAM Review,
vol. 51, pp. 455–500, 2009.
[122] F. Kooti, E. Moro, and K. Lerman, “Twitter session analytics: Profiling users’ short-term
behavioral changes,” in International Conference on Social Informatics, Springer, 2016,
pp. 71–86.
137
[123] F. Kooti, K. Subbian, W. Mason, L. Adamic, and K. Lerman, “Understanding short-term
changes in online activity sessions,” in Proceedings of the International Conference on
World Wide Web Companion, 2017, pp. 555–563.
[124] A. C. Kozlowski, M. Taddy, and J. A. Evans, “The geometry of culture: Analyzing the
meanings of class through word embeddings,” American Sociological Review, vol. 84,
no. 5, pp. 905–949, 2019.
[125] V . Kristof, M. Grossglauser, and P. Thiran, “War of words: The competitive dynamics of
legislative processes,” in Proceedings of the World Wide Web Conference, 2021, pp. 2803–
2809.
[126] A. C. Kroon, D. Trilling, T. G. L. A. v. d. Meer, and J. G. F. Jonkman, “Clouded reality:
News representations of culturally close and distant ethnic outgroups,” Communications,
vol. 45, no. s1, pp. 744–764, 2020.
[127] A. C. Kroon, D. Trilling, and T. Raats, “Guilty by Association: Using Word Embeddings
to Measure Ethnic Stereotypes in News Coverage,” Journalism & Mass Communication
Quarterly, vol. 98, no. 2, pp. 451–477, 2021.
[128] D. Lazer, A. Pentland, L. Adamic, S. Aral, A.-L. Barab´ asi, D. Brewer, N. Christakis,
N. Contractor, J. Fowler, M. Gutmann, et al., “Computational social science,” Science,
vol. 323, no. 5915, pp. 721–723, 2009.
[129] D. M. Lazer, A. Pentland, D. J. Watts, S. Aral, S. Athey, N. Contractor, D. Freelon, S.
Gonzalez-Bailon, G. King, H. Margetts, et al., “Computational social science: Obstacles
and opportunities,” Science, vol. 369, no. 6507, pp. 1060–1062, 2020.
[130] S. Leavy, M. T. Keane, and E. Pine, “Patterns in language: Text analysis of government
reports on the Irish industrial school system with word embedding,” Digital Scholarship in
the Humanities, vol. 34, no. Supplement 1, pp. i110–i122, 2019.
[131] R. Lee and J. Kim, “Developing a Social Index for Measuring the Public Opinion Re-
garding the Attainment of Sustainable Development Goals,” Social Indicators Research,
vol. 156, no. 1, pp. 201–221, 2021.
[132] J. Lerner and A. Lomi, “Team diversity, polarization, and productivity in online peer pro-
duction,” Social Network Analysis and Mining, vol. 9, no. 1, pp. 1–17, 2019.
[133] ——, “The Third Man: Hierarchy formation in wikipedia,” Applied Network Science,
vol. 2, no. 1, pp. 1–30, 2017.
[134] O. Levy and Y . Goldberg, “Dependency-based word embeddings,” in Proceedings of the
59th Annual Meeting of the Association for Computational Linguistics, 2014.
[135] K. Li, F. Mai, R. Shen, and X. Yan, “Measuring corporate culture using machine learning,”
The Review of Financial Studies, vol. 34, no. 7, pp. 3265–3315, 2021.
[136] L.-H. Lim and P. Comon, “Nonnegative approximations of nonnegative tensors,” Journal
of Chemometrics, vol. 23, pp. 432–441, 2009.
[137] C. Lin, Q. Zhu, S. Guo, Z. Jin, Y .-R. Lin, and N. Cao, “Anomaly detection in spatiotem-
poral data via regularized non-negative tensor analysis,” Data Mining and Knowledge Dis-
covery, vol. 32, no. 4, pp. 1056–1073, 2018.
[138] B. Liu, T. Zhang, F. X. Han, D. Niu, K. Lai, and Y . Xu, “Matching natural language sen-
tences with hierarchical sentence factorization,” in Proceedings of the World Wide Web
Conference, 2018, pp. 1237–1246.
138
[139] H. Liu, Y . Zhu, T. Zang, J. Yu, and H. Cai, “Jointly modeling individual student behav-
iors and social influence for prediction tasks,” in Proceedings of the ACM International
Conference on Information and Knowledge Management. 2020.
[140] L. Liu, W. Ouyang, X. Wang, P. Fieguth, J. Chen, X. Liu, and M. Pietik¨ ainen, “Deep
learning for generic object detection: A survey,” International Journal of Computer Vision,
vol. 128, no. 2, pp. 261–318, 2020.
[141] M.-T. Luong, R. Socher, and C. D. Manning, “Better word representations with recursive
neural networks for morphology,” in Proceedings of the seventeenth conference on compu-
tational natural language learning, 2013, pp. 104–113.
[142] L. v. d. Maaten and G. Hinton, “Visualizing data using t-sne,” Journal of Machine Learning
Research, vol. 9, pp. 2579–2605, 2008.
[143] S. Maniu, B. Cautis, and T. Abdessalem, “Building a signed network from interactions in
wikipedia,” in Databases and Social Networks, 2011, pp. 19–24.
[144] N. Mankiw, Macroeconomics. Worth Publishers, 2003.
[145] N. Masuda and R. Lambiotte, A guide to temporal networks. World Scientific, 2016.
[146] A. Matsui, T. Kobayashi, D. Moriwaki, and E. Ferrara, “Detecting multi-timescale con-
sumption patterns from receipt data: A non-negative tensor factorization approach,” Jour-
nal of Computational Social Science, pp. 1–14, 2020.
[147] A. Matsui, X. Ren, and E. Ferrara, “Using word embedding to reveal monetary policy
explanation changes,” in Proceedings of the Third Workshop on Economics and Natural
Language Processing, 2021, pp. 56–61.
[148] A. Matsui, A. Sapienza, and E. Ferrara, “Does streaming esports affect players’ behavior
and performance?” Games and Culture, vol. 15, no. 1, p. 1 555 412 019 838 095, 2020.
[149] T. Mikolov, K. Chen, G. Corrado, and J. Dean, “Efficient estimation of word representa-
tions in vector space,” in International Conference on Learning Representations, 2013.
[150] T. Mikolov, K. Chen, G. S. Corrado, and J. Dean, Computing numeric representations of
words in a high-dimensional space, US Patent 9, 037, 464, 2015.
[151] T. Mikolov, I. Sutskever, K. Chen, G. Corrado, and J. Dean, “Distributed representations
of words and phrases and their compositionality,” in Advances in Neural Information Pro-
cessing Systems, 2013.
[152] T. Mikolov, W. t. Yih, and G. Zweig, “Linguistic regularities in continuous space word
representations,” in Proceedings of the Conference of the North American Chapter of the
Association for Computational Linguistics: Human Language Technologies, Association
for Computational Linguistics, 2013, pp. 746–751.
[153] G. A. Miller and W. G. Charles, “Contextual correlates of semantic similarity,” Language
and cognitive processes, vol. 6, no. 1, pp. 1–28, 1991.
[154] W. W. Moe, “Buying, searching, or browsing: Differentiating between online shoppers
using in-store navigational clickstream,” Journal of Consumer Psychology, vol. 13, pp. 29–
39, 2003.
[155] W. W. Moe and P. S. Fader, “Capturing evolving visit behavior in clickstream data,” Jour-
nal of Interactive Marketing, vol. 18, pp. 5–19, 2004.
[156] C. M¨ uller-Birn, B. Karran, J. Lehmann, and M. Luczak-R¨ osch, “Peer-production system
or collaborative ontology engineering effort: What is wikidata?” In Proceedings of the
International Symposium on Open Collaboration, 2015.
139
[157] G. Muri´ c, A. Abeliuk, K. Lerman, and E. Ferrara, “Collaboration drives individual produc-
tivity,” in Proceedings of the ACM on Human-Computer Interaction, 2019.
[158] D. Murray, J. Yoon, S. Kojaku, R. Costas, W.-S. Jung, S. Milojevi´ c, and Y .-Y . Ahn, “Unsu-
pervised embedding of trajectories captures the latent structure of mobility,” arXiv preprint
arXiv:2012.02785, 2020.
[159] A. Namin and Y . Dehdashti, “A “hidden” side of consumer grocery shopping choice,”
Journal of Retailing and Consumer Services, vol. 48, pp. 16–27, 2019.
[160] A. Nanni and M. Fallin, “Earth, wind, (water), and fire: Measuring epistemic boundaries
in climate change research,” Poetics, vol. 88, p. 101 573, 2021.
[161] L. K. Nelson, “Leveraging the alignment between machine learning and intersectionality:
Using word embeddings to measure intersectional experiences of the nineteenth century
U.S. South,” Poetics, vol. 88, p. 101 539, 2021.
[162] K. Nemoto, P. Gloor, and R. Laubacher, “Social capital increases efficiency of collabora-
tion among wikipedia editors,” in Hypertext and hypermedia, 2011, pp. 231–240.
[163] M. Nickel and D. Kiela, “Poincar´ e embeddings for learning hierarchical representations,”
Advances in Neural Information Processing Systems, vol. 30, 2017.
[164] J. Nyarko and S. Sanga, “A Statistical Test for Legal Interpretation: Theory and Applica-
tions,” SSRN Electronic Journal, 2020.
[165] A. Oeberst, I. von der Beck, C. Matschke, T. A. Ihme, and U. Cress, “Collectively biased
representations of the past: Ingroup bias in wikipedia articles about intergroup conflicts,”
British Journal of Social Psychology, vol. 59, no. 4, pp. 791–818, 2020.
[166] M. Okada, K. Yamanishi, and N. Masuda, “Long-tailed distributions of inter-event times as
mixtures of exponential distributions,” Royal Society Open Science, vol. 7, no. 2, p. 191 643,
2020.
[167] R. Olbrich and C. Holsing, “Modeling consumer purchasing behavior in social shop-
ping communities with clickstream data,” International Journal of Electronic Commerce,
vol. 16, pp. 15–40, 2011.
[168] A. J. Oliner, A. P. Iyer, I. Stoica, E. Lagerspetz, and S. Tarkoma, “Carat: Collaborative
energy diagnosis for mobile devices,” in Proceedings of the 11th ACM Conference on Em-
bedded Networked Sensor Systems, 2013, pp. 1–14.
[169] J. G. Oliveira and A.-L. Barab´ asi, “Darwin and einstein correspondence patterns,” Nature,
vol. 437, no. 7063, pp. 1251–1251, 2005.
[170] A. Panisson, L. Gauvin, M. Quaggiotto, and C. Cattuto, “Mining concurrent topical activity
in microblog streams,” arXiv preprint arXiv:1403.1403, 2014.
[171] E. E. Papalexakis and C. Faloutsos, “Fast efficient and scalable core consistency diagnostic
for the parafac decomposition for big sparse tensors,” in IEEE International Conference on
Acoustics, Speech and Signal Processing, IEEE, 2015, pp. 5441–5445.
[172] N. Pecora and A. Spelta, “A multi-way analysis of international bilateral claims,” Social
Networks, vol. 49, pp. 81–92, 2017.
[173] J. W. Pennebaker, M. E. Francis, and R. J. Booth, “Linguistic inquiry and word count:
Liwc 2001,” Mahway: Lawrence Erlbaum Associates, vol. 71, no. 2001, p. 2001, 2001.
[174] J. Pennington, R. Socher, and C. D. Manning, “Glove: Global vectors for word repre-
sentation,” in Proceedings of the Conference on Empirical Methods in Natural Language
Processing, 2014, pp. 1532–1543.
140
[175] J. C. Peterson, D. Chen, and T. L. Griffiths, “Parallelograms revisited: Exploring the limi-
tations of vector space models for simple analogies,” Cognition, vol. 205, p. 104 440, 2020.
[176] G. Peyr´ e, M. Cuturi, et al., “Computational optimal transport: With applications to data
science,” Foundations and Trends® in Machine Learning, vol. 11, no. 5-6, pp. 355–607,
2019.
[177] A. Piscopo and E. Simperl, “Who models the world? collaborative ontology creation and
user roles in wikidata,” in Proceedings of the ACM on Human-Computer Interaction, 2018.
[178] E. L. Platt and D. M. Romero, “Network structure, efficiency, and performance in wikipro-
jects,” in Proceedings of the International AAAI Conference on Web and Social Media,
2018.
[179] M. Platzer and T. Reutterer, “Ticking away the moments: Timing regularity helps to better
predict customer activity,” Marketing Science, vol. 35, pp. 779–799, 2016.
[180] X. Qin, P. Cunningham, and M. Salter-Townshend, “The influence of network structures
of wikipedia discussion pages on the efficiency of wikiprojects,” Social Networks, vol. 43,
pp. 1–15, 2015.
[181] K. Radinsky, E. Agichtein, E. Gabrilovich, and S. Markovitch, “A word at a time: Comput-
ing word relatedness using temporal semantic analysis,” in Proceedings of the International
Conference on World Wide Web, 2011, pp. 337–346.
[182] E. Rahimikia, S. Zohren, and S.-H. Poon, “Realised V olatility Forecasting: Machine Learn-
ing via Financial Word Embedding,” SSRN Electronic Journal, 2021.
[183] N. Raman, N. Sauerberg, J. Fisher, and S. Narayan, “Classifying wikipedia article quality
with revision history networks,” in Proceedings of the International Symposium on Open
Collaboration, 2020, pp. 1–7.
[184] R. Rehurek and P.Sojka, Gensim,https://github.com/RaRe-Technologies/gensim,
2010.
[185] R. Rehurek and P. Sojka, “Software framework for topic modelling with large corpora,”
in In Proceedings of the LREC 2010 workshop on new challenges for NLP frameworks,
Citeseer, 2010.
[186] K. Ren, Y . Fang, W. Zhang, S. Liu, J. Li, Y . Zhang, Y . Yu, and J. Wang, “Learning multi-
touch conversion attribution with dual-attention mechanisms for online advertising,” in
Proceedings of the ACM International Conference on Information and Knowledge Man-
agement, 2018.
[187] L. Rheault and C. Cochrane, “Word Embeddings for the Analysis of Ideological Placement
in Parliamentary Corpora,” Political Analysis, vol. 28, no. 1, pp. 112–133, 2020.
[188] R. Richie, W. Zou, S. Bhatia, and S. Vazire, “Predicting high-level human judgment across
diverse behavioral domains,” Collabra: Psychology, vol. 5, no. 1, 2019.
[189] E. Rodman, “A Timely Intervention: Tracking the Changing Meanings of Political Con-
cepts with Word Vectors,” Political Analysis, vol. 28, no. 1, pp. 87–111, 2020.
[190] J. N. Rosenquist, J. Murabito, J. H. Fowler, and N. A. Christakis, “The spread of alcohol
consumption behavior in a large social network,” Annals of Internal Medicine, vol. 152,
pp. 426–433, 2010.
[191] D. Rozado and M. al-Gharbi, “Using word embeddings to probe sentiment associations of
politically loaded terms in news and opinion articles from news media outlets,” Journal of
Computational Social Science, 2021.
141
[192] H. Rubenstein and J. B. Goodenough, “Contextual correlates of synonymy,” Communica-
tions of the ACM, vol. 8, no. 10, pp. 627–633, 1965.
[193] A. Sapienza, A. Barrat, C. Cattuto, and L. Gauvin, “Estimating the outcome of spread-
ing processes on networks with incomplete information: A dimensionality reduction ap-
proach,” Physical Review E, vol. 98, p. 012 317, 2018.
[194] A. Sapienza, A. Bessi, and E. Ferrara, “Non-negative tensor factorization for human be-
havioral pattern mining in online games,” Information, vol. 9, p. 66, 2018.
[195] A. Sapienza, A. Panisson, J. Wu, L. Gauvin, and C. Cattuto, “Detecting anomalies in time-
varying networks using tensor decomposition,” in IEEE International Conference on Data
Mining Workshop, 2015, pp. 516–523.
[196] A. Sapienza, Y . Zeng, A. Bessi, K. Lerman, and E. Ferrara, “Individual performance in
team-based online games,” Royal Society Open Science, vol. 5, no. 6, p. 180 329, 2018.
[197] S. Senecal, P. J. Kalczynski, and J. Nantel, “Consumers’ decision-making process and their
online shopping behavior: A clickstream analysis,” Journal of Business Research, vol. 58,
pp. 1599–1608, 2005.
[198] C. Shalizi, Advanced data analysis from an elementary point of view,http://www.stat.
cmu.edu/
˜
cshalizi/ADAfaEPoV/, last accessed March 1, 2022, 2013.
[199] K. Sharma, Y . Zhang, E. Ferrara, and Y . Liu, “Identifying coordinated accounts on so-
cial media through hidden influence and group behaviours,” in Proceedings of the ACM
SIGKDD International Conference on Knowledge Discovery and Data Mining, 2021.
[200] K. Shenoy, F. Ilievski, D. Garijo, D. Schwabe, and P. Szekely, “A study of the quality of
wikidata,” Journal of Web Semantics, p. 100 679, 2021.
[201] F. Shi, M. Teplitskiy, E. Duede, and J. A. Evans, “The wisdom of polarized crowds,” Nature
Human Behaviour, vol. 3, no. 4, pp. 329–336, 2019.
[202] T. H. Silva, P. O. V . de Melo, J. M. Almeida, M. Musolesi, and A. A. Loureiro, “A large-
scale study of cultural differences using urban data about eating and drinking preferences,”
Information Systems, vol. 72, pp. 95–116, 2017.
[203] P. Singer, E. Ferrara, F. Kooti, M. Strohmaier, and K. Lerman, “Evidence of online per-
formance deterioration in user sessions on Reddit,” PloS one, vol. 11, no. 8, e0161636,
2016.
[204] S. Soni, K. Lerman, and J. Eisenstein, “Follow the leader: Documents on the leading edge
of semantic change get more citations,” Journal of the Association for Information Science
and Technology, vol. 72, no. 4, pp. 478–492, 2021.
[205] M. Spiliopoulou, B. Mobasher, B. Berendt, and M. Nakagawa, “A framework for the eval-
uation of session reconstruction heuristics in web-usage analysis,” INFORMS Journal on
Computing, vol. 15, no. 2, pp. 171–190, 2003.
[206] K. E. Stanovich and R. F. West, “Individual differences in reasoning: Implications for the
rationality debate?” Behavioral and Brain Sciences, vol. 23, no. 5, pp. 645–665, 2000.
[207] B. Su and G. Hua, “Order-preserving optimal transport for distances between sequences,”
IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 41, no. 12, pp. 2961–
2974, 2018.
[208] ——, “Order-preserving wasserstein distance for sequence matching,” in Proceedings of
the IEEE Conference on Computer Vision and Pattern Recognition, 2017.
142
[209] K. Tasnim Huq and G. L. Ciampaglia, “Characterizing opinion dynamics and group de-
cision making in wikipedia content discussions,” in Proceedings of the World Wide Web
Conference, ACM, 2021, pp. 632–639.
[210] S. D. Taylor, “Concepts as a working hypothesis,” Philosophical Psychology, pp. 1–26,
2021.
[211] A. Tifrea, G. B´ ecigneul, and O.-E. Ganea, “Poincar\’e glove: Hyperbolic word embed-
dings,” Proceeding of International Conference on Learning Representations, 2018.
[212] L. C. Torres, L. M. Pereira, and M. H. Amini, “A survey on optimal transport for machine
learning: Theory and applications,” arXiv preprint arXiv:2106.01963, 2021.
[213] O. Toubia, J. Berger, and J. Eliashberg, “How quantifying the shape of stories predicts their
success,” Proceedings of the National Academy of Sciences, vol. 118, no. 26, e2011695118,
2021.
[214] V . Tshitoyan, J. Dagdelen, L. Weston, A. Dunn, Z. Rong, O. Kononova, K. A. Persson,
G. Ceder, and A. Jain, “Unsupervised word embeddings capture latent knowledge from
materials science literature,” Nature, vol. 571, no. 7763, pp. 95–98, 2019.
[215] A. V´ azquez, J. G. Oliveira, Z. Dezs¨ o, K.-I. Goh, I. Kondor, and A.-L. Barab´ asi, “Modeling
bursts and heavy tails in human dynamics,” Physical Review E, vol. 73, no. 3, p. 036 127,
2006.
[216] I. Vendrov, R. Kiros, S. Fidler, and R. Urtasun, “Order-embeddings of images and lan-
guage,” Proceeding of International Conference on Learning Representations, 2016.
[217] F. B. Viegas, M. Wattenberg, J. Kriss, and F. van Ham, “Talk before you type: Coordination
in wikipedia,” in Proceedings of the Annual Hawaii International Conference on System
Sciences, 2007, pp. 78–78.
[218] L. Vilnis and A. McCallum, “Word representations via gaussian embedding,” Proceeding
of International Conference on Learning Representations, 2015.
[219] C. Wagner, P. Singer, and M. Strohmaier, “Spatial and temporal patterns of online food
preferences,” in Proceedings of the International Conference on World Wide Web, 2014,
pp. 553–554.
[220] I. Waller and A. Anderson, “Quantifying social organization and political polarization in
online platforms,” Nature, vol. 600, no. 7888, pp. 264–268, 2021.
[221] C. E. Walsh, Monetary Theory and Policy. MIT press, 2017.
[222] B. Wang, A. Wang, F. Chen, Y . Wang, and C.-C. J. Kuo, “Evaluating word embedding mod-
els: Methods and experimental results,” APSIPA Transactions on Signal and Information
Processing, vol. 8, 2019.
[223] D. Wang, H. Wang, and X. Zou, “Identifying key nodes in multilayer networks based on
tensor decomposition,” Chaos: An Interdisciplinary Journal of Nonlinear Science, vol. 27,
no. 6, p. 063 108, 2017.
[224] G. Wang, X. Zhang, S. Tang, H. Zheng, and B. Y . Zhao, “Unsupervised clickstream clus-
tering for user behavior analysis,” in Proceedings of the Conference on Human Factors in
Computing Systems, 2016, pp. 225–236.
[225] R. Wang, F. Chen, Z. Chen, T. Li, G. Harari, S. Tignor, X. Zhou, D. Ben-Zeev, and A. T.
Campbell, “Studentlife: Assessing mental health, academic performance and behavioral
trends of college students using smartphones,” in Proceedings of the ACM International
Joint Conference on Pervasive and Ubiquitous Computing, 2014.
143
[226] W. Wang, “Two-Stage Learning of a Systemic Risk Factor,” SSRN Electronic Journal,
2019.
[227] M. Wevers and M. Koolen, “Digital begriffsgeschichte: Tracing semantic change using
word embeddings,” Historical Methods: A Journal of Quantitative and Interdisciplinary
History, vol. 53, no. 4, pp. 226–243, 2020.
[228] C. D. Wickens and C. M. Carswell, “Information processing,” Handbook of Human Factors
and Ergonomics, pp. 114–158, 2021.
[229] M. Wiper, D. R. Insua, and F. Ruggeri, “Mixtures of gamma distributions with applica-
tions,” Journal of computational and graphical statistics, vol. 10, no. 3, pp. 440–454, 2001.
[230] M. Woodford, Interest and Prices: Foundations of a Theory of Monetary Policy. Princeton
University Press, 2011.
[231] Z. Yao, Y . Sun, W. Ding, N. Rao, and H. Xiong, “Dynamic word embeddings for evolving
semantic discovery,” in Proceedings of the Proceedings of the International Conference on
Web Search and Data Mining, 2018, pp. 673–681.
[232] T. Yasseri, R. Sumi, and J. Kert´ esz, “Circadian patterns of wikipedia editorial activity: A
demographic analysis,” PloS one, vol. 7, no. 1, e30091, 2012.
[233] T. Yasseri, R. Sumi, A. Rung, A. Kornai, and J. Kert´ esz, “Dynamics of conflicts in wikipedia,”
PloS one, vol. 7, no. 6, e38869, 2012.
[234] D. Yu, Y . Li, F. Xu, P. Zhang, and V . Kostakos, “Smartphone app usage prediction using
points of interest,” Proceedings of the ACM on Interactive, Mobile, Wearable and Ubiqui-
tous Technologies, 2018.
[235] W. Zhang, L. Chen, and J. Wang, “Implicit look-alike modelling in display ads,” in Euro-
pean Conference on Information Retrieval, 2016.
[236] J. Zhao, Y . Zhou, Z. Li, W. Wang, and K.-W. Chang, “Learning gender-neutral word em-
beddings,” in Proceedings of the 2018 Conference on Empirical Methods in Natural Lan-
guage Processing, Association for Computational Linguistics, 2018, pp. 4847–4853.
[237] Z. Zhao, T. Liu, S. Li, B. Li, and X. Du, “Ngram2vec: Learning improved word representa-
tions from ngram co-occurrence statistics,” in Proceedings of the Conference on Empirical
Methods in Natural Language Processing, 2017.
[238] C. Zheng, “Comparisons of the City Brand Influence of Global Cities: Word-Embedding
Based Semantic Mining and Clustering Analysis on the Big Data of GDELT Global News
Knowledge Graph,” Sustainability, vol. 12, no. 16, p. 6294, 2020.
[239] J. Zhou, L. Liu, W. Wei, and J. Fan, “Network representation learning: From preprocessing,
feature extraction to node embedding,” ACM Computing Surveys, vol. 55, no. 2, 2022.
[240] X. Zhou, S. Huang, and Z. Zheng, “RPD: A distance function between word embeddings,”
in Proceedings of the 58th Annual Meeting of the Association for Computational Lin-
guistics: Student Research Workshop, Online: Association for Computational Linguistics,
2020, pp. 42–50.
[241] G. K. Zipf, “The P1 P2D hypothesis: On the intercity movement of persons,” American
Sociological Review, vol. 11, no. 6, pp. 677–686, 1946.
144
Abstract (if available)
Abstract
Since the advent of the Internet, large-scale data from digital platforms have gained attention because of their excellent capability for tracking various aspects of human behavior, such as economic activity, human mobility, and student learning style. Analyzing these substantial data requires methods to extract the essence of human behavior and expand our understanding of how we live in this age of digital platforms. This dissertation proposes a comprehensive set of methods and applications to uncover the underlying mechanisms of the dynamic human behavior observed in digital platforms. First, I propose a non-negative tensor factorization model for detecting multi-timescale consumption patterns in high-dimensional data. We demonstrate that the multi-timescale temporal structure extracted by the proposed methods reflects the demographic information of individuals. Second, I discuss recent trends in applying word embedding techniques to human behavior modeling. Along with these recent trends, I propose a word embedding model for studying human behavior in knowledge production. By mining the high-dimensional human behavior data of knowledge production, I demonstrate that the two different types of labor supply compose produced knowledge. Third, I highlight the importance of understanding time intervals for human behavior analysis and propose a unified framework for understanding temporal human behavior with inter-time information and cluster users based on the dynamics of human behavior. The comprehensive set of methods and application proposed in this dissertation allows us to capture the temporal context of human behavior from the dual-process theory and detect individual differences in high-dimensional human behavior data.
Linked assets
University of Southern California Dissertations and Theses
Conceptually similar
PDF
Socially-informed content analysis of online human behavior
PDF
Fair Machine Learning for Human Behavior Understanding
PDF
Towards generalized event understanding in text via generative models
PDF
Computational modeling of human behavior in negotiation and persuasion: the challenges of micro-level behavior annotations and multimodal modeling
PDF
Improving modeling of human experience and behavior: methodologies for enhancing the quality of human-produced data and annotations of subjective constructs
PDF
Modeling dynamic behaviors in the wild
PDF
Improving language understanding and summarization by leveraging auxiliary information through self-supervised or unsupervised learning
PDF
Graph embedding algorithms for attributed and temporal graphs
PDF
Diffusion network inference and analysis for disinformation mitigation
PDF
Towards social virtual listeners: computational models of human nonverbal behaviors
PDF
Advances in understanding and leveraging structured data for knowledge-intensive tasks
PDF
Balancing prediction and explanation in the study of language usage and speaker attributes
PDF
Heterogeneous graphs versus multimodal content: modeling, mining, and analysis of social network data
PDF
Probabilistic framework for mining knowledge from georeferenced social annotation
PDF
Emphasizing the importance of data and evaluation in the era of large language models
PDF
Decoding situational perspective: incorporating contextual influences into facial expression perception modeling
PDF
Event-centric reasoning with neuro-symbolic networks and knowledge incorporation
PDF
Generative foundation model assisted privacy-enhancing computing in human-centered machine intelligence
PDF
Scalable exact inference in probabilistic graphical models on multi-core platforms
PDF
Deep learning models for temporal data in health care
Asset Metadata
Creator
Matsui, Akira
(author)
Core Title
Mining and modeling temporal structures of human behavior in digital platforms
School
Viterbi School of Engineering
Degree
Doctor of Philosophy
Degree Program
Computer Science
Degree Conferral Date
2022-05
Publication Date
05/03/2022
Defense Date
03/11/2022
Publisher
University of Southern California
(original),
University of Southern California. Libraries
(digital)
Tag
digital platforms,Human behavior,OAI-PMH Harvest,temporal structures
Format
application/pdf
(imt)
Language
English
Contributor
Electronically uploaded by the author
(provenance)
Advisor
Ferrara, Emilio (
committee chair
), Nakano, Aiichiro (
committee member
), Twyman, Marlon II (
committee member
)
Creator Email
amatsui@usc.edu,matsui-akira-zr@ynu.ac.jp
Permanent Link (DOI)
https://doi.org/10.25549/usctheses-oUC111208090
Unique identifier
UC111208090
Document Type
Dissertation
Format
application/pdf (imt)
Rights
Matsui, Akira
Type
texts
Source
20220503-usctheses-batch-937
(batch),
University of Southern California
(contributing entity),
University of Southern California Dissertations and Theses
(collection)
Access Conditions
The author retains rights to his/her dissertation, thesis or other graduate work according to U.S. copyright law. Electronic access is being provided by the USC Libraries in agreement with the author, as the original true and official version of the work, but does not grant the reader permission to use the work if the desired use is covered by copyright. It is the author, as rights holder, who must provide use permission if such use is covered by copyright. The original signature page accompanying the original submission of the work to the USC Libraries is retained by the USC Libraries and a copy of it may be obtained by authorized requesters contacting the repository e-mail address given.
Repository Name
University of Southern California Digital Library
Repository Location
USC Digital Library, University of Southern California, University Park Campus MC 2810, 3434 South Grand Avenue, 2nd Floor, Los Angeles, California 90089-2810, USA
Repository Email
cisadmin@lib.usc.edu
Tags
digital platforms
temporal structures