Close
About
FAQ
Home
Collections
Login
USC Login
Register
0
Selected
Invert selection
Deselect all
Deselect all
Click here to refresh results
Click here to refresh results
USC
/
Digital Library
/
University of Southern California Dissertations and Theses
/
Deriving real-world social strength and spatial influence from spatiotemporal data
(USC Thesis Other)
Deriving real-world social strength and spatial influence from spatiotemporal data
PDF
Download
Share
Open document
Flip pages
Contact Us
Contact Us
Copy asset link
Request this asset
Transcript (if available)
Content
Deriving Real-World Social Strength and Spatial Influence from Spatiotemporal Data
by
Huy Pham
A Dissertation Presented to the
FACULTY OF THE USC GRADUATE SCHOOL
UNIVERSITY OF SOUTHERN CALIFORNIA
In Partial Fulfillment of the
Requirements for the Degree
DOCTOR OF PHILOSOPHY
(COMPUTER SCIENCE)
December 2015
Copyright 2015 Huy Pham
Acknowledgments
I dedicate my dissertation work to my loving mother, who probably does not really understand
what I wrote in here due to her scope of knowledge (which has mostly to do with middle school
math), language barrier and her generation of being used to numeric calculations with a black
board and chalks. However, she has been a great motivation and support for me to start from such
basic knowledge in my childhood, and to continue to broaden my knowledge to the level that gains
me a degree of Doctor of Philosophy.
I would like to say many thanks to my advisor, Professor Cyrus Shahabi. It took me more than
two years after I entered my PhD program to find him, running from the East coast (Columbia
university) to the West coast (University of Southern California - the infolab). Our relationship
has been as comfortable as a friendship, except for the responsibility on my side to conduct my
research with his guidance. He has been a true mentor during my PhD; from whom, I have learned
uncountable things, as a professional and as a person. I forever wish to have such a supervisor and
a friend like him, wherever I land on this planet, with whatever I will work on.
In addition, I would also like to thank Professor Yan Liu, who gave me tremendously helpful
advice and information for my research. She is a sweet, considerate and very knowledgeable
young professor. She is part of what has made my PhD an adorable and memorable experience.
ii
Finally, many thanks to Pandush Simo, Grigory Sokolov and Ulas Ciftcioglu, who have always
been by my side to cheer me up, help and/or provide support whenever I needed.
iii
Table of Contents
Acknowledgments ii
List of Figures vii
List of Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ix
List of Tables ix
Abstract x
Chapter 1: Introduction 1
1.1 Social Network Research . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.3 Research Significance and Applications . . . . . . . . . . . . . . . . . . . . . . 5
Chapter 2: Background 8
2.1 Definitions and Challenges . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
2.1.1 Inferring Social Strength from Spatiotemporal Data . . . . . . . . . . . . 9
2.1.2 Inferring Spatial Influence from Spatiotemporal Data . . . . . . . . . . . 11
2.2 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
2.3 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
2.3.1 Social Strength . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
2.3.1.1 Possible Candidate Measures for Social Strength . . . . . . . . 17
2.3.1.2 Related Studies in Inferring Social Connections from Spatiotem-
poral Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
2.3.2 Social and Spatial Influence . . . . . . . . . . . . . . . . . . . . . . . . 21
Chapter 3: Social Strength with GEOSO Model 26
3.1 Data Representation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
3.1.1 Visit Vector . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
3.1.2 Co-occurrence Vector . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
3.1.3 Master Vector . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
3.2 GEOSO Distance Measure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
3.3 Properties of the GEOSO Measure . . . . . . . . . . . . . . . . . . . . . . . . . 31
3.3.1 Compatibility . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
3.3.2 Commitment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
iv
3.3.3 Compatibility vs. Commitment . . . . . . . . . . . . . . . . . . . . . . 35
3.4 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
3.4.1 Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
3.4.2 Distance Measure and Result Verification . . . . . . . . . . . . . . . . . 39
3.5 Summary of Chapter . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
Chapter 4: Social Strength with EBM Model 44
4.1 Preliminaries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
4.1.0.1 Representation of Location . . . . . . . . . . . . . . . . . . . 45
4.1.0.2 Visit Vector . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
4.1.0.3 Co-occurrence Vector . . . . . . . . . . . . . . . . . . . . . . 47
4.2 Entropy-Based Model - EBM . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
4.2.1 Diversity in Co-occurrences . . . . . . . . . . . . . . . . . . . . . . . . 49
4.2.2 Formulation of Diversity through
Shannon Entropy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
4.2.3 Renyi Entropy-based Diversity . . . . . . . . . . . . . . . . . . . . . . . 54
4.2.4 Coincidences . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
4.2.5 Location Entropy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58
4.2.6 Weighted Frequency . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60
4.2.7 Social Strength . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62
4.2.8 EBM with Location Semantics . . . . . . . . . . . . . . . . . . . . . . . 64
4.2.8.1 Location Semantics . . . . . . . . . . . . . . . . . . . . . . . 64
4.2.8.2 Applying Location Semantics in EBM . . . . . . . . . . . . . 65
4.3 Optimization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72
4.4 Performance Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73
4.4.1 Dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73
4.4.2 Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74
4.4.3 Order of Diversity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74
4.4.4 Social Strength . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77
4.4.5 Goodness of fit . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80
4.4.6 Precision and Recall . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81
4.4.7 Comparison of EBM with other models . . . . . . . . . . . . . . . . . . 84
4.4.7.1 Improving EBM with Location Semantics . . . . . . . . . . . 86
4.5 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90
4.6 Summary of Chapter . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91
Chapter 5: Spatial Influence - Measuring Followship in the Real World 93
5.1 The TLFM Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94
5.1.1 Data Representation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94
5.1.2 Temporal Dependency of Followship . . . . . . . . . . . . . . . . . . . 96
5.1.3 Locational Dependency of Followship . . . . . . . . . . . . . . . . . . . 103
5.1.3.1 Credit distribution via locational dependency . . . . . . . . . . 107
5.1.4 Influence Causality and Coincidences . . . . . . . . . . . . . . . . . . . 108
5.1.4.1 Causality Filter via Shannon Entropy . . . . . . . . . . . . . . 109
5.2 Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112
v
5.3 Performance Evaluations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 114
5.3.1 Data and Experiment Set-up . . . . . . . . . . . . . . . . . . . . . . . . 114
5.3.2 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115
5.3.2.1 Temporal Dependency . . . . . . . . . . . . . . . . . . . . . . 115
5.3.2.2 Comparison of TLFM with Related Work . . . . . . . . . . . 117
5.3.2.3 The Eect of Coincidences . . . . . . . . . . . . . . . . . . . 120
5.3.2.4 Eciency . . . . . . . . . . . . . . . . . . . . . . . . . . . . 124
5.4 Summary of Chapter . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 124
Chapter 6: Conclusions 126
6.1 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 126
6.2 Future Plans . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 127
Bibliography 129
vi
List of Figures
1.1 Three periods in social network studies. . . . . . . . . . . . . . . . . . . . . . . 1
2.1 Example of the the visits by u on v. . . . . . . . . . . . . . . . . . . . . . . . . . 12
2.2 The visits of u and v are unfolded along the time axis. . . . . . . . . . . . . . . . 12
2.3 Cross-correlation integral of two user visit patterns . . . . . . . . . . . . . . . . 18
3.1 Visit history of user a, b and c . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
3.2 Vector view of GEOSO distance measurements . . . . . . . . . . . . . . . . . . 30
3.3 Commitment vs. compatibility . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
3.4 Experiments on the set of 2k users . . . . . . . . . . . . . . . . . . . . . . . . . 39
3.5 Experiments on the set of 4k users . . . . . . . . . . . . . . . . . . . . . . . . . 41
3.6 Experiments on the set of 10k users . . . . . . . . . . . . . . . . . . . . . . . . 41
3.7 Precision/Recall vs. Social Distance . . . . . . . . . . . . . . . . . . . . . . . . 42
4.1 A quadtree storing areas of dierent levels of popularity, and the visits of Users
1; 2 and 3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
4.2 Diagram of EBM - Social strength is formulated via Renyi Entropy and Weighted
Frequency . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
4.3 Location Entropy (LE) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
4.4 Durations of co-occurrences for user pairs (p; q), (i; j) and (u; v) are t
1
, t
2
and
t
3
, respectively. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69
4.5 Data Structure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72
vii
4.6 The impact of the order of diversity on precision. . . . . . . . . . . . . . . . . . 75
4.7 Percentage of real friendships vs. the social strength of buckets. . . . . . . . . . 79
4.8 Precision vs. Recall . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83
4.9 The precision/recall after utilizing location semantics. (a) Weighted frequency
with Location Entropy WF LE, with average Location Entropy WF ALE, with
average number of unique users/visitors WF NU, and with average number of
visits WF NV . (b) Weighted frequency WF, weighted frequency with average
duration of stay WF ADS, EBM, EBM with average duration of stay EBM ADS 87
5.1 Influence that u exerts on v at a location over time. . . . . . . . . . . . . . . . . . 97
5.2 Exponential Decay - Foursquare Data . . . . . . . . . . . . . . . . . . . . . . . 101
5.3 Temporal dependency is shown as how the number of doublets depends on their
time span in hours. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 116
5.4 Comparison of TLFM with Related Work. . . . . . . . . . . . . . . . . . . . . . 118
5.5 The eect of the causality filter for friendships only (upper graphs), and for all
relationships (lower graphs). The white bars indicate the percentage of pairs that
pass the filter; black bars - fail the filter. . . . . . . . . . . . . . . . . . . . . . . 120
5.6 The impact of filter on spatial influence. . . . . . . . . . . . . . . . . . . . . . . 123
viii
List of Tables
2.1 Questions addressed by GEOSO and EBM and other model . . . . . . . . . . . . 19
3.1 Dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
4.1 Example of Diversities . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
4.2 Coecient of Determination . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81
4.3 Eciency of EBM and other models . . . . . . . . . . . . . . . . . . . . . . . . 85
5.1 Example of Location Entropy . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104
5.2 Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 114
5.3 The running time of searching for doublets. . . . . . . . . . . . . . . . . . . . . 124
ix
Abstract
The ubiquity of mobile devices and the popularity of location-based services have generated rich
datasets of people’s location information at a very high fidelity. These location datasets can be
used for studying various social behaviors, including social connections and social influence. For
example, social studies have shown that people, who are seen together frequently at the same
places and the same time, are most probably socially related. Similarly, the fact that a person
visits a location by following the recommendation of another person who has visited that same
location in the past indicates influence that a person exerts on another. Correspondingly, this
thesis focuses on inferring the real-world social connections and social influence by analyzing
people’s location information, which are useful in a variety of application domains from sales
and marketing to social/cultural studies and intelligence analysis. In particular, in the first two
studies of this thesis we propose models (GEOSO and EBM) to not only infer social connections,
but also to estimate their strengths quantitatively (aka social strength) by analyzing people’s
co-occurrences in space and time. In the third study, we first define followship to capture the
phenomenon of an individual visiting a real-world location (e.g., restaurant) due the influence
of another individual who has visited that same location in the past. Subsequently, we coin the
term spatial influence as the concept of inferring pair-wise influence from spatiotemporal data
by quantifying the amount of followship influence that an individual has on others, and devise
x
the TLFM model for quantifying followship. In all these studies, we examine the impacts of
dierent factors in the location behaviors of people on social strength and influence, including
time, locations and coincidences. We conducted extensive experiments with real-world datasets,
which demonstrate the eectiveness of the proposed models in quantifying social strength and
influence, and their eciency in working with large data.
xi
Chapter 1: Introduction
1.1 Social Network Research
Social networks have been studied by social scientists since the pre-Internet era, and their relevance
particularly increased in the last decade. We identify three periods in the study of social networks
corresponding to the growth in the availability of data over time.
The very first period in social networks started back in 1970s [Rogers and Shoemaker, 1971]
when social scientists realized that it was critical to understand the underlying network that
portrays people’s social connections and influence relationships. Such information is significant
in the analysis of the propagation of information, innovations, practice, rumors and contagious
infections, and also in commerce including target advertising and recommendations. However,
Figure 1.1: Three periods in social network studies.
1
in the pre-Internet era, the problem of identifying “who is friend of whom” was challenging, and
studies on social networks in this earlier stage had to confine themselves to extremely small
datasets [Chen et al., 2013], which mostly came from some social surveys at very limited scales.
The second period started along with the Internet revolution in the ’90s through the devel-
opment of web, when our lives have continually expanded to occupy virtual worlds [Mashable,
2012]. Towards the end of the last decade, the research on social networks witnessed an explosion.
To a large extent, this has been fueled by the spectacular growth of social media and online
social networks, such as LinkedIn, Facebook and Twitter, which started in 2003, 2004 and 2006,
respectively [Chen et al., 2013]. These giant networks have produced and continue to produce
enormous datasets about hundreds of millions of online connected users in the form of social
graphs. Therefore, the “who is friend of whom” question, which was a big challenge during the first
period, suddenly became a cakewalk. The readily available social graphs collected from online
social networks motivated social scientists to move far beyond the basic question of “who is friend
of whom” to much more interesting and sophisticated topics. As a result, a large number of studies
has been devoted to new questions/solutions related to social networks, including measuring friend-
ships quantitatively [Liben-Nowell and Kleinberg, 2007b], identifying most influential people in a
network [Kempe et al., 2003], maximizing and speeding up the propagation of information and
innovations in a social graph [Leskovec et al., 2007], and analyzing the structures and properties
of a social network (e.g., density, clusters, stability, etc.)[Ugander et al., 2011][Wu and Tretter,
2009]. However, all these achievements may still be considered inadequate in the eyes of the social
scientists due to the gap that exists between online social networks (aka the virtual world) and
the real lives (aka the real world). The large volume of studies during this period focused on the
2
virtual world and utilized data collected from online networks. However, the people’s relationships
in the virtual world may not necessarily correspond to those in the real world.
Subsequently, we are now witnessing the third period as the phase of bridging the gap between
the virtual world and the real world. Indeed, the pervasiveness of GPS-enabled mobile devices,
and the fact that all the giant social networks have also gone mobile, has introduced massive
data that represents the movements of people in the real world at high resolution, specifically by
indicating who has been where and when (aka spatiotemporal data). Spatiotemporal data can
be collected eortlessly from online services, such as geo-tagged contents (tweets from Twitter,
pictures from Instagram, Facebook and Flickr, check-ins from Foursquare), or from mobile apps’
data (WhatsApp, Glancee), etc. Such collections of spatiotemporal data constitute a rich source of
information for studying and inferring various social behaviors, including social connections and
influence. For example, for social connections, the intuition is that if two people have been to the
same places at the same time (aka co-occurrences), there is a good chance that they are socially
related. Another example is in inferring spatial influence, whose intuition is based on the concept
of followship, which captures the phenomenon of an individual visiting a real-world location (e.g.,
restaurant) due the influence of another individual who has visited that same location in the past.
The focus of this thesis is to infer the real-world social connections and influence from the
newly available spatiotemporal data. Since these social connections and influence are inferred
from people’s locations, they constitute social behaviors that occur in the real world, as opposed to
those that may take place only in the virtual world.
3
1.2 Motivation
As mentioned earlier, the current trends in technology have brought billions of devices to users
and organizations, including smart phones, tablets, stationary sensors and satellites. These devices,
together with a new user mentality of utilizing technology to voluntarily share information, produce
a huge flood of geo-spatial data associated with geographic locations. Twitter has reported 14M
geo-tagged tweets per day [Weidemann], Foursquare has 5M check-ins per day [Mangalindan,
2013] and on Facebook, 1.8M locations are checked-in by users, daily [socialbakers, 2011].
Facebook acquired WhatsApp to add mobile users (with location). Google+ is now integrated with
Android-based mobile technology where the pictures and videos captured are automatically geo-
tagged and “backed-up” into Google+ accounts. In addition, a number of widely used applications
enable the capture of social networks. This type of location-embedded and location-driven social
structures is known as location-based social networks (LBSN) or geo-social networks (GSNs).
The dimension of location brings social networks back to reality, bridging the gap between the
physical world and online social networking services.
The need for our research arises from the fact that such geo-social data is collected and
published ubiquitously all over the world, thus creating extremely large sets of data. However, we
still lack the proper methods to integrate and analyze this wealth of information in order to utilize
the immense amount of hidden knowledge. In the first two studies, we observed that location
data (a.k.a. spatiotemporal data), if collected over time, can aid in inferring social strengths. For
example, two individuals seen together on several occasions may suggest some level of social
connection between them. Therefore, we are interested in whether social relationships among
people can be inferred from such a data collection. The intuition is that if two people have been
4
to the same places at the same time (aka co-occurrences), there is a good chance that they are
socially related. The ultimate goal is to derive the social-network of people and the social strength
from their real-world location data as opposed to (or in addition to) their online activities. In the
third study, we observed that spatiotemporal data can also aid in inferring another similar social
behavior - influence, which indicates the change in attitude, opinion or behavior that one person
causes in another as a result of dierent forms of actions, such as interactions, recommendations or
observations [Chen et al., 2013][Kelman, 1958][Rogers, 2010]. Specifically, we pay attention to a
phenomenon when an individual visits a location (e.g., a restaurant) due to the influence of another
individual who visited that same location in the past. We define this behavior as followship. Hence,
followship is an indication of pair-wise influence between people in the real world. Subsequently,
for the first time, we introduce spatial influence - a concept of inferring pair-wise influence from
spatiotemporal data by quantifying the followship influence that an individual exerts on another in
the real world.
1.3 Research Significance and Applications
The major research significance of this thesis lies in the new and eective technical solutions
for inferring social strength and influence from spatiotemporal data, for which we consider the
impact of time, locations and the inherent coincidences in individuals’ visiting behaviors. The
solutions are not only eective in inferring the social strength and influence, but also prove to be
ecient in working with large data. Particularly, social strength and influence can be derived from
spatiotemporal data both eectively and eciently.
5
The inferred social strength and influence sets stages and background for numerous social
and cultural studies studies. For example, the social connections are used for investigating the
dissemination of information (new ideas, practices and rumors, etc.) among people, for identifying
influential people in societies. It also has its own unique utilities due to the geospatial property.
For example, the inferred social connections can be used in criminology to identify the new (or
unknown) members of a criminal gang or a terrorist cell, or in epidemiology to study the spread of
diseases through human contacts. Spatial influence also plays a key role in other social studies. For
example, in the influence maximization problem [Domingos and Richardson, 2001][Chen et al.,
2010][Kempe et al., 2003], the focus is to find a small set of influential individuals as seeds, who
have the most combined influential impact on an entire social network, and thus can be the targets
of advertisements and/or the seeds of political campaigns. The most critical input to this problem
is pair-wise influence, for which one can use the spatial influence inferred from spatiotemporal
data.
Finally, social strength and influence also oers a variety of applications. The applications for
a physically inferred social network include all the applications of online social networks such
as marketing applications (e.g., target advertising, recommendation engines such as friendship
suggestions). The most notable application of spatial influence is for identifying highly influential
people, who can potentially start a large cascade of information propagation, which in turn serves
the purposes of target advertising (by giving influential people coupons/promotions so they can
further spread the information to many other people under their influence), political campaigns
and sharing ideas/practice (by making influential people the seeds/initiators of the campaigns).
Both inferred social strength and spatial influence from users’ location data also have unique
applications due to the geospatial property, specifically by inducing the location factor of social
6
strength and influence in real-world applications bounded to a specific location. Examples include
healthcare (to inform the residents of a geographical area about the outbreak of a contagious
disease), in local advertisements (local restaurants, cafes, events), in a local political campaign
(elections of district’s representatives), or simply sharing ideas/knowledge/experience that are
related to a geographically contained community, e.g., students at a university campus.
7
Chapter 2: Background
2.1 Definitions and Challenges
Location: Technically, a location is a point with coordinates in latitude and longitude. Generally,
this is dierent from a place (e.g., a park), which is an area. However, in the settings of Location-
Based Social Networks (LBSNs), a location can be perceived as a place. Indeed, when a user
checks in at a place, an LBSN first obtains the user’s coordinates via her GPS device, then searches
and presents the nearby places. Then, the user simply selects and shares one of such places, for
which an LBSN already has all information: unique location’s ID, coordinates and name. As
a result, all check-ins to the same place have the same coordinates and the same location’s ID.
Therefore, a location is often associated with a place in LBNSs. This also helps eliminate the
uncertainty of location data collected by GPS devices. We will use the two terms place and
location interchangeably.
Check-in: A check-in is a record of spatiotemporal data in the form of a triplethu; l; ti, which
indicates the user’s ID u, location l and the time t that the user shared her location with LBSN.
As a user checks in at a location, the following information will typically be recorded and sent
to the server: User’s ID - u; User’s location - l, which consists of latitude and longitude’s values
and a unique ID that represents the location, and the time of the check-in - t.
8
Therefore users’ check-ins can be represented as a set of user-location-time tripletshu; l; ti,
each of which states that User u visited location l at time t.
2.1.1 Inferring Social Strength from Spatiotemporal Data
Definition 1. Social strength is a quantitative measure between 0 and 1 that tells how socially
close two people are. A social strength of value 0 indicates that there is no social connection
between two individuals, while a social strength of value 1 indicates the strongest possible social
connection.
PROBLEM DEFINITION: Given a set of users U = (u
1
; u
2
;:::; u
M
), a set of locations
L = (l
1
; l
2
:::; l
N
) and a set of check-ins in the forms of user-location-time tripletshu; l; ti, the
problem is to infer the social strength for each pair of users.
Note: The only time-related input to the problem is the check-in’s time. Another possible
input that can greatly influence social strength is the duration of time two users stayed together
at a place, often referred to as length of stay or duration of stay. However, most location-based
social networks nowadays, such as Facebook, Foursquare, Yelp, etc., neither record the length of
stay nor provide such services. In fact, users check in at places, but never check out. Therefore,
the length of stay is general absent from the raw input data. We will show in Chapter 4 how to
estimate the length of stay given the semantics of a location.
CHALLENGES:
The problem of inferring social connections from people’s spatiotemporal data is particularly
challenging for many reasons. First, it is not clear which features of co-occurrences should
be measured to infer social connection. For example, if only the number of co-occurrences of
two people, called frequency, is considered, then one may arrive at a wrong conclusion about
9
their social relationship. To illustrate, suppose two people study at the same library around the
same time every day, which results in high frequencies, but they may not even know each other.
This erroneous conclusion can be attributed to the fact that the library is a popular location and
the observation that two people only co-occur at the library is not a strong indication of social
connection. On the other hand, a few co-occurrences in a small private place are perhaps a better
indication of friendship. Or alternatively, several co-occurrences at dierent popular places (e.g.,
coeehouses, restaurants) may also be a better indication of friendships. Second, We are interested
in inferring more information about social connections such as how close of a relationship two
people have (aka social strength). Third, there may be a lot of missing data, as people’s location
data may be sparse. Fourth, the spatiotemporal data is often extremely large, in the order of
gigabytes, which could render the inference algorithms inecient, taking too much time and/or
resources to perform.
A naive approach to estimate the social strength is to simply count the number of unique
locations two people co-occurred as their social strength, called richness. However, this measure
would consider dierent locations equally important as it ignores the number of co-occurrences
at each location. To address this problem, one could sum up the number of co-occurrences of
two people across dierent locations as a measure of their social strength, called frequency. The
problem with this approach is that it may overestimate coincidences. For example, 10 random
encounters at a crowded coee shop (called coincidences) are considered 10 times more important
than 1 interactive meeting at a private oce.
10
2.1.2 Inferring Spatial Influence from Spatiotemporal Data
Social influence: When user u is said to influence user v, there is an implication of a pair-wise
probability p
uv
associated with this influential relationship from u to v. Putting this in the context
of actions, it means if u performs an action (such as clicking the “like” button of a Facebook fan
page), then v will also perform the same action at a later time with probability p
uv
. This is called
influence probability, which indicates the extent of influence that u exerts on v. Generally speaking,
p
uv
, p
vu
and therefore influential relationships are directed. When u influences v, we say u is
the influencer and v is the influencee. Social influence is important due to its many potential
applications and it is also a critical part of the influence maximization problem [Kempe et al.,
2003].
Spatial influence: We aim to study influence in the geo-spatial context. For this, we first
define followship as the indication of influence in the real world, then we define spatial influence
as the study of quantifying followship. We give the formal definitions of followship and spatial
influence below. We use u and v as the notations of two individuals.
Definition 2. Followship is a spatiotemporal behavior where an individual v visits a location due
to the influence of another individual u who has visited that same location. We say v follows u.
Definition 3. Spatial influence is the concept of pair-wise influence inferred from spatiotemporal
data by quantifying followship, which indicates the amount of influence that one individual u
exerts on another individual v in the geo-spatial context. We use the notation [u! v] to denote
the influential relationship in which u influences v, or v follows u.
Clarifications: First, the time delay between the visits is greater or equal to zero. If it is
zero, followship becomes a co-occurrence (two users are at the same place and the same time).
11
Second, not all successive visits are followship; only those caused by influence are followship,
according to Definition 2. An example of followship is when people go to a restaurant because of
recommendations from friends who visited the place before. An example of successive visits that
are not followship is when non-related people happen to shop at the same mall by accident. We
call such cases coincidences.
Definition 4. Coincidences are successive visits to the same place by dierent individuals, which
are due to non-influence reasons - any reason except influence.
Fig. 2.1 shows an example of followship, where person v follows person u in visiting various
places in Los Angeles; each may have multiple visits to the same place (not shown in the figure).
In Fig. 2.1, there are three common places (marked in circles) visited by both u and v: a cafe, an
opera house and a statue. Unfolding their visits at these common places onto the time axis, we get
a more informative picture of followship in Fig. 2.2, which show the time delay between the visits.
Figure 2.1: Example of the the visits by u on v.
Figure 2.2: The visits of u and v are unfolded along the time axis.
12
In Fig. 2.2, each horizontal time axis corresponds to one place, and the time delays are shown as
1
,
2
, etc.
PROBLEM DEFINITION: Given a set of locations L =fl
1
; l
2
;:::; l
m
g, a set of users
U =fu
i
; i = 1;:::; ng, a set of check-ins in the form of user-location-time triplethu; l; ti, the problem
is to infer the pair-wise spatial influence p between users in U.
Notes: first, the value of p is not necessarily between 0 and 1, but can be normalized to become
influence probability. Second, since influential relationships are directed, p
u
i
!u
j
can be dierent
from p
u
j
!u
i
for user pair u
i
and u
j
.
CHALLENGES:
Quantifying spatial influence brings up many challenges. First, we do not assume any prior
knowledge about friendship information between users since this information is usually absent or
sparsely available in spatiotemporal datasets [Pham et al., 2013]. Second, we need to distinguish
actual followship from other successive visits that are not due to influence, aka coincidences.
Note that it would not be dicult to eliminate coincidences if explicit friendships are available
because they can be used to filter out successive visits from non-friends, and thus eliminate
coincidences; this was done for online influence in [Goyal et al., 2010]. However, in this case,
explicit friendships are not available, identifying coincidences becomes challenging. For example,
people who dined at a restaurant earlier do not necessarily influence people who dined there later;
therefore, their successive visits are simply coincidences, even though they may share the same
pattern as followship. Third, even if we can identify successive visits as followship, how should
we quantify followship? Should it be a function of location (the popularity of a location), or the
time delay (the time interval between visits)? Fourth, how should we measure the individual
contribution of each factor (location and time delay) and then how to combine them in a meaningful
13
manner? Among the above-mentioned issues, the issues related to coincidences and the impact of
locations are critical to spatial influence. Indeed, previous studies in geo-social networks [Cho
et al., 2011b] [Zhang and Pelechrinis, 2014] showed that only 30% to 40% of human movement
is due to social reasons; the remaining is due to regularities, randomness and the popularity of
locations. Since previous studies in social influence only considered the time delay between
actions (see Section 2.3), one cannot simply adapt their solutions for spatial influence.
2.2 Contributions
The research contributions of this thesis consist of the contributions of the GEOSO and EBM
models (for inferring social strength) and of the TLFM model (for inferring spatial influence) from
spatiotemporal data; the details of the contributions are listed below.
GEOSO Model:
GEOSO quantifies social strength through social distance - a geometric measure of how
socially far away two individuals are.
GEOSO introduces two properties: commitment and compatibility, which must be consid-
ered by any distance measure. Commitment is when two people repeatedly visited the same
place together, while compatibility is when two people share a variety of commonly visited
places. The consideration of compatibility greatly reduces the impact of coincidences on
social strength.
The conducted experiments demonstrated that the social distances computed by GEOSO
model are consistent with the real social distances from the datasets.
EBM Model:
14
EBM quantifies the strength of each social connection by considering how diverse the
distribution of the co-occurrences is in the context of locations (aka diversity).
EBM avoids overestimation by discounting coincidences, utilizing diversity’s order - an
important property of diversity. EBM compensates for the data sparseness by taking into
account the local characteristics of locations (e.g., location popularity). EBM considers
location semantics. Based on the existing studies in location semantics, we argued that
if the semantics of a location is known a priori, we can derive its popularity and average
duration of stay by using historical and/or online data. Subsequently, we proposed (a) to
use location popularity, which is estimated based on location semantics instead of using
Location Entropy, and (b), to consider the duration of stay of users (derived from the location
semantics) in the EBM model by applying the decay law of social influence.
We evaluate EBM using a large real-world dataset collected by a location-based social
network called Gowalla. We 1) use only the Gowalla’s location data to infer users’ social
connections, and 2) use the Gowalla’s social-network as the ground truth. Our evaluation
shows that EBM’s predicted social strength is consistent with ground truth. 88% of social
strengths are correctly predicted by EBM model. As for inferring friendships, by using
the the diversity, we achieve a precision of 96.5%, but the recall was low due to the data
sparseness. However, after incorporating location entropy into EBM, we improve the recall
by a factor of 1:8.
The EBM’s algorithm is parallelizable, thus can be implemented using the MapReduce
framework in order to be ecient for massive data, which is critical in any online applica-
tions.
15
Finally, we experimentally compared our model to the previous studies, including GEOSO
[Pham et al., 2011], probabilistic model [Crandall et al., 2010], the feature model [Cranshaw
et al., 2010] and the trajectory model [Li et al., 2008], and the results show that EBM
outperforms them in both eciency and accuracy.
TLFM Model:
We introduce followship to capture the intuition of pair-wise influence, subsequently for the
first time, we define spatial influence as the concept of inferring influence from spatiotem-
poral data, which is the influence that an individual exerts on another in the real world, by
quantifying followship.
We provide the TLFM model to derive spatial influence from spatiotemporal data. In
addition to considering the time delay between visits, we believe TLFM is the first model to
consider the impacts of locations and coincidences.
Our extensive experiments on real-world datasets collected from dierent Location-Based
Social Networks prove the eectiveness of our model and the ineectiveness of techniques
in previous studies (proposed for inferring online social influence) in capturing spatial
influence.
16
2.3 Related Work
2.3.1 Social Strength
2.3.1.1 Possible Candidate Measures for Social Strength
Since the event history of any person can be represented as a vector or a time-series, applying
existing similarity metrics to the problem of inferring social strength from spatiotemporal data
appears to be promising. In this section, we discuss two straight-forward similarity metrics and
point out why these candidate solutions do not apply to our problem.
Cross-Correlation Integral
Cross-correlation integral is frequently used in signal processing [V on Storch and Zwiers,
2001][Scha and Waldhauser, 2005] to measure the similarity of two waveforms as a function
of time. It also applies to pattern recognition problems [Rossi and Warner, 1985][Kumar et al.,
2006] to find the similarity between two patterns. We can use cross-correlation integral to measure
the visit patterns of two users in space and time. Particularly, let the x-axis be time and y-axis the
geo-spatial locations, e.g., the label of grid cells if we consider the whole 2D space as a grid and
number the cells in row-order. Each W
3
=hu; l; li event (huser, location, timei) corresponds to
a point in the coordinate system and points are connected chronically using linear interpolation.
Consequently, we have one time-series for each user as shown in Fig. 2.3. Next we can compute the
cross-correlation integral based on the time-series of two users and use the result as the similarity
measure of the two users.
However, there are two major problems with this approach. First, as the time-series is a
function of time, it does not scale well. When the time axis is continuously growing, especially
when the size of data reaches Gigabytes, it results in a linear increase in time complexity of any
17
Figure 2.3: Cross-correlation integral of two user visit patterns
possible similarity function. This shows that representing user visit patterns as a function of time
and space would result in prohibitively expensive computation. Second, the space is discretized as
non-overlapping cells and the cells on the y-axis may be numbered in an arbitrary way (in row
order or Hilbert curve order). Thus, being in two cells, for example, cell y
1
and cell y
3
, at two
time instances does not indicate that the user was ever in any intermediate cells that lie spatially
between cell y
1
and cell y
3
. Therefore, the time-series can potentially misinterpret the visit pattern
of the user, and the cross-correlation integral over time-series of two users may lead to imprecise
results and hence incorrect social distance measurements.
Cosine Similarity
Cosine similarity measures the similarity between two vectors based on the cosine value of
the angle between them. In the field of information retrieval, cosine similarity is often used to
compare the similarity between two documents [Yuan and Sun, 2005][Esteva and Bi, 2009]. If we
consider the user visit patterns as vectors, the cosine similarity metric can be adopted to solve the
problem. Let V
a
be a vector which records the number of times that user a visits a geo-location in
the space and V
b
be the same vector for user b. We can compute the cosine similarity between
18
Table 2.1: Questions addressed by GEOSO and EBM and other model
EBM GEOSO PM FT TR BS
Coincidences X X 7 X 7 7
Location Characteristics X 7 7 X 7 7
Data Sparseness X 7 7 7 7 7
Social Strength X X X 7 X 7
the two vectors V
a
and V
b
, which is then used to measure the social distance between a and b.
However, there is a major drawback in this approach because the time dimension is overlooked in
the vector representation of the visit history. For example, if both users a and b have visited the
same geo-locations, but on dierent days, they are considered similar in this approach but they
are not similar in reality as they have never been at the place at the same time. Obviously, the
simple vector representation cannot handle the time dimension, which is an important factor in
measuring social distances. Furthermore, cosine similarity essentially measures the cosine of the
angle between two vectors, therefore, the scalar of the vectors are not measured or considered.
That is, the cosine similarity between a vector~ v and~ u is the same as the cosine similarity between
k~ v and~ u. This is not appropriate in measuring social distances based on visit patterns as the
number of visits is an important indication of social closeness.
2.3.1.2 Related Studies in Inferring Social Connections from Spatiotemporal Data
Table 2.1 summarizes the GEOSO and EBM models, and the previous studies with the same
problem focus. The check-mark (X) indicates that the question is addressed by the model, while
the cross-mark (7) indicates the opposite.
To examine the relationship between a friendship network and the human interactions, Eagle
at al. [Eagle et al., 2009] conducted their analysis on two dierent sets of data of the same group
of users: one from mobile phone called “behavioral”, another was reported by users called “self
19
report” (shown as BS in Table 2.1). They examined the communications, locations and proximity
of the users over an extended period of time, conducted a regression analysis over the data and
finally compared the behavioral social network to self-reported relationships. Their results showed
that the two are indeed related. In addition, communication was the most significant predictor of
friendships, followed by number of common locations and proximity. However, this early study
did not consider the impact of coincidences nor the importance of co-occurrence locations, which
has been shown to be significant in this thesis.
Crandall et al. [Crandall et al., 2010] used a probability model (PM in Table 2.1) to infer the
probability of friendships given the co-occurrences in time and space and did the evaluation with a
large dataset from Flickr. The first limit of this model is that it makes a simplifying assumption
about the structure of the social network: each user can have only one friend, which is not the
case in reality. Second, it does not consider the frequency of co-occurrences at each location; all
the co-occurrences at one location only count as once. Finally, the impact of coincidences was
not addressed, as well as the location characterics, known as location entropy in this thesis and in
[Cranshaw et al., 2010].
Cranshaw et al. [Cranshaw et al., 2010] introduced various features such as specificity, location
entropy, etc, in order to analyze the social connections (FT in Table 2.1). Their experiments showed
that there exists a relationship between the mobility of patterns of a user and the number of the
user’s friends in the underlying social network. This is an in-depth study and provides much
insight into the social network structure. However, they did not consider a subtle question of the
social network: how close two people are (aka the social strength in this thesis)? In addition,
they studied the location characteristics to avoid coincidences, but for each user pair, the local
frequency was not clearly dierentiated from location to location. Here, we show that the impact
20
of local frequencies to social connections greatly varies from location to location. Furthermore,
weighted frequency also diers from TFIDF in their work in that we use the inverse diversity of
a location (exp(H
l
)) to weight the local frequency, which considers the visiting patterns to a
location and detects the significance of each single co-occurrence to social strength. Their TFIDF
is more to show the specificity of a location to a user pair.
With a similar problem focus, Li et al. [Li et al., 2008] also used the history of user locations
to develop a similarity measure among users (TR in Table 2.1). They first represented each user as
a trajectory in a hierarchical fashion, then used the similarity between the trajectories of two users
as their social similarity. The model considers the movements of users in both micro and macro
scales. This research is particularly promising for its scalability and its consideration of dierent
level of movements. However, coincidences and location characteristics were not considered,
which has been shown to be crucial in this thesis.
In this thesis, we have addressed all the interesting, and at the same time, dicult challenges
that the previous studies either ignored or suered from. This thesis does not make or simplify
any assumptions and the experiment has proved the high accuracy of our proposed models in
predicting social connections with real-world data even when the data is sparse, as well as its
eciency when it comes to the problem of large-scale online processing.
2.3.2 Social and Spatial Influence
A large number of studies in social influence focused on influence maximization (which is orthogo-
nal to this thesis), while a few others studied the derivation of pair-wise influence from online web
data. We summarize these related studies below.
21
A large number of studies in social influence focused on an optimization problem called
influence maximization, in which the focus is to find a small set of influential individuals as
seeds, who have the most combined influential impact on an entire social network [Chen et al.,
2010][Domingos and Richardson, 2001][Goyal et al., 2011][Kempe et al., 2003][Leskovec et al.,
2007]. In the influence maximization problem, the pair-wise influence is the most critical input
and is generally assumed to be known or generated randomly for experiment purposes. More
precisely, the influence maximization problem does not focus on inferring pair-wise influence, but
it rather utilizes the pair-wise influence for optimizing the viral propagation of information in a
social network. Hence, technically the influence maximization problem is orthogonal to this thesis,
and it shares similarities with this thesis only in applications.
Studies focusing on pair-wise influence can be categorized into two main subgroups: direct-
influence and indirect-influence.
With the direct-influence subgroup, the subjects are online blogs/articles that directly reference
each other, which oers a clear evidence of influence - an article influences another in a referencing
event. Saito et al. [Saito et al., 2008] inferred the influence probability in a network of online
blogs. Specifically, the authors modeled the problem as to find the influence probability from a
series of past episodes, each episode oered the evidence of influence events: who succeeded or
failed to influence whom at successive points of time. The influence probabilities were modeled
so that they would maximize the chance of observing those given episodes. A series of studies by
Gomez-Rodriguez et al. [Gomez Rodriguez et al., 2010][Gomez Rodriguez et al., 2013][Rodriguez
et al., 2011] proposed another way of learning social influence by assuming that online events
cascade along a hidden tree where nodes are blogs/articles. They then inferred the hidden tree
by capturing which node influences which, based on the time when an article started referencing
22
another. In the studies by Zhou et al. [Zhou et al., 2013] and He et al. [He et al.], the authors
used Hawkes processes to reason and discover the information pathways of information cascades.
Specifically, He et al. captured the postings of information from nodes in a possibly hidden
network as events of a Hawkes process, and also utilized the content of the cascaded topics to
further discover the hidden topics in the texts being cascaded.
With the indirect-influence subgroup, a subject indirectly influences another through a common
action. For example, user u liked and started following a Facebook fan page, and her friend, user v,
later also started following that same fan page. Studies in this subgroup rely on explicit friendships
between users (i.e., users list others as their friends in their profiles) in order to infer influence
based on the actions performed by friends only, and thus do not need to worry about coincidences.
Goyal et al. [Goyal et al., 2010] later proposed another model to learn influence probability from
logging data of users’ online activities, specifically by looking at users joining online groups after
their friends joined the same groups. They proposed several models, including static, continuous
and discrete time models to capture the impact of the time delay between the actions on influence.
The studies in the two above-mentioned subgroups are the most relevant to our work as they
focused on inferring pairwise social influence. The main reason they cannot be applied to spatial
influence is that they focused on utilizing only the time delay between the events, which is not
necessarily a research limitation, but rather a scope restriction of the papers. In fact, the time delay
has been shown to be one of the most significant factors in discovering the hidden network of
information diusion for online social media [Goyal et al., 2010]. However, in spatial influence
as we discussed earlier, the presence of users’ geo-locations in the input data introduces new
challenges. Specifically, the impact of locations and coincidences in user location behaviors need
to be taken into account in inferring spatial influence. Subsequently, among the three challenges
23
we mentioned, including the impact of time delay, locations and coincidences, only the time delay
challenge has been addressed in the previous studies, whose solutions can be a partial solution to
spatial influence, while the two other challenges remain unresolved.
In addition, the first subgroup assumes the evidence of influence in each referencing event (the
referencing article is influenced by the referenced one), which does not hold in case of successive
visits in spatiotemporal data, which may be due to other reasons, such as coincidences or location
popularity. The second subgroup assumes the availability of explicit friendship information; in
social media, such as Facebook and Twitter, friends follow each other, observe each other’s online
actions, be influenced and may perform those same actions later in time. Subsequently, there is
a strong correlation between explicit friendships in a social network (e.g., Facebook) and social
influence in that same network [Chen et al., 2013][Kempe et al., 2003]. Therefore, these studies
rely on the explicit friendships in order to take into account only actions performed by pairs of
friends, and hence eliminate the concern about coincidences in social media, as it was done in
[Goyal et al., 2010]. However, in spatial influence, friendships are neither virtual nor explicit; they
are real-world and implicit; the latter means we no longer have information about friendships in
the real world. Subsequently, we have to rely on raw spatiotemporal data only in order to infer
influence, for which we have to handle coincidences and measure the impact of time-delay and
locations.
The roles of locations and randomness (aka coincidences) in the location behaviors of users
were demonstrated in a study by Zhang et al. [Zhang and Pelechrinis, 2014], which looked at
peer-influence in an aggregated fashion, meaning to find what overall percentage of the similarity
between users’ visit patterns is due to location and randomness. By using cosine to measure the
similarity between users’ visit pattern, the authors found that 44% of the similarity between users’
24
visit patterns is due to randomness, 16% is due to locations, and the remaining 40% is supposedly
due to influence between friends. Even though this study neither inferred pair-wise influence, nor
studied the impact of location and randomness on each individual pair-wise influence, its result
tells us that one must consider the roles of locations and randomness in users’ locations carefully
when trying to infer pair-wise influence, because they contribute significantly to why users visited
various locations.
Finally, one straightforward solution would be to first infer/construct an implicit social graph
from spatiotemporal data (e.g., by using methods in [Pham et al., 2013][Crandall et al., 2010]),
and then use the solution in the indirect-influence subgroup [Goyal et al., 2010] to infer pair-wise
influence; thus, we could rely on the implicit friendships to rule out coincidences. The problem
with this approach is that it still does not take into account the impact of location on influence.
Moreover, using friendships will cause the pair-wise influence between non-friends (e.g., influence
by celebrities, university president, etc.) to be eliminated from consideration. We consider this
“baseline” solution in our experiments, and the result shows our approach was about 2 times better
in accuracy (see the experiment section in Chapter 5).
25
Chapter 3: Social Strength with GEOSO Model
In this chapter, we discuss the GEOSO model (GEOSO stands for Geo-Social), which utilizes the
geometric properties of users’ data representations as vectors to create a metric for social distance
and social strength.
3.1 Data Representation
Assume that the data input to the problem is a sequence of triplets in the form ofhuser, location,
timei, specifying who visited where and when. Following the storage model in [Crandall et al.,
2010], the 2D space, formed by latitude and longitude, is partitioned into disjoint cells. For
example, the space could be divided by a grid consisting of X Y rectangular cells. The size of
the cells is application-dependent.
3.1.1 Visit Vector
A visit vector is a data structure that records the movement history of a user. We consider the
grid as a matrix and then store it in row-first order as a vector. Subsequently, for each user, a visit
vector is constructed to record the visit history of that user within a period of time. Specifically,
each dimension of the visit vector represents one cell of the grid, and the value of the dimension
26
Figure 3.1: Visit history of user a, b and c
is a list of time showing when these visits to the cell happened. If the user has not visited a cell
within the time period of interest, the value of that cell is 0. For example, in Fig. 3.1, the visit
vectors of user a and user b are:
V
a
= (0;ht
1
; t
2
; t
3
i;ht
4
; t
5
i; 0; 0; 0)
V
b
= (0; 0;ht
4
; t
5
; t
6
i; t
7
; t
8
; t
9
)
As most users only visit a fairly small number of cells compared to the total number of cells in
the space, the visit vector for a single user may contain mostly zeros and a few non-zero values.
For storage and computation eciency, we can eliminate all zeros and only store the non-zero
values together with their cell IDs. For example, the visit vector of user a in Fig. 3.1 can be stored
as V
a
= (2 :ht
1
; t
2
; t
3
i; 3 :ht
4
; t
5
i), which represents that user a visited cell 2 for three times and
cell 3 two times. For ease of presentation, we still use the original representation of the visit vector
throughout the rest of the chapter, but note that the vectors can be stored and computed in a more
ecient way.
3.1.2 Co-occurrence Vector
Next, we define a data representation to capture the commonalities between two users. The
co-occurrence vector states the common visits of two users for the time period of interest. Each
dimension of the vector still corresponds to a cell in the grid. However, the value of each dimension
27
does not record the time of the visits but the number of times that the two users visited the same
cell at roughly the same time, that is, the time spans of the visits of the two users at the same cell
overlap. Note the length of the time overlap is application dependent and can be an input parameter
to our model, for example, 20 minutes or two hours. Consider users a and c in Fig. 3.1, both a and
c visited cells 2 and 3 at the same time. In particular, a and c visited cell 2 two times and cell 3
two times together. The co-occurrence vector between user a and c is C
ac
= (0; 2; 2; 0; 0; 0). We
formally define the co-occurrence vector as follows:
C
i j
= (c
i1; j1
; c
i2; j2
;:::; c
iN; jN
) (3.1)
In Equation 3.1, a term c
ik; jk
denotes the number of times that user i and user j both visited
cell k while k ranges from 1 to the total number of cells N. Note that co-occurrence vectors can
also be stored in a compact form, while only non-zero values are stored and maintained. In the
next section, we discuss how to perform computation eciently on these compact vectors.
3.1.3 Master Vector
As two co-occurrence vectors can considerably dier from each other, we need to normalize
co-occurrence vectors so that the distance measurements are comparable. Assume that two
users i and j have visited every cell in the space at the same time, and the number of visits to
each cell is the maximum among any pair of users in the group of users of interest. Let C
i j
be
the co-occurrence vector of i and j. Undoubtedly, user i and user j have the highest similarity,
hence, the smallest distance between each other. Furthermore, the more similar the co-occurrence
vectors of any user pair to C
i j
, the closer the two users are in terms of social distance. Following
28
this intuition, we define the master vector for a group of users. A master vector contains the
maximum pair-wise co-occurrences in each cell for a group of users of interest. For instance, the
co-occurrence vectors of users a, b and c in Fig. 3.1 are as follows:
C
ab
= (0; 0; 2; 0; 0; 0)
C
ac
= (0; 2; 2; 0; 0; 0)
C
bc
= (0; 0; 2; 1; 0; 1)
The master vector of the three users is M = (0; 2; 2; 1; 0; 1) where the value of each dimension
of M is the maximum value of the three co-occurrence vectors at the corresponding dimension.
Note that only one master vector is constructed for a given set of users. Computing the master
vector is simple and can be done eciently. The definition of the master vector is shown in
Equation 3.2, where U stands for the total number of users and N is the total number of cells.
M = (m
1
; m
2
;:::; m
k
;:::; m
N
); m
k
= max
1i< jU;1kN
c
ik; jk
(3.2)
3.2 GEOSO Distance Measure
The goal of our problem is to eciently compute the social connections among all pairs of users
and report those users who are strongly bonded. For any given set of users and their W
3
=hu; l; ti
events, we first compute the co-occurrence vectors for every pair of users and the master vector for
the entire set of users. Next, we compute the social distance between each pair of users.
29
Figure 3.2: Vector view of GEOSO distance measurements
The social distance d
i j
between user i and user j is defined by the Pure Euclidean Distance
(PED) between the co-occurrence vector C
i j
and the master vector M. The similarity s
i j
between
two users is the inverse of the distance metric.
d
i j
=
s
X
k
(c
ik; jk
m
k
)
2
(3.3)
s
i j
=
1
d
i j
(3.4)
Consider a simple example consisting of two cells and three users shown in Fig. 3.2. The x-axis
shows the number of co-occurrences in cell 1 and the y-axis shows the number of co-occurrences
in cell 2. The co-occurrence vectors are plotted as thin arrow-headed lines and the master vector is
plotted with a solid bold arrow-headed line. The co-occurrence vector of user a and b is (2; 2), the
co-occurrence vector of users a and c is (0; 3), and the co-occurrence vector of users b and c is
(0; 2). The master vector of the three users is M = (2; 3).
Next, the PED distances between all user pairs are computed as follows:
d
2
ab
= (2 2)
2
+ (3 2)
2
= 1
d
2
ac
= (2 0)
2
+ (3 3)
2
= 4
30
d
2
bc
= (2 0)
2
+ (3 2)
2
= 5
The smaller the distance between two users, the closer they are. Therefore, we know that
users a and b are the closest user pair in the example shown in Fig. 3.2. As co-occurrence vectors
contain mostly zeros, they are stored in a compact form. That is, all zeros are eliminated from the
vector. Subsequently, we can improve the computation eciency by employing the Projected Pure
Euclidean Distance (PPED) proposed in [Shahabi and Banaei-Kashani, 2003].
3.3 Properties of the GEOSO Measure
In this section, we introduce two important properties of the GEOSO model and how our model
captures the properties quantitatively. Intuitively speaking, people who are socially close to
each other have higher chances of visiting same places at the same time (co-occurrences in both
space and time). For example, a couple who lives together probably visits same grocery shops,
restaurants, and vacation destinations at the same time. Furthermore, people who repeatedly visit
the same location at the same time are socially connected with higher probability. For example,
co-workers go to work on every weekday. Subsequently, we declare the following observations for
the ease of discussion and refer to them later.
Observation 1: The more places two users visited together at the same time, the more likely
these two users are socially close to each other.
Observation 2: The more often two users visited same places at the same time, the closer the
two users are socially connected.
31
3.3.1 Compatibility
According to the first observation above, the more common cells two users visit, the higher
the likelihood that these two users are socially closer. Now, we show that our social distance
measure is consistent with this observation. First, let us temporarily not consider the number of
co-occurrences in one cell between two users, but only the fact whether two users co-occurred
in that cell. In the co-occurrence vector, if two users both visited a cell at the same time
(co-occurred), we use the value 1 for that cell. Otherwise, we assign 0 to that cell. Note that
the dimension of the vector stays the same. In the extreme case, if two users visited every cell
together, their co-occurrence vector contains only ones in all dimensions. Generally, suppose we
have two pairs of users, i.e., (i; j) and (p; q). Users i and j both visited k cells together, while users
p and q both visited k+a cells together (a> 0). The co-occurrence vectors of the two user pairs are:
C
i j
= (1; 1;:::; 1; 0; 0;:::; 0)
C
pq
= (1; 1;:::; 1; 1;:::1; 0;:::; 0)
Without loss of generality, suppose all co-occurrences happened in the first several cells.
Clearly, the social distance between the user pair (p; q) is closer because p and q has more overlap
in space and time. We define the total number of dimensions with non-zero values in the co-
occurrence vector as the compatibility between the two users. Then, compatibility property says
that the more compatible two users are in their social relations, the closer they are. Next, we prove
that our distance model captures the compatibility property.
32
Consider a new master vector that is represented as M
0
= (m; m;:::; m) where m is the
maximum value of all dimensions in the original master vector in Equation 3.2. Note that the new
master vector M
0
changes the absolute distance values but does not change the relative values
between two distances. That is, if d
i j
is greater than d
pq
with regard to the original master vector
M, it is still greater than d
pq
with regard to the new master vector M
0
. Consequently, we use M
0
instead of M as the master vector in the following discussions where only the relative distance
values are of concern. Hence, the distances between user i and j, p and q are as follows:
d
i j
=
p
k(m 1)
2
+ (N k)m
2
d
pq
=
p
(k + a)(m 1)
2
+ (N k a)m
2
Next, consider the dierence between the two distances:
d
2
i j
d
2
pq
=
k(m 1)
2
+ (N k)m
2
(k + a)(m 1)
2
(N k a)m
2
=
a(m 1)
2
+ am
2
= a(2m 1)> 0 as m 1
Hence d
i j
is greater than d
pq
. Consequently, user p and q are more socially connected than
user i and j. Note that if m = 0, it is a trivial case where no two users visited the same cell and
their distances are all set to infinity. Therefore, our model has the compatibility property.
33
3.3.2 Commitment
As stated in our second observation, if two users repeatedly visited the same places together,
they are more likely socially close to each other. For examples, the fact that students go to the
same classroom twice a week is a strong indication that they are classmates. To show that our
distance model is consistent with this observation, we need to take into account the number of co-
occurrences between two users which we left behind in the previous section. That is, the value of
each dimension in the co-occurrence vector corresponds to how many times two users co-occurred
in space and time. Then the second observation states that the more two users committed to a
certain place, the closer they are. We call it the commitment property of social relations. Next we
prove how the model captures the commitment property.
Suppose that the co-occurrence vectors of two pairs of users (i; j) and (p; q) are identical
except in one dimension (the first components of the two vectors are dierent).
C
i j
= (k; c
2
; c
3
;:::; c
N
)
C
pq
= (k + a; c
2
; c
3
;:::; c
N
) (a> 0)
The distances between the two pairs of users are:
d
i j
=
p
((m k)
2
+)
d
pq
=
p
(m k a)
2
+; =
P
2lN
(m c
l
)
2
34
Next, consider the dierence of the two distances:
d
2
i j
d
2
pq
= (m k)
2
(m k a)
2
> 0
Hence d
i j
is greater than d
pq
. Therefore we conclude that p and q are more socially connected
than i and j. This shows that our model has the commitment property.
3.3.3 Compatibility vs. Commitment
We have shown that both compatibility and commitment properties play important roles in
measuring social distances and they are captured by our GEOSO model. As the next step,
we analyze the relationship between the two in the model and show which of the two properties
are more important.
Assume user i and j have x co-occurrences in one cell (say cell 1), user p and q have y
co-occurrences all of which happened in dierent cells. Without loss of generality, suppose that y
co-occurrences happened at the first y cells. The co-occurrence vectors are:
C
i j
= (x; 0; 0;:::; 0)
C
pq
= (1; 1;:::; 1; 0:::; 0)
The distances functions are:
d
i j
=
p
(m x)
2
+ (N 1)m
2
)
d
pq
=
p
(y(m 1)
2
+ (N y)m
2
)
35
Figure 3.3: Commitment vs. compatibility
Let d
i j
= d
pq
and we have the relationship between x and y as the quadratic function:
y = f (x) =
2mx x
2
2m 1
(3.5)
In Equation (3.5), m is a constant. The relationship between the variable x and variable y is
plotted in Fig. 3.3 (m is set to 20).
The figure of the relationship between commitment and compatibility gives two important
insights. First, as the curve of y = f (x) is always below the line of y = x, our models shows that the
commitment property has less importance on the distance function than the compatibility property.
This is consistent with some intuitive examples. Consider the activities of two students on campus.
If their W
3
event history shows that they went to the cafeteria 10 times together, the gym 10 times
together and the same classroom 4 times together in the past month (high compatibility), this is a
strong indication that these two students are close friends. On the other hand, if two students have
been to the library at the same time for 30 times (high commitment), it does not necessarily show
that the two are friends. In fact, there might be hundreds of students who go to the library every
day. However, most of them do not know other students who also study in the same library.
36
Second, it is shown in the Fig. 3.3 that as commitment (x) increases, compatibility (y) also
increases, however, with a much slower speed. We can increase either the commitment or the
compatibility to yield a certain social distance. However, it requires less change in compatibility
than commitment. When commitment reaches its upper limit (the saturation point) , further
increasing commitment only very insignificantly aects the social distance of our model. This
also confirms the fact that a spike of large commitment value only implies a coincidence in our
social lives and does not bring closer the social distances.
The GEOSO model captures both compatibility and commitment properties of social behaviors
by applying both the co-occurrence vectors and the master vector collectively. Without these data
representations, applying the simple cosine or Euclidean distance measures on the simple visit
vectors of users will lead to wrong estimation of social connectivity, in particular, the commitment
property will overestimate social distances and weaken the influences of compatibility. For
example, two users that co-occurred in the same places together for k times will have the same
social distance as two users that co-occurred in k dierent places but only once in each place in
both cosine similarity or Euclidean distance measure.
3.4 Experiments
3.4.1 Data
For the purpose of experimental evaluations, we use data from the Internet Movie Database
(IMDB) [IMDB] because it resembles the data requirements of our experiments for two reasons.
First, the dataset contains spatiotemporal data of people. For example, if two actors/actresses acted
in the same movie/episode, we consider that two persons co-occurred in space and time. If they
37
performed in more than one movie/episode, we consider that they co-occurred in space and time
multiple times as an indicator of compatibility, and if they performed in multiple episodes of the
same TV series, it is considered an indicator of commitment. The social distance d (see Equation
3.3) is calculated for each pair of people. Second, social connections of these actors/actresses
are available publicly. For example, the Bio sections on IMDB and/or the Wikipedia [Wikipedia]
web pages usually contain the social relationships of that actor/actress, such as parents, siblings,
spouses, best friends, long-time acting partners, etc. These data of social connections can be used
to verify if two people with short social distance d is indeed socially connected. One might argue
that the fact of two actors/actresses performing in the same movie does not necessarily suggest
that they are related. This is a valid argument. However, the same thing is also true in a real
spatiotemporal dataset, that is, two persons appearing at the same place at the same time may only
due to coincidences. Our model can handle these coincidences by weighting compatibility and
commitment appropriately in a non-linear fashion (See Fig. 3.3).
We extracted the information as described above from the IMDB and Wikipedia and ran our
experiments on these datasets. Table 3.1 provides an overview of the datasets used in this section.
The first row describes the sizes of celebrity sets, and the second row shows the number of dierent
movies that the corresponding set of celebrities acted in. The last row of the table summarizes the
total number of tuples in the format ofhperson, movie/episode, timei, which corresponds to the
hwho, where, wheni (W
3
) events.
3.4.2 Distance Measure and Result Verification
In this section, we ran experiments on each data set and computed social distances using the
GEOSO model. Next, the distances are normalized and discretized. We divided [0; 1] into 25 of
38
(a) Percentage of pairs vs. social distances
(2k)
(b) PSVP vs. social distances (2k)
Figure 3.4: Experiments on the set of 2k users
equal-sized buckets and each bucket contains user pairs with distances between a and a + 1=25.
For example, the first bucket contains user pairs with distances between 0 and 0:04.
The first dataset contains 2; 000 celebrities with 280k co-occurrences. Most of the user pairs
(96:8% of 280k) have social distances close to 1 (greater than 0:91), meaning that they are socially
far away. Therefore, we drop those user pairs and focus on those who are socially close, which is
8:8k pairs (3:2% of 280k).
Table 3.1: Dataset
Number of celebrities 2k 4k 10k
Number of movies and TV episodes 32k 50k 100k
Number of W
3
events 280k 1.1M 4.6M
The relationship between distance values and the user pair percentage, which is calculated out
of 8:8k pairs, is shown in Fig. 3.4(a). The x-axis shows the social distance calculated by our model
and the y-axis shows the percentage of top 3:2% user pairs with smaller distances. Each tipping
point in the graph represents a bucket, and as the graph shows, buckets with short social distances
have fewer pairs of people (lower percentage) than buckets with long social distances do. Keep
in mind that the number of buckets (number of tipping points) does not represent the number of
pairs of people, however, the pair percentage corresponding to the bucket (the value on the y-axis)
39
does. Fig. 3.4(a) also shows an interesting characteristic that the distribution has the behavior of
an exponential function (the dotted curve) p = C
1
e
(C
2
d)
where C
1
and C
2
are constants and we
experimentally found them to be: C
1
= 1=N and C
2
= 6:92 where N = 8; 860. In other words, the
GEOSO model shows that the percentage of pairs increases exponentially as the social distance
increases.
Next, we verify the distances using the social information retrieved from IMDB and Wikipedia.
The verified results are shown in Fig. 3.4(b). The x-axis shows the social distances and the y-axis
represents the percentage of successfully verified pairs (PSVP) of each individual bucket for all
280k pairs. As Fig. 3.4(b) shows, buckets with distances less than 0:55 have PSVP above 80%
(150 pairs), and buckets with distance less than 0:6 have PSVP above 59% (301 pairs). As the
distance values increase, especially close to the value of 1, the percent of verified user pairs drops
dramatically. This is due to the fact that when two persons are far away in social distances, there is
no data from IMDB or Wikipedia showing that these two are not friends, family members or in
other relationships, which on the other hand proves that our distance measure is consistent with
the reality.
We also verify our results by manually checking the first 300 celebrity pairs with the smallest
social distances. The user pairs with smallest distances are, for example, twins Close Sprouse
and Dylan Sprouse, twins Ashley Olsen and Mary-Kate Olsen, and Ricky Gervais and Steven
Merchant. These user pairs acted together in either many TV Series or movies.
Our experiments on larger datasets, which have the size of 4; 000 and 10; 000 of users, shown
in Figs. 3.5 and 3.6 respectively, also exhibit very similar behaviors as observed in the experiment
with the dataset of 2; 000 users. The distributions of the user pair percentage over the social
distance have the exponential behavior (Figs. 3.5(a) and 3.6(a)). The PSVP drops as the social
40
(a) Percentage of pairs vs. social distances
(4k)
(b) PSVP vs. social distances (4k)
Figure 3.5: Experiments on the set of 4k users
(a) Percentage of pairs vs. social distances (b) PSVP vs. social distances
Figure 3.6: Experiments on the set of 10k users
distance increases (Figs. 3.5(b) and 3.6(b)), which shows that your distance measure is consistent
with the intuition that the shorter the distance, the tighter the social connection.
Finally, we evaluate our results using precision and recall. For each value of the social distance
d (the midpoint of each bucket), we calculate the precision and recall for the set of all pairs with
social distance less than or equal to d.
Precision(d) =
NS VP(d)
NP(d)
; Recall =
NS VP(d)
P
d
NS VP(d)
(3.6)
The term NS VP(d) represents the number of successfully verified pairs with distance no
greater than d, and NP(d) is the number of pairs with distance no greater than d. We present the
precision and recall for all three datasets used in the previous sections. In Fig. 3.7, the x-axis
shows the social distance and the y-axis shows the precision and recall measures.
41
Figure 3.7: Precision/Recall vs. Social Distance
As shown in Fig. 3.7, the precision is high as the distance values are small. However, the
recall is low. This is due to two reasons. First, the number of pairs with shorter social distances
is only a small fraction (3% 5%) of the total number of user pairs. Although all of them can
be verified, they account for only a small percent of all user pairs. Second, user pairs who are
close in reality are not reported close in our model. This is because our datasets consist of only
co-acting data instead of real spatiotemporal co-occurrence data. Two persons who are father and
son might never act together, but they are socially close. This generally is not the case in a real
spatiotemporal dataset.
3.5 Summary of Chapter
In this chapter, we focused on how to infer social connections among people based on their
co-occurrences in space and time. We presented the geometric GEOSO model which derives social
connections between people based on spatiotemporal events in real world. We also showed that
the GEOSO model captures the intuitive properties of social behaviors. Finally, the experiments
demonstrated that the social distances computed by the model are consistent with the real social
distances from the datasets.
42
There are a few future extensions for this work. First, we plan to examine the diversity of
co-occurrences of a user pair in terms of locations, which can show not only the number of
dierent co-locations, but also the distribution of co-occurrences over these co-locations. The
latter is important as it is informative to tell if a co-occurrence is indeed a social interaction or
it is just a coincidence of two people who happen to be at a crowded place at the same time
without any social interaction. Furthermore, we would like to analyze each individual location
to estimate the weight of each co-occurrence dierently depending on the crowdedness of the
location: the more crowded, the less weight, and vice versa. This will help alleviate the problem
of data sparseness, which is when the data per person is small and even few co-occurrences at
private places may be a a good indication of a close social relationship. Finally, the algorithm also
needs to meet the requirement of online services, which often require to process large data in a
very short time, as a matter of milliseconds. Therefore, we plan to apply powerful computational
model/framework, such as Map-Reduce/Hadoop to improve the overall eciency of the algorithm.
All these extensions will be addressed in the next chapter.
43
Chapter 4: Social Strength with EBM Model
In this chapter, we discuss the EBM model (standing for Entropy-Based Model), which utilizes
various types of entropies for inferring social strength from spatiotemporal data.
A naive approach to estimate the social strength is to simply count the number of unique
locations two people co-occurred as their social strength. However, this measure would consider
dierent locations equally important as it ignores the number of co-occurrences at each location.
To address this problem, one could sum up the number of co-occurrences of two people across
dierent locations as a measure of their social strength. The problem with this approach is that it
may overestimate coincidences. For example, 10 random encounters at a crowded coee shop
(called coincidences) are considered 10 times more important than 1 interactive meeting at a
private oce.
To remedy for these shortcomings, we propose an Entropy-Based Model, named EBM, which
successfully infers social strengths from spatiotemporal data with high accuracy. With EBM,
we first use the Shannon entropy to measure the diversity of co-occurrences, which, for each
pair of people, uses the number of their co-occurrences at each location to derive a relative co-
occurrence measure, and use only diversity as social strength. This measure may give higher
importance to outliers (i.e., local frequency), and thus may still overestimate the social strength
44
due to coincidences. Hence, we generalize Shannon by using the Renyi entropy that can look at the
global pattern of co-occurrences per user pair and has the flexibility of giving more or less weight
to outliers by varying the order of diversity q. However, Renyi is still blind to the characteristic of a
location (e.g., whether the location is a crowded public coeehouse vs. a private oce). Therefore,
we incorporate weighted frequency, which utilizes location entropy to weigh each co-occurrence
dierently depending on the popularity of locations and the derived user average duration of stay
depending on the type of the location. Thus, the model captures minor co-occurrences that can be
a significant indication of a social connection.
4.1 Preliminaries
4.1.0.1 Representation of Location
One popular way of storing visited locations is to use a grid to uniformly partition the space into
disjoint cells of equal size [Crandall et al., 2010], where each cell represents only one place, so
that any two people, who check in within the same cell at the same time, are considered to have
a co-occurrence. However, a uniform grid is inflexible and inecient due to the two following
reasons: 1) In a crowded area such as a downtown, a place, which represents the location of
a co-occurrence, say a shopping center, is often much smaller compared to a place in a sparse
mountainous region, say a national park. Hence, the method of partitioning the space into equal
cells is not applicable here. 2) Reducing the size of all the cells to fit small places in crowded areas
would result in much waste of storage resources and look-up time in sparse areas, meanwhile,
increasing the cells’ size to fit to large places in sparse regions would result in misinterpreting
the co-occurrences in crowded areas. Therefore, to eciently store spatial data, we use quadtree
45
Figure 4.1: A quadtree storing areas of dierent levels of popularity, and the visits of Users 1; 2
and 3
[Samet, 1984]. Fig. 4.1 shows a quadtree, where each quadrant, called cell, has a unique ID,
numbered from 1 to 10. Three users are shown as circles, diamonds and squares uniquely identified
with user IDs 1, 2 and 3. The arrows show that a user checked in at the cell at time t
i
. The darker,
the denser the area. Geo-points inside a cell share the same cell ID, which is used along with time
to determine co-occurrences. For example, looking at cell 1, we say Users 1 and 2 co-occurred at
cell 1 at time t
2
.
For simple presentation, we set the capacity of each cell of the quadtree to 1 so that each
cell can cover a maximum of one place (in the experiment, the capacity is more than 1). The
construction of the tree can be done by recursively dividing an area into four equal quadrants until
each quadrant holds only one place.
4.1.0.2 Visit Vector
We use the data representation from our previous study in Chapter 2, including the visit vector
and the co-occurrence vector. The visit history of a user is represented by a visit vector, which
shows the cell IDs and the check-in time. For example, the visit vector for User 1 in Fig. 4.1 is
46
V
1
= (ht
2
i;ht
3
; t
6
i;:::), which states that User 1 visited cell 1 at time t
2
; visited cell 2 at time t
3
and
t
6
, etc. The general format of the visit vector of User i is:
V
i
= (ht
1;1
;:::; t
1;i
1
i;::::;ht
M;1
; t
M;2
;:::; t
M;i
M
i) (4.1)
where M is the number of leaves in the quadtree.
4.1.0.3 Co-occurrence Vector
If two users checked in at the same location within a time-interval, then we say that they have
a co-occurrence. is an application-dependent parameter and can be set experimentally. Corre-
spondingly, a co-occurrence vector between User i and User j represents all the co-occurrences
of Users i and j is:
C
i j
= (c
i j;1
; c
i j;2
;:::; c
i j;M
) (4.2)
where c
i j;l
is the number of co-occurrences between Users i and j at location l, which is referred to
as local frequency, which will be used throughout this chapter.
For example, given the visit vectors of User 1 and User 2:
V
1
= (ht
2
i;ht
3
; t
6
i;ht
8
; t
10
; t
11
; t
12
; t
13
i;ht
14
i;ht
16
i; 0; 0; 0; 0; 0)
V
2
= (ht
1
; t
2
i;ht
3
; t
4
; t
5
; t
7
i;ht
8
; t
9
i;ht
14
; t
15
i;ht
16
i;ht
17
i; 0; 0; 0; 0)
we see that the two users have one co-occurrence at location 1 at time t
2
, one co-occurrence at
location 2 at time t
3
, etc, therefore the co-occurrence vector between User 1 and User 2 is:
C
12
= (1; 1; 1; 1; 1; 0; 0; 0; 0; 0)
47
Usually, a user can only visit a limited number of locations, which makes the visit vector
and the co-occurrence vector sparse (containing many zeroes). Therefore, we will introduce an
alternative data structure and storage for optimization in Section 4.3. For now, we use this format
to simplify the presentation.
4.2 Entropy-Based Model - EBM
The goal of this section is to devise a model, named Entropy-based Model (EBM), to quantify
social strength between two users from their co-occurrence vectors. The overview diagram of
EBM is shown in Fig. 4.2. In Section 4.2.1, we start by utilizing the diversity of the co-occurrence
vectors as the main contributing factor to social strength. Consequently, we use the Shanon entropy
to measure the diversity of co-occurrences (see Section 4.2.2), but we observe that this measure
may overestimate the strength of social connections due to the impact of coincidences, which are
the case when people happen to co-occur by chance, but do not interact with each other, especially
in crowded places such as downtown and shopping centers. Hence, we generalize Shanon entropy
to the Renyi entropy (see Section 4.2.3), which gives us the flexibility of controlling how much
coincidences can contribute to diversity via a parameter q, called the order of diversity. Finally, to
compensate for the problem of data sparseness, we incorporate weighted frequency, which in turn
uses location entropy, into our model in Section 4.2.6 to increase the impact of co-occurrences
at uncrowded places even at low frequencies to the strength of social connections. The resulting
social strength is the ultimate measure that describes how close two people are based on the history
of their spatiotemporal information.
48
Figure 4.2: Diagram of EBM - Social strength is formulated via Renyi Entropy and Weighted
Frequency
4.2.1 Diversity in Co-occurrences
The concept of diversity has long been used in Physics, Economics, Ecology, Information Theory,
etc, as a quantitative measure to characterize the richness of a system [Hill, 1973; Jost, 2006;
Tuomisto, 2010b]. Specifically, in Ecology, diversity is used to measure how diverse an ecosystem
is; in the simplest case, it equals the number of dierent species in an ensemble. In Statistical
Thermodynamics, diversity is the number of micro-states, in which a system can be [Schroeder
and Gould, 2000].
Consider the co-occurrence vectors for each user pair in Fig. 4.1,
C
12
= (1; 1; 1; 1; 1; 0; 0; 0; 0; 0)
C
23
= (1; 2; 1; 1; 0; 0; 0; 0; 0; 0)
C
13
= (0; 0; 4; 0; 0; 0; 0; 0; 0; 0)
we see that User 1 and User 2 have 5 co-occurrences, so do User 2 and User 3. However, in the
former case the co-occurrences are spread over 5 dierent locations, while in the latter case the
49
Table 4.1: Example of Diversities
Co-oc. Vector Shannon En. D
i j
Diversity Coincidence Prob. Friendship Prob.
C
12
1:609 5:000 High Low High
C
23
1:332 3:789 Medium Medium Medium
C
13
0:000 1:000 Low High Low
co-occurrences happened in 4 dierent locations. Even simpler is the case of User 1 and User 3,
for which all co-occurrences happened in one location - cell 3. We say that C
12
is more diverse
than C
23
, and C
23
is more diverse than C
13
.
Intuitively, people, who are socially connected, tend to visit various places together [Cho et al.,
2011a; Crandall et al., 2010; Cranshaw et al., 2010; Eagle et al., 2009; Pham et al., 2011]. This
intuition is captured by our model as how diverse their co-occurrences are. Applying the general
definition of diversity in [Tuomisto, 2010a], we formally define diversity in our model:
DEFINITION: Diversity is a measure that quantifies how many eective locations the co-
occurrences between two people represent, given the mean proportional abundance of the actual
locations.
4.2.2 Formulation of Diversity through
Shannon Entropy
In this section, we use Shannon entropy, then we will extend EBM to a more generic one, called
Renyi entropy, in Section 4.2.3.
First, we define the notations and quantities that will be used to construct EBM. r
l;t
i; j
=hi; j; l; ti
is a co-occurrence of User i and User j at location l and at time t. R
l
i j
=
S
t
r
l;t
i; j
is the set of co-
occurrences of User i and User j, which happened at location l. R
i j
is the set of all co-occurrences
of User i and User j at all locations: R
i j
=
S
l
R
l
i; j
=
S
l;t
r
l;t
i; j
50
The probability that a randomly picked co-occurrence from the set R
i j
happened at location l
is:
P
l
i j
=
jR
l
i j
j
jR
i j
j
(4.3)
If we randomly pick a co-occurrence from the set R
i j
and define its location as a random variable,
then the uncertainty associated with this random variable is defined by the Shannon entropy for
User i and User j as follow (the upper index S denotes Shannon):
H
S
i j
=
X
l
P
l
i j
log P
l
i j
(4.4)
Formulation of diversity: There exists a distinction between entropy and diversity, in which
entropy often acts as the index of diversity. For illustration in the former research [Jost, 2006],
Jost et al compared the roles of entropy and diversity as the radius and the volume of a sphere,
respectively, where radius is used to calculate volume; and showed their relationship:
D = exp(H) (4.5)
where D is diversity, which indicates how diverse an ensemble is. Following this strategy, we
construct the EBM model, where D shows how diverse the co-occurrences of two users are in
terms of locations. The diversity of the co-occurrences of User i and User j, which is defined in
Equation 4.5, becomes:
D
i j
= exp
H
S
i j
= exp
0
B
B
B
B
B
B
@
X
l
P
l
i j
log P
l
i j
1
C
C
C
C
C
C
A
(4.6)
51
Since we already defined the co-occurrence vector in Equation (4.2), we can rewrite the expression
of diversity D
i j
in terms of the co-occurrence vector as follow:
D
i j
= exp
0
B
B
B
B
B
B
B
@
X
l;c
i j;l
,0
c
i j;l
f
i j
log
c
i j;l
f
i j
1
C
C
C
C
C
C
C
A
(4.7)
where f
i j
=
P
l
c
i j;l
is the total number of co-occurrences of User i and User j, named frequency.
Note the dierence between frequency f
i j
and local frequency c
i j;l
; the frequency of two users is
the sum of all their local frequencies. From Equations (4.5), (4.6) and (4.7), we have observations:
The higher the number of co-occurrence locations, the higher the uncertainty given by the
Shannon entropy, and consequently the higher the diversity.
If the number of co-occurrence locations is fixed, the diversity and the Shannon entropy
reach their maximums when all the probabilities in Equations (4.4) and (4.6) are equal to
each other.
To demonstrate the observations, let us consider the example of a group of three users in
Fig. 4.1. The co-occurrence vectors, Shannon Entropy, Diversity value, the diverse information,
the likelihood of coincidences and the probability of being friends for each pair of users are
summarized in Table 4.1.
From Table 4.1, we see that C
12
has the highest value of diversity due to the spread of co-
occurrences over more locations, or in other words, C
12
is the most diverse, followed by C
23
,
then followed by C
13
, which is the least diverse as all the co-occurrences happened at only one
place - cell 3. For C
12
, the numbers of all co-occurrences are equal to each other (in the first five
cells), which produces the maximum value of Shannon entropy, and consequently, the maximum
value of diversity among the three co-occurrence vectors, together with the highest number of
52
co-locations (i.e., 5), which makes it the most diverse, therefore, suggesting a high probability
of User 1 and User 2 being friends. Furthermore, the value of diversity D
12
coincides with the
number of co-occurrence locations, which is the reason why diversity is often referred to as the
eective number of states (in Statistical Mechanics [Schroeder and Gould, 2000]) or eective
number of species (in Ecology [Jost, 2006]).
In addition, we see that the diversity of C
23
(3:789) is less than the number of co-occurrence
locations (i.e., 4), which is due to the fact that it has two co-occurrences at one place - cell 2 (less
diverse), as compared to the case of C
12
, where all co-occurrences are uniformly spread over five
dierent cells (more diverse).
We also observe an interesting point where all four co-occurrences of User 1 and User 3
happened at a popular place - cell 3, which has been visited by all three users and has the highest
number of co-occurrences in total. This fact implies the high likelihood of coincidences between
User 1 and User 3, for example, they might just happen to be at a crowded public place (such as
a shopping center or a library) at the same time. Therefore, even four co-occurrences in such a
crowded place might still say very little about a possible social connection. We have the following
observation:
Observation: A high number of co-occurrences at only one place might be an indicator of
a friendship if the place is unpopular and uncrowded, but they might suggest the likelihood of
coincidences in popular and crowded places. Shannon Entropy and its corresponding diversity,
however, would treat multiple co-occurrences as coincidences independent of where took place.
Also, Shannon Entropy does not allow us to adjust the impact of this type of co-occurrences on
the diversity. Therefore, it is necessary to examine these issues to avoid any false predictions that
coincidences might cause. We will address this in Sections 4.2.4 and 4.2.5.
53
4.2.3 Renyi Entropy-based Diversity
As we discussed in Section 4.2.2, even though the Shannon Entropy (and its corresponding diver-
sity) can capture the likelihood of a social connection between two people, it cannot distinguish
the cases when coincidences might or might not happen and does not allow us to adjust the impact
of coincidences. Rennyi entropy and its corresponding diversity, on the other hand, will give us
the utility to control how much coincidences can contribute to diversity. In fact, Shannon entropy
is just a special case of Renyi entropy.
Consider the general case of entropy - Renyi entropy, given as:
H
R
i j
=
0
B
B
B
B
B
B
@
log
X
l
P
l
i j
q
1
C
C
C
C
C
C
A
=(q 1) (4.8)
where q 0 is the order of diversity. The diversity given by Equation 4.5 becomes (The upper
index R denotes Renyi):
D
i j
= exp(H
R
i j
) = exp
2
6
6
6
6
6
6
4
0
B
B
B
B
B
B
@
log
X
l
P
l
i j
q
1
C
C
C
C
C
C
A
=(q 1)
3
7
7
7
7
7
7
5
=
2
6
6
6
6
6
6
4
exp
0
B
B
B
B
B
B
@
log
X
l
P
l
i j
q
1
C
C
C
C
C
C
A
3
7
7
7
7
7
7
5
1=(1q)
=
2
6
6
6
6
6
6
4
X
l
P
l
i j
q
3
7
7
7
7
7
7
5
1=(1q)
(4.9)
=
2
6
6
6
6
6
6
6
4
X
l;c
i j;l
,0
c
i j;l
f
i j
!
q
3
7
7
7
7
7
7
7
5
1=(1q)
(4.10)
Equation (4.10) expresses the diversity in terms of a co-occurrence vector. The elegance of using
the Renyi entropy in our problem lies inside the parameter q, called the order of diversity, which
indicates its sensitivity to the local frequency c
i j;l
[Renyi, 1960]. Specifically:
54
When q > 1 the Renyi entropy H
R
i j
, and consequently the diversity D
i j
, more favorably
considers the high values of c
i j;l
, which are the more popular events. In other words, the
higher the local frequency c
i j;l
, the more weight it gets from the diversity or the more impact
the local frequency can make on diversity.
When q< 1, in opposite, the diversity tends to give more weight to the local frequencies
with low-values c
i j;l
.
When q = 0, the diversity is completely insensitive to c
i j;l
and gives the pure number of
co-occurrence locations.
Case q = 1: The Renyi entropy favors local frequencies c
i j;l
in opposite ways when q< 1
versus when q> 1, therefore q = 1 is the pass-through point where Renyi entropy and its
diversity stop all of their biased favors and weight the local frequencies c
i j;l
by their own
values, which is what Shannon entropy captures. This suggests a meeting point of the two
entropies. Indeed, even though Equations (4.8), (4.9) and (4.10) are undefined at q = 1, their
limits exist when q! 1 (see the proof below) and become the Shannon entropy and the
diversity defined in Section 4.2.2 .
Proof of Renyi Entropy’s Limit: At q = 1 Equation (4.8) is undefined at its form f (q)=g(q) =
0=0, where f (q) = log
P
l
P
l
i j
q
and g(q) = 1 q. Therefore, we use l’Hˆ opital’s rule to find its
limit, which is lim
q!1
f (q)=g(q) = lim
q!1
f
0
(q)=g
0
(q). f
0
(q) =
1=
P
l
P
l
i j
q
P
l
(P
l
i j
)
q
log P
l
i j
,
and g
0
(q) =1 (assuming natural logarithm for simplicity). Plug in the value q = 1 in the equation
f
0
(q)=g
0
(q), we get:
lim
q!1
f (q)=g(q) = f
0
(q)=g
0
(q)j
q=1
=
P
l
P
l
i j
log P
l
i j
. The last formula is nothing but Shannon
entropy, thus the limit existence of Renyi entropy is proved. This also leads to the limit of diversity
55
as D = exp(H), and at q = 1 Equations (4.9) and (4.10) become Equations (4.6) and (4.7),
respectively.
We see that the impact of the local frequency on the diversity is not always necessarily
determined by just its own value c
i j;l
, but also by the value of parameter q. This moves us one step
closer to solving the problem of coincidences, which we are going to discuss in Section 4.2.4.
4.2.4 Coincidences
Coincidences occur when two people happen to be at the same places at the same time but never
or rarely get a chance to see and communicate with each other, thus less possibility of being
friends. This happens often in popular and crowded places where coincidences are frequent, such
as cafeteria, public libraries, etc.
Consider the following example: assume there are 5 cells and consider two user pairs (a,b) and
(c,d) with co-occurrence vectors: C
ab
= (10; 1; 0; 0; 9) , C
cd
= (2; 3; 2; 2; 3), respectively. We also
assume that cells 1 and 5 are highly crowded places, cell 2 is medium-crowded, while cell 3 and 4
are non-crowded, based on the number of visits. Intuitively, this example suggests that c and d are
far more socially connected than a and b as the co-occurrences of a and b are likely coincidences;
the co-occurrence at cell 2 would be the only one that is medium-significant for friendship of a
and b, while c and d would have 7 of such or even more significant co-occurrences from cells 2, 3
and 4. First, obviously, using the total number of co-occurrences would give a wrong suggestion
that (a,b) are socially closer to each other than (c,d). Second, using the number of co-occurrence
locations (NL) for social strength would give us a relative value NL
ab
=NL
cd
= 3=5, which still
indicates a recognizable level of connection of (a,b) compared to (c,d), but a fair measure would
reasonably want that level to be low.
56
Now let’s see how diversities of Shannon and Renyi entropies address the challenge in the
example above. We set the value of q (order of diversity) to 0:5, less than 1, which, according
to the discussion in Section 4.2.3, will limit the impact of coincidences. The relative value for
Shannon entropy of two user pairs is H
S
ab
=H
S
cd
= 0:86=1:59 = 0:54, relative Shannon’s diversity
is D
S
ab
=D
S
cd
= 2:35=4:90 = 0:48, relative Renyi entropy is H
R
ab
=H
R
cd
= 3:20=5:63 = 0:56, relative
Renyi’s diversity is D
R
ab
=D
R
cd
= 24:60=279:67 = 0:09. First, the Renyi’s diversity shows a relatively
high level of social connection of (c; d)) compared to (a; b) (D
R
ab
=D
R
cd
= 0:09), which we would
expect intuitively. Second, compared to Renyi’s diversity, Shannon’s diversity does not limit the
impact of coincidences, consequently, the social strength of (a,b) is still high compared to that of
(c,d) (D
S
ab
=D
S
cd
= 0:48). Third, using entropy (either Shannon or Renyi) instead of diversity as a
metric of social strength still results in a relatively high level of connection of (a,b) compared to
(c,d). Therefore, this example confirms our discussion in Section 4.2.2 that Entropy can only act
as the index of diversity, but should not be used as a direct metric for social connection.
We see that coincidences often produce high local frequencies c
i j;l
, which, if misjudged, can
be overestimated. However, Renyi entropy and its diversity give us the ability to control the impact
of coincidences on diversity through q, which is sensitive to the values of local frequencies. q is
one of the optimization parameters and will be determined experimentally in Section 4.4.3.
Towards this end, we have focused on eliminating the impact of coincidences at crowded
places. However, we have not yet answered the following two questions: 1) What characterizes
crowded and non-crowded places, or even further, the level of crowdedness? 2) What can be used
to determine the likelihood of coincidences in co-occurrences, even when local frequencies are
low, and oppositely, the likelihood of non-coincidences when frequencies are low or high? We
57
are going to answer these questions in Section 4.2.5 utilizing Location Entropy and Weighted
Frequency.
4.2.5 Location Entropy
Location entropy is a crucial part in weighted frequency. It was first introduced in [Cranshaw
et al., 2010] to describe the popularity of a location. Let l be a location, V
l;u
=fhu; l; ti :8tg be
a set of check-ins at location l of User u and V
l
=fhu; l; ti :8t;8ug be a set of all check-ins at
location l of all users. The probability that a randomly picked check-in from V
l
belongs to User u
is P
u;l
=jV
l;u
j=jV
l
j. If we define this event as a random variable, then its uncertainty is given by the
Shannon entropy as follow:
H
l
=
X
u;P
u;l
,0
P
u;l
log P
u;l
(4.11)
A high value of the location entropy indicates a popular place with many visitors and is not specific
to anyone. On the other hand, a low value of the location entropy implies a private place with few
visitors, such as houses, which are specific to a few people.
To help understand the meaning of location entropy, let’s assume a simplified case, when N
users have visited a location l and each user visited it exactly once. The location entropy then
becomes:
H
l
=
N
X
u=1
1
N
log
1
N
= log N (4.12)
As we see in this simplified case, location entropy is the logarithm of the number of unique users,
who have been at the place. We show the dependence of location entropy’s value on the number
of unique visitors for this simplified case in Fig 4.3(a). Fig 4.3(b) shows an example of location
entropy for the case of three users from Fig 4.1, where the value of location entropy of each cell
58
(a) Number of Unique Visitors vs.
LE
(b) Example of LE
Figure 4.3: Location Entropy (LE)
is underlined. Note that cell 7,8,9 and 10 have no visitors and they have a default value of 0 for
location entropy (not shown in the figure). Fig. 4.3(b) tells us that location entropy is not really
determined by the number of visits, but rather by the number of unique visitors. In addition, the
location entropy is higher if the location is less specific to any user. Location entropy helps us
answer two questions:
Using location entropy, we can determine the places where coincidences are highly probable,
even when the frequency of a user pair in such places is low. That is because location
entropy for a place takes into account the visits of all others to that place.
A low number of co-occurrences (low local frequency) at an uncrowded place can also be a
significant indicator of social connections, even if the diversity is low. When the number of
co-locations is low, the diversity will also be low, hence this type of co-occurrences cannot
be captured by the diversity measure. Therefore, it is necessary to ensure that co-occurrences
at highly private places are given more priority or weight.
Using location entropy, we will introduce weighted frequency in Section 4.2.6 to capture this type
of co-occurrences
59
4.2.6 Weighted Frequency
Co-occurrences in small uncrowded places, such as private houses, often results in more social
interaction, as compared to those in crowded places. Therefore, the probability of friendships
strongly depends on the locations of co-occurrences. Given the co-occurrence vector of Users
i and j in Equation (4.2), we define the weighted frequency, which tells us how important the
co-occurrences at non-crowded places are to social strength, as follow:
F
i j
=
X
l
c
i j;l
exp(H
l
) (4.13)
It is interesting to note that exp(H
l
) is the inverse of diversity of a location in terms of its visitors.
This weighted frequency is inspired by tf-idf - a numerical statistic widely used in information
retrieval and text mining [Blei et al., 2003b] to measure the importance of a term/word t to a
document in a corpus. tf is term frequency and often taken as the number of times the term appears
in a document. idf is inverse document frequency, defined asjDj=(jd2 D; t2 dj + 1) - the total
number of documentsjDj divided by the number of documents that have t. In our problem, location
is similar to document in tf-idf, thus the number of co-occurrences at a location is similar to term
frequency in a document. However, to weight co-occurrences, we use exp(H
l
), not idf, since
location entropy provides insights into the intrinsic characteristics, i.e., the visiting patterns to a
location. idf is not suitable here because by its definition, it says how important or how specific a
user pair is to a location, but we want to answer a dierent question: how important a co-visit to
that location is to a pair of user?
Another tempting approach, which is also inspired by tf-idf and in fact, used in [Cranshaw
et al., 2010], would be using c
i j;l
=
P
l
c
i j;l
- the number of co-occurrences of a user pair at a location
60
divided by the total number of co-occurrences by all user pairs at that location. To show the
shortcoming of this approach, assume we have a private living house of a couple, who have made
check-ins to produce 1000 co-occurrences. Another guest couple visited them once and made
1 co-occurrence. The c
i j;l
=
P
l
c
i j;l
for the guest couple in this house is 1=(1000 + 1) = 1=1001,
which is very low and, therefore, would say nothing about the social connection of the guest
couple, but as we know, such a co-occurrence, even just one, is a high indication of a social
connection. Our weighted frequency, however, looks at this case dierently. Since the house is
visited by few people, its location entropy is low (0:0079), which makes a high value of weight
exp(H
l
) = 0:9921 (note that 0 < exp(H
l
) 1). Thus, this only co-occurrence makes a high
impact on weighted frequency and is significant for the connection of the guest couple.
To continue the example in Section 4.2.4, we calculate weighted frequencies for the two
couples (a,b) and (c,d). As we assumed earlier, cells 1 and 5 are crowded places. To compute
location entropy, we also need the visit information of other users in each cell. To achieve that,
assume there are additional 20 visitors at each of cells 1 and 5, and each of them visited the cell
10 times. For simplicity, also assume that each user a, b, c and d visited each cell as many times
as they co-occurred with their partners. The weighted frequency for each pair is F
ab
= 1:10 and
F
cd
= 3:07. Our analysis shows that F
ab
is mostly impacted by cell 2, and F
cd
is mostly impacted
by cells 2, 3 and 4, which matches our expectation that only non-crowded places contribute to
weighted frequencies.
Note: Diversity and weighted frequency answer two dierent question. Diversity decreases the
impact of frequent coincidences while weighted frequency increases the impact of co-occurrences
at less crowded places; the less crowded, the more impact.
61
Data sparseness: Weighted frequency plays an important role when it comes to data sparse-
ness, i.e., when the availability of spatiotemporal data is very limited - only few co-occurrences
for each couple, the Renyi’s diversity can be very low. However, weighted frequency compensates
for low diversity by further looking into location characteristics to capture the co-occurrences
at non-crowded areas, which can be insignificant for diversity, but very significant for weighted
frequency, and for friendship, consequently.
4.2.7 Social Strength
So far, we have formulated two independent ways, through which co-occurrences contribute
to social strength: 1) Diversity (through Renyi entropy) - which measures how diverse the co-
occurrences of two people are, and at the same time, can control and tell us how much coincidences
can impact diversity, and 2) Weighted frequency - which favorably captures the local frequencies
of co-occurrences at uncrowded places and can compensate for diversity in case of data sparseness.
If we want to combine these two measure to produce an ultimate one for social strength, it is
necessary to understand the relative importance of each component-measure to social strength.
To illustrate, from the example discussed in Sections 4.2.4 and 4.2.6, we have (D
R
ab
= 24:60,
F
ab
= 1:10), (D
R
cd
= 279:67, F
cd
= 3:07). We see that the two measures have dierent scales; as
we decrease the order of diversity q, diversity will scale itself up to more clearly dierentiate the
impacts of coincidences and non-coincidences. As we see D
R
ab
=D
R
cd
= 0:09, but at the same time
the diversity’s scale goes up to 279:67 and weighted frequency remains at a low scale F
cd
= 3:07.
This challenge influences us in the way we combine the two measures together.
62
We now formulate the social strength, which ultimately tells how close two people are, by
doing a linear regression over diversity D
i j
and weighted frequency F
i j
:
s
i j
= (D
i j
) + (F
i j
) (4.14)
where and are two linear functions and s
i j
is the ultimate strength measure we look for. Since
D
i j
focuses on the distribution of co-occurrences over dierent locations, while F
i j
focuses on
the intrinsic properties of locations, they are independent of each other, subsequently, Equation
(4.14) takes us to a multiple regression problem over two independent variables D
i j
and F
i j
. For
convenience of conducting the multiple regression, we rewrite Equation (4.14) in an explicit form
through optimal parameters, and
:
s
i j
=:D
i j
+:F
i j
+
(4.15)
where D
i j
and F
i j
are defined in Equations (4.10) and (4.13), respectively. Parameters, and
1
can be learned from dataset and/or provided by user. In Section 4.4.4 we show how these
parameters can be learned from training data and applicable across networks.
Equation (4.15) is the final formula to determine social strength between two users given their
co-occurrences in time and space.
1
It is possible to keep only one parameter, say, let = 1 and skip
. However, we keep all the three parameters
just to follow the more traditional form of the multiple regression problem.
63
4.2.8 EBM with Location Semantics
4.2.8.1 Location Semantics
In this section, we provide an extension to the EBM model by considering the role of the semantics
of locations on social strength. In other words, if we know the type of the location where the
co-occurrences between two users took place, such as it is a restaurant or a hospital, how can we
utilize this semantic information to improve the EBM model in inferring social strength? Similar
to the approach for the co-occurrence vector (see Section 4.1.0.3), here we also need to identify
what are the features of the semantics of a location that may have impact on social strength?
In order to answer this question, we first look at the existing studies that focused on location
semantics. In particular, we pay close attention to three major studies that inferred the semantics
of locations based on dierent features of check-ins that are closely related to the social behaviors
of users [Cao et al., 2010][Ye et al., 2011][Lee et al., 2011]. In all these three studies, the authors
utilized the location history to explore the users’ behaviors and to identify the unique features that
are discriminative for classifying the semantics of each location (e.g., as a restaurant, a shopping
mall or a hospital). Even though these studies used dierent datasets, they all shared the same
key ideas of what features of the location history data are the most significant for inferring the
semantics of locations. Specifically, they agreed on the following. Among all the features of the
location history data, the popularity of a location (number of visits, number of unique visitors)
and the average duration that users stayed in the location are the most significant features. For
example, restaurants have a similar level of crowded-ness and an average duration of stay of 1:5
hours. The less significant features are the check-in time (aka entry time) and the distribution of
check-in during a period of 24 hours or a week. In particular, Lee et al. argued that the duration
64
of stay is the most significant feature and chose to utilize only the duration of stay for inferring
location semantics [Lee et al., 2011]. Ye et al. put more emphasis on the location popularity [Ye
et al., 2011], while Cao et al. put more emphasis on the duration of stay and the check-in time
[Cao et al., 2010]. With these observations, they were able to infer the location semantics with a
high accuracy. For example, Ye et al. achieved the average precision above 90% for places related
to nightlife and restaurants and approximately 80% for all other types of places [Ye et al., 2011].
Consequently, we will rely on these observations to exploit location semantics in our EBM model.
Note that there exists other studies that infer location semantics and do not rely on the users’
behaviors. For example, Liu et al. [Liu et al., 2006] inferred the semantics of a location by
converting the location’s coordinates to a street address, and matching the street address with other
services (Yellow Pages, Address Book, Map, etc.) to obtain the semantics of the location. This
approach is not useful for the EBM model because it is irrelevant to users’ check-in behaviors. On
the other hand, we aim to explore the users’ check-in behaviors (e.g., frequency of visits, duration
of stay, etc.) to infer social strength.
4.2.8.2 Applying Location Semantics in EBM
In this section, we discuss how to utilize location semantics to improve the EBM model. According
to the studies discussed in the previous section [Cao et al., 2010][Ye et al., 2011][Lee et al., 2011],
given the semantics of a location, the two most discriminative features for the semantics of the
location are its popularity and the average duration of stay by users. Therefore, we can argue that
if the semantic of a location is known a priori, we can derive its popularity and its average duration
of stay, for example, by utilizing the historical check-in data. We will focus on these two features
in this section.
65
The first significant feature that can be derived from a location’s semantics is the popularity of
the location. We remove Location Entropy from EBM model, specifically from Equation (4.13),
and instead, we use the popularity computed based on location semantics. Using the historical
check-in data, for each location type, we compute the average number of unique users per location
(NU((l))), the average number of visits per location (NV((l))), and the Average Location Entropy
(H((l))), where(l) denotes the type (or the semantics) of location l. Subsequently, we use the
same values NU((l)), NV((l)) and H((l)) for all the locations of the same type(l), and obtain
three alternative options for the weighted frequency as follows:
F
i j
=
X
l
c
i j;l
1
NU((l))
(4.16)
F
i j
=
X
l
c
i j;l
1
NV((l))
(4.17)
F
i j
=
X
l
c
i j;l
exp(H((l))) (4.18)
In each of the Equations (4.16), (4.17) and (4.18), the local frequency c
i j;l
is inversely weighted by
the popularity of the location l.
However, as compared to Equation (4.13), the location popularity in Equations (4.16), (4.17)
and (4.18) is also derived from historical data. In fact, in Equation (4.13), Location Entropy
can capture location popularity with higher accuracy as it considers the popularity at a finer
granularity, i.e., per location, instead of considering the popularity at the coarser granularity of
semantics (i.e., per group of locations of the same type) as in Equations (4.16), (4.17) and (4.18).
Subsequently, using the semantics-based popularity does not provide any clear advantage over
the existing Location Entropy in Equation (4.13). Indeed, we conducted experiments based on
66
the three equations above in Section 4.4.7.1, and the results confirmed that the semantics-based
popularity does not improve the precision/recall over the existing location popularity in Equation
(4.13).
However, the role of the semantics-based location popularity becomes significant when we do
not have access to the historical check-in data of a location, say location l, in order to compute
its Location Entropy. This is normally the case when the users’ locations are protected by k-
anonymity due to privacy issues [Lee et al., 2011]. Therefore, if we at least know the type of
the location where users co-occur, we can estimate the popularity of the location based on its
type. The estimation can be done by examining the check-ins of the other locations (possibly of a
dierent dataset) that have the same semantics with l, and obtaining the average popularity as one
of the metrics NU, NV or H. We discuss more on this issue in Section 4.5.
The second feature related to location semantics is the average duration of stay (ADS);
locations of the same semantics share similar ADS [Lee et al., 2011][Cao et al., 2010]. For
example, people dining at a restaurant normally stay for about 1 or 2 hours [Lee et al., 2011].
Subsequently, given the semantics of a location, we can estimate ADS for that location. For
example, for a location that is of type restaurant, we can get the lengths of online reservations and
then compute the average duration. In Section 4.4.7.1, we will discuss our method of estimating
the ADS for each location type by obtaining data from dierent sources. Generally, spatiotemporal
data may contain location semantics (e.g., Foursquare’s data), but does not typically contain users’
duration of stay. Therefore, being able to estimate the duration of stay based on the location
semantics provides us with an opportunity to compensate for the lack of this information from the
raw input data. In the remaining of this section, we discuss how to incorporate the duration of stay,
which is inferred based on the location semantics, into the EBM model.
67
To utilize ADS, we need to make the following assumption: if the average duration of stay
(ADS) at location l is t
l
, then the average duration of co-occurrences (ADC) at location l is also
t
l
. This assumption is based on the observation made in the earlier studies in psychology [Werner
and Parmelee, 1979][Sias and Cahill, 1998][Oswald and Clark, 2003] that if two friends attend
an event together (a restaurant, a concert, or a shopping mall, etc.), then, in general, they remain
together during the event.
We plan to capture the duration of co-occurrences in social strength, particularly in weighted
frequency, by weighing each co-occurrence based on both the location popularity and the duration
of the co-occurrence; the former is captured by Location Entropy, and we assume the latter
to be captured by a function I(l), which is the impact of the duration of co-occurrence on F
i j
.
This function should only depend on the location l of the co-occurrence because the duration of
co-occurrence is estimated based on the semantics of l. Subsequently, we rewrite the weighted
frequency for user pair (i; j) as follows:
F
i j
=
X
l
c
i j;l
exp(H
l
) I(l) (4.19)
We will measure the impact of co-occurrence duration I(l) on weighted frequency by applying
the decay law of social influence [Goyal et al., 2010][Goyal et al., 2011][Gomez Rodriguez et al.,
2010], which states the following. Consider two users i and j. If i performed an action (i visited a
location l in our case) at time t
1
, and j also performed that same action at a later time t
2
> t
1
as
a result of being influenced by the same earlier action by i, then we say that i influences j, and
the strength of the influence that i exerts on j exponentially decays over the time delay between
their visits: t(l) = t
2
t
1
. The later j performed that action, the weaker the influence. On the
68
(p,q) i,j u,v
Figure 4.4: Durations of co-occurrences for user pairs (p; q), (i; j) and (u; v) are t
1
, t
2
and t
3
,
respectively.
other hand, the sooner j performed the action (indicating the urge that j feels), the stronger the
influence.
I
i! j
(t(l)) = I
0
exp(
t(l)
) (4.20)
where I
i! j
denotes the influence that i exerts on j, I
0
is the initial influence that i would exert
on j if j visited location l at the same time as i did (or they co-occurred), and is the expected
time delay and can be taken as the average time delay between the visits of friends to the same
locations [Goyal et al., 2010]. The general interpretation of Equation (4.20) is as follows: if i
and j co-occurred, then I
0
is their mutual influence, which indicates that they both influence each
other socially. The strength of the influence depends on their duration of co-occurrence. Unless
the two users i and j remain together at the same location, their mutual social influence starts
to decay exponentially over time, starting from I
0
. Based on this law, we measure the impact of
co-occurrence duration on weighted frequency as the mutual influence between two users during
the duration of their co-occurrence. Let us consider three pairs of users (p; q), (i; j) and (u; v),
each pair co-occurred at some location and their durations of co-occurrence are t
1
, t
2
and t
3
,
respectively. Assume that t
1
is the longest duration, followed by t
2
and by t
3
, as shown in
Figure 4.4.
Without loss of generality, suppose that the three pairs of users started their co-occurrences at
the same time t
s
, as shown in Figure 4.4. During the entire period from t
s
to t
f
(t
f
is the end of the
69
co-occurrence duration of (p; q)), p and q remained together, and therefore, their mutual social
influence did not decay and remained I
0
. However, for i and j, their influence remained at I
0
from t
s
to t
s
+ t
2
. As compared to (p; q), the influence between i and j started to decay at the moment t
s
+
t
2
, and at t
f
, their influence had decayed from I
0
down to I
0
exp(t
0
2
=) = I
0
exp((t
1
t
2
)=).
Similarly, the influence between (u; v) had decayed down to I
0
exp(t
0
3
=) = I
0
exp((t
1
t
3
)=). Subsequently, if we assume that T = t
1
is the maximum possible co-occurrence duration,
then we can express the strength of the influence for each user pair (i; j) according to their
co-occurrence duration t as follows:
I(i; j) = exp(
T t
) (4.21)
In Equation (4.21), we set I
0
= 1 and replace the symbol i! j with (i; j) because the users
co-occurred and stayed together during t, and thus their influence is bidirectional (mutual). I(i; j)
reaches the maximum of 1 when t = T. According to our earlier discussion, we utilize the
semantics of location l to estimate the duration of co-occurrence
l
between users in location l,
which in turn is the average duration of stay of users at the location, denoted as ADS
l
. Therefore,
we can rewrite Equation (4.21) to take into account the location of the co-occurrence between i
and j as follows:
I
l
(i; j) = exp(
T ADS
l
) (4.22)
I
l
(i; j) in Equation (4.22) indicates the strength of influence between i and j due to one single
co-occurrence at location l. The higher this influence strength, the stronger the social strength
between i and j. Subsequently, we incorporate the influence strength I
l
(i; j) into the weighted
frequency (See Equation 4.19) in order to capture the significance of the duration of co-occurrences
70
between two users. Since I
l
(i; j) only depends on location l, we change the notion from I
l
(i; j) to
simply I(l). The weighted frequency becomes:
F
i j
=
X
l
c
i j;l
exp(H
l
) I(l) =
X
l
c
i j;l
exp(H
l
) exp(
T ADS
l
) (4.23)
The meaning of the factor I(l) to F
i j
is similar to the meaning of the location popularity factor
exp(H
l
) to F
i j
. Each local frequency c
i j;l
(number of co-occurrences at location l) is weighted by
both the popularity of the location and the duration of their co-occurrences; the former is captured
by Location Entropy, while the latter is captured by the influence strength I(l).
Utilizing location semantics in our solution provides several advantages, especially in corner
cases. For example, when the data is sparse, many user pairs may have very few co-occurrences,
which results in both low diversity and low local frequency. However, by capturing the co-
occurrence duration (in addition to the location popularity), we can increase the impact of weighted
frequency to social strength. The advantage can also be observed in the case when users co-
occurred at only one location repeatedly for a long duration, such as when students attend an art
class together due to their mutual interest. Even though the diversity is zero in such a case, their
weighted frequency can become essentially high due to the incorporation of their long-lasting
co-occurrences in I(l), and thus their social connection may still be captured by the EBM model.
4.3 Optimization
To be ecient with massive datasets, we optimize the implementation by using k-d tree [Bentley,
1975] for data structure and using MapReduce framework to parallelize the computation.
71
Figure 4.5: Data Structure
Data structure: We use k-d tree [Bentley, 1975] with slight modifications to make it suitable
for our problem. It is a 3-D tree shown in Fig. 4.5, with an (x; y) plane for representing loca-
tions, and a t-axis for time. The details of k-d tree’s implementation can be found in literature
[Bentley, 1975]. Our focus here, however, is how to split the tree or sub-tree (named rectangular
parallelepiped or RP for short) into smaller sub-trees when inserted data exceeds its capacity? To
do that, we can either split the 1-D time interval into two equal halves or the 2-D spatial square
of the RP into four equal quads. The former method is cheaper as it is a 1-D split. However,
consider the case of quad ABCD in Fig 4.5, spatial splitting is a better option as the spatial points
are evenly spread out over the quad but all the temporal points shrink to one end of the time
interval. In contrast, in the DEFG quad, temporal splitting is a better option. To facilitate decision
making, one more time, we use entropy. Assume we have N check-ins in an RP. Divide the RP’s
spatial quad into S = 4
n
equal cells indexed from 1 to S , and the RP’s time interval into T = 2
m
equal sub-intervals indexed from 1 to T (n and m are integers), then for the check-ins, the spatial
Shannon entropy in the quad and the temporal Shannon entropy in the time interval are:
H
s
=
X
i
s
i
N
ln(
s
i
N
); H
t
=
X
j
t
j
N
ln(
t
j
N
) (4.24)
72
where s
i
and t
j
are the numbers of check-ins in spatial cell i and time sub-interval j, respectively,
and 1 i S , 1 j T. The more evenly spread-out the check-ins in the quad (or the time
interval), the higher the entropy. These two quantities give us the clue of which dimension to split
in an RP when it reaches capacity N. The empirical values for S and T is 16 (n = 2 and m = 4). In
general, when the two entropies are roughly the same, time splitting is chosen to save storage cost.
Implementation with MapReduce: First, with k-d tree, the search for co-occurrences be-
comes ecient and standard as the time interval and the spatial quad can eciently filter out all
non-candidate points. The MapReduce implementation can be done in two phases. In the first
phase, Maps build partial co-occurrence vectors in sub-trees, and Reduce combines them to make
full co-occurrence vectors and compute diversities. In the second phase, Maps compute location
entropy for each RP, while Reduce will use the entropy values to compute weighted frequencies.
4.4 Performance Evaluation
4.4.1 Dataset
The data used in the experiments was collected by Gowalla - a location-based social network,
where users shared their locations through check-ins. The data was collected from February 2009
to October 2010 and consists of two dierent sets. The first set is spatiotemporal data, which has
6,442,890 check-ins from 196,591 users. Each check-in has format:huser ID, latitude, longitute,
timestamp, location IDi. The second set is a social graph of friendships among users. It has
950,727 edges (or friendships).
73
4.4.2 Methodology
The two metrics we use to measure the accuracy of our techniques are precision and recall. Let
TC be the set of true social connections reported by Gowalla’s social network (i.e., ground truth)
and RC be the set of user pairs that our model reported as socially connected. The precision and
recall are defined as:
P =
jTC\ RCj
jRCj
; R =
jTC\ RCj
jTCj
(4.25)
where\ denotes the intersection operation.
4.4.3 Order of Diversity
In this set of experiments, we want to examine how the order of diversity q controls the impact of
coincidences on diversity and find the optimal value for q. Towards this end, we use only diversity
as social strength. To be completely unbiased, we do this experiment using only the training data
L
west
and S
west
.
We perform this particular experiment through the following steps. Step 1: Vary the order of
diversity q from 0 with a step of 0:1, then for each value of q, we calculate diversity D
q
i j
based
on Equation (4.10). Step 2: Since S
west
only tells us if two users are friends or not, while our
output D
i j
gives us a numerical value (assumably strength), we need to somehow make the two
comparable. To accomplish this, we define the threshold of diversity to be D
q
so that: if D
q
i j
D
q
then User i and User j are considered to be friends by our model; otherwise they are not. Therefore,
we vary threshold D
q
from 0 with a step of max(D
i j
)=1000, take user pairs with diversity D
q
i j
D
q
(assuming they are friends) to compare with the real friendship information in S
west
and calculate
74
precision P and recall R. As a result of varying D
q
, we get the dependence of precision and recall
on q.
Figs. 4.6(a) and (b) show the results of how q impacts precision. The x-axis shows the order
of diversity q and the y-axis shows the precision. To simplify the graph visualization, we split
the graph into two, each shows three curves, each curve corresponds to one level of recall. In
addition, we only show the results of q that ranges from 0 to 2:0 to keep a high level of details
since further increasing q beyond 2:0 decreases the precision dramatically. We made the following
observations:
(a) (b)
Figure 4.6: The impact of the order of diversity on precision.
Our major observation is all the curves at 6 dierent recall’s levels show the same behavior:
they all peak at q = 0:1, which says q = 0:1 is the optimal value for limiting coincidences’
impact. We believe this optimal value is a general phenomenon across networks for two
reasons: first, all the networks nowadays share the same nature of check-ins as users share
their locations with their friends; second, coincidences are general spatial phenomena and
happen to all networks without bias to any particular network. To confirm this phenomenon,
we repeated this experiment on another similar dataset from a dierent network - Brightkite,
which consists of 58K users, 214K connections and 4.5M check-ins. The result showed a
75
peak again at q = 0:1 with a very insignificant fluctuation (0:004), which can be considered
as experimental uncertainty.
The case q = 0 makes the diversity equal to the number of co-occurrence locations (a.k.a
richness). Fig. 4.6 shows the fact that simply setting the diversity to the number of co-
occurrence locations will produce low precision. This is because coincidences are completely
ignored and all cases of co-occurrences are considered equally important.
When q increases from 0:1 to 2:0, the diversity increasingly favors high local frequencies.
Consequently it favors coincidences because coincidences often produce high local frequen-
cies. Therefore coincidences now are out of control and have more impact on diversity,
which causes false predictions, and consequently, causes the decrease in precision.
Further decreasing q from 0:1 to 0 also results in the degradation of precision, because low
values of q (q < 0:1) not only limits the impact of coincidences (high local frequencies),
but also limits the impact of non-coincidences (medium local frequencies), which results in
excessive controlling or over-limiting.
We will use the optimal value q = 0:1 for the rest of our experiments. Note that diversity
mainly deals with precision since its role is to avoid coincidences. Therefore in Fig 4.6 we used
precision to learn about the order of diversity. The low recall associated with diversity will be
compensated by weighted frequency, which is what we are going to examine next.
76
4.4.4 Social Strength
Our goals in this set of experiments is 1) to compute the social strength by experimentally
conducting multiple linear regression over diversity D
i j
and weighted frequency F
i j
; 2) to evaluate
the social strength by relating it to friendship information from the ground truth S
east
.
Linear Regression: In order to find the social strength in the evaluation dataset L
east
, we first
need to use the training set (L
west
and S
west
) to learn about the parameters, and
in Equation
(4.15) (See section 4.2.7). Thus, we need diversity D
i j
, weighted frequency F
i j
and strength ˆ s
i j
(computed only based on social graph). We already have D
i j
and F
i j
computed from Equations
(4.10) and (4.13). However, S
west
is a social graph that only tells us if two users are friends or
not, but not the strength. Fortunately, there exist dierent techniques to calculate social strength
based solely on a social graph. We will use three techniques, which have been shown to have high
performance [Liben-Nowell and Kleinberg, 2007a], including Jaccard’s index, Adamic/Adar
similarity and Katz score.
Subsequently, ˆ s
i j
can be any of the three measures above. Among the three, Katz score has the
best performance, followed by Adamic/Adar similarity, and by Jaccard’s index [Liben-Nowell and
Kleinberg, 2007a].
The optimal values of, and
as the results of the least-square method in linear regression
[Bishop, 2006] are given as follow:
=
(
P
F
2
i j
)(
P
D
i j
:ˆ s
i j
) (
P
D
i j
:F
i j
)(
P
F
i j
:ˆ s
i j
)
(
P
D
2
i j
)(
P
F
2
i j
) (
P
D
i j
:F
i j
)
2
(4.26)
=
(
P
D
2
i j
)(
P
F
i j
:ˆ s
i j
) (
P
D
i j
:F
i j
)(
P
D
i j
:ˆ s
i j
)
(
P
D
2
i j
)(
P
F
2
i j
) (
P
D
i j
:F
i j
)
2
(4.27)
77
= ˆ s
i j
:D
i j
:F
i j
(4.28)
where ˆ s
i j
, D
i j
, and F
i j
are the corresponding mean values of ˆ s
i j
, D
i j
and F
i j
.
Applying each of the three techniques above to the social network S
west
(i.e., the training data),
we compute the social strength ˆ s
i j
. However, before computing parameters (, ,
), we first
normalize diversity, weighted frequency and ˆ s
i j
so that we can use the values of and to analyze
the relative importance of each measure (diversity and weighted frequency) to social strength.
The values of (,) for each of the Jaccard, Adamic/Adar and Katz methods are (0:441, 0:550),
(0:476, 0:521), and (0:483, 0:520), respectively. As we see, the two measures are comparable
in their importance to social strength in all three methods. Weighted frequency gets a slightly
higher priority, which reveals a fact that many co-occurrences at uncrowded places have low
frequencies. These are general phenomena because people check in more frequently at famous and
popular places as those are more interesting to share with friends, while uncrowded places are less
interesting to share, thus the check-in’s frequencies there should be low. It is important and also
interesting to note that this nature of check-ins is general to all networks since the main purpose of
users’ check-ins is to share their locations with friends, despite the fact that dierent networks
might have dierent ways of encouraging users to perform check-ins. Therefore, consider two
scenarios: first, if a partial social network is available explicitly for a spatiotemporal dataset, then
it can be used to compute its own parameters (, ,
). Second, however, if no explicit social
network is available, then the values of (,,
) can be applied across networks without much
sacrifice of precision due to the general phenomena discussed above.
Finally, applying these parameters to the evaluation dataset L
east
, we compute the social
strength s
i j
for new user pairs based on their spatiotemporal data.
78
Figure 4.7: Percentage of real friendships vs. the social strength of buckets.
Social Strength and Friendship: Our goal now is to show the relationship of our predicted
social strength with the friendships. We do this by grouping the user pairs with similar social
strength together into subgroups called buckets and find the percentage of real friendships in each
bucket. We perform this experiment through the following steps. Step 1: we divide the social
strength axis into 100 intervals of equal length = 0:01. Step 2: we group the user pairs with
s
i j
that belong to the same interval into a bucket. Step 3: we take the user pairs in each bucket
and check with the social network (i.e., ground truth S
east
) to find what percents of pairs in each
bucket are truly friends as reported by Gowalla.
Fig 4.7 shows the results in forms of charts. The x-axis shows the middle value of normalized
social strength of each bucket (or interval), while the y-axis shows the percentage of real friendships
in each bucket as checked with the ground truth S
east
. The three graphs in Fig 4.7(a)(b) and (c)
correspond to three dierent cases of which technique is used in the linear regression of social
strength in Section 4.4.4: (a) Jaccard’s index is used; (b) Adamic/Adar similarity is used; (c) Katz
score is used.
Observations: First, as observed in Fig. 4.7, our predicted social strength is consistent with
the ground truth: user pairs with higher social strength have higher percentage of being friends
than those user pairs with lower social strength. This also fits the intuition that user pairs with high
79
social strength are more involved/interactive with each other, therefore they are more likely to be
friends. Second, furthermore, as Katz score is a better metric than the other two [Liben-Nowell
and Kleinberg, 2007a], our predicted social strength also shows a better performance when Katz
score is used in the regression. Fig. 4.7(c) shows a more consistent curve as the percentage is
supposed to increase when the social strength of buckets increases. As Jaccard’s index is the worst
metric compared to Adamic/Adar similarity and Katz score, we see more fluctuations in 4.7(a) as
the percentage goes up and down, while Fig 4.7(b) and Fig. 4.7(c) are smoother, which means
more consistent with the ground truth.
4.4.5 Goodness of fit
Our goal in this experiment is to evaluate how well our predicted social strength from spatiotempo-
ral data L
east
fits the observed strength computed based only on social graph S
east
? This is known
as goodness of fit. This diers from the the previous experiments as in Section 4.4.4 we tested
our social strength against friendship, but not observed strength. Hence, we use the coecient of
determination R
2
to measure the variance of our predicted social strength s
i j
from ˆ s
i j
computed
solely based on ground truth S
east
using each of the techniques above.
Coecient of Determination measures how well the the predicted values fit the observed
values. Let N be the number of user pairs, ˆ s
i j
=
P
ˆ s
i j
=N be the mean of the observed social
strengths, S S
tol
=
P
i
(ˆ s
i j
ˆ s
i j
)
2
be the total sum of squares and S S
err
=
P
i
(ˆ s
i j
s
i j
)
2
be the sum
of squares of residuals. The coecient of determination is defined as:
R
2
= 1
S S
err
S S
tol
(4.29)
80
R
2
is a statistic that shows the goodness of fit of the model. It ranges from 0 to 1:0; R
2
near 1:0
indicates that the regression results fit the real data well, while R
2
near 0 indicates the opposite.
Table 4.2: Coecient of Determination
Jaccard Adamic/Adar Katz
R
2
0.691 0.830 0.877
Table 4.2 shows the values of R
2
for each dierent technique used to compute ˆ s
i j
in the
evaluation’s social network S
east
. First, our model very well predicts the social strength. Particu-
larly, if we use Katz score and assume its absolute reliability, then 87:7% of the social strengths
in the East of USA are predictable by our model. Second, Katz and Adamic/Adar methods fit
our linear regression better than the Jaccard’s index. Logically, this also fits the evaluation in
[Liben-Nowell and Kleinberg, 2007a], which reported that Katz score is a better metric for social
strength, followed by Adamic/Adar similarity, and followed by Jaccard’s index. Third, the values
of R
2
are high, which implies that linear regression is the right choice to integrate diversity and
weighted frequency together.
4.4.6 Precision and Recall
We already showed that the precision of EBM is very high in Section 4.4.3, even just using diversity.
Here, we would like to evaluate its recall. That is, what percentage of Gowalla’s social connections
can be predicted by just analyzing the check-in data? Note that this is a very tough challenge since
the check-in data collected from Gowalla is very sparse (unlike more active social-networks such
as Foursquare). For example, we analyzed the Gowalla users’ co-occurrence vectors and found
that out of 996; 621 user pairs who have co-occurred, only 4:3% of them co-occurred more than
three times. 95% of pairs have few co-occurrences, which will inevitably limit the opportunity
81
of exploring the social connections from this sparse spatiotemporal data. To alleviate this, we
could have used other factors that can help us to infer social connections, such as common friends,
common interests, etc. [Bukowski et al., 1998]. However, since the focus is on inferring social
connections only from spatiotemporal data, we challenged EBM by limiting its knowledge to
only the check-in data. Instead, to slightly level the field, we removed from our dataset those
pairs of users who have zero or one co-occurrence and only included the pairs with more than one
co-occurrence. This is a fair adjustment as it is almost impossible to infer friendship for those
users with one or less co-occurrence (by any method that only relies on the check-in data).
To calculate precision and recall, we need to compare EBM’s predictions with ground truth
S
east
. However, EBM gives us numerical strengths, while S
east
only tells us if two users are friends
or not. To work around this, we use the same technique as in Section 4.4.3 by defining a threshold
s
0
for social strength and varying it to find precision and recall. Fig. 4.8(a) shows the results of
the EBM’s evaluation. The x-axis shows precision and the y-axis shows recall. The dotted line
corresponds to the case when only diversity is used as social strength, while the other three are
when both diversity and weighted frequency are integrated together to compute social strength,
and correspond to three dierent techniques used in Section 4.4.4.
Observation 1: It is interesting to note that all the four curves practically meet together when
recall is below 35% and start diverging as recall increases. This can be explained as follow: the low
recall happens when we set high values for threshold s
0
, thus only a subset of user pairs with high
number of co-occurrences can pass through threshold s
0
and be considered. Therefore, handling
coincidences is the only essential requirement for such subset. Diversity satisfies that requirement
and it is used in all four cases, thus they all have the same performance level. However, when we
reduce threshold s
0
is when user pairs with fewer co-occurrences can pass through the threshold
82
(a) EBM (b) EBM vs. other models
Figure 4.8: Precision vs. Recall
and get considered. At this point, just handling coincidences is not enough and this is where
weighted frequency comes into play to cope with data sparseness, as discussed in Section 4.2.6.
Weighted frequency takes into account the location characteristics to predict social connections
with very low number of co-occurrences. Such subtle co-occurrences cannot be captured by
diversity, therefore its graph remains below the other three, which do use weighted frequency in
addition to diversity.
Observation 2: As shown in Section 4.4.5 and in the related work [Liben-Nowell and Kleinberg,
2007a], Katz score is a better metric for ground truth, followed by Adamic/Adar similarity and
Jaccard index. Fig. 4.8(a) shows that our predicted social strength matches Katz score the best,
which enhances the trustworthiness of our model.
Observation 3: Our model achieves both high precision and recall. Particularly, using Katz
technique we can achieve (precision, recall) as high as (80%, 70%) or (70%, 82%). Moreover,
considering the sparseness of the data where only 4:5% of co-occurred pairs have more than three
co-occurrences, being able to get those high precision and recall is a major achievement.
83
4.4.7 Comparison of EBM with other models
Using the precision vs. recall graphs, we compare the performance of EBM with other four models,
which have previously studied the problem, including probability model (PM) [Crandall et al.,
2010], GEOSO model [Pham et al., 2011], the model that utilizes various features (named FT
for short) [Cranshaw et al., 2010], trajectory model (TR) [Li et al., 2008]. Fig. 4.8(b) shows the
results.
Observations: Precision: EBM outperforms all other four models in precision. This is due to
the EBM’s ability of controlling the impact of coincidences, which is the real challenge to the other
models since coincidences often produce very high local frequencies, which can easily misguide
any method that wants to infer social connections from spatiotemporal data. The PM’s precision
remains the lowest; it assumes that each user can have at most one friend, which is almost never
true; a possible fix would be removing that assumption, but that would severely interfere with its
design and prevent it from formulating the probability of social connection. All PM, GEOSO and
TR completely ignore location characteristics (aka location entropy in our work), thus there is no
easy way for them to handle data sparseness, where even coincidences can have low frequencies,
which is shown in our work as a dicult and challenging question. TR also does not clearly
address coincidences at high frequencies. Finally, FT considers coincidences, however, it does not
compute social strength but only answers if two people are friends or not. In addition, its tf-idf
fails to capture co-occurrences at private places as shown in Section 4.2.6.
Recall: Since we challenged all five models with a highly sparse dataset, the result of recall
truly shows how capable a model is of mining friendship’s information. EBM achieves significantly
higher recall compared to other models due to its knowledge of the locations (public or private,
84
Table 4.3: Eciency of EBM and other models
EBM GEOSO PM TR
100 maps 159.12s 282.36s 203.25s 394.73s
500 maps 34.84s 92.46s 76.92s 129.32s
1000 maps 19.21s 39.22s 29.36s 64.291s
level of crowded-ness) and applies that knowledge to make small things (few co-occurrences)
become significant. As discussed earlier, PM, GEOSO and TR have no knowledge about locations,
thus cannot capture minor co-occurrences, consequently, have low recall. FT considers location
characteristics to capture few co-occurrences in uncrowded locations, which explains its higher
recall compared to GEOSO, PM and TR, but still lower than EBM’s recall. The latter is because
with EBM, diversity and weighted frequency can compensate for each other to avoid coincidences,
and at the same time, handle sparse data.
Eciency: We also parallelized the algorithms for GEOSO in [Pham et al., 2011], probability
model (PM) in [Crandall et al., 2010] and trajectory model in [Li et al., 2008] to make the
comparison. The feature model (FT) in [Cranshaw et al., 2010] does not propose any data structure
since the model was to target a relatively small proprietary data set, thus we do not examine its
eciency. Using the same setup, we run the experiments using dierent numbers of nodes (maps).
Table 5.3 shows the number of seconds each model took to run in dierent setup: 100 nodes, 500
nodes and 1000 nodes. EBM outperforms the other three models. GEOSO and PM use a uniform
grid to partition space into equal cells, thus results in high cost of storage as the number of cells
grows enormously. TR has high cost in time due to its construction of trajectory (sequence) of
user locations. Furthermore, EBM really takes advantages of parallelization as we see the rate, at
which the running time decreases when we add more nodes to the computation.
85
4.4.7.1 Improving EBM with Location Semantics
In this set of experiments, we utilize location semantics to improve the performance of EBM,
according to our discussion in Section 4.2.8.
Since the Gowalla and Brightkite datasets do not have the semantics of locations, we used the
third dataset collected from Foursquare that has the information about the semantics of locations.
The Foursquare dataset contains location semantics, therefore we use this dataset exclusively for the
experiments that requires location semantics (see Section 4.2.8). The Foursquare dataset contains
0.8M check-ins from 669k users, who form 1.4k friendships. We divided the Foursquare dataset
into two parts. The training part contains 0.34M check-ins, 346k users, and 0.52M friendships.
The evaluation part contains 0.46M check-ins, 328k users and 0.81M friendships. Note that the
two parts of users have a small overlapping because some users have check-ins in both sets, and
a small portion of friendships were lost during the partitioning (dividing into two parts) if two
friends of a friendships fell into dierent parts - training and evaluation parts.
First, we removed Location Entropy from EBM model, and instead, used the location popularity
derived based on the semantics of the location, according to our discussion in Section 4.2.8.2. In
the training part of the data, we put locations of the same type into each group, and for each group
we computed the average number of unique users per location (NU), the average number of visits
per location (NV), and the Average Location Entropy (H). We then used each of these as location
popularity to compute weighted frequency, according to Equations (4.16), (4.17) and (4.18). We
used only weighted frequency as social strength to evaluate Precision/Recall, and compared the
results with the case when Location Entropy is used as popularity in weighted frequency, according
to Equation (4.13). Figure 4.9(a) shows the results, where WF NV , WF NU, WF ALE and WF LE
86
Figure 4.9: The precision/recall after utilizing location semantics. (a) Weighted frequency with
Location Entropy WF LE, with average Location Entropy WF ALE, with average number of
unique users/visitors WF NU, and with average number of visits WF NV . (b) Weighted frequency
WF, weighted frequency with average duration of stay WF ADS, EBM, EBM with average
duration of stay EBM ADS
correspond to the cases, when (NV), NU, H and Location Entropy are used as location popularity,
respectively.
From Figure 4.9(a), we observe that using the popularity derived from location semantics did
not improve the results, and using Location Entropy still proves to be the best option. This is
because of an obvious reason: all the options for location popularity, including NV, NU, H and
Location Entropy, were derived from the same historical dataset, and Location Entropy captures
the finest granularity of location popularity because it measures popularity per location, while the
other three were derived per group (per location type) from the training data. Recall that in Section
4.2.5, we argued that Location Entropy is a better metric for popularity compared to the number of
unique visitors (NU) or the number of visits to the location (NV). The result from Figure 4.9(a)
also confirmed our argument, particularly where average location entropy WF ALE outperforms
WF NU and WF NV .
Despite the lower performance compared to Location Entropy, the role of the semantics-based
location popularity becomes significant when users’ location privacy matters. That is, when
we may not have access to users’ locations due to user privacy issues, and may only know the
87
semantics of their locations. In such cases, however, WF ALE (which can be derived from any
historical check-in dataset) can be a reasonable replacement for Location Entropy without much
sacrifice in performance as we see in Figure 4.9(a). Please refer to Section 4.5 for more discussion
about this matter.
In the next experiment, we utilize the average duration of stay, which is derived from location
semantics, to improve EBM, according to our discussion in Section 4.2.8.2. We use the given
semantics of each location in the dataset to obtain the average duration of stay (ADS) by doing the
following. First, we crawled data from EventBrite [Eventbrite, 2015], which is an online service
that publishes events of dierent categories (Education, Sports, Foods and Drink, Business, etc.).
Each category contains dierent types of events, such as class, conference, attraction, performance,
gala, etc. Each of this event types corresponds to a location type in our dataset. The description of
each published event contains the start and end time, which we used to compute the duration of
stay. Note that the outline of each event on EventBrite only contains the start time, but the detailed
description of each event normally contains both start time and end time. For each event type of
each category, we extracted all the events, their durations and then computed the average duration.
With EventBright, we were able to assign the average duration of stay (ADS) to 74% of location
types in our dataset (our dataset contains 394 dierent types of locations). For the remaining types,
we asked 8 individuals independently, who are currently in or have finished a graduate program, to
estimate the duration of stay for each type of location, and we computed the average duration of
stay based on their answers. The average standard deviation among their answers is 6:298 minutes.
Next, we incorporated the estimated average duration of stay into weighted frequency, ac-
cording to Equation (4.23), where parameter was estimated to be 89 hours with the training
dataset. Using the average duration of stay (ADS), we computed the precision/recall when only
88
the weighted frequency is used as social strength , and when EBM (incorporation of diversity and
weighted frequency) is used as social strength. Figure 4.9(b) shows the results. In Figure 4.9(b),
WF and EBM correspond to the case when the location semantics is not used either in weighted
frequency (denoted as WF in the figure) (according to Equation 4.13) or in EBM model (EBM),
while WF ADS and EBM ADS correspond to the case when the location semantics (estimated
average duration of stay ADS) is used in both weighted frequency (WF ADS) (according to
Equation 4.23) and in EBM model (EBM ADS).
From Figure 4.9(b), we observe that using location semantics does, indeed, improve the
performance. For example, at the precision of 70%, the recall of both WF ADS and EBM ADS
improves by approximately 6%, which is a considerable improvement. This improvement mainly
comes from the capture of social connections for the user pairs, who have few co-occurrences
and/or who co-occurred at only one location. Without considering their length of co-occurrence
based on location semantics, their social connections would not be captured despite the help
Location Entropy. Location Entropy generally helps in identifying friendships in the case where
the co-occurrences took place in uncrowded locations. However, in the case where the places are
not uncrowded, and friends had only few, but long-lasting co-occurrences (e.g., students in an
art class), Location Entropy alone cannot capture such friendships due to the high value of LE
(see Equation 4.11), and neither can diversity due to the low number of unique locations in the
co-occurrences. Utilizing the location semantics proves to be a valuable solution for such corner
cases due to the consideration of the co-occurrence duration between friends.
89
4.5 Discussion
It is important to highlight several limitations of our work based on the experimental observations.
First and foremost, even though EBM can achieve higher performance compared to the state-of-
the-art methods, inevitably there are still friendships that were not identified or whose strengths
were misinterpreted by our model due to several reasons. For example, people tend to perform
check-ins at more popular and crowded places than at less well-known or less crowded, which
will reduce the eectiveness of weighted frequency. If the number of co-occurrences is low in
such cases, our model will underestimate the corresponding social strength. Many users also
hesitate to share their locations due to low activeness or privacy concerns, which results in extreme
data sparseness and subsequently in underestimation of social strength. On the other hand, the
uncertainly in location due to the inaccuracy of GPS or other location positioning devices may
cause two people, who are at two dierent but nearby places, to appear to check-in at the same
location, or vice versa, two people at the same place to check it at dierent locations. The former
causes false co-occurrences while the latter causes the missing of real co-occurrences, which result
in overestimating and underestimating social strength, respectively. The uncertainly of temporal
co-occurrences can also cause misinterpretation of social strength. People leaving a place may be
followed immediately by other people coming to that place, which causes false co-occurrences
from the temporal point of view. In another scenario, two friends may check in at the same place
at quite dierent moments (the interval exceeds the threshold) during their lengthy stay, which
results in missing their co-occurrences. These two scenarios also result in overestimation and
underestimation of social strength, respectively.
90
For the data considerations, each dataset can have its own biases due to the method of data
collection. The time period of collecting data may be too short for some user pairs to obtain all
of their co-occurrences, but at the same time may be too long for others if their friendships have
already become stale. The data from Location-Based Social Networks have its own pros and cons.
The pros include the large numbers of users, check-ins and locations, while the cons include the
dierence of activeness in sharing locations among users due to dierence in ages, skills of using
mobile applications and biases towards dierent locations. On the other hand, a dataset collected
by a wireless network at a university campus Bilogrevic et al. [2013] can be small, but provides
a more homogeneous body of users (students), who have more similar levels of activeness in
location sharing and less bias towards dierent locations. This smaller dataset can possibly also
have other useful information for inferring social strength, such as the overlapped length of stay of
two students at the same place or their continuous trajectories, which have a potential of improving
the prediction of social strength.
4.6 Summary of Chapter
In this chapter, we discussed the EBM model for inferring social connections from spatiotemporal
data. Towards this end, we presented the EBM model to address some of the subtle questions
about social connections, including how to infer the social strength of two people and how to avoid
coincidences, which is a challenging problem due to its frequent nature. EBM also alleviated
the problem of data sparseness by incorporating the location characteristics into the model when
estimating the strength of social connections. Finally, the proposed algorithm is ecient and
91
parallelizable with Map-Reduce framework. The experiments confirmed the high accuracy and
eciency of the EBM model and its superiority over competitors.
92
Chapter 5: Spatial Influence - Measuring Followship in the Real
World
In this chapter, we discuss how to quantify spatial influence by measuring followship (see Section
2.1.2). In particular, we propose the Temporal and Locational Followship Model (TLFM) to
estimate spatial influence by taking into account the contributions of all the impacting factors.
Specifically, the temporal followship estimates influence as the urge or how soon a person wants
to visit the location following the initial visit of her influencer (aka the time delay), while the
locational followship discounts the popularity of the location from that measurement. In addition,
we utilize Shannon entropy to eliminate the contribution of coincidences. We report extensive
experiments on real-world datasets collected from dierent Location-Based Social Networks,
which proves the eectiveness of the model in quantifying spatial influence. We used the influence
computed from the corresponding social networks as ground-truth to evaluate the spatial influence
computed by TLFM and we observed about 70% of our inferred influence closely matches the
ground truth. We also quantified the impact of each factor (temporal, spatial and coincidences)
in computing spatial influence, concluding that the combination of all the three factors results
in the best performance. Finally, we developed several baseline approaches by adapting various
93
techniques proposed to compute influence in social media and our experimental comparisons
confirm that none can eectively capture spatial influence.
Note that even though this work is motivated by users’ real-world location data, the proposed
solution is not limited to spatial influence; in fact, it can also be applied to social media to improve
the inference of online influence due to the additional consideration of the popularity of online
actions, which has not been considered in the previous work.
5.1 The TLFM Model
In this section, we discuss the derivation of the Temporal & Locational Followship Model (TLFM).
5.1.1 Data Representation
Definition 5. Doublet: A followship-doublet (or doublet for short) is the finest element of follow-
ship, which consists of a pair of visits or check-ins, denoted as c
1
an c
2
, at the same location l by
two dierent users.
Clarifications: First, a doublet is the finest element of followship, meaning it does not contain
other doublets (or check-ins) by these two users. For example, in Fig. 2.2, at Location 1, two
check-ins at t
1
and t
2
form a doublet, but two check-ins at t
1
and t
4
do not form a doublet because
it is not the finest element - there are other check-ins by the users between t
1
and t
4
in that same
location. Second, a doublet is defined with regard to an influential relationship - who follows
whom. Therefore the order of the check-ins in a doublet depends on the direction of followship:
the check-in by the influencer precedes the check-in by the influencee. For example, from location
94
1 in Fig. 2.2, if v follows u, then the doublets are (t
1
, t
2
), (t
3
, t
4
) and (t
5
, t
6
). However, if u follows
v, then the doublets are (t
2
, t
3
), (t
4
, t
5
) and (t
6
, t
5
).
Definition 6. Doublet’s span: The span of a doublet is the time interval between the two
check-ins in the doublet.
Definition 7. Co-occurrence: A co-occurrence between two users is a doublet whose span is 0.
The span of a co-occurrence is 0, theoretically, but we can accept a small value [Pham et al.,
2013], such as less than 15 minutes due to the fact that users may perform check-ins separately in
time even though they are together at a place.
Definition 8. Multiplet: A followship-multiplet (or a multiplet for short) at a location is the set of
doublets by the two users at that location.
Similar to a doublet, a multiplet is defined with regard to an influential relationship between
two users at a specific location. For example, in Figure 2.2, a multiplet at Location 1 for the
v-follows-u relationship consists of doublets that correspond to the pairs of check-ins at (t
1
, t
2
), (t
3
,
t
4
) and (t
5
, t
6
).
We organize our discussion as follows: first we discuss the contribution of a doublet’s span in
Section 5.1.2, which we call temporal dependency of followship. We then discuss the contribution
of a doublet’s location in Section 5.1.3, which we call locational dependency of followship. Finally,
in Section 5.1.4 we discuss the issue related to influence causality and coincidences.
5.1.2 Temporal Dependency of Followship
In this section, we study the relation of the doublet’s span with influence, or how the temporal
factor impacts influence (and thus the name temporal dependency). We assume that the causality
95
of influence has already been established; in other words, we assume that the doublet happened
due to influence, but not due to coincidences, and therefore in this section and Section 5.1.3 we
only focus on measuring followship. We address the issue of causality and coincidences separately
in Section 5.1.4.
Consider a doublet when two users u and v visited location l at time t
1
and t
2
, respectively.
Without loss of generality assume that u influences v in this doublet and thus t
2
t
1
. The span of
the doublet = t
2
t
1
0.
When u influences v, and u visits location l, it is natural to expect that v would like to visit
location l under the influence of u [Goyal et al., 2010][Chen et al., 2013]. The desire of v to visit l
is nothing but the influence that u exerts on v. The stronger the desire, the sooner v will visit l, and
thus the more influence that u exerts on v. At best, v will try to visit l together with v, which creates
a co-occurrence in both space and time. On the other hand, the later v visits l, the less influence
u has on v. Consequently, the influence that u exerts on v weakens over time starting from the
moment t
1
, or in other words, the span of the doublet inversely indicates the influence of u on v.
This observation is in accordance with early studies [Goyal et al., 2010][Gomez Rodriguez et al.,
2010][Gomez Rodriguez et al., 2013], which empirically proposed that influence decays over time
by a law; subsequently, our goal is to derive this law theoretically and study the parameters that
govern it. Consider the geo-spatial channel that u influences v shown in Fig. 5.1, which illustrates
the influence over a time period after u visited location l, meaning after t
1
. Let t t
1
be any
moment after u visited location l, and p(t) be the influence of u on v at time t as if v visits location
l at t; note that t should not be confused with the span = t t
1
. Furthermore, let dt > 0 be a
temporal dierential that represents an infinitely small change of variable t. At time t + dt, the
influence of u on v is p(t + dt). Correspondingly, the change of the influence over the interval
96
t
t + dt
u v p(t) p(t + dt)
Figure 5.1: Influence that u exerts on v at a location over time.
dt is dp = p(t + dt) p(t). As mentioned earlier, the influence decreases over time, therefore
dp = p(t + dt) p(t)< 0 shows the loss of influence over the infinitely small time interval dt.
The amount of influence expected to be lost during the infinitely short time interval dt (between
t to t + dt) depends on the current value of influence p(t) at t and is proportional to p(t). We
assume this behavior because it is also observed in other phenomena, such as in radioactive decay
[Krane, 1987] or in the Maxwell-Boltzmann distribution statistics [Sivukhin, 1990] (note that we
will verify this experimentally). Formally:
p(t + dt) p(t)
dt
/ p(t) (5.1)
or we can write this in the form of an equation as follows:
dp
dt
=
p(t + dt) p(t)
dt
=
p(t)
97
where the minus sign indicates that p(t) decreases over time, and has time unit and indicates the
proportional constant, whose meaning we will clarify shortly. Rewriting the above equation in a
more compact form, we have:
dp
dt
=
p
(5.2)
Rewrite the equation and integrate both sides, we have:
dp
p
=
dt
()
Z
dp
p
=
Z
dt
() ln(p) =
t
+ C()
e
ln(p)
= e
t
+C
= e
C
e
t
= p
0
e
t
where C is an integral constant. After simplifying the above equation and de-compacting it (writing
p(t) instead of just p), we have:
p(t) = p
0
e
t
(5.3)
Equation (5.3) shows that the influence of u on v with respect to location l decays exponentially over
time (p(t) = 0 when t< 0). Thus, our theoretical derivation agrees with the empirical proposals for
the exponential decay of influence over time in early studies [Goyal et al., 2010][Gomez Rodriguez
et al., 2010][Gomez Rodriguez et al., 2013]. p
0
comes out as an integral constant. Recall that the
doublet span is = t t
1
; if we set t
1
= 0 ( i.e., the initial visit of u to location l is at the origin of
the time axis), then = t, thus t becomes the span of the doublet if v visits the location at time t.
98
Therefore, we omit symbol and simply use t to imply the span because we can always assume
t
1
= 0 without loss of generality.
Furthermore, by setting t = 0, we have p(0) = p
0
. Therefore, p
0
has the meaning of the initial
influence that u would exert on v before it starts to decay over time. In other words, p
0
is the
influence of u on v at location l if they co-occur at l. To clarify the meaning of the constant, we set
p(t) = p
0
=2 - the time that the influence has decreased by half. Solving equation p
0
=2 = p
0
e
t=
,
we get t = ln 2. Consequently, ln 2 has the meaning of half-life - the time interval, through
which the influence decreases by half. We will denote this quantity as h = ln 2.
It is important to elaborate on the constant p
0
- the initial influence in Equation (5.3). Specifi-
cally, what does p
0
depend on? Attributes related to a doublet are the span t and the location l.
As we see, the temporal factor t has already been considered and included in the exponential part
e
t=
of the Equation (5.3). Consequently, location l is the only remaining factor that aects p
0
.
It may be tempting to think that everyone has dierent influential capability and thus p
0
should
also depend on the user. However, recall that we do not assume that each user has her own initial
level of influence at the beginning, but rather we infer this entire information from followship.
Therefore, p
0
should only depend on the location l. Consequently, we separate Equation (5.3) into
two parts: the temporal dependency e
t=
and the locational dependency p
0
. To better structure
the locational dependency, we rewrite p
0
as p(l) - a function that depends only on the location l.
Equation (5.3) obtains a new and more informative form:
p(t; l) = p(l) e
t
(5.4)
99
Function p(l) only depends on location l and it is considered a constant during the course of the
derivation of Equation (5.3). Therefore the derivation remains valid.
Note: The goal of the derivation above is two-fold. First, it is to derive the impact of the time
delay on influence and identify the governing parameters and their meanings ( and h). Second,
the derivation also shows how to combine the impacts of the time delay and the location - Equation
(5.4). We further elaborate p(l) and the meaning of the combination in Section 5.1.3.
To verify the validity of the exponential temporal decay of spatial influence, we conducted
an experiment using a real dataset from Foursquare. This dataset has two parts: one contains
the social information between users - who is friend of whom; and another part contains the
spatiotemporal data. The spatiotemporal data is a set of check-ins, where each check-in is a tuple:h
user-id, location-id, latitude, longitude, timei which indicates, from left to right, the ID of the user,
the ID of the location (auto-generated by Foursquare), the latitude and longitude of the location,
and the time of the check-in, respectively. For the purpose of this experiment, we randomly chose
200,000 pairs of friendships out of roughly 1:4M friendships, for whom we computed the spans of
all their doublets. In Fig. 5.2, we show the behavior of temporal dependency of this real dataset
as how the number of doublets of friends decreases as the span grows. The xaxis shows the
doublet’s span in hours and the yaxis shows the number of doublets that corresponds to the span
in the xaxis. The left-most cross, for example, indicates that there are about 35,000 doublets
with spans between 0 and 10 hours, and the second left-most cross indicates there are about 32,000
doublets with spans between 10 (exclusive) and 20 (inclusive) hours.
From Fig. 5.2 we observe that the number of doublets between friends drops exponentially as
their span grows. This in turn tells us that the tendency of people repeating their friends’ actions,
or their followship, decreases exponentially starting from the moment when their friends first
100
Figure 5.2: Exponential Decay - Foursquare Data
initiated the actions. This implies that spatial influence decays exponentially over time, and hence,
verifies the validity of our theoretical derivation of temporal dependency in Equations 5.3 and
5.4. From Fig. 5.2 we also estimated the half-life h = ln 2 = 130 hours or roughly five and
a half days, which is the time when the influence decreases by half, and therefore parameter
= 130= ln 2 = 188 hours. From this experiment, we also observed that the influence drops to
negligible level when t> 900 hours or roughly 38 days (not shown in Fig. 5.2 as it is out of range
of the x axis). We call this value T = 900 hours the span threshold, which therefore can be
used as the threshold or limit for searching for doublets in spatiotemporal data; check-ins that are
beyond T from each other have a negligible relation in terms of influence. We evaluate h and T
with more extensive experiments across dierent datasets in Section 5.3.2.1.
Note that other possible considerations for the decay of p(t) include linear function, power
law or Rayleigh distribution. However, we will show later in Section 5.3.2.1 that experiments
for spatial influence across dierent datasets of dierent LBSNs show a consistent exponential
behavior, thus, we eliminate these options from further consideration particularly in our study
101
of spatial influence. In addition, as we discussed in Section 2.3.2, several previous studies in
influence for social media provided extensive analysis on discovering the network of information
diusion based on the time delay. Essentially, the solutions of these studies can be used as the
metric for the temporal dependency in spatial influence, some of which share the same formula
with our derivation in Equation (5.3). Therefore, the question of which solutions to use comes
down to, and depends on specific applications, input data and eciency requirements.
Next, we consider the temporal dependency in a multiplet. Without loss of generality, assume
the influential direction is also from u to v, meaning u influences v. Let the multiplet m consist of
multiple doubletsfd
1
; d
2
;:::; d
n
g of two users u and v at the same location l and since we assume
that u influences v, in each of these doublets the check-in by u preceded the check-in by v.
The overall influence of u on v at l consists of the influence scored from the component
doublets d
i
of the multiplet m.
p(m) = p(l)
X
d
i
e
t
d
i
=
(5.5)
where t
d
i
is the span of doublet d
i
, and we formally added m as an argument because d
i
and l
are defined in m. Note that because all the doublets are related to the same location l, they share
the same factor p(l). Equation (5.5) shows the depth of the influence of u on v at location l.
Note that we follow the implicit assumption in the previous studies [Goyal et al.,
2010][Gomez Rodriguez et al., 2010][Gomez Rodriguez et al., 2013] by assuming that the latest
check-in/action by u, for example at t
3
in Fig. 2.2, immediately preceding the check-in by v at t
4
,
is the most and the only significant action by u in influencing v to visit the location at t
4
. Thus, we
do not consider the impact of the earlier check-ins by u prior to t
1
in causing v to visit the location
102
at t
4
. This assumption can, in fact, be relaxed easily by including the impact of the check-ins by u
prior to t
1
in Eq. (5.5).
Lastly, to determine the direction of influence in a multiplet (who influences whom), we
propose a basic rule as follows. We identify who is the first to initiate the action at location l.
By looking at their check-in at the location of the multiplet, the person who performed the first
check-in is the influencer and the other is the influencee. If both of them initiated this first action
together (a co-occurrence), then they are considered friends [Pham et al., 2013], and thus this is a
bi-directional influential relationship where friends influence each other [Pham et al., 2013][Kempe
et al., 2003][Domingos and Richardson, 2001]; consequently, we have two multiplets, one for
each influential direction, from u to v and from v to u. Note that we only mention this most basic
rule. A more thorough consideration is also possible, for example, by considering who is the first
to visit the location after neither of the two users has visited the location for a period greater than
the span threshold T.
5.1.3 Locational Dependency of Followship
As we mentioned earlier, in Equation (5.4) p(l) is the initial influence at location l that u would
exert on v if they co-occur or before the influence starts to decay over time exponentially. We
also argued that p(l) only depends on location l and thus, in general, it shows the eect of each
individual location in a doublet on social influence in the real world. In this section we elaborate
on the function p(l) - the impact of a location.
It is natural to expect people to visit more famous and popular places because those are
frequently mentioned on online media, advertisements or word of mouth. Therefore, it is easier
to convince or to influence a person to visit a popular place, such as Times Square, because the
103
place is already well-known and thus her motivation may have come from many dierent sources
of information. On the other hand, it is harder to convince someone to visit a less popular place,
such as a mediocre restaurant which has low ranking and is little or poorly known among people.
Consequently, it requires a higher level of trust or influence from a person in order to convince
other people to visit unpopular places, and if the person succeeds in doing so, then she should
deserve a higher score in the quantification of her influence. This score is the function p(l). The
less popular the location l, the higher the rewarding score. Subsequently, we will build function
p(l) based on the popularity of a location in order to capture this intuition.
To study the popularity of locations, we use Location Entropy, which we discussed earlier
in Section 4.2.5. At location l, let V
l;u
= f< u; l; t >: 8tg be the set of check-ins by u and
V
l
=f< u; l; t>:8t;8ug be the set of check-ins by all users. The probability that a check-in, which
is selected randomly from V
l
, was performed by user u is P
u;l
=jV
l;u
j=jV
l
j. The Shannon entropy
of location l based on this probability is given as follows:
H
l
=
X
u;P
u;l
,0
P
u;l
ln P
u;l
(5.6)
This entropy measures how popular a location is in terms of the people who visited it. The
advantage of using entropy is that it measures the location popularity based on the distribution of
check-ins over the users who performed them. To understand, consider a simplified example with
two locations l
1
and l
2
and two users u and v shown in Table 5.1.
Table 5.1: Example of Location Entropy
#u’s visits #v’s visits H
l
Locationl
1
1000 1 0:35
Locationl
2
500 500 0:69
104
As shown in Table 5.1, location l
1
was visited 1000 times by u, but only once by v; while
location l
2
was visited equally by u and v, 500 each. We observe that both locations have the same
number of visitors (which is two), and as well as roughly the same number of visits (1000 each).
However, l
1
was mostly visited by only one person (u), thus, it is less popular compared to l
2
,
which was visited frequently by two people. Consequently, neither the number of visitors nor the
number of visits can show the popularity of a location. However, Location Entropy H
l
correctly
describes this situation by considering the distribution of check-ins at a location across its visitors;
the more uniform the distribution, the higher the entropy, and the more popular a place. Location
l
2
has higher entropy 0:69 as compared to l
1
, thus l
2
is more popular. Note that in Table 5.1, we
assume there are only two users for simplicity, and thus, it is easy to judge the popularity of each
location by simply looking at the table. However, in reality, locations often have many visitors,
and therefore, the use of Location Entropy is significant. It is also worth noting that Location
Entropy is not the same as tf-idf [Blei et al., 2003a], which would indicate the significance of a
visitor to a location, but would not show the popularity of a location.
Note that in social media, measuring the popularity of an online event is generally not
challenging because an online user normally performs each online action only once, such as to
join an online group, to start following a person on Twitter, to like a Facebook fan page, etc.
Thus, popularity can be simply measured by the number of members of a group or the number of
followers of a person.
As mentioned earlier in Section 4.2.5, an important property of Location Entropy is exp(H
l
),
which is called the eective number of visitors at location l. This is not the actual number of users
who visited l, but it is rather an equivalent number of users who visited l as if they all visited l
with an equal number of times. For example, the eective number of visitors is exp(0:35) = 1:4
105
for location l
1
, and exp(0:69) = 2 for location l
2
. Note that the eective number does not need to
be integer; it is an index to show how popular a location is.
With the support of Location Entropy, let us go back and discuss function p(l) - the locational
dependency part of Equation (5.4). As we mentioned earlier in this section, p(l) is to capture the
initial influence that u would exerts on v before this value starts to decay over time. This initial
influence should be inversely proportional to the popularity of location l, for which we accept
the inverse quantity of the eective number of visitors at l, that is p(l) = exp(H
l
). We rewrite
Equation (5.4) as follows:
p(l; t) = exp(H
l
) exp(t=)
= exp
0
B
B
B
B
B
B
B
@
X
u;P
u;l
,0
P
u;l
ln P
u;l
1
C
C
C
C
C
C
C
A
exp
t
(5.7)
The first exponential expression shows the locational dependency and the second shows the
temporal dependency of followship. Equation (5.5) also obtains a similar form due to the same
factor p(l) that is common for all doublets at location l.
p(m) = exp
0
B
B
B
B
B
B
B
@
X
u;P
u;l
,0
P
u;l
ln P
u;l
1
C
C
C
C
C
C
C
A
X
d
i
exp
t
d
i
(5.8)
When u influences v in multiple locationsfl
1
; l
2
;:::; l
s
g, and produces in each location l
k
a
corresponding multiplet m
l
k
, the overall influence value is:
p
u!v
=
X
l
k
p(m
l
k
) (5.9)
106
5.1.3.1 Credit distribution via locational dependency
An important issue concerning the influencer-influencee relationship is the distribution of credit
among multiple influencers when they all share the credit of influencing the same person to visit
a location? Particularly, when user v visited a location l as a result of being influenced, we may
expect that v was influenced by many people and thus the credit for convincing v to visit l must be
distributed appropriately among the influencers. Hence, the question: how do we distribute the
influence credit among the influencers?
One way to do this is to search for all people (as influencers) who had visited location l prior
to v’s visit or check-in (denoted as c
v
), and thus obtain a set of earlier check-ins that can form
doublets with c
v
. Subsequently, we can distribute the total credit for influencing v to perform
the particular check-in c
v
to each of the influencers proportionally to the temporal dependency
of their doublet according to Equation 5.3. There are two problems with this approach. First, a
single visit by one user much earlier before v’s visit (e.g., several years earlier) may have very
insignificant or zero influence on v due to the long doublet’s span according to Equation 5.3, but
thousands or millions of such earlier visits by many people combined may still create a huge
motivation and impact on v. This is understandable since it is natural for us to seek to visit places
(e.g., the Liberty Statue) that other people have visited and talked about for decades. Therefore,
distributing negligibly small credit to each of such much earlier visitors does not make sense,
obviously. However, eliminating a large number of those earlier visitors would cause the more
recent visitors (still prior to v’s visit) to get too much credit; “too much” because those recent
visitors are only a tiny portion of a possibly large number of visitors who all influenced v to visit l.
The second problem concerns the computation resource required to perform the credit distribution
107
based on this approach. In order to know what portion of influence with respect to v belongs to the
more recent visitors, we would have to compute the temporal dependency for all visitors prior to
v’s visit. The number of doublets that stretch along the time axis can be extremely large and one
would have to compute all in order to perform the proportionate distribution, which would render
the approach prohibitively expensive when dealing with large spatiotemporal datasets.
However, we have already accomplished the credit distribution implicitly via the locational
dependency p(l). p(l) means the initial credit for an influencer for convincing someone to visit
location l. This function is built based on all the visits to location l in order to take into account the
popularity of the location; the more popular, the less credit. Thus it implies that we are implicitly
distributing the influence credit via p(l); the more visitors to l means there are more influencers
to v, but at the same time, also means that the initial credit p(l) becomes smaller (according to
Equations 4.11 and 5.8) , and consequently each influencer will get less credit p(l) as a result of
sharing the total credit with the others.
5.1.4 Influence Causality and Coincidences
So far, we have elaborated on estimating spatial influence by measuring followship between two
people via Equations (5.7) (5.8) and (5.9). However, we left out one important question: how to
identify followship in spatiotemporal data? In other words, how to distinguish followship from
coincidences? We refer to this issue as the causality, meaning the relationship between cause and
eect. Specifically, per our discussion in Section 2.1.2, influence between two individuals (the
cause) results in their followship (the eect), while non-influence reasons result in coincidences.
The reason we set aside this question temporarily (by assuming that the cause is influence at the
108
beginning of Section 5.1.2) is because we wanted to focus purely on measuring followship without
worrying about their causes. However, we now discuss this important matter.
Our strategy is to introduce the causality filter f (u; v), whose purpose is to filter out coinci-
dences from being counted toward influence. For example, assume that the successive visits by
two users u and v is measured to be 0:8 according to Equation (5.9). Also assume that we have
determined that u and v are not socially-related (neither friends nor influence) and their successive
visits were coincidences, which are somehow captured as a small value by the filter f (u; v) = 0:01,
thus, the influence of u on v is 0:01 0:8 = 0:008, a negligibly small value that correctly indicates
the insignificant influence of u on v despite their successive visits.
The filtering function f (u; v) can take into account multiple factors that can help eliminate
coincidences, such as the similarity between users in socio-demographics (age, ethnicity, religion,
education level, etc.,), or the similarity in users’ location behaviors (such as cosine similarity in
[Zhang and Pelechrinis, 2014]), or even strength of friendships, etc. f (u; v) can be expressed in the
form of f (u; v) = f
1
(u; v): f
2
(u; v)::: f
r
(u; v), where each component function f
i
(u; v), i from 1 to r,
considers each factor separately. Alternatively, f (u; v) can also be in the form of a linear regression
over component functions, each is weighted dierently according to its relative importance. In
this work, since we assumed that the only input is spatiotemporal data, we leave other factors as
possibilities, and propose a Shannon-entropy-based filter that utilizes only spatiotemporal data.
5.1.4.1 Causality Filter via Shannon Entropy
We aim to utilize Shannon entropy to capture the breadth of spatial influence. Specifically, the
breadth of spatial influence indicates the ability of an individual to influence another across many
locations. The breadth captures another aspect of spatial influence, in addition to the depth’s
109
aspect in each location described by Equation (5.5). Our motivation of using the breadth of spatial
influence comes from early social studies [Pham et al., 2013][Crandall et al., 2010], which showed
in both theory and experiments that two people, who co-occurred in many locations, are closer
friends compared to those who co-occurred in only one or very few locations. This captures the
intuition that close friends tend to hang together in many places and therefore their co-occurrences
should occur in various locations. On the other hand, a large number of co-occurrences but all
took place in one location, may indicate a weak or even non-existent friendship due to possible
coincidences, e.g., students study at the same library frequently but they are not friends.
Subsequently, we expect a similar behavior in followship. Specifically, strong influence is
expected to cause one person to follow another in many locations, or their followship should be
diverse in locations. On the other hand, the lack of variety of locations in successive visits may just
imply coincidences. Indeed, even though coincidences can result in a large number of successive
visits (e.g., in shopping malls), they typically only happen in one or few locations for non-related
user pairs, as similarly shown for co-occurrences in [Pham et al., 2013]. Consequently, we use the
term breadth to describe the diversity of locations in successive visits for each pair of people, and
use the breadth of successive visits to capture the causality of influence. The higher the value of
the breadth, the more probable that the successive visits by two individuals are caused by influence.
Next, we show how to measure the breadth of successive visits.
Our strategy is to measure the breadth of the successive visits for every pair of users in the
spatiotemporal data. Then, based on the breadth, we can determine which pairs are influential
relationships and which pairs are not. To ease the presentation, we still keep using the terms
doublets and multiplets, even though they may refer to followship or coincidences. However, we
will use the term successive visits instead of followship.
110
Let M
u;v
=fm
l
1
; m
l
2
;:::; m
l
s
g be the set of multiplets between users u and v where each multiplet
m
l
k
corresponds to location l
k
. We also assume that we are considering the (possible) direction of
influence from u to v. We aim to find the amount of location information contained in this set of
multiplets, which describes the breadth of their successive visits in terms of locations, as opposed
to the depth in each location expressed in Equation (5.5).
Recall from Section 5.1.3 where we argued that using the number of unique visitors to a
location may not correctly describe how popular a location is. For the same reason, using the
number of unique locations in M
u;v
also may not correctly characterize its diversity of locations.
Therefore we use Shannon entropy to measure the amount of location information in M
u;v
. Let
p
t;k
=
P
d
i
exp(t
d
i
=) be the depth of M
u;v
at location l
k
(the temporal dependency of m
l
k
), and let
p
t
=
P
k
p
t;k
. The Shannon entropy which describes the amount of location information in M
u;v
is
defined as follows:
H(M
u;v
) =
X
k
p
t;k
p
t
!
ln
p
t;k
p
t
!
(5.10)
Note that because we are interested in the breadth of M
u;v
in terms of locations, the inherent
property of each location should not matter. Thus, in Equation (5.10) we do not include p(l)
(locational dependency) which only characterizes the inherent property (popularity) of each
location.
The Shannon entropy in Equation (5.10) shows the diversity of successive visits in terms of
locations, or the breadth, which can be used to indicate the causality of influence as we discussed
earlier. A high value of H(M
u;v
) indicates that the successive visits by u and v are followship
(caused the influence of u on v), while a low value indicates that the successive visits are prone to
111
coincidences. Subsequently, by applying H(M
u;v
) as the causality filter, we can discount the eect
of coincidences. Let
P
l
k
p(m
l
k
) (Equation 5.9) be the measure of the successive visits of u and v,
the real influence of u on v after applying the causality filter is defined as follows.
p
u!v
=
H(M
u;v
) +
X
l
k
p(m
l
k
) (5.11)
We include parameter in Equation (5.11) because H(M
u;v
) = 0 in case where the successive
visits occurred in only one location. This parameter is application-dependent and can be set to a
small value if we do not want to completely eliminate successive visits with one location from
consideration. Generally, f (u; v) = H(M
u;v
) + needs to be normalized to the range [0; 1].
In Equation (5.11), even though coincidences may create a large number of successive visits and
result in high values of the measure
P
l
k
p(m
l
k
), the filter f (u; v) can eectively reduce their eect
on p
u!v
because f (u; v) is negligibly small due to the lack of location diversity in coincidences.
5.2 Implementation
The computational complexity of TLFM comes from the search for doublets in spatiotemporal data.
Therefore, we provide the implementation of searching for doublets in Algorithm 1. Note that
the algorithm is straightforward; however, our goal of presenting it is to analyze the complexity.
Inputs include (i) a list c[i] of check-ins at a given location l sorted by increasing time; (ii) an
empty hashmap H where the key stores the user pair’s IDs (e.g., “12345:678” - influencer precedes
influecee), and the value stores a list of their doublets; (iii) the span threshold T. The algorithm
returns H. Primitive functions used are u(c
i
) and t(c
i
), which return the user and the time of a
check-in c
i
; clear(I) - erase the content of a set I.
112
Input: c[i] - a list of check-ins, H - empty hashmap; T - span threshold
Output: H
1: Create an empty set I for users’ IDs; n = size of c; i = 0;
2: while (i< n) do
3: j = i + 1;
4: while (true) do
5: if (( j == n) or (t(c[ j]) t(c[i])> T) or
(u(c[i]) == u(c[ j])) )
6: clear(I);
7: break; end if;
8: if (I does not contain u(c
j
))
9: Add (c[i],c[ j]) to hashmap H
under key “c[i]:c[ j]”;
10: Add u(c[ j]) to I;
11: end if;
12: j++;
13: end while;
14: i++;
15: end while;
16: return H;
Algorithm 1: Doublet Search
The algorithm is self-explanatory. It scans through the list of check-ins at each location only
once (index i in Algorithm 1). For each check-in by index i, it only scans for subsequent check-ins
within the time interval T (the span threshold) because check-ins by two users that spans beyond
this interval are considered to have negligible relations with each other in terms of influence. The
time complexity for each location is O(n) where is the maximum number of check-ins within a
time interval T, and n is the total number of check-ins at that location. The space complexity is
also O(n) - for each check-in indexed by i, there are at most doublets that can be constructed
with c[i] (c[i] is the preceding check-in in a doublet). The overall time and space complexities for
a set of spatiotemporal data with N check-ins are O(N) and O(N).
113
Since the search for doublets in each location can be performed separately and independently,
one can use Map-Reduce to parallelize the doublet search for a spatiotemporal dataset to further
improve the eciency. In a related study in social connections [Pham et al., 2013], the authors
provide a straightforward Map-Reduce implementation for the search of co-occurrences, which
can also be applied for searching doublets. We refer the readers to [Pham et al., 2013] for more
details.
5.3 Performance Evaluations
5.3.1 Data and Experiment Set-up
For experiments we used 3 dierent datasets from 3 LBSNs: Gowalla, BrightKite and Foursquare,
which were collected during the periods Feb 2009 - Oct 2010, Feb 2008 - Oct 2010 and Feb 2012 -
Jul 2012, respectively. Each of these datasets contains spatiotemporal data which we use to derive
spatial influence, and an explicit social graph which we use as ground truth. Table 5.2 summarizes
these datasets. M denotes million, k denotes thousand, and “Check-ins’ # ” means the number of
check-ins.
Table 5.2: Datasets
Check-ins’ # Users’ # Friendships’ #
Gowalla 6.4M 197k 951k
Brightkite 4.5M 58k 214k
Foursquare 1.3M 669k 1.4M
We divide each dataset into two part: training and evaluation. We randomly pick 3 out of
every 5 locations and choose all the check-ins in those locations to add to the training set, and
use the check-ins of the remaining locations for the evaluation set. In the end, the numbers of
114
check-ins for the training set and evaluation set for each of the datasets Gowalla, Brightkite and
Foursquare are (4:3M, 2:1M), (2:5M, 2:0M) and (0:8M, 0:5M), respectively. Our experiments are
written in Java and run in a 64-bit OS X with 16GB memory and 2.20 GHz CPU Quad-core.
5.3.2 Experiments
5.3.2.1 Temporal Dependency
In our first set of experiments with the training datasets, our goal is to verify the exponential
behavior in the temporal dependency of spacial influence (see Section 5.1.2), and learn the
parameter half-life h and span threshold T. Specifically, we examine how the number of doublets
of friendships depends on the doublet’s span. Recall that we already showed this experiment
with a smaller and dierent subset of data from Foursquare in Section 5.1.2 as an illustration of
our theoretical derivation. However, we repeated this experiment with larger data from dierent
LBSNs to show the consistency of the exponential decay for spatial influence across dierent
networks. Note that we only perform the experiments for user pairs who are explicit friends,
because for whom we can assume the evidence of existing influence [Rogers, 2010][Goyal et al.,
2010].
The results for data from three LBSNs are shown in Fig. 5.3. The description of the Fig. 5.3
is similar to that of Fig. 5.2. We observe that all three datasets exhibit consistent behavior: the
number of doublets of friendships drops exponentially as their span grows. This indicates that
the tendency of people repeating their friends’ actions (aka followship) drops exponentially over
time starting from the moment when their friends first initiated the actions. Consequently, spatial
influence that people exert on their friends decays exponentially over time, which confirms the
validity of our theory about temporal dependency of spatial influence in Section 5.1.2. In addition,
115
Figure 5.3: Temporal dependency is shown as how the number of doublets depends on their time
span in hours.
based on this experiment we estimated the half-life and the span threshold parameters (in hours)
for each dataset to be approximately (h,T) = (130,1000), (135,850) and (130,900) for Gowalla,
Brightkite and Foursquare, respectively; the span threshold is beyond the limit of the figures.
Overall, the fluctuations in these parameters across the datasets are not significant considering
experimental uncertainty and possibly dierent motivations for sharing locations in each LBSN.
Finally, note that the number of friendships’ doublets for Gowalla and Brightkite are low despite
their large numbers of check-ins because their friendship networks are much sparser than that of
Foursquare.
Parameter = h= ln 2 learned from this experiment with the training datasets, will be used in
the subsequent experiments with the evaluation datasets to compute temporal dependency.
In addition, note that our metric for temporal dependency is to show the process of derivation
of the impact of time delay on spatial influence so that we can gain a deep understanding about how
influence starts, propagates and dies out. We do not aim to replace the solutions that were developed
in previous studies for social media (see Section 2.3.2), which can also provide alternative solution
for measuring the impact of time delay. The main focus of this work is rather to investigate
dierent factors that impact spatial influence (time delay, locations and coincidence), their relative
116
significance and how to combine them into a unified measure, for which we show experiments in
the next sections.
5.3.2.2 Comparison of TLFM with Related Work
In this set of experiments, we use the explicit social graph of each network as ground truth to
evaluate our method and the methods proposed in the related studies discussed in Section 2.3.2.
We created the ground truth’s influence from an explicit social graph by using a standard
well-known technique [Kempe et al., 2003] as follows. First, we used Jaccard index (proposed
in [Liben-Nowell and Kleinberg, 2007b]) J(u; v) =jF(u)\ F(v)j=jF(u)[ F(v)j to compute social
strength for each friendship u and v (F(u) denotes the set of friends of u). Then we applied
a technique in [Kempe et al., 2003] - dividing this strength J(u; v) by the number of friends
of v to get the influence of u on v, which is denoted as s(u! v) = J(u; v)=jF(v)j. Similarly,
s(v! u) = J(u; v)=jF(u)j.
On the other hand, we computed predicted spatial influence from spatiotemporal data by using
TLFM model and the models proposed in the related studies discussed in Section 2.3.2, which
include the similarity metric by Zhang et al. [Zhang and Pelechrinis, 2014] (denoted as SIM), and
the model by Goyal et al. [Goyal et al., 2010] (denoted as GO). In addition, we also evaluate each
component of TLFM separately, which includes the temporal dependency alone (denoted as TD)
by setting p(l) = 1 for every location in Equation 5.8, the locational dependency alone (denoted as
LD) by using only the first exponential part of Equation 5.7. Note that in this set of experiments,
GO is the same as TD, and we do not need to use filter because we intend to compute influence
between friends only in order to be able to perform evaluations against the ground truth.
117
Figure 5.4: Comparison of TLFM with Related Work.
We normalized the inferred influence p and the ground truth influence s to [0; 1]. For each
influential relationship, say u! v, we computed a quantity =jp(u! v) s(u! v)j, which
shows how much the inferred influence diers from the ground truth. Finally, we computed how
many influential relationships, for whom our inferred influence diers from the ground truth by a
given. In other words, we want to answer the question: For how many relationships does the
result dier from the ground truth? and by how much? Note that we used percentage for “how
many”.
Figure 5.4 shows the results. The x-axis shows the dierence . The y-axis shows the
percentage of influential relationships. How to read the graph: the left most dot for the TLFM
curve in Fig. 5.4(a) states that for 70% of relationships, our inferred influence diers from the
ground truth by a value between 0 and 0:1. The next dot states that for 19% of relationships, our
result diers from the ground truth by a value between 0:1 and 0:2, etc. Note: A good method
should have high percentage for low (meaning a large portion of the inferred pair-wise influence
is close to the ground truth), and low percentage for high (meaning only a small portion of the
results diers widely from the ground truth).
Observations: We start the observations with the performance of the SIM and GO methods
from the related studies. Fig. 5.4 shows that only small portions of the results (25% for SIM
118
and 35% for GO) can be considered accurate (diering from the ground truth by less than 0:1),
while a large portion of the results diers significantly from the ground truth. SIM has the worst
performance due to the fact that two users visited the same locations for a similar number of times
generally does not indicate influence for two reasons. First, one user might visit a location a long
time (months or years) before the other did, thus the influence, even if present, may have already
decayed to null. Second, if their common locations are popular and famous, their visits might just
be random as argued by the study [Zhang and Pelechrinis, 2014]. Note that the main purpose of
this study is to account for the impact of randomness and locations in location behaviors of users.
The GO model (or TD) also has poor performance because a significant portion of followships is
not due to influence, but because users visited popular locations due to their own knowledge (e.g.,
from word-of-mouth, social media, etc.). Thus, a person may be over-credited for influencing her
friends to go to a well-known and popular place. Locational dependency alone (LD) also has low
performance because the decay of influence over time is completely ignored in LD.
However, when TD and LD are combined together into the TLFM model, we observe a
significant improvement in the performance for the obvious reason: both temporal decay and
impact of locations are captured by TLFM. Specifically, Figure 5.4(a) depicts that the high accuracy
of our inferred influence ( between 0 and 0:1) is observed in 70% of relationships (improved by
30% as compared to TD). Moreover, the large or poor accuracy is observed in significantly less
relationships: between 0:2 and 0:3 is observed in only 6% for TLFM, as compared to 11% for
TD and 16% for LD. This improvement is observed consistently for all three datasets Gowalla,
Brightkite and Foursquare in Figs. 5.4(a) (b) and (c).
119
Figure 5.5: The eect of the causality filter for friendships only (upper graphs), and for all
relationships (lower graphs). The white bars indicate the percentage of pairs that pass the filter;
black bars - fail the filter.
5.3.2.3 The Eect of Coincidences
We divide this section into two parts.
Part 1: In this part, we evaluate the impact of coincidences and the eectiveness of the causality
filter based on Shannon entropy in reducing the eect of coincidences (See Section 5.1.4.1). We do
this in two steps: first, we prove the correctness of the filter by verifying it against only friendships,
for whom we can assume the evidence of existing influence [Rogers, 2010][Goyal et al., 2010];
second, we apply the filter to all user pairs (including friendships and non-friendships) to show its
eectiveness in eliminating coincidences.
Our experimental methodology is as follows. (a) First, we make an incorrect assumption that
all the successive visits (doublets/multiplets) found in the evaluation dataset are due to influence,
aka followship. Subsequently, we measure and normalize the successive visits for all user pairs
by using only a combination of temporal dependency (TD) and locational dependency (LD),
denoted as TD+LD, according to Equation (5.9), without using any causality filter. (b) Second,
120
we compute and normalize the breadth of influence (Equation (5.10)) for those relationships in
step (a) whose followship’s measure is above 0:5. (c) Third, we divide the relationships obtained
from step (b) into five sub-groups, each with TD+TL measure falls on each of the sub-intervals
[0:5; 0:6], [0:6; 0:7], etc. For each sub-group, we compute (i) the percentages of friendships who
pass and fail the filter (breadth above and below the threshold 0:1, respectively), and similarly (ii)
the percentages of total pairs (including friendships and non-friendships) which pass and fail the
filter. For friendships in each sub-group, we also computed their average ground truth influence
using the method that we used earlier in Section 5.3.2.2. Note that only for the experiment purpose
we set a threshold for the causality filter; in general, however, one does not need to set a threshold
for the filter.
Figure 5.5 shows the results. The x-axis shows normalized followship’s measure computed
by TD+LD, the y-axis shows percentage. For each dataset (for example, Figure 5.5(a) shows
the results for Gowalla’s dataset), the upper figure shows the percentage of friendships in each
sub-group passed and failed the filter, and the lower figure shows the percentage of all relationships
in each sub-group that passed and failed the filter. How to read the graph: for the two left-most
graphs (for Gowalla’s dataset), the left-most column in the upper graph states that for the sub-group
with medium followship’s measure by TD+LD ([0:5; 0:6]), 74% of friendships passed the filter
while the remaining 26% failed; the left-most column in the lower graph states that 48% of total
relationships in that same sub-group passed the filter while 52% failed. The average ground truth
influence for friendships in the sub-group is shown inside each column of the upper graphs.
Observation 1: the upper graphs show that the majority of friendships passed the filter,
especially those friendships with higher ground truth influence (96% of friendships in Gowalla
with average ground truth influence 0:73 passed the filter). Since friendships result in social
121
influence [Goyal et al., 2010][Kempe et al., 2003][Chen et al., 2013], their successive visits are
generally not considered coincidences. Consequently, this observation proves that the causality
filter behaves correctly.
Observation 2: Relying on this filter, we then analyze the results shown in the lower graphs, which
show the pass/fail for all user pairs (friendships and non-friendships). We observe that more than
half of the total pairs with extensive successive visits (measured above 0:5) still failed the filter
(extensive coincidences), while they would have been considered influential if no filter had been
applied. By applying the filter, we therefore can reduce the impact of coincidences eectively.
Note that in Fig. 5.5, we only show the results for the user pairs with extensive successive visits
(measured above 0:5) to illustrate the eectiveness of the filter.
Part 2: In this part, we repeat the experiments in Section 5.3.2.2 with the following changes.
First, even though we still conduct the experiments for only pairs of friends (in order to be able
to perform evaluation against the ground truth), we do not assume this knowledge. Therefore,
we need to use Equation (5.11) to compute pair-wise influence (by including the filter in the
value) (denoted as TLFM w Filter). This is the main dierence as compared to the experiments
in Section 5.3.2.2, where we did not include the filter. As for comparison, we also compute
influence by using the solution in a related study [Goyal et al., 2010]; this solution needs to rely on
friendships, thus, we use the EBM model [Pham et al., 2013] to infer implicit friendships from
spatiotemporal data, following our discussion in the last paragraph of Section 2.3 (denoted as
GO+EBM). For comparisons, we also include the result of TLFM without using filter (denoted
as TLFM w/o Filter), and of the GO solution from Section 5.3.2.2 (denoted as GO). The results
are shown in Fig. 5.6.
122
Figure 5.6: The impact of filter on spatial influence.
Observations: We observe that the Shannon entropy-based filter improves the performance by
reducing the dierence between the inferred pair-wise influence and the ground truth, as compared
to when no filter was used. This confirms our argument in Section 5.1.4.1: the breadth of influence
(ability to influence people in various locations) is as important as the depth of influence (strong
influence in only one or few locations). On the other hand, the performance of GO+EBM does
not improve as compared to GO, and both dier significantly from the ground truth. We attribute
this to the fact that the impact of locations on influence is not considered in the GO model, which,
consequently, does not discount the impact of location popularity from followship. Recall that
Zhang et al. also showed that about 60% of the similarity of the location behaviors of users is
due to randomness and location, but not due to influence; this study, together with the result of
Location Dependency (LD) in Section 5.3.2.2, confirm that the solutions proposed for social media
cannot be applied for spatial influence.
5.3.2.4 Eciency
Table 3 shows the running time of the experiments. Time is measured in minute (m) and second
(s). Note that the GO model [Goyal et al., 2010] considers time-delay, therefore, it also needs to
123
utilize doublets. Subsequently, the TLFM and GO models share similar time complexity due to
the same doublet searching process.
Table 5.3: The running time of searching for doublets.
Dataset Gowalla Brightkite Foursquare
Number of Check-ins 6.4M 4.5M 1.3M
Running Time 11m 42s 8m 13s 3m 19s
As we see in Table 5.3, it took approximately 12 minutes to search for doublets in the Gowalla
dataset in our simple experimental settings (without the use of Map-Reduce). Generally, the
running time depends not only on the number of check-ins, but also on the number of unique
locations in the set of check-ins because the search for doublets in each location is processed
sequentially and the transitions between the locations in the input can prolong the running time.
Finally, our solution is an oine process because it uses data collected over a time period (months
or years), and the result needs not be updated every time a person performs a check-in.
5.4 Summary of Chapter
In this chapter, we discussed spatial influence and how to infer pair-wise influence from spatiotem-
poral data by analyzing people’ movements in the real world - a concept named spatial influence.
We identified a number of challenges with spatial influence, and to address them we presented the
TLFM model. TLFM includes formulations to account for the decay of spatial influence over time,
to capture the impact of each individual location by considering its popularity, and to account for
the role of coincidences. We conducted extensive experiments with real world datasets, which
confirmed the high accuracy of the TLFM model, and at the same time, the thorough comparative
124
experiments with related studies in computing influence from social media showed that those
solutions cannot capture spatial influence.
125
Chapter 6: Conclusions
6.1 Summary
In this thesis, we focused on inferring and quantifying real-world connections (aka social strength)
and spatial influence from spatiotemporal data, which represents human movements at high
resolution. Towards this end, we surveyed related work as well as analyzed the conventional
approaches. We first proposed the GEOSO model for measuring the social strength between
two users, which indicates how socially close they are, based on their co-occurrences in space
and time. While the GEOSO model can capture two intuitive properties of co-occurrences -
commitment and compatibility, and showed consistent experimental results, it still has several
disadvantages. For example, the model does not consider the impact of locations, cannot give high
performance when working with sparse datasets, and is inecient when working with large data.
Consequently, we proposed the EBM model to address subtle questions about social connections,
including how to infer the social strength of two people and how to avoid coincidences, which is a
challenging problem due to its frequent nature. EBM also alleviated the problem of data sparseness
by incorporating the location characteristics into the model when estimating the strength of social
connections. We also showed how popularity of a location could be inferred from location-
semantics, replacing location-entropy in the model for privacy preservation. We showed how
126
average duration of co-occurrences could be inferred from location-semantics, improving recall,
in particular for corner cases where diversity is low but duration of overlap is high. Finally, we
describe an ecient data structure and provide a fast parallelized implementation of the algorithm
using Map/Reduce, which makes the model highly applicable to online services. The experiments
confirmed the high accuracy and eciency of the EBM model and its superiority over competitors.
As for influence, for the first time, we coined the term spatial influence to refer to the study of
inferring influence from individuals’ movement (spatiotemporal) data by measuring the strength
of followship relationship. We identified several challenges in accurately measuring influence
from spatiotemporal data, and to address these challenges we presented the TLFM model. TLFM
includes formulations to account for the decay of spatial influence over time, captures the impact
of each individual location by considering its popularity, and accounts for the role of coincidences.
We conducted extensive experiments with real world datasets and confirmed the high accuracy and
eciency of the model.
6.2 Future Plans
In this section, we briefly discuss several possible extensions for the future work.
First, with the availability of location semantics, it is possible to identify the type of each
relationship based on the the type of locations where two users co-occurred. For example, if
two users only co-occurred at a work place repeatedly during business hours, then it is probable
that they are co-workers. Similarly, they can be roommates or in a family relationship if they
co-occurred at a domestic house. The inference of relationship types (semantic relationships)
can be achieved, for example, by training a classifier with the social connections, for which the
127
relationship types are available. We would like to integrate social strength and influence inferred
from dierent sources of data to create a unified measure of pair-wise social strength and influence,
respectively, that capture people’s behaviors in both the virtual and real worlds.
Finally, we plan to address the question concerning the privacy of users since their locations at
a high resolution are used and consequently their social circle (friends and influential icons) can be
exposed. Even though the discussion of user privacy falls out of the scope of this research, we still
would like to touch upon this matter briefly by discussing a possible direction of extending the
GEOSO, EBM and TLFM models to work around the barrier of user privacy. Per our discussion
in Section 4.2.8.2, the role of semantics-based location popularity becomes more significant when
protecting user privacy prevents us from accessing the locations of users and/or the historical
check-ins of the locations of users. In such cases, it is not straightforward to compute the co-
occurrences between people, and it is practically impossible to compute the Location Entropy of
the locations where users co-occur. However, assume that, at least, the semantics of the locations
of users are available, then we can still compute the co-occurrences between two users without
the users having to reveal their exact locations. This can be achieved, for example, by using the
solution proposed by Narayanan et al. [Narayanan and Shmatikov, 2008], which allows users
to detect if their friends are nearby without revealing their actual locations. Consequently, it is
possible to obtain the co-occurrences between two people in a privacy-preserving manner, and
compute their weighted frequency by utilizing the semantics-based location popularity derived
from other historical datasets.
128
Bibliography
J.L. Bentley. Multidimensional binary search trees used for associative searching. Communications
of the ACM, 18(9):509–517, 1975.
Igor Bilogrevic, K´ evin Huguenin, Murtuza Jadliwala, Florent Lopez, Jean-Pierre Hubaux, Philip
Ginzboorg, and Valtteri Niemi. Inferring social ties in academic networks using short-range
wireless communications. In Proceedings of the 12th ACM workshop on Workshop on privacy
in the electronic society, pages 179–188. ACM, 2013.
C.M. Bishop. Pattern recognition and machine learning, volume 4. springer New York, 2006.
David M Blei, Andrew Y Ng, and Michael I Jordan. Latent dirichlet allocation. JMLR, 3:993–1022,
2003a.
D.M. Blei, A.Y . Ng, and M.I. Jordan. Latent dirichlet allocation. the Journal of machine Learning
research, 3:993–1022, 2003b.
W.M. Bukowski, A.F. Newcomb, and W.W. Hartup. The company they keep: Friendships in
childhood and adolescence. Cambridge University Press, 1998.
Xin Cao, Gao Cong, and Christian S Jensen. Mining significant semantic locations from gps data.
Proceedings of the VLDB Endowment, 3(1-2):1009–1020, 2010.
Wei Chen, Chi Wang, and Yajun Wang. Scalable influence maximization for prevalent viral
marketing in large-scale social networks. In ACM SIGKDD, pages 1029–1038, 2010.
Wei Chen, Laks VS Lakshmanan, and Carlos Castillo. Information and influence propagation in
social networks. Synthesis Lectures on Data Management, 5(4):1–177, 2013.
Eunjoon Cho, Seth A. Myers, and Jure Leskovec. Friendship and mobility: user
movement in location-based social networks. In Proceedings of the 17th ACM
SIGKDD, KDD ’11, pages 1082–1090, New York, NY , USA, 2011a. ISBN
978-1-4503-0813-7. doi: http://doi.acm.org/10.1145/2020408.2020579. URL
http://doi.acm.org/10.1145/2020408.2020579.
Eunjoon Cho, Seth A Myers, and Jure Leskovec. Friendship and mobility: user movement in
location-based social networks. In ACM SIGKDD, pages 1082–1090. ACM, 2011b.
David J. Crandall, Lars Backstrom, Dan Cosley, Siddharth Suri, Daniel Huttenlocher, and Jon
Kleinberg. Inferring social ties from geographic coincidences. Proceedings of the National
Academy of Sciences, 107(52):22436–22441, 2010. doi: 10.1073/pnas.1006155107. URL
http://www.pnas.org/content/107/52/22436.abstract.
129
Justin Cranshaw, Eran Toch, Jason Hong, Aniket Kittur, and Norman Sadeh. Bridging the
gap between physical location and online social networks. In Proceedings of the 12th ACM
international conference on Ubiquitous computing, Ubicomp ’10, pages 119–128, New York,
NY , USA, 2010. ACM. ISBN 978-1-60558-843-8. doi: 10.1145/1864349.1864380. URL
http://doi.acm.org/10.1145/1864349.1864380.
Pedro Domingos and Matt Richardson. Mining the network value of customers. In ACM SIGKDD,
pages 57–66. ACM, 2001.
Nathan Eagle, Alex (Sandy) Pentland, and David Lazer. Inferring friend-
ship network structure by using mobile phone data. Proceedings of the
NAS, 106(36):15274–15278, 2009. doi: 10.1073/pnas.0900282106. URL
http://www.pnas.org/content/106/36/15274.abstract.
Maria Esteva and Hai Bi. Inferring intra-organizational collaboration from cosine similarity
distributions in text documents. In Proceedings of the 9th ACM/IEEE-CS joint conference on
Digital libraries, pages 385–386. ACM, 2009.
Eventbrite, 2015. URLhttps://www.eventbrite.com/.
Manuel Gomez Rodriguez, Jure Leskovec, and Andreas Krause. Inferring networks of diusion
and influence. In In ACM SIGKDD, pages 1019–1028, 2010.
Manuel Gomez Rodriguez, Jure Leskovec, and Bernhard Sch¨ olkopf. Structure and dynamics of
information pathways in online media. In In ACM WSDM, pages 23–32, 2013.
Amit Goyal, Francesco Bonchi, and Laks V .S. Lakshmanan. Learning influence probabilities in
social networks. In ACM WSDM, pages 241–250, New York, NY , USA, 2010.
Amit Goyal, Francesco Bonchi, and Laks VS Lakshmanan. A data-based approach to social
influence maximization. VLDB, 5(1):73–84, 2011.
Xinran He, EDU Theodoros Rekatsinas, UMD EDU, James Foulds, EDU Lise Getoor, Yan Liu,
and USC EDU. Hawkestopic: A joint model for network inference and topic modeling from
text-based cascades.
M. O. Hill. Diversity and evenness: A unifying notation and its consequences. Ecology, 54:
427–432, 1973. ISSN 0012-9658. URLhttp://dx.doi.org/10.2307/1934352.
IMDB. http://www.imdb.com.
Lou Jost. Entropy and diversity. Oikos, 113(2):363–375, 2006.
ISSN 1600-0706. doi: 10.1111/j.2006.0030-1299.14714.x. URL
http://dx.doi.org/10.1111/j.2006.0030-1299.14714.x.
Herbert C Kelman. Compliance, identification, and internalization: Three processes of attitude
change. Journal of conflict resolution, pages 51–60, 1958.
David Kempe, Jon Kleinberg, and Eva Tardos. Maximizing the spread of influence through a social
network. In Proceedings of the ninth ACM SIGKDD international conference on Knowledge
discovery and data mining, pages 137–146. ACM, 2003.
130
Kenneth S Krane. Introductory nuclear physics. 1987.
BVK Vijaya Kumar, Marios Savvides, and Chunyan Xie. Correlation pattern recognition for face
recognition. Proceedings of the IEEE, 94(11):1963–1976, 2006.
Byoungyoung Lee, Jinoh Oh, Hwanjo Yu, and Jong Kim. Protecting location privacy using
location semantics. In Proceedings of the 17th ACM SIGKDD international conference on
Knowledge discovery and data mining, pages 1289–1297. ACM, 2011.
Jure Leskovec, Andreas Krause, Carlos Guestrin, Christos Faloutsos, and Natalie Glance. Cost-
eective outbreak detection in networks. In ACM SIGKDD, pages 420–429, 2007.
Quannan Li, Yu Zheng, Xing Xie, Yukun Chen, Wenyu Liu, and Wei-Ying Ma.
Mining user similarity based on location history. In Proceedings of the 16th
ACM SIGSPATIAL, GIS ’08, pages 34:1–34:10, New York, NY , USA, 2008. ACM.
ISBN 978-1-60558-323-5. doi: http://doi.acm.org/10.1145/1463434.1463477. URL
http://doi.acm.org/10.1145/1463434.1463477.
D. Liben-Nowell and J. Kleinberg. The link-prediction problem for social networks. Journal of
the American society for IST, 58(7):1019–1031, 2007a.
David Liben-Nowell and Jon Kleinberg. The link-prediction problem for social networks. Journal
of the ASIST, 58(7):1019–1031, 2007b.
Juhong Liu, Ouri Wolfson, and Huabei Yin. Extracting semantic location from outdoor positioning
systems. In MDM, page 73. Citeseer, 2006.
JP Mangalindan. tech.fortune.cnn.com/2013/03/18/today-in-tech-hulu-tk/, 2013.
Mashable. http://mashable.com/2012/01/09/real-world-digital-world/, 2012.
Arvind Narayanan and Vitaly Shmatikov. Robust de-anonymization of large sparse datasets. In
Security and Privacy, 2008. SP 2008. IEEE Symposium on, pages 111–125. IEEE, 2008.
Debra L Oswald and Eddie M Clark. Best friends forever?: High school best friendships and the
transition to college. Personal Relationships, 10(2):187–196, 2003.
Huy Pham, Ling Hu, and Cyrus Shahabi. Towards integrating real-world spatiotemporal data with
social networks. In Proceedings of the 19th ACM SIGSPATIAL, GIS ’11, pages 453–457, New
York, NY , USA, 2011. ACM. ISBN 978-1-4503-1031-4. doi: 10.1145/2093973.2094046. URL
http://doi.acm.org/10.1145/2093973.2094046.
Huy Pham, Cyrus Shahabi, and Yan Liu. Ebm: an entropy-based model to infer social strength
from spatiotemporal data. In ACM SIGMOD, pages 265–276. ACM, 2013.
A. Renyi. On Measures of Entropy and Information. In Berkeley Symposium Mathematics,
Statistics, and Probability, pages 547–561, 1960.
Manuel Gomez Rodriguez, David Balduzzi, and Bernhard Sch¨ olkopf. Uncovering the temporal
dynamics of diusion networks. arXiv preprint arXiv:1105.0697, 2011.
131
Everett M Rogers. Diusion of innovations. 2010.
Everett M Rogers and F Floyd Shoemaker. Communication of innovations; a cross-cultural
approach. 1971.
TM Rossi and IM Warner. Pattern recognition of two-dimensional fluorescence data using cross-
correlation analysis. Applied spectroscopy, 39(6):949–959, 1985.
Kazumi Saito, Ryohei Nakano, and Masahiro Kimura. Prediction of information diusion
probabilities for independent cascade model. In KBII - ES, pages 67–75. Springer, 2008.
Hanan Samet. The quadtree and related hierarchical data structures. ACM Comput. Surv.,
16(2):187–260, June 1984. ISSN 0360-0300. doi: 10.1145/356924.356930. URL
http://doi.acm.org/10.1145/356924.356930.
David P Scha and Felix Waldhauser. Waveform cross-correlation-based dierential travel-time
measurements at the northern california seismic network. Bulletin of the Seismological Society
of America, 95(6):2446–2461, 2005.
Daniel V . Schroeder and Harvey Gould. An introduction to thermal physics. Physics Today, 53(8):
44–45, 2000. doi: 10.1063/1.2405696. URLhttp://link.aip.org/link/?PTO/53/44/2.
Cyrus Shahabi and Farnoush Banaei-Kashani. Ecient and anonymous web-usage mining for
web personalization. INFORMS Journal on Computing, 15(2):123–147, 2003.
Patricia M Sias and Daniel J Cahill. From coworkers to friends: The development of peer
friendships in the workplace. Western Journal of Communication (includes Communication
Reports), 62(3):273–299, 1998.
DV Sivukhin. A course of general physics. vol. ii, thermodynamics and molecular physics, 1990.
socialbakers. http://www.socialbakers.com/blog/167-interesting-facebook-places-numbers, 2011.
Hanna Tuomisto. A diversity of beta diversities: straightening up a concept. Ecography,
33(1):2–22, 2010a. ISSN 1600-0587. doi: 10.1111/j.1600-0587.2009.05880.x. URL
http://dx.doi.org/10.1111/j.1600-0587.2009.05880.x.
Hanna Tuomisto. A consistent terminology for quantifying species diversity? yes,
it does exist. Oecologia, 164:853–860, 2010b. ISSN 0029-8549. URL
http://dx.doi.org/10.1007/s00442-010-1812-0. 10.1007/s00442-010-1812-0.
Johan Ugander, Brian Karrer, Lars Backstrom, and Cameron Marlow. The anatomy of the facebook
social graph. arXiv preprint arXiv:1111.4503, 2011.
Hans V on Storch and Francis W Zwiers. Statistical analysis in climate research. Cambridge
University Press, 2001.
Chris Weidemann. http://geosocialfootprint.com/.
Carol Werner and Pat Parmelee. Similarity of activity preferences among friends: Those who play
together stay together. Social Psychology Quarterly, pages 62–66, 1979.
132
Wikipedia. http://www.wikipedia.org.
Peng Wu and Dan Tretter. Close & closer: social cluster and closeness from photo collections. In
Proceedings of the 17th ACM international conference on Multimedia, pages 709–712. ACM,
2009.
Mao Ye, Dong Shou, Wang-Chien Lee, Peifeng Yin, and Krzysztof Janowicz. On the semantic
annotation of places in location-based social networks. In Proceedings of the 17th ACM
SIGKDD international conference on Knowledge discovery and data mining, pages 520–528.
ACM, 2011.
Soe-Tsyr Yuan and Jerry Sun. Ontology-based structured cosine similarity in document summa-
rization: with applications to mobile audio-based knowledge management. Systems, Man, and
Cybernetics, Part B: Cybernetics, IEEE Transactions on, 35(5):1028–1040, 2005.
Ke Zhang and Konstantinos Pelechrinis. Understanding spatial homophily: the case of peer
influence and social selection. In WWW, pages 271–282. ACM, 2014.
Ke Zhou, Hongyuan Zha, and Le Song. Learning social infectivity in sparse low-rank networks
using multi-dimensional hawkes processes. In AI and Statistics, pages 641–649, 2013.
133
Abstract (if available)
Linked assets
University of Southern California Dissertations and Theses
Conceptually similar
PDF
Mechanisms for co-location privacy
PDF
Iteratively learning data transformation programs from examples
PDF
Understanding diffusion process: inference and theory
PDF
Scalable data integration under constraints
PDF
Advanced machine learning techniques for video, social and biomedical data analytics
PDF
Disentangling the network: understanding the interplay of topology and dynamics in network analysis
PDF
Generalized optimal location planning
PDF
Combining textual Web search with spatial, temporal and social aspects of the Web
PDF
Modeling and predicting with spatial‐temporal social networks
PDF
Differentially private learned models for location services
PDF
Towards trustworthy and data-driven social interventions
PDF
Modeling information operations and diffusion on social media networks
PDF
Ensuring query integrity for sptial data in the cloud
PDF
Spatiotemporal traffic forecasting in road networks
PDF
Inferring mobility behaviors from trajectory datasets
PDF
Privacy-aware geo-marketplaces
PDF
Learning to adapt to sensor changes and failures
PDF
Location privacy in spatial crowdsourcing
PDF
Heterogeneous graphs versus multimodal content: modeling, mining, and analysis of social network data
PDF
Partitioning, indexing and querying spatial data on cloud
Asset Metadata
Creator
Pham, Huy
(author)
Core Title
Deriving real-world social strength and spatial influence from spatiotemporal data
School
Viterbi School of Engineering
Degree
Doctor of Philosophy
Degree Program
Computer Science
Publication Date
09/17/2015
Defense Date
08/13/2015
Publisher
University of Southern California
(original),
University of Southern California. Libraries
(digital)
Tag
geo-social,locations,OAI-PMH Harvest,social connections,social networks,spatiotemporal data,Strength
Format
application/pdf
(imt)
Language
English
Contributor
Electronically uploaded by the author
(provenance)
Advisor
Shahabi, Cyrus (
committee chair
), Liu, Yan (
committee member
), O'Leary, Daniel E. (
committee member
)
Creator Email
huyvpham@usc.edu
Permanent Link (DOI)
https://doi.org/10.25549/usctheses-c40-184673
Unique identifier
UC11273518
Identifier
etd-PhamHuy-3925.pdf (filename),usctheses-c40-184673 (legacy record id)
Legacy Identifier
etd-PhamHuy-3925.pdf
Dmrecord
184673
Document Type
Dissertation
Format
application/pdf (imt)
Rights
Pham, Huy
Type
texts
Source
University of Southern California
(contributing entity),
University of Southern California Dissertations and Theses
(collection)
Access Conditions
The author retains rights to his/her dissertation, thesis or other graduate work according to U.S. copyright law. Electronic access is being provided by the USC Libraries in agreement with the a...
Repository Name
University of Southern California Digital Library
Repository Location
USC Digital Library, University of Southern California, University Park Campus MC 2810, 3434 South Grand Avenue, 2nd Floor, Los Angeles, California 90089-2810, USA
Tags
geo-social
locations
social connections
social networks
spatiotemporal data