Close
About
FAQ
Home
Collections
Login
USC Login
Register
0
Selected
Invert selection
Deselect all
Deselect all
Click here to refresh results
Click here to refresh results
USC
/
Digital Library
/
University of Southern California Dissertations and Theses
/
Modeling and predicting with spatial‐temporal social networks
(USC Thesis Other)
Modeling and predicting with spatial‐temporal social networks
PDF
Download
Share
Open document
Flip pages
Contact Us
Contact Us
Copy asset link
Request this asset
Transcript (if available)
Content
MODELING AND PREDICTING WITH SPATIAL-TEMPORAL SOCIAL NETWORKS by Yoon-Sik Cho A Dissertation Presented to the FACULTY OF THE GRADUATE SCHOOL UNIVERSITY OF SOUTHERN CALIFORNIA In Partial Fulfillment of the Requirements for the Degree DOCTOR OF PHILOSOPHY (ELECTRICAL ENGINEERING) June 2014 Copyright 2014 Yoon-Sik Cho Dedication Dedicated to my parents, Moon-Hwan and Huen-Ja. ii Acknowledgements This dissertation would not have been possible without the help and support of many friends and colleagues. Foremost, I would like to thank my advisor Aram Galstyan. I feel very fortunate to have met Aram who is a great mentor, collaborator, and my teacher. He taught me about research and broadened my interests. He has always been generous and kind to me. I also thank my co-advisor Bhaskar Krishnamachari for his support and guidance. I was really impressed with how dedicated Bhaskar was to his research and to his students. I thank my dissertation committee, Antonio Ortega, Kristina Lerman, and Greg Ver Steeg, for their insightful comments and suggestions. I really enjoyed Antonio’s class which I took in my first year during my Phd studies. I also enjoyed our weekly meeting which Kristina organized at ISI. Greg is one of my greatest friend and collaborator who always gave me valuable feedback. I thank all my friends at ISI: Jeon-Hyung Kang, Sahil Garg, Bo Wu, Suradej Intagorn, Shuyang Gao, and my officemate Ardeshir Kianercy. I also thank my colleagues Min- Yian Su, and my mentor Shankar Sadasivam whom I met at Qualcomm during my intern- ship. I particularly thank Professor No who taught me during my undergraduate studies at Seoul National University and helped me making the decision to study abroad. iii Finally, I would like to thank my family for their support. My beloved wife, Kyung, deserve special thanks and recognition. I especially thank my mother who has always been my biggest supporter. My father who is always missed and loved receive my deepest gratitude and love for his dedication. iv Abstract Network data and user behavior data are becoming pervasive. In this thesis, we develop efficient machine learning methods for addressing various real-world problems using these datasets. Network structure allows us to understand the underlying complex system that generated the data. Through clustering the nodes in latent space, we can measure the similarity between the nodes and study how link are formed with regards to their latent properties. This model can be extended by combining with user behavior model which is another emerging field in machine learning applications. We study the two aspects and propose a model that combines the two by inferring the joint latent space that describes both the link and user behaviors. The first part of our work focuses on network modeling, and examine the dynamics of hidden attributes of nodes and the link dynamics. We assume the two dynamics co- evolve with each other, allowing feedbacks. Co-evolving MMSB is based on a well- defined clustering algorithm that has been widely used in analysis of social network and gene regulatory networks. We show how Co-evolving MMSB captures the influence between the nodes which have interacted. The other topic in network modeling studies the temporal dynamics in pair interactions. We proposed a model that can reconstruct missing information in pair interactions using the temporal and spatial informations. v Second part of our work focuses on user behavior modeling. Specifically, we exam- ined “check-in” behaviors of users in Location-Based Social Network (LBSN). Venues in LBSN can be represented on a latent space through lower dimensional representation. For venue clustering, we used non-parametric method called information bottleneck which clusters similar type of data with minimum relevant information loss. “Check-in” data in LBSN can also be clustered in time space by measuring the influence from the past. Through this clustering, one can make better prediction on time of future “check-in” than existing methods. Finally, we combined the two models by finding the joint space of the two latent spaces which describe link formation and user behavior respectively. With this model one can build an effective recommendation system even with limited data. We show how we can predict behavior based solely on link information and vice versa. vi Contents Dedication ii Acknowledgements iii Abstract v List of Figures x List of Tables xii 1 Introduction 1 1.1 Network Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 1.1.1 Notations in MMSB . . . . . . . . . . . . . . . . . . . . . . . . 5 1.2 Point Process Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 1.2.1 Hawkes Process . . . . . . . . . . . . . . . . . . . . . . . . . . 6 1.2.2 Notations in Point Processes . . . . . . . . . . . . . . . . . . . . 6 2 Mixed Membership Blockmodels for Dynamic Networks with Feedback 8 2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8 2.1.1 Selection and influence in networks . . . . . . . . . . . . . . . . 10 2.1.2 MMSBs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11 2.1.3 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 2.2 Co-evolving Mixed Membership Blockmodel . . . . . . . . . . . . . . . 15 2.2.1 Inference and Learning . . . . . . . . . . . . . . . . . . . . . . . 17 2.2.2 Variational E-step . . . . . . . . . . . . . . . . . . . . . . . . . . 20 2.2.3 Variational M step . . . . . . . . . . . . . . . . . . . . . . . . . 22 2.3 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23 2.3.1 Experiments on Synthetic Data . . . . . . . . . . . . . . . . . . 23 2.3.2 Comparison with dMMSB . . . . . . . . . . . . . . . . . . . . . 25 2.3.3 US Senate Co-Sponsorship Network . . . . . . . . . . . . . . . . 25 vii 2.3.4 Interpreting Results . . . . . . . . . . . . . . . . . . . . . . . . . 27 2.3.5 Polarization Dynamics . . . . . . . . . . . . . . . . . . . . . . . 28 2.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30 3 Latent Self-Exciting Point Process Model for Spatial-Temporal Networks 33 3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33 3.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36 3.3 Spatial-Temporal Model of Relationship Network . . . . . . . . . . . . . 37 3.3.1 Hawkes process . . . . . . . . . . . . . . . . . . . . . . . . . . . 39 3.3.2 Spatial Gaussian Mixture Model (GMM) . . . . . . . . . . . . . 40 3.4 Learning and Inference . . . . . . . . . . . . . . . . . . . . . . . . . . . 41 3.5 Experiments with Synthetic Data . . . . . . . . . . . . . . . . . . . . . . 43 3.6 Experiments with Real–World Data . . . . . . . . . . . . . . . . . . . . 48 3.6.1 Data description . . . . . . . . . . . . . . . . . . . . . . . . . . 49 3.6.2 Inferring event participants . . . . . . . . . . . . . . . . . . . . . 50 3.6.3 Event prediction with LPPM . . . . . . . . . . . . . . . . . . . . 53 3.7 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56 4 Venue Clustering through Information Bottleneck 59 4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59 4.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61 4.3 Data Description . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62 4.4 Network-infused Agglomerative Information Bottleneck . . . . . . . . . 64 4.5 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66 4.5.1 Venue clustering . . . . . . . . . . . . . . . . . . . . . . . . . . 67 4.5.2 Reconstructing Edges . . . . . . . . . . . . . . . . . . . . . . . . 69 4.6 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70 5 Temporal Clustering through Hawkes Process 73 5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73 5.2 Dataset Description . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76 5.3 Model Description . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77 5.3.1 Modeling Temporal Patterns . . . . . . . . . . . . . . . . . . . . 77 5.3.2 Characterizing Correlations Between Events . . . . . . . . . . . 79 5.3.3 Three Factors causing Temporal-Clustering . . . . . . . . . . . . 80 5.4 Experimental Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . 82 5.4.1 Model selection . . . . . . . . . . . . . . . . . . . . . . . . . . . 82 5.4.2 Predicting Venue attendance . . . . . . . . . . . . . . . . . . . . 83 5.4.3 Evaluating the Three Factors . . . . . . . . . . . . . . . . . . . 87 viii 5.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91 6 Network Behavior Joint-Space 93 6.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93 6.2 Modeling LBSN . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95 6.2.1 Generative Processes . . . . . . . . . . . . . . . . . . . . . . . . 96 6.2.2 Variational Inference . . . . . . . . . . . . . . . . . . . . . . . . 98 6.2.3 Parameter Estimation . . . . . . . . . . . . . . . . . . . . . . . . 100 6.3 Related Works . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101 6.4 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101 6.4.1 Venue Prediction . . . . . . . . . . . . . . . . . . . . . . . . . . 102 6.4.2 Edge Prediction . . . . . . . . . . . . . . . . . . . . . . . . . . . 103 6.4.3 Evaluation using Document Citation Network . . . . . . . . . . . 104 6.5 Improving Computational Efficiency of MMSB . . . . . . . . . . . . . . 104 6.5.1 Multi-Stage MMSB . . . . . . . . . . . . . . . . . . . . . . . . 106 6.5.2 Evaluation using Real-World Data . . . . . . . . . . . . . . . . . 107 6.6 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108 7 Conclusions and Open Questions 109 A Variational EM update Equations for Co-Evolving MMSB 113 A.1 Variational EM update Equations for Co-Evolving MMSB . . . . . . . . 113 A.1.1 Alternative View of EM Algorithm . . . . . . . . . . . . . . . . 113 A.1.2 KL-Distance . . . . . . . . . . . . . . . . . . . . . . . . . . . . 114 A.1.3 Variational E-step . . . . . . . . . . . . . . . . . . . . . . . . . . 115 A.1.4 Variational M-step . . . . . . . . . . . . . . . . . . . . . . . . . 116 B Variational EM update Equations for LPPM 118 B.1 Variational E-step . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 118 B.2 Variational M-step . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 120 ix List of Figures 1.1 Combination of Two Modules in this Dissertation . . . . . . . . . . . . . 2 2.1 Actual and inferred mixed membership trajectories on a simplex. . . . . 23 2.2 (a) Inference error for dMMSB and CMMSB for synthetic data generated with𝐾 = 2 and𝛽 = 0.1 for all the nodes (b) when𝛽 = 0.2 for all the nodes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24 2.3 Correlation between ACU/ADA scores and inferred probabilities. . . . . 28 2.4 Comparison of inference results with ACU and ADA scores: Sen. Specter (left) and Sen. Dole (right). . . . . . . . . . . . . . . . . . . . . . . . . . 29 2.5 Polarization trends during 97th–104th US Congresses. . . . . . . . . . . 29 3.1 Schematic demonstration of the missing label problem for temporal point processes. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35 3.2 The spatial data generated varying the covariance matrix from 0.25I (a) to 4I (b). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46 3.3 (a) Accuracy of inference using spatial data only, temporal data only, and spatial-temporal data, for different settings of the standard deviation of the spatial Gaussian model;(b) Average accuracy (over 20 trials) plotted against the percentage of missing labels. . . . . . . . . . . . . . . . . . . 47 3.4 Spatial (a) and temporal (b) description of the events involving four active gang rivalries. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57 3.5 Average accuracy for varying fraction of missing labels. . . . . . . . . . . 58 3.6 Average accuracy of participant-inference task for the user in San Fran- cisco. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58 4.1 Geo plot of three clusters𝐶 1 (red),𝐶 2 (green),𝐶 3 (blue) . . . . . . . . . . 68 4.2 ROC curve for link prediction using JS-divergences. . . . . . . . . . . . 69 4.3 ROC curve using JS-divergence compared to other baselines. . . . . . . . 71 5.1 Temporal pattern of check-ins (SF) . . . . . . . . . . . . . . . . . . . . . 78 5.2 Three users’ activity profiles at Dolores Park Cafe . . . . . . . . . . . . . 81 5.3 AIC Comparison . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83 x 5.4 Exogenous Effect Score𝑆 exgn plotted against time (San Francisco) . . . . 90 5.5 Average Number of Check-ins with Respect to the Days Since the First Check-in (San Francisco) . . . . . . . . . . . . . . . . . . . . . . . . . . 92 6.1 Predictions on LBSN dataset: lower score is better . . . . . . . . . . . . 103 6.2 Predictions on Citation dataset: lower score is better . . . . . . . . . . . . 103 6.3 Majority of nodes only take a single role (HEP-PH dataset: 12K nodes, 32 roles) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105 6.4 multi-stage MMSB uses tree structure of roles . . . . . . . . . . . . . . . 106 6.5 multi-stage MMSB converges faster than the existing model without loos- ing its performance in accuracy (HEP-PH dataset: 12K nodes, 32 roles) . 107 xi List of Tables 3.1 Model evaluation for total of n = 40 events between 6 pairs. . . . . . . . . 44 3.2 Prediction accuracy of top-K choices for K=1,2,3. . . . . . . . . . . . . . 55 4.1 Top 3 largest clusters in San Francisco . . . . . . . . . . . . . . . . . . . 72 5.1 Statistics of Check-ins from Three Cities . . . . . . . . . . . . . . . . . . 76 5.2 Performance of Predictions . . . . . . . . . . . . . . . . . . . . . . . . . 87 5.3 Top 5 Venues with Self-Reinforcing Behavior . . . . . . . . . . . . . . . 89 5.4 Top 5 Venues with High Social Effect . . . . . . . . . . . . . . . . . . . 89 6.1 Dataset Statistics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102 xii Chapter 1 Introduction In this thesis, we introduce efficient machine learning methods for addressing various real-world problems. The recent proliferation of Social Network Services (SNS) has resulted in an abundance of network and users behavioral data. Users not only make friends online through the services, they also watch video clips, listen to musics, read articles, or check-in to venues they visited. These large datasets became widely accessi- ble, and effective models are needed for understanding their behaviors. As sensors become more ubiquitous with accelerometers and GPS embedded in cell phones or pulsimeter in wearable sensors, engineers are now able to analyze the move- ment or the behavior of users on a more fine-grained level. These types of data require efficient models that can be properly handled especially when the size of data grows everyday. Besides efficiency, one of the primary benchmarks for these models is their ability to predict future events. All the phenomena and behaviors related to the users (or actors in general) can be considered as events. Useful tasks might involve predicting of the arrival of a user on certain venue, the predicting of the location and time of the interaction between two actors, or the predicting the probability of having a heart failure of a given user. The solution to these problems would contribute to our society in many ways such as in traffic control, predictive crime, market analysis, and health industry. Using the probabilistic model framework, we are interested in estimating the pos- terior distribution of missing behaviors or future behaviors given their history. Specif- ically in this work, we considered cases when the time and location information for 1 each behavior is given as past records. Along with the behavior record the relationship between the actors are also of interest. Relationships between the actors provide rich information because many of people’s personal networks are homogeneous with regard to many sociodemographic, behavioral, and intrapersonal characteristics [56]. Analyzing network data provides useful predictive models and recommendation systems, predicting the events associated to the behaviors, recommending new friends, items, and activities to the users. Our approach to the solution can be divided into two modules, where one focuses on the network aspects and the other focuses on the behavior aspects. Later on in this work, two modules are connected through the joint-space of the two: latent space of the network and the latent space of the behavior. Network( Chap.&2& Chap.&3& Behavior( Chap.&4& Chap.&5& Joint0Space& Chap.&6&& Figure 1.1: Combination of Two Modules in this Dissertation The chapters are organized as follows: ∙ Chapter 2: Many networks are complex dynamical systems, where both attributes of nodes and topology of the network can change with time. We propose a model of co-evolving networks where both node attributes and network structure evolve 2 under mutual influence. Compared to other dynamic social network modelings, this model captures the influence between users and its effect to network formation. ∙ Chapter 3: In this chapter, we deal with missing information in pair interactions. Our efficient model of spatial-temporal network allows reconstructing missing information (i.e., the identity of pairs) and predictions of future events. Specif- ically, we use our model on LAPD gang-related crime dataset for investigating unknown perpetrators and for predictive crime. ∙ Chapter 4: Location information reflects many aspects of users such as their affili- ation and predilection. However, majority of users in LBSN are inactive in leaving their “check in” records. One of the solution could be clustering the venues with similar types for better understanding of users. Particularly in this work, we show how we can cluster venues with respect to their social relevancy. ∙ Chapter 5: We also found the collection of “check-in” on a venue exhibits strong temporal pattern. To further study these aspects, we cluster the temporal point- processes and relate the event (point) to the previous events (points). We explore how self-reinforcing behaviors, social factors, and exogenous effects contribute to this clustering and introduce a framework to distinguish these effects at the level of individual check-ins for both users and venues. ∙ Chapter 6: Finally, in this chapter, we combine the two modules by projecting two latent spaces which accounts for network formation and behaviors onto a joint space of the two. Using this approach, either side of the information can be pre- dicted solely based on the other side of the information. This model holds a huge potential for applications in SNS such as recommending friends based on limited 3 existing edge information or recommending actions with limited behavior history. Later in this chapter, we show how to enhance the clustering algorithm for link structure. Next we will introduce some of the basic models that we use and improve in the rest of this thesis, including network models and point process models. 1.1 Network Models The Network graph represents the pairwise relationships between objects, and in social network literature, it represents the pairwise relationships between social actors. The graph consists of nodes which represent the social actors and edges which represent the relationship between the actors. In the most general case, an edge could have direc- tions because the relationship from𝐴 to𝐵 is not necessarily same as that of from𝐵 to 𝐴 . Some of social networks on other hand have undirected edges which often appear in social network services like Facebook. Previous set of works based on Mixed Mem- bership Stochastic Blockmodels (MMSB) [3] have successfully modeled the pairwise interactions between the actors and can be easily modified to undirected cases. Our net- work models that account for pairwise interactions are based on MMSB which we have used for modeling dynamic networks and collective behaviors. MMSB assumes that the interactions are governed by the roles that nodes select within the interaction of a pair and the block matrix which accounts for role-to-role link probability. MMSB overcomes the limitation that each actor can only belong to one clus- ter by introducing the flexibility of mixed membership. This assumption perfectly holds especially in social network analysis where many of the social interactions are multi- facet. One can interact with the other as a co-worker or as a neighbor living nearby. 4 MMSB captures the multiple roles that actors undertake in their interactions and the rela- tionship between the chosen roles through the block-matrix. MMSB is followed by series of extensions which includes dynamic mixed membership stochastic block model [27] and joint latent topic model [59]. More recently, efficient model have been introduced by adding constraints in the existing assumption [36]. 1.1.1 Notations in MMSB Here we present the notations which is used in our network models. Symbol Description 𝑁 Total number of nodes 𝐾 Total number of roles Y(𝑝,𝑞 ) Edge from node𝑝 to𝑞 B Block matrix, where B(i,j) describes the probability of a link between𝑖 th role and𝑗 th role. z 𝑝 →𝑞 𝐾 × 1 role indicator vector that node𝑝 undertakes in𝑝 to𝑞 interaction. z 𝑝 ←𝑞 𝐾 × 1 role indicator vector that node𝑞 undertakes in𝑝 to𝑞 interaction. 1.2 Point Process Models The points in point process often represent the time and the location of the events on tem- poral space and geo space respectively. Many real-world phenomena can be considered as point processes, which includes occurrence of earthquake, outbreak of epidemics, and incidence of crimes. These random processes exhibits some patterns in temporal space or in geo space, where most of the events are history-dependent focused on specific region 5 in geo space. We are interested in discovering the underlying structural patterns that gov- erns these occurrence of events. In this work, we consider set of different types of events using marked point process, where the mark differentiates the types. Specifically, we look into the problems of Inter-gang violence between multiple gang rivalries and check- ins on venues in LBSN using marked point processes. Rather than assuming each events are temporally independent each other we assume the events are temporally dependent where the previous event affect the future events. 1.2.1 Hawkes Process The Hawkes process is a self-exciting point process which captures temporal dependency and has been widely used in Seismology. The simplest model that accounts for temporal point process is Poisson process where each inter-arrival times is assumed to be inde- pendent. This process lacks the ability of capturing temporal dependency. The Hawkes process defines the rate as a function of time and its history through the kernel function, where as the Poisson process has a constant rate over time. The Hawkes process consists of a back ground rate which is often constant and self-excitation term. This self-excitation term triggers the offspring events once the events are generated. 1.2.2 Notations in Point Processes Here we present the notations which is used in our point process model Symbol Description 𝜆 (𝑡 ) Rate function over time 𝜇 Poisson rate ℎ 𝑘 = (𝑡 𝑘 ,x 𝑘 ,𝑧 𝑘 ) A tuple of𝑘 -th event with time, location, mark information 6 ℋ 𝑡 History of events up to time𝑡 :{ℎ 𝑘} 𝑡 𝑘 <𝑡 𝑆 (𝑡, x|ℋ 𝑡 ) Rate function defined in time, location, and its history 7 Chapter 2 Mixed Membership Blockmodels for Dynamic Networks with Feedback 2.1 Introduction Networks are a useful paradigm for representing various social, biological, and techno- logical systems. One of the recent emerging themes in machine learning research is to model the structure and formation of complex networks. This task is made more difficult when the nodes in the network and the topology of the network change over time. How- ever, the growth of the internet and social media, in particular, has provided researchers with huge amounts of data that make such studies both feasible and highly desirable. A standard approach to network modeling is to assume a generative model for links that is based on node attributes. That is, the nodes or objects modeled are assumed to have some (possibly hidden) attributes, e.g., group membership, and these hidden prop- erties determine the formation of links between nodes. A version of this approach which has achieved great success is the mixed membership stochastic blockmodel (MMSB)[3]. MMSBs recognize that nodes often have multiple attributes (mixed membership) that may come into play when determining whether two nodes should be linked. Thus, MMSBs are a special case of a more general class of latent space models, which assume that nodes’ attributes are described in some abstract space, and the formation of links between nodes depends on the distance between their attributes in that space [42, 48]. 8 A common limitation of these approaches is that the attributes of nodes are assumed to be unchanging over time. If the nodes represent people, for instance, we know that attributes like interests, location, or job may change over time and this may affect a person’s connections to the network. In this case, it is necessary to model the dynamics of the nodes’ hidden attributes as well. This difficulty is compounded by the possibility that the change in a node’s attributes at one time step may depend on the network structure at previous time steps. This in turn may affect the network structure in the future, causing a feedback loop between node dynamics and network evolution. A concrete example of this phenomena occurs in social networks. For instance, it is known that new friendship links are often formed as a result of selection effects like homophily: actors often befriend people with similar interests [77]. In turn, social actors introduce their friends to new ideas and interests in a process known as social influence or diffusion.Together, these dynamics cause both the nodes and the network structure to evolve simultaneously. Our contribution is to combine a model of node dynamics that depends on network topology with an MMSB-inspired generative model for link formation that depends on changing node attributes. We use this model to describe the co-evolution of selection and influence for real world dynamic network data. The rest of the chapter is structured as follows. We begin with a high level description of dynamic networks and how we can adapt MMSBs to describe them, followed by a discussion of related work. In Section 2.2, we describe the details of our co-evolving mixed membership stochastic blockmodel (CMMSB), including a discussion of how to efficiently infer model parameters. In Sec- tion 2.3, we apply CMMSB to a synthetic dataset and a real world dataset consisting of the bill co-sponsorship network among U.S. senators. A discussion of results follows in Section 3.7. Detailed calculations can be found in the Appendix. 9 2.1.1 Selection and influence in networks Suppose we have𝑁 nodes and we observe a network structure among them at discrete time steps,𝑡 = 0, 1,...,𝑇 . If there exists a directed link from node𝑝 to node𝑞 at time 𝑡 , we say𝑌 𝑡 (𝑝,𝑞 ) = 1, otherwise 0. There are many examples of real world data that fit this format including friendship ties in a social network and gene regulatory networks. We suppose that the nodes themselves are described by some hidden attribute that changes over time, i.e., node 𝑝 is described at time 𝑡 by ⃗ 𝜇 𝑡 𝑝 . For a social network, this vector could represent interests, group membership, or behavioral traits, while in a gene regulatory network this could indicate response to stages of a cell cycle. Then by selection we mean that the probability of a link between two nodes depends on their attribute vector: Prob(𝑌 𝑡 (𝑝,𝑞 ) = 1) =𝑔 (⃗ 𝜇 𝑡 𝑝 ,⃗ 𝜇 𝑡 𝑞 ), (2.1) One of the most famous forms of selection is homophily, or assortative mixing, which states that nodes tend to interact with other nodes that have similar attributes. We stress, however, that different selection mechanisms are possible as well (i.e., disassortative mix- ing patterns such as buyer-seller relationship, etc). The next step is to explicitly model the dynamics in the latent space. For instance,⃗ 𝜇 may drift over time, or perhaps it responds to either one-time or recurring external events. As discussed in the introduction, we are particularly interested in modeling the influence of a node’s neighbors on his/her dynamics. Toward this end, we allow a feedback mech- anism where an interaction at one time step affects the position of the node at next time step. That is, we want to model dynamics of the form ⃗ 𝜇 𝑡 +1 𝑝 =𝑓 (⃗ 𝜇 𝑡 𝑝 ,⃗ 𝜇 𝑡 𝒮 𝑡 𝑝 ), (2.2) 10 where𝒮 𝑡 𝑝 denotes neighbors of node𝑝 at time𝑡 . For instance, to model positive social influence one should select a function𝑓 such that the distance between nodes contracts after the interaction. It is possible to have more general (e.g., repulsive) interactions as well, depending on the concrete scenario. Together, Equations 2.1 and 2.2 provide a very high level description of our approach. We would like to emphasize that while distance-based interactions (such as given by Equation 2.1) is at the core of most prior work, introducing a feedback mechanism via the influence model Equation 2.2 is one of the main ideas distinguishing our approach from a previous attempt to formulate dynamic MMSBs in [27]. Once we have specified a model for node dynamics, the task of fixing a model for link formation remains. Ideally, a generative model for link formation based on the node dynamics should capture our intuitions about real link formation while admitting some uncertainty and allowing efficient inference. For these reasons, we chose to adapt MMSBs, which we describe in the next section. 2.1.2 MMSBs In this chapter we will use a latent space representation of the nodes based on MMSBs [3]. In this section, we will purposely adhere to a high level description of MMSBs and their dynamic extensions, while we will discuss a detailed implementation in Section 2.2. Starting with a static MMSB, we see that each node has a normalized mixed mem- bership vector⃗ 𝜋 𝑝 ∈ R 𝐾 , which describes the probability for node 𝑝 to take one of 𝐾 roles. The role that a node takes in a particular interaction is sampled according to the 11 membership vector, and the probability of a link between𝑝,𝑞 then depends on the roles they take and the role compatibility matrix,𝐵 . The generative process is as follows: ⃗ 𝜋 𝑝 ∼ Prior distribution ⃗ 𝑧 𝑝 →𝑞 ∼ Multinomial(⃗ 𝜋 𝑝 ) ⃗ 𝑧 𝑞→𝑝 ∼ Multinomial(⃗ 𝜋 𝑞 ) 𝑌 (𝑝,𝑞 ) ∼ Bernoulli(⃗ 𝑧 ⊤ 𝑝 →𝑞 𝐵⃗ 𝑧 𝑝 ←𝑞 ) The most naive dynamic extension is to simply add a𝑡 index to all the variables in the previous expression. This amounts to learning𝑇 independent, static MMSBs and fails to take into account any of our knowledge of the underlying node dynamics. An extension considered in [27] is to say that the prior distribution for the⃗ 𝜋 𝑡 should evolve over time. However, each mixed membership vector is still sampled from the same distribution at each time, so the effect is to model only aggregate dynamics. In contrast, and as discussed in the previous section, we would prefer that the mixed membership vector of nodes to evolve individually but under mutual influence. The par- ticular form of influence we will study is ⃗ 𝜇 𝑡 +1 𝑝 = (1−𝛽 𝑝 )⃗ 𝜇 𝑡 𝑝 +𝛽 𝑝 ⃗ 𝜇 𝑡 𝑎𝑣𝑔 + noise term, (2.3) where ⃗ 𝜇 𝑡 𝑎𝑣𝑔 = 1 |𝒮 𝑡 𝑝 | ∑︀ 𝑞∈𝒮 𝑡 𝑝 𝑤 𝑝 ←𝑞 ⃗ 𝜇 𝑡 𝑞 is the weighted average of node 𝑞 ’s neighbors log- membership vectors. Thus, the membership vector of node𝑞 at time𝑡 + 1 is a weighted average of his membership vector at time 𝑡 as well as the membership vectors of the nodes he has interacted at time 𝑡 . This feature of our model has the desired effect of incorporating feedback between network structure and individual node dynamics. The 12 relative importance of the neighbors is captured by the parameter 0<𝛽 𝑝 < 1: larger𝛽 𝑝 means that node𝑝 is more susceptible to influence from his neighbors. Before proceeding further, we note that exact inference is not feasible even for static MMSB, so adding dynamics to a model makes the inference problem much harder. Here we use variational EM [7, 88] approach that allows us to do efficient approximate infer- ence. 2.1.3 Related Work The problem of properly characterizing selection and influence has been a subject of extensive studies in sociology. For instance, [78] suggested a continuous time agent– based model of network co–evolution. In this model, each agent is characterized by a certain utility function that depends on the agent’s individual attributes as well as his/her local neighborhood in the network. The agents evolve as continuous–time Markovian processes which, at randomly chosen time points, select an action to maximize their utility. Despite its intuitive appeal, a serious shortcoming of this model is that it can- not handle missing data well, thus most of the attributes have to be fully observable. This was addressed in [26] where a continuous Dynamic Bayesian approach was devel- oped. Continuous–time models have certain advantages when the network observations are infrequent and well–separated in time. In situations where more fine-grained data is available, however, discrete–time models are more suitable [39]. The model represented here is based on Mixed Membership Stochastic Blockmod- els [3]. MMSBs are an extension of stochastic block-models that have been studied extensively both in social sciences and in computer science [43, 30]. In a stochastic blockmodel each node is assigned to a block (or a role), and the pattern of interactions 13 between different nodes depends only on their block assignment. Many situations, how- ever, are better described by multi–faceted interactions, where nodes can bear multiple latent roles that influence their relationships to others. MMSB accounts for such “mixed" interactions, by allowing each node to have a probability distribution over roles, and by making the interactions role–dependent [3]. A different approach to mixed membership community detection has been developed in physics[5, 2]. In particular, [2] suggested a definition of communities in terms of links rather than nodes. Previously, a dynamic extension of the MMSB (dMMSB) was suggested in [27]. In contrast to dMMSB, where the dynamics was imposed externally, our model assumes that the membership evolution is driven by the interactions between the nodes through a parametrized influence mechanism. At the same time, the patterns of those interactions themselves change due to the evolution of the node memberships. An advantage of the present model over dMMSB is that the latter models the aggregate dynamics, e.g., the mean of the logistic normal distribution from which the membership vectors are sam- pled. CMMSB, however, models each node’s trajectory separately, thus providing better flexibility for describing system dynamics. Of course, more flexibility comes at a higher computational cost, as CMMSB tracks the trajectories of all nodes individually. This additional cost, however, can be well justified in scenarios when the system as a whole is almost static (e.g., no shift in the mean membership vector), but different subsystems experience dynamic changes. One such scenario that deals with political polarization in the U.S. Senate is presented in our experimental results section. 14 2.2 Co-evolving Mixed Membership Blockmodel Consider a set of𝑁 nodes, each of which can have𝐾 different roles, and let⃗ 𝜋 𝑡 𝑝 be the mixed membership vector of node 𝑝 at time 𝑡 . Let 𝑌 𝑡 be the network formed by those nodes at time𝑡 :𝑌 𝑡 (𝑝,𝑞 ) = 1 if the nodes𝑝 and𝑞 are connected at time𝑡 , and𝑌 𝑡 (𝑝,𝑞 ) = 0 otherwise. Further, let 𝑌 0:𝑇 ={𝑌 0 ,𝑌 1 ,...,𝑌 𝑇 } be a time sequence of such networks. The generative process that induces this sequence is described below. ∙ For each node𝑝 at time𝑡 = 0, employ a logistic normal distribution 1 to sample an initial membership vector, 𝜋 0 𝑝,𝑘 = exp(𝜇 0 𝑝,𝑘 −𝐶 (⃗ 𝜇 0 𝑝 )), ⃗ 𝜇 0 𝑝 ∼𝒩 (⃗ 𝛼 0 ,𝐴 ) where 𝐶 (⃗ 𝜇 ) = log( ∑︀ 𝑘 exp(𝜇 𝑘 )) is a normalization constant, and ⃗ 𝛼 0 ,𝐴 are prior mean, and covariance matrix. ∙ For each node𝑝 at time𝑡> 0, the mean of each normal distribution is updated due to influence from the neighbors at its previous step: ⃗ 𝛼 𝑡 𝑝 = (1−𝛽 𝑝 )⃗ 𝜇 𝑡−1 𝑝 +𝛽 𝑝 ⃗ 𝜇 𝒮 𝑡−1 𝑝 where⃗ 𝜇 𝒮 𝑡 𝑝 is average of weighted membership vector⃗ 𝜇 -s of the nodes which node 𝑝 is connected to at time𝑡 ⃗ 𝜇 𝒮 𝑡 𝑝 = 1 |𝒮 𝑡 𝑝| ∑︁ 𝑞∈𝒮 𝑡 𝑝 𝑤 𝑡 𝑝 ←𝑞 ⃗ 𝜇 𝑞 1 We found that the logistic normal form of the membership vector suggested in [27] leads to more tractable equations compared to the Dirichlet distribution used for static MMSBs. 15 𝛽 𝑝 describes how easily the node𝑝 is influenced by its neighbors, while the weights, 𝑤 , allow for different degrees of influence from different neighbors. The member- ship vector at time𝑡 is 𝜋 𝑡 𝑝,𝑘 = exp(𝜇 𝑡 𝑝,𝑘 −𝐶 (⃗ 𝜇 𝑡 𝑝 )), ⃗ 𝜇 𝑡 𝑝 ∼𝒩 (⃗ 𝛼 𝑡 𝑝 , Σ 𝜇 ) where the covariance Σ 𝜇 accounts for noise in the evolution process. ∙ For each pair of nodes𝑝 ,𝑞 at time𝑡 , sample role indicator vectors from multinomial distributions: ⃗ 𝑧 𝑡 𝑝 →𝑞 ∼ Multinomial(⃗ 𝜋 𝑡 𝑝 ),⃗ 𝑧 𝑡 𝑝 ←𝑞 ∼ Multinomial(⃗ 𝜋 𝑡 𝑞 ) Here⃗ 𝑧 𝑝 →𝑞 is a unit indicator vector of dimension𝐾 , so that𝑧 𝑝 →𝑞,𝑘 = 1 means node 𝑝 undertakes role𝑘 while interacting with𝑞 . ∙ Sample a link between𝑝 and𝑞 as a Bernoulli trial: 𝑌 𝑡 (𝑝,𝑞 )∼ Bernoulli((1−𝜌 )⃗ 𝑧 𝑡⊤ 𝑝 →𝑞 𝐵 𝑡 ⃗ 𝑧 𝑡 𝑝 ←𝑞 ) where𝐵 is a𝐾 ×𝐾 role–compatibility matrix, so that𝐵 𝑡 𝑟𝑠 describes the likelihood of interaction between two nodes in roles𝑟 and𝑠 at time𝑡 . When𝐵 𝑡 is diagonal, the only possible interactions are among the nodes in the same role. Here 𝜌 is a parameter that accounts for the sparsity of the network [3]. Thus, the coupling between dynamics of different nodes is introduced by allowing the role vector of a node to be influenced by the role vectors of its neighbors. To benefit from computational simplicity, we updated⃗ 𝜋 by changing its associated⃗ 𝜇 . This update of⃗ 𝜇 16 is a linear combination of its current state and the values of its neighbors’ current states. The influence is measured by a node–specific parameter𝛽 𝑝 , and𝑤 𝑡 𝑝 ←𝑞 , which need to be estimated from the data.𝛽 𝑝 describes how easily the node𝑝 is influenced by its neighbors: 𝛽 𝑝 = 0 means it is not influenced at all, whereas 𝛽 𝑝 = 1 means the behavior is solely determined by the neighbors. On the other hand,𝑤 𝑡 𝑝 ←𝑞 reflects the weight of the specific influence that node𝑞 exerts on node𝑝 , so that larger values correspond to more influence. 2.2.1 Inference and Learning Under the Co–Evolving MMSB, the joint probability of the data𝑌 0:𝑇 and the latent vari- ables{⃗ 𝜇 𝑡 1:𝑁 ,⃗ 𝑧 𝑡 𝑝 →𝑞 : 𝑝,𝑞 ∈ 𝑁,⃗ 𝑧 𝑡 𝑝 ←𝑞 : 𝑝,𝑞 ∈ 𝑁 } can be written in the following factored form. To simplify the notation, we define⃗ 𝑧 𝑡 𝑝,𝑞 as a pair of⃗ 𝑧 𝑡 𝑝 →𝑞 , and⃗ 𝑧 𝑡 𝑝 ←𝑞 . Also denote the sets of latent group indicators{⃗ 𝑧 𝑡 𝑝 →𝑞 :𝑝,𝑞 ∈𝑁 }, and{⃗ 𝑧 𝑡 𝑝 ←𝑞 :𝑝,𝑞 ∈𝑁 } as ⃗ 𝑍 𝑡 → , and ⃗ 𝑍 𝑡 → . 𝑝 (𝑌 0:𝑇 ,⃗ 𝜇 0:𝑇 1:𝑁 , ⃗ 𝑍 0:𝑇 → , ⃗ 𝑍 0:𝑇 ← |⃗ 𝛼,𝐴,𝐵,𝛽 𝑝 ,𝑤 𝑡 𝑝 ←𝑞 , Σ 𝜇 ) = (2.4) ∏︁ 𝑡 ∏︁ 𝑝,𝑞 𝑃 (𝑌 𝑡 (𝑝,𝑞 )|⃗ 𝑧 𝑡 𝑝,𝑞 ,𝐵 𝑡 )𝑃 (⃗ 𝑧 𝑡 𝑝,𝑞 |⃗ 𝜇 𝑡 𝑝 ,⃗ 𝜇 𝑡 𝑞 ) × ∏︁ 𝑝 𝑃 (⃗ 𝜇 0 𝑝|⃗ 𝛼 0 ,𝐴 ) ∏︁ 𝑡̸=0 𝑃 (⃗ 𝜇 𝑡 𝑝|⃗ 𝜇 𝑡−1 𝑝 ,⃗ 𝜇 𝒮 𝑡−1 𝑝 , Σ 𝜇 ,𝛽 𝑝 ) 17 In Equation 2.4, the term describing the dynamics of the membership vector is defined as follows 2 : 𝑃 (⃗ 𝜇 𝑡 𝑝|⃗ 𝜇 𝑡−1 𝑝 ,⃗ 𝜇 𝒮 𝑡−1 𝑝 , Σ 𝜇 ,𝑌 𝑡 ,𝛽 𝑝 ) = 𝑓 𝐺 (⃗ 𝜇 𝑡 𝑝 −𝑓 𝑏 (⃗ 𝜇 𝑡−1 𝑝 ,⃗ 𝜇 𝒮 𝑡−1 𝑝 ), Σ 𝜇 ) (2.5) 𝑓 𝐺 (⃗ 𝑥, Σ 𝜇 ) = 1 (2𝜋 ) 𝑘/ 2 |Σ 𝜇| 1/2 𝑒 − 1 2 𝑥 𝑇 Σ𝜇 −1 𝑥 𝑓 𝑏 (⃗ 𝜇 𝑡−1 𝑝 ,⃗ 𝜇 𝒮 𝑡−1 𝑝 ) = (1−𝛽 𝑝 )⃗ 𝜇 𝑡−1 𝑝 +𝛽 𝑝 ⃗ 𝜇 𝒮 𝑡−1 𝑝 As we already mentioned, performing exact inference with this model is not feasi- ble. Thus, one needs to resort to approximate techniques. Here we use a variational EM [7, 88] approach. The main idea behind variational methods is to posit a simpler distribution𝑞 (𝑋 ) over the latent variables with free (variational) parameters, and then fit those parameters so that the distribution is close to the true posterior in KL divergence. 𝐷 𝐾𝐿 (𝑞||𝑝 ) = ∫︁ 𝑋 𝑞 (𝑋 ) log 𝑞 (𝑋 ) 𝑝 (𝑋,𝑌 ) 𝑑𝑋 (2.6) Here we introduce the following factorized variational distribution: 𝑞 (⃗ 𝜇 0:𝑇 1:𝑁 ,𝑍 0:𝑇 → ,𝑍 0:𝑇 ← |⃗ 𝛾 0:𝑇 1:𝑁 , Φ 0:𝑇 → , Φ 0:𝑇 ← ) = (2.7) ∏︁ 𝑝,𝑡 𝑞 1 (⃗ 𝜇 𝑡 𝑝|⃗ 𝛾 𝑡 𝑝 , Σ 𝑡 𝑝 ) ∏︁ 𝑝,𝑞,𝑡 (𝑞 2 (⃗ 𝑧 𝑡 𝑝 →𝑞| ⃗ 𝜑 𝑡 𝑝 →𝑞 )𝑞 2 (⃗ 𝑧 𝑡 𝑝 ←𝑞| ⃗ 𝜑 𝑡 𝑝 ←𝑞 )) where 𝑞 1 is the normal distribution, and 𝑞 2 is the multinomial distribution, and ⃗ 𝛾 𝑡 𝑝 , Σ 𝑡 𝑝 , ⃗ 𝜑 𝑡 𝑝 →𝑞 , ⃗ 𝜑 𝑡 𝑝 ←𝑞 are the variational parameters. Intuitively, 𝜑 𝑡 𝑝 →𝑞,𝑔 is the probability of node 𝑝 undertaking the role 𝑔 in an interaction with node 𝑞 at time 𝑡 , and 𝜑 𝑡 𝑝 ←𝑞,ℎ is defined similarly. 2 For simplicity, we will assumeΣ 𝜇 is a diagonal matrix. 18 For this choice of the variational distribution, we rewrite Equation 2.6 as follows: 𝐷 𝐾𝐿 (𝑞||𝑝 ) = (2.8) 𝐸 𝑞 [log ∏︁ 𝑡 ∏︁ 𝑝 𝑞 1 (⃗ 𝜇 𝑡 𝑝|⃗ 𝛾 𝑡 𝑝 , Σ 𝑡 𝑝 )] +𝐸 𝑞 [log ∏︁ 𝑡 ∏︁ 𝑝,𝑞 𝑞 2 (⃗ 𝑧 𝑡 𝑝 →𝑞| ⃗ 𝜑 𝑡 𝑝 →𝑞 ) +𝐸 𝑞 [log ∏︁ 𝑡 ∏︁ 𝑝,𝑞 𝑞 2 (⃗ 𝑧 𝑡 𝑝 ←𝑞| ⃗ 𝜑 𝑡 𝑝 ←𝑞 )−𝐸 𝑞 [log ∏︁ 𝑡 ∏︁ 𝑝,𝑞 𝑃 (𝑌 𝑡 (𝑝,𝑞 )|⃗ 𝑧 𝑡 𝑝 →𝑞 ,⃗ 𝑧 𝑡 𝑝 ←𝑞 ,𝐵 )] −𝐸 𝑞 [log ∏︁ 𝑡 ∏︁ 𝑝,𝑞 𝑃 (⃗ 𝑧 𝑡 𝑝 →𝑞|⃗ 𝜇 𝑡 𝑝 )−𝐸 𝑞 [log ∏︁ 𝑡 ∏︁ 𝑝,𝑞 𝑃 (⃗ 𝑧 𝑡 𝑝 ←𝑞|⃗ 𝜇 𝑡 𝑞 ) −𝐸 𝑞 [log ∏︁ 𝑡̸=0 ∏︁ 𝑝 𝑃 (⃗ 𝜇 𝑡 𝑝|⃗ 𝜇 𝑡−1 𝑝 ,⃗ 𝜇 𝒮 𝑡−1 𝑝 , Σ 𝜇 )]−𝐸 𝑞 [log ∏︁ 𝑝 𝑃 (⃗ 𝜇 0 𝑝|⃗ 𝛼 0 ,𝐴 )] In the third line of the above equation, we need to compute the expected value of log[ ∑︀ 𝑘 exp(𝜇 𝑘 )] under the variational distribution, which is problematic. Toward this end, we introduce𝑁 additional variational parameters𝜁 , and replace the expectation of the log by its upper bound induced from the first-order Taylor expansion [9]: log[ ∑︁ exp(𝜇 𝑘 )]≤ log𝜁 − 1 + 1 𝜁 ∑︁ exp(𝜇 𝑘 ) (2.9) The variational EM algorithm works by iterating between the E–step of calculating the expectation value using the variational distribution, and the M–step of updating the model (hyper)parameters so that the data likelihood is locally maximized. The pseudo- code is shown in Algorithm 1, and the details of the calculations are discussed below. 19 Algorithm 1 Variational EM Input: data𝑌 𝑡 (𝑝,𝑞 ), size𝑁,𝑇,𝐾 Initialize all{⃗ 𝛾 } 𝑡 ,{𝜎 } 𝑡 Start with an initial guess for the model parameters. repeat repeat for𝑡 = 0 to𝑇 do repeat Initialize𝜑 𝑡 𝑝 →𝑞 ,𝜑 𝑡 𝑝 ←𝑞 to 1 𝐾 for all𝑔,ℎ repeat Update all{𝜑 } 𝑡 until convergence of{𝜑 } 𝑡 Find{⃗ 𝛾 } 𝑡 ,{𝜎 } 𝑡 Update all{𝜁 } 𝑡 until convergence in time𝑡 end for until convergence across all time steps Update hyper parameters. until convergence in hyper parameters 2.2.2 Variational E-step In the variational E–step, we minimize the KL distance over the variational parameters. Taking the derivative of KL divergence with respect to each variational parameter and set- ting it to zero, we obtain a set of equations that can be solved via iterative or other numer- ical techniques. For instance, the variational parameters ( ⃗ 𝜑 𝑡 𝑝 →𝑞 , ⃗ 𝜑 𝑡 𝑝 ←𝑞 ), corresponding to a pair of nodes (𝑝,𝑞 ) at time𝑡 , can be found via the following iterative scheme: 𝜑 𝑡 𝑝 →𝑞,𝑔 ∝ exp(𝛾 𝑡 𝑝,𝑔 ) ∏︁ ℎ (𝐵 (𝑔,ℎ ) 𝑌 𝑡 (𝑝,𝑞 ) (1−𝐵 (𝑔,ℎ )) 1−𝑌 𝑡 (𝑝,𝑞 ) ) 𝜑 𝑡 𝑝 ←𝑞,ℎ (2.10) 𝜑 𝑡 𝑝 ←𝑞,ℎ ∝ exp(𝛾 𝑡 𝑞,ℎ ) ∏︁ 𝑔 (𝐵 (𝑔,ℎ ) 𝑌 𝑡 (𝑝,𝑞 ) (1−𝐵 (𝑔,ℎ )) 1−𝑌 𝑡 (𝑝,𝑞 ) ) 𝜑 𝑡 𝑝 →𝑞,ℎ (2.11) 20 In the above equations, 𝜑 𝑡 𝑝 →𝑞,𝑔 and𝜑 𝑡 𝑝 ←𝑞,ℎ are normalized after each update. Note also that Eqs. 2.10 and 2.11 are coupled with each other as well as with the parameters𝛾 𝑡 𝑝,𝑔 , 𝛾 𝑡 𝑞,ℎ . Sets of variational parameter{⃗ 𝛾 } 𝑡 , and{𝜎 } 𝑡 are initialized at the beginning of varia- tional EM. For{⃗ 𝛾 } 𝑡 , we sample it from normal distribution𝒩 (⃗ 𝛼 0 ,𝐴 ), and for{𝜎 } 𝑡 we initialize it to same value over all nodes across the whole time steps. Once the{𝜑 } 𝑡 are converged to optimal points, we then update{⃗ 𝛾 } 𝑡 , and{𝜎 } 𝑡 using the update equations. Both of the variational parameters do not have closed form of solution, and the details are given in Appendix A.2. Here we simply note that their general form is: ⃗ 𝛾 𝑡 𝑝 =𝑓 𝛾 (⃗ 𝛾 𝑡−1 𝑝 ,⃗ 𝛾 𝑡 +1 𝑝 ,⃗ 𝛾 𝑡 𝑞 , ⃗ 𝜑 𝑡 𝑝 →𝑞 , ⃗ 𝜑 𝑡 𝑞←𝑝 ,𝜁 𝑡 𝑝 , Σ 𝑡 𝑝 ), (2.12) Thus, the parameter⃗ 𝛾 𝑡 𝑝 depends on its immediate past and future values,⃗ 𝛾 𝑡−1 𝑝 and⃗ 𝛾 𝑡 +1 𝑝 , as well as the parameters of its neighbors. For the variational parameters of a covariance matrix Σ 𝑡 𝑝 , which is assumed to be a diagonal matrix with components ((𝜎 𝑡 𝑝, 1 ) 2 , (𝜎 𝑡 𝑝, 2 ) 2 ,...(𝜎 𝑡 𝑝,𝑘 ) 2 ), the general form of the optimal point is : 𝜎 𝑡 𝑝,𝑘 =𝑓 𝜎 (𝛾 𝑡 𝑝,𝑘 ,𝜁 𝑡 𝑝 ) (2.13) Finally, for the variational parameters𝜁 we have 𝜁 𝑡 𝑝 = ∑︁ 𝑖 exp(𝛾 𝑡 𝑝,𝑖 + (𝜎 𝑡 𝑝,𝑖 ) 2 2 ) (2.14) Note that the above equations can be solved via simple iterative update as before. To expedite convergence, however, we combine the iterations with Newton–Raphson 21 method, where we solve for individual parameters while keeping the others fixed, and then repeat this process until all the parameters have converged. 2.2.3 Variational M step The M-step in the EM algorithm computes the parameters by maximizing the expected log-likelihood found in the E-step. The model parameters in our case are: 𝐵 𝑡 , the role- compatibility matrix, the covariance matrix Σ 𝜇 ,𝛽 𝑝 for each node,𝑤 𝑡 𝑝 ←𝑞 for each pair,⃗ 𝛼 , and𝐴 from the prior. If we assume that the time variation of the block compatibility matrix is small com- pared to the evolution of the node attributes, we can neglect the time dependence in𝐵 , and use its average across time, which yields: ^ 𝐵 (𝑔,ℎ ) = ∑︀ 𝑝,𝑞,𝑡 𝑌 𝑡 (𝑝,𝑞 )·𝜑 𝑡 𝑝 →𝑞,𝑔 𝜑 𝑡 𝑝 ←𝑞,ℎ ∑︀ 𝑝,𝑞,𝑡 𝜑 𝑡 𝑝 →𝑞,𝑔 𝜑 𝑡 𝑝 ←𝑞,ℎ (2.15) Likewise, for the update of diagonal components of the noise covariance matrix Σ 𝜇 , (^ 𝜂 𝑘 ) 2 = 1 𝑁 (𝑇 − 1) 𝐸 𝑞 [ ∑︁ 𝑝,𝑡 (𝜇 𝑡 𝑝,𝑘 − (1−𝛽 )𝜇 𝑡−1 𝑝,𝑘 −𝛽𝜇 𝒮 𝑡−1 𝑝 ,𝑘 ) 2 ] (2.16) Similar equations are obtained for𝛽 𝑝 and𝑤 𝑡 𝑝 ←𝑞 . The update equation of𝛽 𝑝 and𝑤 𝑡 𝑝 ←𝑞 is a function of𝛾 and𝜎 which are related to the transition for specific node𝑝 . 𝛽 𝑝 = ∑︀ 𝑡> 0 ∑︀ 𝑘 (𝛾 𝑡−1 𝑝,𝑘 2 +𝜎 𝑡−1 𝑘 2 −𝛾 𝑡 𝑝,𝑘 𝛾 𝑡−1 𝑝,𝑘 −𝛾 𝑡−1 𝑝,𝑘 𝛾 𝒮 𝑡−1 𝑝 ,𝑘 +𝛾 𝑡 𝑝,𝑘 𝛾 𝒮 𝑡−1 𝑝 ,𝑘 ) ∑︀ 𝑡> 0 ∑︀ 𝑘 (𝛾 𝑡−1 𝑝,𝑘 2 +𝜎 𝑡−1 𝑘 2 − 2𝛾 𝑡−1 𝑝,𝑘 𝛾 𝒮 𝑡−1 𝑝 ,𝑘 ) + ∑︀ 𝑡> 0 ∑︀ 𝑘 (𝛾 2 𝒮 𝑡 𝑝 ,𝑘 +𝜎 2 𝒮 𝑡 𝑝 ,𝑘 ) 22 where⃗ 𝛾 𝒮 𝑡 𝑝 , and Σ 𝒮 𝑡 𝑝 is the mean and covariance of set of nodes which node𝑝 is connected to at time𝑡 . The priors of the model can be expressed in closed form as below: ⃗ 𝛼 0 = 1 𝑁 ∑︁ 𝑝 ⃗ 𝛾 0 𝑝 (2.17) 𝑎 𝑘 = √︃ 1 𝑁 ∑︁ ((𝛾 0 𝑝,𝑘 ) 2 + (𝜎 0 𝑝,𝑘 ) 2 − 2𝛼 0 𝑘 𝛾 0 𝑝,𝑘 + (𝛼 0 𝑘 ) 2 ) (2.18) 2.3 Results 2.3.1 Experiments on Synthetic Data 0 0 0 0.1 0.1 0.1 0.2 0.2 0.2 0.3 0.3 0.3 0.4 0.4 0.4 0.5 0.5 0.5 0.6 0.6 0.6 0.7 0.7 0.7 0.8 0.8 0.8 0.9 0.9 0.9 1 1 1 Role 2 Role 1 Role 3 0 0 0 0.1 0.1 0.1 0.2 0.2 0.2 0.3 0.3 0.3 0.4 0.4 0.4 0.5 0.5 0.5 0.6 0.6 0.6 0.7 0.7 0.7 0.8 0.8 0.8 0.9 0.9 0.9 1 1 1 Role 2 Role 1 Role 3 0 0 0 0.1 0.1 0.1 0.2 0.2 0.2 0.3 0.3 0.3 0.4 0.4 0.4 0.5 0.5 0.5 0.6 0.6 0.6 0.7 0.7 0.7 0.8 0.8 0.8 0.9 0.9 0.9 1 1 1 Role 2 Role 1 Role 3 Actual CMMSB Figure 2.1: Actual and inferred mixed membership trajectories on a simplex. 23 1 2 3 4 5 6 7 8 0 0.05 0.1 0.15 0.2 0.25 time step Error in L 2 distance Error Comparison with ! = 0.1 across all nodes dMMSB CMMSB 1 2 3 4 5 6 7 8 0 0.05 0.1 0.15 0.2 0.25 time step Error in L 2 distance Error Comparison with ! = 0.2 across all nodes dMMSB CMMSB (a) 1 2 3 4 5 6 7 8 0 0.05 0.1 0.15 0.2 0.25 time step Error in L 2 distance Error Comparison with ! = 0.1 across all nodes dMMSB CMMSB 1 2 3 4 5 6 7 8 0 0.05 0.1 0.15 0.2 0.25 time step Error in L 2 distance Error Comparison with ! = 0.2 across all nodes dMMSB CMMSB (b) Figure 2.2: (a) Inference error for dMMSB and CMMSB for synthetic data generated with𝐾 = 2 and𝛽 = 0.1 for all the nodes (b) when𝛽 = 0.2 for all the nodes We tested our model by generating a sequence of networks according to the process described above, for 50 nodes, and𝐾 = 3 latent roles across𝑇 = 8 time steps. We use a covariance matrix of𝐴 = 3𝐼 , and mean⃗ 𝛼 0 having homogeneous values for the prior, so that initially nodes have a well defined role (i.e., the membership vector is peaked around a single role). More precisely, the majority of nodes had around 90% of member- ship probability mass centered at a specific role, and on average a third of those nodes will have 90% on role𝑘 . For the role-compatibility matrix, we gave high weight at the diagonal. Starting from some initial parameter estimates, we performed variational EM and obtained re–estimated parameters which were very close to the original values (ground truth). With those learned parameters, we inferred the hidden trajectory of agents as given by their mixed membership vector for each time step. The results are shown in Fig. 2.1, where, for three nodes, we plot the projection of trajectories onto the simplex. One can see that for all three nodes, the inferred trajectories are very close to the actual ones. 24 2.3.2 Comparison with dMMSB As a further verification of our results, we compare the performance of our inference method to the dynamic mixed membership stochastic blockmodel (dMMSB)[27]. We use synthetic data generated in a manner similar to the previous section. This time, though, for simplicity we keep𝐾 = 2 and we set all the𝛽 ’s to some constant for all the nodes, 𝛽 = 0.1 in one trial and 𝛽 = 0.2 in the other. In this case, we compare performance by evaluating the distance in 𝐿 2 norm between actual and inferred mixed membership vectors for each method. At each time step, we calculate the average over all nodes of the𝐿 2 distance from the actual membership vector. As shown in Fig. 2.2(a) and 2.2(b), CMMSB captures the dynamics better than the dMMSB. This is due to the fact that our model tracks all of the nodes individually (inter- nal dynamics), while dMMSB regards the dynamism as an evolution of the environment (external dynamics). Here, we have only included results for relatively small and homo- geneous dynamics. In fact, we noticed that our method tends to fare even better as we increase the degree of dynamics or the heterogeneity of dynamics across nodes (node- varying values of𝛽 ). We believe heterogeneous dynamics is more prevalent in real sys- tems, and so we expect our method to outperform dMMSB even more than is indicated by Fig.2.2(b). 2.3.3 US Senate Co-Sponsorship Network We have also performed some preliminary experiments for testing our model against real–world data. In particular, we used senate co–sponsorship networks from the 97th to the 104th senate, by considering each senate as a separate time point in the dynamics. There were 43 senators who remained part of the senate during this period. For any 25 pair of senators (𝑝,𝑞 ) in a given senate, we generated a directed link 𝑝 → 𝑞 if 𝑝 co- sponsored at least 3 bills that𝑞 originally sponsored. The threshold of 3 bills was chosen to avoid having too dense of a network. With this data, we wanted to test (a) to what extent senators tend to follow others who share their political views (i.e., conservative vs. liberal) and (b) whether some senators change their political creed more easily than others. The number of roles𝐾 = 2 was chosen to reflect the mostly bi–polar nature of the US Senate. The susceptibility of senator𝑝 to influence is measured by the corresponding parameter 𝛽 𝑝 , which is learned using the EM algorithm. High 𝛽 means that a senator tends to change his/her role more easily. Likewise, the power of influence of senator𝑞 on senator𝑝 is measured by the parameter𝑤 𝑡 𝑝 ←𝑞 , where𝑤 𝑡 𝑝 ←𝑞 1 > 𝑤 𝑡 𝑝 ←𝑞 2 means senator𝑞 1 is more influential on senator𝑝 than senator𝑞 2 . Here the direction of the arrow reflects the direction of the influence which is opposite to the direction of link. To initialize the EM procedure, we assigned the same𝛽 , and𝑤 to all the senators, and start with a matrix which is weighted at the diagonal for𝐵 . Another method for validation is to compare the degree of influence. Our model handles, and learns, the degree of influence in the update equation. Sorting out influential senators is an area of active research. Recently, KNOWLEGIS has been ranking US senators based on various criteria, including influence, since 2005. Since our data was extracted from the 97th senate to the 104th senate, direct comparison of the rankings was impossible. Another study[53] ranked the 10 most influential senators in both parties who have been elected since 1955. We compared our top 5 influential senators, and we were able to find 3 senators (Sen. Byrd, Sen. Thurmond, and Sen. Dole) in the list. 26 2.3.4 Interpreting Results The role-compatibility matrix learned from the Variational EM has high values on the diagonal confirming our intuition that interaction is indeed more likely between senators that share the same role. Furthermore, the learned values of 𝛽 showed that senators varied in their “susceptibility”. In particular, Sen. Arlen Spector was found to be the most influenceable one, while Sen. Dole was found to be one of the most inert ones. Note that while there are no direct ways of estimating the “dynamism” of senators, our results seem to agree with our intuition about both senators (e.g., Sen. Spector switched parties in 2009 while Dole became his party’s candidate for President in 1996). To get some independent verification, we compared our results to the yearly ratings that ACU (American Conservative Union), and ADA(Americans for Democratic Action) assign to senators 3 . ACU/ADA rated every senator based on selected votes which they believed to have a clear ideological distinction, so that high scores in ACU mean that they are truly conservative, while lower score in ACU suggests they are liberal, and for ADA vice versa. To compare the rating with our predictions (given by the membership vector) we scaled the former to get scores in the range [0, 1]. Fig. 2.3 shows the relationship between these scores and our mixed membership vector score, confirming our interpretation of the two roles in our model as corresponding to liberal/conservative. Although those values cannot be used for quantitative agreement, we found that at least qualitatively, the inferred trajectories agree reasonably well with the ACU/ADA ratings. This agreement is rather remarkable since the ACU/ADA scores are based on selected votes rather than co–sponsorship network as in our data. 3 Accessible at http://www.conservative.org/, http://www.adaction.org/ 27 0 0.2 0.4 0.6 0.8 1 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 Correlation between Inference and ACU score Probability of role 1 ACU score R 2 = 0.82629 0 0.2 0.4 0.6 0.8 1 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 Correlation between Inference and ADA score Probaility of role 1 ADA score R 2 = 0.84766 Figure 2.3: Correlation between ACU/ADA scores and inferred probabilities. Of course, we are most interested in correctly identifying the dynamics for each sena- tor. We compare our inferred trajectory of the most dynamic senator, and the inert senator to the scores of ACU, and ADA. In Fig.2.4 the scores of ADA have been flipped, so that we can compare all of the scores in the same measurement. However, since ACU/ADA scores are rated for every senator each year, the dynamics of inference, and the dynam- ics of ACU/ADA scores cannot be compared one to one. Not all senators showed high correlation of the trend like senator Specter, and Dole. 2.3.5 Polarization Dynamics The yearly ACU/ADA scores give a good comparison of the relative political position of senators scored in each year. However, they are not very appropriate for comparison between years, a point illustrated by the fact that the score is based on voting records for different bills in each year. Therefore, for validation of the dynamics we turn to another scoring system highly regarded by political scientists and used to observe his- torical trends, the DW-NOMINATE score. For the time period of our study, [54] shows 28 97 98 99 100 101 102 103 104 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 Inference and Score comparison of Sen. Specter Congress number score CMMSB ACU score ADA score DW score 97 98 99 100 101 102 103 104 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 Inference and Score comparison of Sen. Dole Congress number score CMMSB ACU score ADA score DW score 97 98 99 100 101 102 103 104 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 Inference and Score comparison of Sen. Specter Congress number score CMMSB ACU score ADA score DW score 97 98 99 100 101 102 103 104 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 Inference and Score comparison of Sen. Dole Congress number score CMMSB ACU score ADA score DW score Figure 2.4: Comparison of inference results with ACU and ADA scores: Sen. Specter (left) and Sen. Dole (right). 97 98 99 100 101 102 103 104 0.4 0.5 0.6 0.7 0.8 0.9 1 Congress number polarization CMMSB Polarization Trend DW−Polarization Trend of 43 Senators y=0.002x+0.9329 R 2 = 0.99006 y=0.0103x + 0.5025 R 2 = 0.99999 Figure 2.5: Polarization trends during 97th–104th US Congresses. that the political polarization of the senate was increasing. In particular, they show that the gap between the average DW-NOMINATE score of Republicans and Democrats is monotonically increasing, as we show in Fig. 2.5. In fact, the polarization for the entire senate was stronger every year. This is due to the unbalanced seats in the entire senate. In other words, our data had 22 Republican, and 21 Democratic, while for the entire senate, majority out numbered minority by around 10 seats. For comparison, for each time step 29 we took the average of our inferred score for the 14 most and least conservative senators. As we show in Fig. 2.5, our inferred result agrees qualitatively with the results of [54], showing an increase in polarization for every senate in the studied time-window. Since the DW-NOMINATE score uses its own metric, and our polarization is measured by the difference between upper average and lower average probability, we should not expect to get quantitative agreement. We would like to highlight, however, that the direction of the trend is correctly predicted for each of the eight terms. 2.4 Summary We have presented the Co–evolving Mixed Membership Blockmodel for modeling inter– coupled node and link dynamics in networks. We used a variational EM approach for learning and inference with CMMSB, and were able to reproduce the hidden dynamics for synthetically generated data, both qualitatively and quantitatively. We also tested our model using the US Senate bill co–sponsorship data, and obtained reasonable results in our experiments. In particular, CMMSB was able to detect increasing polarization in the Senate as reported by other sources that analyze individual voting records of the senators. Our results with the U.S. Senate dataset suggest that our dynamical model can actu- ally capture some nuances of individual dynamics. While we lack a ground truth for the true position of senators, third party analyses qualitatively support the findings of our model. Of course, many factors are not explicitly modeled in our approach, but we hope that by including individual dynamical terms we capture these effects implicitly. For instance, external events like upcoming re-election campaigns surely affect senator’s actions. While the true chain of events may rely on these events, if all relevant external 30 events are not or cannot be included in our model, then capturing dynamics through shifts in observed relationships is a good proxy. The approach to modeling influence described in Section 2.2 is only one of several possibilities. Although we learned a static parameter 𝛽 for each node, describing how easily influenced they are, we also pointed out the possibility of adding a weight that varies for each pair: that is, a node may be more influenced by one person than another. Additionally, someone’s influence may change over time. Finally, we chose a simple linear influence mechanism. In principle, someone may be more influential along one axis than another. For instance, a node may be influenced by a friend’s musical taste, but not by his politics. As a future work, we intend to test our model against different real–world data, such as communication networks or co–authorship networks of publications. We also plan to extend CMMSB in several ways. A significant bottleneck of the current model is that it explicitly considers links between all the pairs of nodes, resulting in a quadratic complexity in the network size. Most real world networks, however, are sparse, which is not accounted for in the current approach. Introducing sparsity into the model would greatly enhance its efficiency. We note that this is also a drawback for static MMSBs, but progress has already been made towards reducing this complexity [58]. An additional drawback of of MMSB (and stochastic block models in general) is the inability to properly deal with degree heterogeneity. Indeed, MMSB (or related latent space models) might assign nodes to the same group based merely on the frequency of their interactions with the other nodes. Possible remedies are found in the degree– correct blockmodel recently proposed in [45] or in exponential random graph models that separately model node and group variability[69]. The problem reveals a fundamental ambiguity about network modeling. A priori, we have no reason to believe that node 31 connectivity is a less important dimension for clustering nodes than homophily for some hidden attribute. Our intuition leads to expect otherwise for human networks, but this intuition must be explicitly modeled. In the co-sponsorship network studied here, most senators are well-connected and so the network structure is better explained by political views than node connectivity. However, large variability in node connectivity has been observed in many social networks where this effect will have to be explicitly modeled. 32 Chapter 3 Latent Self-Exciting Point Process Model for Spatial-Temporal Networks 3.1 Introduction In recent years there has been a considerable interest in understanding dynamic social networks. Traditionally, longitudinal analysis of social network data has been limited to relatively small amounts of data collected from manual and time-consuming surveys. Recent development of various sensing technologies, online communication services, and location-based social networks has made it possible to gather time-stamped and geo- coded data on social interactions at an unprecedented scale. Such data can potentially facilitate better and more nuanced understanding of geo-temporal patterns in social inter- actions. To harness this potential, it is imperative to have efficient computational models that can deal with spatial-temporal social networks. One of the main challenges in social network analysis is handling missing data. Indeed, most social network data are generally incomplete, with missing information about links [37, 47, 52], nodes [25] or both [46]. In repeated interaction networks stud- ied here, there is another source of data ambiguity that comes from limited observability of certain interaction events. Namely, even when interactions are recorded, information 33 about participants might be missing or only partially known. A real-world problem that highlights the latter scenario is concerned with inter-gang rivalry network in Los Ange- les, where the records of violent events between rival gangs might lack information about one or both participants [80]. Here we formalize the missing label problem for spatial-temporal networks by intro- ducing Latent Point Process Model, or LPPM, to describe geographically distributed interaction events between pairs of entities. LPPM assumes that interaction between each pair is governed by a spatial-temporal point process. In contrast to existing mod- els [38, 63, 68, 75], however, it allows a non-trivial generalization where certain attributes of those events are not fully observed. Instead, they need to be inferred from avail- able observations. To illustrate the problem, consider a sequence of events generated by𝑀 temporal point processes; see Figure 3.1. Each sequence is generated via a non- homogenous and possibly history-dependent point process. The combined time series is a marked point process, where the mark, or the label, describes the component that generates the event. The observed data consists of the recorded events. However we assume that those labels are only partially observable, and need to be inferred from the observations. How well can one identify the label of a specific event based on limited observations? The answer depends on the nature of the process generating the events. For instance, if the events in Figure 3.1 are generated by a set of independent and homogenous Pois- son processes with intensities𝜆 1 < 𝜆 2 < 𝜆 3 , then the identification accuracy is limited by 𝜆 3 𝜆 1 +𝜆 2 +𝜆 3 , i.e., all the unlabeled events are attributed to the process with the highest intensity. Luckily, most real-world processes describing human interactions demonstrate highly non-homogenous and history-dependent temporal patterns, suggesting that inter- action events are not statistically independent, but exhibit non-trivial correlations [6, 79]. 34 Hidden Observed Time Figure 3.1: Schematic demonstration of the missing label problem for temporal point pro- cesses. The dashed lines represent events for which the generating process is unknown. To account for temporal correlations, here we augment LPPM with a model of self- exciting point process known as Hawkes process, which has been used previously in a number of applications. Furthermore, we use interaction-specific mixture distributions of spatial patterns of interactions to inform the inference problem. Learning and inference with LPPM constitutes inferring missing labels, predicting the timing and/or the source of the next event, and so on. Due to missing observations, exact inference and learning is intractable for even moderately large datasets. Toward this end, we develop an efficient algorithm for learning and inference based on the variational EM approach [7]. We validate our model for both synthetic and real-world data. For the latter, we use two distinctly different datasets (1) data on inter-gang violence from Los Angeles Police Department; (2) User check-in data from Gowalla, which is a location based social networking service. Our results indicate that LPPM is better than baselines in both inference and prediction tasks. 35 The rest of the paper is organized as follows: After reviewing some relevant work in Section 3.2, we define our latent point process model in Section 3.3. In Section 3.4 we describe variational EM approach for efficient learning and inference with LPPM. We present results of experiments with both synthetic and real-world data in Sec. 3.5 and 3.6, and provide some concluding remarks in Section 3.7. 3.2 Related Work Modeling temporal social networks has attracted considerable interest in recent years. Both discrete-time [39] and continuous-time [26, 78, 85] models have been proposed to study longitudinal networks. In particular, Perry and Wolfe [63] suggested point process models for describing repeated interactions among a set of nodes. They used a Cox haz- ard model and allowed the interaction intensity to depend on the history of interactions as well as on node attributes. In contrast to our work, however, Ref. [63] assumes that all the interactions are perfectly observable. Different continuos time models such as Pois- son Cascades [75], Poisson Networks [68], and Piecewise-Constant Conditional Intensity Model [38], have also been used to describe temporal dependencies between events. In a related line of research, a number of authors have addressed the problem of uncovering hidden networks that facilitate information diffusion and/or activation cas- cades, based on time-resolved traces of such cascades. Most of the existing approaches rely on temporal information only [31, 32, 33, 23, 83], although several other methods also utilize additional features, such as prior knowledge about the network structure [60], or the content diffusing through the network [87, 84]. Self-exciting point process was originally suggested in seismology to model after- shock of earthquakes [61]. Self-exciting models have since been used in a number of 36 diverse applications, such as assessing financial portfolio credit risk [24], detecting ter- rorist activities [66], predicting corporate defaults [4]. Recently, Mohler et al. [57] used spatial-temporal self-exciting process to model urban crime. Their model, however, stud- ies a different problem and does not assume any missing information. In particular, they consider a univariate point process, as opposed to multi-variate model used here, which is needed to describe interactions among different entities. Stomakhin et al. [80] studied the temporal data reconstruction problem in a very similar settings. Their approach, however, assumes known model parameters, which is impractical in real-world scenario, thus limiting their experiments to synthetically gen- erated data only. In contrast, LPPM learns the model parameters directly from the data labeled or unlabeled. More recently, Hegemann et al. [41] proposed a method that does not assume known parameters but learns those parameters using an iterative scheme. The main difference between LPPM and Ref. [41] is that the former is a generative proba- bilistic model, which allows to estimate the posterior probability that a certain pair is involved in a given event based on the observations. Ref. [41], on the other hand, cal- culates heuristic score-functions that can be used to rank different pairs’ involvement as more or less likely, but those scores cannot be interpreted as probabilities. Furthermore, in contrast to Ref. [41] , here we consider both temporal and spatial components, and use the generative model for event prediction. 3.3 Spatial-Temporal Model of Relationship Network Consider𝑁 individuals forming𝑀 pairs that are engaged in pairwise interactions with each other. Generally𝑀 would be the total number of undirected edges that𝑁 has, which is𝑁 (𝑁 − 1)/2. However, for some cases (i.e., the network structure is given or some 37 pairs are out of our consideration) total number of pairs𝑀 could be fixed to the size of our interest for efficient computation. We observe a sequence of interaction events (called events hereafter) given asℋ ={ℎ 𝑘} 𝑛 𝑘 =1 , where each event is a tupleℎ 𝑘 = (𝑡 𝑘 ,x 𝑘 ,𝑧 𝑘 ). Here 𝑡 𝑘 ∈ R + and x 𝑘 ∈ R 2 are the time and the location of the event, while 𝑧 𝑘 is the symmetric interaction matrix for the event𝑘 : 𝑧 𝑖𝑗 𝑘 = 1 the agents𝑖 and𝑗 are involved in the𝑘 -th event, and𝑧 𝑖𝑗 𝑘 = 0 otherwise. Since each event involves only one pair of agents, we have ∑︀ 𝑖<𝑗 𝑧 𝑖𝑗 𝑘 = 1. Without loss of generality, we assume𝑡 1 = 0 and𝑡 𝑛 =𝑇 . Letℋ 𝑡 denote the history of events up to time 𝑡 as the set of all the events that have occurred before that time, ℋ 𝑡 = {ℎ 𝑘} 𝑡 𝑘 <𝑡 . We assume that the interactions between the pairs are point processes with spatial-temporal conditional intensity func- tion 𝑆 𝑖𝑗 (𝑡, x|ℋ 𝑡 ), so that the probability that the agents 𝑖 and 𝑗 will interact within a time window (𝑡,𝑡 +𝑑 )] and location (x,x +𝑑 x) is simply𝑆 𝑖𝑗 (𝑡, x|ℋ 𝑡 )𝑑𝑡𝑑 x. Note that the intensity function is conditioned on the history of past events. Here we assume that the above intensity function can be factorized into temporal and spatial components as follows: 𝑆 𝑖𝑗 (𝑡, x|ℋ 𝑡 ) =𝜆 𝑖𝑗 (𝑡|ℋ 𝑡 )𝑟 𝑖𝑗 (x) (3.1) The intensity function 𝑆 𝑖𝑗 (·) is based on the separability of spatio-temporal covariance functions [21] assuming that the temporal evolution proceeds independently at each spa- tial location, but rather on the history of its own. Note that the temporal conditional inten- sity𝜆 𝑖𝑗 (𝑡|ℋ 𝑡 ) is history-dependent, whereas the spatial component is not. The scope of our research is not the influence of spatial preference between nodes, but rather the spatial activities of pairs. In this regard, we assume that the pair’s preference of location stays the same over time. Let us define Λ 𝑇 𝑖𝑗 = ∫︁ 𝑇 0 𝜆 𝑖𝑗 (𝜏 |ℋ 𝜏 )𝑑𝜏 (3.2) 38 The likelihood of an observed sequence of interactions under the above model is given as p(ℋ ; Θ) = ∏︁ 𝑘 ∏︁ 𝑖<𝑗 [𝜆 𝑖𝑗 (𝑡 𝑘 )] 𝑧 𝑖𝑗 𝑘 𝑒 −Λ 𝑇 𝑖𝑗 ⏟ ⏞ temporal process [𝑟 𝑖𝑗 (x 𝑘 )] 𝑧 𝑖𝑗 𝑘 ⏟ ⏞ spatial process (3.3) where the products are over all the events and the pairs, respectively. Here Θ encodes all the hyperparameters of the model (to be specified below). From this point, we simplify the intensity expression to𝜆 𝑖𝑗 (𝑡 ) omittingℋ 𝑡 . So far our description has been rather general. Next we have to specify a concrete parametric form of the temporal and spatial conditional intensity functions. As stated above, the consideration of non-trivial temporal correlations between the events suggest that it is not realistic to use Poisson process with constant intensity. Instead, here we will use a Hawkes Process, which is a variant of a self-exciting process [61]. 3.3.1 Hawkes process We assume that the intensity of events involving the pair (𝑖,𝑗 ) at time𝑡 is given as follows: 𝜆 𝑖𝑗 (𝑡 ) =𝜇 𝑖𝑗 + ∑︁ 𝑝 :𝑡 𝑝 <𝑡 𝑔 𝑖𝑗 (𝑡 −𝑡 𝑝 ) (3.4) where the summation in the second term is over all the events that have happened up to time 𝑡 . In Equation 3.4, 𝜇 𝑖𝑗 describes the background rate of event occurrence that is time-independent, whereas the second term describes the self-excitation part, so that the events in the past increase the probability of observing another event in the (near) future. We will use a two-parameter family for the self-excitation term: 𝑔 𝑖𝑗 (𝑡 −𝑡 𝑝 ) =𝛽 𝑖𝑗 𝜔 𝑖𝑗 exp{−𝜔 𝑖𝑗 (𝑡 −𝑡 𝑝 )} (3.5) 39 Here 𝛽 𝑖𝑗 describes the weight of the self-excitation term (compared to the background rate), while𝜔 𝑖𝑗 describes the decay rate of the excitation. 3.3.2 Spatial Gaussian Mixture Model (GMM) To model the spatial aspect of the interactions, we assume that different pairs might have different geo-profiles of interactions. Namely, we assume that the interaction of specific pair is spatially distributed according to a pair-specific Gaussian mixture model: 𝑟 𝑖𝑗 (x) = ∑︁ 𝐶 𝑐 =1 𝑤 𝑐 𝑖𝑗 𝒩 (x;m 𝑐 𝑖𝑗 , Σ 𝑐 𝑖𝑗 ) (3.6) In Equation 3.6,𝐶 is the number of components,𝒩 (x;m 𝑐 𝑖𝑗 , Σ 𝑐 𝑖𝑗 ) denotes for 2-D multi- variate normal distribution with meanm 𝑐 𝑖𝑗 and covariance Σ 𝑐 𝑖𝑗 , and𝑤 𝑐 𝑖𝑗 is the weight of 𝑐 -th component for pair𝑖,𝑗 . The number of components𝐶 was obtained using Ref. [62], where BIC scores were used to optimize the number of components. More weights on specific component on space means more chances of appearance within the cluster of the component. For simplicity, the dynamics of the weights over time has been ignored. We would like to note that the use of Gaussian mixtures rather than a single Gaussian model is justified by the observation that interactions among the same pair might have different modalities (e.g., school, or movies, etc.). In this sense, the model borrows from the mixed membership stochastic block model [3], which assumes that the agents can interact while assuming different roles. Equations 3.2-3.6 complete the definition of our latent point process model. Next we describe our approach for efficient learning and inference with LPPM. 40 3.4 Learning and Inference As mentioned in the introduction, we are interested in scenario where the actual par- ticipants of the events are not observed directly, and need to be inferred, together with the model parameters (i.e., pair-specific parameters of the Hawkes process model and the Gaussian mixture model). For the latter, we employ maximum likelihood (ML) esti- mation. ML selects the parameters that maximize the likelihood of observations, which consist of the timing and the location of the events, and participant information for some of the events. Due to the missing labels, some of the interaction matrices 𝑧 𝑘 are unobserved (or latent) for some 𝑘 . Therefore, there is no closed-form expression for the likelihood of the observed sequence of events. Instead, one has to resort to approximate tech- niques for learning and inference, which is described next. Here we use a variational EM approach [7] by positing a simpler distribution𝑄 (𝑍 ) over the latent variables with free parameters. The free parameters are selected to minimize the Kullback-Leibler (KL) divergence between the variational and the true posterior distributions. Recall that the KL divergence between two distribution𝑄 and𝑃 is defined as 𝐷 𝐾𝐿 (𝑄 ||𝑃 ) = ∫︁ 𝑍 𝑄 (𝑍 ) log 𝑄 (𝑍 ) 𝑃 (𝑍,𝑌 ) 𝑑𝑍 (3.7) where𝑍 is the hidden variables, and𝑌 is the observed variables. In our case, 𝑍 is the hidden identity of interaction where some of the portion is known, whereas𝑌 describes the location and the time of the incident. 41 We introduce the following variational multinomial distribution: 𝑄 (𝒵 𝑛|Φ) = ∏︁ 𝑘 ∏︁ 𝑖<𝑗 𝑞 (𝑧 𝑖𝑗 𝑘 |𝜑 𝑘 ) (3.8) where𝒵 𝑘 ={𝑧 𝑙} 𝑘 𝑙 =1 denotes the set of interaction matrices for events up to the𝑘 -th event, and𝑞 (·|𝜑 𝑘 ) being the multinomial distribution with parameter𝜑 𝑘 . The matrix𝜑 𝑘 consists of the free variational parameters 𝜑 𝑖𝑗 𝑘 describing the probability that the agents 𝑖 and 𝑗 are involved in the 𝑘 -th event. Note that the present choice of the variational distribu- tion discards correlations between past and future incidents, thus making the calculation tractable. The variational parameters are determined by maximizing the following lower bound for the log-likelihood [7]: ℒ Φ (𝑄, Θ) = 𝐸 𝑄 [︁ log ∏︁ 𝑘 ∏︁ 𝑖<𝑗 [𝜆 𝑖𝑗 (𝑡 𝑘 )] 𝑧 𝑖𝑗 𝑘 𝑒 −Λ 𝑇 𝑖𝑗 ]︁ + 𝐸 𝑄 [︁ log ∏︁ 𝑘 ∏︁ 𝑖<𝑗 [𝑟 𝑖𝑗 (x 𝑘 )] 𝑧 𝑖𝑗 𝑘 ]︁ − 𝐸 𝑄 [log ∏︁ 𝑘 ∏︁ 𝑖<𝑗 𝑞 (𝑧 𝑖𝑗 𝑘 |𝜑 𝑘 )] (3.9) where Φ is the set of variational parameters, and Θ is the set of all the model parameters. The above equation can be rewritten as follows: ℒ Φ (𝑄, Θ) = 𝐸 𝑄 [︁ ∑︁ 𝑘 ∑︁ 𝑖<𝑗 𝑧 𝑖𝑗 𝑘 log[𝜆 𝑖𝑗 (𝑡 𝑘 )]−Λ 𝑇 𝑖𝑗 ]︁ + ∑︁ 𝑘 ∑︁ 𝑖<𝑗 𝜑 𝑖𝑗 𝑘 log[𝑟 𝑖𝑗 (x 𝑘 )] − ∑︁ 𝑘 ∑︁ 𝑖<𝑗 𝜑 𝑖𝑗 𝑘 log𝜑 𝑖𝑗 𝑘 (3.10) 42 Algorithm 2 Variational EM Size: consider total of𝑛 events,𝑀 = 𝑁 (𝑁 −1) 2 pairs Input: datax 1:𝑛 ,𝑡 1:𝑛 ,𝑧 𝑘 of complete events Start with initial guess for hyper parameters. Fix all𝜑 𝑘 =𝑧 𝑘 for labeled events. repeat Initialize all components of𝜑 𝑘 corresponding to unknown pairs or event𝑘 to 1 𝑀 repeat for𝑘 = 1 to𝑛 do if the pairs of𝑘 -th event is unknown then Update𝜑 𝑘 using Eq. B.7 end if end for until convergence across all time steps Update hyper parameters. until convergence in hyper parameters where in the last two terms we have explicitly performed the averaging over the multino- mial variational distribution defined in Equation 3.8. Variational EM algorithm works by iterating between the E–step of calculating the expectation value using the variational distribution, and the M–step of updating the model (hyper)parameters so that the data likelihood is locally maximized. The overall pseudo- algorithm is shown in Algorithm 3. The details of update equations used in both E–step and M–step are provided in the appendix. 3.5 Experiments with Synthetic Data We first report our experiments with synthetically generated data for six pairs of agents. The sequence of interaction events was generated according to the LPPM process as follows: 43 1. For each pair, sample the first time of the incident using an exponential distribution with rate parameter𝜇 . 2. For each pair, sample the duration of time until the next incident using Poisson thinning. Since we are dealing with non-homogeneous Poisson process, we use the so called thinning algorithm [65] to sample the next time of the event. By repeating step 2, we obtain the timestamps of incidents for each pair. 3. For every timestamp of a given pair we sample the location of the incident. To compare the performance of our algorithm with previous approaches, we follow the experimental set-up proposed in [80], where the authors used temporal-only information for reconstructing missing information in synthetically generated data. In addition to ML estimation, Ref. [80] also used an alternative objective function over relaxed continuous variables, and performed constrained optimization of the new objective function using𝑙 2 regularization. Although their method does not assign proper probabilities to the various timelines, it can provide a ranking of most likely participants. Table 3.1: Model evaluation for total of n = 40 events between 6 pairs. Only 4 events have unknown participants. The parameters are𝜇 = 10 −2 days −1 ,𝜔 = 10 −1 days −1 and 𝛽 = 0.5. The accuracy of top three method is from Ref. [80], and Variational EM is our result using LPPM. The results are averaged over 1000 trials. METHOD ACCURACY EXACT ML 47.3 % MAX 𝑙 1 47% MAX 𝑙 2 47.1% VARIATIONAL EM 46.9% Following Ref. [80], we consider 40 events, and assume that for 10% (4 events) we do not have participant information. Table 3.1 shows the overall performance of different 44 approaches. To make the comparison meaningful, we omit the spatial information in our model, and focus on the temporal part only. For our algorithm the results are averaged over 1000 runs. Throughout this paper we measure the accuracy (expressed as a percentage) by count- ing the number of correctly identified events divided by the total number of hidden events. Table 3.1 indicates that all four methods perform almost identically. In particular, all four methods have significantly better accuracy than the simple baseline value 1/6, where each pair is selected randomly. Also, we note that while our methods does slightly worse, it is important to remember that the other methods assume known value of the parameters, whereas LPPM learns the parameters from the data. In the next set of experiments we examine the relative importance of spatial and temporal parts by comparing three variants of our algorithm that use 1. Temporal only data, 2. Spatial-only data, and 3. Combined spatial and temporal data. For the spatial component of the data, we use six multivariate normal distributions with the center of each on the vertex point of hexagon (for all 6 pairs). Here we use simple Gaussian for each pair for the spatial process. As in Figure 3.2, we fix the side length of the hexagon to 1, and analyzed how varying the width of the normal distribution affects the overall performance. Specifically, we varied the covariance matrix Σ from 0.25I to 4I. Again, the results are averaged over 100 runs. The accuracy was computed by averaging the number of correct estimates divided by the number of unknown incidents. As expected, the relative importance of the spatial information decreases when increasing 𝜎 . In the limit when 𝜎 is very large, location of an event does not contain any useful information about the participants, so that the accuracy based on spatial infor- mation only should converge to the random baseline 1/6. On the other hand, for small values of Σ, the spatial information helps to increase accuracy. 45 −2 −1.5 −1 −0.5 0 0.5 1 1.5 2 −2.5 −2 −1.5 −1 −0.5 0 0.5 1 1.5 2 spatial data generated with = 0.25 I (a) −4 −3 −2 −1 0 1 2 3 4 5 −8 −6 −4 −2 0 2 4 6 8 spatial data generated with = 4 I (b) Figure 3.2: The spatial data generated varying the covariance matrix from 0.25I (a) to 4I (b). Each color and symbol represents the pairs. 6 centers are on the vertex point of hexagon with side length 1, while the covariance matrix is being controlled In the last set of experiments with synthetic data, we examine the performance of LPPM by varying the fraction of unknown incident labels. We compare the performance of LPPM to two baseline methods. 46 1/4 1/2 1 2 4 20 30 40 50 60 70 80 Variance Accuracy Spatial+Temporal Spatial Temporal (a) 0 20 40 60 80 30 35 40 45 50 55 60 65 % of missing labels ( ) Accuracy Spatial+Temporal Baseline I Baseline II (b) Figure 3.3: (a) Accuracy of inference using spatial data only, temporal data only, and spatial-temporal data, for different settings of the standard deviation of the spatial Gaus- sian model. The results are averaged over 100 trials; (b) Average accuracy (over 20 trials) plotted against the percentage of missing labels. Spatial data was generated based on Gaussian with standard deviation 1. ∙ Baseline I (B1): This method uses self-exciting Hawkes process model using labeled data only. We perform MLE to estimate the model parameters of Hawkes 47 process by only considering the events that are labeled. In other words, we discard the events which misses the label: the information on the pairs. ∙ Baseline II (B2): This method uses homogenous Poisson process model with con- stant intensity using both labeled and unlabeled data. This method is similar to our method except that the temporal process is based on Poisson process. This can be treated as a special case of Hawkes process with𝛽 = 0. We note that both LPPM and the baseline methods use the spatial component, so any differences in their performance should come from the temporal part of the model only. The results of our comparative studies are shown in Figure 3.3. It can be seen that LPPM outperforms both baselines by a significant margin, which increases as the data becomes more noisy. Thus, LPPM is a much better choice when the amount of missing information is significant. The result also reflects that learning model parameters only with the labeled data is not sufficient for inferring missing labels. 3.6 Experiments with Real–World Data In this section we report on our experiments using two distinctly different real-world datasets. The first dataset describes gang-rivalry networks in the Hollenbeck police divi- sion of Los Angeles [82], and the second dataset is from a popular location-based social networking service Gowalla [15]. The rest of the section is organized as follows: We next describe both datasets; Then we conduct experiments on identity-inference problems in Section 3.6.2. Finally, we evaluate LPPM for event prediction problem in Section 3.6.3 48 3.6.1 Data description LAPD dataset Hollenbeck is a 15.2 square mile (39.4 km2) policing division of the Los Angeles Police Department (LAPD), located on the eastern edge of the City of Los Angeles, with approximately 220,000 residents. Overall, 31 active criminal street gangs were identified in Hollenbeck between 1999-2002 [82]. These gangs formed at least 40 unique rivalries, which are responsible for the vast majority of violent exchanges observed between gangs. Between November 14, 1999 and September 28, 2002 (1049 days), there were 1208 violent crimes attributed to criminal street gangs in the area. Of these, 1132 crimes explicitly identify the gang affiliation of the suspect, victim, or both. The remaining events include crimes such as ‘shots fired’ which are known to be gang related, but the intended victim and suspect gang is not clear. For each violent crime, the collected information includes the street address where the crime occurred as well as the date and time of the event [82], allowing examination of the spatial-temporal dynamics of gang violence. In Figure 3.4 we show temporal and spatial distribution of interactions between three most active gangs. For this dataset, we found that each pair is characterized by a simple Gaussian. This could be treated as a special case of GMM with𝐶 = 1. Gowalla dataset Gowalla is a location-based social networking website where users share their locations by checking-in [15]. We used the top 20 nodes who actively check- in to places. The network consists of 196,591 nodes and 950,327 undirected edges. 6,442,890 check-ins of these users were gathered from Feb. 2009 - Oct. 2010. Each check-in not only has its latitude and longitude coordinates but also has a given location ID provided by Gowalla. The location ID is very useful in that it enables us to verify the co-occurrence of a pair at a given location even though the location of latitude and 49 longitude has some error or has a multi-story building at the given coordinates. Gowalla also has a list of friends, where the edge between them is undirected. We looked into every check-in of the friends of 20 nodes and assumed they have interacted each other if the check-in of the two at same location was within 10,000 seconds. The venues of popular places such as airport and stations has been removed to rule out the unexpected coincidence between users. Out of 20 active nodes, we were able to collect 3 groups: one from Stockholm, Tokyo, and San Francisco. 3.6.2 Inferring event participants As we mentioned earlier, most social network data is noisy and incomplete with missing information about nodes and/or interactions. In this section, we consider a scenario where one has the timing and location of interaction events, but only partial information about event participants. A specific real-world problem described within this scenario is inter- gang violence, where one has a record of reported violent inter-gang events, but where either the perpetrator gang, the victim gang, or both, are unknown. Thus, the problem is to infer the unknown participants based on available information. The naive solution would be to discard the missing data, learn the model parameters based on fully observed events only, and then use the learned model for inferring participants of partially labeled events. However, below we show that the naive approach is sub-optimal. Instead, by taking into account missing data via the expectation-maximization framework, one achieves better accuracy in the participant identification task. Experiments with LAPD dataset As described above, the LAPD dataset contains the time stamp and the location of inci- dents between pairs of gangs. Approximately 31% of the records contain information 50 about both participants in the event. Furthermore, 62% of the records contain informa- tion about one of the participants, but not the other. Finally, 7% do not have any infor- mation about the participants. For better understanding of gang-rivalries, it is important to recover missing information on those 70% of the whole data. Since this research is not the studies of the actual rivalries in Hollenbeck but to verify how well our algo- rithm performs on inference, in the experiments below, we discard the latter portion of the data.This way we could validate our inference and by comparing it with actual given label. In the remaining data, we focused on 31 active gangs which were involved in at least 4 incidents within the time period. Furthermore, out of all possible pairs, we use 40 pairs which had more than one reported incident between each other. In the first set of experiments, we focused on the portion of the data that contains information about both participants. We randomly select a fraction𝜌 of the incidents, and then hide the identity of the participants for those incidents. Next, we use LPPM to see how well it can reconstruct the hidden identities by varying𝜌 . We compared the results to the same two baseline methods outlined in Section 3.5. In addition, we add another baseline that uses all existing labels to learn a spatial-only model. The accuracy is defined as the fraction of events for which the algorithm correctly recovers both participants. The results were averaged over 20 different runs. The center of clusters were initialized with the mean location of labeled data. Figure 3.5 demonstrates our results. One can see that the LPPM does consistently better than B1 and B2. For only 10% of missing label information, the accuracy of LPPM and B1 are fairly close. This is to be expected, since for vanishing 𝜌 those algorithms become identical – they learn the same model using the same data. However, LPPM performs much better than B1 when𝜌 increases. Another interesting observation is that B2 performs better than B1 when𝜌 is sufficiently large. This suggests that for large𝜌 it 51 is better to use a simpler (and presumably wrong) model using both missing and labelled data, than learn a more elaborate model using labelled data only. We also note LPPM does better than the spatial-only baseline even when half of the events are hidden. This is significant since the spatial model uses all the label informa- tion that is not available to LPPM. Although the spatial model performs better when 𝜌 increases further, LPPM remains very competitive even when 70% of the events are hid- den, which is the same condition (i.e., fraction of unknown) of LAPD gang related crime data. Experiments with Gowalla dataset Next, we perform experiments on the participant-inference task using the Gowalla data. Note that while the participant information is generally available in this data, it still pro- vides an interesting benchmark for validating LPPM. Out of 20 most active users in Gowalla network, we focus on three users that have high interaction frequency with their friends. 1 Coincidently, three users were from dif- ferent city (Tokyo, Stockholm, and San Francisco). We found that some of the check-in locations were repeated by the same pairs. Strictly speaking, this suggests that the spatial component is not a point process. However, this detail has little bearing on our model, as the spatial interactions can still be modeled via the Gaussian mixture model. Spatial analysis of the dataset reveals that the interaction are multi-modal in the sense that the same pair of users interact at different locations. This is different from the crime dataset, and necessitates using more than one component for the spatial mixture model. 1 Recall that for this dataset, an interaction between two users is determined by near-simultaneous check- ins; see the description of the dataset 52 In the experiments, we used 4 components of GMM for two of the pairs (Stockholm and San Francisco), and three components for the other pairs (Tokyo). The results of the experiments are shown in Figure 3.6. Due to limited space, we present the result of simulation using users in San Francisco. Since the two baseline methods perform similarly, here we show the comparison only with B2, which learns a homogenous Poisson point process model using both labeled and unlabeled data. Again, the results suggest that LPPM is consistently better than the baseline for all of the pairs. The gap between LPPM and the baseline is not significant as before which is mainly due to the active pairs which dominates the interactions. When there are dominant active pairs, Poisson process could distinguish the users by comparing the rate between the pairs. Moreover, there were some active pairs which have checked into the exact same location repeatedly leading to higher accuracy. 3.6.3 Event prediction with LPPM LPPM can be used not only for inferring missing information but also predicting future events, which can be potentially useful for many applications. For instance, in the context of proactive policing, the predictions can be used to anticipate the partici- pants/timing/location of the next event, and properly assign resources for patrol, etc. Related to friendship network, one can predict the spatial-temporal movement patterns by predicting the hot clusters involving given pairs. This kind of prediction can be also very useful in epidemiology, i.e., by predicting diffusion patterns of an infectious disease. In this section, we use learned LPPM models for two different prediction tasks: (1) Predicting the timing of the next interaction event. (2) Predicting the pair that will have the next interaction. 53 Let us first discuss the timing prediction problem. Given the history of events up to the 𝑘 -th event, our goal is to predict the timing of the (𝑘 + 1)-th event. Note that the prediction can be either pair-specific, or across all pairs. Here we select the latter option. The estimated waiting time until the next incident is given by ∫︁ 𝐿 0 𝑡𝜆 𝒮 (𝑡 ) exp(− ∫︁ 𝑡 0 𝜆 𝒮 (𝜏 )𝑑𝜏 )𝑑𝑡 (3.11) where𝐿 is fairly a large number, and𝜆 𝒮 (𝑡 ) = ∑︀ (𝑖𝑗 ) 𝜆 𝑖𝑗 (𝑡 ) is the sum of the conditional intensity function across all the pairs. Below we compare the prediction performance of LPPM with the B2 defined in Section 3.5, which employs homogenous Poisson pro- cesses. According to this baseline, the expected waiting time to the next event is simply 1 ∑︀ (𝑖𝑗 ) 𝜆 * 𝑖𝑗 (𝜆 𝑖𝑗 (𝑡 )≡𝜆 * 𝑖𝑗 ), where𝜆 * 𝑖𝑗 is the time-independent intensity for the pair (𝑖,𝑗 ). The prediction accuracy is measured using the mean absolute percentage error (MAPE) score, which measure the relative error of the predicted waiting time: MAPE = | 𝐴 𝑛 −𝐹 𝑛 𝐴 𝑛 |, where𝐴 𝑛 is actual waiting time until the next incident, and𝐹 𝑛 is our predicted value. Note that more accurate prediction corresponds to lower MAPE score, MAPE = 0 for perfect prediction. We measure the MAPE score for LPPM prediction on the LAPD and Gowalla datasets. For the former, we use LPPM to predict the timing of the last 50 incidents among top 40 pairs. As for the latter dataset, we focus on only one of the users (in Tokyo), and use the last 10 events (out of 40 total) for prediction. For both datasets, LPPM provides significantly more accurate prediction than the baseline for most of the incidents. LAPD dataset had 2.7502 for LPPM compared to 11.0434 for B2; Gowalla dataset had 1.2236 for LPPM compared to 5.9350 for B2. A possible explanation of the poor performance of the Poisson model is that it fails to accurately predict the timing of 54 highly correlated events that are clustered in time, whereas LPPM is able to capture such correlations. When the next event is highly influenced by the previous event, Poisson model is limited in that it considers the triggered event as a random event. For the prediction task (2), we used LPPM to find the conditional intensity of interac- tions between different pairs based on all the events up to event𝑘 that happens at time𝑡 𝑘 . We then predict that the pair with the highest conditional intensity to have an interaction event at a time𝑡>𝑡 𝑘 , assuming that no other interaction has taken place in time interval [𝑡 𝑘 ,𝑡 ]. Note that the homogeneous Poisson process model (Baseline II) simply selects the pair that has been the most active in the past. For this particular task, we also use another prediction method (Baseline III) which predicts that the pair that had the last event will also participate in the follow-up event. In addition to the top pair, we also predict the second and third best predictions. We performed experiments with the crime dataset, for which 14 incidents out of 100 were predicted correctly by LPPM. Baseline II correctly predicted only 8 incidents, whereas Baseline III did considerably better with 13 correct predictions. Furthermore, LPPM outperforms both methods in predicting top 2 and top 3 users, as shown in Table 3.2. Table 3.2: Prediction accuracy of top-K choices for K=1,2,3. METHOD BASELINE II BASLINE III LPPM TOP 1 8% 13% 14% TOP 2 16% 20% 26% TOP 3 23% 22% 37% 55 3.7 Summary We suggested a latent point process model to describe spatial-temporal interaction net- works. In contrast to existing continuous time models of temporal networks, here we assume that interactions along the network links are only partially observable. We describe an efficient variational EM approach for learning and inference with such mod- els, and demonstrated a good performance in our experiments with both synthetic and real-world data. We note that while our work was motivated by modeling spatial-temporal interaction networks, the latent point process suggested here is much more general and can be used for modeling scenarios where one deals with latent mixture of arbitrary point processes. For instance, LPPM can be generalized to describe geographically distributed sequence of arbitrary events of multiple pairs even with the events which misses the pair informa- tion. There are several ways to generalize the model further. For instance we have assumed a homogenous background rate, whereas in certain scenarios one might need to intro- duce cyclic activity patterns. Furthermore, the assumption that the process intensity is factorized into temporal and spatial components might not work well for certain types of processes, where the location component might depend on the event time. 56 −118.225 −118.22 −118.215 −118.21 −118.205 −118.2 −118.195 −118.19 34.015 34.02 34.025 34.03 34.035 34.04 34.045 34.05 34.055 34.06 34.065 Longitude Latitude (a) Time (b) Figure 3.4: Spatial (a) and temporal (b) description of the events involving four active gang rivalries. Different colors represent different pairs. In (b) each spike represents the time of the event. 57 0 20 40 60 80 30 35 40 45 50 55 60 65 % of missing labels ( ) Accuracy LPPM Baseline I Baseline II Spatial Figure 3.5: Average accuracy for varying fraction of missing labels. Baseline I and Baseline II are defined in Section 3.5. The horizontal line corresponds to inference using spatial data only. 0 20 40 60 80 20 30 40 50 60 70 80 % of missing labels ( ) Accuracy LPPM Poisson Figure 3.6: Average accuracy of participant-inference task for the user in San Francisco. The fraction of missing labels is varied between 10% and 70%. 58 Chapter 4 Venue Clustering through Information Bottleneck 4.1 Introduction Despite the efforts of social scientists, understanding human mobility patterns remains a challenging problem. As sensors become more ubiquitous with accelerometers and GPS embedded in cell phones, computer scientists are able to analyze movements on a more fine-grained level. The emergence of location based social network services take the potential even further. Large scale data covering wide areas over long timescales is coupled with detailed information about users’ online interactions. Location Based Social Networks (LBSN) bear unique features. Many of the appli- cation interfaces now enable users to select the location they would like to check into from an automatically generated list. Often the lists are displayed by tracking the GPS coordinates of a current user and searching for nearby candidates. If the name of the location is not provided by the service, the user can add it for future visitors. This feature is remarkable in this area of studies in that the exact location can be pin pointed using the labels attached to each of the check-ins. The noise of GPS coordinates can be filtered easily using these labels. This enables us to collect the users who have visited specific venues during a given time period. 59 Another unique feature in LBSN is the sharing of user locations with friends. On popular social networking sites like Facebook, users can pinpoint where their friends are and where they have been in the past if they have checked-in. This creates on-line influ- ence which is the social network equivalent of word-of-mouth influence. Each check-in creates a visible reminder that may induce friends to return to a location or visit for the first time. In this sense, network structure may have major implications in this area of study. Understanding and modeling individual movement has many applications. By under- standing each movement and the relationship between users, service providers could rec- ommend some venues to a group of users with potential interest. In a broader context, human mobility models also impact prediction of the spread of disease, controlling traffic congestion, business marketing, and urban analysis. Yet another huge impact on other fields of study comes from the fact that the venues comprise not only the geo-coordinate information but also other useful information when combined with the profile of users who have visited it. This may provide information about the behavior patterns associated with a venue or the groups which frequently visit the venue. For instance, based on the location, timing, and composition of a group, it could be possible to infer the activity as ‘studying’ with high probability. This intuition suggests that venues see similar patterns of activity based on the users who visit. In this chapter, we use the network structure information to cluster venues so that a venue’s group reflects its functionality. This coarse representation of venues may be useful in many ways but we focus on two concrete benefits. First, through clustering the venues, we may be able to have a better understanding of what venues represent as a whole. For instance, the venues connected to schools, libraries, bookstores could be may be related to activities like ‘studying’. Second, with coarse representation of venues the 60 unknown relationships among users can be inferred based on the set of venues two users have visited. Using LBSN dataset, we found that many of the actual friends show similar check-in patterns, but some of the pairs had no overlap at all. We show that through clustering venue types our model can correctly infer relationships even between pairs of users with no overlap of venues. We present network infused agglomerative information bottleneck, which is an exten- sion of agglomerative information bottleneck [76]. This simple non-parametric method allows us to cluster venues utilizing the network information. As mentioned previously, we show two advantages: categorizing venues and edge prediction using these clustered venues. However, for ease of validation, we mainly focus on edge prediction in this chap- ter and show how coarse representation performs better than using the raw venue data. We also show how our clustering performs better relative to other methods of dimension- ality reduction. 4.2 Related Work There have been a number of studies [15, 19, 71, 86] using geo-spatial dataset for modeling mobility patterns in social networks. For instance, Ref. [86] defined mobile homophily based on visitation frequencies, and used this measure to infer social interac- tions. The difference between our approach and the prior work relies on measuring the homophily. Namely, our method first projects venues onto latent space, and then finds similar users in this space. Thus, our approach can yield high similarity for a pair of users who have never visited the same venue, provided that they have visited similar venues. 61 This is very different from approaches that use, for instance, distances between locations for link prediction. As we mentioned above, we intent to compress the check-in data into a coarser rep- resentation. There are many existing approaches for dimensionality reduction. Latent Semantic Analysis (LSA) [49] and Latent Dirichlet Allocation (LDA) [11] were orig- inally introduced in NLP to discover hidden concepts/topics characterizing document data using the term-document matrix. Probabilistic Matrix Factorization (PMF) [72, 73] that was developed for collaborative filtering works by decomposes the rating matrix two matrices and works well on predicting missing rates. Spectral Co-Clustering [22] simultaneously clusters row and column using spectral graph partitioning. We would like to note that, similar to our work, Ref. [44] used LDA to cluster venues, although they did not use the obtained clusters for link prediction. Furthermore, in con- trast to [44] ( and the dimensionality reduction methods listed above) the approach pro- posed here is a supervised method as it uses information about known social ties. 4.3 Data Description We use Gowalla dataset [15] in this chapter. Gowalla is a location based social net- work (LBSN) service where each user can post their current location and share it with its friends. The ‘check-in’ consists of the node id(user), the actual date and time, and coordinate with the location id provided by Gowalla. Location ID becomes useful when two locations which are close or have the same coordinates but located on different build- ing levels need to be distinguished. To distinguish from previous works using geo-space dataset, we use venues which is a location with ID in LBSN, instead of locations, which were mostly based on cell towers using cell phone data. 62 The majority of the users in the dataset show sparse activity in ‘check-in’s, which makes the modeling difficult. We observe that 20% of the most-active users were respon- sible for 80% of all check-ins. For the experiment, we use the check-in data of users from three major cities in USA, which includes San Francisco, Austin, and New York. For each city, the check-in data is encoded as an|𝑉 |×|𝑈 | matrix, where 𝑉 is the set of venues and𝑈 is the set of users. The (𝑖,𝑗 )-th entry indicates the number of visits of user 𝑗 to the venue 𝑖 . The dataset also contains node-node friendship matrix, which is an unweighted and undirected graph. We use the check-ins of users who showed active histories in those three cities. Measuring Similarity In previous work [86], one of the measure used to compute mobile similarity was cosine similarity between two vectors of the given users. Each user had occurrence vector, where the 𝑖 -th component of the vector counts the number of the visits to location𝑖 of the given user. Cosine similarity is an inner product of two 𝑙 2 -norm vectors, which measures the cosine of angle between them: SIM cos (𝐴,𝐵 ) = 𝐴 ‖𝐴 ‖ · 𝐵 ‖𝐵 ‖ (4.1) Another measure of similarity is Kullback-Leibler (KL) divergence (defined below). Our results indicate that KL divergence achieves better inference on friendships than cosine similarity 1 . Hence we use KL divergence as a metric of measuring similarity of check-in histories. 1 We believe this is mainly due to the 𝑙 1 normalization. 𝑙 2 norm is sensitive to the scaling factor espe- cially when dealing with high dimensional vectors, where as 𝑙 1 vectors shows more robustness [17]. 63 4.4 Network-infused Agglomerative Information Bottle- neck Our objective in this chapter is to capture the unknown edges using the similarity between the users in some latent space. Our approach is based on a variation of the Information Bottleneck (IB) method [81]. This method ttys to find a compressed representation ˜ 𝑋 of the original data𝑋 so that ˜ 𝑋 still contains useful information about some relevance variable 𝑌 . In other words, IB tries to find the features of the original dataset that are most useful for predicting the relevance variable 𝑌 , while discarding the features (via compression) that are not. To make this intuition more formal, let us recall the definition of the mutual informa- tion between two random variable𝑋 and𝑌 : 𝐼 (𝑋 ;𝑌 ) = ∑︁ 𝑥 ∈𝑋,𝑦 ∈𝑌 𝑝 (𝑥,𝑦 ) log (︁ 𝑝 (𝑥,𝑦 ) 𝑝 (𝑥 )𝑝 (𝑦 ) )︁ (4.2a) = ∑︁ 𝑦 ∈𝑌 𝑝 (𝑦 )D KL (𝑝 (𝑥 |𝑦 )||𝑝 (𝑥 )) (4.2b) where in the second equation we have introduced the Kullback-Leibler (KL) divergence: D KL (𝑝||𝑞 ) = ∑︁ 𝑖 𝑝 (𝑖 ) log (︁ 𝑝 (𝑖 ) 𝑞 (𝑖 ) )︁ (4.3) The objective of IB method is to find a compact representation ˜ 𝑋 of the original variable𝑋 that results in a minimal loss of information about the relevance variable𝑌 . Introducing a Lagrange multiplier 𝛽 , the above objective is captured by the following functional: ℒ [𝑝 (˜ 𝑥 |𝑥 )] =𝐼 (𝑋 ; ˜ 𝑋 )−𝛽𝐼 ( ˜ 𝑋 ;𝑌 ) (4.4) 64 Note that IB can be viewed as a soft clustering algorithm characterized by the conditional distribution 𝑝 (˜ 𝑥 |𝑥 ). When the compressed representation has finite cardinality, then in the limit 𝛽 → ∞, IB reduces to hard clustering [76]. In this limit, the first term in Equation 4.4 is discarded, and the problem reduces to maximizing𝐼 ( ˜ 𝑋 ;𝑌 ). We focus on this scenario from now on. We adapt a version of IB known as agglomerative Information Bottleneck [76], which is essentially a bottom-up hard clustering method. The agglomerative IB starts with trivial partition where each data point is in its own cluster. Define information loss as the decrease in the mutual information𝐼 ( ˜ 𝑋 ;𝑌 ) due to merging,𝛿𝐼 𝑦 =𝐼 (𝑋 ;𝑌 )−𝐼 ( ˜ 𝑋 ;𝑌 ). Then, at each iteration, the agglomerative IB greedily merges two clusters that have the minimum mutual information loss. We now define an objective function inspired by the IB approach. In our case, the data that we would like to compress is the set of all venues, while the relevance variable is the existing network structure. Then, intuitively, we would like to compress the venues so that the users who are linked in the network will be close to each other in the compressed representation, whereas the users who are not linked in the network will be further away. Thus, we define 𝐼 𝑆 (𝑋 ;𝑌 ) = ∑︁ 𝑦 ∈𝑌 𝑝 𝑤 (𝑦 ) {︃ D KL (𝑝 (𝑥 |𝑦 )||𝑝 𝒮 𝑦 (𝑥 ))− D KL (𝑝 (𝑥 |𝑦 )||𝑝 ˜ 𝒮 𝑦 (𝑥 )) }︃ (4.5) where𝒮 𝑦 denotes the set of friends of user 𝑦 , and ˜ 𝒮 𝑦 denotes the set of non-friends of user𝑦 , and𝑝 (𝑥 |𝑦 ) is the probability of a given user visiting venue𝑥 . The two terms in the objective functions are the results of combining two types of information - existence and absence of links between the users. Since our objective is to differentiate the two sets (friend set vs non-friend set) for each users, we separate the two 65 by penalizing the other term, the distance to the probability of non-friends. Note also that instead of 𝑝 (𝑦 ), our objective uses 𝑝 𝑤 (𝑦 ) = #of edges containing 𝑦 #of edges which gives more weight to the users that have more edges. Having defined the above objective, we can use the greedy bottom-up technique defined above to find hard clustering of the venues. The following remark is due: Since users do not visit all the venues, the denominator in the KL divergence is often zero. To avoid this, below we use Jensen Shannon (JS) divergence, a symmeterized and smoothed version of KL divergence: 𝐽 (𝒞 ) = ∑︁ 𝑢 ∈𝑈 𝑤 𝑢{𝐽𝑆 (p 𝒞 𝑢 ||p 𝒞 𝒮 𝑢 )−𝐽𝑆 (p 𝒞 𝑢 ||p 𝒞 ˜ 𝒮 𝑢 )} (4.6) 𝐽𝑆 (p||q) = D KL (p||r) + D KL (q||r), wherer = 1 2 p + 1 2 q (4.7) p 𝒞 𝑢 (𝑘 ) = #of user𝑢 visiting cluster𝑘 #of total check-ins of user𝑢 (4.8) In Equation 4.6 and 4.8,𝒞 represents the set of the current clusters; The component of the probability vector is defined in equation 4.8. The overall procedure is shown in Algorithm 3. 4.5 Experimental Results For our experiments we focused on Gowalla data from three major US cities - San Fran- cisco, Austin, and New York. Those three cities exhibited more active ‘check-ins’ com- pared to other major cities. For each city, we used top 20% active users of which the check-ins form 80% of all check-ins. We also left out the venues with fewer than 10 different visitors during the considered period. 66 Algorithm 3 Venue Clustering Size: consider total of|𝑉 | venues,|𝑈 | users Input:|𝑉 | by|𝑈 | co-occurrence matrixX, and |𝑈 | by|𝑈 | adjacency matrixY, which is partially observable Initialization: Start with|𝑉 | clusters, each of which contains a venue repeat for𝑖,𝑗 = 1 to𝐶 , i<j do 𝑑 𝑖𝑗 =𝐽 (𝒞 )−𝐽 ( ¯ 𝒞 ) where ¯ 𝒞 = {︁ 𝒞−{ 𝑐 𝑖 ,𝑐 𝑗} }︁ ∪ ¯ 𝑐 𝑖𝑗 , and ¯ 𝑐 𝑖𝑗 is a merge of cluster𝑖 and𝑗 end for Merge: – Find{𝛼,𝛽 } = arg min 𝑖,𝑗 𝑑 𝑖𝑗 – Merge{𝑐 𝛼 ,𝑐 𝛽}→ ¯ 𝑐 𝑖𝑗 until min𝑑 𝑖𝑗 <−𝜖 Output:𝐶 by𝑈 matrixX 𝐶 4.5.1 Venue clustering In our first set of experiments, we examined whether our approach yields meaningful clustering of the venues. We run our algorithm starting with the clusters where each venue constitutes a separate cluster initially. We merge those clusters following the algorithm 3, until the information loss is less than a predefined threshold. For all three cities, we examine the top three clusters that contain the most venues. Since we are only interested in what the clusters represents, we use 100% of the edge information between the users for the clustering. For each cluster we find the top 10 popular venues. We assume that the number of unique users in each venue represents the venue popularity, i.e., the more unique visitors the more popular it is. Top 10 venues of each clusters with most unique users are examined. The name of the venues in top three clusters of San Francisco are presented in Table 4.1. We observe that the cluster𝐶 1 mainly consists of amusement facilities such as 67 37.74 37.75 37.76 37.77 37.78 37.79 37.8 37.81 −122.52 −122.5 −122.48 −122.46 −122.44 −122.42 −122.4 −122.38 −122.36 (a) Figure 4.1: Geo plot of three clusters𝐶 1 (red),𝐶 2 (green),𝐶 3 (blue) theater, brewery (or bar), and cafe. Cluster𝐶 2 mostly has shops and venues in the shop- ping district of San Francisco. And for the cluster𝐶 3 we found that most of the venues seem to be associated with the LGBT (lesbian, gay, bisexual, and transgender) commu- nity. Thus, we see our clustering algorithm is able to capture semantic information about venues using the friendship network information between the users who check-in. Figure 4.1 shows the actual mapping of the venues. It is seen that the venues that belong to the same cluster are not necessarily geographically close to each other. Instead, the closeness is in the induced latent space. We also observe that𝐶 3 is more localized geographically compared to other clusters. This is because many of the venues in this cluster are located in a prominent LGBT neighborhood in the city. 68 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 False Positive Rate True Positive Rate ROC curve JS (25% training) IB venues AUC r : 0.78 AUC b : 0.72 (a) 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 False Positive Rate True Positive Rate ROC curve JS (50% training) IB venues AUC r : 0.80 AUC b : 0.73 (b) 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 False Positive Rate True Positive Rate ROC curve JS (75% training) IB venues AUC r : 0.80 AUC b : 0.70 (c) Figure 4.2: ROC curve for link prediction using JS-divergences. The results are shown for the information bottleneck (red) and unclustered original venues (blue) on varying size of training set (25% of whole data, 50% and 75%). We also show the corresponding AUC scores. 4.5.2 Reconstructing Edges In the next experiment we use the coarse-grained representation of the venues to predict social links among the users. In the San Francisco dat set, there were 3, 360 edges out of 706, 266 pairs. Furthermore, out of 1, 680 edges, 565 (more than third) had no common venues between the two at all. The New York dataset has 1, 205 active users with 1, 051 venues, with 1, 781 edges from 725, 410 pairs. And for the Austin datset, there were 1, 920 active users with 9, 126 edges. For the experiment, we use fraction of the existing edges to cluster the venues using our algorithm. We then try to recover the remaining edges based on the venue clusters they have visited. As our algorithm uses available network information in venue clustering, we expect to achieve better accuracy with more network information. To examine this effect, we control the observable edge ratio to 25%, 50%, and 75% and infer the 75%, 50%, and 25% of edges respectively. Due to the limited space, we only present the results from the San Francisco data (the results were similar for the remaining two datasets). In Figure 4.2 69 we show the ROC curve for different approaches, together with the corresponding AUC scores. We see that measuring user similarity based on clusters of venues results in more accurate link prediction. In other words, the induced latent categories of the venues are a better measure of similarity than the individual venues themselves. We also compare our algorithm to other baselines described in Section 4.2. All of the baselines uses reduced dimensional representations of venues for finding similarities between users. For the experiment, we used 50% of the edge information for our algo- rithm, and validated with the other 50% of the data as a test set assuming the edge infor- mation is unknown. With the same test set, we inferred the edges between users using other baseline methods and compared it to ours. As shown in Figure 4.3, our method (IB) outperforms other baselines. (topic: LDA, co-clustering: Spectral co-clustering). We note, however, that direct comparison of the methods is a little unfair, since our method makes use of additional (social network) information for clustering the venues, whereas the baselines above are fully unsupervised. Nevertheless, our results clearly indicate that information about social interactions is indeed relevant for clustering venues. 4.6 Summary Finding similar users or friends in LBSN is important for better understanding user mobility patterns. Though there are many venues in a city, only a small number of venues are visited by individual users. The mobility patterns (i.e., the venues that a user frequently checks-in) exhibit the characteristics of each user. Conversely, we can predict the venues that the users might be interested based on our inference. Reaffirming many previous studies, the social network plays a great role in inferring the characteristics of users. In this chapter, we focused on reconstructing the social network based on other 70 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 False Positive Rate True Positive Rate ROC curve JS (50% training) IB venues topic co−cluster AUC: 0.80 (a) Figure 4.3: ROC curve using JS-divergence compared to other baselines. The AUC is for the red plot (IB) only. IB denotes our model, where as the venue denotes the edge reconstruction using the unclustered venues. topic: LDA and co-cluster: Spectral co clustering are the baseline we have introduced previously observable network and the check-ins of users. We showed that when the venues are merged using our algorithm, we achieve better predictions about the social network. We also validated that our cluster contains meaningful representation by examining the name of the venues and the actual locations on geo-space. 71 Table 4.1: Top 3 largest clusters in San Francisco Metreon(movie,video game, city target) 𝐶 1 Amendment Brewery Mint Plaza SFMOMA Moscone Center North San Francisco Ferry Bldg Whole Food Market Thirsty Bear Brewery Sightglass(coffee) Bloodhound (bar) Apple store 𝐶 2 Union Square Park Westfield San Francisco Centre(shopping mall) Moscone West (exhibition hall) Powell st (cable car turntable) Flood Building (powell st shopping district) Transamerica pyramid (small shop) Lucca Delicatessen(italian deli brocery) Macy’s Rainbow World Fund(friends community) 𝐶 3 Toad Hall (bar) Building (rainbow flag) Sunflower Cafe (unknown, moved out) QBar (rainbow flag) Moby Dick (bar, rainbow flag) 440 Castro (underwear night, rainbow flag) closed property (rainbow flag) Exclusive club 72 Chapter 5 Temporal Clustering through Hawkes Process 5.1 Introduction Human mobility patterns influence human behavior in both routine and profound ways. Any picture of the spread of disease, traffic congestion, or urban crime would be incom- plete without understanding the movements of individuals and groups. Even though the behavior of groups of humans has many degrees of freedom, previous works [34, 70] have demonstrated that human mobility exhibits structural regularities. The recent emergence of Location Based Social Network (LBSN) services such as Gowalla and Foursquare has enabled researchers to perform fine-grained analysis of users’ mobility patterns and their impact on social interactions. In LBSN services, users share their current location or the venues they have visited in the past with their friends. Most LBSNs give unique IDs to different establishments even if they share the same geo- graphical location (i.e., Lat+Long coordinates); we emphasize this distinction by using the term “venue” rather than “location”. Typically, a user “checks in” to a specific venue by using a smartphone or tablet to choose from a list of venues near their current loca- tion as determined by Wi-Fi or GPS. This information is sent to the LBSN server and shared with their friends. A user can check-in to a venue during each visit and is often encouraged to do so through incentives. 73 The primary LBSN data consist of check-in history of the users, where each check-in is described by a user id, venue id, and the time of the check-in. In addition, most LBSN services also provide secondary data that describe the underlying social network of the users. Prior research has studied the correlation between individual mobility patterns and social interactions, e.g., by predicting social ties based on similar mobility patterns [19], or, conversely, by predicting the next check-in location of a user based on the recent check-in history within his local network [55]. While most prior work has focused on user-based modeling of spatial-temporal LBSN data [15, 28], here we argue that a venue-centric approach is sometimes preferable. For instance, if the goal is to predict future attendance of a particular venue or to measure the impact of an ad campaign on attendance, it is more natural to focus on the check- in dynamics of venues rather than users. While recent work has studied correlations between a venue’s characteristics and its popularity [44], the dynamics of venue-specific check-ins have been largely ignored. We focus on modeling the full temporal dynamics of check-ins from a venue-centric perspective. We observe that check-ins at venues are clustered in time, sometimes exhibiting bursty behavior. We also observe that the average check-in patterns for both users and venues are not static, but change over time. We include three primary mech- anisms to describe check-in dynamics: (1) Repeated behavior is captured by a self- reinforcing mechanism in which a user is strongly influenced by his recent behavior; (2) Social influence, i.e., a visit by a user triggers future visits by his friends; and (3) Exoge- nous effects, which include external events (such as releasing new SW for the service or a promotion campaign) that modulate the attendance rates. 74 Here we are especially interested in assessing social influence on visitation patterns. Toward this goal, we adopt a parametric point process model known as a Hawkes pro- cess [40] to describe check-in dynamics at venues. A Hawkes process is an example of a self-exciting point process in which past events positively influence the likelihood (inten- sity) of future events. This model allows us to measure the likelihood that a particular (offspring) event was triggered by a past (parent) event. This allows us to distinguish the most likely factors contributing to an individual check-in. Combining this information with the known social network structure enables us to estimate the fraction of check-ins that can be plausibly attributed to social influence. Beyond the rich explanatory power of the model, we also demonstrate that it predicts future check-in data better than several alternatives. In particular, we consider various baseline point process models and compare them on their ability to capture temporal dynamics of check-ins. Finally, we consider each of the three mechanisms in our model separately and demonstrate their validity by distinguishing social and non-social venues and by capturing known exogenous effects like (external) promotion campaigns. This multi-faceted analysis allows a fine-grained discrimination of different types of venues. While we focus on user/venue dynamics here, the mechanisms we describe are general and could apply to other aspects of human behavior. Related Work There is a growing body of literature on LBSN analysis. Link prediction using geo-coincidences has been studied in [19, 86, 74]. Other studies have used social network information to infer user location [71, 29] and predict next check-in [55]. Several recent studies have also attempted to cluster users [44] and venues [20] based on similar visitation patterns. Most human activity patterns have bursty dynamics and cannot be adequately described by homogenous Poisson process. To describe temporal correlations in social 75 interactions, researchers have used Non-homogeneous Poisson processes (NHPP) such as Cox-process [50, 64], as well as and Hidden Markov Models [67]. Our approach here is based on self-exciting Hawkes process [40] that has been previously used for modeling urban crime [57], inter-gang violence [16], and repeated social interactions [12]. 5.2 Dataset Description We use the Gowalla dataset [15] in this work. Gowalla is a location-based social network- ing website where users share their locations by checking-in. In this dataset, the network consists of 196,591 nodes and 950,327 undirected edges. Between February 2009 and October 2010, there were 6,442,890 check-ins. We extracted all the check-ins of active users in San Francisco, New York, and Stockholm as representative samples from west- ern U.S., eastern U.S., and Europe. Asian cities had little activity and were excluded from analysis; Check-ins from a relatively active city in Asia (Tokyo) were a quarter of those in San Francisco. We collected all activity within a rectangular box of latitude-longitude coordinates around each of the selected cities. We considered only the 20% of users who were most active to ensure sufficient statistics for parameter estimation. The 20% of most active users represented around 80% of the total number of check-ins. This 80–20 rule was universal across all the cities we examined. Statistics from each city are presented below. Table 5.1: Statistics of Check-ins from Three Cities 76 Dataset statistics San Francisco Number of Check-ins 142,972 Number of Venues 10,751 Number of Users 5,989 New York Number of Check-ins 114,777 Number of Venues 17,062 Number of Users 6,205 Stockholm Number of Check-ins 184,485 Number of Venues 15,753 Number of Users 9,320 5.3 Model Description 5.3.1 Modeling Temporal Patterns We treat the check-ins in LBSN as a marked point process in time, where the mark represents the venue as well as the user for an event at a given time. By separating every process with respect to the venue id, each venue forms its own point process. We defer analysis of user-specific processes until Sec. 5.3.3. As shown in Figure 5.1, clustering is apparent in the three temporal point processes. Thicker lines represent the degree to which an event is explained by previous events (as opposed to background or exogenous effects). The strength of ties was mathematically computed using the self-exciting point process known as the Hawkes process [40], detailed in the next section. The Hawkes process defines a (mark specific) intensity as a function of history and time. This model 77 Figure 5.1: Temporal pattern of check-ins (SF) has been widely used in various applications that show temporal clustering of events such as shocks and aftershocks in seismology. Hawkes process Each ‘check-in’ at a given venue is treated as an event in the given venue-specific point process. We assume that the intensity of check-in events involving the venue𝑣 at time𝑡 is given as follows: 𝜆 𝑣 (𝑡 ) =𝜇 𝑣 + ∑︁ 𝑝 :𝑡 𝑝 <𝑡 𝑔 𝑣 (𝑡 −𝑡 𝑝 ). (5.1) This intensity function can be interpreted as a rate at which events occur (see Eq. 5.4). The summation in the second term is over all the events that have happened up to time 𝑡 . In Equation 5.1, 𝜇 𝑣 describes the background rate of event occurrence that is time- independent, whereas the second term describes the self-excitation part, so that the events 78 in the past increase the probability of observing another event in the (near) future. We will use a two-parameter family for the self-excitation term: 𝑔 𝑣 (𝑡 −𝑡 𝑝 ) =𝛽 𝑣 𝜔 𝑣 exp{−𝜔 𝑣 (𝑡 −𝑡 𝑝 )}. (5.2) Here 𝛽 𝑣 describes the weight of the self-excitation term (compared to the background rate), while 𝜔 𝑣 describes the decay rate of the excitation. Intuitively, the decay term captures the notion that more recent events are more important. 5.3.2 Characterizing Correlations Between Events We have seen the clustering of points in time in Fig. 5.1. A Hawkes process model allows us to measure the strength of ties between two events. By examining the intensity function in Equation 5.1 for a given event, we can further infer the likelihood that the event was triggered by a specific historical event. We use the probabilistic measure in Equation 5.3 as the strength of tie between𝑖 and𝑗 . For the given process (representing a specific venue,𝑣 ), the probability that the𝑗 -th event is triggered by the𝑖 -th event can be expressed as below: 𝑝 𝑣 𝑖→𝑗 = 𝑔 𝑣 (𝑡 𝑗 −𝑡 𝑖 ) 𝜇 𝑣 + ∑︀ 𝑝 :𝑡 𝑝 <𝑡 𝑗 𝑔 𝑣 (𝑡 𝑗 −𝑡 𝑝 ) . (5.3) The probability above can be inferred based on the estimated set of parameters {𝜇 𝑣 ,𝛽 𝑣 ,𝜔 𝑣}. Since we are interested in correlation of points for a given process (venue) and not the correlation across different processes (venues), we assume each process (venue) has its own parameters and we estimate them separately. We use the EM algo- rithm for our inference and estimation of model parameters. We follow the update equa- tions of the parameters from [51]. 79 5.3.3 Three Factors causing Temporal-Clustering Three factors that contribute to temporal clustering of events are considered in our stud- ies. We describe each in turn below. We are able to disentangle these events because of the rich information in the data that include the user and venue for each event along with a social network among users. This allows us to construct a fine-grained model of the strength of the effect of one visit on another. While the ground truth cause of each visit is unknown, in the next section we consider various ways to qualitatively test the validity of our model. Self-reinforcing Behavior Looking at the behavior of individual users already reveals strongly predictable patterns. Many users return frequently and repeatedly to the same venue. Figure 5.2 shows the activity of three users on Dolores Park Cafe in San Francisco. User A,B,C (bottom 3) organize their own temporal clusters, which forms a series of clusters when collected (top). A user who has recently visited a venue is much more likely to visit again soon and, conversely, a paucity of visits strongly predicts few visits in the future. This self-reinforcing tendency is measured using Eq. 5.3 by summing over events𝑖 and𝑗 that were initiated by a single user. Later in our study, we see how individuals’ overall activity decays over-time. Social Effects Another factor explaining temporal clustering is social influence. In this case, a user may be more likely to visit a location his friends have visited recently. LBSN allows users to see their friends’ check-ins, and this in turn attracts users to visit the same venues. This effect may function by recommending venues to friends who likely share similar interests or simply by reminding users of places they have visited in the past. The increased likelihood of visits due to previous visits by friends is again captured by 80 Figure 5.2: Three users’ activity profiles at Dolores Park Cafe Eq. 5.3 but this time by summing only over events involving a user’s friends in the social network. Exogenous Effect If a visit is not explained by either of the effects above, we consider it to be caused by some external (exogenous) factor. Many businesses also use LBSN for marketing purposes. By reporting the physical location of their business, their venue becomes visible in the service, which allows the LBSN service users to visit in the future. Local businesses promoted check-ins to increase visibility online and to entice new cus- tomers by offering special deals. In fact, these marketing activities were not limited to local businesses. LBSN services also teamed-up with major companies for their own marketing. Gowalla attracted users in the early stages of their business by giving away 81 presents to active Gowalla users. The influx of users during this period also forms a clus- ter which could not be described by the two effects above. In our studies we see how these are captured as an outside effect. 5.4 Experimental Evaluation In this section, we perform experimental evaluation. 5.4.1 Model selection For every popular venue, we fit the data to a Hawkes process using the EM algorithm and evaluate the goodness of fit compared against other baseline approaches (see Sec- tion 5.4.2 for the list of baselines). For evaluation we use the AIC score [13], which has been widely used for model selection. In addition to maximizing likelihood, AIC also penalizes models with large number of parameters to discourage overfitting. The model with the smallest score is chosen from the candidates. In our experiments, we found that the baselines generally compared poorly with the homogenous Poisson process (HPP), so here we focus on comparison of Hawkes process with HPP. Fig. 5.3 shows the difference in AIC scores plotted against the inferred value of the parameter 𝛽 . Positive different (above the dashed line) suggests a better fit for HPP, while negative difference suggests that the Hawkes process model should be selected. We observe that Hawkes process is the better choice overall, and especially so for the venues with large𝛽 . When𝛽 is large, our model predicts significant temporal clustering, so it is natural to expect that the Hawkes process should do a better job explaining the data. We also note that when𝛽 decreases, the gap in AIC between the two models becomes small, and actually reverses sign when𝛽 → 0. This is also understandable, because if there is 82 0 0.2 0.4 0.6 0.8 1 −1500 −1000 −500 0 500 value AIC H − AIC P Poisson Hawkes Figure 5.3: AIC Comparison no temporal clustering, there is no advantage to include a self-exciting term in the model. We would like to note, however, that in the latter scenario the difference between the AIC scores is minuscule, compared to significant differences observed for large𝛽 . 5.4.2 Predicting Venue attendance In this experiment, we predict the number of daily visitors in the future. For all the venues, we compute the mid-time, which is the mid-point between initial check-in time and final check-in time on each venue. To have enough temporal data for the training set, we sample the time that appears after mid-time. The check-ins made before the mid-point are collected as a training set. Using the training set, we fit the data to a Hawkes process and estimate the parameters. With the estimated parameters, the rate function at time𝑡 can be computed based on the history up to time𝑡 and the parameters estimated from the 83 training set. The number of events between time interval𝑡 and𝑡 + Δ𝑡 can be computed using the counting process as below (Δ𝑡> 0): 𝑁 (𝑡 + Δ𝑡 )−𝑁 (𝑡 ) = ∫︁ 𝑡 +Δ𝑡 𝑡 𝜆 (𝜏 ) d𝜏. (5.4) In our experiment, we focus on predicting daily check-ins, so we set Δ𝑡 = 24hrs. The time, 𝑡 is randomly sampled from some random time that includes at least half of the data so that we have enough history for parameter estimation. For each venue, we repeat the experiment 1,000 times for different random 𝑡 ’s and compare our prediction to the actual number of events. The prediction error is computed using the gap between the actual number of events and our prediction: 𝑎𝑏𝑠 (true count− predicted count). The number of predicted events is estimated using Equation 5.4. We compare the Hawkes process to several other baselines including non-homogenous Poisson processes(NHPP) and Cox processes. Baseline 1: piecewise-constant NHPP Check-in data from three cities shows strong activity during the weekend compared to weekdays. We separate weekend check-ins and weekday check-ins to estimate the rate parameter respectively. Each parameter is 𝜆 weekend and 𝜆 weekday is constant and can be easily estimated. For predicting the number of visits on a given day, we simply use the appropriate rate parameter for weekends or weekdays. As for the Hawkes process, we repeat the experiment 1,000 times for each venue. Baseline 2: NHPP with drifting 𝜆 (𝑡 ) =𝑎𝑡 +𝑏 (5.5) 84 We define the rate function as a linear function of time. On many venues, the check-ins became more frequent as time elapsed from the first check-in. This is because more and more people joined the Gowalla service after its introduction. This intensity function well captures the birth of users, and we see that this simple intensity function predicts the number of visitors relatively well. Baseline 3: Cox proportional hazard model A Cox process is a generalization of Poisson process where the random intensity is a stochastic process. Cox proportional hazard model [18] associates covariates which mod- ulate up or down to the baseline rate𝜆 0 (𝑡 ). Often the baseline is a function of time with non-parametric form while the covariates involve coefficient𝛽 which is estimated using the partial likelihood. For our experiment, we define𝑥 (𝑡 ) as a number of unique visitors assuming the size of unique visitors in the past affects the intensity function. 𝜆 (𝑡 ) =𝜆 0 (𝑡 ) exp(𝛽𝑥 (𝑡 )) (5.6) In our experiment, we assume the baseline rate as constant, and repeat the same experi- ment for baseline 3 with the same sampled time/ training set. Baseline 4: Sigmoidal Gaussian Cox process 𝜆 (𝑠 ) =𝜆 ⋆ 𝜎 (𝑔 (𝑠 )) (5.7) This is another variant of the Cox process for which the intensity function is a trans- formation of a random realization from a Gaussian process. Adams et. al. [1] suggested this model which achieves tractable inference on unknown intensity function.The random intensity function𝜆 (𝑠 ) has an upper-bound𝜆 ⋆ and a sigmoid function which projects𝑔 (𝑠 ) 85 to the intensity function where the𝑔 (𝑠 ) is sampled from Gaussian process. In [1], sig- moidal Gaussian cox process (SGCP) inferred the intensity functions defined in simple form in their synthetic experiment. We also compare the Hawkes process to SGCP on prediction of future events. Error Comparison We use 360 venues (120 each) from three cities, and repeat 1,000 predictions for each sampled time range between 𝑡 and 𝑡 + Δ𝑡 . The venues with few check-ins (less than 100 during 400 days) or with a short history (less than 200 days from the first check-in to the final check-in) were excluded in this experiment. Some venues have more fre- quent check-ins, hence we evenly divide the 360 venues into three groups based on the total number of actual check-in counts from 1,000 test samples, and name them inac- tive/moderate/active venues reflecting fewer to more check-ins, respectively. Splitting venues by activity was done on a per city basis to avoid city-specific bias due to higher average usage. As for the 1,000 test samples, we divide them into two groups, ones which had no check-ins (zero) and others which had more than zero check-ins. On aver- age,∼ 70% of test samples from inactive venues had zero check-ins,∼ 50% of test samples from moderate venues fell into the zero group, and∼ 35% of test samples from active venues fell into the zero group. Th results are presented in Table 5.2. The average prediction error for zero group has been averaged over the total number of zero occurrence in the test sample, while the average prediction error for non-zero group has been averaged over the total number of the counts in the test sample. We observe that Hawkes process clearly outperforms the other baselines for venues with moderate and high activity levels. In particular, for those venues the Hawkes process produces more accurate prediction for both the rate of events 86 Table 5.2: Performance of Predictions Process Obs. Avg. Prediction Error (each inactive moderate active sample) venues venues venues zero 0.3202 0.3710 0.7210 Hawkes non-zero 0.7238 0.5361 0.4712 zero 0.3273 0.4455 1.0306 Baseline 1 non-zero 0.7305 0.7029 0.5937 zero 0.5318 0.6795 1.6901 Baseline 2 non-zero 0.6011 0.5707 0.5086 zero 0.7289 0.9347 2.2185 Baseline 3 non-zero 0.6040 0.6361 0.5660 zero 0.2477 0.4037 0.9989 Baseline 4 non-zero 0.7927 0.7331 0.5990 when they occurred, and the absence of events. For the inactive venues, we find that Baseline 4 makes more accurate predictions for non-events, while Baseline 2 (and also Baseline 3) make better prediction of the rate of events when they occur. The former observation can be attributed to the fact that Baseline 4 tends to under-predict, which results in low prediction error for non-events and the higher prediction error (among all methods) for events. Out of all the processes under consideration, Hawkes process is the only process which captures influence between check-ins while the other processes only capture fluctuation of rates over time. 5.4.3 Evaluating the Three Factors We now focus on understanding the relative importance of the three main factors put forward in Section 5.3.3 that are responsible for temporal clustering. Toward this goal, recall that the (directional) correlation between two check-in events can be measured 87 using Equation 5.3. Furthermore, by using the existing social network information, we can estimate the relative contribution of each factor by analyzing the identity of users in those check-ins. Namely, if both check-ins are by the same user, then the event pair con- tributes to the self-reinforcing behavior. Similarly, when the check-ins are by two users that are connected in the social network, then the events contribute to the social effect. Finally, event pairs that belong to neither of these groups are attributed to exogenous effects. Since we are interested in differentiating the strength of effects, we separately add all the pairs of𝑝 𝑣 𝑖→𝑗 in Equation 5.3 for the cases when event𝑖 and𝑗 involve the same person, or two people who are friends with each other, or neither of these two. To under- stand which factor contributes more to the temporal patterns, we define the following scores corresponding to each of three factors, respectively: 𝑆 self = ∑︀ 𝑡 𝑖 <𝑡 𝑗 𝑝 𝑣 𝑖→𝑗 1 <𝑢 𝑖 =𝑢 𝑗 > ∑︀ 𝑡 𝑖 <𝑡 𝑗 𝑝 𝑣 𝑖→𝑗 𝑆 social = ∑︀ 𝑡 𝑖 <𝑡 𝑗 𝑝 𝑣 𝑖→𝑗 1 <𝑢 𝑗∈𝐹 (𝑢 𝑖 )> ∑︀ 𝑡 𝑖 <𝑡 𝑗 𝑝 𝑣 𝑖→𝑗 𝑆 exgn = 1−𝑆 self −𝑆 social (5.8) where1 is an indicator function,𝑢 𝑖 is the user corresponding to event𝑖 , and𝐹 (𝑢 𝑖 ) is the set of friends of𝑢 𝑖 . We measured the above scores for 120 venues from the three cities. 1 In Table 5.3 we show the top 5 venues with estimated high self-reinforcing behavior score . Interestingly, high𝑆 self score seems to capture venues that reflect repeated behaviors such as commuting from work to home or regularly visiting favorite local places. 1 We excluded Stockholm from the analysis below because most of the venues there were hard to identify using geo-coordinates only. 88 Table 5.3: Top 5 Venues with Self-Reinforcing Behavior San Francisco New York Laguna Honda station G. Washington bridge Bernie’s (local coffee) Manhattan bridge San Francisco Caltrain station Port authority bus terminal Mail Access Lincoln tunnel Research Institute Grand central terminal We next turn to venues that are characterized by high social effect score𝑆 social . Intu- itively, we expect that such venues will consist of bars or local restaurants in a community that have higher chances of attracting users who are friends of each other compared to other type of venues such as popular tourist attractions or stations where large numbers of random users visit. We list the venues with high social effect scores in Table 5.4 below. Indeed, based on the name and type of these venues, these venues seem to intuitively reflect what might be expected for highly social venues. Table 5.4: Top 5 Venues with High Social Effect San Francisco New York 303 second st. Plaza Tasti D lite (ice cream) Chinatown (restaurant) Cafe 28 Golf smith (shop) or others Moschino Botique Restaurant (name unknown) Ace Hotel NY Western athletic clubs Radio City Music Hall Finally, we analyze the relative importance of, and temporal variations in the exoge- nous effects as predicted by our model. Toward this goal, we average𝑆 exgn over all the popular venues in San Francisco for all the check-ins during a given week, and then track the variation of the averaged score over time. The resulting dynamics is shown in Fig- ure 5.4. We observe that the exogenous score increases starting from𝑆 exgn ≈ 0.25 in Sep, 2009 and reaches over 0.5 in June, 2010. Remarkably, the onsets of two growth periods 89 as identified in the figure correspond to important external events. Namely, according to Gowalla blog, the company released its software for iPhone users on Dec. 2nd 2009, and for Android users on March 7th 2010. We clearly see steep growth of𝑆 exgn after the release of the software, thus vindicating our intuition. More generally, we believe that by tracking the dynamics of𝑆 exgn (perhaps with only local averaging), it might be possible to detect the impact of even smaller promotional events. 2009.9.1 2009.12.1 2010.3.1 2010.6.1 2010.9.1 0.2 0.25 0.3 0.35 0.4 0.45 0.5 0.55 Exogenous score App for Android released App for iPhone released Figure 5.4: Exogenous Effect Score𝑆 exgn plotted against time (San Francisco) Comparison between Cities Between three cities, San Francisco showed the highest score in social effect, meaning many of the events within a cluster had social relation- ships. For the experiment, we used the top∼ 120 venues from each city and compared the average social effect. To learn the Hawkes process, enough samples are needed, which led us to use∼ 120 venues. The average social effect scores from San Francisco, New York and Stockholm were 0.0895, 0.0220, and 0.0188, respectively. The average social density, which we defined as the fraction of true edges over all the possible pairs, 90 for all collected venues was measured. The social densities are 0.0374 (San Francisco), 0.0552 (New York), 0.0406 (Stockholm). Interestingly, while San Francisco had the low- est social density, it showed the highest social effect score. This is because the users who were friends in San Francisco visited venues during same time period while friends in other cities did not cluster as much of their activity at similar times and venues. Long-term Activity Trends We conclude this section by commenting on activity pat- terns of individual users in the long term. As indicated in Figure 5.2, the temporal pat- terns of individuals’ behavior seem to decay over time. One possibility is that a user stops visiting a venue. Another possible explanation is that users tend to use the Gowalla service less often as time passes. Indeed, this trend is indicated in Figure 5.5, where we plot the average number of checkins (averaged over all the active users in all three cities) against the time passed after the first check-in. We observe that user activity rapidly decays after the first week of use. Remarkably, this decay seems to be fairly universal across the cities. A more detailed analysis of the user-turnover dynamics is an interesting open problem. 5.5 Summary In conclusion, we have introduced a point process model describing check-in activity for users and venues participating in LBSNs. Our model, which is based on self-exciting Hawkes process, outperforms benchmarks in both data explanation and prediction tasks. More importantly, the proposed approach allows to construct a fine-grained view of events that enables us to distinguish relevant factors like self-reinforcing behavior, social 91 0 5 10 15 20 25 30 0 1 2 3 4 5 6 7 8 Day from the First Use Average Number of Check −ins/day Figure 5.5: Average Number of Check-ins with Respect to the Days Since the First Check-in (San Francisco) effects, and exogenous effects. Qualitative results suggest that we are able to meaning- fully distinguish these factors. Future work will provide a more in-depth analysis of these important effects and their repercussions on human mobility patterns. 92 Chapter 6 Network Behavior Joint-Space 6.1 Introduction The Social Network Services (SNS) enable users to make friends and share their thoughts and experiences online. Currently, real-network data and users’ behavior data became more popular in machine learning applications. Analyzing the connection between two datasets is necessary for better understanding what the users are and for predictive mod- eling such as recommending new friends or activities: videos to watch, restaurants to visit, etc. The recent proliferation of Location-Based Social Network (LBSN) services has resulted in an abundance of human mobility data. Thanks to the GPS and these services, users now are able to leave record of their visit on a more fine-grained level. Before LBSN, many researchers relied on cell phone records to indicate the area that the user was in during the call. With LBSN dataset, we now know exactly where the user was (i.e., the name of the place) when “checking-in.” LBSN users check-in at venues that are on a list provided by the service in order to share with their friends. The collection of venues of the user reflects what the user likes, and due to this fact, LBSN data is considered as privacy-related data. The behavior and their predilection can be easily inferred based on what restaurants or shops they frequent. This leads to the necessity for clustering the venues of similar type. Many of the works suggested methods 93 of clustering, which include clustering based on the geo–coordinates, friendship, and latent topics. In this chapter, we consider probabilistic models that describe behaviors and their friendships. We would like to note that though we introduce our model on LBSN setting, our model can be easily generalized to other applications not limited to social networks and venues. The proposed model is a fully generative process that combines the Latent Dirichlet Allocation [11] (LDA) and the Mixed Membership Stochastic Blockmodels [3] (MMSB), where we represent their behaviors using LDA and their friendships using MMSB. The LDA infers the latent mixture of topic for each document based on the words it comprises, where each topic has its own distribution over words. Though the LDA considered the problem on text corpora, it can be applied to various applications where it has the collections of discrete data. For instance, recent work [44] has shown that venues can be clustered into latent topics by considering users as documents and venues as words. The idea of LDA has also been extended for modeling interactions between nodes in MMSB. Each node undertakes a mixture of roles/communities (topics in LDA) and forms edges to nodes through a stochastic process. Finding the joint space of the two spaces has a huge advantage on the prediction of unknowns. Unknown edges of a given node can be predicted without any information of the existing edges by only observing the venues that a user visited. In similar context, venues can be predicted without any information of the previous records of venues by relying on the network information. Another challenge we have with these datasets is the computational complexity. Pre- vious MMSB suffered from handling large-sized datasets where the computational com- plexity is quadratic to the number of nodes. Recent works [35, 36] have shown methods to overcome this issue by using stochastic variational inference algorithm or by adding 94 an extra constraint to non-edges to improve efficiency. In our work we focus on the latter approach and build the model that accounts for the behavior of users as well. Later in this chapter, we introduce a new model that can improve the efficiency of the state of the art model. 6.2 Modeling LBSN The LBSN data consists of two datasets, which are the network information between users and each user’s check-in record. In the recent work [44], users and venues have been clustered through the latent topics by treating users as documents and in the LDA context. We further extend this model by including a-MMSB [36] to describe the link formation between the users. One can think of a-MMSB as a subclass of MMSB [3], where the off-diagonal components of the block matrix are ignored for the edges in the network. One of the challenges is to infer the joint community/topic distribution that well defines both processes: check-ins and link formation. The earliest work [59] finding joint space of MMSB and LDA underperforms when the model infers the mixture of community and topic to describe the link and the words. This results from the exchangeability assumptions the model has and we found that in our experiments this tended to occur more often as the size of community/topic increased. This problem already has been addressed in [14] and has been solved by introducing a constraint that enforces the two spaces to be the same. Introducing this type of constraint has been originally introduced in [8, 10], where [14] uses [10] approach. However, the proposed models in [14] lose the flexibility of multi–facet in the interactions. In their model, every node has fixed probability distribution over topics and the interactions are 95 determined by the probability itself. On the other hand, MMSB allows nodes to select topics from the distribution for each relation of the given node. We introduce a fully generative process that uses MMSB on the network formation and LDA on the check-in generation. Our work is inspired by [8] that correlates image and caption, where we use network and check-ins instead. We later improve the perfor- mance of this model in terms of efficiency. Our contribution is in two folds: 1. better predictive performance than the state-of-the-art model 2. improved computational effi- ciency. We let 𝑦 𝑎,𝑏 be an indicator of existence of edge between node 𝑎 and 𝑏 . Many of the real-world social networks are undirected and we can only consider the cases when 𝑎<𝑏 . This expression can be easily generalized when the network is directed by having 𝑦 𝑎 →𝑏 instead. We further define 𝜋 𝑎 as the community/topic distribution of node 𝑎 and 𝑤 𝑎,𝑖 ∈{1,...,𝑉 } as the index of the venue of the user 𝑎 ’s 𝑖 -th check-in. Using these notations, we describe the generative process for link formations and check-ins. 6.2.1 Generative Processes Here we present generative processes for link formation and check-ins. ∙ Link Formation: Let𝑁 be the total number of nodes,𝐾 be the total number of community. 1. For each community𝑘 , sample the community strength𝛽 𝑘 ∼ Beta(𝜂 1 ,𝜂 0 ). 2. For each node 𝑎 , sample the membership vector 𝐾 × 1 vector 𝜋 𝑎 ∼ Dirichlet(𝛼 ). 3. For pair𝑎 and𝑏 , where𝑎<𝑏 , (a) Draw membership indicator vectorz 𝑎 →𝑏 ∼ Multinomial(𝜋 𝑎 ). 96 (b) Draw membership indicator vectorz 𝑏→𝑎 ∼ Multinomial(𝜋 𝑏 ). (c) Sample the pair interaction𝑦 (𝑎,𝑏 )∼ Bernoulli(z ⊤ 𝑎 Bz 𝑏 ), where B = diag(𝛽 1 ,𝛽 2 ,...,𝛽 𝐾 ) +𝜖 (1−𝐼 ). 4. For each node𝑎 , construct setZ 𝑎 ={z 𝑎 →𝑏|𝑏 ∈N −𝑎}, whereN −𝑎 is the set of nodes excluding𝑎 . ∙ Check-ins: Perfect Joint Let𝑀 𝑎 be the total number of check-ins of user𝑎 ,𝑉 be the total number of venues. 1. For each community 𝑘 , sample the venue distribution vector 𝜔 𝑘 ∼ Dirichlet(𝜅 ). 2. For each check-in𝑤 𝑎,𝑚 of user𝑎 ,𝑚 ∈{1,...,𝑀 𝑎} (a) Sample correspondent index indicatorc 𝑎 𝑚 ∼ Unif({1,..., size(Z 𝑎 )}). (b) Sample𝑤 𝑎,𝑚 ∼ Multinomial(Ωc 𝑚 ), where Ω = [︁ 𝜔 1 |...|𝜔 𝑘 ]︁ . ∙ Check-ins: Partially Joint 1. Sample Venue distribution (a) For each community 𝑘 , sample the venue distribution vector 𝜔 𝑘 ∼ Dirichlet(𝜅 ) (same as perfect joint model). (b) Sample "global" venue distribution vector, which is not described by the community𝜔 𝐺 ∼ Dirichlet(𝜅 ) 97 6.2.2 Variational Inference We introduce factorized distribution𝑞 (·) and build the Evidence Lower BOund (ELBO): log𝑝 (Y,w 1:𝑁 |𝜂 ,𝛼 ,𝜅 )≥ℒ (𝜑 ,𝛾 ,𝜆 ) , E 𝑞 [log𝑝 (Y,w 1:𝑁 ,𝜋 1:𝑁 ,Z 1:𝑁 , B,C, Ω|𝜂 ,𝛼 ,𝜅 )] − E 𝑞 [log𝑞 (𝜋 1:𝑁 ,Z 1:𝑁 , B,C, Ω)], (6.1) where Y is the matrix having components of 𝑦 (𝑎,𝑏 ), w 𝑛 is the collection of check-ins of user 𝑛 , and C being the all collection of c 𝑚 from all of the nodes. The factorized distribution on the latent variables𝑞 (𝜋 1:𝑁 ,Z 1:𝑁 C) is define as: 𝑞 (𝜋 1:𝑁 ,Z 1:𝑁 ,C) = 𝑁 ∏︁ 𝑛 =1 𝑞 dir (𝜋 𝑛|𝛾 𝑛 ) 𝑁 ∏︁ 𝑎,𝑏 𝑞 mul (z 𝑎 →𝑏|𝜑 𝑎 →𝑏 )𝑞 mul (z 𝑎 ←𝑏|𝜑 𝑎 ←𝑏 ) 𝑀 𝑛 ∏︁ 𝑚 =1 𝑞 mul (c 𝑛 𝑚 |𝜆 𝑛 𝑚 ) (6.2) For the parameter distributions, we also use factorized 𝑞 (·) distributions, which are 𝑞 beta (𝛽 𝑘|𝜏 𝑘 1 ,𝜏 𝑘 0 ) and𝑞 dir (𝜔 𝑖|𝜌 𝑖 ). The ELBO can be maximized using the coordinate ascent algorithm.T update equa- tion for the variational parameters can be obtained by taking the partial derivatives of ELBO and setting it to zero. We present the update equation of the node a’s𝑘 -th component of𝛾 𝑎 as follows 𝛾 𝑎,𝑘 ←𝛼 𝑘 + ∑︁ 𝑎,𝑏 𝜑 𝑎 →𝑏,𝑘 + ∑︁ 𝑎,𝑏 𝜑 𝑏←𝑎,𝑘 . (6.3) 98 For the update equation of{𝜑 } and{𝜆 }, we optimize the ELBO with the constraint where each component of the vector sums to 1. Solving the Lagrange multiplier the corresponding update equations are as below: 𝜑 𝑎 →𝑏,𝑘 ∝ exp(E 𝑞 [log𝑝 (𝜋 𝑎,𝑘 )] +E 𝑞 [log𝑝 (𝑌 (𝑎,𝑏 ))] +E 𝑞 [log 𝑀 𝑎 ∏︁ 𝑚 =1 𝑝 (𝑤 𝑎,𝑚 ) 𝑐 𝑎 𝑚,𝑏 ]), (6.4) 𝜆 𝑎 𝑚,𝑏 ∝ exp(E 𝑞 [log𝑝 (𝑤 𝑎,𝑚 |𝑐 𝑎 𝑚,𝑏 = 1)]) ∝ exp( 𝐾 ∑︁ 𝑘 =1 𝜑 𝑎 →𝑏,𝑘 log𝑝 (𝑤 𝑎,𝑚 |𝑐 𝑎 𝑚,𝑏 = 1,𝑧 𝑎 →𝑏,𝑘 = 1, Ω) (6.5) where𝑏 in𝑐 𝑚,𝑏 and𝜆 𝑚,𝑏 denotes the index of𝜑 𝑎 →𝑏 inZ 𝑎 . Note like indicator vector z a→b ,c 𝑎 𝑚 can have only one compnent equal to 1 setting all others to 0. As one can see, the computation complexity of the update equation of{𝜑 } is quadratic to the number of nodes. In the recent work of [35], link sampling method was proposed by adding an constraint that the average of𝜑 𝑎 →· of non-links is equal to the average of𝜑 𝑎 →· of links. We found that with this constraint, we can efficiently do the inference without using stochastic variational inference, which uses sub-sampling in variational inference. In our work, we also incorporated this constraint, which leads to having computational complexity linear to the total number of edges. With further due, we denote{𝜑 𝑎 →· } + as the set of𝜑 𝑎 →· s from links and{𝜑 𝑎 →· } − as the set of𝜑 𝑎 →· s from non-links with ¯ 𝜑 𝑎 →· being the mean of𝜑 𝑎 →· s in{𝜑 𝑎 →· } − . The number of the setZ 𝑎 can also be reduced to the number of edges of node𝑎 instead of having all the relations in the set. This is mainly because we can consider{𝜑 𝑎 →· } − in unreduced-original set Z 𝑎 is a multiple duplication of{𝜑 𝑎 →· } + in that the average of both positive and negative set are the same under the constraint we are imposing. 99 Besides, considering many of the social networks are undirected, we can further reduce the number of update equation of{𝜑 𝑎 →· } + as below: 𝜑 𝑎𝑏,𝑘 ∝ exp(E 𝑞 [log𝑝 (𝜋 𝑎,𝑘 )] +E 𝑞 [log𝑝 (𝜋 𝑏,𝑘 )] +E 𝑞 [𝛽 𝑘 ] +E 𝑞 [log 𝑀 𝑎 ∏︁ 𝑚 =1 𝑝 (𝑤 𝑎,𝑚 ) 𝑐 𝑎 𝑚,𝑏 ]), (6.6) where we use E 𝑞 [log𝑝 (𝜋 𝑎,𝑘 )] = 𝜓 (𝛾 𝑎,𝑘 )− 𝜓 ( ∑︀ 𝑟 𝛾 𝑎,𝑟 ), E 𝑞 [𝛽 𝑘 ] = 𝜓 (𝜂 𝑘, 1 )− 𝜓 (𝜂 𝑘, 2 ) using the exponential family distribution property. As for the last term in Equation 6.6, E 𝑞 [log ∏︀ 𝑀 𝑎 𝑚 =1 𝑝 (𝑤 𝑎,𝑚 ) 𝑐 𝑎 𝑚,𝑏 ] = ∑︀ 𝑀 𝑎 𝑚 =1 𝜆 𝑚,𝑏 (𝜓 ( ∑︀ 𝑟 1(𝑤 𝑎,𝑚 =𝑖 )𝜌 𝑘,𝑟 )−𝜓 ( ∑︀ 𝑖 𝜌 𝑘,𝑖 )). 6.2.3 Parameter Estimation By fixing the variational parameters above to the corresponding update equation, we can update the model parameters by maximizing the ELBO respect to the model parameter of our interest. This process is known as the M-step in variational EM algorithm. In our model, we have𝛽 and𝜔 𝑘 which we assumed to follow beta distribution with variational parameter𝜏 1 ,𝜏 0 and dirichlet distribution with variational parameters𝜌 𝑘 . We present the update equations of these variational parameters. 𝜏 𝑘, 1 = 𝜂 1 + ∑︁ (𝑎,𝑏 )∈link 𝜑 𝑎𝑏,𝑘 𝜏 𝑘, 2 = 𝜂 2 + ∑︁ (𝑎,𝑏 )∈non-link 𝜑 𝑎 →𝑏,𝑘 𝜑 𝑎 ←𝑏,𝑘 (6.7) 𝜌 𝑖𝑗 = 𝜅 𝑗 + 𝑁 ∑︁ 𝑎 =1 𝑀 𝑎 ∑︁ 𝑚 =1 1(𝑤 𝑎,𝑚 =𝑗 ) ∑︁ 𝑏∈link 𝜑 𝑎𝑏,𝑖 𝜆 𝑚,𝑏 (6.8) 100 6.3 Related Works The Nallapati et al.’s work [59] originates the idea of combining LDA and MMSB on joint space using the document citation network. This work has been improved by Chang and Blei [14] where RTM imbeds the data in a latent space that explains both the words of the documents and how they are connected. RTM extends the supervised topic model [10] by enforcing the topic distribution which accounts for the network to be the same as the average distribution of topics generating words. 6.4 Experiments For the evaluation, we use the real-world dataset [15] which is collected from Gowalla LBSN service. From the whole dataset, we collect users and their check-ins from three major cities. We perform link prediction of unknown links and venue prediction of unknown check-ins on different set-ups. We compare our model to the previous works. The dataset from these cities have been preprocessed by filtering out 80% of inactive users who occupy 20% of the check-ins. The users with less than 5 hops have been also excluded to discard the community with a extremely small number of users. Table 6.1 shows the statistics from these datasets. For all of the experiments, we kept the hyper- parameter𝛼 fixed to 5, and ran the variational EM algorithm until its convergence. For the convergence criteria, the algorithm stops when the proportional change in the likeli- hood bound is less than1e-8. 101 Table 6.1: Dataset Statistics CITY # of Users # of Venues # of Check-ins # of Edges San Francisco 931 1,909 63,543 3,429 New York 687 1,058 23,626 1,859 Stockholm 1,607 1,698 79,704 6,444 6.4.1 Venue Prediction For the venue prediction, we perform two experiments both with 5–fold evaluation. We use 4/5th of the whole data with both check-ins and network for the training, and use 1/5th for the evaluation. For the chosen test set, we hide all the information except the network information, in other words, all of the check-ins are hidden for the users in the test set. Our goal in this experiment is to predict the venue that the user might be interested in based on the existing edges only. In practice, SNS face this problem often where many of the users rarely exhibit behaviors online while establishing many online friends. We believe this could be one of the solutions for prediction and also for recommendation. For each data from three cities, we perform 10 trials of experiments by increasing the number of topics or the community from 5 to 25 by increment of 5. The predictive–perplexity was used to compare both models, where the lower perplexity score denote more predictive power. The average ranking of visited venues with 100% recalls is measured, which are presented in the Figure 6.1. venue predictive-perplexity = exp(−w|Y,𝜂 ,𝛼 ,𝜅 ) (6.9) 102 5 10 15 20 25 400 420 440 460 480 500 520 540 560 580 600 Number of Topic/Community Average Ranking Venue Prediction (San Francisco) Our Model RTM MMSB+LDA 5 10 15 20 25 180 200 220 240 260 280 300 320 Number of Topic/Community Average Ranking Edge Prediction (San Francisco) Our Model RTM MMSB+LDA Figure 6.1: Predictions on LBSN dataset: lower score is better 5 10 15 20 25 220 230 240 250 260 270 280 290 300 310 320 Number of Topic/Community Average Ranking Word Prediction (CORA) Our Model RTM MMSB+LDA 5 10 15 20 25 300 400 500 600 700 800 900 Number of Topic/Community Average Ranking Link Prediction (CORA) Our Model RTM MMSB+LDA Figure 6.2: Predictions on Citation dataset: lower score is better 6.4.2 Edge Prediction Likewise, we perform edge prediction based on the collection of users’ check-ins. Having LBSN as an example, this prediction can recommend friends that a user might know or share similarities with. This recommendation would be especially useful for the users who just showed up to the service and haven’t made any friends. Since network data is imbalanced by having only small edges, rather than comparing the perplexity score for positive edges, we sort the likelihood of link formation of pairs and find the rank of 103 positive edges. In fact, we found that our method out performs previous works in terms of perplexity too. Here in this set of experiments, we present the average ranking of positive edges with 100% recalls. 6.4.3 Evaluation using Document Citation Network We continue the experiment by evaluating our model against citation network data, which has been tested in RTM. The CoRa dataset consists of 2,708 scientific publications and 1,433 words. Every directed edge has been converted to an undirected edge, and each document has word vectors with binary value, which shows the presence of a word in a document. The difference between citation network dataset and the LBSN dataset is that the word vector of citation network dataset is binary while as that of LBSN dataset counts the appearance of words. In this regards, rather than measuring the predictive–perplexity in our previous work, we measure the rank for each appearance of word and edge. The results is given in Figure 6.2 6.5 Improving Computational Efficiency of MMSB In this section, we introduce a method that improve the computational efficiency of MMSB. Assortative MMSB model with constraint on non-edges [36] have improved the performance of MMSB in terms of MMSB. MMSB assumes a node can undertake multiple roles in various interactions. In reality, we have found that majority of the nodes only undertake a single role across all the interactions they had. We studied various real- world networks and have found that as in the case ofHEP-PH dataset, 70% of the node were assigned to a single community. 104 1 2 3 4 5 6 7 8 9 0 1000 2000 3000 4000 5000 6000 7000 8000 9000 number of community (max:32) number of nodes (max:12008) Community Histogram Figure 6.3: Majority of nodes only take a single role (HEP-PH dataset: 12K nodes, 32 roles) This sparse role assignment leads us to the idea of representing the roles in a tree structure. In the HEP-PH dataset where we have pre-defined the number of total roles 32, we can subdivide these roles into two stages having 4 roles in the parent and 8 roles in the child. The overall algorithm can be divided into two rounds: the first round for the parent roles and the second round for the child roles. Rather than visiting 32 roles in the first round for every node in the dataset, visiting 4 roles are much more efficient. After the convergence in the first round, unassigned roles can be ignored and the only selected roles in the first round are further used in the second round. If one of the roles are selected in the first round for certain nodes, we revisit the 8 children roles for the inference. The overall step enhances the computational efficiency. 105 Role A Role A1 Role A2 Role A3 Role A4 Role B Role B1 Role B2 Role B3 Role B4 Figure 6.4: multi-stage MMSB uses tree structure of roles 6.5.1 Multi-Stage MMSB As we have shown previously, majority of the nodes are not well mixed often selecting one roles over all the interactions. We use multi-stage variational inference on MMSB for our role inference. Figure 6.4 is an example of having a tree structured representation of 8 roles. For simplicity we have used 2-stage variational inference and this can be easily generalized to multi-stage having different tree structured representation. In the first round, we use only two roles for our inference, and once optimized, unselected role will be ignored in the second round. Only the children roles where their parents have been selected, become the candidates in the inference algorithm. First round usually converges fast due to the limited number of roles considered. Second round also converges fast because only the children roles are considered in the inference algorithm by ignoring the children roles of unselected parents. 106 0 50 100 150 200 250 300 350 400 450 2 2.5 3 3.5 4 4.5 5 time(sec.) perplexity Time until Convergence AMB multistage−AMB Figure 6.5: multi-stage MMSB converges faster than the existing model without loosing its performance in accuracy (HEP-PH dataset: 12K nodes, 32 roles) 6.5.2 Evaluation using Real-World Data For the evaluation, we use HEP-PH dataset which also has been used in [35, 36]. As in their work, we assumed this dataset has 32 roles (or community), and compared our inference to their inference. We measured perplexity of the held-out set to compare the accuracy and also compared the time until convergence. As shown in Figure 6.5, multi- stage inference algorithm converges much more faster than the existing model without losing any performance in terms of accuracy. This way of inference algorithm would be useful especially when we are dealing with large dataset. In large-sized network, a lot of communities act exclusive to other community and are often isolated from others. For these cases, multi-stage inference on the community is much more faster than other algorithms. 107 6.6 Summary In this chapter, we proposed a latent space approach that finds the joint space of network and behavior. Because the MMSB-LDA joint space model captures node behavior and link structure, this model can predict one observation given the other side of observation. This becomes useful when no previous record of same type exists where we have to rely on other information. Many of the new users who just appeared in SNS either have no friends or haven’t had any online activity. For those cases, the prediction we have shown using our model can be a very effective recommendation system. We also studied on enhancing the computational efficiency of the MMSB which accounts for the link structure. When using a tree-structured roles, some number of roles can be ignored in our inference. This enhances the speed of convergence which allows to skip unnecessary visits on non relevant roles. Many social networks are sparsely con- nected where we can take advantage of by using the multi-staged inference algorithm. 108 Chapter 7 Conclusions and Open Questions In this thesis, we presented models that efficiently infers hidden attributes in social net- work and actors’ behavior. For the network modeling, we studied how node attributes and network structure co-evolves influencing each other. The pairwise interaction mod- eling has been extended to spatial temporal network modeling considering all the pair interaction in time and geo space. We found that pair interaction exhibits strong spatial temporal pattern. By understanding the underlying structure of this pattern, one can esti- mate the missing information and make prediction of future events. This set of works becomes very useful when dealing with incomplete dataset which we often face. We also modeled users’ movement patterns in time and geo space. This “check-in” activity can be easily generalized to other behavior types by considering venues as items. Our works have shown that the items can be clustered using the link structure and have shown tem- poral data of each item can be clustered showing how the event is affected by the serious of past events. We summarized each chapter and present contributions of our works. In Chapter 2, node attribute dynamics and link structure dynamics have been modeled in a way the two dynamics co-evolves. Our works differs from the previous dynamic MMSB network models in that our’s capture the influence between the nodes where as the previous work captures the global drifting. We showed how our model performs better than the previous model using real-world dataset. In this work, we studied the interplay between node attribute dynamics and link dynamics. This framework allows us to well capture the trajectory of the latent attributes. 109 In Chapter 3, we investigated the temporal patterns in pair interactions in a network. Besides the temporal patterns, we also defined the intensity function between pairs on geo-space. Many of dataset suffers from having missing informations which needs to be estimated. We proposed a model that can handle the missing information. Our work estimates the model parameters even with the missing information through estimations on the items with missing data. LPPM better handles the missing data than previous works. In Chapter 4, we studied venues using the network structure of users who visited the venues. Specifically in this work we used non-parametric approach called information bottleneck, and have shown how clustered venue can achieve better predictions than using the original sparse data. Dimensional reduction reduces the number of zero identities through combining the items. This work can be generalized to other behavior dataset where each user has a vector describing its behavior. In Chapter 5, we studied the temporal patterns in users behavior. In this work, we used LBSN dataset and looked into the users’ “check-in” behavior. We introduced a venue based model by defining intensity function for each venue, and have seen how current check-ins are affected by previous check-ins. We have shown how our model better predicts the future check-ins than other baselines. Three factors have been defined and have been examined for each behavior to see how each behavior was affected by previous behaviors through tree factors. Finally in Chapter 6, we combined our network models and behavior models through inferring their joint space: latent space for network and latent space for behavior. This joint space model allows us a better understanding of the two spaces and provides effec- tive recommendation system. When new users appear in SNS, system could have trouble in recommending friends or activity if there’s no previous record of the user. Our model 110 suggests a recommendation system which makes prediction of edges using the previous behavior records or vice versa. We present a set of open questions which can be solved extending our models. ∙ Incorporating review scores: Most of behaviors or items are reviewed by the users in online services. For instance, users in Netflix, Yelp, or Amazon review the movies, venues, or products using 5-star rating. Modeling the rating, one can pre- dict how users would like the item or what kind of group of users might like the items. ∙ Global behavior vs local behavior: Our network-behavior joint space model infers the joint space of two latent spaces each of which describing link structure and users’ behavior. We can further assume that behavior can be sub divided into two groups global and local, where global behaviors are universal to all the groups of users. By separating the global behavior one can achieve better prediction of the link and behavior by connecting the local behaviors to each group. ∙ Incorporating geo-coordinates in LBSN: Each user in LBSN services has home location information. This could be the actual place where they live in or the place they often visit. Home location information could be used to further distinguish the venues that are far from the home location or the venues that are near by. Mobility patterns are strongly affected by the distance and by having the distance penalty term we can predict where the user might visit more accurately. ∙ Schedule of the day: The temporal patterns of check-ins are also affected by the actual time. For instance, time around noon and in the evening, people tend to check-in to the venus actively than other hours. Our current work captures this as an exogenous effect, but this can also be captured by defining the intensity of 111 check-in as a function of absolute time. This way, the “time of day effect” can be excluded from the exogenous. 112 Appendix A Variational EM update Equations for Co-Evolving MMSB A.1 Variational EM update Equations for Co-Evolving MMSB A.1.1 Alternative View of EM Algorithm We start with the log-likelihood function where 𝑌 is the data, 𝑋 is the set of latent variables, and Θ is the set of model parameters. log𝑝 (𝑌 |Θ) = log ∫︁ 𝑝 (𝑌,𝑋 |Θ)𝑑𝑋 (A.1) = log ∫︁ 𝑞 (𝑋 ) 𝑝 (𝑌,𝑋 |Θ) 𝑞 (𝑋 ) 𝑑𝑋 ≥ ∫︁ 𝑞 (𝑋 ) log 𝑝 (𝑌,𝑋 |Θ) 𝑞 (𝑋 ) 𝑑𝑋 (Jensen’s Inequality) We define the lower-bound as free energy : 𝐹 (𝑞, Θ) = ∫︁ 𝑞 (𝑋 ) log 𝑝 (𝑋 |𝑌, Θ)𝑝 (𝑌 |Θ) 𝑞 (𝑋 ) 𝑑𝑋 (A.2) = log𝑃 (𝑌 |Θ)−𝐷 𝐾𝐿 (𝑞 (𝑋 )||𝑝 (𝑌,𝑋 )) 113 The goal is to maximize the lower bound (free energy), by updating𝑞 , and Θ. In E-step, we minimize the KL-distance of two distributions, and in M-step, we maximize the free energy under fixed𝑞 distribution obtained in E-step. A.1.2 KL-Distance Here we present the KL-distance between𝑞 (𝑋 ), and𝑝 (𝑌,𝑋 ). 𝐷 𝐾𝐿 (𝑞||𝑝 ) = (A.3) ∑︁ 𝑡 ∑︁ 𝑝 (︂ − 1 2 𝐸 𝑞 [(⃗ 𝜇 𝑡 𝑝 −⃗ 𝛾 𝑡 𝑝 ) 𝑇 (Σ 𝑡 𝑝 ) −1 (⃗ 𝜇 𝑡 𝑝 −⃗ 𝛾 𝑡 𝑝 )]− log(2𝜋 ) 𝑘/ 2 − log(|Σ 𝑡 𝑝| 1/2 ) )︂ + ∑︁ 𝑡 ∑︁ 𝑝,𝑞 ∑︁ 𝑔 𝜑 𝑡 𝑝 →𝑞,𝑔 log𝜑 𝑡 𝑝 →𝑞,𝑔 + ∑︁ 𝑡 ∑︁ 𝑝,𝑞 ∑︁ ℎ 𝜑 𝑡 𝑝 ←𝑞,ℎ log𝜑 𝑡 𝑝 ←𝑞,ℎ − ∑︁ 𝑡 ∑︁ 𝑝,𝑞 ∑︁ 𝑔,ℎ 𝜑 𝑡 𝑝 →𝑞,𝑔 𝜑 𝑡 𝑝 ←𝑞,ℎ 𝑓 (𝑌 𝑡 (𝑝,𝑞 ),𝐵 (𝑔,ℎ )) − ∑︁ 𝑡 ∑︁ 𝑝,𝑞 ∑︁ 𝑔 𝜑 𝑡 𝑝 →𝑞,𝑔 (︁ 𝛾 𝑡 𝑝,𝑔 − log𝜁 𝑡 𝑝 + 1− 1 𝜁 𝑡 𝑝 ∑︁ 𝑘 exp(𝛾 𝑡 𝑝,𝑘 + 𝜎 𝑡 2 𝑝,𝑘 2 ) )︁ − ∑︁ 𝑡 ∑︁ 𝑝,𝑞 ∑︁ ℎ 𝜑 𝑡 𝑝 ←𝑞,ℎ (︁ 𝛾 𝑡 𝑞,ℎ − log𝜁 𝑡 𝑞 + 1− 1 𝜁 𝑡 𝑞 ∑︁ 𝑘 exp(𝛾 𝑡 𝑞,𝑘 + 𝜎 𝑡 2 𝑞,𝑘 2 ) )︁ − ∑︁ 𝑡> 0 ∑︁ 𝑝 (︂ − 1 2 𝐸 𝑞 [︁ (⃗ 𝜇 𝑡 𝑝 −𝑓 𝑏 (⃗ 𝜇 𝑡−1 𝑝 ,⃗ 𝜇 𝒮 𝑡−1 𝑝 )) 𝑇 Σ −1 𝜇 ·(⃗ 𝜇 𝑡 𝑝 −𝑓 𝑏 (⃗ 𝜇 𝑡−1 𝑝 ,⃗ 𝜇 𝒮 𝑡−1 𝑝 )) ]︁ − (𝑘/ 2) log(2𝜋 )− log(|Σ 𝜇| 1/2 ) )︂ − ∑︁ 𝑝 (︂ − 1 2 𝐸 𝑞 [(⃗ 𝜇 0 𝑝 −⃗ 𝛼 0 ) 𝑇 𝐴 −1 (⃗ 𝜇 0 𝑝 −⃗ 𝛼 0 )]− log(2𝜋 ) 𝑘/ 2 − log(|𝐴 | 1/2 ) )︂ where exp(𝛾 𝑝,𝑘 + 𝜎 𝑡 2 𝑝,𝑘 2 ) comes from the moment-generating function of the normal dis- tribution,𝑀 𝑋 (𝑡 ) := 𝐸 [𝑒 𝑡𝑋 ] with t=1. The first line simplifies to const− ∑︀ 𝑡,𝑝,𝑘 log𝜎 𝑡 𝑝,𝑘 , where, once again, we have taken the covariance matrix to be diagonal. 114 A.1.3 Variational E-step In the variational E–step, we minimize the KL distance over the variational parameters. Variational parameters{⃗ 𝛾 𝑝} 𝑡 , and{𝜎 𝑝,𝑘 } 𝑡 need to be solved analytically. We use the Newton-Raphson method as an optimization algorithm for tightening the bound with respect to those variational parameters. First, we minimize the divergence with respect to ⃗ 𝛾 𝑡 𝑝 . Since the other variational parameters Σ 𝑡 𝑝 are assumed to be a diagonal matrix, we treat the multivariate normal dis- tribution as a combination of independent normal distribution, and update the mean and variance for each coordinate. We use the Newton-Raphson method for each coordinate where the derivative is : 𝑑𝐷 𝐾𝐿 (𝑞||𝑝 )/𝑑𝛾 𝑡 𝑝,𝑘 = (A.4) − ∑︁ 𝑞 𝜑 𝑡 𝑝 →𝑞,𝑘 + ∑︁ 𝑞 ∑︁ 𝑔 𝜑 𝑡 𝑝 →𝑞,𝑔 1 𝜁 𝑡 𝑝 exp(𝛾 𝑡 𝑝,𝑘 + (𝜎 𝑡 𝑝,𝑘 ) 2 2 ) − ∑︁ 𝑞 𝜑 𝑡 𝑞←𝑝,𝑘 + ∑︁ 𝑞 ∑︁ ℎ 𝜑 𝑡 𝑞←𝑝,ℎ 1 𝜁 𝑡 𝑝 exp(𝛾 𝑡 𝑝,𝑘 + (𝜎 𝑡 𝑝,𝑘 ) 2 2 ) + [︁ 𝛾 𝑡 𝑝,𝑘 |𝜂 𝑘| 2 + (1−𝛽 𝑝 ) 2 𝛾 𝑡 𝑝,𝑘 |𝜂 𝑘| 2 + ∑︁ 𝑞∈𝒮 𝑡 𝑝 𝛽 2 𝑞 𝛾 𝑡 𝒮 𝑡 𝑞 ,𝑘 /|𝒮 𝑡 𝑞| |𝜂 𝑘| 2 −(1−𝛽 𝑝 ) 𝛾 𝑡−1 𝑝 |𝜂 𝑘| 2 − (1−𝛽 𝑝 ) 𝛾 𝑡 +1 𝑝 |𝜂 𝑘| 2 +𝛽 𝑝 (1−𝛽 𝑝 ) 𝛾 𝑡 𝒮 𝑡 𝑝 |𝜂 𝑘| 2 −𝛽 𝑝 𝛾 𝑡−1 𝒮 𝑡−1 𝑝 |𝜂 𝑘| 2 +(1−𝛽 𝑝 ) ∑︁ 𝑞∈𝒮 𝑡 𝑝 𝛽 𝑞 𝛾 𝑡 𝑞 /|𝒮 𝑡 𝑝| |𝜂 𝑘| 2 − ∑︁ 𝑞∈𝒮 𝑡 𝑝 𝛽 𝑞 𝛾 𝑡 +1 𝑞 /|𝒮 𝑡 𝑝| |𝜂 𝑘| 2 ]︁ ⃗ 𝛾 𝑡 𝒮 𝑡 𝑝 , ≡ ∑︀ 𝑞∈𝒮 𝑡 𝑝 (𝑤 𝑡 𝑝 ←𝑞 )(𝛾 𝑡 𝑞,𝑘 ) is the mean of set of neighbors of node 𝑝 at time 𝑡 , and 𝜎 2 𝒮 𝑡 𝑝 ,𝑘 ≡ ∑︀ 𝑞∈𝒮 𝑡 𝑝 (𝑤 𝑡 𝑝 ←𝑞 ) 2 (𝜎 𝑡 𝑞,𝑘 ) 2 are the variance of set of neighbors of node 𝑝 at time 𝑡 . Mean and variance of neighbors can be easily computed since the components of neighbors are independent of each other and are Gaussian themselves. The derivative 115 above is valid for 𝛾 𝑡 𝑝,𝑘 when 0 < 𝑡 < 𝑇 ; the form is slightly different when 𝑡 = 0 or 𝑡 =𝑇 . Second, we minimize the divergence with respect to ((𝜎 𝑡 𝑝, 1 ) 2 , (𝜎 𝑡 𝑝, 2 ) 2 ,...(𝜎 𝑡 𝑝,𝐾 ) 2 ) using the Newton-Raphson method. The derivative with respect to𝜎 𝑡 𝑝,𝑘 is : 𝑑𝐷 𝐾𝐿 (𝑞||𝑝 )/𝑑𝜎 𝑡 𝑝,𝑘 = (A.5) 2(𝑁 − 1) 𝜎 𝑡 𝑝,𝑘 𝜁 𝑡 𝑝 exp(𝛾 𝑡 𝑝,𝑘 + (𝜎 𝑡 𝑝,𝑘 ) 2 2 )− 1 𝜎 𝑡 𝑝,𝑘 + 1 + (1−𝛽 𝑝 ) 2 + ∑︀ 𝑞 𝑌 𝑡 (𝑞,𝑝 )𝛽 2 𝑞 𝑤 𝑡 𝑞←𝑝 2 |𝒮 𝑡 𝑞 | 2 𝜂 2 𝑘 𝜎 𝑡 𝑝,𝑘 , where𝜂 2 𝑘 is the diagonal component of the covariance matrix Σ 𝜇 . When𝑡 = 0 or𝑡 =𝑇 , the derivative slightly differs from the above equation. A.1.4 Variational M-step 𝑝 (𝑌 |Θ)≥ ∫︁ 𝑋 𝑞 (𝑋 ) log𝑝 (𝑌,𝑋 |Θ) 𝑞 (𝑋 ) (A.6) The M-step in the EM algorithm computes the hyper-parameters by maximizing the lower bound under fixed𝑞 found in the E-step. The lower bound of the log likelihood is from Jensen’s inequality (equation A.6), and the expectation is taken with respect to a variational distribution. Hence the general form of the update equation at the𝑘 -th step is as below: Θ 𝑘 = arg max Θ ∫︁ 𝑞 𝑘 (𝑋 ) log𝑝 (𝑌,𝑋 |Θ)𝑑𝑋 (A.7) 116 Since the final form of most model parameters are quite intuitive, we only derive equa- tion 2.17 in this section. To obtain the update equation of𝛽 𝑝 , we start from differentiating the expected log-likelihood and setting it to zero: 0 = ∑︁ 𝑡<𝑇 ∑︁ 𝑘 (︁ −(1−𝛽 𝑝 )(𝛾 𝑡 𝑝,𝑘 2 +𝜎 2 𝑝,𝑘 ) +𝛽 𝑝 (𝛾 𝒮 𝑡 𝑝 ,𝑘 2 +𝜎 2 𝒮 𝑡 𝑝 ,𝑘 ) )︁ + ∑︁ 𝑡> 0 ∑︁ 𝑘 (︁ 𝛾 𝑡 𝑝,𝑘 𝛾 𝑡−1 𝑝,𝑘 −𝛾 𝑡 𝑝,𝑘 𝛾 𝒮 𝑡−1 𝑝 ,𝑘 + (1− 2𝛽 𝑝 )𝛾 𝑡−1 𝑝,𝑘 𝛾 𝒮 𝑡−1 𝑝 ,𝑘 )︁ Solving the equation above, 𝛽 𝑝 = ∑︀ 𝑡> 0 ∑︀ 𝑘 (𝛾 𝑡−1 𝑝,𝑘 2 +𝜎 𝑡−1 𝑘 2 −𝛾 𝑡 𝑝,𝑘 𝛾 𝑡−1 𝑝,𝑘 −𝛾 𝑡−1 𝑝,𝑘 𝛾 𝒮 𝑡−1 𝑝 ,𝑘 +𝛾 𝑡 𝑝,𝑘 𝛾 𝒮 𝑡−1 𝑝 ,𝑘 ) ∑︀ 𝑡> 0 ∑︀ 𝑘 (𝛾 𝑡−1 𝑝,𝑘 2 +𝜎 𝑡−1 𝑘 2 − 2𝛾 𝑡−1 𝑝,𝑘 𝛾 𝒮 𝑡−1 𝑝 ,𝑘 ) + ∑︀ 𝑡<𝑇 ∑︀ 𝑘 (𝛾 𝒮 𝑡 𝑝 ,𝑘 2 +𝜎 2 𝒮 𝑡 𝑝 ,𝑘 ) (A.8) For solving the optimal weight, we differentiate the lower bound with respect to 𝑤 𝑝 ←𝑞 1 and set it to zero: 0 = ∑︁ 𝑘 (︁ 𝛽 2 𝑝 𝑤 𝑝 ←𝑞 1 |𝒮 𝑡 𝑝| 2 (𝛾 𝑡 𝑞 1 ,𝑘 2 +𝜎 𝑡 𝑞 1 ,𝑘 2 )− 𝛽 𝑝 |𝒮 𝑡 𝑝| 𝛾 𝑡 𝑞 1 ,𝑘 𝛾 𝑡 +1 𝑝,𝑘 + 𝛽 𝑝 |𝒮 𝑡 𝑝| (1−𝛽 𝑝 )𝛾 𝑡 𝑞 1 ,𝑘 𝛾 𝑡 𝑝,𝑘 + 𝛽 2 𝑝 |𝒮 𝑡 𝑝| 2 ( ∑︁ 𝑞∈𝒮 𝑡 𝑝 ,𝑞̸=𝑞 1 𝑌 (𝑝,𝑞 1 )𝑤 𝑝 ←𝑞 𝛾 𝑡 𝑞,𝑘 ) )︁ (A.9) Finally, the update equation for weight becomes, 𝑤 𝑝 ←𝑞 1 = ∑︀ 𝑘 𝛽 𝑝 |𝒮 𝑡 𝑝 | 𝛾 𝑡 𝑞 1 ,𝑘 𝛾 𝑡 +1 𝑝,𝑘 − 𝛽 𝑝 |𝒮 𝑡 𝑝 | (1−𝛽 𝑝 )𝛾 𝑡 𝑞 1 ,𝑘 𝛾 𝑡 𝑝,𝑘 − 𝛽 2 𝑝 |𝒮 𝑡 𝑝 | 2 ( ∑︀ 𝑞∈𝒮 𝑡 𝑝 ,𝑞̸=𝑞 1 𝑌 (𝑝,𝑞 1 )𝑤 𝑝 ←𝑞 𝛾 𝑡 𝑞,𝑘 ) ∑︀ 𝑘 𝛽 2 𝑝 |𝒮 𝑡 𝑝 | 2 (𝛾 𝑡 𝑞 1 ,𝑘 2 +𝜎 𝑡 𝑞 1 ,𝑘 2 ) (A.10) 117 Appendix B Variational EM update Equations for LPPM B.1 Variational E-step In the variational E-step, we maximizeℒ Φ over the variational parameters. Note that the variational parameters shoud satisfy the normalization constraint ∑︀ 𝑖<𝑗 𝜑 𝑖𝑗 𝑝 = 1. By introdcuing Lagrange multipliers𝛾 𝑝 to enforce those constraints, and taking the derivative of Equation 3.10 with respect to the variational parameters yields 0 = 𝜕 𝜕𝜑 𝑖𝑗 𝑝 𝐸 𝑄 [︁ ∑︁ 𝑘 𝑧 𝑖𝑗 𝑘 log[𝜆 𝑖𝑗 (𝑡 𝑘 )]−Λ 𝑇 𝑖𝑗 ]︁ + log[𝑟 𝑖𝑗 (x 𝑝 )] − log𝜑 𝑖𝑗 𝑘 − 1 +𝛾 𝑝 (B.1) Solving the constrained optimization problem with Lagrange multipliers, we have the update equation for variational parameter𝜑 𝑖𝑗 𝑝 as below: 𝜑 𝑖𝑗 𝑝 = 1 𝐶 𝑝 exp {︃ 𝜕 𝜕𝜑 𝑖𝑗 𝑝 𝐸 𝑄 [︃ ∑︁ 𝑘 𝑧 𝑖𝑗 𝑘 log𝜆 𝑖𝑗 (𝑡 𝑘 )− Λ 𝑇 𝑖𝑗 ]︃}︃ [𝑟 𝑖𝑗 (x 𝑝 )] (B.2) where the Lagrange multiplier has been absorbed in the normalization constant𝐶 𝑝 . 118 For the evaluating the derivative of the expectation of log𝜆 𝑖𝑗 (𝑡 𝑘 ) with respect to𝜑 𝑖𝑗 𝑝 in the above equation, we separate into two cases when𝑘>𝑝 and𝑘 =𝑝 . Before expressing the derivatives for two cases, we introduce a new function for a simpler expression. ℳ 𝑖𝑗 (𝒵 𝑘 ) = 𝑘 ∏︁ 𝑙 =1 𝜑 𝑖𝑗 𝑙 𝑧 𝑖𝑗 𝑙 (1−𝜑 𝑖𝑗 𝑙 ) (1−𝑧 𝑖𝑗 𝑙 ) (B.3) which is a joint probability of given scenario from the beginning up to event𝑘 . First, for the case when𝑘 =𝑝 , we have 𝜕 𝜕𝜑 𝑖𝑗 𝑝 𝐸 𝑄 [log𝜆 𝑖𝑗 (𝑡 𝑝 )] = ∑︁ 𝒵 𝑝 −1 ℳ 𝑖𝑗 (𝒵 𝑝 −1 ) log [︃ 𝜇 𝑖𝑗 + 𝑝 −1 ∑︁ 𝑙 =1 𝑧 𝑖𝑗 𝑙 𝑔 𝑖𝑗 (𝑡 𝑝 −𝑡 𝑙 ) ]︃ (B.4) In the right hand side of Equation B.4, the sum is over all the possible configurations of the latent variables up to the event𝑝 − 1,𝑍 𝑝 −1 𝑘 =1 . Similarly, we can derive the derivative with respect to𝜑 𝑖𝑗 𝑝 for the terms with𝑘>𝑝 . For steps when𝑘 is greater than𝑝 , 𝜕 𝜕𝜑 𝑖𝑗 𝑝 𝐸 𝑄 [log𝜆 𝑖𝑗 (𝑡 𝑘 )] = ∑︁ ˜ 𝒵 𝑝 𝑘 −1 𝜑 𝑖𝑗 𝑘 ˜ ℳ 𝑝 𝑖𝑗 ( ˜ 𝒵 𝑝 𝑘 −1 ) (B.5) × log [︃ 𝜇 𝑖𝑗 + ∑︀ 𝑘 −1 𝑙 =1,𝑙̸=𝑝 𝑧 𝑖𝑗 𝑙 𝑔 𝑖𝑗 (𝑡 𝑘 −𝑡 𝑙 ) +𝑔 𝑖𝑗 (𝑡 𝑘 −𝑡 𝑝 ) 𝜇 𝑖𝑗 + ∑︀ 𝑘 −1 𝑙 =1 𝑙̸=𝑝 𝑧 𝑖𝑗 𝑙 𝑔 𝑖𝑗 (𝑡 𝑘 −𝑡 𝑙 ) ]︃ where we have defined ˜ 𝒵 𝑝 𝑘 as𝒵 𝑘 excluding𝑧 𝑝 with ˜ ℳ 𝑝 𝑖𝑗 (·) following the same logic. The numerator term in the logarithm above comes from when pair 𝑖 and 𝑗 trigger the 𝑘 -th event on the𝑝 -th event, while the denominator term comes from when they did not. Finally for the derivative of expectation of Λ 𝑇 𝑖𝑗 in Equation B.2, we use 𝜕 𝜕𝜑 𝑖𝑗 𝑝 𝐸 𝑄 [−Λ 𝑇 𝑖𝑗 ] =−𝛽 𝑖𝑗 {1− exp(𝜔 𝑖𝑗 (𝑇 −𝑡 𝑝 ))} (B.6) 119 By combining Equation B.2 – B.6, we obtain an iterative scheme for finding the variational parameters of the form 𝜑 𝑖𝑗 𝑝 =𝑓 ({𝜑 𝑖𝑗 𝑝 } 𝑛 𝑘 =1;𝑘 ̸=𝑝 ; Θ) (B.7) The above iterations are used until the convergence of all the variational parameters. B.2 Variational M-step The M-step in the EM algorithm computes the parameters by maximizing the expected log-likelihood found in the E-step. The model parameters consists of spatial parameters and temporal parameters. We first look into the update equations of spatial parameters. For some cases, when the spatial pattern is distinct over pairs, we use single Gaussian for each pair, and the update equations are as below (i.e., the mean and the variance of Gaussian distribution): m 𝑖𝑗 ← ∑︀ 𝑘 𝜑 𝑖𝑗 𝑘 x 𝑘 ∑︀ 𝑘 𝜑 𝑖𝑗 𝑘 (B.8) 𝜎 2 𝑖𝑗,𝑙𝑎𝑡 ← ∑︀ 𝑘 𝜑 𝑖𝑗 𝑘 (x 𝑘,𝑙𝑎𝑡 −m 𝑖𝑗,𝑙𝑎𝑡 ) 2 ∑︀ 𝑘 𝜑 𝑖𝑗 𝑘 (B.9) 𝜎 2 𝑖𝑗,𝑙𝑜𝑛𝑔 ← ∑︀ 𝑘 𝜑 𝑖𝑗 𝑘 (x 𝑘,𝑙𝑜𝑛𝑔 −m 𝑖𝑗,𝑙𝑜𝑛𝑔 ) 2 ∑︀ 𝑘 𝜑 𝑖𝑗 𝑘 (B.10) When using a Gaussian mixture model, the weight vector of the mixture model for each pair is updated respectively. 𝑤 𝑐 𝑖𝑗 ← ∑︀ 𝑘 𝜑 𝑖𝑗 𝑘 𝒩 (x 𝑘 |m 𝑐 𝑖𝑗 ,Σ 𝑐 𝑖𝑗 ) ∑︀ 𝐶 𝑝 =1 𝒩 (x 𝑘 |m 𝑝 𝑖𝑗 ,Σ 𝑝 𝑖𝑗 ) ∑︀ 𝑘 𝜑 𝑖𝑗 𝑘 (B.11) 120 The re-estimation of the temporal parameters are more involved. For instance, to estimate 𝜇 𝑖𝑗 , we nullify the derivative of the likelihood with respect to 𝜇 𝑖𝑗 , 𝜕 ℒ Φ 𝜕𝜇 𝑖𝑗 = 0, which yields 𝜇 𝑖𝑗 ← ∑︀ 𝑘 ∑︀ 𝒵 𝑘 −1 𝜑 𝑖𝑗 𝑘 𝜇 𝑖𝑗 ℳ 𝑖𝑗 (𝒵 𝑘 −1 ) 𝜇 𝑖𝑗 + ∑︀ 𝑘 −1 𝑙 =1 𝑧 𝑖𝑗 𝑙 𝑔 𝑖𝑗 (𝑡 𝑘 −𝑡 𝑙 ) 𝑇 (B.12) Similarly, for re–estimation of𝛽 𝑖𝑗 , we present the derivative as below: 𝛽 𝑖𝑗 ← ∑︀ 𝑘 ∑︀ 𝒵 𝑘 −1 𝜑 𝑖𝑗 𝑘 ℳ 𝑖𝑗 (𝒵 𝑘 −1 ) ∑︀ 𝑘 −1 𝑙 =1 𝑧 𝑖𝑗 𝑙 𝑔 𝑖𝑗 (𝑡 𝑘 −𝑡 𝑙 ) 𝜇 𝑖𝑗 + ∑︀ 𝑘 −1 𝑙 =1 𝑧 𝑖𝑗 𝑙 𝑔 𝑖𝑗 (𝑡 𝑘 −𝑡 𝑙 ) ∑︀ 𝑘 𝜑 𝑖𝑗 𝑘 ∫︀ 𝑇 −𝑡 𝑘 0 𝜔 𝑖𝑗 𝑒 −𝜔 𝑖𝑗 𝜏 𝑑𝜏 (B.13) Finally, for𝜔 𝑖𝑗 , we obtain (B.14) ∑︁ 𝑘 𝜑 𝑖𝑗 𝑘 [︃ ∑︁ 𝒵 𝑘 −1 [︁ (︁ 𝑘 −1 ∑︀ 𝑙 =1 𝑧 𝑖𝑗 𝑙 (1−𝜔 𝑖𝑗 (𝑡 𝑘 −𝑡 𝑙 ))𝑔 𝑖𝑗 (𝑡 𝑘 −𝑡 𝑙 ) )︁ 𝜇 𝑖𝑗 + ∑︀ 𝑘 −1 𝑙 =1 𝑧 𝑖𝑗 𝑘 𝑔 𝑖𝑗 (𝑡 𝑘 −𝑡 𝑙 ) ×ℳ 𝑖𝑗 (𝒵 𝑘 −1 ) ]︁ −𝛽 𝑖𝑗 (𝑡 𝑘 −𝑇 )𝑒 −𝜔 𝑖𝑗 (𝑇 −𝑡 𝑘 ) ]︃ = 0 where 𝑔 𝑖𝑗 (𝑡 −𝑡 𝑝 ) =𝛽 𝑖𝑗 𝜔 𝑖𝑗 exp{−𝜔 𝑖𝑗 (𝑡 −𝑡 𝑝 )} (B.15) Unfortunately, the resulting equations do not allow closed form solutions, so they have to be solved using numerical methods, such as the Newton’s method employed here. We can also have closed form of update equation of𝜔 𝑖𝑗 by approximating the second term to 121 zero in Equation B.14. When𝜔 𝑖𝑗 𝑇 is fairly large compared to𝜔 𝑖𝑗 𝑡 𝑘 , we can ignore the second term, and have closed form as below: 𝜔 𝑖𝑗 ← ∑︀ 𝑘 ∑︀ 𝒵 𝑘 −1 𝜑 𝑖𝑗 𝑘 ℳ 𝑖𝑗 (𝒵 𝑘 −1 ) ∑︀ 𝑘 −1 𝑙 =1 𝑧 𝑖𝑗 𝑙 𝑔 𝑖𝑗 (𝑡 𝑘 −𝑡 𝑙 ) 𝜇 𝑖𝑗 + ∑︀ 𝑘 −1 𝑙 =1 𝑧 𝑖𝑗 𝑙 𝑔 𝑖𝑗 (𝑡 𝑘 −𝑡 𝑙 ) ∑︀ 𝑘 ∑︀ 𝒵 𝑘 −1 𝜑 𝑖𝑗 𝑘 ℳ 𝑖𝑗 (𝒵 𝑘 −1 ) ∑︀ 𝑘 −1 𝑙 =1 𝑧 𝑖𝑗 𝑙 (𝑡 𝑘 −𝑡 𝑙 )𝑔 𝑖𝑗 (𝑡 𝑘 −𝑡 𝑙 ) 𝜇 𝑖𝑗 + ∑︀ 𝑘 −1 𝑙 =1 𝑧 𝑖𝑗 𝑙 𝑔 𝑖𝑗 (𝑡 𝑘 −𝑡 𝑙 ) (B.16) The following remark is due: the update equations for both the variational parameters and the model parameters involve summation over the all possible configurations of the latent variables. This sum might become prohibitively extensive for long history win- dows. However, due to the exponential decay of the self-excitation term, events too far in the past have negligible impact on future events. This observation justifies limiting the summation to a window, i.e.,𝜆 𝑖𝑗 (𝑡 𝑘|ℋ 𝑡 𝑘 )≈𝜆 𝑖𝑗 (𝑡 𝑘|{ℎ 𝑙} 𝑘 𝑙 =𝑘 −𝑑 ), which discards events that are far in the past. In the results, we use this truncation to speed up the inference process. 122 Reference List [1] Ryan Prescott Adams, Iain Murray, and David J. C. MacKay. Tractable nonpara- metric bayesian inference in Poisson processes with gaussian process intensities. In Proceedings of The 26th International Conference on Machine Learning, 2009. [2] Yong-Yeol Ahn, James P. Bagrow, and Sune Lehmann. Link communities reveal multiscale complexity in networks. Nature, 466(7307):761–764, August 2010. [3] Edoardo M. Airoldi, David M. Blei, Stephen E. Fienberg, and Eric P. Xing. Mixed membership stochastic blockmodels. Journal of Machine Learning Research, 9:1981–2014, June 2008. [4] Shahriar Azizpour, Kay Giesecke, Seminar For Discussions, Xiaowei Ding, Baeho Kim, and Supakorn Mudchanatongsuk. Self-exciting corporate defaults: Contagion vs. frailty, 2008. [5] Brian Ball, Brian Karrer, and M. E. J. Newman. Efficient and principled method for detecting communities in networks. Phys. Rev. E, 84:036103, Sep 2011. [6] Albert-László Barabási. The origin of bursts and heavy tails in human dynamics. Nature, 435:207–211, May 2005. [7] Matthew J. Beal and Zoubin Ghahramani. The variational bayesian em algo- rithm for incomplete data: with application to scoring graphical model structures. Bayesian Statistics, 7:453–464, 2003. [8] David M. Blei and Michael I. Jordan. Modeling annotated data. In Proceedings of the 26th Annual International ACM SIGIR Conference on Research and Develop- ment in Informaion Retrieval, 2003. [9] David M. Blei and John D. Lafferty. A correlated topic model of science. Annals of Applied Statistics, 1(1):17–35, 2007. [10] David M. Blei and Jon D. McAuliffe. Supervised topic models. In Proceedings of the Advances in Neural Information Processing Systems 20, 2007. 123 [11] David M. Blei, Andrew Y . Ng, and Michael I. Jordan. Latent dirichlet allocation. J. Mach. Learn. Res., 3:993–1022, March 2003. [12] Charles Blundell, Katherine A. Heller, and Jeffrey M. Beck. Modelling reciprocat- ing relationships with hawkes processes. In Proceedings of the Advances in Neural Information Processing Systems 25, 2012. [13] Kenneth P. Burnham and David R. Anderson. Model selection and multimodel inference: a practical information-theoretic approach. Springer, 2 edition, 2002. [14] Jonathan Chang and David M. Blei. Relational topic models for document net- works. In Proceedings of the Twelfth International Conference on Artificial Intelli- gence and Statistics, 2009. [15] Eunjoon Cho, Seth A. Myers, and Jure Leskovec. Friendship and mobility: User movement in location-based social networks. In Proceedings of the 17th ACM SIGKDD international conference on Knowledge discovery and data mining, 2011. [16] Yoon Sik Cho, Aram Galstyan, Jeff Brantingham, and George Tita. Latent point process models for spatial-temporal networks. arXiv:1302.2671, 2013. [17] Thomas M. Cover and Joy A. Thomas. Elements of information theory. Wiley- Interscience, New York, NY , USA, 1991. [18] David R. Cox. Regression models and life-tables (with discussion). Journal of the Royal Statistical Society, B(34):187–220, 1972. [19] David J. Crandall, Lars Backstrom, Dan Cosley, Siddharth Suri, Daniel Hutten- locher, and Jon Kleinberg. Inferring social ties from geographic coincidences. Pro- ceedings of the National Academy of Sciences, 107(52):22436–22441, December 2010. [20] Justin Cranshaw, Raz Schwartz, Jason I. Hong, and Norman Sadeh. Utilizing social media to understand the dynamics of a city. In Proceedings of the 6th International Conference on Weblogs and Social Media, 2012. [21] Noel Cressie and Christopher K. Wikle. Statistics for Spatio-Temporal Data (Wiley Series in Probability and Statistics). Wiley, 1 edition, May 2011. [22] Inderjit S. Dhillon. Co-clustering documents and words using bipartite spectral graph partitioning. In Proceedings of the seventh ACM SIGKDD international con- ference on Knowledge discovery and data mining, 2001. 124 [23] Nan Du, Le Song, Alex J. Smola, and Ming Yuan. Learning networks of heteroge- neous influence. In Proceedings of the Advances in Neural Information Processing Systems 25, 2012. [24] E. Errais, K. Giesecke, and L. Goldberg. Affine point processes and portfolio credit risk. SIAM Journal on Financial Mathematics, 1(1):642–665, 2010. [25] Ron Eyal, Sarit Kraus, and Avi Rosenfeld. Identifying missing node information in social networks. In Proceedings of the Twenty-Fifth Conference on Artificial Intelligence, 2011. [26] Yu Fan and Christian R. Shelton. Learning continuous-time social network dynam- ics. In Proceedings of the 25th Conference on Uncertainty in Artificial Intelligence, pages 161–168, 2009. [27] Wenjie Fu, Le Song, and Eric P. Xing. Dynamic mixed membership blockmodel for evolving networks. In Proceedings of the 26th International Conference on Machine Learning, 2009. [28] Huiji Gao, Jiliang Tang, and Huan Liu. Exploring social-historical ties on location- based social networks. In Proceedings of the 6th International AAAI Conference on Weblogs and Social Media, 2012. [29] Huiji Gao, Jiliang Tang, and Huan Liu. gscorr: modeling geo-social correlations for new check-ins on location-based social networks. In Proceedings of the 21st ACM international conference on Information and knowledge management, 2012. [30] Anna Goldenberg, Alice X. Zheng, Stephen E. Fienberg, and Edoardo M. Airoldi. A Survey of Statistical Network Models. Found. Trends Mach. Learn., 2(2):129–233, 2010. [31] Manuel Gomez Rodriguez, Jure Leskovec, and Andreas Krause. Inferring networks of diffusion and influence. In Proceedings of the 16th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 2010. [32] Manuel Gomez Rodriguez, Jure Leskovec, and Bernhard Schölkopf. Structure and dynamics of information pathways in online media. In Proceedings of the Sixth ACM International Conference on Web Search and Data Mining, 2013. [33] Manuel Gomez-Rodriguez, Jure Leskovec, and Bernhard SchŽlkopf. Modeling information propagation with survival theory. In Proceedings of The 30th Interna- tional Conference on Machine Learning, 2013. 125 [34] Marta C. Gonzalez, Cesar A. Hidalgo, and Albert-Laszlo Barabasi. Understanding individual human mobility patterns. Nature, 453(7196):779–782, June 2008. [35] Prem Gopalan, David M. Mimno, Sean Gerrish, Michael J. Freedman, and David M. Blei. In Proceedings of the Advances in Neural Information Process- ing Systems 25, 2012. [36] Prem K. Gopalan and David M. Blei. Efficient discovery of overlapping com- munities in massive networks. Proceedings of the National Academy of Sciences, 110(36):14534–14539, 2013. [37] Roger GuimerÃ˘ a and Marta Sales-Pardo. Missing and spurious interactions and the reconstruction of complex networks. Proceedings of the National Academy of Sciences, 106(52):22073–22078, 2009. [38] Asela Gunawardana, Christopher Meek, and Puyang Xu. A model for temporal dependencies in event streams. In Proceedings of the Advances in Neural Informa- tion Processing Systems 24, pages 1962–1970, 2011. [39] Steve Hanneke, Wenjie Fu, and Eric Xing. Discrete Temporal Models of Social Networks. Electronic Journal of Statistics, 4:585–605, 2010. [40] Alan G. Hawkes. Spectra of Some Self-Exciting and Mutually Exciting Point Pro- cesses. Biometrika, 58(1):83–90, 1971. [41] Rachel Hegemann, Erik Lewis, and Andrea Bertozzi. An estimate & score algo- rithm for simultaneous parameter estimation and reconstruction of incomplete data on social networks. Security Informatics, 2(1), 2013. [42] P. D. Hoff, A. E. Raftery, and M. S. Handcock. Latent space approaches to social network analysis. Journal of the American Statistical Association, pages 1090– 1098, December 2002. [43] Paul W. Holland, Kathryn Blackmond Laskey, and Samuel Leinhardt. Stochastic blockmodels: First steps. Social Networks, 5(2):109 – 137, 1983. [44] Kenneth Joseph, Chun How Tan, and Kathleen M. Carley. Beyond “local", “cate- gories" and “friends": Clustering foursquare users with latent “topics". In Proceed- ings of the 2012 ACM Conference on Ubiquitous Computing, UbiComp ’12, pages 919–926, New York, NY , USA, 2012. ACM. [45] Brian Karrer and M. E. J. Newman. Stochastic blockmodels and community struc- ture in networks. Phys. Rev. E, 83:016107, Jan 2011. 126 [46] Myunghwan Kim and Jure Leskovec. The network completion problem: Inferring missing nodes and edges in networks. In Proceedings of the 2011 SIAM Interna- tional Conference on Data Mining, pages 47–58, 2011. [47] G. Kossinets. Effects of missing data in social networks. Social Networks, 28(3):247–268, July 2006. [48] D. Krioukov, F. Papadopoulos, A. Vahdat, and M. Boguñá. Curvature and tempera- ture of complex networks. Physical Review E, 80(3):035101, September 2009. [49] Thomas K Landauer and Susan T. Dutnais. A solution to platoÕs problem: The latent semantic analysis theory of acquisition, induction, and representation of knowledge. Psychological review, pages 211–240, 1997. [50] David Lando. On cox processes and credit risky securities. Review of Derivatives Research, 2(2-3):99–120, 1998. [51] Erik Lewis and Georges Mohler. A Nonparametric EM algorithm for Multiscale Hawkes Processes. preprint, 2011. [52] David Liben-Nowell and Jon Kleinberg. The link-prediction problem for social networks. Journal of the American Society for Information Science and Technology, 58(7):1019–1031, 2007. [53] L. Sandy Maisel. Rating united states senators the strength of maineÕs delegation since 1955. The New England Journal of Political Science, 4(1), 2010. [54] N. McCarty, K. T. Poole, and H. Rosenthal. Polarized America: The Dance of Ideology and Unequal Riches. MIT Press, 2006. [55] Jeffrey McGee, James Caverlee, and Zhiyuan Cheng. Location prediction in social media based on tie strength. In Proceedings of The ACM International Conference on Information and Knowledge Management. ACM, 2013. [56] Miller McPherson, Lynn Smith-Lovin, and James M Cook. Birds of a feather: Homophily in social networks. Annual Review of Sociology, 27(1):415–444, 2001. [57] G. O. Mohler, M. B. Short, P. J. Brantingham, F. P. Schoenberg, and G. E. Tita. Self-exciting point process modeling of crime. Journal of the American Statistical Association, 106(493):100–108, 2011. [58] Morten M£rup, Mikkel N Schmidt, and Lars Kai Hansen. Infinite multiple member- ship relational modeling for complex networks. Technical Report arXiv:1101.5097, Jan 2011. 127 [59] Ramesh M. Nallapati, Amr Ahmed, Eric P. Xing, and William W. Cohen. Joint latent topic models for text and citations. In Proceedings of the 14th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 2008. [60] Praneeth Netrapalli and Sujay Sanghavi. Learning the graph of epidemic cascades. In Proceedings of the 12th ACM SIGMETRICS/PERFORMANCE Joint Interna- tional Conference on Measurement and Modeling of Computer Systems, 2012. [61] Y . Ogata. Space-time point process models for earthquake occurrences. Ann.Inst.Statist.Math., 50:379–402, 1988. [62] Dan Pelleg and Andrew W. Moore. X-means: Extending k-means with efficient estimation of the number of clusters. In Proceedings of the Seventeenth Interna- tional Conference on Machine Learning, 2000. [63] P. O. Perry and P. J. Wolfe. Point process modeling for directed interaction net- works. Preprint,arXiv:1011.1703v2, November 2011. [64] Patrick O. Perry and Patrick J. Wolfe. Point process modelling for directed interac- tion networks. Journal of the Royal Statistical Society: Series B (Statistical Method- ology), 75(5):821–849, 2013. [65] P.Lewis and G.Shedler. Simulation of nonhomogenous Poisson processes by thin- ning. Naval Research Logistics Quarterly, 26(3):403–413, 1979. [66] M. D. Porter and G. White. Self-exciting hurdle models for terrorist activity. The Annals of Applied Statistics, 6(1):106–124, 2012. [67] Vasanthan Raghavan, Greg Ver Steeg, Aram Galstyan, and Alexander G. Tar- takovsky. Modeling temporal activity patterns in dynamic social networks. CoRR, abs/1305.1980, 2013. [68] Shyamsundar Rajarm, Thore Graepel, and Ralf Herbrich. Poisson-networks: A model for structured point processes. In Proceedings of the International Workshop on Artificial Intelligence and Statistics, 2005. [69] Joerg Reichardt, Roberto Alamino, and David Saad. The interplay between micro- scopic and mesoscopic structures in complex networks. PLoS ONE, 6(8):e21282, 08 2011. [70] Injong Rhee, Minsu Shin, Seongik Hong, Kyunghan Lee, Seong Joon Kim, and Song Chong. On the levy-walk nature of human mobility. IEEE/ACM Trans. Netw., 19(3):630–643, June 2011. 128 [71] Adam Sadilek, Henry Kautz, and Jeffrey P. Bigham. Finding your friends and following them to where you are. In Proceedings of the 5th ACM International Conference on Web Search and Data Mining, 2012. [72] Ruslan Salakhutdinov and Andriy Mnih. Bayesian probabilistic matrix factoriza- tion using Markov chain Monte Carlo. In Proceedings of The 25th International Conference on Machine Learning, 2008. [73] Ruslan Salakhutdinov and Andriy Mnih. Probabilistic matrix factorization. In Pro- ceedings of the Advances in Neural Information Processing Systems, volume 20, 2008. [74] Salvatore Scellato, Anastasios Noulas, and Cecilia Mascolo. Exploiting place fea- tures in link prediction on location-based social networks. In Proceedings of the 17th ACM SIGKDD international conference on Knowledge discovery and data mining, 2011. [75] Aleksandr Simma and Michael I. Jordan. Modeling events with cascades of Pois- son processes. In Proceedings of The 26th Conference on Uncertainty in Artificial Intelligence, 2010. [76] Noam Slonim and Naftali Tishby. Agglomerative information bottleneck. In Pro- ceedings of the Advances in Neural Information Processing Systems 12, 1999. [77] Tom A. B. Snijders, Christian E. G. Steglich, and Michael Schweinberger. Modeling the co-evolution of networks and behavior. In In, 2006. [78] Christian Steglich, Tom A. B. Snijders, and Michael Pearson. Dynamic Networks and Behavior: Separating Selection from Influence. Sociological Methodology, 2010. [79] Juliette Stehlé, Alain Barrat, and Ginestra Bianconi. Dynamical and bursty interac- tions in social networks. Phys. Rev. E, 81:035101, Mar 2010. [80] A. Stomakhin, M. B. Short, and A. L. Bertozzi. Reconstruction of missing data in social networks based on temporal patterns of interactions. Inverse Problems, 27(11):115013, 2011. [81] Naftali Tishby, Fernando C. Pereira, and William Bialek. The information bottle- neck method. In Proceedings of the 37-th Annual Allerton Conference on Commu- nication, Control and Computing, pages 368–377, 1999. 129 [82] G. Tita, J. K. Riley, G. Ridgeway, C. Grammich, A. F. Abrahamse, and P.W. Green- wood. Reducing Gun Violence: Results from an Intervention in East Los Angeles. RAND Press, 2003. [83] Greg Ver Steeg and Aram Galstyan. Information transfer in social media. In Pro- ceedings of the 21st International Conference on World Wide Web, 2012. [84] Greg Ver Steeg and Aram Galstyan. Information-theoretic measures of influence based on content dynamics. In Proceedings of the Sixth ACM International Confer- ence on Web Search and Data Mining, 2013. [85] Duy Vu, Arthur U. Asuncion, David Hunter, and Padhraic Smyth. Continuous-time regression models for longitudinal networks. In Proceedings of the Advances in Neural Information Processing Systems 24, 2011. [86] Dashun Wang, Dino Pedreschi, Chaoming Song, Fosca Giannotti, and Albert- Laszlo Barabasi. Human mobility, social ties, and link prediction. In Proceedings of the 17th ACM SIGKDD international conference on Knowledge discovery and data mining, 2011. [87] Liaoruo Wang, Stefano Ermon, and John E. Hopcroft. Feature-enhanced probabilis- tic models for diffusion network inference. In Proceedings of the 2012 European Conference on Machine Learning and Knowledge Discovery in Databases - Vol- ume Part II, ECML PKDD’12, pages 499–514, Berlin, Heidelberg, 2012. Springer- Verlag. [88] Eric Xing, Michael Jordan, and Stuart Russell. A generalized mean field algorithm for variational inference in exponential families. In Proceedings of the 19th Con- ference on Uncertainty in Artificial Intelligence, 2003. 130
Abstract (if available)
Abstract
Network data and user behavior data are becoming pervasive. In this thesis, we develop efficient machine learning methods for addressing various real‐world problems using these datasets. Network structure allows us to understand the underlying complex system that generated the data. Through clustering the nodes in latent space, we can measure the similarity between the nodes and study how link are formed with regards to their latent properties. This model can be extended by combining with user behavior model which is another emerging field in machine learning applications. We study the two aspects and propose a model that combines the two by inferring the joint latent space that describes both the link and user behaviors. ❧ The first part of our work focuses on network modeling, and examine the dynamics of hidden attributes of nodes and the link dynamics. We assume the two dynamics co‐evolve with each other, allowing feedbacks. Co‐evolving MMSB is based on a well‐defined clustering algorithm that has been widely used in analysis of social network and gene regulatory networks. We show how Co‐evolving MMSB captures the influence between the nodes which have interacted. The other topic in network modeling studies the temporal dynamics in pair interactions. We proposed a model that can reconstruct missing information in pair interactions using the temporal and spatial informations. ❧ Second part of our work focuses on user behavior modeling. Specifically, we examined ""check‐in"" behaviors of users in Location‐Based Social Network (LBSN). Venues in LBSN can be represented on a latent space through lower dimensional representation. For venue clustering, we used non‐parametric method called information bottleneck which clusters similar type of data with minimum relevant information loss. ""Check‐in"" data in LBSN can also be clustered in time space by measuring the influence from the past. Through this clustering, one can make better prediction on time of future ""check‐in"" than existing methods. ❧ Finally, we combined the two models by finding the joint space of the two latent spaces which describe link formation and user behavior respectively. With this model one can build an effective recommendation system even with limited data. We show how we can predict behavior based solely on link information and vice versa.
Linked assets
University of Southern California Dissertations and Theses
Conceptually similar
PDF
Efficient data collection in wireless sensor networks: modeling and algorithms
PDF
Global consequences of local information biases in complex networks
PDF
Modeling social and cognitive aspects of user behavior in social media
PDF
On information captured by neural networks: connections with memorization and generalization
PDF
Probabilistic framework for mining knowledge from georeferenced social annotation
PDF
Sensing with sound: acoustic tomography and underwater sensor networks
PDF
Learning distributed representations from network data and human navigation
PDF
Learning fair models with biased heterogeneous data
PDF
Emergence and mitigation of bias in heterogeneous data
PDF
Disentangling the network: understanding the interplay of topology and dynamics in network analysis
PDF
Fast and label-efficient graph representation learning
PDF
Latent space dynamics for interpretation, monitoring, and prediction in industrial systems
PDF
Understanding diffusion process: inference and theory
PDF
Representation problems in brain imaging
PDF
Lifting transforms on graphs: theory and applications
PDF
Learning to diagnose from electronic health records data
PDF
Modeling, searching, and explaining abnormal instances in multi-relational networks
PDF
Critically sampled wavelet filterbanks on graphs
PDF
Modeling intermittently connected vehicular networks
PDF
Essays on beliefs, networks and spatial modeling
Asset Metadata
Creator
Cho, Yoon-Sik
(author)
Core Title
Modeling and predicting with spatial‐temporal social networks
School
Viterbi School of Engineering
Degree
Doctor of Philosophy
Degree Program
Electrical Engineering
Publication Date
07/24/2014
Defense Date
06/04/2014
Publisher
University of Southern California
(original),
University of Southern California. Libraries
(digital)
Tag
clustering,location based social networks,mixed membership,OAI-PMH Harvest,point processes,spatial‐temporal,topic model
Format
application/pdf
(imt)
Language
English
Contributor
Electronically uploaded by the author
(provenance)
Advisor
Galstyan, Aram (
committee chair
), Krishnamachari, Bhaskar (
committee chair
), Lerman, Kristina (
committee member
), Ortega, Antonio K. (
committee member
), Ver Steeg, Greg (
committee member
)
Creator Email
yoonsik@isi.edu,yoonsikc@usc.edu
Permanent Link (DOI)
https://doi.org/10.25549/usctheses-c3-447544
Unique identifier
UC11286891
Identifier
etd-ChoYoonSik-2737.pdf (filename),usctheses-c3-447544 (legacy record id)
Legacy Identifier
etd-ChoYoonSik-2737.pdf
Dmrecord
447544
Document Type
Dissertation
Format
application/pdf (imt)
Rights
Cho, Yoon-Sik
Type
texts
Source
University of Southern California
(contributing entity),
University of Southern California Dissertations and Theses
(collection)
Access Conditions
The author retains rights to his/her dissertation, thesis or other graduate work according to U.S. copyright law. Electronic access is being provided by the USC Libraries in agreement with the a...
Repository Name
University of Southern California Digital Library
Repository Location
USC Digital Library, University of Southern California, University Park Campus MC 2810, 3434 South Grand Avenue, 2nd Floor, Los Angeles, California 90089-2810, USA
Tags
clustering
location based social networks
mixed membership
point processes
spatial‐temporal
topic model