University of Southern California Dissertations and Theses

Disentangling the network: understanding the interplay of topology and dynamics in network analysis

(USC Thesis Other)

Disentangling the network: understanding the interplay of topology and dynamics in network analysis

PDF

Download a page range

Download transcript

Copy asset link

Request this asset

Transcript (if available)

Content DISENTANGLING THE NETWORK: UNDERSTANDING THE INTERPLAY OF TOPOLOGY AND DYNAMICS IN NETWORK ANALYSIS by Rumi Ghosh A Dissertation Presented to the FACULTY OF THE USC GRADUATE SCHOOL UNIVERSITY OF SOUTHERN CALIFORNIA In Partial Fulfillment of the Requirements for the Degree DOCTOR OF PHILOSOPHY (COMPUTER SCIENCE) August 2012 Copyright 2012 Rumi Ghosh Dedication To my parents Pratip Kumar Ghosh and Devyani Ghosh ii Acknowledgments I would like to express my deep and sincere gratitude to my supervisor, Dr. Kristina Lerman for her continuous support, enthusiasm, patience, motivation and guidance. Besides my advisor, I would like to thank the rest of my thesis committee: Dr. Shanghua Teng, Dr. Yan Liu and Dr. Peter Monge. I am also grateful to Dr. Craig Knoblock and Dr. Daniel O’Leary for helping me to shape my thesis proposal. My sin- cere thanks goes to Dr. Greg Ver Steeg, Dr. Aram Galstyan and Dr. Sofus Macskassy for their valuable advice and friendly help. During this work I have collaborated with many colleagues to whom I wish to extend my warmest thanks. I would especially like to thank all my colleagues and friends at ISI from whom I have learnt a lot. I am indebted to all my friends who have helped me at every stage of my life and would especially like to thank Sushmita Allam, Adarsh Shekhar, Krishnakali Dasgupta and Prithviraj Bannerjee for being my foster family at Los Angeles. My deepest gratitude to my entire extended family for their love and affection. I wish to thank my parents, Pratip Kumar and Devyani Ghosh, my sister, Reshmi Ghosh, my parents -in- law Atanu and Bidyasri Choudhury and my sister-in-law Ajanta Choudhury for their support and faith in me. Last but not the least, I would like to thank my husband Anustup Kumar Choudhury for his loving care and for being there for me in sickness and in health. iii Table of Contents Dedication ii Acknowledgments iii List of Tables ix List of Figures x Abstract xvii Chapter 1: Introduction 1 1.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1.2 Thesis Statement . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 1.3 Proposed Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 1.4 Contributions to Research . . . . . . . . . . . . . . . . . . . . . . . . 7 1.4.1 Classifying Dynamic Interactions . . . . . . . . . . . . . . . 7 1.4.2 Modeling Dynamic Interactions . . . . . . . . . . . . . . . . 8 1.4.3 Impact of Dynamics on Analysis of Network Structure . . . . 8 1.4.3.1 Community Structure . . . . . . . . . . . . . . . . 8 1.4.3.2 Centrality Prediction . . . . . . . . . . . . . . . . . 9 1.4.3.3 Proximity Prediction . . . . . . . . . . . . . . . . . 10 1.4.4 Understanding Information Spread in Networks . . . . . . . . 10 1.4.4.1 Comparative empirical study of cascades on networks 10 1.4.4.2 Cascade Generating Function . . . . . . . . . . . . 11 1.4.4.3 Mechanism of Social Contagion . . . . . . . . . . 11 1.4.5 Alpha-Centrality . . . . . . . . . . . . . . . . . . . . . . . . 12 1.5 Outline of the Proposal . . . . . . . . . . . . . . . . . . . . . . . . . 12 Chapter 2: Preliminaries 14 2.1 Conventions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14 2.2 Definitions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15 2.3 Real Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17 2.3.1 Karate Club . . . . . . . . . . . . . . . . . . . . . . . . . . . 17 iv 2.3.2 Digg . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18 2.3.3 Twitter . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20 2.3.3.1 Dataset 1 . . . . . . . . . . . . . . . . . . . . . . . 22 2.3.3.2 Dataset 2 . . . . . . . . . . . . . . . . . . . . . . . 23 2.3.4 Facebook . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23 Chapter 3: Real-World Dynamic Interactions 25 3.1 Dynamics of Retweeting Activity . . . . . . . . . . . . . . . . . . . . 27 3.2 Entropy-Based Analysis . . . . . . . . . . . . . . . . . . . . . . . . 30 3.2.1 Time Interval Distribution . . . . . . . . . . . . . . . . . . . 30 3.2.2 User Distribution . . . . . . . . . . . . . . . . . . . . . . . . 32 3.2.3 Classification . . . . . . . . . . . . . . . . . . . . . . . . . . 35 3.3 Validation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35 3.3.1 Manual Annotation . . . . . . . . . . . . . . . . . . . . . . . 36 3.3.2 Classification . . . . . . . . . . . . . . . . . . . . . . . . . . 39 3.3.2.1 Supervised Classification . . . . . . . . . . . . . . 39 3.3.2.2 Unsupervised Classification . . . . . . . . . . . . . 40 3.3.2.3 Observations . . . . . . . . . . . . . . . . . . . . . 42 3.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45 Chapter 4: Modeling Interactions 46 4.1 Generalized Interaction Model . . . . . . . . . . . . . . . . . . . . . 46 4.1.1 Conservative Interactions . . . . . . . . . . . . . . . . . . . . 48 4.1.2 Non-Conservative Interactions . . . . . . . . . . . . . . . . . 53 4.1.2.1 Contagion as Non-Conservative Interaction . . . . . 57 4.1.3 Spectral Properties of Kernels . . . . . . . . . . . . . . . . . 61 4.1.4 Classification of Interaction Models . . . . . . . . . . . . . . 61 4.2 Interactions and Network Structure . . . . . . . . . . . . . . . . . . . 62 Chapter 5: Interactions and Community Structure 63 5.1 Classification of Synchronization Processes . . . . . . . . . . . . . . 66 5.1.1 Kuramoto Synchronization as Conservative Interaction Model 66 5.1.2 Novel Methods for Synchronization using Non-Conservative Interactions . . . . . . . . . . . . . . . . . . . . . . . . . . . 68 5.2 Synchronization and Community Structure . . . . . . . . . . . . . . . 69 5.2.1 A Consolidated Framework for Community Detection . . . . 70 5.2.2 Community Structure via Interaction Dynamics . . . . . . . 73 5.3 Empirical Study . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76 5.3.1 Synthetic Network . . . . . . . . . . . . . . . . . . . . . . . 77 5.3.2 Karate Club . . . . . . . . . . . . . . . . . . . . . . . . . . . 79 5.3.3 Digg Mutual Follower Network . . . . . . . . . . . . . . . . 81 5.3.3.1 Multi-scale Structure of Digg . . . . . . . . . . . . 82 v 5.3.3.2 Empirical Evaluation . . . . . . . . . . . . . . . . 84 5.3.4 Facebook . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87 5.3.4.1 Multi-scale Structure of Facebook . . . . . . . . . 87 5.3.4.2 Empirical Evaluation . . . . . . . . . . . . . . . . 88 5.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90 Chapter 6: Interactions and Centrality 92 6.1 Classification of Centrality Metrics . . . . . . . . . . . . . . . . . . . 92 6.1.1 Conservative Interactions and PageRank . . . . . . . . . . . . 92 6.1.2 Non-conservative Interactions and Alpha-Centrality . . . . . . 94 6.2 Ranking Nodes by Centrality- An Illustration . . . . . . . . . . . . . 97 6.3 Application to Online Social Network Analysis . . . . . . . . . . . . 99 6.3.1 Empirical Estimates of Influence . . . . . . . . . . . . . . . . 100 6.3.2 Comparison of Centrality Metrics . . . . . . . . . . . . . . . 104 6.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106 Chapter 7: Interactions and Proximity 107 7.1 Proximity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108 7.2 Classification of Proximity Metrics . . . . . . . . . . . . . . . . . . . 110 7.2.1 Conservative Proximity . . . . . . . . . . . . . . . . . . . . . 113 7.2.2 Non-conservative Proximity . . . . . . . . . . . . . . . . . . 115 7.3 Predicting Activity in Social Media . . . . . . . . . . . . . . . . . . . 116 7.3.1 Analysis of Proximity Metrics . . . . . . . . . . . . . . . . . 118 7.3.2 Prediction Results . . . . . . . . . . . . . . . . . . . . . . . 121 7.3.3 Predictability . . . . . . . . . . . . . . . . . . . . . . . . . . 124 7.3.4 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . 127 7.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 128 Chapter 8: Information Spread Under the Microscope 129 8.1 Information Spread on Digg and Twitter . . . . . . . . . . . . . . . . 131 8.1.1 Characteristics of User Activity . . . . . . . . . . . . . . . . 132 8.1.2 Evolution of a Story . . . . . . . . . . . . . . . . . . . . . . 134 8.1.3 Evolution of Fan V otes . . . . . . . . . . . . . . . . . . . . . 135 8.1.4 How Popular is a Story? . . . . . . . . . . . . . . . . . . . . 138 8.1.5 How Far does a Story Spread on the OSN? . . . . . . . . . . 139 8.2 Quantitative Framework for Measuring Cascades . . . . . . . . . . . 141 8.2.1 Characterizing Cascades . . . . . . . . . . . . . . . . . . . . 143 8.2.2 Analyzing Cascades . . . . . . . . . . . . . . . . . . . . . . 146 8.2.3 Reconstructing Cascades . . . . . . . . . . . . . . . . . . . . 149 8.2.4 Digg Case Study . . . . . . . . . . . . . . . . . . . . . . . . 150 8.2.4.1 Microscopic Cascade Characteristics . . . . . . . . 153 8.2.4.2 Macroscopic Cascade Characteristics . . . . . . . . 156 vi 8.2.4.3 Mesoscopic Cascade Characteristics . . . . . . . . 159 8.3 Why are Information Cascades in Digg so Small? . . . . . . . . . . . 160 8.3.1 Network Structure: Clustering Effect . . . . . . . . . . . . . 161 8.3.1.1 Synthetic Graph Construction . . . . . . . . . . . . 162 8.3.1.2 Simulations using Independent Cascade Model . . . 162 8.3.1.3 Theoretical results . . . . . . . . . . . . . . . . . . 164 8.3.1.4 Characterization of real and simulated cascades . . 166 8.3.2 Effect of Contagion Mechanism on Network Dynamics . . . . 167 8.3.2.1 Simulation using Friend Saturation Model . . . . . 168 8.3.2.2 Inferring Transmissibility of a Cascade . . . . . . . 168 8.3.3 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . 170 8.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 172 Chapter 9: Alpha-Centrality 175 9.1 Node Ranking . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 177 9.2 Community Detection . . . . . . . . . . . . . . . . . . . . . . . . . . 178 9.3 Empirical Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . 180 9.3.1 Karate Club Network: Communities, Leaders and Bridges . . 181 9.3.2 Other Real-World Networks . . . . . . . . . . . . . . . . . . 185 9.4 Multimodal Networks . . . . . . . . . . . . . . . . . . . . . . . . . . 187 9.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 190 Chapter 10: Related Work 191 10.1 Real-World Dynamic Interactions . . . . . . . . . . . . . . . . . . . 191 10.2 Interactions and Community . . . . . . . . . . . . . . . . . . . . . . 193 10.3 Interaction and Centrality . . . . . . . . . . . . . . . . . . . . . . . . 195 10.4 Interaction and Proximity . . . . . . . . . . . . . . . . . . . . . . . . 197 10.5 Information Spread Under the Microscope . . . . . . . . . . . . . . . 199 10.6 Alpha-Centrality . . . . . . . . . . . . . . . . . . . . . . . . . . . . 204 Chapter 11: Future Work and Conclusion 207 References 213 Appendix A Computational Framework for Cascade Generating Function . . . . . . . . 229 Appendix B Normalized Alpha centrality -Proofs and Theorems . . . . . . . . . . . . . 234 vii Appendix C Approximation Algorithm for Alpha-Centrality . . . . . . . . . . . . . . . 241 C.0.1 Quality of Approximate Results . . . . . . . . . . . . . . . . 245 viii List of Tables 3.1 F-Measure (F) and ROC area for 10-fold cross validation experiments using SVM and k-NN classification . . . . . . . . . . . . . . . . . . 39 3.2 Confusion matrix with manually annotated data and clusters automat- ically detected by EM algorithm . . . . . . . . . . . . . . . . . . . . 41 3.3 Confusion matrix with manually annotated data and clusters detected by EM algorithm when number of clusters is predefined to be 5 . . . . 41 7.1 Some of the proximity metrics used for network analysis, including four proposed in this work . . . . . . . . . . . . . . . . . . . . . . . 112 7.2 Correlation between proximity of pairs of users connected by an edge in the follower graph and their co-activity on (a) Digg and (b) Twitter. Rows in (a) present co-votes under different filter conditions. For ex- ample, co-votes< 200 condition reports correlations for pairs of users who voted for fewer than 200 common stories. The number of pairs satisfying the filter condition is reported in the second column. . . . . 120 7.3 Evaluation of predictions by different metrics in the Digg and Twitter data sets. Lift is defined as % change over baseline. . . . . . . . . . . 122 8.1 Parameter estimates for distributions that best describe data (Weibull and Lognormal). . . . . . . . . . . . . . . . . . . . . . . . . . . . . 157 8.2 Parameter estimates for distributions that best describe data(Power Law).157 9.1 The number and purity of communities discovered at different values of . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 186 ix List of Figures 2.1 Karate Club Network. The different colors show the different commu- nities as detected by Zachary. . . . . . . . . . . . . . . . . . . . . . . 17 2.2 A screen shot of the Digg front page . . . . . . . . . . . . . . . . . . 19 2.3 A screen shot of the Twitter home page . . . . . . . . . . . . . . . . 21 2.4 A screen shot of the Tweetmeme front page . . . . . . . . . . . . . . 21 2.5 A screen shot of the Facebook front page . . . . . . . . . . . . . . . 23 3.1 Evolution of retweeting activity for story posted by (a) a popular news website (nytimes) (b) popular celebrity (billgates) (c) politician (silva_marina) (d) an aspiring artist (youngdizzy) (e) post by a fan site (AnnieBieber) (f) animal rights campaign (nokillanimalist) (g) advertisement using social media (onstrategy) (h) advertisement from an account eventu- ally suspended by Twitter(EasyCash435) (i) advertisement by a Japanese user (nitokono). Insets in (d), (e) and (g) show automatic retweeting, with multiple retweets made within a short time period either by the same or different users. . . . . . . . . . . . . . . . . . . . . . . . . . 28 3.2 The distribution of the inter-arrival gaps for the retweeting activities shown in Figure 3.1 (a)nytimes (b)billgates (c) silva_marina (d) youngdizzy1 (e) AnnieBieber (f) nokillanimalist (g) onstrategy (h) EasyCash435 (i) nitokono. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31 3.3 The number of retweets by distinct users. Each user is marked by a unique user id for the retweeting activities shown in Figure 3.1 (a)nytimes (b)billgates (c) silva_marina (d) youngdizzy1 (e) AnnieBieber(f) nokil- lanimalist (g) onstrategy (h) EasyCash435 (i) nitokono. . . . . . . . . 33 3.4 Manually annotated URLs shown in the entropy plane. . . . . . . . . 38 x 3.5 Unsupervised clustering of the data points using EM: (a) when EM au- tomatically finds the best number of clusters, and (b) when the number of clusters is constrained by be five. . . . . . . . . . . . . . . . . . . 40 5.1 Analysis of the synthetic graph. (a) Hinton diagram of the adjacency matrix. A point is red if an edge exists between nodes at that loca- tion; otherwise it is blue. (b) Eigenvalue spectrum of the two kernels. Similarity matrix att = 1500 under the (c) conservative and (d) non- conservative synchronization models. Color indicates how similar two nodes are, with red corresponding to higher and blue to lower similar- ity. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78 5.2 Analysis of the karate club network. (a) Friendship graph. (b) Com- parison of eigenvalues of the Laplacian(L) and Replicator(R) kernels. Synchronization matrix at time t = 1000 due to (c) the conservative interaction model and (d) the non-conservative interaction model. The color of each square indicates how similar two nodes are (zoom in to see node labels), with red corresponding to more similar nodes and blue to less similar nodes. . . . . . . . . . . . . . . . . . . . . . . . 79 5.3 Evolution of the discovered community structure of the karate club network, as measured by normalized mutual information, in the con- servative and non-conservative interaction models. . . . . . . . . . . 80 5.4 (a) Top 6000 eigenvalues of the Replicator and Laplacian kernels of the Digg friendship network. (b) The long-tailed distribution (using loga- rithmic binning) of the components comprising the core for different similarity thresholds for the non-conservative model. . . . . . . . . 81 5.5 Number of nodes nodes comprising the core at different levels of hier- archy (resolution scales) found by the interaction models in the Digg and Facebook networks. The resolution scales correspond to similar- ity thresholds that give cores of comparable size. The green line shows the number of nodes that the cores have in common at that resolution scale. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82 5.6 Evaluation of communities found in the Digg mutual follower graph at t = 100 by the two interaction models. (a) Number of small communi- ties found at different resolutions specified by the similarity threshold parameter. The smallest resolution corresponds to smallest value of the similarity threshold. (b) Average quality of communities at each scale, as measured by the number of co-votes. . . . . . . . . . . . . . 84 xi 5.7 Distribution of communities in the Facebook network for American University at t = 100. (a) Comparison of size distribution of small communities found by the two interaction models at the coarsest and finest resolution scales. (b) Number of small communities at different resolutions. The smallest resolution corresponds to highest similarity between individuals. . . . . . . . . . . . . . . . . . . . . . . . . . . 86 5.8 Evaluation of communities found in the Facebook network of Amer- ican University at t = 100 by the two interaction models. Average quality of communities at each resolution scale, as measured by the probability of occurrence of the most frequent value of features major, dorm, year, category of individual. . . . . . . . . . . . . . . . . . . . 89 6.1 An example network, where node 1 has the highest Alpha-Centrality followed by node 3. In contrast node 3 has the highest PageRank fol- lowed by node 1. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98 6.2 (a) The scatter plot shows the average number of fan votes received by a story within the first 100 votes vs submitter’s in-degree (number of fans) on Digg. Each point represents a distinct submitter. (b) Proba- bility of the expected number of fan votes being generated purely by chance. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101 6.3 (a) The scatter plot shows the average number of follower retweets received by stories within the first 100 votes vs submitter’s in-degree (number of followers) on Twitter. Each point represents a distinct sub- mitter. (b) Probability of the expected number of follower retweets being generated purely by chance. . . . . . . . . . . . . . . . . . . . 103 6.4 Correlation between the rankings produced by the empirical measures of influence and those predicted by Alpha-Centrality and PageRank for Digg. We use (a) the average number of fan votes and (b) average cascade size as the empirical measures of influence. The inset zooms into the variation in correlation for 0 0:01 . . . . . . . . . . . 104 6.5 Correlation between the rankings produced by the empirical measures of influence and those predicted by Alpha-Centrality and PageRank for Twitter. We use (a) the average number of fan retweets and (b) average cascade size as the empirical measures of influence. . . . . . 105 7.1 Example of a directed graph. . . . . . . . . . . . . . . . . . . . . . . 111 7.2 Average value of the proximity metrics vs activity for pairs of users linked by an edge in the follower graphs of (a) Digg and (b) Twitter. . 119 xii 7.3 Prediction methodology . . . . . . . . . . . . . . . . . . . . . . . . . 121 7.4 Distribution of prediction precision over users in the Digg and Twitter data sets. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 124 7.5 Distribution of precision as a function of user attribute, such as the number of friends, followers, and the activity level, as measured by the number of votes user made. . . . . . . . . . . . . . . . . . . . . . 125 7.6 Differences between predictable (precision=1) and unpredictable (pre- cision=0) Twitter users in terms of the number of friends, followers and their activity level, as measured by the number of URLs they tweeted. 126 8.1 Distribution of user activity. (a) Number of active fans per user in the Digg data set vs the number of users with that many fans. Inset shows distribution of voting activity, i.e., number of votes per user vs number of users who cast that many votes. (b) Number of active followers per user in the Twitter data set vs the number of users with that many followers. Inset shows distribution of retweeting activity. . . . . . . . 132 8.2 Dynamics of stories on Digg and Twitter. (a) Total number of votes (diggs) and fan votes received by stories on Digg since submission. (b) Total number of times a story was retweeted and the number of retweets from followers since the first post vs time. The titles of stories on Digg were: story1: “U.S. Government Asks Twitter to Stay Up for #IranElection”, story2: “Western Corporations Helped Censor Iranian Internet”, story3: “Iranian clerics defy ayatollah, join protests.” The ti- tles of retweeted stories were: story1:“US gov asks twitter to stay up”, story2:“Iran Has Built a Censorship Monster with help of west tech”, story3:“Clerics join Iran’s anti-government protests - CNN.com.” . . 134 8.3 Spread of interest in stories through the network. (a) Median num- ber of fan votes vs votes, aggregated over all Digg stories in our data set. Dotted lines show the boundary one standard deviation from the mean. Dashed lines shows the number of votes from fans of submit- ter. (b) Probability next vote is from a fan before and after the Digg story is promoted. (c) Median number of retweets from followers vs all retweets, aggregated over all stories in the Twitter data set. (d) Probability next retweet is from a follower. . . . . . . . . . . . . . . 136 xiii 8.4 Distribution of story popularity. (a) Distribution of the total number of votes received by Digg stories, with line showing log-normal fit. The plot excludes the 15 stories that received more than 6,000 votes. (b) Distribution of the total number of times stories in the Twitter data set were retweeted, with the line showing log-normal fit. . . . . . . . . . 138 8.5 Distribution of story cascade sizes. (a) Histogram of the distribution of the total number of fan votes received by Digg stories (size of the inter- est cascade). The inset shows the distribution of the number of votes from submitter’s fans. (b) Histogram of the distribution of the total number of retweets from followers. The inset shows the distribution of the number of retweets of a story from submitter’s followers. . . . 140 8.6 An toy example of an information cascade on a network. Nodes are labeled in the temporal order in which they are activated by the cas- cade. The nodes that are never activated are blank. (a) The edges show the underlying friendship network. Edge direction shows the seman- tics of the connection, i.e., nodes are watching others to which they are pointing. (b) Two cascades on the network (shown in yellow and red). Node 1 is the seed of the first (yellow) cascade and node 2 is the seed of the second (red) cascade. Node 4 belongs to both cascades and is shown in orange. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 141 8.7 Analysis of various cascades. Nodes are labeled by the order in which they are activated by the contagion process. Row (a) shows cascade plots obtained by computing the cascade generating function, at dif- ferent times. Row (b) shows the corresponding contagion process. Different cascades within the same contagion process are shown in different colors. Row (c) shows some of the numeric properties of the cascades, and row (d) shows sets of isomorphic nodes. . . . . . . . . 147 8.8 Shows the cascade plot for top 3 cascades for four stories. The left set of plots in each figure shows cascade evolution in the early stages of the contagion process, while the right set of plots shows cascade evolution over the entire time period. Red dot shows the time when cascade seed was activated. . . . . . . . . . . . . . . . . . . . . . . . 152 8.9 PDF of distribution of cascade properties: number of cascades per story, cascade size, spread, diameter, average path length, and log of the number of paths. Distributions are fitted with the stretched ex- ponential/Weibull (black), mixture of Weibull (cyan), lognormal (red) and power law (green) functions. The double pareto lognormal distri- bution(magenta) gives a very good fit for the number of cascades. . . 156 xiv 8.10 Distribution of principal cascade size. . . . . . . . . . . . . . . . . . 159 8.11 For nodes who were exposed to a story, the average number of friends who voted on the story. . . . . . . . . . . . . . . . . . . . . . . . . . 161 8.12 Cascade size as a function of transmissibility for simulated cascades on the Digg graph and the randomized graph with the same degree distribution (see section on simulations). Heterogeneous mean field predicts cascade size as a fraction of the nodes affected. The line (hmf) reports these predictions multiplied by the total number of nodes in the Digg network. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 164 8.13 Characteristics of voting on Digg. Probability a user votes given n friends voted given by the independent cascade model and actual vot- ing behavior on Digg (averaged over all cascades). . . . . . . . . . . 166 8.14 (a) Inferred versus actual transmissibility for simulated cascades in the FSM(b) Cascade size vs inferred transmissibility for simulated and real cascades on the Digg graph, this time plotted on a log-log scale to highlight the order of magnitude difference between these cascade sizes and predictions of the epidemic model (HMF, see text for details). 169 8.15 Dynamics of transmissibility and fanout (a) Number of new fans who can see the story (watching) and who actually vote for the story (voting) vs time (voter i) for actual and simulated cascades. (b) Change in the estimated value of transmissibility for actual and simu- lated Digg cascades as a function of time. . . . . . . . . . . . . . . . 170 9.1 Zachary’s karate club data. Circles and squares represent the two ac- tual factions, while colors stand for discovered communities as the strength of ties increases: (a) = 0, (b) 0<< 0:14, (c) 0:14 . 181 9.2 Centrality scores of Zachary club members vs.. . . . . . . . . . . . 182 9.3 Comparison of eigenvector centrality and converged normalized Alpha- Centrality for Zachary’s karate club network. . . . . . . . . . . . . . 183 xv 9.4 Classification of karate club nodes according to the roles scheme pro- posed by Guimera et al. [76]: (i) non-hubs (z < 2:5) are divided into ultra-peripheral, peripheral, and connector nodes (kinless nodes whose links are homogeneously distributed among all communities are not shown); (ii) hubs (z 2:5) are subdivided into provincial (major- ity of link within their own community), connector hubs (many links to other communities). Global hubs whose links are homogeneously distributed among all communities are not shown. . . . . . . . . . . . 184 xvi Abstract Understanding the complex interplay of topology and dynamics in complex networks is necessary to answer a variety of questions, including who are the important people in a social network, what are the authoritative pages on the world wide web, who to quarantine to minimize the spread of an epidemic, what are the functional modules in a protein-protein network, and even how the world trade network affects the robustness of the global economy. To address these questions, we build predictive network models that take dynamic interactions into account. Our mathematical models are grounded empirically on data from online social networks on sites such as Digg, Twitter and Facebook. We claim that network structure is the product of both topology and dynamics. We propose a generalized interaction model that describes a range of dynamic processes, or interactions, taking place in complex networks, from random walks to epidemic spread. Traditionally, network analysis methods, including those used to identify cen- tral nodes and communities in the network, either ignore or make implicit assumptions about network interactions. We show, however, that different interactions lead to dif- ferent views of network structure, and empirically verify this insight using real-world data from online social networks. A wide spectrum of heterogeneous activity spanning from information diffusion to spamming has been observed in online social networks. We have designed a simple, xvii scalable and robust, information theoretic framework to automatically classify differ- ent types of activities. Of these, we are especially interested in information spread. We have developed a mathematical framework to quantitatively measure how information spreads on networks, and showed that standard epidemic models fail to describe the spread of information in real-world networks. Our work is a step towards the ultimate goal of building theoretically justified, empirically grounded network models that improve the prediction of future behavior, aid information discovery and outbreak control, and help in designing network policies for our connected world. xviii Chapter 1 Introduction In this chapter, we first provide the motivation and outline the problems that this the- sis attempts to address. We will then briefly describe possible solutions we propose, followed by the outline of this manuscript. 1.1 Motivation What are the functional modules in a protein-protein interaction network? What are the authoritative pages on the World Wide Web? Who should the target audience be for a viral marketing campaign for a newly launched product? Who are the important people in a recommendation network? What are the communities in a social network? Who are the experts in a scientific discipline? How do we stop an epidemic from spreading? Which firms are the key players in the global economy? The answers to these and other questions [43, 55, 193, 145, 89, 146, 177, 150, 114, 41, 62, 67, 164, 116, 45, 112, 63, 108, 66, 84, 69, 117, 175, 58] lie in the modeling and analysis of complex networks. This thesis addresses the immediate and pressing need for efficient, scalable and effective tools for network analysis. 1 Specifically, we focus on investigating the complex interplay of structure and dy- namics in real networks and construct network models grounded by these empirical observations. Analyzing networks is of paramount importance in predicting their fu- ture behavior, identifying events and trends, and forecasting factors that shape them. Predictive analysis of networks includes models to discover experts [45] and influen- tials, measure centrality of individuals [67, 63] and their proximity [112] to each other, and to detect communities [108, 66]. These models in turn can aid in information discovery and search, be used for link prediction, activity prediction, recommending useful connections [114], help in outbreak control and in design of policies in this networked world. How do we develop these predictive models of network analytics? What are the design principles that should be kept in mind to improve the accuracy of these models? What are the interactions taking place on networks and how do we model them? How do we build simple, computationally scalable models that capture the complex feed- back between interactions and topology that shapes the network? This thesis develops a principled, empirically grounded mathematical framework for the analysis of net- work structure that takes into account the variety of dynamic interactions taking place on the network. Until recently, obtaining empirical data to study networks involved laborious sur- veys [174] and contact traces [126, 35, 36], which made statistical analysis of their properties impractical. This changed with the advent of online social networks on so- cial media sites such as Twitter, Digg, Facebook, and the like, where users explicitly declare social links and interact with others. These networks have emerged as a critical factor in information dissemination [184, 75, 118], search [2] and marketing [49, 93]. They have been used as effective conduits for organizing and coordinating huge groups 2 of people as seen during the ‘Arab Spring’ in 2011 [18]. In the cultural arena, such networks has developed into an effective mouthpiece for celebrities, spawning a gen- eration of stars, like Justin Bieber, and starlets. The massive quantities of data made available by social media sites have lead to a quantum leap in progress in network sci- ence, uncovering new phenomena that need to be explained and demonstrating gaps between theoretical understanding and empirical observations. Though the network models we develop in this thesis are general, and applicable to a variety of complex networks, a major portion of our empirical analysis deals with online social networks like Digg, Facebook and Twitter. 1.2 Thesis Statement Our research focuses on characterizing interactions in social and information net- works and understanding their impact on the analysis of network structure. We also study interactions leading to information spread in online social networks. Our goal is to build efficient, scalable and useful network models grounded in empirical data that will help in predicting future trends, activity and behavior, finding influentials and experts and understanding the overall organization and structure of networks. 1.3 Proposed Approach Characterizing Dynamic Interactions The wide gamut of activities occurring in online social media include conversations, information dissemination, advertising, mar- keting, propaganda campaigns, robotic activities and spamming. Understanding these 3 activities and their intent will lead to better tools for trend identification, spam de- tection, and improve user modeling and content analysis. We propose a solution to the difficult problem of differentiating between these diverse activities in online social media in Chapter 3. Interactions determine the activities taking place on the network, whether diffu- sion and other types of transport in biological networks, or epidemics and information spread in online social media. Random walk has been a fundamental approach to modeling interactions and related activities on networks [166, 146, 143, 135]. How- ever random walk and related processes cannot capture the diverse activities occurring on complex networks especially one-to-many interactions such as broadcasts in social media or epidemic contagion. In Chapter 4, we propose a consolidated framework to model many of the different kind of interactions, or dynamic processes, taking place on complex networks. Impact of Dynamics on Analysis of Network Structure The aspects of network structure that are of interest to researchers include the organization of the network into communities, the centrality (or importance) of individuals and proximity (or close- ness) between them. Community structure is an important characteristic of complex networks, including social networks which are composed of communities and sub- communities of similar individuals, and biological networks, which are often orga- nized within functional modules [151, 153]. Identifying central or influential individ- uals is of great importance in today’s world, be it to find the target audience for a product, or the infected individuals to be quarantined during an epidemic. Similarly, identifying people who are close to an individual can help predict his activity and fu- ture connections he will make. 4 An array of models, methods and metrics have been proposed for network anal- ysis (including but not limited to metrics to predict centrality and proximity and to detect communities within the network), but principled approaches to evaluate them are relatively scarce. Moreover, the impact of microscopic dynamic interactions on our understanding on network structure has not been investigated. Metrics, such as PageRank, have revolutionized Internet search and commerce and are used in many applications from image processing to Web page ranking. However, how appropriate is PageRank, which is based on the random walk, to determining influential people in a social network? How do we choose an appropriate metric for a given network? How is this choice affected by dynamics? We focus not only on analyzing and modeling dynamics in complex networks, but also on principles guiding the choice of models for network analysis to lead to the best predictions of network structure. We validate predictions of these models on data from real-world networks and online social media (Chapter 5, 6 and 7). Understanding Information Spread in Networks The dynamic processes of great- est interest to researchers, and therefore, examined in great details in this work, are those leading to information spread in networks (Chapter 8). When a person is infected with a disease, he may spread it to people in contact with him. This interaction, or con- tact process, leads to epidemic spread in a network of connected individuals [8, 81]. Information spread, common in online social networks, is another example of a con- tact process. It is created, for example, when an individual forwards an email she receives to her contacts [184, 115], or retweets a news item to her followers on Twit- ter [107]. Besides information diffusion in online social networks [107]; many other diverse phenomena can be classified as contact processes, including adoption of new 5 ideas [19], spread of behaviors [35, 36], computer virus epidemics on the Internet [30], word-of-mouth recommendations [70] and viral marketing campaigns [93, 88]. This gives rise to the need of principled tools for studying contact processes. For instance, a mathematical tool for analysis of cascades (sequence of ‘infections’ in a contact pro- cess) can find extensive use anywhere where contact processes are studied: anomaly and spam detection, information classification, viral marketing, epidemiological stud- ies, computer virus spread, political and social unrest and even power transmission failure [48, 181]. Therefore, we focus on building such tools for network analysis. Understanding how information spreads in online social networks may be indica- tive of its quality [41, 106] and is especially critical for design of network policies and outbreak control. In previous studies [124] of information spread, the structure of the underlying network was not visible but had to be inferred from the flow of infor- mation from one individual to another. This posed a serious challenge to the efforts to understand the complex interplay between structure and dynamics of real-life net- works. We address this gap by comparing information spread on two popular social news sites Digg and Twitter which are described in Chapter 2. Specifically, we extract social networks of active users on Digg and Twitter, and track how interest in news sto- ries spreads among them and its dependence on the topology and dynamic interactions within the network. Empirical observations of complex networks using principled analytical tools, lead to better insights into their structure and function. They also unearth new mysteries and present new challenges. Independent cascade model is widely used to model the spread of a disease or information. However, how well does it explain observed properties of information diffusion in real networks? It turns out not well, and such discrepancies elucidate the need to understand the characteristic properties of the network in order 6 to build better predictive network models. We identify the probable factors shaping a network, through observations and simulations and use this knowledge to build better network models. One of the metrics we examine in great depth is Alpha-Centrality, which is used to identify central individuals in a social network. We show that Alpha-Centrality is closely related to contact processes, and, therefore, best predicts influential indi- viduals in networks when the observed dynamic process is information spread. We use Alpha-Centrality to analyze network structure and differentiate between local and global structures. Specifically, we use it both to identify communities in social net- works, and individuals who act as community leaders and those who bridge different communities, thereby facilitating communication between them. We evaluate perfor- mance of this metric on benchmark networks as well as social media data sets (Chapter 9). 1.4 Contributions to Research Here, we briefly enumerate our contributions to the study of structure and dynamics of complex networks. 1.4.1 Classifying Dynamic Interactions We present an information-theoretic approach to automatic classification of user ac- tivity in online social media (Chapter 3). We focus on messages that contain embedded URLs and study collective user response to these messages. We identify two features, time-interval and user entropy, which we use to classify the response. We achieve good separation of different activities using just these two features and are able to categorize 7 content based on the collective user response it generates. Our information theoretic framework for classification of activity and associated content is content-independent, robust, efficient and scalable [68]. 1.4.2 Modeling Dynamic Interactions We propose a generalized linear model for describing a wide range of interactions (Chapter 4) [66]. This model describes random walks at one end of the spectrum and contact processes, such as epidemic spread, on the other end. We elucidate the connection between existing epidemic models, e.g., [179], and the generalized linear model, showing how it leads to insights into the relationship between network structure and macroscopic dynamic phenomena, such as the location of the epidemic threshold. 1.4.3 Impact of Dynamics on Analysis of Network Structure We claim that not only network topology, but also the dynamic interactions taking place on it, play an important role in predicting network structure. Different dynamic processes running on the same network can lead to different perspectives of network structure. We show that a mathematical model that best predicts network structure is one that takes into account the dynamic processes taking place in the network. 1.4.3.1 Community Structure One of our principal contributions in the area of community detection is a formal framework that consolidates different approaches to community detection in complex networks and generalizes them within a dynamics-based formalism (Chapter 5) [108, 8 66]. We apply this framework to study real-world networks. The specific contributions that support this perspective are • A novel synchronization model for locally interacting nodes coupled via non- conservative interactions • A metric to calculate synchronization similarity (and a hierarchical community detection method based on this similarity) and an activity-based measure of com- munity quality • Detailed investigation communities detected using different interaction models on real-world networks revealing important differences between them within a layered ‘onion-like’ organization of complex networks 1.4.3.2 Centrality Prediction The contributions [67, 63] in the area of centrality and the use of centrality to predict influentials (Chapter 6) are: • A new metric Normalized Alpha-Centrality that avoids the problem of bounded parameters in Alpha-Centrality while retaining the desirable characteristics of Alpha-centrality, i.e. its ability to differentiate between local and global struc- tures • Classification of centrality metrics based on the dynamic interactions they im- plicitly emulate • Statistically significant empirical measurements of influence on online social media 9 • Evaluation methodology for centrality metrics based on empirical measurements of influence for online social media 1.4.3.3 Proximity Prediction While it may seem counter-intuitive that proximity between individuals in a network depends on anything but network topology, as we show in this thesis (Chapter 7), different dynamic processes running on the same network can lead to different notions of proximity [112]. This thesis makes the following contributions: • New structural proximity metrics for directed graphs that take into account the nature of interactions between individuals • Detailed evaluation of proximity metrics in the context of activity prediction in social media Ongoing work includes design of metrics for networks whose topology changes with time [109, 57]. 1.4.4 Understanding Information Spread in Networks We have analyzed in details the spread of information on online social media (Chapter 8). The specific contributions in this area are: 1.4.4.1 Comparative empirical study of cascades on networks We compare user activity and characteristics of information spread in the online social networks of Digg and Twitter. We show that while these sites are used in strikingly similar ways to spread information, network structure and details of the user interface affect information flow [110, 107]. 10 1.4.4.2 Cascade Generating Function We present an efficient and scalable mathematical framework for quantitative analysis of cascades on networks. We define a cascade generating function that computes the macroscopic properties of cascades, such as their size, spread, diameter, number of paths, average path length etc. We show that this function also captures the details of the microscopic as well as meso-scopic properties of the cascades. We present an algo- rithm for efficiently computing the cascade generating function and demonstrate that while significantly compressing information within a cascade, it nevertheless allows us to almost perfectly reconstruct its structure [64]. 1.4.4.3 Mechanism of Social Contagion Using tools described above, we identify information cascades in social media that spread fast enough for one initial spreader to infect hundreds of people, yet end up affecting only a miniscule portion of the entire network. This is in stark contrast to tra- ditional epidemic models, which are often used to explain online information spread. We demonstrate two competing factors that limit the final size of cascades in Chapter 8. First, because of the highly clustered structure of the online social network slows the overall growth of cascades. In addition, the mechanism of social contagion by which information spread in an online social network deviates from standard contagion mod- els, like the independent cascade model. We propose an alternate empirically grounded model of social contagion in social media: the “friend saturation model” [164]. In this model, multiple exposures to information only marginally increase the probability of spreading it. This model better replicates the observed features of information spread in social media. 11 1.4.5 Alpha-Centrality We demonstrate [65, 61, 59, 60] the use of Alpha centrality (Chapter 9) in • Influence Prediction: By tracking the variation of rankings with the change of the tunable parameter, we were able to identify locally important ‘leaders’ and globally important ‘bridges’ or ‘brokers’ that facilitate communication between different communities. • Community Detection: We extended the modularity maximization class of algo- rithms [69] to use (normalized) Alpha-Centrality, rather than edge density, as a measure of network connectivity. We demonstrate that by capturing the spectrum from coarse-grained to fine-grained structures, with the aid of the tuning parameter, this metric provides a simpler alterna- tive to previous attempts in this direction both in community detection [131, 132] and role-based [76, 77] descriptions. We apply the proposed methods to several bench- mark networks as well as real-world networks. We extend this definition to multi- modal networks that link entities of different types, and use this approach to study the structure of such networks. 1.5 Outline of the Proposal We provide the general definitions and conventions and the description of the online social networks used in the study and the datasets explored, is given in Chapter 2. In Chapter 3, we enumerate the wide range of dynamic activities occurring on social media websites and develop a methodology to automatically classify them. In Chapter 4, we propose simple mathematical models of interaction dynamics that can be used 12 to express a spectrum of activities occurring on these networks. Next we move to the predictive models of network structure. We discuss community detection and the role dynamic interactions play in community detection in Chapter 5. The effect of interaction in centrality and proximity prediction is discussed in Chapter 6 and Chapter 7 respectively. Information propagation is one of the predominant activities occurring on networks and we analyze it in greater details in Chapter 8. We compare information propagation in Digg and Twitter. We build a principled mathematical framework in this chapter to measure information cascades in large networks. We note that very simple models of dynamic interactions do a good job in explaining many of the complex characteristics of large networks. But there do remain many unanswered questions. For example in the context of information propagation some of the obvious questions are - Why are actual cascades very small compared to those predicted by traditional models of contagion ? We investigate some of these questions in greater details and show how topology and dynamics play a role in limiting the growth of cascades. We explore Alpha-Centrality (whose underlying principles matches those of traditional models of information propagation) further in Chapter 9 and discuss its applicability in identifying important people as well as in community detection. We also build a framework for applying Alpha-centrality in the analysis of multimodal networks. We outline the related work in Chapter 10. Highlighting the scope of future work, we conclude in Chapter 11. In Appendix B, provides the mathematical formulations and proofs associated with normalized Alpha-centrality. Appendix C outlines the scalable algorithm, we have designed for computing Alpha-centrality. 13 Chapter 2 Preliminaries In this chapter, we state the general conventions and definitions used in this proposal. We also describe the datasets used in this work. 2.1 Conventions As a convention, for any matrixM, letM[i;j] represent the element in thei th row and j th column of M. M[i;:] represents the i th row of matrix M and M[:;j] represents thej th column. Similarly, for any vectorV , letV [i] represent thei th element of that vector. If V is a n-dimensional vector, then the L 1 -Norm of vector V is given by jjVjj 1 = P n i=0 V [i]. IfF : X! Y is a function, whereX is the domain of the function andY is the co-domain of the function, then for any element x2 X, we have y2 Y , such that y =F (x). I is the identity matrix ande is a unit vector. 14 2.2 Definitions A network is represented by a directed, weighted graphG = (V;E), whereV is the set of nodes andE is the set edges. LetjVj be the number of nodes in the graph and jEj, the number of edges in the graph. The weight of the edge fromu tov is specified usingw(u;v). Adjacency Matrix: The adjacency matrix of the graph is defined as: A[u;v] = 8 > < > : 1 if (u;v)2E 0 otherwise. Indegree and Outdegree: d out i is the outdegree of node i. d out max be the maximum outdegree of any node in the network. LetD be the degree matrix which is a diagonal matrix, whereD[i;j] = d out i , ifi = j; otherwise,D[i;j] = 0. d in i is the indegree of nodei.d in max be the maximum indegree of any node in the network. Neighbors, Friends, Fans, Followers and Followees: If there exists an edge from j toi,j is said to be a neighbor ofi. The neighborhood ofj is defined as all the nodes of which j is a neighbor. Borrowing from sociology, it can also be said that j is a fan or follower of i and i is a friend or followee of j. Thus all friends comprise the neighborhood. Mutual Friends: If nodei is a friend on nodej and nodej is a friend of nodei, then we call nodesi andj as mutual friends. 15 Spectral Radius: 1 is the largest eigenvalue of adjacency matrix A and is called the spectral radius of the network. Distribution: LetV be a set of nodes. Any distribution,w2 (R + [f0g) jVj is defined as ajVj dimensional vector, where w[i] gives the amount or weight of commodity present in nodei,i2V . Contact Process: A contact process is simply a diffusion of activation on a graph, where each activated, or “infected,” node can infect its neighbors with some probability given by thetransmissibility. Entropy: The Shannon entropy of discrete random variableX is given by: H(X) = X x2X p(x)log(p(x)) (2.1) HereP is the probability mass function forX. Mutual Information: Mutual information of two discrete random variablesX and Y can be defined as: I(X;Y ) = X y2Y X x2X p(x;y)log( p(x;y) p(x)p(y) ) (2.2) Normalized Mutual Information: Normalized Mutual information of two discrete random variablesX andY is given by: MI(X;Y ) = 2I(X;Y ) H(X) +H(Y ) (2.3) 16 2.3 Real Networks We have extensively analyzed the online social networks of Digg, Twitter and Face- book and social network benchmarks existing in literature. We have illustrated and validated our understanding of the structure and dynamics of complex networks using these networks. 2.3.1 Karate Club Figure 2.1: Karate Club Network. The different colors show the different communities as detected by Zachary. We have extensively analyzed the real-world friendship network of Zachary’s karate club [193], shown in Fig. 2.1 a widely studied social network benchmark. During the study, a disagreement developed between the administrator and the club’s instructor, resulting in the division of the club into two factions, represented by circles and squares of pink and yellow colors respectively. 17 2.3.2 Digg The social news aggregator Digg 1 is one of the oldest and more popular social media sites with over 3 million registered users. Digg allows users to submit links to news stories and vote for, or digg, them. There are multiple submissions every minute, many thousands a day. A newly submitted story goes to the upcoming stories list, where it remains for 24 hours, or until it is promoted to the front page. Newly submitted stories are displayed as a chronologically ordered list, with the most recent story at the top of the list, 15 stories to a page. Digg selects about a hundred of these stories every day to feature on its front page. Although the exact promotion mechanism is secret, it appears to take into account the number and the rate at which story receives votes. Promoted (or ‘popular’) stories are also displayed in a reverse chronological order on the front pages, 15 stories to a page, with the most recently promoted story at the top of the list. The importance of being promoted has, among other things, spawned a black market 2 which claims the ability to manipulate the voting process. Digg also allows users to designate friends and track their activities. Digg high- lights the stories a user’s friends posted or voted for by marking them with a green ribbon and also displaying them on the Friends Interface, a special page for watching friends’ activity. The friendship graph is directed. The friendship relationship is asym- metric. When userA lists userB as a friend,A can watch the activities ofB but not vice versa. Therefore,A is the fan, or the follower, ofB. A newly submitted story is visible in the upcoming stories list, as well as to submitter’s fans through the friends interface. With each vote it also becomes visible to voter’s fans. The friends interface can be accessed by clicking on Friends Activity tab at the top of any Digg page. In 1 http://digg.com 2 As an example, see http://subvertandprofit.com 18 addition, a story submitted or voted on by user’s friends receives a green ribbon on the story’s Digg badge, raising its visibility to fans. Together the friends and the fans com- prise the social network of Digg. Thus Digg provides an opportunity to the users, to watch not only the stories submitted by their friends but also the most popular stories. The fundamental dynamic process occurring on Digg is information broadcast. Figure 2.2 captures a screen-shot of the Digg front page. Figure 2.2: A screen shot of the Digg front page Digg API was used to collect data about 3,553 stories promoted to the front page in June 2009. 3 The data associated with each story contains story title, story id, link, submitter’s name, submission time, list of voters and the time of each vote, the time the story was promoted to the front page. In addition, the list of voters’ friends was collected. Using this information, reconstruction of the fan network of Digg users who were active during the sample period was possible. An active user is defined as any user who voted for at least one story on Digg during the data collection period. There 3 The data set is available at http://www.isi.edu/~lerman/downloads/digg2009.html 19 were 139,409 active users, of which 69,524 designated at least one other user as a friend. The friends of these users were extracted and reconstruction of the fan network of active users (a directed graph of active users who are watching activities of other users) was done. There were 279,634 nodes in the fan network, with 1,731,658 links. Of the 69K users, around 40K users were mutual friends with at least one other user in the network and the mutual friendship network comprised of more than 360K edges. 2.3.3 Twitter Twitter 4 is a micro-blogging website that allows registered users to post and read short (at most 140 characters) text messages, which may contain URLs to online content, usually shortened by a URL shortening service such as bit.ly or tinyurl. The user interface of Twitter provides an opportunity for users to post messages, stories (via URLs), repost (retweet) a submitted post, carry on conversations and so on. Posting a link on Twitter is analogous to submitting a new story on Digg, and retweeting the post is analogous to voting for it. Like on Digg, a user can follow other users and have other users as followers. If usera watches the activity of userb,a is followingb andb has a as a follower. However, unlike Digg, in Twitter the user can watch only the activity of the people he is following. Also, information broadcast is not the only dynamic process occurring on Twitter. Exchange of information and private conversations [25] are some of the other dynamic processes occurring on Twitter. Figure 2.3 captures a screen-shot of a Twitter home page. 20 Figure 2.3: A screen shot of the Twitter home page Figure 2.4: A screen shot of the Tweetmeme front page 21 2.3.3.1 Dataset 1 Twitter restricts large-scale access to its data to a limited number of entities. One of these, Tweetmeme (http://tweetmeme.com , Figure 2.4 ), aggregates all Twitter posts to determine frequently retweeted URLs, categorizes the stories these URLs point to, and presents them as news stories in a fashion similar to Digg’s front page. Data from Tweetmeme was collected using specialized page scrapers developed using Fetch Technologies’s AgentBuilder tool. For each story, the name of the user who posted the link to it, the time it was posted, the number of times the link was retweeted, and details of up to 1000 of the most recent retweets, was retrieved. For each retweet, the name of the user, the text and time stamp of the retweet was extracted. Extraction was limited to 1000 most recent retweets by the structure of Tweetmeme. 398 stories from Tweetmeme that were originally posted between June 11, 2009 and July 3, 2009 were extracted. Of these, 329 stories had fewer than 1000 retweets. Next, Twitter API was used to download profile information for each user in the data set. The profile included the complete list of user’s friends and followers. An active user was defined as any user who voted/retweeted for at least one story onTwitter during the data collection period. There are 137,582 active Twitter users in the sample. Active users on Twitter were connected to 6,200,051 users. From this data, reconstruction the fan networks of active users was done ( i.e., active users who are watching activities of other users). 4 www.twitter.com 22 2.3.3.2 Dataset 2 Twitter’s Gardenhose streaming API provides access to a portion of real time user activity, roughly 20%-30% of all user activity. This API was used to collect tweets for a period of three weeks in the fall of 2010. The focus was specifically on tweets that included a URL in the body of the message, usually shortened by some service, such as bit.ly or tinyurl. In order to ensure the acquisition of the complete tweeting history of each URL, Twitter’s search API was used to retrieve all activity for that URL. Then, for each tweet, the REST API was used to collect friend and follower information for that user. Data collection process resulted in 3,424,033 tweets which mentioned 70,343 dis- tinct shortened URLs. There were 815,614 users in the data sample, but retrieval of the follower information for some of them was not possible, resulting in a graph with 699,985 nodes and 36,743,448 edges. 2.3.4 Facebook Figure 2.5: A screen shot of the Facebook front page 23 This dataset contains a snapshot of the Facebook networks of more than 100 col- leges and universities as of September 2005 [171, 172]. Each user in this data set has four descriptive features: status (e.g., student, faculty, staff, and so on), major, dorm or house, and graduation year. 24 Chapter 3 Real-World Dynamic Interactions Differentiating between the diverse activities in online social network sites such as Twitter and classifying the short posts is a challenging problem. For example, a post that is retweeted multiple times by the same user may be categorized as spam. How- ever, if the same message is of interest to and retweeted by many other users, it can be classified as a successful campaign or information dissemination. Such judgements are difficult to make based solely on content. The advent of bots and automatic tweeting services have added another dimension of complexity to the already difficult problem. How do we distinguish human activity from programmed or bot activity, campaigns designed to manipulate opinion from those that capture users’ interest, and popular from unpopular content? We propose a novel, quantitative approach to address these questions. In Sec- tion 3.2 we describe an information-theoretic method to characterize the dynamics of retweeting activity generated by some content on Twitter. While content- and language-independent, our method is nevertheless able to categorize content into mul- tiple classes based on how Twitter users react to it: it can separate newsworthy stories from those that are not interesting, campaigns that are driven by humans from those 25 driven by bots, successful marketing campaigns from unsuccessful ones. Previous work provided a binary (such as low-quality vs. high quality content) or tertiary clas- sification of content based on analysis of content and structure [3] or user response to it [42]. However, the rich, heterogenous and complex activity on Twitter necessitates the need for a more detailed characterization. When a user posts or ‘tweets’ a story, he exposes it to other Twitter users. We focus on tweets that contain URLs and use these URLs as markers to trace the spread of information or content through the Twitter population. When a later tweet includes the same URL as an earlier one, we say that the new post ‘retweets’ the content of the original tweet. We do not require the retweet to contain ‘RT’ string nor check that the user follows the author of the original tweet. Our retweets not only include traditional retweets from the original author’s followers, but also conversations about the content associated with that URL and independent mentions of it. The collective user response to the tweet, what we call the retweeting activity, varies with the nature of content and users’ interest in it, leading to characteristic dynamic patterns. For example, a popular news story will be retweeted by many different users (but only once by each user), whereas campaigns will get many retweets but from the same small group of users. Some retweets, however, could be automatically generated, and relying purely on frequency of retweets will mislead as to the popularity of content. The tempo- ral signature of automated retweeting is drastically different from human response, allowing us to easily differentiate between them. Given some content (URL), we char- acterize its retweeting dynamics by two distributions: distribution of the time interval between successive retweets and distribution of distinct users involved in retweeting. We use entropy to quantitatively characterize these distributions. We show that these two numeric features capture much of the complexity of user activity in Section 3.3. 26 Using these features to classify activity on Twitter, we have been able to identify sev- eral different types of activity, including marketing campaigns, information dissemi- nation, auto-tweeting and spam. In fact some of the profiles, that we have correctly identified as engaging in spam-like activities have been eventually suspended by Twit- ter. Our simple yet powerful approach can easily separate newsworthy content from promotional campaigns, independent of the language of the content, and provides an objective measure of the value of content to people. 3.1 Dynamics of Retweeting Activity User’s response to content posted on Twitter is encoded in the dynamics of retweeting of this content. Figure 3.1 shows the cumulative number of times nine different URLs were retweeted vs time. The figures show a wide variety of collective response to content. Figure 3.1(a) shows a characteristic response to newsworthy information: fast initial rise followed by a slow saturation in the number of retweets. Such response is typical of diffusion patterns of newsworthy information in online social networks [104, 107, 185]. A similar trend is also observed in the response to content (often photos) posted by major celebrities, as Fig 3.1(b). Retweeting activity of posts made by starlets (without major following) is starkly different from that of stars. Figure 3.1(d) shows retweeting activity of a post by Young Dizzy, an aspiring artist and songwriter. Short bursts of intense activity are followed by long periods of inactivity. As we show later, this is one of the characteristics of automated tweeting, an increasingly popular feature on social media. In many of these cases, such automated retweets are generated by one or a small groups of users, point- ing to attempts to manipulate the apparent popularity of content. Such automated 27 (a) (b) (c) (d) (e) (f) (g) (h) (i) Figure 3.1: Evolution of retweeting activity for story posted by (a) a popular news web- site (nytimes) (b) popular celebrity (billgates) (c) politician (silva_marina) (d) an aspir- ing artist (youngdizzy) (e) post by a fan site (AnnieBieber) (f) animal rights campaign (nokillanimalist) (g) advertisement using social media (onstrategy) (h) advertisement from an account eventually suspended by Twitter(EasyCash435) (i) advertisement by a Japanese user (nitokono). Insets in (d), (e) and (g) show automatic retweeting, with multiple retweets made within a short time period either by the same or different users. 28 methods to boost popularity are used not only by aspiring starlets, but also by dedi- cated fans of major stars, e.g., Justin Bieber as shown in Figure 3.1(e). In this case, fans are asked to register their Twitter accounts on a fan website, which then automatically tweets posts about the star from their accounts. There are other examples where users (or a small group of users) retweet the same message multiple times, often with the aid of some automated service, leading to a spam-like campaign. This is shown figures Figure 3.1(g) and Figure 3.1(h). One of these accounts EasyCash435 was eventually suspended by Twitter. Figure 3.1(i) shows similar characteristics of some content in Japanese. Note, that using only the retweet dynamics, without any knowledge of the content, we are able to deduce the spam-like advertisement campaign that this profile engages in. This is confirmed by analyzing content. In addition to information dissemination, automated tweeting, promotional activ- ities and advertisements, campaigns add to the diversity of Twitter dynamics. One of the successful campaigners in our sample was a Brazilian politician Marina Silva. Figure 3.1(c) traces the retweeting activity of a post made by her over a period of 4 days. Every day she posts the same link using the social media dashboard Hoot- suite (www.hootsuite.com). The retweeting activity follows a news-like trace seen in (a) and (b). However, when the activity gradually slows down, she breathes new life into the campaign by retweeting the same URL, generating a new upsurge in interest (and retweeting). Contrast this with a not-so-popular animal rights campaign shown in Figure 3.1(f), where the same few users (as shown later) are repeatedly manually retweeting some content to raise its visibility. 29 3.2 Entropy-Based Analysis Manual analysis of retweeting activity on Twitter is labor-intensive. Instead, in this section we describe a principled approach to categorize retweeting activity associated with some content. Problem Statement Given some user-generated content or tweetc j 2 C(whereC is a set of tweets or content), our aim is to analyze the trace,T j 2T (whereT is the collective activity on all content), of retweeting activity on it, to understand the content and associated dynamics. This trace,T j can be represented by a sequence of tuples ((u j1 ;t j1 );(u j2 ;t j2 ); ; (u ji ;t ji ); , (u jK ;t jK )), whereu ji represents a user retweetingc j at timet ji . GivenN such tracesT 1 ; ;T N 2T and their corresponding tweetsc 1 ; ;c j ; ;c N 2C , how do we meaningfully characterize and categorize them? 3.2.1 Time Interval Distribution The observations we made above about dynamics of retweeting can be succinctly captured by two distributions: inter-tweet time interval distribution and distinct user distribution. First, we consider the distribution of time intervals between successive retweets. These are shown in Figure 3.2 for the same URLs whose retweeting activity is shown in Figure 3.1. Humans are very heterogeneous; therefore, a signature of hu- man activity is a broad distribution with time intervals of many different length that are all equally likely, as shown in Figure 3.2(a)-(c) and (f). Specifically, there is a lot of activity initially associated with newsworthy content, which gradually decreases with 30 (a) (b) (c) (d) (e) (f) (g) (h) (i) Figure 3.2: The distribution of the inter-arrival gaps for the retweeting activities shown in Figure 3.1 (a)nytimes (b)billgates (c) silva_marina (d) youngdizzy1 (e) AnnieBieber (f) nokillanimalist (g) onstrategy (h) EasyCash435 (i) nitokono. time, resulting in many short intervals and some long ones, as shown in Figure 3.2(a)– (b). Automated retweeting results in tweets at regular time intervals, which will lead to an isolated peak or peaks in the distribution (as in Figure 3.2(i)), or bursty behavior with many zero second intervals (as seen in Figure 3.2(e) and (g)). We measure the regularity or predictability of the temporal trace of tweets using entropy. Let T represent the time interval between two consecutive retweets in a traceT j , with possible valuesft 1 ; t 2 ; ; t i ; ; t n T g. If there aren t i time 31 intervals of length t i , then p T (t i ) denotes the probability of observing a time interval t i : p T (t i ) = n t i P n T k=1 n t k (3.1) The entropyH T of the distribution of time intervals is: H T (T j ) = n T X i=1 p T (t i )log(p T (t i )) (3.2) If a tweeting pattern is regular, the probability of certain time intervals is more than the rest. Therefore there is less uncertainty associated with the distribution. In con- trast, if a tweeting pattern is heterogenous, then the uncertainty associated with the corresponding time interval distribution is more. Entropy captures this notion of un- certainty. Therefore automatic retweeting with a regular pattern has a lower time inter- val entropy, and is therefore, more predictable, than human retweeting, which is more broadly distributed and less predictable. 3.2.2 User Distribution In addition to time interval, we also measure the distribution of the number of times distinct users retweet some URL. Figure 3.3 shows the number of retweets made by each user involved in the tweeting activity shown in Figure 3.1. Newsworthy content is usually retweeted once by each user who participates in the tweeting activity, as shown in Figure 3.3(a)–(c). Spam-like activity and campaigns, on the other hand, result when an individual (Figure 3.3(g)–(i)) or a small group (Figure 3.3(f)) repeatedly retweet the same post. The higher the retweeting, the greater the manipulation effort. 32 (a) (b) (c) (d) (e) (f) (g) (h) (i) Figure 3.3: The number of retweets by distinct users. Each user is marked by a unique user id for the retweeting activities shown in Figure 3.1 (a)nytimes (b)billgates (c) silva_marina (d) youngdizzy1 (e) AnnieBieber(f) nokillanimalist (g) onstrategy (h) EasyCash435 (i) nitokono. The campaign shown in Figure 3.1(c) is successful, since there are many distinct users who participate in it, as shown in Figure 3.3(c). However, there are some ded- icated campaigners, includingsilva_marina herself, who retweet the same message multiple times. Also the distribution of inter-arrival times in Figure 3.2(c) is similar to that of Figure 3.2 (a) and (b), indicating human activity. A campaign probably not as successful as that bysilva_marina is one bynokillanimalist (Figure 3.1(f)), which has very few participating users in it. The distribution of the inter-arrival times in 33 Figure 3.2(f) is also comparable to Figure 3.2(a)–(c), with a large number of nonzero inter-arrival times and the frequency of shorter inter-arrival gaps being larger than that of longer ones, indicating human activity. However, the distribution of the number of retweets by distinct users shows a stark contrast. In fact it shows that there are only three dedicated users generating over 3000 retweets. Similarly in case of of the retweeting activity shown in Figure 3.1(h), there are only two users engaged in spread- ing spam-like advertisements (Figure 3.3(h)). These two users together account for around 900 retweets. Spam-like characteristics are also observed in the advertise- ments, whose retweeting activity is shown in Figure 3.1(g) and 3.1(i) which have one (Figure 3.3(g)) and two users (Figure 3.3(i)) generating a bulk of the content. However on looking into the temporal distribution more closely, we observe that in case of Fig- ure 3.1(g), almost two-thirds of the retweets occur almost consecutively (time interval gap is zero seconds), indicating a possible autotweeting activity. Figure 3.1(i) too, shows some kind of probable scheduled or automated tweeting activity with around 28% of the tweets having an exact interval gap of 604 seconds. Possible autotweeting is also indicated in the promotional activity shown in Figure 3.1(e). Although a large number of users participate in this activity as shown by Figure 3.3(e), almost all the retweets are generated simultaneously as seen in Figure 3.2(e). We use entropy to measure the breadth of user distribution. Let random variableF represent a distinct user in a traceT j , with possible valuesff 1 ;f 2 ; ;f i ; ;f n F g. Let there be n f i retweets from user f i in the traceT j . If p F denotes the probability mass function ofF , such thatp F (f i ) gives the probability of a retweet being generated by userf i , then p F (f i ) = n f i P n F k=1 n f k (3.3) 34 The user entropyH F is given by: H F (T j ) = n F X i=1 p F (f i )log(p F (f i )) (3.4) As clear from the Equation 5.5, in spam-like activity a small number of users are responsible for large number of tweets, which leads to a lower entropy than retweeting activity of newsworthy content. On the other hand, automated retweeting coming from many distinct users (as seen in Figure 3.3(e)) indicates that users’ accounts may have been compromised. 3.2.3 Classification The time interval and user entropies H T (T j ) and H F (T j )) can used to categorize retweeting activity of any content. This classification helps us not only identify the different dynamic activities occurring on Twitter, but also provides valuable insight into the nature of the associated content. The linear runtime complexity of entropy calculation and the presence of scalable methods of clustering [26] ensures that this entropy-based approach can be easily applied to very large data sets. 3.3 Validation We study the retweeting activity of URLs in Twitter Dataset 2 described in Chapter 2, Section 2.3.3.2. We focus on URLs posted by users who posted at least two popular URLs. By popular, we mean URLs that were retweeted at least 100 times. There were 687 such distinct URLs. 35 We apply entropy based approach to study the retweeting dynamics of these URLs. We show that entropy-based analysis gives a good characterization of different types of activities observed in collective retweeting of these URLs. 3.3.1 Manual Annotation We manually examined the content of each URL (using Google translate on foreign language pages) to annotate the activity along following categories: News If the URL belongs to the twitter profile of a news organization, we label the retweeting activity as following news. Blogs If the URL links to the blog or webpage maintained by an individual, we clas- sify the retweeting activity as following blogs or celebrity. Campaigns If the URL belongs to an individual or an organization with a discernible agenda (politics, animal rights issues), we classify the retweeting activity as a campaign. Advertisements and promotions If the URL links to an advertisement or promotion, we classify the retweeting activity as such. This includes instances where users post the same link repeatedly, leading to spam-like content generation, and the promotional activities of aspiring starlets. Parasitic ads This is a form of parasitic advertisement in which users participate un- wittingly. This happens when a user logs into a website or web service, and then that service tweets a message in user’s name telling his followers about it. For 36 example, when a user visits sites such as Tinychat 1 or Twitcam, 2 a message is posted to the user’s Twitter account “join me on tinychat...” Automated/robotic activity Retweeting that is mainly generated through Twitterfeed 3 or similar services is classifies as automatic activity. Note that automated activ- ity could be associated with any type of content, but since it has its own unique characteristics, different from all the aforementioned activities, we included it as a separate class. This can be easily identified by looking at the source of the tweet, which will identifytwitterfeed (or a similar service) as the originator. We found that users respond to news stories and blog posts in identical manner, making them difficult to distinguish. Generally, the type of information contained in these two sources is also very similar. Therefore, for classification purposes, we put them in the same category of newsworthy content. Figure 3.4 shows the retweeting activity of URLs in our data sample as measured by the time interval and user entropy. The bulk of the URLs belong to news or blog category. They are also characterized by medium to high user and time interval en- tropies, indicating newsworthy content. Blog posts or websites of major celebrities represent more popular content and are located in the upper section of the plot. Blog posts from starlets without major following are located in the lower section of the plot. Though these posts have similar numbers of retweets, lower user entropy means that the starlets, or their dedicated followers, generate some of the retweeting activity. The automatic retweeting cluster is isolated. This contains URLs like one whose activity is shown in Figure 3.1(e), but also several news stories, most notably from the online 1 tinychat.com 2 twitcam.com 3 www.twitterfeed.com 37 Figure 3.4: Manually annotated URLs shown in the entropy plane. technology magazine TechCrunch. This is because some Twitter users employ Twit- terfeed to automatically tweet stories that are posted on TechCrunch. This helps users appear to be more active on Twitter than they really are. The uninteresting stories are for the most part retweeted automatically, and not manually by other people. There- fore, they have low time interval entropy, but high user entropy, since many different Twitter accounts are used. Advertisements are mostly located in the lower half of the figure, although success- ful advertisements that capture public interest are indistinguishable from newsworthy content. Unsuccessful campaigns that are driven by a few dedicated zealots are in their own cluster with high time interval and low user entropy, but successful campaigns are also indistinguishable from newsworthy content. 38 Table 3.1: F-Measure (F) and ROC area for 10-fold cross validation experiments using SVM and k-NN classification ads & auto- campaign news parasitic promotiontweet & blog ads k-NN F 0.686 0.96 0.5 0.89 0.105 ROC 0.807 0.959 0.678 0.837 0.644 SVM F 0.719 0.939 0.526 0.897 0 ROC 0.833 0.973 0.685 0.875 0.718 3.3.2 Classification The distribution of distinct time intervals and users involved in the retweeting activ- ity gives a good characterization of the retweeting activity. As explained in Section 3.2, temporal and user entropy are used to quantify these distributions. Temporal en- tropy is maximum when the time intervals between any two successive retweets is different. User entropy is maximum when each user retweets the message only once. Next, using temporal and user entropies as features, we classify the retweeting activity represented by a traceT j 2T . We perform both unsupervised and supervised classifi- cation. The data is manually labelled to train the supervised classifier and to evaluate the performance of the classification techniques. We used Weka software library 4 for off-the-shelf implementation of EM (expectation maximization [46]),k-NN (k-nearest neighbors) and SVM(support vector machines [24]) classification. 3.3.2.1 Supervised Classification We used Support Vector Machine with radial basis function (RBF) kernel and k-NN algorithm with three nearest neighbors and Euclidean distance function to classify the 4 www.cs.waikato.ac.nz/ml/weka 39 data. Table 3.1 reports results of 10-fold cross validation in each model was trained on 90% of the labeled data and tested on the remaining 10%. The F-scores of both algorithms are relatively high, showing that they have well separated instances into different classes. 3.3.2.2 Unsupervised Classification (a) (b) Figure 3.5: Unsupervised clustering of the data points using EM: (a) when EM au- tomatically finds the best number of clusters, and (b) when the number of clusters is constrained by be five. We use Expectation Maximization (EM) algorithm to automatically cluster points. EM uses Gaussian mixture model and can decide how many clusters to create by cross validation. The number of clusters determined automatically by this method was nine. Figure 3.5(a) shows the resulting clusters, and the confusion matrix is shown in Table 3.2. If the number of clusters were predefined to be 5, the resulting confusion matrix is shown in Table 3.3, and discovered clusters are shown in Figure 3.5(b). 40 Table 3.2: Confusion matrix with manually annotated data and clusters automatically detected by EM algorithm advertisement auto-tweet campaign news blogs parasitic & promotion advertisement cluster0 45 0 0 0 8 0 cluster1 7 0 0 41 13 1 cluster2 17 0 0 0 14 0 cluster3 0 0 0 53 10 1 cluster4 0 23 0 0 0 0 cluster5 53 0 7 2 34 0 cluster6 36 2 1 27 19 6 cluster7 10 1 3 14 30 6 cluster8 11 0 2 130 60 0 Table 3.3: Confusion matrix with manually annotated data and clusters detected by EM algorithm when number of clusters is predefined to be 5 advertisement auto-tweet campaign news parasitic & promotion & blogs advertisements cluster0 7 0 0 82 1 cluster1 85 0 7 49 0 cluster2 1 23 0 0 0 cluster3 22 1 5 272 7 cluster4 64 2 1 52 6 41 3.3.2.3 Observations Broadly speaking, we identify five classes of retweeting activity and associated content on Twitter. Automatic/Robotic Activity As we can see from the results, almost all methods classify automatic or robotic retweeting (auto-tweet) with high accuracy. While some of such activity, in our data set is related to technology news stories, and their user entropy is similar to that of other news stories, such activity has a much lower time interval entropy than other news stories. Two primary kind of automated services that we identified are auto-tweeting ser- vices and tweet-scheduling services. There are two categories of auto-tweeting activi- ties. The first arises when an individual subscribes to an automatic service that tweets messages on the user’s profile on his behalf. One such automatic service is Twitterfeed, through which the user can subscribe to a blog or news website (any service with an RSS feed). Twitter users employ this service to automatically retweet stories posted on technology news sites Mashable and TechCrunch. This leads to individual auto-tweets observed from the profile of that user. However, this auto-tweeting feature is also being used for promotional and perhaps phishing activities. For example, a fan site (http://bieberinsanityblog.blogspot.com/) for Justin Bieber asks fans to provide their Twitter account information. The site is powered by Twitterfeed, and then auto-tweets Justin Bieber news from the profiles of registered fans, resulting in collective auto-tweeting. Services like Tweet-u-later 5 and Hootsuite can be used to schedule tweeting activ- ities. These websites can be used for spamming. Registering a collection of profiles to 5 http://www.tweet-u-later.com/ 42 these websites and scheduling the a tweet to posted repeatedly, enables spammers to post the same message multiple times. Since our method can easily differentiate human activity from bot or automated activity, we are able to identify marketing companies which engage automated ser- vices to increase their visibility on Twitter. Such services include OperationWeb (http://www.operationweb.com/) and TweetMaster (http://tweetmaster.tk/), which claim that they “will tweet your ad or message on my Twitter accounts that add up to over 170k* followers 2-6 times per day for 30 days.” Most of these services use bots or automated services to push up the perceived visibility of the advertisements. To increase visibility they need a large number of profiles. To gain access to a large number of profiles, such services ask users to register, set their own prices for tweets and feature the sponsored tweets in their profile. In this way these services create a win-win situation, helping companies to promote their product and users to make money by featuring sponsored messages on their profiles. Newsworthy information This class comprises of mostly news and blogs and some successful campaigns. Newsworthy information is characterized by comparable (usu- ally high) user and temporal entropy. Since people, not bots, are involved in dissem- inating such content, we call this “human response to information.” Both supervised and unsupervised clustering algorithms able to separate news and blogs, i.e., infor- mation sharing by humans, from the rest of retweeting activity with good accuracy (Tables 3.1, 3.3 and 3.2). However, EM algorithm with five classes breaks this class into smaller clusters (cluster0, cluster3 and cluster4). This is indeed a meaningful subdivision based on popularity, with content in cluster3 being the most popular, con- tent in cluster0 being normal content and content in cluster4 having low popularity. 43 When EM is allowed to automatically adjust the number of clusters, the popular clus- ters found by the earlier algorithm gets subdivided into two more classes giving five clusters of human response to information (cluster1, cluster3, cluster6, cluster7 and cluster8 in Figure 3.5(b)). Compared to hand-labeled dataset (Figure 3.4) and from the confusion matrix in Table 3.2, we observe that cluster7 comprises predominantly of popular blogs, cluster8 comprises mostly of popular news, cluster1 and cluster3 com- prise of normal human response to information and cluster6 shows human response to unpopular information. Advertisements and Promotions Advertisements and promotions are distinguished by low user entropy and low to high temporal entropy. Supervised clustering is able to accurately detect advertisements and promotions (Table 3.1). Most spam-like ad- vertisements fall in this section. These are unwanted advertisements which are never retweeted by any user besides the originator of the advertisement. EM algorithm with five classes also identifies a group comprising predominantly of advertisements. How- ever, EM algorithm with automatic class detection, divides this group further into three classes: cluster0 comprising mostly of spam-like activity with very low user entropy ( 0), cluster2 containing advertisements with low user and medium time entropy and cluster5 comprising of campaign-like promotions and advertisements with low user entropy and medium to high temporal entropy. Campaigns Campaigns are identified by low user entropy and very high temporal entropy. There are very few campaigns in the hand-labeled dataset. Even then, su- pervised algorithms are able to classify campaigns with a fair degree of accuracy (cf. Table 3.1). However, unsupervised algorithm merges campaigns with advertisements and promotions. Due to considerable overlap of characteristics of campaigns with 44 advertisements or promotions, to distinguish a campaign from an advertisement is dif- ficult, even for manual annotators. Note, that when a campaign is very successful like the one bysilva_marina, Figure 3.1 (c), information that the campaigner intends to propagate spreads through the online social media. The retweeting activity in this case becomes similar to human response to information. Parasitic Advertisements None of the methods were able to identify parasitic ad- vertisements very accurately. One possible reason may be their parasitic nature, where they do not have a distinct characteristic feature of their own, but adopt the character- istics of the hosting user profile. 3.4 Summary In this chapter, we characterize dynamics of retweeting activity of some content on Twitter by the entropy of the user and time interval distributions, and show that these two features alone are able to separate user activity into different meaningful classes. The method is computationally efficient and scalable, content and language indepen- dent and is robust to missing data. We have identified five categories of retweeting activity on Twitter: newsworthy information dissemination, advertisements and pro- motions, campaigns, automatic or robotic activity and parasitic advertisements. We have observed that human response to news, blogs and celebrity posts is very similar. The novel entropy-based classification method not only enables us to characterize user activity, but it also helps us to understand user-generated content and separate popular content from normal or unpopular content. 45 Chapter 4 Modeling Interactions In this chapter, we propose a generalized framework to model the dynamics of interac- tions within a complex network. Later, we use this framework to show how interactions affect the network structure. We consider a static network of active nodes (or agents), who can affect the state or activity of their neighbors through interactions. We dif- ferentiate between conservative and non-conservative interactions, which we explain in greater details in this chapter. Examples of the former include money exchange, Web surfing, and diffusion in physical systems. Non-conservative interactions include broadcast-based interactions that lead to information diffusion, epidemics, and other social phenomena. 4.1 Generalized Interaction Model Interactions between nodes determine the dynamic process taking place on the net- work. An dynamic process is a process which results in the spread of tangible of intangible content in a network or graph. Mathematically, dynamic process is given by a functionf : (R + [f0g) jVj ! (R + [f0g) jVj , i.e., a map from ajVj-dimensional non-negative vector (say,F (t)) to ajVj-dimensional vector (F (t+1)), which describes 46 the distribution of some observed feature at timet andt + 1 respectively. The inter- action process traces the change in the observed featureF over time. The change in the observed feature might depend on some intrinsic property of the nodes, given by ajVj dimensional vector !. It would also depend on the nature of dynamic activity taking place in the network. Kernel function (A) captures the interdependence of interactions and topology, whereA is the adjacency matrix (and hence represents the topology of the network). The generalized model can be described as: dF T dt =!kF T (A) (4.1) Herek is a constant andjj:jj T is the transpose of vectorjj:jj. The discrete time version of the generalized model would be: F T (t) =F T (t)F T (t 1) = (!kF T (t 1)(A))t (4.2) In the context of social networks, F (t)[i] could represent the opinions of an in- dividual agenti at timet, and![i] his intrinsic beliefs. Though his opinions depend on his intrinsic beliefs, they may change over time as the result of interactions with neighbors. Though rather simplified, we believe that this abstract model provides a useful framework to study social phenomena. We classify interaction processes based on their spreading behavior i.e. whether the interaction process is such that, it does or does not conserve the amount of quantity spread. Consider financial exchange networks in which nodes distribute money among their network neighbors. The interactions that give rise to the financial exchange can be called conservative, since they do not increase nor decrease the amount of money 47 exchanged. Web surfing, communicating via phone calls, and other one-to-one in- teractions are conservative, because, as in the case of a Web surfer, at any time the surfer can browse only one page, and the probability to find the surfer on any Web page remains constant. We contrast these to non-conservative interactions, which do not preserve the amount of quantity exchanged. Take, as an example, a virus spread- ing through a social network. A person (node) will get infected with a virus through her infected friends, but the amount of the virus present in the network will increase because of these interactions (or decrease as infected people become cured). Social processes based on one-to-many interactions, such as users broadcasting messages in online social media, are also non-conservative in nature. While the conservative/non- conservative dichotomy might not capture the full range of possible interactions in a network, we begin our investigation here because this dichotomy can be described mathematically. Moreover, to keep mathematics tractable, we focus analysis on linear interactions. 4.1.1 Conservative Interactions The functionC defines a conservative interaction process if for allF2 (R + [f0g) jVj , jjFjj 1 =jjC(F )jj 1 . Here,jj:jj 1 denotes theL 1 -norm of the argument 1 . In other words, the net change in weightjjFjj 1 =jjC(F )Fjj 1 = 0. For example, the probability distribution of a random walk on the network is conservative interaction process, since the sum of total probabilities is always 1. In continuous case, this implies that a process is conservative ifjj dF dt jj 1 = 0. 1 For a vectorx = 2 4 x 1 x n 3 5 , theL 1 norm isjjxjj 1 = P n i=1 x i . 48 We refer to the observed feature of each node i, F (t)[i] in Equation 4.1 and 4.2 at timet, as its weight. At any timet, conservative interaction process simply redis- tributes the weights between the nodes of the graph, while the total weight remains constant. Let us illustrate a conservative interaction process with discrete time intervals on a toy example of an imaginary city. Suppose one resident of this city,i wins a lottery at time t = 0, while all other residents are moneyless. We call friends of resident i one-hop neighbors ofi, friends of friends ofi as two-hop neighbors ofi, and so on. Let the amount of money each resident of the city has att = 0 beF c (0) (F c (0)[i] = 1 and 08 j6= i). LetW c be the transfer matrix, whereW c [p;q] gives the fraction of amount transferred from residentp toq at timet. We takedF c (t) to be the vector representing the amount each resident receives at timet. At timet + 1, each resident retains a fraction 1 of what she receives at timet and redistributes the rest. At timet = 0,dF c T (0) =F c T (0). At timet = 1, the residenti retains (1), 0 1 of what she receives at timet = 0. Using matrix formulation, the amount retained is (1)dF c T (0) = (1)F c T (0). She distributes the rest, dF c T (0) = F c T (0) amongst her one-hop neighbors equally. TakingW c = D 1 A8t where D is the degree matrix and A is the adjacency matrix 2 , the amount received by the one-hop neighbors at timet = 1 is given by dF c T (1) =dF c T (0)W c =F c T (0)W c . At timet = 2, each one-hop neighbor retains 1 of what she receives at time t = 1, (1)dF c T (1) = (1)F c T (0)W c 2 for definitions of degree and adjacency matrix see Section 2.2 49 and distributes the rest amongst her friends equally, i.e., amongst the two-hop neighbors of i. Therefore, the amount received by the two-hop neighbors at t = 2 is given by dF c T (2) =dF c T (1)W c = 2 F c T (0)W c 2 . Similarly, at any time t (t > 0), the amount of money that t-hop neighbors of residenti retain is (1) of residenti’s money they receive at timet1 and redistribute the rest amongst their friends equally. Hence, dF c T (t) = dF c T (t 1)W c = t F c T (0)W c t (4.3) Therefore the distribution of money in the network at time t,F c (t) can also be given by: F c T (t) = t X k=1 (1)dF c T (k 1) +dF c T (t) = t X k=1 (1) k1 F c T (0)W c k1 +dF c T (t) = ( t X k=1 (1) k1 F c T (0)W c k1 ) + t F c T (0)W c t = (1)F c T (0) +f( t1 X k=1 (1) k1 F c T (0)W c k1 + t1 F c T (0)W c t1 gW c = (1)F c T (0) +F c T (t 1)W c (4.4) Therefore Equation 4.4 can be written as F c T (t) = (1)F c T (0)F c T (t 1)(IW c ) (4.5) 50 which is a special case of the discrete generalized linear model described in Equation 4.2 with! = (1)F c T (0), (A) =IW c and the time step t = 1. Ast!1, Equation 4.4 reduces to, F c T (t!1) = (1)F c T (0) +F c T (t!1)W c = (1)F c T (0)(IW c ) 1 (4.6) We observe that transfer matrixW c is a stochastic matrix since it is a matrix whose rows sum up to 1. We can express different processes using different transfer matrices. In the above money distribution exampleW c = D 1 A,8t, i.e., the transfer matrix is static for allt. A general form of static transfer matrix isW c [7]: W c =I + (1)D 1 A: (4.7) Due to the linearity of this interaction process, Equation 4.3 – 4.6 hold true for any starting vectorF c (0). If the transfer matrix varies with time,W c (t), the interaction process could be expressed as: F c T (t) = ( t1 X k=0 (1)dF c T (k)) +dF c T (t) = ( t1 X j=0 (1) j F c T (0) j Y k=1 W c (k)) +dF c T (t) (4.8) Note that at any timet, the total money in the city remains conserved in the above example. If this interaction process at timet, is given by a functionC t :F c (0)!F c (t), thenjjF c (0)jj 1 = jjC t (F c (0))jj 1 . Hence this is a conservative interaction process. SinceC t is a linear mapping, we call the above interaction processes (Equation 4.4 and 51 4.4) linear conservative interaction. A more general representation of conservative process could be: F c (t) =f t (F c (0)) (4.9) wheref t may be a linear or non-linear mappingf t :F c (0)!F c (t). Summarizing, the circulation of money on any network is a conservative process. Every node in the network starts with some amount of money, and at each time step, transfers a fraction of it to its neighbors. Hence, the fixed amount of wealth is redis- tributed within a social network but clearly, at any timet, the total amount of wealth within the network would remain the same. Likewise, goods moving through the net- work may flow along different trajectories, but their total quantity stays constant. Random walk is also an example of conservative process. Let the initial probability of the random walker being on any node be uniform, i.e.,F c (0)[i] = 1 jVj . At any time t, a random walker at node i, then, chooses with probability , node j in the neighborhood ofi (uniformly at random) and jumps to it. With probability 1, she chooses any nodej uniformly at random from the network and jumps to it. Let matrix X be defined as X[i;j] = 1 jVj andW c = D 1 A . Then her probability of being on node j at time t is given byF c T (t) = (1)F c T (t 1)X +F c T (t 1)W c = (1)F c T (0) +F c T (t 1)W c . Thus random walk with uniform starting vector is exactly equivalent to Equation 4.4 describing linear conservative process with static transfer matrix. An example of a continuous conservative interaction process would be: dF c T dt =kF c T L (4.10) 52 Here (A) =L =DA is the Laplacian matrix. Note thatjj dFc dt jj 1 =jjkF c T Ljj 1 = 0 and hence this is a conservative interaction process. The model would change based on the nature of interactions. Another example of a conservative interaction model would be: dF c T dt =kF c T (IAD 1 ) (4.11) Here,D 1 is the inverse of the diagonal matrix. Similarly using the normalized Lapla- cian operator Equation 4.12 models a conservative interaction framework: dF c T dt =kF c T (ID 1=2 AD 1=2 ) (4.12) The normalized Laplacian operators in Eq. 4.11 and 4.12 is often used to describe random walk-based processes. Equation 4.10 has been used to describe a variety of conservative systems. It measures electric potential in a network of capacitors of unit capacitances, with one plate of each capacitor grounded and the other plate con- nected according to the graph structure, with each edge corresponding to a resistor of resistance 1 c . The same equation has been used to model (discrete) diffusion of heat and fluid flow in networks and serves as the basis of interaction kernels over discrete structures in machine learning algorithms [97]. 4.1.2 Non-Conservative Interactions An interaction process where the total weight F (which is the observed feature) can change in time is a non-conservative interaction process. Formally, a functionN defines a non-conservative interaction process if for someF2 (R + [f0g) jVj ,jjFjj 1 6= 53 jjN (F )jj 1 . In other wordsjjFjj 1 =jjN (F )Fjj 1 6= 0. In the continuous case, this translates tojj dF dt jj 1 6= 0. For example, consider the spread of a virus on a network. At each time step, an infected node may infect its neighbors, thereby replicating the virus. Clearly, the total amount of the virus in the system will change with time. This can be modeled by a non-conservative dynamic process. In the context of online social networks, consider information spread in online social networks, e.g., on Digg. Users on Digg broadcast information to their network neighbors by voting for a story. Those neighbors may in turn broadcast the information to their own neighbors by voting for a story. Here the observed feature is the number of votes in the network. The number of votes in the network increases with time; therefore, information diffusion on such a network is a non-conservative process. To illustrate such non-conservative processes, let us return to the imaginary city of the previous example. Let us now imagine that each resident, has some amount of money. Also, she has a personal money minting machine and wants to produce more money by a replication process described below. The amount of money a person receives is the sum of what she produces for herself and what she receives from her neighbors. Let the amount each resident receives at timet bedF n (t) anddF n T (0) = F n T (0). At time t, each resident produces a fraction of the money she receives at time t 1 for herself. At timet, she also prints a fraction of the money she receives at timet 1 for each of her neighbors. This process of replication occurring at timet can then be expressed using a replication matrixW n =I +A. At timet = 1, the additional amount each resident receives is dF n T (1) =dF n T (0) +dF n T (0)A =F n T (0)W n . 54 The additional amount each resident receives at timet = 2 is dF n T (2) =dF n T (1) +dF n T (1)A =dF n T (1)(I +A) =F n T (0)W n 2 . The amount of money each resident receives at timet,dF n (t) is: dF n T (t) =dF n T (t1)+dF n T (t1)A =dF n T (t1)W n =F n T (0)W n t =F n T (0)(I +A) t (4.13) The amount of money that each resident has at time t is the total amount accu- mualted by her up to timet and is given by distributionF n (t) is : F n T (t) = t X k=0 dF n T (k) =F n T (0) t X k=0 W n k = t X k=0 F n T (0)(I +A) k = F n T (0) + t1 X k=0 F n T (0)(I +A) k (I +A) = F n T (0) +F n T (t 1)(I +A) = F n T (0) +F n T (t 1)W n (4.14) Therefore Equation 4.14 can be written as F n T (t) =F n T (0)F n T (t 1)(IW n ) (4.15) which is a special case of the discrete generalized linear model described in Equa- tion 4.2 with ! =F n T (0), (A) = IW n and the time step t = 1. Note that jjF n T (t)jj 1 6= 0. 55 At timet!1, Equation 4.14 reduces to F n T (t!1) =F n T (0) t!1 X k=0 (I +A) k (4.16) which can be solved to yield F n T (t!1) =F n T (0) +F n T (t!1)(I +A) =F n T (0)((1)IA) 1 (4.17) for 1 < 1 1 where 1 is the spectral radius ofA. Here, the replication matrix is static for allt,W n = (I +A). If the replication matrix changes with time,W n (t), the amount of money each resident has at timet is: F n T (t) =F n T (0) +F n T (0) t X j=1 j Y k=1 W n (k) (4.18) If this interaction process at time t is given by a functionN t :F n (0)!F n (t), thenjjF n (0)jj 1 6=jjN t (F n (0))jj 1 . Hence this is a non-conservative interaction process. SinceN t is a linear mapping, we call the above interaction processes (Equation 4.14 – 4.18) as linear non-conservative interaction process. A more general representation of conservative process could be: F n (t) =f t (F n (0)) (4.19) wheref t may be a linear or non-linear mappingf t :F n (0)!F n (t). An example of a continuous linear non-conservative interaction process is dF n T dt =! T kF n T (IA) (4.20) 56 We call the kernelR =IA as the Replicator kernel. Another linear non-conservative linear interaction model could be: dF n T dt =! T kF n T (I 1 A) (4.21) In Equations 4.20 and 4.21 for> 1 where 1 is the spectral radius ofA. The different interaction models represent different nature of interactions that take place in the network. For example the conservative model given by Equation 4.4 can accurately represent a money exchange process. On the other hand Equation 4.14 might better explain an epidemic process. We explore the role of non-conservative model in studying epidemics in the next section. 4.1.2.1 Contagion as Non-Conservative Interaction Non-conservative interaction provides a useful framework for thinking about epidemics and other spreading processes, specifically, the relation between network structure and the growth of epidemics. A spreading process is a contact process, since information or virus spreads from one individual to another, only if they are in contact with one another. Contact pro- cesses have been extensively studied in epidemiology, where compartmental mod- els, such as SIS (susceptible-infectious-susceptible) and SIR (susceptible-infectious- recovered), have been used to model dynamics of epidemics [14, 80]. The simplest of these models assume homogeneity, i.e., everyone is in contact with everyone else in the population, and rate of infection and recovery is uniform [14]. In other words they do not take the heterogeneity of the underlying network into account. 57 One approach to create more realistic models of interactions is to segregate popu- lation using different categorical features, such as age, sex and so on, and then treat the interactions within the subpopulations as homogenous and symmetric [80]. Another approach modifies homogenous models to represent interactions between individuals as a directed graph [94], leading to a single mean field (MF) equation. To relax the ho- mogeneity assumptions further, and to take into account, the strong fluctuations in the connectivity distribution, the single MF equation is modified to a heterogenous mean field (HMF) rate equation [133]. In order to model a spreading process accurately, the structure of the underlying network has to be taken into account. Therefore, to take the heterogeneity and struc- tural dependence of epidemic spread, Wang et al. [179] modified the existing SIS models, to describe epidemic spreading in real networks. However, we demonstrate here, that this model too makes implicit assumptions about the spread of epidemics. It assumes that the spread of epidemics is a linear non-conservative interaction process with static replication matrix (Equation 4.14). Consider the SIS model of virus spreading. Let be the virus birth rate on a node connected to an infected node . Let be the virus curing rate. Wang et al. [179] modeled the probabilityp i;t that nodei is infected at timet. Letc i;t be the probability that a nodei is not infected by its neighbors at timet: c i;t = Y j:neighbor of i (p j;t1 (1) + (1p j;t1 )) = Y j:neighbor of i (1p j;t1 ) (4.22) Nodei is healthy att if (1) the node is healthy before timet and was not infected by its neighbors, or (2) node is infected beforet, but is cured in timet and was not infected 58 by its neighbor or (3) node is infected beforet, and it receives but is not affected by infection from its neighbor and was cured at timet. It is assumed that a curing event taking place ati after infection from a neighbor is roughly 50%. Putting it all together, the probability that nodei is healthy at timet is: 1p i;t = (1p i;t1 )c i;t +p i;t1 c i;t + 0:5p i;t1 (1c i;t ) = c i;t (1 (1 0:5)p i;t1 ) + 0:5p i;t1 = Y j:neighbor of i (1p j;t1 )(1 (1 0:5)p i;t1 ) + 0:5p i;t1 = 1 (1 0:5)p i;t1 X j:neighbor of i p j;t1 + 0:5p i;t1 = 1 (1)p i;t1 X j:neighbor of i p j;t1 (4.23) This uses the approximation (1a)(1b) 1ab. Therefore probabilityp i;t that nodei is infected at timet is: p i;t = (1)p i;t1 + X j:neighbor of i p j;t1 P T t =P T t1 ((1)I +A) =P T 0 ((1)I +A) t (4.24) Writing this in matrix notation,P t [i] is the probability of infection of nodei at timet , andP 0 is the initial probability of infection. The expected number of times a node i is infected up to time t is then given by P cum t [i] whereP cum t is: P (cum) T t = t X k=0 P T k =P T 0 t X k=0 ((1)I +A) k (4.25) 59 Ast!1, when= < 1= 1 , the seriesfP cum t g converges to P (cum) T t!1 =P T 0 (IA) 1 : (4.26) The convergence of the series implies that the infection dies off. Hence the infection dies off if the effective transmissibility,= < 1= 1 . Using terms from epidemiology, this shows that there exists an epidemic threshold . Epidemic threshold is the value of transmissibility below which the disease dies out and above which an epidemic oc- curs [14]. For any graph, for a susceptible-infected-susceptible model the threshold is determined by the largest eigenvalue of the adjacency matrixA or the spectral radius, i.e., = 1= 1 [179]. The epidemic threshold then has an intuitive explanation: when = < 1= 1 ,fP cum t g is a convergent series, which means that the amount infection each node has ast!1 must be finite, and therefore the disease will die out. On the other hand, when= 1= 1 ,fP cum t g is a divergent series, which is consistent with the fact that the epidemic spreads. Therefore, the epidemic will spread with effective transmissibility greater than the 1= 1 , which is called the radius of convergence. This is a signature of critical phenomena, and happens when= 1= 1 . Equation 4.24 is exactly equivalent to Equation 4.13 (dF n (t) =F n (0)(I +A) t ) and Equation 4.25 is equivalent to Equation 4.14 (F n (t) =F n (0) P t k=0 (I +A) k ) withP 0 =F n (0), = 1 and =. Equation 4.13 and 4.14 define the distribution of a linear non-conservative interaction process with static replication matrix (I+A) at time stept. Therefore, the epidemic model described above is exactly equivalent to linear non-conservative interaction with static replication matrix. 60 4.1.3 Spectral Properties of Kernels As we saw above, the linear conservative model naturally gives rise to the Laplacian kernel L (see Eq. 4.10). This explains the connection between the spectrum of the Laplacian and properties of linear conservative interaction and associated structure. The number of null eigenvalues of L gives the number of disconnected components of the graph and is the basis of spectral clustering. The time to reach the steady state is inversely proportional to the smallest positive eigenvalue of the Laplacian, and the gaps between consecutive eigenvalues are related to the relative difference in synchro- nization time scales of different modules [12, 11]. The Replicator kernelR we introduced in Eq. 4.20 is the non-conservative coun- terpart of the Laplacian. Its spectrum gives us information about topological and tem- poral scales of non-conservative dynamical systems. In particular, the time it takes for the system to reach the steady state is inversely proportional to the smallest positive eigenvalue ofR (when = max ), explained in details in Chapter 5, Section 5.2.1. 4.1.4 Classification of Interaction Models We would like to emphasize that the classification of conservative and non-conservative interaction models is just one of the many possible ways to classify interaction mod- els. Another possible classification could be classifying interaction models into con- stant rate interaction models and varying rate interaction models. We can generalize conservative interaction models to constant rate interaction models, by including all interaction models satisfyingjj dF dt jj 1 = c in Equation 4.1 ( dF T dt = !kF T (A)). Herec is a constant as opposed tojj dF dt jj 1 = 0 in the conservative case in Section 4.1.2. 61 In such models the rate would be independent of time. On the other hand if the cumu- lative ratejj dF dt jj 1 is dependent on time, we call such models varying rate interaction models. Similarly one-one interaction models have a conservative flavor to them and one-many interaction models have a non-conservative flavor to them. The objective of classification, irrespective of the scheme of classification, is to understand the distinct and unique properties of each class of interaction models, contrast them with the prop- erty of other classes and to identify global characteristics, common to all interaction models. 4.2 Interactions and Network Structure The different indicators of network structure include the organization of the commu- nities, who the influentials are within the community and what is the measure of prox- imity or closeness in the network. There exists many metrics for detection of com- munities, prediction of influentials (centrality) and proximity. However, as we show in the next three chapters, most of these metrics make implicit assumptions about the dynamic interactions take place in the network. We claim that, given a network, the predictive metrics that best detect the network structure are those whose implicit dy- namic interactions most closely matches the actual dynamic activity taking place in the network. 62 Chapter 5 Interactions and Community Structure Modular structure is an important characteristic of complex real-world networks, in- cluding social networks which are composed of communities and sub-communities of interconnected individuals, and biological networks, which are often organized within functional modules [151, 153]. Conductance minimization [38] and modularity maxi- mization [52] are some of the most popular methods for community detection. How- ever, these are combinatorial approaches that have been shown to be NP-hard or NP- complete. As the result, researchers resort to heuristics and approximation algorithms when applying these methods to community detection problem. On the other hand, decentralized algorithms based on local computation have been shown to provide scal- able solutions to combinatorial problems [191]. Motivated by this idea, we cast com- munity detection as a decentralized computation problem in which a network of locally interacting agents over time finds a global solution that corresponds to the community division of the network. To find interesting structure in networks, community detection algorithms have to take into account both the network topology and the dynamics of interactions between nodes. We investigate this claim using the paradigm of synchronization in a network 63 of coupled oscillators and develop a framework for multi-scale analysis of community structure. The local interactions cause nodes’ activity to become more similar. As the network evolves 1 to a global steady state, nodes belonging to the same community synchronize faster than nodes belonging to different communities. In a social network, for example, frequent contact leads to similarity of behavior among friends. Over time, communities composed of individuals who act in a similar manner will emerge. As an- other example, consider a population of fireflies who have characteristic light flashing patterns to help males and females recognize each other. Some firefly species exhibit synchronous flashing, during which individual’s flashing pattern can affect that of his neighbors, leading all nearby fireflies to flash in unison [167]. The Kuramoto model is a simple mathematical description of distributed synchronization in this and other physical and biological systems [100]. The model considers a network of coupled oscillators, in which the phase of each oscillator is affected by the phases of its neigh- bors. While the network as a whole eventually reaches a fully synchronized state, it does so in stages, with nodes belonging to the same community synchronizing faster than nodes belonging to different communities [12]. Traditionally, nodes in network synchronization models are coupled via conser- vative interactions or constant rate interactions (Chapter 4, Section 4.1.4). However, social interactions are often one-to-many, as for example, in social media, where users broadcast messages to all their followers. We formulate a novel model of synchroniza- tion in a network of coupled oscillators in which the oscillators are coupled via such 1 Note that by evolution we mean the change of phase of the nodes of the system with time following the rules of interactions. Evolution is used here in the context of development of the system by a series of interaction processes. We do not use the evolution in its biological interpretation of a gradual change in the characteristics of a population of animals or plants over successive generations which accounts for the origin of existing species from ancestors unlike them. 64 non-conservative interactions. We study the dynamics of different interaction models and contrast their spectral properties. We show that in non-conservative interaction model, nodes synchronize much faster than in the conservative interaction model. To find multi-scale community structure in a network of interacting nodes, we de- fine a similarity function that measures the degree to which nodes are synchronized and use it to hierarchically cluster nodes. We use dynamic interaction models to explore community structure of several networks, including an artificial network, benchmark social network and large real-world networks from social media sites. Our study re- veals substantial differences in network structure discovered by different interaction models. We find a complex layered organization of the real-world networks. While these networks exhibits the ‘core and whiskers’ organization found in other real-world social and information networks [122], with a giant core and multiple small commu- nities (whiskers) weakly connected to the core, this constitutes but one layer of the organization. As we peel away the whiskers layer to examine the core, we find a similar ‘core and whiskers’ structure in the new layer, and so on. To evaluate the quality of the discovered communities in a social media network we propose a community quality metric based on user activity. We find that conser- vative and non-conservative interaction models lead to dramatically different views of community structure even within the same network. Our work offers a novel mathe- matical framework for exploring the relationship between network structure, topology and dynamics. We demonstrate in this chapter the important yet perhaps surprising principle – different dynamic processes running on the same topology can lead to different views 65 of network structure. In reality network structure, its topology and dynamics are intri- cately interconnected and our work offers a formal framework to begin exploring these connections. 5.1 Classification of Synchronization Processes Physicists have studied the dynamics of interacting entities in an attempt to understand collective behavior of complex networks. The Kuramoto model [100] was proposed as a simple model for how global synchronization may arise in physical and biological systems. The model considers a network of phase oscillators, each coupled to its neighbors through the sine of their phase differences. The Kuramoto model has a fully synchronized steady state in which the phase difference between all oscillators is zero. As we show below, the Kuramoto model (at least in the linear case) assumes that interactions between nodes are mediated by a conservative process similar to heat dif- fusion, which is mathematically related to the random walk. However, not all social phenomena, including epidemic spread and information diffusion, admit to such de- scriptions [63]. In this section we introduce a new model of distributed synchronization based on non-conservative interactions. 5.1.1 Kuramoto Synchronization as Conservative Interaction Model The Kuramoto model is written as: d i dt =! i +k X j2neigh(i) sin( j i ) (5.1) 66 where i is the instantaneous phase of theith oscillator,! i is its natural frequency, and k is the coupling constant that describes the strength of interaction with a neighbor. The neighborhood of nodei,neigh(i), contains nodes which share an edge with node i. For small phase differences,sin , and the linearized version of the Kuramoto model can be written as: d i dt =! i +k X j2neigh(i) ( j i ) (5.2) For convenience, we rewrite Eq. 5.2 in vector form: d dt =!kL (5.3) Here ! is the vector of length N of intrinsic properties of nodes, is a vector of observed features, and k is the coupling constant. Kernel L is the Laplacian of the graphL =DA. HereA is the adjacency matrix of the unweighted, undirected graph, such thatA[i;j] = 1 if there exists an edge betweeni andj; otherwise, A[i;j] = 0. MatrixD is the diagonal matrix whereD[i;i] = P i A[i;j] andD[i;j] = 08i6=j. The linearized version of the Kuromoto model belongs to the class of constant rate interaction models (Chapter 4, Section 4.1.4). When ! = 0, this is exactly equiva- lent to Equation 4.10 ( dFc dt =kLF c ), which is a continuous conservative model of interaction. This model (! = 0) describes evolution of the extrinsic properties of a population of nodes (or agents). After some time, the network reaches a steady state, and interactions no longer change the property of any node, i.e., i (t) = i (t + 1). In the opinion formation example (where the observed feature is the opinion of the people), it would mean that after some period, individual opinions no longer change. For! i =! j ;8i;j, in the steady state i (t) = j (t);8i;j. In other words, the extrinsic 67 properties of all the nodes are the same in the steady state. In the context of oscillators, this means that their phases are equal and they are synchronized. This model is just one of a family of conservative interaction models. The kernel in the Kuramoto model is (A) = L = DA. Other kernels that can give rise to conservative synchronization model include (A) = IAD 1 leading to Equation 4.11 ( d dt =k(IAD 1 )) and (A) =ID 1=2 AD 1=2 leading to Equation 4.12 ( d dt =k(ID 1=2 AD 1=2 )) 5.1.2 Novel Methods for Synchronization using Non-Conservative Interactions In contrast to the Kuramoto models, in most human or biological networks the cu- mulative rate of interaction might change with time and is rarely conservative. This changes the nature of interactions and the resulting dynamics of the network. We present a model of synchronization based on non-conservative interactions in undi- rected networks where the cumulative rate of interactionjj d i dt jj 1 varies with time: d i dt = ! i +k X j2neigh(i) ( j i d i ) (5.4) d dt = !k(IA) (5.5) Here is a constant andI is the identity matrix and (IA) is the Replicator kernel R. The Replicator kernel can be used to model processes like epidemics as shown in Section 4.1.2.1. In order for this system to reach a steady state, max where max is the largest eigenvalue of the adjacency matrix of the network. Eq. 5.5 gives 68 the vector form of the non-conservative model and is exactly equivalent to the non- conservative model in Eq. 4.20 ( dFn T dt = ! T kF n T (IA)). In spite of non- conservation, the system reaches a steady state where phases of oscillators no longer change: i (t) = i (t + 1) when = max . In steady state, i is proportional to the i th element of the largest eigenvector of the adjacency matrix. Other flavors of the non-conservative interaction model are also possible like d dt =!k(I 1 A) (Eq. 4.21). 5.2 Synchronization and Community Structure A community is a group of nodes who are more similar to each other than to other nodes. Some network community detection approaches like conductance measure similarity by the number (or fraction) of edges linking nodes to other nodes within the same community [52]. The interaction models allow us to define communities dy- namically. Given a network of nodes with random initial states ( i (t = 0)), we allow the system to evolve according to the rules of the interaction model. As Arenas et al. [12] observed, as nodes interact, their phases (or extrinsic properties) become more similar, with nodes within the same community becoming more similar to each other faster than nodes from different communities. This happens in stages that reveal the network’s hierarchical community structure. In this section we define a new similar- ity function and describe a hierarchical clustering algorithm that uses it to identify a network’s community structure. 69 5.2.1 A Consolidated Framework for Community Detection We demonstrated in Chapter 4, both conservative and non-conservative interaction models are special cases of the general linearized interaction model, Eq. 4.1 ( d dt = !k(A)). As we show below, this model generalizes several community detec- tion methods, such as spectral clustering, modularity maximization and conductance minimization. Solving this differential equation we get: (t) = ( 0 (k(A)) 1 !)e k(A)t + (k(A)) 1 ! (5.6) with 0 the initial value of(t = 0), and! the vector of natural frequencies. LetjVj be the number of nodes in the network. LetX be ajVjjVj matrix whose columnX [:;i] gives the eigenvector of kernel 2 (A) corresponding to eigenvalue i . Also, let be the diagonal eigenvalue matrix where [i;i] = i . LetY = X 1 . Therefore (A) = P i2f1;2;jVjg X [:;i] i Y[i;:]. Eq. 5.6 with! = 0 can be rewritten as: (t) = 0 e k(A)t = X i2f1;2;jVjg X [:;i]e k i t Y[i;:] 0 = X i2f1;2;jVjg X [:;i]e k i t c i (5.7) 2 we assume (A) to be diagonizable. 70 Here c i = Y[i;:] 0 is a constant. Let 1 2 max . Let t j be such that e k i t j ! 0;8i j andt j+1 be such thate k i t j+1 ! 0;8i j + 1. Therefore, for t j+1 t<t j , t = P j i=1 X [:;i]e k i t c i . Steady State Let us look at the 1 = 0 case more closely. This arises in non- conservative interaction when = max (Eqs. 5.5 ). In this case ast!1, Eq. 5.7 reduces to t!1 =X [:; 1]c 1 , wherec 1 is a constant which is the steady state or equi- librium. For non-conservative interaction models, 1 > 0 leads to a trivial equilibrium condition. Considering Eq. 4.1, 5.3 and 5.5 with! = 0: (A) =DA =L: In this caseX [:; 1] / 1 ( vector of 1s). Hence t!1 [i] = t!1 [j]8 i;j. Hence the content or phase of all nodes is equal at synchro- nization. (A) =IAD 1 : : Hence t!1 [i]/d[i] whered[i] is the degree of nodei. (A) =I 1 max A or (A) = max IA =R: Here t!1 / the eigenvector of the adjacency matrix A corresponding to the largest eigenvalue. (A) =I 1 A or (A) =IA8> max : Here t!1 [i]! 08i Spectral Clustering and Partitioning Note that ifX [:;i],8i2f1; ;jg is used for clustering, the conservative interaction models in Eq. 5.3 ,4.11 and 4.12, reduce to spectral clustering techniques using Laplacian, D A (Eq. 5.3) or normalized LaplaciansIAD 1 orID 1 2 AD 1 2 , (Eq. 4.11 or 4.12 ) to findj communities [176]. [163] showed that naive spectral bisection methods do not necessarily work. However, for a conservative dynamic process with (A) =DA (Eq. 5.3 with! = 0), if vertices are arranged such that t [u 1 ] t [u 1 ] t [u jVj ] and setS i comprise 71 of nodesu 1 ;u 2 u i , then att =t 2 :e k i t 2 ! 0,8i> 2, min S i E(S i ; S i ) min(S i S i ) (Fiedler cut) isO(1= p n) for bounded degree planar graphs and a bisector ofO( p n) can be found by repeatedly finding Fiedler cuts. (This is one of the very few theoretical guarantees for spectral partitioning.) Conductance Finding a partition with a low conductance is closely related to the conservative interaction model with = IAD 1 (Eq. 4.11 with ! = 0 ). Let SV be a set of vertices. LetE(S; S) be the cut size or the edges going fromS to S. V olumevol(S) is the sum of the degree of all vertices in S. Conductance is given by (G) = min S E(S; S) vol(S) . The classic Cheeger inequality states that 2(G) 2 (G) 2 2 . Therefore, if this conservative dynamic process starts at node u, i.e., 0 [u] = 1 for u2 V and 0 [v] = 08v6= u2 V , thenj t [v] 1 [v]j e t (G) 2 2 q d[u] d[v] whered[u] is the degree of nodeu. In other words, if conductance is large, this dynamic process would reach equilibrium quickly. Let the nodes be arranged such that t[u 1 ] d[u 1 ] t[u 2 ] d[u 2 ] t[u jVj] d[u jVj ] at time t, and let set S i comprise of nodes u 1 ;u 2 u i . In this setup, for a set with volumevol(S) vol(G) 4 and(G)< , where is a constant, there is a subsetS 0 S with volumevol(S 0 )vol(S)=2, such that, if the conservative dynamic process (Eq. 4.11 ) starts at u2 S 0 , at t =d 2 4 e, min S i E(S i ; S i ) vol(S i ) p log(vol(S). This shows that by focussing on cuts determined by linear ordering of vertices using t of conservative interaction model in Eq. 4.11, the partition obtained is quadratic factor of the minimum conductance (which is one of the best approximation guarantees for local partitioning using conductance) [38]. Modularity Maximization If (A) =DDA whereDD[i;j] = d[i]d[j] 2m whered[i] is the degree of nodei andd[j] is the degree of nodej and 2m are the total number 72 of edges, and ifX [:;i];8i is used for clustering, then the model reduces to modularity maximization problem using the eigenvector approach [141]. 5.2.2 Community Structure via Interaction Dynamics In the section below we use interaction models to identify community structure that emerges en route to the steady state in real-world networks. We find that conservative and non-conservative interaction models lead to similar multi-scale organization of the network, but the composition of communities found at different scales is markedly different. Similarity Measure We assume that when nodes are similar, further interactions between them do not change their extrinsic property, which is given by the dynamic variable i (t). Maximal similarity is reached at time t eq , when the equilibrium or steady state is reached. In the conservative model in Eq. 5.3,! = 0, the steady state corresponds to global synchronization, in which every node has the same phase. The steady state of the non-conservative model is given by the largest eigenvector ofR (or the adjacency matrixA) when! = 0. For the sake of convention, we call this state the synchronized state, even if the values of all i s are not the same (but they do have fixed values, given by the first eigenvector). Once the system reaches synchronization, i (t + 1) = i (t) for all subsequent times. Arenas et al. used cosine of the phase difference between nodes as the measure of similarity. However, such a measure will lead to finite differences between nodes in the steady state in the non-conservative model. Instead, we measure similarity by the 73 relative difference of the variables in the synchronized state. In other words, similarity between nodesi andj at timet is sim(i;j;t) =cos( i (t) eq i eq j j (t)) where eq i , is the value of the dynamic variable in the steady state. Therefore for both the conservative and non-conservative interaction models,sim(i;j;t) = 1,8i;j2V attt eq . In the conservative case, the similarity measure we propose reduces to the one used by Arenas et el., because in the conservative steady state eq i = eq j ; therefore, sim(i;j;t) =cos( i (t) j (t)). Hierarchical Community Detection We simulate the interaction model by letting the network evolve from some initial configuration. At any time t < t eq , we can find the structure of the evolving network by executing a clustering algorithm, e.g., average link hierarchical agglomerative algorithm, with the similarity calculated as shown above. The hierarchical structure of the network can be captured by a dendrogram. How- ever, a complete dendrogram may be difficult to visualize, especially for large net- works. Instead, we use a coarse-graining strategy to cluster nodes if their similar- ity is above some threshold . Algorithm 3 describes the clustering procedure that takes similarity threshold as input, and at time t finds all communities in the net- work, such that if i 2 C i , max j2C i (sim(i;j;t)) is more than or equal to 1 . Since by construction, in Algorithm 3, for every i 2 C i , there exists a j 2 C i , 1 sim(i;j;t) max j2C i (sim(i;j;t)), therefore in all communities output by this algorithm, for all nodes i2 C i , similarity max j2C i (sim(i;j;t)) (1). 74 This algorithm has linear runtime, O(jEj), wherejEj is the number of edges. By changing, we can change the number and size of clusters. As increases, a cluster fragments into sub-clusters and thus a hierarchical arrangement of the clusters can be found. The set of communities output by Algorithm 3 at timet, for a given is unique and independent of the order in which edgese(i;j)2E are considered. Algorithm 1 Communities at timet with threshold of similarity, Input K: number of simulations of the interaction modelI t: time at which the hierarchy of the evolving communities is calculated i (t)[k]: i (t) from thek th simulation i (t) = ( i (t)[1]; i (t)[2]; ; i (t)[K]). =similarity threshold G(V;E) = network withjVj nodes,jEj edges e(i;j)=edge betweeni andj Output Communities {C i } such that8i2V max j2C i (sim(i;j;t)) (1) in the interac- tion modelI. Initialize S =E Assign each nodei to a separate communityC i 2C. repeat for eache(i;j)2E do sim(i;j;t) = 1 K P K y=1 cos i (t) [y] eq i eq j [y] j (t) [y] S =Sfe(i;j)g ifsim(i;j;t) (1) then MergeC i andC j end if end for untilS = Fast and scalable The decentralized nature of the interaction models allows each nodei to compute i locally interacting with at mostd[i] of its neighbors, which helps 75 us to parallelize the computation process making it fast and scalable. Due to the lin- ear nature of the interaction models considered, Eq. 4.1 can easily be rewritten as P jVj i=1 d(i) dt =!k(A)(i)j 0 (i) where 0 (i)[i] = 0 [i] and is 0 (i)[j] = 08j6=i, 0 being the initial starting vector in Eq. 5.6. Each of thejVj terms of this model can be calculated independently increasing parallelizability further. 5.3 Empirical Study We explore the differences between dynamics of conservative and non-conservative synchronization and the structures that emerge in an artificial network, real-world small networks including Digg and Facebook. We contrast the structure discovered by the linearized Kuramoto model, given by Eq. 5.3, to that discovered by the non- conservative interaction model, given by Eq. 5.5. In each simulation, the initial phases of nodes are drawn from a uniform random distribution [;] and all!s are set to 0. In the non-conservative interaction model we took = max , i.e., interaction models which reach non-trivial equilibrium ( t!1 [i]/ the eigenvector of the adjacency ma- trixA corresponding max ). Investigations into the differences of interaction models reaching trivial equilibrium t!1 [i] = 08i and those reaching non-trivial equilib- rium is the scope of future work. We ran multiple simulations, 100 of each interaction model with different initial conditions and use these as input to the structure detection algorithms described in the previous section. 76 5.3.1 Synthetic Network We consider a synthetic network with a fixed hierarchical community structure. While the synthetic network does not have the statistical properties of naturally evolved real- world networks, we study this case to demonstrate that imposing different dynamics on the same graph leads to measurable differences in the structures found by the two synchronization models. The synthetic network, constructed following the method- ology of [12], has N = 256 nodes evenly divided between four communities, with each community further sub-divided into four equal size sub-communities. Each node randomly connects toz 1 nodes within its sub-community,z 1 +z 2 nodes within its com- munity, andz out nodes outside the community. For our experiments, we tookz 1 = 13, z 2 = 4 andz out = 1. Figure 5.1(a) shows the hinton diagram of the adjacency matrix of this network, with red entries indicating the presence and blue the absence of an edge. Dense red blocks correspond to sub-communities at the first level of the hier- archy, and sparse red blocks to second level communities. The spectra of L and R kernels are shown in Figure 5.1(b). Each spectrum contains the eigenvalues of the ker- nel, ranked in descending order, with the largest eigenvalue in the first position. While there are already differences in the spectra of the two kernels, these differences be- come more pronounced in real-world networks characterized by heterogeneous degree distribution. We simulate synchronization dynamics in the synthetic network by letting nodes’ phases evolve from some initial configuration. Figure 5.1(c) and (d) show the similar- ity matrix of the network aftert = 1500 iterations under the two models. The matrix represents similarity,s ij , of pairs of nodes, with color red corresponding to higher sim- ilarity values and blue to lower. The minimum similarity between any two nodes in the non-conservative system is 0:998, compared to 0:958 for the conservative system. This 77 demonstrates that in both conservative and non-conservative model at time t=1500, the system has reached a state close to equilibrium. The hierarchical community structure is visible in both similarity matrices. To find hierarchical community structure of the synthetic network, we execute a hierarchical agglomerative clustering algorithm on the similarity matrix at some time (here, after 1500 iterations). This procedure produces a dendrogram, which can be partitioned into four or 16 clusters. We use normalized mutual information MI (Chapter 2, Equation 2.3) to measure how well these clusters reproduce the actual communities [44]. When MI = 1, discovered clusters are the actual communities; while for MI = 0, they are independent of the actual communities. When we split each dendrogram into four clusters, we findMI = 1:00 for the conservative model, and MI = 0:83 for the non-conservative model, while splitting it into 16 clusters, MI = 0:66 (conservative) and MI = 0:96 (non-conservative). Non-conservative model appears to identify smaller structures faster and more accurately than the con- servative model. (a) (b) (c) (d) Figure 5.1: Analysis of the synthetic graph. (a) Hinton diagram of the adjacency matrix. A point is red if an edge exists between nodes at that location; otherwise it is blue. (b) Eigenvalue spectrum of the two kernels. Similarity matrix at t = 1500 under the (c) conservative and (d) non-conservative synchronization models. Color indicates how similar two nodes are, with red corresponding to higher and blue to lower similarity. 78 5.3.2 Karate Club (a) (b) (c) (d) Figure 5.2: Analysis of the karate club network. (a) Friendship graph. (b) Compar- ison of eigenvalues of the Laplacian(L) and Replicator(R) kernels. Synchronization matrix at timet = 1000 due to (c) the conservative interaction model and (d) the non- conservative interaction model. The color of each square indicates how similar two nodes are (zoom in to see node labels), with red corresponding to more similar nodes and blue to less similar nodes. Next, we study the real-world friendship network of Zachary’s karate club [193] described in Chapter 2, Section 2.3.1 , shown in Fig. 5.2(a), a widely studied social net- work benchmark. The circles and squares represent the actual factions in the network and are taken as ground truth communities for this data set. There are greater differ- ences between the spectra of L and R, shown in Fig. 5.2(b), than for the synthetic graph with a more homogeneous degree distribution. The smallest positive eigenvalue ofR is larger than that ofL, implying that the non-conservative model reaches steady state faster than the conservative model. We observe this empirically in Figure 5.2(c) and (d) , which show the similarity matrices of the network att = 1000 under the two interaction models. Minimum similarity in conservative (non-conservative) model is 0.65 (0.91). Clearly, nodes are more synchronized in the non-conservative model. We used average link hierarchical clustering algorithm to hierarchically cluster network at different times using synchronization metric as the measure of similarity. Both models reveal rich hierarchical structure within the dendrogram, though the two 79 Figure 5.3: Evolution of the discovered community structure of the karate club net- work, as measured by normalized mutual information, in the conservative and non- conservative interaction models. dendrograms are very different. In the conservative model, high degree nodes (hubs) are deeper within the hierarchy, meaning they are more synchronized: 33 and 34, in one community and 1, 3 and 2 in the other community. Peripheral nodes (such as 13, 18, 22) synchronize later, although nodes 15 and 10 never synchronize with their actual community and are mis-assigned. In the non-conservative model, peripheral nodes synchronize first, while the hubs synchronize later. Bridging nodes connected to both communities synchronize earlier in the conservative model than the non-conservative model and remain more synchronized. Figure 5.3 reportsMI scores of communities discovered by the two synchronization models at different times. The non-conservative model identifies communities faster than the conservative model, and the discovered communities are purer. Under both models community membership of nodes does not change after 3899 iterations. However, as time increases (beyond 3899 iterations), both the conservative and the non-conservative model moves closer to its equilibrium state. This leads to increase of similarity of the nodes with time. When the models 80 reach their equilibrium state, each node is maximally similar to every other node in the network. Thus similarity continues to increase beyond 3899 iterations until every node equally similar to every other node. 5.3.3 Digg Mutual Follower Network We study the community structure on the mutual friendship network of the Digg dataset described in Chapter 2, Section 2.3.2. There are 4,811 disconnected com- ponents in this network, with the largest component comprising of 70% of the nodes (27K nodes) and 96% of the edges (352K edges). The second largest component has 22 nodes. Since the inherent richness of structure of this network is largely captured by the giant component, we study this component in detail. (a) (b) Figure 5.4: (a) Top 6000 eigenvalues of the Replicator and Laplacian kernels of the Digg friendship network. (b) The long-tailed distribution (using logarithmic binning) of the components comprising the core for different similarity thresholds for the non-conservative model. Using the Jacobi-Davidson Algorithm for calculating eigenvalues of a graph, we compute more than 6K of the smallest eigenvalues of the Replicator and Laplacian kernels and rank them in descending order (Fig. 5.4(a)). The two spectra a dramatically different. The smallest positive eigenvalue ofL is much smaller than that ofR. This 81 (a) Digg (b) Facebook Figure 5.5: Number of nodes nodes comprising the core at different levels of hierarchy (resolution scales) found by the interaction models in the Digg and Facebook networks. The resolution scales correspond to similarity thresholds that give cores of comparable size. The green line shows the number of nodes that the cores have in common at that resolution scale. indicates that the non-conservative interaction model reaches the steady state much faster than the conservative model. 5.3.3.1 Multi-scale Structure of Digg We use Algorithm 3 to cluster nodes at different resolutions specified by the similarity threshold. While the overall structure changes over time, we find an intricate multi- scale organization of the network in both interaction models. At every resolution, we find a ‘core and whiskers’ organization [122], with one giant community (core) and many small communities (whiskers). The core itself has a well-defined structure: as we tighten the similarity threshold, the core fragments into another large core and many small communities with a long-tailed size distribution. This process continues until the core fragments into some number of small communities. 82 The community structure of Digg, therefore, resembles an onion, with multiple layers of whiskers. This paradigm is captured in Figure 5.5(a), which shows core sizes at different resolution scales at timet = 100. At later times, at any given resolution the core grows untilt = t eq , when it forms a giant component for every resolution scale. However, the composition of the core remains almost time invariant, i.e., the core at a coarser resolution at timet 1 is very similar to a core at some finer resolution at later timet 2 . We chose the threshold parameters that give comparable size cores at each resolution scale for the interaction models. Using the non-conservative interaction model, all thresholds above = 0:0004 produce a single component with about 27K nodes. At a finer resolution (smaller), the number of communities increases. As illustrated in Figure 5.5(a), at = 0:00018, 76% of these nodes form a giant component or the core. In addition, there are several small communities, whose sizes have a long-tailed distribution (Fig. 5.4(b)). At = 0:00016, the core again divides into one large community, with 72% of the nodes, and many small communities, whose sizes also have a long-tailed distribution, as shown in Figure 5.4(b). Increasing the resolution scale further to = 0:00014, we discover that the core found at = 0:00016 breaks down once more into one giant component comprising of 62% of the nodes, and so on. A similar organization is discovered using the conservative model and though at larger similarity thresholds,. While the onion-like organization discovered by both interaction models is similar, its composition is different. Figure 5.5(a) shows the overlap of the membership of comparable-size cores found by the two models. For example, the size of the giant component discovered by non-conservative interaction model for = 0:00018 is comparable to the size of the core discovered by the conservative interaction model for = 0:2; however, they share only about 80% of the nodes. Core overlap decreases 83 to about 40% at = 0:00014 for non-conservative interaction model ( = 0:008 for conservative model), and keeps on decreasing as we fine-tune the resolution scale. Finally, the largest component at = 0:00008 for non-conservative and = 0:0001 for conservative models (resolution scale 1) do not have any nodes in common. 5.3.3.2 Empirical Evaluation While the two interaction models discover different structures in the Digg network, in the absence of ground truth communities for this network, it is challenging to say which model is correct. However, user activity provides an independent source of evidence for evaluating the quality of communities. We use this evidence to gain more insight into the structure of the Digg network, and show that the non-conservative model is better suited for studying it. (a) (b) Figure 5.6: Evaluation of communities found in the Digg mutual follower graph at t = 100 by the two interaction models. (a) Number of small communities found at different resolutions specified by the similarity threshold parameter. The smallest resolution corresponds to smallest value of the similarity threshold. (b) Average quality of communities at each scale, as measured by the number of co-votes. We propose an empirical measure of community quality based on user activity. Members of the same community are likely to share the same information, interests, 84 and attributes [73]. As a consequence, they are likely to behave in a similar manner, which on Digg translates into voting for the same news stories. We measure similarity of two Digg users by the number of stories for which they both voted, i.e., co-votes. Then, averaging over co-votes of all pairs of community members, we obtain a number that quantifies the quality of the community. We focus on small components (whiskers) of at least size three isolated from the core at different resolutions. Non-conservative interaction model assigned 3,712 users to such small communities. In contrast, the conservative interaction model assigned just 449 users to small communities. The rest of the users fragmented into isolated pairs or singletons. Figure 5.6(a) shows the number of small communities resolved by the two inter- action models at different scales. Figure 5.6(b) reports the average community quality at each resolution scale, as measured by the number co-votes between pairs of com- munity members. Community quality increases at finer resolution scales, producing tighter communities in the center of the ‘onion’ as expected. Members of the inner- most communities (resolution scale 1), are much more similar than members of the outer communities (resolution scales 5, 6). Except for these innermost communities, the average quality of communities found by the non-conservative model is better than that found by the conservative model. The difference at resolution scale 1 is driven by the two outliers in the conservative model. The first of these is a community of 26 users, with more than 300 co-votes on average, and the other is a community of nine with more than 600 co-votes. In addition to co-voting on an extraordinary number of stories (600 is nearly 20% of all stories in our data set), these users are also highly interlinked. The first group forms a 13-core (a cluster in which each node is linked to at least 13 other nodes), and the second group forms a 4-core. These users also share many friends. While we cannot say whether these groups represent the often-rumored 85 voting blocs on Digg, their activity does appear to be anomalous. One way such ac- tivity could arise is if each member of the group navigated to the profiles of other group members and voted for the stories that appeared on that profile, e.g., the stories that member submitted or voted for. Such browsing can be represented by one-to- one interactions; therefore, conservative model is best at finding it. Non-conservative models describe information diffusion through broadcasts of recent votes to followers, and finds communities arising from this information sharing behavior. To summarize, non-conservative model finds many more small communities of higher quality than the conservative model, though the latter seems to pick out some anomalous groups of users. (a) (b) Figure 5.7: Distribution of communities in the Facebook network for American Uni- versity at t = 100. (a) Comparison of size distribution of small communities found by the two interaction models at the coarsest and finest resolution scales. (b) Number of small communities at different resolutions. The smallest resolution corresponds to highest similarity between individuals. 86 5.3.4 Facebook We also performed our analysis on a data set comprising of Facebook networks de- scribed in Chapter 2, Section 2.3.4. We use the four features associated with each user in the dataset (status (e.g., student, faculty, staff, and so on), major, dorm or house, and graduation year) to empirically evaluate the quality of the discovered communities, in a sense that a good community should consist of individual who are similar according to these features. While this data set contains more than 100 colleges and universities, we present here the analysis of the network for American University, which comprises of 6,386 nodes and more than 200K edges. 5.3.4.1 Multi-scale Structure of Facebook We use Algorithm 3 to cluster nodes at different resolution scales specified by the similarity threshold. As with Digg, we find an onion-like, multi-scale organization in the structures discovered by conservative and non-conservative interaction models for the Facebook networks underpinning the generality of the observed structure. At each resolution, we discover a giant community (core) and many small communities (whiskers) with a long tailed size distribution (Fig. 5.7(a)). Just as on Digg, there is little overlap in membership between cores found by the two interaction models at finer resolutions (Fig. 5.5(b)). As on Digg, many nodes participate in small, clique-like communities. How- ever, while 1,320 nodes contribute to the formation of such communities in the non- conservative interaction model, only 32 nodes participate in such communities in the conservative interaction model. The remaining users are fragmented into isolated pairs 87 or singletons. As in the Digg data set, non-conservative model found many more com- munities than the conservative model. Figure 5.7(b) shows the number of communities discovered at each resolution scale for conservative and non-conservative models. 5.3.4.2 Empirical Evaluation We measure quality of the community discovered at different resolution scales using the four features enumerated above namely: major, dorm, year and category of indi- vidual. We measure the prevalence of the most popular value of some feature among community members. If the community is pure, the quality will be high. For example, the quality of a community with respect to the dorm feature gives the largest fraction of community members that belong to the same dorm. Figure 5.8 reports quality of communities found by the two models at different resolution scales with respect to those features. We find that, overall, the quality increases as we tighten the similarity threshold (decrease the resolution scale), irrespective of the feature under considera- tion. However, the characteristics of the community structure discovered by conser- vative and non-conservative interaction models vary significantly. At finer resolution scales, non-conservative model finds communities of individuals who are more likely to have the same major and belong to the same dorm. Conservative models, on the other hand, are more likely to put into the same community individuals who belong to the student category and are in the same year. Though the type of interactions may differ from college to college, it is reasonable to assume that students who belong to the same year will have more face to face (conservative) interactions, while students who have the same major or live in a dorm, may meet in study groups, or organized events, increasing chances for one-to-many (non-conservative) interaction. 88 major dorm year category Figure 5.8: Evaluation of communities found in the Facebook network of American University att = 100 by the two interaction models. Average quality of communities at each resolution scale, as measured by the probability of occurrence of the most frequent value of features major, dorm, year, category of individual. 89 In summary, regardless of the interaction process, we observe a roughly scale in- variant organization in the real-world social networks. At almost every resolution scale, we find a large component and many small components with a long-tailed size distribution. Thus, Digg and Facebook’s structure resembles an onion. Peeling each layer reveals another, almost self-similar structure with a core and many smaller com- munities. However, the composition of communities depends on the interaction pro- cess, and is different for the conservative and non-conservative interaction models. 5.4 Summary This chapter highlights the importance of dynamic interactions in the analysis of com- munity structure and provides a framework for unifying some of the existing com- munity detection methods. Our view of network’s community structure depends not only on how its nodes are connected but also on how they interact. We have explored this issue using models of synchronization in a network of coupled oscillators. We also presented a new formulation of similarity which we used in multi-scale analysis of net- work structure and an activity-based metric to measure the quality of communities in a real-world network. Our decentralized approach to the community detection is fast and scalable. Our study of the community structure of real-world social networks revealed a complex ‘onion’-like organization. Peeling each level of hierarchy gives a core and many small components, regardless of the interaction model. However, different inter- actions lead to distinct kernels that govern dynamics of synchronization, each with its own spectral properties and characteristic topological and temporal scales. In practical terms this suggests that to identify communities in real-world networks, algorithms have to take into account the nature of dynamic processes taking place on them. 90 In future, we would like to investigate the effect of non-zero! on the interaction models. Also, we would like to investigate the ergodicity of interaction models and the spectral properties of the different kernels. Our work offers a framework for un- derstanding the role of dynamic processes in the measurement of community structure. 91 Chapter 6 Interactions and Centrality In this Chapter we argue that just like in detecting community structure (Chapter 5), interactions between nodes play an important role in predicting centrality, i.e., who are the influential people in a network. 6.1 Classification of Centrality Metrics One of the more popular metrics, betweenness centrality [54], measures the fraction of all shortest paths in a network that pass through a given node. Other centrality metrics include those based on random walks [166, 146, 143, 135] and path-based metrics. The simplest path-based metric, degree centrality, measures the number of edges that connect a node to others in a network. According to this measure, the most important nodes are those that have the most connections. 6.1.1 Conservative Interactions and PageRank We focus our attention on the PageRank [146] algorithm, which is widely used in network analysis to measure importance, or prestige, of nodes. Computer science 92 community developed a variety of fast algorithms to efficiently compute PageRank and its variants for large graphs. These methods compute the relative importance of nodes in a graph when the underlying interaction process is random walk on the graph [169, 53]. PageRank, for example, gives the probability that a random walk initiated at node i will reachj, while random-walk centrality computes the number of times a nodei will be visited by walks from all pairs of nodes in the network. As stated in Chapter 4, the transfer matrix gives the transition probabilities of a random walk on the network. Let the transfer matrix be denoted by W = D 1 A, where D is the degree matrix and A is the adjacency matrix 1 . A PageRank vector pr (s;t) is the steady state probability distribution of a random walk with damping factor (restart probability= 1). The damping factor is introduced to ensure that the walk always has a stationary distribution. The starting vectors, gives the probability distribution for where the walk transitions after restarting. Formally, PageRank is the unique solution of the system represented by Equation 6.1, at time t!1, i.e., PageRank is pr (s;1); where pr (s;t) is: pr (s;t) = (1)s +pr (s;t 1)W (6.1) For ease of convention, we shall denote PageRank by pr (s). Hence pr (s) = (1)s +pr (s)W (6.2) 1 for definitions of degree and adjacency matrix see Section 2.2 93 The PageRank vector with a uniform starting vectors gives the global PageRank of each vertex. PageRank with non-uniform starting vectors is known as personalized PageRank [89]. Equation 6.2 is identical to Equation 4.6 (F c (t!1) = (1)F c (0) +F c (t! 1)W c ) whereW c =I + (1)D 1 A and = 0, which defines the distribution of a linear conservative interaction process with static transfer matrix at time stept!1, if s =F c (0). Hence, given the initial value of the observed feature, PageRank is the final distribution (att!1) of the observed feature when a linear conservative interaction process with static transfer matrix takes place. Therefore PageRank is a conservative centrality model. Similarly, other metrics derived from the random walk make an implicit assumption of conservative interaction process occurring on a network. 6.1.2 Non-conservative Interactions and Alpha-Centrality Much of the analysis done by social scientists considered local structure, i.e., the num- ber [180] and nature [161, 72, 29] of an individual’s ties. By focusing on local struc- ture, however, traditional theories fail to take into account the macroscopic structure of the network. Many metrics proposed and studied over the years deal with this short- coming, including PageRank [146] and random walk centrality [135]. These metrics aim to identify nodes that are ‘close’ in some sense to other nodes in the network, and are therefore, more important. One such metric, that captures the intuition, that a node’s centrality depends not only on how many others it is connected to but also on the centralities of those nodes, is Alpha-centrality [20]. It measures the total number of paths from a node, exponentially attenuated by their length. 94 For starting vector s and attenuation parameter , the Alpha-Centrality vector cr (s) is the solution to the following equation att!1: cr (s;t) =s +cr (s;t 1)A =s t X k=0 k A k : (6.3) The starting vectors is usually taken as in-degree centrality [20]. Solving this equa- tion fort!1 andjj < 1 j 1 j , (where 1 spectral radius of the network), we obtain cr (s;1) =s(IA) 1 , whereI is the identity matrix of sizejVj. For ease of con- vention, we shall denote Alpha-Centrality by cr (s). Hence, an equivalent formulation is: cr (s) = s +cr (s)A =s(IA) 1 =s P 1 k=0 k A k : (6.4) The parameter controls how much weight we give to longer paths. Setting close to zero enables us to probe only the local structure of the network. As increases, longer paths become important, and it becomes a more global measure. Normalized Alpha-Centrality: One difficulty in applying Alpha-centrality in net- work analysis is that its key parameter is bounded by 1 , the spectral radius of the network. As a result, the metric diverges for larger values of this parameter. To over- come this, we have recently introduced normalized Alpha-Centrality[63], denoted by ncr (s), which normalizes the score of each node by the sum of the centrality scores of all the nodes. Formally 2 we show that the new metric avoids the problem of bounded parameters while retaining the desirable characteristics of Alpha-centrality, namely 2 Although the sum of normalized Alpha-Centrality scores of all nodes is constant, we do not call it a conservative metric, since it does not define a conservative dynamic process. We distinguish between a conservative metric that maps to a conservative dynamic process, and a trivially conservative metric, achieved by numeric normalization, that does not map to a dynamic process. Normalized Alpha-Centrality belongs to the latter class of metrics. 95 its ability to differentiate between local and global structures. We define normalized Alpha-centrality, using the system of equation shown below: ncr (s;t) = 1 jjncr (s;t)jj 1 s t X k=0 k A k (6.5) We show in the Appendix B that the new metric is well defined for 0(6= 1 jj ) where is an eigenvalue of the network. Normalized alpha-centrality is the solution to this system fort!1. For ease of convention, we shall denote normalized Alpha- Centrality by ncr (s), ncr (s) = ncr (s;t!1). Hence, an equivalent formulation is: ncr (s) = 1 jjncr(s)jj 1 s P 1 k=0 k A k : (6.6) The rankings produced by normalized Alpha-Centrality are identical to the rank- ings produced by Alpha-Centrality forjj < 1 j 1 j . For2 ( 1 j 1 j ; 1] the value of nor- malized Alpha-Centrality remains finite and independent of. For values of> 1 j 1 j , ncr (s) = ncr = ncr lim ! 1 j 1 j (s) = cr lim ! 1 j 1 j (s). The derivation of these results are presented in the Appendix B. Just like the original Alpha-Centrality, normalized Alpha-Centrality contains a tun- able parameter that sets the length scale of interactions. The presence of a tunable parameter turns normalized Alpha-Centrality into a powerful tool for studying network structure and allows us to seamlessly connect the rankings produced by well-known local and global centrality metrics. For = 0, normalized Alpha-Centrality takes into account local interactions that are mediated by direct edges only, and therefore, reduces to degree centrality. As increases and longer range interactions become more important, nodes that are connected by longer paths grow in importance. For 96 symmetric matrices, forjj> 1 j 1 j , normalized alpha centrality reduces to eigenvector centrality (Appendix B) . We provide an in-depth study of Alpha-Centrality and Normalized Alpha-centrality in the next Chapter (Chapter 9). Equation 6.4 and 6.6 are equivalent to Equation 4.16 (F n (t ! 1) =F n (0) P t!1 k=0 (I +A) k ) with = 0 and starting vectorF n (0) = cs (c = 1 for Alpha- Centrality andc = 1 P i;j P t k=0 k A k [i;j] for normalized Alpha-Centrality). Equation 4.16 defines the distribution of a linear non-conservative interaction process with static replication matrix at time step t ! 1. Therefore, given the initial value of the observed features on the nodes, Alpha-Centrality is the final distribution of the ob- served feature when a linear non-conservative interaction process with static repli- cation matrix goes on for a long time ( t ! 1). Therefore Alpha-Centrality is a non-conservative centrality model. Alpha-Centrality and centrality models similar to it, make an implicit assumption of non-conservative interaction process occurring on a network. Katz score [92], SenderRank [95], and eigenvector centrality [21] are other exam- ples of non-conservative centrality metrics. 6.2 Ranking Nodes by Centrality- An Illustration The ranking of the nodes predicted by a centrality model depends on the interaction process it implicitly emulates. The fundamental differences between network inter- action processes impacts the ranking of the nodes [23]. Therefore, centrality metric should be chosen carefully to account for the dynamics. We illustrate these differences in ranking given by different centrality metrics, on a toy shown in network Fig. 6.1, 97 where a link from nodeu to nodev indicates that nodeu is watching nodev, orv is an out-neighbor ofu. Figure 6.1: An example network, where node 1 has the highest Alpha-Centrality fol- lowed by node 3. In contrast node 3 has the highest PageRank followed by node 1. Even in this simple example, PageRank and Alpha-Centrality disagree about who the important nodes in the network are. PageRank without restarts ranks node 3 high- est, followed by node 1. In contrast, Alpha-Centrality ranks node 1 above node 3. The difference in rankings produced by the two centrality metrics is due to the difference in the underlying interaction process that redistributes the weights of the nodes. As- sume that all nodes start with equal weights, which then evolve according to the rules of interaction. In PageRank without restarts, each follower divides his weight equally among thed out friends he is watching, and hence transfers the fraction 1=d out to each. Therefore, in Page Rank without restarts, node 5 will contribute 1/3 of the weight to node 1, and so will node 8. Node 3, on the other hand, will get the entire weight from node 4, giving it a higher weight than node 1 already in the first iteration, and therefore, more influence. In contrast to PageRank, Alpha-Centrality has nodes update their weights by copying a portion of their followers’ weights. Thus, each follower contributes a portion of his weight to all the friends he is watching. For example, the weight of node 1 will include contributions from nodes 2, 5 and 8, while the weight of 98 node 3 only includes contributions from nodes 2 and 4. The weight of node 1 will be greater than node 3, and consequently, it will be ranked higher by Alpha-Centrality. 6.3 Application to Online Social Network Analysis Online social networks on sites such as Facebook,Twitter, and Digg have become im- portant hubs of social activity and conduits of information. While a variety of meth- ods [32, 102] have been used to identify influential users in online social networks, each metric leads different results, and no justification for these metrics have been proposed. Fortunately, by exposing activity of their users, online social networks provide a unique opportunity to study dynamic processes. We analyze information flow on Digg and provide some preliminary investigations into Twitter. The description of the datasets used in our analysis is provided in Chapter 2.3. We define two empirical measures of influence based on how users react to the information their friends create. We then investigate how well different centrality met- rics reproduce the empirically measured influence. We claim that since the spread of news or information is a non-conservative interaction process, Alpha-Centrality will better reproduce empirically measured influence than a conservative metric (such as PageRank) on Digg. We note that there is a wide range of heterogenous activities going on in Twitter. We adopt the novel information-theoretic automatic classifica- tion technique we developed (Chapter 3) to detect stories in Twitter associated with information spreading-like activity and use these stories for further empirical analysis. This dataset comprises of 3,798 distinct URLs of news-like content retweeted by 542K distinct Twitter users. 99 6.3.1 Empirical Estimates of Influence Katz and Lazarsfeld [91] defined influentials as “individuals who were likely to in- fluence other persons in their immediate environment.” In the years that followed, many attempts were made to identify people who influenced others to adopt a new practice or product by looking at how innovations or word-of-mouth recommenda- tions spread [28]. The rise of online social networks has allowed researchers to trace the flow of information through social links on a massive scale. Using the new em- pirical foundation, some researchers proposed to measure a person’s influence by the size of the cascade he or she triggers [93]. However, as Watts and Dodds [182] note, “the ability of any individual to trigger a cascade depends much more on the global structure of the influence network than on his or her personal degree of influence.” Al- ternatively, Trusov et al. [173] defined influential people in an online social network as those whose activity stimulates those connected to them to increase their activity. Cha et al. [32], on the other hand, used the number of retweets and mentions to measure user influence on Twitter. Motivated by these works, we measure influence by analyzing users’ activity on an online social network. Suppose some user, the submitter, posts a new story/url on Digg or Twitter. By posting the story, submitter broadcasts it to her fans. When another user votes for this story (on Digg) or retweets it (on Twitter), she broadcasts it to her own fans. We measure the activity submitter’s post generates by the number of times it is re-broadcast by fans or followers or the information cascade the story generates. Our quantification and empirical measurement of influence is somewhat similar to that of Bakshy et. al [16], who quantify the influence of a given post by the size of the information cascade it generates. However, we use the posts to evaluate the influence of a user. We postulate the cascade size depends not only on the influence 100 of the submitter of the submitter but also on the quality of the post or story. We assume that story’s quality is uncorrelated with the submitter. 3 Therefore, we can average out its contribution to the activity a submitter generates by aggregating over all stories submitted by the same user. We claim that the residual difference between submitters can be attributed to variations in influence. We propose two metrics to measure submitter’s influence: (i) average number of fan votes her posts generate and (ii) average size of the cascades her posts trigger. (a) (b) Figure 6.2: (a) The scatter plot shows the average number of fan votes received by a story within the first 100 votes vs submitter’s in-degree (number of fans) on Digg. Each point represents a distinct submitter. (b) Probability of the expected number of fan votes being generated purely by chance. Estimate 1: Average number of fan votes To reduce the effect of the front page on Digg voting we count the number of votes from submitter’s fans within the first 100 votes. Since few stories are promoted to the front page before they receive 100 votes, this ensures that we consider mainly the network effects [107]. Of the 3553 stories in the Digg data set, 3489 were submitted by 572 connected users. 3 This is a fairly strong assumption, but it appears to hold at least for Digg [111]. 101 Of these, 289 distinct users submitted more than two stories which received at least one fan vote within the first 100 votes. Figure 6.2(a) shows the average number of fan voteshki within the first 100 votes received by stories submitted by these users vs the number of fansK these users have. Are these observations significant? Let’s assume that there areN users who vote for stories independently of who submits them. This type of stochastic voting can described by the urn model, in whichn = 100 balls are drawn without replacement from an urn containingN balls, of which onlyK balls are white. The probability that k of the firstn votes come from submitter’s fans purely by chance is equivalent to the probability thatk of then balls drawn from the urn are white. This probability is given by the hypergeometric distribution: P (X =kjK;N;n) = 0 B @ K k 1 C A 0 B @ NK nk 1 C A 0 B @ N n 1 C A (6.7) Using Eq. 6.7, we compute the probabilityP (X =hkijK;N;n) (N=69524, n=100) that stories submitted by a Digg user with K fans receivedhki fan votes purely by chance. As shown in Figure 6.2(b), forK > 100, this probability is small (P < 0:03); therefore, it is unlikely to observe such numbers of fan votes purely by chance. We conclude that average number of fan votes received by stories submitted by a specific user is an effective estimate of her influence (given she has at least 100 fans). 102 (a) (b) Figure 6.3: (a) The scatter plot shows the average number of follower retweets received by stories within the first 100 votes vs submitter’s in-degree (number of followers) on Twitter. Each point represents a distinct submitter. (b) Probability of the expected number of follower retweets being generated purely by chance. We analyzed the Twitter data set using the same methodology. There were 174 users who posted at least two URLs that were retweeted at least 100 times. Fig- ure 6.3(a) shows the average number of times the posts of these users were retweeted by their followers. Figure 6.3(b) shows the probability these number of retweets could have been observed purely by chance. Since these values are small, we conclude that average number of follower retweets is a significant estimate of influence on Twitter. Estimate 2: Average cascade size Alternatively, we can measure the influence of the submitted by the average size of the cascades her posts trigger. For each post, using the methodology described in Section 8.2.1, we extracted the cascade that starts with the submitter and includes all voters who are connected to the submitter either directly or indirectly via the fan network. The hypothesis is that the larger the cascade size (on average), the more influential the submitter. 103 6.3.2 Comparison of Centrality Metrics (a) (b) Figure 6.4: Correlation between the rankings produced by the empirical measures of influence and those predicted by Alpha-Centrality and PageRank for Digg. We use (a) the average number of fan votes and (b) average cascade size as the empirical measures of influence. The inset zooms into the variation in correlation for 0 0:01 We use the empirical estimates of influence to rank a subset of users in our sample who submitted more than one story. These rankings are used for evaluating the per- formance of the different centrality metrics. We use Pearson’s correlation coefficient (since ties in rank may exist [134]) to compare the rankings predicted by different centrality metrics to the empirical rankings. We study the change in correlation with , where(0 1) stands for the attenuation factor for normalized-centrality (see Equation 6.3) and damping factor (restart probability=1)for PageRank (see Equation 6.1). Note that the correlation of PageRank (with uniform starting vector) at = 0 (restart probability=1) with the empirical estimate cannot be computed because standard deviation of PageRank rankings would be zero in this case. In Figure 6.4(a), we use the average number of fan votes to rank 289 Digg users who submitted two or more stories and compare these rankings to those predicted by normalized Alpha-Centrality and PageRank for these users. Alpha-Centrality repro- duces the empirical rankings better than PageRank for all values of except when 104 close to 1. Alpha-Centrality also outperforms PageRank when average cascade size is used to rank Digg users, as shown in Figure 6.4(b). Note that correlation of both centrality metrics with average fans votes is much higher than with cascade size, indi- cating that average fan votes might be a better estimate of influence. (a) (b) Figure 6.5: Correlation between the rankings produced by the empirical measures of influence and those predicted by Alpha-Centrality and PageRank for Twitter. We use (a) the average number of fan retweets and (b) average cascade size as the empirical measures of influence. Figure 6.5 shows the correlation between rankings produced by the average num- ber of follower retweets and average cascade size and those computed by normalized Alpha-Centrality and PageRank. The better performance of Alpha-Centrality in com- parison to PageRank is even more pronounced in Twitter when compared with Digg in Figure 6.5(a). We observe in this figure that the correlation of the empirical measure of influence (average fan retweets) with Alpha-Centrality is more when compared to PageRank. Interestingly both PageRank and Alpha-Centrality are anti-correlated to average cascade size unlike in Digg, as seen in Figure 6.5 (b). One possible explana- tion could be differences in the user interfaces on these sites. Another possibility is 105 that information spread cannot be modeled as a simple epidemic [164], and these dif- ferences are more pronounced on Twitter. Yet another explanation could be differences in network structure on the two sites, or simply an artifact of the biases introduced by our aggressive spam filtering or small size of the data set. We are addressing these questions in ongoing work. 6.4 Summary In this chapter, we show that centrality metrics can be classified as conservative or non-conservative based on the dynamic processes that they models. We studied the properties of two prototypical metrics: PageRank and Alpha-Centrality. We introduced normalized Alpha-centrality as a metric to study network structure. Like the original Alpha-Centrality [22] on which it is based, this metric measures the number of paths that exist between nodes in a network, attenuated by their length with the attenuation parameter. The tuning parameter can be used to set the length scale of the interactions. Hence, it retains the parametrized approach to centrality adopted by Alpha-centrality. However, unlike the original Alpha-centrality, which bounds to be less than the reciprocal of the spectral radius of the network, normalized Alpha- centrality sets no such limit. We showed that since Alpha-Centrality and its normalized counterpart implicitly emulate non-conservative interactions, therefore, such metrics are amongst those best suited to study online social networks whose primary function is to spread information, a non-conservative process. We study this metric in greater detail in Chapter 9. 106 Chapter 7 Interactions and Proximity In this Chapter, we explore the complex interplay between proximity and activity in networks. The topology of a social network along with its dynamics occurring on it contains information useful for predicting its structure (Chapter 5 and 6). In this chapter show that this information also helps predict user behavior, i.e., future activity associated with distinct individuals in the network. People who are “close” in some sense in a social network are more likely to perform similar actions than more distant people. We use network proximity to capture the degree to which people are “close” to each other. Proximity is dependent not only the topology but also the function of the network. In addition to standard proximity metrics used in the link prediction task, such as neighborhood overlap, we introduce new metrics that model different types of interactions that may take place between people. We classify these proximity metrics into conservative and non-conservative based on the nature of the activity (Chapter 4) they implicitly emulate. We study the predictive powers of different proximity metrics empirically using data about URL forwarding activity on the social media sites Digg and Twitter. We show that structural proximity of two users in the follower graph is related to similarity of their activity, i.e., how many URLs they both forward. We also 107 show that given friends’ activity, knowing their proximity to the user can help better predict which URLs the user will forward. We compare the performance of different proximity metrics on the activity prediction task and find that metrics that take into account the attention-limited nature of interaction in social media lead to substantially better predictions. In addition, we examine what attributes help make some users more predictable than others. 7.1 Proximity Structural proximity measures how readily information can be exchanged by nodes in a network even in the absence of a direct link between them. Given a pair of uncon- nected nodes, link prediction algorithm [125, 98, 168, 127] calculates a graph-based proximity score between them. The “closer” the two nodes are, the more likely they are to become linked in the future, or in the case of partially observed networks, the more likely a link to actually exist between them. Note that centrality (Chapter 6) is a property of a node (i.e. how important that node is in network) whereas proximity is a property of a pair of nodes (i.e. how “close" they are to each other). Researchers proposed a variety of proximity metrics for the link prediction task, including local metrics, such as the number and fraction of common neighbors, measures that weigh the contribution of each common neighbor by the inverse of its degree [194] or inverse of the logarithm of its degree [1], as well as global metrics based on the number of paths between nodes [92] or the probability that a random walk starting at one node will reach another [98]. Researchers empirically evaluated different metrics on the link prediction task in a variety of networks. Liben-Nowell and Kleinberg [125] showed that Adamic-Adar score (that weighs the contribution of each common neighbor by 108 the inverse of the log of its degree) best predicted new links in scientific co-authorship networks. Zhou et al. [194], on the other hand, found that the linear version of the Adamic-Adar score best predicts missing links in biological and technological net- works, including protein-protein interaction networks, electrical power grid and US air transportation networks. Neither study explained variation in performance or mo- tivated the choice of the metric. The greater the number of paths connecting two nodes through intermediaries, the greater the potential for information exchange; therefore, the closer the nodes are. However, the degree to which information can reach one node from another de- pends not only on network topology, but also on the nature of interaction between the nodes [67]. Consider one-to-one interactions such as web surfing or phone con- versations. In web surfing, a person chooses one of the outgoing hyperlinks from a current page to navigate to next. Likewise, in the phone conversations example, a person chooses one of her friends and places a call to her. Such interactions can be modeled as a random walk; therefore, metrics based on the random walk, such as conductance [98], are appropriate as a proximity measure. However, the spread of a disease or information in a social network are fundamentally different and cannot be modeled as a random walk. In these processes, rather than picking one network neighbor to whom to transmit a message or a pathogen, users broadcast the messages to all their neighbors. However, recent research [82] shows that in social media that users’ capacity to respond to incoming messages from their neighbors is limited by their finite attention [90]. Since users must divide their attention among all neighbors, the more neighbors they have, the less likely they are to respond to and further spread an arbitrary message from a neighbor. 109 Friends’ activity has been shown to be a useful predictor of user activity in social media. People tend to vote for stories their friends vote for on Digg [105], use the same tags [130] and favorite the same images as friends on Flickr [113, 33], and so on. User activity, e.g., their tagging vocabulary, was used to predict social links between users [159]. We explore the inverse problem, i.e., the degree to which network struc- ture helps predict user activity. We propose proximity metrics that take into account the one-to-many and attention-limited interactions between nodes. We focus on local metrics that do not require knowledge of the full graph and are easier to compute, and relate these metrics to commonly used in the link prediction task. We study URL forwarding on social media sites Digg and Twitter. We investigate how well different proximity metrics predict which URLs the user will forward. Our results show that metrics that take into account divided attention lead to better predic- tions of user activity. These findings suggest an important role limited attention plays in interactions on social media. Note that activity prediction differs from the link prediction problem. In the latter, network structure is used both as the basis for prediction and to evaluate prediction results. In activity prediction, on the other hand, prediction results are evaluated in- dependently of the network structure using evidence from users’ voting or retweeting behavior. 7.2 Classification of Proximity Metrics We represent a network as a directed, unweighed graphG = (V;E) withV nodes and E edges. The adjacency matrix of the graph is defined as: A(u;v) = 1 if (u;v)2E; otherwise,A(u;v) = 0. The set of out-neighbors ofu is out (u) =fv2 Vj(u;v)2 110 Eg, and the out-degree ofu isd out (u) = P v2V A(u;v) =j out (u)j. Similarly, in (u) represents the set of in-neighbors of u, and d in (u) is the in-degree of u. The total degree of the node isd(u) =d out (u) +d in (u). In undirected graph, the neighborhood ofu consists of nodes that are connected tou and is denoted by (u). Figure 7.1: Example of a directed graph. Intuitively, network proximity measures the likelihood a message starting at node u will reach v, regardless of whether an edge exists between them. The greater the number of paths connecting u and v, the more likely they are to share information, and the closer they are considered to be in the network. Proximity metrics used in previous studies [125, 127] include the number of common neighbors (CN), fraction of common neighbors, or Jaccard (JC) coefficient, and the Adamic-Adar (AA) score, which weighs each common neighbor by the inverse of the logarithm of its degree. Table 7.1 gives their definition in terms of the directed neighborhoods ofu andv: = out (u)\ in (v) 0 = in (u)\ out (v): 111 Table 7.1: Some of the proximity metrics used for network analysis, including four proposed in this work metric definition CN CN = 1 2 jj +j 0 j JC JC= 1 2 h jout(u)\ in (v)j jout(u)[ in (v)j + jout(v)\ in (u)j jout(v)[ in (u)j i AA AA = 1 2 h P z2 1 log(d(z)) + P z 0 2 0 1 log(d(z 0 )) i CS CS = 1 2 P z2 1 dout(u)dout(z) + 1 2 P z2 0 1 dout(v)dout(z) CS_AL CS_AL = 1 2 P z2 1 dout(u)d in (z)dout(z)d in (v) + 1 2 P z2 0 1 dout(v)d in (z)dout(z)d in (u) NC NC = 1 2 jj +j 0 j NC_AL NC_AL = 1 2 P z2 1 d in (z)d in (v) + 1 2 P z2 0 1 d in (z)d in (u) The likelihood a message will reachv fromu depends, however, not only on the number of paths, but also on the nature of the dynamic process by which messages spread on the network [67]. Consider a graph of hyperlinked Web pages. The process of browsing this graph is best described by a random walk. At each page, a Web surfer picks one of the neighbors of that page in the Web graph and navigates to it. The inter- actions by which information is exchanged in the air transportation network, the elec- tric power grid and mobile phone network can also be modeled by the random walk. We call such one-to-one processes conservative, since they conserve some underlying weight distribution. Not all interactions, however, are conservative. The one-to-many interactions common in social media, where users broadcast information to all their followers, cannot be modeled as a random walk. This, and many other social phenom- ena, such as the spread of disease or innovation are fundamentally non-conservative. Different dynamic processes will lead to different notions of proximity, even in the 112 same network. Below we derive different proximity metrics for each type of process. We focus on local measures, that depend only on the neighborhoods ofu andv. Such measures can be computed efficiently, since they also do require knowledge of the full graph, e.g., the entire Twitter follower graph. 7.2.1 Conservative Proximity Consider conservative processes first. Koren et al. [98] introduced cycle-free effective conductance as a global measure of proximity. This metric computes the probability a random walk starting atu will reachv through any path in the graph. In the directed graph shown in Fig. 7.1, a walker starting at u can reach v through z. While there could be longer paths that connect u to v, our local metrics consider only paths of length two that go through intermediates nodes such asz orz 0 in Fig. 7.1. A random walker moving from u to v first needs to pick an edge that will take it from u to z (which it will do with probability 1=d out (u)), and then it has to pick an edge that will take it from z to v (which it will do with probability 1=d out (z)). Symmetrizing, we obtain the conservative proximity metric, which gives the probability a random walk will reachu fromv or vice versa through paths of length two: CS = 1 2 h X z2 1 d out (u)d out (z) + X z2 0 1 d out (v)d out (z) i : (7.1) Note that in an undirected graph, this metric reduces to CS = 1 2 h 1 d(u) + 1 d(v) i X z2(u)\(v) 1 d(z) : (7.2) 113 Like the Adamic-Adar score, conservative proximity takes into account the degree of the common neighbor. This measure is almost identical to the resource allocation metric (RA) shown by Zhou et al. [194] to be the best-performing local metric on the missing link prediction task in several networks, including the network of political blogs, the electric power grid, router-level Internet graph, and US air transportation network. On an undirected network RA is: RA = 1 d u X z2(u)\(v) 1 d(z) : Conservative proximity in undirected networks (Eq. 7.2) is the symmetric version of this metric. Since RA metric is conservative in nature, it is reasonable to expect it to do well on these networks, because, except for political blogs, the processes taking place on them are conservative in nature. When a plane leaves one airport, its destination is exactly one other airport. For the political blogs network, Zhou et al. ignored the direction of links, which may have changed properties of the network. A person’s capacity to process incoming stimuli is finite. The phenomenon, known as finite attention [90], limits a person’s ability to receive and process incoming mes- sages. In online social networks, users must divide their attention among all in- neighbors, so that the more in-neighbors they have, the less likely they are to process a message from an arbitrary neighbor. Limited attention alters the dynamic process and affects propagation of messages. Now, in order for a message to get fromu toz, not only mustu pick an edge that will get the message toz, butz must also pay attention to 114 that in-link to receive it, which it will do with probability 1=d in (z). Attention limited conservative proximity metric can be written as: CS_AL = 1 2 h X z2 1 d out (u)d in (z)d out (z)d in (v) + X z2 0 1 d out (v)d in (z)d out (z)d in (u) i : 7.2.2 Non-conservative Proximity Now imagine that information flows on a network via one-to-many broadcasts. When a node broadcasts a message, it is sent to all the nodes’ out-neighbors. In this case, for a message to get fromu tov in Fig. 7.1, firstu broadcasts it to its neighbors, including z, and then z broadcasts it. Probability of the message being transmitted from one node to another is one. Therefore, symmetrized non-conservative proximity measure is: NC = 1 2 h X z2 1 + X z2 0 1 i = 1 2 jj +j 0 j : (7.3) The non-conservative metric counts the expected number of times a message is re- ceived and is identical to the neighborhood overlap metricCN. Finite attention can also play a role in non-conservative interactions. When u broadcasts a message, z will receive it only if it pays attention to the channel from u. Therefore, attention-limited non-conservative proximity metric can be written as NC_AL = 1 2 h X z2 1 d in (z)d in (v) + X z2 0 1 d in (z)d in (u) i : 115 In undirected graphs, this reduces to NC_AL = 1 2 h 1 d(u) + 1 d(v) i X z2(u)\(v) 1 d(z) ; which is identical to conservative proximity in undirected networks (Eq. 7.2). 7.3 Predicting Activity in Social Media In this work we focus on URL forwarding activity on two popular social media sites: Digg and Twitter. In this study we use the Digg and Twitter dataset described in Chapter 2, Section 2.3.2 and 2.3.3.2 respectively. In social networks, network proximity can be interpreted as social closeness. In his seminal paper Granovetter [73] argued that the strength of a social tie, which specifies the intensity and the depth of interaction between two people, can be estimated from their network structure. Ideally, the strength of tie should consider paths of all lengths between two nodes (since a ‘weak tie’ ties a node to distant parts of the network ). But a one-step approximation of the strength of ties is the one hop paths between the nodes which is the neighborhood overlap. Thus Granovetter proposed neighborhood overlap of two nodes as the metric to approximate tie strength. Subsequently, a large- scale study of a mobile phone network established a correlation between the strength of ties, measured by the frequency and duration of phone calls between two people, and structural proximity, measured by their neighborhood overlap [144]. We claim that proximity also has predictive power. People who are close to each other in an online social network are more likely to act in a similar way because they share the 116 same information, have similar tastes and attributes, or participate in the same com- munity. While the causes of similarity of activity are hotly debated and difficult to tease apart [160], our goal is simply to show that information in the follower graph can help predict activity. In other words, knowing the actions of some people allows us to predict the actions of others who are close to them in the network. Specifically, we study retweeting activity in social media and show that while users tend to retweet the URLs their friends retweet, knowing the friends’ proximity in the follower graph can help better predict which URLs the user will retweet. Retweeting activity in our sample encompassed diverse behaviors from spreading newsworthy content to orchestrated human and bot-driven campaigns that included advertising and spam. We have proposed a novel method to automatically classify these behaviors (Chapter 3) by characterizing the dynamics of retweeting with two information theoretic features - user and time interval entropy. In this work, we focus on those URLs from the data set which are characterized by high (> 3) user and time interval entropies. These parameter values are associated with the spread of news- worthy content and excludes robotic spamming and manipulation campaigns driven by few individuals. This left us with a data set containing 3,798 distinct URLs retweeted by 542K distinct Twitter users. One limitation of our data collection methodology is that it was not guaranteed to retrieve all of user’s friends and followers. Knowing all friends and followers may not be necessary, since some of them could have become inactive without deleting their accounts, and therefore, will not contribute to activity prediction. However, what if the friend is still active, but participates on Digg and Twitter so infrequently that we did not record his activity during the observation period. While our methodology guarantees that we collect information about all active friends and followers whose actions were 117 observed during data collection period, it misses some of infrequently active followers (on Digg) or friends (on Twitter). In light of this considerations, the metrics described below provide a lower bound on proximity values: if infrequently active friends and followers are included, the calculated proximity values may be higher. 7.3.1 Analysis of Proximity Metrics We compute proximity metrics on the directed follower graphs of active Digg and Twitter users. Proximity metrics used in this study are listed in Table 7.1, where user u’s out-neighbors are the set of her followers, and in-neighbors are her friends. We measure similarity of activity of a pair of users by the number of common URLs they both recommended. Activity of a pair of Digg users is measured by co-votes, the number of promoted stories for which they both voted. Activity of a pair of Twitter users is measured by co-retweets, the number of common URLs they both tweeted or retweeted. Figure 7.2 plots proximity, computed using different metrics, vs activity for pairs of users linked by an edge in the follower graph. The y-value represents the average proximity for all pairs with that many co-votes or co-retweets. There are significant trends in proximity as a function of activity on Digg (Fig. 7.2(a)), at least for co-votes < 800. Above this value, there is no observable correlation between proximity and activity. This could be because some users tend to vote on many front page stories regardless of their content, or due to automatic voting. Interestingly, attention-limited versions of the conservative and non-conservative proximity decrease with the number of co-votes. Conservative metric is the only one to display a behavior that is not, on the whole, monotonic: the value of the metric decreases until around 50 co-votes and increases after that. 118 (a) Digg common neighbors (CN, NC) jaccard (JA) adamic-adar (AA) conservative (CS) conservative attn-limited (CS_AL) non-cons., attn-limited (NC_AL) (b) Twitter common neighbors (CN, NC) jaccard (JA) adamic-adar (AA) conservative (CS) conservative attn-limited (CS_AL) non-cons., attn-limited (NC_AL) Figure 7.2: Average value of the proximity metrics vs activity for pairs of users linked by an edge in the follower graphs of (a) Digg and (b) Twitter. 119 Proximity–activity trends on Twitter are more complex (Figure 7.2(b)). In the first three plots, the average value of proximity initially increases with activity, until about 15 co-retweets, at which point there is a decreasing trend. The last three metrics, however, show an increasing trend. Table 7.2: Correlation between proximity of pairs of users connected by an edge in the follower graph and their co-activity on (a) Digg and (b) Twitter. Rows in (a) present co- votes under different filter conditions. For example, co-votes< 200 condition reports correlations for pairs of users who voted for fewer than 200 common stories. The number of pairs satisfying the filter condition is reported in the second column. (a) Digg: correlation filter # edges CN JC AA CS CS_AL NC NC_AL co-votes< 200 1,410,590 0.256 0.129 0.232 0.015 -0.010 0.256 -0.028 co-votes< 400 1,429,712 0.277 0.158 0.246 0.019 -0.009 0.277 -0.027 co-votes< 800 1,438,320 0.283 0.170 0.249 0.024 -0.008 0.283 -0.025 all 1,439,842 0.279 0.163 0.246 0.025 -0.008 0.279 -0.023 (b) Twitter: correlation # edges CN JC AA CS CS_AL NC NC_AL 28M -0.769 -0.339 -0.755 0.523 0.350 -0.769 0.406 We compute correlation between proximity and activity for all pairs of users linked by an edge in the follower graph. These correlations for different proximity metrics are shown in Table 7.2. We can limit the edges taken into account by correlation to those that satisfy some filter condition. For example, co-votes< 200 line reports correlations for pairs of Digg users who voted on fewer than 200 common stories. The number of pairs satisfying the filter condition is reported in the second column. Despite growing scatter, correlation increases with the amount of co-activity until about 800 co-votes. The non-conservative metric, which is equivalent to the common neighbors metric, leads to highest correlation. The story is somewhat different for Twitter (Table 7.2(b)), 120 where the conservative and attention-limited non-conservative metrics lead to highest correlations. Figure 7.3: Prediction methodology Algorithm 2 Predict womanw’s attendance of test events 1: F( friends(w) 2: for each friendj2F do 3: ~ f j ( test_events(j) . vector of test events friend attended 4: x j ( proximity(w;j) . friend’s proximity to w 5: end for 6: ~ p = P j ~ f j x j =j~ xj . construct prediction vector 7: ~ u( test_events(w) . test events w actually attended 8: Pr =~ u~ p=j~ pj 9: Re =~ u~ p=j~ uj 7.3.2 Prediction Results We study the claim that knowing the (local) structure of the follower graph can enhance the power of this predictor. In other words, while social media users tend to act like their friends, they are more likely to act like their closer friends. Proximity metrics, however, vary in their prediction success; therefore, making the choice of the right metric and important question. 121 Table 7.3: Evaluation of predictions by different metrics in the Digg and Twitter data sets. Lift is defined as % change over baseline. base CN, NC JA AA CS CS_AL NC_AL (a) Digg precision 0.032 0.027 0.033 0.027 0.028 0.039 0.034 recall 0.172 0.248 0.174 0.250 0.272 0.195 0.174 pr lift % 0 -15.0 3.3 -14.7 -11.1 22.1 7.7 re lift % 0 44.2 1.1 45.5 57.9 13.3 1.3 (c) Twitter precision 0.105 0.091 0.120 0.093 0.094 0.133 0.125 recall 0.094 0.090 0.102 0.091 0.097 0.113 0.106 pr lift % 0 -14.1 14.1 -12.0 -10.7 25.9 18.5 re lift % 0 -4.8 8.4 -3.4 2.8 19.7 12.3 We evaluate this claim on the task of predicting user activity on Digg and Twitter. This task can be stated as follows: given the follower graph and the URLs that a user’s friends voted for (or retweeted), predict which URLs the user votes for (or retweets). To quantitatively evaluate this claim, we construct a prediction vector p for a user (see Fig. 7.3). The value p i of the prediction vector represents probability a user’s friends voted for thei th URL, weighted by each friend’s proximity to the user in the follower graph. To compute precision and recall of prediction, we construct a vectoru of the URLs the user actually voted for. Then precision isPr =~ u~ p=j~ pj and recall is Re =~ u~ p=j~ uj, wherej~ zj = P i z i . Algorithm 2 gives the pseudo code of the prediction algorithm. We compare proximity-based predictions to baseline that weighs friend’s votes uniformly, without regard to their proximity to user. V oters in the Digg data set voted on more than 3.5K stories. Almost 53K of these voters had at least one friend and were included in the baseline. Of these, we could calculate proximity for about 25K voters. The rest of the voters did not share any common friends or followers with other active users. We also limit predictions to votes users made on stories before they were promoted to the front page. This is because 122 interactions with friends are most important for attracting new votes before promotion, since at that time a story is visible mainly through the friends interface, which shows the user stories friends recently submitted or voted for. 1 After promotion, stories are visible on the highly popular font page, which dilutes effect of social networks [104, 84]. Table 7.3(a) reports average precision and recall values of predictions for Digg. Although average precision appears low, ranging from 2.7% to 3.9%, it is an order of magnitude better than precision of 0.6% for randomly guessing which stories a user will vote for. We define lift as percent-change over baseline. Proximity-based prediction results in a substantial lift, especially for the attention-limited versions of the conservative and non-conservative metrics. Even Jaccard results in a small positive lift, while common neighbors and Adamic-Adar metrics still perform worse than baseline. In the Twitter data set, almost 542K user retweeted 3.8K URLs. Twitter does not provide an equivalent of Digg’s front page for the most retweeted URLs; therefore, URLs generally spread via recommendations by friends. Table 7.3(b) compares pre- diction performance of different proximity metrics. Baseline precision is around 10%. Attention-limited versions of the conservative and non-conservative proximity metrics result in the greatest lift both in precision and recall, up to 25%. As in the Digg data set, the precision of the common neighbors, Adamic-Adar, and conservative metrics is worse than baseline. These results suggest that divided attention plays an important roles in interactions on Digg and Twitter. Moreover, taking interactions into account improves the predictive power of structural proximity. 1 The story is visible also on the Upcoming Stories list, but since that list gets tens of thou- sands of new submissions daily, it contributes marginally to the votes the story receives. 123 Digg Twitter Figure 7.4: Distribution of prediction precision over users in the Digg and Twitter data sets. 7.3.3 Predictability The values reported in Table 7.3 represent precision and recall averaged over all users. However, as shown in Figure 7.4, precision values have a long-tailed distribution (re- call is more uniformly distributed), for both Digg and Twitter users. This means that while the majority of users are essentially unpredictable, some users are very pre- dictable. In the Twitter data set, for 208K users our algorithm did not make any correct predictions, while it perfectly predicted the activity of about 5K users. In the Digg data set, about 10K users were unpredictable, while more than 100 users perfectly predictable. A natural question to ask is whether there is some attribute of the user that pre- dicts how predictable her activity will be, so that we can automatically distinguish users whose actions we can predict with high confidence from those whose actions are essentially unpredictable. Figure 7.5 shows distribution of precision on Digg as a function of three user attributes: the number of friends and followers the user has, 124 friends followers votes Figure 7.5: Distribution of precision as a function of user attribute, such as the number of friends, followers, and the activity level, as measured by the number of votes user made. and her activity level, as measured by the number of votes she made. Precision was computed usingNC_AL metric, but there is little qualitative difference between the metrics. The distribution is very noisy, though there exists a weak correlation between the number friends, followers, votes and precision. However, contrasting users who are perfectly predictable (with precision=1) and unpredictable (precision=0), a differ- ent picture emerges. Generally, predictable users tend to have far fewer friends than unpredictable users: up to 12 for predictable and up to 1,000 for unpredictable users. Also, predictable users are less active on Digg, voting for fewer than 100 stories, as opposed to unpredictable users, who vote for thousands of stories. Predictable users also have fewer followers, but that is to be expected, since the number of friends and followers is strongly correlated. There is no observable correlation between precision and attribute values of Twit- ter users. To reduce noise, we contrast predictable Twitter users (whose activity we predicted with precision = 1) and unpredictable users (precision = 0) along the three attributes. User’s activity level is s measured by the number of retweets she made. 125 friends followers retweets Figure 7.6: Differences between predictable (precision=1) and unpredictable (preci- sion=0) Twitter users in terms of the number of friends, followers and their activity level, as measured by the number of URLs they tweeted. Figure 7.6 shows these results for precision values computed usingNC_AL proxim- ity metric. Similar to Digg, predictable users tend to have far fewer friends than un- predictable users: up to 100 for predictable and up to 10,000 for unpredictable. Also, predictable users are less active on Twitter, tweeting at most 25 URLs, as opposed to unpredictable users, who tweet up to 100 URLs. The two classes of users do not differ substantially as far as the number of followers. One potential explanations for these trends that were observed on both sites is that users with many friends have a variety of interests and are active in many, possibly diverse, communities, making their activity less predictable. Users with fewer friends could be part of more cohesive communities, and as a consequence be more like their friends, making their activity more predictable. This hypothesis is supported by the finding that predictable users on Twitter generally had higher average proximity to their friends than unpredictable users. Being close to many friends is a hallmark of a community; however, more research is required to understand this phenomenon. 126 7.3.4 Discussion Just as in the link prediction task, structural information can help activity prediction task. However, as we show in this thesis, the choice of the structural proximity metric matters for prediction performance. Although non-conservative metric produced the highest correlation between structural proximity and activity, it did not lead to the best prediction results. In fact, on both Digg and Twitter it gave the worst predictions, com- pared to the uniform friend recommendation baseline. The non-conservative metrics model epidemic spreading in networks. We know, however, that information spread in social media (at least on Digg) is different from the spread of epidemics, because probability of becoming “infected” with information does not depend on the number of “infected” friends, a phenomena we explore in details in Chapter 8, Section 8.3. Re- sults of this chapter suggest that attention plays an important role in information spread in social media. Even if we do not yet fully understand this process, we show in this work the choice of the proximity metric matters. The reason that attention-limited metrics produce the best prediction results is because they more closely describe the dynamic processes taking place in social media than other metrics. This may also help explain link prediction results. The reason Adamic-Adar performed best on the task of predicting future paper co-authorship, probably because of the many metrics studied by Liben-Nowell and Kleinberg, it most closely approximated the nature of interactions between authors, which is probably best modeled by an attention-limited process. On the missing link prediction task in conservative transportation and power grid networks, linear RA metric gave the best results. This makes sense, since the RA metric is an unsymmetrized version of the conservative metric described in this work. This further underscores the need to consider the nature of the dynamic process when choosing proximity metric for the prediction task. The design of proximity metrics 127 with even better predictability than those underscored in this section is the course of future work. Our work also ignores the timing of votes, i.e., whether friends’ recommendations came before or after a user’s own recommendation. Therefore, we do not distinguish between the effects of homophily and influence [160]. 7.4 Summary We study activity prediction in social networks. In this task, information about activity of user’s friends is used to predict her activity. We showed that taking into account friends’ proximity to the user in the follower graph can help better predict user’s ac- tivity. Our results indicate that proximity metrics that take attention-limited nature of interactions in social media lead, in aggregate, to better predictions than other metrics, or the baseline that does not use proximity for prediction. The aggregate results, how- ever, hide a wide range of predictability. On a per-user basis, prediction performance is characterized by a long-tailed distribution, with few highly predictable users and a large number of non-predictable users. These results suggest interesting directions for future research. We also did not explore the temporal nature of activity, whether user retweets the URL before or after her friend does. In addition, we found evidence that some users’ activity may be easier to predict than others, so an interesting question is whether we can automatically determine whose behaviors are more predictable. We leave these questions for future research. 128 Chapter 8 Information Spread Under the Microscope Social media has become an important channel for people to share information. Infor- mation spread is a contagion or contact process occurring on a network. A contagion process starts with a seed (on social networks like Digg or Twitter it is the story’s sub- mitter) and grows as the story accrues fan votes. On Digg, Twitter, Slashdot, Reddit, and Facebook, among others, users post news or links to news stories, discuss them, and share their opinions in real time. We study information spread on Digg and Twit- ter. Sections 2.3.2 and 2.3.3.1 describe the procedure adopted for data collection from these websites. The contagion process of story propagation can be traced through the underlying social network of Digg (Twitter) by checking whether a new vote (retweet) came from a fan (follower) of any of the previous voters, including the submitter. We call such votes or retweets fan votes, regardless of whether we are talking about Digg or Twitter. But how similar or different is information propagation in different online social me- dia? In Section 8.1, we compare contagion process of story(information) propagation on Digg and on Twitter. 129 What drives information diffusion and what factors facilitate information spread? As in any other field of research, there are two distinct ways of tackling this problem: model-centric or empirical. Model-centric approaches make certain assumptions about how individuals participating in a cascade are affected by their neighbors (indepen- dent cascade or threshold model). Using these models, researchers have tried to infer global properties of information cascades in social networks [192, 181], devise effi- cient methods to infer the underlying network structure [155, 75] or maximize cascade size [49, 93], and identify influential spreaders [96]. However, empirical approaches are needed to validate assumptions made by these models. We need principled mathe- matical tools to quantitatively characterize the temporal and spatial properties of cas- cades ( sequence of activations generated by a contagion or contact process) as they occur in real-world networks. However, to the best of our knowledge, no previous work has attempted to quantify the dynamics of information cascades on social net- works or characterize their microscopic growth. At most, researchers have visualized the shape of cascades [115] or enumerated their commonly observed patterns [119]. Such approaches do not scale to even moderately large cascades. To address this gap, we propose a practical, general, and scalable quantitative framework for the analysis of cascades on social networks that is applicable even to large cascades in Section 8.2. We define a cascade generating function, which captures the details of the dynamics of information diffusion on networks. We can use this function to (1) compute the macroscopic properties of the cascade, such as its size, diameter, average path length, etc., (2) reconstruct the shape of the cascade, and (3) analyze its microscopic and meso-scopic dynamic properties. The cascade generating function is a good signature [120] of the contagion process occurring on a network. 130 It could help us identify patterns, trends, and anomalies within the cascades in near real-time. As the size of cascades grows, storing their complete structure may not be feasible. However, the cascade generating function can approximate the structure of the cascade with very high accuracy, in spite of having pseudo-linear space complexity. Hence, the cascade generating function provides efficient compression of the information in a cascade. The use of these powerful tools lead to many interesting insights into network structure. One of the interesting property of information cascades in online social media, discovered using these tools is that the the size of observed information cas- cades is very small. These cascades are much smaller than those predicted by existing contagion models. We delve into this discrepancy in Section 8.3. We propose possible models of social contagion that better describe the observed phenomena of information spread occurring on online social media. 8.1 Information Spread on Digg and Twitter At the time of submission, a Digg story is visible on the upcoming stories list and to submitter’s fans through the friends interface. As users vote on the story, it becomes visible to their own fans via the friends interface. Analogous to the spread of a conta- gious disease [137], interest in the story cascades through the social network. When the story is promoted to the front page, it becomes visible to many nonfans, although users are still able to pick out stories their friends liked through the green ribbon on the story’s Digg badge. Similarly, a new post on Twitter is visible to submitter’s followers, 131 and every user who retweets the story broadcasts it to his own followers. Although ag- gregators like Tweetmeme attempt to identify popular stories on Twitter in Digg-like fashion, there is no evidence that they boost their visibility to nonfans. 8.1.1 Characteristics of User Activity ! 10 0 10 2 10 4 10 6 10 8 10 0 10 1 10 2 10 3 number of followers per user number of users 10 0 10 1 10 2 10 0 10 1 10 2 10 3 10 4 number of tweets per user (a) Digg (b) Twitter Figure 8.1: Distribution of user activity. (a) Number of active fans per user in the Digg data set vs the number of users with that many fans. Inset shows distribution of voting activity, i.e., number of votes per user vs number of users who cast that many votes. (b) Number of active followers per user in the Twitter data set vs the number of users with that many followers. Inset shows distribution of retweeting activity. Figure 8.1 shows the distribution of number of active fans and followers per user. Digg’s distribution, shown in Fig. 8.1(a), has a long-tail shape that is common to degree distributions in real-world complex networks and is well described by a power law of the formp(k)/ k with 2. Twitter’s distribution, shown in Fig. 8.1(b), has a peak at around 100 followers and a long tail. Next, we characterize users’ voting activity. User activity is not uniform, as shown in inset Fig. 8.1(a) and (b). While majority of users cast fewer than 10 votes, some users voted on thousands of stories over the sample time period. The distribution of the number of retweets per user in the Twitter data set has a similar shape, with the 132 number of retweets per user ranging from 1 to about 100. The difference in slopes in these distribution is likely explained by the level of effort [183] required to vote on Digg vs retweet on Twitter. As the ratio of the number of connections to the number of active users suggests, the Digg social network is denser, more tightly knit than the Twitter social network. We measure density by the number of reciprocal friendship links and the modified cluster- ing coefficient. A reciprocal, or mutual, friendship link exists when userA marksB as friend and vice versa. There were 125,219 such links among 279,725 distinct users in the Digg sample and 3,973,892 mutual links among 6,200,051 users in the Twitter sample. Normalizing these counts by the number of all possible mutual links in the network gives me the fraction of mutual linksf m . For Diggf m = 3:20 10 6 , and for Twitter f m = 2:07 10 7 , an order of magnitude smaller. The clustering coef- ficient f c measures the degree to which a node’s network neighbors are interlinked. We define the clustering coefficient for directed networks such as those that exist on Digg and Twitter as the fraction of closed triangles that exist out of all possible sets of three nodes, or triples. For simplicity, we define a closed triangle as a cycle of length three that exists whenA listsB as a friend,B listsC andC listsA as a friend. There were 166,239 such triangles in the Digg network, giving us the clustering coefficient f c = 7:60 10 12 , and 4,566,952 triangles on Twitter, giving the clustering coeffi- cient off c = 1:92 10 14 that is two orders of magnitude smaller. Due to the size of the networks, we implemented these metrics using Hadoop 1 . we suspect that the differences in density of the two networks are due to their age, since Twitter is a more recent service than Digg. With time, it is expected that the Twitter network will grow denser [117] and become as tightly knit as Digg. 1 http://hadoop.apache.org/ 133 8.1.2 Evolution of a Story (a) Digg (b) Twitter Figure 8.2: Dynamics of stories on Digg and Twitter. (a) Total number of votes (diggs) and fan votes received by stories on Digg since submission. (b) Total number of times a story was retweeted and the number of retweets from followers since the first post vs time. The titles of stories on Digg were: story1: “U.S. Government Asks Twitter to Stay Up for #IranElection”, story2: “Western Corporations Helped Censor Iranian Internet”, story3: “Iranian clerics defy ayatollah, join protests.” The titles of retweeted stories were: story1:“US gov asks twitter to stay up”, story2:“Iran Has Built a Cen- sorship Monster with help of west tech”, story3:“Clerics join Iran’s anti-government protests - CNN.com.” The data sets contain a complete record of voting on Digg front page stories and frequently retweeted stories on Twitter. From this data dynamics of voting can be reconstructed. In addition to voting history, the active fan network of Digg and Twitter users is also known. We use this information to check whether a particular voter is a fan of the submitter or previous voters. In-network votes are called fan votes. This information allows us to study how interest in the story spreads through the social networks on Digg and Twitter. Figure 8.2(a) shows the evolution of the number of votes received by three Digg stories about post-election unrest in Iran in June 2009. While the details of the dy- namics differ, the general features of votes evolution are shared by all Digg stories and 134 can be described by a stochastic model of social voting [84]. In the upcoming stories queue, a story accumulates votes at some slow rate. The point where the slope abruptly changes corresponds to promotion to the front page. After promotion the story is visi- ble to a large number of people, and the number of votes grows at a faster rate. As the story ages, accumulation of new votes slows down [185] and finally saturates. Figure 8.2(b) shows the evolution of the number of times stories on the same topics were retweeted. The number of retweets grows smoothly until it saturates. It takes about a day for the number of votes/retweets to saturate on both sites. The dashed lines in Figure 8.2 show how the number of fan votes received by each story, grows in time. Their evolution is similar to that of all votes and growth saturates after a period of about a day. The value at which growth saturates shows the story’s range, or how widely it penetrates the social network. 8.1.3 Evolution of Fan Votes Figure 8.3 shows how the number of fan votes (size of the cascade), aggregated over all stories, grows during the early stages of voting or retweeting. While there is significant variation in the number of fan votes received by a story, the aggregate exhibits a well- defined trend. The solid lines show the median cascade size, while dotted lines show the envelope of the boundary that is one standard deviation from the mean. The cascade grows steadily with new votes on Digg (Fig. 8.3(a)), although faster initially, indicating that there are two distinct mechanisms for story visibility on Digg. This is seen more clearly in Fig. 8.3(b), which shows the probability that next vote is a fan vote and will increase the size of the cascade. The votes cast before promotion is separated from those cast after the story is promoted. Before promotion, this probability is almost constant, atp = 0:74. After promotion, it decays to a lower, 135 (a) (b) (c) (d) Figure 8.3: Spread of interest in stories through the network. (a) Median number of fan votes vs votes, aggregated over all Digg stories in our data set. Dotted lines show the boundary one standard deviation from the mean. Dashed lines shows the number of votes from fans of submitter. (b) Probability next vote is from a fan before and after the Digg story is promoted. (c) Median number of retweets from followers vs all retweets, aggregated over all stories in the Twitter data set. (d) Probability next retweet is from a follower. 136 but also almost constant value p = 0:3. This is consistent with our hypothesis that before promotion social networks are the primary mechanism for spreading interest in new stories. Although a story is also visible on the upcoming stories list, few users actually discover stories there. With 16,000 daily submissions, a new story is quickly submerged by new submissions and is pushed to page 15 of the upcoming stories list within the first 20 minutes. Few users are likely to navigate that far [87]. Promotion to the front page, which generally happens when a story accrues between 50 and 100 votes, exposes the story to a large and diverse audience, making social networks less of a factor in its spread, since large numbers of Digg users who read front page stories do not befriend others. The spread of interest in stories through the Twitter network, shown in Figure 8.3(c), is similar to Digg. As on Digg, the median number of fan votes rises steadily during the early stages of voting. However, the rate of growth is nearly constant, indicating there is a single significant mechanism for making stories visible to voters, namely the social network. The probability that next retweet is from a fan, shown in Fig. 8.3(d), rises slowly from aroundp = 0:4 top = 0:55. This value is lower than pre-promotion probability of next fan vote on Digg. The rate of interest spread appears to depend on the density of network. Initially, Digg stories spread faster through the social network than stories on Twitter, because of Digg’s denser network structure, but after promotion they spread much slower as unconnected users see and vote on the stories. The dashed lines in Fig. 8.3(a) & (c) show how the median number of votes from submitter’s fans or followers changes with voting. By the time a story accumulates 50 votes on Digg (at which point some of the stories are promoted to the front page), about half of the votes are from submitter’s fans, and another 10 are from fans of prior voters but not the submitter. After a story receives about 100 votes (by which point 137 most of the stories are promoted), the number of votes from submitter’s fans changes very slowly, while the number of fan votes continues to grow. This indicates that submitter’s fans vote for the story during its early stages and that users pay attention to the stories their friends submit. On Twitter, initial votes are from submitter’s fans, but slows significantly later. 8.1.4 How Popular is a Story? (a) Digg (b) Twitter Figure 8.4: Distribution of story popularity. (a) Distribution of the total number of votes received by Digg stories, with line showing log-normal fit. The plot excludes the 15 stories that received more than 6,000 votes. (b) Distribution of the total number of times stories in the Twitter data set were retweeted, with the line showing log-normal fit. The total number of times the story was voted for and retweeted reflects its popu- larity among Digg and Twitter users respectively. The distribution of story popularity on either site, Figure 8.4, shows the ‘inequality of popularity’ [157], with relatively few stories becoming very popular, accruing thousands of votes, while most are much 138 less popular, receiving fewer than 500 votes. 2 The most common number of votes by a story is around 500 on Digg and 400 on Twitter. These values are well described by a lognormal distribution (shown as the red line in the figure). The log-normal distribution of story popularity is typical of the “heavy-tailed” dis- tributions associated with social production and consumption of content. In a heavy- tailed distribution a small but non-vanishing number of items generate uncharacteris- tically large amount of activity. These distributions have been observed in a variety of contexts, including voting on Digg [185] and Essembly [83], edits of Wikipedia arti- cles [183], and music downloads [157]. Understanding the origin of such distributions is the next challenge in modeling user activity on social media sites. 8.1.5 How Far does a Story Spread on the OSN? Figure 8.5 (a) and (b) shows the distribution of fan votes generated by Digg and Twit- ter stories. These distributions are markedly different from the distribution of story popularity shown in Fig. 8.4. Although the distribution of network cascades of Digg stories, Fig. 8.5(a), is slightly asymmetrical, it is best described by a normal with the mean and standard deviation equal to 104:27 and 32:31 votes respectively, not the log- normal distribution in Fig. 8.4(a). It is also unlike distribution of cascade sizes in a blog post network, which has a power law distribution [119]. Remarkably, there are no stories that did not generate a cascade, i.e., which did not receive any fan votes. The inset in Figure 8.5(a) shows the distribution of votes from submitter’s fans only. It is also described by a normal function with a mean around 50 votes. A small fraction of stories, fewer than 400, did not have any votes from submitter’s fans. This 2 This distribution applies to Digg’s front page stories only. Stories that are never promoted to the front page receive very few votes, in many cases just a single vote from the submitter. 139 (a) Digg (b) Twitter Figure 8.5: Distribution of story cascade sizes. (a) Histogram of the distribution of the total number of fan votes received by Digg stories (size of the interest cascade). The inset shows the distribution of the number of votes from submitter’s fans. (b) Histogram of the distribution of the total number of retweets from followers. The inset shows the distribution of the number of retweets of a story from submitter’s followers. indicates that active users who are fans of the submitter are also fans of other voters, i.e., that the social network of active Digg users is dense and highly interlinked. This observation is supported by the finding of a relatively high clustering coefficient of the Digg social network. The distribution of followers of Twitter stories is shown in Fig. 8.5(b). These also appear to be normally distributed, although a substantial number of stories do not spread on the network. This distribution is broader than that of Digg stories, which in- dicates that stories spread farther on the Twitter network. The distribution of the num- ber of votes cast by submitter’s followers, shown in inset in Fig. 8.5(b), is markedly different from Digg. The vast majority of the stories did not receive any votes from submitter’s followers, indicating that submitter’s and other voters’ followers are dis- joint. This observation is supported by our finding that the Twitter social network is sparsely interconnected. 140 8.2 Quantitative Framework for Measuring Cascades (a) (b) Figure 8.6: An toy example of an information cascade on a network. Nodes are labeled in the temporal order in which they are activated by the cascade. The nodes that are never activated are blank. (a) The edges show the underlying friendship network. Edge direction shows the semantics of the connection, i.e., nodes are watching others to which they are pointing. (b) Two cascades on the network (shown in yellow and red). Node 1 is the seed of the first (yellow) cascade and node 2 is the seed of the second (red) cascade. Node 4 belongs to both cascades and is shown in orange. A cascade is a sequence of activations generated by a contagion or contact process, in which nodes cause connected nodes to be activated with some probability. In analogy with the spread of an infectious disease on a network, an infected (activated) node exposes his fans to the contagion. Information or story cascades through the network as exposed fans become infected, thereby exposing their own fans to the story, and so on. The seed of a cascade is the node that initiates the cascade. In information cascades, the seed is an independent originator of information, who then influences others to adopt, endorse, or transmit that information. We call a node that participates in a cascade a member of the cascade. A contagion process can generate multiple cascades, and a node can participate in more than one cascade, resulting in a commonly observed “collision of cascades” [119] phenomenon. 141 Figure 8.6(a) shows a directed network in which node 4 is a fan of 1 and 2. We call an edgee ij active, if nodej is a fan of nodei and nodei is activated before nodej. Information or influence flows from activated nodes to their fans. In the figure above, information flows from nodes 1 and 2 to 4. Therefore, edgese 14 ande 24 are active. On online social media like Digg, we observe that, as interest in a story spreads, it may generate many cascades from independent seeds. Figure 8.6(b) shows cascades on the network shown in Fig. 8.6(a), in which nodes are labeled in the order they are activated, with links showing the direction of influence. As shown, the contagion process generates two cascades whose seeds are nodes 1 and 2, respectively. Node 4 participates in both cascades. A cascade chain is a sequence of connected nodes participating in a cascade. Each node in the cascade chain is influenced by all the nodes in the chain activated before it and influences all the successive nodes in the chain. The length of the longest chain is the diameter of the cascade [119]. The spread of the cascade is the maximal branching number of its participants, i.e., the maximum number of nodes a single member infects. The diameter of the contagion process in Fig. 8.6(b) is two (longest chain is 1! 3! 6, the spread of cascade 1 (yellow) is 4 and of cascade 2 (in red) is 2. In Section 8.2.1, we define the cascade generating function, which describes how information spreads through the network. We show that this function can be used to compute cascades’ macroscopic properties, such as its size, diameter, number of paths in the cascade, etc. We demonstrate the use of cascade generating function to study dynamics of cascades in Section 8.2.2 and 8.2.3. In Section 8.2.4 we also apply it to study large information cascades occurring on a real-world social network of Digg using the dataset described in Section 2.3. Stories propagate on Digg’s social network through a series of cascades as users influence their fans to vote for the story [107, 142 63]. We study the distribution of several macroscopic properties of these cascades. In addition, we study the microscopic and meso-scopic dynamics of their temporal evolution. Time plots of the cascade generating function show several characteristic signatures of cascade growth, such as star-like, chain-like and community-like growth. 8.2.1 Characterizing Cascades We characterize a cascade mathematically by the cascade generating function,(j; j;i ), which describes how activation spreads through the network. Contagion process is pa- rameterized by the transmission rates j;i 8j;i2 [1;N], which give the probability that a nodei activated at timet i will activate a connected nodej at a later timet j . Though, in principle, j;i could be different for different values ofi andj, for simplicity, we assume that they are all the same, i.e., ji =. Note, that since the nodes are labeled in the temporal order of their activation,(j; j;i ) characterizes the cascade at timet j . We use the contagion process shown in Fig. 8.6(b) to illustrate how the cascade generating function is calculated. The initial value of the cascade function is some con- stant. In the example, nodes 1 and 2 are seeds; therefore, the values of the cascade func- tion at the times they are activated are constant. While these values may be different, for convenience we set them both to one: (1;) =(2;) = 1. The value of cap- tures the cumulative effect on nodej of activated nodes that are connected toj. Node 3 is connected to 1 and activated by it with probability; therefore,(3;) =(1;). At the time node 4 is activated, cascade function is (4;) = (1;) +(2;). Nodes continue to activate others in this fashion. At timet 6 , the cascade function is (6;) =(1;) +(3;). Since(3;) only depends on(1;),(6;) can be rewritten as(6;) =(1;) + 2 (1;). 143 In general terms, if nodei is a node activated at timet i , the value of the cascade generating function at later timet j when nodej is activated is: (j;) = X i2friend(j) (i;) (8.1) where friend(j) is a set of nodes connected to node j that are activated before it. Since links are directed, without loss of generality, we can assume that there are K cascades in a contagion process. Let(i 1 ;),(i 2 ;), ,(i K ;) be the weights of their seeds. Then, Eq. 8.1 reduces to (j;) = X i2friend(j) (i;) = K X p=1 f(j;i p ;)(i p ;) (8.2) The value of(j;) is proportional to the cumulative effect or influence of all cascades on nodej activated at timet j and can be described using the vector (f(j;i 1 ;);::;f(j;i K ;)). f(j;i p ;); captures the cumulative effect of the cascade generated at seed nodei p on the nodej wheret j >t ip . In Fig. 8.6(b), at timet = 4,(4;) =f(4; 1;)(1;) + f(4; 2;)(2;) wheref(4; 1;) = andf(4; 2;) =. At timet = 6,(6;) = f(6; 1;)(1;) +f(6; 2;)(2;). Heref(6; 1;) = + 2 andf(6; 2;) = 0. If the values of the cascade generating function for nodes i and j are the same, (i;) = (j;), the nodes i and j are isomorphic with respect to the contagion process. Such nodes are structurally similar with respect to the cascade; therefore, the value of the cascade function is independent of the order in which they are activated. By structural similarity, we mean that in a network comprising of only the activated nodes and active edges between them, the topological distance of two isomorphic from 144 all the seeds is the same. Here, the topological distance of a node from the seed is mea- sured in terms of the total number of attenuated paths over active edges. Isomorphic nodes can be grouped together in a tier with its own characteristic(). In the con- tagion process in Fig. 8.6(b), nodes 3 and 7 are isomorphic and form a tier with value () =(1;). Cascade properties. We can use the cascade generating function to compute the macroscopic properties of cascades, such as their size, diameter, number of paths, and their average length. If we take(i p ;) = 1, wherei p is the seed ofp th cascade activated at timet ip , then the total number of paths fromi p to nodej is equal tof(j;i p ; 1) in Eq. 8.2. The total length of paths from the seedi p toj,l(j;i p ), can be obtained by differentiating with respect to and evaluating the derivative at = 1, i.e.,l(j;i p ) = df(j;ip;) d j =1 d(j;) d j =1 = K X p=1 df(j;i p ;) d (i p ;) = K X p=1 l(j;i p )(i p ;) (8.3) To illustrate this, consider again the contagion process shown in Fig. 8.6(b). For exam- ple, if we pick node 6, there are two paths from the seed (node 1) to node 6: 1! 3! 6 and 1! 6. The total length of these paths is three. There are no paths from the second seed (node 2 ) to node 6. We can also get this answer from: d(6;) d j =1 = d( + 2 )(1;) d = 3: 145 We can use similar reasoning to compute other cascade properties. The average path length,l av is given by: l av = P j P K p=1 l(j;i p ) P j P K p=1 f(j;i p ; 1) = d P j (j;) d P j (j;) j =1 (8.4) The diameter of the contagion process is the length of the longest path of any cascade generated by this process. It is given by modifying Eq. 8.4: l max = max j2[2;N] d min (j;) d min (j;) ; (8.5) where min (j;) = min i2friend(j) min (i;) The computational framework for efficiently computing the cascade generating function is described Appendix A. 8.2.2 Analyzing Cascades Plotting the cascade generating function (j;) vs time (j) shows how the structure of the cascade evolves over time. Fig. 8.7 illustrates the cascade plots computed for a variety of contagion processes, which include several prototypes of cascades fre- quently observed in recommendation and blog networks [119, 120]. We label nodes in the order in which they are activated and take(i;) = 1 wheni is the seed of the cascade, thus giving equal weights to all cascades in the contagion process. Without loss of generality, in this study, we set the value of to 0.5. Future work includes es- timation of the transmission rate empirically from the network. We show that cascade plots contain as much information as cascade graphs, but can be used to analyze the 146 (1) (2) (3) (4) (5) (6) (7) (a) (b) tot.paths=5 tot.paths=5 tot.paths=31 tot.paths=3 tot.paths=3 tot.paths=3 tot.paths=3 tot.len=5 tot.len=15 tot.len=80 tot.len=4 tot.len=4 tot.len=4 tot.len=5 av.len=1 av.len=3 av.len=2.58 av.len=1.33 av.len=1.33 av.len=1.33 av.len=1.67 (c) diam.=2 diam.=5 diam.=5 diam.=2 diam.=2 diam.=2 diam.=2 {1},{2,3,4,5,6} {1},{2},{3}, {1},{2},{3}, {1},{2,3},{4} {1},{2,3},{4} {1},{2,4},{3} {1},{2},{3,4} (d) {4},{5},{6} {4},{5},{6} (a) (b) tot.paths=5 tot.paths=5 tot.paths=5 tot.paths=10 tot.paths=7 tot.paths=41 tot.paths=12 tot.len=6 tot.len=7 tot.len=7 tot.len=15 tot.len=8 tot.len=154 tot.len=25 av.len=1.2 av.len=1.4 av.len=1.4 av.len=1.50 av.len= 1.14 av.len=3.76 av.len=2.08 (c) diam.=2 diam.=2 diam.=2 diam.=2 diam.=2 diam.=8 diam.=4 {1},{2,3,4,5} {1},{2,4,6} {1},{2} {1},{2}, {1,2},{3,7}, {1,5,11,12},:::{1,3,7},{2},{4}, (d) {6} {3,5} {3,4} {3,4,6,7},{5} {5},{4},{6} {8,9},:::,{18} {5},{6},{8} (8) (9) (10) (11) (12) (13) (14) Figure 8.7: Analysis of various cascades. Nodes are labeled by the order in which they are activated by the contagion process. Row (a) shows cascade plots obtained by computing the cascade generating function, at different times. Row (b) shows the corresponding contagion process. Different cascades within the same contagion process are shown in different colors. Row (c) shows some of the numeric properties of the cascades, and row (d) shows sets of isomorphic nodes. 147 structure and evolution of even large cascades, for which visualization is not feasible. In addition to showing the cascade plot for each cascade (row (a)), Fig. 8.7 also reports some of the macroscopic properties of the cascade (row (c)), such as total number of paths and their length, average path length, and diameter. Note that this is not the exhaustive list of properties that can be calculated using the cascade characterization function. Row (d) lists groups of isomorphic nodes in each cascade. Cascades (1)–(3) in Fig. 8.7 are three of the commonly observed patterns, such as a star (Fig. 8.7(1)), a chain (Fig. 8.7(2)), and a community (clique) (Fig. 8.7(3)). In the star-like contagion process, Fig. 8.7(1), nodes activated byn 1 have the same value of , and form an isomorphic group. Interchanging the order of their activation does not affect the value of or the cascade plot. In the chain-like contagion process, cascade function decreases as the chain becomes longer. There are no isomorphic nodes. In the clique-like contagion process, the value of the cascade function grows in time as more paths are created in the cascade. There are also no isomorphic nodes in this cascade. In the contagion process in Fig. 8.7(4), nodes activated att = 2 andt = 3 are iso- morphic, therefore, the evolution of this cascade is indistinguishable from the cascade shown in Fig. 8.7(5). However, if the shape of the cascade is the same, but nodes are activated in different order as in Fig. 8.7(6), the cascade plot and its structure are dif- ferent. This is because in the contagion processes (4) and (5), cascade widens first (it is star-like), before lengthening, while in the contagion process (6), cascade lengthens first (it is chain-like), before widening. Similarly, the cascade (7) first deepens, then widens, opposite of cascade (8), while cascade (9) alternates between deepening and widening. In none of these cascades (except (3)) are there multiple paths to a node. Once this happens, as in cascades (10) and (11), the value of the cascade function increases. 148 We can also disentangle multiple cascades co-occurring in a contagion process. Contagion processes (12)–(14) contain multiple cascades, whose cascade functions are shown in different color. Note that in the contagion process (12), node 4 is isomorphic to 3 and 7 with respect to the cascade initiated by 1, and it is isomorphic to 5 with respect to the cascade initiated by 2. 8.2.3 Reconstructing Cascades The cascade generating function, compresses information, and has a space complex- ity of O(KN) where K is the number of seeds in the contagion process. How well does the compressed representation capture the contagion process? Using with 0<< 1, a tier-level reconstruction of the contagion process is pos- sible. This reconstruction does not remove degeneracy of isomorphic nodes. Temporal ordering of the nodes, help us to fine-tune the tier-level reconstruction. Additional in- formation, such us the number of nodesm and theirindegree andoutdegree can help us further improve the approximation. In all the examples shown in Fig. 8.7, using just, we are able to obtain the exact tier-level reconstruction. To illustrate, consider Fig. 8.6, taking(1) = (2) = 1, we get(3;) = (7;) = (; 0), where is the value of the cascade function for the cascade initiated by seed 1, and 0 is the value of the cascade function for the cascade initiated by second seed, node 2. Likewise, (4;) = (;),(5;) = (0;) and(6;) = ( + 2 ; 0). Hence we can recon- struct that nodes 1 and 2 are independent seeds, 3 and 7 are connected to only node 1 and 5 is connected to only node 2. Node 4 is connected to both 1 and 2. Node 6 is connected to 1 and to that tier of nodes containing node 3 and 7. However due to the temporal arrangement of the nodes, we know that 6 is activated before 7, hence it is 149 necessarily connected to node 3. Thus we are able to obtain the exact reconstruction of the cascade. In Fig. 8.7, using just , we are able to obtain the exact tier-level reconstruction for cases 1, 4, 5, 7, 8, 9,10, 11 and 13. In most of these cases we are also able to disambiguate between isomorphic nodes in the same tier. For cases 2, 3, 6, 12, and 14, we are also able to obtain exact node-level reconstruction of the cascade graph. Space and time complexity Clearly, as demonstrated by the discussion above (and in Appendix A), knowing the values of at different times allows us to deduce the dynamics of a cascade, and reconstruct its structure (up to the degeneracy that exists for isomorphic nodes). Storing the shape of the cascade hasO(N 2 ) space complex- ity. However, as demonstrated above, the cascade generating function can reconstruct this shape with high degree of accuracy. Having a pseudo-linear space complexity O(KN), it provides an efficient compression of this information. Besides, this model is general, because the same model can be used to investigate cascades in information flow, epidemics, computer viruses, and so on. This method is fast having O(dKN) runtime complexity even in its naive implementation whered is the maximum degree of any node. Moreover, the cascade generating function of a node activated at timet, depends only on the cascade value of his friends activated before him. Hence can be calculated real-time and is appropriate even for applications which require streaming, online or near real-time analysis of cascades. 8.2.4 Digg Case Study Apart from the social network, there can be other means through which the story can reach a user. For instance, the user could independently find it on one of Digg’s web 150 pages or through a link from an external site. If a user find the story through other means than the friend’s interface, he becomes an independent seed for another cascade. Not all seeds, however, generate non-trivial cascades. If a voter is unconnected or does not influence at least one of his fans to vote, the story does not spread. An independent user who generates a non-trivial cascade is its active seed. On online social media like Digg, we observe that, as interest in a story spreads, it may generate many cascades from independent active seeds. A story will typically generate multiple, even hundreds of, cascades as shown in Section 8.2.4.2. We use the framework described above to study information spread on the Digg dataset described in Section 2.3 in greater details. We treat each story as an independent contagion process. We arrange all voters in the temporal order in which they voted for the story and extract the underlying social network of these voters. Letn s be the number of active seeds of the contagion process of a story s. We take each active seed to be independent of other seeds. Therefore, we can quantitatively characterize the cascade by ann s 1 vectorc j for every nodej participating in the contagion process. Transmission rate can be derived empirically from the network. In this work, without loss of generality, we set the value of to 0.5. We use the framework described above to study the macroscopic properties of cascades on Digg, such as the distribution of cascade size, diameter, etc. We also study the dynamics of evolution of cascades associated with some sample stories. 151 (a) Story 1 (b) Story 2 (c) Story 3 (d) Story 4 Figure 8.8: Shows the cascade plot for top 3 cascades for four stories. The left set of plots in each figure shows cascade evolution in the early stages of the contagion process, while the right set of plots shows cascade evolution over the entire time period. Red dot shows the time when cascade seed was activated. 152 8.2.4.1 Microscopic Cascade Characteristics The cascade generating function is an effective tool for analyzing their microscopic dynamic signatures. In previous works, this was done by visualizing individual cas- cades [119] or by creating a generative model of the contagion process [119, 181]. Vi- sualization, however, quickly becomes difficult, even for moderately-sized cascades. Generative models [121, 4] are ad hoc in nature, and while they are designed to pro- duce cascades with similar macroscopic properties as the observed cascades, they are not guaranteed to reproduce their microscopic characteristics. For example, [121] as- sumes that graphs are generated using non-standard matrix operation, the Kronecker product. These generated graphs replicate heavy tailed characteristics of real graphs. But Kronecker product does not necessarily accurately represent the actual micro- scopic properties of the real graph. The cascade generating function, on the other hand, allows us to study microscopic properties of even very large cascades without the need to visualize them. We illustrate the use of cascade plots to study microscopic dynamics of cascades with four different stories. Story 1, titled “Infomercial King’ Billy Mays Dead at 50” was submitted by a user who had 760 fans. This story was among the most popular in our data set, receiving 8,471 votes, of which 1,244 were from fans. The contagion process of this story generated 853 cascades. Its diameter was 46, spread 412, and the average path length 24. Fig. 8.8(a) shows evolution of the cascade function(t) of the top three cascades, ranked by their largest value. The left-hand set of plots shows the early dynamics of the cascade (t < 100), while the right-hand set of plots shows cascade dynamics over the entire time period. The seed of the cascade is shown in red. The top cascade attains its largest value of = 3:554 10 7 . This cascade started early in the contagion process. Though the seeds of the next two cascades were also 153 activated within the first 100 votes, these cascades did not start growing until later. Values of > 1 imply that the voter is a fan of two or more previous voters. Large values of in Fig. 8.8(a) indicate a community effect (cf Fig. 8.7(3)). This implies that information is spreading within an interconnected fan network. Though initially the three cascades of Story 1 are very different, in their later stages, they become increasingly similar. This is due to mixing caused by “collision of cascades,” which happens when the same nodes participate in different cascades. The popularity of Story 2, titled “Bender’s back,” is comparable to popularity of Story 1. Story 2 received 8,034 votes of which 1,464 were from fans and generated 722 cascades. Its diameter was 26, spread 401, and the average path length 12. Fig. 8.8(b) shows both the early and late-stage dynamics of the top three cascades generated by this story. However, the largest value of attained by any cascade was just = 1859:4, four orders of magnitude smaller than for Story 1. This indicates a much lower connectivity of the underlying fan network. In the three dominant cascades of this story, does not rise above 2.5 during the first 100 votes. Low values of in the initial stages of cascade evolution imply a chaining effect (cf Fig. 8.7(1)), or cascade growth by deepening. Unlike Story 1, here the seed of the dominant cascade is the 19 th voter. However, as seen from the larger values of in the cascade plots, in the later stages of information spread, community effect also comes into picture. The third story in Fig. 8.8(c) is titled “Play Doctor On Yourself: 16 Things To Do Between Checkups.” While this story was submitted by a well-connected user (with 1,701 fans) it did not become popular, receiving only 390 votes of which 158 were from fans. This story generated 11 cascades, and its diameter was 48, spread 5, and the average path length 25. All of the first 100 voters participated in the dominant cascade, one initiated by the submitter himself. The maximum value reached by 154 this cascade was very high ( = 7:53 10 7 ), even though this cascade was of short duration. Unlike in previous stories, we observed very high values of already within the first 100 votes, which indicates strong community effect, and high connectivity within the fan network. For the final illustration we consider the story titled “APOD: 2009 July 1 - Three Galaxies in Draco,” shown in Fig. 8.8(d). The submitter of this story has only 27 fans. This story is one of the least popular in our data set, receiving only 199 votes, of which 27 were from fans. This contagion process generated eight cascades, its diameter was 7, spread 7, and the average path length 2.6. In the early stages, constant values of in the dominant cascade (top plot in Fig. 8.8(d)) indicate a branching effect (cf Fig. 8.7(1)). This implies that cascade is growing in a star-like fashion, rather than deepening. The decreasing values of of the third cascade (bottom plot in Fig. 8.8(d)) indicate a chaining effect, implying that this cascade is deepening. We do not observe the community effect either in the initial or later stages of this contagion process. The maximum value of for this story is two and the average path length is 2.6, indicating that most of the voters are the fans of the submitter or submitter’s fans, but are not themselves interconnected. In summary, cascade plots can tell us much about the microscopic evolution of information cascade. Popular stories that have large participation also generated many cascades and had high spread. Initially they showed chaining and branching effects, as evidenced by values that are decreasing or staying constant in time, respectively. The community effect, manifested by growing values of, is visible in later stages when a story penetrates and then spreads through a community. The trends of the dominant cascades grow increasingly similar with time due to the mixing effect of “colliding cascades.” However stories that do not become popular generate very few cascades 155 and have low spread. When submitter is well connected, the community effect is visible in all stages of the contagion process, implying that the story spreads within submitter’s community only. However, when submitter is poorly connected, cascades grow by chaining and branching. 8.2.4.2 Macroscopic Cascade Characteristics num cascades cascade size spread ave. path log(num paths) diameter Figure 8.9: PDF of distribution of cascade properties: number of cascades per story, cascade size, spread, diameter, average path length, and log of the number of paths. Distributions are fitted with the stretched exponential/Weibull (black), mixture of Weibull (cyan), lognormal (red) and power law (green) functions. The double pareto lognormal distribution(magenta) gives a very good fit for the number of cascades. The stories in our data set generated 216,088 distinct information cascades on the Digg social network. Using the formalism described above, we calculate global prop- erties of these cascades and plot their distribution. These properties include cascade 156 Table 8.1: Parameter estimates for distributions that best describe data (Weibull and Lognormal). Lognormal Weibull (n=1) ^ ^ lk(10 3 ) KS ^ k ^ ^ lk(10 3 ) KS # cascades 3.57 0.96 -17.58 0.063 0.88 53.46 2.98 -17.91 0.1053 cascade size 2.06 1.43 -829.59 0.175 0.41 7.44 1.24 -672.51 0.444 spread 0.94 1.00 -509.84 0.255 0.59 2.47 0.83 -447.32 0.56 diameter 1.19 1.14 -590.45 0.186 0.55 3.43 0.91 -513.94 0.495 ave. path length 0.75 0.84 -431.77 0.262 0.6 1.54 0.90 -342.79 0.79 log # of paths 1.086 0.91 -349.58 0.392 0.717 3.107 0.848 -326.58 0.673 Table 8.2: Parameter estimates for distributions that best describe data(Power Law). Power Law % ^ ^ x min lk(10 3 ) KS # cascades 48.97 2.17 33 -9.01 0.291 cascade size 4.56 3.14 133 -55.1 0.036 spread 12.21 2.92 10 -82.4 0.081 diameter 30.96 2.11 6 -234 0.690 ave. path length 15.01 2.78 5.88 -81.57 0.850 log # of paths 2.51 1.5 21 -0.636 0.646 size, spread, diameter, etc. We have included in this thesis, just some examples of the many properties that we can calculate using. To fit continuous distributions to discrete data, we treat a discrete distribution as if it was generated from a continuous probability density function and then rounded to the nearest integer. We do not use the commonly used methods such as least square minimization, because the data that spans many orders of magnitude and least square minimization can produce substantially inaccurate estimation of parameters of heavy- tailed distributions like the power-law [39]. We use Maximum Likelihood Parameter Estimation (MLE) to estimate the values of parameters for these distributions and KS statistics to test the goodness of fit. The closer the KS-statistics to 0, better the fit. We study the following distributions: lognormal F (x;;) = 0:5erfc[ lnx p 2 ], Weibull 157 F (x;k;;) = (1e ( x ) k ) , mixed WeibullF (x; i ;k i ; i ) = P n i=1 i (1e ( x i ) k i ) with P n i=1 i = 1 and power-lawF (x;x min ;) = ( x x min ) +1 . More often power law applies only for values greater than a certain minimum x min . In such cases the tail of the distribution follows the power law. Using the MLE estimates of x min and scaling parameter , we find what percent of the data comprises this tail of the distribution. We also investigated distribution fitting using the Double Pareto Lognormal distribution [152]F (x;;;;) = 0:5erfc[ lnx p 2 ] (0:5 + x A(;;) erfc[ lnx 2 p 2 ] + + x A(;;) (1 0:5erfc[ lnx+ 2 p 2 ])) whereA(;;) = e + 2 2 2 . Double Pareto Lognormal (DPLN) distribution with = 2:8, = 1:9, = 3:0941, = 0:3119 gives the best fit for the number of cascades (better than any of the distributions shown in Table 8.1 and 8.2) with likelihood of15:234 10 3 andKS statistic of 0.0109. Fig. 8.9 shows the distribution of several macroscopic properties of the information cascades on Digg, along with functions that best describe them. Table 8.1 and 8.2 shows the MLE esti- mates of these distributions. We observe that lognormal or stretched exponential give a good fit with the observed distributions, and that power law mostly (if at all) accounts for a small percentage at the tail of the distribution. This indicates that a small number of core users may not be driving information propagation in online social networks on the whole. However, as the cascade size increases, some users may have dispropor- tionate influence on information propagation. Future work includes, delving deeper into the probable causes of these distributions. 158 8.2.4.3 Mesoscopic Cascade Characteristics For each story on Digg, using the methodology proposed in Section 8.2.1, we extracted the cascade that starts with the submitter and includes all voters who are connected to the submitter either directly or indirectly via the fans network. We call this the princi- pal cascade of the story. The cascade generating function helps us to identify not only local properties (like evolution of a cascade) and global properties (like the distribution of sizes of all the cascades in all the contagion properties), but also intermediate prop- erties like the size distribution of principle cascades. Figure 8.10 shows the distribution of principal cascade sizes of stories in our sample. This distribution is well described by a log-normal function with the mean of 156. Note that most of the cascades are smaller than 500, and only three are bigger than 1,000. Figure 8.10: Distribution of principal cascade size. We observe that the information cascades are very small. In our sample, only one cascade, about Michael Jackson’s death, can be said to have reached epidemic proportions, i.e., reaching a significant fraction of active Digg users (in this case, about 5%) . The majority of the cascades for the remaining stories reached fewer than 1% of active Digg users. This observation becomes more striking in the next section, 159 where we show that typical epidemic models predict that stories will reach an order of magnitude more voters than we observe on Digg. The measurements done using this quantitative framework of cascade generating function leads to many useful insights about online social networks. It helps us to empirically verify whether the underlying principles of many models for predictive analytics on networks hold. It also leads to many interesting questions like: why are information cascades on Digg so small? 8.3 Why are Information Cascades in Digg so Small? In this section, we attempt to solve this puzzle by examining factors that might be responsible for the stunted growth of cascades. For this purpose, we explore Digg further, both empirically and via simulations on the Digg graph and on synthetic graphs constructed to have similar properties. There are a number of factors that could explain why information cascades on Digg are so small. Perhaps Digg users modulate transmissibility of stories and keep them small to prevent information overload. On the other hand, transmissibility could diminish in time, either because of novelty decay [185] or decrease in visibility of stories as new stories are submitted to Digg [84]. Perhaps the structure of the network (e.g., clustering or communities) limits the spread of information. Or it could be that the mechanism of social contagion, i.e., how people decide to vote for a story once their friends vote for it, prevents stories from growing on Digg. In addition, users are active at different times, and heterogeneity of their activity could be another explanation. Here, we examine some of these alternate hypotheses through simulations of con- tact processes on networks and empirical study of real cascades on Digg. Ultimately, 160 we are able to identify factors that allow us to closely reproduce the observed behavior on Digg. 8.3.1 Network Structure: Clustering Effect Figure 8.11: For nodes who were exposed to a story, the average number of friends who voted on the story. In Section 8.1, we had shown that the social network of Digg seems to be denser than of Twitter. However, a traditional measure of graph clustering like the (Watts- Strogatz) clustering coefficient, which is based on the number of triangles in a graph, yields an unremarkable 0:0924 for the Digg graph. In practice, we find clustering effects to be far more pronounced than this measure suggests, potentially reflecting high variance of the clustering coefficient across all nodes. For every node that sees a story from one of its friends, we count the total number of that node’s friends who voted on the story (or, if the node itself voted, the total number of friends who voted on the story before it did). From the distribution of this quantity in Figure 8.11, we see that a solid majority of 63% have more than one friend voting on a story, with some having dozens of friends voting before it. This is 161 especially remarkable when one considers theoretical results that model social conta- gion as branching process, e.g., a Galton-Watson process [88]. That model assumes cascades spread in a tree-like fashion, so that each node has only one friend voting before it. 8.3.1.1 Synthetic Graph Construction Figure 8.1 (a) shows that the distribution of the number of active fans per user in the social network of Digg is well approximated by a power law of the formp(k)/ k with 2. To study the effect of graph structure on the spread of cascades, we construct a graph with the same number of nodes and require that each node has the same degree as its counterpart in the Digg graph. We used the directed configuration model from [136] to create a random graph with a given degree sequence. This method preserves the degree distribution of the original Digg graph while destroying degree correlations and cluster structure in the graph. By simulating cascade processes on the original and randomized network, we can measure the effect graph structure has on cascade size. 8.3.1.2 Simulations using Independent Cascade Model We begin with an independent cascade model (ICM) widely used to study diffusion processes on networks [158, 93, 88] and is similar to the susceptible-infected-removed (SIR) model in the epidemic literature [81, 17]. We start with a single seed node who has voted for a story. By analogy with epidemic processes, we call this node infected. The susceptible fans of the seed node decide to vote on the story with some probability given by the transmissibility. Since every node can only vote for the story once, at this point, the seed node is removed, and we repeat the process with the newly infected 162 nodes. Note that a node who is a fan ofn voting nodes, hasn independent chances to become infected, but in this model once a node votes on a story, it only has one chance to spread it to its fans before it is removed. 3 Intuitively, this assumption implies that you are more likely to vote on a story if many of your friends vote on it. Starting with random seed node, we generated 100,000 cascades using a transmis- sibility picked uniformly from the range [0, 0.01] and 40,000 cascades with transmis- sibility picked uniformly from the range [0.01, 0.03]. Each time a node is infected, it will infect each of its fans independently with probability. Additionally, we model the time between seeing a story and voting for it as a random variable pulled uniformly from some interval. After some time, no new nodes are infected, and the cascade stops. Because the graph is finite, the cascade is guaranteed to stop eventually. The final number of in- fected nodes gives cascade size. These are shown in Figure 8.12, where each point represents a single cascade with they-axis giving the final cascade size and thex-axis denotes the transmissibility . We only keep cascades with more than 10 infected nodes (votes). 3 In the epidemic literature this is equivalent to setting the recovery rate = 1 and the infection rate =. 163 Figure 8.12: Cascade size as a function of transmissibility for simulated cascades on the Digg graph and the randomized graph with the same degree distribution (see section on simulations). Heterogeneous mean field predicts cascade size as a fraction of the nodes affected. The line (hmf) reports these predictions multiplied by the total number of nodes in the Digg network. Blue dots represent cascades on the original Digg graph while pink dots represent cascades on a randomized version of the Digg graph. In both simulations, there exists a critical value of, the epidemic threshold 4 . Note that even above epidemic threshold, cascades that start in an isolated region of the graph will die out. 8.3.1.3 Theoretical results The location of the epidemic threshold is accurately calculated for both the Digg and randomized graph using the inverse of the largest eigenvalue of the adjacency matrix of the graph [179] and is analyzed in details in Section 4.1.2.1. For the original graph this gives digg c = 0:00587, while for the randomized graph this gives rand c = 0:00928. 4 As stated in Chapter 4, epidemic threshold is the value of transmissibility below which the disease dies out and above which an epidemic occurs [14]. 164 As we noted previously, this process should be accurately modeled by the SIR model of epidemics. In the limit of large graphs, if we assume that a node’s behavior is defined by its degree (with no degree correlations), we can calculate the expected size of cascades using heterogeneous mean field (HMF) theory [133]. Based on Fig- ure 8.1(a), we pick a degree distributionp(k)/ k 2 , with a cut-off on the maximum degree, k max = 10 3 . This prediction is depicted with the solid line in Figure 8.12. Both the threshold and growth accurately characterize the randomized graph. Note that HMF applies in the large graph limit. Because the randomized graph is still fi- nite, some clustering inevitably occurs (it has a clustering coefficient of about 0:02), decreasing the cascade size from the HMF prediction. If one assumes Digg’s graph structure consists of dense clusters, the effects on cascades in the independent cascade model are quite intuitive. It is easier for a story to take off within a smaller, more tightly connected community, thereby lowering the epidemic threshold. This also explains why the majority of people exposed to story are exposed to it from multiple sources. On the other hand, for cascades to grow very large it is better to have a more homogeneous link structure to reach all parts of the graph quickly. Ultimately, clusters have the effect of marginally decreasing the size of cascades by sequestering an infection in one part of the graph. Comparing the theoretical and simulation results for cascades in Figure 8.12 to the observed distribution of cascade sizes in Figure 8.10 highlights the aforementioned puzzle. Why are cascades so small? According to our cascade model, only transmissi- bilities in a very narrow range near the threshold produce cascades of the appropriate size of 500 votes. Is there some sort of critical behavior that tunes transmissibilities to be exactly near the threshold? Our subsequent analysis suggests a more mundane possibility. 165 8.3.1.4 Characterization of real and simulated cascades Figure 8.13: Characteristics of voting on Digg. Probability a user votes givenn friends voted given by the independent cascade model and actual voting behavior on Digg (averaged over all cascades). As we previously noted, modeling the spreading process as a branching process assumes that each node has only one voting friend. In that case, the definition of transmissibility,, is unambiguous: the probability of voting for a story if your friend voted. As we saw from Figure 8.11, most people exposed to a story are exposed multiple times. In that case, even if we maintain the definition of transmissibility in the case of one friend voting, there is some freedom to model the effect of having multiple voting friends. The most straightforward generalization, which we used in the last section, is the independent cascade model (ICM) which says that if a node has n voting friends then it hasn independent chances to vote. Therefore, p ICM (votejn friends voted) = 1 (1) n : 166 We can measure this quantity on Digg. To do so, we considered all cascades with more than 10 votes. We isolated the users in a cascade who had exactlyn friends vot- ing and did not vote versus people withn friends voting on the story before they them- selves voted. For a givenn, the percentage of people voting is depicted with a solid line in Figure 8.13. Forn = 1, the percentage of users voting was 1:3%, suggesting a transmissibility of = 0:013. The dashed line depictsp ICM (votejn friends voted) for this value of transmissibility. Even forn = 2, ICM overestimates the probability of a vote, and byn = 10, a relatively common occurrence, ICM is an order of magnitude too large. 8.3.2 Effect of Contagion Mechanism on Network Dynamics Clearly, Figure 8.13 shows that multiple exposures to a story only marginally in- crease the probability of voting for it. The effect of having more than one friend recommend a story quickly saturates and would be better approximated as constant p FSM (votejn friends voted) =; forn 1. We will refer to this simplified model for generating cascades as the friend saturation model (FSM). We point out that p(vote jn friends voted) actually contains two factors: the probability that you visit Digg and see that your friend(s) voted on a story, and the probability you vote on the story given that you did visit. In fact, a careful examination of Figure 8.13 suggests that a more sophisticated model of behavior might include some small marginal increase in voting probability from multiple voting friends, balanced by a marginal decrease from having many friends. For simplicity, we will stick with the simpler model. 167 8.3.2.1 Simulation using Friend Saturation Model We can repeat the simulation procedure of the previous section. This time, though, after a node is exposed to a story from one of its friends (voting with probability), if the node chooses not to vote, it will not vote in the future even if it is exposed to the story again. We generated 100,000 cascades with transmissibility picked uniformly from the range [0, 0.04]. Again, we only keep cascades with more than 10 votes. 8.3.2.2 Inferring Transmissibility of a Cascade Assuming that cascades on Digg spread according to the FSM process, we can infer the transmissibility of actual cascades. We label the nodes in a cascade in order of votingi = 1;:::;v, where there arev total nodes who vote on the story (not counting the seed node), andi = v + 1;:::;w label thewv nodes (watching) who are exposed to the story but do not vote on it. According to the FSM, each node votes for a story that is spread to it with proba- bility, and we can read the probability of a cascade directly from the graph. p(cascadej) = (1) wv v (8.6) Using Bayes’ rule and noting thatp(jcascade) is a simple Beta distribution, we can infer the most likely value for, inf = argmax p(jcascade) = v w (8.7) 168 To test the accuracy of the inference method, we compute the inferred values of for simulated cascades and plot them against the actual values of used in the simulation in Figure 8.14(a). Pearson’s correlation coefficient for these values is 0:946. (a) (b) Figure 8.14: (a) Inferred versus actual transmissibility for simulated cascades in the FSM(b) Cascade size vs inferred transmissibility for simulated and real cascades on the Digg graph, this time plotted on a log-log scale to highlight the order of magnitude difference between these cascade sizes and predictions of the epidemic model (HMF, see text for details). Pink dots in Figure 8.14(b) plot size of simulated cascades generated according to the FSM versus inferred transmissibility. Already, looking at the solid line plotting the HMF prediction from Figure 8.14(b), we see that cascade sizes are an order of magni- tude smaller than for the independent cascade model. Using this model, we can also infer transmissibilities for actual Digg cascades and compare them on the same plot, as we do with blue dots. The similarity is striking and the overlap so complete that most of the simulated cascade dots are covered. Also, note that the threshold still lines up fairly well with HMF and eigenvalue prediction. At the beginning of a cascade, most people have not been exposed multiple times, so the FSM and independent cascade model differ very little, therefore we should not expect much change in the location of the threshold. 169 The inferred transmissibilitiy of actual Digg cascades are almost all above thresh- old. This is not surprising, given that we are analyzing stories that have been promoted to the front page; therefore, have been found by Digg to be interesting to the commu- nity. Note that the largest cascade, one about Michael Jackson’s death, also has the highest inferred transmissibility. 8.3.3 Discussion (a) (b) Figure 8.15: Dynamics of transmissibility and fanout (a) Number of new fans who can see the story (watching) and who actually vote for the story (voting) vs time (voter i) for actual and simulated cascades. (b) Change in the estimated value of transmissi- bility for actual and simulated Digg cascades as a function of time. In epidemic models, population models, and other branching processes, the prin- cipal quantity of interest is the reproductive number, R 0 . Intuitively, the reproduc- tive number is just the average number of people infected by a single infected per- son. If R 0 > 1, each infection leads to another indefinitely, an epidemic. Whereas, if R 0 < 1, the infection will die out eventually. Naively, the reproductive number should just be the average fanout, i.e., the average number of fans, times the trans- missibility. For Digg, we havehki 6 so R 0 6. In that case, an epidemic 170 threshold atR 0 = 1! c 1=6, much higher than we observe. It is well known, however, that heterogeneous degree distributions lower the threshold compared to this prediction[17]. However, we can gain some intuition from this quantity if we view it as a dynamic quantity. FSM implies that the true fanout only includes the number of new fans (those that have not already been exposed to a story) and changes with time. Figure 8.15(a) shows that with this definition the fanout is steadily decaying, both for actual and simulated cascades on the Digg graph. Effectively, this leads to a decrease in the reproductive number as well, so that a cascade that initially starts above the epidemic threshold may fall below it with time. Additionally, in Figure 8.15(b) we examine the dynamics of the transmissibility. We calculate the transmissibility for each voter by looking at the number of votes a story gets from their fans divided by the number of new fans the voter exposed the story to. We see this quantity is constant for simulated cascades as it should be by con- struction. For actual cascades, on the other hand, the transmissibility remains constant until about 100 people have voted, and then it begins to decline. This is another effect limiting the size of cascades. The decline could be due to decay of novelty [185] or decrease in visibility [84] as a consequence of new stories being submitted to Digg. Alternately, people may vote for stories mostly hoping to help them get “promoted” to the front page. After about 100 votes, a story is usually promoted, thereby offering less incentive to give it further votes. Reproductive number is a product of fanout and transmissibility, and on Digg both decrease with time. From this perspective, the slowdown of cascade growth follows naturally. 171 8.4 Summary We conducted an empirical analysis of user activity on Digg and Twitter. Though the two sites are vastly different in their functionality and user interface, they are used in strikingly similar ways to spread information. The mechanism for the spread of information is the same on both sites, namely, users watch their friends’ activities — what they tweet or vote for — and by their own tweeting and voting actions they make this information visible to their own fans or followers. In spite of the similarities, there are quantitative differences in the structure and functionality of social networks on Digg and Twitter. While the number of fans a user has on each site exhibits a long-tail distribution, Digg’s social network is denser and more interconnected than Twitter’s, as judged by the number of reciprocated links and the network clustering coefficient. We discover that the structure of the network affects dynamics of information spread, with information reaching nodes faster in a denser network of Digg than Twitter, with users who are following the submitter also likely to follow other voters. We demonstrate that user activity on both sites has a power-law distribution, albeit with different exponents. The user interface does play a role in the spread of information via these networks. We show that user interface affects dynamics of votes, with evolution of Digg sto- ries going through two distinct stages- before and after promotion to front page. On Digg, stories spread mainly through the network before promotion. After promotion, the stories spread mainly outside the network and it is exposed to a large number of unconnected users. The spread of the story on the network slows significantly, though the story may still generate a large response from Digg audience. The design of the Twitter interface facilitates the spread of information or story primarily through the social network. In Twitter, stories spread through the network slower than Digg stories 172 do initially, but they continue spreading at this rate as the story ages and generally penetrate the network farther than Digg stories. Nevertheless, the number of votes ac- cumulated by stories on both sites saturates after a period of about a day to a value that reflects their popularity. In this chapter we also present an efficient, scalable analytical tool to quantify cas- cades. We believe that our work is first to provide a mathematical framework to quan- tify and analyze cascades, even for applications requiring real-time or online analysis. The mathematical framework is based on the cascade generating function, which quan- titatively characterizes micro, meso and the macroscopic properties of the cascade. The macroscopic properties that can be efficiently calculated using this tool include the diameter and the spread of the cascades. This function also provides an efficient compression of the information encoded in cascades. In spite of having pseudo-linear space complexity, it can be used to reconstruct the shape of the cascade with high degree of accuracy. Although large scale studies of contagion processes have been carried out, the number of participants involved in these studies was relatively small. To the best of our knowledge, this is the first study of very large contagion processes with thousands of participants. Microscopic analysis revealed interesting insight to cascades and contagion pro- cesses, such as the possible effect of the initial number of seeds and of the branching, chaining and community effect on the initial popularity of news. For macroscopic properties like number of cascades in a contagion process, cascade size, spread, di- ameter, average length and so on, we observe a stretched exponential (Weibull) or a lognormal distribution fits well with the observed distribution. Double Pareto Log- normal distribution gives a very good fit for the distribution of number of cascades. Usually power law accounts (if at all) for a small percentage of data in the tail of the 173 distribution. We also analyzed meso-scopic properties like the distribution of princi- ple cascades using this function. One of the surprising observations was that the vast majority of cascades grow far slower than predicted by traditional models and fail to reach “epidemic” proportions. We demonstrate two important effects contribute to stunting the growth of cascades in online social media. The first is the topological effect- due to the highly clustered structure of Digg network most people who are aware of a story have been exposed to it via multiple friends. Another effect limiting the size of cascades comes from the nature of interaction or the social contagion mechanism. Many network studies assume that graphs with locally tree-like behavior give a good approximation to real networks. In this case, we find that such methods wildly overestimate the size of cascades. If most of the people exposed to a story are exposed repeatedly, understanding how they are affected by repeat exposures is of paramount importance. On Digg, subsequent exposures to a story have almost no effect on the probability of voting. Our simple model of contagion better reproduces the behavior of Digg cascades as compared to the traditional models like the independent cascade model. Much remains to be studied: whether these results hold on other social networks, more sophisticated models of response to friends, the time dependence of transmissi- bility and more detailed analysis of the effect of network structure and dynamics on cascades. 174 Chapter 9 Alpha-Centrality In Chapter 6, we showed that the non-conservative Alpha-Centrality [22] and its nor- malized counterpart better reproduce empirically measured influence than a conserva- tive metric (such as PageRank) on Digg and Twitter. In this chapter we explore the application of Alpha-Centrality to network analysis in greater detail. Alpha-centrality is given by: cr (s 0 ) = s 0 P 1 t=0 Q t k=1 k A k . Although k along different edges in a path could in principle be different, for simplicity, we take them all to be equal: k =;8k6= 1 and 1 = [20]. is known as the direct attenuation factor and is known as the indirect attenuation factor. As shown in Chapter 6, Equation 6.4, this reduces to cr (s) = s + cr (s)A = sA(IA) 1 wheres = s 0 A. Another way to formulate Alpha-Centrality is using the Alpha-centrality matrix C ;k 82 [0; 1], which is given by: C ;k = I +A + 2 A 2 + + k A k = k X t=0 t A t (9.1) 175 then Alpha-centrality is cr (s) =sC ;k!1 . The first term in the summation gives the number of paths of length one (edges) to i, the second gives the number of paths of length two, etc. Similarly, if normalized Alpha-centrality matrix is given by: NC ;k = 1 X i;j (C ;k [i;j]) C ;k (9.2) then normalized Alpha-centrality is ncr (s) =sNC ;k!1 Alpha-Centrality is a powerful tool for network analysis. Unlike other centrality metrics, which do not distinguish between local and global structure, a parameter- ized centrality metric can differentiate between locally connected nodes, i.e., nodes that are linked to other nodes which are themselves interconnected, and globally con- nected nodes that link and mediate communication between poorly connected groups of nodes. We have used normalized Alpha-centrality to study the structure of networks, specif- ically, identify important nodes and communities within the network. We have ex- tended the modularity maximization class of algorithms [69] to use (normalized) Alpha- centrality, rather than edge density, as a measure of network connectivity. Rather than find regions of the network that have greater than expected number of edges connecting nodes [142], our approach looks for regions that have greater than expected number of weighted paths connecting nodes. For small values of, smaller, more locally con- nected communities emerge, while for larger values of, we observe larger globally connected communities. We also used this metric to rank nodes in a network. By studying changes in rankings that occur when parameter is varied, we were able to identify locally important ‘leaders’ and globally important ‘bridges’ or ‘brokers’ that 176 facilitate communication between different communities. We applied this approach to benchmark networks studied in literature and found that it results in network divi- sion in close agreement with the ground truth. We can easily extend this definition to multi-modal networks that link entities of different types, and use approach described in this thesis to study the structure of such networks [61]. We have also provided an approximation algorithm compute Alpha-centrality for large networks (Appendix C). 9.1 Node Ranking Normalized Alpha-centrality, ncr (s), measures how ‘close’ nodei is to other nodes in a network and can be used to rank the nodes accordingly. By studying how rank- ings change when this parameter is varied allows us to identify locally and globally important nodes. Leaders are the core members of the communities and are central with respect to the local community structure. Such nodes have high ranking (using Alpha-Centrality) for small values of parameter , but their rankings may decrease with the increase in. Bridges act as passageways of commodity/information flow be- tween communities. They may not be central to any community locally, but are central globally with respect to the network. As we show in Section 9.3, we are not only able to discover the leaders of the community but also the ‘bridges’ using Alpha-centrality. Bridges are identified as nodes which might have low rankings for small values of, but their ranking increases with the increase of. For = 0, normalized Alpha-centrality reduces to degree centrality. Also as shown in the Appendix B, for symmetric matrices, as! 1=j 1 j, normalized Alpha- Centrality converges to eigenvector centrality. The rankings no longer change as increases further, since has reached some fundamental length scale of the network. 177 The underlying principle of Alpha-centrality is the concept of path-based centrality. This principle of path-based connectivity can be used to relate many existing centrality metrics. 9.2 Community Detection Girvan & Newman [142] proposed modularity as a metric for evaluating community structure of a network. The modularity-optimization class of community detection algorithms [139] finds a network division that maximizes the modularity, which is de- fined asQ = (connectivity within community)-(expected connectivity), where connec- tivity is measured by the density of edges. We extend this definition to use normalized Alpha-centrality as the measure of network connectivity. According to this definition, in the best division of a network, there are more weighted paths connecting nodes to others within their own community than to nodes in other communities. Modularity can, therefore, be written as: Q() = X ij (NC ;n!1 [i;j]NC ;n!1 [i;j])(s i ;s j ) (9.3) NC ;n!1 is given by Equation 9.2. NC ;n!1 [i;j] is the expected normalized Alpha-Centrality, ands i is the index of the communityi belongs to, with(s i ;s j ) = 1 ifs i =s j ; otherwise,(s i ;s j ) = 0. We round the values ofNC ;n!1 [i;j] to the nearest integer. To compute NC ;n!1 [i;j], we consider a graph, referred to as the null model, which has the same number of nodes and edges as the original graph, but in which the edges are placed at random. To make the derivation below more intuitive, instead of 178 normalized Alpha-Centrality, we talk of the number of attenuated paths. In normal- ized Alpha-Centrality, the number of attenuated paths is scaled by a constant, hence the derivation below holds true. When all the nodes are placed in a single group, then axiomatically,Q() = 0. Therefore P ij [NC ;n!1 [i;j]NC ;n!1 [i;j]] = 0, and we set W = P ij NC ;n!1 [i;j] = P ij NC ;n!1 [i;j]: Therefore, according to the argument above, the total number of paths between nodes in the null model P ij NC ;n!1 [i;j] is equal to the total number of paths in the original graph, P ij NC ;n!1 [i;j]. We further restrict the choice of null model to one where the expected number of paths reaching nodej,W in j , is equal to the actual number of paths reaching the correspond- ing node in the original graph. W in j = P i NC ;n!1 [i;j] = P i NC ;n!1 [i;j]. Sim- ilarly, we also assume that in the null model, the expected number of paths originating at nodei,W out i , is equal to the actual number of paths originating at the corresponding node in the original graphW out i = P j NC ;n!1 [i;j] = P j NC ;n!1 [i;j].W ,W out i andW in j are then rounded to the nearest integers. Next, we reduce the original graphG to a new graphG 0 that has the same number of nodes asG and total number of edgesW , such that each edge has weight 1 and the number of edges between nodes i and j in G 0 is NC ;n!1 [i;j]. Now the expected number of paths betweeni andj in graphG could be taken as the expected number of the edges between nodesi andj in graphG 0 and the actual number of paths between nodesi andj in graphG can be taken as the actual number of edges between nodei and nodej in graphG 0 . The equivalent random graphG 00 is used to find the expected number of edges from nodei to nodej. In this graph the edges are placed in random subject to constraints: (i) The total number of edges inG 00 isW ; (ii) The out-degree of node i in G 00 = out-degree of node i in G 0 = W out i ; (iii) The in-degree of a node j in graphG 00 =in-degree of nodej in graphG 0 = W in j . Thus inG 00 the probability 179 that an edge will emanate from a particular node depends only on the out-degree of that node; the probability that an edge is incident on a particular node depends only on the in-degree of that node; and the probabilities of the two nodes being the two ends of a single edge are independent of each other. In this case, the probability that an edge exists fromi toj is given by edge inG 0 emanates from i edge inG 0 incident on j=(W out i =W )(W in j =W ). Since the total number of edges isW inG 00 , therefore the ex- pected number of edges betweeni andj isW(W out i =W )(W in j =W ) =NC ;n!1 [i;j], the expected normalized Alpha-Centrality inG. Once we computeQ(), we have to select an algorithm to divide the network into communities that maximize Q(). Brandes et al. [27] have shown that the decision version of modularity maximization is NP-complete. Like others [140, 103], we use the leading eigenvector method to obtain an approximate solution. In this method, nodes are assigned to either of two groups based on a single eigenvector corresponding to the largest positive eigenvalue of the modularity matrix. This process is repeated for each group until modularity does not increase further upon division. We provide a scalable algorithm for Alpha-Centrality in Appendix C. 9.3 Empirical Results We apply the formalism developed above to benchmark networks studied in literature and real-world networks. 180 (a) = 0 (b) 0<< 0:14 (c) 0:14 Figure 9.1: Zachary’s karate club data. Circles and squares represent the two ac- tual factions, while colors stand for discovered communities as the strength of ties increases: (a) = 0, (b) 0<< 0:14, (c) 0:14 9.3.1 Karate Club Network: Communities, Leaders and Bridges First, we study the friendship network of Zachary’s karate club [193] described in Chapter 2 (Section 2.3.1) in Figure 9.1. During the course of the study, a disagree- ment developed between the administrator and the club’s instructor, resulting in the division of the club into two factions, represented by circles and squares in Figure 9.1. We find community division of this network. The first bisection of the network re- sults in two communities, regardless of the value of, which are identical to the two factions observed by Zachary. However, when the algorithm runs to termination (no more bisections are possible), different groups are found for different values of. For = 0, the method reduces to edge-based modularity maximization [138] and leads to four groups [50, 51] (Figure 9.1(a)). For 0 < < 0:14 it discovers three groups (Figure 9.1(b)), and for > 0:14, two groups that are identical to the factions found by Zachary (Figure 9.1(c)). Thus, increasing allows local groups to merge into more global communities. Figure 9.2 shows how the normalized Alpha-Centrality scores of nodes change with . For = 0, normalized Alpha-Centrality reproduces the rankings given by 181 Figure 9.2: Centrality scores of Zachary club members vs.. degree centrality. As we show in the Appendix B, the final rankings produced by normalized Alpha-Centrality for this symmetric matrix are the same as those given by the eigenvector centrality. This can be confirmed by their values in Figure 9.3. Varying allows us to smoothly transition from a local to a global measure of centrality. Nodes 34 and 1 have the highest centrality scores, especially at lower values. These are the leaders of their communities. It was the disagreement between these nodes, the club administrator (node 1) and instructor (node 34), that led to the club’s division. Nodes 33 and 2 also have high centrality and hold leadership positions. All these nodes are also scored highly by betweenness centrality and PageRank. Note that centrality scores of these nodes decrease with, indicating that they are far more important locally than globally. A node may also have high centrality if it is connected to many nodes from differ- ent communities. Such nodes, which bridge communities, are crucially important to 182 maintaining cohesiveness and facilitating communication flow in both human [161, 72] and animal [128] groups. We can identify these nodes because their normalized Alpha- Centrality increases with, i.e., they become more important as longer paths become more important. Centrality of nodes 3, 14, 9, 31, 8, 20, 10, etc., increases with from moderate to relatively high values. While most of these nodes are directly connected to both communities, some are only indirectly connected by longer paths. Betweenness centrality of these nodes is low, but non-zero. Nodes 25, 26 and 17 have low centrality which decreases with . These are pe- ripheral members. Betweenness centrality of 17 is zero, as expected, but 25 and 26 have scores similar to 31. PageRank scores of these peripheral nodes are higher than nodes 21, 22, 23, which are connected to central nodes, and comparable to scores of the bridging nodes 20 and 31. While both betweenness centrality and PageRank cor- rectly pick out leaders, they do not distinguish between locally and globally connected nodes. Figure 9.3: Comparison of eigenvector centrality and converged normalized Alpha- Centrality for Zachary’s karate club network. 183 Figure 9.4: Classification of karate club nodes according to the roles scheme pro- posed by Guimera et al. [76]: (i) non-hubs (z < 2:5) are divided into ultra-peripheral, peripheral, and connector nodes (kinless nodes whose links are homogeneously dis- tributed among all communities are not shown); (ii) hubs (z 2:5) are subdivided into provincial (majority of link within their own community), connector hubs (many links to other communities). Global hubs whose links are homogeneously distributed among all communities are not shown. Guimera and collaborators [76] proposed a role-based description of complex net- works as an alternative to the ‘average description’ approach, which characterizes net- work structure in terms of average degree or degree distribution. They define a role in terms of the relative within-community degreez (which measures how well the node is connected to other nodes in its community) and participation coefficientP (which measures how well the node is connected to nodes in other communities). They pro- pose a heuristic classification scheme to assign roles to nodes based on where they fall in thez–P plane and find similar patterns of role-to-role connectivity among networks with similar functional needs and growth mechanisms [77]. 184 Figure 9.4 shows the positions of nodes in the karate club network in thez–P plane. Colored regions demarcate the boundaries of different roles according to Guimera et al.’s classification scheme. Nodes separate into provincial hubs (34, 1), peripheral (33, 2, 28, 14, 31, 29, 20, 3, 9, 10) and ultra-peripheral nodes (rest of the nodes). No special role is assigned to the bridging nodes, such as 9. Even if the boundary of non-hub connectors is shifted to slightly less than P = 0:5 in order to identify nodes 3, 9, 10 as serving a special role, the method would still miss node 14, whose position in the network is very similar to node 9. This is because the method takes into account direct links only, rather than complete connectivity between nodes. The method also requires one to first identify communities in the network, which is a very computationally expensive procedure for large networks. Our method, on the other hand, provides a simpler and scalable way to identify network structure. 9.3.2 Other Real-World Networks In addition to the Karate Club friendship network described above, we evaluated the performance of our community division algorithm on other real-world networks like the US College football and the political books networks. We were not able to evaluate rankings due to the lack of ground truth for these data sets. The first network represents the schedule of Division 1 games for the 2001 season where the nodes represent teams and the edges represent the regular season games between teams [69]. The teams are divided into conferences containing 8 to 12 teams each. Games are more frequent between members of the same conference, though inter-conference games also take place. This leads to an intuition, that the natural communities may be larger than conferences. 185 The political books network represents books about US politics sold by the online bookseller Amazon. 1 Edges represent frequent co-purchasing by the same buyers, as indicated by the “customers who bought this book also bought these other books” feature of Amazon. The nodes were labeled liberal, neutral, or conservative by Mark Newman on a reading their descriptions and reviews on Amazon 2 . We take these labels as communities. We used a Wallace criterion [178] to evaluate the quality of discovered communi- ties. The Wallace criterion or purity is the fraction of all pairs of objects in the same community that are assigned to the same group by the algorithm. More the purity of the detected communities, the better is the performance of the algorithm. Table 9.1: The number and purity of communities discovered at different values of karate club football flickr grps Pu grps Pu grps Pu 0.00 4 0.505 0.00 8 0.715 0.000 4 0.501 0.12 3 0.736 0.02 8 0.723 0.001 3 0.565 0.14 2 1.000 0.04 8 0.723 0.002 3 0.567 florentine 0.06 7 0.723 0.003 3 0.567 0.00 7 0.34 0.08 7 0.723 0.004 3 0.567 0.05 6 0.34 0.10 7 0.791 0.005 3 0.568 0.10 5 0.42 0.12 6 0.803 0.006 3 0.570 political books 0.14 6 0.813 0.007 3 0.571 0.00 4 0.633 0.16 6 0.813 0.008 3 0.572 0.04 3 0.805 0.18 4 0.862 0.009 3 0.574 0.08 2 0.917 1 http://www.orgnet.com/ 2 http://www-personal.umich.edu/mejn/netdata/ 186 The number and purity of the communities found in networks as a function of the parameter are shown in Table 9.1. The case = 0 corresponds to edge-based mod- ularity method. As increases, the number of groups discovered in all networks goes down, while their purity increases. This is consistent with our hypothesis that using smaller values of allows us to identify more local network structure, while larger values of lead to more global structure. In the Karate club network, for example, at = 0, there are four small communities, as shown in Fig. 9.1(a). These local commu- nities coalesce into two large groups as increases (Fig. 9.1(c)), which are identical to the groups identified by Zachary [193]. We applied this approach to other benchmark networks studied in literature (like Florentine Families [145] ) and other real world networks (like the photo-sharing website Flickr) and found that it results in network division and node ranking to be in close agreement with the ground truth [62]. The per- formance of Alpha-Centrality based community detection method on these datasets is also shown in Table 9.1 . 9.4 Multimodal Networks Multimodal networks play a key role in the evolution of communities and the decisions individuals make. While traditional network analysis algorithms can efficiently find structure even in large data sets, they usually work on homogeneous data, i.e., networks composed of entities of a single type, for example, a social network where individuals are nodes and an edge between nodes corresponds to a (possibly directed) friendship relationship. Such networks can be represented as unipartite graphs. Many online networks, however, mix entities of different types. For example, we can represent the popular photo-sharing site Flickr as a multimodal network composed of several 187 entity types: users, images, groups, and tags, with connections between the entities representing different types of relations. A link between users denotes a friendship; a link between a user and a group denotes user’s membership in the group; a link between an image and tags represents the keywords used to annotate that image, and so on. In order to extract useful knowledge from this data, we need to look at the network in its entirety. We can compactly represent a multimodal network as a layered graph, in which entities belonging to different classes are partitioned into separate layers, with intra- layer and inter-layer edges representing links between entities. Consider a network with two entity classes X (jXj = n) and Y (jYj = m). For concreteness, suppose the data represents a scientific papers dataset with authorsX and papersY , and that in addition to the usual authorship relations, we managed to collect additional data about friendships, acknowledgements and citations. This data can be represented as a graph with two layers, with vertices of typeX (authors) in one layer, and vertices of typeY (papers) in the other layer. An (m +n) (m +n) adjacency matrix captures the intra- and inter-layer relations between different vertices: A = 2 6 4 XX mm XY mn YX nm YY nn 3 7 5 HereA[i;j] = XX[i;j] gives the binary relation of the ordered pair (x i ;x j ), e.g., a friendship between authorsi andj; A[i;j +m] = XY [i;j] gives the binary relation of the ordered pair (x i ;y j ), e.g., if author i wrote paper j; A[i +m;j] = YX[i;j] gives the binary relation of the ordered pair (y i ;x j ), e.g., if paper i acknowledges author j; A[i +m;j +m] = YY [i;j] gives the binary relation of the ordered pair (y i ;y j ), e.g., whether paperi cites paperj. We call this data structure a 2-mode matrix. 188 This representation is similar to one used by Tong et al. [170] to represent bipartite graphs, except since bipartite graphs only describe the inter-layer, and not the intra- layer, relations, the diagonal submatricesXX andYY are zero. We can easily generalize the above formulation toN-mode matrices, which repre- sent graphs havingN distinct types of nodes or being composed of entities belonging toN distinct classes. The adjacency matrix in this case representsN 2 distinct types of binary relations. In [61], we introduced this compact data structure, the N-mode matrix, to represent different classes of entities and relations present in a multimodal network. We used Alpha-Centrality to study the structure of such networks, specifically, identify com- munities and important nodes in the network. We applied this approach to benchmark networks studied in literature (Southern Women [55] and US College football dataset [69]) and the photo-sharing website of Flickr. We found that the results are in close agreement with the ground truth. In addition, it gave useful insights into the structure of the graph and prediction of future events. By studying changes in rankings that occur when the indirect attenuation factor changes, we were able to identify leaders and ‘bridging’ nodes that facilitate communication between different communities. One possible extension of this method is that we may want to differentially weigh relations to balance transmission of influence along different channels. To do this, we break the 2-mode matrix into diagonal (intra-layer) and off-diagonal (inter-layer) components:A =D 1 A 1 +D 2 A 2 ; where A 1 = 2 6 4 XX mm 0 mn 0 nm YY nn 3 7 5 A 2 = 2 6 4 0 mm XY mn YX nm 0 nn 3 7 5 ; 189 and weights are given by matrices D 1 = 2 6 4 D mm 0 mn 0 nm D nn 3 7 5 D 2 = 2 6 4 D mm 0 mn 0 nm D nn 3 7 5 ; with eachD ,:::,D a diagonal matrix withD ii = , etc. We plan to study this bal- ancing scheme on other real-world networks in future. This work is complementary to ongoing work in social science, where sociologists are emphasizing on the impor- tance of social space, which is defined as being multi-layered, relational and containing multiple types of entities. They are using this concept of social space to build models for interactive systems of variables [154, 147]. Many of these social spaces can be represented using ourN mode matrix. 9.5 Summary In this chapter, we closely examined the properties of (normalized) Alpha-centrality which makes such metrics, suitable for network analysis. The tuning of the attenuation parameter helps the metric to identify both, locally and globally important nodes and structures. We have shown that this metric can be used in two important aspects of network analysis- influence prediction and community detection. We used normalized Alpha- Centrality to study the structure of networks, specifically, identify important nodes and communities within the network. For small values of smaller, more locally con- nected communities emerge, while for larger values of, we observe larger globally connected communities. We also used this metric to rank nodes in a network. 190 Chapter 10 Related Work The range of interaction processes that can occur on a network includes the spread of epidemics [8, 81] and information [107], viral marketing [93, 88], word-of-mouth rec- ommendation [70], money exchange, e-mail forwarding [124], and Web surfing [146], among others. 10.1 Real-World Dynamic Interactions There has been some work to differentiate temporal activity associated with heteroge- nous content on online social media. In [189], the authors enumerate the different approximate shapes of temporal distribution of content in Twitter. But unlike us, they are not able to associate semantic meaning to the activity associated with the clusters they observe. Previous work has tried to estimate the quality or interestingness of content [3, 42]. However, quality or interestingness is a subjective measure and is biased by the per- spective of the user. For instance, what would be high quality information or inter- esting to a campaigner might be junk to a news aggregator. Therefore there is the 191 need for an objective quantitative measure of user-generated content. Our entropy- based approach for classifying user activity and content addresses this need. While the method described in [42] is similar in spirit to ours, it can discover only three classes of activity. Heterogeneous activity on Twitter requires more than three classes. Most of the existing spam detection [129] and trust management systems [31] are based on content and structure but do not look at collective dynamics. Besides, they usually require additional constraints like labelled up-to-date annotation of resources, access to content and cooperation of search engine. Satisfiability of so many con- straints is difficult especially when one takes the diversity and astronomical size of online social media into account. Our method on the other hand, while having no such constraints, may be able to detect spams with an accuracy close to humans. There has been some work done on spam detection on Twitter. Grier et. al [74] analyzed the features of spam on Twitter. However, they detect spam using three black- listing services. Similarly, one of the methods employed to remove spam on Twitter is using Clean Tweets 1 [101]. Clean tweets filter tweets from users who are less than a day (or any duration specified) old and tweets that mention three (or any number specified) trending topics. However, it would be unable to detect spammers who auto- tweet or posts spam-like tweets at regular intervals (like EasyCash435 or onstrategy, Figure 3.1 (g) and (h)), which our approach can easily detect. Also, since URL short- ening services such as http://bit.ly are often used on Twitter, users cannot guess which references are pointed at, which in turn is an attractive feature for spammers. How- ever, since our categorization method is content independent, we can easily identify such spams using this method. Yardi et al. [190] state “Twitter spam varies in style and tone; some approaches are well-worn and transparent and others are deceptively 1 http://www.seoq.com/blvdstatus/clean-tweets.html 192 sophisticated and adaptable.” Using this method, we can capture the characteristics of spam, whether, it is generated by an auto-tweet service, a malicious advertiser or a passionate campaigner. Automated email spamming has been studied by [188]. They have identified the activity of botnets generating e-mail spam as being ‘bursty’ (inferred from the dura- tion of activity) and ‘specific’ (pertaining to a random generated URL matching the signature). In this study of Twitter, we identify automated activity by a set pattern of retweeting (indicated by much lower time-interval entropy compared to user entropy). Note, that this approach is based on observed collective response to content. By reposting some content, users give an implicit feedback to that content. Therefore unlike[3, 31] this method can even be applied in systems, where users do not explicitly rate other users. We can automatically detect newsworthy, information-rich content and separate it from other user-generated content, based on user-response. We have showed that this method can further categorize content within this class into blogs or celebrity websites and news. [187] studies the flow of information between these sub-categories. 10.2 Interactions and Community Community detection is an extremely active research area, with a variety of methods proposed, including conductance [7], spectral and graph partitioning [176, 163] and modularity maximization [141, 52]. We demonstrate that to get the full picture of network’s emergent structure, community detection methods must take into account the dynamic process occurring on the network. We show in this thesis that most of the 193 existing community detection methods assume a specific type of interaction and can be expressed using our framework of the generalized linear interaction model. It might be argued that taking interactions into account eventually leads to a weighted graph and off-the shelf community detection algorithms for weighted graphs [52] might be applied. However, like in the unweighted case, application of a commu- nity detection method on a weighted graph without taking the nature of interaction process into account, might lead to unsatisfactory results. For example, if a conduc- tance minimization algorithm is applied on a weighted graph, whose weights are a consequence of a non-conservative process, the structure detected might differ signif- icantly from ground truth. Our method on the other hand learns the weights from the interaction process and then detects structure dynamically. Learning the underlying in- teraction process from the activity logs of nodes of the network and using this process to determine the community structure is the course of future work. Several community detection methods implicitly takes dynamic interactions into account. These include spin models, random walk models and synchronization. Spin models [186] imply that the interaction is ferromagnetic, i.e., it favors spin alignment. As we show in this thesis, random walk and Kuramoto synchronization models [100] are both conservative in nature, with the former expressed in terms of the normalized Laplacian, and the latter in terms of the graph Laplacian. Arenas et al. [12] studied the relationship between topological and community structure of complex networks using the Kuramoto model of synchronization. They created a threshold graph at some point in time where an edge exists between nodes only if their similarity exceeds some threshold. They defined communities as disconnected components of the threshold graph. We, on the other hand, explore different types of interactions and show how 194 these reveal different hierarchical community structures in real-world complex net- works. We also introduce a process-independent similarity metric. Hu et al. [85] found communities based on signaling interactions. They described the interactions by a kernelL(A) = (I +A) and used K-means clustering and F-statistics to find the optimal clusters at a some point of time. However, it can be shown mathematically that the process they defined will never reach a steady state. Our non-conservative interaction model treats signaling interactions in a principled way. Community detection methods are used to reveal the structure of complex net- works. Leskovec et al. [122] found ‘core and whiskers’ structure of real-world net- works using conductance-based methods and argued that these methods cannot re- veal any further structure in the giant core. Song [162] claimed that there exist self- repeating patterns in complex networks at all length scales. Our results corroborate this claim, as we show a repeating ‘core and whiskers’ pattern in the Digg social network at many different length scales. It can be shown that some of the interaction models described above not only solve certain regularized Semi-Definite Programs but also give fast solutions to these problems[148]. 10.3 Interaction and Centrality The interplay of topology of the underlying network with the interaction occurring in it, contributes to the complexity of real-life networks. For example in epidemiology, the dynamics of disease spread on a network and the epidemic threshold is closely related to its spectral radius of the graph [179]. Similarly, random walk on a graph is closely related Laplacian of the graph [37]. 195 Researchers have developed an arsenal of centrality metrics to study the proper- ties of networks, including degree, closeness [156], graph [79] and betweenness [54]; Markov process-based random measures like the Hubbels model [86]; path-based ranking measures like the Katz score [92], SenderRank [95], and eigenvector cen- trality [22]. However, like us, Borgatti noted [23], most centrality measures make implicit assumptions about the interaction process occurring on a network. In order to give correct predictions, these assumptions must match the actual dynamics of the network. Borgatti classified dynamic processes according to the trajectories they fol- low (geodesic, path, trail, walk) and the method of spread (transfer, serial or parallel duplication). We on the other hand maintain that a simpler classification scheme, that divides dynamic processes into conservative and non-conservative, captures the essen- tial differences between them and informs the choice of the centrality metric. Apart from PageRank and Alpha-Centrality, other measures can also be classified as conser- vative or non-conservative [63]. Online social networks provide us the unique opportunity to study the dynamic processes occurring on networks. Some studies compared empirical measures, such as tweets and mentions on Twitter [32, 102], with centrality metrics including PageRank and in-degree centrality. We on the other hand, differentiate between the two distinct methods of quantifying influence: estimating influence by measuring dynamics of so- cial network behavior and using centrality metrics to predict influence. In addition, we evaluate the predictive influence models using the empirical measurements. 196 10.4 Interaction and Proximity Granovetter [73] proposed neighborhood overlap as a metric to quantify the strength of a tie, i.e., how intensely and deeply two actors in a social network interact. If u andv have many friends in common, they are more likely to attend the same events and be exposed to the same information, and therefore, interact and act in a similar manner. A study of a massive mobile phone network established a correlation between social tie strength and neighborhood overlap, or proximity [144]. This study measured tie strength by the frequency and duration of phone calls between two people, and it measured proximity by the fraction of common neighbors. Though it established a correlation between proximity and activity, it did not attempt to predict activity. Gra- novetter’s paper is best remembered for the special role he assigned to weak ties in information diffusion. In this thesis, we only focus on the role of strong ties in predict- ing activity. Activity prediction is similar to the link prediction prediction in that it uses net- work structure for prediction. However, these problems are fundamentally different, because in link prediction, structural evidence is used to predict structure of the net- work, while in activity prediction, structural evidence is used to predict user activity, a distinct source of evidence. Several researchers have studied the link prediction task, in which they used network proximity to identify unobserved or missing links or to pre- dict future links in a network. These studies used a number of metrics, including the number and fraction of common neighbors, Adamic-Adar score [125, 127], as well as a metric based on resource allocation (RA) [194], and those based on the random walk, such as effective conductance [98] and escape probability [168, 170]. Although some metrics were shown to perform better than others, no explanation was given for these 197 differences. On the link prediction task in the co-authorship networks, for example, Adamic-Adar score gave best results [125], while on the missing link prediction task in power grid and transportation networks, the linear version (RA) of Adamic-Adar performed best [194]. We postulate that the reason RA metric, which is equivalent to our conservative proximity, worked so well is because it captures the conservative nature of interactions in the power grid and transportation networks. We suspect that Adamic-Adar worked best on the link prediction task because of all the metrics tested by Liben-Nowell and Kleinberg, it came closest to capturing the nature of interactions between authors. We suspect that metrics we introduce in this work will lead to an even better link prediction performance. Activity and network structure are, of course, not completely independent. Previ- ous studies examined the impact of social ties and network structure on user behavior. Crandall et al. [40] attempted to predict user’s future behavior on Wikipedia and Live- Journal based on the number of network neighbors who have already adopted the be- havior. Unlike this study, we also take the social proximity of neighbors into account. Anagnostopoulos et al. [6] examined user activity on a social media site Flickr and found evidence for social correlations, i.e., they found that user’s tagging activity was similar to that of her friends in the network. The goal of that work, however, was to test whether homophily or social influence is responsible for social correlation. Other studies [9, 34, 165] have examined the cause of behavior correlation in networks, both online social networks and friendship networks. We do not attempt to explain the source of social correlation and its relationship to network structure, rather we exploit existing correlations to predict activity. 198 10.5 Information Spread Under the Microscope Several researchers studied dynamics of information flow on networks, however, em- pirical studies have produced conflicting results. Wu et al. [184] examined patterns of email forwarding within an organization and found that email forwarding chains termi- nate after an unexpectedly small number of steps. They argued that unlike the spread of a virus on a social network, which is expected to reach many individuals, the flow of information is slowed by decay of similarity among individuals within the social network. They measured similarity by distance in organizational hierarchy between the two individuals within an organization, or in general, as a number of edges sepa- rating two nodes within a graph. Similarly, in a large-scale study of the effectiveness of word-of-mouth product recommendations, [114] found that most recommendation chains terminate after one or two steps. However, authors noted sensitivity of recom- mendation to price and category of product, leaving open the question whether social networks are an effective tool for disseminating information, rather than purchasing products. Contrary to these studies, we find that information, such as news, reaches many individuals within a social network. On Digg, whose users are highly intercon- nected, a story does not reach as many fans as on Twitter, where users are less densely connected. Like Wu et al., [124] studied the patterns of forwarding of two popular email pe- titions. Unlike their expectations, the forwarding chains produced long narrow, rather than bushy wide, trees. In these studies, however, the structure of the underlying social network was not directly visible but had to be inferred by observing new signatures on the forwarded petitions. This method offers only a partial view of the network and does not identify all edges between individuals that participated in the email chain. If 199 an individual has already forwarded the message, she will not do so again, and an edge between her and the sender will not be observed. A number of researchers have studied the flow of information and influence in the blogosphere and in a virtual world. [75] traced topic propagation through blogs and used a model of the spread of epidemics on networks [137] to characterize the spread of topics through the blogosphere. [119] defined an information cascade as a graph of hyperlinks between blog posts. Like in the patterns of email propagation [124] these studies, the networks were derived from the observed links between blog posts, i.e., from the diffusion of infor- mation. In our study, on the contrary, they were extracted from the sites independently of data about the diffusion of information. [15] traced the spread of influence in a multi-player online game and found that similar to our findings with social news, influ- ence spreads easily on social networks in virtual worlds. This provides an independent confirmation of the importance of social networks in the dynamics of information flow. Most of the earlier work does not clearly distinguish between cascades and the contagion processes generating these cascades. We believe that ours is the first work studying large scale cascades. Though large scale studies of information contagion have been carried out earlier [120], the contagion in general was small in size (O(10)). We on the other hand, have very large contagion processes (extending up toO(10 4 )). The quantitative framework for analyzing cascades that we present here is very scal- able and can be easily used to provide a efficient compressed representation of large cascades, when storing the complete information of the entire cascade is no longer trivial. In previous studies [119, 120], the authors characterize the cascades/contagion processes using a multilevel approach comprising of global and local signatures. Cas- cades are considered to be approximately isomorphic if they have the same global 200 signature. If the global signatures match, more expensive isomorphism tests based on local signatures are carried out. To aid reasoning about cascades, the authors focus on local cascades, which they define as the ‘cascade in the (undirected) neighborhood of the node’, which for every node is the subgraph induced on the nodes reachable from it. They enumerate the shapes of these local cascades. As cascades grow in size, the number of possible shapes increases exponentially and such enumeration becomes infeasible. Note that in this work we provide a scalable, efficient and compressed rep- resentation of the observed cascades and make no claims about whether the observed cascades are the actual cascades or fragments of them. In our study of Digg, we have contagion processes infecting up to 20; 000 users. We aim to deduce their qualitative as well as quantitative properties, such as shape and size. Hence a more formal framework for characterizing cascades is required. In this work, we provide such a formalism, which not only captures the macro, meso and micro level signatures described above, but much more. For instance, it captures the similarity between cascades, which were initially similar but later become dissimilar; or, similarity between cascades that are similar in some stage of their growth. This formalism enables us to distinguish between cascades and obviates the need for enu- merating them or drawing their shape. In [155, 75], the underlying network on which information spreads is not observed, but has to be inferred from the observed cascades. However such inferences [155] are based on the hypothesis that the contagion process follows an independent cascade model [93]. Our work, on the other hand, focuses on providing a quantitative tool to analyze the trends and patterns of actual contagion processes observed on real-life networks. Even when the underlying network is predicted using a different inference methods, e.g., [155, 75], the trends of the contagion process occurring on the network 201 can be investigated using the cascade generating function. Ongoing work (Chapter 6, Chapter 8) demonstrates that these tools can be effective in the verification or rejection of the hypothesis used for modeling information spread [14, 48, 181]. They can also be used for evaluating the robustness of inferred networks [155]. As demonstrated by the third story in our examples, we observe that if the submit- ter is well connected, the community effect is visible at all stages. However, initial popularity only within the tightly knit community (shown by a high cascade value and few seeds in the initial stages) does not ensure global popularity (large number of votes). In contrast, stories submitted by a not so well connected user, which spreads by branching and deepening initially (with low cascade values), but have larger number of initial active seeds become more popular globally (as shown by the second story in the example). This observation is in agreement to those reported in [106, 15] that content diffusing primarily through an interconnected community tends to be confined to that community. These cascades are also complicated by the interplay between social influence and homophily [6, 34]. Future work will address these questions more closely. We observe that though power-law well describes the degree distribution in these networks, it only accounts for a small fraction of cascade sizes at the tail of the distribution. Rather, the entire data can be approximated well with a stretched-exponential (weibull), lognormal or double pareto lognormal distributions, similar to those observed in [78]. In contrast, previous works [119, 120], the cascade size was found to be described well by the power-law distribution. In graphs with power-law degree distribution, a common property of social net- works [5], large degree heterogeneity speeds up epidemics [17, 126], resulting in the vanishing epidemic threshold in the limit of very large graphs [158]. This result has 202 alarming implications for propagation of viruses in human populations and computer networks: any outbreak, even one that is not very virulent, will spread to infect large number of nodes. In [149], the authors conjecture that for any virus propagation model (including SIS and SIR), the epidemic threshold depends only on the largest eigenvalue of the adjacency matrix of the network. However, [30] argue that while this holds true for the SIS model, the HMF prediction in the SIR model seems to be much more accu- rate than the generic claim made in [149] for scale-rich networks. They claim that on quenched scale-rich networks the threshold of generic epidemic models is vanishing or finite depending on the presence or absence of steady state. Future work includes exploring these alternate epidemic models further. Another modified spreading process for social contagion that has been considered is the effect of adding “stiflers”[17]. Similar to FSM, stiflers will not spread a story (rumor) no matter how many times they encounter it. Stiflers, however, are not merely desensitized to multiple exposures, they may actively convert spreaders or susceptible nodes into stiflers. This complicated dynamic can lead to drastic changes, e.g., the elimination of the epidemic threshold. In Digg, a fan who does not vote on a story after multiple exposures, does not actively persuade the exposed and susceptible fans not to vote on a story. Hence, this model does not apply to the process of information diffusion on Digg. The friend saturating model we have used to describe cascades on Digg is a special case of a broader class of models called “decreasing cascade models”[93]. Several works have observed similar diminishing returns from friends in social networks. [114] analyzed the usefulness of product recommendations on Amazon.com. They rarely found that anyone received more than a handful of recommendations for any product, and the marginal benefit of multiple recommendations, while product dependent, was 203 typically sublinear (i.e., two recommendations did not make someone twice as likely to buy as one recommendation). Link formation was studied in [99], where they also found diminishing returns in the probability of befriending someone with whom one sharesn mutual friends, with saturation occurring aroundn = 5. The probability of joining a group thatn friends have joined was studied in [13], with saturation occurring forn around 10-20. [88] modeled viral email cascades using branching processes like the Galton-Watson process and the Bellman-Harris process. They argued that the topology of the under- lying social network is irrelevant in the prediction of cascade size. This may hold true in the tree-like cascades studied by the authors. However as stated previously, in Digg, dynamics of information propagation is not tree-like and these models do not hold. Future work includes learning the nature of dynamic activity in a network. This in turn would help in the design of customized metrics for networks having better predictive capabilities. 10.6 Alpha-Centrality A variety of metrics have been proposed to measure node’s centrality in a network [92, 86, 20, 54, 180, 146, 22, 135], yet few studies systematically evaluated their perfor- mance on real-world networks. Liben-Nowell and Kleinberg [123] compared the per- formance of several commonly used centrality metrics on the link prediction task and found Katz score [92] to be the most effective measure for this task, outperforming PageRank [146] and its variants. The Alpha-centrality metric modifies the Katz score by introducing a parameter , that gives a weight to indirect links and also sets the 204 length scale of interactions in the network. We showed in Chapter 6 [63] that nor- malized Alpha-centrality outperforms other centrality metrics on the task of predicting influential nodes in an online social network. Similar to personalized PageRank [89] for conservative interaction, each user’s unique notion of importance in non-conservative interaction can be captured using customized starting vector for individual users in Alpha-Centrality, leading to person- alized Alpha-Centrality. The use of residual vectors and incremental computation in the calculation of approximate Alpha-Centrality leads to scalability of the method. Moreover, as in personalized PageRank, these residual vectors can be shared across multiple personalized views, scaling the personalized Alpha-Centrality metric. Analo- gous to approximate PageRank [7], in approximate Alpha-Centrality, at each iteration residual vector is redistributed to reduce the difference between the Alpha-Centrality vector and its approximate version. However, the process of redistribution of the resid- ual vector mimics the kind of interaction the model emulates. For approximate in PageRank, the redistribution of residual vectors is conservative (with the total weight of the residual vector conserved). On the other hand, in approximate Alpha-Centrality, the redistribution of residual vectors is not conservative. Guimera and collaborators [76, 77] proposed role-based description of complex networks. They define a role in terms of the relative within-community degree z (which measures how well the node is connected to other nodes in its community) and participation coefficient P (which measures how well the node is connected to nodes in other communities). They proposed a heuristic classification scheme based on where the nodes lie in thez–P plane. This classification scheme is similar to the local vs. globally-connected distinction we are making, with connector nodes being more globally connected nodes while provincial hubs and peripheral nodes are more 205 locally connected. Role-based analysis requires community decomposition of the net- work to be performed first. This is a computationally expensive procedure for most real-world networks. Our approach, on the other hand, allows us to differentiate be- tween roles of nodes in a more computationally efficient way. Like us, Arenas et al. [10] have generalized modularity to find correlations be- tween nodes that go beyond nearest neighbors. Their approach relies on the presence of motifs [131, 132], i.e., connected subgraphs such as cycles, to identify communi- ties within a network. For example, higher than expected density of triangles implies presence of a community, and a triangle modularity may be defined to identify it. The motif-based modularity uses the size of the motif to impose a limit on the proxim- ity of neighbors. Our Alpha-centrality based modularity optimization method, on the other hand, imposes no such limit. The measure of global correlation computed using Alpha-centrality is equal to the weighted average of correlations for motifs of different sizes. Our method enables us to easily calculate this complex term. 206 Chapter 11 Future Work and Conclusion We leverage the power of online social networks like Digg, Twitter and Facebook in our efforts to understand the structure and dynamics of complex networks. These online social networks have not only been a rich source of data, but also have proven to be effective platform for testing the underlying principles of complex networks. In this thesis we have studied the impact of interactions on the analysis of complex networks, especially focussing on online social media. There are many different kinds of interactions taking place on social networks like Twitter. For instance, other than being a medium for conversations and information spread, we observe the existence of services for sophisticated spamming, marketing and auto-tweeting on Twitter. We find that the heterogeneous interactions on Twitter are often indicative of the characteristics of the associated content. For example the activity associated with a campaign is observed to be very different from the activity associated with the spread of news. We propose a novel information theoretic framework for classification of content based on users’ participation and the temporal signature of this associated activity. Using this technique we are able to automatically classify tweets into newsworthy, campaigns, spams, advertisement and promotions categories. Our robust, content-independent, 207 scalable technique has wide-scale applications in spam detection, trend identification, trust management, user-modeling, social search and content classification. We also provide different categorizations of dynamic activities based on their in- herent properties. For example, we can classify dynamic activities into conservative and non-conservative activities or constant rate and varying rate activities. We propose a simple yet useful generalized linear model that can be used to describe a wide range of dynamic activities. Using this model we study and compare the characteristics of various classes of activities and explore their similarities and differences We hypothesize that the structure of a network depends not only on the topol- ogy but also on the dynamic activities. Therefore the implicit dynamics emulated by metrics to predict structure (whether it is community detection metrics or proximity or centrality metrics), must match the actual dynamics taking place in the network for best predictions. We test this hypothesis empirically in the context of information spread in online social networks. We detect the community structure using the framework of synchronization. Nodes similar to each other synchronize with each other faster when compared to the rest of the network. However similarity depends on the nature of dy- namic processes. We formulate a novel methodology for community detection using non-conservative interactions. Given the topology of the network, different dynamic activities give different perspectives of community structure. Similarly, different cen- trality metrics give different views of centrality of the network. On Digg and Twitter, when the activity is information diffusion, Alpha-Centrality seems to better predict in- fluentials than PageRank (when the empirical measure of influence is average number of fan votes). This is because, the dynamics captured by Alpha-Centrality is closer to the spread of information on Digg when compared with PageRank. 208 There exists a complex feedback between structure and activity. We show using proximity metrics that structure can also be used to predict activity. Proximity metrics too can be classified into conservative or non-conservative, based on the interactions they implicitly emulate. We compare the performance of different proximity metrics and find that metrics that explicitly take attention into account give substantially better results. We also investigate into what factors make the action of some users more predictable than those of other users. Future work includes modeling user-behavior and understanding the effect of heterogeneity of user behavior on the structure and dynamics of a network. Recognizing the potential of Alpha-Centrality in modeling online social networks, we study this metric in greater details. We show that this metric can be used to predict the importance of individuals as well as detect groups or community structure within a network. We introduce a normalized version of this metric which measures the num- ber of attenuated paths between individuals. We extend the modularity-maximization method [142] for community detection to take path-based connectivity into account, using normalized Alpha-Centrality. Normalized Alpha-centrality can prove to be a very useful tool in network analysis since it contains a tunable parameter which sets the length scale of the interactions between individuals. This metric helps us to identify not only the locally important leaders, but also the ‘bridges’ or ‘brokers’ who facilitate communication between communities. By varying the tuning parameter, we can seam- lessly connect local centrality models like degree-centrality to global centrality models like eigenvector centrality. It provides a simpler alternative for the quantification of the entire spectrum– from coarse-grained to fine-grained structure of complex networks; as compared to the difficult and computationally expensive previous attempts in this direction (motif-based and role-based descriptions). 209 We extend network analysis to multi-modal networks with multiple types of links and entities since real-world networks usually are multi-modal in nature. Most exist- ing network analysis algorithms, usually conflate the relations and project such net- works unto simple graphs composing entities of single type, loosing information in this process. In this work, we present a compact mathematical framework for analyz- ing muliti-modal complex networks and demonstrate its applicability in the study of real networks. Most of the above methods assume that the topology is static over time. How can structure be detected when topology itself changes with time? Existing network analysis techniques represent such dynamic networks as static networks by aggregating the nodes and edges over some time period, and then applying static network analysis tools on them. As the consequence such methods are often unsuccessful in correctly characterizing dynamic networks. We have designed a parametric dynamic centrality metric which closely emulates the interaction process occurring in real-world networks like citation networks. We have also developed an approach to find optimal parameters for the metric. We have used this dynamic centrality to rank nodes by the number of time-dependent paths that connect them to other nodes in the network. In addition to discovering best connected, or influential, nodes, this method can identify nodes that are most connected to a specific node and, therefore, have highest influence on it. We have performed a detailed analysis of a toy dynamic network and citation networks. Our results indicate that this dynamic centrality metric can produce a radically different view of what the important nodes in the network are than static measures and leads to new insights about the structure of the dynamic network. Not only does it predict the importance of nodes, but also helps to uncover structural properties, missed by other well-known centrality metrics [109, 57]. 210 One of the prime functionalities of online social networks is information propaga- tion, a dynamic activity we study in greater details. We empirically study the cascading of information in these networks and provide a mathematical framework for quantify- ing such cascades. We investigate the characteristics of user-activity, evolution and reach of information or story in the networks of Twitter and Digg. One of the puzzles that surfaces in the course of these investigations is that, the size of actual cascades on these on online social networks is much smaller than that predicted theoretically. We find that while network structure somewhat limits the growth of cascades, a far more dramatic effect comes from the social contagion mechanism. Therefore we pro- pose an alternate cascade model- the friend cascade model and demonstrate through simulations, that this model closely emulates the characteristics of the real cascades occurring on Digg better than the existing contagion models. Our investigations into the structure and dynamics of complex networks, has a revealed a lot of unresolved questions in the field. Future work would attempt to answer some of these questions. There are many different definitions of group or community existing in literature [51]. How do we evaluate these methods? What is the quality of a community? How do we compare different metrics of quality? We plan to explore some of these questions further in future. In this thesis we have shown that structure depends on the dynamic activity occur- ring in the network. Different dynamic activities give different views of the network. In future we plan to learn the dynamic activity occurring in the network, and use this knowledge to design metrics with better predictive capabilities. There is a complex interplay of individual and group dynamics in complex net- works. Together, they influence the emerging trends and shape the evolution of the 211 network on the whole. Understanding individual behavior and its role in group dynam- ics is an important step towards decoupling this problem. In spite of the heterogeneity of human dynamics, there exists some universality in the patterns of human behavior [71]. Future also work includes unearthing the factors determining the distinct charac- teristics, and building viable mathematical models of individual user activity on online social networks. Further, we plan to explore the other dimension of the same problem- how group dynamics affects individual behavior. To summarize, this work is an initial attempt to understand structure and dynamics of complex networks. Our work just scratches the surface of the intrinsically inter- weaved, multilayered mystery— the complex network. Further exploration into the realms of network analysis is required and is the scope of future work. 212 References [1] Lada Adamic and Eytan Adar. Friends and neighbors on the Web. Social Net- works, 25(3):211–230, 2003. [2] Lada Adamic and Eytan Adar. How to search a social network. Social Net- works, 27(3):187–203, 2005. [3] Eugene Agichtein, Carlos Castillo, Debora Donato, Aristides Gionis, and Gilad Mishne. Finding high-quality content in social media. In Proceedings of the in- ternational conference on Web search and web data mining, WSDM ’08, pages 183–194, New York, NY , USA, 2008. ACM. [4] Reka Albert, Hawoong Jeong, and Albert-Lazlo Arabasi. Diameter of the world-wide web. 1999. [5] L. A. Amaral, A. Scala, M. Barthelemy, and H. E. Stanley. Classes of small- world networks. Proceedings of the National Academy of Sciences of the United States of America, 97(21):11149–11152, October 2000. [6] A. Anagnostopoulos, R. Kumar, and M. Mahdian. Influence and correlation in social networks,. In Proc. Knowledge Discovery and Data Mining Conference (KDD-2008), 2008. [7] R. Andersen, F. Chung, and K. Lang. Local graph partitioning using pagerank vectors. In Proc IEEE Foundations of Computer Science, pages 475–486, 2006. [8] R. M. Anderson and R. May. Infectious diseases of humans: dynamics and control. Oxford University Press, 1991. [9] Sinan Aral, Lev Muchnik, and Arun Sundararajan. Distinguishing influence- based contagion from homophily-driven diffusion in dynamic networks. Pro- ceedings of the National Academy of Sciences, 106(51):21544–21549, Decem- ber 2009. [10] A. Arenas, A. Fernandez, S. Fortunato, and S. Gomez. Motif-based communi- ties in complex networks. Mathematical Systems Theory, 41, 2008. 213 [11] Alex Arenas, Albert Díaz-Guilera, Jurgen Kurths, Yamir Moreno, and Chang- song Zhou. Synchronization in complex networks. Physics Reports, 469(3):93– 153, December 2008. [12] Alex Arenas, Albert D. Guilera, and Conrad J. Pérez Vicente. Synchronization Reveals Topological Scales in Complex Networks. Physical Review Letters, 96(11):114102+, March 2006. [13] Lars Backstrom, Dan Huttenlocher, Jon Kleinberg, and Xiangyang Lan. Group formation in large social networks: membership, growth, and evolution. In Proceedings of the 12th ACM SIGKDD international conference on Knowledge discovery and data mining, KDD ’06, pages 44–54, New York, NY , USA, 2006. ACM. [14] Norman Bailey. The Mathematical Theory of Infectious Diseases and its Appli- cations. Griffin, London, 1975. [15] E. Bakshy, B. Karrer, and L. A. Adamic. Social influence and the diffusion of user-created content. In In EC ’09: Proc. 10th ACM conference on Electronic commerce, volume 325–334., 2009. [16] Eytan Bakshy, Jake M. Hofman, Winter A. Mason, and Duncan J. Watts. Ev- eryone’s an influencer: quantifying influence on twitter. In Proceedings of the fourth ACM international conference on Web search and data mining, WSDM ’11, pages 65–74, New York, NY , USA, 2011. ACM. [17] Alain Barrat, Marc Barthélemy, and Alessandro Vespignani. Dynamical Pro- cesses on Complex Networks. Cambridge University Press, Cambridge, Eng- land, 1st edition, 2008. [18] Peter Beaumont. Can social networking overthrow a government? In http://www.smh.com.au/technology/technology-news/can-social-networking- overthrow-a-government-20110225-1b7u6.html, 2011. [19] Luís M. A. Bettencourt, Ariel Cintrón-Arias, David I. Kaiser, and Carlos Castillo-Chávez. The power of a good idea: quantitative modeling of the spread of ideas from epidemiological models. Physica A: Statistical Mechanics and its Applications, In Press, Corrected Proof, Jun 2005. [20] Philip Bonacich. Factoring and weighting approaches to status scores and clique identification. Journal of Mathematical Sociology, 2(1):113–120, 1972. [21] Phillip Bonacich. Power and centrality: a family of measures. The American Journal of Sociology, 92(5):1170–1182, 1987. 214 [22] Phillip Bonacich and Paulette Lloyd. Eigenvector-like measures of centrality for asymmetric relations. Social Networks, 23(3):191–201, 2001. [23] S. Borgatti. Centrality and network flow. Social Networks, 27(1):55–71, Jan- uary 2005. [24] Bernhard E. Boser, Isabelle M. Guyon, and Vladimir N. Vapnik. A training algorithm for optimal margin classifiers. In Proceedings of the fifth annual workshop on Computational learning theory, COLT ’92, pages 144–152, New York, NY , USA, 1992. ACM. [25] Danah Boyd, Scott Golder, and Gilad Lotan. Tweet, tweet, retweet: Conver- sational aspects of retweeting on twitter. Hawaii International Conference on System Sciences, 0:1–10, 2010. [26] P. S. Bradley, C. A. Reina, and U. M. Fayyad. Clustering Very Large Databases Using EM Mixture Models. Pattern Recognition, International Conference on, 2:2076+, 2000. [27] U. Brandes, D. Delling, M. Gaertler, R. Gorke, M. Hoefer, Z. Nikoloski, and D. Wagner. On modularity clustering. IEEE Trans. on Knowl. and Data Eng., 20(2):172–188, 2008. [28] Jacqueline J. Brown and Peter H. Reingen. Social Ties and Word-of-Mouth Referral Behavior. The Journal of Consumer Research, 14(3):350–362, 1987. [29] R. S. Burt. Structural Holes: The Structure of Competition. Harvard University Press, Cambridge, MA, 1992. [30] Claudio Castellano and Romualdo Pastor-Satorras. Thresholds for epidemic spreading in networks. Dec 2010. [31] James Caverlee, Ling Liu, and Steve Webb. Socialtrust: tamper-resilient trust establishment in online communities. In JCDL ’08: Proceedings of the 8th ACM/IEEE-CS joint conference on Digital libraries, pages 104–114, New York, NY , USA, 2008. ACM. [32] Meeyoung Cha, Hamed Haddadiy, Fabrıcio Benevenutoz, and Krishna P. Gum- madi. Measuring User Influence in Twitter: The Million Follower Fallacy. In Proceedings of 4th International Conference on Weblogs and Social Media (ICWSM), 2010. [33] Meeyoung Cha, Alan Mislove, and Krishna P. Gummadi. A measurement- driven analysis of information propagation in the flickr social network. In Pro- ceedings of the 18th international conference on World wide web, WWW ’09, pages 721–730, New York, NY , USA, 2009. ACM. 215 [34] M. D. Choudhury, H. Sundaram, John A., D. D. Seligmann, and A. Kelliher. “birds of a feather”: Does homophily among users impact information diffusion in social media? In arXiv:1006.1702v1, 2010. [35] Nicholas A. Christakis and James H. Fowler. The spread of obesity in a large so- cial network over 32 years. The New England journal of medicine, 357(4):370– 379, July 2007. [36] Nicholas A. Christakis and James H. Fowler. The Collective Dynamics of Smoking in a Large Social Network. New England Journal of Medicine, 358(21):2249–2258, May 2008. [37] Fan Chung and Wenbo Zhao. Pagerank and random walks on graphs. [38] Fan R. K. Chung. Spectral Graph Theory (CBMS Regional Conference Series in Mathematics, No. 92). American Mathematical Society, February 1997. [39] C. R.; Clauset, A.; Shalizi and M. E. J. Newman. Power-law distributions in empirical data. In In SIAM Review, volume 51(4):661+., 2009. [40] David Crandall, Dan Cosley, Daniel Huttenlocher, Jon Kleinberg, and Siddharth Suri. Feedback effects between similarity and social influence in online com- munities. In Proceeding of the 14th ACM SIGKDD international conference on Knowledge discovery and data mining, KDD ’08, pages 160–168, New York, NY , USA, 2008. ACM. [41] R. Crane and D. Sornette. Viral, quality, and junk videos on youtube: Separat- ing content from noise in an information-rich environment,. In SIP, 2008. [42] R. Crane and D. Sornette. Viral, quality, and junk videos on youtube: Separat- ing content from noise in an information-rich environment. In Proceedings of the AAAI Symposium on Social Information Processing, 2008. [43] P. Csermely. Creative elements: network-based predictions of active centres in proteins and cellular and social networks. Trends in Biochemical Sciences, 33(12):569–576, December 2008. [44] Leon Danon, Jordi Duch, Albert Diaz-Guilera, and Alex Arenas. Comparing community structure identification. October 2005. [45] J. Davitz, J. Yu, S. Basu, and A. Gutelius, D.and Harris. Search and routing in social networks. [46] A. Dempster, N. Laird, and D. Rubin. Maximum likelihood from incomplete data via the EM algorithm. Royal statistical Society B, 39:1–38, 1977. 216 [47] P. Dienes. Notes on linear equations in infinite matrices. Quart. J. of Math. (Oxford), 3:253–268, 1932. [48] P. S. Dodds and D. J. Watts. Universal behavior in a generalized model of contagion. In Phys. Rev. letters, 2004. [49] P. Domingos and M. Richardson. Mining the network value of customers. In In KDD, 2001. [50] Jordi Duch and Alex Arenas. Community detection in complex networks using extremal optimization. Physical Review E, 72(2):027104+, Aug 2005. [51] S. Fortunato. Community detection in graphs. Phys. Reports, 486:75–174, 2010. [52] Santo Fortunato. Community detection in graphs. Physics Reports, 486(3- 5):75–174, February 2010. [53] Santo Fortunato and Alessandro Flammini. Random Walks on Directed Net- works: the Case of PageRank. International Journal of Bifurcation and Chaos, 17:2343–2353, Sep 2007. [54] L. C. Freeman. A set of measures of centrality based on betweenness. Sociom- etry, 40:35–41, 1977. [55] Linton Freeman. Finding Social Groups: A Meta-Analysis of the Southern Women Data. [56] F. Gebali. Markov chains. Analysis of Computer and Communication Net- works, page 65:122, 2008. [57] Rumi Ghosh, Tsung-Ting Kuo, Chun-Nan Hsu, Shou-De Lin, and Kristina Ler- man. Time-aware ranking in dynamic citation networks. In ICDM Workshops, pages 373–380, 2011. [58] Rumi Ghosh and Kristina Lerman. Leaders and Negotiators: An Influence- based Metric for Rank. In Proceedings of the Third International ICWSM Con- ference (2009). [59] Rumi Ghosh and Kristina Lerman. Community Detection using a Measure of Global Influence. August 2008. [60] Rumi Ghosh and Kristina Lerman. Leaders and negotiators: An influence-based metric for rank. In ICWSM, 2009. 217 [61] Rumi Ghosh and Kristina Lerman. Structure of Heterogeneous Networks. In Proceedings of 1st IEEE Conference on Social Computing, June 2009. [62] Rumi Ghosh and Kristina Lerman. A Parameterized Centrality Metric for Net- work Analysis. October 2010. [63] Rumi Ghosh and Kristina Lerman. Predicting influential users in online social networks. In Proceedings of KDD workshop on Social Network Analysis (SNA- KDD), July 2010. [64] Rumi Ghosh and Kristina Lerman. A Framework for Quantitative Analysis of Cascades on Networks. In Proceedings of Web Search and Data Mining Conference (WSDM), Nov 2011. [65] Rumi Ghosh and Kristina Lerman. Parameterized centrality metric for network analysis. Physical Review E, 83(6):066118+, June 2011. [66] Rumi Ghosh and Kristina Lerman. The Role of Dynamic Interactions in Multi- scale Analysis of Network Structure. January 2012. [67] Rumi Ghosh, Kristina Lerman, Tawan Surachawala, Konstantin V oevodski, and Shang-Hua Teng. Non-Conservative Diffusion and its Application to Social Network Analysis. In submitted to KDD, February 2011. [68] Rumi Ghosh, Tawan Surachawala, and Kristina Lerman. Entropy-based classi- fication of ’retweeting’ activity on twitter. abs/1106.0346, 2011. [69] M. Girvan and M. E. J. Newman. Community structure in social and biological networks. Proc. Natl. Acad. Sci. USA., 99:7821, 2002. [70] J. Goldenberg, B. Libai, and E. Muller. Talk of the Network: A Complex Sys- tems Look at the Underlying Process of Word-of-Mouth. Marketing Letters, pages 211–223, August 2001. [71] Marta C. Gonzalez, Cesar A. Hidalgo, and Albert-Laszlo Barabasi. Under- standing individual human mobility patterns. Nature, 453(7196):779–782, June 2008. [72] M. Granovetter. The strength of weak ties. Am. J. Sociology, May 1973. [73] Mark S. Granovetter. The Strength of Weak Ties. American Journal of Sociol- ogy, 78(6):1360–1380, 1973. [74] Chris Grier, Kurt Thomas, Vern Paxson, and Michael Zhang. @spam: the un- derground on 140 characters or less. In Proceedings of the 17th ACM confer- ence on Computer and communications security, CCS ’10, pages 27–37, New York, NY , USA, 2010. ACM. 218 [75] D. Gruhl, R. Guha, D. Liben-Nowell, and A. Tomkins. Information diffusion through blogspace,. In In WWW, 2004. [76] Roger Guimera and Luis A. Nunes Amaral. Functional cartography of complex metabolic networks. Nature, 433(7028):895–900, February 2005. [77] Roger Guimera, Marta Sales-Pardo, and Luis A. N. Amaral. Classes of complex networks defined by role-to-role connectivity profiles. Nat Phys, 3(1):63–69, January 2007. [78] L. Guo, E. Tan, S. Chen, X. Zhang, and Y E. Zhao. Analyzing patterns of user content generation in online social networks. In In KDD, 2009. [79] P. Hage and F. Harary. Eccentricity and centrality in networks. Social Networks, 17:57–63, 1995. [80] Herbert W. Hethcote. An immunization model for a heterogeneous population. Theoretical Population Biology, 14(3):338 – 349, 1978. [81] Herbert W. Hethcote. The Mathematics of Infectious Diseases. SIAM REVIEW, 42(4):599–653, 2000. [82] Nathan Hodas and Kristina Lerman. Analyzing effects of limited attention and visibility on social contagion using human response dynamics. In private com- munication, 2012. [83] T. Hogg and G. Szabo. Diversity of user activity and content quality in on- line communities. In In Proc. Int. Conference on Weblogs and Social Media (ICWSM09), 2009. [84] Tad Hogg and Kristina Lerman. Stochastic Models of User-Contributory Web Sites. In Proceedings of 3rd International Conference on Weblogs and Social Media (ICWSM), March 2009. [85] Yanqing Hu, Menghui Li, Peng Zhang, Ying Fan, and Zengru Di. Com- munity detection by signaling on complex networks. Physical Review E, 78(1):016115+, July 2008. [86] C. Hubbel. An input-output approach to clique identification. Sociometry, 28:377–399, 1965. [87] B. A. Huberman, P. L. T. Pirolli, J. E. Pitkow, and R. M. Lukose. Strong regular- ities in world Wide Web surfing. In Science, volume 280(5360):95–97, 1998. 219 [88] José L. Iribarren and Esteban Moro. Impact of Human Activity Patterns on the Dynamics of Information Diffusion. Physical Review Letters, 103(3):038702+, Jul 2009. [89] Glen Jeh and Jennifer Widom. Scaling personalized web search. In Proceed- ings of the 12th international conference on World Wide Web, WWW ’03, pages 271–279, New York, NY , USA, 2003. ACM. [90] Daniel Kahneman. Attention and Effort (Experimental Psychology). Prentice Hall, 1973. [91] Elihu Katz and Paul Lazarsfeld. Personal Influence: The Part Played by People in the Flow of Mass Communications. Transaction Publishers, October 2005. [92] Leo Katz. A new status index derived from sociometric analysis. Psychome- trika, 18(1):39–43, March 1953. [93] D. Kempe, J. Kleinberg, and E. Tardos. Maximizing the spread of influence through a social network, 2003. [94] Jeffrey O. Kephart and Steve R. White. Directed-graph epidemiological models of computer viruses. Security and Privacy, IEEE Symposium on, 0:343, 1991. [95] C. Kiss and M. Bichler. Identification of influencers-measuring influence in customer networks. Decision Support Systems, 46(1):233–253, 2008. [96] M. Kitsak, L. K. Gallos, S. Havlin, L. Liljeros, F.and Muchnik, H. E. Stanley, and H. A. Makse. Identifying influential spreaders in complex networks. 2010. [97] Risi I. Kondor and John Lafferty. Diffusion Kernels on Graphs and Other Dis- crete Structures. In ICML, pages 315–322, 2002. [98] Yehuda Koren, Stephen C. North, and Chris V olinsky. Measuring and extracting proximity graphs in networks. ACM Trans. Knowl. Discov. Data, 1(3), Decem- ber 2007. [99] Gueorgi Kossinets and Duncan J. Watts. Empirical Analysis of an Evolving Social Network. Science, 311(5757):88–90, January 2006. [100] Y . Kuramoto. Chemical Oscillations, Waves, and Turbulence. Springer–Verlag, New York, 1984. [101] Haewoon Kwak, Changhyun Lee, Hosung Park, and Sue Moon. What is Twit- ter, a social network or a news media? In Proceedings of the 19th international conference on World wide web, WWW ’10, pages 591–600, New York, NY , USA, 2010. ACM. 220 [102] Changhyun Lee, Haewoon Kwak, Hosung Park, and Sue Moon. Finding Influ- entials from Temporal Order of Information Adoption in Twitter". In Proceed- ings of 19th World-Wide Web (WWW) Conference (Poster), 2010. [103] E. A. Leicht and M. E. J. Newman. Community structure in directed networks. Phys. Rev. Letters, 100:118703, 2008. [104] Kristina Lerman. Social information processing in social news aggregation. IEEE Internet Computing: special issue on Social Search, 11(6):16–28, 2007. [105] Kristina Lerman. Social Networks and Social Information Filtering on Digg. In Proceedings of 1st International Conference on Weblogs and Social Media (poster), December 2007. [106] Kristina Lerman and Aram Galstyan. Analysis of social voting patterns on digg. In Proceedings of the 1st ACM SIGCOMM Workshop on Online Social Net- works, 2008. [107] Kristina Lerman and Rumi Ghosh. Information Contagion: an Empirical Study of the Spread of News on Digg and Twitter Social Networks. In Proceedings of 4th International AAAI Conference on Weblogs and Social Media (ICWSM), March 2010. [108] Kristina Lerman and Rumi Ghosh. Network Structure, Topology and Dynamics in Generalized Models of Synchronization. March 2012. [109] Kristina Lerman, Rumi Ghosh, and Jeon H. Kang. Centrality Metric for Dy- namic Networks. In Proceedings of KDD workshop on Mining and Learning with Graphs (MLG), June 2010. [110] Kristina Lerman, Rumi Ghosh, and Tawan Surachawala. Social Contagion: An Empirical Study of Information Spread on Digg and Twitter Follower Graphs. February 2012. [111] Kristina Lerman and Tad Hogg. Using a model of social dynamics to predict popularity of news. In Proceedings of the 19th international conference on World wide web, WWW ’10, pages 621–630, New York, NY , USA, 2010. ACM. [112] Kristina Lerman, Suradej Intagorn, Jeon-Hyung Kang, and Rumi Ghosh. Using Proximity to Predict Activity in Social Networks. December 2011. [113] Kristina Lerman and Laurie Jones. Social Browsing on Flickr. In Proceedings of International Conference on Weblogs and Social Media (ICWSM), March 2007. 221 [114] J. Leskovec, L. Adamic, and B. Huberman. The dynamics of viral marketing. In In EC ’06: Proc. 7th Conf. on Electronic commerce, volume 228–237., 2006. [115] J. Leskovec, L. Backstrom, and J. Kleinberg. Meme-tracking and the dynamics of the news cycle. In KDD, 2009. [116] J. Leskovec and E. Horvitz. Planetary-scale views on a large instant-messaging network. In In WWW ’08: Proc. 17th Int. World Wide Web Conference, volume 915–924., 2008. [117] J. Leskovec, J. Kleinberg, and C. Faloutsos. Graphs over time: densification laws, shrinking diameters and possible explanations. In In KDD ’05: Proc. 11th Int. Conf. on Knowledge discovery in data mining, volume 177–187, 2005. [118] J. Leskovec, A. Krause, C. Guestrin, C. Faloutsos, J. Vanbriesen, and N. Glance. Cost-effective outbreak detection in networks. In In KDD ’05: Proc. 13th Int. Conf. on Knowledge discovery in data mining, volume 420–429, 2007. [119] J. Leskovec, M. McGlohon, C. Faloutsos, N. Glance, and M. Hurst. Cascading behavior in large blog graphs. In In SDM, 2007. [120] J. Leskovec, A. Singh, and J. Kleinberg. Patterns of influence in a recommen- dation network. In In PAKDD, volume 380-389, 2005. [121] Jure Leskovec, Deepayan Chakrabarti, Jon Kleinberg, Christos Faloutsos, and Zoubin Ghahramani. Kronecker graphs: An approach to modeling networks. J. Mach. Learn. Res., 11:985–1042, March 2010. [122] Jure Leskovec, Kevin J. Lang, Anirban Dasgupta, and Michael W. Mahoney. Statistical properties of community structure in large social and information networks. In WWW, pages 695–704, New York, NY , USA, 2008. ACM. [123] D. Liben-Nowell and J. Kleinberg. The link-prediction problem for social net- works. J. Am. Soc. Inf. Sci. Technol., 58(7):1019–1031, 2007. [124] D. Liben-Nowell and J. Kleinberg. Tracing information flow on a global scale using internet chain-letter data. In PNAS, volume 4633–4638, 2008. [125] David Liben-Nowell and Jon Kleinberg. The link-prediction problem for social networks. J. Am. Soc. Inf. Sci., 58(7):1019–1031, 2007. [126] J. O. Lloyd-Smith, S. J. Schreiber, P. E. Kopp, and W. M. Getz. Super- spreading and the effect of individual variation on disease emergence. Nature, 438(7066):355–359, November 2005. 222 [127] Linyuan Lü and Tao Zhou. Link prediction in complex networks: A survey. Physica A: Statistical Mechanics and its Applications, December 2010. [128] David Lusseau and M. E. J. Newman. Identifying the role that animals play in their social networks. Proceedings of the Royal Society of London. Series B: Biological Sciences, 271(Suppl 6):S477–S481, December 2004. [129] Benjamin Markines, Ciro Cattuto, and Filippo Menczer. Social spam detection. In Proceedings of the 5th International Workshop on Adversarial Information Retrieval on the Web, AIRWeb ’09, pages 41–48, New York, NY , USA, 2009. ACM. [130] C. Marlow, M. Naaman, Boyd, and M. Davis. HT06, tagging paper, taxonomy, flickr, academic article, ToRead. In Proceedings of Hypertext 2006, New York, 2006. ACM, New York: ACM Press. [131] R. Milo, S. Shen-Orr, S. Itzkovitz, N. Kashtan, D. Chklovskii, and U. Alon. Network motifs: Simple building blocks of complex networks. Science, 298(5594):824–827, October 2002. [132] Ron Milo, Shalev Itzkovitz, Nadav Kashtan, Reuven Levitt, Shai Shen-Orr, In- bal Ayzenshtat, Michal Sheffer, and Uri Alon. Superfamilies of evolved and designed networks. Science, 303(5663):1538–1542, March 2004. [133] Y . Moreno, R. Pastor-Satorras, and A. Vespignani. Epidemic outbreaks in com- plex heterogeneous networks. The European Physical Journal B - Condensed Matter and Complex Systems, 26(4):521–529, April 2002. [134] D. Well Myers, Jerome L.; Arnold. Research Design and Statistical Analysis . Lawrence Erlbaum, 2003. [135] M. Newman. A measure of betweenness centrality based on random walks. Social Networks, 27(1):39–54, 2005. [136] M. E. Newman, S. H. Strogatz, and D. J. Watts. Random graphs with arbitrary degree distributions and their applications. Phys Rev E Stat Nonlin Soft Matter Phys, 64(2 Pt 2), August 2001. [137] M. E. J. Newman. Spread of epidemic disease on networks. In Phys. Rev. E, volume 66(1):016128+., 2002. [138] M. E. J. Newman. Detecting community structure in networks. The European Physical Journal B, 38:321–330, 2004. [139] M. E. J. Newman. Fast algorithm for detecting community structure in net- works. Phys. Rev. E, 69:066133, 2004. 223 [140] M. E. J. Newman. Finding community structure in networks using the eigen- vectors of matrices. Phys. Rev. E, 74:036104, 2006. [141] M. E. J. Newman. Modularity and community structure in networks. Proceed- ings of the National Academy of Sciences, 103(23):8577–8582, June 2006. [142] M. E. J. Newman and M. Girvan. Finding and evaluating community structure in networks. Phys. Rev. E, 69:026113, 2004. [143] J. D. Noh and H. Rieger. Stability of shortest paths in complex networks with random edge weights. Phys. Rev. E, 66(6):066127+, 2002. [144] J. P. Onnela, J. Saramäki, J. Hyvönen, G. Szabó, D. Lazer, K. Kaski, J. Kertész, and A. L. Barabási. Structure and tie strengths in mobile communication net- works. Proceedings of the National Academy of Sciences, 104(18):7332–7336, May 2007. [145] John F. Padgett and Christopher K. Ansell. Robust action and the rise of the medici, 1400-1434. The American Journal of Sociology, 98(6):1259–1319, 1993. [146] L. Page, S. Brin, R. Motwani, and T. Winograd. The pagerank citation ranking: Bringing order to the web. Technical report, Stanford Digital Library Technolo- gies Project, 1998. [147] Philippa Pattison and Garry Robins. Building models for social space: Neighbourhood-based models for social networks and affiliation structures. Mathematics and Social Sciences, 168:11–29, 2004. [148] Patrick O. Perry and Michael W. Mahoney. Regularized laplacian estimation and fast eigenvector approximation. CoRR, abs/1110.1757, 2011. [149] B. Aditya Prakash, Deepayan Chakrabarti, Michalis Faloutsos, Nicholas Valler, and Christos Faloutsos. Got the Flu (or Mumps)? Check the Eigenvalue! Apr 2010. [150] Filippo Radicchi, Santo Fortunato, Benjamin Markines, and Alessandro Vespig- nani. Diffusion of scientific credits and the ranking of scientists. Physical Re- view, E80:056103+, September 2009. [151] E. Ravasz, A. L. Somera, D. A. Mongru, Z. N. Oltvai, and A. L. Barabási. Hierarchical Organization of Modularity in Metabolic Networks. Science, 297(5586):1551–1555, 2002. 224 [152] William J. Reed and Murray Jorgensen. The Double Pareto-Lognormal Distri- butionÂ ˚ UA New Parametric Model for Size Distributions. Communications in Statistics - Theory and Methods, 33(8):1733–1753, 2004. [153] Alexander W. Rives and Timothy Galitski. Modular organization of cellular networks. Proc Natl Acad Sci U S A, 100(3):1128–1133, 2003. [154] Garry Robins, Philippa Pattison, and Peter Elliott. Network models for social influence processes. Psychometrika, 66:161–189, 2001. 10.1007/BF02294834. [155] M. G. Rodriguez, J. Leskovec, and A. Krause. Inferring networks of diffusion and influence. 2010. [156] G. Sabidussi. The centrality index of a graph. Psychmetrika, 31:581–603, 1966. [157] P.and Salganik, M.and Dodds and D. Watts. Experimental study of inequal- ity and unpredictability in an artificial cultural market. In Science, volume 311:854., 2006. [158] Romualdo P. Satorras and Alessandro Vespignani. Epidemic Spreading in Scale-Free Networks. Physical Review Letters, 86(14):3200–3203, Apr 2001. [159] Rossano Schifanella, Alain Barrat, Ciro Cattuto, Benjamin Markines, and Fil- ippo Menczer. Folks in folksonomies: social link prediction from shared meta- data. In Proceedings of the third ACM international conference on Web search and data mining, WSDM ’10, pages 271–280, New York, NY , USA, March 2010. ACM. [160] Cosma R. Shalizi and Andrew C. Thomas. Homophily and Contagion Are Generically Confounded in Observational Social Network Studies. Sociolog- ical Methods & Research, 40(2):211–239, May 2011. [161] G. Simmel. The Sociology of Georg Simmel, chapter Individual and Society. Free Press, 1950. [162] Chaoming Song, Shlomo Havlin, and Hernan A. Makse. Self-similarity of com- plex networks. Nature, 433(7024):392–395, January 2005. [163] Daniel A. Spielman and Shang-Hua Teng. Spectral partitioning works: Planar graphs and finite element meshes. In In IEEE Symposium on Foundations of Computer Science, pages 96–105, 1996. [164] Greg V . Steeg, Rumi Ghosh, and Kristina Lerman. What stops social epi- demics? In Proceedings of the 5th International AAAI Conference on Weblogs and Social Media (ICWSM), Feb 2011. 225 [165] Christian Steglich, Tom A. B. Snijders, and Michael Pearson. Dynamic Networks and Behavior: Separating Selection from Influence. Sociological Methodology, 2010. [166] K. Stephenson and M. Zelen. Rethinking centrality: Methods and applications. Social Networks, 11:1–37, 1989. [167] Steven Strogatz. Sync: The Emerging Science of Spontaneous Order. Theia, March 2003. [168] Hanghang Tong, Christos Faloutsos, and Yehuda Koren. Fast direction-aware proximity for graph mining. In KDD ’07: Proceedings of the 13th ACM SIGKDD international conference on Knowledge discovery and data mining, pages 747–756, New York, NY , USA, 2007. ACM. [169] Hanghang Tong, Christos Faloutsos, and Jia-yu Pan. Fast Random Walk with Restart and Its Applications. In ICDM ’06: Proceedings of the Sixth Inter- national Conference on Data Mining, pages 613–622, Washington, DC, USA, December 2006. IEEE Computer Society. [170] Hanghang Tong, Spiros Papadimitriou, Philip S. Yu, and Christos Faloutsos. Proximity tracking on time-evolving bipartite graphs. In Proc. SIAM Confer- ence on Data Mining, pages 704–715, 2008. [171] Amanda L. Traud, Eric D. Kelsic, Peter J. Mucha, and Mason A. Porter. Com- paring community structure to characteristics in online collegiate social net- works. SIAM Review, in press (arXiv:0809.0960), 2010. [172] Amanda L. Traud, Peter J. Mucha, and Mason A. Porter. Social structure of facebook networks. arXiv:1102.2166, 2011. [173] Michael Trusov, Anand V . Bodapati, and Randolph E. Bucklin. Determining Influential Users in Internet Social Networks. Journal of Marketing Research, XLVII:643–658, 2010. [174] T. W. Valente. Social Networks and Health: Models, Methods, and Applica- tions. Oxford University Press, 2010. [175] A. Vazquez, J. G. Oliveira, Z. Dezso, I. Goh, K.and Kondor, and A. Barabasi. Modeling bursts and heavy tails in human dynamics. In Phys. Rev. E, volume 73(3):036127, 2006. [176] Ulrike von Luxburg. A tutorial on spectral clustering. Statistics and Computing, 17(4):395–416, December 2007. 226 [177] Dylan Walker, Huafeng Xie, Koon-Kiu Yan, and Sergei Maslov. Ranking Sci- entific Publications Using a Simple Model of Network Traffic. December 2006. [178] David L. Wallace. A method for comparing two hierarchical clusterings: Com- ment. Journal of the American Statistical Association, 78(383):569–576, 1983. [179] Yang Wang, Deepayan Chakrabarti, Chenxi Wang, and Christos Faloutsos. Epi- demic Spreading in Real Networks: An Eigenvalue Viewpoint. Reliable Dis- tributed Systems, IEEE Symposium on, 0:25+, 2003. [180] S. Wasserman and K. Faust. Social Network Analysis: Methods and Applica- tions. Cambridge Univ.Press, 1994. [181] D. J. Watts. A simple model of global cascades in random networks. In PNAS, 2002. [182] Duncan J. Watts and Peter S. Dodds. Influentials, Networks, and Public Opinion Formation. Journal of Consumer Research, 34(4):441–458, December 2007. [183] D. M. Wilkinson. Strong regularities in online peer production. In In EC ’08: Proc. 9th Conf. on Electronic commerce, volume 302–309., 2008. [184] F. Wu, B. Huberman, L. Adamic, and J. Tyler. Information flow in social groups. Physica A: Statistical and Theoretical Physics, 2004. [185] F. Wu and B. A. Huberman. Novelty and collective attention. In In PNAS, volume 104(45):17599–17601, 2007. [186] F. Y . Wu. The Potts model. Reviews of Modern Physics, 54(1):235–268, Jan- uary 1982. [187] S. Wu, J. M. Hofman, W. A. Mason, and D. J. Watts. Who Says What to Whom on Twitter. In Proceedings of World Wide Web Conference (WWW ’11), 2011. [188] Yinglian Xie, Fang Yu, Kannan Achan, Rina Panigrahy, Geoff Hulten, and Ivan Osipkov. Spamming botnets: signatures and characteristics. SIGCOMM Com- put. Commun. Rev., 38(4):171–182, August 2008. [189] Jaewon Yang and Jure Leskovec. Patterns of temporal variation in online media. In Proceedings of the fourth ACM international conference on Web search and data mining, WSDM ’11, pages 177–186, New York, NY , USA, 2011. ACM. [190] Sarita Yardi, Daniel Romero, Grant Schoenebeck, and Danah Boyd. Detecting spam in a Twitter network. First Monday, 15(1), January 2010. 227 [191] Makoto Yokoo, Edmund H. Durfee, Toru Ishida, and Kazuhiro Kuwabara. The distributed constraint satisfaction problem: Formalization and algorithms. Knowledge and Data Engineering, 10(5):673–685, 1998. [192] P.H. Young. The diffusion of innovations in social networks. In In The Econ- omy as a Complex Evolving System, 2003. [193] W. W. Zachary. An information flow model for conflict and fission in small groups. J. Anthropological Research, 33:452–473, 1977. [194] Tao Zhou, Linyuan Lü, and Yi-Cheng Zhang. Predicting missing links via lo- cal information. The European Physical Journal B - Condensed Matter and Complex Systems, 71(4):623–630, October 2009. 228 Appendix A Computational Framework for Cascade Generating Function Cascade Graph For the analysis of the contagion process, we create a cascade graphG c (V c ;E c ) from the original networkG(V;E) as follows. LetV c be the num- ber of nodes participating in all cascades. Let a cascade begin at time t 1 and end at t N . We arrange and label the nodes in the temporal order in which they are acti- vated, e.g., transmit information: 1; 2; ;N, where nodek activated at timet k and t 1 t k t N . An edge exists fromj toi inG c (i.e. i is activated byj) if an edge exists fromi toj inG (i is a fan ofj) andt i >t j . The adjacency matrix ofA c of the cascade graphG c (V c ;E c ), the cascade matrix, is: A c (i;j) = 1 if9 an edge fromj toi inG c (V c ;E c ),j <i = 0 otherwise We break ties randomly. If nodes a and b receive information at the same time t k , without loss of generality, we assume a = k and b = k + 1. Also, we modify the adjacency matrixA c , makingA c (k + 1;k) = 0 andA c (k;k + 1) = 0, irrespective of 229 whether or not an edge exists betweenk andk + 1. This means that neither node can activate the other, since they are activated at the same time. We note that 1 is always the seed of a cascade. The cascade matrix can encode a contagion process that generates multiple cascades. Contagion and Length Matrix In addition to the cascade matrix, we introduce the dynamic adjacency matrix of the cascade graph,A(t). This is a time-dependent matrix, whose non-zero elements include all nodes that have been activated up to timet: A ij (t k ) = 1 ifA c (i;j) = 1 andt i t k = 0 otherwise The dynamic adjacency matrix allows us to compute connectivity between nodes in a cascade, as measured by the number of paths that exist between them. Following [109], let the attenuation parameter be the probability of transmitting a message or influence along any edge from nodei at timet i to nodej at timet j . The contagion matrix over the time period [t 1 ;t N ]: C() = N1 A(t N )A(t 3 )A(t 2 ) + + 2 A(t N )A(t N1 ) +A(t N ) +I (A.1) The termC ij () gives the number of attenuated paths from nodej toi inG c (V c ;E c ) andI is the identity matrix. 230 The total length of paths from one node to another can be modeled using a formal- ism similar to contagion matrix. We define the length matrix as: L() = (N 1) N1 A(t N )A(t 3 )A(t 2 ) (A.2) + + 2 2 A(t N )A(t N1 ) +A(t N ) +I HereL ij (1) gives the total length of paths from nodej to nodei inG c (V c ;E c ).L ij (1)=C ij (1) then gives the average length of paths from nodej to nodei. The first step towards quantifying cascades is seed identification. The can be achieved by collecting all the maximal elements ofG c , seen as partially ordered set. Equivalently, if all the elements of thei th row ofA c are zero, then nodei is a seed of the cascade. Finding all the seeds gives the total number of cascades,K, in the contagion process. Let(i 1 ;); ;(i K ;) be the cascade function value of each seed. The value of the cascade function of nodej, which is activated afterm (but beforekm) cascade seeds, is(j;) = P m p=1 C j;ip ()(i p ;). A non-zero value ofC j;ip () indi- cates thatj is a member of the cascade initiated byi p . Hence,C j;ip () =f(j;i p ;) in Eq. 8.2. The efficient design of the cascade generating function is such, that knowing only the columns corresponding to the seeds in the contagion matrix, would help us to characterize the entire matrix. This along with the triangular nature of the cascade matrix enables us to calculate the contagion and length matrices, and hence the corre- sponding cascade generating function efficiently using dynamic programming (Algo- rithm 3 ). This algorithm hasO(kN) space andO(dkN) runtime complexity even in its naive implementation, where,d is the maximum degree of any node andN is the number of nodes in the process. 231 Algorithm 3 Efficient algorithm for computing the Contagion and the Length Matrix Input A c : Adjacency Matrix of the Cascade Graph : transmission probability Output C(): Contagion Matrix (Nk),L(): Length Matrix (Nk) p th column in C and L, corresponds features of the cascade generated by thep th seed. 8p2 [1;k],j is the label of thep th seed activated at timet j . C i;p () is the cascade generating value for node i with respect to the p th seed. L i;p () at = 1 gives the total length of paths from the p th seed to the node i. 8pC j;p () = 1;L j;p () = 1 ifi<j then C i;p () = 0;L i;p () = 0 else ifi ==j + 1 then C i;p () =A c (i;j);L i;p () =A c (i;j) else C i;p () =A c (i;j) + ij1 X 8edgese(ik;i)jk=1 A c (i;ik)C ik;p () L i;p () =A c (i;j) + ij1 X 8edgese(ik;i)jk=1 A c (i;ik)(C ik;p () +L ik;p ()) end if end if 232 The contagion and length matrices together fully determine and d d , and there- fore, capture the microscopic details of the contagion process. If (C j 1 ;i 1 (),C j 1 ;i 2 (), ;C j 1 ;i K ()) = (C j 2 ;i 1 (),C j 2 ;i 2 (), , C j 2 ;i K ()), thenj 1 andj 2 are isomorphic with respect to the contagion process. The total number and total length of paths in the cascade from seedi p to nodej is given byC j;ip (1) = f(j;i p ; 1) andL j;ip (1) = l(j;i p ) in Eq. 8.2 and Eq. 8.3. Hence, the total number of paths, total length, and average path length for the entire contagion process is given by P ip P j6=ip8p C j;ip (1), P ip P j6=ip8p L j;ip (1) and P ip P j6=ip8p L j;ip (1) P ip P j6=ip8p C j;ip (1) . As can be seen in Eq. 8.5, analogous to the length calculation, computation of diameter can also be done with comparable runtime and space complexity. 233 Appendix B Normalized Alpha centrality -Proofs and Theorems If is an eigenvalue ofA, then (I 1 A)x = 0 (B.1) Invertibility of (I 1 A) would lead to the trivial solution of eigenvectorx(x = 0). Hence for computation of eigenvalues and eigenvectors, we require that no inverse of (I 1 A) should exist, i.e. Det(I 1 A) = 0 (B.2) Equation B.2 is called the characteristic equation solving which gives the eigenvalues and eigenvectors of adjacency matrixA. Using eigenvalues and eigenvectors, the adjacency matrixA can be written as: A =XX 1 = n X i=1 i Y i (B.3) whereX is a matrix whose columns are the eigenvectors ofA. is a diagonal matrix, whose diagonal elements are the eigenvalues, ii = i , arranged according to the 234 ordering of the eigenvectors in X. Without loss of generality we assume that 1 > 2 >> n . The matricesY i can be determined from the product Y i =XZ i X 1 (B.4) whereZ i is the selection matrix having zeros everywhere except for element (Z i ) ii = 1 [56]. The Alpha-centrality matrixC ;k 82 [0; 1]is given by: C ;k = I +A + 2 A 2 + + k A k = k X t=0 t A t (B.5) The normalized Alpha-centrality matrix is then given by: NC ;k = 1 X i;j (C ;k [i;j]) C ;k (B.6) As shown in Equation 6.4 and 6.5 Alpha-centrality vector isvC ;k!1 and normalized Alpha-centrality vector isvNC ;k!1 . A k can then be written as : A k =X k X 1 = n X i=1 k i Y i (B.7) 235 Using Equation B.7, Equation B.5 and reduces to cr (k) = n X i=1 k X t=0 t t i Y i = n X i=1 (1) p i (1 k+1 k+1 i ) (1) p i (1 i ) Y i (B.8) where p i = 0 if j i j < 1 and p i = 1 if j i j > 1. As obvious from above, for equation B.5 and B.8 to hold non-trivially,6= 1 j i j 8i2 1; 2 ;n. We consider the characterization of the series {NC ;k!1 }for2 [0; 1]. 1. 1 j 1 j : If 1 j 1 j , C ;k!1 (and NC ;k!1 ) would be independent of , since C ;k!1 I NC ;k!1 1 n I (B.9) 2. < 1 j 1 j : The sequence of matricesfC ;k g would converge toC ask!1 if all the sequencesf(C ;k )[i;j]g for every fixedi andj converge to (C )[i;j] [47]. If< 1 j 1 j ,C ;k converges toC . C = C ;k!1 = n X i=0 1 1 i Y i = (IA) 1 NC ;k!1 = C P n ij (C )[i;j] (B.10) 236 3. > 1 j 1 j andk!1, k A k dominates in the Equation B.8. C ;k!1 k A k NC ;k!1 1 n X i;j A k [i;j] A k (B.11) Theorem 1: The induced ordering of nodes due to normalized-centrality would be equal to the induced ordering of nodes due to-centrality for< 1 j 1 j . Proof: Since cr (k!1) =vC ;k!1 and ncr (k!1) =vNC ;k!1 , from equations B.9 and B.10, the induced ordering of nodes due to-centrality ( < 1 j 1 j ) would be equal to induced ordering of nodes due to normalized-centrality (< 1 j 1 j ). Theorem 2: The value of normalized-centrality remains the same82 ( 1 j 1 j ; 1] ( ncr > 1 j 1 j (k!1) = ncr). Proof: As can be seen from equation B.11 when> 1 j 1 j andk!1,NC ;k!1 reduces to 1 n X i;j A k [i;j] A k and is independent of . Since normalized -centrality, ncr =vNC ;k!1 , therefore, value of normalized-centrality value remains the same 82 ( 1 j 1 j ; 1] (ncr > 1 j 1 j (k!1) = ncr). The remaining theorems hold under the condition thatj 1 j is strictly greater than any other eigenvalue, which is true in most real life cases studied. Theorem 3: lim ! 1 j 1 j ncr (k!1) exists and lim ! 1 j 1 j ncr (k!1) = ncr = ncr > 1 j 1 j (k!1) = vY 1 P n i;j (Y 1 ) ij . 237 Proof: Under the assumption thatj 1 j is strictly greater than any eigenvalue, as ! 1 j 1 j Equation B.10 reduces to C ! 1 j 1 j ;k!1 = 1 1 1 Y 1 (B.12) This is because all other eigenvectors shrink in importance as! 1 j 1 j [22]. Therefore as! 1 j 1 j , we have NC ! 1 j 1 j ;k!1 = 1 n X i;j (Y 1 ) ij Y 1 (B.13) Under the assumption thatj 1 j is strictly greater than any other eigenvalue, k k 1 Y 1 dominates in the Equation B.8, B.11. C ;k!1 k k 1 Y 1 NC ;k!1 1 n X i;j (Y 1 ) ij Y 1 (B.14) Hence from equation B.14, as! 1 j + 1 j , we have NC ! 1 j + 1 j ;k!1 = 1 n X i;j (Y 1 ) ij Y 1 (B.15) Since lim ! 1 j 1 j NC ;k!1 = lim ! 1 j + 1 j NC ;k!1 = Y 1 P n i;j (Y 1 ) ij , therefore lim ! 1 j 1 j NC ;k!1 exists and lim ! 1 j 1 j NC ;k!1 = Y 1 P n i;j (Y 1 ) ij : (B.16) 238 Since ncr (k !1) = vNC ;k!1 , therefore, lim ! 1 j 1 j ncr (k !1) = ncr = ncr > 1 j 1 j (k!1) = vY 1 P n i;j (Y 1 ) ij . Theorem 4: For symmetric matrices, the induced ordering of nodes due to eigen- vector centralityC E is equivalent to the induced ordering of nodes given by normalized centrality ncr = lim ! 1 j 1 j ncr (k!1) = ncr > 1 j 1 j (k!1) = vY 1 P n i;j (Y 1 ) ij . Proof: For symmetric matrices A =XX 1 =XX T (B.17) Therefore equation B.4 reduces to Y i =XZ i X T =X i X T i (B.18) whereX i is the column ofX representing the eigenvector corresponding to i . Hence, in case of symmetric matrices: ncr = ncr > 1 j 1 j (k!1) = lim ! 1 j 1 j ncr (k!1) = vY 1 P n i;j (Y 1 ) ij = c 1 vX 1 X T 1 =c 2 X T 1 (B.19) wherec 1 = 1 P n i;j (Y 1 ) ij andc 2 =c 1 vX 1 . Since X T 1 corresponds to the eigenvector centrality vector C E , hence for sym- metric matrices, the induced ordering of nodes given by eigenvector centrality C E 239 is equivalent to the induced ordering of nodes given by normalized centrality ncr = lim ! 1 j 1 j ncr (k!1) = ncr > 1 j 1 j (k!1) = vY 1 P n i;j (Y 1 ) ij . 240 Appendix C Approximation Algorithm for Alpha-Centrality In order to compute the exact Alpha-Centrality vector we have to solve Equation 6.3, which requires us to compute a matrix inverse. Computing a matrix inverse in a naive implementation, takesO(n 3 ) time(wheren is the number of nodes in the network), so this is difficult to compute for large networks. One way to compute an approximate solution is to use the alternate formulation given in Equation 6.4, and computes(I + A + 2 A 2 + 3 A 3 +:::), until the i coefficient grows sufficiently small. While this technique is effective in practice, computing A i in each iteration, using a naive implementation would have must take at leastn 2 time, and it is not clear how many iterations we need to get a good approximation. In this section we present an algorithm for approximating Alpha-Centrality, which has a single parameter that controls both the runtime and the quality of the produced approximation. Our procedure shown in Algorithm 4 is similar to the algorithm for approximating PageRank that is given in [7]. Our algorithm takes the network, the starting vectors,, and an approximation parameter (0< 1) as input, and computes an approximate Alpha Centrality vector where each entry has error of at most (see Theorem 5). The notations used are: n gives the number of nodes in the network,V is the set of nodes 241 of the network,N(u) is the neighborhood of nodeu or the friends of nodeu. In order to approximate a centrality vector with starting vectors, we maintain an approximate centrality vector ~ cr and a residual vector r. Initially r is equivalent to the starting vectors; the algorithm iteratively moves content fromr to ~ cr until each entry inr is small. We give a proof that throughout the execution of the algorithm the error in the approximate centrality vector is the amount of content remaining in the residual vector. Algorithm 4 Approximate-Centrality(V;E;s;;) 1: =jjsjj 1 =n; 2: r =s; 3: Queue q = new Queue(); 4: for eachu2V do 5: ~ cr[u] = 0; 6: ifr[u]> then 7: q.add(u); 8: end if 9: end for 10: while q.size> 0 do 11: u = q.dequeue(); 12: ~ cr[u] = ~ cr[u] +r[u]; 13: T =r[u]; 14: r[u] = 0; 15: for eachv2N(u) do 16: r[v] =r[v] +Tw[u;v]; 17: if !q.contains(v) andr[v]> then 18: q.add(v); 19: end if 20: end for 21: end whilereturn ~ cr; Our arguments depend on the linearity of the centrality computation with respect to the starting vector, which is easy to verify. We can show that cr (s 1 ) + cr (s 2 ) = cr (s 1 +s 2 ), andc cr (s) = cr (cs). 242 When the parameter is fixed, we use cr(s) to denote cr (s). We will also use [cr(s)](u) to refer to how much content vertexu has in cr(s).We next give our formal performance guarantee for Algorithm 4. Theorem 5: Given an c d out max for some c < 1 and a uniform starting vec- tor s, the vector ~ cr output by Approximate-Centrality satisfies [cr(s)](u) ~ cr[u] [cr(s)](u)(1) for each vertexu2V . The runtime of the algorithm isO( n d out max ). Proof: Lemma 6 argues that ~ cr = cr(sr) = cr(s) cr(r) throughout the execution of the algorithm, so we have ~ cr[u] = [cr(s)](u) [cr(r)](u) for all vertices u2V . Given a uniform starting vectors,s(u) =jjsjj 1 =n for allu2V . The algorithm terminates whenr[u] for allu2 V , so we choose = jjsjj 1 =n = s(u) such that upon completionr[u]s(u) for allu2V . Clearly, [cr(s)](u) ~ cr[u] because r and cr(r) are non-negative. We can also show that given that r[u] s(u) for all u 2 V , [cr(r)](u) [cr(s)](u) for all verticesu2 V . It follows that ~ cr[u] = [cr(s)](u) [cr(r)](u) [cr(s)](u)(1). Therefore we can see that indeed [cr(s)](u) ~ cr[u] [cr(s)](u)(1) for all vertices u2V . We assume that is chosen such that c d out max for some constantc < 1, where d out max is the largest out-degree of any node in the graph. In order to bound the runtime of the algorithm, consider that each iteration of the while-loop decreases the sum of the entries ofr by (1d out (u))r[u]> (1d out (u)) (1d out max ) (1c). Becauser =s at initialization and each iteration decreasesjjrjj 1 by at least (1c), the number of iterationsi must satisfyi(1c)jjsjj 1 . Therefore the number of iterations may be at most jjsjj 1 (1c) = O(jjsjj 1 =). The cost of each iteration is proportional to the out-degree of the node that is dequeued, so the worst-case runtime of the algorithm is O(jjsjj 1 =d out max ). For our choice of this is equivalent toO( n d out max ). 243 Lemma 6: The invariant ~ cr = cr(sr) is maintained throughout the execution of the while-loop. Proof: Before the loop starts, we haver =s and ~ cr = ~ 0, so cr(sr) = cr( ~ 0) = ~ 0 = ~ cr. We can also show that if ~ cr = cr(sr) holds prior to an iteration of the loop, then ~ cr 0 = cr(sr 0 ) is still true after the iteration, where ~ cr 0 andr 0 are the updated approximate centrality and residual vectors. We first observe that cr(s)A = cr(sA). To see this, consider that by defini- tion cr(s) = s + cr(s)A. Multiplying this equation by A we get cr(s)A = sA + (cr(s)A)A. This shows that cr(s)A is by definition a centrality vector for starting vector sA. Moreover, we know that the solution to cr(sA) is unique, so we have cr(s)A = cr(sA). This observation shows that we can iteratively compute the centrality vector by expressing cr(s)A as cr(sA). We will write the operations performed inside the while-loop using vector-matrix notation. We use e u to denote a row vector that has all of its content in vertex u: e u (i) = 1 ifi =u; otherwise,e u (i) = 0. After an iteration of the loop we have ~ cr 0 = ~ cr +r[u]e u , andr 0 = rr[u]e u + r[u]e u A, where u is the vertex that is dequeued in line 11. We next specify the relationship between the approximate centrality and residual vectors before and after an iteration of the while-loop. Consider that cr(r) = cr(rr[u]e u ) + cr(r(u)e u ) = cr(rr[u]e u ) +r[u]e u + cr(r[u]e u A) = cr(rr[u]e u +r[u]e u A) +r[u]e u = cr(r 0 ) + ~ cr 0 ~ cr: 244 If ~ cr = cr(s r), we have cr(r) = cr(r 0 ) + ~ cr 0 cr(s r). It follows that ~ cr 0 = cr(r) cr(r 0 ) + cr(sr) = cr(rr 0 + (sr)) = cr(sr 0 ). This completes the proof. C.0.1 Quality of Approximate Results We compare the performance of the approximate algorithm with the power iteration method in Equation 6.3 using the starting vector as in [20]. To compute Alpha- centrality using the approximate algorithm, we fix (Algorithm 4) to be 3:57 10 8 and 1:42 10 8 guaranteeing that the error in approximation would be less than 1%(< 0:01). We terminate the power iteration algorithm after 100 iterations in Digg and 10 to 100 iterations in Twitter. We calculate the RMS(root mean square) error of the approximate algorithm with respect to the power iteration algorithm, for different values of. The RMS error averaged over all values of, is 0.797% and 0.75% for Digg and Twitter respectively. 245

Abstract (if available)

Linked assets

University of Southern California Dissertations and Theses

Conceptually similar

PDF

Understanding diffusion process: inference and theory

PDF

Modeling social and cognitive aspects of user behavior in social media

PDF

Diffusion network inference and analysis for disinformation mitigation

PDF

Heterogeneous graphs versus multimodal content: modeling, mining, and analysis of social network data

PDF

Scheduling and resource allocation with incomplete information in wireless networks

PDF

Modeling and predicting with spatial‐temporal social networks

PDF

Learning distributed representations from network data and human navigation

PDF

Computational aspects of optimal information revelation

PDF

Robust routing and energy management in wireless sensor networks

PDF

Modeling information operations and diffusion on social media networks

PDF

Interaction and topology in distributed multi-agent coordination

PDF

Global consequences of local information biases in complex networks

PDF

Incorporating aggregate feature statistics in structured dynamical models for human activity recognition

PDF

Sensing with sound: acoustic tomography and underwater sensor networks

PDF

Predicting and modeling human behavioral changes using digital traces

PDF

Improving machine learning algorithms via efficient data relevance discovery

PDF

Analysis and countermeasures of worm propagations and interactions in wired and wireless networks

PDF

Sharpness analysis of neural networks for physics simulations

PDF

Dynamic topology reconfiguration of Boltzmann machines on quantum annealers

PDF

Deep learning models for temporal data in health care

Asset Metadata

Creator Ghosh, Rumi (author)

Core Title Disentangling the network: understanding the interplay of topology and dynamics in network analysis

School Viterbi School of Engineering

Degree Doctor of Philosophy

Degree Program Computer Science

Publication Date 07/17/2012

Defense Date 04/23/2012

Publisher University of Southern California (original), University of Southern California. Libraries (digital)

Tag centrality,communities,information diffusion,network analysis,network dynamics,OAI-PMH Harvest,online social networks

Language English

Contributor Electronically uploaded by the author (provenance)

Advisor Lerman, Kristina (committee chair), Teng, Shang-Hua (committee member), Liu, Yan (committee member), Monge, Peter R. (committee member)

Creator Email rumi.ghosh@gmail.com,rumig@usc.edu

Permanent Link (DOI) https://doi.org/10.25549/usctheses-c3-60034

Unique identifier UC11290018

Identifier usctheses-c3-60034 (legacy record id)

Legacy Identifier etd-GhoshRumi-959.pdf

Dmrecord 60034

Document Type Dissertation

Rights Ghosh, Rumi

Type texts

Source University of Southern California (contributing entity), University of Southern California Dissertations and Theses (collection)

Access Conditions The author retains rights to his/her dissertation, thesis or other graduate work according to U.S. copyright law. Electronic access is being provided by the USC Libraries in agreement with the a...

Repository Name University of Southern California Digital Library

Repository Location USC Digital Library, University of Southern California, University Park Campus MC 2810, 3434 South Grand Avenue, 2nd Floor, Los Angeles, California 90089-2810, USA